12.7.2. crate_anon.preprocess.autoimport_db

crate_anon/preprocess/autoimport_db.py

Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.

Automatically import to a database from a collection of tabular files.

Efficiency is challenging here. Simple CSV/TSV files are efficiently handled as file-like objects, and can be iterated in a low-memory way very fast. Spreadsheet-type objects often need to be loaded “whole”, so repeat iteration is less sensible. We’re trying to handle both, so no perfect/simple way.

Considered but not done:

Track min/max for numeric types. This would allow us to refine the integer type. However, there is always the danger that we scan data and create tables for one set of files, then want to import from another, and the latter has more extreme values. So we just use a column type (BigInteger) with wide capabilities. (Having said that, we do track the maximum length of strings!)

class crate_anon.preprocess.autoimport_db.ColumnTypeDetector(colname: str, values: Iterable[Any] | None = None, default_type: TypeEngine | None = None, verbose: bool = False)[source]

Class to inspect values from a spreadsheet column, and make a decision about what kind of SQLAlchemy column should be used.

__init__(colname: str, values: Iterable[Any] | None = None, default_type: TypeEngine | None = None, verbose: bool = False) → None[source]

Parameters:

colname – Future column name.
values – Optional values to inspect immediately.
default_type – Type to use if no non-NULL data whatsoever is seen.
verbose – Report values on failure. WARNING: unsuitable, for this reason, for production code.

inspect(v: Any) → None[source]: Inspect a new value.

sqlalchemy_column(nullable: bool | None = None) → Column[source]

Returns an SQLAlchemy Column object (free-floating, i.e. with no table attached), or raises ValueError.

Parameters:: nullable – Should the column be NULL-capable? Use True for “NULL”, False for “NOT NULL”, and None for “NULL if NULL/None values have been seen, otherwise NOT NULL”.

class crate_anon.preprocess.autoimport_db.TabularFileInfo(tablename: str, metadata: MetaData, engine: Engine, datagen: Iterable[Dict[str, Any]] | None = None, with_columns_from_data: bool = False, with_columns_from_reflection: bool = False, with_data: bool = False, verbose: bool = False)[source]

Simple class to represent information about a potential database table, from a tabular data file format.

__init__(tablename: str, metadata: MetaData, engine: Engine, datagen: Iterable[Dict[str, Any]] | None = None, with_columns_from_data: bool = False, with_columns_from_reflection: bool = False, with_data: bool = False, verbose: bool = False) → None[source]

Parameters:

tablename – Name of the table.
metadata – Database MetaData object.
engine – SQLAlchemy engine.
datagen – Optional iterable to provide data. (Must be supplied if with_columns or with_data is True.)
with_columns_from_data – Read/autodetect column information from data, for creating tables?
with_columns_from_reflection – Should columns be read from the metadata?
with_data – Provide data?
verbose – Be verbose if the process fails? WARNING: will report values.

colreport() → str[source]: A text-format report of our columns.

create_table() → None[source]: Create a database table, if it doesn’t already exist.

drop_table() → None[source]: Drop a database table, if it exists.

has_columns() → bool[source]: Do we have at least one column?

table() → Table[source]: Returns an SQLAlchemy table object. Caches this across requests (or SQLAlchemy will complain that we re-assign columns to a table). Assumes that the MetaData object WILL NOT CHANGE ACROSS CALLS.

validate_columns() → None[source]: Validates columns, or raises ValueError.

crate_anon.preprocess.autoimport_db.auto_import_db(url: str, filenames: List[str], use_spreadsheet_names: bool = True, drop_tables: bool = False, create_tables: bool = False, import_data: bool = False, chunksize: int = 100000, skip_tables: List[str] | None = None, echo: bool = False, verbose: bool = False) → None[source]

Main import function.

Parameters:

url – Database URL.
filenames – Filenames to iterate through.
use_spreadsheet_names – Use spreadsheet names (where relevant) as table names, rather than filenames. (If False, only the first sheet in each spreadsheet file will be used.)
drop_tables – Drop tables first?
create_tables – Create tables, if required?
import_data – Do the actual import?
skip_tables – Optional names of tables to skip.
chunksize – Number of records to insert at once.
echo – Echo SQL?
verbose – Be verbose?

crate_anon.preprocess.autoimport_db.dict_from_rows(row_iterator: Iterable[Iterable], row_to_list_fn: Callable[[Iterable], List]) → Generator[Dict, None, None][source]: Iterate through rows (from row_iterator); apply row_to_list_fn() to each; yield dictionaries mapping column names to values.

crate_anon.preprocess.autoimport_db.does_datetime_have_zero_time(d: datetime | DateTime) → bool[source]: Does a given datetime-like object have all its time fields set to zero?

crate_anon.preprocess.autoimport_db.gen_dicts_from_csv(fileobj: TextIO) → Generator[Dict[str, Any], None, None][source]: Generates value dictionaries from a CSV file.

crate_anon.preprocess.autoimport_db.gen_dicts_from_tsv(fileobj: TextIO) → Generator[Dict[str, Any], None, None][source]: Generates value dictionaries from a TSV file.

crate_anon.preprocess.autoimport_db.gen_filename_fileobj(filenames: List[str]) → Generator[Tuple[Path, IO], None, None][source]

Iterates files, yielding (filename, file-like object) tuples. If a file is a ZIP file, iterate within it similarly (but not recursively).

Parameters:: filenames – Filenames to process.
Yields:: tuple (Path, file-like object) for each inner file

crate_anon.preprocess.autoimport_db.gen_files_from_zipfile(zipfilename: Path | str) → Generator[Tuple[Path, IO], None, None][source]

Iterates ZIP file(s), yielding filenames and corresponding file-like objects from within it/them.

Parameters:: zipfilename – filename of the .zip file
Yields:: tuple (Path, file-like object) for each inner file

NB related to cardinal_pythonlib.file_io.gen_files_from_zipfiles, but simpler and also provides the filenames.

crate_anon.preprocess.autoimport_db.gen_sheets_from_ods(fileobj: BinaryIO, first_sheet_only: bool = False) → Generator[Tuple[str, Iterable[Dict[str, Any]]], None, None][source]: Generates tuples of (sheet name, iterable-of-value-dictionaries) from an ODS file.

crate_anon.preprocess.autoimport_db.gen_sheets_from_xlsx(fileobj: BinaryIO, first_sheet_only: bool = False) → Generator[Tuple[str, Iterable[Dict[str, Any]]], None, None][source]: Generates tuples of (sheet name, iterable-of-value-dictionaries) from an Excel XLSX file.

crate_anon.preprocess.autoimport_db.gen_tablename_info(filenames: List[str], metadata: MetaData, engine: Engine, use_spreadsheet_names: bool = True, with_columns_from_data: bool = False, with_columns_from_reflection: bool = False, with_data: bool = False, skip_tables: List[str] | None = None, verbose: bool = False) → Generator[TabularFileInfo, None, None][source]

Parameters:

filenames – Filenames to iterate through.
metadata – Database MetaData object.
engine – SQLAlchemy engine.
use_spreadsheet_names – Use spreadsheet names (where relevant) as table names, rather than filenames. (If False, only the first sheet in each spreadsheet file will be used.)
with_columns_from_data – Read/autodetect column information from data, for creating tables?
with_columns_from_reflection – Should columns be read from the metadata?
with_data – Provide data?
skip_tables – Optional names of tables to skip.
verbose – Be verbose if the process fails? WARNING: will report values.

Yields:

TabularFileInfo instances.

crate_anon.preprocess.autoimport_db.import_table(ti: TabularFileInfo, session: Session, chunksize: int = 100000) → None[source]: Import a database table.

crate_anon.preprocess.autoimport_db.is_date_like_not_datetime_like(v: Any) → bool[source]: Does this look like a date (but not a datetime)?

crate_anon.preprocess.autoimport_db.is_datetime_or_date_like(v: Any) → bool[source]: Does this look like a datetime (or a date)?

crate_anon.preprocess.autoimport_db.mk_columns(datagen: Iterable[Dict[str, Any]], verbose: bool = False) → List[Column][source]

Attempt to autodetect SQLAlchemy column types.

Parameters:

datagen – Generator of data.
verbose – Be verbose and report values? (WARNING: therefore unsuitable for production use.)

crate_anon.preprocess.autoimport_db.ods_row_to_list(row: Iterable[Any]) → List[Any][source]: Convert an OpenOffice ODS row to a list of values, translating the empty string (used for empty cells) to None.

crate_anon.preprocess.autoimport_db.translate_empty_str_to_none(reader: Iterable[Dict[str, Any]]) → Generator[Dict[str, Any], None, None][source]

Yield dictionaries (mapping column name to value), but (a) translating blank strings (often the product of empty cells e.g. with csv.DictReader) to None; (b) skipping entirely blank rows.

Parameters:: reader – For example, a csv.DictReader().

crate_anon.preprocess.autoimport_db.xlsx_row_to_list(row: Iterable[Cell]) → List[Any][source]: Convert an OpenPyXL XLSX row to a list of values, translating the empty string (used for empty cells) to None.