14.1.11. crate_anon.anonymise.dd
crate_anon/anonymise/dd.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Data dictionary classes for CRATE anonymiser.
The data dictionary is a TSV file, for ease of editing by multiple authors, rather than a database table.
- class crate_anon.anonymise.dd.DDTableSummary(src_db: str, src_table: str, src_has_pk: bool, src_pk_fieldname: str, src_constant: bool, src_addition_only: bool, src_defines_pid: bool, src_has_pid: bool, src_has_mpid: bool, src_has_opt_out: bool, src_has_patient_scrub_info: bool, src_has_third_party_scrub_info: bool, src_has_required_scrub_info: bool, src_has_table_comment: bool, dest_table: str, dest_has_rows: bool, dest_add_src_hash: bool, dest_being_scrubbed: bool)[source]
- __init__(src_db: str, src_table: str, src_has_pk: bool, src_pk_fieldname: str, src_constant: bool, src_addition_only: bool, src_defines_pid: bool, src_has_pid: bool, src_has_mpid: bool, src_has_opt_out: bool, src_has_patient_scrub_info: bool, src_has_third_party_scrub_info: bool, src_has_required_scrub_info: bool, src_has_table_comment: bool, dest_table: str, dest_has_rows: bool, dest_add_src_hash: bool, dest_being_scrubbed: bool) None
- class crate_anon.anonymise.dd.DataDictionary(config: Config)[source]
Class representing an entire data dictionary.
- __init__(config: Config) None [source]
Set defaults.
- Parameters:
config –
crate_anon.anonymise.config.Config
- cached_funcs() List[Any] [source]
Returns a list of our methods that are cached. See
clear_caches()
.
- check_against_source_db() None [source]
Check DD validity against the source database(s).
Also caches SQLAlchemy source column types.
- check_valid(prohibited_fieldnames: List[str] | None = None, check_against_source_db: bool = True) None [source]
Check DD validity, internally ± against the source database(s).
- Parameters:
prohibited_fieldnames – list of prohibited destination fieldnames
check_against_source_db – check validity against the source database(s)?
- Raises:
ValueError –
- classmethod create_from_file(filename: str, config: Config, check_valid: bool = True, override_dialect: Dialect = None) DataDictionary [source]
Creates a new data dictionary by reading a file.
- property dest_dialect: Dialect
Returns the SQLAlchemy
Dialect
(e.g. MySQL, SQL Server…) for the destination database.
- property dest_dialect_name: str
Returns the SQLAlchemy dialect name for the destination database.
- draft_from_source_databases(report_every: int = 100) None [source]
Create a draft DD from a source database.
Will skip any rows it knows about already (thus allowing the generation of incremental changes).
- Parameters:
report_every – report to the Python log every n columns
- get_dest_sqla_table(tablename: str) Table [source]
For a given destination table name, return an
sqlalchemy.sql.schema.Table
object for the destination table (which we will create).
- get_dest_table_for_src_db_table(src_db: str, src_table: str) str [source]
For a given source database/table, return the single or the first destination table.
- get_dest_tables_all() AbstractSet[str] [source]
Return a SortedSet of all destination table names (including tables that will receive no contents).
- get_dest_tables_for_src_db_table(src_db: str, src_table: str) AbstractSet[str] [source]
For a given source database/table, return a SortedSet of destination tables.
- get_dest_tables_included() AbstractSet[str] [source]
Return a SortedSet of all destination table names (tables with at least some columns that are included).
- get_dest_tables_with_patient_info() AbstractSet[str] [source]
Return a SortedSet of destination table names that have patient information.
- get_fieldnames_for_src_table(src_db: str, src_table: str) AbstractSet[DataDictionaryRow] [source]
For a given source database name/table, return a SortedSet of source fields.
- get_int_pk_ddr(src_db: str, src_table: str) DataDictionaryRow | None [source]
For a given source database name and table, return the DD row for the integer PK for that table.
Will return
None
if no such data dictionary row exists.
- get_int_pk_name(src_db: str, src_table: str) str | None [source]
For a given source database name and table, return the field name of the integer PK for that table (or
None
if there isn’t one).
- get_mandatory_scrubber_sigs() AbstractSet[str] [source]
Return a set of field signatures (strings of the format
db.table.column
) for all rows representing “required scrubber” fields – that is, rows that must have at least one non-NULL value for each patient, or the patient won’t get processed.
- get_mpid_name(src_db: str, src_table: str) str | None [source]
For a given source database name and table: return the field name of the field providing master PID (MPID) information (or
None
if there isn’t one).
- get_optout_defining_fields() AbstractSet[Tuple[str, str, str, str, str]] [source]
Return a SortedSet of
src_db, src_table, src_field, pidfield, mpidfield
tuples for rows that define opt-out information.
- get_patient_src_tables_with_active_dest(src_db: str) AbstractSet[str] [source]
For a given source database name, return a SortedSet of source tables that contain patient information and have an active destination table.
- get_pid_name(src_db: str, src_table: str) str | None [source]
For a given source database name and table: return the field name of the field providing primary PID information (or
None
if there isn’t one).
- get_pk_ddr(src_db: str, src_table: str) DataDictionaryRow | None [source]
For a given source database name and table, return the DD row for the PK for that table, whether integer or not.
Will return
None
if no such data dictionary row exists.
- get_rows_for_dest_table(dest_table: str, skip_table_comments: bool = True) AbstractSet[DataDictionaryRow] [source]
For a given destination table, return a SortedSet of DD rows.
- get_rows_for_src_table(src_db: str, src_table: str, skip_table_comments: bool = True) AbstractSet[DataDictionaryRow] [source]
For a given source database name/table, return a SortedSet of DD rows.
- get_scrub_from_db_table_pairs() AbstractSet[Tuple[str, str]] [source]
Return a SortedSet of
source_database_name, source_table
tuples where those fields containscrub_src
(scrub-from) information.
- get_scrub_from_rows(src_db: str, src_table: str) AbstractSet[DataDictionaryRow] [source]
Return a SortedSet of DD rows for all fields containing
scrub_src
(scrub-from) information.
- get_scrub_from_rows_as_fieldinfo(src_db: str, src_table: str, depth: int, max_depth: int) List[ScrubSourceFieldInfo] [source]
Using
get_scrub_from_rows()
, as a list ofScrubSourceFieldInfo
objects, which is more convenient for scrubbing.- Parameters:
src_db – Source database name.
src_table – Source table name.
depth – Current recursion depth for looking up third-party information.
max_depth – Maximum permitted recursion depth for looking up third-party information.
- get_src_db_tablepairs() AbstractSet[Tuple[str, str]] [source]
Return a SortedSet of all
source_database_name, source_table
tuples.
- get_src_db_tablepairs_w_int_pk() AbstractSet[Tuple[str, str]] [source]
Return a SortedSet of
source_database_name, source_table
tuples for tables that have an integer PK.
- get_src_db_tablepairs_w_no_pt_info() AbstractSet[Tuple[str, str]] [source]
Return a SortedSet of
source_database_name, source_table
tuples for tables that contain no patient information.
- get_src_db_tablepairs_w_pt_info() AbstractSet[Tuple[str, str]] [source]
Return a SortedSet of
source_database_name, source_table
tuples for tables that contain patient information.
- get_src_dbs_tables_for_dest_table(dest_table: str) AbstractSet[Tuple[str, str]] [source]
For a given destination table, return a SortedSet of
dbname, table
tuples, representing source database(s)/table(s).
- get_src_dbs_tables_with_no_pt_info_int_pk() AbstractSet[Tuple[str, str]] [source]
Return a SortedSet of
source_database_name, source_table
tuples where the table has no patient information and has an integer PK.
- get_src_dbs_tables_with_no_pt_info_no_pk() AbstractSet[Tuple[str, str]] [source]
Return a SortedSet of
source_database_name, source_table
tuples where the table has no patient information and no integer PK.
- get_src_tables(src_db: str) AbstractSet[str] [source]
For a given source database name, return a SortedSet of all source tables that are required (that is, ones being copied and ones providing vital patient information).
- get_src_tables_with_active_dest(src_db: str) AbstractSet[str] [source]
For a given source database name, return a SortedSet of its source tables that have an active destination.
- get_src_tables_with_patient_info(src_db: str) AbstractSet[str] [source]
For a given source database name, return a SortedSet of source tables that have patient information.
- get_summary_info_all_tables() List[DDTableSummary] [source]
Returns summary information by table.
- get_summary_info_for_table(src_db: str, src_table: str) DDTableSummary [source]
Returns summary information for a specific table.
- get_tables_w_no_pt_info() AbstractSet[str] [source]
Return a SortedSet of
source_table
names for tables that contain no patient information.
- get_tables_w_scrub_src() AbstractSet[str] [source]
Return a SortedSet of
source_table
names for tables that containscrub_src
information, i.e. that contribute to anonymisation.
- has_active_destination(src_db: str, src_table: str) bool [source]
For a given source database name and table, does it have an active destination?
- make_dest_datatypes_explicit() None [source]
By default, when autocreating a data dictionary, the
dest_datatype
field is not populated explicit, just implicitly. This option makes them explicit by instantiating those values. Primarily for debugging.
- property n_definers: int
The number of patient-defining columns.
- property n_rows: int
Number of rows.
- omit_rows_by_filter(keep: Callable[[DataDictionaryRow], bool]) None [source]
Set to “omit” any rows that do not pass a filter function. Does not alter any rows already set to omit. (Skips table comment rows.)
- Parameters:
keep – Function taking a data dictionary row as an argument, and returning a boolean of whether to keep the row.
- read_from_file(filename: str, check_valid: bool = True, override_dialect: Dialect | None = None) None [source]
Read DD from file.
- Parameters:
filename – Filename to read.
check_valid – Run a validity check after setting each row from its values?
override_dialect – SQLAlchemy SQL dialect to enforce (e.g. for interpreting textual column types in the source database). By default, the source database’s own dialect is used.
- remove_rows_by_filter(keep: Callable[[DataDictionaryRow], bool]) None [source]
Removes any rows that do not pass a filter function. (Retains table comment rows.)
- Parameters:
keep – Function taking a data dictionary row as an argument, and returning a boolean of whether to keep the row.
- remove_rows_by_modifying_filter(keep_modify: Callable[[DataDictionaryRow], DataDictionaryRow | None]) None [source]
Removes any rows that do not pass a filter function; allows the filter function to modify rows that are kept. (Retains table comment rows.)
- Parameters:
keep_modify – Function taking a data dictionary row as an argument, and returning either the row (potentially modified) to retain it, or
None
to reject it.
- sort() None [source]
Sorts the data dictionary.
(Table comments, having no source field, will be first among rows for their tables.)
- tidy_draft() None [source]
Corrects a draft data dictionary for overall logical consistency.
The checks are:
Don’t scrub in non-patient tables.
SQL Server only supports one FULLTEXT index per table, and only if the table has a non-null column with a unique index.
Test code for full-text index creation:
-- ---------------------------------------------------------------- -- SQL Server: basic use -- ---------------------------------------------------------------- USE mydb; CREATE FULLTEXT CATALOG default_fulltext_catalog AS DEFAULT; CREATE TABLE junk (intthing INT PRIMARY KEY, textthing VARCHAR(MAX)); -- now find the name of the PK index (! -- by hand or see cardinal_pythonlib) CREATE FULLTEXT INDEX ON junk (textthing) KEY INDEX <pk_index_name>; -- ---------------------------------------------------------------- -- SQL Server: it means it about the "NOT NULL" aspects, and a -- unique index is not enough -- ---------------------------------------------------------------- USE mydb; DROP TABLE IF EXISTS rubbish; CREATE TABLE rubbish (a INT NOT NULL, b VARCHAR(MAX)); CREATE UNIQUE INDEX rubbish_a ON rubbish (a); CREATE FULLTEXT INDEX ON rubbish (b) KEY INDEX rubbish_a; -- .. that works, but if you remove the "NOT NULL" from the table -- definition, it fails with: -- -- 'rubbish_a' is not a valid index to enforce a full-text search -- key. A full-text search key must be a unique, non-nullable, -- single-column index which is not offline, is not defined on a -- non-deterministic or imprecise nonpersisted computed column, -- does not have a filter, and has maximum size of 900 bytes. -- Choose another index for the full-text key. -- ---------------------------------------------------------------- -- MySQL: two FULLTEXT indexes on one table -- ---------------------------------------------------------------- USE mydb; CREATE TABLE junk (intthing INT PRIMARY KEY, text1 LONGTEXT, text2 LONGTEXT); ALTER TABLE junk ADD FULLTEXT INDEX ftidx1 (text1); ALTER TABLE junk ADD FULLTEXT INDEX ftidx2 (text2); -- OK
- write(filename: str, filetype: str | None = None) None [source]
Writes the dictionary, either specifying the filetype or autodetecting it from the specified filename.
- Parameters:
filename – Name of file to write, or “-” for stdout (in which case the filetype is forced to TSV).
filetype – File type as one of
.ods
,.tsv
, or.xlsx
; alternatively, useNone
to autodetect from the filename.
- class crate_anon.anonymise.dd.ScrubSourceFieldInfo(is_mpid: bool, is_patient: bool, recurse: bool, required_scrubber: bool, scrub_method: crate_anon.anonymise.constants.ScrubMethod, signature: str, value_fieldname: str)[source]
- __init__(is_mpid: bool, is_patient: bool, recurse: bool, required_scrubber: bool, scrub_method: ScrubMethod, signature: str, value_fieldname: str) None
- crate_anon.anonymise.dd.ensure_no_source_type_mismatch(ddr: DataDictionaryRow, config_sqlatype: TypeEngine | String, primary_pid: bool = True) None [source]
Ensure that the source column type of a data dictionary row is compatible with what’s expected from the config. We check this only for specific type of column (PID, MPID), because we need to know their data types concretely for the secret mapping table. The question is not whether the types are the same, but whether the value will fit into the config-determined type (for example, it’s OK to convert an integer to a long-enough string but necessarily not the other way round).
- Parameters:
ddr – Data dictionary row.
config_sqlatype – SQLAlchemy column type that would be expected based on the current config.
primary_pid – Is this the main PID field? If false, it’s the MPID.