12.1.11. crate_anon.anonymise.dd

crate_anon/anonymise/dd.py

Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.

Data dictionary classes for CRATE anonymiser.

The data dictionary is a TSV file, for ease of editing by multiple authors, rather than a database table.

class crate_anon.anonymise.dd.DDTableSummary(src_db: str, src_table: str, src_has_pk: bool, src_pk_fieldname: str, src_constant: bool, src_addition_only: bool, src_defines_pid: bool, src_has_pid: bool, src_has_mpid: bool, src_has_opt_out: bool, src_has_patient_scrub_info: bool, src_has_third_party_scrub_info: bool, src_has_required_scrub_info: bool, src_has_table_comment: bool, dest_table: str, dest_has_rows: bool, dest_add_src_hash: bool, dest_being_scrubbed: bool)[source]

__init__(src_db: str, src_table: str, src_has_pk: bool, src_pk_fieldname: str, src_constant: bool, src_addition_only: bool, src_defines_pid: bool, src_has_pid: bool, src_has_mpid: bool, src_has_opt_out: bool, src_has_patient_scrub_info: bool, src_has_third_party_scrub_info: bool, src_has_required_scrub_info: bool, src_has_table_comment: bool, dest_table: str, dest_has_rows: bool, dest_add_src_hash: bool, dest_being_scrubbed: bool) → None

class crate_anon.anonymise.dd.DataDictionary(config: Config)[source]

Class representing an entire data dictionary.

__init__(config: Config) → None[source]

Set defaults.

Parameters:: config – crate_anon.anonymise.config.Config

cached_funcs() → List[Any][source]: Returns a list of our methods that are cached. See clear_caches().

check_against_source_db() → None[source]

Check DD validity against the source database(s).

Also caches SQLAlchemy source column types.

check_valid(prohibited_fieldnames: List[str] | None = None, check_against_source_db: bool = True) → None[source]

Check DD validity, internally ± against the source database(s).

Parameters:

prohibited_fieldnames – list of prohibited destination fieldnames
check_against_source_db – check validity against the source database(s)?

Raises:

ValueError –

clear_caches() → None[source]: Clear all our cached information.

classmethod create_from_file(filename: str, config: Config, check_valid: bool = True, override_dialect: Dialect = None) → DataDictionary[source]: Creates a new data dictionary by reading a file.

debug_cache_hits() → None[source]: Report cache hit information for our caches, to the Python log.

property dest_dialect: Dialect: Returns the SQLAlchemy Dialect (e.g. MySQL, SQL Server…) for the destination database.

property dest_dialect_name: str: Returns the SQLAlchemy dialect name for the destination database.

draft_from_source_databases(report_every: int = 100) → None[source]

Create a draft DD from a source database.

Will skip any rows it knows about already (thus allowing the generation of incremental changes).

Parameters:: report_every – report to the Python log every n columns

get_dest_sqla_table(tablename: str) → Table[source]: For a given destination table name, return an sqlalchemy.sql.schema.Table object for the destination table (which we will create).

get_dest_table_for_src_db_table(src_db: str, src_table: str) → str[source]: For a given source database/table, return the single or the first destination table.

get_dest_tables_all() → AbstractSet[str][source]: Return a SortedSet of all destination table names (including tables that will receive no contents).

get_dest_tables_for_src_db_table(src_db: str, src_table: str) → AbstractSet[str][source]: For a given source database/table, return a SortedSet of destination tables.

get_dest_tables_included() → AbstractSet[str][source]: Return a SortedSet of all destination table names (tables with at least some columns that are included).

get_dest_tables_with_patient_info() → AbstractSet[str][source]: Return a SortedSet of destination table names that have patient information.

get_fieldnames_for_src_table(src_db: str, src_table: str) → AbstractSet[DataDictionaryRow][source]: For a given source database name/table, return a SortedSet of source fields.

get_int_pk_ddr(src_db: str, src_table: str) → DataDictionaryRow | None[source]

For a given source database name and table, return the DD row for the integer PK for that table.

Will return None if no such data dictionary row exists.

get_int_pk_name(src_db: str, src_table: str) → str | None[source]: For a given source database name and table, return the field name of the integer PK for that table (or None if there isn’t one).

get_mandatory_scrubber_sigs() → AbstractSet[str][source]: Return a set of field signatures (strings of the format db.table.column) for all rows representing “required scrubber” fields – that is, rows that must have at least one non-NULL value for each patient, or the patient won’t get processed.

get_mpid_name(src_db: str, src_table: str) → str | None[source]: For a given source database name and table: return the field name of the field providing master PID (MPID) information (or None if there isn’t one).

get_optout_defining_fields() → AbstractSet[Tuple[str, str, str, str, str]][source]: Return a SortedSet of src_db, src_table, src_field, pidfield, mpidfield tuples for rows that define opt-out information.

get_patient_src_tables_with_active_dest(src_db: str) → AbstractSet[str][source]: For a given source database name, return a SortedSet of source tables that contain patient information and have an active destination table.

get_pid_name(src_db: str, src_table: str) → str | None[source]: For a given source database name and table: return the field name of the field providing primary PID information (or None if there isn’t one).

get_pk_ddr(src_db: str, src_table: str) → DataDictionaryRow | None[source]

For a given source database name and table, return the DD row for the PK for that table, whether integer or not.

Will return None if no such data dictionary row exists.

get_rows_for_dest_table(dest_table: str, skip_table_comments: bool = True) → AbstractSet[DataDictionaryRow][source]: For a given destination table, return a SortedSet of DD rows.

get_rows_for_src_table(src_db: str, src_table: str, skip_table_comments: bool = True) → AbstractSet[DataDictionaryRow][source]: For a given source database name/table, return a SortedSet of DD rows.

get_scrub_from_db_table_pairs() → AbstractSet[Tuple[str, str]][source]: Return a SortedSet of source_database_name, source_table tuples where those fields contain scrub_src (scrub-from) information.

get_scrub_from_rows(src_db: str, src_table: str) → AbstractSet[DataDictionaryRow][source]: Return a SortedSet of DD rows for all fields containing scrub_src (scrub-from) information.

get_scrub_from_rows_as_fieldinfo(src_db: str, src_table: str, depth: int, max_depth: int) → List[ScrubSourceFieldInfo][source]

Using get_scrub_from_rows(), as a list of ScrubSourceFieldInfo objects, which is more convenient for scrubbing.

Parameters:

src_db – Source database name.
src_table – Source table name.
depth – Current recursion depth for looking up third-party information.
max_depth – Maximum permitted recursion depth for looking up third-party information.

get_source_databases() → AbstractSet[str][source]: Return a SortedSet of source database names.

get_src_db_tablepairs() → AbstractSet[Tuple[str, str]][source]: Return a SortedSet of all source_database_name, source_table tuples.

get_src_db_tablepairs_w_int_pk() → AbstractSet[Tuple[str, str]][source]: Return a SortedSet of source_database_name, source_table tuples for tables that have an integer PK.

get_src_db_tablepairs_w_no_pt_info() → AbstractSet[Tuple[str, str]][source]: Return a SortedSet of source_database_name, source_table tuples for tables that contain no patient information.

get_src_db_tablepairs_w_pt_info() → AbstractSet[Tuple[str, str]][source]: Return a SortedSet of source_database_name, source_table tuples for tables that contain patient information.

get_src_dbs_tables_for_dest_table(dest_table: str) → AbstractSet[Tuple[str, str]][source]: For a given destination table, return a SortedSet of dbname, table tuples, representing source database(s)/table(s).

get_src_dbs_tables_with_no_pt_info_int_pk() → AbstractSet[Tuple[str, str]][source]: Return a SortedSet of source_database_name, source_table tuples where the table has no patient information and has an integer PK.

get_src_dbs_tables_with_no_pt_info_no_pk() → AbstractSet[Tuple[str, str]][source]: Return a SortedSet of source_database_name, source_table tuples where the table has no patient information and no integer PK.

get_src_tables(src_db: str) → AbstractSet[str][source]: For a given source database name, return a SortedSet of all source tables that are required (that is, ones being copied and ones providing vital patient information).

get_src_tables_with_active_dest(src_db: str) → AbstractSet[str][source]: For a given source database name, return a SortedSet of its source tables that have an active destination.

get_src_tables_with_patient_info(src_db: str) → AbstractSet[str][source]: For a given source database name, return a SortedSet of source tables that have patient information.

get_summary_info_all_tables() → List[DDTableSummary][source]: Returns summary information by table.

get_summary_info_for_table(src_db: str, src_table: str) → DDTableSummary[source]: Returns summary information for a specific table.

get_tables_w_no_pt_info() → AbstractSet[str][source]: Return a SortedSet of source_table names for tables that contain no patient information.

get_tables_w_scrub_src() → AbstractSet[str][source]: Return a SortedSet of source_table names for tables that contain scrub_src information, i.e. that contribute to anonymisation.

has_active_destination(src_db: str, src_table: str) → bool[source]: For a given source database name and table, does it have an active destination?

make_dest_datatypes_explicit() → None[source]: By default, when autocreating a data dictionary, the dest_datatype field is not populated explicit, just implicitly. This option makes them explicit by instantiating those values. Primarily for debugging.

property n_definers: int: The number of patient-defining columns.

property n_rows: int: Number of rows.

omit_rows_by_filter(keep: Callable[[DataDictionaryRow], bool]) → None[source]

Set to “omit” any rows that do not pass a filter function. Does not alter any rows already set to omit. (Skips table comment rows.)

Parameters:: keep – Function taking a data dictionary row as an argument, and returning a boolean of whether to keep the row.

read_from_file(filename: str, check_valid: bool = True, override_dialect: Dialect | None = None) → None[source]

Read DD from file.

Parameters:

filename – Filename to read.
check_valid – Run a validity check after setting each row from its values?
override_dialect – SQLAlchemy SQL dialect to enforce (e.g. for interpreting textual column types in the source database). By default, the source database’s own dialect is used.

remove_rows_by_filter(keep: Callable[[DataDictionaryRow], bool]) → None[source]

Removes any rows that do not pass a filter function. (Retains table comment rows.)

Parameters:: keep – Function taking a data dictionary row as an argument, and returning a boolean of whether to keep the row.

remove_rows_by_modifying_filter(keep_modify: Callable[[DataDictionaryRow], DataDictionaryRow | None]) → None[source]

Removes any rows that do not pass a filter function; allows the filter function to modify rows that are kept. (Retains table comment rows.)

Parameters:: keep_modify – Function taking a data dictionary row as an argument, and returning either the row (potentially modified) to retain it, or None to reject it.

sort() → None[source]

Sorts the data dictionary.

(Table comments, having no source field, will be first among rows for their tables.)

tidy_draft() → None[source]

Corrects a draft data dictionary for overall logical consistency.

The checks are:

Don’t scrub in non-patient tables.
SQL Server only supports one FULLTEXT index per table, and only if the table has a non-null column with a unique index.

Test code for full-text index creation:

-- ----------------------------------------------------------------
-- SQL Server: basic use
-- ----------------------------------------------------------------

USE mydb;
CREATE FULLTEXT CATALOG default_fulltext_catalog AS DEFAULT;
CREATE TABLE junk (intthing INT PRIMARY KEY, textthing VARCHAR(MAX));
-- now find the name of the PK index (! -- by hand or see cardinal_pythonlib)
CREATE FULLTEXT INDEX ON junk (textthing) KEY INDEX <pk_index_name>;

-- ----------------------------------------------------------------
-- SQL Server: it means it about the "NOT NULL" aspects, and a
-- unique index is not enough
-- ----------------------------------------------------------------

USE mydb;
DROP TABLE IF EXISTS rubbish;
CREATE TABLE rubbish (a INT NOT NULL, b VARCHAR(MAX));
CREATE UNIQUE INDEX rubbish_a ON rubbish (a);
CREATE FULLTEXT INDEX ON rubbish (b) KEY INDEX rubbish_a;

-- .. that works, but if you remove the "NOT NULL" from the table
-- definition, it fails with:
--
-- 'rubbish_a' is not a valid index to enforce a full-text search
-- key. A full-text search key must be a unique, non-nullable,
-- single-column index which is not offline, is not defined on a
-- non-deterministic or imprecise nonpersisted computed column,
-- does not have a filter, and has maximum size of 900 bytes.
-- Choose another index for the full-text key.

-- ----------------------------------------------------------------
-- MySQL: two FULLTEXT indexes on one table
-- ----------------------------------------------------------------

USE mydb;
CREATE TABLE junk (intthing INT PRIMARY KEY, text1 LONGTEXT, text2 LONGTEXT);
ALTER TABLE junk ADD FULLTEXT INDEX ftidx1 (text1);
ALTER TABLE junk ADD FULLTEXT INDEX ftidx2 (text2);  -- OK

write(filename: str, filetype: str | None = None) → None[source]

Writes the dictionary, either specifying the filetype or autodetecting it from the specified filename.

Parameters:

filename – Name of file to write, or “-” for stdout (in which case the filetype is forced to TSV).
filetype – File type as one of .ods, .tsv, or .xlsx; alternatively, use None to autodetect from the filename.

class crate_anon.anonymise.dd.ScrubSourceFieldInfo(is_mpid: bool, is_patient: bool, recurse: bool, required_scrubber: bool, scrub_method: crate_anon.anonymise.constants.ScrubMethod, signature: str, value_fieldname: str)[source]

__init__(is_mpid: bool, is_patient: bool, recurse: bool, required_scrubber: bool, scrub_method: ScrubMethod, signature: str, value_fieldname: str) → None

crate_anon.anonymise.dd.ensure_no_source_type_mismatch(ddr: DataDictionaryRow, config_sqlatype: TypeEngine | String, primary_pid: bool = True) → None[source]

Ensure that the source column type of a data dictionary row is compatible with what’s expected from the config. We check this only for specific type of column (PID, MPID), because we need to know their data types concretely for the secret mapping table. The question is not whether the types are the same, but whether the value will fit into the config-determined type (for example, it’s OK to convert an integer to a long-enough string but necessarily not the other way round).

Parameters:

ddr – Data dictionary row.
config_sqlatype – SQLAlchemy column type that would be expected based on the current config.
primary_pid – Is this the main PID field? If false, it’s the MPID.