14.1.12. crate_anon.anonymise.ddr

crate_anon/anonymise/ddr.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Data dictionary rows.

class crate_anon.anonymise.ddr.DataDictionaryRow(config: Config)[source]

Class representing a single row of a data dictionary (a DDR).

__init__(config: Config) None[source]

Set up basic defaults.

Parameters

configcrate_anon.anonymise.config.Config

property add_src_hash: bool

Should we add a column to the destination that contains a hash of the contents of the whole source row (all fields)?

property addition_only: bool

May we assume that records can only be added to this table, not deleted?

This is a flag that may be applied to a PK row only.

property alter_method: str

Return the alter_method string from the working fields.

property alter_methods: List[crate_anon.anonymise.altermethod.AlterMethod]

Return all alteration methods to be applied.

Returns

list of crate_anon.anonymise.altermethod.AlterMethod objects

as_row() List[Any][source]

Returns a data row (a list of values whose order matches header_row()) for use in spreadsheet formats.

property being_scrubbed: bool

Is the field being scrubbed as it passes from source to destination? (Only true if the field is being included, not omitted.)

check_prohibited_fieldnames(prohibited_fieldnames: Iterable[str]) None[source]

Check that the destination field isn’t a prohibited one.

Parameters

prohibited_fieldnames – list of prohibited fieldnames

Raises

ValueError

check_valid() None[source]

Check internal validity and complain if invalid, showing the source of the problem.

Raises

AssertionError

property constant: bool

Is the source field guaranteed not to change (for a given PK)?

property contains_patient_info: bool

Does the field contain patient information? That means any of:

  • primary PID

  • MPID

  • scrub-source (sensitive) information

property contains_patient_scrub_src_info: bool

Does this field contain scrub-source information about the patient?

property contains_scrub_src: bool

Does the field contain scrub-source information (sensitive information used for de-identification)?

property contains_third_party_info: bool

Does this field contain (identifiable) information about a third party, either directly or via a third-party PID?

property contains_third_party_info_directly: bool

Does this field contain (identifiable) information about a third party, directly?

property contains_vital_patient_info: bool

Does the field contain vital patient information? That means:

  • scrub-source (sensitive) information

property decision: str

Should we include the field in the destination?

Returns

"OMIT" or "include.

property defines_primary_pids: bool

Is this the field – usually one in the entire source database – that defines primary PIDs? Usually this is true for the “ID” column of the master patient table.

property dest_dialect: sqlalchemy.engine.interfaces.Dialect

Returns the SQLAlchemy Dialect (e.g. MySQL, SQL Server…) for the destination database.

property dest_dialect_name: str

Returns the SQLAlchemy dialect name for the destination database.

property dest_should_be_encrypted_pid_type: bool

Should the destination column (if included) be of the encrypted PID/MPID type?

property dest_signature: str

Returns a signature based on the destination table/field, in the format table.column.

property dest_sqla_coltype: sqlalchemy.sql.type_api.TypeEngine

Returns the SQLAlchemy column type of the destination column.

Note that this doesn’t include nullable status. An SQLAlchemy column looks like Column(String(50), nullable=False) – the type that we’re fetching here is, for example, the String(50) part. For the full column, see dest_sqla_column below.

property dest_sqla_column: sqlalchemy.sql.schema.Column

Returns an SQLAlchemy sqlalchemy.sql.schema.Column for the destination column.

property exclusion_values: Union[List[Any], str]

Returns a list of exclusion values (or an empty string if there are no such values).

This slightly curious output format is used to create a TSV row (see get_tsv()) or to check in a “truthy” way whether we have exclusion values (see crate_anon.anonymise.anonymise.process_table()).

property extracting_text_altermethods: List[crate_anon.anonymise.altermethod.AlterMethod]

Return all alteration methods that involve text extraction.

Returns

list of crate_anon.anonymise.altermethod.AlterMethod objects

property from_file: bool

Was this DDR loaded from a file (rather than, say, autogenerated from a database)?

property has_special_alter_method: bool

Fields for which the alter method is fixed.

classmethod header_row() List[str][source]

Returns a header row (a list of headings) for use in spreadsheet formats.

property include: bool

Is this row being included (not omitted)?

property inclusion_values: List[Any]

Returns a list of inclusion values (or an empty string if there are no such values).

This slightly curious output format is used to create a TSV row (see get_tsv()) or to check in a “truthy” way whether we have inclusion values (see crate_anon.anonymise.anonymise.process_table()).

make_dest_datatype_explicit() None[source]

By default, when autocreating a data dictionary, the dest_datatype field is not populated explicit, just implicitly. This option makes them explicit by instantiating those values. Primarily for debugging.

property master_pid: bool

Does this field contain the master patient ID (MPID)?

(A typical example of an MPID: “NHS number”.)

matches_fielddef(fielddef: Union[str, List[str]]) bool[source]

Does our source table/field match the wildcard-based field definition?

Parameters

fielddeffnmatch-style pattern (e.g. "system_table.*" or "*.nhs_number"), or list of them

matches_tabledef(tabledef: Union[str, List[str]]) bool[source]

Does our source table match the wildcard-based table definition?

Parameters

tabledeffnmatch-style pattern (e.g. "patient_address_table_*"), or list of them

property not_null: bool

Defaults to False. But if the DD row was created by database reflection, and the source field was set NOT NULL, will return True.

property offender_description: str

Get a string used to describe this DDR (in terms of its source/destination fields) if it does something wrong.

property opt_out_info: bool

Does the field contain information about whether the patient wishes to opt out entirely from the anonymised database?

(Whether the contents of the field means “opt out” or “don’t opt out” depends on optout_col_values in the crate_anon.anonymise.config.Config.)

property pk: bool

Is the source field (and the destination field, for that matter) a primary key (PK)?

property primary_pid: bool

Does the source field contain the primary patient ID (PID)?

(A typical example of a PID: “hospital number”.)

remove_scrub_from_alter_methods() None[source]

Prevent this row from being scrubbed, by removing any “scrub” method from among its alteration methods.

report_dest_annotation() str[source]

Returns information useful for a researcher looking at the destination database, in simple string form.

  • Therefore: does not include fields like “constant”, “addition_only”, which are primarily for database managers; we’re trying to keep this terse.

  • Relates to DESTINATION fields, e.g. a source PID becomes a destination RID.

property required: bool

Is the field required? That means any of:

  • chosen by the user to be translated into the destination

  • contains vital patient information (scrub-source information)

property required_scrubber: bool

Is this a “required scrubber” field?

A “required scrubber” is a field that must provide at least one non-NULL value for each patient, or the patient won’t get processed. (For example, you might want to omit a patient if you can’t be certain about their surname for anonymisation.)

set_alter_methods_directly(methods: List[crate_anon.anonymise.altermethod.AlterMethod]) None[source]

For internal use: setting from a list directly.

set_from_dict(valuedict: Dict[str, Any], override_dialect: Optional[sqlalchemy.engine.interfaces.Dialect] = None) None[source]

Set internal fields from a dict of elements representing a row from the TSV data dictionary file.

Also sets the “loaded from file” indicator, since that is the context in which we use this function.

Parameters
  • valuedict – Dictionary mapping row headings (or attribute names) to values.

  • override_dialect – SQLAlchemy SQL dialect to enforce (e.g. for interpreting textual column types in the source database). By default, the source database’s own dialect is used.

set_from_src_db_info(src_db: str, src_table: str, src_field: str, src_datatype_sqltext: str, src_sqla_coltype: sqlalchemy.sql.type_api.TypeEngine, dbconf: DatabaseSafeConfig, comment: str = None, nullable: bool = True, primary_key: bool = False, table_has_explicit_pk: bool = False, table_has_candidate_pk: bool = False) None[source]

Set up this DDR from a field in the source database, using options set in the config file. Used to draft a data dictionary. This is the first-draft classification of a given column, which the administrator should review and may then wish to edit.

Parameters
  • src_db – Source database name.

  • src_table – Source table name.

  • src_field – Source field (column) name.

  • src_datatype_sqltext – Source string SQL type, e.g. "VARCHAR(100)".

  • src_sqla_coltype – Source SQLAlchemy column type, e.g. Integer().

  • dbconf – A crate_anon.anonymise.config.DatabaseSafeConfig.

  • comment – Textual comment.

  • nullable – Whether the source is can be NULL (True) or is NOT NULL (False).

  • primary_key – Whether the source is marked as a primary key.

  • table_has_explicit_pk – Whether the source database knows of a formal PK field for this table, whether or not this is it.

  • table_has_candidate_pk – Whether the source table contains a field that matches CRATE’s name detection criteria, whether or not this is it.

set_src_sqla_coltype(sqla_coltype: sqlalchemy.sql.type_api.TypeEngine) None[source]

Sets the SQLAlchemy column type of the source column.

skip_row_by_value(value: Any) bool[source]

Should we skip this row, because the value is one of the row’s exclusion values, or the row has inclusion values and the value isn’t one of them?

Parameters

value – value to test

property skip_row_if_extract_text_fails: bool

Should we skip the row if processing the row involves extracting text and that process fails?

property src_db_lowercase: str

Returns the source database name, in lower case.

property src_dialect: sqlalchemy.engine.interfaces.Dialect

Returns the SQLAlchemy Dialect (e.g. MySQL, SQL Server…) for the source database.

property src_field_lowercase: str

Returns the source field (column) name, in lower case.

property src_flags: str

Returns a string representation of the source flags.

property src_is_textual: bool

Is the source column textual?

property src_signature: str

Returns a signature based on the source database/table/field, in the format db.table.column.

property src_sqla_coltype: sqlalchemy.sql.type_api.TypeEngine

Returns the SQLAlchemy column type of the source column.

property src_table_lowercase: str

Returns the source table name, in lower case.

property src_textlength: Optional[int]

If the source column is textual, returns its length (or None) for unlimited. Also returns None if the source is not textual.

property third_party_pid: bool

Does this field contain the PID of a different (e.g. related) patient?

property using_fulltext_index: bool

Should the destination field have a full-text index?

crate_anon.anonymise.ddr.warn_if_identifier_long(table: str, column: str, dest_dialect: Optional[str]) None[source]

Warns about identifiers that are too long for specific database engines.