14.1.12. crate_anon.anonymise.ddr
crate_anon/anonymise/ddr.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Data dictionary rows.
- class crate_anon.anonymise.ddr.DataDictionaryRow(config: Config)[source]
Class representing a single row of a data dictionary (a DDR).
- __init__(config: Config) None [source]
Set up basic defaults.
- Parameters:
config –
crate_anon.anonymise.config.Config
- property add_src_hash: bool
Should we add a column to the destination that contains a hash of the contents of the whole source row (all fields)?
- property addition_only: bool
May we assume that records can only be added to this table, not deleted?
This is a flag that may be applied to a PK row only.
- property alter_method: str
Return the
alter_method
string from the working fields.
- property alter_methods: List[AlterMethod]
Return all alteration methods to be applied.
- Returns:
list of
crate_anon.anonymise.altermethod.AlterMethod
objects
- as_row() List[Any] [source]
Returns a data row (a list of values whose order matches
header_row()
) for use in spreadsheet formats.
- property being_scrubbed: bool
Is the field being scrubbed as it passes from source to destination? (Only true if the field is being included, not omitted.)
- check_prohibited_fieldnames(prohibited_fieldnames: Iterable[str]) None [source]
Check that the destination field isn’t a prohibited one.
- Parameters:
prohibited_fieldnames – list of prohibited fieldnames
- Raises:
ValueError –
- check_valid() None [source]
Check internal validity and complain if invalid, showing the source of the problem.
- Raises:
AssertionError –
- property constant: bool
Is the source field guaranteed not to change (for a given PK)?
- property contains_patient_info: bool
Does the field contain patient information? That means any of:
primary PID
MPID
scrub-source (sensitive) information
- property contains_patient_scrub_src_info: bool
Does this field contain scrub-source information about the patient?
- property contains_scrub_src: bool
Does the field contain scrub-source information (sensitive information used for de-identification)?
- property contains_third_party_info: bool
Does this field contain (identifiable) information about a third party, either directly or via a third-party PID?
- property contains_third_party_info_directly: bool
Does this field contain (identifiable) information about a third party, directly?
- property contains_vital_patient_info: bool
Does the field contain vital patient information? That means:
scrub-source (sensitive) information
- property decision: str
Should we include the field in the destination?
- Returns:
"OMIT"
or"include
.
- property defines_primary_pids: bool
Is this the field – usually one in the entire source database – that defines primary PIDs? Usually this is true for the “ID” column of the master patient table.
- property dest_dialect: Dialect
Returns the SQLAlchemy
Dialect
(e.g. MySQL, SQL Server…) for the destination database.
- property dest_dialect_name: str
Returns the SQLAlchemy dialect name for the destination database.
- property dest_should_be_encrypted_pid_type: bool
Should the destination column (if included) be of the encrypted PID/MPID type?
- property dest_signature: str
Returns a signature based on the destination table/field, in the format
table.column
.
- property dest_sqla_coltype: TypeEngine
Returns the SQLAlchemy column type of the destination column.
Note that this doesn’t include nullable status. An SQLAlchemy column looks like Column(String(50), nullable=False) – the type that we’re fetching here is, for example, the String(50) part. For the full column, see
dest_sqla_column
below.
- property dest_sqla_column: Column
Returns an SQLAlchemy
sqlalchemy.sql.schema.Column
for the destination column.
- property exclusion_values: List[Any] | str
Returns a list of exclusion values (or an empty string if there are no such values).
This slightly curious output format is used to create a TSV row (see
get_tsv()
) or to check in a “truthy” way whether we have exclusion values (seecrate_anon.anonymise.anonymise.process_table()
).
- property extracting_text_altermethods: List[AlterMethod]
Return all alteration methods that involve text extraction.
- Returns:
list of
crate_anon.anonymise.altermethod.AlterMethod
objects
- property from_file: bool
Was this DDR loaded from a file (rather than, say, autogenerated from a database)?
- property has_special_alter_method: bool
Fields for which the alter method is fixed.
- classmethod header_row() List[str] [source]
Returns a header row (a list of headings) for use in spreadsheet formats.
- property include: bool
Is this row being included (not omitted)?
- property inclusion_values: List[Any]
Returns a list of inclusion values (or an empty string if there are no such values).
This slightly curious output format is used to create a TSV row (see
get_tsv()
) or to check in a “truthy” way whether we have inclusion values (seecrate_anon.anonymise.anonymise.process_table()
).
- property is_table_comment: bool
Is this a table comment, free of column information?
- make_dest_datatype_explicit() None [source]
By default, when autocreating a data dictionary, the
dest_datatype
field is not populated explicit, just implicitly. This option makes them explicit by instantiating those values. Primarily for debugging.
- property master_pid: bool
Does this field contain the master patient ID (MPID)?
(A typical example of an MPID: “NHS number”.)
- matches_fielddef(fielddef: str | List[str]) bool [source]
Does our source table/field match the wildcard-based field definition?
- Parameters:
fielddef –
fnmatch
-style pattern (e.g."system_table.*"
or"*.nhs_number"
), or list of them
- matches_tabledef(tabledef: str | List[str]) bool [source]
Does our source table match the wildcard-based table definition?
- Parameters:
tabledef –
fnmatch
-style pattern (e.g."patient_address_table_*"
), or list of them
- property not_null: bool
Defaults to False. But if the DD row was created by database reflection, and the source field was set NOT NULL, will return True.
- property offender_description: str
Get a string used to describe this DDR (in terms of its source/destination fields) if it does something wrong.
- property opt_out_info: bool
Does the field contain information about whether the patient wishes to opt out entirely from the anonymised database?
(Whether the contents of the field means “opt out” or “don’t opt out” depends on
optout_col_values
in thecrate_anon.anonymise.config.Config
.)
- property pk: bool
Is the source field (and the destination field, for that matter) a primary key (PK)?
- property primary_pid: bool
Does the source field contain the primary patient ID (PID)?
(A typical example of a PID: “hospital number”.)
- remove_scrub_from_alter_methods() None [source]
Prevent this row from being scrubbed, by removing any “scrub” method from among its alteration methods.
- static replace_odd_chars(text: str) str [source]
Sanitise a table or field name to only contain printable ASCII characters plus ()/|
SQLServer and MySQL allow pretty much anything in a table or field name but these could cause problems elsewhere.
- report_dest_annotation() str [source]
Returns information useful for a researcher looking at the destination database, in simple string form.
Therefore: does not include fields like “constant”, “addition_only”, which are primarily for database managers; we’re trying to keep this terse.
Relates to DESTINATION fields, e.g. a source PID becomes a destination RID.
- property required: bool
Is the field required? That means any of:
chosen by the user to be translated into the destination
contains vital patient information (scrub-source information)
- property required_scrubber: bool
Is this a “required scrubber” field?
A “required scrubber” is a field that must provide at least one non-NULL value for each patient, or the patient won’t get processed. (For example, you might want to omit a patient if you can’t be certain about their surname for anonymisation.)
- set_alter_methods_directly(methods: List[AlterMethod]) None [source]
For internal use: setting from a list directly.
- set_as_table_comment(src_db: str, src_table: str, comment: str) None [source]
Set up this DDR as a special table-comment row. (Used in data dictionary drafting.)
- Parameters:
src_db – Source database name.
src_table – Source table name.
comment – Textual comment.
- set_from_dict(valuedict: Dict[str, Any], override_dialect: Dialect | None = None) None [source]
Set internal fields from a dict of elements representing a row from the TSV data dictionary file.
Also sets the “loaded from file” indicator, since that is the context in which we use this function.
- Parameters:
valuedict – Dictionary mapping row headings (or attribute names) to values.
override_dialect – SQLAlchemy SQL dialect to enforce (e.g. for interpreting textual column types in the source database). By default, the source database’s own dialect is used.
- set_from_src_db_info(src_db: str, src_table: str, src_field: str, src_datatype_sqltext: str, src_sqla_coltype: TypeEngine, dbconf: DatabaseSafeConfig, comment: str = None, nullable: bool = True, primary_key: bool = False, table_has_explicit_pk: bool = False, table_has_candidate_pk: bool = False) None [source]
Set up this DDR from a field in the source database, using options set in the config file. Used to draft a data dictionary. This is the first-draft classification of a given column, which the administrator should review and may then wish to edit.
- Parameters:
src_db – Source database name.
src_table – Source table name.
src_field – Source field (column) name.
src_datatype_sqltext – Source string SQL type, e.g.
"VARCHAR(100)"
.src_sqla_coltype – Source SQLAlchemy column type, e.g.
Integer()
.dbconf – A
crate_anon.anonymise.config.DatabaseSafeConfig
.comment – Textual comment.
nullable – Whether the source is can be NULL (True) or is NOT NULL (False).
primary_key – Whether the source is marked as a primary key.
table_has_explicit_pk – Whether the source database knows of a formal PK field for this table, whether or not this is it.
table_has_candidate_pk – Whether the source table contains a field that matches CRATE’s name detection criteria, whether or not this is it.
- set_src_sqla_coltype(sqla_coltype: TypeEngine) None [source]
Sets the SQLAlchemy column type of the source column.
- skip_row_by_value(value: Any) bool [source]
Should we skip this row, because the value is one of the row’s exclusion values, or the row has inclusion values and the value isn’t one of them?
- Parameters:
value – value to test
- property skip_row_if_extract_text_fails: bool
Should we skip the row if processing the row involves extracting text and that process fails?
- property src_db_lowercase: str
Returns the source database name, in lower case.
- property src_dialect: Dialect
Returns the SQLAlchemy
Dialect
(e.g. MySQL, SQL Server…) for the source database.
- property src_field_lowercase: str
Returns the source field (column) name, in lower case.
- property src_flags: str
Returns a string representation of the source flags.
- property src_is_textual: bool
Is the source column textual?
- property src_signature: str
Returns a signature based on the source database/table/field, in the format
db.table.column
.
- property src_sqla_coltype: TypeEngine
Returns the SQLAlchemy column type of the source column.
- property src_table_lowercase: str
Returns the source table name, in lower case.
- property src_textlength: int | None
If the source column is textual, returns its length (or
None
) for unlimited. Also returnsNone
if the source is not textual.
- property third_party_pid: bool
Does this field contain the PID of a different (e.g. related) patient?
- property using_fulltext_index: bool
Should the destination field have a full-text index?