14.1.4. crate_anon.anonymise.anonymise

crate_anon/anonymise/anonymise.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Anonymise one or more SQL-based source databases into a destination database using a data dictionary.

crate_anon.anonymise.anonymise.anonymise(incremental: bool = False, skipdelete: bool = False, dropremake: bool = False, full_drop_only: bool = False, optout: bool = False, patienttables: bool = False, nonpatienttables: bool = False, index: bool = False, restrict: str = '', restrict_file: str = '', restrict_limits: Tuple[Any, Any] | None = None, restrict_list: List[Any] | None = None, free_text_limit: int | None = None, exclude_scrubbed_fields: bool = False, nprocesses: int = 1, process: int = 0, skip_dd_check: bool = False, seed: str = '', chunksize: int = 100000, reportevery: int = 100000, echo: bool = False, debugscrubbers: bool = False, savescrubbers: bool = False) None[source]

Main entry point for anonymisation.

Parameters:
  • incremental – If true: incremental run, rather than full.

  • skipdelete – (For “incremental”.) Skip deletion of rows present in the destination but not the source.

  • dropremake – If true: drop/remake destination tables.

  • full_drop_only – If true: drop destination tables (even opt-out ones) and do nothing else.

  • optout – If true: update opt-out list.

  • patienttables – If true: process patient tables only (rather than all tables).

  • nonpatienttables – If true: process non-patient tables only (rather than all tables).

  • index – If true: create indexes only.

  • restrict – Restrict to certain patients? Specify a field name, or pid to restrict by patient IDs.

  • restrict_file – (For “restrict”.) Filename for permitted values.

  • restrict_limits – (For “restrict”.) Tuple of lower and upper limits to apply to the field.

  • restrict_list – (For “restrict”.) List of permitted values.

  • free_text_limit – Filter out all free text over the specified length.

  • exclude_scrubbed_fields – Exclude all text fields which are being scrubbed.

  • nprocesses – Number of processing being run (of which this is one), for work allocation.

  • process – Number of this process (from 0 to nprocesses - 1), for work allocation.

  • skip_dd_check – If true: skip data dictionary validity check. (Useful in multiprocessing contexts when another process has already done this.)

  • seed – Seed for random number generator (for TRID generation). Blank for the default of system time.

  • chunksize – Number of records copied in a chunk when copying PKs from one database to another.

  • reportevery – Report insert progress every n rows in verbose mode.

  • echo – Echo SQL?

  • debugscrubbers – Report sensitive scrubbing information, for debugging

  • savescrubbers – Saves sensitive scrubbing information in admin database, for debugging

crate_anon.anonymise.anonymise.commit_admindb() None[source]

Execute a COMMIT on the admin database, which is using ORM sessions.

crate_anon.anonymise.anonymise.commit_destdb() None[source]

Execute a COMMIT on the destination database, and reset row counts.

crate_anon.anonymise.anonymise.count_rows(dbname: str, sourcetable: str, pid: int | str | None = None) int[source]

Count the number of rows in a table for a given PID.

Parameters:
  • dbname – name (as per the data dictionary) of the source database

  • sourcetable – name of the source table

  • pid – patient ID (PID)

Returns:

the number of records

crate_anon.anonymise.anonymise.create_indexes(tasknum: int = 0, ntasks: int = 1) None[source]

Create indexes for the destination tables.

Parameters:
  • tasknum – task number of this process (for dividing up work)

  • ntasks – total number of processes (for dividing up work)

crate_anon.anonymise.anonymise.delete_dest_rows_with_no_src_row(srcdbname: str, src_table: str, report_every: int = 100000, chunksize: int = 100000) None[source]

For a given source database/table, delete any rows in the corresponding destination table where there is no corresponding source row.

  • Can’t do this in a single SQL command, since the engine can’t necessarily see both databases.

  • Can’t do this in a multiprocess way, because we’re trying to do a DELETE WHERE NOT IN.

  • However, we can get stupidly long query lists if we try to SELECT all the values and use a DELETE FROM x WHERE y NOT IN (v1, v2, v3, ...) query. This crashes the MySQL connection, etc.

  • Therefore, we need a temporary table in the destination.

Parameters:
  • srcdbname – name (as per the data dictionary) of the source database

  • src_table – name of the source table

  • report_every – report to the Python log every n records

  • chunksize – insert records every n records

crate_anon.anonymise.anonymise.drop_remake(incremental: bool = False, skipdelete: bool = False, full_drop_only: bool = False) None[source]

Drop and rebuild (a) mapping table, (b) destination tables.

Parameters:
  • incremental – Doesn’t drop tables; just deletes destination information where source information no longer exists.

  • skipdelete – For incremental updates, skip deletion of rows present in the destination but not the source

  • full_drop_only – Performs a full drop (even opt-out tables) and does nothing else. Incompatible with incremental.

crate_anon.anonymise.anonymise.estimate_count_patients() int[source]

Estimate the number of patients in the source database.

We can’t easily and quickly get the total number of patients, because they may be defined in multiple tables across multiple databases. We shouldn’t fetch them all into Python in case there are billions, and it’s a waste of effort to stash them in a temporary table and count unique rows, because this is all only for a progress indicator. So we approximate.

crate_anon.anonymise.anonymise.fieldname_is_pid(field: str) bool[source]

Checks if a field name is the literal 'pid' or, if in the form 'database.table.field', is the name of a primary PID field in the source database. If either of those conditions is met, return True; otherwise, False.

crate_anon.anonymise.anonymise.gen_index_row_sets_by_table(tasknum: int = 0, ntasks: int = 1) Generator[Tuple[str, List[DataDictionaryRow]], None, None][source]

Generate table, list_of_dd_rows_for_indexed_fields tuples for all tables requiring indexing.

Parameters:
  • tasknum – task number of this process (for dividing up work)

  • ntasks – total number of processes (for dividing up work)

Yields:

tupletable, list_of_dd_rows_for_indexed_fields for each table as above

crate_anon.anonymise.anonymise.gen_nonpatient_tables_with_int_pk() Generator[Tuple[str, str, str], None, None][source]

Generate source_db_name, source_table, pk_name tuples for all tables that

  1. don’t contain patient information and

  2. do have an integer PK.

crate_anon.anonymise.anonymise.gen_nonpatient_tables_without_int_pk(tasknum: int = 0, ntasks: int = 1) Generator[Tuple[str, str], None, None][source]

Generate (source db name, source table) tuples for all tables that

  1. don’t contain patient information and

  2. don’t have an integer PK.

Parameters:
  • tasknum – task number of this process (for dividing up work)

  • ntasks – total number of processes (for dividing up work)

Yields:

tuplesource_db_name, source_table for each table as above

crate_anon.anonymise.anonymise.gen_opt_out_pids_from_database(mpid: bool = False) Generator[Any, None, None][source]

Generate opt-out PIDs (or MPIDs) from a database.

Parameters:

mpid – generate MPIDs, not PIDs

Yields:

each PID (or MPID)

crate_anon.anonymise.anonymise.gen_opt_out_pids_from_file(mpid: bool = False) Generator[int | str, None, None][source]

Generate opt-out PIDs (or MPIDs) from a file.

Parameters:

mpid – generate MPIDs, not PIDs (and therefore use config.optout_mpid_filenames, not config.optout_pid_filenames, as the set of filenames to read)

Yields:

each PID (or MPID), which will be either str or int depending on the value of config.mpidtype_is_integer or config.pidtype_is_integer.

crate_anon.anonymise.anonymise.gen_optout_rids() Generator[str, None, None][source]

Generates RIDs for patients who opt out (which we can use to wipe their information from the destination database).

Yields:

string – research ID (RID)

crate_anon.anonymise.anonymise.gen_patient_ids(tasknum: int = 0, ntasks: int = 1, specified_pids: List[str | int] | None = None) Generator[int | str, None, None][source]

Generate patient IDs.

Parameters:
  • tasknum – task number of this process (for dividing up work)

  • ntasks – total number of processes (for dividing up work)

  • specified_pids – optional list of PIDs to restrict ourselves to

Yields:

integer or string patient IDs (PIDs)

  • Assigns work to threads/processes, via the simple expedient of processing only those patient ID numbers where patientnum % ntasks == tasknum (for integers), or an equivalent method for string PIDs.

crate_anon.anonymise.anonymise.gen_pks(srcdbname: str, tablename: str, pkname: str) Generator[int, None, None][source]

Generate PK values from a table.

Parameters:
  • srcdbname – name (as per the data dictionary) of the database

  • tablename – name of the table

  • pkname – name of the PK column

Yields:

int – each primary key

crate_anon.anonymise.anonymise.gen_rows(dbname: str, sourcetable: str, sourcefields: Iterable[str], pid: int | str | None = None, intpkname: str | None = None, tasknum: int = 0, ntasks: int = 1, debuglimit: int = 0) Generator[List[Any], None, None][source]

Generates rows from a source table: - … each row being a list of values - … each value corresponding to a field in sourcefields. - … optionally restricted to a single patient

If the table has a PK and we’re operating in a multitasking situation, generate just the rows for this task (thread/process).

Parameters:
  • dbname – name (as per the data dictionary) of the source database

  • sourcetable – name of the source table

  • sourcefields – names of fields in the source table

  • pid – patient ID (PID)

  • intpkname – name of the integer PK column in the source table, if one exists

  • tasknum – task number of this process (for dividing up work)

  • ntasks – total number of processes (for dividing up work)

  • debuglimit – if specified, the maximum number of rows to process

Yields:

lists, each representing one row and containing values for each of the sourcefields

crate_anon.anonymise.anonymise.get_pid_subset_from_field(field: str, values_to_find: List[Any]) List[Any][source]

Takes a field name and elements from that field (values present in that field) and queries the database to find the PIDs associated with these values.

Parameters:
  • field – field name in the format database.table.field

  • values_to_find – values to look for

Returns:

list of PIDs

For example, suppose you have a source table called mydb.mystudyinfo like this:

pid (INTEGER)

include_in_extract (VARCHAR)

1

no

2

0

3

yes

4

1

5

definitely

then a call like

get_subset_from_field("mydb.mystudyinfo.include_in_extract",
                      ["yes", "1", "definitely"])

should return [3, 4, 5], assuming that pid has been correctly marked as the PID column in the data dictionary.

crate_anon.anonymise.anonymise.get_pids_from_field_limits(field: str, low: int, high: int) List[Any][source]

Takes a field name and a lower/upper limit, and returns a list of associated PIDs.

Parameters:
  • field – a fieldname of the format database.table.field, or the literal pid

  • low – lower (inclusive) limit

  • high – upper (inclusive) limit

The range is used as follows.

  • If field is the string literal 'pid', or is the name of a source database field containing PIDs, then fetch PIDs in the specified range and check that they are valid.

  • If it’s another kind of field, look for rows where this field is in the specified range, and return the value of the PID column from the same row of the table. (See get_pids_query_field_limits().)

Returns:

list of PIDs

crate_anon.anonymise.anonymise.get_pids_from_file(field: str, filename: str) List[str][source]

Takes a field name, and a filename of values of that field, and returns a list of PIDs associated with them.

Parameters:
  • field – a fieldname of the format database.table.field, or the literal pid

  • filename

    A file containing words that represent values to look for, as follows.

    • If field is the string literal 'pid', or is the name of a source database field containing PIDs, then the values in the file should be PIDs. We check that they are valid.

    • If it’s another kind of field, look for values (from the file) in this field, and return the value of the PID column from the same row of the table. (See get_pid_subset_from_field().)

Returns:

list of PIDs

crate_anon.anonymise.anonymise.get_pids_from_limits(low: int, high: int) List[Any][source]

Finds PIDs from the source database that are between low and high inclusive.

Parameters:
  • low – lower (inclusive) limit

  • high – upper (inclusive) limit

Returns:

list of PIDs in this range

crate_anon.anonymise.anonymise.get_pids_from_list(field: str, values: List[str]) List[str][source]

Takes a field name and a list of values, and returns a list of PIDs associated with them.

Parameters:
  • field – a fieldname of the format database.table.field, or the literal pid

  • values

    Values to look for, as follows.

    • If field is the string literal 'pid', or is the name of a source database field containing PIDs, then the values in the should be PIDs. We check that they are valid.

    • If it’s another kind of field, look for the values in this field, and return the value of the PID column from the same row of the table. (See get_pid_subset_from_field().)

Returns:

list of PIDs

crate_anon.anonymise.anonymise.get_pids_query_field_limits(field: str, low: int, high: int) List[Any][source]

Takes a field name and queries the database to find the PIDs associated with records where field is in the range low to high inclusive.

Parameters:
  • field – field name in the format database.table.field

  • low – lower (inclusive) limit

  • high – upper (inclusive) limit

Returns:

list of PIDs

For example, suppose you have a source table called mydb.myoptouts like this:

pid (INTEGER)

opt_out_level (INTEGER)

1

0

2

1

3

2

4

3

5

4

then a call like

get_subset_from_field("mydb.myoptouts.opt_out_level", 2, 3)

should return [3, 4], assuming that pid has been correctly marked as the PID column in the data dictionary.

crate_anon.anonymise.anonymise.get_valid_pid_subset(given_pids: List[str]) List[str][source]

Takes a list of PIDs and returns those in the list which are also in the database.

crate_anon.anonymise.anonymise.identical_record_exists_by_hash(dest_table: str, pkfield: str, pkvalue: int, hashvalue: str) bool[source]

For a given PK in a given destination table, is there a record with the specified value for its source hash?

Parameters:
  • dest_table – name of the destination table

  • pkfield – name of the primary key (PK) column in the destination table

  • pkvalue – integer value of the PK in the destination table

  • hashvalue – hash of the source

crate_anon.anonymise.anonymise.identical_record_exists_by_pk(dest_table: str, pkfield: str, pkvalue: int) bool[source]

For a given PK in a given destination table, does a record exist?

Parameters:
  • dest_table – name of the destination table

  • pkfield – name of the primary key (PK) column in the destination table

  • pkvalue – integer value of the PK in the destination table

crate_anon.anonymise.anonymise.opting_out_mpid(mpid: int | str) bool[source]

Does this patient wish to opt out?

Parameters:

mpid – master patient identifier (MPID)

crate_anon.anonymise.anonymise.opting_out_pid(pid: int | str) bool[source]

Does this patient wish to opt out?

Parameters:

pid – patient identifier (PID)

crate_anon.anonymise.anonymise.patient_processing_fn(tasknum: int = 0, ntasks: int = 1, incremental: bool = False, specified_pids: List[int] | None = None, free_text_limit: int | None = None, exclude_scrubbed_fields: bool = False) None[source]

Main function to anonymise patient data.

  • Iterate through patient IDs;

  • build the scrubber for each patient;

  • process source data for that patient, scrubbing it;

  • insert the patient into the mapping table in the admin database.

Parameters:
  • tasknum – task number of this process (for dividing up work)

  • ntasks – total number of processes (for dividing up work)

  • incremental – perform an incremental update, rather than a full run?

  • specified_pids – if specified, restrict to specific PIDs

  • free_text_limit – as per process_table()

  • exclude_scrubbed_fields – as per process_table()

crate_anon.anonymise.anonymise.process_nonpatient_tables(tasknum: int = 0, ntasks: int = 1, incremental: bool = False, free_text_limit: int | None = None, exclude_scrubbed_fields: bool = False) None[source]

Copies all non-patient tables.

  • If they have an integer PK, the work may be parallelized.

  • If not, whole tables are assigned to different processes in parallel mode.

Parameters:
  • tasknum – task number of this process (for dividing up work)

  • ntasks – total number of processes (for dividing up work)

  • incremental – perform an incremental update, rather than a full run?

  • free_text_limit – as per process_table()

  • exclude_scrubbed_fields – as per process_table()

crate_anon.anonymise.anonymise.process_patient_tables(tasknum: int = 0, ntasks: int = 1, incremental: bool = False, specified_pids: List[int] | None = None, free_text_limit: int | None = None, exclude_scrubbed_fields: bool = False) None[source]

Process all patient tables, optionally in a parallel-processing fashion.

All the work is done via patient_processing_fn().

Parameters:
  • tasknum – task number of this process (for dividing up work)

  • ntasks – total number of processes (for dividing up work)

  • incremental – perform an incremental update, rather than a full run?

  • specified_pids – if specified, restrict to specific PIDs

  • free_text_limit – as per process_table()

  • exclude_scrubbed_fields – as per process_table()

crate_anon.anonymise.anonymise.process_table(sourcedbname: str, sourcetable: str, patient: Patient | None = None, incremental: bool = False, intpkname: str | None = None, tasknum: int = 0, ntasks: int = 1, free_text_limit: int | None = None, exclude_scrubbed_fields: bool = False) None[source]

Process a table. This can either be a patient table (in which case the patient’s scrubber is applied and only rows for that patient are processed) or not (in which case the table is just copied).

Parameters:
  • sourcedbname – name (as per the data dictionary) of the source database

  • sourcetable – name of the source table

  • patientcrate_anon.anonymise.patient.Patient object, or None for non-patient tables

  • incremental – perform an incremental update, rather than a full run?

  • intpkname – name of the integer PK column in the source table

  • tasknum – task number of this process (for dividing up work)

  • ntasks – total number of processes (for dividing up work)

  • free_text_limit – If specified, any text field longer than this will be excluded

  • exclude_scrubbed_fields – Exclude all text fields which are being scrubbed.

crate_anon.anonymise.anonymise.remove_invalid_bools_from_optout_values(optout_colname: str, values: List[Any]) List[Any][source]

Called when the column that defines opt-outs is of boolean type. Removes any values from values that is not a valid boolean value (or None/NULL), announcing it, and return the values that pass the test.

crate_anon.anonymise.anonymise.setup_opt_out(incremental: bool = False) None[source]
Parameters:

incremental – after adding opt-out patients, delete any data for them found in the destination database. (Unnecessary for “full” rather than “incremental” runs, since “full” runs delete all the destination tables and start again.)

crate_anon.anonymise.anonymise.validate_optouts()[source]

Check that our opt-out definitions are valid, or raise ValueError.

crate_anon.anonymise.anonymise.wipe_and_recreate_destination_db(incremental: bool = False, full_drop_only: bool = False) None[source]

Drop and recreate all destination tables (as specified in the DD) in the destination database.

Parameters:
  • incremental – Don’t drop the tables first, just create them if they don’t exist.

  • full_drop_only – Drop everything, but don’t rebuild. Incompatible with incremental.

crate_anon.anonymise.anonymise.wipe_destination_data_for_opt_out_patients(report_every: int = 1000, chunksize: int = 10000) None[source]

Delete any data from patients that have opted out (after their data was processed on a previous occasion).

(Slightly complicated by the fact that the destination database can’t necessarily ‘see’ the mapping database, so we need to cache the RID keys in the destination database temporarily.)

Parameters:
  • report_every – report logging information every n records

  • chunksize – insert records every n records