14.1.4. crate_anon.anonymise.anonymise
crate_anon/anonymise/anonymise.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Anonymise one or more SQL-based source databases into a destination database using a data dictionary.
- crate_anon.anonymise.anonymise.anonymise(incremental: bool = False, skipdelete: bool = False, dropremake: bool = False, full_drop_only: bool = False, optout: bool = False, patienttables: bool = False, nonpatienttables: bool = False, index: bool = False, restrict: str = '', restrict_file: str = '', restrict_limits: Tuple[Any, Any] | None = None, restrict_list: List[Any] | None = None, free_text_limit: int | None = None, exclude_scrubbed_fields: bool = False, nprocesses: int = 1, process: int = 0, skip_dd_check: bool = False, seed: str = '', chunksize: int = 100000, reportevery: int = 100000, echo: bool = False, debugscrubbers: bool = False, savescrubbers: bool = False) None [source]
Main entry point for anonymisation.
- Parameters:
incremental – If true: incremental run, rather than full.
skipdelete – (For “incremental”.) Skip deletion of rows present in the destination but not the source.
dropremake – If true: drop/remake destination tables.
full_drop_only – If true: drop destination tables (even opt-out ones) and do nothing else.
optout – If true: update opt-out list.
patienttables – If true: process patient tables only (rather than all tables).
nonpatienttables – If true: process non-patient tables only (rather than all tables).
index – If true: create indexes only.
restrict – Restrict to certain patients? Specify a field name, or
pid
to restrict by patient IDs.restrict_file – (For “restrict”.) Filename for permitted values.
restrict_limits – (For “restrict”.) Tuple of lower and upper limits to apply to the field.
restrict_list – (For “restrict”.) List of permitted values.
free_text_limit – Filter out all free text over the specified length.
exclude_scrubbed_fields – Exclude all text fields which are being scrubbed.
nprocesses – Number of processing being run (of which this is one), for work allocation.
process – Number of this process (from 0 to nprocesses - 1), for work allocation.
skip_dd_check – If true: skip data dictionary validity check. (Useful in multiprocessing contexts when another process has already done this.)
seed – Seed for random number generator (for TRID generation). Blank for the default of system time.
chunksize – Number of records copied in a chunk when copying PKs from one database to another.
reportevery – Report insert progress every n rows in verbose mode.
echo – Echo SQL?
debugscrubbers – Report sensitive scrubbing information, for debugging
savescrubbers – Saves sensitive scrubbing information in admin database, for debugging
- crate_anon.anonymise.anonymise.commit_admindb() None [source]
Execute a
COMMIT
on the admin database, which is using ORM sessions.
- crate_anon.anonymise.anonymise.commit_destdb() None [source]
Execute a
COMMIT
on the destination database, and reset row counts.
- crate_anon.anonymise.anonymise.count_rows(dbname: str, sourcetable: str, pid: int | str | None = None) int [source]
Count the number of rows in a table for a given PID.
- Parameters:
dbname – name (as per the data dictionary) of the source database
sourcetable – name of the source table
pid – patient ID (PID)
- Returns:
the number of records
- crate_anon.anonymise.anonymise.create_indexes(tasknum: int = 0, ntasks: int = 1) None [source]
Create indexes for the destination tables.
- Parameters:
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
- crate_anon.anonymise.anonymise.delete_dest_rows_with_no_src_row(srcdbname: str, src_table: str, report_every: int = 100000, chunksize: int = 100000) None [source]
For a given source database/table, delete any rows in the corresponding destination table where there is no corresponding source row.
Can’t do this in a single SQL command, since the engine can’t necessarily see both databases.
Can’t do this in a multiprocess way, because we’re trying to do a
DELETE WHERE NOT IN
.However, we can get stupidly long query lists if we try to
SELECT
all the values and use aDELETE FROM x WHERE y NOT IN (v1, v2, v3, ...)
query. This crashes the MySQL connection, etc.Therefore, we need a temporary table in the destination.
- Parameters:
srcdbname – name (as per the data dictionary) of the source database
src_table – name of the source table
report_every – report to the Python log every n records
chunksize – insert records every n records
- crate_anon.anonymise.anonymise.drop_remake(incremental: bool = False, skipdelete: bool = False, full_drop_only: bool = False) None [source]
Drop and rebuild (a) mapping table, (b) destination tables.
- Parameters:
incremental – Doesn’t drop tables; just deletes destination information where source information no longer exists.
skipdelete – For incremental updates, skip deletion of rows present in the destination but not the source
full_drop_only – Performs a full drop (even opt-out tables) and does nothing else. Incompatible with
incremental
.
- crate_anon.anonymise.anonymise.estimate_count_patients() int [source]
Estimate the number of patients in the source database.
We can’t easily and quickly get the total number of patients, because they may be defined in multiple tables across multiple databases. We shouldn’t fetch them all into Python in case there are billions, and it’s a waste of effort to stash them in a temporary table and count unique rows, because this is all only for a progress indicator. So we approximate.
- crate_anon.anonymise.anonymise.fieldname_is_pid(field: str) bool [source]
Checks if a field name is the literal
'pid'
or, if in the form'database.table.field'
, is the name of a primary PID field in the source database. If either of those conditions is met, returnTrue
; otherwise,False
.
- crate_anon.anonymise.anonymise.gen_index_row_sets_by_table(tasknum: int = 0, ntasks: int = 1) Generator[Tuple[str, List[DataDictionaryRow]], None, None] [source]
Generate
table, list_of_dd_rows_for_indexed_fields
tuples for all tables requiring indexing.- Parameters:
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
- Yields:
tuple –
table, list_of_dd_rows_for_indexed_fields
for each table as above
- crate_anon.anonymise.anonymise.gen_nonpatient_tables_with_int_pk() Generator[Tuple[str, str, str], None, None] [source]
Generate
source_db_name, source_table, pk_name
tuples for all tables thatdon’t contain patient information and
do have an integer PK.
- crate_anon.anonymise.anonymise.gen_nonpatient_tables_without_int_pk(tasknum: int = 0, ntasks: int = 1) Generator[Tuple[str, str], None, None] [source]
Generate
(source db name, source table)
tuples for all tables thatdon’t contain patient information and
don’t have an integer PK.
- Parameters:
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
- Yields:
tuple –
source_db_name, source_table
for each table as above
- crate_anon.anonymise.anonymise.gen_opt_out_pids_from_database(mpid: bool = False) Generator[Any, None, None] [source]
Generate opt-out PIDs (or MPIDs) from a database.
- Parameters:
mpid – generate MPIDs, not PIDs
- Yields:
each PID (or MPID)
- crate_anon.anonymise.anonymise.gen_opt_out_pids_from_file(mpid: bool = False) Generator[int | str, None, None] [source]
Generate opt-out PIDs (or MPIDs) from a file.
- Parameters:
mpid – generate MPIDs, not PIDs (and therefore use
config.optout_mpid_filenames
, notconfig.optout_pid_filenames
, as the set of filenames to read)- Yields:
each PID (or MPID), which will be either
str
orint
depending on the value ofconfig.mpidtype_is_integer
orconfig.pidtype_is_integer
.
- crate_anon.anonymise.anonymise.gen_optout_rids() Generator[str, None, None] [source]
Generates RIDs for patients who opt out (which we can use to wipe their information from the destination database).
- Yields:
string – research ID (RID)
- crate_anon.anonymise.anonymise.gen_patient_ids(tasknum: int = 0, ntasks: int = 1, specified_pids: List[str | int] | None = None) Generator[int | str, None, None] [source]
Generate patient IDs.
- Parameters:
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
specified_pids – optional list of PIDs to restrict ourselves to
- Yields:
integer or string patient IDs (PIDs)
Assigns work to threads/processes, via the simple expedient of processing only those patient ID numbers where
patientnum % ntasks == tasknum
(for integers), or an equivalent method for string PIDs.
- crate_anon.anonymise.anonymise.gen_pks(srcdbname: str, tablename: str, pkname: str) Generator[int, None, None] [source]
Generate PK values from a table.
- Parameters:
srcdbname – name (as per the data dictionary) of the database
tablename – name of the table
pkname – name of the PK column
- Yields:
int – each primary key
- crate_anon.anonymise.anonymise.gen_rows(dbname: str, sourcetable: str, sourcefields: Iterable[str], pid: int | str | None = None, intpkname: str | None = None, tasknum: int = 0, ntasks: int = 1, debuglimit: int = 0) Generator[List[Any], None, None] [source]
Generates rows from a source table: - … each row being a list of values - … each value corresponding to a field in sourcefields. - … optionally restricted to a single patient
If the table has a PK and we’re operating in a multitasking situation, generate just the rows for this task (thread/process).
- Parameters:
dbname – name (as per the data dictionary) of the source database
sourcetable – name of the source table
sourcefields – names of fields in the source table
pid – patient ID (PID)
intpkname – name of the integer PK column in the source table, if one exists
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
debuglimit – if specified, the maximum number of rows to process
- Yields:
lists, each representing one row and containing values for each of the
sourcefields
- crate_anon.anonymise.anonymise.get_pid_subset_from_field(field: str, values_to_find: List[Any]) List[Any] [source]
Takes a field name and elements from that field (values present in that field) and queries the database to find the PIDs associated with these values.
- Parameters:
field – field name in the format
database.table.field
values_to_find – values to look for
- Returns:
list of PIDs
For example, suppose you have a source table called
mydb.mystudyinfo
like this:pid (INTEGER)
include_in_extract (VARCHAR)
1
no
2
0
3
yes
4
1
5
definitely
then a call like
get_subset_from_field("mydb.mystudyinfo.include_in_extract", ["yes", "1", "definitely"])
should return
[3, 4, 5]
, assuming thatpid
has been correctly marked as the PID column in the data dictionary.
- crate_anon.anonymise.anonymise.get_pids_from_field_limits(field: str, low: int, high: int) List[Any] [source]
Takes a field name and a lower/upper limit, and returns a list of associated PIDs.
- Parameters:
field – a fieldname of the format
database.table.field
, or the literalpid
low – lower (inclusive) limit
high – upper (inclusive) limit
The range is used as follows.
If
field
is the string literal'pid'
, or is the name of a source database field containing PIDs, then fetch PIDs in the specified range and check that they are valid.If it’s another kind of field, look for rows where this field is in the specified range, and return the value of the PID column from the same row of the table. (See
get_pids_query_field_limits()
.)
- Returns:
list of PIDs
- crate_anon.anonymise.anonymise.get_pids_from_file(field: str, filename: str) List[str] [source]
Takes a field name, and a filename of values of that field, and returns a list of PIDs associated with them.
- Parameters:
field – a fieldname of the format
database.table.field
, or the literalpid
filename –
A file containing words that represent values to look for, as follows.
If
field
is the string literal'pid'
, or is the name of a source database field containing PIDs, then the values in the file should be PIDs. We check that they are valid.If it’s another kind of field, look for values (from the file) in this field, and return the value of the PID column from the same row of the table. (See
get_pid_subset_from_field()
.)
- Returns:
list of PIDs
- crate_anon.anonymise.anonymise.get_pids_from_limits(low: int, high: int) List[Any] [source]
Finds PIDs from the source database that are between
low
andhigh
inclusive.The SQL
BETWEEN
operator is inclusive (https://www.w3schools.com/sql/sql_between.asp).
- Parameters:
low – lower (inclusive) limit
high – upper (inclusive) limit
- Returns:
list of PIDs in this range
- crate_anon.anonymise.anonymise.get_pids_from_list(field: str, values: List[str]) List[str] [source]
Takes a field name and a list of values, and returns a list of PIDs associated with them.
- Parameters:
field – a fieldname of the format
database.table.field
, or the literalpid
values –
Values to look for, as follows.
If
field
is the string literal'pid'
, or is the name of a source database field containing PIDs, then the values in the should be PIDs. We check that they are valid.If it’s another kind of field, look for the values in this field, and return the value of the PID column from the same row of the table. (See
get_pid_subset_from_field()
.)
- Returns:
list of PIDs
- crate_anon.anonymise.anonymise.get_pids_query_field_limits(field: str, low: int, high: int) List[Any] [source]
Takes a field name and queries the database to find the PIDs associated with records where
field
is in the rangelow
tohigh
inclusive.- Parameters:
field – field name in the format
database.table.field
low – lower (inclusive) limit
high – upper (inclusive) limit
- Returns:
list of PIDs
For example, suppose you have a source table called
mydb.myoptouts
like this:pid (INTEGER)
opt_out_level (INTEGER)
1
0
2
1
3
2
4
3
5
4
then a call like
get_subset_from_field("mydb.myoptouts.opt_out_level", 2, 3)
should return
[3, 4]
, assuming thatpid
has been correctly marked as the PID column in the data dictionary.
- crate_anon.anonymise.anonymise.get_valid_pid_subset(given_pids: List[str]) List[str] [source]
Takes a list of PIDs and returns those in the list which are also in the database.
- crate_anon.anonymise.anonymise.identical_record_exists_by_hash(dest_table: str, pkfield: str, pkvalue: int, hashvalue: str) bool [source]
For a given PK in a given destination table, is there a record with the specified value for its source hash?
- Parameters:
dest_table – name of the destination table
pkfield – name of the primary key (PK) column in the destination table
pkvalue – integer value of the PK in the destination table
hashvalue – hash of the source
- crate_anon.anonymise.anonymise.identical_record_exists_by_pk(dest_table: str, pkfield: str, pkvalue: int) bool [source]
For a given PK in a given destination table, does a record exist?
- Parameters:
dest_table – name of the destination table
pkfield – name of the primary key (PK) column in the destination table
pkvalue – integer value of the PK in the destination table
- crate_anon.anonymise.anonymise.opting_out_mpid(mpid: int | str) bool [source]
Does this patient wish to opt out?
- Parameters:
mpid – master patient identifier (MPID)
- crate_anon.anonymise.anonymise.opting_out_pid(pid: int | str) bool [source]
Does this patient wish to opt out?
- Parameters:
pid – patient identifier (PID)
- crate_anon.anonymise.anonymise.patient_processing_fn(tasknum: int = 0, ntasks: int = 1, incremental: bool = False, specified_pids: List[int] | None = None, free_text_limit: int | None = None, exclude_scrubbed_fields: bool = False) None [source]
Main function to anonymise patient data.
Iterate through patient IDs;
build the scrubber for each patient;
process source data for that patient, scrubbing it;
insert the patient into the mapping table in the admin database.
- Parameters:
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
incremental – perform an incremental update, rather than a full run?
specified_pids – if specified, restrict to specific PIDs
free_text_limit – as per
process_table()
exclude_scrubbed_fields – as per
process_table()
- crate_anon.anonymise.anonymise.process_nonpatient_tables(tasknum: int = 0, ntasks: int = 1, incremental: bool = False, free_text_limit: int | None = None, exclude_scrubbed_fields: bool = False) None [source]
Copies all non-patient tables.
If they have an integer PK, the work may be parallelized.
If not, whole tables are assigned to different processes in parallel mode.
- Parameters:
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
incremental – perform an incremental update, rather than a full run?
free_text_limit – as per
process_table()
exclude_scrubbed_fields – as per
process_table()
- crate_anon.anonymise.anonymise.process_patient_tables(tasknum: int = 0, ntasks: int = 1, incremental: bool = False, specified_pids: List[int] | None = None, free_text_limit: int | None = None, exclude_scrubbed_fields: bool = False) None [source]
Process all patient tables, optionally in a parallel-processing fashion.
All the work is done via
patient_processing_fn()
.- Parameters:
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
incremental – perform an incremental update, rather than a full run?
specified_pids – if specified, restrict to specific PIDs
free_text_limit – as per
process_table()
exclude_scrubbed_fields – as per
process_table()
- crate_anon.anonymise.anonymise.process_table(sourcedbname: str, sourcetable: str, patient: Patient | None = None, incremental: bool = False, intpkname: str | None = None, tasknum: int = 0, ntasks: int = 1, free_text_limit: int | None = None, exclude_scrubbed_fields: bool = False) None [source]
Process a table. This can either be a patient table (in which case the patient’s scrubber is applied and only rows for that patient are processed) or not (in which case the table is just copied).
- Parameters:
sourcedbname – name (as per the data dictionary) of the source database
sourcetable – name of the source table
patient –
crate_anon.anonymise.patient.Patient
object, orNone
for non-patient tablesincremental – perform an incremental update, rather than a full run?
intpkname – name of the integer PK column in the source table
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
free_text_limit – If specified, any text field longer than this will be excluded
exclude_scrubbed_fields – Exclude all text fields which are being scrubbed.
- crate_anon.anonymise.anonymise.remove_invalid_bools_from_optout_values(optout_colname: str, values: List[Any]) List[Any] [source]
Called when the column that defines opt-outs is of boolean type. Removes any values from
values
that is not a valid boolean value (or None/NULL), announcing it, and return the values that pass the test.
- crate_anon.anonymise.anonymise.setup_opt_out(incremental: bool = False) None [source]
Hunts far and wide through its sources for PID/MPID values of patients who wish to opt out.
Adds them to the admin tables for
crate_anon.anonymise.models.OptOutPid
andcrate_anon.anonymise.models.OptOutMpid
.
- Parameters:
incremental – after adding opt-out patients, delete any data for them found in the destination database. (Unnecessary for “full” rather than “incremental” runs, since “full” runs delete all the destination tables and start again.)
- crate_anon.anonymise.anonymise.validate_optouts()[source]
Check that our opt-out definitions are valid, or raise ValueError.
- crate_anon.anonymise.anonymise.wipe_and_recreate_destination_db(incremental: bool = False, full_drop_only: bool = False) None [source]
Drop and recreate all destination tables (as specified in the DD) in the destination database.
- Parameters:
incremental – Don’t drop the tables first, just create them if they don’t exist.
full_drop_only – Drop everything, but don’t rebuild. Incompatible with
incremental
.
- crate_anon.anonymise.anonymise.wipe_destination_data_for_opt_out_patients(report_every: int = 1000, chunksize: int = 10000) None [source]
Delete any data from patients that have opted out (after their data was processed on a previous occasion).
(Slightly complicated by the fact that the destination database can’t necessarily ‘see’ the mapping database, so we need to cache the RID keys in the destination database temporarily.)
- Parameters:
report_every – report logging information every n records
chunksize – insert records every n records