14.1.4. crate_anon.anonymise.anonymise
crate_anon/anonymise/anonymise.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Anonymise one or more SQL-based source databases into a destination database using a data dictionary.
- crate_anon.anonymise.anonymise.anonymise(incremental: bool = False, skipdelete: bool = False, dropremake: bool = False, full_drop_only: bool = False, optout: bool = False, patienttables: bool = False, nonpatienttables: bool = False, index: bool = False, restrict: str = '', restrict_file: str = '', restrict_limits: Optional[Tuple[Any, Any]] = None, restrict_list: Optional[List[Any]] = None, free_text_limit: Optional[int] = None, exclude_scrubbed_fields: bool = False, nprocesses: int = 1, process: int = 0, skip_dd_check: bool = False, seed: str = '', chunksize: int = 100000, reportevery: int = 100000, echo: bool = False, debugscrubbers: bool = False, savescrubbers: bool = False) None [source]
Main entry point for anonymisation.
- Parameters
incremental – If true: incremental run, rather than full.
skipdelete – (For “incremental”.) Skip deletion of rows present in the destination but not the source.
dropremake – If true: drop/remake destination tables.
full_drop_only – If true: drop destination tables (even opt-out ones) and do nothing else.
optout – If true: update opt-out list.
patienttables – If true: process patient tables only (rather than all tables).
nonpatienttables – If true: process non-patient tables only (rather than all tables).
index – If true: create indexes only.
restrict – Restrict to certain patients? Specify a field name, or
pid
to restrict by patient IDs.restrict_file – (For “restrict”.) Filename for permitted values.
restrict_limits – (For “restrict”.) Tuple of lower and upper limits to apply to the field.
restrict_list – (For “restrict”.) List of permitted values.
free_text_limit – Filter out all free text over the specified length.
exclude_scrubbed_fields – Exclude all text fields which are being scrubbed.
nprocesses – Number of processing being run (of which this is one), for work allocation.
process – Number of this process (from 0 to nprocesses - 1), for work allocation.
skip_dd_check – If true: skip data dictionary validity check. (Useful in multiprocessing contexts when another process has already done this.)
seed – Seed for random number generator (for TRID generation). Blank for the default of system time.
chunksize – Number of records copied in a chunk when copying PKs from one database to another.
reportevery – Report insert progress every n rows in verbose mode.
echo – Echo SQL?
debugscrubbers – Report sensitive scrubbing information, for debugging
savescrubbers – Saves sensitive scrubbing information in admin database, for debugging
- crate_anon.anonymise.anonymise.commit_admindb() None [source]
Execute a
COMMIT
on the admin database, which is using ORM sessions.
- crate_anon.anonymise.anonymise.commit_destdb() None [source]
Execute a
COMMIT
on the destination database, and reset row counts.
- crate_anon.anonymise.anonymise.count_rows(dbname: str, sourcetable: str, pid: Optional[Union[int, str]] = None) int [source]
Count the number of rows in a table for a given PID.
- Parameters
dbname – name (as per the data dictionary) of the source database
sourcetable – name of the source table
pid – patient ID (PID)
- Returns
the number of records
- crate_anon.anonymise.anonymise.create_indexes(tasknum: int = 0, ntasks: int = 1) None [source]
Create indexes for the destination tables.
- Parameters
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
- crate_anon.anonymise.anonymise.delete_dest_rows_with_no_src_row(srcdbname: str, src_table: str, report_every: int = 100000, chunksize: int = 100000) None [source]
For a given source database/table, delete any rows in the corresponding destination table where there is no corresponding source row.
Can’t do this in a single SQL command, since the engine can’t necessarily see both databases.
Can’t do this in a multiprocess way, because we’re trying to do a
DELETE WHERE NOT IN
.However, we can get stupidly long query lists if we try to
SELECT
all the values and use aDELETE FROM x WHERE y NOT IN (v1, v2, v3, ...)
query. This crashes the MySQL connection, etc.Therefore, we need a temporary table in the destination.
- Parameters
srcdbname – name (as per the data dictionary) of the source database
src_table – name of the source table
report_every – report to the Python log every n records
chunksize – insert records every n records
- crate_anon.anonymise.anonymise.drop_remake(incremental: bool = False, skipdelete: bool = False, full_drop_only: bool = False) None [source]
Drop and rebuild (a) mapping table, (b) destination tables.
- Parameters
incremental – Doesn’t drop tables; just deletes destination information where source information no longer exists.
skipdelete – For incremental updates, skip deletion of rows present in the destination but not the source
full_drop_only – Performs a full drop (even opt-out tables) and does nothing else. Incompatible with
incremental
.
- crate_anon.anonymise.anonymise.estimate_count_patients() int [source]
Estimate the number of patients in the source database.
We can’t easily and quickly get the total number of patients, because they may be defined in multiple tables across multiple databases. We shouldn’t fetch them all into Python in case there are billions, and it’s a waste of effort to stash them in a temporary table and count unique rows, because this is all only for a progress indicator. So we approximate.
- crate_anon.anonymise.anonymise.fieldname_is_pid(field: str) bool [source]
Checks if a field name is the literal
'pid'
or, if in the form'database.table.field'
, is the name of a primary PID field in the source database. If either of those conditions is met, returnTrue
; otherwise,False
.
- crate_anon.anonymise.anonymise.gen_index_row_sets_by_table(tasknum: int = 0, ntasks: int = 1) Generator[Tuple[str, List[crate_anon.anonymise.ddr.DataDictionaryRow]], None, None] [source]
Generate
table, list_of_dd_rows_for_indexed_fields
tuples for all tables requiring indexing.- Parameters
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
- Yields
tuple –
table, list_of_dd_rows_for_indexed_fields
for each table as above
- crate_anon.anonymise.anonymise.gen_nonpatient_tables_with_int_pk() Generator[Tuple[str, str, str], None, None] [source]
Generate
source_db_name, source_table, pk_name
tuples for all tables thatdon’t contain patient information and
do have an integer PK.
- crate_anon.anonymise.anonymise.gen_nonpatient_tables_without_int_pk(tasknum: int = 0, ntasks: int = 1) Generator[Tuple[str, str], None, None] [source]
Generate
(source db name, source table)
tuples for all tables thatdon’t contain patient information and
don’t have an integer PK.
- Parameters
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
- Yields
tuple –
source_db_name, source_table
for each table as above
- crate_anon.anonymise.anonymise.gen_opt_out_pids_from_database(mpid: bool = False) Generator[Any, None, None] [source]
Generate opt-out PIDs (or MPIDs) from a database.
- Parameters
mpid – generate MPIDs, not PIDs
- Yields
each PID (or MPID)
- crate_anon.anonymise.anonymise.gen_opt_out_pids_from_file(mpid: bool = False) Generator[Union[int, str], None, None] [source]
Generate opt-out PIDs (or MPIDs) from a file.
- Parameters
mpid – generate MPIDs, not PIDs (and therefore use
config.optout_mpid_filenames
, notconfig.optout_pid_filenames
, as the set of filenames to read)- Yields
each PID (or MPID), which will be either
str
orint
depending on the value ofconfig.mpidtype_is_integer
orconfig.pidtype_is_integer
.
- crate_anon.anonymise.anonymise.gen_optout_rids() Generator[str, None, None] [source]
Generates RIDs for patients who opt out (which we can use to wipe their information from the destination database).
- Yields
string – research ID (RID)
- crate_anon.anonymise.anonymise.gen_patient_ids(tasknum: int = 0, ntasks: int = 1, specified_pids: Optional[List[Union[int, str]]] = None) Generator[Union[int, str], None, None] [source]
Generate patient IDs.
- Parameters
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
specified_pids – optional list of PIDs to restrict ourselves to
- Yields
integer or string patient IDs (PIDs)
Assigns work to threads/processes, via the simple expedient of processing only those patient ID numbers where
patientnum % ntasks == tasknum
(for integers), or an equivalent method for string PIDs.
- crate_anon.anonymise.anonymise.gen_pks(srcdbname: str, tablename: str, pkname: str) Generator[int, None, None] [source]
Generate PK values from a table.
- Parameters
srcdbname – name (as per the data dictionary) of the database
tablename – name of the table
pkname – name of the PK column
- Yields
int – each primary key
- crate_anon.anonymise.anonymise.gen_rows(dbname: str, sourcetable: str, sourcefields: Iterable[str], pid: Optional[Union[int, str]] = None, intpkname: Optional[str] = None, tasknum: int = 0, ntasks: int = 1, debuglimit: int = 0) Generator[List[Any], None, None] [source]
Generates rows from a source table: - … each row being a list of values - … each value corresponding to a field in sourcefields. - … optionally restricted to a single patient
If the table has a PK and we’re operating in a multitasking situation, generate just the rows for this task (thread/process).
- Parameters
dbname – name (as per the data dictionary) of the source database
sourcetable – name of the source table
sourcefields – names of fields in the source table
pid – patient ID (PID)
intpkname – name of the integer PK column in the source table, if one exists
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
debuglimit – if specified, the maximum number of rows to process
- Yields
lists, each representing one row and containing values for each of the
sourcefields
- crate_anon.anonymise.anonymise.get_pid_subset_from_field(field: str, values_to_find: List[Any]) List[Any] [source]
Takes a field name and elements from that field (values present in that field) and queries the database to find the PIDs associated with these values.
- Parameters
field – field name in the format
database.table.field
values_to_find – values to look for
- Returns
list of PIDs
For example, suppose you have a source table called
mydb.mystudyinfo
like this:pid (INTEGER)
include_in_extract (VARCHAR)
1
no
2
0
3
yes
4
1
5
definitely
then a call like
get_subset_from_field("mydb.mystudyinfo.include_in_extract", ["yes", "1", "definitely"])
should return
[3, 4, 5]
, assuming thatpid
has been correctly marked as the PID column in the data dictionary.
- crate_anon.anonymise.anonymise.get_pids_from_field_limits(field: str, low: int, high: int) List[Any] [source]
Takes a field name and a lower/upper limit, and returns a list of associated PIDs.
- Parameters
field – a fieldname of the format
database.table.field
, or the literalpid
low – lower (inclusive) limit
high – upper (inclusive) limit
The range is used as follows.
If
field
is the string literal'pid'
, or is the name of a source database field containing PIDs, then fetch PIDs in the specified range and check that they are valid.If it’s another kind of field, look for rows where this field is in the specified range, and return the value of the PID column from the same row of the table. (See
get_pids_query_field_limits()
.)
- Returns
list of PIDs
- crate_anon.anonymise.anonymise.get_pids_from_file(field: str, filename: str) List[str] [source]
Takes a field name, and a filename of values of that field, and returns a list of PIDs associated with them.
- Parameters
field – a fieldname of the format
database.table.field
, or the literalpid
filename –
A file containing words that represent values to look for, as follows.
If
field
is the string literal'pid'
, or is the name of a source database field containing PIDs, then the values in the file should be PIDs. We check that they are valid.If it’s another kind of field, look for values (from the file) in this field, and return the value of the PID column from the same row of the table. (See
get_pid_subset_from_field()
.)
- Returns
list of PIDs
- crate_anon.anonymise.anonymise.get_pids_from_limits(low: int, high: int) List[Any] [source]
Finds PIDs from the source database that are between
low
andhigh
inclusive.The SQL
BETWEEN
operator is inclusive (https://www.w3schools.com/sql/sql_between.asp).
- Parameters
low – lower (inclusive) limit
high – upper (inclusive) limit
- Returns
list of PIDs in this range
- crate_anon.anonymise.anonymise.get_pids_from_list(field: str, values: List[str]) List[str] [source]
Takes a field name and a list of values, and returns a list of PIDs associated with them.
- Parameters
field – a fieldname of the format
database.table.field
, or the literalpid
values –
Values to look for, as follows.
If
field
is the string literal'pid'
, or is the name of a source database field containing PIDs, then the values in the should be PIDs. We check that they are valid.If it’s another kind of field, look for the values in this field, and return the value of the PID column from the same row of the table. (See
get_pid_subset_from_field()
.)
- Returns
list of PIDs
- crate_anon.anonymise.anonymise.get_pids_query_field_limits(field: str, low: int, high: int) List[Any] [source]
Takes a field name and queries the database to find the PIDs associated with records where
field
is in the rangelow
tohigh
inclusive.- Parameters
field – field name in the format
database.table.field
low – lower (inclusive) limit
high – upper (inclusive) limit
- Returns
list of PIDs
For example, suppose you have a source table called
mydb.myoptouts
like this:pid (INTEGER)
opt_out_level (INTEGER)
1
0
2
1
3
2
4
3
5
4
then a call like
get_subset_from_field("mydb.myoptouts.opt_out_level", 2, 3)
should return
[3, 4]
, assuming thatpid
has been correctly marked as the PID column in the data dictionary.
- crate_anon.anonymise.anonymise.get_valid_pid_subset(given_pids: List[str]) List[str] [source]
Takes a list of PIDs and returns those in the list which are also in the database.
- crate_anon.anonymise.anonymise.identical_record_exists_by_hash(dest_table: str, pkfield: str, pkvalue: int, hashvalue: str) bool [source]
For a given PK in a given destination table, is there a record with the specified value for its source hash?
- Parameters
dest_table – name of the destination table
pkfield – name of the primary key (PK) column in the destination table
pkvalue – integer value of the PK in the destination table
hashvalue – hash of the source
- crate_anon.anonymise.anonymise.identical_record_exists_by_pk(dest_table: str, pkfield: str, pkvalue: int) bool [source]
For a given PK in a given destination table, does a record exist?
- Parameters
dest_table – name of the destination table
pkfield – name of the primary key (PK) column in the destination table
pkvalue – integer value of the PK in the destination table
- crate_anon.anonymise.anonymise.opting_out_mpid(mpid: Union[int, str]) bool [source]
Does this patient wish to opt out?
- Parameters
mpid – master patient identifier (MPID)
- crate_anon.anonymise.anonymise.opting_out_pid(pid: Union[int, str]) bool [source]
Does this patient wish to opt out?
- Parameters
pid – patient identifier (PID)
- crate_anon.anonymise.anonymise.patient_processing_fn(tasknum: int = 0, ntasks: int = 1, incremental: bool = False, specified_pids: Optional[List[int]] = None, free_text_limit: Optional[int] = None, exclude_scrubbed_fields: bool = False) None [source]
Main function to anonymise patient data.
Iterate through patient IDs;
build the scrubber for each patient;
process source data for that patient, scrubbing it;
insert the patient into the mapping table in the admin database.
- Parameters
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
incremental – perform an incremental update, rather than a full run?
specified_pids – if specified, restrict to specific PIDs
free_text_limit – as per
process_table()
exclude_scrubbed_fields – as per
process_table()
- crate_anon.anonymise.anonymise.process_nonpatient_tables(tasknum: int = 0, ntasks: int = 1, incremental: bool = False, free_text_limit: Optional[int] = None, exclude_scrubbed_fields: bool = False) None [source]
Copies all non-patient tables.
If they have an integer PK, the work may be parallelized.
If not, whole tables are assigned to different processes in parallel mode.
- Parameters
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
incremental – perform an incremental update, rather than a full run?
free_text_limit – as per
process_table()
exclude_scrubbed_fields – as per
process_table()
- crate_anon.anonymise.anonymise.process_patient_tables(tasknum: int = 0, ntasks: int = 1, incremental: bool = False, specified_pids: Optional[List[int]] = None, free_text_limit: Optional[int] = None, exclude_scrubbed_fields: bool = False) None [source]
Process all patient tables, optionally in a parallel-processing fashion.
All the work is done via
patient_processing_fn()
.- Parameters
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
incremental – perform an incremental update, rather than a full run?
specified_pids – if specified, restrict to specific PIDs
free_text_limit – as per
process_table()
exclude_scrubbed_fields – as per
process_table()
- crate_anon.anonymise.anonymise.process_table(sourcedbname: str, sourcetable: str, patient: Optional[crate_anon.anonymise.patient.Patient] = None, incremental: bool = False, intpkname: Optional[str] = None, tasknum: int = 0, ntasks: int = 1, free_text_limit: Optional[int] = None, exclude_scrubbed_fields: bool = False) None [source]
Process a table. This can either be a patient table (in which case the patient’s scrubber is applied and only rows for that patient are processed) or not (in which case the table is just copied).
- Parameters
sourcedbname – name (as per the data dictionary) of the source database
sourcetable – name of the source table
patient –
crate_anon.anonymise.patient.Patient
object, orNone
for non-patient tablesincremental – perform an incremental update, rather than a full run?
intpkname – name of the integer PK column in the source table
tasknum – task number of this process (for dividing up work)
ntasks – total number of processes (for dividing up work)
free_text_limit – If specified, any text field longer than this will be excluded
exclude_scrubbed_fields – Exclude all text fields which are being scrubbed.
- crate_anon.anonymise.anonymise.setup_opt_out(incremental: bool = False) None [source]
Hunts far and wide through its sources for PID/MPID values of patients who wish to opt out.
Adds them to the admin tables for
crate_anon.anonymise.models.OptOutPid
andcrate_anon.anonymise.models.OptOutMpid
.
- Parameters
incremental – after adding opt-out patients, delete any data for them found in the destination database. (Unnecessary for “full” rather than “incremental” runs, since “full” runs delete all the destination tables and start again.)
- crate_anon.anonymise.anonymise.wipe_and_recreate_destination_db(incremental: bool = False, full_drop_only: bool = False) None [source]
Drop and recreate all destination tables (as specified in the DD) in the destination database.
- Parameters
incremental – Don’t drop the tables first, just create them if they don’t exist.
full_drop_only – Drop everything, but don’t rebuild. Incompatible with
incremental
.
- crate_anon.anonymise.anonymise.wipe_destination_data_for_opt_out_patients(report_every: int = 1000, chunksize: int = 10000) None [source]
Delete any data from patients that have opted out (after their data was processed on a previous occasion).
(Slightly complicated by the fact that the destination database can’t necessarily ‘see’ the mapping database, so we need to cache the RID keys in the destination database temporarily.)
- Parameters
report_every – report logging information every n records
chunksize – insert records every n records