6.3. The anonymiser config file
This file controls the behaviour of the anonymiser, and tells it where to find the source, destination, and secret databases, and the data dictionary that controls the conversion process for each database column.
You can generate a specimen config file with
crate_anon_demo_config > test_anon_config.ini
You should save this, then edit it to your own needs. A copy is shown below.
For convenience, you may want the CRATE_ANON_CONFIG environment variable to point to this file. (Otherwise you must specify it each time.)
6.3.1. Format of the configuration file
The config file is in standard INI file format.
UTF-8 encoding. Use this! The file is explicitly opened in UTF-8 mode.
Comments. Hashes (
#
) and semicolons (;
) denote comments.Sections. Sections are indicated with:
[section]
Name/value (key/value) pairs. The parser used is ConfigParser. It allows
name=value
orname:value
.Avoid indentation of parameters. (Indentation is used to indicate the continuation of previous parameters.)
Parameter types, referred to below, are:
String. Single-line strings are simple.
Multiline string. Here, a series of lines is read and split into a list of strings (one for each line). You should indent all lines except the first beyond the level of the parameter name, and then they will be treated as one parameter value.
Integer. Simple.
Boolean. For Boolean options, true values are any of:
1, yes, true, on
(case-insensitive). False values are any of:0, no, false, off
.
6.3.2. [main] section
6.3.2.1. Data dictionary
6.3.2.1.1. data_dictionary_filename
String.
Specify the filename of a data dictionary in TSV (tab-separated value) format, with a header row. See Data Dictionary.
6.3.2.2. Critical field types
6.3.2.2.1. sqlatype_pid
String. Default: BigInteger
.
See sqlatype_mpid below.
6.3.2.2.2. sqlatype_mpid
String. Default: BigInteger
.
We need to know PID and MPID types from the config so that we can set up our
secret mapping tables. You can leave these blank, in which case they will be
assumed to be large integers, using SQLAlchemy’s BigInteger
(e.g.
SQL Server’s BIGINT
). If you do specify them, you may specify EITHER
BigInteger
or a string type such as String(50)
.
Note
Some databases have genuine string IDs, like “M123456”, and you should of
course use a string type then. However, some databases have integers stored
in a variety of fields. For example, an NHS number is a 10-digit integer.
This might be stored in a BIGINT
field, or a VARCHAR(10)
field,
even if the latter is odd. What should you do?
Use BigInteger
for preference.
First, note that it is very unusual to have PIDs going into your destination database, so the only places these column types are likely to be used are in your secret mapping database. (CRATE’s RIDs in the research database are typically alphanumeric, i.e. strings, and its TRIDs are integer.)
Second, database autoconvert string values to integer and vice versa, with varying flexibility. Here are two examples:
-- MySQL 5.7.30:
SELECT 123 = 123; -- returns 1, meaning true
SELECT 123 = '123'; -- returns 1, meaning true
SELECT 123 = '123x'; -- returns 1, meaning true
SELECT 123 = 'x123'; -- returns 0, meaning false
DROP TABLE IF EXISTS rubbish;
CREATE TABLE rubbish (a INTEGER, b VARCHAR(10));
INSERT INTO rubbish (a, b) VALUES (123, '123'), (456, '456');
SELECT * FROM rubbish WHERE a = '123'; -- returns 1 row
SELECT * FROM rubbish WHERE a = '123x'; -- returns 1 row
SELECT * FROM rubbish WHERE b = 123; -- returns 1 row
-- Microsoft SQL Server 14.0.1000.169:
SELECT 123 = 123; -- Incorrect syntax near '='.
SELECT 123 = '123'; -- Incorrect syntax near '='.
SELECT 123 = '123x'; -- Incorrect syntax near '='.
SELECT 123 = 'x123'; -- Incorrect syntax near '='.
USE mydatabase;
DROP TABLE IF EXISTS rubbish;
CREATE TABLE rubbish (a INTEGER, b VARCHAR(10));
INSERT INTO rubbish (a, b) VALUES (123, '123'), (456, '456');
SELECT * FROM rubbish WHERE a = '123'; -- works
SELECT * FROM rubbish WHERE a = '123x'; -- Conversion failed when converting the varchar value '123x' to data type int.
SELECT * FROM rubbish WHERE b = 123; -- works
6.3.2.3. Encryption phrases/passwords
6.3.2.3.1. hash_method
String. Default HMAC_MD5
.
Hashing method, used for
PID to RID conversion;
MPID to MRID conversion;
hashing source data for change detection.
Options are:
6.3.2.3.2. per_table_patient_id_encryption_phrase
String.
Secret phrase with which to hash the PID (creating the RID).
6.3.2.3.3. master_patient_id_encryption_phrase
String.
Secret phrase with which to hash the MPID (creating the MRID).
6.3.2.3.4. change_detection_encryption_phrase
String.
Secret phrase with which to hash content (storing the result in the output database), so that changes in content can be detected.
6.3.2.3.5. extra_hash_config_sections
Multiline string.
If you are using the “hash” field alteration method (see alter_method), you need to list the hash methods here, for internal initialization order/performance reasons.
See hasher definitions for how to define these.
6.3.2.4. Text extraction
6.3.2.4.1. extract_text_extensions_permitted
Multiline string.
extract_text_extensions_permitted
and
extract_text_extensions_prohibited
govern what kinds of files are accepted
for text extraction. It is very likely that you’ll want to apply such
restrictions; for example, if your database contains .jpg
files, it’s a
waste of trying to extract text from them (and in theory, if your text
extraction tool provided sufficient detail, such as binary-encoding the JPEG,
you might leak identifiable information, such as a photo).
The “permitted” and “prohibited” settings are both lists of strings.
If the “permitted” list is not empty then a file will be processed only if its extension is in the permitted list. Otherwise, it will be processed only if it is not in the prohibited list.
The extensions must include the “.” prefix.
Case sensitivity is controlled by the extra flag,
extract_text_extensions_case_sensitive
.
6.3.2.4.2. extract_text_extensions_prohibited
Multiline string.
6.3.2.4.3. extract_text_extensions_case_sensitive
Boolean. Default: false.
6.3.2.4.4. extract_text_plain
Boolean. Default: true. (Changed to true from v0.18.88.)
Use the plainest possible layout for text extraction?
False
= better for human layout. Table example (e.g. from a DOCX file):
┼────────────────┼────────────────┼
│ Row 1 col 1 │ Row 1 col 2 │
│ R1C1 continued │ R1C2 continued │
┼────────────────┼────────────────┼
│ Row 2 col 1 │ Row 2 col 2 │
│ R2C1 continued │ R2C2 continued │
┼────────────────┼────────────────┼
True
= good for natural language processing. Table example equivalent to
that above:
╔═════════════════════════════════════════════════════════════════╗
Row 1 col 1
R1C1 continued
───────────────────────────────────────────────────────────────────
Row 1 col 2
R1C2 continued
═══════════════════════════════════════════════════════════════════
Row 2 col 1
R2C1 continued
───────────────────────────────────────────────────────────────────
Row 2 col 2
R2C2 continued
╚═════════════════════════════════════════════════════════════════╝
… note the absence of vertical interruptions, and that text from one cell remains contiguous.
6.3.2.4.5. extract_text_width
Integer. Default: 80.
Default width (in columns) to word-wrap extracted text to.
6.3.2.5. Anonymisation
6.3.2.5.1. allow_no_patient_info
Boolean. Default: false.
Allow the absence of patient info? Used to copy databases; WILL NOT ANONYMISE. Since this is usually (but not always) a mistake, this option exists as a safeguard: if you have it set to false and attempt to anonymise with a data dictionary that contains no “patient-defining” information (i.e. a table that lists all patients), CRATE will stop with an error.
6.3.2.5.2. replace_patient_info_with
String. Default: [__PPP__]
.
Patient information will be replaced with this. For example, XXXXXX
or
[___]
or [__PPP__]
or [__ZZZ__]
; the bracketed forms can be a bit
easier to spot, and work better if they directly abut other text.
6.3.2.5.3. replace_third_party_info_with
String. Default: [__TTT__]
.
Third-party information (e.g. information about family members) will be
replaced by this. For example, YYYYYY
or [...]
or [__TTT__]
or
[__QQQ__]
.
6.3.2.5.4. replace_nonspecific_info_with
String. Default: [~~~]
.
Things to be removed irrespective of patient-specific information will be
replaced by this (for example, if you opt to remove all things looking like
telephone numbers). For example, [~~~]
.
6.3.2.5.5. replace_all_dates_with
String. Default: [~~~]
.
When scrub_all_dates is True, replace with this text.
Supports limited datetime.strftime
directives for “blurring” of dates. The
supported directives are:
%b # Month as locale's abbreviated name, e.g. "Sep"
%B # Month as locale's full name, e.g. "September
%m # Month as zero-padded decimal number, e.g. "09"
%Y # Year with century as decimal number, e.g. "2022"
%y # Year without century as zero-padded decimal number, e.g. "22"
Note that day-of-the-month directives are not supported, because this is all about blurring dates. Examples:
This... Gives, for example...
%b %Y Sep 2020
[%b %Y] [Sep 2020] -- making the editorial changes more apparent
[01-%m-%Y] [01-09-2020] -- blurring to the first of the month
6.3.2.5.6. thirdparty_xref_max_depth
Integer. Default 1.
For fields marked as scrub_src = thirdparty_xref_pid
(see scrub_src), how deep should we recurse? Beware making this too large; the
recursion trawls a lot of information (and also uses an extra simultaneous
database cursor for each recursion).
6.3.2.5.7. scrub_string_suffixes
Multiline string.
Strings to append to every “scrub from” string.
For example, include “s” if you want to scrub “Roberts” whenever you scrub “Robert”.
Applies to scrub methods words
, but not to phrase
(see
scrub_method).
6.3.2.5.8. string_max_regex_errors
Integer. Default: 0.
Specify maximum number of errors (insertions, deletions, substitutions) in string regex matching. Beware using a high number! Suggest 1-2.
6.3.2.5.9. min_string_length_for_errors
Integer. Default: 3.
Is there a minimum length to apply string_max_regex_errors? For example, if you allow one typo and someone is called Ian, all instances of ‘in’ or ‘an’ will be wiped. Note that this applies to scrub-source data.
6.3.2.5.10. min_string_length_to_scrub_with
Integer. Default: 2.
Is there a minimum length of string to scrub WITH? For example, if you specify 2, you allow two-letter names such as Al to be scrubbed, but you allow initials through, and therefore prevent e.g. ‘A’ from being scrubbed from the destination. Note that this applies to scrub-source data.
6.3.2.5.11. allowlist_filenames
Multiline string.
Allowlist.
Are there any words not to scrub? For example, “the”, “road”, “street” often appear in addresses, but you might not want them removed. Be careful in case these could be names (e.g. “Lane”).
Specify these as a list of filenames, where the files contain words; e.g.
allowlist_filenames = /some/path/short_english_words.txt
Here’s a suggestion for some of the sorts of words you might include:
am
an
as
at
bd
by
he
if
is
it
me
mg
od
of
on
or
re
so
to
us
we
her
him
tds
she
the
you
road
street
Note
Formerly whitelist_filenames
; changed 2020-07-20 as part of neutral
language review.
See anonymisation sequence for an explanation of how the allow/deny sequence works.
6.3.2.5.12. denylist_filenames
Multiline string.
Denylist.
Are there any words you always want to remove?
Specify these as a list of filenames, e.g
denylist_filenames =
/some/path/boy_names.txt
/some/path/girl_names.txt
/some/path/common_surnames.txt
The prototypical use is to remove all known names (pre-filtered to remove medical eponyms) – see also crate_fetch_wordlists.
See denylist_files_as_phrases, which governs how these files are interpreted.
See anonymisation sequence for an explanation of how the allow/deny sequence works.
Note
Formerly blacklist_filenames
; changed 2020-07-20 as part of neutral
language review.
Note
These words are removed only at word boundaries. So “Bob” won’t scrub “bobsleigh”, for example.
Note
Whitespace is removed at the start/end of lines, and any line starting with
a hash (#
) is treated as a comment and ignored. (However, comments are
not permitted at the end of lines containing content; they are treated as
part of the content.)
6.3.2.5.13. denylist_files_as_phrases
Boolean. Default: false.
(For denylist_filenames.) Consider a “denylist” file like this:
Alice
Bob
Charlie Brown
If denylist_files_as_phrases
is false, then every word will be scrubbed
– Alice
, Bob
, Charlie
, and Brown
.
If denylist_files_as_phrases
is true, then every line is treated as a
phrase to be scrubbed: Alice
, Bob
, and Charlie Brown
. That is,
Charlie
and Brown
will not be scrubbed, just the composite phrase. This
option is further configurable via denylist_use_regex.
The prototypical use of this option is to provide a list of organizational (e.g. general practitioner) addresses that should be scrubbed, and always appear in a consistent format.
6.3.2.5.14. denylist_use_regex
Boolean. Default: false.
If false:
Replacements are very fast (they use FlashText).
However, spacing matters and is exactly as provided: as a phrase,
Charlie Brown
will not scrubCharlie Brown
(note the double-space).Word boundaries are always required, so
blahCharlie Brownblah
will not be scrubbed.
If true:
Replacements are not as fast (they use regex).
However, spacing is flexible; if you enable denylist_files_as_phrases, the phrase
Charlie Brown
will scrubCharlie Brown
and the like.Word boundary behaviour is also flexible, governed by anonymise_strings_at_word_boundaries_only.
6.3.2.5.15. phrase_alternative_word_filenames
Multiline string.
Alternatives for common words. These will be used to find alternative phrases which will be scrubbed. The files specified should be in comma separated variable (CSV) form.
Examples of alternative words include street types: https://en.wikipedia.org/wiki/Street_suffix.
6.3.2.5.16. scrub_all_numbers_of_n_digits
Multiline list of integers.
Use nonspecific scrubbing of numbers of a certain length?
For example, scrubbing all 11-digit numbers will remove modern UK telephone
numbers in conventional format. To do this, specify
scrub_all_numbers_of_n_digits = 11
. You could scrub both 10- and 11-digit
numbers by specifying both numbers (in multiline format, as above); 10-digit
numbers would include all NHS numbers. Avoid using this for short numbers; you
may lose valuable numeric data!
6.3.2.5.17. scrub_all_uk_postcodes
Boolean. Default: false.
Nonspecific scrubbing of UK postcodes?
See https://www.mrs.org.uk/pdf/postcodeformat.pdf; these can look like
FORMAT EXAMPLE
AN NAA M1 1AA
ANN NAA M60 1NW
AAN NAA CR2 6XH
AANN NAA DN55 1PT
ANA NAA W1A 1HQ
AANA NAA EC1A 1BB
6.3.2.5.18. scrub_all_dates
Boolean. Default: false.
Nonspecific scrubbing of all dates? Anything that looks like a date will be removed, including:
2/11/73
03.31.1991
13 11 2001
1976/02/28
19741213
2 Sep 1990
1st Sep 2000
Sep 2nd 1990
(etc.)
Note
At present, the ordinal suffixes (“st”, “nd”, “rd”, “th”) are fixed to English, as are the month names (“January”, “February”, …).
6.3.2.5.19. scrub_all_email_addresses
Boolean. Default: false.
Nonspecific scrubbing of all e-mail addresses?
6.3.2.5.20. nonspecific_scrubber_first
Boolean. Default: false.
This influences the anonymisation sequence.
Suppose you have a nonspecific scrubber that you have set up to scrub postcodes (see scrub_all_uk_postcodes) but also names from a pre-built list (see denylist_filenames). This knows about the names John and Smith. As it happens, your patient is also called John Smith, and your database tells CRATE that. So CRATE has two ways of removing this person’s name. Do you want this patient’s names to be treated as nonspecific sensitive information, or patient-specific sensitive information?
Suppose we are scrubbing this:
Please send a letter to John Smith about the next appointment.
In his youth, this patient worked as a smith.
If nonspecific_scrubber_first
is False, then you will get something like
Please send a letter to [PATIENT] about the next appointment.
In his youth, this patient worked as a [PATIENT].
but if it is True, you will get something like
Please send a letter to [GENERIC] about the next appointment.
In his youth, this patient worked as a [GENERIC].
So using False provides you with a bit more information about the person being referred to – (1) likely in a good way, so you know the first sentence is referring to the patient, but (2) potentially in a bad way, because (in this contrived example) you would be able to guess that the patient’s surname is a word for a profession. In contrast, using True provides neither of those pieces of information. It’s a trade-off. Since access to de-identified free text should be strictly controlled anyway, we opt for the slightly more informative default.
6.3.2.5.21. anonymise_codes_at_word_boundaries_only
Boolean. Default: true.
Anonymise codes only when they are found at word boundaries?
Applies to the code
scrub method (see scrub_method).
True
is more liberal (produces less scrubbing); False
is more
conservative (more scrubbing; higher chance of over-scrubbing) and will deal
with accidental word concatenation. With ID numbers, beware if you use a
prefix, e.g. if people write M123456
or R123456
; in that case you will
need anonymise_numbers_at_word_boundaries_only = False
.
6.3.2.5.22. anonymise_codes_at_numeric_boundaries_only
Boolean. Default: true.
Only applicable if anonymise_codes_at_word_boundaries_only is false. Requires
that codes occur at a numeric boundary to be scrubbed. That is, if your code is
M123
, then using anonymise_codes_at_numeric_boundaries_only = True
will
not scrub M1234
. (See anonymise_numbers_at_numeric_boundaries_only, which
works in the same way.)
6.3.2.5.23. anonymise_dates_at_word_boundaries_only
Boolean. Default: true.
As for anonymise_codes_at_word_boundaries_only, but applies to the date
scrub method (see scrub_method). Also applies to
scrub_all_dates.
6.3.2.5.24. anonymise_numbers_at_word_boundaries_only
Boolean. Default: false.
As for anonymise_codes_at_word_boundaries_only, but applies to the number
scrub method (see scrub_method).
6.3.2.5.25. anonymise_numbers_at_numeric_boundaries_only
Boolean. Default: true.
Only applicable if anonymise_numbers_at_word_boundaries_only is False.
Similar to anonymise_numbers_at_word_boundaries_only, and similarly applies
to the number
scrub method (see scrub_method);
however, this relates to whether numbers are scrubbed only at numeric
boundaries.
If True
, CRATE will not scrub “234” from “123456”. Setting this to
False
is extremely conservative (all sorts of numbers may be scrubbed). You
probably want this set to True
. (See also
anonymise_codes_at_numeric_boundaries_only, which works in the same way.)
6.3.2.5.26. anonymise_strings_at_word_boundaries_only
Boolean. Default: true.
As for anonymise_codes_at_word_boundaries_only, but applies to the words
and the phrase
scrub methods (see scrub_method).
This setting will only apply to pre-built wordlists (supplied via
denylist_filenames) if you enable denylist_use_regex; otherwise, those
wordlists will behave as if anonymise_strings_at_word_boundaries_only
is
true.
6.3.2.6. Other anonymisation options
You can also specify additional “nonspecific” regular expressions yourself. See extra_regexes.
6.3.2.7. Output fields and formatting
6.3.2.7.1. timefield_name
String. Default: _when_processed_utc
.
Name of the DATETIME
column to be created in every output table indicating
when CRATE processed that row (see
crate_anon.anonymise.anonymise.process_table()
).
6.3.2.7.2. research_id_fieldname
String. Default: rid
.
Research ID (RID) field name for destination tables. This will be a VARCHAR
of length determined by hash_method. Used to
replace patient ID fields from source tables.
6.3.2.7.3. trid_fieldname
String. Default: trid
.
Transient integer research ID (TRID) fieldname. An unsigned integer field with this name will be added to every table containing a primary patient ID (in the source) or research ID (in the destination). It will be indexed (and whether that index is unique or not depends on the settings for the PID field).
6.3.2.7.4. master_research_id_fieldname
String. Default: mrid
.
Master research ID (MRID) field name for destination tables. This will be a
VARCHAR
of length determined by hash_method. Used to replace master patient ID fields from
source tables.
6.3.2.7.5. add_mrid_wherever_rid_added
Boolean. Default: true.
Whenever adding a RID field to a destination table (replacing the PID field from the source, and adding a TRID field), should we also add an MRID field? It will be indexed (non-uniquely; the MRID is not always guaranteed to be present just because the PID is present).
6.3.2.7.6. source_hash_fieldname
String. Default: _src_hash
.
Change-detection hash fieldname for destination tables, used to hash entire
rows to see if they’ve changed later. To rephrase: this is a field (column)
that CRATE will create in tables in the destination database, using it to store
a “signature” of the source row (allowing changes to be detected). This column
will be a VARCHAR
of length determined by hash_method. See also src_flags in the
data dictionary.
6.3.2.8. Destination database configuration
6.3.2.8.1. max_rows_before_commit
Integer. Default: 1000.
Specify the maximum number of rows to be processed before a COMMIT
is
issued on the database transaction(s). This prevents the transaction(s) growing
too large.
6.3.2.8.2. max_bytes_before_commit
Integer. Default: 80 Mb (80 * 1024 * 1024 = 83886080).
Specify the maximum number of source-record bytes (approximately!) that are
processed before a COMMIT
is issued on the database transaction(s). This
prevents the transaction(s) growing too large. The COMMIT
will be issued
after this limit has been met/exceeded, so it may be exceeded if the
transaction just before the limit takes the cumulative total over the limit.
6.3.2.8.3. temporary_tablename
String. Default: _crate_temp_table
.
We need a temporary table name for incremental updates. This can’t be the name of a real destination table. It lives in the destination database.
6.3.2.9. Choose databases (defined in their own sections)
Parameter values in this section are themselves config file section names.
For example, if you refer to a database called mydb
, CRATE will look for a
database config section named [mydb]
.
6.3.2.9.1. source_databases
Multiline string (of database config section names).
Source database list. Can be lots.
6.3.2.9.2. destination_database
String (a database config section name).
Destination database. Just one.
6.3.2.9.3. admin_database
String (a database config section name).
Secret admin database. Just one.
6.3.2.10. Processing options, to limit data quantity for testing
6.3.2.10.1. debug_max_n_patients
Integer. Default: 0.
Limit the number of patients to be processed? Specify 0 (the default) for no limit. The limit applies to each process separately, if you are using parallel processing.
6.3.2.10.2. debug_pid_list
Multiline string.
Specify a list of patient IDs to use, for debugging? If specified, only these patients will be processed – this list will be used directly (overriding the patient ID source specified in the data dictionary, and overriding debug_max_n_patients).
6.3.2.11. Opting out entirely
Patients who elect to opt out entirely have their PIDs stored in the OptOut
table of the admin database. ENTRIES ARE NEVER REMOVED FROM THIS LIST BY
CRATE. It can be populated in several ways:
Manually, by adding a PID to the column
opt_out_pid.pid
in the admin database. Seecrate_anon.anonymise.models.OptOutPid
.Similarly, by adding an MPID to the column
opt_out_mpid.mpid
in the admin database. Seecrate_anon.anonymise.models.OptOutMpid
.By maintaining a text file list of integer PIDs/MPIDs. Any PIDs/MPIDs in this file/files are added to the opt-out list. See below.
By flagging a source database field as indicating an opt-out, using the
!
marker in src_flags). See below.
6.3.2.11.1. optout_pid_filenames
Multiline string.
If you set this, each line of each named file is scanned for an integer, taken to be the PID of a patient who wishes to opt out.
6.3.2.11.2. optout_mpid_filenames
Multiline string.
If you set this, each line of each named file is scanned for an integer, taken to be the MPID of a patient who wishes to opt out.
6.3.2.11.3. optout_col_values
List of Python values.
If you mark a field in the data dictionary as an opt-out field (see above and src_flags), that says “the field tells you whether the patient opts out or not”. But is it “opt out” or “not”? If the actual value matches a value specified here, then it’s “opt out”.
Specify a LIST OF PYTHON VALUES; for example:
optout_col_values = [True, 1, '1', 'Yes', 'yes', 'Y', 'y']
6.3.3. [extra_regexes] section
Arbitrary number of name (string), value (string) pairs.
This section is optional.
Here, you can specify extra regular expression patterns (regexes) that you wish to be scrubbed from the text as nonspecific information (see replace_nonspecific_info_with).
These regexes can be multiline and contain comments – just remember to escape spaces and hash signs which you actually want to be part of the regex.
They have their own section so that you can use the parameter name as a helpful descriptive name (these names are ignored, and you could specify a giant regex combining them yourself, but CRATE will do that for you to enhance config file legibility and convenience). You can name each of them anything, e.g.
[extra_regexes]
my_regex_canadian_postcodes = [a-zA-Z][0-9][a-zA-Z]\w+[0-9][a-zA-Z][0-9]
another_regex =
\d+\#x # a number then a hash sign then an 'x'
\d+\ y # then another number then a space then 'y'
6.3.4. Database config sections
Config file sections that define databases always have the url
parameter.
Destination and admin databases need only this. Source databases have other
options, as below.
Warning
You should permit CRATE write access to the destination and admin databases, but only read access to source databases. It doesn’t need more, so read-only is safer.
6.3.4.1. Connection details
6.3.4.1.1. url
String.
Use SQLAlchemy URLs: see http://docs.sqlalchemy.org/en/latest/core/engines.html.
For example:
url = mysql+mysqldb://username:password@127.0.0.1:3306/output_databasename?charset=utf8
You may need to install additional drivers, e.g.
pip install SOME_DRIVER
… see database drivers.
6.3.4.2. Data dictionary generation: source fields
This section relates to automatic creation of data dictionaries only. In normal use, these settings do nothing.
In this section, fields (columns) can either be specified as column
(to
match a column with that name any table) or table.column
, to match a
specific column in a specific table.
The specifications are case-insensitive.
Wildcards (*
and ?
) may also be used (as per Python’s fnmatch). Thus,
one can write specifications like addr*.address_line_*
to match all of
address_current.address_line_1
, address_previous.address_line_4
, etc.
6.3.4.2.1. ddgen_omit_by_default
Boolean. Default: true.
By default, most fields (except PKs and patient ID codes) are marked as
OMIT
(see decision), pending human review. If you want
to live dangerously, set this to False, and they will be marked as include
from the outset.
6.3.4.2.2. ddgen_omit_fields
Multiline string, of field/column specifications.
You can specify additional fields to omit (see decision) here. Settings here override ddgen_include_fields – that is, “omit” overrides “include”.
6.3.4.2.3. ddgen_include_fields
Multiline string, of field/column specifications.
You can specify additional fields to include (see decision) here.
If a field contains scrubbing source information (see scrub_src), it will also be omitted pending human review, regardless of other settings.
6.3.4.2.4. ddgen_per_table_pid_field
String.
Specify the name of a (typically integer) patient identifier (PID) field present in EVERY table. It will be replaced by the research ID (RID) in the destination database.
6.3.4.2.5. ddgen_table_defines_pids
String
Specify a table which will define patient identifiers (PIDs) in the field
specified above. Only the PIDs in this field (and any other field defining
PIDs - see ddgen_pid_defining_fieldnames
below) will be included in the
anonymisation. If both this option and ddgen_pid_defining_fieldnames
are
left blank, the data dictionary will not work without manual editing.
6.3.4.2.6. ddgen_add_per_table_pids_to_scrubber
Boolean. Default: false.
Add every instance of a per-table PID field to the patient scrubber?
This is a very conservative setting, and should be unnecessary as the single master “PID-defining” column (see ddgen_pid_defining_fieldnames) should be enough.
(Note that per-table PIDs are always replaced by RIDs – this setting governs whether the scrubber used to scrub free-text fields also works through every single per-table PID.)
6.3.4.2.7. ddgen_master_pid_fieldname
String.
Master patient ID fieldname. Used for e.g. NHS numbers. This information will be replaced by the MRID in the destination database.
6.3.4.2.8. ddgen_table_denylist
Multiline string.
Deny any tables when creating new data dictionaries?
This is case-insensitive, and you can use *
and ?
wildcards (as per
Python’s fnmatch module).
Note
Formerly ddgen_table_blacklist
; changed 2020-07-20 as part of neutral
language review.
6.3.4.2.9. ddgen_table_allowlist
Multiline string.
Allow any tables? (Allowlists override denylists.)
Note
Formerly ddgen_table_whitelist
; changed 2020-07-20 as part of neutral
language review.
6.3.4.2.10. ddgen_table_require_field_absolute
Multiline string.
List any fields that all tables MUST contain. If a table doesn’t contain all of the field(s) listed here, it will be skipped.
6.3.4.2.11. ddgen_table_require_field_conditional
Multiline string (one pair per line).
List any fields that are required conditional on other fields. List them as one
or more pairs: A, B
where B is required if A is present (or the table will
be skipped).
6.3.4.2.12. ddgen_field_denylist
Multiline string.
Deny any fields (regardless of their table) when creating new data
dictionaries? Wildcards of *
and ?
operate as above.
Note
Formerly ddgen_field_blacklist
; changed 2020-07-20 as part of neutral
language review.
6.3.4.2.13. ddgen_field_allowlist
Multiline string.
Allow any fields? (Allowlists override denylists.)
Note
Formerly ddgen_field_whitelist
; changed 2020-07-20 as part of neutral
language review.
6.3.4.2.14. ddgen_pk_fields
Multiline string.
Fieldnames assumed to be their table’s PK. In addition to these, any columns that the source database reports as a PK will be treated as such.
6.3.4.2.15. ddgen_constant_content
Boolean. Default: false.
Assume that content stays constant?
Applies the C
flags to PK fields; see src_flags. This
then becomes the default, after which ddgen_constant_content_tables and
ddgen_nonconstant_content_tables can override (of which,
ddgen_nonconstant_content_tables takes priority if a table
matches both).
6.3.4.2.16. ddgen_constant_content_tables
Multiline string.
Table-specific overrides for ddgen_constant_content, as above.
6.3.4.2.17. ddgen_nonconstant_content_tables
Table-specific overrides for ddgen_constant_content, as above.
6.3.4.2.18. ddgen_addition_only
Boolean. Default: false.
Assume that records can only be added, not deleted?
6.3.4.2.19. ddgen_addition_only_tables
Multiline string.
Table-specific overrides for ddgen_addition_only, similarly.
6.3.4.2.20. ddgen_deletion_possible_tables
Multiline string.
Table-specific overrides for ddgen_addition_only, similarly.
6.3.4.2.21. ddgen_pid_defining_fieldnames
Multiline string.
Predefine field(s) that define the existence of patient IDs? UNUSUAL to want to do this.
6.3.4.2.22. ddgen_scrubsrc_patient_fields
Multiline string.
Field names assumed to provide patient information for scrubbing.
6.3.4.2.23. ddgen_scrubsrc_thirdparty_fields
Multiline string.
Field names assumed to provide third-party information for scrubbing.
6.3.4.2.24. ddgen_scrubsrc_thirdparty_xref_pid_fields
Multiline string.
Field names assumed to contain PIDs of third parties (e.g. relatives also in the patient database), to be used to look up the third party in a recursive way, for scrubbing.
6.3.4.2.25. ddgen_required_scrubsrc_fields
Multiline string.
Are any scrub_src fields required (mandatory), i.e. must have non-NULL data in at least one row (or the patient will be skipped)?
6.3.4.2.26. ddgen_scrubmethod_code_fields
Multiline string.
Fields to enforce the code
scrub_method upon,
overriding the default method.
6.3.4.2.27. ddgen_scrubmethod_date_fields
Multiline string.
Fields to enforce the date
scrub_method upon,
overriding the default method.
6.3.4.2.28. ddgen_scrubmethod_number_fields
Multiline string.
Fields to enforce the number
scrub_method upon,
overriding the default method.
6.3.4.2.29. ddgen_scrubmethod_phrase_fields
Multiline string.
Fields to enforce the phrase
scrub_method upon,
overriding the default method.
6.3.4.2.30. ddgen_safe_fields_exempt_from_scrubbing
Multiline string.
Known safe fields, exempt from scrubbing.
6.3.4.2.31. ddgen_min_length_for_scrubbing
Integer. Default: 50.
Define minimum source text column length for scrubbing (fields shorter than
this value are assumed safe). For example, specifying 10 will mean that
VARCHAR(9)
columns are assumed not to need scrubbing.
If you specify a number below 1 (e.g. 0), nothing will be marked for scrubbing. Be particularly careful with this!
6.3.4.2.32. ddgen_truncate_date_fields
Multiline string.
Fields whose date should be truncated to the first of the month.
6.3.4.2.33. ddgen_filename_to_text_fields
Multiline string.
Fields containing filenames, which files should be converted to text.
6.3.4.2.34. ddgen_binary_to_text_field_pairs
Multiline string (one pair per line).
Fields containing raw binary data from files (binary large objects; BLOBs), whose contents should be converted to text – paired with fields in the same table containing their file extension (e.g. “pdf”, “.PDF”) or a filename having that extension.
Specify it as a list of comma-joined pairs, e.g.
ddgen_binary_to_text_field_pairs =
binary1field, ext1field
binary2field, ext2field
...
The first (binaryfield
) can be specified as column
or table.column
,
but the second must be column
only.
6.3.4.2.35. ddgen_skip_row_if_extract_text_fails_fields
Multiline string.
Specify any text-extraction rows for which you also want to set the flag
skip_if_extract_fails
(see alter_method).
6.3.4.2.36. ddgen_rename_tables_remove_suffixes
Multiline string.
Automatic renaming of tables. This option specifies a list of suffixes to
remove from table names. (Typical use: you make a view with a suffix _x
as
a working step, then you want the suffix removed for users.)
6.3.4.2.37. ddgen_patient_opt_out_fields
Multiline string.
Fields that are used as patient opt-out fields (see above and src_flags).
6.3.4.2.38. ddgen_extra_hash_fields
Multiline string (one pair per line; see below).
Are there any fields you want hashed, in addition to the normal PID/MPID
fields? Specify these a list of FIELDSPEC, EXTRA_HASH_NAME
pairs. For
example:
ddgen_extra_hash_fields = CaseNumber*, case_number_hashdef
where case_number_hashdef
is an extra hash definition (see
extra_hash_config_sections, and
alter_method in the data dictionary).
6.3.4.3. Data dictionary generation: destination fields
This section relates to automatic creation of data dictionaries only. In normal use, these settings do nothing.
6.3.4.3.1. ddgen_force_lower_case
Boolean. Default: false.
Force all destination table/field names to lower case?
(Default changed from true to false in v0.19.3.)
6.3.4.3.2. ddgen_convert_odd_chars_to_underscore
Boolean. Default: true.
Convert spaces in table/fieldnames (yuk!) to underscores?
6.3.4.3.3. ddgen_append_source_info_to_comment
Boolean. Default: true.
When drafting a data dictionary, append the source table/field name to the column comment?
6.3.4.3.4. ddgen_index_fields
Multiline string.
Fields to apply an index to.
6.3.4.3.5. ddgen_allow_fulltext_indexing
Boolean. Default: true.
Allow full-text index creation?
(Disable for databases that don’t support full-text indexes?)
6.3.4.3.6. ddgen_freetext_index_min_length
Integer. Default: 1000.
For full-text index creation (see ddgen_freetext_index_min_length): what is the minimum (source) text field length that should have a free-text index applied?
Note
This is applied to the length of the source field, not the destination, because sometimes short source fields are expanded (to make room for “anonymisation expansion”), so the source length is usually more meaningful.
6.3.4.4. Other options for source databases
6.3.4.4.1. debug_row_limit
Integer. Default: 0.
Specify 0 (the default) for no limit, or a number of rows (e.g. 1000) to apply to any tables listed in debug_limited_tables. For those tables, only this many rows will be taken from the source database.
Use this, for example, to reduce the number of large documents fetched for testing.
If you run a multiprocess/multithreaded anonymisation, this limit applies per process (or task), not overall.
Note that these limits DO NOT APPLY to the fetching of patient-identifiable information to build “scrubbers” for anonymisation – when a patient is processed, all identification information for that patient is trawled.
6.3.4.4.2. debug_limited_tables
Multiline string.
List of tables to which to apply debug_row_limit.
6.3.5. Hasher definitions
If you use the hash
alter_method, you must specify
a config section there that is cross-referenced in the
extra_hash_config_sections
parameter of the [main] section of the config
file.
Such config sections, named e.g. [my_extra_hasher]
, must have the following
parameters:
6.3.5.1. hash_method
String.
Options are as for the hash_method parameter of the [main] section.
6.3.5.2. secret_key
String.
Secret key for the hasher.
6.3.6. Anonymisation sequence
When CRATE processes data and creates scrubbers, it follows the following sequence:
For every patient, a scrubber is built (see
crate_anon.anonymise.patient.Patient
). That scrubber can contain patient-specific, third-party, and nonspecific (generic) information to remove.NONSPECIFIC options are set according to the config file.
This can include generic options like removing all numbers with a certain number of digits, or anything that looks like a UK postcode.
It can also include words or phrases to remove, read from files specified by denylist_filenames.
PATIENT and THIRD-PARTY information is added by scanning the source database for all “scrub-source” information for that patient (optionally, recursing into third-party records).
Words and phrases are not added if they are in the “allowlist” (see allowlist_filenames).
For every data item for that patient that requires scrubbing, scrub using the scrubber we have built for that patient.
That scrubber operates as follows (see
crate_anon.anonymise.scrub.PersonalizedScrubber.scrub()
):If nonspecific_scrubber_first is True:
the NONSPECIFIC part of the scrubber runs first;
the PATIENT part of the scrubber runs second;
the THIRD-PARTY part of the scrubber runs third.
If nonspecific_scrubber_first is False,
the PATIENT part of the scrubber runs first;
the THIRD-PARTY part of the scrubber runs second.
the NONSPECIFIC part of the scrubber runs third;
6.3.7. Minimal anonymiser config
Here’s an extremely minimal version for a hypothetical test database. Many options are not shown and most comments have been removed.
# Configuration file for CRATE anonymiser (crate_anonymise).
# =============================================================================
# Main settings
# =============================================================================
[main]
data_dictionary_filename = testdd.tsv
hash_method = HMAC_MD5
per_table_patient_id_encryption_phrase = SOME_PASSPHRASE_REPLACE_ME
master_patient_id_encryption_phrase = SOME_OTHER_PASSPHRASE_REPLACE_ME
change_detection_encryption_phrase = YETANOTHER
replace_patient_info_with = [XXXXXX]
replace_third_party_info_with = [QQQQQQ]
replace_nonspecific_info_with = [~~~~~~]
research_id_fieldname = rid
trid_fieldname = trid
master_research_id_fieldname = mrid
source_hash_fieldname = _src_hash
temporary_tablename = _temp_table
source_databases =
mysourcedb1
destination_database = my_destination_database
admin_database = my_admin_database
# =============================================================================
# Destination database details. User should have WRITE access.
# =============================================================================
[my_destination_database]
url = mysql+mysqldb://username:password@127.0.0.1:3306/output_databasename?charset=utf8
# =============================================================================
# Administrative database. User should have WRITE access.
# =============================================================================
[my_admin_database]
url = mysql+mysqldb://username:password@127.0.0.1:3306/admin_databasename?charset=utf8
# =============================================================================
# Source database. (Just one in this example.)
# User should have READ access only for safety.
# =============================================================================
[mysourcedb1]
url = mysql+mysqldb://username:password@127.0.0.1:3306/source_databasename?charset=utf8
Todo
Check minimal anonymiser config example works.
6.3.8. Specimen config
A specimen anonymiser config file is available by running
crate_anon_demo_config
. You can send its output to a file using >
or
the --output
option:
USAGE: crate_anon_demo_config [-h] [--output OUTPUT] [--leave_placeholders]
Print a demo config file for the CRATE anonymiser. (CRATE version 0.20.0,
2023-02-14. Created by Rudolf Cardinal.)
OPTIONAL ARGUMENTS:
-h, --help show this help message and exit
--output OUTPUT File for output; use '-' for stdout. (default: -)
--leave_placeholders Don't substitute @@ placeholders with examples
(default: False)
Here’s the specimen anonymisation config file:
# Configuration file for CRATE anonymiser (crate_anonymise).
# Version 0.20.0 (2023-02-14).
#
# SEE HELP FOR DETAILS.
# =============================================================================
# Main settings
# =============================================================================
[main]
# -----------------------------------------------------------------------------
# Data dictionary
# -----------------------------------------------------------------------------
data_dictionary_filename = testdd.tsv
# -----------------------------------------------------------------------------
# Critical field types
# -----------------------------------------------------------------------------
sqlatype_pid =
sqlatype_mpid =
# -----------------------------------------------------------------------------
# Encryption phrases/passwords
# -----------------------------------------------------------------------------
hash_method = HMAC_MD5
per_table_patient_id_encryption_phrase = SOME_PASSPHRASE_REPLACE_ME
master_patient_id_encryption_phrase = SOME_OTHER_PASSPHRASE_REPLACE_ME
change_detection_encryption_phrase = YETANOTHER
extra_hash_config_sections =
# -----------------------------------------------------------------------------
# Text extraction
# -----------------------------------------------------------------------------
extract_text_extensions_permitted =
extract_text_extensions_prohibited =
extract_text_extensions_case_sensitive = False
extract_text_plain = True
extract_text_width = 80
# -----------------------------------------------------------------------------
# Anonymisation
# -----------------------------------------------------------------------------
allow_no_patient_info = False
replace_all_dates_with = [~~~]
replace_patient_info_with = [__PPP__]
replace_third_party_info_with = [__TTT__]
replace_nonspecific_info_with = [~~~]
thirdparty_xref_max_depth = 1
scrub_string_suffixes =
s
string_max_regex_errors = 0
min_string_length_for_errors = 3
min_string_length_to_scrub_with = 2
allowlist_filenames =
denylist_filenames =
denylist_files_as_phrases = False
denylist_use_regex = False
phrase_alternative_word_filenames =
scrub_all_dates = False
scrub_all_email_addresses = False
scrub_all_numbers_of_n_digits =
scrub_all_uk_postcodes = False
nonspecific_scrubber_first = False
anonymise_codes_at_word_boundaries_only = True
anonymise_codes_at_numeric_boundaries_only = True
anonymise_dates_at_word_boundaries_only = True
anonymise_numbers_at_word_boundaries_only = False
anonymise_numbers_at_numeric_boundaries_only = True
anonymise_strings_at_word_boundaries_only = True
# -----------------------------------------------------------------------------
# Output fields and formatting
# -----------------------------------------------------------------------------
timefield_name = _when_processed_utc
research_id_fieldname = rid
trid_fieldname = trid
master_research_id_fieldname = mrid
source_hash_fieldname = _src_hash
# -----------------------------------------------------------------------------
# Destination database configuration
# See the [destination_database] section for connection details.
# -----------------------------------------------------------------------------
max_rows_before_commit = 1000
max_bytes_before_commit = 83886080
temporary_tablename = _crate_temp_table
# -----------------------------------------------------------------------------
# Choose databases (defined in their own sections).
# -----------------------------------------------------------------------------
source_databases =
sourcedb1
# sourcedb2
destination_database = destination_database
admin_database = admin_database
# -----------------------------------------------------------------------------
# PROCESSING OPTIONS, TO LIMIT DATA QUANTITY FOR TESTING
# -----------------------------------------------------------------------------
debug_max_n_patients =
debug_pid_list =
# -----------------------------------------------------------------------------
# Opting out entirely
# -----------------------------------------------------------------------------
optout_pid_filenames =
optout_mpid_filenames =
optout_col_values =
# =============================================================================
# Extra regular expression patterns you wish to be scrubbed from the text
# as nonspecific information. See help.
# =============================================================================
[extra_regexes]
# =============================================================================
# Destination database details. User should have WRITE access.
# =============================================================================
[destination_database]
url = mysql+mysqldb://username:password@127.0.0.1:3306/output_databasename?charset=utf8
# =============================================================================
# Administrative database. User should have WRITE access.
# =============================================================================
[admin_database]
url = mysql+mysqldb://username:password@127.0.0.1:3306/admin_databasename?charset=utf8
# =============================================================================
# SOURCE DATABASE DETAILS BELOW HERE.
# User should have READ access only for safety.
# =============================================================================
# -----------------------------------------------------------------------------
# Source database example 1
# -----------------------------------------------------------------------------
[sourcedb1]
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# CONNECTION DETAILS
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
url = mysql+mysqldb://username:password@127.0.0.1:3306/source_databasename?charset=utf8
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# INPUT FIELDS, FOR THE AUTOGENERATION OF DATA DICTIONARIES
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ddgen_omit_by_default = True
ddgen_omit_fields =
ddgen_include_fields =
ddgen_per_table_pid_field = patient_id
ddgen_table_defines_pids = patient
ddgen_add_per_table_pids_to_scrubber = False
ddgen_master_pid_fieldname = nhsnum
ddgen_table_denylist =
ddgen_table_allowlist =
ddgen_table_require_field_absolute =
ddgen_table_require_field_conditional =
ddgen_field_denylist =
ddgen_field_allowlist =
ddgen_pk_fields =
ddgen_constant_content = False
ddgen_constant_content_tables =
ddgen_nonconstant_content_tables =
ddgen_addition_only = False
ddgen_addition_only_tables =
ddgen_deletion_possible_tables =
ddgen_pid_defining_fieldnames =
ddgen_scrubsrc_patient_fields =
ddgen_scrubsrc_thirdparty_fields =
ddgen_scrubsrc_thirdparty_xref_pid_fields =
ddgen_required_scrubsrc_fields =
ddgen_scrubmethod_code_fields =
ddgen_scrubmethod_date_fields =
ddgen_scrubmethod_number_fields =
ddgen_scrubmethod_phrase_fields =
ddgen_safe_fields_exempt_from_scrubbing =
ddgen_min_length_for_scrubbing = 50
ddgen_truncate_date_fields =
ddgen_filename_to_text_fields =
ddgen_binary_to_text_field_pairs =
ddgen_skip_row_if_extract_text_fails_fields =
ddgen_rename_tables_remove_suffixes =
ddgen_patient_opt_out_fields =
ddgen_extra_hash_fields =
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# DESTINATION INDEXING
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ddgen_index_fields =
ddgen_allow_fulltext_indexing = True
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# DATA DICTIONARY MANIPULATION TO DESTINATION TABLE/FIELD NAMES
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ddgen_force_lower_case = False
ddgen_convert_odd_chars_to_underscore = True
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# PROCESSING OPTIONS, TO LIMIT DATA QUANTITY FOR TESTING
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
debug_row_limit =
debug_limited_tables =
# -----------------------------------------------------------------------------
# Source database example 2
# -----------------------------------------------------------------------------
[mysourcedb2]
url = mysql+mysqldb://username:password@127.0.0.1:3306/source2_databasename?charset=utf8
ddgen_force_lower_case = False
ddgen_append_source_info_to_comment = True
ddgen_per_table_pid_field = patient_id
ddgen_master_pid_fieldname = nhsnum
ddgen_table_denylist =
ddgen_field_denylist =
ddgen_table_require_field_absolute =
ddgen_table_require_field_conditional =
ddgen_pk_fields =
ddgen_constant_content = False
ddgen_scrubsrc_patient_fields =
ddgen_scrubsrc_thirdparty_fields =
ddgen_scrubmethod_code_fields =
ddgen_scrubmethod_date_fields =
ddgen_scrubmethod_number_fields =
ddgen_scrubmethod_phrase_fields =
ddgen_safe_fields_exempt_from_scrubbing =
ddgen_min_length_for_scrubbing = 50
ddgen_truncate_date_fields =
ddgen_filename_to_text_fields =
ddgen_binary_to_text_field_pairs =
# -----------------------------------------------------------------------------
# Source database example 3
# -----------------------------------------------------------------------------
[camcops]
# Example for the CamCOPS anonymisation staging database
url = mysql+mysqldb://username:password@127.0.0.1:3306/camcops_databasename?charset=utf8
# FOR EXAMPLE:
ddgen_force_lower_case = False
ddgen_per_table_pid_field = _patient_idnum1
ddgen_pid_defining_fieldnames = _patient_idnum1
ddgen_master_pid_fieldname = _patient_idnum2
ddgen_table_denylist =
ddgen_field_denylist = _patient_iddesc1
_patient_idshortdesc1
_patient_iddesc2
_patient_idshortdesc2
_patient_iddesc3
_patient_idshortdesc3
_patient_iddesc4
_patient_idshortdesc4
_patient_iddesc5
_patient_idshortdesc5
_patient_iddesc6
_patient_idshortdesc6
_patient_iddesc7
_patient_idshortdesc7
_patient_iddesc8
_patient_idshortdesc8
id
patient_id
_device
_era
_current
_when_removed_exact
_when_removed_batch_utc
_removing_user
_preserving_user
_forcibly_preserved
_predecessor_pk
_successor_pk
_manually_erased
_manually_erased_at
_manually_erasing_user
_addition_pending
_removal_pending
_move_off_tablet
ddgen_table_require_field_absolute =
ddgen_table_require_field_conditional =
ddgen_pk_fields = _pk
ddgen_constant_content = False
ddgen_scrubsrc_patient_fields = _patient_forename
_patient_surname
_patient_dob
_patient_idnum1
_patient_idnum2
_patient_idnum3
_patient_idnum4
_patient_idnum5
_patient_idnum6
_patient_idnum7
_patient_idnum8
ddgen_scrubsrc_thirdparty_fields =
ddgen_scrubmethod_code_fields =
ddgen_scrubmethod_date_fields = _patient_dob
ddgen_scrubmethod_number_fields =
ddgen_scrubmethod_phrase_fields =
ddgen_safe_fields_exempt_from_scrubbing = _device
_era
_when_added_exact
_adding_user
_when_removed_exact
_removing_user
_preserving_user
_manually_erased_at
_manually_erasing_user
when_last_modified
when_created
when_firstexit
clinician_specialty
clinician_name
clinician_post
clinician_professional_registration
clinician_contact_details
# ... now some task-specific ones
bdi_scale
pause_start_time
pause_end_time
trial_start_time
cue_start_time
target_start_time
detection_start_time
iti_start_time
iti_end_time
trial_end_time
response_time
target_time
choice_time
discharge_date
discharge_reason_code
diagnosis_psych_1_icd10code
diagnosis_psych_1_description
diagnosis_psych_2_icd10code
diagnosis_psych_2_description
diagnosis_psych_3_icd10code
diagnosis_psych_3_description
diagnosis_psych_4_icd10code
diagnosis_psych_4_description
diagnosis_medical_1
diagnosis_medical_2
diagnosis_medical_3
diagnosis_medical_4
category_start_time
category_response_time
category_chosen
gamble_fixed_option
gamble_lottery_option_p
gamble_lottery_option_q
gamble_start_time
gamble_response_time
likelihood
ddgen_min_length_for_scrubbing = 50
ddgen_truncate_date_fields = _patient_dob
ddgen_filename_to_text_fields =
ddgen_binary_to_text_field_pairs =