6.3. The anonymiser config file

This file controls the behaviour of the anonymiser, and tells it where to find the source, destination, and secret databases, and the data dictionary that controls the conversion process for each database column.

You can generate a specimen config file with

crate_anonymise --democonfig > test_anon_config.ini

You should save this, then edit it to your own needs. A copy is shown below.

For convenience, you may want the CRATE_ANON_CONFIG environment variable to point to this file. (Otherwise you must specify it each time.)

6.3.1. Format of the configuration file

  • The config file is in standard INI file format.
  • UTF-8 encoding. Use this! The file is explicitly opened in UTF-8 mode.
  • Comments. Hashes (#) and semicolons (;) denote comments.
  • Sections. Sections are indicated with: [section]
  • Name/value (key/value) pairs. The parser used is ConfigParser. It allows name=value or name:value.
  • Avoid indentation of parameters. (Indentation is used to indicate the continuation of previous parameters.)
  • Parameter types, referred to below, are:
    • String. Single-line strings are simple.
    • Multiline string. Here, a series of lines is read and split into a list of strings (one for each line). You should indent all lines except the first beyond the level of the parameter name, and then they will be treated as one parameter value.
    • Integer. Simple.
    • Boolean. For Boolean options, true values are any of: 1, yes, true, on (case-insensitive). False values are any of: 0, no, false, off.

6.3.2. [main] section

6.3.2.1. Data dictionary

6.3.2.1.1. data_dictionary_filename

String.

Specify the filename of a data dictionary in TSV (tab-separated value) format, with a header row. See Data Dictionary.

6.3.2.2. Critical field types

6.3.2.2.2. sqlatype_mpid

String.

We need to know PID and MPID types from the config so that we can set up our secret mapping tables. You can leave these blank, in which case they will be assumed to be large integers, using SQLAlchemy’s BigInteger (e.g. SQL Server’s BIGINT). If you do specify them, you may specify EITHER BigInteger or a string type such as String(50).

6.3.2.3. Encryption phrases/passwords

6.3.2.3.1. hash_method

String.

PID-to-RID hashing method. Options are:

  • HMAC_MD5 – use MD5 and produce a 32-character digest
  • HMAC_SHA256 – use SHA256 and produce a 64-character digest
  • HMAC_SHA512 – use SHA512 and produce a 128-character digest

6.3.2.3.2. per_table_patient_id_encryption_phrase

String.

Secret phrase with which to hash the PID (creating the RID).

6.3.2.3.3. master_patient_id_encryption_phrase

String.

Secret phrase with which to hash the MPID (creating the MRID).

6.3.2.3.4. change_detection_encryption_phrase

String.

Secret phrase with which to hash content (storing the result in the output database), so that changes in content can be detected.

6.3.2.3.5. extra_hash_config_sections

Multiline string.

If you are using the “hash” field alteration method (see alter_method), you need to list the hash methods here, for internal initialization order/performance reasons.

See hasher definitions for how to define these.

6.3.2.4. Text extraction

6.3.2.4.1. extract_text_extensions_permitted

Multiline string.

extract_text_extensions_permitted and extract_text_extensions_prohibited govern what kinds of files are accepted for text extraction. It is very likely that you’ll want to apply such restrictions; for example, if your database contains .jpg files, it’s a waste of trying to extract text from them (and in theory, if your text extraction tool provided sufficient detail, such as binary-encoding the JPEG, you might leak identifiable information, such as a photo).

  • The “permitted” and “prohibited” settings are both lists of strings.
  • If the “permitted” list is not empty then a file will be processed only if its extension is in the permitted list. Otherwise, it will be processed only if it is not in the prohibited list.
  • The extensions must include the “.” prefix.
  • Case sensitivity is controlled by the extra flag, extract_text_extensions_case_sensitive.

6.3.2.4.4. extract_text_plain

Boolean. Default: true. (Changed to true from v0.18.88.)

Use the plainest possible layout for text extraction?

False = better for human layout. Table example from DOCX:

┼─────────────┼─────────────┼
│ Row 1 col 1 │ Row 1 col 2 │
┼─────────────┼─────────────┼
│ Row 2 col 1 │ Row 2 col 2 │
┼─────────────┼─────────────┼

True = good for natural language processing. Table example from DOCX:

╔═════════════════════════════════════════════════════════════════╗
Row 1 col 1
───────────────────────────────────────────────────────────────────
Row 1 col 2
═══════════════════════════════════════════════════════════════════
Row 2 col 1
───────────────────────────────────────────────────────────────────
Row 2 col 2
╚═════════════════════════════════════════════════════════════════╝

… note the absence of vertical interruptions, and that text from one cell remains contiguous.

6.3.2.4.5. extract_text_width

Integer. Default: 80.

Default width (in columns) to word-wrap extracted text to.

6.3.2.5. Anonymisation

6.3.2.5.1. replace_patient_info_with

String.

Patient information will be replaced with this. For example, XXXXXX or [___] or [__PPP__] or [__ZZZ__]; the bracketed forms can be a bit easier to spot, and work better if they directly abut other text.

6.3.2.5.2. replace_third_party_info_with

String.

Third-party information (e.g. information about family members) will be replaced by this. For example, YYYYYY or [...] or [__TTT__] or [__QQQ__].

6.3.2.5.3. thirdparty_xref_max_depth

Integer. Default 1.

For fields marked as scrub_src = thirdparty_xref_pid (see scrub_src), how deep should we recurse? Beware making this too large; the recursion trawls a lot of information (and also uses an extra simultaneous database cursor for each recursion).

6.3.2.5.4. replace_nonspecific_info_with

String.

Things to be removed irrespective of patient-specific information will be replaced by this (for example, if you opt to remove all things looking like telephone numbers). For example, ZZZZZZ or [~~~].

6.3.2.5.5. scrub_string_suffixes

Multiline string.

Strings to append to every “scrub from” string.

For example, include “s” if you want to scrub “Roberts” whenever you scrub “Robert”.

Applies to scrub methods words, but not to phrase (see scrub_method).

6.3.2.5.6. string_max_regex_errors

Integer. Default: 0.

Specify maximum number of errors (insertions, deletions, substitutions) in string regex matching. Beware using a high number! Suggest 1-2.

6.3.2.5.7. min_string_length_for_errors

Integer. Default: 1.

Is there a minimum length to apply string_max_regex_errors? For example, if you allow one typo and someone is called Ian, all instances of ‘in’ or ‘an’ will be wiped. Note that this applies to scrub-source data.

6.3.2.5.8. min_string_length_to_scrub_with

Integer. Default: 2.

Is there a minimum length of string to scrub WITH? For example, if you specify 2, you allow two-letter names such as Al to be scrubbed, but you allow initials through, and therefore prevent e.g. ‘A’ from being scrubbed from the destination. Note that this applies to scrub-source data.

6.3.2.5.9. whitelist_filenames

Multiline string.

Whitelist.

Are there any words not to scrub? For example, “the”, “road”, “street” often appear in addresses, but you might not want them removed. Be careful in case these could be names (e.g. “Lane”).

Specify these as a list of filenames, where the files contain words; e.g.

whitelist_filenames = /some/path/short_english_words.txt

Here’s a suggestion for some of the sorts of words you might include:

am
an
as
at
bd
by
he
if
is
it
me
mg
od
of
on
or
re
so
to
us
we
her
him
tds
she
the
you
road
street

6.3.2.5.10. blacklist_filenames

Multiline string.

Blacklist.

Are there any words you always want to remove?

Specify these as a list of filenames, e.g

blacklist_filenames =
    /some/path/boy_names.txt
    /some/path/girl_names.txt
    /some/path/common_surnames.txt

6.3.2.5.11. phrase_alternative_word_filenames

Multiline string.

Alternatives for common words. These will be used to find alternative phrases which will be scrubbed. The files specified should be in comma separated variable (CSV) form.

Examples of alternative words include street types: https://en.wikipedia.org/wiki/Street_suffix.

6.3.2.5.12. scrub_all_numbers_of_n_digits

Multiline list of integers.

Use nonspecific scrubbing of numbers of a certain length?

For example, scrubbing all 11-digit numbers will remove modern UK telephone numbers in conventional format. To do this, specify scrub_all_numbers_of_n_digits = 11. You could scrub both 10- and 11-digit numbers by specifying both numbers (in multiline format, as above); 10-digit numbers would include all NHS numbers. Avoid using this for short numbers; you may lose valuable numeric data!

6.3.2.5.13. scrub_all_uk_postcodes

Boolean. Default: false.

Nonspecific scrubbing of UK postcodes?

See https://www.mrs.org.uk/pdf/postcodeformat.pdf; these can look like

FORMAT    EXAMPLE
AN NAA    M1 1AA
ANN NAA   M60 1NW
AAN NAA   CR2 6XH
AANN NAA  DN55 1PT
ANA NAA   W1A 1HQ
AANA NAA  EC1A 1BB

6.3.2.5.14. anonymise_codes_at_word_boundaries_only

Boolean. Default: true.

Anonymise codes only when they are found at word boundaries?

Applies to the code scrub method (see scrub_method).

True is more liberal (produces less scrubbing); False is more conservative (more scrubbing; higher chance of over-scrubbing) and will deal with accidental word concatenation. With ID numbers, beware if you use a prefix, e.g. if people write M123456 or R123456; in that case you will need anonymise_numbers_at_word_boundaries_only = False.

6.3.2.5.15. anonymise_dates_at_word_boundaries_only

Boolean. Default: true.

As for anonymise_codes_at_word_boundaries_only, but applies to the date scrub method (see scrub_method).

6.3.2.5.16. anonymise_numbers_at_word_boundaries_only

Boolean. Default: false.

As for anonymise_codes_at_word_boundaries_only, but applies to the number scrub method (see scrub_method).

6.3.2.5.17. anonymise_numbers_at_numeric_boundaries_only

Boolean. Default: true.

Similar to anonymise_numbers_at_word_boundaries_only, and similarly applies to the number scrub method (see scrub_method); however, this relates to whether numbers are scrubbed only at numeric boundaries.

If True, CRATE will not scrub “234” from “123456”. Setting this to False is extremely conservative (all sorts of numbers may be scrubbed). You probably want this set to True.

6.3.2.5.18. anonymise_strings_at_word_boundaries_only

Boolean. Default: true.

As for anonymise_codes_at_word_boundaries_only, but applies to the words and the phrase scrub methods (see scrub_method).

6.3.2.6. Other anonymisation options

You can also specify additional “nonspecific” regular expressions yourself. See extra_regexes.

6.3.2.7. Output fields and formatting

6.3.2.7.1. timefield_name

String.

Name of the DATETIME column to be created in every output table indicating when CRATE processed that row (see crate_anon.anonymise.anonymise.process_table()).

An example might be _when_processed_utc.

6.3.2.7.2. research_id_fieldname

String.

Research ID (RID) field name for destination tables. This will be a VARCHAR of length determined by hash_method. Used to replace patient ID fields from source tables.

6.3.2.7.3. trid_fieldname

String.

Transient integer research ID (TRID) fieldname. An unsigned integer field with this name will be added to every table containing a primary patient ID (in the source) or research ID (in the destination).

6.3.2.7.4. master_research_id_fieldname

String.

Master research ID (MRID) field name for destination tables. This will be a VARCHAR of length determined by hash_method. Used to replace master patient ID fields from source tables.

6.3.2.7.5. source_hash_fieldname

String.

Change-detection hash fieldname for destination tables. This will be a VARCHAR of length determined by hash_method. Used to hash entire rows to see if they’ve changed later.

6.3.2.7.6. ddgen_append_source_info_to_comment

Boolean. Default: true.

When drafting a data dictionary, append the source table/field name to the column comment?

6.3.2.8. Destination database configuration

6.3.2.8.1. max_rows_before_commit

Integer. Default: 1000.

Specify the maximum number of rows to be processed before a COMMIT is issued on the database transaction(s). This prevents the transaction(s) growing too large.

6.3.2.8.2. max_bytes_before_commit

Integer. Default: 80 Mb (80 * 1024 * 1024 = 83886080).

Specify the maximum number of source-record bytes (approximately!) that are processed before a COMMIT is issued on the database transaction(s). This prevents the transaction(s) growing too large. The COMMIT will be issued after this limit has been met/exceeded, so it may be exceeded if the transaction just before the limit takes the cumulative total over the limit.

6.3.2.8.3. temporary_tablename

String.

We need a temporary table name for incremental updates. This can’t be the name of a real destination table. It lives in the destination database.

6.3.2.9. Choose databases (defined in their own sections)

Parameter values in this section are themselves config file section names. For example, if you refer to a database called mydb, CRATE will look for a database config section named [mydb].

6.3.2.9.1. source_databases

Multiline string (of database config section names).

Source database list. Can be lots.

6.3.2.9.2. destination_database

String (a database config section name).

Destination database. Just one.

6.3.2.9.3. admin_database

String (a database config section name).

Secret admin database. Just one.

6.3.2.10. Processing options, to limit data quantity for testing

6.3.2.10.1. debug_max_n_patients

Integer. Default: 0.

Limit the number of patients to be processed? Specify 0 (the default) for no limit.

6.3.2.10.2. debug_pid_list

Multiline string.

Specify a list of patient IDs to use, for debugging? If specified, only these patients will be processed – this list will be used directly (overriding the patient ID source specified in the data dictionary, and overriding debug_max_n_patients).

6.3.2.11. Opting out entirely

Patients who elect to opt out entirely have their PIDs stored in the OptOut table of the admin database. ENTRIES ARE NEVER REMOVED FROM THIS LIST BY CRATE. It can be populated in several ways:

  1. Manually, by adding a PID to the column opt_out_pid.pid in the admin database. See crate_anon.anonymise.models.OptOutPid.
  2. Similarly, by adding an MPID to the column opt_out_mpid.mpid in the admin database. See crate_anon.anonymise.models.OptOutMpid.
  3. By maintaining a text file list of integer PIDs/MPIDs. Any PIDs/MPIDs in this file/files are added to the opt-out list. See below.
  4. By flagging a source database field as indicating an opt-out, using the ! marker in src_flags). See below.

6.3.2.11.1. optout_pid_filenames

Multiline string.

If you set this, each line of each named file is scanned for an integer, taken to be the PID of a patient who wishes to opt out.

6.3.2.11.2. optout_mpid_filenames

Multiline string.

If you set this, each line of each named file is scanned for an integer, taken to be the MPID of a patient who wishes to opt out.

6.3.2.11.3. optout_col_values

List of Python values.

If you mark a field in the data dictionary as an opt-out field (see above and src_flags), that says “the field tells you whether the patient opts out or not”. But is it “opt out” or “not”? If the actual value matches a value specified here, then it’s “opt out”.

Specify a LIST OF PYTHON VALUES; for example:

optout_col_values = [True, 1, '1', 'Yes', 'yes', 'Y', 'y']

6.3.3. [extra_regexes] section

Arbitrary number of name (string), value (string) pairs.

This section is optional.

Here, you can specify extra regular expression patterns (regexes) that you wish to be scrubbed from the text as nonspecific information (see replace_nonspecific_info_with).

These regexes can be multiline and contain comments – just remember to escape spaces and hash signs which you actually want to be part of the regex.

They have their own section so that you can use the parameter name as a helpful descriptive name (these names are ignored, and you could specify a giant regex combining them yourself, but CRATE will do that for you to enhance config file legibility and convenience). You can name each of them anything, e.g.

[extra_regexes]

my_regex_canadian_postcodes = [a-zA-Z][0-9][a-zA-Z]\w+[0-9][a-zA-Z][0-9]

another_regex =
   \d+\#x    # a number then a hash sign then an 'x'
   \d+\ y    # then another number then a space then 'y'

6.3.4. Database config sections

Config file sections that define databases always have the url parameter. Destination and admin databases need only this. Source databases have other options, as below.

Warning

You should permit CRATE write access to the destination and admin databases, but only read access to source databases. It doesn’t need more, so read-only is safer.

6.3.4.1. Connection details

6.3.4.1.1. url

String.

Use SQLAlchemy URLs: see http://docs.sqlalchemy.org/en/latest/core/engines.html.

For example:

url = mysql+mysqldb://username:password@127.0.0.1:3306/output_databasename?charset=utf8

You may need to install additional drivers, e.g.

pip install SOME_DRIVER

… see database drivers.

6.3.4.2. Data dictionary generation: input fields

This section relates to automatic creation of data dictionaries only. In normal use, none of these settings does anything.

In this section, fields (columns) can either be specified as column (to match a column with that name any table) or table.column, to match a specific column in a specific table.

The specifications are case-insensitive.

Wildcards (* and ?) may also be used (as per Python’s fnmatch). Thus, one can write specifications like addr*.address_line_* to match all of address_current.address_line_1, address_previous.address_line_4, etc.

6.3.4.2.1. ddgen_omit_by_default

Boolean. Default: true.

By default, most fields (except PKs and patient ID codes) are marked as OMIT (see decision), pending human review. If you want to live dangerously, set this to False, and they will be marked as include from the outset.

6.3.4.2.2. ddgen_omit_fields

Multiline string, of field/column specifications.

You can specify additional fields to omit (see decision) here. Settings here override ddgen_include_fields – that is, “omit” overrides “include”.

6.3.4.2.3. ddgen_include_fields

Multiline string, of field/column specifications.

You can specify additional fields to include (see decision) here.

If a field contains scrubbing source information (see scrub_src), it will also be omitted pending human review, regardless of other settings.

6.3.4.2.4. ddgen_allow_no_patient_info

Boolean. Default: false.

Allow the absence of patient info? Used to copy databases; WILL NOT ANONYMISE.

6.3.4.2.5. ddgen_per_table_pid_field

String.

Specify the name of a (typically integer) patient identifier (PID) field present in EVERY table. It will be replaced by the research ID (RID) in the destination database.

6.3.4.2.6. ddgen_add_per_table_pids_to_scrubber

Boolean. Default: false.

Add every instance of a per-table PID field to the patient scrubber?

This is a very conservative setting, and should be unnecessary as the single master “PID-defining” column (see ddgen_pid_defining_fieldnames) should be enough.

(Note that per-table PIDs are always replaced by RIDs – this setting governs whether the scrubber used to scrub free-text fields also works through every single per-table PID.)

6.3.4.2.7. ddgen_master_pid_fieldname

String.

Master patient ID fieldname. Used for e.g. NHS numbers. This information will be replaced by the MRID in the destination database.

6.3.4.2.8. ddgen_table_blacklist

Multiline string.

Blacklist any tables when creating new data dictionaries?

This is case-insensitive, and you can use * and ? wildcards (as per Python’s fnmatch module).

6.3.4.2.9. ddgen_table_whitelist

Multiline string.

Whitelist any tables? (Whitelists override blacklists.)

6.3.4.2.10. ddgen_table_require_field_absolute

Multiline string.

List any fields that all tables MUST contain. If a table doesn’t contain all of the field(s) listed here, it will be skipped.

6.3.4.2.11. ddgen_table_require_field_conditional

Multiline string (one pair per line).

List any fields that are required conditional on other fields. List them as one or more pairs: A, B where B is required if A is present (or the table will be skipped).

6.3.4.2.12. ddgen_field_blacklist

Multiline string.

Blacklist any fields (regardless of their table) when creating new data dictionaries? Wildcards of * and ? operate as above.

6.3.4.2.13. ddgen_field_whitelist

Multiline string.

Whitelist any fields? (Whitelists override blacklists.)

6.3.4.2.14. ddgen_pk_fields

Multiline string.

Fieldnames assumed to be their table’s PK.

6.3.4.2.15. ddgen_constant_content

Boolean. Default: false.

Assume that content stays constant?

Applies the C flags to PK fields; see src_flags. This then becomes the default, after which ddgen_constant_content_tables and ddgen_nonconstant_content_tables can override (of which, ddgen_nonconstant_content_tables takes priority if a table matches both).

6.3.4.2.16. ddgen_constant_content_tables

Multiline string.

Table-specific overrides for ddgen_constant_content, as above.

6.3.4.2.18. ddgen_addition_only

Boolean. Default: false.

Assume that records can only be added, not deleted?

6.3.4.2.19. ddgen_addition_only_tables

Multiline string.

Table-specific overrides for ddgen_addition_only, similarly.

6.3.4.2.20. ddgen_deletion_possible_tables

Multiline string.

Table-specific overrides for ddgen_addition_only, similarly.

6.3.4.2.21. ddgen_pid_defining_fieldnames

Multiline string.

Predefine field(s) that define the existence of patient IDs? UNUSUAL to want to do this.

6.3.4.2.22. ddgen_scrubsrc_patient_fields

Multiline string.

Field names assumed to provide patient information for scrubbing.

6.3.4.2.23. ddgen_scrubsrc_thirdparty_fields

Multiline string.

Field names assumed to provide third-party information for scrubbing.

6.3.4.2.24. ddgen_scrubsrc_thirdparty_xref_pid_fields

Multiline string.

Field names assumed to contain PIDs of third parties (e.g. relatives also in the patient database), to be used to look up the third party in a recursive way, for scrubbing.

6.3.4.2.25. ddgen_required_scrubsrc_fields

Multiline string.

Are any scrub_src fields required (mandatory), i.e. must have non-NULL data in at least one row (or the patient will be skipped)?

6.3.4.2.26. ddgen_scrubmethod_code_fields

Multiline string.

Fields to enforce the code scrub_method upon, overriding the default method.

6.3.4.2.27. ddgen_scrubmethod_date_fields

Multiline string.

Fields to enforce the date scrub_method upon, overriding the default method.

6.3.4.2.28. ddgen_scrubmethod_number_fields

Multiline string.

Fields to enforce the number scrub_method upon, overriding the default method.

6.3.4.2.29. ddgen_scrubmethod_phrase_fields

Multiline string.

Fields to enforce the phrase scrub_method upon, overriding the default method.

6.3.4.2.30. ddgen_safe_fields_exempt_from_scrubbing

Multiline string.

Known safe fields, exempt from scrubbing.

6.3.4.2.31. ddgen_min_length_for_scrubbing

Integer. Default: 0.

Define minimum text column length for scrubbing (fields shorter than this value are assumed safe). For example, specifying 10 will mean that VARCHAR(9) columns are assumed not to need scrubbing.

6.3.4.2.32. ddgen_truncate_date_fields

Multiline string.

Fields whose date should be truncated to the first of the month.

6.3.4.2.33. ddgen_filename_to_text_fields

Multiline string.

Fields containing filenames, which files should be converted to text.

6.3.4.2.34. ddgen_binary_to_text_field_pairs

Multiline string (one pair per line).

Fields containing raw binary data from files (binary large objects; BLOBs), whose contents should be converted to text – paired with fields in the same table containing their file extension (e.g. “pdf”, “.PDF”) or a filename having that extension.

Specify it as a list of comma-joined pairs, e.g.

ddgen_binary_to_text_field_pairs =
    binary1field, ext1field
    binary2field, ext2field
    ...

The first (binaryfield) can be specified as column or table.column, but the second must be column only.

6.3.4.2.35. ddgen_skip_row_if_extract_text_fails_fields

Multiline string.

Specify any text-extraction rows for which you also want to set the flag skip_if_extract_fails (see alter_method).

6.3.4.2.36. ddgen_rename_tables_remove_suffixes

Multiline string.

Automatic renaming of tables. This option specifies a list of suffixes to remove from table names. (Typical use: you make a view with a suffix _x as a working step, then you want the suffix removed for users.)

6.3.4.2.37. ddgen_patient_opt_out_fields

Multiline string.

Fields that are used as patient opt-out fields (see above and src_flags).

6.3.4.2.38. ddgen_extra_hash_fields

Multiline string (one pair per line; see below).

Are there any fields you want hashed, in addition to the normal PID/MPID fields? Specify these a list of FIELDSPEC, EXTRA_HASH_NAME pairs. For example:

ddgen_extra_hash_fields = CaseNumber*, case_number_hashdef

where case_number_hashdef is an extra hash definition (see extra_hash_config_sections, and alter_method in the data dictionary).

6.3.4.3. Data dictionary generation: destination indexing

This section relates to automatic creation of data dictionaries only. In normal use, none of these settings does anything.

6.3.4.3.1. ddgen_index_fields

Multiline string.

Fields to apply an index to.

6.3.4.3.2. ddgen_allow_fulltext_indexing

Boolean. Default: true.

Allow full-text index creation?

(Disable for databases that don’t support full-text indexes?)

6.3.4.4. Data dictionary generation: altering destination table/field names

This section relates to automatic creation of data dictionaries only. In normal use, none of these settings does anything.

6.3.4.4.1. ddgen_force_lower_case

Boolean. Default: true.

Force all destination table/field names to lower case?

6.3.4.4.2. ddgen_convert_odd_chars_to_underscore

Boolean. Default: true.

Convert spaces in table/fieldnames (yuk!) to underscores?

6.3.4.5. Other options for source databases

6.3.4.5.1. debug_row_limit

Integer. Default: 0.

Specify 0 (the default) for no limit, or a number of rows (e.g. 1000) to apply to any tables listed in debug_limited_tables. For those tables, only this many rows will be taken from the source database.

Use this, for example, to reduce the number of large documents fetched.

If you run a multiprocess/multithreaded anonymisation, this limit applies per process (or task), not overall.

Note that these limits DO NOT APPLY to the fetching of patient- identifiable information for anonymisation – when a patient is processed, all identifiable information for that patient is trawled.

6.3.4.5.2. debug_limited_tables

Multiline string.

List of tables to which to apply debug_row_limit.

6.3.5. Hasher definitions

If you use the hash alter_method, you must specify a config section there that is cross-referenced in the extra_hash_config_sections parameter of the [main] section of the config file.

Such config sections, named e.g. [my_extra_hasher], must have the following parameters:

6.3.5.1. hash_method

String.

Options are as for the hash_method parameter of the [main] section.

6.3.5.2. secret_key

String.

Secret key for the hasher.

6.3.6. Minimal anonymiser config

Here’s an extremely minimal version for a hypothetical test database. Many options are not shown and most comments have been removed.

# Configuration file for CRATE anonymiser (crate_anonymise).

# =============================================================================
# Main settings
# =============================================================================

[main]

data_dictionary_filename = testdd.tsv

hash_method = HMAC_MD5
per_table_patient_id_encryption_phrase = SOME_PASSPHRASE_REPLACE_ME
master_patient_id_encryption_phrase = SOME_OTHER_PASSPHRASE_REPLACE_ME
change_detection_encryption_phrase = YETANOTHER

replace_patient_info_with = [XXXXXX]
replace_third_party_info_with = [QQQQQQ]
replace_nonspecific_info_with = [~~~~~~]

research_id_fieldname = rid
trid_fieldname = trid
master_research_id_fieldname = mrid

source_hash_fieldname = _src_hash

temporary_tablename = _temp_table

source_databases =
    mysourcedb1

destination_database = my_destination_database

admin_database = my_admin_database

# =============================================================================
# Destination database details. User should have WRITE access.
# =============================================================================

[my_destination_database]

url = mysql+mysqldb://username:password@127.0.0.1:3306/output_databasename?charset=utf8

# =============================================================================
# Administrative database. User should have WRITE access.
# =============================================================================

[my_admin_database]

url = mysql+mysqldb://username:password@127.0.0.1:3306/admin_databasename?charset=utf8

# =============================================================================
# Source database. (Just one in this example.)
# User should have READ access only for safety.
# =============================================================================

[mysourcedb1]

url = mysql+mysqldb://username:password@127.0.0.1:3306/source_databasename?charset=utf8

Todo

Check minimal anonymiser config example works.

6.3.7. Specimen config

A specimen NLP config is available by running crate_nlp --democonfig. In the source, it is crate_anon.nlp_manager.constants.DEMO_CONFIG.

Here’s the specimen NLP config:

# Configuration file for CRATE anonymiser (crate_anonymise).
# Version 0.18.92 (2019-10-10).
#
# SEE HELP FOR DETAILS.

# =============================================================================
# Main settings
# =============================================================================

[main]

# -----------------------------------------------------------------------------
# Data dictionary
# -----------------------------------------------------------------------------

data_dictionary_filename = testdd.tsv

# -----------------------------------------------------------------------------
# Critical field types
# -----------------------------------------------------------------------------

sqlatype_pid =
sqlatype_mpid =

# -----------------------------------------------------------------------------
# Encryption phrases/passwords
# -----------------------------------------------------------------------------

hash_method = HMAC_MD5

per_table_patient_id_encryption_phrase = SOME_PASSPHRASE_REPLACE_ME

master_patient_id_encryption_phrase = SOME_OTHER_PASSPHRASE_REPLACE_ME

change_detection_encryption_phrase = YETANOTHER

extra_hash_config_sections =

# -----------------------------------------------------------------------------
# Text extraction
# -----------------------------------------------------------------------------

extract_text_extensions_permitted =
extract_text_extensions_prohibited =
extract_text_extensions_case_sensitive = False

extract_text_plain = False

extract_text_width = 80

# -----------------------------------------------------------------------------
# Anonymisation
# -----------------------------------------------------------------------------

replace_patient_info_with = [__PPP__]

replace_third_party_info_with = [__TTT__]

thirdparty_xref_max_depth = 1

replace_nonspecific_info_with = [~~~]

scrub_string_suffixes =
    s

string_max_regex_errors = 1

min_string_length_for_errors = 4

min_string_length_to_scrub_with = 2

whitelist_filenames =

blacklist_filenames =

phrase_alternative_word_filenames =

scrub_all_numbers_of_n_digits =

scrub_all_uk_postcodes = False

anonymise_codes_at_word_boundaries_only = True
anonymise_dates_at_word_boundaries_only = True
anonymise_numbers_at_word_boundaries_only = False
anonymise_numbers_at_numeric_boundaries_only = True
anonymise_strings_at_word_boundaries_only = True

# -----------------------------------------------------------------------------
# Output fields and formatting
# -----------------------------------------------------------------------------

timefield_name = _when_processed_utc

research_id_fieldname = brcid
trid_fieldname = trid
master_research_id_fieldname = nhshash

source_hash_fieldname = _src_hash

ddgen_append_source_info_to_comment = True

# -----------------------------------------------------------------------------
# Destination database configuration
# See the [destination_database] section for connection details.
# -----------------------------------------------------------------------------

max_rows_before_commit = 1000
max_bytes_before_commit = 83886080

temporary_tablename = _temp_table

# -----------------------------------------------------------------------------
# Choose databases (defined in their own sections).
# -----------------------------------------------------------------------------

source_databases =
    mysourcedb1
    mysourcedb2

destination_database = my_destination_database

admin_database = my_admin_database

# -----------------------------------------------------------------------------
# PROCESSING OPTIONS, TO LIMIT DATA QUANTITY FOR TESTING
# -----------------------------------------------------------------------------

debug_max_n_patients =
debug_pid_list =

# -----------------------------------------------------------------------------
# Opting out entirely
# -----------------------------------------------------------------------------

optout_pid_filenames =
optout_mpid_filenames =

optout_col_values =


# =============================================================================
# Extra regular expression patterns you wish to be scrubbed from the text
# as nonspecific information. See help.
# =============================================================================

[extra_regexes]


# =============================================================================
# Destination database details. User should have WRITE access.
# =============================================================================

[my_destination_database]

url = mysql+mysqldb://username:password@127.0.0.1:3306/output_databasename?charset=utf8


# =============================================================================
# Administrative database. User should have WRITE access.
# =============================================================================

[my_admin_database]

url = mysql+mysqldb://username:password@127.0.0.1:3306/admin_databasename?charset=utf8


# =============================================================================
# SOURCE DATABASE DETAILS BELOW HERE.
# User should have READ access only for safety.
# =============================================================================

# -----------------------------------------------------------------------------
# Source database example 1
# -----------------------------------------------------------------------------

[mysourcedb1]

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # CONNECTION DETAILS
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

url = mysql+mysqldb://username:password@127.0.0.1:3306/source_databasename?charset=utf8

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # INPUT FIELDS, FOR THE AUTOGENERATION OF DATA DICTIONARIES
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ddgen_omit_by_default = True
ddgen_omit_fields =
ddgen_include_fields =

ddgen_allow_no_patient_info = False

ddgen_per_table_pid_field = patient_id

ddgen_add_per_table_pids_to_scrubber = False

ddgen_master_pid_fieldname = nhsnum

ddgen_table_blacklist =
ddgen_table_whitelist =

ddgen_table_require_field_absolute =
ddgen_table_require_field_conditional =

ddgen_field_blacklist =
ddgen_field_whitelist =

ddgen_pk_fields =

ddgen_constant_content = False
ddgen_constant_content_tables =
ddgen_nonconstant_content_tables =
ddgen_addition_only = False
ddgen_addition_only_tables =
ddgen_deletion_possible_tables =

ddgen_pid_defining_fieldnames =

ddgen_scrubsrc_patient_fields =
ddgen_scrubsrc_thirdparty_fields =
ddgen_scrubsrc_thirdparty_xref_pid_fields =

ddgen_required_scrubsrc_fields =

ddgen_scrubmethod_code_fields =
ddgen_scrubmethod_date_fields =
ddgen_scrubmethod_number_fields =
ddgen_scrubmethod_phrase_fields =

ddgen_safe_fields_exempt_from_scrubbing =

ddgen_min_length_for_scrubbing = 4

ddgen_truncate_date_fields =

ddgen_filename_to_text_fields =
ddgen_binary_to_text_field_pairs =

ddgen_skip_row_if_extract_text_fails_fields =

ddgen_rename_tables_remove_suffixes =

ddgen_patient_opt_out_fields =

ddgen_extra_hash_fields =

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # DESTINATION INDEXING
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ddgen_index_fields =
ddgen_allow_fulltext_indexing = True

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # DATA DICTIONARY MANIPULATION TO DESTINATION TABLE/FIELD NAMES
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ddgen_force_lower_case = True
ddgen_convert_odd_chars_to_underscore = True

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # PROCESSING OPTIONS, TO LIMIT DATA QUANTITY FOR TESTING
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

debug_row_limit =
debug_limited_tables =

# -----------------------------------------------------------------------------
# Source database example 2
# -----------------------------------------------------------------------------

[mysourcedb2]

url = mysql+mysqldb://username:password@127.0.0.1:3306/source2_databasename?charset=utf8

ddgen_force_lower_case = True
ddgen_per_table_pid_field = patient_id
ddgen_master_pid_fieldname = nhsnum
ddgen_table_blacklist =
ddgen_field_blacklist =
ddgen_table_require_field_absolute =
ddgen_table_require_field_conditional =
ddgen_pk_fields =
ddgen_constant_content = False
ddgen_scrubsrc_patient_fields =
ddgen_scrubsrc_thirdparty_fields =
ddgen_scrubmethod_code_fields =
ddgen_scrubmethod_date_fields =
ddgen_scrubmethod_number_fields =
ddgen_scrubmethod_phrase_fields =
ddgen_safe_fields_exempt_from_scrubbing =
ddgen_min_length_for_scrubbing = 4
ddgen_truncate_date_fields =
ddgen_filename_to_text_fields =
ddgen_binary_to_text_field_pairs =

# -----------------------------------------------------------------------------
# Source database example 3
# -----------------------------------------------------------------------------

[camcops]
# Example for the CamCOPS anonymisation staging database

url = mysql+mysqldb://username:password@127.0.0.1:3306/camcops_databasename?charset=utf8

# FOR EXAMPLE:
ddgen_force_lower_case = True
ddgen_per_table_pid_field = _patient_idnum1
ddgen_pid_defining_fieldnames = _patient_idnum1
ddgen_master_pid_fieldname = _patient_idnum2

ddgen_table_blacklist =

ddgen_field_blacklist = _patient_iddesc1
    _patient_idshortdesc1
    _patient_iddesc2
    _patient_idshortdesc2
    _patient_iddesc3
    _patient_idshortdesc3
    _patient_iddesc4
    _patient_idshortdesc4
    _patient_iddesc5
    _patient_idshortdesc5
    _patient_iddesc6
    _patient_idshortdesc6
    _patient_iddesc7
    _patient_idshortdesc7
    _patient_iddesc8
    _patient_idshortdesc8
    id
    patient_id
    _device
    _era
    _current
    _when_removed_exact
    _when_removed_batch_utc
    _removing_user
    _preserving_user
    _forcibly_preserved
    _predecessor_pk
    _successor_pk
    _manually_erased
    _manually_erased_at
    _manually_erasing_user
    _addition_pending
    _removal_pending
    _move_off_tablet

ddgen_table_require_field_absolute =
ddgen_table_require_field_conditional =
ddgen_pk_fields = _pk
ddgen_constant_content = False

ddgen_scrubsrc_patient_fields = _patient_forename
    _patient_surname
    _patient_dob
    _patient_idnum1
    _patient_idnum2
    _patient_idnum3
    _patient_idnum4
    _patient_idnum5
    _patient_idnum6
    _patient_idnum7
    _patient_idnum8

ddgen_scrubsrc_thirdparty_fields =

ddgen_scrubmethod_code_fields =
ddgen_scrubmethod_date_fields = _patient_dob
ddgen_scrubmethod_number_fields =
ddgen_scrubmethod_phrase_fields =

ddgen_safe_fields_exempt_from_scrubbing = _device
    _era
    _when_added_exact
    _adding_user
    _when_removed_exact
    _removing_user
    _preserving_user
    _manually_erased_at
    _manually_erasing_user
    when_last_modified
    when_created
    when_firstexit
    clinician_specialty
    clinician_name
    clinician_post
    clinician_professional_registration
    clinician_contact_details
# ... now some task-specific ones
    bdi_scale
    pause_start_time
    pause_end_time
    trial_start_time
    cue_start_time
    target_start_time
    detection_start_time
    iti_start_time
    iti_end_time
    trial_end_time
    response_time
    target_time
    choice_time
    discharge_date
    discharge_reason_code
    diagnosis_psych_1_icd10code
    diagnosis_psych_1_description
    diagnosis_psych_2_icd10code
    diagnosis_psych_2_description
    diagnosis_psych_3_icd10code
    diagnosis_psych_3_description
    diagnosis_psych_4_icd10code
    diagnosis_psych_4_description
    diagnosis_medical_1
    diagnosis_medical_2
    diagnosis_medical_3
    diagnosis_medical_4
    category_start_time
    category_response_time
    category_chosen
    gamble_fixed_option
    gamble_lottery_option_p
    gamble_lottery_option_q
    gamble_start_time
    gamble_response_time
    likelihood

ddgen_min_length_for_scrubbing = 4

ddgen_truncate_date_fields = _patient_dob
ddgen_filename_to_text_fields =
ddgen_binary_to_text_field_pairs =



# Generated at 2019-10-10 10:23:25