6.4. Data dictionary (DD)
The data dictionary is a catalogue of tables and columns (fields) in a source database (typically containing identifiable data). It tells CRATE how to transform the data into a de-identified destination database.
The data dictionary is a spreadsheet-style file: a tab-separated values (TSV) file, OpenOffice Spreadsheet (ODS) file, or Microsoft Excel XLSX (OpenXML, Excel 2007+) file.
It has a single header row, and columns as defined below.
6.4.1. Drafting a data dictionary
Once you have edited your anonymiser config file to point to your source database, you can generate a draft data dictionary like this:
crate_anon_draft_dd --output mydd.xlsx
Now edit the data dictionary as required. (And then edit your config file so it points to the data dictionary you have created.)
Full options for this tool are:
USAGE: crate_anon_draft_dd [-h] [--config CONFIG] [--verbose] [--incremental]
[--skip_dd_check] [--output OUTPUT]
[--explicit_dest_datatype] [--systmone]
[--systmone_context {tpp_sre,cpft_dw}]
[--systmone_sre_spec SYSTMONE_SRE_SPEC]
[--systmone_append_comments]
[--systmone_include_generic]
[--systmone_allow_unprefixed_tables]
[--systmone_alter_loaded_rows]
[--systmone_table_info_in_comments]
[--systmone_no_table_info_in_comments]
Draft a data dictionary for the anonymiser, by scanning a source database.
(CRATE version 0.20.4, 2023-10-17. Created by Rudolf Cardinal.)
OPTIONS:
-h, --help show this help message and exit
--config CONFIG Config file (overriding environment variable
CRATE_ANON_CONFIG). Note that the config file has
several options governing the automatic generation of
data dictionaries. (default: None)
--verbose, -v Be verbose (default: False)
--incremental Drafts an INCREMENTAL draft data dictionary
(containing fields in the database that aren't in the
existing data dictionary referred to by the config
file). (default: False)
--skip_dd_check Skip validity check (against the source database) for
the data dictionary. (default: False)
--output OUTPUT File for output; use '-' for stdout. (default: -)
--explicit_dest_datatype
(Primarily for debugging.) CRATE will convert the
source column data type (e.g. INTEGER, FLOAT,
VARCHAR(25)) to a datatype for the destination
database, sometimes with modifications. However, this
is usually implicit: the draft data dictionary doesn't
show these data types unless they require
modification. Use this option to make them all
explicit. (default: False)
--systmone Modify the data dictionary for SystmOne. CRATE knows
about some of the standard SystmOne data structure and
can read a database and customize the data dictionary
for SystmOne. (default: False)
SYSTMONE OPTIONS (FOR WHEN --SYSTMONE IS USED):
--systmone_context {tpp_sre,cpft_dw}
Context of the SystmOne database that you are reading.
(default: cpft_dw)
--systmone_sre_spec SYSTMONE_SRE_SPEC
SystmOne Strategic Reporting Extract (SRE)
specification CSV filename (from TPP, containing
table/field comments). (default: None)
--systmone_append_comments
Append to comments, rather than replacing them.
(default: False)
--systmone_include_generic
Include all 'generic' fields, overriding preferences
set via the config file options. (default: False)
--systmone_allow_unprefixed_tables
Permit tables that don't start with the expected
prefix (which is e.g. 'SR' for the TPP SRE context,
'S1_' for the CPFT Data Warehouse context). May add
helpful content, but you may get odd tables and views.
(default: False)
--systmone_alter_loaded_rows
(For --incremental.) Alter rows that were loaded from
disk (not read from a database)? The default is to
leave such rows untouched. (default: False)
--systmone_table_info_in_comments
Add table descriptions to column comments. Useful if
the database does not itself support table comments.
(default: True)
--systmone_no_table_info_in_comments
Opposite of --systmone_table_info_in_comments.
(default: False)
6.4.2. Columns in the data dictionary
The DD columns can be in any order as long as the header row matches the data, and the column headings include the headings shown here.
In TSV format, lines beginning with a hash (
#
) are treated as comments and ignored, as are blank lines.There is a special row type for table comments, in which the fieldnames are blank. That is, just the database and table names are specified, with a comment (but no other flags).
6.4.2.1. src_db
String.
This column specifies the source database, using a name that matches one from
the source_databases
list in the config file.
6.4.2.2. src_table
String.
This column specifies the table name in the source database.
6.4.2.3. src_field
String.
This column specifies the field (column) name in the source database.
6.4.2.4. src_datatype
String.
This column gives the source column’s SQL data type (e.g. INT, VARCHAR(50)).
6.4.2.5. src_flags
String.
This field can be blank or can contain a string made up of one or more characters. The characters have the following meanings:
Character |
Meaning |
---|---|
|
PK.
This field is the primary key (PK) for the table it’s in.
|
|
NOT NULL.
This field should be set to
NOT NULL in the destination.
|
|
ADD SOURCE HASH.
Add source hash of the record, for incremental updates?
|
|
CONSTANT.
Record contents are constant (will not change) for a given PK.
|
|
ADDITION ONLY.
Marks an addition-only table. It is assumed that records can only
be added to this table, not deleted.
|
|
PRIMARY PID.
Primary patient ID field. If set,
|
|
DEFINES PRIMARY PIDS.
This field defines primary PIDs. If set, this row will be used
to search for all patient IDs, and will define them for this
database. Only those patients will be processed (for all tables
containing patient info). Typically, this flag is applied to a
SINGLE field in a SINGLE table, usually the principal patient
registration/demographics table. CRATE will warn you if there is
more than one such field, and will raise an error if there are
none, unless allow_no_patient_info
is set.
|
|
MASTER PID.
Master ID (e.g. NHS number).
|
|
OPT OUT.
This field is used to mark that the patient wishes to opt out
entirely. It must be in a table that also has a primary patient
ID field (because that’s the ID that will be omitted). If the
opt-out field contains a value that’s defined in the
optout_col_values config setting, that
patient will be opted out entirely from the anonymised database.
|
|
REQUIRED SCRUBBER.
If this field is a scrub_src field (see
below), and this flag is set, then at least one non-NULL value
for this field must be present for each patient, or no
information will be processed for this patient. (Typical use:
where you have a master patient index separate from the patient
name table, and data might have been brought across partially, so
there are some missing names. In this situation, text might go
unscrubbed because the names are missing. Setting this flag for
the name field will prevent this.)
|
6.4.2.6. scrub_src
String.
One of the following values, or blank:
Value |
Meaning |
---|---|
|
Contains patient-identifiable information that must be
removed from |
|
Contains identifiable information about a carer,
family member, or other third party, which must be
removed from |
|
This field is a patient identifier for ANOTHER patient (such as a relative). The scrubber should recursively include THAT patient’s identifying information as third-party information for THIS patient. Fields marked thus, if included in the destination database (see decision), are automatically hashed with the “primary” PID hasher, allowing you to link connected records in the research database. You cannot specify another alter_method. |
6.4.2.7. scrub_method
String.
Applicable to scrub_src fields, this column determines the manner in which this field should be treated for scrubbing. It must be one of the following values (or blank):
Value |
Meaning |
---|---|
|
Treat as a set of textual words. This is the default for all textual fields (e.g. CHAR, VARCHAR, TEXT). Typically used for names: for example, “John Smith” will scrub both “John” and “Smith” separately. Also OK for e-mail addresses. |
|
Treat as a textual phrase (a sequence of words to be replaced only when they occur in sequence). Any superfluous whitespace at the start/end, or between words, is ignored. Typically used for address components: for example, “5 Tree Avenue” will not scrub “tree” or “avenue” by themselves, but this phrase will be scrubbed. |
|
If the value is numeric, ignore it. Otherwise,
treat it as |
|
Treat as a number. This is the default for all numeric fields (e.g. INTEGER, FLOAT). If you have a phone number in a text field, use this method; it will be scrubbed regardless of spacing/punctuation. |
|
Treat as an alphanumeric code. Suited to postcodes. Very like the numeric method, but permits non-digits. |
|
Treat as a date, and scrub any recognizable representations of that date. This is the default for all DATE/DATETIME fields. |
6.4.2.8. decision
String.
One of the following two values:
Value |
Meaning |
---|---|
|
Omit the field from the output entirely. |
|
Include it. |
This is case sensitive, for safety.
6.4.2.9. inclusion_values
String.
Either blank, or an expression that evaluates to a Python iterable (e.g. list or tuple) with Python’s ast.literal_eval() function (see https://docs.python.org/3.4/library/ast.html).
If this is not blank/None, then it serves as a ROW INCLUSION LIST – the source row will only be processed if the field’s value is one of the inclusion values.
It applies to the raw value from the database (before any transformation via
alter_method
).This is not applied to
scrub_src
fields (which contribute to the scrubber regardless).Note that
[None]
is a list with one member, None, whereasNone
is equivalent to leaving the field blank.
Examples:
[None, 0]
[True, 1, 'yes', 'true', 'Yes', 'True']
6.4.2.10. exclusion_values
String.
As for inclusion_values
, but the row is excluded if the field’s value is in
the exclusion_values list.
6.4.2.11. alter_method
String.
Manner in which to alter the data. Blank, or a comma-separated list of one or more of the following. (You should replace aspects in capitals with appropriate values.)
Component |
Meaning |
---|---|
|
Scrub in. Applies to text fields only. The field will have its contents anonymised (using information from other fields). Use this for any text field that end users might store free-text comments in. |
|
Truncate this date to the first of the month. Applicable to text or date-as-text fields. |
|
Convert a binary field (e.g. `VARBINARY`,
`BLOB`) to text (e.g. `LONGTEXT`). Insert
your chosen field name in place of
EXTFIELDNAME. The binary data is taken to be
the representation of a document. The field
must be in the same source table, must contain
the file extension (e.g. |
|
As for the binary-to-text option, but the field contains a full filename (the contents of which is converted to text), rather than containing binary data directly. |
|
A more powerful way of specifying a filename that can be created using data from this table. Replace FMT with an unquoted Python str.format() string; see https://docs.python.org/3.4/library/stdtypes.html#str.format. The dictionary passed to format() is created from all fields in the row. Using an example from RiO: if your
ClientDocuments table contains a ClientID
column (with a value like filename_format_to_text=C:\some\path\{ClientID}\docs\{Path}
You probably want to apply this
|
|
If one of the text extraction methods is specified, and this flag is also specified, then the data row will be skipped if text extraction fails (rather than inserted with a NULL value for the text). This is helpful, for example, if your text-processing pipeline breaks; the option prevents rows being created erroneously with NULL text values, so that a subsequent incremental update will fix the problems once you’ve fixed your text extraction tools. |
|
HTML encoding is removed, e.g. convert
|
|
HTML tags are removed, e.g. from
|
|
Hash this field, using the hasher specified in the config file section that you name. |
You can specify multiple options separated by commas.
Not all are compatible (e.g. scrubbing is for text; date truncation is for dates).
If there’s more than one, text extraction from BLOBs/files is performed first. After that, they are executed in sequence. (The position of the skip-if-text-extraction-fails flag is immaterial.)
A typical combination might be:
filename_to_text,skip_if_extract_fails,scrub
or:
html_untag,html_unescape,scrub
6.4.2.12. dest_table
String.
Table name in the destination database.
6.4.2.13. dest_field
String.
Field (column) name in the destination database.
6.4.2.14. dest_datatype
String. Default: none.
SQL data type in the destination database.
If omitted, the source SQL data type is translated appropriately.
6.4.2.15. index
String.
One of:
Value |
Meaning |
---|---|
(blank) |
No index. |
|
Create a normal index on the destination field. |
|
Create a unique index on the destination field. |
|
Create a FULLTEXT index, for rapid searching within long text fields. Only applicable to one field per table.
|
6.4.2.16. indexlen
Integer. Default: none.
Can be blank. If not, sets the prefix length of the index. This is mandatory in MySQL if you apply a normal (+/- unique) index to a TEXT or BLOB field. It is not required for FULLTEXT indexes.
6.4.2.17. comment
String.
Field (column) comment, stored in the destination database.
6.4.3. Minimal data dictionary example
This illustrates a data dictionary for a fictional database.
Some more specialist columns (inclusion_values
, exclusion_values
) are
not shown for clarity.
src_db src_table src_field src_datatype src_flags scrub_src scrub_method decision alter_method dest_table dest_field dest_datatype index indexlen comment
------- ---------- ------------ ------------- ---------- ---------- ------------- --------- ----------------------- ----------- ----------- -------------- ------ --------- ----------------------------------------------------
# The source table "patients" defines our patients.
# This is also a primary source of information that is used to build our scrubbers.
# Most information shouldn't come through to the destination database, but some (e.g. DOB) is helpful in a truncated form.
# This table also includes our master opt-out switch.
mydb patients patientnum INTEGER(11) K*H patient number OMIT Local patient ID (PID); will be replaced by RID+TRID
mydb patients nhsnum INTEGER(11) M patient number OMIT NHS number (MPID); will be replaced by MRID
mydb patients dob DATE patient date include truncate_date patients dob DATE Date of birth (truncated to first of month)
mydb patients dod DATE include patients dod DATE Date of death, or NULL if alive
mydb patients forename VARCHAR(255) patient words OMIT
mydb patients surname VARCHAR(255) patient words OMIT
mydb patients telephone VARCHAR(255) patient number OMIT A phone number.
mydb patients opt_out_anon BIT !
# The "addresses" table gives (potentially several) addresses per patient.
mydb addresses pk INTEGER(11) KH include addresses pk INTEGER(11) U Arbitrary address PK.
mydb addresses patientnum INTEGER(11) P OMIT
mydb addresses from_date DATE include addresses from_date I
mydb addresses to_date DATE include addresses to_date I
mydb addresses line1 VARCHAR(255) patient phrase OMIT
mydb addresses line2 VARCHAR(255) patient phrase OMIT
mydb addresses line3 VARCHAR(255) patient phrase OMIT
mydb addresses line4 VARCHAR(255) patient phrase OMIT
mydb addresses line5 VARCHAR(255) patient phrase OMIT
mydb addresses postcode VARCHAR(10) patient code OMIT UK postcode.
mydb addresses lsoa VARCHAR(10) include addresses lsoa Lower Super Output Area, added by CRATE preprocessor (calculated from postcode).
mydb addresses imd INTEGER include addresses imd UK Index of Multiple Deprivation, added by CRATE preprocessor.
# The "relatives" table gives us some third-party information to add to our scrubbers.
mydb relatives pk INTEGER(11) KH OMIT
mydb relatives patientnum INTEGER(11) P OMIT
mydb relatives relationship VARCHAR(255) OMIT
mydb relatives forename VARCHAR(255) thirdparty words OMIT
mydb relatives surname VARCHAR(255) thirdparty words OMIT
# The "notes" table contains simple text that needs scrubbing.
mydb notes pk INTEGER(11) KH include notes pk INTEGER(11) U
mydb notes patientnum INTEGER(11) P OMIT Patient ID will be replaced by RID+TRID
mydb notes when DATETIME include notes when DATETIME I
mydb notes note VARCHAR(MAX) include scrub notes note LONGTEXT Gives the scrubbed note.
# The "documents" table uses filenames to refer to binary documents on disk, which need scrubbing.
# (If binary documents won't change once added, you might want to set the "C" flag on "doc_id", instead of "H", for efficiency.)
mydb documents doc_id INTEGER(11) KH include documents doc_id INTEGER(11) U Document PK
mydb documents patientnum INTEGER(11) P OMIT Patient ID will be replaced by RID+TRID
mydb documents when_added DATETIME include documents when_added DATETIME I
mydb documents filename VARCHAR(255) include filename_to_text,scrub documents contents LONGTEXT F Becomes scrubbed document contents with FULLTEXT index.
Todo
Check minimal data dictionary example works.