5. Preprocessing tools
These tools:
reshape specific databases for CRATE:
crate_preprocess_pcmis – preprocess a PCMIS database
crate_preprocess_rio – preprocess a RiO database
crate_preprocess_systmone – index a SystmOne database
fetch external data used for anonymisation:
crate_postcodes – fetch ONS postcode information
crate_fetch_wordlists – fetch forenames, surnames, and medical eponyms
perform fuzzy identity matching for linking different databases securely:
Although they are usually run before anonymisation, it’s probably more helpful to read the Anonymisation section first.
5.1. crate_preprocess_pcmis
Options:
USAGE: crate_preprocess_pcmis [-h] --url URL [-v] [--print] [--echo]
[--drop_danger_drop] [--debug_skiptables]
[--postcodedb POSTCODEDB]
[--geogcols [GEOGCOLS ...]]
[--settings_filename SETTINGS_FILENAME]
Alters a PCMIS database to be suitable for CRATE.
OPTIONS:
-h, --help show this help message and exit
--url URL SQLAlchemy database URL (default: None)
-v, --verbose Verbose (default: False)
--print Print SQL but do not execute it. (You can redirect the
printed output to create an SQL script.) (default:
False)
--echo Echo SQL (default: False)
--drop_danger_drop REMOVES new columns and indexes, rather than creating
them. (There's not very much danger; no real
information is lost, but it might take a while to
recalculate it.) (default: False)
--debug_skiptables DEBUG-ONLY OPTION. Skip tables (view creation only)
(default: False)
--postcodedb POSTCODEDB
Specify database (schema) name for ONS Postcode
Database (as imported by CRATE) to link to addresses
as a view. With SQL Server, you will have to specify
the schema as well as the database; e.g. "--postcodedb
ONS_PD.dbo" (default: None)
--geogcols [GEOGCOLS ...]
List of geographical information columns to link in
from ONS Postcode Database. BEWARE that you do not
specify anything too identifying. (default: ['bua11',
'buasd11', 'casward', 'imd', 'lea', 'lsoa01',
'lsoa11', 'msoa01', 'msoa11', 'nuts', 'oac01',
'oac11', 'parish', 'pcon', 'pct', 'ru11ind',
'statsward', 'ur01ind'])
--settings_filename SETTINGS_FILENAME
Specify filename to write draft ddgen_* settings to,
for use in a CRATE anonymiser configuration file.
(default: None)
5.2. crate_preprocess_rio
The RiO preprocessor creates a unique integer field named crate_pk in all tables (copying the existing integer PK, creating one from an existing non-integer primary key, or adding a new one using SQL Server’s INT IDENTITY(1, 1) type. For all patient tables, it makes the patient ID (RiO number) into an integer, called crate_rio_number. It then adds indexes and views. All of these can be removed again, or updated incrementally if you add new data.
The views ‘denormalize’ the data for convenience, since it can be pretty hard to follow the key chain of fully normalized tables. The views conform mostly to the names used by the Servelec RiO CRIS Extraction Program (RCEP), with added consistency. Because user lookups are common, to save typing (and in some cases keep the field length below the 64-character column name limit of MySQL), the following abbreviations are used:
|
… Responsible Clinician |
Options:
USAGE: crate_preprocess_rio [-h] --url URL [-v] [--print] [--echo] [--rcep]
[--drop_danger_drop] [--cpft] [--debug_skiptables]
[--prognotes_current_only | --prognotes_all]
[--clindocs_current_only | --clindocs_all]
[--allergies_current_only | --allergies_all]
[--audit_info | --no_audit_info]
[--postcodedb POSTCODEDB]
[--geogcols [GEOGCOLS ...]]
[--settings_filename SETTINGS_FILENAME]
* Alters a RiO database to be suitable for CRATE.
* By default, this treats the source database as being a copy of a RiO
database (slightly later than version 6.2; exact version unclear).
Use the "--rcep" (+/- "--cpft") switch(es) to treat it as a
Servelec RiO CRIS Extract Program (RCEP) v2 output database.
OPTIONS:
-h, --help show this help message and exit
--url URL SQLAlchemy database URL
-v, --verbose Verbose
--print Print SQL but do not execute it. (You can redirect the
printed output to create an SQL script.)
--echo Echo SQL
--rcep Treat the source database as the product of Servelec's
RiO CRIS Extract Program v2 (instead of raw RiO)
--drop_danger_drop REMOVES new columns and indexes, rather than creating
them. (There's not very much danger; no real
information is lost, but it might take a while to
recalculate it.)
--cpft Apply hacks for Cambridgeshire & Peterborough NHS
Foundation Trust (CPFT) RCEP database. Only applicable
with --rcep
--debug_skiptables DEBUG-ONLY OPTION. Skip tables (view creation only)
--prognotes_current_only
Progress_Notes view restricted to current versions
only (* default)
--prognotes_all Progress_Notes view shows old versions too
--clindocs_current_only
Clinical_Documents view restricted to current versions
only (*)
--clindocs_all Clinical_Documents view shows old versions too
--allergies_current_only
Client_Allergies view restricted to current info only
--allergies_all Client_Allergies view shows deleted allergies too (*)
--audit_info Audit information (creation/update times) added to
views
--no_audit_info No audit information added (*)
--postcodedb POSTCODEDB
Specify database (schema) name for ONS Postcode
Database (as imported by CRATE) to link to addresses
as a view. With SQL Server, you will have to specify
the schema as well as the database; e.g. "--postcodedb
ONS_PD.dbo"
--geogcols [GEOGCOLS ...]
List of geographical information columns to link in
from ONS Postcode Database. BEWARE that you do not
specify anything too identifying. Default: bua11
buasd11 casward imd lea lsoa01 lsoa11 msoa01 msoa11
nuts oac01 oac11 parish pcon pct ru11ind statsward
ur01ind
--settings_filename SETTINGS_FILENAME
Specify filename to write draft ddgen_* settings to,
for use in a CRATE anonymiser configuration file.
5.3. crate_preprocess_systmone
Preprocess SystmOne data, by indexing it. (It shouldn’t need further reshaping.)
Options:
USAGE: crate_preprocess_systmone [-h] --url URL [-v] [--print] [--echo]
[--systmone_context {tpp_sre,cpft_dw}]
[--systmone_allow_unprefixed_tables]
[--postcodedb POSTCODEDB]
[--geogcols [GEOGCOLS ...]]
[--drop_danger_drop]
Indexes a SystmOne database to be suitable for CRATE.
OPTIONS:
-h, --help show this help message and exit
--url URL SQLAlchemy database URL
-v, --verbose Verbose
--print Print SQL but do not execute it. (You can redirect the
printed output to create an SQL script.)
--echo Echo SQL
--systmone_context {tpp_sre,cpft_dw}
Context of the SystmOne database that you are reading.
--systmone_allow_unprefixed_tables
Permit tables that don't start with the expected
prefix (which is e.g. 'SR' for the TPP SRE context,
'S1_' for the CPFT Data Warehouse context). May add
helpful content, but you may get odd tables and views.
--postcodedb POSTCODEDB
Specify database (schema) name for ONS Postcode
Database (as imported by CRATE) to link to addresses
as a view. With SQL Server, you will have to specify
the schema as well as the database; e.g. "--postcodedb
ONS_PD.dbo"
--geogcols [GEOGCOLS ...]
List of geographical information columns to link in
from ONS Postcode Database. BEWARE that you do not
specify anything too identifying. Default: bua11
buasd11 casward imd lea lsoa01 lsoa11 msoa01 msoa11
nuts oac01 oac11 parish pcon pct ru11ind statsward
ur01ind
--drop_danger_drop REMOVES new columns and indexes, rather than creating
them. (There's not very much danger; no real
information is lost, but it might take a while to
recalculate it.)
5.4. crate_postcodes
Options:
USAGE: crate_postcodes [-h] [--dir DIR] [--url URL] [--echo]
[--reportevery REPORTEVERY] [--commitevery COMMITEVERY]
[--startswith STARTSWITH [STARTSWITH ...]] [--replace]
[--skiplookup]
[--specific_lookup_tables [SPECIFIC_LOOKUP_TABLES ...]]
[--list_lookup_tables] [--skippostcodes] [--docsonly]
[-v]
- This program reads data from the UK Office of National Statistics Postcode
Database (ONSPD) and inserts it into a database.
- You will need to download the ONSPD from
https://geoportal.statistics.gov.uk/geoportal/catalog/content/filelist.page
e.g. ONSPD_MAY_2016_csv.zip (79 Mb), and unzip it (>1.4 Gb) to a directory.
Tell this program which directory you used.
- Specify your database as an SQLAlchemy connection URL: see
https://docs.sqlalchemy.org/en/latest/core/engines.html
The general format is:
dialect[+driver]://username:password@host[:port]/database[?key=value...]
- If you get an error like:
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in
position 33: ordinal not in range(256)
then try appending "?charset=utf8" to the connection URL.
- ONS POSTCODE DATABASE LICENSE.
Output using this program must add the following attribution statements:
Contains OS data © Crown copyright and database right
Contains Royal Mail data © Royal Mail copyright and database right
Contains National Statistics data © Crown copyright and database right
See https://www.ons.gov.uk/methodology/geography/licences
OPTIONS:
-h, --help show this help message and exit
--dir DIR Root directory of unzipped ONSPD download (default:
/path/to/unzipped/ONSPD/download)
--url URL SQLAlchemy database URL (default: None)
--echo Echo SQL (default: False)
--reportevery REPORTEVERY
Report every n rows (default: 1000)
--commitevery COMMITEVERY
Commit every n rows. If you make this too large
(relative e.g. to your MySQL max_allowed_packet
setting, you may get crashes with errors like 'MySQL
has gone away'. (default: 10000)
--startswith STARTSWITH [STARTSWITH ...]
Restrict to postcodes that start with one of these
strings (default: None)
--replace Replace tables even if they exist (default: skip
existing tables) (default: False)
--skiplookup Skip generation of code lookup tables (default: False)
--specific_lookup_tables [SPECIFIC_LOOKUP_TABLES ...]
Within the lookup tables, process only specific named
tables (default: None)
--list_lookup_tables List all possible lookup tables, then stop (default:
False)
--skippostcodes Skip generation of main (large) postcode table
(default: False)
--docsonly Show help for postcode table then stop (default:
False)
-v, --verbose Verbose (default: False)
5.5. crate_fetch_wordlists
This tool assists in fetching common word lists, such as name lists for global denial, and words to exclude from such lists (such as English words or medical eponyms). It also provides an exclusion filter system, to find lines in some files that are absent from others.
The purpose of creating large name lists is usually to remove more names. However, it’s likely that you want to remove medical eponyms, like Parkinson (for Parkinson’s disease). CRATE has a hand-curated list of these. (If a patient is named Parkinson, though, and CRATE is told to remove that as a patient-specific identifier, that name will be removed from phrases like “Parkinson’s disease”, which may itself be a potential identifying clue, however, e.g. “[XXX]’s disease”, but 100% reliable text de-identification is impossible.)
The overlap between names and English words is really tricky.
If you use all names from this set and exclude all valid English words (e.g. from a “valid answers in Scrabble or crosswords” list), you will remove from the namelist – and thus NOT remove from text being nonspecifically scrubbed – names such as John (john is a noun) and Veronica (veronica is also a noun).
If you keep all names in the exclusion namelist, though, you will scrub words like excellent, fought, friend, games, he, hope, husband, joyful, kitten, knuckle, libel, limp, lovely, man, memory, mood, music, no, power, powers, sad, stress, true, yes, you, young, zone (to list but a few); these are all names.
A compromise may be to start with all names, remove medical eponyms, and remove common English words. CRATE provides tools to count words in a subset of the Project Gutenberg corpus. For example, removing English words that account for the top 99% of this corpus (and are also valid Scrabble clues) does this. The process is shown in the specimen usage below.
Options:
USAGE: crate_fetch_wordlists [-h] [--verbose]
[--min_word_length MIN_WORD_LENGTH]
[--show_rejects]
[--debug_names [DEBUG_NAMES ...]]
[--english_words]
[--english_words_output ENGLISH_WORDS_OUTPUT]
[--english_words_url ENGLISH_WORDS_URL]
[--valid_word_regex VALID_WORD_REGEX]
[--gutenberg_word_freq]
[--gutenberg_word_freq_output GUTENBERG_WORD_FREQ_OUTPUT]
[--gutenberg_id_first GUTENBERG_ID_FIRST]
[--gutenberg_id_last GUTENBERG_ID_LAST]
[--filter_words_by_freq]
[--wordfreqfilter_input WORDFREQFILTER_INPUT]
[--wordfreqfilter_output WORDFREQFILTER_OUTPUT]
[--wordfreqfilter_min_cum_freq WORDFREQFILTER_MIN_CUM_FREQ]
[--wordfreqfilter_max_cum_freq WORDFREQFILTER_MAX_CUM_FREQ]
[--us_forenames]
[--us_forenames_freq_output US_FORENAMES_FREQ_OUTPUT]
[--us_forenames_sex_freq_output US_FORENAMES_SEX_FREQ_OUTPUT]
[--us_forenames_url US_FORENAMES_URL]
[--us_forenames_min_cumfreq_pct US_FORENAMES_MIN_CUMFREQ_PCT]
[--us_forenames_max_cumfreq_pct US_FORENAMES_MAX_CUMFREQ_PCT]
[--us_forenames_output US_FORENAMES_OUTPUT]
[--us_surnames]
[--us_surnames_output US_SURNAMES_OUTPUT]
[--us_surnames_freq_output US_SURNAMES_FREQ_OUTPUT]
[--us_surnames_1990_census_url US_SURNAMES_1990_CENSUS_URL]
[--us_surnames_2010_census_url US_SURNAMES_2010_CENSUS_URL]
[--us_surnames_min_cumfreq_pct US_SURNAMES_MIN_CUMFREQ_PCT]
[--us_surnames_max_cumfreq_pct US_SURNAMES_MAX_CUMFREQ_PCT]
[--eponyms] [--eponyms_output EPONYMS_OUTPUT]
[--eponyms_add_unaccented_versions [EPONYMS_ADD_UNACCENTED_VERSIONS]]
[--filter_input [FILTER_INPUT ...]]
[--filter_include [FILTER_INCLUDE ...]]
[--filter_exclude [FILTER_EXCLUDE ...]]
[--filter_output [FILTER_OUTPUT]]
OPTIONS:
-h, --help show this help message and exit
--verbose, -v Be verbose (default: False)
--min_word_length MIN_WORD_LENGTH
Minimum word (or name) length to allow (default: 2)
--show_rejects Print to stdout (and, in verbose mode, log) the words
being rejected (default: False)
--debug_names [DEBUG_NAMES ...]
Show extra detail about these names (e.g. to work out
your preferred frequency thresholds) (default: None)
ENGLISH WORDS:
--english_words Fetch English words (e.g. to remove from the
nonspecific denylist, not to add to an allowlist;
consider words like smith) (default: False)
--english_words_output ENGLISH_WORDS_OUTPUT
Output file for English words (default: None)
--english_words_url ENGLISH_WORDS_URL
URL for a textfile containing all English words (will
then be filtered) (default:
https://www.gutenberg.org/files/3201/files/CROSSWD.TXT
)
--valid_word_regex VALID_WORD_REGEX
Regular expression to determine valid English words
(default: ^[a-z](?:[A-Za-z'-]*[a-z])*$)
--gutenberg_word_freq
Fetch words from Project Gutenberg with frequencies
(default: False)
--gutenberg_word_freq_output GUTENBERG_WORD_FREQ_OUTPUT
Output file for English words with frequencies. CSV
file with columns: word, word_freq, cum_freq.
(default: None)
--gutenberg_id_first GUTENBERG_ID_FIRST
For word counting: first Project Gutenberg book ID
(default: 100)
--gutenberg_id_last GUTENBERG_ID_LAST
For word counting: last Project Gutenberg book ID
(default: 110)
FILTER ENGLISH WORDS BY FREQUENCY:
--filter_words_by_freq
Read a CSV file from --gutenberg_word_freq, filter it
by cumulative word frequency, and write a plain list
of words. (default: False)
--wordfreqfilter_input WORDFREQFILTER_INPUT
Input filename. Usually the output of
--gutenberg_word_freq_output. (default: None)
--wordfreqfilter_output WORDFREQFILTER_OUTPUT
Output filename. Plain text file. (default: None)
--wordfreqfilter_min_cum_freq WORDFREQFILTER_MIN_CUM_FREQ
Minimum cumulative frequency. (Set to >0 to exclude
common words.) (default: 0.0)
--wordfreqfilter_max_cum_freq WORDFREQFILTER_MAX_CUM_FREQ
Maximum cumulative frequency. (Set to <1 to exclude
rare words.) (default: 1.0)
US FORENAMES:
--us_forenames Fetch US forenames (for denylist) (default: False)
--us_forenames_freq_output US_FORENAMES_FREQ_OUTPUT
Output CSV file for US forename with frequencies
(columns are: name, frequency) (default: None)
--us_forenames_sex_freq_output US_FORENAMES_SEX_FREQ_OUTPUT
Output CSV file for US forename with sex and
frequencies (columns are: name, gender, frequency)
(default: None)
--us_forenames_url US_FORENAMES_URL
URL to Zip file of US Census-derived forenames lists
(excludes names with national frequency <5; see
https://www.ssa.gov/OACT/babynames/limits.html)
(default:
https://www.ssa.gov/OACT/babynames/names.zip)
--us_forenames_min_cumfreq_pct US_FORENAMES_MIN_CUMFREQ_PCT
Fetch only names where the cumulative frequency
percentage, up to and including this name, was at
least this value. Range is 0-100. Use 0 for no limit.
Setting this above 0 excludes COMMON names. (This is a
trade-off between being comprehensive and operating at
a reasonable speed. Lower numbers are more
comprehensive but slower.) (default: 0)
--us_forenames_max_cumfreq_pct US_FORENAMES_MAX_CUMFREQ_PCT
Fetch only names where the cumulative frequency
percentage, up to and including this name, was less
than or equal to this value. Range is 0-100. Use 100
for no limit. Setting this below 100 excludes RARE
names. (This is a trade-off between being
comprehensive and operating at a reasonable speed.
Higher numbers are more comprehensive but slower.)
(default: 100)
--us_forenames_output US_FORENAMES_OUTPUT
Output file for US forenames (default: None)
US SURNAMES:
--us_surnames Fetch US surnames (for denylist) (default: False)
--us_surnames_output US_SURNAMES_OUTPUT
Output text file for US surnames (default: None)
--us_surnames_freq_output US_SURNAMES_FREQ_OUTPUT
Output CSV file for US surnames with frequencies
(columns are: name, frequency) (default: None)
--us_surnames_1990_census_url US_SURNAMES_1990_CENSUS_URL
URL for textfile of US 1990 Census surnames (default:
http://www2.census.gov/topics/genealogy/1990surnames/d
ist.all.last)
--us_surnames_2010_census_url US_SURNAMES_2010_CENSUS_URL
URL for zip of US 2010 Census surnames (default:
https://www2.census.gov/topics/genealogy/2010surnames/
names.zip)
--us_surnames_min_cumfreq_pct US_SURNAMES_MIN_CUMFREQ_PCT
Fetch only names where the cumulative frequency
percentage, up to and including this name, was at
least this value. Range is 0-100. Use 0 for no limit.
Setting this above 0 excludes COMMON names. (This is a
trade-off between being comprehensive and operating at
a reasonable speed. Lower numbers are more
comprehensive but slower.) (default: 0)
--us_surnames_max_cumfreq_pct US_SURNAMES_MAX_CUMFREQ_PCT
Fetch only names where the cumulative frequency
percentage, up to and including this name, was less
than or equal to this value. Range is 0-100. Use 100
for no limit. Setting this below 100 excludes RARE
names. (This is a trade-off between being
comprehensive and operating at a reasonable speed.
Higher numbers are more comprehensive but slower.)
(default: 100)
MEDICAL EPONYMS:
--eponyms Write medical eponyms (to remove from denylist)
(default: False)
--eponyms_output EPONYMS_OUTPUT
Output file for medical eponyms (default:
medical_eponyms.txt)
--eponyms_add_unaccented_versions [EPONYMS_ADD_UNACCENTED_VERSIONS]
Add unaccented versions (e.g. Sjogren as well as
Sjögren) (default: True)
FILTER FUNCTIONS:
Extra functions to filter wordlists.Specify an input file, optional
exclusion and/or inclusion file(s), and an output file. You can use '-'
for the output file to mean 'stdout', and for one input file to mean
'stdin'. No filenames (other than '-' for input and output) may overlap.
The --min_line_length option also applies. Duplicates are not removed.
--filter_input [FILTER_INPUT ...]
Input file(s). Words will be drawn from these files.
(default: None)
--filter_include [FILTER_INCLUDE ...]
Inclusion file(s). If any inclusion files are
specified, words from the input must be present in at
least one inclusion file to pass. (default: None)
--filter_exclude [FILTER_EXCLUDE ...]
Exclusion file(s). Any words present in the exclusion
files do not pass. (default: None)
--filter_output [FILTER_OUTPUT]
Output file. Words are written here. (default: None)
Specimen usage:
#!/bin/bash
# Generating name files under Linux, for nonspecific name removal.
# -----------------------------------------------------------------------------
# 1. Fetch our source data.
# -----------------------------------------------------------------------------
# Downloading these and then using a file:// URL is unnecessary, but it makes
# the processing steps faster if we need to retry with new settings.
wget https://www.gutenberg.org/files/3201/files/CROSSWD.TXT -O dictionary.txt
wget https://www.ssa.gov/OACT/babynames/names.zip -O forenames.zip
wget http://www2.census.gov/topics/genealogy/1990surnames/dist.all.last -O surnames_1990.txt
wget https://www2.census.gov/topics/genealogy/2010surnames/names.zip -O surnames_2010.zip
# -----------------------------------------------------------------------------
# 2. Create our wordlists.
# -----------------------------------------------------------------------------
# Fetch forenames, surnames, medical eponyms, and words valid for Scrabble.
# Via --debug_names, we'll illustrate frequencies for some names that are also
# words.
crate_fetch_wordlists \
--english_words \
--english_words_url "file://${PWD}/dictionary.txt" \
--english_words_output english_crossword_words.txt \
--us_forenames \
--us_forenames_url "file://${PWD}/forenames.zip" \
--us_forenames_max_cumfreq_pct 100 \
--us_forenames_output us_forenames.txt \
--us_surnames \
--us_surnames_1990_census_url "file://${PWD}/surnames_1990.txt" \
--us_surnames_2010_census_url "file://${PWD}/surnames_2010.zip" \
--us_surnames_max_cumfreq_pct 100 \
--us_surnames_output us_surnames.txt \
--eponyms \
--eponyms_output medical_eponyms.txt \
--debug_names \
excellent fought friend games he hope husband john joyful kitten \
knuckle libel limp lovely man memory mood music no power powers sad \
stress true veronica yes you young zone
# Count frequencies for a few books (preserving case, filtering for words of
# length 2+ and meeting a valid pattern which includes starting with a
# lower-case letter), across 1,134,142 words:
crate_fetch_wordlists \
--gutenberg_word_freq \
--gutenberg_id_first 100 \
--gutenberg_id_last 110 \
--gutenberg_word_freq_output english_word_freq_gutenberg.csv
# 100 is Shakespeare: https://www.gutenberg.org/ebooks/100
# Filter to common words, that account together for the top 99% of usage.
crate_fetch_wordlists \
--filter_words_by_freq \
--wordfreqfilter_input english_word_freq_gutenberg.csv \
--wordfreqfilter_min_cum_freq 0.0 \
--wordfreqfilter_max_cum_freq 0.99 \
--wordfreqfilter_output english_gutenberg_common_words.txt
# In this corpus, "john" comes in at about 0.9923.
# Create a list of common English words -- the overlap between "common words in
# Project Gutenberg books" and "valid Scrabble words".
crate_fetch_wordlists \
--filter_input english_gutenberg_common_words.txt \
--filter_include english_crossword_words.txt \
--filter_output common_english_words.txt
# -----------------------------------------------------------------------------
# 3. Generate an amalgamated and filtered wordlist.
# -----------------------------------------------------------------------------
crate_fetch_wordlists \
--filter_input us_forenames.txt us_surnames.txt \
--filter_exclude medical_eponyms.txt common_english_words.txt \
--filter_output filtered_names.txt
# ^^^^^^^^^^^^^^^^^^
# This may be useful for the CRATE anonymiser option --denylist_filenames.
# As a check, these are then names that are *rare* English words:
comm -12 \
<(sort filtered_names.txt | tr "[:upper:]" "[:lower:]") \
<(sort english_crossword_words.txt | tr "[:upper:]" "[:lower:]") \
> remaining_names_that_are_english_crossword_words.txt