5. Preprocessing tools¶
These tools:
reshape specific databases for CRATE:
crate_preprocess_rio – preprocess a RiO database
crate_preprocess_pcmis – preprocess a PCMIS database
fetch external data used for anonymisation:
crate_postcodes – fetch ONS postcode information
crate_fetch_wordlists – fetch forenames, surnames, and medical eponyms
perform fuzzy identity matching for linking different databases securely:
Although they are usually run before anonymisation, it’s probably more helpful to read the Anonymisation section first.
5.1. crate_preprocess_rio¶
The RiO preprocessor creates a unique integer field named crate_pk in all tables (copying the existing integer PK, creating one from an existing non-integer primary key, or adding a new one using SQL Server’s INT IDENTITY(1, 1) type. For all patient tables, it makes the patient ID (RiO number) into an integer, called crate_rio_number. It then adds indexes and views. All of these can be removed again, or updated incrementally if you add new data.
The views ‘denormalize’ the data for convenience, since it can be pretty hard to follow the key chain of fully normalized tables. The views conform mostly to the names used by the Servelec RiO CRIS Extraction Program (RCEP), with added consistency. Because user lookups are common, to save typing (and in some cases keep the field length below the 64-character column name limit of MySQL), the following abbreviations are used:
|
… Responsible Clinician |
Options:
usage: crate_preprocess_rio [-h] --url URL [-v] [--print] [--echo] [--rcep]
[--drop-danger-drop] [--cpft] [--debug-skiptables]
[--prognotes-current-only | --prognotes-all]
[--clindocs-current-only | --clindocs-all]
[--allergies-current-only | --allergies-all]
[--audit-info | --no-audit-info]
[--postcodedb POSTCODEDB]
[--geogcols [GEOGCOLS [GEOGCOLS ...]]]
[--settings-filename SETTINGS_FILENAME]
* Alters a RiO database to be suitable for CRATE.
* By default, this treats the source database as being a copy of a RiO
database (slightly later than version 6.2; exact version unclear).
Use the "--rcep" (+/- "--cpft") switch(es) to treat it as a
Servelec RiO CRIS Extract Program (RCEP) v2 output database.
optional arguments:
-h, --help show this help message and exit
--url URL SQLAlchemy database URL
-v, --verbose Verbose
--print Print SQL but do not execute it. (You can redirect the
printed output to create an SQL script.
--echo Echo SQL
--rcep Treat the source database as the product of Servelec's
RiO CRIS Extract Program v2 (instead of raw RiO)
--drop-danger-drop REMOVES new columns and indexes, rather than creating
them. (There's not very much danger; no real
information is lost, but it might take a while to
recalculate it.)
--cpft Apply hacks for Cambridgeshire & Peterborough NHS
Foundation Trust (CPFT) RCEP database. Only applicable
with --rcep
--debug-skiptables DEBUG-ONLY OPTION. Skip tables (view creation only)
--prognotes-current-only
Progress_Notes view restricted to current versions
only (* default)
--prognotes-all Progress_Notes view shows old versions too
--clindocs-current-only
Clinical_Documents view restricted to current versions
only (*)
--clindocs-all Clinical_Documents view shows old versions too
--allergies-current-only
Client_Allergies view restricted to current info only
--allergies-all Client_Allergies view shows deleted allergies too (*)
--audit-info Audit information (creation/update times) added to
views
--no-audit-info No audit information added (*)
--postcodedb POSTCODEDB
Specify database (schema) name for ONS Postcode
Database (as imported by CRATE) to link to addresses
as a view. With SQL Server, you will have to specify
the schema as well as the database; e.g. "--postcodedb
ONS_PD.dbo"
--geogcols [GEOGCOLS [GEOGCOLS ...]]
List of geographical information columns to link in
from ONS Postcode Database. BEWARE that you do not
specify anything too identifying. Default: pcon pct
nuts lea statsward casward lsoa01 msoa01 ur01ind oac01
lsoa11 msoa11 parish bua11 buasd11 ru11ind oac11 imd
--settings-filename SETTINGS_FILENAME
Specify filename to write draft ddgen_* settings to,
for use in a CRATE anonymiser configuration file.
5.2. crate_preprocess_pcmis¶
Options:
usage: crate_preprocess_pcmis [-h] --url URL [-v] [--print] [--echo]
[--drop-danger-drop] [--debug-skiptables]
[--postcodedb POSTCODEDB]
[--geogcols [GEOGCOLS [GEOGCOLS ...]]]
[--settings-filename SETTINGS_FILENAME]
Alters a PCMIS database to be suitable for CRATE.
optional arguments:
-h, --help show this help message and exit
--url URL SQLAlchemy database URL (default: None)
-v, --verbose Verbose (default: False)
--print Print SQL but do not execute it. (You can redirect the
printed output to create an SQL script. (default:
False)
--echo Echo SQL (default: False)
--drop-danger-drop REMOVES new columns and indexes, rather than creating
them. (There's not very much danger; no real
information is lost, but it might take a while to
recalculate it.) (default: False)
--debug-skiptables DEBUG-ONLY OPTION. Skip tables (view creation only)
(default: False)
--postcodedb POSTCODEDB
Specify database (schema) name for ONS Postcode
Database (as imported by CRATE) to link to addresses
as a view. With SQL Server, you will have to specify
the schema as well as the database; e.g. "--postcodedb
ONS_PD.dbo" (default: None)
--geogcols [GEOGCOLS [GEOGCOLS ...]]
List of geographical information columns to link in
from ONS Postcode Database. BEWARE that you do not
specify anything too identifying. (default: ['pcon',
'pct', 'nuts', 'lea', 'statsward', 'casward',
'lsoa01', 'msoa01', 'ur01ind', 'oac01', 'lsoa11',
'msoa11', 'parish', 'bua11', 'buasd11', 'ru11ind',
'oac11', 'imd'])
--settings-filename SETTINGS_FILENAME
Specify filename to write draft ddgen_* settings to,
for use in a CRATE anonymiser configuration file.
(default: None)
5.3. crate_postcodes¶
Options:
usage: crate_postcodes [-h] [--dir DIR] [--url URL] [--echo]
[--reportevery REPORTEVERY] [--commitevery COMMITEVERY]
[--startswith STARTSWITH [STARTSWITH ...]] [--replace]
[--skiplookup]
[--specific_lookup_tables [SPECIFIC_LOOKUP_TABLES [SPECIFIC_LOOKUP_TABLES ...]]]
[--list_lookup_tables] [--skippostcodes] [--docsonly]
[-v]
- This program reads data from the UK Office of National Statistics Postcode
Database (ONSPD) and inserts it into a database.
- You will need to download the ONSPD from
https://geoportal.statistics.gov.uk/geoportal/catalog/content/filelist.page
e.g. ONSPD_MAY_2016_csv.zip (79 Mb), and unzip it (>1.4 Gb) to a directory.
Tell this program which directory you used.
- Specify your database as an SQLAlchemy connection URL: see
http://docs.sqlalchemy.org/en/latest/core/engines.html
The general format is:
dialect[+driver]://username:password@host[:port]/database[?key=value...]
- If you get an error like:
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in
position 33: ordinal not in range(256)
then try appending "?charset=utf8" to the connection URL.
- ONS POSTCODE DATABASE LICENSE.
Output using this program must add the following attribution statements:
Contains OS data © Crown copyright and database right [year]
Contains Royal Mail data © Royal Mail copyright and database right [year]
Contains National Statistics data © Crown copyright and database right [year]
See http://www.ons.gov.uk/methodology/geography/licences
optional arguments:
-h, --help show this help message and exit
--dir DIR Root directory of unzipped ONSPD download (default:
/path/to/unzipped/ONSPD/download)
--url URL SQLAlchemy database URL (default: None)
--echo Echo SQL (default: False)
--reportevery REPORTEVERY
Report every n rows (default: 1000)
--commitevery COMMITEVERY
Commit every n rows. If you make this too large
(relative e.g. to your MySQL max_allowed_packet
setting, you may get crashes with errors like 'MySQL
has gone away'. (default: 10000)
--startswith STARTSWITH [STARTSWITH ...]
Restrict to postcodes that start with one of these
strings (default: None)
--replace Replace tables even if they exist (default: skip
existing tables) (default: False)
--skiplookup Skip generation of code lookup tables (default: False)
--specific_lookup_tables [SPECIFIC_LOOKUP_TABLES [SPECIFIC_LOOKUP_TABLES ...]]
Within the lookup tables, process only specific named
tables (default: None)
--list_lookup_tables List all possible lookup tables, then stop (default:
False)
--skippostcodes Skip generation of main (large) postcode table
(default: False)
--docsonly Show help for postcode table then stop (default:
False)
-v, --verbose Verbose (default: False)
5.4. crate_fetch_wordlists¶
This tool assists in fetching common word lists, such as name lists for global denial, and words to exclude from such lists (such as English words or medical eponyms). It also provides an exclusion filter system, to find lines in some files that are absent from others.
Options:
usage: crate_fetch_wordlists [-h] [--verbose]
[--min_word_length MIN_WORD_LENGTH]
[--show_rejects] [--english_words]
[--english_words_output ENGLISH_WORDS_OUTPUT]
[--english_words_url ENGLISH_WORDS_URL]
[--valid_word_regex VALID_WORD_REGEX]
[--us_forenames]
[--us_forenames_freq_output US_FORENAMES_FREQ_OUTPUT]
[--us_forenames_sex_freq_output US_FORENAMES_SEX_FREQ_OUTPUT]
[--us_forenames_url US_FORENAMES_URL]
[--us_forenames_min_cumfreq_pct US_FORENAMES_MIN_CUMFREQ_PCT]
[--us_forenames_max_cumfreq_pct US_FORENAMES_MAX_CUMFREQ_PCT]
[--us_forenames_output US_FORENAMES_OUTPUT]
[--us_surnames]
[--us_surnames_output US_SURNAMES_OUTPUT]
[--us_surnames_freq_output US_SURNAMES_FREQ_OUTPUT]
[--us_surnames_1990_census_url US_SURNAMES_1990_CENSUS_URL]
[--us_surnames_2010_census_url US_SURNAMES_2010_CENSUS_URL]
[--us_surnames_min_cumfreq_pct US_SURNAMES_MIN_CUMFREQ_PCT]
[--us_surnames_max_cumfreq_pct US_SURNAMES_MAX_CUMFREQ_PCT]
[--eponyms] [--eponyms_output EPONYMS_OUTPUT]
[--eponyms_add_unaccented_versions [EPONYMS_ADD_UNACCENTED_VERSIONS]]
[--filter_input [FILTER_INPUT [FILTER_INPUT ...]]]
[--filter_exclude [FILTER_EXCLUDE [FILTER_EXCLUDE ...]]]
[--filter_output [FILTER_OUTPUT]]
optional arguments:
-h, --help show this help message and exit
--verbose, -v Be verbose (default: False)
--min_word_length MIN_WORD_LENGTH
Minimum word (or name) length to allow (default: 2)
--show_rejects Print to stdout (and, in verbose mode, log) the words
being rejected (default: False)
English words:
--english_words Fetch English words (to remove from the nonspecific
denylist, not to add to an allowlist; consider words
like smith) (default: False)
--english_words_output ENGLISH_WORDS_OUTPUT
Output file for English words (default:
english_words.txt)
--english_words_url ENGLISH_WORDS_URL
URL for a textfile containing all English words (will
then be filtered) (default: https://www.gutenberg.org/
files/3201/files/CROSSWD.TXT)
--valid_word_regex VALID_WORD_REGEX
Regular expression to determine valid English words
(default: ^[a-z](?:[A-Za-z'-]*[a-z])*$)
US forenames:
--us_forenames Fetch US forenames (for denylist) (default: False)
--us_forenames_freq_output US_FORENAMES_FREQ_OUTPUT
Output CSV file for US forename with frequencies
(columns are: name, frequency) (default:
us_forename_freq.csv)
--us_forenames_sex_freq_output US_FORENAMES_SEX_FREQ_OUTPUT
Output CSV file for US forename with sex and
frequencies (columns are: name, gender, frequency)
(default: us_forename_sex_freq.csv)
--us_forenames_url US_FORENAMES_URL
URL to Zip file of US Census-derived forenames lists
(excludes names with national frequency <5; see
https://www.ssa.gov/OACT/babynames/limits.html)
(default:
https://www.ssa.gov/OACT/babynames/names.zip)
--us_forenames_min_cumfreq_pct US_FORENAMES_MIN_CUMFREQ_PCT
Fetch only names where the cumulative frequency
percentage up to and including this name was at least
this value. Range is 0-100. Use 0 for no limit.
Setting this above 0 excludes COMMON names. (This is a
trade-off between being comprehensive and operating at
a reasonable speed. Higher numbers are more
comprehensive but slower.) (default: 0)
--us_forenames_max_cumfreq_pct US_FORENAMES_MAX_CUMFREQ_PCT
Fetch only names where the cumulative frequency
percentage up to and including this name was less than
or equal to this value. Range is 0-100. Use 100 for no
limit. Setting this below 100 excludes RARE names.
(This is a trade-off between being comprehensive and
operating at a reasonable speed. Higher numbers are
more comprehensive but slower.) (default: 100)
--us_forenames_output US_FORENAMES_OUTPUT
Output file for US forenames (default:
us_forenames.txt)
US surnames:
--us_surnames Fetch US surnames (for denylist) (default: False)
--us_surnames_output US_SURNAMES_OUTPUT
Output text file for US surnames (default:
us_surnames.txt)
--us_surnames_freq_output US_SURNAMES_FREQ_OUTPUT
Output CSV file for US surnames with frequencies
(columns are: name, frequency) (default:
us_surname_freq.csv)
--us_surnames_1990_census_url US_SURNAMES_1990_CENSUS_URL
URL for textfile of US 1990 Census surnames (default:
http://www2.census.gov/topics/genealogy/1990surnames/d
ist.all.last)
--us_surnames_2010_census_url US_SURNAMES_2010_CENSUS_URL
URL for zip of US 2010 Census surnames (default: https
://www2.census.gov/topics/genealogy/2010surnames/names
.zip)
--us_surnames_min_cumfreq_pct US_SURNAMES_MIN_CUMFREQ_PCT
Fetch only names where the cumulative frequency
percentage up to and including this name was at least
this value. Range is 0-100. Use 0 for no limit.
Setting this above 0 excludes COMMON names. (This is a
trade-off between being comprehensive and operating at
a reasonable speed. Higher numbers are more
comprehensive but slower.) (default: 0)
--us_surnames_max_cumfreq_pct US_SURNAMES_MAX_CUMFREQ_PCT
Fetch only names where the cumulative frequency
percentage up to and including this name was less than
or equal to this value. Range is 0-100. Use 100 for no
limit. Setting this below 100 excludes RARE names.
(This is a trade-off between being comprehensive and
operating at a reasonable speed. Higher numbers are
more comprehensive but slower.) (default: 100)
Medical eponyms:
--eponyms Write medical eponyms (to remove from denylist)
(default: False)
--eponyms_output EPONYMS_OUTPUT
Output file for medical eponyms (default:
medical_eponyms.txt)
--eponyms_add_unaccented_versions [EPONYMS_ADD_UNACCENTED_VERSIONS]
Add unaccented versions (e.g. Sjogren as well as
Sjögren) (default: True)
Filter functions:
Extra functions to filter wordlists. Specify an input file (or files),
whose lines will be included; optional exclusion file(s), whose lines will
be excluded (in case-insensitive fashion); and an output file. You can use
'-' for the output file to mean 'stdout', and for one input file to mean
'stdin'. No filenames (other than '-' for input and output) may overlap.
The --min_line_length option also applies. Duplicates are not removed.
--filter_input [FILTER_INPUT [FILTER_INPUT ...]]
Input file(s). See above. (default: None)
--filter_exclude [FILTER_EXCLUDE [FILTER_EXCLUDE ...]]
Exclusion file(s). See above. (default: None)
--filter_output [FILTER_OUTPUT]
Exclusion file(s). See above. (default: None)
Specimen usage:
#!/bin/bash
# -----------------------------------------------------------------------------
# Specimen usage under Linux
# -----------------------------------------------------------------------------
# Downloading these and then using a file:// URL is unnecessary, but it makes
# the processing steps faster if we need to retry with new settings.
wget https://www.gutenberg.org/files/3201/files/CROSSWD.TXT -O dictionary.txt
wget https://www.ssa.gov/OACT/babynames/names.zip -O forenames.zip
wget http://www2.census.gov/topics/genealogy/1990surnames/dist.all.last -O surnames_1990.txt
wget https://www2.census.gov/topics/genealogy/2010surnames/names.zip -O surnames_2010.zip
crate_fetch_wordlists --help
crate_fetch_wordlists \
--english_words \
--english_words_url "file://${PWD}/dictionary.txt" \
--us_forenames \
--us_forenames_url "file://${PWD}/forenames.zip" \
--us_forenames_max_cumfreq_pct 100 \
--us_surnames \
--us_surnames_1990_census_url "file://${PWD}/surnames_1990.txt" \
--us_surnames_2010_census_url "file://${PWD}/surnames_2010.zip" \
--us_surnames_max_cumfreq_pct 100 \
--eponyms
# --show_rejects \
# --verbose
# Forenames encompassing the top 95% gives 5874 forenames (of 96174).
# Surnames encompassing the top 85% gives 74525 surnames (of 175880).
crate_fetch_wordlists \
--filter_input \
us_forenames.txt \
us_surnames.txt \
--filter_exclude \
english_words.txt \
medical_eponyms.txt \
--filter_output \
filtered_names.txt
5.5. crate_fuzzy_id_match¶
In development.
See crate_anon.preprocess.fuzzy_id_match
.
Options (from crate_fuzzy_id_match --allhelp
):
usage: crate_fuzzy_id_match [-h] [--version] [--allhelp] [--verbose]
[--key KEY] [--allow_default_hash_key]
[--rounding_sf ROUNDING_SF]
[--forename_sex_freq_csv FORENAME_SEX_FREQ_CSV]
[--forename_cache_filename FORENAME_CACHE_FILENAME]
[--surname_freq_csv SURNAME_FREQ_CSV]
[--surname_cache_filename SURNAME_CACHE_FILENAME]
[--name_min_frequency NAME_MIN_FREQUENCY]
[--p_middle_name_n_present P_MIDDLE_NAME_N_PRESENT]
[--population_size POPULATION_SIZE]
[--birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE]
[--postcode_csv_filename POSTCODE_CSV_FILENAME]
[--postcode_cache_filename POSTCODE_CACHE_FILENAME]
[--mean_oa_population MEAN_OA_POPULATION]
[--p_not_male_or_female P_NOT_MALE_OR_FEMALE]
[--p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE]
[--p_minor_forename_error P_MINOR_FORENAME_ERROR]
[--p_minor_surname_error P_MINOR_SURNAME_ERROR]
[--p_proband_middle_name_missing P_PROBAND_MIDDLE_NAME_MISSING]
[--p_sample_middle_name_missing P_SAMPLE_MIDDLE_NAME_MISSING]
[--p_minor_postcode_error P_MINOR_POSTCODE_ERROR]
[--p_gender_error P_GENDER_ERROR]
[--min_log_odds_for_match MIN_LOG_ODDS_FOR_MATCH]
[--exceeds_next_best_log_odds EXCEEDS_NEXT_BEST_LOG_ODDS]
{selftest,speedtest,print_demo_sample,validate1,validate2_fetch_cdl,validate2_fetch_rio,hash,compare_plaintext,compare_hashed_to_hashed,compare_hashed_to_plaintext,show_metaphone,show_forename_freq,show_forename_metaphone_freq,show_surname_freq,show_surname_metaphone_freq,show_dob_freq}
...
Identity matching via hashed fuzzy identifiers
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--allhelp show help for all commands and exit
display options:
--verbose Be verbose (default: False)
hasher (secrecy) options:
--key KEY Key (passphrase) for hasher (default: fuzzy_id_match_d
efault_hash_key_DO_NOT_USE_FOR_LIVE_DATA)
--allow_default_hash_key
Allow the default hash key to be used beyond tests.
INADVISABLE! (default: False)
--rounding_sf ROUNDING_SF
Number of significant figures to use when rounding
frequencies in hashed version (default: 5)
frequency information for prior probabilities:
--forename_sex_freq_csv FORENAME_SEX_FREQ_CSV
CSV file of "name, sex, frequency" pairs for
forenames. You can generate one via
crate_fetch_wordlists. (default:
/path/to/crate/user/data/us_forename_sex_freq.csv)
--forename_cache_filename FORENAME_CACHE_FILENAME
File in which to store cached forename info (to speed
loading) (default:
/path/to/crate/user/data/fuzzy_forename_cache.pickle)
--surname_freq_csv SURNAME_FREQ_CSV
CSV file of "name, frequency" pairs for forenames. You
can generate one via crate_fetch_wordlists. (default:
/path/to/crate/user/data/us_surname_freq.csv)
--surname_cache_filename SURNAME_CACHE_FILENAME
File in which to store cached surname info (to speed
loading) (default:
/path/to/crate/user/data/fuzzy_surname_cache.pickle)
--name_min_frequency NAME_MIN_FREQUENCY
Minimum base frequency for names. If a frequency is
less than this, use this minimum. Allowing extremely
low frequencies may increase the chances of a spurious
match. Note also that typical name frequency tables
don't give very-low-frequency information. For
example, for US census forename/surname information,
below 0.001 percent they report 0.000 percent; so a
reasonable minimum is 0.0005 percent or 0.000005 or
5e-6. (default: 5e-06)
--p_middle_name_n_present P_MIDDLE_NAME_N_PRESENT
CSV list of probabilities that a randomly selected
person has a certain number of middle names. The first
number is P(has a first middle name). The second
number is P(has a second middle name | has a first
middle name), and so on. The last number present will
be re-used ad infinitum if someone has more names.
(default: 0.8,0.1375)
--population_size POPULATION_SIZE
Size of the whole population, from which we calculate
the baseline log odds that two people, randomly
selected (and replaced) from the population are the
same person. (default: 66040000)
--birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE
Birth year pseudo-range. The sole purpose is to
calculate the probability of two random people sharing
a DOB, which is taken as 1/(365.25 * b). This option
is b. (default: 90)
--postcode_csv_filename POSTCODE_CSV_FILENAME
CSV file of postcode geography from UK Census/ONS data
(default: /path/to/postcodes/file)
--postcode_cache_filename POSTCODE_CACHE_FILENAME
File in which to store cached postcodes (to speed
loading) (default:
/path/to/crate/user/data/fuzzy_postcode_cache.pickle)
--mean_oa_population MEAN_OA_POPULATION
Mean population of a UK Census Output Area, from which
we estimate the population of postcode-based units.
(default: 309)
--p_not_male_or_female P_NOT_MALE_OR_FEMALE
Probability that a person in the population has gender
'X'. (default: 0.004)
--p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE
Probability that a person in the population is female,
given that they are either male or female. (default:
0.51)
error probabilities:
--p_minor_forename_error P_MINOR_FORENAME_ERROR
Assumed probability that a forename has an error in
that means it fails a full match but satisfies a
partial (metaphone) match. (default: 0.001)
--p_minor_surname_error P_MINOR_SURNAME_ERROR
Assumed probability that a surname has an error in
that means it fails a full match but satisfies a
partial (metaphone) match. (default: 0.001)
--p_proband_middle_name_missing P_PROBAND_MIDDLE_NAME_MISSING
Probability that a middle name, present in the sample,
is missing from the proband. (default: 0.05)
--p_sample_middle_name_missing P_SAMPLE_MIDDLE_NAME_MISSING
Probability that a middle name, present in the
proband, is missing from the sample. (default: 0.05)
--p_minor_postcode_error P_MINOR_POSTCODE_ERROR
Assumed probability that a postcode has an error in
that means it fails a full (postcode unit) match but
satisfies a partial (postcode sector) match. (default:
0.001)
--p_gender_error P_GENDER_ERROR
Assumed probability that a gender is wrong leading to
a proband/candidate mismatch. (default: 0.0001)
matching rules:
--min_log_odds_for_match MIN_LOG_ODDS_FOR_MATCH
Minimum natural log (ln) odds of two people being the
same, before a match will be considered. (Default is
equivalent to p = 0.999.) (default: 6.906754778648553)
--exceeds_next_best_log_odds EXCEEDS_NEXT_BEST_LOG_ODDS
Minimum log (ln) odds by which a best match must
exceed the next-best match to be considered a unique
match. (default: 10)
commands:
Valid commands are as follows.
{selftest,speedtest,print_demo_sample,validate1,validate2_fetch_cdl,validate2_fetch_rio,hash,compare_plaintext,compare_hashed_to_hashed,compare_hashed_to_plaintext,show_metaphone,show_forename_freq,show_forename_metaphone_freq,show_surname_freq,show_surname_metaphone_freq,show_dob_freq}
Specify one command.
selftest Run self-tests and stop
speedtest Run speed tests and stop
print_demo_sample Print a demo sample .CSV file (#1).
validate1 Run validation test 1 and stop. In this test, a list
of people is compared to a version of itself, at times
with elements deleted or with typos introduced.
validate2_fetch_cdl
Validation 2A: fetch people from CPFT CDL database
validate2_fetch_rio
Validation 2B: fetch people from CPFT RiO database
hash STEP 1 OF DE-IDENTIFIED LINKAGE. Hash an identifiable
CSV file into an encrypted one.
compare_plaintext IDENTIFIABLE LINKAGE COMMAND. Compare a list of
probands against a sample (both in plaintext).
compare_hashed_to_hashed
STEP 2 OF DE-IDENTIFIED LINKAGE (for when you have de-
identified both sides in advance). Compare a list of
probands against a sample (both hashed).
compare_hashed_to_plaintext
STEP 2 OF DE-IDENTIFIED LINKAGE (for when you have
received de-identified data and you want to link to
your identifiable data, producing a de-identified
result). Compare a list of probands (hashed) against a
sample (plaintext).
show_metaphone Show metaphones of words
show_forename_freq Show frequencies of forenames
show_forename_metaphone_freq
Show frequencies of forename metaphones
show_surname_freq Show frequencies of surnames
show_surname_metaphone_freq
Show frequencies of surname metaphones
show_dob_freq Show the frequency of any DOB
===============================================================================
Help for command 'selftest'
===============================================================================
usage: crate_fuzzy_id_match selftest [-h]
This will run a bunch of self-tests and crash out if one fails.
optional arguments:
-h, --help show this help message and exit
===============================================================================
Help for command 'speedtest'
===============================================================================
usage: crate_fuzzy_id_match speedtest [-h] [--profile]
This will run several comparisons to test hashing and comparison speed.
Results are reported as microseconds per comparison.
optional arguments:
-h, --help show this help message and exit
--profile Profile (makes things slower but shows you what's taking the
time). (default: False)
===============================================================================
Help for command 'print_demo_sample'
===============================================================================
usage: crate_fuzzy_id_match print_demo_sample [-h]
optional arguments:
-h, --help show this help message and exit
===============================================================================
Help for command 'validate1'
===============================================================================
usage: crate_fuzzy_id_match validate1 [-h] --people PEOPLE --output OUTPUT
[--seed SEED]
Takes an identifiable list of people (typically a short list of imaginary
people!) and validates the matching process.
This is done by splitting the input list into two groups (alternating),
then comparing a list of probands either against itself (there should be
matches) or against the other group (there should generally not be).
The process is carried out in cleartext (plaintext) and hashed. At times
it's made harder by introducing deletions or mutations (typos) into one of
the groups.
Here's a specimen test CSV file to use, with entirely made-up people and
institutional (not personal) postcodes in Cambridge:
original_id,research_id,first_name,middle_names,surname,dob,gender,postcodes
1,r1,Alice,Zara,Smith,1931-01-01,F,CB2 0QQ/2000-01-01/2010-12-31
2,r2,Bob,Yorick,Jones,1932-01-01,M,CB2 3EB/2000-01-01/2010-12-31
3,r3,Celia,Xena,Wright,1933-01-01,F,CB2 1TP/2000-01-01/2010-12-31
4,r4,David,William;Wallace,Cartwright,1934-01-01,M,CB2 8PH/2000-01-01/2010-12-31;CB2 1TP/2000-01-01/2010-12-31
5,r5,Emily,Violet,Fisher,1935-01-01,F,CB3 9DF/2000-01-01/2010-12-31
6,r6,Frank,Umberto,Williams,1936-01-01,M,CB2 1TQ/2000-01-01/2010-12-31
7,r7,Greta,Tilly,Taylor,1937-01-01,F,CB2 1DQ/2000-01-01/2010-12-31
8,r8,Harry,Samuel,Davies,1938-01-01,M,CB3 9ET/2000-01-01/2010-12-31
9,r9,Iris,Ruth,Evans,1939-01-01,F,CB3 0DG/2000-01-01/2010-12-31
10,r10,James,Quentin,Thomas,1940-01-01,M,CB2 0SZ/2000-01-01/2010-12-31
11,r11,Alice,,Smith,1931-01-01,F,CB2 0QQ/2000-01-01/2010-12-31
Explanation of the output format:
- 'collection_name' is a human-readable name summarizing the next four;
- 'in_sample' (boolean) is whether the probands are in the sample;
- 'deletions' (boolean) is whether random items have been deleted from
the probands;
- 'typos' (boolean) is whether random typos have been made in the
probands;
- 'is_hashed' (boolean) is whether the proband and sample are hashed;
- 'original_id' is the gold-standard ID of the proband;
- 'winner_id' is the ID of the best-matching person in the sample if they
were a good enough match to win;
- 'best_match_id' is the ID of the best-matching person in the sample;
- 'best_log_odds' is the calculated log (ln) odds that the proband and the
sample member identified by 'winner_id' are the sample person (ideally
high if there is a true match, low if not);
- 'second_best_log_odds' is the calculated log odds of the proband and the
runner-up being the same person (ideally low);
- 'second_best_match_id' is the ID of the second-best matching person, if
there is one;
- 'correct_if_winner' is whether the proband and winner IDs are te same
(ideally true);
- 'leader_advantage' is the log odds by which the winner beats the
runner-up (ideally high indicating a strong preference for the winner
over the runner-up).
Clearly, if the probands are in the sample, then a match may occur; if not,
no match should occur. If hashing is in use, this tests de-identified
linkage; if not, this tests identifiable linkage. Deletions and typos
may reduce (but we hope not always eliminate) the likelihood of a match,
and we don't want to see mismatches.
For n input rows, each basic set test involves n^2/2 comparisons.
Then we repeat for typos and deletions. (There is no point in DOB typos
as our rules preclude that.)
Examine:
- P(unique plaintext match | proband in sample) -- should be close to 1.
- P(unique plaintext match | proband in others) -- should be close to 0.
- P(unique hashed match | proband in sample) -- should be close to 1.
- P(unique hashed match | proband in others) -- should be close to 0.
optional arguments:
-h, --help show this help message and exit
--people PEOPLE CSV filename for validation 1 data. Header row present.
Columns: ['original_id', 'research_id', 'first_name',
'middle_names', 'surname', 'dob', 'gender', 'postcodes'].
The fields ['postcodes'] are in TemporalIdentifier format.
TemporalIdentifier format: IDENTIFIER/STARTDATE/ENDDATE,
where dates are in YYYY-MM-DD format or one of ['none',
'null', '?'] (case-insensitive). Semicolon-separated values
are allowed within ['middle_names', 'postcodes']. (default:
None)
--output OUTPUT Output CSV file for validation. Header row present.
Columns: ['collection_name', 'in_sample', 'deletions',
'typos', 'is_hashed', 'original_id', 'winner_id',
'best_match_id', 'best_log_odds', 'second_best_log_odds',
'second_best_match_id', 'correct_if_winner',
'leader_advantage']. (default: None)
--seed SEED Random number seed, for introducing deliberate errors in
validation test 1 (default: 1234)
===============================================================================
Help for command 'validate2_fetch_cdl'
===============================================================================
usage: crate_fuzzy_id_match validate2_fetch_cdl [-h] --url URL [--echo]
--output OUTPUT
Validation #2. Sequence:
1. Fetch
- crate_fuzzy_id_match validate2_fetch_cdl --output validate2_cdl_DANGER_IDENTIFIABLE.csv --url <SQLALCHEMY_URL_CDL>
- crate_fuzzy_id_match validate2_fetch_rio --output validate2_rio_DANGER_IDENTIFIABLE.csv --url <SQLALCHEMY_URL_RIO>
2. Hash
- crate_fuzzy_id_match hash --input validate2_cdl_DANGER_IDENTIFIABLE.csv --output validate2_cdl_hashed.csv --include_original_id --allow_default_hash_key
- crate_fuzzy_id_match hash --input validate2_rio_DANGER_IDENTIFIABLE.csv --output validate2_rio_hashed.csv --include_original_id --allow_default_hash_key
3. Compare
- crate_fuzzy_id_match compare_plaintext --population_size 852523 --probands validate2_cdl_DANGER_IDENTIFIABLE.csv --sample validate2_rio_DANGER_IDENTIFIABLE.csv --output cdl_to_rio_plaintext.csv --extra_validation_output
- crate_fuzzy_id_match compare_hashed_to_hashed --population_size 852523 --probands validate2_cdl_hashed.csv --sample validate2_rio_hashed.csv --output cdl_to_rio_hashed.csv --extra_validation_output
- crate_fuzzy_id_match compare_plaintext --population_size 852523 --probands validate2_rio_DANGER_IDENTIFIABLE.csv --sample validate2_cdl_DANGER_IDENTIFIABLE.csv --output rio_to_cdl_plaintext.csv --extra_validation_output
- crate_fuzzy_id_match compare_hashed_to_hashed --population_size 852523 --probands validate2_rio_hashed.csv --sample validate2_cdl_hashed.csv --output rio_to_cdl_hashed.csv --extra_validation_output
optional arguments:
-h, --help show this help message and exit
--url URL SQLAlchemy URL for CPFT CDL source (IDENTIFIABLE) database
(default: None)
--echo Echo SQL? (default: False)
--output OUTPUT CSV filename for output (plaintext, IDENTIFIABLE) data.
Header row present. Columns: ['original_id', 'research_id',
'first_name', 'middle_names', 'surname', 'dob', 'gender',
'postcodes']. The fields ['postcodes'] are in
TemporalIdentifier format. TemporalIdentifier format:
IDENTIFIER/STARTDATE/ENDDATE, where dates are in YYYY-MM-DD
format or one of ['none', 'null', '?'] (case-insensitive).
Semicolon-separated values are allowed within
['middle_names', 'postcodes']. (default: None)
===============================================================================
Help for command 'validate2_fetch_rio'
===============================================================================
usage: crate_fuzzy_id_match validate2_fetch_rio [-h] --url URL [--echo]
--output OUTPUT
See validate2_fetch_cdl command.
optional arguments:
-h, --help show this help message and exit
--url URL SQLAlchemy URL for CPFT RiO source (IDENTIFIABLE) database
(default: None)
--echo Echo SQL? (default: False)
--output OUTPUT CSV filename for output (plaintext, IDENTIFIABLE) data.
Header row present. Columns: ['original_id', 'research_id',
'first_name', 'middle_names', 'surname', 'dob', 'gender',
'postcodes']. The fields ['postcodes'] are in
TemporalIdentifier format. TemporalIdentifier format:
IDENTIFIER/STARTDATE/ENDDATE, where dates are in YYYY-MM-DD
format or one of ['none', 'null', '?'] (case-insensitive).
Semicolon-separated values are allowed within
['middle_names', 'postcodes']. (default: None)
===============================================================================
Help for command 'hash'
===============================================================================
usage: crate_fuzzy_id_match hash [-h] --input INPUT --output OUTPUT
[--without_frequencies]
[--include_original_id]
Takes an identifiable list of people (with name, DOB, and postcode
information) and creates a hashed, de-identified equivalent.
The research ID (presumed not to be a direct identifier) is preserved.
Optionally, the unique original ID (e.g. NHS number, presumed to be a
direct identifier) is preserved, but you have to ask for that explicitly.
optional arguments:
-h, --help show this help message and exit
--input INPUT CSV filename for input (plaintext) data. Header row
present. Columns: ['original_id', 'research_id',
'first_name', 'middle_names', 'surname', 'dob',
'gender', 'postcodes']. The fields ['postcodes'] are
in TemporalIdentifier format. TemporalIdentifier
format: IDENTIFIER/STARTDATE/ENDDATE, where dates are
in YYYY-MM-DD format or one of ['none', 'null', '?']
(case-insensitive). Semicolon-separated values are
allowed within ['middle_names', 'postcodes'].
(default: None)
--output OUTPUT Output CSV file for hashed version. Header row
present. Columns: ['original_id', 'research_id',
'hashed_first_name', 'first_name_frequency',
'hashed_first_name_metaphone',
'first_name_metaphone_frequency',
'hashed_middle_names', 'middle_name_frequencies',
'hashed_middle_name_metaphones',
'middle_name_metaphone_frequencies', 'hashed_surname',
'surname_frequency', 'hashed_surname_metaphone',
'surname_metaphone_frequency', 'hashed_dob',
'hashed_gender', 'gender_frequency',
'hashed_postcode_units', 'postcode_unit_frequencies',
'hashed_postcode_sectors',
'postcode_sector_frequencies']. The fields
['hashed_postcode_sectors', 'hashed_postcode_units']
are in TemporalIdentifier format. TemporalIdentifier
format: IDENTIFIER/STARTDATE/ENDDATE, where dates are
in YYYY-MM-DD format or one of ['none', 'null', '?']
(case-insensitive). Semicolon-separated values are
allowed within ['hashed_middle_name_metaphones',
'hashed_middle_names', 'hashed_postcode_sectors',
'hashed_postcode_units', 'middle_name_frequencies',
'middle_name_metaphone_frequencies',
'postcode_sector_frequencies',
'postcode_unit_frequencies']. (default: None)
--without_frequencies
Do not include frequency information. This makes the
result suitable for use as a sample file, but not a
proband file. (default: False)
--include_original_id
Include the (potentially identifying) 'original_id'
data? Usually False; may be set to True for
validation. (default: False)
===============================================================================
Help for command 'compare_plaintext'
===============================================================================
usage: crate_fuzzy_id_match compare_plaintext [-h] --probands PROBANDS
--sample SAMPLE
[--sample_cache SAMPLE_CACHE]
--output OUTPUT
[--extra_validation_output]
[--n_workers N_WORKERS]
[--max_chunksize MAX_CHUNKSIZE]
[--min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL]
[--profile]
Comparison rules:
- People MUST match on DOB and surname (or surname metaphone), or hashed
equivalents, to be considered a plausible match.
- Only plausible matches proceed to the Bayesian comparison.
Output file format:
- CSV file with header.
- Columns: ['proband_original_id', 'proband_research_id', 'matched', 'log_odds_match', 'p_match', 'sample_match_original_id', 'sample_match_research_id', 'second_best_log_odds']
- proband_original_id
Original (identifiable?) ID of the proband. Taken from the input.
Optional -- may be blank for de-identified comparisons.
- proband_research_id
Research ID (de-identified?) of the proband. Taken from the input.
- matched
Boolean. Was a matching person (a "winner") found in the sample, who
is to be considered a match to the proband? To give a match requires
(a) that the log odds for the winner reaches a threshold, and (b) that
the log odds for the winner exceeds the log odds for the runner-up by
a certain amount (because a mismatch may be worse than a failed match).
- log_odds_match
Log (ln) odds that the winner in the sample is a match to the proband.
- p_match
Probability that the winner in the sample is a match.
- sample_match_original_id
Original ID of the "winner" in the sample (the closest match to the
proband). Optional -- may be blank for de-identified comparisons.
- sample_match_research_id
Research ID of the winner in the sample.
- second_best_log_odds
Log odds of the runner up (the second-closest match) being the same
person as the proband.
- If '--extra_validation_output' is used, the following columns are added:
- best_person_original_id
Original ID of the closest-matching person in the sample, EVEN IF THEY
DID NOT WIN.
- best_person_research_id
Research ID of the closest-matching person in the sample, EVEN IF THEY
DID NOT WIN.
- The results file is NOT necessarily sorted as the input proband file was
(not sorting improves parallel processing efficiency).
optional arguments:
-h, --help show this help message and exit
--probands PROBANDS CSV filename for probands data. Header row present.
Columns: ['original_id', 'research_id', 'first_name',
'middle_names', 'surname', 'dob', 'gender',
'postcodes']. The fields ['postcodes'] are in
TemporalIdentifier format. TemporalIdentifier format:
IDENTIFIER/STARTDATE/ENDDATE, where dates are in YYYY-
MM-DD format or one of ['none', 'null', '?'] (case-
insensitive). Semicolon-separated values are allowed
within ['middle_names', 'postcodes']. (default: None)
--sample SAMPLE CSV filename for sample data. Header row present.
Columns: ['original_id', 'research_id', 'first_name',
'middle_names', 'surname', 'dob', 'gender',
'postcodes']. The fields ['postcodes'] are in
TemporalIdentifier format. TemporalIdentifier format:
IDENTIFIER/STARTDATE/ENDDATE, where dates are in YYYY-
MM-DD format or one of ['none', 'null', '?'] (case-
insensitive). Semicolon-separated values are allowed
within ['middle_names', 'postcodes']. (default: None)
--sample_cache SAMPLE_CACHE
File in which to store cached sample info (to speed
loading) (default: None)
--output OUTPUT Output CSV file for proband/sample comparison.
(default: None)
--extra_validation_output
Add extra output for validation purposes. (default:
False)
--n_workers N_WORKERS
Number of processes to use in parallel. (default: 8)
--max_chunksize MAX_CHUNKSIZE
Maximum chunk size (number of probands to pass to a
subprocess each time). (default: 500)
--min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL
Minimum number of probands for which we will bother to
use parallel processing. (default: 100)
--profile Profile the code (for development only). (default:
False)
===============================================================================
Help for command 'compare_hashed_to_hashed'
===============================================================================
usage: crate_fuzzy_id_match compare_hashed_to_hashed [-h] --probands PROBANDS
--sample SAMPLE
[--sample_cache SAMPLE_CACHE]
--output OUTPUT
[--extra_validation_output]
[--n_workers N_WORKERS]
[--max_chunksize MAX_CHUNKSIZE]
[--min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL]
[--profile]
Comparison rules:
- People MUST match on DOB and surname (or surname metaphone), or hashed
equivalents, to be considered a plausible match.
- Only plausible matches proceed to the Bayesian comparison.
Output file format:
- CSV file with header.
- Columns: ['proband_original_id', 'proband_research_id', 'matched', 'log_odds_match', 'p_match', 'sample_match_original_id', 'sample_match_research_id', 'second_best_log_odds']
- proband_original_id
Original (identifiable?) ID of the proband. Taken from the input.
Optional -- may be blank for de-identified comparisons.
- proband_research_id
Research ID (de-identified?) of the proband. Taken from the input.
- matched
Boolean. Was a matching person (a "winner") found in the sample, who
is to be considered a match to the proband? To give a match requires
(a) that the log odds for the winner reaches a threshold, and (b) that
the log odds for the winner exceeds the log odds for the runner-up by
a certain amount (because a mismatch may be worse than a failed match).
- log_odds_match
Log (ln) odds that the winner in the sample is a match to the proband.
- p_match
Probability that the winner in the sample is a match.
- sample_match_original_id
Original ID of the "winner" in the sample (the closest match to the
proband). Optional -- may be blank for de-identified comparisons.
- sample_match_research_id
Research ID of the winner in the sample.
- second_best_log_odds
Log odds of the runner up (the second-closest match) being the same
person as the proband.
- If '--extra_validation_output' is used, the following columns are added:
- best_person_original_id
Original ID of the closest-matching person in the sample, EVEN IF THEY
DID NOT WIN.
- best_person_research_id
Research ID of the closest-matching person in the sample, EVEN IF THEY
DID NOT WIN.
- The results file is NOT necessarily sorted as the input proband file was
(not sorting improves parallel processing efficiency).
optional arguments:
-h, --help show this help message and exit
--probands PROBANDS CSV filename for probands data. Header row present.
Columns: ['original_id', 'research_id',
'hashed_first_name', 'first_name_frequency',
'hashed_first_name_metaphone',
'first_name_metaphone_frequency',
'hashed_middle_names', 'middle_name_frequencies',
'hashed_middle_name_metaphones',
'middle_name_metaphone_frequencies', 'hashed_surname',
'surname_frequency', 'hashed_surname_metaphone',
'surname_metaphone_frequency', 'hashed_dob',
'hashed_gender', 'gender_frequency',
'hashed_postcode_units', 'postcode_unit_frequencies',
'hashed_postcode_sectors',
'postcode_sector_frequencies']. The fields
['hashed_postcode_sectors', 'hashed_postcode_units']
are in TemporalIdentifier format. TemporalIdentifier
format: IDENTIFIER/STARTDATE/ENDDATE, where dates are
in YYYY-MM-DD format or one of ['none', 'null', '?']
(case-insensitive). Semicolon-separated values are
allowed within ['hashed_middle_name_metaphones',
'hashed_middle_names', 'hashed_postcode_sectors',
'hashed_postcode_units', 'middle_name_frequencies',
'middle_name_metaphone_frequencies',
'postcode_sector_frequencies',
'postcode_unit_frequencies']. (default: None)
--sample SAMPLE CSV filename for sample data. Header row present.
Columns: ['original_id', 'research_id',
'hashed_first_name', 'first_name_frequency',
'hashed_first_name_metaphone',
'first_name_metaphone_frequency',
'hashed_middle_names', 'middle_name_frequencies',
'hashed_middle_name_metaphones',
'middle_name_metaphone_frequencies', 'hashed_surname',
'surname_frequency', 'hashed_surname_metaphone',
'surname_metaphone_frequency', 'hashed_dob',
'hashed_gender', 'gender_frequency',
'hashed_postcode_units', 'postcode_unit_frequencies',
'hashed_postcode_sectors',
'postcode_sector_frequencies']. The fields
['hashed_postcode_sectors', 'hashed_postcode_units']
are in TemporalIdentifier format. TemporalIdentifier
format: IDENTIFIER/STARTDATE/ENDDATE, where dates are
in YYYY-MM-DD format or one of ['none', 'null', '?']
(case-insensitive). Semicolon-separated values are
allowed within ['hashed_middle_name_metaphones',
'hashed_middle_names', 'hashed_postcode_sectors',
'hashed_postcode_units', 'middle_name_frequencies',
'middle_name_metaphone_frequencies',
'postcode_sector_frequencies',
'postcode_unit_frequencies']. (default: None)
--sample_cache SAMPLE_CACHE
File in which to store cached sample info (to speed
loading) (default: None)
--output OUTPUT Output CSV file for proband/sample comparison.
(default: None)
--extra_validation_output
Add extra output for validation purposes. (default:
False)
--n_workers N_WORKERS
Number of processes to use in parallel. (default: 8)
--max_chunksize MAX_CHUNKSIZE
Maximum chunk size (number of probands to pass to a
subprocess each time). (default: 500)
--min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL
Minimum number of probands for which we will bother to
use parallel processing. (default: 100)
--profile Profile the code (for development only). (default:
False)
===============================================================================
Help for command 'compare_hashed_to_plaintext'
===============================================================================
usage: crate_fuzzy_id_match compare_hashed_to_plaintext [-h] --probands
PROBANDS --sample
SAMPLE
[--sample_cache SAMPLE_CACHE]
--output OUTPUT
[--extra_validation_output]
[--n_workers N_WORKERS]
[--max_chunksize MAX_CHUNKSIZE]
[--min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL]
[--profile]
Comparison rules:
- People MUST match on DOB and surname (or surname metaphone), or hashed
equivalents, to be considered a plausible match.
- Only plausible matches proceed to the Bayesian comparison.
Output file format:
- CSV file with header.
- Columns: ['proband_original_id', 'proband_research_id', 'matched', 'log_odds_match', 'p_match', 'sample_match_original_id', 'sample_match_research_id', 'second_best_log_odds']
- proband_original_id
Original (identifiable?) ID of the proband. Taken from the input.
Optional -- may be blank for de-identified comparisons.
- proband_research_id
Research ID (de-identified?) of the proband. Taken from the input.
- matched
Boolean. Was a matching person (a "winner") found in the sample, who
is to be considered a match to the proband? To give a match requires
(a) that the log odds for the winner reaches a threshold, and (b) that
the log odds for the winner exceeds the log odds for the runner-up by
a certain amount (because a mismatch may be worse than a failed match).
- log_odds_match
Log (ln) odds that the winner in the sample is a match to the proband.
- p_match
Probability that the winner in the sample is a match.
- sample_match_original_id
Original ID of the "winner" in the sample (the closest match to the
proband). Optional -- may be blank for de-identified comparisons.
- sample_match_research_id
Research ID of the winner in the sample.
- second_best_log_odds
Log odds of the runner up (the second-closest match) being the same
person as the proband.
- If '--extra_validation_output' is used, the following columns are added:
- best_person_original_id
Original ID of the closest-matching person in the sample, EVEN IF THEY
DID NOT WIN.
- best_person_research_id
Research ID of the closest-matching person in the sample, EVEN IF THEY
DID NOT WIN.
- The results file is NOT necessarily sorted as the input proband file was
(not sorting improves parallel processing efficiency).
optional arguments:
-h, --help show this help message and exit
--probands PROBANDS CSV filename for probands data. Header row present.
Columns: ['original_id', 'research_id',
'hashed_first_name', 'first_name_frequency',
'hashed_first_name_metaphone',
'first_name_metaphone_frequency',
'hashed_middle_names', 'middle_name_frequencies',
'hashed_middle_name_metaphones',
'middle_name_metaphone_frequencies', 'hashed_surname',
'surname_frequency', 'hashed_surname_metaphone',
'surname_metaphone_frequency', 'hashed_dob',
'hashed_gender', 'gender_frequency',
'hashed_postcode_units', 'postcode_unit_frequencies',
'hashed_postcode_sectors',
'postcode_sector_frequencies']. The fields
['hashed_postcode_sectors', 'hashed_postcode_units']
are in TemporalIdentifier format. TemporalIdentifier
format: IDENTIFIER/STARTDATE/ENDDATE, where dates are
in YYYY-MM-DD format or one of ['none', 'null', '?']
(case-insensitive). Semicolon-separated values are
allowed within ['hashed_middle_name_metaphones',
'hashed_middle_names', 'hashed_postcode_sectors',
'hashed_postcode_units', 'middle_name_frequencies',
'middle_name_metaphone_frequencies',
'postcode_sector_frequencies',
'postcode_unit_frequencies']. (default: None)
--sample SAMPLE CSV filename for sample data. Header row present.
Columns: ['original_id', 'research_id', 'first_name',
'middle_names', 'surname', 'dob', 'gender',
'postcodes']. The fields ['postcodes'] are in
TemporalIdentifier format. TemporalIdentifier format:
IDENTIFIER/STARTDATE/ENDDATE, where dates are in YYYY-
MM-DD format or one of ['none', 'null', '?'] (case-
insensitive). Semicolon-separated values are allowed
within ['middle_names', 'postcodes']. (default: None)
--sample_cache SAMPLE_CACHE
File in which to store cached sample info (to speed
loading) (default: None)
--output OUTPUT Output CSV file for proband/sample comparison.
(default: None)
--extra_validation_output
Add extra output for validation purposes. (default:
False)
--n_workers N_WORKERS
Number of processes to use in parallel. (default: 8)
--max_chunksize MAX_CHUNKSIZE
Maximum chunk size (number of probands to pass to a
subprocess each time). (default: 500)
--min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL
Minimum number of probands for which we will bother to
use parallel processing. (default: 100)
--profile Profile the code (for development only). (default:
False)
===============================================================================
Help for command 'show_metaphone'
===============================================================================
usage: crate_fuzzy_id_match show_metaphone [-h] words [words ...]
positional arguments:
words Words to check
optional arguments:
-h, --help show this help message and exit
===============================================================================
Help for command 'show_forename_freq'
===============================================================================
usage: crate_fuzzy_id_match show_forename_freq [-h] forenames [forenames ...]
positional arguments:
forenames Forenames to check
optional arguments:
-h, --help show this help message and exit
===============================================================================
Help for command 'show_forename_metaphone_freq'
===============================================================================
usage: crate_fuzzy_id_match show_forename_metaphone_freq [-h]
metaphones
[metaphones ...]
positional arguments:
metaphones Forenames to check
optional arguments:
-h, --help show this help message and exit
===============================================================================
Help for command 'show_surname_freq'
===============================================================================
usage: crate_fuzzy_id_match show_surname_freq [-h] surnames [surnames ...]
positional arguments:
surnames surnames to check
optional arguments:
-h, --help show this help message and exit
===============================================================================
Help for command 'show_surname_metaphone_freq'
===============================================================================
usage: crate_fuzzy_id_match show_surname_metaphone_freq [-h]
metaphones
[metaphones ...]
positional arguments:
metaphones surnames to check
optional arguments:
-h, --help show this help message and exit
===============================================================================
Help for command 'show_dob_freq'
===============================================================================
usage: crate_fuzzy_id_match show_dob_freq [-h]
optional arguments:
-h, --help show this help message and exit