5. Preprocessing tools

These tools:

Although they are usually run before anonymisation, it’s probably more helpful to read the Anonymisation section first.

5.1. crate_preprocess_rio

The RiO preprocessor creates a unique integer field named crate_pk in all tables (copying the existing integer PK, creating one from an existing non-integer primary key, or adding a new one using SQL Server’s INT IDENTITY(1, 1) type. For all patient tables, it makes the patient ID (RiO number) into an integer, called crate_rio_number. It then adds indexes and views. All of these can be removed again, or updated incrementally if you add new data.

The views ‘denormalize’ the data for convenience, since it can be pretty hard to follow the key chain of fully normalized tables. The views conform mostly to the names used by the Servelec RiO CRIS Extraction Program (RCEP), with added consistency. Because user lookups are common, to save typing (and in some cases keep the field length below the 64-character column name limit of MySQL), the following abbreviations are used:

_Resp_Clinician_

… Responsible Clinician

Options:

usage: crate_preprocess_rio [-h] --url URL [-v] [--print] [--echo] [--rcep]
                            [--drop-danger-drop] [--cpft] [--debug-skiptables]
                            [--prognotes-current-only | --prognotes-all]
                            [--clindocs-current-only | --clindocs-all]
                            [--allergies-current-only | --allergies-all]
                            [--audit-info | --no-audit-info]
                            [--postcodedb POSTCODEDB]
                            [--geogcols [GEOGCOLS [GEOGCOLS ...]]]
                            [--settings-filename SETTINGS_FILENAME]

*   Alters a RiO database to be suitable for CRATE.

*   By default, this treats the source database as being a copy of a RiO
    database (slightly later than version 6.2; exact version unclear).
    Use the "--rcep" (+/- "--cpft") switch(es) to treat it as a
    Servelec RiO CRIS Extract Program (RCEP) v2 output database.
    

optional arguments:
  -h, --help            show this help message and exit
  --url URL             SQLAlchemy database URL
  -v, --verbose         Verbose
  --print               Print SQL but do not execute it. (You can redirect the
                        printed output to create an SQL script.
  --echo                Echo SQL
  --rcep                Treat the source database as the product of Servelec's
                        RiO CRIS Extract Program v2 (instead of raw RiO)
  --drop-danger-drop    REMOVES new columns and indexes, rather than creating
                        them. (There's not very much danger; no real
                        information is lost, but it might take a while to
                        recalculate it.)
  --cpft                Apply hacks for Cambridgeshire & Peterborough NHS
                        Foundation Trust (CPFT) RCEP database. Only applicable
                        with --rcep
  --debug-skiptables    DEBUG-ONLY OPTION. Skip tables (view creation only)
  --prognotes-current-only
                        Progress_Notes view restricted to current versions
                        only (* default)
  --prognotes-all       Progress_Notes view shows old versions too
  --clindocs-current-only
                        Clinical_Documents view restricted to current versions
                        only (*)
  --clindocs-all        Clinical_Documents view shows old versions too
  --allergies-current-only
                        Client_Allergies view restricted to current info only
  --allergies-all       Client_Allergies view shows deleted allergies too (*)
  --audit-info          Audit information (creation/update times) added to
                        views
  --no-audit-info       No audit information added (*)
  --postcodedb POSTCODEDB
                        Specify database (schema) name for ONS Postcode
                        Database (as imported by CRATE) to link to addresses
                        as a view. With SQL Server, you will have to specify
                        the schema as well as the database; e.g. "--postcodedb
                        ONS_PD.dbo"
  --geogcols [GEOGCOLS [GEOGCOLS ...]]
                        List of geographical information columns to link in
                        from ONS Postcode Database. BEWARE that you do not
                        specify anything too identifying. Default: pcon pct
                        nuts lea statsward casward lsoa01 msoa01 ur01ind oac01
                        lsoa11 msoa11 parish bua11 buasd11 ru11ind oac11 imd
  --settings-filename SETTINGS_FILENAME
                        Specify filename to write draft ddgen_* settings to,
                        for use in a CRATE anonymiser configuration file.

5.2. crate_preprocess_pcmis

Options:

usage: crate_preprocess_pcmis [-h] --url URL [-v] [--print] [--echo]
                              [--drop-danger-drop] [--debug-skiptables]
                              [--postcodedb POSTCODEDB]
                              [--geogcols [GEOGCOLS [GEOGCOLS ...]]]
                              [--settings-filename SETTINGS_FILENAME]

Alters a PCMIS database to be suitable for CRATE.

optional arguments:
  -h, --help            show this help message and exit
  --url URL             SQLAlchemy database URL (default: None)
  -v, --verbose         Verbose (default: False)
  --print               Print SQL but do not execute it. (You can redirect the
                        printed output to create an SQL script. (default:
                        False)
  --echo                Echo SQL (default: False)
  --drop-danger-drop    REMOVES new columns and indexes, rather than creating
                        them. (There's not very much danger; no real
                        information is lost, but it might take a while to
                        recalculate it.) (default: False)
  --debug-skiptables    DEBUG-ONLY OPTION. Skip tables (view creation only)
                        (default: False)
  --postcodedb POSTCODEDB
                        Specify database (schema) name for ONS Postcode
                        Database (as imported by CRATE) to link to addresses
                        as a view. With SQL Server, you will have to specify
                        the schema as well as the database; e.g. "--postcodedb
                        ONS_PD.dbo" (default: None)
  --geogcols [GEOGCOLS [GEOGCOLS ...]]
                        List of geographical information columns to link in
                        from ONS Postcode Database. BEWARE that you do not
                        specify anything too identifying. (default: ['pcon',
                        'pct', 'nuts', 'lea', 'statsward', 'casward',
                        'lsoa01', 'msoa01', 'ur01ind', 'oac01', 'lsoa11',
                        'msoa11', 'parish', 'bua11', 'buasd11', 'ru11ind',
                        'oac11', 'imd'])
  --settings-filename SETTINGS_FILENAME
                        Specify filename to write draft ddgen_* settings to,
                        for use in a CRATE anonymiser configuration file.
                        (default: None)

5.3. crate_postcodes

Options:

usage: crate_postcodes [-h] [--dir DIR] [--url URL] [--echo]
                       [--reportevery REPORTEVERY] [--commitevery COMMITEVERY]
                       [--startswith STARTSWITH [STARTSWITH ...]] [--replace]
                       [--skiplookup]
                       [--specific_lookup_tables [SPECIFIC_LOOKUP_TABLES [SPECIFIC_LOOKUP_TABLES ...]]]
                       [--list_lookup_tables] [--skippostcodes] [--docsonly]
                       [-v]

-   This program reads data from the UK Office of National Statistics Postcode
    Database (ONSPD) and inserts it into a database.

-   You will need to download the ONSPD from
        https://geoportal.statistics.gov.uk/geoportal/catalog/content/filelist.page
    e.g. ONSPD_MAY_2016_csv.zip (79 Mb), and unzip it (>1.4 Gb) to a directory.
    Tell this program which directory you used.

-   Specify your database as an SQLAlchemy connection URL: see
        http://docs.sqlalchemy.org/en/latest/core/engines.html
    The general format is:
        dialect[+driver]://username:password@host[:port]/database[?key=value...]

-   If you get an error like:
        UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in
        position 33: ordinal not in range(256)
    then try appending "?charset=utf8" to the connection URL.

-   ONS POSTCODE DATABASE LICENSE.
    Output using this program must add the following attribution statements:

    Contains OS data © Crown copyright and database right [year]
    Contains Royal Mail data © Royal Mail copyright and database right [year]
    Contains National Statistics data © Crown copyright and database right [year]

    See http://www.ons.gov.uk/methodology/geography/licences
    

optional arguments:
  -h, --help            show this help message and exit
  --dir DIR             Root directory of unzipped ONSPD download (default:
                        /path/to/unzipped/ONSPD/download)
  --url URL             SQLAlchemy database URL (default: None)
  --echo                Echo SQL (default: False)
  --reportevery REPORTEVERY
                        Report every n rows (default: 1000)
  --commitevery COMMITEVERY
                        Commit every n rows. If you make this too large
                        (relative e.g. to your MySQL max_allowed_packet
                        setting, you may get crashes with errors like 'MySQL
                        has gone away'. (default: 10000)
  --startswith STARTSWITH [STARTSWITH ...]
                        Restrict to postcodes that start with one of these
                        strings (default: None)
  --replace             Replace tables even if they exist (default: skip
                        existing tables) (default: False)
  --skiplookup          Skip generation of code lookup tables (default: False)
  --specific_lookup_tables [SPECIFIC_LOOKUP_TABLES [SPECIFIC_LOOKUP_TABLES ...]]
                        Within the lookup tables, process only specific named
                        tables (default: None)
  --list_lookup_tables  List all possible lookup tables, then stop (default:
                        False)
  --skippostcodes       Skip generation of main (large) postcode table
                        (default: False)
  --docsonly            Show help for postcode table then stop (default:
                        False)
  -v, --verbose         Verbose (default: False)

5.4. crate_fetch_wordlists

This tool assists in fetching common word lists, such as name lists for global denial, and words to exclude from such lists (such as English words or medical eponyms). It also provides an exclusion filter system, to find lines in some files that are absent from others.

Options:

usage: crate_fetch_wordlists [-h] [--verbose]
                             [--min_word_length MIN_WORD_LENGTH]
                             [--show_rejects] [--english_words]
                             [--english_words_output ENGLISH_WORDS_OUTPUT]
                             [--english_words_url ENGLISH_WORDS_URL]
                             [--valid_word_regex VALID_WORD_REGEX]
                             [--us_forenames]
                             [--us_forenames_freq_output US_FORENAMES_FREQ_OUTPUT]
                             [--us_forenames_sex_freq_output US_FORENAMES_SEX_FREQ_OUTPUT]
                             [--us_forenames_url US_FORENAMES_URL]
                             [--us_forenames_min_cumfreq_pct US_FORENAMES_MIN_CUMFREQ_PCT]
                             [--us_forenames_max_cumfreq_pct US_FORENAMES_MAX_CUMFREQ_PCT]
                             [--us_forenames_output US_FORENAMES_OUTPUT]
                             [--us_surnames]
                             [--us_surnames_output US_SURNAMES_OUTPUT]
                             [--us_surnames_freq_output US_SURNAMES_FREQ_OUTPUT]
                             [--us_surnames_1990_census_url US_SURNAMES_1990_CENSUS_URL]
                             [--us_surnames_2010_census_url US_SURNAMES_2010_CENSUS_URL]
                             [--us_surnames_min_cumfreq_pct US_SURNAMES_MIN_CUMFREQ_PCT]
                             [--us_surnames_max_cumfreq_pct US_SURNAMES_MAX_CUMFREQ_PCT]
                             [--eponyms] [--eponyms_output EPONYMS_OUTPUT]
                             [--eponyms_add_unaccented_versions [EPONYMS_ADD_UNACCENTED_VERSIONS]]
                             [--filter_input [FILTER_INPUT [FILTER_INPUT ...]]]
                             [--filter_exclude [FILTER_EXCLUDE [FILTER_EXCLUDE ...]]]
                             [--filter_output [FILTER_OUTPUT]]

optional arguments:
  -h, --help            show this help message and exit
  --verbose, -v         Be verbose (default: False)
  --min_word_length MIN_WORD_LENGTH
                        Minimum word (or name) length to allow (default: 2)
  --show_rejects        Print to stdout (and, in verbose mode, log) the words
                        being rejected (default: False)

English words:
  --english_words       Fetch English words (to remove from the nonspecific
                        denylist, not to add to an allowlist; consider words
                        like smith) (default: False)
  --english_words_output ENGLISH_WORDS_OUTPUT
                        Output file for English words (default:
                        english_words.txt)
  --english_words_url ENGLISH_WORDS_URL
                        URL for a textfile containing all English words (will
                        then be filtered) (default: https://www.gutenberg.org/
                        files/3201/files/CROSSWD.TXT)
  --valid_word_regex VALID_WORD_REGEX
                        Regular expression to determine valid English words
                        (default: ^[a-z](?:[A-Za-z'-]*[a-z])*$)

US forenames:
  --us_forenames        Fetch US forenames (for denylist) (default: False)
  --us_forenames_freq_output US_FORENAMES_FREQ_OUTPUT
                        Output CSV file for US forename with frequencies
                        (columns are: name, frequency) (default:
                        us_forename_freq.csv)
  --us_forenames_sex_freq_output US_FORENAMES_SEX_FREQ_OUTPUT
                        Output CSV file for US forename with sex and
                        frequencies (columns are: name, gender, frequency)
                        (default: us_forename_sex_freq.csv)
  --us_forenames_url US_FORENAMES_URL
                        URL to Zip file of US Census-derived forenames lists
                        (excludes names with national frequency <5; see
                        https://www.ssa.gov/OACT/babynames/limits.html)
                        (default:
                        https://www.ssa.gov/OACT/babynames/names.zip)
  --us_forenames_min_cumfreq_pct US_FORENAMES_MIN_CUMFREQ_PCT
                        Fetch only names where the cumulative frequency
                        percentage up to and including this name was at least
                        this value. Range is 0-100. Use 0 for no limit.
                        Setting this above 0 excludes COMMON names. (This is a
                        trade-off between being comprehensive and operating at
                        a reasonable speed. Higher numbers are more
                        comprehensive but slower.) (default: 0)
  --us_forenames_max_cumfreq_pct US_FORENAMES_MAX_CUMFREQ_PCT
                        Fetch only names where the cumulative frequency
                        percentage up to and including this name was less than
                        or equal to this value. Range is 0-100. Use 100 for no
                        limit. Setting this below 100 excludes RARE names.
                        (This is a trade-off between being comprehensive and
                        operating at a reasonable speed. Higher numbers are
                        more comprehensive but slower.) (default: 100)
  --us_forenames_output US_FORENAMES_OUTPUT
                        Output file for US forenames (default:
                        us_forenames.txt)

US surnames:
  --us_surnames         Fetch US surnames (for denylist) (default: False)
  --us_surnames_output US_SURNAMES_OUTPUT
                        Output text file for US surnames (default:
                        us_surnames.txt)
  --us_surnames_freq_output US_SURNAMES_FREQ_OUTPUT
                        Output CSV file for US surnames with frequencies
                        (columns are: name, frequency) (default:
                        us_surname_freq.csv)
  --us_surnames_1990_census_url US_SURNAMES_1990_CENSUS_URL
                        URL for textfile of US 1990 Census surnames (default: 
                        http://www2.census.gov/topics/genealogy/1990surnames/d
                        ist.all.last)
  --us_surnames_2010_census_url US_SURNAMES_2010_CENSUS_URL
                        URL for zip of US 2010 Census surnames (default: https
                        ://www2.census.gov/topics/genealogy/2010surnames/names
                        .zip)
  --us_surnames_min_cumfreq_pct US_SURNAMES_MIN_CUMFREQ_PCT
                        Fetch only names where the cumulative frequency
                        percentage up to and including this name was at least
                        this value. Range is 0-100. Use 0 for no limit.
                        Setting this above 0 excludes COMMON names. (This is a
                        trade-off between being comprehensive and operating at
                        a reasonable speed. Higher numbers are more
                        comprehensive but slower.) (default: 0)
  --us_surnames_max_cumfreq_pct US_SURNAMES_MAX_CUMFREQ_PCT
                        Fetch only names where the cumulative frequency
                        percentage up to and including this name was less than
                        or equal to this value. Range is 0-100. Use 100 for no
                        limit. Setting this below 100 excludes RARE names.
                        (This is a trade-off between being comprehensive and
                        operating at a reasonable speed. Higher numbers are
                        more comprehensive but slower.) (default: 100)

Medical eponyms:
  --eponyms             Write medical eponyms (to remove from denylist)
                        (default: False)
  --eponyms_output EPONYMS_OUTPUT
                        Output file for medical eponyms (default:
                        medical_eponyms.txt)
  --eponyms_add_unaccented_versions [EPONYMS_ADD_UNACCENTED_VERSIONS]
                        Add unaccented versions (e.g. Sjogren as well as
                        Sjögren) (default: True)

Filter functions:
  Extra functions to filter wordlists. Specify an input file (or files),
  whose lines will be included; optional exclusion file(s), whose lines will
  be excluded (in case-insensitive fashion); and an output file. You can use
  '-' for the output file to mean 'stdout', and for one input file to mean
  'stdin'. No filenames (other than '-' for input and output) may overlap.
  The --min_line_length option also applies. Duplicates are not removed.

  --filter_input [FILTER_INPUT [FILTER_INPUT ...]]
                        Input file(s). See above. (default: None)
  --filter_exclude [FILTER_EXCLUDE [FILTER_EXCLUDE ...]]
                        Exclusion file(s). See above. (default: None)
  --filter_output [FILTER_OUTPUT]
                        Exclusion file(s). See above. (default: None)

Specimen usage:

#!/bin/bash
# -----------------------------------------------------------------------------
# Specimen usage under Linux
# -----------------------------------------------------------------------------

# Downloading these and then using a file:// URL is unnecessary, but it makes
# the processing steps faster if we need to retry with new settings.
wget https://www.gutenberg.org/files/3201/files/CROSSWD.TXT -O dictionary.txt
wget https://www.ssa.gov/OACT/babynames/names.zip -O forenames.zip
wget http://www2.census.gov/topics/genealogy/1990surnames/dist.all.last -O surnames_1990.txt
wget https://www2.census.gov/topics/genealogy/2010surnames/names.zip -O surnames_2010.zip

crate_fetch_wordlists --help

crate_fetch_wordlists \
    --english_words \
        --english_words_url "file://${PWD}/dictionary.txt" \
    --us_forenames \
        --us_forenames_url "file://${PWD}/forenames.zip" \
        --us_forenames_max_cumfreq_pct 100 \
    --us_surnames \
        --us_surnames_1990_census_url "file://${PWD}/surnames_1990.txt" \
        --us_surnames_2010_census_url "file://${PWD}/surnames_2010.zip" \
        --us_surnames_max_cumfreq_pct 100 \
    --eponyms

#    --show_rejects \
#    --verbose

# Forenames encompassing the top 95% gives 5874 forenames (of 96174).
# Surnames encompassing the top 85% gives 74525 surnames (of 175880).

crate_fetch_wordlists \
    --filter_input \
        us_forenames.txt \
        us_surnames.txt \
    --filter_exclude \
        english_words.txt \
        medical_eponyms.txt \
    --filter_output \
        filtered_names.txt

5.5. crate_fuzzy_id_match

In development.

See crate_anon.preprocess.fuzzy_id_match.

Options (from crate_fuzzy_id_match --allhelp):

usage: crate_fuzzy_id_match [-h] [--version] [--allhelp] [--verbose]
                            [--key KEY] [--allow_default_hash_key]
                            [--rounding_sf ROUNDING_SF]
                            [--forename_sex_freq_csv FORENAME_SEX_FREQ_CSV]
                            [--forename_cache_filename FORENAME_CACHE_FILENAME]
                            [--surname_freq_csv SURNAME_FREQ_CSV]
                            [--surname_cache_filename SURNAME_CACHE_FILENAME]
                            [--name_min_frequency NAME_MIN_FREQUENCY]
                            [--p_middle_name_n_present P_MIDDLE_NAME_N_PRESENT]
                            [--population_size POPULATION_SIZE]
                            [--birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE]
                            [--postcode_csv_filename POSTCODE_CSV_FILENAME]
                            [--postcode_cache_filename POSTCODE_CACHE_FILENAME]
                            [--mean_oa_population MEAN_OA_POPULATION]
                            [--p_not_male_or_female P_NOT_MALE_OR_FEMALE]
                            [--p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE]
                            [--p_minor_forename_error P_MINOR_FORENAME_ERROR]
                            [--p_minor_surname_error P_MINOR_SURNAME_ERROR]
                            [--p_proband_middle_name_missing P_PROBAND_MIDDLE_NAME_MISSING]
                            [--p_sample_middle_name_missing P_SAMPLE_MIDDLE_NAME_MISSING]
                            [--p_minor_postcode_error P_MINOR_POSTCODE_ERROR]
                            [--p_gender_error P_GENDER_ERROR]
                            [--min_log_odds_for_match MIN_LOG_ODDS_FOR_MATCH]
                            [--exceeds_next_best_log_odds EXCEEDS_NEXT_BEST_LOG_ODDS]
                            {selftest,speedtest,print_demo_sample,validate1,validate2_fetch_cdl,validate2_fetch_rio,hash,compare_plaintext,compare_hashed_to_hashed,compare_hashed_to_plaintext,show_metaphone,show_forename_freq,show_forename_metaphone_freq,show_surname_freq,show_surname_metaphone_freq,show_dob_freq}
                            ...

Identity matching via hashed fuzzy identifiers

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --allhelp             show help for all commands and exit

display options:
  --verbose             Be verbose (default: False)

hasher (secrecy) options:
  --key KEY             Key (passphrase) for hasher (default: fuzzy_id_match_d
                        efault_hash_key_DO_NOT_USE_FOR_LIVE_DATA)
  --allow_default_hash_key
                        Allow the default hash key to be used beyond tests.
                        INADVISABLE! (default: False)
  --rounding_sf ROUNDING_SF
                        Number of significant figures to use when rounding
                        frequencies in hashed version (default: 5)

frequency information for prior probabilities:
  --forename_sex_freq_csv FORENAME_SEX_FREQ_CSV
                        CSV file of "name, sex, frequency" pairs for
                        forenames. You can generate one via
                        crate_fetch_wordlists. (default:
                        /path/to/crate/user/data/us_forename_sex_freq.csv)
  --forename_cache_filename FORENAME_CACHE_FILENAME
                        File in which to store cached forename info (to speed
                        loading) (default:
                        /path/to/crate/user/data/fuzzy_forename_cache.pickle)
  --surname_freq_csv SURNAME_FREQ_CSV
                        CSV file of "name, frequency" pairs for forenames. You
                        can generate one via crate_fetch_wordlists. (default:
                        /path/to/crate/user/data/us_surname_freq.csv)
  --surname_cache_filename SURNAME_CACHE_FILENAME
                        File in which to store cached surname info (to speed
                        loading) (default:
                        /path/to/crate/user/data/fuzzy_surname_cache.pickle)
  --name_min_frequency NAME_MIN_FREQUENCY
                        Minimum base frequency for names. If a frequency is
                        less than this, use this minimum. Allowing extremely
                        low frequencies may increase the chances of a spurious
                        match. Note also that typical name frequency tables
                        don't give very-low-frequency information. For
                        example, for US census forename/surname information,
                        below 0.001 percent they report 0.000 percent; so a
                        reasonable minimum is 0.0005 percent or 0.000005 or
                        5e-6. (default: 5e-06)
  --p_middle_name_n_present P_MIDDLE_NAME_N_PRESENT
                        CSV list of probabilities that a randomly selected
                        person has a certain number of middle names. The first
                        number is P(has a first middle name). The second
                        number is P(has a second middle name | has a first
                        middle name), and so on. The last number present will
                        be re-used ad infinitum if someone has more names.
                        (default: 0.8,0.1375)
  --population_size POPULATION_SIZE
                        Size of the whole population, from which we calculate
                        the baseline log odds that two people, randomly
                        selected (and replaced) from the population are the
                        same person. (default: 66040000)
  --birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE
                        Birth year pseudo-range. The sole purpose is to
                        calculate the probability of two random people sharing
                        a DOB, which is taken as 1/(365.25 * b). This option
                        is b. (default: 90)
  --postcode_csv_filename POSTCODE_CSV_FILENAME
                        CSV file of postcode geography from UK Census/ONS data
                        (default: /path/to/postcodes/file)
  --postcode_cache_filename POSTCODE_CACHE_FILENAME
                        File in which to store cached postcodes (to speed
                        loading) (default:
                        /path/to/crate/user/data/fuzzy_postcode_cache.pickle)
  --mean_oa_population MEAN_OA_POPULATION
                        Mean population of a UK Census Output Area, from which
                        we estimate the population of postcode-based units.
                        (default: 309)
  --p_not_male_or_female P_NOT_MALE_OR_FEMALE
                        Probability that a person in the population has gender
                        'X'. (default: 0.004)
  --p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE
                        Probability that a person in the population is female,
                        given that they are either male or female. (default:
                        0.51)

error probabilities:
  --p_minor_forename_error P_MINOR_FORENAME_ERROR
                        Assumed probability that a forename has an error in
                        that means it fails a full match but satisfies a
                        partial (metaphone) match. (default: 0.001)
  --p_minor_surname_error P_MINOR_SURNAME_ERROR
                        Assumed probability that a surname has an error in
                        that means it fails a full match but satisfies a
                        partial (metaphone) match. (default: 0.001)
  --p_proband_middle_name_missing P_PROBAND_MIDDLE_NAME_MISSING
                        Probability that a middle name, present in the sample,
                        is missing from the proband. (default: 0.05)
  --p_sample_middle_name_missing P_SAMPLE_MIDDLE_NAME_MISSING
                        Probability that a middle name, present in the
                        proband, is missing from the sample. (default: 0.05)
  --p_minor_postcode_error P_MINOR_POSTCODE_ERROR
                        Assumed probability that a postcode has an error in
                        that means it fails a full (postcode unit) match but
                        satisfies a partial (postcode sector) match. (default:
                        0.001)
  --p_gender_error P_GENDER_ERROR
                        Assumed probability that a gender is wrong leading to
                        a proband/candidate mismatch. (default: 0.0001)

matching rules:
  --min_log_odds_for_match MIN_LOG_ODDS_FOR_MATCH
                        Minimum natural log (ln) odds of two people being the
                        same, before a match will be considered. (Default is
                        equivalent to p = 0.999.) (default: 6.906754778648553)
  --exceeds_next_best_log_odds EXCEEDS_NEXT_BEST_LOG_ODDS
                        Minimum log (ln) odds by which a best match must
                        exceed the next-best match to be considered a unique
                        match. (default: 10)

commands:
  Valid commands are as follows.

  {selftest,speedtest,print_demo_sample,validate1,validate2_fetch_cdl,validate2_fetch_rio,hash,compare_plaintext,compare_hashed_to_hashed,compare_hashed_to_plaintext,show_metaphone,show_forename_freq,show_forename_metaphone_freq,show_surname_freq,show_surname_metaphone_freq,show_dob_freq}
                        Specify one command.
    selftest            Run self-tests and stop
    speedtest           Run speed tests and stop
    print_demo_sample   Print a demo sample .CSV file (#1).
    validate1           Run validation test 1 and stop. In this test, a list
                        of people is compared to a version of itself, at times
                        with elements deleted or with typos introduced.
    validate2_fetch_cdl
                        Validation 2A: fetch people from CPFT CDL database
    validate2_fetch_rio
                        Validation 2B: fetch people from CPFT RiO database
    hash                STEP 1 OF DE-IDENTIFIED LINKAGE. Hash an identifiable
                        CSV file into an encrypted one.
    compare_plaintext   IDENTIFIABLE LINKAGE COMMAND. Compare a list of
                        probands against a sample (both in plaintext).
    compare_hashed_to_hashed
                        STEP 2 OF DE-IDENTIFIED LINKAGE (for when you have de-
                        identified both sides in advance). Compare a list of
                        probands against a sample (both hashed).
    compare_hashed_to_plaintext
                        STEP 2 OF DE-IDENTIFIED LINKAGE (for when you have
                        received de-identified data and you want to link to
                        your identifiable data, producing a de-identified
                        result). Compare a list of probands (hashed) against a
                        sample (plaintext).
    show_metaphone      Show metaphones of words
    show_forename_freq  Show frequencies of forenames
    show_forename_metaphone_freq
                        Show frequencies of forename metaphones
    show_surname_freq   Show frequencies of surnames
    show_surname_metaphone_freq
                        Show frequencies of surname metaphones
    show_dob_freq       Show the frequency of any DOB

===============================================================================
Help for command 'selftest'
===============================================================================
usage: crate_fuzzy_id_match selftest [-h]

This will run a bunch of self-tests and crash out if one fails.

optional arguments:
  -h, --help  show this help message and exit

===============================================================================
Help for command 'speedtest'
===============================================================================
usage: crate_fuzzy_id_match speedtest [-h] [--profile]

This will run several comparisons to test hashing and comparison speed.
Results are reported as microseconds per comparison.

optional arguments:
  -h, --help  show this help message and exit
  --profile   Profile (makes things slower but shows you what's taking the
              time). (default: False)

===============================================================================
Help for command 'print_demo_sample'
===============================================================================
usage: crate_fuzzy_id_match print_demo_sample [-h]

optional arguments:
  -h, --help  show this help message and exit

===============================================================================
Help for command 'validate1'
===============================================================================
usage: crate_fuzzy_id_match validate1 [-h] --people PEOPLE --output OUTPUT
                                      [--seed SEED]

    Takes an identifiable list of people (typically a short list of imaginary
    people!) and validates the matching process.

    This is done by splitting the input list into two groups (alternating),
    then comparing a list of probands either against itself (there should be
    matches) or against the other group (there should generally not be).
    The process is carried out in cleartext (plaintext) and hashed. At times
    it's made harder by introducing deletions or mutations (typos) into one of
    the groups.

    Here's a specimen test CSV file to use, with entirely made-up people and
    institutional (not personal) postcodes in Cambridge:

original_id,research_id,first_name,middle_names,surname,dob,gender,postcodes
1,r1,Alice,Zara,Smith,1931-01-01,F,CB2 0QQ/2000-01-01/2010-12-31
2,r2,Bob,Yorick,Jones,1932-01-01,M,CB2 3EB/2000-01-01/2010-12-31
3,r3,Celia,Xena,Wright,1933-01-01,F,CB2 1TP/2000-01-01/2010-12-31
4,r4,David,William;Wallace,Cartwright,1934-01-01,M,CB2 8PH/2000-01-01/2010-12-31;CB2 1TP/2000-01-01/2010-12-31
5,r5,Emily,Violet,Fisher,1935-01-01,F,CB3 9DF/2000-01-01/2010-12-31
6,r6,Frank,Umberto,Williams,1936-01-01,M,CB2 1TQ/2000-01-01/2010-12-31
7,r7,Greta,Tilly,Taylor,1937-01-01,F,CB2 1DQ/2000-01-01/2010-12-31
8,r8,Harry,Samuel,Davies,1938-01-01,M,CB3 9ET/2000-01-01/2010-12-31
9,r9,Iris,Ruth,Evans,1939-01-01,F,CB3 0DG/2000-01-01/2010-12-31
10,r10,James,Quentin,Thomas,1940-01-01,M,CB2 0SZ/2000-01-01/2010-12-31
11,r11,Alice,,Smith,1931-01-01,F,CB2 0QQ/2000-01-01/2010-12-31

    Explanation of the output format:

    - 'collection_name' is a human-readable name summarizing the next four;
    - 'in_sample' (boolean) is whether the probands are in the sample;
    - 'deletions' (boolean) is whether random items have been deleted from
       the probands;
    - 'typos' (boolean) is whether random typos have been made in the
       probands;

    - 'is_hashed' (boolean) is whether the proband and sample are hashed;
    - 'original_id' is the gold-standard ID of the proband;
    - 'winner_id' is the ID of the best-matching person in the sample if they
      were a good enough match to win;
    - 'best_match_id' is the ID of the best-matching person in the sample;
    - 'best_log_odds' is the calculated log (ln) odds that the proband and the
      sample member identified by 'winner_id' are the sample person (ideally
      high if there is a true match, low if not);
    - 'second_best_log_odds' is the calculated log odds of the proband and the
      runner-up being the same person (ideally low);
    - 'second_best_match_id' is the ID of the second-best matching person, if
      there is one;

    - 'correct_if_winner' is whether the proband and winner IDs are te same
      (ideally true);
    - 'leader_advantage' is the log odds by which the winner beats the
      runner-up (ideally high indicating a strong preference for the winner
      over the runner-up).

    Clearly, if the probands are in the sample, then a match may occur; if not,
    no match should occur. If hashing is in use, this tests de-identified
    linkage; if not, this tests identifiable linkage. Deletions and typos
    may reduce (but we hope not always eliminate) the likelihood of a match,
    and we don't want to see mismatches.

    For n input rows, each basic set test involves n^2/2 comparisons.
    Then we repeat for typos and deletions. (There is no point in DOB typos
    as our rules preclude that.)

    Examine:
    - P(unique plaintext match | proband in sample) -- should be close to 1.
    - P(unique plaintext match | proband in others) -- should be close to 0.
    - P(unique hashed match | proband in sample) -- should be close to 1.
    - P(unique hashed match | proband in others) -- should be close to 0.

optional arguments:
  -h, --help       show this help message and exit
  --people PEOPLE  CSV filename for validation 1 data. Header row present.
                   Columns: ['original_id', 'research_id', 'first_name',
                   'middle_names', 'surname', 'dob', 'gender', 'postcodes'].
                   The fields ['postcodes'] are in TemporalIdentifier format.
                   TemporalIdentifier format: IDENTIFIER/STARTDATE/ENDDATE,
                   where dates are in YYYY-MM-DD format or one of ['none',
                   'null', '?'] (case-insensitive). Semicolon-separated values
                   are allowed within ['middle_names', 'postcodes']. (default:
                   None)
  --output OUTPUT  Output CSV file for validation. Header row present.
                   Columns: ['collection_name', 'in_sample', 'deletions',
                   'typos', 'is_hashed', 'original_id', 'winner_id',
                   'best_match_id', 'best_log_odds', 'second_best_log_odds',
                   'second_best_match_id', 'correct_if_winner',
                   'leader_advantage']. (default: None)
  --seed SEED      Random number seed, for introducing deliberate errors in
                   validation test 1 (default: 1234)

===============================================================================
Help for command 'validate2_fetch_cdl'
===============================================================================
usage: crate_fuzzy_id_match validate2_fetch_cdl [-h] --url URL [--echo]
                                                --output OUTPUT

    Validation #2. Sequence:

    1. Fetch

    - crate_fuzzy_id_match validate2_fetch_cdl --output validate2_cdl_DANGER_IDENTIFIABLE.csv --url <SQLALCHEMY_URL_CDL>
    - crate_fuzzy_id_match validate2_fetch_rio --output validate2_rio_DANGER_IDENTIFIABLE.csv --url <SQLALCHEMY_URL_RIO>

    2. Hash

    - crate_fuzzy_id_match hash --input validate2_cdl_DANGER_IDENTIFIABLE.csv --output validate2_cdl_hashed.csv --include_original_id --allow_default_hash_key
    - crate_fuzzy_id_match hash --input validate2_rio_DANGER_IDENTIFIABLE.csv --output validate2_rio_hashed.csv --include_original_id --allow_default_hash_key

    3. Compare

    - crate_fuzzy_id_match compare_plaintext --population_size 852523 --probands validate2_cdl_DANGER_IDENTIFIABLE.csv --sample validate2_rio_DANGER_IDENTIFIABLE.csv --output cdl_to_rio_plaintext.csv --extra_validation_output
    - crate_fuzzy_id_match compare_hashed_to_hashed --population_size 852523 --probands validate2_cdl_hashed.csv --sample validate2_rio_hashed.csv --output cdl_to_rio_hashed.csv --extra_validation_output
    - crate_fuzzy_id_match compare_plaintext --population_size 852523 --probands validate2_rio_DANGER_IDENTIFIABLE.csv --sample validate2_cdl_DANGER_IDENTIFIABLE.csv --output rio_to_cdl_plaintext.csv --extra_validation_output
    - crate_fuzzy_id_match compare_hashed_to_hashed --population_size 852523 --probands validate2_rio_hashed.csv --sample validate2_cdl_hashed.csv --output rio_to_cdl_hashed.csv --extra_validation_output

optional arguments:
  -h, --help       show this help message and exit
  --url URL        SQLAlchemy URL for CPFT CDL source (IDENTIFIABLE) database
                   (default: None)
  --echo           Echo SQL? (default: False)
  --output OUTPUT  CSV filename for output (plaintext, IDENTIFIABLE) data.
                   Header row present. Columns: ['original_id', 'research_id',
                   'first_name', 'middle_names', 'surname', 'dob', 'gender',
                   'postcodes']. The fields ['postcodes'] are in
                   TemporalIdentifier format. TemporalIdentifier format:
                   IDENTIFIER/STARTDATE/ENDDATE, where dates are in YYYY-MM-DD
                   format or one of ['none', 'null', '?'] (case-insensitive).
                   Semicolon-separated values are allowed within
                   ['middle_names', 'postcodes']. (default: None)

===============================================================================
Help for command 'validate2_fetch_rio'
===============================================================================
usage: crate_fuzzy_id_match validate2_fetch_rio [-h] --url URL [--echo]
                                                --output OUTPUT

See validate2_fetch_cdl command.

optional arguments:
  -h, --help       show this help message and exit
  --url URL        SQLAlchemy URL for CPFT RiO source (IDENTIFIABLE) database
                   (default: None)
  --echo           Echo SQL? (default: False)
  --output OUTPUT  CSV filename for output (plaintext, IDENTIFIABLE) data.
                   Header row present. Columns: ['original_id', 'research_id',
                   'first_name', 'middle_names', 'surname', 'dob', 'gender',
                   'postcodes']. The fields ['postcodes'] are in
                   TemporalIdentifier format. TemporalIdentifier format:
                   IDENTIFIER/STARTDATE/ENDDATE, where dates are in YYYY-MM-DD
                   format or one of ['none', 'null', '?'] (case-insensitive).
                   Semicolon-separated values are allowed within
                   ['middle_names', 'postcodes']. (default: None)

===============================================================================
Help for command 'hash'
===============================================================================
usage: crate_fuzzy_id_match hash [-h] --input INPUT --output OUTPUT
                                 [--without_frequencies]
                                 [--include_original_id]

    Takes an identifiable list of people (with name, DOB, and postcode
    information) and creates a hashed, de-identified equivalent.

    The research ID (presumed not to be a direct identifier) is preserved.
    Optionally, the unique original ID (e.g. NHS number, presumed to be a
    direct identifier) is preserved, but you have to ask for that explicitly.
        

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT         CSV filename for input (plaintext) data. Header row
                        present. Columns: ['original_id', 'research_id',
                        'first_name', 'middle_names', 'surname', 'dob',
                        'gender', 'postcodes']. The fields ['postcodes'] are
                        in TemporalIdentifier format. TemporalIdentifier
                        format: IDENTIFIER/STARTDATE/ENDDATE, where dates are
                        in YYYY-MM-DD format or one of ['none', 'null', '?']
                        (case-insensitive). Semicolon-separated values are
                        allowed within ['middle_names', 'postcodes'].
                        (default: None)
  --output OUTPUT       Output CSV file for hashed version. Header row
                        present. Columns: ['original_id', 'research_id',
                        'hashed_first_name', 'first_name_frequency',
                        'hashed_first_name_metaphone',
                        'first_name_metaphone_frequency',
                        'hashed_middle_names', 'middle_name_frequencies',
                        'hashed_middle_name_metaphones',
                        'middle_name_metaphone_frequencies', 'hashed_surname',
                        'surname_frequency', 'hashed_surname_metaphone',
                        'surname_metaphone_frequency', 'hashed_dob',
                        'hashed_gender', 'gender_frequency',
                        'hashed_postcode_units', 'postcode_unit_frequencies',
                        'hashed_postcode_sectors',
                        'postcode_sector_frequencies']. The fields
                        ['hashed_postcode_sectors', 'hashed_postcode_units']
                        are in TemporalIdentifier format. TemporalIdentifier
                        format: IDENTIFIER/STARTDATE/ENDDATE, where dates are
                        in YYYY-MM-DD format or one of ['none', 'null', '?']
                        (case-insensitive). Semicolon-separated values are
                        allowed within ['hashed_middle_name_metaphones',
                        'hashed_middle_names', 'hashed_postcode_sectors',
                        'hashed_postcode_units', 'middle_name_frequencies',
                        'middle_name_metaphone_frequencies',
                        'postcode_sector_frequencies',
                        'postcode_unit_frequencies']. (default: None)
  --without_frequencies
                        Do not include frequency information. This makes the
                        result suitable for use as a sample file, but not a
                        proband file. (default: False)
  --include_original_id
                        Include the (potentially identifying) 'original_id'
                        data? Usually False; may be set to True for
                        validation. (default: False)

===============================================================================
Help for command 'compare_plaintext'
===============================================================================
usage: crate_fuzzy_id_match compare_plaintext [-h] --probands PROBANDS
                                              --sample SAMPLE
                                              [--sample_cache SAMPLE_CACHE]
                                              --output OUTPUT
                                              [--extra_validation_output]
                                              [--n_workers N_WORKERS]
                                              [--max_chunksize MAX_CHUNKSIZE]
                                              [--min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL]
                                              [--profile]

    Comparison rules:

    - People MUST match on DOB and surname (or surname metaphone), or hashed
      equivalents, to be considered a plausible match.
    - Only plausible matches proceed to the Bayesian comparison.

    Output file format:

    - CSV file with header.
    - Columns: ['proband_original_id', 'proband_research_id', 'matched', 'log_odds_match', 'p_match', 'sample_match_original_id', 'sample_match_research_id', 'second_best_log_odds']

      - proband_original_id
        Original (identifiable?) ID of the proband. Taken from the input.
        Optional -- may be blank for de-identified comparisons.

      - proband_research_id
        Research ID (de-identified?) of the proband. Taken from the input.

      - matched
        Boolean. Was a matching person (a "winner") found in the sample, who
        is to be considered a match to the proband? To give a match requires
        (a) that the log odds for the winner reaches a threshold, and (b) that
        the log odds for the winner exceeds the log odds for the runner-up by
        a certain amount (because a mismatch may be worse than a failed match).

      - log_odds_match
        Log (ln) odds that the winner in the sample is a match to the proband.

      - p_match
        Probability that the winner in the sample is a match.

      - sample_match_original_id
        Original ID of the "winner" in the sample (the closest match to the
        proband). Optional -- may be blank for de-identified comparisons.

      - sample_match_research_id
        Research ID of the winner in the sample.

      - second_best_log_odds
        Log odds of the runner up (the second-closest match) being the same
        person as the proband.

    - If '--extra_validation_output' is used, the following columns are added:

      - best_person_original_id
        Original ID of the closest-matching person in the sample, EVEN IF THEY
        DID NOT WIN.

      - best_person_research_id
        Research ID of the closest-matching person in the sample, EVEN IF THEY
        DID NOT WIN.

    - The results file is NOT necessarily sorted as the input proband file was
      (not sorting improves parallel processing efficiency).

optional arguments:
  -h, --help            show this help message and exit
  --probands PROBANDS   CSV filename for probands data. Header row present.
                        Columns: ['original_id', 'research_id', 'first_name',
                        'middle_names', 'surname', 'dob', 'gender',
                        'postcodes']. The fields ['postcodes'] are in
                        TemporalIdentifier format. TemporalIdentifier format:
                        IDENTIFIER/STARTDATE/ENDDATE, where dates are in YYYY-
                        MM-DD format or one of ['none', 'null', '?'] (case-
                        insensitive). Semicolon-separated values are allowed
                        within ['middle_names', 'postcodes']. (default: None)
  --sample SAMPLE       CSV filename for sample data. Header row present.
                        Columns: ['original_id', 'research_id', 'first_name',
                        'middle_names', 'surname', 'dob', 'gender',
                        'postcodes']. The fields ['postcodes'] are in
                        TemporalIdentifier format. TemporalIdentifier format:
                        IDENTIFIER/STARTDATE/ENDDATE, where dates are in YYYY-
                        MM-DD format or one of ['none', 'null', '?'] (case-
                        insensitive). Semicolon-separated values are allowed
                        within ['middle_names', 'postcodes']. (default: None)
  --sample_cache SAMPLE_CACHE
                        File in which to store cached sample info (to speed
                        loading) (default: None)
  --output OUTPUT       Output CSV file for proband/sample comparison.
                        (default: None)
  --extra_validation_output
                        Add extra output for validation purposes. (default:
                        False)
  --n_workers N_WORKERS
                        Number of processes to use in parallel. (default: 8)
  --max_chunksize MAX_CHUNKSIZE
                        Maximum chunk size (number of probands to pass to a
                        subprocess each time). (default: 500)
  --min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL
                        Minimum number of probands for which we will bother to
                        use parallel processing. (default: 100)
  --profile             Profile the code (for development only). (default:
                        False)

===============================================================================
Help for command 'compare_hashed_to_hashed'
===============================================================================
usage: crate_fuzzy_id_match compare_hashed_to_hashed [-h] --probands PROBANDS
                                                     --sample SAMPLE
                                                     [--sample_cache SAMPLE_CACHE]
                                                     --output OUTPUT
                                                     [--extra_validation_output]
                                                     [--n_workers N_WORKERS]
                                                     [--max_chunksize MAX_CHUNKSIZE]
                                                     [--min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL]
                                                     [--profile]

    Comparison rules:

    - People MUST match on DOB and surname (or surname metaphone), or hashed
      equivalents, to be considered a plausible match.
    - Only plausible matches proceed to the Bayesian comparison.

    Output file format:

    - CSV file with header.
    - Columns: ['proband_original_id', 'proband_research_id', 'matched', 'log_odds_match', 'p_match', 'sample_match_original_id', 'sample_match_research_id', 'second_best_log_odds']

      - proband_original_id
        Original (identifiable?) ID of the proband. Taken from the input.
        Optional -- may be blank for de-identified comparisons.

      - proband_research_id
        Research ID (de-identified?) of the proband. Taken from the input.

      - matched
        Boolean. Was a matching person (a "winner") found in the sample, who
        is to be considered a match to the proband? To give a match requires
        (a) that the log odds for the winner reaches a threshold, and (b) that
        the log odds for the winner exceeds the log odds for the runner-up by
        a certain amount (because a mismatch may be worse than a failed match).

      - log_odds_match
        Log (ln) odds that the winner in the sample is a match to the proband.

      - p_match
        Probability that the winner in the sample is a match.

      - sample_match_original_id
        Original ID of the "winner" in the sample (the closest match to the
        proband). Optional -- may be blank for de-identified comparisons.

      - sample_match_research_id
        Research ID of the winner in the sample.

      - second_best_log_odds
        Log odds of the runner up (the second-closest match) being the same
        person as the proband.

    - If '--extra_validation_output' is used, the following columns are added:

      - best_person_original_id
        Original ID of the closest-matching person in the sample, EVEN IF THEY
        DID NOT WIN.

      - best_person_research_id
        Research ID of the closest-matching person in the sample, EVEN IF THEY
        DID NOT WIN.

    - The results file is NOT necessarily sorted as the input proband file was
      (not sorting improves parallel processing efficiency).

optional arguments:
  -h, --help            show this help message and exit
  --probands PROBANDS   CSV filename for probands data. Header row present.
                        Columns: ['original_id', 'research_id',
                        'hashed_first_name', 'first_name_frequency',
                        'hashed_first_name_metaphone',
                        'first_name_metaphone_frequency',
                        'hashed_middle_names', 'middle_name_frequencies',
                        'hashed_middle_name_metaphones',
                        'middle_name_metaphone_frequencies', 'hashed_surname',
                        'surname_frequency', 'hashed_surname_metaphone',
                        'surname_metaphone_frequency', 'hashed_dob',
                        'hashed_gender', 'gender_frequency',
                        'hashed_postcode_units', 'postcode_unit_frequencies',
                        'hashed_postcode_sectors',
                        'postcode_sector_frequencies']. The fields
                        ['hashed_postcode_sectors', 'hashed_postcode_units']
                        are in TemporalIdentifier format. TemporalIdentifier
                        format: IDENTIFIER/STARTDATE/ENDDATE, where dates are
                        in YYYY-MM-DD format or one of ['none', 'null', '?']
                        (case-insensitive). Semicolon-separated values are
                        allowed within ['hashed_middle_name_metaphones',
                        'hashed_middle_names', 'hashed_postcode_sectors',
                        'hashed_postcode_units', 'middle_name_frequencies',
                        'middle_name_metaphone_frequencies',
                        'postcode_sector_frequencies',
                        'postcode_unit_frequencies']. (default: None)
  --sample SAMPLE       CSV filename for sample data. Header row present.
                        Columns: ['original_id', 'research_id',
                        'hashed_first_name', 'first_name_frequency',
                        'hashed_first_name_metaphone',
                        'first_name_metaphone_frequency',
                        'hashed_middle_names', 'middle_name_frequencies',
                        'hashed_middle_name_metaphones',
                        'middle_name_metaphone_frequencies', 'hashed_surname',
                        'surname_frequency', 'hashed_surname_metaphone',
                        'surname_metaphone_frequency', 'hashed_dob',
                        'hashed_gender', 'gender_frequency',
                        'hashed_postcode_units', 'postcode_unit_frequencies',
                        'hashed_postcode_sectors',
                        'postcode_sector_frequencies']. The fields
                        ['hashed_postcode_sectors', 'hashed_postcode_units']
                        are in TemporalIdentifier format. TemporalIdentifier
                        format: IDENTIFIER/STARTDATE/ENDDATE, where dates are
                        in YYYY-MM-DD format or one of ['none', 'null', '?']
                        (case-insensitive). Semicolon-separated values are
                        allowed within ['hashed_middle_name_metaphones',
                        'hashed_middle_names', 'hashed_postcode_sectors',
                        'hashed_postcode_units', 'middle_name_frequencies',
                        'middle_name_metaphone_frequencies',
                        'postcode_sector_frequencies',
                        'postcode_unit_frequencies']. (default: None)
  --sample_cache SAMPLE_CACHE
                        File in which to store cached sample info (to speed
                        loading) (default: None)
  --output OUTPUT       Output CSV file for proband/sample comparison.
                        (default: None)
  --extra_validation_output
                        Add extra output for validation purposes. (default:
                        False)
  --n_workers N_WORKERS
                        Number of processes to use in parallel. (default: 8)
  --max_chunksize MAX_CHUNKSIZE
                        Maximum chunk size (number of probands to pass to a
                        subprocess each time). (default: 500)
  --min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL
                        Minimum number of probands for which we will bother to
                        use parallel processing. (default: 100)
  --profile             Profile the code (for development only). (default:
                        False)

===============================================================================
Help for command 'compare_hashed_to_plaintext'
===============================================================================
usage: crate_fuzzy_id_match compare_hashed_to_plaintext [-h] --probands
                                                        PROBANDS --sample
                                                        SAMPLE
                                                        [--sample_cache SAMPLE_CACHE]
                                                        --output OUTPUT
                                                        [--extra_validation_output]
                                                        [--n_workers N_WORKERS]
                                                        [--max_chunksize MAX_CHUNKSIZE]
                                                        [--min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL]
                                                        [--profile]

    Comparison rules:

    - People MUST match on DOB and surname (or surname metaphone), or hashed
      equivalents, to be considered a plausible match.
    - Only plausible matches proceed to the Bayesian comparison.

    Output file format:

    - CSV file with header.
    - Columns: ['proband_original_id', 'proband_research_id', 'matched', 'log_odds_match', 'p_match', 'sample_match_original_id', 'sample_match_research_id', 'second_best_log_odds']

      - proband_original_id
        Original (identifiable?) ID of the proband. Taken from the input.
        Optional -- may be blank for de-identified comparisons.

      - proband_research_id
        Research ID (de-identified?) of the proband. Taken from the input.

      - matched
        Boolean. Was a matching person (a "winner") found in the sample, who
        is to be considered a match to the proband? To give a match requires
        (a) that the log odds for the winner reaches a threshold, and (b) that
        the log odds for the winner exceeds the log odds for the runner-up by
        a certain amount (because a mismatch may be worse than a failed match).

      - log_odds_match
        Log (ln) odds that the winner in the sample is a match to the proband.

      - p_match
        Probability that the winner in the sample is a match.

      - sample_match_original_id
        Original ID of the "winner" in the sample (the closest match to the
        proband). Optional -- may be blank for de-identified comparisons.

      - sample_match_research_id
        Research ID of the winner in the sample.

      - second_best_log_odds
        Log odds of the runner up (the second-closest match) being the same
        person as the proband.

    - If '--extra_validation_output' is used, the following columns are added:

      - best_person_original_id
        Original ID of the closest-matching person in the sample, EVEN IF THEY
        DID NOT WIN.

      - best_person_research_id
        Research ID of the closest-matching person in the sample, EVEN IF THEY
        DID NOT WIN.

    - The results file is NOT necessarily sorted as the input proband file was
      (not sorting improves parallel processing efficiency).

optional arguments:
  -h, --help            show this help message and exit
  --probands PROBANDS   CSV filename for probands data. Header row present.
                        Columns: ['original_id', 'research_id',
                        'hashed_first_name', 'first_name_frequency',
                        'hashed_first_name_metaphone',
                        'first_name_metaphone_frequency',
                        'hashed_middle_names', 'middle_name_frequencies',
                        'hashed_middle_name_metaphones',
                        'middle_name_metaphone_frequencies', 'hashed_surname',
                        'surname_frequency', 'hashed_surname_metaphone',
                        'surname_metaphone_frequency', 'hashed_dob',
                        'hashed_gender', 'gender_frequency',
                        'hashed_postcode_units', 'postcode_unit_frequencies',
                        'hashed_postcode_sectors',
                        'postcode_sector_frequencies']. The fields
                        ['hashed_postcode_sectors', 'hashed_postcode_units']
                        are in TemporalIdentifier format. TemporalIdentifier
                        format: IDENTIFIER/STARTDATE/ENDDATE, where dates are
                        in YYYY-MM-DD format or one of ['none', 'null', '?']
                        (case-insensitive). Semicolon-separated values are
                        allowed within ['hashed_middle_name_metaphones',
                        'hashed_middle_names', 'hashed_postcode_sectors',
                        'hashed_postcode_units', 'middle_name_frequencies',
                        'middle_name_metaphone_frequencies',
                        'postcode_sector_frequencies',
                        'postcode_unit_frequencies']. (default: None)
  --sample SAMPLE       CSV filename for sample data. Header row present.
                        Columns: ['original_id', 'research_id', 'first_name',
                        'middle_names', 'surname', 'dob', 'gender',
                        'postcodes']. The fields ['postcodes'] are in
                        TemporalIdentifier format. TemporalIdentifier format:
                        IDENTIFIER/STARTDATE/ENDDATE, where dates are in YYYY-
                        MM-DD format or one of ['none', 'null', '?'] (case-
                        insensitive). Semicolon-separated values are allowed
                        within ['middle_names', 'postcodes']. (default: None)
  --sample_cache SAMPLE_CACHE
                        File in which to store cached sample info (to speed
                        loading) (default: None)
  --output OUTPUT       Output CSV file for proband/sample comparison.
                        (default: None)
  --extra_validation_output
                        Add extra output for validation purposes. (default:
                        False)
  --n_workers N_WORKERS
                        Number of processes to use in parallel. (default: 8)
  --max_chunksize MAX_CHUNKSIZE
                        Maximum chunk size (number of probands to pass to a
                        subprocess each time). (default: 500)
  --min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL
                        Minimum number of probands for which we will bother to
                        use parallel processing. (default: 100)
  --profile             Profile the code (for development only). (default:
                        False)

===============================================================================
Help for command 'show_metaphone'
===============================================================================
usage: crate_fuzzy_id_match show_metaphone [-h] words [words ...]

positional arguments:
  words       Words to check

optional arguments:
  -h, --help  show this help message and exit

===============================================================================
Help for command 'show_forename_freq'
===============================================================================
usage: crate_fuzzy_id_match show_forename_freq [-h] forenames [forenames ...]

positional arguments:
  forenames   Forenames to check

optional arguments:
  -h, --help  show this help message and exit

===============================================================================
Help for command 'show_forename_metaphone_freq'
===============================================================================
usage: crate_fuzzy_id_match show_forename_metaphone_freq [-h]
                                                         metaphones
                                                         [metaphones ...]

positional arguments:
  metaphones  Forenames to check

optional arguments:
  -h, --help  show this help message and exit

===============================================================================
Help for command 'show_surname_freq'
===============================================================================
usage: crate_fuzzy_id_match show_surname_freq [-h] surnames [surnames ...]

positional arguments:
  surnames    surnames to check

optional arguments:
  -h, --help  show this help message and exit

===============================================================================
Help for command 'show_surname_metaphone_freq'
===============================================================================
usage: crate_fuzzy_id_match show_surname_metaphone_freq [-h]
                                                        metaphones
                                                        [metaphones ...]

positional arguments:
  metaphones  surnames to check

optional arguments:
  -h, --help  show this help message and exit

===============================================================================
Help for command 'show_dob_freq'
===============================================================================
usage: crate_fuzzy_id_match show_dob_freq [-h]

optional arguments:
  -h, --help  show this help message and exit