5. Preprocessing tools

These tools:

Although they are usually run before anonymisation, it’s probably more helpful to read the Anonymisation section first.

5.1. crate_preprocess_rio

The RiO preprocessor creates a unique integer field named crate_pk in all tables (copying the existing integer PK, creating one from an existing non-integer primary key, or adding a new one using SQL Server’s INT IDENTITY(1, 1) type. For all patient tables, it makes the patient ID (RiO number) into an integer, called crate_rio_number. It then adds indexes and views. All of these can be removed again, or updated incrementally if you add new data.

The views ‘denormalize’ the data for convenience, since it can be pretty hard to follow the key chain of fully normalized tables. The views conform mostly to the names used by the Servelec RiO CRIS Extraction Program (RCEP), with added consistency. Because user lookups are common, to save typing (and in some cases keep the field length below the 64-character column name limit of MySQL), the following abbreviations are used:

_Resp_Clinician_ … Responsible Clinician

Options:

usage: crate_preprocess_rio [-h] --url URL [-v] [--print] [--echo] [--rcep]
                            [--drop-danger-drop] [--cpft] [--debug-skiptables]
                            [--prognotes-current-only | --prognotes-all]
                            [--clindocs-current-only | --clindocs-all]
                            [--allergies-current-only | --allergies-all]
                            [--audit-info | --no-audit-info]
                            [--postcodedb POSTCODEDB]
                            [--geogcols [GEOGCOLS [GEOGCOLS ...]]]
                            [--settings-filename SETTINGS_FILENAME]

*   Alters a RiO database to be suitable for CRATE.

*   By default, this treats the source database as being a copy of a RiO
    database (slightly later than version 6.2; exact version unclear).
    Use the "--rcep" (+/- "--cpft") switch(es) to treat it as a
    Servelec RiO CRIS Extract Program (RCEP) v2 output database.
    

optional arguments:
  -h, --help            show this help message and exit
  --url URL             SQLAlchemy database URL
  -v, --verbose         Verbose
  --print               Print SQL but do not execute it. (You can redirect the
                        printed output to create an SQL script.
  --echo                Echo SQL
  --rcep                Treat the source database as the product of Servelec's
                        RiO CRIS Extract Program v2 (instead of raw RiO)
  --drop-danger-drop    REMOVES new columns and indexes, rather than creating
                        them. (There's not very much danger; no real
                        information is lost, but it might take a while to
                        recalculate it.)
  --cpft                Apply hacks for Cambridgeshire & Peterborough NHS
                        Foundation Trust (CPFT) RCEP database. Only applicable
                        with --rcep
  --debug-skiptables    DEBUG-ONLY OPTION. Skip tables (view creation only)
  --prognotes-current-only
                        Progress_Notes view restricted to current versions
                        only (* default)
  --prognotes-all       Progress_Notes view shows old versions too
  --clindocs-current-only
                        Clinical_Documents view restricted to current versions
                        only (*)
  --clindocs-all        Clinical_Documents view shows old versions too
  --allergies-current-only
                        Client_Allergies view restricted to current info only
  --allergies-all       Client_Allergies view shows deleted allergies too (*)
  --audit-info          Audit information (creation/update times) added to
                        views
  --no-audit-info       No audit information added (*)
  --postcodedb POSTCODEDB
                        Specify database (schema) name for ONS Postcode
                        Database (as imported by CRATE) to link to addresses
                        as a view. With SQL Server, you will have to specify
                        the schema as well as the database; e.g. "--postcodedb
                        ONS_PD.dbo"
  --geogcols [GEOGCOLS [GEOGCOLS ...]]
                        List of geographical information columns to link in
                        from ONS Postcode Database. BEWARE that you do not
                        specify anything too identifying. Default: pcon pct
                        nuts lea statsward casward lsoa01 msoa01 ur01ind oac01
                        lsoa11 msoa11 parish bua11 buasd11 ru11ind oac11 imd
  --settings-filename SETTINGS_FILENAME
                        Specify filename to write draft ddgen_* settings to,
                        for use in a CRATE anonymiser configuration file.

# Generated at 2019-10-10 10:23:45

5.2. crate_preprocess_pcmis

Options:

usage: crate_preprocess_pcmis [-h] --url URL [-v] [--print] [--echo]
                              [--drop-danger-drop] [--debug-skiptables]
                              [--postcodedb POSTCODEDB]
                              [--geogcols [GEOGCOLS [GEOGCOLS ...]]]
                              [--settings-filename SETTINGS_FILENAME]

Alters a PCMIS database to be suitable for CRATE.

optional arguments:
  -h, --help            show this help message and exit
  --url URL             SQLAlchemy database URL (default: None)
  -v, --verbose         Verbose (default: False)
  --print               Print SQL but do not execute it. (You can redirect the
                        printed output to create an SQL script. (default:
                        False)
  --echo                Echo SQL (default: False)
  --drop-danger-drop    REMOVES new columns and indexes, rather than creating
                        them. (There's not very much danger; no real
                        information is lost, but it might take a while to
                        recalculate it.) (default: False)
  --debug-skiptables    DEBUG-ONLY OPTION. Skip tables (view creation only)
                        (default: False)
  --postcodedb POSTCODEDB
                        Specify database (schema) name for ONS Postcode
                        Database (as imported by CRATE) to link to addresses
                        as a view. With SQL Server, you will have to specify
                        the schema as well as the database; e.g. "--postcodedb
                        ONS_PD.dbo" (default: None)
  --geogcols [GEOGCOLS [GEOGCOLS ...]]
                        List of geographical information columns to link in
                        from ONS Postcode Database. BEWARE that you do not
                        specify anything too identifying. (default: ['pcon',
                        'pct', 'nuts', 'lea', 'statsward', 'casward',
                        'lsoa01', 'msoa01', 'ur01ind', 'oac01', 'lsoa11',
                        'msoa11', 'parish', 'bua11', 'buasd11', 'ru11ind',
                        'oac11', 'imd'])
  --settings-filename SETTINGS_FILENAME
                        Specify filename to write draft ddgen_* settings to,
                        for use in a CRATE anonymiser configuration file.
                        (default: None)

# Generated at 2019-10-10 10:23:44

5.3. crate_postcodes

Options:

usage: crate_postcodes [-h] [--dir DIR] [--url URL] [--echo]
                       [--reportevery REPORTEVERY] [--commitevery COMMITEVERY]
                       [--startswith STARTSWITH [STARTSWITH ...]] [--replace]
                       [--skiplookup]
                       [--specific_lookup_tables [SPECIFIC_LOOKUP_TABLES [SPECIFIC_LOOKUP_TABLES ...]]]
                       [--list_lookup_tables] [--skippostcodes] [--docsonly]
                       [-v]

-   This program reads data from the UK Office of National Statistics Postcode
    Database (ONSPD) and inserts it into a database.

-   You will need to download the ONSPD from
        https://geoportal.statistics.gov.uk/geoportal/catalog/content/filelist.page
    e.g. ONSPD_MAY_2016_csv.zip (79 Mb), and unzip it (>1.4 Gb) to a directory.
    Tell this program which directory you used.

-   Specify your database as an SQLAlchemy connection URL: see
        http://docs.sqlalchemy.org/en/latest/core/engines.html
    The general format is:
        dialect[+driver]://username:password@host[:port]/database[?key=value...]

-   If you get an error like:
        UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in
        position 33: ordinal not in range(256)
    then try appending "?charset=utf8" to the connection URL.

-   ONS POSTCODE DATABASE LICENSE.
    Output using this program must add the following attribution statements:

    Contains OS data © Crown copyright and database right [year]
    Contains Royal Mail data © Royal Mail copyright and database right [year]
    Contains National Statistics data © Crown copyright and database right [year]

    See http://www.ons.gov.uk/methodology/geography/licences
    

optional arguments:
  -h, --help            show this help message and exit
  --dir DIR             Root directory of unzipped ONSPD download (default:
                        /home/rudolf/dev/onspd)
  --url URL             SQLAlchemy database URL (default: None)
  --echo                Echo SQL (default: False)
  --reportevery REPORTEVERY
                        Report every n rows (default: 1000)
  --commitevery COMMITEVERY
                        Commit every n rows. If you make this too large
                        (relative e.g. to your MySQL max_allowed_packet
                        setting, you may get crashes with errors like 'MySQL
                        has gone away'. (default: 10000)
  --startswith STARTSWITH [STARTSWITH ...]
                        Restrict to postcodes that start with one of these
                        strings (default: None)
  --replace             Replace tables even if they exist (default: skip
                        existing tables) (default: False)
  --skiplookup          Skip generation of code lookup tables (default: False)
  --specific_lookup_tables [SPECIFIC_LOOKUP_TABLES [SPECIFIC_LOOKUP_TABLES ...]]
                        Within the lookup tables, process only specific named
                        tables (default: None)
  --list_lookup_tables  List all possible lookup tables, then stop (default:
                        False)
  --skippostcodes       Skip generation of main (large) postcode table
                        (default: False)
  --docsonly            Show help for postcode table then stop (default:
                        False)
  -v, --verbose         Verbose (default: False)

# Generated at 2019-10-10 10:23:43

5.4. crate_fetch_wordlists

This tool assists in fetching common word lists, such as name lists for global blacklisting, and words to exclude from such lists (such as English words or medical eponyms). It also provides an exclusion filter system, to find lines in some files that are absent from others.

Options:

usage: crate_fetch_wordlists [-h] [--verbose]
                             [--min_word_length MIN_WORD_LENGTH]
                             [--show_rejects] [--english_words]
                             [--english_words_output ENGLISH_WORDS_OUTPUT]
                             [--english_words_url ENGLISH_WORDS_URL]
                             [--valid_word_regex VALID_WORD_REGEX]
                             [--us_forenames]
                             [--us_forenames_freq_output US_FORENAMES_FREQ_OUTPUT]
                             [--us_forenames_url US_FORENAMES_URL]
                             [--us_forenames_min_cumfreq_pct US_FORENAMES_MIN_CUMFREQ_PCT]
                             [--us_forenames_max_cumfreq_pct US_FORENAMES_MAX_CUMFREQ_PCT]
                             [--us_forenames_output US_FORENAMES_OUTPUT]
                             [--us_surnames]
                             [--us_surnames_output US_SURNAMES_OUTPUT]
                             [--us_surnames_freq_output US_SURNAMES_FREQ_OUTPUT]
                             [--us_surnames_1990_census_url US_SURNAMES_1990_CENSUS_URL]
                             [--us_surnames_2010_census_url US_SURNAMES_2010_CENSUS_URL]
                             [--us_surnames_min_cumfreq_pct US_SURNAMES_MIN_CUMFREQ_PCT]
                             [--us_surnames_max_cumfreq_pct US_SURNAMES_MAX_CUMFREQ_PCT]
                             [--eponyms] [--eponyms_output EPONYMS_OUTPUT]
                             [--eponyms_add_unaccented_versions [EPONYMS_ADD_UNACCENTED_VERSIONS]]
                             [--filter_input [FILTER_INPUT [FILTER_INPUT ...]]]
                             [--filter_exclude [FILTER_EXCLUDE [FILTER_EXCLUDE ...]]]
                             [--filter_output [FILTER_OUTPUT]]

optional arguments:
  -h, --help            show this help message and exit
  --verbose, -v         Be verbose (default: False)
  --min_word_length MIN_WORD_LENGTH
                        Minimum word (or name) length to allow (default: 2)
  --show_rejects        Print to stdout (and, in verbose mode, log) the words
                        being rejected (default: False)

English words:
  --english_words       Fetch English words (for reducing nonspecific
                        blacklist, not as whitelist; consider words like
                        smith) (default: False)
  --english_words_output ENGLISH_WORDS_OUTPUT
                        Output file for English words (default:
                        english_words.txt)
  --english_words_url ENGLISH_WORDS_URL
                        URL for a textfile containing all English words (will
                        then be filtered) (default: https://www.gutenberg.org/
                        files/3201/files/CROSSWD.TXT)
  --valid_word_regex VALID_WORD_REGEX
                        Regular expression to determine valid English words
                        (default: ^[a-z](?:[A-Za-z'-]*[a-z])*$)

US forenames:
  --us_forenames        Fetch US forenames (for blacklist) (default: False)
  --us_forenames_freq_output US_FORENAMES_FREQ_OUTPUT
                        Output CSV file for US forename with frequencies
                        (columns are: name, frequency) (default:
                        us_forename_freq.csv)
  --us_forenames_url US_FORENAMES_URL
                        URL to Zip file of US Census-derived forenames lists
                        (excludes names with national frequency <5; see
                        https://www.ssa.gov/OACT/babynames/limits.html)
                        (default:
                        https://www.ssa.gov/OACT/babynames/names.zip)
  --us_forenames_min_cumfreq_pct US_FORENAMES_MIN_CUMFREQ_PCT
                        Fetch only names where the cumulative frequency
                        percentage up to and including this name was at least
                        this value. Range is 0-100. Use 0 for no limit.
                        Setting this above 0 excludes COMMON names. (This is a
                        trade-off between being comprehensive and operating at
                        a reasonable speed. Higher numbers are more
                        comprehensive but slower.) (default: 0)
  --us_forenames_max_cumfreq_pct US_FORENAMES_MAX_CUMFREQ_PCT
                        Fetch only names where the cumulative frequency
                        percentage up to and including this name was less than
                        or equal to this value. Range is 0-100. Use 100 for no
                        limit. Setting this below 100 excludes RARE names.
                        (This is a trade-off between being comprehensive and
                        operating at a reasonable speed. Higher numbers are
                        more comprehensive but slower.) (default: 100)
  --us_forenames_output US_FORENAMES_OUTPUT
                        Output file for US forenames (default:
                        us_forenames.txt)

US surnames:
  --us_surnames         Fetch US surnames (for blacklist) (default: False)
  --us_surnames_output US_SURNAMES_OUTPUT
                        Output text file for US surnames (default:
                        us_surnames.txt)
  --us_surnames_freq_output US_SURNAMES_FREQ_OUTPUT
                        Output CSV file for US surnames with frequencies
                        (columns are: name, frequency) (default:
                        us_surname_freq.csv)
  --us_surnames_1990_census_url US_SURNAMES_1990_CENSUS_URL
                        URL for textfile of US 1990 Census surnames (default: 
                        http://www2.census.gov/topics/genealogy/1990surnames/d
                        ist.all.last)
  --us_surnames_2010_census_url US_SURNAMES_2010_CENSUS_URL
                        URL for zip of US 2010 Census surnames (default: https
                        ://www2.census.gov/topics/genealogy/2010surnames/names
                        .zip)
  --us_surnames_min_cumfreq_pct US_SURNAMES_MIN_CUMFREQ_PCT
                        Fetch only names where the cumulative frequency
                        percentage up to and including this name was at least
                        this value. Range is 0-100. Use 0 for no limit.
                        Setting this above 0 excludes COMMON names. (This is a
                        trade-off between being comprehensive and operating at
                        a reasonable speed. Higher numbers are more
                        comprehensive but slower.) (default: 0)
  --us_surnames_max_cumfreq_pct US_SURNAMES_MAX_CUMFREQ_PCT
                        Fetch only names where the cumulative frequency
                        percentage up to and including this name was less than
                        or equal to this value. Range is 0-100. Use 100 for no
                        limit. Setting this below 100 excludes RARE names.
                        (This is a trade-off between being comprehensive and
                        operating at a reasonable speed. Higher numbers are
                        more comprehensive but slower.) (default: 100)

Medical eponyms:
  --eponyms             Write medical eponyms (to remove from blacklist)
                        (default: False)
  --eponyms_output EPONYMS_OUTPUT
                        Output file for medical eponyms (default:
                        medical_eponyms.txt)
  --eponyms_add_unaccented_versions [EPONYMS_ADD_UNACCENTED_VERSIONS]
                        Add unaccented versions (e.g. Sjogren as well as
                        Sjögren) (default: True)

Filter functions:
  Extra functions to filter wordlists. Specify an input file (or files),
  whose lines will be included; optional exclusion file(s), whose lines will
  be excluded (in case-insensitive fashion); and an output file. You can use
  '-' for the output file to mean 'stdout', and for one input file to mean
  'stdin'. No filenames (other than '-' for input and output) may overlap.
  The --min_line_length option also applies. Duplicates are not removed.

  --filter_input [FILTER_INPUT [FILTER_INPUT ...]]
                        Input file(s). See above. (default: None)
  --filter_exclude [FILTER_EXCLUDE [FILTER_EXCLUDE ...]]
                        Exclusion file(s). See above. (default: None)
  --filter_output [FILTER_OUTPUT]
                        Exclusion file(s). See above. (default: None)

# Generated at 2019-10-10 10:23:41

Specimen usage:

#!/bin/bash
# -----------------------------------------------------------------------------
# Specimen usage under Linux
# -----------------------------------------------------------------------------

cd ~/Documents/code/crate/working

# Downloading these and then using a file:// URL is unnecessary, but it makes
# the processing steps faster if we need to retry with new settings.
wget https://www.gutenberg.org/files/3201/files/CROSSWD.TXT -O dictionary.txt
wget https://www.ssa.gov/OACT/babynames/names.zip -O forenames.zip
wget http://www2.census.gov/topics/genealogy/1990surnames/dist.all.last -O surnames_1990.txt
wget https://www2.census.gov/topics/genealogy/2010surnames/names.zip -O surnames_2010.zip

crate_fetch_wordlists --help

crate_fetch_wordlists \
    --english_words \
        --english_words_url file://$PWD/dictionary.txt \
    --us_forenames \
        --us_forenames_url file://$PWD/forenames.zip \
        --us_forenames_max_cumfreq_pct 100 \
    --us_surnames \
        --us_surnames_1990_census_url file://$PWD/surnames_1990.txt \
        --us_surnames_2010_census_url file://$PWD/surnames_2010.zip \
        --us_surnames_max_cumfreq_pct 100 \
    --eponyms

#    --show_rejects \
#    --verbose

# Forenames encompassing the top 95% gives 5874 forenames (of 96174).
# Surnames encompassing the top 85% gives 74525 surnames (of 175880).

crate_fetch_wordlists \
    --filter_input \
        us_forenames.txt \
        us_surnames.txt \
    --filter_exclude \
        english_words.txt \
        medical_eponyms.txt \
    --filter_output \
        filtered_names.txt

5.5. crate_fuzzy_id_match

In development.

See crate_anon.preprocess.fuzzy_id_match.

Options (from crate_fuzzy_id_match --allhelp):

usage: crate_fuzzy_id_match [-h] [--version] [--allhelp] [--verbose]
                            [--key KEY] [--allow_default_hash_key]
                            [--rounding_sf ROUNDING_SF]
                            [--forename_freq_csv FORENAME_FREQ_CSV]
                            [--forename_cache_filename FORENAME_CACHE_FILENAME]
                            [--surname_freq_csv SURNAME_FREQ_CSV]
                            [--surname_cache_filename SURNAME_CACHE_FILENAME]
                            [--name_min_frequency NAME_MIN_FREQUENCY]
                            [--p_middle_name_n_present P_MIDDLE_NAME_N_PRESENT]
                            [--population_size POPULATION_SIZE]
                            [--postcode_csv_filename POSTCODE_CSV_FILENAME]
                            [--postcode_cache_filename POSTCODE_CACHE_FILENAME]
                            [--mean_oa_population MEAN_OA_POPULATION]
                            [--p_minor_forename_error P_MINOR_FORENAME_ERROR]
                            [--p_minor_surname_error P_MINOR_SURNAME_ERROR]
                            [--p_proband_middle_name_missing P_PROBAND_MIDDLE_NAME_MISSING]
                            [--p_sample_middle_name_missing P_SAMPLE_MIDDLE_NAME_MISSING]
                            [--p_minor_postcode_error P_MINOR_POSTCODE_ERROR]
                            [--min_log_odds_for_match MIN_LOG_ODDS_FOR_MATCH]
                            [--exceeds_next_best_log_odds EXCEEDS_NEXT_BEST_LOG_ODDS]
                            {selftest,speedtest,validate1,hash,compare_plaintext,compare_hashed_to_hashed,compare_hashed_to_plaintext}
                            ...

Identity matching via hashed fuzzy identifiers

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --allhelp             show help for all commands and exit

display options:
  --verbose             Be verbose (default: False)

hasher (secrecy) options:
  --key KEY             Key (passphrase) for hasher (default: fuzzy_id_match_d
                        efault_hash_key_DO_NOT_USE_FOR_LIVE_DATA)
  --allow_default_hash_key
                        Allow the default hash key to be used beyond tests.
                        INADVISABLE! (default: False)
  --rounding_sf ROUNDING_SF
                        Number of significant figures to use when rounding
                        frequencies in hashed version (default: 3)

frequency information for prior probabilities:
  --forename_freq_csv FORENAME_FREQ_CSV
                        CSV file of "name, frequency" pairs for forenames
                        (default: /home/rudolf/Documents/code/crate/working/us
                        _forename_freq.csv)
  --forename_cache_filename FORENAME_CACHE_FILENAME
                        File in which to store cached forename info (to speed
                        loading) (default: /home/rudolf/.local/share/crate/fuz
                        zy_forename_cache.pickle)
  --surname_freq_csv SURNAME_FREQ_CSV
                        CSV file of "name, frequency" pairs for forenames
                        (default: /home/rudolf/Documents/code/crate/working/us
                        _surname_freq.csv)
  --surname_cache_filename SURNAME_CACHE_FILENAME
                        File in which to store cached surname info (to speed
                        loading) (default: /home/rudolf/.local/share/crate/fuz
                        zy_surname_cache.pickle)
  --name_min_frequency NAME_MIN_FREQUENCY
                        Minimum base frequency for names. If a frequency is
                        less than this, use this minimum. Allowing extremely
                        low frequencies may increase the chances of a spurious
                        match. Note also that typical name frequency tables
                        don't give very-low-frequency information. For
                        example, for US census forename/surname information,
                        below 0.001 percent they report 0.000 percent; so a
                        reasonable minimum is 0.0005 percent or 0.000005 or
                        5e-6. (default: 5e-06)
  --p_middle_name_n_present P_MIDDLE_NAME_N_PRESENT
                        CSV list of probabilities that a randomly selected
                        person has a certain number of middle names. The first
                        number is P(has a first middle name). The second
                        number is P(has a second middle name | has a first
                        middle name), and so on. The last number present will
                        be re-used ad infinitum if someone has more names.
                        (default: 0.8,0.1375)
  --population_size POPULATION_SIZE
                        Size of the whole population, from which we calculate
                        the baseline log odds that two people, randomly
                        selected (and replaced) from the population are the
                        same person. (default: 66040000)
  --postcode_csv_filename POSTCODE_CSV_FILENAME
                        CSV file of postcode geography from UK Census/ONS data
                        (default:
                        /home/rudolf/dev/onspd/Data/ONSPD_MAY_2016_UK.csv)
  --postcode_cache_filename POSTCODE_CACHE_FILENAME
                        File in which to store cached postcodes (to speed
                        loading) (default: /home/rudolf/.local/share/crate/fuz
                        zy_postcode_cache.pickle)
  --mean_oa_population MEAN_OA_POPULATION
                        Mean population of a UK Census Output Area, from which
                        we estimate the population of postcode-based units.
                        (default: 309)

error probabilities:
  --p_minor_forename_error P_MINOR_FORENAME_ERROR
                        Assumed probability that a forename has an error in
                        that means it fails a full match but satisfies a
                        partial (metaphone) match. (default: 0.001)
  --p_minor_surname_error P_MINOR_SURNAME_ERROR
                        Assumed probability that a surname has an error in
                        that means it fails a full match but satisfies a
                        partial (metaphone) match. (default: 0.001)
  --p_proband_middle_name_missing P_PROBAND_MIDDLE_NAME_MISSING
                        Probability that a middle name, present in the sample,
                        is missing from the proband. (default: 0.05)
  --p_sample_middle_name_missing P_SAMPLE_MIDDLE_NAME_MISSING
                        Probability that a middle name, present in the
                        proband, is missing from the sample. (default: 0.05)
  --p_minor_postcode_error P_MINOR_POSTCODE_ERROR
                        Assumed probability that a postcode has an error in
                        that means it fails a full (postcode unit) match but
                        satisfies a partial (postcode sector) match. (default:
                        0.001)

matching rules:
  --min_log_odds_for_match MIN_LOG_ODDS_FOR_MATCH
                        Minimum probability of two people being the same,
                        before a match will be considered. (Default is
                        equivalent to p = 0.999.) (default: 6.906754778648553)
  --exceeds_next_best_log_odds EXCEEDS_NEXT_BEST_LOG_ODDS
                        Minimum log odds by which a best match must exceed the
                        next-best match to be considered a unique match.
                        (default: 10)

commands:
  Valid commands are as follows.

  {selftest,speedtest,validate1,hash,compare_plaintext,compare_hashed_to_hashed,compare_hashed_to_plaintext}
                        Specify one command.
    selftest            Run self-tests and stop
    speedtest           Run speed tests and stop
    validate1           Run validation test 1 and stop. In this test, a list
                        of people is compared to a version of itself, at times
                        with elements deleted or with typos introduced.
    hash                Hash an identifiable CSV file into an encrypted one.
    compare_plaintext   Compare a list of probands against a sample (both in
                        plaintext).
    compare_hashed_to_hashed
                        Compare a list of probands against a sample (both
                        hashed).
    compare_hashed_to_plaintext
                        Compare a list of probands (hashed) against a sample
                        (plaintext).

===============================================================================
Help for command 'selftest'
===============================================================================
usage: crate_fuzzy_id_match selftest [-h]

optional arguments:
  -h, --help  show this help message and exit

===============================================================================
Help for command 'speedtest'
===============================================================================
usage: crate_fuzzy_id_match speedtest [-h]

optional arguments:
  -h, --help  show this help message and exit

===============================================================================
Help for command 'validate1'
===============================================================================
usage: crate_fuzzy_id_match validate1 [-h] [--people_csv PEOPLE_CSV]
                                      [--people_cache_filename PEOPLE_CACHE_FILENAME]
                                      [--output_csv OUTPUT_CSV] [--seed SEED]

optional arguments:
  -h, --help            show this help message and exit
  --people_csv PEOPLE_CSV
                        CSV filename for validation 1 data. (Header row.
                        Columns: ['unique_id', 'research_id', 'first_name',
                        'middle_names', 'surname', 'dob', 'postcodes']. Use
                        semicolon-separated values for ['middle_names',
                        'postcodes'].
  --people_cache_filename PEOPLE_CACHE_FILENAME
                        File in which to store cached people info (to speed
                        loading)
  --output_csv OUTPUT_CSV
                        Output CSV file for validation
  --seed SEED           Random number seed, for introducing deliberate errors
                        in validation test 1

===============================================================================
Help for command 'hash'
===============================================================================
usage: crate_fuzzy_id_match hash [-h] [--input INPUT] [--output OUTPUT]
                                 [--include_unique_id]

optional arguments:
  -h, --help           show this help message and exit
  --input INPUT        CSV filename for input (plaintext) data. (Header row.
                       Columns: ['unique_id', 'research_id', 'first_name',
                       'middle_names', 'surname', 'dob', 'postcodes']. Use
                       semicolon-separated values for ['middle_names',
                       'postcodes'].
  --output OUTPUT      Output CSV file for hashed version.
  --include_unique_id  Include the (potentially identifying) 'unique_id' data?
                       Usually False; may be set to True for validation.

===============================================================================
Help for command 'compare_plaintext'
===============================================================================
usage: crate_fuzzy_id_match compare_plaintext [-h] [--probands PROBANDS]
                                              [--sample SAMPLE]
                                              [--sample_cache SAMPLE_CACHE]
                                              [--output OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  --probands PROBANDS   CSV filename for probands data. (Header row. Columns:
                        ['unique_id', 'research_id', 'first_name',
                        'middle_names', 'surname', 'dob', 'postcodes']. Use
                        semicolon-separated values for ['middle_names',
                        'postcodes'].
  --sample SAMPLE       CSV filename for sample data. (Header row. Columns:
                        ['unique_id', 'research_id', 'first_name',
                        'middle_names', 'surname', 'dob', 'postcodes']. Use
                        semicolon-separated values for ['middle_names',
                        'postcodes'].
  --sample_cache SAMPLE_CACHE
                        File in which to store cached sample info (to speed
                        loading)
  --output OUTPUT       Output CSV file for proband/sample comparison

===============================================================================
Help for command 'compare_hashed_to_hashed'
===============================================================================
usage: crate_fuzzy_id_match compare_hashed_to_hashed [-h]
                                                     [--probands PROBANDS]
                                                     [--sample SAMPLE]
                                                     [--output OUTPUT]

optional arguments:
  -h, --help           show this help message and exit
  --probands PROBANDS  CSV filename for probands data. (Header row. Columns:
                       ['unique_id', 'research_id', 'hashed_first_name',
                       'first_name_frequency', 'hashed_first_name_metaphone',
                       'first_name_metaphone_frequency',
                       'hashed_middle_names', 'middle_name_frequencies',
                       'hashed_middle_name_metaphones',
                       'middle_name_metaphone_frequencies', 'hashed_surname',
                       'surname_frequency', 'hashed_surname_metaphone',
                       'surname_metaphone_frequency', 'hashed_dob',
                       'hashed_postcode_units', 'postcode_unit_frequencies',
                       'hashed_postcode_sectors',
                       'postcode_sector_frequencies']. Use semicolon-separated
                       values for ['hashed_middle_name_metaphones',
                       'hashed_middle_names', 'hashed_postcode_sectors',
                       'hashed_postcode_units', 'middle_name_frequencies',
                       'middle_name_metaphone_frequencies',
                       'postcode_sector_frequencies',
                       'postcode_unit_frequencies'].
  --sample SAMPLE      CSV filename for sample data. (Header row. Columns:
                       ['unique_id', 'research_id', 'hashed_first_name',
                       'first_name_frequency', 'hashed_first_name_metaphone',
                       'first_name_metaphone_frequency',
                       'hashed_middle_names', 'middle_name_frequencies',
                       'hashed_middle_name_metaphones',
                       'middle_name_metaphone_frequencies', 'hashed_surname',
                       'surname_frequency', 'hashed_surname_metaphone',
                       'surname_metaphone_frequency', 'hashed_dob',
                       'hashed_postcode_units', 'postcode_unit_frequencies',
                       'hashed_postcode_sectors',
                       'postcode_sector_frequencies']. Use semicolon-separated
                       values for ['hashed_middle_name_metaphones',
                       'hashed_middle_names', 'hashed_postcode_sectors',
                       'hashed_postcode_units', 'middle_name_frequencies',
                       'middle_name_metaphone_frequencies',
                       'postcode_sector_frequencies',
                       'postcode_unit_frequencies'].
  --output OUTPUT      Output CSV file for proband/sample comparison

===============================================================================
Help for command 'compare_hashed_to_plaintext'
===============================================================================
usage: crate_fuzzy_id_match compare_hashed_to_plaintext [-h]
                                                        [--probands PROBANDS]
                                                        [--sample SAMPLE]
                                                        [--sample_cache SAMPLE_CACHE]
                                                        [--output OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  --probands PROBANDS   CSV filename for probands data. (Header row. Columns:
                        ['unique_id', 'research_id', 'hashed_first_name',
                        'first_name_frequency', 'hashed_first_name_metaphone',
                        'first_name_metaphone_frequency',
                        'hashed_middle_names', 'middle_name_frequencies',
                        'hashed_middle_name_metaphones',
                        'middle_name_metaphone_frequencies', 'hashed_surname',
                        'surname_frequency', 'hashed_surname_metaphone',
                        'surname_metaphone_frequency', 'hashed_dob',
                        'hashed_postcode_units', 'postcode_unit_frequencies',
                        'hashed_postcode_sectors',
                        'postcode_sector_frequencies']. Use semicolon-
                        separated values for ['hashed_middle_name_metaphones',
                        'hashed_middle_names', 'hashed_postcode_sectors',
                        'hashed_postcode_units', 'middle_name_frequencies',
                        'middle_name_metaphone_frequencies',
                        'postcode_sector_frequencies',
                        'postcode_unit_frequencies'].
  --sample SAMPLE       CSV filename for sample data. (Header row. Columns:
                        ['unique_id', 'research_id', 'first_name',
                        'middle_names', 'surname', 'dob', 'postcodes']. Use
                        semicolon-separated values for ['middle_names',
                        'postcodes'].
  --sample_cache SAMPLE_CACHE
                        File in which to store cached sample info (to speed
                        loading)
  --output OUTPUT       Output CSV file for proband/sample comparison


# Generated at 2019-10-10 10:23:42