10.3. crate_fuzzy_id_match

A tool to match people from two databases that don’t share a person-unique identifier, using information from names, dates of birth, sex/gender, and address information. This is a probability-based (“fuzzy”) matching technique. It can operate using either identifiable information or in de-identified fashion.

You will need to download a CSV file of postcode geography from UK Census/ONS data from e.g. https://geoportal.statistics.gov.uk/search?q=PRD_ONSPD%20NOV_2024 and place it somewhere accessible to CRATE. If you are running CRATE under Docker this needs to be under the files directory, which is under the top level directory of the CRATE installation.

10.3.1. Example

In an area with a population size of 100,000:

Institution A has a database with the following patient record table:

ID

first_name

other_names

last_name

date_of_birth

gender

postcode

NHS Number

1

Danny

Jonathan

Walters

2014-10-07

M

CB2 0QQ

9997577205

2

Lucy

Bryan

2007-06-24

F

CB2 3EB

9998871948

3

Julie

Kate

Evans

2014-10-26

F

CB2 1TP

9994594737

4

Amelia

Geraldine

Hudson-Smith

2012-02-04

F

CB2 8PH

9994484249

5

Aaisha

Singh

2006-11-11

F

CB3 9DF

9997885643

Institution B has a database with the following student record table:

ID

first_name

middle_name

last_name

date_of_birth

gender

postcode

1001

Lucy

Bryan

2007-06-24

F

CB2 3EB

1002

Julie

Katherine

Evans

2014-10-26

F

CB2 1TP

1003

Amelia

Geraldine

Hudson-Smith

2012-02-04

F

CB2 8PH

1004

Aaisha

Singh

2006-11-11

F

CB3 9DF

1005

Daniel

Jonathan

Walters

2014-10-07

M

CB2 0QQ

There are no NHS numbers in the student table so we rely on name, date of birth, gender and postcode to link the two tables.

The database manager at institution A creates a CSV file called patients_for_hashing.csv like this:

local_id,forenames,surnames,dob,gender,postcodes,perfect_id,other_info
1,Danny; Jonathan,Walters,2014-10-07,M,CB2 0QQ,,
2,Lucy,Bryan,2007-06-24,F,CB2 3EB,,
3,Julie; Kate,Evans,2014-10-26,F,CB2 1TP,,
4,Amelia; Geraldine,Hudson-Smith,2012-02-04,F,CB2 8PH,,
5,Aaisha,Singh,2006-11-11,F,CB3 9DF,,

If you are running CRATE under Docker, you must place this file under the files directory under the top level directory of the CRATE installation. The Docker container sees this as /crate/files.

The database manager then runs the following script in CRATE:

# If using the Docker-based CRATE installer (from the scripts directory)
./fuzzy_id_match_hash.sh --population_size=100000 --input /crate/files/patients_for_hashing.csv --key mysecretpassphrase --output /crate/files/hashed_patients.jsonl --postcode_csv_filename=/crate/files/ONSPD_NOV_2024_UK.csv

# otherwise
crate_fuzzy_id_match hash --population_size=100000 --input patients_for_hashing.csv --key mysecretpassphrase --output hashed_patients.jsonl --postcode_csv_filename=ONSPD_NOV_2024_UK.csv

This will write a file in JSON Lines format hashed_patients.jsonl. This is sent, along with the hash key to the database manager at Institution B.

The database manager at institution B creates a CSV file called students_for_hashing.csv like this:

local_id,forenames,surnames,dob,gender,postcodes,perfect_id,other_info
1001,Lucy,Bryan,2007-06-24,F,CB2 3EB,,
1002,Julie; Katherine,Evans,2014-10-26,F,CB2 1TP,,
1003,Amelia; Geraldine,Hudson-Smith,2012-02-04,F,CB2 8PH,,
1004,Aaisha,Singh,2006-11-11,F,CB3 9DF,,
1005,Danny; Jonathan,Walters,2014-10-07,M,CB2 0QQ,,

They then run the following script in their CRATE installation:

# If using the Docker-based CRATE installer (from the scripts directory)
./fuzzy_id_match_compare_hashed_to_plaintext.sh --probands /crate/files/hashed_patients.jsonl --sample /crate/files/students_for_hashing.csv --sample_cache /crate/files/sample_cache.jsonl --output /crate/files/sample_comparison.csv --key mysecretpassphrase --population_size=100000 --postcode_csv_filename=/crate/files/ONSPD_NOV_2024_UK.csv

# otherwise
crate_fuzzy_id_match compare_hashed_to_plaintext --probands hashed_patients.jsonl --sample students_for_hashing.csv  --output sample_comparison.csv --key mysecretpassphrase --population_size=100000 --postcode_csv_filename=ONSPD_NOV_2024_UK.csv

This produces the following output sample_comparison.csv:

proband_local_id

matched

log_odds_match

p_match

sample_match_local_id

second_best_log_odds

1

1

29.651800972095156

0.9999999999998674

1005

-31.527640352096324

2

1

25.298704043709186

0.9999999999896982

1001

-inf

3

1

19.664790718087005

0.999999997118027

1002

-31.517581820004903

4

1

34.766408465008126

0.9999999999999992

1003

-inf

5

1

29.579028524494277

0.9999999999998574

1004

-inf

As you can see from this short example, CRATE has matched the records from the two tables with the IDs shown in the proband_local_id and sample_match_local_id columns. Typically the local IDs would be hashed as well (with the --local_id_hash_key option) but for this example we have left them unmodified for easier identification.

Now the two institutions are able to link records between their databases and can share de-identified data with each other.

We describe this tool in:

  • Cardinal RN, Moore A, Burchell M, Lewis JR (2023). De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation. BMC Medical Informatics and Decision Making 23: 85. PubMed ID 37147600; DOI 10.1186/s12911-023-02176-6; PDF.

USAGE: crate_fuzzy_id_match [-h] [--version] [--allhelp]
                            {hash,compare_plaintext,compare_hashed_to_hashed,compare_hashed_to_plaintext,print_demo_sample,show_metaphone,show_names_for_metaphone,show_forename_freq,show_forename_metaphone_freq,show_forename_f2c_freq,show_surname_freq,show_surname_metaphone_freq,show_surname_f2c_freq,show_dob_freq,show_postcode_freq}
                            ...

Identity matching via hashed fuzzy identifiers

OPTIONS:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --allhelp             Show help for all commands and exit.

COMMANDS:
  Valid commands are as follows.

  {hash,compare_plaintext,compare_hashed_to_hashed,compare_hashed_to_plaintext,print_demo_sample,show_metaphone,show_names_for_metaphone,show_forename_freq,show_forename_metaphone_freq,show_forename_f2c_freq,show_surname_freq,show_surname_metaphone_freq,show_surname_f2c_freq,show_dob_freq,show_postcode_freq}
                        Specify one command.
    hash                STEP 1 OF DE-IDENTIFIED LINKAGE. Hash an identifiable
                        CSV file into an encrypted one.
    compare_plaintext   IDENTIFIABLE LINKAGE COMMAND. Compare a list of
                        probands against a sample (both in plaintext).
    compare_hashed_to_hashed
                        STEP 2 OF DE-IDENTIFIED LINKAGE (for when you have
                        de-identified both sides in advance). Compare a list
                        of probands against a sample (both hashed).
    compare_hashed_to_plaintext
                        STEP 2 OF DE-IDENTIFIED LINKAGE (for when you have
                        received de-identified data and you want to link to
                        your identifiable data, producing a de-identified
                        result). Compare a list of probands (hashed) against a
                        sample (plaintext). Hashes the sample on the fly.
    print_demo_sample   Print a demo sample .CSV file.
    show_metaphone      Show metaphones of words
    show_names_for_metaphone
                        Show names (forenames and surnames) for a given
                        metaphone
    show_forename_freq  Show frequencies of forenames
    show_forename_metaphone_freq
                        Show frequencies of forename metaphones
    show_forename_f2c_freq
                        Show frequencies of forename first two characters
    show_surname_freq   Show frequencies of surnames
    show_surname_metaphone_freq
                        Show frequencies of surname metaphones
    show_surname_f2c_freq
                        Show frequencies of surname first two characters
    show_dob_freq       Show the frequency of any DOB
    show_postcode_freq  Show the frequency of any postcode

===============================================================================
Help for command 'hash'
===============================================================================
USAGE: crate_fuzzy_id_match hash [-h] --input INPUT --output OUTPUT
                                 [--without_frequencies]
                                 [--include_other_info] [--key KEY]
                                 [--allow_default_hash_key]
                                 [--hash_method {HMAC_MD5,HMAC_SHA256,HMAC_SHA512}]
                                 [--rounding_sf ROUNDING_SF]
                                 [--local_id_hash_key LOCAL_ID_HASH_KEY]
                                 [--population_size POPULATION_SIZE]
                                 [--forename_cache_filename FORENAME_CACHE_FILENAME]
                                 [--forename_sex_freq_csv FORENAME_SEX_FREQ_CSV]
                                 [--forename_min_frequency FORENAME_MIN_FREQUENCY]
                                 [--surname_cache_filename SURNAME_CACHE_FILENAME]
                                 [--surname_freq_csv SURNAME_FREQ_CSV]
                                 [--surname_min_frequency SURNAME_MIN_FREQUENCY]
                                 [--accent_transliterations ACCENT_TRANSLITERATIONS]
                                 [--nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS]
                                 [--birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE]
                                 [--p_not_male_or_female P_NOT_MALE_OR_FEMALE]
                                 [--p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE]
                                 [--postcode_cache_filename POSTCODE_CACHE_FILENAME]
                                 [--postcode_csv_filename POSTCODE_CSV_FILENAME]
                                 [--k_postcode K_POSTCODE]
                                 [--p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE]
                                 [--k_pseudopostcode K_PSEUDOPOSTCODE]
                                 [--verbose]

Takes an identifiable list of people (with name, DOB, and postcode information)
and creates a hashed, de-identified equivalent. Order is preserved.

The local ID (presumed not to be a direct identifier) is preserved exactly,
unless you explicitly elect to hash it.

Optionally, the "other" information (you can choose, e.g. attaching a direct
identifier) is preserved, but you have to ask for that explicitly; that is
normally for testing.

OPTIONS:
  -h, --help            show this help message and exit
  --input INPUT         Filename for input (plaintext) data. (1) CSV format
                        with header row. Columns: ['local_id', 'forenames',
                        'surnames', 'dob', 'gender', 'postcodes',
                        'perfect_id', 'other_info']. (2) Semicolon-separated
                        values are allowed within ['forenames', 'surnames',
                        'postcodes', 'perfect_id']. (3) The fields
                        ['forenames', 'surnames', 'postcodes'] are in
                        TemporalIdentifier format. Temporal identifier format:
                        either just IDENTIFIER, or
                        IDENTIFIER/STARTDATE/ENDDATE, where dates are in
                        YYYY-MM-DD format or one of ['none', 'null', '?']
                        (case-insensitive). (4) perfect_id, if specified,
                        contains one or more perfect person identifiers as
                        key:value pairs, e.g. 'nhs:12345;ni:AB6789XY'. The
                        keys will be forced to lower case; values will be
                        forced to upper case. (5) 'other_info' is an arbitrary
                        string for you to use (e.g. for validation). (default:
                        None)
  --output OUTPUT       Output file for hashed version. File created by CRATE
                        in JSON Lines (.jsonl) format. (You could use the 'jq'
                        tool to inspect these.) (default: None)
  --without_frequencies
                        Do not include frequency information. This makes the
                        result suitable for use as a sample file, but not a
                        proband file. (default: False)
  --include_other_info  Include the (potentially identifying) 'other_info'
                        data? Usually False; may be set to True for
                        validation. (default: False)

HASHER (SECRECY) OPTIONS:
  --key KEY             Key (passphrase) for hasher. (default:
                        fuzzy_id_match_default_hash_key_DO_NOT_USE_FOR_LIVE_DA
                        TA)
  --allow_default_hash_key
                        Allow the default hash key to be used beyond tests.
                        INADVISABLE! (default: False)
  --hash_method {HMAC_MD5,HMAC_SHA256,HMAC_SHA512}
                        Hash method. (default: HMAC_SHA256)
  --rounding_sf ROUNDING_SF
                        Number of significant figures to use when rounding
                        frequencies in hashed version. Use 'None' to disable
                        rounding. (default: 5)
  --local_id_hash_key LOCAL_ID_HASH_KEY
                        Only applicable to the 'hash' command. Hash the
                        local_id values, using this key (passphrase). There
                        are good reasons to use a key different to that
                        specified for --key. If you leave this blank, or
                        specify an empty string, then local ID values will be
                        left unmodified (e.g. if you have pre-hashed them).
                        (default: None)

FREQUENCY INFORMATION FOR PRIOR PROBABILITIES:
  --population_size POPULATION_SIZE
                        Size of the whole population, from which we calculate
                        the baseline log odds that two people, randomly
                        selected (and replaced) from the population are the
                        same person. (default: 852523)
  --forename_cache_filename FORENAME_CACHE_FILENAME
                        File in which to store cached forename info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_forename_cache.jsonl)
  --forename_sex_freq_csv FORENAME_SEX_FREQ_CSV
                        CSV file of "name, sex, frequency" pairs for
                        forenames. You can generate one via
                        crate_fetch_wordlists. [Information saved in the
                        forename cache. If you change this, delete your
                        forename cache.] (default:
                        /path/to/linkage/data/us_forename_sex_freq.zip)
  --forename_min_frequency FORENAME_MIN_FREQUENCY
                        Minimum frequency for forenames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. The standard US forename data has a floor
                        2.875e-8 (M), 2.930e-8 (F), so 2.9e-8 to 2sf.
                        [Information saved in the forename cache. If you
                        change this, delete your forename cache.] (default:
                        5e-06)
  --surname_cache_filename SURNAME_CACHE_FILENAME
                        File in which to store cached surname info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_surname_cache.jsonl)
  --surname_freq_csv SURNAME_FREQ_CSV
                        CSV file of "name, frequency" pairs for surnames. You
                        can generate one via crate_fetch_wordlists.
                        [Information saved in the surname cache. If you change
                        this, delete your surname cache.] (default:
                        /path/to/linkage/data/us_surname_freq.zip)
  --surname_min_frequency SURNAME_MIN_FREQUENCY
                        Minimum frequency for surnames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. In the standard US surname data, values below
                        3e-7 are reported as 0, so 1.5e-7 is the midpoint of
                        the low-frequency range. [Information saved in the
                        surname cache. If you change this, delete your surname
                        cache.] (default: 5e-06)
  --accent_transliterations ACCENT_TRANSLITERATIONS
                        (For surnames.) CSV list of 'accented/plain' pairs,
                        representing how accented characters may be
                        transliterated (if they are not reproduced accurately
                        and not simply mangled into ASCII like É→E). Only
                        upper-case versions are required (anything supplied
                        will be converted to upper case). (default:
                        Ä/AE,Ö/OE,Ü/UE,ẞ/SS)
  --nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS
                        (For surnames.) CSV list of name components that
                        should not be used as alternatives in their own right,
                        such as nobiliary particles. (default:
                        AF,AL,AUF,AV,AW,D,DA,DAI,DAL,DALLA,DAS,DE,DEI,DEL,DELL
                        ,DELLA,DER,DES,DI,DO,DOS,DU,EL,I,II,III,IV,IX,JNR,JR,L
                        ,LA,LE,NA,OF,PHRA,SNR,SR,SRI,THOE,TOT,V,VAN,VI,VII,VII
                        I,VON,X,ZU)
  --birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE
                        Birth year pseudo-range. The purpose is to calculate
                        the probability of two random people sharing a DOB,
                        which is taken as 1/(365.25 * b), even for 29 Feb, or
                        a partial DOB equivalently. This option is b.
                        (default: 30)
  --p_not_male_or_female P_NOT_MALE_OR_FEMALE
                        Probability that a person in the population has gender
                        'X'. (default: 0.004)
  --p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE
                        Probability that a person in the population is female,
                        given that they are either male or female. (default:
                        0.51)
  --postcode_cache_filename POSTCODE_CACHE_FILENAME
                        File in which to store cached postcodes (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_postcode_cache.json)
  --postcode_csv_filename POSTCODE_CSV_FILENAME
                        CSV file of postcode geography from UK Census/ONS
                        data. A ZIP file is also acceptable. [Information
                        saved in the postcode cache. If you change this,
                        delete your postcode cache.] (default:
                        /path/to/linkage/data/ONSPD_MAY_2022_UK.zip)
  --k_postcode K_POSTCODE
                        Probability multiple: P[P(postcode unit match | ¬H)] =
                        k_postcode * f_f_postcode, and p_p_postcode[P(postcode
                        sector match | ¬H) = k_postcode * f_p_postcode. The
                        default, None, autocalculates k_postcode = n_UK /
                        population_size where n_uk = 66040000; this is
                        approximately correct if your population is a
                        geographically restricted section of the UK, but if it
                        is geographically representative of the UK, specify 1.
                        (default: None)
  --p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE
                        Expected population probability of each
                        'pseudo-postcode' postcode unit (e.g. ZZ99 3VZ = no
                        fixed above; ZZ99 3CZ, England/UK not otherwise
                        specified) or to have a postcode not known to the
                        postcode geography database. (default: 0.00201)
  --k_pseudopostcode K_PSEUDOPOSTCODE
                        Probability multiple: P(pseudopostcode sector or
                        unknown postcode sector match | ¬H) = k_pseudopostcode
                        * p_unknown_or_pseudo_postcode. Must strictly be >=1
                        and we enforce >1; see paper. (default: 1.83)

DISPLAY OPTIONS:
  --verbose             Be verbose. (default: False)

===============================================================================
Help for command 'compare_plaintext'
===============================================================================
USAGE: crate_fuzzy_id_match compare_plaintext [-h] --probands PROBANDS
                                              --sample SAMPLE
                                              [--sample_cache SAMPLE_CACHE]
                                              --output OUTPUT [--profile]
                                              [--min_log_odds_for_match MIN_LOG_ODDS_FOR_MATCH]
                                              [--exceeds_next_best_log_odds EXCEEDS_NEXT_BEST_LOG_ODDS]
                                              [--perfect_id_translation PERFECT_ID_TRANSLATION]
                                              [--extra_validation_output]
                                              [--check_comparison_order]
                                              [--report_every REPORT_EVERY]
                                              [--min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL]
                                              [--n_workers N_WORKERS]
                                              [--p_ep1_forename P_EP1_FORENAME]
                                              [--p_ep2np1_forename P_EP2NP1_FORENAME]
                                              [--p_en_forename P_EN_FORENAME]
                                              [--p_u_forename P_U_FORENAME]
                                              [--p_ep1_surname P_EP1_SURNAME]
                                              [--p_ep2np1_surname P_EP2NP1_SURNAME]
                                              [--p_en_surname P_EN_SURNAME]
                                              [--p_ep_dob P_EP_DOB]
                                              [--p_en_dob P_EN_DOB]
                                              [--p_e_gender P_E_GENDER]
                                              [--p_ep_postcode P_EP_POSTCODE]
                                              [--p_en_postcode P_EN_POSTCODE]
                                              [--population_size POPULATION_SIZE]
                                              [--forename_cache_filename FORENAME_CACHE_FILENAME]
                                              [--forename_sex_freq_csv FORENAME_SEX_FREQ_CSV]
                                              [--forename_min_frequency FORENAME_MIN_FREQUENCY]
                                              [--surname_cache_filename SURNAME_CACHE_FILENAME]
                                              [--surname_freq_csv SURNAME_FREQ_CSV]
                                              [--surname_min_frequency SURNAME_MIN_FREQUENCY]
                                              [--accent_transliterations ACCENT_TRANSLITERATIONS]
                                              [--nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS]
                                              [--birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE]
                                              [--p_not_male_or_female P_NOT_MALE_OR_FEMALE]
                                              [--p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE]
                                              [--postcode_cache_filename POSTCODE_CACHE_FILENAME]
                                              [--postcode_csv_filename POSTCODE_CSV_FILENAME]
                                              [--k_postcode K_POSTCODE]
                                              [--p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE]
                                              [--k_pseudopostcode K_PSEUDOPOSTCODE]
                                              [--verbose]

Comparison rules:

- People MUST match on DOB and surname (or surname metaphone), or hashed
  equivalents, to be considered a plausible match.

- Only plausible matches proceed to the Bayesian comparison.

The output file is a CSV (comma-separated value) file with a header and
these columns:

    proband_local_id:
        Local ID (identifiable or de-identified as the user chose) of the
        proband. Taken from the input.
    matched:
        Boolean as binary (0/1). Was a matching person (a "winner") found in
        the sample, who is to be considered a match to the proband? To give a
        match requires (a) that the log odds for the winner reaches a
        threshold, and (b) that the log odds for the winner exceeds the log
        odds for the runner-up by a certain amount (because a mismatch may be
        worse than a failed match).
    log_odds_match:
        Log (ln) odds that the best candidate in the sample is a match to the
        proband.
    p_match:
        Probability that the best candidate in the sample is a match.
        Equivalent to log_odds_match.
    sample_match_local_id:
        Local ID of the "winner" in the sample (the candidate who was matched
        to the proband), or blank if there was no winner.
    second_best_log_odds:
        Log odds of the runner-up (the candidate from the sample who is the
        second-closest match) being the same person as the proband.

If '--extra_validation_output' is used, the following columns are
added:

    best_candidate_local_id:
        Local ID of the closest-matching person (candidate) in the sample, EVEN
        IF THEY DID NOT WIN. (This will be the same as the winner if there was
        a match.) String; blank for no match.
    second_best_candidate_local_id:
        Local ID of the second-best candidate in the sample, if any. String;
        blank for no match.

Proband order is retained in the output (even using parallel processing).

OPTIONS:
  -h, --help            show this help message and exit

COMPARISON OPTIONS:
  --probands PROBANDS   Input filename for probands data. (1) CSV format with
                        header row. Columns: ['local_id', 'forenames',
                        'surnames', 'dob', 'gender', 'postcodes',
                        'perfect_id', 'other_info']. (2) Semicolon-separated
                        values are allowed within ['forenames', 'surnames',
                        'postcodes', 'perfect_id']. (3) The fields
                        ['forenames', 'surnames', 'postcodes'] are in
                        TemporalIdentifier format. Temporal identifier format:
                        either just IDENTIFIER, or
                        IDENTIFIER/STARTDATE/ENDDATE, where dates are in
                        YYYY-MM-DD format or one of ['none', 'null', '?']
                        (case-insensitive). (4) perfect_id, if specified,
                        contains one or more perfect person identifiers as
                        key:value pairs, e.g. 'nhs:12345;ni:AB6789XY'. The
                        keys will be forced to lower case; values will be
                        forced to upper case. (5) 'other_info' is an arbitrary
                        string for you to use (e.g. for validation). (default:
                        None)
  --sample SAMPLE       Input filename for sample data. (1) CSV format with
                        header row. Columns: ['local_id', 'forenames',
                        'surnames', 'dob', 'gender', 'postcodes',
                        'perfect_id', 'other_info']. (2) Semicolon-separated
                        values are allowed within ['forenames', 'surnames',
                        'postcodes', 'perfect_id']. (3) The fields
                        ['forenames', 'surnames', 'postcodes'] are in
                        TemporalIdentifier format. Temporal identifier format:
                        either just IDENTIFIER, or
                        IDENTIFIER/STARTDATE/ENDDATE, where dates are in
                        YYYY-MM-DD format or one of ['none', 'null', '?']
                        (case-insensitive). (4) perfect_id, if specified,
                        contains one or more perfect person identifiers as
                        key:value pairs, e.g. 'nhs:12345;ni:AB6789XY'. The
                        keys will be forced to lower case; values will be
                        forced to upper case. (5) 'other_info' is an arbitrary
                        string for you to use (e.g. for validation). (default:
                        None)
  --sample_cache SAMPLE_CACHE
                        JSONL file in which to store cached sample info (to
                        speed loading) (default: None)
  --output OUTPUT       Output CSV file for proband/sample comparison.
                        (default: None)
  --profile             Profile the code (for development only). (default:
                        False)

MATCHING RULES:
  --min_log_odds_for_match MIN_LOG_ODDS_FOR_MATCH
                        Minimum natural log (ln) odds of two people being the
                        same, before a match will be considered. Referred to
                        as theta (θ) in the validation paper. (Default is
                        equivalent to p = 0.9933071490757152.) (default: 5)
  --exceeds_next_best_log_odds EXCEEDS_NEXT_BEST_LOG_ODDS
                        Minimum log (ln) odds by which a best match must
                        exceed the next-best match to be considered a unique
                        match. Referred to as delta (δ) in the validation
                        paper. (default: 0)
  --perfect_id_translation PERFECT_ID_TRANSLATION
                        Optional dictionary of the form {'nhsnum':'nhsnumber',
                        'ni_num':'national_insurance'}, mapping the names of
                        perfect (person-unique) identifiers as found in the
                        proband data to their equivalents in the sample.
                        (default: None)

CONTROL OPTIONS:
  --extra_validation_output
                        Add extra output for validation purposes (the local
                        IDs of the best and second-best candidates, if any,
                        even if there was no match). (default: False)
  --check_comparison_order
                        Check every comparison for log-likelihood ratio
                        sequence 'no match ≤ partial(s) ≤ full' and warn if
                        this is not observed. Note, however, that deviations
                        from this are not unexpected. (default: False)
  --report_every REPORT_EVERY
                        Report progress every n probands. (default: 100)
  --min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL
                        Minimum number of probands for which we will bother to
                        use parallel processing. (default: 1000)
  --n_workers N_WORKERS
                        Number of processes to use in parallel. Defaults to 1
                        (Windows) or the number of CPUs on your system (other
                        operating systems). (default: 8)

ERROR PROBABILITIES:
  --p_ep1_forename P_EP1_FORENAME
                        Probability that a forename has an error such that it
                        fails a full match but satisfies a partial 1
                        (metaphone) match. (Comma-separated list of 'gender:p'
                        values, where gender must include F, M and can include
                        X, ''.) (default: F:0.00894,M:0.0084)
  --p_ep2np1_forename P_EP2NP1_FORENAME
                        Probability that a forename has an error such that it
                        fails a full/partial 1 match but satisfies a partial 2
                        (first two character) match. (Comma-separated list of
                        'gender:p' values, where gender must include F, M and
                        can include X, ''.) (default: F:0.00881,M:0.00688)
  --p_en_forename P_EN_FORENAME
                        Probability that a forename has an error such that it
                        produces no match at all. (Comma-separated list of
                        'gender:p' values, where gender must include F, M and
                        can include X, ''.) (default: F:0.00572,M:0.00625)
  --p_u_forename P_U_FORENAME
                        Probability that a set of at least two forenames has
                        an error such that they become unordered (e.g.
                        swapped/shuffled) with respect to their counterpart.
                        See paper for full details. (default: 0.00191)
  --p_ep1_surname P_EP1_SURNAME
                        Probability that a surname has an error such that it
                        fails a full match but satisfies a partial 1
                        (metaphone) match. (Comma-separated list of 'gender:p'
                        values, where gender must include F, M and can include
                        X, ''.) (default: F:0.00551,M:0.00471)
  --p_ep2np1_surname P_EP2NP1_SURNAME
                        Probability that a surname has an error such that it
                        fails a full/partial 1 match but satisfies a partial 2
                        (first two character) match. (Comma-separated list of
                        'gender:p' values, where gender must include F, M and
                        can include X, ''.) (default: F:0.00378,M:0.00247)
  --p_en_surname P_EN_SURNAME
                        Probability that a surname has an error such that it
                        produces no match at all. (Comma-separated list of
                        'gender:p' values, where gender must include F, M and
                        can include X, ''.) (default: F:0.0567,M:0.0134)
  --p_ep_dob P_EP_DOB   Probability that a DOB is wrong in some way that
                        causes a partial match (YM, MD, or YD) but not a full
                        (YMD) match. (default: 0.00459036)
  --p_en_dob P_EN_DOB   Probability that a DOB error leads to no match
                        (neither full, nor partial as defined above).
                        Empirically, this is about 0.00033. However, we
                        suggest setting it to 0, as anything higher will run
                        much slower. (default: 0)
  --p_e_gender P_E_GENDER
                        Assumed probability (p_e) that a gender is wrong,
                        leading to a proband/candidate mismatch. (default:
                        0.0033)
  --p_ep_postcode P_EP_POSTCODE
                        Assumed probability (p_ep) that a proband/candidate
                        postcode pair fails a full (postcode unit) match but
                        satisfies a partial (postcode sector) match, through
                        error or a move within a sector. (default: 0.0097)
  --p_en_postcode P_EN_POSTCODE
                        Assumed probability (p_ep) that a proband/candidate
                        postcode pair exhibits no match at all. (default: 0.3)

FREQUENCY INFORMATION FOR PRIOR PROBABILITIES:
  --population_size POPULATION_SIZE
                        Size of the whole population, from which we calculate
                        the baseline log odds that two people, randomly
                        selected (and replaced) from the population are the
                        same person. (default: 852523)
  --forename_cache_filename FORENAME_CACHE_FILENAME
                        File in which to store cached forename info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_forename_cache.jsonl)
  --forename_sex_freq_csv FORENAME_SEX_FREQ_CSV
                        CSV file of "name, sex, frequency" pairs for
                        forenames. You can generate one via
                        crate_fetch_wordlists. [Information saved in the
                        forename cache. If you change this, delete your
                        forename cache.] (default:
                        /path/to/linkage/data/us_forename_sex_freq.zip)
  --forename_min_frequency FORENAME_MIN_FREQUENCY
                        Minimum frequency for forenames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. The standard US forename data has a floor
                        2.875e-8 (M), 2.930e-8 (F), so 2.9e-8 to 2sf.
                        [Information saved in the forename cache. If you
                        change this, delete your forename cache.] (default:
                        5e-06)
  --surname_cache_filename SURNAME_CACHE_FILENAME
                        File in which to store cached surname info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_surname_cache.jsonl)
  --surname_freq_csv SURNAME_FREQ_CSV
                        CSV file of "name, frequency" pairs for surnames. You
                        can generate one via crate_fetch_wordlists.
                        [Information saved in the surname cache. If you change
                        this, delete your surname cache.] (default:
                        /path/to/linkage/data/us_surname_freq.zip)
  --surname_min_frequency SURNAME_MIN_FREQUENCY
                        Minimum frequency for surnames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. In the standard US surname data, values below
                        3e-7 are reported as 0, so 1.5e-7 is the midpoint of
                        the low-frequency range. [Information saved in the
                        surname cache. If you change this, delete your surname
                        cache.] (default: 5e-06)
  --accent_transliterations ACCENT_TRANSLITERATIONS
                        (For surnames.) CSV list of 'accented/plain' pairs,
                        representing how accented characters may be
                        transliterated (if they are not reproduced accurately
                        and not simply mangled into ASCII like É→E). Only
                        upper-case versions are required (anything supplied
                        will be converted to upper case). (default:
                        Ä/AE,Ö/OE,Ü/UE,ẞ/SS)
  --nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS
                        (For surnames.) CSV list of name components that
                        should not be used as alternatives in their own right,
                        such as nobiliary particles. (default:
                        AF,AL,AUF,AV,AW,D,DA,DAI,DAL,DALLA,DAS,DE,DEI,DEL,DELL
                        ,DELLA,DER,DES,DI,DO,DOS,DU,EL,I,II,III,IV,IX,JNR,JR,L
                        ,LA,LE,NA,OF,PHRA,SNR,SR,SRI,THOE,TOT,V,VAN,VI,VII,VII
                        I,VON,X,ZU)
  --birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE
                        Birth year pseudo-range. The purpose is to calculate
                        the probability of two random people sharing a DOB,
                        which is taken as 1/(365.25 * b), even for 29 Feb, or
                        a partial DOB equivalently. This option is b.
                        (default: 30)
  --p_not_male_or_female P_NOT_MALE_OR_FEMALE
                        Probability that a person in the population has gender
                        'X'. (default: 0.004)
  --p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE
                        Probability that a person in the population is female,
                        given that they are either male or female. (default:
                        0.51)
  --postcode_cache_filename POSTCODE_CACHE_FILENAME
                        File in which to store cached postcodes (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_postcode_cache.json)
  --postcode_csv_filename POSTCODE_CSV_FILENAME
                        CSV file of postcode geography from UK Census/ONS
                        data. A ZIP file is also acceptable. [Information
                        saved in the postcode cache. If you change this,
                        delete your postcode cache.] (default:
                        /path/to/linkage/data/ONSPD_MAY_2022_UK.zip)
  --k_postcode K_POSTCODE
                        Probability multiple: P[P(postcode unit match | ¬H)] =
                        k_postcode * f_f_postcode, and p_p_postcode[P(postcode
                        sector match | ¬H) = k_postcode * f_p_postcode. The
                        default, None, autocalculates k_postcode = n_UK /
                        population_size where n_uk = 66040000; this is
                        approximately correct if your population is a
                        geographically restricted section of the UK, but if it
                        is geographically representative of the UK, specify 1.
                        (default: None)
  --p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE
                        Expected population probability of each
                        'pseudo-postcode' postcode unit (e.g. ZZ99 3VZ = no
                        fixed above; ZZ99 3CZ, England/UK not otherwise
                        specified) or to have a postcode not known to the
                        postcode geography database. (default: 0.00201)
  --k_pseudopostcode K_PSEUDOPOSTCODE
                        Probability multiple: P(pseudopostcode sector or
                        unknown postcode sector match | ¬H) = k_pseudopostcode
                        * p_unknown_or_pseudo_postcode. Must strictly be >=1
                        and we enforce >1; see paper. (default: 1.83)

DISPLAY OPTIONS:
  --verbose             Be verbose. (default: False)

===============================================================================
Help for command 'compare_hashed_to_hashed'
===============================================================================
USAGE: crate_fuzzy_id_match compare_hashed_to_hashed [-h] --probands PROBANDS
                                                     --sample SAMPLE
                                                     [--sample_cache SAMPLE_CACHE]
                                                     --output OUTPUT
                                                     [--profile]
                                                     [--min_log_odds_for_match MIN_LOG_ODDS_FOR_MATCH]
                                                     [--exceeds_next_best_log_odds EXCEEDS_NEXT_BEST_LOG_ODDS]
                                                     [--perfect_id_translation PERFECT_ID_TRANSLATION]
                                                     [--extra_validation_output]
                                                     [--check_comparison_order]
                                                     [--report_every REPORT_EVERY]
                                                     [--min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL]
                                                     [--n_workers N_WORKERS]
                                                     [--p_ep1_forename P_EP1_FORENAME]
                                                     [--p_ep2np1_forename P_EP2NP1_FORENAME]
                                                     [--p_en_forename P_EN_FORENAME]
                                                     [--p_u_forename P_U_FORENAME]
                                                     [--p_ep1_surname P_EP1_SURNAME]
                                                     [--p_ep2np1_surname P_EP2NP1_SURNAME]
                                                     [--p_en_surname P_EN_SURNAME]
                                                     [--p_ep_dob P_EP_DOB]
                                                     [--p_en_dob P_EN_DOB]
                                                     [--p_e_gender P_E_GENDER]
                                                     [--p_ep_postcode P_EP_POSTCODE]
                                                     [--p_en_postcode P_EN_POSTCODE]
                                                     [--population_size POPULATION_SIZE]
                                                     [--forename_cache_filename FORENAME_CACHE_FILENAME]
                                                     [--forename_sex_freq_csv FORENAME_SEX_FREQ_CSV]
                                                     [--forename_min_frequency FORENAME_MIN_FREQUENCY]
                                                     [--surname_cache_filename SURNAME_CACHE_FILENAME]
                                                     [--surname_freq_csv SURNAME_FREQ_CSV]
                                                     [--surname_min_frequency SURNAME_MIN_FREQUENCY]
                                                     [--accent_transliterations ACCENT_TRANSLITERATIONS]
                                                     [--nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS]
                                                     [--birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE]
                                                     [--p_not_male_or_female P_NOT_MALE_OR_FEMALE]
                                                     [--p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE]
                                                     [--postcode_cache_filename POSTCODE_CACHE_FILENAME]
                                                     [--postcode_csv_filename POSTCODE_CSV_FILENAME]
                                                     [--k_postcode K_POSTCODE]
                                                     [--p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE]
                                                     [--k_pseudopostcode K_PSEUDOPOSTCODE]
                                                     [--verbose]

Comparison rules:

- People MUST match on DOB and surname (or surname metaphone), or hashed
  equivalents, to be considered a plausible match.

- Only plausible matches proceed to the Bayesian comparison.

The output file is a CSV (comma-separated value) file with a header and
these columns:

    proband_local_id:
        Local ID (identifiable or de-identified as the user chose) of the
        proband. Taken from the input.
    matched:
        Boolean as binary (0/1). Was a matching person (a "winner") found in
        the sample, who is to be considered a match to the proband? To give a
        match requires (a) that the log odds for the winner reaches a
        threshold, and (b) that the log odds for the winner exceeds the log
        odds for the runner-up by a certain amount (because a mismatch may be
        worse than a failed match).
    log_odds_match:
        Log (ln) odds that the best candidate in the sample is a match to the
        proband.
    p_match:
        Probability that the best candidate in the sample is a match.
        Equivalent to log_odds_match.
    sample_match_local_id:
        Local ID of the "winner" in the sample (the candidate who was matched
        to the proband), or blank if there was no winner.
    second_best_log_odds:
        Log odds of the runner-up (the candidate from the sample who is the
        second-closest match) being the same person as the proband.

If '--extra_validation_output' is used, the following columns are
added:

    best_candidate_local_id:
        Local ID of the closest-matching person (candidate) in the sample, EVEN
        IF THEY DID NOT WIN. (This will be the same as the winner if there was
        a match.) String; blank for no match.
    second_best_candidate_local_id:
        Local ID of the second-best candidate in the sample, if any. String;
        blank for no match.

Proband order is retained in the output (even using parallel processing).

OPTIONS:
  -h, --help            show this help message and exit

COMPARISON OPTIONS:
  --probands PROBANDS   Input filename for probands data. File created by
                        CRATE in JSON Lines (.jsonl) format. (You could use
                        the 'jq' tool to inspect these.) (default: None)
  --sample SAMPLE       Input filename for sample data. File created by CRATE
                        in JSON Lines (.jsonl) format. (You could use the 'jq'
                        tool to inspect these.) (default: None)
  --sample_cache SAMPLE_CACHE
                        JSONL file in which to store cached sample info (to
                        speed loading) (default: None)
  --output OUTPUT       Output CSV file for proband/sample comparison.
                        (default: None)
  --profile             Profile the code (for development only). (default:
                        False)

MATCHING RULES:
  --min_log_odds_for_match MIN_LOG_ODDS_FOR_MATCH
                        Minimum natural log (ln) odds of two people being the
                        same, before a match will be considered. Referred to
                        as theta (θ) in the validation paper. (Default is
                        equivalent to p = 0.9933071490757152.) (default: 5)
  --exceeds_next_best_log_odds EXCEEDS_NEXT_BEST_LOG_ODDS
                        Minimum log (ln) odds by which a best match must
                        exceed the next-best match to be considered a unique
                        match. Referred to as delta (δ) in the validation
                        paper. (default: 0)
  --perfect_id_translation PERFECT_ID_TRANSLATION
                        Optional dictionary of the form {'nhsnum':'nhsnumber',
                        'ni_num':'national_insurance'}, mapping the names of
                        perfect (person-unique) identifiers as found in the
                        proband data to their equivalents in the sample.
                        (default: None)

CONTROL OPTIONS:
  --extra_validation_output
                        Add extra output for validation purposes (the local
                        IDs of the best and second-best candidates, if any,
                        even if there was no match). (default: False)
  --check_comparison_order
                        Check every comparison for log-likelihood ratio
                        sequence 'no match ≤ partial(s) ≤ full' and warn if
                        this is not observed. Note, however, that deviations
                        from this are not unexpected. (default: False)
  --report_every REPORT_EVERY
                        Report progress every n probands. (default: 100)
  --min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL
                        Minimum number of probands for which we will bother to
                        use parallel processing. (default: 1000)
  --n_workers N_WORKERS
                        Number of processes to use in parallel. Defaults to 1
                        (Windows) or the number of CPUs on your system (other
                        operating systems). (default: 8)

ERROR PROBABILITIES:
  --p_ep1_forename P_EP1_FORENAME
                        Probability that a forename has an error such that it
                        fails a full match but satisfies a partial 1
                        (metaphone) match. (Comma-separated list of 'gender:p'
                        values, where gender must include F, M and can include
                        X, ''.) (default: F:0.00894,M:0.0084)
  --p_ep2np1_forename P_EP2NP1_FORENAME
                        Probability that a forename has an error such that it
                        fails a full/partial 1 match but satisfies a partial 2
                        (first two character) match. (Comma-separated list of
                        'gender:p' values, where gender must include F, M and
                        can include X, ''.) (default: F:0.00881,M:0.00688)
  --p_en_forename P_EN_FORENAME
                        Probability that a forename has an error such that it
                        produces no match at all. (Comma-separated list of
                        'gender:p' values, where gender must include F, M and
                        can include X, ''.) (default: F:0.00572,M:0.00625)
  --p_u_forename P_U_FORENAME
                        Probability that a set of at least two forenames has
                        an error such that they become unordered (e.g.
                        swapped/shuffled) with respect to their counterpart.
                        See paper for full details. (default: 0.00191)
  --p_ep1_surname P_EP1_SURNAME
                        Probability that a surname has an error such that it
                        fails a full match but satisfies a partial 1
                        (metaphone) match. (Comma-separated list of 'gender:p'
                        values, where gender must include F, M and can include
                        X, ''.) (default: F:0.00551,M:0.00471)
  --p_ep2np1_surname P_EP2NP1_SURNAME
                        Probability that a surname has an error such that it
                        fails a full/partial 1 match but satisfies a partial 2
                        (first two character) match. (Comma-separated list of
                        'gender:p' values, where gender must include F, M and
                        can include X, ''.) (default: F:0.00378,M:0.00247)
  --p_en_surname P_EN_SURNAME
                        Probability that a surname has an error such that it
                        produces no match at all. (Comma-separated list of
                        'gender:p' values, where gender must include F, M and
                        can include X, ''.) (default: F:0.0567,M:0.0134)
  --p_ep_dob P_EP_DOB   Probability that a DOB is wrong in some way that
                        causes a partial match (YM, MD, or YD) but not a full
                        (YMD) match. (default: 0.00459036)
  --p_en_dob P_EN_DOB   Probability that a DOB error leads to no match
                        (neither full, nor partial as defined above).
                        Empirically, this is about 0.00033. However, we
                        suggest setting it to 0, as anything higher will run
                        much slower. (default: 0)
  --p_e_gender P_E_GENDER
                        Assumed probability (p_e) that a gender is wrong,
                        leading to a proband/candidate mismatch. (default:
                        0.0033)
  --p_ep_postcode P_EP_POSTCODE
                        Assumed probability (p_ep) that a proband/candidate
                        postcode pair fails a full (postcode unit) match but
                        satisfies a partial (postcode sector) match, through
                        error or a move within a sector. (default: 0.0097)
  --p_en_postcode P_EN_POSTCODE
                        Assumed probability (p_ep) that a proband/candidate
                        postcode pair exhibits no match at all. (default: 0.3)

FREQUENCY INFORMATION FOR PRIOR PROBABILITIES:
  --population_size POPULATION_SIZE
                        Size of the whole population, from which we calculate
                        the baseline log odds that two people, randomly
                        selected (and replaced) from the population are the
                        same person. (default: 852523)
  --forename_cache_filename FORENAME_CACHE_FILENAME
                        File in which to store cached forename info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_forename_cache.jsonl)
  --forename_sex_freq_csv FORENAME_SEX_FREQ_CSV
                        CSV file of "name, sex, frequency" pairs for
                        forenames. You can generate one via
                        crate_fetch_wordlists. [Information saved in the
                        forename cache. If you change this, delete your
                        forename cache.] (default:
                        /path/to/linkage/data/us_forename_sex_freq.zip)
  --forename_min_frequency FORENAME_MIN_FREQUENCY
                        Minimum frequency for forenames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. The standard US forename data has a floor
                        2.875e-8 (M), 2.930e-8 (F), so 2.9e-8 to 2sf.
                        [Information saved in the forename cache. If you
                        change this, delete your forename cache.] (default:
                        5e-06)
  --surname_cache_filename SURNAME_CACHE_FILENAME
                        File in which to store cached surname info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_surname_cache.jsonl)
  --surname_freq_csv SURNAME_FREQ_CSV
                        CSV file of "name, frequency" pairs for surnames. You
                        can generate one via crate_fetch_wordlists.
                        [Information saved in the surname cache. If you change
                        this, delete your surname cache.] (default:
                        /path/to/linkage/data/us_surname_freq.zip)
  --surname_min_frequency SURNAME_MIN_FREQUENCY
                        Minimum frequency for surnames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. In the standard US surname data, values below
                        3e-7 are reported as 0, so 1.5e-7 is the midpoint of
                        the low-frequency range. [Information saved in the
                        surname cache. If you change this, delete your surname
                        cache.] (default: 5e-06)
  --accent_transliterations ACCENT_TRANSLITERATIONS
                        (For surnames.) CSV list of 'accented/plain' pairs,
                        representing how accented characters may be
                        transliterated (if they are not reproduced accurately
                        and not simply mangled into ASCII like É→E). Only
                        upper-case versions are required (anything supplied
                        will be converted to upper case). (default:
                        Ä/AE,Ö/OE,Ü/UE,ẞ/SS)
  --nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS
                        (For surnames.) CSV list of name components that
                        should not be used as alternatives in their own right,
                        such as nobiliary particles. (default:
                        AF,AL,AUF,AV,AW,D,DA,DAI,DAL,DALLA,DAS,DE,DEI,DEL,DELL
                        ,DELLA,DER,DES,DI,DO,DOS,DU,EL,I,II,III,IV,IX,JNR,JR,L
                        ,LA,LE,NA,OF,PHRA,SNR,SR,SRI,THOE,TOT,V,VAN,VI,VII,VII
                        I,VON,X,ZU)
  --birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE
                        Birth year pseudo-range. The purpose is to calculate
                        the probability of two random people sharing a DOB,
                        which is taken as 1/(365.25 * b), even for 29 Feb, or
                        a partial DOB equivalently. This option is b.
                        (default: 30)
  --p_not_male_or_female P_NOT_MALE_OR_FEMALE
                        Probability that a person in the population has gender
                        'X'. (default: 0.004)
  --p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE
                        Probability that a person in the population is female,
                        given that they are either male or female. (default:
                        0.51)
  --postcode_cache_filename POSTCODE_CACHE_FILENAME
                        File in which to store cached postcodes (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_postcode_cache.json)
  --postcode_csv_filename POSTCODE_CSV_FILENAME
                        CSV file of postcode geography from UK Census/ONS
                        data. A ZIP file is also acceptable. [Information
                        saved in the postcode cache. If you change this,
                        delete your postcode cache.] (default:
                        /path/to/linkage/data/ONSPD_MAY_2022_UK.zip)
  --k_postcode K_POSTCODE
                        Probability multiple: P[P(postcode unit match | ¬H)] =
                        k_postcode * f_f_postcode, and p_p_postcode[P(postcode
                        sector match | ¬H) = k_postcode * f_p_postcode. The
                        default, None, autocalculates k_postcode = n_UK /
                        population_size where n_uk = 66040000; this is
                        approximately correct if your population is a
                        geographically restricted section of the UK, but if it
                        is geographically representative of the UK, specify 1.
                        (default: None)
  --p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE
                        Expected population probability of each
                        'pseudo-postcode' postcode unit (e.g. ZZ99 3VZ = no
                        fixed above; ZZ99 3CZ, England/UK not otherwise
                        specified) or to have a postcode not known to the
                        postcode geography database. (default: 0.00201)
  --k_pseudopostcode K_PSEUDOPOSTCODE
                        Probability multiple: P(pseudopostcode sector or
                        unknown postcode sector match | ¬H) = k_pseudopostcode
                        * p_unknown_or_pseudo_postcode. Must strictly be >=1
                        and we enforce >1; see paper. (default: 1.83)

DISPLAY OPTIONS:
  --verbose             Be verbose. (default: False)

===============================================================================
Help for command 'compare_hashed_to_plaintext'
===============================================================================
USAGE: crate_fuzzy_id_match compare_hashed_to_plaintext [-h] --probands
                                                        PROBANDS --sample
                                                        SAMPLE
                                                        [--sample_cache SAMPLE_CACHE]
                                                        --output OUTPUT
                                                        [--profile]
                                                        [--key KEY]
                                                        [--allow_default_hash_key]
                                                        [--hash_method {HMAC_MD5,HMAC_SHA256,HMAC_SHA512}]
                                                        [--rounding_sf ROUNDING_SF]
                                                        [--local_id_hash_key LOCAL_ID_HASH_KEY]
                                                        [--min_log_odds_for_match MIN_LOG_ODDS_FOR_MATCH]
                                                        [--exceeds_next_best_log_odds EXCEEDS_NEXT_BEST_LOG_ODDS]
                                                        [--perfect_id_translation PERFECT_ID_TRANSLATION]
                                                        [--extra_validation_output]
                                                        [--check_comparison_order]
                                                        [--report_every REPORT_EVERY]
                                                        [--min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL]
                                                        [--n_workers N_WORKERS]
                                                        [--p_ep1_forename P_EP1_FORENAME]
                                                        [--p_ep2np1_forename P_EP2NP1_FORENAME]
                                                        [--p_en_forename P_EN_FORENAME]
                                                        [--p_u_forename P_U_FORENAME]
                                                        [--p_ep1_surname P_EP1_SURNAME]
                                                        [--p_ep2np1_surname P_EP2NP1_SURNAME]
                                                        [--p_en_surname P_EN_SURNAME]
                                                        [--p_ep_dob P_EP_DOB]
                                                        [--p_en_dob P_EN_DOB]
                                                        [--p_e_gender P_E_GENDER]
                                                        [--p_ep_postcode P_EP_POSTCODE]
                                                        [--p_en_postcode P_EN_POSTCODE]
                                                        [--population_size POPULATION_SIZE]
                                                        [--forename_cache_filename FORENAME_CACHE_FILENAME]
                                                        [--forename_sex_freq_csv FORENAME_SEX_FREQ_CSV]
                                                        [--forename_min_frequency FORENAME_MIN_FREQUENCY]
                                                        [--surname_cache_filename SURNAME_CACHE_FILENAME]
                                                        [--surname_freq_csv SURNAME_FREQ_CSV]
                                                        [--surname_min_frequency SURNAME_MIN_FREQUENCY]
                                                        [--accent_transliterations ACCENT_TRANSLITERATIONS]
                                                        [--nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS]
                                                        [--birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE]
                                                        [--p_not_male_or_female P_NOT_MALE_OR_FEMALE]
                                                        [--p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE]
                                                        [--postcode_cache_filename POSTCODE_CACHE_FILENAME]
                                                        [--postcode_csv_filename POSTCODE_CSV_FILENAME]
                                                        [--k_postcode K_POSTCODE]
                                                        [--p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE]
                                                        [--k_pseudopostcode K_PSEUDOPOSTCODE]
                                                        [--verbose]

Comparison rules:

- People MUST match on DOB and surname (or surname metaphone), or hashed
  equivalents, to be considered a plausible match.

- Only plausible matches proceed to the Bayesian comparison.

The output file is a CSV (comma-separated value) file with a header and
these columns:

    proband_local_id:
        Local ID (identifiable or de-identified as the user chose) of the
        proband. Taken from the input.
    matched:
        Boolean as binary (0/1). Was a matching person (a "winner") found in
        the sample, who is to be considered a match to the proband? To give a
        match requires (a) that the log odds for the winner reaches a
        threshold, and (b) that the log odds for the winner exceeds the log
        odds for the runner-up by a certain amount (because a mismatch may be
        worse than a failed match).
    log_odds_match:
        Log (ln) odds that the best candidate in the sample is a match to the
        proband.
    p_match:
        Probability that the best candidate in the sample is a match.
        Equivalent to log_odds_match.
    sample_match_local_id:
        Local ID of the "winner" in the sample (the candidate who was matched
        to the proband), or blank if there was no winner.
    second_best_log_odds:
        Log odds of the runner-up (the candidate from the sample who is the
        second-closest match) being the same person as the proband.

If '--extra_validation_output' is used, the following columns are
added:

    best_candidate_local_id:
        Local ID of the closest-matching person (candidate) in the sample, EVEN
        IF THEY DID NOT WIN. (This will be the same as the winner if there was
        a match.) String; blank for no match.
    second_best_candidate_local_id:
        Local ID of the second-best candidate in the sample, if any. String;
        blank for no match.

Proband order is retained in the output (even using parallel processing).

OPTIONS:
  -h, --help            show this help message and exit

COMPARISON OPTIONS:
  --probands PROBANDS   Input filename for probands data. File created by
                        CRATE in JSON Lines (.jsonl) format. (You could use
                        the 'jq' tool to inspect these.) (default: None)
  --sample SAMPLE       Input filename for sample data. (1) CSV format with
                        header row. Columns: ['local_id', 'forenames',
                        'surnames', 'dob', 'gender', 'postcodes',
                        'perfect_id', 'other_info']. (2) Semicolon-separated
                        values are allowed within ['forenames', 'surnames',
                        'postcodes', 'perfect_id']. (3) The fields
                        ['forenames', 'surnames', 'postcodes'] are in
                        TemporalIdentifier format. Temporal identifier format:
                        either just IDENTIFIER, or
                        IDENTIFIER/STARTDATE/ENDDATE, where dates are in
                        YYYY-MM-DD format or one of ['none', 'null', '?']
                        (case-insensitive). (4) perfect_id, if specified,
                        contains one or more perfect person identifiers as
                        key:value pairs, e.g. 'nhs:12345;ni:AB6789XY'. The
                        keys will be forced to lower case; values will be
                        forced to upper case. (5) 'other_info' is an arbitrary
                        string for you to use (e.g. for validation). (default:
                        None)
  --sample_cache SAMPLE_CACHE
                        JSONL file in which to store cached sample info (to
                        speed loading) (default: None)
  --output OUTPUT       Output CSV file for proband/sample comparison.
                        (default: None)
  --profile             Profile the code (for development only). (default:
                        False)

HASHER (SECRECY) OPTIONS:
  --key KEY             Key (passphrase) for hasher. (default:
                        fuzzy_id_match_default_hash_key_DO_NOT_USE_FOR_LIVE_DA
                        TA)
  --allow_default_hash_key
                        Allow the default hash key to be used beyond tests.
                        INADVISABLE! (default: False)
  --hash_method {HMAC_MD5,HMAC_SHA256,HMAC_SHA512}
                        Hash method. (default: HMAC_SHA256)
  --rounding_sf ROUNDING_SF
                        Number of significant figures to use when rounding
                        frequencies in hashed version. Use 'None' to disable
                        rounding. (default: 5)
  --local_id_hash_key LOCAL_ID_HASH_KEY
                        Only applicable to the 'hash' command. Hash the
                        local_id values, using this key (passphrase). There
                        are good reasons to use a key different to that
                        specified for --key. If you leave this blank, or
                        specify an empty string, then local ID values will be
                        left unmodified (e.g. if you have pre-hashed them).
                        (default: None)

MATCHING RULES:
  --min_log_odds_for_match MIN_LOG_ODDS_FOR_MATCH
                        Minimum natural log (ln) odds of two people being the
                        same, before a match will be considered. Referred to
                        as theta (θ) in the validation paper. (Default is
                        equivalent to p = 0.9933071490757152.) (default: 5)
  --exceeds_next_best_log_odds EXCEEDS_NEXT_BEST_LOG_ODDS
                        Minimum log (ln) odds by which a best match must
                        exceed the next-best match to be considered a unique
                        match. Referred to as delta (δ) in the validation
                        paper. (default: 0)
  --perfect_id_translation PERFECT_ID_TRANSLATION
                        Optional dictionary of the form {'nhsnum':'nhsnumber',
                        'ni_num':'national_insurance'}, mapping the names of
                        perfect (person-unique) identifiers as found in the
                        proband data to their equivalents in the sample.
                        (default: None)

CONTROL OPTIONS:
  --extra_validation_output
                        Add extra output for validation purposes (the local
                        IDs of the best and second-best candidates, if any,
                        even if there was no match). (default: False)
  --check_comparison_order
                        Check every comparison for log-likelihood ratio
                        sequence 'no match ≤ partial(s) ≤ full' and warn if
                        this is not observed. Note, however, that deviations
                        from this are not unexpected. (default: False)
  --report_every REPORT_EVERY
                        Report progress every n probands. (default: 100)
  --min_probands_for_parallel MIN_PROBANDS_FOR_PARALLEL
                        Minimum number of probands for which we will bother to
                        use parallel processing. (default: 1000)
  --n_workers N_WORKERS
                        Number of processes to use in parallel. Defaults to 1
                        (Windows) or the number of CPUs on your system (other
                        operating systems). (default: 8)

ERROR PROBABILITIES:
  --p_ep1_forename P_EP1_FORENAME
                        Probability that a forename has an error such that it
                        fails a full match but satisfies a partial 1
                        (metaphone) match. (Comma-separated list of 'gender:p'
                        values, where gender must include F, M and can include
                        X, ''.) (default: F:0.00894,M:0.0084)
  --p_ep2np1_forename P_EP2NP1_FORENAME
                        Probability that a forename has an error such that it
                        fails a full/partial 1 match but satisfies a partial 2
                        (first two character) match. (Comma-separated list of
                        'gender:p' values, where gender must include F, M and
                        can include X, ''.) (default: F:0.00881,M:0.00688)
  --p_en_forename P_EN_FORENAME
                        Probability that a forename has an error such that it
                        produces no match at all. (Comma-separated list of
                        'gender:p' values, where gender must include F, M and
                        can include X, ''.) (default: F:0.00572,M:0.00625)
  --p_u_forename P_U_FORENAME
                        Probability that a set of at least two forenames has
                        an error such that they become unordered (e.g.
                        swapped/shuffled) with respect to their counterpart.
                        See paper for full details. (default: 0.00191)
  --p_ep1_surname P_EP1_SURNAME
                        Probability that a surname has an error such that it
                        fails a full match but satisfies a partial 1
                        (metaphone) match. (Comma-separated list of 'gender:p'
                        values, where gender must include F, M and can include
                        X, ''.) (default: F:0.00551,M:0.00471)
  --p_ep2np1_surname P_EP2NP1_SURNAME
                        Probability that a surname has an error such that it
                        fails a full/partial 1 match but satisfies a partial 2
                        (first two character) match. (Comma-separated list of
                        'gender:p' values, where gender must include F, M and
                        can include X, ''.) (default: F:0.00378,M:0.00247)
  --p_en_surname P_EN_SURNAME
                        Probability that a surname has an error such that it
                        produces no match at all. (Comma-separated list of
                        'gender:p' values, where gender must include F, M and
                        can include X, ''.) (default: F:0.0567,M:0.0134)
  --p_ep_dob P_EP_DOB   Probability that a DOB is wrong in some way that
                        causes a partial match (YM, MD, or YD) but not a full
                        (YMD) match. (default: 0.00459036)
  --p_en_dob P_EN_DOB   Probability that a DOB error leads to no match
                        (neither full, nor partial as defined above).
                        Empirically, this is about 0.00033. However, we
                        suggest setting it to 0, as anything higher will run
                        much slower. (default: 0)
  --p_e_gender P_E_GENDER
                        Assumed probability (p_e) that a gender is wrong,
                        leading to a proband/candidate mismatch. (default:
                        0.0033)
  --p_ep_postcode P_EP_POSTCODE
                        Assumed probability (p_ep) that a proband/candidate
                        postcode pair fails a full (postcode unit) match but
                        satisfies a partial (postcode sector) match, through
                        error or a move within a sector. (default: 0.0097)
  --p_en_postcode P_EN_POSTCODE
                        Assumed probability (p_ep) that a proband/candidate
                        postcode pair exhibits no match at all. (default: 0.3)

FREQUENCY INFORMATION FOR PRIOR PROBABILITIES:
  --population_size POPULATION_SIZE
                        Size of the whole population, from which we calculate
                        the baseline log odds that two people, randomly
                        selected (and replaced) from the population are the
                        same person. (default: 852523)
  --forename_cache_filename FORENAME_CACHE_FILENAME
                        File in which to store cached forename info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_forename_cache.jsonl)
  --forename_sex_freq_csv FORENAME_SEX_FREQ_CSV
                        CSV file of "name, sex, frequency" pairs for
                        forenames. You can generate one via
                        crate_fetch_wordlists. [Information saved in the
                        forename cache. If you change this, delete your
                        forename cache.] (default:
                        /path/to/linkage/data/us_forename_sex_freq.zip)
  --forename_min_frequency FORENAME_MIN_FREQUENCY
                        Minimum frequency for forenames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. The standard US forename data has a floor
                        2.875e-8 (M), 2.930e-8 (F), so 2.9e-8 to 2sf.
                        [Information saved in the forename cache. If you
                        change this, delete your forename cache.] (default:
                        5e-06)
  --surname_cache_filename SURNAME_CACHE_FILENAME
                        File in which to store cached surname info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_surname_cache.jsonl)
  --surname_freq_csv SURNAME_FREQ_CSV
                        CSV file of "name, frequency" pairs for surnames. You
                        can generate one via crate_fetch_wordlists.
                        [Information saved in the surname cache. If you change
                        this, delete your surname cache.] (default:
                        /path/to/linkage/data/us_surname_freq.zip)
  --surname_min_frequency SURNAME_MIN_FREQUENCY
                        Minimum frequency for surnames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. In the standard US surname data, values below
                        3e-7 are reported as 0, so 1.5e-7 is the midpoint of
                        the low-frequency range. [Information saved in the
                        surname cache. If you change this, delete your surname
                        cache.] (default: 5e-06)
  --accent_transliterations ACCENT_TRANSLITERATIONS
                        (For surnames.) CSV list of 'accented/plain' pairs,
                        representing how accented characters may be
                        transliterated (if they are not reproduced accurately
                        and not simply mangled into ASCII like É→E). Only
                        upper-case versions are required (anything supplied
                        will be converted to upper case). (default:
                        Ä/AE,Ö/OE,Ü/UE,ẞ/SS)
  --nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS
                        (For surnames.) CSV list of name components that
                        should not be used as alternatives in their own right,
                        such as nobiliary particles. (default:
                        AF,AL,AUF,AV,AW,D,DA,DAI,DAL,DALLA,DAS,DE,DEI,DEL,DELL
                        ,DELLA,DER,DES,DI,DO,DOS,DU,EL,I,II,III,IV,IX,JNR,JR,L
                        ,LA,LE,NA,OF,PHRA,SNR,SR,SRI,THOE,TOT,V,VAN,VI,VII,VII
                        I,VON,X,ZU)
  --birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE
                        Birth year pseudo-range. The purpose is to calculate
                        the probability of two random people sharing a DOB,
                        which is taken as 1/(365.25 * b), even for 29 Feb, or
                        a partial DOB equivalently. This option is b.
                        (default: 30)
  --p_not_male_or_female P_NOT_MALE_OR_FEMALE
                        Probability that a person in the population has gender
                        'X'. (default: 0.004)
  --p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE
                        Probability that a person in the population is female,
                        given that they are either male or female. (default:
                        0.51)
  --postcode_cache_filename POSTCODE_CACHE_FILENAME
                        File in which to store cached postcodes (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_postcode_cache.json)
  --postcode_csv_filename POSTCODE_CSV_FILENAME
                        CSV file of postcode geography from UK Census/ONS
                        data. A ZIP file is also acceptable. [Information
                        saved in the postcode cache. If you change this,
                        delete your postcode cache.] (default:
                        /path/to/linkage/data/ONSPD_MAY_2022_UK.zip)
  --k_postcode K_POSTCODE
                        Probability multiple: P[P(postcode unit match | ¬H)] =
                        k_postcode * f_f_postcode, and p_p_postcode[P(postcode
                        sector match | ¬H) = k_postcode * f_p_postcode. The
                        default, None, autocalculates k_postcode = n_UK /
                        population_size where n_uk = 66040000; this is
                        approximately correct if your population is a
                        geographically restricted section of the UK, but if it
                        is geographically representative of the UK, specify 1.
                        (default: None)
  --p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE
                        Expected population probability of each
                        'pseudo-postcode' postcode unit (e.g. ZZ99 3VZ = no
                        fixed above; ZZ99 3CZ, England/UK not otherwise
                        specified) or to have a postcode not known to the
                        postcode geography database. (default: 0.00201)
  --k_pseudopostcode K_PSEUDOPOSTCODE
                        Probability multiple: P(pseudopostcode sector or
                        unknown postcode sector match | ¬H) = k_pseudopostcode
                        * p_unknown_or_pseudo_postcode. Must strictly be >=1
                        and we enforce >1; see paper. (default: 1.83)

DISPLAY OPTIONS:
  --verbose             Be verbose. (default: False)

===============================================================================
Help for command 'print_demo_sample'
===============================================================================
USAGE: crate_fuzzy_id_match print_demo_sample [-h] [--verbose]

OPTIONS:
  -h, --help  show this help message and exit

DISPLAY OPTIONS:
  --verbose   Be verbose. (default: False)

===============================================================================
Help for command 'show_metaphone'
===============================================================================
USAGE: crate_fuzzy_id_match show_metaphone [-h] [--verbose] words [words ...]

POSITIONAL ARGUMENTS:
  words       Words to check

OPTIONS:
  -h, --help  show this help message and exit

DISPLAY OPTIONS:
  --verbose   Be verbose. (default: False)

===============================================================================
Help for command 'show_names_for_metaphone'
===============================================================================
USAGE: crate_fuzzy_id_match show_names_for_metaphone [-h]
                                                     [--population_size POPULATION_SIZE]
                                                     [--forename_cache_filename FORENAME_CACHE_FILENAME]
                                                     [--forename_sex_freq_csv FORENAME_SEX_FREQ_CSV]
                                                     [--forename_min_frequency FORENAME_MIN_FREQUENCY]
                                                     [--surname_cache_filename SURNAME_CACHE_FILENAME]
                                                     [--surname_freq_csv SURNAME_FREQ_CSV]
                                                     [--surname_min_frequency SURNAME_MIN_FREQUENCY]
                                                     [--accent_transliterations ACCENT_TRANSLITERATIONS]
                                                     [--nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS]
                                                     [--birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE]
                                                     [--p_not_male_or_female P_NOT_MALE_OR_FEMALE]
                                                     [--p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE]
                                                     [--postcode_cache_filename POSTCODE_CACHE_FILENAME]
                                                     [--postcode_csv_filename POSTCODE_CSV_FILENAME]
                                                     [--k_postcode K_POSTCODE]
                                                     [--p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE]
                                                     [--k_pseudopostcode K_PSEUDOPOSTCODE]
                                                     [--verbose]
                                                     words [words ...]

POSITIONAL ARGUMENTS:
  words                 Words to check

OPTIONS:
  -h, --help            show this help message and exit

FREQUENCY INFORMATION FOR PRIOR PROBABILITIES:
  --population_size POPULATION_SIZE
                        Size of the whole population, from which we calculate
                        the baseline log odds that two people, randomly
                        selected (and replaced) from the population are the
                        same person. (default: 852523)
  --forename_cache_filename FORENAME_CACHE_FILENAME
                        File in which to store cached forename info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_forename_cache.jsonl)
  --forename_sex_freq_csv FORENAME_SEX_FREQ_CSV
                        CSV file of "name, sex, frequency" pairs for
                        forenames. You can generate one via
                        crate_fetch_wordlists. [Information saved in the
                        forename cache. If you change this, delete your
                        forename cache.] (default:
                        /path/to/linkage/data/us_forename_sex_freq.zip)
  --forename_min_frequency FORENAME_MIN_FREQUENCY
                        Minimum frequency for forenames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. The standard US forename data has a floor
                        2.875e-8 (M), 2.930e-8 (F), so 2.9e-8 to 2sf.
                        [Information saved in the forename cache. If you
                        change this, delete your forename cache.] (default:
                        5e-06)
  --surname_cache_filename SURNAME_CACHE_FILENAME
                        File in which to store cached surname info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_surname_cache.jsonl)
  --surname_freq_csv SURNAME_FREQ_CSV
                        CSV file of "name, frequency" pairs for surnames. You
                        can generate one via crate_fetch_wordlists.
                        [Information saved in the surname cache. If you change
                        this, delete your surname cache.] (default:
                        /path/to/linkage/data/us_surname_freq.zip)
  --surname_min_frequency SURNAME_MIN_FREQUENCY
                        Minimum frequency for surnames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. In the standard US surname data, values below
                        3e-7 are reported as 0, so 1.5e-7 is the midpoint of
                        the low-frequency range. [Information saved in the
                        surname cache. If you change this, delete your surname
                        cache.] (default: 5e-06)
  --accent_transliterations ACCENT_TRANSLITERATIONS
                        (For surnames.) CSV list of 'accented/plain' pairs,
                        representing how accented characters may be
                        transliterated (if they are not reproduced accurately
                        and not simply mangled into ASCII like É→E). Only
                        upper-case versions are required (anything supplied
                        will be converted to upper case). (default:
                        Ä/AE,Ö/OE,Ü/UE,ẞ/SS)
  --nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS
                        (For surnames.) CSV list of name components that
                        should not be used as alternatives in their own right,
                        such as nobiliary particles. (default:
                        AF,AL,AUF,AV,AW,D,DA,DAI,DAL,DALLA,DAS,DE,DEI,DEL,DELL
                        ,DELLA,DER,DES,DI,DO,DOS,DU,EL,I,II,III,IV,IX,JNR,JR,L
                        ,LA,LE,NA,OF,PHRA,SNR,SR,SRI,THOE,TOT,V,VAN,VI,VII,VII
                        I,VON,X,ZU)
  --birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE
                        Birth year pseudo-range. The purpose is to calculate
                        the probability of two random people sharing a DOB,
                        which is taken as 1/(365.25 * b), even for 29 Feb, or
                        a partial DOB equivalently. This option is b.
                        (default: 30)
  --p_not_male_or_female P_NOT_MALE_OR_FEMALE
                        Probability that a person in the population has gender
                        'X'. (default: 0.004)
  --p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE
                        Probability that a person in the population is female,
                        given that they are either male or female. (default:
                        0.51)
  --postcode_cache_filename POSTCODE_CACHE_FILENAME
                        File in which to store cached postcodes (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_postcode_cache.json)
  --postcode_csv_filename POSTCODE_CSV_FILENAME
                        CSV file of postcode geography from UK Census/ONS
                        data. A ZIP file is also acceptable. [Information
                        saved in the postcode cache. If you change this,
                        delete your postcode cache.] (default:
                        /path/to/linkage/data/ONSPD_MAY_2022_UK.zip)
  --k_postcode K_POSTCODE
                        Probability multiple: P[P(postcode unit match | ¬H)] =
                        k_postcode * f_f_postcode, and p_p_postcode[P(postcode
                        sector match | ¬H) = k_postcode * f_p_postcode. The
                        default, None, autocalculates k_postcode = n_UK /
                        population_size where n_uk = 66040000; this is
                        approximately correct if your population is a
                        geographically restricted section of the UK, but if it
                        is geographically representative of the UK, specify 1.
                        (default: None)
  --p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE
                        Expected population probability of each
                        'pseudo-postcode' postcode unit (e.g. ZZ99 3VZ = no
                        fixed above; ZZ99 3CZ, England/UK not otherwise
                        specified) or to have a postcode not known to the
                        postcode geography database. (default: 0.00201)
  --k_pseudopostcode K_PSEUDOPOSTCODE
                        Probability multiple: P(pseudopostcode sector or
                        unknown postcode sector match | ¬H) = k_pseudopostcode
                        * p_unknown_or_pseudo_postcode. Must strictly be >=1
                        and we enforce >1; see paper. (default: 1.83)

DISPLAY OPTIONS:
  --verbose             Be verbose. (default: False)

===============================================================================
Help for command 'show_forename_freq'
===============================================================================
USAGE: crate_fuzzy_id_match show_forename_freq [-h]
                                               [--population_size POPULATION_SIZE]
                                               [--forename_cache_filename FORENAME_CACHE_FILENAME]
                                               [--forename_sex_freq_csv FORENAME_SEX_FREQ_CSV]
                                               [--forename_min_frequency FORENAME_MIN_FREQUENCY]
                                               [--surname_cache_filename SURNAME_CACHE_FILENAME]
                                               [--surname_freq_csv SURNAME_FREQ_CSV]
                                               [--surname_min_frequency SURNAME_MIN_FREQUENCY]
                                               [--accent_transliterations ACCENT_TRANSLITERATIONS]
                                               [--nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS]
                                               [--birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE]
                                               [--p_not_male_or_female P_NOT_MALE_OR_FEMALE]
                                               [--p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE]
                                               [--postcode_cache_filename POSTCODE_CACHE_FILENAME]
                                               [--postcode_csv_filename POSTCODE_CSV_FILENAME]
                                               [--k_postcode K_POSTCODE]
                                               [--p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE]
                                               [--k_pseudopostcode K_PSEUDOPOSTCODE]
                                               [--verbose]
                                               forenames [forenames ...]

POSITIONAL ARGUMENTS:
  forenames             Forenames to check

OPTIONS:
  -h, --help            show this help message and exit

FREQUENCY INFORMATION FOR PRIOR PROBABILITIES:
  --population_size POPULATION_SIZE
                        Size of the whole population, from which we calculate
                        the baseline log odds that two people, randomly
                        selected (and replaced) from the population are the
                        same person. (default: 852523)
  --forename_cache_filename FORENAME_CACHE_FILENAME
                        File in which to store cached forename info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_forename_cache.jsonl)
  --forename_sex_freq_csv FORENAME_SEX_FREQ_CSV
                        CSV file of "name, sex, frequency" pairs for
                        forenames. You can generate one via
                        crate_fetch_wordlists. [Information saved in the
                        forename cache. If you change this, delete your
                        forename cache.] (default:
                        /path/to/linkage/data/us_forename_sex_freq.zip)
  --forename_min_frequency FORENAME_MIN_FREQUENCY
                        Minimum frequency for forenames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. The standard US forename data has a floor
                        2.875e-8 (M), 2.930e-8 (F), so 2.9e-8 to 2sf.
                        [Information saved in the forename cache. If you
                        change this, delete your forename cache.] (default:
                        5e-06)
  --surname_cache_filename SURNAME_CACHE_FILENAME
                        File in which to store cached surname info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_surname_cache.jsonl)
  --surname_freq_csv SURNAME_FREQ_CSV
                        CSV file of "name, frequency" pairs for surnames. You
                        can generate one via crate_fetch_wordlists.
                        [Information saved in the surname cache. If you change
                        this, delete your surname cache.] (default:
                        /path/to/linkage/data/us_surname_freq.zip)
  --surname_min_frequency SURNAME_MIN_FREQUENCY
                        Minimum frequency for surnames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. In the standard US surname data, values below
                        3e-7 are reported as 0, so 1.5e-7 is the midpoint of
                        the low-frequency range. [Information saved in the
                        surname cache. If you change this, delete your surname
                        cache.] (default: 5e-06)
  --accent_transliterations ACCENT_TRANSLITERATIONS
                        (For surnames.) CSV list of 'accented/plain' pairs,
                        representing how accented characters may be
                        transliterated (if they are not reproduced accurately
                        and not simply mangled into ASCII like É→E). Only
                        upper-case versions are required (anything supplied
                        will be converted to upper case). (default:
                        Ä/AE,Ö/OE,Ü/UE,ẞ/SS)
  --nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS
                        (For surnames.) CSV list of name components that
                        should not be used as alternatives in their own right,
                        such as nobiliary particles. (default:
                        AF,AL,AUF,AV,AW,D,DA,DAI,DAL,DALLA,DAS,DE,DEI,DEL,DELL
                        ,DELLA,DER,DES,DI,DO,DOS,DU,EL,I,II,III,IV,IX,JNR,JR,L
                        ,LA,LE,NA,OF,PHRA,SNR,SR,SRI,THOE,TOT,V,VAN,VI,VII,VII
                        I,VON,X,ZU)
  --birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE
                        Birth year pseudo-range. The purpose is to calculate
                        the probability of two random people sharing a DOB,
                        which is taken as 1/(365.25 * b), even for 29 Feb, or
                        a partial DOB equivalently. This option is b.
                        (default: 30)
  --p_not_male_or_female P_NOT_MALE_OR_FEMALE
                        Probability that a person in the population has gender
                        'X'. (default: 0.004)
  --p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE
                        Probability that a person in the population is female,
                        given that they are either male or female. (default:
                        0.51)
  --postcode_cache_filename POSTCODE_CACHE_FILENAME
                        File in which to store cached postcodes (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_postcode_cache.json)
  --postcode_csv_filename POSTCODE_CSV_FILENAME
                        CSV file of postcode geography from UK Census/ONS
                        data. A ZIP file is also acceptable. [Information
                        saved in the postcode cache. If you change this,
                        delete your postcode cache.] (default:
                        /path/to/linkage/data/ONSPD_MAY_2022_UK.zip)
  --k_postcode K_POSTCODE
                        Probability multiple: P[P(postcode unit match | ¬H)] =
                        k_postcode * f_f_postcode, and p_p_postcode[P(postcode
                        sector match | ¬H) = k_postcode * f_p_postcode. The
                        default, None, autocalculates k_postcode = n_UK /
                        population_size where n_uk = 66040000; this is
                        approximately correct if your population is a
                        geographically restricted section of the UK, but if it
                        is geographically representative of the UK, specify 1.
                        (default: None)
  --p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE
                        Expected population probability of each
                        'pseudo-postcode' postcode unit (e.g. ZZ99 3VZ = no
                        fixed above; ZZ99 3CZ, England/UK not otherwise
                        specified) or to have a postcode not known to the
                        postcode geography database. (default: 0.00201)
  --k_pseudopostcode K_PSEUDOPOSTCODE
                        Probability multiple: P(pseudopostcode sector or
                        unknown postcode sector match | ¬H) = k_pseudopostcode
                        * p_unknown_or_pseudo_postcode. Must strictly be >=1
                        and we enforce >1; see paper. (default: 1.83)

DISPLAY OPTIONS:
  --verbose             Be verbose. (default: False)

===============================================================================
Help for command 'show_forename_metaphone_freq'
===============================================================================
USAGE: crate_fuzzy_id_match show_forename_metaphone_freq [-h]
                                                         [--population_size POPULATION_SIZE]
                                                         [--forename_cache_filename FORENAME_CACHE_FILENAME]
                                                         [--forename_sex_freq_csv FORENAME_SEX_FREQ_CSV]
                                                         [--forename_min_frequency FORENAME_MIN_FREQUENCY]
                                                         [--surname_cache_filename SURNAME_CACHE_FILENAME]
                                                         [--surname_freq_csv SURNAME_FREQ_CSV]
                                                         [--surname_min_frequency SURNAME_MIN_FREQUENCY]
                                                         [--accent_transliterations ACCENT_TRANSLITERATIONS]
                                                         [--nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS]
                                                         [--birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE]
                                                         [--p_not_male_or_female P_NOT_MALE_OR_FEMALE]
                                                         [--p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE]
                                                         [--postcode_cache_filename POSTCODE_CACHE_FILENAME]
                                                         [--postcode_csv_filename POSTCODE_CSV_FILENAME]
                                                         [--k_postcode K_POSTCODE]
                                                         [--p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE]
                                                         [--k_pseudopostcode K_PSEUDOPOSTCODE]
                                                         [--verbose]
                                                         metaphones
                                                         [metaphones ...]

POSITIONAL ARGUMENTS:
  metaphones            Metaphones to check

OPTIONS:
  -h, --help            show this help message and exit

FREQUENCY INFORMATION FOR PRIOR PROBABILITIES:
  --population_size POPULATION_SIZE
                        Size of the whole population, from which we calculate
                        the baseline log odds that two people, randomly
                        selected (and replaced) from the population are the
                        same person. (default: 852523)
  --forename_cache_filename FORENAME_CACHE_FILENAME
                        File in which to store cached forename info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_forename_cache.jsonl)
  --forename_sex_freq_csv FORENAME_SEX_FREQ_CSV
                        CSV file of "name, sex, frequency" pairs for
                        forenames. You can generate one via
                        crate_fetch_wordlists. [Information saved in the
                        forename cache. If you change this, delete your
                        forename cache.] (default:
                        /path/to/linkage/data/us_forename_sex_freq.zip)
  --forename_min_frequency FORENAME_MIN_FREQUENCY
                        Minimum frequency for forenames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. The standard US forename data has a floor
                        2.875e-8 (M), 2.930e-8 (F), so 2.9e-8 to 2sf.
                        [Information saved in the forename cache. If you
                        change this, delete your forename cache.] (default:
                        5e-06)
  --surname_cache_filename SURNAME_CACHE_FILENAME
                        File in which to store cached surname info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_surname_cache.jsonl)
  --surname_freq_csv SURNAME_FREQ_CSV
                        CSV file of "name, frequency" pairs for surnames. You
                        can generate one via crate_fetch_wordlists.
                        [Information saved in the surname cache. If you change
                        this, delete your surname cache.] (default:
                        /path/to/linkage/data/us_surname_freq.zip)
  --surname_min_frequency SURNAME_MIN_FREQUENCY
                        Minimum frequency for surnames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. In the standard US surname data, values below
                        3e-7 are reported as 0, so 1.5e-7 is the midpoint of
                        the low-frequency range. [Information saved in the
                        surname cache. If you change this, delete your surname
                        cache.] (default: 5e-06)
  --accent_transliterations ACCENT_TRANSLITERATIONS
                        (For surnames.) CSV list of 'accented/plain' pairs,
                        representing how accented characters may be
                        transliterated (if they are not reproduced accurately
                        and not simply mangled into ASCII like É→E). Only
                        upper-case versions are required (anything supplied
                        will be converted to upper case). (default:
                        Ä/AE,Ö/OE,Ü/UE,ẞ/SS)
  --nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS
                        (For surnames.) CSV list of name components that
                        should not be used as alternatives in their own right,
                        such as nobiliary particles. (default:
                        AF,AL,AUF,AV,AW,D,DA,DAI,DAL,DALLA,DAS,DE,DEI,DEL,DELL
                        ,DELLA,DER,DES,DI,DO,DOS,DU,EL,I,II,III,IV,IX,JNR,JR,L
                        ,LA,LE,NA,OF,PHRA,SNR,SR,SRI,THOE,TOT,V,VAN,VI,VII,VII
                        I,VON,X,ZU)
  --birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE
                        Birth year pseudo-range. The purpose is to calculate
                        the probability of two random people sharing a DOB,
                        which is taken as 1/(365.25 * b), even for 29 Feb, or
                        a partial DOB equivalently. This option is b.
                        (default: 30)
  --p_not_male_or_female P_NOT_MALE_OR_FEMALE
                        Probability that a person in the population has gender
                        'X'. (default: 0.004)
  --p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE
                        Probability that a person in the population is female,
                        given that they are either male or female. (default:
                        0.51)
  --postcode_cache_filename POSTCODE_CACHE_FILENAME
                        File in which to store cached postcodes (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_postcode_cache.json)
  --postcode_csv_filename POSTCODE_CSV_FILENAME
                        CSV file of postcode geography from UK Census/ONS
                        data. A ZIP file is also acceptable. [Information
                        saved in the postcode cache. If you change this,
                        delete your postcode cache.] (default:
                        /path/to/linkage/data/ONSPD_MAY_2022_UK.zip)
  --k_postcode K_POSTCODE
                        Probability multiple: P[P(postcode unit match | ¬H)] =
                        k_postcode * f_f_postcode, and p_p_postcode[P(postcode
                        sector match | ¬H) = k_postcode * f_p_postcode. The
                        default, None, autocalculates k_postcode = n_UK /
                        population_size where n_uk = 66040000; this is
                        approximately correct if your population is a
                        geographically restricted section of the UK, but if it
                        is geographically representative of the UK, specify 1.
                        (default: None)
  --p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE
                        Expected population probability of each
                        'pseudo-postcode' postcode unit (e.g. ZZ99 3VZ = no
                        fixed above; ZZ99 3CZ, England/UK not otherwise
                        specified) or to have a postcode not known to the
                        postcode geography database. (default: 0.00201)
  --k_pseudopostcode K_PSEUDOPOSTCODE
                        Probability multiple: P(pseudopostcode sector or
                        unknown postcode sector match | ¬H) = k_pseudopostcode
                        * p_unknown_or_pseudo_postcode. Must strictly be >=1
                        and we enforce >1; see paper. (default: 1.83)

DISPLAY OPTIONS:
  --verbose             Be verbose. (default: False)

===============================================================================
Help for command 'show_forename_f2c_freq'
===============================================================================
USAGE: crate_fuzzy_id_match show_forename_f2c_freq [-h]
                                                   [--population_size POPULATION_SIZE]
                                                   [--forename_cache_filename FORENAME_CACHE_FILENAME]
                                                   [--forename_sex_freq_csv FORENAME_SEX_FREQ_CSV]
                                                   [--forename_min_frequency FORENAME_MIN_FREQUENCY]
                                                   [--surname_cache_filename SURNAME_CACHE_FILENAME]
                                                   [--surname_freq_csv SURNAME_FREQ_CSV]
                                                   [--surname_min_frequency SURNAME_MIN_FREQUENCY]
                                                   [--accent_transliterations ACCENT_TRANSLITERATIONS]
                                                   [--nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS]
                                                   [--birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE]
                                                   [--p_not_male_or_female P_NOT_MALE_OR_FEMALE]
                                                   [--p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE]
                                                   [--postcode_cache_filename POSTCODE_CACHE_FILENAME]
                                                   [--postcode_csv_filename POSTCODE_CSV_FILENAME]
                                                   [--k_postcode K_POSTCODE]
                                                   [--p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE]
                                                   [--k_pseudopostcode K_PSEUDOPOSTCODE]
                                                   [--verbose]
                                                   f2c [f2c ...]

POSITIONAL ARGUMENTS:
  f2c                   First-two-character groups to check

OPTIONS:
  -h, --help            show this help message and exit

FREQUENCY INFORMATION FOR PRIOR PROBABILITIES:
  --population_size POPULATION_SIZE
                        Size of the whole population, from which we calculate
                        the baseline log odds that two people, randomly
                        selected (and replaced) from the population are the
                        same person. (default: 852523)
  --forename_cache_filename FORENAME_CACHE_FILENAME
                        File in which to store cached forename info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_forename_cache.jsonl)
  --forename_sex_freq_csv FORENAME_SEX_FREQ_CSV
                        CSV file of "name, sex, frequency" pairs for
                        forenames. You can generate one via
                        crate_fetch_wordlists. [Information saved in the
                        forename cache. If you change this, delete your
                        forename cache.] (default:
                        /path/to/linkage/data/us_forename_sex_freq.zip)
  --forename_min_frequency FORENAME_MIN_FREQUENCY
                        Minimum frequency for forenames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. The standard US forename data has a floor
                        2.875e-8 (M), 2.930e-8 (F), so 2.9e-8 to 2sf.
                        [Information saved in the forename cache. If you
                        change this, delete your forename cache.] (default:
                        5e-06)
  --surname_cache_filename SURNAME_CACHE_FILENAME
                        File in which to store cached surname info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_surname_cache.jsonl)
  --surname_freq_csv SURNAME_FREQ_CSV
                        CSV file of "name, frequency" pairs for surnames. You
                        can generate one via crate_fetch_wordlists.
                        [Information saved in the surname cache. If you change
                        this, delete your surname cache.] (default:
                        /path/to/linkage/data/us_surname_freq.zip)
  --surname_min_frequency SURNAME_MIN_FREQUENCY
                        Minimum frequency for surnames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. In the standard US surname data, values below
                        3e-7 are reported as 0, so 1.5e-7 is the midpoint of
                        the low-frequency range. [Information saved in the
                        surname cache. If you change this, delete your surname
                        cache.] (default: 5e-06)
  --accent_transliterations ACCENT_TRANSLITERATIONS
                        (For surnames.) CSV list of 'accented/plain' pairs,
                        representing how accented characters may be
                        transliterated (if they are not reproduced accurately
                        and not simply mangled into ASCII like É→E). Only
                        upper-case versions are required (anything supplied
                        will be converted to upper case). (default:
                        Ä/AE,Ö/OE,Ü/UE,ẞ/SS)
  --nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS
                        (For surnames.) CSV list of name components that
                        should not be used as alternatives in their own right,
                        such as nobiliary particles. (default:
                        AF,AL,AUF,AV,AW,D,DA,DAI,DAL,DALLA,DAS,DE,DEI,DEL,DELL
                        ,DELLA,DER,DES,DI,DO,DOS,DU,EL,I,II,III,IV,IX,JNR,JR,L
                        ,LA,LE,NA,OF,PHRA,SNR,SR,SRI,THOE,TOT,V,VAN,VI,VII,VII
                        I,VON,X,ZU)
  --birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE
                        Birth year pseudo-range. The purpose is to calculate
                        the probability of two random people sharing a DOB,
                        which is taken as 1/(365.25 * b), even for 29 Feb, or
                        a partial DOB equivalently. This option is b.
                        (default: 30)
  --p_not_male_or_female P_NOT_MALE_OR_FEMALE
                        Probability that a person in the population has gender
                        'X'. (default: 0.004)
  --p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE
                        Probability that a person in the population is female,
                        given that they are either male or female. (default:
                        0.51)
  --postcode_cache_filename POSTCODE_CACHE_FILENAME
                        File in which to store cached postcodes (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_postcode_cache.json)
  --postcode_csv_filename POSTCODE_CSV_FILENAME
                        CSV file of postcode geography from UK Census/ONS
                        data. A ZIP file is also acceptable. [Information
                        saved in the postcode cache. If you change this,
                        delete your postcode cache.] (default:
                        /path/to/linkage/data/ONSPD_MAY_2022_UK.zip)
  --k_postcode K_POSTCODE
                        Probability multiple: P[P(postcode unit match | ¬H)] =
                        k_postcode * f_f_postcode, and p_p_postcode[P(postcode
                        sector match | ¬H) = k_postcode * f_p_postcode. The
                        default, None, autocalculates k_postcode = n_UK /
                        population_size where n_uk = 66040000; this is
                        approximately correct if your population is a
                        geographically restricted section of the UK, but if it
                        is geographically representative of the UK, specify 1.
                        (default: None)
  --p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE
                        Expected population probability of each
                        'pseudo-postcode' postcode unit (e.g. ZZ99 3VZ = no
                        fixed above; ZZ99 3CZ, England/UK not otherwise
                        specified) or to have a postcode not known to the
                        postcode geography database. (default: 0.00201)
  --k_pseudopostcode K_PSEUDOPOSTCODE
                        Probability multiple: P(pseudopostcode sector or
                        unknown postcode sector match | ¬H) = k_pseudopostcode
                        * p_unknown_or_pseudo_postcode. Must strictly be >=1
                        and we enforce >1; see paper. (default: 1.83)

DISPLAY OPTIONS:
  --verbose             Be verbose. (default: False)

===============================================================================
Help for command 'show_surname_freq'
===============================================================================
USAGE: crate_fuzzy_id_match show_surname_freq [-h]
                                              [--population_size POPULATION_SIZE]
                                              [--forename_cache_filename FORENAME_CACHE_FILENAME]
                                              [--forename_sex_freq_csv FORENAME_SEX_FREQ_CSV]
                                              [--forename_min_frequency FORENAME_MIN_FREQUENCY]
                                              [--surname_cache_filename SURNAME_CACHE_FILENAME]
                                              [--surname_freq_csv SURNAME_FREQ_CSV]
                                              [--surname_min_frequency SURNAME_MIN_FREQUENCY]
                                              [--accent_transliterations ACCENT_TRANSLITERATIONS]
                                              [--nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS]
                                              [--birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE]
                                              [--p_not_male_or_female P_NOT_MALE_OR_FEMALE]
                                              [--p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE]
                                              [--postcode_cache_filename POSTCODE_CACHE_FILENAME]
                                              [--postcode_csv_filename POSTCODE_CSV_FILENAME]
                                              [--k_postcode K_POSTCODE]
                                              [--p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE]
                                              [--k_pseudopostcode K_PSEUDOPOSTCODE]
                                              [--verbose]
                                              surnames [surnames ...]

POSITIONAL ARGUMENTS:
  surnames              surnames to check

OPTIONS:
  -h, --help            show this help message and exit

FREQUENCY INFORMATION FOR PRIOR PROBABILITIES:
  --population_size POPULATION_SIZE
                        Size of the whole population, from which we calculate
                        the baseline log odds that two people, randomly
                        selected (and replaced) from the population are the
                        same person. (default: 852523)
  --forename_cache_filename FORENAME_CACHE_FILENAME
                        File in which to store cached forename info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_forename_cache.jsonl)
  --forename_sex_freq_csv FORENAME_SEX_FREQ_CSV
                        CSV file of "name, sex, frequency" pairs for
                        forenames. You can generate one via
                        crate_fetch_wordlists. [Information saved in the
                        forename cache. If you change this, delete your
                        forename cache.] (default:
                        /path/to/linkage/data/us_forename_sex_freq.zip)
  --forename_min_frequency FORENAME_MIN_FREQUENCY
                        Minimum frequency for forenames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. The standard US forename data has a floor
                        2.875e-8 (M), 2.930e-8 (F), so 2.9e-8 to 2sf.
                        [Information saved in the forename cache. If you
                        change this, delete your forename cache.] (default:
                        5e-06)
  --surname_cache_filename SURNAME_CACHE_FILENAME
                        File in which to store cached surname info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_surname_cache.jsonl)
  --surname_freq_csv SURNAME_FREQ_CSV
                        CSV file of "name, frequency" pairs for surnames. You
                        can generate one via crate_fetch_wordlists.
                        [Information saved in the surname cache. If you change
                        this, delete your surname cache.] (default:
                        /path/to/linkage/data/us_surname_freq.zip)
  --surname_min_frequency SURNAME_MIN_FREQUENCY
                        Minimum frequency for surnames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. In the standard US surname data, values below
                        3e-7 are reported as 0, so 1.5e-7 is the midpoint of
                        the low-frequency range. [Information saved in the
                        surname cache. If you change this, delete your surname
                        cache.] (default: 5e-06)
  --accent_transliterations ACCENT_TRANSLITERATIONS
                        (For surnames.) CSV list of 'accented/plain' pairs,
                        representing how accented characters may be
                        transliterated (if they are not reproduced accurately
                        and not simply mangled into ASCII like É→E). Only
                        upper-case versions are required (anything supplied
                        will be converted to upper case). (default:
                        Ä/AE,Ö/OE,Ü/UE,ẞ/SS)
  --nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS
                        (For surnames.) CSV list of name components that
                        should not be used as alternatives in their own right,
                        such as nobiliary particles. (default:
                        AF,AL,AUF,AV,AW,D,DA,DAI,DAL,DALLA,DAS,DE,DEI,DEL,DELL
                        ,DELLA,DER,DES,DI,DO,DOS,DU,EL,I,II,III,IV,IX,JNR,JR,L
                        ,LA,LE,NA,OF,PHRA,SNR,SR,SRI,THOE,TOT,V,VAN,VI,VII,VII
                        I,VON,X,ZU)
  --birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE
                        Birth year pseudo-range. The purpose is to calculate
                        the probability of two random people sharing a DOB,
                        which is taken as 1/(365.25 * b), even for 29 Feb, or
                        a partial DOB equivalently. This option is b.
                        (default: 30)
  --p_not_male_or_female P_NOT_MALE_OR_FEMALE
                        Probability that a person in the population has gender
                        'X'. (default: 0.004)
  --p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE
                        Probability that a person in the population is female,
                        given that they are either male or female. (default:
                        0.51)
  --postcode_cache_filename POSTCODE_CACHE_FILENAME
                        File in which to store cached postcodes (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_postcode_cache.json)
  --postcode_csv_filename POSTCODE_CSV_FILENAME
                        CSV file of postcode geography from UK Census/ONS
                        data. A ZIP file is also acceptable. [Information
                        saved in the postcode cache. If you change this,
                        delete your postcode cache.] (default:
                        /path/to/linkage/data/ONSPD_MAY_2022_UK.zip)
  --k_postcode K_POSTCODE
                        Probability multiple: P[P(postcode unit match | ¬H)] =
                        k_postcode * f_f_postcode, and p_p_postcode[P(postcode
                        sector match | ¬H) = k_postcode * f_p_postcode. The
                        default, None, autocalculates k_postcode = n_UK /
                        population_size where n_uk = 66040000; this is
                        approximately correct if your population is a
                        geographically restricted section of the UK, but if it
                        is geographically representative of the UK, specify 1.
                        (default: None)
  --p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE
                        Expected population probability of each
                        'pseudo-postcode' postcode unit (e.g. ZZ99 3VZ = no
                        fixed above; ZZ99 3CZ, England/UK not otherwise
                        specified) or to have a postcode not known to the
                        postcode geography database. (default: 0.00201)
  --k_pseudopostcode K_PSEUDOPOSTCODE
                        Probability multiple: P(pseudopostcode sector or
                        unknown postcode sector match | ¬H) = k_pseudopostcode
                        * p_unknown_or_pseudo_postcode. Must strictly be >=1
                        and we enforce >1; see paper. (default: 1.83)

DISPLAY OPTIONS:
  --verbose             Be verbose. (default: False)

===============================================================================
Help for command 'show_surname_metaphone_freq'
===============================================================================
USAGE: crate_fuzzy_id_match show_surname_metaphone_freq [-h]
                                                        [--population_size POPULATION_SIZE]
                                                        [--forename_cache_filename FORENAME_CACHE_FILENAME]
                                                        [--forename_sex_freq_csv FORENAME_SEX_FREQ_CSV]
                                                        [--forename_min_frequency FORENAME_MIN_FREQUENCY]
                                                        [--surname_cache_filename SURNAME_CACHE_FILENAME]
                                                        [--surname_freq_csv SURNAME_FREQ_CSV]
                                                        [--surname_min_frequency SURNAME_MIN_FREQUENCY]
                                                        [--accent_transliterations ACCENT_TRANSLITERATIONS]
                                                        [--nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS]
                                                        [--birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE]
                                                        [--p_not_male_or_female P_NOT_MALE_OR_FEMALE]
                                                        [--p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE]
                                                        [--postcode_cache_filename POSTCODE_CACHE_FILENAME]
                                                        [--postcode_csv_filename POSTCODE_CSV_FILENAME]
                                                        [--k_postcode K_POSTCODE]
                                                        [--p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE]
                                                        [--k_pseudopostcode K_PSEUDOPOSTCODE]
                                                        [--verbose]
                                                        metaphones
                                                        [metaphones ...]

POSITIONAL ARGUMENTS:
  metaphones            surnames to check

OPTIONS:
  -h, --help            show this help message and exit

FREQUENCY INFORMATION FOR PRIOR PROBABILITIES:
  --population_size POPULATION_SIZE
                        Size of the whole population, from which we calculate
                        the baseline log odds that two people, randomly
                        selected (and replaced) from the population are the
                        same person. (default: 852523)
  --forename_cache_filename FORENAME_CACHE_FILENAME
                        File in which to store cached forename info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_forename_cache.jsonl)
  --forename_sex_freq_csv FORENAME_SEX_FREQ_CSV
                        CSV file of "name, sex, frequency" pairs for
                        forenames. You can generate one via
                        crate_fetch_wordlists. [Information saved in the
                        forename cache. If you change this, delete your
                        forename cache.] (default:
                        /path/to/linkage/data/us_forename_sex_freq.zip)
  --forename_min_frequency FORENAME_MIN_FREQUENCY
                        Minimum frequency for forenames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. The standard US forename data has a floor
                        2.875e-8 (M), 2.930e-8 (F), so 2.9e-8 to 2sf.
                        [Information saved in the forename cache. If you
                        change this, delete your forename cache.] (default:
                        5e-06)
  --surname_cache_filename SURNAME_CACHE_FILENAME
                        File in which to store cached surname info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_surname_cache.jsonl)
  --surname_freq_csv SURNAME_FREQ_CSV
                        CSV file of "name, frequency" pairs for surnames. You
                        can generate one via crate_fetch_wordlists.
                        [Information saved in the surname cache. If you change
                        this, delete your surname cache.] (default:
                        /path/to/linkage/data/us_surname_freq.zip)
  --surname_min_frequency SURNAME_MIN_FREQUENCY
                        Minimum frequency for surnames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. In the standard US surname data, values below
                        3e-7 are reported as 0, so 1.5e-7 is the midpoint of
                        the low-frequency range. [Information saved in the
                        surname cache. If you change this, delete your surname
                        cache.] (default: 5e-06)
  --accent_transliterations ACCENT_TRANSLITERATIONS
                        (For surnames.) CSV list of 'accented/plain' pairs,
                        representing how accented characters may be
                        transliterated (if they are not reproduced accurately
                        and not simply mangled into ASCII like É→E). Only
                        upper-case versions are required (anything supplied
                        will be converted to upper case). (default:
                        Ä/AE,Ö/OE,Ü/UE,ẞ/SS)
  --nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS
                        (For surnames.) CSV list of name components that
                        should not be used as alternatives in their own right,
                        such as nobiliary particles. (default:
                        AF,AL,AUF,AV,AW,D,DA,DAI,DAL,DALLA,DAS,DE,DEI,DEL,DELL
                        ,DELLA,DER,DES,DI,DO,DOS,DU,EL,I,II,III,IV,IX,JNR,JR,L
                        ,LA,LE,NA,OF,PHRA,SNR,SR,SRI,THOE,TOT,V,VAN,VI,VII,VII
                        I,VON,X,ZU)
  --birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE
                        Birth year pseudo-range. The purpose is to calculate
                        the probability of two random people sharing a DOB,
                        which is taken as 1/(365.25 * b), even for 29 Feb, or
                        a partial DOB equivalently. This option is b.
                        (default: 30)
  --p_not_male_or_female P_NOT_MALE_OR_FEMALE
                        Probability that a person in the population has gender
                        'X'. (default: 0.004)
  --p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE
                        Probability that a person in the population is female,
                        given that they are either male or female. (default:
                        0.51)
  --postcode_cache_filename POSTCODE_CACHE_FILENAME
                        File in which to store cached postcodes (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_postcode_cache.json)
  --postcode_csv_filename POSTCODE_CSV_FILENAME
                        CSV file of postcode geography from UK Census/ONS
                        data. A ZIP file is also acceptable. [Information
                        saved in the postcode cache. If you change this,
                        delete your postcode cache.] (default:
                        /path/to/linkage/data/ONSPD_MAY_2022_UK.zip)
  --k_postcode K_POSTCODE
                        Probability multiple: P[P(postcode unit match | ¬H)] =
                        k_postcode * f_f_postcode, and p_p_postcode[P(postcode
                        sector match | ¬H) = k_postcode * f_p_postcode. The
                        default, None, autocalculates k_postcode = n_UK /
                        population_size where n_uk = 66040000; this is
                        approximately correct if your population is a
                        geographically restricted section of the UK, but if it
                        is geographically representative of the UK, specify 1.
                        (default: None)
  --p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE
                        Expected population probability of each
                        'pseudo-postcode' postcode unit (e.g. ZZ99 3VZ = no
                        fixed above; ZZ99 3CZ, England/UK not otherwise
                        specified) or to have a postcode not known to the
                        postcode geography database. (default: 0.00201)
  --k_pseudopostcode K_PSEUDOPOSTCODE
                        Probability multiple: P(pseudopostcode sector or
                        unknown postcode sector match | ¬H) = k_pseudopostcode
                        * p_unknown_or_pseudo_postcode. Must strictly be >=1
                        and we enforce >1; see paper. (default: 1.83)

DISPLAY OPTIONS:
  --verbose             Be verbose. (default: False)

===============================================================================
Help for command 'show_surname_f2c_freq'
===============================================================================
USAGE: crate_fuzzy_id_match show_surname_f2c_freq [-h]
                                                  [--population_size POPULATION_SIZE]
                                                  [--forename_cache_filename FORENAME_CACHE_FILENAME]
                                                  [--forename_sex_freq_csv FORENAME_SEX_FREQ_CSV]
                                                  [--forename_min_frequency FORENAME_MIN_FREQUENCY]
                                                  [--surname_cache_filename SURNAME_CACHE_FILENAME]
                                                  [--surname_freq_csv SURNAME_FREQ_CSV]
                                                  [--surname_min_frequency SURNAME_MIN_FREQUENCY]
                                                  [--accent_transliterations ACCENT_TRANSLITERATIONS]
                                                  [--nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS]
                                                  [--birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE]
                                                  [--p_not_male_or_female P_NOT_MALE_OR_FEMALE]
                                                  [--p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE]
                                                  [--postcode_cache_filename POSTCODE_CACHE_FILENAME]
                                                  [--postcode_csv_filename POSTCODE_CSV_FILENAME]
                                                  [--k_postcode K_POSTCODE]
                                                  [--p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE]
                                                  [--k_pseudopostcode K_PSEUDOPOSTCODE]
                                                  [--verbose]
                                                  f2c [f2c ...]

POSITIONAL ARGUMENTS:
  f2c                   First-two-character groups to check

OPTIONS:
  -h, --help            show this help message and exit

FREQUENCY INFORMATION FOR PRIOR PROBABILITIES:
  --population_size POPULATION_SIZE
                        Size of the whole population, from which we calculate
                        the baseline log odds that two people, randomly
                        selected (and replaced) from the population are the
                        same person. (default: 852523)
  --forename_cache_filename FORENAME_CACHE_FILENAME
                        File in which to store cached forename info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_forename_cache.jsonl)
  --forename_sex_freq_csv FORENAME_SEX_FREQ_CSV
                        CSV file of "name, sex, frequency" pairs for
                        forenames. You can generate one via
                        crate_fetch_wordlists. [Information saved in the
                        forename cache. If you change this, delete your
                        forename cache.] (default:
                        /path/to/linkage/data/us_forename_sex_freq.zip)
  --forename_min_frequency FORENAME_MIN_FREQUENCY
                        Minimum frequency for forenames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. The standard US forename data has a floor
                        2.875e-8 (M), 2.930e-8 (F), so 2.9e-8 to 2sf.
                        [Information saved in the forename cache. If you
                        change this, delete your forename cache.] (default:
                        5e-06)
  --surname_cache_filename SURNAME_CACHE_FILENAME
                        File in which to store cached surname info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_surname_cache.jsonl)
  --surname_freq_csv SURNAME_FREQ_CSV
                        CSV file of "name, frequency" pairs for surnames. You
                        can generate one via crate_fetch_wordlists.
                        [Information saved in the surname cache. If you change
                        this, delete your surname cache.] (default:
                        /path/to/linkage/data/us_surname_freq.zip)
  --surname_min_frequency SURNAME_MIN_FREQUENCY
                        Minimum frequency for surnames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. In the standard US surname data, values below
                        3e-7 are reported as 0, so 1.5e-7 is the midpoint of
                        the low-frequency range. [Information saved in the
                        surname cache. If you change this, delete your surname
                        cache.] (default: 5e-06)
  --accent_transliterations ACCENT_TRANSLITERATIONS
                        (For surnames.) CSV list of 'accented/plain' pairs,
                        representing how accented characters may be
                        transliterated (if they are not reproduced accurately
                        and not simply mangled into ASCII like É→E). Only
                        upper-case versions are required (anything supplied
                        will be converted to upper case). (default:
                        Ä/AE,Ö/OE,Ü/UE,ẞ/SS)
  --nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS
                        (For surnames.) CSV list of name components that
                        should not be used as alternatives in their own right,
                        such as nobiliary particles. (default:
                        AF,AL,AUF,AV,AW,D,DA,DAI,DAL,DALLA,DAS,DE,DEI,DEL,DELL
                        ,DELLA,DER,DES,DI,DO,DOS,DU,EL,I,II,III,IV,IX,JNR,JR,L
                        ,LA,LE,NA,OF,PHRA,SNR,SR,SRI,THOE,TOT,V,VAN,VI,VII,VII
                        I,VON,X,ZU)
  --birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE
                        Birth year pseudo-range. The purpose is to calculate
                        the probability of two random people sharing a DOB,
                        which is taken as 1/(365.25 * b), even for 29 Feb, or
                        a partial DOB equivalently. This option is b.
                        (default: 30)
  --p_not_male_or_female P_NOT_MALE_OR_FEMALE
                        Probability that a person in the population has gender
                        'X'. (default: 0.004)
  --p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE
                        Probability that a person in the population is female,
                        given that they are either male or female. (default:
                        0.51)
  --postcode_cache_filename POSTCODE_CACHE_FILENAME
                        File in which to store cached postcodes (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_postcode_cache.json)
  --postcode_csv_filename POSTCODE_CSV_FILENAME
                        CSV file of postcode geography from UK Census/ONS
                        data. A ZIP file is also acceptable. [Information
                        saved in the postcode cache. If you change this,
                        delete your postcode cache.] (default:
                        /path/to/linkage/data/ONSPD_MAY_2022_UK.zip)
  --k_postcode K_POSTCODE
                        Probability multiple: P[P(postcode unit match | ¬H)] =
                        k_postcode * f_f_postcode, and p_p_postcode[P(postcode
                        sector match | ¬H) = k_postcode * f_p_postcode. The
                        default, None, autocalculates k_postcode = n_UK /
                        population_size where n_uk = 66040000; this is
                        approximately correct if your population is a
                        geographically restricted section of the UK, but if it
                        is geographically representative of the UK, specify 1.
                        (default: None)
  --p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE
                        Expected population probability of each
                        'pseudo-postcode' postcode unit (e.g. ZZ99 3VZ = no
                        fixed above; ZZ99 3CZ, England/UK not otherwise
                        specified) or to have a postcode not known to the
                        postcode geography database. (default: 0.00201)
  --k_pseudopostcode K_PSEUDOPOSTCODE
                        Probability multiple: P(pseudopostcode sector or
                        unknown postcode sector match | ¬H) = k_pseudopostcode
                        * p_unknown_or_pseudo_postcode. Must strictly be >=1
                        and we enforce >1; see paper. (default: 1.83)

DISPLAY OPTIONS:
  --verbose             Be verbose. (default: False)

===============================================================================
Help for command 'show_dob_freq'
===============================================================================
USAGE: crate_fuzzy_id_match show_dob_freq [-h]
                                          [--population_size POPULATION_SIZE]
                                          [--forename_cache_filename FORENAME_CACHE_FILENAME]
                                          [--forename_sex_freq_csv FORENAME_SEX_FREQ_CSV]
                                          [--forename_min_frequency FORENAME_MIN_FREQUENCY]
                                          [--surname_cache_filename SURNAME_CACHE_FILENAME]
                                          [--surname_freq_csv SURNAME_FREQ_CSV]
                                          [--surname_min_frequency SURNAME_MIN_FREQUENCY]
                                          [--accent_transliterations ACCENT_TRANSLITERATIONS]
                                          [--nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS]
                                          [--birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE]
                                          [--p_not_male_or_female P_NOT_MALE_OR_FEMALE]
                                          [--p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE]
                                          [--postcode_cache_filename POSTCODE_CACHE_FILENAME]
                                          [--postcode_csv_filename POSTCODE_CSV_FILENAME]
                                          [--k_postcode K_POSTCODE]
                                          [--p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE]
                                          [--k_pseudopostcode K_PSEUDOPOSTCODE]
                                          [--verbose]

OPTIONS:
  -h, --help            show this help message and exit

FREQUENCY INFORMATION FOR PRIOR PROBABILITIES:
  --population_size POPULATION_SIZE
                        Size of the whole population, from which we calculate
                        the baseline log odds that two people, randomly
                        selected (and replaced) from the population are the
                        same person. (default: 852523)
  --forename_cache_filename FORENAME_CACHE_FILENAME
                        File in which to store cached forename info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_forename_cache.jsonl)
  --forename_sex_freq_csv FORENAME_SEX_FREQ_CSV
                        CSV file of "name, sex, frequency" pairs for
                        forenames. You can generate one via
                        crate_fetch_wordlists. [Information saved in the
                        forename cache. If you change this, delete your
                        forename cache.] (default:
                        /path/to/linkage/data/us_forename_sex_freq.zip)
  --forename_min_frequency FORENAME_MIN_FREQUENCY
                        Minimum frequency for forenames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. The standard US forename data has a floor
                        2.875e-8 (M), 2.930e-8 (F), so 2.9e-8 to 2sf.
                        [Information saved in the forename cache. If you
                        change this, delete your forename cache.] (default:
                        5e-06)
  --surname_cache_filename SURNAME_CACHE_FILENAME
                        File in which to store cached surname info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_surname_cache.jsonl)
  --surname_freq_csv SURNAME_FREQ_CSV
                        CSV file of "name, frequency" pairs for surnames. You
                        can generate one via crate_fetch_wordlists.
                        [Information saved in the surname cache. If you change
                        this, delete your surname cache.] (default:
                        /path/to/linkage/data/us_surname_freq.zip)
  --surname_min_frequency SURNAME_MIN_FREQUENCY
                        Minimum frequency for surnames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. In the standard US surname data, values below
                        3e-7 are reported as 0, so 1.5e-7 is the midpoint of
                        the low-frequency range. [Information saved in the
                        surname cache. If you change this, delete your surname
                        cache.] (default: 5e-06)
  --accent_transliterations ACCENT_TRANSLITERATIONS
                        (For surnames.) CSV list of 'accented/plain' pairs,
                        representing how accented characters may be
                        transliterated (if they are not reproduced accurately
                        and not simply mangled into ASCII like É→E). Only
                        upper-case versions are required (anything supplied
                        will be converted to upper case). (default:
                        Ä/AE,Ö/OE,Ü/UE,ẞ/SS)
  --nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS
                        (For surnames.) CSV list of name components that
                        should not be used as alternatives in their own right,
                        such as nobiliary particles. (default:
                        AF,AL,AUF,AV,AW,D,DA,DAI,DAL,DALLA,DAS,DE,DEI,DEL,DELL
                        ,DELLA,DER,DES,DI,DO,DOS,DU,EL,I,II,III,IV,IX,JNR,JR,L
                        ,LA,LE,NA,OF,PHRA,SNR,SR,SRI,THOE,TOT,V,VAN,VI,VII,VII
                        I,VON,X,ZU)
  --birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE
                        Birth year pseudo-range. The purpose is to calculate
                        the probability of two random people sharing a DOB,
                        which is taken as 1/(365.25 * b), even for 29 Feb, or
                        a partial DOB equivalently. This option is b.
                        (default: 30)
  --p_not_male_or_female P_NOT_MALE_OR_FEMALE
                        Probability that a person in the population has gender
                        'X'. (default: 0.004)
  --p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE
                        Probability that a person in the population is female,
                        given that they are either male or female. (default:
                        0.51)
  --postcode_cache_filename POSTCODE_CACHE_FILENAME
                        File in which to store cached postcodes (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_postcode_cache.json)
  --postcode_csv_filename POSTCODE_CSV_FILENAME
                        CSV file of postcode geography from UK Census/ONS
                        data. A ZIP file is also acceptable. [Information
                        saved in the postcode cache. If you change this,
                        delete your postcode cache.] (default:
                        /path/to/linkage/data/ONSPD_MAY_2022_UK.zip)
  --k_postcode K_POSTCODE
                        Probability multiple: P[P(postcode unit match | ¬H)] =
                        k_postcode * f_f_postcode, and p_p_postcode[P(postcode
                        sector match | ¬H) = k_postcode * f_p_postcode. The
                        default, None, autocalculates k_postcode = n_UK /
                        population_size where n_uk = 66040000; this is
                        approximately correct if your population is a
                        geographically restricted section of the UK, but if it
                        is geographically representative of the UK, specify 1.
                        (default: None)
  --p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE
                        Expected population probability of each
                        'pseudo-postcode' postcode unit (e.g. ZZ99 3VZ = no
                        fixed above; ZZ99 3CZ, England/UK not otherwise
                        specified) or to have a postcode not known to the
                        postcode geography database. (default: 0.00201)
  --k_pseudopostcode K_PSEUDOPOSTCODE
                        Probability multiple: P(pseudopostcode sector or
                        unknown postcode sector match | ¬H) = k_pseudopostcode
                        * p_unknown_or_pseudo_postcode. Must strictly be >=1
                        and we enforce >1; see paper. (default: 1.83)

DISPLAY OPTIONS:
  --verbose             Be verbose. (default: False)

===============================================================================
Help for command 'show_postcode_freq'
===============================================================================
USAGE: crate_fuzzy_id_match show_postcode_freq [-h]
                                               [--population_size POPULATION_SIZE]
                                               [--forename_cache_filename FORENAME_CACHE_FILENAME]
                                               [--forename_sex_freq_csv FORENAME_SEX_FREQ_CSV]
                                               [--forename_min_frequency FORENAME_MIN_FREQUENCY]
                                               [--surname_cache_filename SURNAME_CACHE_FILENAME]
                                               [--surname_freq_csv SURNAME_FREQ_CSV]
                                               [--surname_min_frequency SURNAME_MIN_FREQUENCY]
                                               [--accent_transliterations ACCENT_TRANSLITERATIONS]
                                               [--nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS]
                                               [--birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE]
                                               [--p_not_male_or_female P_NOT_MALE_OR_FEMALE]
                                               [--p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE]
                                               [--postcode_cache_filename POSTCODE_CACHE_FILENAME]
                                               [--postcode_csv_filename POSTCODE_CSV_FILENAME]
                                               [--k_postcode K_POSTCODE]
                                               [--p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE]
                                               [--k_pseudopostcode K_PSEUDOPOSTCODE]
                                               [--verbose]
                                               postcodes [postcodes ...]

POSITIONAL ARGUMENTS:
  postcodes             postcodes to check

OPTIONS:
  -h, --help            show this help message and exit

FREQUENCY INFORMATION FOR PRIOR PROBABILITIES:
  --population_size POPULATION_SIZE
                        Size of the whole population, from which we calculate
                        the baseline log odds that two people, randomly
                        selected (and replaced) from the population are the
                        same person. (default: 852523)
  --forename_cache_filename FORENAME_CACHE_FILENAME
                        File in which to store cached forename info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_forename_cache.jsonl)
  --forename_sex_freq_csv FORENAME_SEX_FREQ_CSV
                        CSV file of "name, sex, frequency" pairs for
                        forenames. You can generate one via
                        crate_fetch_wordlists. [Information saved in the
                        forename cache. If you change this, delete your
                        forename cache.] (default:
                        /path/to/linkage/data/us_forename_sex_freq.zip)
  --forename_min_frequency FORENAME_MIN_FREQUENCY
                        Minimum frequency for forenames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. The standard US forename data has a floor
                        2.875e-8 (M), 2.930e-8 (F), so 2.9e-8 to 2sf.
                        [Information saved in the forename cache. If you
                        change this, delete your forename cache.] (default:
                        5e-06)
  --surname_cache_filename SURNAME_CACHE_FILENAME
                        File in which to store cached surname info (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_surname_cache.jsonl)
  --surname_freq_csv SURNAME_FREQ_CSV
                        CSV file of "name, frequency" pairs for surnames. You
                        can generate one via crate_fetch_wordlists.
                        [Information saved in the surname cache. If you change
                        this, delete your surname cache.] (default:
                        /path/to/linkage/data/us_surname_freq.zip)
  --surname_min_frequency SURNAME_MIN_FREQUENCY
                        Minimum frequency for surnames. If a frequency is
                        unknown or less than this, the software uses this
                        minimum. In the standard US surname data, values below
                        3e-7 are reported as 0, so 1.5e-7 is the midpoint of
                        the low-frequency range. [Information saved in the
                        surname cache. If you change this, delete your surname
                        cache.] (default: 5e-06)
  --accent_transliterations ACCENT_TRANSLITERATIONS
                        (For surnames.) CSV list of 'accented/plain' pairs,
                        representing how accented characters may be
                        transliterated (if they are not reproduced accurately
                        and not simply mangled into ASCII like É→E). Only
                        upper-case versions are required (anything supplied
                        will be converted to upper case). (default:
                        Ä/AE,Ö/OE,Ü/UE,ẞ/SS)
  --nonspecific_name_components NONSPECIFIC_NAME_COMPONENTS
                        (For surnames.) CSV list of name components that
                        should not be used as alternatives in their own right,
                        such as nobiliary particles. (default:
                        AF,AL,AUF,AV,AW,D,DA,DAI,DAL,DALLA,DAS,DE,DEI,DEL,DELL
                        ,DELLA,DER,DES,DI,DO,DOS,DU,EL,I,II,III,IV,IX,JNR,JR,L
                        ,LA,LE,NA,OF,PHRA,SNR,SR,SRI,THOE,TOT,V,VAN,VI,VII,VII
                        I,VON,X,ZU)
  --birth_year_pseudo_range BIRTH_YEAR_PSEUDO_RANGE
                        Birth year pseudo-range. The purpose is to calculate
                        the probability of two random people sharing a DOB,
                        which is taken as 1/(365.25 * b), even for 29 Feb, or
                        a partial DOB equivalently. This option is b.
                        (default: 30)
  --p_not_male_or_female P_NOT_MALE_OR_FEMALE
                        Probability that a person in the population has gender
                        'X'. (default: 0.004)
  --p_female_given_male_or_female P_FEMALE_GIVEN_MALE_OR_FEMALE
                        Probability that a person in the population is female,
                        given that they are either male or female. (default:
                        0.51)
  --postcode_cache_filename POSTCODE_CACHE_FILENAME
                        File in which to store cached postcodes (to speed
                        loading). (default:
                        /path/to/crate/user/data/fuzzy_postcode_cache.json)
  --postcode_csv_filename POSTCODE_CSV_FILENAME
                        CSV file of postcode geography from UK Census/ONS
                        data. A ZIP file is also acceptable. [Information
                        saved in the postcode cache. If you change this,
                        delete your postcode cache.] (default:
                        /path/to/linkage/data/ONSPD_MAY_2022_UK.zip)
  --k_postcode K_POSTCODE
                        Probability multiple: P[P(postcode unit match | ¬H)] =
                        k_postcode * f_f_postcode, and p_p_postcode[P(postcode
                        sector match | ¬H) = k_postcode * f_p_postcode. The
                        default, None, autocalculates k_postcode = n_UK /
                        population_size where n_uk = 66040000; this is
                        approximately correct if your population is a
                        geographically restricted section of the UK, but if it
                        is geographically representative of the UK, specify 1.
                        (default: None)
  --p_unknown_or_pseudo_postcode P_UNKNOWN_OR_PSEUDO_POSTCODE
                        Expected population probability of each
                        'pseudo-postcode' postcode unit (e.g. ZZ99 3VZ = no
                        fixed above; ZZ99 3CZ, England/UK not otherwise
                        specified) or to have a postcode not known to the
                        postcode geography database. (default: 0.00201)
  --k_pseudopostcode K_PSEUDOPOSTCODE
                        Probability multiple: P(pseudopostcode sector or
                        unknown postcode sector match | ¬H) = k_pseudopostcode
                        * p_unknown_or_pseudo_postcode. Must strictly be >=1
                        and we enforce >1; see paper. (default: 1.83)

DISPLAY OPTIONS:
  --verbose             Be verbose. (default: False)

Name frequency data is pre-supplied. It was generated like this:

#!/bin/bash
# Fetch/generate name/frequency files for de-identified fuzzy linkage.

# 1. Fetch our source data.
wget https://www.ssa.gov/OACT/babynames/names.zip -O forenames.zip
wget http://www2.census.gov/topics/genealogy/1990surnames/dist.all.last -O surnames_1990.txt
wget https://www2.census.gov/topics/genealogy/2010surnames/names.zip -O surnames_2010.zip

# 2. Create our frequency lists.
crate_fetch_wordlists \
    --us_forenames \
        --us_forenames_url "file://${PWD}/forenames.zip" \
        --us_forenames_sex_freq_output us_forename_sex_freq.csv \
    --us_surnames \
        --us_surnames_1990_census_url "file://${PWD}/surnames_1990.txt" \
        --us_surnames_2010_census_url "file://${PWD}/surnames_2010.zip" \
        --us_surnames_freq_output us_surname_freq.csv