5. Preprocessing tools

These tools:

reshape specific databases for CRATE:
- crate_preprocess_pcmis – preprocess a PCMIS database
- crate_preprocess_rio – preprocess a RiO database
- crate_preprocess_systmone – index a SystmOne database
fetch external data used for anonymisation:
- crate_postcodes – fetch ONS postcode information
- crate_fetch_wordlists – fetch forenames, surnames, and medical eponyms
perform fuzzy identity matching for linking different databases securely:
- crate_fuzzy_id_match

Although they are usually run before anonymisation, it’s probably more helpful to read the Anonymisation section first.

5.1. crate_preprocess_pcmis

Options:

USAGE: crate_preprocess_pcmis [-h] --url URL [-v] [--print] [--echo]
                              [--drop_danger_drop] [--debug_skiptables]
                              [--postcodedb POSTCODEDB]
                              [--geogcols [GEOGCOLS ...]]
                              [--settings_filename SETTINGS_FILENAME]

Alters a PCMIS database to be suitable for CRATE.

OPTIONS:
  -h, --help            show this help message and exit
  --url URL             SQLAlchemy database URL (default: None)
  -v, --verbose         Verbose (default: False)
  --print               Print SQL but do not execute it. (You can redirect the
                        printed output to create an SQL script.) (default:
                        False)
  --echo                Echo SQL (default: False)
  --drop_danger_drop    REMOVES new columns and indexes, rather than creating
                        them. (There's not very much danger; no real
                        information is lost, but it might take a while to
                        recalculate it.) (default: False)
  --debug_skiptables    DEBUG-ONLY OPTION. Skip tables (view creation only)
                        (default: False)
  --postcodedb POSTCODEDB
                        Specify database (schema) name for ONS Postcode
                        Database (as imported by CRATE) to link to addresses
                        as a view. With SQL Server, you will have to specify
                        the schema as well as the database; e.g. "--postcodedb
                        ONS_PD.dbo" (default: None)
  --geogcols [GEOGCOLS ...]
                        List of geographical information columns to link in
                        from ONS Postcode Database. BEWARE that you do not
                        specify anything too identifying. (default: ['bua11',
                        'buasd11', 'casward', 'imd', 'lea', 'lsoa01',
                        'lsoa11', 'msoa01', 'msoa11', 'nuts', 'oac01',
                        'oac11', 'parish', 'pcon', 'pct', 'ru11ind',
                        'statsward', 'ur01ind'])
  --settings_filename SETTINGS_FILENAME
                        Specify filename to write draft ddgen_* settings to,
                        for use in a CRATE anonymiser configuration file.
                        (default: None)

5.2. crate_preprocess_rio

The RiO preprocessor creates a unique integer field named crate_pk in all tables (copying the existing integer PK, creating one from an existing non-integer primary key, or adding a new one using SQL Server’s INT IDENTITY(1, 1) type. For all patient tables, it makes the patient ID (RiO number) into an integer, called crate_rio_number. It then adds indexes and views. All of these can be removed again, or updated incrementally if you add new data.

The views ‘denormalize’ the data for convenience, since it can be pretty hard to follow the key chain of fully normalized tables. The views conform mostly to the names used by the Servelec RiO CRIS Extraction Program (RCEP), with added consistency. Because user lookups are common, to save typing (and in some cases keep the field length below the 64-character column name limit of MySQL), the following abbreviations are used:

_Resp_Clinician_

… Responsible Clinician

Options:

USAGE: crate_preprocess_rio [-h] --url URL [-v] [--print] [--echo] [--rcep]
                            [--drop_danger_drop] [--cpft] [--debug_skiptables]
                            [--prognotes_current_only | --prognotes_all]
                            [--clindocs_current_only | --clindocs_all]
                            [--allergies_current_only | --allergies_all]
                            [--audit_info | --no_audit_info]
                            [--postcodedb POSTCODEDB]
                            [--geogcols [GEOGCOLS ...]]
                            [--settings_filename SETTINGS_FILENAME]

*   Alters a RiO database to be suitable for CRATE.

*   By default, this treats the source database as being a copy of a RiO
    database (slightly later than version 6.2; exact version unclear).
    Use the "--rcep" (+/- "--cpft") switch(es) to treat it as a
    Servelec RiO CRIS Extract Program (RCEP) v2 output database.


OPTIONS:
  -h, --help            show this help message and exit
  --url URL             SQLAlchemy database URL
  -v, --verbose         Verbose
  --print               Print SQL but do not execute it. (You can redirect the
                        printed output to create an SQL script.)
  --echo                Echo SQL
  --rcep                Treat the source database as the product of Servelec's
                        RiO CRIS Extract Program v2 (instead of raw RiO)
  --drop_danger_drop    REMOVES new columns and indexes, rather than creating
                        them. (There's not very much danger; no real
                        information is lost, but it might take a while to
                        recalculate it.)
  --cpft                Apply hacks for Cambridgeshire & Peterborough NHS
                        Foundation Trust (CPFT) RCEP database. Only applicable
                        with --rcep
  --debug_skiptables    DEBUG-ONLY OPTION. Skip tables (view creation only)
  --prognotes_current_only
                        Progress_Notes view restricted to current versions
                        only (* default)
  --prognotes_all       Progress_Notes view shows old versions too
  --clindocs_current_only
                        Clinical_Documents view restricted to current versions
                        only (*)
  --clindocs_all        Clinical_Documents view shows old versions too
  --allergies_current_only
                        Client_Allergies view restricted to current info only
  --allergies_all       Client_Allergies view shows deleted allergies too (*)
  --audit_info          Audit information (creation/update times) added to
                        views
  --no_audit_info       No audit information added (*)
  --postcodedb POSTCODEDB
                        Specify database (schema) name for ONS Postcode
                        Database (as imported by CRATE) to link to addresses
                        as a view. With SQL Server, you will have to specify
                        the schema as well as the database; e.g. "--postcodedb
                        ONS_PD.dbo"
  --geogcols [GEOGCOLS ...]
                        List of geographical information columns to link in
                        from ONS Postcode Database. BEWARE that you do not
                        specify anything too identifying. Default: bua11
                        buasd11 casward imd lea lsoa01 lsoa11 msoa01 msoa11
                        nuts oac01 oac11 parish pcon pct ru11ind statsward
                        ur01ind
  --settings_filename SETTINGS_FILENAME
                        Specify filename to write draft ddgen_* settings to,
                        for use in a CRATE anonymiser configuration file.

5.3. crate_preprocess_systmone

Preprocess SystmOne data, by indexing it. (It shouldn’t need further reshaping.)

Options:

USAGE: crate_preprocess_systmone [-h] --url URL [-v] [--print] [--echo]
                                 [--systmone_context {tpp_sre,cpft_dw}]
                                 [--systmone_allow_unprefixed_tables]
                                 [--postcodedb POSTCODEDB]
                                 [--geogcols [GEOGCOLS ...]]
                                 [--drop_danger_drop]

Indexes a SystmOne database to be suitable for CRATE.

OPTIONS:
  -h, --help            show this help message and exit
  --url URL             SQLAlchemy database URL
  -v, --verbose         Verbose
  --print               Print SQL but do not execute it. (You can redirect the
                        printed output to create an SQL script.)
  --echo                Echo SQL
  --systmone_context {tpp_sre,cpft_dw}
                        Context of the SystmOne database that you are reading.
  --systmone_allow_unprefixed_tables
                        Permit tables that don't start with the expected
                        prefix (which is e.g. 'SR' for the TPP SRE context,
                        'S1_' for the CPFT Data Warehouse context). May add
                        helpful content, but you may get odd tables and views.
  --postcodedb POSTCODEDB
                        Specify database (schema) name for ONS Postcode
                        Database (as imported by CRATE) to link to addresses
                        as a view. With SQL Server, you will have to specify
                        the schema as well as the database; e.g. "--postcodedb
                        ONS_PD.dbo"
  --geogcols [GEOGCOLS ...]
                        List of geographical information columns to link in
                        from ONS Postcode Database. BEWARE that you do not
                        specify anything too identifying. Default: bua11
                        buasd11 casward imd lea lsoa01 lsoa11 msoa01 msoa11
                        nuts oac01 oac11 parish pcon pct ru11ind statsward
                        ur01ind
  --drop_danger_drop    REMOVES new columns and indexes, rather than creating
                        them. (There's not very much danger; no real
                        information is lost, but it might take a while to
                        recalculate it.)

5.4. crate_postcodes

Options:

USAGE: crate_postcodes [-h] [--dir DIR] [--url URL] [--echo]
                       [--reportevery REPORTEVERY] [--commitevery COMMITEVERY]
                       [--startswith STARTSWITH [STARTSWITH ...]] [--replace]
                       [--skiplookup]
                       [--specific_lookup_tables [SPECIFIC_LOOKUP_TABLES ...]]
                       [--list_lookup_tables] [--skippostcodes] [--docsonly]
                       [-v]

-   This program reads data from the UK Office of National Statistics Postcode
    Database (ONSPD) and inserts it into a database.

-   You will need to download the ONSPD from
        https://geoportal.statistics.gov.uk/geoportal/catalog/content/filelist.page
    e.g. ONSPD_MAY_2016_csv.zip (79 Mb), and unzip it (>1.4 Gb) to a directory.
    Tell this program which directory you used.

-   Specify your database as an SQLAlchemy connection URL: see
        https://docs.sqlalchemy.org/en/latest/core/engines.html
    The general format is:
        dialect[+driver]://username:password@host[:port]/database[?key=value...]

-   If you get an error like:
        UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in
        position 33: ordinal not in range(256)
    then try appending "?charset=utf8" to the connection URL.

-   ONS POSTCODE DATABASE LICENSE.
    Output using this program must add the following attribution statements:

    Contains OS data © Crown copyright and database right
    Contains Royal Mail data © Royal Mail copyright and database right
    Contains National Statistics data © Crown copyright and database right

    See https://www.ons.gov.uk/methodology/geography/licences


OPTIONS:
  -h, --help            show this help message and exit
  --dir DIR             Root directory of unzipped ONSPD download (default:
                        /path/to/unzipped/ONSPD/download)
  --url URL             SQLAlchemy database URL (default: None)
  --echo                Echo SQL (default: False)
  --reportevery REPORTEVERY
                        Report every n rows (default: 1000)
  --commitevery COMMITEVERY
                        Commit every n rows. If you make this too large
                        (relative e.g. to your MySQL max_allowed_packet
                        setting, you may get crashes with errors like 'MySQL
                        has gone away'. (default: 10000)
  --startswith STARTSWITH [STARTSWITH ...]
                        Restrict to postcodes that start with one of these
                        strings (default: None)
  --replace             Replace tables even if they exist (default: skip
                        existing tables) (default: False)
  --skiplookup          Skip generation of code lookup tables (default: False)
  --specific_lookup_tables [SPECIFIC_LOOKUP_TABLES ...]
                        Within the lookup tables, process only specific named
                        tables (default: None)
  --list_lookup_tables  List all possible lookup tables, then stop (default:
                        False)
  --skippostcodes       Skip generation of main (large) postcode table
                        (default: False)
  --docsonly            Show help for postcode table then stop (default:
                        False)
  -v, --verbose         Verbose (default: False)

5.5. crate_fetch_wordlists

This tool assists in fetching common word lists, such as name lists for global denial, and words to exclude from such lists (such as English words or medical eponyms). It also provides an exclusion filter system, to find lines in some files that are absent from others.

The purpose of creating large name lists is usually to remove more names. However, it’s likely that you want to remove medical eponyms, like Parkinson (for Parkinson’s disease). CRATE has a hand-curated list of these. (If a patient is named Parkinson, though, and CRATE is told to remove that as a patient-specific identifier, that name will be removed from phrases like “Parkinson’s disease”, which may itself be a potential identifying clue, however, e.g. “[XXX]’s disease”, but 100% reliable text de-identification is impossible.)

The overlap between names and English words is really tricky.

If you use all names from this set and exclude all valid English words (e.g. from a “valid answers in Scrabble or crosswords” list), you will remove from the namelist – and thus NOT remove from text being nonspecifically scrubbed – names such as John (john is a noun) and Veronica (veronica is also a noun).
If you keep all names in the exclusion namelist, though, you will scrub words like excellent, fought, friend, games, he, hope, husband, joyful, kitten, knuckle, libel, limp, lovely, man, memory, mood, music, no, power, powers, sad, stress, true, yes, you, young, zone (to list but a few); these are all names.
A compromise may be to start with all names, remove medical eponyms, and remove common English words. CRATE provides tools to count words in a subset of the Project Gutenberg corpus. For example, removing English words that account for the top 99% of this corpus (and are also valid Scrabble clues) does this. The process is shown in the specimen usage below.

Options:

USAGE: crate_fetch_wordlists [-h] [--verbose]
                             [--min_word_length MIN_WORD_LENGTH]
                             [--show_rejects]
                             [--debug_names [DEBUG_NAMES ...]]
                             [--english_words]
                             [--english_words_output ENGLISH_WORDS_OUTPUT]
                             [--english_words_url ENGLISH_WORDS_URL]
                             [--valid_word_regex VALID_WORD_REGEX]
                             [--gutenberg_word_freq]
                             [--gutenberg_word_freq_output GUTENBERG_WORD_FREQ_OUTPUT]
                             [--gutenberg_id_first GUTENBERG_ID_FIRST]
                             [--gutenberg_id_last GUTENBERG_ID_LAST]
                             [--filter_words_by_freq]
                             [--wordfreqfilter_input WORDFREQFILTER_INPUT]
                             [--wordfreqfilter_output WORDFREQFILTER_OUTPUT]
                             [--wordfreqfilter_min_cum_freq WORDFREQFILTER_MIN_CUM_FREQ]
                             [--wordfreqfilter_max_cum_freq WORDFREQFILTER_MAX_CUM_FREQ]
                             [--us_forenames]
                             [--us_forenames_freq_output US_FORENAMES_FREQ_OUTPUT]
                             [--us_forenames_sex_freq_output US_FORENAMES_SEX_FREQ_OUTPUT]
                             [--us_forenames_url US_FORENAMES_URL]
                             [--us_forenames_min_cumfreq_pct US_FORENAMES_MIN_CUMFREQ_PCT]
                             [--us_forenames_max_cumfreq_pct US_FORENAMES_MAX_CUMFREQ_PCT]
                             [--us_forenames_output US_FORENAMES_OUTPUT]
                             [--us_surnames]
                             [--us_surnames_output US_SURNAMES_OUTPUT]
                             [--us_surnames_freq_output US_SURNAMES_FREQ_OUTPUT]
                             [--us_surnames_1990_census_url US_SURNAMES_1990_CENSUS_URL]
                             [--us_surnames_2010_census_url US_SURNAMES_2010_CENSUS_URL]
                             [--us_surnames_min_cumfreq_pct US_SURNAMES_MIN_CUMFREQ_PCT]
                             [--us_surnames_max_cumfreq_pct US_SURNAMES_MAX_CUMFREQ_PCT]
                             [--eponyms] [--eponyms_output EPONYMS_OUTPUT]
                             [--eponyms_add_unaccented_versions [EPONYMS_ADD_UNACCENTED_VERSIONS]]
                             [--filter_input [FILTER_INPUT ...]]
                             [--filter_include [FILTER_INCLUDE ...]]
                             [--filter_exclude [FILTER_EXCLUDE ...]]
                             [--filter_output [FILTER_OUTPUT]]

OPTIONS:
  -h, --help            show this help message and exit
  --verbose, -v         Be verbose (default: False)
  --min_word_length MIN_WORD_LENGTH
                        Minimum word (or name) length to allow (default: 2)
  --show_rejects        Print to stdout (and, in verbose mode, log) the words
                        being rejected (default: False)
  --debug_names [DEBUG_NAMES ...]
                        Show extra detail about these names (e.g. to work out
                        your preferred frequency thresholds) (default: None)

ENGLISH WORDS:
  --english_words       Fetch English words (e.g. to remove from the
                        nonspecific denylist, not to add to an allowlist;
                        consider words like smith) (default: False)
  --english_words_output ENGLISH_WORDS_OUTPUT
                        Output file for English words (default: None)
  --english_words_url ENGLISH_WORDS_URL
                        URL for a textfile containing all English words (will
                        then be filtered) (default:
                        https://www.gutenberg.org/files/3201/files/CROSSWD.TXT
                        )
  --valid_word_regex VALID_WORD_REGEX
                        Regular expression to determine valid English words
                        (default: ^[a-z](?:[A-Za-z'-]*[a-z])*$)
  --gutenberg_word_freq
                        Fetch words from Project Gutenberg with frequencies
                        (default: False)
  --gutenberg_word_freq_output GUTENBERG_WORD_FREQ_OUTPUT
                        Output file for English words with frequencies. CSV
                        file with columns: word, word_freq, cum_freq.
                        (default: None)
  --gutenberg_id_first GUTENBERG_ID_FIRST
                        For word counting: first Project Gutenberg book ID
                        (default: 100)
  --gutenberg_id_last GUTENBERG_ID_LAST
                        For word counting: last Project Gutenberg book ID
                        (default: 110)

FILTER ENGLISH WORDS BY FREQUENCY:
  --filter_words_by_freq
                        Read a CSV file from --gutenberg_word_freq, filter it
                        by cumulative word frequency, and write a plain list
                        of words. (default: False)
  --wordfreqfilter_input WORDFREQFILTER_INPUT
                        Input filename. Usually the output of
                        --gutenberg_word_freq_output. (default: None)
  --wordfreqfilter_output WORDFREQFILTER_OUTPUT
                        Output filename. Plain text file. (default: None)
  --wordfreqfilter_min_cum_freq WORDFREQFILTER_MIN_CUM_FREQ
                        Minimum cumulative frequency. (Set to >0 to exclude
                        common words.) (default: 0.0)
  --wordfreqfilter_max_cum_freq WORDFREQFILTER_MAX_CUM_FREQ
                        Maximum cumulative frequency. (Set to <1 to exclude
                        rare words.) (default: 1.0)

US FORENAMES:
  --us_forenames        Fetch US forenames (for denylist) (default: False)
  --us_forenames_freq_output US_FORENAMES_FREQ_OUTPUT
                        Output CSV file for US forename with frequencies
                        (columns are: name, frequency) (default: None)
  --us_forenames_sex_freq_output US_FORENAMES_SEX_FREQ_OUTPUT
                        Output CSV file for US forename with sex and
                        frequencies (columns are: name, gender, frequency)
                        (default: None)
  --us_forenames_url US_FORENAMES_URL
                        URL to Zip file of US Census-derived forenames lists
                        (excludes names with national frequency <5; see
                        https://www.ssa.gov/OACT/babynames/limits.html)
                        (default:
                        https://www.ssa.gov/OACT/babynames/names.zip)
  --us_forenames_min_cumfreq_pct US_FORENAMES_MIN_CUMFREQ_PCT
                        Fetch only names where the cumulative frequency
                        percentage, up to and including this name, was at
                        least this value. Range is 0-100. Use 0 for no limit.
                        Setting this above 0 excludes COMMON names. (This is a
                        trade-off between being comprehensive and operating at
                        a reasonable speed. Lower numbers are more
                        comprehensive but slower.) (default: 0)
  --us_forenames_max_cumfreq_pct US_FORENAMES_MAX_CUMFREQ_PCT
                        Fetch only names where the cumulative frequency
                        percentage, up to and including this name, was less
                        than or equal to this value. Range is 0-100. Use 100
                        for no limit. Setting this below 100 excludes RARE
                        names. (This is a trade-off between being
                        comprehensive and operating at a reasonable speed.
                        Higher numbers are more comprehensive but slower.)
                        (default: 100)
  --us_forenames_output US_FORENAMES_OUTPUT
                        Output file for US forenames (default: None)

US SURNAMES:
  --us_surnames         Fetch US surnames (for denylist) (default: False)
  --us_surnames_output US_SURNAMES_OUTPUT
                        Output text file for US surnames (default: None)
  --us_surnames_freq_output US_SURNAMES_FREQ_OUTPUT
                        Output CSV file for US surnames with frequencies
                        (columns are: name, frequency) (default: None)
  --us_surnames_1990_census_url US_SURNAMES_1990_CENSUS_URL
                        URL for textfile of US 1990 Census surnames (default:
                        http://www2.census.gov/topics/genealogy/1990surnames/d
                        ist.all.last)
  --us_surnames_2010_census_url US_SURNAMES_2010_CENSUS_URL
                        URL for zip of US 2010 Census surnames (default:
                        https://www2.census.gov/topics/genealogy/2010surnames/
                        names.zip)
  --us_surnames_min_cumfreq_pct US_SURNAMES_MIN_CUMFREQ_PCT
                        Fetch only names where the cumulative frequency
                        percentage, up to and including this name, was at
                        least this value. Range is 0-100. Use 0 for no limit.
                        Setting this above 0 excludes COMMON names. (This is a
                        trade-off between being comprehensive and operating at
                        a reasonable speed. Lower numbers are more
                        comprehensive but slower.) (default: 0)
  --us_surnames_max_cumfreq_pct US_SURNAMES_MAX_CUMFREQ_PCT
                        Fetch only names where the cumulative frequency
                        percentage, up to and including this name, was less
                        than or equal to this value. Range is 0-100. Use 100
                        for no limit. Setting this below 100 excludes RARE
                        names. (This is a trade-off between being
                        comprehensive and operating at a reasonable speed.
                        Higher numbers are more comprehensive but slower.)
                        (default: 100)

MEDICAL EPONYMS:
  --eponyms             Write medical eponyms (to remove from denylist)
                        (default: False)
  --eponyms_output EPONYMS_OUTPUT
                        Output file for medical eponyms (default:
                        medical_eponyms.txt)
  --eponyms_add_unaccented_versions [EPONYMS_ADD_UNACCENTED_VERSIONS]
                        Add unaccented versions (e.g. Sjogren as well as
                        Sjögren) (default: True)

FILTER FUNCTIONS:
  Extra functions to filter wordlists.Specify an input file, optional
  exclusion and/or inclusion file(s), and an output file. You can use '-'
  for the output file to mean 'stdout', and for one input file to mean
  'stdin'. No filenames (other than '-' for input and output) may overlap.
  The --min_line_length option also applies. Duplicates are not removed.

  --filter_input [FILTER_INPUT ...]
                        Input file(s). Words will be drawn from these files.
                        (default: None)
  --filter_include [FILTER_INCLUDE ...]
                        Inclusion file(s). If any inclusion files are
                        specified, words from the input must be present in at
                        least one inclusion file to pass. (default: None)
  --filter_exclude [FILTER_EXCLUDE ...]
                        Exclusion file(s). Any words present in the exclusion
                        files do not pass. (default: None)
  --filter_output [FILTER_OUTPUT]
                        Output file. Words are written here. (default: None)

Specimen usage:

#!/bin/bash
# Generating name files under Linux, for nonspecific name removal.

# -----------------------------------------------------------------------------
# 1. Fetch our source data.
# -----------------------------------------------------------------------------
# Downloading these and then using a file:// URL is unnecessary, but it makes
# the processing steps faster if we need to retry with new settings.
wget https://www.gutenberg.org/files/3201/files/CROSSWD.TXT -O dictionary.txt
wget https://www.ssa.gov/OACT/babynames/names.zip -O forenames.zip
wget http://www2.census.gov/topics/genealogy/1990surnames/dist.all.last -O surnames_1990.txt
wget https://www2.census.gov/topics/genealogy/2010surnames/names.zip -O surnames_2010.zip

# -----------------------------------------------------------------------------
# 2. Create our wordlists.
# -----------------------------------------------------------------------------
# Fetch forenames, surnames, medical eponyms, and words valid for Scrabble.
# Via --debug_names, we'll illustrate frequencies for some names that are also
# words.
crate_fetch_wordlists \
    --english_words \
        --english_words_url "file://${PWD}/dictionary.txt" \
        --english_words_output english_crossword_words.txt \
    --us_forenames \
        --us_forenames_url "file://${PWD}/forenames.zip" \
        --us_forenames_max_cumfreq_pct 100 \
        --us_forenames_output us_forenames.txt \
    --us_surnames \
        --us_surnames_1990_census_url "file://${PWD}/surnames_1990.txt" \
        --us_surnames_2010_census_url "file://${PWD}/surnames_2010.zip" \
        --us_surnames_max_cumfreq_pct 100 \
        --us_surnames_output us_surnames.txt \
    --eponyms \
        --eponyms_output medical_eponyms.txt \
    --debug_names \
        excellent fought friend games he hope husband john joyful kitten \
        knuckle libel limp lovely man memory mood music no power powers sad \
        stress true veronica yes you young zone

# Count frequencies for a few books (preserving case, filtering for words of
# length 2+ and meeting a valid pattern which includes starting with a
# lower-case letter), across 1,134,142 words:
crate_fetch_wordlists \
      --gutenberg_word_freq \
      --gutenberg_id_first 100 \
      --gutenberg_id_last 110 \
      --gutenberg_word_freq_output english_word_freq_gutenberg.csv
# 100 is Shakespeare: https://www.gutenberg.org/ebooks/100

# Filter to common words, that account together for the top 99% of usage.
crate_fetch_wordlists \
    --filter_words_by_freq \
    --wordfreqfilter_input english_word_freq_gutenberg.csv \
    --wordfreqfilter_min_cum_freq 0.0 \
    --wordfreqfilter_max_cum_freq 0.99 \
    --wordfreqfilter_output english_gutenberg_common_words.txt
# In this corpus, "john" comes in at about 0.9923.

# Create a list of common English words -- the overlap between "common words in
# Project Gutenberg books" and "valid Scrabble words".
crate_fetch_wordlists \
    --filter_input english_gutenberg_common_words.txt \
    --filter_include english_crossword_words.txt \
    --filter_output common_english_words.txt

# -----------------------------------------------------------------------------
# 3. Generate an amalgamated and filtered wordlist.
# -----------------------------------------------------------------------------
crate_fetch_wordlists \
    --filter_input us_forenames.txt us_surnames.txt \
    --filter_exclude medical_eponyms.txt common_english_words.txt \
    --filter_output filtered_names.txt
#                   ^^^^^^^^^^^^^^^^^^
# This may be useful for the CRATE anonymiser option --denylist_filenames.

# As a check, these are then names that are *rare* English words:
comm -12 \
    <(sort filtered_names.txt | tr "[:upper:]" "[:lower:]") \
    <(sort english_crossword_words.txt | tr "[:upper:]" "[:lower:]") \
    > remaining_names_that_are_english_crossword_words.txt