14.1.16. crate_anon.anonymise.fetch_wordlists

crate_anon/anonymise/fetch_wordlists.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Script to fetch wordlists from Internet sources, such as lists of forenames, surnames, and English words.

For specimen usage: see ancillary.rst, as crate_fetch_wordlists.

See:

For the Moby project (word lists):

and default URLs in command-line parameters. The “crossword” file is good. However, for frequency information this is a bit sparse (it contains the top 1000 words in various contexts).

Broader corpora with frequencies include:

For the SPGC, notations like “PG74” refer to books (e.g. PG74 is “The Adventures of Tom Sawyer”); these are listed in the metadata file. Overall, the SPGC looks pretty good but one downside is that the SPGC software forces all words to lower case. See:

  • process_data – calls process_book()

  • src.pipeline.process_book – calls tokenize_text() via “tokenize_f”

  • src.tokenizer.tokenize_text – calls filter_tokens()

  • src.tokenizer.filter_tokens – forces everything to lower-case.

and thus the output contains e.g. “ellen”, “james”, “jamestown”, josephine”, “mary”. Cross-referencing to our Scrabble/crossword list will remove some, but it will retain the problem that “john” (a rare-ish word but a common name) has its frequency overestimated.

For API access to Project Gutenberg:

class crate_anon.anonymise.fetch_wordlists.NameInfo(name: str, freq_pct: float | None = None, cumfreq_pct: float | None = None)[source]

Information about a human name.

__init__(name: str, freq_pct: float | None = None, cumfreq_pct: float | None = None) None[source]
Parameters:
  • name – The name.

  • freq_pct – Frequency (%).

  • cumfreq_pct – Cumulative frequency (%) when names are ordered from most to least common; therefore, close to 0 for common names, and close to 100 for rare names.

assert_freq_info() None[source]

Assert that the frequences are reasonable numbers.

property freq_p: float

Frequency as a probability or proportion, range [0, 1].

class crate_anon.anonymise.fetch_wordlists.UsForenameInfo(name: str, sex: str, count: str)[source]

Information about a forename in the United States of America.

__init__(name: str, sex: str, count: str) None[source]
Parameters:
  • name – The name.

  • sex – The sex, as "M" or "F".

  • count – A string version of an integer, giving the number of times the name appeared in a certain time period.

class crate_anon.anonymise.fetch_wordlists.UsSurname1990Info(name: str, freq_pct: str, cumfreq_pct: str, rank: int)[source]

Represents US surnames from the 1990 census.

__init__(name: str, freq_pct: str, cumfreq_pct: str, rank: int) None[source]
Parameters:
  • name – The name.

  • freq_pct – Frequency (%) in string form.

  • cumfreq_pct – Cumulative frequency (%) in string form.

  • rank – Integer rank of frequency, in string form.

class crate_anon.anonymise.fetch_wordlists.UsSurname2010Info(name: str, rank: str, count: str, prop100k: str, cum_prop100k: str, pct_white: str, pct_black: str, pct_api: str, pct_aian: str, pct_2prace: str, pct_hispanic: str)[source]

Represents US surnames from the 2010 census.

__init__(name: str, rank: str, count: str, prop100k: str, cum_prop100k: str, pct_white: str, pct_black: str, pct_api: str, pct_aian: str, pct_2prace: str, pct_hispanic: str) None[source]
Parameters:
  • name – The name.

  • rank – Integer rank of frequency, in string form.

  • count – Frequency/count of the number of uses nationally.

  • prop100k – “Proportion per 100,000 population”, in string format, or a percentage times 1000.

  • cum_prop100k – Cumulative “proportion per 100,000 population” [1].

  • pct_white – “Percent Non-Hispanic White Alone” [1, 2].

  • pct_black – “Percent Non-Hispanic Black or African American Alone” [1, 2].

  • pct_api – “Percent Non-Hispanic Asian and Native Hawaiian and Other Pacific Islander Alone” [1, 2].

  • pct_aian

    “Percent Non-Hispanic American Indian and Alaska Native Alone”

    [1, 2].

  • pct_2prace – “Percent Non-Hispanic Two or More Races” [1, 2].

  • pct_hispanic – “Percent Hispanic or Latino origin” [1, 2].

[1] These will be filtered through float_or_na_for_us_surnames().

[2] These mean “of people with this name, the percentage who are X race”.

crate_anon.anonymise.fetch_wordlists.fetch_english_words(url: str, filename: str = '', valid_word_regex_text: str = "^[a-z](?:[A-Za-z'-]*[a-z])*$", min_word_length: int = 1, show_rejects: bool = False) None[source]

Fetch English words and write them to a file.

Parameters:
  • url – URL to fetch file from.

  • filename – Filename to write to.

  • valid_word_regex_text – Regular expression text; every word must match this regex.

  • min_word_length – Minimum word length; all words must be at least this long.

  • show_rejects – Report rejected words to the Python debug log.

crate_anon.anonymise.fetch_wordlists.fetch_eponyms(filename: str, add_unaccented_versions: bool) None[source]

Writes medical eponyms to a file.

Parameters:
  • filename – Filename to write to.

  • add_unaccented_versions – Add unaccented (mangled) versions of names, too? For example, do you want Sjogren as well as Sjögren?

crate_anon.anonymise.fetch_wordlists.fetch_gutenberg_word_freq(filename: str = '', gutenberg_id_first: int = 1, gutenberg_id_last: int = 100, valid_word_regex_text: str = "^[a-z](?:[A-Za-z'-]*[a-z])*$", min_word_length: int = 1) None[source]

Fetch English word frequencies from a frozen Standardized Project Gutenberg Corpus, and write them to a file. Within the words selected (which might be e.g. words of at least 2 characters, per min_word_length, and excluding words starting with upper-case letters or containing unusual punctuationg, per valid_word_regex_text), it produces a CSV file whose columns are: word, word_freq, cum_freq.

Parameters:
  • filename – Filename to write to.

  • gutenberg_id_first – First book ID to use from Project Gutenberg.

  • gutenberg_id_last – Last book ID to use from Project Gutenberg.

  • valid_word_regex_text – Regular expression text; every word must match this regex.

  • min_word_length – Minimum word length; all words must be at least this long.

crate_anon.anonymise.fetch_wordlists.fetch_us_forenames(url: str, filename: str = '', freq_csv_filename: str = '', freq_sex_csv_filename: str = '', min_cumfreq_pct: float = 0, max_cumfreq_pct: float = 100, min_name_length: int = 1, show_rejects: bool = False, debug_names: List[str] | None = None) None[source]

Fetch US forenames and store them in a file, one per line.

Parameters:
  • url – URL to fetch file from.

  • filename – Filename to write names to.

  • freq_csv_filename – Optional CSV to write “name, frequency” pairs to, one name per line.

  • freq_sex_csv_filename – Optional CSV to write “name, gender, frequency” rows to.

  • min_cumfreq_pct – Minimum cumulative frequency (%): 0 for no limit, or above 0 to exclude common names.

  • max_cumfreq_pct – Maximum cumulative frequency (%): 100 for no limit, or below 100 to exclude rare names.

  • min_name_length – Minimum word length; all words must be at least this long.

  • show_rejects – Report rejected words to the Python debug log.

  • debug_names – Names to show extra information about (e.g. to discover the right thresholds).

crate_anon.anonymise.fetch_wordlists.fetch_us_surnames(url_1990: str, url_2010: str, filename: str = '', freq_csv_filename: str = '', min_cumfreq_pct: float = 0, max_cumfreq_pct: float = 100, min_word_length: int = 1, show_rejects: bool = False, debug_names: List[str] | None = None) None[source]

Fetches US surnames from the 1990 and 2010 census data. Writes them to a file.

Parameters:
  • url_1990 – URL for 1990 US census data

  • url_2010 – URL for 2010 US census data

  • filename – Text filename to write names to (one name per line).

  • freq_csv_filename – Optional CSV to write “name, frequency” pairs to, one name per line.

  • min_cumfreq_pct – Minimum cumulative frequency (%): 0 for no limit, or above 0 to exclude common names.

  • max_cumfreq_pct – Maximum cumulative frequency (%): 100 for no limit, or below 100 to exclude rare names.

  • min_word_length – Minimum word length; all words must be at least this long.

  • show_rejects – Report rejected words to the Python debug log.

  • debug_names – Names to show extra information about (e.g. to discover the right thresholds).

crate_anon.anonymise.fetch_wordlists.filter_files(input_filenames: List[str], output_filename: str, exclusion_filenames: List[str] | None = None, inclusion_filenames: List[str] | None = None, min_line_length: int = 0) None[source]

Read lines from input files, filters them, and writes them to the output file.

Parameters:
  • input_filenames – Read lines from these files.

  • output_filename – The output file.

  • exclusion_filenames – If a line is present in any of these files, it is excluded

  • inclusion_filenames – If any files are specified here, lines must be present in at least one inclusion file to pass through.

  • min_line_length – Skip any A lines that are shorter than this value.

crate_anon.anonymise.fetch_wordlists.filter_words_by_freq(input_filename: str, output_filename: str, min_cum_freq: float = 0.0, max_cum_freq: float = 1.0) None[source]

Reads words from our frequency file and filters them.

Parameters:
  • input_filename – Input CSV file. The output of fetch_gutenberg_word_freq().

  • output_filename – A plain output file, sorted.

  • min_cum_freq – Minimum cumulative frequency. Set to >0 to exclude common words.

  • max_cum_freq – Maximum cumulative frequency. Set to <1 to exclude rare words.

crate_anon.anonymise.fetch_wordlists.float_or_na_for_us_surnames(x: float | str) float | None[source]

The US surname data replaces low-frequency numbers with "(S)" for suppressed. Return a float representation of our input, but convert the suppression marker to None.

Parameters:

x – Input.

Returns:

Float version of input, or None.

Raises:

ValueError

crate_anon.anonymise.fetch_wordlists.gen_lines_from_binary_files_with_maxfiles(files: Iterable[BinaryIO], encoding: str = 'utf8', max_files: int | None = None) Generator[str, None, None][source]

Generates lines from binary files. Strips out newlines.

Parameters:
  • files – iterable of BinaryIO file-like objects

  • encoding – encoding to use

  • max_files – maximum number of files to read

Yields:

each line of all the files

crate_anon.anonymise.fetch_wordlists.gen_name_from_name_info(info_iter: Iterable[NameInfo]) Generator[str, None, None][source]

Generates names from NameInfo objects.

Parameters:

info_iter – Iterable of NameInfo objects.

Yields:

Names as strings.

crate_anon.anonymise.fetch_wordlists.gen_name_info_via_min_length(info_iter: Iterable[NameInfo], min_name_length: int = 1) Generator[NameInfo, None, None][source]

Generates NameInfo objects matching a name length criterion.

Parameters:
  • info_iter – Iterable of NameInfo objects.

  • min_name_length – Minimum name length; all names must be at least this long.

Yields:

Names as strings.

crate_anon.anonymise.fetch_wordlists.gen_sufficiently_frequent_names(infolist: Iterable[NameInfo], min_cumfreq_pct: float = 0, max_cumfreq_pct: float = 100, show_rejects: bool = False, debug_names: List[str] | None = None) Generator[NameInfo, None, None][source]

Generate names of a chosen kind of frequency.

Parameters:
  • infolist – Iterable of NameInfo objects.

  • min_cumfreq_pct – Minimum cumulative frequency (%): 0 for no limit, or above 0 to exclude common names.

  • max_cumfreq_pct – Maximum cumulative frequency (%): 100 for no limit, or below 100 to exclude rare names.

  • show_rejects – Report rejected words to the Python debug log.

  • debug_names – Names to show extra information about (e.g. to discover the right thresholds).

Yields:

NameInfo objects

crate_anon.anonymise.fetch_wordlists.gen_us_forename_info(lines: Iterable[str]) Generator[UsForenameInfo, None, None][source]

Generate US forenames from an iterable of lines in a specific textfile format, where each line looks like:

Mary,F,7065

representing name, sex, frequency (count).

Parameters:

lines – Iterable of lines.

Yields:

UsForenameInfo objects, one per name, with frequency information added.

crate_anon.anonymise.fetch_wordlists.gen_us_forename_info_by_sex(lines: Iterable[str]) Generator[UsForenameInfo, None, None][source]

Generate US forenames from an iterable of lines in a specific textfile format, where each line looks like:

Mary,F,7065

representing name, sex, frequency (count).

Parameters:

lines – Iterable of lines.

Yields:

UsForenameInfo objects, one per name/sex combination present, with frequency information added.

crate_anon.anonymise.fetch_wordlists.gen_us_surname_1990_info(lines: Iterable[str]) Generator[UsSurname1990Info, None, None][source]

Process a series of lines from a textfile and generate US surname information from the 1990 census data.

Parameters:

lines

Iterable of lines, with this format:

# Format is e.g.
SMITH          1.006  1.006      1
# which is:
# name, frequency (%), cumulative frequency (%), rank

Yields:

UsSurname1990Info objects

crate_anon.anonymise.fetch_wordlists.gen_us_surname_2010_info(rows: Iterable[Iterable[str]]) Generator[UsSurname2010Info, None, None][source]

Process a series of rows and generate US surname information from the 2010 census data.

Parameters:

rows – Iterable giving “row” objects, where each row is an iterable of strings.

Yields:

UsSurname2010Info objects

crate_anon.anonymise.fetch_wordlists.gen_valid_words_from_words(words: Iterable[str], valid_word_regex_text: str, min_word_length: int = 1, show_rejects: bool = False) Generator[str, None, None][source]

Generates valid words from an iterable of words.

Parameters:
  • words – Source iterable of words.

  • valid_word_regex_text – Regular expression text; every word must match this regex.

  • min_word_length – Minimum word length; all words must be at least this long.

  • show_rejects – Report rejected words to the Python debug log.

Yields:

Valid words.

crate_anon.anonymise.fetch_wordlists.gen_word_freq_tuples_from_words(words: Iterable[str]) Generator[Tuple[str, float], None, None][source]

Generates valid words and their frequencies from an iterable of SPGC count lines.

Parameters:

words – Source iterable of words

Yields:

(word, count, word_freq, cum_freq) tuples, sorted by frequency (ascending).

crate_anon.anonymise.fetch_wordlists.gen_words_from_gutenberg_ids(gutenberg_ids: Iterable[int], valid_word_regex_text: str, min_word_length: int = 1) Generator[str, None, None][source]

Generates words from Project Gutenberg books. Does not alter case.

Parameters:
  • gutenberg_ids – Project Gutenberg IDs; e.g. 74 is Tom Sawyer, 100 is the complete works of Shakespeare.

  • valid_word_regex_text – Regular expression text; every word must match this regex.

  • min_word_length – Minimum word length; all words must be at least this long.

Yields:

words

crate_anon.anonymise.fetch_wordlists.main() None[source]

Command-line processor. See command-line help.

crate_anon.anonymise.fetch_wordlists.write_words_to_file(filename: str, words: Iterable[str]) None[source]

Write all the words to a file, one per line.

Parameters:
  • filename – Filename to open (or '-' for stdout).

  • words – Iterable of words.