14.1.16. crate_anon.anonymise.fetch_wordlists

crate_anon/anonymise/fetch_wordlists.py

Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.

Script to fetch wordlists from Internet sources, such as lists of forenames, surnames, and English words.

For specimen usage: see ancillary.rst, as crate_fetch_wordlists.

See:

For the Moby project (word lists):

https://en.wikipedia.org/wiki/Moby_Project
https://www.gutenberg.org/ebooks/3201 (Moby word lists)
https://www.gutenberg.org/files/3201/3201.txt – explains other files

and default URLs in command-line parameters. The “crossword” file is good. However, for frequency information this is a bit sparse (it contains the top 1000 words in various contexts).

Broader corpora with frequencies include:

Google Books Ngrams, https://storage.googleapis.com/books/ngrams/books/datasetsv2.html, where “1-grams” means individual words. However, it’s large (e.g. the “A” file is 1.7 Gb), it’s split by year, and it has a lot of non-word entities like “Amood_ADJ” and “→_ADJ”.
Wikipedia, e.g. https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists, but it doesn’t seem to have formats oriented to automatic processing.
British National Corpus, http://www.natcorp.ox.ac.uk/corpus/index.xml?ID=intro (but not freely distributable).
Non-free ones, e.g. COCA, https://www.wordfrequency.info/.
A “frozen” version of the Standardized Project Gutenberg Corpus (SPGC), https://doi.org/10.5281/zenodo.2422561 and https://github.com/pgcorpus/gutenberg.

For the SPGC, notations like “PG74” refer to books (e.g. PG74 is “The Adventures of Tom Sawyer”); these are listed in the metadata file. Overall, the SPGC looks pretty good but one downside is that the SPGC software forces all words to lower case. See:

process_data – calls process_book()
src.pipeline.process_book – calls tokenize_text() via “tokenize_f”
src.tokenizer.tokenize_text – calls filter_tokens()
src.tokenizer.filter_tokens – forces everything to lower-case.

and thus the output contains e.g. “ellen”, “james”, “jamestown”, josephine”, “mary”. Cross-referencing to our Scrabble/crossword list will remove some, but it will retain the problem that “john” (a rare-ish word but a common name) has its frequency overestimated.

For API access to Project Gutenberg:

class crate_anon.anonymise.fetch_wordlists.NameInfo(name: str, freq_pct: float | None = None, cumfreq_pct: float | None = None)[source]

Information about a human name.

__init__(name: str, freq_pct: float | None = None, cumfreq_pct: float | None = None) → None[source]

Parameters:

name – The name.
freq_pct – Frequency (%).
cumfreq_pct – Cumulative frequency (%) when names are ordered from most to least common; therefore, close to 0 for common names, and close to 100 for rare names.

assert_freq_info() → None[source]: Assert that the frequences are reasonable numbers.

property freq_p: float: Frequency as a probability or proportion, range [0, 1].

class crate_anon.anonymise.fetch_wordlists.UsForenameInfo(name: str, sex: str, count: str)[source]

Information about a forename in the United States of America.

__init__(name: str, sex: str, count: str) → None[source]

Parameters:

name – The name.
sex – The sex, as "M" or "F".
count – A string version of an integer, giving the number of times the name appeared in a certain time period.

class crate_anon.anonymise.fetch_wordlists.UsSurname1990Info(name: str, freq_pct: str, cumfreq_pct: str, rank: int)[source]

Represents US surnames from the 1990 census.

__init__(name: str, freq_pct: str, cumfreq_pct: str, rank: int) → None[source]

Parameters:

name – The name.
freq_pct – Frequency (%) in string form.
cumfreq_pct – Cumulative frequency (%) in string form.
rank – Integer rank of frequency, in string form.

class crate_anon.anonymise.fetch_wordlists.UsSurname2010Info(name: str, rank: str, count: str, prop100k: str, cum_prop100k: str, pct_white: str, pct_black: str, pct_api: str, pct_aian: str, pct_2prace: str, pct_hispanic: str)[source]

Represents US surnames from the 2010 census.

__init__(name: str, rank: str, count: str, prop100k: str, cum_prop100k: str, pct_white: str, pct_black: str, pct_api: str, pct_aian: str, pct_2prace: str, pct_hispanic: str) → None[source]

Parameters:

name – The name.
rank – Integer rank of frequency, in string form.
count – Frequency/count of the number of uses nationally.
prop100k – “Proportion per 100,000 population”, in string format, or a percentage times 1000.
cum_prop100k – Cumulative “proportion per 100,000 population” [1].
pct_white – “Percent Non-Hispanic White Alone” [1, 2].
pct_black – “Percent Non-Hispanic Black or African American Alone” [1, 2].
pct_api – “Percent Non-Hispanic Asian and Native Hawaiian and Other Pacific Islander Alone” [1, 2].
pct_aian –

“Percent Non-Hispanic American Indian and Alaska Native Alone”
[1, 2].
pct_2prace – “Percent Non-Hispanic Two or More Races” [1, 2].
pct_hispanic – “Percent Hispanic or Latino origin” [1, 2].

[1] These will be filtered through float_or_na_for_us_surnames().

[2] These mean “of people with this name, the percentage who are X race”.

crate_anon.anonymise.fetch_wordlists.fetch_english_words(url: str, filename: str = '', valid_word_regex_text: str = "^[a-z](?:[A-Za-z'-]*[a-z])*$", min_word_length: int = 1, show_rejects: bool = False) → None[source]

Fetch English words and write them to a file.

Parameters:

url – URL to fetch file from.
filename – Filename to write to.
valid_word_regex_text – Regular expression text; every word must match this regex.
min_word_length – Minimum word length; all words must be at least this long.
show_rejects – Report rejected words to the Python debug log.

crate_anon.anonymise.fetch_wordlists.fetch_eponyms(filename: str, add_unaccented_versions: bool) → None[source]

Writes medical eponyms to a file.

Parameters:

filename – Filename to write to.
add_unaccented_versions – Add unaccented (mangled) versions of names, too? For example, do you want Sjogren as well as Sjögren?

crate_anon.anonymise.fetch_wordlists.fetch_gutenberg_word_freq(filename: str = '', gutenberg_id_first: int = 1, gutenberg_id_last: int = 100, valid_word_regex_text: str = "^[a-z](?:[A-Za-z'-]*[a-z])*$", min_word_length: int = 1) → None[source]

Fetch English word frequencies from a frozen Standardized Project Gutenberg Corpus, and write them to a file. Within the words selected (which might be e.g. words of at least 2 characters, per min_word_length, and excluding words starting with upper-case letters or containing unusual punctuationg, per valid_word_regex_text), it produces a CSV file whose columns are: word, word_freq, cum_freq.

Parameters:

filename – Filename to write to.
gutenberg_id_first – First book ID to use from Project Gutenberg.
gutenberg_id_last – Last book ID to use from Project Gutenberg.
valid_word_regex_text – Regular expression text; every word must match this regex.
min_word_length – Minimum word length; all words must be at least this long.

crate_anon.anonymise.fetch_wordlists.fetch_us_forenames(url: str, filename: str = '', freq_csv_filename: str = '', freq_sex_csv_filename: str = '', min_cumfreq_pct: float = 0, max_cumfreq_pct: float = 100, min_name_length: int = 1, show_rejects: bool = False, debug_names: List[str] | None = None) → None[source]

Fetch US forenames and store them in a file, one per line.

Parameters:

url – URL to fetch file from.
filename – Filename to write names to.
freq_csv_filename – Optional CSV to write “name, frequency” pairs to, one name per line.
freq_sex_csv_filename – Optional CSV to write “name, gender, frequency” rows to.
min_cumfreq_pct – Minimum cumulative frequency (%): 0 for no limit, or above 0 to exclude common names.
max_cumfreq_pct – Maximum cumulative frequency (%): 100 for no limit, or below 100 to exclude rare names.
min_name_length – Minimum word length; all words must be at least this long.
show_rejects – Report rejected words to the Python debug log.
debug_names – Names to show extra information about (e.g. to discover the right thresholds).

crate_anon.anonymise.fetch_wordlists.fetch_us_surnames(url_1990: str, url_2010: str, filename: str = '', freq_csv_filename: str = '', min_cumfreq_pct: float = 0, max_cumfreq_pct: float = 100, min_word_length: int = 1, show_rejects: bool = False, debug_names: List[str] | None = None) → None[source]

Fetches US surnames from the 1990 and 2010 census data. Writes them to a file.

Parameters:

url_1990 – URL for 1990 US census data
url_2010 – URL for 2010 US census data
filename – Text filename to write names to (one name per line).
freq_csv_filename – Optional CSV to write “name, frequency” pairs to, one name per line.
min_cumfreq_pct – Minimum cumulative frequency (%): 0 for no limit, or above 0 to exclude common names.
max_cumfreq_pct – Maximum cumulative frequency (%): 100 for no limit, or below 100 to exclude rare names.
min_word_length – Minimum word length; all words must be at least this long.
show_rejects – Report rejected words to the Python debug log.
debug_names – Names to show extra information about (e.g. to discover the right thresholds).

crate_anon.anonymise.fetch_wordlists.filter_files(input_filenames: List[str], output_filename: str, exclusion_filenames: List[str] | None = None, inclusion_filenames: List[str] | None = None, min_line_length: int = 0) → None[source]

Read lines from input files, filters them, and writes them to the output file.

Parameters:

input_filenames – Read lines from these files.
output_filename – The output file.
exclusion_filenames – If a line is present in any of these files, it is excluded
inclusion_filenames – If any files are specified here, lines must be present in at least one inclusion file to pass through.
min_line_length – Skip any A lines that are shorter than this value.

crate_anon.anonymise.fetch_wordlists.filter_words_by_freq(input_filename: str, output_filename: str, min_cum_freq: float = 0.0, max_cum_freq: float = 1.0) → None[source]

Reads words from our frequency file and filters them.

Parameters:

input_filename – Input CSV file. The output of fetch_gutenberg_word_freq().
output_filename – A plain output file, sorted.
min_cum_freq – Minimum cumulative frequency. Set to >0 to exclude common words.
max_cum_freq – Maximum cumulative frequency. Set to <1 to exclude rare words.

crate_anon.anonymise.fetch_wordlists.float_or_na_for_us_surnames(x: float | str) → float | None[source]

The US surname data replaces low-frequency numbers with "(S)" for suppressed. Return a float representation of our input, but convert the suppression marker to None.

Parameters:: x – Input.
Returns:: Float version of input, or None.
Raises:: ValueError –

crate_anon.anonymise.fetch_wordlists.gen_lines_from_binary_files_with_maxfiles(files: Iterable[BinaryIO], encoding: str = 'utf8', max_files: int | None = None) → Generator[str, None, None][source]

Generates lines from binary files. Strips out newlines.

Parameters:

files – iterable of BinaryIO file-like objects
encoding – encoding to use
max_files – maximum number of files to read

Yields:

each line of all the files

crate_anon.anonymise.fetch_wordlists.gen_name_from_name_info(info_iter: Iterable[NameInfo]) → Generator[str, None, None][source]

Generates names from NameInfo objects.

Parameters:: info_iter – Iterable of NameInfo objects.
Yields:: Names as strings.

crate_anon.anonymise.fetch_wordlists.gen_name_info_via_min_length(info_iter: Iterable[NameInfo], min_name_length: int = 1) → Generator[NameInfo, None, None][source]

Generates NameInfo objects matching a name length criterion.

Parameters:

info_iter – Iterable of NameInfo objects.
min_name_length – Minimum name length; all names must be at least this long.

Yields:

Names as strings.

crate_anon.anonymise.fetch_wordlists.gen_sufficiently_frequent_names(infolist: Iterable[NameInfo], min_cumfreq_pct: float = 0, max_cumfreq_pct: float = 100, show_rejects: bool = False, debug_names: List[str] | None = None) → Generator[NameInfo, None, None][source]

Generate names of a chosen kind of frequency.

Parameters:

infolist – Iterable of NameInfo objects.
min_cumfreq_pct – Minimum cumulative frequency (%): 0 for no limit, or above 0 to exclude common names.
max_cumfreq_pct – Maximum cumulative frequency (%): 100 for no limit, or below 100 to exclude rare names.
show_rejects – Report rejected words to the Python debug log.
debug_names – Names to show extra information about (e.g. to discover the right thresholds).

Yields:

NameInfo objects

crate_anon.anonymise.fetch_wordlists.gen_us_forename_info(lines: Iterable[str]) → Generator[UsForenameInfo, None, None][source]

Generate US forenames from an iterable of lines in a specific textfile format, where each line looks like:

Mary,F,7065

representing name, sex, frequency (count).

Parameters:: lines – Iterable of lines.
Yields:: UsForenameInfo objects, one per name, with frequency information added.

crate_anon.anonymise.fetch_wordlists.gen_us_forename_info_by_sex(lines: Iterable[str]) → Generator[UsForenameInfo, None, None][source]

Generate US forenames from an iterable of lines in a specific textfile format, where each line looks like:

Mary,F,7065

representing name, sex, frequency (count).

Parameters:: lines – Iterable of lines.
Yields:: UsForenameInfo objects, one per name/sex combination present, with frequency information added.

crate_anon.anonymise.fetch_wordlists.gen_us_surname_1990_info(lines: Iterable[str]) → Generator[UsSurname1990Info, None, None][source]

Process a series of lines from a textfile and generate US surname information from the 1990 census data.

Parameters:

lines –

Iterable of lines, with this format:

# Format is e.g.
SMITH          1.006  1.006      1
# which is:
# name, frequency (%), cumulative frequency (%), rank

Yields:

UsSurname1990Info objects

crate_anon.anonymise.fetch_wordlists.gen_us_surname_2010_info(rows: Iterable[Iterable[str]]) → Generator[UsSurname2010Info, None, None][source]

Process a series of rows and generate US surname information from the 2010 census data.

Parameters:: rows – Iterable giving “row” objects, where each row is an iterable of strings.
Yields:: UsSurname2010Info objects

crate_anon.anonymise.fetch_wordlists.gen_valid_words_from_words(words: Iterable[str], valid_word_regex_text: str, min_word_length: int = 1, show_rejects: bool = False) → Generator[str, None, None][source]

Generates valid words from an iterable of words.

Parameters:

words – Source iterable of words.
valid_word_regex_text – Regular expression text; every word must match this regex.
min_word_length – Minimum word length; all words must be at least this long.
show_rejects – Report rejected words to the Python debug log.

Yields:

Valid words.

crate_anon.anonymise.fetch_wordlists.gen_word_freq_tuples_from_words(words: Iterable[str]) → Generator[Tuple[str, float], None, None][source]

Generates valid words and their frequencies from an iterable of SPGC count lines.

Parameters:: words – Source iterable of words
Yields:: (word, count, word_freq, cum_freq) tuples, sorted by frequency (ascending).

crate_anon.anonymise.fetch_wordlists.gen_words_from_gutenberg_ids(gutenberg_ids: Iterable[int], valid_word_regex_text: str, min_word_length: int = 1) → Generator[str, None, None][source]

Generates words from Project Gutenberg books. Does not alter case.

Parameters:

gutenberg_ids – Project Gutenberg IDs; e.g. 74 is Tom Sawyer, 100 is the complete works of Shakespeare.
valid_word_regex_text – Regular expression text; every word must match this regex.
min_word_length – Minimum word length; all words must be at least this long.

Yields:

words

crate_anon.anonymise.fetch_wordlists.main() → None[source]: Command-line processor. See command-line help.

crate_anon.anonymise.fetch_wordlists.write_words_to_file(filename: str, words: Iterable[str]) → None[source]

Write all the words to a file, one per line.

Parameters:

filename – Filename to open (or '-' for stdout).
words – Iterable of words.