14.1.16. crate_anon.anonymise.fetch_wordlists
crate_anon/anonymise/fetch_wordlists.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Script to fetch wordlists from Internet sources, such as lists of forenames, surnames, and English words.
For specimen usage: see ancillary.rst, as crate_fetch_wordlists.
See:
For the Moby project (word lists):
https://www.gutenberg.org/ebooks/3201 (Moby word lists)
https://www.gutenberg.org/files/3201/3201.txt – explains other files
and default URLs in command-line parameters. The “crossword” file is good. However, for frequency information this is a bit sparse (it contains the top 1000 words in various contexts).
Broader corpora with frequencies include:
Google Books Ngrams, https://storage.googleapis.com/books/ngrams/books/datasetsv2.html, where “1-grams” means individual words. However, it’s large (e.g. the “A” file is 1.7 Gb), it’s split by year, and it has a lot of non-word entities like “Amood_ADJ” and “→_ADJ”.
Wikipedia, e.g. https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists, but it doesn’t seem to have formats oriented to automatic processing.
British National Corpus, http://www.natcorp.ox.ac.uk/corpus/index.xml?ID=intro (but not freely distributable).
Non-free ones, e.g. COCA, https://www.wordfrequency.info/.
A “frozen” version of the Standardized Project Gutenberg Corpus (SPGC), https://doi.org/10.5281/zenodo.2422561 and https://github.com/pgcorpus/gutenberg.
For the SPGC, notations like “PG74” refer to books (e.g. PG74 is “The Adventures of Tom Sawyer”); these are listed in the metadata file. Overall, the SPGC looks pretty good but one downside is that the SPGC software forces all words to lower case. See:
process_data – calls process_book()
src.pipeline.process_book – calls tokenize_text() via “tokenize_f”
src.tokenizer.tokenize_text – calls filter_tokens()
src.tokenizer.filter_tokens – forces everything to lower-case.
and thus the output contains e.g. “ellen”, “james”, “jamestown”, josephine”, “mary”. Cross-referencing to our Scrabble/crossword list will remove some, but it will retain the problem that “john” (a rare-ish word but a common name) has its frequency overestimated.
For API access to Project Gutenberg:
- class crate_anon.anonymise.fetch_wordlists.NameInfo(name: str, freq_pct: float | None = None, cumfreq_pct: float | None = None)[source]
Information about a human name.
- __init__(name: str, freq_pct: float | None = None, cumfreq_pct: float | None = None) None [source]
- Parameters:
name – The name.
freq_pct – Frequency (%).
cumfreq_pct – Cumulative frequency (%) when names are ordered from most to least common; therefore, close to 0 for common names, and close to 100 for rare names.
- property freq_p: float
Frequency as a probability or proportion, range [0, 1].
- class crate_anon.anonymise.fetch_wordlists.UsForenameInfo(name: str, sex: str, count: str)[source]
Information about a forename in the United States of America.
- class crate_anon.anonymise.fetch_wordlists.UsSurname1990Info(name: str, freq_pct: str, cumfreq_pct: str, rank: int)[source]
Represents US surnames from the 1990 census.
- class crate_anon.anonymise.fetch_wordlists.UsSurname2010Info(name: str, rank: str, count: str, prop100k: str, cum_prop100k: str, pct_white: str, pct_black: str, pct_api: str, pct_aian: str, pct_2prace: str, pct_hispanic: str)[source]
Represents US surnames from the 2010 census.
- __init__(name: str, rank: str, count: str, prop100k: str, cum_prop100k: str, pct_white: str, pct_black: str, pct_api: str, pct_aian: str, pct_2prace: str, pct_hispanic: str) None [source]
- Parameters:
name – The name.
rank – Integer rank of frequency, in string form.
count – Frequency/count of the number of uses nationally.
prop100k – “Proportion per 100,000 population”, in string format, or a percentage times 1000.
cum_prop100k – Cumulative “proportion per 100,000 population” [1].
pct_white – “Percent Non-Hispanic White Alone” [1, 2].
pct_black – “Percent Non-Hispanic Black or African American Alone” [1, 2].
pct_api – “Percent Non-Hispanic Asian and Native Hawaiian and Other Pacific Islander Alone” [1, 2].
pct_aian –
- “Percent Non-Hispanic American Indian and Alaska Native Alone”
[1, 2].
pct_2prace – “Percent Non-Hispanic Two or More Races” [1, 2].
pct_hispanic – “Percent Hispanic or Latino origin” [1, 2].
[1] These will be filtered through
float_or_na_for_us_surnames()
.[2] These mean “of people with this name, the percentage who are X race”.
- crate_anon.anonymise.fetch_wordlists.fetch_english_words(url: str, filename: str = '', valid_word_regex_text: str = "^[a-z](?:[A-Za-z'-]*[a-z])*$", min_word_length: int = 1, show_rejects: bool = False) None [source]
Fetch English words and write them to a file.
- Parameters:
url – URL to fetch file from.
filename – Filename to write to.
valid_word_regex_text – Regular expression text; every word must match this regex.
min_word_length – Minimum word length; all words must be at least this long.
show_rejects – Report rejected words to the Python debug log.
- crate_anon.anonymise.fetch_wordlists.fetch_eponyms(filename: str, add_unaccented_versions: bool) None [source]
Writes medical eponyms to a file.
- Parameters:
filename – Filename to write to.
add_unaccented_versions – Add unaccented (mangled) versions of names, too? For example, do you want Sjogren as well as Sjögren?
- crate_anon.anonymise.fetch_wordlists.fetch_gutenberg_word_freq(filename: str = '', gutenberg_id_first: int = 1, gutenberg_id_last: int = 100, valid_word_regex_text: str = "^[a-z](?:[A-Za-z'-]*[a-z])*$", min_word_length: int = 1) None [source]
Fetch English word frequencies from a frozen Standardized Project Gutenberg Corpus, and write them to a file. Within the words selected (which might be e.g. words of at least 2 characters, per min_word_length, and excluding words starting with upper-case letters or containing unusual punctuationg, per valid_word_regex_text), it produces a CSV file whose columns are: word, word_freq, cum_freq.
- Parameters:
filename – Filename to write to.
gutenberg_id_first – First book ID to use from Project Gutenberg.
gutenberg_id_last – Last book ID to use from Project Gutenberg.
valid_word_regex_text – Regular expression text; every word must match this regex.
min_word_length – Minimum word length; all words must be at least this long.
- crate_anon.anonymise.fetch_wordlists.fetch_us_forenames(url: str, filename: str = '', freq_csv_filename: str = '', freq_sex_csv_filename: str = '', min_cumfreq_pct: float = 0, max_cumfreq_pct: float = 100, min_name_length: int = 1, show_rejects: bool = False, debug_names: List[str] | None = None) None [source]
Fetch US forenames and store them in a file, one per line.
- Parameters:
url – URL to fetch file from.
filename – Filename to write names to.
freq_csv_filename – Optional CSV to write “name, frequency” pairs to, one name per line.
freq_sex_csv_filename – Optional CSV to write “name, gender, frequency” rows to.
min_cumfreq_pct – Minimum cumulative frequency (%): 0 for no limit, or above 0 to exclude common names.
max_cumfreq_pct – Maximum cumulative frequency (%): 100 for no limit, or below 100 to exclude rare names.
min_name_length – Minimum word length; all words must be at least this long.
show_rejects – Report rejected words to the Python debug log.
debug_names – Names to show extra information about (e.g. to discover the right thresholds).
- crate_anon.anonymise.fetch_wordlists.fetch_us_surnames(url_1990: str, url_2010: str, filename: str = '', freq_csv_filename: str = '', min_cumfreq_pct: float = 0, max_cumfreq_pct: float = 100, min_word_length: int = 1, show_rejects: bool = False, debug_names: List[str] | None = None) None [source]
Fetches US surnames from the 1990 and 2010 census data. Writes them to a file.
- Parameters:
url_1990 – URL for 1990 US census data
url_2010 – URL for 2010 US census data
filename – Text filename to write names to (one name per line).
freq_csv_filename – Optional CSV to write “name, frequency” pairs to, one name per line.
min_cumfreq_pct – Minimum cumulative frequency (%): 0 for no limit, or above 0 to exclude common names.
max_cumfreq_pct – Maximum cumulative frequency (%): 100 for no limit, or below 100 to exclude rare names.
min_word_length – Minimum word length; all words must be at least this long.
show_rejects – Report rejected words to the Python debug log.
debug_names – Names to show extra information about (e.g. to discover the right thresholds).
- crate_anon.anonymise.fetch_wordlists.filter_files(input_filenames: List[str], output_filename: str, exclusion_filenames: List[str] | None = None, inclusion_filenames: List[str] | None = None, min_line_length: int = 0) None [source]
Read lines from input files, filters them, and writes them to the output file.
- Parameters:
input_filenames – Read lines from these files.
output_filename – The output file.
exclusion_filenames – If a line is present in any of these files, it is excluded
inclusion_filenames – If any files are specified here, lines must be present in at least one inclusion file to pass through.
min_line_length – Skip any A lines that are shorter than this value.
- crate_anon.anonymise.fetch_wordlists.filter_words_by_freq(input_filename: str, output_filename: str, min_cum_freq: float = 0.0, max_cum_freq: float = 1.0) None [source]
Reads words from our frequency file and filters them.
- Parameters:
input_filename – Input CSV file. The output of fetch_gutenberg_word_freq().
output_filename – A plain output file, sorted.
min_cum_freq – Minimum cumulative frequency. Set to >0 to exclude common words.
max_cum_freq – Maximum cumulative frequency. Set to <1 to exclude rare words.
- crate_anon.anonymise.fetch_wordlists.float_or_na_for_us_surnames(x: float | str) float | None [source]
The US surname data replaces low-frequency numbers with
"(S)"
for suppressed. Return a float representation of our input, but convert the suppression marker toNone
.- Parameters:
x – Input.
- Returns:
Float version of input, or
None
.- Raises:
ValueError –
- crate_anon.anonymise.fetch_wordlists.gen_lines_from_binary_files_with_maxfiles(files: Iterable[BinaryIO], encoding: str = 'utf8', max_files: int | None = None) Generator[str, None, None] [source]
Generates lines from binary files. Strips out newlines.
- Parameters:
files – iterable of
BinaryIO
file-like objectsencoding – encoding to use
max_files – maximum number of files to read
- Yields:
each line of all the files
- crate_anon.anonymise.fetch_wordlists.gen_name_from_name_info(info_iter: Iterable[NameInfo]) Generator[str, None, None] [source]
Generates names from
NameInfo
objects.- Parameters:
info_iter – Iterable of
NameInfo
objects.- Yields:
Names as strings.
- crate_anon.anonymise.fetch_wordlists.gen_name_info_via_min_length(info_iter: Iterable[NameInfo], min_name_length: int = 1) Generator[NameInfo, None, None] [source]
Generates
NameInfo
objects matching a name length criterion.- Parameters:
info_iter – Iterable of
NameInfo
objects.min_name_length – Minimum name length; all names must be at least this long.
- Yields:
Names as strings.
- crate_anon.anonymise.fetch_wordlists.gen_sufficiently_frequent_names(infolist: Iterable[NameInfo], min_cumfreq_pct: float = 0, max_cumfreq_pct: float = 100, show_rejects: bool = False, debug_names: List[str] | None = None) Generator[NameInfo, None, None] [source]
Generate names of a chosen kind of frequency.
- Parameters:
infolist – Iterable of
NameInfo
objects.min_cumfreq_pct – Minimum cumulative frequency (%): 0 for no limit, or above 0 to exclude common names.
max_cumfreq_pct – Maximum cumulative frequency (%): 100 for no limit, or below 100 to exclude rare names.
show_rejects – Report rejected words to the Python debug log.
debug_names – Names to show extra information about (e.g. to discover the right thresholds).
- Yields:
NameInfo
objects
- crate_anon.anonymise.fetch_wordlists.gen_us_forename_info(lines: Iterable[str]) Generator[UsForenameInfo, None, None] [source]
Generate US forenames from an iterable of lines in a specific textfile format, where each line looks like:
Mary,F,7065
representing name, sex, frequency (count).
- Parameters:
lines – Iterable of lines.
- Yields:
UsForenameInfo
objects, one per name, with frequency information added.
- crate_anon.anonymise.fetch_wordlists.gen_us_forename_info_by_sex(lines: Iterable[str]) Generator[UsForenameInfo, None, None] [source]
Generate US forenames from an iterable of lines in a specific textfile format, where each line looks like:
Mary,F,7065
representing name, sex, frequency (count).
- Parameters:
lines – Iterable of lines.
- Yields:
UsForenameInfo
objects, one per name/sex combination present, with frequency information added.
- crate_anon.anonymise.fetch_wordlists.gen_us_surname_1990_info(lines: Iterable[str]) Generator[UsSurname1990Info, None, None] [source]
Process a series of lines from a textfile and generate US surname information from the 1990 census data.
- Parameters:
lines –
Iterable of lines, with this format:
# Format is e.g. SMITH 1.006 1.006 1 # which is: # name, frequency (%), cumulative frequency (%), rank
- Yields:
UsSurname1990Info
objects
- crate_anon.anonymise.fetch_wordlists.gen_us_surname_2010_info(rows: Iterable[Iterable[str]]) Generator[UsSurname2010Info, None, None] [source]
Process a series of rows and generate US surname information from the 2010 census data.
- Parameters:
rows – Iterable giving “row” objects, where each row is an iterable of strings.
- Yields:
UsSurname2010Info
objects
- crate_anon.anonymise.fetch_wordlists.gen_valid_words_from_words(words: Iterable[str], valid_word_regex_text: str, min_word_length: int = 1, show_rejects: bool = False) Generator[str, None, None] [source]
Generates valid words from an iterable of words.
- Parameters:
words – Source iterable of words.
valid_word_regex_text – Regular expression text; every word must match this regex.
min_word_length – Minimum word length; all words must be at least this long.
show_rejects – Report rejected words to the Python debug log.
- Yields:
Valid words.
- crate_anon.anonymise.fetch_wordlists.gen_word_freq_tuples_from_words(words: Iterable[str]) Generator[Tuple[str, float], None, None] [source]
Generates valid words and their frequencies from an iterable of SPGC count lines.
- Parameters:
words – Source iterable of words
- Yields:
(word, count, word_freq, cum_freq) tuples, sorted by frequency (ascending).
- crate_anon.anonymise.fetch_wordlists.gen_words_from_gutenberg_ids(gutenberg_ids: Iterable[int], valid_word_regex_text: str, min_word_length: int = 1) Generator[str, None, None] [source]
Generates words from Project Gutenberg books. Does not alter case.
- Parameters:
gutenberg_ids – Project Gutenberg IDs; e.g. 74 is Tom Sawyer, 100 is the complete works of Shakespeare.
valid_word_regex_text – Regular expression text; every word must match this regex.
min_word_length – Minimum word length; all words must be at least this long.
- Yields:
words