14.4.7. crate_anon.linkage.helpers
crate_anon/linkage/helpers.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Helper functions for linkage tools.
Avoid using pickle for caching; it is insecure (arbitrary code execution).
- crate_anon.linkage.helpers.age_years(dob: Date | None, when: Date | None) int | None [source]
A person’s age in years when something happened, or
None
if either DOB or the index date is unknown.
- crate_anon.linkage.helpers.dict_from_str(x: str) Dict[str, str] [source]
Reads a dictionary like {‘a’: ‘x’, ‘b’: ‘y’} from a string like “{a:x, b:y}”.
- crate_anon.linkage.helpers.get_first_two_char(x: str) str [source]
Returns the first two characters of a string. Having this as a function is slight overkill.
- crate_anon.linkage.helpers.get_metaphone(x: str) str [source]
Returns a string representing a metaphone of the string – specifically, the first (primary) part of a Double Metaphone.
See
The implementation is from https://pypi.org/project/Fuzzy/.
Alternatives (soundex, NYSIIS) are in
fuzzy
and also injellyfish
(https://jellyfish.readthedocs.io/en/latest/).from crate_anon.tools.fuzzy_id_match import * get_metaphone("Alice") # ALK get_metaphone("Alec") # matches Alice; ALK get_metaphone("Mary Ellen") # MRLN get_metaphone("D'Souza") # TSS get_metaphone("de Clerambault") # TKRM; won't do accents
- crate_anon.linkage.helpers.get_postcode_sector(postcode_unit: str, prestandardized: bool = False) str [source]
Returns the postcode (area + district +) sector from a full postcode. For example, converts “AB12 3CD” to “AB12 3”.
While the format and length of the first part (area + district) varies (2-4 characters), the format of the second (sector + unit) is fixed, of the format “9AA” (3 characters); https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Formatting. So to get the sector, we chop off the last two characters.
- crate_anon.linkage.helpers.getdictprob(d: Dict[str, Any], key: str, mandatory: bool = False, default: float | None = None) float | None [source]
As for
getdictval()
but returns a probability and checks that it is in range. The default is non-mandatory, returning None.
- crate_anon.linkage.helpers.getdictval(d: Dict[str, Any], key: str, type_: Type, mandatory: bool = False, default: Any | None = None) Any [source]
Returns a value from a dictionary, or raises ValueError.
If
mandatory
is True, the key must be present, and the value must not beNone
or a blank string.If
mandatory
is False and the key is absent,default
is returned.The value must be of type type_ (or
None
if permitted).
- crate_anon.linkage.helpers.is_nfa_postcode(postcode_unit: str, prestandardized: bool = False) bool [source]
Is this the pseudopostcode meaning “no fixed abode”?
- crate_anon.linkage.helpers.is_pseudopostcode(postcode_unit: str, prestandardized: bool = False) bool [source]
Is this a pseudopostcode?
- crate_anon.linkage.helpers.is_valid_isoformat_blurred_date(x: str) bool [source]
Validates an ISO-format date (as above) that must be the first of the month.
- crate_anon.linkage.helpers.is_valid_isoformat_date(x: str) bool [source]
Validates an ISO-format date with separators, e.g. ‘2022-12-31’.
- crate_anon.linkage.helpers.isoformat_date_or_none(d: Date | None) str | None [source]
Returns a date in string format, or None if it is absent.
- crate_anon.linkage.helpers.isoformat_optional_date_str(d: Date | None) str [source]
Returns a date in string format.
- crate_anon.linkage.helpers.ln(x: float) float [source]
Version of
math.log()
that treats log(0) as-inf
, rather than crashing withValueError: math domain error
.- Parameters:
x – parameter
- Returns:
ln(x), the natural logarithm of x
- Return type:
float
- crate_anon.linkage.helpers.log_likelihood_ratio_from_p(p_d_given_h: float, p_d_given_not_h: float) float [source]
Calculates the log of the odds ratio. Fast implementation.
- Parameters:
p_d_given_h –
p_d_given_not_h –
- Returns:
log likelihood ratio,
- Return type:
float
- crate_anon.linkage.helpers.log_posterior_odds_from_pdh_pdnh(log_prior_odds: float, p_d_given_h: float, p_d_given_not_h: float) float [source]
Calculates posterior odds. Fast implementation.
- Parameters:
log_prior_odds – log prior odds of H,
p_d_given_h –
p_d_given_not_h –
- Returns:
log posterior odds of H,
- Return type:
float
- crate_anon.linkage.helpers.mangle_unicode_to_ascii(s: Any) str [source]
Mangle unicode to ASCII, losing accents etc. in the process. This is a slightly different version to that in cardinal_pythonlib, because the Eszett gets a rough ride:
"Straße Clérambault".encode("ascii", "ignore") # b'Strae Clerambault'
So we add the
MANGLE_PRETRANSLATE
step.
- crate_anon.linkage.helpers.mk_blurry_dates(d: Date | str) Tuple[str, str, str] [source]
Returns MONTH_DAY, YEAR_DAY, and YEAR_MONTH versions in a standard form.
- crate_anon.linkage.helpers.mkdir_for_filename(filename: str) None [source]
Ensures that a directory exists for the filename.
- crate_anon.linkage.helpers.mutate_name(name: str) str [source]
Introduces typos into a (standardized, capitalized, no-space-no-punctuation) name.
- crate_anon.linkage.helpers.mutate_postcode(postcode: str, cfg: MatchConfig) str [source]
Introduces typos into a UK postcode, keeping the letter/digit format.
- Parameters:
postcode – the postcode to alter
cfg – the main
MatchConfig
object
- crate_anon.linkage.helpers.open_even_if_zipped(filename: str) Generator[StringIO, None, None] [source]
Yields (as a context manager) a text file, opened directly or through a ZIP file (distinguished by its extension) containing that file.
- crate_anon.linkage.helpers.optional_int(value: str) int | None [source]
argparse
argument type that checks that its value is an integer or the valueNone
.
- crate_anon.linkage.helpers.remove_redundant_whitespace(x: str) str [source]
Strip at edges; remove double-spaces; remove any other whitespace by a single space.
- crate_anon.linkage.helpers.safe_upper(name: str) str [source]
Convert to upper case, but don’t mess up a few specific accents. Note that:
‘ß’.upper() == ‘SS’ but ‘ẞ’.upper() == ‘ẞ’
… here, we will use an upper-case Eszett, and the “SS” will be dealt with through transliteration.
- crate_anon.linkage.helpers.simplify_punctuation_whitespace(x: str) str [source]
Simplify punctuation and whitespace, e.g. curly to straight quotes, tab to space, en dash to hyphen, etc.
- crate_anon.linkage.helpers.standardize_name(name: str) str [source]
Converts a name to a standard form: upper case (will also e.g. translate Eszett to SS), no spaces, no punctuation.
This is the format used by the US surname database, e.g. ACOSTAPEREZ for (probably) Acosta Perez, and just PEREZ without e.g. PÉREZ.
We use this for our name frequency databases. For other purposes, we use a more sophisticated approach; see e.g. surname_alternative_fragments().
Examples: see unit tests.
- crate_anon.linkage.helpers.standardize_perfect_id_key(k: str) str [source]
Keys are compared case-insensitive, in lower case.
- crate_anon.linkage.helpers.standardize_perfect_id_value(k: Any) str [source]
Values are forced to strings and compared case-insensitive, in upper case.
- crate_anon.linkage.helpers.standardize_postcode(postcode_unit_or_sector: str) str [source]
Standardizes postcodes to “no space” format.
- crate_anon.linkage.helpers.surname_alternative_fragments(surname: str, accent_transliterations: Dict[int, str | int | None] = {196: 'AE', 214: 'OE', 220: 'UE', 7838: 'SS'}, nonspecific_name_components: Set[str] = {'AF', 'AL', 'AUF', 'AV', 'AW', 'D', 'DA', 'DAI', 'DAL', 'DALLA', 'DAS', 'DE', 'DEI', 'DEL', 'DELL', 'DELLA', 'DER', 'DES', 'DI', 'DO', 'DOS', 'DU', 'EL', 'I', 'II', 'III', 'IV', 'IX', 'JNR', 'JR', 'L', 'LA', 'LE', 'NA', 'OF', 'PHRA', 'SNR', 'SR', 'SRI', 'THOE', 'TOT', 'V', 'VAN', 'VI', 'VII', 'VIII', 'VON', 'X', 'ZU'}) List[str] [source]
Return a list of fragments that may occur as substitutes for the name (including the name itself). Those fragments include:
Parts of double-barrelled surnames.
ASCII-mangled versions of accents (e.g. Ü to U).
Transliterated versions of accents (e.g. Ü to UE).
Upper case will be used throughout.
- Parameters:
surname – The name to process. This should contain all original accents, spacing, and punctuation (i.e. should NOT have been standardized as above). Case is unimportant (we will use upper case internally).
accent_transliterations – A mapping from accents to potential transliterated versions, in the form of a Python string translation table.
nonspecific_name_components – Name fragments that should not be produced in their own right, e.g. nobiliary particles such as “van” in “van Beethoven”.
- Returns:
full name first, then other fragments in alphabetical order.
- Return type:
A list of fragments