14.4.7. crate_anon.linkage.helpers

crate_anon/linkage/helpers.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Helper functions for linkage tools.

Avoid using pickle for caching; it is insecure (arbitrary code execution).

crate_anon.linkage.helpers.age_years(dob: Optional[pendulum.date.Date], when: Optional[pendulum.date.Date]) Optional[int][source]

A person’s age in years when something happened, or None if either DOB or the index date is unknown.

crate_anon.linkage.helpers.dict_from_str(x: str) Dict[str, str][source]

Reads a dictionary like {‘a’: ‘x’, ‘b’: ‘y’} from a string like “{a:x, b:y}”.

crate_anon.linkage.helpers.get_first_two_char(x: str) str[source]

Returns the first two characters of a string. Having this as a function is slight overkill.

crate_anon.linkage.helpers.get_metaphone(x: str) str[source]

Returns a string representing a metaphone of the string – specifically, the first (primary) part of a Double Metaphone.

See

The implementation is from https://pypi.org/project/Fuzzy/.

Alternatives (soundex, NYSIIS) are in fuzzy and also in jellyfish (https://jellyfish.readthedocs.io/en/latest/).

from crate_anon.tools.fuzzy_id_match import *
get_metaphone("Alice")  # ALK
get_metaphone("Alec")  # matches Alice; ALK
get_metaphone("Mary Ellen")  # MRLN
get_metaphone("D'Souza")  # TSS
get_metaphone("de Clerambault")  # TKRM; won't do accents
crate_anon.linkage.helpers.get_postcode_sector(postcode_unit: str, prestandardized: bool = False) str[source]

Returns the postcode (area + district +) sector from a full postcode. For example, converts “AB12 3CD” to “AB12 3”.

While the format and length of the first part (area + district) varies (2-4 characters), the format of the second (sector + unit) is fixed, of the format “9AA” (3 characters); https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Formatting. So to get the sector, we chop off the last two characters.

crate_anon.linkage.helpers.getdictprob(d: Dict[str, Any], key: str, mandatory: bool = False, default: Optional[float] = None) Optional[float][source]

As for getdictval() but returns a probability and checks that it is in range. The default is non-mandatory, returning None.

crate_anon.linkage.helpers.getdictval(d: Dict[str, Any], key: str, type_: Type, mandatory: bool = False, default: Optional[Any] = None) Any[source]

Returns a value from a dictionary, or raises ValueError.

  • If mandatory is True, the key must be present, and the value must not be None or a blank string.

  • If mandatory is False and the key is absent, default is returned.

  • The value must be of type type_ (or None if permitted).

crate_anon.linkage.helpers.identity(x: Any) Any[source]

Returns its input.

crate_anon.linkage.helpers.is_nfa_postcode(postcode_unit: str, prestandardized: bool = False) bool[source]

Is this the pseudopostcode meaning “no fixed abode”?

crate_anon.linkage.helpers.is_pseudopostcode(postcode_unit: str, prestandardized: bool = False) bool[source]

Is this a pseudopostcode?

crate_anon.linkage.helpers.is_valid_isoformat_blurred_date(x: str) bool[source]

Validates an ISO-format date (as above) that must be the first of the month.

crate_anon.linkage.helpers.is_valid_isoformat_date(x: str) bool[source]

Validates an ISO-format date with separators, e.g. ‘2022-12-31’.

crate_anon.linkage.helpers.isoformat_date_or_none(d: Optional[pendulum.date.Date]) Optional[str][source]

Returns a date in string format, or None if it is absent.

crate_anon.linkage.helpers.isoformat_optional_date_str(d: Optional[pendulum.date.Date]) str[source]

Returns a date in string format.

crate_anon.linkage.helpers.ln(x: float) float[source]

Version of math.log() that treats log(0) as -inf, rather than crashing with ValueError: math domain error.

Parameters

x – parameter

Returns

ln(x), the natural logarithm of x

Return type

float

crate_anon.linkage.helpers.log_likelihood_ratio_from_p(p_d_given_h: float, p_d_given_not_h: float) float[source]

Calculates the log of the odds ratio. Fast implementation.

Parameters
  • p_d_given_hP(D | H)

  • p_d_given_not_hP(D | \neg H)

Returns

log likelihood ratio, ln(\frac{ P(D | H) }{ P(D | \neg H) })

Return type

float

crate_anon.linkage.helpers.log_posterior_odds_from_pdh_pdnh(log_prior_odds: float, p_d_given_h: float, p_d_given_not_h: float) float[source]

Calculates posterior odds. Fast implementation.

Parameters
  • log_prior_odds – log prior odds of H, ln(\frac{ P(H) }{ P(\neg H) })

  • p_d_given_hP(D | H)

  • p_d_given_not_hP(D | \neg H)

Returns

log posterior odds of H, ln(\frac{ P(H | D) }{ P(\neg H | D) })

Return type

float

crate_anon.linkage.helpers.mangle_unicode_to_ascii(s: Any) str[source]

Mangle unicode to ASCII, losing accents etc. in the process. This is a slightly different version to that in cardinal_pythonlib, because the Eszett gets a rough ride:

"Straße Clérambault".encode("ascii", "ignore")  # b'Strae Clerambault'

So we add the MANGLE_PRETRANSLATE step.

crate_anon.linkage.helpers.mk_blurry_dates(d: Union[pendulum.date.Date, str]) Tuple[str, str, str][source]

Returns MONTH_DAY, YEAR_DAY, and YEAR_MONTH versions in a standard form.

crate_anon.linkage.helpers.mkdir_for_filename(filename: str) None[source]

Ensures that a directory exists for the filename.

crate_anon.linkage.helpers.mutate_name(name: str) str[source]

Introduces typos into a (standardized, capitalized, no-space-no-punctuation) name.

crate_anon.linkage.helpers.mutate_postcode(postcode: str, cfg: MatchConfig) str[source]

Introduces typos into a UK postcode, keeping the letter/digit format.

Parameters
  • postcode – the postcode to alter

  • cfg – the main MatchConfig object

crate_anon.linkage.helpers.open_even_if_zipped(filename: str) Generator[_io.StringIO, None, None][source]

Yields (as a context manager) a text file, opened directly or through a ZIP file (distinguished by its extension) containing that file.

crate_anon.linkage.helpers.optional_int(value: str) Optional[int][source]

argparse argument type that checks that its value is an integer or the value None.

crate_anon.linkage.helpers.remove_redundant_whitespace(x: str) str[source]

Strip at edges; remove double-spaces; remove any other whitespace by a single space.

crate_anon.linkage.helpers.safe_upper(name: str) str[source]

Convert to upper case, but don’t mess up a few specific accents. Note that:

  • ‘ß’.upper() == ‘SS’ but ‘ẞ’.upper() == ‘ẞ’

… here, we will use an upper-case Eszett, and the “SS” will be dealt with through transliteration.

crate_anon.linkage.helpers.simplify_punctuation_whitespace(x: str) str[source]

Simplify punctuation and whitespace, e.g. curly to straight quotes, tab to space, en dash to hyphen, etc.

crate_anon.linkage.helpers.standardize_name(name: str) str[source]

Converts a name to a standard form: upper case (will also e.g. translate Eszett to SS), no spaces, no punctuation.

This is the format used by the US surname database, e.g. ACOSTAPEREZ for (probably) Acosta Perez, and just PEREZ without e.g. PÉREZ.

We use this for our name frequency databases. For other purposes, we use a more sophisticated approach; see e.g. surname_alternative_fragments().

Examples: see unit tests.

crate_anon.linkage.helpers.standardize_perfect_id_key(k: str) str[source]

Keys are compared case-insensitive, in lower case.

crate_anon.linkage.helpers.standardize_perfect_id_value(k: Any) str[source]

Values are forced to strings and compared case-insensitive, in upper case.

crate_anon.linkage.helpers.standardize_postcode(postcode_unit_or_sector: str) str[source]

Standardizes postcodes to “no space” format.

crate_anon.linkage.helpers.surname_alternative_fragments(surname: str, accent_transliterations: Dict[int, Optional[Union[str, int]]] = {196: 'AE', 214: 'OE', 220: 'UE', 7838: 'SS'}, nonspecific_name_components: Set[str] = {'AF', 'AL', 'AUF', 'AV', 'AW', 'D', 'DA', 'DAI', 'DAL', 'DALLA', 'DAS', 'DE', 'DEI', 'DEL', 'DELL', 'DELLA', 'DER', 'DES', 'DI', 'DO', 'DOS', 'DU', 'EL', 'I', 'II', 'III', 'IV', 'IX', 'JNR', 'JR', 'L', 'LA', 'LE', 'NA', 'OF', 'PHRA', 'SNR', 'SR', 'SRI', 'THOE', 'TOT', 'V', 'VAN', 'VI', 'VII', 'VIII', 'VON', 'X', 'ZU'}) List[str][source]

Return a list of fragments that may occur as substitutes for the name (including the name itself). Those fragments include:

  • Parts of double-barrelled surnames.

  • ASCII-mangled versions of accents (e.g. Ü to U).

  • Transliterated versions of accents (e.g. Ü to UE).

Upper case will be used throughout.

Parameters
  • surname – The name to process. This should contain all original accents, spacing, and punctuation (i.e. should NOT have been standardized as above). Case is unimportant (we will use upper case internally).

  • accent_transliterations – A mapping from accents to potential transliterated versions, in the form of a Python string translation table.

  • nonspecific_name_components – Name fragments that should not be produced in their own right, e.g. nobiliary particles such as “van” in “van Beethoven”.

Returns

full name first, then other fragments in alphabetical order.

Return type

A list of fragments

crate_anon.linkage.helpers.validate_prob(p: float, description: str) None[source]

Checks a probability is in the range [0, 1] or raises ValueError.

crate_anon.linkage.helpers.validate_uncertain_prob(p: float, description: str) None[source]

Checks a probability is in the range (0, 1) or raises ValueError.