12.4.9. crate_anon.linkage.matchconfig

crate_anon/linkage/matchconfig.py

Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.

Helper functions for linkage tools.

class crate_anon.linkage.matchconfig.MatchConfig(hash_key: str = 'fuzzy_id_match_default_hash_key_DO_NOT_USE_FOR_LIVE_DATA', hash_method: str = 'HMAC_SHA256', rounding_sf: int | None = 5, local_id_hash_key: str | None = None, population_size: int = 852523, forename_sex_csv_filename: str = '/path/to/linkage/data/us_forename_sex_freq.zip', forename_cache_filename: str = '/path/to/crate/user/data/fuzzy_forename_cache.jsonl', forename_freq_info: NameFrequencyInfo | None = None, forename_min_frequency: float = 5e-06, surname_csv_filename: str = '/path/to/linkage/data/us_surname_freq.zip', surname_cache_filename: str = '/path/to/crate/user/data/fuzzy_surname_cache.jsonl', surname_freq_info: NameFrequencyInfo | None = None, surname_min_frequency: float = 5e-06, accent_transliterations_csv: str = 'Ä/AE,Ö/OE,Ü/UE,ẞ/SS', nonspecific_name_components_csv: str = 'AF,AL,AUF,AV,AW,D,DA,DAI,DAL,DALLA,DAS,DE,DEI,DEL,DELL,DELLA,DER,DES,DI,DO,DOS,DU,EL,I,II,III,IV,IX,JNR,JR,L,LA,LE,NA,OF,PHRA,SNR,SR,SRI,THOE,TOT,V,VAN,VI,VII,VIII,VON,X,ZU', birth_year_pseudo_range: float = 30, p_not_male_or_female: float = 0.004, p_female_given_male_or_female: float = 0.51, postcode_csv_filename: str = '/path/to/linkage/data/ONSPD_MAY_2022_UK.zip', postcode_cache_filename: str = '/path/to/crate/user/data/fuzzy_postcode_cache.json', postcode_freq_info: PostcodeFrequencyInfo | None = None, k_postcode: float | None = None, p_unknown_or_pseudo_postcode: float = 0.00201, k_pseudopostcode: float = 1.83, p_ep1_forename: str = 'F:0.00894,M:0.0084', p_ep2np1_forename: str = 'F:0.00881,M:0.00688', p_u_forename: float = 0.00191, p_en_forename: str = 'F:0.00572,M:0.00625', p_ep1_surname: str = 'F:0.00551,M:0.00471', p_ep2np1_surname: str = 'F:0.00378,M:0.00247', p_en_surname: str = 'F:0.0567,M:0.0134', p_ep_dob: float = 0.00459036, p_en_dob: float = 0, p_e_gender: float = 0.0033, p_ep_postcode: float = 0.0097, p_en_postcode: float = 0.3, min_log_odds_for_match: float = 5, exceeds_next_best_log_odds: float = 0, perfect_id_translation: Dict[str, str] | str = '', extra_validation_output: bool = False, check_comparison_order: bool = False, report_every: int = 100, min_probands_for_parallel: int = 1000, n_workers: int = 8, verbose: bool = False)[source]

Master config class. It’s more convenient to pass one of these round than lots of its components.

Default arguments are there for testing.

__init__(hash_key: str = 'fuzzy_id_match_default_hash_key_DO_NOT_USE_FOR_LIVE_DATA', hash_method: str = 'HMAC_SHA256', rounding_sf: int | None = 5, local_id_hash_key: str | None = None, population_size: int = 852523, forename_sex_csv_filename: str = '/path/to/linkage/data/us_forename_sex_freq.zip', forename_cache_filename: str = '/path/to/crate/user/data/fuzzy_forename_cache.jsonl', forename_freq_info: NameFrequencyInfo | None = None, forename_min_frequency: float = 5e-06, surname_csv_filename: str = '/path/to/linkage/data/us_surname_freq.zip', surname_cache_filename: str = '/path/to/crate/user/data/fuzzy_surname_cache.jsonl', surname_freq_info: NameFrequencyInfo | None = None, surname_min_frequency: float = 5e-06, accent_transliterations_csv: str = 'Ä/AE,Ö/OE,Ü/UE,ẞ/SS', nonspecific_name_components_csv: str = 'AF,AL,AUF,AV,AW,D,DA,DAI,DAL,DALLA,DAS,DE,DEI,DEL,DELL,DELLA,DER,DES,DI,DO,DOS,DU,EL,I,II,III,IV,IX,JNR,JR,L,LA,LE,NA,OF,PHRA,SNR,SR,SRI,THOE,TOT,V,VAN,VI,VII,VIII,VON,X,ZU', birth_year_pseudo_range: float = 30, p_not_male_or_female: float = 0.004, p_female_given_male_or_female: float = 0.51, postcode_csv_filename: str = '/path/to/linkage/data/ONSPD_MAY_2022_UK.zip', postcode_cache_filename: str = '/path/to/crate/user/data/fuzzy_postcode_cache.json', postcode_freq_info: PostcodeFrequencyInfo | None = None, k_postcode: float | None = None, p_unknown_or_pseudo_postcode: float = 0.00201, k_pseudopostcode: float = 1.83, p_ep1_forename: str = 'F:0.00894,M:0.0084', p_ep2np1_forename: str = 'F:0.00881,M:0.00688', p_u_forename: float = 0.00191, p_en_forename: str = 'F:0.00572,M:0.00625', p_ep1_surname: str = 'F:0.00551,M:0.00471', p_ep2np1_surname: str = 'F:0.00378,M:0.00247', p_en_surname: str = 'F:0.0567,M:0.0134', p_ep_dob: float = 0.00459036, p_en_dob: float = 0, p_e_gender: float = 0.0033, p_ep_postcode: float = 0.0097, p_en_postcode: float = 0.3, min_log_odds_for_match: float = 5, exceeds_next_best_log_odds: float = 0, perfect_id_translation: Dict[str, str] | str = '', extra_validation_output: bool = False, check_comparison_order: bool = False, report_every: int = 100, min_probands_for_parallel: int = 1000, n_workers: int = 8, verbose: bool = False) → None[source]

Parameters:

hash_key – Key (passphrase) for hasher.
hash_method – Method to use for hashhing.
rounding_sf – Number of significant figures to use when rounding frequency information in hashed copies. Use None for no rounding.
local_id_hash_key – If specified, then for hash operations, the local_id values will also be hashed, using this key.
population_size – The size of the entire population (not our sample). See docstrings above.
forename_sex_csv_filename – Forename frequencies. CSV file, with no header, of “name, frequency” pairs.
forename_cache_filename – File in which to cache forename information for faster loading.
forename_freq_info – Debugging option: overrides forename_sex_csv_filename by providing a NameFrequencyInfo object directly.
forename_min_frequency – Minimum frequency for forenames.
surname_csv_filename – Surname frequencies. CSV file, with no header, of “name, frequency” pairs.
surname_cache_filename – File in which to cache forename information for faster loading.
surname_freq_info – Debugging option: overrides surname_csv_filename by providing a NameFrequencyInfo object directly.
surname_min_frequency – Minimum frequency for surnames.
accent_transliterations_csv – Accent transliteration map. String of the form “Ä/AE,Ö/OE” – comma-separated pairs, with slashed separating each pair.
nonspecific_name_components_csv – CSV-separated list of nonspecific name components (e.g. nobiliary particles), which will be avoided as equivalent name fragments.
birth_year_pseudo_range – b, such that P(two people share a DOB) = 1/(365.25 * b).
p_not_male_or_female – Probability that a person in the population has gender ‘X’.
p_female_given_male_or_female – Probability that a person in the population is female, given that they are either male or female.
postcode_csv_filename – Postcode mapping. CSV (or ZIP) file. Special format; see PostcodeFrequencyInfo.
postcode_cache_filename – File in which to cache postcode information for faster loading.
postcode_freq_info – Debugging option: overrides postcode_csv_filename by providing a PostcodeFrequencyInfo object directly.
k_postcode – Multiple applied to postcode unit/sector frequencies, such that p_f_postcode = k_postcode * f_f_postcode and p_p_postcode = k_postcode * f_p_postcode. If None, defaults to UK_POPULATION_2017 / population_size, appropriate if the population under consideration is geographically constrained (rather than sampled from across the UK).
p_unknown_or_pseudo_postcode – Probability that a random person will have a pseudo-postcode, e.g. ZZ99 3VZ (no fixed abode) or a postcode not known to our database. Specifically, P(each pseudopostcode or unknown postcode unit | ¬H).
k_pseudopostcode – Probability multiple: P(pseudopostcode sector or unknown postcode sector match | ¬H) = k_pseudopostcode * p_unknown_or_pseudo_postcode. Must strictly be >=1 and we enforce >1; see paper.
p_ep1_forename – Error probability that a forename fails a full match but passes a partial 1 (metaphone) match. [GPD]
p_ep2np1_forename – Error probability that a forename fails a full match and a partial 1 match but passes a partial 2 (F2C) match. [GPD]
p_en_forename – Error probability that a forename yields no match at all. [GPD]
p_ep1_surname – Error probability that a surname fails a full match but passes a partial 1 (metaphone) match. [GPD]
p_ep2np1_surname – Error probability that a surname fails a full match and a partial 1 match but passes a partial 2 (F2C) match. [GPD]
p_en_surname – Error probability that a surname yields no match at all. [GPD]
p_ep_dob – Error probability that a DOB fails a full (YMD) match but passes a partial (YM, MD, or YD) match.
p_en_dob – Error probability that a DOB produces no match at all.
p_e_gender – Error probability of no gender match.
p_ep_postcode – Probability that a postcode fails a full (unit) match but passes a partial (sector) match (due to error or a move within a sector).
p_en_postcode – Probability that a postcode gives no match at all.
min_log_odds_for_match – minimum log odds of a match, to consider two people a match
exceeds_next_best_log_odds – In a multi-person comparison, the log odds of the best match must exceed those of the next-best match by this much for the best to be considered a unique winner.
perfect_id_translation – Option dictionary mapping the perfect ID names in the proband to the equivalents in the sample, e.g. {“nhsnum”: “nhsnumber”}.
extra_validation_output – Add extra columns to the output for validation purposes?
check_comparison_order – Check that comparisons follow the general rule “no match ≤ partial(s) ≤ full” and warn if not.
report_every – Report progress every n probands.
min_probands_for_parallel – Minimum number of probands for which we will bother to use parallel processing.
n_workers – Number of parallel processes to use, if parallel processing is used.
verbose – Be verbose on creation?

[GPD] In {gender:p, ...} dict-as-string format.
F2C = First two characters.

debug_postcode_sector_population(postcode_sector: str, prestandardized: bool = False) → float[source]

Returns the calculated population of a postcode sector.

Parameters:

postcode_sector – the postcode sector to check
prestandardized – was the postcode pre-standardized in format?

debug_postcode_unit_population(postcode_unit: str, prestandardized: bool = False) → float[source]

Returns the calculated population of a postcode unit.

Parameters:

postcode_unit – the postcode unit to check
prestandardized – was the postcode pre-standardized in format?

exceeds_primary_threshold(log_odds_match: float) → bool[source]

Decides as to whether the log odds, representing P(H | D) from a comparison of two Person objects, are sufficient for a match, based on our threshold.

Parameters:: log_odds_match – log odds that they’re the same person
Returns:: binary decision
Return type:: bool

get_forename_freq_info(name: str, gender: str, prestandardized: bool = False) → BasicNameFreqInfo[source]

Returns the baseline frequency of a forename.

Parameters:

name – the name to check
gender – the gender to look up for
prestandardized – was the name pre-standardized?

get_surname_freq_info(name: str, prestandardized: bool = False) → BasicNameFreqInfo[source]

Returns the baseline frequency of a surname.

Parameters:

name – the name to check
prestandardized – was it pre-standardized?

is_valid_postcode(postcode_unit: str) → bool[source]: Is this a valid postcode?

postcode_unit_sector_freq(postcode_unit: str, prestandardized: bool = False) → Tuple[float, float][source]

Returns the frequency for a full postcode, or postcode unit (the proportion of the population who live in that postcode), and the corresponding larger-scale postcode sector.

The underlying function ensures that the sector frequency is as least as big as the unit frequency.

crate_anon.linkage.matchconfig.mk_dummy_match_config() → MatchConfig[source]: Returns a dummy config with empty frequency information.