14.4.5. crate_anon.linkage.frequencies

crate_anon/linkage/frequencies.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Frequency classes for linkage tools.

These record and calculate frequencies of real-world things (names, postcodes) from publicly available data.

class crate_anon.linkage.frequencies.BasicNameFreqInfo(name: str, p_name: float, gender: str = '', metaphone: str = '', p_metaphone: float = 0.0, p_metaphone_not_name: float = 0.0, f2c: str = '', p_f2c: float = 0.0, p_f2c_not_name_metaphone: float = 0.0, synthetic: bool = False)[source]

Used for calculating P(share F2C but not name or metaphone).

Note that the metaphone can be “”, e.g. if the name is “W”. But we can still calculate the frequency of those metaphones cumulatively across all our names.

__init__(name: str, p_name: float, gender: str = '', metaphone: str = '', p_metaphone: float = 0.0, p_metaphone_not_name: float = 0.0, f2c: str = '', p_f2c: float = 0.0, p_f2c_not_name_metaphone: float = 0.0, synthetic: bool = False) None[source]

The constructor allows initialization with just a name and its frequency (with other probabilities being set later), or from a saved representation with full details.

Parameters:
  • name – Name.

  • p_name – Population probability (frequency) of this name, within the specified gender if there is one.

  • gender – Specified gender, or a blank string for non-gender-associated names.

  • metaphone – “Sounds-like” representation as the first part of a double metaphone.

  • p_metaphone – Population frequency (probability) of the metaphone.

  • p_metaphone_not_name – Probability that someone in the population shares this metaphone, but not this name. Usually this is p_metaphone - p_name, but you may choose to impose a minimum frequency.

  • f2c – First two characters (F2C) of the name.

  • p_f2c – Population probability of the F2C.

  • p_f2c_not_name_metaphone – Probability that someone in the population shares this F2C, but not this name or metaphone.

  • synthetic – Is this record made up (e.g. an unknown name, or a mean of two other records)?

as_dict() Dict[str, Any][source]

Returns a JSON representation.

classmethod from_dict(d: Dict[str, Any]) BasicNameFreqInfo[source]

Create from JSON representation.

static weighted_mean(objects: Sequence[BasicNameFreqInfo], weights: Sequence[float])[source]

Returns an object with the weighted probabilities across the objects specified. Used for gender weighting.

class crate_anon.linkage.frequencies.NameFrequencyInfo(csv_filename: str, cache_filename: str, by_gender: bool = False, min_frequency: float = 0)[source]

Holds frequencies of a class of names (e.g. first names or surnames), and also of their fuzzy (metaphone) versions.

We keep these frequency representations entirely here (source) and with the probands (storage); the config doesn’t get involved except to define min_frequency at creation. We need to scan across all names for an estimate of the empty (“”) metaphone, which does arise in our standard data. There is a process for obtaining default frequency information for any names not encountered in our name definitions, of course, but that is then stored with the (hashed) name representations and nothing needs to be recalculated at comparison time. (Compare postcodes, where further geographical adjustments may be required, depending on the comparison population.)

__init__(csv_filename: str, cache_filename: str, by_gender: bool = False, min_frequency: float = 0) None[source]

Initializes the object from a CSV file. Uses standardize_name().

Parameters:
  • csv_filename – CSV file, with no header, of “name, frequency” pairs.

  • cache_filename – File in which to cache information, for faster loading.

  • by_gender – Is the source data split by gender?

  • min_frequency – Minimum frequency to allow; see command-line help.

first_two_char_frequency(f2c: str, gender: str = '') float[source]

Returns the frequency of the first two characters of a name. This one isn’t very important; we want a more refined probability.

get_names_for_metaphone(metaphone: str) List[str][source]

Return (for debugging purposes) a list of all names matching the specified metaphone.

metaphone_frequency(metaphone: str, gender: str = '') float[source]

Returns the frequency of a metaphone.

name_frequency(name: str, gender: str = '', prestandardized: bool = True) float[source]

Returns the frequency of a name.

Parameters:
  • name – the name to check

  • gender – the gender, if created with by_gender=True

  • prestandardized – was the name pre-standardized in format?

Returns:

the name’s frequency in the population

name_frequency_info(name: str, gender: str = '', prestandardized: bool = True) BasicNameFreqInfo[source]

Look up frequency information for a name (with gender, optionally).

class crate_anon.linkage.frequencies.PostcodeFrequencyInfo(csv_filename: str, cache_filename: str, report_every: int = 10000)[source]

Holds frequencies of UK postcodes, and also their hashed versions. Handles pseudo-postcodes somewhat separately.

Frequencies are national estimates for known real postcodes. Any local correction or correction for unknown postcodes is done separately.

We return explicit “don’t know” values for unknown postcodes (including pseudopostcodes) since those values may be handled differently, in a way that is set at comparison time.

__init__(csv_filename: str, cache_filename: str, report_every: int = 10000) None[source]

Initializes the object from a CSV file.

Parameters:
  • csv_filename – CSV file from the UK Office of National Statistics, e.g. ONSPD_MAY_2022_UK.csv. Columns include “pdcs” (one of the postcode formats) and “oa11” (Output Area from the 2011 Census). A ZIP file containing a single CSV file is also permissible (distinguished by filename extension).

  • cache_filename – Filename to hold pickle format cached data, because the CSV read process is slow (it’s a 1.4 Gb CSV).

  • report_every – How often to report progress during loading.

debug_is_valid_postcode(postcode_unit: str, prestandardized: bool = False) bool[source]

Is this a valid postcode?

debug_postcode_sector_population(postcode_sector: str, prestandardized: bool = False, total_population: int = 66040000) float | None[source]

Returns the calculated population of a postcode sector.

Parameters:
  • postcode_sector – the postcode sector to check

  • prestandardized – was the sector pre-standardized in format?

  • total_population – national population

debug_postcode_unit_population(postcode_unit: str, prestandardized: bool = False, total_population: int = 66040000) float | None[source]

Returns the calculated population of a postcode unit.

Parameters:
  • postcode_unit – the postcode unit to check

  • prestandardized – was the postcode pre-standardized in format?

  • total_population – national population

postcode_unit_sector_frequency(postcode_unit: str, prestandardized: bool = False) Tuple[float | None, float | None][source]

Returns the frequency of a postcode unit and its associated sector. Performs an important check that the sector frequency is as least as big as the unit frequency.

Parameters:
  • postcode_unit – the postcode unit to check

  • prestandardized – was the postcode pre-standardized in format?

Returns:

unit_frequency, sector_frequency

Return type:

tuple