14.4.8. crate_anon.linkage.identifiers

crate_anon/linkage/identifiers.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Helper functions for linkage tools.

Represents various types of person identifier (e.g. name, postcode) that may be compared between two people.

class crate_anon.linkage.identifiers.BasicName(cfg: crate_anon.linkage.matchconfig.MatchConfig, name: str = '', gender: str = '', temporal: bool = False, start_date: Optional[Union[str, pendulum.date.Date]] = None, end_date: Optional[Union[str, pendulum.date.Date]] = None, description: str = 'name')[source]

Base class for names.

Note that this is a pretty difficult generic problem. See https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/

The sequence of preferences is (1) full match, (2) metaphone match, (3) first two character (F2C) match, (4) no match. Reasons are discussed in the validation paper. Frequency representations here are slightly more complex as the fuzzy representations are not subsets/supersets of each other, but overlap, so we need to represent explicitly e.g. P(F2C match but not metaphone or name match).

We will need some special gender features for both forenames and surnames:

  • UK forename frequency depends on gender.

  • The probability that someone’s surname changes depends on gender.

As a result, because we can’t access gender once hashed, we need to store error frequencies as well as population frequencies.

Since names can change, we also support optional start/end dates. If none are supplied, it simply becomes a non-temporal identifier.

__init__(cfg: crate_anon.linkage.matchconfig.MatchConfig, name: str = '', gender: str = '', temporal: bool = False, start_date: Optional[Union[str, pendulum.date.Date]] = None, end_date: Optional[Union[str, pendulum.date.Date]] = None, description: str = 'name') None[source]

Plaintext creation of a name.

Parameters
  • cfg – The config object.

  • name – (PLAINTEXT.) The name.

  • description – Used internally for verbose comparisons.

as_dict(encrypt: bool = True, include_frequencies: bool = True) Dict[str, Any][source]

For JSON.

ensure_has_freq_info_if_id_present() None[source]

If we have ID information but some frequency information is missing, raise ValueError. Used to check validity for probands; candidates do not have to fulfil this requirement.

fully_matches(other: crate_anon.linkage.identifiers.BasicName) bool[source]

Does this identifier fully match the other?

You can assume that self.comparison_relevant(other) is True.

property p_en: float

For internal use. Only call if frequencies are set up.

property p_n: float

For internal use. Only call if frequencies are set up.

partially_matches(other: crate_anon.linkage.identifiers.BasicName) bool[source]

Does this identifier partially match the other?

You can assume that self.comparison_relevant(other) is True.

partially_matches_second(other: crate_anon.linkage.identifiers.BasicName) bool[source]

Does this identifier partially match the other on the first fuzzy identifier?

You can assume that self.comparison_relevant(other) is True.

plaintext_str_core() str[source]

For CSV.

set_gender(gender: str) None[source]

Special operation for identifiable reading.

class crate_anon.linkage.identifiers.ComparisonInfo(proband_idx: int, candidate_idx: int, comparison: crate_anon.linkage.comparison.Comparison)[source]

Used by gen_best_comparisons().

__init__(proband_idx: int, candidate_idx: int, comparison: crate_anon.linkage.comparison.Comparison) None[source]
static sort_asc_best_to_worst(x: crate_anon.linkage.identifiers.ComparisonInfo) Tuple[float, int][source]

Returns a sort value suitable for ASCENDING (standard, reverse=False) sorting to give a best-to-worst sort order.

  • The first part of the tuple is negative log likelihood ratio, so higher values are worse (because higher values of log likelihood ratio are better).

  • The second part of the tuple (the tie-breaker if NLLR is identical) is the square of the distance between the proband and candidate indexes. We prefer to use identical values (distance = squared distance = 0), so higher values are worse. This tiebreaker means that if we compare Alice Alice SMITH to Alice Alice SMITH on first names, we will choose index pairs (1, 1) and (2, 2), not (1, 2) and (2, 1).

class crate_anon.linkage.identifiers.DateOfBirth(cfg: crate_anon.linkage.matchconfig.MatchConfig, dob: str = '')[source]

Represents a date of birth (DOB).

We don’t store any frequencies with the hashed version, since they are all obtainable from the config (they are not specific to a particular DOB).

__init__(cfg: crate_anon.linkage.matchconfig.MatchConfig, dob: str = '') None[source]

Plaintext creation of a DOB.

Parameters
  • cfg – The config object.

  • dob – (PLAINTEXT.) The date of birth in ISO-8061 “YYYY-MM-DD” string format.

as_dict(encrypt: bool = True, include_frequencies: bool = True) Dict[str, Any][source]

For JSON.

ensure_has_freq_info_if_id_present() None[source]

If we have ID information but some frequency information is missing, raise ValueError. Used to check validity for probands; candidates do not have to fulfil this requirement.

classmethod from_dict(cfg: crate_anon.linkage.matchconfig.MatchConfig, d: Dict[str, Any], hashed: bool) crate_anon.linkage.identifiers.DateOfBirth[source]

Creation of a hashed DOB, ultimately from JSON.

classmethod from_plaintext_str(cfg: crate_anon.linkage.matchconfig.MatchConfig, x: str) crate_anon.linkage.identifiers.DateOfBirth[source]

Creation from CSV.

fully_matches(other: crate_anon.linkage.identifiers.DateOfBirth) bool[source]

Does this identifier fully match the other?

You can assume that self.comparison_relevant(other) is True.

partially_matches(other: crate_anon.linkage.identifiers.DateOfBirth) bool[source]

Does this identifier partially match the other?

You can assume that self.comparison_relevant(other) is True.

plaintext_str_core() str[source]

For CSV.

class crate_anon.linkage.identifiers.DummyLetterIdentifier(value: str, cfg: Optional[crate_anon.linkage.matchconfig.MatchConfig] = None)[source]

Represents identifiers {A, B, … Z}, each with probability 1/26, allowing exact matching only. For testing multiple comparison algorithms. No temporal component.

__init__(value: str, cfg: Optional[crate_anon.linkage.matchconfig.MatchConfig] = None) None[source]

Plaintext creation of a dummy identifier.

class crate_anon.linkage.identifiers.DummyLetterTemporalIdentifier(value: str, cfg: Optional[crate_anon.linkage.matchconfig.MatchConfig] = None, temporal: bool = False, start_date: Optional[Union[str, pendulum.date.Date]] = None, end_date: Optional[Union[str, pendulum.date.Date]] = None)[source]

Represents identifiers {A, B, … Z}, each with probability 1/26, allowing exact matching only. For testing multiple comparison algorithms. Allows a temporal component.

__init__(value: str, cfg: Optional[crate_anon.linkage.matchconfig.MatchConfig] = None, temporal: bool = False, start_date: Optional[Union[str, pendulum.date.Date]] = None, end_date: Optional[Union[str, pendulum.date.Date]] = None) None[source]

Plaintext creation of a dummy identifier.

as_dict(encrypt: bool = True, include_frequencies: bool = True) Dict[str, Any][source]

For JSON.

ensure_has_freq_info_if_id_present() None[source]

If we have ID information but some frequency information is missing, raise ValueError. Used to check validity for probands; candidates do not have to fulfil this requirement.

fully_matches(other: crate_anon.linkage.identifiers.DummyLetterIdentifier) bool[source]

Does this identifier fully match the other?

You can assume that self.comparison_relevant(other) is True.

plaintext_str_core() str[source]

Represents the identifier in plaintext, for CSV. Potentially encapsulated within more information by __str__().

class crate_anon.linkage.identifiers.Forename(cfg: crate_anon.linkage.matchconfig.MatchConfig, name: str = '', gender: str = '', start_date: Optional[Union[str, pendulum.date.Date]] = None, end_date: Optional[Union[str, pendulum.date.Date]] = None)[source]

Represents a forename (given name).

__init__(cfg: crate_anon.linkage.matchconfig.MatchConfig, name: str = '', gender: str = '', start_date: Optional[Union[str, pendulum.date.Date]] = None, end_date: Optional[Union[str, pendulum.date.Date]] = None) None[source]

Plaintext creation of a name.

Parameters
  • cfg – The config object.

  • name – (PLAINTEXT.) The name.

  • description – Used internally for verbose comparisons.

classmethod from_dict(cfg: crate_anon.linkage.matchconfig.MatchConfig, d: Dict[str, Any], hashed: bool) crate_anon.linkage.identifiers.Forename[source]

Creation of a hashed name, ultimately from JSON.

classmethod from_plaintext_str(cfg: crate_anon.linkage.matchconfig.MatchConfig, x: str) crate_anon.linkage.identifiers.Forename[source]

Creation from CSV.

class crate_anon.linkage.identifiers.Gender(cfg: crate_anon.linkage.matchconfig.MatchConfig, gender: str = '')[source]

Represents a gender.

__init__(cfg: crate_anon.linkage.matchconfig.MatchConfig, gender: str = '') None[source]

Plaintext creation of a gender.

Parameters
  • cfg – The config object.

  • gender – (PLAINTEXT.) The gender.

as_dict(encrypt: bool = True, include_frequencies: bool = True) Dict[str, Any][source]

For JSON.

ensure_has_freq_info_if_id_present() None[source]

If we have ID information but some frequency information is missing, raise ValueError. Used to check validity for probands; candidates do not have to fulfil this requirement.

classmethod from_dict(cfg: crate_anon.linkage.matchconfig.MatchConfig, d: Dict[str, Any], hashed: bool) crate_anon.linkage.identifiers.Gender[source]

Creation of a hashed gender, ultimately from JSON.

classmethod from_plaintext_str(cfg: crate_anon.linkage.matchconfig.MatchConfig, x: str) crate_anon.linkage.identifiers.Gender[source]

Creation from CSV.

fully_matches(other: crate_anon.linkage.identifiers.Gender) bool[source]

Does this identifier fully match the other?

You can assume that self.comparison_relevant(other) is True.

plaintext_str_core() str[source]

For CSV.

class crate_anon.linkage.identifiers.Identifier(cfg: Optional[crate_anon.linkage.matchconfig.MatchConfig], is_plaintext: bool, temporal: bool = False, start_date: Optional[Union[str, pendulum.date.Date]] = None, end_date: Optional[Union[str, pendulum.date.Date]] = None)[source]

Abstract base class: generic nugget of information about a person, in identifiable (plaintext) or de-identified (hashed) form. Optionally, may convey start/end dates.

Note:

  • We trust that probabilities from the config have been validated (i.e. are in the range 0-1), but we should check values arising from incoming data, primarily via from_hashed_dict(). The crate_anon.linkage.helpers.getdictprob() does this, but more checks may be required.

  • A typical comparison operation involves comparing a lot of people to each other, so it is usually efficient to cache “derived” information (e.g. we should calculate metaphones, etc., from names at creation, not at comparison). See comparison().

__init__(cfg: Optional[crate_anon.linkage.matchconfig.MatchConfig], is_plaintext: bool, temporal: bool = False, start_date: Optional[Union[str, pendulum.date.Date]] = None, end_date: Optional[Union[str, pendulum.date.Date]] = None) None[source]
Parameters
  • cfg – A configuration object. Can be None but you have to specify that manually.

  • is_plaintext – Is this an identifiable (plaintext) version? If False, then it is a de-identified (hashed) version, whose internal structure can be more complex.

  • temporal – Store start/end dates (which can be None) along with the information?

  • start_date – The start date (first valid date), or None.

  • end_date – The end date (last valid date), or None.

abstract as_dict(encrypt: bool = True, include_frequencies: bool = True) Dict[str, Any][source]

Represents the object in a dictionary suitable for JSON serialization, for the de-identified (hashed) version.

Parameters
  • encrypt – Encrypt the contents as writing, creating a hashed version.

  • include_frequencies – Include frequency information. If you don’t, this makes the resulting file suitable for use as a sample, but not as a proband file.

abstract comparison(candidate_id: crate_anon.linkage.identifiers.Identifier) Optional[crate_anon.linkage.comparison.Comparison][source]

Return a comparison odds (embodying the change in log odds) for a comparison between the “self” identifier (as the proband) and another, the candidate. Frequency information is expected to be on the “self” (proband) side.

comparison_relevant(other: crate_anon.linkage.identifiers.Identifier) bool[source]

It’s only relevant to compare this identifier to another if both have some information, and if they are not specifically excluded by a temporal check.

abstract ensure_has_freq_info_if_id_present() None[source]

If we have ID information but some frequency information is missing, raise ValueError. Used to check validity for probands; candidates do not have to fulfil this requirement.

abstract classmethod from_dict(cfg: crate_anon.linkage.matchconfig.MatchConfig, d: Dict[str, Any], hashed: bool) crate_anon.linkage.identifiers.Identifier[source]

Restore a hashed or plaintext version from a dictionary (which will have been read from JSON).

abstract classmethod from_plaintext_str(cfg: crate_anon.linkage.matchconfig.MatchConfig, x: str) crate_anon.linkage.identifiers.Identifier[source]

Restore a plaintext version from a string (which has been read from CSV). Reverses __str__(), not plaintext_str_core().

hashed(include_frequencies: bool = True) crate_anon.linkage.identifiers.Identifier[source]

For testing: hash this identifier by itself.

overlaps(other: crate_anon.linkage.identifiers.Identifier) bool[source]

Do self and other overlap in time?

Parameters

other – the other Identifier

For similar logic, see cardinal_pythonlib.interval.Interval.overlaps().

abstract plaintext_str_core() str[source]

Represents the identifier in plaintext, for CSV. Potentially encapsulated within more information by __str__().

class crate_anon.linkage.identifiers.IdentifierFourState(*args, **kwargs)[source]

Identifier that supports a four-state comparison.

__init__(*args, **kwargs) None[source]
Parameters
  • cfg – A configuration object. Can be None but you have to specify that manually.

  • is_plaintext – Is this an identifiable (plaintext) version? If False, then it is a de-identified (hashed) version, whose internal structure can be more complex.

  • temporal – Store start/end dates (which can be None) along with the information?

  • start_date – The start date (first valid date), or None.

  • end_date – The end date (last valid date), or None.

comparison(candidate_id: crate_anon.linkage.identifiers.IdentifierFourState) Optional[crate_anon.linkage.comparison.Comparison][source]

See IdentifierTwoState.comparison().

abstract partially_matches_second(other: crate_anon.linkage.identifiers.IdentifierFourState) bool[source]

Does this identifier partially match the other on the first fuzzy identifier?

You can assume that self.comparison_relevant(other) is True.

class crate_anon.linkage.identifiers.IdentifierThreeState(*args, **kwargs)[source]

Identifier that supports a three-state comparison.

__init__(*args, **kwargs) None[source]
Parameters
  • cfg – A configuration object. Can be None but you have to specify that manually.

  • is_plaintext – Is this an identifiable (plaintext) version? If False, then it is a de-identified (hashed) version, whose internal structure can be more complex.

  • temporal – Store start/end dates (which can be None) along with the information?

  • start_date – The start date (first valid date), or None.

  • end_date – The end date (last valid date), or None.

comparison(candidate_id: crate_anon.linkage.identifiers.IdentifierThreeState) Optional[crate_anon.linkage.comparison.Comparison][source]

See IdentifierTwoState.comparison().

abstract partially_matches(other: crate_anon.linkage.identifiers.IdentifierThreeState) bool[source]

Does this identifier partially match the other?

You can assume that self.comparison_relevant(other) is True.

class crate_anon.linkage.identifiers.IdentifierTwoState(*args, **kwargs)[source]

Identifier that supports a two-state comparison.

__init__(*args, **kwargs) None[source]
Parameters
  • cfg – A configuration object. Can be None but you have to specify that manually.

  • is_plaintext – Is this an identifiable (plaintext) version? If False, then it is a de-identified (hashed) version, whose internal structure can be more complex.

  • temporal – Store start/end dates (which can be None) along with the information?

  • start_date – The start date (first valid date), or None.

  • end_date – The end date (last valid date), or None.

comparison(candidate_id: crate_anon.linkage.identifiers.IdentifierTwoState) Optional[crate_anon.linkage.comparison.Comparison][source]

Compare our identifier to another of the same type. Return None if you wish to draw no conclusions (e.g. there is missing information, or temporally defined identifiers do not overlap).

You should assume that frequency information must be present on the “self” side (this should be the proband); it may be missing from the “other” side (the candidate).

This is a high-speed function; pre-cache any fixed information that requires multi-stage lookup.

abstract fully_matches(other: crate_anon.linkage.identifiers.IdentifierTwoState) bool[source]

Does this identifier fully match the other?

You can assume that self.comparison_relevant(other) is True.

warn_if_llr_order_unexpected(full: crate_anon.linkage.comparison.DirectComparison, partials: Optional[List[crate_anon.linkage.comparison.DirectComparison]] = None) None[source]

Partial/full comparisons are not guaranteed to be ordered as you might expect; an example is in the validation paper (and in other_examples_for_paper.py). Nor are all partial/full matches guaranteed to yield better evidence for H than a complete mismatch. However, that’s what you might expect. This function warns the user if that’s not the case.

Parameters
  • full – Comparisons for the “full match” condition.

  • partials – Comparisons for “partial match” conditions.

class crate_anon.linkage.identifiers.PerfectID(cfg: crate_anon.linkage.matchconfig.MatchConfig, identifiers: Optional[Dict[str, Any]] = None)[source]

For comparing people based on one or more perfect ID values.

__init__(cfg: crate_anon.linkage.matchconfig.MatchConfig, identifiers: Optional[Dict[str, Any]] = None) None[source]

The identifier values will be converted to strings, if they aren’t already.

as_dict(encrypt: bool = True, include_frequencies: bool = True) Dict[str, Any][source]

Represents the object in a dictionary suitable for JSON serialization, for the de-identified (hashed) version.

Parameters
  • encrypt – Encrypt the contents as writing, creating a hashed version.

  • include_frequencies – Include frequency information. If you don’t, this makes the resulting file suitable for use as a sample, but not as a proband file.

comparison(candidate_id: crate_anon.linkage.identifiers.PerfectID) Optional[crate_anon.linkage.comparison.Comparison][source]

Compare our identifier to another of the same type. Return None if you wish to draw no conclusions (e.g. there is missing information, or temporally defined identifiers do not overlap).

You should assume that frequency information must be present on the “self” side (this should be the proband); it may be missing from the “other” side (the candidate).

This is a high-speed function; pre-cache any fixed information that requires multi-stage lookup.

ensure_has_freq_info_if_id_present() None[source]

If we have ID information but some frequency information is missing, raise ValueError. Used to check validity for probands; candidates do not have to fulfil this requirement.

fully_matches(other: crate_anon.linkage.identifiers.PerfectID) bool[source]

Does this identifier fully match the other?

You can assume that self.comparison_relevant(other) is True.

plaintext_str_core() str[source]

Represents the identifier in plaintext, for CSV. Potentially encapsulated within more information by __str__().

class crate_anon.linkage.identifiers.Postcode(cfg: crate_anon.linkage.matchconfig.MatchConfig, postcode: str = '', start_date: Optional[Union[str, pendulum.date.Date]] = None, end_date: Optional[Union[str, pendulum.date.Date]] = None)[source]

Represents a UK postcode.

Note that we store nationwide frequencies. Final adjustment by k_postcode is only done at the last moment, allowing k_postcode to vary without having to change a hashed frequency file. Similarly for the probability of a postcode being unknown. So stored frequencies may be None.

__init__(cfg: crate_anon.linkage.matchconfig.MatchConfig, postcode: str = '', start_date: Optional[Union[str, pendulum.date.Date]] = None, end_date: Optional[Union[str, pendulum.date.Date]] = None)[source]

Plaintext creation of a postcode.

as_dict(encrypt: bool = True, include_frequencies: bool = True) Dict[str, Any][source]

For JSON.

ensure_has_freq_info_if_id_present() None[source]

If we have ID information but some frequency information is missing, raise ValueError. Used to check validity for probands; candidates do not have to fulfil this requirement.

classmethod from_dict(cfg: crate_anon.linkage.matchconfig.MatchConfig, d: Dict[str, Any], hashed: bool) crate_anon.linkage.identifiers.Postcode[source]

Creation of a hashed postcode, ultimately from JSON.

classmethod from_plaintext_str(cfg: crate_anon.linkage.matchconfig.MatchConfig, x: str) crate_anon.linkage.identifiers.Postcode[source]

Creation from CSV.

fully_matches(other: crate_anon.linkage.identifiers.Postcode) bool[source]

Does this identifier fully match the other?

You can assume that self.comparison_relevant(other) is True.

partially_matches(other: crate_anon.linkage.identifiers.Postcode) bool[source]

Does this identifier partially match the other?

You can assume that self.comparison_relevant(other) is True.

plaintext_str_core() str[source]

For CSV.

class crate_anon.linkage.identifiers.Surname(cfg: crate_anon.linkage.matchconfig.MatchConfig, name: str = '', gender: str = '', start_date: Optional[Union[str, pendulum.date.Date]] = None, end_date: Optional[Union[str, pendulum.date.Date]] = None)[source]

Represents a surname (family name).

Identifiably, we store the unmodified (unstandardized) name.

We don’t inherit from BasicName, but from Identifier, because surnames need to deal with “fragment” problems.

We need to be able to match on parts. For example, “van Beethoven” should match “van Beethoven” but also “Beethoven”. What frequency should we use for those parts? This has to be the frequency of the part (not the composite). For example, if someone is called “Mozart-Smith”, then a match on “Mozart-Smith” or “Mozart” is less likely in the population, and thus more informative, than a match on “Smith”. So, we need frequency information associated with each part.

__init__(cfg: crate_anon.linkage.matchconfig.MatchConfig, name: str = '', gender: str = '', start_date: Optional[Union[str, pendulum.date.Date]] = None, end_date: Optional[Union[str, pendulum.date.Date]] = None) None[source]
Parameters
  • cfg – A configuration object. Can be None but you have to specify that manually.

  • is_plaintext – Is this an identifiable (plaintext) version? If False, then it is a de-identified (hashed) version, whose internal structure can be more complex.

  • temporal – Store start/end dates (which can be None) along with the information?

  • start_date – The start date (first valid date), or None.

  • end_date – The end date (last valid date), or None.

as_dict(encrypt: bool = True, include_frequencies: bool = True) Dict[str, Any][source]

Represents the object in a dictionary suitable for JSON serialization, for the de-identified (hashed) version.

Parameters
  • encrypt – Encrypt the contents as writing, creating a hashed version.

  • include_frequencies – Include frequency information. If you don’t, this makes the resulting file suitable for use as a sample, but not as a proband file.

comparison(candidate_id: crate_anon.linkage.identifiers.Surname) Optional[crate_anon.linkage.comparison.Comparison][source]

Specialized version for surname.

ensure_has_freq_info_if_id_present() None[source]

If we have ID information but some frequency information is missing, raise ValueError. Used to check validity for probands; candidates do not have to fulfil this requirement.

classmethod from_dict(cfg: crate_anon.linkage.matchconfig.MatchConfig, d: Dict[str, Any], hashed: bool) crate_anon.linkage.identifiers.Surname[source]

Creation of a hashed name, ultimately from JSON.

classmethod from_plaintext_str(cfg: crate_anon.linkage.matchconfig.MatchConfig, x: str) crate_anon.linkage.identifiers.Surname[source]

Creation from CSV.

fully_matches(other: crate_anon.linkage.identifiers.Surname) bool[source]

Primarily for debugging; comparison() is used for real work.

partially_matches(other: crate_anon.linkage.identifiers.Surname) bool[source]

Primarily for debugging; comparison() is used for real work.

partially_matches_second(other: crate_anon.linkage.identifiers.Surname) bool[source]

Primarily for debugging; comparison() is used for real work.

plaintext_str_core() str[source]

Represents the identifier in plaintext, for CSV. Potentially encapsulated within more information by __str__().

set_gender(gender: str) None[source]

Special operation for identifiable reading.

class crate_anon.linkage.identifiers.SurnameFragment(cfg: crate_anon.linkage.matchconfig.MatchConfig, name: str = '', gender: str = '')[source]

Collate information about a name fragment. This identifier is unlikely to be used directly for comparisons, but is used by Surname.

We don’t store dates; they are stored with the surname.

__init__(cfg: crate_anon.linkage.matchconfig.MatchConfig, name: str = '', gender: str = '') None[source]

Plaintext creation of a name.

Parameters
  • cfg – The config object.

  • name – (PLAINTEXT.) The name.

  • description – Used internally for verbose comparisons.

plaintext_str_core() str[source]

For CSV.

class crate_anon.linkage.identifiers.TemporalIDHolder(identifier: str, start_date: Optional[pendulum.date.Date] = None, end_date: Optional[pendulum.date.Date] = None)[source]

Limited class that allows no config and stores a plain string identifier. Used for representing postcodes between a database and CSV for validation.

__init__(identifier: str, start_date: Optional[pendulum.date.Date] = None, end_date: Optional[pendulum.date.Date] = None) None[source]
Parameters
  • cfg – A configuration object. Can be None but you have to specify that manually.

  • is_plaintext – Is this an identifiable (plaintext) version? If False, then it is a de-identified (hashed) version, whose internal structure can be more complex.

  • temporal – Store start/end dates (which can be None) along with the information?

  • start_date – The start date (first valid date), or None.

  • end_date – The end date (last valid date), or None.

as_dict(encrypt: bool = True, include_frequencies: bool = True) Dict[str, Any][source]

Represents the object in a dictionary suitable for JSON serialization, for the de-identified (hashed) version.

Parameters
  • encrypt – Encrypt the contents as writing, creating a hashed version.

  • include_frequencies – Include frequency information. If you don’t, this makes the resulting file suitable for use as a sample, but not as a proband file.

comparison(candidate_id: crate_anon.linkage.identifiers.Identifier) Optional[crate_anon.linkage.comparison.Comparison][source]

Return a comparison odds (embodying the change in log odds) for a comparison between the “self” identifier (as the proband) and another, the candidate. Frequency information is expected to be on the “self” (proband) side.

ensure_has_freq_info_if_id_present() None[source]

If we have ID information but some frequency information is missing, raise ValueError. Used to check validity for probands; candidates do not have to fulfil this requirement.

plaintext_str_core() str[source]

Represents the identifier in plaintext, for CSV. Potentially encapsulated within more information by __str__().

crate_anon.linkage.identifiers.gen_best_comparisons(proband_identifiers: List[crate_anon.linkage.identifiers.Identifier], candidate_identifiers: List[crate_anon.linkage.identifiers.Identifier], ordered: bool = False, p_u: Optional[float] = None) Generator[crate_anon.linkage.comparison.Comparison, None, None][source]

Generates comparisons for two sequences of identifiers (one from the proband, one from the candidate), being indifferent to their order. The method – which needs to be fast – is as described above in NOTES_MULTIPLE_COMPARISONS.

Parameters
  • proband_identifiers – List of identifiers from the proband.

  • candidate_identifiers – List of comparable identifiers from the candidate.

  • ordered – Treat the comparison as an ordered one?

  • p_u – (Applicable if ordered is True.) The probability of being “unordered”, and the complement of p_o, where p_o is the probability, given the hypothesis H (proband and candidate are the same person) and that c > 1 identifiers are being compared, that the candidate identifiers will be in exactly the right order (that is, for all matches, the index of the candidate’s identifier is the same as the index of the proband’s identifier).