14.4.12. crate_anon.linkage.person

crate_anon/linkage/person.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Person representations for fuzzy matching.

class crate_anon.linkage.person.Person(cfg: MatchConfig, local_id: str = '', other_info: str = '', forenames: List[None | str | TemporalIDHolder | Forename] | None = None, surnames: List[None | str | TemporalIDHolder | Surname] | None = None, dob: None | str | DateOfBirth = '', gender: None | str | Gender = '', postcodes: List[None | str | TemporalIDHolder | Postcode] | None = None, perfect_id: None | Dict[str, Any] | PerfectID = None)[source]

A proper representation of a person that can do hashing and comparisons. The information may be incomplete or slightly wrong. Includes frequency information and requires a config.

__init__(cfg: MatchConfig, local_id: str = '', other_info: str = '', forenames: List[None | str | TemporalIDHolder | Forename] | None = None, surnames: List[None | str | TemporalIDHolder | Surname] | None = None, dob: None | str | DateOfBirth = '', gender: None | str | Gender = '', postcodes: List[None | str | TemporalIDHolder | Postcode] | None = None, perfect_id: None | Dict[str, Any] | PerfectID = None) None[source]
Parameters:
  • cfg – The config object.

  • local_id – Identifier within this person’s local database (e.g. proband ID or sample ID). Typically a research pseudonym, not itself identifying.

  • other_info – String containing any other attributes the user may wish to remember (e.g. in JSON). Only used for validation research (e.g. ensuring linkage is not biased by ethnicity).

  • forenames – The person’s forenames (given names, first/middle names), as strings or Forename objects.

  • surnames – The person’s surname(s), as strings or Surname or TemporalIDHolder objects.

  • dob – The date of birth, in ISO-8061 “YYYY-MM-DD” string format, or as a DateOfBirth object, or None, or ‘’.

  • gender – The gender: ‘M’, ‘F’, ‘X’, or ‘’, or None, or a Gender object.

  • postcodes – Any UK postcodes for this person, with optional associated dates.

  • perfect_id – Any named person-unique identifiers (e.g. UK NHS numbers, UK National Insurance numbers), for non-fuzzy matching. Dictionary keys will be forced to lower case, and dictionary values to upper case.

as_dict(hashed: bool = True, include_frequencies: bool = True, include_other_info: bool = False) Dict[str, Any][source]

For JSON.

Parameters:
  • hashed – Create a hashed/encrypted version?

  • include_frequencies – Include frequency information. If you don’t, this makes the resulting file suitable for use as a sample, but not as a proband file.

  • include_other_info – include the (potentially identifying) other_info data? Usually False; may be True for validation.

copy() Person[source]

Returns a copy of this object.

  • copy.deepcopy() is incredibly slow, yet copy.copy() isn’t enough when we want to mutate this object.

  • We did do it quasi-manually, copying attributes but using [copy.copy(x) for x in value] if the value was a list.

  • However, since we have functions to convert to/from a dict representation, we may as well use them.

debug_compare(candidate: Person, verbose: bool = True) None[source]

Compare a person with another, and log every step of the way.

debug_comparison_report(candidate: Person, verbose: bool = True) str[source]

Compare a person with another, log every step of the way, and return the result as a string.

debug_delete_something() None[source]

Randomly delete one of: a forename, or a postcode.

debug_gen_identifiers() Generator[Identifier, None, None][source]

Yield all identifiers.

debug_mutate_something() None[source]

Randomly mutate one of: a forename, or a postcode.

ensure_valid_as_candidate() None[source]

Ensures this person has sufficient information to act as a candidate, or raises AssertionError.

We previously required a DOB unless debugging, but no longer.

ensure_valid_as_proband() None[source]

Ensures this person has sufficient information to act as a proband, or raises ValueError.

We previously required a DOB unless debugging, but no longer.

classmethod from_json_dict(cfg: MatchConfig, d: Dict[str, Any], hashed: bool = True) Person[source]

Restore a hashed or plaintext version from a dictionary (which has been read from JSONL).

classmethod from_json_str(cfg: MatchConfig, s: str) Person[source]

Restore a hashed version from a string representing JSON.

classmethod from_plaintext_csv(cfg: MatchConfig, rowdict: Dict[str, str]) Person[source]

Returns a Person object from a CSV row.

Parameters:
  • cfg – a configuration object

  • rowdict – a CSV row, read via csv.DictReader.

has_dob() bool[source]

Do we have a DOB?

hashed(include_frequencies: bool = True, include_other_info: bool = False) Person[source]

Returns a Person object but with all the elements hashed (if they are not blank).

Note that you do NOT need to do this just to write a hashed version to disk. This function is primarily for comparing an entire sample of hashed people to plaintext people, or vice versa; we hash the plaintext version first.

Parameters:
  • include_frequencies – Include frequency information. If you don’t, this makes the resulting file suitable for use as a sample, but not as a proband file.

  • include_other_info – include the (potentially identifying) other_info data? Usually False; may be True for validation.

is_hashed() bool[source]

Is this a hashed (de-identified) Person?

is_plaintext() bool[source]

Is this a plaintext (identifiable) Person?

log_odds_same(candidate: Person) float[source]

Returns the log odds that self (the proband) and candidate are the same person.

Parameters:

candidate – another Person object

Returns:

the log odds they’re the same person

Return type:

float

n_forenames() int[source]

Number of forenames

n_postcodes() int[source]

How many postcodes does this person have?

static plain_or_hashed_txt(plaintext: bool) str[source]

Used for error messages.

classmethod plaintext_csv_columns() List[str][source]

CSV column names – including user-specified “other” information.

plaintext_csv_dict() Dict[str, str][source]

Returns a dictionary suitable for csv.DictWriter. This is for writing identifiable content.