14.4.12. crate_anon.linkage.person
crate_anon/linkage/person.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Person representations for fuzzy matching.
- class crate_anon.linkage.person.Person(cfg: MatchConfig, local_id: str = '', other_info: str = '', forenames: List[None | str | TemporalIDHolder | Forename] | None = None, surnames: List[None | str | TemporalIDHolder | Surname] | None = None, dob: None | str | DateOfBirth = '', gender: None | str | Gender = '', postcodes: List[None | str | TemporalIDHolder | Postcode] | None = None, perfect_id: None | Dict[str, Any] | PerfectID = None)[source]
A proper representation of a person that can do hashing and comparisons. The information may be incomplete or slightly wrong. Includes frequency information and requires a config.
- __init__(cfg: MatchConfig, local_id: str = '', other_info: str = '', forenames: List[None | str | TemporalIDHolder | Forename] | None = None, surnames: List[None | str | TemporalIDHolder | Surname] | None = None, dob: None | str | DateOfBirth = '', gender: None | str | Gender = '', postcodes: List[None | str | TemporalIDHolder | Postcode] | None = None, perfect_id: None | Dict[str, Any] | PerfectID = None) None [source]
- Parameters:
cfg – The config object.
local_id – Identifier within this person’s local database (e.g. proband ID or sample ID). Typically a research pseudonym, not itself identifying.
other_info – String containing any other attributes the user may wish to remember (e.g. in JSON). Only used for validation research (e.g. ensuring linkage is not biased by ethnicity).
forenames – The person’s forenames (given names, first/middle names), as strings or Forename objects.
surnames – The person’s surname(s), as strings or Surname or TemporalIDHolder objects.
dob – The date of birth, in ISO-8061 “YYYY-MM-DD” string format, or as a DateOfBirth object, or None, or ‘’.
gender – The gender: ‘M’, ‘F’, ‘X’, or ‘’, or None, or a Gender object.
postcodes – Any UK postcodes for this person, with optional associated dates.
perfect_id – Any named person-unique identifiers (e.g. UK NHS numbers, UK National Insurance numbers), for non-fuzzy matching. Dictionary keys will be forced to lower case, and dictionary values to upper case.
- as_dict(hashed: bool = True, include_frequencies: bool = True, include_other_info: bool = False) Dict[str, Any] [source]
For JSON.
- Parameters:
hashed – Create a hashed/encrypted version?
include_frequencies – Include frequency information. If you don’t, this makes the resulting file suitable for use as a sample, but not as a proband file.
include_other_info – include the (potentially identifying)
other_info
data? UsuallyFalse
; may beTrue
for validation.
- copy() Person [source]
Returns a copy of this object.
copy.deepcopy()
is incredibly slow, yetcopy.copy()
isn’t enough when we want to mutate this object.We did do it quasi-manually, copying attributes but using
[copy.copy(x) for x in value]
if the value was a list.However, since we have functions to convert to/from a dict representation, we may as well use them.
- debug_compare(candidate: Person, verbose: bool = True) None [source]
Compare a person with another, and log every step of the way.
- debug_comparison_report(candidate: Person, verbose: bool = True) str [source]
Compare a person with another, log every step of the way, and return the result as a string.
- debug_gen_identifiers() Generator[Identifier, None, None] [source]
Yield all identifiers.
- ensure_valid_as_candidate() None [source]
Ensures this person has sufficient information to act as a candidate, or raises
AssertionError
.We previously required a DOB unless debugging, but no longer.
- ensure_valid_as_proband() None [source]
Ensures this person has sufficient information to act as a proband, or raises
ValueError
.We previously required a DOB unless debugging, but no longer.
- classmethod from_json_dict(cfg: MatchConfig, d: Dict[str, Any], hashed: bool = True) Person [source]
Restore a hashed or plaintext version from a dictionary (which has been read from JSONL).
- classmethod from_json_str(cfg: MatchConfig, s: str) Person [source]
Restore a hashed version from a string representing JSON.
- classmethod from_plaintext_csv(cfg: MatchConfig, rowdict: Dict[str, str]) Person [source]
Returns a
Person
object from a CSV row.- Parameters:
cfg – a configuration object
rowdict – a CSV row, read via
csv.DictReader
.
- hashed(include_frequencies: bool = True, include_other_info: bool = False) Person [source]
Returns a
Person
object but with all the elements hashed (if they are not blank).Note that you do NOT need to do this just to write a hashed version to disk. This function is primarily for comparing an entire sample of hashed people to plaintext people, or vice versa; we hash the plaintext version first.
- Parameters:
include_frequencies – Include frequency information. If you don’t, this makes the resulting file suitable for use as a sample, but not as a proband file.
include_other_info – include the (potentially identifying)
other_info
data? UsuallyFalse
; may beTrue
for validation.
- log_odds_same(candidate: Person) float [source]
Returns the log odds that
self
(the proband) andcandidate
are the same person.- Parameters:
candidate – another
Person
object- Returns:
the log odds they’re the same person
- Return type:
float