14.4.6. crate_anon.linkage.fuzzy_id_match

crate_anon/linkage/fuzzy_id_match.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Fuzzy matching with hashed identifiers.

See paper: Cardinal et al. (2023), https://pubmed.ncbi.nlm.nih.gov/37147600/.

class crate_anon.linkage.fuzzy_id_match.Commands[source]

Main commands.

crate_anon.linkage.fuzzy_id_match.add_basic_options(parser: ArgumentParser) None[source]

Adds a subparser for global options.

crate_anon.linkage.fuzzy_id_match.add_comparison_options(parser: ArgumentParser, proband_is_hashed: bool = True, sample_is_hashed: bool = True) None[source]

Adds a subparser for comparisons.

crate_anon.linkage.fuzzy_id_match.add_config_options(parser: ArgumentParser) None[source]

Adds a subparser for MatchConfig options (excepting hasher, above). In a function because we use these in validate_fuzzy_linkage.py too.

crate_anon.linkage.fuzzy_id_match.add_error_probabilities(parser: ArgumentParser) None[source]

Adds a subparser for error probabilities.

crate_anon.linkage.fuzzy_id_match.add_hasher_options(parser: ArgumentParser) None[source]

Adds a subparser for hasher options.

crate_anon.linkage.fuzzy_id_match.add_matching_rules(parser: ArgumentParser) None[source]

Adds a subparser for matching rules.

crate_anon.linkage.fuzzy_id_match.add_subparsers(parser: ArgumentParser) _SubParsersAction[source]

Adds global-only options and subparsers.

crate_anon.linkage.fuzzy_id_match.compare_probands_to_sample(cfg: MatchConfig, probands: People, sample: People, output_filename: str) None[source]

Compares each proband to the sample. Writes to an output file. Order is retained.

See notes above (in source code) re parallel processing.

Parameters:
  • cfg – The main MatchConfig object.

  • probandsPeople

  • samplePeople

  • output_filename – Output CSV filename.

crate_anon.linkage.fuzzy_id_match.compare_probands_to_sample_from_files(cfg: MatchConfig, probands_filename: str, sample_filename: str, output_filename: str, probands_plaintext: bool = True, sample_plaintext: bool = True, sample_cache_filename: str = '', profile: bool = False) None[source]

Compares each of the people in the probands file to the sample file.

Parameters:
  • cfg – The main MatchConfig object.

  • probands_filename – Filename of people (probands); see read_people().

  • sample_filename – Filename of people (sample); see read_people().

  • output_filename – Output filename.

  • sample_cache_filename – File in which to cache sample, for speed.

  • probands_plaintext – Is the probands file plaintext (not hashed)?

  • sample_plaintext – Is the sample file plaintext (not hashed)?

  • profile – Profile the code?

crate_anon.linkage.fuzzy_id_match.get_cfg_from_args(args: Namespace, require_hasher: bool, require_main_config: bool, require_error: bool, require_matching: bool) MatchConfig[source]

Return a MatchConfig object from our standard arguments. Uses defaults where not specified.

crate_anon.linkage.fuzzy_id_match.get_demo_csv() str[source]

A demonstration CSV file, as text.

crate_anon.linkage.fuzzy_id_match.get_demo_people(cfg: MatchConfig | None = None) List[Person][source]

Some demonstration records. All data are fictional. The postcodes are real but are institutional, not residential, addresses in Cambridge.

crate_anon.linkage.fuzzy_id_match.hash_identity_file(cfg: MatchConfig, input_filename: str, output_filename: str, include_frequencies: bool = True, include_other_info: bool = False) None[source]

Hash a file of identifiable people to a hashed version. Order is preserved.

Parameters:
  • cfg – The main MatchConfig object.

  • input_filename – Input (plaintext) CSV filename to read.

  • output_filename – Output (hashed) CSV filename to write.

  • include_frequencies – Include frequency information. Without this, the resulting file is suitable for use as a sample, but not as a proband file.

  • include_other_info – Include the (potentially identifying) other_info data? Usually False; may be True for validation.

crate_anon.linkage.fuzzy_id_match.main() int[source]

Command-line entry point.

Returns:

program exit status code

crate_anon.linkage.fuzzy_id_match.process_proband_chunk(probands: List[Person], sample: People, worker_num: int, report_every: int = 100) List[MatchResult][source]

Used for multiprocessing, where a single process handles lots of probands, not one proband per process.

crate_anon.linkage.fuzzy_id_match.read_people(cfg: MatchConfig, filename: str, plaintext: bool = True, jsonl: bool | None = None) People[source]

Read a list of people from a CSV/JSONLines file.

See read_people_2(), but this version doesn’t offer the feature of splitting into two groups, and returns only a single People object.

crate_anon.linkage.fuzzy_id_match.read_people_alternate_groups(cfg: MatchConfig, filename: str, plaintext: bool = True, jsonl: bool | None = None) Tuple[People, People][source]

Read people from a file, splitting consecutive people into “first group”, “second group”. (A debugging/validation feature.)

Returns:

first_group, second_group

Return type:

tuple

crate_anon.linkage.fuzzy_id_match.warn_or_fail_if_default_key(args: Namespace) None[source]

Ensure that we are not using the default (insecure) hash key unless the user has specifically authorized this.

It’s pretty unlikely that local_id_hash_key will be this specific default, because that defaults to None. However, key might be.