14.4.6. crate_anon.linkage.fuzzy_id_match
crate_anon/linkage/fuzzy_id_match.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Fuzzy matching with hashed identifiers.
See paper: Cardinal et al. (2023), https://pubmed.ncbi.nlm.nih.gov/37147600/.
- crate_anon.linkage.fuzzy_id_match.add_basic_options(parser: ArgumentParser) None [source]
Adds a subparser for global options.
- crate_anon.linkage.fuzzy_id_match.add_comparison_options(parser: ArgumentParser, proband_is_hashed: bool = True, sample_is_hashed: bool = True) None [source]
Adds a subparser for comparisons.
- crate_anon.linkage.fuzzy_id_match.add_config_options(parser: ArgumentParser) None [source]
Adds a subparser for MatchConfig options (excepting hasher, above). In a function because we use these in validate_fuzzy_linkage.py too.
- crate_anon.linkage.fuzzy_id_match.add_error_probabilities(parser: ArgumentParser) None [source]
Adds a subparser for error probabilities.
- crate_anon.linkage.fuzzy_id_match.add_hasher_options(parser: ArgumentParser) None [source]
Adds a subparser for hasher options.
- crate_anon.linkage.fuzzy_id_match.add_matching_rules(parser: ArgumentParser) None [source]
Adds a subparser for matching rules.
- crate_anon.linkage.fuzzy_id_match.add_subparsers(parser: ArgumentParser) _SubParsersAction [source]
Adds global-only options and subparsers.
- crate_anon.linkage.fuzzy_id_match.compare_probands_to_sample(cfg: MatchConfig, probands: People, sample: People, output_filename: str) None [source]
Compares each proband to the sample. Writes to an output file. Order is retained.
See notes above (in source code) re parallel processing.
- Parameters:
cfg – The main
MatchConfig
object.probands –
People
sample –
People
output_filename – Output CSV filename.
- crate_anon.linkage.fuzzy_id_match.compare_probands_to_sample_from_files(cfg: MatchConfig, probands_filename: str, sample_filename: str, output_filename: str, probands_plaintext: bool = True, sample_plaintext: bool = True, sample_cache_filename: str = '', profile: bool = False) None [source]
Compares each of the people in the probands file to the sample file.
- Parameters:
cfg – The main
MatchConfig
object.probands_filename – Filename of people (probands); see
read_people()
.sample_filename – Filename of people (sample); see
read_people()
.output_filename – Output filename.
sample_cache_filename – File in which to cache sample, for speed.
probands_plaintext – Is the probands file plaintext (not hashed)?
sample_plaintext – Is the sample file plaintext (not hashed)?
profile – Profile the code?
- crate_anon.linkage.fuzzy_id_match.get_cfg_from_args(args: Namespace, require_hasher: bool, require_main_config: bool, require_error: bool, require_matching: bool) MatchConfig [source]
Return a MatchConfig object from our standard arguments. Uses defaults where not specified.
- crate_anon.linkage.fuzzy_id_match.get_demo_people(cfg: MatchConfig | None = None) List[Person] [source]
Some demonstration records. All data are fictional. The postcodes are real but are institutional, not residential, addresses in Cambridge.
- crate_anon.linkage.fuzzy_id_match.hash_identity_file(cfg: MatchConfig, input_filename: str, output_filename: str, include_frequencies: bool = True, include_other_info: bool = False) None [source]
Hash a file of identifiable people to a hashed version. Order is preserved.
- Parameters:
cfg – The main
MatchConfig
object.input_filename – Input (plaintext) CSV filename to read.
output_filename – Output (hashed) CSV filename to write.
include_frequencies – Include frequency information. Without this, the resulting file is suitable for use as a sample, but not as a proband file.
include_other_info – Include the (potentially identifying)
other_info
data? UsuallyFalse
; may beTrue
for validation.
- crate_anon.linkage.fuzzy_id_match.main() int [source]
Command-line entry point.
- Returns:
program exit status code
- crate_anon.linkage.fuzzy_id_match.process_proband_chunk(probands: List[Person], sample: People, worker_num: int, report_every: int = 100) List[MatchResult] [source]
Used for multiprocessing, where a single process handles lots of probands, not one proband per process.
- crate_anon.linkage.fuzzy_id_match.read_people(cfg: MatchConfig, filename: str, plaintext: bool = True, jsonl: bool | None = None) People [source]
Read a list of people from a CSV/JSONLines file.
See
read_people_2()
, but this version doesn’t offer the feature of splitting into two groups, and returns only a singlePeople
object.
- crate_anon.linkage.fuzzy_id_match.read_people_alternate_groups(cfg: MatchConfig, filename: str, plaintext: bool = True, jsonl: bool | None = None) Tuple[People, People] [source]
Read people from a file, splitting consecutive people into “first group”, “second group”. (A debugging/validation feature.)
- Returns:
first_group
,second_group
- Return type:
tuple
- crate_anon.linkage.fuzzy_id_match.warn_or_fail_if_default_key(args: Namespace) None [source]
Ensure that we are not using the default (insecure) hash key unless the user has specifically authorized this.
It’s pretty unlikely that
local_id_hash_key
will be this specific default, because that defaults toNone
. However,key
might be.