.. crate_anon/docs/source/linkage/fuzzy_id_match.rst .. Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk). . This file is part of CRATE. . CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. . CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. . You should have received a copy of the GNU General Public License along with CRATE. If not, see . .. _crate_fuzzy_id_match: crate_fuzzy_id_match -------------------- A tool to match people from two databases that don't share a person-unique identifier, using information from names, dates of birth, sex/gender, and address information. This is a probability-based ("fuzzy") matching technique. It can operate using either identifiable information or in de-identified fashion. You will need to download a CSV file of postcode geography from UK Census/ONS data from e.g. https://geoportal.statistics.gov.uk/search?q=PRD_ONSPD%20NOV_2024 and place it somewhere accessible to CRATE. If you are running CRATE under Docker this needs to be under the ``files`` directory, which is under the top level directory of the CRATE installation. Example ~~~~~~~ In an area with a population size of 100,000: Institution A has a database with the following patient record table: .. csv-table:: :file: fuzzy_linkage_example_patient.csv :header-rows: 1 :class: compact-table Institution B has a database with the following student record table: .. csv-table:: :file: fuzzy_linkage_example_student.csv :header-rows: 1 :class: compact-table There are no NHS numbers in the student table so we rely on name, date of birth, gender and postcode to link the two tables. The database manager at institution A creates a CSV file called ``patients_for_hashing.csv`` like this: .. literalinclude:: patients_for_hashing.csv :language: none If you are running CRATE under Docker, you must place this file under the ``files`` directory under the top level directory of the CRATE installation. The Docker container sees this as ``/crate/files``. The database manager then runs the following script in CRATE: .. code-block:: bash # If using the Docker-based CRATE installer (from the scripts directory) ./fuzzy_id_match_hash.sh --population_size=100000 --input /crate/files/patients_for_hashing.csv --key mysecretpassphrase --output /crate/files/hashed_patients.jsonl --postcode_csv_filename=/crate/files/ONSPD_NOV_2024_UK.csv # otherwise crate_fuzzy_id_match hash --population_size=100000 --input patients_for_hashing.csv --key mysecretpassphrase --output hashed_patients.jsonl --postcode_csv_filename=ONSPD_NOV_2024_UK.csv This will write a file in JSON Lines format ``hashed_patients.jsonl``. This is sent, along with the hash key to the database manager at Institution B. The database manager at institution B creates a CSV file called ``students_for_hashing.csv`` like this: .. literalinclude:: students_for_hashing.csv :language: none They then run the following script in their CRATE installation: .. code-block:: bash # If using the Docker-based CRATE installer (from the scripts directory) ./fuzzy_id_match_compare_hashed_to_plaintext.sh --probands /crate/files/hashed_patients.jsonl --sample /crate/files/students_for_hashing.csv --sample_cache /crate/files/sample_cache.jsonl --output /crate/files/sample_comparison.csv --key mysecretpassphrase --population_size=100000 --postcode_csv_filename=/crate/files/ONSPD_NOV_2024_UK.csv # otherwise crate_fuzzy_id_match compare_hashed_to_plaintext --probands hashed_patients.jsonl --sample students_for_hashing.csv --output sample_comparison.csv --key mysecretpassphrase --population_size=100000 --postcode_csv_filename=ONSPD_NOV_2024_UK.csv This produces the following output ``sample_comparison.csv``: .. csv-table:: :file: sample_comparison.csv :header-rows: 1 :class: compact-table As you can see from this short example, CRATE has matched the records from the two tables with the IDs shown in the ``proband_local_id`` and ``sample_match_local_id`` columns. Typically the local IDs would be hashed as well (with the ``--local_id_hash_key`` option) but for this example we have left them unmodified for easier identification. Now the two institutions are able to link records between their databases and can share de-identified data with each other. We describe this tool in: - Cardinal RN, Moore A, Burchell M, Lewis JR (2023). De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation. *BMC Medical Informatics and Decision Making* 23: 85. `PubMed ID 37147600 `__; `DOI 10.1186/s12911-023-02176-6 `__; `PDF `__. .. literalinclude:: _crate_fuzzy_id_match_help.txt :language: none Name frequency data is pre-supplied. It was generated like this: .. literalinclude:: fetch_name_frequencies.sh :language: bash