.. crate_anon/docs/source/linkage/overview_linkage.rst
.. Copyright (C) 2015, University of Cambridge, Department of Psychiatry.
Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
.
This file is part of CRATE.
.
CRATE is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
.
CRATE is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
.
You should have received a copy of the GNU General Public License
along with CRATE. If not, see .
.. _linkage_overview:
Linkage: overview
-----------------
Linkage is about joining two databases together using common keys or
identifiers. In the context of de-identified clinical records linkage, it is
often desirable to link without using any direct identity information.
One way to do so is to "pseudonymise" both databases, converting a
person-unique identifier (such as a National Health Service [NHS] number in the
UK) to a research ID (pseudonym, tag).
A common operation is for institution A to say to institution B: "please send
me de-identified data for the following people". If the two institutions share
a common passphrase (secret key), they can both "hash" their identifiers in the
same way, and then check for matches using the resulting pseudonyms. This could
work as follows:
- Institutions A and B agree a secret passphrase.
- Institution A hashes the identifiers of relevant people, for whom it would
like de-identified data from institution B.
- Institution A sends the resulting pseudonyms to institution B.
- Institution B hashes all its identifiers with the same passphrase.
- Institution B looks for pseudonyms that match those requested by A.
- Institution B sends de-identified data for those people (only) back to A.
For example, using the passphrase "tiger" and the HMAC-MD5 algorithm, the
following hashes (expressed as hexadecimal) can be generated consistently:
.. code-block:: none
Identifier Hash (research pseudonym)
------------------------------------------------
1234567890 35b102550cd6b3118153d0372dffb0aa
2345678901 4aa6ca6d046b6fcffd2e465061bf19de
3456789012 71597eb16547ab2a87bad4139ff73693
The :ref:`crate_bulk_hash ` tool allows you to generate these
sorts of pseudonyms en masse.
Sometimes, organisations do not share a common person-unique identifier. (For
example, education databases probably don't contain NHS numbers.) In these
circumstances, sometimes a match has to be conducted using non-unique
information, such as names and dates of birth.
The :ref:`crate_fuzzy_id_match ` tool helps you link
databases that do not share a common person-unique identifier, via Bayesian
probabilistic techniques; either directly (identifiably) or via a de-identified
method.