12.1. Linkage: overview
Linkage is about joining two databases together using common keys or
identifiers. In the context of de-identified clinical records linkage, it is
often desirable to link without using any direct identity information.
One way to do so is to “pseudonymise” both databases, converting a
person-unique identifier (such as a National Health Service [NHS] number in the
UK) to a research ID (pseudonym, tag).
A common operation is for institution A to say to institution B: “please send
me de-identified data for the following people”. If the two institutions share
a common passphrase (secret key), they can both “hash” their identifiers in the
same way, and then check for matches using the resulting pseudonyms. This could
work as follows:
Institutions A and B agree a secret passphrase.
Institution A hashes the identifiers of relevant people, for whom it would
like de-identified data from institution B.
Institution A sends the resulting pseudonyms to institution B.
Institution B hashes all its identifiers with the same passphrase.
Institution B looks for pseudonyms that match those requested by A.
Institution B sends de-identified data for those people (only) back to A.
For example, using the passphrase “tiger” and the HMAC-MD5 algorithm, the
following hashes (expressed as hexadecimal) can be generated consistently:
Identifier Hash (research pseudonym)
The crate_bulk_hash tool allows you to generate these
sorts of pseudonyms en masse.
Sometimes, organisations do not share a common person-unique identifier. (For
example, education databases probably don’t contain NHS numbers.) In these
circumstances, sometimes a match has to be conducted using non-unique
information, such as names and dates of birth.
The crate_fuzzy_id_match tool helps you link
databases that do not share a common person-unique identifier, via Bayesian
probabilistic techniques; either directly (identifiably) or via a de-identified