12.1. Linkage: overview

Linkage is about joining two databases together using common keys or identifiers. In the context of de-identified clinical records linkage, it is often desirable to link without using any direct identity information.

One way to do so is to “pseudonymise” both databases, converting a person-unique identifier (such as a National Health Service [NHS] number in the UK) to a research ID (pseudonym, tag).

A common operation is for institution A to say to institution B: “please send me de-identified data for the following people”. If the two institutions share a common passphrase (secret key), they can both “hash” their identifiers in the same way, and then check for matches using the resulting pseudonyms. This could work as follows:

  • Institutions A and B agree a secret passphrase.

  • Institution A hashes the identifiers of relevant people, for whom it would like de-identified data from institution B.

  • Institution A sends the resulting pseudonyms to institution B.

  • Institution B hashes all its identifiers with the same passphrase.

  • Institution B looks for pseudonyms that match those requested by A.

  • Institution B sends de-identified data for those people (only) back to A.

For example, using the passphrase “tiger” and the HMAC-MD5 algorithm, the following hashes (expressed as hexadecimal) can be generated consistently:

Identifier      Hash (research pseudonym)
1234567890      35b102550cd6b3118153d0372dffb0aa
2345678901      4aa6ca6d046b6fcffd2e465061bf19de
3456789012      71597eb16547ab2a87bad4139ff73693

The crate_bulk_hash tool allows you to generate these sorts of pseudonyms en masse.

Sometimes, organisations do not share a common person-unique identifier. (For example, education databases probably don’t contain NHS numbers.) In these circumstances, sometimes a match has to be conducted using non-unique information, such as names and dates of birth.

The crate_fuzzy_id_match tool helps you link databases that do not share a common person-unique identifier, via Bayesian probabilistic techniques; either directly (identifiably) or via a de-identified method.