2. Package elements in brief

There are multiple stages between a clinical source database and a final research database. CRATE separates these stages.

2.1. Terminology

We will refer to ‘anonymisation’ in this document, but sometimes as a shorthand for ‘pseudonymisation’, in which IDs are removed and replaced by a generated pseudonym.

2.2. Database connections

Most of the CRATE tools talk to one or more databases. They do this via SQLAlchemy, which uses a unified URL scheme to define a database connection. You will need to create a URL for every database you wish to use. In some cases you may need to install drivers. See “Internal database interfaces” below.

2.3. Preprocessing

Your data may need reshaping or adding to. For example, while you will want to remove addresses and postcodes from the raw data, you may want to add less specific but nonetheless useful UK Office of National Statistics (ONS) geographical information. It may also be that your source database is heavily normalised 1 and you want to de-normalize it to make life easier for your researchers.

CRATE provides the following optional pre-processing steps:

  • crate_postcodes: takes a downloaded UK ONS Postcode Database (ONS PD) file and inserts it into a database, for later linking to your source data.

  • crate_preprocess_rio: adds fields, indexes, and views to a Servelec RiO database (or a Servelec RiO CRIS Extract Program [RCEP] database) to apply some de-normalization of RiO Core and RiO Non-Core data to make it simpler for end users. This program also generates data dictionary options for the next step.

2.4. Data dictionary generation and editing

CRATE removes identifiable information as it copies a database based on a data dictionary, which is essentially a spreadsheet with one row for every column in the source database.

You can create this data dictionary manually, and edit it manually, but CRATE also provides a way to generate a draft of the data dictionary automatically. Use the command crate_anon_draft_dd to start a data dictionary or to discover new database columns and add them to an existing data dictionary.

This command refers to a CRATE anonymiser configuration file, and you can use that configuration file to guide CRATE on how to draft the data dictionary, via options named ddgen....

To be more helpful, preprocessors (including crate_preprocess_rio) can create these options for you; see the --settings-filename option above. These suggestions incorporate knowledge about the specific database (e.g. which fields contain patient IDs; which contain references to external document files; etc). You can take those suggestions, and add them to your CRATE configuration file. If you do that, the autogenerated data dictionary will (we hope) be much closer to what you want.

However, you should always review your data dictionary by hand prior to anonymisation.

2.5. Anonymisation

You can use the crate_anonymise (or crate_anonymise_multiprocess) commands to perform the main anonymisation. This can be done in a “full” way, dropping existing tables and starting from scratch, or in an “incremental” way, looking for changes to the source database (with respect to the anonymised database) and changing the anonymised database accordingly.

This tool uses a configuration file that you create and edit. Use crate_anon_demo_config to generate a demonstration file.

Note

For some databases, like RiO, you can mix in the suggested options from crate_preprocess_rio.

2.6. Natural language processing (NLP)

You can use the crate_nlp (or crate_nlp_multiprocess) commands to pass text from one or more databases/tables/columns, to an external NLP tool, and the structured data back to a database table.

CRATE includes some built-in natural language tools, including regular expression (regex) parsers for numerical results.

The GATE NLP system is also supported, via a Java program. Use crate_nlp_build_gate_java_interface to build this before you use it for the first time.

The MedEx-UIMA system is also supported, via a Java program. Use crate_nlp_build_medex_java_interface to build this before you use it for the first time.

This tool uses a configuration file that you create and edit. Use crate_nlp --democonfig to generate a demonstration file.

2.7. Linkage

You might have more than one database and want to link them, so information about the same person in two databases can be analysed together. CRATE provides tools to do this whether two databases share a person-unique identifier (like a UK NHS number) or without. Linkage without a shared person-unique identifier is performed via a Bayesian personal identity matching process. This can be done in an entirely de-identified manner. See linkage.

2.8. Web front end

CRATE offers a web front end that supports researcher access to the data, and allows managers to operate a specific consent-to-contact process.

It uses a configuration file. Use crate_print_demo_crateweb_config to create a starting config that you can edit, and crate_generate_new_django_secret_key to generate a random secret key for your site (which goes into the config).

The crate_django_manage command provides options for:

  • building the structure of the admin database (migrate);

  • collecting statically served files (collectstatic);

  • creating a superuser (createsuperuser);

  • manually changing a password (changepassword);

  • populating a consent database (populate);

  • testing the back-end messaging system by sending an e-mail (test_email);

and a few other things that other scripts provide more convenient interfaces to.

Other scripts include:

2.9. Testing and additional tools

Other tools include:


Footnotes

1

https://en.wikipedia.org/wiki/Database_normalization