.. crate_anon/docs/source/introduction/package_elements.rst
.. Copyright (C) 2015, University of Cambridge, Department of Psychiatry.
Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
.
This file is part of CRATE.
.
CRATE is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
.
CRATE is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
.
You should have received a copy of the GNU General Public License
along with CRATE. If not, see .
.. _Meld: http://meldmerge.org/
.. _MySQL: https://www.mysql.com/
.. _SQLAlchemy: https://www.sqlalchemy.org/
Package elements in brief
-------------------------
There are multiple stages between a clinical source database and a final
research database. CRATE separates these stages.
.. contents::
:local:
Terminology
~~~~~~~~~~~
We will refer to ‘anonymisation’ in this document, but sometimes as a shorthand
for ‘pseudonymisation’, in which IDs are removed and replaced by a generated
pseudonym.
Database connections
~~~~~~~~~~~~~~~~~~~~
Most of the CRATE tools talk to one or more databases. They do this via
SQLAlchemy_, which uses a unified URL scheme to define a database connection.
Preprocessing
~~~~~~~~~~~~~
Your data may need reshaping or adding to. For example, while you will want to
remove addresses and postcodes from the raw data, you may want to add less
specific but nonetheless useful UK Office of National Statistics (ONS)
geographical information. It may also be that your source database is heavily
normalised [#dbnormalization]_ and you want to de-normalize it to make life
easier for your researchers.
CRATE provides the following optional pre-processing steps:
- :ref:`crate_postcodes `: takes a downloaded UK ONS Postcode
Database (ONS PD) file and inserts it into a database, for later linking to
your source data.
- :ref:`crate_preprocess_rio `: adds fields, indexes, and
views to a Servelec RiO database (or a Servelec RiO CRIS Extract Program
[RCEP] database) to apply some de-normalization of RiO Core and RiO Non-Core
data to make it simpler for end users. This program also generates data
dictionary options for the next step.
- :ref:`crate_preprocess_systmone `: indexes a SystmOne database.
Data dictionary generation and editing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CRATE removes identifiable information as it copies a database based on a
:ref:`data dictionary `, which is essentially a spreadsheet
with one row for every column in the source database.
You can create this data dictionary manually, and edit it manually, but CRATE
also provides a way to generate a draft of the data dictionary automatically.
Use the command :ref:`crate_anon_draft_dd ` to start a
data dictionary or to discover new database columns and add them to an existing
data dictionary.
This command refers to a CRATE anonymiser :ref:`configuration file
`, and you can use that configuration file to guide CRATE on
how to draft the data dictionary, via options named ``ddgen...``.
To be more helpful, preprocessors (including :ref:`crate_preprocess_rio
`) can create these options for you; see the
``--settings-filename`` option above. These suggestions incorporate knowledge
about the specific database (e.g. which fields contain patient IDs; which
contain references to external document files; etc). You can take those
suggestions, and add them to your CRATE configuration file. If you do that, the
autogenerated data dictionary will (we hope) be much closer to what you want.
However, you should always review your data dictionary by hand prior to
anonymisation.
Anonymisation
~~~~~~~~~~~~~
You can use the :ref:`crate_anonymise ` (or
:ref:`crate_anonymise_multiprocess `) commands to
perform the main anonymisation. This can be done in a “full” way, dropping
existing tables and starting from scratch, or in an “incremental” way, looking
for changes to the source database (with respect to the anonymised database)
and changing the anonymised database accordingly.
This tool uses a :ref:`configuration file ` that is initially
created by the installer in the ``config`` directory and can be edited.
.. note::
For some databases, like RiO, you can mix in the suggested options from
:ref:`crate_preprocess_rio `.
Natural language processing (NLP)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can use the :ref:`crate_nlp ` (or :ref:`crate_nlp_multiprocess
`) commands to pass text from one or more
databases/tables/columns, to an external NLP tool, and the structured data back
to a database table.
CRATE includes some built-in natural language tools, including regular
expression (regex) parsers for numerical results.
The GATE NLP system is also supported, via a Java program. This is built into
the CRATE Docker image or use :ref:`crate_nlp_build_gate_java_interface
` to build this if you are not using
Docker.
The MedEx-UIMA system is also supported, via a Java program. Use
:ref:`crate_nlp_build_medex_java_interface
` to build this before you use it for the
first time.
This tool uses a configuration file that you create and edit. Use ``crate_nlp
--democonfig`` to generate a demonstration file.
Linkage
~~~~~~~
You might have more than one database and want to link them, so information
about the same person in two databases can be analysed together. CRATE provides
tools to do this whether two databases share a person-unique identifier (like a
UK NHS number) or without. Linkage without a shared person-unique identifier is
performed via a Bayesian personal identity matching process. This can be done
in an entirely de-identified manner. See :ref:`linkage `.
Web front end
~~~~~~~~~~~~~
CRATE offers a web front end that supports researcher access to the data, and
allows managers to operate a specific consent-to-contact process.
It uses a configuration file, which is created automatically by the CRATE
installer. Alternatively use :ref:`crate_print_demo_crateweb_config
` to create a starting config that you can
edit, and :ref:`crate_generate_new_django_secret_key
` to generate a random secret key for your
site (which goes into the config).
The :ref:`crate_django_manage ` command provides options
for:
- building the structure of the admin database (``migrate``);
- collecting statically served files (``collectstatic``);
- creating a superuser (``createsuperuser``);
- manually changing a password (``changepassword``);
- populating a consent database (``populate``);
- testing the back-end messaging system by sending an e-mail (``test_email``);
and a few other things that other scripts provide more convenient interfaces
to.
Other scripts include:
- :ref:`crate_launch_django_server ` for a test
Django server;
- :ref:`crate_launch_cherrypy_server ` to launch
a production-grade CherryPy server;
- :ref`crate_launch_celery ` to launch the Celery
message-handling backend;
- :ref:`crate_launch_flower ` for the Flower tool to
monitor the Celery/RabbitMQ backend;
- :ref:`crate_windows_service ` to set up or test a
Windows service for the web server system. (The CRATE Windows service does
the equivalent of running both :ref:`crate_launch_cherrypy_server
` and :ref`crate_launch_celery
`, in the background.)
Testing and additional tools
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Other tools include:
- :ref:`crate_help ` launches this documentation.
- :ref:`crate_make_demo_database `: creates a
demonstration database for testing.
- :ref:`crate_test_extract_text ` tests methods of
extracting text from binary files.
- :ref:`crate_test_anonymisation `: fetches raw and
anonymised data (from a source and a destination database), for a human to
compare with a tool like Meld_ to verify the accuracy of anonymisation.
===============================================================================
.. rubric:: Footnotes
.. [#dbnormalization]
https://en.wikipedia.org/wiki/Database_normalization