.. crate_anon/docs/source/introduction/package_elements.rst .. Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk). . This file is part of CRATE. . CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. . CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. . You should have received a copy of the GNU General Public License along with CRATE. If not, see . .. _Meld: http://meldmerge.org/ .. _MySQL: https://www.mysql.com/ .. _SQLAlchemy: https://www.sqlalchemy.org/ Package elements in brief ------------------------- There are multiple stages between a clinical source database and a final research database. CRATE separates these stages. .. contents:: :local: Terminology ~~~~~~~~~~~ We will refer to ‘anonymisation’ in this document, but sometimes as a shorthand for ‘pseudonymisation’, in which IDs are removed and replaced by a generated pseudonym. Database connections ~~~~~~~~~~~~~~~~~~~~ Most of the CRATE tools talk to one or more databases. They do this via SQLAlchemy_, which uses a unified URL scheme to define a database connection. Preprocessing ~~~~~~~~~~~~~ Your data may need reshaping or adding to. For example, while you will want to remove addresses and postcodes from the raw data, you may want to add less specific but nonetheless useful UK Office of National Statistics (ONS) geographical information. It may also be that your source database is heavily normalised [#dbnormalization]_ and you want to de-normalize it to make life easier for your researchers. CRATE provides the following optional pre-processing steps: - :ref:`crate_postcodes `: takes a downloaded UK ONS Postcode Database (ONS PD) file and inserts it into a database, for later linking to your source data. - :ref:`crate_preprocess_rio `: adds fields, indexes, and views to a Servelec RiO database (or a Servelec RiO CRIS Extract Program [RCEP] database) to apply some de-normalization of RiO Core and RiO Non-Core data to make it simpler for end users. This program also generates data dictionary options for the next step. - :ref:`crate_preprocess_systmone `: indexes a SystmOne database. Data dictionary generation and editing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CRATE removes identifiable information as it copies a database based on a :ref:`data dictionary `, which is essentially a spreadsheet with one row for every column in the source database. You can create this data dictionary manually, and edit it manually, but CRATE also provides a way to generate a draft of the data dictionary automatically. Use the command :ref:`crate_anon_draft_dd ` to start a data dictionary or to discover new database columns and add them to an existing data dictionary. This command refers to a CRATE anonymiser :ref:`configuration file `, and you can use that configuration file to guide CRATE on how to draft the data dictionary, via options named ``ddgen...``. To be more helpful, preprocessors (including :ref:`crate_preprocess_rio `) can create these options for you; see the ``--settings-filename`` option above. These suggestions incorporate knowledge about the specific database (e.g. which fields contain patient IDs; which contain references to external document files; etc). You can take those suggestions, and add them to your CRATE configuration file. If you do that, the autogenerated data dictionary will (we hope) be much closer to what you want. However, you should always review your data dictionary by hand prior to anonymisation. Anonymisation ~~~~~~~~~~~~~ You can use the :ref:`crate_anonymise ` (or :ref:`crate_anonymise_multiprocess `) commands to perform the main anonymisation. This can be done in a “full” way, dropping existing tables and starting from scratch, or in an “incremental” way, looking for changes to the source database (with respect to the anonymised database) and changing the anonymised database accordingly. This tool uses a :ref:`configuration file ` that is initially created by the installer in the ``config`` directory and can be edited. .. note:: For some databases, like RiO, you can mix in the suggested options from :ref:`crate_preprocess_rio `. Natural language processing (NLP) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can use the :ref:`crate_nlp ` (or :ref:`crate_nlp_multiprocess `) commands to pass text from one or more databases/tables/columns, to an external NLP tool, and the structured data back to a database table. CRATE includes some built-in natural language tools, including regular expression (regex) parsers for numerical results. The GATE NLP system is also supported, via a Java program. This is built into the CRATE Docker image or use :ref:`crate_nlp_build_gate_java_interface ` to build this if you are not using Docker. The MedEx-UIMA system is also supported, via a Java program. Use :ref:`crate_nlp_build_medex_java_interface ` to build this before you use it for the first time. This tool uses a configuration file that you create and edit. Use ``crate_nlp --democonfig`` to generate a demonstration file. Linkage ~~~~~~~ You might have more than one database and want to link them, so information about the same person in two databases can be analysed together. CRATE provides tools to do this whether two databases share a person-unique identifier (like a UK NHS number) or without. Linkage without a shared person-unique identifier is performed via a Bayesian personal identity matching process. This can be done in an entirely de-identified manner. See :ref:`linkage `. Web front end ~~~~~~~~~~~~~ CRATE offers a web front end that supports researcher access to the data, and allows managers to operate a specific consent-to-contact process. It uses a configuration file, which is created automatically by the CRATE installer. Alternatively use :ref:`crate_print_demo_crateweb_config ` to create a starting config that you can edit, and :ref:`crate_generate_new_django_secret_key ` to generate a random secret key for your site (which goes into the config). The :ref:`crate_django_manage ` command provides options for: - building the structure of the admin database (``migrate``); - collecting statically served files (``collectstatic``); - creating a superuser (``createsuperuser``); - manually changing a password (``changepassword``); - populating a consent database (``populate``); - testing the back-end messaging system by sending an e-mail (``test_email``); and a few other things that other scripts provide more convenient interfaces to. Other scripts include: - :ref:`crate_launch_django_server ` for a test Django server; - :ref:`crate_launch_cherrypy_server ` to launch a production-grade CherryPy server; - :ref`crate_launch_celery ` to launch the Celery message-handling backend; - :ref:`crate_launch_flower ` for the Flower tool to monitor the Celery/RabbitMQ backend; - :ref:`crate_windows_service ` to set up or test a Windows service for the web server system. (The CRATE Windows service does the equivalent of running both :ref:`crate_launch_cherrypy_server ` and :ref`crate_launch_celery `, in the background.) Testing and additional tools ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Other tools include: - :ref:`crate_help ` launches this documentation. - :ref:`crate_make_demo_database `: creates a demonstration database for testing. - :ref:`crate_test_extract_text ` tests methods of extracting text from binary files. - :ref:`crate_test_anonymisation `: fetches raw and anonymised data (from a source and a destination database), for a human to compare with a tool like Meld_ to verify the accuracy of anonymisation. =============================================================================== .. rubric:: Footnotes .. [#dbnormalization] https://en.wikipedia.org/wiki/Database_normalization