.. crate_anon/docs/source/anonymisation/preprocessing.rst .. Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk). . This file is part of CRATE. . CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. . CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. . You should have received a copy of the GNU General Public License along with CRATE. If not, see . .. _ONS: https://www.ons.gov.uk/ .. _PCMIS: https://www.york.ac.uk/healthsciences/pc-mis/ .. _RiO: https://www.servelec.co.uk/product-range/rio-epr-system/ .. _SystmOne: https://tpp-uk.com/products/ Preprocessing tools =================== These tools: - reshape specific databases for CRATE: - crate_preprocess_pcmis_ -- preprocess a PCMIS_ database - crate_preprocess_rio_ -- preprocess a RiO_ database - crate_preprocess_systmone_ -- preprocess a SystmOne_ database - fetch external data used for anonymisation: - crate_postcodes_ -- fetch ONS_ postcode information - crate_fetch_wordlists_ -- fetch forenames, surnames, and medical eponyms - import tabular data to a database: - crate_autoimport_db_ -- import tabular data from files to a database - perform fuzzy identity matching for linking different databases securely: - :ref:`crate_fuzzy_id_match ` Although they are usually run before anonymisation, it's probably more helpful to read the :ref:`Anonymisation ` section first. .. _crate_preprocess_pcmis: crate_preprocess_pcmis ---------------------- Options: .. literalinclude:: _crate_preprocess_pcmis_help.txt :language: none .. _crate_preprocess_rio: crate_preprocess_rio -------------------- See the script ``preprocess_rio.sh`` in the ``scripts`` folder generated by the CRATE installer. The RiO preprocessor creates a unique integer field named `crate_pk` in all tables (copying the existing integer PK, creating one from an existing non-integer primary key, or adding a new one using SQL Server’s `INT IDENTITY(1, 1)` type. For all patient tables, it makes the patient ID (RiO number) into an integer, called `crate_rio_number`. It then adds *indexes* and *views*. All of these can be removed again, or updated incrementally if you add new data. The views ‘denormalize’ the data for convenience, since it can be pretty hard to follow the key chain of fully normalized tables. The views conform mostly to the names used by the Servelec RiO CRIS Extraction Program (RCEP), with added consistency. Because user lookups are common, to save typing (and in some cases keep the field length below the 64-character column name limit of MySQL), the following abbreviations are used: ==================== ======================= ``_Resp_Clinician_`` … Responsible Clinician ==================== ======================= Options: .. literalinclude:: _crate_preprocess_rio_help.txt :language: none .. _crate_preprocess_systmone: crate_preprocess_systmone ------------------------- See the script ``preprocess_systmone.sh`` in the ``scripts`` folder generated by the CRATE installer. Preprocess SystmOne_ data, by indexing it. (It shouldn't need further reshaping.) As part of the preprocessing step, CRATE can extract text from a document store into a table ``crate_extracted_text``, which can then be anonymised along with the other tables. CRATE processes each file in the document store that has a filename of the pattern ``___.`` where the ``RowIdentifier`` matches an entry in the ``S1_Documents`` table. Options: .. literalinclude:: _crate_preprocess_systmone_help.txt :language: none .. _crate_postcodes: crate_postcodes --------------- See the script ``load_ons_postcode_database.sh`` in the ``scripts`` folder generated by the CRATE installer. Options: .. literalinclude:: _crate_postcodes_help.txt :language: none .. _crate_fetch_wordlists: crate_fetch_wordlists --------------------- This tool assists in fetching common word lists, such as name lists for global denial, and words to exclude from such lists (such as English words or medical eponyms). It also provides an exclusion filter system, to find lines in some files that are absent from others. The purpose of creating large name lists is usually to remove more names. However, it's likely that you want to remove medical eponyms, like Parkinson (for Parkinson's disease). CRATE has a hand-curated list of these. (If a patient is named Parkinson, though, and CRATE is told to remove that as a patient-specific identifier, that name will be removed from phrases like "Parkinson's disease", which may itself be a potential identifying clue, however, e.g. "[XXX]'s disease", but 100% reliable text de-identification is impossible.) The overlap between names and English words is really tricky. - If you use all names from this set and exclude all valid English words (e.g. from a "valid answers in Scrabble or crosswords" list), you will remove from the namelist -- and thus NOT remove from text being nonspecifically scrubbed -- names such as John (john is a noun) and Veronica (veronica is also a noun). - If you keep all names in the exclusion namelist, though, you will scrub words like excellent, fought, friend, games, he, hope, husband, joyful, kitten, knuckle, libel, limp, lovely, man, memory, mood, music, no, power, powers, sad, stress, true, yes, you, young, zone (to list but a few); these are all names. - A compromise may be to start with all names, remove medical eponyms, and remove *common* English words. CRATE provides tools to count words in a subset of the Project Gutenberg corpus. For example, removing English words that account for the top 99% of this corpus (and are also valid Scrabble clues) does this. The process is shown in the specimen usage below. Options: .. literalinclude:: _crate_fetch_wordlists_help.txt :language: none Specimen usage: .. literalinclude:: crate_fetch_wordlists_specimen_usage.sh :language: bash .. _crate_autoimport_db: crate_autoimport_db ------------------- This tool reads tabular data from files, and imports it into a database. Help: .. literalinclude:: _crate_fetch_wordlists_help.txt :language: none