14.7.5. crate_anon.preprocess.preprocess_pcmis

crate_anon/preprocess/preprocess_pcmis.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Preprocesses PCMIS tables for CRATE.

PCMIS is an EMR for UK IAPT services from the University of York.

PCMIS table structure

No proper documentation, but the structure is clear. See pcmis_information_schema.ods.

- PatientDetails

    PatientID -- PK; patient-defining field; VARCHAR(100)
    FirstName
    LastName
    NHSNumber -- VARCHAR(100)
    ...

- Other per-patient things: Patient*

- CasesAll

    CaseNumber -- appears to be unique; same #records as ReferralDetails
    ReferralDate

- Many other per-case things: Case*

    CaseNumber -- FK to CasesAll/ReferralDetails

- Group things: linked to cases via GroupMember

    IMPORTANTLY: there are only two, Groups and GroupSession
    and neither are identifiable.

- Carers: from PatientCarerDetails (CarerNumber, PatientID)
- Children: from PatientChildDetails (ChildNumber, PatientID)

- ReferralDetails

    CaseNumber -- appears to be unique; same #records as CasesAll
    PatientID -- not unique
    PrimaryDiagnosis (e.g. 'F41.1')

- Non-patient stuff we'll filter out:

    pcmis_UserProfiles
    Users

- Then a lot of other things are index by ContactNumber, which probably
  cross-refers to CaseContacts, having

    ContactNumber INT
    CaseNumber VARCHAR(100)

Decisions re database keys and anonymisation

For RiO, we had integer patient IDs but mangled into a text format. So there were distinct performance advantages in making an integer version. For PCMIS, patient IDs look like ‘JC000001’, ‘SB000001’ (where the letters are unrelated to patients’ initials; I’m not sure what they refer to). There are numerical overlaps if you ignore the letters. So there is no neat integer mapping; we’d be inventing an arbitrary new key if we added one.

So the tradeoff is simplicity (keep textual PK for patients) versus speed (parallel processing based on an integer operation). It’s natural to think of an integer hash of a string, but this hash has to operate in the SQL domain, be portable, and produce an integer (so SQL Server’s HASHBYTES is of no use. At present (2017-05-02), our PCMIS copy has ~53,000 patients in, and there are lots of tables with patients in.

Therefore, DECISION: create an integer PK.

However, we could do this deterministically. Since the length is fixed, and the numerical part goes up to 999999, and the letters are always upper case – ah, no, there are some like <digit><letter>999999. But 0-99 would be fine.

SELECT (
    (ASCII(SUBSTRING(PatientID, 1, 1))) * 100000000 +
    (ASCII(SUBSTRING(PatientID, 2, 1))) * 1000000 +
    CAST(SUBSTRING(PatientID, 3, 6) AS BIGINT)
) AS patient_id_int
FROM PatientDetails

If we’re using SQLAlchemy, then use things like func.substr instead, but it’s a reasonable compromise for now to say that a specific database like PCMIS is going to be hosted on SQL Server, since PCMIS uses that

SQL Server

SQLAlchemy

SUBSTR

func.substr

ASCII

What about CaseNumber – is that identifying? If not, it can remain the internal key to identify cases. If it is, then we have to replace it. The first character is 1,9,A-Z except Q, X, Y (n=25). The second character is 0,A-Z except I, Q, U, X, Z (n=22). So, pretty spread. The digits seem to be approximately consecutive. So it does look more like an internal PK than something identifiable.

Mind you, very often it is identical to CaseNumber. So, do we need a second hash?

Our PCMIS copy certainly has free text (search the schema for text types).

Therefore, views and the like

MAIN SOFTWARE CHANGES

  • Support non-integer PIDs/MPIDs.

  • Add an AlterMethod that is hash=hash_config_key_name with e.g.

[hash_config_key_name]
method = hmacsha256
key = somesecretkey

TABLES

  • If a table doesn’t have a PK, give it an AUTONUMBER integer PK (e.g. “crate_pk”). That looks to be true of ?all tables.

VIEWS

  • In general, not needed: we can use PatientId and CaseNumber as non-integer fields.

  • We do need the geography views, though.

DATA DICTIONARY AUTOGENERATIO

  • PatientId: always the PID.

  • NHSNumber: always the MPID.

  • CaseNumber: belongs in ddgen_extra_hash_fields, and users should give it the same hash key as for the PID-to-RID conversion, since it’s often the same code.

class crate_anon.preprocess.preprocess_pcmis.PcmisConfigOptions(postcodedb: str, geogcols: List[str], print_sql_only: bool, drop_not_create: bool)[source]

Hold configuration options for this program.

__init__(postcodedb: str, geogcols: List[str], print_sql_only: bool, drop_not_create: bool) None[source]
Parameters:
  • postcodedb – Specify database (schema) name for ONS Postcode Database (as imported by CRATE) to link in. With SQL Server, you will have to specify the schema as well as the database; e.g. ONS_PD.dbo".

  • geogcols – List of geographical information columns to link in from ONS Postcode Database. BEWARE that you do not specify anything too identifying.

  • print_sql_only – print SQL rather than executing it?

  • drop_not_create – REMOVES new columns/indexes, rather than creating them. (Not really very dangerous, but might take some time to recreate.)

crate_anon.preprocess.preprocess_pcmis.add_geography_to_view(columns: List[str], viewmaker: ViewMaker, engine: Engine, configoptions: PcmisConfigOptions) None[source]

Modifies a viewmaker to add geography columns to views on PCMIS tables. For example, if you start with an address table including postcodes, and you’re building a view involving it, then you can link in LSOA or IMD information with this function.

Parameters:
  • columns – column names from the postcode table to include

  • viewmaker – a crate_anon.common.sql.ViewMaker, which will be modified. The base table is taken from viewmaker.basetable.

  • engine – an SQLAlchemy Engine

  • configoptions – an instance of PcmisConfigOptions

crate_anon.preprocess.preprocess_pcmis.create_pcmis_views(engine: Engine, metadata: MetaData, configoptions: PcmisConfigOptions, ddhint: DDHint) None[source]

Creates all PCMIS views.

Parameters:
  • engine – an SQLAlchemy Engine

  • metadata – SQLAlchemy MetaData containing reflected details of database

  • configoptions – an instance of PcmisConfigOptions

  • ddhint – a crate_anon/preprocess/ddhint.DDHint, which will be modified

crate_anon.preprocess.preprocess_pcmis.drop_pcmis_views(engine: Engine, metadata: MetaData, configoptions: PcmisConfigOptions, ddhint: DDHint) None[source]

Drops all PCMIS views.

Parameters:
  • engine – an SQLAlchemy Engine

  • metadata – SQLAlchemy MetaData containing reflected details of database

  • configoptions – an instance of PcmisConfigOptions

  • ddhint – a crate_anon/preprocess/ddhint.DDHint, which will be modified

crate_anon.preprocess.preprocess_pcmis.get_pcmis_dd_settings(ddhint: DDHint) str[source]

Draft CRATE config file settings that will allow CRATE to create a PCMIS data dictionary near-automatically.

Parameters:

ddhintcrate_anon.preprocess.ddhint.DDHint

Returns:

the config file settings, as a string

crate_anon.preprocess.preprocess_pcmis.get_pcmis_views(engine: Engine, configoptions: PcmisConfigOptions, ddhint: DDHint) List[ViewMaker][source]

Gets all PCMIS view definitions.

Parameters:
  • engine – an SQLAlchemy Engine

  • configoptions – an instance of PcmisConfigOptions

  • ddhint – a crate_anon/preprocess/ddhint.DDHint, which will be modified

Returns:

a list of crate_anon.common.sql.ViewMaker objects

crate_anon.preprocess.preprocess_pcmis.main() None[source]

Command-line parser. See command-line help.

crate_anon.preprocess.preprocess_pcmis.process_all_tables(engine: Engine, metadata: MetaData, configoptions: PcmisConfigOptions) None[source]

Process all PCMIS tables; see process_table().

Parameters:
  • engine – an SQLAlchemy Engine

  • metadata – SQLAlchemy MetaData containing reflected details of database

  • configoptions – an instance of PcmisConfigOptions

crate_anon.preprocess.preprocess_pcmis.process_table(table: Table, engine: Engine, configoptions: PcmisConfigOptions) None[source]

Processes a PCMIS table by checking it has appropriate columns, perhaps adding a CRATE integer PK, and indexing it.

Parameters:
  • table – an SQLAlchemy Table to process

  • engine – an SQLAlchemy Engine

  • configoptions – an instance of PcmisConfigOptions