14.7.13. crate_anon.preprocess.systmone_ddgen

crate_anon/preprocess/systmone_ddgen.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Generate a CRATE data dictionary for SystmOne data.

Notes

  • SystmOne is a general-purpose electronic health record (EHR) system from TPP (The Phoenix Partnership): https://tpp-uk.com/products/.

  • It’s widely used in general practice (GP), and in Cambridgeshire/Peterborough, ~80% of GP surgeries use it (2018 data, https://pubmed.ncbi.nlm.nih.gov/29490968/, Figure 2).

  • Cambridgeshire & Peterborough NHS Foundation Trust (CPFT) used to use SystmOne for community services, and then moved nearly all the rest of its services to SystmOne (from RiO, in the case of mental health services): Children’s Directorate (12 Oct 2020), Community Hospital wards (30 Nov 2020), the rest of the Older People, Adults, and Community Directorate (7 Dec 2020), and finally the Adult and Specialist Directorate (14 Jun 2021).

  • SystmOne is centrally hosted by TPP.

  • TPP provide a nightly “Strategic Reporting extract” (SRE) of SystmOne data.

  • Its primary coding mechanisms are (1) CTV3 (Read) codes, and (2) SNOMED codes (see e.g. https://termbrowser.nhs.uk/) – the latter are gradually taking over (as of 2021). Coded values can be numeric. For example, one entry might include:

    • SNOMED code 718087004

    • SNOMED text “QRISK2 cardiovascular disease 10 year risk score”

    • CTV3 code “XaQVY”

    • CTV3 text “QRISK2 cardiovascular disease 10 year risk score”

    • Numeric unit “%”

    • Numeric value 10.4

  • SystmOne collects data mostly via “templates” and “questionnaires”. Templates are perhaps closer to the heart of SystmOne (e.g. better presented in the long-form journal view) and values entered into templates are (always?) coded. Questionnaires are more free-form. Both can have free text attached to coded values.

14.7.13.1. Strategic Reporting extract

SpecificationDirectory.zip (e.g. 2021-02-18) contains e.g. Specification v123.csv, which is a full description of the SRE. Principles:

  • All these tables start SR, e.g. SR18WeekWait, SRAAndEAttendance.

  • Columns in that spreadsheet are:

    TableName
    TableDescription
    ColumnName
    ColumnDescription
    ColumnDataType -- possible values include:
        Boolean
        Date
        Date and Time
        Numeric - Integer
        Numeric - Real
        Text - Fixed
        Text - Variable
    ColumnLength -- possible values include:
        empty (e.g. boolean, date, date/time)
        8 for integer
        4 for real
        the VARCHAR length -- for both "variable" and "fixed" text types
    DateDefining
    ColumnOrdinal -- sequence number of column within table
    LinkedTable     }
    LinkedColumn1   }-+
    LinkedColumn2   } |
                      +-- e.g.
                            SROrganisation, ID
                            SRStaffMember, RowIdentifier
                            SRPatient, RowIdentifier, IDOrganisationVisibleTo
    
  • To get a table list:

    # Poor for CSVs with newlines within their strings:
    tail -n+2 "Specification v123.csv" | cut -d, -f1 | sort | uniq
    
    # Much better:
    python3 -c 'import csv; print("\n".join(row[0] for i, row in enumerate(csv.reader(open("Specification v123.csv"))) if i > 0))' | sort | uniq
    
  • Tables and their descriptions:

    import csv
    s = set()
    for i, row in enumerate(csv.reader(open("Specification v123.csv"))):
        if i > 0:
            s.add(f"{row[0]} - {row[1]}")
    
    print("\n".join((x for x in sorted(s))))
    

    Translating that to a single line: https://www.python.org/dev/peps/pep-0289/ … meh, hard.

  • SRPatient looks to be the master patient table here – including names, dates of birth/death, NHS number.

  • Tpp Strategic Reporting Table Specification v123.rtf contains a nicer version of (exactly?) the same information.

  • Strategic Reporting downloads can be configured. Options include:

    • Whether to include the shared record. (I’m not sure if that means a national thing or data from SystmOne that each patient may have consented to sharing “‘out’ from another organization, then ‘in’ to mine”.)

  • When a download is set up, the recipient gets one CSV file per table selected, such as SRPatient.csv for the SRPatient table, plus some ever-present system tables:

    • SRManifest.csv, describing what you’ve received;

    • SRMapping.csv and SRMappingGroup.csv, providing text for built-in

      lists.

  • The date format is e.g. “29 Sep 2011 14:53:28”. Unknown times are marked as “00:00:00”. Unknown dates give an empty string. Boolean values are TRUE or FALSE.

14.7.13.2. Free-text data

The SRE does not contain free text data or binary documents by default. For some Trusts, an augmented SRE is provided also, with that information.

From FreeText Model.xlsx, 2021-04-15, some of this data comes in the following format:

Field Name                  Type            Description
RowIdentifier                   bigint              The unique identifier of the
                                            record
IDPatient                   bigint          Links to patient ID in
                                            demographics
IDReferralIn                    bigint              ID of referral
IDEvent                         bigint              Links to activity event ID
Question                    varchar(MAX)    The questionnaire question
[FreeText]                  varchar(MAX)    The answer given to the above
                                            question
EventDate                   datetime            The data/time of the
                                            questionnaire
SRTable                         varchar(100)        Which SR table the record
                                            relates to
IDSRTable                   bigint          The ID of the above table
QuestionnaireName           varchar(255)    The name of the questionnaire
IDAnsweredQuestionnaire         bigint              The ID of the above
                                            questionnaire
QuestionnaireVersionNumber  int                 The version number of the above
                                            questionnaire
IDOrganisation                  bigint              Organisation ID of the
                                            questionnaire record
CPFTGroup                   int                 Group (directorate)
Directorate                 varchar(50)         Directorate name
TeamName                    varchar(100)    Name of team linked to the
                                            referral
IsMentalHealth                  int             Mental or physical health
Imported                    date            Date imported to the database

(SR = Strategic Reporting.)

Specimen values:

  • SRTable: ‘SRAnsweredQuestionnaire’

  • IDSRTable: this varies for rows with SRTable = ‘SRAnsweredQuestionnaire’, so I think it’s the PK within the table indicated by SRTable.

  • QuestionnaireName = ‘CPFT Risk Assessment’

  • IDAnsweredQuestionnaire = this is unique for rows with QuestionnaireName = ‘CPFT Risk Assessment’, so I think it’s the ID of the Questionnaire, and is probably a typo.

(This ends up (in our environment) in the S1_FreeText table, as below, so it likely arrives as SRFreeText.)

However, note that RowIdentifier is not unique in this table. Whatever they mean by “record”, it isn’t that. For example, there are 7 rows with one common value of RowIdentifier that are clearly the 7 questions (in Question) and textually coded answers (in FreeText) to a SWEMWBS questionnaire. That means that to apply a FULLTEXT index, which requires an indexed unique value, we have to add one.

14.7.13.3. Key fields

  • IDPatient – the SystmOne patient number, in all patient tables (PID, in CRATE terms).

  • SRPatient.NHSNumber – the NHS number (MPID, in CRATE terms).

14.7.13.4. Notable tables in the SRE

  • [SR]Patient, as above

  • Patient identifiers and relationship/third-party details:

    • [SR]PatientAddressHistory

    • [SR]PatientContactDetails

    • [SR]HospitalAAndENumber

  • Relationship/third-party details:

    • [SR]PatientRelationship

    • some of the safeguarding tables

  • [SR]NDOptOutPreference, re NHS national data opt out (for NHS Act s251 use)

    • This has an IDPatient column; presumably presence indicates an active opt-out.

  • Full text and binary:

    • [SR]Media – contains filenames and some metadata

    • [SR]FreeText – if supplied

14.7.13.5. Notable additional tables/columns in the CPFT environment

  • S1_FreeText – this includes all answers to Questionnaires (linked via IDAnsweredQuestionnaire etc.). Comes from the “upgraded” SRE.

  • Several tables have identifiers linked in. For example, try:

    SELECT * FROM information_schema.columns WHERE column_name = 'FirstName'
    

14.7.13.6. Notable tables omitted from the CPFT environment

  • Questionnaire – data is linked into to AnsweredQuestionnaire (which still contains the column IDQuestionnaire).

14.7.13.7. CPFT copy

This broadly follows the SRE, but is expanded. Some notable differences:

  • Tables named SR* in the SRE are named S1_* in the CPFT version (e.g. SRPatient becomes S1_Patient).

  • There is a S1_Patient.NationalDataOptOut column (0 or 1).

  • The local opt-out information appears in S1_ClinicalOutcome_ConsentResearch (as the OptOut field, a text field) but is clearer in S1_ClinicalOutcome_ConsentResearch_OptOutCheck, which only contains patients opting out and has:

    IDPatient = <ID_of_patient_opting_out>
    SNOMEDCode = 1091881000000109
    CTV3Code = 'XaaDb'
    CTV3Text = 'Declined invitation to participate in research study'
    

    So for CPFT, we will autodetect this table/column (S1_ClinicalOutcome_ConsentResearch_OptOutCheck.SNOMEDCode) and the config file should contain:

    optout_col_values = [1091881000000109]
    
  • There seem to be quite a few extra tables, such as:

    S1_ClinicalMeasure_QRisk
    S1_ClinicalMeasure_SWEMWBS
    S1_ClinicalMeasure_Section58
    

    These look like CPFT-created tables pulling data from questionnaires or similar.

  • There is S1_FreeText, where someone (NP!) has helpfully imported that additional data.

  • There is S1_ClinicalOutcome_ConsentResearch, which is the traffic-light system for the CPFT Research Database.

In more detail:

  • All data is loaded via stored procedures, available via Microsoft SQL Server Management Studio in [server] ‣ [database] ‣ Programmability ‣ Stored Procedures. Right-click any and choose “Modify” to view the source. For example, the stored procedure named dbo.load_S1_Patient creates the S1_Patient table.

  • RwNo or RwNo_Patient is frequently used, typically via:

    SELECT
        -- stuff,
        ROW_NUMBER() OVER (
            PARTITION BY IDPatient
            ORDER BY DateEventRecorded DESC
        ) AS RwNo
    FROM
        -- somewhere
    WHERE
        RwNo = 1
    ;
    
    SELECT
        -- stuff,
        ROW_NUMBER() OVER (
            PARTITION BY IDPatient
            ORDER BY DateEvent DESC
        ) AS RwNo_Patient
    FROM
        -- somewhere
    ;
    

    … in other words, picking the most recent for each patient (or, without the WHERE clause, showing its sequencing within each patient).

14.7.13.8. Test patients in the live system?

There are some test patients in our live system.

SELECT COUNT(*)  -- or DISTINCT firstname, surname
FROM S1_Patient
WHERE firstname LIKE '%test%' AND surname LIKE '%test%';

-- Several present. However, in the CPFT copy, column "TestPatient" from
-- this table (BOOLEAN in SRE docs) is missing. How to distinguish?

There are several present. They should be distinguished by the TestPatient column (BOOLEAN, as per the SRE docs). Our code looks for the “TestPatient” column and marks it as an opt-out flag.

Todo

TestPatient column missing in CPFT copy. [A/w NP 2022-03-21.]

14.7.13.9. Manual review after first draft

Reviewing CPFT de-identified output for patient-related content only (not staff-related), per local ethics approvals.

-- Tables in the de-identified database:
SELECT table_name FROM information_schema.tables WHERE table_catalog = 'S1' ORDER BY table_name;

All reviewed and this code tweaked accordingly.