14.7.13. crate_anon.preprocess.systmone_ddgen
crate_anon/preprocess/systmone_ddgen.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Generate a CRATE data dictionary for SystmOne data.
Notes
SystmOne is a general-purpose electronic health record (EHR) system from TPP (The Phoenix Partnership): https://tpp-uk.com/products/.
It’s widely used in general practice (GP), and in Cambridgeshire/Peterborough, ~80% of GP surgeries use it (2018 data, https://pubmed.ncbi.nlm.nih.gov/29490968/, Figure 2).
Cambridgeshire & Peterborough NHS Foundation Trust (CPFT) used to use SystmOne for community services, and then moved nearly all the rest of its services to SystmOne (from RiO, in the case of mental health services): Children’s Directorate (12 Oct 2020), Community Hospital wards (30 Nov 2020), the rest of the Older People, Adults, and Community Directorate (7 Dec 2020), and finally the Adult and Specialist Directorate (14 Jun 2021).
SystmOne is centrally hosted by TPP.
TPP provide a nightly “Strategic Reporting extract” (SRE) of SystmOne data.
Its primary coding mechanisms are (1) CTV3 (Read) codes, and (2) SNOMED codes (see e.g. https://termbrowser.nhs.uk/) – the latter are gradually taking over (as of 2021). Coded values can be numeric. For example, one entry might include:
SNOMED code 718087004
SNOMED text “QRISK2 cardiovascular disease 10 year risk score”
CTV3 code “XaQVY”
CTV3 text “QRISK2 cardiovascular disease 10 year risk score”
Numeric unit “%”
Numeric value 10.4
SystmOne collects data mostly via “templates” and “questionnaires”. Templates are perhaps closer to the heart of SystmOne (e.g. better presented in the long-form journal view) and values entered into templates are (always?) coded. Questionnaires are more free-form. Both can have free text attached to coded values.
14.7.13.1. Strategic Reporting extract
SpecificationDirectory.zip
(e.g. 2021-02-18) contains e.g. Specification
v123.csv
, which is a full description of the SRE. Principles:
All these tables start
SR
, e.g.SR18WeekWait
,SRAAndEAttendance
.Columns in that spreadsheet are:
TableName TableDescription ColumnName ColumnDescription ColumnDataType -- possible values include: Boolean Date Date and Time Numeric - Integer Numeric - Real Text - Fixed Text - Variable ColumnLength -- possible values include: empty (e.g. boolean, date, date/time) 8 for integer 4 for real the VARCHAR length -- for both "variable" and "fixed" text types DateDefining ColumnOrdinal -- sequence number of column within table LinkedTable } LinkedColumn1 }-+ LinkedColumn2 } | +-- e.g. SROrganisation, ID SRStaffMember, RowIdentifier SRPatient, RowIdentifier, IDOrganisationVisibleTo
To get a table list:
# Poor for CSVs with newlines within their strings: tail -n+2 "Specification v123.csv" | cut -d, -f1 | sort | uniq # Much better: python3 -c 'import csv; print("\n".join(row[0] for i, row in enumerate(csv.reader(open("Specification v123.csv"))) if i > 0))' | sort | uniq
Tables and their descriptions:
import csv s = set() for i, row in enumerate(csv.reader(open("Specification v123.csv"))): if i > 0: s.add(f"{row[0]} - {row[1]}") print("\n".join((x for x in sorted(s))))
Translating that to a single line: https://www.python.org/dev/peps/pep-0289/ … meh, hard.
SRPatient
looks to be the master patient table here – including names, dates of birth/death, NHS number.Tpp Strategic Reporting Table Specification v123.rtf
contains a nicer version of (exactly?) the same information.Strategic Reporting downloads can be configured. Options include:
Whether to include the shared record. (I’m not sure if that means a national thing or data from SystmOne that each patient may have consented to sharing “‘out’ from another organization, then ‘in’ to mine”.)
When a download is set up, the recipient gets one CSV file per table selected, such as
SRPatient.csv
for theSRPatient
table, plus some ever-present system tables:SRManifest.csv
, describing what you’ve received;SRMapping.csv
andSRMappingGroup.csv
, providing text for built-inlists.
The date format is e.g. “29 Sep 2011 14:53:28”. Unknown times are marked as “00:00:00”. Unknown dates give an empty string. Boolean values are
TRUE
orFALSE
.
14.7.13.2. Free-text data
The SRE does not contain free text data or binary documents by default. For some Trusts, an augmented SRE is provided also, with that information.
From FreeText Model.xlsx
, 2021-04-15, some of this data comes in the
following format:
Field Name Type Description
RowIdentifier bigint The unique identifier of the
record
IDPatient bigint Links to patient ID in
demographics
IDReferralIn bigint ID of referral
IDEvent bigint Links to activity event ID
Question varchar(MAX) The questionnaire question
[FreeText] varchar(MAX) The answer given to the above
question
EventDate datetime The data/time of the
questionnaire
SRTable varchar(100) Which SR table the record
relates to
IDSRTable bigint The ID of the above table
QuestionnaireName varchar(255) The name of the questionnaire
IDAnsweredQuestionnaire bigint The ID of the above
questionnaire
QuestionnaireVersionNumber int The version number of the above
questionnaire
IDOrganisation bigint Organisation ID of the
questionnaire record
CPFTGroup int Group (directorate)
Directorate varchar(50) Directorate name
TeamName varchar(100) Name of team linked to the
referral
IsMentalHealth int Mental or physical health
Imported date Date imported to the database
(SR = Strategic Reporting.)
Specimen values:
SRTable: ‘SRAnsweredQuestionnaire’
IDSRTable: this varies for rows with SRTable = ‘SRAnsweredQuestionnaire’, so I think it’s the PK within the table indicated by SRTable.
QuestionnaireName = ‘CPFT Risk Assessment’
IDAnsweredQuestionnaire = this is unique for rows with QuestionnaireName = ‘CPFT Risk Assessment’, so I think it’s the ID of the Questionnaire, and is probably a typo.
(This ends up (in our environment) in the S1_FreeText table, as below, so it likely arrives as SRFreeText.)
However, note that RowIdentifier
is not unique in this table. Whatever
they mean by “record”, it isn’t that. For example, there are 7 rows with one
common value of RowIdentifier
that are clearly the 7 questions (in
Question
) and textually coded answers (in FreeText
) to a SWEMWBS
questionnaire. That means that to apply a FULLTEXT index, which requires an
indexed unique value, we have to add one.
14.7.13.3. Key fields
IDPatient
– the SystmOne patient number, in all patient tables (PID, in CRATE terms).SRPatient.NHSNumber
– the NHS number (MPID, in CRATE terms).
14.7.13.4. Notable tables in the SRE
[SR]Patient, as above
Patient identifiers and relationship/third-party details:
[SR]PatientAddressHistory
[SR]PatientContactDetails
[SR]HospitalAAndENumber
Relationship/third-party details:
[SR]PatientRelationship
some of the safeguarding tables
[SR]NDOptOutPreference, re NHS national data opt out (for NHS Act s251 use)
This has an IDPatient column; presumably presence indicates an active opt-out.
Full text and binary:
[SR]Media – contains filenames and some metadata
[SR]FreeText – if supplied
14.7.13.5. Notable additional tables/columns in the CPFT environment
S1_FreeText – this includes all answers to Questionnaires (linked via
IDAnsweredQuestionnaire
etc.). Comes from the “upgraded” SRE.Several tables have identifiers linked in. For example, try:
SELECT * FROM information_schema.columns WHERE column_name = 'FirstName'
14.7.13.6. Notable tables omitted from the CPFT environment
Questionnaire – data is linked into to AnsweredQuestionnaire (which still contains the column
IDQuestionnaire
).
14.7.13.7. CPFT copy
This broadly follows the SRE, but is expanded. Some notable differences:
Tables named
SR*
in the SRE are namedS1_*
in the CPFT version (e.g.SRPatient
becomesS1_Patient
).There is a
S1_Patient.NationalDataOptOut
column (0 or 1).The local opt-out information appears in S1_ClinicalOutcome_ConsentResearch (as the OptOut field, a text field) but is clearer in S1_ClinicalOutcome_ConsentResearch_OptOutCheck, which only contains patients opting out and has:
IDPatient = <ID_of_patient_opting_out> SNOMEDCode = 1091881000000109 CTV3Code = 'XaaDb' CTV3Text = 'Declined invitation to participate in research study'
So for CPFT, we will autodetect this table/column (S1_ClinicalOutcome_ConsentResearch_OptOutCheck.SNOMEDCode) and the config file should contain:
optout_col_values = [1091881000000109]
There seem to be quite a few extra tables, such as:
S1_ClinicalMeasure_QRisk S1_ClinicalMeasure_SWEMWBS S1_ClinicalMeasure_Section58
These look like CPFT-created tables pulling data from questionnaires or similar.
There is
S1_FreeText
, where someone (NP!) has helpfully imported that additional data.There is
S1_ClinicalOutcome_ConsentResearch
, which is the traffic-light system for the CPFT Research Database.
In more detail:
All data is loaded via stored procedures, available via Microsoft SQL Server Management Studio in
. Right-click any and choose “Modify” to view the source. For example, the stored procedure nameddbo.load_S1_Patient
creates theS1_Patient
table.RwNo
orRwNo_Patient
is frequently used, typically via:SELECT -- stuff, ROW_NUMBER() OVER ( PARTITION BY IDPatient ORDER BY DateEventRecorded DESC ) AS RwNo FROM -- somewhere WHERE RwNo = 1 ; SELECT -- stuff, ROW_NUMBER() OVER ( PARTITION BY IDPatient ORDER BY DateEvent DESC ) AS RwNo_Patient FROM -- somewhere ;
… in other words, picking the most recent for each patient (or, without the WHERE clause, showing its sequencing within each patient).
14.7.13.8. Test patients in the live system?
There are some test patients in our live system.
SELECT COUNT(*) -- or DISTINCT firstname, surname
FROM S1_Patient
WHERE firstname LIKE '%test%' AND surname LIKE '%test%';
-- Several present. However, in the CPFT copy, column "TestPatient" from
-- this table (BOOLEAN in SRE docs) is missing. How to distinguish?
There are several present. They should be distinguished by the TestPatient
column (BOOLEAN, as per the SRE docs). Our code looks for the “TestPatient”
column and marks it as an opt-out flag.
Todo
TestPatient column missing in CPFT copy. [A/w NP 2022-03-21.]
14.7.13.9. Manual review after first draft
Reviewing CPFT de-identified output for patient-related content only (not staff-related), per local ethics approvals.
-- Tables in the de-identified database:
SELECT table_name FROM information_schema.tables WHERE table_catalog = 'S1' ORDER BY table_name;
All reviewed and this code tweaked accordingly.