12.7.14. crate_anon.preprocess.systmone_ddgen

crate_anon/preprocess/systmone_ddgen.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Generate a CRATE data dictionary for SystmOne data.

Notes

  • SystmOne is a general-purpose electronic health record (EHR) system from TPP (The Phoenix Partnership): https://tpp-uk.com/products/.

  • It’s widely used in general practice (GP), and in Cambridgeshire/Peterborough, ~80% of GP surgeries use it (2018 data, https://pubmed.ncbi.nlm.nih.gov/29490968/, Figure 2).

  • Cambridgeshire & Peterborough NHS Foundation Trust (CPFT) used to use SystmOne for community services, and then moved nearly all the rest of its services to SystmOne (from RiO, in the case of mental health services): Children’s Directorate (12 Oct 2020), Community Hospital wards (30 Nov 2020), the rest of the Older People, Adults, and Community Directorate (7 Dec 2020), and finally the Adult and Specialist Directorate (14 Jun 2021).

  • SystmOne is centrally hosted by TPP.

  • TPP provide a nightly “Strategic Reporting extract” (SRE) of SystmOne data.

  • Its primary coding mechanisms are (1) CTV3 (Read) codes, and (2) SNOMED codes (see e.g. https://termbrowser.nhs.uk/) – the latter are gradually taking over (as of 2021). Coded values can be numeric. For example, one entry might include:

    • SNOMED code 718087004

    • SNOMED text “QRISK2 cardiovascular disease 10 year risk score”

    • CTV3 code “XaQVY”

    • CTV3 text “QRISK2 cardiovascular disease 10 year risk score”

    • Numeric unit “%”

    • Numeric value 10.4

  • SystmOne collects data mostly via “templates” and “questionnaires”. Templates are perhaps closer to the heart of SystmOne (e.g. better presented in the long-form journal view) and values entered into templates are (always?) coded. Questionnaires are more free-form. Both can have free text attached to coded values.

12.7.14.1. Strategic Reporting extract

SpecificationDirectory.zip (e.g. 2021-02-18) contains e.g. Specification v123.csv, which is a full description of the SRE. Principles:

  • All these tables start SR, e.g. SR18WeekWait, SRAAndEAttendance.

  • Columns in that spreadsheet are:

    TableName
    TableDescription
    ColumnName
    ColumnDescription
    ColumnDataType -- possible values include:
        Boolean
        Date
        Date and Time
        Numeric - Integer
        Numeric - Real
        Text - Fixed
        Text - Variable
    ColumnLength -- possible values include:
        empty (e.g. boolean, date, date/time)
        8 for integer
        4 for real
        the VARCHAR length -- for both "variable" and "fixed" text types
    DateDefining
    ColumnOrdinal -- sequence number of column within table
    LinkedTable     }
    LinkedColumn1   }-+
    LinkedColumn2   } |
                      +-- e.g.
                            SROrganisation, ID
                            SRStaffMember, RowIdentifier
                            SRPatient, RowIdentifier, IDOrganisationVisibleTo
    
  • To get a table list:

    # Poor for CSVs with newlines within their strings:
    tail -n+2 "Specification v123.csv" | cut -d, -f1 | sort | uniq
    
    # Much better:
    python3 -c 'import csv; print("\n".join(row[0] for i, row in enumerate(csv.reader(open("Specification v123.csv"))) if i > 0))' | sort | uniq
    
  • Tables and their descriptions:

    import csv
    s = set()
    for i, row in enumerate(csv.reader(open("Specification v123.csv"))):
        if i > 0:
            s.add(f"{row[0]} - {row[1]}")
    
    print("\n".join((x for x in sorted(s))))
    

    Translating that to a single line: https://www.python.org/dev/peps/pep-0289/ … meh, hard.

  • SRPatient looks to be the master patient table here – including names, dates of birth/death, NHS number.

  • Tpp Strategic Reporting Table Specification v123.rtf contains a nicer version of (exactly?) the same information.

  • Strategic Reporting downloads can be configured. Options include:

    • Whether to include the shared record. (I’m not sure if that means a national thing or data from SystmOne that each patient may have consented to sharing “‘out’ from another organization, then ‘in’ to mine”.)

  • When a download is set up, the recipient gets one CSV file per table selected, such as SRPatient.csv for the SRPatient table, plus some ever-present system tables:

    • SRManifest.csv, describing what you’ve received;

    • SRMapping.csv and SRMappingGroup.csv, providing text for built-in

      lists.

  • The date format is e.g. “29 Sep 2011 14:53:28”. Unknown times are marked as “00:00:00”. Unknown dates give an empty string. Boolean values are TRUE or FALSE.

12.7.14.2. Free-text data

The SRE does not contain free text data or binary documents by default. For some Trusts, an augmented SRE is provided also, with that information.

From FreeText Model.xlsx, 2021-04-15, some of this data comes in the following format:

Field Name                  Type            Description
RowIdentifier                   bigint              The unique identifier of the
                                            record
IDPatient                   bigint          Links to patient ID in
                                            demographics
IDReferralIn                    bigint              ID of referral
IDEvent                         bigint              Links to activity event ID
Question                    varchar(MAX)    The questionnaire question
[FreeText]                  varchar(MAX)    The answer given to the above
                                            question
EventDate                   datetime            The data/time of the
                                            questionnaire
SRTable                         varchar(100)        Which SR table the record
                                            relates to
IDSRTable                   bigint          The ID of the above table
QuestionnaireName           varchar(255)    The name of the questionnaire
IDAnsweredQuestionnaire         bigint              The ID of the above
                                            questionnaire
QuestionnaireVersionNumber  int                 The version number of the above
                                            questionnaire
IDOrganisation                  bigint              Organisation ID of the
                                            questionnaire record
CPFTGroup                   int                 Group (directorate)
Directorate                 varchar(50)         Directorate name
TeamName                    varchar(100)    Name of team linked to the
                                            referral
IsMentalHealth                  int             Mental or physical health
Imported                    date            Date imported to the database

(SR = Strategic Reporting.)

Specimen values:

  • SRTable: ‘SRAnsweredQuestionnaire’

  • IDSRTable: this varies for rows with SRTable = ‘SRAnsweredQuestionnaire’, so I think it’s the PK within the table indicated by SRTable.

  • QuestionnaireName = ‘CPFT Risk Assessment’

  • IDAnsweredQuestionnaire = this is unique for rows with QuestionnaireName = ‘CPFT Risk Assessment’, so I think it’s the ID of the Questionnaire, and is probably a typo.

(This ends up (in our environment) in the S1_FreeText table, as below, so it likely arrives as SRFreeText.)

However, note that RowIdentifier is not unique in this table. Whatever they mean by “record”, it isn’t that. For example, there are 7 rows with one common value of RowIdentifier that are clearly the 7 questions (in Question) and textually coded answers (in FreeText) to a SWEMWBS questionnaire. That means that to apply a FULLTEXT index, which requires an indexed unique value, we have to add one.

12.7.14.3. Key fields

  • IDPatient – the SystmOne patient number, in all patient tables (PID, in CRATE terms).

  • SRPatient.NHSNumber – the NHS number (MPID, in CRATE terms).

12.7.14.4. Notable tables in the SRE

  • [SR]Patient, as above

  • Patient identifiers and relationship/third-party details:

    • [SR]PatientAddressHistory

    • [SR]PatientContactDetails

    • [SR]HospitalAAndENumber

  • Relationship/third-party details:

    • [SR]PatientRelationship

    • some of the safeguarding tables

  • [SR]NDOptOutPreference, re NHS national data opt out (for NHS Act s251 use)

    • This has an IDPatient column; presumably presence indicates an active opt-out.

  • Full text and binary:

    • [SR]Media – contains filenames and some metadata

    • [SR]FreeText – if supplied

12.7.14.5. Notable additional tables/columns in the CPFT environment

  • S1_FreeText – this includes all answers to Questionnaires (linked via IDAnsweredQuestionnaire etc.). Comes from the “upgraded” SRE.

  • Several tables have identifiers linked in. For example, try:

    SELECT * FROM information_schema.columns WHERE column_name = 'FirstName'
    

12.7.14.6. Notable tables omitted from the CPFT environment

  • Questionnaire – data is linked into to AnsweredQuestionnaire (which still contains the column IDQuestionnaire).

12.7.14.7. CPFT copy

This broadly follows the SRE, but is expanded. Some notable differences:

  • Tables named SR* in the SRE are named S1_* in the CPFT version (e.g. SRPatient becomes S1_Patient).

  • There is a S1_Patient.NationalDataOptOut column (0 or 1).

  • The local opt-out information appears in S1_ClinicalOutcome_ConsentResearch (as the OptOut field, a text field) but is clearer in S1_ClinicalOutcome_ConsentResearch_OptOutCheck, which only contains patients opting out and has:

    IDPatient = <ID_of_patient_opting_out>
    SNOMEDCode = 1091881000000109
    CTV3Code = 'XaaDb'
    CTV3Text = 'Declined invitation to participate in research study'
    

    So for CPFT, we will autodetect this table/column (S1_ClinicalOutcome_ConsentResearch_OptOutCheck.SNOMEDCode) and the config file should contain:

    optout_col_values = [1091881000000109]
    
  • There seem to be quite a few extra tables, such as:

    S1_ClinicalMeasure_QRisk
    S1_ClinicalMeasure_SWEMWBS
    S1_ClinicalMeasure_Section58
    

    These look like CPFT-created tables pulling data from questionnaires or similar.

  • There is S1_FreeText, where someone (NP!) has helpfully imported that additional data.

  • There is S1_ClinicalOutcome_ConsentResearch, which is the traffic-light system for the CPFT Research Database.

In more detail:

  • All data is loaded via stored procedures, available via Microsoft SQL Server Management Studio in [server] ‣ [database] ‣ Programmability ‣ Stored Procedures. Right-click any and choose “Modify” to view the source. For example, the stored procedure named dbo.load_S1_Patient creates the S1_Patient table.

  • RwNo or RwNo_Patient is frequently used, typically via:

    SELECT
        -- stuff,
        ROW_NUMBER() OVER (
            PARTITION BY IDPatient
            ORDER BY DateEventRecorded DESC
        ) AS RwNo
    FROM
        -- somewhere
    WHERE
        RwNo = 1
    ;
    
    SELECT
        -- stuff,
        ROW_NUMBER() OVER (
            PARTITION BY IDPatient
            ORDER BY DateEvent DESC
        ) AS RwNo_Patient
    FROM
        -- somewhere
    ;
    

    … in other words, picking the most recent for each patient (or, without the WHERE clause, showing its sequencing within each patient).

12.7.14.8. Test patients in the live system?

There are some test patients in our live system.

SELECT COUNT(*)  -- or DISTINCT firstname, surname
FROM S1_Patient
WHERE firstname LIKE '%test%' AND surname LIKE '%test%';

-- Several present. However, in the CPFT copy, column "TestPatient" from
-- this table (BOOLEAN in SRE docs) is missing. How to distinguish?

There are several present. They should be distinguished by the TestPatient column (BOOLEAN, as per the SRE docs). Our code looks for the “TestPatient” column and marks it as an opt-out flag.

Todo

TestPatient column missing in CPFT copy. [A/w NP 2022-03-21.]

12.7.14.9. Manual review after first draft

Reviewing CPFT de-identified output for patient-related content only (not staff-related), per local ethics approvals.

-- Tables in the de-identified database:
SELECT table_name FROM information_schema.tables WHERE table_catalog = 'S1' ORDER BY table_name;

All reviewed and this code tweaked accordingly.

class crate_anon.preprocess.systmone_ddgen.CPFTAddressCol[source]

CPFT variants for the address table.

class crate_anon.preprocess.systmone_ddgen.CPFTGenericCol[source]

” CPFT variants for generic column names.

class crate_anon.preprocess.systmone_ddgen.CPFTOtherCol[source]

Other CPFT variants.

class crate_anon.preprocess.systmone_ddgen.CPFTPatientCol[source]

CPFT variants for the patient table.

class crate_anon.preprocess.systmone_ddgen.CPFTTable[source]

Selected tables that CPFT have renamed or created.

class crate_anon.preprocess.systmone_ddgen.CrateS1ViewCol[source]

Additional columns added by CRATE’s preprocessor

class crate_anon.preprocess.systmone_ddgen.CrateView[source]

Views created by CRATE, which do not have contextual prefixes.

class crate_anon.preprocess.systmone_ddgen.S1AddressCol[source]

Columns in the PatientAddressHistory table.

class crate_anon.preprocess.systmone_ddgen.S1ContactCol[source]

Columns in the PatientContactDetails table.

class crate_anon.preprocess.systmone_ddgen.S1GenericCol[source]

Columns used in many SystmOne tables.

class crate_anon.preprocess.systmone_ddgen.S1HospNumCol[source]

Columns in the HospitalAAndENumber table.

class crate_anon.preprocess.systmone_ddgen.S1PatientCol[source]

Columns in the Patient table.

class crate_anon.preprocess.systmone_ddgen.S1RelCol[source]

Columns in the PatientRelationship table. (This is also one for which we specify everything in detail, since CPFT add in extra identifiers.)

class crate_anon.preprocess.systmone_ddgen.S1Table[source]

SystmOne “core” table names, with no prefix.

class crate_anon.preprocess.systmone_ddgen.ScrubSrcAlterMethodInfo(change_comment_and_indexing_only: bool = False, src_flags: str = '', scrub_src: ~crate_anon.anonymise.constants.ScrubSrc | None = None, scrub_method: ~crate_anon.anonymise.constants.ScrubMethod | None = None, decision: ~crate_anon.anonymise.constants.Decision = Decision.OMIT, alter_methods: ~typing.List[~crate_anon.anonymise.altermethod.AlterMethod] = <factory>, dest_datatype: str | None = None, dest_field: str | None = None)[source]

For describing scrub-source and alter-method information.

__init__(change_comment_and_indexing_only: bool = False, src_flags: str = '', scrub_src: ~crate_anon.anonymise.constants.ScrubSrc | None = None, scrub_method: ~crate_anon.anonymise.constants.ScrubMethod | None = None, decision: ~crate_anon.anonymise.constants.Decision = Decision.OMIT, alter_methods: ~typing.List[~crate_anon.anonymise.altermethod.AlterMethod] = <factory>, dest_datatype: str | None = None, dest_field: str | None = None) None
add_alter_method(alter_method: AlterMethod) None[source]

Adds an alteration method.

add_src_flag(flag: SrcFlag) None[source]

Add a flag, if it doesn’t exist already.

include() None[source]

Sets the decision to “include”.

omit() None[source]

Sets the decision to “omit”.

class crate_anon.preprocess.systmone_ddgen.SystmOneContext(value)[source]

Environments in which we might have SystmOne data.

class crate_anon.preprocess.systmone_ddgen.SystmOneSRESpecRow(d: Dict[str, Any])[source]

Represents a row in the SystmOne SRE specification CSV file.

__init__(d: Dict[str, Any]) None[source]

Initialize with a row dictionary from a csv.DictReader.

comment(context: SystmOneContext, with_table: bool = True) str[source]

Used to generate a comment for the CRATE data dictionary.

Parameters:
  • context – The SystmOneContext in which data is being processed.

  • with_table – Include information about the table.

description(context: SystmOneContext, with_table: bool = True) str[source]

Full description line.

Parameters:
  • context – The SystmOneContext in which data is being processed.

  • with_table – Include information about the table.

property linked_table_core: str

Core part of the linked table name.

property tablename_core: str

Core part of the tablename.

class crate_anon.preprocess.systmone_ddgen.SystmOneSRESpecs(context: SystmOneContext, filename: str)[source]

Loads and represents the SystmOne SRE specifications.

__init__(context: SystmOneContext, filename: str) None[source]

Initialize by reading a SystmOne SRE specification CSV file.

context:

The context from which SystmOne data is being extracted (e.g. the raw TPP Strategic Reporting Extract (SRE), or a local version processed into CPFT’s Data Warehouse).

filename:

Optional filename for the TPP SRE specification file, in comma-separated value (CSV) format.

debug_specs() None[source]

Print the specs to the debugging log.

get_spec_row(tablename_core: str, columnname: str) SystmOneSRESpecRow[source]

Look up a row specification.

table_comment(tablename_core: str) str[source]

Returns the table description/comment for a given table, if known, or a blank string.

class crate_anon.preprocess.systmone_ddgen.TableCommentWorking(dd: DataDictionary, specifications: SystmOneSRESpecs, append_comments: bool = False, allow_unprefixed_tables: bool = False)[source]

Class used to store data temporarily about table comments, during SystmOne data dictionary annotation. Slightly complex because

__init__(dd: DataDictionary, specifications: SystmOneSRESpecs, append_comments: bool = False, allow_unprefixed_tables: bool = False) None[source]
Parameters:
  • dd – The data dictionary.

  • specifications – Details of the TPP SRE specifications.

  • append_comments – Append comments to any that were autogenerated, rather than replacing them. (If you use the SRE specifications, you may as well set this to False as the SRE specification comments are much better.)

  • allow_unprefixed_tables – Permit tables that don’t start with the expected contextual prefix? Discouraged; you may get odd tables and views.

maybe_add_table_comment(ddr: DataDictionaryRow)[source]

We scan each data dictionary row via this function.

  • If we already have seen a comment for this table in the data dictionary, we don’t do anything, UNLESS this row is itself that comment, and then if

  • Otherwise, we add the SystmOne comment, if found, as an extra DDR, storing it in our “extra_table_comment_rows” list.

crate_anon.preprocess.systmone_ddgen.annotate_systmone_dd_row(ddr: DataDictionaryRow, context: SystmOneContext, specifications: SystmOneSRESpecs, append_comments: bool = False, include_generic: bool = False, allow_unprefixed_tables: bool = False, table_info_in_comments: bool = True) None[source]

Modifies (in place) a data dictionary row for SystmOne.

Parameters:
  • ddr – The data dictionary row to amend.

  • context – The context from which SystmOne data is being extracted (e.g. the raw TPP Strategic Reporting Extract (SRE), or a local version processed into CPFT’s Data Warehouse).

  • specifications – Details of the TPP SRE specifications.

  • append_comments – Append comments to any that were autogenerated, rather than replacing them. (If you use the SRE specifications, you may as well set this to False as the SRE specification comments are much better.)

  • include_generic – Include all fields that are not known about by this code and treated specially? If False, the config file settings are used (which may omit or include). If True, all such fields are included.

  • allow_unprefixed_tables – Permit tables that don’t start with the expected contextual prefix? Discouraged; you may get odd tables and views. A few (see INCLUDE_TABLES_REGEX) are explicitly included anyway.

  • table_info_in_comments – Include table descriptions in column comments?

crate_anon.preprocess.systmone_ddgen.contextual_columnname(tablename_core: str, columnname_core: str, to_context: SystmOneContext) str[source]

Translates a “core” column name to its contextual variant, if applicable.

crate_anon.preprocess.systmone_ddgen.contextual_tablename(tablename_core: str, to_context: SystmOneContext) str[source]

Prefixes the “core” table name for a given context, and sometimes translates it too.

crate_anon.preprocess.systmone_ddgen.core_columnname(tablename_core: str, columnname_context: str, from_context: SystmOneContext) str[source]

Some contexts rename their column names. This function puts them back into the “core” (TPP SRE) name space.

crate_anon.preprocess.systmone_ddgen.core_tablename(tablename: str, from_context: SystmOneContext, allow_unprefixed: bool = False) str[source]

Is this a table of an expected format that we will consider? - If so, returns the “core” part of the tablename, in the given context. - Otherwise, if allow_unprefixed return the input. - Otherwise, return an empty string.

crate_anon.preprocess.systmone_ddgen.cpft_s1_tablename(core_tablename: str) str[source]

Helper function for the consent-for-contact system, but conceptually it sits reasonably well here.

Parameters:

core_tablename – Table name in S1 “core” format (devoid of any prefix).

Returns:

Returns the local CPFT table name.

crate_anon.preprocess.systmone_ddgen.eq(x: str, y: str) bool[source]

Case-insensitive string comparison.

crate_anon.preprocess.systmone_ddgen.eq_re(x: str, y_regex: str) bool[source]

Returns True if the regex matches at the start of the string.

crate_anon.preprocess.systmone_ddgen.get_index_flag(tablename: str, colname: str, ddr: DataDictionaryRow, context: SystmOneContext) IndexType | None[source]

Should this be indexed? Returns an indexing flag, or None if it should not be indexed.

crate_anon.preprocess.systmone_ddgen.get_scrub_alter_details(tablename: str, colname: str, ddr: DataDictionaryRow, context: SystmOneContext, include_generic: bool = False) ScrubSrcAlterMethodInfo[source]

The main “thinking” function.

Is this a sensitive field that should be used for scrubbing? Should it be modified in transit?

Parameters:
  • tablename – The “core” tablename being considered, without any prefix (e.g. “Patient”, not “SRPatient” or “S1_Patient”).

  • colname – The database column name.

  • ddr – Data dictionary row.

  • context – The context from which SystmOne data is being extracted (e.g. the raw TPP Strategic Reporting Extract (SRE), or a local version processed into CPFT’s Data Warehouse).

  • include_generic – Include all fields that are not known about by this code and treated specially? If False, the config file settings are used (which may omit or include). If True, all such fields are included.

crate_anon.preprocess.systmone_ddgen.is_free_text(tablename: str, colname: str, context: SystmOneContext, ddr: DataDictionaryRow | None = None) bool[source]

Is this a free-text field requiring scrubbing?

Unusually, there is not very much free text, and it is mostly collated. (We haven’t added binary support yet. Do we have the binary documents?)

crate_anon.preprocess.systmone_ddgen.is_in(x: str, y: Iterable[str]) bool[source]

Case-insensitive version of “in”, to replace “if x in y”.

crate_anon.preprocess.systmone_ddgen.is_in_re(x: str, y_regexes: Iterable[str]) bool[source]

Case-insensitive regex-based version of “in”, to replace “if x in y”.

crate_anon.preprocess.systmone_ddgen.is_master_patient_table(tablename: str) bool[source]

Is this the master patient table?

crate_anon.preprocess.systmone_ddgen.is_mpid(colname: str, context: SystmOneContext) bool[source]

Is this column the master patient identifier (MPID), i.e. the NHS number?

crate_anon.preprocess.systmone_ddgen.is_other_system_id(colname: str, context: SystmOneContext) bool[source]

Is this column an ID from another system (e.g. RiO, PCMIS)?

crate_anon.preprocess.systmone_ddgen.is_pair_in(a: str, b: str, y: Iterable[Tuple[str, str]]) bool[source]

Case-insensitive version of “in”, to replace “if a, b in y”.

crate_anon.preprocess.systmone_ddgen.is_pair_in_re(a: str, b: str, y_regexes: Iterable[Tuple[str, str]]) bool[source]

Case-insensitive regex-based version of “in”, to replace “if a, b in y”.

crate_anon.preprocess.systmone_ddgen.is_pid(colname: str, context: SystmOneContext) bool[source]

Is this column the SystmOne primary patient identifier (PID)?

It’s nearly always S1GenericCol.PID. But occasionally something else (e.g. in CPFT-created tables).

This works for all tables EXCEPT the main “Patient” table, where the PK takes its place.

Occasionally, CPFT tables blend SystmOne patients with other patients using IDs from other EHR systems. However, those patients won’t be in our master patient index, so their data won’t be brought through.

crate_anon.preprocess.systmone_ddgen.is_pk(tablename: str, colname: str, context: SystmOneContext, ddr: DataDictionaryRow | None = None) bool[source]

Is this a primary key (PK) column within its table?

crate_anon.preprocess.systmone_ddgen.join_comments(comments: List[str]) str[source]

Joins comment elements, skipping any blanks.

crate_anon.preprocess.systmone_ddgen.modify_dd_for_systmone(dd: DataDictionary, context: SystmOneContext, sre_spec_csv_filename: str = '', debug_specs: bool = False, append_comments: bool = False, include_generic: bool = False, allow_unprefixed_tables: bool = False, alter_loaded_rows: bool = False, table_info_in_comments: bool = True) None[source]

Modifies a data dictionary in place.

Parameters:
  • dd – The data dictionary to amend.

  • context – The context from which SystmOne data is being extracted (e.g. the raw TPP Strategic Reporting Extract (SRE), or a local version processed into CPFT’s Data Warehouse).

  • sre_spec_csv_filename – Optional filename for the TPP SRE specification file, in comma-separated value (CSV) format. If present, this will be used to add proper descriptive comments to all known fields. Highly recommended.

  • debug_specs – Report the SRE specifications to the log.

  • append_comments – Append comments to any that were autogenerated, rather than replacing them. (If you use the SRE specifications, you may as well set this to False as the SRE specification comments are much better.)

  • include_generic – Include all fields that are not known about by this code and treated specially? If False, the config file settings are used (which may omit or include). If True, all such fields are included.

  • allow_unprefixed_tables – Permit tables that don’t start with the expected contextual prefix? Discouraged; you may get odd tables and views.

  • alter_loaded_rows – Alter rows that were loaded from disk (not read from a database)? The default is to leave such rows untouched.

  • table_info_in_comments – Include table descriptions in column comments?

crate_anon.preprocess.systmone_ddgen.not_just_at_start(x: str) str[source]

Apply a prefix so that a regex string doesn’t just work at the start of a string.

crate_anon.preprocess.systmone_ddgen.process_generic_table_column(tablename: str, colname: str, ddr: DataDictionaryRow, ssi: ScrubSrcAlterMethodInfo, context: SystmOneContext) bool[source]

Performs operations applicable to columns any SystmOne table, except a few very special ones like Patient. Modifies ssi in place.

Returns: recognized and dealt with?

crate_anon.preprocess.systmone_ddgen.should_be_fulltext_indexed(tablename: str, colname: str) bool[source]

Is this a field that should get a FULLTEXT index? That’s not just “a column that contains free text and should be scrubbed”, that is “a column with a lot of interesting free text that should get a special index”.

crate_anon.preprocess.systmone_ddgen.tablename_prefix(context: SystmOneContext) str[source]

The tablename prefix in the given context.

crate_anon.preprocess.systmone_ddgen.tcmatch(table1: str, column1: str, table2: str, column2: str) bool[source]

Equal (in case-insensitive fashion) for table and column?

crate_anon.preprocess.systmone_ddgen.terminate(x: str) str[source]

Apply an end-of-string terminator to a regex string.

crate_anon.preprocess.systmone_ddgen.translate_tablename(from_tablename: str, from_context: SystmOneContext, to_context: SystmOneContext)[source]

Translates a table name from one S1 context to another.