14.4.21. crate_anon.linkage.validation.validate_fuzzy_linkage

crate_anon/linkage/validation/validate_fuzzy_linkage.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Highly specific code to develop/validate fuzzy linkage.

class crate_anon.linkage.validation.validate_fuzzy_linkage.CPFTValidationExtras(hashed_nhs_number: str, blurred_dob: str, gender: str, ethnicity: str | None, index_of_multiple_deprivation: int | None, first_mh_care_date: str, age_at_first_mh_care: int | None, any_icd10_dx_present: int, chapter_f_icd10_dx_present: int, severe_mental_illness_icd10_dx_present: int, has_pseudopostcode: bool = False, has_nfa_pseudopostcode: bool = False, has_non_nfa_pseudopostcode: bool = False)[source]

Extra information for the “other_info” column for validation, as per the approved CPFT protocol.

This class should contain all information that would not otherwise make it into the matching file, i.e. that information required to check the correctness and/or bias of matching. It should not contain anything directly identifiable.

We store dates as strings because they are then JSON-serializable. We use blank strings for unknown dates.

__init__(hashed_nhs_number: str, blurred_dob: str, gender: str, ethnicity: str | None, index_of_multiple_deprivation: int | None, first_mh_care_date: str, age_at_first_mh_care: int | None, any_icd10_dx_present: int, chapter_f_icd10_dx_present: int, severe_mental_illness_icd10_dx_present: int, has_pseudopostcode: bool = False, has_nfa_pseudopostcode: bool = False, has_non_nfa_pseudopostcode: bool = False) None
set_pseudo_postcode_info(postcodes: Iterable[PostcodeInfo]) None[source]

Updates our pseudopostcode flags, without storing the postcode(s).

class crate_anon.linkage.validation.validate_fuzzy_linkage.PostcodeInfo(postcode: str, start_date: date | None, end_date: date | None, index_of_multiple_deprivation: int | None)[source]

Postcode with IMD.

__init__(postcode: str, start_date: date | None, end_date: date | None, index_of_multiple_deprivation: int | None) None
class crate_anon.linkage.validation.validate_fuzzy_linkage.QueryColnames[source]

Used to reduce some duplication. However, we don’t use these within SQL itself simply because copying/pasting is helpful for SQL development.

crate_anon.linkage.validation.validate_fuzzy_linkage.help_v2_compare(plaintext: bool) str[source]

Help string for comparing data from all sources. Produces Windows output; one line per command (helpful for commenting things out).

crate_anon.linkage.validation.validate_fuzzy_linkage.help_v2_fetch(plaintext: bool = True) str[source]

Help string for fetching data from all sources. Produces Windows output; one line per command (helpful for commenting things out).

crate_anon.linkage.validation.validate_fuzzy_linkage.help_v2_hash() str[source]

Help string for hashing data from all sources. Produces Windows output; one line per command (helpful for commenting things out).

crate_anon.linkage.validation.validate_fuzzy_linkage.is_ok_nhs_number(n: int) bool[source]

For skipping records with invalid NHS numbers (e.g. bad checksum) or test records (e.g. NHS number starts with 999, the official test range).

crate_anon.linkage.validation.validate_fuzzy_linkage.last_imd(postcodes: List[PostcodeInfo]) int | None[source]

The IMD from the last postcode specified for which an IMD is known, if any.

crate_anon.linkage.validation.validate_fuzzy_linkage.main() int[source]

Command-line entry point.

crate_anon.linkage.validation.validate_fuzzy_linkage.make_deletion_data(people: People, cfg: MatchConfig) People[source]

Makes a copy of the supplied data set with deliberate deletions applied.

Surnames and DOBs are excepted as we require exact matches for those.

crate_anon.linkage.validation.validate_fuzzy_linkage.make_typo_data(people: People, cfg: MatchConfig) People[source]

Makes a copy of the supplied data set with deliberate typos applied.

Surnames and DOBs are excepted as we require exact matches for those.

crate_anon.linkage.validation.validate_fuzzy_linkage.postcode_temporal_identifiers(postcodes: List[PostcodeInfo]) List[TemporalIDHolder][source]

Returns the DummyTemporalIdentifier components of a list of postcodes.

crate_anon.linkage.validation.validate_fuzzy_linkage.save_people_from_db(people: Iterable[Person], output_filename: str, plaintext: bool = True, report_every: int = 1000) None[source]

Saves people (in plaintext) from a function that generates them from a database.

Parameters:
  • people – Iterable of Person.

  • output_filename – Output CSV/JSONL filename.

  • plaintext – Save in plaintext format?

  • report_every – Report progress every n people.

Out of interest, the reason why Windows files show as being of zero length until closed (and how to update the display): https://superuser.com/questions/727883.

crate_anon.linkage.validation.validate_fuzzy_linkage.speedtest(cfg: MatchConfig, set_breakpoint: bool = False) None[source]

Run self-tests or timing tests.

Parameters:
  • cfg – The main MatchConfig object.

  • set_breakpoint – Set a pdb breakpoint to explore objects from the Python console?

crate_anon.linkage.validation.validate_fuzzy_linkage.v2_hashed(database: str) str[source]

A default filename.

crate_anon.linkage.validation.validate_fuzzy_linkage.v2_outhashed(probands: str, sample: str) str[source]

A default filename.

crate_anon.linkage.validation.validate_fuzzy_linkage.v2_outplain(probands: str, sample: str) str[source]

A default filename.

crate_anon.linkage.validation.validate_fuzzy_linkage.v2_plaintext(database: str) str[source]

A default filename.

crate_anon.linkage.validation.validate_fuzzy_linkage.validate_1(cfg: MatchConfig, people_filename: str, output_filename: str, seed: int = 1234) None[source]

Read data and perform split-half validation.

Parameters:
  • cfg – The main MatchConfig object.

  • people_filename – Filename of people; see read_people().

  • output_filename – Output CSV filename.

  • seed – RNG seed

crate_anon.linkage.validation.validate_fuzzy_linkage.validate_2_fetch_cdl(cfg: MatchConfig, url: str, echo: bool = False) Generator[Person, None, None][source]

Generates IDENTIFIED people from CPFT’s CRS/CRL source database.

See validate_2_fetch_rio() for notes.

Information we do not have:

  • Dates for postcodes; there are address dates in CRS_CDL.dbo.Address but that is de-identified. Not sure where the master identifiable copy is, but maybe no longer available?

  • Middle names (not present anywhere).

An older query with columns like patients.dttm_of_birth is no longer current.

Column exploration (see non-aliased table names below):

  • v.EJPS_ID is PRIMARY KEY VARCHAR(10) NOT NULL; either ‘M<number>’ or ‘number’; length 3/4/6/7.

  • ip.PatientID is INT NOT NULL, observed length 4/5/6 = ?

  • ip.Identifier is NVARCHAR(50) NOT NULL, length 4/10/12; deduced (below) to be NHS#

  • ip.CRSNo is NVARCHAR(50) NOT NULL; length 4/6/7/8; deduced to be CRS/CDL#

  • Linkage combinations that do/do not work:

    • v.EPJS_ID = ip.PatientID – type mismatch

    • v.EPJS_ID = ip.Identifier COLLATE Latin1_General_CI_AS – matches 0

    • v.EPJS_ID = ip.CRSNo COLLATE Latin1_General_CI_AS – matches 154658

    • REPLACE(v.EPJS_ID, 'M', '') = REPLACE(ip.CRSNo COLLATE Latin1_General_CI_AS, 'M', '') – also matches 154658

    • v.NHS_ID = ip.Identifier COLLATE Latin1_General_CI_AS – matches 152944

    • REPLACE(v.NHS_ID, ' ', '') = REPLACE(ip.Identifier COLLATE Latin1_General_CI_AS, ' ', '') – matches 153060

  • v.NHS_ID: VARCHAR(15) column, can be NULL. Length 3/8/10/12.

    • If 3, is ‘123’ (junk) or ‘xNx’, i.e. missing. (Note that ‘xNx’ is/was a common “missing” code in CRIS.)

    • If 8, is garbage.

    • If 12, has spaces in (format: xxx xxx xxxx).

Date range:

SELECT YEAR(REFERRAL_RECVD_DATE), COUNT(*)
FROM CRS_CDL.dbo.Referral
WHERE REFERRAL_RECVD_DATE IS NOT NULL
GROUP BY YEAR(REFERRAL_RECVD_DATE)
ORDER BY YEAR(REFERRAL_RECVD_DATE)

-- exceeds 10,000/year from 1999-2012 inclusive.
crate_anon.linkage.validation.validate_fuzzy_linkage.validate_2_fetch_pcmis(cfg: MatchConfig, url: str, echo: bool = False) Generator[Person, None, None][source]

Generates IDENTIFIED people from CPFT’s PCMIS source database.

Parameters:
  • cfg – The main MatchConfig object.

  • url – SQLAlchemy URL.

  • echo – Echo SQL?

Yields:

Person objects

Before running:

CREATE INDEX _crateidx_fuzzy_pcmis_pd_pid
    ON rawPCMIS.dbo.PatientDetails (PatientID);
CREATE INDEX _crateidx_fuzzy_pcmis_pd_nhsn
    ON rawPCMIS.dbo.PatientDetails (NHSNumber);

CREATE INDEX _crateidx_fuzzy_pcmis_ref_pid
    ON rawPCMIS.dbo.CPFT_Referrals (PatientID);
CREATE INDEX _crateidx_fuzzy_pcmis_ref_case
    ON rawPCMIS.dbo.CPFT_Referrals (CaseNumber);
CREATE INDEX _crateidx_fuzzy_pcmis_ref_dx1
    ON rawPCMIS.dbo.CPFT_Referrals (PrimaryDiagnosis);
CREATE INDEX _crateidx_fuzzy_pcmis_ref_dx2
    ON rawPCMIS.dbo.CPFT_Referrals (SecondaryDiagnosis);

Date range:

SELECT YEAR(CreateDate), COUNT(*)
FROM rawPCMIS.dbo.CPFT_Referrals
WHERE CreateDate IS NOT NULL
GROUP BY YEAR(CreateDate)
ORDER BY YEAR(CreateDate)

-- exceeds 1,000/year from 2008-2020 inclusive;
-- exceeds 10,000/year from 2015-2019 inclusive;
crate_anon.linkage.validation.validate_fuzzy_linkage.validate_2_fetch_rio(cfg: MatchConfig, url: str, echo: bool = False) Generator[Person, None, None][source]

Generates IDENTIFIED people from CPFT’s RiO source database.

The connection to any such database is HIGHLY confidential; it sits on a secure server within a secure network and access to this specific database is very restricted – to administrators only.

Parameters:
  • cfg – The main MatchConfig object.

  • url – SQLAlchemy URL.

  • echo – Echo SQL?

Yields:

Person objects

Date range:

SELECT YEAR(Referral_DateTime), COUNT(*)
FROM RiO62CAMLive.dbo.Referral
WHERE Referral_DateTime IS NOT NULL
GROUP BY YEAR(Referral_DateTime)
ORDER BY YEAR(Referral_DateTime)

-- exceeds 10,000/year from 2012-2021 inclusive.

Don’t use ClientIndex; first name and surname are always NULL. Use Client.

crate_anon.linkage.validation.validate_fuzzy_linkage.validate_2_fetch_systmone(cfg: MatchConfig, url: str, echo: bool = False) Generator[Person, None, None][source]

Generates IDENTIFIED people from CPFT’s SystmOne source database.

Parameters:
  • cfg – The main MatchConfig object.

  • url – SQLAlchemy URL.

  • echo – Echo SQL?

Yields:

Person objects

Date range:

  • We know that CPFT secondary care MH moves to SystmOne in phases, 2020-2021, specifically:

    • 2020-10-12, Children’s Directorate;

    • 2020-12-07, Older People/Adults and Community Directorate (MH);

    • 2021-06-14, Adult & Specialist Directorate.

  • Before that: in 2015, many Cambridgeshire Community Services (CCS) staff moved to CPFT to support integrated adult/community services (https://www.cambscommunityservices.nhs.uk/docs/default-source/board-papers—june-2016/0-9-1-quality-account-2015-16-final.pdf) and they were using SystmOne.

    SELECT YEAR(ReferralDate), COUNT(*)
    FROM SystmOne.dbo.S1_ReferralsIn
    WHERE ReferralDate IS NOT NULL
    GROUP BY YEAR(ReferralDate)
    ORDER BY YEAR(ReferralDate)
    
    -- exceeds 10,000/year from 2007-.
    
  • S1_Diagnosis contains these code schemes:

    SELECT DISTINCT CodeScheme
    FROM SystmOne.dbo.S1_Diagnosis
    
    -- 'ICD-10', 'OPCS' only (in current data).
    -- Only 7504 rows in whole table.
    

    In the strategic reporting extract, it’s SRClinicalCode (and there is no SRDiagnosis). There’s SRCode, too, which contains Read/CTV3 and SNOMED codes. CPFT’s table S1_Diagnosis is essentially SRClinicalCode with a Description column linked in, and minor column renaming (e.g. DtDiagnosis to DateDiagnosis). But it’s the right (and only) table for ICD-10 codes.

  • There are no NHS numbers with spaces in, so we don’t have to worry about that:

    SELECT COUNT(*) FROM SystmOne.dbo.S1_Patient
    WHERE p.NHSNumber LIKE '% %'
    -- returns 0