14.4.21. crate_anon.linkage.validation.validate_fuzzy_linkage
crate_anon/linkage/validation/validate_fuzzy_linkage.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Highly specific code to develop/validate fuzzy linkage.
- class crate_anon.linkage.validation.validate_fuzzy_linkage.CPFTValidationExtras(hashed_nhs_number: str, blurred_dob: str, gender: str, ethnicity: str | None, index_of_multiple_deprivation: int | None, first_mh_care_date: str, age_at_first_mh_care: int | None, any_icd10_dx_present: int, chapter_f_icd10_dx_present: int, severe_mental_illness_icd10_dx_present: int, has_pseudopostcode: bool = False, has_nfa_pseudopostcode: bool = False, has_non_nfa_pseudopostcode: bool = False)[source]
Extra information for the “other_info” column for validation, as per the approved CPFT protocol.
This class should contain all information that would not otherwise make it into the matching file, i.e. that information required to check the correctness and/or bias of matching. It should not contain anything directly identifiable.
We store dates as strings because they are then JSON-serializable. We use blank strings for unknown dates.
- __init__(hashed_nhs_number: str, blurred_dob: str, gender: str, ethnicity: str | None, index_of_multiple_deprivation: int | None, first_mh_care_date: str, age_at_first_mh_care: int | None, any_icd10_dx_present: int, chapter_f_icd10_dx_present: int, severe_mental_illness_icd10_dx_present: int, has_pseudopostcode: bool = False, has_nfa_pseudopostcode: bool = False, has_non_nfa_pseudopostcode: bool = False) None
- set_pseudo_postcode_info(postcodes: Iterable[PostcodeInfo]) None [source]
Updates our pseudopostcode flags, without storing the postcode(s).
- class crate_anon.linkage.validation.validate_fuzzy_linkage.PostcodeInfo(postcode: str, start_date: date | None, end_date: date | None, index_of_multiple_deprivation: int | None)[source]
Postcode with IMD.
- __init__(postcode: str, start_date: date | None, end_date: date | None, index_of_multiple_deprivation: int | None) None
- class crate_anon.linkage.validation.validate_fuzzy_linkage.QueryColnames[source]
Used to reduce some duplication. However, we don’t use these within SQL itself simply because copying/pasting is helpful for SQL development.
- crate_anon.linkage.validation.validate_fuzzy_linkage.help_v2_compare(plaintext: bool) str [source]
Help string for comparing data from all sources. Produces Windows output; one line per command (helpful for commenting things out).
- crate_anon.linkage.validation.validate_fuzzy_linkage.help_v2_fetch(plaintext: bool = True) str [source]
Help string for fetching data from all sources. Produces Windows output; one line per command (helpful for commenting things out).
- crate_anon.linkage.validation.validate_fuzzy_linkage.help_v2_hash() str [source]
Help string for hashing data from all sources. Produces Windows output; one line per command (helpful for commenting things out).
- crate_anon.linkage.validation.validate_fuzzy_linkage.is_ok_nhs_number(n: int) bool [source]
For skipping records with invalid NHS numbers (e.g. bad checksum) or test records (e.g. NHS number starts with 999, the official test range).
- crate_anon.linkage.validation.validate_fuzzy_linkage.last_imd(postcodes: List[PostcodeInfo]) int | None [source]
The IMD from the last postcode specified for which an IMD is known, if any.
- crate_anon.linkage.validation.validate_fuzzy_linkage.make_deletion_data(people: People, cfg: MatchConfig) People [source]
Makes a copy of the supplied data set with deliberate deletions applied.
Surnames and DOBs are excepted as we require exact matches for those.
- crate_anon.linkage.validation.validate_fuzzy_linkage.make_typo_data(people: People, cfg: MatchConfig) People [source]
Makes a copy of the supplied data set with deliberate typos applied.
Surnames and DOBs are excepted as we require exact matches for those.
- crate_anon.linkage.validation.validate_fuzzy_linkage.postcode_temporal_identifiers(postcodes: List[PostcodeInfo]) List[TemporalIDHolder] [source]
Returns the DummyTemporalIdentifier components of a list of postcodes.
- crate_anon.linkage.validation.validate_fuzzy_linkage.save_people_from_db(people: Iterable[Person], output_filename: str, plaintext: bool = True, report_every: int = 1000) None [source]
Saves people (in plaintext) from a function that generates them from a database.
- Parameters:
people – Iterable of
Person
.output_filename – Output CSV/JSONL filename.
plaintext – Save in plaintext format?
report_every – Report progress every n people.
Out of interest, the reason why Windows files show as being of zero length until closed (and how to update the display): https://superuser.com/questions/727883.
- crate_anon.linkage.validation.validate_fuzzy_linkage.speedtest(cfg: MatchConfig, set_breakpoint: bool = False) None [source]
Run self-tests or timing tests.
- Parameters:
cfg – The main
MatchConfig
object.set_breakpoint – Set a pdb breakpoint to explore objects from the Python console?
- crate_anon.linkage.validation.validate_fuzzy_linkage.v2_hashed(database: str) str [source]
A default filename.
- crate_anon.linkage.validation.validate_fuzzy_linkage.v2_outhashed(probands: str, sample: str) str [source]
A default filename.
- crate_anon.linkage.validation.validate_fuzzy_linkage.v2_outplain(probands: str, sample: str) str [source]
A default filename.
- crate_anon.linkage.validation.validate_fuzzy_linkage.v2_plaintext(database: str) str [source]
A default filename.
- crate_anon.linkage.validation.validate_fuzzy_linkage.validate_1(cfg: MatchConfig, people_filename: str, output_filename: str, seed: int = 1234) None [source]
Read data and perform split-half validation.
- Parameters:
cfg – The main
MatchConfig
object.people_filename – Filename of people; see
read_people()
.output_filename – Output CSV filename.
seed – RNG seed
- crate_anon.linkage.validation.validate_fuzzy_linkage.validate_2_fetch_cdl(cfg: MatchConfig, url: str, echo: bool = False) Generator[Person, None, None] [source]
Generates IDENTIFIED people from CPFT’s CRS/CRL source database.
See
validate_2_fetch_rio()
for notes.Information we do not have:
Dates for postcodes; there are address dates in CRS_CDL.dbo.Address but that is de-identified. Not sure where the master identifiable copy is, but maybe no longer available?
Middle names (not present anywhere).
An older query with columns like
patients.dttm_of_birth
is no longer current.Column exploration (see non-aliased table names below):
v.EJPS_ID is PRIMARY KEY VARCHAR(10) NOT NULL; either ‘M<number>’ or ‘number’; length 3/4/6/7.
ip.PatientID is INT NOT NULL, observed length 4/5/6 = ?
ip.Identifier is NVARCHAR(50) NOT NULL, length 4/10/12; deduced (below) to be NHS#
ip.CRSNo is NVARCHAR(50) NOT NULL; length 4/6/7/8; deduced to be CRS/CDL#
Linkage combinations that do/do not work:
v.EPJS_ID = ip.PatientID
– type mismatchv.EPJS_ID = ip.Identifier COLLATE Latin1_General_CI_AS
– matches 0v.EPJS_ID = ip.CRSNo COLLATE Latin1_General_CI_AS
– matches 154658REPLACE(v.EPJS_ID, 'M', '') = REPLACE(ip.CRSNo COLLATE Latin1_General_CI_AS, 'M', '')
– also matches 154658v.NHS_ID = ip.Identifier COLLATE Latin1_General_CI_AS
– matches 152944REPLACE(v.NHS_ID, ' ', '') = REPLACE(ip.Identifier COLLATE Latin1_General_CI_AS, ' ', '')
– matches 153060
v.NHS_ID: VARCHAR(15) column, can be NULL. Length 3/8/10/12.
If 3, is ‘123’ (junk) or ‘xNx’, i.e. missing. (Note that ‘xNx’ is/was a common “missing” code in CRIS.)
If 8, is garbage.
If 12, has spaces in (format: xxx xxx xxxx).
Date range:
SELECT YEAR(REFERRAL_RECVD_DATE), COUNT(*) FROM CRS_CDL.dbo.Referral WHERE REFERRAL_RECVD_DATE IS NOT NULL GROUP BY YEAR(REFERRAL_RECVD_DATE) ORDER BY YEAR(REFERRAL_RECVD_DATE) -- exceeds 10,000/year from 1999-2012 inclusive.
- crate_anon.linkage.validation.validate_fuzzy_linkage.validate_2_fetch_pcmis(cfg: MatchConfig, url: str, echo: bool = False) Generator[Person, None, None] [source]
Generates IDENTIFIED people from CPFT’s PCMIS source database.
- Parameters:
cfg – The main
MatchConfig
object.url – SQLAlchemy URL.
echo – Echo SQL?
- Yields:
Person
objects
Before running:
CREATE INDEX _crateidx_fuzzy_pcmis_pd_pid ON rawPCMIS.dbo.PatientDetails (PatientID); CREATE INDEX _crateidx_fuzzy_pcmis_pd_nhsn ON rawPCMIS.dbo.PatientDetails (NHSNumber); CREATE INDEX _crateidx_fuzzy_pcmis_ref_pid ON rawPCMIS.dbo.CPFT_Referrals (PatientID); CREATE INDEX _crateidx_fuzzy_pcmis_ref_case ON rawPCMIS.dbo.CPFT_Referrals (CaseNumber); CREATE INDEX _crateidx_fuzzy_pcmis_ref_dx1 ON rawPCMIS.dbo.CPFT_Referrals (PrimaryDiagnosis); CREATE INDEX _crateidx_fuzzy_pcmis_ref_dx2 ON rawPCMIS.dbo.CPFT_Referrals (SecondaryDiagnosis);
Date range:
SELECT YEAR(CreateDate), COUNT(*) FROM rawPCMIS.dbo.CPFT_Referrals WHERE CreateDate IS NOT NULL GROUP BY YEAR(CreateDate) ORDER BY YEAR(CreateDate) -- exceeds 1,000/year from 2008-2020 inclusive; -- exceeds 10,000/year from 2015-2019 inclusive;
- crate_anon.linkage.validation.validate_fuzzy_linkage.validate_2_fetch_rio(cfg: MatchConfig, url: str, echo: bool = False) Generator[Person, None, None] [source]
Generates IDENTIFIED people from CPFT’s RiO source database.
The connection to any such database is HIGHLY confidential; it sits on a secure server within a secure network and access to this specific database is very restricted – to administrators only.
- Parameters:
cfg – The main
MatchConfig
object.url – SQLAlchemy URL.
echo – Echo SQL?
- Yields:
Person
objects
Date range:
SELECT YEAR(Referral_DateTime), COUNT(*) FROM RiO62CAMLive.dbo.Referral WHERE Referral_DateTime IS NOT NULL GROUP BY YEAR(Referral_DateTime) ORDER BY YEAR(Referral_DateTime) -- exceeds 10,000/year from 2012-2021 inclusive.
Don’t use
ClientIndex
; first name and surname are always NULL. UseClient
.
- crate_anon.linkage.validation.validate_fuzzy_linkage.validate_2_fetch_systmone(cfg: MatchConfig, url: str, echo: bool = False) Generator[Person, None, None] [source]
Generates IDENTIFIED people from CPFT’s SystmOne source database.
- Parameters:
cfg – The main
MatchConfig
object.url – SQLAlchemy URL.
echo – Echo SQL?
- Yields:
Person
objects
Date range:
We know that CPFT secondary care MH moves to SystmOne in phases, 2020-2021, specifically:
2020-10-12, Children’s Directorate;
2020-12-07, Older People/Adults and Community Directorate (MH);
2021-06-14, Adult & Specialist Directorate.
Before that: in 2015, many Cambridgeshire Community Services (CCS) staff moved to CPFT to support integrated adult/community services (https://www.cambscommunityservices.nhs.uk/docs/default-source/board-papers—june-2016/0-9-1-quality-account-2015-16-final.pdf) and they were using SystmOne.
SELECT YEAR(ReferralDate), COUNT(*) FROM SystmOne.dbo.S1_ReferralsIn WHERE ReferralDate IS NOT NULL GROUP BY YEAR(ReferralDate) ORDER BY YEAR(ReferralDate) -- exceeds 10,000/year from 2007-.
S1_Diagnosis contains these code schemes:
SELECT DISTINCT CodeScheme FROM SystmOne.dbo.S1_Diagnosis -- 'ICD-10', 'OPCS' only (in current data). -- Only 7504 rows in whole table.
In the strategic reporting extract, it’s SRClinicalCode (and there is no SRDiagnosis). There’s SRCode, too, which contains Read/CTV3 and SNOMED codes. CPFT’s table S1_Diagnosis is essentially SRClinicalCode with a Description column linked in, and minor column renaming (e.g. DtDiagnosis to DateDiagnosis). But it’s the right (and only) table for ICD-10 codes.
There are no NHS numbers with spaces in, so we don’t have to worry about that:
SELECT COUNT(*) FROM SystmOne.dbo.S1_Patient WHERE p.NHSNumber LIKE '% %' -- returns 0