14.1.31. crate_anon.anonymise.test_anonymisation
crate_anon/anonymise/test_anonymisation.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Test anonymisation for specific databases.
From the output, we have:
n_replacements (POSITIVE)
word_count (N)
true_positive_confidential_masked (TP)
false_positive_banal_masked (FP)
false_negative_confidential_visible_known_to_source (FN)
confidential_visible_but_unknown_to_source
Therefore, having summed across documents:
TP + FP = POSITIVE
NEGATIVE = N - POSITIVE
TN = NEGATIVE - FN
and then we have everything we need. For all identifiers, we make FN equal to
false_negative_confidential_visible_known_to_source
+ not_false_negative_confidential_visible_but_unknown_to_source
instead.
- class crate_anon.anonymise.test_anonymisation.FieldInfo(table: str, field: str)[source]
Fetches useful subsets from the data dictionary (DD), for tables that have a primary key, a patient ID, and some text field of interest.
Reads the singleton
crate_anon.anonymise.config.Config
.- __init__(table: str, field: str) None [source]
Reads the data dictionary and populates:
pk_ddrow
: DD row (DDR) for the table’s PKpid_ddrow
: DDR for the table’s PID fieldtext_ddrow
: DDR for the table’s text field (as chosen by thefield
parameter)
- Parameters:
table – destination table to read information for
field – destination text field to read information for
- Raises:
ValueError –
- crate_anon.anonymise.test_anonymisation.get_docids(fieldinfo: FieldInfo, uniquepatients: bool = True, limit: int = 100, from_src: bool = True) List[int] [source]
Returns a limited number of document PKs (which we will use to summarize anonymisation performance).
- Parameters:
fieldinfo –
FieldInfo
describing the tableuniquepatients – fetch one document each for a lot of patients (rather than a lot of documents, potentially from the same patient or a small number)?
limit – maximum number of documents to retrieve
from_src – retrieve IDs from the source database, not the destination database?
- Returns:
a list of document IDs
- crate_anon.anonymise.test_anonymisation.get_patientnum_anontext(docid: int, fieldinfo: FieldInfo) Tuple[int | None, str | None] [source]
Fetches the anonymised text for a given document PK, plus the associated research ID (RID).
- Parameters:
docid – integer PK for the document
fieldinfo –
FieldInfo
describing the table
- Returns:
rid, text
, orNone, None
if none found- Return type:
tuple
- crate_anon.anonymise.test_anonymisation.get_patientnum_rawtext(docid: int, fieldinfo: FieldInfo) Tuple[int | None, str | None] [source]
Fetches the original text for a given document PK, plus the associated patient ID (PID).
- Parameters:
docid – integer PK for the document
fieldinfo –
FieldInfo
describing the table
- Returns:
pid, text
, orNone, None
if none found- Return type:
tuple
- Raises:
ValueError –
- crate_anon.anonymise.test_anonymisation.main() None [source]
Command-line entry point. See command-line help.
- crate_anon.anonymise.test_anonymisation.process_doc(docid: int, rawdir: str, anondir: str, fieldinfo: FieldInfo, csvwriter: CSVWriterType, first: bool, scrubdict: Dict[int, Dict[str, Any]]) int [source]
For a given document ID, write the original and anonymised documents to disk, plus some counts to a CSV file. Also saves scrubber information for each patient.
- Parameters:
docid – integer PK for the document
rawdir – directory to store raw documents in
anondir – directory to store anonymised documents in
fieldinfo –
FieldInfo
describing the tablecsvwriter – a
csv.writer()
object to write summary data tofirst – is this the first document being processed? If so, we’ll add a CSV header
scrubdict – a dictionary with
{pid: scrubber_info}
information, which is written to by this function. The scrubber information comes fromcrate_anon.anonymise.scrub.PersonalizedScrubber.get_raw_info()
- Returns:
the patient ID number (PID)
- crate_anon.anonymise.test_anonymisation.test_anon(uniquepatients: bool, limit: int, from_src: bool, rawdir: str, anondir: str, scrubfile: str, resultsfile: str, dsttable: str, dstfield: str) None [source]
Fetch raw and anonymised documents and store them in files for comparison, along with some summary information.
- Parameters:
uniquepatients – fetch one document each for a lot of patients (rather than a lot of documents, potentially from the same patient or a small number)?
limit – maximum number of documents to retrieve
from_src – retrieve IDs from the source database, not the destination database?
rawdir – directory to store raw documents in
anondir – directory to store anonymised documents in
scrubfile – filename to store scrubber information in (as JSON)
resultsfile – filename to store CSV summaries in
dsttable – name of the destination table
dstfield – name of the destination table’s text field of interest