12.1.31. crate_anon.anonymise.test_anonymisation

crate_anon/anonymise/test_anonymisation.py

Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.

Test anonymisation for specific databases.

From the output, we have:

n_replacements (POSITIVE)
word_count (N)
true_positive_confidential_masked (TP)
false_positive_banal_masked (FP)
false_negative_confidential_visible_known_to_source (FN)
confidential_visible_but_unknown_to_source

Therefore, having summed across documents:

TP + FP = POSITIVE
NEGATIVE = N - POSITIVE
TN = NEGATIVE - FN

and then we have everything we need. For all identifiers, we make FN equal to

false_negative_confidential_visible_known_to_source
    + not_false_negative_confidential_visible_but_unknown_to_source

instead.

class crate_anon.anonymise.test_anonymisation.FieldInfo(table: str, field: str)[source]

Fetches useful subsets from the data dictionary (DD), for tables that have a primary key, a patient ID, and some text field of interest.

Reads the singleton crate_anon.anonymise.config.Config.

__init__(table: str, field: str) → None[source]

Reads the data dictionary and populates:

pk_ddrow: DD row (DDR) for the table’s PK
pid_ddrow: DDR for the table’s PID field
text_ddrow: DDR for the table’s text field (as chosen by the field parameter)

Parameters:

table – destination table to read information for
field – destination text field to read information for

Raises:

ValueError –

crate_anon.anonymise.test_anonymisation.get_docids(fieldinfo: FieldInfo, uniquepatients: bool = True, limit: int = 100, from_src: bool = True) → List[int][source]

Returns a limited number of document PKs (which we will use to summarize anonymisation performance).

Parameters:

fieldinfo – FieldInfo describing the table
uniquepatients – fetch one document each for a lot of patients (rather than a lot of documents, potentially from the same patient or a small number)?
limit – maximum number of documents to retrieve
from_src – retrieve IDs from the source database, not the destination database?

Returns:

a list of document IDs

crate_anon.anonymise.test_anonymisation.get_patientnum_anontext(docid: int, fieldinfo: FieldInfo) → Tuple[int | None, str | None][source]

Fetches the anonymised text for a given document PK, plus the associated research ID (RID).

Parameters:

docid – integer PK for the document
fieldinfo – FieldInfo describing the table

Returns:

rid, text, or None, None if none found

Return type:

tuple

crate_anon.anonymise.test_anonymisation.get_patientnum_rawtext(docid: int, fieldinfo: FieldInfo) → Tuple[int | None, str | None][source]

Fetches the original text for a given document PK, plus the associated patient ID (PID).

Parameters:

docid – integer PK for the document
fieldinfo – FieldInfo describing the table

Returns:

pid, text, or None, None if none found

Return type:

tuple

Raises:

ValueError –

crate_anon.anonymise.test_anonymisation.main() → None[source]: Command-line entry point. See command-line help.

crate_anon.anonymise.test_anonymisation.process_doc(docid: int, rawdir: str, anondir: str, fieldinfo: FieldInfo, csvwriter: CSVWriterType, first: bool, scrubdict: Dict[int, Dict[str, Any]]) → int[source]

For a given document ID, write the original and anonymised documents to disk, plus some counts to a CSV file. Also saves scrubber information for each patient.

Parameters:

docid – integer PK for the document
rawdir – directory to store raw documents in
anondir – directory to store anonymised documents in
fieldinfo – FieldInfo describing the table
csvwriter – a csv.writer() object to write summary data to
first – is this the first document being processed? If so, we’ll add a CSV header
scrubdict – a dictionary with {pid: scrubber_info} information, which is written to by this function. The scrubber information comes from crate_anon.anonymise.scrub.PersonalizedScrubber.get_raw_info()

Returns:

the patient ID number (PID)

crate_anon.anonymise.test_anonymisation.test_anon(uniquepatients: bool, limit: int, from_src: bool, rawdir: str, anondir: str, scrubfile: str, resultsfile: str, dsttable: str, dstfield: str) → None[source]

Fetch raw and anonymised documents and store them in files for comparison, along with some summary information.

Parameters:

uniquepatients – fetch one document each for a lot of patients (rather than a lot of documents, potentially from the same patient or a small number)?
limit – maximum number of documents to retrieve
from_src – retrieve IDs from the source database, not the destination database?
rawdir – directory to store raw documents in
anondir – directory to store anonymised documents in
scrubfile – filename to store scrubber information in (as JSON)
resultsfile – filename to store CSV summaries in
dsttable – name of the destination table
dstfield – name of the destination table’s text field of interest