14.1.31. crate_anon.anonymise.test_anonymisation

crate_anon/anonymise/test_anonymisation.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Test anonymisation for specific databases.

From the output, we have:

n_replacements (POSITIVE)
word_count (N)
true_positive_confidential_masked (TP)
false_positive_banal_masked (FP)
false_negative_confidential_visible_known_to_source (FN)
confidential_visible_but_unknown_to_source

Therefore, having summed across documents:

TP + FP = POSITIVE
NEGATIVE = N - POSITIVE
TN = NEGATIVE - FN

and then we have everything we need. For all identifiers, we make FN equal to

false_negative_confidential_visible_known_to_source
    + not_false_negative_confidential_visible_but_unknown_to_source

instead.

class crate_anon.anonymise.test_anonymisation.FieldInfo(table: str, field: str)[source]

Fetches useful subsets from the data dictionary (DD), for tables that have a primary key, a patient ID, and some text field of interest.

Reads the singleton crate_anon.anonymise.config.Config.

__init__(table: str, field: str) None[source]

Reads the data dictionary and populates:

  • pk_ddrow: DD row (DDR) for the table’s PK

  • pid_ddrow: DDR for the table’s PID field

  • text_ddrow: DDR for the table’s text field (as chosen by the field parameter)

Parameters:
  • table – destination table to read information for

  • field – destination text field to read information for

Raises:

ValueError

crate_anon.anonymise.test_anonymisation.get_docids(fieldinfo: FieldInfo, uniquepatients: bool = True, limit: int = 100, from_src: bool = True) List[int][source]

Returns a limited number of document PKs (which we will use to summarize anonymisation performance).

Parameters:
  • fieldinfoFieldInfo describing the table

  • uniquepatients – fetch one document each for a lot of patients (rather than a lot of documents, potentially from the same patient or a small number)?

  • limit – maximum number of documents to retrieve

  • from_src – retrieve IDs from the source database, not the destination database?

Returns:

a list of document IDs

crate_anon.anonymise.test_anonymisation.get_patientnum_anontext(docid: int, fieldinfo: FieldInfo) Tuple[int | None, str | None][source]

Fetches the anonymised text for a given document PK, plus the associated research ID (RID).

Parameters:
  • docid – integer PK for the document

  • fieldinfoFieldInfo describing the table

Returns:

rid, text, or None, None if none found

Return type:

tuple

crate_anon.anonymise.test_anonymisation.get_patientnum_rawtext(docid: int, fieldinfo: FieldInfo) Tuple[int | None, str | None][source]

Fetches the original text for a given document PK, plus the associated patient ID (PID).

Parameters:
  • docid – integer PK for the document

  • fieldinfoFieldInfo describing the table

Returns:

pid, text, or None, None if none found

Return type:

tuple

Raises:

ValueError

crate_anon.anonymise.test_anonymisation.main() None[source]

Command-line entry point. See command-line help.

crate_anon.anonymise.test_anonymisation.process_doc(docid: int, rawdir: str, anondir: str, fieldinfo: FieldInfo, csvwriter: CSVWriterType, first: bool, scrubdict: Dict[int, Dict[str, Any]]) int[source]

For a given document ID, write the original and anonymised documents to disk, plus some counts to a CSV file. Also saves scrubber information for each patient.

Parameters:
  • docid – integer PK for the document

  • rawdir – directory to store raw documents in

  • anondir – directory to store anonymised documents in

  • fieldinfoFieldInfo describing the table

  • csvwriter – a csv.writer() object to write summary data to

  • first – is this the first document being processed? If so, we’ll add a CSV header

  • scrubdict – a dictionary with {pid: scrubber_info} information, which is written to by this function. The scrubber information comes from crate_anon.anonymise.scrub.PersonalizedScrubber.get_raw_info()

Returns:

the patient ID number (PID)

crate_anon.anonymise.test_anonymisation.test_anon(uniquepatients: bool, limit: int, from_src: bool, rawdir: str, anondir: str, scrubfile: str, resultsfile: str, dsttable: str, dstfield: str) None[source]

Fetch raw and anonymised documents and store them in files for comparison, along with some summary information.

Parameters:
  • uniquepatients – fetch one document each for a lot of patients (rather than a lot of documents, potentially from the same patient or a small number)?

  • limit – maximum number of documents to retrieve

  • from_src – retrieve IDs from the source database, not the destination database?

  • rawdir – directory to store raw documents in

  • anondir – directory to store anonymised documents in

  • scrubfile – filename to store scrubber information in (as JSON)

  • resultsfile – filename to store CSV summaries in

  • dsttable – name of the destination table

  • dstfield – name of the destination table’s text field of interest