12.1.22. crate_anon.anonymise.scrub

crate_anon/anonymise/scrub.py

Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.

Scrubber classes for CRATE anonymiser.

class crate_anon.anonymise.scrub.NonspecificReplacer(replacement_text: str, replacement_text_all_dates: str)[source]

Custom regex replacement for the Nonspecific scrubber. Currently this will “blur” dates if replacement_text_all_dates contains any formatting directives.

__init__(replacement_text: str, replacement_text_all_dates: str)[source]

Parameters:

replacement_text – Generic text to use.
replacement_text_all_dates – Replacement text to use if the matched text is a date. Can include format specifiers to blur the date rather than scrubbing it out entirely.

static is_a_date(groupdict: Dict[str, Any]) → bool[source]: Is the match result a date? We detect this via our named regex groups.

static parse_date(match: Match, groupdict: Dict[str, Any]) → datetime[source]

Retrieve a valid date from the Match object for blurring.

Valid regex group name combinations, where D == DateRegexNames:

D.ISODATE_NO_SEP: D.FOUR_DIGIT_YEAR,

D.DAY_MONTH_YEAR: D.NUMERIC_DAY, D.NUMERIC_MONTH, D.TWO_DIGIT_YEAR, D.DAY_MONTH_YEAR: D.NUMERIC_DAY, D.NUMERIC_MONTH, D.FOUR_DIGIT_YEAR, D.DAY_MONTH_YEAR: D.NUMERIC_DAY, D.ALPHABETICAL_MONTH, D.TWO_DIGIT_YEAR, D.DAY_MONTH_YEAR: D.NUMERIC_DAY, D.ALPHABETICAL_MONTH, D.FOUR_DIGIT_YEAR,

D.MONTH_DAY_YEAR: D.NUMERIC_DAY, D.NUMERIC_MONTH, D.TWO_DIGIT_YEAR, D.MONTH_DAY_YEAR: D.NUMERIC_DAY, D.NUMERIC_MONTH, D.FOUR_DIGIT_YEAR, D.MONTH_DAY_YEAR: D.NUMERIC_DAY, D.ALPHABETICAL_MONTH, D.TWO_DIGIT_YEAR, D.MONTH_DAY_YEAR: D.NUMERIC_DAY, D.ALPHABETICAL_MONTH, D.FOUR_DIGIT_YEAR,

D.YEAR_MONTH_DAY: D.NUMERIC_DAY, D.NUMERIC_MONTH, D.TWO_DIGIT_YEAR, D.YEAR_MONTH_DAY: D.NUMERIC_DAY, D.NUMERIC_MONTH, D.FOUR_DIGIT_YEAR, D.YEAR_MONTH_DAY: D.NUMERIC_DAY, D.ALPHABETICAL_MONTH, D.TWO_DIGIT_YEAR, D.YEAR_MONTH_DAY: D.NUMERIC_DAY, D.ALPHABETICAL_MONTH, D.FOUR_DIGIT_YEAR,

replace(match: Match) → str[source]: When re.sub() or regex.sub() is called, the “repl” argument can be a function. If so, it’s a function that takes a re.Match argument and returns the replacement text.

class crate_anon.anonymise.scrub.NonspecificScrubber(hasher: GenericHasher, replacement_text: str = '[~~~]', anonymise_codes_at_word_boundaries_only: bool = True, anonymise_dates_at_word_boundaries_only: bool = True, anonymise_numbers_at_word_boundaries_only: bool = False, denylist: WordList | None = None, scrub_all_numbers_of_n_digits: List[int] | None = None, scrub_all_uk_postcodes: bool = False, scrub_all_dates: bool = False, replacement_text_all_dates: str = '[~~~]', scrub_all_email_addresses: bool = False, extra_regexes: List[str] | None = None)[source]

Scrubs a bunch of things that are independent of any patient-specific data, such as removing all UK postcodes, or numbers of a certain length.

__init__(hasher: GenericHasher, replacement_text: str = '[~~~]', anonymise_codes_at_word_boundaries_only: bool = True, anonymise_dates_at_word_boundaries_only: bool = True, anonymise_numbers_at_word_boundaries_only: bool = False, denylist: WordList | None = None, scrub_all_numbers_of_n_digits: List[int] | None = None, scrub_all_uk_postcodes: bool = False, scrub_all_dates: bool = False, replacement_text_all_dates: str = '[~~~]', scrub_all_email_addresses: bool = False, extra_regexes: List[str] | None = None) → None[source]

Parameters:

replacement_text – Replace sensitive content with this string.
hasher – GenericHasher to use to hash this scrubber (for change-detection purposes); should be a secure hasher
anonymise_codes_at_word_boundaries_only – For codes: Boolean. Ensure that the regex begins and ends with a word boundary requirement.
anonymise_dates_at_word_boundaries_only – Scrub dates only if they occur at word boundaries. (Even if you say no, there are some restrictions or very odd things would happen; see crate_anon.anonymise.anonregex.get_generic_date_regex_elements().)
anonymise_numbers_at_word_boundaries_only – For numbers: Boolean. If set, ensure that the regex begins and ends with a word boundary requirement. If not set, the regex must be surrounded by non-digits. (If it were surrounded by more digits, it wouldn’t be an n-digit number!)
denylist – Words to scrub.
scrub_all_numbers_of_n_digits – List of values of n; number lengths to scrub.
scrub_all_uk_postcodes – Scrub all UK postcodes?
scrub_all_dates – Scrub all dates? (Currently assumes the default locale for month names and ordinal suffixes.)
replacement_text_all_dates – When scrub_all_dates is True, replace with this text. Supports limited datetime.strftime directives for “blurring” of dates. Example: “%b %Y” for abbreviated month and year.
scrub_all_email_addresses – Scrub all e-mail addresses?
extra_regexes – List of user-defined extra regexes to scrub.

build_regex() → None[source]: Compile our high-speed regex.

check_replacement_text_all_dates() → None[source]: Ensure our date-replacement text is legitimate in terms of e.g. “%Y”-style directives.

get_hash() → str[source]: Returns a hash of our scrubber – so we can store it, and later see if it’s changed. In an incremental update, if the scrubber has changed, we should re-anonymise all data for this patient.

get_replacer() → Replacer[source]: Return a function that can be used as the “repl” (replacer) argument to a re.sub() or regex.sub() call.

scrub(text: str) → str[source]

Returns a scrubbed version of the text.

Parameters:: text – the raw text, potentially containing sensitive information
Returns:: the de-identified text

class crate_anon.anonymise.scrub.PersonalizedScrubber(hasher: GenericHasher, replacement_text_patient: str = '[__PPP__]', replacement_text_third_party: str = '[__TTT__]', anonymise_codes_at_word_boundaries_only: bool = True, anonymise_codes_at_numeric_boundaries_only: bool = True, anonymise_dates_at_word_boundaries_only: bool = True, anonymise_numbers_at_word_boundaries_only: bool = False, anonymise_numbers_at_numeric_boundaries_only: bool = True, anonymise_strings_at_word_boundaries_only: bool = True, min_string_length_for_errors: int = 3, min_string_length_to_scrub_with: int = 2, scrub_string_suffixes: List[str] | None = None, string_max_regex_errors: int = 0, allowlist: WordList | None = None, alternatives: List[List[str]] | None = None, nonspecific_scrubber: NonspecificScrubber | None = None, nonspecific_scrubber_first: bool = False, debug: bool = False)[source]

Accepts patient-specific (patient and third-party) information, and uses that to scrub text.

__init__(hasher: GenericHasher, replacement_text_patient: str = '[__PPP__]', replacement_text_third_party: str = '[__TTT__]', anonymise_codes_at_word_boundaries_only: bool = True, anonymise_codes_at_numeric_boundaries_only: bool = True, anonymise_dates_at_word_boundaries_only: bool = True, anonymise_numbers_at_word_boundaries_only: bool = False, anonymise_numbers_at_numeric_boundaries_only: bool = True, anonymise_strings_at_word_boundaries_only: bool = True, min_string_length_for_errors: int = 3, min_string_length_to_scrub_with: int = 2, scrub_string_suffixes: List[str] | None = None, string_max_regex_errors: int = 0, allowlist: WordList | None = None, alternatives: List[List[str]] | None = None, nonspecific_scrubber: NonspecificScrubber | None = None, nonspecific_scrubber_first: bool = False, debug: bool = False) → None[source]

Parameters:

hasher – GenericHasher to use to hash this scrubber (for change-detection purposes); should be a secure hasher.
replacement_text_patient – Replace sensitive “patient” content with this string.
replacement_text_third_party – Replace sensitive “third party” content with this string.
anonymise_codes_at_word_boundaries_only – For codes: Boolean. Ensure that the regex begins and ends with a word boundary requirement.
anonymise_codes_at_numeric_boundaries_only – For codes: Boolean. Only applicable if anonymise_codes_at_word_boundaries_only is False. Ensure that the code is only recognized when surrounded by non-numbers; that is, only at the boundaries of numbers (at numeric boundaries). See crate_anon.anonymise.anonregex.get_code_regex_elements().
anonymise_dates_at_word_boundaries_only – For dates: Boolean. Ensure that the regex begins and ends with a word boundary requirement.
anonymise_numbers_at_word_boundaries_only – For numbers: Boolean. Ensure that the regex begins and ends with a word boundary requirement. See crate_anon.anonymise.anonregex.get_code_regex_elements().
anonymise_numbers_at_numeric_boundaries_only – For numbers: Boolean. Only applicable if anonymise_numbers_at_word_boundaries_only is False. Ensure that the number is only recognized when surrounded by non-numbers; that is, only at the boundaries of numbers (at numeric boundaries). See crate_anon.anonymise.anonregex.get_code_regex_elements().
anonymise_strings_at_word_boundaries_only – For strings: Boolean. Ensure that the regex begins and ends with a word boundary requirement.
min_string_length_for_errors – For strings: minimum string length at which typographical errors will be permitted.
min_string_length_to_scrub_with – For strings: minimum string length at which the string will be permitted to be scrubbed with.
scrub_string_suffixes – A list of suffixes to permit on strings.
string_max_regex_errors – The maximum number of typographical insertion / deletion / substitution errors to permit.
allowlist – WordList of words to allow (not to scrub).
alternatives – This allows words to be substituted by equivalents; such as St for Street or Rd for Road. The parameter is a list of lists of equivalents; see crate_anon.anonymise.config.get_word_alternatives().
nonspecific_scrubber – NonspecificScrubber to apply to remove information that is generic.
nonspecific_scrubber_first – If one is provided, run the nonspecific scrubber first (rather than last)?
debug – Show the final scrubber regex text as we compile our regexes.

add_value(value: Any, scrub_method: ScrubMethod, patient: bool = True, clear_cache: bool = True) → None[source]

Add a specific value via a specific scrub_method.

Parameters:

value – value to add to the scrubber
scrub_method – crate_anon.anonymise.constants.SCRUBMETHOD value
patient – Boolean; controls whether it’s treated as a patient value or a third-party value.
clear_cache – also clear our cache?

build_regexes() → None[source]: Compile our regexes.

clear_cache() → None[source]: Clear the internal cache (the compiled regex).

get_elements_code(value: Any) → List[str][source]

Start with an alphanumeric code. Remove whitespace. Build a regex that scrubs the code.

Particular examples: postcodes, e.g. "PE12 3AB".

Parameters:: value – a string containing containing an alphanumeric code
Returns:: a list of regex elements

get_elements_date(value: datetime | date) → List[str] | None[source]: Returns a list of regex elements for a given date value.

get_elements_numeric(value: Any) → List[str][source]

Start with a number. Remove everything but the digits. Build a regex that scrubs the number.

Particular examples: phone numbers, e.g. "(01223) 123456".

Parameters:: value – a string containing a number, or an actual number.
Returns:: a list of regex elements

get_elements_phrase(value: Any) → List[str][source]: Returns a list of regex elements for a given phrase.

get_elements_phrase_unless_numeric(value: Any) → List[str][source]: If the value is numeric, return an empty list. Otherwise, returns a list of regex elements for the given phrase.

get_elements_words(value: str) → List[str][source]: Returns a list of regex elements for a given string that contains textual words.

get_hash() → str[source]: Returns a hash of our scrubber – so we can store it, and later see if it’s changed. In an incremental update, if the scrubber has changed, we should re-anonymise all data for this patient.

get_patient_regex_string() → str[source]: Return the string version of the patient regex, sorted.

get_raw_info() → Dict[str, Any][source]

Summarizes settings and (sensitive) data for this scrubber.

This is both a summary for debugging and the basis for our change-detection hash (and for the latter reason we need order etc. to be consistent). For any information we put in here, changes will cause data to be re-scrubbed.

Note that the hasher should be a secure one, because this is sensitive information.

static get_scrub_method(datatype_long: str, scrub_method: ScrubMethod | None) → ScrubMethod[source]

Return the default scrub method for a given SQL datatype, unless overridden. For example, dates are scrubbed via a date method; numbers by a numeric method.

Parameters:

datatype_long – SQL datatype as a string
scrub_method – optional method to enforce

Returns:

crate_anon.anonymise.constants.SCRUBMETHOD value

get_tp_regex_string() → str[source]: Return the string version of the third-party regex, sorted.

scrub(text: str) → str | None[source]

Returns a scrubbed version of the text.

Parameters:: text – the raw text, potentially containing sensitive information
Returns:: the de-identified text

class crate_anon.anonymise.scrub.Replacer(replacement_text: str)[source]

Custom regex replacement called from regex.sub(). This base class doesn’t do much and is the equivalent of just passing the replacement text to regex.sub().

__init__(replacement_text: str) → None[source]

replace(match: Match) → str[source]: When re.sub() or regex.sub() is called, the “repl” argument can be a function. If so, it’s a function that takes a re.Match argument and returns the replacement text.

class crate_anon.anonymise.scrub.ScrubberBase(hasher: GenericHasher)[source]

Scrubber base class.

__init__(hasher: GenericHasher) → None[source]

Parameters:: hasher – GenericHasher to use to hash this scrubber (for change-detection purposes); should be a secure hasher

abstract get_hash() → str[source]: Returns a hash of our scrubber – so we can store it, and later see if it’s changed. In an incremental update, if the scrubber has changed, we should re-anonymise all data for this patient.

abstract scrub(text: str) → str[source]

Returns a scrubbed version of the text.

Parameters:: text – the raw text, potentially containing sensitive information
Returns:: the de-identified text

class crate_anon.anonymise.scrub.WordList(filenames: Iterable[str] | None = None, words: Iterable[str] | None = None, as_phrases: bool = False, replacement_text: str = '[---]', hasher: GenericHasher | None = None, suffixes: List[str] | None = None, at_word_boundaries_only: bool = True, max_errors: int = 0, regex_method: bool = False)[source]

A scrubber that removes all words in a wordlist, in case-insensitive fashion.

This serves a dual function as an allowlist (is a word in the list?) and a denylist (scrub text using the wordlist).

__init__(filenames: Iterable[str] | None = None, words: Iterable[str] | None = None, as_phrases: bool = False, replacement_text: str = '[---]', hasher: GenericHasher | None = None, suffixes: List[str] | None = None, at_word_boundaries_only: bool = True, max_errors: int = 0, regex_method: bool = False) → None[source]

Parameters:

filenames – Filenames to read words from.
words – Additional words to add.
as_phrases – Keep lines in the source file intact (as phrases), rather than splitting them into individual words, and (if regex_method is True) scrub as phrases.
replacement_text – Replace sensitive content with this string.
hasher – GenericHasher to use to hash this scrubber (for change-detection purposes); should be a secure hasher.
suffixes – Append each of these suffixes to each word.
at_word_boundaries_only – Boolean. If set, ensure that the regex begins and ends with a word boundary requirement. (If false: will scrub ANN from bANNed, for example.)
max_errors – The maximum number of typographical insertion / deletion / substitution errors to permit. Applicable only if regex_method is True.
regex_method – Use regular expressions? If True: slower, but phrase scrubbing deals with variable whitespace. If False: much faster (uses FlashText), but whitespace is inflexible.

add_file(filename: str, clear_cache: bool = True) → None[source]

Add all words from a file.

Parameters:

filename – File to read.
clear_cache – Also clear our cache?

add_word(word: str, clear_cache: bool = True) → None[source]

Add a word to our wordlist.

Parameters:

word – word to add
clear_cache – also clear our cache?

build() → None[source]: Compiles a high-speed scrubbing device, be it a regex or a FlashText processor. Called only when we have collected all our words.

clear_cache() → None[source]: Clear cached information (e.g. the compiled regex, the cached hash of this scrubber).

contains(word: str) → bool[source]: Does our wordlist contain this word?

get_hash() → str[source]: Returns a hash of our scrubber – so we can store it, and later see if it’s changed. In an incremental update, if the scrubber has changed, we should re-anonymise all data for this patient.

scrub(text: str) → str[source]

Returns a scrubbed version of the text.

Parameters:: text – the raw text, potentially containing sensitive information
Returns:: the de-identified text

crate_anon.anonymise.scrub.lower_case_phrase_lines_from_file(filename: str) → Generator[str, None, None][source]: Generates lower-case phrases from a file, one per line.

crate_anon.anonymise.scrub.lower_case_words_from_file(filename: str) → Generator[str, None, None][source]: Generates lower-case words from a file.