14.1.22. crate_anon.anonymise.scrub
crate_anon/anonymise/scrub.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Scrubber classes for CRATE anonymiser.
- class crate_anon.anonymise.scrub.NonspecificReplacer(replacement_text: str, replacement_text_all_dates: str)[source]
Custom regex replacement for the Nonspecific scrubber. Currently this will “blur” dates if replacement_text_all_dates contains any formatting directives.
- __init__(replacement_text: str, replacement_text_all_dates: str)[source]
- Parameters:
replacement_text – Generic text to use.
replacement_text_all_dates – Replacement text to use if the matched text is a date. Can include format specifiers to blur the date rather than scrubbing it out entirely.
- static is_a_date(groupdict: Dict[str, Any]) bool [source]
Is the match result a date? We detect this via our named regex groups.
- static parse_date(match: Match, groupdict: Dict[str, Any]) datetime [source]
Retrieve a valid date from the Match object for blurring.
Valid regex group name combinations, where D == DateRegexNames:
D.ISODATE_NO_SEP: D.FOUR_DIGIT_YEAR,
D.DAY_MONTH_YEAR: D.NUMERIC_DAY, D.NUMERIC_MONTH, D.TWO_DIGIT_YEAR, D.DAY_MONTH_YEAR: D.NUMERIC_DAY, D.NUMERIC_MONTH, D.FOUR_DIGIT_YEAR, D.DAY_MONTH_YEAR: D.NUMERIC_DAY, D.ALPHABETICAL_MONTH, D.TWO_DIGIT_YEAR, D.DAY_MONTH_YEAR: D.NUMERIC_DAY, D.ALPHABETICAL_MONTH, D.FOUR_DIGIT_YEAR,
D.MONTH_DAY_YEAR: D.NUMERIC_DAY, D.NUMERIC_MONTH, D.TWO_DIGIT_YEAR, D.MONTH_DAY_YEAR: D.NUMERIC_DAY, D.NUMERIC_MONTH, D.FOUR_DIGIT_YEAR, D.MONTH_DAY_YEAR: D.NUMERIC_DAY, D.ALPHABETICAL_MONTH, D.TWO_DIGIT_YEAR, D.MONTH_DAY_YEAR: D.NUMERIC_DAY, D.ALPHABETICAL_MONTH, D.FOUR_DIGIT_YEAR,
D.YEAR_MONTH_DAY: D.NUMERIC_DAY, D.NUMERIC_MONTH, D.TWO_DIGIT_YEAR, D.YEAR_MONTH_DAY: D.NUMERIC_DAY, D.NUMERIC_MONTH, D.FOUR_DIGIT_YEAR, D.YEAR_MONTH_DAY: D.NUMERIC_DAY, D.ALPHABETICAL_MONTH, D.TWO_DIGIT_YEAR, D.YEAR_MONTH_DAY: D.NUMERIC_DAY, D.ALPHABETICAL_MONTH, D.FOUR_DIGIT_YEAR,
- class crate_anon.anonymise.scrub.NonspecificScrubber(hasher: GenericHasher, replacement_text: str = '[~~~]', anonymise_codes_at_word_boundaries_only: bool = True, anonymise_dates_at_word_boundaries_only: bool = True, anonymise_numbers_at_word_boundaries_only: bool = False, denylist: WordList | None = None, scrub_all_numbers_of_n_digits: List[int] | None = None, scrub_all_uk_postcodes: bool = False, scrub_all_dates: bool = False, replacement_text_all_dates: str = '[~~~]', scrub_all_email_addresses: bool = False, extra_regexes: List[str] | None = None)[source]
Scrubs a bunch of things that are independent of any patient-specific data, such as removing all UK postcodes, or numbers of a certain length.
- __init__(hasher: GenericHasher, replacement_text: str = '[~~~]', anonymise_codes_at_word_boundaries_only: bool = True, anonymise_dates_at_word_boundaries_only: bool = True, anonymise_numbers_at_word_boundaries_only: bool = False, denylist: WordList | None = None, scrub_all_numbers_of_n_digits: List[int] | None = None, scrub_all_uk_postcodes: bool = False, scrub_all_dates: bool = False, replacement_text_all_dates: str = '[~~~]', scrub_all_email_addresses: bool = False, extra_regexes: List[str] | None = None) None [source]
- Parameters:
replacement_text – Replace sensitive content with this string.
hasher –
GenericHasher
to use to hash this scrubber (for change-detection purposes); should be a secure hasheranonymise_codes_at_word_boundaries_only – For codes: Boolean. Ensure that the regex begins and ends with a word boundary requirement.
anonymise_dates_at_word_boundaries_only – Scrub dates only if they occur at word boundaries. (Even if you say no, there are some restrictions or very odd things would happen; see
crate_anon.anonymise.anonregex.get_generic_date_regex_elements()
.)anonymise_numbers_at_word_boundaries_only – For numbers: Boolean. If set, ensure that the regex begins and ends with a word boundary requirement. If not set, the regex must be surrounded by non-digits. (If it were surrounded by more digits, it wouldn’t be an n-digit number!)
denylist – Words to scrub.
scrub_all_numbers_of_n_digits – List of values of n; number lengths to scrub.
scrub_all_uk_postcodes – Scrub all UK postcodes?
scrub_all_dates – Scrub all dates? (Currently assumes the default locale for month names and ordinal suffixes.)
replacement_text_all_dates – When scrub_all_dates is True, replace with this text. Supports limited datetime.strftime directives for “blurring” of dates. Example: “%b %Y” for abbreviated month and year.
scrub_all_email_addresses – Scrub all e-mail addresses?
extra_regexes – List of user-defined extra regexes to scrub.
- check_replacement_text_all_dates() None [source]
Ensure our date-replacement text is legitimate in terms of e.g. “%Y”-style directives.
- get_hash() str [source]
Returns a hash of our scrubber – so we can store it, and later see if it’s changed. In an incremental update, if the scrubber has changed, we should re-anonymise all data for this patient.
- class crate_anon.anonymise.scrub.PersonalizedScrubber(hasher: GenericHasher, replacement_text_patient: str = '[__PPP__]', replacement_text_third_party: str = '[__TTT__]', anonymise_codes_at_word_boundaries_only: bool = True, anonymise_codes_at_numeric_boundaries_only: bool = True, anonymise_dates_at_word_boundaries_only: bool = True, anonymise_numbers_at_word_boundaries_only: bool = False, anonymise_numbers_at_numeric_boundaries_only: bool = True, anonymise_strings_at_word_boundaries_only: bool = True, min_string_length_for_errors: int = 3, min_string_length_to_scrub_with: int = 2, scrub_string_suffixes: List[str] | None = None, string_max_regex_errors: int = 0, allowlist: WordList | None = None, alternatives: List[List[str]] | None = None, nonspecific_scrubber: NonspecificScrubber | None = None, nonspecific_scrubber_first: bool = False, debug: bool = False)[source]
Accepts patient-specific (patient and third-party) information, and uses that to scrub text.
- __init__(hasher: GenericHasher, replacement_text_patient: str = '[__PPP__]', replacement_text_third_party: str = '[__TTT__]', anonymise_codes_at_word_boundaries_only: bool = True, anonymise_codes_at_numeric_boundaries_only: bool = True, anonymise_dates_at_word_boundaries_only: bool = True, anonymise_numbers_at_word_boundaries_only: bool = False, anonymise_numbers_at_numeric_boundaries_only: bool = True, anonymise_strings_at_word_boundaries_only: bool = True, min_string_length_for_errors: int = 3, min_string_length_to_scrub_with: int = 2, scrub_string_suffixes: List[str] | None = None, string_max_regex_errors: int = 0, allowlist: WordList | None = None, alternatives: List[List[str]] | None = None, nonspecific_scrubber: NonspecificScrubber | None = None, nonspecific_scrubber_first: bool = False, debug: bool = False) None [source]
- Parameters:
hasher –
GenericHasher
to use to hash this scrubber (for change-detection purposes); should be a secure hasher.replacement_text_patient – Replace sensitive “patient” content with this string.
replacement_text_third_party – Replace sensitive “third party” content with this string.
anonymise_codes_at_word_boundaries_only – For codes: Boolean. Ensure that the regex begins and ends with a word boundary requirement.
anonymise_codes_at_numeric_boundaries_only – For codes: Boolean. Only applicable if anonymise_codes_at_word_boundaries_only is False. Ensure that the code is only recognized when surrounded by non-numbers; that is, only at the boundaries of numbers (at numeric boundaries). See
crate_anon.anonymise.anonregex.get_code_regex_elements()
.anonymise_dates_at_word_boundaries_only – For dates: Boolean. Ensure that the regex begins and ends with a word boundary requirement.
anonymise_numbers_at_word_boundaries_only – For numbers: Boolean. Ensure that the regex begins and ends with a word boundary requirement. See
crate_anon.anonymise.anonregex.get_code_regex_elements()
.anonymise_numbers_at_numeric_boundaries_only – For numbers: Boolean. Only applicable if anonymise_numbers_at_word_boundaries_only is False. Ensure that the number is only recognized when surrounded by non-numbers; that is, only at the boundaries of numbers (at numeric boundaries). See
crate_anon.anonymise.anonregex.get_code_regex_elements()
.anonymise_strings_at_word_boundaries_only – For strings: Boolean. Ensure that the regex begins and ends with a word boundary requirement.
min_string_length_for_errors – For strings: minimum string length at which typographical errors will be permitted.
min_string_length_to_scrub_with – For strings: minimum string length at which the string will be permitted to be scrubbed with.
scrub_string_suffixes – A list of suffixes to permit on strings.
string_max_regex_errors – The maximum number of typographical insertion / deletion / substitution errors to permit.
allowlist –
WordList
of words to allow (not to scrub).alternatives – This allows words to be substituted by equivalents; such as
St
forStreet
orRd
forRoad
. The parameter is a list of lists of equivalents; seecrate_anon.anonymise.config.get_word_alternatives()
.nonspecific_scrubber –
NonspecificScrubber
to apply to remove information that is generic.nonspecific_scrubber_first – If one is provided, run the nonspecific scrubber first (rather than last)?
debug – Show the final scrubber regex text as we compile our regexes.
- add_value(value: Any, scrub_method: ScrubMethod, patient: bool = True, clear_cache: bool = True) None [source]
Add a specific value via a specific scrub_method.
- Parameters:
value – value to add to the scrubber
scrub_method –
crate_anon.anonymise.constants.SCRUBMETHOD
valuepatient – Boolean; controls whether it’s treated as a patient value or a third-party value.
clear_cache – also clear our cache?
- get_elements_code(value: Any) List[str] [source]
Start with an alphanumeric code. Remove whitespace. Build a regex that scrubs the code.
Particular examples: postcodes, e.g.
"PE12 3AB"
.- Parameters:
value – a string containing containing an alphanumeric code
- Returns:
a list of regex elements
- get_elements_date(value: datetime | date) List[str] | None [source]
Returns a list of regex elements for a given date value.
- get_elements_numeric(value: Any) List[str] [source]
Start with a number. Remove everything but the digits. Build a regex that scrubs the number.
Particular examples: phone numbers, e.g.
"(01223) 123456"
.- Parameters:
value – a string containing a number, or an actual number.
- Returns:
a list of regex elements
- get_elements_phrase(value: Any) List[str] [source]
Returns a list of regex elements for a given phrase.
- get_elements_phrase_unless_numeric(value: Any) List[str] [source]
If the value is numeric, return an empty list. Otherwise, returns a list of regex elements for the given phrase.
- get_elements_words(value: str) List[str] [source]
Returns a list of regex elements for a given string that contains textual words.
- get_hash() str [source]
Returns a hash of our scrubber – so we can store it, and later see if it’s changed. In an incremental update, if the scrubber has changed, we should re-anonymise all data for this patient.
- get_raw_info() Dict[str, Any] [source]
Summarizes settings and (sensitive) data for this scrubber.
This is both a summary for debugging and the basis for our change-detection hash (and for the latter reason we need order etc. to be consistent). For any information we put in here, changes will cause data to be re-scrubbed.
Note that the hasher should be a secure one, because this is sensitive information.
- static get_scrub_method(datatype_long: str, scrub_method: ScrubMethod | None) ScrubMethod [source]
Return the default scrub method for a given SQL datatype, unless overridden. For example, dates are scrubbed via a date method; numbers by a numeric method.
- Parameters:
datatype_long – SQL datatype as a string
scrub_method – optional method to enforce
- Returns:
crate_anon.anonymise.constants.SCRUBMETHOD
value
- class crate_anon.anonymise.scrub.Replacer(replacement_text: str)[source]
Custom regex replacement called from regex.sub(). This base class doesn’t do much and is the equivalent of just passing the replacement text to regex.sub().
- class crate_anon.anonymise.scrub.ScrubberBase(hasher: GenericHasher)[source]
Scrubber base class.
- __init__(hasher: GenericHasher) None [source]
- Parameters:
hasher –
GenericHasher
to use to hash this scrubber (for change-detection purposes); should be a secure hasher
- class crate_anon.anonymise.scrub.WordList(filenames: Iterable[str] | None = None, words: Iterable[str] | None = None, as_phrases: bool = False, replacement_text: str = '[---]', hasher: GenericHasher | None = None, suffixes: List[str] | None = None, at_word_boundaries_only: bool = True, max_errors: int = 0, regex_method: bool = False)[source]
A scrubber that removes all words in a wordlist, in case-insensitive fashion.
This serves a dual function as an allowlist (is a word in the list?) and a denylist (scrub text using the wordlist).
- __init__(filenames: Iterable[str] | None = None, words: Iterable[str] | None = None, as_phrases: bool = False, replacement_text: str = '[---]', hasher: GenericHasher | None = None, suffixes: List[str] | None = None, at_word_boundaries_only: bool = True, max_errors: int = 0, regex_method: bool = False) None [source]
- Parameters:
filenames – Filenames to read words from.
words – Additional words to add.
as_phrases – Keep lines in the source file intact (as phrases), rather than splitting them into individual words, and (if
regex_method
is True) scrub as phrases.replacement_text – Replace sensitive content with this string.
hasher –
GenericHasher
to use to hash this scrubber (for change-detection purposes); should be a secure hasher.suffixes – Append each of these suffixes to each word.
at_word_boundaries_only – Boolean. If set, ensure that the regex begins and ends with a word boundary requirement. (If false: will scrub
ANN
frombANNed
, for example.)max_errors – The maximum number of typographical insertion / deletion / substitution errors to permit. Applicable only if
regex_method
is True.regex_method – Use regular expressions? If True: slower, but phrase scrubbing deals with variable whitespace. If False: much faster (uses FlashText), but whitespace is inflexible.
- add_file(filename: str, clear_cache: bool = True) None [source]
Add all words from a file.
- Parameters:
filename – File to read.
clear_cache – Also clear our cache?
- add_word(word: str, clear_cache: bool = True) None [source]
Add a word to our wordlist.
- Parameters:
word – word to add
clear_cache – also clear our cache?
- build() None [source]
Compiles a high-speed scrubbing device, be it a regex or a FlashText processor. Called only when we have collected all our words.
- clear_cache() None [source]
Clear cached information (e.g. the compiled regex, the cached hash of this scrubber).