14.1.3. crate_anon.anonymise.anonregex
crate_anon/anonymise/anonregex.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Regular expression functions for anonymisation.
- crate_anon.anonymise.anonregex.get_anon_fragments_from_string(s: str) List[str] [source]
Takes a complex string, such as a name or address with its components separated by spaces, commas, etc., and returns a list of substrings to be used for anonymisation.
For example, from
"John Smith"
, return["John", "Smith"]
; from"John D'Souza"
, return["John", "D", "Souza"]
; from"42 West Street"
, return["42", "West", "Street"]
.Try these examples:
get_anon_fragments_from_string("Bob D'Souza") get_anon_fragments_from_string("Jemima Al-Khalaim") get_anon_fragments_from_string("47 Russell Square")
Note that this is a LIBERAL algorithm, i.e. one prone to anonymise too much (e.g. all instances of
"Street"
if someone has that as part of their address).Note that we use the “word boundary” facility when replacing, and that treats apostrophes and hyphens as word boundaries. Therefore, we don’t need the largest-level chunks, like
D'Souza
.
- crate_anon.anonymise.anonregex.get_code_regex_elements(s: str, liberal: bool = True, very_liberal: bool = True, at_word_boundaries_only: bool = True, at_numeric_boundaries_only: bool = True) List[str] [source]
Takes a string representation of a number or an alphanumeric code, which may include leading zeros (as for phone numbers), and produces a list of regex strings for scrubbing.
We allow all sorts of separators. For example, 0123456789 might appear as
(01234) 56789 0123 456 789 01234-56789 0123.456.789
This can also be used for postcodes, which should have whitespace prestripped, so e.g. PE123AB might appear as
PE123AB PE12 3AB PE 12 3 AB
- Parameters:
s – The string representation of a number or code.
liberal – Boolean. Use “optional non-newline whitespace” to separate characters in the source.
very_liberal – Boolean. Use “optional nonword” to separate characters in the source.
at_word_boundaries_only – Boolean. Ensure that the regex begins and ends with a word boundary requirement. So, if True, “123” will not be scrubbed from “M123”.
at_numeric_boundaries_only –
Boolean. Only applicable if
at_numeric_boundaries_only
is False. Ensure that the number/code is only recognized when surrounded by non-numbers; that is, only at the boundaries of numbers (at numeric boundaries).Applicable if
not at_word_boundaries_only
.Even though we’re not restricting to word boundaries, because (for example) we want
123456
to matchM123456
, it can be undesirable to match numbers that are bordered only by numbers; that is, with this setting,23
should never match234
or1234
or123
.If set, this option ensures that the number/code is recognized only when it is bordered by non-numbers.
But if you want to anonymise “123456” out of a phone number written like “01223123456”, you might have to turn this off…
- Returns:
a list of regular expression strings
- crate_anon.anonymise.anonregex.get_date_regex_elements(dt: datetime | date, at_word_boundaries_only: bool = False, ordinal_suffixes: Iterable[str] = ('st', 'nd', 'rd', 'th')) List[str] [source]
Takes a datetime object and returns a list of regex strings with which to scrub.
For example, a date/time of 13 Sep 2014 will produce regexes that recognize “13 Sep 2014”, “September 13, 2014”, “2014/09/13”, and many more.
- Parameters:
dt – The datetime or date or similar object.
at_word_boundaries_only – Ensure that all regexes begin and end with a word boundary requirement.
ordinal_suffixes – Language-specific suffixes that may be appended to numbers to make them ordinal. In English, “st”, “nd”, “rd”, and “th”.
- Returns:
the list of regular expression strings, as above
- crate_anon.anonymise.anonregex.get_generic_date_regex_elements(at_word_boundaries_only: bool = True, ordinal_suffixes: Iterable[str] = ('st', 'nd', 'rd', 'th'), all_month_names: Iterable[str] = ('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')) List[str] [source]
Returns a set of regex elements to scrub any date.
Word boundaries are strongly preferred! This will match some odd things otherwise; see the associated unit tests.
- crate_anon.anonymise.anonregex.get_number_of_length_n_regex_elements(n: int, liberal: bool = True, very_liberal: bool = False, at_word_boundaries_only: bool = True) List[str] [source]
Get a list of regex strings for scrubbing n-digit numbers – for example, to remove all 10-digit numbers as putative NHS numbers, or all 11-digit numbers as putative UK phone numbers.
- Parameters:
n – the length of the number
liberal – Boolean. Use “optional non-newline whitespace” to separate the digits.
very_liberal – Boolean. Use “optional nonword” to separate the digits.
at_word_boundaries_only – Boolean. If set, ensure that the regex begins and ends with a word boundary requirement. If not set, the regex must be surrounded by non-digits. (If it were surrounded by more digits, it wouldn’t be an n-digit number!)
- Returns:
a list of regular expression strings
- crate_anon.anonymise.anonregex.get_phrase_regex_elements(phrase: str, suffixes: List[str] | None = None, at_word_boundaries_only: bool = True, max_errors: int = 0, alternatives: List[List[str]] | None = None) List[str] [source]
Gets regular expressions to scrub a phrase; that is, all words within a phrase consecutively.
- Parameters:
phrase – E.g. ‘4 Privet Drive’.
suffixes – A list of suffixes to permit (unusual).
at_word_boundaries_only – Apply regex only at word boundaries?
max_errors – Maximum number of typos, as defined by the regex module.
alternatives – This allows words to be substituted by equivalents; such as
St
forStreet
orRd
forRoad
. The parameter is a list of lists of equivalents; seecrate_anon.anonymise.config.get_word_alternatives()
.
- Returns:
A list of regex fragments.
- crate_anon.anonymise.anonregex.get_regex_from_elements(elementlist: List[str]) Pattern | None [source]
Convert a list of regex elements into a compiled regex, which will operate in case-insensitive fashion on Unicode strings.
- crate_anon.anonymise.anonregex.get_regex_string_from_elements(elementlist: List[str]) str [source]
Convert a list of regex elements into a single regex string.
- crate_anon.anonymise.anonregex.get_string_regex_elements(s: str, suffixes: List[str] | None = None, at_word_boundaries_only: bool = True, max_errors: int = 0) List[str] [source]
Takes a string and returns a list of regex strings with which to scrub.
- Parameters:
s – The starting string.
suffixes – A list of suffixes to permit, typically
["s"]
.at_word_boundaries_only – Boolean. If set, ensure that the regex begins and ends with a word boundary requirement. (If false: will scrub
ANN
frombANNed
.)max_errors – The maximum number of typographical insertion/deletion/substitution errors to permit.
- Returns:
a list of regular expression strings
- crate_anon.anonymise.anonregex.get_uk_postcode_regex_elements(at_word_boundaries_only: bool = True) List[str] [source]
Get a list of regex strings for scrubbing UK postcodes. These have a well-defined format.
Unless compiled with the
re.IGNORECASE
, they will match upper-case postcodes only.- Parameters:
at_word_boundaries_only – Boolean. If set, ensure that the regex begins and ends with a word boundary requirement.
- Returns:
a list of regular expression strings
See:
- crate_anon.anonymise.anonregex.get_uk_postcode_regex_string(at_word_boundaries_only: bool = True) str [source]
Shortcut to retrieve a single regex string for UK postcodes (following the changes above on 2020-04-28). See
get_uk_postcode_regex_elements()
.