12.1.3. crate_anon.anonymise.anonregex

crate_anon/anonymise/anonregex.py

Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.

Regular expression functions for anonymisation.

class crate_anon.anonymise.anonregex.DateRegexNames[source]: For named groups in date regexes.

crate_anon.anonymise.anonregex.get_anon_fragments_from_string(s: str) → List[str][source]

Takes a complex string, such as a name or address with its components separated by spaces, commas, etc., and returns a list of substrings to be used for anonymisation.

For example, from "John Smith", return ["John", "Smith"]; from "John D'Souza", return ["John", "D", "Souza"]; from "42 West Street", return ["42", "West", "Street"].

Try these examples:

get_anon_fragments_from_string("Bob D'Souza")
get_anon_fragments_from_string("Jemima Al-Khalaim")
get_anon_fragments_from_string("47 Russell Square")

Note that this is a LIBERAL algorithm, i.e. one prone to anonymise too much (e.g. all instances of "Street" if someone has that as part of their address).
Note that we use the “word boundary” facility when replacing, and that treats apostrophes and hyphens as word boundaries. Therefore, we don’t need the largest-level chunks, like D'Souza.

crate_anon.anonymise.anonregex.get_code_regex_elements(s: str, liberal: bool = True, very_liberal: bool = True, at_word_boundaries_only: bool = True, at_numeric_boundaries_only: bool = True) → List[str][source]

Takes a string representation of a number or an alphanumeric code, which may include leading zeros (as for phone numbers), and produces a list of regex strings for scrubbing.

We allow all sorts of separators. For example, 0123456789 might appear as

(01234) 56789
0123 456 789
01234-56789
0123.456.789

This can also be used for postcodes, which should have whitespace prestripped, so e.g. PE123AB might appear as

PE123AB
PE12 3AB
PE 12 3 AB

Parameters:

s – The string representation of a number or code.
liberal – Boolean. Use “optional non-newline whitespace” to separate characters in the source.
very_liberal – Boolean. Use “optional nonword” to separate characters in the source.
at_word_boundaries_only – Boolean. Ensure that the regex begins and ends with a word boundary requirement. So, if True, “123” will not be scrubbed from “M123”.
at_numeric_boundaries_only –
Boolean. Only applicable if at_numeric_boundaries_only is False. Ensure that the number/code is only recognized when surrounded by non-numbers; that is, only at the boundaries of numbers (at numeric boundaries).
- Applicable if not at_word_boundaries_only.
- Even though we’re not restricting to word boundaries, because (for example) we want 123456 to match M123456, it can be undesirable to match numbers that are bordered only by numbers; that is, with this setting, 23 should never match 234 or 1234 or 123.
- If set, this option ensures that the number/code is recognized only when it is bordered by non-numbers.
- But if you want to anonymise “123456” out of a phone number written like “01223123456”, you might have to turn this off…

Returns:

a list of regular expression strings

crate_anon.anonymise.anonregex.get_date_regex_elements(dt: datetime | date, at_word_boundaries_only: bool = False, ordinal_suffixes: Iterable[str] = ('st', 'nd', 'rd', 'th')) → List[str][source]

Takes a datetime object and returns a list of regex strings with which to scrub.

For example, a date/time of 13 Sep 2014 will produce regexes that recognize “13 Sep 2014”, “September 13, 2014”, “2014/09/13”, and many more.

Parameters:

dt – The datetime or date or similar object.
at_word_boundaries_only – Ensure that all regexes begin and end with a word boundary requirement.
ordinal_suffixes – Language-specific suffixes that may be appended to numbers to make them ordinal. In English, “st”, “nd”, “rd”, and “th”.

Returns:

the list of regular expression strings, as above

crate_anon.anonymise.anonregex.get_generic_date_regex_elements(at_word_boundaries_only: bool = True, ordinal_suffixes: Iterable[str] = ('st', 'nd', 'rd', 'th'), all_month_names: Iterable[str] = ('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')) → List[str][source]

Returns a set of regex elements to scrub any date.

Word boundaries are strongly preferred! This will match some odd things otherwise; see the associated unit tests.

crate_anon.anonymise.anonregex.get_number_of_length_n_regex_elements(n: int, liberal: bool = True, very_liberal: bool = False, at_word_boundaries_only: bool = True) → List[str][source]

Get a list of regex strings for scrubbing n-digit numbers – for example, to remove all 10-digit numbers as putative NHS numbers, or all 11-digit numbers as putative UK phone numbers.

Parameters:

n – the length of the number
liberal – Boolean. Use “optional non-newline whitespace” to separate the digits.
very_liberal – Boolean. Use “optional nonword” to separate the digits.
at_word_boundaries_only – Boolean. If set, ensure that the regex begins and ends with a word boundary requirement. If not set, the regex must be surrounded by non-digits. (If it were surrounded by more digits, it wouldn’t be an n-digit number!)

Returns:

a list of regular expression strings

crate_anon.anonymise.anonregex.get_phrase_regex_elements(phrase: str, suffixes: List[str] | None = None, at_word_boundaries_only: bool = True, max_errors: int = 0, alternatives: List[List[str]] | None = None) → List[str][source]

Gets regular expressions to scrub a phrase; that is, all words within a phrase consecutively.

Parameters:

phrase – E.g. ‘4 Privet Drive’.
suffixes – A list of suffixes to permit (unusual).
at_word_boundaries_only – Apply regex only at word boundaries?
max_errors – Maximum number of typos, as defined by the regex module.
alternatives – This allows words to be substituted by equivalents; such as St for Street or Rd for Road. The parameter is a list of lists of equivalents; see crate_anon.anonymise.config.get_word_alternatives().

Returns:

A list of regex fragments.

crate_anon.anonymise.anonregex.get_regex_from_elements(elementlist: List[str]) → Pattern | None[source]: Convert a list of regex elements into a compiled regex, which will operate in case-insensitive fashion on Unicode strings.

crate_anon.anonymise.anonregex.get_regex_string_from_elements(elementlist: List[str]) → str[source]: Convert a list of regex elements into a single regex string.

crate_anon.anonymise.anonregex.get_string_regex_elements(s: str, suffixes: List[str] | None = None, at_word_boundaries_only: bool = True, max_errors: int = 0) → List[str][source]

Takes a string and returns a list of regex strings with which to scrub.

Parameters:

s – The starting string.
suffixes – A list of suffixes to permit, typically ["s"].
at_word_boundaries_only – Boolean. If set, ensure that the regex begins and ends with a word boundary requirement. (If false: will scrub ANN from bANNed.)
max_errors – The maximum number of typographical insertion/deletion/substitution errors to permit.

Returns:

a list of regular expression strings

crate_anon.anonymise.anonregex.get_uk_postcode_regex_elements(at_word_boundaries_only: bool = True) → List[str][source]

Get a list of regex strings for scrubbing UK postcodes. These have a well-defined format.

Unless compiled with the re.IGNORECASE, they will match upper-case postcodes only.

Parameters:: at_word_boundaries_only – Boolean. If set, ensure that the regex begins and ends with a word boundary requirement.
Returns:: a list of regular expression strings

See:

https://stackoverflow.com/questions/164979/regex-for-matching-uk-postcodes

crate_anon.anonymise.anonregex.get_uk_postcode_regex_string(at_word_boundaries_only: bool = True) → str[source]: Shortcut to retrieve a single regex string for UK postcodes (following the changes above on 2020-04-28). See get_uk_postcode_regex_elements().