14.1.3. crate_anon.anonymise.anonregex

crate_anon/anonymise/anonregex.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Regular expression functions for anonymisation.

class crate_anon.anonymise.anonregex.DateRegexNames[source]

For named groups in date regexes.

crate_anon.anonymise.anonregex.get_anon_fragments_from_string(s: str) List[str][source]

Takes a complex string, such as a name or address with its components separated by spaces, commas, etc., and returns a list of substrings to be used for anonymisation.

  • For example, from "John Smith", return ["John", "Smith"]; from "John D'Souza", return ["John", "D", "Souza"]; from "42 West Street", return ["42", "West", "Street"].

  • Try these examples:

    get_anon_fragments_from_string("Bob D'Souza")
    get_anon_fragments_from_string("Jemima Al-Khalaim")
    get_anon_fragments_from_string("47 Russell Square")
    
  • Note that this is a LIBERAL algorithm, i.e. one prone to anonymise too much (e.g. all instances of "Street" if someone has that as part of their address).

  • Note that we use the “word boundary” facility when replacing, and that treats apostrophes and hyphens as word boundaries. Therefore, we don’t need the largest-level chunks, like D'Souza.

crate_anon.anonymise.anonregex.get_code_regex_elements(s: str, liberal: bool = True, very_liberal: bool = True, at_word_boundaries_only: bool = True, at_numeric_boundaries_only: bool = True) List[str][source]

Takes a string representation of a number or an alphanumeric code, which may include leading zeros (as for phone numbers), and produces a list of regex strings for scrubbing.

We allow all sorts of separators. For example, 0123456789 might appear as

(01234) 56789
0123 456 789
01234-56789
0123.456.789

This can also be used for postcodes, which should have whitespace prestripped, so e.g. PE123AB might appear as

PE123AB
PE12 3AB
PE 12 3 AB
Parameters:
  • s – The string representation of a number or code.

  • liberal – Boolean. Use “optional non-newline whitespace” to separate characters in the source.

  • very_liberal – Boolean. Use “optional nonword” to separate characters in the source.

  • at_word_boundaries_only – Boolean. Ensure that the regex begins and ends with a word boundary requirement. So, if True, “123” will not be scrubbed from “M123”.

  • at_numeric_boundaries_only

    Boolean. Only applicable if at_numeric_boundaries_only is False. Ensure that the number/code is only recognized when surrounded by non-numbers; that is, only at the boundaries of numbers (at numeric boundaries).

    • Applicable if not at_word_boundaries_only.

    • Even though we’re not restricting to word boundaries, because (for example) we want 123456 to match M123456, it can be undesirable to match numbers that are bordered only by numbers; that is, with this setting, 23 should never match 234 or 1234 or 123.

    • If set, this option ensures that the number/code is recognized only when it is bordered by non-numbers.

    • But if you want to anonymise “123456” out of a phone number written like “01223123456”, you might have to turn this off…

Returns:

a list of regular expression strings

crate_anon.anonymise.anonregex.get_date_regex_elements(dt: datetime | date, at_word_boundaries_only: bool = False, ordinal_suffixes: Iterable[str] = ('st', 'nd', 'rd', 'th')) List[str][source]

Takes a datetime object and returns a list of regex strings with which to scrub.

For example, a date/time of 13 Sep 2014 will produce regexes that recognize “13 Sep 2014”, “September 13, 2014”, “2014/09/13”, and many more.

Parameters:
  • dt – The datetime or date or similar object.

  • at_word_boundaries_only – Ensure that all regexes begin and end with a word boundary requirement.

  • ordinal_suffixes – Language-specific suffixes that may be appended to numbers to make them ordinal. In English, “st”, “nd”, “rd”, and “th”.

Returns:

the list of regular expression strings, as above

crate_anon.anonymise.anonregex.get_generic_date_regex_elements(at_word_boundaries_only: bool = True, ordinal_suffixes: Iterable[str] = ('st', 'nd', 'rd', 'th'), all_month_names: Iterable[str] = ('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')) List[str][source]

Returns a set of regex elements to scrub any date.

Word boundaries are strongly preferred! This will match some odd things otherwise; see the associated unit tests.

crate_anon.anonymise.anonregex.get_number_of_length_n_regex_elements(n: int, liberal: bool = True, very_liberal: bool = False, at_word_boundaries_only: bool = True) List[str][source]

Get a list of regex strings for scrubbing n-digit numbers – for example, to remove all 10-digit numbers as putative NHS numbers, or all 11-digit numbers as putative UK phone numbers.

Parameters:
  • n – the length of the number

  • liberal – Boolean. Use “optional non-newline whitespace” to separate the digits.

  • very_liberal – Boolean. Use “optional nonword” to separate the digits.

  • at_word_boundaries_only – Boolean. If set, ensure that the regex begins and ends with a word boundary requirement. If not set, the regex must be surrounded by non-digits. (If it were surrounded by more digits, it wouldn’t be an n-digit number!)

Returns:

a list of regular expression strings

crate_anon.anonymise.anonregex.get_phrase_regex_elements(phrase: str, suffixes: List[str] | None = None, at_word_boundaries_only: bool = True, max_errors: int = 0, alternatives: List[List[str]] | None = None) List[str][source]

Gets regular expressions to scrub a phrase; that is, all words within a phrase consecutively.

Parameters:
  • phrase – E.g. ‘4 Privet Drive’.

  • suffixes – A list of suffixes to permit (unusual).

  • at_word_boundaries_only – Apply regex only at word boundaries?

  • max_errors – Maximum number of typos, as defined by the regex module.

  • alternatives – This allows words to be substituted by equivalents; such as St for Street or Rd for Road. The parameter is a list of lists of equivalents; see crate_anon.anonymise.config.get_word_alternatives().

Returns:

A list of regex fragments.

crate_anon.anonymise.anonregex.get_regex_from_elements(elementlist: List[str]) Pattern | None[source]

Convert a list of regex elements into a compiled regex, which will operate in case-insensitive fashion on Unicode strings.

crate_anon.anonymise.anonregex.get_regex_string_from_elements(elementlist: List[str]) str[source]

Convert a list of regex elements into a single regex string.

crate_anon.anonymise.anonregex.get_string_regex_elements(s: str, suffixes: List[str] | None = None, at_word_boundaries_only: bool = True, max_errors: int = 0) List[str][source]

Takes a string and returns a list of regex strings with which to scrub.

Parameters:
  • s – The starting string.

  • suffixes – A list of suffixes to permit, typically ["s"].

  • at_word_boundaries_only – Boolean. If set, ensure that the regex begins and ends with a word boundary requirement. (If false: will scrub ANN from bANNed.)

  • max_errors – The maximum number of typographical insertion/deletion/substitution errors to permit.

Returns:

a list of regular expression strings

crate_anon.anonymise.anonregex.get_uk_postcode_regex_elements(at_word_boundaries_only: bool = True) List[str][source]

Get a list of regex strings for scrubbing UK postcodes. These have a well-defined format.

Unless compiled with the re.IGNORECASE, they will match upper-case postcodes only.

Parameters:

at_word_boundaries_only – Boolean. If set, ensure that the regex begins and ends with a word boundary requirement.

Returns:

a list of regular expression strings

See:

crate_anon.anonymise.anonregex.get_uk_postcode_regex_string(at_word_boundaries_only: bool = True) str[source]

Shortcut to retrieve a single regex string for UK postcodes (following the changes above on 2020-04-28). See get_uk_postcode_regex_elements().