14.5.32. crate_anon.nlp_manager.regex_parser

crate_anon/nlp_manager/regex_parser.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Shared elements for regex-based NLP work.

class crate_anon.nlp_manager.regex_parser.NumeratorOutOfDenominatorParser(nlpdef: NlpDefinition, cfg_processor_name: str, variable_name: str, variable_regex_str: str, expected_denominator: int, numerator_text_fieldname: str = 'numerator_text', numerator_fieldname: str = 'numerator', denominator_text_fieldname: str = 'denominator_text', denominator_fieldname: str = 'denominator', correct_numerator_fieldname: str | None = None, take_absolute: bool = True, commit: bool = False, debug: bool = False)[source]

Base class for X-out-of-Y numerical results, e.g. for MMSE/ACE.

__init__(nlpdef: NlpDefinition, cfg_processor_name: str, variable_name: str, variable_regex_str: str, expected_denominator: int, numerator_text_fieldname: str = 'numerator_text', numerator_fieldname: str = 'numerator', denominator_text_fieldname: str = 'denominator_text', denominator_fieldname: str = 'denominator', correct_numerator_fieldname: str | None = None, take_absolute: bool = True, commit: bool = False, debug: bool = False) None[source]
This class operates with compiled regexes having this group format:
  • quantity_regex_str: e.g. to find “MMSE”

Parameters:
  • nlpdef – a crate_anon.nlp_manager.nlp_definition.NlpDefinition

  • cfg_processor_name – the suffix (name) of a CRATE NLP config file processor section (from which we may choose to get extra config information)

  • variable_name – becomes the content of the variable_name output column

  • variable_regex_str – regex for the text that states the variable

  • expected_denominator – the integer value that’s expected as the “out of Y” part. For example, an MMSE is out of 30; an ACE-III total is out of 100. If the text just says “MMSE 17”, we will infer “17 out of 30”; so, for the MMSE, expected_denominator should be 30.

  • numerator_text_fieldname – field (column) name in which to store the text retrieved as the numerator

  • numerator_fieldname – field (column) name in which to store the numerical value retrieved as the numerator

  • denominator_text_fieldname – field (column) name in which to store the text retrieved as the denominator

  • denominator_fieldname – field (column) name in which to store the numerical value retrieved as the denominator

  • correct_numerator_fieldname – field (column) name in which we store the principal validated numerator. For example, if an MMSE processor sees “17” or “17/30”, this field will end up containing 17; but if it sees “17/100”, it will remain NULL.

  • take_absolute – Convert negative values to positive ones? As for SimpleNumericalResultParser.

  • commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.

  • debug – print the regex?

dest_tables_columns() Dict[str, List[Column]][source]

Describes the destination table(s) that this NLP processor wants to write to.

Returns:

a dictionary of {tablename: destination_columns}, where destination_columns is a list of SQLAlchemy Column objects.

Return type:

dict

parse(text: str, debug: bool = False) Generator[Tuple[str, Dict[str, Any]], None, None][source]

Main parsing function.

Parameters:

text – the raw text to parse

Yields:

tupletablename, valuedict, where valuedict is a dictionary of {columnname: value}. The values returned are ONLY those generated by NLP, and do not include either (a) the source reference values (_srcdb, _srctable, etc.) or the “copy” fields.

Raises:
test_numerator_denominator_parser(test_expected_list: List[Tuple[str, List[Tuple[float, float]]]], verbose: bool = False) None[source]

Test the parser.

Parameters:
  • test_expected_list – list of tuples test_string, expected_values. The parser will parse test_string and compare the result (each value of the target unit) to expected_values, which is a list of tuples numerator, denominator, and can be an empty list.

  • verbose – print the regex?

Raises:

AssertionError

class crate_anon.nlp_manager.regex_parser.NumericalResultParser(nlpdef: NlpDefinition, cfg_processor_name: str, variable: str, target_unit: str, regex_str_for_debugging: str, commit: bool = False)[source]

DO NOT USE DIRECTLY. Base class for generic numerical results, where a SINGLE variable is produced.

__init__(nlpdef: NlpDefinition, cfg_processor_name: str, variable: str, target_unit: str, regex_str_for_debugging: str, commit: bool = False) None[source]

Init function for NumericalResultParser.

Parameters:
  • nlpdef – A crate_anon.nlp_manager.nlp_definition.NlpDefinition.

  • cfg_processor_name – Config section name in the NLP config file.

  • variable – Used by subclasses as the record value for variable_name.

  • target_unit – Fieldname used for the primary output quantity.

  • regex_str_for_debugging – String form of regex, for debugging.

  • commit – Force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.

Subclasses will extend this method.

dest_tables_columns() Dict[str, List[Column]][source]

Describes the destination table(s) that this NLP processor wants to write to.

Returns:

a dictionary of {tablename: destination_columns}, where destination_columns is a list of SQLAlchemy Column objects.

Return type:

dict

detailed_test(text: str, expected: List[Dict[str, Any]], verbose: bool = False) None[source]

Runs a more detailed check. Whereas test_numerical_parser() tests the primary numerical results, this function tests other key/value pairs returned by the parser.

Parameters:
  • text – text to parse

  • expected

    list of resultdict dictionaries (each mapping column names to values).

    • The parser should return one result dictionary for every entry in expected.

    • It’s fine for the resultdict not to include all the columns returned for the parser. However, for any column that is present, the parser must provide the corresponding value.

  • verbose – be verbose

detailed_test_multiple(tests: List[Tuple[str, List[Dict[str, Any]]]], verbose: bool = False) None[source]
Parameters:
  • tests – list of tuples test_string, expected. The parser will parse test_string and compare the result(s) to expected. This is list of dictionaries with keys that can be like values, tense, etc. Each dictionary value is the corresponding expected value.

  • verbose – show the regex string too

Raises:

AssertionError

get_regex_str_for_debugging() str[source]

Returns the string version of the regex, for debugging.

abstract parse(text: str) Generator[Tuple[str, Dict[str, Any]], None, None][source]

Main parsing function.

Parameters:

text – the raw text to parse

Yields:

tupletablename, valuedict, where valuedict is a dictionary of {columnname: value}. The values returned are ONLY those generated by NLP, and do not include either (a) the source reference values (_srcdb, _srctable, etc.) or the “copy” fields.

Raises:
set_tablename(tablename: str) None[source]

In case a friend class wants to override.

test_numerical_parser(test_expected_list: List[Tuple[str, List[float]]], add_test_no_plain_number: bool = True, verbose: bool = False) None[source]
Parameters:
  • test_expected_list – list of tuples test_string, expected_values. The parser will parse test_string and compare the result (each value of the target unit) to expected_values, which is a list of numerical (float), and can be an empty list.

  • verbose – show the regex string too

Raises:

AssertionError

Compare also test_numerical_parser_detailed().

class crate_anon.nlp_manager.regex_parser.SimpleNumericalResultParser(nlpdef: NlpDefinition, cfg_processor_name: str, regex_str: str, variable: str, target_unit: str, units_to_factor: Dict[str, float], take_absolute: bool = False, commit: bool = False, debug: bool = False)[source]

Base class for simple single-format numerical results. Use this when not only do you have a single variable to produce, but you have a single regex (in a standard format) that can produce it.

__init__(nlpdef: NlpDefinition, cfg_processor_name: str, regex_str: str, variable: str, target_unit: str, units_to_factor: Dict[str, float], take_absolute: bool = False, commit: bool = False, debug: bool = False) None[source]
Parameters:
  • nlpdefcrate_anon.nlp_manager.nlp_definition.NlpDefinition

  • cfg_processor_name – config section suffix in the NLP config file

  • regex_str

    Regular expression, in string format.

    This class operates with compiled regexes having this group format (capture groups in this sequence):

    • variable

    • tense_indicator

    • relation

    • value

    • units

  • variable – used as the record value for variable_name

  • target_unit – fieldname used for the primary output quantity

  • units_to_factor

    dictionary, mapping

    • FROM (compiled regex for units)

    • TO EITHER a float (multiple) to multiply those units by, to get the preferred unit

    • OR a function taking a text parameter and returning a float value in preferred unit

    Any units present in the regex but absent from units_to_factor will lead the result to be ignored. For example, this allows you to ignore a relative neutrophil count (“neutrophils 2.2%”) while detecting absolute neutrophil counts (“neutrophils 2.2”), or ignoring “docusate sodium 100mg” but detecting “sodium 140 mM”.

  • take_absolute

    Convert negative values to positive ones? Typical text requiring this option might look like:

    CRP-4
    CRP-106
    CRP -97
    Blood results for today as follows: Na- 142, K-4.1, ...
    

    … occurring in 23 out of 8054 hits for CRP of one test set in our data.

    For many quantities, we know that they cannot be negative, so this is just a notation rather than a minus sign. We have to account for it, or it’ll distort our values. Preferable to account for it here rather than later; see manual.

  • commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.

  • debug – print the regex?

parse(text: str, debug: bool = False) Generator[Tuple[str, Dict[str, Any]], None, None][source]

Main parsing function.

Parameters:

text – the raw text to parse

Yields:

tupletablename, valuedict, where valuedict is a dictionary of {columnname: value}. The values returned are ONLY those generated by NLP, and do not include either (a) the source reference values (_srcdb, _srctable, etc.) or the “copy” fields.

Raises:
class crate_anon.nlp_manager.regex_parser.ValidatorBase(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]

DO NOT USE DIRECTLY. Base class for validating regex parser sensitivity.

The validator will find fields that refer to the variable, whether or not they meet the other criteria of the actual NLP processors (i.e. whether or not they contain a valid value). More explanation below.

Suppose we’re validating C-reactive protein (CRP). Key concepts:

  • source (true state of the world): Pr present, Ab absent

  • software decision: Y yes, N no

  • signal detection theory classification:

    • hit = Pr & Y = true positive

    • miss = Pr & N = false negative

    • false alarm = Ab & Y = false positive

    • correct rejection = Ab & N = true negative

  • common SDT metrics:

    • positive predictive value, PPV = P(Pr | Y) = precision (*)

    • negative predictive value, NPV = P(Ab | N)

    • sensitivity = P(Y | Pr) = recall (*) = true positive rate

    • specificity = P(N | Ab) = true negative rate

    (*) common names used in the NLP context.

  • other common classifier metric:

    F_beta score = (1 + beta^2) * precision * recall /
                   ((beta^2 * precision) + recall)
    

    … which measures performance when you value recall beta times as much as precision (thus, for example, the F1 score when beta = 1). See https://en.wikipedia.org/wiki/F1_score/

Working from source to NLP, we can see there are a few types of “absent”:

    1. unselected database field containing text

    • Q. field contains “CRP”, “C-reactive protein”, etc.; something that a human (or as a proxy: a machine) would judge as containing a textual reference to CRP.

      • Pr. Present: a human would judge that a CRP value is present,

        e.g. “today her CRP is 7, which I am not concerned about.”

          1. Hit: software reports the value.

        • M. Miss: software misses the value. (Maybe: “his CRP was twenty-one”.)

      • Ab1. Absent: reference to CRP, but no numerical information, e.g. “her CRP was normal”.

        • FA1. False alarm: software reports a numerical value. (Maybe: “my CRP was 7 hours behind my boss’s deadline”)

        • CR1. Correct rejection: software doesn’t report a value.

    • Ab2. field contains no reference to CRP at all.

      • FA2. False alarm: software reports a numerical value. (A bit harder to think of examples… but imagine a bug that gives a hit for “number of carp: 7”. Or an alternative abbreviation meaning, e.g. “took part in a cardiac rehabilitation programme (CRP) 4 hours/week”.)

      • CR2. Correct rejection: software doesn’t report a value.

From NLP backwards to source:

    1. Software says value present.

      1. Hit: value is present.

    • FA. False alarm: value is absent.

    1. Software says value absent.

    • CR. Correct rejection: value is absent.

      1. Miss: value is present.

The key metrics are:

  • precision = positive predictive value = P(Pr | Y)

    … relatively easy to check; find all the “Y” records and check manually that they’re correct.

  • sensitivity = recall = P(Y | Pr)

    … Here, we want a sample that is enriched for “symptom actually present”, for human reasons. For example, if 0.1% of text entries refer to CRP, then to assess 100 “Pr” samples we would have to review 100,000 text records, 99,900 of which are completely irrelevant. So we want an automated way of finding “Pr” records. That’s what the validator classes do.

You can enrich for “Pr” records with SQL, e.g.

SELECT textfield FROM sometable WHERE (
    textfield LIKE '%CRP%'
    OR textfield LIKE '%C-reactive protein%');

or similar, but really we want the best “CRP detector” possible. That is probably to use a regex, either in SQL (… WHERE textfield REGEX 'myregex') or using these validator classes. (The main NLP regexes don’t distinguish between “CRP present, no valid value” and “CRP absent”, because regexes either match or don’t.)

Each validator class implements the core variable-finding part of its corresponding NLP regex class, but without the value or units. For example, the CRP class looks for things like “CRP is 6” or “CRP 20 mg/L”, whereas the CRP validator looks for things like “CRP”.

__init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False) None[source]
Parameters:
dest_tables_columns() Dict[str, List[Column]][source]

Describes the destination table(s) that this NLP processor wants to write to.

Returns:

a dictionary of {tablename: destination_columns}, where destination_columns is a list of SQLAlchemy Column objects.

Return type:

dict

abstract classmethod get_variablename_regexstrlist() Tuple[str, List[str]][source]

To be overridden.

Returns:

(validated_variable_name, regex_str_list), where:

regex_str_list:

List of regular expressions, each in string format.

This class operates with compiled regexes having this group format (capture groups in this sequence):

  • variable

validated_variable:

used to set our variable attribute and thus the value of the field variable_name in the NLP output; for example, if validated_variable == 'crp', then the variable_name field will be set to crp_validator.

Return type:

tuple

parse(text: str) Generator[Tuple[str, Dict[str, Any]], None, None][source]

Main parsing function.

Parameters:

text – the raw text to parse

Yields:

tupletablename, valuedict, where valuedict is a dictionary of {columnname: value}. The values returned are ONLY those generated by NLP, and do not include either (a) the source reference values (_srcdb, _srctable, etc.) or the “copy” fields.

Raises:
set_tablename(tablename: str) None[source]

In case a friend class wants to override.

test(verbose: bool = False) None[source]

Performs a self-test on the NLP processor.

Parameters:

verbose – Be verbose?

This is an abstract method that is subclassed.

test_validator(test_expected_list: List[Tuple[str, bool]], verbose: bool = False) None[source]

The ‘bool’ part of test_expected_list is: should it match any? … noting that “match anywhere” is the “search” function, whereas “match” matches at the beginning:

crate_anon.nlp_manager.regex_parser.common_tense(tense_text: str | None, relation_text: str | None) Tuple[str | None, str | None][source]

Takes strings potentially representing “tense” and “equality” concepts and unifies them.

  • Used, for example, to help impute that “CRP was 72” means that relation was EQ in the PAST, etc.

Parameters:
  • tense_text – putative tense information

  • relation_text – putative relationship (equals, less than, etc.)

Returns:

tense, relation; either may be None.

Return type:

tuple

crate_anon.nlp_manager.regex_parser.learning_alternative_regex_groups() None[source]

Function to learn about regex syntax.

crate_anon.nlp_manager.regex_parser.make_simple_numeric_regex(quantity: str, units: str, value: str = '(?: (?: \\+ | [-−–] )? (?: (?: \\d{1,3} (?:,\\d{3})+ ) | \\d+ ) (?: \\. \\d+ )? )', tense_indicator: str = '(?: \\b is \\b | \\b was \\b )', relation: str = '(?: <= | (?: < | less \\s+ than | under ) | (?: = | equals | equal \\s+ to ) | >= | (?: > | (?:more|greater) \\s+ than | over ) )', optional_results_ignorables: str = '\n    (?:  # OPTIONAL_RESULTS_IGNORABLES\n        \\s | \\| | \\:          # whitespace, bar, colon\n        | \\bHH?\\b | \\(HH?\\)   # H/HH at a word boundary; (H)/(HH)\n        | \\bLL?\\b | \\(LL?\\)   # L/LL etc.\n        | \\* | \\(\\*\\)         # *, (*)\n        | | --              # em dash, double hyphen-minus\n        | –\\s+ | -\\s+ | ‐\\s+  # en dash/hyphen-minus/Unicode hyphen; whitespace\n    )*                        # ... any of those, repeated 0 or more times\n', optional_ignorable_after_quantity: str = '', units_optional: bool = True) str[source]

Makes a regex with named groups to handle simple numerical results.

Copes with formats like:

sodium 132 mM
sodium (mM) 132
sodium (132 mM)

… and lots more.

Parameters:
  • quantity – Regex for the quantity (e.g. for “sodium” or “Na”).

  • units – Regex for units.

  • value – Regex for the numerical value (e.g. our SIGNED_FLOAT regex).

  • tense_indicator – Regex for tense indicator.

  • relation – Regex for mathematical relationship (e.g. equals, less than).

  • optional_results_ignorables – Regex for junk to ignore in between the other things. Should include its own “optionality” (e.g. *).

  • optional_ignorable_after_quantity – Regex for additional things that can be ignored right after the quantity. Should include its own “optionality” (e.g. ?).

  • units_optional – The units are allowed to be omitted. Usually true.

The resulting regex groups are named, not numbered:

0:          Whole thing; integer, as in: m.group(0)
'quantity': Quantity
'tense':    Tense (optional)
'relation': Relation (optional)
'value':    Value
'units':    Units (optional)

… as used by SimpleNumericalResultParser.

Just to check re overlap:

import regex
s1 = r"(?P<quantity>Sodium)\s+(?P<value>\d+)\s+(?P<units>mM)"
s2 = r"(?P<quantity>Sodium)\s+\((?P<units>mM)\)\s+(?P<value>\d+)"
s = f"{s1}|{s2}"
r = regex.compile(s)
t1 = "Sodium 132 mM"
t2 = "Sodium (mM) 127"
m1 = r.match(t1)
m2 = r.match(t2)

print(m1.group(0))  # Sodium 132 mM
print(m1.group("quantity"))  # Sodium
print(m1.group("value"))  # 132
print(m1.group("units"))  # mM

print(m2.group(0))  # Sodium (mM) 127
print(m2.group("quantity"))  # Sodium
print(m2.group("value"))  # 127
print(m2.group("units"))  # mM

… so it’s fine in that multiple groups can have the same name.