14.5.32. crate_anon.nlp_manager.regex_parser
crate_anon/nlp_manager/regex_parser.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Shared elements for regex-based NLP work.
- class crate_anon.nlp_manager.regex_parser.NumeratorOutOfDenominatorParser(nlpdef: NlpDefinition, cfg_processor_name: str, variable_name: str, variable_regex_str: str, expected_denominator: int, numerator_text_fieldname: str = 'numerator_text', numerator_fieldname: str = 'numerator', denominator_text_fieldname: str = 'denominator_text', denominator_fieldname: str = 'denominator', correct_numerator_fieldname: str | None = None, take_absolute: bool = True, commit: bool = False, debug: bool = False)[source]
Base class for X-out-of-Y numerical results, e.g. for MMSE/ACE.
Integer denominator, expected to be positive.
Otherwise similar to
SimpleNumericalResultParser
.
- __init__(nlpdef: NlpDefinition, cfg_processor_name: str, variable_name: str, variable_regex_str: str, expected_denominator: int, numerator_text_fieldname: str = 'numerator_text', numerator_fieldname: str = 'numerator', denominator_text_fieldname: str = 'denominator_text', denominator_fieldname: str = 'denominator', correct_numerator_fieldname: str | None = None, take_absolute: bool = True, commit: bool = False, debug: bool = False) None [source]
- This class operates with compiled regexes having this group format:
quantity_regex_str: e.g. to find “MMSE”
- Parameters:
nlpdef – a
crate_anon.nlp_manager.nlp_definition.NlpDefinition
cfg_processor_name – the suffix (name) of a CRATE NLP config file processor section (from which we may choose to get extra config information)
variable_name – becomes the content of the
variable_name
output columnvariable_regex_str – regex for the text that states the variable
expected_denominator – the integer value that’s expected as the “out of Y” part. For example, an MMSE is out of 30; an ACE-III total is out of 100. If the text just says “MMSE 17”, we will infer “17 out of 30”; so, for the MMSE,
expected_denominator
should be 30.numerator_text_fieldname – field (column) name in which to store the text retrieved as the numerator
numerator_fieldname – field (column) name in which to store the numerical value retrieved as the numerator
denominator_text_fieldname – field (column) name in which to store the text retrieved as the denominator
denominator_fieldname – field (column) name in which to store the numerical value retrieved as the denominator
correct_numerator_fieldname – field (column) name in which we store the principal validated numerator. For example, if an MMSE processor sees “17” or “17/30”, this field will end up containing 17; but if it sees “17/100”, it will remain NULL.
take_absolute – Convert negative values to positive ones? As for
SimpleNumericalResultParser
.commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
debug – print the regex?
- dest_tables_columns() Dict[str, List[Column]] [source]
Describes the destination table(s) that this NLP processor wants to write to.
- Returns:
a dictionary of
{tablename: destination_columns}
, wheredestination_columns
is a list of SQLAlchemyColumn
objects.- Return type:
dict
- parse(text: str, debug: bool = False) Generator[Tuple[str, Dict[str, Any]], None, None] [source]
Main parsing function.
- Parameters:
text – the raw text to parse
- Yields:
tuple –
tablename, valuedict
, wherevaluedict
is a dictionary of{columnname: value}
. The values returned are ONLY those generated by NLP, and do not include either (a) the source reference values (_srcdb
,_srctable
, etc.) or the “copy” fields.- Raises:
crate_anon.nlp_manager.base_nlp_parser.TextProcessingFailed –
if we could not process this text. –
- test_numerator_denominator_parser(test_expected_list: List[Tuple[str, List[Tuple[float, float]]]], verbose: bool = False) None [source]
Test the parser.
- Parameters:
test_expected_list – list of tuples
test_string, expected_values
. The parser will parsetest_string
and compare the result (each value of the target unit) toexpected_values
, which is a list of tuplesnumerator, denominator
, and can be an empty list.verbose – print the regex?
- Raises:
AssertionError –
- class crate_anon.nlp_manager.regex_parser.NumericalResultParser(nlpdef: NlpDefinition, cfg_processor_name: str, variable: str, target_unit: str, regex_str_for_debugging: str, commit: bool = False)[source]
DO NOT USE DIRECTLY. Base class for generic numerical results, where a SINGLE variable is produced.
- __init__(nlpdef: NlpDefinition, cfg_processor_name: str, variable: str, target_unit: str, regex_str_for_debugging: str, commit: bool = False) None [source]
Init function for NumericalResultParser.
- Parameters:
nlpdef – A
crate_anon.nlp_manager.nlp_definition.NlpDefinition
.cfg_processor_name – Config section name in the NLP config file.
variable – Used by subclasses as the record value for
variable_name
.target_unit – Fieldname used for the primary output quantity.
regex_str_for_debugging – String form of regex, for debugging.
commit – Force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
Subclasses will extend this method.
- dest_tables_columns() Dict[str, List[Column]] [source]
Describes the destination table(s) that this NLP processor wants to write to.
- Returns:
a dictionary of
{tablename: destination_columns}
, wheredestination_columns
is a list of SQLAlchemyColumn
objects.- Return type:
dict
- detailed_test(text: str, expected: List[Dict[str, Any]], verbose: bool = False) None [source]
Runs a more detailed check. Whereas
test_numerical_parser()
tests the primary numerical results, this function tests other key/value pairs returned by the parser.- Parameters:
text – text to parse
expected –
list of
resultdict
dictionaries (each mapping column names to values).The parser should return one result dictionary for every entry in
expected
.It’s fine for the
resultdict
not to include all the columns returned for the parser. However, for any column that is present, the parser must provide the corresponding value.
verbose – be verbose
- detailed_test_multiple(tests: List[Tuple[str, List[Dict[str, Any]]]], verbose: bool = False) None [source]
- Parameters:
tests – list of tuples
test_string, expected
. The parser will parsetest_string
and compare the result(s) toexpected
. This is list of dictionaries with keys that can be likevalues
,tense
, etc. Each dictionary value is the corresponding expected value.verbose – show the regex string too
- Raises:
AssertionError –
- abstract parse(text: str) Generator[Tuple[str, Dict[str, Any]], None, None] [source]
Main parsing function.
- Parameters:
text – the raw text to parse
- Yields:
tuple –
tablename, valuedict
, wherevaluedict
is a dictionary of{columnname: value}
. The values returned are ONLY those generated by NLP, and do not include either (a) the source reference values (_srcdb
,_srctable
, etc.) or the “copy” fields.- Raises:
crate_anon.nlp_manager.base_nlp_parser.TextProcessingFailed –
if we could not process this text. –
- test_numerical_parser(test_expected_list: List[Tuple[str, List[float]]], add_test_no_plain_number: bool = True, verbose: bool = False) None [source]
- Parameters:
test_expected_list – list of tuples
test_string, expected_values
. The parser will parsetest_string
and compare the result (each value of the target unit) toexpected_values
, which is a list of numerical (float
), and can be an empty list.verbose – show the regex string too
- Raises:
AssertionError –
Compare also
test_numerical_parser_detailed()
.
- class crate_anon.nlp_manager.regex_parser.SimpleNumericalResultParser(nlpdef: NlpDefinition, cfg_processor_name: str, regex_str: str, variable: str, target_unit: str, units_to_factor: Dict[str, float], take_absolute: bool = False, commit: bool = False, debug: bool = False)[source]
Base class for simple single-format numerical results. Use this when not only do you have a single variable to produce, but you have a single regex (in a standard format) that can produce it.
- __init__(nlpdef: NlpDefinition, cfg_processor_name: str, regex_str: str, variable: str, target_unit: str, units_to_factor: Dict[str, float], take_absolute: bool = False, commit: bool = False, debug: bool = False) None [source]
- Parameters:
nlpdef –
crate_anon.nlp_manager.nlp_definition.NlpDefinition
cfg_processor_name – config section suffix in the NLP config file
regex_str –
Regular expression, in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
tense_indicator
relation
value
units
variable – used as the record value for
variable_name
target_unit – fieldname used for the primary output quantity
units_to_factor –
dictionary, mapping
FROM (compiled regex for units)
TO EITHER a float (multiple) to multiply those units by, to get the preferred unit
OR a function taking a text parameter and returning a float value in preferred unit
Any units present in the regex but absent from
units_to_factor
will lead the result to be ignored. For example, this allows you to ignore a relative neutrophil count (“neutrophils 2.2%”) while detecting absolute neutrophil counts (“neutrophils 2.2”), or ignoring “docusate sodium 100mg” but detecting “sodium 140 mM”.take_absolute –
Convert negative values to positive ones? Typical text requiring this option might look like:
CRP-4 CRP-106 CRP -97 Blood results for today as follows: Na- 142, K-4.1, ...
… occurring in 23 out of 8054 hits for CRP of one test set in our data.
For many quantities, we know that they cannot be negative, so this is just a notation rather than a minus sign. We have to account for it, or it’ll distort our values. Preferable to account for it here rather than later; see manual.
commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
debug – print the regex?
- parse(text: str, debug: bool = False) Generator[Tuple[str, Dict[str, Any]], None, None] [source]
Main parsing function.
- Parameters:
text – the raw text to parse
- Yields:
tuple –
tablename, valuedict
, wherevaluedict
is a dictionary of{columnname: value}
. The values returned are ONLY those generated by NLP, and do not include either (a) the source reference values (_srcdb
,_srctable
, etc.) or the “copy” fields.- Raises:
crate_anon.nlp_manager.base_nlp_parser.TextProcessingFailed –
if we could not process this text. –
- class crate_anon.nlp_manager.regex_parser.ValidatorBase(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
DO NOT USE DIRECTLY. Base class for validating regex parser sensitivity.
The validator will find fields that refer to the variable, whether or not they meet the other criteria of the actual NLP processors (i.e. whether or not they contain a valid value). More explanation below.
Suppose we’re validating C-reactive protein (CRP). Key concepts:
source (true state of the world): Pr present, Ab absent
software decision: Y yes, N no
signal detection theory classification:
hit = Pr & Y = true positive
miss = Pr & N = false negative
false alarm = Ab & Y = false positive
correct rejection = Ab & N = true negative
common SDT metrics:
positive predictive value, PPV = P(Pr | Y) = precision (*)
negative predictive value, NPV = P(Ab | N)
sensitivity = P(Y | Pr) = recall (*) = true positive rate
specificity = P(N | Ab) = true negative rate
(*) common names used in the NLP context.
other common classifier metric:
F_beta score = (1 + beta^2) * precision * recall / ((beta^2 * precision) + recall)
… which measures performance when you value recall beta times as much as precision (thus, for example, the F1 score when beta = 1). See https://en.wikipedia.org/wiki/F1_score/
Working from source to NLP, we can see there are a few types of “absent”:
unselected database field containing text
Q. field contains “CRP”, “C-reactive protein”, etc.; something that a human (or as a proxy: a machine) would judge as containing a textual reference to CRP.
- Pr. Present: a human would judge that a CRP value is present,
e.g. “today her CRP is 7, which I am not concerned about.”
Hit: software reports the value.
M. Miss: software misses the value. (Maybe: “his CRP was twenty-one”.)
Ab1. Absent: reference to CRP, but no numerical information, e.g. “her CRP was normal”.
FA1. False alarm: software reports a numerical value. (Maybe: “my CRP was 7 hours behind my boss’s deadline”)
CR1. Correct rejection: software doesn’t report a value.
Ab2. field contains no reference to CRP at all.
FA2. False alarm: software reports a numerical value. (A bit harder to think of examples… but imagine a bug that gives a hit for “number of carp: 7”. Or an alternative abbreviation meaning, e.g. “took part in a cardiac rehabilitation programme (CRP) 4 hours/week”.)
CR2. Correct rejection: software doesn’t report a value.
From NLP backwards to source:
Software says value present.
Hit: value is present.
FA. False alarm: value is absent.
Software says value absent.
CR. Correct rejection: value is absent.
Miss: value is present.
The key metrics are:
precision = positive predictive value = P(Pr | Y)
… relatively easy to check; find all the “Y” records and check manually that they’re correct.
sensitivity = recall = P(Y | Pr)
… Here, we want a sample that is enriched for “symptom actually present”, for human reasons. For example, if 0.1% of text entries refer to CRP, then to assess 100 “Pr” samples we would have to review 100,000 text records, 99,900 of which are completely irrelevant. So we want an automated way of finding “Pr” records. That’s what the validator classes do.
You can enrich for “Pr” records with SQL, e.g.
SELECT textfield FROM sometable WHERE ( textfield LIKE '%CRP%' OR textfield LIKE '%C-reactive protein%');
or similar, but really we want the best “CRP detector” possible. That is probably to use a regex, either in SQL (…
WHERE textfield REGEX 'myregex'
) or using these validator classes. (The main NLP regexes don’t distinguish between “CRP present, no valid value” and “CRP absent”, because regexes either match or don’t.)Each validator class implements the core variable-finding part of its corresponding NLP regex class, but without the value or units. For example, the CRP class looks for things like “CRP is 6” or “CRP 20 mg/L”, whereas the CRP validator looks for things like “CRP”.
- __init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False) None [source]
- Parameters:
nlpdef –
crate_anon.nlp_manager.nlp_definition.NlpDefinition
cfg_processor_name – config section suffix in the NLP config file
commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
- dest_tables_columns() Dict[str, List[Column]] [source]
Describes the destination table(s) that this NLP processor wants to write to.
- Returns:
a dictionary of
{tablename: destination_columns}
, wheredestination_columns
is a list of SQLAlchemyColumn
objects.- Return type:
dict
- abstract classmethod get_variablename_regexstrlist() Tuple[str, List[str]] [source]
To be overridden.
- Returns:
(validated_variable_name, regex_str_list)
, where:- regex_str_list:
List of regular expressions, each in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
- validated_variable:
used to set our
variable
attribute and thus the value of the fieldvariable_name
in the NLP output; for example, ifvalidated_variable == 'crp'
, then thevariable_name
field will be set tocrp_validator
.
- Return type:
tuple
- parse(text: str) Generator[Tuple[str, Dict[str, Any]], None, None] [source]
Main parsing function.
- Parameters:
text – the raw text to parse
- Yields:
tuple –
tablename, valuedict
, wherevaluedict
is a dictionary of{columnname: value}
. The values returned are ONLY those generated by NLP, and do not include either (a) the source reference values (_srcdb
,_srctable
, etc.) or the “copy” fields.- Raises:
crate_anon.nlp_manager.base_nlp_parser.TextProcessingFailed –
if we could not process this text. –
- crate_anon.nlp_manager.regex_parser.common_tense(tense_text: str | None, relation_text: str | None) Tuple[str | None, str | None] [source]
Takes strings potentially representing “tense” and “equality” concepts and unifies them.
Used, for example, to help impute that “CRP was 72” means that relation was EQ in the PAST, etc.
- Parameters:
tense_text – putative tense information
relation_text – putative relationship (equals, less than, etc.)
- Returns:
tense, relation
; either may beNone
.- Return type:
tuple
- crate_anon.nlp_manager.regex_parser.learning_alternative_regex_groups() None [source]
Function to learn about regex syntax.
- crate_anon.nlp_manager.regex_parser.make_simple_numeric_regex(quantity: str, units: str, value: str = '(?: (?: \\+ | [-−–] )? (?: (?: \\d{1,3} (?:,\\d{3})+ ) | \\d+ ) (?: \\. \\d+ )? )', tense_indicator: str = '(?: \\b is \\b | \\b was \\b )', relation: str = '(?: <= | (?: < | less \\s+ than | under ) | (?: = | equals | equal \\s+ to ) | >= | (?: > | (?:more|greater) \\s+ than | over ) )', optional_results_ignorables: str = '\n (?: # OPTIONAL_RESULTS_IGNORABLES\n \\s | \\| | \\: # whitespace, bar, colon\n | \\bHH?\\b | \\(HH?\\) # H/HH at a word boundary; (H)/(HH)\n | \\bLL?\\b | \\(LL?\\) # L/LL etc.\n | \\* | \\(\\*\\) # *, (*)\n | — | -- # em dash, double hyphen-minus\n | –\\s+ | -\\s+ | ‐\\s+ # en dash/hyphen-minus/Unicode hyphen; whitespace\n )* # ... any of those, repeated 0 or more times\n', optional_ignorable_after_quantity: str = '', units_optional: bool = True) str [source]
Makes a regex with named groups to handle simple numerical results.
Copes with formats like:
sodium 132 mM sodium (mM) 132 sodium (132 mM)
… and lots more.
- Parameters:
quantity – Regex for the quantity (e.g. for “sodium” or “Na”).
units – Regex for units.
value – Regex for the numerical value (e.g. our
SIGNED_FLOAT
regex).tense_indicator – Regex for tense indicator.
relation – Regex for mathematical relationship (e.g. equals, less than).
optional_results_ignorables – Regex for junk to ignore in between the other things. Should include its own “optionality” (e.g.
*
).optional_ignorable_after_quantity – Regex for additional things that can be ignored right after the quantity. Should include its own “optionality” (e.g.
?
).units_optional – The units are allowed to be omitted. Usually true.
The resulting regex groups are named, not numbered:
0: Whole thing; integer, as in: m.group(0) 'quantity': Quantity 'tense': Tense (optional) 'relation': Relation (optional) 'value': Value 'units': Units (optional)
… as used by
SimpleNumericalResultParser
.Just to check re overlap:
import regex s1 = r"(?P<quantity>Sodium)\s+(?P<value>\d+)\s+(?P<units>mM)" s2 = r"(?P<quantity>Sodium)\s+\((?P<units>mM)\)\s+(?P<value>\d+)" s = f"{s1}|{s2}" r = regex.compile(s) t1 = "Sodium 132 mM" t2 = "Sodium (mM) 127" m1 = r.match(t1) m2 = r.match(t2) print(m1.group(0)) # Sodium 132 mM print(m1.group("quantity")) # Sodium print(m1.group("value")) # 132 print(m1.group("units")) # mM print(m2.group(0)) # Sodium (mM) 127 print(m2.group("quantity")) # Sodium print(m2.group("value")) # 127 print(m2.group("units")) # mM
… so it’s fine in that multiple groups can have the same name.