14.5.26. crate_anon.nlp_manager.parse_haematology
crate_anon/nlp_manager/parse_haematology.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Python regex-based NLP processors for haematology tests.
All inherit from
crate_anon.nlp_manager.regex_parser.NumeratorOutOfDenominatorParser
and are constructed with these arguments:
- nlpdef:
- cfgsection:
the name of a CRATE NLP config file section (from which we may choose to get extra config information)
- commit:
force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
- class crate_anon.nlp_manager.parse_haematology.Basophils(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
HAEMATOLOGY (FBC).
Basophil count (absolute). Default units are 10^9 / L; also supports cells/mm^3 = cells/μL.
- __init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False) None [source]
__init__
function forWbcBase
.- Parameters:
nlpdef – a
crate_anon.nlp_manager.nlp_definition.NlpDefinition
cfg_processor_name – the name of a CRATE NLP config file section (from which we may choose to get extra config information)
cell_type_regex_text – text for regex for the cell type, representing e.g. “monocytes” or “basophils”
variable – used as the record value for
variable_name
commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
- class crate_anon.nlp_manager.parse_haematology.BasophilsValidator(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
Validator for Basophils (see help for explanation).
- classmethod get_variablename_regexstrlist() Tuple[str, List[str]] [source]
To be overridden.
- Returns:
(validated_variable_name, regex_str_list)
, where:- regex_str_list:
List of regular expressions, each in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
- validated_variable:
used to set our
variable
attribute and thus the value of the fieldvariable_name
in the NLP output; for example, ifvalidated_variable == 'crp'
, then thevariable_name
field will be set tocrp_validator
.
- Return type:
tuple
- class crate_anon.nlp_manager.parse_haematology.Eosinophils(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
HAEMATOLOGY (FBC).
Eosinophil count (absolute). Default units are 10^9 / L; also supports cells/mm^3 = cells/μL.
- __init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False) None [source]
__init__
function forWbcBase
.- Parameters:
nlpdef – a
crate_anon.nlp_manager.nlp_definition.NlpDefinition
cfg_processor_name – the name of a CRATE NLP config file section (from which we may choose to get extra config information)
cell_type_regex_text – text for regex for the cell type, representing e.g. “monocytes” or “basophils”
variable – used as the record value for
variable_name
commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
- class crate_anon.nlp_manager.parse_haematology.EosinophilsValidator(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
Validator for Eosinophils (see help for explanation).
- classmethod get_variablename_regexstrlist() Tuple[str, List[str]] [source]
To be overridden.
- Returns:
(validated_variable_name, regex_str_list)
, where:- regex_str_list:
List of regular expressions, each in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
- validated_variable:
used to set our
variable
attribute and thus the value of the fieldvariable_name
in the NLP output; for example, ifvalidated_variable == 'crp'
, then thevariable_name
field will be set tocrp_validator
.
- Return type:
tuple
- class crate_anon.nlp_manager.parse_haematology.Esr(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
HAEMATOLOGY (ESR).
Erythrocyte sedimentation rate (ESR), in mm/h.
- __init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False) None [source]
- Parameters:
nlpdef –
crate_anon.nlp_manager.nlp_definition.NlpDefinition
cfg_processor_name – config section suffix in the NLP config file
regex_str –
Regular expression, in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
tense_indicator
relation
value
units
variable – used as the record value for
variable_name
target_unit – fieldname used for the primary output quantity
units_to_factor –
dictionary, mapping
FROM (compiled regex for units)
TO EITHER a float (multiple) to multiply those units by, to get the preferred unit
OR a function taking a text parameter and returning a float value in preferred unit
Any units present in the regex but absent from
units_to_factor
will lead the result to be ignored. For example, this allows you to ignore a relative neutrophil count (“neutrophils 2.2%”) while detecting absolute neutrophil counts (“neutrophils 2.2”), or ignoring “docusate sodium 100mg” but detecting “sodium 140 mM”.take_absolute –
Convert negative values to positive ones? Typical text requiring this option might look like:
CRP-4 CRP-106 CRP -97 Blood results for today as follows: Na- 142, K-4.1, ...
… occurring in 23 out of 8054 hits for CRP of one test set in our data.
For many quantities, we know that they cannot be negative, so this is just a notation rather than a minus sign. We have to account for it, or it’ll distort our values. Preferable to account for it here rather than later; see manual.
commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
debug – print the regex?
- class crate_anon.nlp_manager.parse_haematology.EsrValidator(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
Validator for Esr (see help for explanation).
- classmethod get_variablename_regexstrlist() Tuple[str, List[str]] [source]
To be overridden.
- Returns:
(validated_variable_name, regex_str_list)
, where:- regex_str_list:
List of regular expressions, each in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
- validated_variable:
used to set our
variable
attribute and thus the value of the fieldvariable_name
in the NLP output; for example, ifvalidated_variable == 'crp'
, then thevariable_name
field will be set tocrp_validator
.
- Return type:
tuple
- class crate_anon.nlp_manager.parse_haematology.Haematocrit(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
HAEMATOLOGY (FBC).
Haematocrit (Hct). A dimensionless quantity (but supports L/L notation).
- __init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False) None [source]
- Parameters:
nlpdef –
crate_anon.nlp_manager.nlp_definition.NlpDefinition
cfg_processor_name – config section suffix in the NLP config file
regex_str –
Regular expression, in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
tense_indicator
relation
value
units
variable – used as the record value for
variable_name
target_unit – fieldname used for the primary output quantity
units_to_factor –
dictionary, mapping
FROM (compiled regex for units)
TO EITHER a float (multiple) to multiply those units by, to get the preferred unit
OR a function taking a text parameter and returning a float value in preferred unit
Any units present in the regex but absent from
units_to_factor
will lead the result to be ignored. For example, this allows you to ignore a relative neutrophil count (“neutrophils 2.2%”) while detecting absolute neutrophil counts (“neutrophils 2.2”), or ignoring “docusate sodium 100mg” but detecting “sodium 140 mM”.take_absolute –
Convert negative values to positive ones? Typical text requiring this option might look like:
CRP-4 CRP-106 CRP -97 Blood results for today as follows: Na- 142, K-4.1, ...
… occurring in 23 out of 8054 hits for CRP of one test set in our data.
For many quantities, we know that they cannot be negative, so this is just a notation rather than a minus sign. We have to account for it, or it’ll distort our values. Preferable to account for it here rather than later; see manual.
commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
debug – print the regex?
- class crate_anon.nlp_manager.parse_haematology.HaematocritValidator(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
Validator for Haematocrit (see help for explanation).
- classmethod get_variablename_regexstrlist() Tuple[str, List[str]] [source]
To be overridden.
- Returns:
(validated_variable_name, regex_str_list)
, where:- regex_str_list:
List of regular expressions, each in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
- validated_variable:
used to set our
variable
attribute and thus the value of the fieldvariable_name
in the NLP output; for example, ifvalidated_variable == 'crp'
, then thevariable_name
field will be set tocrp_validator
.
- Return type:
tuple
- class crate_anon.nlp_manager.parse_haematology.Haemoglobin(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
HAEMATOLOGY (FBC).
Haemoglobin (Hb). Default units are g/L; also supports g/dL.
UK reporting for haemoglobin switched in 2013 from g/dL to g/L; see e.g.
The DANGER remains that “Hb 9” may have been from someone assuming old-style units, 9 g/dL = 90 g/L, but this will be interpreted as 9 g/L. This problem is hard to avoid.
- __init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False) None [source]
- Parameters:
nlpdef –
crate_anon.nlp_manager.nlp_definition.NlpDefinition
cfg_processor_name – config section suffix in the NLP config file
regex_str –
Regular expression, in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
tense_indicator
relation
value
units
variable – used as the record value for
variable_name
target_unit – fieldname used for the primary output quantity
units_to_factor –
dictionary, mapping
FROM (compiled regex for units)
TO EITHER a float (multiple) to multiply those units by, to get the preferred unit
OR a function taking a text parameter and returning a float value in preferred unit
Any units present in the regex but absent from
units_to_factor
will lead the result to be ignored. For example, this allows you to ignore a relative neutrophil count (“neutrophils 2.2%”) while detecting absolute neutrophil counts (“neutrophils 2.2”), or ignoring “docusate sodium 100mg” but detecting “sodium 140 mM”.take_absolute –
Convert negative values to positive ones? Typical text requiring this option might look like:
CRP-4 CRP-106 CRP -97 Blood results for today as follows: Na- 142, K-4.1, ...
… occurring in 23 out of 8054 hits for CRP of one test set in our data.
For many quantities, we know that they cannot be negative, so this is just a notation rather than a minus sign. We have to account for it, or it’ll distort our values. Preferable to account for it here rather than later; see manual.
commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
debug – print the regex?
- class crate_anon.nlp_manager.parse_haematology.HaemoglobinValidator(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
Validator for Haemoglobin (see help for explanation).
- classmethod get_variablename_regexstrlist() Tuple[str, List[str]] [source]
To be overridden.
- Returns:
(validated_variable_name, regex_str_list)
, where:- regex_str_list:
List of regular expressions, each in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
- validated_variable:
used to set our
variable
attribute and thus the value of the fieldvariable_name
in the NLP output; for example, ifvalidated_variable == 'crp'
, then thevariable_name
field will be set tocrp_validator
.
- Return type:
tuple
- class crate_anon.nlp_manager.parse_haematology.Lymphocytes(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
HAEMATOLOGY (FBC).
Lymphocyte count (absolute). Default units are 10^9 / L; also supports cells/mm^3 = cells/μL.
- __init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False) None [source]
__init__
function forWbcBase
.- Parameters:
nlpdef – a
crate_anon.nlp_manager.nlp_definition.NlpDefinition
cfg_processor_name – the name of a CRATE NLP config file section (from which we may choose to get extra config information)
cell_type_regex_text – text for regex for the cell type, representing e.g. “monocytes” or “basophils”
variable – used as the record value for
variable_name
commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
- class crate_anon.nlp_manager.parse_haematology.LymphocytesValidator(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
Validator for Lymphocytes (see help for explanation).
- classmethod get_variablename_regexstrlist() Tuple[str, List[str]] [source]
To be overridden.
- Returns:
(validated_variable_name, regex_str_list)
, where:- regex_str_list:
List of regular expressions, each in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
- validated_variable:
used to set our
variable
attribute and thus the value of the fieldvariable_name
in the NLP output; for example, ifvalidated_variable == 'crp'
, then thevariable_name
field will be set tocrp_validator
.
- Return type:
tuple
- class crate_anon.nlp_manager.parse_haematology.Monocytes(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
HAEMATOLOGY (FBC).
Monocyte count (absolute). Default units are 10^9 / L; also supports cells/mm^3 = cells/μL.
- __init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False) None [source]
__init__
function forWbcBase
.- Parameters:
nlpdef – a
crate_anon.nlp_manager.nlp_definition.NlpDefinition
cfg_processor_name – the name of a CRATE NLP config file section (from which we may choose to get extra config information)
cell_type_regex_text – text for regex for the cell type, representing e.g. “monocytes” or “basophils”
variable – used as the record value for
variable_name
commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
- class crate_anon.nlp_manager.parse_haematology.MonocytesValidator(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
Validator for Monocytes (see help for explanation).
- classmethod get_variablename_regexstrlist() Tuple[str, List[str]] [source]
To be overridden.
- Returns:
(validated_variable_name, regex_str_list)
, where:- regex_str_list:
List of regular expressions, each in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
- validated_variable:
used to set our
variable
attribute and thus the value of the fieldvariable_name
in the NLP output; for example, ifvalidated_variable == 'crp'
, then thevariable_name
field will be set tocrp_validator
.
- Return type:
tuple
- class crate_anon.nlp_manager.parse_haematology.Neutrophils(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
HAEMATOLOGY (FBC).
Neutrophil (polymorphonuclear leukoocte) count (absolute). Default units are 10^9 / L; also supports cells/mm^3 = cells/μL.
- __init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False) None [source]
__init__
function forWbcBase
.- Parameters:
nlpdef – a
crate_anon.nlp_manager.nlp_definition.NlpDefinition
cfg_processor_name – the name of a CRATE NLP config file section (from which we may choose to get extra config information)
cell_type_regex_text – text for regex for the cell type, representing e.g. “monocytes” or “basophils”
variable – used as the record value for
variable_name
commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
- class crate_anon.nlp_manager.parse_haematology.NeutrophilsValidator(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
Validator for Neutrophils (see help for explanation).
- classmethod get_variablename_regexstrlist() Tuple[str, List[str]] [source]
To be overridden.
- Returns:
(validated_variable_name, regex_str_list)
, where:- regex_str_list:
List of regular expressions, each in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
- validated_variable:
used to set our
variable
attribute and thus the value of the fieldvariable_name
in the NLP output; for example, ifvalidated_variable == 'crp'
, then thevariable_name
field will be set tocrp_validator
.
- Return type:
tuple
- class crate_anon.nlp_manager.parse_haematology.Platelets(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
HAEMATOLOGY (FBC).
Platelet count. Default units are 10^9 / L; also supports cells/mm^3 = cells/μL.
Not actually a white blood cell, of course, but can share the same base class; platelets are expressed in the same units, of 10^9 / L. Typical values 150–450 ×10^9 / L (or 150,000–450,000 per μL).
- __init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False) None [source]
__init__
function forWbcBase
.- Parameters:
nlpdef – a
crate_anon.nlp_manager.nlp_definition.NlpDefinition
cfg_processor_name – the name of a CRATE NLP config file section (from which we may choose to get extra config information)
cell_type_regex_text – text for regex for the cell type, representing e.g. “monocytes” or “basophils”
variable – used as the record value for
variable_name
commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
- class crate_anon.nlp_manager.parse_haematology.PlateletsValidator(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
Validator for Platelets (see help for explanation).
- classmethod get_variablename_regexstrlist() Tuple[str, List[str]] [source]
To be overridden.
- Returns:
(validated_variable_name, regex_str_list)
, where:- regex_str_list:
List of regular expressions, each in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
- validated_variable:
used to set our
variable
attribute and thus the value of the fieldvariable_name
in the NLP output; for example, ifvalidated_variable == 'crp'
, then thevariable_name
field will be set tocrp_validator
.
- Return type:
tuple
- class crate_anon.nlp_manager.parse_haematology.RBC(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
HAEMATOLOGY (FBC).
Red blood cell count. Default units are 10^12/L; also supports cells/mm^3 = cells/μL.
A typical excerpt from a FBC report:
RBC, POC 4.84 10*12/L RBC, POC 9.99 (H) 10*12/L
- __init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False) None [source]
- Parameters:
nlpdef –
crate_anon.nlp_manager.nlp_definition.NlpDefinition
cfg_processor_name – config section suffix in the NLP config file
regex_str –
Regular expression, in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
tense_indicator
relation
value
units
variable – used as the record value for
variable_name
target_unit – fieldname used for the primary output quantity
units_to_factor –
dictionary, mapping
FROM (compiled regex for units)
TO EITHER a float (multiple) to multiply those units by, to get the preferred unit
OR a function taking a text parameter and returning a float value in preferred unit
Any units present in the regex but absent from
units_to_factor
will lead the result to be ignored. For example, this allows you to ignore a relative neutrophil count (“neutrophils 2.2%”) while detecting absolute neutrophil counts (“neutrophils 2.2”), or ignoring “docusate sodium 100mg” but detecting “sodium 140 mM”.take_absolute –
Convert negative values to positive ones? Typical text requiring this option might look like:
CRP-4 CRP-106 CRP -97 Blood results for today as follows: Na- 142, K-4.1, ...
… occurring in 23 out of 8054 hits for CRP of one test set in our data.
For many quantities, we know that they cannot be negative, so this is just a notation rather than a minus sign. We have to account for it, or it’ll distort our values. Preferable to account for it here rather than later; see manual.
commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
debug – print the regex?
- class crate_anon.nlp_manager.parse_haematology.RBCValidator(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
Validator for RBC (see help for explanation).
- classmethod get_variablename_regexstrlist() Tuple[str, List[str]] [source]
To be overridden.
- Returns:
(validated_variable_name, regex_str_list)
, where:- regex_str_list:
List of regular expressions, each in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
- validated_variable:
used to set our
variable
attribute and thus the value of the fieldvariable_name
in the NLP output; for example, ifvalidated_variable == 'crp'
, then thevariable_name
field will be set tocrp_validator
.
- Return type:
tuple
- class crate_anon.nlp_manager.parse_haematology.Wbc(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
HAEMATOLOGY (FBC).
White cell count (WBC, WCC). Default units are 10^9 / L; also supports cells/mm^3 = cells/μL.
- __init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False) None [source]
__init__
function forWbcBase
.- Parameters:
nlpdef – a
crate_anon.nlp_manager.nlp_definition.NlpDefinition
cfg_processor_name – the name of a CRATE NLP config file section (from which we may choose to get extra config information)
cell_type_regex_text – text for regex for the cell type, representing e.g. “monocytes” or “basophils”
variable – used as the record value for
variable_name
commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
- class crate_anon.nlp_manager.parse_haematology.WbcBase(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, cell_type_regex_text: str, variable: str, commit: bool = False)[source]
DO NOT USE DIRECTLY. White cell count base class. Default units are 10^9 / L; also supports cells/mm^3 = cells/μL.
- __init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, cell_type_regex_text: str, variable: str, commit: bool = False) None [source]
__init__
function forWbcBase
.- Parameters:
nlpdef – a
crate_anon.nlp_manager.nlp_definition.NlpDefinition
cfg_processor_name – the name of a CRATE NLP config file section (from which we may choose to get extra config information)
cell_type_regex_text – text for regex for the cell type, representing e.g. “monocytes” or “basophils”
variable – used as the record value for
variable_name
commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
- class crate_anon.nlp_manager.parse_haematology.WbcValidator(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False)[source]
Validator for Wbc (see help for explanation).
- classmethod get_variablename_regexstrlist() Tuple[str, List[str]] [source]
To be overridden.
- Returns:
(validated_variable_name, regex_str_list)
, where:- regex_str_list:
List of regular expressions, each in string format.
This class operates with compiled regexes having this group format (capture groups in this sequence):
variable
- validated_variable:
used to set our
variable
attribute and thus the value of the fieldvariable_name
in the NLP output; for example, ifvalidated_variable == 'crp'
, then thevariable_name
field will be set tocrp_validator
.
- Return type:
tuple