7.5. CRATE’s Python regex NLP

CRATE provides several simple numerical results processors. All are in English unless specified. Each comes with a corresponding validator.

Processor

The processor looks for full results in text, like “his CRP is 10 mg/L today” or “CRP | (H) | 17”, and will extract the number, units, and so forth. It does so by matching a keyword (such as “CRP”, “C-reactive protein”, “C reactive protein”, etc.) plus other attributes such as an optional tense indicator (“is”, “was”), an optional relationship (“equals”, “<”, etc.), a value, and units. Units may be optional, and some units may be recognized and specifically disallowed. For example, “MMSE 25/30” or “MMSE 25 out of 30” may be allowed, where variants on “out of 30” are the units, and “MMSE 25” can be treated as if it were implicitly out of 30, but “MMSE 25/29” disallowed.

The processor produces the standard NLP output columns, and also these output columns:

Column

SQL type

Description

variable_name

VARCHAR(64)

Variable name (e.g. ‘CRP’)

_content

TEXT

Matching text contents

_start

INT

Start position (of matching string within whole text)

_end

INT

End position (of matching string within whole text)

variable_text

TEXT

Text that matched the variable name (e.g. ‘CRP’, ‘C-reactive protein’).

relation_text

VARCHAR(50)

Text that matched the mathematical relationship between variable and value (e.g. ‘=’, ‘equals’, ‘less than’)

relation

VARCHAR(2)

Standardized mathematical relationship between variable and value (e.g. ‘=’, ‘<=’)

value_text

VARCHAR(50)

Matched numerical value, as text

units

VARCHAR(50)

Matched units, as text

value_mg_l (*)

FLOAT

Numerical value in preferred units, if known

tense_text

VARCHAR(50)

Tense text, if known (e.g. ‘is’, ‘was’)

tense

VARCHAR(7)

Calculated tense, if known (e.g. ‘past’, ‘present’)

… plus any fields you elected to copy.

The name of the column marked (*) will vary from processor to processor (e.g. value_mg_l for the CRP processor; value_kg for the Weight processor; value_m for the Height processor). The columns may vary from processor to processor; for example, the blood pressure (BP) processor produces two numbers per entry (a systolic and a diastolic BP).

Validator

The validator simply looks for the corresponding keyword. It doesn’t record much information except for a reference to the source row.

The validator produces the standard NLP output columns, and typically these output columns:

Column

SQL type

Description

variable_name

VARCHAR(64)

Variable name (e.g. ‘CRP’)

_content

TEXT

Matching text contents

_start

INT

Start position (of matching string within whole text)

_end

INT

End position (of matching string within whole text)

… plus any fields you elected to copy.

To look at things the validator recognized but the processor didn’t like, you can do something like the following. This example was created for a database with string source PKs (yuk) on Microsoft SQL Server (which sometimes requires a slightly convoluted way of specifying table names).

SELECT text  -- field with the free text in
FROM crissql_v3.dbo.Progress_Notes  -- source table
WHERE document_id IN (  -- primary key
    SELECT _srcpkstr FROM crissql_workspace.[CRIS-CPFT\RCardinal].validate_crp
    WHERE _srctable = 'Progress_Notes'  -- source table
) AND document_id NOT IN (
    SELECT _srcpkstr FROM crissql_workspace.[CRIS-CPFT\RCardinal].crp
    WHERE _srctable = 'Progress_Notes'  -- source table
)

This should produce text where CRP is mentioned but no value given, such as “FBC, TSH, vitamin B12, CRP and eGFR are all within normal range”; “blood sample taken (CRP/U&Es and FBC)”; “monitoring CK and CRP”; “CRP was back up yesterday”.

For a table with integer PKs you would use _srcpkval instead of _srcpkstr. Here’s an example, again using SQL Server:

SELECT [text]  -- field with the free text in
FROM RiO.dbo.Progress_Notes  -- source table
WHERE crate_pk IN (  -- primary key
    SELECT _srcpkval FROM RiONLP.dbo.validate_crp
    WHERE _srctable = 'Progress_Notes'  -- source table
) AND crate_pk NOT IN (
    SELECT _srcpkval FROM RiONLP.dbo.crp
    WHERE _srctable = 'Progress_Notes'  -- source table
)

Specimen timing on a slow system (2016-11-15): 5,954 seconds (1h40) for a full run of 2,717,779 text notes (one per database row, from a table with a string PK) through 40 NLP tasks (20 main, 20 validator) on a virtual computer mimicking 2×2.7GHz CPUs running Windows Server 2003, with all databases under SQL Server hosted elsewhere over a network. That works out at 18.2 kHz for processor-notes or 456 Hz for notes. (The corresponding do-nothing incremental update, with the –skipdelete option, took 4,756 s. That’s not much faster, and was limited primarily by queries for a record indicating that each datum had previously been processed. The advantage of incremental updates can be considerably more than this if the NLP step is slow, as with GATE and other more complex systems, but regular expressions are pretty quick.) Fast computers with local networks and SSD storage should perform considerably better, and tables with integer PKs are also processed faster because their work can be more efficiently and evenly assigned to parallel processes.

Current processors

Use the crate_nlp --listprocessors or crate_nlp --describeprocessors commands to show these (see Current NLP processors). Those not marked “EXTERNAL” are CRATE Python NLP processors.

Not all have been formally validated.

Testing the NLP

You can test these built-in NLP tools quickly via the command line:

USAGE: crate_run_crate_nlp_demo [-h] [--terminator TERMINATOR] --processors
                                PROCESSOR [PROCESSOR ...] [--verbose]
                                inputs [inputs ...]

Demonstrate CRATE's built-in Python NLP tools

POSITIONAL ARGUMENTS:
  inputs                Input files (use '-' for stdin)

OPTIONS:
  -h, --help            show this help message and exit
  --terminator TERMINATOR
                        Single-line terminator separating input chunks in an
                        input file. (default: STOP)
  --processors PROCESSOR [PROCESSOR ...]
                        NLP processor(s) to apply. Possibilities:
                        all,Ace,AceValidator,Albumin,AlbuminValidator,AlcoholU
                        nits,AlcoholUnitsValidator,AlkPhos,AlkPhosValidator,AL
                        T,ALTValidator,Basophils,BasophilsValidator,Bilirubin,
                        BilirubinValidator,Bmi,BmiValidator,Bp,BpValidator,Cre
                        atinine,CreatinineValidator,Crp,CrpValidator,Eosinophi
                        ls,EosinophilsValidator,Esr,EsrValidator,GammaGT,Gamma
                        GTValidator,Glucose,GlucoseValidator,Haematocrit,Haema
                        tocritValidator,Haemoglobin,HaemoglobinValidator,HbA1c
                        ,HbA1cValidator,HDLCholesterol,HDLCholesterolValidator
                        ,Height,HeightValidator,LDLCholesterol,LDLCholesterolV
                        alidator,Lithium,LithiumValidator,Lymphocytes,Lymphocy
                        tesValidator,MiniAce,MiniAceValidator,Mmse,MmseValidat
                        or,Moca,MocaValidator,Monocytes,MonocytesValidator,Neu
                        trophils,NeutrophilsValidator,Platelets,PlateletsValid
                        ator,Potassium,PotassiumValidator,RBC,RBCValidator,Sod
                        ium,SodiumValidator,TotalCholesterol,TotalCholesterolV
                        alidator,Triglycerides,TriglyceridesValidator,Tsh,TshV
                        alidator,Urea,UreaValidator,Wbc,WbcValidator,Weight,We
                        ightValidator (default: None)
  --verbose             Be verbose (default: False)

Note also that you can test any NLP processor via a config file; see testing NLP.