7.4. Run the NLP
Now you’ve created and edited your config file, you can run the NLP process in one of the following ways:
crate_nlp [--config CONFIG] --nlpdef NLP_NAME --incremental
crate_nlp [--config CONFIG] --nlpdef NLP_NAME --full
crate_nlp_multiprocess [--config CONFIG] --nlpdef NLP_NAME --incremental
crate_nlp_multiprocess [--config CONFIG] --nlpdef NLP_NAME --full
where NLP_NAME is something you’ve configured in the NLP config file (e.g. a drug-parsing NLP program or the GATE demonstration name/location NLP app). You can specify the config file explicitly or default to one selected by an environment variable (see below).
The ‘multiprocess’ versions are faster (if you have a multi-core/-CPU computer). The ‘full’ option destroys the destination database and starts again. The ‘incremental’ one brings the destination database up to date (creating it if necessary). The default is ‘incremental’, for safety reasons.
Get more help with
crate_nlp --help
7.4.1. crate_nlp
This runs a single-process NLP controller.
Options:
USAGE: crate_nlp [-h] [--config CONFIG] [--nlpdef NLPDEF] [-i | -f]
[--dropremake] [--skipdelete] [--nlp] [--chunksize CHUNKSIZE]
[--verbose] [--report_every_fast REPORT_EVERY_FAST]
[--report_every_nlp REPORT_EVERY_NLP] [--echo] [--timing]
[--process PROCESS] [--nprocesses NPROCESSES]
[--processcluster PROCESSCLUSTER] [--version] [--democonfig]
[--listprocessors] [--describeprocessors] [--test_nlp]
[--print_local_processors] [--print_cloud_processors]
[--count] [--cloud] [--immediate] [--retrieve]
[--cancelrequest] [--cancelall] [--showqueue]
NLP manager. Version 0.20.0 (2023-02-14). By Rudolf Cardinal.
OPTIONS:
-h, --help show this help message and exit
CONFIG OPTIONS:
--config CONFIG Config file (overriding environment variable
CRATE_NLP_CONFIG) (default: None)
--nlpdef NLPDEF NLP definition name (from config file) (default: None)
-i, --incremental Process only new/changed information, where possible
(default: True)
-f, --full Drop and remake everything (default: False)
--dropremake Drop/remake destination tables only (default: False)
--skipdelete For incremental updates, skip deletion of rows present
in the destination but not the source (default: False)
--nlp Perform NLP processing only (default: False)
--chunksize CHUNKSIZE
Number of records copied in a chunk when copying PKs
from one database to another (default: 100000)
REPORTING OPTIONS:
--verbose, -v Be verbose (use twice for extra verbosity) (default:
False)
--report_every_fast REPORT_EVERY_FAST
Report insert progress (for fast operations) every n
rows in verbose mode (default: 100000)
--report_every_nlp REPORT_EVERY_NLP
Report progress for NLP every n rows in verbose mode
(default: 500)
--echo Echo SQL (default: False)
--timing Show detailed timing breakdown (default: False)
MULTIPROCESSING OPTIONS:
--process PROCESS For multiprocess mode: specify process number
(default: 0)
--nprocesses NPROCESSES
For multiprocess mode: specify total number of
processes (launched somehow, of which this is to be
one) (default: 1)
--processcluster PROCESSCLUSTER
Process cluster name (default: )
INFO ACTIONS:
--version show program's version number and exit
--democonfig Print a demo config file (default: False)
--listprocessors Show all possible built-in NLP processor names
(default: False)
--describeprocessors Show details of all built-in NLP processors (default:
False)
--test_nlp Test the NLP processor(s) for the selected definition,
by sending text from stdin to them (default: False)
--print_local_processors
For the chosen NLP definition, establish which local
NLP processors are involved (if any). Show detailed
information about these processors (as NLPRP JSON),
then stop (default: False)
--print_cloud_processors
For the chosen NLP definition, establish the relevant
cloud server, if applicable (from the 'cloud_config'
parameter). Ask that remote server about its available
NLP processors. Show detailed information about these
remote processors (as NLPRP JSON), then stop (default:
False)
--count Count records in source/destination databases, then
stop (default: False)
CLOUD OPTIONS:
--cloud Use cloud-based NLP processing tools. Queued mode by
default. (default: False)
--immediate To be used with 'cloud'. Process immediately.
(default: False)
--retrieve Retrieve NLP data from cloud (default: False)
--cancelrequest Cancel pending requests for the nlpdef specified
(default: False)
--cancelall Cancel all pending cloud requests. WARNING: this
option cancels all pending requests - not just those
for the nlp definition specified (default: False)
--showqueue Shows all pending cloud requests. (default: False)
7.4.2. Current NLP processors
NLP processors (from crate_nlp --describeprocessors
):
+---------------------------+-------------------------------------------------+
| NLP name | Description |
+---------------------------+-------------------------------------------------+
| Ace | COGNITIVE. |
| | |
| | Addenbrooke's Cognitive Examination (ACE, |
| | ACE-R, ACE-III) total score. |
| | |
| | The default denominator is 100 but it supports |
| | other values if given |
| | explicitly. |
+---------------------------+-------------------------------------------------+
| AceValidator | Validator for Ace (see help for explanation). |
+---------------------------+-------------------------------------------------+
| Albumin | BIOCHEMISTRY (LFTs). |
| | |
| | Albumin (Alb). Units are g/L. |
+---------------------------+-------------------------------------------------+
| AlbuminValidator | Validator for Albumin (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| AlcoholUnits | SUBSTANCE MISUSE. |
| | |
| | Alcohol consumption, specified explicitly as |
| | (UK) units per day or per |
| | week, or via non-numeric references to not |
| | drinking any. |
| | |
| | - Output is in UK units per week. A UK unit is |
| | 10 ml of ethanol [#f1]_ [#f2]_. |
| | UK NHS guidelines used to be "per week" and |
| | remain broadly week-based [#f1]_. |
| | - It doesn't attempt any understanding of other |
| | alcohol descriptions (e.g. |
| | "pints of beer", "glasses of wine", "bottles |
| | of vodka") so is expected to |
| | apply where a clinician has converted a |
| | (potentially mixed) alcohol |
| | description to a units-per-week calculation. |
| | |
| | .. [#f1] https://www.nhs.uk/live-well/alcohol- |
| | advice/calculating-alcohol-units/, |
| | accessed 2023-01-18. |
| | .. [#f2] |
| | https://en.wikipedia.org/wiki/Unit_of_alcohol |
+---------------------------+-------------------------------------------------+
| AlcoholUnitsValidator | Validator for AlcoholUnits (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| AlkPhos | BIOCHEMISTRY (LFTs/BFTs). |
| | |
| | Alkaline phosphatase (ALP, AlkP, AlkPhos). |
| | Units are U/L. |
+---------------------------+-------------------------------------------------+
| AlkPhosValidator | Validator for AlkPhos (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| ALT | BIOCHEMISTRY (LFTs). |
| | |
| | Alanine aminotransferase (ALT), a.k.a. alanine |
| | transaminase (ALT). |
| | Units are U/L. |
| | |
| | A.k.a. serum glutamate-pyruvate transaminase |
| | (SGPT), or serum |
| | glutamate-pyruvic transaminase (SGPT), but not |
| | a.k.a. those in recent |
| | memory! |
+---------------------------+-------------------------------------------------+
| ALTValidator | Validator for ALT (see help for explanation). |
+---------------------------+-------------------------------------------------+
| Basophils | HAEMATOLOGY (FBC). |
| | |
| | Basophil count (absolute). |
| | Default units are 10^9 / L; also supports |
| | cells/mm^3 = cells/μL. |
+---------------------------+-------------------------------------------------+
| BasophilsValidator | Validator for Basophils (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Bilirubin | BIOCHEMISTRY (LFTs). |
| | |
| | Total bilirubin. Units are μM. |
+---------------------------+-------------------------------------------------+
| BilirubinValidator | Validator for Bilirubin (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Bmi | CLINICAL EXAMINATION. |
| | |
| | Body mass index (BMI), in kg / m^2. |
+---------------------------+-------------------------------------------------+
| BmiValidator | Validator for Bmi (see help for explanation). |
+---------------------------+-------------------------------------------------+
| Bp | CLINICAL EXAMINATION. |
| | |
| | Blood pressure, in mmHg. (Systolic and |
| | diastolic.) |
+---------------------------+-------------------------------------------------+
| BpValidator | Validator for Bp (see help for explanation). |
+---------------------------+-------------------------------------------------+
| Cloud | EXTERNAL. |
| | |
| | Abstract NLP processor that passes information |
| | to a remote (cloud-based) |
| | NLP system via the NLPRP protocol. The |
| | processor at the other end might be |
| | of any kind. |
+---------------------------+-------------------------------------------------+
| Creatinine | BIOCHEMISTRY (U&E). |
| | |
| | Creatinine. Default units are micromolar (SI); |
| | also supports mg/dL. |
+---------------------------+-------------------------------------------------+
| CreatinineValidator | Validator for Creatinine (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Crp | BIOCHEMISTRY. |
| | |
| | C-reactive protein (CRP). Default units are |
| | mg/L; also supports mg/dL. |
| | |
| | CRP units: |
| | |
| | - mg/L is commonest in the UK (or at least |
| | standard at Addenbrooke's, |
| | Hinchingbrooke, and Dundee); |
| | |
| | - values of <=6 mg/L or <10 mg/L are normal, |
| | and e.g. 70-250 mg/L in |
| | pneumonia. |
| | |
| | - Refs include: |
| | |
| | - https://www.ncbi.nlm.nih.gov/pubmed/7705110 |
| | - https://emedicine.medscape.com/article/2086 |
| | 909-overview |
| | |
| | - 1 mg/dL = 10 mg/L, so normal in mg/dL is <=1 |
| | roughly. |
+---------------------------+-------------------------------------------------+
| CrpValidator | Validator for Crp (see help for explanation). |
+---------------------------+-------------------------------------------------+
| Eosinophils | HAEMATOLOGY (FBC). |
| | |
| | Eosinophil count (absolute). |
| | Default units are 10^9 / L; also supports |
| | cells/mm^3 = cells/μL. |
+---------------------------+-------------------------------------------------+
| EosinophilsValidator | Validator for Eosinophils (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Esr | HAEMATOLOGY (ESR). |
| | |
| | Erythrocyte sedimentation rate (ESR), in mm/h. |
+---------------------------+-------------------------------------------------+
| EsrValidator | Validator for Esr (see help for explanation). |
+---------------------------+-------------------------------------------------+
| GammaGT | BIOCHEMISTRY (LFTs). |
| | |
| | Gamma-glutamyl transferase (gGT), in U/L. |
+---------------------------+-------------------------------------------------+
| GammaGTValidator | Validator for GammaGT (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Gate | EXTERNAL. |
| | |
| | Abstract NLP processor controlling an external |
| | process, typically our Java |
| | interface to GATE programs, |
| | ``CrateGatePipeline.java`` (but it could be any |
| | external program). |
| | |
| | We send text to it, it parses the text, and it |
| | sends us back results, which |
| | we return as dictionaries. The specific text |
| | sought depends on the |
| | configuration file and the specific GATE |
| | program used. |
| | |
| | For details of GATE, see |
| | https://www.gate.ac.uk/. |
+---------------------------+-------------------------------------------------+
| Glucose | BIOCHEMISTRY. |
| | |
| | Glucose. Default units are mM; also supports |
| | mg/dL. |
+---------------------------+-------------------------------------------------+
| GlucoseValidator | Validator for Glucose (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Haematocrit | HAEMATOLOGY (FBC). |
| | |
| | Haematocrit (Hct). |
| | A dimensionless quantity (but supports L/L |
| | notation). |
+---------------------------+-------------------------------------------------+
| HaematocritValidator | Validator for Haematocrit (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Haemoglobin | HAEMATOLOGY (FBC). |
| | |
| | Haemoglobin (Hb). Default units are g/L; also |
| | supports g/dL. |
| | |
| | UK reporting for haemoglobin switched in 2013 |
| | from g/dL to g/L; see |
| | e.g. |
| | |
| | - http://www.pathology.leedsth.nhs.uk/pathology |
| | /Portals/0/PDFs/BP-2013-02%20Hb%20units.pdf |
| | - https://www.acb.org.uk/docs/default-source/co |
| | mmittees/scientific/guidelines/acb/pathology- |
| | harmony-haematology.pdf |
| | |
| | The *DANGER* remains that "Hb 9" may have been |
| | from someone assuming |
| | old-style units, 9 g/dL = 90 g/L, but this will |
| | be interpreted as 9 g/L. |
| | This problem is hard to avoid. |
+---------------------------+-------------------------------------------------+
| HaemoglobinValidator | Validator for Haemoglobin (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| HbA1c | BIOCHEMISTRY. |
| | |
| | Glycosylated (glycated) haemoglobin (HbA1c). |
| | Default units are mmol/mol; also supports %. |
| | |
| | Note: HbA1 is different |
| | (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2 |
| | 541274). |
+---------------------------+-------------------------------------------------+
| HbA1cValidator | Validator for HbA1c (see help for explanation). |
+---------------------------+-------------------------------------------------+
| HDLCholesterol | BIOCHEMISTRY (LIPID PROFILE). |
| | |
| | High-density lipoprotein (HDL) cholesterol. |
| | Default units are mM; also supports mg/dL. |
+---------------------------+-------------------------------------------------+
| HDLCholesterolValidator | Validator for HDLCholesterol (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Height | CLINICAL EXAMINATION. |
| | |
| | Height. Handles metric (e.g. "1.8m") and |
| | imperial (e.g. "5 ft 2 in"). |
+---------------------------+-------------------------------------------------+
| HeightValidator | Validator for Height (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| LDLCholesterol | BIOCHEMISTRY (LIPID PROFILE). |
| | |
| | Low density lipoprotein (LDL) cholesterol. |
| | Default units are mM; also supports mg/dL. |
+---------------------------+-------------------------------------------------+
| LDLCholesterolValidator | Validator for LDLCholesterol (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Lithium | BIOCHEMISTRY (THERAPEUTIC DRUG MONITORING). |
| | |
| | Lithium (Li) levels (for blood tests, not |
| | doses), in mM. |
+---------------------------+-------------------------------------------------+
| LithiumValidator | Validator for Lithium (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Lymphocytes | HAEMATOLOGY (FBC). |
| | |
| | Lymphocyte count (absolute). |
| | Default units are 10^9 / L; also supports |
| | cells/mm^3 = cells/μL. |
+---------------------------+-------------------------------------------------+
| LymphocytesValidator | Validator for Lymphocytes (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Medex | EXTERNAL. |
| | |
| | Class controlling a Medex-UIMA external |
| | process, via our custom Java |
| | interface, ``CrateMedexPipeline.java``. |
| | |
| | MedEx-UIMA is a medication-finding tool: |
| | https://www.ncbi.nlm.nih.gov/pubmed/25954575. |
+---------------------------+-------------------------------------------------+
| MiniAce | COGNITIVE. |
| | |
| | Mini-Addenbrooke's Cognitive Examination |
| | (M-ACE). |
| | |
| | The default denominator is 30, but it supports |
| | other values if given |
| | explicitly. |
+---------------------------+-------------------------------------------------+
| MiniAceValidator | Validator for MiniAce (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Mmse | COGNITIVE. |
| | |
| | Mini-mental state examination (MMSE). |
| | |
| | The default denominator is 30, but it supports |
| | other values if given |
| | explicitly. |
+---------------------------+-------------------------------------------------+
| MmseValidator | Validator for Mmse (see help for explanation). |
+---------------------------+-------------------------------------------------+
| Moca | COGNITIVE. |
| | |
| | Montreal Cognitive Assessment (MOCA). |
| | |
| | The default denominator is 30, but it supports |
| | other values if given |
| | explicitly. |
+---------------------------+-------------------------------------------------+
| MocaValidator | Validator for Moca (see help for explanation). |
+---------------------------+-------------------------------------------------+
| Monocytes | HAEMATOLOGY (FBC). |
| | |
| | Monocyte count (absolute). |
| | Default units are 10^9 / L; also supports |
| | cells/mm^3 = cells/μL. |
+---------------------------+-------------------------------------------------+
| MonocytesValidator | Validator for Monocytes (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Neutrophils | HAEMATOLOGY (FBC). |
| | |
| | Neutrophil (polymorphonuclear leukoocte) count |
| | (absolute). |
| | Default units are 10^9 / L; also supports |
| | cells/mm^3 = cells/μL. |
+---------------------------+-------------------------------------------------+
| NeutrophilsValidator | Validator for Neutrophils (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Platelets | HAEMATOLOGY (FBC). |
| | |
| | Platelet count. |
| | Default units are 10^9 / L; also supports |
| | cells/mm^3 = cells/μL. |
| | |
| | Not actually a white blood cell, of course, but |
| | can share the same base |
| | class; platelets are expressed in the same |
| | units, of 10^9 / L. |
| | Typical values 150–450 ×10^9 / L (or |
| | 150,000–450,000 per μL). |
+---------------------------+-------------------------------------------------+
| PlateletsValidator | Validator for Platelets (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Potassium | BIOCHEMISTRY (U&E). |
| | |
| | Potassium (K), in mM. |
+---------------------------+-------------------------------------------------+
| PotassiumValidator | Validator for Potassium (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| RBC | HAEMATOLOGY (FBC). |
| | |
| | Red blood cell count. |
| | Default units are 10^12/L; also supports |
| | cells/mm^3 = cells/μL. |
| | |
| | A typical excerpt from a FBC report: |
| | |
| | .. code-block:: none |
| | |
| | RBC, POC 4.84 10*12/L |
| | RBC, POC 9.99 (H) 10*12/L |
+---------------------------+-------------------------------------------------+
| RBCValidator | Validator for RBC (see help for explanation). |
+---------------------------+-------------------------------------------------+
| Sodium | BIOCHEMISTRY (U&E). |
| | |
| | Sodium (Na), in mM. |
+---------------------------+-------------------------------------------------+
| SodiumValidator | Validator for Sodium (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| TotalCholesterol | BIOCHEMISTRY (LIPID PROFILE). |
| | |
| | Total or undifferentiated cholesterol. |
| | Default units are mM; also supports mg/dL. |
+---------------------------+-------------------------------------------------+
| TotalCholesterolValidator | Validator for TotalCholesterol (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Triglycerides | BIOCHEMISTRY (LIPID PROFILE). |
| | |
| | Triglycerides. |
| | Default units are mM; also supports mg/dL. |
+---------------------------+-------------------------------------------------+
| TriglyceridesValidator | Validator for Triglycerides (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
| Tsh | BIOCHEMISTRY (ENDOCRINOLOGY). |
| | |
| | Thyroid-stimulating hormone (TSH), in mIU/L (or |
| | μIU/mL). |
+---------------------------+-------------------------------------------------+
| TshValidator | Validator for TSH (see help for explanation). |
+---------------------------+-------------------------------------------------+
| Urea | BIOCHEMISTRY (U&E). |
| | |
| | Urea, in mM. |
+---------------------------+-------------------------------------------------+
| UreaValidator | Validator for Urea (see help for explanation). |
+---------------------------+-------------------------------------------------+
| Wbc | HAEMATOLOGY (FBC). |
| | |
| | White cell count (WBC, WCC). |
| | Default units are 10^9 / L; also supports |
| | cells/mm^3 = cells/μL. |
+---------------------------+-------------------------------------------------+
| WbcValidator | Validator for Wbc (see help for explanation). |
+---------------------------+-------------------------------------------------+
| Weight | CLINICAL EXAMINATION. |
| | |
| | Weight. Handles metric (e.g. "57kg") and |
| | imperial (e.g. "10 st 2 lb"). |
| | Requires units to be specified. |
+---------------------------+-------------------------------------------------+
| WeightValidator | Validator for Weight (see help for |
| | explanation). |
+---------------------------+-------------------------------------------------+
Abbreviations not otherwise explained:
BFTs: bone function tests.
FBC: full blood count.
LFTs: liver function tests.
U&E, urea and electrolytes.
7.4.3. crate_nlp_multiprocess
This program runs multiple copies of crate_nlp
in parallel.
Options:
USAGE: crate_nlp_multiprocess [-h] --nlpdef NLPDEF [--nproc [NPROC]]
[--verbose]
Runs the CRATE NLP manager in parallel. Version 0.20.0 (2023-02-14). Note that
all arguments not specified here are passed to the underlying script (see
crate_nlp --help).
OPTIONS:
-h, --help show this help message and exit
--nlpdef NLPDEF NLP processing name, from the config file (default:
None)
--nproc, -n [NPROC] Number of processes (default is the number of CPUs on
this machine) (default: 8)
--verbose, -v Be verbose (default: False)
7.4.4. Limiting the network bandwidth used by cloud NLP
Cloud-based NLP may involve sending large quantities of text (de-identified and encrypted en route) to a distant server. If you have limited network bandwidth, you may want to cap the bandwidth used by CRATE (at the price of speed).
Under Linux, use trickle. Here’s how:
# Install with e.g. "sudo apt install trickle", then see "man trickle".
# Source code is at https://github.com/mariusae/trickle.
# Example with limits of 500 KB/s download, 200 KB/s upload:
trickle -s -d 500 -u 200 crate_nlp <OPTIONS>
Under Windows, use NetLimiter. The rationale is as follows.
Under Windows, the choice is less obvious. A commercial opton is NetLimiter, but there is no direct equivalent of trickle. Python options require quite a bit of network code redesign; e.g.
https://stackoverflow.com/questions/3488616/bandwidth-throttling-in-python
https://stackoverflow.com/questions/13047458/bandwidth-throttling-using-twisted
but with the exception of rewriting network code to use Twisted rather than requests, none of these open-source methods address the general-purposes bandwidth limitation challenge addressed by trickle. The best option might be txrequests or treq plus bandwidth limitation via Twisted through its ThrottlingFactory, but this doesn’t look entirely simple (see links above). Even with that, it’d be hard to coordinate bandwidth limits across multiple processes.
Therefore, in favour of NetLimiter:
it’s cheap (~$30/licence in 2019);
it provides a per-host unlimited-duration licence;
if you’re using Windows you’re already in the domain of commercial software;
the cloud NLP facility of CRATE is the sort of thing you’re likely to run on one big computer rather than lots of computers (so one licence should suffice);
its filters are very flexible (including time-of-day restrictions and the ability to group applications);
the alternatives would involve substantial development effort for lesser benefit;
… so NetLimiter seems like the most cost-effective option.