7.2. NLP config file
7.2.1. Overview
The CRATE NLP config file controls the behaviour of the NLP manager. It defines source and destination databases, and one or more NLP definitions.
You can generate a specimen config file with
crate_nlp --democonfig > test_nlp_config.ini
You should save this, then edit it to your own needs.
Make it point to other necessary things, like your GATE installation if you want to use GATE NLP.
For convenience, you may want the CRATE_NLP_CONFIG environment variable to point to this file. (Otherwise you must specify it each time.)
7.2.2. Detail
The config file describes NLP definitions, such as ‘people_and_places’,
‘drugs_and_doses’, or ‘haematology_white_cell_differential’. You choose the
names of these NLP definitions as you define them. You select one when you run
the NLP manager (using the --nlpdef
argument) [1].
The NLP definition sets out the following things.
“Where am I going to find my source text?” You specify this by giving one or more
inputfielddefs
. Each one of those specifies (via its own config section) a database/table/field combination, such as the ‘Notes’ field of the ‘Progress Notes’ table in the ‘RiO’ database, or the ‘text_content’ field of the ‘Clinical_Documents’ table in the ‘CDL’ database, or some such [2].CRATE will always store minimal source information (database, table, integer PK) with the NLP output. However, for convenience you are also likely to want to copy some other key information over to the output, such as patients’ research identifiers (RIDs). You can specify these via the
copyfields
option to the input field definitions. For validation purposes, you might even choose to copy the full source text (just for convenience), but it’s unlikely you’d want to do this routinely (because it wastes space).
“Which NLP processor will run, and where will it store its output?” This might be an external GATE program specializing in finding drug names, or one of CRATE’s built-in regular expression (regex) parsers, such as for inflammatory markers from blood tests. The choice of NLP processor also determines the fields that will appear in the output; for example, a drug-detecting NLP program might provide fields such as ‘drug’, ‘dose’, ‘units’, and ‘route’, while a white-cell differential processor might provide output such as ‘cell type’, ‘value_in_billion_per_litre’, and so on [3].
In fact, GATE applications can simultaneously provide more than one type of output; for example, GATE’s demonstration people-and-places application yields both ‘person’ information (rule, firstname, surname, gender…) and ‘location’ information (rule, loctype…), and it can be computationally more efficient to run them together. Therefore, CRATE supports multiple types of output from ‘single’ NLP processors.
Each NLP processor may have its own set of options. For example, the GATE controller requires information about the specific external GATE app to run, and about any necessary environment variables. Others, such as CRATE’s build-in regular expression parsers, are simpler.
You might want all of your “drugs and doses” information to be stored in a single table (such that drugs found in your Progress_Notes and drugs found in your Clinical_Documents get stored together); this would be common and sensible (and CRATE will keep a record of where the information came from). However, it’s possible that you might want to segregate them (e.g. having C-reactive protein information extracted from your Progress_Notes stored in a different table to C-reactive protein information extracted from your High_Sensitivity_CRP_Notes_For_Bobs_Project table).
For GATE apps that provide more than one type of output structure, you will need to specify more than one output table.
You can batch different NLP processors together. For example, the demo config batches up CRATE’s internal regular expression NLP processors together. This is more efficient, because one record fetched from the source database can then be sent to multiple NLP processors. However, it’s less helpful if you are developing new NLP tools and want to be able to re-run just one NLP tool frequently.
All NLP configuration ‘chunks’ are sections within the NLP config file, which is in standard .INI format. For example, an input field definition is a section; a database definition is a section; an environment variable section for use with external programs is a section; and so on.
To allow incremental updates, CRATE will keep a master progress table, storing a reference to the source information (database, table, PK), a hash of the source information (to work out later on if the source has changed), and a date/time when the NLP was last run, and the name of the NLP definition that was run.
It’s definitely better if your source table has integer PKs, but you might not have a choice in the matter (and be unable to add one to a read-only source database), so CRATE also supports string PKs. In this instance it will create an integer by hashing the string and store that along with the string PK itself. (That integer is not guaranteed to be unique, because of hash collisions [5], but it allows some efficiency to be added.)
7.2.3. Testing your NLP definitions
You can test any NLP definition that you create. See testing NLP.
7.2.4. Format of the configuration file
The config file is in standard INI file format.
UTF-8 encoding. Use this! The file is explicitly opened in UTF-8 mode.
Comments. Hashes (
#
) and semicolons (;
) denote comments.Sections. Sections are indicated with:
[section]
Name/value (key/value) pairs. The parser used is ConfigParser. It allows
name=value
orname:value
.Avoid indentation of parameters. (Indentation is used to indicate the continuation of previous parameters.)
Parameter types, referred to below, are:
String. Single-line strings are simple.
Multiline string. Here, a series of lines is read and split into a list of strings (one for each line). You should indent all lines except the first beyond the level of the parameter name, and then they will be treated as one parameter value.
Integer. Simple.
Boolean. For Boolean options, true values are any of:
1, yes, true, on
(case-insensitive). False values are any of:0, no, false, off
.
7.2.5. Config file section: NLP definition
These are config file sections named [nlpdef:XXX]
where XXX
is the name
of one of your NLP definitions.
These are the “top-level” configuration sections, referred to when you launch CRATE’s NLP tools from the command line. Start here.
These config sections map inputs (from your database) to processors and a progress-tracking database, and give names to those mappings.
7.2.5.1. inputfielddefs
Multiline string.
List of input fields to parse. Each is the name of an input field definition in the config file.
Input to the NLP processor(s) comes from one or more source fields (columns), each within a table within a database. Each item in this list refers to a config section that define that field in more detail.
7.2.5.2. processors
Multiline string.
Which NLP processors shall we use?
Specify these as a list of processor_type, processor_config_section
pairs.
For example, one might be:
GATE mygateproc_name_location
and CRATE would then look for a processor definition in a config file section named
[processor:mygateproc_name_location]
, and expect it to have the information
required for a GATE processor.
For possible processor types, see crate_nlp --listprocessors
. These include
CRATE internal processors (e.g. “Glucose”), external tools run locally (e.g.
“GATE”), and cloud-based NLP (“Cloud”).
7.2.5.3. progressdb
String.
Secret progress database; the name of a database definition in the config file.
To allow incremental updates, information is stored in a progress table.
The database name is a cross-reference to another section in this config
file. The table name within this database is hard-coded to
crate_nlp_progress
.
7.2.5.4. hashphrase
String.
You should insert a hash phrase of your own here. However, it’s not especially secret (it’s only used for change detection and users are likely to have access to the source material anyway), and its specific value is unimportant.
7.2.5.5. temporary_tablename
String. Default: _crate_nlp_temptable
.
Temporary table name to use (in progress and destination databases).
7.2.5.6. max_rows_before_commit
Integer. Default: 1000.
Specify the maximum number of rows to be processed before a COMMIT
is
issued on the database transaction(s). This prevents the transaction(s) growing
too large.
7.2.5.7. max_bytes_before_commit
Integer. Default: 80 Mb (80 * 1024 * 1024 = 83886080).
Specify the maximum number of source-record bytes (approximately!) that are
processed before a COMMIT
is issued on the database transaction(s). This
prevents the transaction(s) growing too large. The COMMIT
will be issued
after this limit has been met/exceeded, so it may be exceeded if the
transaction just before the limit takes the cumulative total over the limit.
7.2.5.8. truncate_text_at
Integer. Default: 0. Must be zero or positive.
Use this to truncate very long incoming text fields. If non-zero, this is the length at which to truncate.
7.2.5.9. record_truncated_values
Boolean. Default: false.
Record in the progress database that we have processed records for which the source text was truncated (see truncate_text_at).
The purpose for this option is so the program, when running in incremental mode, can decide whether to re-run the nlp records which were truncated before processing. If this option is set to true, such records won’t be run again unless they have changed.
7.2.5.10. cloud_config
String. Required to use cloud NLP.
The name of the cloud NLP configuration to use if you ask for cloud-based processing with this NLP definition.
For example, you might specify:
cloud_config = my_uk_cloud_nlp_service
and CRATE would then look for a cloud NLP configuration in a config file section named
[cloud:my_uk_cloud_nlp_service]
, and use the information there to connect
to a cloud NLP service via the NLPRP.
7.2.5.11. cloud_request_data_dir
String. Required to use cloud NLP.
Directory (on your local filesystem) to hold files containing information for the retrieval of data which has been sent in queued mode.
For safety (in case the user specifies a foolish directory!), CRATE will make a subdirectory of this directory (whose name is that of the NLP definition). CRATE will delete files at will within that subdirectory.
7.2.6. Config file section: input field definition
These are config file sections named [input:XXX]
where XXX
is the name
of one of your input field definitions.
These define database “inputs” in more detail, including the database, table, and field (column) containing the input, the associated primary key field, and fields that should be copied to the destination to make subsequent work easier (e.g. patient research IDs).
They are referred to by the NLP definition.
7.2.6.1. srcdb
String.
Source database; the name of a database definition in the config file.
7.2.6.2. srctable
String.
The name of the table in the source database.
7.2.6.3. srcpkfield
String.
The name of the primary key field (column) in the source table.
7.2.6.4. srcfield
String.
The name of the field (column) in the source table that contains the data of interest.
7.2.6.5. srcdatetimefield
String. Optional (but advisable).
The name of the DATETIME
field (column) in the source table that represents
the date/time of the source data. If present, this information will be copied
to the output; see Standard NLP output columns.
7.2.6.6. copyfields
Multiline string. Optional.
Names of fields to copy from the source table to the destination (NLP output) table.
7.2.6.7. indexed_copyfields
Multiline string.
Optional subset of copyfields that should be indexed in the destination (NLP output) table.
7.2.6.8. debug_row_limit
Integer. Default: 0.
Debugging option. Specify this to set the maximum number of rows to be fetched from the source table. Specifying 0 means “no limit”.
7.2.7. Config file section: processor definition
These are config file sections named [processor:XXX]
where XXX
is the
name of one of your NLP processors.
These control the behaviour of individual NLP processors.
In the case of CRATE’s built-in processors, the only configuration needed is the destination database/table, but for some, like GATE applications, you need to define more – such as how to run the external program, and what sort of table structure should be created to receive the results.
The format depends on the specific processor type (see processors).
7.2.7.1. destdb
String.
Applicable to: all parsers.
Destination database; the name of a database definition in the config file.
7.2.7.2. desttable
String.
Applicable to: MedEx, all CRATE Python processors.
The name of the table in the destination database in which the results should be stored.
Cloud and local GATE processors may produce output for multiple tables (or a single table, but potentially one that you need to help define). For these, use outputtypemap instead. This refers to “output” configurations, in which you can define the table(s).
7.2.7.3. outputtypemap
Multiline string.
Applicable to: GATE, Cloud.
For GATE processors:
What’s GATE? See the section on GATE NLP.
This tabular entry maps GATE ‘_type’ parameters to possible destination tables (in case-insensitive fashion). This parameter is follows is a list of pairs, one pair per line.
The first item of each is the annotation type coming out of the GATE system.
The second is the output type section defined in this file (as a separate section). Those sections each define a table with its columns (fields); see GATE/cloud output definitions.
Example:
outputtypemap = Person output_person Location output_locationThis example would take output from GATE labelled with
_type=Person
and send it to output defined in the[output:output_person]
section of the config file – see GATE/cloud output definitions. Equivalently for theLocation
type.
For cloud processors:
The first parameter is the remote server’s tablename (see NLPRP schema definition). The second is an output type definition, as above (which may define the table in full, or just name it and leave the definition to the remote processor).
Use this method whenever the remote processor may return data for more than one table.
If the remote processor will only return results for a single table, and doesn’t name it, the FIRST definition in the output type map is used (and the first element of the pair is ignored for this purpose, i.e. you can use any string you want).
7.2.7.4. assume_preferred_unit
Boolean. Default: True.
Applicable to: nearly all numerical CRATE Python processors.
If a unit is not specified, assume that values are in the processor’s preferred
units. (For example, crate_anon.nlp_manager.parse_biochemistry.Crp
will assume mg/L.)
Some override this and are not configurable, however:
AlcoholUnits
never assumes this.
7.2.7.5. progargs
Multiline string.
Applicable to: GATE, MedEx.
This parameter defines how we will launch GATE. See GATE NLP.
GATE NLP is done by an external program.
In this parameter, we specify a program and associated arguments. Here’s an example:
progargs =
java
-classpath "{NLPPROGDIR}"{OS_PATHSEP}"{GATE_HOME}/bin/gate.jar"{OS_PATHSEP}"{GATE_HOME}/lib/*"
-Dgate.home="{GATE_HOME}"
CrateGatePipeline
--gate_app "{GATE_HOME}/plugins/ANNIE/ANNIE_with_defaults.gapp"
--annotation Person
--annotation Location
--input_terminator END_OF_TEXT_FOR_NLP
--output_terminator END_OF_NLP_OUTPUT_RECORD
--log_tag {NLPLOGTAG}
--verbose
The example shows how to use Java to launch a specific Java program
(CrateGatePipeline
), having set a path to find other Java classes, and how
to to pass arguments to the program itself.
NOTE IN PARTICULAR:
Use double quotes to encapsulate any filename that may have spaces within it (e.g.
C:/Program Files/...
).Use a forward slash directory separator, even under Windows.
… ? If that doesn’t work, use a double backslash,
\\
.
Under Windows, use a semicolon to separate parts of the Java classpath. Under Linux, use a colon.
So a Linux Java classpath looks like
/some/path:/some/other/path:/third/path
and a Windows one looks like
C:/some/path;C:/some/other/path;C:/third/path
To make this simpler, we can define the environment variable
OS_PATHSEP
(by analogy to Python’s os.pathsep). See the environment variable section below.You can use substitutable parameters:
{X}
Substitutes variable X from the environment you specify (see below).
{NLPLOGTAG}
Additional environment variable that indicates the process being run; used to label the output from the
CrateGatePipeline
application.
7.2.7.6. progenvsection
String.
Applicable to: GATE, MedEx.
Environment variable config section to use when launching this program.
7.2.7.7. input_terminator
String.
Applicable to: GATE.
The external GATE program is slow, because NLP is slow. Therefore, we set up
the external program and use it repeatedly for a whole bunch of text.
Individual pieces of text are sent to it (via its stdin
). We finish our
piece of text with a delimiter, which should (a) be specified in the -it
or
--input_terminator` parameter to the CRATE ``CrateGatePipeline
interface
(above), and (b) be set here, TO THE SAME VALUE. The external program will
return a TSV-delimited set of field/value pairs, like this:
field1\\tvalue1\\tfield2\\tvalue2...
field1\\tvalue3\\tfield2\\tvalue4...
...
OUTPUTTERMINATOR
… where OUTPUTTERMINATOR
is something that you (a) specify with the
-ot
or --output_terminator
parameter above, and (b) set via the config
file output_terminator, TO
THE SAME VALUE.
7.2.7.8. output_terminator
String.
Applicable to: GATE.
See input_terminator.
7.2.7.9. max_external_prog_uses
Integer.
Applicable to: GATE, MedEx.
If the external GATE program leaks memory, you may wish to cap the number of uses before it’s restarted. Specify this option if so. Specify 0 or omit the option entirely to ignore this.
7.2.7.10. processor_name
String.
Applicable to: Cloud.
Name of the remote processor; see NLPRP list_processors.
Note that this is case sensitive. To ask the remote server what processor names it offers, use the crate_nlp tool like this:
crate_nlp --config MYCONFIG --nlpdef MYNLPDEF --print_cloud_processors
That will, in sequence:
read the config called
MYCONFIG
;look for an NLP definition section marked
[nlpdef:MYNLPDEF]
;look for a cloud_config parameter in that section;
look up the corresponding cloud server definition, including the URL of the remote server;
ask the server for details of all its processors;
print the results (in NLPRP JSON format; see the NLPLP list_processors command for details).
7.2.7.11. processor_version
String. Default: None.
Applicable to: Cloud.
Version of the remote processor; see NLPRP list_processors.
7.2.7.12. processor_format
String.
Applicable to: Cloud.
One of: Standard
, GATE
.
Standard
refers primarily to CRATE Python-based rempote processors,
but would be compatible with any remote processor which returned data in the
same format as the CRATE processors. GATE
refers to GATE remote
processors, which return a standard set of columns.
7.2.8. Config file section: GATE/cloud output definition
These are config file sections named [output:XXX]
where XXX
is the
name of one of your GATE output types, or cloud remote processor table
names.[#outputuserconfig]_
For GATE applications, we need this additional information because CRATE doesn’t automatically know what sort of output they will produce. The tables and SPECIFIC output fields for a given GATE processor are defined here.
For remote cloud processors, this section enables you to rename remote tables to something appropriate locally, and add options (like indexing). Additionally, CRATE may or may not be told exactly by the remote application what tabular structure it is using, but even if the remote application is helpfully informative, you wouldn’t automatically trust remotely provided table names. So this section is still mandatory.
They are referred to by the outputtypemap parameter (q.v.).
7.2.8.1. desttable
String.
Table name in the destination (NLP output) database into which to write results from the GATE/cloud NLP application.
7.2.8.2. renames
Multiline string.
This is an optional “column renaming” section.
For GATE processors: a list of from, to
things to rename from the GATE
output en route to the database. In each case, the from
item is the name of
a GATE output annotation. The to
item is the destination field/column name.
Also applicable to cloud processors; you can rename columns in this way. The
from
item is the column name as specified by the remote processor, and the
to
item the local destination column name.
Specify one pair per line. You can can quote, using shlex rules. Case-sensitive.
This example:
renames =
firstName firstname
renames firstName
to firstname
.
A more relevant example, in which the GATE annotation names are clearly not well suited to being database column names:
renames =
drug-type drug_type
dose-value dose_value
dose-unit dose_unit
dose-multiple dose_multiple
Directionality directionality
Experiencer experiencer
"Length of Time" length_of_time
Temporality temporality
"Unit of Time" unit_of_time
7.2.8.3. null_literals
Multiline string.
Applicable to: GATE only.
Define values that will be treated as NULL
in SQL. For example, sometimes
GATE provides the string null
for a NULL value; we can convert to a proper
SQL NULL.
The parameter is treated as a sequence of words; shlex quoting rules apply.
Example:
null_literals =
null
""
7.2.8.4. destfields
Multiline string.
Defines the database field (column) types used in the output database. This is
how you tell the database how much space to allocate for information that will
come out of GATE. Each line is a column_name, sql_type
pair (or,
optionally, a column_name, sql_type, comment
triple. Whitespace is used to
separate the columns. Examples:
destfields =
rule VARCHAR(100)
firstname VARCHAR(100)
surname VARCHAR(100)
gender VARCHAR(7)
kind VARCHAR(100)
destfields =
rule VARCHAR(100) Rule used to find this person (e.g. TitleFirstName, PersonFull)
firstname VARCHAR(100) First name
surname VARCHAR(100) Surname
gender VARCHAR(7) Gender (e.g. male, female, unknown)
kind VARCHAR(100) Kind of name (e.g. personName, fullName)
For cloud applications, this is optional. If you specify any lines here, your table will be created in this way (plus additional universal CRATE NLP columns). If you don’t specify any, it will be created according to the remote table specification (plus additional universal CRATE NLP columns).
7.2.8.5. indexdefs
Multiline string.
Fields to index in the destination table.
Each line is a indexed_field, index_length
pairs. The index_length
should be an integer or None
. Example:
indexdefs =
firstname 64
surname 64
7.2.9. Config file section: environment variables definition
These are config file sections named [env:XXX]
where XXX
is the
name of one of your environment variable definition blocks.
We define environment variable groups here, with one group per section.
When a section is selected (e.g. by a progenvsection parameter in a GATE NLP processor definition as above), these variables can be substituted into the progargs part of the NLP definition (for when external programs are called) and are available in the operating system environment for those programs themselves.
The environment will start by inheriting the parent environment, then add variables here.
Keys are case-sensitive.
Example:
[env:MY_ENV_SECTION]
GATE_HOME = /home/myuser/somewhere/GATE_Developer_8.0
NLPPROGDIR = /home/myuser/somewhere/crate_anon/nlp_manager/compiled_nlp_classes
MEDEXDIR = /home/myuser/somewhere/Medex_UIMA_1.3.6
KCONNECTDIR = /home/myuser/somewhere/yodie-pipeline-1-2-umls-only
OS_PATHSEP = :
7.2.10. Config file section: database definition
These are config file sections named [database:XXX]
where XXX
is the
name of one of your database definitions.
These simply tell CRATE how to connect to different databases.
7.2.10.1. url
String.
The URL of the database. Use SQLAlchemy URLs: http://docs.sqlalchemy.org/en/latest/core/engines.html.
Example:
[database:MY_SOURCE_DATABASE]
url = mysql+mysqldb://myuser:password@127.0.0.1:3306/anonymous_output_db?charset=utf8
7.2.10.2. echo
Boolean. Default: False.
Optional parameter for debugging. If set to True, all SQL being sent to the database will be logged to the Python console log.
7.2.11. Config file section: cloud NLP configuration
These are config file sections named [cloud:XXX]
where XXX
is the name
of one of your cloud NLP configurations (referred to by the cloud_config
parameter in a NLP definition) [4].
7.2.11.1. cloud_url
String. Required to use cloud NLP.
The URL of the cloud NLP service.
7.2.11.2. verify_ssl
Boolean. Default: true.
Should CRATE verify the SSL certificate of the remote NLP server?
7.2.11.3. compress
Boolean. Default: true.
Should CRATE compress messages going to the NLP server, using gzip
?
CRATE (via the Python requests
library) also always tells the server that
it will accept gzip
compression back; the server should respond to this by
compressing results.
7.2.11.4. username
String. Default: “”.
Your username for accessing the services at the URL specified in cloud_url.
7.2.11.5. password
String. Default: “”.
Your password for accessing the services at the URL specified in cloud_url.
7.2.11.6. wait_on_conn_err
Integer. Default: 180.
After a connection error occurs, wait this many seconds before retrying.
7.2.11.7. max_content_length
Integer. Default: 0.
The maximum size of the packets to be sent. This should be less than or equal to the limit the service allows. Put 0 for no maximum length.
NOTE: if a single record is larger than the maximum packet size, that record will not be sent.
7.2.11.8. max_records_per_request
Integer. Default: 1000.
When sending data: the maximum number of pieces of text that will be sent as part of a single NLPRP request (subject also to max_content_length).
7.2.11.9. limit_before_commit
Integer. Default: 1000.
When receiving results: the number of results that will be processed (and
written to the database) before a COMMIT
command is executed.
7.2.11.10. stop_at_failure
Boolean. Default: true.
Are cloud NLP requests for processing allowed to fail, and CRATE continue? If not, an error is raised and CRATE will abort on failure. (Some requests are not allowed to fail, regardless of this setting.)
7.2.11.11. max_tries
Integer. Default: 5.
Maximum number of times to try each HTTP connection to the cloud NLP server, before giving it up as a bad job.
7.2.11.12. rate_limit_hz
Integer. Default: 2.
The maximum rate, in Hz (times per second), that the CRATE NLP processor will send requests. Use this to avoid overloading the cloud NLP server. Specify 0 for no limit.
7.2.12. Parallel processing
There are two ways to parallelize CRATE NLP.
You can run multiple NLP processors at the same time, by specifying multiple NLP processors in a single NLP definition within your configuration file.
There can be different source of bottlenecks. One is if database access is limiting. Specifying multiple NLP processors means that text is fetched once (for a given set of input fields) and then run through multiple NLP processors in one go.
However, GATE apps can take e.g. 1 Gb RAM per process, so be careful if trying to run several of those! CRATE’s regular expression parsers use very little RAM (and can go quite fast: e.g. 2 CPUs processing about 15,000 records through 10 regex parsers in about 166 s, or of the order of 1 kHz).
You can run multiple simultaneous copies of CRATE’s NLP manager.
This will divide up the work across the copies (by dividing up the records retrieved from the database).
You can use both strategies simultaneously.
7.2.13. Specimen config
A specimen NLP config is available by running crate_nlp --democonfig
.
Here’s the specimen NLP config:
# Configuration file for CRATE NLP manager (crate_nlp).
# Version 0.20.6 (2025-01-09).
#
# PLEASE SEE THE HELP at https://crateanon.readthedocs.io/
# Using defaults for Docker environment: False
# =============================================================================
# A. Individual NLP definitions
# =============================================================================
# - referred to by the NLP manager's command-line arguments
# - You are likely to need to alter these (particularly the bits in capital
# letters) to refer to your own database(s).
# -----------------------------------------------------------------------------
# GATE people-and-places demo
# -----------------------------------------------------------------------------
[nlpdef:gate_name_location_demo]
inputfielddefs =
INPUT_FIELD_CLINICAL_DOCUMENTS
INPUT_FIELD_PROGRESS_NOTES
processors =
GATE procdef_gate_name_location
progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter
# -----------------------------------------------------------------------------
# KConnect (Bio-YODIE) disease-finding GATE app
# -----------------------------------------------------------------------------
[nlpdef:gate_kconnect_diseases]
inputfielddefs =
INPUT_FIELD_CLINICAL_DOCUMENTS
INPUT_FIELD_PROGRESS_NOTES
processors =
GATE procdef_gate_kconnect
progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter
# -----------------------------------------------------------------------------
# KCL Lewy body dementia GATE app
# -----------------------------------------------------------------------------
[nlpdef:gate_kcl_lbd]
inputfielddefs =
INPUT_FIELD_CLINICAL_DOCUMENTS
INPUT_FIELD_PROGRESS_NOTES
processors =
GATE procdef_gate_kcl_lbda
progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter
# -----------------------------------------------------------------------------
# KCL pharmacotherapy GATE app
# -----------------------------------------------------------------------------
[nlpdef:gate_kcl_pharmacotherapy]
inputfielddefs =
INPUT_FIELD_CLINICAL_DOCUMENTS
INPUT_FIELD_PROGRESS_NOTES
processors =
GATE procdef_gate_pharmacotherapy
progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter
# -----------------------------------------------------------------------------
# Medex-UIMA medication-finding app
# -----------------------------------------------------------------------------
[nlpdef:medex_medications]
inputfielddefs =
INPUT_FIELD_CLINICAL_DOCUMENTS
INPUT_FIELD_PROGRESS_NOTES
processors =
Medex procdef_medex_medications
progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter
# -----------------------------------------------------------------------------
# CRATE number-finding Python regexes
# -----------------------------------------------------------------------------
[nlpdef:crate_biomarkers]
inputfielddefs =
INPUT_FIELD_CLINICAL_DOCUMENTS
INPUT_FIELD_PROGRESS_NOTES
processors =
# -------------------------------------------------------------------------
# Biochemistry
# -------------------------------------------------------------------------
Albumin procdef_albumin
AlbuminValidator procdef_validate_albumin
AlkPhos procdef_alkphos
AlkPhosValidator procdef_validate_alkphos
ALT procdef_alt
ALTValidator procdef_validate_alt
Bilirubin procdef_bilirubin
BilirubinValidator procdef_validate_bilirubin
Creatinine procdef_creatinine
CreatinineValidator procdef_validate_creatinine
Crp procdef_crp
CrpValidator procdef_validate_crp
GammaGT procdef_gammagt
GammaGTValidator procdef_validate_gammagt
Glucose procdef_glucose
GlucoseValidator procdef_validate_glucose
HbA1c procdef_hba1c
HbA1cValidator procdef_validate_hba1c
HDLCholesterol procdef_hdlcholesterol
HDLCholesterolValidator procdef_validate_hdlcholesterol
LDLCholesterol procdef_ldlcholesterol
LDLCholesterolValidator procdef_validate_ldlcholesterol
Lithium procdef_lithium
LithiumValidator procdef_validate_lithium
Potassium procdef_potassium
PotassiumValidator procdef_validate_potassium
Sodium procdef_sodium
SodiumValidator procdef_validate_sodium
TotalCholesterol procdef_totalcholesterol
TotalCholesterolValidator procdef_validate_totalcholesterol
Triglycerides procdef_triglycerides
TriglyceridesValidator procdef_validate_triglycerides
Tsh procdef_tsh
TshValidator procdef_validate_tsh
Urea procdef_urea
UreaValidator procdef_validate_urea
# -------------------------------------------------------------------------
# Clinical
# -------------------------------------------------------------------------
Bmi procdef_bmi
BmiValidator procdef_validate_bmi
Bp procdef_bp
BpValidator procdef_validate_bp
Height procdef_height
HeightValidator procdef_validate_height
Weight procdef_weight
WeightValidator procdef_validate_weight
# -------------------------------------------------------------------------
# Cognitive
# -------------------------------------------------------------------------
Ace procdef_ace
AceValidator procdef_validate_ace
MiniAce procdef_miniace
MiniAceValidator procdef_validate_miniace
Mmse procdef_mmse
MmseValidator procdef_validate_mmse
Moca procdef_moca
MocaValidator procdef_validate_moca
# -------------------------------------------------------------------------
# Haematology
# -------------------------------------------------------------------------
Basophils procdef_basophils
BasophilsValidator procdef_validate_basophils
Eosinophils procdef_eosinophils
EosinophilsValidator procdef_validate_eosinophils
Esr procdef_esr
EsrValidator procdef_validate_esr
Haematocrit procdef_haematocrit
HaematocritValidator procdef_validate_haematocrit
Haemoglobin procdef_haemoglobin
HaemoglobinValidator procdef_validate_haemoglobin
Lymphocytes procdef_lymphocytes
LymphocytesValidator procdef_validate_lymphocytes
Monocytes procdef_monocytes
MonocytesValidator procdef_validate_monocytes
Neutrophils procdef_neutrophils
NeutrophilsValidator procdef_validate_neutrophils
Platelets procdef_platelets
PlateletsValidator procdef_validate_platelets
RBC procdef_rbc
RBCValidator procdef_validate_rbc
Wbc procdef_wbc
WbcValidator procdef_validate_wbc
# -------------------------------------------------------------------------
# Substance misuse
# -------------------------------------------------------------------------
AlcoholUnits procdef_alcoholunits
AlcoholUnitsValidator procdef_validate_alcoholunits
progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter
# truncate_text_at = 32766
# record_truncated_values = False
max_rows_before_commit = 1000
max_bytes_before_commit = 83886080
# -----------------------------------------------------------------------------
# Cloud NLP demo
# -----------------------------------------------------------------------------
[nlpdef:cloud_nlp_demo]
inputfielddefs =
INPUT_FIELD_CLINICAL_DOCUMENTS
INPUT_FIELD_PROGRESS_NOTES
processors =
Cloud procdef_cloud_crp
progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter
cloud_config = my_uk_cloud_service
cloud_request_data_dir = /srv/crate/clouddata
# =============================================================================
# B. NLP processor definitions
# =============================================================================
# - You're likely to have to modify the destination databases these point to,
# but otherwise you can probably leave them as they are.
# -----------------------------------------------------------------------------
# Specimen CRATE regular expression processor definitions
# -----------------------------------------------------------------------------
# Most of these are very simple, and just require a destination database
# (as a cross-reference to a database section within this file) and a
# destination table.
# Biochemistry
[processor:procdef_albumin]
destdb = DESTINATION_DATABASE
desttable = albumin
[processor:procdef_validate_albumin]
destdb = DESTINATION_DATABASE
desttable = validate_albumin
[processor:procdef_alkphos]
destdb = DESTINATION_DATABASE
desttable = alkphos
[processor:procdef_validate_alkphos]
destdb = DESTINATION_DATABASE
desttable = validate_alkphos
[processor:procdef_alt]
destdb = DESTINATION_DATABASE
desttable = alt
[processor:procdef_validate_alt]
destdb = DESTINATION_DATABASE
desttable = validate_alt
[processor:procdef_bilirubin]
destdb = DESTINATION_DATABASE
desttable = bilirubin
[processor:procdef_validate_bilirubin]
destdb = DESTINATION_DATABASE
desttable = validate_bilirubin
[processor:procdef_creatinine]
destdb = DESTINATION_DATABASE
desttable = creatinine
[processor:procdef_validate_creatinine]
destdb = DESTINATION_DATABASE
desttable = validate_creatinine
[processor:procdef_crp]
destdb = DESTINATION_DATABASE
desttable = crp
[processor:procdef_validate_crp]
destdb = DESTINATION_DATABASE
desttable = validate_crp
[processor:procdef_gammagt]
destdb = DESTINATION_DATABASE
desttable = gammagt
[processor:procdef_validate_gammagt]
destdb = DESTINATION_DATABASE
desttable = validate_gammagt
[processor:procdef_glucose]
destdb = DESTINATION_DATABASE
desttable = glucose
[processor:procdef_validate_glucose]
destdb = DESTINATION_DATABASE
desttable = validate_glucose
[processor:procdef_hba1c]
destdb = DESTINATION_DATABASE
desttable = hba1c
[processor:procdef_validate_hba1c]
destdb = DESTINATION_DATABASE
desttable = validate_hba1c
[processor:procdef_hdlcholesterol]
destdb = DESTINATION_DATABASE
desttable = hdlcholesterol
[processor:procdef_validate_hdlcholesterol]
destdb = DESTINATION_DATABASE
desttable = validate_hdlcholesterol
[processor:procdef_ldlcholesterol]
destdb = DESTINATION_DATABASE
desttable = ldlcholesterol
[processor:procdef_validate_ldlcholesterol]
destdb = DESTINATION_DATABASE
desttable = validate_ldlcholesterol
[processor:procdef_lithium]
destdb = DESTINATION_DATABASE
desttable = lithium
[processor:procdef_validate_lithium]
destdb = DESTINATION_DATABASE
desttable = validate_lithium
[processor:procdef_potassium]
destdb = DESTINATION_DATABASE
desttable = potassium
[processor:procdef_validate_potassium]
destdb = DESTINATION_DATABASE
desttable = validate_potassium
[processor:procdef_sodium]
destdb = DESTINATION_DATABASE
desttable = sodium
[processor:procdef_validate_sodium]
destdb = DESTINATION_DATABASE
desttable = validate_sodium
[processor:procdef_totalcholesterol]
destdb = DESTINATION_DATABASE
desttable = totalcholesterol
[processor:procdef_validate_totalcholesterol]
destdb = DESTINATION_DATABASE
desttable = validate_totalcholesterol
[processor:procdef_triglycerides]
destdb = DESTINATION_DATABASE
desttable = triglycerides
[processor:procdef_validate_triglycerides]
destdb = DESTINATION_DATABASE
desttable = validate_triglycerides
[processor:procdef_tsh]
destdb = DESTINATION_DATABASE
desttable = tsh
[processor:procdef_validate_tsh]
destdb = DESTINATION_DATABASE
desttable = validate_tsh
[processor:procdef_urea]
destdb = DESTINATION_DATABASE
desttable = urea
[processor:procdef_validate_urea]
destdb = DESTINATION_DATABASE
desttable = validate_urea
# Clinical
[processor:procdef_bmi]
destdb = DESTINATION_DATABASE
desttable = bmi
[processor:procdef_validate_bmi]
destdb = DESTINATION_DATABASE
desttable = validate_bmi
[processor:procdef_bp]
destdb = DESTINATION_DATABASE
desttable = bp
[processor:procdef_validate_bp]
destdb = DESTINATION_DATABASE
desttable = validate_bp
[processor:procdef_height]
destdb = DESTINATION_DATABASE
desttable = height
[processor:procdef_validate_height]
destdb = DESTINATION_DATABASE
desttable = validate_height
[processor:procdef_weight]
destdb = DESTINATION_DATABASE
desttable = weight
[processor:procdef_validate_weight]
destdb = DESTINATION_DATABASE
desttable = validate_weight
# Cognitive
[processor:procdef_ace]
destdb = DESTINATION_DATABASE
desttable = ace
[processor:procdef_validate_ace]
destdb = DESTINATION_DATABASE
desttable = validate_ace
[processor:procdef_miniace]
destdb = DESTINATION_DATABASE
desttable = miniace
[processor:procdef_validate_miniace]
destdb = DESTINATION_DATABASE
desttable = validate_miniace
[processor:procdef_mmse]
destdb = DESTINATION_DATABASE
desttable = mmse
[processor:procdef_validate_mmse]
destdb = DESTINATION_DATABASE
desttable = validate_mmse
[processor:procdef_moca]
destdb = DESTINATION_DATABASE
desttable = moca
[processor:procdef_validate_moca]
destdb = DESTINATION_DATABASE
desttable = validate_moca
# Haematology
[processor:procdef_basophils]
destdb = DESTINATION_DATABASE
desttable = basophils
[processor:procdef_validate_basophils]
destdb = DESTINATION_DATABASE
desttable = validate_basophils
[processor:procdef_eosinophils]
destdb = DESTINATION_DATABASE
desttable = eosinophils
[processor:procdef_validate_eosinophils]
destdb = DESTINATION_DATABASE
desttable = validate_eosinophils
[processor:procdef_esr]
destdb = DESTINATION_DATABASE
desttable = esr
[processor:procdef_validate_esr]
destdb = DESTINATION_DATABASE
desttable = validate_esr
[processor:procdef_haematocrit]
destdb = DESTINATION_DATABASE
desttable = haematocrit
[processor:procdef_validate_haematocrit]
destdb = DESTINATION_DATABASE
desttable = validate_haematocrit
[processor:procdef_haemoglobin]
destdb = DESTINATION_DATABASE
desttable = haemoglobin
[processor:procdef_validate_haemoglobin]
destdb = DESTINATION_DATABASE
desttable = validate_haemoglobin
[processor:procdef_lymphocytes]
destdb = DESTINATION_DATABASE
desttable = lymphocytes
[processor:procdef_validate_lymphocytes]
destdb = DESTINATION_DATABASE
desttable = validate_lymphocytes
[processor:procdef_monocytes]
destdb = DESTINATION_DATABASE
desttable = monocytes
[processor:procdef_validate_monocytes]
destdb = DESTINATION_DATABASE
desttable = validate_monocytes
[processor:procdef_neutrophils]
destdb = DESTINATION_DATABASE
desttable = neutrophils
[processor:procdef_validate_neutrophils]
destdb = DESTINATION_DATABASE
desttable = validate_neutrophils
[processor:procdef_platelets]
destdb = DESTINATION_DATABASE
desttable = platelets
[processor:procdef_validate_platelets]
destdb = DESTINATION_DATABASE
desttable = validate_platelets
[processor:procdef_rbc]
destdb = DESTINATION_DATABASE
desttable = rbc
[processor:procdef_validate_rbc]
destdb = DESTINATION_DATABASE
desttable = validate_rbc
[processor:procdef_wbc]
destdb = DESTINATION_DATABASE
desttable = wbc
[processor:procdef_validate_wbc]
destdb = DESTINATION_DATABASE
desttable = validate_wbc
# Substance misuse
[processor:procdef_alcoholunits]
destdb = DESTINATION_DATABASE
desttable = alcoholunits
[processor:procdef_validate_alcoholunits]
destdb = DESTINATION_DATABASE
desttable = validate_alcoholunits
# -----------------------------------------------------------------------------
# Specimen GATE demo people/places processor definition
# -----------------------------------------------------------------------------
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Define the processor
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[processor:procdef_gate_name_location]
destdb = DESTINATION_DATABASE
outputtypemap =
Person output_person
Location output_location
progargs =
java
-classpath "{NLPPROGDIR}"{OS_PATHSEP}"{GATE_HOME}/lib/*"
-Dgate.home="{GATE_HOME}"
CrateGatePipeline
--gate_app "{GATE_HOME}/plugins/ANNIE/ANNIE_with_defaults.gapp"
--pluginfile "{GATE_PLUGIN_FILE}"
--annotation Person
--annotation Location
--input_terminator END_OF_TEXT_FOR_NLP
--output_terminator END_OF_NLP_OUTPUT_RECORD
--log_tag {NLPLOGTAG}
--verbose
progenvsection = MY_ENV_SECTION
input_terminator = END_OF_TEXT_FOR_NLP
output_terminator = END_OF_NLP_OUTPUT_RECORD
# max_external_prog_uses = 1000
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Define the output tables used by this GATE processor
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[output:output_person]
desttable = person
renames =
firstName firstname
destfields =
rule VARCHAR(100) Rule used to find this person (e.g. TitleFirstName, PersonFull)
firstname VARCHAR(100) First name
surname VARCHAR(100) Surname
gender VARCHAR(7) Gender (e.g. male, female, unknown)
kind VARCHAR(100) Kind of name (e.g. personName, fullName)
# ... longest gender: "unknown" (7)
indexdefs =
firstname 64
surname 64
[output:output_location]
desttable = location
renames =
locType loctype
destfields =
rule VARCHAR(100) Rule used (e.g. Location1)
loctype VARCHAR(100) Location type (e.g. city)
indexdefs =
rule 100
loctype 100
# -----------------------------------------------------------------------------
# Specimen Sheffield/KCL KConnect (Bio-YODIE) processor definition
# -----------------------------------------------------------------------------
# https://gate.ac.uk/applications/bio-yodie.html
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Define the processor
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[processor:procdef_gate_kconnect]
destdb = DESTINATION_DATABASE
outputtypemap =
Disease_or_Syndrome output_disease_or_syndrome
progargs =
java
-classpath "{NLPPROGDIR}"{OS_PATHSEP}"{GATE_HOME}/lib/*"
-Dgate.home="{GATE_HOME}"
CrateGatePipeline
--gate_app "{KCONNECTDIR}/main-bio/main-bio.xgapp"
--pluginfile "{GATE_PLUGIN_FILE}"
--annotation Disease_or_Syndrome
--input_terminator END_OF_TEXT_FOR_NLP
--output_terminator END_OF_NLP_OUTPUT_RECORD
--log_tag {NLPLOGTAG}
--suppress_gate_stdout
--verbose
progenvsection = MY_ENV_SECTION
input_terminator = END_OF_TEXT_FOR_NLP
output_terminator = END_OF_NLP_OUTPUT_RECORD
# max_external_prog_uses = 1000
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Define the output tables used by this GATE processor
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[output:output_disease_or_syndrome]
desttable = kconnect_diseases
renames =
Experiencer experiencer
Negation negation
PREF pref
STY sty
TUI tui
Temporality temporality
VOCABS vocabs
destfields =
# Found by manual inspection of KConnect/Bio-YODIE output from the GATE console:
experiencer VARCHAR(100) Who experienced it; e.g. "Patient", "Other"
negation VARCHAR(100) Was it negated or not; e.g. "Affirmed", "Negated"
pref VARCHAR(100) PREFferred name; e.g. "Rheumatic gout"
sty VARCHAR(100) Semantic Type (STY) [semantic type name]; e.g. "Disease or Syndrome"
tui VARCHAR(4) Type Unique Identifier (TUI) [semantic type identifier]; 4 characters; https://www.ncbi.nlm.nih.gov/books/NBK9679/; e.g. "T047"
temporality VARCHAR(100) Occurrence in time; e.g. "Recent", "historical", "hypothetical"
vocabs VARCHAR(255) List of UMLS vocabularies; e.g. "AIR,MSH,NDFRT,MEDLINEPLUS,NCI,LNC,NCI_FDA,NCI,MTH,AIR,ICD9CM,LNC,SNOMEDCT_US,LCH_NW,HPO,SNOMEDCT_US,ICD9CM,SNOMEDCT_US,COSTAR,CST,DXP,QMR,OMIM,OMIM,AOD,CSP,NCI_NCI-GLOSS,CHV"
inst VARCHAR(8) Looks like a Concept Unique Identifier (CUI); 1 letter then 7 digits; e.g. "C0003873"
inst_full VARCHAR(255) Looks like a URL to a CUI; e.g. "http://linkedlifedata.com/resource/umls/id/C0003873"
language VARCHAR(100) Language; e.g. ""; ?will look like "ENG" for English? See https://www.nlm.nih.gov/research/umls/implementation_resources/query_diagrams/er1.html
tui_full VARCHAR(255) TUI (?); e.g. "http://linkedlifedata.com/resource/semanticnetwork/id/T047"
indexdefs =
pref 100
sty 100
tui 4
inst 8
# -----------------------------------------------------------------------------
# Specimen KCL GATE pharmacotherapy processor definition
# -----------------------------------------------------------------------------
# https://github.com/KHP-Informatics/brc-gate-pharmacotherapy
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Define the processor
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[processor:procdef_gate_pharmacotherapy]
destdb = DESTINATION_DATABASE
outputtypemap =
Prescription output_prescription
progargs =
java
-classpath "{NLPPROGDIR}"{OS_PATHSEP}"{GATE_HOME}/lib/*"
-Dgate.home="{GATE_HOME}"
CrateGatePipeline
--gate_app "{GATE_PHARMACOTHERAPY_DIR}/application.xgapp"
--pluginfile "{GATE_PLUGIN_FILE}"
--include_set Output
--annotation Prescription
--input_terminator END_OF_TEXT_FOR_NLP
--output_terminator END_OF_NLP_OUTPUT_RECORD
--log_tag {NLPLOGTAG}
--suppress_gate_stdout
--show_contents_on_crash
progenvsection = MY_ENV_SECTION
input_terminator = END_OF_TEXT_FOR_NLP
output_terminator = END_OF_NLP_OUTPUT_RECORD
# max_external_prog_uses = 1000
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Define the output tables used by this GATE processor
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[output:output_prescription]
desttable = medications_gate
renames =
drug-type drug_type
dose-value dose_value
dose-unit dose_unit
dose-multiple dose_multiple
Directionality directionality
Experiencer experiencer
"Length of Time" length_of_time
Temporality temporality
"Unit of Time" unit_of_time
null_literals =
null
""
destfields =
# Found by (a) manual inspection of BRC GATE pharmacotherapy output from
# the GATE console; (b) inspection of
# application-resources/schemas/Prescription.xml
# Note preference for DECIMAL over FLOAT/REAL; see
# https://stackoverflow.com/questions/1056323
# Note that not all annotations appear for all texts. Try e.g.:
# Please start haloperidol 5mg tds.
# I suggest you start haloperidol 5mg tds for one week.
rule VARCHAR(100) Rule yielding this drug. Not in XML but is present in a subset: e.g. "weanOff"; max length unclear
drug VARCHAR(200) Drug name. Required string; e.g. "haloperidol"; max length 47 from "wc -L BNF_generic.lst", 134 from BNF_trade.lst
drug_type VARCHAR(100) Type of drug name. Required string; from "drug-type"; e.g. "BNF_generic"; ?length of longest drug ".lst" filename
dose VARCHAR(100) Dose text. Required string; e.g. "5mg"; max length unclear
dose_value DECIMAL Numerical dose value. Required numeric; from "dose-value"; "double" in the XML but DECIMAL probably better; e.g. 5.0
dose_unit VARCHAR(100) Text of dose units. Required string; from "dose-unit"; e.g. "mg"; max length unclear
dose_multiple INT Dose count (multiple). Required integer; from "dose-multiple"; e.g. 1
route VARCHAR(7) Route of administration. Required string; one of: "oral", "im", "iv", "rectal", "sc", "dermal", "unknown"
status VARCHAR(10) Change in drug status. Required; one of: "start", "continuing", "stop"
tense VARCHAR(7) Tense in which drug is referred to. Required; one of: "past", "present"
date VARCHAR(100) ?. Optional string; max length unclear
directionality VARCHAR(100) ?. Optional string; max length unclear
experiencer VARCHAR(100) Person experiencing the drug-related event. Optional string; e.g. "Patient"
frequency DECIMAL Frequency (times per <time_unit>). Optional numeric; "double" in the XML but DECIMAL probably better
interval DECIMAL The n in "every n <time_unit>s" (1 for "every <time_unit>"). Optional numeric; "double" in the XML but DECIMAL probably better
length_of_time VARCHAR(100) ?. Optional string; from "Length of Time"; max length unclear
temporality VARCHAR(100) ?. Optional string; e.g. "Recent", "Historical"
time_unit VARCHAR(100) Unit of time (see frequency, interval). Optional string; from "time-unit"; e.g. "day"; max length unclear
unit_of_time VARCHAR(100) ?. Optional string; from "Unit of Time"; max length unclear
when VARCHAR(100) ?. Optional string; max length unclear
indexdefs =
rule 100
drug 200
route 7
status 10
tense 7
# -----------------------------------------------------------------------------
# Specimen KCL Lewy Body Diagnosis Application (LBDA) processor definition
# -----------------------------------------------------------------------------
# https://github.com/KHP-Informatics/brc-gate-LBD
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Define the processor
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[processor:procdef_gate_kcl_lbda]
# "cDiagnosis" is the "confirmed diagnosis" field, as d/w Jyoti Jyoti
# 2018-03-20; see also README.md. This appears in the "Automatic" and the
# unnamed set. There is also a near-miss one, "DiagnosisAlmost", which
# appears in the unnamed set.
# "Mr Jones has Lewy body dementia."
# -> DiagnosisAlmost
# "Mr Jones has a diagnosis of Lewy body dementia."
# -> DiagnosisAlmost, cDiagnosis
# Note that we must use lower case in the outputtypemap.
destdb = DESTINATION_DATABASE
outputtypemap =
cDiagnosis output_lbd_diagnosis
DiagnosisAlmost output_lbd_diagnosis
progargs =
java
-classpath "{NLPPROGDIR}"{OS_PATHSEP}"{GATE_HOME}/lib/*"
-Dgate.home="{GATE_HOME}"
CrateGatePipeline
--gate_app "{KCL_LBDA_DIR}/application.xgapp"
--pluginfile "{GATE_PLUGIN_FILE}"
--set_annotation "" DiagnosisAlmost
--set_annotation Automatic cDiagnosis
--input_terminator END_OF_TEXT_FOR_NLP
--output_terminator END_OF_NLP_OUTPUT_RECORD
--log_tag {NLPLOGTAG}
--suppress_gate_stdout
--verbose
progenvsection = MY_ENV_SECTION
input_terminator = END_OF_TEXT_FOR_NLP
output_terminator = END_OF_NLP_OUTPUT_RECORD
# max_external_prog_uses = 1000
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Define the output tables used by this GATE processor
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[output:output_lbd_diagnosis]
desttable = lewy_body_dementia_gate
null_literals =
null
""
destfields =
# Found by
# (a) manual inspection of output from the GATE Developer console:
# - e.g. {rule=Includefin, text=Lewy body dementia}
# (b) inspection of contents:
# - run a Cygwin shell
# - find . -type f -exec grep cDiagnosis -l {} \;
# - 3 hits:
# ./application-resources/jape/DiagnosisExclude2.jape
# ... part of the "Lewy"-detection apparatus
# ./application-resources/jape/text-feature.jape
# ... adds "text" annotation to cDiagnosis Token
# ./application.xgapp
# ... in annotationTypes
# On that basis:
rule VARCHAR(100) Rule that generated the hit.
text VARCHAR(200) Text that matched the rule.
indexdefs =
rule 100
text 200
# -----------------------------------------------------------------------------
# Specimen MedEx processor definition
# -----------------------------------------------------------------------------
# https://sbmi.uth.edu/ccb/resources/medex.htm
[processor:procdef_medex_medications]
destdb = DESTINATION_DATABASE
desttable = medications_medex
progargs =
java
-classpath {NLPPROGDIR}:{MEDEXDIR}/bin:{MEDEXDIR}/lib/*
-Dfile.encoding=UTF-8
CrateMedexPipeline
-lt {NLPLOGTAG}
-v -v
# ... other arguments are added by the code
progenvsection = MY_ENV_SECTION
# =============================================================================
# C. Environment variable definitions
# =============================================================================
# - You'll need to modify this according to your local configuration.
[env:MY_ENV_SECTION]
GATE_HOME = /path/to/GATE_Developer_9.0.1
GATE_PHARMACOTHERAPY_DIR = /path/to/brc-gate-pharmacotherapy
GATE_PLUGIN_FILE = /path/to/specimen_gate_plugin_file.ini
KCL_LBDA_DIR = /path/to/brc-gate-LBD/Lewy_Body_Diagnosis
KCONNECTDIR = /path/to/yodie-pipeline-1-2-umls-only
MEDEXDIR = /path/to/Medex_UIMA_1.3.6
NLPPROGDIR = /path/to/crate_anon/nlp_manager/compiled_nlp_classes
OS_PATHSEP = :
# =============================================================================
# D. Input field definitions
# =============================================================================
[input:INPUT_FIELD_CLINICAL_DOCUMENTS]
srcdb = SOURCE_DATABASE
srctable = EXTRACTED_CLINICAL_DOCUMENTS
srcpkfield = DOCUMENT_PK
srcfield = DOCUMENT_TEXT
srcdatetimefield = DOCUMENT_DATE
copyfields =
RID_FIELD
TRID_FIELD
indexed_copyfields =
RID_FIELD
TRID_FIELD
# debug_row_limit = 0
[input:INPUT_FIELD_PROGRESS_NOTES]
srcdb = SOURCE_DATABASE
srctable = PROGRESS_NOTES
srcpkfield = PN_PK
srcfield = PN_TEXT
srcdatetimefield = PN_DATE
copyfields =
RID_FIELD
TRID_FIELD
indexed_copyfields =
RID_FIELD
TRID_FIELD
# =============================================================================
# E. Database definitions, each in its own section
# =============================================================================
[database:SOURCE_DATABASE]
url = mysql+mysqldb://anontest:XXX@127.0.0.1:3306/anonymous_output?charset=utf8
[database:DESTINATION_DATABASE]
url = mysql+mysqldb://anontest:XXX@127.0.0.1:3306/anonymous_output?charset=utf8
# =============================================================================
# F. Information for using cloud-based NLP
# =============================================================================
[cloud:my_uk_cloud_service]
cloud_url = https://your_url
username = your_username
password = your_password
wait_on_conn_err = 180
max_content_length = 0
max_records_per_request = 1000
limit_before_commit = 1000
stop_at_failure = true
max_tries = 5
rate_limit_hz = 2
[processor:procdef_cloud_crp]
destdb = DESTINATION_DATABASE
desttable = crp_test
processor_name = crate_anon.nlp_manager.parse_biochemistry.Crp
processor_format = Standard
Footnotes