7.2. NLP config file

7.2.1. Overview

The CRATE NLP config file controls the behaviour of the NLP manager. It defines source and destination databases, and one or more NLP definitions.

You can generate a specimen config file with

crate_nlp --democonfig > test_nlp_config.ini

You should save this, then edit it to your own needs.

Make it point to other necessary things, like your GATE installation if you want to use GATE NLP.

For convenience, you may want the CRATE_NLP_CONFIG environment variable to point to this file. (Otherwise you must specify it each time.)

7.2.2. Detail

The config file describes NLP definitions, such as ‘people_and_places’, ‘drugs_and_doses’, or ‘haematology_white_cell_differential’. You choose the names of these NLP definitions as you define them. You select one when you run the NLP manager (using the --nlpdef argument) 1.

The NLP definition sets out the following things.

  • “Where am I going to find my source text?” You specify this by giving one or more inputfielddefs. Each one of those specifies (via its own config section) a database/table/field combination, such as the ‘Notes’ field of the ‘Progress Notes’ table in the ‘RiO’ database, or the ‘text_content’ field of the ‘Clinical_Documents’ table in the ‘CDL’ database, or some such 2.

    • CRATE will always store minimal source information (database, table, integer PK) with the NLP output. However, for convenience you are also likely to want to copy some other key information over to the output, such as patients’ research identifiers (RIDs). You can specify these via the copyfields option to the input field definitions. For validation purposes, you might even choose to copy the full source text (just for convenience), but it’s unlikely you’d want to do this routinely (because it wastes space).

  • “Which NLP processor will run, and where will it store its output?” This might be an external GATE program specializing in finding drug names, or one of CRATE’s built-in regular expression (regex) parsers, such as for inflammatory markers from blood tests. The choice of NLP processor also determines the fields that will appear in the output; for example, a drug-detecting NLP program might provide fields such as ‘drug’, ‘dose’, ‘units’, and ‘route’, while a white-cell differential processor might provide output such as ‘cell type’, ‘value_in_billion_per_litre’, and so on 3.

    • In fact, GATE applications can simultaneously provide more than one type of output; for example, GATE’s demonstration people-and-places application yields both ‘person’ information (rule, firstname, surname, gender…) and ‘location’ information (rule, loctype…), and it can be computationally more efficient to run them together. Therefore, CRATE supports multiple types of output from ‘single’ NLP processors.

    • Each NLP processor may have its own set of options. For example, the GATE controller requires information about the specific external GATE app to run, and about any necessary environment variables. Others, such as CRATE’s build-in regular expression parsers, are simpler.

    • You might want all of your “drugs and doses” information to be stored in a single table (such that drugs found in your Progress_Notes and drugs found in your Clinical_Documents get stored together); this would be common and sensible (and CRATE will keep a record of where the information came from). However, it’s possible that you might want to segregate them (e.g. having C-reactive protein information extracted from your Progress_Notes stored in a different table to C-reactive protein information extracted from your High_Sensitivity_CRP_Notes_For_Bobs_Project table).

    • For GATE apps that provide more than one type of output structure, you will need to specify more than one output table.

    • You can batch different NLP processors together. For example, the demo config batches up CRATE’s internal regular expression NLP processors together. This is more efficient, because one record fetched from the source database can then be sent to multiple NLP processors. However, it’s less helpful if you are developing new NLP tools and want to be able to re-run just one NLP tool frequently.

All NLP configuration ‘chunks’ are sections within the NLP config file, which is in standard .INI format. For example, an input field definition is a section; a database definition is a section; an environment variable section for use with external programs is a section; and so on.

To allow incremental updates, CRATE will keep a master progress table, storing a reference to the source information (database, table, PK), a hash of the source information (to work out later on if the source has changed), and a date/time when the NLP was last run, and the name of the NLP definition that was run.

It’s definitely better if your source table has integer PKs, but you might not have a choice in the matter (and be unable to add one to a read-only source database), so CRATE also supports string PKs. In this instance it will create an integer by hashing the string and store that along with the string PK itself. (That integer is not guaranteed to be unique, because of hash collisions 5, but it allows some efficiency to be added.)

7.2.3. Testing your NLP definitions

You can test any NLP definition that you create. See testing NLP.

7.2.4. Format of the configuration file

  • The config file is in standard INI file format.

  • UTF-8 encoding. Use this! The file is explicitly opened in UTF-8 mode.

  • Comments. Hashes (#) and semicolons (;) denote comments.

  • Sections. Sections are indicated with: [section]

  • Name/value (key/value) pairs. The parser used is ConfigParser. It allows name=value or name:value.

  • Avoid indentation of parameters. (Indentation is used to indicate the continuation of previous parameters.)

  • Parameter types, referred to below, are:

    • String. Single-line strings are simple.

    • Multiline string. Here, a series of lines is read and split into a list of strings (one for each line). You should indent all lines except the first beyond the level of the parameter name, and then they will be treated as one parameter value.

    • Integer. Simple.

    • Boolean. For Boolean options, true values are any of: 1, yes, true, on (case-insensitive). False values are any of: 0, no, false, off.

7.2.5. Config file section: NLP definition

These are config file sections named [nlpdef:XXX] where XXX is the name of one of your NLP definitions.

These are the “top-level” configuration sections, referred to when you launch CRATE’s NLP tools from the command line. Start here.

These config sections map inputs (from your database) to processors and a progress-tracking database, and give names to those mappings.

7.2.5.1. inputfielddefs

Multiline string.

List of input fields to parse. Each is the name of an input field definition in the config file.

Input to the NLP processor(s) comes from one or more source fields (columns), each within a table within a database. Each item in this list refers to a config section that define that field in more detail.

7.2.5.2. processors

Multiline string.

Which NLP processors shall we use?

Specify these as a list of processor_type, processor_config_section pairs. For example, one might be:

GATE mygateproc_name_location

and CRATE would then look for a processor definition in a config file section named [processor:mygateproc_name_location], and expect it to have the information required for a GATE processor.

For possible processor types, see crate_nlp --listprocessors.

7.2.5.3. progressdb

String.

Secret progress database; the name of a database definition in the config file.

To allow incremental updates, information is stored in a progress table. The database name is a cross-reference to another section in this config file. The table name within this database is hard-coded to crate_nlp_progress.

7.2.5.4. hashphrase

String.

You should insert a hash phrase of your own here. However, it’s not especially secret (it’s only used for change detection and users are likely to have access to the source material anyway), and its specific value is unimportant.

7.2.5.5. temporary_tablename

String. Default: _crate_nlp_temptable.

Temporary table name to use (in progress and destination databases).

7.2.5.6. max_rows_before_commit

Integer. Default: 1000.

Specify the maximum number of rows to be processed before a COMMIT is issued on the database transaction(s). This prevents the transaction(s) growing too large.

7.2.5.7. max_bytes_before_commit

Integer. Default: 80 Mb (80 * 1024 * 1024 = 83886080).

Specify the maximum number of source-record bytes (approximately!) that are processed before a COMMIT is issued on the database transaction(s). This prevents the transaction(s) growing too large. The COMMIT will be issued after this limit has been met/exceeded, so it may be exceeded if the transaction just before the limit takes the cumulative total over the limit.

7.2.5.8. truncate_text_at

Integer. Default: 0. Must be zero or positive.

Use this to truncate very long incoming text fields. If non-zero, this is the length at which to truncate.

7.2.5.9. record_truncated_values

Boolean. Default: false.

Record in the progress database that we have processed records for which the source text was truncated (see truncate_text_at).

The purpose for this option is so the program, when running in incremental mode, can decide whether to re-run the nlp records which were truncated before processing. If this option is set to true, such records won’t be run again unless they have changed.

7.2.5.10. cloud_config

String. Required to use cloud NLP.

The name of the cloud NLP configuration to use if you ask for cloud-based processing with this NLP definition.

For example, you might specify:

cloud_config = my_uk_cloud_nlp_service

and CRATE would then look for a cloud NLP configuration in a config file section named [cloud:my_uk_cloud_nlp_service], and use the information there to connect to a cloud NLP service via the NLPRP.

7.2.5.11. cloud_request_data_dir

String. Required to use cloud NLP.

Directory (on your local filesystem) to hold files containing information for the retrieval of data which has been sent in queued mode.

For safety (in case the user specifies a foolish directory!), CRATE will make a subdirectory of this directory (whose name is that of the NLP definition). CRATE will delete files at will within that subdirectory.

7.2.6. Config file section: input field definition

These are config file sections named [input:XXX] where XXX is the name of one of your input field definitions.

These define database “inputs” in more detail, including the database, table, and field (column) containing the input, the associated primary key field, and fields that should be copied to the destination to make subsequent work easier (e.g. patient research IDs).

They are referred to by the NLP definition.

7.2.6.1. srcdb

String.

Source database; the name of a database definition in the config file.

7.2.6.2. srctable

String.

The name of the table in the source database.

7.2.6.3. srcpkfield

String.

The name of the primary key field (column) in the source table.

7.2.6.4. srcfield

String.

The name of the field (column) in the source table that contains the data of interest.

7.2.6.5. srcdatetimefield

String. Optional (but advisable).

The name of the DATETIME field (column) in the source table that represents the date/time of the source data. If present, this information will be copied to the output; see Standard NLP output columns.

7.2.6.6. copyfields

Multiline string. Optional.

Names of fields to copy from the source table to the destination (NLP output) table.

7.2.6.7. indexed_copyfields

Multiline string.

Optional subset of copyfields that should be indexed in the destination (NLP output) table.

7.2.6.8. debug_row_limit

Integer. Default: 0.

Debugging option. Specify this to set the maximum number of rows to be fetched from the source table. Specifying 0 means “no limit”.

7.2.7. Config file section: processor definition

These are config file sections named [processor:XXX] where XXX is the name of one of your NLP processors.

These control the behaviour of individual NLP processors.

In the case of CRATE’s built-in processors, the only configuration needed is the destination database/table, but for some, like GATE applications, you need to define more – such as how to run the external program, and what sort of table structure should be created to receive the results.

The format depends on the specific processor type (see processors).

7.2.7.1. destdb

String.

Applicable to: all parsers.

Destination database; the name of a database definition in the config file.

7.2.7.2. desttable

String.

Applicable to: Cloud, MedEx, all CRATE Python processors.

The name of the table in the destination database in which the results should be stored.

7.2.7.3. assume_preferred_unit

Boolean. Default: True.

Applicable to: nearly all numerical CRATE Python processors.

If a unit is not specified, assume that values are in the processor’s preferred units. (For example, crate_anon.nlp_manager.parse_biochemistry.Crp will assume mg/L.)

Some override this and are not configurable, however:

  • AlcoholUnits never assumes this.

7.2.7.4. desttable

String.

Applicable to: Cloud.

Table name in the destination (NLP output) database into which to write results from the cloud NLP processor. Use this for single-table processors.

The alternative is outputtypemap.

7.2.7.5. outputtypemap

Multiline string.

Applicable to: GATE, Cloud.

For GATE:

What’s GATE? See the section on GATE NLP.

Map GATE ‘_type’ parameters to possible destination tables (in case-insensitive fashion). This parameter is follows is a list of pairs, one pair per line.

  • The first item of each is the annotation type coming out of the GATE system.

  • The second is the output type section defined in this file (as a separate section). Those sections (q.v.) define tables and columns (fields).

Example:

outputtypemap =
    Person output_person
    Location output_location

This example would take output from GATE labelled with _type=Person and send it to output defined in the [output:output_person] section of the config file – see GATE output definitions. Equivalently for the Location type.

For cloud:

7.2.7.6. progargs

Multiline string.

Applicable to: GATE, MedEx.

This parameter defines how we will launch GATE. See GATE NLP.

GATE NLP is done by an external program.

In this parameter, we specify a program and associated arguments. Here’s an example:

progargs =
    java
    -classpath "{NLPPROGDIR}"{OS_PATHSEP}"{GATE_HOME}/bin/gate.jar"{OS_PATHSEP}"{GATE_HOME}/lib/*"
    -Dgate.home="{GATE_HOME}"
    CrateGatePipeline
    --gate_app "{GATE_HOME}/plugins/ANNIE/ANNIE_with_defaults.gapp"
    --annotation Person
    --annotation Location
    --input_terminator END_OF_TEXT_FOR_NLP
    --output_terminator END_OF_NLP_OUTPUT_RECORD
    --log_tag {NLPLOGTAG}
    --verbose

The example shows how to use Java to launch a specific Java program (CrateGatePipeline), having set a path to find other Java classes, and how to to pass arguments to the program itself.

NOTE IN PARTICULAR:

  • Use double quotes to encapsulate any filename that may have spaces within it (e.g. C:/Program Files/...).

  • Use a forward slash directory separator, even under Windows.

    • … ? If that doesn’t work, use a double backslash, \\.

  • Under Windows, use a semicolon to separate parts of the Java classpath. Under Linux, use a colon.

    So a Linux Java classpath looks like

    /some/path:/some/other/path:/third/path
    

    and a Windows one looks like

    C:/some/path;C:/some/other/path;C:/third/path
    
  • To make this simpler, we can define the environment variable OS_PATHSEP (by analogy to Python’s os.pathsep). See the environment variable section below.

  • You can use substitutable parameters:

    {X}

    Substitutes variable X from the environment you specify (see below).

    {NLPLOGTAG}

    Additional environment variable that indicates the process being run; used to label the output from the CrateGatePipeline application.

7.2.7.7. progenvsection

String.

Applicable to: GATE, MedEx.

Environment variable config section to use when launching this program.

7.2.7.8. input_terminator

String.

Applicable to: GATE.

The external GATE program is slow, because NLP is slow. Therefore, we set up the external program and use it repeatedly for a whole bunch of text. Individual pieces of text are sent to it (via its stdin). We finish our piece of text with a delimiter, which should (a) be specified in the -it or --input_terminator` parameter to the CRATE ``CrateGatePipeline interface (above), and (b) be set here, TO THE SAME VALUE. The external program will return a TSV-delimited set of field/value pairs, like this:

field1\\tvalue1\\tfield2\\tvalue2...
field1\\tvalue3\\tfield2\\tvalue4...
...
OUTPUTTERMINATOR

… where OUTPUTTERMINATOR is something that you (a) specify with the -ot or --output_terminator parameter above, and (b) set via the config file output_terminator, TO THE SAME VALUE.

7.2.7.9. output_terminator

String.

Applicable to: GATE.

See input_terminator.

7.2.7.10. max_external_prog_uses

Integer.

Applicable to: GATE, MedEx.

If the external GATE program leaks memory, you may wish to cap the number of uses before it’s restarted. Specify this option if so. Specify 0 or omit the option entirely to ignore this.

7.2.7.11. processor_name

String.

Applicable to: Cloud.

Name of the remote processor; see NLPRP list_processors.

Note that this is case sensitive. To ask the remote server what processor names it offers, use the crate_nlp tool like this:

crate_nlp --config MYCONFIG --nlpdef MYNLPDEF --print_cloud_processors

That will, in sequence:

  • read the config called MYCONFIG;

  • look for an NLP definition section marked [nlpdef:MYNLPDEF];

  • look for a cloud_config parameter in that section;

  • look up the corresponding cloud server definition, including the URL of the remote server;

  • ask the server for details of all its processors;

  • print the results (in NLPRP JSON format; see the NLPLP list_processors command for details).

7.2.7.12. processor_version

String. Default: None.

Applicable to: Cloud.

Version of the remote processor; see NLPRP list_processors.

7.2.7.13. processor_format

String.

Applicable to: Cloud.

One of: Standard, GATE.

Standard refers primarily to CRATE Python-based rempote processors, but would be compatible with any remote processor which returned data in the same format as the CRATE processors. GATE refers to GATE remote processors, which return a standard set of columns.

7.2.8. Config file section: GATE output definition

These are config file sections named [output:XXX] where XXX is the name of one of your GATE output types 6.

This is an additional thing we need for GATE applications, since CRATE doesn’t automatically know what sort of output they will produce. The tables and SPECIFIC output fields for a given GATE processor are defined here.

They are referred to by the outputtypemap parameter (q.v.).

7.2.8.1. desttable

String.

Table name in the destination (NLP output) database into which to write results from the GATE NLP application.

7.2.8.2. renames

Multiline string.

A list of from, to things to rename from the GATE output en route to the database. In each case, the from item is the name of a GATE output annotation. The to item is the destination field/column name.

Specify one pair per line. You can can quote, using shlex rules. Case-sensitive.

This example:

renames =
    firstName   firstname

renames firstName to firstname.

A more relevant example, in which the GATE annotation names are clearly not well suited to being database column names:

renames =
    drug-type           drug_type
    dose-value          dose_value
    dose-unit           dose_unit
    dose-multiple       dose_multiple
    Directionality      directionality
    Experiencer         experiencer
    "Length of Time"    length_of_time
    Temporality         temporality
    "Unit of Time"      unit_of_time

7.2.8.3. null_literals

Multiline string.

Define values that will be treated as NULL in SQL. For example, sometimes GATE provides the string null for a NULL value; we can convert to a proper SQL NULL.

The parameter is treated as a sequence of words; shlex quoting rules apply.

Example:

null_literals =
    null
    ""

7.2.8.4. destfields

Multiline string.

Defines the database field (column) types used in the output database. This is how you tell the database how much space to allocate for information that will come out of GATE. Each line is a column_name, sql_type pair (or, optionally, a column_name, sql_type, comment triple. Whitespace is used to separate the columns. Examples:

destfields =
    rule        VARCHAR(100)
    firstname   VARCHAR(100)
    surname     VARCHAR(100)
    gender      VARCHAR(7)
    kind        VARCHAR(100)
destfields =
    rule        VARCHAR(100)    Rule used to find this person (e.g. TitleFirstName, PersonFull)
    firstname   VARCHAR(100)    First name
    surname     VARCHAR(100)    Surname
    gender      VARCHAR(7)      Gender (e.g. male, female, unknown)
    kind        VARCHAR(100)    Kind of name (e.g. personName, fullName)

7.2.8.5. indexdefs

Multiline string.

Fields to index in the destination table.

Each line is a indexed_field, index_length pairs. The index_length should be an integer or None. Example:

indexdefs =
    firstname   64
    surname     64

7.2.9. Config file section: environment variables definition

These are config file sections named [env:XXX] where XXX is the name of one of your environment variable definition blocks.

We define environment variable groups here, with one group per section.

When a section is selected (e.g. by a progenvsection parameter in a GATE NLP processor definition as above), these variables can be substituted into the progargs part of the NLP definition (for when external programs are called) and are available in the operating system environment for those programs themselves.

  • The environment will start by inheriting the parent environment, then add variables here.

  • Keys are case-sensitive.

Example:

[env:MY_ENV_SECTION]

GATE_HOME = /home/myuser/somewhere/GATE_Developer_8.0
NLPPROGDIR = /home/myuser/somewhere/crate_anon/nlp_manager/compiled_nlp_classes
MEDEXDIR = /home/myuser/somewhere/Medex_UIMA_1.3.6
KCONNECTDIR = /home/myuser/somewhere/yodie-pipeline-1-2-umls-only
OS_PATHSEP = :

7.2.10. Config file section: database definition

These are config file sections named [database:XXX] where XXX is the name of one of your database definitions.

These simply tell CRATE how to connect to different databases.

7.2.10.1. url

String.

The URL of the database. Use SQLAlchemy URLs: http://docs.sqlalchemy.org/en/latest/core/engines.html.

Example:

[database:MY_SOURCE_DATABASE]

url = mysql+mysqldb://myuser:password@127.0.0.1:3306/anonymous_output_db?charset=utf8

7.2.10.2. echo

Boolean. Default: False.

Optional parameter for debugging. If set to True, all SQL being sent to the database will be logged to the Python console log.

7.2.11. Config file section: cloud NLP configuration

These are config file sections named [cloud:XXX] where XXX is the name of one of your cloud NLP configurations (referred to by the cloud_config parameter in a NLP definition) 4.

7.2.11.1. cloud_url

String. Required to use cloud NLP.

The URL of the cloud NLP service.

7.2.11.2. verify_ssl

Boolean. Default: true.

Should CRATE verify the SSL certificate of the remote NLP server?

7.2.11.3. compress

Boolean. Default: true.

Should CRATE compress messages going to the NLP server, using gzip?

CRATE (via the Python requests library) also always tells the server that it will accept gzip compression back; the server should respond to this by compressing results.

7.2.11.4. username

String. Default: “”.

Your username for accessing the services at the URL specified in cloud_url.

7.2.11.5. password

String. Default: “”.

Your password for accessing the services at the URL specified in cloud_url.

7.2.11.6. wait_on_conn_err

Integer. Default: 180.

After a connection error occurs, wait this many seconds before retrying.

7.2.11.7. max_content_length

Integer. Default: 0.

The maximum size of the packets to be sent. This should be less than or equal to the limit the service allows. Put 0 for no maximum length.

NOTE: if a single record is larger than the maximum packet size, that record will not be sent.

7.2.11.8. max_records_per_request

Integer. Default: 1000.

When sending data: the maximum number of pieces of text that will be sent as part of a single NLPRP request (subject also to max_content_length).

7.2.11.9. limit_before_commit

Integer. Default: 1000.

When receiving results: the number of results that will be processed (and written to the database) before a COMMIT command is executed.

7.2.11.10. stop_at_failure

Boolean. Default: true.

Are cloud NLP requests for processing allowed to fail, and CRATE continue? If not, an error is raised and CRATE will abort on failure. (Some requests are not allowed to fail, regardless of this setting.)

7.2.11.11. max_tries

Integer. Default: 5.

Maximum number of times to try each HTTP connection to the cloud NLP server, before giving it up as a bad job.

7.2.11.12. rate_limit_hz

Integer. Default: 2.

The maximum rate, in Hz (times per second), that the CRATE NLP processor will send requests. Use this to avoid overloading the cloud NLP server. Specify 0 for no limit.

7.2.12. Parallel processing

There are two ways to parallelize CRATE NLP.

  1. You can run multiple NLP processors at the same time, by specifying multiple NLP processors in a single NLP definition within your configuration file.

    There can be different source of bottlenecks. One is if database access is limiting. Specifying multiple NLP processors means that text is fetched once (for a given set of input fields) and then run through multiple NLP processors in one go.

    However, GATE apps can take e.g. 1 Gb RAM per process, so be careful if trying to run several of those! CRATE’s regular expression parsers use very little RAM (and can go quite fast: e.g. 2 CPUs processing about 15,000 records through 10 regex parsers in about 166 s, or of the order of 1 kHz).

  2. You can run multiple simultaneous copies of CRATE’s NLP manager.

    This will divide up the work across the copies (by dividing up the records retrieved from the database).

You can use both strategies simultaneously.

7.2.13. Specimen config

A specimen NLP config is available by running crate_nlp --democonfig.

Here’s the specimen NLP config:

# Configuration file for CRATE NLP manager (crate_nlp).
# Version 0.20.0 (2023-02-14).
#
# PLEASE SEE THE HELP at https://crateanon.readthedocs.io/
# Using defaults for Docker environment: False

# =============================================================================
# A. Individual NLP definitions
# =============================================================================
# - referred to by the NLP manager's command-line arguments
# - You are likely to need to alter these (particularly the bits in capital
#   letters) to refer to your own database(s).

# -----------------------------------------------------------------------------
# GATE people-and-places demo
# -----------------------------------------------------------------------------

[nlpdef:gate_name_location_demo]

inputfielddefs =
    INPUT_FIELD_CLINICAL_DOCUMENTS
    INPUT_FIELD_PROGRESS_NOTES
processors =
    GATE procdef_gate_name_location
progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter


# -----------------------------------------------------------------------------
# KConnect (Bio-YODIE) disease-finding GATE app
# -----------------------------------------------------------------------------

[nlpdef:gate_kconnect_diseases]

inputfielddefs =
    INPUT_FIELD_CLINICAL_DOCUMENTS
    INPUT_FIELD_PROGRESS_NOTES
processors =
    GATE procdef_gate_kconnect
progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter


# -----------------------------------------------------------------------------
# KCL Lewy body dementia GATE app
# -----------------------------------------------------------------------------

[nlpdef:gate_kcl_lbd]

inputfielddefs =
    INPUT_FIELD_CLINICAL_DOCUMENTS
    INPUT_FIELD_PROGRESS_NOTES
processors =
    GATE procdef_gate_kcl_lbda
progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter


# -----------------------------------------------------------------------------
# KCL pharmacotherapy GATE app
# -----------------------------------------------------------------------------

[nlpdef:gate_kcl_pharmacotherapy]

inputfielddefs =
    INPUT_FIELD_CLINICAL_DOCUMENTS
    INPUT_FIELD_PROGRESS_NOTES
processors =
    GATE procdef_gate_pharmacotherapy
progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter


# -----------------------------------------------------------------------------
# Medex-UIMA medication-finding app
# -----------------------------------------------------------------------------

[nlpdef:medex_medications]

inputfielddefs =
    INPUT_FIELD_CLINICAL_DOCUMENTS
    INPUT_FIELD_PROGRESS_NOTES
processors =
    Medex procdef_medex_medications
progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter


# -----------------------------------------------------------------------------
# CRATE number-finding Python regexes
# -----------------------------------------------------------------------------

[nlpdef:crate_biomarkers]

inputfielddefs =
    INPUT_FIELD_CLINICAL_DOCUMENTS
    INPUT_FIELD_PROGRESS_NOTES
processors =
    # -------------------------------------------------------------------------
    # Biochemistry
    # -------------------------------------------------------------------------
    Albumin procdef_albumin
    AlbuminValidator procdef_validate_albumin
    AlkPhos procdef_alkphos
    AlkPhosValidator procdef_validate_alkphos
    ALT procdef_alt
    ALTValidator procdef_validate_alt
    Bilirubin procdef_bilirubin
    BilirubinValidator procdef_validate_bilirubin
    Creatinine procdef_creatinine
    CreatinineValidator procdef_validate_creatinine
    Crp procdef_crp
    CrpValidator procdef_validate_crp
    GammaGT procdef_gammagt
    GammaGTValidator procdef_validate_gammagt
    Glucose procdef_glucose
    GlucoseValidator procdef_validate_glucose
    HbA1c procdef_hba1c
    HbA1cValidator procdef_validate_hba1c
    HDLCholesterol procdef_hdlcholesterol
    HDLCholesterolValidator procdef_validate_hdlcholesterol
    LDLCholesterol procdef_ldlcholesterol
    LDLCholesterolValidator procdef_validate_ldlcholesterol
    Lithium procdef_lithium
    LithiumValidator procdef_validate_lithium
    Potassium procdef_potassium
    PotassiumValidator procdef_validate_potassium
    Sodium procdef_sodium
    SodiumValidator procdef_validate_sodium
    TotalCholesterol procdef_totalcholesterol
    TotalCholesterolValidator procdef_validate_totalcholesterol
    Triglycerides procdef_triglycerides
    TriglyceridesValidator procdef_validate_triglycerides
    Tsh procdef_tsh
    TshValidator procdef_validate_tsh
    Urea procdef_urea
    UreaValidator procdef_validate_urea
    # -------------------------------------------------------------------------
    # Clinical
    # -------------------------------------------------------------------------
    Bmi procdef_bmi
    BmiValidator procdef_validate_bmi
    Bp procdef_bp
    BpValidator procdef_validate_bp
    Height procdef_height
    HeightValidator procdef_validate_height
    Weight procdef_weight
    WeightValidator procdef_validate_weight
    # -------------------------------------------------------------------------
    # Cognitive
    # -------------------------------------------------------------------------
    Ace procdef_ace
    AceValidator procdef_validate_ace
    MiniAce procdef_miniace
    MiniAceValidator procdef_validate_miniace
    Mmse procdef_mmse
    MmseValidator procdef_validate_mmse
    Moca procdef_moca
    MocaValidator procdef_validate_moca
    # -------------------------------------------------------------------------
    # Haematology
    # -------------------------------------------------------------------------
    Basophils procdef_basophils
    BasophilsValidator procdef_validate_basophils
    Eosinophils procdef_eosinophils
    EosinophilsValidator procdef_validate_eosinophils
    Esr procdef_esr
    EsrValidator procdef_validate_esr
    Haematocrit procdef_haematocrit
    HaematocritValidator procdef_validate_haematocrit
    Haemoglobin procdef_haemoglobin
    HaemoglobinValidator procdef_validate_haemoglobin
    Lymphocytes procdef_lymphocytes
    LymphocytesValidator procdef_validate_lymphocytes
    Monocytes procdef_monocytes
    MonocytesValidator procdef_validate_monocytes
    Neutrophils procdef_neutrophils
    NeutrophilsValidator procdef_validate_neutrophils
    Platelets procdef_platelets
    PlateletsValidator procdef_validate_platelets
    RBC procdef_rbc
    RBCValidator procdef_validate_rbc
    Wbc procdef_wbc
    WbcValidator procdef_validate_wbc
    # -------------------------------------------------------------------------
    # Substance misuse
    # -------------------------------------------------------------------------
    AlcoholUnits procdef_alcoholunits
    AlcoholUnitsValidator procdef_validate_alcoholunits

progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter
# truncate_text_at = 32766
# record_truncated_values = False
max_rows_before_commit = 1000
max_bytes_before_commit = 83886080

# -----------------------------------------------------------------------------
# Cloud NLP demo
# -----------------------------------------------------------------------------

[nlpdef:cloud_nlp_demo]

inputfielddefs =
    INPUT_FIELD_CLINICAL_DOCUMENTS
    INPUT_FIELD_PROGRESS_NOTES
processors =
    Cloud procdef_cloud_crp
progressdb = DESTINATION_DATABASE
hashphrase = doesnotmatter
cloud_config = my_uk_cloud_service
cloud_request_data_dir = /srv/crate/clouddata


# =============================================================================
# B. NLP processor definitions
# =============================================================================
# - You're likely to have to modify the destination databases these point to,
#   but otherwise you can probably leave them as they are.

# -----------------------------------------------------------------------------
# Specimen CRATE regular expression processor definitions
# -----------------------------------------------------------------------------

    # Most of these are very simple, and just require a destination database
    # (as a cross-reference to a database section within this file) and a
    # destination table.

    # Biochemistry

[processor:procdef_albumin]
destdb = DESTINATION_DATABASE
desttable = albumin
[processor:procdef_validate_albumin]
destdb = DESTINATION_DATABASE
desttable = validate_albumin

[processor:procdef_alkphos]
destdb = DESTINATION_DATABASE
desttable = alkphos
[processor:procdef_validate_alkphos]
destdb = DESTINATION_DATABASE
desttable = validate_alkphos

[processor:procdef_alt]
destdb = DESTINATION_DATABASE
desttable = alt
[processor:procdef_validate_alt]
destdb = DESTINATION_DATABASE
desttable = validate_alt

[processor:procdef_bilirubin]
destdb = DESTINATION_DATABASE
desttable = bilirubin
[processor:procdef_validate_bilirubin]
destdb = DESTINATION_DATABASE
desttable = validate_bilirubin

[processor:procdef_creatinine]
destdb = DESTINATION_DATABASE
desttable = creatinine
[processor:procdef_validate_creatinine]
destdb = DESTINATION_DATABASE
desttable = validate_creatinine

[processor:procdef_crp]
destdb = DESTINATION_DATABASE
desttable = crp
[processor:procdef_validate_crp]
destdb = DESTINATION_DATABASE
desttable = validate_crp

[processor:procdef_gammagt]
destdb = DESTINATION_DATABASE
desttable = gammagt
[processor:procdef_validate_gammagt]
destdb = DESTINATION_DATABASE
desttable = validate_gammagt

[processor:procdef_glucose]
destdb = DESTINATION_DATABASE
desttable = glucose
[processor:procdef_validate_glucose]
destdb = DESTINATION_DATABASE
desttable = validate_glucose

[processor:procdef_hba1c]
destdb = DESTINATION_DATABASE
desttable = hba1c
[processor:procdef_validate_hba1c]
destdb = DESTINATION_DATABASE
desttable = validate_hba1c

[processor:procdef_hdlcholesterol]
destdb = DESTINATION_DATABASE
desttable = hdlcholesterol
[processor:procdef_validate_hdlcholesterol]
destdb = DESTINATION_DATABASE
desttable = validate_hdlcholesterol

[processor:procdef_ldlcholesterol]
destdb = DESTINATION_DATABASE
desttable = ldlcholesterol
[processor:procdef_validate_ldlcholesterol]
destdb = DESTINATION_DATABASE
desttable = validate_ldlcholesterol

[processor:procdef_lithium]
destdb = DESTINATION_DATABASE
desttable = lithium
[processor:procdef_validate_lithium]
destdb = DESTINATION_DATABASE
desttable = validate_lithium

[processor:procdef_potassium]
destdb = DESTINATION_DATABASE
desttable = potassium
[processor:procdef_validate_potassium]
destdb = DESTINATION_DATABASE
desttable = validate_potassium

[processor:procdef_sodium]
destdb = DESTINATION_DATABASE
desttable = sodium
[processor:procdef_validate_sodium]
destdb = DESTINATION_DATABASE
desttable = validate_sodium

[processor:procdef_totalcholesterol]
destdb = DESTINATION_DATABASE
desttable = totalcholesterol
[processor:procdef_validate_totalcholesterol]
destdb = DESTINATION_DATABASE
desttable = validate_totalcholesterol

[processor:procdef_triglycerides]
destdb = DESTINATION_DATABASE
desttable = triglycerides
[processor:procdef_validate_triglycerides]
destdb = DESTINATION_DATABASE
desttable = validate_triglycerides

[processor:procdef_tsh]
destdb = DESTINATION_DATABASE
desttable = tsh
[processor:procdef_validate_tsh]
destdb = DESTINATION_DATABASE
desttable = validate_tsh

[processor:procdef_urea]
destdb = DESTINATION_DATABASE
desttable = urea
[processor:procdef_validate_urea]
destdb = DESTINATION_DATABASE
desttable = validate_urea

    # Clinical

[processor:procdef_bmi]
destdb = DESTINATION_DATABASE
desttable = bmi
[processor:procdef_validate_bmi]
destdb = DESTINATION_DATABASE
desttable = validate_bmi

[processor:procdef_bp]
destdb = DESTINATION_DATABASE
desttable = bp
[processor:procdef_validate_bp]
destdb = DESTINATION_DATABASE
desttable = validate_bp

[processor:procdef_height]
destdb = DESTINATION_DATABASE
desttable = height
[processor:procdef_validate_height]
destdb = DESTINATION_DATABASE
desttable = validate_height

[processor:procdef_weight]
destdb = DESTINATION_DATABASE
desttable = weight
[processor:procdef_validate_weight]
destdb = DESTINATION_DATABASE
desttable = validate_weight

    # Cognitive

[processor:procdef_ace]
destdb = DESTINATION_DATABASE
desttable = ace
[processor:procdef_validate_ace]
destdb = DESTINATION_DATABASE
desttable = validate_ace

[processor:procdef_miniace]
destdb = DESTINATION_DATABASE
desttable = miniace
[processor:procdef_validate_miniace]
destdb = DESTINATION_DATABASE
desttable = validate_miniace

[processor:procdef_mmse]
destdb = DESTINATION_DATABASE
desttable = mmse
[processor:procdef_validate_mmse]
destdb = DESTINATION_DATABASE
desttable = validate_mmse

[processor:procdef_moca]
destdb = DESTINATION_DATABASE
desttable = moca
[processor:procdef_validate_moca]
destdb = DESTINATION_DATABASE
desttable = validate_moca

    # Haematology

[processor:procdef_basophils]
destdb = DESTINATION_DATABASE
desttable = basophils
[processor:procdef_validate_basophils]
destdb = DESTINATION_DATABASE
desttable = validate_basophils

[processor:procdef_eosinophils]
destdb = DESTINATION_DATABASE
desttable = eosinophils
[processor:procdef_validate_eosinophils]
destdb = DESTINATION_DATABASE
desttable = validate_eosinophils

[processor:procdef_esr]
destdb = DESTINATION_DATABASE
desttable = esr
[processor:procdef_validate_esr]
destdb = DESTINATION_DATABASE
desttable = validate_esr

[processor:procdef_haematocrit]
destdb = DESTINATION_DATABASE
desttable = haematocrit
[processor:procdef_validate_haematocrit]
destdb = DESTINATION_DATABASE
desttable = validate_haematocrit

[processor:procdef_haemoglobin]
destdb = DESTINATION_DATABASE
desttable = haemoglobin
[processor:procdef_validate_haemoglobin]
destdb = DESTINATION_DATABASE
desttable = validate_haemoglobin

[processor:procdef_lymphocytes]
destdb = DESTINATION_DATABASE
desttable = lymphocytes
[processor:procdef_validate_lymphocytes]
destdb = DESTINATION_DATABASE
desttable = validate_lymphocytes

[processor:procdef_monocytes]
destdb = DESTINATION_DATABASE
desttable = monocytes
[processor:procdef_validate_monocytes]
destdb = DESTINATION_DATABASE
desttable = validate_monocytes

[processor:procdef_neutrophils]
destdb = DESTINATION_DATABASE
desttable = neutrophils
[processor:procdef_validate_neutrophils]
destdb = DESTINATION_DATABASE
desttable = validate_neutrophils

[processor:procdef_platelets]
destdb = DESTINATION_DATABASE
desttable = platelets
[processor:procdef_validate_platelets]
destdb = DESTINATION_DATABASE
desttable = validate_platelets

[processor:procdef_rbc]
destdb = DESTINATION_DATABASE
desttable = rbc
[processor:procdef_validate_rbc]
destdb = DESTINATION_DATABASE
desttable = validate_rbc

[processor:procdef_wbc]
destdb = DESTINATION_DATABASE
desttable = wbc
[processor:procdef_validate_wbc]
destdb = DESTINATION_DATABASE
desttable = validate_wbc

    # Substance misuse

[processor:procdef_alcoholunits]
destdb = DESTINATION_DATABASE
desttable = alcoholunits
[processor:procdef_validate_alcoholunits]
destdb = DESTINATION_DATABASE
desttable = validate_alcoholunits

# -----------------------------------------------------------------------------
# Specimen GATE demo people/places processor definition
# -----------------------------------------------------------------------------

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Define the processor
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[processor:procdef_gate_name_location]

destdb = DESTINATION_DATABASE
outputtypemap =
    Person output_person
    Location output_location
progargs =
    java
    -classpath "{NLPPROGDIR}"{OS_PATHSEP}"{GATE_HOME}/lib/*"
    -Dgate.home="{GATE_HOME}"
    CrateGatePipeline
    --gate_app "{GATE_HOME}/plugins/ANNIE/ANNIE_with_defaults.gapp"
    --pluginfile "{GATE_PLUGIN_FILE}"
    --annotation Person
    --annotation Location
    --input_terminator END_OF_TEXT_FOR_NLP
    --output_terminator END_OF_NLP_OUTPUT_RECORD
    --log_tag {NLPLOGTAG}
    --verbose
progenvsection = MY_ENV_SECTION
input_terminator = END_OF_TEXT_FOR_NLP
output_terminator = END_OF_NLP_OUTPUT_RECORD
# max_external_prog_uses = 1000

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Define the output tables used by this GATE processor
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[output:output_person]

desttable = person
renames =
    firstName   firstname
destfields =
    rule        VARCHAR(100)    Rule used to find this person (e.g. TitleFirstName, PersonFull)
    firstname   VARCHAR(100)    First name
    surname     VARCHAR(100)    Surname
    gender      VARCHAR(7)      Gender (e.g. male, female, unknown)
    kind        VARCHAR(100)    Kind of name (e.g. personName, fullName)
    # ... longest gender: "unknown" (7)
indexdefs =
    firstname   64
    surname     64

[output:output_location]

desttable = location
renames =
    locType     loctype
destfields =
    rule        VARCHAR(100)    Rule used (e.g. Location1)
    loctype     VARCHAR(100)    Location type (e.g. city)
indexdefs =
    rule    100
    loctype 100


# -----------------------------------------------------------------------------
# Specimen Sheffield/KCL KConnect (Bio-YODIE) processor definition
# -----------------------------------------------------------------------------
# https://gate.ac.uk/applications/bio-yodie.html

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Define the processor
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[processor:procdef_gate_kconnect]

destdb = DESTINATION_DATABASE
outputtypemap =
    Disease_or_Syndrome output_disease_or_syndrome
progargs =
    java
    -classpath "{NLPPROGDIR}"{OS_PATHSEP}"{GATE_HOME}/lib/*"
    -Dgate.home="{GATE_HOME}"
    CrateGatePipeline
    --gate_app "{KCONNECTDIR}/main-bio/main-bio.xgapp"
    --pluginfile "{GATE_PLUGIN_FILE}"
    --annotation Disease_or_Syndrome
    --input_terminator END_OF_TEXT_FOR_NLP
    --output_terminator END_OF_NLP_OUTPUT_RECORD
    --log_tag {NLPLOGTAG}
    --suppress_gate_stdout
    --verbose
progenvsection = MY_ENV_SECTION
input_terminator = END_OF_TEXT_FOR_NLP
output_terminator = END_OF_NLP_OUTPUT_RECORD
# max_external_prog_uses = 1000

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Define the output tables used by this GATE processor
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[output:output_disease_or_syndrome]

desttable = kconnect_diseases
renames =
    Experiencer     experiencer
    Negation        negation
    PREF            pref
    STY             sty
    TUI             tui
    Temporality     temporality
    VOCABS          vocabs
destfields =
    # Found by manual inspection of KConnect/Bio-YODIE output from the GATE console:
    experiencer  VARCHAR(100)  Who experienced it; e.g. "Patient", "Other"
    negation     VARCHAR(100)  Was it negated or not; e.g. "Affirmed", "Negated"
    pref         VARCHAR(100)  PREFferred name; e.g. "Rheumatic gout"
    sty          VARCHAR(100)  Semantic Type (STY) [semantic type name]; e.g. "Disease or Syndrome"
    tui          VARCHAR(4)    Type Unique Identifier (TUI) [semantic type identifier]; 4 characters; https://www.ncbi.nlm.nih.gov/books/NBK9679/; e.g. "T047"
    temporality  VARCHAR(100)  Occurrence in time; e.g. "Recent", "historical", "hypothetical"
    vocabs       VARCHAR(255)  List of UMLS vocabularies; e.g. "AIR,MSH,NDFRT,MEDLINEPLUS,NCI,LNC,NCI_FDA,NCI,MTH,AIR,ICD9CM,LNC,SNOMEDCT_US,LCH_NW,HPO,SNOMEDCT_US,ICD9CM,SNOMEDCT_US,COSTAR,CST,DXP,QMR,OMIM,OMIM,AOD,CSP,NCI_NCI-GLOSS,CHV"
    inst         VARCHAR(8)    Looks like a Concept Unique Identifier (CUI); 1 letter then 7 digits; e.g. "C0003873"
    inst_full    VARCHAR(255)  Looks like a URL to a CUI; e.g. "http://linkedlifedata.com/resource/umls/id/C0003873"
    language     VARCHAR(100)  Language; e.g. ""; ?will look like "ENG" for English? See https://www.nlm.nih.gov/research/umls/implementation_resources/query_diagrams/er1.html
    tui_full     VARCHAR(255)  TUI (?); e.g. "http://linkedlifedata.com/resource/semanticnetwork/id/T047"
indexdefs =
    pref    100
    sty     100
    tui     4
    inst    8


# -----------------------------------------------------------------------------
# Specimen KCL GATE pharmacotherapy processor definition
# -----------------------------------------------------------------------------
# https://github.com/KHP-Informatics/brc-gate-pharmacotherapy

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Define the processor
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[processor:procdef_gate_pharmacotherapy]

destdb = DESTINATION_DATABASE
outputtypemap =
    Prescription output_prescription
progargs =
    java
    -classpath "{NLPPROGDIR}"{OS_PATHSEP}"{GATE_HOME}/lib/*"
    -Dgate.home="{GATE_HOME}"
    CrateGatePipeline
    --gate_app "{GATE_PHARMACOTHERAPY_DIR}/application.xgapp"
    --pluginfile "{GATE_PLUGIN_FILE}"
    --include_set Output
    --annotation Prescription
    --input_terminator END_OF_TEXT_FOR_NLP
    --output_terminator END_OF_NLP_OUTPUT_RECORD
    --log_tag {NLPLOGTAG}
    --suppress_gate_stdout
    --show_contents_on_crash
progenvsection = MY_ENV_SECTION
input_terminator = END_OF_TEXT_FOR_NLP
output_terminator = END_OF_NLP_OUTPUT_RECORD
# max_external_prog_uses = 1000

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Define the output tables used by this GATE processor
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[output:output_prescription]

desttable = medications_gate
renames =
    drug-type           drug_type
    dose-value          dose_value
    dose-unit           dose_unit
    dose-multiple       dose_multiple
    Directionality      directionality
    Experiencer         experiencer
    "Length of Time"    length_of_time
    Temporality         temporality
    "Unit of Time"      unit_of_time
null_literals =
    null
    ""
destfields =
    # Found by (a) manual inspection of BRC GATE pharmacotherapy output from
    # the GATE console; (b) inspection of
    # application-resources/schemas/Prescription.xml
    # Note preference for DECIMAL over FLOAT/REAL; see
    # https://stackoverflow.com/questions/1056323
    # Note that not all annotations appear for all texts. Try e.g.:
    #   Please start haloperidol 5mg tds.
    #   I suggest you start haloperidol 5mg tds for one week.
    rule            VARCHAR(100)  Rule yielding this drug. Not in XML but is present in a subset: e.g. "weanOff"; max length unclear
    drug            VARCHAR(200)  Drug name. Required string; e.g. "haloperidol"; max length 47 from "wc -L BNF_generic.lst", 134 from BNF_trade.lst
    drug_type       VARCHAR(100)  Type of drug name. Required string; from "drug-type"; e.g. "BNF_generic"; ?length of longest drug ".lst" filename
    dose            VARCHAR(100)  Dose text. Required string; e.g. "5mg"; max length unclear
    dose_value      DECIMAL       Numerical dose value. Required numeric; from "dose-value"; "double" in the XML but DECIMAL probably better; e.g. 5.0
    dose_unit       VARCHAR(100)  Text of dose units. Required string; from "dose-unit"; e.g. "mg"; max length unclear
    dose_multiple   INT           Dose count (multiple). Required integer; from "dose-multiple"; e.g. 1
    route           VARCHAR(7)    Route of administration. Required string; one of: "oral", "im", "iv", "rectal", "sc", "dermal", "unknown"
    status          VARCHAR(10)   Change in drug status. Required; one of: "start", "continuing", "stop"
    tense           VARCHAR(7)    Tense in which drug is referred to. Required; one of: "past", "present"
    date            VARCHAR(100)  ?. Optional string; max length unclear
    directionality  VARCHAR(100)  ?. Optional string; max length unclear
    experiencer     VARCHAR(100)  Person experiencing the drug-related event. Optional string; e.g. "Patient"
    frequency       DECIMAL       Frequency (times per <time_unit>). Optional numeric; "double" in the XML but DECIMAL probably better
    interval        DECIMAL       The n in "every n <time_unit>s" (1 for "every <time_unit>"). Optional numeric; "double" in the XML but DECIMAL probably better
    length_of_time  VARCHAR(100)  ?. Optional string; from "Length of Time"; max length unclear
    temporality     VARCHAR(100)  ?. Optional string; e.g. "Recent", "Historical"
    time_unit       VARCHAR(100)  Unit of time (see frequency, interval). Optional string; from "time-unit"; e.g. "day"; max length unclear
    unit_of_time    VARCHAR(100)  ?. Optional string; from "Unit of Time"; max length unclear
    when            VARCHAR(100)  ?. Optional string; max length unclear
indexdefs =
    rule    100
    drug    200
    route   7
    status  10
    tense   7


# -----------------------------------------------------------------------------
# Specimen KCL Lewy Body Diagnosis Application (LBDA) processor definition
# -----------------------------------------------------------------------------
# https://github.com/KHP-Informatics/brc-gate-LBD

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Define the processor
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[processor:procdef_gate_kcl_lbda]

    # "cDiagnosis" is the "confirmed diagnosis" field, as d/w Jyoti Jyoti
    # 2018-03-20; see also README.md. This appears in the "Automatic" and the
    # unnamed set. There is also a near-miss one, "DiagnosisAlmost", which
    # appears in the unnamed set.
    #   "Mr Jones has Lewy body dementia."
    #       -> DiagnosisAlmost
    #   "Mr Jones has a diagnosis of Lewy body dementia."
    #       -> DiagnosisAlmost, cDiagnosis
    # Note that we must use lower case in the outputtypemap.

destdb = DESTINATION_DATABASE
outputtypemap =
    cDiagnosis output_lbd_diagnosis
    DiagnosisAlmost output_lbd_diagnosis
progargs =
    java
    -classpath "{NLPPROGDIR}"{OS_PATHSEP}"{GATE_HOME}/lib/*"
    -Dgate.home="{GATE_HOME}"
    CrateGatePipeline
    --gate_app "{KCL_LBDA_DIR}/application.xgapp"
    --pluginfile "{GATE_PLUGIN_FILE}"
    --set_annotation "" DiagnosisAlmost
    --set_annotation Automatic cDiagnosis
    --input_terminator END_OF_TEXT_FOR_NLP
    --output_terminator END_OF_NLP_OUTPUT_RECORD
    --log_tag {NLPLOGTAG}
    --suppress_gate_stdout
    --verbose
progenvsection = MY_ENV_SECTION
input_terminator = END_OF_TEXT_FOR_NLP
output_terminator = END_OF_NLP_OUTPUT_RECORD
# max_external_prog_uses = 1000

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # Define the output tables used by this GATE processor
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[output:output_lbd_diagnosis]

desttable = lewy_body_dementia_gate
null_literals =
    null
    ""
destfields =
    # Found by
    # (a) manual inspection of output from the GATE Developer console:
    # - e.g. {rule=Includefin, text=Lewy body dementia}
    # (b) inspection of contents:
    # - run a Cygwin shell
    # - find . -type f -exec grep cDiagnosis -l {} \;
    # - 3 hits:
    #       ./application-resources/jape/DiagnosisExclude2.jape
    #           ... part of the "Lewy"-detection apparatus
    #       ./application-resources/jape/text-feature.jape
    #           ... adds "text" annotation to cDiagnosis Token
    #       ./application.xgapp
    #           ... in annotationTypes
    # On that basis:
    rule            VARCHAR(100)  Rule that generated the hit.
    text            VARCHAR(200)  Text that matched the rule.
indexdefs =
    rule    100
    text    200


# -----------------------------------------------------------------------------
# Specimen MedEx processor definition
# -----------------------------------------------------------------------------
# https://sbmi.uth.edu/ccb/resources/medex.htm

[processor:procdef_medex_medications]

destdb = DESTINATION_DATABASE
desttable = medications_medex
progargs =
    java
    -classpath {NLPPROGDIR}:{MEDEXDIR}/bin:{MEDEXDIR}/lib/*
    -Dfile.encoding=UTF-8
    CrateMedexPipeline
    -lt {NLPLOGTAG}
    -v -v
# ... other arguments are added by the code
progenvsection = MY_ENV_SECTION


# =============================================================================
# C. Environment variable definitions
# =============================================================================
# - You'll need to modify this according to your local configuration.

[env:MY_ENV_SECTION]

GATE_HOME = /path/to/GATE_Developer_8.6.1
GATE_PHARMACOTHERAPY_DIR = /path/to/brc-gate-pharmacotherapy
GATE_PLUGIN_FILE = /path/to/specimen_gate_plugin_file.ini
KCL_LBDA_DIR = /path/to/brc-gate-LBD/Lewy_Body_Diagnosis
KCONNECTDIR = /path/to/yodie-pipeline-1-2-umls-only
MEDEXDIR = /path/to/Medex_UIMA_1.3.6
NLPPROGDIR = /path/to/crate_anon/nlp_manager/compiled_nlp_classes
OS_PATHSEP = :


# =============================================================================
# D. Input field definitions
# =============================================================================

[input:INPUT_FIELD_CLINICAL_DOCUMENTS]

srcdb = SOURCE_DATABASE
srctable = EXTRACTED_CLINICAL_DOCUMENTS
srcpkfield = DOCUMENT_PK
srcfield = DOCUMENT_TEXT
srcdatetimefield = DOCUMENT_DATE
copyfields =
    RID_FIELD
    TRID_FIELD
indexed_copyfields =
    RID_FIELD
    TRID_FIELD
# debug_row_limit = 0

[input:INPUT_FIELD_PROGRESS_NOTES]

srcdb = SOURCE_DATABASE
srctable = PROGRESS_NOTES
srcpkfield = PN_PK
srcfield = PN_TEXT
srcdatetimefield = PN_DATE
copyfields =
    RID_FIELD
    TRID_FIELD
indexed_copyfields =
    RID_FIELD
    TRID_FIELD


# =============================================================================
# E. Database definitions, each in its own section
# =============================================================================

[database:SOURCE_DATABASE]

url = mysql+mysqldb://anontest:XXX@127.0.0.1:3306/anonymous_output?charset=utf8

[database:DESTINATION_DATABASE]

url = mysql+mysqldb://anontest:XXX@127.0.0.1:3306/anonymous_output?charset=utf8


# =============================================================================
# F. Information for using cloud-based NLP
# =============================================================================

[cloud:my_uk_cloud_service]

cloud_url = https://your_url
username = your_username
password = your_password
wait_on_conn_err = 180
max_content_length = 0
max_records_per_request = 1000
limit_before_commit = 1000
stop_at_failure = true
max_tries = 5
rate_limit_hz = 2

[processor:procdef_cloud_crp]

destdb = DESTINATION_DATABASE
desttable = crp_test
processor_name = crate_anon.nlp_manager.parse_biochemistry.Crp
processor_format = Standard

Footnotes

1

Internally, the config file section is represented by the crate_anon.nlp_manager.nlp_definition.NlpDefinition class, which acts as the master config class.

2

Internally, this information is represented by the crate_anon.nlp_manager.input_field_config.InputFieldConfig class.

3

Internally, this information is represented by classes such as crate_anon.nlp_manager.parse_gate.Gate and crate_anon.nlp_manager.parse_biochemistry.Crp, which are subclasses of crate_anon.nlp_manager.base_nlp_parser.BaseNlpParser.

4

Internally, this information is represented by the crate_anon.nlp_manager.cloud_config.CloudConfig class.

5

https://en.wikipedia.org/wiki/Hash_function

6

Internally, this information is represented by the crate_anon.nlp_manager.output_user_config.OutputUserConfig class.