7.5. GATE NLP applications

GATE NLP is done via an external program, GATE [1]. GATE runs in Java. CRATE supplies an external front-end Java program (CrateGatePipeline.java) that loads a GATE app, sends text to it, and returns answers.

In general, CRATE sends text to the external program (via stdin), and will expect a result (via stdout) as a set of tab-separated value (TSV) lines corresponding to the expected destination fields.

The CrateGatePipeline.java program takes arguments that describe how a specific GATE application should be handled.

7.5.1. Output columns

In addition to the standard NLP output columns, the CRATE GATE processor produces these output columns:

Column SQL type Description
_set VARCHAR(64) GATE output set name
_type VARCHAR(64) GATE annotation type name (e.g. ‘Person’)
_id INT GATE annotation ID. Not clear that this is very useful.
_start INT Start position in the content
_end INT End position in the content
_content TEXT Full content marked as relevant. (Not the entire content of the source field.)

These default output columns are prefixed with an underscore to reduce the risk of name clashes, since GATE applications can themselves generate arbitrary column names. For example, the demonstration GATE Person app generates these:

rule
firstname
surname
gender
kind

You tell CRATE about the specific fields produced by a GATE application using the destfields option; see the NLP config file.

7.5.2. crate_nlp_build_gate_java_interface

This program builds CrateGatePipeline.

Options:

usage: crate_nlp_build_gate_java_interface [-h] [--builddir BUILDDIR]
                                           [--gatedir GATEDIR] [--java JAVA]
                                           [--javac JAVAC] [--verbose]
                                           [--launch]

Compile Java classes for CRATE's interface to GATE

optional arguments:
  -h, --help           show this help message and exit
  --builddir BUILDDIR  Output directory for compiled .class files (default: /h
                       ome/rudolf/Documents/code/crate/crate_anon/nlp_manager/
                       compiled_nlp_classes)
  --gatedir GATEDIR    Root directory of GATE installation (default:
                       /home/rudolf/dev/GATE_Developer_8.0)
  --java JAVA          Java executable (default: java)
  --javac JAVAC        Java compiler (default: javac)
  --verbose, -v        Be verbose (use twice for extra verbosity) (default: 0)
  --launch             Launch script in demonstration mode (having previously
                       compiled it) (default: False)

# Generated at 2019-08-08 17:03:59

7.5.3. CrateGatePipeline

The following specimen scripts presuppose that you have set the environment variable GATE_DIR, and assume specific locations for the compiled Java (e.g. files like CrateGatePipeline.class); edit them as required.

Asking CrateGatePipeline to show its command-line options:

#!/usr/bin/env bash

CRATE_NLP_JAVA_CLASS_DIR=${CRATE_SOURCE_ROOT}/crate_anon/nlp_manager/compiled_nlp_classes

java \
    -classpath "${CRATE_NLP_JAVA_CLASS_DIR}":"${GATE_DIR}/bin/gate.jar":"${GATE_DIR}/lib/*" \
    -Dgate.home="${GATE_DIR}" \
    CrateGatePipeline \
    --help \
    -v -v

exit 0

The resulting output:

usage: CrateGatePipeline -g GATEAPP [-a ANN [-a ANN [...]]]
                         [--include_set SET [--include_set SET [...]]]
                         [--exclude_set SET [--exclude_set SET [...]]]
                         [-e ENCODING] [-it TERM] [-ot TERM] [-lt LOGTAG]
                         [-wa FILESTEM] [-wg FILESTEM] [-wt FILESTEM]
                         [-s] [--show_contents_on_crash]
                         [-h] [-v [-v [-v]]]

Java front end to GATE natural language processor.

- Takes input on stdin. Produces output on stdout.
- GATE applications produce output clustered (1) into named annotation sets
  (with a default, unnamed set). (2) Within annotation sets, we find
  annotations. (3) Each annotation is a collection of key/value pairs.
  This collection is not fixed, in that individual annotations, or keys within
  annotations, may be present sometimes and absent sometimes, depending on the
  input text.

Required arguments:

  --gate_app GATEAPP
  -g GATEAPP
                   Specifies the GATE app (.gapp/.xgapp) file to use.

Optional arguments:

  --include_set SET
  --exclude_set SET
                   Includes or excludes the specified GATE set, by name.
                   By default, the inclusion list is empty, and the exclusion
                   list is also empty. By specifying set names here, you add
                   to the inclusion or exclusion list. You can specify each
                   option multiple times. Then, the rules are as follows:
                   the output from a GATE set is included if (A) the inclusion
                   list is empty OR the set is on the inclusion list, AND (B)
                   the set is not on the exclusion list. Note also that there
                   is a default set with no name; refer to this one using
                   the empty string "". Set names are compared in a
                   case-sensitive manner.

  --annotation ANNOT
  -a ANNOT
                   Adds the specified annotation to the target list.
                   If you don't specify any, you'll get them all.

  --set_annotation SET ANNOT
  -sa SET ANNOT
                   Adds the specific set/annotation combination to the target
                   list. Use this option for maximum control. You cannot mix
                   --annotation and --set_annotation.

  --encoding ENCODING
  -e ENCODING
                   The character encoding of the source documents, to be used
                   for file output. If not specified, the platform default
                   encoding (currently "UTF-8") is assumed.

  --input_terminator TERMINATOR
  -it TERMINATOR
                   Specify stdin end-of-document terminator.

  --output_terminator TERMINATOR
  -ot TERMINATOR
                   Specify stdout end-of-document terminator.

  --log_tag LOGTAG
  -lt LOGTAG
                   Use an additional tag for stderr logging.
                   Helpful in multiprocess environments.

  --write_annotated_xml FILESTEM
  -wa FILESTEM
                   Write annotated XML document to FILESTEM<n>.xml, where <n>
                   is the file's sequence number (starting from 0).

  --write_gate_xml FILESTEM
  -wg FILESTEM
                   Write GateXML document to FILESTEM<n>.xml.

  --write_tsv FILESTEM
  -wt FILESTEM
                   Write TSV-format annotations to FILESTEM<n>.tsv.

  --suppress_gate_stdout
  -s
                   Suppress any stdout from GATE application.

  --show_contents_on_crash
  -show_contents_on_crash
                   If GATE crashes, report the current text to stderr (as well
                   as reporting the error).
                   (WARNING: likely to contain identifiable material.)

  --continue_on_crash
  -c
                   If GATE crashes, carry on after reporting the error.

  --help
  -h
                   Show this help message and exit.

  --verbose
  -v
                   Verbose (use up to 3 times to be more verbose).

# Generated at 2019-10-10 10:23:25

Asking CrateGatePipeline to run the GATE “ANNIE” demonstration:

#!/usr/bin/env bash

CRATE_NLP_JAVA_CLASS_DIR=${CRATE_SOURCE_ROOT}/crate_anon/nlp_manager/compiled_nlp_classes

# Note the unrealistic use of "STOP" as an end-of-input marker. Don't use that for real!

java \
    -classpath "${CRATE_NLP_JAVA_CLASS_DIR}":"${GATE_DIR}/bin/gate.jar":"${GATE_DIR}/lib/*" \
    -Dgate.home="${GATE_DIR}" \
    CrateGatePipeline \
    -g "${GATE_DIR}/plugins/ANNIE/ANNIE_with_defaults.gapp" \
    -a Person \
    -a Location \
    -it STOP \
    -ot END_OF_NLP_OUTPUT_RECORD \
    -lt . \
    -v -v

7.5.4. KConnect (Bio-YODIE)

This GATE application finds diseases.

You can test the application via the GATE Developer console. See testing GATE applications.

See the specimen CRATE NLP config file.

Script to test the app via the command line:

#!/usr/bin/env bash

CRATE_NLP_JAVA_CLASS_DIR=${CRATE_SOURCE_ROOT}/crate_anon/nlp_manager/compiled_nlp_classes

# todo: Bug: this (run_gate_kcl_connect.sh) isn't working on Osprey; KConnect crashes

java \
    -classpath "${CRATE_NLP_JAVA_CLASS_DIR}":"${GATE_DIR}/bin/gate.jar":"${GATE_DIR}/lib/*" \
    -Dgate.home="${GATE_DIR}" \
    CrateGatePipeline \
    --gate_app "${GATE_KCONNECT_DIR}/main-bio/main-bio.xgapp" \
    --annotation Disease_or_Syndrome \
    --input_terminator END_OF_TEXT_FOR_NLP \
    --output_terminator END_OF_NLP_OUTPUT_RECORD \
    --suppress_gate_stdout \
    --show_contents_on_crash \
    -v -v

7.5.5. KCL pharmacotherapy application

This GATE application finds drugs (medications).

You can test the application via the GATE Developer console. See testing GATE applications.

See the specimen CRATE NLP config file.

Script to test the app via the command line:

#!/usr/bin/env bash

CRATE_NLP_JAVA_CLASS_DIR=${CRATE_SOURCE_ROOT}/crate_anon/nlp_manager/compiled_nlp_classes

java \
    -classpath "${CRATE_NLP_JAVA_CLASS_DIR}":"${GATE_DIR}/bin/gate.jar":"${GATE_DIR}/lib/*" \
    -Dgate.home="${GATE_DIR}" \
    CrateGatePipeline \
    --gate_app "${GATE_PHARMACOTHERAPY_DIR}/application.xgapp" \
    --include_set Output \
    --annotation Prescription \
    --input_terminator END_OF_TEXT_FOR_NLP \
    --output_terminator END_OF_NLP_OUTPUT_RECORD \
    --suppress_gate_stdout \
    --show_contents_on_crash \
    -v -v

7.5.6. KCL Lewy Body Diagnosis Application

This GATE application finds references to Lewy body dementia.

  • Clone it from https://github.com/KHP-Informatics/brc-gate-LBD
  • As of 2018-03-20, the Git repository just contains a zip. Unzip it.
  • The main application is called application.xgapp.
  • The principal annotation is called cDiagnosis (“confirmed diagnosis”), which has rule and text elements.

You can test the application via the GATE Developer console. See testing GATE applications.

See the specimen CRATE NLP config file.

Script to test the app via the command line:

#!/usr/bin/env bash

CRATE_NLP_JAVA_CLASS_DIR=${CRATE_SOURCE_ROOT}/crate_anon/nlp_manager/compiled_nlp_classes

java \
    -classpath "${CRATE_NLP_JAVA_CLASS_DIR}":"${GATE_DIR}/bin/gate.jar":"${GATE_DIR}/lib/*" \
    -Dgate.home="${GATE_DIR}" \
    CrateGatePipeline \
    --gate_app "${GATE_LEWY_BODY_DIAGNOSIS_DIR}/application.xgapp" \
    --set_annotation "" DiagnosisAlmost \
    --set_annotation Automatic cDiagnosis \
    --input_terminator END_OF_TEXT_FOR_NLP \
    --output_terminator END_OF_NLP_OUTPUT_RECORD \
    --suppress_gate_stdout \
    --show_contents_on_crash \
    -v -v

7.5.7. Testing a GATE application manually

The illustration below assumes that the main GATE application file is called main-bio.xgapp, which is correct for KConnect. For others, the name is different; see above.

  • Run GATE Developer.
  • Load the application:
    • File ‣ Restore application from file
    • find main-bio.xgapp, in the downloaded KConnect directory (or whichever the appropriate .xgapp file is for your application);
    • load this;
    • wait until it’s finished loading.
  • Create a document:
    • Right-click Language Resources ‣ New ‣ GATE Document
    • name it (e.g. my_test_doc);
    • open it;
    • paste some text in the “Text” window.
  • Create a corpus
    • Right-click Language Resources ‣ New ‣ GATE Corpus
    • name it (e.g. my_test_corpus);
    • open it;
    • add the document (e.g. with the icon looking like ‘G+’).
  • View the application:
    • Go to the application tab (main-bio.xgapp), or double-click main-bio.xgapp in the left hand tree (under Applications) to open it if it’s not already open. For other applications: fine the appropriate application in the “Applications” tree and double-click it.
    • Make sure your corpus is selected in the “Corpus:” section. (There should already be a bunch of things in the top-right-hand box, “Selected processing resources”; for example, for KConnect, you’ll see “MP:preprocess” through to “MP:finalize”.)
  • To see the results, go back to the document, and toggle on both “Annotation Sets” and “Annotation Lists”. If you tick “sets” in the Annotation Sets window (at the right; it’s colourful) you should see specific annotations in the Annotation List window (at the bottom).

7.5.8. Troubleshooting GATE

Out of Java heap space

You may see an error like “Out of memory error: java heap space”.

  • On a Windows machine, set the _JAVA_OPTIONS environment variable (not JAVA_OPTS; we’re not sure when that one applies).
    • Edit environment variables via the control panel or e.g. rundll32 sysdm.cpl,EditEnvironmentVariables.
    • For the user or the system (as you prefer), set _JAVA_OPTIONS to e.g. -Xms2048m -Xmx4096m -XX:MaxPermSize=1024m.
  • Restart the relevant application (e.g. GATE Developer) and retry.

See:


Footnotes

[1]University of Sheffield (2016). “GATE: General Architecture for Text Engineering.” https://gate.ac.uk/