7.6. GATE NLP applications

GATE NLP is done via an external program, GATE 1. GATE runs in Java. CRATE supplies an external front-end Java program (CrateGatePipeline.java) that loads a GATE app, sends text to it, and returns answers.

In general, CRATE sends text to the external program (via stdin), and will expect a result (via stdout) as a set of tab-separated value (TSV) lines corresponding to the expected destination fields.

The CrateGatePipeline.java program takes arguments that describe how a specific GATE application should be handled.

7.6.1. Output columns

In addition to the standard NLP output columns, the CRATE GATE processor produces these output columns:

Column

SQL type

Description

_set

VARCHAR(64)

GATE output set name

_type

VARCHAR(64)

GATE annotation type name (e.g. ‘Person’)

_id

INT

GATE annotation ID. Not clear that this is very useful.

_start

INT

Start position in the content

_end

INT

End position in the content

_content

TEXT

Full content marked as relevant. (Not the entire content of the source field.)

These default output columns are prefixed with an underscore to reduce the risk of name clashes, since GATE applications can themselves generate arbitrary column names. For example, the demonstration GATE Person app generates these:

rule
firstname
surname
gender
kind

You tell CRATE about the specific fields produced by a GATE application using the destfields option; see the NLP config file.

7.6.2. crate_nlp_build_gate_java_interface

This program builds CrateGatePipeline.

Options:

USAGE: crate_nlp_build_gate_java_interface [-h] [--builddir BUILDDIR]
                                           [--gatedir GATEDIR]
                                           [--gate_exec GATE_EXEC]
                                           [--java JAVA] [--javac JAVAC]
                                           [--verbose] [--launch]

Compile Java classes for CRATE's interface to GATE

OPTIONS:
  -h, --help            show this help message and exit
  --builddir BUILDDIR   Output directory for compiled .class files (default:
                        /path/to/crate/crate_anon/nlp_manager/compiled_nlp_cla
                        sses)
  --gatedir GATEDIR     Root directory of GATE installation (default:
                        /path/to/GATE/installation)
  --gate_exec GATE_EXEC
                        Path to GATE executable (JAR file). Temporary (future
                        releases may handle this differently). If not
                        specified, defaults to 'bin/gate.jar' within the GATE
                        directory. (default: None)
  --java JAVA           Java executable (default: java)
  --javac JAVAC         Java compiler (default: javac)
  --verbose, -v         Be verbose (use twice for extra verbosity) (default:
                        0)
  --launch              Launch script in demonstration mode (having previously
                        compiled it) (default: False)

7.6.3. CrateGatePipeline

The following specimen scripts presuppose that you have set the environment variable GATE_HOME, and assume specific locations for the compiled Java (e.g. files like CrateGatePipeline.class); edit them as required.

Asking CrateGatePipeline to show its command-line options:

crate_show_crate_gate_pipeline_options

The resulting output:

usage: CrateGatePipeline --gate_app GATEAPP
                         [--include_set SET [--include_set SET [...]]]
                         [--exclude_set SET [--exclude_set SET [...]]]
                         [--annotation ANNOT [--annotation ANNOT [...]]]
                         [--set_annotation SET ANNOT [...]]
                         [--encoding ENCODING]
                         [--input_terminator TERM]
                         [--output_terminator TERM]
                         [--log_tag LOGTAG]
                         [--write_annotated_xml FILESTEM]
                         [--write_gate_xml FILESTEM]
                         [--write_tsv FILESTEM]
                         [--suppress_gate_stdout]
                         [--show_contents_on_crash]
                         [-h] [-v [-v [-v]]]
                         [--loglevel <debug|info|warn|error>]
                         [--gateloglevel <debug|info|warn|error>]
                         [--pluginfile PLUGINFILE]
                         [--launch_then_stop]
                         [--demo]

Java front end to GATE natural language processor.

- Takes input on stdin. Produces output on stdout.
- GATE applications produce output clustered (1) into named annotation sets
  (with a default, unnamed set). (2) Within annotation sets, we find
  annotations. (3) Each annotation is a collection of key/value pairs.
  This collection is not fixed, in that individual annotations, or keys within
  annotations, may be present sometimes and absent sometimes, depending on the
  input text.

Optional arguments:

  --gate_app GATEAPP
  -g GATEAPP
                   Specifies the GATE app (.gapp/.xgapp) file to use.
                   REQUIRED unless specifying --demo.

  --include_set SET
  --exclude_set SET
                   Includes or excludes the specified GATE set, by name.
                   By default, the inclusion list is empty, and the exclusion
                   list is also empty. By specifying set names here, you add
                   to the inclusion or exclusion list. You can specify each
                   option multiple times. Then, the rules are as follows:
                   the output from a GATE set is included if (A) the inclusion
                   list is empty OR the set is on the inclusion list, AND (B)
                   the set is not on the exclusion list. Note also that there
                   is a default set with no name; refer to this one using
                   the empty string "". Set names are compared in a
                   case-sensitive manner.

  --annotation ANNOT
  -a ANNOT
                   Adds the specified annotation to the target list.
                   If you don't specify any, you'll get them all.

  --set_annotation SET ANNOT
  -sa SET ANNOT
                   Adds the specific set/annotation combination to the target
                   list. Use this option for maximum control. You cannot mix
                   --annotation and --set_annotation.

  --encoding ENCODING
  -e ENCODING
                   The character encoding of the source documents, to be used
                   for file output. If not specified, the platform default
                   encoding (currently "UTF-8") is assumed.

  --input_terminator TERMINATOR
  -it TERMINATOR
                   Specify stdin end-of-document terminator.

  --output_terminator TERMINATOR
  -ot TERMINATOR
                   Specify stdout end-of-document terminator.

  --log_tag LOGTAG
  -lt LOGTAG
                   Use an additional tag for stderr logging.
                   Helpful in multiprocess environments.

  --write_annotated_xml FILESTEM
  -wa FILESTEM
                   Write annotated XML document to FILESTEM<n>.xml, where <n>
                   is the file's sequence number (starting from 0).

  --write_gate_xml FILESTEM
  -wg FILESTEM
                   Write GateXML document to FILESTEM<n>.xml.

  --write_tsv FILESTEM
  -wt FILESTEM
                   Write TSV-format annotations to FILESTEM<n>.tsv.

  --suppress_gate_stdout
  -s
                   Suppress any stdout from GATE application.

  --show_contents_on_crash
  -show_contents_on_crash
                   If GATE crashes, report the current text to stderr (as well
                   as reporting the error).
                   (WARNING: likely to contain identifiable material.)

  --continue_on_crash
  -c
                   If GATE crashes, carry on after reporting the error.

  --help
  -h
                   Show this help message and exit.

  --verbose
  -v
                   Verbose (use up to 3 times to be more verbose).

  --loglevel LEVEL
                   Main log level. Overrides verbose. Options are:
                   debug, info, warn, error

  --gateloglevel LEVEL
                   GATE log level. Overrides verbose. Options are:
                   debug, info, warn, error

  --pluginfile PLUGINFILE
                   INI file specifying GATE plugins, including name,
                   location of Maven repository and version. See
                   specimen_gate_plugin_file.ini. A simple example:

                   [ANNIE]
                   name = annie
                   location = uk.ac.gate.plugins
                   version = 8.6

                   [Tools]
                   name = tools
                   location = uk.ac.gate.plugins
                   version = 8.6

 --launch_then_stop
                   Launch the GATE program, then stop immediately. (Used 
                   to pre-download plugins.)

  --demo
                   Use the demo gapp file.

Asking CrateGatePipeline to run the GATE “ANNIE” demonstration:

crate_run_gate_annie_demo

Note

For the demonstrations that follow, we presuppose that you have also set the environment variable CRATE_GATE_PLUGIN_FILE to be the filename of a GATE plugin INI file like this:

# crate_anon/nlp_manager/specimen_gate_plugin_file.ini
#
# This file is read by CrateGatePipeline.java, when you ask it to.
# It will ask GATE to fetch plugins from a Maven repository, e.g.
# https://mvnrepository.com/artifact/uk.ac.gate.plugins
#
# Note that when you're searching for method in JAR files, you can use:
#
# for i in *.jar; do jar -tvf "$i" | grep -Hsi ClassName && echo "$i"; done

[ANNIE]
# - https://mvnrepository.com/artifact/uk.ac.gate.plugins/annie
# - ANNIE includes the GATE demo app, but is also required by others, e.g.
#   KCL pharmacotherapy.
# - "ANNIE is a general purpose information extraction system that provides the
#   building blocks of many other GATE applications."
name = annie
location = uk.ac.gate.plugins
version = 8.6

[Tools]
# - https://mvnrepository.com/artifact/uk.ac.gate.plugins/tools
# - "A selection of processing resources commonly used to extend ANNIE."
# - Certainly required by KCL LBD application.
name = tools
location = uk.ac.gate.plugins
version = 8.6

[JAPE]
# - https://mvnrepository.com/artifact/uk.ac.gate.plugins/jape-plus
# - "An alternative, usually more efficient and faster, JAPE implementation"
# - JAPE = Java Annotation Patterns Engine
# - https://en.wikipedia.org/wiki/JAPE_(linguistics)
# - Necessary for the KCL pharmacotherapy app.
name = jape-plus
location = uk.ac.gate.plugins
version = 8.6

[Stanford_CoreNLP]
# - https://mvnrepository.com/artifact/uk.ac.gate.plugins/stanford-corenlp
# - https://mvnrepository.com/artifact/edu.stanford.nlp/stanford-corenlp
# - "Stanford CoreNLP provides a set of natural language analysis tools which
#   can take raw English language text input and give the base forms of words,
#   their parts of speech, whether they are names of companies, people, etc.,
#   normalize dates, times, and numeric quantities, mark up the structure of
#   sentences in terms of phrases and word dependencies, and indicate which
#   noun phrases refer to the same entities. It provides the foundational
#   building blocks for higher level text understanding applications."
# - Necessary for KConnect.
name = stanford-corenlp
location = uk.ac.gate.plugins
version = 8.5.1

7.6.4. KConnect (Bio-YODIE)

This GATE application finds diseases. Bio-YODIE is part of the KConnect project.

You can test the application via the GATE Developer console. See testing GATE applications.

See the specimen CRATE NLP config file.

Script to test the app via the command line:

crate_run_gate_kcl_kconnect_demo

The KConnect GATE application requires you to register and download UMLS data, containing disease vocabularies. Once you’ve done so, the crate_nlp_prepare_ymls_for_bioyodie tool will do some necessary preprocessing. Its help is:

USAGE: crate_nlp_prepare_ymls_for_bioyodie [-h] [--keeptemp]
                                           [--java_home JAVA_HOME]
                                           [--gate_home GATE_HOME]
                                           [--groovy GROOVY]
                                           [--bioyodie_prep_repo_url BIOYODIE_PREP_REPO_URL]
                                           [--scala_url SCALA_URL]
                                           umls_zip dest_dir

Prepare UMLS data for BioYodie.

POSITIONAL ARGUMENTS:
  umls_zip              Filename of ZIP file downloaded from
                        https://www.nlm.nih.gov/research/umls/licensedcontent/
                        umlsknowledgesources.html, e.g.
                        /path/to/umls-2017AA-full.zip . This can't be
                        autodownloaded, as it requires a license/login.
  dest_dir              Destination directory to write.

OPTIONS:
  -h, --help            show this help message and exit
  --keeptemp            Keep temporary directory on exit. (default: False)
  --java_home JAVA_HOME
                        Value for JAVA_HOME environment variable. Should be a
                        directory that contains 'bin/java'. Default is (a)
                        existing JAVA_HOME variable; (b) location based on
                        'which java'. (default: /path/to/java)
  --gate_home GATE_HOME
                        Value for GATE_HOME environment variable. Should be a
                        directory that contains 'bin/gate.*'. Default is
                        existing GATE_HOME environment variable. (default:
                        /path/to/GATE/directory)
  --groovy GROOVY       Path to groovy binary (ideally v3.0+). Default is the
                        system copy, if there is one. (default: None)
  --bioyodie_prep_repo_url BIOYODIE_PREP_REPO_URL
                        URL of Bio-YODIE preprocessor Git repository (default:
                        https://github.com/RudolfCardinal/bio-yodie-resource-p
                        rep)
  --scala_url SCALA_URL
                        URL for Scala .tgz file (default:
                        https://downloads.lightbend.com/scala/2.11.7/scala-2.1
                        1.7.tgz)

7.6.5. KCL pharmacotherapy application

This GATE application finds drugs (medications).

You can test the application via the GATE Developer console. See testing GATE applications.

See the specimen CRATE NLP config file.

Script to test the app via the command line:

crate_run_gate_kcl_pharmacotherapy_demo

7.6.6. KCL Lewy Body Diagnosis Application

This GATE application finds references to Lewy body dementia.

  • Clone it from https://github.com/KHP-Informatics/brc-gate-LBD

  • As of 2018-03-20, the Git repository just contains a zip. Unzip it.

  • The main application is called application.xgapp.

  • The principal annotation is called cDiagnosis (“confirmed diagnosis”), which has rule and text elements.

You can test the application via the GATE Developer console. See testing GATE applications.

See the specimen CRATE NLP config file.

Script to test the app via the command line:

crate_run_gate_kcl_lewy_demo

7.6.7. Testing a GATE application manually

The illustration below assumes that the main GATE application file is called main-bio.xgapp, which is correct for KConnect. For others, the name is different; see above.

  • Run GATE Developer.

  • Load the application:

    • File ‣ Restore application from file

    • find main-bio.xgapp, in the downloaded KConnect directory (or whichever the appropriate .xgapp file is for your application);

    • load this;

    • wait until it’s finished loading.

  • Create a document:

    • Right-click Language Resources ‣ New ‣ GATE Document

    • name it (e.g. my_test_doc);

    • open it;

    • paste some text in the “Text” window.

  • Create a corpus

    • Right-click Language Resources ‣ New ‣ GATE Corpus

    • name it (e.g. my_test_corpus);

    • open it;

    • add the document (e.g. with the icon looking like ‘G+’).

  • View the application:

    • Go to the application tab (main-bio.xgapp), or double-click main-bio.xgapp in the left hand tree (under Applications) to open it if it’s not already open. For other applications: fine the appropriate application in the “Applications” tree and double-click it.

    • Make sure your corpus is selected in the “Corpus:” section. (There should already be a bunch of things in the top-right-hand box, “Selected processing resources”; for example, for KConnect, you’ll see “MP:preprocess” through to “MP:finalize”.)

  • To see the results, go back to the document, and toggle on both “Annotation Sets” and “Annotation Lists”. If you tick “sets” in the Annotation Sets window (at the right; it’s colourful) you should see specific annotations in the Annotation List window (at the bottom).

7.6.8. Troubleshooting GATE

Out of Java heap space

You may see an error like “Out of memory error: java heap space”.

  • On a Windows machine, set the _JAVA_OPTIONS environment variable (not JAVA_OPTS; we’re not sure when that one applies).

    • Edit environment variables via the control panel or e.g. rundll32 sysdm.cpl,EditEnvironmentVariables.

    • For the user or the system (as you prefer), set _JAVA_OPTIONS to e.g. -Xms2048m -Xmx4096m -XX:MaxPermSize=1024m.

  • Restart the relevant application (e.g. GATE Developer) and retry.

See:


Footnotes

1

University of Sheffield (2016). “GATE: General Architecture for Text Engineering.” https://gate.ac.uk/