7.6. GATE NLP applications
GATE NLP is done via an external program, GATE 1. GATE runs in Java. CRATE supplies an external front-end Java program (CrateGatePipeline.java) that loads a GATE app, sends text to it, and returns answers.
In general, CRATE sends text to the external program (via stdin), and will expect a result (via stdout) as a set of tab-separated value (TSV) lines corresponding to the expected destination fields.
The CrateGatePipeline.java program takes arguments that describe how a specific GATE application should be handled.
7.6.1. Output columns
In addition to the standard NLP output columns, the CRATE GATE processor produces these output columns:
Column |
SQL type |
Description |
---|---|---|
_set |
VARCHAR(64) |
GATE output set name |
_type |
VARCHAR(64) |
GATE annotation type name (e.g. ‘Person’) |
_id |
INT |
GATE annotation ID. Not clear that this is very useful. |
_start |
INT |
Start position in the content |
_end |
INT |
End position in the content |
_content |
TEXT |
Full content marked as relevant. (Not the entire content of the source field.) |
These default output columns are prefixed with an underscore to reduce the risk of name clashes, since GATE applications can themselves generate arbitrary column names. For example, the demonstration GATE Person app generates these:
rule
firstname
surname
gender
kind
You tell CRATE about the specific fields produced by a GATE application using
the destfields
option; see the NLP config file.
7.6.2. crate_nlp_build_gate_java_interface
This program builds CrateGatePipeline
.
Options:
USAGE: crate_nlp_build_gate_java_interface [-h] [--builddir BUILDDIR]
[--gatedir GATEDIR]
[--gate_exec GATE_EXEC]
[--java JAVA] [--javac JAVAC]
[--verbose] [--launch]
Compile Java classes for CRATE's interface to GATE
OPTIONS:
-h, --help show this help message and exit
--builddir BUILDDIR Output directory for compiled .class files (default:
/path/to/crate/crate_anon/nlp_manager/compiled_nlp_cla
sses)
--gatedir GATEDIR Root directory of GATE installation (default:
/path/to/GATE/installation)
--gate_exec GATE_EXEC
Path to GATE executable (JAR file). Temporary (future
releases may handle this differently). If not
specified, defaults to 'bin/gate.jar' within the GATE
directory. (default: None)
--java JAVA Java executable (default: java)
--javac JAVAC Java compiler (default: javac)
--verbose, -v Be verbose (use twice for extra verbosity) (default:
0)
--launch Launch script in demonstration mode (having previously
compiled it) (default: False)
7.6.3. CrateGatePipeline
The following specimen scripts presuppose that you have set the environment variable GATE_HOME, and assume specific locations for the compiled Java (e.g. files like CrateGatePipeline.class); edit them as required.
Asking CrateGatePipeline to show its command-line options:
crate_show_crate_gate_pipeline_options
The resulting output:
usage: CrateGatePipeline --gate_app GATEAPP
[--include_set SET [--include_set SET [...]]]
[--exclude_set SET [--exclude_set SET [...]]]
[--annotation ANNOT [--annotation ANNOT [...]]]
[--set_annotation SET ANNOT [...]]
[--encoding ENCODING]
[--input_terminator TERM]
[--output_terminator TERM]
[--log_tag LOGTAG]
[--write_annotated_xml FILESTEM]
[--write_gate_xml FILESTEM]
[--write_tsv FILESTEM]
[--suppress_gate_stdout]
[--show_contents_on_crash]
[-h] [-v [-v [-v]]]
[--loglevel <debug|info|warn|error>]
[--gateloglevel <debug|info|warn|error>]
[--pluginfile PLUGINFILE]
[--launch_then_stop]
[--demo]
Java front end to GATE natural language processor.
- Takes input on stdin. Produces output on stdout.
- GATE applications produce output clustered (1) into named annotation sets
(with a default, unnamed set). (2) Within annotation sets, we find
annotations. (3) Each annotation is a collection of key/value pairs.
This collection is not fixed, in that individual annotations, or keys within
annotations, may be present sometimes and absent sometimes, depending on the
input text.
Optional arguments:
--gate_app GATEAPP
-g GATEAPP
Specifies the GATE app (.gapp/.xgapp) file to use.
REQUIRED unless specifying --demo.
--include_set SET
--exclude_set SET
Includes or excludes the specified GATE set, by name.
By default, the inclusion list is empty, and the exclusion
list is also empty. By specifying set names here, you add
to the inclusion or exclusion list. You can specify each
option multiple times. Then, the rules are as follows:
the output from a GATE set is included if (A) the inclusion
list is empty OR the set is on the inclusion list, AND (B)
the set is not on the exclusion list. Note also that there
is a default set with no name; refer to this one using
the empty string "". Set names are compared in a
case-sensitive manner.
--annotation ANNOT
-a ANNOT
Adds the specified annotation to the target list.
If you don't specify any, you'll get them all.
--set_annotation SET ANNOT
-sa SET ANNOT
Adds the specific set/annotation combination to the target
list. Use this option for maximum control. You cannot mix
--annotation and --set_annotation.
--encoding ENCODING
-e ENCODING
The character encoding of the source documents, to be used
for file output. If not specified, the platform default
encoding (currently "UTF-8") is assumed.
--input_terminator TERMINATOR
-it TERMINATOR
Specify stdin end-of-document terminator.
--output_terminator TERMINATOR
-ot TERMINATOR
Specify stdout end-of-document terminator.
--log_tag LOGTAG
-lt LOGTAG
Use an additional tag for stderr logging.
Helpful in multiprocess environments.
--write_annotated_xml FILESTEM
-wa FILESTEM
Write annotated XML document to FILESTEM<n>.xml, where <n>
is the file's sequence number (starting from 0).
--write_gate_xml FILESTEM
-wg FILESTEM
Write GateXML document to FILESTEM<n>.xml.
--write_tsv FILESTEM
-wt FILESTEM
Write TSV-format annotations to FILESTEM<n>.tsv.
--suppress_gate_stdout
-s
Suppress any stdout from GATE application.
--show_contents_on_crash
-show_contents_on_crash
If GATE crashes, report the current text to stderr (as well
as reporting the error).
(WARNING: likely to contain identifiable material.)
--continue_on_crash
-c
If GATE crashes, carry on after reporting the error.
--help
-h
Show this help message and exit.
--verbose
-v
Verbose (use up to 3 times to be more verbose).
--loglevel LEVEL
Main log level. Overrides verbose. Options are:
debug, info, warn, error
--gateloglevel LEVEL
GATE log level. Overrides verbose. Options are:
debug, info, warn, error
--pluginfile PLUGINFILE
INI file specifying GATE plugins, including name,
location of Maven repository and version. See
specimen_gate_plugin_file.ini. A simple example:
[ANNIE]
name = annie
location = uk.ac.gate.plugins
version = 8.6
[Tools]
name = tools
location = uk.ac.gate.plugins
version = 8.6
--launch_then_stop
Launch the GATE program, then stop immediately. (Used
to pre-download plugins.)
--demo
Use the demo gapp file.
Asking CrateGatePipeline to run the GATE “ANNIE” demonstration:
crate_run_gate_annie_demo
Note
For the demonstrations that follow, we presuppose that you have also set
the environment variable CRATE_GATE_PLUGIN_FILE
to be the filename of
a GATE plugin INI file like this:
# crate_anon/nlp_manager/specimen_gate_plugin_file.ini
#
# This file is read by CrateGatePipeline.java, when you ask it to.
# It will ask GATE to fetch plugins from a Maven repository, e.g.
# https://mvnrepository.com/artifact/uk.ac.gate.plugins
#
# Note that when you're searching for method in JAR files, you can use:
#
# for i in *.jar; do jar -tvf "$i" | grep -Hsi ClassName && echo "$i"; done
[ANNIE]
# - https://mvnrepository.com/artifact/uk.ac.gate.plugins/annie
# - ANNIE includes the GATE demo app, but is also required by others, e.g.
# KCL pharmacotherapy.
# - "ANNIE is a general purpose information extraction system that provides the
# building blocks of many other GATE applications."
name = annie
location = uk.ac.gate.plugins
version = 8.6
[Tools]
# - https://mvnrepository.com/artifact/uk.ac.gate.plugins/tools
# - "A selection of processing resources commonly used to extend ANNIE."
# - Certainly required by KCL LBD application.
name = tools
location = uk.ac.gate.plugins
version = 8.6
[JAPE]
# - https://mvnrepository.com/artifact/uk.ac.gate.plugins/jape-plus
# - "An alternative, usually more efficient and faster, JAPE implementation"
# - JAPE = Java Annotation Patterns Engine
# - https://en.wikipedia.org/wiki/JAPE_(linguistics)
# - Necessary for the KCL pharmacotherapy app.
name = jape-plus
location = uk.ac.gate.plugins
version = 8.6
[Stanford_CoreNLP]
# - https://mvnrepository.com/artifact/uk.ac.gate.plugins/stanford-corenlp
# - https://mvnrepository.com/artifact/edu.stanford.nlp/stanford-corenlp
# - "Stanford CoreNLP provides a set of natural language analysis tools which
# can take raw English language text input and give the base forms of words,
# their parts of speech, whether they are names of companies, people, etc.,
# normalize dates, times, and numeric quantities, mark up the structure of
# sentences in terms of phrases and word dependencies, and indicate which
# noun phrases refer to the same entities. It provides the foundational
# building blocks for higher level text understanding applications."
# - Necessary for KConnect.
name = stanford-corenlp
location = uk.ac.gate.plugins
version = 8.5.1
7.6.4. KConnect (Bio-YODIE)
This GATE application finds diseases. Bio-YODIE is part of the KConnect project.
See https://gate.ac.uk/applications/bio-yodie.html; http://www.kconnect.eu/.
The main application is called main-bio.xgapp.
You can test the application via the GATE Developer console. See testing GATE applications.
See the specimen CRATE NLP config file.
Script to test the app via the command line:
crate_run_gate_kcl_kconnect_demo
The KConnect GATE application requires you to register and download UMLS data,
containing disease vocabularies. Once you’ve done so, the
crate_nlp_prepare_ymls_for_bioyodie
tool will do some necessary
preprocessing. Its help is:
USAGE: crate_nlp_prepare_ymls_for_bioyodie [-h] [--keeptemp]
[--java_home JAVA_HOME]
[--gate_home GATE_HOME]
[--groovy GROOVY]
[--bioyodie_prep_repo_url BIOYODIE_PREP_REPO_URL]
[--scala_url SCALA_URL]
umls_zip dest_dir
Prepare UMLS data for BioYodie.
POSITIONAL ARGUMENTS:
umls_zip Filename of ZIP file downloaded from
https://www.nlm.nih.gov/research/umls/licensedcontent/
umlsknowledgesources.html, e.g.
/path/to/umls-2017AA-full.zip . This can't be
autodownloaded, as it requires a license/login.
dest_dir Destination directory to write.
OPTIONS:
-h, --help show this help message and exit
--keeptemp Keep temporary directory on exit. (default: False)
--java_home JAVA_HOME
Value for JAVA_HOME environment variable. Should be a
directory that contains 'bin/java'. Default is (a)
existing JAVA_HOME variable; (b) location based on
'which java'. (default: /path/to/java)
--gate_home GATE_HOME
Value for GATE_HOME environment variable. Should be a
directory that contains 'bin/gate.*'. Default is
existing GATE_HOME environment variable. (default:
/path/to/GATE/directory)
--groovy GROOVY Path to groovy binary (ideally v3.0+). Default is the
system copy, if there is one. (default: None)
--bioyodie_prep_repo_url BIOYODIE_PREP_REPO_URL
URL of Bio-YODIE preprocessor Git repository (default:
https://github.com/RudolfCardinal/bio-yodie-resource-p
rep)
--scala_url SCALA_URL
URL for Scala .tgz file (default:
https://downloads.lightbend.com/scala/2.11.7/scala-2.1
1.7.tgz)
7.6.5. KCL pharmacotherapy application
This GATE application finds drugs (medications).
See https://github.com/KHP-Informatics/brc-gate-pharmacotherapy
The main application is called application.xgapp.
You can test the application via the GATE Developer console. See testing GATE applications.
See the specimen CRATE NLP config file.
Script to test the app via the command line:
crate_run_gate_kcl_pharmacotherapy_demo
7.6.6. KCL Lewy Body Diagnosis Application
This GATE application finds references to Lewy body dementia.
Clone it from https://github.com/KHP-Informatics/brc-gate-LBD
As of 2018-03-20, the Git repository just contains a zip. Unzip it.
The main application is called application.xgapp.
The principal annotation is called cDiagnosis (“confirmed diagnosis”), which has rule and text elements.
You can test the application via the GATE Developer console. See testing GATE applications.
See the specimen CRATE NLP config file.
Script to test the app via the command line:
crate_run_gate_kcl_lewy_demo
7.6.7. Testing a GATE application manually
The illustration below assumes that the main GATE application file is called main-bio.xgapp, which is correct for KConnect. For others, the name is different; see above.
Run GATE Developer.
Load the application:
find main-bio.xgapp, in the downloaded KConnect directory (or whichever the appropriate .xgapp file is for your application);
load this;
wait until it’s finished loading.
Create a document:
name it (e.g.
my_test_doc
);open it;
paste some text in the “Text” window.
Create a corpus
name it (e.g.
my_test_corpus
);open it;
add the document (e.g. with the icon looking like ‘G+’).
View the application:
Go to the application tab (main-bio.xgapp), or double-click main-bio.xgapp in the left hand tree (under Applications) to open it if it’s not already open. For other applications: fine the appropriate application in the “Applications” tree and double-click it.
Make sure your corpus is selected in the “Corpus:” section. (There should already be a bunch of things in the top-right-hand box, “Selected processing resources”; for example, for KConnect, you’ll see “MP:preprocess” through to “MP:finalize”.)
To see the results, go back to the document, and toggle on both “Annotation Sets” and “Annotation Lists”. If you tick “sets” in the Annotation Sets window (at the right; it’s colourful) you should see specific annotations in the Annotation List window (at the bottom).
7.6.8. Troubleshooting GATE
Out of Java heap space
You may see an error like “Out of memory error: java heap space”.
On a Windows machine, set the
_JAVA_OPTIONS
environment variable (notJAVA_OPTS
; we’re not sure when that one applies).Edit environment variables via the control panel or e.g.
rundll32 sysdm.cpl,EditEnvironmentVariables
.For the user or the system (as you prefer), set
_JAVA_OPTIONS
to e.g.-Xms2048m -Xmx4096m -XX:MaxPermSize=1024m
.
Restart the relevant application (e.g. GATE Developer) and retry.
See:
Footnotes
- 1
University of Sheffield (2016). “GATE: General Architecture for Text Engineering.” https://gate.ac.uk/