.. crate_anon/docs/source/nlp/nlp_config.rst .. Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk). . This file is part of CRATE. . CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. . CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. . You should have received a copy of the GNU General Public License along with CRATE. If not, see . .. _shlex: https://docs.python.org/3/library/shlex.html .. _nlp_config: NLP config file --------------- .. contents:: :local: Overview ~~~~~~~~ The CRATE NLP config file controls the behaviour of the NLP manager. It defines source and destination databases, and one or more **NLP definitions**. You can generate a specimen config file with .. code-block:: bash crate_nlp --democonfig > test_nlp_config.ini You should save this, then edit it to your own needs. Make it point to other necessary things, like your GATE installation if you want to use GATE NLP. For convenience, you may want the `CRATE_NLP_CONFIG` environment variable to point to this file. (Otherwise you must specify it each time.) Detail ~~~~~~ The config file describes NLP definitions, such as ‘people_and_places’, ‘drugs_and_doses’, or ‘haematology_white_cell_differential’. You choose the names of these NLP definitions as you define them. You select one when you run the NLP manager (using the ``--nlpdef`` argument) [#nlpdefinitionclass]_. The NLP definition sets out the following things. - *“Where am I going to find my source text?”* You specify this by giving one or more ``inputfielddefs``. Each one of those specifies (via its own config section) a database/table/field combination, such as the ‘Notes’ field of the ‘Progress Notes’ table in the ‘RiO’ database, or the ‘text_content’ field of the ‘Clinical_Documents’ table in the ‘CDL’ database, or some such [#inputfieldconfig]_. - CRATE will always store minimal source information (database, table, integer PK) with the NLP output. However, for convenience you are also likely to want to copy some other key information over to the output, such as patients’ research identifiers (RIDs). You can specify these via the ``copyfields`` option to the input field definitions. For validation purposes, you might even choose to copy the full source text (just for convenience), but it’s unlikely you’d want to do this routinely (because it wastes space). - *“Which NLP processor will run, and where will it store its output?”* This might be an external GATE program specializing in finding drug names, or one of CRATE’s built-in regular expression (regex) parsers, such as for inflammatory markers from blood tests. The choice of NLP processor also determines the fields that will appear in the output; for example, a drug-detecting NLP program might provide fields such as ‘drug’, ‘dose’, ‘units’, and ‘route’, while a white-cell differential processor might provide output such as ‘cell type’, ‘value_in_billion_per_litre’, and so on [#nlpparser]_. - In fact, GATE applications can simultaneously provide *more than one type* of output; for example, GATE’s demonstration people-and-places application yields both ‘person’ information (rule, firstname, surname, gender...) and ‘location’ information (rule, loctype...), and it can be computationally more efficient to run them together. Therefore, CRATE supports multiple types of output from ‘single’ NLP processors. - Each NLP processor may have its own set of options. For example, the GATE controller requires information about the specific external GATE app to run, and about any necessary environment variables. Others, such as CRATE’s build-in regular expression parsers, are simpler. - You might want all of your “drugs and doses” information to be stored in a single table (such that drugs found in your Progress_Notes and drugs found in your Clinical_Documents get stored together); this would be common and sensible (and CRATE will keep a record of where the information came from). However, it’s possible that you might want to segregate them (e.g. having C-reactive protein information extracted from your Progress_Notes stored in a different table to C-reactive protein information extracted from your High_Sensitivity_CRP_Notes_For_Bobs_Project table). - For GATE apps that provide more than one type of output structure, you will need to specify more than one output table. - You can batch different NLP processors together. For example, the demo config batches up CRATE’s internal regular expression NLP processors together. This is more efficient, because one record fetched from the source database can then be sent to multiple NLP processors. However, it’s less helpful if you are developing new NLP tools and want to be able to re-run just one NLP tool frequently. All NLP configuration ‘chunks’ are sections within the NLP config file, which is in standard .INI format. For example, an input field definition is a section; a database definition is a section; an environment variable section for use with external programs is a section; and so on. To allow incremental updates, CRATE will keep a master progress table, storing a reference to the source information (database, table, PK), a hash of the source information (to work out later on if the source has changed), and a date/time when the NLP was last run, and the name of the NLP definition that was run. It’s definitely better if your source table has integer PKs, but you might not have a choice in the matter (and be unable to add one to a read-only source database), so CRATE also supports string PKs. In this instance it will create an integer by hashing the string and store that along with the string PK itself. (That integer is not guaranteed to be unique, because of *hash collisions* [#hashcollisions]_, but it allows some efficiency to be added.) Testing your NLP definitions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can test any NLP definition that you create. See :ref:`testing NLP `. Format of the configuration file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - The config file is in standard `INI file format `_. - **UTF-8 encoding.** Use this! The file is explicitly opened in UTF-8 mode. - **Comments.** Hashes (``#``) and semicolons (``;``) denote comments. - **Sections.** Sections are indicated with: ``[section]`` - **Name/value (key/value) pairs.** The parser used is `ConfigParser `_. It allows ``name=value`` or ``name:value``. - **Avoid indentation of parameters.** (Indentation is used to indicate the continuation of previous parameters.) - **Parameter types,** referred to below, are: - **String.** Single-line strings are simple. - **Multiline string.** Here, a series of lines is read and split into a list of strings (one for each line). You should indent all lines except the first beyond the level of the parameter name, and then they will be treated as one parameter value. - **Integer.** Simple. - **Boolean.** For Boolean options, true values are any of: ``1, yes, true, on`` (case-insensitive). False values are any of: ``0, no, false, off``. .. _nlp_config_section_nlpdef: Config file section: NLP definition ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These are config file sections named ``[nlpdef:XXX]`` where ``XXX`` is the name of one of your NLP definitions. **These are the "top-level" configuration sections, referred to when you launch CRATE's NLP tools from the command line. Start here.** These config sections map *inputs* (from your database) to *processors* and a *progress-tracking database*, and give names to those mappings. inputfielddefs ^^^^^^^^^^^^^^ *Multiline string.* List of input fields to parse. Each is the name of an :ref:`input field definition ` in the config file. Input to the NLP processor(s) comes from one or more source fields (columns), each within a table within a database. Each item in this list refers to a config section that define that field in more detail. .. _nlp_config_nlpdef_processors: processors ^^^^^^^^^^ *Multiline string.* Which NLP processors shall we use? Specify these as a list of ``processor_type, processor_config_section`` pairs. For example, one might be: .. code-block:: none GATE mygateproc_name_location and CRATE would then look for a :ref:`processor definition ` in a config file section named ``[processor:mygateproc_name_location]``, and expect it to have the information required for a GATE processor. For possible processor types, see ``crate_nlp --listprocessors``. These include CRATE internal processors (e.g. "Glucose"), external tools run locally (e.g. "GATE"), and cloud-based NLP ("Cloud"). progressdb ^^^^^^^^^^ *String.* Secret progress database; the name of a :ref:`database definition ` in the config file. To allow incremental updates, information is stored in a progress table. The database name is a cross-reference to another section in this config file. The table name within this database is hard-coded to ``crate_nlp_progress``. hashphrase ^^^^^^^^^^ *String.* You should insert a hash phrase of your own here. However, it's not especially secret (it's only used for change detection and users are likely to have access to the source material anyway), and its specific value is unimportant. temporary_tablename ^^^^^^^^^^^^^^^^^^^ *String.* Default: ``_crate_nlp_temptable``. Temporary table name to use (in progress and destination databases). max_rows_before_commit ^^^^^^^^^^^^^^^^^^^^^^ *Integer.* Default: 1000. Specify the maximum number of rows to be processed before a ``COMMIT`` is issued on the database transaction(s). This prevents the transaction(s) growing too large. max_bytes_before_commit ^^^^^^^^^^^^^^^^^^^^^^^ *Integer.* Default: 80 Mb (80 * 1024 * 1024 = 83886080). Specify the maximum number of source-record bytes (approximately!) that are processed before a ``COMMIT`` is issued on the database transaction(s). This prevents the transaction(s) growing too large. The ``COMMIT`` will be issued *after* this limit has been met/exceeded, so it may be exceeded if the transaction just before the limit takes the cumulative total over the limit. .. _nlp_config_truncate_text_at: truncate_text_at ^^^^^^^^^^^^^^^^ *Integer.* Default: 0. Must be zero or positive. Use this to truncate very long incoming text fields. If non-zero, this is the length at which to truncate. record_truncated_values ^^^^^^^^^^^^^^^^^^^^^^^ *Boolean.* Default: false. Record in the progress database that we have processed records for which the source text was truncated (see :ref:`truncate_text_at `). The purpose for this option is so the program, when running in incremental mode, can decide whether to re-run the nlp records which were truncated before processing. If this option is set to true, such records won't be run again unless they have changed. .. _cloud_config: cloud_config ^^^^^^^^^^^^ *String.* Required to use cloud NLP. The name of the cloud NLP configuration to use if you ask for cloud-based processing with this NLP definition. For example, you might specify: .. code-block:: ini cloud_config = my_uk_cloud_nlp_service and CRATE would then look for a :ref:`cloud NLP configuration ` in a config file section named ``[cloud:my_uk_cloud_nlp_service]``, and use the information there to connect to a cloud NLP service via the :ref:`NLPRP `. cloud_request_data_dir ^^^^^^^^^^^^^^^^^^^^^^ *String.* Required to use cloud NLP. Directory (on your local filesystem) to hold files containing information for the retrieval of data which has been sent in queued mode. For safety (in case the user specifies a foolish directory!), CRATE will make a subdirectory of this directory (whose name is that of the NLP definition). CRATE will delete files at will within that subdirectory. .. _nlp_config_section_input: Config file section: input field definition ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These are config file sections named ``[input:XXX]`` where ``XXX`` is the name of one of your input field definitions. These define database "inputs" in more detail, including the database, table, and field (column) containing the input, the associated primary key field, and fields that should be copied to the destination to make subsequent work easier (e.g. patient research IDs). They are referred to by the :ref:`NLP definition `. srcdb ^^^^^ *String.* Source database; the name of a :ref:`database definition ` in the config file. srctable ^^^^^^^^ *String.* The name of the table in the source database. srcpkfield ^^^^^^^^^^ *String.* The name of the primary key field (column) in the source table. srcfield ^^^^^^^^ *String.* The name of the field (column) in the source table that contains the data of interest. srcdatetimefield ^^^^^^^^^^^^^^^^ *String.* Optional (but advisable). The name of the ``DATETIME`` field (column) in the source table that represents the date/time of the source data. If present, this information will be copied to the output; see :ref:`Standard NLP output columns `. .. _nlp_config_input_copyfields: copyfields ^^^^^^^^^^ *Multiline string.* Optional. Names of fields to copy from the source table to the destination (NLP output) table. indexed_copyfields ^^^^^^^^^^^^^^^^^^ *Multiline string.* Optional subset of :ref:`copyfields ` that should be indexed in the destination (NLP output) table. debug_row_limit ^^^^^^^^^^^^^^^ *Integer.* Default: 0. Debugging option. Specify this to set the maximum number of rows to be fetched from the source table. Specifying 0 means "no limit". .. _nlp_config_section_processor: Config file section: processor definition ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These are config file sections named ``[processor:XXX]`` where ``XXX`` is the name of one of your NLP processors. These control the behaviour of individual NLP processors. In the case of CRATE's built-in processors, the only configuration needed is the destination database/table, but for some, like GATE applications, you need to define more -- such as how to run the external program, and what sort of table structure should be created to receive the results. The format depends on the specific processor *type* (see :ref:`processors `). destdb ^^^^^^ *String.* **Applicable to: all parsers.** Destination database; the name of a :ref:`database definition ` in the config file. .. _nlp_config_processor_desttable: desttable ^^^^^^^^^ *String.* **Applicable to: MedEx, all CRATE Python processors.** The name of the table in the destination database in which the results should be stored. **Cloud** and local **GATE** processors may produce output for multiple tables (or a single table, but potentially one that you need to help define). For these, use :ref:`outputtypemap ` instead. This refers to "output" configurations, in which you can define the table(s). .. _nlp_config_processor_outputtypemap: outputtypemap ^^^^^^^^^^^^^ *Multiline string.* **Applicable to: GATE, Cloud.** For GATE processors: What's GATE? See the section on :ref:`GATE NLP `. This tabular entry maps GATE '_type' parameters to possible destination tables (in case-insensitive fashion). This parameter is follows is a list of pairs, one pair per line. - The first item of each is the annotation type coming out of the GATE system. - The second is the output type section defined in this file (as a separate section). Those sections each define a table with its columns (fields); see :ref:`GATE/cloud output definitions `. Example: .. code-block:: none outputtypemap = Person output_person Location output_location This example would take output from GATE labelled with ``_type=Person`` and send it to output defined in the ``[output:output_person]`` section of the config file -- see :ref:`GATE/cloud output definitions `. Equivalently for the ``Location`` type. For cloud processors: - The first parameter is the remote server's tablename (see :ref:`NLPRP schema definition `). The second is an output type definition, as above (which may define the table in full, or just name it and leave the definition to the remote processor). - Use this method whenever the remote processor may return data for more than one table. - If the remote processor will only return results for a single table, and doesn't name it, the FIRST definition in the output type map is used (and the first element of the pair is ignored for this purpose, i.e. you can use any string you want). assume_preferred_unit ^^^^^^^^^^^^^^^^^^^^^ *Boolean.* Default: True. **Applicable to: nearly all numerical CRATE Python processors.** If a unit is not specified, assume that values are in the processor's preferred units. (For example, :class:`crate_anon.nlp_manager.parse_biochemistry.Crp` will assume mg/L.) Some override this and are not configurable, however: - ``AlcoholUnits`` never assumes this. .. _nlp_config_section_gate_progargs: progargs ^^^^^^^^ *Multiline string.* **Applicable to: GATE, MedEx.** This parameter defines how we will launch GATE. See :ref:`GATE NLP `. GATE NLP is done by an external program. In this parameter, we specify a program and associated arguments. Here's an example: .. code-block:: none progargs = java -classpath "{NLPPROGDIR}"{OS_PATHSEP}"{GATE_HOME}/bin/gate.jar"{OS_PATHSEP}"{GATE_HOME}/lib/*" -Dgate.home="{GATE_HOME}" CrateGatePipeline --gate_app "{GATE_HOME}/plugins/ANNIE/ANNIE_with_defaults.gapp" --annotation Person --annotation Location --input_terminator END_OF_TEXT_FOR_NLP --output_terminator END_OF_NLP_OUTPUT_RECORD --log_tag {NLPLOGTAG} --verbose The example shows how to use Java to launch a specific Java program (``CrateGatePipeline``), having set a path to find other Java classes, and how to to pass arguments to the program itself. NOTE IN PARTICULAR: - Use double quotes to encapsulate any filename that may have spaces within it (e.g. ``C:/Program Files/...``). - Use a **forward slash directory separator, even under Windows.** - ... ? If that doesn't work, use a double backslash, ``\\``. - Under Windows, use a semicolon to separate parts of the Java classpath. Under Linux, use a colon. So a Linux Java classpath looks like .. code-block:: none /some/path:/some/other/path:/third/path and a Windows one looks like .. code-block:: none C:/some/path;C:/some/other/path;C:/third/path - To make this simpler, we can define the environment variable ``OS_PATHSEP`` (by analogy to Python's os.pathsep). See the :ref:`environment variable ` section below. - You can use substitutable parameters: +-----------------+---------------------------------------------------------+ | ``{X}`` | Substitutes variable X from the environment you specify | | | (see below). | +-----------------+---------------------------------------------------------+ | ``{NLPLOGTAG}`` | Additional environment variable that indicates the | | | process being run; used to label the output from | | | the ``CrateGatePipeline`` application. | +-----------------+---------------------------------------------------------+ .. _nlp_config_section_gate_progenvsection: progenvsection ^^^^^^^^^^^^^^ *String.* **Applicable to: GATE, MedEx.** :ref:`Environment variable config section ` to use when launching this program. .. _nlp_config_section_gate_inputterminator: input_terminator ^^^^^^^^^^^^^^^^ *String.* **Applicable to: GATE.** The external GATE program is slow, because NLP is slow. Therefore, we set up the external program and use it repeatedly for a whole bunch of text. Individual pieces of text are sent to it (via its ``stdin``). We finish our piece of text with a delimiter, which should (a) be specified in the ``-it`` or ``--input_terminator` parameter to the CRATE ``CrateGatePipeline`` interface (above), and (b) be set here, TO THE SAME VALUE. The external program will return a TSV-delimited set of field/value pairs, like this: .. code-block:: none field1\\tvalue1\\tfield2\\tvalue2... field1\\tvalue3\\tfield2\\tvalue4... ... OUTPUTTERMINATOR ... where ``OUTPUTTERMINATOR`` is something that you (a) specify with the ``-ot`` or ``--output_terminator`` parameter above, and (b) set via the config file :ref:`output_terminator `, TO THE SAME VALUE. .. _nlp_config_section_gate_outputterminator: output_terminator ^^^^^^^^^^^^^^^^^ *String.* **Applicable to: GATE.** See :ref:`input_terminator `. max_external_prog_uses ^^^^^^^^^^^^^^^^^^^^^^ *Integer.* **Applicable to: GATE, MedEx.** If the external GATE program leaks memory, you may wish to cap the number of uses before it's restarted. Specify this option if so. Specify 0 or omit the option entirely to ignore this. processor_name ^^^^^^^^^^^^^^ *String.* **Applicable to: Cloud.** Name of the remote processor; see :ref:`NLPRP list_processors `. Note that this is **case sensitive**. To ask the remote server what processor names it offers, use the :ref:`crate_nlp ` tool like this: .. code-block:: bash crate_nlp --config MYCONFIG --nlpdef MYNLPDEF --print_cloud_processors That will, in sequence: - read the config called ``MYCONFIG``; - look for an NLP definition section marked ``[nlpdef:MYNLPDEF]``; - look for a cloud_config_ parameter in that section; - look up the corresponding :ref:`cloud server ` definition, including the URL of the remote server; - ask the server for details of all its processors; - print the results (in NLPRP JSON format; see the NLPLP :ref:`list_processors ` command for details). processor_version ^^^^^^^^^^^^^^^^^ *String.* Default: None. **Applicable to: Cloud.** Version of the remote processor; see :ref:`NLPRP list_processors `. processor_format ^^^^^^^^^^^^^^^^ *String.* **Applicable to: Cloud.** One of: ``Standard``, ``GATE``. ``Standard`` refers primarily to CRATE Python-based rempote processors, but would be compatible with any remote processor which returned data in the same format as the CRATE processors. ``GATE`` refers to GATE remote processors, which return a standard set of columns. .. _nlp_config_section_gate_cloud_output: Config file section: GATE/cloud output definition ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These are config file sections named ``[output:XXX]`` where ``XXX`` is the name of one of your GATE output types, or cloud remote processor table names.[#outputuserconfig]_ For GATE applications, we need this additional information because CRATE doesn't automatically know what sort of output they will produce. The tables and SPECIFIC output fields for a given GATE processor are defined here. For remote cloud processors, this section enables you to rename remote tables to something appropriate locally, and add options (like indexing). Additionally, CRATE may or may not be told exactly by the remote application what tabular structure it is using, but even if the remote application is helpfully informative, you wouldn't automatically trust remotely provided table names. So this section is still mandatory. They are referred to by the :ref:`outputtypemap ` parameter (q.v.). desttable ^^^^^^^^^ *String.* Table name in the destination (NLP output) database into which to write results from the GATE/cloud NLP application. renames ^^^^^^^ *Multiline string.* This is an optional "column renaming" section. For GATE processors: a list of ``from, to`` things to rename from the GATE output en route to the database. In each case, the ``from`` item is the name of a GATE output annotation. The ``to`` item is the destination field/column name. Also applicable to cloud processors; you can rename columns in this way. The ``from`` item is the column name as specified by the remote processor, and the ``to`` item the local destination column name. Specify one pair per line. You can can quote, using shlex_ rules. Case-sensitive. This example: .. code-block:: none renames = firstName firstname renames ``firstName`` to ``firstname``. A more relevant example, in which the GATE annotation names are clearly not well suited to being database column names: .. code-block:: none renames = drug-type drug_type dose-value dose_value dose-unit dose_unit dose-multiple dose_multiple Directionality directionality Experiencer experiencer "Length of Time" length_of_time Temporality temporality "Unit of Time" unit_of_time null_literals ^^^^^^^^^^^^^ *Multiline string.* **Applicable to: GATE only.** Define values that will be treated as ``NULL`` in SQL. For example, sometimes GATE provides the string ``null`` for a NULL value; we can convert to a proper SQL NULL. The parameter is treated as a sequence of words; shlex_ quoting rules apply. Example: .. code-block:: none null_literals = null "" .. _nlp_config_destfields: destfields ^^^^^^^^^^ *Multiline string.* Defines the database field (column) types used in the output database. This is how you tell the database how much space to allocate for information that will come out of GATE. Each line is a ``column_name, sql_type`` pair (or, optionally, a ``column_name, sql_type, comment`` triple. Whitespace is used to separate the columns. Examples: .. code-block:: none destfields = rule VARCHAR(100) firstname VARCHAR(100) surname VARCHAR(100) gender VARCHAR(7) kind VARCHAR(100) .. code-block:: none destfields = rule VARCHAR(100) Rule used to find this person (e.g. TitleFirstName, PersonFull) firstname VARCHAR(100) First name surname VARCHAR(100) Surname gender VARCHAR(7) Gender (e.g. male, female, unknown) kind VARCHAR(100) Kind of name (e.g. personName, fullName) For cloud applications, this is optional. If you specify **any** lines here, your table will be created in this way (plus additional universal CRATE NLP columns). If you don't specify any, it will be created according to the remote table specification (plus additional universal CRATE NLP columns). indexdefs ^^^^^^^^^ *Multiline string.* Fields to index in the destination table. Each line is a ``indexed_field, index_length`` pairs. The ``index_length`` should be an integer or ``None``. Example: .. code-block:: none indexdefs = firstname 64 surname 64 .. _nlp_config_section_envvar: Config file section: environment variables definition ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These are config file sections named ``[env:XXX]`` where ``XXX`` is the name of one of your environment variable definition blocks. We define environment variable groups here, with one group per section. When a section is selected (e.g. by a :ref:`progenvsection ` parameter in a GATE NLP processor definition as above), these variables can be substituted into the :ref:`progargs ` part of the NLP definition (for when external programs are called) and are available in the operating system environment for those programs themselves. - The environment will start by inheriting the parent environment, then add variables here. - Keys are case-sensitive. Example: .. code-block:: ini [env:MY_ENV_SECTION] GATE_HOME = /home/myuser/somewhere/GATE_Developer_8.0 NLPPROGDIR = /home/myuser/somewhere/crate_anon/nlp_manager/compiled_nlp_classes MEDEXDIR = /home/myuser/somewhere/Medex_UIMA_1.3.6 KCONNECTDIR = /home/myuser/somewhere/yodie-pipeline-1-2-umls-only OS_PATHSEP = : .. _nlp_config_section_database: Config file section: database definition ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These are config file sections named ``[database:XXX]`` where ``XXX`` is the name of one of your database definitions. These simply tell CRATE how to connect to different databases. url ^^^ *String.* The URL of the database. Use SQLAlchemy URLs: http://docs.sqlalchemy.org/en/latest/core/engines.html. Example: .. code-block:: ini [database:MY_SOURCE_DATABASE] url = mysql+mysqldb://myuser:password@127.0.0.1:3306/anonymous_output_db?charset=utf8 echo ^^^^ *Boolean.* Default: False. Optional parameter for debugging. If set to True, all SQL being sent to the database will be logged to the Python console log. .. _nlp_config_section_cloud_nlp: Config file section: cloud NLP configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These are config file sections named ``[cloud:XXX]`` where ``XXX`` is the name of one of your cloud NLP configurations (referred to by the cloud_config_ parameter in a NLP definition) [#cloudconfigclass]_. .. _nlp_config_cloud_url: cloud_url ^^^^^^^^^ *String.* Required to use cloud NLP. The URL of the cloud NLP service. .. _nlp_config_verify_ssl: verify_ssl ^^^^^^^^^^ *Boolean.* Default: true. Should CRATE verify the SSL certificate of the remote NLP server? compress ^^^^^^^^ *Boolean.* Default: true. Should CRATE compress messages going to the NLP server, using ``gzip``? CRATE (via the Python ``requests`` library) also always tells the server that it will accept ``gzip`` compression back; the server should respond to this by compressing results. username ^^^^^^^^ *String.* Default: "". Your username for accessing the services at the URL specified in :ref:`cloud_url `. password ^^^^^^^^ *String.* Default: "". Your password for accessing the services at the URL specified in :ref:`cloud_url `. wait_on_conn_err ^^^^^^^^^^^^^^^^ *Integer.* Default: 180. After a connection error occurs, wait this many seconds before retrying. .. _nlp_config_max_content_length: max_content_length ^^^^^^^^^^^^^^^^^^ *Integer.* Default: 0. The maximum size of the packets to be sent. This should be less than or equal to the limit the service allows. Put 0 for no maximum length. NOTE: if a single record is larger than the maximum packet size, that record will not be sent. .. _nlp_config_max_records_per_request: max_records_per_request ^^^^^^^^^^^^^^^^^^^^^^^ *Integer.* Default: 1000. When sending data: the maximum number of pieces of text that will be sent as part of a single NLPRP request (subject also to :ref:`max_content_length `). .. _nlp_config_limit_before_commit: limit_before_commit ^^^^^^^^^^^^^^^^^^^ *Integer.* Default: 1000. When receiving results: the number of results that will be processed (and written to the database) before a ``COMMIT`` command is executed. stop_at_failure ^^^^^^^^^^^^^^^ *Boolean.* Default: true. Are cloud NLP requests for processing allowed to fail, and CRATE continue? If not, an error is raised and CRATE will abort on failure. (Some requests are not allowed to fail, regardless of this setting.) .. _nlp_config_max_tries: max_tries ^^^^^^^^^ *Integer.* Default: 5. Maximum number of times to try each HTTP connection to the cloud NLP server, before giving it up as a bad job. .. _nlp_config_rate_limit_hz: rate_limit_hz ^^^^^^^^^^^^^ *Integer.* Default: 2. The maximum rate, in Hz (times per second), that the CRATE NLP processor will send requests. Use this to avoid overloading the cloud NLP server. Specify 0 for no limit. Parallel processing ~~~~~~~~~~~~~~~~~~~ There are two ways to parallelize CRATE NLP. #. You can run multiple NLP processors at the same time, by specifying multiple NLP processors in a single NLP definition within your configuration file. There can be different source of bottlenecks. One is if database access is limiting. Specifying multiple NLP processors means that text is fetched once (for a given set of input fields) and then run through multiple NLP processors in one go. However, GATE apps can take e.g. 1 Gb RAM per process, so be careful if trying to run several of those! CRATE’s regular expression parsers use very little RAM (and can go quite fast: e.g. 2 CPUs processing about 15,000 records through 10 regex parsers in about 166 s, or of the order of 1 kHz). #. You can run multiple simultaneous copies of CRATE's NLP manager. This will divide up the work across the copies (by dividing up the records retrieved from the database). You can use both strategies simultaneously. .. _specimen_nlp_config: Specimen config ~~~~~~~~~~~~~~~ A specimen NLP config is available by running ``crate_nlp --democonfig``. Here's the specimen NLP config: .. literalinclude:: _specimen_nlp_config_file.ini :language: ini =============================================================================== .. rubric:: Footnotes .. [#nlpdefinitionclass] Internally, the config file section is represented by the :class:`crate_anon.nlp_manager.nlp_definition.NlpDefinition` class, which acts as the master config class. .. [#inputfieldconfig] Internally, this information is represented by the :class:`crate_anon.nlp_manager.input_field_config.InputFieldConfig` class. .. [#nlpparser] Internally, this information is represented by classes such as :class:`crate_anon.nlp_manager.parse_gate.Gate` and :class:`crate_anon.nlp_manager.parse_biochemistry.Crp`, which are subclasses of :class:`crate_anon.nlp_manager.base_nlp_parser.BaseNlpParser`. .. [#cloudconfigclass] Internally, this information is represented by the :class:`crate_anon.nlp_manager.cloud_config.CloudConfig` class. .. [#hashcollisions] https://en.wikipedia.org/wiki/Hash_function .. [#outputuserconfig] Internally, this information is represented by the :class:`crate_anon.nlp_manager.output_user_config.OutputUserConfig` class.