14.5.25. crate_anon.nlp_manager.parse_gate

crate_anon/nlp_manager/parse_gate.py

Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.

NLP handler for external GATE NLP tools.

The pipe encoding (Python -> Java stdin, Java stdout -> Python) is fixed to be UTF-8, here and in the Java code.

class crate_anon.nlp_manager.parse_gate.Gate(nlpdef: NlpDefinition, cfg_processor_name: str, commit: bool = False)[source]

EXTERNAL.

Abstract NLP processor controlling an external process, typically our Java interface to GATE programs, CrateGatePipeline.java (but it could be any external program).

We send text to it, it parses the text, and it sends us back results, which we return as dictionaries. The specific text sought depends on the configuration file and the specific GATE program used.

For details of GATE, see https://www.gate.ac.uk/.

__init__(nlpdef: NlpDefinition, cfg_processor_name: str, commit: bool = False) → None[source]

Parameters:

nlpdef – a crate_anon.nlp_manager.nlp_definition.NlpDefinition
cfg_processor_name – the name of a CRATE NLP config file section (from which we may choose to get extra config information)
commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.

dest_tables_columns() → Dict[str, List[Column]][source]

Describes the destination table(s) that this NLP processor wants to write to.

Returns:: a dictionary of {tablename: destination_columns}, where destination_columns is a list of SQLAlchemy Column objects.
Return type:: dict

dest_tables_indexes() → Dict[str, List[Index]][source]

Describes indexes that this NLP processor suggests for its destination table(s).

Returns:: a dictionary of {tablename: indexes}, where indexes is a list of SQLAlchemy Index objects.
Return type:: dict

nlprp_name() → str[source]

Returns the processor’s name for use in response to the NLPRP list_processors command.

The default is the fully qualified module/class name – because this is highly unlikely to clash with any other NLP processors on a given server.

nlprp_schema_info(sql_dialect: str | None = None) → Dict[str, Any][source]

Returns a dictionary for the schema_type parameter, and associated parameters describing the schema (e.g. tabular_schema), of the NLPRP list_processors command.

This is not a classmethod, because it may be specialized as we load external schema information (e.g. GATE processors).

Parameters:: sql_dialect – preferred SQL dialect for tabular_schema

parse(text: str) → Generator[Tuple[str, Dict[str, Any]], None, None][source]

Send text to the external process, and receive the result.
Note that associated data is not passed into this function, and is kept in the Python environment, so we can’t run into any problems with the transfer to/from the Java program garbling important data. All we send to the subprocess is the text (and an input_terminator). Then, we may receive MULTIPLE sets of data back (“your text contains the following 7 people/drug references/whatever”), followed eventually by the output_terminator, at which point this set is complete.

test(verbose: bool = False) → None[source]: Test the send() function.