14.5.5. crate_anon.nlp_manager.base_nlp_parser

crate_anon/nlp_manager/base_nlp_parser.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Simple base class for all our NLP parsers (GATE, regex, …)

class crate_anon.nlp_manager.base_nlp_parser.BaseNlpParser(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False, friendly_name: str = '?')[source]

Base class for all local CRATE NLP parsers.

__init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False, friendly_name: str = '?') None[source]

__init__ function for TableMaker.

Parameters:
  • nlpdef – An instance of crate_anon.nlp_manager.nlp_definition.NlpDefinition.

  • cfg_processor_name – The name of a CRATE NLP config file section, TO WHICH we will add a processor: prefix (from which section we may choose to get extra config information).

  • commit – Force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.

  • friendly_name – Friendly name for the parser.

static describe_sqla_col(column: Column, sql_dialect: str | None = None) Dict[str, Any][source]

Describes a single SQLAlchemy Column in the NLPRP format, which follows INFORMATION_SCHEMA.COLUMNS closely.

Parameters:
  • column – the Column

  • sql_dialect – preferred SQL dialect for response, or None for a default

classmethod nlprp_description() str[source]

Returns the processor’s description for use in response to the NLPRP list_processors command.

Uses each processor’s docstring, and reformats it slightly.

classmethod nlprp_is_default_version() bool[source]

Returns whether this processor is the default version of its name, for use in response to the NLPRP list_processors command.

The default is True.

classmethod nlprp_name() str[source]

Returns the processor’s name for use in response to the NLPRP list_processors command.

The default is the fully qualified module/class name – because this is highly unlikely to clash with any other NLP processors on a given server.

nlprp_processor_info(sql_dialect: str | None = None) Dict[str, Any][source]

Returns a dictionary suitable for use as this processor’s response to the NLPRP list_processors command.

This is not a classmethod, because it may be specialized as we load external schema information (e.g. GATE processors).

Parameters:

sql_dialect – preferred SQL dialect for tabular_schema

nlprp_processor_info_json(indent: int = 4, sort_keys: bool = True, sql_dialect: str | None = None) str[source]

Returns a formatted JSON string from nlprp_schema_info(). This is primarily for debugging.

Parameters:
  • indent – number of spaces for indentation

  • sort_keys – sort keys?

  • sql_dialect – preferred SQL dialect for tabular_schema, or None for default

nlprp_schema_info(sql_dialect: str | None = None) Dict[str, Any][source]

Returns a dictionary for the schema_type parameter, and associated parameters describing the schema (e.g. tabular_schema), of the NLPRP list_processors command.

This is not a classmethod, because it may be specialized as we load external schema information (e.g. GATE processors).

Parameters:

sql_dialect – preferred SQL dialect for tabular_schema

classmethod nlprp_title() str[source]

Returns the processor’s title for use in response to the NLPRP list_processors command.

The default is the short Python class name.

classmethod nlprp_version() str[source]

Returns the processor’s version for use in response to the NLPRP list_processors command.

The default is the current CRATE version.

abstract parse(text: str) Generator[Tuple[str, Dict[str, Any]], None, None][source]

Main parsing function.

Parameters:

text – the raw text to parse

Yields:

tupletablename, valuedict, where valuedict is a dictionary of {columnname: value}. The values returned are ONLY those generated by NLP, and do not include either (a) the source reference values (_srcdb, _srctable, etc.) or the “copy” fields.

Raises:
process(text: str, starting_fields_values: Dict[str, Any]) None[source]

The core function that takes a single piece of text and feeds it through a single NLP processor. This may produce zero, one, or many output records. Those records are then merged with information about their source (etc)., and inserted into the destination database.

Parameters:
  • text – the raw text to parse

  • starting_fields_values – a dictionary of the format {columnname: value} that should be added to whatever the NLP processor comes up with. This will, in practice, include source metadata (which table, row [PK], and column did the text come from), processing metadata (when did the NLP processing take place?), and other values that the user has told us to copy across from the source database.

Raises:
abstract test(verbose: bool = False) None[source]

Performs a self-test on the NLP processor.

Parameters:

verbose – Be verbose?

This is an abstract method that is subclassed.

test_parser(test_strings: List[str]) None[source]

Tests the NLP processor’s parser with a set of test strings.

class crate_anon.nlp_manager.base_nlp_parser.TableMaker(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False, friendly_name: str = '?')[source]

Base class for all CRATE NLP processors, local and cloud, including those that talk to third-party software. Manages the interface to databases for results storage, etc.

__init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False, friendly_name: str = '?') None[source]

__init__ function for TableMaker.

Parameters:
  • nlpdef – An instance of crate_anon.nlp_manager.nlp_definition.NlpDefinition.

  • cfg_processor_name – The name of a CRATE NLP config file section, TO WHICH we will add a processor: prefix (from which section we may choose to get extra config information).

  • commit – Force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.

  • friendly_name – Friendly name for the parser.

classmethod classname() str[source]

Returns the short Python name of this class.

delete_dest_record(ifconfig: InputFieldConfig, srcpkval: int, srcpkstr: str | None, commit: bool = False) None[source]

Deletes all destination records for a given source record.

  • Used during incremental updates.

  • For when a record (specified by srcpkval) has been updated in the source; wipe older entries for it in the destination database(s).

Parameters:
delete_where_srcpk_not(ifconfig: InputFieldConfig, temptable: Table | None) None[source]

Function to help with deleting NLP destination records whose source records have been deleted.

See crate_anon.nlp_manager.nlp_manager.delete_where_no_source().

Parameters:
  • ifconfigcrate_anon.nlp_manager.input_field_config.InputFieldConfig that defines the source database, table, and field (column).

  • temptable – If this is specified (as an SQLAlchemy) table, we delete NLP destination records whose source PK has not been inserted into this table. Otherwise, we delete all NLP destination records from the source column.

property dest_dbname: str

Returns the friendly (config file) name for the destination database (which this NLP processor was told about at construction).

property dest_engine: Engine

Returns the SQLAlchemy database Engine for the destination database (which this NLP processor was told about at construction).

property dest_metadata: MetaData

Returns the SQLAlchemy metadata for the destination database (which this NLP processor was told about at construction).

property dest_session: Session

Returns the SQLAlchemy ORM Session for the destination database (which this NLP processor was told about at construction).

abstract dest_tables_columns() Dict[str, List[Column]][source]

Describes the destination table(s) that this NLP processor wants to write to.

Returns:

a dictionary of {tablename: destination_columns}, where destination_columns is a list of SQLAlchemy Column objects.

Return type:

dict

dest_tables_indexes() Dict[str, List[Index]][source]

Describes indexes that this NLP processor suggests for its destination table(s).

Returns:

a dictionary of {tablename: indexes}, where indexes is a list of SQLAlchemy Index objects.

Return type:

dict

property destdb: DatabaseHolder

Returns the destination database.

property friendly_name: str

Returns the NLP parser’s friendly name

property friendly_name_with_section: str

Returns the NLP parser’s friendly name and config section.

classmethod fully_qualified_classname() str[source]

Returns the class’s fully qualified name.

get_table(tablename: str) Table[source]

Returns an SQLAlchemy Table for a given destination table of this NLP processor whose name is tablename.

get_tablenames() Iterable[str][source]

Returns all destination table names for this NLP processor.

classmethod is_cloud_processor() bool[source]

Is this class a cloud-based (remote) NLP processor?

make_tables(drop_first: bool = False) List[str][source]

Creates all destination tables for this NLP processor in the destination database.

Parameters:

drop_first – drop the tables first?

property nlpdef_name: str | None

Returns the name of our crate_anon.nlp_manager.nlp_definition.NlpDefinition, if we have one, or None.

tables() Dict[str, Table][source]

Returns a dictionary of {tablename: Table}, mapping table names to SQLAlchemy Table objects, for all destination tables of this NLP processor.

exception crate_anon.nlp_manager.base_nlp_parser.TextProcessingFailed[source]