14.5.5. crate_anon.nlp_manager.base_nlp_parser
crate_anon/nlp_manager/base_nlp_parser.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Simple base class for all our NLP parsers (GATE, regex, …)
- class crate_anon.nlp_manager.base_nlp_parser.BaseNlpParser(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False, friendly_name: str = '?')[source]
Base class for all local CRATE NLP parsers.
- __init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False, friendly_name: str = '?') None [source]
__init__
function forTableMaker
.- Parameters:
nlpdef – An instance of
crate_anon.nlp_manager.nlp_definition.NlpDefinition
.cfg_processor_name – The name of a CRATE NLP config file section, TO WHICH we will add a
processor:
prefix (from which section we may choose to get extra config information).commit – Force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
friendly_name – Friendly name for the parser.
- static describe_sqla_col(column: Column, sql_dialect: str | None = None) Dict[str, Any] [source]
Describes a single SQLAlchemy
Column
in the NLPRP format, which followsINFORMATION_SCHEMA.COLUMNS
closely.- Parameters:
column – the
Column
sql_dialect – preferred SQL dialect for response, or
None
for a default
- classmethod nlprp_description() str [source]
Returns the processor’s description for use in response to the NLPRP list_processors command.
Uses each processor’s docstring, and reformats it slightly.
- classmethod nlprp_is_default_version() bool [source]
Returns whether this processor is the default version of its name, for use in response to the NLPRP list_processors command.
The default is
True
.
- classmethod nlprp_name() str [source]
Returns the processor’s name for use in response to the NLPRP list_processors command.
The default is the fully qualified module/class name – because this is highly unlikely to clash with any other NLP processors on a given server.
- nlprp_processor_info(sql_dialect: str | None = None) Dict[str, Any] [source]
Returns a dictionary suitable for use as this processor’s response to the NLPRP list_processors command.
This is not a classmethod, because it may be specialized as we load external schema information (e.g. GATE processors).
- Parameters:
sql_dialect – preferred SQL dialect for
tabular_schema
- nlprp_processor_info_json(indent: int = 4, sort_keys: bool = True, sql_dialect: str | None = None) str [source]
Returns a formatted JSON string from
nlprp_schema_info()
. This is primarily for debugging.- Parameters:
indent – number of spaces for indentation
sort_keys – sort keys?
sql_dialect – preferred SQL dialect for
tabular_schema
, orNone
for default
- nlprp_schema_info(sql_dialect: str | None = None) Dict[str, Any] [source]
Returns a dictionary for the
schema_type
parameter, and associated parameters describing the schema (e.g.tabular_schema
), of the NLPRP list_processors command.This is not a classmethod, because it may be specialized as we load external schema information (e.g. GATE processors).
- Parameters:
sql_dialect – preferred SQL dialect for
tabular_schema
- classmethod nlprp_title() str [source]
Returns the processor’s title for use in response to the NLPRP list_processors command.
The default is the short Python class name.
- classmethod nlprp_version() str [source]
Returns the processor’s version for use in response to the NLPRP list_processors command.
The default is the current CRATE version.
- abstract parse(text: str) Generator[Tuple[str, Dict[str, Any]], None, None] [source]
Main parsing function.
- Parameters:
text – the raw text to parse
- Yields:
tuple –
tablename, valuedict
, wherevaluedict
is a dictionary of{columnname: value}
. The values returned are ONLY those generated by NLP, and do not include either (a) the source reference values (_srcdb
,_srctable
, etc.) or the “copy” fields.- Raises:
crate_anon.nlp_manager.base_nlp_parser.TextProcessingFailed –
if we could not process this text. –
- process(text: str, starting_fields_values: Dict[str, Any]) None [source]
The core function that takes a single piece of text and feeds it through a single NLP processor. This may produce zero, one, or many output records. Those records are then merged with information about their source (etc)., and inserted into the destination database.
- Parameters:
text – the raw text to parse
starting_fields_values – a dictionary of the format
{columnname: value}
that should be added to whatever the NLP processor comes up with. This will, in practice, include source metadata (which table, row [PK], and column did the text come from), processing metadata (when did the NLP processing take place?), and other values that the user has told us to copy across from the source database.
- Raises:
crate_anon.nlp_manager.base_nlp_parser.TextProcessingFailed –
if this parser could not process the text –
- class crate_anon.nlp_manager.base_nlp_parser.TableMaker(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False, friendly_name: str = '?')[source]
Base class for all CRATE NLP processors, local and cloud, including those that talk to third-party software. Manages the interface to databases for results storage, etc.
- __init__(nlpdef: NlpDefinition | None, cfg_processor_name: str | None, commit: bool = False, friendly_name: str = '?') None [source]
__init__
function forTableMaker
.- Parameters:
nlpdef – An instance of
crate_anon.nlp_manager.nlp_definition.NlpDefinition
.cfg_processor_name – The name of a CRATE NLP config file section, TO WHICH we will add a
processor:
prefix (from which section we may choose to get extra config information).commit – Force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.
friendly_name – Friendly name for the parser.
- delete_dest_record(ifconfig: InputFieldConfig, srcpkval: int, srcpkstr: str | None, commit: bool = False) None [source]
Deletes all destination records for a given source record.
Used during incremental updates.
For when a record (specified by
srcpkval
) has been updated in the source; wipe older entries for it in the destination database(s).
- Parameters:
ifconfig –
crate_anon.nlp_manager.input_field_config.InputFieldConfig
that defines the source database, table, and field (column)srcpkval – integer primary key (PK) value
srcpkstr – for tables with string PKs: the string PK value
commit – execute a COMMIT after we have deleted the records? If you don’t do this, we will get deadlocks in incremental mode. See e.g. https://dev.mysql.com/doc/refman/5.5/en/innodb-deadlocks.html
- delete_where_srcpk_not(ifconfig: InputFieldConfig, temptable: Table | None) None [source]
Function to help with deleting NLP destination records whose source records have been deleted.
See
crate_anon.nlp_manager.nlp_manager.delete_where_no_source()
.- Parameters:
ifconfig –
crate_anon.nlp_manager.input_field_config.InputFieldConfig
that defines the source database, table, and field (column).temptable – If this is specified (as an SQLAlchemy) table, we delete NLP destination records whose source PK has not been inserted into this table. Otherwise, we delete all NLP destination records from the source column.
- property dest_dbname: str
Returns the friendly (config file) name for the destination database (which this NLP processor was told about at construction).
- property dest_engine: Engine
Returns the SQLAlchemy database Engine for the destination database (which this NLP processor was told about at construction).
- property dest_metadata: MetaData
Returns the SQLAlchemy metadata for the destination database (which this NLP processor was told about at construction).
- property dest_session: Session
Returns the SQLAlchemy ORM Session for the destination database (which this NLP processor was told about at construction).
- abstract dest_tables_columns() Dict[str, List[Column]] [source]
Describes the destination table(s) that this NLP processor wants to write to.
- Returns:
a dictionary of
{tablename: destination_columns}
, wheredestination_columns
is a list of SQLAlchemyColumn
objects.- Return type:
dict
- dest_tables_indexes() Dict[str, List[Index]] [source]
Describes indexes that this NLP processor suggests for its destination table(s).
- Returns:
a dictionary of
{tablename: indexes}
, whereindexes
is a list of SQLAlchemyIndex
objects.- Return type:
dict
- property destdb: DatabaseHolder
Returns the destination database.
- property friendly_name: str
Returns the NLP parser’s friendly name
- property friendly_name_with_section: str
Returns the NLP parser’s friendly name and config section.
- get_table(tablename: str) Table [source]
Returns an SQLAlchemy
Table
for a given destination table of this NLP processor whose name istablename
.
- get_tablenames() Iterable[str] [source]
Returns all destination table names for this NLP processor.
- make_tables(drop_first: bool = False) List[str] [source]
Creates all destination tables for this NLP processor in the destination database.
- Parameters:
drop_first – drop the tables first?
- property nlpdef_name: str | None
Returns the name of our
crate_anon.nlp_manager.nlp_definition.NlpDefinition
, if we have one, orNone
.