.. crate_anon/docs/source/nlp/overview.rst .. Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk). . This file is part of CRATE. . CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. . CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. . You should have received a copy of the GNU General Public License along with CRATE. If not, see . Overview of NLP --------------- The purpose of NLP is to start with **free text** that a human wrote and end up with **structured data** that a machine can deal with. CRATE provides a high-level NLP management system. It churns through databases (typically, databases that have already been de-identified by CRATE's :ref:`anonymisation ` system) and sends text to one or more **NLP processors**. It takes the results and stashes them in an NLP **output database**. NLP processors can be disparate. For example, CRATE has support for: - built-in :ref:`Python NLP via regular expressions `; - all sorts of tools that use :ref:`GATE `; - other third-party tools like :ref:`MedEx-UIMA `. .. _standard_nlp_output_columns: Standard NLP output columns ~~~~~~~~~~~~~~~~~~~~~~~~~~~ All CRATE NLP processors use the following output columns: =================== =============== =========================================== Column SQL type Description =================== =============== =========================================== _pk BIGINT Arbitrary PK of output record _nlpdef VARCHAR(64) Name of the NLP definition producing this row _srcdb VARCHAR(64) Source database name (from CRATE NLP config) _srctable VARCHAR(64) Source table name _srcpkfield VARCHAR(64) PK field (column) name in source table _srcpkval BIGINT PK of source record (or integer hash of PK if the PK is a string) _srcpkstr VARCHAR(64) NULL if the table has an integer PK, but the PK itself if the PK was a string, to deal with hash collisions. _srcfield VARCHAR(64) Field (column) name of source text _srcdatetimefield DATETIME Field (column) name containing the source date/time. (Added in v0.18.52.) _srcdatetimeval DATETIME Date/time of the source field. (Added in v0.18.52.) _crate_version VARCHAR(147) Version of CRATE that generated this NLP record, in semantic version form. (Added in v0.18.53.) _when_fetched_utc DATETIME Date/time (in UTC) that the NLP processor fetched the record from the source database. (Added in v0.18.53.) =================== =============== =========================================== The length of the VARCHAR fields that refer to relational database entity names is set by the `MAX_SQL_FIELD_LEN` constant. These default output columns are prefixed with an underscore to reduce the risk of name clashes (for example, with :ref:`GATE NLP applications ` that can themselves generate arbitrary column names). Columns beginning with an underscore are a nuisance for R, though; one has to refer to them in data tables as e.g. ``dt$`_myfield``` rather than ``dt$myfield``.