7.8. Natural Language Processing Request Protocol (NLPRP)

Version 0.3.0

7.8.1. Authors 

In alphabetical order:

Rudolf N. Cardinal (RNC), University of Cambridge, 2017-.
Joe Kearney (JK), University of Cambridge, 2018-19.
Angus Roberts (AR), King’s College London, 2018-.
Ian Roberts (IR), University of Sheffield, 2018-.
Francesca Spivack (FS), University of Cambridge, 2018-20.

7.8.2. Rationale 

In the context of research using electronic medical records (EMR) systems, a consideration is the need to derive structured data from free text, using natural language processing (NLP) software.

However, not all individuals and/or institutions may have the necessary resources to perform bulk NLP as they wish. In particular, there are two common constraints. Firstly, the computing power necessary for NLP can be considerable. Secondly, some NLP programs may be sensitive (for example, by virtue of containing fragments of clinical free text from another institution, used as an exemplar for an algorithm) and thus not widely distributable. Accordingly, there is a need for a client–server NLP framework, and thus for a defined request protocol.

The protocol is not RESTful. In particular, it is not stateless. State can be maintained on the server in between requests, if desired, through the notion of queued requests.

7.8.3. Communications stack 

The underlying application layer is HTTP (and HTTPS, encrypted HTTP, is strongly encouraged), over TCP/IP.

7.8.4. Request 

7.8.4.1. Request format 

Requests via HTTP POST

Requests are always transmitted using the HTTP POST method.

Rationale:
1. Some calls modify state on the server, making GET inappropriate.
2. Some requests will be large, making GET URL encoding inappropriate.
3. Many requests involve sensitive data, which should not be encoded into URLs via GET.
4. Therefore, all requests are via POST [1].
The POST method itself is broad [2].

Requests in JSON

The request content is in JSON, with media type application/json. The encoding can be specified, and will be assumed to be UTF-8 if not specified.

Rationale: we need a structured notation supporting list types, key–value pair (KVP) types, a null notation, and arbitrary nesting. JSON fulfils this, is simple (and thus fast), is legible to humans, and is widely supported under many programming languages. Other formats such as XML require considerably more complex parsing and are slower [3].
Consideration: denial-of-service attacks in which large quantities of nonsense are sent. We considered using XML instead of JSON as XML is intrinsically ordered; we could enforce a constraint of having call arguments such as parameter lists preceding textual content, and abandoning processing if the request is malformed. Instead, we elected to keep JSON but move authentication to the HTTP level (so non-authenticated requests can be thrown away earlier) and allow the server to impose its own choice of maximum request size. With that done, all requests coming through will be from authenticated users and of a reasonable request size. At that point, JSON continues to look simpler.

Note the JSON terminology:

Value: one of:
- String: zero or more Unicode characters, wrapped in double quotes, using backslash escapes.
- Number: a raw number.
- The literals true, false, or null.
- An object or array, as below.
Object: an unordered collection of comma-separated KVPs bounded by braces, such as {key1: value1, key2: value2}, where the keys are strings. Roughly equivalent to a Python dictionary.
Array: an ordered collection of comma-separated values bounded by square brackets, such as [value1, value2]. Equivalent to a Python list.
JSON does not in general permit trailing commas in objects and arrays.
Comments are not supported in JSON (but we may illustrate some JSON examples here using # to indicate comments).

Where versions are passed, they are in Semantic Versioning 2.0.0 format. Semantic versions are strings using a particular format (e.g. "1.2.0"), referred to as a Version henceforth.

Where date/time values are passed, they are in ISO-8601 format and must include all three of: date, time, timezone. (The choice of timezone is immaterial; servers may choose to use UTC throughout.)

7.8.4.2. Authentication at HTTP/HTTPS level 

Servers are free to require an authentication method using a standard HTTP mechanism, such as HTTP basic access authentication, HTTP digest access authentication, a URL query string, or OAuth. The mechanism for doing so is not part of the API.
It is expected that the HTTP front end would make the identity of an authenticated user available to the NLPRP server, e.g. so the server can check that a user is authorized for a specific NLP processor or to impose volume/rate limits, but the mechanism for doing so is not part of the API specification.

7.8.4.3. Compression at HTTP/HTTPS level 

Clients may compress requests by setting the HTTP header Content-Encoding: gzip (see HTTP Content-Encoding) and compressing the POST body accordingly. Servers should accept requests compressed with gzip.
If the client sets the Accept-Encoding header (see HTTP Accept-Encoding), the server may return a suitably compressed response (indicated via the Content-Encoding header in its reply).

7.8.4.4. Rejection of unauthorized or malformed responses 

Servers may reject invalid responses with an HTTP error. Typical reasons might include failed authentication or authorization; overly large requests; requests that exceed a user’s quota; syntactically invalid NLPRP requests; syntactically valid requests that are invalid for this server (such as requests that include invalid processors).
Clients must accept HTTP errors either with a NLPRP response or without.
- If the body of the server’s reply includes valid JSON where json_object["protocol"]["name"] == "nlprp", it is an NLPRP reply.
If an error is returned via the NLP protocol, the status field in the response must match the HTTP status code.
The rationale for this is to reduce the effect of denial-of-service attacks by preprocessing HTTP requests without the need to parse the NLPRP request content, and to allow NLPRP server software to operate within a broader institutional authentication, authorization, and/or accounting framework.

7.8.4.5. Request JSON structure 

The top-level structure of a request is a JSON object with the following keys.

Key

JSON type

Required?

Description

protocol

Object

Mandatory

Details of the NLPRP protocol that the client is using, with keys:

name (string): Must be "nlprp". Case insensitive.
version (string): The Version of the NLPRP protocol that the client is using.

command

String

Mandatory

NLPRP command, as below.

args

Value

Optional

Arguments to the command.

JSON does not care about whitespace in formatting, and neither the client nor the server are under any obligation as to how they format their JSON.

7.8.5. Response 

7.8.5.1. Response format 

The request is returned over HTTP as media type application/json. The encoding should be specified (e.g. application/json; charset=utf-8, and will be assumed to be UTF-8 if not specified.

7.8.5.2. Response JSON structure 

The top-level structure of a response is a JSON object with the following keys.

Key	JSON type	Required?	Description
`status`	Value	Mandatory	An integer matching the HTTP status code. Will be in the range [200, 299] for success.
`errors`	Array	Optional	If the status is not 102 or in the range [200, 299], one or more errors will be given. Each error is an object with at least the following keys: `code` (integer or null): error code `message` (string): brief textual description of the error `description` (string): more detail
`protocol`	Object	Mandatory	Details of the NLPRP protocol that the server is using. Keys: `name` (string): Must be `"nlprp"`. Case insensitive. `version` (string): The Version of the NLPRP protocol that the client is using.
`server_info`	Object	Mandatory	Details of the NLPRP server. Keys: `name` (string): Name of the NLPRP server software in use. `version` (string): The Version of the NLPRP server software.

7.8.6. NLPRP commands 

7.8.6.1. list_processors 

No additional parameters are required, but there is an optional parameter.

Key	JSON type	Required?	Description
`sql_dialect`	String	Optional	The SQL dialect that the client would prefer to receive its column information in. (The server does not have to honour this.) See SQL dialects below. [Version 0.2.0 and higher.]

This command lists the NLP processors available to the requestor. (This might be a subset of all NLP processors on the server, depending on the authentication and the permissions granted by the server.)

The relevant part of the response is:

Key

JSON type

Required?

Description

processors

Array

Mandatory

An array of objects. Each object has the following keys:

name (string): the server’s name for the processor.
title (string): generally, the processor’s name for itself.
version (string): the Version of the processor.
is_default_version (Boolean): indicates that this processor is the default version for the given name. May be true for zero or one versions for a given processor name.
description (string): a description of the processor.
schema_type (string): optional; must be one of unknown, tabular; default is unknown. (Future versions may add more schema types.) [Version 0.2.0 and higher.]
sql_dialect (string): the SQL dialect (see below) used within the tabular_schema object (see below). Must be present if tabular_schema is given. [Version 0.2.0 and higher.]
tabular_schema (object): an object that is present if and only if schema_type is tabular. Represents a tabular schema and describing the tables/columns provided by this processor. The format of tabular_schema is described below. [Version 0.2.0 and higher.]

Schema definition

The NLP server may not know the output format of its NLP processors, in which case schema_type may be set to unknown. However, this is undesirable. Processors should describe their schema by enumerating their output tables/columns if they provide output compatible with storage in database tables. To do so, the server sets schema_type to tabular and provides the tabular_schema object.

The tabular_schema object defines a tabular schema. It maps table names to arrays of column definition objects. Thus, in pseudocode:

"tabular_schema": {
    "table_name_1" : [
        <column_definition_1>,
        <column_definition_2>,
        ...
    ],
    "table_name_2" : [
        <column_definition_1>,
        <column_definition_2>,
        ...
    ],
    ...
}

Most NLP processors produce output for a single database table. The table name may be an empty string, "", and that is fine (it is, after all, up to the client to decide how it names its tables). Such a schema would look like this:

"tabular_schema": {
    "" : [
        <column_definition_1>,
        <column_definition_2>,
        ...
    ]
}

A few (e.g. GATE) processors may give output that requires more than one database table to store. For example, a “people and places” processor may return one kind of result when it finds a person, and another kind when it finds a place; it would therefore need to define two tables.

Each column definition object describes a column in the database being used to store results, and has the following keys.

Key	JSON type	Required?	Description
`column_name`	String	Mandatory	Name of the column.
`column_type`	String	Mandatory	Full column data type, e.g. `VARCHAR(64)`. (*)
`data_type`	String	Mandatory	Type name only, e.g. `VARCHAR`
`is_nullable`	Boolean	Mandatory	Whether this column can contain `null` values.
`column_comment`	String or `null`	Optional	Comment describing this column. (*)
`data_type`	String	Mandatory	Type name only, e.g. `VARCHAR`

(The system follows the ANSI SQL INFORMATION_SCHEMA standard loosely; specifically, using some of the columns found in INFORMATION_SCHEMA.COLUMNS.)

For examples of INFORMATION_SCHEMA.COLUMNS, see e.g.

https://docs.microsoft.com/en-us/sql/relational-databases/system-information-schema-views/columns-transact-sql?view=sql-server-2017

https://www.postgresql.org/docs/current/infoschema-columns.html

https://dev.mysql.com/doc/refman/5.7/en/columns-table.html

(*) Not part of the ANSI standard; a MySQL extension to INFORMATION_SCHEMA.COLUMNS.

SQL dialects

The sql_dialect parameter, detailed above, names the SQL dialect in which column_type and data_type are expressed. For example, “unlimited-length text” might be VARCHAR(MAX) in the mssql dialect but LONGTEXT in the mysql dialect.

SQL dialect values are strings representing those major dialects used by SQLAlchemy (see https://docs.sqlalchemy.org/en/13/dialects/), i.e.

Dialect name	Dialect
`mysql`	MySQL
`mssql`	Microsoft SQL Server
`oracle`	Oracle
`postgresql`	PostgreSQL
`sqlite`	SQLite

Request example

A full request as sent over TCP/IP might be as follows, being sent to https://myserver.mydomain/nlp:

POST /nlp HTTP/1.1
Host: myserver.mydomain
Content-Type: application/json; charset=utf-8
Content-Length: <length_goes_here>

{
    "protocol": {
        "name": "nlprp",
        "version": "0.2.0"
    },
    "command":  "list_processors"
}

Response example

For the specimen request above, the reply sent over TCP/IP might look like this:

HTTP/1.1 200 OK
Date: Mon, 13 Nov 2017 09:50:59 GMT
Server: Apache/2.4.23 (Ubuntu)
Content-Type: application/json; charset=utf-8
Content-Length: <length_goes_here>

{
    "status": 200,
    "protocol": {
        "name": "nlprp",
        "version": "0.2.0"
    },
    "server_info": {
        "name": "My NLPRP server software",
        "version": "0.2.0"
    },
    "processors": [
        {
            "name": "gate_medication",
            "title": "SLAM BRC GATE-based medication finder",
            "version": "1.2.0",
            "is_default_version": true,
            "description": "Finds drug names",
            "schema_type": "unknown",
        },
        {
            "name": "python_c_reactive_protein",
            "title": "Cardinal RN (2017) CRATE CRP finder",
            "version": "0.1.3",
            "is_default_version": true,
            "description": "Finds C-reactive protein (CRP) values",
            "schema_type": "tabular",
            "sql_dialect": "mysql",
            "tabular_schema": {
                "": [
                    {
                        "column_comment": "Variable name",
                        "column_name": "variable_name",
                        "column_type": "VARCHAR(64)",
                        "data_type": "VARCHAR",
                        "is_nullable": true
                    },
                    {
                        "column_comment": "Matching text contents",
                        "column_name": "_content",
                        "column_type": "TEXT",
                        "data_type": "TEXT",
                        "is_nullable": true
                    },
                    {
                        "column_comment": "Start position (of matching string within whole text)",
                        "column_name": "_start",
                        "column_type": "INTEGER",
                        "data_type": "INTEGER",
                        "is_nullable": true
                    },
                    {
                        "column_comment": "End position (of matching string within whole text)",
                        "column_name": "_end",
                        "column_type": "INTEGER",
                        "data_type": "INTEGER",
                        "is_nullable": true
                    },
                    {
                        "column_comment": "Text that matched the variable name",
                        "column_name": "variable_text",
                        "column_type": "TEXT",
                        "data_type": "TEXT",
                        "is_nullable": true
                    },
                    {
                        "column_comment": "Text that matched the mathematical relationship between variable and value (e.g. '=', '<=', 'less than')",
                        "column_name": "relation_text",
                        "column_type": "VARCHAR(50)",
                        "data_type": "VARCHAR",
                        "is_nullable": true
                    },
                    {
                        "column_comment": "Standardized mathematical relationship between variable and value (e.g. '=', '<=')",
                        "column_name": "relation",
                        "column_type": "VARCHAR(2)",
                        "data_type": "VARCHAR",
                        "is_nullable": true
                    },
                    {
                        "column_comment": "Matched numerical value, as text",
                        "column_name": "value_text",
                        "column_type": "TEXT",
                        "data_type": "TEXT",
                        "is_nullable": true
                    },
                    {
                        "column_comment": "Matched units, as text",
                        "column_name": "units",
                        "column_type": "VARCHAR(50)",
                        "data_type": "VARCHAR",
                        "is_nullable": true
                    },
                    {
                        "column_comment": "Numerical value in preferred units, if known",
                        "column_name": "value_mg_l",
                        "column_type": "FLOAT",
                        "data_type": "FLOAT",
                        "is_nullable": true
                    },
                    {
                        "column_comment": "Tense text, if known (e.g. 'is', 'was')",
                        "column_name": "tense_text",
                        "column_type": "VARCHAR(50)",
                        "data_type": "VARCHAR",
                        "is_nullable": true
                    },
                    {
                        "column_comment": "Calculated tense, if known (e.g. 'past', 'present')",
                        "column_name": "tense",
                        "column_type": "VARCHAR(7)",
                        "data_type": "VARCHAR",
                        "is_nullable": true
                    }
                ]
            }
        }
    ]
}

7.8.6.2. process 

This command is the central NLP processing request. The important detail is passed in the top-level args parameter, where args is an object with the following structure:

Key	JSON type	Required?	Description
`processors`	Array	Mandatory	An array of objects, each with the following keys: `name` (string): the name of an NLP processor to apply to the text (matching one of the names given by the server via the list_processors command). `version` (optional string): the version of the named NLP processor to use. If a version is not specified explicitly, and there is a default version (see list_processors), the server will use that. `args`: optional key whose value is a JSON value considered to be arguments to the processor (for future expansion).
`queue`	Boolean value (`true` or `false`)	Optional (default `false`)	Controls queueing behaviour: If `true`, adds the request to the server’s processing queue, and returns a response giving queue information, or refuses the request. See the show_queue and fetch_from_queue commands below. If `false`, performs NLP immediately and returns the processing result. (Note, however, that the server can refuse to serve either immediate or delayed results depending on its preference.)
`client_job_id`	String, of maximum length 150 characters	Optional (if absent, an empty string will be used)	This is for queued processing. It is a string that the server will store alongside the queue request, to aid the client in identifying requests belonging to the same job (if it splits work across many requests). It is returned by the show_queue and fetch_from_queue commands.
`include_text`	Boolean value (`true` or `false`)	Optional (default `false`)	If `true`, includes the source text in the reply.
`content`	Array	Mandatory	A list of JSON objects representing text to be parsed (documents), with optional associated metadata. Each object has the following keys: `text` (string, mandatory): The actual text to parse. `metadata` (value, optional): The metadata will be returned verbatim with the results.

Immediate processing

The response to a successful non-queued process command has the following format (on top of the basic response structure):

Key

JSON type

Required?

Description

client_job_id

String

Mandatory

The same client_job_id as the client provided (or a blank string if none was provided).

results

Array

Mandatory

An array of objects of the same length as content, but in arbitrary order, with each object having the following format:

metadata (optional): a copy of the text-specific metadata provided in the request
text (string, optional); if include_text was true, the source text is included here.
processors: array of objects in the same order as the processors parameter in the request, and whose keys are:
- name (string): name of the processor (as per list_processors)
- title (string): title of the processor (as per list_processors)
- version (string): Version of the processor (as per list_processors)
- success (Boolean): true for success, false for failure. This allows for the possibility of text-specific failure, e.g. a document that crashes the NLP parser or otherwise fails dynamically.
- errors (Array, optional): if success is false, this should be present and describe the reason(s) for failure. It is an array of error objects, where each error is an object with at least the following keys:
  - code (integer or null): error code
  - message (string): brief textual description of the error
  - description (string): more detail
- results: see Format of per-processor results below.

Note that it is strongly advisable for clients to specify metadata as this will be necessary for them to recover order information whenever content has more than one item.

Format of per-processor results

Remember that a single piece of source text can generate zero, one, or many NLP matches from each processor; and that a single NLP “match” can involve highly structured results, but typically involves one set of key/value pairs.

We now consider the response["results"][result_num]["processors"][processor_num]["results"] value. This is processor-specific.

For a failed request, this should be an empty array, [], or an empty object, {}. (Note that it may also be empty following success, meaning that the processor found nothing of interest to it.)
If the processor does not offer a tabular_schema definition (see Schema definition above), then the format of results is not constrained. [In NLPRP Version 0.1.0, there were no constraints, as a result.] A common format is an array of objects (each object providing a key-value mapping of column/field names to values), like this:
```
"results": [
    {
        # row 1
        "column_name_1": <value_of_col_1>,
        "column_name_2": <value_of_col_2>,
        ...
    },
    {
        # row 2
        "column_name_1": <value_of_col_1>,
        "column_name_2": <value_of_col_2>,
        ...
    },
    ...
]
```
If the processor does offer a tabular_schema definition, and that definition is for only a single table, then results may be an array of objects, each object representing a database row and containing a mapping from column names to values (exactly as above).

If the processor offers a tabular_schema definition, the other permissible format is that results is an object, not an array. In this case, the results object maps table names (exactly as in the schema) to arrays of rows, like this:

"results": {
    "table_name_1": [
        {
            # row 1
            "column_name_1": <value_of_col_1>,
            "column_name_2": <value_of_col_2>,
            ...
        },
        {
            # row 2
            "column_name_1": <value_of_col_1>,
            "column_name_2": <value_of_col_2>,
            ...
        },
        ...
    ],
    "table_name_2": [
        {
            # row 1
            "column_name_3": <value_of_col_3>,
            "column_name_4": <value_of_col_4>,
            ...
        },
        {
            # row 2
            "column_name_3": <value_of_col_3>,
            "column_name_4": <value_of_col_4>,
            ...
        },
        ...
    ]
}

This format must be used by processors providing a tabular_schema and using more than one table.

It is an error for the processor to offer a tabular_schema definition and then not abide by it (e.g. by providing table or column names not as described in the schema, or by returning data in non-tabular format).

Example

An example exchange using immediate processing follows. The request sends three pieces of text with metadata, and requests two processors to be run on each of them. (Neither processor takes any arguments. Since no version is specified for the python_c_reactive_protein processor, the default version will be used.)

POST /nlp HTTP/1.1
Host: myserver.mydomain
Content-Type: application/json; charset=utf-8
Content-Length: <length_goes_here>

{
    "protocol": {
        "name": "nlprp",
        "version": "0.2.0"
    },
    "command":  "process",
    "args": {
        "processors": [
            {
                "name": "gate_medication",
                "version": "1.2.0",
            },
            {
                "name": "python_c_reactive_protein",
            },
        ],
        "queue": false,
        "client_job_id": "My NLP job 57 for depression/CRP",
        "include_text": false,
        "content": [
            {
                "metadata": {"myfield": "progress_notes", "pk": 12345},
                "text": "My old man’s a dustman. He wears a dustman’s hat."
            },
            {
                "metadata": {"myfield": "progress_notes", "pk": 23456},
                "text": "Dr Bloggs started aripiprazole 5mg od today."
            },
            {
                "metadata": {"myfield": "clinical_docs", "pk": 777},
                "text": "CRP 45; concern about UTI. No longer on prednisolone. Has started co-amoxiclav 625mg tds."
            }
        ]
    }
}

Here’s the response. The first piece of text generates no hits for either processor. The second generates a hit for the ‘medication’ processor. The third generates a hit for ‘CRP’ and two drugs.

HTTP/1.1 200 OK
Date: Mon, 13 Nov 2017 09:50:59 GMT
Server: Apache/2.4.23 (Ubuntu)
Content-Type: application/json; charset=utf-8
Content-Length: <length_goes_here>

{
    "status": 200,
    "protocol": {
        "name": "nlprp",
        "version": "0.2.0"
    },
    "server_info": {
        "name": "My NLPRP server software",
        "version": "0.2.0"
    },
    "client_job_id": "My NLP job 57 for depression/CRP",
    "results": [
        {
            "metadata": {"myfield": "progress_notes", "pk": 12345},
            "processors": [
                {
                    "name": "gate_medication",
                    "title": "SLAM BRC GATE-based medication finder",
                    "version": "1.2.0",
                    "success": true,
                    "results": []
                },
                {
                    "name": "python_c_reactive_protein",
                    "title": "Cardinal RN (2017) CRATE CRP finder",
                    "version": "0.1.3",
                    "success": true,
                    "results": []
                },
            ]
        },
        {
            "metadata": {"myfield": "progress_notes", "pk": 23456},
            "processors": [
                {
                    "name": "gate_medication",
                    "title": "SLAM BRC GATE-based medication finder",
                    "version": "1.2.0",
                    "success": true,
                    "results": [
                        {
                            "drug": "aripiprazole",
                            "drug_type": "BNF_generic",
                            "dose": "5mg",
                            "dose_value": 5,
                            "dose_unit": "mg",
                            "dose_multiple": 1,
                            "route": null,
                            "status": "start",
                            "tense": "present"
                        }
                    ]
                },
                {
                    "name": "python_c_reactive_protein",
                    "title": "Cardinal RN (2017) CRATE CRP finder",
                    "version": "0.1.3",
                    "success": true,
                    "results": []
                },
            ]
        },
        {
            "metadata": {"myfield": "clinical_docs", "pk": 777},
            "processors": [
                {
                    "name": "gate_medication",
                    "title": "SLAM BRC GATE-based medication finder",
                    "version": "1.2.0",
                    "results": [
                        {
                            "drug": "prednisolone",
                            "drug_type": "BNF_generic",
                            "dose": null,
                            "dose_value": null,
                            "dose_unit": null,
                            "dose_multiple": null,
                            "route": null,
                            "status": "stop",
                            "tense": null
                        },
                        {
                            "drug": "co-amoxiclav",
                            "drug_type": "BNF_generic",
                            "dose": "625mg",
                            "dose_value": 625,
                            "dose_unit": "mg",
                            "dose_multiple": 1,
                            "route": "po",
                            "status": "start",
                            "tense": "present"
                        }
                    ]
                },
                {
                    "name": "python_c_reactive_protein",
                    "title": "Cardinal RN (2017) CRATE CRP finder",
                    "version": "0.1.3",
                    "results": [
                        {
                            "startpos": 1,
                            "endpos": 7,
                            "variable_name": "CRP",
                            "variable_text": "CRP",
                            "relation": "",
                            "value_text": "45",
                            "units": "",
                            "value_mg_l": 45,
                            "tense_text": "",
                            "tense": "present"
                        }
                    ]
                },
            ]
        }
    ]
}

Note that the two NLP processors are returning different sets of information, in a processor-specific way.

Queued processing

NLP can be slow. Non-queued commands require that the server performs all the NLP requested within the HTTP timeout period, which may not be feasible; therefore, the protocol supports queuing. With a queued process request, the server takes the data, says “thanks, I’m thinking about it”, and the client can check back later. When the client checks back, the server might have data to offer it or may still be busy.

One risk of queued commands is to the server: clients may send NLP requests faster than the server can handle them. Therefore, the protocol allows the server to refuse queued requests.

Another thing to note is that immediate requests may or may not require the raw text to “touch down” somewhere on the server — what the server does is up to it — but typically, “immediate” requests require minimal (e.g. in-memory) storage of the raw text, whilst “queued” requests inevitably require that the server store the text (e.g. on disk, perhaps in a database) for the lifetime of the queue request.

Initial successful response to process command with queued = true

The initial response has an HTTP status code of 202 (Accepted) and a top-level key of queue_id, whose value is a string. Like this:

HTTP/1.1 202 Accepted
Date: Mon, 13 Nov 2017 09:50:59 GMT
Server: Apache/2.4.23 (Ubuntu)
Content-Type: application/json; charset=utf-8
Content-Length: <length_goes_here>

{
    "status": 202,
    "protocol": {
        "name": "nlprp",
        "version": "0.1.0"
    },
    "server_info": {
        "name": "My NLPRP server software",
        "version": "0.1.0"
    },
    "queue_id": "7586876b-49cb-447b-9db3-b640e02f4f9b"
}

7.8.6.3. show_queue 

The show_queue command allows the client to view its queue status. It has one optional argument:

Key	JSON type	Required?	Description
`client_job_id`	String	Optional	An optional client job ID (see process). If absent, all queue entries for this client are shown. If present, only queue entries for the specified `client_job_id` are shown.

The reply contains this extra information:

Key

JSON type

Required?

Description

queue

Array

Mandatory

An array of objects, one for each incomplete queue entry, each with the following keys/values:

queue_id: queue ID, as returned from the process command
client_job_id: the client’s job ID (see process).
status: a string; one of: ready, busy.
datetime_submitted: date/time submitted, in ISO-8601 format.
datetime_completed: date/time completed, in ISO-8601 format, or null if it’s not yet complete.

Specimen request:

POST /nlp HTTP/1.1
Host: myserver.mydomain
Content-Type: application/json; charset=utf-8
Content-Length: <length_goes_here>

{
    "protocol": {
        "name": "nlprp",
        "version": "0.1.0"
    },
    "command":  "show_queue"
}

and corresponding response:

HTTP/1.1 200 OK
Date: Mon, 13 Nov 2017 09:50:59 GMT
Server: Apache/2.4.23 (Ubuntu)
Content-Type: application/json; charset=utf-8
Content-Length: <length_goes_here>

{
    "status": 200,
    "protocol": {
        "name": "nlprp",
        "version": "0.1.0"
    },
    "server_info": {
        "name": "My NLPRP server software",
        "version": "0.1.0"
    },
    "queue": [
        {
            "queue_id": "7586876b-49cb-447b-9db3-b640e02f4f9b",
            "client_job_id": "My NLP job 57 for depression/CRP",
            "status": "ready",
            "datetime_submitted": "2017-11-13T09:49:38.578474Z",
            "datetime_completed": "2017-11-13T09:50:00.817611Z"
        }
        {
            "queue_id": "6502b94a-2332-4f51-b2a3-337dc5d36ca0",
            "client_job_id": "My NLP job 57 for depression/CRP",
            "status": "busy",
            "datetime_submitted": "2017-11-13T09:49:39.717170Z",
            "datetime_completed": null
        }
    ]
}

7.8.6.4. fetch_from_queue 

Fetches a single entry from the queue, if it exists and is ready for collection. The top-level args should contain a key queue_id containing the queue ID.

If the queue ID doesn’t correspond to a current queue entry, an error will be returned (HTTP 404 Not Found).
If the queue entry is ready for collection, the reply will be of the format for an “immediate” process request. The queue entry will be deleted upon collection.

If the queue entry is still busy being processed, an information code will be returned (HTTP 202 Accepted), along with details as follows:

Key	JSON type	Required?	Description
`n_docprocs`	Value	Optional	The total number of document/processor pairs corresponding to the queue entry.
`n_docprocs_completed`	Value	Optional	The number of document/processor pairs corresponding to the queue entry for which processing has been completed.

7.8.6.5. delete_from_queue 

For this command, the top-level args should be an object with the following keys:

Key	JSON type	Required?	Description
`queue_ids`	Array	Optional	An array of strings, each representing a queue ID to be deleted.
`client_job_ids`	Array	Optional	An array of strings, each representing a client job ID for which all queue IDs should be deleted.
`delete_all`	Boolean value (`true` or `false`)	Optional (default `false`)	If true, all queue entries (for this client!) are deleted.

7.8.7. Specimen Python 3.7+ client program 

Very briefly, run pip install requests crate_anon, and then you can run this:

#!/usr/bin/env python

r"""
crate_anon/nlprp/nlprp_test_client.py

===============================================================================

    Copyright (C) 2015, University of Cambridge, Department of Psychiatry.
    Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

    This file is part of CRATE.

    CRATE is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    CRATE is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with CRATE. If not, see <https://www.gnu.org/licenses/>.

===============================================================================

crate_anon/docs/sources/nlp/nlprp_test_client.py
Simple test client for the NLPRP interface.

"""

import argparse
import logging
import requests
from typing import Any, NoReturn

from cardinal_pythonlib.logs import main_only_quicksetup_rootlogger
from requests.auth import HTTPBasicAuth

from crate_anon.version import require_minimum_python_version
from crate_anon.nlprp.api import NlprpRequest, NlprpResponse
from crate_anon.nlprp.constants import NlprpCommands

require_minimum_python_version()

log = logging.getLogger(__name__)


def get_response(
    url: str,
    command: str,
    command_args: Any = None,
    transmit_compressed: bool = False,
    receive_compressed: bool = True,
    username: str = "",
    password: str = "",
) -> NlprpResponse:
    """
    Illustrate sending to/receiving from an NLPRP server, using HTTP basic
    authentication.

    Args:
        url: URL to send request to
        command: NLPRP command
        command_args: arguments to be sent with the NLPRP command;
            ``json.dumps()`` is applied to this object first
        transmit_compressed: compress requests via GZIP encoding
        receive_compressed: tell the server we will accept responses compresed
            via GZIP encoding
        username: username for basic HTTP authentication
        password: password for basic HTTP authentication
    """

    # -------------------------------------------------------------------------
    # How we fail
    # -------------------------------------------------------------------------
    def fail(msg: str) -> NoReturn:
        log.warning(msg)
        raise ValueError(msg)

    # -------------------------------------------------------------------------
    # Build request and send it
    # -------------------------------------------------------------------------
    req = NlprpRequest(command=command, command_args=command_args)
    log.info(f"Sending request: {req}")
    headers = {"Content-Type": "application/json"}
    if transmit_compressed:
        headers["Content-Encoding"] = "gzip"
        request_data = req.data_gzipped
    else:
        request_data = req.data_bytes
    if receive_compressed:
        headers["Accept-Encoding"] = "gzip, deflate"
        # As it turns out, the "requests" library adds this anyway!
    log.debug(
        f"Sending to URL {url!r}: headers {headers!r}, "
        f"data {request_data!r}"
    )
    r = requests.post(
        url,
        data=request_data,
        headers=headers,
        auth=HTTPBasicAuth(username, password),
    )
    # -------------------------------------------------------------------------
    # Process response
    # -------------------------------------------------------------------------
    # - Note that the "requests" module automatically handles "gzip" and
    #   "deflate" transfer-encodings from the server; see
    #   http://docs.python-requests.org/en/master/user/quickstart/.
    # - The difference between gzip and deflate:
    #   https://stackoverflow.com/questions/388595/why-use-deflate-instead-of-gzip-for-text-files-served-by-apache  # noqa
    log.debug(
        f"Reply has status_code={r.status_code}, headers={r.headers!r}, "
        f"text={r.text!r}"
    )
    try:
        response = NlprpResponse(data=r.text)
        # "r.text" automatically does gzip decode
    except (
        ValueError
    ):  # includes simplejson.errors.JSONDecodeError, json.decoder.JSONDecodeError  # noqa
        fail("Reply was not JSON")
        response = None  # for type checker
    log.debug(f"Response JSON decoded to: {response.dict!r}")
    try:
        assert response.valid()
    except (AssertionError, AttributeError, KeyError):
        fail("Reply was not in the NLPRP protocol")
    return response


def main() -> None:
    """
    Command-line entry point.
    """
    # noinspection PyTypeChecker
    parser = argparse.ArgumentParser(
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )
    parser.add_argument(
        "--url", default="http://0.0.0.0:6543", help="URL of server"
    )
    parser.add_argument(
        "--transmit_compressed",
        action="store_true",
        help="Send data compressed",
    )
    parser.add_argument(
        "--receive_compressed",
        action="store_true",
        help="Accept compressed data from the server",
    )
    parser.add_argument(
        "--username",
        default="",
        help="Username for HTTP basic authentication on server",
    )
    parser.add_argument(
        "--password",
        default="",
        help="Password for HTTP basic authentication on server",
    )
    cmdargs = parser.parse_args()
    response = get_response(
        url=cmdargs.url,
        command=NlprpCommands.LIST_PROCESSORS,
        transmit_compressed=cmdargs.transmit_compressed,
        receive_compressed=cmdargs.receive_compressed,
        username=cmdargs.username,
        password=cmdargs.password,
    )
    log.info(f"Received reply: {response.dict!r}")


if __name__ == "__main__":
    main_only_quicksetup_rootlogger(level=logging.DEBUG)
    main()

7.8.8. Specimen Python 3.7+ server program 

Similarly, for a dummy server program, run pip install pyramid crate_anon and then you can run this:

#!/usr/bin/env python

r"""
crate_anon/nlprp/nlprp_test_server.py

===============================================================================

    Copyright (C) 2015, University of Cambridge, Department of Psychiatry.
    Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

    This file is part of CRATE.

    CRATE is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    CRATE is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with CRATE. If not, see <https://www.gnu.org/licenses/>.

===============================================================================

Simple test server for the NLPRP interface.

"""

import argparse
import logging
from typing import Any, Dict
from wsgiref.simple_server import make_server

from cardinal_pythonlib.httpconst import HttpStatus
from cardinal_pythonlib.logs import main_only_quicksetup_rootlogger
from pyramid.config import Configurator
from pyramid.request import Request
from pyramid.response import Response
from semantic_version import Version

from crate_anon.version import require_minimum_python_version
from crate_anon.nlprp.constants import NlprpKeys
from crate_anon.nlprp.api import NlprpRequest, NlprpResponse

require_minimum_python_version()

log = logging.getLogger(__name__)

SERVER_NLPRP_VERSION = Version("0.1.0")


# noinspection PyUnusedLocal
def cmd_list_processors(hreq: Request, nreq: NlprpRequest) -> Dict[str, Any]:
    """
    Specimen (dummy) implementation of the "list_processors" command.
    """
    return {
        NlprpKeys.PROCESSORS: [
            {
                NlprpKeys.NAME: "my_first_processor",
                NlprpKeys.TITLE: "My First NLP Processor",
                NlprpKeys.VERSION: "0.0.1",
                NlprpKeys.IS_DEFAULT_VERSION: True,
                NlprpKeys.DESCRIPTION: "Finds mentions of the word Alice",
            },
            {
                NlprpKeys.NAME: "my_second_processor",
                NlprpKeys.TITLE: "My Second NLP Processor",
                NlprpKeys.VERSION: "0.0.2",
                NlprpKeys.IS_DEFAULT_VERSION: True,
                NlprpKeys.DESCRIPTION: "Finds mentions of the word Bob",
            },
        ]
    }


def nlprp_server(request: Request) -> Response:
    """
    Specimen NLPRP server for Pyramid.

    Args:
        request: the Pyramid :class:`Request` object

    Returns:
        a Pyramid :class:`Response`
    """
    headers = request.headers
    request_is_gzipped = headers.get("Content-Encoding", "") == "gzip"
    client_accepts_gzip = "gzip" in headers.get("Accept-Encoding", "")
    body = request.body
    log.debug(
        f"===========================================\n"
        f"Received request {request!r}:\n"
        f"    headers={dict(headers)!r}\n"
        f"    body={body!r}\n"
        f"    request_is_gzipped={request_is_gzipped}\n"
        f"    client_accepts_gzip={client_accepts_gzip}"
    )
    try:
        nreq = NlprpRequest(data=body, data_is_gzipped=request_is_gzipped)
        log.info(f"NLPRP request is {nreq}")
        assert nreq.valid()
        # Establish command
        log.debug(f"command={nreq.command!r}, args={nreq.args!r}")
    except Exception:
        # Could do more here!
        raise

    # Process NLPRP command
    http_status = HttpStatus.OK
    if nreq.command == "list_processors":
        replydict = cmd_list_processors(request, nreq)
    else:
        raise NotImplementedError()
    assert isinstance(replydict, dict), "Bug in server!"
    reply = NlprpResponse(http_status=http_status, reply_args=replydict)
    log.info(f"Sending NLPRP response: {reply}")

    # Create the HTTP response
    response = Response(reply.data_bytes, status=http_status)
    # Compress the response?
    if client_accepts_gzip:
        response.encode_content("gzip")
    # Done
    log.debug(
        f"Sending HTTP response: headers={response.headers}, "
        f"body={response.body}"
    )
    return response


def main() -> None:
    """
    Command-line entry point.
    """
    # noinspection PyTypeChecker
    parser = argparse.ArgumentParser(
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )
    parser.add_argument(
        "--host", default="0.0.0.0", help="Hostname to serve on"
    )
    parser.add_argument("--port", default=6543, help="TCP port to serve on")
    cmdargs = parser.parse_args()
    with Configurator() as config:
        config.add_route("only_route", "/")
        config.add_view(nlprp_server, route_name="only_route")
        app = config.make_wsgi_app()
    server = make_server(cmdargs.host, cmdargs.port, app)
    log.info(f"Starting server on {cmdargs.host}:{cmdargs.port}")
    server.serve_forever()


if __name__ == "__main__":
    main_only_quicksetup_rootlogger(level=logging.DEBUG)
    main()

7.8.9. More on error responses 

The main design question here is whether HTTP status codes should be used for errors, or not. There are pros and cons here [4]. We shall follow best practice and encode the status both in HTTP and in the JSON.

Specific HTTP status codes not detailed above include:

Command	Situation	HTTP status code
Any	Success	200 OK
Any	Request malformed	400 Bad Request
Any	Authorization failed	401 Unauthorized
Any	Server bug	500 Internal Server Error
process	Results returned	200 OK
process	Request queued	202 Accepted
process	Upstream server went wrong	502 Bad Gateway
process	Server is too busy right now	503 Service Unavailable
fetch_from_queue	No such queue entry	404 Not Found
fetch_from_queue	Entry still in queue and being processed	202 Accepted

7.8.10. Python internal NLP interface 

The NLPRP server should manage per-text metadata (from the process command) internally. We define a very generic Python interface for the NLPRP server to request NLP results from a specific Python NLP processor:

def nlp_process(text: str,
                processor_args: Any = None) -> List[Dict[str, Any]]:
    """
    Standardized interface via the NLP Request Protocol (NLPRP).
    Processes text using some form of natural language processing (NLP).

    Args:
        text: the text to process
        processor_args: additional arguments supplied by the user [via a
            json.loads() call upon the processor argument value].

    Returns:
        a list of dictionaries with string keys, suitable for conversion to
        JSON using a process such as:

        .. code-block:: python

            import json
            from my_nlp_module import nlp_process
            result_dict = nlp_process("some text")
            result_json = json.dumps(result_dict)

    """
    raise NotImplementedError()

The combination of this standard interface plus the Python Package Index (PyPI) should allow easy installation of Python NLP managers (by Python package name and version). The NLPRP server should be able to import a nlp_process or equivalent function from the top-level package.

7.8.11. Existing code of relevance 

The CRATE toolchain has Python handlers for firing up external NLP processors including GATE and other Java-based tools, and piping text to them; similarly for its internal Python code. From the Cambridge perspective we are likely to extend and use CRATE to send data to the NLP API/service and manage results, but it is also potentially extensible to serve as the NLP API server.

7.8.12. Aspects of server function that are not part of the NLPRP specification 

The following are implementation details that are at the server’s discretion:

authentication
authorization
accounting (logging, billing, size/frequency restrictions)
containerization, parallel processing, message queue details

7.8.13. Abbreviations used in this section 

EMR	electronic medical records
HTTP	hypertext transport protocol
HTTPS	secure HTTP
IP	Internet protocol
ISO	International Organization for Standardization
JSON	JavaScript Object Notation
KVP	key–value pair
NHS	UK National Health Service
NLP	natural language processing
NLPRP	NLP Request Protocol
PyPI	The Python Package Index; https://pypi.python.org/
REST	Representational state transfer
TCP	transmission control protocol
UK	United Kingdom
URL	uniform resource locator
UTC	Universal Coordinated Time
UTF-8	Unicode Transformation Format, 8-bit
XML	Extensible Markup Language

7.8.14. NLPRP things to do and potential future requirements 

Todo

NLPRP: consider supra-document processing requirements

Corpus (supra-document) processing:

There may be future use cases where the NLP processor must simultaneously consider more than one document (a “corpus” of documents, in GATE terminology). This is not currently supported. However, batch processing is currently supported.

7.8.15. NLPRP history 

v0.0.1

Started 13 Nov 2017; Rudolf Cardinal.

v0.0.2

RNC
Minor changes 18 July 2018 following discussion with SLAM/KCL team.

v0.1.0

Amendments 4 Oct 2018, RNC/IR/FS/JK/AR.
Authentication moved out of the API.
Authorization moved out of the API.
The server may “fail” requests at the HTTP level or at the subsequent NLPRP processing stage (i.e. failures may or may not include an NLPRP response object).
Compression at HTTP level discussed; servers should accept gzip compression from the client.
Order of results object changed to arbitrary (to facilitate parallel processing).
echorequest/echo parameters removed; this was pointless as all HTTP calls have an associated reply, so the client should never fail to know what was echoed back.
is_default_version argument to the list_processors reply, and version argument to process.
Comment re future potential use case for corpus-level processing
Signalling mechanism for dynamic failure via the success and errors parameters to the response (see immediate response).
Ability for the client to pass a client_job_id to the queued processing mode, so it can add many requests to the same job and retrieve this data as part of show_queue. Similar argument to delete_from_queue.
Consideration of processor version control and how this is managed in practice (e.g. Python modules; GATE apps) isn’t part of the API; removed from NLPAPI “to-do” list.

v0.2.0

4-6 Aug 2019, RNC.
Processors can describe their output format: schema_type and tabular_schema attribute in the response to the list_processors command, with associated sql_dialect options.
Corresponding constraints on the results format for processors that provide a tabular schema.

v0.3.0

14 Feb 2023, RNC.
Use HTTP 202 (Accepted), not 102 (Processing), for the in-progress response to fetch_from_queue, and report how many are complete. See https://github.com/ucam-department-of-psychiatry/crate/issues/106.

Footnotes