14.5.27. crate_anon.nlp_manager.parse_medex

crate_anon/nlp_manager/parse_medex.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


NLP handler for the external MedEx-UIMA tool, to find references to drugs (medication.

  • MedEx-UIMA

    • can’t find Python version of MedEx (which preceded MedEx-UIMA)

    • paper on Python version is https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2995636/; uses Python NLTK

    • see notes in Documents/CRATE directory

    • MedEx-UIMA is in Java, and resolutely uses a file-based processing system; Main.java calls MedTagger.java (MedTagger.run_batch_medtag), and even in its core MedTagger.medtagging() function it’s making files in directories; that’s deep in the core of its NLP thinking so we can’t change that behaviour without creating a fork. So the obvious way to turn this into a proper “live” pipeline would be for the calling code to

      • fire up a receiving process - Python launching custom Java

      • create its own temporary directory - Python

      • receive data - Python

      • stash it on disk - Python

      • call the MedEx function - Python -> stdout -> custom Java -> MedEx

      • return the results - custom Java signals “done” -> Python reads stdin?

      • and clean up - Python

      Not terribly elegant, but might be fast enough (and almost certainly much faster than reloading Java regularly!).

    • output comes from its MedTagger.print_result() function

    • would need a per-process-unique temporary directory, since it scans all files in the input directory (and similarly one output directory); would do that in Python

MedEx-UIMA is firmly (and internally) wedded to a file-based processing system. So we need to:

  • create a process-specific pair of temporary directories;

  • fire up a receiving process

  • pass data (1) to file and (2) signal that there’s data available;

  • await a “data ready” reply and read the data from disk;

  • clean up (delete files) in readiness for next data chunk.

NOTE ALSO that MedEx’s MedTagger class writes to stdout (though not stderr). Option 1: move our logs to stdout and use stderr for signalling. Option 2: keep things as they are and just use a stdout signal that’s not used by MedEx. Went with option 2; simpler and more consistent esp. for logging.

How do we clean up the temporary directories?

PROBLEMS:

  • NLP works fine, but UK-style abbreviations e.g. “qds” not recognized where “q.i.d.” is. US abbreviations: e.g. https://www.d.umn.edu/medweb/Modules/Prescription/Abbreviations.html

    • Places to look, and things to try adding:

      resources/TIMEX/norm_patterns/NormFREQword
      
          qds=>R1P6H
      
      resources/TIMEX/rules/frequency_rules
      
          //QID ( 4 times a day
          expression="[Qq]\.?[Ii]\.?[Dd]\.?[ ]*\((.*?)\)",val="R1P6H"
      
          // RNC: qds
          expression="[Qq]\.?[Dd]\.?[Ss]\.?[ ]*\((.*?)\)",val="R1P6H"
      
      ... looked like it was correct, but not working
      ... are this files compiled in, rather than being read live?
      ... do I have the user or the developer version?
      

      … not there yet. Probably need to recompile. See MedEx’s Readme.txt

    • reference to expression/val (as in frequency_rules):

      TIMEX.Rule._add_rule()
          ... from TIMEX.Rule.Rule via a directory walker
          ... from TIMEX.ProcessingEngine.ProcessingEngine()
              ... via semi-hardcoded file location relative to class's location
                  ... via rule_dir, set to .../TIMEX/rules
      
    • Detect a file being accessed:

      sudo apt install inotify-tools
      inotifywait -m FILE
      

      … frequency_rules IS opened.

    • OVERALL SEQUENCE:

      org.apache.medex.Main [OR: CrateNedexPipeline.java]
      org.apache.medex.MedTagger.run_batch_medtag
      ... creeates an org.apache.NLPTools.Document
          ... not obviously doing frequency stuff, or drug recognition
      ... then runs org.apache.medex.MedTagger.medtagging(doc)
          ... this does most of the heavy lifting, I think
          ... uses ProcessingEngine freq_norm_engine
              ... org.apache.TIMEX.ProcessingEngine
              ... but it may be that this just does frequency NORMALIZATION, not frequency finding
          ... uses SemanticRuleEngine rule_engine
              ... which is org.apache.medex.SemanticRuleEngine
              ... see all the regexlist.put(..., "FREQ") calls
              ... note double-escaping \\ for Java's benefit
      
  • Rebuilding MedEx:

    See build_medex_itself.py

  • YES. If you add to org.apache.medex.SemanticRuleEngine, with extra entries in the regexlist.put(...) sequence, new frequencies appear in the output.

    To get them normalized as well, add them to frequency_rules.

    Specifics:

    1. SemanticRuleEngine.java

      // EXTRA FOR UK FREQUENCIES (see https://www.evidence.nhs.uk/formulary/bnf/current/general-reference/latin-abbreviations)
      // NB case-insensitive regexes in SemanticRuleEngine.java, so ignore case here
      regexlist.put("^(q\\.?q\\.?h\\.?)( |$)", "FREQ");  // qqh, quarta quaque hora (RNC)
      regexlist.put("^(q\\.?d\\.?s\\.?)( |$)", "FREQ");  // qds, quater die sumendum (RNC); must go before existing competing expression: regexlist.put("^q(\\.|)\\d+( |$)","FREQ");
      regexlist.put("^(t\\.?d\\.?s\\.?)( |$)", "FREQ");  // tds, ter die sumendum (RNC)
      regexlist.put("^(b\\.?d\\.?)( |$)", "FREQ");  // bd, bis die (RNC)
      regexlist.put("^(o\\.?d\\.?)( |$)", "FREQ");  // od, omni die (RNC)
      regexlist.put("^(mane)( |$)", "FREQ");  // mane (RNC)
      regexlist.put("^(o\\.?m\\.?)( |$)", "FREQ");  // om, omni mane (RNC)
      regexlist.put("^(nocte)( |$)", "FREQ");  // nocte (RNC)
      regexlist.put("^(o\\.?n\\.?)( |$)", "FREQ");  // on, omni nocte (RNC)
      regexlist.put("^(fortnightly)( |$)", "FREQ");  // fortnightly (RNC)
      regexlist.put("^((?:2|two)\s+weekly)\b", "FREQ");  // fortnightly (RNC)
      regexlist.put("argh", "FREQ");  // fortnightly (RNC)
      // ALREADY IMPLEMENTED BY MedEx: tid (ter in die)
      // NECESSITY, NOT FREQUENCY: prn (pro re nata)
      // TIMING, NOT FREQUENCY: ac (ante cibum); pc (post cibum)
      
    2. frequency_rules

      // EXTRA FOR UK FREQUENCIES (see https://www.evidence.nhs.uk/formulary/bnf/current/general-reference/latin-abbreviations)
      // NB case-sensitive regexes in Rule.java, so offer upper- and lower-case alternatives here
      // qqh, quarta quaque hora (RNC)
      expression="\b[Qq]\.?[Qq]\.?[Hh]\.?\b",val="R1P4H"
      // qds, quater die sumendum (RNC); MUST BE BEFORE COMPETING "qd" (= per day) expression: expression="[Qq]\.?[ ]?[Dd]\.?",val="R1P24H"
      expression="\b[Qq]\.?[Dd]\.?[Ss]\.?\b",val="R1P6H"
      // tds, ter die sumendum (RNC)
      expression="\b[Tt]\.?[Dd]\.?[Ss]\.?\b",val="R1P8H"
      // bd, bis die (RNC)
      expression="\b[Bb]\.?[Dd]\.?\b",val="R1P12H"
      // od, omni die (RNC)
      expression="\b[Oo]\.?[Dd]\.?\b",val="R1P24H"
      // mane (RNC)
      expression="\b[Mm][Aa][Nn][Ee]\b",val="R1P24H"
      // om, omni mane (RNC)
      expression="\b[Oo]\.?[Mm]\.?\b",val="R1P24H"
      // nocte (RNC)
      expression="\b[Nn][Oo][Cc][Tt][Ee]\b",val="R1P24H"
      // on, omni nocte (RNC)
      expression="\b[Oo]\.?[Nn]\.?\b",val="R1P24H"
      // fortnightly and variants (RNC); unsure if TIMEX3 format is right
      expression="\b[Ff][Oo][Rr][Tt][Nn][Ii][Gg][Hh][Tt][Ll][Yy]\b",val="R1P2WEEK"
      expression="\b(?:2|[Tt][Ww][Oo])\s+[Ww][Ee][Ee][Kk][Ll][Yy]\b",val="R1P2WEEK"
      // monthly (RNC)
      expression="\b[Mm][Oo][Nn][Tt][Hh][Ll][Yy]\b",val="R1P1MONTH"
      //
      // ALREADY IMPLEMENTED BY MedEx: tid (ter in die)
      // NECESSITY, NOT FREQUENCY: prn (pro re nata)
      // TIMING, NOT FREQUENCY: ac (ante cibum); pc (post cibum)
      
    3. source:

  • How about routes of administration?

    MedTagger.printResult()
        route is in FStr_list[5]
    ... called from MedTagger.medtagging()
        route is in FStr_list_final[5]
        before that, is in FStr (separated by \n)
            ... from formatDruglist
            ...
            ... from logs, appears first next to "input for tagger" at
                which point it's in
                    sent_token_array[j] (e.g. "po")
                    sent_tag_array[j] (e.g. "RUT" = route)
            ... from tag_dict
            ... from filter_tags
            ... from (Document) doc.filtered_drug_tag()
            ...
            ... ?from MedTagger.medtagging() calling doc.add_drug_tag()
            ... no, not really; is in this bit:
                SuffixArray sa = new SuffixArray(...);
                Vector<SuffixArrayResult> result = sa.search();
            ... and then each element of result has a "semantic_type"
                member that can be "RUT"
    ... SuffixArray.search()
        semantic_type=this.lex.sem_list().get(i);
    
        ... where lex comes from MedTagger:
            this.lex = new Lexicon(this.lex_fname);
    ... Lexicon.sem_list() returns Lexicon.semantic_list
        ... Lexicon.Lexicon() constructs using MedTagger's this.lex_fname
        ... which is lexicon.cfg
    
    ... aha! There it is. If a line in lexicon.cfg has a RUT tag, it'll
        appear as a route. So:
            grep "RUT$" lexicon.cfg | sort  # and replace tabs with spaces
    
        bedside    RUT
        by mouth    RUT
        drip    RUT
        gt    RUT
        g tube    RUT
        g-tube    RUT
        gtube    RUT
        im injection    RUT
        im    RUT
        inhalation    RUT
        inhalatn    RUT
        inhaled    RUT
        intramuscular    RUT
        intravenously    RUT
        intravenous    RUT
        iv    RUT
        j tube    RUT
        j-tube    RUT
        jtube    RUT
        nare    RUT
        nares    RUT
        naris    RUT
        neb    RUT
        nostril    RUT
        orally    RUT
        oral    RUT
        ou    RUT
        patch    DDF-DOSEUNIT-RUT
        per gt    RUT
        per mouth    RUT
        per os    RUT
        per rectum    RUT
        per tube    RUT
        p. g    RUT
        pgt    RUT
        png    RUT
        pnj    RUT
        p.o    RUT
        po    RUT
        sc    RUT
        sl    RUT
        sq    RUT
        subc    RUT
        subcu    RUT
        subcutaneously    RUT
        subcutaneous    RUT
        subcut    RUT
        subling    RUT
        sublingual    RUT
        sub q    RUT
        subq    RUT
        swallow    RUT
        swish and spit    RUT
        sw&spit    RUT
        sw&swall    RUT
        topically    RUT
        topical    RUT
        topical tp    RUT
        trans    RUT
        with spacer    RUT
    

    Looks like these are not using synonyms. Note also format is route\tRUT

    Note also that the first element is always forced to lower case (in Lexicon.Lexicon()), so presumably it’s case-insensitive.

    There’s no specific comment format (though any line that doesn’t resolve to two items when split on a tab looks like it’s ignored).

    So we might want to add more; use

    build_medex_itself.py --extraroutes >> lexicon.cfg
    
  • Note that all frequencies and routes must be in the lexicon. And all frequencies must be in SemanticRuleEngine.java (and, to be normalized, frequency_rules).

  • USEFUL BIT FOR CHECKING RESULTS:

    SELECT
        sentence_text,
        drug, generic_name,
        form, strength, dose_amount,
        route, frequency, frequency_timex3,
        duration, necessity
    FROM anonymous_output.drugs;
    
  • ENCODING

    • Pipe encoding (to Java’s stdin, from Java’s stdout) encoding is the less important as we’re only likely to send/receive ASCII. It’s hard-coded to UTF-8.

    • File encoding is vital and is hard-coded to UTF-8 here and in the receiving Java.

    • We have no direct influence over the MedTagger code for output (unless we modify it). The output function is MedTagger.print_result(), which (line 2040 of MedTagger.java) calls out.write(stuff).

      The out variable is set by

      this.out = new BufferedWriter(new FileWriter(output_dir
              + File.separator + doc.fname()));
      

      That form of the FileWriter constructor, FileWriter(String fileName), uses the “default character encoding”, as per https://docs.oracle.com/javase/7/docs/api/java/io/FileWriter.html

      That default is given by System.getProperty("file.encoding"). However, we don’t have to do something daft like asking the Java to report its file encoding to Python through a pipe; instead, we can set the Java default encoding. It can’t be done dynamically, but it can be done at JVM launch: https://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding.

      Therefore, we should have a Java parameter specified in the config file as -Dfile.encoding=UTF-8.

class crate_anon.nlp_manager.parse_medex.Medex(nlpdef: crate_anon.nlp_manager.nlp_definition.NlpDefinition, cfg_processor_name: str, commit: bool = False)[source]

EXTERNAL.

Class controlling a Medex-UIMA external process, via our custom Java interface, CrateMedexPipeline.java.

MedEx-UIMA is a medication-finding tool: https://www.ncbi.nlm.nih.gov/pubmed/25954575.

__init__(nlpdef: crate_anon.nlp_manager.nlp_definition.NlpDefinition, cfg_processor_name: str, commit: bool = False) None[source]
Parameters
  • nlpdef – a crate_anon.nlp_manager.nlp_definition.NlpDefinition

  • cfg_processor_name – the name of a CRATE NLP config file section (from which we may choose to get extra config information)

  • commit – force a COMMIT whenever we insert data? You should specify this in multiprocess mode, or you may get database deadlocks.

dest_tables_columns() Dict[str, List[sqlalchemy.sql.schema.Column]][source]

Describes the destination table(s) that this NLP processor wants to write to.

Returns

a dictionary of {tablename: destination_columns}, where destination_columns is a list of SQLAlchemy Column objects.

Return type

dict

dest_tables_indexes() Dict[str, List[sqlalchemy.sql.schema.Index]][source]

Describes indexes that this NLP processor suggests for its destination table(s).

Returns

a dictionary of {tablename: indexes}, where indexes is a list of SQLAlchemy Index objects.

Return type

dict

static frequency_and_timex(text: str) Tuple[Optional[str], Optional[str]][source]

Splits a MedEx frequency/TIMEX strings to its frequency and TIMEX parts; e.g. splits b.i.d.(R1P12H) to "b.i.d.", "R1P12H".

static get_text_start_end(medex_str: Optional[str]) Tuple[Optional[str], Optional[int], Optional[int]][source]

MedEx returns “drug”, “strength”, etc. as aspirin[7,14], where the text is followed by the start position (zero-indexed) and the end position (one beyond the last character) (zero-indexed). This function converts a string like aspirin[7,14] to a tuple like "aspirin", 7, 14.

Parameters

medex_str – string from MedEx

Returns

text, start_pos, end_pos; values may be None

Return type

tuple

static int_or_none(text: Optional[str]) Optional[int][source]

Takes text and returns an integer version or None.

parse(text: str) Generator[Tuple[str, Dict[str, Any]], None, None][source]
  • Send text to the external process, and receive the result.

  • Note that associated data is not passed into this function, and is kept in the Python environment, so we can’t run into any problems with the transfer to/from the Java program garbling important data. All we send to the subprocess is the text (and an input_terminator). Then, we may receive MULTIPLE sets of data back (“your text contains the following 7 people/drug references/whatever”), followed eventually by the output_terminator, at which point this set is complete.

static str_or_none(text: Optional[str]) Optional[str][source]

If the string is non-empty, return the string; otherwise return None.

test(verbose: bool = False) None[source]

Test the send function.

class crate_anon.nlp_manager.parse_medex.PseudoTempDir(name: str)[source]

This class exists so that a TemporaryDirectory and a manually specified directory can be addressed via the same (very simple!) interface.

__init__(name: str) None[source]