13.5. Using SemEHR

SemEHR is a tool to make apparent semantic data from notes in electronic health records. See:

Here we summarize quick ways to get SemEHR operational.

13.5.1. Background on Docker

Docker delivers software in “containers”, which package up code and its dependencies. The container is portable to any Docker-enabled machine. It can be build automatically from a Dockerfile, which is a text description of the container.

With regard to operating systems:

  • There is a “host OS” (“container host”), which is the operating system that the computer running Docker is using.

  • Containers can have a “container OS” (“base OS”), which is what the software in the container sees. If the container is very different (e.g. Linux container on Windows host), then virtualization is used (e.g. running virtualized Ubuntu on Windows, and Docker containers in or on top of that). If the container is similar (e.g. a container of one flavour of Linux running on another), then the underlying kernel can be used, which is faster. If the container wants to use the same OS as the host OS (e.g. Ubuntu container running on Ubuntu machine) then a “no-OS” container can be used, which is also fast. Most descriptions of Docker involve Linux containers on a Linux host.

  • The container uses only the kernel from the host OS (but it can therefore run one Linux distribution on top of another’s kernel).

See

Docker Compose is a tool for running multi-container Docker applications. It uses a control file typically called docker-compose.yml to describe the containers, which can then be started with the command docker-compose up and stopped with docker-compose down.

13.5.1.1. Installing Docker for Ubuntu

  • Install the software.

    sudo apt update
    sudo apt-get remove docker docker-engine docker.io
    
    # Docker
    sudo apt install docker.io
    
    # If you want it to start automatically:
    sudo systemctl start docker
    sudo systemctl enable docker
    
    # Docker Compose. Note that "sudo apt install docker-compose" may go wrong.
    sudo apt install curl
    sudo curl -L "https://github.com/docker/compose/releases/download/1.23.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
    sudo chmod +x /usr/local/bin/docker-compose
    
  • Edit /etc/group to add your user to the docker group. Log out and log in again to pick up the change. (Otherwise you will get the error Got permission denied while trying to connect to the Docker daemon socket.)

  • Check it’s working:

    groups  # Which groups am I in?
    # ... Should include "docker". If not, try "groups <MYUSERNAME>". If the
    # two differ, reboot and retry. If "docker" is in neither, you have not
    # edited /etc/group properly.
    
    docker --version
    docker version  # More detailed; should show client and server version.
    
    docker-compose --version
    docker-compose version  # More detailed.
    

13.5.1.2. Installing Docker for Windows

Install Docker Desktop for Windows.

13.5.2. SemEHR tutorial via Docker on Ubuntu

This is based on https://semehr.cogstack.systems/tutorials.html.

Prerequisites:

  • Docker and Docker Compose, as above.

  • A BioYODIE installation (see links in tutorial)

Set some environment variables:

#!/usr/bin/env bash
# semehr_set_envvars.sh

# -----------------------------------------------------------------------------
# Environment variables
# -----------------------------------------------------------------------------

# We will make this directory:
export TUTORIALDIR=${HOME}/tmp/semehr_tutorial1

# This should already exist and contain your Bio-YODIE installation:
export BIOYODIEDIR=${HOME}/dev/yodie-pipeline-1-2-umls-only

# Other directories and files we'll use:
# - Root directory of SemEHR Git repository
export GITDIR=${TUTORIALDIR}/CogStack-SemEHR
# - Docker Compose tutorial directory within SemEHR tree
export COMPOSEDIR=${GITDIR}/tutorials/tutorial1_compose_files
# - Docker Compose file to launch Elasticsearch
export ELASTICSEARCH_COMPOSE=${COMPOSEDIR}/semehr-tutorial1-servers-compose.yml
# - Docker Compose file to launch SemEHR
export SEMEHR_COMPOSE=${COMPOSEDIR}/semehr-tutorial-run-compose.yml
# - Data directory
export DATADIR=${GITDIR}/tutorials/mtsamples-cohort
# - SemEHR config file
export SEMEHR_CONFIG=${DATADIR}/semehr_settings.json
# - Docker network name
export NETNAME=semehrnet

Make some amendments:

#!/usr/bin/env bash
# semehr_setup_demo.sh
set -e
# ... if we try to run this more than once, it will (appropriately) fail

# -----------------------------------------------------------------------------
# Fetch environment variables from our common source
# -----------------------------------------------------------------------------

THIS_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
source "${THIS_DIR}/semehr_set_envvars.sh"

# -----------------------------------------------------------------------------
# Other variables:
# -----------------------------------------------------------------------------

grep sse4_2 /proc/cpuinfo >/dev/null && HAS_SSE42=true || HAS_SSE42=false

# -----------------------------------------------------------------------------
# Setup actions
# -----------------------------------------------------------------------------

echo "- Making directory: ${TUTORIALDIR}"
mkdir "${TUTORIALDIR}"

echo "- Copying in UMLS..."
cp -R "${BIOYODIEDIR}/bio-yodie-resources" "${TUTORIALDIR}"

echo "- Fetching SemEHR code..."
git clone https://github.com/CogStack/CogStack-SemEHR.git "${GITDIR}"

echo "- Copying/editing Docker Compose files..."
# - Point to our files, not some hard-coded root-based path:
sed -i "s,device: /semehr_tutorial1/,device: ${TUTORIALDIR}/,g" "${ELASTICSEARCH_COMPOSE}"
sed -i "s,device: /semehr_tutorial1/,device: ${TUTORIALDIR}/,g" "${SEMEHR_COMPOSE}"
# - Fix networking aspects of config files
#   (a) Create named network for Elasticsearch.
#       Cannot name network to be created in v2.2 of the Docker Compose
#       file format. Therefore, create it separately.
docker network create "${NETNAME}" || echo "- Docker network ${NETNAME} already exists."
#       ... and declare it as external:
cat <<EOT >> "${ELASTICSEARCH_COMPOSE}"
networks:
  default:
    external:
      name: ${NETNAME}
EOT
#   (b) Make SemEHR join that network.
cat <<EOT >> "${SEMEHR_COMPOSE}"
networks:
  default:
    external:
      name: ${NETNAME}
EOT
#   (c) Make config file use internal net and names, not main net and IP addresses.
sed -i "s,http://172.17.0.1:8200/,http://es01:9200/,g" "${SEMEHR_CONFIG}"
# - Disable machine learning libraries if SSE4.2 not supported
if [ "${HAS_SSE42}" = false ] ; then
    sed -i "s,environment:,environment:\n      - xpack.security.enabled=false\n      - xpack.monitoring.enabled=false\n      - xpack.ml.enabled=false\n      - xpack.graph.enabled=false\n      - xpack.watcher.enabled=false,g" "${ELASTICSEARCH_COMPOSE}"
fi
# - NB to revert files, use
#   cd "${GITDIR}"; git reset --hard origin/master

echo "- Done."

Now start Elasticsearch:

#!/usr/bin/env bash
# semehr_start_elasticsearch.sh
set -e

# -----------------------------------------------------------------------------
# Fetch environment variables from our common source
# -----------------------------------------------------------------------------

THIS_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
source "${THIS_DIR}/semehr_set_envvars.sh"

# -----------------------------------------------------------------------------
# Start Elasticsearch
# -----------------------------------------------------------------------------
# Start the containers (will fetch all necessary software the first time).
# Run in foreground mode, so we can see the log output.
echo "Starting Docker container: ${ELASTICSEARCH_COMPOSE}"
docker-compose -f "${ELASTICSEARCH_COMPOSE}" up

Now fire up another terminal, enter the same variable definitions as above, and fix an Elasticsearch problem:

#!/usr/bin/env bash
# semehr_fix_watermark.sh
set -e

# -----------------------------------------------------------------------------
# Fix an Elasticsearch error (needed once only, but no harm in repeating):
# -----------------------------------------------------------------------------

echo "- Fixing watermark for Elasticsearch (which should be running already)..."
curl -X PUT "localhost:8200/_cluster/settings" -H 'Content-Type: application/json' -d'
    {
      "transient": {
        "cluster.routing.allocation.disk.watermark.low": "2gb",
        "cluster.routing.allocation.disk.watermark.high": "1gb",
        "cluster.routing.allocation.disk.watermark.flood_stage": "500mb",
        "cluster.info.update.interval": "1m"
      }
    }
'

Now in that second terminal, run SemEHR:

#!/usr/bin/env bash
# semehr_run_semehr.sh
set -e

# -----------------------------------------------------------------------------
# Fetch environment variables from our common source
# -----------------------------------------------------------------------------

THIS_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
source "${THIS_DIR}/semehr_set_envvars.sh"

# -----------------------------------------------------------------------------
# Run SemEHR (i.e. start, run, stop)
# -----------------------------------------------------------------------------
docker-compose -f "${SEMEHR_COMPOSE}" run semehr

Browse to http://127.0.0.1:8080/SemEHR.html and try searching for patient P001. Try also http://127.0.0.1:8200/_cat/indices/, which should show current indices (you expect one called eprdoc).

Once the Elasticsearch container group is happy, you can (if you want) shut it down (Ctrl-C) and restart it in the background:

docker-compose -f "${ELASTICSEARCH_COMPOSE}" up -d

# And when you want to shut down Elasticsearch:
docker-compose -f "${ELASTICSEARCH_COMPOSE}" down

13.5.3. Notes on Docker

13.5.3.1. Docker information and debugging

  • You can explore what’s going on:

    docker-compose -f <COMPOSEFILE> images
    docker-compose -f <COMPOSEFILE> top
    docker images
    docker container ls
    docker stats  # Ctrl-C to finish
    docker ps
    
  • If things are going wrong, you can start a shell in a running container (see e.g. https://phase2.github.io/devtools/common-tasks/ssh-into-a-container/), such as with:

    docker ps  # get container ID
    docker exec -it <CONTAINER_ID> /bin/bash
    

    Note that the Elasticsearch containers are meant to start up and stay up, but the SemEHR container is meant to run (sending data to Elasticsearch), then stop.

  • To monitor network traffic:

    # Run Wireshark
    # (a) from Docker (e.g. https://hub.docker.com/r/manell/wireshark/):
    #
    # docker run -ti --net=host --privileged -v $HOME:/root:ro -e XAUTHORITY=/root/.Xauthority -e DISPLAY=$DISPLAY manell/wireshark
    #
    # (b) Natively:
    
    wireshark
    
    # Now use Wireshark filters e.g. to debug browsing to http://172.17.0.1:
    # (ip.dst == 172.17.0.1 || ip.src == 172.17.0.1) && http
    
  • To trash a Docker system thoroughly:

    # DANGER: destroys everything it can from Docker.
    docker stop $(docker ps -q)
    docker container rm $(docker container ls -a -q)
    docker image rm $(docker image ls -a -q)
    docker volume rm $(docker volume ls -q)
    docker network rm $(docker network ls -q)
    docker system prune -a
    

13.5.3.2. Docker networking

  • Docker always creates a default network interface (called docker0; run ifconfig to see them), plus a default Docker network whose name is bridge (run docker network inspect bridge to see details). You can see what Docker does in terms of routing by running sudo iptables -t nat -S. The default Docker network appears as the private network 172.17.0.0.

  • Docker may set up multiple networks. View them with:

    docker network ls
    docker network inspect $(docker network ls -q)
    
  • To get a container’s IP address (as seen by other processes on the host machine), use

    docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <CONTAINER_NAME_OR_ID>
    

    You can see more information on a container with

    docker inspect <CONTAINER_NAME_OR_ID>
    
  • Within a Docker Compose container collection (I can’t find an established noun for one of these, so will call it a “composition”), applications can use each other’s service names as IP names – so, for example, containers can talk to the the web container via http://web/... and to the database container as postgres://db:5432. However, this naming system does not extend to the “outside world” of the host machine (verified empirically), or to Docker containers outside the composition.

  • Compositions may define their own networks. But if they don’t (as SemEHR doesn’t as of 2019-11-11):

    • Docker Compose will create a network whose name is that of the directory containing the .yml file (minus punctuation), plus _default – thus, for SemEHR, this is tutorial1composefiles_default. In our tests this network is typically 172.21.0.0.

    • Individual containers may be exposed via multiple IP addresses. For example, the es01 container’s exposed port 8200 (see below) is accessible via two gateways:

      # With the Elasticsearch composition running:
      # Get IP address of "es01" container:
      docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' es01
      # ... currently 172.21.0.2
      
      # Ping it
      ping 172.21.0.2
      # ... OK
      
      # Which ports are open?
      nmap 172.21.0.2  # es01 container
      # ... port 9200 is open
      
      # (Not shown) Establish that the "es02" container is 172.21.0.4,
      # and the web ("tutorial1composefiles_web_1") container is 172.21.0.3.
      
      nmap 172.21.0.4  # es02 container
      # ... port 9200 is open
      
      nmap 172.21.0.3  # tutorial1composefiles_web_1 container
      # ... port 80 is open
      
      # Now, there is a special one at 172.21.0.1:
      docker network inspect $(docker network ls -q) | less
      # ... shows that 172.21.0.1 is the gateway for the network named
      #     tutorial1composefiles_default
      
      nmap 172.21.0.1  # gateway for this composition's network
      # ... ports 22, 8080, 8200 are open.
      # ... reducing to port 22 when the Elasticsearch composition shuts down
      
      # What about 172.17.0.1, the gateway for the Docker default bridge
      # network?
      nmap 172.17.0.1  # gateway for default Docker network
      # ... ports 22, 8080, 8200 are open.
      # ... reducing to port 22 when the Elasticsearch composition shuts down
      
      # What about the host machine itself?
      nmap localhost  # or nmap 127.0.0.1
      # ... in my case: stuff including 8080, 8200
      # ... reducing to stuff not including 8080, 8200 when ES container down
      
      curl http://172.17.0.1:8080  # gives SemEHR web page
      curl http://172.21.0.1:8080  # gives SemEHR web page
      
      curl http://172.17.0.1:8200  # gives Elasticsearch JSON result
      curl http://172.21.0.1:8200  # gives Elasticsearch JSON result
      
      # ssh into the es01 container:
      docker exec -it es01 /bin/bash
      # then try: find / -type f -name "*.log"
      
  • Therefore, the SemEHR container can currently access the Elasticsearch container via any of:

    http://172.17.0.1:8200
    http://172.21.0.1:8200
    http://localhost:8200
    http://127.0.0.1:8200
    

See:

13.5.3.3. Docker debugging container

Let’s create a container that mimics the SemEHR “runner”, in that it is part of our created network, but not in the Elasticsearch composition.

In debugger.yml:

version: '3.3'

services:
  debugger:
    image: praqma/network-multitool
    container_name: debugger
networks:
  default:
    external:
      name: semehrnet
docker-compose -f debugger.yml up -d

And in a separate command line:

docker exec -it debugger /bin/bash

Via ping, nmap, and curl, we see that the correct URL is http://es01:9200/ (and http://web/ or http://web:80/).

13.5.4. Notes on the SemEHR Docker setup

  • SemEHR sets up a background Docker application via Docker Compose. This has three containers:

    • web uses the httpd image, which serves content from its /usr/local/apache2/htdocs/ directory on port 80. The Compose file maps some SemEHR data to this directory, and exposes the web server on port 8080.

    • es01 uses an Elasticsearch Docker image, which offers Elasticsearch on port 9200. SemEHR maps that to port 8200.

    • es02 is another Elasticsearch image. The Docker Compose configuration allows them to talk to each other, as per the Elasticsearch Docker instructions.

    This application is intended to run in the background. It provides Elasticsearch indexing and a web interface.

  • It then offers another container to parse SemEHR data. This application is intended to run and stop once it’s processed everything.

    • Its config file, CogStack-SemEHR/tutorials/mtsamples-cohort/semehr_settings.json, governs how this container finds the Elasticsearch container.

    • It writes to semehr.log in the same directory.

13.5.5. Troubleshooting SemEHR

13.5.5.1. Persistent wrong Docker paths

If you started a container based on a .yml file with wrong directories, the settings can persist. Try docker container ls followed by docker container rm ..., but if there are no containers listed yet the problem still persists, try docker image ls followed by docker image rm <IMAGE_ID>. Then the container will be rebuilt when you next restart it. If that doesn’t work, try docker images purge and docker system prune -a, delete the tutorial directory, and start again.

  • Note that under Ubuntu, Docker data is in /var/lib/docker.

  • These were the errors:

    ERROR: for es01  Cannot create container for service es01: failed to mount local volume: mount /semehr_tutorial1/CogStack-SemEHR/tutorials/working_dCreating tutorial1composefiles_web_1 ... error
    ERROR: for tutorial1composefiles_web_1  Cannot create container for service web: failed to mount local volume: mount /semehr_tutorial1/CogStack-SemECreating es02 ... error
    ERROR: for es02  Cannot create container for service es02: failed to mount local volume: mount /semehr_tutorial1/CogStack-SemEHR/tutorials/working_data/docker_es02:/var/lib/docker/volumes/tutorial1composefiles_esdata02/_data, flags: 0x1000: no such file or directory
    ERROR: for es01  Cannot create container for service es01: failed to mount local volume: mount /semehr_tutorial1/CogStack-SemEHR/tutorials/working_data/docker_es01:/var/lib/docker/volumes/tutorial1composefiles_esdata01/_data, flags: 0x1000: no such file or directory
    ERROR: for es02  Cannot create container for service es02: failed to mount local volume: mount /semehr_tutorial1/CogStack-SemEHR/tutorials/working_data/docker_es02:/var/lib/docker/volumes/tutorial1composefiles_esdata02/_data, flags: 0x1000: no such file or directory
    ERROR: for web  Cannot create container for service web: failed to mount local volume: mount /semehr_tutorial1/CogStack-SemEHR/UI/patient_phenome_ui:/var/lib/docker/volumes/tutorial1composefiles_semehr_phenome_ui_folder/_data, flags: 0x1000: no such file or directory
    
  • Show volumes with docker volume ls:

    DRIVER              VOLUME NAME
    local               tutorial1composefiles_esdata01
    local               tutorial1composefiles_esdata02
    local               tutorial1composefiles_semehr_phenome_ui_folder
    
  • Show details on one with e.g. docker volume inspect tutorial1composefiles_esdata01:

    [
        {
            "CreatedAt": "2019-11-07T16:04:18Z",
            "Driver": "local",
            "Labels": {
                "com.docker.compose.project": "tutorial1composefiles",
                "com.docker.compose.volume": "esdata01"
            },
            "Mountpoint": "/var/lib/docker/volumes/tutorial1composefiles_esdata01/_data",
            "Name": "tutorial1composefiles_esdata01",
            "Options": {
                "device": "/semehr_tutorial1/CogStack-SemEHR/tutorials/working_data/docker_es01",
                "o": "bind",
                "type": "none"
            },
            "Scope": "local"
        }
    ]
    

This was on 2019-11-08 after wiping everything I’d thought of. So it seems that the thing that is being persisted/cached is the volume.

A few calls to docker volume rm ... later… and it’s happy.

Lesson: containers and volumes are independent!

Still problems, though. Complete purge, as above.

13.5.5.2. Errors relating to a full disk

If you see INTERNAL ERROR: cannot create temporary directory!, your disk is probably full. (Lots of rubbish in /var/spool/mail/root, for example?)

13.5.5.3. Elasticsearch complains about vm.max_map_count

If the Elasticsearch containers fail to start and give the error message max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144], then do this:

#!/usr/bin/env bash
# semehr_fix_vm_settings.sh
set -e

# -----------------------------------------------------------------------------
# Fix Linux virtual memory settings
# -----------------------------------------------------------------------------

TARGET_VM_SIZE=262144


echo "- vm.max_map_count is:"
sysctl vm.max_map_count  # read

echo "- Setting vm.max_map_count to: ${TARGET_VM_SIZE}"
sudo sysctl -w vm.max_map_count=${TARGET_VM_SIZE}  # write

echo "- vm.max_map_count is now:"
sysctl vm.max_map_count  # re-read, should have changed

13.5.5.4. Elasticsearch “high disk watermark…”

Not clear whether high disk watermark exceeded on one or more nodes messages from Elasticsearch are critical; it seems to carry on regardless.

However, sometimes we get unavailable_shards_exception errors from Elasticsearch, and ConnectionTimeout errors from SemEHR.

Then, do: https://stackoverflow.com/questions/30289024/high-disk-watermark-exceeded-even-when-there-is-not-much-data-in-my-index:

#!/usr/bin/env bash
# semehr_fix_watermark.sh
set -e

# -----------------------------------------------------------------------------
# Fix an Elasticsearch error (needed once only, but no harm in repeating):
# -----------------------------------------------------------------------------

echo "- Fixing watermark for Elasticsearch (which should be running already)..."
curl -X PUT "localhost:8200/_cluster/settings" -H 'Content-Type: application/json' -d'
    {
      "transient": {
        "cluster.routing.allocation.disk.watermark.low": "2gb",
        "cluster.routing.allocation.disk.watermark.high": "1gb",
        "cluster.routing.allocation.disk.watermark.flood_stage": "500mb",
        "cluster.info.update.interval": "1m"
      }
    }
'

13.5.5.5. Elasticsearch complains about log files (but actually machine learning)

Elasticsearch containers fail to start with an error like ... Caused by: java.io.FileNotFoundException: /tmp/elasticsearch-12074371925419480839/controller_log_1 (No such file or directory)...:

  • https://github.com/elastic/elasticsearch/issues/43321 – closed as “user issue” but suggests following minimum steps to reproduce:

    docker pull docker.elastic.co/elasticsearch/elasticsearch:7.1.1
    docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.1.1
    

    Same error on one of my machines, but not another. Both are using Docker 18.09.7. Note that the earlier part of the error message was: "stacktrace": ["org.elasticsearch.bootstrap.StartupException: ElasticsearchException[Failed to create native process factories for Machine Learning]; nested: FileNotFoundException[/tmp/elasticsearch-13081531845067409927/controller_log_1 (No such file or directory)];",

  • So this may actually relate to machine learning libraries, not logs. Thus:

  • https://discuss.elastic.co/t/unable-to-start-elasticsearch-5-4-0-in-docker/84800

  • Update Ubuntu on the failing machine (including the kernel, which is the relevant bit – to 4.15.0-66-generic from 4.15.0-62-generic; the “good” machine is running 4.15.0-58-generic). Didn’t help.

Add this to Docker Compose file:

services:
  es01:
    environment:
      - xpack.security.enabled=false
      - xpack.monitoring.enabled=false
      - xpack.ml.enabled=false
      - xpack.graph.enabled=false
      - xpack.watcher.enabled=false
  es02:
    environment:
      - xpack.security.enabled=false
      - xpack.monitoring.enabled=false
      - xpack.ml.enabled=false
      - xpack.graph.enabled=false
      - xpack.watcher.enabled=false

Yes, that fixed it.

13.5.5.6. SemEHR not passing files to Elasticsearch

I had this from SemEHR:

total 2 docs to process...
semehr_processor(569) root 2019-11-07 23:36:33,250 INFO logging to /data/semehr.log
semehr_processor(574) root 2019-11-07 23:36:33,250 INFO [SemEHR-step] using job status file /data/semehr_job_status_doc_semehr.json
semehr_processor(580) root 2019-11-07 23:36:33,251 INFO [SemEHR-step]load documents to elasticsearch...
base(136) elasticsearch 2019-11-07 23:36:43,254 WARNING POST http://172.17.0.1:8200/eprdoc/docs/discharge_summary_14.txt?timeout=30s [status:N/A request:10.002s]
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_urllib3.py", line 220, in perform_request
    method, url, body, retries=Retry(False), headers=request_headers, **kw
  File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 641, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/local/lib/python2.7/dist-packages/urllib3/util/retry.py", line 344, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 603, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 355, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python2.7/httplib.py", line 1042, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python2.7/httplib.py", line 1082, in _send_request
    self.endheaders(body)
  File "/usr/lib/python2.7/httplib.py", line 1038, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python2.7/httplib.py", line 882, in _send_output
    self.send(msg)
  File "/usr/lib/python2.7/httplib.py", line 844, in send
    self.connect()
  File "/usr/local/lib/python2.7/dist-packages/urllib3/connection.py", line 183, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python2.7/dist-packages/urllib3/connection.py", line 165, in _new_conn
    (self.host, self.timeout))
ConnectTimeoutError: (<urllib3.connection.HTTPConnection object at 0x7fce6f9fce90>, u'Connection to 172.17.0.1 timed out. (connect timeout=10)')
  • 172.17.0.1 is a private IP address, and it’s the address of the Elasticsearch engine.

  • Browsing to http://172.17.0.1:8200/ gives a happy Elasticsearch JSON answer:

    {
      "name" : "es01",
      "cluster_name" : "docker-cluster",
      "cluster_uuid" : "GRzBT27MQ3Shni3eK0DVIQ",
      "version" : {
        "number" : "7.1.1",
        "build_flavor" : "default",
        "build_type" : "docker",
        "build_hash" : "7a013de",
        "build_date" : "2019-05-23T14:04:00.380842Z",
        "build_snapshot" : false,
        "lucene_version" : "8.0.0",
        "minimum_wire_compatibility_version" : "6.8.0",
        "minimum_index_compatibility_version" : "6.0.0-beta1"
      },
      "tagline" : "You Know, for Search"
    }
    
  • Browsing to http://172.17.0.1:8200/eprdoc gives:

    {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [eprdoc]","index_uuid":"_na_","resource.type":"index_or_alias","resource.id":"eprdoc","index":"eprdoc"}],"type":"index_not_found_exception","reason":"no such index [eprdoc]","index_uuid":"_na_","resource.type":"index_or_alias","resource.id":"eprdoc","index":"eprdoc"},"status":404}
    
  • So, as per https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html:

    curl -X PUT http://172.17.0.1:8200/eprdoc
    

    … which should cause a message like [index [eprdoc] created] on the Elasticsearch container console.

  • Now, browsing to http://172.17.0.1:8200/eprdoc gives a happier answer:

    {"eprdoc":{"aliases":{},"mappings":{},"settings":{"index":{"creation_date":"1573169647882","number_of_shards":"1","number_of_replicas":"1","uuid":"yuzy7rNuTauk9thSPXaB6g","version":{"created":"7010199"},"provided_name":"eprdoc"}}}}
    
  • But re-running SemEHR still gives:

    ConnectTimeoutError: (<urllib3.connection.HTTPConnection object at 0x7fce6f9fced0>, u'Connection to 172.17.0.1 timed out. (connect timeout=10)')
    base(136) elasticsearch 2019-11-07 23:37:07,280 WARNING POST http://172.17.0.1:8200/eprdoc/docs/discharge_summary_14.txt?timeout=30s [status:N/A request:10.010s]
    
  • So, we mimic the call exactly:

    curl -v -X POST http://172.17.0.1:8200/eprdoc/docs/discharge_summary_14.txt?timeout=30s
    # ... nope, needs a request body
    
  • The semehr-tutorial-run-compose.yml file maps the Git root directory (.../CogStack-SemEHR) to its /opt/semehr/CogStack-SemEHR directory. Within that is semehr_processor.py.

    Working through the Python stack trace, we get to the relevant call from CogStack-SemEHR/semehr_processor.py to CogStack-SemEHR/analysis/semquery.py, but then it calls into a Python Elasticsearch library that is not in the Git repository, such as elasticsearch/connection/http_urllib3.py. However, this file can be found within /var/lib/docker.

    So, hack http_urllib3.py to add these lines (NB Python 2.7):

    import sys
    
    # ...
    
    # In the "try" block of "def perform_request(...)":
    print >>sys.stderr, "URL: %s" % repr(full_url)
    print >>sys.stderr, "Headers: %s" % repr(request_headers)
    print >>sys.stderr, "Body: %s" % repr(body)
    

    Then rerun the SemEHR container. We see (edited):

    Headers: {'connection': 'keep-alive', 'content-type': 'application/json'}
    Body: '{"fulltext":"Description: ...","id":"discharge_summary_03.txt","patient_id":"P003"}'
    

    We can save that data (with Python repr syntax removed, then edited) as data.txt:

    {
        "fulltext":"Description: Ankylosing spondylitis.",
        "id":"discharge_summary_03.txt",
        "patient_id":"P003"
    }
    

    and then our curl command is:

    DATAFILE=data.txt
    URL=http://172.17.0.1:8200/eprdoc/docs/discharge_summary_14.txt?timeout=30s
    # Or, for later:
    # URL=http://es01:9200/eprdoc/docs/discharge_summary_14.txt?timeout=30s
    curl -v -X POST  -d @"${DATAFILE}" "${URL}" -H 'connection: keep-alive' -H 'content-type: application/json'
    

    When it works, we see:

    'connection: keep-alive' -H 'content-type: application/json'
    Note: Unnecessary use of -X or --request, POST is already inferred.
    *   Trying 172.17.0.1...
    * TCP_NODELAY set
    * Connected to 172.17.0.1 (172.17.0.1) port 8200 (#0)
    > POST /eprdoc/docs/discharge_summary_14.txt?timeout=30s HTTP/1.1
    > Host: 172.17.0.1:8200
    > User-Agent: curl/7.58.0
    > Accept: */*
    > connection: keep-alive
    > content-type: application/json
    > Content-Length: 115
    >
    * upload completely sent off: 115 out of 115 bytes
    < HTTP/1.1 201 Created
    < Location: /eprdoc/docs/discharge_summary_14.txt
    < Warning: 299 Elasticsearch-7.1.1-7a013de "[types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id})."
    < content-type: application/json; charset=UTF-8
    < content-length: 177
    <
    * Connection #0 to host 172.17.0.1 left intact
    {"_index":"eprdoc","_type":"docs","_id":"discharge_summary_14.txt","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}
    

    But sometimes we see (edited):

    Note: Unnecessary use of -X or --request, POST is already inferred.
    *   Trying 172.17.0.1...
    * TCP_NODELAY set
    * Connected to 172.17.0.1 (172.17.0.1) port 8200 (#0)
    > POST /eprdoc/docs/discharge_summary_14.txt?timeout=30s HTTP/1.1
    > Host: 172.17.0.1:8200
    > User-Agent: curl/7.58.0
    > Accept: */*
    > connection: keep-alive
    > content-type: application/json
    > Content-Length: 1456
    > Expect: 100-continue
    >
    < HTTP/1.1 100 Continue
    * We are completely uploaded and fine
    < HTTP/1.1 503 Service Unavailable
    < Warning: 299 Elasticsearch-7.1.1-7a013de "[types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id})."
    < content-type: application/json; charset=UTF-8
    < content-length: 3415
    <
    {"error":{"root_cause":[{"type":"unavailable_shards_exception","reason":"[eprdoc][0] primary shard is not active Timeout: [30s], request: [BulkShardRequest [[eprdoc][0]] containing [index {[eprdoc][docs][discharge_summary_14.txt], source[{    \"fulltext\":\"Description: <...>* Connection #0 to host 172.17.0.1 left intact
    <...>\",    \"id\":\"discharge_summary_03.txt\",    \"patient_id\":\"P003\"}]}]]"},"status":503}
    

    and the Elasticsearch console says (excerpt):

    es01    | {"type": "server", "timestamp": "2019-11-11T10:40:07,723+0000", "level": "WARN", "component": "o.e.c.r.a.DiskThresholdMonitor", "cluster.name": "docker-cluster", "node.name": "es01", "cluster.uuid": "CVehM86XReSKmJsl9PrPhA", "node.id": "iya2H5K6SPS-EO9Co8NhLQ",  "message": "high disk watermark [90%] exceeded on [id7zHXTrQK6kPDJCzJ5eng][es02][/usr/share/elasticsearch/data/nodes/0] free: 68.2gb[9.5%], shards will be relocated away from this node"  }
    es01    | {"type": "server", "timestamp": "2019-11-11T10:40:07,723+0000", "level": "INFO", "component": "o.e.c.r.a.DiskThresholdMonitor", "cluster.name": "docker-cluster", "node.name": "es01", "cluster.uuid": "CVehM86XReSKmJsl9PrPhA", "node.id": "iya2H5K6SPS-EO9Co8NhLQ",  "message": "rerouting shards: [high disk watermark exceeded on one or more nodes]"  }
    es01    | {"type": "server", "timestamp": "2019-11-11T10:40:17,275+0000", "level": "WARN", "component": "r.suppressed", "cluster.name": "docker-cluster", "node.name": "es01", "cluster.uuid": "CVehM86XReSKmJsl9PrPhA", "node.id": "iya2H5K6SPS-EO9Co8NhLQ",  "message": "path: /eprdoc/docs/discharge_summary_14.txt, params: {index=eprdoc, id=discharge_summary_14.txt, type=docs, timeout=30s}" ,
    es01    | "stacktrace": ["org.elasticsearch.action.UnavailableShardsException: [eprdoc][0] primary shard is not active Timeout: [30s], request: [BulkShardRequest [[eprdoc][0]] containing [index {[eprdoc][docs][discharge_summary_14.txt], source[{    \"fulltext\":\"Description: <...>\",    \"id\":\"discharge_summary_03.txt\",    \"patient_id\":\"P003\"}]}]]",
    es01    | "at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:968) [elasticsearch-7.1.1.jar:7.1.1]",
    es01    | "at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryIfUnavailable(TransportReplicationAction.java:845) [elasticsearch-7.1.1.jar:7.1.1]",
    es01    | "at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:797) [elasticsearch-7.1.1.jar:7.1.1]",
    es01    | "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.1.1.jar:7.1.1]",
    es01    | "at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:928) [elasticsearch-7.1.1.jar:7.1.1]",
    es01    | "at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-7.1.1.jar:7.1.1]",
    es01    | "at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-7.1.1.jar:7.1.1]",
    es01    | "at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:555) [elasticsearch-7.1.1.jar:7.1.1]",
    es01    | "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-7.1.1.jar:7.1.1]",
    es01    | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
    es01    | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
    es01    | "at java.lang.Thread.run(Thread.java:835) [?:?]"] }
    
  • Is the “high disk watermark” thing a problem?

    Fix as above.

    Nope. Makes those “high disk watermark” errors go away, but doesn’t stop SemEHR failing.

  • Make ES be verbose:

    # Elasticsearch debug-level logs
    curl -X PUT "localhost:8200/_cluster/settings" -H 'Content-Type: application/json' -d'
        {"transient":{"logger._root":"DEBUG"}}
    '
    

    … not especially helpful.

  • Make ES log requests:

    https://stackoverflow.com/questions/13821061/log-elasticsearch-requests

    # Enable Elasticsearch slow log for "eprdoc" index:
    curl -X "PUT" "http://localhost:8200/eprdoc/_settings?preserve_existing=true" \
         -H 'Content-Type: application/json; charset=utf-8' \
         -d $'{
      "index": {
        "search.slowlog.threshold.query.trace": "0ms",
        "search.slowlog.threshold.fetch.trace": "0ms",
        "search.slowlog.level": "trace"
      }
    }'
    

Then a more successful line of enquiry:

  • Wireshark:

    wireshark &
    
    # Use this display filter:
    http && ip.host matches "^172\."
    

    Looks like no HTTP traffic is coming from the SemEHR container.

  • Shell within a Docker container:

    docker run -t -i ianneub/network-tools /bin/bash
    curl http://172.17.0.1:8200/  # aha! Not working.
    exit
    
    docker run -t -i bytesizedalex/nmap 172.17.0.1
    # ... only port 22.
    
    nmap 172.17.0.1
    # ...  ports 22, 8080, 8200
    
    sudo iptables -t nat -S
    # ... is that because the port mapping is via iptables in the host machine?
    
    docker run -t -i bytesizedalex/nmap 172.21.0.1
    # ... only port 22.
    
    # ... that Docker nmap command for .2, .3, and .4 all fail.
    
  • OK. Fundamental problem in communicating between Docker containers?

    https://docs.docker.com/v17.09/engine/userguide/networking/default_network/container-communication/#communication-between-containers

    sudo iptables -L -n | grep FORWARD
    # ... gives:
    # Chain FORWARD (policy DROP)
    
    sudo pico /etc/init.d/docker
    # edit from
    #   DOCKER_OPTS=
    # to
    #   DOCKER_OPTS=--icc=true
    
    sudo service docker restart
    
    sudo iptables -L -n | grep FORWARD
    # no difference!
    # Reverted.
    
  • As per this:

    https://forums.docker.com/t/communicate-between-two-containers/38646

    “Containers on the same network can use the other[‘]s container name to communicate with each other.”

    So let’s try:

    docker network create semehrnet
    docker network connect semehrnet es01
    docker network connect semehrnet es02
    docker network connect semehrnet tutorial1composefiles_web_1
    
  • Then in semehr_settings.json, do not use http://es01:8200/`, but use ``http://es01:9200/.

Success! Edits made to config file to create semehrnet as above.