4.1. Installing CRATE

4.1.2. Ubuntu Linux, from Debian package

To install CRATE and all its dependencies, download the Debian package and use gdebi:

sudo gdebi crate-VERSION.deb

(If you don’t have gdebi, install it with sudo apt-get install gdebi.)

4.1.3. Manual installation

Installing CRATE itself is very easy, but you probably want a lot of supporting tools. Here’s a logical sequence.

4.1.3.1. Python

Install Python 3.4 or higher. If it’s not already installed:

Linux

sudo apt-get install python3.4-dev

Windows

4.1.3.2. Python virtual environment and CRATE itself

Create a Python virtual environment (an isolated set of Python programs that won’t interfere with any other Python things) and install CRATE. Choose your own directory names.

Linux

python3.4 -m venv ~/venvs/crate
source ~/venvs/crate/bin/activate
python -m pip install --upgrade pip
pip install crate-anon

Windows

C:\Python34\python.exe -m ensurepip
C:\Python34\python.exe -m pip install --upgrade pip
C:\Python34\python.exe -m venv C:\venvs\crate
C:\venvs\crate\Scripts\activate
pip install crate-anon

4.1.3.3. Activating your virtual environment

Every time you want to work within your virtual environment, you should activate it, by running (Windows) or sourcing (Linux) the ``activate`` script within it, as above.

Once activated,

  • the PATHs are set up for the programs in the virtual environment;
  • when you run Python, you will run the copy in the virtual environment;
  • the Python package installation tool, pip, will be the one in the virtual environment and will modify the virtual environment (not the whole system).

See:

4.1.3.4. RabbitMQ

Install RabbitMQ, required by the CRATE web site.

Linux

sudo apt-get install rabbitmq
# Check it's working:
sudo rabbitmqctl status

Windows

  • Download/install Erlang from http://www.erlang.org/downloads. The 32-bit Windows download (Erlang/OTP 18.3) does not work on Windows XP, so everything that follows has been tested on Windows 10, 64-bit.
  • Download/install RabbitMQ from https://www.rabbitmq.com/ → Download. (If you use the default installer, it will find Erlang automatically.)
  • Check it’s working: Start ‣ RabbitMQ Server ‣ RabbitMQ Command Prompt (sbin dir). Then type rabbitmqctl status. It’s helpful to do this, because you need to tell Windows to allow the various bits of RabbitMQ/Erlang to communicate over internal networks, and (under Windows 10) this triggers the appropriate prompts.
  • For additional RabbitMQ help see https://cmatskas.com/getting-started-with-rabbitmq-on-windows/.

4.1.3.5. Java

Install a Java development kit, to compile support for GATE natural language processing (NLP).

Linux

  • Usually built in.

Windows

  • Download/run the Java Development Kit installer from Oracle.

4.1.3.6. GATE

Install GATE, for NLP.

4.1.3.7. Third-party text extractors

Ensure any necessary third-party text extractor tools are installed and on the PATH.

Good extractors are built into CRATE for:

  • Office Open XML (DOCX, DOCM), for Microsoft Word 2007 onwards;
  • HTM(L), XML;
  • Open Document text format (ODT), for OpenOffice/LibreOffice;
  • plain text (LOG, TXT).

For some, there is a fallback converter built in, but third-party tools are faster:

  • PDF: speed improves by installing pdftotext [3]
  • Rich Text Format (RTF): speed improves by installing unrtf [4]

For some, you will need an external tool:

  • For Microsoft Word 97–2003 binary (DOC) files, you will need antiword [5]
  • As a fallback tool (“extract text from anything”), CRATE will use strings or strings2 [6], whichever it finds first.

If you install any manually, check they run, as follows.

Check that your text extractors are available and visible to CRATE via the PATH. Pass any extensions for which you want to see a report.

crate_anonymise --checkextractor .doc .docx .odt .pdf .rtf .txt None

4.1.3.8. C/C++ compiler

Note

This is optional. If you want to install C-based Python libraries, you’ll need a C/C++ compiler for Python 3.4.

Linux

Built in.

Windows

Tricky as the official compiler for Python 3.4 is Visual Studio 2010 [7] [9] [10]. Essentially, this can be hard (e.g. on 64-bit Windows or with later compilers).

FUTURE PLANS: use Python 3.5, which supports Visual C++ 14.0 [8]. At present the necessary dependencies do not work cleanly.

4.1.3.9. Fast MurmurHash3

Note

This is optional (CRATE contains a pure-Python version), but it makes hashing faster.

pip install mmh3  # C version of MurmurHash3

4.1.4. Database and database drivers

You’ll want drivers for at least one database. See Recommended database drivers.

In the CPFT NHS environment, we use SQL Server and these:

pip install pyodbc django-pyodbc-azure

4.1.5. Build the CRATE Java NLP interfaces

crate_nlp_build_gate_java_interface --help
crate_nlp_build_gate_java_interface --javac JAVA_COMPILER_FILENAME --gatedir GATE_DIRECTORY

For example, on Windows:

crate_nlp_build_gate_java_interface ^
    --javac "C:\Program Files\Java\jdk1.8.0_91\bin\javac.exe" ^
    --gatedir "C:\Program Files\GATE_Developer_8.1"

Once built, you can run the script again with an additional --launch parameter to launch the GATE framework in an interactive demonstration mode (using GATE’s supplied “people and places” app).

4.1.6. Configure CRATE for your system

The anonymiser and NLP manager are run on an ad-hoc or regularly scheduled basis, and do not need to be kept running continuously.

For the anonymiser, you will need a .INI-style configuration file (see the anonymiser config file that the CRATE_ANON_CONFIG environment variable points to when the anonymiser is run (and a .TSV format data dictionary that the configuration file points to – see data dictionary).

For the NLP manager, you will need another .INI-style configuration file (see NLP config file) that the CRATE_NLP_CONFIG environment variable points to when the NLP manager is run.

For the web service, which you will want to run continuously, you will need a Python (Django) configuration file (see web config file) that the CRATE_WEB_LOCAL_SETTINGS environment variable points to when the web server processes are run. Use crate_print_demo_crateweb_config to make a new one, and edit it for your own settings.

4.1.7. Set up the web site infrastructure

Create the database yourself using your normal database management tool. Make sure that the config file pointed to by the CRATE_WEB_LOCAL_SETTINGS environment variable is set up to point to the database. From the activated Python virtual environment, you want to build the admin database, collect static files, populate relevant parts of the database, and create a superuser:

crate_django_manage migrate
crate_django_manage collectstatic
crate_django_manage populate
crate_django_manage createsuperuser

4.1.8. Test the web server and message queue

In two separate command windows, with the virtual environment activated in each, run the following two programs:

crate_launch_cherrypy_server
crate_launch_celery --debug

Browse to the web site. Choose ‘Test message queue by sending an e-mail to the RDBM’. If an e-mail arrives, that’s good. If you can’t see the web site, there’s a configuration problem. If you can see the web site but no e-mail arrives, check:

  • that e-mail server and the RDBM e-mail destination are correctly configured in the Django config file (as per the CRATE_WEB_LOCAL_SETTINGS environment variable);
  • check the Django log;
  • check the Celery log;
  • from the RabbitMQ administrative command prompt, run rabbitmqctl list_queues name messages consumers; this shows each queue’s name along with the number of messages in the queue and the number of consumers. If the number of messages is stuck at >0, they’re not being consumed properly.
  • run crate_launch_flower and browse to http://localhost:5555/ to explore the messaging system.

4.1.9. Configure the CRATE web service to run automatically

CRATE’s web service has two parts: the web site itself runs Django, and the offline message handling part (e.g. to send emails) runs Celery.

Linux

Try to avoid managing this by hand! That’s what the .deb file is there for.

Windows: service method

Using a privileged command prompt [e.g. on Windows 10: Winkey+X ‣ Command Prompt (Admin)], activate the virtual environment and install the service:

C:\venvs\crate\Scripts\activate
crate_windows_service install

Set the following system (not user!) environment variables (if you can’t find the Environment Variables part of Control Panel, use the command sysdm.cpl):

  • CRATE_ANON_CONFIG – to your main database’s CRATE anonymisation config file
  • CRATE_CHERRYPY_ARGS – e.g. to --port 8999 --root_path / (for relevant options, see crate_django_manage runcpserver --help)
  • CRATE_WEB_LOCAL_SETTINGS – to your Django site-specific Python configuration file.
  • CRATE_WINSERVICE_LOGDIR – to a writable directory.

In older versions of Windows you had to reboot or the service manager wouldn’t see it, but Windows 10 seems to cope happily. You can start the CRATE service manually, or configure it to start automatically on boot, with the Automatic or Automatic (Delayed Start) option [1], or (with the virtual environment activated) with crate_windows_service start. Any messages will appear in the Windows ‘Application’ event log.

Windows: task scheduler method

In principle you could also run the scripts via the Windows Task Scheduler, rather than as a service [2], e.g. with tasks like

cmd /c C:\venvs\crate\Scripts\crate_launch_cherrypy_server >>C:\crate_logs\djangolog.txt 2>&1
cmd /c C:\venvs\crate\Scripts\crate_launch_celery >>C:\crate_logs\celerylog.txt 2>&1

… but I’ve not bothered to test this, as the Service method works fine.

4.1.10. Retest the web server and message queue

Going to a “behind-the-scenes” (service) mode of operation has the potential to go wrong, so retest that the web server and the e-mail transmission task work.


Footnotes

[1]http://stackoverflow.com/questions/11015189/automatic-vs-automatic-delayed-start
[2]See https://www.calazan.com/windows-tip-run-applications-in-the-background-using-task-scheduler/
[3]pdftotext: Ubuntu: sudo apt-get install poppler-utils. Windows: see http://blog.alivate.com.au/poppler-windows/, then install it and add it to the PATH.
[4]unrtf: Ubuntu: sudo apt-get install unrtf. Windows: see http://gnuwin32.sourceforge.net/packages/unrtf.htm, then install it and add it to the PATH.
[5]antiword: Ubuntu: sudo apt-get install antiword. Windows: see http://www.winfield.demon.nl/, then install it and add it to the PATH.
[6]strings and strings2: strings is part of Linux by default; for Windows, see https://technet.microsoft.com/en-us/sysinternals/strings.aspx or http://split-code.com/strings2.html (then install it and add it to the PATH.
[7]Visual Studio 2010; VC++ 10.0; MSC_VER=1600
[8]Visual Studio 2015; VC++ 14.0; MSC_VER=1900
[9]See http://stackoverflow.com/questions/29909330
[10]To map Visual C++/Studio versions to compiler numbers, see http://stackoverflow.com/questions/2676763. For more detail see http://stackoverflow.com/questions/2817869.