4.2. Installing CRATE
4.2.1. URLs for CRATE source code
https://github.com/ucam-department-of-psychiatry/crate (for source)
https://pypi.io/project/crate-anon/ (for
pip install crate-anon
)
4.2.2. Ubuntu Linux, from Debian package
To install CRATE and all its dependencies, download the Debian package and use
gdebi
:
sudo gdebi crate-VERSION.deb
(If you don’t have gdebi
, install it with sudo apt-get install gdebi
.)
4.2.3. Manual installation
Installing CRATE itself is very easy, but you probably want a lot of supporting tools. Here’s a logical sequence.
4.2.3.1. Python
Install Python 3.8 or higher. If it’s not already installed:
Linux
sudo apt-get install python3.8-dev
Windows
https://www.python.org/ → Downloads
4.2.3.2. Python virtual environment and CRATE itself
Create a Python virtual environment (an isolated set of Python programs that won’t interfere with any other Python things) and install CRATE. Choose your own directory names.
Linux
python3.7 -m venv ~/venvs/crate
source ~/venvs/crate/bin/activate
python -m pip install --upgrade pip
pip install crate-anon
Windows
C:\Python37\python.exe -m ensurepip
C:\Python37\python.exe -m venv C:\venvs\crate
C:\venvs\crate\Scripts\activate
C:\Python37\python.exe -m pip install --upgrade pip
pip install crate-anon
4.2.3.3. Activating your virtual environment
Every time you want to work within your virtual environment, you should activate it, by running (Windows) or sourcing (Linux) the ``activate`` script within it, as above.
Once activated,
the PATHs are set up for the programs in the virtual environment;
when you run Python, you will run the copy in the virtual environment;
the Python package installation tool,
pip
, will be the one in the virtual environment and will modify the virtual environment (not the whole system).
See:
4.2.3.4. RabbitMQ
Install RabbitMQ, required by the CRATE web site.
Linux
sudo apt-get install rabbitmq
# Check it's working:
sudo rabbitmqctl status
Windows
Download/install Erlang from http://www.erlang.org/downloads. The 32-bit Windows download (Erlang/OTP 18.3) does not work on Windows XP, so everything that follows has been tested on Windows 10, 64-bit.
Download/install RabbitMQ from https://www.rabbitmq.com/ → Download. (If you use the default installer, it will find Erlang automatically.)
Check it’s working:
. Then typerabbitmqctl status
. It’s helpful to do this, because you need to tell Windows to allow the various bits of RabbitMQ/Erlang to communicate over internal networks, and (under Windows 10) this triggers the appropriate prompts.For additional RabbitMQ help see https://cmatskas.com/getting-started-with-rabbitmq-on-windows/.
4.2.3.5. Java
Install a Java development kit, to compile support for GATE natural language processing (NLP).
Linux
Usually built in.
Windows
Download/run the Java Development Kit installer from Oracle.
4.2.3.6. GATE
Install GATE, for NLP.
Download and install GATE from https://gate.ac.uk/download/
4.2.3.7. Third-party text extractors
Ensure any necessary third-party text extractor tools are installed and on the PATH.
Good extractors are built into CRATE for:
Office Open XML (DOCX, DOCM), for Microsoft Word 2007 onwards;
HTM(L), XML;
Open Document text format (ODT), for OpenOffice/LibreOffice;
plain text (LOG, TXT).
For some, there is a fallback converter built in, but third-party tools are faster:
PDF: speed improves by installing
pdftotext
3Rich Text Format (RTF): speed improves by installing
unrtf
4
For some, you will need an external tool:
For Microsoft Word 97–2003 binary (DOC) files, you will need
antiword
5As a fallback tool (“extract text from anything”), CRATE will use
strings
orstrings2
6, whichever it finds first.
If you install any manually, check they run, as follows.
To check that your text extractors are available and visible to CRATE via the
PATH
, you can use the crate_anon_check_text_extractor tool.
4.2.3.8. C/C++ compiler
Note
This is optional. If you want to install C-based Python libraries, you’ll need a C/C++ compiler.
Linux
Built in.
Windows
Install Visual C++ 14.x 8 (or later?), the official compiler for Python 3.7-3.9 under Windows 9. Visual Studio Community is free 11.
4.2.4. Database and database drivers
You’ll want drivers for at least one database. See Recommended database drivers.
In the CPFT NHS environment, we use SQL Server and these:
pip install pyodbc django-pyodbc-azure
4.2.5. Build the CRATE Java NLP interfaces
crate_nlp_build_gate_java_interface --help
crate_nlp_build_gate_java_interface --javac JAVA_COMPILER_FILENAME --gatedir GATE_DIRECTORY
For example, on Windows:
crate_nlp_build_gate_java_interface ^
--javac "C:\Program Files\Java\jdk1.8.0_91\bin\javac.exe" ^
--gatedir "C:\Program Files\GATE_Developer_8.1"
Once built, you can run the script again with an additional --launch
parameter to launch the GATE framework in an interactive demonstration mode
(using GATE’s supplied “people and places” app).
4.2.6. Configure CRATE for your system
The anonymiser and NLP manager are run on an ad-hoc or regularly scheduled basis, and do not need to be kept running continuously.
For the anonymiser, you will need a .INI-style configuration file (see the anonymiser config file that the CRATE_ANON_CONFIG environment variable points to when the anonymiser is run (and a .TSV format data dictionary that the configuration file points to – see data dictionary).
For the NLP manager, you will need another .INI-style configuration file (see NLP config file) that the CRATE_NLP_CONFIG environment variable points to when the NLP manager is run.
For the web service, which you will want to run continuously, you will need a
Python (Django) configuration file (see web config file) that the CRATE_WEB_LOCAL_SETTINGS environment variable
points to when the web server processes are run. Use
crate_print_demo_crateweb_config
to make a new one, and edit it for your
own settings.
4.2.7. Set up the web site infrastructure
Create the database yourself using your normal database management tool. Make sure that the config file pointed to by the CRATE_WEB_LOCAL_SETTINGS environment variable is set up to point to the database. From the activated Python virtual environment, you want to build the admin database, collect static files, populate relevant parts of the database, and create a superuser:
crate_django_manage migrate
crate_django_manage collectstatic
crate_django_manage populate
crate_django_manage createsuperuser
4.2.8. Test the web server and message queue
In two separate command windows, with the virtual environment activated in each, run the following two programs:
crate_launch_cherrypy_server
crate_launch_celery --debug
Browse to the web site. Choose ‘Test message queue by sending an e-mail to the RDBM’. If an e-mail arrives, that’s good. If you can’t see the web site, there’s a configuration problem. If you can see the web site but no e-mail arrives, check:
that e-mail server and the RDBM e-mail destination are correctly configured in the Django config file (as per the CRATE_WEB_LOCAL_SETTINGS environment variable);
check the Django log;
check the Celery log;
from the RabbitMQ administrative command prompt, run
rabbitmqctl list_queues name messages consumers
; this shows each queue’s name along with the number of messages in the queue and the number of consumers. If the number of messages is stuck at >0, they’re not being consumed properly.run
crate_launch_flower
and browse to http://localhost:5555/ to explore the messaging system.
4.2.9. Configure the CRATE web service to run automatically
CRATE’s web service has two parts: the web site itself runs Django, and the offline message handling part (e.g. to send emails) runs Celery.
Linux
Try to avoid managing this by hand! That’s what the .deb file is there for.
Windows: service method
Using a privileged command prompt [e.g. on Windows 10:
], activate the virtual environment and install the service:C:\venvs\crate\Scripts\activate
crate_windows_service install
Set the following system (not user!) environment variables (if you can’t find
the Environment Variables part of Control Panel, use the command
sysdm.cpl
):
CRATE_ANON_CONFIG – to your main database’s CRATE anonymisation config file
CRATE_CHERRYPY_ARGS – e.g. to
--port 8999 --root_path /
(for relevant options, seecrate_django_manage runcpserver --help
)CRATE_WEB_LOCAL_SETTINGS – to your Django site-specific Python configuration file.
CRATE_WINSERVICE_LOGDIR – to a writable directory.
In older versions of Windows you had to reboot or the service manager wouldn’t
see it, but Windows 10 seems to cope happily. You can start the CRATE service
manually, or configure it to start automatically on boot, with the Automatic or
Automatic (Delayed Start) option 1, or (with the virtual
environment activated) with crate_windows_service start
. Any messages will
appear in the Windows ‘Application’ event log.
Windows: task scheduler method
In principle you could also run the scripts via the Windows Task Scheduler, rather than as a service 2, e.g. with tasks like
cmd /c C:\venvs\crate\Scripts\crate_launch_cherrypy_server >>C:\crate_logs\djangolog.txt 2>&1
cmd /c C:\venvs\crate\Scripts\crate_launch_celery >>C:\crate_logs\celerylog.txt 2>&1
… but I’ve not bothered to test this, as the Service method works fine.
4.2.10. Retest the web server and message queue
Going to a “behind-the-scenes” (service) mode of operation has the potential to go wrong, so retest that the web server and the e-mail transmission task work.
Footnotes
- 1
https://stackoverflow.com/questions/11015189/automatic-vs-automatic-delayed-start
- 2
See https://www.calazan.com/windows-tip-run-applications-in-the-background-using-task-scheduler/
- 3
pdftotext
: Ubuntu:sudo apt-get install poppler-utils
. Windows: see http://blog.alivate.com.au/poppler-windows/, then install it and add it to the PATH.- 4
unrtf
: Ubuntu:sudo apt-get install unrtf
. Windows: see http://gnuwin32.sourceforge.net/packages/unrtf.htm, then install it and add it to the PATH.- 5
antiword
: Ubuntu:sudo apt-get install antiword
. Windows: see http://www.winfield.demon.nl/, then install it and add it to the PATH.- 6
strings
andstrings2
:strings
is part of Linux by default; for Windows, see https://technet.microsoft.com/en-us/sysinternals/strings.aspx or http://split-code.com/strings2.html (then install it and add it to the PATH.- 7
Visual Studio 2010; VC++ 10.0; MSC_VER=1600
- 8
Visual Studio 2015; VC++ 14.0; MSC_VER=1900
- 9
- 10
To map Visual C++/Studio versions to compiler numbers, see https://stackoverflow.com/questions/2676763. For more detail see https://stackoverflow.com/questions/2817869.
- 11