4.1. Installing CRATE¶
- URLs for CRATE source code
- Ubuntu Linux, from Debian package
- Manual installation
- Database and database drivers
- Build the CRATE Java NLP interfaces
- Configure CRATE for your system
- Set up the web site infrastructure
- Test the web server and message queue
- Configure the CRATE web service to run automatically
- Retest the web server and message queue
To install CRATE and all its dependencies, download the Debian package and use
sudo gdebi crate-VERSION.deb
(If you don’t have
gdebi, install it with
sudo apt-get install gdebi.)
Installing CRATE itself is very easy, but you probably want a lot of supporting tools. Here’s a logical sequence.
Install Python 3.6 or higher. If it’s not already installed:
sudo apt-get install python3.6-dev
- https://www.python.org/ → Downloads
Create a Python virtual environment (an isolated set of Python programs that won’t interfere with any other Python things) and install CRATE. Choose your own directory names.
python3.6 -m venv ~/venvs/crate source ~/venvs/crate/bin/activate python -m pip install --upgrade pip pip install crate-anon
C:\Python36\python.exe -m ensurepip C:\Python36\python.exe -m pip install --upgrade pip C:\Python36\python.exe -m venv C:\venvs\crate C:\venvs\crate\Scripts\activate pip install crate-anon
Every time you want to work within your virtual environment, you should activate it, by running (Windows) or sourcing (Linux) the ``activate`` script within it, as above.
- the PATHs are set up for the programs in the virtual environment;
- when you run Python, you will run the copy in the virtual environment;
- the Python package installation tool,
pip, will be the one in the virtual environment and will modify the virtual environment (not the whole system).
Install RabbitMQ, required by the CRATE web site.
sudo apt-get install rabbitmq # Check it's working: sudo rabbitmqctl status
- Download/install Erlang from http://www.erlang.org/downloads. The 32-bit Windows download (Erlang/OTP 18.3) does not work on Windows XP, so everything that follows has been tested on Windows 10, 64-bit.
- Download/install RabbitMQ from https://www.rabbitmq.com/ → Download. (If you use the default installer, it will find Erlang automatically.)
- Check it’s working:
rabbitmqctl status. It’s helpful to do this, because you need to tell Windows to allow the various bits of RabbitMQ/Erlang to communicate over internal networks, and (under Windows 10) this triggers the appropriate prompts.
. Then type
- For additional RabbitMQ help see https://cmatskas.com/getting-started-with-rabbitmq-on-windows/.
Install a Java development kit, to compile support for GATE natural language processing (NLP).
- Usually built in.
- Download/run the Java Development Kit installer from Oracle.
Ensure any necessary third-party text extractor tools are installed and on the PATH.
Good extractors are built into CRATE for:
- Office Open XML (DOCX, DOCM), for Microsoft Word 2007 onwards;
- HTM(L), XML;
- Open Document text format (ODT), for OpenOffice/LibreOffice;
- plain text (LOG, TXT).
For some, there is a fallback converter built in, but third-party tools are faster:
- PDF: speed improves by installing
- Rich Text Format (RTF): speed improves by installing
For some, you will need an external tool:
- For Microsoft Word 97–2003 binary (DOC) files, you will need
- As a fallback tool (“extract text from anything”), CRATE will use
strings2, whichever it finds first.
If you install any manually, check they run, as follows.
Check that your text extractors are available and visible to CRATE via the PATH. Pass any extensions for which you want to see a report.
crate_anonymise --checkextractor .doc .docx .odt .pdf .rtf .txt None
This is optional. If you want to install C-based Python libraries, you’ll need a C/C++ compiler.
FUTURE PLANS: use Python 3.5, which supports Visual C++ 14.0 . At present the necessary dependencies do not work cleanly.
fix these docs now we are using Python 3.6+
You’ll want drivers for at least one database. See Recommended database drivers.
In the CPFT NHS environment, we use SQL Server and these:
pip install pyodbc django-pyodbc-azure
crate_nlp_build_gate_java_interface --help crate_nlp_build_gate_java_interface --javac JAVA_COMPILER_FILENAME --gatedir GATE_DIRECTORY
For example, on Windows:
crate_nlp_build_gate_java_interface ^ --javac "C:\Program Files\Java\jdk1.8.0_91\bin\javac.exe" ^ --gatedir "C:\Program Files\GATE_Developer_8.1"
Once built, you can run the script again with an additional
parameter to launch the GATE framework in an interactive demonstration mode
(using GATE’s supplied “people and places” app).
The anonymiser and NLP manager are run on an ad-hoc or regularly scheduled basis, and do not need to be kept running continuously.
For the anonymiser, you will need a .INI-style configuration file (see the anonymiser config file that the CRATE_ANON_CONFIG environment variable points to when the anonymiser is run (and a .TSV format data dictionary that the configuration file points to – see data dictionary).
For the NLP manager, you will need another .INI-style configuration file (see NLP config file) that the CRATE_NLP_CONFIG environment variable points to when the NLP manager is run.
For the web service, which you will want to run continuously, you will need a
Python (Django) configuration file (see web config file) that the CRATE_WEB_LOCAL_SETTINGS environment variable
points to when the web server processes are run. Use
crate_print_demo_crateweb_config to make a new one, and edit it for your
Create the database yourself using your normal database management tool. Make sure that the config file pointed to by the CRATE_WEB_LOCAL_SETTINGS environment variable is set up to point to the database. From the activated Python virtual environment, you want to build the admin database, collect static files, populate relevant parts of the database, and create a superuser:
crate_django_manage migrate crate_django_manage collectstatic crate_django_manage populate crate_django_manage createsuperuser
In two separate command windows, with the virtual environment activated in each, run the following two programs:
Browse to the web site. Choose ‘Test message queue by sending an e-mail to the RDBM’. If an e-mail arrives, that’s good. If you can’t see the web site, there’s a configuration problem. If you can see the web site but no e-mail arrives, check:
- that e-mail server and the RDBM e-mail destination are correctly configured in the Django config file (as per the CRATE_WEB_LOCAL_SETTINGS environment variable);
- check the Django log;
- check the Celery log;
- from the RabbitMQ administrative command prompt, run
rabbitmqctl list_queues name messages consumers; this shows each queue’s name along with the number of messages in the queue and the number of consumers. If the number of messages is stuck at >0, they’re not being consumed properly.
crate_launch_flowerand browse to http://localhost:5555/ to explore the messaging system.
CRATE’s web service has two parts: the web site itself runs Django, and the offline message handling part (e.g. to send emails) runs Celery.
Try to avoid managing this by hand! That’s what the .deb file is there for.
Windows: service method
Using a privileged command prompt [e.g. on Windows 10:], activate the virtual environment and install the service:
C:\venvs\crate\Scripts\activate crate_windows_service install
Set the following system (not user!) environment variables (if you can’t find
the Environment Variables part of Control Panel, use the command
- CRATE_ANON_CONFIG – to your main database’s CRATE anonymisation config file
- CRATE_CHERRYPY_ARGS – e.g. to
--port 8999 --root_path /(for relevant options, see
crate_django_manage runcpserver --help)
- CRATE_WEB_LOCAL_SETTINGS – to your Django site-specific Python configuration file.
- CRATE_WINSERVICE_LOGDIR – to a writable directory.
In older versions of Windows you had to reboot or the service manager wouldn’t
see it, but Windows 10 seems to cope happily. You can start the CRATE service
manually, or configure it to start automatically on boot, with the Automatic or
Automatic (Delayed Start) option , or (with the virtual
environment activated) with
crate_windows_service start. Any messages will
appear in the Windows ‘Application’ event log.
Windows: task scheduler method
In principle you could also run the scripts via the Windows Task Scheduler, rather than as a service , e.g. with tasks like
cmd /c C:\venvs\crate\Scripts\crate_launch_cherrypy_server >>C:\crate_logs\djangolog.txt 2>&1
cmd /c C:\venvs\crate\Scripts\crate_launch_celery >>C:\crate_logs\celerylog.txt 2>&1
… but I’ve not bothered to test this, as the Service method works fine.
Going to a “behind-the-scenes” (service) mode of operation has the potential to go wrong, so retest that the web server and the e-mail transmission task work.
|||Visual Studio 2010; VC++ 10.0; MSC_VER=1600|
|||Visual Studio 2015; VC++ 14.0; MSC_VER=1900|
|||To map Visual C++/Studio versions to compiler numbers, see http://stackoverflow.com/questions/2676763. For more detail see http://stackoverflow.com/questions/2817869.|