.. crate_anon/docs/source/misc/faq_troubleshooting.rst
.. Copyright (C) 2015, University of Cambridge, Department of Psychiatry.
Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
.
This file is part of CRATE.
.
CRATE is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
.
CRATE is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
.
You should have received a copy of the GNU General Public License
along with CRATE. If not, see .
.. _AMQP: https://en.wikipedia.org/wiki/Advanced_Message_Queuing_Protocol
.. _Celery: http://www.celeryproject.org/
.. _RabbitMQ: https://www.rabbitmq.com/
FAQs and troubleshooting
------------------------
.. contents::
:local:
Known bugs elsewhere affecting CRATE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- wkhtmltopdf font size bug
- See notes next to PATIENT_FONTSIZE in config/settings.py
- https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2505
- If you try to use django-debug-toolbar when proxying via a Unix domain
socket, you need to use a custom INTERNAL_IPS setting; see the specimen
config file.
- SQL Server returns a rowcount of -1; this is normal.
See https://code.google.com/p/pyodbc/wiki/Cursor.
General
~~~~~~~
ImportError: No module named '{mysqldb, pyodbc, ...}'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You haven’t installed the right drivers for the database URL(s) that you have
specified in your CRATE configuration file. Make sure the virtualenv is
activated (see :ref:`Activating your virtual environment `), and
then install it (see :ref:`Database drivers `).
Which Python modules should be installed?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Use ``pip freeze`` to show what you have installed. The ones you have to
install manually are marked with ``# MANUAL``; CRATE installs the others.
.. code-block:: none
django-pyodbc-azure==1.10.4.0 # MANUAL: for Django to talk to SQL Server
pyodbc==4.0.3 # MANUAL: for ODBC database connections
Pretty colours all gone (anonymiser, NLP, etc.)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
That’s what happens when you pipe the tool through ``tee``.
CRATE anonymiser
~~~~~~~~~~~~~~~~
Anonymisation is slow
^^^^^^^^^^^^^^^^^^^^^
Make sure you have indexes created on all patient_id fields, because the tool
will use this to find (a) values for scrubbing, and (b) records for
anonymisation. Indexing here makes a huge difference!
CRATE uses lots of memory
^^^^^^^^^^^^^^^^^^^^^^^^^
A normal run should see CRATE using roughly 60–80 Mb per process. Values much
in excess of this likely relate to the text extraction process, which uses
third-party software over which CRATE has no control (I’ve seen >1 Gb)
[#debugginghighmemusage]_.
"File is not a zip file"
^^^^^^^^^^^^^^^^^^^^^^^^
In full: “Caught exception from document_to_text: File is not a zip file” when
extracting text from DOCX documents
This error usually appears with encrypted, password-protected DOCX files. The
anonymiser will not be able to read these, and this error can be ignored.
"UnRtf: ... has stopped working"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If a third-party program is used by CRATE for text extraction and fails, you
may get lots of messages from Windows like “UnRtf: convert document in RTF
format to other formats has stopped working. Windows can check online for a
solution to the problem...”
To disable this in Windows Server 2008, run the Server Manager, and in the main
page scroll down to a section titled Resources and Support. There should be a
“Windows Error Reporting” section. It’s probably set to “Ask me about sending
reports every time an error occurs”; change this to “I don’t want to
participate, and don’t ask me again”.
That gets rid of the options to tell Microsoft, but it still pops up some
“close or debug?” dialog boxes. To fix that, add the following registry
entries [#disabledebugcloseapplication]_:
.. code-block:: registry
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Windows Error Reporting : “ForceQueue”=dword:00000000
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Windows Error Reporting : “DontShowUI”=dword:00000001
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Windows Error Reporting : “DontSendAdditionalData”=dword:00000001
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Windows Error Reporting\Consent : “DefaultConsent”=dword:00000002
"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can get this error when you try to use a config file in Windows, certainly
by using Notepad to create/edit a CRATE config file. Notepad can default to
"UTF-16 LE" encoding. Re-save the config file with UTF-8 encoding.
CRATE NLP
~~~~~~~~~
How can I update a source file in KConnect/Bio-YODIE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Example: replacing `plugins/Tagger_ConText/src/gate/context/ContextFeaturesTagger.java`.
- Make sure Apache Ant is installed [#installapacheant]_.
- Replace the source file.
- Delete the existing `.jar` file.
- Run ``ant build``.
- If it fails, check the `build.properties` file, which contains local
variables such as directories (e.g. `gate.home`); edit this and try again.
CRATE web site
~~~~~~~~~~~~~~
crate_launch_cherrypy_server can't find its config files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Use forward slashes (``/``) rather than backslashes (``\``) for filename paths
in the environment variable `CRATE_CHERRYPY_ARGS` (e.g.
``C:/somepath/somefile.ext``) or escape the backslashes by doubling them (e.g.
``C:\\somepath\\somefile.ext``).
.. _faq_port_443_not_free:
Port 443 not free on '127.0.0.1'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Problem: Under Windows Server: `OSError(“Port 443 not free on ‘127.0.0.1’”)`
If you launch the CRATE web server on port 443 (for HTTPS) and get this error,
something else is probably using port 443. If that’s unexpected to you, it’s
because Microsoft uses it by default [#win443unavailable]_. Try:
.. code-block:: bat
netstat -ban
to see what’s using it. In our case we had this:
.. code-block:: none
Proto Local Address Foreign Address State
...
[svchost.exe]
TCP 0.0.0.0:443 0.0.0.0 LISTENING
...
[svchost.exe]
TCP [::]:443 [::]:0 LISTENING
That is, svchost.exe is using port 443. One question now is: which actual
program is using this port via svchost.exe (which is a service host program
that does all sorts of things) [#svchost]_? A possibility relates to VMWare
[#vmware443]_.
If you have sufficient control over your machine to wrest port 443 away from
whatever’s using it, fine. Otherwise, you may need to use an alternative port.
A common choice might be 8443 [#port8443]_.
OSError: No socket could be created... permission denied
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The error ``OSError: No socket could be created -- (('127.0.0.1', 443): [Errno
13] Permission denied)`` may be seen on Linux and is similar to :ref:`port 443
not free `.
Under Linux, programs without root authority cannot, by default, open port
numbers below 1024. The best thing to do under Linux is to run a privileged
"main" web server like Apache (which becomes responsible for encryption and so
forth), and configure CRATE to operate behind Apache. See :ref:`Configuring for
Apache `. Alternative approaches are outlined at
https://superuser.com/questions/710253/allow-non-root-process-to-bind-to-port-80-and-443.
"Your connection is not private..." browser error
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You will always get this error until you get a proper HTTPS certificate. This
error occurs when you self-sign a certificate. Browers will offer you a way
round, usually in small print [e.g. in Chrome: :menuselection:`Advanced -->
Proceed... (unsafe)`].
403 Forbidden: CSRF verification failed
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There are at least two possible reasons:
- Your browser must enable cookies, at least for ‘same-origin’ requests; this
is a critical part of CRATE’s security to prevent cross-site request
forgeries. If you’re using Firefox, try Chrome [#csrffirefox]_.
- By default, the CRATE web site uses only HTTPS (secure HTTP). This is
governed by the `CRATE_HTTPS` parameter in CRATE’s own
`crateweb/config/settings.py`. When `CRATE_HTTPS` is `True`, then CSRF
cookies are only permitted over HTTPS, so if you use plain HTTP, you will see
this error. A quick hack is to set ``CRATE_HTTPS = False`` in your local
settings, but this is a bad idea; set up HTTPS properly instead, as above.
"Unknown, invalid, or unsupported option... in a getsockopt or setsockopt call"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Problem: the front end might produce the error: “An unknown, invalid, or
unsupported option or level was specified in a getsockopt or setsockopt call”
while initiating a back-end task.
This is due to a bug in the Python AMQP_ package version 2.1.4 [#amqp214bug]_.
(AMQP is a protocol used by Celery_ to talk to a message broker such as
RabbitMQ_; CRATE uses Celery to manage its back-end asynchronous tasks, like
sending e-mails.)
Solution: downgrade AMQP. From the activated CRATE virtual environment:
.. code-block:: bash
pip uninstall amqp
pip install -Iv amqp==2.1.3
This change has been hardcoded into CRATE’s setup scripts to prevent Celery
from picking the buggy version of the Python `amqp` package. As a consequence,
other requirements are also downgraded (`celery` to 4.0.1; `kombu` to 4.0.1).
Static files not served via CherryPy
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Problem: Static files are not served (e.g. broken icons rather than plus/minus
symbols; broken-looking admin site) when using the CherryPy server for CRATE.
Solution: You should specify either (1) ``FORCE_SCRIPT_NAME = "/crate"`` in
your config file, or (2) ``--root_path /crate`` in your `CRATE_CHERRYPY_ARGS`
environment variable. (The default value for the latter is taken from the
former.)
The cause of the problem is as follows: if you don’t do this, then
`https://mysite/` is meant to be your site, while
`https://mysite/crate_static/` is meant to be your static root. However, the
latter comes under the former, so Django says “it’s for me” then “it doesn’t
exist”. If you use `https://mysite/crate/` as your site root, with
`https://mysite/crate_static/` as your static root, then the software is happy.
I can't restart the CRATE Windows Service cleanly
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There may be a problem whereby the CRATE web service doesn’t entirely shut down
when its service is stopped. You can manually kill leftover processes (which
will appear as `python.exe` or `python.exe *32`) using taskmgr.
This should be fixed now.
CRATE service doesn't start... errors in Windows Event Log
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If your CRATE service doesn't start and you see this error in the Event
Log:
.. code-block:: none
Unable to read Server Queue performance data from the Server service.
The first four bytes (DWORD) of the Data section contains the status
code, the second four bytes contains the IOSB.Status and the next four
bytes contains the IOSB.Information.
Log Name: Application
Source: PerfNet
Event ID: 2006
then consider first if this might be a bug relating to HP ProLiant servers. See
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00041653en_us&docLocale=en_US
and a fix at
https://support.microsoft.com/en-in/help/4057142/windows-10-update-kb4057142.
However, that is for Windows Server 2016.
The CPFT server from Dec 2016 is an HP ProLiant DL360 Gen9 server with 2 ×
12-core/24-thread Intel Xeon E5-2687WV4 3 GHz CPUs (48 effective CPUs),
with 96 Gb RAM later upgraded to 672 Gb RAM (in 2018), and 11.5 Tb SSD
storage in a RAID configuration (8.9 Tb available); it was about £15k inc.
VAT initially plus £8k for the extra RAM. It runs Windows Server 2008 R2.
Other possible problems:
- https://support.microsoft.com/en-my/help/2607486/windows-server-2008-r2-reports-perfnet-error-in-application-log-on-mac
... but that relates to machines with >64 processors;
- https://support.microsoft.com/en-us/help/2279566/32-bit-application-cannot-query-performance-server-work-queues-counter
... that's more likely since it relates to machines with >32 processors,
and in turn this suggests that a 32-bit application is having trouble.
However, we have 64-bit Python installed.
- As it turned out, s per the :ref:`CRATE Windows service `
help, we try ``crate_windows_service``, and it reported that the
``servicemanager`` module was missing; that'd explain it! The virtual
environment had got messed up.
So the general rescue method:
- remove the old virtual environment
- recreate the virtual environment and reinstall, e.g.
.. code-block:: none
cd \srv\crate
"\Program Files\Python35\python.exe" -m venv crate_virtualenv
crate_virtualenv\Scripts\activate.bat
pip install crate_anon==0.18.51 pyodbc django-pyodbc-azure
- remove and reinstall the CRATE service, using an Administrator command
prompt:
.. code-block:: none
crate_windows_service remove
... reboot...
.. code-block:: none
crate_windows_service install
"No connection could be made because the target machine actively refused it"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Problem: From the front end, you see: “No connection could be made because the
target machine actively refused it”.
Check that RabbitMQ is running. Check also CRATE’s Celery log. If RabbitMQ is
not running, you’ll see something like this in CRATE’s Celery log, giving a
little more detail:
.. code-block:: none
[2017-02-25 23:50:11,433: Error/MainProcess] consumer: Cannot connect to
amqp://guest:**@127.0.0.1:5672//: [WinError 10061] No connection could be
made because the target machine actively refused it.
This indicates that Celery (called by CRATE) is looking for RabbitMQ on port
5672, finding it, but being refused access. Make sure RabbitMQ is installed and
its service started. Run ``rabbitmqctl``, which on Windows machines is typically
typically at
.. code-block:: none
C:\Program Files\RabbitMQ Server\rabbitmq_server-3.6.6\sbin\rabbitmqctl.bat
Specifically, run:
.. code-block:: bash
rabbitmqctl status
If you see this unhappy output:
.. code-block:: none
Status of node 'rabbit@cpft-crate-p01' ...
Error: unable to connect to node 'rabbit@cpft-crate-p01': nodedown
DIAGNOSTICS
===========
attempted to contact: ['rabbit@cpft-crate-p01']
rabbit@cpft-crate-p01:
* connected to epmd (port 4369) on cpft-crate-p01
* epmd reports: node 'rabbit' not running at all
other nodes on cpft-crate-p01: ['RabbitMQ']
* suggestion: start the node
current node details:
- node name: 'rabbitmq-cli-11@cpft-crate-p01'
- home dir: P:\
- cookie hash: <...some hash...>
... then one possibility is that RabbitMQ was improperly installed. This can
happen if installed by a non-administrative user [#rabbitmqwinquirks]_, or if
your Windows variables `HOMEDRIVE` and `HOMESHARE` are pointing to a network
drive [#homedrivehomeshare]_. From an **administrative** command prompt, this
was one solution:
.. code-block:: bat
REM Remove the old installation:
net stop rabbitmq
“C:\Program Files\RabbitMQ Server\uninstall.exe”
REM Set environment variables for the new installation:
SET HOMEDRIVE=C:\
SET HOMESHARE=C:\Users
SET ERLANG_HOME=C:\Program Files\erl8.2
REM Now reinstall:
C:\some_download_dir\rabbitmq-server-3.6.6.exe
REM ... and watch the detailed output closely to make sure there are no errors
C:\Program Files\RabbitMQ Server\rabbitmq_server_3.6.6\sbin\rabbitmqctl.bat status
Here’s some happy output:
.. code-block:: none
Status of node 'rabbit@cpft-crate-p01' ...
[{pid,55372},
{running_applications,[{rabbit,"RabbitMQ","3.6.6"},
{rabbit_common,[],"3.6.6"},
{mnesia,"MNESIA CXC 138 12","4.14.2"},
{ranch,"Socket acceptor pool for TCP protocols.",
"1.2.1"},
{xmerl,"XML parser","1.3.12"},
{os_mon,"CPO CXC 138 46","2.4.1"},
{sasl,"SASL CXC 138 11","3.0.2"},
{stdlib,"ERTS CXC 138 10","3.2"},
{kernel,"ERTS CXC 138 10","5.1.1"}]},
{os,{win32,nt}},
{erlang_version,"Erlang/OTP 19 [erts-8.2] [64-bit] [smp:24:24] [async-threads:64]\n"},
{memory,[{total,63923600},
{connection_readers,0},
{connection_writers,0},
{connection_channels,0},
{connection_other,0},
{queue_procs,2736},
{queue_slave_procs,0},
{plugins,0},
{other_proc,23674272},
{mnesia,61784},
{mgmt_db,0},
{msg_index,42592},
{other_ets,1003792},
{binary,22848},
{code,17795673},
{atom,752561},
{other_system,20567342}]},
{alarms,[]},
{listeners,[{clustering,25672,"::"},{amqp,5672,"::"},{amqp,5672,"0.0.0.0"}]},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,41174066790},
{disk_free_limit,50000000},
{disk_free,8951801614336},
{file_descriptors,[{total_limit,8092},
{total_used,2},
{sockets_limit,7280},
{sockets_used,0}]},
{processes,[{limit,1048576},{used,179}]},
{run_queue,0},
{uptime,28},
{kernel,{net_ticktime,60}}]
If you see something like that, all should be well.
MySQL
~~~~~
Can't connect to MySQL, even manually
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See https://dev.mysql.com/doc/refman/5.5/en/problems-connecting.html.
How do I reconfigure MySQL?
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Edit the MySQL configuration file.
- Under Ubuntu Linux this is usually `/etc/mysql/my.cnf`.
- Under Windows it can be in several locations [#mysqlcfglocation]_. If you’re
not sure where yours is, find `mysqld` (typically ``C:\Program
Files\MySQL\MySQL Server 5.7\bin\mysqld.exe``) and run ``mysqld --verbose
--help > helpoutput.txt`` and inspect the resulting file `helpoutput.txt`
(which is quite long). There’ll be a pair of lines like:
.. code-block:: none
Default options are read from the following files in the given order:
C:\Windows\my.ini C:\Windows\my.cnf C:\my.ini C:\my.cnf C:\Program
Files\MySQL\MySQL Server 5.7\my.ini C:\Program Files\MySQL\MySQL Server
5.7\my.cnf
- Note that on our main test system (Windows Server 2008 R2, MySQL 5.7) the
only file existing was ``C:\ProgramData\MySQL\MySQL Server 5.7\my.ini``, but
this was busy being ignored when mysqld was run manually (see below for why –
it can be specified directly as an argument to the Windows service, so a
manual call to ``mysqld`` doesn’t see the same file, which is confusing).
- For running MySQL systems, you can also view the configuration filename via
MySQL Workbench (under Server Status), but what you see here can be wrong.
For example, we’ve had it showing ``C:\Program Files\MySQL\MySQL Server
5.0\my.ini``, when there was no such directory (we were using MySQL 5.7), and
the actual file was ``C:\my.ini``.
Restart MySQL.
- Under Ubuntu Linux, use ``sudo service mysql restart``.
- Under Windows, restart the MySQL service in the Services tool (it’s typically
called e.g. `MySQL57`).
If MySQL fails to restart, run the ``mysqld`` program manually so you can see
why.
- If it sits there appearing to do nothing, it’s probably happy; check the log
files, on Windows usually called `HOSTNAME.err` in the database data
directory (where *HOSTNAME* is the name of your computer).
- You can run ``mysqld --console --standalone --log-error-verbosity 3`` to get
it to write to the console. These options also allow you to abort it with
CTRL-C.
- Under Ubuntu, precede that with ``sudo -u mysql bash`` to get a shell running
as the `mysql` user.
- You might also try ``mysqld --print-defaults`` to see its options.
- Under Ubuntu, try also ``journalctl -xe | grep -i mysql | less`` (sometimes
`apparmor` will block access to MySQL files, if you’ve moved them from their
default location, which can be very confusing; in this case, you’ll need to
edit `/etc/apparmor.d/usr.sbin.mysqld` or
`/etc/apparmor.d/local/usr.sbin.mysqld`).
If the Windows service is stuck in the ‘starting’ state, for example after
you’ve reconfigured MySQL:
- To kill a dead/stuck service: (1) Check the service short name by
double-clicking it in Services. Let’s support it’s `MySQL57`. (2) ``sc
queryex MySQL57`` to see its process ID or PID. (3) ``taskkill /f /pid
PIDNUM`` (where *PIDNUM* is the process ID from the previous step).
- Inspect the Properties of the malfunctioning service carefully. These include
a “path to executable” option, which can look like this: ``"C:\Program
Files\MySQL\MySQL Server 5.7\bin\mysqld.exe"
--defaults-file="C:\ProgramData\MySQL\MySQL Server 5.7\my.ini" MySQL57``. This
gives you the service name and also the hidden configuration path!
- To reinstall the service: ``mysqld --install`` [#mysqlinstallwinservice]_.
The default service name is ‘MySQL’, but you can override this. You’re
probably best being explicit, like this: ``mysqld --install MySQL57
--defaults-file="C:\my.ini"``
- After creating a service, start it manually; if it fails, check
:menuselection:`Event Viewer --> Windows Logs --> Application`.
- If MySQL fails to start and you see errors like `The innodb_system data file
‘ibdata1’ must be writable`, the first thing to check is that another copy of
`mysqld` is not already running.
- To delete a defunct service: ``sc delete servicename``. Exercise extreme
caution with this!
Your target is a happy MySQL installation that restarts automatically when you
reboot.
Table names are always lower case using MySQL under Windows
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Under Windows, MySQL converts table names to lower case by default (but is
happy with mixed-case column names, and is happy with table and column names
being mixed case under Linux). This is because of the default setting for
``lower_case_table_names`` in `my.ini` or `my.cnf`. In turn, this is because
Windows can use a case-insensitive file system (and since tables can be stored
by the name, this would result in an almighty mess) [#mysqlidcasesens]_.
However, when using NTFS, Windows filenames are case-sensitive
[#ntfscasesens]_. Therefore, under Windows with NTFS, you have more options for
``lower_case_table_names``. Note, however, that it also affects the
case-sensitivity of table names using SQL (but not of column names). So you’re
probably better off always using ``lower_case_table_names = 1``, as per the
MySQL advice. This is the default behaviour under Windows.
"Got a packet bigger than 'max_allowed_packet' bytes" or "MySQL has gone away"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Problem: `sqlalchem.exc.OperationalError:
(mysql.connector.errors.OperationalError)... Got a packet bigger than
‘max_allowed_packet’ bytes` *or* `MySQL has gone away` when sending large
packets.
Using large “chunks” is efficient but you have to configure MySQL to allow it.
The ``max_allowed_packet`` setting in the MySQL configuration file (see above)
governs this. Try changing the default, e.g. from
.. code-block:: none
max_allowed_packet=4M # too small!
to
.. code-block:: none
max_allowed_packet=40M
and restart MySQL (as above). You can also view current settings using MySQL
Workbench (:menuselection:`Management --> Status and System Variables -->
System Variables`; search for ``max_allowed_packet``).
If you can’t get this working, reduce the ``--chunksize`` parameter to the
CRATE anonymiser.
How do I hot-swap two MySQL databases?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Since anonymisation is slow, you may want a live research database and another
that you can update offline. When you're ready to swap, you'll want to
- create DEFUNCT
- rename LIVE -> DEFUNCT
- rename OFFLINE -> LIVE
then either revert:
- rename LIVE -> OFFLINE
- rename DEFUNCT -> LIVE
or commit:
- drop DEFUNCT
How?
- https://stackoverflow.com/questions/67093/how-do-i-quickly-rename-a-mysql-database-change-schema-name
- https://gist.github.com/michaelmior/1173781
"MySQL server has gone away"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
One possibility is that you are processing a big binary field, and MySQL's
``max_allowed_packet`` parameter is too small. Try increasing it (e.g. from 16M
to 32M). See also
https://camcops.readthedocs.io/en/latest/administrator/server_troubleshooting.html?highlight=max_allowed_packet#mysql-server-has-gone-away
How to convert a database from SQL Server to MySQL?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This facility is provided by MySQL Workbench, which will connect to an SQL
Server instance. Use the "ODBC via connection string" option if other methods
aren't working: ``DSN=XXX;UID=YYY;PWD=ZZZ``.
If the schema definitions are not seen, it's a permissions issue
(https://stackoverflow.com/questions/17038716), in which case you can also copy
copy the database using CRATE's anonymiser, treating all tables as non-patient
tables (i.e. doing no actual anonymisation).
What settings do I need in /etc/mysql/my.cnf?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Probably these:
.. code-block:: ini
[mysqld]
max_allowed_packet = 32M
innodb_strict_mode = 1
innodb_file_per_table = 1
innodb_file_format = Barracuda
# Only for MySQL prior to 5.7.5 (http://dev.mysql.com/doc/relnotes/mysql/5.6/en/news-5-6-20.html):
innodb_log_file_size = 320M
# For more performance, less safety:
innodb_flush_log_at_trx_commit = 2
# To save memory?
# Default is 8; suggestion is ncores * 2
# innodb_thread_concurrency = ...
[mysqldump]
max_allowed_packet = 32M
"_mysql_exceptions.OperationalError: (1118, 'Row size too large (> 8126)"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In full, the error is:
.. code-block:: none
_mysql_exceptions.OperationalError: (1118, 'Row size too large (> 8126).
Changing some columns to TEXT or BLOB or using ROW_FORMAT=DYNAMIC or
ROW_FORMAT=COMPRESSED may help. In current row format, BLOB prefix of 768
bytes is stored inline.')
See above. If you need to change the log file size, FOLLOW THIS PROCEDURE:
https://dev.mysql.com/doc/refman/5.0/en/innodb-data-log-reconfiguration.html
"Segmentation fault (core dumped)..."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This error can be seen when using the Microsoft ODBC driver for Linux, which is
buggy. In this situation, use the Microsoft JDBC driver instead.
"Killed."
^^^^^^^^^
You may be out of memory, on a small computer. Try reducing MySQL's memory
footprint. (Steps have already been taken to reduce memory usage by the
anonymiser itself.)
Can't create FULLTEXT index(es)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MySQL v5.6 is required to use FULLTEXT indexes with InnoDB tables (as opposed
to MyISAM tables, which don't support transactions).
On Ubuntu 14.04, the default MySQL version is 5.5, so use:
.. code-block:: bash
sudo apt-get install mysql-server-5.6 mysql-server-core-5.6 \
mysql-client-5.6 mysql-client-core-5.6
How to search with FULLTEXT indexes?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In conventional SQL, you would use:
.. code-block:: none
... WHERE field LIKE '%word%'
In a field having a MySQL FULLTEXT index, you can use:
.. code-block:: none
... WHERE MATCH(field) AGAINST ('word')
There are several variants. See
https://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
SQL Server
~~~~~~~~~~
.. _configure_odbc_mars:
[...] "Connection is busy with results for another command" [...]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
**Or: configuring Windows ODBC for MARS.**
If you see this with Microsoft SQL Server via ODBC/pyodbc, you need to enable
Multiple Active Result Sets (MARS), because for some reason Microsoft think
it’s unusual to want more than one cursor open (more than one simultaneous
query) to a single database at once. There are several ways:
**Windows**
- (DOESN’T WORK.) Append ``;MultipleActiveResultSets=True`` to the connection
URL, e.g. ``mssql+pyodbc://@MYDSN;MultipleActiveResultSets=True``. However,
although this is documented [#enablingmars]_, it didn’t work via pyodbc
[#enablingmarsmethodfailed]_!
- (WORKS.)
- Fire up an Administrator command prompt (e.g. press the Start button, start
typing "Command Prompt", and when it appears, right-click it and choose
"Run as Administrator").
- Run the command: ``odbcconf /a {CONFIGSYSDSN "SQL Server Native Client
11.0" "DSN=MY_DSN|MARS_Connection=Yes"}`` (replacing the driver and DSN
names with your own).
- You can re-run the ODBC configuration wizard, and it should now say
`Multiple Active Result Sets(MARS): YES` where it said `... NO` before.
This does work. Use ``CONFIGDSN`` instead of ``CONFIGSYSDSN`` if you are
using a user DSN. Your changes should be visible if you restart the ODBC
control panel (e.g. with ``odbccp32.cpl``) and go through the configuration
wizard again; the MARS option (which you can’t edit) should have changed from
“No” to “Yes”.
- There’s also a registry hack [#marsregistry]_.
**Linux**
- Under Linux, in ``/etc/odbc.ini``, for that DSN, set
``MARS_Connection = yes``. See
- https://msdn.microsoft.com/en-us/library/cfa084cz(v=vs.110).aspx
- https://msdn.microsoft.com/en-us/library/h32h3abf(v=vs.110).aspx
- Rationale: We use gen_patient_ids() to iterate through patients, but then
we fetch data for that patient via the same connection to the source
database(s). Therefore, we're operating multiple result sets through one
connection.
"The data types nvarchar(max) and ntext are incompatible..."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Problem: Using SQL Server, you get an error from the CRATE web front end saying
“The data types nvarchar(max) and ntext are incompatible in the equal to
operator.”
Solution: Upgrade pyodbc.
(This error occurs with pyodbc 3.1.1 but not with pyodbc 4.0.3, for example.)
The error relates to pyodbc passing text parameters to SQL Server as NTEXT
rather than NVARCHAR(MAX).
"A default full-text catalog does not exist in database..."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Problem: Using SQL Server: “A default full-text catalog does not exist in database ‘XXX’ or user does not have permission to perform this action.”
Assuming the user does have permission, this means **you need to run this SQL
beforehand:**
.. code-block:: sql
USE mydatabase;
CREATE FULLTEXT CATALOG default_fulltext_catalog AS DEFAULT;
See https://technet.microsoft.com/en-us/library/dd283095(v=sql.100).aspx.
New tables are named like mydb.[SERVERNAME\\USERNAME].mytable
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Problem: Under SQL Server, new tables (e.g. from NLP) look like
``mydatabase.[SERVERNAME\USERNAME].tablename`` rather than
``mydatabase.dbo.tablename``.
Under Microsoft SQL Server, the standard full notation for a table is
``database.schema.table``. The default schema is often ``dbo``, so tables look like
``mydatabase.dbo.mytable``. However [#sqlserverschemas]_:
The default schema for a user can be defined by using the
``DEFAULT_SCHEMA`` option of ``CREATE USER`` or ``ALTER USER``. If no
default schema is defined for a user account, SQL Server will assume
``dbo`` is the default schema. **It is important [to] note that if the user
is authenticated by SQL Server as a member of a group in the Windows
operating system, no default schema will be associated with the user. If
the user creates an object, a new schema will be created and named the same
as the user, and the object will be associated with that user schema.**
So, for example, if your username is ``RCardinal`` and you authenticate to SQL
Server via Windows authentication, and then create a table, it is likely to be
called something like ``[mydatabase].[myserver\RCardinal].[mytable]``. You can
try this:
.. code-block:: none
USE mydatabase;
SELECT name, type_desc, default_schema_name FROM sys.database_principals;
USE mydatabase;
ALTER USER [myserver\RCardinal] WITH DEFAULT_SCHEMA = dbo;
"Restricted data type attribute violation"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We've seen this whilst performing an initial scan (data dictionary
autogeneration) of a source database using the ``SQL Server Native Client
11.0`` ODBC driver with SQL Server 2012 (version series 11 per
https://sqlserverbuilds.blogspot.com/; unsure of specific version), via
``pyodbc==4.0.x``, default ODBC settings (except MARS enabled, as above), and a
read-only connection. The SQLAlchemy URL for this setup is
``mssql+pyodbc://@odbc_data_source_name``. The error went away by using the
``ODBC Driver 17 for SQL Server`` driver instead.
Windows
~~~~~~~
Control Panel looks blank in Windows 2008 Server
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This isn’t a CRATE problem. However, it’s something we encountered as a problem
when using CRATE. If your Control Panel looks blank, (1) run ``gpedit.msc``
(Local Group Policy Editor), then :menuselection:`User Configuration -->
Administrative Templates --> Control Panel`. Check the settings there.
When that doesn’t work, I’m a bit stuck; try running ``.cpl`` items from the
command line instead.
MedEx-UIMA
~~~~~~~~~~
MedEx-UIMA gives Java errors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yes. Frankly, it’s just fairly badly written, from a programmer’s perspective.
I’ve fixed a few of its bugs in a nasty patch-based way; see
``build_medex_itself.py``. MedEx takes the approach of allowing bugs to throw
exceptions, catching its own exceptions, printing the stack trace, and carrying
on regardless; this can be off-putting, but I’ve not bothered to fix all its
bugs. CRATE will abort if any exceptions make it out of MedEx, but you can just
ignore ones that you see. Here are a couple I fixed:
.. code-block:: none
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 2
at java.lang.String.charAt(Unknown Source)
at org.apache.NLPTools.Document.(Document.java:134)
at org.apache.medex.MedTagger.run_batch_medtag(MedTagger.java:256)
at CrateMedexPipeline.processInput(CrateMedexPipeline.java:302)
at CrateMedexPipeline.(CrateMedexPipeline.java:128)
at CrateMedexPipeline.main(CrateMedexPipeline.java:320)
.. code-block:: none
java.lang.StringIndexOutOfBoundsException: String index out of range: 1
at java.lang.String.charAt(Unknown Source)
at org.apache.algorithms.SuffixArray.construct_tree_word(SuffixArray.java:375)
at org.apache.algorithms.SuffixArray.re_build(SuffixArray.java:97)
at org.apache.algorithms.SuffixArray.(SuffixArray.java:60)
at org.apache.medex.MedTagger.medtagging(MedTagger.java:359)
at org.apache.medex.MedTagger.run_batch_medtag(MedTagger.java:264)
at CrateMedexPipeline.processInput(CrateMedexPipeline.java:302)
at CrateMedexPipeline.(CrateMedexPipeline.java:128)
at CrateMedexPipeline.main(CrateMedexPipeline.java:320)
(the first being an example of using & when they meant && in a logic test).
Here are a couple of MedEx bugs I haven’t fixed, so you might see them:
.. code-block:: none
java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.Vector.elementData(Unknown Source)
at java.util.Vector.get(Unknown Source)
at org.apache.NLPTools.SentenceBoundary.detect_boundaries(SentenceBoundary.java:329)
at org.apache.medex.MedTagger.medtagging(MedTagger.java:354)
at org.apache.medex.MedTagger.run_batch_medtag(MedTagger.java:264)
at CrateMedexPipeline.processInput(CrateMedexPipeline.java:312)
at CrateMedexPipeline.runPipeline(CrateMedexPipeline.java:138)
at CrateMedexPipeline.(CrateMedexPipeline.java:112)
at CrateMedexPipeline.main(CrateMedexPipeline.java:330)
.. code-block:: none
java.lang.NullPointerException
at org.apache.algorithms.SuffixArray.search(SuffixArray.java:636)
at org.apache.medex.MedTagger.medtagging(MedTagger.java:362)
at org.apache.medex.MedTagger.run_batch_medtag(MedTagger.java:264)
at CrateMedexPipeline.processInput(CrateMedexPipeline.java:312)
at CrateMedexPipeline.runPipeline(CrateMedexPipeline.java:138)
at CrateMedexPipeline.(CrateMedexPipeline.java:112)
at CrateMedexPipeline.main(CrateMedexPipeline.java:330)
.. code-block:: none
java.lang.NullPointerException
CRATE reports an encoding error when talking to MedEx
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You have probably missed out the ``-Dfile.encoding=UTF-8`` setting for Java in
your config file; see the example.
===============================================================================
.. rubric:: Footnotes
.. [#debugginghighmemusage]
For tracking it down:
https://www.huyng.com/posts/python-performance-analysis;
http://chase-seibert.github.io/blog/2013/08/03/diagnosing-memory-leaks-python.html
.. [#disabledebugcloseapplication]
https://stackoverflow.com/questions/396369/how-do-i-disable-the-debug-close-application-dialog-on-windows-vista;
https://msdn.microsoft.com/en-us/library/windows/desktop/bb513638(v=vs.85).aspx;
https://stackoverflow.com/questions/3561545/how-to-terminate-a-program-when-it-crashes-which-should-just-fail-a-unit-test/3637710#3637710
.. [#installapacheant]
Apache Ant uses `build.xml` files to build Java `.jar` files from Java
`.java` source files. From https://ant.apache.org, menuselection:`Download
--> Binary distributions`, fetch `apache-ant-1.10.1.zip` or similar, and
unzip it (e.g. to ``C:\Program Files``). Set the `JAVA_HOME` environment
variable to the Java JDK root directory. Set the `ANT_HOME` environment
variable to the Apache Ant root directory.
.. [#amqp214bug]
https://github.com/celery/py-amqp/issues/135;
https://stackoverflow.com/questions/41775353;
https://github.com/celery/py-amqp/issues/130
.. [#mysqlidcasesens]
http://dev.mysql.com/doc/refman/5.7/en/identifier-case-sensitivity.html
.. [#ntfscasesens]
https://support.microsoft.com/en-us/kb/100625
.. [#csrffirefox]
Or maybe see http://superuser.com/questions/461608
.. [#enablingmars]
https://msdn.microsoft.com/en-us/library/h32h3abf(v=vs.110).aspx
.. [#enablingmarsmethodfailed]
Connection string emitted by SQLAlchemy (found by placing a trace within
`sqlalchemy.engine.default.DefaultDialect.connect`):
``dsn=MY_DSN;MultipleActiveResultSets=True;Trusted_Connection=Yes``.
.. [#marsregistry]
http://serverfault.com/questions/302169
.. [#mysqlcfglocation]
http://dev.mysql.com/doc/refman/5.7/en/option-files.html
.. [#mysqlinstallwinservice]
http://dev.mysql.com/doc/refman/5.7/en/windows-start-service.html
.. [#win443unavailable]
https://helpdesk.stone-ware.com/portal/helpcenter/articles/port-443-80-not-available-on-windows-server
.. [#svchost]
https://en.wikipedia.org/wiki/Svchost.exe
.. [#vmware443]
http://superuser.com/questions/125455/why-is-the-system-process-listening-on-port-443
.. [#port8443]
http://www.speedguide.net/port.php?port=8443
.. [#sqlserverschemas]
https://technet.microsoft.com/en-us/library/dd283095(v=sql.100).aspx
.. [#rabbitmqwinquirks]
https://www.rabbitmq.com/windows-quirks.html
.. [#homedrivehomeshare]
https://github.com/rabbitmq/rabbitmq-server/issues/625