15. Change log/history
15.1. Contributors
Rudolf Cardinal <rudolf@pobox.com>, 2015–.
Francesca Spivack, 2018–2020.
Martin Burchell, 2020–.
Quick links:
15.2. Changes
15.2.1. 2015
2015-02-18
Started.
v0.03, 2015-03-19
Bug fix for incremental update (previous version inserted rather than updating when the source content had changed); search for
update_on_duplicate_key.Checks for missing/extra fields in destination.
“No separator” allowed for
crate_anon.anonymise.anonregex.get_date_regex_elements(), allowing anonymisation of e.g.19Mar2015,19800101.New default
at_word_boundaries_only=Falseforcrate_anon.anonymise.anonregex.get_date_regex_elements(), allowing anonymisation of ISO8601-format dates (e.g.1980-10-01T0000), etc.Similar option for
crate_anon.anonymise.anonregex.get_code_regex_elements().Similar option for
crate_anon.anonymise.anonregex.get_string_regex_elements().Options in config to control these.
Fuzzy matching for
crate_anon.anonymise.anonregex.get_string_regex_elements();string_max_regex_errorsoption in config. The downside is the potential for greedy matching; for example, if you anonymise “Ronald MacDonald” with “Ronald” and “MacDonald”, you can end up with “XXX MacXXX”, as the regex greedy-matches “Donald” to “Ronald” with a typo, and therefore fails to process the whole “MacDonald”. On the other hand, this protects against simple typos, which are probably more common.Audit database/table.
Create an incremental update to the data dictionary (i.e. existing DD plus any new fields in the source, with safe draft entries).
Data dictionary optimizations.
v0.04, 2015-04-25
Whole bunch of stuff to cope with a limited computer talking to SQL Server with some idiosyncrasies.
v0.05, 2015-05-01
Ability to vary audit/secret map tablenames.
Made date element separators broader in anonymisation regex.
min_string_length_for_errorsoptionmin_string_length_to_scrub_withoptionwords_not_to_scruboptionbugfix: date regex couldn’t cope with years prior to 1900
crate_anon.anonymise.patient.gen_all_values_for_patient()was inefficient in that it would process the same source table multiple times to retrieve different fields.ddgen_index_fieldsoptionsimplification of
crate_anon.anonymise.anonregex.get_anon_fragments_from_string()SCRUBMETHOD.CODE, particularly for postcodes. (Not very different fromSCRUBMETHOD.NUMERIC, but a little different.)debug_row_limitapplies to patient-based tables (as a per-thread limit); was previously implemented as a per-patient limit, which was silly.Indirection step in config for destination/admin databases.
ddgen_allow_fulltext_indexingoption, for old MySQL versions.
v0.06, 2015-06-25
Option:
replace_nonspecific_info_withOption:
scrub_all_numbers_of_n_digitsOption:
scrub_all_uk_postcodes
v0.06, 2015-07-14
bugfix: if a source scrub-from value was a number with value
'.', the regex went haywire… so regex builders now check for blanks.
v0.07, 2015-07-16
regex.ENHANCEMATCHflag tried unsuccessfully (segmentation fault, i.e. internal error inregexmodule, likely because generated regular expressions got too complicated for it).
v0.08, 2015-07-20
SCRUBMETHOD.WORDSrenamedSCRUBMETHOD.WORDS[? typo in changelog!]SCRUBMETHOD.PHRASEaddedddgen_scrubmethod_phrase_fieldsadded
v0.09, 2015-07-28
debug_max_n_patientsoption, used withcrate_anon.anonymise.anonymise.gen_patient_ids(), to reduce the number of patients processed for “full rebuild” debugging.debug_pid_listoption, similarly
v0.10, 2015-09-02 to 2015-09-13
Opt-out mechanism.
Default hasher changed to SHA256.
Bugfix to datatypes in
crate_anon.anonymise.delete_dest_rows_with_no_src_row().
v0.11, 2015-09-16
Split main source code for simplicity.
v0.12, 2015-09-21
Database interface renamed from mysqldb to mysql, to allow for PyMySQL support as well (backend details otherwise irrelevant to front-end application).
v0.13, 2015-10-06
Added TRID.
15.2.2. 2016
2016-02-09
SQL helpers for free-text search.
Massive SQL speedup for fetching/caching table info from database.
NOTE that highlighting will not always work for unusual characters, e.g. apostrophes (‘); see research.html_functions.make_result_element This is because we apply django.utils.html.escape before we apply the highlighting, and django.utils.html.escape transforms “’” into “'”. But we can’t highlight and then escape, because we need the HTML in the highlighting. Not critical.
v0.14.0, 2016-03-10
Code cleanup.
HMAC for RID generation, replacing simpler hashes, for improved security. Default becomes
HMAC_MD5.New option:
secret_trid_cache_tablenameRemoved option:
words_not_to_scrubNew options:
whitelist_filenames(replaceswords_not_to_scrub),blacklist_filenames.Transition from
cardinal_pythonlib.rnc_dbto SQLAlchemy for anonymiser database interface.Environment variable changed from
CRATE_LOCAL_SETTINGStoCRATE_WEB_LOCAL_SETTINGSand coded intocrate_anon.crateweb.config.constants.Web front end now happy getting structure from SQL Server and PostgreSQL.
Windows support. Windows XP not supported as Erlang (and thus RabbitMQ) won’t run on it from the distributed binaries. Windows 10 works fine.
Semantic versioning.
v0.16.0, 2016-06-04
Fixes to work properly with SQL Server, including proper automatic conversion of
VARCHAR(MAX)andNVARCHAR(MAX)to MySQLTEXTfields. Note: also needs SQLAlchemy 1.1 or higher [1], currently available only via (1) fetching source viagit clone https://github.com/zzzeek/sqlalchemyand changing into the ‘sqlalchemy’ directory this will create; (2) activating your CRATE virtual environment; (3)pip install .to install SQLAlchemy from your source copy. Further note: as of v0.18.2, this is done via PyPI again.Opt-out management (1) manually; (2) via disk file; (3) via database fields.
v0.17.0, 2016-06-25
ONS Postcode Database.
RiO preprocessor.
Third-party patient cross-referencing for anonymisation.
The ‘required scrubber’ flag, as a safety measure.
Recordwise view of results in web interface.
Static type checking.
v0.18.0, 2016-09-29
Regular expression NLP tools for simple numerical results (CRP, ESR, WBC and differential, Na, MMSE).
v0.18.1, 2016-11-04
v0.18.1 (2016-11-04): new
anonymise_numbers_at_numeric_boundaries_onlyoption, to prevent e.g. ‘23’ being scrubbed from ‘1234’ unless you really want to.More built-in NLP tools by now (height, weight, BMI, BP, TSH). MedEx support.
v0.18.2 to v0.18.8, 2016-11-11 to 2016-11-13
Too many version numbers here because git connection unavailable for remote development.
Requirement upgraded to SQLAlchemy 1.1.3, now SQLAlchemy 1.1 and higher are available from PyPI.
Support for non-integer PKs for NLP, to allow us to operate with tables we have only read-only access to. This is a bit tricky. To parallelize, it helps to be able to convert a non-integer to an integer for use with the modulo operator,
%. In addition, we store PK values to speed up incremental updates. It becomes messy if we have to cope with lots and lots of types of PKs. Also, Python’shash()function is inconsistent across invocations [2]. This is not a cryptographic application, so we can use anything simple and fast [3]. It looks like MurmurHash3 is suitable (hash DDoS attacks are not relevant here) [4]. However, the problem then is with collisions [5]. We want to ask “has this PK been processed before?” Realistically, the only types of PKs are integers and strings; it would be crazy to use floating-point numbers or BLOBs or something. So let’s put a cap atVARCHAR(n), wherencomes fromMAX_STRING_PK_LENGTH; store a 64-bit integer hash for speed, and then use the hash to say quickly “no, not processed” and check the original PK if processed. If the PK field is integer, we can just use the integer field for the PK itself. Note that thedelete_where_no_sourcefunction may be imperfect now under hash collisions (and it may be imperfect in other ways too).This system not implemented for anonymisation; it just gets too confusing (PIDs, MPIDs, uniqueness of PID for TRID generation, etc.).
However, mmh3 requires a Visual C++ 10.0 compiler for Windows. An alternative would be to require pymmh3 but load mmh3 if available, but pymmh3 isn’t on PyPI. Another is xxHash [6], but that also requires VC++ under Windows; pyhashxx installs but the interface isn’t fantastic. Others include FNV and siphash [7]. The xxHash page compares quality and speed and xxHash beats FNV for both (and MurmurHash for speed); siphash not listed. Installation of siphash is fine. Other comparisons at [8]. Let’s use xxhash (needs VC++) and pyhashxx as a fallback… only pyhashxx only supports 32-bit hashing. The pyhash module doesn’t install under Windows Server 2003, and nor does xxh, while lz4tools needs VC++. OK. Upshot: use mmh3 but fall back to some baked in Python implementations (from StackOverflow and pymmh3, with some bugfixes) if mmh3 not available.
NLP
delete_where_no_sourcethen failed as expected with large databases, so reworked to be OK regardless of size (using temporary tables).Python 3.5 can handle circular imports (for type hints) that Python 3.4 can’t, so some delayed and version-conditional imports to sort that out in the NLP code.
Provide source/destination record counts from NLP manager, and better progress indicator for anonymiser.
Optional NLP record limit for debugging.
Speed increases by not requesting unnecessary
ORDER BYconditions.Commit-every options for NLP (every n bytes and/or every n rows).
Regex NLP for ACE, mini-ACE, MOCA.
Timing framework for NLP (for when it’s dreadfully slow and you think the problem might be the source database).
Significant NLP performance enhancement by altering progress DB lookup methods.
v0.18.9, 2016-12-02
Regex NLP: option in
crate_anon.nlp_manager.regex_parser.SimpleNumericalResultParserto take absolute values, e.g. to deal with text likeNa-142, K-4.1, CRP-97, which use-simply as punctuation, rather than as a minus sign. Failing to account for these would distort results.No attempt is made to specify maximum or minimum values, which can easily be excluded as required from the resulting data set. One could of course use the SQL
ABS()function to deal with negative values post hoc, but some things have no physical meaning when negative, such as a white cell count or CRP value, so it’s preferable to fix these at source to reduce the chance of user error through not noticing negative values.The
take_absoluteoption is applied to: CRP, sodium, TSH, BMI, MMSE, ACE, mini-ACE, MOCA, ESR, and white cell/differential counts. (NLP processors for height, BP already enforced positive values. Weight must be able to handle negatives, like “weight change –0.4kg”.)Similarly, hyphen followed by whitespace treated as ignorable in regex NLP (e.g. in
weight - 48 kg; though spaces are meaningful for mathematical operations (“a – b = c”), it is syntactically wrong to use- 4as a unary minus sign to indicate a negative number (–4) and much more likely that this context means a dash.En and em dashes, and a double-hyphen as a dash (
--) treated as ignorables in regex NLP.At present, Unicode minus signs (
−) are not handled. For reference [9]:name
character
code
handling
hyphen-minus
-Unicode 002D or ASCII 45
minus sign if context correct
formal hyphen
‐Unicode 2010
not handled at present
minus sign
−Unicode 2212
not handled at present
en dash
–Unicode 2013
treated as ignorable [10]
em dash
—Unicode 2014
treated as ignorable
Improved regex self-testing, including new test framework in
crate_anon.nlp_manager.test_all_regex.
v0.18.10, 2016-12-11
Full support for SQL Server as the backend.
Hot-swapping databases (compare MySQL [11]): you can rename databases, so this seems OK [12].
Full-text indexing: optional in SQL Server 2008, 2012, 2014 and 2016 [13]; basic SELECT syntax is
WHERE CONTAINS(fieldname, "word"), and index creation withCREATE FULLTEXT INDEX ON table_name (column_name) KEY INDEX index_name .... Added tocrate_anon.common.sqla.Support for SQL query building, with user-configurable selector mechanism. See Transact-SQL syntax reference [14]. We use the Django setting
settings.RESEARCH_DB_DIALECTto govern this.
v0.18.11, 2016-12-19
Tweaks/bugfixes for RiO preprocessor, and for anonymisation to SQL Server databases.
Local help HTML offered via web front end.
15.2.3. 2017
v0.18.12, 2017-02-26
More fixes for SQL Server, including full-text indexing.
Completed changes to CPFT consent materials to reflect ethics revision (Major Amendment 2, 12/EE/0407).
v0.18.13, 2017-03-04
Final update/PyPI push for CPFT consent materials.
v0.18.14, 2017-03-05
Extra debug options for consent-to-contact templates.
Multi-column FULLTEXT indexes under SQL Server.
v0.18.15-v0.18.16, 2017-03-06 to 2017-03-13
Full-text finder generates
CONTAINS(column, 'word')properly for SQL Server.Bugfix to Patient Explorer (wasn’t offering WHERE options always).
“Table browser” views in Patient Explorer
Bugfix to Windows service. Problem: a Python process was occasionally being “left over” by the Windows service, i.e. not being killed properly. Process Explorer indicated it was the one launched as
python launch_cherrypy_server.py. The Windows event log has a message reading “Process 1/2 (Django/CherryPy) (PID=62516): Subprocess finished cleanly (return code 0).” The problem was probably that incrate_anon.crateweb.core.management.commands.runcpserver, thecherrypy.engine.stop()call was only made upon a KeyboardInterrupt exception, and not on other exceptions. Solution: broadened to all exceptions.
v0.18.17, 2017-03-17
Removed erroneous debugging code from
crate_anon.nlp_manager.parse_medex.Medex.parse().If you mis-configured the Java interface to a GATE application, it crashed quickly, which was helpful. If you mis-configured the Java interface to MedEx, it tried repeatedly. Now it crashes quickly.
v0.18.18 to v0.18.23, 2017-04-28
Paper published on 2017-04-26 as Cardinal (2017), BMC Medical Informatics and Decision Making 17:50; http://www.pubmed.gov/28441940; https://doi.org/10.1186/s12911-017-0437-1.
Support for configurable paths for finding on-disk documents (e.g. from a combination of a fixed root directory, a patient ID, and a filename).
v0.18.23 to v0.18.33, 2017-05-02
NLP
value_textfield (FN_VALUE_TEXTin code) given maximum length, rather than 50, for the regex parsers, as it was overflowing (e.g. when a lot of whitespace was present). Seecrate_anon.nlp_manager.regex_parser.NumericalResultParser.dest_tables_columns().Supports more simple text file types (
.csv,.msg,.htm).New option:
ddgen_rename_tables_remove_suffixes.Bugfix to CRATE GATE handler’s stdout-suppression switch.
New option:
ddgen_extra_hash_fields.PCMIS preprocessor.
Support non-integer PIDs and MPIDs. Note that the hashing is based on a string representation, so if you have one database using an integer NHS number, and another using a string NHS number, the same number will hash to the same result if you use the same key.
Hashing of additional fields, initially to support the PCMIS
CaseNumber(as well asPatientId).
v0.18.34 to v0.18.39, 2017-06-05
For SLAM BRC GATE pharmacotherapy app: add support for output columns whose SQL column name is different to the GATE tag (e.g. when
dose-valuemust be changed todose_value); see ``renames`` option. GATE output fields now preserve case. Another option (null_literals) to allow GATE output ofnullto be changed to an SQL NULL. Also added_setcolumn to GATE output.
v0.18.40, 2017-07-20
Fixed Python type-checking bug in
crate_anon.common.extendedconfigparser.ExtendedConfigParser.get_pyvalue_list(); changed fromGenerictoAny.
v0.18.41, 2017-07-21
Support for MySQL
ENUMtypes. However, see http://komlenic.com/244/8-reasons-why-mysqls-enum-data-type-is-evil/ also!
To v0.18.46, 2017-07-28 to 2017-08-05
Fix to
coerce_to_date(for date types), renamed tocoerce_to_datetime.NLP bug fixed relating to a missing
pytzimport.Fixes to NLP, including accepting views (not just tables) as input. Note that under SQL Server, you should not have to specify ‘dbo’ anywhere in the config (but consider setting
ALTER USER... WITH DEFAULT SCHEMAas above).Manual and 2017 paper distributed with package.
Shift some core stuff to cardinal_pythonlib to reduce code duplication with other projects.
v0.18.48, 2017-11-06
Clinician view: find text across a database, for an identified patient. See
crate_anon.crateweb.research.views.all_text_from_pid.Rationale: Should privileged clinical queries be in any way integrated with CRATE? Advantages would include allowing the receiving user to run the query themselves without RDBM intervention and RDBM-to-recipient data transfer considerations, while ensuring the receiving user doesn’t have unrestricted access (e.g. via SQL Server Management Studio). Plus there may be a UI advantage.
Clinician view: look up (M)RIDs from (M)PIDs. Intended purpose for this and the preceding function: “My clinical front end won’t tell me if my patient’s ever had mirtazapine. I want to ask the research database.” (As per CO’L request 2017-05-04.) See
crate_anon.crateweb.research.views.ridlookup.Code to generate and test demonstration databases improved.
15.2.4. 2018
v0.18.49, 2018-01-07, 2018-03-21, 2018-03-27, published 2018-04-20
Use
flashtext(rather thanregex) for denylisting words; this is much faster and allows large denylists (e.g. a long list of all known forenames/surnames).Provides the
crate_fetch_wordliststool to fetch names and English words (and perform in-A-not-B functions, e.g. to generate a list of names that are not English words).Extend CRATE’s GATE pipeline to include or exclude GATE sets, since some applications produce results just in one set, and some produce them twice (e.g. in the unnamed set, named
"", and in a specific named set).Medical eponym list.
v0.18.50 to v0.18.51, 2018-05-04 to 2018-06-29
IllegalCharacterError possible from
crate_anon.crateweb.research.models.make_excel(); was raised by openpyxl. The problem may be that the Excel file format itself prohibits some Unicode characters; certainly openpyxl does [15]. See gen_excel_row_elements() for bugfix. Not all queries require this, but anything that allows unrestricted textual/binary content does.Change to CPFT-specific SQL in
crate_anon.crateweb.consent.lookup_rio.get_latest_consent_mode_from_rio_generic().Bugfix to
crate_anon.crateweb.extra.pdf.CratePdfPlan; this failed to specifywkhtmltopdf_filename, so ifwkhtmltopdfwasn’t found on the PATH (e.g. via a Celery task), PDFs were not generated properly.Addition of
processed_attocrate_anon.crateweb.consent.models.ContactRequest.Addition of
processedandprocessed_attocrate_anon.crateweb.consent.models.ClinicianResponse.Addition of
processedandprocessed_attocrate_anon.crateweb.consent.models.ClinicianResponse.Addition of
skip_letter_to_patient,needs_processing`, ``processedandprocessed_attocrate_anon.crateweb.consent.models.ClinicianResponse.Package version changes:
amqp from 2.1.3 to 2.3.2; https://github.com/celery/py-amqp/blob/master/Changelog
arrow from 0.10.0 to 0.12.1; https://pypi.org/project/arrow/
beautifulsoup4 from 4.5.3 to 4.6.0; https://github.com/newvem/beautifulsoup/blob/master/CHANGELOG
cardinal_pythonlib from 1.0.15 to 1.0.16
celery from 4.0.1 to 4.2.0 (no longer constrained by amqp); http://docs.celeryproject.org/en/latest/history/
chardet from 3.0.2 to 3.0.4
cherrypy from 10.0.0 to 16.0.2; https://docs.cherrypy.org/en/latest/history.html
colorlog from 2.10.0 to 3.1.4
distro from 1.0.2 to 1.3.0
django from 1.10.5 to 2.0.6; https://docs.djangoproject.com/en/2.0/releases/2.0/
django-debug-toolbar from 1.6 to 1.9.1
django-extensions from 1.7.6 to 2.0.7
django-picklefield from 0.3.2 to 1.0.0
django-sslserver from 0.19 to 0.20
flashtext from 2.5 to 2.7
flower from 0.9.1 to 0.9.2
gunicorn from 19.6.0 to 19.8.1
kombu from 4.0.1 to 4.1.0 (no longer constrained by amqp, but kombu 4.2.1 is broken: https://github.com/celery/kombu/issues/870)
openpyxl from 2.4.2 to 2.5.4
pendulum from 1.3.0 to 2.0.2; see https://pendulum.eustace.io/history/
psutil from 5.0.1 to 5.4.6
pyparsing from 2.1.10 to 2.2.0
python-dateutil from 2.6.0 to 2.7.3
regex from 2017.1.17 to 2018.6.21
semver from 2.7.5 to 2.8.0
sortedcontainers from 1.5.7 to 2.0.4
SQLAlchemy from 1.1.5 to 1.2.8
sqlparse from 0.2.2 to 0.2.4
typing from 3.5.3.0 to 3.6.4
Werkzeug from 0.11.15 to 0.14.1
xlrd from 1.0.0 to 1.1.0
(Windows) pypiwin32 from 219 to 223
(Windows) servicemanager 1.3.0, as below
(Windows) winerror
Note
If you are using SQL Server, you probably need to upgrade
django-pyodbc-azure(from e.g. 1.10.4.0 to 2.0.6.1, with the commandpip install django-pyodbc-azure==2.0.6.1), or you may see errors from...\sql_server\pyodbc\base.pylike “Django 2.0.6 is not supported.”You may also need to update the database connection parameters; e.g. the
DSNkey has becomedsn; see django-pyodbc-azure.New crate_celery_status command.
Changed to using Celery
--concurrency=1(formerly 4) fromcrate_anon.tools.launch_celery, as this should prevent multiple Celery threads doing the same work twice if you callcrate_django_manage resubmit_unprocessed_tasksmore than once. There was a risk that this breaks Flower or other Celery status monitoring (as it did with Celery v3.1.23, but that was a long time ago, and it works fine now.
v0.18.52, 2018-07-02
NLP fields now support a standard
_srcdatetimefield; this can be NULL, but it’s normally specified as a definingDATETIMEfield from the source database (since most NLP needs an associated date and it’s far more convenient if this is in the destination database, along with patient ID). It’s specified directly to thecrate_anon.nlp_manager.input_field_config.InputFieldConfigrather than via thecopyfields, since we want a consistent date/time field name in the NLP output even if there is a lack of naming consistency in the source. Search for “new in v0.18.52”.Possibly a bug fixed within the NLP manager, in relation to recording of hashed PKs from tables with non-integer PKs; see
crate_anon.nlp_manager.input_field_config.InputFieldConfig.gen_text().
v0.18.53, to 2018-10-24
Added
Client_Demographic_Details.National_Insurance_NumberandClientOtherDetail.NINumberto RiO automatic data dictionary generator as a sensitive (scrub-source) field; they were marked for code anonymisation but not flagged as scrub-source automatically.Removed full stop from end of sentence in
email_clinician.htmlbeginning “If you’d like help, please telephone the Research Database Manager…”, since some users copied/pasted the full stop as part of the final e-mail address, which bounced. Clarity more important than grammar in this case.NLP adds CRATE version column,
_crate_version.NLP adds “when fetched from database” column,
_when_fetched_utc.NLP supports “cmm” as an abbreviation for cubic mm (seen in CPFT and as per https://medical-dictionary.thefreedictionary.com/cmm).
To
cardinal_pythonlib==1.0.25with updates todocument_to_text()parameter handling, then to1.0.32.Note that
cardinal_pythonlib==1.0.25also fixes a bug related to SQLAlchemy that manifested asAttributeError: module 'sqlalchemy.sql.sqltypes' has no attribute '_DateAffinity'.
NLPRP draft to 0.1.0.
django==2.0.6todjango==2.1.2given security vulnerabilities reported in Django versions [2.0, 2.0.8).Bugfix:
mark_safedecorator added to all Django admin site parts withallow_tags = Trueset (for embedded URLs).django-debug-toolbar==1.9.1todjango-debug-toolbar==1.10.1Improved docstrings.
Minor bugfixes in
crate_anon.anonymise.anonymisefor fetching values from files._addition_onlyDDR flag only permitted on PK fields. (Was only attended to for them in any case!)Bugfix to
crate_anon.crateweb.consent.views.validate_email_request()andcrate_anon.crateweb.consent.views.validate_letter_request(); these were returning rather than raising. Testing showed that something else was also blocking permission to access such things inappropriately, but fixed anyway!Renamed
generate_fake_nhstogenerate_random_nhsto emphasize what this does.Sitewide queries, editable by RDBM.
Restrict anonymiser to specific patient IDs (for subset generation +/- custom pseudonyms).
v0.18.54, 2018-10-26
Deferred load of clinical team info. (Main research database structure is still loaded at the start; I think my intention was to fail as early as possible if it’s going to fail, and/or ensure that “filling the cache” time is not experienced by the end user).
Fixed packaging bug in
setup.py.2018-10-21: Fixed bug in :
OperationalError at /mgr_admin/consent/study/ (1054, "Unknown column 'consent_study.p_summary' in 'field list'")
Changed
p_summaryto a property.
v0.18.55, 2018-11-02
In
crate_anon.anonymise.altermethod.AlterMethod._extract_text_func(), pre-check that a file exists (to save time if it doesn’t).Bugfix to
cardinal_pythonlib(now v1.0.33) in the autotranslation of SQL ServerTIMESTAMPfields.Changed caching for
crate_anon.crateweb.research.research_db_info.SingleResearchDatabaseto make command-line startup faster (at the expense of first-fetch speed).
v0.18.56, 2018-11-02
cardinal_pythonlib==1.0.36Bugfix to
setup.py; Java files were not being distributed properly.Performance optimization to query “column filtering” for “show only columns containing no NULL values”, and more generally optimized; should run queries only once per web session.
Bugfix to
crate_anon.crateweb.research.models.get_executed_researchdb_cursor(), which was double-wrapping a database cursor incorrectly.
v0.18.57, 2018-12-11
New lithium NLP processor (still needs external validation).
Bugfix: “cmm” was meant to be accepted as an abbreviation for “cubic mm” as per v0.18.53 above, but wasn’t. Rechecked all with
crate_anon.nlp_manager.test_all_regexand added additional specific tests for this unit incrate_anon.nlp_manager.regex_units.test_unit_regexes(). All passing.
v0.18.58, 2018-12-23
Clinician requests added so that a clinician can request that their patient is included in a study.
Bugfix to
crate_anon.preprocess.preprocess_rio.main(). Changed ‘progargs.rio’ to ‘rio’.
v0.18.59, 2018-12-24
Bugfix to
clinician_initiated_contact_request(). Now checks that patient’s consent mode is green or yellow before confirming request.
v0.18.60, 2018-12-27
New look of website.
Bugfix to clinician requests. Also now sends a more appropriate email in these cases.
15.2.5. 2019
v0.18.61, 2019-01-15
Updated version of Django in
setup.py.Flag on website to check if query has been run since last database update.
Option of column in anonymiser output specifying when processed.
v0.18.62, 2019-02-09
Improved the
crate_test_extract_textcommand (crate_anon.anonymise.test_extract_text), including errorlevel/return codes to detect text presence.Bump to
cardinal_pythonlib==1.0.47. Note that this now raises an exception fromcardinal_pythonlib.extract_text.document_to_text()if a filename is passed and the file doesn’t exist.
v0.18.63, 2019-02-12
NLP web server based on the NLPRP API.
Bugfix to the website string finder - ‘text fields’ now includes ‘NVARCHAR(-1)’.
v0.18.64, 2019-02-21
NLP for glucose cholesterol (LDL, HDL, total), triglycerides, HbA1c (still need external validation).
v0.18.65, 2019-03-04 to 2019-03-25
NLP for potassium, urea, creatinine, haemoglobin, haematocrit (still need external validation).
At some point before this: SQL helpers to find drug classes/types (e.g. “atypical antipsychotics”, “SSRIs”), as per JL’s idea of 2018-01-08.
At some point before this: research query options to show a subset of columns.
At some point before this: “Clinician asks for a study pack” – create a contact request that’s pre-authorized by a clinician (who might want to pass on the pack themselves or delegate the RDBM to do it).
Standard site queries now handle the following problem:
With regular data updates there might be problems with queries returning different results if rerun a week later, so might be worth returning a timestamp of some type, like:
MAX(DATE_CREATED) FROM RIO.DBO.Clinical_Documents + MAX(whenprocessedutc)) FROM [RiONLP].[dbo].[crate_nlp_progress] + …
v0.18.66, 2019-03-29
Update to
CrateGatePipeline.javato support an option to continue after GATE crashes.
v0.18.67, 2019-03-30 to 2019-03-31
semvertosemantic_version; consistent with CamCOPS and better (and not actually used hitherto by CRATE!)NLPRP constants and core API.
Move to Python 3.6 (already the minimum in CPFT), allowing f-strings.
f-strings. (Note: use Alt-Enter in PyCharm.)
CrateGatePipeline.javasupports continuation after a Java RuntimeException (“bug in GATE code”).
v0.18.68, 2019-04-09
Creatinine regex supports mg/dl units as well as micromolar.
urlandmax_content_lengthconfigurable.Bugfixes to
crate_anon.nlp_manager.nlp_manager.send_cloud_requests()andcrate_anon.nlp_webserver.views.NlpWebViews.show_queue().
v0.18.70, 2019-04-17
PyPI distribution properly contains
nlprpdirectory.
v0.18.71, 2019-05-13
Bugfix to nlp incremental mode.
Use of tokens in cloud NLP and option not to verify SSL.
v0.18.72, 2019-05-16
Bugfix to
crate_anon.nlp_manager.cloud_parser.CloudRequestto convert string datetime back to datetime object. (MySQL automatically converts when writing to the database, but MSSQL doesn’t.)
v0.18.73, 2019-05-21
Only do nlp processing on records with alphanumeric characters.
Do highlighting only once per query, then save the highlighted version in an attribute of the
crate_anon.crateweb.Queryclass.
v0.18.74, 2019-05-21
Changed migrations to make them compatible with SQL Server.
v0.18.75, 2019-06-06
Long queries are now hidden on website in order to avoid long render time.
crate_anon.nlp_manager.cloud_parser.CloudRequestnow extracts content from GATE processors based on the start and end indexes.
v0.18.76, 2019-06-12
Option to truncate source data in nlp and to mark truncated records as processed or not.
Upgrade to
SQLAlchemy==1.3.0anddjango==2.2.2.Bugfix to
crate_anon.nlp_webserver.views-include_textandclient_job_idare obtained from args rather than top-level of the request.In
crate_anon.nlp_manager.nlp_manager, open file to write after completing retrieval of requests so if there is a problem you don’t lose all your queue_ids.Records will not be sent with no word character.
session.remove()has been added to tocrate_anon.nlp_webserver.views.
v0.18.77, 2019-06-12
crate_anon.nlp_manager.cloud_parserwon’t crash if one request gives an error. This is so we don’t lose all data if just one request doesn’t work.
v0.18.78, 2019-06-12
In
crate_anon.nlp_manager.nlp_manager.process_cloud_nlp(), use append file instead of write, so that, is there’s a problem part-way through, we don’t lose all data.
v0.18.79, 2019-06-13
Downgraded to
SQLAlchemy==1.2.8, which it was before anddjango==2.1.9, which is higher than it was before, because the updates where causing clashes withdjango-pyodbc-azure.Log error messages from server in
crate_anon.nlp_manager.cloud_parser.CloudRequest.list_processors().
v0.18.80, 2019-06-13
Sending requests to the cloud servers is broken up into blocks so that the database can be written to periodically.
New sessions for each request on the server-side.
v0.18.81, 2019-06-17
Microsoft specific bugfix in cloud nlp.
Commit every n records, where n is specified by the user, in retrieval of cloud requests.
v0.18.82, 2019-06-17
Used rate limiter.
v0.18.83, 2019-06-23
Bugfix to
crate_anon.nlp_manager.cloud_parser.CloudRequest.get_nlp_values_gate()andcrate_anon.nlp_manager.cloud_parser.CloudRequest.get_nlp_values_internal()so that they don’t try to fish out results for a processor when there are errors.Retry after connection failure in
crate_anon.nlp_manager.cloud_parser.
v0.18.85, 2019-07-21
Regexes:
MICROLITRE,CUBIC_MM_OR_MICROLITRE,CELLS_PER_CUBIC_MM_OR_MICROLITRE.HGBas synonym for haemoglobin incrate_anon.nlp_manager.parse_haematology.Haemoglobin.OPTIONAL_POCelement in several biochemistry/haematology parserscrate_anon.nlp_manager.parse_haematology.WbcBaseallows “per microlitre” as well as “per cubic mm”.logging, rather than
print(), for regex testingmention
urllib3==1.23explicitly insetup.py(used byrequests)… then
urllib==1.24.2to avoid a high severity security vulnerability (automatic Github warning; well done, it).
v0.18.86, 2019-08-06
NLPRP v0.2.0, with schema support.
django==2.1.11(from 2.1.10), Github-prompted security fix.sqlalchemy==1.3.6(from 1.2.8); needed to go to 1.3.0 (Github-prompted security fix) but we’d noted Windows problems with 1.3.0; looks like SQL Server regression was fixed in 1.3.1 (see https://docs.sqlalchemy.org/en/13/changelog/changelog_13.html) so going to 1.3.6.python-dateutil==2.6.1(required bypandas), from 2.6.0 (was blocking readthedocs updates).cardinal_pythonlib==1.0.61(from 1.0.58); bugfix in log probability handling; fix relating to Djangosettings.XSENDFILE.Bugfix to
crate_anon.nlp_manager.parse_cognitive.MocaValidator; was looking at the mini-ACE instead!Abstract base classes in NLP parsers to assist with NLPRP work.
Comments for NLP output columns (for build-in fields and those specified by destfields).
Cloud NLP config modularized. Breaking change to existing cloud NLP configs.
Some code simplification, including classes: -
crate_anon.nlp_manager.errors.NlprpError-crate_anon.nlp_manager.tasks.NlpServerResult-crate_anon.nlp_manager.views.NlprpProcessRequestReset
countincrate_anon.nlp_manager.retrieve_nlp_data()after committing.Renamed
max_retriesto max_tries.Moved “verify SSL” option from
--noverifyon the command line to the verify_ssl parameter.Parameterized maximum request frequency via rate_limit_hz.
Split
limit_before_writeparameter into max_records_per_request and limit_before_commit.Renamed
nlp_webtonlp_webserverfor clarity (since “web” might refer to client or server).Split
nlp_webserver/constants.pyintocrate_anon.nlp_webserver.constantsandcrate_anon.nlp_webserver.settingsso “constants” has no import side-effectsMore compact encoding (including for CRATE web Javascript) via
crate_anon.constants.JSON_SEPARATORS_COMPACT.Removed dependencies:
typing– now using Python 3.6Werkzeug– no longer in use
Pinned versions:
pytz==2018.5
Added requirements:
cairosvg==2.4.0pillow==6.1.0
Context-sensitive help on the CRATE web site, via
crate_anon.common.constants.HelpUrl.Amended
show_sitewide_queries.htmlto remove<form>children of<tr>; seeNLPRP client sets
include_texttoFalse(see process).Removed reference to Django setting
SEND_BROKEN_LINK_EMAILS(and thusMANAGERSsince we won’t enable Django’sBrokenLinkEmailsMiddleware); see https://docs.djangoproject.com/en/dev/internals/deprecation/.Experimental: archive system.
Removed
cardinal_pythonlib.django.middleware.DisableClientSideCachingMiddlewaresince we may want to do some caching.
cardinal_pythonlib==1.0.63Added standard
tense_textcolumn to NLP classescrate_anon.nlp_manager.parse_clinical.Bp,crate_anon.nlp_manager.regex_parser.NumeratorOutOfDenominatorParser.Python NLP:
CRP value column case changed from
value_mg_ltovalue_mg_L.Creatinine value column renamed from
value_mmol_L(wrong!) tovalue_micromol_L.HbA1c value column renamed from
value_mmol_L(wrong!) tovalue_mmol_mol.Haematocrit value column case changed from
value_l_ltovalue_L_L.Haemoglobin value column case changed from
value_g_ltovalue_g_L.
GATE parser now avoids stripping terminal tabs (now just newlines), removing error messages saying “Bad chunk, not of length 2”. See
crate_anon.nlp_manager.parse_gate.Gate.parse().crate_anon.crateweb.research.models.PatientExploreruse is audited.
v0.18.87, 2019-09-30
NLP web server performance tweaks; database structure changes.
Remove dependence on
cardinal_pythonlib.rnc_db, which is trivial but gives a warning.cardinal_pythonlib==1.0.65readthedocs.org problems fixed; see
environment variable
_SPHINX_AUTODOC_IN_PROGRESS(re errors from docs build environment)readthedocs.yml(re resource usage)all
.inifiles were being ignored (despite being fine on a local Sphinx build) – this was a.gitignorebug.
v0.18.88 to 0.18.91, 2019-10-06 to 2019-10-07
We were seeing
BrokenPipeErrorexceptions when very large chunks of text (e.g. 27 Mb) were being sent to GATE processors under Windows. This was due to a bug in the DOCX text extractor. So:new
crate_anon.nlp_manager.base_nlp_parser.TextProcessingFailedexception;BrokenPipeErrorexceptions now trapped by the GATE and MedEx processors (leading to a log error, a restart of the processor, and acrate_anon.nlp_manager.base_nlp_parser.TextProcessingFailederror);cardinal_pythonlib==1.0.67, which has improvements to DOCX table extraction;right-strip all extracted text
v0.18.92, 2019-10-10
Bugfix: tools that were unrelated to the NLP web server were importing its settings (so requiring a dummy config file).
crate_email_rdbmtoolBugfix in the way that
postcodes.pyimported fromcardinal_pythonlib.extract_text.cardinal_pythonlib==1.0.73
v0.18.93, 2019-11-19
New option add_mrid_wherever_rid_added.
Preceding thoughts:
Option to add MRID to every table, to make cross-database queries simpler?
Column would have to support NULL values; not all patients with a PID (e.g. local identifier) will have a MPID (e.g. national identifier).
Would not require sequencing of tables during anonymisation, since the MRID should be found via
crate_anon.anonymise.patient.Patient._build_scrubber().Would involve modifying
crate_anon.anonymise.anonymise.process_table()to callcrate_anon.anonymise.patient.Patient.get_mrid(), possibly where it checks for a column being the primary PID, and adding an extra row there subject to a flag.The flag relates to the whole database rather than a specific row, so it should probably be in the config file – e.g. named
add_mrid_wherever_rid_added, within the[main]section, and the “Output fields and formatting” subsection.Might also need an option to index that field automatically (true by default) – indexed automatically.
Update
pillowfrom 6.1.0 to 6.2.0 (https://nvd.nist.gov/vuln/detail/CVE-2019-16865).crate_anon.nlp_manager.parse_biochemistry.TotalCholesterolwas incorrectly labelling its output “HDL cholesterol”; changed to “Total cholesterol”.cardinal_pythonlib==1.0.80, including a better call to Celery that handles a Ctrl-C to the Python process better (via thenice_callfunction). See CamCOPS documentation for more detail.
v0.18.94, 2019-12-05
Option to filter out free text; see
--free_text_limit; see crate_anonymise.Option to exclude all text fields which are set to be scrubbed via
--excludescrubbed.Temporary bugfix to get round a bug in the
flashtextmodule.On crash, show which record is being processed in anonymiser.
Allow option ‘C’ (‘patient is definitely ineligible’) for all clinician responses to contact requests.
v0.18.95, 2019-12-10
15.2.6. 2020
v0.18.96, 2020-01-07
Security fixes for external dependencies:
waitress from 1.4.1 to 1.4.2 (https://github.com/advisories/GHSA-m5ff-3wj3-8ph4; https://github.com/advisories/GHSA-968f-66r5-5v74)
django from 2.1.11 to 2.1.15 (https://github.com/advisories/GHSA-hvmf-r92r-27hr)
v0.18.97, 2020-03-20
Create
crate_anon.__version__crate_nlp_build_gate_java_interface: the
--launchoption now includes the directory for the CRATE Java class as part of the Java classpath.Document
CRATE_HTTPSsetting.New
crate_bulk_hashtool.cardinal_pythonlib==1.0.85Bump
waitressfrom 1.4.2 to 1.4.3 (security alert).Bugfix to
crate_postcodes(re nonexistentcommitargument).Update
crate_postcodesfor ONSPD Nov 2019.Changes to
crate_anon.nlp_manager.nlp_managerandcrate_anon.nlp_manager.input_field_configto go back to a single query.
v0.18.98, 2020-03-28
Downgrade Django as most recent version was not compatible.
0.18.99, 2020-04-28 to 2020-07-20
More efficient simple postcode regex in
crate_anon.anonymise.anonregex.get_uk_postcode_regex_elements().Fuzzy ID matching work.
Neutral language review, as per https://lkml.org/lkml/2020/7/4/229:
blacklist→denylist, verb “deny”, noun “denial”, jargon verb “denylist”, jargon adjective “denylisted”.whitelist→allowlist, verb “allow”, noun “allowing” (not “allowance”; that sense only in the late 15th century, according to the OED; “allowing” as a noun is a gerund or verbal noun; example in UK legislation at https://www.legislation.gov.uk/ukpga/1955/26/pdfs/ukpga_19550026_en.pdf); jargon verb “allowlist”, jargon adjective “allowlisted”.Tidy up config file processing as part of this work.
Bump Pillow from 6.2.0 to 7.2.0. Bump Django from 2.2.11 to 2.2.14.
0.19.0, 2020-07-21
Django 3 and multiple other internal package upgrades.
Basic Docker operation.
Comment lines and blank lines ignored in data dictionary.
0.19.1, 2020-12-18
“LFT” NLP processors: albumin, ALT, alkaline phosphatase, bilirubin, gamma GT.
crate_run_crate_nlp_demotool to test internal NLP more conveniently.Bugfix to
crate_anon.anonymise.anonregex.escape_literal_string_for_regex: was not doing its job!Read code support for blood test NLP parsers (biochemistry, haematology).
Significant rework to numerical NLP to support a wider variety, e.g.
sodium (mM) 132as well assodium 132 mM.
15.2.7. 2021
0.19.2, 2021-01-26
Handle errors when inserting rows in the destination table during NLP.
15.2.8. 2022
0.19.3, 2022-03-31
Migrating Travis CI.
Library updates:
cardinal_pythonlibfrom 1.1.10 to 1.1.15.celeryfrom 5.2.0 to 5.2.2 for CVE-2021-23727.djangofrom 3.1.7 to 3.1.12 for CVE-2021-31542, then to 3.1.13 for CVE-2021-35042.Update jQuery from 3.1.1 to 3.6.0, and jQuery UI from 1.12.1 to 1.13.0.
kombufrom 4.4.6 to 5.2.1 (security fix), andceleryfrom 4.4.6 to 5.2.0 in consequence, thenamqpfrom 2.6.0 to 5.0.6 in consequence. Change syntax inlaunch_nlp_webserver_celery.pyas a result, and similarly elsewhere. Thenkombuto 5.2.2 whencelerybumped as above.MarkupSafefrom 1.1.1 to 2.0.1 (for other dependencies).pendulumfrom 2.1.1 to 2.1.2 so it installs (Python 3.7, Windows) (previously, it complained about PEP 517; https://github.com/sdispater/pendulum/issues/454).pillowfrom 8.1.2 to 8.2.0 for several alerts including CVE-2021-25288, then to 8.3.2 (CVE-2021-23437), then to 9.0.0 (e.g. CVE-2022-22817).urllib3from 1.26.4 to 1.26.5 for CVE-2021-33503.waitressfrom 1.4.4 to 2.1.1 for CVE-2022-24761.Remove need for
xlrd(was only used for the postcode database and now redundant; all other Excel work usesopenpyxl), but addpyexcel-odsfor ODS files.
Minimum Python version is now 3.7. (Python 3.6 reached end-of-life on 2021-12-23.) Explicit support for Python 3.10.
Specific code for TIMELY project.
Command line:
Split out standalone commands, as the
crate_anonymisecommand was becoming confusingly multi-purpose:crate_anonymise --countbecomescrate_anon_show_counts;crate_anonymise --democonfigbecomescrate_anon_demo_config;crate_anonymise --checkextractorbecomescrate_anon_check_text_extractor;crate_anonymise --draftddandcrate_anonymise --incrementalddbecomecrate_anon_draft_dd.
crate_anon_summarize_ddtool.Change some hyphens to underscores in the command-line arguments to the PCMIS and RiO preprocessing tools, for consistency.
Help:
Index of all CRATE commands.
Data dictionaries and automatic data dictionary generation:
Full support for data dictionaries in CSV, ODS, and XLSX format, as well as the existing TSV. (Uses the first spreadsheet of a potentially multi-sheet file when reading.)
Support SystmOne data dictionaries.
ddgen_force_lower_casedefault changed from True to False.ddgen_min_length_for_scrubbingdefault changed from 0 to 50.New
ddgen_freetext_index_min_lengthoption.Fulltext indexing during data dictionary autogeneration now bases its decisions on the source (not destination) datatype. This handles the “auto-expansion” better – otherwise all sorts of things were attracting the full-text flag.
Remove warnings about lack of primary PID field in source tables with an MPID if no scrubbing is required (that’s an inconvenience, not a de-identification risk).
Use
DataDictionary.get_pid_nameinstead ofddgen_per_table_pid_fieldto establish the PID field for each table for scrubbing. Theddgenoptions should only be for generating a data dictionary; the user may have revised the data dictionary subsequently, and there is no requirement that all PID fields have the same name across tables.Add data dictionary check that all scrub-source tables have a patient ID field.
Remove
ddgen_allow_no_patient_infooption and replace it withallow_no_patient_info– this is now a “runtime” setting, not a “data dictionary definition” setting. Depending onallow_no_patient_info, warnings or errors are produced if a data dictionary is used without patient-defining information (which is usually wrong, but there are sometimes sensible use-cases for it).Option for
ddgen_min_length_for_scrubbingto be less than 1 to disable scrubbing entirely (helpful for the SystmOne automatic data dictionary generation).Add data dictionary row check that “add source hash” (H) flag fields are not omitted, as promised in the documentation.
Autodetect primary keys.
Anonymisation:
New scrub method:
phrase_unless_numeric.Efficiency check when recursing into third-party records, to avoid doing the same work twice.
Automatically hash third-party PIDs using the same hasher as patient PIDs, rendering the de-identified records linkable (if and only if the third-party PID field is marked for inclusion).
denylist_files_as_phrasesoption for anonymisation, anddenylist_use_regex.Fix
crate_anon.anonymise.scrub.WordListto use itssuffixesparameter even ifregex_methodis False. (Was not being used.)Ensure that if MRIDs are being automatically (option
add_mrid_wherever_rid_added), but a table has an explicit MPID/MRID data dictionary row with the same name, that we don’t attempt to add it twice.Make primary key columns (which are already detected and/or configured by the user) explicitly NOT NULL on the destination, which allows free-text indexing. Replicate source NOT NULL status, allowing the user to control this via a source flag, for other column types.
Add support for SQL column comments (supported since SQLAlchemy 1.2).
Drop all tables known to the data dictionary (not just tables with included content), to avoid leaving orphan tables when the data dictionary is altered to OMIT everything in a table. As before, only active tables are created.
Allow secret table PID/MPID types to be integer despite string source fields, giving a warning only. This is acceptable if the source fields do in fact contain only integers-as-strings, e.g. ‘123’.
When dates are truncated, (a) ensure time fields are zero, and (b) default (during data dictionary drafting) to a DATE field, in case the source is DATETIME.
Provide option nonspecific_scrubber_first to govern scrubber order (in
crate_anon.anonymise.scrub.PersonalizedScrubber.scrub()). Was (1) nonspecific, (2) patient, (3) third party. The new default is (1) patient, (2) third party, (3) nonspecific. This provides some more information to the user about the subject of a sentence, but it is a configurable minor trade-off.Option to scrub all dates: scrub_all_dates.
This does not presently do generic date “blurring”; blurring to year is very imprecise, while blurring to month is quite susceptible to information discovery around month boundaries. However, if required, this could be implemented – likely not by a simple textual replace using named capture groups for the parts to preserve, but by named capture groups followed by date parsing followed by date-writing in a standard, e.g. ISO, format.
Installer for CRATE running within Docker
The Docker version of CRATE can now be installed with a single script.
0.19.4, 2022-05-24
Anonymisation:
Document defaults for
anonymise_numbers_at_word_boundaries_onlyandanonymise_numbers_at_numeric_boundaries_onlyarguments tocrate_anon.anonymise.scrub.PersonalizedScrubber, fixing https://github.com/ucam-department-of-psychiatry/crate/issues/67. Addanonymise_codes_at_numeric_boundaries_onlyoption for coherence.Document behaviour of
anonymise_strings_at_word_boundaries_onlyfor FlashText-based wordlist scrubbing (controlled bydenylist_use_regex), fixing https://github.com/ucam-department-of-psychiatry/crate/issues/68.Default for min_string_length_for_errors changed from 1 to 3.
Added some more generally sensible details around CRATE table/field naming.
Support for blurring of dates with new config option replace_all_dates_with.
15.2.9. 2023
0.20.0, 2023-03-14
Support for removing all e-mail addresses, with new config option scrub_all_email_addresses.
Fix bug in the opt-out recording table if the opt-out PID/MPID values were in a column of integer type. This caused errors under SQL Server like “Cannot insert explicit value for identity column in table ‘opt_out_pid’ when IDENTITY_INSERT is set to OFF. (544)”.
NLP:
Improve docstrings and documentation display for NLP processors, also removing the relatively confusing
--showinfooption.Improve internal regex testing.
When generating text from a source database for NLP, skip text that contains only whitespace or is in other senses irrelevant (not just text that is NULL or the empty string). (We were getting errors from remote/cloud NLP processors, e.g. the GATE error “document contains no tokens or sentences”, and it’s just a waste of resources to process these records.) Relevance is controlled by
crate_anon.common.stringfunc.relevant_for_nlp().The
crate_nlp --test_nlpoption now supports cloud-based (remote) processors as well as local ones.The NLP cloud client checks that all requested processors are available remotely, and fails overtly, rather than silently ignoring those.
Alcohol units-per-week regex NLP.
Support “over” and “under” as further synonyms for inequalities in regex NLP.
NLPRP version now 0.3.0. Uses HTTP return code 202 (Accepted), not 102 (Processing), to respond to calls about NLP jobs that are in progress via
fetch_from_queue. See https://github.com/ucam-department-of-psychiatry/crate/issues/106.
Bayesian linkage, identifiable and de-identified.
Minimum Python version is now 3.8. (Required by numpy v1.22.)
Rich text for help.
Fix installer to work with Docker Compose >= 2.14.1. The named ‘crate’ image will be pulled from DockerHub even if there is a Dockerfile present. We don’t want to do that.
0.20.1, 2023-10-05
Remove row-level data dictionary check for the combination of “OMIT” and the “H” (add source hash) flag – should be a table-level check only (i.e. if you’re omitting the whole table, it’s fine to have this combination).
Improvements to the installer, in particular around access to real-world databases and external file storage.
0.20.2, 2023-10-06
Support table comments, by:
creating some in the demo database
scanning table comments when drafting a data dictionary
not ignoring (overwriting) column comments when ddgen_append_source_info_to_comment is set (bugfix)!
supporting a kind of DD row with no source/destination field, just a comment
writing that to the destination database
supporting table comments for NLP output tables
and adding column comments more consistently overall
0.20.3, 2023-10-06
No change in functionality. Updates Django to 3.2.21 to fix CVE-2023-41164.
15.2.10. 2024
0.20.4, 2024-05-21
crate_researcher_reportcommand.crate_subset_dbcommand.Supported SQLAlchemy version now 1.4
Consent-for-contact lookup via SystmOne SRE database.
IRAS/REC/version number/title for all ethics pack documents autogenerated by CRATE. Remove
PDF_LETTER_HEADER_HTMLandPDF_LETTER_FOOTER_HTMLfrom settings; addPDF_LETTER_FOOTER_ADDRESS_HTMLinstead (as a subset), andETHICS_INFO.The GATE interface
CrateGatePipeline.javanow builds with GATE 9.x as well as GATE 8.x. The version of GATE used in the Docker image is configurable, defaulting to 9.0.1. https://github.com/ucam-department-of-psychiatry/crate/issues/149
0.20.5, 2024-06-26
When the Docker image is built, it is now possible to specify both a user ID and a group ID so that file systems shared between the host and the container have the correct permissions.
15.2.11. 2025
0.20.6, 2025-01-09
Update Django to 4.2 LTS. The minimum version of MySQL supported by Django 4.2 is 8.0.
Update the Docker image to use Debian 11. Debian 10 has now reached end-of-life.
Fix the installer so that sensible Django Database settings are written for SQL Server databases accessed via DSN in the ODBC configuration. https://github.com/ucam-department-of-psychiatry/crate/issues/159
Fix access to /crate/files from the Flower container. https://github.com/ucam-department-of-psychiatry/crate/issues/162
URL-encode passwords that appear in SQL Alchemy URLs generated by the installer. Before this change passwords with special characters were resulting in wrong credentials being used. https://github.com/ucam-department-of-psychiatry/crate/issues/161
Minimum Python version is now 3.9. Python 3.8 has now reached end-of-life. Python 3.11 is now supported.
Change the default installer file hierarchy on the host so that it can more easily be shared between several users.
Change the installer to test external database connections early and give better feedback to the user regarding failures.
Fix the data dictionary generator to allow column names with valid but unusual characters in the source database. Convert any unusual characters to ASCII in the destination database.
Update NLP handler to cope with remote NLPRP servers providing tabular_schema data, and create local tables based on this, if desired. Change default NLPRP server behaviour to use the more explicit format (object/dictionaries by table, not plain arrays/lists of records).
Update
crate_postcodesto support the November 2024 ONSPD in full. Offer partial support for ONSPD lookup tables in earlier/future versions by not failing when there is a mismatch between the created database tables and the ONSPD spreadsheet files.Workaround a problem with Docker’s handling of large sparse files, which resulted in a very large Docker image if the ID of the user creating the image was large.
Update the installer to provide some example scripts for running anonymisation, NLP etc under Docker. https://github.com/ucam-department-of-psychiatry/crate/issues/163
0.20.7, 2025-05-05
Shift from SQLAlchemy 1.4.49 to SQLALchemy 2.0.36.
Make the anonymiser more tolerant of Integrity errors and patients with invalid Master PIDs
Ensure the Dockerfile uses a more recent
setuptoolsso thatfuzzycan be installed.Drop support for Debian package given that we have PyPI, GitHub and the Docker-based installer.
Fix bug in import of plaintext CSV for linkage with semicolon-delimited identifiers.
Remove SAVEPOINT when creating Transient Research ID (TRID) on patient records in the secret database. Databricks does not support SAVEPOINT.
Use later urllib3 for Databricks compatibility. This means dropping support for Python 3.9, which reaches end-of-life in October 2025. The Docker image is now based on Python 3.10.
15.2.12. 2026
0.20.8, 2026-02-23
Update the installer with example scripts to:
enter the Docker container (useful when troubleshooting problems).
create wordlists e.g. for removing all personal names that are not medical eponyms.
Fix bug where incremental anonymisation would fail for data dictionary rows with K, H and P flags. https://github.com/ucam-department-of-psychiatry/crate/issues/232
Fix bug where a data dictionary row would not be skipped if its
AlterMethodreturnedTruefor theskip_rowvalue.Add –docstore_root option to
crate_anon.preprocess.preprocess_systmone.main(). If present, this will extract text from documents in this location into a new table calledcrate_extracted_text, which can then be anonymised along with the other tables.
0.20.9, 2026-04-16
Add
DateEventandDateEventRecordedfields tocrate_extracted_texttable when extracting text from SystmOne documents.Fix bug where validation of the NLP config would fail if the fields in
copyfieldsandindexed_copyfieldsappeared in different orders in different tables. https://github.com/ucam-department-of-psychiatry/crate/issues/243
0.20.10, in progress
Fix Python NLP bug that didn’t exclude a preceding “if” (e.g. “If CRP is over 30 mg/L the ferritin result may be high and uninterpretable”). Applies to all NLP parsers using
make_simple_numeric_regex. Also added “below” as additional synonym for “<”, and “above” as synonym for “>”.
15.3. To do
Fix promise for
crate_anonymise_multiprocess: it is launching n patient processes and n non-patient processes simultaneously. Attempts in progress incardinal_pythonlib.subproc.run_multiple_processes. But consider also Ray (https://www.ray.io/).
Footnotes