6.5. Run the anonymiser
Now you’ve created and edited your config file and data dictionary, you can run the anonymiser in one of the following ways:
crate_anonymise --full
crate_anonymise --incremental
crate_anonymise_multiprocess --full
crate_anonymise_multiprocess --incremental
The ‘multiprocess’ versions are faster (if you have a multi-core/-CPU computer). The ‘full’ option destroys the destination database and starts again. The ‘incremental’ one brings the destination database up to date (creating it if necessary). The default is ‘incremental’, for safety reasons.
Get more help with
crate_anonymise --help
6.5.1. crate_anonymise
This runs a single-process anonymiser.
Options:
USAGE: crate_anonymise [-h] [--config CONFIG] [--version] [--verbose]
[-i | -f] [--skipdelete] [--dropremake] [--drop_all]
[--optout] [--nonpatienttables] [--patienttables]
[--index] [--restrict RESTRICT]
[--limits LIMITS LIMITS] [--file FILE]
[--list LIST [LIST ...]]
[--free_text_limit FREE_TEXT_LIMIT] [--excludescrubbed]
[--process [PROCESS]] [--nprocesses [NPROCESSES]]
[--processcluster PROCESSCLUSTER] [--skip_dd_check]
[--seed SEED] [--chunksize [CHUNKSIZE]]
[--reportevery [REPORTEVERY]] [--debugscrubbers]
[--savescrubbers] [--echo]
Database anonymiser. (CRATE version 0.20.0, 2023-02-14. Created by Rudolf
Cardinal.)
OPTIONAL ARGUMENTS:
-h, --help show this help message and exit
--config CONFIG Config file (overriding environment variable
CRATE_ANON_CONFIG). (default: None)
--version show program's version number and exit
--verbose, -v Be verbose (default: False)
MODE OPTIONS:
-i, --incremental Process only new/changed information, where possible.
(default: True)
-f, --full Drop and remake everything. (default: False)
--skipdelete For incremental updates, skip deletion of rows present
in the destination but not the source. (default:
False)
ACTION OPTIONS (DEFAULT IS TO DO ALL, BUT IF ANY ARE SPECIFIED, ONLY THOSE ARE DONE):
--dropremake Drop/remake destination tables, and admin tables
except opt-out tables. (default: False)
--drop_all Drop all destination tables known to the data
dictionary, and all admin tables, then stop. (May also
be helpful in revealing leftover tables in the
destination database, e.g. if the data dictionary has
changed.) (default: False)
--optout Update opt-out list in administrative database.
(default: False)
--nonpatienttables Process non-patient tables only. (default: False)
--patienttables Process patient tables only. (default: False)
--index Create indexes only. (default: False)
RESTRICTION OPTIONS:
--restrict RESTRICT Restrict which patients are processed. Specify which
field to base the restriction on or 'pid' for patient
ids. (default: None)
--limits LIMITS LIMITS
Specify lower and upper limits of the field specified
in '--restrict'. (default: None)
--file FILE Specify a file with a list of values for the field
specified in '--restrict'. (default: None)
--list LIST [LIST ...]
Specify a list of values for the field specified in
'--restrict'. (default: None)
--free_text_limit FREE_TEXT_LIMIT
Filter out all free text fields over the specified
length. For example, if you specify 200, then
VARCHAR(200) fields will be permitted, but
VARCHAR(200), or VARCHAR(MAX), or TEXT (etc., etc.)
fields will be excluded. (default: None)
--excludescrubbed Exclude all text fields which are being scrubbed.
(default: False)
PROCESSING OPTIONS:
--process [PROCESS] For multiprocess mode: specify process number.
(default: 0)
--nprocesses [NPROCESSES]
For multiprocess mode: specify total number of
processes (launched somehow, of which this is to be
one). (default: 1)
--processcluster PROCESSCLUSTER
Process cluster name (used as part of log name).
(default: )
--skip_dd_check Skip data dictionary validity check. (default: False)
--seed SEED String to use as the basis of the seed for the random
number generator used for the transient integer RID
(TRID). Leave blank to use the default seed (system
time). (default: None)
--chunksize [CHUNKSIZE]
Number of records copied in a chunk when copying PKs
from one database to another. (default: 100000)
REPORTING AND DEBUGGING:
--reportevery [REPORTEVERY]
Report insert progress every n rows in verbose mode.
(default: 100000)
--debugscrubbers Report sensitive scrubbing information, for debugging.
(default: False)
--savescrubbers Saves sensitive scrubbing information in admin
database, for debugging. (default: False)
--echo Echo SQL. (default: False)
6.5.2. crate_anonymise_multiprocess
This runs multiple copies of crate_anonymise
in parallel.
Options:
USAGE: crate_anonymise_multiprocess [-h] [--nproc [NPROC]] [--verbose]
Runs the CRATE anonymiser in parallel. Version 0.20.0 (2023-02-14). Note that
all arguments not specified here are passed to the underlying script (see
crate_anonymise --help).
OPTIONAL ARGUMENTS:
-h, --help show this help message and exit
--nproc, -n [NPROC] Number of processes (default is the number of CPUs on
this machine) (default: 8)
--verbose, -v Be verbose (default: False)
6.5.3. crate_anon_show_counts
This ancillary tool prints record counts from your source and destination databases.
USAGE: crate_anon_show_counts [-h] [--config CONFIG] [--verbose]
Print record counts from source/destination databases. (CRATE version 0.20.0,
2023-02-14. Created by Rudolf Cardinal.)
OPTIONAL ARGUMENTS:
-h, --help show this help message and exit
--config CONFIG Config file (overriding environment variable
CRATE_ANON_CONFIG). (default: None)
--verbose, -v Be verbose (default: False)
6.5.4. crate_anon_check_text_extractor
This ancillary tool checks that you have the text extraction software that you might want. See third-party text extractors.
USAGE: crate_anon_check_text_extractor [-h]
[checkextractor [checkextractor ...]]
Check availability of tools to extract text from different document formats.
(CRATE version 0.20.0, 2023-02-14. Created by Rudolf Cardinal.)
POSITIONAL ARGUMENTS:
checkextractor File extensions to check for availability of a text
extractor. Try, for example, '.doc .docx .odt .pdf .rtf .txt
None' (use a '.' prefix for all extensions, and use the
special extension 'None' to check the fallback processor).
(default: None)
OPTIONAL ARGUMENTS:
-h, --help show this help message and exit
6.5.5. crate_anon_summarize_dd
This ancillary tool reads your data dictionary and summarizes facts about each table. It may be helpful to find problems with large data dictionaries.
USAGE: crate_anon_summarize_dd [-h] [--config CONFIG] [--verbose]
[--output OUTPUT]
Draft a data dictionary for the anonymiser. (CRATE version 0.20.0, 2023-02-14.
Created by Rudolf Cardinal.)
OPTIONAL ARGUMENTS:
-h, --help show this help message and exit
--config CONFIG Config file (overriding environment variable
CRATE_ANON_CONFIG). (default: None)
--verbose, -v Be verbose (default: False)
--output OUTPUT File for output; use '-' for stdout. (default: -)