6.5. Run the anonymiser
Now you’ve created and edited your config file and data dictionary, you can run the anonymiser in one of the following ways:
crate_anonymise --full
crate_anonymise --incremental
crate_anonymise_multiprocess --full
crate_anonymise_multiprocess --incremental
The ‘multiprocess’ versions are faster (if you have a multi-core/-CPU computer). The ‘full’ option destroys the destination database and starts again. The ‘incremental’ one brings the destination database up to date (creating it if necessary). The default is ‘incremental’, for safety reasons.
Get more help with
crate_anonymise --help
6.5.1. crate_anonymise
This runs a single-process anonymiser.
Options:
USAGE: crate_anonymise [-h] [--config CONFIG] [--version] [--verbose]
[-i | -f] [--skipdelete] [--dropremake] [--drop_all]
[--optout] [--nonpatienttables] [--patienttables]
[--index] [--restrict RESTRICT]
[--limits LIMITS LIMITS] [--file FILE]
[--list LIST [LIST ...]]
[--free_text_limit FREE_TEXT_LIMIT] [--excludescrubbed]
[--process [PROCESS]] [--nprocesses [NPROCESSES]]
[--processcluster PROCESSCLUSTER] [--skip_dd_check]
[--seed SEED] [--chunksize [CHUNKSIZE]]
[--reportevery [REPORTEVERY]] [--debugscrubbers]
[--savescrubbers] [--echo]
Database anonymiser. (CRATE version 0.20.4, 2023-10-17. Created by Rudolf
Cardinal.)
OPTIONS:
-h, --help show this help message and exit
--config CONFIG Config file (overriding environment variable
CRATE_ANON_CONFIG). (default: None)
--version show program's version number and exit
--verbose, -v Be verbose (default: False)
MODE OPTIONS:
-i, --incremental Process only new/changed information, where possible.
(default: True)
-f, --full Drop and remake everything. (default: False)
--skipdelete For incremental updates, skip deletion of rows present
in the destination but not the source. (default:
False)
ACTION OPTIONS (DEFAULT IS TO DO ALL, BUT IF ANY ARE SPECIFIED, ONLY THOSE ARE DONE):
--dropremake Drop/remake destination tables, and admin tables
except opt-out tables. (default: False)
--drop_all Drop all destination tables known to the data
dictionary, and all admin tables, then stop. (May also
be helpful in revealing leftover tables in the
destination database, e.g. if the data dictionary has
changed.) (default: False)
--optout Update opt-out list in administrative database.
(default: False)
--nonpatienttables Process non-patient tables only. (default: False)
--patienttables Process patient tables only. (default: False)
--index Create indexes only. (default: False)
RESTRICTION OPTIONS:
--restrict RESTRICT Restrict which patients are processed. Specify which
field to base the restriction on or 'pid' for patient
ids. (default: None)
--limits LIMITS LIMITS
Specify lower and upper limits of the field specified
in '--restrict'. (default: None)
--file FILE Specify a file with a list of values for the field
specified in '--restrict'. (default: None)
--list LIST [LIST ...]
Specify a list of values for the field specified in
'--restrict'. (default: None)
--free_text_limit FREE_TEXT_LIMIT
Filter out all free text fields over the specified
length. For example, if you specify 200, then
VARCHAR(200) fields will be permitted, but
VARCHAR(200), or VARCHAR(MAX), or TEXT (etc., etc.)
fields will be excluded. (default: None)
--excludescrubbed Exclude all text fields which are being scrubbed.
(default: False)
PROCESSING OPTIONS:
--process [PROCESS] For multiprocess mode: specify process number.
(default: 0)
--nprocesses [NPROCESSES]
For multiprocess mode: specify total number of
processes (launched somehow, of which this is to be
one). (default: 1)
--processcluster PROCESSCLUSTER
Process cluster name (used as part of log name).
(default: )
--skip_dd_check Skip data dictionary validity check. (default: False)
--seed SEED String to use as the basis of the seed for the random
number generator used for the transient integer RID
(TRID). Leave blank to use the default seed (system
time). (default: None)
--chunksize [CHUNKSIZE]
Number of records copied in a chunk when copying PKs
from one database to another. (default: 100000)
REPORTING AND DEBUGGING:
--reportevery [REPORTEVERY]
Report insert progress every n rows in verbose mode.
(default: 100000)
--debugscrubbers Report sensitive scrubbing information, for debugging.
(default: False)
--savescrubbers Saves sensitive scrubbing information in admin
database, for debugging. (default: False)
--echo Echo SQL. (default: False)
6.5.2. crate_anonymise_multiprocess
This runs multiple copies of crate_anonymise
in parallel.
Options:
USAGE: crate_anonymise_multiprocess [-h] [--nproc [NPROC]] [--verbose]
Runs the CRATE anonymiser in parallel. Version 0.20.4 (2023-10-17). Note that
all arguments not specified here are passed to the underlying script (see
crate_anonymise --help).
OPTIONS:
-h, --help show this help message and exit
--nproc, -n [NPROC] Number of processes (default is the number of CPUs on
this machine) (default: 8)
--verbose, -v Be verbose (default: False)
6.5.3. crate_anon_show_counts
This ancillary tool prints record counts from your source and destination databases.
USAGE: crate_anon_show_counts [-h] [--config CONFIG] [--verbose]
Print record counts from source/destination databases. (CRATE version 0.20.4,
2023-10-17. Created by Rudolf Cardinal.)
OPTIONS:
-h, --help show this help message and exit
--config CONFIG Config file (overriding environment variable
CRATE_ANON_CONFIG). (default: None)
--verbose, -v Be verbose (default: False)
6.5.4. crate_anon_check_text_extractor
This ancillary tool checks that you have the text extraction software that you might want. See third-party text extractors.
USAGE: crate_anon_check_text_extractor [-h] [checkextractor ...]
Check availability of tools to extract text from different document formats.
(CRATE version 0.20.4, 2023-10-17. Created by Rudolf Cardinal.)
POSITIONAL ARGUMENTS:
checkextractor File extensions to check for availability of a text
extractor. Try, for example, '.doc .docx .odt .pdf .rtf .txt
None' (use a '.' prefix for all extensions, and use the
special extension 'None' to check the fallback processor).
(default: None)
OPTIONS:
-h, --help show this help message and exit
6.5.5. crate_anon_summarize_dd
This ancillary tool reads your data dictionary and summarizes facts about each table. It may be helpful to find problems with large data dictionaries.
USAGE: crate_anon_summarize_dd [-h] [--config CONFIG] [--verbose]
[--output OUTPUT]
Summarize a data dictionary for the anonymiser. The resulting
spreadsheet-style report has one row per source table. (CRATE version 0.20.4,
2023-10-17. Created by Rudolf Cardinal.)
OPTIONS:
-h, --help show this help message and exit
--config CONFIG Config file (overriding environment variable
CRATE_ANON_CONFIG). (default: None)
--verbose, -v Be verbose (default: False)
--output OUTPUT File for output; use '-' for stdout. (default: -)
6.5.6. crate_researcher_report
This ancillary tool reads your destination database (and data dictionary) and generates a PDF report intended for use by researchers. Optionally, it can include row counts and specimen values or value ranges.
USAGE: crate_researcher_report [-h] [--config CONFIG] [--noconfig]
[--db_url DB_URL] [--db_name DB_NAME]
[--show_url] [--no_show_url] [--show_counts]
[--no_show_counts] [--use_dd] [--no_use_dd]
[--show_values] [--no_show_values]
[--max_distinct_values MAX_DISTINCT_VALUES]
[--skip_values_if_too_many]
[--max_value_length MAX_VALUE_LENGTH]
[--page_size PAGE_SIZE]
[--margin_left_right MARGIN_LEFT_RIGHT]
[--margin_top_bottom MARGIN_TOP_BOTTOM]
[--header_footer_spacing_mm HEADER_FOOTER_SPACING_MM]
[--orientation {portrait,landscape}]
[--base_font_size BASE_FONT_SIZE] [--verbose]
[--debug_pdf]
output
Produce a researcher-oriented PDF report about a destination database.
(CRATE version 0.20.4, 2023-10-17. Created by Rudolf Cardinal.)
Note: if wkhtmtopdf reports 'Too many open files', see
- https://stackoverflow.com/q/25355697;
- https://github.com/wkhtmltopdf/wkhtmltopdf/issues/3081;
setting e.g. "ulimit -n 2048" is one solution.
POSITIONAL ARGUMENTS:
output PDF output filename
OPTIONS:
-h, --help show this help message and exit
DATABASE:
--config CONFIG CRATE anonymisation config file, overriding
environment variable CRATE_ANON_CONFIG (default: None)
--noconfig Do not use a config file (unusual) (default: False)
--db_url DB_URL Database URL, overriding that in the config file
(default: None)
--db_name DB_NAME Database name, overriding that in the config file;
must be specified if you use --db_url (default: None)
DETAIL:
--show_url Include sanitised, password-safe version of database
URL (default: False)
--no_show_url Do not include database URL (default: True)
--show_counts Include row counts for each table (default: True)
--no_show_counts Do not include row counts (default: False)
--use_dd Use information obtainable from the CRATE data
dictionary (DD), including comments, annotations, and
value suppression for potentially sensitive fields;
only sensible for reporting on a database completely
unrelated to the DD (default: True)
--no_use_dd Do not use information from the CRATE data dictionary
(default: False)
--show_values Include specimen values/ranges (default: True)
--no_show_values Do not include specimen values/ranges (default: False)
--max_distinct_values MAX_DISTINCT_VALUES
Maximum number of distinct values to show, if
applicable (default: 20)
--skip_values_if_too_many
If showing values, and there are more distinct values
than the maximum, omit them (rather than showing the
first few)? (default: False)
--max_value_length MAX_VALUE_LENGTH
Maximum string length to show for a literal value
(default: 50)
VISUALS:
--page_size PAGE_SIZE
Page size, i.e. paper type (default: A4)
--margin_left_right MARGIN_LEFT_RIGHT
Page left/right margins, with units (default: 15mm)
--margin_top_bottom MARGIN_TOP_BOTTOM
Page top/bottom margins for content, ignoring
header/footer (see --header_footer_spacing_mm), with
units (default: 18mm)
--header_footer_spacing_mm HEADER_FOOTER_SPACING_MM
Gap between content and header/footer, in mm (default:
3)
--orientation {portrait,landscape}
Page orientation (default: landscape)
--base_font_size BASE_FONT_SIZE
Base font size, with units (default: 11pt)
PROGRESS:
--verbose, -v Be verbose (default: False)
--debug_pdf Debug PDF creation (default: False)
6.5.7. crate_subset_db
This ancillary tool makes a subset of a database. It’s typically for use by database administrators to create custom subset of data for specific research projects.
USAGE: crate_subset_db [-h] --src_db_url SRC_DB_URL --dst_db_url DST_DB_URL
[--filter_column FILTER_COLUMN]
[--filter_values [FILTER_VALUES ...]]
[--filter_value_filenames [FILTER_VALUE_FILENAMES ...]]
[--filter_value_db_urls [FILTER_VALUE_DB_URLS ...]]
[--filter_value_tablecols [FILTER_VALUE_TABLECOLS ...]]
[--include_rows_filtercol_null [INCLUDE_ROWS_FILTERCOL_NULL]]
[--include_tables_without_filtercol [INCLUDE_TABLES_WITHOUT_FILTERCOL]]
[--include_tables [INCLUDE_TABLES ...]]
[--include_table_filenames [INCLUDE_TABLE_FILENAMES ...]]
[--exclude_tables [EXCLUDE_TABLES ...]]
[--exclude_table_filenames [EXCLUDE_TABLE_FILENAMES ...]]
[--verbose] [--echo]
Create a simple subset of a database, copying one database to another while applying filters. You can filter by a standard column (e.g. one representing patient IDs), taking permitted filter values from the command line, from file(s), and/or from database(s). You can also decide which tables to include/exclude. (CRATE version 0.20.4, 2023-10-17. Created by Rudolf Cardinal.)
OPTIONS:
-h, --help show this help message and exit
SOURCE DATABASE:
--src_db_url SRC_DB_URL
Source database SQLAlchemy URL (default: None)
DESTINATION DATABASE:
--dst_db_url DST_DB_URL
Destination database SQLAlchemy URL (default: None)
ROW FILTERING:
--filter_column FILTER_COLUMN
Column on which to filter. Typically the one that
defines individuals (e.g. 'patient_research_id',
'rid', 'brcid'). If omitted, then the whole database
might be copied unfiltered (if you set
--include_tables_without_filtercol). (default: None)
--filter_values [FILTER_VALUES ...]
Filter values to permit. (Comparison is performed as
strings.) (default: None)
--filter_value_filenames [FILTER_VALUE_FILENAMES ...]
Filename(s) of files containing filter values to
permit. ('#' denotes comments in the file. Comparison
is performed as strings.) (default: None)
--filter_value_db_urls [FILTER_VALUE_DB_URLS ...]
SQLAlchemy URLs of databases to pull additional filter
values from. Must be in the same order as
corresponding arguments to --filter_value_tablecols.
(default: None)
--filter_value_tablecols [FILTER_VALUE_TABLECOLS ...]
Table/column pairs, each expressed as 'table.column',
of database columns to pull additional filter values
from. Must be in the same order as corresponding
arguments to --filter_value_db_urls. (default: None)
--include_rows_filtercol_null [INCLUDE_ROWS_FILTERCOL_NULL]
Include rows where the filter column is NULL. You
can't otherwise specify NULL as a permitted value (at
least, not from the command line or from files).
(Specify as yes/y/true/t/1 or no/n/false/f/0.)
(default: False)
TABLE FILTERING:
--include_tables_without_filtercol [INCLUDE_TABLES_WITHOUT_FILTERCOL]
Include tables that do not possess the filter column
(e.g. system/lookup tables). (Specify as
yes/y/true/t/1 or no/n/false/f/0.) (default: True)
--include_tables [INCLUDE_TABLES ...]
Names of tables to include. (If 'include' tables are
given, only tables explicitly named are included. If
no 'include' tables are specified, all tables are
included by default. Explicit excluding always
overrides including.) (default: None)
--include_table_filenames [INCLUDE_TABLE_FILENAMES ...]
Filename(s) of files containing names of tables to
include. (If 'include' tables are given, only tables
explicitly named are included. If no 'include' tables
are specified, all tables are included by default.
Explicit excluding always overrides including.)
(default: None)
--exclude_tables [EXCLUDE_TABLES ...]
Names of tables to exclude. (default: None)
--exclude_table_filenames [EXCLUDE_TABLE_FILENAMES ...]
Filename(s) of files containing names of tables to
exclude. (default: None)
PROGRESS:
--verbose, -v Be verbose (default: False)
--echo Echo SQL (slow; for debugging only) (default: False)