6.5. Run the anonymiser

Now you’ve created and edited your config file and data dictionary, you can run the anonymiser in one of the following ways:

crate_anonymise --full
crate_anonymise --incremental
crate_anonymise_multiprocess --full
crate_anonymise_multiprocess --incremental

The ‘multiprocess’ versions are faster (if you have a multi-core/-CPU computer). The ‘full’ option destroys the destination database and starts again. The ‘incremental’ one brings the destination database up to date (creating it if necessary). The default is ‘incremental’, for safety reasons.

Get more help with

crate_anonymise --help

6.5.1. crate_anonymise

This runs a single-process anonymiser.

Options:

USAGE: crate_anonymise [-h] [--config CONFIG] [--version] [--verbose]
                       [-i | -f] [--skipdelete] [--dropremake] [--drop_all]
                       [--optout] [--nonpatienttables] [--patienttables]
                       [--index] [--restrict RESTRICT]
                       [--limits LIMITS LIMITS] [--file FILE]
                       [--list LIST [LIST ...]]
                       [--free_text_limit FREE_TEXT_LIMIT] [--excludescrubbed]
                       [--process [PROCESS]] [--nprocesses [NPROCESSES]]
                       [--processcluster PROCESSCLUSTER] [--skip_dd_check]
                       [--seed SEED] [--chunksize [CHUNKSIZE]]
                       [--reportevery [REPORTEVERY]] [--debugscrubbers]
                       [--savescrubbers] [--echo]

Database anonymiser. (CRATE version 0.20.4, 2023-10-17. Created by Rudolf
Cardinal.)

OPTIONS:
  -h, --help            show this help message and exit
  --config CONFIG       Config file (overriding environment variable
                        CRATE_ANON_CONFIG). (default: None)
  --version             show program's version number and exit
  --verbose, -v         Be verbose (default: False)

MODE OPTIONS:
  -i, --incremental     Process only new/changed information, where possible.
                        (default: True)
  -f, --full            Drop and remake everything. (default: False)
  --skipdelete          For incremental updates, skip deletion of rows present
                        in the destination but not the source. (default:
                        False)

ACTION OPTIONS (DEFAULT IS TO DO ALL, BUT IF ANY ARE SPECIFIED, ONLY THOSE ARE DONE):
  --dropremake          Drop/remake destination tables, and admin tables
                        except opt-out tables. (default: False)
  --drop_all            Drop all destination tables known to the data
                        dictionary, and all admin tables, then stop. (May also
                        be helpful in revealing leftover tables in the
                        destination database, e.g. if the data dictionary has
                        changed.) (default: False)
  --optout              Update opt-out list in administrative database.
                        (default: False)
  --nonpatienttables    Process non-patient tables only. (default: False)
  --patienttables       Process patient tables only. (default: False)
  --index               Create indexes only. (default: False)

RESTRICTION OPTIONS:
  --restrict RESTRICT   Restrict which patients are processed. Specify which
                        field to base the restriction on or 'pid' for patient
                        ids. (default: None)
  --limits LIMITS LIMITS
                        Specify lower and upper limits of the field specified
                        in '--restrict'. (default: None)
  --file FILE           Specify a file with a list of values for the field
                        specified in '--restrict'. (default: None)
  --list LIST [LIST ...]
                        Specify a list of values for the field specified in
                        '--restrict'. (default: None)
  --free_text_limit FREE_TEXT_LIMIT
                        Filter out all free text fields over the specified
                        length. For example, if you specify 200, then
                        VARCHAR(200) fields will be permitted, but
                        VARCHAR(200), or VARCHAR(MAX), or TEXT (etc., etc.)
                        fields will be excluded. (default: None)
  --excludescrubbed     Exclude all text fields which are being scrubbed.
                        (default: False)

PROCESSING OPTIONS:
  --process [PROCESS]   For multiprocess mode: specify process number.
                        (default: 0)
  --nprocesses [NPROCESSES]
                        For multiprocess mode: specify total number of
                        processes (launched somehow, of which this is to be
                        one). (default: 1)
  --processcluster PROCESSCLUSTER
                        Process cluster name (used as part of log name).
                        (default: )
  --skip_dd_check       Skip data dictionary validity check. (default: False)
  --seed SEED           String to use as the basis of the seed for the random
                        number generator used for the transient integer RID
                        (TRID). Leave blank to use the default seed (system
                        time). (default: None)
  --chunksize [CHUNKSIZE]
                        Number of records copied in a chunk when copying PKs
                        from one database to another. (default: 100000)

REPORTING AND DEBUGGING:
  --reportevery [REPORTEVERY]
                        Report insert progress every n rows in verbose mode.
                        (default: 100000)
  --debugscrubbers      Report sensitive scrubbing information, for debugging.
                        (default: False)
  --savescrubbers       Saves sensitive scrubbing information in admin
                        database, for debugging. (default: False)
  --echo                Echo SQL. (default: False)

6.5.2. crate_anonymise_multiprocess

This runs multiple copies of crate_anonymise in parallel.

Options:

USAGE: crate_anonymise_multiprocess [-h] [--nproc [NPROC]] [--verbose]

Runs the CRATE anonymiser in parallel. Version 0.20.4 (2023-10-17). Note that
all arguments not specified here are passed to the underlying script (see
crate_anonymise --help).

OPTIONS:
  -h, --help            show this help message and exit
  --nproc, -n [NPROC]   Number of processes (default is the number of CPUs on
                        this machine) (default: 8)
  --verbose, -v         Be verbose (default: False)

6.5.3. crate_anon_show_counts

This ancillary tool prints record counts from your source and destination databases.

USAGE: crate_anon_show_counts [-h] [--config CONFIG] [--verbose]

Print record counts from source/destination databases. (CRATE version 0.20.4,
2023-10-17. Created by Rudolf Cardinal.)

OPTIONS:
  -h, --help       show this help message and exit
  --config CONFIG  Config file (overriding environment variable
                   CRATE_ANON_CONFIG). (default: None)
  --verbose, -v    Be verbose (default: False)

6.5.4. crate_anon_check_text_extractor

This ancillary tool checks that you have the text extraction software that you might want. See third-party text extractors.

USAGE: crate_anon_check_text_extractor [-h] [checkextractor ...]

Check availability of tools to extract text from different document formats.
(CRATE version 0.20.4, 2023-10-17. Created by Rudolf Cardinal.)

POSITIONAL ARGUMENTS:
  checkextractor  File extensions to check for availability of a text
                  extractor. Try, for example, '.doc .docx .odt .pdf .rtf .txt
                  None' (use a '.' prefix for all extensions, and use the
                  special extension 'None' to check the fallback processor).
                  (default: None)

OPTIONS:
  -h, --help      show this help message and exit

6.5.5. crate_anon_summarize_dd

This ancillary tool reads your data dictionary and summarizes facts about each table. It may be helpful to find problems with large data dictionaries.

USAGE: crate_anon_summarize_dd [-h] [--config CONFIG] [--verbose]
                               [--output OUTPUT]

Summarize a data dictionary for the anonymiser. The resulting
spreadsheet-style report has one row per source table. (CRATE version 0.20.4,
2023-10-17. Created by Rudolf Cardinal.)

OPTIONS:
  -h, --help       show this help message and exit
  --config CONFIG  Config file (overriding environment variable
                   CRATE_ANON_CONFIG). (default: None)
  --verbose, -v    Be verbose (default: False)
  --output OUTPUT  File for output; use '-' for stdout. (default: -)

6.5.6. crate_researcher_report

This ancillary tool reads your destination database (and data dictionary) and generates a PDF report intended for use by researchers. Optionally, it can include row counts and specimen values or value ranges.

USAGE: crate_researcher_report [-h] [--config CONFIG] [--noconfig]
                               [--db_url DB_URL] [--db_name DB_NAME]
                               [--show_url] [--no_show_url] [--show_counts]
                               [--no_show_counts] [--use_dd] [--no_use_dd]
                               [--show_values] [--no_show_values]
                               [--max_distinct_values MAX_DISTINCT_VALUES]
                               [--skip_values_if_too_many]
                               [--max_value_length MAX_VALUE_LENGTH]
                               [--page_size PAGE_SIZE]
                               [--margin_left_right MARGIN_LEFT_RIGHT]
                               [--margin_top_bottom MARGIN_TOP_BOTTOM]
                               [--header_footer_spacing_mm HEADER_FOOTER_SPACING_MM]
                               [--orientation {portrait,landscape}]
                               [--base_font_size BASE_FONT_SIZE] [--verbose]
                               [--debug_pdf]
                               output

Produce a researcher-oriented PDF report about a destination database.
(CRATE version 0.20.4, 2023-10-17. Created by Rudolf Cardinal.)

Note: if wkhtmtopdf reports 'Too many open files', see
- https://stackoverflow.com/q/25355697;
- https://github.com/wkhtmltopdf/wkhtmltopdf/issues/3081;
setting e.g. "ulimit -n 2048" is one solution.

POSITIONAL ARGUMENTS:
  output                PDF output filename

OPTIONS:
  -h, --help            show this help message and exit

DATABASE:
  --config CONFIG       CRATE anonymisation config file, overriding
                        environment variable CRATE_ANON_CONFIG (default: None)
  --noconfig            Do not use a config file (unusual) (default: False)
  --db_url DB_URL       Database URL, overriding that in the config file
                        (default: None)
  --db_name DB_NAME     Database name, overriding that in the config file;
                        must be specified if you use --db_url (default: None)

DETAIL:
  --show_url            Include sanitised, password-safe version of database
                        URL (default: False)
  --no_show_url         Do not include database URL (default: True)
  --show_counts         Include row counts for each table (default: True)
  --no_show_counts      Do not include row counts (default: False)
  --use_dd              Use information obtainable from the CRATE data
                        dictionary (DD), including comments, annotations, and
                        value suppression for potentially sensitive fields;
                        only sensible for reporting on a database completely
                        unrelated to the DD (default: True)
  --no_use_dd           Do not use information from the CRATE data dictionary
                        (default: False)
  --show_values         Include specimen values/ranges (default: True)
  --no_show_values      Do not include specimen values/ranges (default: False)
  --max_distinct_values MAX_DISTINCT_VALUES
                        Maximum number of distinct values to show, if
                        applicable (default: 20)
  --skip_values_if_too_many
                        If showing values, and there are more distinct values
                        than the maximum, omit them (rather than showing the
                        first few)? (default: False)
  --max_value_length MAX_VALUE_LENGTH
                        Maximum string length to show for a literal value
                        (default: 50)

VISUALS:
  --page_size PAGE_SIZE
                        Page size, i.e. paper type (default: A4)
  --margin_left_right MARGIN_LEFT_RIGHT
                        Page left/right margins, with units (default: 15mm)
  --margin_top_bottom MARGIN_TOP_BOTTOM
                        Page top/bottom margins for content, ignoring
                        header/footer (see --header_footer_spacing_mm), with
                        units (default: 18mm)
  --header_footer_spacing_mm HEADER_FOOTER_SPACING_MM
                        Gap between content and header/footer, in mm (default:
                        3)
  --orientation {portrait,landscape}
                        Page orientation (default: landscape)
  --base_font_size BASE_FONT_SIZE
                        Base font size, with units (default: 11pt)

PROGRESS:
  --verbose, -v         Be verbose (default: False)
  --debug_pdf           Debug PDF creation (default: False)

6.5.7. crate_subset_db

This ancillary tool makes a subset of a database. It’s typically for use by database administrators to create custom subset of data for specific research projects.

USAGE: crate_subset_db [-h] --src_db_url SRC_DB_URL --dst_db_url DST_DB_URL
                       [--filter_column FILTER_COLUMN]
                       [--filter_values [FILTER_VALUES ...]]
                       [--filter_value_filenames [FILTER_VALUE_FILENAMES ...]]
                       [--filter_value_db_urls [FILTER_VALUE_DB_URLS ...]]
                       [--filter_value_tablecols [FILTER_VALUE_TABLECOLS ...]]
                       [--include_rows_filtercol_null [INCLUDE_ROWS_FILTERCOL_NULL]]
                       [--include_tables_without_filtercol [INCLUDE_TABLES_WITHOUT_FILTERCOL]]
                       [--include_tables [INCLUDE_TABLES ...]]
                       [--include_table_filenames [INCLUDE_TABLE_FILENAMES ...]]
                       [--exclude_tables [EXCLUDE_TABLES ...]]
                       [--exclude_table_filenames [EXCLUDE_TABLE_FILENAMES ...]]
                       [--verbose] [--echo]

Create a simple subset of a database, copying one database to another while applying filters. You can filter by a standard column (e.g. one representing patient IDs), taking permitted filter values from the command line, from file(s), and/or from database(s). You can also decide which tables to include/exclude. (CRATE version 0.20.4, 2023-10-17. Created by Rudolf Cardinal.)

OPTIONS:
  -h, --help            show this help message and exit

SOURCE DATABASE:
  --src_db_url SRC_DB_URL
                        Source database SQLAlchemy URL (default: None)

DESTINATION DATABASE:
  --dst_db_url DST_DB_URL
                        Destination database SQLAlchemy URL (default: None)

ROW FILTERING:
  --filter_column FILTER_COLUMN
                        Column on which to filter. Typically the one that
                        defines individuals (e.g. 'patient_research_id',
                        'rid', 'brcid'). If omitted, then the whole database
                        might be copied unfiltered (if you set
                        --include_tables_without_filtercol). (default: None)
  --filter_values [FILTER_VALUES ...]
                        Filter values to permit. (Comparison is performed as
                        strings.) (default: None)
  --filter_value_filenames [FILTER_VALUE_FILENAMES ...]
                        Filename(s) of files containing filter values to
                        permit. ('#' denotes comments in the file. Comparison
                        is performed as strings.) (default: None)
  --filter_value_db_urls [FILTER_VALUE_DB_URLS ...]
                        SQLAlchemy URLs of databases to pull additional filter
                        values from. Must be in the same order as
                        corresponding arguments to --filter_value_tablecols.
                        (default: None)
  --filter_value_tablecols [FILTER_VALUE_TABLECOLS ...]
                        Table/column pairs, each expressed as 'table.column',
                        of database columns to pull additional filter values
                        from. Must be in the same order as corresponding
                        arguments to --filter_value_db_urls. (default: None)
  --include_rows_filtercol_null [INCLUDE_ROWS_FILTERCOL_NULL]
                        Include rows where the filter column is NULL. You
                        can't otherwise specify NULL as a permitted value (at
                        least, not from the command line or from files).
                        (Specify as yes/y/true/t/1 or no/n/false/f/0.)
                        (default: False)

TABLE FILTERING:
  --include_tables_without_filtercol [INCLUDE_TABLES_WITHOUT_FILTERCOL]
                        Include tables that do not possess the filter column
                        (e.g. system/lookup tables). (Specify as
                        yes/y/true/t/1 or no/n/false/f/0.) (default: True)
  --include_tables [INCLUDE_TABLES ...]
                        Names of tables to include. (If 'include' tables are
                        given, only tables explicitly named are included. If
                        no 'include' tables are specified, all tables are
                        included by default. Explicit excluding always
                        overrides including.) (default: None)
  --include_table_filenames [INCLUDE_TABLE_FILENAMES ...]
                        Filename(s) of files containing names of tables to
                        include. (If 'include' tables are given, only tables
                        explicitly named are included. If no 'include' tables
                        are specified, all tables are included by default.
                        Explicit excluding always overrides including.)
                        (default: None)
  --exclude_tables [EXCLUDE_TABLES ...]
                        Names of tables to exclude. (default: None)
  --exclude_table_filenames [EXCLUDE_TABLE_FILENAMES ...]
                        Filename(s) of files containing names of tables to
                        exclude. (default: None)

PROGRESS:
  --verbose, -v         Be verbose (default: False)
  --echo                Echo SQL (slow; for debugging only) (default: False)