14.1.7. crate_anon.anonymise.config

crate_anon/anonymise/config.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Config class for CRATE anonymiser.

Thoughts on configuration method

  • First version used a Config() class, which initializes with blank values. The anonymise_cli.py file creates a config singleton and passes it around. Then when its set() method is called, it reads a config file and instantiates its settings. An option exists to print a draft config without ever reading one from disk.

    Advantage: easy to start the program without a valid config file (e.g. to print one).

    Disadvantage: modules can’t be sure that a config is properly instantiated before they are loaded, so you can’t easily define a class according to config settings (you’d have to have a class factory, which gets ugly).

  • The Django method is to have a configuration file (e.g. settings.py, which can import from other things) that is read by Django and then becomes importable by anything at startup as django.conf.settings. (I’ve added local settings via an environment variable.) The way Django knows where to look is via this in manage.py:

    os.environ.setdefault("DJANGO_SETTINGS_MODULE",
                          "crate_anon.crateweb.config.settings")
    

    Advantage: setting the config file via an environment variable (read when the config file loads) allows guaranteed config existence as other modules start.

    Further advantage: config filenames not on command line, therefore not visible to ps.

    Disadvantage: how do you override with a command-line (argparse) setting? .. though: who cares?

    To print a config using that file: raise an exception on nonexistent config, and catch it with a special entry point script.

  • See also https://stackoverflow.com/questions/7443366/argument-passing-strategy-environment-variables-vs-command-line

class crate_anon.anonymise.config.Config(open_databases: bool = True, mock: bool = False)[source]

Class representing the main CRATE anonymiser configuration.

__init__(open_databases: bool = True, mock: bool = False) None[source]

Read the config from the file specified in the CRATE_ANON_CONFIG environment variable.

Parameters:
  • open_databases – open SQLAlchemy connections to the databases?

  • mock – create mock (dummy) config?

check_valid() None[source]

Raise ValueError if the config is invalid.

commit_dest_db() None[source]

Executes a COMMIT on the destination database.

property dest_dialect: Dialect

Returns the SQLAlchemy Dialect (e.g. MySQL, SQL Server…) for the destination database.

property dest_dialect_name: str

Returns the SQLAlchemy name for the destination database dialect (e.g. mysql).

encrypt_master_pid(mpid: int | str) str | None[source]

Encrypt a master PID, producing a master RID (MRID).

encrypt_primary_pid(pid: int | str) str[source]

Encrypt a primary patient ID (PID), producing a research ID (RID).

extract_text_extension_permissible(extension: str) bool[source]

Is this file extension (e.g. .doc, .txt) one that the config permits to use for text extraction?

See the config options extract_text_extensions_permitted and extract_text_extensions_prohibited.

Parameters:

extension – file extension, beginning with .

Returns:

permitted?

get_destdb_engine_outside_transaction(encoding: str = 'utf-8') Engine[source]

Get a standalone SQLAlchemy Engine for the destination database, and configure itself so transactions aren’t used (equivalently: autocommit is True; equivalently, the database commits after every statement).

See https://github.com/mkleehammer/pyodbc/wiki/Database-Transaction-Management

Parameters:

encoding – passed to the SQLAlchemy create_engine() function

Returns:

the Engine

get_extra_hasher(hasher_name: str) GenericHasher[source]

Return a named hasher from our extra_hashers dictionary.

Parameters:

hasher_name – name of the hasher

Returns:

the hasher

Raises:

ValueError

get_src_dialect(src_db: str) Dialect[source]

Returns the SQLAlchemy Dialect (e.g. MySQL, SQL Server…) for the specified source database.

hash_object(x: Any) str[source]

Hashes an object using our change_detection_hasher.

We could use Python’s build-in hash() function, which produces a 64-bit unsigned integer (calculated from: sys.maxint). However, there is an outside chance that someone uses a single-field table and therefore that this is vulnerable to content discovery via a dictionary attack. Thus, we should use a better version.

init_row_counts() None[source]

Initialize the “number of rows inserted” counts to zero for all source tables.

load_dd(check_against_source_db: bool = True) None[source]

Loads the data dictionary (DD) into the config.

Parameters:

check_against_source_db – check DD validity against the source database?

notify_dest_db_transaction(n_rows: int, n_bytes: int) None[source]

Use this function to tell the config how many rows and bytes have been written to the source database. See, for example, overall_progress().

Note that this may trigger a COMMIT, via our crate_anon.common.sql.TransactionSizeLimiter.

Parameters:
  • n_rows – the number of rows written

  • n_bytes – the number of bytes written

notify_src_bytes_read(n_bytes: int) None[source]

Use this function to tell the config how many bytes have been read from the source database. See, for example, overall_progress().

Parameters:

n_bytes – the number of bytes read

overall_progress() str[source]

Returns a formatted description of the number of bytes read from the source database(s) and written to the destination database.

(The Config is used to keep track of progress, via notify_src_bytes_read() and notify_dest_db_transaction().)

set_echo(echo: bool) None[source]

Sets the “echo” property for all our SQLAlchemy database connections.

Parameters:

echo – show SQL?

property source_db_names: List[str]

Get all source database names.

class crate_anon.anonymise.config.DatabaseSafeConfig(parser: ExtendedConfigParser, section: str)[source]

Class representing non-sensitive configuration information about a source database.

__init__(parser: ExtendedConfigParser, section: str) None[source]

Read from a configparser section.

Parameters:
  • parser – configparser object

  • section – section name

does_table_fail_minimum_fields(colnames: List[str]) bool[source]

For use when creating a data dictionary automatically:

Does a table with the specified column names fail our minimum requirements? These requirements are set by our ddgen_table_require_field_absolute and ddgen_table_require_field_conditional configuration parameters.

Parameters:

colnames – list of column names for the table

Returns:

does it fail?

is_field_denied(field: str) bool[source]

Is the field name denylisted (and not also allowlisted)?

is_table_denied(table: str) bool[source]

Is the table name denylisted (and not also allowlisted)?

crate_anon.anonymise.config.get_extra_hasher(parser: ExtendedConfigParser, section: str) GenericHasher[source]

Read hasher configuration from a configparser section, and return the hasher.

Parameters:
  • parser – configparser object

  • section – section name

Returns:

the hasher

crate_anon.anonymise.config.get_sqlatype(sqlatype: str) TypeEngine[source]

Converts a string, like “VARCHAR(10)”, to an SQLAlchemy type.

Since we might have to return String(length=…), we have to return an instance, not a class.

crate_anon.anonymise.config.get_word_alternatives(filenames: List[str]) List[List[str]][source]

Reads in a list of word alternatives, from one or more comma-separated-value (CSV) text files (also accepting comment lines starting with #, and allowing different numbers of columns on different lines).

All entries on one line will be substituted for each other, if alternatives are enabled.

Produces a list of equivalent-word lists.

Arbitrarily, uses upper case internally. (All CRATE regex replacements are case-insensitive.)

An alternatives file might look like this:

# Street types
# https://en.wikipedia.org/wiki/Street_suffix

avenue, circus, close, crescent, drive, gardens, grove, hill, lane, mead, mews, place, rise, road, row, square, street, vale, way, wharf
Parameters:

filenames – filenames to read from

Returns:

a list of lists of equivalent words