14.1.7. crate_anon.anonymise.config
crate_anon/anonymise/config.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Config class for CRATE anonymiser.
Thoughts on configuration method
First version used a
Config()
class, which initializes with blank values. Theanonymise_cli.py
file creates a config singleton and passes it around. Then when itsset()
method is called, it reads a config file and instantiates its settings. An option exists to print a draft config without ever reading one from disk.Advantage: easy to start the program without a valid config file (e.g. to print one).
Disadvantage: modules can’t be sure that a config is properly instantiated before they are loaded, so you can’t easily define a class according to config settings (you’d have to have a class factory, which gets ugly).
The Django method is to have a configuration file (e.g.
settings.py
, which can import from other things) that is read by Django and then becomes importable by anything at startup asdjango.conf.settings
. (I’ve added local settings via an environment variable.) The way Django knows where to look is via this inmanage.py
:os.environ.setdefault("DJANGO_SETTINGS_MODULE", "crate_anon.crateweb.config.settings")
Advantage: setting the config file via an environment variable (read when the config file loads) allows guaranteed config existence as other modules start.
Further advantage: config filenames not on command line, therefore not visible to
ps
.Disadvantage: how do you override with a command-line (argparse) setting? .. though: who cares?
To print a config using that file: raise an exception on nonexistent config, and catch it with a special entry point script.
- class crate_anon.anonymise.config.Config(open_databases: bool = True, mock: bool = False)[source]
Class representing the main CRATE anonymiser configuration.
- __init__(open_databases: bool = True, mock: bool = False) None [source]
Read the config from the file specified in the
CRATE_ANON_CONFIG
environment variable.- Parameters:
open_databases – open SQLAlchemy connections to the databases?
mock – create mock (dummy) config?
- property dest_dialect: Dialect
Returns the SQLAlchemy
Dialect
(e.g. MySQL, SQL Server…) for the destination database.
- property dest_dialect_name: str
Returns the SQLAlchemy name for the destination database dialect (e.g.
mysql
).
- encrypt_master_pid(mpid: int | str) str | None [source]
Encrypt a master PID, producing a master RID (MRID).
- encrypt_primary_pid(pid: int | str) str [source]
Encrypt a primary patient ID (PID), producing a research ID (RID).
- extract_text_extension_permissible(extension: str) bool [source]
Is this file extension (e.g.
.doc
,.txt
) one that the config permits to use for text extraction?See the config options
extract_text_extensions_permitted
andextract_text_extensions_prohibited
.- Parameters:
extension – file extension, beginning with
.
- Returns:
permitted?
- get_destdb_engine_outside_transaction(encoding: str = 'utf-8') Engine [source]
Get a standalone SQLAlchemy Engine for the destination database, and configure itself so transactions aren’t used (equivalently:
autocommit
is True; equivalently, the database commits after every statement).See https://github.com/mkleehammer/pyodbc/wiki/Database-Transaction-Management
- Parameters:
encoding – passed to the SQLAlchemy
create_engine()
function- Returns:
the Engine
- get_extra_hasher(hasher_name: str) GenericHasher [source]
Return a named hasher from our
extra_hashers
dictionary.- Parameters:
hasher_name – name of the hasher
- Returns:
the hasher
- Raises:
ValueError –
- get_src_dialect(src_db: str) Dialect [source]
Returns the SQLAlchemy
Dialect
(e.g. MySQL, SQL Server…) for the specified source database.
- hash_object(x: Any) str [source]
Hashes an object using our
change_detection_hasher
.We could use Python’s build-in
hash()
function, which produces a 64-bit unsigned integer (calculated from:sys.maxint
). However, there is an outside chance that someone uses a single-field table and therefore that this is vulnerable to content discovery via a dictionary attack. Thus, we should use a better version.
- init_row_counts() None [source]
Initialize the “number of rows inserted” counts to zero for all source tables.
- load_dd(check_against_source_db: bool = True) None [source]
Loads the data dictionary (DD) into the config.
- Parameters:
check_against_source_db – check DD validity against the source database?
- notify_dest_db_transaction(n_rows: int, n_bytes: int) None [source]
Use this function to tell the config how many rows and bytes have been written to the source database. See, for example,
overall_progress()
.Note that this may trigger a
COMMIT
, via ourcrate_anon.common.sql.TransactionSizeLimiter
.- Parameters:
n_rows – the number of rows written
n_bytes – the number of bytes written
- notify_src_bytes_read(n_bytes: int) None [source]
Use this function to tell the config how many bytes have been read from the source database. See, for example,
overall_progress()
.- Parameters:
n_bytes – the number of bytes read
- overall_progress() str [source]
Returns a formatted description of the number of bytes read from the source database(s) and written to the destination database.
(The Config is used to keep track of progress, via
notify_src_bytes_read()
andnotify_dest_db_transaction()
.)
- set_echo(echo: bool) None [source]
Sets the “echo” property for all our SQLAlchemy database connections.
- Parameters:
echo – show SQL?
- property source_db_names: List[str]
Get all source database names.
- class crate_anon.anonymise.config.DatabaseSafeConfig(parser: ExtendedConfigParser, section: str)[source]
Class representing non-sensitive configuration information about a source database.
- __init__(parser: ExtendedConfigParser, section: str) None [source]
Read from a configparser section.
- Parameters:
parser – configparser object
section – section name
- does_table_fail_minimum_fields(colnames: List[str]) bool [source]
For use when creating a data dictionary automatically:
Does a table with the specified column names fail our minimum requirements? These requirements are set by our
ddgen_table_require_field_absolute
andddgen_table_require_field_conditional
configuration parameters.- Parameters:
colnames – list of column names for the table
- Returns:
does it fail?
- crate_anon.anonymise.config.get_extra_hasher(parser: ExtendedConfigParser, section: str) GenericHasher [source]
Read hasher configuration from a configparser section, and return the hasher.
- Parameters:
parser – configparser object
section – section name
- Returns:
the hasher
- crate_anon.anonymise.config.get_sqlatype(sqlatype: str) TypeEngine [source]
Converts a string, like “VARCHAR(10)”, to an SQLAlchemy type.
Since we might have to return String(length=…), we have to return an instance, not a class.
- crate_anon.anonymise.config.get_word_alternatives(filenames: List[str]) List[List[str]] [source]
Reads in a list of word alternatives, from one or more comma-separated-value (CSV) text files (also accepting comment lines starting with #, and allowing different numbers of columns on different lines).
All entries on one line will be substituted for each other, if alternatives are enabled.
Produces a list of equivalent-word lists.
Arbitrarily, uses upper case internally. (All CRATE regex replacements are case-insensitive.)
An alternatives file might look like this:
# Street types # https://en.wikipedia.org/wiki/Street_suffix avenue, circus, close, crescent, drive, gardens, grove, hill, lane, mead, mews, place, rise, road, row, square, street, vale, way, wharf
- Parameters:
filenames – filenames to read from
- Returns:
a list of lists of equivalent words