12.1.7. crate_anon.anonymise.config
crate_anon/anonymise/config.py
Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
This file is part of CRATE.
CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.
Config class for CRATE anonymiser.
Thoughts on configuration method
First version used a
Config()class, which initializes with blank values. Theanonymise_cli.pyfile creates a config singleton and passes it around. Then when itsset()method is called, it reads a config file and instantiates its settings. An option exists to print a draft config without ever reading one from disk.Advantage: easy to start the program without a valid config file (e.g. to print one).
Disadvantage: modules can’t be sure that a config is properly instantiated before they are loaded, so you can’t easily define a class according to config settings (you’d have to have a class factory, which gets ugly).
The Django method is to have a configuration file (e.g.
settings.py, which can import from other things) that is read by Django and then becomes importable by anything at startup asdjango.conf.settings. (I’ve added local settings via an environment variable.) The way Django knows where to look is via this inmanage.py:os.environ.setdefault("DJANGO_SETTINGS_MODULE", "crate_anon.crateweb.config.settings")
Advantage: setting the config file via an environment variable (read when the config file loads) allows guaranteed config existence as other modules start.
Further advantage: config filenames not on command line, therefore not visible to
ps.Disadvantage: how do you override with a command-line (argparse) setting? .. though: who cares?
To print a config using that file: raise an exception on nonexistent config, and catch it with a special entry point script.
- class crate_anon.anonymise.config.Config(open_databases: bool = True, mock: bool = False)[source]
Class representing the main CRATE anonymiser configuration.
- __init__(open_databases: bool = True, mock: bool = False) None[source]
Read the config from the file specified in the
CRATE_ANON_CONFIGenvironment variable.- Parameters:
open_databases – open SQLAlchemy connections to the databases?
mock – create mock (dummy) config?
- property dest_dialect: Dialect
Returns the SQLAlchemy
Dialect(e.g. MySQL, SQL Server…) for the destination database.
- property dest_dialect_name: str
Returns the SQLAlchemy name for the destination database dialect (e.g.
mysql).
- encrypt_master_pid(mpid: int | str) str | None[source]
Encrypt a master PID, producing a master RID (MRID).
- encrypt_primary_pid(pid: int | str) str[source]
Encrypt a primary patient ID (PID), producing a research ID (RID).
- extract_text_extension_permissible(extension: str) bool[source]
Is this file extension (e.g.
.doc,.txt) one that the config permits to use for text extraction?See the config options
extract_text_extensions_permittedandextract_text_extensions_prohibited.- Parameters:
extension – file extension, beginning with
.- Returns:
permitted?
- get_destdb_engine_outside_transaction() Engine[source]
Get a standalone SQLAlchemy Engine for the destination database, and configure itself so transactions aren’t used (equivalently:
autocommitis True; equivalently, the database commits after every statement).See https://github.com/mkleehammer/pyodbc/wiki/Database-Transaction-Management
- Returns:
the Engine
- get_extra_hasher(hasher_name: str) GenericHasher[source]
Return a named hasher from our
extra_hashersdictionary.- Parameters:
hasher_name – name of the hasher
- Returns:
the hasher
- Raises:
ValueError –
- get_src_dialect(src_db: str) Dialect[source]
Returns the SQLAlchemy
Dialect(e.g. MySQL, SQL Server…) for the specified source database.
- hash_object(x: Any) str[source]
Hashes an object using our
change_detection_hasher.We could use Python’s build-in
hash()function, which produces a 64-bit unsigned integer (calculated from:sys.maxint). However, there is an outside chance that someone uses a single-field table and therefore that this is vulnerable to content discovery via a dictionary attack. Thus, we should use a better version.
- init_row_counts() None[source]
Initialize the “number of rows inserted” counts to zero for all source tables.
- load_dd(check_against_source_db: bool = True) None[source]
Loads the data dictionary (DD) into the config.
- Parameters:
check_against_source_db – check DD validity against the source database?
- notify_dest_db_transaction(n_rows: int, n_bytes: int) None[source]
Use this function to tell the config how many rows and bytes have been written to the source database. See, for example,
overall_progress().Note that this may trigger a
COMMIT, via ourcrate_anon.common.sql.TransactionSizeLimiter.- Parameters:
n_rows – the number of rows written
n_bytes – the number of bytes written
- notify_src_bytes_read(n_bytes: int) None[source]
Use this function to tell the config how many bytes have been read from the source database. See, for example,
overall_progress().- Parameters:
n_bytes – the number of bytes read
- overall_progress() str[source]
Returns a formatted description of the number of bytes read from the source database(s) and written to the destination database.
(The Config is used to keep track of progress, via
notify_src_bytes_read()andnotify_dest_db_transaction().)
- set_echo(echo: bool) None[source]
Sets the “echo” property for all our SQLAlchemy database connections.
- Parameters:
echo – show SQL?
- property source_db_names: List[str]
Get all source database names.
- class crate_anon.anonymise.config.DatabaseSafeConfig(parser: ExtendedConfigParser, section: str)[source]
Class representing non-sensitive configuration information about a source database.
- __init__(parser: ExtendedConfigParser, section: str) None[source]
Read from a configparser section.
- Parameters:
parser – configparser object
section – section name
- does_table_fail_minimum_fields(colnames: List[str]) bool[source]
For use when creating a data dictionary automatically:
Does a table with the specified column names fail our minimum requirements? These requirements are set by our
ddgen_table_require_field_absoluteandddgen_table_require_field_conditionalconfiguration parameters.- Parameters:
colnames – list of column names for the table
- Returns:
does it fail?
- crate_anon.anonymise.config.get_extra_hasher(parser: ExtendedConfigParser, section: str) GenericHasher[source]
Read hasher configuration from a configparser section, and return the hasher.
- Parameters:
parser – configparser object
section – section name
- Returns:
the hasher
- crate_anon.anonymise.config.get_sqlatype(sqlatype: str) TypeEngine[source]
Converts a string, like “VARCHAR(10)”, to an SQLAlchemy type.
Since we might have to return String(length=…), we have to return an instance, not a class.
- crate_anon.anonymise.config.get_word_alternatives(filenames: List[str]) List[List[str]][source]
Reads in a list of word alternatives, from one or more comma-separated-value (CSV) text files (also accepting comment lines starting with #, and allowing different numbers of columns on different lines).
All entries on one line will be substituted for each other, if alternatives are enabled.
Produces a list of equivalent-word lists.
Arbitrarily, uses upper case internally. (All CRATE regex replacements are case-insensitive.)
An alternatives file might look like this:
# Street types # https://en.wikipedia.org/wiki/Street_suffix avenue, circus, close, crescent, drive, gardens, grove, hill, lane, mead, mews, place, rise, road, row, square, street, vale, way, wharf
- Parameters:
filenames – filenames to read from
- Returns:
a list of lists of equivalent words