14.4.20. crate_anon.linkage.validation.test_hash_speed

crate_anon/linkage/validation/test_hash_speed.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Test the speed of hashing.

The question is: if someone malicious learned a secret hash key, how long would it take them to generate a reverse map from a known identifier space?

The test uses a single CPU core.

Specimen results, for padding length 9 and the HMAC_MD5 algorithm, on Wombat (3.5 GHz CPU), tested with 100000 (1e5) iterations (which took 0.72 s, piping to /dev/null):

  • 1e9 operations will take about 7200 s = 2 hours. This is the right order of magniture for NHS numbers (9 digits plus a checksum; other rules might restrict that a bit more).

  • 7.3e12 operations will take about 52667022 s = 1.7 years This is the right order of magnitude for NHS numbers plus dates of birth covering 20 years (1e9 for NHS number * 365 days/year * 20 years).

  • 3.65e13 operations will take about 263335113 s = 8.4 years This is the right order of magnitude for NHS numbers plus DOBs covering a century (1e9 for NHS number * 365 days/year * 100 years).

The hash algorithm isn’t a major factor; moving from HMAC_MD5 to HMAC_SHA512, for example, only takes the time for 1e5 iterations from 0.72 s to 0.86 s.

(A subsequent test: faster, at 5443 s = 1.5 h, 1.25 y, and 6.3 y respectively.)

Speed tests considered for paper (not used in the end; real-world measures used):

./test_hash_speed.py --method HMAC_MD5 --ntests 1000000 > /dev/null
./test_hash_speed.py --method HMAC_SHA256 --ntests 1000000 > /dev/null
./test_hash_speed.py --method HMAC_SHA512 --ntests 1000000 > /dev/null
crate_anon.linkage.validation.test_hash_speed.gen_dummy_data(n: int, string_length: int) Iterable[str][source]

Generate some random strings of the specified width

crate_anon.linkage.validation.test_hash_speed.main() None[source]

Command-line entry point.

crate_anon.linkage.validation.test_hash_speed.test_hash_speed(output_filename: str, hash_method: str, key: str, ntests: int, intended_possibilities: List[int], string_length: int)[source]

Hash lines from one file to another.

Parameters
  • output_filename – Output filename, or “-” for stdin

  • hash_method – Method to use; e.g. HMAC_SHA256

  • key – Secret key for hasher

  • ntests – Number of hashes to perform.

  • intended_possibilities – Number of hashes to estimate time for.

  • string_length – Length of each string to hash (characters).

Note that the hash precedes the ID with the keep_id option, which works best if the ID might contain commas.