Source code for crate_anon.linkage.validation.entropy_names

#!/usr/bin/env python

"""
crate_anon/linkage/validation/entropy_names.py

===============================================================================

    Copyright (C) 2015, University of Cambridge, Department of Psychiatry.
    Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

    This file is part of CRATE.

    CRATE is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    CRATE is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with CRATE. If not, see <https://www.gnu.org/licenses/>.

===============================================================================

**Measure entropy and entropy reduction among names.**

See also:

- https://www.lesswrong.com/posts/SEZqJcSm25XpQMhzr/information-theory-and-the-symmetry-of-updating-beliefs
  denoted [Academian2010].

Summarized:

Probabilistic evidence, pev()
-----------------------------

.. code-block:: none

    pev(A, B) = P(A and B) / [P(A) * P(B)]              [1] [Academian2010]

              = pev(B, A)

Stating Bayes theorem in those terms:

.. code-block:: none

    P(A and B) = P(A) * P(B | A) = P(B) * P(A | B)      [2] Bayes, symmetrical

    P(A | B) = P(A) * P(B | A) / P(B)                   [3] Bayes, conventional

but, from [1] and [2], since

.. code-block:: none

    pev(A, B) = P(B) * P(A | B) / [P(A) * P(B)]         RNC derivation
              = P(A | B) / P(A)
              = P(A) * P(B | A) / [P(A) * P(B)]
              = P(B | A) / P(B)

we reach this version of Bayes' theorem:

.. code-block:: none

    P(A | B) = P(A) * pev(A, B)                         [4a] [Academian2010]
    P(B | A) = P(B) * pev(A, B)                         [4b] [Academian2010]

Probabilistic evidence, being the ratio of two probabilities, has range [0,
+∞]. It is a multiplicative measure of mutual evidence: 1 if A and B are
independent; >1 if they make each other more likely; <1 if they make each other
less likely.


Information value, inf()
------------------------

The information value of an event (a.k.a. surprisal, self-information):

.. code-block:: none

    inf(A) = log_base_half[P(A)] = -log2[P(A)]          [5] [Academian2010]

Range check: p(A) ∈ [0, 1], so inf(A) ∈ [0, +∞]; impossibility (p = 0) gives
inf(A) = +∞, while certainty (p = 1) gives inf(A) = 0; p = 0.5 corresponds to
inf(A) = 1.

This is also "uncertainty" (how many independent bits are required to confirm
that A is true) or "informativity" (how many independent bits are gained if we
are told that A is true).


Informational evidence, iev()
-----------------------------

Redundancy, or mutual information, or informational evidence:

.. code-block:: none

    iev(A, B) = log2[pev(A, B)]                        [6] [Academian2010]

                NOTE the error in the original (twice, in equation and
                preceding paragraph); it cannot be -log2[pev(A, B)], as pointed
                out in the comments, and rechecked here.

              = log2{ P(A and B)  / [P(A)      * P(B)] }            from [1]
              = log2[P(A and B)]  - log2[P(A)] - log2[P(B)]
              = -inf(A and B)     + inf(A)     + inf(B)

              = inf(A) + inf(B) - inf(A and B)          [7] [Academian2010]

              = iev(B, A)

Range check: pev ∈ [0, +∞], so iev ∈ [-∞, +∞].

If iev(A, B) is positive, the uncertainty of A decreases upon observing B
(meaning A becomes more likely). If it is negative, the uncertainty of A
increases (A becomes less likely). A value of -∞ means A and B completely
contradict each other, and +∞ means they completely confirm each other.


Conditional information value
-----------------------------

.. code-block:: none

    inf(A | B) = -log2[P(A | B)]                                [8], from [5]
    
               = -log2{ P(A)   * pev(A, B) }                        from [4a]
               = -{ log2[P(A)] + log2[pev(A, B)] }
               = -log2[P(A)]   - log2[pev(A, B)]
               = inf(A)        - iev(A, B)                          from [5, 6]
               = inf(A)        - [inf(A) + inf(B) - inf(A and B)]   from [7]
               = inf(A)        -  inf(A) - inf(B) + inf(A and B)
               =                         - inf(B) + inf(A and B)

               = inf(A and B) - inf(B)                  [9] [Academian2010]


Information-theoretic Bayes' theorem
------------------------------------

Taking logs of [4a],

.. code-block:: none

    log2[P(A | B)] = log2[P(A)] + log2[pev(A, B)]

so

.. code-block:: none

    -log2[P(A | B)] = -log2[P(A)] - log2[pev(A, B)]

we obtain, from [8], [5], and [6] respectively,

.. code-block:: none

    inf(A | B) = inf(A) - iev(A, B)                     [10] [Academian2010]

or: Bayesian updating is subtracting *mutual evidence* from *uncertainty*.


A worked example
----------------

.. code-block:: bash

    ./entropy_names.py demo_info_theory_bayes_cancer


Other references
----------------

- Bayes theorem:
  https://en.wikipedia.org/wiki/Bayes%27_theorem
  ... ultimately Bayes (1763).

- A probability mass function for a discrete random variable X, which can take
  multiple states each labelled x: p_X(x) = P(X = x).
  https://en.wikipedia.org/wiki/Probability_mass_function

- Information content, self-information, surprisal, Shannon information, inf():
  https://en.wikipedia.org/wiki/Information_content
  ... ultimately e.g. Shannon (1948), Shannon & Weaver (1949).
  For a single event, usually expressed as I(x) = -log[P(x)].
  For a random variable, usually expressed as I_X(x) = -log[p_X(x)].

- Entropy is the expected information content (surprisal) of measurement of X:
  https://en.wikipedia.org/wiki/Entropy_(information_theory)
  Usually written H(X) = E[I(X)] = E[-log(P(X))],
  or (with the minus sign outside): H(X) = -sum_i[P(x_i) * log(P(x_i))],
  i.e. the sum of information for each value, weighted by the probability of
  that value.

- Mutual information (compare "informational evidence" above):
  https://en.wikipedia.org/wiki/Mutual_information
  Typically:

  .. code-block:: none

      I(X; Y) = I(Y; X)                             # symmetric
              = H(X) - H(X | Y)
              = H(Y) - H(Y | X)
              = H(X) + H(Y) - H(X, Y)               # cf. eq. [7]?
              = H(X, Y) - H(X | Y) - H(Y | X)
      I(X; Y) >= 0                                  # non-negative

  where

  - H(X) and H(Y) are marginal entropies,
  - H(X | Y) and H(Y | X) are conditional entropies,
  - H(X, Y) is the joint entropy.

  However, this is not the same quantity; I(X; Y) >= 0 whereas iev ∈ [-∞, +∞].
  This (Wikipedia) is the mutual information of two random variables: the
  amount of information you can observe about one random variable by observing
  the other. The "iev" concept above is about pairs of individual events.

  For two discrete RVs,
  
    I(X; Y) = sum_y{ sum_x{ P_XY(x, y) log[ P_XY(x, y) / (P_X(x) * P_Y(y)) ] }} 

- Mutual information is a consideration across events. The individual-event
  version is "pointwise mutual information", 
  https://en.wikipedia.org/wiki/Pointwise_mutual_information, which is
  
  .. code-block:: none
  
    pmi(x; y) = log[ P(x, y) / (P(x) * P(y) ]
              = log[ P(x | y) / P(x) ]
              = log[ P(y | x) / P(y) ]


Applying to our problem of selecting a good partial representation
------------------------------------------------------------------

Assume we are comparing a proband and a candidate and there is not a full
match. We start with some sort of prior, P(H | information so far); for now,
we'll simplify that to P(H). We want P(H | D) where D is the new information
from the partial identifier -- the options being a partial match, or no match.
We generally use this form of Bayes' theorem:

.. code-block:: none

    ln(posterior odds)       = ln(prior odds)   + ln(likelihood ratio)
    ln[P(H | D) / P(¬H | D)] = ln[P(H) / P(¬H)] + ln[P(D | H) / P(D | ¬H)]

Converting to log2 just involves multiplying by a constant, of course:

.. code-block:: none

    ln(x)   = log2(x) * ln(2) 
    log2(x) = ln(x) * log2(e)

A partial match would provide a log likelihood of

.. code-block:: none

    log(p_ep) − log(p_pnf)

and no match would provide a log likelihood of

.. code-block:: none

    log(p_en) − log(p_n)

We could weight that (or the information equivalent) by the probability of
obtaining a partial match (given no full match) and of obtaining no match
(given no full match) respectively.

... let's skip this and try mutual information.


Note
----

Code largely abandoned; not re-checked since NameFrequencyInfo was refactored,
since this code had served it purpose.

"""  # noqa


# =============================================================================
# Imports
# =============================================================================

import argparse
from collections import defaultdict
from dataclasses import dataclass
import logging
from typing import Dict, Generator, Iterable, List, Tuple

from cardinal_pythonlib.logs import main_only_quicksetup_rootlogger
from cardinal_pythonlib.probability import bayes_posterior
from numba import jit
from numpy import log2, power
from rich_argparse import ArgumentDefaultsRichHelpFormatter

from crate_anon.linkage.constants import GENDER_FEMALE, GENDER_MALE
from crate_anon.linkage.frequencies import NameFrequencyInfo
from crate_anon.linkage.matchconfig import MatchConfig

log = logging.getLogger(__name__)


# =============================================================================
# Constants
# =============================================================================

ACCURATE_MIN_NAME_FREQ = 1e-10
FLOAT_TOLERANCE = 1e-10


# =============================================================================
# Information theory calculations
# =============================================================================


[docs]@jit(nopython=True) def p_log2p(p: float) -> float: """ Given p, calculate p * log_2(p). """ return p * log2(p)
[docs]def entropy(frequencies: Iterable[float]) -> float: """ Returns the (information/Shannon) entropy of the probability distribution supplied, in bits. By definition, H = -sum_i[ p_i * log_2(p_i) ] https://en.wikipedia.org/wiki/Quantities_of_information """ return -sum(p_log2p(p) for p in frequencies)
[docs]@jit(nopython=True) def p_log2_p_over_q(p: float, q: float) -> float: """ Given p and q, calculate p * log_2(p / q), except return 0 if p == 0. Used for KL divergence. """ if p == 0: return 0 return p * log2(p / q)
[docs]def relative_entropy_kl_divergence( pairs: Iterable[Tuple[float, float]] ) -> float: """ Returns the relative entropy, or Kullback-Leibler divergence, D_KL(P || Q), in bits; https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence. The iterable should contain pairs P(x), Q(x) for all values of x in the distribution. We calculate D_KL(P || Q) = sum_x{ P(x) * log[P(x) / Q(x)] } """ kl = sum(p_log2_p_over_q(p, q) for p, q in pairs) assert kl >= 0, "Bug: K-L divergence must be >=0" return kl
[docs]@jit(nopython=True) def inf_bits_from_p(p: float) -> float: """ The information value (surprisal, self-information) of an event from its probability, p; see equation [5] above. """ # noqa return -log2(p)
[docs]@jit(nopython=True) def p_from_inf(inf_bits: float) -> float: """ The information value (surprisal, self-information) of an event from its probability, p; see equation [5] above. """ # noqa return power(2, -inf_bits)
# ============================================================================= # Gender-weighted version of a frequency dictionary # =============================================================================
[docs]def gen_gender_weighted_frequencies( freqdict: Dict[Tuple[str, str], float], p_female: float ) -> Generator[float, None, None]: """ Generates gender-weighted frequencies. Requires p_female + p_male = 1. """ p_male = 1 - p_female for (name, gender), p in freqdict.items(): if gender == GENDER_FEMALE: yield p * p_female elif gender == GENDER_MALE: yield p * p_male else: raise ValueError("bad gender in frequency info")
[docs]def get_frequencies( nf: NameFrequencyInfo, p_female: float = None, metaphones: bool = False, first_two_char: bool = False, ) -> List[float]: """ Returns raw frequencies for a category of identifier, optionally combining (weighting by gender) for those stored separately by gender. """ assert not (metaphones and first_two_char) if nf.by_gender: assert p_female is not None if metaphones: return [i.p_metaphone for i in gen_name_frequency_tuples(nf, p_female)] elif first_two_char: return [i.p_f2c for i in gen_name_frequency_tuples(nf, p_female)] else: return [i.p_name for i in gen_name_frequency_tuples(nf, p_female)]
[docs]@dataclass class ValidationNameFreqInfo: """ Used for validation calculations. """ name: str p_name: float metaphone: str p_metaphone: float f2c: str p_f2c: float
[docs]def gen_name_frequency_tuples( nf: NameFrequencyInfo, p_female: float = None, ) -> Generator[ValidationNameFreqInfo, None, None]: """ Generates frequency tuples (name, p_name, metaphone, p_metaphone, f2c, p_firsttwochar). """ by_gender = nf.by_gender if by_gender: assert p_female is not None p_male = 1 - p_female else: p_male = None for info in nf.infolist: if by_gender: if info.gender == GENDER_FEMALE: multiple = p_female elif info.gender == GENDER_MALE: multiple = p_male else: raise ValueError("bad gender") else: multiple = 1 yield ValidationNameFreqInfo( name=info.name, p_name=multiple * info.p_name, metaphone=info.metaphone, p_metaphone=multiple * info.p_metaphone, f2c=info.f2c, p_f2c=multiple * info.p_f2c, )
# ============================================================================= # Demonstration of the information-based version of Bayes' theorem # =============================================================================
[docs]def demo_info_theory_bayes_cancer() -> None: """ From the comments in [Academian2010]: 1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer? """ # Problem p_cancer = 0.01 p_pos_given_cancer = 0.8 p_pos_given_no_cancer = 0.096 print(demo_info_theory_bayes_cancer.__doc__) print( f"p(cancer) = {p_cancer}, p(pos | cancer) = {p_pos_given_cancer}, " f"p(pos | ¬cancer) = {p_pos_given_no_cancer}" ) # Goal: calculate p_cancer_given_pos # Derived p_no_cancer = 1 - p_cancer p_pos_and_cancer = p_cancer * p_pos_given_cancer p_pos_and_no_cancer = p_no_cancer * p_pos_given_no_cancer p_pos = p_pos_and_cancer + p_pos_and_no_cancer # Conventional Bayes: p_cancer_given_pos_std = bayes_posterior( prior=p_cancer, likelihood=p_pos_given_cancer, marginal_likelihood=p_pos, ) print(f"(plain Bayes) p(cancer | pos) = {p_cancer_given_pos_std}") # Either info theory version: inf_pos = inf_bits_from_p(p_pos) inf_pos_and_cancer = inf_bits_from_p(p_pos_and_cancer) # Version 1: inf_cancer_given_pos_v1 = inf_pos_and_cancer - inf_pos # eq. [9] p_cancer_given_pos_v1 = p_from_inf(inf_cancer_given_pos_v1) print(f"(info theory v1) p(cancer | pos) = {p_cancer_given_pos_v1}") # Version 2: inf_cancer = inf_bits_from_p(p_cancer) iev_pos_cancer = inf_pos + inf_cancer - inf_pos_and_cancer inf_cancer_given_pos_v2 = inf_cancer - iev_pos_cancer # eq. [10] p_cancer_given_pos_v2 = p_from_inf(inf_cancer_given_pos_v2) print(f"(info theory v2) p(cancer | pos) = {p_cancer_given_pos_v2}") # Same answer, within rounding error: assert abs(p_cancer_given_pos_v1 - p_cancer_given_pos_v2) < FLOAT_TOLERANCE
# ============================================================================= # Mutual information examples # =============================================================================
[docs]def mutual_info( iterable_xy_x_y: Iterable[Tuple[float, float, float]], verbose: bool = False, ) -> float: """ See https://en.wikipedia.org/wiki/Mutual_information: mutual information of two jointly discrete random variables X and Y. The expectation from the iterable is that for all x ∈ X and y ∈ Y, the iterable delivers the tuple P_X_Y(x, y), P_X(x), P_Y(y). Uses log2 and therefore the units are bits. """ # Verbose version: if verbose: mutual_info_bits = 0.0 for i, (p_xy, p_x, p_y) in enumerate(iterable_xy_x_y): bits = p_xy * log2(p_xy / (p_x * p_y)) if i % 10000 == 0: log.info( f"p_xy = {p_xy}, p_x = {p_x}, p_y = {p_y}, bits = {bits}" ) mutual_info_bits += bits return mutual_info_bits # Terse version: return sum( p_xy * log2(p_xy / (p_x * p_y)) for p_xy, p_x, p_y in iterable_xy_x_y )
[docs]def gen_mutual_info_name_probabilities( nf: NameFrequencyInfo, p_female: float = None, name_metaphone: bool = False, name_firsttwochar: bool = False, metaphone_firsttwochar: bool = False, ) -> Generator[Tuple[float, float, float], None, None]: """ Generates mutual information probabilities for name/fuzzy name comparisons. """ assert ( sum([name_metaphone, name_firsttwochar, metaphone_firsttwochar]) == 1 ) for info in gen_name_frequency_tuples( nf=nf, p_female=p_female, ): if name_metaphone: # Names are mapped many-to-one to metaphones. Therefore, P(name_x ∧ # metaphone_for_name_x) = P(name_x). However, # P(metaphone_for_name_x) ≥ P(name_x). p_xy, p_x, p_y = info.p_name, info.p_name, info.p_metaphone elif name_firsttwochar: # Similarly for first-two-character representations. p_xy, p_x, p_y = info.p_name, info.p_name, info.p_f2c else: # metaphone_firsttwochar # Here there is overlap. So we use p_name as the joint probability; # there may be some duplication but I think that is OK (they'll add # up). p_xy, p_x, p_y = info.p_name, info.p_metaphone, info.p_f2c yield p_xy, p_x, p_y
# ============================================================================= # Information theory summaries # =============================================================================
[docs]def show_info_theory_calcs() -> None: """ Show some information theory calculations. """ # Special options: # - Get the probabilities right (very low minimum frequency). # - Load details of first-two-character fuzzy representations. cfg = MatchConfig( min_name_frequency=ACCURATE_MIN_NAME_FREQ, ) # ------------------------------------------------------------------------- # Surnames # ------------------------------------------------------------------------- surname_p = get_frequencies(cfg.surname_freq_info) # Test: surname_p = [1/100000] * 100000 log.info(f"Number of surnames: {len(surname_p)}") h_surnames = entropy(surname_p) log.info(f"Surname entropy: H = {h_surnames} bits") surname_metaphone_p = get_frequencies( cfg.surname_freq_info, metaphones=True ) log.info(f"Number of surname metaphones: {len(surname_metaphone_p)}") h_surname_metaphones = entropy(surname_metaphone_p) log.info(f"Surname metaphone entropy: H = {h_surname_metaphones} bits") surname_f2c_p = get_frequencies(cfg.surname_freq_info, first_two_char=True) log.info(f"Number of surname first-two-chars: {len(surname_f2c_p)}") h_surname_f2c = entropy(surname_f2c_p) log.info(f"Surname first-two-char entropy: H = {h_surname_f2c} bits") # kl_name_metaphone = relative_entropy_kl_divergence( # gen_name_pairs(cfg.surname_freq_info, metaphones=True) # ) # log.info( # f"Surname name/metaphone relative entropy: " # f"D_KL = {kl_name_metaphone} bits" # ) surname_name_meta_mi = mutual_info( gen_mutual_info_name_probabilities( nf=cfg.surname_freq_info, name_metaphone=True ) ) log.info( f"Surname: name/metaphone mutual information: " f"I = {surname_name_meta_mi} bits" ) surname_name_f2c_mi = mutual_info( gen_mutual_info_name_probabilities( nf=cfg.surname_freq_info, name_firsttwochar=True ) ) log.info( f"Surname: name/first-two-char mutual information: " f"I = {surname_name_f2c_mi} bits" ) surname_meta_f2c_mi = mutual_info( gen_mutual_info_name_probabilities( nf=cfg.surname_freq_info, metaphone_firsttwochar=True ) ) log.info( f"Surname: metaphone/first-two-char mutual information: " f"I = {surname_meta_f2c_mi} bits" ) # ------------------------------------------------------------------------- # Forenames # ------------------------------------------------------------------------- p_female = cfg.p_female_given_m_or_f forename_p = get_frequencies(nf=cfg.forename_freq_info, p_female=p_female) log.info( f"Number of forenames (M/F versions count separately): " f"{len(forename_p)}" ) h_forenames = entropy(forename_p) log.info(f"Forename entropy: H = {h_forenames} bits") forename_metaphone_p = get_frequencies( nf=cfg.forename_freq_info, p_female=p_female, metaphones=True, ) log.info( f"Number of forename metaphones (M/F versions count separately): " f"{len(forename_metaphone_p)}" ) h_forename_metaphones = entropy(forename_metaphone_p) log.info(f"Forename metaphone entropy: H = {h_forename_metaphones} bits") forename_f2c_p = get_frequencies( cfg.forename_freq_info, p_female=p_female, first_two_char=True, ) log.info( f"Number of forename first-two-chars (M/F versions count separately): " f"{len(forename_f2c_p)}" ) h_forename_f2c = entropy(forename_f2c_p) log.info(f"Forename first-two-char entropy: H = {h_forename_f2c} bits") forename_name_meta_mi = mutual_info( gen_mutual_info_name_probabilities( nf=cfg.forename_freq_info, p_female=p_female, name_metaphone=True ) ) log.info( f"Forename: name/metaphone mutual information: " f"I = {forename_name_meta_mi} bits" ) forename_name_f2c_mi = mutual_info( gen_mutual_info_name_probabilities( nf=cfg.forename_freq_info, p_female=p_female, name_firsttwochar=True, ) ) log.info( f"Forename: name/first-two-char mutual information: " f"I = {forename_name_f2c_mi} bits" ) forename_meta_f2c_mi = mutual_info( gen_mutual_info_name_probabilities( nf=cfg.forename_freq_info, p_female=p_female, metaphone_firsttwochar=True, ) ) log.info( f"Forename: metaphone/first-two-char mutual information: " f"I = {forename_meta_f2c_mi} bits" )
# ============================================================================= # Partial match frequencies # =============================================================================
[docs]def partial_calcs( nf: NameFrequencyInfo, p_female: float = None, report_every: int = 10000, debug_stop: int = None, ) -> None: """ Show e.g. p(match metaphone but not name). This has the potential to be really slow (e.g. 175k^2 = 3e10) though it should only need to be done once -- however, we can optimize beyond an n^2 comparison. Uses examples from the public US name databases. Maths is e.g. .. code-block:: none integral_over_a(integral_over_b(p_a * p_b * binary(conjunction event))) Examples: - Share metaphone, not first two characters (F2C) or name: .. code-block:: none AABERG [APRK, AA] / WIBBERG [APRK, WI] AABY [AP, AA] / ABAY [AP, AB] AAKRE [AKR, AA] / OKYERE [AKR, OK] - Share F2C, not metaphone or name: .. code-block:: none AALDERS [ALTR, AA] / AASEN [ASN, AA] (etc.; these are obvious) """ # def debug_report( # event: str, a_: ValidationNameFreqInfo, b_: ValidationNameFreqInfo # ) -> None: # log.info( # f"{event}: {a.name} [{a.metaphone}, {a.f2c}] / " # f"{b.name} [{b.metaphone}, {b.f2c}]" # ) # We optimize thus: metaphone_to_infolist = defaultdict( list ) # type: Dict[str, List[ValidationNameFreqInfo]] f2c_to_infolist = defaultdict( list ) # type: Dict[str, List[ValidationNameFreqInfo]] for i in gen_name_frequency_tuples(nf, p_female): metaphone_to_infolist[i.metaphone].append(i) f2c_to_infolist[i.f2c].append(i) # This improved the performance for 10k names in A from about 1h7min to # about 4 seconds, so that was definitely worth it. total_p_a = 0.0 # for normalization, in case this is not 1 # ... might differ from total_p_a only if we break via our debugging loop total_p_b = sum( info_.p_name for info_ in gen_name_frequency_tuples(nf, p_female) ) # ... we always iterate through all of b for all a, even if debugging; # ... and even if we iterate through b implicitly (via the dictionaries). p_share_metaphone = 0.0 p_share_metaphone_not_name = 0.0 p_share_metaphone_not_f2c_or_name = 0.0 p_share_f2c = 0.0 p_share_f2c_not_name = 0.0 p_share_f2c_not_metaphone_or_name = 0.0 n = len(nf.infolist) for idx_a, a in enumerate( gen_name_frequency_tuples(nf, p_female), start=1 ): if idx_a % report_every == 0: log.info(f"... processing name {idx_a} / {n}") if debug_stop and idx_a > debug_stop: break # for debugging # For speed: a_p_name = a.p_name a_name = a.name a_metaphone = a.metaphone a_f2c = a.f2c total_p_a += a_p_name for b in f2c_to_infolist[a_f2c]: # Iterate only through names sharing first two characters. p_product = a_p_name * b.p_name p_share_f2c += p_product if a_name != b.name: p_share_f2c_not_name += p_product # debug_report("share_f2c_not_name", a, b) if a_metaphone != b.metaphone: p_share_f2c_not_metaphone_or_name += p_product # debug_report("share_f2c_not_metaphone_or_name", a, b) for b in metaphone_to_infolist[a_metaphone]: # Iterate only through names sharing metaphones. p_product = a_p_name * b.p_name p_share_metaphone += p_product if a_name != b.name: p_share_metaphone_not_name += p_product # debug_report("share_metaphone_not_name", a, b) if a_f2c != b.f2c: p_share_metaphone_not_f2c_or_name += p_product # debug_report("share_metaphone_not_f2c_or_name", a, b) # Normalized probabilities: nf = 1 / (total_p_a * total_p_b) # normalizing factor np_share_metaphone = p_share_metaphone * nf np_share_metaphone_not_name = p_share_metaphone_not_name * nf np_share_metaphone_not_f2c_or_name = p_share_metaphone_not_f2c_or_name * nf np_share_f2c = p_share_f2c * nf np_share_f2c_not_name = p_share_f2c_not_name * nf np_share_f2c_not_metaphone_or_name = p_share_f2c_not_metaphone_or_name * nf log.info( f"Unnormalized: total_p_a = {total_p_a}, " f"total_p_b = {total_p_b}, " f"nf = {nf}, " f"p_share_metaphone = {p_share_metaphone}, " f"p_share_metaphone_not_name = {p_share_metaphone_not_name}, " f"p_share_metaphone_not_f2c_or_name = " f"{p_share_metaphone_not_f2c_or_name}, " f"p_share_f2c = {p_share_f2c}, " f"p_share_f2c_not_name = {p_share_f2c_not_name}, " f"p_share_f2c_not_metaphone_or_name = " f"{p_share_f2c_not_metaphone_or_name}" ) log.info( f"Normalized: P(share metaphone) = {np_share_metaphone}, " f"P(share metaphone, not name) = {np_share_metaphone_not_name}, " f"P(share metaphone, not first two char or name) = " f"{np_share_metaphone_not_f2c_or_name}, " f"P(share first two char) = {np_share_f2c}, " f"P(share first two char, not name) = {np_share_f2c_not_name}, " f"P(share first two char, not metaphone or name) = " f"{np_share_f2c_not_metaphone_or_name}" )
[docs]def show_partial_match_frequencies() -> None: """ Show population-level frequency/probability calculations from our name frequency databases, e.g. p(match metaphone but not name), p(match first two characters but not metaphone or name). """ cfg = MatchConfig( min_name_frequency=ACCURATE_MIN_NAME_FREQ, ) log.info("Partial match frequencies for surnames:") partial_calcs(cfg.surname_freq_info) log.info("Partial match frequencies for forenames:") partial_calcs(cfg.forename_freq_info, p_female=cfg.p_female_given_m_or_f)
_ = """ Saved results: 2022-06-26 14:40:35.939 __main__:INFO: Partial match frequencies for surnames: 2022-06-26 14:40:36.191 crate_anon.linkage.helpers:WARNING: No metaphone for 'HWA' 2022-06-26 14:40:36.191 crate_anon.linkage.helpers:WARNING: No metaphone for 'HWEE' 2022-06-26 14:40:36.192 crate_anon.linkage.helpers:WARNING: No metaphone for 'HWU' 2022-06-26 14:40:39.924 __main__:INFO: ... processing name 10000 / 175880 ... 2022-06-26 14:41:39.135 __main__:INFO: ... processing name 170000 / 175880 2022-06-26 14:41:40.034 __main__:INFO: Unnormalized: total_p_a = 0.9555555960002395, total_p_b = 0.9555555960002395, nf = 1.0951864946351493, p_share_metaphone = 0.011339587936127086, p_share_metaphone_not_name = 0.000768652112738086, p_share_metaphone_not_f2c_or_name = 0.0005779216890297886, p_share_f2c = 0.02015927698649681, p_share_f2c_not_name = 0.009588341150670168, p_share_f2c_not_metaphone_or_name = 0.009397610726948866 2022-06-26 14:41:40.034 __main__:INFO: Normalized: P(share metaphone) = 0.01241896356237405, P(share metaphone, not name) = 0.000841817412943526, P(share metaphone, not first two char or name) = 0.0006329320287821591, P(share first two char) = 0.022078167897220478, P(share first two char, not name) = 0.010501021734168415, P(share first two char, not metaphone or name) = 0.010292136349992806 ~~~ 2022-06-26 14:41:40.054 __main__:INFO: Partial match frequencies for forenames: 2022-06-26 14:41:43.254 __main__:INFO: ... processing name 10000 / 106695 ... 2022-06-26 14:42:13.711 __main__:INFO: ... processing name 100000 / 106695 2022-06-26 14:42:14.506 __main__:INFO: Unnormalized: total_p_a = 0.9999999999997667, total_p_b = 0.9999999999997667, nf = 1.0000000000004665, p_share_metaphone = 0.005028025327852539, p_share_metaphone_not_name = 0.0025948399071387762, p_share_metaphone_not_f2c_or_name = 0.0014685292424996006, p_share_f2c = 0.018453783690030413, p_share_f2c_not_name = 0.01602059826926657, p_share_f2c_not_metaphone_or_name = 0.014894287604555126 2022-06-26 14:42:14.506 __main__:INFO: Normalized: P(share metaphone) = 0.005028025327854884, P(share metaphone, not name) = 0.0025948399071399867, P(share metaphone, not first two char or name) = 0.0014685292425002856, P(share first two char) = 0.01845378369003902, P(share first two char, not name) = 0.016020598269274045, P(share first two char, not metaphone or name) = 0.014894287604562073 Note in particular that metaphone-sharing is rarer than F2C-sharing: Surnames: P(share metaphone) = 0.01241896356237405 P(share first two char) = 0.022078167897220478 P(share metaphone, not name) = 0.000841817412943526 P(share first two char, not name) = 0.010501021734168415 P(share metaphone, not first two char or name) = 0.0006329320287821591 P(share first two char, not metaphone or name) = 0.010292136349992806 Forenames: P(share metaphone) = 0.005028025327854884 P(share first two char) = 0.01845378369003902 P(share metaphone, not name) = 0.0025948399071399867 P(share first two char, not name) = 0.016020598269274045 P(share metaphone, not first two char or name) = 0.0014685292425002856 P(share first two char, not metaphone or name) = 0.014894287604562073 """ # noqa # ============================================================================= # Main # =============================================================================
[docs]def main() -> None: """ Command-line entry point. """ function_map = { "infotheory": show_info_theory_calcs, "demobayes": demo_info_theory_bayes_cancer, "partials": show_partial_match_frequencies, } parser = argparse.ArgumentParser( formatter_class=ArgumentDefaultsRichHelpFormatter ) parser.add_argument("command", choices=function_map.keys()) args = parser.parse_args() main_only_quicksetup_rootlogger(level=logging.INFO) func = function_map[args.command] func()
if __name__ == "__main__": main()