Source code for crate_anon.linkage.validation.entropy_names

#!/usr/bin/env python

"""
crate_anon/linkage/validation/entropy_names.py

===============================================================================

    Copyright (C) 2015, University of Cambridge, Department of Psychiatry.
    Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

    This file is part of CRATE.

    CRATE is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    CRATE is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with CRATE. If not, see <https://www.gnu.org/licenses/>.

===============================================================================

**Measure entropy and entropy reduction among names.**

See also:

- https://www.lesswrong.com/posts/SEZqJcSm25XpQMhzr/information-theory-and-the-symmetry-of-updating-beliefs
  denoted [Academian2010].

Summarized:

Probabilistic evidence, pev()
-----------------------------

.. code-block:: none

    pev(A, B) = P(A and B) / [P(A) * P(B)]              [1] [Academian2010]

              = pev(B, A)

Stating Bayes theorem in those terms:

.. code-block:: none

    P(A and B) = P(A) * P(B | A) = P(B) * P(A | B)      [2] Bayes, symmetrical

    P(A | B) = P(A) * P(B | A) / P(B)                   [3] Bayes, conventional

but, from [1] and [2], since

.. code-block:: none

    pev(A, B) = P(B) * P(A | B) / [P(A) * P(B)]         RNC derivation
              = P(A | B) / P(A)
              = P(A) * P(B | A) / [P(A) * P(B)]
              = P(B | A) / P(B)

we reach this version of Bayes' theorem:

.. code-block:: none

    P(A | B) = P(A) * pev(A, B)                         [4a] [Academian2010]
    P(B | A) = P(B) * pev(A, B)                         [4b] [Academian2010]

Probabilistic evidence, being the ratio of two probabilities, has range [0,
+∞]. It is a multiplicative measure of mutual evidence: 1 if A and B are
independent; >1 if they make each other more likely; <1 if they make each other
less likely.


Information value, inf()
------------------------

The information value of an event (a.k.a. surprisal, self-information):

.. code-block:: none

    inf(A) = log_base_half[P(A)] = -log2[P(A)]          [5] [Academian2010]

Range check: p(A) ∈ [0, 1], so inf(A) ∈ [0, +∞]; impossibility (p = 0) gives
inf(A) = +∞, while certainty (p = 1) gives inf(A) = 0; p = 0.5 corresponds to
inf(A) = 1.

This is also "uncertainty" (how many independent bits are required to confirm
that A is true) or "informativity" (how many independent bits are gained if we
are told that A is true).


Informational evidence, iev()
-----------------------------

Redundancy, or mutual information, or informational evidence:

.. code-block:: none

    iev(A, B) = log2[pev(A, B)]                        [6] [Academian2010]

                NOTE the error in the original (twice, in equation and
                preceding paragraph); it cannot be -log2[pev(A, B)], as pointed
                out in the comments, and rechecked here.

              = log2{ P(A and B)  / [P(A)      * P(B)] }            from [1]
              = log2[P(A and B)]  - log2[P(A)] - log2[P(B)]
              = -inf(A and B)     + inf(A)     + inf(B)

              = inf(A) + inf(B) - inf(A and B)          [7] [Academian2010]

              = iev(B, A)

Range check: pev ∈ [0, +∞], so iev ∈ [-∞, +∞].

If iev(A, B) is positive, the uncertainty of A decreases upon observing B
(meaning A becomes more likely). If it is negative, the uncertainty of A
increases (A becomes less likely). A value of -∞ means A and B completely
contradict each other, and +∞ means they completely confirm each other.


Conditional information value
-----------------------------

.. code-block:: none

    inf(A | B) = -log2[P(A | B)]                                [8], from [5]
    
               = -log2{ P(A)   * pev(A, B) }                        from [4a]
               = -{ log2[P(A)] + log2[pev(A, B)] }
               = -log2[P(A)]   - log2[pev(A, B)]
               = inf(A)        - iev(A, B)                          from [5, 6]
               = inf(A)        - [inf(A) + inf(B) - inf(A and B)]   from [7]
               = inf(A)        -  inf(A) - inf(B) + inf(A and B)
               =                         - inf(B) + inf(A and B)

               = inf(A and B) - inf(B)                  [9] [Academian2010]


Information-theoretic Bayes' theorem
------------------------------------

Taking logs of [4a],

.. code-block:: none

    log2[P(A | B)] = log2[P(A)] + log2[pev(A, B)]

so

.. code-block:: none

    -log2[P(A | B)] = -log2[P(A)] - log2[pev(A, B)]

we obtain, from [8], [5], and [6] respectively,

.. code-block:: none

    inf(A | B) = inf(A) - iev(A, B)                     [10] [Academian2010]

or: Bayesian updating is subtracting *mutual evidence* from *uncertainty*.


A worked example
----------------

.. code-block:: bash

    ./entropy_names.py demo_info_theory_bayes_cancer


Other references
----------------

- Bayes theorem:
  https://en.wikipedia.org/wiki/Bayes%27_theorem
  ... ultimately Bayes (1763).

- A probability mass function for a discrete random variable X, which can take
  multiple states each labelled x: p_X(x) = P(X = x).
  https://en.wikipedia.org/wiki/Probability_mass_function

- Information content, self-information, surprisal, Shannon information, inf():
  https://en.wikipedia.org/wiki/Information_content
  ... ultimately e.g. Shannon (1948), Shannon & Weaver (1949).
  For a single event, usually expressed as I(x) = -log[P(x)].
  For a random variable, usually expressed as I_X(x) = -log[p_X(x)].

- Entropy is the expected information content (surprisal) of measurement of X:
  https://en.wikipedia.org/wiki/Entropy_(information_theory)
  Usually written H(X) = E[I(X)] = E[-log(P(X))],
  or (with the minus sign outside): H(X) = -sum_i[P(x_i) * log(P(x_i))],
  i.e. the sum of information for each value, weighted by the probability of
  that value.

- Mutual information (compare "informational evidence" above):
  https://en.wikipedia.org/wiki/Mutual_information
  Typically:

  .. code-block:: none

      I(X; Y) = I(Y; X)                             # symmetric
              = H(X) - H(X | Y)
              = H(Y) - H(Y | X)
              = H(X) + H(Y) - H(X, Y)               # cf. eq. [7]?
              = H(X, Y) - H(X | Y) - H(Y | X)
      I(X; Y) >= 0                                  # non-negative

  where

  - H(X) and H(Y) are marginal entropies,
  - H(X | Y) and H(Y | X) are conditional entropies,
  - H(X, Y) is the joint entropy.

  However, this is not the same quantity; I(X; Y) >= 0 whereas iev ∈ [-∞, +∞].
  This (Wikipedia) is the mutual information of two random variables: the
  amount of information you can observe about one random variable by observing
  the other. The "iev" concept above is about pairs of individual events.

  For two discrete RVs,
  
    I(X; Y) = sum_y{ sum_x{ P_XY(x, y) log[ P_XY(x, y) / (P_X(x) * P_Y(y)) ] }} 

- Mutual information is a consideration across events. The individual-event
  version is "pointwise mutual information", 
  https://en.wikipedia.org/wiki/Pointwise_mutual_information, which is
  
  .. code-block:: none
  
    pmi(x; y) = log[ P(x, y) / (P(x) * P(y) ]
              = log[ P(x | y) / P(x) ]
              = log[ P(y | x) / P(y) ]


Applying to our problem of selecting a good partial representation
------------------------------------------------------------------

Assume we are comparing a proband and a candidate and there is not a full
match. We start with some sort of prior, P(H | information so far); for now,
we'll simplify that to P(H). We want P(H | D) where D is the new information
from the partial identifier -- the options being a partial match, or no match.
We generally use this form of Bayes' theorem:

.. code-block:: none

    ln(posterior odds)       = ln(prior odds)   + ln(likelihood ratio)
    ln[P(H | D) / P(¬H | D)] = ln[P(H) / P(¬H)] + ln[P(D | H) / P(D | ¬H)]

Converting to log2 just involves multiplying by a constant, of course:

.. code-block:: none

    ln(x)   = log2(x) * ln(2) 
    log2(x) = ln(x) * log2(e)

A partial match would provide a log likelihood of

.. code-block:: none

    log(p_ep) − log(p_pnf)

and no match would provide a log likelihood of

.. code-block:: none

    log(p_en) − log(p_n)

We could weight that (or the information equivalent) by the probability of
obtaining a partial match (given no full match) and of obtaining no match
(given no full match) respectively.

... let's skip this and try mutual information.


Note
----

Code largely abandoned; not re-checked since NameFrequencyInfo was refactored,
since this code had served it purpose.

"""  # noqa


# =============================================================================
# Imports
# =============================================================================

import argparse
from collections import defaultdict
from dataclasses import dataclass
import logging
from typing import Dict, Generator, Iterable, List, Tuple

from cardinal_pythonlib.logs import main_only_quicksetup_rootlogger
from cardinal_pythonlib.probability import bayes_posterior
from numba import jit
from numpy import log2, power
from rich_argparse import ArgumentDefaultsRichHelpFormatter

from crate_anon.linkage.constants import GENDER_FEMALE, GENDER_MALE
from crate_anon.linkage.frequencies import NameFrequencyInfo
from crate_anon.linkage.matchconfig import MatchConfig

log = logging.getLogger(__name__)


# =============================================================================
# Constants
# =============================================================================

ACCURATE_MIN_NAME_FREQ = 1e-10
FLOAT_TOLERANCE = 1e-10


# =============================================================================
# Information theory calculations
# =============================================================================


[docs]@jit(nopython=True)
def p_log2p(p: float) -> float:
    """
    Given p, calculate p * log_2(p).
    """
    return p * log2(p)


[docs]def entropy(frequencies: Iterable[float]) -> float:
    """
    Returns the (information/Shannon) entropy of the probability distribution
    supplied, in bits. By definition,

        H = -sum_i[ p_i * log_2(p_i) ]

    https://en.wikipedia.org/wiki/Quantities_of_information
    """
    return -sum(p_log2p(p) for p in frequencies)


[docs]@jit(nopython=True)
def p_log2_p_over_q(p: float, q: float) -> float:
    """
    Given p and q, calculate p * log_2(p / q), except return 0 if p == 0.
    Used for KL divergence.
    """
    if p == 0:
        return 0
    return p * log2(p / q)


[docs]def relative_entropy_kl_divergence(
    pairs: Iterable[Tuple[float, float]]
) -> float:
    """
    Returns the relative entropy, or Kullback-Leibler divergence, D_KL(P || Q),
    in bits; https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence.

    The iterable should contain pairs P(x), Q(x) for all values of x in the
    distribution. We calculate

        D_KL(P || Q) = sum_x{ P(x) * log[P(x) / Q(x)] }
    """
    kl = sum(p_log2_p_over_q(p, q) for p, q in pairs)
    assert kl >= 0, "Bug: K-L divergence must be >=0"
    return kl


[docs]@jit(nopython=True)
def inf_bits_from_p(p: float) -> float:
    """
    The information value (surprisal, self-information) of an event from its
    probability, p; see equation [5] above.
    """  # noqa
    return -log2(p)


[docs]@jit(nopython=True)
def p_from_inf(inf_bits: float) -> float:
    """
    The information value (surprisal, self-information) of an event from its
    probability, p; see equation [5] above.
    """  # noqa
    return power(2, -inf_bits)


# =============================================================================
# Gender-weighted version of a frequency dictionary
# =============================================================================


[docs]def gen_gender_weighted_frequencies(
    freqdict: Dict[Tuple[str, str], float], p_female: float
) -> Generator[float, None, None]:
    """
    Generates gender-weighted frequencies. Requires p_female + p_male = 1.
    """
    p_male = 1 - p_female
    for (name, gender), p in freqdict.items():
        if gender == GENDER_FEMALE:
            yield p * p_female
        elif gender == GENDER_MALE:
            yield p * p_male
        else:
            raise ValueError("bad gender in frequency info")


[docs]def get_frequencies(
    nf: NameFrequencyInfo,
    p_female: float = None,
    metaphones: bool = False,
    first_two_char: bool = False,
) -> List[float]:
    """
    Returns raw frequencies for a category of identifier, optionally combining
    (weighting by gender) for those stored separately by gender.
    """
    assert not (metaphones and first_two_char)
    if nf.by_gender:
        assert p_female is not None
    if metaphones:
        return [i.p_metaphone for i in gen_name_frequency_tuples(nf, p_female)]
    elif first_two_char:
        return [i.p_f2c for i in gen_name_frequency_tuples(nf, p_female)]
    else:
        return [i.p_name for i in gen_name_frequency_tuples(nf, p_female)]


[docs]@dataclass
class ValidationNameFreqInfo:
    """
    Used for validation calculations.
    """

    name: str
    p_name: float
    metaphone: str
    p_metaphone: float
    f2c: str
    p_f2c: float


[docs]def gen_name_frequency_tuples(
    nf: NameFrequencyInfo,
    p_female: float = None,
) -> Generator[ValidationNameFreqInfo, None, None]:
    """
    Generates frequency tuples (name, p_name, metaphone, p_metaphone, f2c,
    p_firsttwochar).
    """
    by_gender = nf.by_gender
    if by_gender:
        assert p_female is not None
        p_male = 1 - p_female
    else:
        p_male = None
    for info in nf.infolist:
        if by_gender:
            if info.gender == GENDER_FEMALE:
                multiple = p_female
            elif info.gender == GENDER_MALE:
                multiple = p_male
            else:
                raise ValueError("bad gender")
        else:
            multiple = 1
        yield ValidationNameFreqInfo(
            name=info.name,
            p_name=multiple * info.p_name,
            metaphone=info.metaphone,
            p_metaphone=multiple * info.p_metaphone,
            f2c=info.f2c,
            p_f2c=multiple * info.p_f2c,
        )


# =============================================================================
# Demonstration of the information-based version of Bayes' theorem
# =============================================================================


[docs]def demo_info_theory_bayes_cancer() -> None:
    """
    From the comments in [Academian2010]:

    1% of women at age forty who participate in routine screening have breast
    cancer. 80% of women with breast cancer will get positive mammographies.
    9.6% of women without breast cancer will also get positive mammographies. A
    woman in this age group had a positive mammography in a routine screening.
    What is the probability that she actually has breast cancer?
    """
    # Problem
    p_cancer = 0.01
    p_pos_given_cancer = 0.8
    p_pos_given_no_cancer = 0.096
    print(demo_info_theory_bayes_cancer.__doc__)
    print(
        f"p(cancer) = {p_cancer}, p(pos | cancer) = {p_pos_given_cancer}, "
        f"p(pos | ¬cancer) = {p_pos_given_no_cancer}"
    )
    # Goal: calculate p_cancer_given_pos

    # Derived
    p_no_cancer = 1 - p_cancer
    p_pos_and_cancer = p_cancer * p_pos_given_cancer
    p_pos_and_no_cancer = p_no_cancer * p_pos_given_no_cancer
    p_pos = p_pos_and_cancer + p_pos_and_no_cancer

    # Conventional Bayes:
    p_cancer_given_pos_std = bayes_posterior(
        prior=p_cancer,
        likelihood=p_pos_given_cancer,
        marginal_likelihood=p_pos,
    )
    print(f"(plain Bayes)    p(cancer | pos) = {p_cancer_given_pos_std}")

    # Either info theory version:
    inf_pos = inf_bits_from_p(p_pos)
    inf_pos_and_cancer = inf_bits_from_p(p_pos_and_cancer)

    # Version 1:
    inf_cancer_given_pos_v1 = inf_pos_and_cancer - inf_pos  # eq. [9]
    p_cancer_given_pos_v1 = p_from_inf(inf_cancer_given_pos_v1)
    print(f"(info theory v1) p(cancer | pos) = {p_cancer_given_pos_v1}")

    # Version 2:
    inf_cancer = inf_bits_from_p(p_cancer)
    iev_pos_cancer = inf_pos + inf_cancer - inf_pos_and_cancer
    inf_cancer_given_pos_v2 = inf_cancer - iev_pos_cancer  # eq. [10]
    p_cancer_given_pos_v2 = p_from_inf(inf_cancer_given_pos_v2)
    print(f"(info theory v2) p(cancer | pos) = {p_cancer_given_pos_v2}")

    # Same answer, within rounding error:
    assert abs(p_cancer_given_pos_v1 - p_cancer_given_pos_v2) < FLOAT_TOLERANCE


# =============================================================================
# Mutual information examples
# =============================================================================


[docs]def mutual_info(
    iterable_xy_x_y: Iterable[Tuple[float, float, float]],
    verbose: bool = False,
) -> float:
    """
    See https://en.wikipedia.org/wiki/Mutual_information: mutual information of
    two jointly discrete random variables X and Y. The expectation from the
    iterable is that for all x ∈ X and y ∈ Y, the iterable delivers the tuple
    P_X_Y(x, y), P_X(x), P_Y(y). Uses log2 and therefore the units are bits.
    """
    # Verbose version:
    if verbose:
        mutual_info_bits = 0.0
        for i, (p_xy, p_x, p_y) in enumerate(iterable_xy_x_y):
            bits = p_xy * log2(p_xy / (p_x * p_y))
            if i % 10000 == 0:
                log.info(
                    f"p_xy = {p_xy}, p_x = {p_x}, p_y = {p_y}, bits = {bits}"
                )
            mutual_info_bits += bits
        return mutual_info_bits

    # Terse version:
    return sum(
        p_xy * log2(p_xy / (p_x * p_y)) for p_xy, p_x, p_y in iterable_xy_x_y
    )


[docs]def gen_mutual_info_name_probabilities(
    nf: NameFrequencyInfo,
    p_female: float = None,
    name_metaphone: bool = False,
    name_firsttwochar: bool = False,
    metaphone_firsttwochar: bool = False,
) -> Generator[Tuple[float, float, float], None, None]:
    """
    Generates mutual information probabilities for name/fuzzy name comparisons.
    """
    assert (
        sum([name_metaphone, name_firsttwochar, metaphone_firsttwochar]) == 1
    )
    for info in gen_name_frequency_tuples(
        nf=nf,
        p_female=p_female,
    ):
        if name_metaphone:
            # Names are mapped many-to-one to metaphones. Therefore, P(name_x ∧
            # metaphone_for_name_x) = P(name_x). However,
            # P(metaphone_for_name_x) ≥ P(name_x).
            p_xy, p_x, p_y = info.p_name, info.p_name, info.p_metaphone
        elif name_firsttwochar:
            # Similarly for first-two-character representations.
            p_xy, p_x, p_y = info.p_name, info.p_name, info.p_f2c
        else:  # metaphone_firsttwochar
            # Here there is overlap. So we use p_name as the joint probability;
            # there may be some duplication but I think that is OK (they'll add
            # up).
            p_xy, p_x, p_y = info.p_name, info.p_metaphone, info.p_f2c
        yield p_xy, p_x, p_y


# =============================================================================
# Information theory summaries
# =============================================================================


[docs]def show_info_theory_calcs() -> None:
    """
    Show some information theory calculations.
    """
    # Special options:
    # - Get the probabilities right (very low minimum frequency).
    # - Load details of first-two-character fuzzy representations.
    cfg = MatchConfig(
        min_name_frequency=ACCURATE_MIN_NAME_FREQ,
    )

    # -------------------------------------------------------------------------
    # Surnames
    # -------------------------------------------------------------------------

    surname_p = get_frequencies(cfg.surname_freq_info)
    # Test: surname_p = [1/100000] * 100000
    log.info(f"Number of surnames: {len(surname_p)}")
    h_surnames = entropy(surname_p)
    log.info(f"Surname entropy: H = {h_surnames} bits")

    surname_metaphone_p = get_frequencies(
        cfg.surname_freq_info, metaphones=True
    )
    log.info(f"Number of surname metaphones: {len(surname_metaphone_p)}")
    h_surname_metaphones = entropy(surname_metaphone_p)
    log.info(f"Surname metaphone entropy: H = {h_surname_metaphones} bits")

    surname_f2c_p = get_frequencies(cfg.surname_freq_info, first_two_char=True)
    log.info(f"Number of surname first-two-chars: {len(surname_f2c_p)}")
    h_surname_f2c = entropy(surname_f2c_p)
    log.info(f"Surname first-two-char entropy: H = {h_surname_f2c} bits")

    # kl_name_metaphone = relative_entropy_kl_divergence(
    #     gen_name_pairs(cfg.surname_freq_info, metaphones=True)
    # )
    # log.info(
    #     f"Surname name/metaphone relative entropy: "
    #     f"D_KL = {kl_name_metaphone} bits"
    # )

    surname_name_meta_mi = mutual_info(
        gen_mutual_info_name_probabilities(
            nf=cfg.surname_freq_info, name_metaphone=True
        )
    )
    log.info(
        f"Surname: name/metaphone mutual information: "
        f"I = {surname_name_meta_mi} bits"
    )

    surname_name_f2c_mi = mutual_info(
        gen_mutual_info_name_probabilities(
            nf=cfg.surname_freq_info, name_firsttwochar=True
        )
    )
    log.info(
        f"Surname: name/first-two-char mutual information: "
        f"I = {surname_name_f2c_mi} bits"
    )

    surname_meta_f2c_mi = mutual_info(
        gen_mutual_info_name_probabilities(
            nf=cfg.surname_freq_info, metaphone_firsttwochar=True
        )
    )
    log.info(
        f"Surname: metaphone/first-two-char mutual information: "
        f"I = {surname_meta_f2c_mi} bits"
    )

    # -------------------------------------------------------------------------
    # Forenames
    # -------------------------------------------------------------------------

    p_female = cfg.p_female_given_m_or_f

    forename_p = get_frequencies(nf=cfg.forename_freq_info, p_female=p_female)
    log.info(
        f"Number of forenames (M/F versions count separately): "
        f"{len(forename_p)}"
    )
    h_forenames = entropy(forename_p)
    log.info(f"Forename entropy: H = {h_forenames} bits")

    forename_metaphone_p = get_frequencies(
        nf=cfg.forename_freq_info,
        p_female=p_female,
        metaphones=True,
    )
    log.info(
        f"Number of forename metaphones (M/F versions count separately): "
        f"{len(forename_metaphone_p)}"
    )
    h_forename_metaphones = entropy(forename_metaphone_p)
    log.info(f"Forename metaphone entropy: H = {h_forename_metaphones} bits")

    forename_f2c_p = get_frequencies(
        cfg.forename_freq_info,
        p_female=p_female,
        first_two_char=True,
    )
    log.info(
        f"Number of forename first-two-chars (M/F versions count separately): "
        f"{len(forename_f2c_p)}"
    )
    h_forename_f2c = entropy(forename_f2c_p)
    log.info(f"Forename first-two-char entropy: H = {h_forename_f2c} bits")

    forename_name_meta_mi = mutual_info(
        gen_mutual_info_name_probabilities(
            nf=cfg.forename_freq_info, p_female=p_female, name_metaphone=True
        )
    )
    log.info(
        f"Forename: name/metaphone mutual information: "
        f"I = {forename_name_meta_mi} bits"
    )

    forename_name_f2c_mi = mutual_info(
        gen_mutual_info_name_probabilities(
            nf=cfg.forename_freq_info,
            p_female=p_female,
            name_firsttwochar=True,
        )
    )
    log.info(
        f"Forename: name/first-two-char mutual information: "
        f"I = {forename_name_f2c_mi} bits"
    )

    forename_meta_f2c_mi = mutual_info(
        gen_mutual_info_name_probabilities(
            nf=cfg.forename_freq_info,
            p_female=p_female,
            metaphone_firsttwochar=True,
        )
    )
    log.info(
        f"Forename: metaphone/first-two-char mutual information: "
        f"I = {forename_meta_f2c_mi} bits"
    )


# =============================================================================
# Partial match frequencies
# =============================================================================


[docs]def partial_calcs(
    nf: NameFrequencyInfo,
    p_female: float = None,
    report_every: int = 10000,
    debug_stop: int = None,
) -> None:
    """
    Show e.g. p(match metaphone but not name). This has the potential to be
    really slow (e.g. 175k^2 = 3e10) though it should only need to be done once
    -- however, we can optimize beyond an n^2 comparison. Uses examples from
    the public US name databases.

    Maths is e.g.

    .. code-block:: none

        integral_over_a(integral_over_b(p_a * p_b * binary(conjunction event)))

    Examples:

    - Share metaphone, not first two characters (F2C) or name:

      .. code-block:: none

        AABERG [APRK, AA] / WIBBERG [APRK, WI]
        AABY [AP, AA] / ABAY [AP, AB]
        AAKRE [AKR, AA] / OKYERE [AKR, OK]

    - Share F2C, not metaphone or name:

      .. code-block:: none

        AALDERS [ALTR, AA] / AASEN [ASN, AA]
        (etc.; these are obvious)

    """

    # def debug_report(
    #     event: str, a_: ValidationNameFreqInfo, b_: ValidationNameFreqInfo
    # ) -> None:
    #     log.info(
    #         f"{event}: {a.name} [{a.metaphone}, {a.f2c}] / "
    #         f"{b.name} [{b.metaphone}, {b.f2c}]"
    #     )

    # We optimize thus:
    metaphone_to_infolist = defaultdict(
        list
    )  # type: Dict[str, List[ValidationNameFreqInfo]]
    f2c_to_infolist = defaultdict(
        list
    )  # type: Dict[str, List[ValidationNameFreqInfo]]
    for i in gen_name_frequency_tuples(nf, p_female):
        metaphone_to_infolist[i.metaphone].append(i)
        f2c_to_infolist[i.f2c].append(i)
    # This improved the performance for 10k names in A from about 1h7min to
    # about 4 seconds, so that was definitely worth it.

    total_p_a = 0.0  # for normalization, in case this is not 1
    # ... might differ from total_p_a only if we break via our debugging loop
    total_p_b = sum(
        info_.p_name for info_ in gen_name_frequency_tuples(nf, p_female)
    )
    # ... we always iterate through all of b for all a, even if debugging;
    # ... and even if we iterate through b implicitly (via the dictionaries).

    p_share_metaphone = 0.0
    p_share_metaphone_not_name = 0.0
    p_share_metaphone_not_f2c_or_name = 0.0
    p_share_f2c = 0.0
    p_share_f2c_not_name = 0.0
    p_share_f2c_not_metaphone_or_name = 0.0
    n = len(nf.infolist)
    for idx_a, a in enumerate(
        gen_name_frequency_tuples(nf, p_female), start=1
    ):
        if idx_a % report_every == 0:
            log.info(f"... processing name {idx_a} / {n}")
        if debug_stop and idx_a > debug_stop:
            break  # for debugging

        # For speed:
        a_p_name = a.p_name
        a_name = a.name
        a_metaphone = a.metaphone
        a_f2c = a.f2c

        total_p_a += a_p_name

        for b in f2c_to_infolist[a_f2c]:
            # Iterate only through names sharing first two characters.
            p_product = a_p_name * b.p_name
            p_share_f2c += p_product
            if a_name != b.name:
                p_share_f2c_not_name += p_product
                # debug_report("share_f2c_not_name", a, b)
                if a_metaphone != b.metaphone:
                    p_share_f2c_not_metaphone_or_name += p_product
                    # debug_report("share_f2c_not_metaphone_or_name", a, b)

        for b in metaphone_to_infolist[a_metaphone]:
            # Iterate only through names sharing metaphones.
            p_product = a_p_name * b.p_name
            p_share_metaphone += p_product
            if a_name != b.name:
                p_share_metaphone_not_name += p_product
                # debug_report("share_metaphone_not_name", a, b)
                if a_f2c != b.f2c:
                    p_share_metaphone_not_f2c_or_name += p_product
                    # debug_report("share_metaphone_not_f2c_or_name", a, b)

    # Normalized probabilities:
    nf = 1 / (total_p_a * total_p_b)  # normalizing factor
    np_share_metaphone = p_share_metaphone * nf
    np_share_metaphone_not_name = p_share_metaphone_not_name * nf
    np_share_metaphone_not_f2c_or_name = p_share_metaphone_not_f2c_or_name * nf
    np_share_f2c = p_share_f2c * nf
    np_share_f2c_not_name = p_share_f2c_not_name * nf
    np_share_f2c_not_metaphone_or_name = p_share_f2c_not_metaphone_or_name * nf

    log.info(
        f"Unnormalized: total_p_a = {total_p_a}, "
        f"total_p_b = {total_p_b}, "
        f"nf = {nf}, "
        f"p_share_metaphone = {p_share_metaphone}, "
        f"p_share_metaphone_not_name = {p_share_metaphone_not_name}, "
        f"p_share_metaphone_not_f2c_or_name = "
        f"{p_share_metaphone_not_f2c_or_name}, "
        f"p_share_f2c = {p_share_f2c}, "
        f"p_share_f2c_not_name = {p_share_f2c_not_name}, "
        f"p_share_f2c_not_metaphone_or_name = "
        f"{p_share_f2c_not_metaphone_or_name}"
    )
    log.info(
        f"Normalized: P(share metaphone) = {np_share_metaphone}, "
        f"P(share metaphone, not name) = {np_share_metaphone_not_name}, "
        f"P(share metaphone, not first two char or name) = "
        f"{np_share_metaphone_not_f2c_or_name}, "
        f"P(share first two char) = {np_share_f2c}, "
        f"P(share first two char, not name) = {np_share_f2c_not_name}, "
        f"P(share first two char, not metaphone or name) = "
        f"{np_share_f2c_not_metaphone_or_name}"
    )


[docs]def show_partial_match_frequencies() -> None:
    """
    Show population-level frequency/probability calculations from our name
    frequency databases, e.g. p(match metaphone but not name), p(match first
    two characters but not metaphone or name).
    """
    cfg = MatchConfig(
        min_name_frequency=ACCURATE_MIN_NAME_FREQ,
    )
    log.info("Partial match frequencies for surnames:")
    partial_calcs(cfg.surname_freq_info)
    log.info("Partial match frequencies for forenames:")
    partial_calcs(cfg.forename_freq_info, p_female=cfg.p_female_given_m_or_f)


_ = """

Saved results:

2022-06-26 14:40:35.939 __main__:INFO: Partial match frequencies for surnames:
2022-06-26 14:40:36.191 crate_anon.linkage.helpers:WARNING: No metaphone for 'HWA'
2022-06-26 14:40:36.191 crate_anon.linkage.helpers:WARNING: No metaphone for 'HWEE'
2022-06-26 14:40:36.192 crate_anon.linkage.helpers:WARNING: No metaphone for 'HWU'
2022-06-26 14:40:39.924 __main__:INFO: ... processing name 10000 / 175880
...
2022-06-26 14:41:39.135 __main__:INFO: ... processing name 170000 / 175880
2022-06-26 14:41:40.034 __main__:INFO: Unnormalized: total_p_a =
0.9555555960002395, total_p_b = 0.9555555960002395, nf = 1.0951864946351493,
p_share_metaphone = 0.011339587936127086, p_share_metaphone_not_name =
0.000768652112738086, p_share_metaphone_not_f2c_or_name =
0.0005779216890297886, p_share_f2c = 0.02015927698649681, p_share_f2c_not_name
= 0.009588341150670168, p_share_f2c_not_metaphone_or_name =
0.009397610726948866

2022-06-26 14:41:40.034 __main__:INFO: Normalized: P(share metaphone) =
0.01241896356237405, P(share metaphone, not name) = 0.000841817412943526,
P(share metaphone, not first two char or name) = 0.0006329320287821591, P(share
first two char) = 0.022078167897220478, P(share first two char, not name) =
0.010501021734168415, P(share first two char, not metaphone or name) =
0.010292136349992806

~~~

2022-06-26 14:41:40.054 __main__:INFO: Partial match frequencies for forenames:
2022-06-26 14:41:43.254 __main__:INFO: ... processing name 10000 / 106695
...
2022-06-26 14:42:13.711 __main__:INFO: ... processing name 100000 / 106695

2022-06-26 14:42:14.506 __main__:INFO: Unnormalized: total_p_a =
0.9999999999997667, total_p_b = 0.9999999999997667, nf = 1.0000000000004665,
p_share_metaphone = 0.005028025327852539, p_share_metaphone_not_name =
0.0025948399071387762, p_share_metaphone_not_f2c_or_name =
0.0014685292424996006, p_share_f2c = 0.018453783690030413, p_share_f2c_not_name
= 0.01602059826926657, p_share_f2c_not_metaphone_or_name = 0.014894287604555126

2022-06-26 14:42:14.506 __main__:INFO: Normalized: P(share metaphone) =
0.005028025327854884, P(share metaphone, not name) = 0.0025948399071399867,
P(share metaphone, not first two char or name) = 0.0014685292425002856, P(share
first two char) = 0.01845378369003902, P(share first two char, not name) =
0.016020598269274045, P(share first two char, not metaphone or name) =
0.014894287604562073


Note in particular that metaphone-sharing is rarer than F2C-sharing:

Surnames:

P(share metaphone)      = 0.01241896356237405
P(share first two char) = 0.022078167897220478

P(share metaphone, not name) =      0.000841817412943526
P(share first two char, not name) = 0.010501021734168415

P(share metaphone, not first two char or name) = 0.0006329320287821591
P(share first two char, not metaphone or name) = 0.010292136349992806

Forenames:

P(share metaphone)      = 0.005028025327854884
P(share first two char) = 0.01845378369003902

P(share metaphone, not name) =      0.0025948399071399867
P(share first two char, not name) = 0.016020598269274045

P(share metaphone, not first two char or name) = 0.0014685292425002856
P(share first two char, not metaphone or name) = 0.014894287604562073

"""  # noqa


# =============================================================================
# Main
# =============================================================================


[docs]def main() -> None:
    """
    Command-line entry point.
    """
    function_map = {
        "infotheory": show_info_theory_calcs,
        "demobayes": demo_info_theory_bayes_cancer,
        "partials": show_partial_match_frequencies,
    }
    parser = argparse.ArgumentParser(
        formatter_class=ArgumentDefaultsRichHelpFormatter
    )
    parser.add_argument("command", choices=function_map.keys())
    args = parser.parse_args()
    main_only_quicksetup_rootlogger(level=logging.INFO)
    func = function_map[args.command]
    func()


if __name__ == "__main__":
    main()