14.4.15. crate_anon.linkage.validation.entropy_names

crate_anon/linkage/validation/entropy_names.py


Copyright (C) 2015, University of Cambridge, Department of Psychiatry. Created by Rudolf Cardinal (rnc1001@cam.ac.uk).

This file is part of CRATE.

CRATE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

CRATE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with CRATE. If not, see <https://www.gnu.org/licenses/>.


Measure entropy and entropy reduction among names.

See also:

Summarized:

14.4.15.1. Probabilistic evidence, pev()

pev(A, B) = P(A and B) / [P(A) * P(B)]              [1] [Academian2010]

          = pev(B, A)

Stating Bayes theorem in those terms:

P(A and B) = P(A) * P(B | A) = P(B) * P(A | B)      [2] Bayes, symmetrical

P(A | B) = P(A) * P(B | A) / P(B)                   [3] Bayes, conventional

but, from [1] and [2], since

pev(A, B) = P(B) * P(A | B) / [P(A) * P(B)]         RNC derivation
          = P(A | B) / P(A)
          = P(A) * P(B | A) / [P(A) * P(B)]
          = P(B | A) / P(B)

we reach this version of Bayes’ theorem:

P(A | B) = P(A) * pev(A, B)                         [4a] [Academian2010]
P(B | A) = P(B) * pev(A, B)                         [4b] [Academian2010]

Probabilistic evidence, being the ratio of two probabilities, has range [0, +∞]. It is a multiplicative measure of mutual evidence: 1 if A and B are independent; >1 if they make each other more likely; <1 if they make each other less likely.

14.4.15.2. Information value, inf()

The information value of an event (a.k.a. surprisal, self-information):

inf(A) = log_base_half[P(A)] = -log2[P(A)]          [5] [Academian2010]

Range check: p(A) ∈ [0, 1], so inf(A) ∈ [0, +∞]; impossibility (p = 0) gives inf(A) = +∞, while certainty (p = 1) gives inf(A) = 0; p = 0.5 corresponds to inf(A) = 1.

This is also “uncertainty” (how many independent bits are required to confirm that A is true) or “informativity” (how many independent bits are gained if we are told that A is true).

14.4.15.3. Informational evidence, iev()

Redundancy, or mutual information, or informational evidence:

iev(A, B) = log2[pev(A, B)]                        [6] [Academian2010]

            NOTE the error in the original (twice, in equation and
            preceding paragraph); it cannot be -log2[pev(A, B)], as pointed
            out in the comments, and rechecked here.

          = log2{ P(A and B)  / [P(A)      * P(B)] }            from [1]
          = log2[P(A and B)]  - log2[P(A)] - log2[P(B)]
          = -inf(A and B)     + inf(A)     + inf(B)

          = inf(A) + inf(B) - inf(A and B)          [7] [Academian2010]

          = iev(B, A)

Range check: pev ∈ [0, +∞], so iev ∈ [-∞, +∞].

If iev(A, B) is positive, the uncertainty of A decreases upon observing B (meaning A becomes more likely). If it is negative, the uncertainty of A increases (A becomes less likely). A value of -∞ means A and B completely contradict each other, and +∞ means they completely confirm each other.

14.4.15.4. Conditional information value

inf(A | B) = -log2[P(A | B)]                                [8], from [5]

           = -log2{ P(A)   * pev(A, B) }                        from [4a]
           = -{ log2[P(A)] + log2[pev(A, B)] }
           = -log2[P(A)]   - log2[pev(A, B)]
           = inf(A)        - iev(A, B)                          from [5, 6]
           = inf(A)        - [inf(A) + inf(B) - inf(A and B)]   from [7]
           = inf(A)        -  inf(A) - inf(B) + inf(A and B)
           =                         - inf(B) + inf(A and B)

           = inf(A and B) - inf(B)                  [9] [Academian2010]

14.4.15.5. Information-theoretic Bayes’ theorem

Taking logs of [4a],

log2[P(A | B)] = log2[P(A)] + log2[pev(A, B)]

so

-log2[P(A | B)] = -log2[P(A)] - log2[pev(A, B)]

we obtain, from [8], [5], and [6] respectively,

inf(A | B) = inf(A) - iev(A, B)                     [10] [Academian2010]

or: Bayesian updating is subtracting mutual evidence from uncertainty.

14.4.15.6. A worked example

./entropy_names.py demo_info_theory_bayes_cancer

14.4.15.7. Other references

  • Bayes theorem: https://en.wikipedia.org/wiki/Bayes%27_theorem … ultimately Bayes (1763).

  • A probability mass function for a discrete random variable X, which can take multiple states each labelled x: p_X(x) = P(X = x). https://en.wikipedia.org/wiki/Probability_mass_function

  • Information content, self-information, surprisal, Shannon information, inf(): https://en.wikipedia.org/wiki/Information_content … ultimately e.g. Shannon (1948), Shannon & Weaver (1949). For a single event, usually expressed as I(x) = -log[P(x)]. For a random variable, usually expressed as I_X(x) = -log[p_X(x)].

  • Entropy is the expected information content (surprisal) of measurement of X: https://en.wikipedia.org/wiki/Entropy_(information_theory) Usually written H(X) = E[I(X)] = E[-log(P(X))], or (with the minus sign outside): H(X) = -sum_i[P(x_i) * log(P(x_i))], i.e. the sum of information for each value, weighted by the probability of that value.

  • Mutual information (compare “informational evidence” above): https://en.wikipedia.org/wiki/Mutual_information Typically:

    I(X; Y) = I(Y; X)                             # symmetric
            = H(X) - H(X | Y)
            = H(Y) - H(Y | X)
            = H(X) + H(Y) - H(X, Y)               # cf. eq. [7]?
            = H(X, Y) - H(X | Y) - H(Y | X)
    I(X; Y) >= 0                                  # non-negative
    

    where

    • H(X) and H(Y) are marginal entropies,

    • H(X | Y) and H(Y | X) are conditional entropies,

    • H(X, Y) is the joint entropy.

    However, this is not the same quantity; I(X; Y) >= 0 whereas iev ∈ [-∞, +∞]. This (Wikipedia) is the mutual information of two random variables: the amount of information you can observe about one random variable by observing the other. The “iev” concept above is about pairs of individual events.

    For two discrete RVs,

    I(X; Y) = sum_y{ sum_x{ P_XY(x, y) log[ P_XY(x, y) / (P_X(x) * P_Y(y)) ] }}

  • Mutual information is a consideration across events. The individual-event version is “pointwise mutual information”, https://en.wikipedia.org/wiki/Pointwise_mutual_information, which is

    pmi(x; y) = log[ P(x, y) / (P(x) * P(y) ]
              = log[ P(x | y) / P(x) ]
              = log[ P(y | x) / P(y) ]
    

14.4.15.8. Applying to our problem of selecting a good partial representation

Assume we are comparing a proband and a candidate and there is not a full match. We start with some sort of prior, P(H | information so far); for now, we’ll simplify that to P(H). We want P(H | D) where D is the new information from the partial identifier – the options being a partial match, or no match. We generally use this form of Bayes’ theorem:

ln(posterior odds)       = ln(prior odds)   + ln(likelihood ratio)
ln[P(H | D) / P(¬H | D)] = ln[P(H) / P(¬H)] + ln[P(D | H) / P(D | ¬H)]

Converting to log2 just involves multiplying by a constant, of course:

ln(x)   = log2(x) * ln(2)
log2(x) = ln(x) * log2(e)

A partial match would provide a log likelihood of

log(p_ep) − log(p_pnf)

and no match would provide a log likelihood of

log(p_en) − log(p_n)

We could weight that (or the information equivalent) by the probability of obtaining a partial match (given no full match) and of obtaining no match (given no full match) respectively.

… let’s skip this and try mutual information.

Note

Code largely abandoned; not re-checked since NameFrequencyInfo was refactored, since this code had served it purpose.

class crate_anon.linkage.validation.entropy_names.ValidationNameFreqInfo(name: str, p_name: float, metaphone: str, p_metaphone: float, f2c: str, p_f2c: float)[source]

Used for validation calculations.

__init__(name: str, p_name: float, metaphone: str, p_metaphone: float, f2c: str, p_f2c: float) None
crate_anon.linkage.validation.entropy_names.demo_info_theory_bayes_cancer() None[source]

From the comments in [Academian2010]:

1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

crate_anon.linkage.validation.entropy_names.entropy(frequencies: Iterable[float]) float[source]

Returns the (information/Shannon) entropy of the probability distribution supplied, in bits. By definition,

H = -sum_i[ p_i * log_2(p_i) ]

https://en.wikipedia.org/wiki/Quantities_of_information

crate_anon.linkage.validation.entropy_names.gen_gender_weighted_frequencies(freqdict: Dict[Tuple[str, str], float], p_female: float) Generator[float, None, None][source]

Generates gender-weighted frequencies. Requires p_female + p_male = 1.

crate_anon.linkage.validation.entropy_names.gen_mutual_info_name_probabilities(nf: crate_anon.linkage.frequencies.NameFrequencyInfo, p_female: Optional[float] = None, name_metaphone: bool = False, name_firsttwochar: bool = False, metaphone_firsttwochar: bool = False) Generator[Tuple[float, float, float], None, None][source]

Generates mutual information probabilities for name/fuzzy name comparisons.

crate_anon.linkage.validation.entropy_names.gen_name_frequency_tuples(nf: crate_anon.linkage.frequencies.NameFrequencyInfo, p_female: Optional[float] = None) Generator[crate_anon.linkage.validation.entropy_names.ValidationNameFreqInfo, None, None][source]

Generates frequency tuples (name, p_name, metaphone, p_metaphone, f2c, p_firsttwochar).

crate_anon.linkage.validation.entropy_names.get_frequencies(nf: crate_anon.linkage.frequencies.NameFrequencyInfo, p_female: Optional[float] = None, metaphones: bool = False, first_two_char: bool = False) List[float][source]

Returns raw frequencies for a category of identifier, optionally combining (weighting by gender) for those stored separately by gender.

crate_anon.linkage.validation.entropy_names.inf_bits_from_p(p: float) float[source]

The information value (surprisal, self-information) of an event from its probability, p; see equation [5] above.

crate_anon.linkage.validation.entropy_names.main() None[source]

Command-line entry point.

crate_anon.linkage.validation.entropy_names.mutual_info(iterable_xy_x_y: Iterable[Tuple[float, float, float]], verbose: bool = False) float[source]

See https://en.wikipedia.org/wiki/Mutual_information: mutual information of two jointly discrete random variables X and Y. The expectation from the iterable is that for all x ∈ X and y ∈ Y, the iterable delivers the tuple P_X_Y(x, y), P_X(x), P_Y(y). Uses log2 and therefore the units are bits.

crate_anon.linkage.validation.entropy_names.p_from_inf(inf_bits: float) float[source]

The information value (surprisal, self-information) of an event from its probability, p; see equation [5] above.

crate_anon.linkage.validation.entropy_names.p_log2_p_over_q(p: float, q: float) float[source]

Given p and q, calculate p * log_2(p / q), except return 0 if p == 0. Used for KL divergence.

crate_anon.linkage.validation.entropy_names.p_log2p(p: float) float[source]

Given p, calculate p * log_2(p).

crate_anon.linkage.validation.entropy_names.partial_calcs(nf: crate_anon.linkage.frequencies.NameFrequencyInfo, p_female: Optional[float] = None, report_every: int = 10000, debug_stop: Optional[int] = None) None[source]

Show e.g. p(match metaphone but not name). This has the potential to be really slow (e.g. 175k^2 = 3e10) though it should only need to be done once – however, we can optimize beyond an n^2 comparison. Uses examples from the public US name databases.

Maths is e.g.

integral_over_a(integral_over_b(p_a * p_b * binary(conjunction event)))

Examples:

  • Share metaphone, not first two characters (F2C) or name:

    AABERG [APRK, AA] / WIBBERG [APRK, WI]
    AABY [AP, AA] / ABAY [AP, AB]
    AAKRE [AKR, AA] / OKYERE [AKR, OK]
    
  • Share F2C, not metaphone or name:

    AALDERS [ALTR, AA] / AASEN [ASN, AA]
    (etc.; these are obvious)
    
crate_anon.linkage.validation.entropy_names.relative_entropy_kl_divergence(pairs: Iterable[Tuple[float, float]]) float[source]

Returns the relative entropy, or Kullback-Leibler divergence, D_KL(P || Q), in bits; https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence.

The iterable should contain pairs P(x), Q(x) for all values of x in the distribution. We calculate

D_KL(P || Q) = sum_x{ P(x) * log[P(x) / Q(x)] }

crate_anon.linkage.validation.entropy_names.show_info_theory_calcs() None[source]

Show some information theory calculations.

crate_anon.linkage.validation.entropy_names.show_partial_match_frequencies() None[source]

Show population-level frequency/probability calculations from our name frequency databases, e.g. p(match metaphone but not name), p(match first two characters but not metaphone or name).