.- -.- |- -+- `` Probabilistic causation indicated by relative risk, attributable risk and by formulas of I.J. Good, Kemeny, Popper, Sheps/Cheng, Pearl and Google's Brin, for data mining, epidemiology, evidence-based medicine, economy, investments or Causal INSIGHTS INSIDE for data mining to fight data tsunami and confounding CopyRight (C) 2002-2007, Jan Hajek , NL, version 3.02 of 2007-5-2 NO part of this document may be published, implemented, programmed, copied or communicated by any means without an explicit & full reference to this author + the full title + the website www.humintel.com/hajek for the freshest version + the CopyRight note in texts and in ALL references to this. An implicit, incomplete, indirect, disconnected or unlinked reference (in your text and/or on www) does NOT suffice. All based on 1st-hand experience. All rights reserved. This file + has lines < 80 chars (+CrLf) in ASCII + has facilities for easy finding & browsing + reads better than 2 columns texts which must be paged dn & up again & again + may read even better (more of left margin, PgDn/Up) outside your email, and/or if you change its name.txt to name.wri + like other texts in MS-Explorer, it is better to do Find in page backwards + can be compared with its previous version if you save it and use a files differencer to see only where the versions differ; eg download Visual Compare VC154.ZIP and run it as VCOMP vers1 vers2 /i /k which is the best line-by-line files comparer (3 colors) for .txt or .wri files + contains math functions graphable with David Meredith's XPL on www + your comments (inserted into this .txt file) are welcome. -.- Associative thinkers-browsers may like to repeatedly find the following: (single spacing indicates semantical closeness) !!! !! ! ?? ? { refs } Q: bound complement block AndNot --> nImp( impli 0/0 /. ./ RDS Sheps TauB Hajek NNR dNNR HF PF GF PS PN Pearl Cheng EBM JH INDEPendentIMPlication paradox CausedBy NecFor SufFor 3angle q.e.d. ?( asymmetr attributable B( B(~ Perr Bayes factor beta :-) as-if boost confound Cornfield Gastwirth caution chain conjecture :-( coviction Brin Google Conv( Conv1 Conv2 Conv3 corr( correl contingency SIC Gini
Cont( caus1( causa causes code Cofa Cofa0 Cofa1 CI confidence conversion cov( Popper confirm corroborat C( K( korroborat Kahre counterfact degree depend inpedend 17th entrop error etiologic example expos Folk fuzzy B( F( F(~ F0( factual support Kemeny I.J. Good Gini hypothe independ infinit oo NNT NNH NNS costs effort nonevent noisy key LikelyThanNot likelihood LR meaning mislead M( MDL MML monoton necess suffic Occam odds( PARADOX Pearson Phi^2 princip proper ratio relative risk RR( RR(~ r2 refut relativi rapidit regraduat remov rule naive Schield simplistic sense SeLn SIC slope Shannon softmax Spinoza symmetr Venn 2x2 table 5x2 tendency triviality variance regress range scale tanh typo UNDESIRABLE weigh evidence W( W(~ www Bonferroni Inclusion-Exclusion DeMorgan opeRation -log( exagger ChiSqr( student! -.- separates sections .- separates (sub)sections Venn diagrams table |- -.- +Contents: (find a +Term to find a section) +Who might like to read this epaper +Intro: the duality of causal necessity and sufficiency +Epicenter of this epaper with key insights inside +Contrasting formulas aka measures of impact !!! +New conversions between many measures +Contemplating the bounds of some measures +Construction principles P1: to P7: of good association measures +Key elements of probabilistic logic and simple candidates for causation K0: to K4: C1: to C4: +Dissecting RR(:) LR(:) OR(:) for deeper insights ! +3angle inequalities combined yield new JH-bounds for 3 events x,y,z !!! +The simplest thinkable necessary condition for CONFOUNDING +Notation, basic tutorial insights, PARADOX of IndependentImplication !!! +Interpreting a 2x2 contingency table wrt RR(:) = relative risk = risk ratio ! see my squashed Venn diagrams +More on probabilities +Tutorial notes on probabilistic logic, entropies and information +Google's Brin's conviction Conv( , my nImp( , AndNot +Rescalings important wrt risk ratio RR(:) aka relative risk +Correlation in a 2x2 contingency table +Example ( example finds more examples ) +Folks' wisdom +Acknowledgment +References -.- +Who might like to read this epaper This epaper started as notes to myself ( Descartes called them Cogitationes privatae). Now it is a much improved version of my original draft tentatively titled "Data mining = fighting the data tsunami : When & how much an evidential event y INDICATES x as a hypothesised cause, for doctors, engineers, investors, lawyers, researchers and scientists", who all should be interested in this stuff. This epaper is primarily targeted at British-style empiricists or BE's (sounds better than BSE :-). Continental Rationalists (CR's) a la Descartes, Leibniz, Spinoza prefer to apply deductive analytical methods to splendidly isolated and well defined problems, while BE's a la Locke, Berkeley, Hume are not afraid of using inductive inferential/experimental/observational methods even on messy tasks in biostatistics, econometry, medicine, military and social domains. BE's credo is Berkeley's "Esse est percipi". CR's credo is Descartes' "Cogito ergo sum". -.- +Intro: the DUALITY of causal Necessity and Sufficiency When confronted with events, and events happen all the time, humans ask about and search for inter-event relationships, associations, influences, reasons, and causes, so that predictions, remedies and decision-making may be learned from the past experiences of such or similar events. To find a cause, an explanation, or a remedy is the ultimate goal, the Holy Grail of advisors, analists, attorneys, barristers, doctors, engineers, investigators, investors, lawyers, philosophers, physicians, prosecutors, researches, scientists, and in fact of all wonderful expert human beings like you and me, who use or just think the words "because", "due to", and "if-then". David Hume (1711-1776) used to say that the "causation is the cement of the Universe". The nobelist Max Planck (1858-1947) wrote { Kahre 2002, 187 }: "Causation is neither true nor false, it is more a heuristic principle, a guide, and in my opinion clearly the most valuable guide that we have to find the right way in the motley hotchpotch [= bunten Wirrwarr], in which scientific research must take place, and reach fruitful results." One man's mechanism is another man's black box, wrote Patrick Suppes. I say: One man's data is another woman's noise, and one man's cause is another woman's effect, eg: smoking = u ...> (x1 = ill lungs and/or x2 = ill heart) ...> death = y but your coroner will not say that smoking was the cause of your death. gene ....> hormone ...> symptom ; or if we view the notion of specific illness as-if real (in fact its name is an abstraction), then eg: gene ....> illness ...> symptom . In this causal chain a researcher may see an illness as an effect caused by genes, while a physician, GP or clinician, sees it as a cause of a symptom, eg a pain in the neck to be removed or at least suppressed. Cause-effect relationships are relative wrt to the observer's frame of view, like Einstein would have loved to say. Like an implication or entailment, causation is supposed to be TRANSITIVE : if x causes y & y causes z then x caused z if x <= y & y <= z then x <= z The <= is 'less or equal','subset of','entails' or 'implies' if x,y,z are numbers, sets, or Boolean logical operands. The <= works on False, True represented (eg internally) as numbers 0, 1. Here <= is for numbers, <== is for sets or Booleans, but I use the more human --> y-->x == (y <== x) == (y subset of x) == (y implies x) == (y entails x) !! == ~(y & ~x) == (~x -->~y) == ~(~x & y) == (x or ~y) by DeMorgan ! == ~(y AndNot x) == ~(y,~x) == ~(y UnlessBlockedBy x) in plaintalk ! y-->x == (~x-->~y) looks nice, but is (find:) UNDESIRABLE for causation , because "the rain causes us to wear raincoat" makes SENSE, while the statement "not wearing a raincoat causes no rain" is a NONSENSE; find raincoat . Find => for more on --> (the => is meaningless here, in Pascal too); the >= is 'greater or equal' (in Pascal also SuperSet of). Translated into probabilistic logic : P(y-->x) = P(~(y,~x)) = 1 -(Py - Pxy) = P(~(y AndNot x)) ie (y-->x) is a DECreasing function of (Py - Pxy), hence (y implies x) is at its MAXimum when (Py - Pxy)=0 ie Py=Pxy ie whenever is (y subset of x) ie (y entails x). Causation is like the 2-faced Roman god Janus (January is named after him). One face is SUFiciency, the other facet is NECessity. They go together, they are 2 components of causation, something like the real and imaginary parts of a complex number. This analogy is not too bad, since necessity is often based on imagined CounterFactual reasoning, ie on what WOULD BE IF (in German: was waere wenn; also find as-if ) the situation WOULD NOT BE as it is the factual one { Pearl 2000, 284 }. The duality of Sufficiency and Necessity is easily visualized by Venn diagrams (pancakes or pizzas diagrams for kids :-) : |---------------------------- | Universe of discourse = 1 | | __________________ | 100% overlap ie Py-Pxy = 0 : | | | | y-->x 100% implication is the extreme | | P(x,~y) < Px | _________________ | | _____________|____ | | | for an {find:} archer-) | | | | | | | Px hitting x is NECessary | | | Pxy > 0 | | | | _____________ | for y to be hit | |___|____________| | | | | | | find NECfor | | P(y,~x) < Py | | | 0< Pxy = Py | hitting y is SUFficient | |________________| | | |___________| | for x being hit too |___________________________| |_______________| find SufFor My approach to causation is based on probabilistic logic, with emphasis on the operation of IMPLication aka entailment. A viewpoint of mine is that !! causation works in the DIRECTION OPPOSITE to y-->x . This is so !! because ideally the observed effect y implies x as an UNobservable cause, !!! while a cause x is NECessary for the effect y ie x NecFor y . Removing a cause x will ideally remove effect y . Note that an inference rule : IF effect ie evidence THEN hypothesised cause (eg an exposure) is reflected in the: evidence IMPLIES hypothetical cause (eg a treatment), while the causation goes in the opposite direction: An exposure may cause an effect or evidence. Hence we must be careful about the assigned meanings and about directions of arrows and notations like (x:y) , (y:x) , (~y:~x) , (~x:~y) , find ?( Baeyer . Many cues or predictors are symptoms caused by a health disorder, but some cues are surely the causes of an illness, so eg: IF (wo)man THEN "a (fe)male disorder is likely" makes sense, but it would be foolish to think that a disorder caused a human to be a (wo)man. Although IF (fe)male disorder THEN (wo)man, is correct, it (usually) is pointless. -.- +Epicenter of this epaper with key insights inside .- +Contrasting formulas aka measures of impact ARR LR OR RR RRR PAR NNT NNH are abbreviations almost standard in EBM . ACR ARH ARX RDS PF GF HF dNNR NNR are my non-words, hence easy to find. It is important to use notations preventing errors of thought & typos (find Baeyer ). Over too many years too many folks used (and switched to) too many notations; to avoid confusion find ?(y:x) vs ?(x:y) in +Notation for their meanings. x = h = exposure, hypothesised cause, conjecture ; y = e = effect, evidence. Right now it may help to find +MicroTutorial and read a byte or bit(e) of it. Too many measures of statistical association were (re)invented under even more all too suggestive names. Easy is { Feinstein 2001, sect.17.6, 171-175, 337 -340 }, derivations in { Fleiss 2003, 123-133, 151-163 }, insights in both. All measures capture statistical dependence, often as functions of ARR(:) or of RR(:) which both contrast P(y|x) vs P(y|~x). The key question is which formula when (not) for what (not) ?? For the risk (or gain ) of the effect y if exposed to (a treatment) x , the key contrasting formulas in epidemiology and evidence-based medicine EBM for binary x are (for multivalued x or for any other kind of exposure just replace ~x by u, a mnemonic for Ursache = cause): ARR = P(y|x) - P(y|~x) = absolute risk reduction (for risk of effect y if x) = ARR(x:y) "absolute" = "not relative", but often |ARR| too. = ARR(~x:~y) = [ 1 - P(y|~x) ] - [ 1 - P(y|x) ] ; find ?(x:y) = a/(a+b) - c/(c+d) in a 2x2 contingency table (find 2x2 ) = [ Pxy - Px.Py ]/[ Px.(1-Px) ] <= 1 even for tiny Px as Pxy <= min(Px,Py) !! = cov(x,y)/var(x) = beta(y:x) = slope(of y on x) <= 1 ! = 0 if independent x,y then Pxy=Px.Py & P(y|x)=Py & P(x|y )=Px ! = 0 enforce if Px=1 then Py=Pxy=Px.Py & P(y|x)=Py & P(y|~x)=0/0=ARR(x:y) ! = 0 natural if Py=1 then Px=Pxy=Px.Py & P(x|y)=Px & P(x|~y)=0/0=ARR(y:x) ! note that Py=1 yields Pxy=Px.Py hence ARR(x:y) = 0/[Px.(1-Px)] =0, ! note that Px=1 yields Pxy=Px.Py hence ARR(y:x) = 0/[Py.(1-Py)] =0. ! Enforced zeros lead to a more meaningful ARR (but Py=1 or Px=1 are too extreme to be of much importance). For DISCOUNTing of the lack of ! surprise in y, K(x:y) = P(y|x) -Py <= 1-Py SEEMS better (find SIC K( ) since a frequent y is seldom perceived as much of a risk anyway (find twice ) but as we just saw, ARR(:) = 0/0 must be numerically forced to ARR(:)=0, but this is logically natural since Pxy = Px.Py means independent x,y also in the extreme case of Py=1 when P(y|x)=1 ie x-->y ie x implies y, as well as in the extreme case of Px=1 when P(x|y)=1 ie y-->x ie y implies x, while in both !!! extreme cases x,y are also INDEPENDent at the same time. Find my !!! INDEPendentIMPlication PARADOX. ARR(x:y) == PNS = Probability of Necessity and Sufficiency (in general), under exogeneity = no confounding, & monotonicity = no prevention of y by x { Pearl 2000, 289,291,300 } ARR(x:y).ARR(y:x).N = r2.N = ChiSqr(x,y) , find r2 ChiSqr( { Allan 1980 } NNT = |1/ARR| was introduced in { Laupacis & Sackett & Roberts 1988 } : NNT = number needed to treat for 1 more |or 1 less| beneficial effect y, low NNT = good, successful, effective treatment x ; NNH = number needed to harm 1 more |or 1 less| by side effects z, high NNH = good, harmless treatment x ; NNS = number needed to screen to find 1 more |or 1 less| case, low NNS = good, effective screening; |1/ARR| is the most realistic general measure of health effects, as it !!! is the least abstract & least exaggerating ie most HONEST, and !!! UNlike RR(:), OR(:) or any other rate ratio, it does NOT "throw away all information on the number of dead" { Fleiss 2003, 123 on Berkson's !!! index ie ARR }. Moreover NNT, NNS, NNH measure EFFORT ie COSTS PER EFFECT. If ARR=0 ie y,x independent, then 1/ARR = oo ie infinite. NNR by Hajek : NNR = 1/RDS > 0 (find RDS ) is my Number Needed for 1 more Relative effect; if RDS < 0 then switch to its COMMENSURABLE neighbor HF . dNNR = 1/ARR - 1/RDS (if ARR > 0) = Hajek's difference of "Nrs Needed", = P(y|~x)/ARR = P(y|~x)/[P(y|x) - P(y|~x)] = 1/( RR -1) = 1/RRR 1/dNNR = RR - 1 = RRR = Relative risk reduction if ARR > 0. NNH(x:z)/NNT(x:y) is also highly informative; should be >> 1 ie many more have to be x-treated before 1 z-harm will occur, while many more patients have y-improved already. NNH/NNT is in the fine infokit by { Steve Simon at http://www.childrens-mercy.org/stats }. OR = odds ratio (as odds(P) = P/(1-P) in general) = (P(x|y)/[1-P(x|y)])/( P(x|~y)/[1-P(x|~y)] ) = OR(x:y) from which: = [P(x|y)/P(x|~y)]/[ P(~x|y)/P(~x|~y) ] = LR+ / LR- ; denominators annul: = P(x,y).P(~x,~y)/[ P(~x,y)/P( x,~y) ] = a.d/(b.c) = (a/b)/(c/d) odds ratio = [P(y|x)/P(~y|x)]/[ P(y|~x)/P(~y|~x) ] = OR(y:x) by symmetry wrt x,y LR- = P(~x|y)/P(~x|~y) = LR- = negative LR = (1 - sensitivity)/specificity LR = P( x|y)/P( x|~y) = LR+ = likelihood ratio = sensitivity/(1-specificity) = Pxy.[(1/(Px-Pxy)].(1-Py)/Py = LR(x:y) = RR(x:y) = B(x:y) = Bayes factor in favor of y provided by x ! Note (x:y) ie x-->y = 1/(Px - Pxy) = 1/P(x,~y) = 1/(x AndNot y) = 1/( x UnlessBlockedBy y) in plaintalk. ! x-->y despite the numerator P(x|y) = (y SufFor x) = naive y-->x (1-Py)/Py is my measure of surprise in y ; it decreases with increasing Py; in LR , RR, it DISCOUNTS lack of surprise in frequently occurring y; = product of 2 simplest measures of surprise thinkable : = (1-Py).(1/Py); expectation E[Y] = Sum[Py.fun(Py)] in general ; (1-Py) = surprise in y; Cont(Y) = Sum[Py.(1-Py)] = 1-Sum[Py^2] find Cont( log(1/Py) = surprise in y; H(Y) = Sum[Py.log(1/Py)] = Shannon's entropy; log gives it a coding interpretation (here UNNEEDED) based on positional numerical representation and Kraft inequality for unique decodability. Sum[Py.1/Py] = card(Y) = cardinality = variety = nr of distinct values of a r.v. Y = a rough measure of surprise in Y. H(Y) <= log(card(Y)). Sum[Py*(1-Py).(1/Py)] = Sum[Py*(1/Py - 1)] = Sum[1 - Py] = card(Y) - 1 . RR = P(y|x)/P(y|~x) = relative risk = risk ratio , measures (y SufFor x) = (a/(a+b))/(c/(c+d)) = RR(y:x), seems more impressive than ARR, NNT, NNH = Pxy.[1/(Py-Pxy)].(1-Px)/Px goes up with small: Py-Pyx, Px = LR(y:x) = RR(y:x) = B(y:x) = Bayes factor in favor of x provided by y, ! since y-->x is in 1/(Py - Pxy) = 1/P(y,~x) = 1/(y AndNot x) = = 1/( y UnlessBlockedBy x ) in plaintalk In medicine LR will be more stable than RR. LR's can be collected globally (eg on national scale) and via Bayes rule (eg in the nomogram at www.cebm.net in Oxford) applied to the individual cases subject to the local prevalence (or to the judged prior) Py, to obtain what we really want: the post-test probability P(y|x). Find Bayes and Bailey below. Prof. Brian Haynes (McMaster University, Canada) and prof. Paul Glasziou (Oxford, UK) have pointed out to me that it would be misleading to publish P(y|x), because a physician must use his/her internal prior Py of an individual patient and update it (eg via the nomogram) by LR(x:y) of a community or population, to obtain patient's P(y|x). So although LR may carry a more generally useful (more robust ie stable) partial information, RR carries information more meaningful finally and individually: the relative risk ie risk ratio RR(:). RR(y:x), ARR(x:y) are the key parts of other meaningful formulas : RRR = ARR/P(y|~x) = RR - 1 (if RR >= 1 ie ARR >= 0) = relative risk reduction = excess relative risk { Feinstein 2002, 340 } (is not Pearl's ERR ) `` = 1/dNNR = 1/[1/ARR - 1/RDS] , find RDS dNNR by Hajek PF = -ARR/P(y|~x) = 1-RR if RR <= 1, = preventable or prevented fraction if RR >= 1 use 1 -1/RR = ARX , not INCOMPATIBLE GF ; = -[ P( y|x) - P( y|~x) ]/ P( y|~x) = PF(x:y) by { P.W. Cheng 1997 } = -[1-P(~y|x) -(1-P(~y|~x))]/[1-P(~y|~x)] my expression of PF as a GF !! *= [ P(~y|x) - P(~y|~x) ]/[1-P(~y|~x)] = GF(x:~y) = RDS(x:~y) by Hajek `` my *= expresses Cheng's PF in the CANONICAL form of "relative difference" by M. Sheps RDS : GF(x generates y), PF(x prevents y) = GF(x generates ~y) !! so both have similar structures, ie are unified in the spirit of my slogan "A non-event is an event is an event", with apologies to Gertrude Stein who spoke similarly about a rose, tho not about a non-rose :-) . Nevertheless !! PF and GF are INCOMPATIBLE, says this box: |--- +NEW CONVERSIONS: "relative difference" by Sheps in my notation RDS(u:v) | is quantitatively interpretable iff its numerator ARR(u:v) >= 0, hence if !! ARR(u:v) < 0 switch to RDS(~u:v) ; R(u:~v) is INCOMPATIBLE with RDS(u:v). | P( v|u) + P(~v|u) = 1 find ?( in +Notation for the semantics of: | LR(~v:u) = P(~v|u)/P(~v|~u) = RR(~v:u) !!! LR(~v:u) = 1 - RDS(u:v) = 1/[ 1 - RDS(~u:v) ] is my key RDS-CONVERSION RULE | between COMPATIBLE RDS's where u stands for x or ~x, and v for y or ~y, so | eg 1-ARX = 1/[ 1-PF ], ie 1/RR = 1/RR indeed. | | RDS(u:v) = [ Pv - P(v|~u) ]/[ Pu.(1-P(v|~u)) ] | = [P(u,v) - Pu.Pv]/[ Pu.P(~u,~v) ] ie asymmetry due to Pu, hence !!! if Pu <= Pv then RDS(u:v) >= RDS(v:u), P(v|u) >= P(u|v) , and v.v. | | 1st derivation: generic form for a success score Pa corrected for guessing | by discounting the hits by chance is (for Pa > Pb) : !!! [(1-Pb)-(1-Pa)]/(1-Pb) = 1 -(1-Pa)/(1-Pb) = [Pa - Pb]/[1-Pb] where | 1-Pb is a reference or base rate of failures (or errors Pb = Perr ), eg: | + Jacob Cohen's kappa for interrater agreement or concordance, 1960, in | { Fleiss 2003, 603,609,620 } { Feinstein 2002, 20.4.3 } | { Bishop 1975, 395-6 } | + (Py -
)/(1 -
) in general, where
= E[P] is MINImal if all Pj=1/m | (Py -1/m)/(1 -1/m) = m-multiple-choice score [ 1/m MAXImizes Cont( ] | ( If all 1..j..m choices must be assigned a Pj , then the score | S = 2.Pr -E[Pj] = 2.Pr -Sum[Pj.Pj] { De Finneti 1972, 30 } scale [-1..1] | where Pr = P assigned to the right answer ; (S+1)/2 ranges [0..1/2..1]; | if Pr = 1 then Smax else if a Pj=1 then Smin ) !! + (P(y|x) - Py)/(1-Py) = K(x:y) /[1-Py] = ARH = attributable risk by Hajek | + TauB = [ (1-E[Py])-(1-E[P(y|x)]) ]/(1-E[Py]) = Cont(X:Y)/Cont(Y) | Var(Y) = 1-E[Py] , 1-E[P(y|x)] = E[Var(Y|X)] , find Taub Cont( Gini E[P | | 2nd derivation: ARR(x:y) = slope(of y on x) = beta(y:x) , find slope, so | RDS = ARR(x:y)/(fictive max. slope of y on x), fictive = as-if = what-if | | 3rd derivation: P(y|. ) + P(~y|. ) = 1 !!! 1 >= P(y|x) = P(y|~x) + P(~y|~x).RDS , causal excess factor RDS == GF | RDS = [P(y|x) - P(y|~x)]/[1-P(y|~x)] <= 1 !!! P(~y|~x) = 1 - P(y|~x) for (~y,~x) also RDS-susceptible to the cause x | | 4th derivation: would x always cause y, enlarged Py would make Pxy=Px ie | P(y|x)=1, hence the 1 in RDS's denominator. Note that thus enlarged Py | does NOT change P(y,~x) and P(y|~x). .- | RDS(:)'s true meaning obtains from its CANONICAL form with the denominator | 1-P(.|.), then interpret RDS(:) >= 0 from its numerator. | | RDS(x:y) = [P( y| x) - P( y|~x)]/[1-P( y|~x)] from the 1st derivation: | = [P(~y|~x) - P(~y| x)]/P(~y|~x) = 1 - P(~y|x)/P(~y|~x) | = RDS = ARR/P(~y|~x) = Cheng's GF = Pearl's PS = 1 -LR(~y:x) = 1 - 1/Qsuf | = 1 if P( y| x) = 1 | | RDS(~x:y) = [P( y|~x) - P( y| x)]/[1-P( y| x)] = 1 - P(~y|~x)/P(~y|x) !! = -ARR/P(~y| x) = new Hajek's fraction HF = 1 -LR(~y:~x) = 1 - Qsuf | = 1 if P( y|~x) = 1 | | RDS(x:~y) = [P(~y| x) - P(~y|~x)]/[1-P(~y|~x)] = 1 - P(y|x)/P(y|~x) | = [P( y|~x) - P( y| x)]/ P( y|~x) = -ARR/P(y|~x) | = 1-RR = PF (eg by P.W. Cheng) = 1 -LR(y:x) = 1 - Qnec | = 1 if P(~y| x) = 1 | | RDS(~x:~y) = [P(~y|~x) - P(~y| x)]/[1-P(~y| x)] = 1 - P(y|~x)/P(y|x) | = [P( y| x) - P( y|~x)]/ P( y| x) = 1 -LR(y:~x) = 1 - 1/Qnec | = 1 -1/RR = ARX = Judea Pearl's ERR ( <= PN ) = ARR/P(y|x) | = 1 if P(~y|~x) = 1 | `` In my CONVERSIONS SCHEME RDS(u:v)'s one neighbor has ~u, the other ~v | ( commensurable neighbors are linked by a / ) : | !!! PS = GF = -PF.P(y|~x)/[ 1-P(y|~x)] = -PF.Odds(y|~x) | /. 1-GF = 1/[1-HF ] = 1/Qsuf = Lnec | New: HF PF = 1-RR = -GF/Odds(y|~x) | ./ RR = 1-PF = 1/[1-ARX] = Qnec = Lsuf | PN >= ARX = 1 -1/RR = -HF/Odds(y|x) = -HF.[1-P(y|x)]/P(y|x) = PAR/P(x|y) | | HF = -ARX.Odds(y|x) = -ARX.P(y|x)/[1-P(y|x)] | RR = OR/[P(y|~x).(OR-1) +1] is the exact conversion from odds ratio ; | OR(:) to RR(:) are needed eg for PF and ARX | | RDS(u:v) < 0 isn't interpretable, so we must use its PROPER COMPLEMENTARY | RDS(~u:v) >= 0. Since I expressed all 4 RDS'es with numerators +-ARR, | complementary RDS'es must also have COMMENSURABLE denominators: hence !!! COMPATIBLE are GF,HF and ARX,PF (in both pairs u and ~u are swapped) !!! INCOMPATIBLE are GF,PF (used by Cheng ), and GF,ARX (by Pearl ). | If P(y|x) & P(y|~x) are very small (as they often are in a population) | then GF & HF are near |ARR|, ie GF & HF keep the implicit information on !!! the number of cases |1/ARR| = NNT or NNH are informative in this sense, | while ARX & PF are >> |ARR|, ie ARX & PF lose that information as they are !!! based on the risk ratio RR(:) ie relative risk which exaggerates an effect. | |--- more on RDS below ARX = ARR/P(y|x ) = RRR/RR = 1 -1/RR if RR >= 1, else use 1-RR = PF , not HF ; if RR > 2 then ARX > 1 -1/2 = 0.5 which is interpretable as !!! "more likely than not" eg in civil toxic tort cases, as { Fleiss 2003, 126 } and { Finkelstein 2001, 285 } point out; (find LikelyThanNot ) = PAR/P(x|y) = attributable risk for exposed = attributable risk percent = attributable proportion = attributable fraction in exposed group = etiologic fraction for exposed group (is not PAR ) = excess fraction = excess risk ratio ERR { Pearl 2000, 292 } = excess relative risk So far we obtained our rates or proportions P(.)'s from the study group. Other formulas may require P(.)'s from a community (eg regional population), which may be estimated easily and cheaply from the study group ONLY thus : ! If the control group ie ~y in the study is a RANDOM sample of ~yc ie in the community, then Pxc =. P(x|~y) from the study group { Fleiss 2003, 151 }: Pxc = P(exposed to x in community or population) is estimated by: ! =. P(x|~y) = b/(b+d) from the studied controls subgroup only, writes { Feinstein 2002, twice on p.340/17.7 low } Pyc = P(risk factor y in community or population) { Feinstein 2002, 338 } ! =. Pxc.P(y|x) + (1-Pxc).P(y|~x) where P(y|.) are from the study group = a weighted average ie an interpolation between P(y|x) and P(y|~x) ACR = P(y|x) - Pyc = ARR(1-Pxc) { Feinstein 2002, 338/17.6.2} = attributable community risk = attributable population risk (vs ? in PAR ) PAR = [ Pyc - P(y|~x) ]/Pyc { Feinstein 2002, 340 } = Pxc.(RR-1)/[ 1 + Pxc.(RR-1) ] { Feinstein 2002, find p.340 for a typo } = population attributable risk percent/100 ( RR is from the study group) = population attributable risk fraction !! = [ Pxc.P(y|x) + (1-Pxc).P(y|~x) - P(y|~x) ]/Pyc = ARR.Pxc/Pyc { by Hajek } = community etiologic fraction { Fleiss 2003, 125-8,156/7.5,711/7.5 } : = [ P(x|y) - Pxc ]/[1-Pxc]; if small Pyc then Pxc =. P(x|~y) hence : ! =.[ P(x|y) - P(x|~y)]/[1-P(x|~y)] { Fleiss 2003, 151 }, is RDS-like PAR ? = attributable risk in population { Finkelstein & Levin 2001, 286-7 } : ! = (1 -1/RR).P(x|y) = ARX.P(x|y) but case-control studies provide OR, not RR, but RR =. OR if Py is low (above find exact conversion ); for RR >= 1 { Kahn & Sempos 1989, 74,80 } RDS = ARR/P(~y|~x) = ARR/[ 1-P(y|~x) ] = relative difference by M.C. Sheps = [ P(y|x) - P(y|~x)]/[ 1-P(y|~x) ] for as-if binary x !!! = slope(of y on x) /[ FICTIVE MAX. slope, as P(y|x) <= 1 ] is my view = [ Pxy - Px.Py ]/[ Px.P(~x,~y) ] ie asymmetry due to Px only = RDS(x:y) notation like ARR(x:y) due to x-->y in P(y|x), find ?(x:y) = 1 = Max if P(y| x)=1 ie Pxy=Px ie x-->y 100% = P(y|x) = ARR if P(y|~x)=0 ie Pxy=Py ie y-->x 100% = 0 = ARR if P(y| x)=P(y|~x) ie Pxy=Px.Py ie x,y independent = ARR/0 = ?? if P(y|~x)=1 ie Py -Pxy = 1 - Px , ARR <= 0 ie 0 = 1 -(Px+Py-Pxy) = P(~(x or y)) = P(~x,~y) = 0/0 = ?? if Py = 1 = 0/? = ?? if Px = 1 ( ARR/[ Py - P(y|~x) ] = 1/Px :-) Let u = Ursache = cause or causes other than x : RDS = [ P(y|x) - P(y|u) ]/[ 1-P(y|u) ] = relative difference a la Sheps = [ successful y if x minus if u ]/[ failure rate of y if u ], as P(y|x) <= 1=Max, the 1-P(y|u) is the MAXImal thinkable value of ! RDS's numerator, ie 1-P(y|u) is a meaningful normalization. The !!! key IDEA is that failures if u, are available to become successes if x, !! and that RDS is more honest than RRR, ARX, if P(y|~x), P(y|x) is small, as they often are, which inflates the measures based on RR(y:x). RDS'es are in : - { Fleiss 2003, 123-125,162, SeLn for CI( 1-RDS) on pp.133,162-163,152,156 } - { Sheps 1959 }, fine in { Feinstein 2002, 174 }, and for u == ~x in : - { I.J. Good 1961/1983, 208,212 } as QuasiProbability for causal nets - { Khoury 1989 } as susceptibility if independence assumed, commented in: - { Rothman & Greenland 1998, 53-56 eq.4-3 } on attributable fractions; ! + the following authors have paired RDS with a 2nd formula : - { Glymour 2001, chap.7 = Cheng models, pp.75-91, 108-110 } based on: - { Glymour, Cheng 1998 } based on: - { Patricia Cheng 1997 } eq.(16) = RDS = GF , PF = eq.(30) = 1-RR >= 0 - { Pearl 2000, 292,300,284 } PS = RDS , PN >= ERR = 1 -1/RR = ARX "probability of sufficiency" PS = RDS = GF = generative power of x wrt y "probability of necessity" PN vs Cheng PF = preventive power of x wrt y my view: PF = generative factor of x to ~y = PF = preventable fraction ! if RR > 1 then (0 < PN < 1) vs ( PF < 0) else (here PN =. ERR ) ! if RR < 1 then ( PN < 0) vs ( 0 < PF < 1) else PN = 0 = PF ie x,y independent; for increasing RR Pearl's PN is nonlinear INCreasing, PN <= 1 = if RR=oo=MAX while with RR Cheng's PF is linear DECreasing, PF <= 1 = if RR=0 =min !! but this is ok, since (x NECESSARY for y) is OPPOSITE to (x PREVENTS y). ! PN and PF just serve to OPPOSITE purposes. PN >= 0 , PF >= 0 are required for meaningful interpretability, and also : RDS >= 0 is required in { Pearl 2000, 294 },{ Novick & Cheng 2004, 461 }. RDS < 0 has no clear interpretation, hence we better make RDS >= 0 thus : if ARR >= 0 ie if P( y| x) >= P(y|~x) then RDS(x:y) = ARR/P(~y|~x) <= P(y|x) `` = [ P( y| x) - P( y|~x) ]/[ 1-P( y|~x) ] = GF = y CausedBy x else RDS(~x:y) = [ P( y|~x) - P( y| x) ]/[ 1-P( y| x) ] = HF = y CausedBy ~x !! = -ARR/P(~y| x) is the new Hajek's fraction HF signal that the "else" happened; 1-RDS(x:y) = P(~y|x)/P(~y|~x) = 1/LR(~y:~x) = 1/Qsuf = Lnec (by Folk1 Folk3 ) = 1-GF = 1/[1-HF] !! Health effects can be expressed either by counting the ill or dead, or by counting the cured or alive { Sheps 1958 }. So we are free to replace any P(y|.) with P(~y|.) = 1-P(y|.) in many formulae as eg in RDS in PF. Since P(y|.)'s are often small, 1-P(y|.) =. 1 so that RDS =. ARR. Generally results will be different, depending on our choice of events vs ~events. These options create ample opportunities for honesty vs dishonesty, misleading, manipulation. Clearly, if P(y|~x) < 0.5 then RDS < RRR which only looks more impressive. For ARR > 0 ie for RR > 1 holds: 0 < RRR <= oo, while 0 < RDS <= 1, and measures with incompatible ranges should not be compared. Moreover, RRR is not a RDS-measure. .- +Contemplating the bounds of some measures Warm-up: Py > P(y|~x) = (Py - Pxy)/(1-Px) simplifies to: Pxy > Px.Py ie x,y 'positively' dependent, hence: if Pxy > Px.Py then Py > P(y|~x) & P(y|x) > Py & P(y|x) > P(y|~x) & K(x:y) = [ P(y|x) - Py ] < [ P(y|x) - P(y|~x) ] = ARR(x:y) else then Py < P(y|~x) & P(y|x) < Py & P(y|x) < P(y|~x) & K(x:y) = [ P(y|x) - Py ] > [ P(y|x) - P(y|~x) ] = ARR(x:y) else equalities. Lets think twice about which basic measure is better : P(x|y) measures how much y SufFor x ie y-->x FORMALLY, but this --> NEEDS NOT to make much sense SEMANTICALLY, as it also ! measures how much is x NecFor y, which usually does !! make a lot of sense: x suppressed makes y suppressed or removed; P(y|x) - P(y|~x) = ARR(x:y) is Absolute risk reduction of the effect y, ! which always has the same sign, as: ! P(x|y) - P(x|~y) is a measure of y-->x or how much is x NecFor y ! which always has the same sign, as: -Px <= P(x|y) - Px <= 1-Px DISCOUNTS the LACK of SURPRISE in x ; find SIC K( since a frequent x is not seen as a real CAUSE; this makes sense if Px =. 1, LESS sense if Px is low; moreover we should see P(x|y) - Px for what it normally is ( Px =. 1 is an extreme ); { if Pxy=Py then P(x|y) - Px = 1 - Px if Pxy=Px then P(x|y) - Px = Pxy.(1/Py -1) = P(x|y).(1-Py) [ = Px .(1/Py -1) ] <= 1-Px (find SIC ) where [.] may suggest that Py, Px can be varied, but Pxy <= min(Px, Py) }, !! which always has the same sign, as: -Py <= P(y|x) - Py <= 1-Py DISCOUNTS the LACK of SURPRISE in y (find SIC ) since a frequent y is not seen as a real RISK; this makes LESS sense than 1-Px above, since a wide-spread risk is still a risk, although psycho-socially it is more acceptable if everybody is at the same high risk. Indeed, as long as nothing can be done about the risk, the society gets used to it, becomes fatalistic about that risk, is not too jealous wrt those lucky few exceptions who are spared the risk. { if Pxy=Px then P(y|x) - Py = 1-Py if Pxy=Py then P(y|x) - Py = Pxy.(1/Px -1) = P(y|x).(1-Px) [ = Py .(1/Px -1) ] <= 1-Py (find SIC ) where [.] may suggest that Px, Py can be varied, but Pxy <= min(Px, Py) }. [-Py..0..1-Py] ie COMPLEMENTARY bounds, make sense for a measure ?(y:x). Also [-Px..0..1-Px] ie COMPLEMENTARY bounds, make sense for a measure ?(x:y), see { Kahre 2002, 118-119 } who adopted as desirable the upper bound 1-Px designed by Popper into corroboration C(y:x) . Moreover the ! bound 1-Px fits with Cont(X) = 1 -
= 1 - E[P] = 1 - Sum[Px.Px] = = Sum[Px.(1-Px)] , find Cont( E[P
But if by now you want to choose a measure with the upper bound 1-Px or 1-Py,
then look back at ARR(:) above and notice and appreciate my numerically (but
not logically) enforced value: if Px=1 or Py=1 then ARR(:) = 0 :
!!! = 0 to be enforced if Px=1 then P(y|~x)=0/0 & Py=Pxy=Px.Py & P(y|x)=Py
!!! = 0 to be enforced if Py=1 then P(x|~y)=0/0 & Px=Pxy=Px.Py & P(x|y)=Px
Note that BOTH last two lines hold for both ARR(x:y), ARR(y:x) which BOTH
became DISCOUNTing down to ARR(:) = 0 for extreme cases of Px=1 and/or Py=1.
From the bounds 0 <= P(.) <= 1 follow measures' bounds. Notice their
COMPLEMENTarity: in the math sense the bounds are UNIdistant (a new word?) ie
Upb - Lob = 1 which make sense logically for a hypothesized cause or a
conjecture x and its opposite ~x :
Lob if Pxy=0 : ?(y:x) = y implies x ie y-->x Upb
-P(x) = 0 -P(x) <= K(y:x) = P(x|y) - P(x) <= 1-P(x) = P(~x)
-P(x|~y) = 0 -P(x|~y) <= ARR(y:x) = P(x|y) - P(x|~y) <= 1-P(x|~y) = P(~x|~y)
-P(y|~x) = 0 -P(y|~x) <= ARR(x:y) = P(y|x) - P(y|~x) <= 1-P(y|~x) = P(~y|~x)
-P(y) = 0 -P(y) <= K(x:y) = P(y|x) - P(y) <= 1-P(y) = P(~y)
?(x:y) = x implies y ie x-->y
K(x:y) = my notation for "how much raises x the probability of y" { Kahre }
ie how much (x CAUSES y) as in { Kahre 2002, 186 eq(6.23.2) etc }
where discounting by -Py [ instead of by -P(y|~x) in ARR(x:y) ]
seems implicitly justified by the remark "the probability that a
drunk driver x causes an accident y may be small, eg 1/100" ie
P(y|x)=small and P(y|~x)=smaller, Py = Pyx +P(y,~x). More subtle is :
RDS(x:y) = ARR(x:y)/P(~y|~x) = [ P(y|x) - P(y|~x) ]/[ 1-P(y|~x) ], but
RDS has NOT complementary bounds: LoB for Pxy=0, UpB=P(y|x) :
-P(y|~x)[1-P(y|~x)] = 1 -1/P(y|~x) <= RDS(x:y) <= P(y|x)
hence if P(y|x) < P(y|~x) then RDS(x:y) << -1 is possible and its
interpretation becomes unclear, so if ARR(x:y) < 0 we should evaluate
RDS(~x:y) = [ P(y|~x) - P(y|x) ]/[ 1-P(y|x) ] and signal the swap.
?(y:x) is explained in +Notation ( find ?(y:x) to avoid confusion ! ).
K(y:x) = P(x|y) - Px
= Korroboration { Kahre 2002, 120 eq(5.2.8) } preferred over:
C(y:x) = Corroboration { Popper 1972, 400 eq(9.2*) in foot }
= confirmation of x=h by y=e { Popper 1972, 395-396; find modern }
C(y:x) says: y implies x ie y-->x ie y SufFor x, and indeed,
C(y:x) = MAXi if P(x|y)=1 ie y-->x which does NOT SHOW in Popper's POOR
form where P(y|x) SEEMS x-->y while IN FACT y-->x is there :
C(y:x) =
{ Popper's form } = [ P(y|x) - Py ]/[ P(y|x) + Py - Pxy ]
{ find my C-form2 } = [ P(y,x) - Px.Py]/[ P(y,x) + Px.(Py - Pxy )]
{ inverted form } = [ P(x|y) - Px ]/[ P(x|y) + Px.( 1 - P(x|y))] from
which we see that: = 0.0 for Px=1 or x,y independent
{ Kahre 2002, 119 } = [ P(x|y) - Px ]/[ P(x|y).(1-Px) + Px ],
:-( if P(x,y) = 0 : = [ 0 - Px ]/[ 0 + Px ] = -1 :-(
!! if P(x|y) = 1 : = [ 1 - Px ]/[ 1.(1-Px) + Px ] = 1-Px !
if P(y|x) = 1 : = [ 1 - Py ]/[ 1 + Py - Pxy ]
effect = y = evidence , x = hypothesis
-Px <= K(y:x) <= 1-Px = P(~x) for how much y Korroborates x
:-( -1 <= C(y:x) <= 1-Px = P(~x) for how much y Corroborates x
vs -1 <= F(y:x) <= 1 by design, for how much y supports x
F(y:x) = degree of Factual support = F(h,e) { Kemeny 1952 } =
F(x,y) = my F(y:x) = ARR(x:y)/[ P(y|x) + P(y|~x) ]
= [ P(y|x) - P(y|~x) ]/[ P(y|x) + P(y|~x) ] { Kemeny }
= [RR(y:x) - 1 ]/[RR(y:x) + 1 ] find F(
= [ P(y,x) - Px.Py ]/[ P(y,x) + Px.(Py-2.Pxy)] = F-form2 vs C-form2
= -F(y:~x) and anaLogically for any mix of events like x,y,~y,~x
due to my "A nonevent is an event is an event" (sorry Gertrude :-)
if P(x|y) = 1 then Pxy=Py & P(y|~x)=0 since P(y,~x) = Py - Pyx = 0.0 ,
hence F(y:x) = [ P(y|x) - 0 ]/[ P(y|x) + 0 ] = +1 = maxi;
if P(x,y) = 0 : = [ 0 - P(y|~x) ]/[ 0 + P(y|~x) ] = -1 = mini;
if P(y|x) = 1 : = [ 1 - P(y|~x) ]/[ 1 + P(y|~x) ].
!!! Caution: F(y:x) = +1 if P(x|y) = 1 ie y implies x ie y-->x
despite P(y|x) - P(y|~x) = ARR(x:y) in the numerator;
it follows from the 1/P(y|~x) = (1-Px)/P(y,~x) in RR(y:x) = P(y|x)/P(y|~x) =
= oo = infinite if P(y,~x) = Py - Pyx = 0 ie Py = Pxy ie P(x|y)=1
[ find SurpriseBy(x) ] and recall that F(:) = [RR(:) -1]/[RR(:) +1] hence
F(:) and RR(:) are CO-MONOTONIC (find monoton ).
Back to bounds (find SIC ):
Kahre liked 1-Px but DISLIKED -1 in -1 <= C(y:x) <= 1-Px and argued for his
"symmetrical" in fact COMPLEMENTary -Px <= K(y:x) <= 1-Px . Alas, Kahre did
not consider the much more used ARR(:) with its COMPLEMENTary bounds shown.
.- +Construction principles P1: to P7: of good association measures
P1: "Measures of association should have operationally meaningful
interpretations that are relevant in the contexts of empirical
investigations in which measures are used." { Goodman & Kruskal, 1963,
p.311, also in the footnote }. Henceforth I discuss events x, y, but it all
holds for their expected values ie averages over variables X, Y ie sets of
events too.
P2: OpeRational usefulness is greatly enhanced if measure's WHOLE RANGE of
values (not only its bounds ) has an opeRationally highly meaningful
interpretation as a quasiprobability, eg ARR, NNT , NNH , RDS .
P3: Various results from a single measure should be meaningfully COMPARABLE
regardless of the total count N of all joint events in a contingency
table. This means that a measure should be built from proportions P(:)
only, without an uncancelled N. Thus measures based on ChiSquare do not
qualify for causation. But N plays its role in confidence intervals.
P4: To measure association means to measure statistical dependence. I can
list 16+1 = 17 equivalent conditions of independence ie equalities
lhs = rhs, like eg P(y|x) = P(y|~x), P(x|y) = P(x|~y), ... ,
Pxy = Px.Py, ..., P(y|x) = Py , P(x|y) = Px, ..., from which 2*17 =
34 measures of dependence can be made by CONTRASTing:
lhs - rhs like ARR(:),
or lhs / rhs like RR(:) above, both are asymmetrical wrt x,y ;
eg Pxy/(Px.Py) = P(x|y)/Px = P(y|x)/Py is symmetrical wrt x,y , and the
correlation coefficient r is also symmetrical wrt x,y :
r = cov(x,y)/sqrt[ var(x).var(y) ]
= (Pxy - Px.Py)/sqrt[(1-Px)Px .(1-Py)Py]
r2 = r.r = r^2
= [ cov(x,y)/var(x)].[cov(x,y)/var(y)]
= [slope of y on x ].[slope of x on y] both slopes have the same sign
= beta(y:x) . beta(x:y) >= 0 , -1 <= beta <= 1
= ARR(x:y) . ARR(y:x) = coefficient of determination
But measures of confirmation, evidence, indication, and certainly of
!! causation should be DIRECTED ie ORIENTED ie ASYMMETRICAL wrt events x,y.
Asymmetry is easily obtained by taking a symmetrical association measure
(lhs - rhs) and NORMalizing it either by lhs or by rhs, or by 1 - rhs, eg:
(lhs - rhs)/(1 - rhs) = [ (1-rhs) -(1-lhs) ]/(1-rhs) in fact,
= 1 = Maxi (as RDS(:) ) if lhs=1=Maxi ie if it SATURATES.
Saturation (at a fixed bound) usually makes sense since to be off some
target or reference point by 100 meters or by 100 kilometers may be
opeRationally the same: too far.
Or a normalization by a function of one variable only :
ARR(x:y) = P(y|x) - P(y|~x) = [ Pxy - Px.Py ]/[ Px.(1-Px) ]
= cov(x,y)/var(x) = beta(y:x) for indicator events
!! An asymmetry is not enough, unless it is a meaningful asymmetry, with
clearly understood opeRational meaning.
P5: Measures of CAUSATION tendency should be decomposable into (a product of)
terms such that one term itself measures probabilistic IMPLICATION ie
ENTAILMENT, but the equality Measure(y:x) = Measure(~x:~y) is UNDESIRABLE .
!! Alas, the conviction measure Conv(y:x) = Conv(~x:~y) as defined by
Google's co-founder { Brin 1997 } does NOT qualify; find UNDESIRABLE :-(
!! But if for another holds Measure(y:x) <> Measure(~x:~y), it does NOT
automatically mean that such a measure is better; eg the bounds may be
less opeRationally meaningful than the bounds of Conv(:).
Entailment provides a link with the notions of necessity and sufficiency :
y-->x ie (y SufFor x) == (x NecFor y) == (~y NecFor ~x) == (~x SufFor ~y)
P6: Measure(y:x) should yield meaningful values also for such extreme like
Pxy=0, Pxy=Px.Py, P(x|y)=1, P(y|x)=1, Px=1, Py=1 :
eg: RR(y:x) = 0 if Pxy=0 ie if x,y are disjoint events
! RR(y:x) = 1 if Px=1 hence Py - Pxy = 0 AND YET Pxy = Px.Py,
1 means independent x,y [ find Pxy/(0/0) as special case ]
eg: Conv(y:x) = Py.P(~x)/P(y,~x) = [Py - Px.Py]/[Py - Pxy] in general;
= P(~x)/P(~x|y) = [ 1 - Px ]/[ 1 - P(x|y) ]
= Py/P(y|~x)
= 0/0 numerically if Px = 1 whence Py = Pxy hence:
= 1 if Pxy=Px.Py also [ Py - 1.Py]/[ Py - Py ] = 1
!! = 1 - Px if Pxy=0; this is not a nice fixed value, but
1 - Px is interpretable as "semantic information content" SIC
!!! which makes NO SENSE for Pxy=0 :-( , nevertheless:
1 - Px < 1 = for x,y independent, so for Pxy=0 is Conv(:) < neutral 1 :-)
!!! Similarly P(~y|~x) = 1-Py if Pxy = Px.Py makes NO SENSE if:
if P(~y|~x) is ~y NecFor ~x ie x NecFor y
To avoid overflows due to /0, such extreme/degenerated/special cases of P's
must be numerically detected at run time and handled apart according to the
meaningful interpretation (or conventions) as just shown.
!! Since any single formula is doomed to measure a mix of at least 2 key prop-
erties ( dependence and implication mixed due to my INDEPendentIMPlication
PARADOX ), it is a good idea to detect & report important extreme/special
cases which do not always obviously follow from the values returned. Such
! automated reporting adds semantics and avoids misreading/misinterpretation.
P7: Although it is useful to consider the values returned by measures under
extreme circumstances like eg Px=1 or Py=1, these will not occur often, and
should be detected apart anyway. It is more important to choose a measure
which will return reasonable values for the application at hand. There
cannot be a single universally best measure, but some like NNT and RDS are
universally more useful than other.
So far for my 7 construction principles. More analysis follows:
RR(y:x) is compared with few related measures like eg:
W(y:x) = weight of evidence by I.J. Good (Turing's statistical assistant);
F(y:x) = degree of factual support by John Kemeny ( Einstein's assistant);
C(y:x) = corroboration by Karl Popper (he often called it confirmation, an
overloaded term, so Popper corroborates here to be findable);
it is funny that Sir Popper who stressed refutation has worked
out measures of confirmation, but not of refutation :-) Why ?
Conv(y:x) = conviction measure by Google's co-founder et al { Brin 1997 }.
Such comparisons increase our insights. How well these formulas measure
causal tendency is also discussed. All this & much more was/is implemented
in my KnowledgeXplorer program KX which not only infers & indicates (ie
identifies, diagnoses, predicts, etc) but also extracts knowledge (on both
event- & variable level of interest) from the information carried by data
input in the simple format. KX has graphical and numerical outputs in compact,
comparative, hence effective forms (eg my squashed Venn diagrams).
.- +Key elements of probabilistic logic and of simple measures of causation
K0: Check-as-check-can instead of catch-as-catch-can :
My probabilistic logic formulas fun(Px, Py, Pxy) can be checked by
evaluating fun(.) for all 4 pairs of values Px,Py = 0,1 together with the
! proper value of Pxy=0,1. The resulting fun(.)=0,1 must be equal to the
corresponding logical result value 0,1 .
K1: Simplification rules, duality of rules, DeMorgan rules :
Dual rules have 'or' replaced by '&' and v.v., '0' ie Mini by '1' ie Maxi
`` and v.v. My 'by symmetry' isnt 'duals'.
x & x = x = x or x idempotence (duals); ~(~x) = x involution
x & ~x = 0 ; x or ~x = 1 (duals) tertium non datur
(x & y) or (x & ~y) = x = (x or y) & ( x or ~y) = adjacency rules (duals)
(x & y) or x = x = (x or y) & x (duals) = absorption rules, also:
y or (x & ~y) = x or y = x or (~x & y) by symmetry of (x or y), has
dual: y & (x or~y) = x & y = x & (~x or y) by symmetry of (x & y),
hence:
~x or (x & y) = ~x or y = ~(x & ~y) = x <== y = x-->y has
dual: ~x & (x or y) = ~x & y = ~(x or ~y) = y AndNot x = y Unless x
De Morgan's rules for probabilistic logic:
DeMorgan's law for P is this =
P( x Nand y) = P(~x or ~y) = P(~( x & y)) = 1-Pxy ; its dual rule is:
P( x Nor y) = P(~x & ~y) = P(~( x or y)) = 1-(Px +Py -Pxy) = P(~x,~y)
eg:
P( x And y) = P( x & y) = P(~(~x or~y)) = 1-( 1-Pxy) = Pxy
P( x --> y) = P(~x or y) = P(~( x & ~y)) = 1-(Px-Pxy) =
= P(~y --> ~x) = P( y or ~x) = P(~(~y & x)) = 1-(Px-Pxy) q.e.d.
by symmetry = DeMorganish =
P( x == y) = P(~x == ~y) = P(~(x Xor y)) , == is Equivalence
P( x Xor y) = P(~x Xor ~y) = P(~(x == y)) , Xor is NonEquivalence
Numerical formulas for Boolean logic exist; my translation rules are :
1st: replace x.y by Pxy (not by Px.Py unless independent x, y)
2nd: replace x by Px, y by Py, x^2 by Px, y^2 by Py
(x --> y) = 1 - x + x.y where x,y are 0,1 (hence x^2 = x , y^2 = y)
P(x --> y) = 1 -Px + Pxy = 1 -(Px -Pxy) = P(~(x,~y))
(x Xor y) = x^2 -2.x.y + y^2
P(x Xor y) = Px -2.Pxy + Py
(x Nor y) = 1 - x - y + x.y = (1-x)(1-y) = none of both = Peirce function
P(x Nor y) = 1 -Px -Py + Pxy = 1 -(Px+Py-Pxy) = P(~(x or y))
Estimates of P(vector) if no P(tuple)'s are available, easily obtain from
P(And_i:[x_i] ) = product[ P(x_i) ] for as-if independent x_i's, and from
P( Or_i:[x_i] ) = P(~(And_i:[~x_i])) ie DeMorgan's rule. Then
P( Or_i:[x_i] ) = 1 - product[ 1-P(x_i) ] = P(at least one x_i occurs) =
= union, a basis for noisy OR-gate.
1-Pa = T1
1-P(a or b) = T1.(1-Pb) = (1-Pa)(1-Pb) = 1-[ Pa+Pb -PaPb ] = T2
1-P(a or b or c) = T2.(1-Pc) = 1-[ Pa+Pb+Pc -(PaPb+PaPc+PbPc) +PaPbPc ]
which is the Inclusion-Exclusion principle for independent events a, b, c.
It suggests a progression of tightening inequalities (or improving
approximations = ) ending with an equation:
P( Or[xi] <= Sum[Pi]
P( Or[xi] >= Sum[Pi] - Sum[P(ij)]
P( Or[xi] <= Sum[Pi] - Sum[P(ij)] + Sum[P(ijk)]
...
P( Or[xi] = Sum[Pi] - Sum[P(ij)] + Sum[P(ijk)] - ... P(ijk..z)
which is the Inclusion-Exclusion principle (visualized by Venn diagram).
Max[ Px , Py ] <= P(x or y) <= min[ Px + Py, 1 ]
Max[ 0, Px + Py - 1 ] <= Pxy <= min[ Px , Py ] where on lhs
Bonferroni inequality becomes nontrivial only if Px + Py > 1, in which case
Pxy > 0.
My 3 inequalities for checking, and if violated then for trimming of eg
smoothed estimates :
Pxy <= min[ P(y|x) , P(x|y) ]
Max[ 0, (Px + Py - 1)/Py, Pxy ] <= P(x|y) <= min[ Px/Py , 1 ]
Max[ 0, (Px + Py - 1)/Px, Pxy ] <= P(y|x) <= min[ Py/Px , 1 ]
For the union of events :
Max_i:[Pi] <= P(Or_i:[x_i]) <= min[ 1, Sum_i:[Pi] ] is the simplest;
if we know all Pjk ie P(j,k) then
Sum_i:[Pi] - Sum_j = Sum[Px.Px] = Sum[Px^2] = expected probability of variable X
= Sum[ n(x).(n(x)-1) ]/[N.(N-1)] = unbiased estimate of E[Px]
= "information energy"
1-E[Px] = 1 - - 1/k = 1 - Cont(X) - 1/k
is preferred over Shannon's entropy by quantum theoreticians { Zeilinger &
Brukner 1999, 2001 }.
E[Px]/Pxi = /Pxi = [ 1 - Cont(X) ]/Pxi = surprise index for an event
xi within the variable X, as defined by Shannon's co-author { Weaver }.
Variance of an indicator event x (ie binary or Bernoulli event) is:
Var(x) = Cov(x,x) = P(x,x) - Px.Px = Px - (Px)^2 = Px.(1 -Px), since
Cov(x,y) = P(x,y) - Px.Py = covariance of events x,y in general
Px.Py is a fictitious joint probability of as-if independent events x, y;
it serves as an Archimedean point of reference ( a la Arendt )
to measure dependence of x,y either by Cov(x,y) = Pxy - Px.Py or
! by Pxy/(Px.Py), (find as-if ). If Px=1 or Py=1 then Pxy = Px.Py !
P(x,y) == Pxy == P(x&y) is the joint probability of x&y ; Pxy measures
co-occurrence ie compatibility of x and y. Until early 1960ies
P(x,y) had used to denote P(x|y) in the writings of Hempel, Kemeny, Popper
Rescher and Bar-Hillel who used P(x.y) for the modern P(x,y) = my Pxy
while others used P(xy) for my Pxy.
! P(x,y) + P(x or y) = Px + Py follows from isomorphy with the Set theory, ie
! P(x or y) = Px + Py - Pxy, and DeMorgan's rule says:
! P(~(x or y))= P(~x,~y) = 1-P(x or y)
Empirical and observed proportions should be smoothed to :
0 < Pxy < minimum[ Px, Py ] ie an empirical P(x,y) should be less than
its smallest marginal P. Low counts n(x,y) >= 1 are much improved by
this estimate :
P(x,y) =. [n(x,y) -1/2]/N , and P(y|x) =. [ n(x,y) - 0.5 ]/n(x) which is
close to [ n(.) -1/2]/[N-1] = maximum posterior aka mode aka MAP for
binomial pdf with Jeffreys prior Beta(O.5,1/2). It is vastly superior to
Laplace's sucession rule (Quiz: why? :-)
P(x|y) = Pxy/Py defines conditional probability, and Bayes rule follows:
P(x|y).Py = Pxy = Pyx = Px.P(y|x) shows invertibility of conditioning
P(x|y)/Px = P(y|x)/Py = Pxy/(Px.Py) is my favorite form of basic Bayes as:
P(x|y) ? Px == P(y|x) ? Py , where the ? is < , = , > ; and also
P(x|y) ? P(x|~y) == P(y|x) ? P(y|~x) where the ? is applied consistently.
P(x|y)/P(y|x) = Px/Py is Milo Schield's form of basic Bayes
P(x|y) = Px.P(y|x)/Py is the basic Bayes rule of inversion,
where Px = "base rate"; IGNORING Px is people's "base rate fallacy".
Odds form of Bayes rule :
Odds(y|x) = Odds(y).LR(x:y) { Odds local or individual, LR from community }
= P(y|x)/P(~y|x) = P(y|x)/(1 -P(y|x)) = (Py/(1-Py)).P(x|y)/P(x|~y)
= P(y,x)/P(~y,x)
= n(y,x)/n(~y,x) = n(y,x)/[ n(x) - n(y,x) ] would be a straight, but
misleading estimate in medicine (find Bailey ).
P(y|x) = Odds(y|x)/(1 + Odds(y|x)) = 1/(1/Odds(y|x) + 1)
= 1/( 1 + n(x,~y)/n(x,y) )
= n(x,y)/[ n(x,~y) + n(x,y) ] = n(x,y)/n(x) = P(y|x) q.e.d.
-log(Bayes rule) :
-log( P(x|y) ) = -log(Pxy/Py) =
-log(Px.P(y|x)/Py) = -log(Px) - log(P(y|x)) + log(Py) is the -log(Bayes)
Note that for only comparative purposes between hypotheses x_j we may ignore
Py (but NEVER IGNORE the base rate Px !) since Py is (quasi)constant for all
x_j's compared: the shortest code for max P(x_j, y) wins. This holds for
log-less Bayesian decision-making too: the maximal Pxy is the winner. This is
Occam's razor opeRationalized, as it has the minimal coding interpretation:
x = unobserved/able input of a communication channel, or
unobservable hypothesis/conjecture/cause/MODEL to be inferred/induced;
y = observed/able output of a communication channel, or
available test result/evidence/outcome/DATA.
According to { Shannon, 1949, Part 9, 60 } and provable by Kraft's inequality,
the average length of an efficient ie shortest & still uniquely decodable code
for a symbol or message z is -log(P(z)) in bits if the base of log(.) is 2.
Hence an interpretation of our -logarithmicized Bayes rule { Computer Journal,
1999, no.4 = special issue on MML MDL } is Occam's razor opeRationalized as:
MML = minimum message length (by Chris Wallace & Boulton, 1968)
MDL = minimum description length (by Jorma Rissanen, IBM, 1977)
MLE = minimum length encoding (by Edwin P.D. Pednault, Bell Labs 1988)
This theme is very very close to Kolmogorov complexity, originated in the US
by Ray Solomonoff in 1960, and by Greg Chaitin in 1968, and it was designed
into Morse code, and by Zipf's law evolved in plain language, eg:
4-letter words are so short because they are used so often. In Dutch we use
3-letter words because we either use them more frequently, and/or we are
more efficient than the Anglos & Amis :-)
Hence the total cost ie length of encoding is the sum of the cost of coding
the model x_j , plus the cost ie code size of coding the data y given
that particular model x_j. Stated more concisely :
cost or complexity = log(likelihood) + log(penalty for model's complexity)
where I don't mean any models on a catwalk :-) The pop version of Occam's
"Nunquam ponenda est pluralitas sine necesitate" is the famous KISS-rule:
"Keep it simple, student!" :-) Simplicity should be preferred over complexity,
subject to the "ceteris paribus" rule . Einstein used to say:
"Everything should be made as simple as possible, but not simpler". I say:
"Keep it simple, but not simplistic."
.- The MOST SIMPLISTIC, NAIVE measures of causal tendency :
P(y|x) = Pxy/Px = Sufficiency of x for y { Schield 2002, Appendix }
= Necessity of y for x { follows from the next line: }
P(x|y) = Pxy/Py = Necessity of x for y { Schield 2002, Appendix }
= Sufficiency of y for x { follows from above }
CAUTION :
Let x = a disease, y = 10 fingers: P(y|x) = 1 in a large subpopulation
but it would be a NONSENSE to say that x suffices for y , or that y is
necessary for x { example by Jan Kahre }; find IndependentImplication .
!! My analysis: P(y|x) = Pxy/Px is not a DECreasing function of Py, hence any
P(y) =. 1 ie too FREQUENT y will REFUTE P(y|x) as a measure (find SIC ).
!!! Much more complicated REFUTATIONS of all single P(.|.)'s or P(.)'s as
measures of confirmation or corroboration are in { Popper 1972, Appendix
IX, 390-2, 397-8 (4.2) etc, and 270 }. P(.|.)'s should be viewed as
NAIVE, CRUDE, MOST SIMPLISTIC measures : rel. = relatively wrt base
P(x|y) = Pxy/Py = a measure of (y implies x) ie rel. how many y are x
= a measure of (x SuperSet y) ie rel. how many y in x ;
P(y|x) = Pxy/Px = a measure of (x implies y) ie rel. how many x are y
= a measure of (y SuperSet x) ie rel. how many x in y ;
find archer-) in Venn
P(y|x).P(x|y) = a measure of (x SufFor y) & (x NecFor y)
= a measure of (y Nexfor x) & (y SufFor x)
= (Pxy^2)/(Px.Py) symmetry makes it worthless as a measure
of causal tendency.
Pxy/(Px.Py) has range [0..1..oo) and measures dependence :
oo unbounded POSitive dependence of x, y
1 if x, y are independent
0 bounds NEGative dependence of x, y
0 if x, y are disjoint; do not confuse disjoint with independent !
A fresh alternative look at old stuff ( Px.Py is as-if independence ) :
Pxy/(Px.Py) = (Pxy/Px).(1/Py) =
= ( x --> y).surpriseby(y)
= (Sufficiency of x for y).surpriseby(y)
= ( Necessity of y for x).surpriseby(y)
= (Pxy/Py).(1/Px) = ( y --> x).surpriseby(x)
= (Sufficiency of y for x).surpriseby(x)
= ( Necessity of x for y).surpriseby(x)
= [0..1]*[1..oo) = [0..1..oo) is the range; 1 if independent
= fun(Px, Py) symmetrical wrt x,y which may be good for coding but is poor
for directed, oriented eg causal inferencing, hence I created :
!!
(Pxy/Px).(1-Py) = P(y|x).(1-Py) = (x-->y).LinearSurpriseBy(y)
(Pxy/Py).(1-Px) = P(x|y).(1-Px) = (y-->x).LinearSurpriseBy(x)
= [0..1].[0..1] = [0..1] is very reasonable and it
! = fun(Px, Py) Asymmetrical wrt x, y hence captures causal tendency better.
I created these new measures because trivial ie unsurprising implications
are of little interest for data miners, doctors, engineers, investors,
researchers, scientists. The next formulas would overemphasize importance of
surprise, because Pxy/Px has range [0..1], while (1-Py)/Py has [0..oo) :
!
(Pxy/Px).(1-Py)/Py = P(y|x).(1-Py)/Py = (x-->y).Surpriseby(y)
= [0..1]*[0..oo) = [0..oo) { big range }
(Pxy/Py).(1-Px)/Px = P(x|y).(1-Px)/Px = (y-->x).Surpriseby(x)
= [0..1]*[0..oo) = [0..oo)
Only after this synthesis we may not be surprised that the last lines are
a substantial part of a risk ratio aka relative risk:
RR(y:x) = P(y|x)/P(y|~x) is 0 for disjoint x,y ; is 1 for independent ;
= (Pxy/(Py - Pxy)).(1-Px)/Px = (y implies x) . SurpriseBy(x)
= [0..oo)*[0..oo) = [0..oo)
note that :
+ both factors have the same range [0..oo) hence none of them dominates
structurally ie in general;
+ in both factors both numerator and denominator are working in the same
direction for increasing the product of Implies * Surprise;
+ there is no counter-working within each and among factors.
! P(y|x) > P(y|~x) == P(x|y) > P(x|~y) == Pxy > Px.Py (derive it) which
is symmetrical ie directionless ie not oriented;
the equivalence holds for the < <> = >= <= as well, the = is in all
16+1=17 conditions of independence. On human psychological difficulties in
dealing with such causal/diagnostic tasks see { Tversky & Kahneman 122-3 }.
cov(x,y) = Pxy - Px.Py = covariance of events x, y (binary aka indicator)
var(x) = Pxx - Px.Px = Px.(1 - Px) = variance of an event x (autocov )
corr(x,y) = cov(x,y)/sqrt(var(x).var(y)) = correlation of binary events x,y
>= greater or equal
=> is meaningless in this epaper, although some use it for an implication,
which is MISleading because:
y-->x == (y <== x) == (y subset of x) == (y implies x) == ~(y,~x)
== ~(y AndNot x)
my <== works on Booleans represented as 0, 1 for False, True
evaluated numerically, like the THOUGHTFUL notation :
(y <= x) in Pascal on Booleans means (y implies x).
In our probabilistic logic P(x|y)=1 == (y implies x 100%)
ie (y SufFor x), ie hitting y will hit x ,
!! ie (x NecFor y), ie missing x will miss y (find Venn above).
(y AndNot x) == (y,~x) == ~(y-->x) == ~~(y,~x) == (y UnlessBlockedBy x)
1st meaning: = (y,~x) = (y ButNot x) is a logical DIFFERENCE (see Venn )
= (y -x.y) with 0, 1 values
= Py -Pxy with P's
2nd meaning: the functional relation among x,y,z is interpreted
as z = fun(x,y) thus: (called INHIBITION in circuit design)
"z = y UnlessBlockedBy x=1"; if x=1, then:
input y=1 canNOT 'get through' into the result z = (y Unless x=1=y)
since input x=1 blocks y=1 from passing to the output z. Also:
! The output z=1 if y stays present ie 1 without being blocked by x=1
Note that the meaning of (y AndNot x) is 'y UNLESS y=1 is blocked by x=1',
!! ie the meaning of (y AndNot x) is NOT about 'blocking occurred',
! since that meaning is (y And x). This shows how CAREful we must be
when assigning a meaning to even an intuitively clear operation :
Ins Out
x y z = (y Unless x=1=y) == (x < y) == (y > x) Blocking occurred
0 0 = 0 ; no x, no y means no blocking, hence z=y=0 ; 0
0 1 = 1 ; no x means no blocker, hence z=y=1 ; 0
1 0 = 0 yes x, but no y to block, hence z=y=0 ; 0
1 1 > 0 yes x, yes y can be blocked, hence z= 0 ! 1
IF both y, x are present THEN blocker x blocks y from entering z output
ELSE the output z=y ;
there must be some y to block, and some blocker x=1, both present ie y=1=x ;
!! ie blocking of y is CHANGING y=1 to z=0 if x=1=y , ie x And y !
!! ie a blocker x can REVERSE y=1 to z=0 if x=1=y , ie x And y !, but
(y AndNot x) = y EXCEPT when y=1=x in which case the result z = 0 = ~y,
= 1 if (y=1 & x=0 ie x <> 1 , as above: (y,~x)
= x < y = not(x >= y) in Pascal on Booleans (internally 0 , 1)
= y > x = not(y <= x) in Pascal; ~(y <== x) == ~(y-->x) here.
My B(y:x), W(y:x), F(y:x) and C(y:x) have been written as ?(x:y) by ancient
authors like I.J. Good, John Kemeny and Karl Popper, who were inspired by the
Odds-forms, which swap x, y via Bayes rule of inversion. However my notation
(used by I.J. Good since 1992) is much less error prone as it naturally and
mnemonically abbreviates the simplest straight forms like eg:
RR(y:x) = risk ratio = relative risk = B(y:x) = simple Bayes factor
= P(y|x) / P(y|~x)
ARR(x:y) = P(y|x) - P(y|~x) = absolute risk reduction = risk difference
= a/(a+b) - c/(c+d) = (ad - bc)/[(a+b).(c+d)]
= (Pxy -Px.Py)/(Px.(1-Px)) = risk increase (or risk reduction )
= cov(x,y)/var(x) = covariance(x,y)/variance(x)
!! = beta(y:x) = the slope of the probabilistic regression
line Py = beta(y:x).Px + alpha(y:x) for indication events x, y
ie for binary events aka Bernoulli events; -1 <= beta(:) <= 1
! 0.903 - 0.902 = 0.001 is relatively small, but the same difference:
0.003 - 0.002 = 0.001 is relatively large; absolute differences may be
misleading for some purposes, but for practical treatment effects the
RR(y:x) exaggerates risk more, and more often than ARR(:) and 1/|ARR|'s
like NNT, NNH do.
RRR(y:x) = RR(y:x) - 1 = ARR(x:y)/P(y|~x) = [ P(y|x) - P(y|~x) ]/P(y|~x)
if > 0 = relative risk reduction = excess relative risk = relative effect
! Note that despite ARR(x:y) being a measure of m(x-->y) , the
! 1/P(y|~x) = 1/[ Py - Pxy ] is the overRuling M(y-->x) , hence :
RRR(y:x) is the proper notation.
F(y:x) = (P(y|x) - P(y|~x))/(P(y|x) + P(y|~x)) = factual support
= (difference/2)/(sum/2)
!! = deviation/arithmetic average my 1st interpretation
!! = slope(of y on x )/(P(y|x) + P(y|~x)) my 2nd interpretation
= beta( y:x )/(P(y|x) + P(y|~x)), -1 <= beta(:) <= 1 ,
= [ cov(x,y)/var(x)]/(P(y|x) + P(y|~x))
= (Pxy -Px.Py)/(Px.(1-Px))/(P(y|x) + P(y|~x))
= rescaled B(y:x) from [0..1..oo) to [-1..0..1] 3rd interpretation
= rescaled W(y:x) from (-oo..0..oo) to [-1..0..1] 4th interpretation
= is a combined (mixed) measure scaled [-1..0..1] of :
- how much y and x are independent, 0 if 100% independent
- how much (y --> x) , yields +1 if 100% implication
= [ ad - bc ]/[ ad + bc + 2ac ]
= CF2(y:x) = [ P(x|y) - P(x) ]/[ Px.(1 - P(x|y)) + P(x|y).(1 - Px) ] is
a certainty factor in MYCIN at Stanford rescaled in { Heckerman 1986 },
which I recognized as F(y:x) via :
= (Pxy - Px.Py)/(Pxy + Px.Py - 2.Px.Pxy) my 5th interpretation
= [ RR(y:x) -1]/[ RR(y:x) +1 ] my 6th interpretation
F0(:) = [ F(:) + 1 ]/2 = F(:) linearly rescaled to [0..1/2..1] :
F0(y:x) = P(y|x)/[ P(y|x) + P(y|~x) ]
RR(:) = [ 1 + F(:) ]/[ 1 - F(:) ] is a conversion .
RR(y:x) and its functions F(y:x), F0(y:x) are co-monotonic, as they measure
- how much the event y implies the x event. This is the directed ie
oriented ie asymmetric component of these measures;
- how much x, y are stochastically dependent ie covariate ie associate.
This is the symmetrical aspect or an association.
No contortion is needed to have events x, y which are almost independent,
if we measure independence by Pxy/(Px.Py) or by (Pxy -Px.Py)/(Pxy +Px.Py)
or by (Pxy - Px.Py)/min(Pxy, Px.Py), and at the same time one event will
strongly imply the other event. But this PARADOX depends on the sensitivity
(wrt the deviations from exact independence) of the measure. Hence our
choice of a single measure should depend on our preference for what the
measure should stress: an implication, or (a deviation from) independence.
E.g. K. Popper's corroboration C(:) stresses dependence over implication,
while Kemeny's factual support F(:) stresses implication over dependence,
but neither of those authors say so, nor anybody has noticed that so far.
Of course, we could always use two measures, one for an implication, and
the other for a deviation from independence, but the Holy Grail is a
single formula, which will inevitably combine ie mix these two aspects,
because they are arbitrarily mixable.
However impossible it may be to find the Excalibur formula for causation,
I believe it to be possible to identify formulas which come closer to the
Holy Grail than other formulas. I consider the notions of stochastic
DEPENDENCE together with probabilistic IMPLICATION (or my AndNot ) and
SURPRISE as the key building blocks because they are well defined, though
not understood enough by too many folks.
-.- +Interpreting 2x2 contingency tables wrt RR(:) = relative risk=risk ratio
a , b the counts a, d are hits, and b, c are misses
c , d ie a, d concord, and b, c discord
and a+b+c+d = n = the total count of events.
!! It is useful to view such a table as a Venn diagram formed by two
rectangles, one horizontal and one vertical, with partial overlap
measured by n(x,y) = the joint count ie co-occurrence of x and y :
______________________________
| | |
| a = n( x,y) | b = n( x,~y) | n( x) = a+b
| | |
|-------------+--------------. is a 2x2 Venn diagram
| | .
| c = n(~x,y) | d = n(~x,~y) . n(~x) = c+d
|_____________|...............
a+c = n(y) b+d = n(~y) N = a+b+c+d
but nothing prevents you from viewing an overlap in any of the 4 corners.
Feel free to rotate or to transpose this standard table at your own peril.
Typical semantics (one quadruple per line) may be, eg:
x ~x y ~y
test+ says K.O. test- says ok disorder not this disorder
exposed unexposed illness not this illness
risk factor present risk fact.absent outcome present outcome absent
! treatment control non-case case
alleged cause cause absent effect not this effect
symptom present symptom absent possible cause not this cause
conjecture,hypothesis evidence observed
so be careful with assigning your own semantics ! We can avoid mistakes
if we stick here to the first four interpretations just listed.
The 2x2 probabilistic contingency table summarizes the dichotomies :
| y ~y | marginal sums
-----|------------------------------------+-------------------------
x | a/n = P( x,y) , b/n = P( x,~y) | P( x) = (a + b)/n
~x | c/n = P(~x,y) , d/n = P(~x,~y) | P(~x) = (c + d)/n = i/n
-----+------------------------------------|-------------------------
Sums | P(y) P(~y) | 1 = P( x,y) +P( x,~y)
| = (a+c)/n = (b+d)/n =f/n | +P(~x,y) +P(~x,~y)
In my squashed Venn diagram in 1D-land, the joint occurrences of (x&y)
ie (x,y) ie "a" are marked by ||| = a/N = Pxy :
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
ffffffffffffffxxxxxxxxxxxxxxxxxxxxxxxxxxffffffffffffffffffffff
----------------------aaaaaaaaaaaaaaaaaa----------------------
iiiiiiiiiiiiiiiiiiiiiiyyyyyyyyyyyyyyyyyyyyyyyyyyyyiiiiiiiiiiii
11111111111111111 A limited 1-verse of discourse 1111111111111
---- 1-Px ----xxxxxxxxxxxxxx Px xxxxxxxx------------- 1-Px ---
---- 1-Pxy -----------|||||| Pxy |||||||------------- 1-Pxy --
---- 1-Py ------------yyyyyy Py yyyyyyyyyyyyyyyyy--- 1-Py ---
From the 4 counts ( a+b+c+d = n ) we easily obtain all P(.)'s.
From the 3 proportions or probabilities Px, Py and Pxy we can obtain
any other P(.,.) and P(.|.) containing any mix of (non)negations,
but without raw counts we cannot compute eg confidence interval CI.
The legality of given or generated P's can be checked by the above explained
Bonferroni inequality.
.- The subset-based measures of implication aka entailment:
0 <= Px - Pxy <= 1 measures how little the event x-->y, hence
1/(Px - Pxy) measures how much the event x-->y
with MAXImum = oo for Pxy = Px ;
0 <= P(y|x) <= 1 measures how much the event x-->y
with MAXImum = 1 for Pxy = Px ;
!! recall that (x implies y) == (~y implies ~x) in logic;
probabilistic is 1 -P(x,~y) = 1 -(Px - Pxy)
P( x| y) = a/(a+c) = sensitivity ( Yerushalmy 1947 ) = true positive ratio
P(~x|~y) = d/(d+b) = specificity ( Yerushalmy 1947 ) = true negative ratio
P( x|~y) = b/(b+d) = 1-specificity = false alarm rate = false positive ratio
P( y| x) = a/(a+b) = positive predictivity
P(~y|~x) = d/(d+c) = negative predictivity
LR+ = positive likelihood ratio = LR(x:y)
= P(x|y)/P(x|~y) = P(x|y)/(1 - P(~x|~y))
= (a/(a+c))/(b/(b+d)) = sensitivity/(1 - specificity)
LR- = negative likelihood ratio
= P(~x|y)/P(~x|~y) =
= (c/(c+a))/(d/(d+b)) = (1 - sensitivity)/specificity
RR(y:x) = relative risk = risk ratio = LR(y:x)
= P(y|x)/ P(y|~x) = (a/(a+b))/(c/(c+d)
= (Pxy/Px)/((Py - Pxy)/(1-Px)) which I rearranged into:
!!! = Pxy.[1/(Py - Pxy)] .(1-Px)/Px
!!! = [Pxy / P(y,~x) ] .(1-Px)/Px which I interpret as follows:
RR(y:x)
+ INCreases with Pxy in general (due to numerator and denominator);
+ INCreases with Pxy approaching Py in particular, eg when (y implies x)
fully then RR(y:x) = oo ie infinity;
! - DECreases with INCreasing Px in (1 - Px)/Px which meaningfully measures
our LOW/high surprise when a FREQUENT/rare event x occurs (more below).
Note that P(x|y) = Pxy/Py as a measure of how much is the occurrence of
y Sufficient for x, hence how much is the occurrence of
x Necessary for y, is not an explicit fun( Px ), hence
!! P(x|y) cannot discount the lack of surprise like RR(y:x) does.
!!!
RR(y:x) = CoOccur(x,y) .(y-->x) .SurpriseBy(x), range [0..oo) surprise too
or: = ( Pxy/P(y,~x) ).SurpriseBy(x)
!!! ie: = LikelyThanNot(y: x) .SurpriseBy(x)
where LikelyThanNot(y: x) is visualized by my squashed Venn diagram
(imagine two overlapping pizzas or pancakes x and y, and view them from
aside) :
xxxxxxxxxxxxx--- length of xx..x = Px
yyyyyyyyyy length of yy..y = Py
length of --- = P(~x,y) = Py - Pxy
It is meaningful to contrast the overlap Pxy against the underlap
P(~x,y), eg to contrast them relatively as their ratio :
Pxy/P(~x,y) >=< 1 where >=< stands for >, =, <, >=, <=, <>
ie P(x,y) >=< P(~x,y)
ie P(x|y) >=< P(~x|y) , so that eg for the > we say that :
x occurs More Likely Than Not if y occurred, or we say equivalently :
x occurs More Likely Than Not with y , which both capture our thinking.
+ DE/INcreases with IN/DEcreasing Px; this is meaningful, because our
!!! surprise value of x DIScounts the "triviality effect" of Px =. 1 :
!! if Px =. 1 then Pxy = Py too easily occurs, and RR(y:x) = 1/0 = oo.
!! If Py = 1 then Pxy = Px and P(y,~x)=P(~x) hence RR(y:x) = 1/1 = 1,
indeed, if all are ill, there can be no risk of becoming ill.
Surprise value of x DE/INcreases with IN/DEcreasing Px in general;
(1 - Px), 1/Px, hence also my (1 - Px)/Px measures our surprise by x.
!! My new measure P(x|y).(1 - Px) = (y implies x).LinearSurpriseBy(x) =
= [0..1]*[0..1] = [0..1]
is simpler, but carries less meanings than RR(y:x).
! + is DOMINAted by the factor 1/(Py - Pxy) for a given exposure Px ;
this factor measures how much (y-->x). From this and from
Pxy <= min(Px, Py), but not from "SurpriseBy", follows :
!!! if Py < Px then RR(y:x) >= RR(x:y) ie LR(x:y),
!!! if Py > Px then RR(y:x) <= RR(x:y) ie LR(x:y), where the = may occur
for x,y independent ie RR(:)=1,
or if Pxy=0=RR(:), as my program Acaus3 asserts.
That "SurpriseBy" is not decisive wrt RR(y:x) >=< RR(x:y), follows
from the comparison of:
(y-->x) = 1/(Py - Pxy) vs (1-Px)/Px = SurpriseBy(x)
ie (1 - 0)/(Py - Pxy) vs (1-Px)/(Px - 0).
Lets write Px = k.Py to reduce RR(:) to just 2 variables Pxy, Py,
and lets compare RR(y:x) with RR(x:y) ie LR(x:y) :
(Pxy/(Py - Pxy)).(1-Px)/Px >=< (Pxy/(Px - Pxy)).(1-Py)/Py
ie:
(k.Py - Pxy)/(Py - Pxy) >=< k.(1-Py)/(1 -k.Py) = Dy in shorthand
!! Pxy >=< Py.( Dy - k )/(Dy - 1) =
Py.[ (1 - Py)/(1 - k.Py) - 1]/[(1 - Py)/(1 - k.Py) - 1/k]
Checked:
Solving for k the RR(y:x) = RR(x:y), where Px = k.Py , yields a
quadratic equation with two distinct real roots k1, k2 :
k1 = 1 ie Px = Py which obviously is correct
k2 = Pxy/(Py.Py) ie Py.Px = Pxy which holds for independent x,y .
+ is oo ie infinite for Py - Pxy = 0 ie P(~x,y) = 0 ie Pxy = Py
in which case y implies x fully, because then y is a SubSet of x,
ie whenever y occurs, x occurs too ; draw a Venn diagram.
+ is ASYMMETRICAL ie directed ie oriented wrt the events x, y (this unlike
correlation coefficients and other symmetrical association measures)
. is a relative measure, a ratio (while differences are absolute measures
which may mislead us since eg 0.93 - 0.92 = 0.03 - 0.02)
- is a combined measure which inseparably MIXes measuring of two key
properties:
- stochastic dependence, which is a symmetrical property, and
- probabilistic implication, which is an Asymmetrical property, which
both I see as necessary conditions for a possible CAUSAL relationship
between x and y. Hence RR(:) INDICATES potential CAUSAL TENDENCY;
+ has range [0..1..oo) with 3 opeRationally interpretable fixed points:
RR(y:x)
= oo if Pxy = Py ie (y-->x), ie possibly (x causes y) ;
= 1 if y and x are fully independent ie if Pxy = Px.Py
= 0 if (Pxy = 0) and (0 < Px < 1) ie disjoint events x,y
ie RR(:) = 0 means disjoint ie mutually exclusive events x,y
0 < RR(:) < 1 means negative dependence or correlation of x,y
1 < RR(:) <= oo means positive dependence or correlation of x,y
!! ie RR(:) has a huge unbounded range for positively dependent x,y vs
RR(:) has a small bounded range for negatively dependent x,y ,
hence both subranges are not comparable; the positive subrange is
!! much more SENSITIVE than the negative subrange. In this respect
!! F(:) is BALANCED but has no simple interpretation of risk ratio.
= 0/0 if (Py = 0 or Py = 0 hence Pxy = 0 too).
= 1 if (Py = 1 hence Pxy = Px, P(x|y) = Px/1 ie independent x,y)
then RR(y:x) = (1-Px)/(1-Px) = 1 ie independence.
!! = 1 if (Px = 1 hence Pxy = Py, P(y|x) = Py/1 ie independent x,y)
then RR(y:x) = Pxy/(0/0) = Py/(0/0) numerically, which may
!! seem to be undetermined, but as just shown, Px = 1 means that P(y|x)
does not depend on Px, ie that x,y are independent (find 0/0 ).
RR(y:x) = P(y|x)/P(y|~x) where in many (not all) medical applications
y is a health disorder, and x is a symptom. But both { Lusted 1968 }
and { Bailey 1965, 109, quoted: } noted that :
"P(y|x) will vary with circumstances (social, time, location), however
!! P(x|y) will have often a constant value because symptoms are a
function of a disease processes themselves, and therefore
relatively INdependent of other external circumstances. ...
so we could collect P(x|y) on a national scale, and collect Py on
a [ local/individual ] space-time scale.". The [loc/indiv] is mine.
Therefore we should compute RR(y:x) indirectly via Bayes rule ie via
P(y|x) = Py.P(x|y)/Px where P(x|y) is "global" and more stable.
+ RR(y:x) has an important advantage over its co-monotonic but nonlinear
transform F(y:x). The simple proportionality of RR(y:x) can be used to
(dis)prove confounding. Good explanations of confounding are rare, the
best introduction is in { Schield 1999 } where on p.3 we shall recognize
Cornfield's condition:
P(c|a)/P(c|~a) > P(e|a)/P(e|~a) as RR(c:a) > RR(e:a), and
R.A. Fisher's:
P(a|c)/P(a|~c) > P(a|e)/P(a|~e) as RR(a:c) > RR(a:e).
Be reminded that "contrary to the prevailing pattern of judgment",
as { Tversky & Kahneman 1982, 123 } point out, it holds, in my more
general formulation :
[ P(y|x) >=< P(y|~x) ] == [ P(x|y) >=< P(x|~y) ] , hence also:
[ RR(y:x) >=< 1 ] == [ RR(x:y) >=< 1 ]
where >=< stands for a consistently used >, =, <, >=, <=, <> .
+ For 3 more properties see { Schield 2002, p.4, Conclusions }.
.- +More on probabilities
Keep in mind that it always holds:
P(.) + P(~.) = 1 eg P(y|x) + P(~y|x) = 1 ; hence also:
P(x or y) + P(~(x or y)) = 1 from which via DeMorgan's rule follows:
P(x or y) + P(~x,~y) = 1
P(x or y) + Pxy = Px + Py see the overlap of 2 Pxy in a Venn diagram
hence P(~x,~y) = 1 -(Px + Py - Pxy)
P(~x,~y) = P(~(x or y)) by DeMorgan's rule; he died in 1871; "his" rule
has been clearly described by Ockham aka Occam
aka Dr. Invincibilis in Summa Logicae in 1323 !
= 1 - P(x or y) = 1 - (Px + Py - Pxy) and, surprise :
! Pxy - Px.Py = Pxy.P(~x,~y) - P(x,~y).P(~x,y) from 2x2 table's diagonals
( with / the rhs would be odds ratio OR(:), find below )
= Pxy.(1 -Px -Py +Pxy) - (Px -Pxy).(Py -Pxy)
= cov(x,y) = covariance of 2 "as-if random" events x,y , or
indicator events aka binary/Bernoulli events, from which follows for
independent events only :
if Pxy - Px.Py = 0 ie cov(x,y) = 0 ie
if Pxy = Px.Py (this is equivalent to 16 other equalities)
! then Pxy.P(~x,~y) = P(x,~y).P(~x,y) ie products on 2x2 table's diagonals
are equal; this I call the 17th condition of independence (find 17th ),
which is equivalent (==) to any of the other 4 + 3*(8/2) = 16 equivalent
conditions of independence, like eg:
( Pxy = Px.Py ) == ( P(x|y) = Px ) == ( P(y|x) = Py ) ==
( P(y|x) = P(y|~x) ) == ( P(~y|~x) = P(~y|x) ) ==
( P(x|y) = P(x|~y) ) == ( P(~x|~y) = P(~x|y) ) == etc
Only for independent x,y it holds, via Occam-DeMorgan's rule :
P(~(~x,~y)) = 1 - (1 -Px).(1 -Py) = Px + Py - Px.Py = P(x or y) for indep.
More of the equivalent conditions of independence are obtained by changing
x into ~x, and/or y into ~y, or vice versa. Any consistent mix of such
changes will produce an equivalent condition of independence for events,
negated or not, simply because "A nonevent is an event is an event".
Changing the = into < or > in any of the 17 conditions of independence will
create corresponding equivalent conditions of dependence which obviously are
necessary but far from sufficient conditions for a causal relation between
2 events x, y. For example :
( Pxy > Px.Py ) == ( P(y|x) > Py ) == ( P(x|y) > Px )
!! == ( P(y|x) > P(y|~x) ) == ( P(x|y) > P(x|~y) ) == etc.
From all these 17 inequalities of the generic form lhs > rhs we can obtain
some 6*17= 102 measures of DEPENDENCE simply by COMPARING or CONTRASTING :
Da = lhs - rhs are ABSOLUTE DEPENDENCE measures, eg P(e|h) - P(e|~h)
Da is scaled [0..1] for lhs > rhs , or [-1..1] in general,
with 0 if x,y are fully independent
Dr = lhs / rhs are RELATIVE DEPENDENCE measures, eg P(e|h) / P(e|~h)
Dr is scaled [0..1..oo) with 1 if x,y are fully independent.
Rescalings : log(lhs / rhs) is scaled (-oo..0..oo) in general;
(lhs - rhs )/(lhs + rhs) is scaled [-1..0..1], I call it kemenyzation,
= (lhs/rhs -1)/(lhs/rhs +1);
and lhs/(lhs + rhs) is scaled [0..1/2..1].
Odds(.) = P(.)/(1 - P(.)) = 1/( 1/P(.) - 1 ) = P(.)/P(~.)
P(.) = Odds(.)/(1 + Odds(.)) = 1/( 1/Odds(.) + 1 )
P(x| y)/P(~x| y) = P(x| y)/(1 - P(x| y)) = Odds(x| y)
P(x|~y)/P(~x|~y) = P(x|~y)/(1 - P(x|~y)) = Odds(x|~y)
P(x| y)/P( x|~y) = B(x: y) = LR(x: y) = LR+ is a likelihood ratio
where B(x: y) is a simple Bayes factor = RR(x:y) ; is
NOT odds(.) since RR(.: .) is NOT a P(.|.)/P(~.|.) .
Bayes rule in odds-likelihood form :
Posterior odds on x if y = Prior odds . Likelihood ratio
= Odds(x|y) = Odds(x) . LR(y:x)
= P(x|y)/ P(~x|y) = ( Px/P(~x) ).( P(y|x)/P(y|~x) )
= P(x|y)/(1 -P(x|y)) = Px/(1 -Px).( P(y|x)/P(y|~x) )
= 1/( 1/P(x|y) - 1 )
In our 2x2 contingency table we have Odds ratio OR(:) :
OR = (a/b)/(c/d) = a.d/(b.c)
SeLn(OR) = sqrt( 1/a + 1/b + 1/c + 1/d ) = standard error of odds ratio OR
cov(x,y) = Pxy.P(~x,~y) - [ P(x,~y).P(~x,y) ]
= Pxy - Px.Py
but OR <> Pxy /(Px.Py) , except when Pxy = Px.Py , or Pxy=0 .
Relative risks RR(:) for the following 2x2 contingency table:
| e | ~e | e = effect present; ~e = effect absent
----|-----+-----+------
h | a | b | a+b h = hypothetical cause present (eg tested+ )
~h | c | d | c+d ~h = eg unexposed to environment (eg tested- )
----+-----+-----|------
| a+c | b+d | n
RR( e: h) = P(e|h)/ P(e|~h) = (a/(a+b))/(c/(c+d))
= (Peh/Ph)/((Pe-Peh)/(1-Ph)) = a.(c+d)/((a+b).c))
! = oo if Pe=Peh ie (a+c)=a ie P(e,~h)=0 ie c=0
recalling P(e|~h) + P(~e|~h) = 1, we get:
RR(~e:~h) = P(~e|~h)/P(~e|h) = (d/(c+d))/(b/(a+b))
= (1 - P(e|~h))/(1 - P(e|h))
= (1 - c/(c+d))/(1 - a/(a+b)) = d.(a+b) /(b.(c+d))
! = oo if Peh=Ph ie a=(a+b) ie P(h,~e)=0 ie b=0
RR( h: e) = P(h|e)/ P(h|~e) = (a/(a+c))/(b/(b+d))
= (Peh/Pe)/((Ph-Peh)/(1-Pe)) = a.(b+d)/((a+c).b))
= oo if Ph=Peh ie (a+b)=a ie P(h,~e)=0 ie b=0
RR(~h:~e) = P(~h|~e)/P(~h|e) = (d/(b+d))/(c/(a+c))
= (1 - P(h|~e))/(1 - P(h|e))
= (1 - b/(b+d))/(1 - a/(a+c)) = d.(a+c) /(c.(b+d))
! = oo if Peh=Pe ie a=(a+c) ie P(e,~h)=0 ie c=0
ie:
for c=0 are RR( e: h) = oo = MAXImal = RR(~h:~e)
for b=0 are RR( h: e) = oo = MAXImal = RR(~e:~h)
for a=0 is RR( e: h) = 0 = minimal = RR( h: e)
for d=0 is RR(~h:~e) = 0 = minimal = RR(~e:~h)
! RR(e:h).RR(~e:~h) = RR(h:e).RR(~h:~e)
= Peh.P(~e,~h)/( P(e,~h).P(h,~e) ) = Peh.P(~e,~h)/( P(~e,h).P(~h,e) )
which clearly are identical.
While these equations hold in general, you might like to meditate upon
why the 17th (find 17th , 16+1 ) condition of independent x, y consists
from the same components.
-.- +Tutorial notes on probabilistic logic, entropies and information
Stan Ulam, the father of the H-device (Ed Teller was the mother) used
to say that "Our fortress is our mathematics." I say that here
"Our fortress is our logic." Elementary probability theory is strongly
isomorphous with the set theory, which is strongly isomorphous with logic.
There are 16 Boolean functions of 2 variables, of which 8 are commutative
wrt both variables. For the purposes of inferencing we should use ORIENTED
ie DIRECTED ie ASYMMETRIC functions only. From the remaining 8 asymmetric
logical functions 4 functions are of 1 variable only, so that only 4
asymmetric functions remain for consideration : 2 implications and 2 AndNots
which are pairwise mutually complementary. ASYMMETRY is easily obtained
! even from symmetrical measures of association (or dependence) simply by
normalization with a function of one variable only, eg :
(Pxy -Px.Py)/(Px.(1-Px)) is 0 if x,y are independent
= cov(x,y)/var(x) = beta(y:x)
= slope of a probabilistic regression line Py = beta(y:x).Px + alpha(y:x)
= [ P(y|x) - Py ]/(1-Px) = P(y|x) - P(y|~x) (check it as an equation)
= the numerator of F(y:x) below
= ARR(x:y) = absolute risk reduction of y if x (or increase if ARR < 0).
Many measures of information are easily obtained by taking expected value
of either differences or ratios of the lhs and rhs taken from a dependence
inequality lhs > rhs mentioned above. For example we could create :
SumSum[ Pxy.Dr(y:x) ] where Dr is a relative dependence measure like
eg RR(y:x), but a single Dr = oo would make the
whole SumSum = oo, hence it is better to use
SumSum[ Pxy. Da(x:y) ] where Da is an absolute dependence measure, eg:
SumSum[ Pxy.(P(y|x) - P(y|~x)) ] = SumSum[ Pxy.ARR(x:y) ];
knowing that P(y|x) - P(y|~x) = ARR(x:y) = beta(y:x) = dPy/dPx , and that
Integral[ dx.Px.(dPx/Px)^2 ] = Fisher's information, I realized that
! my SumSum[ Pxy.( F(y:x) )^2 ] could serve as a quasi-Fisher-informatized
RR ; find "my 1st interpretation" of F(y:x) .
{ Renyi 1976, vol.2, 546-9 } connects Fisher's and Shannon's information.
A particularly nice & meaningfully asymmetrical (wrt r.v.s X, Y) information
is my favorite ( out of the 2*[16+1] = 2*17 = 34 possible simple measures
of association; find 17 or 16+1 ) Cont(:) :
Cont(Y:X) = Cont(X) - Cont(X|Y) == Gini(X) - Gini(X|Y) = Gini(Y:X)
= Var(X) - E[Var(X|Y)]
= Sum[ Px.(1-Px) ] - SumSum[ Pxy.(1-P(x|y))] = quadratic entropy
= 1-Sum[(Px)^2] -(1-SumSum[ Pxy. P(x|y) ]) = parabolic entropy
= ( 1 - E[Px]) ) -(1-E[P(x|y)])
!! = SumSum[ Pxy.( P(x|y) -Px )] my opeRationally clearest form
!! = Expected[ P(x|y) -Px ] ie average dependence measured
= SumSum[ Pxy.K(y:x) ] is ASYMMETRICAL wrt x,y, X,Y too
= SumSum[ ( Pxy^2 - Py.Px.Pxy )/Py ] from 3rd line up. Now a leap:
**= SumSum[ ((Pxy - Py.Px)^2 )/Py ] compare with Phi^2 below
= SumSum[ ( Pxy^2 -2Py.Px.Pxy + (Px.Py)^2 )/Py ]
**= will be proven if the next *= is proven :
*= SumSum[ ( Pxy^2 - Py.Px.Pxy + 0 )/Py ] = 4th line up
proof: Sum[Px.Px] =SumSum[Px.Pxy] = SumSum[Py.Px^2] = 3rd term on 3rd line up.
1-Cont(X) = Sum[ Px.Px ] = E[Px] = expected probability of a variable X
= Sum[ (n(x)/N).(n(x)-1)/(N-1) ] = an UNbiased estimate of E[Px]
= expected probability of success in guessing events x
= long-run proportion of correct predictions of events x
= concentration index by Gini/Herfindahl/Simpson ( Simpson was a WWII
codebreaker like I.J.Good and Michie ; they called it a "repeat rate" )
Cont(X) = 1 - Sum[(Px)^2] = expected improbability of variable X
= expected error or failure rate eg in guessing events x
0 <= [ Cont(:), 1-Cont(:) ] <= 1 ie they saturate like P(error), while
Shannon's entropies have no fixed upper bound.
Log-scale fits with the physiological Weber-Fechner law, eg sound is measured
in decibels dB , and so is the pH-factor (0..7..14 = max. alkalic). Logs work
even in psychenomics, since you will feel less than twice as happy after
your salary or profits were doubled :-) For more on infotheory in physiology
see the nice book { Norwich 1993 }.
Cont(:) has been called many names, eg quadratic entropy or parabolic
entropy. Cont(:) gives provably better, sharper results than Shannon's
entropy for tasks like eg pattern classification in general, and diagnosing,
! identification, prediction, forecasting, and discovery of causality in
particular. Such tasks are naturally DIRECTED ie ASYMMETRICAL, requiring
Cont(Y:X) <> Cont(X:Y), while Shannon's mutual information
I(Y:X) = I(X:Y) = SumSum[ Pxy.log(Pxy/(Px.Py)) ] is symmetrical wrt X,Y.
No wonder that Cont( beats I( .
Cont(Y:X)/Cont(X) = TauB(Y:X) { Goodman & Kruskal, Part 1, 1954, 759-760 }
!
where they interpreted TauB as "relative decrease in the proportion of
incorrect predictions". See my Hint 2 & Hint 7 on www.matheory.info.
Neither they nor { Bishop 1975 } have realized that TauB is a normalized
quadratic entropy based on 1-P, the simplest decresing function of P.
{ Agresti 1990, 75 } tells us that for 2x2 contingency tables G & K's
TauB = Phi^2. So I deduce that for 2x2 contingency tables G & K's
TauB is SYMMETRICAL wrt X,Y since Phi^2 and other X^2-based measures are
SYMMETRICAL wrt X,Y :
In general
TauB(Y:X) = Cont(Y:X)/Cont(X) <> Cont(X:Y)/Cont(Y) = TauB(X:Y)
For 2x2 : TauB(Y:X) = Cont(Y:X)/Cont(X) = Cont(X:Y)/Cont(Y) = TauB(X:Y)
hence even for 2x2 Cont(Y:X) <> Cont(X:Y)
as long as Cont(X) <> Cont(Y) , I deduce.
X^2 = mean square CONTINGENCY { Kendall & Stuart, chap.33, 555-557 }
= N.(SumSum[ n(x,y)^2 /( n(x).n(y) ) ] -1) sample ChiSqr statistic
= N.SumSum[ (Pxy - Px.Py)^2 /(Px.Py) ] { Blalock 1958, 102 }
= N.Phi^2
= N.SumSum[((Pxy -Px.Py)/Px).(Pxy -Py.Px)/Py ]
= N.SumSum[ (P(y|x) -Py).(P(x|y) -Px) ]
= N.SumSum[ K(x:y).K(y:x) ], find K(
JHr^2 = SumSum[((Pxy - Px.Py)^2)/(Px.(1-Px).Py.(1-Py))]
= SumSum[ cov(x,y)/var(x) . cov(x,y)/var(y) ]
= SumSum[ beta(y:x) . beta(x:y) ] ie [slope.slope]
= SumSum[ (P(y|x) -P(y|~x)).(P(x|y) -P(x|~y)) ]
= SumSum[ ARR(x:y).ARR(y:x) ] = SumSum[ r2 ]
= my new cummulated r2 ie r^2 shows what Phi^2 = ( X^2 )/N is not,
and how are they related.
Pcc = sqrt[ ( X^2 )/( X^2 + N) ] = Pearson's CONTINGENCY coefficient
= sqrt[ (Phi^2)/(Phi^2 + 1) ], (not Pearson's CORRELATION coeff r ,
find r^2 , which are not SumSum[...] ) ;
Phi^2 = ( X^2 )/N { Blalock 1958, p103, Phi^2 vs TauB = Cont(Y:X)/Cont(X) }
= Phi-squared = Pearson's mean square CONTINGENCY
= SumSum[ (Pxy - Px.Py)^2 /(Px.Py) ] from ( X^2 )/N above
(see asymmetric **= Cont(Y:X) above)
= SumSum[ Pxy.Pxy/(Px.Py) ] - 1 a symmetrical expected value
like Shannon's mutual information:
I(Y:X) = SumSum[ Pxy.log(Pxy/(Px.Py)) ] = I(X:Y) in general
= -0.5*ln(1 - corr(X,Y)) if X, Y are continuous Gaussian variables
[ 1 - Cont(X) ]/Pxi = E[Px]/Pxi = surprise index for an event xi within the
r.v. X, as defined by { Weaver }, Shannon's co-author.
Cont(.) was intended as a semantic information content measure, SIC. The key
idea is that the LOWER the probability of an event, the MORE POSSIBILITIES
it ELIMINATES, EXCLUDES, FORBIDS, hence MORE its occurrence SURPRISES us.
{ Kemeny 1953, 297 } refers this insight to { Popper 1972, 270,374,399,400,402
mention P(~x) = 1-Px as semantic information content measure SIC },
{ Bar-Hillel 1964, 232 } quotes "Omnis determinatio est negatio" =
"Every determination is negation" = "Bestimmen ist verneinen" by Baruch
Spinoza (1632-1677), in 1656 excommunicated from the synagogue in Amsterdam.
Btw, William of Ockham aka Occam aka Dr. Invincibilis was excommunicated from
the Church in 1328 :-) . KISS = Keep IT simple, student! is Occamism.
Stressing elimination of hypotheses or theories is Popperian refutationalism.
In principle any decreasing function of Px will do, but (1-Px) is surely
the simplest one possible, simpler than Shannon's log(1/Px) = -log(Px).
By combining (1-Px) with 1/Px, I constructed SurpriseBy(x) = (1-Px)/Px
only to find it implicit or hidden inside RR(y:x), after my rearrangement
of atomic factors in RR(:). Note that Sum[ Px.1/Px] would not work as an
expected value.
For more on Cont(:) see { Kahre, 2002 } in general, and my (Re)search hints
Hint2 & Hint7 there on pp.501-502 in particular (see www.matheory.info ).
Let me finish this with a proposal for new infomeasures:
!!
JH(Y:X) = SumSum[ Pxy.ARR(y:x) ] = SumSum[ Pxy.( P(x|y) - P(x|~y) ) ]
JH(X:Y) = SumSum[ Pxy.ARR(x:y) ] = SumSum[ Pxy.( P(y|x) - P(y|~x) ) ]
and 2 other with Abs(difference): SumSum[ Pxy*| P(.|_) - P(.|~_) | ]
-.- +Google's Brin's conviction Conv( , my nImp( , AndNot :
Boolean logic is strongly isomorphous with the set theory wherein
X implies Y whenever X is a subset of Y. Since the probability theory
is also strongly isomorphous with the set theory, we see that
for the < as the symbol for both "a subset of" and for the "less than",
it is obvious that since [ P(x|y) > P(y|x) ] == [ Py < Px ] and v.v. ,
!!! Py < Px makes y-->x ie x NecFor y more plausible, while
!!! Px < Py makes x-->y ie y NecFor x more plausible,
which can be easily visualized with a Venn (aka pancakes or pizzas) diagram.
0 <= P(y|x) = Pxy/Px <= 1 measures how much the event x-->y
with maximum = 1 for Pxy=Px ;
0 <= P(x,~y) = Px - Pxy <= 1 measures how little the event x-->y
or: 1 - P(y|x) = (Px - Pxy)/Px measures how little the event x-->y
or: 1/(Px - Pxy) measures how much the event x-->y
Also recall that the Bayesian probabibility of a j-th hypothesis x_j ,
given a vector of cue events y_c ie y..y, is (under the assumption of
independence) computed by the Bayes chain rule formula based on the
product of P(y|x) , the higher the more probable the hypothesis x_j :
P(x_j, y..y) =. P(x_j).Product_c:( P(y_c | x_j ) ; don't swap y, x !
A cue event is a feature/attribute/symptom/evidential/test event
"x implies y" in plaintalk :
"If x then y" is a deterministic rule, which in plain English says that
"x always leads to y" ie Px - Pxy = 0 ie Px = P(x, y); or:
"never (x and not y) occur jointly" ie P(x,~y) = 0 if x-->y 100%
note that Pxy + P(x,~y) = Px , hence [ P(x,~y) = 0 ] == [ Pxy = Px ]
which is the deterministic (ie perfect or ideal) case
which directly translates into the probabilistic formalisms for an
increasing function of the strength of implication :
1 - P(x,~y) = 1 -(Px - Pxy), or, the smaller the P(x,~y), the more x-->y ,
= 1 - Px + Pxy , since ~(x,~y) == (x implies y) in logic.
Another measure is the "conviction" by Google's co-founder { Brin 1997 } :
Conv(x:y) = Px.P(~y)/P(x,~y) = (Px - Px.Py)/(Px - Pxy) is my form
= Px/P(x|~y) = P(~y)/P(~y|x) is my form
= Conv(~y:~x) this = is UNDESIRABLE find;
= 1/Nimp(~x:~y) = 1/ Nimp(y:x) nearby below
= 1 if x,y independent; and where:
+ the larger the Pxy <= min(Px, Py), the more the x-->y, and
+ the closer the Pxy is to Px.Py , the less dependent are x,y and
the closer the Conv(:) to 1 which is the fixed point for independence.
( y --> x) == (~x --> ~y) where --> is "implies" in logic; here too:
Conv( y: x) = Py.P(~x)/P(y,~x) = (Py -Px.Py)/(Py -Pxy), UNDESIRABLE = :
= Conv(~x:~y) = P(~x).Py/P(~x,y)
= 1/Nimp(~y:~x) = 1/Nimp(x:y)
( y --> x) is 1/P(y,~x) = 1/[ Py - Pxy ] =
= (~x -->~y) is 1/P(~x,y) = 1/[ P(~x) - P(~x,~y) ] by DeMorgan's law :
= 1/[ 1-Px -(1 -(Px+Py-Pxy)) ] = 1/(Py - Pxy)
so their equality is logically ok, but it is
!!! UNDESIRABLE FOR A MEASURE OF CAUSAL TENDENCY. Q: why? A: because eg:
"the rain causes us to wear raincoat" is ok, but "not wearing a raincoat
causes no rain" makes NO SENSE, as the Nobel prize winner Herbert Simon
pointed out in { Simon H. 1957, 50-51 }. This UNDESIRABLE equality does
not hold for LR(:), RR(:) and its co-monotonic transformations like eg
W(:) and F(:).
(y AndNot x) == (y,~x) == ~(~(y,~x)) == ~(y-->x) in logic,
is equivalent to "y does not imply x" ie NonImp(y:x)
== P(y,~x) = Py - Pxy = y AndNot x
== in plaintalk "Lack of x allows/permits/leads to y",
since in the perfect case we get:
ideally P(~x,~y) = 0 is the deterministic, extreme case;
note that P(~x,~y) = 0 is not equivalent to Pxy = Py, because
by DeMorgan P(~x,~y) = 1 - (Px + Py - Pxy) = P(~(x or y)) always,
hence P(~x,~y) = 0 ie Px + Py - Pxy = 1 ie Px + Py = 1-Pxy
Recall P(~x,~y) + P(~x, y) = P(~x) always
P(~x,~y) + P( x,~y) = P(~y) always
Nimp(y:x) = P(x,~y)/( Px.P(~y) ) = 1/Conv(x:y) = 1/Conv(~y:~x)
Nimp(x:y) = P(y,~x)/( Py.P(~x) ) = 1/Conv(y:x) = 1/Conv(~x:~y)
= ( Py - Pxy )/( Py -Px.Py) this = is UNDESIRABLE
= 0 if Pxy = Py
= 1 if Pxy = Px.Py ie if x, y independent
= oo if Px = 1
= [0..1..oo)
= 1 / Conv( y: x) = Py.P(~x)/P(y,~x)
= 1 / Conv(~x:~y) = P(~x).Py/P(~x,y)
Conv(x:y) = Px.P(~y)/P(x,~y) the larger, the more causation,
due to small P(x,~y) co-occurence
= (Px - Px.Py)/(Px - Pxy) is 1 if x, y are independent;
= [0..1..oo) , infinity oo if Pxy = Px
nImp(x:y) = ( P(y,~x) - Py.P(~x) )/( P(y,~x) + Py.P(~x) )
= ( Py -Pxy - (Py -Px.Py))/( Py -Pxy + Py -Px.Py )
= ( Pxy - Px.Py )/( Pxy + Px.Py -2.Py )
= [-1..0..1] by kemenyzation
nImp(y:x) = ( Pxy - Px.Py )/( Pxy +Px.Py -2.Px ) find -nImp(
~(x AndNot y) == (x implies y) , hence it should hold:
-nImp(y:x) = (x-->y) = caus1(x:y) and indeed, it does hold
= ( Pxy - Px.Py )/( 2.Px -Pxy -Px.Py ), find caus1(x:y) below.
Consider again: "Lack of x (almost) always leads to y". Clearly, it
would be wrong to tell somebody with x and y that x caused y . Hence
Pxy alone cannot measure how much the x causes y, but P(y|x) could.
Alas, P(y|x) = Pxy/Px is not a function of Py, and we believe that it is
wise to have measures which are functions of all 3 Pxy, Px and Py :
Conv(x:y) = Px.P(~y)/P(x,~y) the larger this Conv, the more of x-->y
due to small P(x,~y) co-occurence
= (Px - Px.Py)/(Px - Pxy) is 1 if x, y are independent;
= [0..1..oo) , infinity oo if Pxy = Px
Conv2(x:y) = P(x implies y)/( ~( Px.P(~y))) larger implies more
= P(~(x,~y))/( ~( Px.P(~y)))
= (1 -P( x,~y))/(1 - Px.P(~y) ) is 1 if independent x,y
= [1/2..1..4/3] , 1 if x,y independent; 4/3 if x imp y.
From the P(~x,~y) + P(~x, y) = P(~x)
P(~x,~y) + P( x,~y) = P(~y)
for the case P(~x,~y) = 0 holds P(~y) = P( x,~y) in which case
Conv(x:y) = Px <= 1 = independent x,y
Conv2(x:y) = ( 1-P(~y) )/( 1 -P(~y).Px ) <= 1 = independent x,y
<= 1 is due to .Px , which always is 0 <= Px <= 1 .
<= 1 in this case is good, because P(~x,~y) = 0 was shown to be
!!! equivalent to the (y AndNot x), hence x cannot imply y ,
not even a little bit, ie causation must not exceed the point
of no dependence ie point of independence, and indeed both
Conv(:) and Conv2(:) are <= 1 in this case, which is good.
An explanation and justification of the Conv(:) measures:
+ Conv(:) = fun( Px, Py, Pxy ) ie fun of all 3 defining probabilities.
+ Conv has a fixed value if x, y independent , and also
has a fixed value if x implies y 100% , hence
Conv has a decent opeRational interpretation.
+ Conv(x:y) = extreme when x implies y 100%
!!! ie when Pxy = Px regardless of Py (draw a Venn)
(x implies y) = ~(x,~y) in logic
= 1 - P(x,~y) in probability
{ Brin 1997 } got rid of the outer negation ~ by taking the reciprocal
value. On one hand this trick is not as clean as Conv2(:),
!!! but on the other hand this trick makes the
!!! 100% implication value an extreme value REGARDLESS of Py :
Conv(x:y) = Px.P(~y)/P(x,~y) , the larger the more (x implies y)
= Px/P(x|~y) , is 1 if independent x,y
= Px.(1 - Py)/(Px - Pxy)
= (Px - Px.Py)/(Px - Pxy) is 1 if Pxy = Px.Py ie 100% independence,
is oo if Pxy = Px ie 100% (x implies y)
oo needs a precheck for an overflow;
numerically is 0/0 if Py = 1 ie Pxy = Px (overflow) but:
correct logically is 1 if Py = 1 as Pxy = Px.Py ie x,y indep.
Or its reciprocal (since Pxy = Px is possible, while Py < 1) :
(Px - Pxy)/(Px - Px.Py) , the smaller the more (x implies y) :
is 1 if Pxy = Px.Py ie 100% independence,
is 0 if Pxy = Px ie 100% (x implies y)
numerically is 0/0 if Py = 1 ie Pxy = Px (overflow) but:
correct logically is 1 if Py = 1 as Pxy = Px.Py ie x,y indep.
Conv(:) kemenyzed by me to the scale [-1..0..1] becomes
Conv1(x:y) = ( Px.P(~y) - P(x,~y) )/( Px.P(~y) + P(x,~y) )
= ( Px - P(x|~y) )/( Px + P(x|~y) )
= ( Pxy - Px.Py)/(2.Px - Pxy - Px.Py) = -nImp(y:x) above;
= ( P(~y) - P(~y|x) )/( P(~y) + P(~y|x) )
which kemenyzed to the scale [0..1/2..1] becomes :
Conv3(x:y) = Px.P(~y)/[ Px.P(~y) + P(x,~y) ]
Or based on counterfactual reasoning ( cofa ) IF ~x THEN ~y :
Cofa1(~x:~y) = ( P(~x).Py - P(~x, y))/( P(~x).Py + P(~x, y))
= ( Py -Px.Py - Py +Pxy )/( Py -Px.Py + Py -Pxy )
= ( Pxy - Px.Py)/(2.Py -Px.Py -Pxy )
= -nImp(x:y) above, ie Non(y AndNot x)
Cofa0(~x:~y) = P(~x).Py / P(~x, y)
= (Py -Px.Py)/(Py -Pxy) = Conv(y:x) above,
and indeed, in logic (~x <== ~y) == (y <== x); the <== means "implies"
(and also it means "less then" if applied to 0 = false, 1 = true)
F(~x:~y) == F(~x <== ~y)
= ( P(~x|~y) - P(~x|y) )/( P(~x|~y) + P(~x|y) ) = -F(~x:y)
They all look reasonable, and all are scaled to [-1..0..1].
Q: which one do you like, if any, and why (not) ?
A mathematically more rigorous alternative to Conv(x:y) is my
Conv2(x:y) which does not suffer from the dangers of an overflow,
employs the exact probabilistic (x implies y) = 1-P(x,~y) derived
from the exact logical (x implies y) == ~(x,~y).
Since we wish to have a fixed value for the independence of events x, y,
the exact implication form 1 - P(x,~y) suggests to compare it with
the negation of the fictive ie as-if independence term as follows:
Conv2(x:y) = P(x implies y)/( x,y independ ) larger implies more
= P(~(x,~y))/( ~( Px.P(~y)) )
= ( 1- P( x,~y))/( 1 -Px.P(~y) ) is 1 if independent
= ( 1-(Px - Pxy))/( 1 -Px.(1-Py) )
= ( 1- Px + Pxy )/( 1 -Px +Px.Py ) is 1 if Pxy = Px.Py
This is ( 1 -Px + Px )/( 1 -Px +Px.Py ) if Pxy = Px
= 1/( 1 -Px.(1 -Py)) >= 1 if Pxy = Px,
the larger the Px and the smaller the Py, the > 1 is Conv2(x:y).
If Px = Pxy (find Venn ) ie if 100% implies
then the numerator is 1 ie MAXimal, but unlike in Conv(x:y),
!! the denominator depends on Px and Py .
-.- +Rescalings important wrt risk ratio RR(:) aka relative risk
For positive u, v : u/v is scaled [ 0..1..oo) v <> 0, and:
W = ln(u/v) is scaled (-oo..0..oo) v <> 0, and:
F = (u - v )/(u + v ) is scaled [ -1..0..1 ], allows u=0 xor v=0
= (1 - v/u)/(1 + v/u) handy for graphing F=f(v/u) u <> 0
= (u/v - 1)/(u/v + 1) handy for graphing F=f(u/v) v <> 0
= (u - v )/(u + v ) this rescaling I call "kemenyzation" to honor
the late John G. Kemeny, the Hungarian-American
co-father of BASIC, and former math-assistant to Einstein;
= tanh(W/2) = tanh(0.5*ln(u/v)) by { I.J. Good 1983, 160 where
sinh is his mistake }
Since
atanh(z) = 0.5*ln((1+z)/(1-z)) for z <> 1,
W = 2.atanh(F) = ln((1+F)/(1-F)) for F <> 1
F0 = (F+1)/2 is linearly rescaled to [0..1/2..1], 1/2 for independence.
W(y:x) = ln( P(y|x)/P(y|~x) ) is an information gain [see F(:) ]
= ln( P(y|x) ) - ln(P(y|~x) ) is additive Bayes factor
= ln( B(y:x) )
= ln(RR(y:x) ) = ln( relative risk of y if x )
= ln( Odds(x|y)/Odds(x) )
W(:) is I.J. Good's "weight of evidence in favor of x provided by y".
The advantage of oo-less scalings like [-1..0..1] or [0..1/2..1] is that
they make comparisons of different formulas possible at all and more
meaningful, though not perfect. E.g. we may try to compare a value of
F(:) with that of Conv1(:) which is Conv(:) kemenyzed by me.
W(:)'s logarithmic scale allows addition (of otherwise multiplicable
ratios) under the valid assumption of independence between y, z :
W(x: y,z) = W(x:y) + W(x:z)
but when y,z are dependent we must use { I.J. Good 1989, 56 } :
W(x: y,z) = W(x:y) + W(x: y|z)
F(:)'s cannot be simply added, but can be combined (provided y, z are
independent) according to { I.J. Good, 1989, 56, eq.(7) } thus :
F(x: y,z) = ( F(x:y) + F(x:z) )/( 1 + F(x:y).F(x:z) )
but when y,z are dependent we must use :
F(x: y,z) = ( F(x:y) + F(x: z|y) )/( 1 + F(x:y).F(x: z|y) )
Seeing this, physicists, but not necessarily physicians, might recall
the relativistic velocity addition formula for combining 2 relativistic
speeds into the resultant one by means of a regraduation function for
relativistic addition of velocities u, v into a single rapidity rap :
rap = ( u + v )/( 1 + u.v/(c.c) ) where c is the speed of light.
P(.|.)'s maximum = 1 corresponds to the unexceedable speed of light c ,
so that rap simplifies to our ( u + v )/( 1 + u.v ).
This relativistic addition appears in:
- { Lucas & Hodgson, 5-13 } is the best on regraduation (no P(.)'s )
- { Yizong Cheng & Kashyap 1989, 628 eq.(20) }, good
- { Good I.J. 1989, 56 }
- { Grosof 1986, 157 } last line, no relativity mentioned
- { Heckerman 1986, 180 } first line, no relativity mentioned.
-.- +Correlation in a 2x2 contingency table
r == corr(x,y) r's sign = the sign of the numerator
= [ a.d - b.c ]/sqrt[ (a+b).(c+d) .(a+c).(b+d) ]
= [Pxy.P(~x,~y) - P(x,~y).P(~x,y)]/sqrt[ Px.P(~x) . Py.P(~y) ]
! the last and the next numerator have different forms, but are equal :
= [ Pxy - Px.Py ]/sqrt[ Px.(1-Px) . Py.(1-Py) ]
= [ a/N -(a+b).(a+c)]/sqrt[(a+b).(c+d) .(a+c).(b+d) ]
= [n(x,y)/N - n(x).n(y) ]/sqrt[ n(x).(N -n(x)) . n(y).(N -n(y))]
= cov(x,y)/sqrt[ var(x) . var(y) ]
= ( Pearson's) CORRELATION coefficient for binary ie Bernoulli ie
indicator events ; is symmetrical wrt x,y
( NOT Pearson's CONTINGENCY coefficient ; find Pcc )
r2 = Square[ corr(x,y) ] = r.r >= 0
= Square[ Pxy - Px.Py]/[ Px.(1-Px).Py.(1-Py) ]
= [ cov(x,y)/var(x) ].[ cov(x,y)/var(y) ]
= beta(y:x) . beta(x:y) , -1 <= beta <= 1, same signs
= [ slope of y on x ].[ slope of x on y ]
= [ P(y|x) - P(y|~x) ].( P(x|y) - P(x|~y) ] = ARR(x:y) . ARR(y:x)
= ( X^2 )/n find X^2 near below
= coefficient of determination r^2
= ( explained variance ) / ( explained var. + unexplained variance )
= ( variance explained by regression )/( total variance )
= 1 - ( variance unexplained )/( total variance )
r2 is considered to be a less inflated ie more realistic measure
of correlation than the r = corr(x,y) itself (keep its sign).
The key mean squared error equation from which the above follows is :
MSE = variance explained + variance unexplained aka residual variance
This MSE equation I call Pythagorean decomposition of the mean squared error
MSE into its orthogonal partial variations. It is a sad fact that very few
books on statistics and/or probability show the correlation coefficient r
between events.
Yule's coefficient of colligation { Kendall & Stuart 1977, chap.33
on Categorized data, 539 } is also symmetrical wrt x, y:
Y = ( 1 - sqrt(b.c/(a.d)) ) / ( 1 + sqrt(b.c/(a.d)) )
= ( sqrt(a.d) - sqrt(b.c) ) / ( sqrt(a.d) + sqrt(b.c) ) kemenyzed
= tanh(0.25.ln( a.d/(b.c) )) my tanhyperbolization a la I.J. Good
The formula for chi-squared (find X^2 , chisqr , chisquared, ChiSqr ) :
X^2 = Sum[ ( Observed - Expected^2 )/Expected ]
= [ ( a - (a+b)(a+c)/n )^2 + ( b - (a+b)(b+d)/n )^2 +
( c - (a+c)(c+d)/n )^2 + ( d - (b+d)(c+d)/n )^2 ] is exact;
approx:
=. n.(|ad - bc| -n/2)^2 /[ (a+b)(a+c)(b+d)(c+d) ] Yates' correction
=.. n.( ad - bc )^2 /[ (a+b)(a+c)(b+d)(c+d) ] may be good enough
= n.( ad - bc )/[ (a+b).(c+d) ] . (ad - bc)/[(a+c).(b+d)]
= n.[ a/(a+b) - c/(c+d) ].[ a/(a+c) - b/(b+d) ]
= n.[ P(y|x) - P(y|~x) ].[ P(x|y) - P(x|~y) ]
!! = n. ARR(x:y) . ARR(y:x)
= n.r2 = ChiSqr(x,y) find r2 nearby above
.- Besides opeRationally meaningful interpretation of values, it is important
how a measure ORDERS the values obtained from a data set, since we want a
list of the pairs of events (x,y) sorted by the strength of their potential
causal tendency. Note that :
P(x|y) is a naive diagnostic predictivity of the hypothesis x from y effect;
P(y|x) is a naive causal predictivity of the effect y from x ;
P(y|x) = Py.P(x|y)/Px = Pxy/Px is the basic Bayes rule.
The likelihood ratio aka Bayes factor in favor of the conjectured hypothesis
x implied by y ie x-->y (is exact, much more specific than the vague word
x provided by y (ie by the evidence, predictor, cue, feature or effect) is
relative risk RR(:) aka risk ratio :
RR(y:x) == B(y:x) = P(y|x) / P(y|~x) = (Pxy/Px)/[(Py - Pxy)/(1-Px)]
= Pxy.(1 - Px) / (Px.Py - Px.Pxy) caution! /0 if Pxy = Py :-(
= (1 - Px) / (Px.Py/Pxy - Px)
= (Pxy - Px.Pxy)/( Px.Py - Px.Pxy ) which shows that:
B = 1 for Px.Py=Pxy ie for independent x,y ; and
B = oo for Py=Pxy ie "if y then x" ie y-->x
= relative odds on the event x after the event y was observed :
!! = Odds(x|y)/Odds(x) = posteriorOdds / priorOdds ; { odds form }
= [ P(x|y)/(1-P(x|y)) ]/[ Px/(1-Px) ]
= [ P(x|y).(1-Px) ]/[ Px.(1-P(x|y)) ] which inverts into
= P(y|x)/P(y|~x) via basic Bayes rule P(x).P(y|x) = Pxy = P(x|y).Py
!! = [ Pxy/(Py - Pxy) ].[ (1-Px)/Px ]
!! shows that Py = Pxy does mean that y-->x so that B(y:x) = oo
!! note that when (x causes y) then y-->x but not necessarily
!! vice versa; the y is an effect or outcome in general;
= P(y|x)/( 1-P(~y|~x) ) = B(y:x) because of:
= P(y|x)/P( y|~x) q.e.d.
Lets compare :
RR(y:x) = P(y|x)/P(y|~x) = P(y|x).(1-Px)/(Py -Pxy) with:
! Conv(y:x) = Py/P(y|~x) = Py.(1-Px)/(Py -Pxy) = Py.P(~x)/P(y,~x)
clearly RR(y:x) is more meaningful than the "conviction" by { Brin 1997 },
though conviction is no nonsense either :
+ both RR(y:x) and Conv(y:x) equal oo if Py=Pxy ie if y-->x
+ both RR(y:x) and Conv(y:x) equal 1 if Pxy=Px.Py ie y, x are independent
+ both RR(y:x) and Conv(y:x) equal 0 if Pxy=0 ie if y is disjoint with x
+ RR(y:x) is relative risk, used within other meaningful formulas
+ RR(y:x) <> RR(~x:~y) which is good, while
- Conv(y:x) == Conv(~x:~y) which is NO GOOD (find UNDESIRABLE above)
B(~y:~x) = P(~y|~x) /P(~y|x) == RR(~y:~x)
= [1 - P( y|~x)]/[ 1 - P(y|x)] = [ P(~y,~x)/P(~y,x)].Px/(1 - Px)
= [ (1 -Py -Px +Pxy)/(Px -Pxy) ]. Px/(1-Px)
= [ (1 -Py)/(Px-Pxy) -1 ]. Px/(1-Px)
!! when Px=Pxy ie when x-->y then B(~y:~x) = oo
hence if we wish to use a B(~.:~.) instead of B( .: .),
then we must swap the events x and y. For example instead of
B( y: x) we might use :
B(~x:~y) = P(~x|~y) / P(~x|y) == RR(~x:~y)
= [ (1 -Px)/(Py-Pxy) -1 ]. Py/(1-Py)
!! when Py=Pxy ie when y-->x then B(~x:~y) = oo
W(y:x) = the weight of evidence for x if y happens/occurs/observed
= ln( P(y|x)/P(y|~x) ) = Qnec(y:x) { I.J. Good 1994, 1992 }
= ln( B(y:x) ) = logarithmic Bayes factor for x due to y
= ln(RR(y:x) ) = ln( Odds(x|y)/Odds(x) )
W(~y:~x) = the weight of evidence against x if y absent { I.J. Good }
= ln[ P(~y|~x)/P(~y|x) ] = Qsuf(y:x) { I.J. Good 1994, 1992 }
= ln[ B(~y:~x) ]
= ln[(1-P( y|~x))/(1-P( y|x))] = -W(~y:x)
W(:) = 2.atanh(F(:)) = ln((1+F)/(1-F)) for abs(F) <> 1
B(a:b) = P(a|b)/P(a|~b) = (Pab/Pb)/((Pa-Pab)/(1-Pb))
= oo if Pab=Pa ie (a implies b) ie (a --> b)
B(b:a) = P(b|a)/P(b|~a) = (Pab/Pa)/((Pb-Pab)/(1-Pa))
= oo if Pab=Pb ie (b implies a)
Q:/Quiz: could comparing (eg subtracting or dividing) B(a:b) with B(b:a)
show the DIRECTION of a possible causal tendency ??
P(~b,~a) = 1 - (Pa + Pb - Pab) = P(~(a or b)) by DeMorgan's rule
B(~b:~a) = P(~b|~a)/P(~b|a) = (1 - P(b|~a))/(1 - P(b|a))
= [P(~b,~a)/(1-Pa)] / [(Pa-Pab))/Pa]
= oo if Pab=Pa ie (a implies b) like for B(a:b) or W(a:b)
which speaks against comparing ?(a:b) with ?(~b:~a) for the purpose of
deciding the direction of possible causal tendency.
C(y:x) = corroboration or confirmation measure { Popper 1972, p400, (9.2*) }
= (P(y|x) - Py )/( P(y|x) + Py - Pxy ) C-form1 { Popper 1972 }
= ( Pxy - Px.Py )/( Pxy + Px.Py - Pxy.Px) my C-form2
= (P(y|x) - P(y|~x))/( P(y|x) + Py/P(~x) ) compare w/ F-form1
= (cov(x,y)/var(x) )/( P(y|x) + Py/P(~x) ) C-form3a
= beta(y:x)/( P(y|x) + Py/P(~x) ) C-form3b
F(y:x) == F(y-->x) == F(y <== x) = "degree of factual support of x by y"
= primarily a measure of how much y implies x (my view)
= (P(y|x) - P(y|~x))/( P(y|x) + P(y|~x) ) F-form1 { Kemeny 1952 }
= ARR(x:y)/( P(y|x) + P(y|~x) ) my F-forms follow:
= ( Pxy - Px.Py )/( Pxy + Px.Py - 2.Pxy.Px ) F-form2
= (cov(x,y)/var(x) )/( P(y|x) + P(y|~x) ) F-form3a
= beta(y:x)/( P(y|x) + P(y|~x) ) F-form3b
= tanh( 0.5*ln(P(y|x) / P(y|~x)) ) F-form4
= tanh( W(y:x)/2 ) { my tanh corrects I.J.Good's sinh }
= (difference/2)/Average = deviation/mean F-form5
= ( B(y:x) - 1 )/( B(y:x) + 1 ) is handy also for graphing F = fun(B)
= -F(y:~x)
= (Pxy/Px - (Py-Pxy)/(1-Px)) / (Pxy/Px + (Py-Pxy)/(1-Px)) hence:
= -1 if P(y|x) = 0
ie if P(y,x) = 0 ie if x, y are disjoint
= 0 if x, y are independent ;
= 1 if P(y|~x) = 0
ie if P(y,~x) = 0 ie Pxy = Py ie P(x|y)=1
ie if y implies x ie y-->x 100% ie deterministic
ie if y leads to x always
ie IF y THEN x always holds
where :
the F-form1 is the original one by { Kemeny & Oppenheim 1952 },
my F-form2 is the de-conditioned one, and it does reveal that
if Pxy=Py (see the -2 factor), ie if y --> x, then F(x:y)=1.
my F-form3 reveals an important hidden meaning: beta(y:x) is the slope
of the implicit probabilistic regression line
of Py = beta(y:x).Px + alpha(y:x) ;
the F-form4 reveals that F(:) and Turing-Good's weight of evidence W(:)
are changing co-monotonically
my F-form5 provides the most simple interpretation of F(:)
Unlike B(:) or W(:), the F(:) will not easily overflow due to /0.
The numerators tell us that for independent x, y it holds C(:) = 0 = F(:).
C(:) stresses the near independence, while F(:), W(:), B(:) stress near
implication more than near independence. Try out an example with a near
independence and simultaneously with near implication.
F(x:y) == F(x <== y) = degree of factual support of y by x
= primarily a measure of how much x implies y
= (P(x|y) - P(x|~y))/( P(x|y) + P(x|~y) ) F-form1
= ( Pxy - Px.Py )/( Pxy + Px.Py - 2.Py.Pxy ) my F-form2
= (Pxy/Py - (Px-Pxy)/(1-Py)) / (Pxy/Py + (Px-Pxy)/(1-Py))
= -F(x:~y)
note that if Px=Pxy then F(x:y) = 1 == (x implies y) fully, and that
this matches Pxy/Px = 1 as maximal possible contribution to the product
for P(x_j | y..y) computed by the simple Bayesian chain rule over y..y cues
for P(x_j , y..y). Clearly a product of Pxy/Px terms over a vector of cues
y..y may be viewed as a product of the simplest (x implies y) terms.
Rescaling F(:) from [-1..0..1] to [0..1/2..1] :
F0(:) = ( F(:) + 1 )/2 , so that
F0(x:y) = P(x|y)/( P(x|y) + P(x|~y) )
F0(y:x) = P(y|x)/( P(y|x) + P(y|~x) )
Before we go further, we recall that F(:) is co-monotonic with B(:), and
B(y:x) = P(y|x) / P(y|~x)
= (Pxy/Px)/( (Py - Pxy)/(1 - Px) ) {now consider INCreasing Pxy:}
wherein Pxy/Px measures how much (x implies y) up to maximum = 1
while 1/(Py - Pxy) measures how much (y implies x) up to maximum = oo
hence if Py = Pxy then B(y:x) = oo ie y implies x is measured by B(y:x)
and:
B(y:~x) = P(y|~x) / P(y|x)
= ((Py - Pxy)/(1 - Px)) /(Pxy/Px) {now consider DECreasing Pxy:}
wherein Py - Pxy measures how much (y implies x) with maximum = Py
while 1/(Pxy/Px) measures how much (x implies y) with maximum = oo
for Pxy=0
hence if Pxy = 0 then B(y:~x) = oo
F(y: x) = (P(y|x) - P(y|~x))/(P(y|x) + P(y|~x))
= -F(y:~x) = ( Pxy - Px.Py )/( Pxy + Px.Py - 2.Px.Pxy )
F(~y:x) = (P(~y|x) - P(~y|~x))/(P(~y|x) + P(~y|~x) )
= -F(~y:~x) = ( Pxy - Px.Py )/( Pxy + Px.Py - 2.Px.(1 -Px +Pxy) )
and the remaining 4 mirror images are easily obtained by swapping x and y.
Note that W(:) = 2.atanh(F(:)) = ln[ ( 1 + F(:) )/( 1 - F(:) ) ] .
For example F(~y:x) = -F(~y:~x) would measure how much the hypothesis
x explains the unobserved fact y , like eg in common reasoning :
"if (s)he would have the health disorder x
(s)he could NOT be able to do y (eg a body movement (s)he did)",
so that from a high enough F(~y:x) we could exclude the disorder x as
an unsupported hypothesis.
-.- +Example 1xy
for Px=0.1 , Pxy=0.1 , Py=0.2 , visualized by a squashed Venn diagram
xxxxxxxxxx
yyyyyyyyyyyyyyyyyyyy
are P(x|y)=0.5 ie "50:50" ; P(x|~y)=(0.1 -0.1)/(1 -0.2) = 0 ie minimum
P(y|x)=1.0 ie maximum ; P(y|~x)=(0.2 -0.1)/(1 -0.1) = 1/9
and
corr(x,y) = cov(x,y)/sqrt[ var(x) . var(y) ]
= (Pxy - Px.Py)/sqrt[ Px.(1-Px).Py.(1-Py) ]
= 0.7 is the value of the correlation coefficient between the events x, y
caus1(x:y) = ( Pxy - Px.Py ) / ( 2.Px -Pxy -Px.Py ) = 1
F(x:y) = (0.5 - 0)/(0.5 + 0) = 1
B(x:y) = P(x|y)/P(x|~y) = 0.5/0 = oo = infinity
clearly the rule IF x THEN y cannot be doubted ;
but what do we get when we swap the roles of x, y ie when our observer
will view the situation from the opposite viewpoint ? This can be done
by either swapping the values of Px with Py, or by computing F(y:x) :
+Example 1yx:
is F(y:x) = (1 -1/9)/(1 +1/9) = 0.8 is too high for my taste
caus1(y:x) = (Pxy - Px.Py)/(2.Py -Pxy -Px.Py) = 0.29 is more reasonable,
as Pxy/Py = 0.5 hence y doesnt imply x much (although x implies y fully
as Pxy/Px = 1 );
B(y:x) = P(y|x)/P(y|~x) = 1/((0.2 - 0.1)/0.9) = 9 is (too) high.
!!! Conclusion: for measuring primarily an implication & secondarily
dependence, B(y:x) and F(y:x) are not ideal measures.
!!! Note: if Px < Py then x is more plausible to imply y, than vice versa;
if Py < Px then y is more plausible to imply x, than v.v.
+Example 3: x = drunken driver ; y = accident
P(y| x) = 0.01 = P(accident y caused by a drunken driver x)
P(y|~x) = 0.0001 = P(accident y caused by a sober driver ~x)
is how { Kahre 2002, 186 } defines it; obviously P(y|x) > P(y|~x).
Note that without knowing either Px or Py or Pxy, we cannot obtain the
probabilities needed for caus1(x:y), F(x:y) and F(~x:~y), ie we can
compute B(y:x), F(y:x) and caus1(y:x) only :
beta(y:x) = P(y|x) - P(y|~x) =. 0.1 is the regression slope of y on x ,
is misleadingly low.
B(y:x) = P(y|x)/P(y|~x) = 100 ie (y implies x) very strongly.
F(y:x) = (B-1)/(B+1) = 0.98 =. 1 = F's upper bound
F(y:x) measures how much an accident y implies drunkenness x
(obviously an accident cannot cause drunkenness).
B(~y:~x) = (1-P(y|~x))/(1-P(y|x)) = 1.01
F(~y:~x) = (B-1)/(B+1) = 0.005 =. 0 = F's point of independence
F(~y:~x) measures how much an absence of an accident y implies that
a driver is not drunk. Here is my CRITICISM of such formulas:
According to I.J. Good, "The evidence against x if y does not happen" can
also be considered as a possible measure of x causes y . It is based on
COUNTERFACTUAL reasoning "if absent y then absent x", which I denote as
!! "necessitistic" reasoning. I am dissatisfied with the sad fact that his
formulation leads to formulas which are not zero when Pxy = 0 ie when x, y
are DISjoint. If the above explained notion of Necessity is to be taken
seriously, and I think it should be, then Good's formulation is not good
enough.
F(~y:~x) == F(~y-->~x) == F(~y <== ~x)
= measures how much ~y implies ~x
= ( P(~y|~x) - P(~y| x) )/( P(~y|~x) + P(~y|x) )
= ((1-P(y|~x)) -(1-P(y| x)))/((1-P(y|~x)) +(1-P(y|x)))
= ( P(y| x) - P(y|~x) )/( 2-P(y|~x) -P(y|x) )
= ( Px.Py - Pxy )/( Px.Py + Pxy - 2.Px.(1 -Px +Pxy) ) = -F(~y:x)
= (B(~y:~x) - 1)/(B(~y:~x) + 1)
F(~x:~y) = (P(~x|~y) - P(~x|y))/( P(~x|~y) + P(~x|y) ) = -F(~x:y)
-.- +Folks' wisdom
!!! Caution: causation works in the opposite direction wrt implication. This
is so, because ideally an effect y implies a cause x, ie
a cause x is necessary for an effect y. See the short +Introduction
again. In what follows here it may be necessary to swap e, h if we want
causation. Since different folks had different mindwaves I tried not to mess
with their formulations more than necessary for this comparison.
The notions of probabilistic Necessity and Sufficiency for events have been
quantified differently by various good folks' wisdoms.
E.g. from { Kahre 2002, Figs.3.1, 13.4 + txt } follows :
if X is a subset of Z ie Z is a SuperSet of X ie all X are Z
then X is Sufficient (but not necessary) for Z ie X implies Z.
ie Z is a consequence of X ie IF X THEN Z rule holds, I say.
if Y is a SuperSet of Z ie Z is a subset of Y ie all Z are Y
then Y is Necessary (but not sufficient) for Z ie Z implies Y
ie Y is a consequence of Z ie IF Z THEN Y rule holds, I say;
For 2 numbers x, y it holds (x < y) == ( y > x) , or v.v.
For 2 sets X, Y it holds (X subset of Y) == ( Y SuperSet of X) , or v.v.
For 2 events x, y it holds (x implies y) == (Py > Px) , or v.v.
For 2 events we may like to answer the Q's (and from the above follow A's) :
Q: How much is x SufFor y ? A: as much as is y NecFor x .
Q: How much is x NecFor y ? A: as much as is y SufFor x .
Q: If Pxy=0 ie x, y are disjoint ? A: then Suf = 0 = Nec must hold.
Lets use: e = evidence, effect, outcome; h = hypothesised cause (exposure)
Hence eg P(e|h) is a NAIVE, MOST SIMPLISTIC measure of Sufficiency of h for e
because P(e|h) = 1 = max if Peh = Ph ie Ph - Peh = 0 = P(~e,h) ie
if h is a subset of e, ie if h implies e then is h SufFor e. Note that:
(h SufFor e) == (h subset of e), and
(h NecFor e) == (e subset of h), hence :
P(h|e) measures how much is h NecFor e, and
P(e|h) measures how much is h SufFor e.
Lets compare these with those now corrected in { Schield 2002, Appendix A } :
P(e|h) = S = "Sufficiency of exposure h for case e"
P(h|e) = N = " Necessity of exposure h for case e" (find NAIVE , SIMPLISTIC )
!! Caution: the suffixes nec, suf in ?nec , ?suf as used by various authors
say nothing about which event is necessary for which one, if
the authors do not use ?(y:x) and do not specify what these parameters
mean. I recommend the ?(y:x) to mean that (y implies x) ie
(y SufFor x) , or (x NecFor y).
Folk1: { Richard Duda, John Gaschnig & Peter Hart: Model design in the
Prospector consultant system for mineral exploration,
in { Michie 1979, 159 }, { Shinghal 1992, chap.10, 354-358 } and
in { Buchanan & Duda 1983, 191 } :
Lsuf = P( e|h)/P( e|~h) = RR( e: h) = Qnec by I.J. Good
Lnec = P(~e|h)/P(~e|~h) = 1/RR(~e:~h) = 1/Qsuf
if Lnec = 0 then e is logically necessary for h
if Lnec = large then ~e is supporting h (ie absence of e supports h)
Lsuf = Qnec , but in fact there is NO semantic confusion, since
Lsuf denotes how much is e SufFor h ie h NecFor e , and
Qnec denotes how much is h NecFor e ie e SufFor h .
Folk2: { Brian Skyrms in James Fetzer (ed), 1988, 172 }
Ssuf = P( e| h)/P( e|~h) = RR( e| h) = Lsuf
Snec = P(~h|~e)/P(~h| e) = RR(~h:~e) = [1 - P(h|~e)]/[1 - P(h|e)]
if Ssuf > 1 then h has a tendency towards sufficiency for e ;
if Snec > 1 then h has tendency towards necessity for e ;
if Ssuf.Snec > 1 then h has tendency to cause the event e .
Folk3: { I.J. Good 1994, 306 + 314 = comment by Patrick Suppes } :
Qnec = P( e| h)/P( e|~h) = RR( e: h) = Lsuf by Folk1
Qsuf = P(~e|~h)/P(~e| h) = RR(~e:~h) = 1/Lnec
Qsuf = weight of evidence against h if e does not happen
= a measure of causal tendency, says I.J. Good.
If Qnec.Qsuf > 1 then h is a prima facie cause of e, adds Suppes.
I.J. Good's late insight (delayed 50 years) is formulated thus :
!! "Qsuf(e:h) = Qnec(~e:~h). This identity is a generalization of the fact
that h is a STRICT SUFFICIENT CAUSE of e if and only if
~h is a STRICT NECESSARY CAUSE of ~e, as any example
makes clear." { I.J. Good's emphasis }
!! Qsuf(e:h) = Qnec(~e:~h) in { I.J. Good 1992, 261 }
!! Qnec(e:h) = Qsuf(~e:~h) in { I.J. Good 1995, 227 in Jarvie }
in { I.J. Good 1994, 302(28); on Suppes }
= RR( e: h) in { I.J. Good 1994, 306(40);314 by Suppes }
!! Qsuf(e:h) = RR(~e:~h) is NOT ZERO if e,h are DISjoint, as I require.
Folk4: S = P(e|h) = " sufficiency of exposure h for effect e"
N = P(h|e) = " necessity of exposure h for effect e" { Schield }
In { Schield 2002 } see Appendix A, his first lines left & right.
There in his section 2.2 on necessity vs sufficiency , Milo Schield
nicely explains their contextual semantics and applicability thus:
"Unless an effect [ e ] can be produced by a single sufficient cause [ h ]
(RARE!), producing the effect requires supplying ALL of its necessary
conditions [h_i], while preventing it [e] requires removing or eliminating
only ONE of those necessary conditions." (I added the [.]'s ).
Q: Well told, but do Schield's S and N fit his semantics ?
A: No. While S is unproblematic, his N is not.
Q: What does it mean that h is strictly sufficient for an effect e ?
A: Whenever h occurs, e occurs too. This in my formal translation
means that h implies e ie Peh = Ph ie P(e|h) = 1 ie h --> e .
Hence S = P(e|h) measures sufficiency of h for e ,
or his necessity of e for h (formally, I say).
Note: if h = bad exposure and e = bad effect, then all above fits;
if h = good treatment for e = better health, then all above fits;
other pairings would not fit meaningfully.
!! My view is this : we are interested in h CAUSES e (potentially).
P(h|e) = Sufficiency of e for h ie e implies h ie e SufFor h,
or P(x|y) = Sufficiency of y for x ie y implies x ie y SufFor x.
Sufficiency is unproblematic, so I use it as a fundament.
Q: What is Nec = necessity of h for e , really ?
A: I derive Nec from the semantical definition in { Schield 2002, p.1 }
where he writes: "But epidemiology focuses more on identifying
a necessary condition [h] whose removal would reduce undesirable outcomes
[e] than on identifying sufficient conditions whose presence would produce
undesirable outcomes." His statement between [h] and [e] I formalize (by
relying on the unproblematic Sufficiency ie on implication) thus:
(no h) implies (no e) ie "no e without h" :
~h implies ~e, hence P(~e|~h) = 1 in the ideal extreme case.
Note that generally P(~e|~h) = [ 1 - (Ph + Pe - Peh) ]/[ 1-Ph ] = 1 here
!! ie Peh = Pe ie P(h|e) = 1 in this IDEAL extreme case ONLY, while
N = P(h|e) is Schield's general necessity of h for e.
!! But in general N = P(h|e) <> P(~e|~h) which is [ see P(e|h) above ]
SUFFICIENCY of ~h for ~e, which better captures Schield's semantics.
Q: Do we need his N = P(h|e) ??
A: Not if we stick to his more meaningful (than his N ) requirement on p.1
just quoted, and opeRationalized by me thus :
!!! (Necessity of h for e) I define as (Sufficiency of ~h for ~e) == P(~e|~h)
which is a COUNTERFACTUAL: IF no h THEN no e ie "no e without h"
which is close in spirit to I.J. Good's ( see Folk3 ) verbal definition,
except for the swapped suffixes nec and suf :
Qsuf(e:h) = Qnec(~e:~h) = RR(~e:~h) in my notation as shown at Folk3 above.
"Qsuf(e:h) = Qnec(~e:~h). This identity is a generalization of the
fact that h is a STRICT SUFFICIENT CAUSE of e if and only if
~h is a STRICT NECESSARY CAUSE of ~e, as any example
makes clear." ( I.J. Good's emphasis; it took him 50 papers in 50 years ).
The semantical Necessity of h for e as removed h implies absence of e,
is my NecP = P(~e|~h), and not Schield's necessity N = P(h|e) not fitting
his opeRational definition and containing no negation ~ as a COUNTERFACTUAL
should. Hence Schield's N is now deconstructed, and can be replaced by
my constructive NecP = P(~e|~h).
Summary of SufP = S , and of my NecP constructed from Schield's
!!! opeRationally meaningful verbal requirements :
SufP = P( e| h) = sufficiency of h for e means h implies e
NecP = P(~e|~h) = necessity of h for e means ~h implies ~e
hence:
SufP = P( e| h) = necessity of ~h for ~e means h implies e
Q: is my NecP ok ?
A: Not yet, since for Peh = 0 my common sense requires S=0=N ie zero, which
precludes all P(.,.) or P(.|.) containing ~ ie a NEGation.
Fix1: IF Peh = 0 THEN NecP1 = 0 ELSE NecP1 = P(~e|~h).
Fix2: Like Suf, Nec should have Peh as a factor in its numerator, so eg:
SufP2 = P(e|h) = SufP = sufficiency of h for e ie h implies e
NecP2 = Peh.NecP = Peh.P(~e|~h) = Peh.P(~(e or h))/P(~h)
= Peh.[ 1 - (Ph +Pe -Peh) ]/(1-Ph)
= necessity of h for e
which seems reasonable, since without Peh. my original NecP will be
too often too close to 1 = max(P), hence poor anyway, as its form
[ 1 - P(e or h) ]/[ 1-Ph ] is near 1 for small P's .
Folk5: RR(:) formulas (derived by Hajek aka JH as my criticism of Folk4 ):
RRsuf = how much h implies e
= RR(h:e) = sufficiency of h for e (not necessity because: )
= P(h|e)/P(h|~e) = [ Peh/(Ph - Peh) ].(1-Pe)/Pe
= Peh.( h implies e)/Odds(e) note that the "implies" factor 1/(Ph - Peh)
comes from P(h|~e), and that it is more influential than P(h| e).
RRnec = how much ~h implies ~e
= derived from RRsuf & my "A nonevent is an event is an event"
= RR(~h:~e)
= P(~h|~e)/P(~h|e) = [ P(~e,~h)/(P(~h) - P(~e,~h)) ].(1-P(~e))/P(~e)
! RRnec should = 0 if Peh = 0, yet RRnec <> 0 here, but it will = 0 if we use
RRnec.Peh in analogy to NecP2 at the end of Folk4 .
Since (h causes e) corresponds to (e implies h) we may have to swap e with h
in some formulas above to get the (h causes e). I have not always done it,
to keep other authors' formulas as close to their original as reasonable.
.- Finally lets look again at the relation between causation and implication.
"Rain causes us to wear a raincoat " makes sense, while
"NOT wearing a raincoat causes it NOT to rain" is an obvious NONSENSE, even
in a clean lab-like context with no shelter and our absolute unwillingness to
become wet. Let x = rain and y = wearing a raincoat.
The 1st statement translates to ( x causes y );
the 2nd statement translates to (~y causes ~x ). Because nobody knows
how to formulate a perfect operator "x causes y", we substitute it with
"y implies x" (the swapped x, y is not the point just now).
Now the 1st statement translates to ( y implies x );
the 2nd statement translates to (~x implies ~y ).
But now we are in trouble, as ( y --> x ) == ( ~x --> ~y ) in logic,
and ideally in probabilities: ( Py = Pxy ) == ( P(~x) = P(~x,~y) ) ie:
in imperfect real situations: ( Py - Pxy ) = ( P(~x) - P(~x,~y) ) ie:
Py - Pxy = (1 - Px) - (1 -(Px + Py - Pxy)) = Py - Pxy q.e.d.
Hence such a simple difference doesnt work as we would like it did for a cause.
What about the corresponding relative risks ?
RR(y:x) <> RR(~x:~y) ie:
P(y|x)/P(y|~x) <> P(~x|~y)/P(~x|y) ie:
(Pxy/(Py-Pxy)).((1-Px)/Px) <> ( (1-(Px+Py-Pxy))/(Py-Pxy) ).(Py/(1-Py))
where we see the key factor 1/(Py - Pxy) = (y --> x) on both sides of the <>.
Hence despite the <> , both RR's will become oo ie infinite if (y implies x)
perfectly whenever Pxy = Py. Otherwise, RR(y:x) behaves quite well:
RR increases with Pxy, and decreases with unsurprising Px, which is reasonable
as explained far above.
Conclusion: an implication cannot substitute causation in all its aspects,
but I don't know any other necessary (but not always sufficient)
indicators of causal tendency than :
+ dependence (is symmetrical wrt x,y),
+ implication or entailment (is asymmetrical, transitive operation ie
a subset in a subset in a subset, etc).
! Caution: repeatedly find UNDESIRABLE
+ SurpriseBy
+ time-ordering (a cause precedes its effect).
++ find +Construction principles for more and sharper formulations
-.- +Acknowledgment
Leon Osinski of NL is the best imaginable chief of a scientific library.
-.- +References { refs } in CausRR
To find names on www use 2 queries, eg: "Joseph Fleiss", then "Fleiss Joseph"
the latter is the form used in all refs and in some languages.
Unlike in the titles of (e)papers here listed, in the titles of books and
periodicals all words start with a CAP, except for the insignificants like
eg.: a, and, der, die, das, for, in, of, or, to, the, etc.
Computer Journal, 1999/4, is a special issue on:
- MML = minimum message length by Chris Wallace & Boulton, 1968
- MDL = minimum description length by Jorma Rissanen, 1977
These themes are very close to Kolmogorov's complexity, originated in the
US by Occamite inductionists (as I call them) Ray Solomonoff in 1960, and
Greg Chaitin in 1968
Allan Lorraine G.: A note on measurement of contingency between two binary
variables in judgment tasks; Bulletin of the Psychonomic Society, 15/3,
1980, 147-149
Arendt Hannah: The Human Condition, 1959; on Archimedean point see pp.237
last line up to 239, 260, more in her Index
Agresti Alan: Analysis of Ordinal Categorical Data, 1984;
on p.45 is a math-definition of Simpson's paradox for events A,B,C
Agresti Alan: Categorical Data Analysis, 1st ed. 1990;
see pp.24-25 & 75/3.24 on { Goodman & Kruskal }'s TauB (by Gini )
Agresti Alan: An Introduction to Categorical Data Analysis, 1996
Alvarez Sergio A.: An exact analytical relation among recall, precision,
and classification accuracy in information retrieval, 2002,
http://www.cs.bc.edu/~alvarez/APR/aprformula.pdf
Anderberg M.R.: Cluster Analysis for Applications, 1973
Bailey N.T.J.: Probability methods of diagnosis based on small samples;
Mathematics and Computer Science in Biology and Medicine, 1964/1966, Oxford,
pp.103-110
Bar-Hillel Yehoshua: Language and Information, 1964; Introduction tells that
the original author of their key paper in 1952 (chap.15,221-274) was in fact
Rudolf Carnap who was the 1st author despite B < C, but B-H doesn't tell it
Bar-Hillel Yehoshua: Semantic information and its measures, 1953, in the book
Cybernetics, Heinz von Foerster (ed), 1955, pp.33-48 + 81-82 = refs
Bar-Hillel Yehoshua, Carnap Rudolf: Semantic information, pp.503-511+512 in
the book Communication Theory, 1953, Jackson W. (ed); also in the
British Journal for the Philosophy of Science, Aug.1953. It is much shorter
than the 1952 paper reprinted in Bar-Hillel, 1964, 221-274
Baeyer Hans Christian von: Information - The New Language of Science, 2003, and
2004, Harvard University Press, which I helped to correct. The final part is
"Work in progress", starting with the chap.24 = "Bits, bucks, hits and nuts -
information theory beyond Shannon", is about Law of Diminishing Information
( LDI ) from { Kahre 2002 } (I found LDI and Ideal receiver in { Woodward
1953, 58-63 } ), where Von Baeyer mentions my "Wheeleresque war cry
Hits before bits". In fact I started with Anglo-Saxonic "Hits statt bits" ie
"Hits over bits"; it could have been "Hits ueber bits" :-)
Baeyer Hans Christian von: Nota bene; The Sciences, 39/1, Jan/Feb. 1999, 12-15
Bishop Yvonne, Fienberg Stephen, Holland Paul: Discrete Multivariate Analysis
1975; on pp.390-392 their TauR|C is TauB from { Goodman & Kruskal 1954 },
which I recognized to be a normalized quadratic entropy Cont(Y:X)/Cont(X).
See also { Blalock 1958 }, find TauB
Blachman Nelson M.: Noise and Its Effects on Communication, 1966
Blachman Nelson M.: The amount of information that y gives about X,
IEEE Trans. on Information Theory, IT-14, Jan.1968, 27-31
Blalock Hubert M.: Probabilistic interpretations for the mean square
contingency, JASA 53, 1958, 102-105; he has not realized, but I did, that
TauB(Y:X) = Cont(Y:X)/Cont(X) vs Phi^2 = ( X^2 )/N (find Phi^2 )
TauB in { Goodman & Kruskal , Part 1, 1954, 759-760 } (find TauB )
= TauR|C in { Bishop , Fienberg & Holland, 1975, 390-392 } (find TauR|C )
Blalock Hubert M.: Causal Inferences in Nonexperimental Research, 1964;
start at p.62, on p.67 is his partial correlation coefficient
Blalock Hubert M.: An Introduction to Social Research, 1970; on p.68 starts
Inferring causal relationships from partial correlations
Brin Sergey, Motwani R., Ullman Jeffrey D., Tsur Shalom: Dynamic itemset
counting and implication rules for market basket data; Proc. of the 1997
ACM SIGMOD Int. Conf. on Management of Data, 255-264; see www .
Sergey Brin is the co-founder of Google
Buchanan Bruce G., Duda Richard O.: Principles of rule-based expert systems;
in Advances in Computers, 22, 1983, Yovits M. (ed)
Cheng Patricia W.: From covariation to causation: a causal power theory;
(aka "power PC theory"), Psychological Review, 104, 1997, 367-405;
! on p.373 right mid: P(a|i) =/= P(a|i) should be P(a|i) =/= P(a|~i)
Recent comments & responses by Patricia Cheng and Laura Novick are in
Psychology Review, 112/3, July 2005, pp.675-707.
Cheng Yizong, Kashyap Rangasami L.: A study of associative evidential
reasoning; IEEE Trans. on Pattern Analysis and Machine Intelligence,
11/6, June 1989, 623-631
Cohen Jonathan L.: Knowledge and Language, 2002;
! on p.180 in the eq.(13.8) both D should be ~D
DeWeese M.R., Meister M.: How to measure the information gained from one
symbol, Network: Computation Neural Systems 10, 1999, p.328.
They partially reinvented Nelson Blachman's fine work (in my refs)
Duda Richard, Gaschnig John, Hart Peter: Model design in the Prospector
consultant system for mineral exploration; see p.159 in { Michie 1979 }
Ebanks Bruce R.: On measures of fuzziness and their representations,
Journal of Mathematical Analysis and Applications, 94, 1983, 24-37
Eddy David M.: Probabilistic reasoning in clinical medicine: problems and
opportunities; in { Kahneman 1982, 249-267 }
Eells Ellery: Probabilistic Causality, 1991
Feinstein Alvan R.: Principles of Medical Statistics, 2002, by a professor of
of medicine at Yale, who studied math & medicine;
chap.10, 170-175 are on proportionate increments, on NNT NNH , on honesty
vs deceptively impressive magnified results.
Chap.17, 332,337-340 are on fractions, rates, ratios OR(:), risks RR(:).
! On p.340 is a typo : etiologic fraction should be e(r-1)/[e(r-1) +1];
! on p.444, eq.21.15 for negative likelihood ratio LR- should be
(1-sensitivity)/specificity; above it should be (c/n1)/(d/n2)
De Finneti Bruno: Probability, Induction, and Statistics, 1972
Finkelstein Michael O., Levin Bruce: Statistics for Lawyers, 2001
Fitelson Branden: Studies in Bayesian confirmation theory, Ph.D. thesis, 2001,
on www, where sinh should be tanh , I told him
Fleiss Joseph L., Levin Bruce, Myunghee Cho Paik: Statistical Methods for
Rates and Proportions, 3rd ed., 2003. In their Index "relative difference"
(also in earlier editions) is my RDS
Gigerenzer Gerd: Adaptive Thinking; see his other fine books & papers
Glymour Clark: The Mind's Arrows - Bayes Nets and Graphical Causal Models in
Psychology, 2001
Glymour Clark, Cheng Patricia W.: Causal mechanism and probability:
a normative approach, pp.295-313 in Oaksford Mike & Chater Nick (eds):
Rational Models of Cognition, 1998.
! On p.305 eq.14.6 isn't just a "noisy And gate" as it is asymmetrical wrt its
! inputs; my new term for it is "noisy AndNot gate" (find INHIBITION ) because
! u.(1-x) = u - ux is the numerical (u AndNot x) for independent u, x
Good I.J. (Irving Jack), born 1916 in London as "Isidore Jacob Gudak" who
unlike Good is findable on www.
! My W(y:x) is his old W(x:y), similarly with B(:), F(:); since 1992 he has
switched to my safer notation
Good I.J.: Legal responsibility and causation; pp.25-59 in the book Machine
Intelligence 15, 1999, K. Furukawa, ed. See Michie in this vol.15
Good I.J.: The mathematics of philosophy: a brief review of my work; in
Critical Rationalism, Metaphysics and Science, 1995, Jarvie I.C. & Laor N.
(eds), 211-238
Good I.J.: Causal tendency, necessitivity and sufficientivity: an updated
review; pp.293-315 in "Patrick Suppes: Scientific Philosopher", vol.1,
P. Humphreys (ed), 1994; Suppes comments on pp.312-315.
I.J. explains his surprisingly late insights (delayed 50 years) into the
semantics of two W(:)'s, renamed by him to Qnec(y:x) , Qsuf(y:x) , like
mine ?(y:x) here, ie no more as his old W(x:y)
Good I.J.: Tendencies to be sufficient or necessary causes, 261-262 in
Journal of Statistical Computation and Simulation, 44, 1992. This is a
preliminary note on Good's belated insight of 1992-1942 = 50 years delayed
Good I.J.: Speculations concerning the future of statistics, Journal
of Statistical Planning and Inference, 25, 1990, 441-66
Good I.J.: Abstract of "Speculations concerning the future of statistics",
The American Statistician, 44/2, May 1990, 132-133
Good I.J.: On the combination of pieces of evidence; Journal of Statistical
Computation and Simulation, 31, 1989, 54-58; followed by "Yet another
argument for the explicatum of weight of evidence" on pp.58-59
Good I.J.: The interface between statistics and philosophy of science;
Statistical Science, 3/4, 1988, 386-412;
for W(:) see 389-390, 393-394 left low! + discussion & rejoinder p.409
Good I.J.: Good Thinking - The Foundations of Probability and Its
Applications, 1983, University of Minnesota Press. It reprints (and lists)
a fraction of his 1500 papers and notes written until 1983.
! on p.160 up: sinh(.) should be tanh(.) where { Kemeny & Oppenheim's }
degree of factual support F(:) is discussed
Good I.J.: Corroboration, explanation, evolving probability, simplicity and
a sharpened razor ; British Journal for the Philosophy of Science, 19,
1968, 123-143
Goodman Leo A., Kruskal William H.: Measures of Association for Cross
Classifications, 1979. Originally published under the same title in the
Journal of the American Statistical Association ( JASA ), parts 1-4:
part 1 in vol.49, 1954, 732-764; TauB on 759-760
part 2 in vol.54, 1959, 123-163;
part 3 in vol.58, 1963, 310-364; TauB on 353-354
part 4 in vol.67, 1972, - ; TauB in sect.2.4
See { Kruskal 1958 } for ordinal measures
Goodman Steven N.: Toward evidence-based medical statistics. Two parts:
1. The P value fallacy, pp. 995-1004; 2. The Bayes factor, 1005-1013;
discussion by Frank Davidoff: Standing statistics right up, 1019-1021;
all in Annals of Internal Medicine, 1999
Grosof Benjamin N.: Evidential confirmation as transformed probability;
pp.153-166 in Uncertainty in Artificial Intelligence, Kanal L.N. & Lemmer
J.F. (eds), vol.1, 1986. I found that on p.159 his:
! B == (1+C)/2 is in fact the rescaling as in { Kemeny 1952, p.323 }, the
last two lines lead to F(:) rescaled on the first lines of p.324,
here & now findable as F0(:)
Grune Dick: How to compare the incomparable, Information Processing Letters,
24, 1987, 177-181
Heckerman David R.: Probabilistic interpretations for MYCIN's certainty
factors; pp.167-196 in Uncertainty in Artificial Intelligence, L.N. Kanal
and J.F. Lemmer (eds), vol.1, 1986. I succeeded to rewrite his eq.(31) for
! the certainty factor CF2 on p.179 to Kemeny's F(:).
Heckerman has more papers in other volumes of these series of proceedings
Hempel C.G.: Aspects of Scientific Explanation, 1965; pp.245-290 are chap.10,
Studies in the logic of explanation; reprinted from Philosophy of Science,
15 (reprinted paper of 1948 with Paul Oppenheim).
Hesse Mary: Bayesian methods; in Induction, Probability and Confirmation,
1975, Minnesota Studies in the Philosophy of Science, vol.6
Kahn Harold A., Sempos Ch.T.: Statistical Methods in Epidemiology, 1989
Kahneman Daniel, Slovic P., Tversky Amos (eds): Judgment Under Uncertainty:
Heuristics and Biases, 1982. Kahneman won Nobel Prize (economics 2002) for
30 years of this kind of work with the late Amos Tversky
Kahre Jan: The Mathematical Theory of Information, 2002.
To find in his book formulas like eg Cont(.) use his special Index on
pp.491-493. See www.matheory.info for errata + more.
! on p.120 eq(5.2.8) is P(x|y) - Px = Kahre's korroboration, x = cause,
! on p.186 eq(6.23.2) is P(y|x) - Py, risk is no corroboration; y = evidence
Kemeny John G., Oppenheim Paul: Degree of factual support; Philosophy of
Science, 19/4, Oct.1952, 307-324. The footnote 1 on p.307 tells that Kemeny
! was de facto the author. Caution: on pp.320 & 324 his oldfashioned P(.,.)
is our modern P(.|.). On p.324 the first two lines should be bracketized
! thus: P(E|H)/[ P(E|H) + P(E|~H) ], which is findable here & now as F0( .
An excellent paper!
Kemeny John G.: A logical measure function; Journal of Symbolic Logic, 18/4,
Dec.1953, 289-308. On p.307 in his F(:) there are missing negation bars ~
! over H's in both 2nd terms. Except for p.297 on Popperian elimination of
models (find SIC now), there is no need to read this paper if you read his
much better one of 1952
Kendall M.G., Stuart A.: The Advanced Theory of Statistics, 1977, vol.2.
Khoury Muin J., Flanders W. Dana, Greenland Sander, Adams Myron J.:
On the measurement of susceptibility in epidemiologic studies;
American Journal of Epidemiology, 129/1, 1989, 183-190
Kruskal William H.: Ordinal measures of association, JASA 53, 1958, 814-861
Laupacis A., Sackett D.L., Roberts R.S.: An assessment of clinically useful
measures of the consequences of treatment; New England Journal of Medicine
( NEJM ), 1988, 318:1728-1733
Lucas J.R., Hodgson P.E.: Spacetime and Electromagnetism, 1990;
pp.5-13 on regraduation of speeds to rapidities
Lusted L.B.: Introduction to Medical Decision Making, 1968
Michie Donald: Adapting Good's Q theory to the causation of individual
events; pp.60-86 in Machine Intelligence 15, Furukawa K., Michie D. and
Muggleton S. (eds). Aged 18 during WWII Michie was the youngest codebreaker,
assisting I.J. Good who was Alan Turing's statistical assistant
Michie Donald (ed): Expert Systems in the Micro Electronic Age, 1979
Norwich Kenneth: Information, sensation, and perception, 1993
Novick Laura R., Cheng Patricia W.: Assesing interactive causal influence;
Psychological Review, 111/2, 2004, 455-485 = 31 pp! See { Cheng P.W. 1997 }
Pang-Ning Tan, Kumar Vipin, Srivastava Jaideep: Selecting the right
interestingness measure for association patterns; kdd2002-interest.ps
Pearl Judea: Causality: Models, Reasoning, Inference, 2000; see at least
pp.284,291-294,300,308; his references to Shep should be Sheps, and on
! p.304 in his note under tab.9.3 ERR = 1 - P(y|x')/P(y|x) would be correct.
! not in Pearl's ERRata on www
Popper Karl: The Logic of Scientific Discovery, 6th impression (revised),
March 1972; new appendices, on corroboration Appendix IX to his original
Logik der Forschung, 1935, where in his Index: Gehalt, Mass des Gehalts =
Measure of content (find SIC ). His oldfashioned P(y,x) is modern P(y|x)
Renyi Alfred: A Diary on Information Theory, 1987. 3rd lecture discusses
asymmetry and causality on pp.24-25+33
*Renyi Alfred: Selected papers of Alfred Renyi, 1976, 3 volumes
Renyi Alfred: New version of the probabilistic generalization of the large
sieve, Acta Mathematica Academiae Scientiarum Hungaricae, 10, 1959, 217-226;
on p.221 his correlation coefficient R between events is also in { Kemeny
& Oppenheim, 1952, p.314, eq.7 }
Rescher N.: Scientific Explanation, 1970. See pp.76-95 for the chap.10 =
The logic of evidence, where his Pr(p,q) actually means P(p|q). Very nice
methodology of derivation, but the result is not spectacular.
! Note that on p.84 he suddenly switches from Pr(p|q) to Pr(q|p). Why?
Rijsbergen C.J. van: Information Retrieval, 2nd ed., 1979
Rothman Kenneth J., Greenland Sander: Modern Epidemiology, 2nd ed., 1998
Sackett David L., Straus Sharon, Richardson W. Scott, Rosenberg William,
Haynes Brian: Evidence-Based Medicine - How to Practice EBM, 2nd ed, 2000.
There is a Glossary of EBM terms, and Appendix 1 on Confidence intervals
! ( CI ), written by Douglas G. Altman of Oxford, UK. I reported 12 typos,
! most of them in CI formulas.
! 30+ bugs or typos are on http://www.cebm.utoronto.ca/search.htm
Schield Milo, Burnham Tom: Confounder-induced spuriousity and reversal
for binary data: algebraic conditions using a non-iteractive linear model;
2003, on www (slides nearby)
Schield Milo, Burnham Tom: Algebraic relationships between relative risk,
phi and measures of necessity and sufficiency; ASA 2002; on www.
Find NAIVE , SIMPLISTIC here & now.
Their Phi = Pearson correlation coefficient r (find r2 ),
not sqrt( Phi^2 ) eg from { Blalock 1958 } (find Phi^2 ), and
not Pcc = Pearson contingency coefficient (find Pcc )
Schield Milo: Simpson's paradox and Cornfield's conditions; ASA 1999; on www.
an excellent multi-angle explanation of confounding, a very important
subject seldom or poorly explained in books on statistics.
His section 8 can be complemented by reading { Agresti 1984, p.45 } for a
definition of Simpson's paradox for events A, B, C
Shannon Claude E., Weaver Warren: The Mathematical Theory of Communication,
1949; 4th printing, Sept.1969, University of Illinois Press. Printings may
differ in page numbering.
His original paper was: A mathematical theory of communication, 1948, in
2 parts, Bell Systems Journal. Compare his titles with the book by { Kahre }
Sheps Mindel C.: An examination of some methods of comparing several rates
or proportions; Biometrics, 15, 1959, 87-97
Sheps Mindel C.: Shall we count the living or the dead;
New England Journal of Medicine ( NEJM ), 1958, 259:1210-1214
Shinghal R.: Formal Concepts in Artificial Intelligence, 1992; see chap.10
on Plausible reasoning in expert systems, pp.347-389, nice tables on
! pp.355-7, in Fig.10.3 the necessity should be N = [1-P(e|h)]/[1-P(e|~h)];
! on p.352 just above 29. in the mid term (...) of the equation,
both ~e should be e like in the section 10.2.11
Simon Herbert: Models of Man, 1957. See pp.50-51+54
Simon Steve: http://www.childrens-mercy.org/stats is a fine infokit
Stoyanov J.M.: Counterexamples in Probability, 1987
Suppes Patrick: A Probabilistic Theory of Causality, 1970
Tversky Amos, Kahneman Daniel: Causal schemas in judgments under uncertainty;
in { Kahneman 1982, 117-128 }
Vaihinger Hans: Die Philosophie des Als Ob, 1923
Weaver Warren: Science and Imagination, 19??; the section on "Probability,
rarity, interest and surprise" has originally appeared in Scientific Monthly,
LXVII ie 67, no.6, Dec.1948, 390-???
Woodward P.M.: Probability and Information Theory, with Applications to Radar,
1953, 1964 adds chap.8
-.-