.- -.- |- -+- `` Probabilistic causation indicated by relative risk, attributable risk and by formulas of I.J. Good, Kemeny, Popper, Sheps/Cheng, Pearl and Google's Brin, for data mining, epidemiology, evidence-based medicine, economy, investments or Causal INSIGHTS INSIDE for data mining to fight data tsunami and confounding CopyRight (C) 2002-2007, Jan Hajek , NL, version 3.02 of 2007-5-2 NO part of this document may be published, implemented, programmed, copied or communicated by any means without an explicit & full reference to this author + the full title + the website www.humintel.com/hajek for the freshest version + the CopyRight note in texts and in ALL references to this. An implicit, incomplete, indirect, disconnected or unlinked reference (in your text and/or on www) does NOT suffice. All based on 1st-hand experience. All rights reserved. This file + has lines < 80 chars (+CrLf) in ASCII + has facilities for easy finding & browsing + reads better than 2 columns texts which must be paged dn & up again & again + may read even better (more of left margin, PgDn/Up) outside your email, and/or if you change its name.txt to name.wri + like other texts in MS-Explorer, it is better to do Find in page backwards + can be compared with its previous version if you save it and use a files differencer to see only where the versions differ; eg download Visual Compare VC154.ZIP and run it as VCOMP vers1 vers2 /i /k which is the best line-by-line files comparer (3 colors) for .txt or .wri files + contains math functions graphable with David Meredith's XPL on www + your comments (inserted into this .txt file) are welcome. -.- Associative thinkers-browsers may like to repeatedly find the following: (single spacing indicates semantical closeness) !!! !! ! ?? ? { refs } Q: bound complement block AndNot --> nImp( impli 0/0 /. ./ RDS Sheps TauB Hajek NNR dNNR HF PF GF PS PN Pearl Cheng EBM JH INDEPendentIMPlication paradox CausedBy NecFor SufFor 3angle q.e.d. ?( asymmetr attributable B( B(~ Perr Bayes factor beta :-) as-if boost confound Cornfield Gastwirth caution chain conjecture :-( coviction Brin Google Conv( Conv1 Conv2 Conv3 corr( correl contingency SIC Gini

Cont( caus1( causa causes code Cofa Cofa0 Cofa1 CI confidence conversion cov( Popper confirm corroborat C( K( korroborat Kahre counterfact degree depend inpedend 17th entrop error etiologic example expos Folk fuzzy B( F( F(~ F0( factual support Kemeny I.J. Good Gini hypothe independ infinit oo NNT NNH NNS costs effort nonevent noisy key LikelyThanNot likelihood LR meaning mislead M( MDL MML monoton necess suffic Occam odds( PARADOX Pearson Phi^2 princip proper ratio relative risk RR( RR(~ r2 refut relativi rapidit regraduat remov rule naive Schield simplistic sense SeLn SIC slope Shannon softmax Spinoza symmetr Venn 2x2 table 5x2 tendency triviality variance regress range scale tanh typo UNDESIRABLE weigh evidence W( W(~ www Bonferroni Inclusion-Exclusion DeMorgan opeRation -log( exagger ChiSqr( student! -.- separates sections .- separates (sub)sections Venn diagrams table |- -.- +Contents: (find a +Term to find a section) +Who might like to read this epaper +Intro: the duality of causal necessity and sufficiency +Epicenter of this epaper with key insights inside +Contrasting formulas aka measures of impact !!! +New conversions between many measures +Contemplating the bounds of some measures +Construction principles P1: to P7: of good association measures +Key elements of probabilistic logic and simple candidates for causation K0: to K4: C1: to C4: +Dissecting RR(:) LR(:) OR(:) for deeper insights ! +3angle inequalities combined yield new JH-bounds for 3 events x,y,z !!! +The simplest thinkable necessary condition for CONFOUNDING +Notation, basic tutorial insights, PARADOX of IndependentImplication !!! +Interpreting a 2x2 contingency table wrt RR(:) = relative risk = risk ratio ! see my squashed Venn diagrams +More on probabilities +Tutorial notes on probabilistic logic, entropies and information +Google's Brin's conviction Conv( , my nImp( , AndNot +Rescalings important wrt risk ratio RR(:) aka relative risk +Correlation in a 2x2 contingency table +Example ( example finds more examples ) +Folks' wisdom +Acknowledgment +References -.- +Who might like to read this epaper This epaper started as notes to myself ( Descartes called them Cogitationes privatae). Now it is a much improved version of my original draft tentatively titled "Data mining = fighting the data tsunami : When & how much an evidential event y INDICATES x as a hypothesised cause, for doctors, engineers, investors, lawyers, researchers and scientists", who all should be interested in this stuff. This epaper is primarily targeted at British-style empiricists or BE's (sounds better than BSE :-). Continental Rationalists (CR's) a la Descartes, Leibniz, Spinoza prefer to apply deductive analytical methods to splendidly isolated and well defined problems, while BE's a la Locke, Berkeley, Hume are not afraid of using inductive inferential/experimental/observational methods even on messy tasks in biostatistics, econometry, medicine, military and social domains. BE's credo is Berkeley's "Esse est percipi". CR's credo is Descartes' "Cogito ergo sum". -.- +Intro: the DUALITY of causal Necessity and Sufficiency When confronted with events, and events happen all the time, humans ask about and search for inter-event relationships, associations, influences, reasons, and causes, so that predictions, remedies and decision-making may be learned from the past experiences of such or similar events. To find a cause, an explanation, or a remedy is the ultimate goal, the Holy Grail of advisors, analists, attorneys, barristers, doctors, engineers, investigators, investors, lawyers, philosophers, physicians, prosecutors, researches, scientists, and in fact of all wonderful expert human beings like you and me, who use or just think the words "because", "due to", and "if-then". David Hume (1711-1776) used to say that the "causation is the cement of the Universe". The nobelist Max Planck (1858-1947) wrote { Kahre 2002, 187 }: "Causation is neither true nor false, it is more a heuristic principle, a guide, and in my opinion clearly the most valuable guide that we have to find the right way in the motley hotchpotch [= bunten Wirrwarr], in which scientific research must take place, and reach fruitful results." One man's mechanism is another man's black box, wrote Patrick Suppes. I say: One man's data is another woman's noise, and one man's cause is another woman's effect, eg: smoking = u ...> (x1 = ill lungs and/or x2 = ill heart) ...> death = y but your coroner will not say that smoking was the cause of your death. gene ....> hormone ...> symptom ; or if we view the notion of specific illness as-if real (in fact its name is an abstraction), then eg: gene ....> illness ...> symptom . In this causal chain a researcher may see an illness as an effect caused by genes, while a physician, GP or clinician, sees it as a cause of a symptom, eg a pain in the neck to be removed or at least suppressed. Cause-effect relationships are relative wrt to the observer's frame of view, like Einstein would have loved to say. Like an implication or entailment, causation is supposed to be TRANSITIVE : if x causes y & y causes z then x caused z if x <= y & y <= z then x <= z The <= is 'less or equal','subset of','entails' or 'implies' if x,y,z are numbers, sets, or Boolean logical operands. The <= works on False, True represented (eg internally) as numbers 0, 1. Here <= is for numbers, <== is for sets or Booleans, but I use the more human --> y-->x == (y <== x) == (y subset of x) == (y implies x) == (y entails x) !! == ~(y & ~x) == (~x -->~y) == ~(~x & y) == (x or ~y) by DeMorgan ! == ~(y AndNot x) == ~(y,~x) == ~(y UnlessBlockedBy x) in plaintalk ! y-->x == (~x-->~y) looks nice, but is (find:) UNDESIRABLE for causation , because "the rain causes us to wear raincoat" makes SENSE, while the statement "not wearing a raincoat causes no rain" is a NONSENSE; find raincoat . Find => for more on --> (the => is meaningless here, in Pascal too); the >= is 'greater or equal' (in Pascal also SuperSet of). Translated into probabilistic logic : P(y-->x) = P(~(y,~x)) = 1 -(Py - Pxy) = P(~(y AndNot x)) ie (y-->x) is a DECreasing function of (Py - Pxy), hence (y implies x) is at its MAXimum when (Py - Pxy)=0 ie Py=Pxy ie whenever is (y subset of x) ie (y entails x). Causation is like the 2-faced Roman god Janus (January is named after him). One face is SUFiciency, the other facet is NECessity. They go together, they are 2 components of causation, something like the real and imaginary parts of a complex number. This analogy is not too bad, since necessity is often based on imagined CounterFactual reasoning, ie on what WOULD BE IF (in German: was waere wenn; also find as-if ) the situation WOULD NOT BE as it is the factual one { Pearl 2000, 284 }. The duality of Sufficiency and Necessity is easily visualized by Venn diagrams (pancakes or pizzas diagrams for kids :-) : |---------------------------- | Universe of discourse = 1 | | __________________ | 100% overlap ie Py-Pxy = 0 : | | | | y-->x 100% implication is the extreme | | P(x,~y) < Px | _________________ | | _____________|____ | | | for an {find:} archer-) | | | | | | | Px hitting x is NECessary | | | Pxy > 0 | | | | _____________ | for y to be hit | |___|____________| | | | | | | find NECfor | | P(y,~x) < Py | | | 0< Pxy = Py | hitting y is SUFficient | |________________| | | |___________| | for x being hit too |___________________________| |_______________| find SufFor My approach to causation is based on probabilistic logic, with emphasis on the operation of IMPLication aka entailment. A viewpoint of mine is that !! causation works in the DIRECTION OPPOSITE to y-->x . This is so !! because ideally the observed effect y implies x as an UNobservable cause, !!! while a cause x is NECessary for the effect y ie x NecFor y . Removing a cause x will ideally remove effect y . Note that an inference rule : IF effect ie evidence THEN hypothesised cause (eg an exposure) is reflected in the: evidence IMPLIES hypothetical cause (eg a treatment), while the causation goes in the opposite direction: An exposure may cause an effect or evidence. Hence we must be careful about the assigned meanings and about directions of arrows and notations like (x:y) , (y:x) , (~y:~x) , (~x:~y) , find ?( Baeyer . Many cues or predictors are symptoms caused by a health disorder, but some cues are surely the causes of an illness, so eg: IF (wo)man THEN "a (fe)male disorder is likely" makes sense, but it would be foolish to think that a disorder caused a human to be a (wo)man. Although IF (fe)male disorder THEN (wo)man, is correct, it (usually) is pointless. -.- +Epicenter of this epaper with key insights inside .- +Contrasting formulas aka measures of impact ARR LR OR RR RRR PAR NNT NNH are abbreviations almost standard in EBM . ACR ARH ARX RDS PF GF HF dNNR NNR are my non-words, hence easy to find. It is important to use notations preventing errors of thought & typos (find Baeyer ). Over too many years too many folks used (and switched to) too many notations; to avoid confusion find ?(y:x) vs ?(x:y) in +Notation for their meanings. x = h = exposure, hypothesised cause, conjecture ; y = e = effect, evidence. Right now it may help to find +MicroTutorial and read a byte or bit(e) of it. Too many measures of statistical association were (re)invented under even more all too suggestive names. Easy is { Feinstein 2001, sect.17.6, 171-175, 337 -340 }, derivations in { Fleiss 2003, 123-133, 151-163 }, insights in both. All measures capture statistical dependence, often as functions of ARR(:) or of RR(:) which both contrast P(y|x) vs P(y|~x). The key question is which formula when (not) for what (not) ?? For the risk (or gain ) of the effect y if exposed to (a treatment) x , the key contrasting formulas in epidemiology and evidence-based medicine EBM for binary x are (for multivalued x or for any other kind of exposure just replace ~x by u, a mnemonic for Ursache = cause): ARR = P(y|x) - P(y|~x) = absolute risk reduction (for risk of effect y if x) = ARR(x:y) "absolute" = "not relative", but often |ARR| too. = ARR(~x:~y) = [ 1 - P(y|~x) ] - [ 1 - P(y|x) ] ; find ?(x:y) = a/(a+b) - c/(c+d) in a 2x2 contingency table (find 2x2 ) = [ Pxy - Px.Py ]/[ Px.(1-Px) ] <= 1 even for tiny Px as Pxy <= min(Px,Py) !! = cov(x,y)/var(x) = beta(y:x) = slope(of y on x) <= 1 ! = 0 if independent x,y then Pxy=Px.Py & P(y|x)=Py & P(x|y )=Px ! = 0 enforce if Px=1 then Py=Pxy=Px.Py & P(y|x)=Py & P(y|~x)=0/0=ARR(x:y) ! = 0 natural if Py=1 then Px=Pxy=Px.Py & P(x|y)=Px & P(x|~y)=0/0=ARR(y:x) ! note that Py=1 yields Pxy=Px.Py hence ARR(x:y) = 0/[Px.(1-Px)] =0, ! note that Px=1 yields Pxy=Px.Py hence ARR(y:x) = 0/[Py.(1-Py)] =0. ! Enforced zeros lead to a more meaningful ARR (but Py=1 or Px=1 are too extreme to be of much importance). For DISCOUNTing of the lack of ! surprise in y, K(x:y) = P(y|x) -Py <= 1-Py SEEMS better (find SIC K( ) since a frequent y is seldom perceived as much of a risk anyway (find twice ) but as we just saw, ARR(:) = 0/0 must be numerically forced to ARR(:)=0, but this is logically natural since Pxy = Px.Py means independent x,y also in the extreme case of Py=1 when P(y|x)=1 ie x-->y ie x implies y, as well as in the extreme case of Px=1 when P(x|y)=1 ie y-->x ie y implies x, while in both !!! extreme cases x,y are also INDEPENDent at the same time. Find my !!! INDEPendentIMPlication PARADOX. ARR(x:y) == PNS = Probability of Necessity and Sufficiency (in general), under exogeneity = no confounding, & monotonicity = no prevention of y by x { Pearl 2000, 289,291,300 } ARR(x:y).ARR(y:x).N = r2.N = ChiSqr(x,y) , find r2 ChiSqr( { Allan 1980 } NNT = |1/ARR| was introduced in { Laupacis & Sackett & Roberts 1988 } : NNT = number needed to treat for 1 more |or 1 less| beneficial effect y, low NNT = good, successful, effective treatment x ; NNH = number needed to harm 1 more |or 1 less| by side effects z, high NNH = good, harmless treatment x ; NNS = number needed to screen to find 1 more |or 1 less| case, low NNS = good, effective screening; |1/ARR| is the most realistic general measure of health effects, as it !!! is the least abstract & least exaggerating ie most HONEST, and !!! UNlike RR(:), OR(:) or any other rate ratio, it does NOT "throw away all information on the number of dead" { Fleiss 2003, 123 on Berkson's !!! index ie ARR }. Moreover NNT, NNS, NNH measure EFFORT ie COSTS PER EFFECT. If ARR=0 ie y,x independent, then 1/ARR = oo ie infinite. NNR by Hajek : NNR = 1/RDS > 0 (find RDS ) is my Number Needed for 1 more Relative effect; if RDS < 0 then switch to its COMMENSURABLE neighbor HF . dNNR = 1/ARR - 1/RDS (if ARR > 0) = Hajek's difference of "Nrs Needed", = P(y|~x)/ARR = P(y|~x)/[P(y|x) - P(y|~x)] = 1/( RR -1) = 1/RRR 1/dNNR = RR - 1 = RRR = Relative risk reduction if ARR > 0. NNH(x:z)/NNT(x:y) is also highly informative; should be >> 1 ie many more have to be x-treated before 1 z-harm will occur, while many more patients have y-improved already. NNH/NNT is in the fine infokit by { Steve Simon at http://www.childrens-mercy.org/stats }. OR = odds ratio (as odds(P) = P/(1-P) in general) = (P(x|y)/[1-P(x|y)])/( P(x|~y)/[1-P(x|~y)] ) = OR(x:y) from which: = [P(x|y)/P(x|~y)]/[ P(~x|y)/P(~x|~y) ] = LR+ / LR- ; denominators annul: = P(x,y).P(~x,~y)/[ P(~x,y)/P( x,~y) ] = a.d/(b.c) = (a/b)/(c/d) odds ratio = [P(y|x)/P(~y|x)]/[ P(y|~x)/P(~y|~x) ] = OR(y:x) by symmetry wrt x,y LR- = P(~x|y)/P(~x|~y) = LR- = negative LR = (1 - sensitivity)/specificity LR = P( x|y)/P( x|~y) = LR+ = likelihood ratio = sensitivity/(1-specificity) = Pxy.[(1/(Px-Pxy)].(1-Py)/Py = LR(x:y) = RR(x:y) = B(x:y) = Bayes factor in favor of y provided by x ! Note (x:y) ie x-->y = 1/(Px - Pxy) = 1/P(x,~y) = 1/(x AndNot y) = 1/( x UnlessBlockedBy y) in plaintalk. ! x-->y despite the numerator P(x|y) = (y SufFor x) = naive y-->x (1-Py)/Py is my measure of surprise in y ; it decreases with increasing Py; in LR , RR, it DISCOUNTS lack of surprise in frequently occurring y; = product of 2 simplest measures of surprise thinkable : = (1-Py).(1/Py); expectation E[Y] = Sum[Py.fun(Py)] in general ; (1-Py) = surprise in y; Cont(Y) = Sum[Py.(1-Py)] = 1-Sum[Py^2] find Cont( log(1/Py) = surprise in y; H(Y) = Sum[Py.log(1/Py)] = Shannon's entropy; log gives it a coding interpretation (here UNNEEDED) based on positional numerical representation and Kraft inequality for unique decodability. Sum[Py.1/Py] = card(Y) = cardinality = variety = nr of distinct values of a r.v. Y = a rough measure of surprise in Y. H(Y) <= log(card(Y)). Sum[Py*(1-Py).(1/Py)] = Sum[Py*(1/Py - 1)] = Sum[1 - Py] = card(Y) - 1 . RR = P(y|x)/P(y|~x) = relative risk = risk ratio , measures (y SufFor x) = (a/(a+b))/(c/(c+d)) = RR(y:x), seems more impressive than ARR, NNT, NNH = Pxy.[1/(Py-Pxy)].(1-Px)/Px goes up with small: Py-Pyx, Px = LR(y:x) = RR(y:x) = B(y:x) = Bayes factor in favor of x provided by y, ! since y-->x is in 1/(Py - Pxy) = 1/P(y,~x) = 1/(y AndNot x) = = 1/( y UnlessBlockedBy x ) in plaintalk In medicine LR will be more stable than RR. LR's can be collected globally (eg on national scale) and via Bayes rule (eg in the nomogram at www.cebm.net in Oxford) applied to the individual cases subject to the local prevalence (or to the judged prior) Py, to obtain what we really want: the post-test probability P(y|x). Find Bayes and Bailey below. Prof. Brian Haynes (McMaster University, Canada) and prof. Paul Glasziou (Oxford, UK) have pointed out to me that it would be misleading to publish P(y|x), because a physician must use his/her internal prior Py of an individual patient and update it (eg via the nomogram) by LR(x:y) of a community or population, to obtain patient's P(y|x). So although LR may carry a more generally useful (more robust ie stable) partial information, RR carries information more meaningful finally and individually: the relative risk ie risk ratio RR(:). RR(y:x), ARR(x:y) are the key parts of other meaningful formulas : RRR = ARR/P(y|~x) = RR - 1 (if RR >= 1 ie ARR >= 0) = relative risk reduction = excess relative risk { Feinstein 2002, 340 } (is not Pearl's ERR ) `` = 1/dNNR = 1/[1/ARR - 1/RDS] , find RDS dNNR by Hajek PF = -ARR/P(y|~x) = 1-RR if RR <= 1, = preventable or prevented fraction if RR >= 1 use 1 -1/RR = ARX , not INCOMPATIBLE GF ; = -[ P( y|x) - P( y|~x) ]/ P( y|~x) = PF(x:y) by { P.W. Cheng 1997 } = -[1-P(~y|x) -(1-P(~y|~x))]/[1-P(~y|~x)] my expression of PF as a GF !! *= [ P(~y|x) - P(~y|~x) ]/[1-P(~y|~x)] = GF(x:~y) = RDS(x:~y) by Hajek `` my *= expresses Cheng's PF in the CANONICAL form of "relative difference" by M. Sheps RDS : GF(x generates y), PF(x prevents y) = GF(x generates ~y) !! so both have similar structures, ie are unified in the spirit of my slogan "A non-event is an event is an event", with apologies to Gertrude Stein who spoke similarly about a rose, tho not about a non-rose :-) . Nevertheless !! PF and GF are INCOMPATIBLE, says this box: |--- +NEW CONVERSIONS: "relative difference" by Sheps in my notation RDS(u:v) | is quantitatively interpretable iff its numerator ARR(u:v) >= 0, hence if !! ARR(u:v) < 0 switch to RDS(~u:v) ; R(u:~v) is INCOMPATIBLE with RDS(u:v). | P( v|u) + P(~v|u) = 1 find ?( in +Notation for the semantics of: | LR(~v:u) = P(~v|u)/P(~v|~u) = RR(~v:u) !!! LR(~v:u) = 1 - RDS(u:v) = 1/[ 1 - RDS(~u:v) ] is my key RDS-CONVERSION RULE | between COMPATIBLE RDS's where u stands for x or ~x, and v for y or ~y, so | eg 1-ARX = 1/[ 1-PF ], ie 1/RR = 1/RR indeed. | | RDS(u:v) = [ Pv - P(v|~u) ]/[ Pu.(1-P(v|~u)) ] | = [P(u,v) - Pu.Pv]/[ Pu.P(~u,~v) ] ie asymmetry due to Pu, hence !!! if Pu <= Pv then RDS(u:v) >= RDS(v:u), P(v|u) >= P(u|v) , and v.v. | | 1st derivation: generic form for a success score Pa corrected for guessing | by discounting the hits by chance is (for Pa > Pb) : !!! [(1-Pb)-(1-Pa)]/(1-Pb) = 1 -(1-Pa)/(1-Pb) = [Pa - Pb]/[1-Pb] where | 1-Pb is a reference or base rate of failures (or errors Pb = Perr ), eg: | + Jacob Cohen's kappa for interrater agreement or concordance, 1960, in | { Fleiss 2003, 603,609,620 } { Feinstein 2002, 20.4.3 } | { Bishop 1975, 395-6 } | + (Py -

)/(1 -

) in general, where

= E[P] is MINImal if all Pj=1/m | (Py -1/m)/(1 -1/m) = m-multiple-choice score [ 1/m MAXImizes Cont( ] | ( If all 1..j..m choices must be assigned a Pj , then the score | S = 2.Pr -E[Pj] = 2.Pr -Sum[Pj.Pj] { De Finneti 1972, 30 } scale [-1..1] | where Pr = P assigned to the right answer ; (S+1)/2 ranges [0..1/2..1]; | if Pr = 1 then Smax else if a Pj=1 then Smin ) !! + (P(y|x) - Py)/(1-Py) = K(x:y) /[1-Py] = ARH = attributable risk by Hajek | + TauB = [ (1-E[Py])-(1-E[P(y|x)]) ]/(1-E[Py]) = Cont(X:Y)/Cont(Y) | Var(Y) = 1-E[Py] , 1-E[P(y|x)] = E[Var(Y|X)] , find Taub Cont( Gini E[P | | 2nd derivation: ARR(x:y) = slope(of y on x) = beta(y:x) , find slope, so | RDS = ARR(x:y)/(fictive max. slope of y on x), fictive = as-if = what-if | | 3rd derivation: P(y|. ) + P(~y|. ) = 1 !!! 1 >= P(y|x) = P(y|~x) + P(~y|~x).RDS , causal excess factor RDS == GF | RDS = [P(y|x) - P(y|~x)]/[1-P(y|~x)] <= 1 !!! P(~y|~x) = 1 - P(y|~x) for (~y,~x) also RDS-susceptible to the cause x | | 4th derivation: would x always cause y, enlarged Py would make Pxy=Px ie | P(y|x)=1, hence the 1 in RDS's denominator. Note that thus enlarged Py | does NOT change P(y,~x) and P(y|~x). .- | RDS(:)'s true meaning obtains from its CANONICAL form with the denominator | 1-P(.|.), then interpret RDS(:) >= 0 from its numerator. | | RDS(x:y) = [P( y| x) - P( y|~x)]/[1-P( y|~x)] from the 1st derivation: | = [P(~y|~x) - P(~y| x)]/P(~y|~x) = 1 - P(~y|x)/P(~y|~x) | = RDS = ARR/P(~y|~x) = Cheng's GF = Pearl's PS = 1 -LR(~y:x) = 1 - 1/Qsuf | = 1 if P( y| x) = 1 | | RDS(~x:y) = [P( y|~x) - P( y| x)]/[1-P( y| x)] = 1 - P(~y|~x)/P(~y|x) !! = -ARR/P(~y| x) = new Hajek's fraction HF = 1 -LR(~y:~x) = 1 - Qsuf | = 1 if P( y|~x) = 1 | | RDS(x:~y) = [P(~y| x) - P(~y|~x)]/[1-P(~y|~x)] = 1 - P(y|x)/P(y|~x) | = [P( y|~x) - P( y| x)]/ P( y|~x) = -ARR/P(y|~x) | = 1-RR = PF (eg by P.W. Cheng) = 1 -LR(y:x) = 1 - Qnec | = 1 if P(~y| x) = 1 | | RDS(~x:~y) = [P(~y|~x) - P(~y| x)]/[1-P(~y| x)] = 1 - P(y|~x)/P(y|x) | = [P( y| x) - P( y|~x)]/ P( y| x) = 1 -LR(y:~x) = 1 - 1/Qnec | = 1 -1/RR = ARX = Judea Pearl's ERR ( <= PN ) = ARR/P(y|x) | = 1 if P(~y|~x) = 1 | `` In my CONVERSIONS SCHEME RDS(u:v)'s one neighbor has ~u, the other ~v | ( commensurable neighbors are linked by a / ) : | !!! PS = GF = -PF.P(y|~x)/[ 1-P(y|~x)] = -PF.Odds(y|~x) | /. 1-GF = 1/[1-HF ] = 1/Qsuf = Lnec | New: HF PF = 1-RR = -GF/Odds(y|~x) | ./ RR = 1-PF = 1/[1-ARX] = Qnec = Lsuf | PN >= ARX = 1 -1/RR = -HF/Odds(y|x) = -HF.[1-P(y|x)]/P(y|x) = PAR/P(x|y) | | HF = -ARX.Odds(y|x) = -ARX.P(y|x)/[1-P(y|x)] | RR = OR/[P(y|~x).(OR-1) +1] is the exact conversion from odds ratio ; | OR(:) to RR(:) are needed eg for PF and ARX | | RDS(u:v) < 0 isn't interpretable, so we must use its PROPER COMPLEMENTARY | RDS(~u:v) >= 0. Since I expressed all 4 RDS'es with numerators +-ARR, | complementary RDS'es must also have COMMENSURABLE denominators: hence !!! COMPATIBLE are GF,HF and ARX,PF (in both pairs u and ~u are swapped) !!! INCOMPATIBLE are GF,PF (used by Cheng ), and GF,ARX (by Pearl ). | If P(y|x) & P(y|~x) are very small (as they often are in a population) | then GF & HF are near |ARR|, ie GF & HF keep the implicit information on !!! the number of cases |1/ARR| = NNT or NNH are informative in this sense, | while ARX & PF are >> |ARR|, ie ARX & PF lose that information as they are !!! based on the risk ratio RR(:) ie relative risk which exaggerates an effect. | |--- more on RDS below ARX = ARR/P(y|x ) = RRR/RR = 1 -1/RR if RR >= 1, else use 1-RR = PF , not HF ; if RR > 2 then ARX > 1 -1/2 = 0.5 which is interpretable as !!! "more likely than not" eg in civil toxic tort cases, as { Fleiss 2003, 126 } and { Finkelstein 2001, 285 } point out; (find LikelyThanNot ) = PAR/P(x|y) = attributable risk for exposed = attributable risk percent = attributable proportion = attributable fraction in exposed group = etiologic fraction for exposed group (is not PAR ) = excess fraction = excess risk ratio ERR { Pearl 2000, 292 } = excess relative risk So far we obtained our rates or proportions P(.)'s from the study group. Other formulas may require P(.)'s from a community (eg regional population), which may be estimated easily and cheaply from the study group ONLY thus : ! If the control group ie ~y in the study is a RANDOM sample of ~yc ie in the community, then Pxc =. P(x|~y) from the study group { Fleiss 2003, 151 }: Pxc = P(exposed to x in community or population) is estimated by: ! =. P(x|~y) = b/(b+d) from the studied controls subgroup only, writes { Feinstein 2002, twice on p.340/17.7 low } Pyc = P(risk factor y in community or population) { Feinstein 2002, 338 } ! =. Pxc.P(y|x) + (1-Pxc).P(y|~x) where P(y|.) are from the study group = a weighted average ie an interpolation between P(y|x) and P(y|~x) ACR = P(y|x) - Pyc = ARR(1-Pxc) { Feinstein 2002, 338/17.6.2} = attributable community risk = attributable population risk (vs ? in PAR ) PAR = [ Pyc - P(y|~x) ]/Pyc { Feinstein 2002, 340 } = Pxc.(RR-1)/[ 1 + Pxc.(RR-1) ] { Feinstein 2002, find p.340 for a typo } = population attributable risk percent/100 ( RR is from the study group) = population attributable risk fraction !! = [ Pxc.P(y|x) + (1-Pxc).P(y|~x) - P(y|~x) ]/Pyc = ARR.Pxc/Pyc { by Hajek } = community etiologic fraction { Fleiss 2003, 125-8,156/7.5,711/7.5 } : = [ P(x|y) - Pxc ]/[1-Pxc]; if small Pyc then Pxc =. P(x|~y) hence : ! =.[ P(x|y) - P(x|~y)]/[1-P(x|~y)] { Fleiss 2003, 151 }, is RDS-like PAR ? = attributable risk in population { Finkelstein & Levin 2001, 286-7 } : ! = (1 -1/RR).P(x|y) = ARX.P(x|y) but case-control studies provide OR, not RR, but RR =. OR if Py is low (above find exact conversion ); for RR >= 1 { Kahn & Sempos 1989, 74,80 } RDS = ARR/P(~y|~x) = ARR/[ 1-P(y|~x) ] = relative difference by M.C. Sheps = [ P(y|x) - P(y|~x)]/[ 1-P(y|~x) ] for as-if binary x !!! = slope(of y on x) /[ FICTIVE MAX. slope, as P(y|x) <= 1 ] is my view = [ Pxy - Px.Py ]/[ Px.P(~x,~y) ] ie asymmetry due to Px only = RDS(x:y) notation like ARR(x:y) due to x-->y in P(y|x), find ?(x:y) = 1 = Max if P(y| x)=1 ie Pxy=Px ie x-->y 100% = P(y|x) = ARR if P(y|~x)=0 ie Pxy=Py ie y-->x 100% = 0 = ARR if P(y| x)=P(y|~x) ie Pxy=Px.Py ie x,y independent = ARR/0 = ?? if P(y|~x)=1 ie Py -Pxy = 1 - Px , ARR <= 0 ie 0 = 1 -(Px+Py-Pxy) = P(~(x or y)) = P(~x,~y) = 0/0 = ?? if Py = 1 = 0/? = ?? if Px = 1 ( ARR/[ Py - P(y|~x) ] = 1/Px :-) Let u = Ursache = cause or causes other than x : RDS = [ P(y|x) - P(y|u) ]/[ 1-P(y|u) ] = relative difference a la Sheps = [ successful y if x minus if u ]/[ failure rate of y if u ], as P(y|x) <= 1=Max, the 1-P(y|u) is the MAXImal thinkable value of ! RDS's numerator, ie 1-P(y|u) is a meaningful normalization. The !!! key IDEA is that failures if u, are available to become successes if x, !! and that RDS is more honest than RRR, ARX, if P(y|~x), P(y|x) is small, as they often are, which inflates the measures based on RR(y:x). RDS'es are in : - { Fleiss 2003, 123-125,162, SeLn for CI( 1-RDS) on pp.133,162-163,152,156 } - { Sheps 1959 }, fine in { Feinstein 2002, 174 }, and for u == ~x in : - { I.J. Good 1961/1983, 208,212 } as QuasiProbability for causal nets - { Khoury 1989 } as susceptibility if independence assumed, commented in: - { Rothman & Greenland 1998, 53-56 eq.4-3 } on attributable fractions; ! + the following authors have paired RDS with a 2nd formula : - { Glymour 2001, chap.7 = Cheng models, pp.75-91, 108-110 } based on: - { Glymour, Cheng 1998 } based on: - { Patricia Cheng 1997 } eq.(16) = RDS = GF , PF = eq.(30) = 1-RR >= 0 - { Pearl 2000, 292,300,284 } PS = RDS , PN >= ERR = 1 -1/RR = ARX "probability of sufficiency" PS = RDS = GF = generative power of x wrt y "probability of necessity" PN vs Cheng PF = preventive power of x wrt y my view: PF = generative factor of x to ~y = PF = preventable fraction ! if RR > 1 then (0 < PN < 1) vs ( PF < 0) else (here PN =. ERR ) ! if RR < 1 then ( PN < 0) vs ( 0 < PF < 1) else PN = 0 = PF ie x,y independent; for increasing RR Pearl's PN is nonlinear INCreasing, PN <= 1 = if RR=oo=MAX while with RR Cheng's PF is linear DECreasing, PF <= 1 = if RR=0 =min !! but this is ok, since (x NECESSARY for y) is OPPOSITE to (x PREVENTS y). ! PN and PF just serve to OPPOSITE purposes. PN >= 0 , PF >= 0 are required for meaningful interpretability, and also : RDS >= 0 is required in { Pearl 2000, 294 },{ Novick & Cheng 2004, 461 }. RDS < 0 has no clear interpretation, hence we better make RDS >= 0 thus : if ARR >= 0 ie if P( y| x) >= P(y|~x) then RDS(x:y) = ARR/P(~y|~x) <= P(y|x) `` = [ P( y| x) - P( y|~x) ]/[ 1-P( y|~x) ] = GF = y CausedBy x else RDS(~x:y) = [ P( y|~x) - P( y| x) ]/[ 1-P( y| x) ] = HF = y CausedBy ~x !! = -ARR/P(~y| x) is the new Hajek's fraction HF signal that the "else" happened; 1-RDS(x:y) = P(~y|x)/P(~y|~x) = 1/LR(~y:~x) = 1/Qsuf = Lnec (by Folk1 Folk3 ) = 1-GF = 1/[1-HF] !! Health effects can be expressed either by counting the ill or dead, or by counting the cured or alive { Sheps 1958 }. So we are free to replace any P(y|.) with P(~y|.) = 1-P(y|.) in many formulae as eg in RDS in PF. Since P(y|.)'s are often small, 1-P(y|.) =. 1 so that RDS =. ARR. Generally results will be different, depending on our choice of events vs ~events. These options create ample opportunities for honesty vs dishonesty, misleading, manipulation. Clearly, if P(y|~x) < 0.5 then RDS < RRR which only looks more impressive. For ARR > 0 ie for RR > 1 holds: 0 < RRR <= oo, while 0 < RDS <= 1, and measures with incompatible ranges should not be compared. Moreover, RRR is not a RDS-measure. .- +Contemplating the bounds of some measures Warm-up: Py > P(y|~x) = (Py - Pxy)/(1-Px) simplifies to: Pxy > Px.Py ie x,y 'positively' dependent, hence: if Pxy > Px.Py then Py > P(y|~x) & P(y|x) > Py & P(y|x) > P(y|~x) & K(x:y) = [ P(y|x) - Py ] < [ P(y|x) - P(y|~x) ] = ARR(x:y) else then Py < P(y|~x) & P(y|x) < Py & P(y|x) < P(y|~x) & K(x:y) = [ P(y|x) - Py ] > [ P(y|x) - P(y|~x) ] = ARR(x:y) else equalities. Lets think twice about which basic measure is better : P(x|y) measures how much y SufFor x ie y-->x FORMALLY, but this --> NEEDS NOT to make much sense SEMANTICALLY, as it also ! measures how much is x NecFor y, which usually does !! make a lot of sense: x suppressed makes y suppressed or removed; P(y|x) - P(y|~x) = ARR(x:y) is Absolute risk reduction of the effect y, ! which always has the same sign, as: ! P(x|y) - P(x|~y) is a measure of y-->x or how much is x NecFor y ! which always has the same sign, as: -Px <= P(x|y) - Px <= 1-Px DISCOUNTS the LACK of SURPRISE in x ; find SIC K( since a frequent x is not seen as a real CAUSE; this makes sense if Px =. 1, LESS sense if Px is low; moreover we should see P(x|y) - Px for what it normally is ( Px =. 1 is an extreme ); { if Pxy=Py then P(x|y) - Px = 1 - Px if Pxy=Px then P(x|y) - Px = Pxy.(1/Py -1) = P(x|y).(1-Py) [ = Px .(1/Py -1) ] <= 1-Px (find SIC ) where [.] may suggest that Py, Px can be varied, but Pxy <= min(Px, Py) }, !! which always has the same sign, as: -Py <= P(y|x) - Py <= 1-Py DISCOUNTS the LACK of SURPRISE in y (find SIC ) since a frequent y is not seen as a real RISK; this makes LESS sense than 1-Px above, since a wide-spread risk is still a risk, although psycho-socially it is more acceptable if everybody is at the same high risk. Indeed, as long as nothing can be done about the risk, the society gets used to it, becomes fatalistic about that risk, is not too jealous wrt those lucky few exceptions who are spared the risk. { if Pxy=Px then P(y|x) - Py = 1-Py if Pxy=Py then P(y|x) - Py = Pxy.(1/Px -1) = P(y|x).(1-Px) [ = Py .(1/Px -1) ] <= 1-Py (find SIC ) where [.] may suggest that Px, Py can be varied, but Pxy <= min(Px, Py) }. [-Py..0..1-Py] ie COMPLEMENTARY bounds, make sense for a measure ?(y:x). Also [-Px..0..1-Px] ie COMPLEMENTARY bounds, make sense for a measure ?(x:y), see { Kahre 2002, 118-119 } who adopted as desirable the upper bound 1-Px designed by Popper into corroboration C(y:x) . Moreover the ! bound 1-Px fits with Cont(X) = 1 -

= 1 - E[P] = 1 - Sum[Px.Px] = = Sum[Px.(1-Px)] , find Cont( E[P

But if by now you want to choose a measure with the upper bound 1-Px or 1-Py, then look back at ARR(:) above and notice and appreciate my numerically (but not logically) enforced value: if Px=1 or Py=1 then ARR(:) = 0 : !!! = 0 to be enforced if Px=1 then P(y|~x)=0/0 & Py=Pxy=Px.Py & P(y|x)=Py !!! = 0 to be enforced if Py=1 then P(x|~y)=0/0 & Px=Pxy=Px.Py & P(x|y)=Px Note that BOTH last two lines hold for both ARR(x:y), ARR(y:x) which BOTH became DISCOUNTing down to ARR(:) = 0 for extreme cases of Px=1 and/or Py=1. From the bounds 0 <= P(.) <= 1 follow measures' bounds. Notice their COMPLEMENTarity: in the math sense the bounds are UNIdistant (a new word?) ie Upb - Lob = 1 which make sense logically for a hypothesized cause or a conjecture x and its opposite ~x : Lob if Pxy=0 : ?(y:x) = y implies x ie y-->x Upb -P(x) = 0 -P(x) <= K(y:x) = P(x|y) - P(x) <= 1-P(x) = P(~x) -P(x|~y) = 0 -P(x|~y) <= ARR(y:x) = P(x|y) - P(x|~y) <= 1-P(x|~y) = P(~x|~y) -P(y|~x) = 0 -P(y|~x) <= ARR(x:y) = P(y|x) - P(y|~x) <= 1-P(y|~x) = P(~y|~x) -P(y) = 0 -P(y) <= K(x:y) = P(y|x) - P(y) <= 1-P(y) = P(~y) ?(x:y) = x implies y ie x-->y K(x:y) = my notation for "how much raises x the probability of y" { Kahre } ie how much (x CAUSES y) as in { Kahre 2002, 186 eq(6.23.2) etc } where discounting by -Py [ instead of by -P(y|~x) in ARR(x:y) ] seems implicitly justified by the remark "the probability that a drunk driver x causes an accident y may be small, eg 1/100" ie P(y|x)=small and P(y|~x)=smaller, Py = Pyx +P(y,~x). More subtle is : RDS(x:y) = ARR(x:y)/P(~y|~x) = [ P(y|x) - P(y|~x) ]/[ 1-P(y|~x) ], but RDS has NOT complementary bounds: LoB for Pxy=0, UpB=P(y|x) : -P(y|~x)[1-P(y|~x)] = 1 -1/P(y|~x) <= RDS(x:y) <= P(y|x) hence if P(y|x) < P(y|~x) then RDS(x:y) << -1 is possible and its interpretation becomes unclear, so if ARR(x:y) < 0 we should evaluate RDS(~x:y) = [ P(y|~x) - P(y|x) ]/[ 1-P(y|x) ] and signal the swap. ?(y:x) is explained in +Notation ( find ?(y:x) to avoid confusion ! ). K(y:x) = P(x|y) - Px = Korroboration { Kahre 2002, 120 eq(5.2.8) } preferred over: C(y:x) = Corroboration { Popper 1972, 400 eq(9.2*) in foot } = confirmation of x=h by y=e { Popper 1972, 395-396; find modern } C(y:x) says: y implies x ie y-->x ie y SufFor x, and indeed, C(y:x) = MAXi if P(x|y)=1 ie y-->x which does NOT SHOW in Popper's POOR form where P(y|x) SEEMS x-->y while IN FACT y-->x is there : C(y:x) = { Popper's form } = [ P(y|x) - Py ]/[ P(y|x) + Py - Pxy ] { find my C-form2 } = [ P(y,x) - Px.Py]/[ P(y,x) + Px.(Py - Pxy )] { inverted form } = [ P(x|y) - Px ]/[ P(x|y) + Px.( 1 - P(x|y))] from which we see that: = 0.0 for Px=1 or x,y independent { Kahre 2002, 119 } = [ P(x|y) - Px ]/[ P(x|y).(1-Px) + Px ], :-( if P(x,y) = 0 : = [ 0 - Px ]/[ 0 + Px ] = -1 :-( !! if P(x|y) = 1 : = [ 1 - Px ]/[ 1.(1-Px) + Px ] = 1-Px ! if P(y|x) = 1 : = [ 1 - Py ]/[ 1 + Py - Pxy ] effect = y = evidence , x = hypothesis -Px <= K(y:x) <= 1-Px = P(~x) for how much y Korroborates x :-( -1 <= C(y:x) <= 1-Px = P(~x) for how much y Corroborates x vs -1 <= F(y:x) <= 1 by design, for how much y supports x F(y:x) = degree of Factual support = F(h,e) { Kemeny 1952 } = F(x,y) = my F(y:x) = ARR(x:y)/[ P(y|x) + P(y|~x) ] = [ P(y|x) - P(y|~x) ]/[ P(y|x) + P(y|~x) ] { Kemeny } = [RR(y:x) - 1 ]/[RR(y:x) + 1 ] find F( = [ P(y,x) - Px.Py ]/[ P(y,x) + Px.(Py-2.Pxy)] = F-form2 vs C-form2 = -F(y:~x) and anaLogically for any mix of events like x,y,~y,~x due to my "A nonevent is an event is an event" (sorry Gertrude :-) if P(x|y) = 1 then Pxy=Py & P(y|~x)=0 since P(y,~x) = Py - Pyx = 0.0 , hence F(y:x) = [ P(y|x) - 0 ]/[ P(y|x) + 0 ] = +1 = maxi; if P(x,y) = 0 : = [ 0 - P(y|~x) ]/[ 0 + P(y|~x) ] = -1 = mini; if P(y|x) = 1 : = [ 1 - P(y|~x) ]/[ 1 + P(y|~x) ]. !!! Caution: F(y:x) = +1 if P(x|y) = 1 ie y implies x ie y-->x despite P(y|x) - P(y|~x) = ARR(x:y) in the numerator; it follows from the 1/P(y|~x) = (1-Px)/P(y,~x) in RR(y:x) = P(y|x)/P(y|~x) = = oo = infinite if P(y,~x) = Py - Pyx = 0 ie Py = Pxy ie P(x|y)=1 [ find SurpriseBy(x) ] and recall that F(:) = [RR(:) -1]/[RR(:) +1] hence F(:) and RR(:) are CO-MONOTONIC (find monoton ). Back to bounds (find SIC ): Kahre liked 1-Px but DISLIKED -1 in -1 <= C(y:x) <= 1-Px and argued for his "symmetrical" in fact COMPLEMENTary -Px <= K(y:x) <= 1-Px . Alas, Kahre did not consider the much more used ARR(:) with its COMPLEMENTary bounds shown. .- +Construction principles P1: to P7: of good association measures P1: "Measures of association should have operationally meaningful interpretations that are relevant in the contexts of empirical investigations in which measures are used." { Goodman & Kruskal, 1963, p.311, also in the footnote }. Henceforth I discuss events x, y, but it all holds for their expected values ie averages over variables X, Y ie sets of events too. P2: OpeRational usefulness is greatly enhanced if measure's WHOLE RANGE of values (not only its bounds ) has an opeRationally highly meaningful interpretation as a quasiprobability, eg ARR, NNT , NNH , RDS . P3: Various results from a single measure should be meaningfully COMPARABLE regardless of the total count N of all joint events in a contingency table. This means that a measure should be built from proportions P(:) only, without an uncancelled N. Thus measures based on ChiSquare do not qualify for causation. But N plays its role in confidence intervals. P4: To measure association means to measure statistical dependence. I can list 16+1 = 17 equivalent conditions of independence ie equalities lhs = rhs, like eg P(y|x) = P(y|~x), P(x|y) = P(x|~y), ... , Pxy = Px.Py, ..., P(y|x) = Py , P(x|y) = Px, ..., from which 2*17 = 34 measures of dependence can be made by CONTRASTing: lhs - rhs like ARR(:), or lhs / rhs like RR(:) above, both are asymmetrical wrt x,y ; eg Pxy/(Px.Py) = P(x|y)/Px = P(y|x)/Py is symmetrical wrt x,y , and the correlation coefficient r is also symmetrical wrt x,y : r = cov(x,y)/sqrt[ var(x).var(y) ] = (Pxy - Px.Py)/sqrt[(1-Px)Px .(1-Py)Py] r2 = r.r = r^2 = [ cov(x,y)/var(x)].[cov(x,y)/var(y)] = [slope of y on x ].[slope of x on y] both slopes have the same sign = beta(y:x) . beta(x:y) >= 0 , -1 <= beta <= 1 = ARR(x:y) . ARR(y:x) = coefficient of determination But measures of confirmation, evidence, indication, and certainly of !! causation should be DIRECTED ie ORIENTED ie ASYMMETRICAL wrt events x,y. Asymmetry is easily obtained by taking a symmetrical association measure (lhs - rhs) and NORMalizing it either by lhs or by rhs, or by 1 - rhs, eg: (lhs - rhs)/(1 - rhs) = [ (1-rhs) -(1-lhs) ]/(1-rhs) in fact, = 1 = Maxi (as RDS(:) ) if lhs=1=Maxi ie if it SATURATES. Saturation (at a fixed bound) usually makes sense since to be off some target or reference point by 100 meters or by 100 kilometers may be opeRationally the same: too far. Or a normalization by a function of one variable only : ARR(x:y) = P(y|x) - P(y|~x) = [ Pxy - Px.Py ]/[ Px.(1-Px) ] = cov(x,y)/var(x) = beta(y:x) for indicator events !! An asymmetry is not enough, unless it is a meaningful asymmetry, with clearly understood opeRational meaning. P5: Measures of CAUSATION tendency should be decomposable into (a product of) terms such that one term itself measures probabilistic IMPLICATION ie ENTAILMENT, but the equality Measure(y:x) = Measure(~x:~y) is UNDESIRABLE . !! Alas, the conviction measure Conv(y:x) = Conv(~x:~y) as defined by Google's co-founder { Brin 1997 } does NOT qualify; find UNDESIRABLE :-( !! But if for another holds Measure(y:x) <> Measure(~x:~y), it does NOT automatically mean that such a measure is better; eg the bounds may be less opeRationally meaningful than the bounds of Conv(:). Entailment provides a link with the notions of necessity and sufficiency : y-->x ie (y SufFor x) == (x NecFor y) == (~y NecFor ~x) == (~x SufFor ~y) P6: Measure(y:x) should yield meaningful values also for such extreme like Pxy=0, Pxy=Px.Py, P(x|y)=1, P(y|x)=1, Px=1, Py=1 : eg: RR(y:x) = 0 if Pxy=0 ie if x,y are disjoint events ! RR(y:x) = 1 if Px=1 hence Py - Pxy = 0 AND YET Pxy = Px.Py, 1 means independent x,y [ find Pxy/(0/0) as special case ] eg: Conv(y:x) = Py.P(~x)/P(y,~x) = [Py - Px.Py]/[Py - Pxy] in general; = P(~x)/P(~x|y) = [ 1 - Px ]/[ 1 - P(x|y) ] = Py/P(y|~x) = 0/0 numerically if Px = 1 whence Py = Pxy hence: = 1 if Pxy=Px.Py also [ Py - 1.Py]/[ Py - Py ] = 1 !! = 1 - Px if Pxy=0; this is not a nice fixed value, but 1 - Px is interpretable as "semantic information content" SIC !!! which makes NO SENSE for Pxy=0 :-( , nevertheless: 1 - Px < 1 = for x,y independent, so for Pxy=0 is Conv(:) < neutral 1 :-) !!! Similarly P(~y|~x) = 1-Py if Pxy = Px.Py makes NO SENSE if: if P(~y|~x) is ~y NecFor ~x ie x NecFor y To avoid overflows due to /0, such extreme/degenerated/special cases of P's must be numerically detected at run time and handled apart according to the meaningful interpretation (or conventions) as just shown. !! Since any single formula is doomed to measure a mix of at least 2 key prop- erties ( dependence and implication mixed due to my INDEPendentIMPlication PARADOX ), it is a good idea to detect & report important extreme/special cases which do not always obviously follow from the values returned. Such ! automated reporting adds semantics and avoids misreading/misinterpretation. P7: Although it is useful to consider the values returned by measures under extreme circumstances like eg Px=1 or Py=1, these will not occur often, and should be detected apart anyway. It is more important to choose a measure which will return reasonable values for the application at hand. There cannot be a single universally best measure, but some like NNT and RDS are universally more useful than other. So far for my 7 construction principles. More analysis follows: RR(y:x) is compared with few related measures like eg: W(y:x) = weight of evidence by I.J. Good (Turing's statistical assistant); F(y:x) = degree of factual support by John Kemeny ( Einstein's assistant); C(y:x) = corroboration by Karl Popper (he often called it confirmation, an overloaded term, so Popper corroborates here to be findable); it is funny that Sir Popper who stressed refutation has worked out measures of confirmation, but not of refutation :-) Why ? Conv(y:x) = conviction measure by Google's co-founder et al { Brin 1997 }. Such comparisons increase our insights. How well these formulas measure causal tendency is also discussed. All this & much more was/is implemented in my KnowledgeXplorer program KX which not only infers & indicates (ie identifies, diagnoses, predicts, etc) but also extracts knowledge (on both event- & variable level of interest) from the information carried by data input in the simple format. KX has graphical and numerical outputs in compact, comparative, hence effective forms (eg my squashed Venn diagrams). .- +Key elements of probabilistic logic and of simple measures of causation K0: Check-as-check-can instead of catch-as-catch-can : My probabilistic logic formulas fun(Px, Py, Pxy) can be checked by evaluating fun(.) for all 4 pairs of values Px,Py = 0,1 together with the ! proper value of Pxy=0,1. The resulting fun(.)=0,1 must be equal to the corresponding logical result value 0,1 . K1: Simplification rules, duality of rules, DeMorgan rules : Dual rules have 'or' replaced by '&' and v.v., '0' ie Mini by '1' ie Maxi `` and v.v. My 'by symmetry' isnt 'duals'. x & x = x = x or x idempotence (duals); ~(~x) = x involution x & ~x = 0 ; x or ~x = 1 (duals) tertium non datur (x & y) or (x & ~y) = x = (x or y) & ( x or ~y) = adjacency rules (duals) (x & y) or x = x = (x or y) & x (duals) = absorption rules, also: y or (x & ~y) = x or y = x or (~x & y) by symmetry of (x or y), has dual: y & (x or~y) = x & y = x & (~x or y) by symmetry of (x & y), hence: ~x or (x & y) = ~x or y = ~(x & ~y) = x <== y = x-->y has dual: ~x & (x or y) = ~x & y = ~(x or ~y) = y AndNot x = y Unless x De Morgan's rules for probabilistic logic: DeMorgan's law for P is this = P( x Nand y) = P(~x or ~y) = P(~( x & y)) = 1-Pxy ; its dual rule is: P( x Nor y) = P(~x & ~y) = P(~( x or y)) = 1-(Px +Py -Pxy) = P(~x,~y) eg: P( x And y) = P( x & y) = P(~(~x or~y)) = 1-( 1-Pxy) = Pxy P( x --> y) = P(~x or y) = P(~( x & ~y)) = 1-(Px-Pxy) = = P(~y --> ~x) = P( y or ~x) = P(~(~y & x)) = 1-(Px-Pxy) q.e.d. by symmetry = DeMorganish = P( x == y) = P(~x == ~y) = P(~(x Xor y)) , == is Equivalence P( x Xor y) = P(~x Xor ~y) = P(~(x == y)) , Xor is NonEquivalence Numerical formulas for Boolean logic exist; my translation rules are : 1st: replace x.y by Pxy (not by Px.Py unless independent x, y) 2nd: replace x by Px, y by Py, x^2 by Px, y^2 by Py (x --> y) = 1 - x + x.y where x,y are 0,1 (hence x^2 = x , y^2 = y) P(x --> y) = 1 -Px + Pxy = 1 -(Px -Pxy) = P(~(x,~y)) (x Xor y) = x^2 -2.x.y + y^2 P(x Xor y) = Px -2.Pxy + Py (x Nor y) = 1 - x - y + x.y = (1-x)(1-y) = none of both = Peirce function P(x Nor y) = 1 -Px -Py + Pxy = 1 -(Px+Py-Pxy) = P(~(x or y)) Estimates of P(vector) if no P(tuple)'s are available, easily obtain from P(And_i:[x_i] ) = product[ P(x_i) ] for as-if independent x_i's, and from P( Or_i:[x_i] ) = P(~(And_i:[~x_i])) ie DeMorgan's rule. Then P( Or_i:[x_i] ) = 1 - product[ 1-P(x_i) ] = P(at least one x_i occurs) = = union, a basis for noisy OR-gate. 1-Pa = T1 1-P(a or b) = T1.(1-Pb) = (1-Pa)(1-Pb) = 1-[ Pa+Pb -PaPb ] = T2 1-P(a or b or c) = T2.(1-Pc) = 1-[ Pa+Pb+Pc -(PaPb+PaPc+PbPc) +PaPbPc ] which is the Inclusion-Exclusion principle for independent events a, b, c. It suggests a progression of tightening inequalities (or improving approximations = ) ending with an equation: P( Or[xi] <= Sum[Pi] P( Or[xi] >= Sum[Pi] - Sum[P(ij)] P( Or[xi] <= Sum[Pi] - Sum[P(ij)] + Sum[P(ijk)] ... P( Or[xi] = Sum[Pi] - Sum[P(ij)] + Sum[P(ijk)] - ... P(ijk..z) which is the Inclusion-Exclusion principle (visualized by Venn diagram). Max[ Px , Py ] <= P(x or y) <= min[ Px + Py, 1 ] Max[ 0, Px + Py - 1 ] <= Pxy <= min[ Px , Py ] where on lhs Bonferroni inequality becomes nontrivial only if Px + Py > 1, in which case Pxy > 0. My 3 inequalities for checking, and if violated then for trimming of eg smoothed estimates : Pxy <= min[ P(y|x) , P(x|y) ] Max[ 0, (Px + Py - 1)/Py, Pxy ] <= P(x|y) <= min[ Px/Py , 1 ] Max[ 0, (Px + Py - 1)/Px, Pxy ] <= P(y|x) <= min[ Py/Px , 1 ] For the union of events : Max_i:[Pi] <= P(Or_i:[x_i]) <= min[ 1, Sum_i:[Pi] ] is the simplest; if we know all Pjk ie P(j,k) then Sum_i:[Pi] - Sum_j= 0 ie nonnegativity, = 0 if Any is empty M(X union Y) = M(X) + M(Y) if X,Y disjoint. Then the key equation called by G. Birkhoff "valuation of a lattice" is: !! M(X u Y) + M(X,Y) = M(X) + M(Y) u is union, , is joint aka intersect, from which the key inequality obtains: !! M(X,Y) <= min[ M(X), M(Y) ] <= M(X u Y) <= M(X) + M(Y) due to M(.) >= 0 find archer-) at Venn for visualization; eg: N(X u Y) + N(X,Y) = N(X) + N(Y) for Number of elements in sets X, Y ie N(.) is a count aka cardinality card( , N(X,Y) counts duplicates ie overlap. If valuation holds AND M(Universal set)=1 then M(~x) = 1-M(x), and DeMorgan's laws hold, both vital for our Occidental reasoning. eg: For r.e.s x,y ie as-if random events : proportions P are in fact counts N(.) of Bernoulli events or indicator events : P(x or y) + Pxy = Px + Py subtract Pxy and get : P(x or y) = Px + Py - Pxy subtract Pxy and get : P(x or y) - Pxy = Px + Py - 2.Pxy = P(x Xor y) = P(~(x == y)) = Px-Pxy + Py-Pxy = sum of 2 Asymmetric differences = P(x,~y) + P(y,~x) (find AndNot blocks ) = d(x,y) = distance(x,y) = P(x <> y) = symmetric difference eg: cont(x or y) + cont(x,y) = cont(x) + cont(y) for events x,y ie: 1-P(x or y) + 1-Pxy = 1-Px + 1-Py ie: 1-(Px +Py -Pxy) + 1-Pxy = 1-Px + 1-Py q.e.d. P(~(x or y)) + P(~x or~y) = 1-Px + 1-Py Now on r.v.s X,Y ie as-if random variables ie sets of as-if random events : ! Cont(.) measure has been called many names, has many meanings hence uses; find SIC Gehalt Popper , and read only 2 info-packed pages of "(Re)search hints by Jan Hajek" at www.matheory.info + Cont(.|.) provides a PROVABLY BETTER ie tighter ie lower upper bound on Bayes probability of error Perr than Shannon's entropies do; I(Y;X) = I(X;Y) by Shannon is best only for (linear cost of) coding. H(X,Y) + I(X;Y) = H(X) + H(Y) are Shannon's entropies + Cont(Y:X) <> Cont(X:Y) is ASYMMETRICAL wrt X, Y + 0 <= Cont(:) <= 1 is absolutely BOUNDED ie it SATURATES like Perr + find Cont( , more in "(Re)search hints by Jan Hajek" at www.matheory.info + Cont(f(X)) = Sum[ f(x).(1-f(x)) ] is THE ONLY measure of FUZZiness ! which satisfies all 6 properties P1-P6 desired in { Ebanks 1983, 25-26, Theorem 3.2 & proof on pp. 32-33 } ; he refs to the 1973 paper by DeLuca in Information and Control, where 1-f(x) is the measure of choice for the entropy of FUZZY sets; neat stuff. ! Hajek noticed links: f'(x) = f(x).(1 -f(x)) = derivative of the S-curve f(x), fuzzy or not : = 0.25*(sech(x/2))^2 ; f(x) = (e^x)/(1 + e^x ) = 1/(1 + e^-x) = 0.5 + tanh(x/2)/2 is S-shaped ( a similar curve is 0.5 + atan(x)/pi ) = sigmoid arising from binary logistic regression = a growth curve = LOGISTIC function, saturates to -+1, used as activation function in belief networks and in artificial neural networks (ANN,NN), x = ln[f(x)/(1-f(x))] = a line, an inverse of f(x), a LOGIT for P(x). Odds(x) = Px/(1-Px ), Px = Odds(x)/(1 + Odds(x)), and by Bayes rule : P(xi|y) = P(y|xi).Pxi/Sum_j:[ P(y|xj).Pxj ] = P(xi,y)/Py has the same form as: Pxi = g(xi)/Sum_j:[ g(xj) ]; we get 0 < Pxj < 1 & Sum[Pxj] = 1 & Pxj increasing with xj, if g(xj) monotonically INCreases with xj, and !! if g(0) > 0 then all 0 < Pxj < 1 ie no ZERO FREQUENCY problem eg in Bayes or Markov chain multiplications; eg: Pxi = (c+xi)/Sum_j:[c+xj] > 0 , c > 0 or: (e^(b.xi))/Sum_j:[e^(b.xj)] > 0 since a^0 = 1; called SoftMax by John Bridle, it arises from the multinomial logit model; ?? Q: Which b or bj ( > 0 ) avoids oversmoothing and under-smoothing ? Btw, e^(-E/(k.T))/Z is Boltzmann-Gibbs pdf from statistical mechanics and thermodynamics. With constants other than kT it is used in eg: simulated annealing, maximum entropy (MaxEnt), Markov random field, mean field, expectation-maximization (EM), etc. Note: let t = exp(x) == e^x , 1/t = e^-x : tanh(x) = sinh(x)/cosh(x) = [(t -1/t)/2]/[(t +1/t)/2] = [t.t -1]/[t.t +1] = [e^(2x) - 1]/[e^(2x) + 1] = -tanh(-x) tanh'(x) = 1 -[tanh(x)]^2 .- eg: The analogon in the classical information theory is pseudo-isomorphic with set theoretical measures M(.) , but it is NOT a strict probabilistic isomorph like P(x or y), and my cont(x or y) for events, as follows : H(X,Y) + I(X;Y) = H(X) + H(Y) the , ; are standard for Shannon's entropies ! ( , is & , no P(x or y) here) on r.v.s X, Y; entropies are expected values of log(.) which transforms multiplicativity into additivity. H(X) = Sum[Px.log(1/Px)] , 1/Px is a decreasingFun(Px) = -Sum[ Px.log(Px) ] <= log(N(X )) ; H(X) = entropy of X H(Y,X) = H(X,Y) = -SumSum[ Pxy.log(Pxy)] <= log(N(XY)) ; H(X,Y)=joint entropy H(Y|X) = -SumSum[ Pxy.log(P(y|x)) ] is conditional entropy of Y if X H(X|Y) = -SumSum[ Pxy.log(P(x|y)) ] is conditional entropy of X if Y I(Y;X) = I(X;Y) = SumSum[ Pxy.log(Pxy/(Px.Py)) ] is the mutual information = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y) ; H(Y|X) + H(X|Y) = H(X,Y).2 - H(X) - H(Y) = H(X) + H(Y) - 2.I(X;Y) = H(X,Y) - I(X;Y) = a metric distance measure, a symmetric difference ; [ H(Y|X) + H(X|Y) ]/H(X,Y) = 2 - [ H(X) + H(Y) ]/H(X,Y) = 1 - I(X;Y)/H(X,Y) = a metric distance <= 1, find Venn `` ! such symmetries wrt x,y,X,Y make Shannon's entropies UNsuitable or at best ! suboptimal as measures of a hypothesised causation. Moreover, they have NO absolute UPPER BOUND, ie they do NOT saturate like eg Bayesian probability of error Perr (eg of misclassification or misidentification, misdiagnosing) does. However, a NORMalization of a symmetrical function of X and Y, by a function of only one of the variables involved creates an asymmetry, and for M(.)'s also creates an absolute upper bound : 0 <= R(Y:X) = I(X:Y)/H(X) = [ H(X) - H(X|Y) ]/H(X) = 1 - H(X|Y)/H(X) <= 1 0 <= R(X:Y) = I(X:Y)/H(Y) = [ H(Y) - H(Y|X) ]/H(Y) = 1 - H(Y|X)/H(Y) <= 1 0 <= R(X:Y) <> R(Y:X) <= 1 = 100% association between X and Y ; this <> is solely due to H(X) <> H(Y) in the denominators, which are bounded by 0 <= H(.) <= log(N(.)). Hence the greater the cardinality N(.), the greater the R(:) will tend to be. This is understandable, but not too exciting, since it could lead eg to overquantization, low hence poor counts and to an illusory DIRECTIONality of a hypothesised causation. See my "Hint 7:" in "(Re)search hints by Jan Hajek" at www.matheory.info For metric distances holds in general: d(x,y) >= 0 nonnegativity d(x,y) = 0 if x=y ; here: logical equivalence Px=Pxy=Py d(x,y) = d(y,x) symmetry d(x,y) + d(y,z) >= d(x,z) 1st 3angle ineq ( triangle inequality ) d(y,z) + d(z,x) >= d(x,y) 2nd 3angle ineq d(z,x) + d(x,y) >= d(y,z) 3rd 3angle ineq My 3angle inequalities (not found elsewhere) : (Px +Py -2.Pxy) + (Py +Pz -2.Pyz) >= (Px +Pz -2.Pxz) = my 1st 3angle ineq: Py -2.Pxy + Py -2.Pyz >= -2.Pxz Py >= Pxy +Pyz -Pxz = 1st Pxz >= Pxy +Pyz -Py may be < 0 ie trivial Pxy <= Py -Pyz +Pxz Pyz <= Py -Pxy +Pxz Pz >= Pxz +Pyz -Pxy = 2nd Pxy >= Pxz +Pyz -Pz may be < 0 ie trivial Pxz <= Pz -Pyz +Pxy Pyz <= Pz -Pxz +Pxy Px >= Pxy +Pxz -Pyz = 3rd Pyz >= Pxy +Pxz -Px may be < 0 ie trivial Pxy <= Px -Pxz +Pyz Pxz <= Px -Pxy +Pyz +3angle inequalities combined yield new JH-bounds (eg for checking) : !!! (Pxz +Pyz -Pz) <= Pxy <= Min[ (Px -Pxz +Pyz), (Py -Pyz +Pxz) ] !!! (Pxy +Pyz -Py) <= Pxz <= Min[ (Px -Pxy +Pyz), (Pz -Pyz +Pxy) ] !!! (Pxy +Pxz -Px) <= Pyz <= Min[ (Py -Pxy +Pxz), (Pz -Pxz +Pxy) ] (find Bonferroni ) K3: Equivalent relations for independence, and for dependence : There are 16+1 = 17 equivalent == relations for independence = , and 17 for -dependence < , and 17 for +dependence > of x,y : the ? stands for any single symbol < , = , > used consistently: [ Pxy ? Px.Py ] == [ P(y|x) ? Py ] == [ P(x|y) ? Px ] but also == [ P(y|x) ? P(y|~x) ] == [ P(x|y) ? P(x|~y) ] eg [ Pxy - Px.Py ? 0 ] == [ P(y|x) - Py ? 0 ] == [ P(x|y) - Px ? 0 ] [ Pxy / Px.Py ? 1 ] == [ P(y|x) / Py ? 1 ] == [ P(x|y) / Px ? 1 ] eg [ RR(y:x) = P(y|x)/P(y|~x) ? 1 ] == [ RR(x:y) = P(x|y)/P(x|~y) ? 1 ] , but RR(y:x) <> RR(x:y) , ie the == concerns the (in)equality ? 1 Formulas left of an ? are candidate elements for a measure of (x CAUSES y). Other elements for (x CAUSES y) must be derived from logic. For 2 binary variables there are 16 different logical functions, of which only the 2 implications and 2 AndNots are ASYMMETRIC ie DIRECTED ie ORIENTED (the remaining 12 functions are either symmetric wrt x,y, or are functions of 1 variable only, either x only or y only). Clearly (x CAUSES y) must to be ASYMETRICAL wrt x,y. But there are more requirements. K4: Implication or entailment, naive sufficiency and necessity : For causation only this == is undesirable (find UNDESIRABLE ) ~(y,~x) == y-->x == (~x --> ~y) == ~(~x,y) == (x or ~y) == == ~(y AndNot x) == ~(y ButNot x) == (y Minus x) == ~(y Unless x=1=y) == ~(y Unless x=1 blocks y=1) Let y = the observed effect ie evidence; x = a hypothesised cause of y : P(x|y) = Pxy/Py is a NAIVE measure of how much y suffices to determine x P(x|y) = 1 = max if y --> x ie y implies x deterministically ie Pxy = Py P(x|y) = Px if x,y are independent ie Pxy = Px.Py. In the extreme case of Px = 1 it holds: if Px=1 then Py = Pxy = Px.Py & P(x|y) = 1 !!! ie y determines x 100% AND x,y are independent. This is my PARADOX of IndependentImplication . If Px > 0 & Py > 0 & Pxy > 0 then if Px > Py then P(x|y) > P(y|x) else if Px = Py then P(x|y) = P(y|x) else if Px < Py then P(x|y) < P(y|x) else mission impossible. So far on relatively unproblematic sufficiency; now on less clear measures of necessity: Pioneers { Buchanan & Duda 1983, 191 } explained the rule y --> x thus : "... let P(x|y) denote our revised belief in x upon learning that y is true. ... In a typical diagnostic situation, we think of x as a 'cause' and y as an 'effect' and view the computation of P(x|y) as an inference that the cause x is present upon observation of the effect y." (find +Folks' for more). My preferred wording is: P(x|y) is a NAIVE measure of how much evidence does y provide for x ie how much y implies x as a potential cause of y, hence !!! also how likely x CAUSES y. !!! Note that in P(x|y) the y = evidence, x = a hypothesised cause. Ideally x CAUSES y if Pxy = Py ie y implies x ie y --> x , eg if P(x|y) = 1. .- Causation assumes that without a cause x there will be no effect y , hence that a cause x is NECESSARY for effect y which then serves as an evidence for that cause. From the reasonings on the last dozen of lines few simple candidate measures (marked by their +pros, -cons, .neutrals ) follow : C1: P(x|y) = Pxy/Py is a NAIVE, SIMPLISTIC measure of how much is: y SufFor x, or: x NecFor y, but: - is not a fun(Px), eg canNOT discount lack of surprise if Px =. 1 . is = Px if Pxy = Px.Py ie x,y independent + is = 0 if Pxy = 0 ie x,y disjoint + is = 1 if Pxy = Py ie P(y,~x) = 0 from Pyx + P(y,~x) = Py ie "without x no y" ie "x NecFor y" (draw or find Venn ) ie "if y then x" ie "y SufFor x" !! - is = 1 see the CounterExample few lines below (also find SIC ). !! - is a single P(.|.) while all single P's were REFUTED as measures of confirmation or corroboration { Popper 1972, chap.X/sect.83/footn.3 p.270, and Appendix IX, 390-2, 397-8 (4.2) etc } P(y|x) = Pxy/Px is analogical (just swap x with y) 1 if "y follows from x" is the phrase in { Popper 1972, 389 } + is used in simple Bayesian chain products for multiple cues. P(~x|y) = [Py -Pxy]/Py = 1 - P(x|y) is a naive measure of how much y Prevents x, as P(x,y)=0 MAXImizes this measure P(~y|x) = [Px -Pxy]/Px = 1 - P(y|x) is a naive measure of how much x Prevents y, as P(x,y)=0 MAXImizes this measure ; find my *= in RDS(x:~y) = PF prevented fraction { in Cheng 1997 as p_i } !! A counterExample shows that P(.|.) is not good enough measure of causation: Let x = a hypothesised cause, a conjecture y = a widely present symptom, eg 10 fingers on each hand. Then P(x|y) =. 1 ie Pxy =. Py since almost all with y are ill. Yet it is neither wise to assume that y is sufficient for x, nor wise to assume that x is necessary for y. (find SIC ) C2: An alternative single P-measure of how much is x necessary for y : P(~y|~x) = [1 -(Px + Py - Pxy)]/[1 - Px] = 1 - ([Py - Pxy]/[1 - Px]) + is a function of Px, Py, Pxy , but: ?- is 1-Py if Pxy = Px.Py ie x,y independent; note that 1-Py is a "semantic information content measure" SIC; ? does 1-Py make sense if x,y are independent ? I don't think so. ? (similar NO SENSE is Conv(y:x) = 1-Px if Pxy=0 ie disjoint) ?. is = 0 if 1 = Px + Py - Pxy ( unlikely to occur ? ) - is just a single P which all were REFUTED in { Popper 1972, 270, 390-2, 397-8 } - is <> 0 if Pxy = 0 ie x,y disjoint :-( but it can be forced: if Pxy = 0 then NecessityOf(x for y) = 0 ELSE = P(~y|~x) + is = 1 if Pxy = Py , as explained next : + is = 1 if y --> x 100% ie if y implies x fully then : ! Pxy = Py AND P(~y|~x) = 1 = P(x|y) , which is consistent with logic: ~(y,~x) == (y --> x) == (~x --> ~y) == ~(~x,y) are all equivalent in logic. P(~y|~x) = 1 is the only nicely interpretable fixed point. P(~y|~x) as a candidate has arisen from my COUNTERFACTUAL reasoning: the semantical Necessity of x for y follows from IF no x THEN no y, ie removed or suppressed x suffices for removed or suppressed y, ie ~x implies ~y ie ~x --> ~y. The COUNTERFACTUALity in human terms says: IF x disappears THEN y will disappear too. For more find +Folks' wisdom. Only after I worked out P(~y|~x) above, I came across { Hempel 1965 } where at the very end of his very long and very abstract paper I could decode his eq.(9.11) as P(~y|~x). He derived it as a "systematic power closely related to the degree of confirmation, or logical probability"{p.282} via his eq(9.6) which is in fact 1-P(.) ie SIC . On his p.283 the last lines tell us why : "Range and content of a sentence vary inversely. The more a sentence asserts, the smaller the variety of its possible realizations, and conversely."( SIC ) "The theory of Range" is a section in { Popper 1972, sect.72/p.212-213 } where on p.213 he refers the notion of [semantic] Range to { Friedrich Waismann: Logische Analyse des Wahrscheinlichkeitsbegriffes, Erkenntnis 1, 1930, 128f}. C3: -Px <= [ P(x|y) - Px ] <= 1-Px { Kahre 2002, 118-119 } -Px if Pxy = 0 ; 1-Px if P(x|y) = 1 ie Pxy=Py , find SIC Note that (1-Px) - (-Px) = 1 ie the absolute magnitudes of both bounds are COMPLEMENTary. This makes sense since a REFUTATION of a conjecture means CONFIRMation of its COMPLEMENTary conjecture. Yet users like fixed points. C4: Better measures of sufficiency and necessity are RR(:)'s or LR(:)'s, like I.J. Good's : Qnec = P( e| h) / P(e|~h) = RR( e: h) = Lsuf (find Folk1 ) Qsuf = P(~e|~h) / P(~e| h) = RR(~e:~h) = 1/Lnec = [1-P( e|~h)]/[1-P(e| h)] = 1/[1- GF ] = 1/[1 - RDS ] by Hajek !! Qsuf(e:h) = Qnec(~e:~h) in { I.J. Good 1992, 261 } (find Folk3 ) !! Qnec(e:h) = Qsuf(~e:~h) in { I.J. Good 1995, 227 in Jarvie } Lsuf = Qnec , but in fact there is NO semantic confusion, since Lsuf denotes how much is e SufFor h (ie h NecFor e), and Qnec denotes how much is h NecFor e (ie e SufFor h). Find +Folks' wisdoms for more. These ratios of ratios have ranges with 3 semantically fixed values, which enhance opeRational interpretability, and are not just single P's REFUTED as measures of confirmation or corroboration in { Popper 1972, chap.X/sect.83/footn.3/p.270, and in Appendix IX, 390-392, 397-8, etc }. .- +Dissecting RR(:) LR(:) OR(:) for deeper insights : y is the effect of interest (= in our focus), eg a target disorder ; x is the potential cause of interest, eg an exposure, a test result . !!! The key point I make now is that while : P(y|x) measures how much (x implies y) ie (x SufFor y) hence (y NecFor x), P(y|x)/P(y|~x) = RR(y:x) = (y implies x) ie (y SufFor x) hence (x NecFor y) = Pxy/P(y,~x) .(1-Px)/Px = [Pxy/(Py - Pxy )].(1-Px)/Px vital in my condition for confounding = Pxy.( y implies x).SurpriseBy(x) = Pxy/P(y,~x) .SurpriseBy(x) = LikelyThanNot(y:x).SurpriseBy(x); note that: 1/P(y,~x) = 1/(Py - Pyx) and 1 - P(y,~x) are both measures of how likely (y implies x) ie IF y THEN x ; recall that ~(y,~x) == (y --> x) == (~x --> ~y) == ~(~x,y) ; note that Py - Pxy = P(~x) - P(~x,~y) in general ie also for imperfect implication = (1-Px) - [1-(Px+Py-Pxy)] = Py-Pxy !!! but equality of fun(y:x) = fun(~x:~y) is UNDESIRABLE for a measure of causal tendency (to find out why find below UNDESIRABLE ). Another fun(y:x) is P(x|y) which also measures (y implies x), however : 100% implication [ P(x|y)= 1 ] = [ P(~y|~x)=1 ], while for less than 100% implication P(x|y) <> P(~y|~x) in general since : Pxy/Py <> [ 1 -(Px + Py - Pxy)]/[1-Px] Pxy/Py <> 1 - ( Py - Pxy )/[1-Px] where by DeMorgan's rule P(~x,~y) = 1 -(Px + Py - Pxy) = P(~(x or y)) (y SufFor x) ie (y implies x) ie: (x NecFor y) ie (x CAUSES y) potentially, because removal, blocking or !! reduction of ANY SINGLE necessity (out of several acting together) as x necessary for y , will annul or suppress its consequent effect y. Draw x enclosing y in a Venn diagram, and see that it is necessary to hit x to have any chance of hitting the enclosed y, but not vice versa. Hence it is the necessary condition which should be seen as a potential cause, removal/suppression of which will remove/suppress the effect y. 1/P(x,~y) = 1/(Px - Pxy), or 1 - P(x,~y) = 1 - (Px - Pxy), are measures of how likely (x implies y) ie IF x THEN y ; recall that: ~(x,~y) == (x implies y) == (~y implies ~x) == ~(~y,x) LR = P(x|y)/P(x|~y) = RR(x:y) = Pxy/P(x,~y) .(1-Py)/Py = [Pxy/(Px - Pxy )].(1-Py)/Py = Pxy.( x implies y).SurpriseBy(y) = Pxy/P(x,~y) .SurpriseBy(y) = LikelyThanNot(x:y).SurpriseBy(y) From RR, LR and also from a Venn diagram, it follows that since the joint P(y,x) = P(x,y), it must be only the unequal marginal probabilities Py, Px, which decide whether (y implies x) more or less than (x implies y) by the rule: !! if Py < Px then RR(y:x) >= RR(x:y) ie LR(x:y), if Py > Px then RR(y:x) <= RR(x:y) ie LR(x:y), where the = occurs for x,y independent ie RR(:) = 1, or if Pxy=0=RR(:), as my program Acaus3 asserts. For more find Py < Px below. IF LikelyThanNot(y:x) < 1 ie Pyx < P(y,~x) ie Less likely than not, AND SurpriseBy(x) is large enough ie Px is low enough THEN RR(y:x) > 1 may still result due to low Px ; ELSE IF RR(y:x) > 1 AND LikelyThanNot(y:x) > 1 ie More likely than not ie P(x|y) > 1/2 ie Pxy > Py/2 ie Pyx > P(y,~x) = Py - Pxy ie 2.Pxy > Py THEN there is a stronger reason for the conjecture that (y implies x) ie that (x causes y), than it is if Pyx < P(y,~x) AND RR(y:x) > 1. The P(x|y) > 0.5 has been: - required as "the critical condition for confirming evidence" in { Rescher 1958, 1970, pp.78-79, but on p.84 swapped to P(y|x) > 0.5 }; - recommended as a potent (not just potential) Necessity N of exposure x for case y : N > 0.5 in { Schield 2002 sect.2.3 & Appendix }; - considered in { Hesse 1975, 81 } but dismissed as a single measure of "confirming evidence" because P(x|y) > 1/2 "may be satisfied even if y has decreased the confirmation of x below its initial value in which case y has disconfirmed x". Mary Hesse ( Oxford ) then opted for P(x|y) > Px as the condition for "y confirms x" aka Carnap's "positive relevance criterion". Alas, her P(x|y) > Px is nothing but P(x,y) > Px.Py . - compare with ARX = 1 -1/RR > 0.5 if RR > 2 (find ARX ). A PARADOXical behaviour of RR(:), and of other formulas, nearby some extreme values is identified : !!! huge, even infinite RR(y:x) = oo is possible while y, x are almost independent. This I call the IndependentImplication paradox. Let: == is equivalence ; rel is >=< ie < , = , > , etc == is IF [.] THEN [_] and vice versa, ie also simultaneously IF [_] THEN [.] Remember that there are at least 16+1 = 17 equivalent (in)dependence relations [ P(y,x) rel Py.Px ] , which divided by Px or by Py yields : == [ P(y|x) rel Py ] == [ P(x|y) rel Px ] == [ P(y|x) rel P(y|~x)] == [ P(x|y) rel P(x|~y)] == [P(~y|~x) rel P(~y|x)] == [P(~x|~y) rel P(~x|y)] etc. Since both relative risk RR(:) aka risk ratio, and odds ratio OR(:) are in use, it is good to remember their relationships: [ OR(:) rel 1 ] == [ RR(:) rel 1 ] == [ Pxy rel Px.Py ] , hence If OR(:) rel 1 (ie if Pxy rel Px.Py) then OR(:) rel RR(:), and vice versa, eg If OR(:) < 1 (ie if Pxy < Px.Py) then OR(:) < RR(:) ; If OR(:) > 1 (ie if Pxy > Px.Py) then OR(:) > RR(:) Let only the OR(:) = 2.5 be known (eg from a meta-study, so that a,b,c,d,N are unknown or not published, and RR(:) is not available). Then we still may speculate about the corresponding RR(:) > 1 thus: OR(:) = ad/(bc) = eg (25*30)/(10*30) = 2.5 > 1, or (25*30)/(30*10) = 2.5 RR(:) = [a/(a+b)]/[c/(c+d)] = eg [25/(25+10)]/[30/(30+30)] = 1.4 > 1, or [25/(25+30)]/[10/(10+30)] = 1.8 etc; Keep in mind that swapping rows and/or columns in a 2x2 contingency table may only change OR into 1/OR, but RR(:) will always change somehow. .- +The simplest thinkable necessary condition for CONFOUNDING !!! Let's make search for & research of confounders easier & less expensive. RR(y:x) = P(y|x)/P(y|~x) where y is the effect, and x is the hypothesized cause, eg an exposure, treatment, or a test result. Let's consider c as a competing (against x) candidate cause of y. Clearly a necessary (but generally not sufficient) condition for c to be a more likely candidate than x as a cause of y is RR(y:c) > RR(y:x) , while RR(c:x) > RR(y:x) is a less natural condition by Jerome Cornfield of 1959 , reproduced in the Appendix of { Schield 1999 }. My decompositions RR(c:x) = [ Pcx/(Pc - Pcx) ].(1-Px)/Px RR(y:x) = [ Pyx/(Py - Pyx) ].(1-Px)/Px readily suggest that (1-Px)/Px can be dropped from Cornfield's 2nd inequality RR(c:x) > RR(y:x) , so that [ Pcx/(Pc - Pcx) ] > [ Pyx/(Py - Pyx) ] hence [ Pcx.Py - Pcx.Pyx ] > [ Pyx.Pc - Pyx.Pcx ] where Pcx.Pyx annul, hence !!! P(x|c) > P(x|y) my simplest necessary condition for c overrulling x !!! P(x|c) - P(x|y) > 0 my simplest necessary absolute boost Ab > 0 needed !!! P(x|c) / P(x|y) > 1 my simplest necessary relative boost Rb > 1 needed [ P(x|c) = P(c|x).Px/Pc ] > [ P(y|x).Px/Py = P(x|y) ] by Bayes rule ; !!! P(c|x)/Pc > P(y|x)/Py my Bayesian boost condition !!! P(c|x) > P(y|x).Pc/Py 2nd form of my necessary cond. P(y|x) < P(c|x).Py/Pc 3rd form of my necessary cond. lead to measures !!! P(c|x)/Pc - P(y|x)/Py = aBb(c:x; y:x) my absolute Bayesian boost !!! [ P(c|x)/Pc ]/[ P(y|x)/Py ] = rBb(c:x; y:x) my relative Bayesian boost !!! [ P(c|x)/Pc - P(y|x)/Py ]/[ P(c|x)/Pc + P(y|x)/Py ] is my absolute Bayesian boost kemenyzed to the range [-1..0..1] . If abs.boost < 0 or rel.boost < 1 then confounder c CANNOT replace x as a potential cause of the effect y; ie abs.boost < 0 or rel.boost < 1 SUFFICE to REFUTE c as a competitor with x for a cause of y. This is Popperian refutationalism opeRationalized. If abs.boost > 0 or rel.boost > 1 then confounder c MIGHT replace x as a potential cause of the effect y, but abs.boost > 0 or rel.boost > 1 are only necessary (but not sufficient) conditions for c to replace x as a potential cause of y. Below find Bailey to read that a "globally" collected P(x|y) is more stable than P(y|x), which can be estimated from a locally collected P(x|y) thus: P(y|x) = ( P(x|y).Py )/[ P(x|y).Py + P(x|~y).(1-Py) ] by Bayes , = 1/[ 1 + (P(x|~y)/P(x|y)).(1-Py)/Py ] = 1/[ 1 + ( 1/LR+ ).SurpriseBy(y) ] = 1/[ 1 + SurpriseBy(y)/LR+ ] where Py has to be the proportion of the effect y in a POPULATION. Now it is clear how much better it is to use my condition P(x|c) > P(x|y) necessary for confounding. Combining both necessary conditions for (c overrules x) yields !! RR(y:x) < mini[ RR(c:x) , RR(y:c) ] is necessary for c to overule x ; or my equivalent !! P(x|y) < P(x|c) AND RR(y:x) < RR(y:c) where both < must hold. Note that the user does not have to evaluate all (remaining) subconditions after any single one of them is found to be violated, so that c becomes an implausible competitor of x for potential causation of y. My new necessary condition above can also be derived from the fact that in RR(y:x) < RR(c:x) ie in P(y|x)/P(y|~x) < P(c|x)/P(c|~x) conditionings |. are the same on both sides of < , hence the conditional P(.|.)'s can be turned into joint P(.,.)'s as the conditionings annul. In the { Encyclopedia of Statistics, Update volume 1, on Cornfield's lemma, pp.163-4 } J.L. Gastwirth's exact condition for (non)confounding is shown. Let me write it in a clearer notation (and then simplify it a bit) : RR(c:x) = RR(y:x) + (RR(y:x)-1)/[ (RR(y:c)-1).P(c|~x) ] RR(c:x) > RR(y:x) is necessary for c , rather than x , to be a cause of y; is Cornfield's necessary but insufficient condition. From Gastwirth's equality follows my more concise sufficient condition for c , rather than x , to cause y : we subtract 1 on lhs & rhs and get RR(c:x)-1 > [RR(y:x)-1] + [RR(y:x)-1]/[(RR(y:c)-1).P(c|~x)] RR(c:x)-1 > [RR(y:x)-1] . ( 1 + 1/[(RR(y:c)-1).P(c|~x)] ) [RR(c:x)-1]/[ RR(y:x)-1] > ( 1 + 1/[(RR(y:c)-1).P(c|~x)] ) ie its lhs > rhs, so that I can have a measure (lhs - rhs)/(lhs + rhs) with kemenyzed range [-1..0..1]. When the reading gets tough, the tough get reading. This epaper has one thing common with an aircraft carrier: there are multiple cabels to hook on and so to land safely on the deck of Knowledge. There is no safety without some redundancy at critical or remote points. -.- +Notation, basic tutorial insights, PARADOX of IndependentImplication "Effective notation can shape the very development of a discipline" is what Hans Christian von Baeyer (one of the best essayists of physics and recently also of information) wrote in the subtitle of his masterful essay Nota Bene { Baeyer 1999, 12 }. CI( . ) confidence interval SeLn standard error of the natural logarithm of something NecFor is necessary for SufFor is sufficient for <> unequal ie 'either smaller < or greater > then' --> implies (find impli --> for more on implication ) AndNot ButNot (find nImp AndNot for more on nonImplication ) X, Y symbolize variables viewed as-if random variables r.v.s , ie here a r.v. is a set of r.e.s : x, y; h, e symbolize events viewed as-if random events r.e.s : x = h = hypothesised cause, eg an exposure, with probability Px y = e = evidence, effect, outcome, disorder with probability Py ~x negation (ie complement) of an event x , so that P(.) + P(~.) = 1 (y,x) = (x,y) == x,y ie symmetry wrt r.e.s x, y (y:x) <> (x:y) in general, ie is asymmetrical wrt x and y ?(y:x) denotes a measure M of how much the evidence y IMPlies x as a cause, conjecture or hypothesis x, ie how effect y SufFor x , ie how y CONFIRMs x , ie also how much is x NecFor y ; the y IMPlies x is a CLEAR LOGICAL & opeRational formulation. RR(y:x) = P(y|x)/P(y|~x) is M(y-->x) with 1/(Py - Pxy) inside DESPITE the P(y|x) which seems to suggest x-->y ! Aha! Insight inside! :-) = B(y:x) = Bayes factor in favor of x provided by y (due to /(Py-Pxy) !!! Hence ?(y:x) or ?(x:y) is decided by a deconstructive analysis of the ?(.:.) measure by Uncovering its key IMPlication term --> . The ! EXCEPTion (so far) to my rule are my betas : ! beta(y:x) = slope of y on x = ARR(x:y) , and ! beta(x:y) = slope of x on y = ARR(y:x) ; so my beta(.:.)'s are swapped wrt other (.:.) since it is impossible to notate them all mnemonically well, which follows most DRAMAtically from : ARR(x:y) = P(y|x) + P(y|~x) where x-->y rules, but: = slope of y on x = beta(y:x) is a natural notation for beta : = cov(x,y)/var(x) where cov(x,y) = cov(y,x), but : F(y:x) = ARR(x:y)/[ P(y|x) + P(y|~x) ] where y-->x overRules because : !! = [ RR(y:x) -1 ]/[ RR(y:x) + 1 ] where y-->x overRules in RR(y:x) ?(x:y) denotes a measure of how much the exposure x IMPlies y effect, where x-->y is indeed the oveRuling implication effect, eg: B(x:y) = LR = P(x|y)/P(x|~y) = LR+ = likelihood ratio = RR(x:y) = Bayes factor ?(x:y) means a measure M( x implies y ) ie M(x-->y), ie x-->y is withIN such a measure. Note that P(y|x), K(x:y) = P(y|x) - Py, and ARR(x:y) = P(y|x) - P(y|~x), RDS(x:y) = ARR(x:y)/P(~y|~x) are M(x-->y) ( in K(x:y) -Py discounts the lack of surprise in [high] Py; find SIC ), ! vs RR(y:x) = P(y|x)/P(y|~x) HIDING THE KEY term 1/(Py - Pxy) == m(y-->x) in /P(y|~x), which due to its oo RANGE OVERRULES P(y|x) == m(x-->y) while (1-Px)/Px discounts the lack of surprise by [high] Px; find SIC . ! Do NOT get MISLED by the form, since most measures can be rewritten as : [ Pxy - Px.Py ]/denominator1 = cov(x,y)/denominator1 = [ P(x|y) - Px ]/denominator2 if denominator2 = 1 then y-->x = [ P(y|x) - Py ]/denominator3 if denominator3 = 1 then x-->y where the numerator captures dependence (is 0 if x,y independent), while the denominator decides implication y-->x, or x-->y. ! It's the denominator, student! :-) E.g.: cov(x,y)/Py = P(x|y) - Px; cov(x,y)/Px = P(y|x) - Py cov(x,y)/var(x) = cov(x,y)/[Px.(1-Px)] = ARR(x:y) = P(y|x) - P(y|~x) cov(x,y)/[ Pxy + Px.Py - Pxy.Px ] = C(y:x) { my C-form2 } cov(x,y)/[ Pxy + Px.Py - 2.Pxy.Px ] = F(y:x) { my F-form2 } Popper, Kemeny and I.J. Good used ?(x:y), until in 1992 I.J. finally switched to my less error prone ?(y:x) which is more mnemonical, as it matches the 1st term which is P(y|x) in their formulas, but more importantly it stands for y-->x which is HIDDEN in their formulas. [0..1..oo) is a half open interval including 0 but excluding infinity oo, where the central point 1 means stochastic independence. Since I assume marginals Px > 0 and Py > 0, many intervals will be half open ..oo) ie not closed ..oo]. as-if useful fiction a la { Vaihinger 1923 }. E.g. the product of marginal probabilities Px.Py provides a fictional point of reference for dependence of events x and y. Fictional, because Pxy = Px.Py occurs rarely, but we often contrast Pxy vs Px.Py in Pxy - Px.Py or in log(Pxy/(Px.Py)) in Shannon's mutual information formula. I realized that the Archimedean point of reference { Arendt 1959 } may be a special case of the useful as-if fictionalism (find Px.Py here & now). Also the MAXimal possible values are as-if values eg for normalization. SIC denotes "semantic information content measure" (sic! :-); find Popper = equal =. nearly or approximately equal, close to := assignment statement in Pascal (in C it is the ambiguous sign = ) == equivalence of two terms, or synonymity of notations or terms = less or equal >= greater or equal (the => is meaningless here and in Pascal ) >=< any one of the relations > = < >= <= <> consistently used * multiplication (must for numbers: 2*3 , 2.3*2.4 ) . multiplication, eg a.b , 2.P == 2P (2.3 is a number) oo infinity (eg 1/0 = oo , but 0/0 = undefined in general, but in expected values we take 0*0/0 = 0 , eg in entropic means; in RR(y:x) = Conv(y:x) = 0/0 = 1 if Px = 1 ie Py = Pxy = Px.Py ! ^ power operator, eg 3^4 = 3^2 * 3^2 = 9*9 = 81 sqr(.) == square(.) == (.)^2 = (.)*(.) sqrt(.) is a square root of (.) Sum_i:[ . ] is a sum over the items indexed by i within [.] lhs, rhs abbreviate left hand side, right hand side exp(a) = e^a where e = 2.718281828 is Euler's number ln(.) is logarithmus naturalis based on Euler's constant e exp(ln(.)) = (.) is antilogarithm aka antilog log2(a) = ln(a)/ln(2) = ln(a)/0.69314718 = 1.442695*ln(a) , now base = 2 log(a.b) = log(a) + log(b) where log(.) is of any base, eg ln(.) log(a/b) = log(a) - log(b) = -log(b/a); log(1/a) = -log(a) (a/b - b/a) = -(b/a - a/b) is a logless reciprocity function of a, b (a - 1/a) = -(1/a - a ) is a logless reciprocity function of a. Reciprocity is desirable when creating new entropy functions. A logless additivity can be achieved by relativistic regraduation. P(.) is a probability, a proportion or a percentage/100. Empirical P's in general and observational P's in particular should be smoothed from the range [0..1] to (0..1) ie to 0 < P(.) < 1. There are several definitions of probability, the main distinction is frequentist (based on repetition and exchangebility) vs subjectivist (allowing plausibility or belief). I am an unproblematic guy because antidogmatic Bayesian frequentist or a data-driven empirical Bayesian. Here I see each proportion as an approximation of a probability. In fact a proportion is a maximum likelihood (ML) estimate of a probability, which is ok if in c/n the count c > 5 and n = large. I designed robust formulas for estimates when c = 0, 1, 2, 3, etc, and data-tested their great powers in my KnowledgeXplorer aka KX. Px == P(x) is a parentheses-less notation for P(x_i) ie P(x[i]) ie P(xi). 1-Px has range [0..1]; linearly DECreases with INCreasing Px ; measures: + improbability of an event x ( x may be a success or a failure ), + surprise value of x ; the less probable, the more surprising is x when it happens. What is too frequent, is neither SPECIFIC, nor surprising, not interesting, carries no new meaning. LOW prior Px means more SIC = "semantic information CONTENT in x", since the lower the Px the more possibilities it FORBIDS, EXCLUDES, REFUTES or ELIMINATES when x occurs (find SIC , Spinoza ). 1/Px has range [1..oo); hyperbolically DECreases with INCreasing Px. log(1/Px) = -log(Px) ranges [0..oo); log bends the steep 1/Px down; measures surprise in Shannon's classical information theory. In 1878 Charles Sanders Peirce (1839-1914) has linked log(Px) to Weber- Fechner's psycho-physiological law, see { Norwich 1993 }. In 1930ies Harold Jeffreys wrote about log(LR(x)/LR(y)), Abraham Wald in 1943, and Turing & Good used their "weight of evidence" during WWII . (1-Px)/Px ranges [0..oo); is my steep measure of surprise in an event x. Px.(1-Px) = var(x) = variance of a single indicator event x, ranges [0..1/4..0] where 1/4 = MAXi if Px = 1/2 = 0.5 E[f(x)] = Sum[Px.f(x)] = expected value of f(x) ie an arithmetic average ie an arithmetic mean of f(x). Let f(x) = P(x) : E[Px] =

= Sum[Px.Px] = Sum[Px^2] = expected probability of variable X = Sum[ n(x).(n(x)-1) ]/[N.(N-1)] = unbiased estimate of E[Px] = "information energy" 1-E[Px] = 1 - = Sum[Px.(1-Px)] = 1 - Sum[(Px)^2] = Cont(X) = expected probability of failure ( or error Perr ) for r.v. X = expected surprise = expected semantic information content SIC = quadratic entropy, which is not only simpler and faster than Shannon's, but also provably BETTER for classification, identification, recognition and diagnostic tasks. Shannon's entropies are better only for coding. Don't tell this secret to any classical information theorist :-) A measure of deviation from a fictive as-if uniform distribution of a r.v. X with k recognizable nominal values, has been used for codebreaking since WWII : E[Px] -1/k = Sum_1:k[(Px - 1/k)^2 ] = Sum[ Px^2 - 2.Px/k + 1/k^2 ] = Sum_1:k[ Px.Px ] - 2/k + 1/k =

- 1/k = 1 - Cont(X) - 1/k is preferred over Shannon's entropy by quantum theoreticians { Zeilinger & Brukner 1999, 2001 }. E[Px]/Pxi =

/Pxi = [ 1 - Cont(X) ]/Pxi = surprise index for an event xi within the variable X, as defined by Shannon's co-author { Weaver }. Variance of an indicator event x (ie binary or Bernoulli event) is: Var(x) = Cov(x,x) = P(x,x) - Px.Px = Px - (Px)^2 = Px.(1 -Px), since Cov(x,y) = P(x,y) - Px.Py = covariance of events x,y in general Px.Py is a fictitious joint probability of as-if independent events x, y; it serves as an Archimedean point of reference ( a la Arendt ) to measure dependence of x,y either by Cov(x,y) = Pxy - Px.Py or ! by Pxy/(Px.Py), (find as-if ). If Px=1 or Py=1 then Pxy = Px.Py ! P(x,y) == Pxy == P(x&y) is the joint probability of x&y ; Pxy measures co-occurrence ie compatibility of x and y. Until early 1960ies P(x,y) had used to denote P(x|y) in the writings of Hempel, Kemeny, Popper Rescher and Bar-Hillel who used P(x.y) for the modern P(x,y) = my Pxy while others used P(xy) for my Pxy. ! P(x,y) + P(x or y) = Px + Py follows from isomorphy with the Set theory, ie ! P(x or y) = Px + Py - Pxy, and DeMorgan's rule says: ! P(~(x or y))= P(~x,~y) = 1-P(x or y) Empirical and observed proportions should be smoothed to : 0 < Pxy < minimum[ Px, Py ] ie an empirical P(x,y) should be less than its smallest marginal P. Low counts n(x,y) >= 1 are much improved by this estimate : P(x,y) =. [n(x,y) -1/2]/N , and P(y|x) =. [ n(x,y) - 0.5 ]/n(x) which is close to [ n(.) -1/2]/[N-1] = maximum posterior aka mode aka MAP for binomial pdf with Jeffreys prior Beta(O.5,1/2). It is vastly superior to Laplace's sucession rule (Quiz: why? :-) P(x|y) = Pxy/Py defines conditional probability, and Bayes rule follows: P(x|y).Py = Pxy = Pyx = Px.P(y|x) shows invertibility of conditioning P(x|y)/Px = P(y|x)/Py = Pxy/(Px.Py) is my favorite form of basic Bayes as: P(x|y) ? Px == P(y|x) ? Py , where the ? is < , = , > ; and also P(x|y) ? P(x|~y) == P(y|x) ? P(y|~x) where the ? is applied consistently. P(x|y)/P(y|x) = Px/Py is Milo Schield's form of basic Bayes P(x|y) = Px.P(y|x)/Py is the basic Bayes rule of inversion, where Px = "base rate"; IGNORING Px is people's "base rate fallacy". Odds form of Bayes rule : Odds(y|x) = Odds(y).LR(x:y) { Odds local or individual, LR from community } = P(y|x)/P(~y|x) = P(y|x)/(1 -P(y|x)) = (Py/(1-Py)).P(x|y)/P(x|~y) = P(y,x)/P(~y,x) = n(y,x)/n(~y,x) = n(y,x)/[ n(x) - n(y,x) ] would be a straight, but misleading estimate in medicine (find Bailey ). P(y|x) = Odds(y|x)/(1 + Odds(y|x)) = 1/(1/Odds(y|x) + 1) = 1/( 1 + n(x,~y)/n(x,y) ) = n(x,y)/[ n(x,~y) + n(x,y) ] = n(x,y)/n(x) = P(y|x) q.e.d. -log(Bayes rule) : -log( P(x|y) ) = -log(Pxy/Py) = -log(Px.P(y|x)/Py) = -log(Px) - log(P(y|x)) + log(Py) is the -log(Bayes) Note that for only comparative purposes between hypotheses x_j we may ignore Py (but NEVER IGNORE the base rate Px !) since Py is (quasi)constant for all x_j's compared: the shortest code for max P(x_j, y) wins. This holds for log-less Bayesian decision-making too: the maximal Pxy is the winner. This is Occam's razor opeRationalized, as it has the minimal coding interpretation: x = unobserved/able input of a communication channel, or unobservable hypothesis/conjecture/cause/MODEL to be inferred/induced; y = observed/able output of a communication channel, or available test result/evidence/outcome/DATA. According to { Shannon, 1949, Part 9, 60 } and provable by Kraft's inequality, the average length of an efficient ie shortest & still uniquely decodable code for a symbol or message z is -log(P(z)) in bits if the base of log(.) is 2. Hence an interpretation of our -logarithmicized Bayes rule { Computer Journal, 1999, no.4 = special issue on MML MDL } is Occam's razor opeRationalized as: MML = minimum message length (by Chris Wallace & Boulton, 1968) MDL = minimum description length (by Jorma Rissanen, IBM, 1977) MLE = minimum length encoding (by Edwin P.D. Pednault, Bell Labs 1988) This theme is very very close to Kolmogorov complexity, originated in the US by Ray Solomonoff in 1960, and by Greg Chaitin in 1968, and it was designed into Morse code, and by Zipf's law evolved in plain language, eg: 4-letter words are so short because they are used so often. In Dutch we use 3-letter words because we either use them more frequently, and/or we are more efficient than the Anglos & Amis :-) Hence the total cost ie length of encoding is the sum of the cost of coding the model x_j , plus the cost ie code size of coding the data y given that particular model x_j. Stated more concisely : cost or complexity = log(likelihood) + log(penalty for model's complexity) where I don't mean any models on a catwalk :-) The pop version of Occam's "Nunquam ponenda est pluralitas sine necesitate" is the famous KISS-rule: "Keep it simple, student!" :-) Simplicity should be preferred over complexity, subject to the "ceteris paribus" rule . Einstein used to say: "Everything should be made as simple as possible, but not simpler". I say: "Keep it simple, but not simplistic." .- The MOST SIMPLISTIC, NAIVE measures of causal tendency : P(y|x) = Pxy/Px = Sufficiency of x for y { Schield 2002, Appendix } = Necessity of y for x { follows from the next line: } P(x|y) = Pxy/Py = Necessity of x for y { Schield 2002, Appendix } = Sufficiency of y for x { follows from above } CAUTION : Let x = a disease, y = 10 fingers: P(y|x) = 1 in a large subpopulation but it would be a NONSENSE to say that x suffices for y , or that y is necessary for x { example by Jan Kahre }; find IndependentImplication . !! My analysis: P(y|x) = Pxy/Px is not a DECreasing function of Py, hence any P(y) =. 1 ie too FREQUENT y will REFUTE P(y|x) as a measure (find SIC ). !!! Much more complicated REFUTATIONS of all single P(.|.)'s or P(.)'s as measures of confirmation or corroboration are in { Popper 1972, Appendix IX, 390-2, 397-8 (4.2) etc, and 270 }. P(.|.)'s should be viewed as NAIVE, CRUDE, MOST SIMPLISTIC measures : rel. = relatively wrt base P(x|y) = Pxy/Py = a measure of (y implies x) ie rel. how many y are x = a measure of (x SuperSet y) ie rel. how many y in x ; P(y|x) = Pxy/Px = a measure of (x implies y) ie rel. how many x are y = a measure of (y SuperSet x) ie rel. how many x in y ; find archer-) in Venn P(y|x).P(x|y) = a measure of (x SufFor y) & (x NecFor y) = a measure of (y Nexfor x) & (y SufFor x) = (Pxy^2)/(Px.Py) symmetry makes it worthless as a measure of causal tendency. Pxy/(Px.Py) has range [0..1..oo) and measures dependence : oo unbounded POSitive dependence of x, y 1 if x, y are independent 0 bounds NEGative dependence of x, y 0 if x, y are disjoint; do not confuse disjoint with independent ! A fresh alternative look at old stuff ( Px.Py is as-if independence ) : Pxy/(Px.Py) = (Pxy/Px).(1/Py) = = ( x --> y).surpriseby(y) = (Sufficiency of x for y).surpriseby(y) = ( Necessity of y for x).surpriseby(y) = (Pxy/Py).(1/Px) = ( y --> x).surpriseby(x) = (Sufficiency of y for x).surpriseby(x) = ( Necessity of x for y).surpriseby(x) = [0..1]*[1..oo) = [0..1..oo) is the range; 1 if independent = fun(Px, Py) symmetrical wrt x,y which may be good for coding but is poor for directed, oriented eg causal inferencing, hence I created : !! (Pxy/Px).(1-Py) = P(y|x).(1-Py) = (x-->y).LinearSurpriseBy(y) (Pxy/Py).(1-Px) = P(x|y).(1-Px) = (y-->x).LinearSurpriseBy(x) = [0..1].[0..1] = [0..1] is very reasonable and it ! = fun(Px, Py) Asymmetrical wrt x, y hence captures causal tendency better. I created these new measures because trivial ie unsurprising implications are of little interest for data miners, doctors, engineers, investors, researchers, scientists. The next formulas would overemphasize importance of surprise, because Pxy/Px has range [0..1], while (1-Py)/Py has [0..oo) : ! (Pxy/Px).(1-Py)/Py = P(y|x).(1-Py)/Py = (x-->y).Surpriseby(y) = [0..1]*[0..oo) = [0..oo) { big range } (Pxy/Py).(1-Px)/Px = P(x|y).(1-Px)/Px = (y-->x).Surpriseby(x) = [0..1]*[0..oo) = [0..oo) Only after this synthesis we may not be surprised that the last lines are a substantial part of a risk ratio aka relative risk: RR(y:x) = P(y|x)/P(y|~x) is 0 for disjoint x,y ; is 1 for independent ; = (Pxy/(Py - Pxy)).(1-Px)/Px = (y implies x) . SurpriseBy(x) = [0..oo)*[0..oo) = [0..oo) note that : + both factors have the same range [0..oo) hence none of them dominates structurally ie in general; + in both factors both numerator and denominator are working in the same direction for increasing the product of Implies * Surprise; + there is no counter-working within each and among factors. ! P(y|x) > P(y|~x) == P(x|y) > P(x|~y) == Pxy > Px.Py (derive it) which is symmetrical ie directionless ie not oriented; the equivalence holds for the < <> = >= <= as well, the = is in all 16+1=17 conditions of independence. On human psychological difficulties in dealing with such causal/diagnostic tasks see { Tversky & Kahneman 122-3 }. cov(x,y) = Pxy - Px.Py = covariance of events x, y (binary aka indicator) var(x) = Pxx - Px.Px = Px.(1 - Px) = variance of an event x (autocov ) corr(x,y) = cov(x,y)/sqrt(var(x).var(y)) = correlation of binary events x,y >= greater or equal => is meaningless in this epaper, although some use it for an implication, which is MISleading because: y-->x == (y <== x) == (y subset of x) == (y implies x) == ~(y,~x) == ~(y AndNot x) my <== works on Booleans represented as 0, 1 for False, True evaluated numerically, like the THOUGHTFUL notation : (y <= x) in Pascal on Booleans means (y implies x). In our probabilistic logic P(x|y)=1 == (y implies x 100%) ie (y SufFor x), ie hitting y will hit x , !! ie (x NecFor y), ie missing x will miss y (find Venn above). (y AndNot x) == (y,~x) == ~(y-->x) == ~~(y,~x) == (y UnlessBlockedBy x) 1st meaning: = (y,~x) = (y ButNot x) is a logical DIFFERENCE (see Venn ) = (y -x.y) with 0, 1 values = Py -Pxy with P's 2nd meaning: the functional relation among x,y,z is interpreted as z = fun(x,y) thus: (called INHIBITION in circuit design) "z = y UnlessBlockedBy x=1"; if x=1, then: input y=1 canNOT 'get through' into the result z = (y Unless x=1=y) since input x=1 blocks y=1 from passing to the output z. Also: ! The output z=1 if y stays present ie 1 without being blocked by x=1 Note that the meaning of (y AndNot x) is 'y UNLESS y=1 is blocked by x=1', !! ie the meaning of (y AndNot x) is NOT about 'blocking occurred', ! since that meaning is (y And x). This shows how CAREful we must be when assigning a meaning to even an intuitively clear operation : Ins Out x y z = (y Unless x=1=y) == (x < y) == (y > x) Blocking occurred 0 0 = 0 ; no x, no y means no blocking, hence z=y=0 ; 0 0 1 = 1 ; no x means no blocker, hence z=y=1 ; 0 1 0 = 0 yes x, but no y to block, hence z=y=0 ; 0 1 1 > 0 yes x, yes y can be blocked, hence z= 0 ! 1 IF both y, x are present THEN blocker x blocks y from entering z output ELSE the output z=y ; there must be some y to block, and some blocker x=1, both present ie y=1=x ; !! ie blocking of y is CHANGING y=1 to z=0 if x=1=y , ie x And y ! !! ie a blocker x can REVERSE y=1 to z=0 if x=1=y , ie x And y !, but (y AndNot x) = y EXCEPT when y=1=x in which case the result z = 0 = ~y, = 1 if (y=1 & x=0 ie x <> 1 , as above: (y,~x) = x < y = not(x >= y) in Pascal on Booleans (internally 0 , 1) = y > x = not(y <= x) in Pascal; ~(y <== x) == ~(y-->x) here. My B(y:x), W(y:x), F(y:x) and C(y:x) have been written as ?(x:y) by ancient authors like I.J. Good, John Kemeny and Karl Popper, who were inspired by the Odds-forms, which swap x, y via Bayes rule of inversion. However my notation (used by I.J. Good since 1992) is much less error prone as it naturally and mnemonically abbreviates the simplest straight forms like eg: RR(y:x) = risk ratio = relative risk = B(y:x) = simple Bayes factor = P(y|x) / P(y|~x) ARR(x:y) = P(y|x) - P(y|~x) = absolute risk reduction = risk difference = a/(a+b) - c/(c+d) = (ad - bc)/[(a+b).(c+d)] = (Pxy -Px.Py)/(Px.(1-Px)) = risk increase (or risk reduction ) = cov(x,y)/var(x) = covariance(x,y)/variance(x) !! = beta(y:x) = the slope of the probabilistic regression line Py = beta(y:x).Px + alpha(y:x) for indication events x, y ie for binary events aka Bernoulli events; -1 <= beta(:) <= 1 ! 0.903 - 0.902 = 0.001 is relatively small, but the same difference: 0.003 - 0.002 = 0.001 is relatively large; absolute differences may be misleading for some purposes, but for practical treatment effects the RR(y:x) exaggerates risk more, and more often than ARR(:) and 1/|ARR|'s like NNT, NNH do. RRR(y:x) = RR(y:x) - 1 = ARR(x:y)/P(y|~x) = [ P(y|x) - P(y|~x) ]/P(y|~x) if > 0 = relative risk reduction = excess relative risk = relative effect ! Note that despite ARR(x:y) being a measure of m(x-->y) , the ! 1/P(y|~x) = 1/[ Py - Pxy ] is the overRuling M(y-->x) , hence : RRR(y:x) is the proper notation. F(y:x) = (P(y|x) - P(y|~x))/(P(y|x) + P(y|~x)) = factual support = (difference/2)/(sum/2) !! = deviation/arithmetic average my 1st interpretation !! = slope(of y on x )/(P(y|x) + P(y|~x)) my 2nd interpretation = beta( y:x )/(P(y|x) + P(y|~x)), -1 <= beta(:) <= 1 , = [ cov(x,y)/var(x)]/(P(y|x) + P(y|~x)) = (Pxy -Px.Py)/(Px.(1-Px))/(P(y|x) + P(y|~x)) = rescaled B(y:x) from [0..1..oo) to [-1..0..1] 3rd interpretation = rescaled W(y:x) from (-oo..0..oo) to [-1..0..1] 4th interpretation = is a combined (mixed) measure scaled [-1..0..1] of : - how much y and x are independent, 0 if 100% independent - how much (y --> x) , yields +1 if 100% implication = [ ad - bc ]/[ ad + bc + 2ac ] = CF2(y:x) = [ P(x|y) - P(x) ]/[ Px.(1 - P(x|y)) + P(x|y).(1 - Px) ] is a certainty factor in MYCIN at Stanford rescaled in { Heckerman 1986 }, which I recognized as F(y:x) via : = (Pxy - Px.Py)/(Pxy + Px.Py - 2.Px.Pxy) my 5th interpretation = [ RR(y:x) -1]/[ RR(y:x) +1 ] my 6th interpretation F0(:) = [ F(:) + 1 ]/2 = F(:) linearly rescaled to [0..1/2..1] : F0(y:x) = P(y|x)/[ P(y|x) + P(y|~x) ] RR(:) = [ 1 + F(:) ]/[ 1 - F(:) ] is a conversion . RR(y:x) and its functions F(y:x), F0(y:x) are co-monotonic, as they measure - how much the event y implies the x event. This is the directed ie oriented ie asymmetric component of these measures; - how much x, y are stochastically dependent ie covariate ie associate. This is the symmetrical aspect or an association. No contortion is needed to have events x, y which are almost independent, if we measure independence by Pxy/(Px.Py) or by (Pxy -Px.Py)/(Pxy +Px.Py) or by (Pxy - Px.Py)/min(Pxy, Px.Py), and at the same time one event will strongly imply the other event. But this PARADOX depends on the sensitivity (wrt the deviations from exact independence) of the measure. Hence our choice of a single measure should depend on our preference for what the measure should stress: an implication, or (a deviation from) independence. E.g. K. Popper's corroboration C(:) stresses dependence over implication, while Kemeny's factual support F(:) stresses implication over dependence, but neither of those authors say so, nor anybody has noticed that so far. Of course, we could always use two measures, one for an implication, and the other for a deviation from independence, but the Holy Grail is a single formula, which will inevitably combine ie mix these two aspects, because they are arbitrarily mixable. However impossible it may be to find the Excalibur formula for causation, I believe it to be possible to identify formulas which come closer to the Holy Grail than other formulas. I consider the notions of stochastic DEPENDENCE together with probabilistic IMPLICATION (or my AndNot ) and SURPRISE as the key building blocks because they are well defined, though not understood enough by too many folks. -.- +Interpreting 2x2 contingency tables wrt RR(:) = relative risk=risk ratio a , b the counts a, d are hits, and b, c are misses c , d ie a, d concord, and b, c discord and a+b+c+d = n = the total count of events. !! It is useful to view such a table as a Venn diagram formed by two rectangles, one horizontal and one vertical, with partial overlap measured by n(x,y) = the joint count ie co-occurrence of x and y : ______________________________ | | | | a = n( x,y) | b = n( x,~y) | n( x) = a+b | | | |-------------+--------------. is a 2x2 Venn diagram | | . | c = n(~x,y) | d = n(~x,~y) . n(~x) = c+d |_____________|............... a+c = n(y) b+d = n(~y) N = a+b+c+d but nothing prevents you from viewing an overlap in any of the 4 corners. Feel free to rotate or to transpose this standard table at your own peril. Typical semantics (one quadruple per line) may be, eg: x ~x y ~y test+ says K.O. test- says ok disorder not this disorder exposed unexposed illness not this illness risk factor present risk fact.absent outcome present outcome absent ! treatment control non-case case alleged cause cause absent effect not this effect symptom present symptom absent possible cause not this cause conjecture,hypothesis evidence observed so be careful with assigning your own semantics ! We can avoid mistakes if we stick here to the first four interpretations just listed. The 2x2 probabilistic contingency table summarizes the dichotomies : | y ~y | marginal sums -----|------------------------------------+------------------------- x | a/n = P( x,y) , b/n = P( x,~y) | P( x) = (a + b)/n ~x | c/n = P(~x,y) , d/n = P(~x,~y) | P(~x) = (c + d)/n = i/n -----+------------------------------------|------------------------- Sums | P(y) P(~y) | 1 = P( x,y) +P( x,~y) | = (a+c)/n = (b+d)/n =f/n | +P(~x,y) +P(~x,~y) In my squashed Venn diagram in 1D-land, the joint occurrences of (x&y) ie (x,y) ie "a" are marked by ||| = a/N = Pxy : nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn ffffffffffffffxxxxxxxxxxxxxxxxxxxxxxxxxxffffffffffffffffffffff ----------------------aaaaaaaaaaaaaaaaaa---------------------- iiiiiiiiiiiiiiiiiiiiiiyyyyyyyyyyyyyyyyyyyyyyyyyyyyiiiiiiiiiiii 11111111111111111 A limited 1-verse of discourse 1111111111111 ---- 1-Px ----xxxxxxxxxxxxxx Px xxxxxxxx------------- 1-Px --- ---- 1-Pxy -----------|||||| Pxy |||||||------------- 1-Pxy -- ---- 1-Py ------------yyyyyy Py yyyyyyyyyyyyyyyyy--- 1-Py --- From the 4 counts ( a+b+c+d = n ) we easily obtain all P(.)'s. From the 3 proportions or probabilities Px, Py and Pxy we can obtain any other P(.,.) and P(.|.) containing any mix of (non)negations, but without raw counts we cannot compute eg confidence interval CI. The legality of given or generated P's can be checked by the above explained Bonferroni inequality. .- The subset-based measures of implication aka entailment: 0 <= Px - Pxy <= 1 measures how little the event x-->y, hence 1/(Px - Pxy) measures how much the event x-->y with MAXImum = oo for Pxy = Px ; 0 <= P(y|x) <= 1 measures how much the event x-->y with MAXImum = 1 for Pxy = Px ; !! recall that (x implies y) == (~y implies ~x) in logic; probabilistic is 1 -P(x,~y) = 1 -(Px - Pxy) P( x| y) = a/(a+c) = sensitivity ( Yerushalmy 1947 ) = true positive ratio P(~x|~y) = d/(d+b) = specificity ( Yerushalmy 1947 ) = true negative ratio P( x|~y) = b/(b+d) = 1-specificity = false alarm rate = false positive ratio P( y| x) = a/(a+b) = positive predictivity P(~y|~x) = d/(d+c) = negative predictivity LR+ = positive likelihood ratio = LR(x:y) = P(x|y)/P(x|~y) = P(x|y)/(1 - P(~x|~y)) = (a/(a+c))/(b/(b+d)) = sensitivity/(1 - specificity) LR- = negative likelihood ratio = P(~x|y)/P(~x|~y) = = (c/(c+a))/(d/(d+b)) = (1 - sensitivity)/specificity RR(y:x) = relative risk = risk ratio = LR(y:x) = P(y|x)/ P(y|~x) = (a/(a+b))/(c/(c+d) = (Pxy/Px)/((Py - Pxy)/(1-Px)) which I rearranged into: !!! = Pxy.[1/(Py - Pxy)] .(1-Px)/Px !!! = [Pxy / P(y,~x) ] .(1-Px)/Px which I interpret as follows: RR(y:x) + INCreases with Pxy in general (due to numerator and denominator); + INCreases with Pxy approaching Py in particular, eg when (y implies x) fully then RR(y:x) = oo ie infinity; ! - DECreases with INCreasing Px in (1 - Px)/Px which meaningfully measures our LOW/high surprise when a FREQUENT/rare event x occurs (more below). Note that P(x|y) = Pxy/Py as a measure of how much is the occurrence of y Sufficient for x, hence how much is the occurrence of x Necessary for y, is not an explicit fun( Px ), hence !! P(x|y) cannot discount the lack of surprise like RR(y:x) does. !!! RR(y:x) = CoOccur(x,y) .(y-->x) .SurpriseBy(x), range [0..oo) surprise too or: = ( Pxy/P(y,~x) ).SurpriseBy(x) !!! ie: = LikelyThanNot(y: x) .SurpriseBy(x) where LikelyThanNot(y: x) is visualized by my squashed Venn diagram (imagine two overlapping pizzas or pancakes x and y, and view them from aside) : xxxxxxxxxxxxx--- length of xx..x = Px yyyyyyyyyy length of yy..y = Py length of --- = P(~x,y) = Py - Pxy It is meaningful to contrast the overlap Pxy against the underlap P(~x,y), eg to contrast them relatively as their ratio : Pxy/P(~x,y) >=< 1 where >=< stands for >, =, <, >=, <=, <> ie P(x,y) >=< P(~x,y) ie P(x|y) >=< P(~x|y) , so that eg for the > we say that : x occurs More Likely Than Not if y occurred, or we say equivalently : x occurs More Likely Than Not with y , which both capture our thinking. + DE/INcreases with IN/DEcreasing Px; this is meaningful, because our !!! surprise value of x DIScounts the "triviality effect" of Px =. 1 : !! if Px =. 1 then Pxy = Py too easily occurs, and RR(y:x) = 1/0 = oo. !! If Py = 1 then Pxy = Px and P(y,~x)=P(~x) hence RR(y:x) = 1/1 = 1, indeed, if all are ill, there can be no risk of becoming ill. Surprise value of x DE/INcreases with IN/DEcreasing Px in general; (1 - Px), 1/Px, hence also my (1 - Px)/Px measures our surprise by x. !! My new measure P(x|y).(1 - Px) = (y implies x).LinearSurpriseBy(x) = = [0..1]*[0..1] = [0..1] is simpler, but carries less meanings than RR(y:x). ! + is DOMINAted by the factor 1/(Py - Pxy) for a given exposure Px ; this factor measures how much (y-->x). From this and from Pxy <= min(Px, Py), but not from "SurpriseBy", follows : !!! if Py < Px then RR(y:x) >= RR(x:y) ie LR(x:y), !!! if Py > Px then RR(y:x) <= RR(x:y) ie LR(x:y), where the = may occur for x,y independent ie RR(:)=1, or if Pxy=0=RR(:), as my program Acaus3 asserts. That "SurpriseBy" is not decisive wrt RR(y:x) >=< RR(x:y), follows from the comparison of: (y-->x) = 1/(Py - Pxy) vs (1-Px)/Px = SurpriseBy(x) ie (1 - 0)/(Py - Pxy) vs (1-Px)/(Px - 0). Lets write Px = k.Py to reduce RR(:) to just 2 variables Pxy, Py, and lets compare RR(y:x) with RR(x:y) ie LR(x:y) : (Pxy/(Py - Pxy)).(1-Px)/Px >=< (Pxy/(Px - Pxy)).(1-Py)/Py ie: (k.Py - Pxy)/(Py - Pxy) >=< k.(1-Py)/(1 -k.Py) = Dy in shorthand !! Pxy >=< Py.( Dy - k )/(Dy - 1) = Py.[ (1 - Py)/(1 - k.Py) - 1]/[(1 - Py)/(1 - k.Py) - 1/k] Checked: Solving for k the RR(y:x) = RR(x:y), where Px = k.Py , yields a quadratic equation with two distinct real roots k1, k2 : k1 = 1 ie Px = Py which obviously is correct k2 = Pxy/(Py.Py) ie Py.Px = Pxy which holds for independent x,y . + is oo ie infinite for Py - Pxy = 0 ie P(~x,y) = 0 ie Pxy = Py in which case y implies x fully, because then y is a SubSet of x, ie whenever y occurs, x occurs too ; draw a Venn diagram. + is ASYMMETRICAL ie directed ie oriented wrt the events x, y (this unlike correlation coefficients and other symmetrical association measures) . is a relative measure, a ratio (while differences are absolute measures which may mislead us since eg 0.93 - 0.92 = 0.03 - 0.02) - is a combined measure which inseparably MIXes measuring of two key properties: - stochastic dependence, which is a symmetrical property, and - probabilistic implication, which is an Asymmetrical property, which both I see as necessary conditions for a possible CAUSAL relationship between x and y. Hence RR(:) INDICATES potential CAUSAL TENDENCY; + has range [0..1..oo) with 3 opeRationally interpretable fixed points: RR(y:x) = oo if Pxy = Py ie (y-->x), ie possibly (x causes y) ; = 1 if y and x are fully independent ie if Pxy = Px.Py = 0 if (Pxy = 0) and (0 < Px < 1) ie disjoint events x,y ie RR(:) = 0 means disjoint ie mutually exclusive events x,y 0 < RR(:) < 1 means negative dependence or correlation of x,y 1 < RR(:) <= oo means positive dependence or correlation of x,y !! ie RR(:) has a huge unbounded range for positively dependent x,y vs RR(:) has a small bounded range for negatively dependent x,y , hence both subranges are not comparable; the positive subrange is !! much more SENSITIVE than the negative subrange. In this respect !! F(:) is BALANCED but has no simple interpretation of risk ratio. = 0/0 if (Py = 0 or Py = 0 hence Pxy = 0 too). = 1 if (Py = 1 hence Pxy = Px, P(x|y) = Px/1 ie independent x,y) then RR(y:x) = (1-Px)/(1-Px) = 1 ie independence. !! = 1 if (Px = 1 hence Pxy = Py, P(y|x) = Py/1 ie independent x,y) then RR(y:x) = Pxy/(0/0) = Py/(0/0) numerically, which may !! seem to be undetermined, but as just shown, Px = 1 means that P(y|x) does not depend on Px, ie that x,y are independent (find 0/0 ). RR(y:x) = P(y|x)/P(y|~x) where in many (not all) medical applications y is a health disorder, and x is a symptom. But both { Lusted 1968 } and { Bailey 1965, 109, quoted: } noted that : "P(y|x) will vary with circumstances (social, time, location), however !! P(x|y) will have often a constant value because symptoms are a function of a disease processes themselves, and therefore relatively INdependent of other external circumstances. ... so we could collect P(x|y) on a national scale, and collect Py on a [ local/individual ] space-time scale.". The [loc/indiv] is mine. Therefore we should compute RR(y:x) indirectly via Bayes rule ie via P(y|x) = Py.P(x|y)/Px where P(x|y) is "global" and more stable. + RR(y:x) has an important advantage over its co-monotonic but nonlinear transform F(y:x). The simple proportionality of RR(y:x) can be used to (dis)prove confounding. Good explanations of confounding are rare, the best introduction is in { Schield 1999 } where on p.3 we shall recognize Cornfield's condition: P(c|a)/P(c|~a) > P(e|a)/P(e|~a) as RR(c:a) > RR(e:a), and R.A. Fisher's: P(a|c)/P(a|~c) > P(a|e)/P(a|~e) as RR(a:c) > RR(a:e). Be reminded that "contrary to the prevailing pattern of judgment", as { Tversky & Kahneman 1982, 123 } point out, it holds, in my more general formulation : [ P(y|x) >=< P(y|~x) ] == [ P(x|y) >=< P(x|~y) ] , hence also: [ RR(y:x) >=< 1 ] == [ RR(x:y) >=< 1 ] where >=< stands for a consistently used >, =, <, >=, <=, <> . + For 3 more properties see { Schield 2002, p.4, Conclusions }. .- +More on probabilities Keep in mind that it always holds: P(.) + P(~.) = 1 eg P(y|x) + P(~y|x) = 1 ; hence also: P(x or y) + P(~(x or y)) = 1 from which via DeMorgan's rule follows: P(x or y) + P(~x,~y) = 1 P(x or y) + Pxy = Px + Py see the overlap of 2 Pxy in a Venn diagram hence P(~x,~y) = 1 -(Px + Py - Pxy) P(~x,~y) = P(~(x or y)) by DeMorgan's rule; he died in 1871; "his" rule has been clearly described by Ockham aka Occam aka Dr. Invincibilis in Summa Logicae in 1323 ! = 1 - P(x or y) = 1 - (Px + Py - Pxy) and, surprise : ! Pxy - Px.Py = Pxy.P(~x,~y) - P(x,~y).P(~x,y) from 2x2 table's diagonals ( with / the rhs would be odds ratio OR(:), find below ) = Pxy.(1 -Px -Py +Pxy) - (Px -Pxy).(Py -Pxy) = cov(x,y) = covariance of 2 "as-if random" events x,y , or indicator events aka binary/Bernoulli events, from which follows for independent events only : if Pxy - Px.Py = 0 ie cov(x,y) = 0 ie if Pxy = Px.Py (this is equivalent to 16 other equalities) ! then Pxy.P(~x,~y) = P(x,~y).P(~x,y) ie products on 2x2 table's diagonals are equal; this I call the 17th condition of independence (find 17th ), which is equivalent (==) to any of the other 4 + 3*(8/2) = 16 equivalent conditions of independence, like eg: ( Pxy = Px.Py ) == ( P(x|y) = Px ) == ( P(y|x) = Py ) == ( P(y|x) = P(y|~x) ) == ( P(~y|~x) = P(~y|x) ) == ( P(x|y) = P(x|~y) ) == ( P(~x|~y) = P(~x|y) ) == etc Only for independent x,y it holds, via Occam-DeMorgan's rule : P(~(~x,~y)) = 1 - (1 -Px).(1 -Py) = Px + Py - Px.Py = P(x or y) for indep. More of the equivalent conditions of independence are obtained by changing x into ~x, and/or y into ~y, or vice versa. Any consistent mix of such changes will produce an equivalent condition of independence for events, negated or not, simply because "A nonevent is an event is an event". Changing the = into < or > in any of the 17 conditions of independence will create corresponding equivalent conditions of dependence which obviously are necessary but far from sufficient conditions for a causal relation between 2 events x, y. For example : ( Pxy > Px.Py ) == ( P(y|x) > Py ) == ( P(x|y) > Px ) !! == ( P(y|x) > P(y|~x) ) == ( P(x|y) > P(x|~y) ) == etc. From all these 17 inequalities of the generic form lhs > rhs we can obtain some 6*17= 102 measures of DEPENDENCE simply by COMPARING or CONTRASTING : Da = lhs - rhs are ABSOLUTE DEPENDENCE measures, eg P(e|h) - P(e|~h) Da is scaled [0..1] for lhs > rhs , or [-1..1] in general, with 0 if x,y are fully independent Dr = lhs / rhs are RELATIVE DEPENDENCE measures, eg P(e|h) / P(e|~h) Dr is scaled [0..1..oo) with 1 if x,y are fully independent. Rescalings : log(lhs / rhs) is scaled (-oo..0..oo) in general; (lhs - rhs )/(lhs + rhs) is scaled [-1..0..1], I call it kemenyzation, = (lhs/rhs -1)/(lhs/rhs +1); and lhs/(lhs + rhs) is scaled [0..1/2..1]. Odds(.) = P(.)/(1 - P(.)) = 1/( 1/P(.) - 1 ) = P(.)/P(~.) P(.) = Odds(.)/(1 + Odds(.)) = 1/( 1/Odds(.) + 1 ) P(x| y)/P(~x| y) = P(x| y)/(1 - P(x| y)) = Odds(x| y) P(x|~y)/P(~x|~y) = P(x|~y)/(1 - P(x|~y)) = Odds(x|~y) P(x| y)/P( x|~y) = B(x: y) = LR(x: y) = LR+ is a likelihood ratio where B(x: y) is a simple Bayes factor = RR(x:y) ; is NOT odds(.) since RR(.: .) is NOT a P(.|.)/P(~.|.) . Bayes rule in odds-likelihood form : Posterior odds on x if y = Prior odds . Likelihood ratio = Odds(x|y) = Odds(x) . LR(y:x) = P(x|y)/ P(~x|y) = ( Px/P(~x) ).( P(y|x)/P(y|~x) ) = P(x|y)/(1 -P(x|y)) = Px/(1 -Px).( P(y|x)/P(y|~x) ) = 1/( 1/P(x|y) - 1 ) In our 2x2 contingency table we have Odds ratio OR(:) : OR = (a/b)/(c/d) = a.d/(b.c) SeLn(OR) = sqrt( 1/a + 1/b + 1/c + 1/d ) = standard error of odds ratio OR cov(x,y) = Pxy.P(~x,~y) - [ P(x,~y).P(~x,y) ] = Pxy - Px.Py but OR <> Pxy /(Px.Py) , except when Pxy = Px.Py , or Pxy=0 . Relative risks RR(:) for the following 2x2 contingency table: | e | ~e | e = effect present; ~e = effect absent ----|-----+-----+------ h | a | b | a+b h = hypothetical cause present (eg tested+ ) ~h | c | d | c+d ~h = eg unexposed to environment (eg tested- ) ----+-----+-----|------ | a+c | b+d | n RR( e: h) = P(e|h)/ P(e|~h) = (a/(a+b))/(c/(c+d)) = (Peh/Ph)/((Pe-Peh)/(1-Ph)) = a.(c+d)/((a+b).c)) ! = oo if Pe=Peh ie (a+c)=a ie P(e,~h)=0 ie c=0 recalling P(e|~h) + P(~e|~h) = 1, we get: RR(~e:~h) = P(~e|~h)/P(~e|h) = (d/(c+d))/(b/(a+b)) = (1 - P(e|~h))/(1 - P(e|h)) = (1 - c/(c+d))/(1 - a/(a+b)) = d.(a+b) /(b.(c+d)) ! = oo if Peh=Ph ie a=(a+b) ie P(h,~e)=0 ie b=0 RR( h: e) = P(h|e)/ P(h|~e) = (a/(a+c))/(b/(b+d)) = (Peh/Pe)/((Ph-Peh)/(1-Pe)) = a.(b+d)/((a+c).b)) = oo if Ph=Peh ie (a+b)=a ie P(h,~e)=0 ie b=0 RR(~h:~e) = P(~h|~e)/P(~h|e) = (d/(b+d))/(c/(a+c)) = (1 - P(h|~e))/(1 - P(h|e)) = (1 - b/(b+d))/(1 - a/(a+c)) = d.(a+c) /(c.(b+d)) ! = oo if Peh=Pe ie a=(a+c) ie P(e,~h)=0 ie c=0 ie: for c=0 are RR( e: h) = oo = MAXImal = RR(~h:~e) for b=0 are RR( h: e) = oo = MAXImal = RR(~e:~h) for a=0 is RR( e: h) = 0 = minimal = RR( h: e) for d=0 is RR(~h:~e) = 0 = minimal = RR(~e:~h) ! RR(e:h).RR(~e:~h) = RR(h:e).RR(~h:~e) = Peh.P(~e,~h)/( P(e,~h).P(h,~e) ) = Peh.P(~e,~h)/( P(~e,h).P(~h,e) ) which clearly are identical. While these equations hold in general, you might like to meditate upon why the 17th (find 17th , 16+1 ) condition of independent x, y consists from the same components. -.- +Tutorial notes on probabilistic logic, entropies and information Stan Ulam, the father of the H-device (Ed Teller was the mother) used to say that "Our fortress is our mathematics." I say that here "Our fortress is our logic." Elementary probability theory is strongly isomorphous with the set theory, which is strongly isomorphous with logic. There are 16 Boolean functions of 2 variables, of which 8 are commutative wrt both variables. For the purposes of inferencing we should use ORIENTED ie DIRECTED ie ASYMMETRIC functions only. From the remaining 8 asymmetric logical functions 4 functions are of 1 variable only, so that only 4 asymmetric functions remain for consideration : 2 implications and 2 AndNots which are pairwise mutually complementary. ASYMMETRY is easily obtained ! even from symmetrical measures of association (or dependence) simply by normalization with a function of one variable only, eg : (Pxy -Px.Py)/(Px.(1-Px)) is 0 if x,y are independent = cov(x,y)/var(x) = beta(y:x) = slope of a probabilistic regression line Py = beta(y:x).Px + alpha(y:x) = [ P(y|x) - Py ]/(1-Px) = P(y|x) - P(y|~x) (check it as an equation) = the numerator of F(y:x) below = ARR(x:y) = absolute risk reduction of y if x (or increase if ARR < 0). Many measures of information are easily obtained by taking expected value of either differences or ratios of the lhs and rhs taken from a dependence inequality lhs > rhs mentioned above. For example we could create : SumSum[ Pxy.Dr(y:x) ] where Dr is a relative dependence measure like eg RR(y:x), but a single Dr = oo would make the whole SumSum = oo, hence it is better to use SumSum[ Pxy. Da(x:y) ] where Da is an absolute dependence measure, eg: SumSum[ Pxy.(P(y|x) - P(y|~x)) ] = SumSum[ Pxy.ARR(x:y) ]; knowing that P(y|x) - P(y|~x) = ARR(x:y) = beta(y:x) = dPy/dPx , and that Integral[ dx.Px.(dPx/Px)^2 ] = Fisher's information, I realized that ! my SumSum[ Pxy.( F(y:x) )^2 ] could serve as a quasi-Fisher-informatized RR ; find "my 1st interpretation" of F(y:x) . { Renyi 1976, vol.2, 546-9 } connects Fisher's and Shannon's information. A particularly nice & meaningfully asymmetrical (wrt r.v.s X, Y) information is my favorite ( out of the 2*[16+1] = 2*17 = 34 possible simple measures of association; find 17 or 16+1 ) Cont(:) : Cont(Y:X) = Cont(X) - Cont(X|Y) == Gini(X) - Gini(X|Y) = Gini(Y:X) = Var(X) - E[Var(X|Y)] = Sum[ Px.(1-Px) ] - SumSum[ Pxy.(1-P(x|y))] = quadratic entropy = 1-Sum[(Px)^2] -(1-SumSum[ Pxy. P(x|y) ]) = parabolic entropy = ( 1 - E[Px]) ) -(1-E[P(x|y)]) !! = SumSum[ Pxy.( P(x|y) -Px )] my opeRationally clearest form !! = Expected[ P(x|y) -Px ] ie average dependence measured = SumSum[ Pxy.K(y:x) ] is ASYMMETRICAL wrt x,y, X,Y too = SumSum[ ( Pxy^2 - Py.Px.Pxy )/Py ] from 3rd line up. Now a leap: **= SumSum[ ((Pxy - Py.Px)^2 )/Py ] compare with Phi^2 below = SumSum[ ( Pxy^2 -2Py.Px.Pxy + (Px.Py)^2 )/Py ] **= will be proven if the next *= is proven : *= SumSum[ ( Pxy^2 - Py.Px.Pxy + 0 )/Py ] = 4th line up proof: Sum[Px.Px] =SumSum[Px.Pxy] = SumSum[Py.Px^2] = 3rd term on 3rd line up. 1-Cont(X) = Sum[ Px.Px ] = E[Px] = expected probability of a variable X = Sum[ (n(x)/N).(n(x)-1)/(N-1) ] = an UNbiased estimate of E[Px] = expected probability of success in guessing events x = long-run proportion of correct predictions of events x = concentration index by Gini/Herfindahl/Simpson ( Simpson was a WWII codebreaker like I.J.Good and Michie ; they called it a "repeat rate" ) Cont(X) = 1 - Sum[(Px)^2] = expected improbability of variable X = expected error or failure rate eg in guessing events x 0 <= [ Cont(:), 1-Cont(:) ] <= 1 ie they saturate like P(error), while Shannon's entropies have no fixed upper bound. Log-scale fits with the physiological Weber-Fechner law, eg sound is measured in decibels dB , and so is the pH-factor (0..7..14 = max. alkalic). Logs work even in psychenomics, since you will feel less than twice as happy after your salary or profits were doubled :-) For more on infotheory in physiology see the nice book { Norwich 1993 }. Cont(:) has been called many names, eg quadratic entropy or parabolic entropy. Cont(:) gives provably better, sharper results than Shannon's entropy for tasks like eg pattern classification in general, and diagnosing, ! identification, prediction, forecasting, and discovery of causality in particular. Such tasks are naturally DIRECTED ie ASYMMETRICAL, requiring Cont(Y:X) <> Cont(X:Y), while Shannon's mutual information I(Y:X) = I(X:Y) = SumSum[ Pxy.log(Pxy/(Px.Py)) ] is symmetrical wrt X,Y. No wonder that Cont( beats I( . Cont(Y:X)/Cont(X) = TauB(Y:X) { Goodman & Kruskal, Part 1, 1954, 759-760 } ! where they interpreted TauB as "relative decrease in the proportion of incorrect predictions". See my Hint 2 & Hint 7 on www.matheory.info. Neither they nor { Bishop 1975 } have realized that TauB is a normalized quadratic entropy based on 1-P, the simplest decresing function of P. { Agresti 1990, 75 } tells us that for 2x2 contingency tables G & K's TauB = Phi^2. So I deduce that for 2x2 contingency tables G & K's TauB is SYMMETRICAL wrt X,Y since Phi^2 and other X^2-based measures are SYMMETRICAL wrt X,Y : In general TauB(Y:X) = Cont(Y:X)/Cont(X) <> Cont(X:Y)/Cont(Y) = TauB(X:Y) For 2x2 : TauB(Y:X) = Cont(Y:X)/Cont(X) = Cont(X:Y)/Cont(Y) = TauB(X:Y) hence even for 2x2 Cont(Y:X) <> Cont(X:Y) as long as Cont(X) <> Cont(Y) , I deduce. X^2 = mean square CONTINGENCY { Kendall & Stuart, chap.33, 555-557 } = N.(SumSum[ n(x,y)^2 /( n(x).n(y) ) ] -1) sample ChiSqr statistic = N.SumSum[ (Pxy - Px.Py)^2 /(Px.Py) ] { Blalock 1958, 102 } = N.Phi^2 = N.SumSum[((Pxy -Px.Py)/Px).(Pxy -Py.Px)/Py ] = N.SumSum[ (P(y|x) -Py).(P(x|y) -Px) ] = N.SumSum[ K(x:y).K(y:x) ], find K( JHr^2 = SumSum[((Pxy - Px.Py)^2)/(Px.(1-Px).Py.(1-Py))] = SumSum[ cov(x,y)/var(x) . cov(x,y)/var(y) ] = SumSum[ beta(y:x) . beta(x:y) ] ie [slope.slope] = SumSum[ (P(y|x) -P(y|~x)).(P(x|y) -P(x|~y)) ] = SumSum[ ARR(x:y).ARR(y:x) ] = SumSum[ r2 ] = my new cummulated r2 ie r^2 shows what Phi^2 = ( X^2 )/N is not, and how are they related. Pcc = sqrt[ ( X^2 )/( X^2 + N) ] = Pearson's CONTINGENCY coefficient = sqrt[ (Phi^2)/(Phi^2 + 1) ], (not Pearson's CORRELATION coeff r , find r^2 , which are not SumSum[...] ) ; Phi^2 = ( X^2 )/N { Blalock 1958, p103, Phi^2 vs TauB = Cont(Y:X)/Cont(X) } = Phi-squared = Pearson's mean square CONTINGENCY = SumSum[ (Pxy - Px.Py)^2 /(Px.Py) ] from ( X^2 )/N above (see asymmetric **= Cont(Y:X) above) = SumSum[ Pxy.Pxy/(Px.Py) ] - 1 a symmetrical expected value like Shannon's mutual information: I(Y:X) = SumSum[ Pxy.log(Pxy/(Px.Py)) ] = I(X:Y) in general = -0.5*ln(1 - corr(X,Y)) if X, Y are continuous Gaussian variables [ 1 - Cont(X) ]/Pxi = E[Px]/Pxi = surprise index for an event xi within the r.v. X, as defined by { Weaver }, Shannon's co-author. Cont(.) was intended as a semantic information content measure, SIC. The key idea is that the LOWER the probability of an event, the MORE POSSIBILITIES it ELIMINATES, EXCLUDES, FORBIDS, hence MORE its occurrence SURPRISES us. { Kemeny 1953, 297 } refers this insight to { Popper 1972, 270,374,399,400,402 mention P(~x) = 1-Px as semantic information content measure SIC }, { Bar-Hillel 1964, 232 } quotes "Omnis determinatio est negatio" = "Every determination is negation" = "Bestimmen ist verneinen" by Baruch Spinoza (1632-1677), in 1656 excommunicated from the synagogue in Amsterdam. Btw, William of Ockham aka Occam aka Dr. Invincibilis was excommunicated from the Church in 1328 :-) . KISS = Keep IT simple, student! is Occamism. Stressing elimination of hypotheses or theories is Popperian refutationalism. In principle any decreasing function of Px will do, but (1-Px) is surely the simplest one possible, simpler than Shannon's log(1/Px) = -log(Px). By combining (1-Px) with 1/Px, I constructed SurpriseBy(x) = (1-Px)/Px only to find it implicit or hidden inside RR(y:x), after my rearrangement of atomic factors in RR(:). Note that Sum[ Px.1/Px] would not work as an expected value. For more on Cont(:) see { Kahre, 2002 } in general, and my (Re)search hints Hint2 & Hint7 there on pp.501-502 in particular (see www.matheory.info ). Let me finish this with a proposal for new infomeasures: !! JH(Y:X) = SumSum[ Pxy.ARR(y:x) ] = SumSum[ Pxy.( P(x|y) - P(x|~y) ) ] JH(X:Y) = SumSum[ Pxy.ARR(x:y) ] = SumSum[ Pxy.( P(y|x) - P(y|~x) ) ] and 2 other with Abs(difference): SumSum[ Pxy*| P(.|_) - P(.|~_) | ] -.- +Google's Brin's conviction Conv( , my nImp( , AndNot : Boolean logic is strongly isomorphous with the set theory wherein X implies Y whenever X is a subset of Y. Since the probability theory is also strongly isomorphous with the set theory, we see that for the < as the symbol for both "a subset of" and for the "less than", it is obvious that since [ P(x|y) > P(y|x) ] == [ Py < Px ] and v.v. , !!! Py < Px makes y-->x ie x NecFor y more plausible, while !!! Px < Py makes x-->y ie y NecFor x more plausible, which can be easily visualized with a Venn (aka pancakes or pizzas) diagram. 0 <= P(y|x) = Pxy/Px <= 1 measures how much the event x-->y with maximum = 1 for Pxy=Px ; 0 <= P(x,~y) = Px - Pxy <= 1 measures how little the event x-->y or: 1 - P(y|x) = (Px - Pxy)/Px measures how little the event x-->y or: 1/(Px - Pxy) measures how much the event x-->y Also recall that the Bayesian probabibility of a j-th hypothesis x_j , given a vector of cue events y_c ie y..y, is (under the assumption of independence) computed by the Bayes chain rule formula based on the product of P(y|x) , the higher the more probable the hypothesis x_j : P(x_j, y..y) =. P(x_j).Product_c:( P(y_c | x_j ) ; don't swap y, x ! A cue event is a feature/attribute/symptom/evidential/test event "x implies y" in plaintalk : "If x then y" is a deterministic rule, which in plain English says that "x always leads to y" ie Px - Pxy = 0 ie Px = P(x, y); or: "never (x and not y) occur jointly" ie P(x,~y) = 0 if x-->y 100% note that Pxy + P(x,~y) = Px , hence [ P(x,~y) = 0 ] == [ Pxy = Px ] which is the deterministic (ie perfect or ideal) case which directly translates into the probabilistic formalisms for an increasing function of the strength of implication : 1 - P(x,~y) = 1 -(Px - Pxy), or, the smaller the P(x,~y), the more x-->y , = 1 - Px + Pxy , since ~(x,~y) == (x implies y) in logic. Another measure is the "conviction" by Google's co-founder { Brin 1997 } : Conv(x:y) = Px.P(~y)/P(x,~y) = (Px - Px.Py)/(Px - Pxy) is my form = Px/P(x|~y) = P(~y)/P(~y|x) is my form = Conv(~y:~x) this = is UNDESIRABLE find; = 1/Nimp(~x:~y) = 1/ Nimp(y:x) nearby below = 1 if x,y independent; and where: + the larger the Pxy <= min(Px, Py), the more the x-->y, and + the closer the Pxy is to Px.Py , the less dependent are x,y and the closer the Conv(:) to 1 which is the fixed point for independence. ( y --> x) == (~x --> ~y) where --> is "implies" in logic; here too: Conv( y: x) = Py.P(~x)/P(y,~x) = (Py -Px.Py)/(Py -Pxy), UNDESIRABLE = : = Conv(~x:~y) = P(~x).Py/P(~x,y) = 1/Nimp(~y:~x) = 1/Nimp(x:y) ( y --> x) is 1/P(y,~x) = 1/[ Py - Pxy ] = = (~x -->~y) is 1/P(~x,y) = 1/[ P(~x) - P(~x,~y) ] by DeMorgan's law : = 1/[ 1-Px -(1 -(Px+Py-Pxy)) ] = 1/(Py - Pxy) so their equality is logically ok, but it is !!! UNDESIRABLE FOR A MEASURE OF CAUSAL TENDENCY. Q: why? A: because eg: "the rain causes us to wear raincoat" is ok, but "not wearing a raincoat causes no rain" makes NO SENSE, as the Nobel prize winner Herbert Simon pointed out in { Simon H. 1957, 50-51 }. This UNDESIRABLE equality does not hold for LR(:), RR(:) and its co-monotonic transformations like eg W(:) and F(:). (y AndNot x) == (y,~x) == ~(~(y,~x)) == ~(y-->x) in logic, is equivalent to "y does not imply x" ie NonImp(y:x) == P(y,~x) = Py - Pxy = y AndNot x == in plaintalk "Lack of x allows/permits/leads to y", since in the perfect case we get: ideally P(~x,~y) = 0 is the deterministic, extreme case; note that P(~x,~y) = 0 is not equivalent to Pxy = Py, because by DeMorgan P(~x,~y) = 1 - (Px + Py - Pxy) = P(~(x or y)) always, hence P(~x,~y) = 0 ie Px + Py - Pxy = 1 ie Px + Py = 1-Pxy Recall P(~x,~y) + P(~x, y) = P(~x) always P(~x,~y) + P( x,~y) = P(~y) always Nimp(y:x) = P(x,~y)/( Px.P(~y) ) = 1/Conv(x:y) = 1/Conv(~y:~x) Nimp(x:y) = P(y,~x)/( Py.P(~x) ) = 1/Conv(y:x) = 1/Conv(~x:~y) = ( Py - Pxy )/( Py -Px.Py) this = is UNDESIRABLE = 0 if Pxy = Py = 1 if Pxy = Px.Py ie if x, y independent = oo if Px = 1 = [0..1..oo) = 1 / Conv( y: x) = Py.P(~x)/P(y,~x) = 1 / Conv(~x:~y) = P(~x).Py/P(~x,y) Conv(x:y) = Px.P(~y)/P(x,~y) the larger, the more causation, due to small P(x,~y) co-occurence = (Px - Px.Py)/(Px - Pxy) is 1 if x, y are independent; = [0..1..oo) , infinity oo if Pxy = Px nImp(x:y) = ( P(y,~x) - Py.P(~x) )/( P(y,~x) + Py.P(~x) ) = ( Py -Pxy - (Py -Px.Py))/( Py -Pxy + Py -Px.Py ) = ( Pxy - Px.Py )/( Pxy + Px.Py -2.Py ) = [-1..0..1] by kemenyzation nImp(y:x) = ( Pxy - Px.Py )/( Pxy +Px.Py -2.Px ) find -nImp( ~(x AndNot y) == (x implies y) , hence it should hold: -nImp(y:x) = (x-->y) = caus1(x:y) and indeed, it does hold = ( Pxy - Px.Py )/( 2.Px -Pxy -Px.Py ), find caus1(x:y) below. Consider again: "Lack of x (almost) always leads to y". Clearly, it would be wrong to tell somebody with x and y that x caused y . Hence Pxy alone cannot measure how much the x causes y, but P(y|x) could. Alas, P(y|x) = Pxy/Px is not a function of Py, and we believe that it is wise to have measures which are functions of all 3 Pxy, Px and Py : Conv(x:y) = Px.P(~y)/P(x,~y) the larger this Conv, the more of x-->y due to small P(x,~y) co-occurence = (Px - Px.Py)/(Px - Pxy) is 1 if x, y are independent; = [0..1..oo) , infinity oo if Pxy = Px Conv2(x:y) = P(x implies y)/( ~( Px.P(~y))) larger implies more = P(~(x,~y))/( ~( Px.P(~y))) = (1 -P( x,~y))/(1 - Px.P(~y) ) is 1 if independent x,y = [1/2..1..4/3] , 1 if x,y independent; 4/3 if x imp y. From the P(~x,~y) + P(~x, y) = P(~x) P(~x,~y) + P( x,~y) = P(~y) for the case P(~x,~y) = 0 holds P(~y) = P( x,~y) in which case Conv(x:y) = Px <= 1 = independent x,y Conv2(x:y) = ( 1-P(~y) )/( 1 -P(~y).Px ) <= 1 = independent x,y <= 1 is due to .Px , which always is 0 <= Px <= 1 . <= 1 in this case is good, because P(~x,~y) = 0 was shown to be !!! equivalent to the (y AndNot x), hence x cannot imply y , not even a little bit, ie causation must not exceed the point of no dependence ie point of independence, and indeed both Conv(:) and Conv2(:) are <= 1 in this case, which is good. An explanation and justification of the Conv(:) measures: + Conv(:) = fun( Px, Py, Pxy ) ie fun of all 3 defining probabilities. + Conv has a fixed value if x, y independent , and also has a fixed value if x implies y 100% , hence Conv has a decent opeRational interpretation. + Conv(x:y) = extreme when x implies y 100% !!! ie when Pxy = Px regardless of Py (draw a Venn) (x implies y) = ~(x,~y) in logic = 1 - P(x,~y) in probability { Brin 1997 } got rid of the outer negation ~ by taking the reciprocal value. On one hand this trick is not as clean as Conv2(:), !!! but on the other hand this trick makes the !!! 100% implication value an extreme value REGARDLESS of Py : Conv(x:y) = Px.P(~y)/P(x,~y) , the larger the more (x implies y) = Px/P(x|~y) , is 1 if independent x,y = Px.(1 - Py)/(Px - Pxy) = (Px - Px.Py)/(Px - Pxy) is 1 if Pxy = Px.Py ie 100% independence, is oo if Pxy = Px ie 100% (x implies y) oo needs a precheck for an overflow; numerically is 0/0 if Py = 1 ie Pxy = Px (overflow) but: correct logically is 1 if Py = 1 as Pxy = Px.Py ie x,y indep. Or its reciprocal (since Pxy = Px is possible, while Py < 1) : (Px - Pxy)/(Px - Px.Py) , the smaller the more (x implies y) : is 1 if Pxy = Px.Py ie 100% independence, is 0 if Pxy = Px ie 100% (x implies y) numerically is 0/0 if Py = 1 ie Pxy = Px (overflow) but: correct logically is 1 if Py = 1 as Pxy = Px.Py ie x,y indep. Conv(:) kemenyzed by me to the scale [-1..0..1] becomes Conv1(x:y) = ( Px.P(~y) - P(x,~y) )/( Px.P(~y) + P(x,~y) ) = ( Px - P(x|~y) )/( Px + P(x|~y) ) = ( Pxy - Px.Py)/(2.Px - Pxy - Px.Py) = -nImp(y:x) above; = ( P(~y) - P(~y|x) )/( P(~y) + P(~y|x) ) which kemenyzed to the scale [0..1/2..1] becomes : Conv3(x:y) = Px.P(~y)/[ Px.P(~y) + P(x,~y) ] Or based on counterfactual reasoning ( cofa ) IF ~x THEN ~y : Cofa1(~x:~y) = ( P(~x).Py - P(~x, y))/( P(~x).Py + P(~x, y)) = ( Py -Px.Py - Py +Pxy )/( Py -Px.Py + Py -Pxy ) = ( Pxy - Px.Py)/(2.Py -Px.Py -Pxy ) = -nImp(x:y) above, ie Non(y AndNot x) Cofa0(~x:~y) = P(~x).Py / P(~x, y) = (Py -Px.Py)/(Py -Pxy) = Conv(y:x) above, and indeed, in logic (~x <== ~y) == (y <== x); the <== means "implies" (and also it means "less then" if applied to 0 = false, 1 = true) F(~x:~y) == F(~x <== ~y) = ( P(~x|~y) - P(~x|y) )/( P(~x|~y) + P(~x|y) ) = -F(~x:y) They all look reasonable, and all are scaled to [-1..0..1]. Q: which one do you like, if any, and why (not) ? A mathematically more rigorous alternative to Conv(x:y) is my Conv2(x:y) which does not suffer from the dangers of an overflow, employs the exact probabilistic (x implies y) = 1-P(x,~y) derived from the exact logical (x implies y) == ~(x,~y). Since we wish to have a fixed value for the independence of events x, y, the exact implication form 1 - P(x,~y) suggests to compare it with the negation of the fictive ie as-if independence term as follows: Conv2(x:y) = P(x implies y)/( x,y independ ) larger implies more = P(~(x,~y))/( ~( Px.P(~y)) ) = ( 1- P( x,~y))/( 1 -Px.P(~y) ) is 1 if independent = ( 1-(Px - Pxy))/( 1 -Px.(1-Py) ) = ( 1- Px + Pxy )/( 1 -Px +Px.Py ) is 1 if Pxy = Px.Py This is ( 1 -Px + Px )/( 1 -Px +Px.Py ) if Pxy = Px = 1/( 1 -Px.(1 -Py)) >= 1 if Pxy = Px, the larger the Px and the smaller the Py, the > 1 is Conv2(x:y). If Px = Pxy (find Venn ) ie if 100% implies then the numerator is 1 ie MAXimal, but unlike in Conv(x:y), !! the denominator depends on Px and Py . -.- +Rescalings important wrt risk ratio RR(:) aka relative risk For positive u, v : u/v is scaled [ 0..1..oo) v <> 0, and: W = ln(u/v) is scaled (-oo..0..oo) v <> 0, and: F = (u - v )/(u + v ) is scaled [ -1..0..1 ], allows u=0 xor v=0 = (1 - v/u)/(1 + v/u) handy for graphing F=f(v/u) u <> 0 = (u/v - 1)/(u/v + 1) handy for graphing F=f(u/v) v <> 0 = (u - v )/(u + v ) this rescaling I call "kemenyzation" to honor the late John G. Kemeny, the Hungarian-American co-father of BASIC, and former math-assistant to Einstein; = tanh(W/2) = tanh(0.5*ln(u/v)) by { I.J. Good 1983, 160 where sinh is his mistake } Since atanh(z) = 0.5*ln((1+z)/(1-z)) for z <> 1, W = 2.atanh(F) = ln((1+F)/(1-F)) for F <> 1 F0 = (F+1)/2 is linearly rescaled to [0..1/2..1], 1/2 for independence. W(y:x) = ln( P(y|x)/P(y|~x) ) is an information gain [see F(:) ] = ln( P(y|x) ) - ln(P(y|~x) ) is additive Bayes factor = ln( B(y:x) ) = ln(RR(y:x) ) = ln( relative risk of y if x ) = ln( Odds(x|y)/Odds(x) ) W(:) is I.J. Good's "weight of evidence in favor of x provided by y". The advantage of oo-less scalings like [-1..0..1] or [0..1/2..1] is that they make comparisons of different formulas possible at all and more meaningful, though not perfect. E.g. we may try to compare a value of F(:) with that of Conv1(:) which is Conv(:) kemenyzed by me. W(:)'s logarithmic scale allows addition (of otherwise multiplicable ratios) under the valid assumption of independence between y, z : W(x: y,z) = W(x:y) + W(x:z) but when y,z are dependent we must use { I.J. Good 1989, 56 } : W(x: y,z) = W(x:y) + W(x: y|z) F(:)'s cannot be simply added, but can be combined (provided y, z are independent) according to { I.J. Good, 1989, 56, eq.(7) } thus : F(x: y,z) = ( F(x:y) + F(x:z) )/( 1 + F(x:y).F(x:z) ) but when y,z are dependent we must use : F(x: y,z) = ( F(x:y) + F(x: z|y) )/( 1 + F(x:y).F(x: z|y) ) Seeing this, physicists, but not necessarily physicians, might recall the relativistic velocity addition formula for combining 2 relativistic speeds into the resultant one by means of a regraduation function for relativistic addition of velocities u, v into a single rapidity rap : rap = ( u + v )/( 1 + u.v/(c.c) ) where c is the speed of light. P(.|.)'s maximum = 1 corresponds to the unexceedable speed of light c , so that rap simplifies to our ( u + v )/( 1 + u.v ). This relativistic addition appears in: - { Lucas & Hodgson, 5-13 } is the best on regraduation (no P(.)'s ) - { Yizong Cheng & Kashyap 1989, 628 eq.(20) }, good - { Good I.J. 1989, 56 } - { Grosof 1986, 157 } last line, no relativity mentioned - { Heckerman 1986, 180 } first line, no relativity mentioned. -.- +Correlation in a 2x2 contingency table r == corr(x,y) r's sign = the sign of the numerator = [ a.d - b.c ]/sqrt[ (a+b).(c+d) .(a+c).(b+d) ] = [Pxy.P(~x,~y) - P(x,~y).P(~x,y)]/sqrt[ Px.P(~x) . Py.P(~y) ] ! the last and the next numerator have different forms, but are equal : = [ Pxy - Px.Py ]/sqrt[ Px.(1-Px) . Py.(1-Py) ] = [ a/N -(a+b).(a+c)]/sqrt[(a+b).(c+d) .(a+c).(b+d) ] = [n(x,y)/N - n(x).n(y) ]/sqrt[ n(x).(N -n(x)) . n(y).(N -n(y))] = cov(x,y)/sqrt[ var(x) . var(y) ] = ( Pearson's) CORRELATION coefficient for binary ie Bernoulli ie indicator events ; is symmetrical wrt x,y ( NOT Pearson's CONTINGENCY coefficient ; find Pcc ) r2 = Square[ corr(x,y) ] = r.r >= 0 = Square[ Pxy - Px.Py]/[ Px.(1-Px).Py.(1-Py) ] = [ cov(x,y)/var(x) ].[ cov(x,y)/var(y) ] = beta(y:x) . beta(x:y) , -1 <= beta <= 1, same signs = [ slope of y on x ].[ slope of x on y ] = [ P(y|x) - P(y|~x) ].( P(x|y) - P(x|~y) ] = ARR(x:y) . ARR(y:x) = ( X^2 )/n find X^2 near below = coefficient of determination r^2 = ( explained variance ) / ( explained var. + unexplained variance ) = ( variance explained by regression )/( total variance ) = 1 - ( variance unexplained )/( total variance ) r2 is considered to be a less inflated ie more realistic measure of correlation than the r = corr(x,y) itself (keep its sign). The key mean squared error equation from which the above follows is : MSE = variance explained + variance unexplained aka residual variance This MSE equation I call Pythagorean decomposition of the mean squared error MSE into its orthogonal partial variations. It is a sad fact that very few books on statistics and/or probability show the correlation coefficient r between events. Yule's coefficient of colligation { Kendall & Stuart 1977, chap.33 on Categorized data, 539 } is also symmetrical wrt x, y: Y = ( 1 - sqrt(b.c/(a.d)) ) / ( 1 + sqrt(b.c/(a.d)) ) = ( sqrt(a.d) - sqrt(b.c) ) / ( sqrt(a.d) + sqrt(b.c) ) kemenyzed = tanh(0.25.ln( a.d/(b.c) )) my tanhyperbolization a la I.J. Good The formula for chi-squared (find X^2 , chisqr , chisquared, ChiSqr ) : X^2 = Sum[ ( Observed - Expected^2 )/Expected ] = [ ( a - (a+b)(a+c)/n )^2 + ( b - (a+b)(b+d)/n )^2 + ( c - (a+c)(c+d)/n )^2 + ( d - (b+d)(c+d)/n )^2 ] is exact; approx: =. n.(|ad - bc| -n/2)^2 /[ (a+b)(a+c)(b+d)(c+d) ] Yates' correction =.. n.( ad - bc )^2 /[ (a+b)(a+c)(b+d)(c+d) ] may be good enough = n.( ad - bc )/[ (a+b).(c+d) ] . (ad - bc)/[(a+c).(b+d)] = n.[ a/(a+b) - c/(c+d) ].[ a/(a+c) - b/(b+d) ] = n.[ P(y|x) - P(y|~x) ].[ P(x|y) - P(x|~y) ] !! = n. ARR(x:y) . ARR(y:x) = n.r2 = ChiSqr(x,y) find r2 nearby above .- Besides opeRationally meaningful interpretation of values, it is important how a measure ORDERS the values obtained from a data set, since we want a list of the pairs of events (x,y) sorted by the strength of their potential causal tendency. Note that : P(x|y) is a naive diagnostic predictivity of the hypothesis x from y effect; P(y|x) is a naive causal predictivity of the effect y from x ; P(y|x) = Py.P(x|y)/Px = Pxy/Px is the basic Bayes rule. The likelihood ratio aka Bayes factor in favor of the conjectured hypothesis x implied by y ie x-->y (is exact, much more specific than the vague word x provided by y (ie by the evidence, predictor, cue, feature or effect) is relative risk RR(:) aka risk ratio : RR(y:x) == B(y:x) = P(y|x) / P(y|~x) = (Pxy/Px)/[(Py - Pxy)/(1-Px)] = Pxy.(1 - Px) / (Px.Py - Px.Pxy) caution! /0 if Pxy = Py :-( = (1 - Px) / (Px.Py/Pxy - Px) = (Pxy - Px.Pxy)/( Px.Py - Px.Pxy ) which shows that: B = 1 for Px.Py=Pxy ie for independent x,y ; and B = oo for Py=Pxy ie "if y then x" ie y-->x = relative odds on the event x after the event y was observed : !! = Odds(x|y)/Odds(x) = posteriorOdds / priorOdds ; { odds form } = [ P(x|y)/(1-P(x|y)) ]/[ Px/(1-Px) ] = [ P(x|y).(1-Px) ]/[ Px.(1-P(x|y)) ] which inverts into = P(y|x)/P(y|~x) via basic Bayes rule P(x).P(y|x) = Pxy = P(x|y).Py !! = [ Pxy/(Py - Pxy) ].[ (1-Px)/Px ] !! shows that Py = Pxy does mean that y-->x so that B(y:x) = oo !! note that when (x causes y) then y-->x but not necessarily !! vice versa; the y is an effect or outcome in general; = P(y|x)/( 1-P(~y|~x) ) = B(y:x) because of: = P(y|x)/P( y|~x) q.e.d. Lets compare : RR(y:x) = P(y|x)/P(y|~x) = P(y|x).(1-Px)/(Py -Pxy) with: ! Conv(y:x) = Py/P(y|~x) = Py.(1-Px)/(Py -Pxy) = Py.P(~x)/P(y,~x) clearly RR(y:x) is more meaningful than the "conviction" by { Brin 1997 }, though conviction is no nonsense either : + both RR(y:x) and Conv(y:x) equal oo if Py=Pxy ie if y-->x + both RR(y:x) and Conv(y:x) equal 1 if Pxy=Px.Py ie y, x are independent + both RR(y:x) and Conv(y:x) equal 0 if Pxy=0 ie if y is disjoint with x + RR(y:x) is relative risk, used within other meaningful formulas + RR(y:x) <> RR(~x:~y) which is good, while - Conv(y:x) == Conv(~x:~y) which is NO GOOD (find UNDESIRABLE above) B(~y:~x) = P(~y|~x) /P(~y|x) == RR(~y:~x) = [1 - P( y|~x)]/[ 1 - P(y|x)] = [ P(~y,~x)/P(~y,x)].Px/(1 - Px) = [ (1 -Py -Px +Pxy)/(Px -Pxy) ]. Px/(1-Px) = [ (1 -Py)/(Px-Pxy) -1 ]. Px/(1-Px) !! when Px=Pxy ie when x-->y then B(~y:~x) = oo hence if we wish to use a B(~.:~.) instead of B( .: .), then we must swap the events x and y. For example instead of B( y: x) we might use : B(~x:~y) = P(~x|~y) / P(~x|y) == RR(~x:~y) = [ (1 -Px)/(Py-Pxy) -1 ]. Py/(1-Py) !! when Py=Pxy ie when y-->x then B(~x:~y) = oo W(y:x) = the weight of evidence for x if y happens/occurs/observed = ln( P(y|x)/P(y|~x) ) = Qnec(y:x) { I.J. Good 1994, 1992 } = ln( B(y:x) ) = logarithmic Bayes factor for x due to y = ln(RR(y:x) ) = ln( Odds(x|y)/Odds(x) ) W(~y:~x) = the weight of evidence against x if y absent { I.J. Good } = ln[ P(~y|~x)/P(~y|x) ] = Qsuf(y:x) { I.J. Good 1994, 1992 } = ln[ B(~y:~x) ] = ln[(1-P( y|~x))/(1-P( y|x))] = -W(~y:x) W(:) = 2.atanh(F(:)) = ln((1+F)/(1-F)) for abs(F) <> 1 B(a:b) = P(a|b)/P(a|~b) = (Pab/Pb)/((Pa-Pab)/(1-Pb)) = oo if Pab=Pa ie (a implies b) ie (a --> b) B(b:a) = P(b|a)/P(b|~a) = (Pab/Pa)/((Pb-Pab)/(1-Pa)) = oo if Pab=Pb ie (b implies a) Q:/Quiz: could comparing (eg subtracting or dividing) B(a:b) with B(b:a) show the DIRECTION of a possible causal tendency ?? P(~b,~a) = 1 - (Pa + Pb - Pab) = P(~(a or b)) by DeMorgan's rule B(~b:~a) = P(~b|~a)/P(~b|a) = (1 - P(b|~a))/(1 - P(b|a)) = [P(~b,~a)/(1-Pa)] / [(Pa-Pab))/Pa] = oo if Pab=Pa ie (a implies b) like for B(a:b) or W(a:b) which speaks against comparing ?(a:b) with ?(~b:~a) for the purpose of deciding the direction of possible causal tendency. C(y:x) = corroboration or confirmation measure { Popper 1972, p400, (9.2*) } = (P(y|x) - Py )/( P(y|x) + Py - Pxy ) C-form1 { Popper 1972 } = ( Pxy - Px.Py )/( Pxy + Px.Py - Pxy.Px) my C-form2 = (P(y|x) - P(y|~x))/( P(y|x) + Py/P(~x) ) compare w/ F-form1 = (cov(x,y)/var(x) )/( P(y|x) + Py/P(~x) ) C-form3a = beta(y:x)/( P(y|x) + Py/P(~x) ) C-form3b F(y:x) == F(y-->x) == F(y <== x) = "degree of factual support of x by y" = primarily a measure of how much y implies x (my view) = (P(y|x) - P(y|~x))/( P(y|x) + P(y|~x) ) F-form1 { Kemeny 1952 } = ARR(x:y)/( P(y|x) + P(y|~x) ) my F-forms follow: = ( Pxy - Px.Py )/( Pxy + Px.Py - 2.Pxy.Px ) F-form2 = (cov(x,y)/var(x) )/( P(y|x) + P(y|~x) ) F-form3a = beta(y:x)/( P(y|x) + P(y|~x) ) F-form3b = tanh( 0.5*ln(P(y|x) / P(y|~x)) ) F-form4 = tanh( W(y:x)/2 ) { my tanh corrects I.J.Good's sinh } = (difference/2)/Average = deviation/mean F-form5 = ( B(y:x) - 1 )/( B(y:x) + 1 ) is handy also for graphing F = fun(B) = -F(y:~x) = (Pxy/Px - (Py-Pxy)/(1-Px)) / (Pxy/Px + (Py-Pxy)/(1-Px)) hence: = -1 if P(y|x) = 0 ie if P(y,x) = 0 ie if x, y are disjoint = 0 if x, y are independent ; = 1 if P(y|~x) = 0 ie if P(y,~x) = 0 ie Pxy = Py ie P(x|y)=1 ie if y implies x ie y-->x 100% ie deterministic ie if y leads to x always ie IF y THEN x always holds where : the F-form1 is the original one by { Kemeny & Oppenheim 1952 }, my F-form2 is the de-conditioned one, and it does reveal that if Pxy=Py (see the -2 factor), ie if y --> x, then F(x:y)=1. my F-form3 reveals an important hidden meaning: beta(y:x) is the slope of the implicit probabilistic regression line of Py = beta(y:x).Px + alpha(y:x) ; the F-form4 reveals that F(:) and Turing-Good's weight of evidence W(:) are changing co-monotonically my F-form5 provides the most simple interpretation of F(:) Unlike B(:) or W(:), the F(:) will not easily overflow due to /0. The numerators tell us that for independent x, y it holds C(:) = 0 = F(:). C(:) stresses the near independence, while F(:), W(:), B(:) stress near implication more than near independence. Try out an example with a near independence and simultaneously with near implication. F(x:y) == F(x <== y) = degree of factual support of y by x = primarily a measure of how much x implies y = (P(x|y) - P(x|~y))/( P(x|y) + P(x|~y) ) F-form1 = ( Pxy - Px.Py )/( Pxy + Px.Py - 2.Py.Pxy ) my F-form2 = (Pxy/Py - (Px-Pxy)/(1-Py)) / (Pxy/Py + (Px-Pxy)/(1-Py)) = -F(x:~y) note that if Px=Pxy then F(x:y) = 1 == (x implies y) fully, and that this matches Pxy/Px = 1 as maximal possible contribution to the product for P(x_j | y..y) computed by the simple Bayesian chain rule over y..y cues for P(x_j , y..y). Clearly a product of Pxy/Px terms over a vector of cues y..y may be viewed as a product of the simplest (x implies y) terms. Rescaling F(:) from [-1..0..1] to [0..1/2..1] : F0(:) = ( F(:) + 1 )/2 , so that F0(x:y) = P(x|y)/( P(x|y) + P(x|~y) ) F0(y:x) = P(y|x)/( P(y|x) + P(y|~x) ) Before we go further, we recall that F(:) is co-monotonic with B(:), and B(y:x) = P(y|x) / P(y|~x) = (Pxy/Px)/( (Py - Pxy)/(1 - Px) ) {now consider INCreasing Pxy:} wherein Pxy/Px measures how much (x implies y) up to maximum = 1 while 1/(Py - Pxy) measures how much (y implies x) up to maximum = oo hence if Py = Pxy then B(y:x) = oo ie y implies x is measured by B(y:x) and: B(y:~x) = P(y|~x) / P(y|x) = ((Py - Pxy)/(1 - Px)) /(Pxy/Px) {now consider DECreasing Pxy:} wherein Py - Pxy measures how much (y implies x) with maximum = Py while 1/(Pxy/Px) measures how much (x implies y) with maximum = oo for Pxy=0 hence if Pxy = 0 then B(y:~x) = oo F(y: x) = (P(y|x) - P(y|~x))/(P(y|x) + P(y|~x)) = -F(y:~x) = ( Pxy - Px.Py )/( Pxy + Px.Py - 2.Px.Pxy ) F(~y:x) = (P(~y|x) - P(~y|~x))/(P(~y|x) + P(~y|~x) ) = -F(~y:~x) = ( Pxy - Px.Py )/( Pxy + Px.Py - 2.Px.(1 -Px +Pxy) ) and the remaining 4 mirror images are easily obtained by swapping x and y. Note that W(:) = 2.atanh(F(:)) = ln[ ( 1 + F(:) )/( 1 - F(:) ) ] . For example F(~y:x) = -F(~y:~x) would measure how much the hypothesis x explains the unobserved fact y , like eg in common reasoning : "if (s)he would have the health disorder x (s)he could NOT be able to do y (eg a body movement (s)he did)", so that from a high enough F(~y:x) we could exclude the disorder x as an unsupported hypothesis. -.- +Example 1xy for Px=0.1 , Pxy=0.1 , Py=0.2 , visualized by a squashed Venn diagram xxxxxxxxxx yyyyyyyyyyyyyyyyyyyy are P(x|y)=0.5 ie "50:50" ; P(x|~y)=(0.1 -0.1)/(1 -0.2) = 0 ie minimum P(y|x)=1.0 ie maximum ; P(y|~x)=(0.2 -0.1)/(1 -0.1) = 1/9 and corr(x,y) = cov(x,y)/sqrt[ var(x) . var(y) ] = (Pxy - Px.Py)/sqrt[ Px.(1-Px).Py.(1-Py) ] = 0.7 is the value of the correlation coefficient between the events x, y caus1(x:y) = ( Pxy - Px.Py ) / ( 2.Px -Pxy -Px.Py ) = 1 F(x:y) = (0.5 - 0)/(0.5 + 0) = 1 B(x:y) = P(x|y)/P(x|~y) = 0.5/0 = oo = infinity clearly the rule IF x THEN y cannot be doubted ; but what do we get when we swap the roles of x, y ie when our observer will view the situation from the opposite viewpoint ? This can be done by either swapping the values of Px with Py, or by computing F(y:x) : +Example 1yx: is F(y:x) = (1 -1/9)/(1 +1/9) = 0.8 is too high for my taste caus1(y:x) = (Pxy - Px.Py)/(2.Py -Pxy -Px.Py) = 0.29 is more reasonable, as Pxy/Py = 0.5 hence y doesnt imply x much (although x implies y fully as Pxy/Px = 1 ); B(y:x) = P(y|x)/P(y|~x) = 1/((0.2 - 0.1)/0.9) = 9 is (too) high. !!! Conclusion: for measuring primarily an implication & secondarily dependence, B(y:x) and F(y:x) are not ideal measures. !!! Note: if Px < Py then x is more plausible to imply y, than vice versa; if Py < Px then y is more plausible to imply x, than v.v. +Example 3: x = drunken driver ; y = accident P(y| x) = 0.01 = P(accident y caused by a drunken driver x) P(y|~x) = 0.0001 = P(accident y caused by a sober driver ~x) is how { Kahre 2002, 186 } defines it; obviously P(y|x) > P(y|~x). Note that without knowing either Px or Py or Pxy, we cannot obtain the probabilities needed for caus1(x:y), F(x:y) and F(~x:~y), ie we can compute B(y:x), F(y:x) and caus1(y:x) only : beta(y:x) = P(y|x) - P(y|~x) =. 0.1 is the regression slope of y on x , is misleadingly low. B(y:x) = P(y|x)/P(y|~x) = 100 ie (y implies x) very strongly. F(y:x) = (B-1)/(B+1) = 0.98 =. 1 = F's upper bound F(y:x) measures how much an accident y implies drunkenness x (obviously an accident cannot cause drunkenness). B(~y:~x) = (1-P(y|~x))/(1-P(y|x)) = 1.01 F(~y:~x) = (B-1)/(B+1) = 0.005 =. 0 = F's point of independence F(~y:~x) measures how much an absence of an accident y implies that a driver is not drunk. Here is my CRITICISM of such formulas: According to I.J. Good, "The evidence against x if y does not happen" can also be considered as a possible measure of x causes y . It is based on COUNTERFACTUAL reasoning "if absent y then absent x", which I denote as !! "necessitistic" reasoning. I am dissatisfied with the sad fact that his formulation leads to formulas which are not zero when Pxy = 0 ie when x, y are DISjoint. If the above explained notion of Necessity is to be taken seriously, and I think it should be, then Good's formulation is not good enough. F(~y:~x) == F(~y-->~x) == F(~y <== ~x) = measures how much ~y implies ~x = ( P(~y|~x) - P(~y| x) )/( P(~y|~x) + P(~y|x) ) = ((1-P(y|~x)) -(1-P(y| x)))/((1-P(y|~x)) +(1-P(y|x))) = ( P(y| x) - P(y|~x) )/( 2-P(y|~x) -P(y|x) ) = ( Px.Py - Pxy )/( Px.Py + Pxy - 2.Px.(1 -Px +Pxy) ) = -F(~y:x) = (B(~y:~x) - 1)/(B(~y:~x) + 1) F(~x:~y) = (P(~x|~y) - P(~x|y))/( P(~x|~y) + P(~x|y) ) = -F(~x:y) -.- +Folks' wisdom !!! Caution: causation works in the opposite direction wrt implication. This is so, because ideally an effect y implies a cause x, ie a cause x is necessary for an effect y. See the short +Introduction again. In what follows here it may be necessary to swap e, h if we want causation. Since different folks had different mindwaves I tried not to mess with their formulations more than necessary for this comparison. The notions of probabilistic Necessity and Sufficiency for events have been quantified differently by various good folks' wisdoms. E.g. from { Kahre 2002, Figs.3.1, 13.4 + txt } follows : if X is a subset of Z ie Z is a SuperSet of X ie all X are Z then X is Sufficient (but not necessary) for Z ie X implies Z. ie Z is a consequence of X ie IF X THEN Z rule holds, I say. if Y is a SuperSet of Z ie Z is a subset of Y ie all Z are Y then Y is Necessary (but not sufficient) for Z ie Z implies Y ie Y is a consequence of Z ie IF Z THEN Y rule holds, I say; For 2 numbers x, y it holds (x < y) == ( y > x) , or v.v. For 2 sets X, Y it holds (X subset of Y) == ( Y SuperSet of X) , or v.v. For 2 events x, y it holds (x implies y) == (Py > Px) , or v.v. For 2 events we may like to answer the Q's (and from the above follow A's) : Q: How much is x SufFor y ? A: as much as is y NecFor x . Q: How much is x NecFor y ? A: as much as is y SufFor x . Q: If Pxy=0 ie x, y are disjoint ? A: then Suf = 0 = Nec must hold. Lets use: e = evidence, effect, outcome; h = hypothesised cause (exposure) Hence eg P(e|h) is a NAIVE, MOST SIMPLISTIC measure of Sufficiency of h for e because P(e|h) = 1 = max if Peh = Ph ie Ph - Peh = 0 = P(~e,h) ie if h is a subset of e, ie if h implies e then is h SufFor e. Note that: (h SufFor e) == (h subset of e), and (h NecFor e) == (e subset of h), hence : P(h|e) measures how much is h NecFor e, and P(e|h) measures how much is h SufFor e. Lets compare these with those now corrected in { Schield 2002, Appendix A } : P(e|h) = S = "Sufficiency of exposure h for case e" P(h|e) = N = " Necessity of exposure h for case e" (find NAIVE , SIMPLISTIC ) !! Caution: the suffixes nec, suf in ?nec , ?suf as used by various authors say nothing about which event is necessary for which one, if the authors do not use ?(y:x) and do not specify what these parameters mean. I recommend the ?(y:x) to mean that (y implies x) ie (y SufFor x) , or (x NecFor y). Folk1: { Richard Duda, John Gaschnig & Peter Hart: Model design in the Prospector consultant system for mineral exploration, in { Michie 1979, 159 }, { Shinghal 1992, chap.10, 354-358 } and in { Buchanan & Duda 1983, 191 } : Lsuf = P( e|h)/P( e|~h) = RR( e: h) = Qnec by I.J. Good Lnec = P(~e|h)/P(~e|~h) = 1/RR(~e:~h) = 1/Qsuf if Lnec = 0 then e is logically necessary for h if Lnec = large then ~e is supporting h (ie absence of e supports h) Lsuf = Qnec , but in fact there is NO semantic confusion, since Lsuf denotes how much is e SufFor h ie h NecFor e , and Qnec denotes how much is h NecFor e ie e SufFor h . Folk2: { Brian Skyrms in James Fetzer (ed), 1988, 172 } Ssuf = P( e| h)/P( e|~h) = RR( e| h) = Lsuf Snec = P(~h|~e)/P(~h| e) = RR(~h:~e) = [1 - P(h|~e)]/[1 - P(h|e)] if Ssuf > 1 then h has a tendency towards sufficiency for e ; if Snec > 1 then h has tendency towards necessity for e ; if Ssuf.Snec > 1 then h has tendency to cause the event e . Folk3: { I.J. Good 1994, 306 + 314 = comment by Patrick Suppes } : Qnec = P( e| h)/P( e|~h) = RR( e: h) = Lsuf by Folk1 Qsuf = P(~e|~h)/P(~e| h) = RR(~e:~h) = 1/Lnec Qsuf = weight of evidence against h if e does not happen = a measure of causal tendency, says I.J. Good. If Qnec.Qsuf > 1 then h is a prima facie cause of e, adds Suppes. I.J. Good's late insight (delayed 50 years) is formulated thus : !! "Qsuf(e:h) = Qnec(~e:~h). This identity is a generalization of the fact that h is a STRICT SUFFICIENT CAUSE of e if and only if ~h is a STRICT NECESSARY CAUSE of ~e, as any example makes clear." { I.J. Good's emphasis } !! Qsuf(e:h) = Qnec(~e:~h) in { I.J. Good 1992, 261 } !! Qnec(e:h) = Qsuf(~e:~h) in { I.J. Good 1995, 227 in Jarvie } in { I.J. Good 1994, 302(28); on Suppes } = RR( e: h) in { I.J. Good 1994, 306(40);314 by Suppes } !! Qsuf(e:h) = RR(~e:~h) is NOT ZERO if e,h are DISjoint, as I require. Folk4: S = P(e|h) = " sufficiency of exposure h for effect e" N = P(h|e) = " necessity of exposure h for effect e" { Schield } In { Schield 2002 } see Appendix A, his first lines left & right. There in his section 2.2 on necessity vs sufficiency , Milo Schield nicely explains their contextual semantics and applicability thus: "Unless an effect [ e ] can be produced by a single sufficient cause [ h ] (RARE!), producing the effect requires supplying ALL of its necessary conditions [h_i], while preventing it [e] requires removing or eliminating only ONE of those necessary conditions." (I added the [.]'s ). Q: Well told, but do Schield's S and N fit his semantics ? A: No. While S is unproblematic, his N is not. Q: What does it mean that h is strictly sufficient for an effect e ? A: Whenever h occurs, e occurs too. This in my formal translation means that h implies e ie Peh = Ph ie P(e|h) = 1 ie h --> e . Hence S = P(e|h) measures sufficiency of h for e , or his necessity of e for h (formally, I say). Note: if h = bad exposure and e = bad effect, then all above fits; if h = good treatment for e = better health, then all above fits; other pairings would not fit meaningfully. !! My view is this : we are interested in h CAUSES e (potentially). P(h|e) = Sufficiency of e for h ie e implies h ie e SufFor h, or P(x|y) = Sufficiency of y for x ie y implies x ie y SufFor x. Sufficiency is unproblematic, so I use it as a fundament. Q: What is Nec = necessity of h for e , really ? A: I derive Nec from the semantical definition in { Schield 2002, p.1 } where he writes: "But epidemiology focuses more on identifying a necessary condition [h] whose removal would reduce undesirable outcomes [e] than on identifying sufficient conditions whose presence would produce undesirable outcomes." His statement between [h] and [e] I formalize (by relying on the unproblematic Sufficiency ie on implication) thus: (no h) implies (no e) ie "no e without h" : ~h implies ~e, hence P(~e|~h) = 1 in the ideal extreme case. Note that generally P(~e|~h) = [ 1 - (Ph + Pe - Peh) ]/[ 1-Ph ] = 1 here !! ie Peh = Pe ie P(h|e) = 1 in this IDEAL extreme case ONLY, while N = P(h|e) is Schield's general necessity of h for e. !! But in general N = P(h|e) <> P(~e|~h) which is [ see P(e|h) above ] SUFFICIENCY of ~h for ~e, which better captures Schield's semantics. Q: Do we need his N = P(h|e) ?? A: Not if we stick to his more meaningful (than his N ) requirement on p.1 just quoted, and opeRationalized by me thus : !!! (Necessity of h for e) I define as (Sufficiency of ~h for ~e) == P(~e|~h) which is a COUNTERFACTUAL: IF no h THEN no e ie "no e without h" which is close in spirit to I.J. Good's ( see Folk3 ) verbal definition, except for the swapped suffixes nec and suf : Qsuf(e:h) = Qnec(~e:~h) = RR(~e:~h) in my notation as shown at Folk3 above. "Qsuf(e:h) = Qnec(~e:~h). This identity is a generalization of the fact that h is a STRICT SUFFICIENT CAUSE of e if and only if ~h is a STRICT NECESSARY CAUSE of ~e, as any example makes clear." ( I.J. Good's emphasis; it took him 50 papers in 50 years ). The semantical Necessity of h for e as removed h implies absence of e, is my NecP = P(~e|~h), and not Schield's necessity N = P(h|e) not fitting his opeRational definition and containing no negation ~ as a COUNTERFACTUAL should. Hence Schield's N is now deconstructed, and can be replaced by my constructive NecP = P(~e|~h). Summary of SufP = S , and of my NecP constructed from Schield's !!! opeRationally meaningful verbal requirements : SufP = P( e| h) = sufficiency of h for e means h implies e NecP = P(~e|~h) = necessity of h for e means ~h implies ~e hence: SufP = P( e| h) = necessity of ~h for ~e means h implies e Q: is my NecP ok ? A: Not yet, since for Peh = 0 my common sense requires S=0=N ie zero, which precludes all P(.,.) or P(.|.) containing ~ ie a NEGation. Fix1: IF Peh = 0 THEN NecP1 = 0 ELSE NecP1 = P(~e|~h). Fix2: Like Suf, Nec should have Peh as a factor in its numerator, so eg: SufP2 = P(e|h) = SufP = sufficiency of h for e ie h implies e NecP2 = Peh.NecP = Peh.P(~e|~h) = Peh.P(~(e or h))/P(~h) = Peh.[ 1 - (Ph +Pe -Peh) ]/(1-Ph) = necessity of h for e which seems reasonable, since without Peh. my original NecP will be too often too close to 1 = max(P), hence poor anyway, as its form [ 1 - P(e or h) ]/[ 1-Ph ] is near 1 for small P's . Folk5: RR(:) formulas (derived by Hajek aka JH as my criticism of Folk4 ): RRsuf = how much h implies e = RR(h:e) = sufficiency of h for e (not necessity because: ) = P(h|e)/P(h|~e) = [ Peh/(Ph - Peh) ].(1-Pe)/Pe = Peh.( h implies e)/Odds(e) note that the "implies" factor 1/(Ph - Peh) comes from P(h|~e), and that it is more influential than P(h| e). RRnec = how much ~h implies ~e = derived from RRsuf & my "A nonevent is an event is an event" = RR(~h:~e) = P(~h|~e)/P(~h|e) = [ P(~e,~h)/(P(~h) - P(~e,~h)) ].(1-P(~e))/P(~e) ! RRnec should = 0 if Peh = 0, yet RRnec <> 0 here, but it will = 0 if we use RRnec.Peh in analogy to NecP2 at the end of Folk4 . Since (h causes e) corresponds to (e implies h) we may have to swap e with h in some formulas above to get the (h causes e). I have not always done it, to keep other authors' formulas as close to their original as reasonable. .- Finally lets look again at the relation between causation and implication. "Rain causes us to wear a raincoat " makes sense, while "NOT wearing a raincoat causes it NOT to rain" is an obvious NONSENSE, even in a clean lab-like context with no shelter and our absolute unwillingness to become wet. Let x = rain and y = wearing a raincoat. The 1st statement translates to ( x causes y ); the 2nd statement translates to (~y causes ~x ). Because nobody knows how to formulate a perfect operator "x causes y", we substitute it with "y implies x" (the swapped x, y is not the point just now). Now the 1st statement translates to ( y implies x ); the 2nd statement translates to (~x implies ~y ). But now we are in trouble, as ( y --> x ) == ( ~x --> ~y ) in logic, and ideally in probabilities: ( Py = Pxy ) == ( P(~x) = P(~x,~y) ) ie: in imperfect real situations: ( Py - Pxy ) = ( P(~x) - P(~x,~y) ) ie: Py - Pxy = (1 - Px) - (1 -(Px + Py - Pxy)) = Py - Pxy q.e.d. Hence such a simple difference doesnt work as we would like it did for a cause. What about the corresponding relative risks ? RR(y:x) <> RR(~x:~y) ie: P(y|x)/P(y|~x) <> P(~x|~y)/P(~x|y) ie: (Pxy/(Py-Pxy)).((1-Px)/Px) <> ( (1-(Px+Py-Pxy))/(Py-Pxy) ).(Py/(1-Py)) where we see the key factor 1/(Py - Pxy) = (y --> x) on both sides of the <>. Hence despite the <> , both RR's will become oo ie infinite if (y implies x) perfectly whenever Pxy = Py. Otherwise, RR(y:x) behaves quite well: RR increases with Pxy, and decreases with unsurprising Px, which is reasonable as explained far above. Conclusion: an implication cannot substitute causation in all its aspects, but I don't know any other necessary (but not always sufficient) indicators of causal tendency than : + dependence (is symmetrical wrt x,y), + implication or entailment (is asymmetrical, transitive operation ie a subset in a subset in a subset, etc). ! Caution: repeatedly find UNDESIRABLE + SurpriseBy + time-ordering (a cause precedes its effect). ++ find +Construction principles for more and sharper formulations -.- +Acknowledgment Leon Osinski of NL is the best imaginable chief of a scientific library. -.- +References { refs } in CausRR To find names on www use 2 queries, eg: "Joseph Fleiss", then "Fleiss Joseph" the latter is the form used in all refs and in some languages. Unlike in the titles of (e)papers here listed, in the titles of books and periodicals all words start with a CAP, except for the insignificants like eg.: a, and, der, die, das, for, in, of, or, to, the, etc. Computer Journal, 1999/4, is a special issue on: - MML = minimum message length by Chris Wallace & Boulton, 1968 - MDL = minimum description length by Jorma Rissanen, 1977 These themes are very close to Kolmogorov's complexity, originated in the US by Occamite inductionists (as I call them) Ray Solomonoff in 1960, and Greg Chaitin in 1968 Allan Lorraine G.: A note on measurement of contingency between two binary variables in judgment tasks; Bulletin of the Psychonomic Society, 15/3, 1980, 147-149 Arendt Hannah: The Human Condition, 1959; on Archimedean point see pp.237 last line up to 239, 260, more in her Index Agresti Alan: Analysis of Ordinal Categorical Data, 1984; on p.45 is a math-definition of Simpson's paradox for events A,B,C Agresti Alan: Categorical Data Analysis, 1st ed. 1990; see pp.24-25 & 75/3.24 on { Goodman & Kruskal }'s TauB (by Gini ) Agresti Alan: An Introduction to Categorical Data Analysis, 1996 Alvarez Sergio A.: An exact analytical relation among recall, precision, and classification accuracy in information retrieval, 2002, http://www.cs.bc.edu/~alvarez/APR/aprformula.pdf Anderberg M.R.: Cluster Analysis for Applications, 1973 Bailey N.T.J.: Probability methods of diagnosis based on small samples; Mathematics and Computer Science in Biology and Medicine, 1964/1966, Oxford, pp.103-110 Bar-Hillel Yehoshua: Language and Information, 1964; Introduction tells that the original author of their key paper in 1952 (chap.15,221-274) was in fact Rudolf Carnap who was the 1st author despite B < C, but B-H doesn't tell it Bar-Hillel Yehoshua: Semantic information and its measures, 1953, in the book Cybernetics, Heinz von Foerster (ed), 1955, pp.33-48 + 81-82 = refs Bar-Hillel Yehoshua, Carnap Rudolf: Semantic information, pp.503-511+512 in the book Communication Theory, 1953, Jackson W. (ed); also in the British Journal for the Philosophy of Science, Aug.1953. It is much shorter than the 1952 paper reprinted in Bar-Hillel, 1964, 221-274 Baeyer Hans Christian von: Information - The New Language of Science, 2003, and 2004, Harvard University Press, which I helped to correct. The final part is "Work in progress", starting with the chap.24 = "Bits, bucks, hits and nuts - information theory beyond Shannon", is about Law of Diminishing Information ( LDI ) from { Kahre 2002 } (I found LDI and Ideal receiver in { Woodward 1953, 58-63 } ), where Von Baeyer mentions my "Wheeleresque war cry Hits before bits". In fact I started with Anglo-Saxonic "Hits statt bits" ie "Hits over bits"; it could have been "Hits ueber bits" :-) Baeyer Hans Christian von: Nota bene; The Sciences, 39/1, Jan/Feb. 1999, 12-15 Bishop Yvonne, Fienberg Stephen, Holland Paul: Discrete Multivariate Analysis 1975; on pp.390-392 their TauR|C is TauB from { Goodman & Kruskal 1954 }, which I recognized to be a normalized quadratic entropy Cont(Y:X)/Cont(X). See also { Blalock 1958 }, find TauB Blachman Nelson M.: Noise and Its Effects on Communication, 1966 Blachman Nelson M.: The amount of information that y gives about X, IEEE Trans. on Information Theory, IT-14, Jan.1968, 27-31 Blalock Hubert M.: Probabilistic interpretations for the mean square contingency, JASA 53, 1958, 102-105; he has not realized, but I did, that TauB(Y:X) = Cont(Y:X)/Cont(X) vs Phi^2 = ( X^2 )/N (find Phi^2 ) TauB in { Goodman & Kruskal , Part 1, 1954, 759-760 } (find TauB ) = TauR|C in { Bishop , Fienberg & Holland, 1975, 390-392 } (find TauR|C ) Blalock Hubert M.: Causal Inferences in Nonexperimental Research, 1964; start at p.62, on p.67 is his partial correlation coefficient Blalock Hubert M.: An Introduction to Social Research, 1970; on p.68 starts Inferring causal relationships from partial correlations Brin Sergey, Motwani R., Ullman Jeffrey D., Tsur Shalom: Dynamic itemset counting and implication rules for market basket data; Proc. of the 1997 ACM SIGMOD Int. Conf. on Management of Data, 255-264; see www . Sergey Brin is the co-founder of Google Buchanan Bruce G., Duda Richard O.: Principles of rule-based expert systems; in Advances in Computers, 22, 1983, Yovits M. (ed) Cheng Patricia W.: From covariation to causation: a causal power theory; (aka "power PC theory"), Psychological Review, 104, 1997, 367-405; ! on p.373 right mid: P(a|i) =/= P(a|i) should be P(a|i) =/= P(a|~i) Recent comments & responses by Patricia Cheng and Laura Novick are in Psychology Review, 112/3, July 2005, pp.675-707. Cheng Yizong, Kashyap Rangasami L.: A study of associative evidential reasoning; IEEE Trans. on Pattern Analysis and Machine Intelligence, 11/6, June 1989, 623-631 Cohen Jonathan L.: Knowledge and Language, 2002; ! on p.180 in the eq.(13.8) both D should be ~D DeWeese M.R., Meister M.: How to measure the information gained from one symbol, Network: Computation Neural Systems 10, 1999, p.328. They partially reinvented Nelson Blachman's fine work (in my refs) Duda Richard, Gaschnig John, Hart Peter: Model design in the Prospector consultant system for mineral exploration; see p.159 in { Michie 1979 } Ebanks Bruce R.: On measures of fuzziness and their representations, Journal of Mathematical Analysis and Applications, 94, 1983, 24-37 Eddy David M.: Probabilistic reasoning in clinical medicine: problems and opportunities; in { Kahneman 1982, 249-267 } Eells Ellery: Probabilistic Causality, 1991 Feinstein Alvan R.: Principles of Medical Statistics, 2002, by a professor of of medicine at Yale, who studied math & medicine; chap.10, 170-175 are on proportionate increments, on NNT NNH , on honesty vs deceptively impressive magnified results. Chap.17, 332,337-340 are on fractions, rates, ratios OR(:), risks RR(:). ! On p.340 is a typo : etiologic fraction should be e(r-1)/[e(r-1) +1]; ! on p.444, eq.21.15 for negative likelihood ratio LR- should be (1-sensitivity)/specificity; above it should be (c/n1)/(d/n2) De Finneti Bruno: Probability, Induction, and Statistics, 1972 Finkelstein Michael O., Levin Bruce: Statistics for Lawyers, 2001 Fitelson Branden: Studies in Bayesian confirmation theory, Ph.D. thesis, 2001, on www, where sinh should be tanh , I told him Fleiss Joseph L., Levin Bruce, Myunghee Cho Paik: Statistical Methods for Rates and Proportions, 3rd ed., 2003. In their Index "relative difference" (also in earlier editions) is my RDS Gigerenzer Gerd: Adaptive Thinking; see his other fine books & papers Glymour Clark: The Mind's Arrows - Bayes Nets and Graphical Causal Models in Psychology, 2001 Glymour Clark, Cheng Patricia W.: Causal mechanism and probability: a normative approach, pp.295-313 in Oaksford Mike & Chater Nick (eds): Rational Models of Cognition, 1998. ! On p.305 eq.14.6 isn't just a "noisy And gate" as it is asymmetrical wrt its ! inputs; my new term for it is "noisy AndNot gate" (find INHIBITION ) because ! u.(1-x) = u - ux is the numerical (u AndNot x) for independent u, x Good I.J. (Irving Jack), born 1916 in London as "Isidore Jacob Gudak" who unlike Good is findable on www. ! My W(y:x) is his old W(x:y), similarly with B(:), F(:); since 1992 he has switched to my safer notation Good I.J.: Legal responsibility and causation; pp.25-59 in the book Machine Intelligence 15, 1999, K. Furukawa, ed. See Michie in this vol.15 Good I.J.: The mathematics of philosophy: a brief review of my work; in Critical Rationalism, Metaphysics and Science, 1995, Jarvie I.C. & Laor N. (eds), 211-238 Good I.J.: Causal tendency, necessitivity and sufficientivity: an updated review; pp.293-315 in "Patrick Suppes: Scientific Philosopher", vol.1, P. Humphreys (ed), 1994; Suppes comments on pp.312-315. I.J. explains his surprisingly late insights (delayed 50 years) into the semantics of two W(:)'s, renamed by him to Qnec(y:x) , Qsuf(y:x) , like mine ?(y:x) here, ie no more as his old W(x:y) Good I.J.: Tendencies to be sufficient or necessary causes, 261-262 in Journal of Statistical Computation and Simulation, 44, 1992. This is a preliminary note on Good's belated insight of 1992-1942 = 50 years delayed Good I.J.: Speculations concerning the future of statistics, Journal of Statistical Planning and Inference, 25, 1990, 441-66 Good I.J.: Abstract of "Speculations concerning the future of statistics", The American Statistician, 44/2, May 1990, 132-133 Good I.J.: On the combination of pieces of evidence; Journal of Statistical Computation and Simulation, 31, 1989, 54-58; followed by "Yet another argument for the explicatum of weight of evidence" on pp.58-59 Good I.J.: The interface between statistics and philosophy of science; Statistical Science, 3/4, 1988, 386-412; for W(:) see 389-390, 393-394 left low! + discussion & rejoinder p.409 Good I.J.: Good Thinking - The Foundations of Probability and Its Applications, 1983, University of Minnesota Press. It reprints (and lists) a fraction of his 1500 papers and notes written until 1983. ! on p.160 up: sinh(.) should be tanh(.) where { Kemeny & Oppenheim's } degree of factual support F(:) is discussed Good I.J.: Corroboration, explanation, evolving probability, simplicity and a sharpened razor ; British Journal for the Philosophy of Science, 19, 1968, 123-143 Goodman Leo A., Kruskal William H.: Measures of Association for Cross Classifications, 1979. Originally published under the same title in the Journal of the American Statistical Association ( JASA ), parts 1-4: part 1 in vol.49, 1954, 732-764; TauB on 759-760 part 2 in vol.54, 1959, 123-163; part 3 in vol.58, 1963, 310-364; TauB on 353-354 part 4 in vol.67, 1972, - ; TauB in sect.2.4 See { Kruskal 1958 } for ordinal measures Goodman Steven N.: Toward evidence-based medical statistics. Two parts: 1. The P value fallacy, pp. 995-1004; 2. The Bayes factor, 1005-1013; discussion by Frank Davidoff: Standing statistics right up, 1019-1021; all in Annals of Internal Medicine, 1999 Grosof Benjamin N.: Evidential confirmation as transformed probability; pp.153-166 in Uncertainty in Artificial Intelligence, Kanal L.N. & Lemmer J.F. (eds), vol.1, 1986. I found that on p.159 his: ! B == (1+C)/2 is in fact the rescaling as in { Kemeny 1952, p.323 }, the last two lines lead to F(:) rescaled on the first lines of p.324, here & now findable as F0(:) Grune Dick: How to compare the incomparable, Information Processing Letters, 24, 1987, 177-181 Heckerman David R.: Probabilistic interpretations for MYCIN's certainty factors; pp.167-196 in Uncertainty in Artificial Intelligence, L.N. Kanal and J.F. Lemmer (eds), vol.1, 1986. I succeeded to rewrite his eq.(31) for ! the certainty factor CF2 on p.179 to Kemeny's F(:). Heckerman has more papers in other volumes of these series of proceedings Hempel C.G.: Aspects of Scientific Explanation, 1965; pp.245-290 are chap.10, Studies in the logic of explanation; reprinted from Philosophy of Science, 15 (reprinted paper of 1948 with Paul Oppenheim). Hesse Mary: Bayesian methods; in Induction, Probability and Confirmation, 1975, Minnesota Studies in the Philosophy of Science, vol.6 Kahn Harold A., Sempos Ch.T.: Statistical Methods in Epidemiology, 1989 Kahneman Daniel, Slovic P., Tversky Amos (eds): Judgment Under Uncertainty: Heuristics and Biases, 1982. Kahneman won Nobel Prize (economics 2002) for 30 years of this kind of work with the late Amos Tversky Kahre Jan: The Mathematical Theory of Information, 2002. To find in his book formulas like eg Cont(.) use his special Index on pp.491-493. See www.matheory.info for errata + more. ! on p.120 eq(5.2.8) is P(x|y) - Px = Kahre's korroboration, x = cause, ! on p.186 eq(6.23.2) is P(y|x) - Py, risk is no corroboration; y = evidence Kemeny John G., Oppenheim Paul: Degree of factual support; Philosophy of Science, 19/4, Oct.1952, 307-324. The footnote 1 on p.307 tells that Kemeny ! was de facto the author. Caution: on pp.320 & 324 his oldfashioned P(.,.) is our modern P(.|.). On p.324 the first two lines should be bracketized ! thus: P(E|H)/[ P(E|H) + P(E|~H) ], which is findable here & now as F0( . An excellent paper! Kemeny John G.: A logical measure function; Journal of Symbolic Logic, 18/4, Dec.1953, 289-308. On p.307 in his F(:) there are missing negation bars ~ ! over H's in both 2nd terms. Except for p.297 on Popperian elimination of models (find SIC now), there is no need to read this paper if you read his much better one of 1952 Kendall M.G., Stuart A.: The Advanced Theory of Statistics, 1977, vol.2. Khoury Muin J., Flanders W. Dana, Greenland Sander, Adams Myron J.: On the measurement of susceptibility in epidemiologic studies; American Journal of Epidemiology, 129/1, 1989, 183-190 Kruskal William H.: Ordinal measures of association, JASA 53, 1958, 814-861 Laupacis A., Sackett D.L., Roberts R.S.: An assessment of clinically useful measures of the consequences of treatment; New England Journal of Medicine ( NEJM ), 1988, 318:1728-1733 Lucas J.R., Hodgson P.E.: Spacetime and Electromagnetism, 1990; pp.5-13 on regraduation of speeds to rapidities Lusted L.B.: Introduction to Medical Decision Making, 1968 Michie Donald: Adapting Good's Q theory to the causation of individual events; pp.60-86 in Machine Intelligence 15, Furukawa K., Michie D. and Muggleton S. (eds). Aged 18 during WWII Michie was the youngest codebreaker, assisting I.J. Good who was Alan Turing's statistical assistant Michie Donald (ed): Expert Systems in the Micro Electronic Age, 1979 Norwich Kenneth: Information, sensation, and perception, 1993 Novick Laura R., Cheng Patricia W.: Assesing interactive causal influence; Psychological Review, 111/2, 2004, 455-485 = 31 pp! See { Cheng P.W. 1997 } Pang-Ning Tan, Kumar Vipin, Srivastava Jaideep: Selecting the right interestingness measure for association patterns; kdd2002-interest.ps Pearl Judea: Causality: Models, Reasoning, Inference, 2000; see at least pp.284,291-294,300,308; his references to Shep should be Sheps, and on ! p.304 in his note under tab.9.3 ERR = 1 - P(y|x')/P(y|x) would be correct. ! not in Pearl's ERRata on www Popper Karl: The Logic of Scientific Discovery, 6th impression (revised), March 1972; new appendices, on corroboration Appendix IX to his original Logik der Forschung, 1935, where in his Index: Gehalt, Mass des Gehalts = Measure of content (find SIC ). His oldfashioned P(y,x) is modern P(y|x) Renyi Alfred: A Diary on Information Theory, 1987. 3rd lecture discusses asymmetry and causality on pp.24-25+33 *Renyi Alfred: Selected papers of Alfred Renyi, 1976, 3 volumes Renyi Alfred: New version of the probabilistic generalization of the large sieve, Acta Mathematica Academiae Scientiarum Hungaricae, 10, 1959, 217-226; on p.221 his correlation coefficient R between events is also in { Kemeny & Oppenheim, 1952, p.314, eq.7 } Rescher N.: Scientific Explanation, 1970. See pp.76-95 for the chap.10 = The logic of evidence, where his Pr(p,q) actually means P(p|q). Very nice methodology of derivation, but the result is not spectacular. ! Note that on p.84 he suddenly switches from Pr(p|q) to Pr(q|p). Why? Rijsbergen C.J. van: Information Retrieval, 2nd ed., 1979 Rothman Kenneth J., Greenland Sander: Modern Epidemiology, 2nd ed., 1998 Sackett David L., Straus Sharon, Richardson W. Scott, Rosenberg William, Haynes Brian: Evidence-Based Medicine - How to Practice EBM, 2nd ed, 2000. There is a Glossary of EBM terms, and Appendix 1 on Confidence intervals ! ( CI ), written by Douglas G. Altman of Oxford, UK. I reported 12 typos, ! most of them in CI formulas. ! 30+ bugs or typos are on http://www.cebm.utoronto.ca/search.htm Schield Milo, Burnham Tom: Confounder-induced spuriousity and reversal for binary data: algebraic conditions using a non-iteractive linear model; 2003, on www (slides nearby) Schield Milo, Burnham Tom: Algebraic relationships between relative risk, phi and measures of necessity and sufficiency; ASA 2002; on www. Find NAIVE , SIMPLISTIC here & now. Their Phi = Pearson correlation coefficient r (find r2 ), not sqrt( Phi^2 ) eg from { Blalock 1958 } (find Phi^2 ), and not Pcc = Pearson contingency coefficient (find Pcc ) Schield Milo: Simpson's paradox and Cornfield's conditions; ASA 1999; on www. an excellent multi-angle explanation of confounding, a very important subject seldom or poorly explained in books on statistics. His section 8 can be complemented by reading { Agresti 1984, p.45 } for a definition of Simpson's paradox for events A, B, C Shannon Claude E., Weaver Warren: The Mathematical Theory of Communication, 1949; 4th printing, Sept.1969, University of Illinois Press. Printings may differ in page numbering. His original paper was: A mathematical theory of communication, 1948, in 2 parts, Bell Systems Journal. Compare his titles with the book by { Kahre } Sheps Mindel C.: An examination of some methods of comparing several rates or proportions; Biometrics, 15, 1959, 87-97 Sheps Mindel C.: Shall we count the living or the dead; New England Journal of Medicine ( NEJM ), 1958, 259:1210-1214 Shinghal R.: Formal Concepts in Artificial Intelligence, 1992; see chap.10 on Plausible reasoning in expert systems, pp.347-389, nice tables on ! pp.355-7, in Fig.10.3 the necessity should be N = [1-P(e|h)]/[1-P(e|~h)]; ! on p.352 just above 29. in the mid term (...) of the equation, both ~e should be e like in the section 10.2.11 Simon Herbert: Models of Man, 1957. See pp.50-51+54 Simon Steve: http://www.childrens-mercy.org/stats is a fine infokit Stoyanov J.M.: Counterexamples in Probability, 1987 Suppes Patrick: A Probabilistic Theory of Causality, 1970 Tversky Amos, Kahneman Daniel: Causal schemas in judgments under uncertainty; in { Kahneman 1982, 117-128 } Vaihinger Hans: Die Philosophie des Als Ob, 1923 Weaver Warren: Science and Imagination, 19??; the section on "Probability, rarity, interest and surprise" has originally appeared in Scientific Monthly, LXVII ie 67, no.6, Dec.1948, 390-??? Woodward P.M.: Probability and Information Theory, with Applications to Radar, 1953, 1964 adds chap.8 -.-