.- Causal insights inside, plus new causal hindrance factor versus P.W. Cheng's preventive causal power CopyRight (C) 2005-2007, Jan Hajek , NL, version 5.9 of 2007-3-5 The word my shortly marks what I have not lifted from anywhere, so it is fresh and may be contested, but must not be lifted from me without a full reference to me plus this website : http://www.humintel.com/hajek contains the latest versions of my e-papers Abstract: After a warm-up we take 3 cold showers and discover that Patricia W. Cheng's "preventive causal power" (here PF , as it has existed as a prevented fraction or preventable fraction ) is not commensurable with Sheps/Cheng's "generative causal power" (here generative factor GF ). New "causal hindrance factor" HF is worked out as a replacement of PF. The body of this e-paper is a cluster of fresh causal insights starting with Insight0 which postulates new very specific desiderata for causal measures. P4b is my new desideratum for a weak causal transitivity. P4b breaks the spell of Bayesian inversion & the double-faced coin of sufficiency-necessity by classifying probabilistic causation measures M(:) as either (x SufFor y) or as (y SufFor x) according to their behavior wrt Px < Py or Px > Py. The absolute difference ARR(y:x) = P(y|x) - P(y|~x) does NOT behave orderly wrt Px < Py , or wrt Px > Py , while the remaining analyzed measures do. Analyzed are GF( : ) by Sheps / Cheng , Conv( : ) by Google CEO Sergey Brin , I.J. Good's Qnec = RR(y:x) = relative risk = risk ratio, its functions W( : ) = ln( RR(:) ) = an old "weight of evidence" by I.J. Good who has been a WWII codebreaker and stats assistant to Alan Turing, and the factor F( : ) = [ RR - 1 ]/[ RR + 1 ] by Einstein's assistant John Kemeny whose F( : ) = CF2 is the "certainty factor" by David Heckerman ( 1986 at Stanford MYCIN project, now leading Microsoft's machine learning research ), and C( : ) = Popper's corroboration or confirmation by the most influential philosopher of science Sir Karl Popper . -.- CONTENTS: +words are for fast jump-finding; the subsection jumper is .- Only 220 lines are on the incommensurability of Cheng's PF with Sheps' GF +Abbreviations +Wisdom +Warm-up with logic and P's in 2 pizzagrams plus a 2x2 contingency table +Cold shower +Warnings +Snapshots of insights inside (for busy execs :-) +Sheps' "relative difference" GF is smart +Why is Cheng's PF incommensurable with Sheps/Cheng's GF +HF vs PF served as a palatable fast food +HF vs PF served as a palatable slow food +Hajek's hindrance factor HF is commensurable with GF +Insights into semantics and forms of causal formulae +Palatable probabilistic logic derived from counting +Causal notation is tough, but our math-beef shouldn't be Insight0 : Know what you want : new causal desiderata P2 P4 P4b P5 P8 P4b: Partial order condition for causation measures M(:) Insight1 : Understanding GFactors, plus my MaxiMin heuristic Insight2 : Fresh interpretation of GF as a regression slope/MAXslope Insight3 : Commensurable pairs of formulae : ( GF & HF ) vs ( PF & QF ) Insight4 : How to rescale a range for more palatable results Insight5 : Numbers needed to treat or harm : NNT , NNH , plus my NNR Insight6 : The simplest thinkable necessary condition for CONFOUNDING Insight7 : Sufficiency and necessity : two sides of a causal coin PAR(x:y) = population attributable risk Insight8 : MaxiMin vs Kemeny & Fitelson in clash with Popper & Kahre ; C(y:x) , F(y:x) , GF(y:x) stress tested with extreme P's ; My paradox of IndependentImplication Insight9 : Variations on the form (ValueA - ValueB)/(MaxValueA - ValueB) +Conclusions +References: papers & books worth (y)our looks -.- +Abbreviations : For non-native English readers: aka = also known as btw = by the way eg = exempli gratia = for example ; ie = id est = that is vs = versus ; w/o = without ; wrt = with respect to For non-native math readers: ~ non , not , negation , complement == synonymous , equivalent , logical equivalence ie 2-way implication ie if and only if ie iff <> or =/= is not equal =. is near, close to, approximately equal b^2 = sqr(b) = b*b = b power 2 causal = possibly possessing a causal tendency causation measure = indicator of a possible causal tendency NecFor = necessary for ; SufFor = sufficient for qed = quod erat demonstrandum = which was to be proved rv = random variable XOR = exclusive OR , non-equivalence -.- +Wisdom : Detecting error is the primary virtue, not proving truth. { Colin McGinn on Karl Popper in NYRB 2001/11/21, end of p.46 } If it's not checked, it's wrong. { I.J. Good , a WWII codebreaker in UK } Know their & thy formulae and thou shalt suffer no disgrace. { JH = me } One man's necessity is another woman's sufficiency. { JH } One woman's determinism is another man's randomness. { JH } Never run after a bus, a (wo)man, or a causal formula, because there will be another one in ten minutes. { my politically correct paraphrase of a Yale professor who spoke about a bus, woman or cosmological theory } The true logic of this world is the calculus of probabilities { J.C. Maxwell } Die Logik ist zwar unerschuetterlich, aber einem Menschen der leben will, widersteht sie nicht. { Franz Kafka (1883-1924) } We need evidence-based medicine, not evidence-biased medicine-(wo)men. { JH } There is no universally best method. What is best is data dependent. ... We need to learn more about what works best where. { Leo Breiman, 1994 } -.- +Warm-up with logic and P's in 2 pizzagrams plus a 2x2 contingency table I don't believe in a "theory of everything" as some physicists & psychicists do. I just try hard to identify good INDICATORS of causation tendency. When the reading gets tough, the tough get reading. This epaper has one thing in common with an aircraft carrier: there are multiple cabels to hook on and so to land safely on the deck of Knowledge. There is no safety without some redumndancy at critical or remote points. Only a minimum of math is needed to get some of my key messages from this e-paper. Even those who see math as a 4-letter word will understand that 0.4 - 0.2 = 0.2 > 0.0004 - 0.0002 = 0.0002 ie - preserves zeroes, while 0.2 / 0.4 = 0.5 = 0.0002 / 0.0004 = 0.5 ie / LOST all zeroes ie the / LOSES INFORMATION on the magnitude of the numbers; see Insight5 . Hence if such ratios are used as measures of probabilistic CONTRAST then they INFLATE the results. Clearly, differences and ratios are incommensurable and should not be mixed as { Cheng 1997 } does by pairing PF with GF. That's one of the number of key messages here. Btw, 0.2 - 0.1 = 0.1 = 0.9 - 0.8 = 0.1 but sqrt(0.2) - sqrt(0.1 ) = 0.13 > sqrt(0.9) - sqrt(0.8) = 0.054 sqrt(0.02) - sqrt(0.01 ) = 0.041 sqrt(0.002) - sqrt(0.001 ) = 0.013 sqrt(0.0002) - sqrt(0.0001) = 0.0041 sqrt(0.00002) - sqrt(0.0001) = 0.0013 Sqrt(P1) - sqrt(P2) is the heart of Hellinger distance aka Matusita distance. Sqrt(P) is called probability amplitude in quantum mechanics. We see that the the information about the magnitudes of numbers is taken into an account in the sense that the difference is related to the magnitudes, and zeroes .0000 are not totally lost like in a ratio . Another easily understood though not totally trivial key message is this: (a simple example is worth 1000 words) a student has correctly answered 2/3 of multiple-choice questions, each with one out of 3 choices. Q: How good is (s)he really ? A: GF = (2/3 - 1/3)/(1 - 1/3) = (1/3)/(2/3) = 1/2 = 50% is not impressive; GF = "goodness factor" corrected the raw rate of 2/3 for the base chance of 1/3 obtainable by random guessing. The 1st correction of 2/3 is (2/3 - 1/3) DECreased 2/3 down to 1/3. The 2nd, less obvious correction /( 1 - 1/3) INCreased 1/3 up to 1/2, because GF is a "relative difference" measuring the efficiency of raising above the base chance, relative to the MAXimal possible success of (1 - 1/3) AVAILABLE for the rise above the base. ! The straight interpretation of this GF is: ! 50% increase of the MAXimally ACHIEVABLE (wrt the base chance), as this ! 50% of 2/3 ACHIEVABLE in the denominator = (2/3 - 1/3)/(2/3) = (1/3)/(2/3). Now we understand how GF-like formulae work. Insight1 & Insight2 explain this in a less casual way. After this easy intro we have to become a bit more exact abstract, hence general, so back to basics. Basics of Boolean logic : there exist only 4^2 = 16 Boolean operators aka Boolean functions ie binary logical operations (~ is non ie a complement): Inputs | function's output : x op y | f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 10 11 12 13 14 15 are f0 up to f15 -------|--c--c--------------c--c--c--c--------------c--c-|--- c = commutative 0 op 0 | 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 | DeMorgan's laws: 0 op 1 | 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 | ~(x or y)=(~x & ~y) 1 op 0 | 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 | ~(x & y)=(~x or ~y) 1 op 1 | 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 | -------|--c--c--------------c--c--c--c--------------c--c-|--- c = commutative No fun=0 2 2 y 2 x 2 2 2 2 ~x 2 ~y 2 2 0 | a 2 = fun(x , y) f4 = x > y hence ~f4 = f11 = (x <= y) = ~(x,~y) = (x entails y) = (x->y) f2 = y > x hence ~f2 = f13 = (y <= x) = ~(y,~x) = (y entails x) = (y->x) f2 = ~f13 = ~~(y,~x) = (y,~x) = ( y unlessBlockedBy x) == (~x,y) = (~x unlessBLockedBy ~y) , formally ; eg: fire z := burning y unlessBlockedBy the use of x=extinguisher P(z) = Py - Pxy; fire z := burning y unlessBlockedBy the ~x=no oxygen P(z) = Py - P(y,~x) = Py - ( Py - Pxy ) = Pxy is no good; clearly ! do not use complements if you don't want simplistic results. f2: if ~x then z:=y else z:=0 ; f13: if ~x then z:=~y else z:= 1; f2: if x then z:=0 else z:=y ; f13: if x then z:= 1 else z:=~y; f2 is better understood than f13 , since the result z is just z:=y unless x f2: P(y,~x) = Py - Pxy, = Py if Pxy = 0 ie disjoint = 0 if Pxy = Py ie P(x|y) = 1, needs Py <= Px > 0 if Pxy < Py ie P(x|y) < 1. 4+2 = 6 = 4 functions of one variable only & 2 no functions of x, or of y. 4*2 = 8 complementary pairs of f#'s (note that 0+15 = 1+14 = .. = 7+8 = 15 ) : 4*2 = 8 are commutative (marked by c ) ie symmetric wrt x, y, hence UNsuitable as indicators of causation which is asymmetric wrt x, y : c | no fun ~f0 = f15 no fun c | ~(x & y) = ~f1 = f14 = (x nand y) = Sheffer function ! | (~y or x) = ~f2 = f13 = (y implies x) = (y->x) = ~(y & ~x) = (~x->~y) = | f13 = [(y or x)==x] | ~(y) = ~f3 = f12 = ~y ! | (~x or y) = ~f4 = f11 = (x implies y) = (x->y) = ~(x & ~y) = (~y->~x) = | f11 = [(x or y)==y] | ~(x) = ~f5 = f10 = ~x c | ~(x Xor y) = ~f6 = f9 = (x eqv y) = (x == y) = equivalence of x, y c | ~(x or y) = ~f7 = f8 = (x nor y) = Pierce function Any (not only those listed here) valid rule (ie an lhs = rhs ) has its DUAL rule (the DUALITY is mutual ie a symmetric 2-way relation), obtained thus: change ANDs into ORs, ORs into ANDs, False ie 0 into True ie 1, 1 into 0, but don't change the parentheses and negations (ie non, not, ~ ). Examples: 0 & 1 = 0 is dual with 1 or 0 = 1 ~(x,~y) = (~x or y) is dual with ~(x or ~y) = (~x,y) Only 2 pairs of f's are non-commutative functions of both variables x, y ; these 2 pairs of complementary functions are ( f2 , f13 ) and ( f4 , f11 ). Only these might provide the necessary asymmetry for a candidate measure of causation obtainable from probabilistic logic which follows from the isomorphism between logic and measures on sets (actually true metrics m ). Alas, we shall see that the following property of the Boolean implication (x implies y) == (~y implies ~x) aka CONTRAPOSITIVE property is UNdesirable since a causation measure requires M(x:y) <> M(~y:~x) because eg: "fire x causes smoke y" makes sense, while "no smoke ~y causes no fire~x" is NONSENSE . Such examples and the fact that only the implications and their complements are the only asymmetric functions of both variables x, y finish ! my PROOF that we cannot use any purely Boolean function as an M(:). Also a single probability P(.) or P(.|.) should not be used even as a corroboration as is shown in great detail in { Popper , appendix IX , pp.387-398 }, and as discussed by Sir Karl Popper vs Leblanc in { The British Journal for the Philosophy of Science, vol.X, 1960, pp.313-318 }. Here come 2 visualisations of the same basic law for any elementary measure m which derives from (hence is isomorphic with) Boolean logic and set theory (which are isomorphic). Just imagine 2 rectangularly cut pizzas or pancakes Px and Py placed upon each other so that they (don't)(partially) overlap : == is equivalent , synonymous ; = is equal ; ~ non, negation, complement 0 <= m(.) is a measure of a set , eg a number n(.) of elements, N = m(All) eg P(.) is a measure of a set , P(.) = n(.)/N , is & is a joint ie an overlap ; U is "or" is a union (w/o an overlap ) Venn diagram enhanced : More expressive one fits 2x2 table : ___ Universe = m(All) ________ ___ 1verse = P(All) = 1 ______ | | | | | | __m(y)___________ | | P(x,y) | P(x,~y) P(x) | | ________|__m(x)__ | | == Pxy | = Px - Pxy | | | | | | | |-----------|------------------| | | m(y-x) | m(x,y) | m(x-y) | | | | | | |________|________| | | | P(~x,y) | P(~x,~y) P(~x) = 1-Px | |_________________| | | = Py -Pxy | = P(~(x or y) | | | | | | | m(~x,~y) = m(~(x U y)) | | | | |______________________________| |___ P(y) __|__ P(~y) = 1-Py __| ---yyyyyyyy,,,,,,,,,,xxxxxxxx--- in 1-dimension where ,,, are joint m(x,y) From 3 axioms m(empty set) = 0 ; m(x) >= 0 for any set x ; m(x or y) = m(x) + m(y) + 0 if m(x,y)=0 ie if x, y are disjoint sets ; all what comes next follows : m(x-y) == m(x,~y) = m(x) - m(x,y) P(x,~y) = Px - Pxy == P(x-y) m(x-y) +m(y-x) +m(x,y) = m(x or y) P(x,~y)+P(~x,y)+Pxy = P(x or y) m(x-y) +m(y-x) +m(x,y) +m(~x,~y) = m(All) P(x,~y)+P(~x,y)+Pxy+P(~x,~y) = 1 m(x or y) +m(~x,~y) = m(All) P(x or y) + P(~x,~y) = 1 because m(x or y) m(~x,~y) are disjoint since they can never overlap ; m(x or y) = m(x) + m(y) +0 iff m(x,y) = 0 ie if no overlap ; basic Bayes : m(y|x).m(x) = m(x,y) = m(y).m(x|y) ; P(y|x).P(x) = P(x,y) = P(y).P(x|y) basic bounds ( Bonferroni inequality is on lhs ) : Max[ 0, m(x) + m(y) - m(All) ] <= m(x,y) <= min[ m(x), m(y) ] Max[ 0, Px + Py - 1 ] <= Pxy <= min[ Px , Py ] Max[ m(x), m(y) ] <= m(x U y) = m(x)+m(y)-m(x,y) <= min[m(x)+m(y), m(All) ] Max[ Px, Py ] <= P(x or y) = Px + Py - Pxy <= min[ Px + Py , 1 ] so if Px + Py > 1 we are not completely free to choose any Pxy <= min[Px, Py] in tests or examples; for more find Bonferroni here and in { Hajek www }. Disjoint x, y have Pxy = 0 hence x, y are dependent since the condition for independence ie Px.Py = 0 cannot hold if Px > 0 and Py > 0. m(~x & ~y) = m(~(x or y)) , P(~x , ~y) = P(~(x or y)) by DeMorgan's law m(~x or ~y) = m(~(x & y)) , P(~x or ~y) = P(~(x,y)) = 1-Pxy via De Morgan d(x,y) = m(x or y) - m(x,y) = m(x) + m(y) - 2.m(x,y) = m(x-y) + m(y-x) = symmetrical distance between x, y = sum of asymmetrical distances = is a metric distance since as it holds : d(x,y) = d(y,x) >= d(x,x) = 0 = d(y,y) d(x,y) + d(y,z) >= d(x,z) is the triangle inequality For a better interpretability of numerical values, we often want a measure m(.) NORMALIZED to the scale [0..1] or [0..100%]. Eg: Q: How should we normalize the overlap Pxy ? A: m(x,y)/[ m(x) + m(y) ] has in fact the range of only [0..1/2] , so 2.m(x,y)/[ m(x) + m(y) ] might work (in analogy to harmonic average ). [0..0.5] is due to MAXimal possible overlap iff m(y) = m(x,y) = m(x). Hence a SHARPER normed overlap is m(x,y)/[m(x) + m(y) - m(x,y)] , which in fact is the normed equivalence m(x==y), in some applications interpretable as SIMILARITY or PROXIMITY. These meanings become clear when we derive normed DISSIMILARITY as non-equivalence ie XOR ie symmetrical distance by taking the complement of m(x==y) : m(x <> y) ie m(x =/= y) = ~[ m(x==y) ] = m(All) - m(x==y) . For probabilities P(x <> y) ie P(x =/= y) = ~[ P(x==y) ] = P(All) - P(x==y) = 1 - P(x==y) = = 1 - Pxy/( Px + Py - Pxy ) = non-proximity = distance ! = ( Px + Py - 2.Pxy )/( Px + Py - Pxy ) = [ P(x or y) - Pxy ]/P(x or y) ! = [(Px-Pxy)+(Py-Pxy)]/( Px + Py - Pxy ) = [ P(x,~y) + P(y,~x) ]/P(x or y) is the normed probabilistic DISTANCE , isomorphic with the d(x,y)/m(x or y). It can be visually appreciated in the pizzagram aka Venn diagram above. It has useful applications as a measure of DISSIMILARITY. Let n(.) = nr. of elements in a set = cardinality = count of unique elements P(.) = probability , or its estimate as a proportion of counts n(.)/N note that P(x) = Sum(0+1+1+1+1+0+0+1)/N = E[x] = Expected value H(.) = Shannon's entropies in his information theory are Expected values H(X,Y) = joint entropy ie it is H(X & Y); I(X;Y) = mutual information H(.|.) = conditional entropies Then the key equation for an elementary set-conform measure m(.) captures the very essence of the COMMON SENSE ( visualized by Venn diagrams ) : m(a) + m(b) - m(a,b) = m(a U b) = m(a,b) + m(a-b) + m(b-a) , eg: n(x) + n(y) - n(x,y) = n(x U y) = n(x,y) + n(x-y) + n(y-x) P(x) + P(y) - P(x,y) = P(x U y) = P(x,y) + P(x-y) + P(y-x) H(X) + H(Y) - I(X;Y) = H(X,Y) = I(X;Y) + H(X|Y) + H(Y|X) are expected values ie Sum[ P(.).log(...) ] ; Px.Py is a FICTITIOUS probability of as-if independent events x, y ; the Px.Py serves as a REFERENCE point. H(X) = -Sum[ Px.log(Px) ] = Sum[ Px.log(1/Px) ] = an expected surprise H(X|Y) = -SumSum[ Pxy.log(Px|y) ] I(X;Y) = +SumSum[ Pxy.log(Pxy/(Px.Py)] = SumSum[ Pxy.(log(Pxy) -log(Px.Py))] = +SumSum[ Pxy.log(P(x|y)/Px) ] = SumSum[ Pxy.(log(P(y|x)/Py) ] H(X,Y) = -SumSum[ Pxy.log(Pxy) ] ie joint entropy (Pxy is a joint, no union), nevertheless H(X,Y) >= H(X|Y) + H(Y|X) ie it behaves as-if a union, and I(X;Y) behaves as-if a joint within the isomorphism which works, but is ! NOT SEMANTICALLY PERFECT for entropies. That it works follows from, eg: H(X|Y) + H(Y|X) = H(X,Y) - I(X;Y) = H(X) + H(Y) - 2.I(X;Y) = = H(X|Y) + H(Y|X) = D(X <> Y) is Shannon metric , and [ H(X|Y) + H(Y|X) ]/H(X,Y) = d(X <> Y) is Rajski metric , a SHARPLY normed average probabilistic distance , hence 1 - d(X <> Y) = a normed measure of dependence between 2 sets of nominal variables X, Y. This 0 <= d(X <> Y) <= 1 measures any dependence, linear or nonlinear dependence, between 2 paired sets of random nominal (symbolic, non-numerical) events, while the classical correlation coefficient -1 <= rho(X,Y) <= 1 is not a metric and it measures only the linear dependence between two paired sets of numerical values. A 2x2 contingency table, should be viewed as a Venn diagram consisting ! from 2 partially overlapping RECTANGLES: one vertical (a & c) on the left, and one horizontal (a & b) above, with the OVERLAP (a), and the REST (d). ! Since "A non-event is an event is an event" (= my paraphrase of Gertrude Stein speaking about roses though not about non-roses), we are free to see other pairs of perpendicularly overlapping reactangles in the same table. Let y = observed effect or evidence (eg symptoms bad for the patient ) x = exposure, treatment , test result , hypothesised cause of y ; { ~x = tested negative = OK for the patient , ~y = healthy , in medicine } n( x, y) = a | b = n( x,~y) || n( x) n(.,.)/N = P(.,.) n(~x, y) = c | d = n(~x,~y) || n(~x) n(.)/N = P(.) ================|=================||======= N/N = 1 n(y) = a+c | b+d = n(~y) || N = n(x)+n(~x) = n(y)+n(~y) = a+b+c+d P(x or y) + P(x,y) = Px + Py see In0 P(x or y) = Px + Py Pxy = 1 - P(~x,~y) = 1 - P(~(x or y)) P(~x or~y) = 1-Pxy by DeMorgan's law == is synonymous, equivalence , 2-way implication , if and only if , iff <> or =/= is not equal ; =. is near, close to, approximately equal P(.) is a probability, or a proportion; 1-P(.) = P(~.) = complement Py == P(y) = n(y)/N = prevalence of y Px == P(x) = n(x)/N Pxy == P(x,y) == P(x & y) = n(x,y)/N = joint probability of (x & y) P(x|y) = Pxy/Py = n(x,y)/n(y) = a crude estimator of "P of x if y" = sensitivity of test x = true positive rate = TPR = a crude measure of how much (y entails x) ie (y implies y) P(y|x) = Pxy/Px = n(x,y)/n(x) = a crude estimator of "P of y if x" = positive predictivity of y (is not directly available in medicine) = a crude measure of how much (x entails y) ie (x implies y) P(~x|~y) = n(~x,~y)/n(~y) = specificity of test x = true negative rate = TNR P( x|~y) = 1 - P(~x|~y) = 1 - spec = false alarm rate of x = P(type I error) P(~x| y) = 1 - P( x| y) = 1 - sens = false negat.rate of x = P(type II error) P(y|x).Px = Pyx = Pxy = Py.P(x|y) is the basic Bayes rule symmetrized by me P(y|x) / P(x|y) = Py/Px is the basic Bayes rule too; hence: P(y|x) > P(x|y) == Py > Px ; becomes VITAL in Insight0 , property P4b ! GF(y:x)/GF( x:y) = Py/Px as I discovered, derived at ++ & in Insight1 ! Determinism (draw Venn diagram or a pizza or pancake Px inside Py) : (x implies y) == [ P( x & ~y) = 0 = Px - Pxy ] == [ Px = Pxy ] is easy ; clumsy: == P(~x or y) = 1 = P(~x) +Py -P(~x,y) = 1-Px +Py -(Py -Pxy) Py - Pxy = P(y,~x) = P(~x) - P(~x,~y) = 1-Px - [1-(Px+Py-Pxy)] via DeMorgan P(x or y) = Px + Py -Pxy = P(~(~x,~y) = 1-P(~x,~y) by DeMorgan's law = Px + Py for disjoint events ie Pxy=0 Pxy = Px.Py is just 1 out of 17 equivalent conditions for independent x,y ! since Px.Py <> 0 , the Pxy=0 means that disjoint events are dependent. E[f(x)] = Sum_x[ Px.f(x)] = expected value of f(x), so for random variable X E[ Px ] = Sum_x[ Px. Px ] = Sum[Px^2] is the expected probability of r.v. X sqrt( Sum_x[ Px^2 ] ) = length of the vector of Px's , says Pythagoras Odds =eg= Pdeath/Psurvival = (1-Psurvival)/(1-Pdeath) Odds(x) = Px/P(~x) = Px/( 1 - Px ) Odds(x|y) = P(x|y)/P(~x|y) = P(x|y)/[ 1 - P(x|y) ] Px = Odds(x)/[ 1 + Odds(x) ] ; P(x|y) = Odds(x|y)/[ 1 + Odds(x|y) ] RR(y:x) = P(y|x)/P(y|~x) = Odds(x|y) / Odds(x) , note the semi-inversion = LR(y:x) is a likelihood ratio = sensitivity/( 1 - specificity ) . ! RR(:) = LR(:) like any simple ratio LOSES INFOrmation on P's magnitude, as 0.2/0.4 = 0.0002/0.0004 = 0.0000000002/0.0000000004 = 0.5 ; find NNT NNH NNR RR( y: x) = P( y| x)/P( y|~x) = 1/RR(y:~x) <> RR(~y:~x) = P(~y|~x)/P(~y| x) = [1 - P(y|~x)]/[1 - P(y|x)] = 1/RR(~y:x) vs ARR( y: x) = P( y| x) - P( y|~x) = absolute risk reduction = ARR(~y:~x) = P(~y|~x) - P(~y| x) = ! = [1-P( y|~x)] - [1-P( y| x)] = ARR(y:x) , UNLIKE for risk ratio. OR( : ) = [ Pu/P(~u) ]/[ Pv/P(~v) is odds ratio in general = [ Pu/Pv ].[ (1-P(~v))/(1-P(~u)) ] ; in particular, find OR(x,y) OR(x,y) = RR(y:x)/RR(~y:x) = RR(y:x).RR(~y:~x) = OR(y,x) , since RR(~y:~x) = 1/RR(~y:x) Eg: { Restle 1961, 150 } shows a formula by Luce, 1959, as an alternative to ROC curves for evaluation of radar operators. Let x = target really present, alpha = strength of the target stimulus, my y = his operator's yes : P(y|x) = alpha.P(y|~x)/[ 1 + (alpha-1).P(y|~x) ] from which I obtained alpha = [ P(y|x) / P(y|~x) ].[ ( 1 - P(y|~x) )/( 1 - P(y|x) ) ] = RR(y:x) /RR(~y:x) = RR(y:x).RR(~y:~x) which I recognized as : = Qnec(y:x).Qsuf(y:x) where Qnec(y:x) = RR(y:x) and Qsuf(y:x) = RR(~y:~x) by I.J. Good { Hajek }; find Qnec , Qsuf there and here. Caution: Max[ 0, Px + Py - 1 ] <= Pxy <= min[ Px, Py ] MUST always hold , so if Px + Py > 1 we are not completely free to choose any Pxy <= min[Px,Py] in tests or examples; for more find Bonferroni here and in { Hajek www }. Caution: [ P(x|y) > P(x|~y) ] == [ P(y|x) > P(y|~x) ], similarly for < , = . Basic Bayes shows how to INVERT ie mutually convert both conditionings P(.|.) ; find E(x and OR(x . More on Bayes is in { Hajek www }. Don't worry, here we shall use Bayes only to expose the invertibility which is exactly the cause of why nobody can formally decide about the direction of a probabilistic causation ie whether x causes y, or y causes x. However, inside Insight0 my new property P4b may help with such decisions. [ n(x,y) - 0.5) ]/n(.) are better estimators if low n(x,y) > 0. No book on statistics will teach you this. Some books on statistics use n(x,y)+0.5 for all entries in a contingency table in order to get rid of all n(x,y)=0 , but +0.5 deforms low original counts n(x,y) > 0 exactly in the wrong direction. If n(x) is low then P(y|x) = n(x,y)/n(x) = 1 occurs too often, is a crude P; if n(y) is low then P(x|y) = n(x,y)/n(y) = 1 occurs too often; these will lead to EXTREME values from many formulae including some in this e-paper. Such extremes will be prevented by using the less simplistic yet provably near optimal estimators (I derived the optimal ones) : P(y|x) = [ n(x,y) - 0.5 ]/n(x) , P(x|y) = [ n(x,y) - 0.5 ]/n(y). Estimates of probabilities should be 0 < P(.) < 1 for two reasons : - "never say never" eg about black swans, or white crows (with a genetic defect, as I saw one in my garden). - "no /0 " ie there will be no division by zero in many formulae, eg in [ Pxy - Px.Py ]/sqrt(Px.(1-Px).Py.(1-Py)) { Kemeny 1952 , p.14 } = cov(x,y)/sqrt( var(x).var(y) ) = the correlation coefficient for events x, y ; more in Insight2 . It is strange that I have seen this important (for why see Insight2 ) formula only in a couple of books. .- Interpretations of P's for INDEPENDENT events a, b : When reading the rules, think about their applications. Independence holds not only for 2 or more throws of one or more dices, but also more or less for successful hits, or for failures eg in reliability studies. 0: P(a & b)=0 if the events a, b are disjoint ie they are mutually exclusive hence they are not independent; eg it is impossible that a single throw of a dice will yield 2 numbers. 1: Pa*Pb = P(both events a & b occur jointly ie "together"). Jointly may mean either simultaneously, or in parallel, or sequentially, depending how it is defined. Also "success" and "failure" may be defined as needed; one (wo)man's success is another (wo)man's failure. = P1.P2...Pk = P( all events a & b &...& k occur jointly ) = (Pj)^k = if all Pj are equal 2: Pa*(1-Pb) = P(a & ~b) = P(a BUT NOT b) = P(a-b) = P(of an asymmetric difference) = 1 - P(a->b) see 3: = Pa - Pa*Pb is the value for independents ! = P( a UNLESS b blocks a ) suggests appplications 3: P(a implies b) = P(a->b) = P(~(a,~b)) = 1 - ( Pa - Pab ) in general; = 1 - (Pa - Pa.Pb) is the value for independents , in which case (a entails b) is questionable, except for the extreme situation ! described by my paradox of an IndependentImplication . Also see 2: 4: P(a or b) = Pa + Pb - Pa*Pb = P(at least one event occurs) = P(one or more events occur) = 1 - P(none occurs) , see 5: !! = 1 - (1-Pa)*(1-Pb) = 1 - (1 - (Pa + Pb - Pa*Pb)) including the extreme possibility that ALL events occur. 5: P(~a)*P(~b) = (1-Pa)*(1-Pb) = 1 - (Pa + Pb - Pa*Pb) = 1 - P(a or b) = P(not even one occurs) = P(none of both occurs) = P(neither a nor b) = 1 - P(a or b) = P(~a & ~b) by DeMorgan's rule = P(a Nor b) verbally "neither a nor b" = (1-Pa)*(1-Pb)*...*(1-Pk) = P(none occurs) = (1 - Pj)^k if all k Pj's are equal 6: 1 - Pa*Pb = P( a & b will not occur jointly ) , eg P( both parts ! will not fail jointly (eg within the time interval) = P(a Nand b) = P(~(a & b)) = P(~a OR ~b) by DeMorgan's rule = P(at least one event is not occuring) = P(one or more events are not occuring) -.- +Cold shower Consider a single cause-effect relationship. Obviously, if an effect y and its tentative cause x are independent, then neither x causes y , nor does y cause x. In terms of probabilities, independence of y and x occurs if Pxy = Px.Py . At least 4*2*2 + 1 = 17 equivalent relations exist : Pxy - Px.Py = 0 P(y|x) - Py = 0 , P(y|x) - P(y|~x) = 0 , see +Palatable ! P(x|y) - Px = 0 , P(x|y) - P(x|~y) = 0 , ... Pxy.P(~x,~y) = P(x,~y).P(~x,y) , in OR(:) , is the 17th equivalent ; ~y is the complement of y , eg absence or failure of y ; y = presence or success ; or vice versa ie complementary meanings. Disjoint events have Pxy=0 , hence are not independent since Px.Py <> 0. Stochastic independence of random events and random variables is the most important reference point in probability and statistics. Mark Kac wrote: "Independence is the central concept of probability theory." I say: "Independence is the Archimedean point of probability and stats.", and "Understanding (in)dependence is the pons asinorum to probability and stats." Buridan's pons asinorum aka pons asini means a "bridge of the asses", (Eselsbruecke in German, ezelsbrug in Dutch). In Holland every school kid has "het ezelsbruggetje" = a little bridge for the (little) asses ie a personal aid to succeed at exams, and hopefully to understand better. Be reminded that independence implies uncorrelatedness, but not vice versa; the implication is one-way only ie it is not an equivalence which is a 2-way implication. After this warm-up we can take a COLD SHOWER : ALAS , the equivalences hold not only among formulae with = but also with < or > , eg: [ P(y|x) > P(y) ] == [ P(y|x) > P(y|~x) ] == see Insight1 [ P(x|y) > P(x) ] == [ P(x|y) > P(x|~y) ] == etc (at least 17 equivalences) hence P(y|x) / P(y|~x) = RR(y:x) > 1 < RR(x:y) hold simultaneously : [ P(y|x) - P(y|~x) ] = ARR(y:x) > 0 < ARR(x:y) ( P(y|x) - Py ) = K(y:x) > 0 < K(x:y) ( P(y|x) - Py )/(1-Py) = PAR(y:x) > 0 < PAR(x:y) GF(y:x) > 0 < GF(x:y) hence we CANNOT GET the direction of causation from M(y:x) > M(if x, y are INdependent). This is a cold shower, but the show must go on. The Question : "Does x cause y, or is it y which causes x ?" is hard to answer automatically, but my condition P4b which classifies M(:)'s by the easy criterion of whether the following equivalence holds or not: [ Px < Py ] == [ M(y:x) > M(x:y) ] , is a big leap forward. P4b is simple but not simplistic ; eg for the risk ratio aka relative risk it holds [ Px < Py ] == RR(y:x) < RR(x:y) = P(x:y)/P(x:~y) = (Pxy/(Px-Pxy)).P(~y)/Py where 1/(Px-Pxy) measures how much (x entails y) ie (x implies y) since whenever Px=Pxy it holds Px <= Py and RR(x:y) = oo = infinite . More follows next, and then in Insight1 and Insight7 . -.- +Warnings W1: Books, papers, epapers and even emails contain errors and typos. The ! marks bugs & typos in the References at the end. Eg in { Kemeny 1953 , p.307 } both second terms p(E,H) should be p(E,~H), in fact P(E|~H) as follows next: W2: Notations have changed since the classical works prior to 1965, say. Example: modern P(x|y) is printed as p(x,y) in { Popper } and { Kemeny }. W3: Notations change when authors (eg I.J. Good, and me too) better understand what should be better expressed how. Find +Causal notation is tough ... Example: I too felt it necessary to change a couple of notations : so here K(y:x) = P(y|x) - Py <= 1-Py SIC , was called K(x:y) in { Hajek } Conv(y:x) = Px.P(~y)/P(x,~y) was called Conv(x:y) in { Hajek } = (Px - Py.Px)/(Px -Pxy) measures how much (x implies y) = (1 - Py)/(1 - P(y|x)) = P(~y)/P(~y|x) = Px/P(x|~y) W4: Causation is a slippery concept. Implication or entailment is tricky, both in the interpretation of logic/math and in its notation, as shown : == equivalent , a 2-way implication , if and only if , iff , synonymous ; >= greater or equal; <= less or equal; = equal ; => is meaningless in this epaper and in Pascal, although many authors use it (usually in a rounded form) for an implication aka entailment, eg: x => y as x -> y is MISLEADing since it is in CONFLICT with the same sign for (x is SuperSet of y) ie (y subsetOf x) which means (x is entailed by y) ie (y entails x), hence (x <- y) == (y -> x) == (y <== x) == (y subsetOf x) == (y implies x) == ~(y,~x) == no(y w/o x) == == ~(y AndNot x) == (y entails x) == (~y or x) via DeMorgan ; my <== works on Booleans represented as 0 for False, 1 for True, and evaluated numerically, like the thoughtful notation in Pascal where (y <= x) on LOGICal expressions means (y implies x). Note that !! Py <= Px on numbers for marginal probabilities naturally suggests that if Py <= Px then a measure of causation M(:) should measure how much (y implies x). My condition P4b in Insight0 uses Px < Py as a basic classifier of what an M(:) measures; find P4b Example: This example covers F(y:x) = [ RR(y:x) -1 ]/[ RR(y:x) +1 ] , W(y:x) = ln( RR(y:x) ) , and RR(y:x) itself. In his excellent paper, Einstein's math assistant { Kemeny 1952 , 313 } has designed his F(H,E) as "degree of factual support" provided by the evidence E for the hypothesis H , to measure how much "the hypothesis is a logical consequence of the evidence. CA8: F(H,E) is 1 if and only if E => H . The evidence definitely disproves the hypothesis just if it logically implies its negation, so this corresponds to F = -1." { more "..." in Insight8 }. His E => H stands for E -> H ie E implies H ie E entails H. But his F(H,E) = [ p(E,H) - p(E,~H) ]/[ p(E,H) + p(E,~H) ] on p.320 means F(e:h) = [ P(e|h) - P(e|~h) ]/[ P(e|h) + P(e|~h) ] in modern notation ; = [ P(e|h) / P(e|~h) -1]/[ P(e|h) / P(e|~h) +1] = [ RR(e:h) -1]/[ RR(e:h) +1] = 1 iff P(e,~h)=0 ie Peh=Ph ie Peh/Ph = 1 = P(e|h). It may seem that F(e:h) measures how much (h implies e), BUT IT ISN'T SO, as my decomposition shows, and as Kemeny correctly says that (E implies H). So there is nothing wrong with Kemeny, despite the mix of 2 notational + 1 genuine nontrivial semantic confusion arising from the specific math of F . Note that Kemeny wished his F(H,E) ie our F(e:h) to measure (e implies h). F(e:h) = [ RR(e:h) - 1 ]/[ RR(e:h) + 1 ] is a function of the risk ratio RR(e:h) = P(e|h)/P(e|~h) = risk ratio aka relative risk OF THE EFFECT e IF h OCCURS. Now h is seen as a possible cause of e , since the risk of e is what we may want to PREDICT. But MD's or GP's want to REMOVE h as the NECESSARY condition for e to occur. More in Insight7 . This all creates a lot or genuine confusion plus errors and typos. That was the 2nd cold shower. -.- +Snapshots of insights inside (for busy execs :-) In0: Fortunatelly for us, logic, switching circuits, set theory, hence also probabilistic logic are 4 isomorphic domains. So eg DeMorgan's laws and Bonferroni inequality apply to them. The isomorphism derives from the elementary measure theory ( so easily visualized by means of pancakes or pizzas P aka Venn diagrams in +Warm-up above ) : P(x U y) + Pxy = Px + Py ; U is union; Pxy = joint , intersect , overlap N(x U y) + Nxy = Nx + Ny for Numbers of elements in sets x, y. Hence P(x or y) + Pxy = Px + Py ie P(x or y) = Px + Py - Pxy for probabilities Pxy = 0 for disjoint x, y Eg the formula for P(x iff y) ie P(x if and only if) ie P(x == y) isn't obvious, but is easily obtained via isomorphism if we know that (x == y) == == ~(x XOR y) == complement of an obvious symmetrical difference P(x XOR y) P(x or y) - Pxy = (Px + Py - Pxy) -Pxy = Px + Py - 2.Pxy = P(x XOR y) { done, P(x==y) = 1 - P(x XOR y), but let's see more isomorphism in action: } = P(x,~y) + P(y,~x) = (Px -Pxy) + (Py -Pxy) = Px + Py - 2.Pxy = P(~(x == y)) = 1 - P(x == y) = P(~(x 2wayImplication y)) = P(~(x iff y)) = 1 - P( (x implies y) & (y implies x) ) = 1 - P( ~(x,~y) & ~(y,~x) ) = P(~[ ~(x,~y) & ~(y,~x) ]) , now use DeMorgan rule : = P( (x,~y) or (y,~x) ) ; since these 2 disjoint terms have no overlap : = P(x,~y) + P(y,~x) - 0 since these P(.,.)'s are disjoint ; = Px -Pxy + Py -Pxy = Px + Py - 2.Pxy qed. So far a demo of the powerful isomorphism . In1: Paradoxically, for the extreme Py=1 a perfect entailment (x entails y) exists while x, y are independent , since Pxy=Px.Py = Px ie P(y|x)=1, ! find IndependentImplication and 0/0 . I prefer to resolve the dilemma by voting for the independence , since the entailment (aka implication ) is 'degenerated' or a trivial one. In2 to In5: are 4 insights packed together: P(y|x)/P(y~x) <> [1-P(y|~x)]/[1-P(y|x)] eg y=sick, ~y=healthy, ie RR(y:x) <> RR(~y:~x) for relative risk reduction aka risk ratio vs ARR(y:x) = ARR(~y:~x) for absolute risk reduction ARR(:) = P(y|x) - P(y|~x) = [1-P(y|~x)] - [1-P(y|x)] = [ Pxy - Px.Py ]/[ Px.(1-Px) ] = cov(x,y)/var(x) = beta(y:x) = slope of a probabilistic regression line Py = beta(y:x).Px + alpha(y:x) from thich follows my fresh interpretation of GF and my HF : GF = ARR/[ 1 - P(y|~x)] which I recognized as: = slope(of y on x)/[ MAXimal achievable slope, since P(y|x) <= 1 ] = beta(of y on x)/[ fictive MAXimal beta, ie as-if , what if ] is commensurable with my causal hindrance factor HF (unlike Cheng's PF ) HF = -ARR/[ 1 - P(y|x)] = -ARR/P(~y|x) instead of Cheng's PF = -ARR/P(y|~x) = slope(of y on ~x)/[ MAXimal achievable slope, since P(y|~x) <= 1 ] = beta(of y on ~x)/[ fictive MAXimal beta, ie as-if , what if ] In6: M(y:x) <> M(~x:~y) should hold for a measure of causation tendency. Alas M(y:x) = M(~x:~y) holds for measures based PURELY on entailment since for an entailment holds (x implies y) == (~y implies ~x) which leads to causal NONSENSE ; find Conv(y:x) = Conv(~x:~y) which is KO for causation RR(y:x) <> RR(~x:~y) is OK , and so its transform F(y:x) <> F(~x:~y) Proof of <> : RR( y: x) = P( y| x)/P( y|~x) = [ Pxy/P(y,~x) ].(1-Px )/Px <> RR(~x:~y) = P(~x|~y)/P(~x| y) = [ P(~x,~y)/P(~x,y) ].(1-P~y)/P~y = [ 1 - (Px+Py-Pxy))/P(y,~x) ].Py/(1-Py) qed. note P(y,~x) in both RR's ; this is due to the next equalities P(y,~x) = Py - Pxy = = P(~x,y) = P(~x) - P(~x,~y) = (1-Px) - (1 - [Px+Py-Pxy]) via DeMorgan 1/RR(y:~x) = RR(y:x) <> RR(~y:~x) = 1/RR(~y:x) ; RR(y:x) <> RR(~x:~y) = 1/RR(~x:y) , causation needs this <> ; RR(y:x) <> RR( x: y) = 1/RR(x:~y) is proven thus: [ Pxy/P(y,~x)].P~x/Px <> [ Pxy/P(x,~y) ].P(~y)/Py ie [ (1/Px -1)/(Py -Pxy) <> [ (1/Py -1) ]/(Px -Pxy) qed. In7: RR(y:x) = P(y|x)/P(y|~x) = 1/RR(y:~x) <> RR(~x:~y) is OK as desired = [ Pxy/(Py - Pxy) ].(1-Px)/Px are my decompositions = [ Pxy/P(y,~x) ].(1-Px)/Px , where 1/P(y,~x) == (y entails x) = Pxy.(y implies x).SurpriseBy(x) for both Pxy & Py fixed , RR(y:x) will: + INCrease with smaller Px ie with more specific x + DECrease with LARGEr Px ie with more common x , hence RR fits SIC ! While P(y|x) is clearly a crude measure of how much (x entails y) , GF = ARR/(1 - P(y|x) measures how much (x implies y) ie (x entails y) (Px < Py) == GF(y:x) > GF(x:y) = [P(x|y) - P(x|~y)]/[ 1 - P(x|~y) ] , like (Px < Py) == P(y|x) > P(x|y) = Pxy/Py = the simplest crude (y implies x) vs : (Px < Py) == RR(y:x) < RR(x:y) = [Pxy/(Px -Pxy)].(1-Py)/Py , (x implies y) !!! ARR(y:x) = P(y|x) - P(y|~x) does NOT obey ANY consistent ordering by either Px <= Py, or consistently by Px >= Py . In8: P(x|c) > P(x|y) ie Pcx/Pxy > Pc/Py is my new simplest thinkable necessary condition for a confounding candidate event c to overrule x as a possible cause of an effect y. Below I show its derivation (from my decomposition just shown in In4 ). The 2 known necessary conditions combimed were: RR(y:x) < minimum[ RR(c:x) , RR(y:c) ] . InEtc: Etc, etc. -.- +Sheps' "relative difference" GF is smart : I did all this research because I consider GF as a very smart formula, and as one of the best indicators of causation tendency. My fresh analysis shows why I do & you should like GF : + Several field proven formulae (find Insight9 ) have the following general structure : Derivation 0: g = [(1-Pb) - (1-Pa)]/(1-Pb) = 1 - (1-Pa)/(1-Pb) is the Error-form = (Pa - Pb)/(1-Pb) is what I call the canonical form g <= 1 ; proof: 0 <= P(.) <= 1 , hence g <= 1 , hence also: g <= Pa ; proof: g - g.Pb = Pa - Pb ; g = Pa - Pb.(1-g) <= Pa <= 1 ++ g <= Pa is semantically desirable because g is a quasiprobability, eg: GF = [P(y|x) - P(y|~x)]/[1 - P(y|~x)] , where: GF <= P(y|x) is desirable since it is desirable to have a measure M(y caused by x) interpretable as a quasiprobability, and it should hold M <= P(y|x), with the = only in one extreme case, because P(y|x) says that x was present, but does NOT discount the ~x , while GF <= P(y|x) says that y may not be caused by x even when x is present, as other causes of y, namely ~x may exist. Hence this <= shows how meaningful GF is i.e. that in its subrange 0 <= GF , GF is measuring a probability P(y caused by x and surely not by some other cause ~x). ++ Derivation 1: if P(y|x) > P(y|~x) then P(y|x) = P(y|~x) + GF*P(~y|~x) ie GF is that portion of P(~y|~x) which could become y if exposed. Hence GF = ARR / P(~y|~x) ++ Derivation 2: if all x would cause y then P(y,x) = Px ie P(y|x)=1. Then GF is the ratio of actual ARR / fictive maximum ARR : GF = [ P(y|x) - P(y|~x) ]/[ 1 - P(y|~x) ] = [ P(y) - P(y|~x) ]/[(1 - P(y|~x)).Px ] my new form + All GF's values >= 0 are easily & instantly interpretable, not just its pivotal values 0 and 1. This is not so for F( : ), C( : ), K( : ). + GF <= 1 = MAXimum if Pa=1 = MAXimum of any P, eg P(y|x)=1 + GF = 0 = minimum if Pa=Pb (the minimum for Pa >= Pb as required ) + GF = 0/0 = UNdefined if Py=1 hence Px=Pxy=Px.Py & P(y|x)=1 ie in this extreme case x, y are INdependent & (x entails y) ; also + GF = 0/0 = UNdefined if Px=1 hence Py=Pxy=Px.Py & P(x|y)=1 ie in this extreme case x, y are INdependent & (y entails x) which is ! my PARADOX of IndependentImplication (find it). These ambiguities are correctly reflected in 0/0 which can be pre-checked and signalled in/by progs. 0/0 occurs in F(:) too. - GF's lower bound << -1 ie is not -(its upper bound =1 ). This calls for an introduction of a COMMENSURABLE complementary formula HF , ! HF = [ P(y|~x) - P(y|x) ]/[1 - P(y|x) ] has maximum = 1 iff P(y|~x)=1 = hindrance factor for use if P(y|x) < P(y|~x) ie if ARR < 0. + HF paired with GF have only the positive ( + ) properties listed here. + GF was designed as a "relative difference", ie as : = difference/(difference AVAILABLE for Pa's above given Pb) = ARR/( fictive MAXimum ARR achievable if Pb is given ) ! = slope/( as-if MAXimal slope thinkable if Pb is fixed ) , my insight ! = beta/( as-if MAXimal beta possible if Pb is known ) , my insight beta is for the probabilistic regression line for events + GF satisfies my desideratum P2 as all values 0 <= GF <= 1 are meaningfully and quantitatively interpretable, not only its pivotal points 0 and 1. GF(y:x) = [P(y|x)-P(y|~x)]/[1-P(y|~x)] = [Pxy-Px.Py]/[Px.(1-(Px+Py-Pxy))] GF(x:y) = [P(x|y)-P(x|~y)]/[1-P(x|~y)] = [Pxy-Px.Py]/[Py.(1-(Px+Py-Pxy))] + GF(z:z) = (1 - 0)/(1 - 0) = 1 a kind of reflexivity ( maximum reflex ) + GF(y:x) = GF(x:y) iff Px = Py , a kind of antisymmetry + GF(y:x) <> GF(x:y) if Px <> Py , a kind of asymmetry + GF(y:x) <> GF(~x:~y) as P5 desires ( contrapositive "=" is UNdesirable ) + GF satisfies my desideratum P4b (see also Insight7 ) . ! GF(y:x) > GF(x:y) == Py > Px is a must for a measure of causal sufficiency; in fact for GF holds exact proportionality : ++ GF(y:x)/GF(x:y) = Py/Px ( proof in Insight1 ) is MEANINGFUL & clear, as it !! fits with basic Bayes ie with : ! P(y|x)/ P(x|y) = Py/Px , while GF conveys more MEANING just listed above; if GF >= 0 then: [ GF(y:x) >= 0 <= GF(x:y) ] AND : [ GF(y:x) > GF(x:y) ] == [ Py > Px ] == [ P(y|x) > P(x|y) ] [ GF(y:x) < GF(x:y) ] == [ Py < Px ] == [ P(y|x) < P(x|y) ] GF(y:x) tells how much (x implies y) ie (x entails y) ie (x SufFor y) + Lets see how GF, RR( : ) , PAR(:) fit "semantic information content" SIC : For both Pxy & Px fixed , GF(y:x) DECreases with INCreasing Py ; for both Pxy & Py fixed , GF(x:y) DECreases with INCreasing Px ; for both Pxy & Py fixed , PAR(x:y) DECreases with INCreasing Px ; for both Pxy & Py fixed , RR(y:x) DECreases with INCreasing Px . + RR(y:x) = 1/RR(y:~x) = Qnec(y:x) by I.J. Good, and also his RR(~y:~x) = 1/RR(~y:x) = Qsuf(y:x) which inserted into my next rule GF(v:u) = 1 - RR(~v:u) = 1 - 1/RR(~v:~u) = 1 - 1/[ 1 - GF(v:~u) ] { Hajek , find there +New conversions } yield 2 pairs of commensurable GF(:)'s. Commensurability means very similar scale and continuity of values near GF = 0. Note that the commensurable pairs have the exposure u swapped with ~u. Hence : GF & HF are commensurable : GF(y:x) = 1 - RR(~y:x) = 1 - 1/RR(~y:~x) = 1 - 1/Qsuf = GF = 1 - 1/[ 1 - GF(y:~x) ] = 1 - 1/[ 1 - HF ] = generative factor (x causes y) GF(y:~x) = 1 - RR(~y:~x) = 1 - 1/RR(~y:x) = 1 - Qsuf = HF = 1 - 1/[ 1 - GF(y:x) ] = 1 - 1/[ 1 - GF ] = hindrance factor by { Hajek } ARX & PF are commensurable : GF(~y:~x) = 1 - RR(y:~x) = 1 - 1/RR( y:x) = 1 - 1/Qnec = ERR = ARX = 1 - 1/[ 1 - GF(~y:x) ] = 1 - 1/[ 1 - PF ] = excess risk by { Pearl 2000 } GF(~y:x) = 1 - RR( y:x) = 1 - 1/RR(y:~x) = 1 - Qnec = PF = 1 - 1/[ 1 - GF(~y:~x)] = 1 - 1/[ 1 - ERR ] = preventive fac. { Cheng 1997 } -.- +Why is Cheng's PF incommensurable with Sheps/Cheng's GF : Let P(y|x) = probability of an effect y if its alleged cause x is present P(y|~x) = probability of an effect y if its assumed cause x is absent ARR = P(y|x) - P(y|~x) = absolute risk(reduction) in epidemiology RR = P(y|x) / P(y|~x) = relative risk aka risk ratio (loses info in zeros). P8: My principle of CONTINUITY and COMMENSURABILITY of results : SMALL changes in input P's must NOT lead to LARGE changes in the output ie in the result from a SINGLE causal tendency measure M(:) . This kind of continuity must also hold between a PAIR of measures each of which covers only a part of the whole range of all possible inputs. If a pair of such formulae is used to cover the whole range of causation values then the results must be COMMENSURABLE in general, hence also near the point where we switch from one formula to the other one. Typically a switch-over point will be INDEPENDENCE ie ARR=0 , around which cannot be any meaning- ful causation (degenerated cases shows my paradox IndependentImplication ), so near ARR=0 a measure of causation M(:) must measure (in)dependence ONLY, and DEPENDENCE IS SYMMETRICAL wrt x, y. With increasing dependence the ASYMMETRY of values should increase. An example: y = an effect, eg death ; x = a cause, eg drinking alcohol. No drinking is better than too much, but worse than drinking a little. Drinking too much INCreases our chances of early death. Drinking a little DECreases our chances of early death from heart disease, a major killer. So, eg, if 100 people change their habit of drinking M glasses to L glasses, and then ARR = P(y|x) - P(y|~x) = 4/100 - 1/100 = 0.03 becomes -0.03 and thus forced switch-over : from "generative power" GF = 0.0303 = power of alcohol to cause death, say, to "preventive power" PF = 0.75 = power of alcohol to avoid death would be ABSURD on the scale [0..1], or 0% to 100%. Since we have accepted GF and its meaning and units for the generative factor, it would be absurd to use a formula which yields such an absurdly high preventive factor PF. Yet such an absurd "preventive power" comes out of Cheng's PF as shown in the next subsection. By the way, anytime three or more values of x are evaluated as possible causes of y, and if only the middle value of x has no dependence on y and vice versa, then we shall get 2 ARR's and 2 GF's with different signs, and the GF < 0 will have to be replaced by some other formula with a range commensurable with the GF > 0. Absurdly high sensitivity to inputs is a property of chaos, unlikely to be explained in terms of the cause-effect as we know it. Abs( ARR ) = Abs( -ARR ) , and Abs( NNT = 1/ARR ) = Abs( NNH = -1/ARR ) provide another argument for my objection against Cheng's PF . .- +HF vs PF served as a palatable fast food : Let P(y|x) = 4/100 = 0.04 > P(y|~x) = 1/100 = 0.01 ARR = P(y|x) - P(y|~x) = 0.04 - 0.01 = 0.03 ; RR = 4 due to LOST 0.0 GF = ARR/[ 1 - P(y|~x) ] = 0.03/(1 - 0.01) = 0.0303 = "relative difference" from { Sheps 1959 }, reinvented in { Cheng 1997 }. If we accept and use GF for generative impact of x on y when ARR >= 0, then we should use a compatible formula for preventive impacts when ARR < 0, unless ARR is too close to 0 in which case abs(GF) =. ARR approximately. When the situation changes and the values become P(y|x) = 0.01 < P(y|~x) = 0.04 , then we get GF = (0.01 - 0.04)/(1 - 0.04) = -0.03/0.96 = -0.03125 < 0 ; RR = 0.01/0.04 GF < 0 has to be replaced by a commensurable companion formula because GF >= 0 has the subrange 0..1, while { -oo is -infinite } GF <= 0 has the subrange -oo..0 , ie both subranges are incommensurable . This incommensurability of both subranges justifies the need for 2 formulae. Therefore and because of the incommensurability of PF with GF , I designed and here justified my hindrance factor HF as a replacement for Cheng's PF. Since a positive power thinker doesn't like negative numbers, (s)he prefers to switch from "generative power" GF < 0 to a positive formula PF earlier called "prevented fraction" or "preventable fraction" : PF = -ARR/P(y|~x) = -(-0.03/0.04) = 0.75 = Cheng's "preventive power" . If the structural and semantic INCOMPATIBILITY of denominators in GF and PF has not struck you before, by now it should be clear that PF=0.75 is totally INCOMMENSURABLE with GF=0.03 . After a tiny change of P's, be it a shift on a scale, or a swap near the zero point, the results should be about the same. Small changes in the inputs ( P's ) should yield small changes in the outputs or results ( GF , PF ). Here PF/GF = 25 is larger than one decimal order ie PF is grossly incommensurable with GF. For ARR < 0 my commensurable formula obtains from GF by swapping x and ~x in GF : HF = [ P(y|~x) - P(y|x) ]/[ 1 - P(y|x) ] = generative factor for y if ~x = [ 0.04 - 0.01 ]/[ 1 - 0.01 ] = 0.0303 , my "hindrance factor" HF = GF = 0.0303 , which is good. So far for low P's. Let's plug in high P's which, by the way, are far less common than low P's in (medical) practice. Whether the P's are (im)possible doesn't matter here. P(y|x) = 98/100 = 0.98 > P(y|~x) = 96/100 = 0.96 ARR = P(y|x) - P(y|~x) = 0.98 - 0.96 = 0.02 ; RR = 1.02 GF = ARR/[ 1 - P(y|~x) ] = 0.02/(1 - 0.96) = 0.5 because 0.02 = 0.5*(1 -0.96) ie the numerator is 50% of the denominator ie of the base. For P(y|x) = 0.96 < P(y|~x) = 0.98 we have ARR = -0.02 and RR = 0.98 GF = -0.02/(1 - 0.98) = -1 < 0 which a positive power thinker replaces with PF = -ARR/P(y|~x) = -(-0.02)/0.98 = 0.0204 = Cheng's "preventive power" , HF = [ P(y|~x) - P(y|x) ]/[ 1 - P(y|x) ] = generative factor for y if ~x = [ 0.98 - 0.96 ]/[ 1 - 0.96 ] = 0.5 is my "hindrance factor" and HF = GF = 0.5 again, while Cheng's PF=0.0204 is INCOMMENSURABLE with GF=0.5 . Another example: P(y|x) = 39/1000 < P(y|~x) = 40/1000 PF = ( 40 - 39)/40 = 1/40 = 0.025 = 25/1000 by Cheng (zeroes lost) ; HF = (0.040 - 0.039)/(1 - 0.039) = 0.00104 = 1/1000 by Hajek, and indeed, the difference was only 1 per 1000, hence HF is almost exact, while Cheng's PF is 25 times larger than reasonable. Cheng's 0.025 = 25/1000 is ABSURDly high. A pair of examples with the difference of 1 case per 1000 : P(y|x) = 2/1000 = 0.002 < P(y|~x) = 3/1000 = 0.003 , ARR = -0.001 PF = (0.003 - 0.002)/0.003 = 1/3 = 0.333 also for 0.2 & 0.3 , or 0.02 & 0.03 ; HF = (0.003 - 0.002)/(1 - 0.002) = 0.001002 , while: HF = (0.03 - 0.02 )/(1 - 0.02 ) = 0.0102 , while: HF = (0.3 - 0.2 )/(1 - 0.2 ) = 0.125 P(y|x) = 8/1000 = 0.008 < P(y|~x) = 9/1000 = 0.009 , ARR = -0.001 as before PF = (0.009 - 0.008)/0.009 = 1/9 = 0.111 also for 0.8 & 0.9 , or 0.08 & 0.09 ; HF = (0.009 - 0.008)/(1 - 0.008) = 0.001008 , while: HF = (0.09 - 0.08 )/(1 - 0.08 ) = 0.0109 , while: HF = (0.9 - 0.8 )/(1 - 0.8 ) = 0.5 Yet another example : P(y|x) = 4/100 = 0.04 > P(y|~x) = 3/100 = 0.03 ARR = P(y|x) - P(y|~x) = 0.04 - 0.03 = 0.01 GF = ARR/[ 1 - P(y|~x) ] = 0.01/0.97 = 0.0103 Now let P(y|x) = 0.03 < P(y|~x) = 0.04 , ARR = -0.01 , RR = 0.75 GF = -0.01/(1 - 0.04) = -0.0104 < 0 HF = -(-0.01/(1 - 0.03) = 0.0103 = GF > 0 above PF = -(-0.01/0.04 ) = 0.25 = 1 - RR in general ; PF/GF = 25 ie they are incommensurable. The difference is -1 per 100 ie -0.01 ; Cheng's PF = 0.25 suggests 25 prevented per 100 which is absurd, or it suggests 1/4 which has a meaning different from GF. Clearly, if we agree on the meaning of a generative factor GF and its units, then we must evaluate preventive effects by the same units of impact. In fact the only reason for not using GF < 0 is that GF < 0 has the range of (-oo..0) , while GF >= 0 has the range of [0..1] , which are incommensurable. This incommensurability of PF > 0 with the undisputed GF > 0 is what I want to point out, and to do away with by using my HF instead of PF . Other insights into GF , HF and PF follow, but in NO WAY effect my point which is the fundamental commensurability of my HF with GF , versus the fundamental incommensurability of Cheng's PF with GF in { Sheps 1959 }, { Cheng 1997 }. GF >= 0 or HF >= 0 are "relative differences", ie differences relative wrt their denominators [ 1 - P(.|.) ]. The values of GF and HF are very meaningfully interpretable over their whole range, ie in accordance with my principle P2 . My fresh & simple though not simplistic interpretation of both GF & HF as ratios of slopes aka beta's is in Insight2 . Also note that : - for commonly (very) low P's GF and HF behave as (slight) corrections of the numerator ARR aka absolute risk reduction ( absolute = not relative ) ie the absolute difference ie the absolute contrast ; - for uncommonly high P's GF and HF tend to behave more as relative risks RR(:) , rather than like absolute risks ie they are no more (slight) corrections of ARR . This last point was communicated to me by professor Peter White of Cardiff University, UK. .- +HF vs PF served as a palatable slow food : The most important temperature for the humankind is when ice starts to melt, which is zero degrees on the scale of Celsius. Here I use the temperature measuring analogy in a serious attempt to get my message across not only to the math-champs but also to the math-chimps. I will keep it simple, but not simplistic. A simple example shows that GF and PF operate on different scales. If we are used to measure non-freezing temperatures in Celsius, then we should not switch to Fahrenheit or to Kelvin when water starts to freeze. A proven way how to expose problems is stress testing. With formulae this means to plug in in (somewhat) extreme numbers. In our case we shall use low probabilities for an effect y given an exposure x. In fact in epidemiology probabilities of diseases in a population are usually very low, fortunately for us. So in fact my numbers will not be extreme at all, just far enough from the values like 0.5, 0.3 or 0.2 which camouflage, rather than reveal, the true nature and behavior of probabilistic formulae. The difference ARR = P(y|x) - P(y|~x) between LOW conditional probabilities P(.|.) is always near ZERO, indicating the near independence of x and y (ie of y on x as well as of x on y). Being near 0.0 is FOR OUR explanatory purposes quite analogical to be near zero degrees Celsius when ice starts to melt or about when water starts to freeze. Let y be an effect we are concerned about, and x be an assumed or alleged cause (eg an exposure to a treatment or to an agent), and let P(y|x) = 0.04 = probability of an effect y if x is present, and P(y|~x) = 0.02 = probability of an effect y if x is absent ; ARR = P(y|x) - P(y|~x) = absolute risk reduction in epidemiology = 0.04 - 0.02 = 0.02 ("absolute" as opposed to "relative") GF = [P(y|x) - P(y|~x)]/[1 - P(y|~x)] = generative factor for y from x = [ 0.04 - 0.02 ]/[1 - 0.02 ] = 0.0204 =. ARR , =. means approx. GF is in fact ARR corrected for the "base effect" (due to basic treatment, background probability, or base chance by guessing like eg 1/m in a multiple choice). The KEY IDEA on which such formulae are built is a DOUBLE DISCOUNTING of the base-chance (base-rate, trivial chance, placebo-chance). In fact the GF formula can be viewed as employing corrections of P(y|x) up to the 2nd order. The 1st-order correction decreases P(y|x) by P(y|~x) in the numerator ARR, and the 2nd-order finer tuning in the denominator increases GF. This has its analogy in the Taylor series with - and + terms. For more & deeper causal insights see Insight0 up to Insight8 below. Other formulae with the same structure as GF are below in Insight9 . Now comes the crunch. Often it happens that P(y|x) < P(y|~x) ie ARR < 0. Let's have such a case, eg: P(y|x) = 0.02 and P(y|~x) = 0.04 ; then some may feel the need to use another formula with -ARR in its numerator, if we want to "stay positive". Cheng used the following "preventive power" (earlier others have called it "prevented fraction" or "preventable fraction", hence my PF ) PF = [ P(y|~x) - P(y|x) ]/P(y|~x) = 1 - P(y|x)/P(y|~x) = 1 - RR(y:x) RR(y:x) is called "relative risk" aka "risk ratio" in epidemiology. RR(y:x) is also called "Bayesian likelihood ratio" or "Bayes factor". PF belongs to the category of "proportionate increments". Read in the wise and fine book { Feinstein, 2002 } section 10.6.3 on "Possible deception in proportionate increments", and his sect.17.6.5 where prevented fraction PF is clearly categorized as a kind of relative risk. To detect and disallow misinterpretation, misinformation, disinformation and deception, as often practiced by many if not by the most economic and "egonomic" interests, we as customers and patients, must be aware of the weak points of their formulae. "Know their & thy formulae and thou shalt suffer no disgrace!" is my fresh paraphrase of the greatest strategist ever, Sun-Tzu, 2500 b.PC. We need evidence-based medicine, not evidence-biased medicine-men & women. Anyway, here we get PF = [ 0.04 - 0.02 ]/0.04 = 0.5 which is UNREASONABLY larger than 0.0204. Unreasonably, since by analogy if temperature drops from 1 degree Celsius above zero (ice starts to melt at 0) to -1 degree Celsius, then we should not switch to another scale on which abs(-1)=1 will become 272 degrees Kelvin. Sufficient CONTINUITY AND PROPORTIONALITY OF SCALE must be maintained in order to keep the interpretation of numbers opeRationally meaningful. The reason for such a large DISCREPANCE (0.02 vs 0.5) is Cheng's denominator, since numerators of both GF and PF have the same absolute value abs(ARR). We can see that GF is essentially a (slightly) up-corrected absolute risk reduction ARR, while PF = 1 - RR(y:x) is a linear function of a relative risk RR(y:x) < 1. Obviously, it is NOT WISE to use an absolute measure for a causal generation, and a relative one for a causal prevention. We also see that the smaller the conditional probabilities, the smaller the difference between Sheps/Cheng's GF and the absolute risk reduction ARR in GF's numerator, ie that GF is just a subtle and too often a negligible 2nd-order correction of ARR. To make it crystal clear to those who see math as a 4-letter word: 0.04 - 0.02 = 0.02 < 0.4 - 0.2 = 0.2 for ARR , while the relative risks 0.02 / 0.04 = 0.5 = 0.2 / 0.4 = 0.5 for RR , so we see that RR and 1-RR are LOSING INFORMATION on the magnitude of the numbers; see Insight5 . -.- +Hajek's hindrance factor HF is commensurable with GF : In order to fix such an INCOMPATIBILITY of scaling, I claim that the proper preventive factor, to be used instead of PF (and GF) when ARR < 0, is HF = [ P(y|~x) - P(y|x) ]/[ 1 - P(y|x) ] = generative factor for y if ~x = [ 0.04 - 0.02 ]/[ 1 - 0.02 ] = 0.0204 , my "hindrance factor" HF is obtained from GF simply by swapping (the roles of) x and ~x. Hence HF is the generative factor for y if ~x ie if x is absent. Since too many formulae exist for a long time with too many similarly sounding names which are too suggestive, it is wise to avoid yet another similar name. In order not to add to the verbal confusion I call my HF a "hindrance factor". GF < 0 has to be replaced by a commensurable companion formula because GF >= 0 has the subrange 0..1, while GF <= 0 has the subrange -oo..0 , ie both subranges are incommensurable , therefore and because of the incommensurability of PF with GF , I designed and here justified my hindrance factor HF as a replacement for Cheng's PF . In medical practice ARR < 0 are near zero. That had to be the reason why the late professor of medicine & epidemiology at Yale University, Alvan Feinstein, who also had a master's degree in mathematics, had no difficulty with using Mindel Sheps' "relative difference" (our GF ) when ARR < 0 made GF < 0. In his fine and alas also his final book on the Principles of Medical Statistics, on the the last line of the page 174, he used Sheps' "relative difference" (0.02 - 0.04)/(1 - 0.04) = -0.0208 ; note that the abs(-0.0208) = 0.0208 differs only slightly from 0.0204 of my HF. Anyway, the Yale professor Feinstein did not feel any need for a second formula, probably because his realistic epidemiological ARR's were near 0. If you will read less casually further, you'll be rewarded with fresh deeper causal insights. -.- +Insights into semantics and forms of causal formulae : .- +Palatable probabilistic logic derived from counting: Counting "AT LEAST how many of BOTH PROPERTIES" : Itanic carried 1000 passangers, 300 smokers, 900 were drinkers, then at least 300 + 900 - 1000 = 200 were both ( 200 is the minimal intersection, or unavoidable joint, or necessary overlap ). Itanic carried 1000 passangers, 400 women, and 700 have drown, then at least 400 + 700 - 1000 = 100 women have drown, and at least 600 + 700 - 1000 = 300 others have drown. The necessary overlap > 0 can arise only for Nx + Ny > Nall=1000. The algorithm is: If (0 < Nx < Nall) & (0 < Ny < Nall ) & (Nx + Ny > Nall) then Njoint >= Nx + Ny - Nall { my special triangle inequality quantified } where Njoint can be visualized as an overlap obtained by folding two sides Nx and Ny long of a triangle with the horizontal side Nall long. Probability is isomorphic (= strongly analogical) with set theory which is isomorphic with Boolean logic. Therefore "Our fortress is our logic" ( = my paraphrase of Stan Ulam's "our fortress is our mathematics" ). Isomorphism is transitive (if A resembles B, and B resembles C, then A resembles C). The key formulae derive from two (non)overlapping pancakes or pizzas with areas Px , Py ; P(x,y) == P(x & y) == Pxy = the overlapping area : P(x U y) + Pxy = Px + Py ; U is union; Pxy = joint , intersect , overlap N(x U y) + Nxy = Nx + Ny for Numbers of elements in sets x, y. Hence P(x or y) + Pxy = Px + Py ie P(x or y) = Px + Py - Pxy for probabilities from which the key inequality obtains (draw your pizzas aka Venn ) : !! Pxy <= minimum[ Px , Py ] <= P(x U y) <= Px + Py due to P(.) >= 0 , Max[ Px, Py ] <= P(x or y) <= min[ Px + Py , 1 ] Max[ 0, Px +Py -1 ] <= Pxy <= min[ Px , Py ] where on lhs the Bonferroni inequality becomes nontrivial only if Px + Py > 1, in which case Pxy > 0, ie there MUST be a certain minimal joint Pxy > lhs. Hence you CANNOT take any 3 numbers such that 0 <= Pxy <= min[ Px, Py ] <= 1 and use them as probabilities; it also MUST hold Pxy >= Max[ 0, Px + Py - 1 ]. In plain English: buy a square serving tray 1x1 meter called 1verse sold by Universe. Take two pieces of pizza cut into squares with areas Px and Py, such that each fits into the rather deep tray, and put them into the tray. If the area Px + Py > 1 square meter, then even if you do your best to shift the pizzas as far apart as possible within the tray, they have to overlap by the joint area of AT LEAST Pxy >= Px + Py - 1 which is the minimal overlap. Forget about underlaps ie negative overlaps. Bonferroni inequality is just a truncated version of the Inclusion-Exclusion principle for probabilities. DeMorgan's laws aka De Morgan rules in my probabilistic form are: P(~x,~y) = P(~(x or y)) = 1 - (Px+Py-Pxy) = P(x Nor y) 1st rule or law P(~x or ~y) = P(~(x,y)) = 1 - Pxy = P(x Nand y) 2nd rule or law Eg P(x implies y) = P(~(x,~y)) = P(~x or y) via DeMorgan's law , so eg the common sensical "No smoke without fire" cannot be meant as (~s & ~f) == ~(s or f) which is absurd. "No s without f" really means ~( s & ~f) which makes sense : "there can't be (smoke & no fire)", but == (~s or f) makes NO SENSE : "there is either no smoke, or there is fire, or BOTH ~s and f" shows the WEAKness of (x implies y) ie of (x entails y) as an indicator of causation. See P5 . More at P5: below. Find DeMorgan in { Hajek www } for more of my DeMorganish rules on P's. Just in case you have difficulties with DeMorgan's rules, I recommend to draw two overlapping pizzas, or pancakes, aka Venn diagram. A verbal formulation of the 1st DeMorgan's law (published 1847) is enlightening: "It should be noted that the contradictory opposite of a disjunctive proposition is a conjunctive proposition composed of the disjunctive contradictories of the parts of the disjunctive proposition." This is a formulation from the book Summa Logicae, of the year 1323, by the well known William of Ockham aka Occam aka Dr. Invincibilis, famous for his Principle of Parsimony aka Occam's razor : "Pluralitas non est ponenda sine necessitate", in turbo-talk known as KISS . Occam's razor has a necessary amendment "ceteris paribus" = "all else being equal" ( conditions , assumptions , properties , desiderata ), wherein the catch-22 is, since all other aspects are never really equal (more below). DeMorgan's rule has been traced by Lukasiewicz back to Petrus Hispanus (1205-1277). If those guys could formulate it, you should understand it. "AT LEAST ONE event" ie"ONE OR MORE hits" follows from the 1st DeMorgan's rule below: P(at least 1 of e's) = P(e1 or e2 .. or ek) = P(~(~e1,~e2, ..,~ek)) = 1 - P(~e1 & ~e2 & .. & ~ek) in general ; = 1 -([1-Pe1].[1-Pe2]..[1-Pek]) for all Pe's independent ; = 1 - [1-Pe ]^k for all Pe's equiprobable and independent = 1 - exp( k.ln(1-Pe) ) if (large) power raising is not available =. 1 - exp(-k.Pe) for very small Pe & large k ; often Pe = h/m, h << m Such formulae have plenty of applications when computing probabilities of hit/miss of firing salvos, of finding when searching (for subs, documents, by hashing), of functioning/failure of (wo)men or machines in estimates of reliability : serial connections have Pxyz = Px.Py.Pz for independently failing or functioning x, y, z ; parallel connections have ( via DeMorgan ) P(x or y or z) = P(~(~x,~y,~z)) = 1 - ([1-Px].[1-Py].[1-Pz]) as above. An advanced application is in my FlashHash on the same site as { Hajek www }. From the 16 binary functions in Boolean logic, only 2 pairs (of complementary functions) are neither trivial functions of a single variable, nor are they symmetric wrt both variables x,y. One such an interesting logical function is the implication (x implies y) aka (x entails y) in set theory. (x implies y) == ~(x,~y) == ~(~y,x) == (~y implies ~x) is UNDESIRABLE for causation as exposed in P5 in Insight0 . Desirable transitivity holds : if (x implies y) & (y implies z) then (x implies z) . Repeatedly find implies and entails . Causation should also be transitive : if (x causes y) & (y causes z) then (x causes z) ; see P4b , Insight0 . So we see that the entailment or implication has a desirable TRANSITivity, but also an UNdesirable property. We cannot have only the good ones :-( After thousand years of human thinking, the notion of causation is still a slippery one. A nice illustration of the psychological and fundamental difficulties with using probabilities to capture even the orientation ie DIRECTION of causation is illustrated by the following equivalence relation: P(x|y) > P(x|~y) is equivalent to P(y|x) > P(y|~x), similarly for < , = , "contrary to the prevailing pattern of judgment" of over 80 % of university students, wrote the late Amos Tversky & Daniel Kahneman in their chapter "Causal schemas in judgments under uncertainty" { Kahneman 1982, p.122-3 }. They wrote "implies" but it is an "equivalence" ie a two-way implication. Daniel Kahneman was awarded Nobel prize (economy 2002), for his and Amos Tversky's 30 years of research into human thinking in general and into the fallacies of probabilistic thinking in particular. Statistical literacy movement under the leadership of professor Milo Schield (Augsburg College, Minneapolis, MN) should adopt my slogan: "(In)dependence is the bridge of the asses for the masses into stats." My argument for why we should keep trying to capture causation in a formula is based on the successes of Boolean logic since 1848, after 2500 years of slippery logic. Be reminded that after George Boole it took another 90 years until 1937 when an MIT student Claude Shannon in his master's thesis has shown the isomorphism (= strong analogy) between Boolean logic and switching circuits. Only then applications like your PC could be developped. Btw, in 1948 Shannon has formulated information theory, also applied inside your PC, CD, DVD, etc. Alas, there is no guarantee that the successes of logic will be repeated with causal formulae. For now we should be satisfied with a good INDICATION of a causal tendency . The key question is this one : "When to use which formula as the best indicator of a causal relation ?" , "Which desiderata are more important than the other or than 'undesiderata'?" My questions assume honesty, not a desire to get inflated ratios which will impress, mislead, disinform and deceive the patients and customers for the purposes of dishonest economic gain and "egonomic" advantage. Vaihinger's fictionalism - A note on useful As-if fictions : Find again As-if = "Als ob" from the title of a book by a German philosopher { Vaihinger 1923 } on useful fictions. Eg the product of marginal probabilities Px.Py provides a fictional point of reference for dependence of events x and y. Fictional, because Pxy = Px.Py occurs rarely, but we often wish to contrast Pxy vs Px.Py in Pxy - Px.Py or in log(Pxy/(Px.Py)) in Shannon's mutual information. I realized that the Archimedean point of reference { Arendt 1959 } may be a special case of the useful as-if-fiction- alism. Maximal possible values also serve as As-If values, eg for many normalizations, like my interpretation of GF and HF in Insight2 . The concept of a random event or of a random variable is also an as-if fiction, since usually the randomness is in the eye of the OBSERVER . One of my slogans is: "One man's determinism is another woman's randomness." "Ceteris paribus" should also be understood as an as-if rule, simply because of the fictitious "all else being equal". .- +Causal notation is tough, but our math-beef shouldn't be A colorful hotchpotch (bunten Wirrwarr) of notations exists. In my References to Popper and to Kemeny I comment on some of them. Due to the invertibility of conditionings | via basic Bayes rule, it would be naive to think that eg a formula expressed in terms of P(y|x) = Pxy/Px should be M(y:x) where y:x is a mnemonic for division, say. A more meaningful alternative notation M(x:y) standing for (x entails y) could be better, but not all M(:) contain a clear entailment. Also our language allows for "x implies y" or "y entailed by x". So for now I shall not change all those countless formulae in { Hajek www } because of the risk of errors despite the necessary effort involved. Eg here Insight6 contains the relative risk aka risk ratio RR(:) which is much used in evidence-based medicine ( EBM ) : RR(y:x) = risk ratio = Bayes factor = Bayesian likelihood ratio = Odds(x|y)/Odds(x) , where Odds(z) = P(z)/[ 1 - P(z) ] = P(y|x)/P(y|~x) , x = cause, exposure, test result; y = effect = [ Pxy/(Py - Pxy) ].(1-Px)/Px are my decompositions = [ Pxy/P(y,~x) ].(1-Px)/Px , where 1/P(y,~x) == (y entails x) = [ P(x|y)/P(~x|y) ].(1-Px)/Px my semi-inverted form = Pxy.(y implies x).SurpriseBy(x) , since: Pxy . 1/P(y,~x) INCreases with (y entails x) ie as (y implies x) grows ; . (1-Px)/Px DECreases with increasing Px , as a SURPRISE should ; (1-Px) and 1/Px and (1-Px)/Px are decreasing functions of Px, hence they are small for frequent (= unsurprising) x ; SIC ; 1/P(y,~x) = 1/(Py - Pyx) measures how much (y implies x) ; see P5 for more on "conviction" by Google CEO { Brin 1997 } : Conv(y:x) = Px.P(~y)/P(x,~y) { as by Brin , my form next } = (Px - Px.Py)/(Px - Pxy) where 1/P(x,~y) measures how (x entails y) , = ( 1 - Py)/( 1 - P(y|x) ) = P(~y)/P(~y|x) = Px/P(x|~y) ie (x implies y) , UNLIKE the (y implies x) inside RR(y:x) = P(y|x)/P(y|~x). We see that both RR(y:x) hence F(y:x) contain (y implies x). However RR(y:x) is known as a measure of increased risk of y due to x ie of y from x. Clearly genuine difficulties with designing a consistent notation for causal M(:)'s exist exactly because causation is still a slippery concept despite our daily "because", "due to", and "if-then". Therefore you must not rely on even my rather uniform notations like M(:) . What matters is INSIDE the right hand side of M(:) . Better understanding, use and interpretation of results from causal M(:) is based on my rule of interpretation turned into a desired property P4b . .- Insight0 : Know what you want : new causal desiderata P2 P4 P4b P5 P8 It is the requirements aka desiderata & properties what matters most. Eg measures of similarity, compatibility, MUTUAL dependence, and of equivalence should be : symmetric ie S(y:x) = S(x:y) , and transitive (below). A (probabilistic) "distance measure" should be a "metric" ie be : non-negative, symmetric and satisfy the triangle inequality. We neither need a metric nor an equivalence S(x:y) = S(y:x), since a causation measure must be meaningfully asymmetric wrt x, y. M(y:x) <> M(x:y) is the 1st desideratum. My 7 "Construction principles for good association measures", all findable as P1 to P7 in { Hajek www }, will not be reproduced here (find P2 here). I am SHARPening P4 into P4b , illustrating P4 & P5 , and adding P8 . P4b: Often it is only the ORDERING in general and RANKing in particular what matters most in many applications. I discovered that the RANKING of more 2x2 contingency tables by their value of measures M(:) applied to such tables will differ between all pairs of different M(:)'s, except between measures Conv(y:x) and PAR(y:x) , and of course between Conv(x:y) and PAR(x:y) , which produce the same RANKing. Remarkably, both formulae are asymmetric wrt y and x . They are the "conviction" by { Brin 1997 } which is UNacceptable as a measure of causation since Conv(y:x) = Conv(~x:~y) , and the following two based on the same idea as Sheps' "relative difference" GF : PAR(y:x) = ( P(y|x) - Py )/(1 - Py ) until now a nameless one, and PAR(x:y) = ( Px.[ RR(y:x) -1 ])/( 1 + Px.[ RR(y:x) -1 ]) is the usual form; ! = ( P(x|y) - Px )/( 1 - Px ) is Sheps-like form = population attributable risk = Levin's attributable risk ; more on PAR(x:y) is at the end of Insight7 . For both Pxy & Py fixed , PAR(x:y) DECreases with INCreasing Px : let P(x1|y) = P(x2|y) = K , then: ! PAR(x1:y) < PAR(x2:y) for Px1 > Px2 ; proof: (K-Px1)/(1-Px1) < (K-Px2)/(1-Px2) ; use (1-K) since it is >= 0 : (1-K).Px1 > Px2.( 1-K) qed. , hence PAR(x:y) obeys SIC , which could be deduced from the fact that K < 1, hence the DECreasing Px will make the numerator INCrease ahead of the denominator. But we have a proof too. My new partial order condition for causation measures M(:) : The fundamental ASYMMETRY of x=cause & effect=y M(y:x) <> M(x:y) from P4 below is now opeRationalized into a more specific & meaningful property P4b : ! Property P4b for M(y:x) measuring how strongly x causes y (= semantics): If Px < Py then M(y:x) > M(x:y) else if Px > Py then M(y:x) < M(x:y) else M(y:x) = M(x:y) ; for example : if Px > Py then P(y|x) < P(x|y) = Pxy/Py = crude measure of (y entails x) P4b is desirable as the more is x common , the less likely is x a SPECIFIC cause of y. Similarly for y as a tentative cause of x. P4b introduces a PARTIAL ORDERing relation for causation measures M(:) . Books on discrete mathematics tell us that a partial ordering must be REFLEXive , ANTISYMMetric and TRANSITive. The relations: = , <= , and the set inclusion <= (in Pascal programming language too, also for Booleans) are partial orderings. I never saw any attempt to express causal relations like here. What surely must be new is the SPECIFICITY of my P4b in relating Px , Py , Pz with M(:)'s. Reflexivity : (x causes x) is ok, though not very explanatory ; Antisymmetry : if (x causes y) & (y causes x) then (x == y) , equivalent ; Asymmetry : if (x causes y) then ~(y causes x) Transitivity : if (x causes y) & (y causes z) then (x causes z) "causes" is isomorphic with "included in", "subset of", "entails", and also with "less or equal" <= . Since Px and Py are numbers, my desideratum P4b above imposes a specific partial order on measures of causation M(:) so that my new weak TRANSITivity rule among causation measures M(:) will be: !! If ( Px <= Py ) & ( Py <= Pz ) !! then M(y:x) >= M(x:y) & M(z:y) >= M(y:z) & M(z:x) >= M(x:z) where any = goes only with all other = . Draw 3 concentric pizzas or pancakes or Venn diagrams to visualize my desired weak transitivity rule for Px <= Py <= Pz . Insight8 contains my de-conditioned C-form2 and F-form2 which are needed for insight into F(:) and C(:) . I obtained for positive dependence relations Pxy > Px.Py the following orderings ( the < are in fact <= here ) : if Px > Py swap < and > in what follows ( the > are in fact >= here ) : ConV(y:x) == V(y:x) > V(x:y) == ConV(x:y) = Py.P(~x)/P(y,~x) : (Px < Py) == V(y:x) > V(x:y) = [1 - Px]/[ 1 - P(x|y)] ie (y entails x) (Px < Py) == GF(y:x) > GF(x:y) = [P(x|y) -P(x|~y)]/[ 1 - P(x|~y) ] (Px < Py) == Z(y:x) > Z(x:y) = [P(x|y) -Px ]/[ 1 - Px ] = PAR(x:y) (Px < Py) == K(y:x) > K(x:y) = P(x|y) -Px <= 1-Px SIC { Hajek K(y:x) } (Px < Py) == P(y|x) > P(x|y) = Pxy/Py = simplistic crude (y implies x) (Px < Py) == M(y:x) > M(x:y) means that M(x:y) measures (y SufFor x) vs : (Px < Py) == m(y:x) < m(x:y) means that m(x:y) measures (x SufFor y) (Px < Py) == RR(y:x) < RR(x:y) = [Pxy/(Px -Pxy)].(1-Py)/Py , (x implies y) (Px < Py) == W(y:x) < W(x:y) =ln(RR(x:y)) "weight of evidence" I.J. Good (Px < Py) == F(y:x) < F(x:y) = [ RR(x:y) -1]/[ RR(x:y) +1 ] , J. Kemeny = [Pxy - Px.Py]/[ Pxy + Px.Py - 2.Pxy.Py ] (Px < Py) == C(y:x) < C(x:y) = [Pxy - Px.Py]/[ Pxy + Px.Py - Pxy.Py ] ! 1-Px >= C(y:x) , C(x:y) <= 1-Py ie obeys the SIC of Karl Popper Find F(e:h) in Insight8 for why { Kemeny 1952 } has not swapped e with h (ie y with x) in his "factual support" of a hypothesis h by an evidence e. if Px < Py then Max[ RR(y:x) , RR(x:y) ] < OR(x,y) ; OR(:) is odds ratio ; if Px > Py then min[ RR(y:x) , RR(x:y) ] < OR(x,y) ; OR(:) is symmetric ; more relations between RR(:) and OR(:) are near the end of Insight1 . Let's analyze the frequently used and seamingly simple relative risk : First recall the equivalences P(x|y) > P(x|~y) == P(y|x) > P(y|~x), similarly for < , = ; P(x|y) / P(x|~y) > 1 == P(y|x) / P(y|~x) > 1 for positive dependence Hence: RR(x:y) > 1 == RR(y:x) > 1. Here is my insight inside RR(:) : RR(y:x) = P(y|x) / P(y|~x) LOSES 0.000 & does NOT satisfy P4b , because : = [Pxy/(Py - Pxy)].(1-Px)/Px = Pxy.(y entails x).DecreasingFun(Px) !! = Pxy.(y implies x).SurpriseBy(Px) where implies works AGAINST the ! SurpriseBy(Px) because: ! - SurpriseBy(Px) directly DECreases with increasing Px , while: ! + (y implies x) INdirectly INCreases with increasing Px . Nevertheless !! (y implies x) OVERRULES because Pxy/(Py - Pxy) is more SENSITIVE . ! So RR(y:x) is a MIXED measure of how much (y entails x) DISCOUNTED for the ! LACK of SURPRISE or LACK of INTERESTINGNESS in larger Px for frequent or ! commonly occurring event x . Such a lack fits the SIC principle. This all carries over to : W(y:x) = ln(RR(y:x)) = "weight of evidence" promoted by I.J. Good, and F(y:x) = [ RR(y:x) -1 ]/[ RR(y:x) +1 ] , hence it does NOT satisfy P4b ; C(y:x) is isomorphic with F(y:x), so no wonder it does NOT satisfy P4b . ! My dissection (or deconstruction) of RR(:) provides us with nontrivial insights into RR(:) and its consistent behavior w.r.t. Px < Py . It is ! amazing that ARR(:) does NOT behave consistently under Px < Py or Px > Py ARR(y:x) = P(y|x) - P(y|~x) does NOT satisfy ANY consistent ordering by either Px < Py, or consistently by Px > Py . RR(y:x) = fun(y implies x) UNLIKE (x implies y) in "conviction" by Brin : Conv(y:x) is a PURE entailment measure of x entails y , UNLIKE RR(y:x) ! Conv(y:x) = "the less (x,~y) means more (x implies y)" a PURE implication : Conv(y:x) = Px.P(~y)/P(x,~y) = "the more (x implies y), the larger Conv(y:x)". Conv(x:y) = Py.P(~x)/P(y,~x) = "the more (y implies x), the larger Conv(x:y)". = (Py - Px.Py)/(Py - Pxy) , where 1/P(y,~x) is an implementation of P(~(y,~x)) allowing for the numerator Py.P(~x) which provides for the FIXED semantically pivoting point = 1 for independent x, y. The straight form of P(~(y,~x)) = 1 - P(y,~x) would NOT allow for creation of such a pivot. P(~(y,~x)) = "P of no(y w/o x)" ie (y entails x). All these M(:)'s behaved consistently (as specified), only ARR(:) did NOT !! !! It would be a cheap trick to swap the ?(y:x) with ?(x:y) and thus get all if Px < Py then ?(y:x) > ?(x:y) ; find +Causal notation is tough . The point !! is that with P(y|x) > P(x|y) and with GF(y:x) > GF(x:y) we know better what the Px < Py does to a measure, than we can see in some other formulae. Therefore I said above "not rely on notation M(:)" (find rely ), "What matters is inside the formula" and that has to be verified as I did here. Don't confuse the meal with the menu M(:) . Lonely P(.|.)'s are not functions of both Px and Py, hence they are only crude measures of entailment, despite 0 <= Pxy <= minimum[ Px, Py ]. That a P(.) or P(.|.) cannot serve as a strength of evidence on y from x ie from x on y, or as degree of confirmation or as corroboration shows in great detail { Popper , appendix IX , pp.387-398 }, and is discussed by Sir Karl Popper vs Leblanc in The British Journal for the Philosophy of Science, vol.X, 1960, pp.313-318. P4: was just strenghthened into a more specific P4b, but there are some meaningful illustrations of the asymmetry required by P4 and P4b: If Px = Py then M(y:x) = M(x:y) is required and easily met; else M(y:x) <> M(x:y) a genuine asymmetry MUST hold for a causation measure ie <> must hold after any formal conversion of M's. Eg: E(y:x) = [ P(y|x) - Py ]/[ P(y|x) + Py ] = { Popper p.400 } = E(x,y) = [ Pxy - Px.Py ]/[ Pxy + Px.Py ] symmetry revealed = E(x:y) = [ P(x|y) - Px ]/[ P(x|y) + Px ] { my inversion } Due to its symmetry E(:) is ok as a measure of dependence, but K.O. as a causation measure M(:) . My view of its general structure is [a-b]/[a+b] = ([a-b]/2)/([a+b]/2) = average deviation from the average P5: This shows that the operation implication cannot capture causation: M(y:x) = M(~x:~y) is generally UNDESIRABLE since it is isomorphic with the logical implication (or entailment ) for which holds the CONTRAPOSITIVE logical property : (x implies y) == ~(x,~y) == ~(~y,x) == (~y implies ~x) as will show a Venn diagram (draw 3 pizzas: Px in Py in a PizzaVerse). Let's test the contrapositivity with a folks' wisdom: "Where is smoke, there is fire.", which I formalize as ( S implies F ) ie If there is smoke then there is fire ; ( ~F implies ~S ) ie If there is no fire then there is no smoke ; but "Smoke causes fire" makes NO SENSE "No fire causes no smoke" makes sense . Apparently my formalization was wrong, so I change it into ( F implies S ) ie If there is fire then there is smoke ; ( ~S implies ~F ) ie If there is no smoke then there is no fire ; but "Fire causes smoke" makes sense "No smoke causes no fire" is NONSENSE . It makes sense in logic , eg: ( R implies S ) as If it rains then it is slippery (~S implies ~R) as If it isn't slippery then it isn't raining , but it mmakes no sense for causation : (R causes S) makes sense , but (~S causes ~R) is NONSENSE . Contrapositivity is an UNDESIRABLE property for a causal tendency measure since eg "Rain causes us to wear raincoat" is ok, but "Not wearing a raincoat causes no rain" makes NO SENSE, as the later Nobel laureate Herbert Simon pointed out { Simon 1957, pp.50-51 }. Therefore the "conviction" measure in { Brin 1997 }, co-authored by Sergey Brin, the co-founder and CEO of Google, is unsuitable to measure causation because conviction is based purely on probabilistic implication : Conv(y:x) = Px.P(~y)/P(x,~y) as by Brin ; my forms next: = (Px - Px.Py)/(Px - Pxy) = how much (x entails y) = Px/P(x|~y) = P(~y)/P(~y|x) = (1-Py)/[ 1-P(y|x) ] = Conv(~x:~y) is UNDESIRABLE , as pointed out in { Hajek www } Conv(y:x) = 0 if x,y are independent (designed into the numerator); Conv(y:x) = oo = infinite if x implies y ie if P(x,~y)=0 == if true is ~(x,~y) , where ~(.) obtains from the / in Brin's form. GF(y:x) <> GF(~x:~y) , GF(x:y) <> GF(~y:~x) are OK for causation . F(y:x) <> F(~x:~y) is desirable ; { Kemeny 1952 } has designed F to hold: F(y:x) = -F( y:~x) , F(~y:x) = -F(~y:~x) , worked out in { Hajek www } F(x:y) = -F( x:~y) , F(~x:y) = -F(~x:~y) , worked out in { Hajek www } which is no wonder since F(:) = ( RR(:) -1)/( RR(:) +1) which rescales RR(:)'s range of [0..1..oo] to F(:)'s [-1..0..1] , and since it holds : RR(y:x) <> RR(~x:~y) is a desirable property of RR(:) RR(y:x) = 1/RR( y:~x) , RR(~y:x) = 1/RR(~y:~x) RR(x:y) = 1/RR( x:~y) , RR(~x:y) = 1/RR(~x:~y) My paradox of an INDependent IMPlication : In the extreme case of Py=1 when P(y|x)=1 ie x implies y, as well as in the extreme case of Px=1 when P(x|y)=1 ie y implies x, in these !!! extreme cases x,y are also independent at the same time. Find my !!! IndependentImplication paradox in Insight8 . P8: My principle of CONTINUITY and COMMENSURABILITY of results : SMALL changes in input P's must NOT lead to LARGE changes in the output ie in the result from a SINGLE causal tendency measure M(:) . This kind of continuity must also hold between a PAIR of measures each of which covers only a part of the whole range of all possible inputs. If a pair of such formulae is used to cover the whole range of causation values then the results must be COMMENSURABLE in general, hence also near the point where we switch from one formula to the other one. Typically a switch-over point will be INDEPENDENCE ie ARR=0 , around which cannot be any meaning- ful causation (degenerated cases shows my paradox IndependentImplication ), so near ARR=0 a measure of causation M(:) must measure (in)dependence ONLY, and DEPENDENCE IS SYMMETRICAL wrt x, y. With increasing dependence the ASYMMETRY of values should increase. For an example find P8: far above. .- Insight1 : Understanding GFactors, plus my MaxiMin heuristic The word "power" has too many connotations, and is too trendy, so it should better be avoided, and replaced by the term "factor", for which there exists a genuine semantic justification : P(y|x) = P(y|~x) + P(~y|~x).GF , proof : = P(y|~x) + [1 - P(y|~x)].GF = P(y|~x) + [1 - P(y|~x)].[ P(y|x) - P(y|~x) ]/[1 - P(y|~x)] = P(y|~x) + P(y|x) - P(y|~x) = P(y|x) qed. which shows how to formally construct GF-like factors, eg again for P(y|x) but this time wrt a basic exposure or treatment b we get : GF(y:x:b) = [ P(y|x) - P(y|b) ]/[ 1 - P(y|b) ] which illustrates my slogan "A non-event is an event is an event" with apologies to Gertrude Stein who spoke similarly about a rose, tho not about a non-rose. Formalities: - from P(y|x) <= 1 folows Abs(GF) <= 1 and similarly for other such factors - if P(y|x) = 1 ie Pxy=Px then GF = 1 and v.v. (an equivalence ) - if P(y|~x) = 0 ie P(y,~x)=0 then GF = P(y|x) Semantics: Factors should carry meanings, not just be formally correct. P(y|~x) = base rate of y-ers even they are ~x-ers eg a non-smokers; would all (~y,~x)-ers become x-ers , some of them would become (y,x)-ers . P(~y,~x) = are all the remaining ~x-ers free to become y after x; P(~y|~x) = 1 - P(y|~x) = proportion of remaining ~x-ers available for y WOULD they become x-ers; this shows the WHAT-IF ie the COUNTERFACTUAL nature of GF ; or: "Raising the base P(y|~x) by a fraction GF of P(~y|~x) yields P(y|x)" ! ie: P(y|x) = P(y|~x) + GF*P(~y|~x) = P(y|~x) + GF*[ 1 - P(y|~x) ] GF = effectivity factor of x to make/generate/cause y from those who have NOT yet became y due to causes OTHER than x. The interested reader is advised to study { Sheps 1959 } or to read the much easier { Fleiss 2003, pp.122-4, optionally 133, 151-2, 156 }. Clearly the "factor" is the proper word for a multiplicative term like GF in the just shown semantically rich formula. One good news is that GF ie GF(y:x) <> GF(~x:~y). To find out why this is one good news, see Insight0 and do find again UNDESIRABLE in { Hajek www }. The bad news is that honest scientists cannot hide the dilemma of which factor (or other probabilistic formula) to use for causation in general and how to FORMALLY even assign the roles to x and y in particular. Namings like eg sufficiency and necessity may help, but not much { Hajek }. The dilemma is GF(y:x) = versus GF(x:y) = = [P(y|x) - P(y|~x)]/[1 - P(y|~x)] vs [P(x|y) - P(x|~y)]/[1 - P(x|~y)] = [Pxy -Px.Py]/[Px.(1 -(Px+Py-Pxy))] vs [Pxy -Px.Py]/[Py.(1 -(Px+Py-Pxy))] = [ P(y|x) - Py ]/P(~x,~y) = GF(y:x) vs [ P(x|y) - Px ]/P(~x,~y) = GF(x:y) where by DeMorgan's rule P(~x,~y) = 1 -(Px+Py-Pxy) = P(~(x or y)) . We see that the originally duplicate asymmetry (in GF's numerator and its denominator) is really a single asymmetry in the denominator only, for which there hold the following equivalence relations among probabilistic dependence relations : [ P(y|x) > Py ] == [ P(x|y) > Px ] == [ Pxy > Px.Py ] which is symmetric wrt random events x, y, so it is impossible to formally decide which of the two GF factors to prefer, while they yield different numerical values. Nevertheless, let me try to heuristically decide between the two possible generative factors. From the formulae [ Pxy -Px.Py ]/[ ... ] above it is clear that GF(y:x) / GF(x:y) = Py / Px = P(y|x) / P(x|y) by Bayes , hence: [ GF(y:x) > GF(x:y) ] == [ Py > Px ] == [ P(y|x) > P(x|y) ] [ GF(y:x) < GF(x:y) ] == [ Py < Px ] == [ P(y|x) < P(x|y) ] GF(y:x) measures how much (x implies y) ie (x entails y) ie (x SufFor y) This is good because the more is x widespread, the less likely is x a SPECIFIC cause of y. Similarly for y as a tentative cause of x. Draw a Venn diagram and do find again Venn , Py < Px , Px < Py in { Hajek }. By this reasoning I arrived to my MaxiMin heuristic rule for which GFactor to use when ranking conservatively by GFactor = minimum[ GF(y:x) , GF(x:y) ] : If n(x,y) > few then { few > 1, say 4, since "Einmahl ist keinmahl" } { the < is correct if we want the minimum value, see a dozen lines above } if Py < Px then use GF(y:x) else use GF(x:y) , which is equivalent to GFactor = minimum[ GF(y:x) , GF(x:y) ] which is a conservative heuristic protecting an automated data-mining program like my KnowlegdeXplorer KX from "false positives" wrt causal tendency. The output of KX is sorted by the values of GFactors and a human user can focus his or her attention (our scarcest resource) on the pairs of events x, y with high GFactor obtained in the just described MaxiMin mode. Similarly with my causal hindrance HFactor. My MaxiMin protects against extreme Px=1 and/or Py=1; if Px =. Py then GF(y:x) =. GF(x:y) anyway. Just in case you don't like MaxiMin, let me tell you about the quantitative relations between the two most frequently used formulae in evidence-based medicine: OR(y:x) = OR(x:y) is symmetric odds ratio OR(:) under P4b above, and RR(y:x) <> RR(x:y) is asymmetric risk ratio == relative risk Odds(z) = Pz/(1-Pz) in general, hence odds ratio ie a ratio of odds is OR( : ) = [ Pu/P(~u) ]/[ Pv/P(~v) is odds ratio in general = [ Pu/Pv ].[ (1-P(~v))/(1-P(~u)) ] ; in particular we want : OR(x,y) = = [ P(x|y)/[1-P(x|y)]/( P(x|~y)/[1-P(x|~y)] ) = OR(x:y) from which: = [ P(x,y)/P(~x,y) ]/( P(x,~y)/P(~x,~y) ) = [ Pxy.P(~x,~y) ]/[ P(x,~y).P(~x,y) ] = 1 if x,y independent , find 17th = a.d/(b.c) in 2x2 contingency table = [ P(x|y)/P(x|~y) ]/[ P(~x|y)/P(~x|~y) ] = LR+ / LR- , denominators annul: = [ Pxy/P(x,~y) ]/[ P(~x,y)/P(~x,~y) ] = (a/b)/(c/d) in a 2x2 table = [ P(y|x)/P(~y|x) ]/[ P(y|~x)/P(~y|~x) ] = OR(y:x) is symmetric wrt x,y = [ P(x,y)/P(~y,x) ]/[ P(y,~x)/P(~y,~x) ] = (a/b)/(c/d) in a 2x2 table qed. = [ P(y|x)/P(y|~x) ]/[P(~y| x)/P(~y|~x) ] = RR(y:x)/RR(~y:x) = [ P(y|x)/P(y|~x) ].[P(~y|~x)/P(~y| x) ] = RR(y:x).RR(~y:~x) = OR(y,x) = OR(:) which is an even more impressive example of Bayesian inversion than E(x,y) . if Px < Py then Max[ RR(y:x) , RR(x:y) ] < OR(:) ; if Px > Py then min[ RR(y:x) , RR(x:y) ] < OR(:) ; if Pxy = Px.Py then RR(y:x) = RR(x:y) = OR(:) = 1 { 0 = ARR } else if Pxy > Px.Py then 1 < Max[ RR(y:x) , RR(x:y) ] < OR(:) { 0 < ARR } else if Pxy < Px.Py then 1 > min[ RR(y:x) , RR(x:y) ] > OR(:) > 0 { ARR < 0 }; Hence the symmetric OR(x,y) is a bound on both RR(:)'s. Depending on the dependence, OR(:) is an upper bound, or an lower bound as shown. The moral of this is that lots of medical researchers and specialists are used to work and live with OR as an upper or lower bound on risk. C'est la vie. I know that the complexities of the world including human thinking cannot be captured in a formula. I do not believe in a "theory of everything" as some physicists & psychicists do. I just try hard to arrive at good INDICATORS of causal tendency. With my heuristic rule I am sticking my neck out like a giraffe. Feel free to cut me down with your logic, counterexamples, and your common sense, but be aware of the sad fact that too often our common sense is a common nonsense, especially when dealing with uncertainties, as the recent Nobel prize winner Daniel Kahneman and the late Amos Tversky have shown eg in { Kahneman }. .- Insight2 : Fresh interpretation of GF as a regression slope/MAXslope GF = [ P(y|x) - P(y|~x)]/[ 1-P(y|~x) ] which I recognized to be: = slope(of y on x)/[ MAXimal achievable slope, since P(y|x) <= 1 ] = beta(of y on x)/[ fictive MAXimal beta, ie as-if , what if ] from the probabilistic regression line y = beta.x + alpha , which is not the statistical regression line from the books on stats. HF = [ P(y|~x) - P(y|x)]/[ 1 - P(y|x) ] which I recognized to be: = slope(of y on ~x) /[ MAXimal achievable slope, since P(y|~x) <= 1 ] = beta(of y on ~x) /[ fictive MAXimal beta, ie as-if , what if ] Proof of ARR = beta(of y on x) ie beta(y:x) : Events can occur or not, they are binary variables aka Bernoulli r.v.'s aka indicator events x, y for which hold the following expected values : E[x] = Px; E[x.y] = Pxy due to x = 0 or 1 , y = 0 or 1 E[x^2] = E[x.x] = Pxx = Px = E[x] due to x = 0 or 1 cov(x,y) = E[x.y] - E[x].E[y] = Pxy - Px.Py ; cov(x,x) = var(x) var(x) = E[x.x] - E[x].E[x] = Px - (Px)^2 = Px.(1-Px) cov(x,y)/var(x) = beta(y:x) recall covariance / variance, hence = (Pxy - Px.Py)/(Px.(1-Px)) is 0 if x,y are independent random events = slope of a probabilistic regression line Py = beta(y:x).Px + alpha(y:x) ! = [ P(y|x) - Py ]/(1-Px) = P(y|x) - P(y|~x) ( checks as an equation ) = ARR = absolute risk reduction of y if x ( or increase if x is "bad" ), qed. beta(y:x)*beta(x:y) = square( correlation coefficient for events x, y ) = coefficient of determination for events x, y E[x.y] <= sqrt( E[x^2].E[y^2] ) is Cauchy-Schwarz inequality Pxy <= minimum[ Px , Py ] <= sqrt(Px.Py) is weaker than the minimum Pxy is a DOT PRODuct aka inner product aka scalar product of x, y. Cosine of the angle between the events viewed as-if they were vectors is : cos(x,y) = E[x.y]/sqrt( E[x^2].E[y^2] ) = Pxy/sqrt(Px.Py) = sqrt( P(y|x).P(x|y) ) = geometricAverage( P(y|x).P(x|y) ) <= 1 due to 0 <= P <= 1 this cosine cannot be negative i.e. 0 <= cos(x,y) <= 1; Px.Py is in fact a fictive probability of the as-if independent events x, y with Px, Py as marginals ; find as-if . .- Insight3 : Commensurable pairs of formulae : ( GF & HF ) vs ( PF & QF ) P(.) and its complement 1-P(.) may be probabilities of success and failure, or vice versa. It's all matter of semantics. When your physician calls a test result "positive", it usually is bad news for you. One man's success is another woman's failure. By complementing ie negating the semantics we can transform our formulae into other forms shown below. Too many names given to formulae are too suggestive. Any conditional probability allows for 2*4 versions P(y|x) P(y|~x) P(~y|x) P(~y|~x), ... , from which at least 17*2 = 34 formulae for a probabilistic CONTRAST can be formed. If possible, it is better to use a single unifying view rather than more of dividing views. In our case here we can avoid semantic confusion and mathematical mistakes if we shall use a single "generative factor" allowing also the negative sign if ARR is near 0.0. Or we can use two pairs of generative factors, comparable only within each proper pair, but not across the pairs. Each such a generative factor has a unique qualification of what is generated from what. So we do not have to doubt or meditate upon whether "preventive" refers to the effect, or to the assumed cause, or to both; see the 4 P(.|.)'s above. The 1st commensurable pair: GF = generative factor for y from x see above and use 1-P(.) to get = 1 - [1-P(y|x) ]/[1-P(y|~x)] UNlike 1 - RR(y:x) HF = generative factor for y from ~x see above and use 1-P(.) to get = 1 - [1-P(y|~x)]/[1-P(y|x) ] UNlike 1 - RR(y:x) The 2nd commensurable pair: PF = generative factor for ~y from x (my proof of these semantics follows) QF = generative factor for ~y from ~x (left as an excercise to the reader) PF = [ P(y|~x) - P(y|x) ]/P(y|~x) = 1 - P(y|x)/P(y|~x) = 1 - RR(y:x) = [ P(~y|x) - P(~y|~x) ]/[1 - P(~y|~x)] in canonical form proves that the "trully unmistakable meaning" of PF is "generative factor for ~y from x"; just compare its canonical form with that of GF ; qed. However, Insight1 warns us that "unmistakable" is meant only in a semi-formal sense. Regardless of how satisfied someone might be with this "true meaning", it is essential that my requirement of CONTINUITY and COMMENSURABILITY P8 is met. .- Insight4 : How to rescale a range for more palatable results Suitable scale for results from a measure is psychologically important. Both Fahrenheit and Celsius have linear scales, but Celsius scale carries more meaning in its pivotal points 0 = ice unfreezing temperature, and 100 = water boiling temperature at the sea level. The range of [0..100] for water as a fluid is easier for mental calculations and imagination than [32..212]. On logarithmic scale are decibel dB, pH, and Richter for earthquakes because: - human sensory perception follows the physiological Weber-Fechner law - logarithms turn huge numbers into psychologically managable ones - logarithms turn "power law" curves and exponential growth into lines - logarithms turn multiplication into easier mental addition Rescalings make results more palatable, but they should be co-monotonic. If we plug our non-negative Pa = P(y|x) and Pb = P(y|~x) into the following functions(Pa, Pb), then all these measures will be co-monotonic, although with different ranges and different FIXED semantic PIVOTAL points. Pa - Pb is an absolute dependence measure ARR = P(y|x) - P(y|~x) is scaled [0..1] for Pa > Pb , or [-1..1] in general, with 0 if x,y are fully independent Pa / Pb is a relative dependence measures RR = P(y|x) / P(y|~x) is scaled [0..1..oo) with 1 if x,y are fully independent. log(Pa / Pb) is scaled (-oo..0..oo) is W(y:x) findable here (Pa - Pb )/(Pa + Pb) is scaled [-1..0..1] , I call it kemenyzed range = ( diff/2 )/(average) makes sense = (Pa/Pb -1)/(Pa/Pb +1) here F(:)=[ RR - 1 ]/[ RR + 1 ] , nearby W(y:x) , and Pa/(Pa + Pb) is scaled [0..1/2..1] ; find F0 here. Such a normalization to the range [-1..0..1] makes sense, as shown, but (Pa - Pb)/(1-Pb) = GF is semantically more specialized, hence stronger. While (a-b)/(1-b) works only for numbers 0 < a,b < 1 = MAX[a,b], but then (a-b)/(1-b) carries more meaning than a less specialized formula : (a-b)/(a+b) works for any numbers a,b <> 0 ; it is a true metric for a,b >= 0, but we do not require a causal measure to be a true metric which obeys the triangle inequality. Metricity is nice, but if not needed then it is not a desideratum, and the more important property P2 of the clear meaning over the WHOLE RANGE makes me to prefer GF over Kemeny's F(y:x) which has clear meaning only at its 3 pivotal points -1, 0 and 1. As said, they are co-monotonic (= isotone), and they have 2 (or 3) FIXED semantic pivots directly interpretable as independence, and as maximum (and as minimum), but not all of the above functions have their non-pivotal values directly interpretable. While RR(:) , ARR(:) , and especially 1/ARR = NNT are directly interpretable over their WHOLE range, other are not : F( : ) by Kemeny , C( : ) by Popper , Conv( : ) by Google's CEO Brin . The infinite upper bounds of RR( : ) and of Conv( : ) are not nice for a psyche. .- Insight5 : Numbers needed to treat or harm : NNT , NNH , plus my NNR Since for commonly low P(.|.)'s the GF is nothing but a finely tuned ARR, and since an opeRationally highly meaningful practical formula is the "number needed to treat" NNT = 1/ARR > 0 , I designed NNR = 1/GF : NNT = |1/ARR| was introduced in { Laupacis & Sackett & Roberts 1988 } NNT = number needed to treat for 1 more |or 1 less| beneficial effect y, low NNT = good, successful, effective treatment x ; NNH = number needed to harm 1 more |or 1 less| by side effect z, high NNH = good, almost harmless treatment x ; NNS = number needed to screen to find 1 more |or 1 less| case, low NNS = good, effective screening; |1/ARR| is the most realistic general measure of health effects, as it !! is the least abstract & least exaggerating ie most HONEST, and !! UNlike RR(:), OR(:) or any other rate ratio, it does NOT "throw away all information on the number of dead" { Fleiss 2003, p.123 on the Berkson's index aka ARR }. Moreover NNT, NNS, NNH measure EFFORT !!! ie COSTS PER EFFECT. If ARR=0 ie y,x independent, then 1/ARR = oo ie infinite. NNR by Jan Hajek : NNR = 1/GF > 0 is my Number Needed for 1 more Relative effect; if GF < 0 then switch to its commensurable counterpart HF . dNNR = 1/ARR - 1/GF (if ARR > 0) = Hajek's difference of "Nrs Needed", = P(y|~x)/ARR = P(y|~x)/[ P(y|x) - P(y|~x) ] = 1/( RR - 1 ) = 1/RRR 1/dNNR = RR(y:x) - 1 = RRR = relative risk reduction if ARR > 0. NNH(x:z)/NNT(x:y) is highly informative too; should be >> 1 ie many more have to be x-treated before 1 z-harm will occur, while many more patients have y-improved already. NNH/NNT is in the fine infokit by Steve Simon at http://www.childrens-mercy.org/stats . .- Insight6 : The simplest thinkable necessary condition for CONFOUNDING Let me present in a more palatable form what I have done in { Hajek www }. Confounding is a serious problem when searching for a true cause of y , as there usually are at least two candidates x, c for a true cause of y. Confounders ie alternative candidate causes confuse our perception and understanding of causation. Let's make search for & research of confounders easier & less expensive. Only to derive my simplest necessary condition for c to overule x as a cause of y, let's assume as-if we have available three relative risks aka risk ratios. I say as-if since my simplest condition does not really need any of the following three RR's, that's my point, the less data we need, the better. RR(y:x) = P(y|x)/P(y|~x) , and RR(y:c) = P(y|c)/P(y|~c) , and RR(c:x) = P(c|x)/P(c|~x) . Naturally, a necessary (but not sufficient) condition for c to be a more likely candidate than x as a cause of y is RR(y:c) > RR(y:x) . Another less natural necessary condition is RR(c:x) > RR(y:x) necessary for c to overrule x as a cause of y , by Jerome Cornfield in 1959, reproduced in the Appendix of { Schield 1999 }. Caution! If some necessary conditions are TRUE then c does not yet exclude x as a cause of y. Only if all (partial) necessary conditions are TRUE then c overrules x . Although it is plausible that there exists (ie that we can formulate) only a small number of necessary conditions, there are infinitely many possible confounders c . Therefore I say : "The best we can say of a cause is that it has not yet been refuted". Also: "No amount of experiments can ever prove me right; a single experiment may at anytime prove me wrong." said Einstein, who was paraphrased by E.W.D.: "No amount of testing can show the absence of bugs, only their presence." "Experiments can only falsify a theory or a hypothesis." is my Popper-ism. "Absence of evidence is not evidence of absence." wrote { Doug Altman & Martin Bland, BMJ 311, 1995/8/19, p.485 } Anyway, combining the above necessary conditions yields my new one: !! RR(y:x) < minimum[ RR(c:x) , RR(y:c) ] , and its equivalent !! P(x|y) < P(x|c) AND RR(y:x) < RR(y:c) . Clearly, they are equivalent if P(x|y) < P(x|c) == RR(y:x) < RR(c:x) which derives from the fact that RR(y:x) < RR(c:x) == P(y|x)/P(y|~x) < P(c|x)/P(c|~x) has the same conditionings |. on both sides of the < , hence the conditional P(.|.)'s can be turned into joint P(.,.)'s since the conditionings annul : P(y|x)/P(y|~x) < P(c|x)/P(c|~x) ; where P(~x)/P(x) annul, and only the < still holds, NOT the values : P(y,x)/P(y,~x) < P(c,x)/P(c,~x) ; now obtain inverted conditionings : P(x|y)/P(~x|y) < P(x|c)/P(~x|c) values as on the preceding line ; P(x|y)/[1-P(x|y)] < P(x|c)/[1-P(x|c)] values as on the preceding line ; !! P(x|y) < P(x|c) proves the equivalences , qed. P(x|c) > P(x|y) is my SIMPLEST THINKABLE necessary condition for a candidate c to overrule x as a possible cause of y. Originally I derived it from the decompositions RR(c:x) = [ Pcx/(Pc - Pcx) ].(1-Px)/Px RR(y:x) = [ Pyx/(Py - Pyx) ].(1-Px)/Px readily suggest that (1-Px)/Px can be dropped from Cornfield's inequality RR(c:x) > RR(y:x) which becomes [ Pcx/(Pc - Pcx) ] > [ Pyx/(Py - Pyx) ] ie [ Pcx.Py - Pcx.Pyx ] > [ Pyx.Pc - Pyx.Pcx ] where Pcx.Pyx annul, hence Pcx.Py > Pyx.Pc hence Pcx.Pyx > Pc/Py hence !!! P(x|c) > P(x|y) my simplest necessary condition for c to overrule x !!! P(x|c) - P(x|y) > 0 my simplest necessary absolute boost Ab > 0 needed !!! P(x|c) / P(x|y) > 1 my simplest necessary relative boost Rb > 1 needed [ P(x|c) = P(c|x).Px/Pc ] > [ P(y|x).Px/Py = P(x|y) ] by Bayes rule ; so !!! P(c|x)/Pc > P(y|x)/Py my Bayesian boost condition for c to overrule x !!! P(c|x) > P(y|x).Pc/Py 2nd form of necessary condition for c over x P(y|x) < P(c|x).Py/Pc 3rd form of necessary condition for c over x Imitating the Polish mathematician Hugo Steinhaus (a math prof. of the father of the peacekeeping H-device Stan Ulam; mother was US-Hungarian Ed Teller) you may ask "Wo ist der Witz ?" ie What's the point ? The point is that the researcher does not have to evaluate all NECESSARY (sub)conditions after any single one of them is found to be FALSE, in which case c becomes an UNCONVINCING competitor of x for potential causation of y , and NO other necessary condition for confounding can possibly be simpler than my simplest thinkable one P(x|c) > P(x|y); draw a Venn diagram (or pizzas or pancakes). Calcs and PC's make calculations easy, but we seldom get all the data we would like to have. Even the best medical journals show only bits of data for the lack of space and other economical reasons. So if a simpler condition like eg mine requires less data, we may be able to do what otherwise would be impossible. Dzatz dz witz. For more do find again confound in { Hajek www }. .- Insight7 : Sufficiency and necessity : two sides of a causal coin Causation is like the 2-faced Roman god Janus (January is named after him). One face is SUFiciency, the other facet is NECessity. They go together, they are 2 components of causation, something like the real and imaginary parts of a complex number. This analogy is not too bad, since necessity is often based on imagined CounterFactual reasoning ie on what WOULD BE IF ( in German: was waere wenn ; also find as-if ) the situation WOULD NOT BE the factual one { Sheps 1959 pp.87,89,91,92 } { Pearl 2000, p.284 }. I have written enough on sufficiency and necessity in { Hajek www }, so I'll mention only the basic facts of causal life. A reminder: [ P(y|x) > P(y|~x) ] == [ P(x|y) > P(x|~y) ] this equivalence holds also for = and < on both sides of the == . The most simplistic , crude , naive measures of causal tendency are: P(y|x) = Pxy/Px = Sufficiency of x for y { Schield 2002, Appendix A } = Necessity of y for x { follows from the next line: } P(x|y) = Pxy/Py = Necessity of x for y { Schield 2002, Appendix A } = Sufficiency of y for x { follows from above } To understand this, draw two pizzas or pancakes as (almost) overlapping targets with areas Px inside Py. Imagine ourselves as as-if archers : If we want to hit the larger Py then it is sufficient (but not necessary) to hit the smaller Px to be sure we hit the larger Py. If we want to hit the smaller Px then it is necessary (but not sufficient) to hit the larger Py , which is a prerequisite (= conditio sine qua non) but no way a guarantee of hitting the smaller Px . My fresh desideratum P4b has much to do with the sufficiency and necessity . { Schield 2002, p.1 } says: "But epidemiology focuses more on identifying a necessary condition [h] whose removal would reduce undesirable outcomes [e] than on identifying sufficient conditions whose presence would produce undesirable outcomes". Also see his Appendix A, first lines left & right. In his section 2.2 on necessity vs. sufficiency , prof. Milo Schield nicely explains their contextual semantics and applicability ( all [.] by JH ) : "Epidemiologists may focus more on necessity than sufficiency. [They] may want to REDUCE disease incidence more than they want to PREDICT disease. Focusing on necessity may be more important for them than focusing on sufficiency since eliminating a necessary condition is sufficient to prevent the outcome [e]. Unless an effect [e] can be produced by a single sufficient cause [h] (RARE!), producing the effect requires supplying ALL of its necessary conditions [h_i], while preventing it [e] requires removing or eliminating only ONE of those necessary conditions [h_i] ." Caution: the suffixes nec, suf as used by various authors say nothing about which event is necessary for which one, as long as you have not found in their writings what is necessary (or sufficient) for what. It is vital to know exactly what nec and suf mean because of the Janus-like double faced equivalences resulting from set theory and logical implication : from set theory == == implication == set theory : (x SufFor y) == (y NecFor x) == (~x NecFor ~y) == (~y SufFor ~x) (y SufFor x) == (x NecFor y) == (~y NecFor ~x) == (~x SufFor ~y) which hold for determinism ie if (x entails y) perfectly. These equivalences ( == ) are broken if Px <> Py then M(y:x) <> M(x:y) in general and by my desideratum P4b: (Px < Py) == [ M(y:x) > M(x:y) ] means that M(y:x) measures (x SufFor y) and M(x:y) measures (y SufFor x); (Px < Py) == [ m(y:x) < m(x:y) ] means that m(x:y) measures (x SufFor y) and m(y:x) measures (y SufFor x); For much more find suffic and necess in { Hajek www }, and find P4b here. Population attributable risk aka Levin's attributable risk { Fleiss 2003, 156,126-128,711/7.5 } can be written in an unusual form as PAR(x:y) = ( Px.[ RR(y:x) -1] )/( 1 + Px.[ RR(y:x) -1] ) is the usual form; = ( Pxy - Px.Py )/( Py - Px.Py ) = [ P(x|y) - Px )/( 1 - Px ) = ( (x NecFor y) -base)/( MaxP(x|y) - base ) in the spirit of Sheps' "relative difference" . As mentioned under P4b , I discovered that PAR(x:y) and Conv(x:y) produced the same RANKings of 10 vastly different 2x2 contingency tables. I have no proof that it holds for all tables. Also I don't know of any other pair of M(:)'s which produced equal ranks. .- Insight8 : MaxiMin vs Kemeny & Fitelson in clash with Popper & Kahre ; C(y:x) , F(y:x) , GF(y:x) stress tested with extreme P's ; plus my paradox of IndependentImplication : Herewith continue the difficulties from the example under +Warnings above. After expressing his & others' doubts { Kemeny 1952, end of pp.312+313 1st lines } says: "But we see no possible interpretation under which the weights [ JH: the P(.|.)'s ] would depend on H. CA7: The weights depend on n, they may depend on E, but they must be independent of H." I read it as a foggy way to say that his "factual support" of a hypothesis H by an evidence E must be F(e:h) = [ P(e|h) - P(e|~h) ]/[ P(e|h) + P(e|~h) ] , ie it must not be [ P(h|e) - P(h|~e) ]/[ P(h|e) + P(h|~e) ] . Similarly { Fitelson 2001, p.42 } requires: "After all, evidential support is supposed to be a measure of how strong the evidential relationship between E and H is, and deductive entailment is the strongest that such a relationship can possibly get. If E is conclusive for H, then H's a priori probability should, intuitively, be irrelevant to how strong the (maximal, deductive) relationship between E and H is." On p.43 he wisely adds: "In any case, we should probably not put too much stock on deductive [ JH: deterministic ] cases of the kind discussed in this section." Wisely, as I recall my 7 Construction principles for good association measures in { Hajek } where my principle P2 says: "OpeRational usefulness is greatly enhanced if measure's WHOLE RANGE of values (not only its bounds) has an opeRationally highly meaningful interpretation (like eg eg NNT , NNH )." My P2 asks for more than Fitelson does, but his wits become handy when we cannot have all desiderata in one measure, in which case we have a dilemma: Should we have opeRationally meaningful bounds, or the most of the range, if we cannot have both ? Answer: Since most of the range covers more values than a small band near extremes, I vote for the range. Thus justified choice fits with Fitelson's intuitive wits. Caution: the LOWer the count n(.) ie n(x) or n(y) , the more often will n(x,y) = n(.), and the more of the conventional proportions n(x,y)/n(.) will = 1. Such simplistic estimators of P(.|.) lead to EXTREME results from some formulae which are sensitive to P(y|x)=1 or P(x|y)=1. My provably (I derived the optimal ones) near optimal estimators for n(x,y) > 0 P(y|x) = [ n(x,y) - 0.5 ]/n(x) , P(x|y) = [ n(x,y) - 0.5 ]/n(y) are always < 1, since n(x,y) <= minimum[ n(x), n(y) ], so no P(.|.) = 1 can occur. MaxiMin: also my heuristic protects us from "false positives" likely to arise from the extreme Px =. 1 or Py =. 1 : If n(x,y) > few then { few > 1, say 4, since "Einmahl ist keinmahl" } if Py < Px then use M(y:x) else use M(x:y) , is equivalent to: If n(x,y) > few then { few > 1, say 4, since "Once is as if never" } M = minimum[ M(y:x) , M(x:y) ] is a conservative heuristic, applicable to most measures M of confirmation, support or causation. Note that if Px =. Py then M(y:x) =. M(x:y) anyway. Maximal values of M are the most interesting ones, so it is MaxiMin heuristic. { Fitelson pp.42,43,47,48 } repeatedly singles out Kemeny's factual support and "ordinally equivalent measures" ie isotone ones, as "Interestingly, the only measure (among our 5 candidates) that satisfies" his requirement. This is not so surprising, since among his 5 candidates there is only one measure L which is based on P(e|h), while 3 are based on P(h|e), and 2 are fully symmetric wrt e, h (symmetry makes them unfit). Fitelson's L's : L = ln( P(e|h)/P(e|~h) ) = ln( RR(e:h) ) = ln( LR(e:h) ) = W(e:h) = "weight of evidence" pushed in some 50 papers of I.J. Good L* = [ P(e|h) - P(e|~h) ]/[ P(e|h) + P(e|~h) ] = F(e:h) = F(y:x) = [ P(e|h) / P(e|~h) - 1 ]/[ P(e|h) / P(e|~h) + 1 ] { my form1 } = [ RR(e|h) - 1 ]/[ RR(e|h) + 1 ] { my form2 } RR(y:x) , F(y:x) == L* , and L == W(y:x) are isotone ie co-monotonic. RR(:) = [ 1 + F(:) ]/[ 1 - F(:) ] is a conversion ; F(y:x) = "degree of factual support of x by y" { my x == h , y == e } = [ P(y|x) - P(y|~x) ]/[ P(y|x) + P(y|~x) ] F-form1 { Kemeny 1952 } = [ RR(y:x) - 1 ]/[ RR(y:x) + 1 ] a function of relative risk = tanh( 0.5*ln(RR(y:x)) ) my F-form4; tanh(z) = (e^(2z) -1)/(e^(2z) +1) = tanh( W(y:x)/2 ) my tanh corrects I.J. Good's sinh = tanh[ 0.5*ln( Odds(x|y)/Odds(x) )/2 ] note the semi-inversion = ARR/[ P(y|x) + P(y|~x) ] = [ Pxy - Py.Px ]/[ Pxy + Py.Px - 2.Pxy.Px ] { my F-form2 } = [ P(x|y) - Px ]/[ P(x|y) + Px.(1 - 2.P(x|y)] { my semi-inverted form } = (difference/2)/Average = deviation/mean { my F-form5 } makes sense as a normalization to the range [-1..0..1] , but is less semantically rich than GF = [Pa - Pb]/[ 1-Pb ] because unlike GF , F(:)'s values are NOT quantitatively interpretable over the whole range ; find P2 ; = CF2(y:x) = [ P(x|y) - Px ]/[ Px.(1 - P(x|y)) + P(x|y).(1 - Px) ] is a certainty factor in MYCIN at Stanford rescaled in { Heckerman 1986 } which I recognized as F(y:x) via my F-form2 above. F(y:x) = 0 if x,y are independent F(y:x) = [ 1 - Py ]/[ 1 + Py - 2.Px ] if Pxy=Px F(y:x) = 1 if P(x|y)=1 ie Pxy=Py , but: F(y:x) = 0/0 if Px=1 == Pxy=Py == P(x|y)=1 then UNdetermined; I choose 0 F(y:x) = 0/0 if Py=1 == Pxy=Px == P(y|x)=1 then UNdetermined; I choose 0 F(y:x) < F(x:y) == (Px < Py) like RR(:) , UNLIKE P(y|x) & GF(:) , see P4b: find -F( under P5: These AMBIGUOUS non-values 0/0 may seem a weakness, but are correct, since if x,y are independent then [ P(x|y) = Px ] == [ P(y|x) = Py ] == [ Pxy = Px.Py ] , so that in extreme: if Px=1 then Py = Pxy = Px.Py & P(x|y) = 1 ie y implies x ! if Py=1 then Px = Pxy = Px.Py & P(y|x) = 1 ie x implies y ! , hence in these extreme cases y entails x AND x,y are independent, (and/)or x entails y AND x,y are independent, which is my PARADOX of IndependentImplication from { Hajek www }. For comparison: GF(y:x) = [ P(y|x) - P(y|~x) ]/[ 1 - P(y|~x) ] = slope(of y on x) / MAXslope = [ Pxy - Py.Px ]/[(1 -(Px+Py-Pxy)).Px ] GF(y:x) = 0 for independent x,y GF(y:x) = 1 if Pxy=Px ie P(y|x)=1 ie MAX slope ie MAX beta(y:x) , ie x entails y , eg if Py=1 GF(y:x) = 1 also if Px=1 ! GF(y:x) = Py/Px if Pxy=Py . Proof: [Py/Px -Py]/[1-Px] = Py.[1/Px -1]/[1-Px] Now Kemeny and Fitelson will clash with Spinoza, Popper, and Kahre : C(y:x) = Popper's corroboration = confirmation designed as <= 1-Px fits SIC = [ P(y|x) - Py ]/[ P(y|x) + Py - Pxy ] { Popper p.400, 9.2* } = [ Pxy - Px.Py ]/[ Pxy + Px.Py - Pxy.Px ] { my C-form2 } = [ P(x|y) - Px ]/[ P(x|y) + Px.(1 - P(x|y)) ] { my semi-inverted form } = 1-Px iff P(x|y)=1 The factor 2 in the denominator of F(:) is the only mathematical difference between my semi-inverted forms of F(y:x) and C(y:x). Yet they have vastly different extreme values: -1 <= F(:) <= 1 , C(y:x) <= 1-Px , C(x:y) <= 1-Py C(y:x) = 0 for independent x,y OR Px=1 OR Py=1 ! which MAKES SENSE, since in all 3 cases Px carries NO surprise, NO new information. C(y:x) = 1-Px if P(x|y)=1 ie if (y entails x) ie (y->x) ie (x implies y) , 1-Px = P(~x) is a MEANINGFUL upper bound, because it fits SIC , and P(~x) = 1-Px is the simplest thinkable decreasing function of Px , better than log(1/Px) = log(-Px) in Shannon's entropy -Sum[Px.log(Px)] where Sum[Px.1/Px] would not work. Hence quadratic entropy = Sum[Px.(1-Px)] K(x:y) = P(x|y) - Px is Kahre's favorite korroboration { Kahre 2002, p.120 } -Px <= P(x|y) - Px <= 1-Px discounts the lack of surprise in x (fits SIC ) since a too frequent x is unlikely to be a SPECIFIC cause; this makes sense if Px =. 1, less sense if Px is low. K(y:x) = P(y|x) - Py another korroboration in { Kahre 2002, p.186 } -Py <= P(y|x) - Py <= 1-Py discounts the lack of surprise in y (find SIC ) since a frequent y is not seen as a real RISK; this makes LESS sense than 1-Px above, since a widespread risk is still a risk, although psycho-socially it is more acceptable if everybody is at the same high risk. Indeed, as long as nothing can be done about the risk, the society gets used to it, becomes fatalistic about that risk, and is not too jealous wrt those lucky few exceptions who are spared the risk. PsychoLogical justification of C( : ) and K( : ) : A frequent x is unlikely to be a SPECIFIC explanation or cause of y, or at least not a surprising, new, interesting one. Hence the decreasing 1-Px . What is common, carries no new meaning. More surprising x carries more SIC = "Semantic Information Content" in x , since the lower the prior Px , the more possibilities it a priori FORBIDS, EXCLUDES, REFUTES or ELIMINATES when x occurs; see SIC here & in { Hajek www }. { Kemeny 1953, 297 } refers this insight to { Popper pp.270,374,399,400,402 } from whom Carnap has lifted it, and from him Bar-Hillel has lifted SIC ; { Bar-Hillel 1964, p.232 } quotes "Omnis determinatio est negatio" = "Every determination is negation" = "Bestimmen ist verneinen" by Baruch Spinoza (1632-1677), in 1656 excommunicated from the synagogue in Amsterdam. William of Ockham (1285-1347) aka Occam aka Dr. Invincibilis, was also excommunicated from the Church in 1328. Clearly too original thinkers had to be purged from establishments' institutions. Occam's bright student, who became a rector of the university of Paris, was Jean Buridan, who wanted to educate the masses by calling them asses. Besides his notion of pons asinorum, or pons asini, ie a "bridge of asses" (find it above), his name is also attached to the "Buridan's ass". His (in fact already Aristotelian) ass is an animal standing between two equally appetizing haystacks, inable to decide which one to chose. This assy behavior can serve as a psychoLogical model for the dilemma of choosing between two or more (wo)men, and also for choosing a measure of causation tendency. The SIC property in Popper's C(:) and in Kahre's K(:) is intellectually appealing, but GF has other goodies mentioned under +Sheps . Which haystack do YOU prefer ? To stress elimination of hypotheses or theories is Popperian refutationalism aka falsificationism. KISS = "Keep it simple, student!" is Occamism, but do not forget the vital amendment "ceteris paribus" (or ceterus paribus) ie "all else being equal". There is more to ceteris paribus than taking its meaning literally as explained above. Ceteris paribus should be recognized as a Vaihingerian useful fiction of the as-if kind (find again: Vaihinger or fictionalism ): the counterfactual WHAT-IF , lets pretend as-if , reasoning. Anyway, my slogan is "Keep IT simple, but not simplistic", as this e-paper does. .- Insight9: Variations on the form (ValueA - ValueB)/(MaxValueA - ValueB) Examples tell more than abstractions. Pa, Pb are probabilities, Pb is some base probability or chance of success (or failure, feel free to define it). Counts of (joint) events a, b, c, d in a 2x2 contingency table are denoted as na, nb, nc, nd where na + nb + nc + nd = N. Of course, na can be any score observed, and nc a chance value, base value, or expected value. Ex: if Pa >= Pb then a factor F is : F = (Pa - Pb)/(1-Pb) where "It's the denominator, students!", "It is the choice of the base that matters" eg Pb = Sum_i:1..m[P_i * P_i] = Sum[(P_i)^2] = expected probability for P1..Pm eg Pb = 1/m where m is the number of alternatives (if equiprobable); note that Sum_i:1..m[ (1/m)^2 ] = m.(1/m).(1/m) = 1/m eg if na > nc then in a 2x2 contingency table a,b,c,d is F = (na/N - nc/N)/(N/N - nc/N) = (Pa - Pc )/( 1 - Pc ) = (na - nc)/(N - nc) , hence this F is NOT based on P(.|.)'s only on P(.)'s ie on counts na, nc, N, with na = n(y,x) , nc = n(y,~x) , N = n(x,y) + n(~x,y) + n(x,~y) + n(~x,~y) eg if nhits > nMfc then similarly as in the last example F = (nhits - nMfc)/(N - nMFc) ie succes above the Most frequent class eg if P(x|y) > Px then F = (P(x|y) - Px)/(1-Px) = K(x:y) /(1-Px ) = PAR(x:y) , find it above. eg Jacob Cohen's coefficient of agreement aka the chance-adjusted Kappa aka concordance or interrater agreement (of 1960), is an F with P's : Pe = simple proportion of actually observed cases in which evaluators agreed Pi = fictitious proportion expected by chance ie as-if experts who may be (wo)men or (wo)machines produced statistically INdepentent judgments In a 2x2 table of a,b,c,d where x = expertX, and y = expertY , the joint counts of agreements are on the main diagonal, hence: Pi = Px.Py + P(~x).P(~y) is the fictitious base chance , so that F = (Pe - Pi)/(1 - Pi) Of course you are free to define (hopefully meaningfully) your own measure of matching or agreement between eg men and woman, or between (wo)men and (wo)machines. Eg for random variables X, Y (ie for sets of random events x, y ) there exists Goodman & Kruskal's measure TauB = [ (1 - E[Py]) - (1 - E[P(y|x)]) ]/( 1 - E[Py] ) = [ E[P(y|x)] - E[Py] ]/( 1 - E[Py] ) = (Conditional quadratic entropy of Y if X)/(Quadratic entropy or Y) = Cont(X:Y)/Cont(Y) TauB has never been explained as a normalized quadratic entropy, based on the simplest decreasing function of P which is 1-P , where Var(Y) = Sum_y:1..m[ P_y .(1-P_y) ] = Expected( 1 - P_y ) = 1 - E[P_y] = 1 - Sum[ P_y . P_y ] = 1 - Sum[Py^2] E[Var(Y|X)] = Sum[Sum[ Pxy.( 1 - P(y|x) ) ]] = 1 - E[P(y|x)] Cont(X:Y) = (1 - E[Py]) - (1 - E[P(y|x]) = E[P(y|x)] - E[Py] = SumSum[ Pxy. P(y|x) ] - Sum[ Py^2 ] = SumSum[ Pxy.(P(y|x) - Py) ] = SumSum[ Pxy. K(y:x) ] -.- +Conclusions: It would be ABSURD if a slightly negative numerator -ARR(:) ie a slightly negative dependence of y and x would have the value PF = 0.5 which is 25 times larger than GF = 0.02 for the same positive numerator ARR. Small deviations from independence (ARR = 0) should be treated as SYMMETRICALLY as possible because DEPENDENCE IS SYMMETRICAL wrt x and y. All the various formulations of the condition of independence between two events y and x can always be transformed into the clearly symmetrical Pxy = Px.Py = Pyx . However the indicator of a probabilistic causation must be a MEANINGFULLY ASYMMETRICAL formula, asymmetric wrt the effect y and its tentative cause x, hence DIRECTED ie ORIENTED. Inversions and semi-inversions above serve as warnings about only a seeming asymmetry. Cheng's PF does not fit with Sheps/Cheng's GF . If ARR < 0 then we must use the causal hindrance factor HF, or if ARR < 0 is near zero then we may use GF < 0 as an approximate measure of prevention. More insights into more formulae are at my http://www.humintel.com/hajek -.- +References: papers & books worth (y)our looks : If you cannot find a reference or else here, it is very likely in { Hajek } on www : http://www.humintel.com/hajek Agresti Alan: Analysis of Ordinal Categorical Data, 1984; on p.45 is a math-definition of Simpson's paradox for events A,B,C Arendt Hannah: The Human Condition, 1959; on Archimedean point see pp.237 last line, up to 239, ... 260, more in her Index Bar-Hillel Yehoshua: Language and Information, 1964; only his Introduction, but not his Contents or the reprinted paper, tells that the original author of their key paper in 1952 (chap.15, pp.221-274) was in fact Rudolf Carnap alone who was the 1st author despite B < C, but B-H does not say it. Only their key paper is worth a look, the rest of the book is obsolete. Brin Sergey, Motwani R., Ullman Jeffrey D., Tsur Shalom: Dynamic itemset counting and implication rules for market basket data ; Proc. of the 1997 ACM SIGMOD Int. Conf. on Management of Data, 255-264; see www . Sergey Brin is the co-founder of Google Cheng Patricia W.: From covariation to causation: a causal power theory; (aka power PC theory), Psychological Review, 104, 1997, 367-405; ! on p.373 right mid: P(a|i) =/= P(a|i) should be P(a|i) =/= P(a|~i) Recent comments & responses by Patricia Cheng and Laura Novick are in Psychology Review, 112/3, July 2005, pp.675-707. Cohen Jacob: Weighted kappa; Psychological Bulletin, 70/4 (19..), 213-220; on p214 is K = [ Pa - Pc ]/[ 1 - Pc ] where Pa = the observed proportion of agreement (eg among experts) Pc = the proportion of agreement expected by chance alone Feinstein Alvan R.: Principles of Medical Statistics, 2002, by the late professor of medicine at Yale, who studied both math & medicine; chap.10, 170-175 are on proportionate increments, on NNT NNH , on honesty vs deceptively impressive magnified results from some formulae. Chap.17, 332, 337-340 are on fractions, rates, ratios OR(:), risks RR(:). ! On p.340 is a typo: etiologic fraction should be e(r-1)/[e(r-1) +1]; ! on p.444, eq.21.15 for the negative likelihood ratio LR- should be ! (1-sensitivity)/specificity; above it should be (c/n1)/(d/n2) Fitelson Branden: Studies in Bayesian confirmation theory, Ph.D. thesis 2001 on www , where I.J. Good's sinh(W(:)/2) should be tanh(W(:)/2), I told him Fleiss Joseph L., Levin Bruce, Myunghee Cho Paik: Statistical Methods for Rates and Proportions, 3rd ed., 2003 ( Sheps' "relative difference" is in the Index, also in the earlier editions) Good I.J. : see { Hajek www } for 10 commented references to too many papers & notes produced by the prolific author and WWII codebreaker over 50 years. Hajek Jan: Probabilistic causation indicated by relative risk, attributable risk and by formulae of I.J. Good, Kemeny, Popper, Sheps/Cheng, Pearl and Google's Brin, for data mining, epidemiology, evidence-based medicine, ! economy, investments. This e-paper is at http://www.humintel.com/hajek ! Find the +Epicenter of this e-paper and +New conversions in it. Also see there my 7 "Construction principles for good association measures", where eg my principle P2 says: OpeRational usefulness is greatly enhanced if measure's WHOLE RANGE of values (not only its bounds) has an opeRationally highly meaningful interpretation (like eg NNT , NNH ). ! Caution: in the just referenced paper the notations Conv(y:x) and Conv(x:y) and also K(y:x) and K(x:y) are swapped wrt the present text. This is no error, it is just a change of notation; see +Causal notation is tough here and now ; swaps are due to my fresh desideratum P4b Heckerman David R.: Probabilistic interpretations for MYCIN's certainty factors; pp.167-196 in Uncertainty in Artificial Intelligence, L.N. Kanal and J.F. Lemmer (eds), vol.1, 1986. I succeeded to rewrite his eq.(31) for ! the certainty factor CF2 on p.179 to Kemeny's F(:). Heckerman's Lambda's are RR(:)'s. Heckerman has more of fine papers in other volumes of these series of proceedings. Kahneman Daniel, Slovic P., Tversky Amos (eds): Judgment Under Uncertainty: Heuristics and Biases, 1982; see p.122-3; Kahneman is a Nobel laureate. Kahre Jan: The Mathematical Theory of Information, 2002. See the last pages for "(Re)Search hints by Jan Hajek" on http://www.matheory.info To find in his book formulae like eg Cont(.) use his special Index on pp.491-493. See www.matheory.info for errata + more. ! on p.120 eq(5.2.8) is P(x|y) - Px = Kahre's korroboration, x = cause, ! on p.186 eq(6.23.2) is P(y|x) - Py, risk is no corroboration, y = evidence Kemeny John G., Oppenheim Paul: Degree of factual support; Philosophy of Science, 19/4, Oct.1952, 307-324. The footnote 1 on p.307 tells that Kemeny ! was de facto the author. Caution: on pp.320 & 324 his oldfashioned p(.,.) is our modern P(.|.). On p.324 the first two lines should be bracketized ! thus: p(E|H)/[ p(E|H) + p(E|~H) ], find it as F0( in { Hajek www }. An excellent paper by then former math-assistant to Einstein, and the later co-father of the programming language Basic. Kemeny John G.: A logical measure function; Journal of Symbolic Logic, 18/4, Dec.1953, 289-308. On p.307 in his F(H,E) the negation bars ~ over H are ! missing in both 2nd terms. Except for p.297 on Popperian elimination of models (find SIC now), there is no need to read this paper if you read his much better one of 1952 Laupacis A., Sackett D.L., Roberts R.S.: An assessment of clinically useful measures of the consequences of treatment; New England Journal of Medicine (NEJM), 1988, 318:1728-1733 Pearl Judea: Causality: Models, Reasoning, Inference, 2000; see at least pp.284,291-294,300,308; his references to Shep should be Sheps , and on ! p.304 in his note under tab.9.3 ERR = 1 - P(y|x')/P(y|x) would be correct ! which wasn't in Pearl's ERRata on www Popper Karl: The Logic of Scientific Discovery, 6th impression (revised), March 1972; with new appendices, on corroboration is Appendix IX to his old Logik der Forschung, 1935, where in his Index: Gehalt, Mass des Gehalts = Measure of content (find SIC ). His oldfashioned p(x,y) is modern P(x|y), and his confirmation = corroboration C(x,y) is my C(y:x) Restle Frank: Psychology of Judgment and Choice : a Theoretical Essay, 1961 ; on p149 is eq.(7.6) P(corrected) = [ P(yes|T) -P(yes|~T)]/[ 1 -P(yes|~T) ] Schield Milo: Simpson's paradox and Cornfield's conditions; ASA 1999 & www ; an excellent multi-angle explanation of confounding, a very important subject seldom or poorly explained in books on statistics. His section 8 can be complemented by reading { Agresti 1984, p.45 } for a definition of Simpson's paradox for events A, B, C Schield Milo, Burnham Tom: Algebraic relationships between relative risk, phi and measures of necessity and sufficiency; ASA 2002; on www . Sheps Mindel C.: An examination of some methods of comparing several rates or proportions; Biometrics, 15, 1959, 87-97 Simon Herbert: Models of Man, 1957; see pp.50-51,54 Vaihinger Hans: Die Philosophie des Als Ob, 1923 Much more find in { Hajek } at http://www.humintel.com/hajek -.-