.-
     Causal insights inside, plus new causal hindrance factor
            versus P.W. Cheng's preventive causal power

CopyRight (C) 2005-2007, Jan Hajek , NL, version 5.9 of 2007-3-5
      The word  my  shortly marks what I have not lifted from anywhere,
      so it is fresh and may be contested, but must not be lifted from
      me without a full reference to me plus this website :
http://www.humintel.com/hajek  contains the latest versions of my e-papers

Abstract: After a warm-up we take 3 cold showers and discover that Patricia
          W. Cheng's "preventive causal power" (here PF , as it has existed
as a prevented fraction or preventable fraction ) is not commensurable with
Sheps/Cheng's "generative causal power" (here generative factor GF ). New
 "causal hindrance factor" HF is worked out as a replacement of PF.
The body of this e-paper is a cluster of fresh causal insights starting with
Insight0 which postulates new very specific desiderata for causal measures.
P4b is my new desideratum for a weak causal transitivity.  P4b breaks the
spell of Bayesian inversion & the double-faced coin of sufficiency-necessity
by classifying probabilistic causation measures M(:) as either (x SufFor y)
or as (y SufFor x) according to their behavior wrt Px < Py or Px > Py.
The absolute difference ARR(y:x) = P(y|x) - P(y|~x) does NOT behave orderly
wrt Px < Py , or wrt Px > Py , while the remaining analyzed measures do.
Analyzed are GF( : ) by Sheps / Cheng , Conv( : ) by Google CEO Sergey Brin ,
I.J. Good's Qnec = RR(y:x) = relative risk = risk ratio, its functions W( : )
= ln( RR(:) ) = an old "weight of evidence" by I.J. Good who has been a WWII
codebreaker and stats assistant to Alan Turing, and the factor  F( : ) =
[ RR - 1 ]/[ RR + 1 ] by Einstein's assistant John Kemeny whose F( : ) = CF2
is the "certainty factor" by David Heckerman ( 1986 at Stanford MYCIN project,
now leading Microsoft's machine learning research ), and C( : ) = Popper's
corroboration or confirmation by the most influential philosopher of science
Sir Karl Popper .


-.- CONTENTS:  +words are for fast jump-finding;  the subsection jumper is .-
    Only 220 lines are on the incommensurability of Cheng's PF with Sheps' GF
+Abbreviations
+Wisdom
+Warm-up with logic and P's in 2 pizzagrams plus a 2x2 contingency table
+Cold shower
+Warnings
+Snapshots of insights inside (for busy execs :-)
+Sheps' "relative difference" GF is smart
+Why is Cheng's PF incommensurable with Sheps/Cheng's GF
  +HF vs PF served as a palatable fast food
  +HF vs PF served as a palatable slow food
+Hajek's hindrance factor HF is commensurable with GF
+Insights into semantics and forms of causal formulae
  +Palatable probabilistic logic derived from counting
  +Causal notation is tough, but our math-beef shouldn't be
  Insight0 : Know what you want : new causal desiderata P2 P4 P4b P5 P8
             P4b: Partial order condition for causation measures M(:)
  Insight1 : Understanding GFactors, plus my MaxiMin heuristic
  Insight2 : Fresh interpretation of GF as a regression slope/MAXslope
  Insight3 : Commensurable pairs of formulae : ( GF & HF ) vs ( PF & QF )
  Insight4 : How to rescale a range for more palatable results
  Insight5 : Numbers needed to treat or harm : NNT , NNH , plus my NNR
  Insight6 : The simplest thinkable necessary condition for CONFOUNDING
  Insight7 : Sufficiency and necessity : two sides of a causal coin
             PAR(x:y) = population attributable risk
  Insight8 : MaxiMin vs Kemeny & Fitelson in clash with Popper & Kahre ;
             C(y:x) , F(y:x) , GF(y:x) stress tested with extreme P's ;
             My paradox of IndependentImplication
  Insight9 : Variations on the form (ValueA - ValueB)/(MaxValueA - ValueB)
+Conclusions
+References: papers & books worth (y)our looks


-.- +Abbreviations :

 For non-native English readers:
 aka = also known as                  btw = by the way
  eg = exempli gratia = for example ;  ie = id est = that is
  vs = versus ;   w/o = without ;     wrt = with respect to

 For non-native math readers:
  ~ non , not , negation , complement
 == synonymous , equivalent , logical equivalence ie 2-way implication ie
    if and only if  ie  iff
 <>  or  =/=  is not equal
  =. is near, close to, approximately equal
 b^2  =  sqr(b) = b*b = b power 2
 causal = possibly possessing a causal tendency
 causation measure = indicator of a possible causal tendency
 NecFor = necessary for ;  SufFor = sufficient for
 qed = quod erat demonstrandum = which was to be proved
 rv  = random variable
 XOR = exclusive OR , non-equivalence


-.- +Wisdom :

Detecting error is the primary virtue, not proving truth.
   { Colin McGinn on Karl Popper in NYRB 2001/11/21, end of p.46 }

If it's not checked, it's wrong. { I.J. Good , a WWII codebreaker in UK }

Know their & thy formulae and thou shalt suffer no disgrace. { JH = me }

One man's necessity is another woman's sufficiency.  { JH }
One woman's determinism is another man's randomness. { JH }

Never run after a bus, a (wo)man, or a causal formula, because there will
be another one in ten minutes. { my politically correct paraphrase of a Yale
   professor who spoke about a bus, woman or cosmological theory }

The true logic of this world is the calculus of probabilities { J.C. Maxwell }

Die Logik ist zwar unerschuetterlich, aber einem Menschen der leben will,
widersteht sie nicht.  { Franz Kafka (1883-1924) }

We need evidence-based medicine, not evidence-biased medicine-(wo)men. { JH }

There is no universally best method. What is best is data dependent. ...
   We need to learn more about what works best where.   { Leo Breiman, 1994 }


-.- +Warm-up with logic and P's in 2 pizzagrams plus a 2x2 contingency table

I don't believe in a "theory of everything" as some physicists & psychicists
do. I just try hard to identify good INDICATORS of causation tendency.
When the reading gets tough, the tough get reading. This epaper has one thing
in common with an aircraft carrier: there are multiple cabels to hook on and
so to land safely on the deck of Knowledge. There is no safety without some
redumndancy at critical or remote points. Only a minimum of math is needed to
get some of my key messages from this e-paper. Even those who see math as a
4-letter word will understand that
0.4 - 0.2 = 0.2  >  0.0004 - 0.0002 = 0.0002  ie - preserves zeroes, while
0.2 / 0.4 = 0.5  =  0.0002 / 0.0004 = 0.5     ie / LOST all zeroes ie the /
LOSES INFORMATION on the magnitude of the numbers; see Insight5 . Hence if
such ratios are used as measures of probabilistic CONTRAST then they INFLATE
the results. Clearly, differences and ratios are incommensurable and should
not be mixed as { Cheng 1997 } does by pairing PF with GF. That's one of the
number of key messages here.
Btw,     0.2  -      0.1     = 0.1   =       0.9  -      0.8  = 0.1
but sqrt(0.2) - sqrt(0.1   ) = 0.13  >  sqrt(0.9) - sqrt(0.8) = 0.054
   sqrt(0.02) - sqrt(0.01  ) = 0.041
  sqrt(0.002) - sqrt(0.001 ) = 0.013
 sqrt(0.0002) - sqrt(0.0001) = 0.0041
sqrt(0.00002) - sqrt(0.0001) = 0.0013

Sqrt(P1) - sqrt(P2) is the heart of Hellinger distance aka Matusita distance.
Sqrt(P) is called probability amplitude in quantum mechanics. We see that the
the information about the magnitudes of numbers is taken into an account in
the sense that the difference is related to the magnitudes, and zeroes .0000
are not totally lost like in a ratio .

Another easily understood though not totally trivial key message is this:
(a simple example is worth 1000 words) a student has correctly answered 2/3
   of multiple-choice questions, each with one out of 3 choices.
Q: How good is (s)he really ?
A: GF = (2/3 - 1/3)/(1 - 1/3) = (1/3)/(2/3) = 1/2 = 50%  is not impressive;
   GF = "goodness factor" corrected the raw rate of 2/3 for the base chance
     of 1/3 obtainable by random guessing. The 1st correction of 2/3 is
 (2/3 - 1/3) DECreased 2/3 down to 1/3.   The 2nd, less obvious correction
/( 1  - 1/3) INCreased 1/3   up to 1/2, because GF is a "relative difference"
measuring the efficiency of raising above the base chance, relative to the
MAXimal possible success of (1 - 1/3) AVAILABLE for the rise above the base.
! The straight interpretation of this GF is:
!  50% increase of the MAXimally ACHIEVABLE (wrt the base chance), as this
!  50% of 2/3 ACHIEVABLE in the denominator = (2/3 - 1/3)/(2/3) = (1/3)/(2/3).
Now we understand how GF-like formulae work. Insight1 & Insight2 explain this
in a less casual way. After this easy intro we have to become a bit more exact
abstract, hence general, so back to basics.

Basics of Boolean logic : there exist only 4^2 = 16 Boolean operators aka
Boolean functions ie binary logical operations (~ is non ie a complement):
Inputs | function's output :
x op y | f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 10 11 12 13 14 15  are f0 up to f15
-------|--c--c--------------c--c--c--c--------------c--c-|--- c = commutative
0 op 0 |  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1 |   DeMorgan's laws:
0 op 1 |  0  0  1  1  0  0  1  1  0  0  1  1  0  0  1  1 | ~(x or y)=(~x & ~y)
1 op 0 |  0  0  0  0  1  1  1  1  0  0  0  0  1  1  1  1 | ~(x & y)=(~x or ~y)
1 op 1 |  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1 |
-------|--c--c--------------c--c--c--c--------------c--c-|--- c = commutative
   No fun=0  2  2  y  2  x  2  2  2  2 ~x  2 ~y  2  2  0 | a 2 = fun(x , y)

f4 = x > y   hence  ~f4 = f11 = (x <= y) = ~(x,~y) = (x entails y) = (x->y)
f2 = y > x   hence  ~f2 = f13 = (y <= x) = ~(y,~x) = (y entails x) = (y->x)
f2 = ~f13 = ~~(y,~x) = (y,~x) = ( y unlessBlockedBy  x)
                    == (~x,y) = (~x unlessBLockedBy ~y)  , formally ;
  eg: fire z := burning y unlessBlockedBy the use of x=extinguisher
         P(z) = Py  - Pxy;
      fire z := burning y unlessBlockedBy the ~x=no oxygen
         P(z) = Py - P(y,~x) = Py - ( Py - Pxy ) = Pxy is no good; clearly
! do not use complements if you don't want simplistic results.

f2: if ~x then z:=y else z:=0 ;      f13: if ~x then z:=~y else z:= 1;
f2: if  x then z:=0 else z:=y ;      f13: if  x then z:= 1 else z:=~y;
f2  is better understood than f13 , since the result z  is just z:=y unless x
f2: P(y,~x) = Py  - Pxy,
            = Py if Pxy = 0  ie disjoint
            =  0 if Pxy = Py ie P(x|y) = 1, needs Py <= Px
            >  0 if Pxy < Py ie P(x|y) < 1.

4+2 = 6 = 4 functions of one variable only  &  2 no functions of x, or of y.
4*2 = 8 complementary pairs of f#'s (note that 0+15 = 1+14 = .. = 7+8 = 15 ) :
4*2 = 8 are commutative (marked by c ) ie symmetric wrt x, y, hence UNsuitable
     as indicators of causation which is asymmetric wrt x, y :
c |      no fun  ~f0 = f15  no fun
c |   ~(x & y) = ~f1 = f14 = (x nand y) = Sheffer function
! |  (~y or x) = ~f2 = f13 = (y implies x) = (y->x) = ~(y & ~x) = (~x->~y) =
  |                    f13 = [(y or x)==x]
  |       ~(y) = ~f3 = f12 = ~y
! |  (~x or y) = ~f4 = f11 = (x implies y) = (x->y) = ~(x & ~y) = (~y->~x) =
  |                    f11 = [(x or y)==y]
  |       ~(x) = ~f5 = f10 = ~x
c | ~(x Xor y) = ~f6 = f9  = (x eqv y) = (x == y) = equivalence of x, y
c | ~(x  or y) = ~f7 = f8  = (x nor y) = Pierce function

Any (not only those listed here) valid rule (ie an lhs = rhs ) has its DUAL
rule (the DUALITY is mutual ie a symmetric 2-way relation), obtained thus:
change ANDs into ORs, ORs into ANDs, False ie 0 into True ie 1, 1 into 0,
but don't change the parentheses and negations (ie non, not, ~ ). Examples:
  0 & 1 = 0         is dual with     1 or 0 = 1
~(x,~y) = (~x or y) is dual with ~(x or ~y) = (~x,y)

Only  2 pairs of f's are non-commutative functions of both variables x, y ;
these 2 pairs of complementary functions are ( f2 , f13 ) and ( f4 , f11 ).
Only these might provide the necessary asymmetry for a candidate measure of
causation obtainable from probabilistic logic which follows from the
isomorphism between logic and measures on sets (actually true metrics m ).
Alas, we shall see that the following property of the Boolean implication

(x implies y) == (~y implies ~x)  aka CONTRAPOSITIVE property is UNdesirable

 since a causation measure requires M(x:y) <> M(~y:~x) because eg:
"fire x causes smoke y" makes sense, while "no smoke ~y causes no fire~x" is
NONSENSE . Such examples and the fact that only the implications and their
 complements are the only asymmetric functions of both variables x, y finish
! my PROOF that we cannot use any purely Boolean function as an M(:). Also a
 single probability P(.) or P(.|.) should not be used even as a corroboration
as is shown in great detail in { Popper , appendix IX , pp.387-398 }, and as
discussed by Sir Karl Popper vs Leblanc in { The British Journal for the
Philosophy of Science, vol.X, 1960, pp.313-318 }.

Here come 2 visualisations of the same basic law for any elementary measure m
which derives from (hence is isomorphic with) Boolean logic and set theory
(which are isomorphic). Just imagine 2 rectangularly cut pizzas or pancakes
Px and Py placed upon each other so that they (don't)(partially) overlap :
  ==  is equivalent , synonymous ;  = is equal ;  ~ non, negation, complement
0 <= m(.) is a measure of a set , eg a number n(.) of elements, N = m(All)
 eg  P(.) is a measure of a set ,      P(.) = n(.)/N
 , is & is a joint ie an overlap ;   U is "or" is a union (w/o an overlap )

 Venn diagram enhanced :             More expressive one fits 2x2 table :

  ___ Universe = m(All) ________      ___ 1verse = P(All) = 1 ______
 |                              |    |           |                  |
 |  __m(y)___________           |    |  P(x,y)   |     P(x,~y)    P(x)
 | |         ________|__m(x)__  |    | == Pxy    |    = Px - Pxy    |
 | |        |        |        | |    |-----------|------------------|
 | | m(y-x) | m(x,y) | m(x-y) | |    |           |                  |
 | |________|________|        | |    | P(~x,y)   |    P(~x,~y)   P(~x) = 1-Px
 |          |_________________| |    | = Py -Pxy |  = P(~(x or y)   |
 |                              |    |           |                  |
 |    m(~x,~y) = m(~(x U y))    |    |           |                  |
 |______________________________|    |___ P(y) __|__ P(~y) = 1-Py __|

 ---yyyyyyyy,,,,,,,,,,xxxxxxxx---  in 1-dimension where ,,, are joint m(x,y)

From 3 axioms   m(empty set) = 0 ;  m(x) >= 0 for any set x ;
     m(x or y) = m(x) + m(y) + 0 if m(x,y)=0  ie if x, y are disjoint sets ;
all what comes next follows :

 m(x-y) == m(x,~y) = m(x) - m(x,y)           P(x,~y) = Px - Pxy == P(x-y)
 m(x-y) +m(y-x) +m(x,y) = m(x or y)          P(x,~y)+P(~x,y)+Pxy = P(x or y)
 m(x-y) +m(y-x) +m(x,y) +m(~x,~y) = m(All)   P(x,~y)+P(~x,y)+Pxy+P(~x,~y) = 1
              m(x or y) +m(~x,~y) = m(All)           P(x or y) + P(~x,~y) = 1
     because  m(x or y)  m(~x,~y) are disjoint since they can never overlap ;
              m(x or y) = m(x) + m(y) +0  iff m(x,y) = 0   ie if no overlap ;
basic Bayes :

 m(y|x).m(x) = m(x,y) = m(y).m(x|y)    ;   P(y|x).P(x) = P(x,y) = P(y).P(x|y)

basic bounds ( Bonferroni inequality is on lhs ) :

 Max[ 0, m(x) + m(y) - m(All) ]  <=  m(x,y) <= min[ m(x), m(y) ]
 Max[ 0,  Px  +  Py  -    1   ]  <=  Pxy    <= min[  Px ,  Py  ]

 Max[ m(x), m(y) ] <= m(x U y)  = m(x)+m(y)-m(x,y) <= min[m(x)+m(y), m(All) ]
 Max[  Px,   Py  ] <= P(x or y) =  Px + Py - Pxy   <= min[ Px + Py , 1 ]

so if Px + Py > 1 we are not completely free to choose any Pxy <= min[Px, Py]
in tests or examples; for more find Bonferroni here and in { Hajek www }.

Disjoint x, y have Pxy = 0 hence x, y are dependent since the condition for
independence ie  Px.Py = 0 cannot hold if Px > 0 and Py > 0.

  m(~x &  ~y) = m(~(x or y)) , P(~x ,  ~y) = P(~(x or y))   by DeMorgan's law
  m(~x or ~y) = m(~(x &  y)) , P(~x or ~y) = P(~(x,y)) = 1-Pxy  via De Morgan

d(x,y) = m(x or y) - m(x,y) = m(x) + m(y) - 2.m(x,y) = m(x-y) + m(y-x)
       =  symmetrical distance between x, y  = sum of asymmetrical distances
       =  is a metric distance since as it holds :
d(x,y) = d(y,x)  >=  d(x,x) = 0 = d(y,y)
d(x,y) + d(y,z)  >=  d(x,z)    is the triangle inequality

For a better interpretability of numerical values, we often want a measure
m(.) NORMALIZED to the scale [0..1] or [0..100%]. Eg:
Q: How should we normalize the overlap Pxy ?
A:   m(x,y)/[ m(x) + m(y) ]  has in fact the range of only [0..1/2] , so
   2.m(x,y)/[ m(x) + m(y) ]  might work (in analogy to harmonic average ).
[0..0.5] is due to MAXimal possible overlap iff m(y) = m(x,y) = m(x). Hence a
SHARPER normed overlap is m(x,y)/[m(x) + m(y) - m(x,y)] , which in fact is the
normed equivalence m(x==y), in some applications interpretable as  SIMILARITY
or PROXIMITY. These meanings become clear when we derive normed DISSIMILARITY
as non-equivalence ie XOR ie symmetrical distance by taking the complement of
m(x==y) :

m(x <> y) ie m(x =/= y) = ~[ m(x==y) ] = m(All) - m(x==y) . For probabilities
P(x <> y) ie P(x =/= y) = ~[ P(x==y) ] = P(All) - P(x==y) = 1 - P(x==y) =
            =   1 - Pxy/( Px + Py - Pxy ) = non-proximity = distance
! = ( Px + Py - 2.Pxy )/( Px + Py - Pxy ) = [ P(x or y)  -  Pxy ]/P(x or y)
! = [(Px-Pxy)+(Py-Pxy)]/( Px + Py - Pxy ) = [ P(x,~y) + P(y,~x) ]/P(x or y)

is the normed probabilistic DISTANCE , isomorphic with the d(x,y)/m(x or y).
It can be visually appreciated in the pizzagram aka Venn diagram above.
It has useful applications as a measure of DISSIMILARITY.

Let
  n(.) = nr. of elements in a set = cardinality = count of unique elements
  P(.) = probability , or its estimate as a proportion of counts n(.)/N
         note that  P(x) = Sum(0+1+1+1+1+0+0+1)/N = E[x] = Expected value
  H(.) = Shannon's entropies in his information theory are Expected values
  H(X,Y) = joint entropy ie it is H(X & Y);   I(X;Y) = mutual information
  H(.|.) = conditional entropies

Then the key equation for an elementary set-conform measure m(.) captures
the very essence of the COMMON SENSE ( visualized by Venn diagrams ) :

  m(a) + m(b) - m(a,b) = m(a U b) = m(a,b) + m(a-b) + m(b-a) , eg:
  n(x) + n(y) - n(x,y) = n(x U y) = n(x,y) + n(x-y) + n(y-x)
  P(x) + P(y) - P(x,y) = P(x U y) = P(x,y) + P(x-y) + P(y-x)

  H(X) + H(Y) - I(X;Y) = H(X,Y)   = I(X;Y) + H(X|Y) + H(Y|X)   are expected

   values ie  Sum[ P(.).log(...) ] ;  Px.Py is a FICTITIOUS probability of
  as-if independent events x, y ; the Px.Py serves as a REFERENCE point.
  H(X)   = -Sum[ Px.log(Px) ] = Sum[  Px.log(1/Px) ] = an expected surprise
  H(X|Y) = -SumSum[ Pxy.log(Px|y) ]
  I(X;Y) = +SumSum[ Pxy.log(Pxy/(Px.Py)] = SumSum[ Pxy.(log(Pxy) -log(Px.Py))]
         = +SumSum[ Pxy.log(P(x|y)/Px) ] = SumSum[ Pxy.(log(P(y|x)/Py) ]
  H(X,Y) = -SumSum[ Pxy.log(Pxy) ] ie joint entropy (Pxy is a joint, no union),
  nevertheless H(X,Y) >= H(X|Y) + H(Y|X) ie it behaves as-if a union, and
  I(X;Y) behaves as-if a joint within the isomorphism which works, but is
! NOT SEMANTICALLY PERFECT for entropies. That it works follows from, eg:
  H(X|Y) + H(Y|X) = H(X,Y) - I(X;Y) = H(X) + H(Y) - 2.I(X;Y) =
= H(X|Y) + H(Y|X)          = D(X <> Y) is Shannon metric , and
[ H(X|Y) + H(Y|X) ]/H(X,Y) = d(X <> Y) is Rajski metric , a SHARPLY normed
average probabilistic distance , hence 1 - d(X <> Y) = a normed measure of
dependence between 2 sets of nominal variables X, Y. This 0 <= d(X <> Y) <= 1
measures any dependence, linear or nonlinear dependence, between 2 paired
sets of random nominal (symbolic, non-numerical) events, while the classical
correlation coefficient -1 <= rho(X,Y) <= 1  is not a metric and it measures
only the linear dependence between two paired sets of numerical values.

 A 2x2 contingency table, should be viewed as a Venn diagram consisting
! from 2 partially overlapping RECTANGLES: one vertical (a & c) on the left,
 and one horizontal (a & b) above, with the OVERLAP (a), and the REST (d).
! Since "A non-event is an event is an event" (= my paraphrase of Gertrude
 Stein speaking about roses though not about non-roses), we are free to see
 other pairs of perpendicularly overlapping reactangles in the same table.
 Let  y = observed effect or evidence (eg symptoms bad for the patient )
      x = exposure, treatment , test result , hypothesised cause of y ;
   { ~x = tested negative = OK for the patient , ~y = healthy , in medicine }

   n( x, y) =  a  |  b  = n( x,~y)  || n( x)              n(.,.)/N = P(.,.)
   n(~x, y) =  c  |  d  = n(~x,~y)  || n(~x)                n(.)/N = P(.)
  ================|=================||=======                  N/N = 1
       n(y) = a+c | b+d =    n(~y)  || N = n(x)+n(~x) = n(y)+n(~y) = a+b+c+d

  P(x or y) + P(x,y)  = Px + Py  see In0
  P(x or y) = Px + Py  Pxy = 1 - P(~x,~y) = 1 - P(~(x or y))
 P(~x or~y) = 1-Pxy  by DeMorgan's law

==  is synonymous, equivalence , 2-way implication , if and only if , iff
<> or =/=  is not equal ;  =. is near, close to, approximately equal
   P(.)    is a probability, or a proportion;    1-P(.) = P(~.) = complement
   Py  == P(y) = n(y)/N = prevalence of y
   Px  == P(x) = n(x)/N
   Pxy == P(x,y) == P(x & y) = n(x,y)/N = joint probability of (x & y)
 P(x|y) = Pxy/Py =  n(x,y)/n(y)  = a crude estimator of "P of x if y"
        =  sensitivity of test x = true positive rate = TPR
        = a crude measure of how much (y entails x) ie (y implies y)
 P(y|x) = Pxy/Px =  n(x,y)/n(x)  = a crude estimator of "P of y if x"
        =  positive predictivity of y (is not directly available in medicine)
        = a crude measure of how much (x entails y) ie (x implies y)
P(~x|~y) = n(~x,~y)/n(~y) = specificity of test x = true negative rate = TNR
P( x|~y) = 1 - P(~x|~y) = 1 - spec = false alarm rate of x = P(type I error)
P(~x| y) = 1 - P( x| y) = 1 - sens = false negat.rate of x = P(type II error)

 P(y|x).Px = Pyx = Pxy = Py.P(x|y)  is the basic Bayes rule symmetrized by me
 P(y|x) / P(x|y) =  Py/Px           is the basic Bayes rule too; hence:
 P(y|x) > P(x|y) == Py > Px ; becomes VITAL in Insight0 , property  P4b !
GF(y:x)/GF( x:y) =  Py/Px  as I discovered, derived at ++ & in Insight1 !

 Determinism (draw Venn diagram or a pizza or pancake Px inside Py) :
 (x implies y) == [ P( x & ~y) = 0 = Px - Pxy ] == [ Px = Pxy ]    is easy ;
 clumsy:       ==   P(~x or y) = 1 = P(~x) +Py -P(~x,y) = 1-Px +Py -(Py -Pxy)

 Py - Pxy = P(y,~x) = P(~x) - P(~x,~y) = 1-Px - [1-(Px+Py-Pxy)]  via DeMorgan

 P(x or y) = Px + Py -Pxy = P(~(~x,~y) = 1-P(~x,~y)         by DeMorgan's law
           = Px + Py for disjoint events ie Pxy=0

   Pxy  = Px.Py is just 1 out of 17 equivalent conditions for independent x,y
 !  since Px.Py <> 0 , the Pxy=0 means that disjoint events are dependent.

 E[f(x)] = Sum_x[ Px.f(x)] = expected value of f(x), so for random variable X
 E[ Px ] = Sum_x[ Px. Px ] = Sum[Px^2]  is the expected probability of r.v. X
     sqrt( Sum_x[ Px^2 ] ) = length of the vector of Px's , says Pythagoras

 Odds   =eg= Pdeath/Psurvival = (1-Psurvival)/(1-Pdeath)
 Odds(x)   =     Px/P(~x)     =      Px/( 1 - Px )
 Odds(x|y) = P(x|y)/P(~x|y)   =  P(x|y)/[ 1 - P(x|y) ]
 Px = Odds(x)/[ 1 + Odds(x) ] ;  P(x|y) = Odds(x|y)/[ 1 + Odds(x|y) ]
 RR(y:x) = P(y|x)/P(y|~x)  =  Odds(x|y) / Odds(x)  , note the semi-inversion
        = LR(y:x) is a likelihood ratio = sensitivity/( 1 - specificity ) .
! RR(:) = LR(:) like any simple ratio LOSES INFOrmation on P's magnitude, as
0.2/0.4 = 0.0002/0.0004 = 0.0000000002/0.0000000004 = 0.5 ; find NNT NNH NNR
   RR( y: x) = P( y| x)/P( y|~x) = 1/RR(y:~x)
<> RR(~y:~x) = P(~y|~x)/P(~y| x) = [1 - P(y|~x)]/[1 - P(y|x)] = 1/RR(~y:x)
     vs
  ARR( y: x) = P( y| x)  -    P( y|~x)  =  absolute risk reduction
= ARR(~y:~x) = P(~y|~x)  -    P(~y| x)  =
!         = [1-P( y|~x)] - [1-P( y| x)] = ARR(y:x) , UNLIKE for risk ratio.

OR( : ) = [ Pu/P(~u) ]/[ Pv/P(~v)  is odds ratio in general
        = [ Pu/Pv    ].[ (1-P(~v))/(1-P(~u)) ] ; in particular, find OR(x,y)
OR(x,y) = RR(y:x)/RR(~y:x) = RR(y:x).RR(~y:~x) = OR(y,x) , since
RR(~y:~x) =     1/RR(~y:x)

Eg: { Restle 1961, 150 } shows a formula by Luce, 1959, as an alternative to
ROC curves for evaluation of radar operators. Let x = target really present,
alpha = strength of the target stimulus, my y = his operator's yes :

P(y|x) = alpha.P(y|~x)/[ 1 + (alpha-1).P(y|~x) ]  from which I obtained

alpha = [ P(y|x) / P(y|~x) ].[ ( 1 - P(y|~x) )/( 1 - P(y|x) ) ]
      =  RR(y:x) /RR(~y:x) = RR(y:x).RR(~y:~x)  which I recognized as :
    =  Qnec(y:x).Qsuf(y:x)
where  Qnec(y:x) = RR(y:x) and Qsuf(y:x) = RR(~y:~x) by I.J. Good { Hajek };
 find  Qnec , Qsuf there and here.

Caution:  Max[ 0, Px + Py - 1 ] <= Pxy <= min[ Px, Py ]  MUST always hold ,

so if Px + Py > 1 we are not completely free to choose any Pxy <= min[Px,Py]
in tests or examples; for more find Bonferroni here and in { Hajek www }.

Caution: [ P(x|y) > P(x|~y) ] == [ P(y|x) > P(y|~x) ], similarly for < , = .

Basic Bayes shows how to INVERT ie mutually convert both conditionings
P(.|.) ; find E(x and OR(x . More on Bayes is in { Hajek www }.
Don't worry, here we shall use Bayes only to expose the invertibility which
is exactly the cause of why nobody can formally decide about the direction of
a probabilistic causation ie whether x causes y, or y causes x.  However,
inside Insight0 my new property P4b may help with such decisions.

[ n(x,y) - 0.5) ]/n(.) are better estimators if low n(x,y) > 0.  No book on
statistics will teach you this. Some books on statistics use  n(x,y)+0.5 for
all entries in a contingency table in order to get rid of all n(x,y)=0 , but
+0.5 deforms low original counts n(x,y) > 0  exactly in the wrong direction.

If n(x) is low then  P(y|x) = n(x,y)/n(x) = 1 occurs too often, is a crude P;
if n(y) is low then  P(x|y) = n(x,y)/n(y) = 1 occurs too often; these will
lead to EXTREME values from many formulae including some in this e-paper.
Such extremes will be prevented by using the less simplistic yet provably
near optimal estimators (I derived the optimal ones) :

  P(y|x) = [ n(x,y) - 0.5 ]/n(x)  ,   P(x|y) = [ n(x,y) - 0.5 ]/n(y).

Estimates of probabilities should be 0 < P(.) < 1 for two reasons :
- "never say never" eg about black swans, or white crows (with a genetic
   defect, as I saw one in my garden).
- "no /0 " ie there will be no division by zero in many formulae, eg in

   [ Pxy - Px.Py ]/sqrt(Px.(1-Px).Py.(1-Py))  { Kemeny 1952 , p.14 }
  =       cov(x,y)/sqrt(   var(x).var(y)   )
  =  the correlation coefficient for events x, y ; more in Insight2 . It is
strange that I have seen this important (for why see Insight2 ) formula only
in a couple of books.


.- Interpretations of P's for INDEPENDENT events a, b :

When reading the rules, think about their applications. Independence
holds not only for 2 or more throws of one or more dices, but also more
or less for successful hits, or for failures eg in reliability studies.

0:  P(a & b)=0 if the events a, b are disjoint ie they are mutually
    exclusive hence they are not independent; eg it is impossible that a
    single throw of a dice will yield 2 numbers.

1:  Pa*Pb = P(both events a & b occur jointly ie "together"). Jointly may
              mean either simultaneously, or in parallel, or sequentially,
    depending how it is defined. Also "success" and "failure" may be defined
              as needed; one (wo)man's success is another (wo)man's failure.
    =  P1.P2...Pk = P( all events a & b &...& k occur jointly )
    = (Pj)^k = if all Pj are equal

2:  Pa*(1-Pb) = P(a & ~b) = P(a BUT NOT b)
              = P(a-b) = P(of an asymmetric difference)
      =     1 - P(a->b)  see 3:
      = Pa - Pa*Pb is the value for independents
  !   = P( a UNLESS b blocks a )  suggests appplications

3:  P(a implies b) = P(a->b) = P(~(a,~b)) = 1 - ( Pa - Pab ) in general;
     = 1 - (Pa - Pa.Pb) is the value for independents , in which case
     (a entails b) is questionable, except for the extreme situation
  !  described by my paradox of an IndependentImplication .  Also see 2:

4:  P(a or b) = Pa + Pb - Pa*Pb
     = P(at least one event occurs) = P(one or more events occur)
     = 1 - P(none occurs) , see 5:
!!   = 1 - (1-Pa)*(1-Pb) = 1 - (1 - (Pa + Pb - Pa*Pb))
       including the extreme possibility that ALL events occur.

5:  P(~a)*P(~b) = (1-Pa)*(1-Pb) =  1 - (Pa + Pb - Pa*Pb) = 1 - P(a or b)
     = P(not even one occurs)   = P(none of both occurs) = P(neither a nor b)
     = 1 - P(a  or b) = P(~a & ~b)  by DeMorgan's rule
     =     P(a Nor b) verbally "neither a nor b"
     = (1-Pa)*(1-Pb)*...*(1-Pk) = P(none occurs)
     = (1 - Pj)^k  if all k Pj's are equal

6:  1 - Pa*Pb = P( a & b will not occur jointly ) , eg  P( both parts
  !                      will not fail jointly (eg within the time interval)
     = P(a Nand b) = P(~(a & b)) = P(~a OR ~b)  by DeMorgan's rule
     = P(at least one event  is not occuring)
     = P(one or more events are not occuring)


-.- +Cold shower

Consider a single cause-effect relationship. Obviously, if an effect y and
its tentative cause x are independent, then neither x causes y , nor does
y cause x. In terms of probabilities, independence of y and x occurs if

  Pxy = Px.Py .    At least 4*2*2 + 1 = 17 equivalent relations exist :
  Pxy - Px.Py  = 0
  P(y|x) - Py  = 0 ,  P(y|x) - P(y|~x) = 0 ,  see  +Palatable !
  P(x|y) - Px  = 0 ,  P(x|y) - P(x|~y) = 0 ,
              ...
  Pxy.P(~x,~y) = P(x,~y).P(~x,y) , in OR(:) , is the 17th equivalent ;

~y is the complement of y , eg absence or failure of y ;
 y = presence or success ;  or vice versa ie complementary meanings.

Disjoint events have Pxy=0 , hence are not independent since Px.Py <> 0.

Stochastic independence of random events and random variables is the most
 important reference point in probability and statistics. Mark Kac wrote:
"Independence is the central concept of probability theory."  I say:
"Independence is the Archimedean point of probability and stats.", and
"Understanding (in)dependence is the pons asinorum to probability and stats."
Buridan's pons asinorum aka pons asini means a "bridge of the asses",
(Eselsbruecke in German, ezelsbrug in Dutch). In Holland every school kid
has "het ezelsbruggetje" = a little bridge for the (little) asses ie a
personal aid to succeed at exams, and hopefully to understand better.
Be reminded that independence implies uncorrelatedness, but not vice versa;
the implication is one-way only ie it is not an equivalence which is a
2-way implication. After this warm-up we can take a COLD SHOWER : ALAS , the
equivalences hold not only among formulae with = but also with < or > , eg:

 [ P(y|x) > P(y) ] == [ P(y|x) > P(y|~x) ] ==  see Insight1
 [ P(x|y) > P(x) ] == [ P(x|y) > P(x|~y) ] ==  etc (at least 17 equivalences)
hence
   P(y|x) / P(y|~x)     =  RR(y:x) > 1 <  RR(x:y)  hold simultaneously :
 [ P(y|x) - P(y|~x) ]   = ARR(y:x) > 0 < ARR(x:y)
 ( P(y|x) - Py )        =   K(y:x) > 0 <   K(x:y)
 ( P(y|x) - Py )/(1-Py) = PAR(y:x) > 0 < PAR(x:y)
                           GF(y:x) > 0 <  GF(x:y)  hence we CANNOT GET the
direction of causation from M(y:x) > M(if x, y are INdependent). This is a
cold shower, but the show must go on.
The Question : "Does x cause y, or is it y which causes x ?" is hard to
answer automatically, but my condition  P4b  which classifies M(:)'s by the
easy criterion of whether the following equivalence holds or not:

  [ Px < Py ] == [ M(y:x) > M(x:y) ] , is a big leap forward. P4b is simple

but not simplistic ; eg for the risk ratio aka relative risk it holds

  [ Px < Py ] == RR(y:x) < RR(x:y) = P(x:y)/P(x:~y) = (Pxy/(Px-Pxy)).P(~y)/Py

where 1/(Px-Pxy) measures how much (x entails y) ie (x implies y) since
whenever Px=Pxy  it holds Px <= Py and RR(x:y) = oo = infinite .

More follows next, and then in Insight1 and Insight7 .


-.- +Warnings

W1: Books, papers, epapers and even emails contain errors and typos. The
    ! marks bugs & typos in the References at the end. Eg in { Kemeny 1953 ,
    p.307 } both second terms p(E,H) should be p(E,~H), in fact P(E|~H) as
    follows next:
W2: Notations have changed since the classical works prior to 1965, say.
    Example: modern P(x|y) is printed as p(x,y) in { Popper } and { Kemeny }.

W3: Notations change when authors (eg I.J. Good, and me too) better understand
    what should be better expressed how. Find +Causal notation is tough ...
    Example: I too felt it necessary to change a couple of notations : so here
       K(y:x) = P(y|x) - Py   <= 1-Py  SIC ,    was called K(x:y) in { Hajek }
    Conv(y:x) = Px.P(~y)/P(x,~y)             was called Conv(x:y) in { Hajek }
          = (Px - Py.Px)/(Px -Pxy)        measures how much (x implies y)
          =     (1 - Py)/(1 - P(y|x)) = P(~y)/P(~y|x) = Px/P(x|~y)

W4: Causation is a slippery concept. Implication or entailment is tricky,
    both in the interpretation of logic/math and in its notation, as shown :

  == equivalent , a 2-way implication , if and only if , iff ,  synonymous ;
  >= greater or equal;  <= less or equal;  = equal ;
  => is meaningless in this epaper and in Pascal, although many authors use it
       (usually in a rounded form) for an implication aka entailment, eg:
x => y  as  x -> y  is MISLEADing since it is in CONFLICT with the same sign
                   for (x is SuperSet of y) ie (y subsetOf x) which
                 means (x is entailed by y) ie (y entails  x), hence
    (x <- y) ==  (y -> x)
== (y <== x) ==  (y subsetOf x) == (y implies x) == ~(y,~x) == no(y w/o x) ==
             == ~(y AndNot   x) == (y entails x) == (~y or x)  via DeMorgan ;
   my <== works on Booleans represented as 0 for False, 1 for True, and
          evaluated numerically, like the thoughtful notation in Pascal where
   (y <=  x) on LOGICal expressions means (y implies x). Note that
!! Py <= Px  on numbers for marginal probabilities naturally suggests that
if Py <= Px then a measure of causation M(:) should measure how much
   (y implies x).  My condition P4b in Insight0 uses Px < Py as a basic
                   classifier of what an M(:) measures; find  P4b

Example: This example covers F(y:x) = [ RR(y:x) -1 ]/[ RR(y:x) +1 ] ,
         W(y:x) = ln( RR(y:x) )  ,  and RR(y:x) itself.
In his excellent paper, Einstein's math assistant { Kemeny 1952 , 313 } has
designed his F(H,E) as "degree of factual support" provided by the evidence
E for the hypothesis H , to measure how much  "the hypothesis is a logical
consequence of the evidence. CA8:  F(H,E) is 1 if and only if E => H . The
evidence definitely disproves the hypothesis just if it logically implies
its negation, so this corresponds to F = -1."    { more "..." in Insight8 }.
His  E => H stands for E -> H ie E implies H  ie  E entails H.
But his F(H,E) = [ p(E,H) - p(E,~H)   ]/[ p(E,H) + p(E,~H) ]  on p.320
means   F(e:h) = [ P(e|h) - P(e|~h)   ]/[ P(e|h) + P(e|~h) ]  in modern
notation ;     = [ P(e|h) / P(e|~h) -1]/[ P(e|h) / P(e|~h) +1]
               =          [ RR(e:h) -1]/[ RR(e:h) +1]
               =  1  iff P(e,~h)=0  ie Peh=Ph  ie Peh/Ph = 1 = P(e|h).

It may seem that F(e:h) measures how much (h implies e), BUT IT ISN'T SO,
as my decomposition shows, and as Kemeny correctly says that (E implies H).
So there is nothing wrong with Kemeny, despite the mix of 2 notational + 1
genuine nontrivial semantic confusion arising from the specific math of F .

Note that Kemeny wished his F(H,E) ie our F(e:h) to measure (e implies h).
 F(e:h) = [ RR(e:h) - 1 ]/[ RR(e:h) + 1 ]  is a function of the risk ratio
RR(e:h) = P(e|h)/P(e|~h) = risk ratio aka relative risk OF THE EFFECT e
           IF h OCCURS. Now h is seen as a possible cause of e , since the
risk of  e  is what we may want to PREDICT. But MD's or GP's want to REMOVE
 h  as the NECESSARY condition for  e  to occur. More in Insight7 .
This all creates a lot or genuine confusion plus errors and typos.
That was the 2nd cold shower.


-.- +Snapshots of insights inside (for busy execs :-)

In0: Fortunatelly for us, logic, switching circuits, set theory, hence also
     probabilistic logic are 4 isomorphic domains. So eg DeMorgan's laws and
  Bonferroni inequality apply to them. The isomorphism derives from the
  elementary measure theory ( so easily visualized by means of pancakes or
  pizzas P aka Venn diagrams in +Warm-up above ) :

  P(x U y) + Pxy = Px + Py ;   U is union;  Pxy = joint , intersect , overlap
  N(x U y) + Nxy = Nx + Ny       for Numbers of elements in sets x, y.  Hence
 P(x or y) + Pxy = Px + Py  ie P(x or y) = Px + Py - Pxy  for probabilities
             Pxy = 0  for disjoint x, y

Eg the formula for P(x iff y) ie P(x if and only if) ie P(x == y) isn't
obvious, but is easily obtained via isomorphism if we know that (x == y) ==
 == ~(x XOR y) == complement of an obvious symmetrical difference P(x XOR y)

    P(x or y) - Pxy   = (Px + Py - Pxy) -Pxy  = Px + Py - 2.Pxy = P(x XOR y)
{ done, P(x==y) = 1 - P(x XOR y), but let's see more isomorphism in action: }
=   P(x,~y) + P(y,~x) = (Px -Pxy) + (Py -Pxy) = Px + Py - 2.Pxy
= P(~(x == y)) = 1 - P(x == y) = P(~(x 2wayImplication y)) = P(~(x iff y))
= 1 - P( (x implies y) & (y implies x) ) = 1 - P( ~(x,~y) & ~(y,~x) )
= P(~[ ~(x,~y) & ~(y,~x) ]) , now use DeMorgan rule :
= P(    (x,~y) or (y,~x)  ) ; since these 2 disjoint terms have no overlap :
=      P(x,~y) + P(y,~x) - 0  since these P(.,.)'s are disjoint ;
=     Px -Pxy  + Py -Pxy    = Px + Py - 2.Pxy  qed.

     So far a demo of the powerful isomorphism .

In1: Paradoxically, for the extreme Py=1 a perfect entailment (x entails y)
     exists while x, y are independent , since Pxy=Px.Py = Px ie P(y|x)=1,
   ! find IndependentImplication and 0/0 .
     I prefer to resolve the dilemma by voting for the independence , since
     the entailment (aka implication ) is 'degenerated' or a trivial one.

In2 to In5: are 4 insights packed together:

    P(y|x)/P(y~x) <> [1-P(y|~x)]/[1-P(y|x)]   eg  y=sick, ~y=healthy,
 ie       RR(y:x) <>  RR(~y:~x) for relative risk reduction aka risk ratio
 vs      ARR(y:x)  = ARR(~y:~x) for absolute risk reduction ARR(:)
= P(y|x) - P(y|~x) = [1-P(y|~x)] - [1-P(y|x)]
= [ Pxy  - Px.Py ]/[ Px.(1-Px) ] = cov(x,y)/var(x) = beta(y:x)
= slope of a probabilistic regression line Py = beta(y:x).Px + alpha(y:x)
  from thich follows my fresh interpretation of GF and my HF :

 GF = ARR/[ 1 - P(y|~x)]  which I recognized as:
    =  slope(of y on x)/[ MAXimal achievable slope, since P(y|x) <= 1 ]
    =   beta(of y on x)/[ fictive MAXimal beta, ie as-if , what if  ]

    is commensurable with my causal hindrance factor HF (unlike Cheng's PF )

 HF = -ARR/[ 1 - P(y|x)] = -ARR/P(~y|x) instead of Cheng's PF = -ARR/P(y|~x)
    = slope(of y on ~x)/[ MAXimal achievable slope, since P(y|~x) <= 1 ]
    =  beta(of y on ~x)/[ fictive MAXimal beta, ie as-if , what if  ]


In6: M(y:x) <> M(~x:~y) should hold for a measure of causation tendency. Alas
     M(y:x) =  M(~x:~y)  holds for measures based PURELY on entailment since

     for an entailment holds (x implies y) == (~y implies ~x) which leads to
     causal NONSENSE ; find Conv(y:x) = Conv(~x:~y) which is KO for causation

     RR(y:x) <> RR(~x:~y)  is OK , and so its transform  F(y:x) <> F(~x:~y)
    Proof of <> :
     RR( y: x) = P( y| x)/P( y|~x) = [      Pxy/P(y,~x) ].(1-Px )/Px
  <> RR(~x:~y) = P(~x|~y)/P(~x| y) = [ P(~x,~y)/P(~x,y) ].(1-P~y)/P~y
                           = [ 1 - (Px+Py-Pxy))/P(y,~x) ].Py/(1-Py)  qed.

  note P(y,~x) in both RR's ; this is due to the next equalities
       P(y,~x) = Py    - Pxy      =
     = P(~x,y) = P(~x) - P(~x,~y) = (1-Px) - (1 - [Px+Py-Pxy])  via DeMorgan

  1/RR(y:~x) = RR(y:x) <> RR(~y:~x) = 1/RR(~y:x) ;
               RR(y:x) <> RR(~x:~y) = 1/RR(~x:y) , causation needs this <> ;
               RR(y:x) <> RR( x: y) = 1/RR(x:~y)   is proven thus:
 [ Pxy/P(y,~x)].P~x/Px <> [ Pxy/P(x,~y) ].P(~y)/Py   ie
 [ (1/Px -1)/(Py -Pxy) <> [ (1/Py -1) ]/(Px -Pxy)   qed.


In7:  RR(y:x) = P(y|x)/P(y|~x)  =  1/RR(y:~x)  <> RR(~x:~y)  is OK as desired
    = [ Pxy/(Py - Pxy)  ].(1-Px)/Px       are my decompositions
    = [ Pxy/P(y,~x)     ].(1-Px)/Px    ,  where 1/P(y,~x) == (y entails x)
    =   Pxy.(y implies x).SurpriseBy(x)
  for both Pxy & Py fixed , RR(y:x) will:
   + INCrease with smaller Px ie with more specific x
   + DECrease with  LARGEr Px ie with more common x , hence RR fits SIC !

  While P(y|x) is clearly a crude measure of how much (x entails y) ,

  GF = ARR/(1 - P(y|x)  measures how much (x implies y) ie (x entails y)

(Px < Py) ==  GF(y:x) > GF(x:y) = [P(x|y) - P(x|~y)]/[ 1 - P(x|~y) ] , like
(Px < Py) ==   P(y|x) >  P(x|y) =  Pxy/Py =  the simplest crude (y implies x)
                    vs :
(Px < Py) ==  RR(y:x) < RR(x:y) = [Pxy/(Px -Pxy)].(1-Py)/Py , (x implies y)

!!!   ARR(y:x) = P(y|x) - P(y|~x)  does NOT obey ANY consistent ordering by
    either Px <= Py,  or consistently by Px >= Py .


In8:  P(x|c) > P(x|y)   ie  Pcx/Pxy > Pc/Py

    is my new simplest thinkable necessary condition for a confounding
    candidate event  c  to overrule  x  as a possible cause of an effect y.
Below I show its derivation (from my decomposition just shown in In4 ).
The 2 known necessary conditions combimed were:

  RR(y:x) < minimum[ RR(c:x) , RR(y:c) ] .


InEtc: Etc, etc.


-.- +Sheps' "relative difference" GF is smart :

I did all this research because I consider GF as a very smart formula, and as
one of the best indicators of causation tendency. My fresh analysis shows why
I do & you should like GF :

+ Several field proven formulae (find Insight9 ) have the following general
  structure :
  Derivation 0:
   g  = [(1-Pb) - (1-Pa)]/(1-Pb) = 1 - (1-Pa)/(1-Pb)    is the Error-form
      =         (Pa - Pb)/(1-Pb)        is what I call the canonical form
   g <= 1  ;  proof: 0 <= P(.) <= 1 ,  hence g <= 1 , hence also:
   g <= Pa ;  proof: g - g.Pb = Pa - Pb ;    g = Pa - Pb.(1-g) <= Pa <= 1
++ g <= Pa is semantically desirable because g is a quasiprobability, eg:
  GF = [P(y|x) - P(y|~x)]/[1 - P(y|~x)] , where:
  GF <= P(y|x) is desirable since it is desirable to have a measure
   M(y caused by x) interpretable as a quasiprobability, and it should hold
   M <= P(y|x), with the = only in one extreme case, because
        P(y|x) says that x was present, but does NOT discount the ~x , while
  GF <= P(y|x) says that y may not be caused by x even when x is present,
      as other causes of y, namely ~x may exist. Hence this <= shows how
    meaningful GF is i.e. that in its subrange 0 <= GF , GF is measuring
    a probability P(y caused by x and surely not by some other cause ~x).

++ Derivation 1:  if P(y|x) > P(y|~x)
                then P(y|x) = P(y|~x) + GF*P(~y|~x)  ie GF is that portion
                 of P(~y|~x) which could become y if exposed. Hence
   GF = ARR / P(~y|~x)

++ Derivation 2: if all x would cause y then P(y,x) = Px ie P(y|x)=1. Then
   GF is the ratio of actual ARR / fictive maximum ARR :
   GF = [ P(y|x) - P(y|~x) ]/[ 1 - P(y|~x) ]
      = [ P(y)   - P(y|~x) ]/[(1 - P(y|~x)).Px ]    my new form


+ All GF's values >= 0 are easily & instantly interpretable, not just its
      pivotal values 0 and 1. This is not so for F( : ), C( : ), K( : ).
+  GF <= 1  = MAXimum   if Pa=1 = MAXimum of any P, eg P(y|x)=1
+  GF  = 0  = minimum   if Pa=Pb  (the minimum for Pa >= Pb as required )
+  GF = 0/0 = UNdefined if Py=1 hence Px=Pxy=Px.Py & P(y|x)=1  ie  in this
                extreme case x, y are INdependent  &  (x entails y) ; also
+  GF = 0/0 = UNdefined if Px=1 hence Py=Pxy=Px.Py & P(x|y)=1  ie  in this
                extreme case x, y are INdependent  &  (y entails x) which is
 !   my PARADOX of IndependentImplication (find it). These ambiguities are
                correctly reflected in 0/0 which can be pre-checked and
                signalled in/by progs. 0/0 occurs in F(:) too.
-  GF's lower bound << -1 ie is not -(its upper bound =1 ). This calls
         for an introduction of a COMMENSURABLE complementary formula HF ,

 ! HF  =   [ P(y|~x) - P(y|x) ]/[1 - P(y|x) ] has maximum = 1 iff P(y|~x)=1
       = hindrance factor for use if P(y|x) < P(y|~x) ie if ARR < 0.
+  HF paired with GF have only the positive ( + ) properties listed here.
+  GF was designed as a "relative difference", ie as :
     = difference/(difference AVAILABLE for Pa's above given Pb)
     =   ARR/( fictive MAXimum ARR  achievable if Pb is given )
  !  = slope/(   as-if MAXimal slope thinkable if Pb is fixed ) , my insight
  !  =  beta/(   as-if MAXimal beta  possible  if Pb is known ) , my insight
        beta is for the probabilistic regression line for events
+  GF satisfies my desideratum P2 as all values 0 <= GF <= 1 are meaningfully
      and quantitatively interpretable, not only its pivotal points 0 and 1.
   GF(y:x) = [P(y|x)-P(y|~x)]/[1-P(y|~x)] = [Pxy-Px.Py]/[Px.(1-(Px+Py-Pxy))]
   GF(x:y) = [P(x|y)-P(x|~y)]/[1-P(x|~y)] = [Pxy-Px.Py]/[Py.(1-(Px+Py-Pxy))]
+  GF(z:z) = (1 - 0)/(1 - 0) = 1      a kind of reflexivity ( maximum reflex )
+  GF(y:x) =  GF(x:y)  iff Px =  Py , a kind of antisymmetry
+  GF(y:x) <> GF(x:y)  if  Px <> Py , a kind of asymmetry
+  GF(y:x) <> GF(~x:~y) as P5 desires ( contrapositive "=" is UNdesirable )
+  GF satisfies my desideratum P4b (see also Insight7 ) .
 ! GF(y:x) > GF(x:y) == Py > Px  is a must for a measure of causal sufficiency;
                                 in fact for GF holds exact proportionality :
++ GF(y:x)/GF(x:y) = Py/Px ( proof in Insight1 ) is MEANINGFUL & clear, as it
!!                          fits with basic Bayes ie with :
 !  P(y|x)/ P(x|y) = Py/Px , while GF conveys more MEANING just listed above;
   if GF >= 0 then:
 [ GF(y:x) >= 0 <= GF(x:y) ] AND :
 [ GF(y:x) > GF(x:y) ] == [ Py > Px ] == [ P(y|x) > P(x|y) ]
 [ GF(y:x) < GF(x:y) ] == [ Py < Px ] == [ P(y|x) < P(x|y) ]
   GF(y:x) tells how much (x implies y) ie (x entails y) ie (x SufFor y)

+ Lets see how GF, RR( : ) , PAR(:) fit "semantic information content" SIC :
   For both Pxy & Px fixed ,  GF(y:x) DECreases with INCreasing Py ;
   for both Pxy & Py fixed ,  GF(x:y) DECreases with INCreasing Px ;
   for both Pxy & Py fixed , PAR(x:y) DECreases with INCreasing Px ;
   for both Pxy & Py fixed ,  RR(y:x) DECreases with INCreasing Px .


+ RR(y:x)   =   1/RR(y:~x) = Qnec(y:x)  by I.J. Good, and also his
  RR(~y:~x) =   1/RR(~y:x) = Qsuf(y:x)  which inserted into my next rule

  GF(v:u)   = 1 - RR(~v:u)  = 1 - 1/RR(~v:~u)
                            = 1 - 1/[ 1 - GF(v:~u)  ]  { Hajek , find there
  +New conversions } yield 2 pairs of commensurable GF(:)'s. Commensurability
  means very similar scale and continuity of values near GF = 0. Note that
  the commensurable pairs have the exposure u swapped with ~u. Hence :
  GF & HF are commensurable :

  GF(y:x)   = 1 - RR(~y:x)  = 1 - 1/RR(~y:~x) =  1 - 1/Qsuf =  GF
                            = 1 - 1/[ 1 - GF(y:~x) ] = 1 - 1/[ 1 - HF ]
                                              = generative factor (x causes y)
  GF(y:~x)  = 1 - RR(~y:~x) = 1 - 1/RR(~y:x)  =  1 -   Qsuf =  HF
                            = 1 - 1/[ 1 - GF(y:x)  ] = 1 - 1/[ 1 - GF ]
                                              =  hindrance factor by { Hajek }
  ARX & PF are commensurable :

  GF(~y:~x) = 1 - RR(y:~x)  = 1 - 1/RR( y:x)  =  1 - 1/Qnec =  ERR = ARX
                            = 1 - 1/[ 1 - GF(~y:x) ] = 1 - 1/[ 1 - PF ]
                                              =  excess risk by { Pearl 2000 }
  GF(~y:x)  = 1 - RR( y:x)  = 1 - 1/RR(y:~x)  =  1 -   Qnec =  PF
                            = 1 - 1/[ 1 - GF(~y:~x)] = 1 - 1/[ 1 - ERR ]
                                              = preventive fac. { Cheng 1997 }

-.- +Why is Cheng's PF incommensurable with Sheps/Cheng's GF :

Let   P(y|x)  = probability of an effect y if its alleged cause x is present
      P(y|~x) = probability of an effect y if its assumed cause x is absent
ARR = P(y|x) - P(y|~x) = absolute risk(reduction) in epidemiology
 RR = P(y|x) / P(y|~x) = relative risk aka risk ratio (loses info in zeros).

P8:  My principle of CONTINUITY and COMMENSURABILITY of results :
     SMALL changes in input P's must NOT lead to LARGE changes in the output
  ie in the result from a SINGLE causal tendency measure M(:) . This kind
  of continuity must also hold between a PAIR of measures each of which
  covers only a part of the whole range of all possible inputs. If a pair of
  such formulae is used to cover the whole range of causation values then
  the results must be COMMENSURABLE in general, hence also near the point
  where we switch from one formula to the other one. Typically a switch-over
  point will be INDEPENDENCE ie ARR=0 , around which cannot be any meaning-
  ful causation (degenerated cases shows my paradox IndependentImplication ),
  so near ARR=0 a measure of causation M(:) must measure (in)dependence ONLY,
  and DEPENDENCE IS SYMMETRICAL wrt x, y. With increasing dependence the
  ASYMMETRY of values should increase.

  An example: y = an effect, eg death ;  x = a cause, eg drinking alcohol.
  No drinking is better than too much, but worse than drinking a little.
  Drinking too much INCreases our chances of early death.
  Drinking a little DECreases our chances of early death from heart disease,
  a major killer. So, eg,
  if 100 people change their habit of drinking M glasses to L glasses, and
  then ARR = P(y|x) - P(y|~x) = 4/100 - 1/100 = 0.03 becomes -0.03 and
  thus forced switch-over :
  from "generative power" GF = 0.0303 = power of alcohol to cause death, say,
    to "preventive power" PF = 0.75   = power of alcohol to avoid death would
  be ABSURD on the scale [0..1], or 0% to 100%. Since we have accepted GF and
  its meaning and units for the generative factor, it would be absurd to use
  a formula which yields such an absurdly high preventive factor PF. Yet such
  an absurd "preventive power" comes out of Cheng's PF as shown in the next
  subsection.
  By the way, anytime three or more values of x are evaluated as possible
  causes of y, and if only the middle value of x has no dependence on y and
  vice versa, then we shall get 2 ARR's and 2 GF's with different signs, and
  the GF < 0 will have to be replaced by some other formula with a range
  commensurable with the GF > 0. Absurdly high sensitivity to inputs is a
  property of chaos, unlikely to be explained in terms of the cause-effect
  as we know it.

  Abs( ARR ) = Abs( -ARR ) , and  Abs( NNT = 1/ARR ) = Abs( NNH = -1/ARR )
  provide another argument for my objection against Cheng's PF .


.-  +HF vs PF served as a palatable fast food :

Let   P(y|x) = 4/100 = 0.04 > P(y|~x) = 1/100 = 0.01
ARR = P(y|x) - P(y|~x)   =    0.04 - 0.01  = 0.03  ;  RR = 4 due to LOST 0.0
GF = ARR/[ 1 - P(y|~x) ] = 0.03/(1 - 0.01) = 0.0303 = "relative difference"

from { Sheps 1959 }, reinvented in { Cheng 1997 }. If we accept and use GF
for generative impact of x on y when ARR >= 0, then we should use a compatible
formula for preventive impacts  when ARR <  0, unless ARR is too close to 0 in
which case abs(GF) =. ARR approximately. When the situation changes and the
values become P(y|x) = 0.01 < P(y|~x) = 0.04 , then we get

GF = (0.01 - 0.04)/(1 - 0.04) = -0.03/0.96 = -0.03125 < 0 ;  RR = 0.01/0.04

GF <  0 has to be replaced by a commensurable companion formula because
GF >= 0 has the subrange      0..1, while    { -oo is -infinite }
GF <= 0 has the subrange -oo..0   , ie both subranges are incommensurable .
This incommensurability of both subranges justifies the need for 2 formulae.
Therefore and because of the incommensurability of PF with GF , I designed
and here justified my hindrance factor HF as a replacement for Cheng's PF.

Since a positive power thinker doesn't like negative numbers, (s)he prefers
to switch from "generative power" GF < 0 to a positive formula PF earlier
called "prevented fraction" or "preventable fraction" :

PF = -ARR/P(y|~x) = -(-0.03/0.04) = 0.75  = Cheng's "preventive power" .

If the structural and semantic INCOMPATIBILITY of denominators in GF and PF
has not struck you before, by now it should be clear that PF=0.75 is totally
INCOMMENSURABLE with GF=0.03 . After a tiny change of P's, be it a shift on
a scale, or a swap near the zero point, the results should be about the same.
Small changes in the inputs ( P's ) should yield small changes in the outputs
or results ( GF , PF ). Here PF/GF = 25 is larger than one decimal order ie
PF is grossly incommensurable with GF. For ARR < 0 my commensurable formula
obtains from GF by swapping x and ~x in GF :

HF = [ P(y|~x) - P(y|x) ]/[ 1 - P(y|x) ]  =  generative factor for y if ~x
   = [   0.04  -   0.01 ]/[ 1 -   0.01 ]  = 0.0303 , my "hindrance factor"

HF = GF = 0.0303 , which is good.  So far for low P's.


Let's plug in high P's which, by the way, are far less common than low P's
in (medical) practice. Whether the P's are (im)possible doesn't matter here.

      P(y|x) = 98/100 = 0.98  >  P(y|~x) = 96/100 = 0.96
ARR = P(y|x) - P(y|~x)   =    0.98 - 0.96  = 0.02       ;    RR = 1.02
GF = ARR/[ 1 - P(y|~x) ] = 0.02/(1 - 0.96) = 0.5 because 0.02 = 0.5*(1 -0.96)

ie the numerator is 50% of the denominator ie of the base.
For P(y|x) = 0.96 < P(y|~x) = 0.98  we have  ARR = -0.02 and RR = 0.98

GF = -0.02/(1 - 0.98) = -1 < 0  which a positive power thinker replaces with

PF = -ARR/P(y|~x) = -(-0.02)/0.98 = 0.0204 = Cheng's "preventive power" ,

HF = [ P(y|~x) - P(y|x) ]/[ 1 - P(y|x) ] =     generative factor for y if ~x
   = [   0.98  -   0.96 ]/[ 1 -   0.96 ] = 0.5  is my "hindrance factor" and

HF = GF = 0.5 again, while Cheng's PF=0.0204 is INCOMMENSURABLE with GF=0.5 .


Another example:

P(y|x) = 39/1000  <  P(y|~x) = 40/1000
PF = (   40 -    39)/40  =  1/40 = 0.025   = 25/1000 by Cheng (zeroes lost) ;
HF = (0.040 - 0.039)/(1 - 0.039) = 0.00104 =  1/1000 by Hajek, and indeed, the
difference was only 1 per 1000, hence HF is almost exact, while Cheng's PF is
25 times larger than reasonable. Cheng's 0.025 = 25/1000 is ABSURDly high.


A pair of examples with the difference of 1 case per 1000 :

P(y|x) = 2/1000 = 0.002 < P(y|~x) = 3/1000 = 0.003 , ARR = -0.001
PF = (0.003 - 0.002)/0.003 = 1/3 = 0.333 also for 0.2 & 0.3 , or 0.02 & 0.03 ;
HF = (0.003 - 0.002)/(1 - 0.002) = 0.001002 , while:
HF = (0.03  - 0.02 )/(1 - 0.02 ) = 0.0102   , while:
HF = (0.3   - 0.2  )/(1 - 0.2  ) = 0.125

P(y|x) = 8/1000 = 0.008 < P(y|~x) = 9/1000 = 0.009 , ARR = -0.001 as before
PF = (0.009 - 0.008)/0.009 = 1/9 = 0.111 also for 0.8 & 0.9 , or 0.08 & 0.09 ;
HF = (0.009 - 0.008)/(1 - 0.008) = 0.001008 , while:
HF = (0.09  - 0.08 )/(1 - 0.08 ) = 0.0109   , while:
HF = (0.9   - 0.8  )/(1 - 0.8  ) = 0.5


Yet another example :  P(y|x) = 4/100 = 0.04  > P(y|~x) = 3/100 = 0.03

ARR =  P(y|x) - P(y|~x) = 0.04 - 0.03 = 0.01
 GF = ARR/[ 1 - P(y|~x) ] = 0.01/0.97 = 0.0103

Now let P(y|x) = 0.03 < P(y|~x) = 0.04 , ARR = -0.01 , RR = 0.75
GF =   -0.01/(1 - 0.04) = -0.0104 < 0
HF = -(-0.01/(1 - 0.03) =  0.0103 =  GF > 0 above
PF = -(-0.01/0.04     ) =  0.25   = 1 - RR in general ;
PF/GF = 25  ie they are incommensurable.
The difference is -1 per 100 ie -0.01 ; Cheng's PF = 0.25 suggests 25 prevented
per 100 which is absurd, or it suggests 1/4 which has a meaning different from
GF. Clearly, if we agree on the meaning of a generative factor GF and its
units, then we must evaluate preventive effects by the same units of impact.
In fact the only reason for not using GF < 0 is that GF < 0 has the range of
(-oo..0) , while GF >= 0 has the range of [0..1] , which are incommensurable.

This incommensurability of PF > 0 with the undisputed GF > 0 is what I want
to point out, and to do away with by using my HF instead of PF . Other
insights into GF , HF and PF follow, but in NO WAY effect my point which is
the fundamental commensurability of my HF with GF , versus the fundamental
incommensurability of Cheng's PF with GF in { Sheps 1959 }, { Cheng 1997 }.

GF >= 0 or HF >= 0 are "relative differences", ie differences relative
wrt their denominators [ 1 - P(.|.) ]. The values of GF and HF are very
meaningfully interpretable over their whole range, ie in accordance with my
principle P2 . My fresh & simple though not simplistic interpretation of both
GF & HF as ratios of slopes aka beta's is in Insight2 . Also note that :
- for commonly (very) low P's GF and HF behave as (slight) corrections of the
  numerator ARR aka absolute risk reduction ( absolute = not relative )
  ie the absolute difference ie the absolute contrast ;
- for uncommonly high P's GF and HF tend to behave more as relative risks
  RR(:) , rather than like absolute risks ie they are no more (slight)
  corrections of ARR . This last point was communicated to me by professor
  Peter White of Cardiff University, UK.


.-  +HF vs PF served as a palatable slow food :

The most important temperature for the humankind is when ice starts to melt,
which is zero degrees on the scale of Celsius. Here I use the temperature
measuring analogy in a serious attempt to get my message across not only to
the math-champs but also to the math-chimps. I will keep it simple, but not
simplistic.
    A simple example shows that GF and PF operate on different scales. If we
are used to measure non-freezing temperatures in Celsius, then we should not
switch to Fahrenheit or to Kelvin when water starts to freeze. A proven way
how to expose problems is stress testing. With formulae this means to plug in
in (somewhat) extreme numbers. In our case we shall use low probabilities for
an effect y given an exposure x. In fact in epidemiology probabilities of
diseases in a population are usually very low, fortunately for us. So in fact
my numbers will not be extreme at all, just far enough from the values like
0.5, 0.3 or 0.2 which camouflage, rather than reveal, the true nature and
behavior of probabilistic formulae.  The difference  ARR = P(y|x) - P(y|~x)
between LOW conditional probabilities P(.|.) is always near ZERO, indicating
the near independence of x and y (ie of y on x as well as of x on y). Being
near 0.0 is FOR OUR explanatory purposes quite analogical to be near zero
degrees Celsius when ice starts to melt or about when water starts to freeze.

Let y be an effect we are concerned about, and
    x be an assumed or alleged cause (eg an exposure to a treatment or
      to an agent), and
let   P(y|x)  = 0.04 = probability of an effect y if x is present,
and   P(y|~x) = 0.02 = probability of an effect y if x is absent ;

ARR = P(y|x) - P(y|~x) =         absolute risk reduction in epidemiology
    =   0.04 -   0.02  = 0.02  ("absolute" as opposed to "relative")

GF = [P(y|x) - P(y|~x)]/[1 - P(y|~x)] =  generative factor for y from x
   = [  0.04 -   0.02 ]/[1 - 0.02   ] = 0.0204 =. ARR , =. means approx.

GF is in fact ARR corrected for the "base effect" (due to basic treatment,
   background probability, or base chance by guessing like eg 1/m in a
multiple choice). The KEY IDEA on which such formulae are built is a DOUBLE
DISCOUNTING of the base-chance (base-rate, trivial chance, placebo-chance).
In fact the GF formula can be viewed as employing corrections of  P(y|x)
up to the 2nd order. The 1st-order correction decreases P(y|x) by P(y|~x)
in the numerator ARR, and the 2nd-order finer tuning in the denominator
increases GF. This has its analogy in the Taylor series with - and + terms.
For more & deeper causal insights see Insight0 up to Insight8 below. Other
formulae with the same structure as  GF are below in Insight9 .

Now comes the crunch. Often it happens that P(y|x) < P(y|~x)  ie ARR < 0.
Let's have such a case, eg:

  P(y|x) = 0.02  and  P(y|~x) = 0.04 ;  then some may feel the need to use

another formula with -ARR in its numerator, if we want to "stay positive".
Cheng used the following "preventive power" (earlier others have called it
"prevented fraction" or "preventable fraction", hence my PF )

  PF = [ P(y|~x) - P(y|x) ]/P(y|~x) = 1 - P(y|x)/P(y|~x) = 1 - RR(y:x)

RR(y:x) is called "relative risk" aka "risk ratio" in epidemiology.
RR(y:x) is also called "Bayesian likelihood ratio" or "Bayes factor".
PF belongs to the category of "proportionate increments". Read in the wise
and fine book { Feinstein, 2002 } section 10.6.3 on "Possible deception in
proportionate increments", and his sect.17.6.5 where prevented fraction PF
is clearly categorized as a kind of relative risk. To detect and disallow
misinterpretation, misinformation, disinformation and deception, as often
practiced by many if not by the most economic and "egonomic" interests, we
as customers and patients, must be aware of the weak points of their
formulae. "Know their & thy formulae and thou shalt suffer no disgrace!"
is my fresh paraphrase of the greatest strategist ever, Sun-Tzu, 2500 b.PC.
We need evidence-based medicine, not evidence-biased medicine-men & women.

Anyway, here we get

PF = [ 0.04 - 0.02 ]/0.04 = 0.5  which is UNREASONABLY larger than 0.0204.

Unreasonably, since by analogy if temperature drops from 1 degree Celsius
above zero (ice starts to melt at 0) to -1 degree Celsius, then we should not
switch to another scale on which abs(-1)=1 will become 272 degrees Kelvin.
Sufficient CONTINUITY AND PROPORTIONALITY OF SCALE must be maintained in
order to keep the interpretation of numbers opeRationally meaningful. The
reason for such a large DISCREPANCE (0.02 vs 0.5) is Cheng's denominator,
since numerators of both GF and PF have the same absolute value abs(ARR).
We can see that GF is essentially a (slightly) up-corrected absolute risk
reduction ARR, while PF = 1 - RR(y:x) is a linear function of a relative
risk RR(y:x) < 1. Obviously, it is NOT WISE to use an absolute measure for
a causal generation, and a relative one for a causal prevention.
We also see that the smaller the conditional probabilities, the smaller the
difference between Sheps/Cheng's GF and the absolute risk reduction ARR in
GF's numerator, ie that GF is just a subtle and too often a negligible
2nd-order correction of ARR. To make it crystal clear to those who see math
as a 4-letter word: 0.04 - 0.02 = 0.02  <  0.4 - 0.2 = 0.2  for ARR , while
the relative risks  0.02 / 0.04 = 0.5   =  0.2 / 0.4 = 0.5  for  RR , so we
see that RR and 1-RR are LOSING INFORMATION on the magnitude of the numbers;
see Insight5 .


-.- +Hajek's hindrance factor HF is commensurable with GF :

In order to fix such an INCOMPATIBILITY of scaling, I claim that the proper
preventive factor, to be used instead of PF (and GF) when ARR < 0, is

HF = [ P(y|~x) - P(y|x) ]/[ 1 - P(y|x) ]  =  generative factor for y if ~x
   = [   0.04  -   0.02 ]/[ 1 -   0.02 ]  = 0.0204 , my "hindrance factor"

HF is obtained from GF simply by swapping (the roles of) x and ~x. Hence
HF is the generative factor for y if ~x  ie if x is absent. Since too
many formulae exist for a long time with too many similarly sounding names
which are too suggestive, it is wise to avoid yet another similar name. In
order not to add to the verbal confusion I call my HF a "hindrance factor".

GF <  0 has to be replaced by a commensurable companion formula because
GF >= 0 has the subrange      0..1, while
GF <= 0 has the subrange -oo..0   , ie both subranges are incommensurable ,
therefore and because of the incommensurability of PF with GF , I designed
and here justified my hindrance factor HF as a replacement for Cheng's PF .

In medical practice ARR < 0 are near zero. That had to be the reason why the
late professor of medicine & epidemiology at Yale University, Alvan Feinstein,
who also had a master's degree in mathematics, had no difficulty with using
Mindel Sheps' "relative difference" (our GF ) when ARR < 0 made GF < 0. In
his fine and alas also his final book on the Principles of Medical Statistics,
on the the last line of the page 174, he used Sheps' "relative difference"
(0.02 - 0.04)/(1 - 0.04) = -0.0208 ; note that the abs(-0.0208) = 0.0208
differs only slightly from  0.0204 of my HF.  Anyway, the Yale professor
Feinstein did not feel any need for a second formula, probably because his
realistic epidemiological ARR's were near 0.

If you will read less casually further, you'll be rewarded with fresh deeper
causal insights.


-.- +Insights into semantics and forms of causal formulae :

.- +Palatable probabilistic logic derived from counting:

Counting "AT LEAST how many of BOTH PROPERTIES" :

Itanic carried 1000 passangers, 300 smokers, 900 were drinkers,
  then at least 300 + 900 - 1000 = 200 were both ( 200 is the minimal
  intersection, or unavoidable joint, or necessary overlap ).

Itanic carried 1000 passangers, 400 women, and 700 have drown,
  then at least 400 + 700 - 1000 = 100 women  have drown, and
       at least 600 + 700 - 1000 = 300 others have drown.

The necessary overlap > 0 can arise only for Nx + Ny > Nall=1000.
The algorithm is:

 If  (0 < Nx < Nall) & (0 < Ny < Nall ) & (Nx + Ny > Nall)
 then  Njoint >= Nx + Ny - Nall { my special triangle inequality quantified }

   where Njoint can be visualized as an overlap obtained by folding two
   sides Nx and Ny long of a triangle with the horizontal side Nall long.

Probability is isomorphic (= strongly analogical) with set theory which is
isomorphic with Boolean logic. Therefore "Our fortress is our logic" ( = my
paraphrase of Stan Ulam's "our fortress is our mathematics" ). Isomorphism
is transitive (if A resembles B, and B resembles C, then A resembles C). The
key formulae derive from two (non)overlapping pancakes or pizzas with areas
Px , Py ; P(x,y) == P(x & y) == Pxy = the overlapping area :

  P(x U y) + Pxy = Px + Py ;   U is union;  Pxy = joint , intersect , overlap
  N(x U y) + Nxy = Nx + Ny  for Numbers of elements in sets x, y.  Hence
 P(x or y) + Pxy = Px + Py  ie P(x or y) = Px + Py - Pxy  for probabilities

  from which the key inequality obtains (draw your pizzas aka Venn ) :
!!
  Pxy  <= minimum[ Px , Py ] <= P(x U y) <= Px + Py   due to  P(.) >= 0 ,

  Max[    Px, Py    ] <= P(x or y) <= min[ Px + Py , 1 ]
  Max[ 0, Px +Py -1 ] <= Pxy       <= min[ Px , Py ]     where on lhs the

Bonferroni inequality becomes nontrivial only if Px + Py > 1, in which case
Pxy > 0, ie there MUST be a certain minimal joint  Pxy > lhs. Hence you
CANNOT take any 3 numbers such that 0 <= Pxy <= min[ Px, Py ] <= 1 and use
them as probabilities; it also MUST hold Pxy >= Max[ 0, Px + Py - 1 ]. In
plain English: buy a square serving tray 1x1 meter called 1verse sold by
Universe. Take two pieces of pizza cut into squares with areas Px and Py,
such that each fits into the rather deep tray, and put them into the tray.
If the area Px + Py > 1 square meter, then even if you do your best to shift
the pizzas as far apart as possible within the tray, they have to overlap by
the joint area of AT LEAST Pxy >= Px + Py - 1 which is the minimal overlap.
Forget about underlaps ie negative overlaps. Bonferroni inequality is just
a truncated version of the Inclusion-Exclusion principle for probabilities.

DeMorgan's laws aka De Morgan rules in my probabilistic form are:

 P(~x,~y) = P(~(x or y)) = 1 - (Px+Py-Pxy) = P(x Nor  y)    1st rule or law
 P(~x or ~y) = P(~(x,y)) = 1 - Pxy         = P(x Nand y)    2nd rule or law

Eg  P(x implies y) = P(~(x,~y)) = P(~x or y)  via DeMorgan's law ,

     so eg the common sensical "No smoke without fire" cannot be meant as

     (~s & ~f) == ~(s or f)  which is absurd. "No s without f" really means

    ~( s & ~f)  which makes sense : "there can't be (smoke & no fire)", but
 ==  (~s or f)     makes NO SENSE : "there is either no smoke, or there is
                 fire, or BOTH ~s and f" shows the WEAKness of (x implies y)
                 ie of (x entails y) as an indicator of causation. See P5 .
     More at P5: below.

Find DeMorgan in { Hajek www } for more of my DeMorganish rules on P's.
Just in case you have difficulties with DeMorgan's rules, I recommend to
draw two overlapping pizzas, or pancakes, aka Venn diagram. A verbal
 formulation of the 1st DeMorgan's law (published 1847) is enlightening:
"It should be noted that the contradictory opposite of a disjunctive
 proposition is a conjunctive proposition composed of the disjunctive
contradictories of the parts of the disjunctive proposition."  This is a
formulation from the book Summa Logicae, of the year 1323, by the well known
William of Ockham aka Occam aka Dr. Invincibilis, famous for his
 Principle of Parsimony aka Occam's razor :
"Pluralitas non est ponenda sine necessitate", in turbo-talk known as KISS .
 Occam's razor has a necessary amendment "ceteris paribus" = "all else being
 equal" ( conditions , assumptions , properties , desiderata ), wherein the
 catch-22 is, since all other aspects are never really equal (more below).
DeMorgan's rule has been traced by Lukasiewicz back to Petrus Hispanus
(1205-1277). If those guys could formulate it, you should understand it.

"AT LEAST ONE event" ie"ONE OR MORE hits" follows from the 1st DeMorgan's
 rule below:
       P(at least 1 of e's) = P(e1 or e2 .. or ek) = P(~(~e1,~e2, ..,~ek))
=  1 - P(~e1 & ~e2 & .. & ~ek)     in general ;
=  1 -([1-Pe1].[1-Pe2]..[1-Pek])   for all Pe's independent ;
=  1 - [1-Pe ]^k                   for all Pe's equiprobable and independent
=  1 - exp( k.ln(1-Pe) )           if (large) power raising is not available
=. 1 - exp(-k.Pe)       for very small Pe & large k ; often Pe = h/m, h << m

Such formulae have plenty of applications when computing probabilities of
hit/miss of firing salvos, of finding when searching (for subs, documents,
by hashing), of functioning/failure of (wo)men or machines in estimates of
reliability : serial connections have Pxyz = Px.Py.Pz for independently
failing or functioning x, y, z ;  parallel connections have ( via DeMorgan )

 P(x or y or z) = P(~(~x,~y,~z)) = 1 - ([1-Px].[1-Py].[1-Pz]) as above.

An advanced application is in my FlashHash on the same site as { Hajek www }.

From the 16 binary functions in Boolean logic, only 2 pairs (of complementary
functions) are neither trivial functions of a single variable, nor are they
symmetric wrt both variables x,y. One such an interesting logical function
is the implication (x implies y) aka (x entails y) in set theory.

  (x implies y) == ~(x,~y) == ~(~y,x) == (~y implies ~x)  is UNDESIRABLE

for causation as exposed in P5 in Insight0 . Desirable transitivity holds :

 if (x implies y) & (y implies z) then (x implies z) .  Repeatedly find

implies and entails . Causation should also be transitive :

 if (x causes y) & (y causes z) then (x causes z) ;  see P4b , Insight0 .

So we see that the entailment or implication has a desirable TRANSITivity,
but also an UNdesirable property. We cannot have only the good ones :-(

After thousand years of human thinking, the notion of causation is still
a slippery one. A nice illustration of the psychological and fundamental
difficulties with using probabilities to capture even the orientation ie
DIRECTION of causation is illustrated by the following equivalence relation:

 P(x|y) > P(x|~y)  is equivalent to  P(y|x) > P(y|~x), similarly for < , = ,

"contrary to the prevailing pattern of judgment" of over 80 % of university
 students, wrote the late Amos Tversky & Daniel Kahneman in their chapter
"Causal schemas in judgments under uncertainty" { Kahneman 1982, p.122-3 }.
 They wrote "implies" but it is an "equivalence" ie a two-way implication.
Daniel Kahneman was awarded Nobel prize (economy 2002), for his and Amos
Tversky's 30 years of research into human thinking in general and into the
fallacies of probabilistic thinking in particular.  Statistical literacy
movement under the leadership of professor Milo Schield (Augsburg College,
 Minneapolis, MN) should adopt my slogan:
"(In)dependence is the bridge of the asses for the masses into stats."

My argument for why we should keep trying to capture causation in a formula
is based on the successes of Boolean logic since 1848, after 2500 years of
slippery logic. Be reminded that after George Boole it took another 90 years
until 1937 when an MIT student Claude Shannon in his master's thesis has
shown the isomorphism (= strong analogy) between Boolean logic and switching
circuits. Only then applications like your PC could be developped. Btw,
in 1948 Shannon has formulated information theory, also applied inside your
PC, CD, DVD, etc. Alas, there is no guarantee that the successes of logic
will be repeated with causal formulae. For now we should be satisfied with
 a good INDICATION of a causal tendency .  The key question is this one :
"When to use which formula as the best indicator of a causal relation ?" ,
"Which desiderata are more important than the other or than 'undesiderata'?"
 My questions assume honesty, not a desire to get inflated ratios which will
impress, mislead, disinform and deceive the patients and customers for the
purposes of dishonest economic gain and "egonomic" advantage.

Vaihinger's fictionalism - A note on useful As-if fictions :
Find again As-if = "Als ob" from the title of a book by a German philosopher
    { Vaihinger 1923 } on useful fictions.  Eg the product of marginal
probabilities  Px.Py  provides a fictional point of reference for dependence
of events x and y.  Fictional, because  Pxy = Px.Py  occurs rarely, but we
often wish to contrast  Pxy vs Px.Py in Pxy - Px.Py  or in  log(Pxy/(Px.Py))
in Shannon's mutual information. I realized that the Archimedean point of
reference { Arendt 1959 } may be a special case of the useful as-if-fiction-
alism.  Maximal possible values also serve as As-If values, eg for many
normalizations, like my interpretation of GF and HF in Insight2 . The concept
of a random event or of a random variable is also an as-if fiction, since
 usually the randomness is in the eye of the OBSERVER . One of my slogans is:
"One man's determinism is another woman's randomness."
"Ceteris paribus" should also be understood as an as-if rule, simply because
 of the fictitious "all else being equal".


.- +Causal notation is tough, but our math-beef shouldn't be

A colorful hotchpotch (bunten Wirrwarr) of notations exists. In my References
to Popper and to Kemeny I comment on some of them. Due to the invertibility
of conditionings | via basic Bayes rule, it would be naive to think that eg a
formula expressed in terms of P(y|x) = Pxy/Px should be M(y:x) where y:x is a
mnemonic for division, say. A more meaningful alternative notation M(x:y)
standing for (x entails y) could be better, but not all M(:) contain a clear
entailment. Also our language allows for "x implies y" or "y entailed by x".
So for now I shall not change all those countless formulae in { Hajek www }
because of the risk of errors despite the necessary effort involved. Eg here
Insight6 contains the relative risk aka risk ratio RR(:) which is much used
in evidence-based medicine ( EBM ) :

RR(y:x) = risk ratio = Bayes factor = Bayesian likelihood ratio
 = Odds(x|y)/Odds(x)   , where  Odds(z) = P(z)/[ 1 - P(z) ]
    = P(y|x)/P(y|~x)   ,      x = cause, exposure, test result; y = effect
    =  [ Pxy/(Py - Pxy) ].(1-Px)/Px       are my decompositions
    =  [ Pxy/P(y,~x)    ].(1-Px)/Px    ,  where 1/P(y,~x) == (y entails x)
    =  [ P(x|y)/P(~x|y) ].(1-Px)/Px       my semi-inverted form
    =  Pxy.(y implies x).SurpriseBy(x) , since:

Pxy . 1/P(y,~x)  INCreases with  (y entails x) ie as (y implies x) grows ;
    . (1-Px)/Px  DECreases with increasing Px , as a SURPRISE should ;
(1-Px) and 1/Px and (1-Px)/Px  are decreasing functions of Px, hence they are
                               small for frequent (= unsurprising) x ;  SIC ;
1/P(y,~x) = 1/(Py - Pyx)  measures how much (y implies x) ; see P5 for more
                               on "conviction" by Google CEO { Brin 1997 } :
Conv(y:x) = Px.P(~y)/P(x,~y)   { as by Brin , my form next }
      = (Px - Px.Py)/(Px - Pxy)  where 1/P(x,~y) measures how (x entails y) ,
      = ( 1 -    Py)/( 1 - P(y|x) ) = P(~y)/P(~y|x) =  Px/P(x|~y)
ie (x implies y) , UNLIKE the (y implies x) inside RR(y:x) = P(y|x)/P(y|~x).

We see that both RR(y:x) hence F(y:x) contain (y implies x). However RR(y:x)
is known as a measure of increased risk of y due to x  ie of y from x.
Clearly genuine difficulties with designing a consistent notation for causal
M(:)'s exist exactly because causation is still a slippery concept despite
our daily "because", "due to", and "if-then". Therefore you must not rely on
even my rather uniform notations like M(:) . What matters is INSIDE the right
hand side of M(:) . Better understanding, use and interpretation of results
from causal  M(:) is based on my rule of interpretation turned into a desired
property P4b .


.- Insight0 : Know what you want : new causal desiderata P2 P4 P4b P5 P8

It is the requirements aka desiderata & properties what matters most. Eg
measures of similarity, compatibility, MUTUAL dependence, and of equivalence
should be : symmetric ie S(y:x) = S(x:y) , and transitive (below).
A (probabilistic) "distance measure" should be a "metric" ie be :
non-negative, symmetric and satisfy the triangle inequality. We neither need
a metric nor an equivalence S(x:y) = S(y:x), since a causation measure must be
meaningfully asymmetric wrt  x, y.   M(y:x) <> M(x:y) is the 1st desideratum.
My 7 "Construction principles for good association measures", all findable as
P1 to P7 in { Hajek www }, will not be reproduced here (find P2 here). I am
SHARPening P4 into P4b , illustrating P4 & P5 , and adding P8 .

P4b: Often it is only the ORDERING in general and RANKing in particular what
     matters most in many applications. I discovered that the RANKING of more
2x2 contingency tables by their value of measures M(:) applied to such tables
will differ between all pairs of different M(:)'s, except between measures
Conv(y:x) and PAR(y:x) , and of course between Conv(x:y) and PAR(x:y) , which
produce the same RANKing. Remarkably, both formulae are asymmetric wrt y and
x . They are the "conviction" by { Brin 1997 } which is UNacceptable as a
measure of causation since Conv(y:x) = Conv(~x:~y) , and the following two
based on the same idea as Sheps' "relative difference" GF :

 PAR(y:x) = ( P(y|x) - Py )/(1 - Py )  until now a nameless one, and

 PAR(x:y) = ( Px.[ RR(y:x) -1 ])/( 1 + Px.[ RR(y:x) -1 ]) is the usual form;
!         =      ( P(x|y) - Px )/( 1 - Px )              is Sheps-like form

= population attributable risk = Levin's attributable risk ; more on PAR(x:y)
  is at the end of Insight7 .

For both Pxy & Py fixed , PAR(x:y) DECreases with INCreasing Px :
    let    P(x1|y) =   P(x2|y) = K , then:
!        PAR(x1:y) < PAR(x2:y)  for Px1 > Px2 ; proof:
 (K-Px1)/(1-Px1)   < (K-Px2)/(1-Px2)          ; use (1-K) since it is >= 0 :
      (1-K).Px1    >    Px2.( 1-K)  qed. , hence PAR(x:y) obeys SIC , which
could be deduced from the fact that K < 1, hence the DECreasing Px will make
the numerator INCrease ahead of the denominator. But we have a proof too.


My new partial order condition for causation measures M(:) :

The fundamental ASYMMETRY of x=cause & effect=y  M(y:x) <> M(x:y)  from P4
below is now opeRationalized into a more specific & meaningful property P4b :

! Property P4b for M(y:x) measuring how strongly x causes y (= semantics):
  If Px < Py then  M(y:x) > M(x:y) else
  if Px > Py then  M(y:x) < M(x:y) else M(y:x) = M(x:y) ;  for example :

  if Px > Py then  P(y|x) < P(x|y) = Pxy/Py = crude measure of (y entails x)

  P4b is desirable as the more is x common , the less likely is x  a
  SPECIFIC cause of y.  Similarly for y as a tentative cause of x.

  P4b introduces a PARTIAL ORDERing relation for causation measures M(:) .
  Books on discrete mathematics tell us that a partial ordering must be
  REFLEXive , ANTISYMMetric and TRANSITive. The relations: = , <= , and the
  set inclusion <= (in Pascal programming language too, also for Booleans)
  are partial orderings. I never saw any attempt to express causal relations
  like here. What surely must be new is the SPECIFICITY of my P4b in
  relating Px , Py , Pz with M(:)'s.

  Reflexivity  :    (x causes x)        is ok, though not very explanatory ;
  Antisymmetry : if (x causes y) & (y causes x) then (x == y) , equivalent ;
     Asymmetry : if (x causes y) then ~(y causes x)
  Transitivity : if (x causes y) & (y causes z) then (x causes z)

 "causes" is isomorphic with "included in", "subset of", "entails", and also
  with "less or equal" <= . Since Px and Py are numbers, my desideratum P4b
  above imposes a specific partial order on measures of causation M(:) so
  that my new weak TRANSITivity rule among causation measures M(:) will be:

!!   If     ( Px <= Py )    &    ( Py <= Pz )
!!   then M(y:x) >= M(x:y)  &  M(z:y) >= M(y:z)  & M(z:x) >= M(x:z)

  where any = goes only with all other = . Draw 3 concentric pizzas or
  pancakes or Venn diagrams to visualize my desired weak transitivity rule
  for Px <= Py <= Pz .

Insight8 contains my de-conditioned C-form2 and F-form2 which are needed for
insight into F(:) and C(:) . I obtained for positive dependence relations
  Pxy > Px.Py  the following orderings  ( the < are in fact <= here ) :
if Px > Py swap < and > in what follows ( the > are in fact >= here ) :

ConV(y:x) ==  V(y:x) >  V(x:y) == ConV(x:y) =  Py.P(~x)/P(y,~x) :
(Px < Py) ==  V(y:x) >  V(x:y) = [1 - Px]/[ 1 - P(x|y)] ie (y entails x)
(Px < Py) == GF(y:x) > GF(x:y) = [P(x|y) -P(x|~y)]/[ 1 - P(x|~y) ]
(Px < Py) ==  Z(y:x) >  Z(x:y) = [P(x|y) -Px     ]/[ 1 - Px ] = PAR(x:y)
(Px < Py) ==  K(y:x) >  K(x:y) =  P(x|y) -Px <= 1-Px  SIC { Hajek K(y:x) }
(Px < Py) ==  P(y|x) >  P(x|y) =  Pxy/Py =  simplistic crude (y implies x)
(Px < Py) ==  M(y:x) >  M(x:y)   means that  M(x:y) measures (y SufFor  x)
                   vs :
(Px < Py) ==  m(y:x) <  m(x:y)   means that  m(x:y) measures (x SufFor  y)
(Px < Py) == RR(y:x) < RR(x:y) = [Pxy/(Px -Pxy)].(1-Py)/Py , (x implies y)
(Px < Py) ==  W(y:x) <  W(x:y) =ln(RR(x:y)) "weight of evidence" I.J. Good
(Px < Py) ==  F(y:x) <  F(x:y) = [ RR(x:y) -1]/[ RR(x:y) +1 ]  , J. Kemeny
                               = [Pxy - Px.Py]/[ Pxy + Px.Py - 2.Pxy.Py ]
(Px < Py) ==  C(y:x) <  C(x:y) = [Pxy - Px.Py]/[ Pxy + Px.Py -   Pxy.Py ]
   !  1-Px >= C(y:x) ,  C(x:y) <= 1-Py  ie obeys the SIC of Karl Popper

Find  F(e:h)  in  Insight8 for why { Kemeny 1952 } has not swapped e with h
(ie y with x) in his "factual support" of a hypothesis h by an evidence e.

if Px < Py then Max[ RR(y:x) , RR(x:y) ] < OR(x,y) ;  OR(:) is odds ratio ;
if Px > Py then min[ RR(y:x) , RR(x:y) ] < OR(x,y) ;  OR(:) is symmetric ;

more relations between RR(:) and OR(:) are near the end of Insight1 .

Let's analyze the frequently used and seamingly simple relative risk :
First recall the equivalences

P(x|y) > P(x|~y)      ==   P(y|x) > P(y|~x), similarly for < , =   ;
P(x|y) / P(x|~y) > 1  ==   P(y|x) / P(y|~x) > 1  for positive dependence

Hence:  RR(x:y)  > 1  ==  RR(y:x) > 1.  Here is my insight inside RR(:) :

  RR(y:x) = P(y|x) / P(y|~x)  LOSES 0.000 & does NOT satisfy P4b , because :
      = [Pxy/(Py - Pxy)].(1-Px)/Px = Pxy.(y entails x).DecreasingFun(Px)
!!    = Pxy.(y implies x).SurpriseBy(Px)  where implies works AGAINST the
!       SurpriseBy(Px) because:
!   -   SurpriseBy(Px)   directly DECreases with increasing Px , while:
!   +    (y implies x) INdirectly INCreases with increasing Px . Nevertheless
!!       (y implies x) OVERRULES because Pxy/(Py - Pxy) is more SENSITIVE .
! So RR(y:x) is a MIXED measure of how much (y entails x) DISCOUNTED for the
! LACK of SURPRISE or LACK of INTERESTINGNESS in larger Px for frequent or
! commonly occurring event x . Such a lack fits the SIC principle.
  This all carries over to :

  W(y:x) = ln(RR(y:x)) = "weight of evidence" promoted by I.J. Good, and
  F(y:x) =  [ RR(y:x) -1 ]/[ RR(y:x) +1 ] ,  hence it does NOT satisfy P4b ;
  C(y:x)  is isomorphic with  F(y:x), so no wonder it does NOT satisfy P4b .

! My dissection (or deconstruction) of RR(:) provides us with nontrivial
  insights into RR(:) and its consistent behavior w.r.t. Px < Py . It is
! amazing that ARR(:) does NOT behave consistently under Px < Py or Px > Py

 ARR(y:x) = P(y|x) - P(y|~x) does NOT satisfy ANY consistent ordering by
                             either Px < Py,  or consistently by Px > Py .

  RR(y:x) = fun(y implies x)  UNLIKE  (x implies y) in "conviction" by Brin :

Conv(y:x) is a PURE entailment measure of x entails y  , UNLIKE RR(y:x) !
Conv(y:x) = "the less (x,~y)  means more (x implies y)" a PURE implication :
Conv(y:x) = Px.P(~y)/P(x,~y) = "the more (x implies y), the larger Conv(y:x)".
Conv(x:y) = Py.P(~x)/P(y,~x) = "the more (y implies x), the larger Conv(x:y)".
         = (Py - Px.Py)/(Py - Pxy) , where 1/P(y,~x) is an implementation of
P(~(y,~x)) allowing for the numerator Py.P(~x) which provides for the FIXED
semantically pivoting point = 1 for independent x, y. The straight form of

P(~(y,~x)) = 1 - P(y,~x)  would NOT allow for creation of such a pivot.
P(~(y,~x)) = "P of no(y w/o x)" ie (y entails x).

All these M(:)'s behaved consistently (as specified), only ARR(:) did NOT !!
!! It would be a cheap trick to swap the ?(y:x) with ?(x:y) and thus get all
if Px < Py then  ?(y:x) > ?(x:y) ; find +Causal notation is tough . The point
!! is that with  P(y|x) > P(x|y) and with GF(y:x) > GF(x:y) we know better
   what the Px < Py does to a measure, than we can see in some other formulae.
Therefore I said above "not rely on notation M(:)" (find rely ), "What matters
is inside the formula" and that has to be verified as I did here.
Don't confuse the meal with the menu M(:) .

Lonely P(.|.)'s are not functions of both Px and Py, hence they are only crude
measures of entailment, despite 0 <= Pxy <= minimum[ Px, Py ]. That a P(.) or
P(.|.) cannot serve as a strength of evidence on y from x ie from x on y, or
as degree of confirmation or as corroboration shows in great detail { Popper ,
appendix IX , pp.387-398 }, and is discussed by Sir Karl Popper vs Leblanc in
The British Journal for the Philosophy of Science, vol.X, 1960, pp.313-318.


P4: was just strenghthened into a more specific P4b, but there are some
    meaningful illustrations of the asymmetry required by P4 and P4b:

    If Px = Py then M(y:x) =  M(x:y)   is required and easily met;
               else M(y:x) <> M(x:y)   a genuine asymmetry MUST hold for
    a causation measure ie <> must hold after any formal conversion of M's.

   Eg:   E(y:x) = [ P(y|x) - Py ]/[ P(y|x) + Py ] =   { Popper p.400 }
       = E(x,y) = [ Pxy - Px.Py ]/[ Pxy + Px.Py ]   symmetry revealed
       = E(x:y) = [ P(x|y) - Px ]/[ P(x|y) + Px ]     { my inversion }

   Due to its symmetry E(:) is ok as a measure of dependence, but K.O. as
   a causation measure M(:) . My view of its general structure is

   [a-b]/[a+b] = ([a-b]/2)/([a+b]/2) = average deviation from the average


P5: This shows that the operation implication cannot capture causation:

 M(y:x) = M(~x:~y)  is generally UNDESIRABLE since it is isomorphic with the
                    logical implication (or entailment ) for which holds the
 CONTRAPOSITIVE logical property :

  (x implies y) == ~(x,~y) == ~(~y,x) == (~y implies ~x)

 as will show a Venn diagram (draw 3 pizzas: Px in Py in a PizzaVerse).
 Let's test the contrapositivity with a folks' wisdom:
"Where is smoke, there is fire.", which I formalize as

 (  S implies  F )  ie  If there is   smoke then there is    fire  ;
 ( ~F implies ~S )  ie  If there is no fire then there is no smoke ;
but
   "Smoke causes fire"        makes NO SENSE
   "No fire causes no smoke"  makes sense .

 Apparently my formalization was wrong, so I change it into

 (  F implies  S )  ie  If there is     fire then there is   smoke ;
 ( ~S implies ~F )  ie  If there is no smoke then there is no fire ;
but
   "Fire causes smoke"        makes sense
   "No smoke causes no fire"  is NONSENSE .

 It makes sense in logic , eg:

  ( R implies S )  as  If it rains then it is slippery
  (~S implies ~R)  as  If it isn't slippery then it isn't raining ,

but it mmakes no sense for causation :

  (R causes S) makes sense , but (~S causes ~R) is NONSENSE .

 Contrapositivity is an UNDESIRABLE property for a causal tendency measure
 since eg "Rain causes us to wear raincoat" is ok, but "Not wearing a
 raincoat causes no rain" makes NO SENSE, as the later Nobel laureate Herbert
 Simon pointed out { Simon 1957, pp.50-51 }.

 Therefore the "conviction" measure in { Brin 1997 }, co-authored by Sergey
 Brin, the co-founder and CEO of Google, is unsuitable to measure causation
 because conviction is based purely on probabilistic implication :

  Conv(y:x) = Px.P(~y)/P(x,~y)        as by Brin ; my forms next:
       =  (Px - Px.Py)/(Px - Pxy)   = how much (x entails y)
       = Px/P(x|~y) = P(~y)/P(~y|x) = (1-Py)/[ 1-P(y|x) ]
= Conv(~x:~y) is UNDESIRABLE , as pointed out in { Hajek www }
  Conv(y:x) = 0  if x,y are independent (designed into the numerator);
  Conv(y:x) = oo = infinite if x implies y  ie if P(x,~y)=0 == if true is
                  ~(x,~y) , where ~(.) obtains from the / in Brin's form.

 GF(y:x) <> GF(~x:~y) ,   GF(x:y) <> GF(~y:~x)   are OK for causation .

  F(y:x) <>  F(~x:~y)  is desirable ; { Kemeny 1952 } has designed F to hold:
  F(y:x) =  -F( y:~x) ,   F(~y:x) =  -F(~y:~x) , worked out in { Hajek www }
  F(x:y) =  -F( x:~y) ,   F(~x:y) =  -F(~x:~y) , worked out in { Hajek www }

 which is no wonder since F(:) = ( RR(:) -1)/( RR(:) +1) which rescales
 RR(:)'s range of [0..1..oo] to F(:)'s [-1..0..1] , and since it holds :

 RR(y:x) <>  RR(~x:~y) is a desirable property of RR(:)
 RR(y:x) = 1/RR( y:~x) ,  RR(~y:x) =  1/RR(~y:~x)
 RR(x:y) = 1/RR( x:~y) ,  RR(~x:y) =  1/RR(~x:~y)

 My paradox of an INDependent IMPlication :
  In the extreme case of Py=1 when P(y|x)=1 ie x implies y, as well as
  in the extreme case of Px=1 when P(x|y)=1 ie y implies x, in these
 !!! extreme cases x,y are also independent at the same time. Find my
 !!! IndependentImplication paradox in Insight8 .


P8:  My principle of CONTINUITY and COMMENSURABILITY of results :
     SMALL changes in input P's must NOT lead to LARGE changes in the output
  ie in the result from a SINGLE causal tendency measure M(:) . This kind
  of continuity must also hold between a PAIR of measures each of which
  covers only a part of the whole range of all possible inputs. If a pair of
  such formulae is used to cover the whole range of causation values then
  the results must be COMMENSURABLE in general, hence also near the point
  where we switch from one formula to the other one. Typically a switch-over
  point will be INDEPENDENCE ie ARR=0 , around which cannot be any meaning-
  ful causation (degenerated cases shows my paradox IndependentImplication ),
  so near ARR=0 a measure of causation M(:) must measure (in)dependence ONLY,
  and DEPENDENCE IS SYMMETRICAL wrt x, y. With increasing dependence the
  ASYMMETRY of values should increase.  For an example find P8: far above.


.- Insight1 : Understanding GFactors, plus my MaxiMin heuristic

The word "power" has too many connotations, and is too trendy, so it should
better be avoided, and replaced by the term "factor", for which there exists
a genuine semantic justification :

  P(y|x) = P(y|~x) +     P(~y|~x).GF   , proof :
         = P(y|~x) + [1 - P(y|~x)].GF
         = P(y|~x) + [1 - P(y|~x)].[ P(y|x) - P(y|~x) ]/[1 - P(y|~x)]
         = P(y|~x) + P(y|x) - P(y|~x) = P(y|x)   qed.

which shows how to formally construct GF-like factors, eg again for P(y|x)
but this time wrt a basic exposure or treatment  b  we get :

  GF(y:x:b) = [ P(y|x) - P(y|b) ]/[ 1 - P(y|b) ]  which illustrates my slogan

"A non-event is an event is an event" with apologies to Gertrude Stein who
spoke similarly about a rose, tho not about a non-rose.

Formalities:
- from P(y|x) <= 1 folows Abs(GF) <= 1 and similarly for other such factors
-   if P(y|x)  = 1  ie Pxy=Px    then GF = 1 and v.v. (an equivalence )
-   if P(y|~x) = 0  ie P(y,~x)=0 then GF = P(y|x)

Semantics:  Factors should carry meanings, not just be formally correct.
           P(y|~x) = base rate of y-ers even they are ~x-ers eg a non-smokers;
 would all (~y,~x)-ers become x-ers , some of them would become (y,x)-ers .
P(~y,~x) =   are all the remaining ~x-ers free to become y after x;
P(~y|~x) = 1 - P(y|~x) = proportion of remaining ~x-ers available for y WOULD
they become x-ers; this shows the WHAT-IF ie the COUNTERFACTUAL nature of GF ;
  or: "Raising the base P(y|~x) by a fraction GF of P(~y|~x) yields P(y|x)"
! ie:  P(y|x) = P(y|~x) + GF*P(~y|~x) = P(y|~x) + GF*[ 1 - P(y|~x) ]
GF =  effectivity factor of x to make/generate/cause y from those who have
    NOT yet became  y  due to causes OTHER than x.
The interested reader is advised to study { Sheps 1959 } or to read the much
easier { Fleiss 2003, pp.122-4, optionally 133, 151-2, 156 }. Clearly the
"factor" is the proper word for a multiplicative term like GF in the just
shown semantically rich formula.

One good news is that GF ie GF(y:x) <> GF(~x:~y). To find out why this is
one good news, see Insight0 and do find again UNDESIRABLE in { Hajek www }.
The bad news is that honest scientists cannot hide the dilemma of which
factor (or other probabilistic formula) to use for causation in general and
how to FORMALLY even assign the roles to x and y in particular. Namings like
eg sufficiency and necessity may help, but not much { Hajek }. The dilemma
is
  GF(y:x) =                       versus  GF(x:y) =
= [P(y|x) - P(y|~x)]/[1 - P(y|~x)]    vs  [P(x|y) - P(x|~y)]/[1 - P(x|~y)]

= [Pxy -Px.Py]/[Px.(1 -(Px+Py-Pxy))]  vs  [Pxy -Px.Py]/[Py.(1 -(Px+Py-Pxy))]

= [ P(y|x) - Py ]/P(~x,~y) = GF(y:x)  vs  [ P(x|y) - Px ]/P(~x,~y) = GF(x:y)

where by DeMorgan's rule  P(~x,~y) = 1 -(Px+Py-Pxy) = P(~(x or y)) .

We see that the originally duplicate asymmetry (in GF's numerator and its
denominator) is really a single asymmetry in the denominator only, for which
there hold the following equivalence relations among probabilistic dependence
relations :

 [ P(y|x) > Py ] == [ P(x|y) > Px ] == [ Pxy > Px.Py ]  which is symmetric

wrt random events x, y,  so it is impossible to formally decide which of
the two GF factors to prefer, while they yield different numerical values.
Nevertheless, let me try to heuristically decide between the two possible
generative factors. From the formulae [ Pxy -Px.Py ]/[ ... ]  above it is
clear that

    GF(y:x) / GF(x:y)   =    Py / Px   =    P(y|x) / P(x|y) by Bayes , hence:
  [ GF(y:x) > GF(x:y) ] == [ Py > Px ] == [ P(y|x) > P(x|y) ]
  [ GF(y:x) < GF(x:y) ] == [ Py < Px ] == [ P(y|x) < P(x|y) ]
    GF(y:x) measures how much (x implies y) ie (x entails y) ie (x SufFor y)

This is good because the more is  x  widespread, the less likely is  x  a
SPECIFIC cause of y. Similarly for  y  as a tentative cause of  x. Draw a
Venn diagram and do find again Venn , Py < Px , Px < Py in { Hajek }. By
this reasoning I arrived to my MaxiMin heuristic rule for which GFactor to use
when ranking conservatively by GFactor = minimum[ GF(y:x) , GF(x:y) ] :

 If n(x,y) > few then  { few > 1, say 4, since "Einmahl ist keinmahl" }
 { the < is correct if we want the minimum value, see a dozen lines above }
 if Py < Px then use GF(y:x) else use GF(x:y) , which is equivalent to
  GFactor = minimum[ GF(y:x) , GF(x:y) ]  which is a conservative heuristic

protecting an automated data-mining program like my KnowlegdeXplorer KX from
"false positives" wrt causal tendency. The output of KX is sorted by the
values of GFactors and a human user can focus his or her attention (our
scarcest resource) on the pairs of events x, y with high GFactor obtained in
the just described MaxiMin mode. Similarly with my causal hindrance HFactor.
My MaxiMin protects against extreme Px=1 and/or Py=1;
if Px =. Py then GF(y:x) =. GF(x:y) anyway.

Just in case you don't like MaxiMin, let me tell you about the quantitative
relations between the two most frequently used formulae in evidence-based
medicine:
          OR(y:x)  = OR(x:y) is  symmetric odds ratio OR(:) under P4b above,
     and  RR(y:x) <> RR(x:y) is asymmetric risk ratio == relative risk

Odds(z) = Pz/(1-Pz)  in general, hence odds ratio ie a ratio of odds is
OR( : ) = [ Pu/P(~u) ]/[ Pv/P(~v)  is odds ratio in general
        = [ Pu/Pv    ].[ (1-P(~v))/(1-P(~u)) ] ; in particular we want :
OR(x,y) =
= [ P(x|y)/[1-P(x|y)]/( P(x|~y)/[1-P(x|~y)] ) = OR(x:y) from which:
= [ P(x,y)/P(~x,y)  ]/( P(x,~y)/P(~x,~y)    )
= [    Pxy.P(~x,~y) ]/[ P(x,~y).P(~x,y)  ] = 1 if x,y independent ,  find 17th
=                 a.d/(b.c)                    in 2x2 contingency table
= [ P(x|y)/P(x|~y)  ]/[ P(~x|y)/P(~x|~y) ] = LR+ / LR- ,   denominators annul:
= [    Pxy/P(x,~y)  ]/[ P(~x,y)/P(~x,~y) ] = (a/b)/(c/d) in a 2x2 table
= [ P(y|x)/P(~y|x)  ]/[ P(y|~x)/P(~y|~x) ] =    OR(y:x)  is symmetric wrt x,y
= [ P(x,y)/P(~y,x)  ]/[ P(y,~x)/P(~y,~x) ] = (a/b)/(c/d) in a 2x2 table qed.
= [ P(y|x)/P(y|~x)  ]/[P(~y| x)/P(~y|~x) ] = RR(y:x)/RR(~y:x)
= [ P(y|x)/P(y|~x)  ].[P(~y|~x)/P(~y| x) ] = RR(y:x).RR(~y:~x)
= OR(y,x) = OR(:)

which is an even more impressive example of Bayesian inversion than E(x,y) .

if Px < Py then Max[ RR(y:x) , RR(x:y) ] < OR(:) ;
if Px > Py then min[ RR(y:x) , RR(x:y) ] < OR(:) ;

if Pxy = Px.Py then          RR(y:x) = RR(x:y)   = OR(:) = 1 { 0 = ARR } else
if Pxy > Px.Py then 1 < Max[ RR(y:x) , RR(x:y) ] < OR(:)     { 0 < ARR } else
if Pxy < Px.Py then 1 > min[ RR(y:x) , RR(x:y) ] > OR(:) > 0 { ARR < 0 };

Hence the symmetric OR(x,y) is a bound on both RR(:)'s. Depending on the
dependence, OR(:) is an upper bound, or an lower bound as shown. The moral of
this is that lots of medical researchers and specialists are used to work and
live with OR as an upper or lower bound on risk. C'est la vie.

I know that the complexities of the world including human thinking cannot be
captured in a formula. I do not believe in a "theory of everything" as some
physicists & psychicists do. I just try hard to arrive at good INDICATORS
of causal tendency.  With my heuristic rule I am sticking my neck out like a
giraffe. Feel free to cut me down with your logic, counterexamples, and your
common sense, but be aware of the sad fact that too often our common sense
is a common nonsense, especially when dealing with uncertainties, as the
recent Nobel prize winner Daniel Kahneman and the late Amos Tversky have
shown eg in { Kahneman }.


.- Insight2 : Fresh interpretation of GF as a regression slope/MAXslope

  GF = [ P(y|x) - P(y|~x)]/[ 1-P(y|~x) ]     which I recognized to be:
     =    slope(of y on x)/[ MAXimal achievable slope, since P(y|x) <= 1 ]
     =     beta(of y on x)/[ fictive MAXimal  beta, ie as-if , what if ]
          from the probabilistic regression line y = beta.x + alpha , which
          is not the statistical regression line from the books on stats.

  HF = [ P(y|~x) - P(y|x)]/[ 1 - P(y|x) ]           which I recognized to be:
     =  slope(of y on ~x) /[ MAXimal achievable slope, since P(y|~x) <= 1 ]
     =   beta(of y on ~x) /[ fictive MAXimal beta, ie as-if , what if   ]

  Proof of ARR = beta(of y on x) ie beta(y:x) :
  Events can occur or not, they are binary variables aka Bernoulli r.v.'s
  aka indicator events x, y for which hold the following expected values :

  E[x] = Px; E[x.y] = Pxy              due to x = 0 or 1 , y = 0 or 1
  E[x^2] =   E[x.x] = Pxx = Px = E[x]  due to x = 0 or 1
  cov(x,y) = E[x.y] - E[x].E[y] = Pxy - Px.Py ;    cov(x,x) = var(x)
  var(x)   = E[x.x] - E[x].E[x] =  Px - (Px)^2  = Px.(1-Px)

  cov(x,y)/var(x) = beta(y:x)    recall covariance / variance, hence
  =  (Pxy - Px.Py)/(Px.(1-Px))   is 0 if x,y are independent random events
  = slope of a probabilistic regression line Py = beta(y:x).Px + alpha(y:x)
! = [ P(y|x) - Py ]/(1-Px) = P(y|x) - P(y|~x)    ( checks as an equation  )
  = ARR  = absolute risk reduction of y if x  ( or increase if x is "bad" ),
  qed.

  beta(y:x)*beta(x:y) = square( correlation coefficient for events x, y )
                      =    coefficient of determination for events x, y

  E[x.y] <= sqrt( E[x^2].E[y^2] )               is Cauchy-Schwarz inequality
     Pxy <= minimum[ Px , Py ] <= sqrt(Px.Py)   is weaker than the minimum

  Pxy is a DOT PRODuct aka inner product aka scalar product of x, y.
  Cosine of the angle between the events viewed as-if they were vectors is :

  cos(x,y) = E[x.y]/sqrt( E[x^2].E[y^2] ) = Pxy/sqrt(Px.Py)
           = sqrt( P(y|x).P(x|y) ) = geometricAverage( P(y|x).P(x|y) ) <= 1

  due to 0 <= P <= 1 this cosine cannot be negative i.e. 0 <= cos(x,y) <= 1;
  Px.Py is in fact a fictive probability of the as-if independent events x,
  y with Px, Py as marginals ; find as-if .


.- Insight3 : Commensurable pairs of formulae : ( GF & HF ) vs ( PF & QF )

P(.) and its complement 1-P(.) may be probabilities of success and failure,
or vice versa.  It's all matter of semantics. When your physician calls a
test result "positive", it usually is bad news for you.
One man's success is another woman's failure. By complementing ie negating
the semantics we can transform our formulae into other forms shown below.
Too many names given to formulae are too suggestive. Any conditional
probability allows for 2*4 versions P(y|x) P(y|~x) P(~y|x) P(~y|~x), ... ,
from which at least 17*2 = 34 formulae for a probabilistic CONTRAST can be
formed. If possible, it is better to use a single unifying view rather than
more of dividing views. In our case here we can avoid semantic confusion and
mathematical mistakes if we shall use a single "generative factor" allowing
also the negative sign if ARR is near 0.0. Or we can use two pairs of
generative factors, comparable only within each proper pair, but not across
the pairs. Each such a generative factor has a unique qualification of what
is generated from what. So we do not have to doubt or meditate upon whether
"preventive" refers to the effect, or to the assumed cause, or to both; see
the 4 P(.|.)'s above.

The 1st commensurable pair:
GF = generative factor for  y from  x   see above and use 1-P(.) to get
   = 1 - [1-P(y|x) ]/[1-P(y|~x)]        UNlike 1 - RR(y:x)

HF = generative factor for  y from ~x   see above and use 1-P(.) to get
   = 1 - [1-P(y|~x)]/[1-P(y|x) ]        UNlike 1 - RR(y:x)

The 2nd commensurable pair:
PF = generative factor for ~y from  x  (my proof of these semantics follows)
QF = generative factor for ~y from ~x  (left as an excercise to the reader)

PF = [ P(y|~x) - P(y|x) ]/P(y|~x)  = 1 - P(y|x)/P(y|~x) = 1 - RR(y:x)

   = [ P(~y|x) - P(~y|~x) ]/[1 - P(~y|~x)]   in canonical form proves that

the "trully unmistakable meaning" of PF is "generative factor for ~y from x";
just compare its canonical form with that of GF ; qed.  However, Insight1
warns us that "unmistakable" is meant only in a semi-formal sense.
Regardless of how satisfied someone might be with this "true meaning", it is
essential that my requirement of CONTINUITY and COMMENSURABILITY P8 is met.


.- Insight4 : How to rescale a range for more palatable results

Suitable scale for results from a measure is psychologically important. Both
Fahrenheit and Celsius have linear scales, but Celsius scale carries more
meaning in its pivotal points 0 = ice unfreezing temperature, and 100 = water
boiling temperature at the sea level. The range of [0..100] for water as a
fluid is easier for mental calculations and imagination than [32..212].
On logarithmic scale are decibel dB, pH, and Richter for earthquakes because:
- human sensory perception follows the physiological Weber-Fechner law
- logarithms turn huge numbers into psychologically managable ones
- logarithms turn "power law" curves and exponential growth into lines
- logarithms turn multiplication into easier mental addition

Rescalings make results more palatable, but they should be co-monotonic.
If we plug our non-negative Pa = P(y|x) and Pb = P(y|~x) into the following
functions(Pa, Pb), then all these measures will be co-monotonic, although
with different ranges and different FIXED semantic PIVOTAL points.

  Pa - Pb  is an absolute dependence measure ARR = P(y|x) - P(y|~x)
           is scaled [0..1]  for Pa > Pb , or  [-1..1] in general,
           with 0 if x,y are fully independent

  Pa / Pb  is a relative dependence measures  RR = P(y|x) / P(y|~x)
           is scaled [0..1..oo) with 1 if x,y are fully independent.

  log(Pa / Pb) is scaled (-oo..0..oo)      is W(y:x) findable here

  (Pa - Pb )/(Pa + Pb)   is scaled  [-1..0..1]  , I call it kemenyzed range
= ( diff/2 )/(average)   makes sense
= (Pa/Pb -1)/(Pa/Pb +1)  here  F(:)=[ RR - 1 ]/[ RR + 1 ] , nearby W(y:x) ,

  and     Pa/(Pa + Pb)  is scaled  [0..1/2..1] ; find F0 here.

  Such a normalization to the range [-1..0..1] makes sense, as shown, but
  (Pa - Pb)/(1-Pb) = GF is semantically more specialized, hence stronger.
While (a-b)/(1-b)  works only for numbers 0 < a,b < 1 = MAX[a,b], but then
      (a-b)/(1-b)  carries more meaning than a less specialized formula :
      (a-b)/(a+b)  works for  any numbers a,b <> 0 ; it is a true metric for
       a,b >= 0, but we do not require a causal measure to be a true metric
which obeys the triangle inequality. Metricity is nice, but if not needed
then it is not a desideratum, and the more important property P2 of the clear
meaning over the WHOLE RANGE makes me to prefer GF over Kemeny's F(y:x) which
has clear meaning only at its 3 pivotal points -1, 0 and 1.

As said, they are co-monotonic (= isotone), and they have 2 (or 3) FIXED
semantic pivots directly interpretable as independence, and as maximum (and
as minimum), but not all of the above functions have their non-pivotal values
directly interpretable. While RR(:) , ARR(:) , and especially 1/ARR = NNT are
directly interpretable over their WHOLE range, other are not :
F( : ) by Kemeny , C( : ) by Popper , Conv( : ) by Google's CEO Brin . The
infinite upper bounds of RR( : ) and of Conv( : ) are not nice for a psyche.


.- Insight5 : Numbers needed to treat or harm : NNT , NNH , plus my NNR

  Since for commonly low P(.|.)'s the GF is nothing but a finely tuned ARR,
  and since an opeRationally highly meaningful practical formula is the
 "number needed to treat" NNT = 1/ARR > 0 , I designed NNR = 1/GF :

  NNT = |1/ARR| was introduced in { Laupacis & Sackett & Roberts 1988 }
  NNT = number needed to treat for 1 more |or 1 less| beneficial effect y,
          low NNT = good, successful, effective treatment x ;
  NNH = number needed to harm 1 more |or 1 less|  by side effect z,
         high NNH = good, almost harmless treatment x ;
  NNS = number needed to screen to find 1 more |or 1 less| case,
          low NNS = good, effective screening;
  |1/ARR| is the most realistic general measure of health effects, as it
!!        is the least abstract & least exaggerating ie most HONEST, and
!!    UNlike RR(:), OR(:) or any other rate ratio, it does NOT "throw away
      all information on the number of dead" { Fleiss 2003, p.123 on the
      Berkson's index aka ARR }. Moreover NNT, NNS, NNH measure EFFORT
!!!   ie COSTS PER EFFECT.

  If ARR=0 ie y,x independent, then 1/ARR = oo ie infinite.

  NNR by Jan Hajek :

  NNR = 1/GF > 0 is my Number Needed for 1 more Relative effect;
       if GF < 0 then switch to its commensurable counterpart HF .

 dNNR = 1/ARR - 1/GF (if ARR > 0) = Hajek's difference of "Nrs Needed",

      = P(y|~x)/ARR = P(y|~x)/[ P(y|x) - P(y|~x) ] = 1/( RR - 1 ) = 1/RRR

 1/dNNR = RR(y:x) - 1 = RRR = relative risk reduction if ARR > 0.

  NNH(x:z)/NNT(x:y) is highly informative too; should be >> 1 ie many more

  have to be x-treated before 1 z-harm will occur, while many more patients
  have y-improved already. NNH/NNT is in the fine infokit by Steve Simon at
  http://www.childrens-mercy.org/stats .


.- Insight6 : The simplest thinkable necessary condition for CONFOUNDING

Let me present in a more palatable form what I have done in { Hajek www }.
Confounding is a serious problem when searching for a true cause of y , as
there usually are at least two candidates x, c  for a true cause of y.
Confounders ie alternative candidate causes confuse our perception and
understanding of causation. Let's make search for & research of confounders
easier & less expensive. Only to derive my simplest necessary condition for
c to overule x as a cause of y, let's assume as-if we have available three
relative risks aka risk ratios. I say as-if since my simplest condition
does not really need any of the following three RR's, that's my point, the
less data we need, the better.

 RR(y:x) = P(y|x)/P(y|~x) , and
 RR(y:c) = P(y|c)/P(y|~c) , and  RR(c:x) = P(c|x)/P(c|~x) .

Naturally, a necessary (but not sufficient) condition for c to be a more
likely candidate than x as a cause of y is

 RR(y:c) > RR(y:x) .  Another less natural necessary condition is
 RR(c:x) > RR(y:x)    necessary for c to overrule x as a cause of y , by

Jerome Cornfield in 1959, reproduced in the Appendix of { Schield 1999 }.

Caution! If some necessary conditions are TRUE then c does not yet exclude x
as a cause of y.  Only if all (partial) necessary conditions are TRUE then c
overrules x .  Although it is plausible that there exists (ie that we can
formulate) only a small number of necessary conditions, there are infinitely
many possible confounders c . Therefore I say :
"The best we can say of a cause is that it has not yet been refuted". Also:
"No amount of experiments can ever prove me right; a single experiment may
 at anytime prove me wrong." said Einstein, who was paraphrased by E.W.D.:
"No amount of testing can show the absence of bugs, only their presence."
"Experiments can only falsify a theory or a hypothesis." is my Popper-ism.
"Absence of evidence is not evidence of absence." wrote { Doug Altman &
 Martin Bland, BMJ 311, 1995/8/19, p.485 }

Anyway, combining the above necessary conditions yields my new one:

!!  RR(y:x) < minimum[  RR(c:x) , RR(y:c) ] , and its equivalent

!!  P(x|y) < P(x|c) AND RR(y:x) < RR(y:c) .  Clearly, they are equivalent

if  P(x|y) < P(x|c)  == RR(y:x) < RR(c:x)  which derives from the fact that

   RR(y:x) < RR(c:x) ==  P(y|x)/P(y|~x) < P(c|x)/P(c|~x)  has the same

conditionings |. on both sides of the < , hence the conditional P(.|.)'s can
be turned into joint P(.,.)'s since the conditionings annul :

  P(y|x)/P(y|~x)    < P(c|x)/P(c|~x) ;  where P(~x)/P(x) annul, and only the
                                      < still holds, NOT the values :
  P(y,x)/P(y,~x)    < P(c,x)/P(c,~x) ;  now obtain inverted conditionings :
  P(x|y)/P(~x|y)    < P(x|c)/P(~x|c)    values as on the preceding line ;
  P(x|y)/[1-P(x|y)] < P(x|c)/[1-P(x|c)] values as on the preceding line ;
!!          P(x|y)  < P(x|c)            proves the equivalences , qed.

P(x|c) > P(x|y) is my SIMPLEST THINKABLE necessary condition for a candidate
    c to overrule x as a possible cause of y. Originally I derived it from
the decompositions

 RR(c:x) = [ Pcx/(Pc - Pcx) ].(1-Px)/Px
 RR(y:x) = [ Pyx/(Py - Pyx) ].(1-Px)/Px   readily suggest that (1-Px)/Px

can be dropped from Cornfield's inequality RR(c:x) > RR(y:x) which becomes

  [ Pcx/(Pc - Pcx)    ] > [ Pyx/(Py - Pyx) ]     ie
  [ Pcx.Py  - Pcx.Pyx ] > [ Pyx.Pc  - Pyx.Pcx ]  where Pcx.Pyx annul, hence
    Pcx.Py              >   Pyx.Pc               hence
    Pcx.Pyx             >       Pc/Py            hence
!!! P(x|c) > P(x|y)      my simplest necessary condition for c to overrule x
!!! P(x|c) - P(x|y) > 0  my simplest necessary absolute boost Ab > 0 needed
!!! P(x|c) / P(x|y) > 1  my simplest necessary relative boost Rb > 1 needed

  [ P(x|c) = P(c|x).Px/Pc ] > [ P(y|x).Px/Py = P(x|y) ]  by Bayes rule ; so
!!! P(c|x)/Pc > P(y|x)/Py    my Bayesian boost condition for c to overrule x
!!! P(c|x)    > P(y|x).Pc/Py    2nd form of necessary condition for c over x
    P(y|x)    < P(c|x).Py/Pc    3rd form of necessary condition for c over x

Imitating the Polish mathematician Hugo Steinhaus (a math prof. of the father
of the peacekeeping H-device Stan Ulam; mother was US-Hungarian Ed Teller)
you may ask "Wo ist der Witz ?" ie What's the point ?  The point is that
the researcher does not have to evaluate all NECESSARY (sub)conditions after
any single one of them is found to be FALSE, in which case  c  becomes an
UNCONVINCING competitor of  x  for potential causation of y , and NO other
necessary condition for confounding can possibly be simpler than my simplest
thinkable one P(x|c) > P(x|y); draw a Venn diagram (or pizzas or pancakes).
Calcs and PC's make calculations easy, but we seldom get all the data we
would like to have. Even the best medical journals show only bits of data
for the lack of space and other economical reasons. So if a simpler
condition like eg mine requires less data, we may be able to do what
otherwise would be impossible. Dzatz dz witz.

For more do find again confound in { Hajek www }.


.- Insight7 : Sufficiency and necessity : two sides of a causal coin

Causation is like the 2-faced Roman god Janus (January is named after him).
One face is SUFiciency, the other facet is NECessity. They go together, they
are 2 components of causation, something like the real and imaginary parts of
a complex number. This analogy is not too bad, since necessity is often based
on imagined CounterFactual reasoning ie on what WOULD BE IF ( in German:
was waere wenn ; also find as-if ) the situation WOULD NOT BE the factual one
{ Sheps 1959 pp.87,89,91,92 } { Pearl 2000, p.284 }. I have written enough on
sufficiency and necessity in { Hajek www }, so I'll mention only the basic
facts of causal life.

A reminder:  [ P(y|x) > P(y|~x) ] == [ P(x|y) > P(x|~y) ]  this equivalence
 holds also for = and < on both sides of the == .

The most simplistic , crude , naive measures of causal tendency are:

P(y|x) = Pxy/Px = Sufficiency of x for y  { Schield 2002, Appendix A }
                =   Necessity of y for x  { follows from the next line: }
P(x|y) = Pxy/Py =   Necessity of x for y  { Schield 2002, Appendix A }
                = Sufficiency of y for x  { follows from above }

To understand this, draw two pizzas or pancakes as (almost) overlapping
targets with areas Px inside Py. Imagine ourselves as as-if archers :
If we want to hit the larger Py then it is sufficient (but not necessary) to
   hit the smaller Px to be sure we hit the larger Py.
If we want to hit the smaller Px then it is necessary (but not sufficient) to
   hit the larger  Py , which is a prerequisite (= conditio sine qua non) but
   no way a guarantee of hitting the smaller Px .

My fresh desideratum P4b has much to do with the sufficiency and necessity .

{ Schield 2002, p.1 } says: "But epidemiology focuses more on identifying
a necessary condition [h] whose removal would reduce undesirable outcomes
[e] than on identifying sufficient conditions whose presence would produce
undesirable outcomes".   Also see his Appendix A, first lines left & right.
In his section 2.2 on necessity vs. sufficiency , prof. Milo Schield nicely
 explains their contextual semantics and applicability ( all [.] by JH ) :
"Epidemiologists may focus more on necessity than sufficiency. [They] may
 want to REDUCE disease incidence more than they want to PREDICT disease.
 Focusing on necessity may be more important for them than focusing on
 sufficiency since eliminating a necessary condition is sufficient to
 prevent the outcome [e]. Unless an effect [e] can be produced by a single
 sufficient cause [h] (RARE!), producing the effect requires supplying ALL
 of its necessary conditions [h_i], while preventing it [e] requires removing
 or eliminating only ONE of those necessary conditions [h_i] ."

Caution: the suffixes nec, suf as used by various authors say nothing about
         which event is necessary for which one, as long as you have not
found in their writings what is necessary (or sufficient) for what. It is
vital to know exactly what nec and suf mean because of the Janus-like
double faced equivalences resulting from set theory and logical implication :
  from set theory ==              == implication    == set theory     :
     (x SufFor y) == (y NecFor x) == (~x NecFor ~y) == (~y SufFor ~x)
     (y SufFor x) == (x NecFor y) == (~y NecFor ~x) == (~x SufFor ~y)

which hold for determinism ie  if (x entails y) perfectly. These
equivalences ( == ) are broken if Px <> Py then M(y:x) <> M(x:y) in general
and by my desideratum P4b:

(Px < Py) == [ M(y:x) > M(x:y) ] means that M(y:x) measures (x SufFor y)
                                        and M(x:y) measures (y SufFor x);
(Px < Py) == [ m(y:x) < m(x:y) ] means that m(x:y) measures (x SufFor y)
                                        and m(y:x) measures (y SufFor x);

For much more find suffic and necess in { Hajek www }, and find P4b here.

Population attributable risk aka Levin's attributable risk { Fleiss 2003,
156,126-128,711/7.5 } can be written in an unusual form as

 PAR(x:y) = ( Px.[ RR(y:x) -1] )/( 1  + Px.[ RR(y:x) -1] ) is the usual form;
          = ( Pxy - Px.Py      )/( Py - Px.Py )
          =      [ P(x|y) - Px )/( 1  - Px    )
         = ( (x NecFor y) -base)/( MaxP(x|y) - base )  in the spirit

of Sheps' "relative difference" . As mentioned under P4b , I discovered that
PAR(x:y) and Conv(x:y) produced the same RANKings of 10 vastly different
2x2 contingency tables. I have no proof that it holds for all tables. Also
I don't know of any other pair of M(:)'s which produced equal ranks.


.- Insight8 : MaxiMin vs Kemeny & Fitelson in clash with Popper & Kahre ;
              C(y:x) , F(y:x) , GF(y:x)  stress tested with extreme P's ;
      plus my paradox of IndependentImplication :

Herewith continue the difficulties from the example under +Warnings above.
After expressing his & others' doubts { Kemeny 1952, end of pp.312+313 1st
lines } says: "But we see no possible interpretation under which the weights
[ JH: the P(.|.)'s ] would depend on H.  CA7: The weights depend on n, they
may depend on E, but they must be independent of H."  I read it as a foggy
way to say that his "factual support" of a hypothesis H by an evidence E
must be
        F(e:h) = [ P(e|h) - P(e|~h) ]/[  P(e|h) + P(e|~h) ]  , ie it
must not be      [ P(h|e) - P(h|~e) ]/[  P(h|e) + P(h|~e) ] .

Similarly { Fitelson 2001, p.42 } requires:
"After all, evidential support is supposed to be a measure of how strong the
evidential relationship between E and H is, and deductive entailment is the
strongest that such a relationship can possibly get. If E is conclusive for
H, then H's a priori probability should, intuitively, be irrelevant to how
strong the (maximal, deductive) relationship between E and H is."  On p.43
he wisely adds: "In any case, we should probably not put too much stock on
deductive [ JH: deterministic ] cases of the kind discussed in this section."
Wisely, as I recall my 7 Construction principles for good association measures
in { Hajek } where my principle  P2  says: "OpeRational usefulness is greatly
enhanced if measure's WHOLE RANGE of values (not only its bounds) has an
opeRationally highly meaningful interpretation (like eg eg NNT , NNH )."
My P2 asks for more than Fitelson does, but his wits become handy when we
      cannot have all desiderata in one measure, in which case we have a
dilemma: Should we have opeRationally meaningful bounds, or the most of
         the range, if we cannot have both ?
Answer: Since most of the range covers more values than a small band near
        extremes, I vote for the range. Thus justified choice fits with
        Fitelson's intuitive wits.
Caution: the LOWer the count n(.) ie n(x) or n(y) , the more often will
          n(x,y) = n(.), and the more of the conventional proportions
n(x,y)/n(.) will = 1. Such simplistic estimators of P(.|.) lead to EXTREME
results from some formulae which are sensitive to P(y|x)=1 or P(x|y)=1. My
provably (I derived the optimal ones) near optimal estimators for n(x,y) > 0

  P(y|x) = [ n(x,y) - 0.5 ]/n(x) ,   P(x|y) = [ n(x,y) - 0.5 ]/n(y)

are always < 1, since n(x,y) <= minimum[ n(x), n(y) ], so no P(.|.) = 1
can occur.

MaxiMin: also my heuristic protects us from "false positives" likely to
         arise from the extreme Px =. 1 or Py =. 1 :

 If n(x,y) > few then   { few > 1, say 4, since "Einmahl ist keinmahl" }
 if Py < Px then use M(y:x) else use M(x:y) , is equivalent to:

 If n(x,y) > few then   { few > 1, say 4, since "Once is as if never"  }
 M = minimum[ M(y:x) , M(x:y) ]    is a conservative heuristic,

applicable to most measures M of confirmation, support or causation. Note
that if Px =. Py then M(y:x) =. M(x:y) anyway. Maximal values of M are the
most interesting ones, so it is MaxiMin heuristic.

{ Fitelson pp.42,43,47,48 } repeatedly singles out Kemeny's factual support
and "ordinally equivalent measures" ie isotone ones, as  "Interestingly,
the only measure (among our 5 candidates) that satisfies" his requirement.
This is not so surprising, since among his 5 candidates there is only one
measure L which is based on P(e|h), while  3 are based on P(h|e), and 2 are
fully symmetric wrt e, h (symmetry makes them unfit). Fitelson's L's :

 L  = ln( P(e|h)/P(e|~h) )  = ln( RR(e:h) ) = ln( LR(e:h) )
    = W(e:h) = "weight of evidence" pushed in some 50 papers of I.J. Good

 L* = [  P(e|h) - P(e|~h)     ]/[ P(e|h) + P(e|~h) ]  = F(e:h) = F(y:x)
    = [  P(e|h) / P(e|~h) - 1 ]/[ P(e|h) / P(e|~h) + 1 ]     { my form1 }
    = [ RR(e|h) - 1 ]/[ RR(e|h) + 1 ]                        { my form2 }

 RR(y:x) , F(y:x) == L* , and L == W(y:x) are isotone ie co-monotonic.
 RR(:)  = [ 1 + F(:) ]/[ 1 - F(:) ]       is a conversion ;

 F(y:x) = "degree of factual support of x by y"    {   my  x == h , y == e }
    = [  P(y|x) - P(y|~x) ]/[  P(y|x) + P(y|~x) ]    F-form1 { Kemeny 1952 }
    = [ RR(y:x) - 1 ]/[ RR(y:x) + 1 ]            a function of relative risk
    = tanh( 0.5*ln(RR(y:x)) )  my F-form4; tanh(z) = (e^(2z) -1)/(e^(2z) +1)
    = tanh( W(y:x)/2 )                  my tanh corrects I.J. Good's sinh
    = tanh[ 0.5*ln( Odds(x|y)/Odds(x) )/2 ]          note the semi-inversion
    = ARR/[ P(y|x) + P(y|~x) ]
    = [ Pxy - Py.Px ]/[ Pxy + Py.Px - 2.Pxy.Px   ]  { my F-form2 }
    = [ P(x|y) - Px ]/[ P(x|y) + Px.(1 - 2.P(x|y)]  { my semi-inverted form }
    = (difference/2)/Average  = deviation/mean      { my F-form5 } makes
       sense as a normalization to the range [-1..0..1] , but is less
       semantically rich than GF = [Pa - Pb]/[ 1-Pb ] because unlike GF ,
 F(:)'s values are NOT quantitatively interpretable over the whole range ;
        find P2 ;
    = CF2(y:x) = [ P(x|y) - Px ]/[ Px.(1 - P(x|y)) + P(x|y).(1 - Px) ]  is

     a certainty factor in MYCIN at Stanford rescaled in { Heckerman 1986 }
     which I recognized as F(y:x) via my F-form2 above.

 F(y:x) =  0  if x,y are independent
 F(y:x) = [ 1 - Py ]/[ 1 + Py - 2.Px ] if Pxy=Px
 F(y:x) = 1 if P(x|y)=1 ie Pxy=Py , but:
 F(y:x) = 0/0 if Px=1  ==  Pxy=Py == P(x|y)=1 then UNdetermined; I choose 0
 F(y:x) = 0/0 if Py=1  ==  Pxy=Px == P(y|x)=1 then UNdetermined; I choose 0
 F(y:x) < F(x:y)  == (Px < Py) like RR(:) , UNLIKE P(y|x) & GF(:) , see P4b:
 find -F( under P5:

These AMBIGUOUS non-values 0/0 may seem a weakness, but are correct, since
if  x,y  are independent then
[ P(x|y) = Px ] == [ P(y|x) = Py ] == [ Pxy = Px.Py ] , so that in extreme:
if Px=1 then Py = Pxy = Px.Py  &  P(x|y) = 1  ie  y implies x !
if Py=1 then Px = Pxy = Px.Py  &  P(y|x) = 1  ie  x implies y ! ,
hence in these extreme cases
  y entails x  AND  x,y are independent, (and/)or
  x entails y  AND  x,y are independent,
which is my PARADOX of IndependentImplication from { Hajek www }.

For comparison:

GF(y:x) = [ P(y|x) - P(y|~x) ]/[ 1 - P(y|~x) ] = slope(of y on x) / MAXslope
        = [    Pxy -  Py.Px  ]/[(1 -(Px+Py-Pxy)).Px ]
GF(y:x) = 0   for independent x,y
GF(y:x) = 1   if Pxy=Px  ie P(y|x)=1 ie MAX slope ie MAX beta(y:x) ,
                 ie x entails y , eg if Py=1
GF(y:x) = 1  also if Px=1 !
GF(y:x) = Py/Px  if Pxy=Py . Proof: [Py/Px -Py]/[1-Px] = Py.[1/Px -1]/[1-Px]

Now Kemeny and Fitelson will clash with Spinoza, Popper, and Kahre :

 C(y:x) = Popper's corroboration = confirmation designed as <= 1-Px  fits SIC
    = [ P(y|x) - Py ]/[ P(y|x) +    Py - Pxy     ]  { Popper p.400, 9.2* }
    = [ Pxy - Px.Py ]/[ Pxy    + Px.Py - Pxy.Px  ]  { my C-form2 }
    = [ P(x|y) - Px ]/[ P(x|y) + Px.(1 - P(x|y)) ]  { my semi-inverted form }
    = 1-Px  iff P(x|y)=1

The factor 2 in the denominator of F(:) is the only mathematical difference
between my semi-inverted forms of  F(y:x) and C(y:x). Yet they have vastly
different extreme values: -1 <= F(:) <= 1 ,   C(y:x) <= 1-Px , C(x:y) <= 1-Py

C(y:x) = 0  for independent x,y OR Px=1 OR Py=1 ! which MAKES SENSE, since
              in all 3 cases Px carries NO surprise, NO new information.
C(y:x) = 1-Px if P(x|y)=1 ie if (y entails x) ie (y->x) ie (x implies y) ,
         1-Px = P(~x) is a MEANINGFUL upper bound, because it fits SIC , and
 P(~x) = 1-Px is the simplest thinkable decreasing function of Px , better
              than log(1/Px) = log(-Px) in Shannon's entropy -Sum[Px.log(Px)]
  where Sum[Px.1/Px] would not work. Hence quadratic entropy = Sum[Px.(1-Px)]

K(x:y) = P(x|y) - Px  is Kahre's favorite korroboration { Kahre 2002, p.120 }
  -Px <= P(x|y) - Px <= 1-Px  discounts the lack of surprise in x (fits SIC )

since a too frequent x is unlikely to be a SPECIFIC cause; this makes sense
if Px =. 1, less sense if Px is low.

K(y:x) = P(y|x) - Py          another korroboration in { Kahre 2002, p.186 }
  -Py <= P(y|x) - Py <= 1-Py  discounts the lack of surprise in y (find SIC )
                              since a frequent  y is not seen as a real RISK;
this makes LESS sense than 1-Px above, since a widespread risk is still a
risk, although psycho-socially it is more acceptable if everybody is at the
same high risk. Indeed, as long as nothing can be done about the risk, the
society gets used to it, becomes fatalistic about that risk, and is not too
jealous wrt those lucky few exceptions who are spared the risk.

PsychoLogical justification of C( : ) and K( : ) :
A frequent x is unlikely to be a SPECIFIC explanation or cause of y, or at
least not a surprising, new, interesting one. Hence the decreasing 1-Px .
What is common, carries no new meaning.  More surprising  x  carries more
SIC = "Semantic Information Content" in x , since the lower the prior Px ,
the more possibilities it a priori FORBIDS, EXCLUDES, REFUTES or ELIMINATES
when  x  occurs; see SIC here & in { Hajek www }.
{ Kemeny 1953, 297 } refers this insight to { Popper pp.270,374,399,400,402 }
from whom Carnap has lifted it, and from him Bar-Hillel has lifted  SIC ;
{ Bar-Hillel 1964, p.232 }  quotes  "Omnis determinatio est negatio" =
"Every determination is negation" = "Bestimmen ist verneinen" by Baruch
Spinoza (1632-1677), in 1656 excommunicated from the synagogue in Amsterdam.
William of Ockham (1285-1347) aka Occam aka Dr. Invincibilis, was also
excommunicated from the Church in 1328. Clearly too original thinkers had to
be purged from establishments' institutions.

  Occam's bright student, who became a rector of the university of Paris, was
Jean Buridan, who wanted to educate the masses by calling them asses. Besides
his notion of pons asinorum, or pons asini, ie a "bridge of asses" (find it
above), his name is also attached to the "Buridan's ass". His (in fact already
Aristotelian) ass is an animal standing between two equally appetizing
haystacks, inable to decide which one to chose. This assy behavior can serve
as a psychoLogical model for the dilemma of choosing between two or more
(wo)men, and also for choosing a measure of causation tendency. The SIC
property in Popper's C(:) and in Kahre's K(:) is intellectually appealing, but
GF has other goodies mentioned under +Sheps . Which haystack do YOU prefer ?

To stress elimination of hypotheses or theories is Popperian refutationalism
aka falsificationism. KISS = "Keep it simple, student!" is Occamism, but
do not forget the vital amendment "ceteris paribus" (or ceterus paribus) ie
"all else being equal". There is more to ceteris paribus than taking its
meaning literally as explained above. Ceteris paribus should be recognized as
a Vaihingerian useful fiction of the as-if kind (find again: Vaihinger or
fictionalism ): the counterfactual WHAT-IF , lets pretend as-if , reasoning.
Anyway, my slogan is "Keep IT simple, but not simplistic", as this e-paper
does.


.- Insight9: Variations on the form (ValueA - ValueB)/(MaxValueA - ValueB)

Examples tell more than abstractions. Pa, Pb are probabilities, Pb is some
base probability or chance of success (or failure, feel free to define it).
Counts of (joint) events a, b, c, d in a 2x2 contingency table are denoted as
na, nb, nc, nd where na + nb + nc + nd = N.  Of course,  na  can be any score
observed, and  nc  a chance value, base value, or expected value.

Ex:  if   Pa >= Pb  then        a factor F is :
     F = (Pa  - Pb)/(1-Pb)      where "It's the denominator, students!",
                            "It is the choice of the base that matters"

    eg  Pb = Sum_i:1..m[P_i * P_i] = Sum[(P_i)^2] = expected probability
              for P1..Pm

    eg  Pb = 1/m  where m is the number of alternatives (if equiprobable);
                  note that Sum_i:1..m[ (1/m)^2 ] = m.(1/m).(1/m) = 1/m

 eg  if   na   > nc    then         in a 2x2 contingency table a,b,c,d is
     F = (na/N - nc/N)/(N/N - nc/N)
       = (Pa   - Pc  )/( 1  - Pc  )  =  (na - nc)/(N - nc) , hence this
     F is NOT based on P(.|.)'s only on P(.)'s ie on counts na, nc, N, with
     na = n(y,x) , nc = n(y,~x) , N = n(x,y) + n(~x,y) + n(x,~y) + n(~x,~y)

 eg  if   nhits > nMfc then         similarly as in the last example
     F = (nhits - nMfc)/(N - nMFc)  ie succes above the Most frequent class

 eg  if   P(x|y) > Px then
     F = (P(x|y) - Px)/(1-Px) = K(x:y) /(1-Px ) = PAR(x:y) , find it above.

 eg  Jacob Cohen's coefficient of agreement aka the chance-adjusted Kappa
     aka concordance or interrater agreement (of 1960), is an F with P's :
  Pe = simple proportion of actually observed cases in which evaluators agreed
  Pi = fictitious proportion expected by chance ie as-if experts who may
       be (wo)men or (wo)machines produced statistically INdepentent judgments
  In a 2x2 table of a,b,c,d  where x = expertX, and y = expertY , the joint
  counts of agreements are on the main diagonal, hence:

  Pi = Px.Py + P(~x).P(~y)  is the fictitious base chance , so that

  F = (Pe - Pi)/(1 - Pi)

  Of course you are free to define (hopefully meaningfully) your own measure
  of matching or agreement between eg men and woman, or between (wo)men and
  (wo)machines.

Eg for random variables X, Y (ie for sets of random events x, y ) there
   exists Goodman & Kruskal's measure

  TauB = [ (1 - E[Py]) - (1 - E[P(y|x)]) ]/( 1 - E[Py] )
       = [      E[P(y|x)]   - E[Py]      ]/( 1 - E[Py] )
       = (Conditional quadratic entropy of Y if X)/(Quadratic entropy or Y)
       =  Cont(X:Y)/Cont(Y)

  TauB has never been explained as a normalized quadratic entropy, based on
  the simplest decreasing function of P which is 1-P , where

  Var(Y) = Sum_y:1..m[ P_y .(1-P_y) ]  = Expected( 1 - P_y ) = 1 - E[P_y]
         =    1 - Sum[ P_y . P_y ] = 1 - Sum[Py^2]

  E[Var(Y|X)] = Sum[Sum[ Pxy.( 1 - P(y|x) ) ]] = 1 - E[P(y|x)]

  Cont(X:Y) = (1 - E[Py]) - (1 - E[P(y|x]) = E[P(y|x)] - E[Py]
            =  SumSum[ Pxy. P(y|x) ] - Sum[ Py^2 ]
            =  SumSum[ Pxy.(P(y|x) - Py) ] = SumSum[ Pxy. K(y:x) ]


-.- +Conclusions:
             It would be ABSURD if a slightly negative numerator -ARR(:) ie a
slightly negative dependence of y and x  would have the value PF = 0.5 which
is 25 times larger than GF = 0.02 for the same positive numerator ARR. Small
deviations from independence (ARR = 0) should be treated as SYMMETRICALLY as
possible because DEPENDENCE IS SYMMETRICAL wrt x and y.  All the various
formulations of the condition of independence between two events y and x can
always be transformed into the clearly symmetrical  Pxy = Px.Py = Pyx .
However the indicator of a probabilistic causation must be a MEANINGFULLY
ASYMMETRICAL formula, asymmetric wrt the effect y and its tentative cause x,
hence DIRECTED ie ORIENTED. Inversions and semi-inversions above serve as
warnings about only a seeming asymmetry.
Cheng's PF does not fit with Sheps/Cheng's GF . If ARR < 0 then we must use
the causal hindrance factor HF, or if ARR < 0 is near zero then we may use
GF < 0 as an approximate measure of prevention.

More insights into more formulae are at my  http://www.humintel.com/hajek


-.- +References: papers & books worth (y)our looks :

If you cannot find a reference or else here, it is very likely in { Hajek }
on www : http://www.humintel.com/hajek

Agresti Alan: Analysis of Ordinal Categorical Data, 1984;
  on p.45 is a math-definition of Simpson's paradox for events A,B,C

Arendt Hannah: The Human Condition, 1959; on Archimedean point see pp.237
  last line, up to 239, ... 260, more in her Index

Bar-Hillel Yehoshua: Language and Information, 1964; only his Introduction,
  but not his Contents or the reprinted paper, tells that the original author
  of their key paper in 1952 (chap.15, pp.221-274) was in fact Rudolf Carnap
  alone who was the 1st author despite B < C, but B-H does not say it. Only
  their key paper is worth a look, the rest of the book is obsolete.

Brin Sergey, Motwani R., Ullman Jeffrey D., Tsur Shalom:
  Dynamic itemset counting and implication rules for market basket data ;
  Proc. of the 1997 ACM SIGMOD Int. Conf. on Management of Data, 255-264;
  see www .  Sergey Brin is the co-founder of Google

Cheng Patricia W.: From covariation to causation: a causal power theory;
  (aka power PC theory), Psychological Review, 104, 1997, 367-405;
! on p.373 right mid: P(a|i) =/= P(a|i) should be P(a|i) =/= P(a|~i)
  Recent comments & responses by Patricia Cheng and Laura Novick are in
  Psychology Review, 112/3, July 2005, pp.675-707.

Cohen Jacob: Weighted kappa; Psychological Bulletin, 70/4 (19..), 213-220;
  on p214 is K = [ Pa - Pc ]/[ 1 - Pc ]  where
  Pa = the observed proportion of agreement (eg among experts)
  Pc = the proportion of agreement expected by chance alone

Feinstein Alvan R.: Principles of Medical Statistics, 2002, by the late
  professor of medicine at Yale, who studied both math & medicine;
  chap.10, 170-175 are on proportionate increments, on NNT NNH , on honesty
       vs deceptively impressive magnified results from some formulae.
  Chap.17, 332, 337-340 are on fractions, rates, ratios OR(:), risks RR(:).
! On p.340 is a typo: etiologic fraction should be e(r-1)/[e(r-1) +1];
! on p.444, eq.21.15 for the negative likelihood ratio LR- should be
! (1-sensitivity)/specificity; above it should be (c/n1)/(d/n2)

Fitelson Branden: Studies in Bayesian confirmation theory, Ph.D. thesis 2001
  on www , where I.J. Good's sinh(W(:)/2) should be tanh(W(:)/2), I told him

Fleiss Joseph L., Levin Bruce, Myunghee Cho Paik: Statistical Methods for
  Rates and Proportions, 3rd ed., 2003 ( Sheps' "relative difference" is
  in the Index, also in the earlier editions)

Good I.J. : see { Hajek www } for 10 commented references to too many papers
  & notes produced by the prolific author and WWII codebreaker over 50 years.

Hajek Jan: Probabilistic causation indicated by relative risk, attributable
  risk and by formulae of I.J. Good, Kemeny, Popper, Sheps/Cheng, Pearl and
  Google's Brin, for data mining, epidemiology, evidence-based medicine,
! economy, investments.  This e-paper is at  http://www.humintel.com/hajek
! Find the +Epicenter of this e-paper and +New conversions in it. Also
  see there my 7 "Construction principles for good association measures",
  where eg my principle  P2  says:  OpeRational usefulness is greatly
  enhanced if measure's WHOLE RANGE of values (not only its bounds) has an
  opeRationally highly meaningful interpretation (like eg NNT , NNH ).
! Caution: in the just referenced paper the notations Conv(y:x) and Conv(x:y)
  and also K(y:x) and K(x:y) are swapped wrt the present text. This is no
  error, it is just a change of notation; see +Causal notation is tough
  here and now ; swaps are due to my fresh desideratum P4b

Heckerman David R.: Probabilistic interpretations for MYCIN's certainty
  factors; pp.167-196 in Uncertainty in Artificial Intelligence, L.N. Kanal
  and J.F. Lemmer (eds), vol.1, 1986. I succeeded to rewrite his eq.(31) for
! the certainty factor CF2 on p.179 to Kemeny's F(:). Heckerman's Lambda's
  are RR(:)'s. Heckerman has more of fine papers in other volumes of these
  series of proceedings.

Kahneman Daniel, Slovic P., Tversky Amos (eds): Judgment Under Uncertainty:
  Heuristics and Biases, 1982; see p.122-3; Kahneman is a Nobel laureate.

Kahre Jan: The Mathematical Theory of Information, 2002. See the last pages
  for "(Re)Search hints by Jan Hajek" on  http://www.matheory.info
  To find in his book formulae like eg Cont(.) use his special Index on
  pp.491-493. See  www.matheory.info  for errata + more.
! on p.120 eq(5.2.8)  is P(x|y) - Px  =  Kahre's korroboration, x = cause,
! on p.186 eq(6.23.2) is P(y|x) - Py, risk is no corroboration, y = evidence

Kemeny John G., Oppenheim Paul: Degree of factual support; Philosophy of
  Science, 19/4, Oct.1952, 307-324. The footnote 1 on p.307 tells that Kemeny
! was de facto the author. Caution: on pp.320 & 324 his oldfashioned  p(.,.)
  is our modern P(.|.). On p.324 the first two lines should be bracketized
! thus: p(E|H)/[ p(E|H) + p(E|~H) ], find it as F0(  in { Hajek www }.
  An excellent paper by then former math-assistant to Einstein, and the later
  co-father of the programming language Basic.

Kemeny John G.: A logical measure function; Journal of Symbolic Logic, 18/4,
  Dec.1953, 289-308. On p.307 in his F(H,E) the negation bars ~ over H are
! missing in both 2nd terms. Except for p.297 on Popperian elimination of
  models (find SIC now), there is no need to read this paper if you read his
  much better one of 1952

Laupacis A., Sackett D.L., Roberts R.S.: An assessment of clinically useful
  measures of the consequences of treatment; New England Journal of Medicine
  (NEJM), 1988, 318:1728-1733

Pearl Judea: Causality: Models, Reasoning, Inference, 2000; see at least
  pp.284,291-294,300,308; his references to Shep should be Sheps , and on
! p.304 in his note under tab.9.3  ERR = 1 - P(y|x')/P(y|x) would be correct
! which wasn't in Pearl's ERRata on www

Popper Karl: The Logic of Scientific Discovery, 6th impression (revised),
  March 1972; with new appendices, on corroboration is Appendix IX to his old
  Logik der Forschung, 1935, where in his Index: Gehalt, Mass des Gehalts =
  Measure of content (find SIC ).  His oldfashioned p(x,y) is modern P(x|y),
  and his confirmation = corroboration C(x,y) is my C(y:x)

Restle Frank: Psychology of Judgment and Choice : a Theoretical Essay, 1961 ;
  on p149 is eq.(7.6)  P(corrected) = [ P(yes|T) -P(yes|~T)]/[ 1 -P(yes|~T) ]

Schield Milo: Simpson's paradox and Cornfield's conditions; ASA 1999 & www ;
  an excellent multi-angle explanation of confounding, a very important
  subject seldom or poorly explained in books on statistics.
  His section 8 can be complemented by reading { Agresti 1984, p.45 } for a
  definition of Simpson's paradox for events A, B, C

Schield Milo, Burnham Tom: Algebraic relationships between relative risk,
  phi and measures of necessity and sufficiency; ASA 2002; on www .

Sheps Mindel C.: An examination of some methods of comparing several rates
  or proportions; Biometrics, 15, 1959, 87-97

Simon Herbert: Models of Man, 1957; see pp.50-51,54

Vaihinger Hans: Die Philosophie des Als Ob, 1923


Much more find in { Hajek } at http://www.humintel.com/hajek
-.-