.-
   Sharper information inequalities applied to continuous pdf estimation

    CopyRight (C) 2009-5-5 , Jan Hajek , NL, version 2.03 of 2009-6-1
         The word  my  shortly marks what I have not lifted from anywhere,
         so it is fresh and may be contested, but must not be lifted from
         me without a full reference to my full name plus this website :
  http://www.humintel.com/hajek  contains the latest versions of my epapers

Abstract: Shannon & De Bruin & Stam inequality (superior to Cramer-Rao )
          1/Fisher information <= eNtropy power <= Variance  is discussed in
detail, including the best numerical computations as Hellinger-Bhattacharyya-
Jeffreys- Matusita metric distance M .

Contents :
+Executive summary
+Elements for the uninitiated
+Shannon & De Bruijn & Stam higher than Cramer-Rao lower bounds on variance
+Application of inequalities to automated probability density estimation
+Appendix1 on the length of a pdf curve and its curvature
+Appendix2 on Stan Ulam's tri-inequality and Jan Kahre's LDI
+References
.-

+Executive summary :
       Find the words DISCOVERY and MY to get to my main results and insights.
Cramer-Rao lower bound on variance is a general uncertainty principle (or an
uncertainty relation, or an information inequality) more general & fundamental
than the better known Heisenberg uncertainty principle. To find a tighter ie
higher lower bound on variance  V ( ie a stronger information inequality )
would be of fundamental importance. Here are the old weak and the fresh, much
stronger general inequalities (not specialized like Heisenberg's uncertainty)
presented in the oldest-first order :
  F, H, N, V are all > 0 :
1/F <= V   is  Cramer-Rao inequality : variance V >= 1/Fisher information F
       F =  Integral[ dx.p.(p'/p)^2 ] == square( Matusita self-distance ) :
         = 4.Sum[ ( sqrt(p[j+1]) - sqrt(p[j  ]) )^2 ]/dx has no /p ie no /0 :-)
         =.  Sum[ ( sqrt(p[j+1]) - sqrt(p[j-1]) )^2 ]/dx  is slightly smoothed.
       N <= V       Shannon inequality : actual variance V >= N , where :
       N = exp(2H)/2pie = 0.058549831*e^(2H) == Shannon's eNtropy power, with
                H == Shannon's entropy for a pdf p(x) of a continuously
                     distributed (quasi)random variable x (not just Gaussian).
1/F <= N  is   A.J. Stam inequality : entropy power N >= 1/F , hence :
1/F <= N <= V  shows that Stam has improved on Cramer-Rao lower bound by
                     increasing lower bound on variance  V. From this I got :
  1 <= F.N <= F.V  , and also from the preceding inequality I obtained :
1/V <= 1/N <= F    , which I multiplied by V and obtained :
  1 <= V/N <= F.V  , which combined with the last but one inequality poses a
                     dilemma wrt the maximum lower bound on V :

  1 <= F.N <= F.V >= V/N     >= 1   , or :
1/F <=   N <=   V >= V/(F.N) >= 1/F , or :
       N/V <=   1 >= 1/(F.N)        , or :
       V/N >=   1 <=    F.N           which I tried to resolve inductively :

my empirical almost-discovery : V/N > Stam's F.N >= 1 :

  1 <= F.N <=  V/N <= F.V   holds for some artificially generated, hence known
                               ACTUAL ie TRUE pdf , while :
  1 <= F.N => Vs/N <= F.Vs  holds for good kernel-based ESTIMATEs of that pdf,
                               where Vs is a sample variance .
 Hence F.N^2 <= V >= 1/F   for ACTUAL ie TRUE pdf. This resolves by deduction:

 since 1 <= F.N , it holds  N <= F.N^2 , hence also :

 0 < 1/F <= N <= F.(N^2) <= V   holds for ACTUAL ie TRUE pdfs ;

 0 < 1/F <= N <= F.(N^2) >= V   holds for kernel-based ESTIMATEs of F, N, V.

Note how large the differences between old weak (because too low) and new
strong (because much higher) lower bounds on variance V are, eg:

  1/F by Cramer-Rao       N by A.J. Stam  F.N^2 by Jan Hajek
  1/F = 1.0253 < 1.9074 = N  <<  3.5486 = F.N^2 < V = 4.237  for a true pdf;
  1/F = 0.5219 < 1.6817 = N  <<  5.4188 = F.N^2 > V = 4.194  for estimates.

Alas, my almost-discovery does not always hold. Can you compute when it does
hold and when it doesNt ?

Extras :
E1: Fisher information F = M^2 == square[ Matusita distance of a pdf p(x)
    from  p(x+dx) ie from shifted itself ] = 2.[ 1 - Hellinger integral aka
    Bhattacharyya coefficient for p(x) and p(x+dx) ]. M uses probability
    amplitudes sqrt(p(.)) which are handy both in quantum mechanics as well
    as for safe practical numerical computations of F from generated data.
E2: Stan Ulam's little known tri-inequality  P(z|x) >= P(z|y).P(y|x)  for the
                  probability of transition from x to z.
E3: Jan Kahre's Law of diminishing information ( LDI ).
E4: Just mentioned : Hirschman inequality as an entropic uncertainty relation
.-

Mottos :
  We ought to be discussing the question "In what aspect of this notion of
  information are we primarily interested ?" The answer to this question will
  decide which measure of information should be used.  { S.D. Silvey , 1964 }

  "so far as measures of association are concerned, operational interpretation
   (either practical or theoretical) should be the main consideration, and NOT
    a pre-ordained set of postulates." { S.D. Silvey on Renyi's 7 demands }

  Probabilistic axioms will change with the whims of time, but an application
      is forever { Gian-Carlo Rota on Mark Kac's opinion }

  "... operational interpretation is of great importance so far as measures
       of association are concerned." { Ali & Silvey on Goodman & Kruskal }

  Operational == a definition based on a criterion which allows measurement
                 for the given purpose . It avoids illusory combinations of
                 words of a metaphysical character.

  It is better to be approximately correct, than exactly wrong.
  It it's not checked, it's wrong.  { I.J. Good (1916-2009), has been Alan
  Turing's stats assistant at Bletchley Park, UK, during WWII codebreaking
  efforts. He produced thousands of notes and papers on statistics }
.-

+Elements for the uninitiated

== equivalence or synonymity ; eg equivalent notation or explanation
=  equal
=. approximately equal
.  or  * == a multiplication ; eg:  2.0*H == 2H
    2pie == 2.pi.e = 2 * 3.141592654 * 2.718281828 = 17.07946845
  1/2pie =  0.058549831
^  == power raising ; eg:  e^x == exp(x) ;  (e^a)^b = e^(a.b)
oo == +infinity
ln(2) = 0.69314718 ;  1 - ln(2) = 0.306852819
ln(.)/ln(2) = ln(.)/0.693 = 1.4473*ln(.) = binary logarithm(.) ie in bits
r.v. x ==  continuous random variable x
 u'    == d/dx(u) == d(u)/dx  for u == u(x) a function of x , eg a p(x)
ln'(x) == d/dx(ln(x)) = 1/x ;  ln'(u(x)) == ln'(u) = u'/u
Integral[ dx.1/x   ] =   ln(x)      , x > 0
Integral[ dx.ln(x) ] = x.ln(x) - x  , x > 0
Integral[ dx.u'    ] = u
Integral[ dx.u'/u  ] = ln|u|
Integral[ dx.u'/sqrt(u) ] = 2.sqrt(u)
 p  == p(x) == pdf(x) == probability density function of x ; p >= 0 and
                         Integral[ dx.p ] = 1 .
 q  == q(x) = sqrt(p) == probability amplitude of x ; Integral[ dx.q^2 ] = 1
<x> == E[x] = Integral[dx.p.x] == mean value of x == expected value of x
                               == 1st moment of x (by mechanical analogy)
<u> == E[u] = Integral[dx.p.u] == mean value of u(x)
<p> == E[p] = Integral[dx.p.p] = Integral[dx.p^2] == the mean pdf (find Gini)
 V  == V(x) = Integral[dx.p.(x - <x>)^2] = Integral[dx.p.x^2] - <x>^2
            == variance of x == 2nd moment == average power
   V(x|y) = Integral[dx.p(x|y)(x - y)^2] == conditional variance of p(x|y)
   V(y|x) = Integral[dy.p(y|x)(y - x)^2] == conditional variance of p(y|x) ,

     for Gaussians is : p(y|x) = exp( -((y-x)^2)/(2V) )/sqrt(2pi.V)

        Cov(x,y) = E[x.y] - E[x].E[y]    == covariance of r.v.s  x, y
 V(x) = Cov(x,x) = E[x^2] -(E[x])^2  == Var(x)

 V(ax + by) = V(x).a^2 + V(y).b^2 + 2ab.Cov(x,y)
 V(ax - by) = V(x).a^2 + V(y).b^2 - 2ab.Cov(x,y)  which is isomorph with the

 rule of cosines  c^2 = a^2 + b^2 - 2ab*cos(a,b)  aka cosine law .
 Caution: watch out which angle ( in radians = degrees.pi/180 = 0.0174533deg )
 is used: angle(x,y)  vs  it's complement pi - angle(x,y).
 These formulas for V(.) are precise ie not truncated Taylor series .
 For a one-dimensional mixture (ie a weighted sum) p(x+y) of Gaussian pdfs
 p(x), p(y), with means <x>, <y> and variances V(x), V(y) it holds :

 p(x+y) = w.p(x) + (1-w)p(y) , 0 <= w <= 1, is a mixture which has :

  <x+y> = w<x>  + (1-w)<y>  as the mean
 V(x+y) = wV(x) + (1-w)V(y) + w(1-w)(<x> - <y>)^2  as the variance

 S == S(x) = sqrt(V(x)) == sigma(x) == standard deviation of x

 F == F(x) == Fisher information in r.v. x
   = Integral[ dx.(ln'(p))^2 ]  , and due to ln'(p(x)) = p'/p :
   = Integral[ dx.p.(p'/p)^2 ] = mean squared change-to-magnitude-of-p ratio ;
                                 the ^2 prevents annulation of +ups and -downs
   = Integral[ dx.(p'^2)/p]  ; introduce q = sqrt(p), d(q)/dx = p'/(2sqrt(p))
   = Integral[ dx.(q'^2)  ]*4/dx ; the best numerical formulas are :
   =. 4.Sum[(      q[j+1]  -      q[j]      )^2]/dx  ,  no /p , no /0 :-))
   =. 4.Sum[(      q[j]    -      q[j-1]    )^2]/dx  , ^2 ensures all >= 0
   =. 4.Sum[((     q[j+1]  -      q[j-1] )/2)^2]/dx  , slightly smoothed :-)
   =    Sum[(      q[j+1]  -      q[j-1]    )^2]/dx
   =    Sum[( sqrt(p[j+1]) - sqrt(p[j-1])   )^2]/dx
   =  4.Sum[((sqrt(p[j+1]) - sqrt(p[j-1]))/2)^2]/dx

 F is often denoted as J (for Jacobian determinant), or as I , like Shannon
      information is. Moreover I is also a pronoun, so we stick to F as a
      better mnemonic.  Since F is much less known than entropy H , I am
      explaining  F in detail :
  == "a measure of the intrinsic accuracy of a distribution. By information on
      an unknown parameter contained in a r.v. or in its distribution, we mean
      the extent to which uncertainty regarding the unknown value is reduced
      as a consequence of an observed value of the r.v." { Rao 1973 p.331 }
  == "an index of sensitivity for small changes in the value of the parameter"
      { Rao 1973 p.332 } C.R. Rao is one of the 6 fathers or C-R inequality.
  == "measures an intrinsic character of the distribution of a r.v. wrt an
      unknown parameter. The intrinsic character considered is the
      discriminatory power between close alternative values of a parameter"
      { Rao 1973 p.348 }
  == a measure of roughness of a distribution, so it can be used as a
     "roughness penalty" when estimating a pdf { I.J.Good & Gaskins , 1971 }
  ==  Measures overall concentration of values of x (in a more or less narrow
      range of the more or less likely values of x), ie peaks, sharpness,
      slopes, gradients of a pdf p or its estimate. 
      The less curved the p ( ie the smoother the p, ie the flatter the p, ie
      the LESS BIASED the p wrt some values of x, ie the more "disordered" the
      p ), the less predictable the values of x will be. See +Appendix1 .
      In practice we want to predict, identify, diagnose, classify, learn from
      new yet UNseen data (cases), hence we MUST GENERALIZE , ie we must be
      MAXIMALLY UNbiased within some MEANINGful constraints. Hence we have to
      MINIMIZE Fisher information F subject to semantic constraints arising
      from the specific context for the specific goal. { Jan Hajek }
      Ceteris paribus, the simpler hypotheses generalize better. { Paul M.
      Churchland }.  Smoother pdf is simpler than a rough one  { Jan Hajek }

 F = M^2 == squared Matusita distance M of p(x) from p(x+dx) ie "from itself".
            Other exactly related functions of M aka Jeffreys invariant (wrt
     linear transformations) are 2.M = Hellinger distance ( of 1909, then a
     non-statistical geodesic ie spherical distance ), and Bhattacharyya
     coefficient (of 1941). For two pdfs p1 and p2 it holds :

     M^2 == (Matusita distance)^2 = Integral[ dx.( sqrt(p1) - sqrt(p2) )^2 ]

      = Integral[dx.p1] + Integral[dx.p2]  - 2.Integral[ dx.sqrt(p1.p2) ]

      = 1 + 1 - 2.Integral[dx.sqrt(p1.p2)] = 2(1 - Integral[dx.sqrt(p1.p2)])

      == 2(1 - B)  where B == Bhattacharyya coefficient == Hellinger integral,
                                 ( requires ideal precision so that B <= 1 ).
      B is a distance between two points on a unit sphere, a cosine of the
        angle between two pdfs ( at maximum when the angle is 0 ), so it is a
        measure of proximity of two pdfs. Bhattacharyya distance = -ln(B) .
      B can be interpreted as an asymptotic Mahalanobis distance. For a
        Gaussian holds ln(1/B) = -ln(B) = ( Mahalanobis distance )/8 .
      B and M are in a quasi-Pythagorean relationship :

      cos(d) = Sum[sqrt(p1.p2)] = 4(sin(d/2))^2 = Sum[(sqrt(p1) - sqrt(p2))^2]

      hence  ArcCos(B) == acos(B) = yet another Bhattacharyya distance.

  ++ Advantages of Hellinger-Jeffreys-Matusita distance are many :
   + M is a proper metric distance between two pdfs. Proper metric, because :
     M >= 0 ; if p1 = p2 then M=0 else M > 0 ; and
     M(p1,p2) = M(p2,p1) ie M is symmetrical wrt p1, p2 ; and
     M obeys the triangular inequality. All 3 properties are must-haves for a
       proper mathematical metric ( see +Appendix2 for this, and for Ulam's
       tri-inequality for probabilistic transition, and Jan Kahre's LDI)
   + M is a metric of power 2 , while Kullback-Leibler ( K-L ) is not a
       metric distance. Shannonian metrics (for discrete distributions) are :
       H(x|y) + H(y|x) = 2.H(x,y) - H(x) - H(y) = H(x) + H(y) - 2.I(x;y)
                       =   H(x,y) - I(x;y)  by Shannon (1950) himself ;
       d(x,y) = ( H(x|y) + H(y|x) )/H(x,y) = 1 - I(x;y)/H(x,y) , and
       R(x,y) = sqrt(1 - d^2) "is similar to the correlation coefficient in so
                        far as it is a measure of stochastical interdependence
            between 2 r.v.s" by C. Rajski (1961), proven by Y. Horibe (1973)
            who considers  1 - d(x,y)  as a correlation-like coefficient, but
           0 <= ( R(x,y) , 1 - d(x,y) ) <= 1 are non-negative.
   + Unlike K-L divergence aka relative entropy , M is bounded 0 <= M  <= 1 ,
           ie  M  saturates like Bayesian probability of error 0 <= Pe <= 1 :

     ( B(p1,p2)/2 )^2 <= Pe(p1,p2) <= B(p1,p2)/2   where

      Pe(p1,p2) == Bayesian probability of misclassification error between
                   2 hypotheses with equal probabilities and pdfs p1(x), p2(x).
     { Kazakos 1980 p.62, who refers it to Kailath 1967/2 , and Kadota & Shepp
       1967/4 all in IEEE Transactions } ,

   + M =  sqrt( Integral[ dx.( sqrt(p(x,y)) - sqrt(p(x).p(y)) )^2 )
       = 2( 1 - Integral[ dx.p(x,y).p(x).p(y) ] )  <=  sqrt(2)

         is a distance of an actual joint p(x,y) from a fictive ie virtual pdf
         p(x).p(y) as-if x and y were independent.

   + Jeffreys prior f(p) = Const.sqrt( E[ Fisher information(p) ]) in general.
     Jeffreys prior f(p) for a binomially distributed parameter or probability
     p is/has :
     . Shape parameters a=b=0.5, so that its pdf is:
       f(p) =       (1/Beta(a, b)) * p^(a-1) * (1-p)^(b-1)  ;  G == gamma :
            = (G(a+b)/(G(a)*G(b))) * p^(a-1) * (1-p)^(b-1)
     =  [ 1/Beta(1/2, 1/2) ] * p^(-1/2) * (1-p)^(-1/2)
     =  G(1/2 + 1/2)/[G(1/2).G(1/2)] *[p.(1-p)]^(-1/2)
     =  [ Pi.p.(1-p) ]^(-1/2)     Pi = 3.14159
     =  1/[   Pi*sqrt(p.(1-p)) ] == Jeffreys noninformative prior (unbiased)
     =. 1/( 3.14*sqrt(p.(1-p)) ) == Jeffreys  uninformative prior (unbiased)
     .  U-shaped
    ++  (x+a-1)/(n+a+b-2) = (x-0.5)/(n-1) = posterior mode aka MAP which
        has a near minimum MSE (I have proven that) ;
     +  is in Krichevsky-Trofimov estimator for context tree weighting
     +  is sqrt( E[ Fisher information ] ) which has several advantages
     +  invariant wrt transformations (his motivation for his pdf)
     + "invariant pdf of a sin^2 transform of a chaotic logistic map at r=4"
        { p.135 Stephen Eubank & Doyne Farmer : An introduction to chaos and
          randomness, in 1989 Lectures in Complex Systems, Erica Jen, ed. }
     + for a=b=1/2 is MAP = (x-0.5)/(n-1); n > 1, x >= 0.5 Jeffreys Be(0.5,0.5)
          has near minimum MSE { Jeffreys , Proc. Royal Soc., A,
          Mathematics and Physics, 186 (1946), p.457 low }.

  ++ My insights : Boundedness matters since it may be misleading to value
     SEPARABILITY more than necessary. If a distance is large enough so that
     separability is achieved, then it not only makes NO SENSE to value huge
     distances more than sufficiently large ones, but adding an UNsaturated
     distance in a summing formula may actually become MISleading.
   + Unlike K-L divergence, M is symmetric wrt both pdfs, as a metric must be.
   + Unlike K-L divergence, M contains no division.
   + M contains no derivative , exept when M is a self-distance like here.
   + M is a robust estimator ( Peter Huber , Rudolf Beran ).
   + Moreover, unlike other metrics which yield the same distance regardless
     of the magnitudes of p1 and p2, M is a metric self-normalizing wrt the
     magnitudes. No wonder, since as a self-distance of a single pdf p(x) ie
     from p(x+dx) ie from itself, such  M^2 = F = Integral[dx.p.(p'/p)^2] .
     Even those who see math as a 4-letter word understand that :
          0.2  -      0.1  = 0.1   = 0.1   =      0.9  -      0.8 , while :
     sqrt(0.2) - sqrt(0.1) = 0.13 >> 0.054 = sqrt(0.9) - sqrt(0.8)

     Now we should understand why the late physics Nobelist John Archie
     Wheeler (has worked on the Manhattan project during WWII) wrote in 1982:
    "The proper tool for analyzing distinguishability is probability amplitude,
     not probability itself." { Wheeler 1982 p.569 }
     Another physics Nobelist, the celebrated Richard Feynman ( a magician-
     genius according to Mark Kac ) says in his well known Feynman Lectures
     on Physics, vol.3, p.1-10 about the use of probability amplitudes in
     quantum mechanics :
    "How does it work ? What is the machinery behind the law ?  No one can
     'explain' ... No one will give you any deeper representation ...
      We have no ideas about a more basic mechanism from which these results
      can be deduced."
     For the relationships between Fisher information and other measures of
     information  ( by Kullback-Leibler relative entropy, Renyi, Tsallis,
     Wootters, and even Gini-Simpson; btw Kullback & Leibler in the US, and
     in the UK Simpson & I.J.Good who died in April 2009, were all successful
     codebreakers during WWII ) see { Frieden 2001, pp.70-73 & 76-77 }.
F == a LOCAL measure of p because F is very sensitive to ( ie not invariant
     wrt ) a rearrangement (ie reordering by swapping) of values of p.

H == H(x) = Integral[dx.p.ln(1/p)] = -Integral[dx.p.ln(p)] =. -Sum[dx.p.ln(p)]

  == continuous entropy aka differential entropy, in nats due to ln(.) , of a
     continuous r.v. x . Units conversion : nats/ln(2) = 1.4427*nats = bits.
     The values p=0 seem to prevent computations of H, but l'Hopital's rule
     allows us to simply skip them. Unlike entropies of discrete data, for
            continuous variables :
     H < 0    may result if some values of p(x) > 1 occur ,
     H = oo   may result ,
     H is not invariant wrt transformations ,
     H is not interpretable as average self-information.
     We do not consider such data. However, the mutual information in general :

     I(y;x) = I(x;y) = H(x) - H(x|y) = H(y) - H(y|x) = H(x) + H(y) - H(x,y)

     I(y;x) = 0.5ln(2pie V(x)) + 0.5ln( 2pie V(y) )

               - ln(2pie S(x)S(y)sqrt(1 - r))  = -0.5ln(1 - r^2)  for a

      joint Gaussian pdf with r == r(x,y) == a correlation coefficient <> 0.

      is invariant to any reversible transformation ie to any 1-to-1 mapping .

H == a COMBINATORIAL measure of overall disorder . Combinatorial, because H is
     invariant wrt reordering (ie rearrangement by swapping) of values of p .
     In fact entropy of a discrete r.v. is a logarithm of a combinatorial
     product to which Stirling's approximation was applied, thus turning a
     product into a sum and getting rid of a multinomial coefficient W.
     W (for Wahrscheinlichkeit) is a number of ways how  n indistinguishable
     elements may form one and the same configuration of n elements.

  H(a.x) = H(x) + log(a)   for a continuous x

  Gaussian pdf ie a normally distributed x is an important one :

  p(x) = exp( -( (x - <x>)^2)/(2V) )/sqrt(2pi.V)

       = exp( -( (x - <x>)/(S.sqrt(2)) )^2) )/(S.sqrt(2pi))

  like eg thermal white noise. A one-dimensional mixture of 2 Gaussians
  p(x), p(y), with means <x>, <y> and variances V(x), V(y) is a weighted sum :

  p(x+y) = w.p(x) + (1-w)p(y) , 0 <= w <= 1,  is a mixture which has :

  the mean       <x+y> = w<x>  + (1-w)<y>
  the variance  V(x+y) = wV(x) + (1-w)V(y) + w(1-w)(<x> - <y>)^2

  The latter formulas are of 1894 by Karl Pearson, try to find them in the
  best books you know of, if you can. Good luck !

  For entropy H only V matters, <x> does not ie H is independent of the mean :

  H == H(x) = ln(sqrt(2pie.V)) = 0.5 ln(2pie.V)  nats or nits (not bits )

            = 1.418938533 + ln(S) , all for a Gaussian ; we see that

  H is a linear function of the logarithm of S or of V.

  H is maximum :
  If the range is (-oo, oo) and the mean & variance are known
     then  p(x) = Gaussian pdf has maximum entropy ;
  if the range is ( 0 , oo) and the mean <x> is known
     then  p(x) = exp(-x/<x>)/<x>  has maximum entropy.

  H(y|x) = 0.5 ln(2pie V(y|x) )  nats .  Since 2.ln(S) = V  we have

 2H = ln(2pie.V) = 2.837877067 + ln(V) nats , for a Gaussian , hence :

  V = exp(2H)/2pie  for a Gaussian pdf , for which it holds :

  p'(x) = -p(x)( x - <x>)/V = p(x)(<x> - x)/V   for a Gaussian r.v. x

  p"(x) =  p(x)((x - <x>)^2 - V)/(V^2) = p(x)( ((x - <x>)/V)^2 - 1/V )

  p.(p'/p)^2 = (p'^2)/p = p(x).((x - <x>)/V)^2  for a Gaussian pdf , hence

  F = Integral[ dx.p(p'/p)^2 ] = Integral[ dx.p.(x - <x>)^2 ]/(V^2)

    = V/(V^2) = 1/V = 1/S^2  is Fisher information for a Gaussian pdf.

{ Frieden 2001 p.73 } says that "The Fisher information is the reciprocal
  of the variance for a host of probability laws, those belonging to the
  exponential family of laws. Examples are the normal, exponential, binomial,
  geometric and chi-square laws. Also, compare 1/V with 1/sqrt(4.pi.V) , the
  Gini-Simpson information for the same case" of the normal law p(x).
  I have very frequently emailed with Roy Frieden for about 5 years, and
  have triggered him to consider Hellinger , Renyi , and also my favorite
  Gini-Simpson = Integral[ dx.p^2 ] = <p> = the mean of the pdf p itself !

  N == N(p) = exp(2H)/2pie = 0.05855*e^(2H) == Shannon's entropy power

              for a p(x) of any continous x (not just a Gaussian pdf).

  For a mix of independent r.v.s  x, y, it holds :

    N(x) +   N(y) <=   N(x+y)  is the Entropy Power Inequality ( EPI )
  1/F(x) + 1/F(y) <= 1/F(x+y)  is the dual of EPI

  hence if V(x) = V(y) = V then N(x+y) >= 2.V ,
  hence if V(x) = V(y) = 1 then N(x+y) = 0.05855exp(2H(x+y)) >= 2 ,
  and for 2 such Gaussians is   N(x+y) = 2.0 .

  { Shannon 1949, sect.23 } "Theorem 15: Let the average power of two ensembles
  be V1 and V2 and let their entropy powers be N1 and N2. Then the entropy
  power of the sum, N3, is bounded by

    N1 + N2 <= N3 <= V1 + V2

  White Gaussian noise has the peculiar property that it can absorb any other
  noise or signal ensemble which may be added to it with a resultant entropy
  power APPROXimately equal to the sum of the white noise and the signal power
 (MEASURED FROM THE AVERAGE SIGNAL VALUE, WHICH IS NORMALLY ZERO), provided
  the signal power is small, in a certain sense, compared to the noise." { JH's
  emphasis }
.-

+Shannon & De Bruijn & Stam higher than Cramer-Rao lower bounds on variance

If we do NOT know the actual  V = S^2 , but we do know H , then we can
estimate V  :
         V >= exp(2H)/2pie = 0.05855*exp(2H) ; thus estimated V is called

ENTROPY POWER N (of a Gaussian ie of a normal noise , eg a thermal noise) :

N == N(p) =  exp(2H)/2pie  = 0.05855*e^(2H) == entropy power of x .

N was introduced in { Shannon 1948 & 1949 section 21 } , where his last
  sentence says :
"Since white noise has the maximum entropy for a given power,
 the entropy power of any noise is less than or equal to its actual power."

My formulation : a non-Gaussian pdf with a continuous entropy H has an
   entropy power N which is a virtual ie as-if Gaussian variance of an
   equivalent average power.

Shannon did not express his last sentence in section 21 as a mathematical
inequality , so it has escaped attention of many (including Jan Kahre ,
Roy Frieden , and until recently, humble me). The explicit inequality is :

   exp(2H)/2pie = N <= V == the ACTUAL variance (or noise power) of x , or

   exp(H)/sqrt(2pie) = sqrt(N) <= S = sqrt(V),  due to (e^a)^b = e^(a.b).

Shannon (in 1949) has not mentioned Fisher information F (of 1925 ).
The link between Fisher information F and Shannon entropy power N has been
expressed as  F >= 1/N  in { A.J. Stam 1959 p.104 eq.(2.3) } , and also the
Cramer-Rao inequality    F >= 1/S^2 , together with Shannonian 1/N >= 1/S^2 .
Cramer-Rao lower bound S^2 >= 1/F  assumes no systematic error aka bias .
I do not discuss bounds for biased estimates, since an empirical kernel-based
non-parametric estimate of p(x) can always be constructed without a bias.
C-R inequality was (re)discovered by Aitken & Silverstone (1942), Frechet
(1943), C.R. Rao (1945) and by H. Cramer (1946). In other words :

1/F <= V  is   Cramer-Rao inequality ;
1/F <= N  is    A.J. Stam inequality ;
       N <= V  is Shannon inequality for entropy power , hence :
1/F <= N <= V  shows that Stam has increased the lower bound on V ,
                  ie that Stam has improved on Cramer-Rao lower bound,
since it is always desirable to obtain bounds as tight as possible. Loose
bounds like eg 0 in V > 0 are of little value.
I collected these separate inequalities into a single inequality :

  S^2 =   V >=   N >= 1/F , or :
1/S^2 = 1/V <= 1/N <= F   , which I multiplied by V and obtained :
          1 <= V/N <= F.V
and                   F.V >= F.N       from Shannon's N <= V ,
                             F.N >= 1  by A.J. Stam ,
these I combined into :

1 <= F.N <= F.V >= V/N >= 1 .

Q1: My 1st fresh question I asked myself :
    Which inequality between  F.N  and V/N holds, if any ?
    Is (almost) always V/N <= F.N  or  V/N >= F.N   ?  In other words :
                    is   V <= F.N^2  or  V >= F.N^2 ?

A1: My empirical almost-discovery :

 1 <= F.N <=  V/N <= F.V   holds for an artificially generated, hence known
                                 ACTUAL ie TRUE pdf , while :
 1 <= F.N => Vs/N <= F.Vs  holds for a good kernel-based ESTIMATE of that pdf,
                                 where Vs is a sample variance .
Hence F.N^2 <= V >= 1/F    for ACTUAL ie TRUE pdf.

Q2: My 2nd fresh question to myself was : Which of the last two lower bounds
    on  V  is higher for a known actual ie true pdf ?

A2: My deduction : since 1 <= F.N , it holds N <= F.N^2 , hence also :

    0 < 1/F <= N <= F.(N^2) <= V    holds for actual ie true pdfs ;

    0 < 1/F <= N <= F.(N^2) >= V    holds for my estimates of F, N and V.

    My numerical tests on artificially generated pdfs confirmed that.
    We see that the Cramer-Rao inequality 1/F <= V  holds for the estimates
    too, simply because C-R inequality is not sharp enough. Afterall, 0 <= V
    holds always too. Here are few numbers to give you a feeling for the
    strength of my lower bounds : For a mix of 2 Gaussians (I use G because
    N is Shannon's eNtropy power) like eg:

    p(x) = 0.9*G(6,1) + 0.1*G(12,1)  generated at 200 values of x , and

    estimated from a sample of 100 pseudorandom values, I got :

    1/F = 1.0253 < 1.9074 = N << 3.5486 = F.N^2 < V = 4.237  for true pdf ;
    1/F = 0.5219 < 1.6817 = N << 5.4188 = F.N^2 > V = 4.194  for estimates.

    Note that :
    + Both variances V are almost equal (due to my estimates of pdfs);
    + the estimated  F is almost twice the true F (no wonder);
    + the << empasizes the joyful fact that my bound F.N^2 is much higher
      then Shannon's entropy power N <= V ;
    + be reminded that  1/F <= N <= V  are A.J. Stam inequality F >= 1/N
      combined with Shannon's  N <= V .

    Conclusion :
!!  My investigations show that C-R bound 1/F <= V == variance, is TOO LOW ,
    although better than the trivial bound 0. My lower bound F.(N^2) would be
    higher than the C-R lower bound, hence my F.N.N would be more realistic
    ie it would be more informative.
.-

+Application of inequalities to automated probability density estimation :

We seldom know the TRUE ie ACTUAL probability density function of a r.v. x.
So we CANNOT evaluate an integrated mean error as a measure of squared or
absolute differences, or as a discrepance, divergence or a distance like eg
the above mentioned Matusita distance between the TRUE ie ACTUAL pdf and its
estimate.
We usually have only a sample of measurement values x1..xj..xn, from which we
can compute a sample variance Vs, and we can estimate the pdf p(x) by eg a
Rosenblatt-Parzen kernel method. Having an estimate of p(x), we can integrate
it numerically and thus estimate both Fisher information F(x) , and entropy
H(x) and its entropy power N :

N = exp(2H)/2pie = 0.058549831*e^(2H) . Now we can compute N.F and Vs/N ,

and for a decent estimate of a reasonably smooth (no infinitely steep steps,
though not necessarily differentiable) UNknown pdf from a sample (of hundreds
of values, or as small as a dozen values) we get 2 values :

N.F > Vs/N . Finally we compute their ratio  (F.N^2)/Vs . The SMALLER this

ratio (the closer to 1 in practice), the better the estimate of pdf p(x) ,
WITHOUT KNOWING the ACTUAL ie TRUE pdf !!
Since we are free to use more than one estimator of p(x) , we can choose
the estimate with the lowest ratio F.N.N/Vs as the most likely best estimate
of p(x). This selection process is easy to automate, thus avoiding visual
inspections and subjective judgements by different people.
.-

+Appendix1 :

For the length s of a segment of a curve y(x) == y  it holds by Pythagorean
theorem :

ds^2 = dy^2 + dx^2  ; since  dy/dx = y' we have dy = y'.dx , so that

ds = dx.sqrt( 1 + y'^2 ) , hence the length of a segment is :

s = Integral[ ds ] = Integral[ dx.sqrt(1 + y'^2) ]. The y(x) can be a p(x).

   [ (1 + y'^2)^(3/2) ]/y" = the diameter of curvature of a curve y(x) ;
                             big diameter have flat curves.
p"/[ (1 + p'^2)^(3/2) ]    = the curvature of a p(x) ; small curvature have
                             short ie flat curves (ie with big diameter).
.-

+Appendix2 on Stan Ulam's probabilistic tri-inequality and Jan Kahre's LDI

 During the WWII in Los Alamos, the ingenious Austro-Hungarian Polish-American
 -Jewish mathematician Stanislaw Ulam (Lemberg/Lwow/Lviv 1909-1984 Santa Fe)
 has done math for the Manhattan Project of developing A-toys used on the
 civilian populations of Hiroshima and Nagasaki in Japan. After WWII Stan Ulam
 has :
 - fathered the H-toy (the Hungarian-American-Jewish physicist Edward Teller
   has mothered it, while Hans Bethe was the midwife, as Bethe described it);
 + invented the Monte Carlo method of numerical problem solving ( MC was
   developed by Metropolis, Von Neumann, Pasta, Teller );
 - with C.J. Everett he co-fathered the mega-excentric Project Orion ;
 + he invented the Ulam distance between genetic sequences (implemented as
   Peter Sellers algorithm). Compare it with Levenstein distance ;
 + he also became interested in math for biology in general and for genetics
   in particular. In 1972 he co-authored a Los Alamos report LA-4973 "Metrics
 in biology, an introduction" where at the very end, he just briefly mentioned
 what I call "Ulam's tri-inequality of probabilistic transitions ( U3IPT )".
 U3IPT is a little known probabilistic analog of the well known geometric
 triangular inequality  d(z,x) <= d(z,y) + d(y,x)  which is a must for any
 proper metric. In { Ulam 1986 p.127 } U3IPT is presented a little better than
 in the original LA-4973. Ulam ( like Karl Popper in The Logic of Scientific
 Discovery; my favorite was the Appendix IX on corroboration, especially the
 footnotes on p.400, ) used an old fashioned notation : "sigma(x,y), denoting
 the probability of transition from x to y" eg in genetics due to random
 mutations, he says. In my modern notation U3IPT becomes :

   P(z|x) >= P(z|y).P(y|x) , 0 <= P(.|.) <= 1 , for a transition z <- y <- x

 Ulam has said very little on his U3IPT, so I am trying to do it here & now.

 Note that UNlike the metric distance  d(z,x) <= d(z,y) + d(y,x) , U3IPT is :
 - oriented ie directed ie asymmetric wrt x, y, z
 - with the inequality reversed: P(z|x) >= .... vs  d(z,x) <= ...
 - with the multiplication of P's vs addition of distances d's

 Let's check if U3PT makes sense :

1: Both tri-inequalities basically say that "a direct route is never longer
   than an indirect one", which fits common sense, ceteris paribus.

2: Summing over z both sides of U3IPT yields   1 >= 1.P(y|x)  checks :-)

3: Multiplying U3IPT by P(x) yields  P(z,x) >= P(z|y).P(y,x) , and
             summing over x  yields  P(z)   >= P(z|y).P(y) = P(z,y) checks ;
             summing over z  yields    P(x) >=      1.P(y,x)        checks ,

4: From 3: we have  P(z,x) >= P(z|y).P(y,x) = P(z,y).P(x|y)
   summed over z :    P(x) >=      1.P(y,x) =   P(y).P(x|y)  checks too ,

5: Taking a logarithm (a monotonic increasing function) of both sides of
   U3IPT gives :

   ln(P(z|x)) >= ln(P(z|y)) + ln(P(y|x)) , eg:

   ln(  0.3 ) >= ln(  0.8 ) + ln(  0.2 ) , ie:

        -1.2  >=   -0.223   +  -1.61  = -1.832

6:   -ln(P(z|x)) <= -ln(P(z|y)) + -ln(P(y|x)) , ie:

    ln(1/P(z|x)) <= ln(1/P(z|y)) + ln(1/P(y|x))  is a triangle inequality.

So far U3PT makes lot of sense, methinks.


Q: Does Ulam's U3IPT fit with Jan Kahre's Law of Diminishing information
   ( LDI ) for x -> y -> z ? If yes, then how ? Here is the LDI :

   if P(x,y,z) = P(x|y)P(z|y)P(y) then inf(z@x) <= inf(y@x)

where  inf(z@x)  is an information measure ( not necessarily Shannon's one ).

P(z|x) = P(z,x)/P(x) >= P(x,z) >= P(x,y,z) = P(x,z|y)P(y) = P(x|y,z)P(z|y)P(y)

We see that: P(x,y,z) = P(x|y  )P(z|y)P(y)  is required by Kahre for LDI ,

     versus  P(x,y,z) = P(x|y,z)P(z|y)P(y)  in general, hence

LDI requires P(x|y,z) = P(x|y) .

A: The answer is left as mental jogging to the readers who are more mathletic
   than me.


+References :

Bhattacharyya A: On a measure of divergence between two multinomial
   populations; Sankhya, 7, 1945-1946 (submitted 1941), 401-406
Blachman Nelson M: The convolution inequality for entropy powers ,
   IEEE Transactions on Information Theory, April 1965, 267-271
Cover Thomas & Thomas Jay A: Elements of Information Theory, 1991, 2001 2nd ed.
   both editions contain chapters/sections on Differential entropy , on
   Fisher information, De Bruijn's identity, Stam, entropy power inequality ,
   Brunn-Minkowski inequality , Renyi entropy power , etc.
Dembo A, Cover T, Thomas J.A: Information theoretic inequalities;
   IEEE Transactions on Information Theory, 37/6, 1991/11, 1501-1518;
   sect. IVB is on Uncertainty principles
Frieden B. Roy: Probability, Statistical Optics, and Data testing, 3rd ed!
   2001 ; he mentions (also in his papers) :
 + { A.J. Stam 1959 }, { Huber 1964 }, and Hirschman uncertainty principle
Good I.J. & Gaskins R.A: Nonparametric roughness penalties for probability
   densities; Biometrika 58/2, 1971, 255-277
Huber Peter (ETH & MIT): Robust estimation of a location parameter ,
   The Annals of Statistics , 35:1 (1964), 73-101, where in the section 8 on
   Minimax theory he has played a "game against Nature" using Fisher
   information as a "payoff to the statistician". An earlier statistical
   miniMax player against Nature was Hugo Steinhaus in 1957. Steinhaus was
   a math teacher of Stanislaw Ulam ( for more see +Appendix2 )
Kahre Jan : The Mathematical Theory of Information , 2002 ;
 + in 10.13 he proves that the reciprocal value of a conditional variance aka
   Fisher's invariance  1/var(x|y)  conforms to Kahre's Law od Diminishing
   Information ( for LDI see +Appendix2 here above ) ;
 + In 10.19-10.21 he discusses (expected) Fisher information mainly in
   relation to Heisenberg's uncertainty principle, yet, remarkably :
 - he neither mentions Cramer-Rao , nor { A.J. Stam 1959 }. Kahre's favorite
   Cramer is a quantum mechanic J.G. Cramer , not H. Cramer of C-R inequality;
 - he does not even mention the Cauchy-Schwarz inequality vital to the
   derivation of Fisher information ;
 - he does not mention I.I. Hirschman inequality, an entropic uncertainty
   relation, an alternative form of Heisenberg uncertainty principle ;
 - in 10.22 he calls "power of a gaussian white noise" what { Shannon sect.21 }
   calls "entropy power", but Kahre does not mention Shannon's inequality :
  "Since white noise has the maximum entropy for a given power,
   the entropy power of any noise is less than or equal to its actual power."
 - he does not mention Shannon's sampling theorem
Kazakos D. & Cotsidas Th: A decision theory approach to the approximation of
   discrete probability densities ; IEEE Trans. PAMI, 2:1 , 1980/1, p.62
Matusita Kameo: Decision rules, based on the distance, for problems of fit,
   two samples, and estimation; Annals of Mathematical Statistics, 26, 1955,
   631-640; (thereafter Matusita wrote at least 6 papers on his distance)
Rao C.R:  Linear Statistical Inference and its Applications, 2nd ed., 1973
Resnikoff H.L: The Illusion of Reality, 1989; p132-139 sect.4.5 is on
   Shannon's sampling theorem and the uncertainty relation (for measurement
   ie dx.dy >= 1/(4pi) ; it employs Schwarz inequality, law of cosines )
Shannon C.E: The Mathematical Theory of Communication , 1949, 4th printing of
   1969, pbk (other printings may differ)
   He has not mentioned Fisher information (of 1925 ).
Stam A.J: Some Mathematical Properties of Quantities of Information, 1959 ,
          Ph.D thesis at the University of Delft, NL , 152 pages, where on
    p.111 Stam writes that F.N >= 1 "was communicated to Prof.Ir.Dr.J.L. van
    Soest by Prof.Dr.N.G. de Bruijn who gave a variational proof of it ..."
    A summary of connections between F , H and N are on p.114
Stam A.J: Some inequalities satisfied by the quantities of information of
          Fisher and Shannon ; Information and Control , 2 (1959) 101-112 ;
          this paper was based on his Ph.D dissertation
Ulam Stanislaw: Science, Computers and People, 1986 ; see chap.12 = "Some
   further ideas and prospects in biomathematics", which I consider to be the
   chapter richest on ideas.
Wheeler John Archie: The computer and the universe ; Int. J. of Theoretical
   Physics, 21:12, 1982, p.557-572, see pp.567-569
Wootters William K: Statistical distance and Hilbert space, Physical Review D,
   23:2, 15 January 1981 ; this paper was based on his Ph.D dissertation as
   a graduate student of J.A. Wheeler at U. of TX

.- Philosophical readings ( oldest 1st ) :
Wigner Eugene: The unreasonable effectiveness of mathematics in the natural
   sciences; Communications on Pure and Applied mathematics, XIII, 1960, 1-14
Schwartz J: The pernicious influence of mathematics on science; in:
   Methodology and Philosophy of Science, Proc. of the 1960 Intl. Congress,
   356-360
Diaconis Persi: Theories of data analysis: From magical thinking through
   classical statistics; in: Exploring Data Tables, Trends, and Shapes; 1985,
   1-36
Livio Mario: The Golden Ratio, 2002; chap. 9 = Is God Mathematician ?, section
   The unreasonable power of mathematics, 237-253
Wilczek Frank: Reasonably effective: 1. Deconstructing a miracle; Physics Today
   November 2006;  Frank Wilczek is physics Nobelist at MIT
Grattan-Guiness Ivor: Solving Wigner's mystery: The reasonable (though perhaps
   limited) effectiveness of mathematics in the natural sciences;
   The mathematical Intelligencer, 30, 2008/3, 7-17
-.-