.- Sharper information inequalities applied to continuous pdf estimation CopyRight (C) 2009-5-5 , Jan Hajek , NL, version 2.03 of 2009-6-1 The word my shortly marks what I have not lifted from anywhere, so it is fresh and may be contested, but must not be lifted from me without a full reference to my full name plus this website : http://www.humintel.com/hajek contains the latest versions of my epapers Abstract: Shannon & De Bruin & Stam inequality (superior to Cramer-Rao ) 1/Fisher information <= eNtropy power <= Variance is discussed in detail, including the best numerical computations as Hellinger-Bhattacharyya- Jeffreys- Matusita metric distance M . Contents : +Executive summary +Elements for the uninitiated +Shannon & De Bruijn & Stam higher than Cramer-Rao lower bounds on variance +Application of inequalities to automated probability density estimation +Appendix1 on the length of a pdf curve and its curvature +Appendix2 on Stan Ulam's tri-inequality and Jan Kahre's LDI +References .- +Executive summary : Find the words DISCOVERY and MY to get to my main results and insights. Cramer-Rao lower bound on variance is a general uncertainty principle (or an uncertainty relation, or an information inequality) more general & fundamental than the better known Heisenberg uncertainty principle. To find a tighter ie higher lower bound on variance V ( ie a stronger information inequality ) would be of fundamental importance. Here are the old weak and the fresh, much stronger general inequalities (not specialized like Heisenberg's uncertainty) presented in the oldest-first order : F, H, N, V are all > 0 : 1/F <= V is Cramer-Rao inequality : variance V >= 1/Fisher information F F = Integral[ dx.p.(p'/p)^2 ] == square( Matusita self-distance ) : = 4.Sum[ ( sqrt(p[j+1]) - sqrt(p[j ]) )^2 ]/dx has no /p ie no /0 :-) =. Sum[ ( sqrt(p[j+1]) - sqrt(p[j-1]) )^2 ]/dx is slightly smoothed. N <= V Shannon inequality : actual variance V >= N , where : N = exp(2H)/2pie = 0.058549831*e^(2H) == Shannon's eNtropy power, with H == Shannon's entropy for a pdf p(x) of a continuously distributed (quasi)random variable x (not just Gaussian). 1/F <= N is A.J. Stam inequality : entropy power N >= 1/F , hence : 1/F <= N <= V shows that Stam has improved on Cramer-Rao lower bound by increasing lower bound on variance V. From this I got : 1 <= F.N <= F.V , and also from the preceding inequality I obtained : 1/V <= 1/N <= F , which I multiplied by V and obtained : 1 <= V/N <= F.V , which combined with the last but one inequality poses a dilemma wrt the maximum lower bound on V : 1 <= F.N <= F.V >= V/N >= 1 , or : 1/F <= N <= V >= V/(F.N) >= 1/F , or : N/V <= 1 >= 1/(F.N) , or : V/N >= 1 <= F.N which I tried to resolve inductively : my empirical almost-discovery : V/N > Stam's F.N >= 1 : 1 <= F.N <= V/N <= F.V holds for some artificially generated, hence known ACTUAL ie TRUE pdf , while : 1 <= F.N => Vs/N <= F.Vs holds for good kernel-based ESTIMATEs of that pdf, where Vs is a sample variance . Hence F.N^2 <= V >= 1/F for ACTUAL ie TRUE pdf. This resolves by deduction: since 1 <= F.N , it holds N <= F.N^2 , hence also : 0 < 1/F <= N <= F.(N^2) <= V holds for ACTUAL ie TRUE pdfs ; 0 < 1/F <= N <= F.(N^2) >= V holds for kernel-based ESTIMATEs of F, N, V. Note how large the differences between old weak (because too low) and new strong (because much higher) lower bounds on variance V are, eg: 1/F by Cramer-Rao N by A.J. Stam F.N^2 by Jan Hajek 1/F = 1.0253 < 1.9074 = N << 3.5486 = F.N^2 < V = 4.237 for a true pdf; 1/F = 0.5219 < 1.6817 = N << 5.4188 = F.N^2 > V = 4.194 for estimates. Alas, my almost-discovery does not always hold. Can you compute when it does hold and when it doesNt ? Extras : E1: Fisher information F = M^2 == square[ Matusita distance of a pdf p(x) from p(x+dx) ie from shifted itself ] = 2.[ 1 - Hellinger integral aka Bhattacharyya coefficient for p(x) and p(x+dx) ]. M uses probability amplitudes sqrt(p(.)) which are handy both in quantum mechanics as well as for safe practical numerical computations of F from generated data. E2: Stan Ulam's little known tri-inequality P(z|x) >= P(z|y).P(y|x) for the probability of transition from x to z. E3: Jan Kahre's Law of diminishing information ( LDI ). E4: Just mentioned : Hirschman inequality as an entropic uncertainty relation .- Mottos : We ought to be discussing the question "In what aspect of this notion of information are we primarily interested ?" The answer to this question will decide which measure of information should be used. { S.D. Silvey , 1964 } "so far as measures of association are concerned, operational interpretation (either practical or theoretical) should be the main consideration, and NOT a pre-ordained set of postulates." { S.D. Silvey on Renyi's 7 demands } Probabilistic axioms will change with the whims of time, but an application is forever { Gian-Carlo Rota on Mark Kac's opinion } "... operational interpretation is of great importance so far as measures of association are concerned." { Ali & Silvey on Goodman & Kruskal } Operational == a definition based on a criterion which allows measurement for the given purpose . It avoids illusory combinations of words of a metaphysical character. It is better to be approximately correct, than exactly wrong. It it's not checked, it's wrong. { I.J. Good (1916-2009), has been Alan Turing's stats assistant at Bletchley Park, UK, during WWII codebreaking efforts. He produced thousands of notes and papers on statistics } .- +Elements for the uninitiated == equivalence or synonymity ; eg equivalent notation or explanation = equal =. approximately equal . or * == a multiplication ; eg: 2.0*H == 2H 2pie == 2.pi.e = 2 * 3.141592654 * 2.718281828 = 17.07946845 1/2pie = 0.058549831 ^ == power raising ; eg: e^x == exp(x) ; (e^a)^b = e^(a.b) oo == +infinity ln(2) = 0.69314718 ; 1 - ln(2) = 0.306852819 ln(.)/ln(2) = ln(.)/0.693 = 1.4473*ln(.) = binary logarithm(.) ie in bits r.v. x == continuous random variable x u' == d/dx(u) == d(u)/dx for u == u(x) a function of x , eg a p(x) ln'(x) == d/dx(ln(x)) = 1/x ; ln'(u(x)) == ln'(u) = u'/u Integral[ dx.1/x ] = ln(x) , x > 0 Integral[ dx.ln(x) ] = x.ln(x) - x , x > 0 Integral[ dx.u' ] = u Integral[ dx.u'/u ] = ln|u| Integral[ dx.u'/sqrt(u) ] = 2.sqrt(u) p == p(x) == pdf(x) == probability density function of x ; p >= 0 and Integral[ dx.p ] = 1 . q == q(x) = sqrt(p) == probability amplitude of x ; Integral[ dx.q^2 ] = 1 == E[x] = Integral[dx.p.x] == mean value of x == expected value of x == 1st moment of x (by mechanical analogy) == E[u] = Integral[dx.p.u] == mean value of u(x)

== E[p] = Integral[dx.p.p] = Integral[dx.p^2] == the mean pdf (find Gini) V == V(x) = Integral[dx.p.(x - )^2] = Integral[dx.p.x^2] - ^2 == variance of x == 2nd moment == average power V(x|y) = Integral[dx.p(x|y)(x - y)^2] == conditional variance of p(x|y) V(y|x) = Integral[dy.p(y|x)(y - x)^2] == conditional variance of p(y|x) , for Gaussians is : p(y|x) = exp( -((y-x)^2)/(2V) )/sqrt(2pi.V) Cov(x,y) = E[x.y] - E[x].E[y] == covariance of r.v.s x, y V(x) = Cov(x,x) = E[x^2] -(E[x])^2 == Var(x) V(ax + by) = V(x).a^2 + V(y).b^2 + 2ab.Cov(x,y) V(ax - by) = V(x).a^2 + V(y).b^2 - 2ab.Cov(x,y) which is isomorph with the rule of cosines c^2 = a^2 + b^2 - 2ab*cos(a,b) aka cosine law . Caution: watch out which angle ( in radians = degrees.pi/180 = 0.0174533deg ) is used: angle(x,y) vs it's complement pi - angle(x,y). These formulas for V(.) are precise ie not truncated Taylor series . For a one-dimensional mixture (ie a weighted sum) p(x+y) of Gaussian pdfs p(x), p(y), with means , and variances V(x), V(y) it holds : p(x+y) = w.p(x) + (1-w)p(y) , 0 <= w <= 1, is a mixture which has : = w + (1-w) as the mean V(x+y) = wV(x) + (1-w)V(y) + w(1-w)( - )^2 as the variance S == S(x) = sqrt(V(x)) == sigma(x) == standard deviation of x F == F(x) == Fisher information in r.v. x = Integral[ dx.(ln'(p))^2 ] , and due to ln'(p(x)) = p'/p : = Integral[ dx.p.(p'/p)^2 ] = mean squared change-to-magnitude-of-p ratio ; the ^2 prevents annulation of +ups and -downs = Integral[ dx.(p'^2)/p] ; introduce q = sqrt(p), d(q)/dx = p'/(2sqrt(p)) = Integral[ dx.(q'^2) ]*4/dx ; the best numerical formulas are : =. 4.Sum[( q[j+1] - q[j] )^2]/dx , no /p , no /0 :-)) =. 4.Sum[( q[j] - q[j-1] )^2]/dx , ^2 ensures all >= 0 =. 4.Sum[(( q[j+1] - q[j-1] )/2)^2]/dx , slightly smoothed :-) = Sum[( q[j+1] - q[j-1] )^2]/dx = Sum[( sqrt(p[j+1]) - sqrt(p[j-1]) )^2]/dx = 4.Sum[((sqrt(p[j+1]) - sqrt(p[j-1]))/2)^2]/dx F is often denoted as J (for Jacobian determinant), or as I , like Shannon information is. Moreover I is also a pronoun, so we stick to F as a better mnemonic. Since F is much less known than entropy H , I am explaining F in detail : == "a measure of the intrinsic accuracy of a distribution. By information on an unknown parameter contained in a r.v. or in its distribution, we mean the extent to which uncertainty regarding the unknown value is reduced as a consequence of an observed value of the r.v." { Rao 1973 p.331 } == "an index of sensitivity for small changes in the value of the parameter" { Rao 1973 p.332 } C.R. Rao is one of the 6 fathers or C-R inequality. == "measures an intrinsic character of the distribution of a r.v. wrt an unknown parameter. The intrinsic character considered is the discriminatory power between close alternative values of a parameter" { Rao 1973 p.348 } == a measure of roughness of a distribution, so it can be used as a "roughness penalty" when estimating a pdf { I.J.Good & Gaskins , 1971 } == Measures overall concentration of values of x (in a more or less narrow range of the more or less likely values of x), ie peaks, sharpness, slopes, gradients of a pdf p or its estimate. The less curved the p ( ie the smoother the p, ie the flatter the p, ie the LESS BIASED the p wrt some values of x, ie the more "disordered" the p ), the less predictable the values of x will be. See +Appendix1 . In practice we want to predict, identify, diagnose, classify, learn from new yet UNseen data (cases), hence we MUST GENERALIZE , ie we must be MAXIMALLY UNbiased within some MEANINGful constraints. Hence we have to MINIMIZE Fisher information F subject to semantic constraints arising from the specific context for the specific goal. { Jan Hajek } Ceteris paribus, the simpler hypotheses generalize better. { Paul M. Churchland }. Smoother pdf is simpler than a rough one { Jan Hajek } F = M^2 == squared Matusita distance M of p(x) from p(x+dx) ie "from itself". Other exactly related functions of M aka Jeffreys invariant (wrt linear transformations) are 2.M = Hellinger distance ( of 1909, then a non-statistical geodesic ie spherical distance ), and Bhattacharyya coefficient (of 1941). For two pdfs p1 and p2 it holds : M^2 == (Matusita distance)^2 = Integral[ dx.( sqrt(p1) - sqrt(p2) )^2 ] = Integral[dx.p1] + Integral[dx.p2] - 2.Integral[ dx.sqrt(p1.p2) ] = 1 + 1 - 2.Integral[dx.sqrt(p1.p2)] = 2(1 - Integral[dx.sqrt(p1.p2)]) == 2(1 - B) where B == Bhattacharyya coefficient == Hellinger integral, ( requires ideal precision so that B <= 1 ). B is a distance between two points on a unit sphere, a cosine of the angle between two pdfs ( at maximum when the angle is 0 ), so it is a measure of proximity of two pdfs. Bhattacharyya distance = -ln(B) . B can be interpreted as an asymptotic Mahalanobis distance. For a Gaussian holds ln(1/B) = -ln(B) = ( Mahalanobis distance )/8 . B and M are in a quasi-Pythagorean relationship : cos(d) = Sum[sqrt(p1.p2)] = 4(sin(d/2))^2 = Sum[(sqrt(p1) - sqrt(p2))^2] hence ArcCos(B) == acos(B) = yet another Bhattacharyya distance. ++ Advantages of Hellinger-Jeffreys-Matusita distance are many : + M is a proper metric distance between two pdfs. Proper metric, because : M >= 0 ; if p1 = p2 then M=0 else M > 0 ; and M(p1,p2) = M(p2,p1) ie M is symmetrical wrt p1, p2 ; and M obeys the triangular inequality. All 3 properties are must-haves for a proper mathematical metric ( see +Appendix2 for this, and for Ulam's tri-inequality for probabilistic transition, and Jan Kahre's LDI) + M is a metric of power 2 , while Kullback-Leibler ( K-L ) is not a metric distance. Shannonian metrics (for discrete distributions) are : H(x|y) + H(y|x) = 2.H(x,y) - H(x) - H(y) = H(x) + H(y) - 2.I(x;y) = H(x,y) - I(x;y) by Shannon (1950) himself ; d(x,y) = ( H(x|y) + H(y|x) )/H(x,y) = 1 - I(x;y)/H(x,y) , and R(x,y) = sqrt(1 - d^2) "is similar to the correlation coefficient in so far as it is a measure of stochastical interdependence between 2 r.v.s" by C. Rajski (1961), proven by Y. Horibe (1973) who considers 1 - d(x,y) as a correlation-like coefficient, but 0 <= ( R(x,y) , 1 - d(x,y) ) <= 1 are non-negative. + Unlike K-L divergence aka relative entropy , M is bounded 0 <= M <= 1 , ie M saturates like Bayesian probability of error 0 <= Pe <= 1 : ( B(p1,p2)/2 )^2 <= Pe(p1,p2) <= B(p1,p2)/2 where Pe(p1,p2) == Bayesian probability of misclassification error between 2 hypotheses with equal probabilities and pdfs p1(x), p2(x). { Kazakos 1980 p.62, who refers it to Kailath 1967/2 , and Kadota & Shepp 1967/4 all in IEEE Transactions } , + M = sqrt( Integral[ dx.( sqrt(p(x,y)) - sqrt(p(x).p(y)) )^2 ) = 2( 1 - Integral[ dx.p(x,y).p(x).p(y) ] ) <= sqrt(2) is a distance of an actual joint p(x,y) from a fictive ie virtual pdf p(x).p(y) as-if x and y were independent. + Jeffreys prior f(p) = Const.sqrt( E[ Fisher information(p) ]) in general. Jeffreys prior f(p) for a binomially distributed parameter or probability p is/has : . Shape parameters a=b=0.5, so that its pdf is: f(p) = (1/Beta(a, b)) * p^(a-1) * (1-p)^(b-1) ; G == gamma : = (G(a+b)/(G(a)*G(b))) * p^(a-1) * (1-p)^(b-1) = [ 1/Beta(1/2, 1/2) ] * p^(-1/2) * (1-p)^(-1/2) = G(1/2 + 1/2)/[G(1/2).G(1/2)] *[p.(1-p)]^(-1/2) = [ Pi.p.(1-p) ]^(-1/2) Pi = 3.14159 = 1/[ Pi*sqrt(p.(1-p)) ] == Jeffreys noninformative prior (unbiased) =. 1/( 3.14*sqrt(p.(1-p)) ) == Jeffreys uninformative prior (unbiased) . U-shaped ++ (x+a-1)/(n+a+b-2) = (x-0.5)/(n-1) = posterior mode aka MAP which has a near minimum MSE (I have proven that) ; + is in Krichevsky-Trofimov estimator for context tree weighting + is sqrt( E[ Fisher information ] ) which has several advantages + invariant wrt transformations (his motivation for his pdf) + "invariant pdf of a sin^2 transform of a chaotic logistic map at r=4" { p.135 Stephen Eubank & Doyne Farmer : An introduction to chaos and randomness, in 1989 Lectures in Complex Systems, Erica Jen, ed. } + for a=b=1/2 is MAP = (x-0.5)/(n-1); n > 1, x >= 0.5 Jeffreys Be(0.5,0.5) has near minimum MSE { Jeffreys , Proc. Royal Soc., A, Mathematics and Physics, 186 (1946), p.457 low }. ++ My insights : Boundedness matters since it may be misleading to value SEPARABILITY more than necessary. If a distance is large enough so that separability is achieved, then it not only makes NO SENSE to value huge distances more than sufficiently large ones, but adding an UNsaturated distance in a summing formula may actually become MISleading. + Unlike K-L divergence, M is symmetric wrt both pdfs, as a metric must be. + Unlike K-L divergence, M contains no division. + M contains no derivative , exept when M is a self-distance like here. + M is a robust estimator ( Peter Huber , Rudolf Beran ). + Moreover, unlike other metrics which yield the same distance regardless of the magnitudes of p1 and p2, M is a metric self-normalizing wrt the magnitudes. No wonder, since as a self-distance of a single pdf p(x) ie from p(x+dx) ie from itself, such M^2 = F = Integral[dx.p.(p'/p)^2] . Even those who see math as a 4-letter word understand that : 0.2 - 0.1 = 0.1 = 0.1 = 0.9 - 0.8 , while : sqrt(0.2) - sqrt(0.1) = 0.13 >> 0.054 = sqrt(0.9) - sqrt(0.8) Now we should understand why the late physics Nobelist John Archie Wheeler (has worked on the Manhattan project during WWII) wrote in 1982: "The proper tool for analyzing distinguishability is probability amplitude, not probability itself." { Wheeler 1982 p.569 } Another physics Nobelist, the celebrated Richard Feynman ( a magician- genius according to Mark Kac ) says in his well known Feynman Lectures on Physics, vol.3, p.1-10 about the use of probability amplitudes in quantum mechanics : "How does it work ? What is the machinery behind the law ? No one can 'explain' ... No one will give you any deeper representation ... We have no ideas about a more basic mechanism from which these results can be deduced." For the relationships between Fisher information and other measures of information ( by Kullback-Leibler relative entropy, Renyi, Tsallis, Wootters, and even Gini-Simpson; btw Kullback & Leibler in the US, and in the UK Simpson & I.J.Good who died in April 2009, were all successful codebreakers during WWII ) see { Frieden 2001, pp.70-73 & 76-77 }. F == a LOCAL measure of p because F is very sensitive to ( ie not invariant wrt ) a rearrangement (ie reordering by swapping) of values of p. H == H(x) = Integral[dx.p.ln(1/p)] = -Integral[dx.p.ln(p)] =. -Sum[dx.p.ln(p)] == continuous entropy aka differential entropy, in nats due to ln(.) , of a continuous r.v. x . Units conversion : nats/ln(2) = 1.4427*nats = bits. The values p=0 seem to prevent computations of H, but l'Hopital's rule allows us to simply skip them. Unlike entropies of discrete data, for continuous variables : H < 0 may result if some values of p(x) > 1 occur , H = oo may result , H is not invariant wrt transformations , H is not interpretable as average self-information. We do not consider such data. However, the mutual information in general : I(y;x) = I(x;y) = H(x) - H(x|y) = H(y) - H(y|x) = H(x) + H(y) - H(x,y) I(y;x) = 0.5ln(2pie V(x)) + 0.5ln( 2pie V(y) ) - ln(2pie S(x)S(y)sqrt(1 - r)) = -0.5ln(1 - r^2) for a joint Gaussian pdf with r == r(x,y) == a correlation coefficient <> 0. is invariant to any reversible transformation ie to any 1-to-1 mapping . H == a COMBINATORIAL measure of overall disorder . Combinatorial, because H is invariant wrt reordering (ie rearrangement by swapping) of values of p . In fact entropy of a discrete r.v. is a logarithm of a combinatorial product to which Stirling's approximation was applied, thus turning a product into a sum and getting rid of a multinomial coefficient W. W (for Wahrscheinlichkeit) is a number of ways how n indistinguishable elements may form one and the same configuration of n elements. H(a.x) = H(x) + log(a) for a continuous x Gaussian pdf ie a normally distributed x is an important one : p(x) = exp( -( (x - )^2)/(2V) )/sqrt(2pi.V) = exp( -( (x - )/(S.sqrt(2)) )^2) )/(S.sqrt(2pi)) like eg thermal white noise. A one-dimensional mixture of 2 Gaussians p(x), p(y), with means , and variances V(x), V(y) is a weighted sum : p(x+y) = w.p(x) + (1-w)p(y) , 0 <= w <= 1, is a mixture which has : the mean = w + (1-w) the variance V(x+y) = wV(x) + (1-w)V(y) + w(1-w)( - )^2 The latter formulas are of 1894 by Karl Pearson, try to find them in the best books you know of, if you can. Good luck ! For entropy H only V matters, does not ie H is independent of the mean : H == H(x) = ln(sqrt(2pie.V)) = 0.5 ln(2pie.V) nats or nits (not bits ) = 1.418938533 + ln(S) , all for a Gaussian ; we see that H is a linear function of the logarithm of S or of V. H is maximum : If the range is (-oo, oo) and the mean & variance are known then p(x) = Gaussian pdf has maximum entropy ; if the range is ( 0 , oo) and the mean is known then p(x) = exp(-x/)/ has maximum entropy. H(y|x) = 0.5 ln(2pie V(y|x) ) nats . Since 2.ln(S) = V we have 2H = ln(2pie.V) = 2.837877067 + ln(V) nats , for a Gaussian , hence : V = exp(2H)/2pie for a Gaussian pdf , for which it holds : p'(x) = -p(x)( x - )/V = p(x)( - x)/V for a Gaussian r.v. x p"(x) = p(x)((x - )^2 - V)/(V^2) = p(x)( ((x - )/V)^2 - 1/V ) p.(p'/p)^2 = (p'^2)/p = p(x).((x - )/V)^2 for a Gaussian pdf , hence F = Integral[ dx.p(p'/p)^2 ] = Integral[ dx.p.(x - )^2 ]/(V^2) = V/(V^2) = 1/V = 1/S^2 is Fisher information for a Gaussian pdf. { Frieden 2001 p.73 } says that "The Fisher information is the reciprocal of the variance for a host of probability laws, those belonging to the exponential family of laws. Examples are the normal, exponential, binomial, geometric and chi-square laws. Also, compare 1/V with 1/sqrt(4.pi.V) , the Gini-Simpson information for the same case" of the normal law p(x). I have very frequently emailed with Roy Frieden for about 5 years, and have triggered him to consider Hellinger , Renyi , and also my favorite Gini-Simpson = Integral[ dx.p^2 ] =

= the mean of the pdf p itself ! N == N(p) = exp(2H)/2pie = 0.05855*e^(2H) == Shannon's entropy power for a p(x) of any continous x (not just a Gaussian pdf). For a mix of independent r.v.s x, y, it holds : N(x) + N(y) <= N(x+y) is the Entropy Power Inequality ( EPI ) 1/F(x) + 1/F(y) <= 1/F(x+y) is the dual of EPI hence if V(x) = V(y) = V then N(x+y) >= 2.V , hence if V(x) = V(y) = 1 then N(x+y) = 0.05855exp(2H(x+y)) >= 2 , and for 2 such Gaussians is N(x+y) = 2.0 . { Shannon 1949, sect.23 } "Theorem 15: Let the average power of two ensembles be V1 and V2 and let their entropy powers be N1 and N2. Then the entropy power of the sum, N3, is bounded by N1 + N2 <= N3 <= V1 + V2 White Gaussian noise has the peculiar property that it can absorb any other noise or signal ensemble which may be added to it with a resultant entropy power APPROXimately equal to the sum of the white noise and the signal power (MEASURED FROM THE AVERAGE SIGNAL VALUE, WHICH IS NORMALLY ZERO), provided the signal power is small, in a certain sense, compared to the noise." { JH's emphasis } .- +Shannon & De Bruijn & Stam higher than Cramer-Rao lower bounds on variance If we do NOT know the actual V = S^2 , but we do know H , then we can estimate V : V >= exp(2H)/2pie = 0.05855*exp(2H) ; thus estimated V is called ENTROPY POWER N (of a Gaussian ie of a normal noise , eg a thermal noise) : N == N(p) = exp(2H)/2pie = 0.05855*e^(2H) == entropy power of x . N was introduced in { Shannon 1948 & 1949 section 21 } , where his last sentence says : "Since white noise has the maximum entropy for a given power, the entropy power of any noise is less than or equal to its actual power." My formulation : a non-Gaussian pdf with a continuous entropy H has an entropy power N which is a virtual ie as-if Gaussian variance of an equivalent average power. Shannon did not express his last sentence in section 21 as a mathematical inequality , so it has escaped attention of many (including Jan Kahre , Roy Frieden , and until recently, humble me). The explicit inequality is : exp(2H)/2pie = N <= V == the ACTUAL variance (or noise power) of x , or exp(H)/sqrt(2pie) = sqrt(N) <= S = sqrt(V), due to (e^a)^b = e^(a.b). Shannon (in 1949) has not mentioned Fisher information F (of 1925 ). The link between Fisher information F and Shannon entropy power N has been expressed as F >= 1/N in { A.J. Stam 1959 p.104 eq.(2.3) } , and also the Cramer-Rao inequality F >= 1/S^2 , together with Shannonian 1/N >= 1/S^2 . Cramer-Rao lower bound S^2 >= 1/F assumes no systematic error aka bias . I do not discuss bounds for biased estimates, since an empirical kernel-based non-parametric estimate of p(x) can always be constructed without a bias. C-R inequality was (re)discovered by Aitken & Silverstone (1942), Frechet (1943), C.R. Rao (1945) and by H. Cramer (1946). In other words : 1/F <= V is Cramer-Rao inequality ; 1/F <= N is A.J. Stam inequality ; N <= V is Shannon inequality for entropy power , hence : 1/F <= N <= V shows that Stam has increased the lower bound on V , ie that Stam has improved on Cramer-Rao lower bound, since it is always desirable to obtain bounds as tight as possible. Loose bounds like eg 0 in V > 0 are of little value. I collected these separate inequalities into a single inequality : S^2 = V >= N >= 1/F , or : 1/S^2 = 1/V <= 1/N <= F , which I multiplied by V and obtained : 1 <= V/N <= F.V and F.V >= F.N from Shannon's N <= V , F.N >= 1 by A.J. Stam , these I combined into : 1 <= F.N <= F.V >= V/N >= 1 . Q1: My 1st fresh question I asked myself : Which inequality between F.N and V/N holds, if any ? Is (almost) always V/N <= F.N or V/N >= F.N ? In other words : is V <= F.N^2 or V >= F.N^2 ? A1: My empirical almost-discovery : 1 <= F.N <= V/N <= F.V holds for an artificially generated, hence known ACTUAL ie TRUE pdf , while : 1 <= F.N => Vs/N <= F.Vs holds for a good kernel-based ESTIMATE of that pdf, where Vs is a sample variance . Hence F.N^2 <= V >= 1/F for ACTUAL ie TRUE pdf. Q2: My 2nd fresh question to myself was : Which of the last two lower bounds on V is higher for a known actual ie true pdf ? A2: My deduction : since 1 <= F.N , it holds N <= F.N^2 , hence also : 0 < 1/F <= N <= F.(N^2) <= V holds for actual ie true pdfs ; 0 < 1/F <= N <= F.(N^2) >= V holds for my estimates of F, N and V. My numerical tests on artificially generated pdfs confirmed that. We see that the Cramer-Rao inequality 1/F <= V holds for the estimates too, simply because C-R inequality is not sharp enough. Afterall, 0 <= V holds always too. Here are few numbers to give you a feeling for the strength of my lower bounds : For a mix of 2 Gaussians (I use G because N is Shannon's eNtropy power) like eg: p(x) = 0.9*G(6,1) + 0.1*G(12,1) generated at 200 values of x , and estimated from a sample of 100 pseudorandom values, I got : 1/F = 1.0253 < 1.9074 = N << 3.5486 = F.N^2 < V = 4.237 for true pdf ; 1/F = 0.5219 < 1.6817 = N << 5.4188 = F.N^2 > V = 4.194 for estimates. Note that : + Both variances V are almost equal (due to my estimates of pdfs); + the estimated F is almost twice the true F (no wonder); + the << empasizes the joyful fact that my bound F.N^2 is much higher then Shannon's entropy power N <= V ; + be reminded that 1/F <= N <= V are A.J. Stam inequality F >= 1/N combined with Shannon's N <= V . Conclusion : !! My investigations show that C-R bound 1/F <= V == variance, is TOO LOW , although better than the trivial bound 0. My lower bound F.(N^2) would be higher than the C-R lower bound, hence my F.N.N would be more realistic ie it would be more informative. .- +Application of inequalities to automated probability density estimation : We seldom know the TRUE ie ACTUAL probability density function of a r.v. x. So we CANNOT evaluate an integrated mean error as a measure of squared or absolute differences, or as a discrepance, divergence or a distance like eg the above mentioned Matusita distance between the TRUE ie ACTUAL pdf and its estimate. We usually have only a sample of measurement values x1..xj..xn, from which we can compute a sample variance Vs, and we can estimate the pdf p(x) by eg a Rosenblatt-Parzen kernel method. Having an estimate of p(x), we can integrate it numerically and thus estimate both Fisher information F(x) , and entropy H(x) and its entropy power N : N = exp(2H)/2pie = 0.058549831*e^(2H) . Now we can compute N.F and Vs/N , and for a decent estimate of a reasonably smooth (no infinitely steep steps, though not necessarily differentiable) UNknown pdf from a sample (of hundreds of values, or as small as a dozen values) we get 2 values : N.F > Vs/N . Finally we compute their ratio (F.N^2)/Vs . The SMALLER this ratio (the closer to 1 in practice), the better the estimate of pdf p(x) , WITHOUT KNOWING the ACTUAL ie TRUE pdf !! Since we are free to use more than one estimator of p(x) , we can choose the estimate with the lowest ratio F.N.N/Vs as the most likely best estimate of p(x). This selection process is easy to automate, thus avoiding visual inspections and subjective judgements by different people. .- +Appendix1 : For the length s of a segment of a curve y(x) == y it holds by Pythagorean theorem : ds^2 = dy^2 + dx^2 ; since dy/dx = y' we have dy = y'.dx , so that ds = dx.sqrt( 1 + y'^2 ) , hence the length of a segment is : s = Integral[ ds ] = Integral[ dx.sqrt(1 + y'^2) ]. The y(x) can be a p(x). [ (1 + y'^2)^(3/2) ]/y" = the diameter of curvature of a curve y(x) ; big diameter have flat curves. p"/[ (1 + p'^2)^(3/2) ] = the curvature of a p(x) ; small curvature have short ie flat curves (ie with big diameter). .- +Appendix2 on Stan Ulam's probabilistic tri-inequality and Jan Kahre's LDI During the WWII in Los Alamos, the ingenious Austro-Hungarian Polish-American -Jewish mathematician Stanislaw Ulam (Lemberg/Lwow/Lviv 1909-1984 Santa Fe) has done math for the Manhattan Project of developing A-toys used on the civilian populations of Hiroshima and Nagasaki in Japan. After WWII Stan Ulam has : - fathered the H-toy (the Hungarian-American-Jewish physicist Edward Teller has mothered it, while Hans Bethe was the midwife, as Bethe described it); + invented the Monte Carlo method of numerical problem solving ( MC was developed by Metropolis, Von Neumann, Pasta, Teller ); - with C.J. Everett he co-fathered the mega-excentric Project Orion ; + he invented the Ulam distance between genetic sequences (implemented as Peter Sellers algorithm). Compare it with Levenstein distance ; + he also became interested in math for biology in general and for genetics in particular. In 1972 he co-authored a Los Alamos report LA-4973 "Metrics in biology, an introduction" where at the very end, he just briefly mentioned what I call "Ulam's tri-inequality of probabilistic transitions ( U3IPT )". U3IPT is a little known probabilistic analog of the well known geometric triangular inequality d(z,x) <= d(z,y) + d(y,x) which is a must for any proper metric. In { Ulam 1986 p.127 } U3IPT is presented a little better than in the original LA-4973. Ulam ( like Karl Popper in The Logic of Scientific Discovery; my favorite was the Appendix IX on corroboration, especially the footnotes on p.400, ) used an old fashioned notation : "sigma(x,y), denoting the probability of transition from x to y" eg in genetics due to random mutations, he says. In my modern notation U3IPT becomes : P(z|x) >= P(z|y).P(y|x) , 0 <= P(.|.) <= 1 , for a transition z <- y <- x Ulam has said very little on his U3IPT, so I am trying to do it here & now. Note that UNlike the metric distance d(z,x) <= d(z,y) + d(y,x) , U3IPT is : - oriented ie directed ie asymmetric wrt x, y, z - with the inequality reversed: P(z|x) >= .... vs d(z,x) <= ... - with the multiplication of P's vs addition of distances d's Let's check if U3PT makes sense : 1: Both tri-inequalities basically say that "a direct route is never longer than an indirect one", which fits common sense, ceteris paribus. 2: Summing over z both sides of U3IPT yields 1 >= 1.P(y|x) checks :-) 3: Multiplying U3IPT by P(x) yields P(z,x) >= P(z|y).P(y,x) , and summing over x yields P(z) >= P(z|y).P(y) = P(z,y) checks ; summing over z yields P(x) >= 1.P(y,x) checks , 4: From 3: we have P(z,x) >= P(z|y).P(y,x) = P(z,y).P(x|y) summed over z : P(x) >= 1.P(y,x) = P(y).P(x|y) checks too , 5: Taking a logarithm (a monotonic increasing function) of both sides of U3IPT gives : ln(P(z|x)) >= ln(P(z|y)) + ln(P(y|x)) , eg: ln( 0.3 ) >= ln( 0.8 ) + ln( 0.2 ) , ie: -1.2 >= -0.223 + -1.61 = -1.832 6: -ln(P(z|x)) <= -ln(P(z|y)) + -ln(P(y|x)) , ie: ln(1/P(z|x)) <= ln(1/P(z|y)) + ln(1/P(y|x)) is a triangle inequality. So far U3PT makes lot of sense, methinks. Q: Does Ulam's U3IPT fit with Jan Kahre's Law of Diminishing information ( LDI ) for x -> y -> z ? If yes, then how ? Here is the LDI : if P(x,y,z) = P(x|y)P(z|y)P(y) then inf(z@x) <= inf(y@x) where inf(z@x) is an information measure ( not necessarily Shannon's one ). P(z|x) = P(z,x)/P(x) >= P(x,z) >= P(x,y,z) = P(x,z|y)P(y) = P(x|y,z)P(z|y)P(y) We see that: P(x,y,z) = P(x|y )P(z|y)P(y) is required by Kahre for LDI , versus P(x,y,z) = P(x|y,z)P(z|y)P(y) in general, hence LDI requires P(x|y,z) = P(x|y) . A: The answer is left as mental jogging to the readers who are more mathletic than me. +References : Bhattacharyya A: On a measure of divergence between two multinomial populations; Sankhya, 7, 1945-1946 (submitted 1941), 401-406 Blachman Nelson M: The convolution inequality for entropy powers , IEEE Transactions on Information Theory, April 1965, 267-271 Cover Thomas & Thomas Jay A: Elements of Information Theory, 1991, 2001 2nd ed. both editions contain chapters/sections on Differential entropy , on Fisher information, De Bruijn's identity, Stam, entropy power inequality , Brunn-Minkowski inequality , Renyi entropy power , etc. Dembo A, Cover T, Thomas J.A: Information theoretic inequalities; IEEE Transactions on Information Theory, 37/6, 1991/11, 1501-1518; sect. IVB is on Uncertainty principles Frieden B. Roy: Probability, Statistical Optics, and Data testing, 3rd ed! 2001 ; he mentions (also in his papers) : + { A.J. Stam 1959 }, { Huber 1964 }, and Hirschman uncertainty principle Good I.J. & Gaskins R.A: Nonparametric roughness penalties for probability densities; Biometrika 58/2, 1971, 255-277 Huber Peter (ETH & MIT): Robust estimation of a location parameter , The Annals of Statistics , 35:1 (1964), 73-101, where in the section 8 on Minimax theory he has played a "game against Nature" using Fisher information as a "payoff to the statistician". An earlier statistical miniMax player against Nature was Hugo Steinhaus in 1957. Steinhaus was a math teacher of Stanislaw Ulam ( for more see +Appendix2 ) Kahre Jan : The Mathematical Theory of Information , 2002 ; + in 10.13 he proves that the reciprocal value of a conditional variance aka Fisher's invariance 1/var(x|y) conforms to Kahre's Law od Diminishing Information ( for LDI see +Appendix2 here above ) ; + In 10.19-10.21 he discusses (expected) Fisher information mainly in relation to Heisenberg's uncertainty principle, yet, remarkably : - he neither mentions Cramer-Rao , nor { A.J. Stam 1959 }. Kahre's favorite Cramer is a quantum mechanic J.G. Cramer , not H. Cramer of C-R inequality; - he does not even mention the Cauchy-Schwarz inequality vital to the derivation of Fisher information ; - he does not mention I.I. Hirschman inequality, an entropic uncertainty relation, an alternative form of Heisenberg uncertainty principle ; - in 10.22 he calls "power of a gaussian white noise" what { Shannon sect.21 } calls "entropy power", but Kahre does not mention Shannon's inequality : "Since white noise has the maximum entropy for a given power, the entropy power of any noise is less than or equal to its actual power." - he does not mention Shannon's sampling theorem Kazakos D. & Cotsidas Th: A decision theory approach to the approximation of discrete probability densities ; IEEE Trans. PAMI, 2:1 , 1980/1, p.62 Matusita Kameo: Decision rules, based on the distance, for problems of fit, two samples, and estimation; Annals of Mathematical Statistics, 26, 1955, 631-640; (thereafter Matusita wrote at least 6 papers on his distance) Rao C.R: Linear Statistical Inference and its Applications, 2nd ed., 1973 Resnikoff H.L: The Illusion of Reality, 1989; p132-139 sect.4.5 is on Shannon's sampling theorem and the uncertainty relation (for measurement ie dx.dy >= 1/(4pi) ; it employs Schwarz inequality, law of cosines ) Shannon C.E: The Mathematical Theory of Communication , 1949, 4th printing of 1969, pbk (other printings may differ) He has not mentioned Fisher information (of 1925 ). Stam A.J: Some Mathematical Properties of Quantities of Information, 1959 , Ph.D thesis at the University of Delft, NL , 152 pages, where on p.111 Stam writes that F.N >= 1 "was communicated to Prof.Ir.Dr.J.L. van Soest by Prof.Dr.N.G. de Bruijn who gave a variational proof of it ..." A summary of connections between F , H and N are on p.114 Stam A.J: Some inequalities satisfied by the quantities of information of Fisher and Shannon ; Information and Control , 2 (1959) 101-112 ; this paper was based on his Ph.D dissertation Ulam Stanislaw: Science, Computers and People, 1986 ; see chap.12 = "Some further ideas and prospects in biomathematics", which I consider to be the chapter richest on ideas. Wheeler John Archie: The computer and the universe ; Int. J. of Theoretical Physics, 21:12, 1982, p.557-572, see pp.567-569 Wootters William K: Statistical distance and Hilbert space, Physical Review D, 23:2, 15 January 1981 ; this paper was based on his Ph.D dissertation as a graduate student of J.A. Wheeler at U. of TX .- Philosophical readings ( oldest 1st ) : Wigner Eugene: The unreasonable effectiveness of mathematics in the natural sciences; Communications on Pure and Applied mathematics, XIII, 1960, 1-14 Schwartz J: The pernicious influence of mathematics on science; in: Methodology and Philosophy of Science, Proc. of the 1960 Intl. Congress, 356-360 Diaconis Persi: Theories of data analysis: From magical thinking through classical statistics; in: Exploring Data Tables, Trends, and Shapes; 1985, 1-36 Livio Mario: The Golden Ratio, 2002; chap. 9 = Is God Mathematician ?, section The unreasonable power of mathematics, 237-253 Wilczek Frank: Reasonably effective: 1. Deconstructing a miracle; Physics Today November 2006; Frank Wilczek is physics Nobelist at MIT Grattan-Guiness Ivor: Solving Wigner's mystery: The reasonable (though perhaps limited) effectiveness of mathematics in the natural sciences; The mathematical Intelligencer, 30, 2008/3, 7-17 -.-