TF ErrorAna StatInference

From New IAC Wiki
Jump to navigation Jump to search

Statistical Inference

Frequentist -vs- Bayesian Inference

When it comes to testing a hypothesis, there are two dominant philosophies known as a Frequentist or a Bayesian perspective.

The dominant discussion for this class will be from the Frequentist perspective.

frequentist statistical inference

Statistical inference is made using a null-hypothesis test; that is, ones that answer the question Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed?


The relative frequency of occurrence of an event, in a number of repetitions of the experiment, is a measure of the probability of that event. Thus, if nt is the total number of trials and nx is the number of trials where the event x occurred, the probability P(x) of the event occurring will be approximated by the relative frequency as follows:

P(x)nxnt.

Bayesian inference.

Statistical inference is made by using evidence or observations to update or to newly infer the probability that a hypothesis may be true. The name "Bayesian" comes from the frequent use of Bayes' theorem in the inference process.

Bayes' theorem relates the conditional probability|conditional and marginal probability|marginal probabilities of events A and B, where B has a non-vanishing probability:

P(A|B)=P(B|A)P(A)P(B).

Each term in Bayes' theorem has a conventional name:

  • P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B.
  • P(B) is the prior or marginal probability of B, and acts as a normalizing constant.
  • P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B.
  • P(B|A) is the conditional probability of B given A.

Bayes' theorem in this form gives a mathematical representation of how the conditional probabability of event A given B is related to the converse conditional probabablity of B given A.

Example

Suppose there is a school having 60% boys and 40% girls as students.

The female students wear trousers or skirts in equal numbers; the boys all wear trousers.

An observer sees a (random) student from a distance; all the observer can see is that this student is wearing trousers.

What is the probability this student is a girl?

The correct answer can be computed using Bayes' theorem.

P(A) probability that the student observed is a girl = 0.4
P(B) probability that the student observed is wearing trousers = 60+20/100 = 0.8
P(B|A) probability the student is wearing trousers given that the student is a girl
P(A|B) probability the student is a girl given that the student is wearing trousers
P(B|A)=0.5


P(A|B)=P(B|A)P(A)P(B)=0.5×0.40.8=0.25.


Method of Maximum Likelihood

The principle of maximum likelihood is the cornerstone of Frequentist based hypothesis testing and may be written as
The best estimate for the mean and standard deviation of the parent population is obtained when the observed set of values are the most likely to occur;ie: the probability of the observing is a maximum.

Least Squares Fit to a Line

Applying the Method of Maximum Likelihood

Our object is to find the best straight line fit for an expected linear relationship between dependent variate (y) and independent variate (x).


If we let y0(x) represent the "true" linear relationship between independent variate x and dependent variate y such that

yo(x)=A+Bx

Then the Probability of observing the value yi with a standard deviation σi is given by

Pi=1σ2πe12(yiy0(xi)σi)2

assuming an experiment done with sufficiently high statistics that it may be represented by a Gaussian parent distribution.

If you repeat the experiment N times then the probability of deducing the values A and B from the data can be expressed as the joint probability of finding N yi values for each xi

P(A,B)=Π1σ2πe12(yiy0(xi)σi)2
=(1σ2π)Ne12(yiy0(xi)σi)2 = Max

The maximum probability will result in the best values for A and B

This means

χ2=(yiy0(xi)σi)2=(yiABxiσi)2 = Min

The min for χ2 occurs when the function is a minimum for both parameters A & B : ie

χ2A=A(yiABxiσi)2=0
χ2B=B(yiABxiσi)2=0
If σi=σ
All variances are the same (weighted fits don't make this assumption)

Then

χ2A=1σ2A(yiABxi)2=2σ2(yiABxi)=0
χ2B=1σ2B(yiABxi)2=2σ2xi(yiABxi)=0


or

(yiABxi)=0
xi(yiABxi)=0

The above equations represent a set of simultaneous of 2 equations and 2 unknowns which can be solved.

yi=A+Bxi
xiyi=Axi+Bx2i


(yixiyi)=(Nxixix2i)(AB)


The Method of Determinants

for the matrix problem:

(y1y2)=(a11a12a21a22)(x1x2)

the above can be written as

y1=a11x1+a12x2
y2=a21x1+a22x2

solving for x1 assuming y1 is known

a22(y1=a11x1+a12x2)
a12(y2=a21x1+a22x2)
a22y1a12y2=(a11a22a12a21)x1
|y1a12y2a22|=|a11a12a12a22|x1

or

x1=|y1a12y2a22||a11a12a12a22| similarly x2=|y1a11y2a21||a11a12a12a22|

Solutions exist as long as

|a11a12a12a22|0

Apply the method of determinant for the maximum likelihood problem above

A=|yixixiyix2i||Nxixix2i|
B=|Nyixixiyi||Nxixix2i|


If the uncertainty in all the measurements is not the same then we need to insert σi back into the system of equations.


A=|yiσ2ixiσ2ixiyiσ2ix2iσ2i||1σ2ixiσ2ixiσ2ix2iσ2i|B=|1σ2iyiσ2ixiσ2ixiyiσ2i||1σ2ixiσ2ixiσ2ix2iσ2i|

Uncertainty in the Linear Fit parameters

As always the uncertainty is determined by the Taylor expansion in quadrature such that

σ2P=[σ2i(Pyi)2] = error in parameter P: here covariance has been assumed to be zero

By definition of variance

σ2is2=(yiABxi)2N2 : there are 2 parameters and N data points which translate to (N-2) degrees of freedom.


The least square fit ( assuming equal σ) has the following solution for the parameters A & B as

A=|yixixiyix2i||Nxixix2i|B=|Nyixixiyi||Nxixix2i|

uncertainty in A

Ayj=yjyix2ixixiyi|Nxixix2i|
=(1)x2ixjxi|Nxixix2i| only the yj term survives
=D(x2ixjxi)

Let

D1|Nxixix2i|=1Nx2ixixi
σ2A=Nj=1[σ2j(Ayj)2]
=Nj=1σ2j(D(x2ixjxi))2
=σ2D2Nj=1(x2ixjxi)2 : Assume σi=σ
=σ2D2Nj=1(x2i)2+(xjxi)22(x2ixjxi)
=σ2D2x2i[Nj=1(x2i)+Nj=1x2j2xiNj=1xj]
=σ2D2x2i[N(x2i)2xiNj=1xj+Nj=1x2j]
xiNj=1xjNj=1x2j Both sums are over the number of observations N
=σ2D2x2i[N(x2i)2Nj=1x2j+Nj=1x2j]
=σ2D2x2i1D
σ2A=σ2x2iNx2i(xi)2
σ2A=(yiABxi)2N2x2iNx2i(xi)2


If we redefine our origin in the linear plot so the line is centered a x=0 then

xi=0
x2iNx2i(xi)2=x2iNx2i=1N

or

σ2A=(yiABxi)2N21N=σ2N
Note
The parameter A is the y-intercept so it makes some intuitive sense that the error in the Y -intercept would be dominated by the statistical error in Y


uncertainty in B

B=|Nyixixiyi||Nxixix2i|
σ2B=Nj=1[σ2j(Byj)2]
Byj=yj|Nyixixiyi||Nxixix2i|=yjD(Nxiyixiyi)
=D(Nxjxi)


σ2B=Nj=1[σ2jD2(Nxjxi)2]
=σ2D2Nj=1[(Nxjxi)2] assuming σj=σ
=σ2D2Nj=1[(N2x2j2Nxjxi+x2i)]
=σ2D2[(N2Nj=1x2j2NxiNj=1xj+x2iNj=1)]
=σ2D2[(N2x2i2NxiNj=1xj+Nx2i)]
=Nσ2D2[(Nx2i2xiNj=1xj+x2i)]
=Nσ2D2[(Nx2ix2i)]
=ND2σ21D=NDσ2
σ2B=N(yiABxi)2N2Nx2i(xi)2

Linear Fit with error

From above we know that if each independent measurement has a different error σi then the fit parameters are given by


A=|yiσ2ixiσ2ixiyiσ2ix2iσ2i||1σ2ixiσ2ixiσ2ix2iσ2i|B=|1σ2iyiσ2ixiσ2ixiyiσ2i||1σ2ixiσ2ixiσ2ix2iσ2i|

Weighted Error in A

σ2A=Nj=1[σ2j(Ayj)2]
A=|yiσ2ixiσ2ixiyiσ2ix2iσ2i||1σ2ixiσ2ixiσ2ix2iσ2i|

Let

D=1|1σ2ixiσ2ixiσ2ix2iσ2i|=11σ2ix2iσ2ixiσ2ixiσ2i


Ayj=Dyj[yiσ2ix2iσ2ixiσ2ixiyiσ2i]
=Dσ2jx2iσ2ixjσ2jxiσ2i


σ2A=Nj=1[σ2j(Ayj)2]=Nj=1[σ2jD2(1σ2jx2iσ2ixjσ2jxiσ2i)2]
=D2Nj=1σ2j[1σ4j(x2iσ2i)22xjσ4jx2iσ2ixiσ2i+x2jσ4j(xiσ2i)2]
=D2[Nj=11σ2j(x2iσ2i)22Nj=1xjσ2jx2iσ2ixiσ2i+Nj=1x2jσ2j(xiσ2i)2]
=D2(x2iσ2i)[Nj=11σ2j(x2iσ2i)22Nj=1xjσ2jxiσ2i+Nj=1x2jσ2j(1σ2i)]
=D2(x2iσ2i)[Nj=11σ2j(x2iσ2i)2(xjσ2j)2]
=D(x2iσ2i)=(x2iσ2i)1σ2ix2iσ2ixiσ2ixiσ2i
Compare with the unweighted error
σ2A=(yiABxi)2N2x2iNx2i(xi)2

Weighted Error in B

σ2B=Nj=1[σ2j(Byj)2]
B=|1σ2iyiσ2ixiσ2ixiyiσ2i||1σ2ixiσ2ixiσ2ix2iσ2i|


Byj=Dyj[1σ2ixiyiσ2iyiσ2ixiσ2i]
=Dσ2jxiσ2i1σ2jxiσ2i
σ2B=Nj=1[σ2j(Byj)2]
=Nj=1[σ2j(Dσ2jxiσ2i1σ2jxiσ2i)2]
=D2Nj=1[1σ2j(xiσ2i)22(xiσ2i)1σ2jxiσ2i+(1σjxiσ2i)2]
=D2Nj=11σ2j[(xiσ2i)22(xiσ2i)xiσ2i+(xiσ2i)2]
=D2Nj=11σ2j[(xiσ2i)21(xiσ2i)xiσ2i]
=DNj=11σ2j
σ2B=1σ2i1σ2ix2iσ2ixiσ2ixiσ2i


Correlation Probability

Once the Linear Fit has been performed, the next step will be to determine a probability that the Fit is actually describing the data.

The Correlation Probability (R) is one method used to try and determine this probability.

This method evaluates the "slope" parameter to determine if there is a correlation between the dependent and independent variables , x and y.

The liner fit above was done to minimize \chi^2 for the following model

y=A+Bx


What if we turn this equation around such that

x=A+By

If there is no correlation between x and y then B=0

If there is complete correlation between x and y then

A=AB and B=1B
and BB=1

So one can define a metric BB^{\prime} which has the natural range between 0 and 1 such that

RBB

since

B=|Nyixixiyi||Nxixix2i|=NxiyiyixiNx2ixixi

and one can show that

B=|Nxiyixiyi||Nyiyiy2i|=NxiyixiyiNy2iyiyi

Thus

R=NxiyiyixiNx2ixixiNxiyixiyiNy2iyiyi
=Nxiyiyixi(Nx2ixixi)(Ny2iyiyi)


Note
The correlation coefficient (R) CAN'T be used to indicate the degree of correlation. The probability distribution R can be derived from a 2-D gaussian but knowledge of the correlation coefficient of the parent population (ρ) is required to evaluate R of the sample distribution.

Instead one assumes a correlation of ρ=0 in the parent distribution and then compares the sample value of R with what you would get if there were no correlation.

The smaller R is the more likely that the data are correlated and that the linear fit is correct.


PR(R,ν)=1πΓ(ν+12)Γ(ν2)(1R2)(ν22)

= Probability that any random sample of UNCORRELATED data would yield the correlation coefficient R

where

Γ(x)=0tx1etdt
(ROOT::Math::tgamma(double x) )
ν=N2 = number of degrees of freedom = Number of data points - Number of parameters in fit function

Derived in "Pugh and Winslow, The Analysis of Physical Measurement, Addison-Wesley Publishing, 1966."

Least Squares fit to a Polynomial

Let's assume we wish to now fit a polynomial instead of a straight line to the data.

y(x)=nj=0ajxn=nj=0ajfj(x)
fj(x)= a function which does not depend on aj

Then the Probability of observing the value yi with a standard deviation σi is given by

Pi(a0,a1,,an)=1σi2πe12(yinj=0ajfj(xi)σi)2

assuming an experiment done with sufficiently high statistics that it may be represented by a Gaussian parent distribution.

If you repeat the experiment N times then the probability of deducing the values an from the data can be expressed as the joint probability of finding N yi values for each xi

P(a0,a1,,an)=ΠiPi(a0,a1,,an)=Πi1σi2πe12(yinj=0ajfj(xi)σi)2e12[Ni(yinj=0ajfj(xi)σi)2]


Once again the probability is maximized when the numerator of the exponential is a minimum

Let

χ2=Ni(yinj=0ajfj(xi)σi)2

where N = number of data points and n = order of polynomial used to fit the data.

The minimum in χ2 is found by setting the partial derivate with respect tot he fit parameters (χak) to zero

χ2ak=akNi1σ2i(yinj=0ajfj(xi))2
=Ni21σ2i(yinj=0ajfj(xi))(nj=0ajfj(xi))ak
=Ni21σ2i(yinj=0ajfj(xi))(fk(xi))
=2Ni1σ2i(yinj=0ajfj(xi))(fk(xi))
=2Ni1σ2i(fk(xi))(yinj=0ajfj(xi))=0
Niyifk(xi)σ2i=Nifk(xi)σ2inj=0ajfj(xi)=nj=0ajNifk(xi)σ2ifj(xi)


You now have a system of n coupled equations for the parameters aj with each equation summing over the N measurements.

The first equation (f1) looks like this

Nif1(xi)yiσ2i=Nif1(xi)σ2i(a1f1(xi)+a2f2(xi)+anfn(xi))

You could use the method of determinants as we did to find the parameters (an) for a linear fit but it is more convenient to use matrices in a technique referred to as regression analysis

Regression Analysis

The parameters aj in the previous section are linear parameters to a general function which may be a polynomial.

The system of equations is composed of n equations where the kth equation is given as

Niyifk(xi)σ2i=nj=0ajNifk(xi)σ2ifj(xi)


may be represented in matrix form as

˜β=˜a˜α

where

βk=Niyifk(xi)σ2i
ak=ak
αkj=Nifk(xi)σ2ifj(xi)

or in matrix form


˜β=(β1,β2,,βn) = a row matrix of order n
˜a=(a1,a2,,an) = a row matrix of the parameters


˜α=(α11α12α1jα21α22α2jαk1αk2αkj) = a k×j=n×n matrix
the object if to find the parameters an
To find an just invert the matrix
˜β=˜a˜α
(˜β=˜a˜α)˜α1
˜β˜α1=˜a˜α˜α1
˜β˜α1=˜a˜1


˜a=˜β˜α1
Thus if you invert the matrix ˜α you find ˜α1 and as a result the parameters an.

Matrix inversion

The first thing to note is that for the inverse of a matrix (˜A) to exist its determinant can not be zero

|˜A|0


The inverse of a matrix (˜A1) is defined such that

˜A˜A1=˜1

If we divide both sides by the matrix ˜A then we have

˜A1=˜1˜A

The above ratio of the unity matrix to matrix ˜A is always equal to ˜A1 as long as both the numerator and denominator are multiplied by the same constant factor.

If we do such operations we can transform the ratio such that the denominator has the unity matrix and then the numerator will have the inverse matrix.

This is the principle of Gauss-Jordan Elimination.

Gauss-Jordan Elimination

If Gauss–Jordan elimination is applied on a square matrix, it can be used to calculate the inverse matrix. This can be done by augmenting the square matrix with the identity matrix of the same dimensions, and through the following matrix operations:

˜A˜1˜A1˜A˜1˜1˜A1.

If the original square matrix, A, is given by the following expression:

˜A=[210121012].

Then, after augmenting by the identity, the following is obtained:

˜A˜1=[210100121010012001].

By performing elementary row operations on the ˜A˜1 matrix until it reaches reduced row echelon form, the following is the final result:

˜1˜A1=[10034121401012112001141234].

The matrix augmentation can now be undone, which gives the following:

˜1=[100010001]˜A1=[34121412112141234].

or

˜A1=14[321242123]=1det(A)[321242123].

A matrix is non-singular (meaning that it has an inverse matrix) if and only if the identity matrix can be obtained using only elementary row operations.

Error Matrix

As always the uncertainty is determined by the Taylor expansion in quadrature such that

σ2P=[σ2i(Pyi)2] = error in parameter P: here covariance has been assumed to be zero

Where the definition of variance

σ2is2=(yinj=0ajfj(xi))2Nn : there are n parameters and N data points which translate to (Nn) degrees of freedom.

Applying this for the parameter ak indicates that

σ2ak=[σ2i(akyi)2]
But what if there are covariances?

In that case the following general expression applies

σ2ak,al=Ni[σ2iakyialyi]
akyi=?


˜a=˜β˜α1

where

˜a=(a1,a2,,an) = a row matrix of the parameters
˜β=(β1,β2,,βn) = a row matrix of order n
˜α1=(α111α112α11jα121α122α12jα1k1α1k2α1kj) = a k×j=n×n matrix
˜a=(a1,a2,,an)=(β1,β2,,βn)(α111α112α11jα121α122α12jα1k1α1k2α1kj)


ak=njβjα1jk


akyi=yinjβjα1jk
=yinj(Nifj(xi)yiσ2i)α1jk
=nj(fj(xi)1σ2i)α1jk :only one yi in the sum over N survives the derivative

similarly

alyi=np(fp(xi)1σ2i)α1pl


substituting

σ2ak,al=Ni[σ2iakyialyi]
=Ni[σ2inj(fj(xi)1σ2i)α1jknp(fp(xi)1σ2i)α1pl]

A σ2i term appears on the top and bottom.

Move the outer most sum to the inside


σ2ak,al=nj(α1jknpNifj(xi)fp(xi)σ2i)α1pl
=njα1jknp(αjp)α1pl
=njα1jk˜1jl

where

˜1jl = the j,l element of the unity matrix = 1
Note
αjk=αkj : the matrix is symmetric.
σ2ak,al=njα1kj˜1jl=α1kl = Covariance/Error matrix element


The inverse matrix α1kl tells you the variance and covariance for the calculation of the total error.

Remember
Y=niaifi(x)
σai=α1ii= error in the parameters
s2=ninj(YaiYaj)σij=ninj(fi(x)fj(x))σij = error in the model's prediction
=ninjxj+iσij If Y is a power series in x (fi(x)=xi)

Chi-Square Distribution

The above tools allow you to perform a least squares fit to data using high order polynomials.

The question though is how high in order should you go? (ie; when should you stop adding parameters to the fit?)

One argument is that you should stop increasing the number of parameters if they don't change much. The parameters, using the above techniques, are correlated such that when you add another order to the fit all the parameters have the potential to change in value. If their change is miniscule then you can ague that adding higher orders to the fit does not change the fit. There are techniu

A quantitative way to express the above uses the χ2 value of the fit. The above technique seeks to minimize χ2. So if you add higher orders and more parameters but the χ2 value does not change appreciably, you could argue that the fit is a good as you can make it with the given function.

Derivation

If you assume a series of measurements have a Gaussian parent distribution

Then

P(x,ˉx,σ)=12πσe12(xˉxσ)2 = probability of measureing the value x from a Gaussian distribution with s sample meanˉx from a parent distribution of with σ


Go Back Forest_Error_Analysis_for_the_Physical_Sciences#Statistical_inference