Statistical Inference
Frequentist -vs- Bayesian Inference
When it comes to testing a hypothesis, there are  two dominant philosophies known as a Frequentist or a Bayesian perspective.
The dominant discussion for this class will be from the Frequentist perspective.
frequentist statistical inference
- Statistical inference is made using a null-hypothesis test; that is, ones that answer the question Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed?
The relative frequency of occurrence of an event, in a number of repetitions of the experiment, is a measure of the probability of that event.
Thus, if nt is the total number of trials and nx is the number of trials where the event x occurred, the probability P(x) of the event occurring will be approximated by the relative frequency as follows:
- [math]P(x) \approx \frac{n_x}{n_t}.[/math]
Bayesian inference.
- Statistical inference is made by using evidence or observations  to update or to newly infer the probability that a hypothesis may be true. The name "Bayesian" comes from the frequent use of Bayes' theorem in the inference process.
Bayes' theorem relates the conditional probability|conditional and marginal probability|marginal probabilities of events A and B, where B has a non-vanishing probability:
- [math]P(A|B) = \frac{P(B | A)\, P(A)}{P(B)}\,\! [/math].
Each term in Bayes' theorem has a conventional name:
- P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information about B.
- P(B) is the prior or marginal probability of B, and acts as a normalizing constant.
- P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B.
- P(B|A) is the conditional probability of B given A.
Bayes' theorem in this form gives a mathematical representation of how the conditional probabability of event A given B is related to the converse conditional probabablity of B given A.
Example
Suppose there is a school having 60% boys and 40% girls as students. 
The female students wear trousers or skirts in equal numbers; the boys all wear trousers. 
An observer sees a (random) student from a distance; all the observer can see is that this student is wearing trousers. 
What is the probability this student is a girl? 
The correct answer can be computed using Bayes' theorem.
- [math] P(A) \equiv[/math] probability that  the student observed is a girl = 0.4
- [math]P(B) \equiv[/math] probability  that the student observed is wearing trousers = 60+20/100 = 0.8
- [math]P(B|A) \equiv[/math] probability the student is wearing trousers given that the student is a girl
- [math]P(A|B) \equiv[/math] probability the student is a girl given that the student is wearing trousers
- [math]P(B|A) =0.5[/math]
- [math]P(A|B) = \frac{P(B|A) P(A)}{P(B)} = \frac{0.5 \times 0.4}{0.8} = 0.25.[/math]
Method of Maximum Likelihood
- The principle of maximum likelihood is the cornerstone of Frequentist based hypothesis testing and may be written as
- The best estimate for the mean and standard deviation of the parent population is obtained when the observed set of values are the most likely to occur;ie: the probability of the observing is a maximum.
Least Squares Fit
Applying the Method of Maximum Likelihood
Our object is to find the best straight line fit for an expected linear relationship between dependent variate [math](y)[/math]  and independent variate [math](x)[/math].
If we let [math]y_0(x)[/math] represent the "true" linear relationship between independent variate [math]x[/math] and dependent variate [math]y[/math] such that
- [math]y_o(x) = A + B x[/math]
Then the Probability of observing the value [math]y_i[/math] with a standard deviation [math]\sigma_i[/math] is given by 
- [math]P_i = \frac{1}{\sigma \sqrt{2 \pi}} e^{- \frac{1}{2} \left ( \frac{y_i - y_0(x_i)}{\sigma_i}\right)^2}[/math]
assuming an experiment done with sufficiently high statistics that it may be represented by a Gaussian parent distribution.
If you repeat the experiment [math]N[/math] times then the probability of deducing the values [math]A[/math] and [math]B[/math] from the data can be expressed as the joint probability of finding [math]N[/math] [math]y_i[/math] values for each [math]x_i[/math]
- [math]P(A,B) = \Pi \frac{1}{\sigma \sqrt{2 \pi}} e^{- \frac{1}{2} \left ( \frac{y_i - y_0(x_i)}{\sigma_i}\right)^2}[/math]
- [math]= \left ( \frac{1}{\sigma \sqrt{2 \pi}}\right )^N e^{- \frac{1}{2} \sum \left ( \frac{y_i - y_0(x_i)}{\sigma_i}\right)^2}[/math] = Max
The maximum probability will result in the best values for [math]A[/math] and [math]B[/math]
This means
- [math]\chi^2 = \sum \left ( \frac{y_i - y_0(x_i)}{\sigma_i}\right)^2 = \sum \left ( \frac{y_i - A - B x_i }{\sigma_i}\right)^2[/math] = Min
The min for [math]\chi^2[/math] occurs when the function is a minimum for both parameters A & B : ie
- [math]\frac{\partial \chi^2}{\partial A} = \sum \frac{ \partial}{\partial A} \left ( \frac{y_i - A - B x_i }{\sigma_i}\right)^2=0[/math]
- [math]\frac{\partial \chi^2}{\partial B} = \sum \frac{ \partial}{\partial B} \left ( \frac{y_i - A - B x_i }{\sigma_i}\right)^2=0[/math]
- If [math]\sigma_i = \sigma[/math]
- All variances are the same  (weighted fits don't make this assumption)
Then
- [math]\frac{\partial \chi^2}{\partial A} = \frac{1}{\sigma}\sum \frac{ \partial}{\partial A} \left ( y_i - A - B x_i \right)^2=\frac{-2}{\sigma}\sum \left ( y_i - A - B x_i \right)=0[/math]
- [math]\frac{\partial \chi^2}{\partial B} = \frac{1}{\sigma}\sum \frac{ \partial}{\partial B} \left ( y_i - A - B x_i \right)^2=\frac{-2}{\sigma}\sum x_i \left ( y_i - A - B x_i \right)=0[/math]
or
- [math]\sum \left ( y_i - A - B x_i \right)=0[/math]
- [math]\sum  x_i \left( y_i - A - B x_i \right)=0[/math]
The above equations represent a set of simultaneous of 2 equations and 2 unknowns which can be solved.
- [math]\sum y_i = \sum A + B \sum x_i[/math]
- [math]\sum x_i y_i = A \sum x_i + B \sum x_i^2[/math]
- [math]\left( \begin{array}{c} \sum y_i \\ \sum x_i y_i \end{array} \right) = \left( \begin{array}{cc} N & \sum x_i\\
\sum x_i & \sum x_i^2 \end{array} \right)\left( \begin{array}{c} A \\ B \end{array} \right)[/math]
The Method of Determinants
for the matrix problem:
- [math]\left( \begin{array}{c} y_1 \\ y_2 \end{array} \right) = \left( \begin{array}{cc} a_{11} & a_{12}\\ a_{21} & a_{22} \end{array} \right)\left( \begin{array}{c} x_1 \\ x_2 \end{array} \right)[/math]
the above can be written as
- [math]y_1 = a_{11} x_1 + a_{12} x_2[/math]
- [math]y_2 = a_{21} x_1 + a_{22} x_2[/math]
solving for [math]x_1[/math] assuming [math]y_1[/math] is known
- [math]a_{22} (y_1 = a_{11} x_1 + a_{12} x_2)[/math]
- [math]-a_{12} (y_2 = a_{21} x_1 + a_{22} x_2)[/math]
- [math]\Rightarrow a_{22} y_1 - a_{12} y_2 = (a_{11}a_{22} - a_{12}a_{21}) x_1[/math]
- [math]\left| \begin{array}{cc} y_1 & a_{12}\\ y_2 & a_{22} \end{array} \right| = \left| \begin{array}{cc} a_{11} & a_{12}\\ a_{12} & a_{22} \end{array} \right| x_1[/math]
or 
- [math]x_1 = \frac{\left| \begin{array}{cc} y_1 & a_{12}\\ y_2 & a_{22} \end{array} \right| }{\left| \begin{array}{cc} a_{11} & a_{12}\\ a_{12} & a_{22} \end{array} \right| }[/math] similarly [math]x_2 = \frac{\left| \begin{array}{cc} y_1 & a_{11}\\ y_2 & a_{21} \end{array} \right| }{\left| \begin{array}{cc} a_{11} & a_{12}\\ a_{12} & a_{22} \end{array} \right| }[/math]
Solutions exist as long as
- [math]\left| \begin{array}{cc} a_{11} & a_{12}\\ a_{12} & a_{22} \end{array} \right| \ne 0[/math]
Apply the method of determinant for the maximum likelihood problem above
- [math]A = \frac{\left| \begin{array}{cc} \sum y_i & \sum x_i\\ \sum x_i y_i & \sum x_i^2 \end{array}\right|}{\left| \begin{array}{cc} N & \sum x_i\\ \sum x_i  & \sum x_i^2 \end{array}\right|}[/math]
- [math]B = \frac{\left| \begin{array}{cc} N & \sum y_i\\ \sum x_i  & \sum x_i y_i \end{array}\right|}{\left| \begin{array}{cc} N & \sum x_i\\ \sum x_i  & \sum x_i^2 \end{array}\right|}[/math]
If the uncertainty in all the measurements is not the same then we need to insert [math]\sigma_i[/math] back into the system of equations.
- [math]A = \frac{\left| \begin{array}{cc} \sum\frac{ y_i}{\sigma_i^2} & \sum\frac{ x_i}{\sigma_i^2}\\ \sum\frac{ x_i y_i}{\sigma_i^2} & \sum\frac{ x_i^2}{\sigma_i^2} \end{array}\right|}{\left| \begin{array}{cc} N\sum \frac{1}{\sigma_i^2} & \sum \frac{x_i}{\sigma_i^2}\\ \sum \frac{x_i}{\sigma_i^2}  & \sum \frac{x_i^2}{\sigma_i^2} \end{array}\right|} \;\;\;\; B = \frac{\left| \begin{array}{cc} N\sum \frac{1}{\sigma_i^2} & \sum \frac{ y_i}{\sigma_i^2}\\ \sum \frac{x_i}{\sigma_i^2}  & \sum \frac{x_i y_i}{\sigma_i^2} \end{array}\right|}{\left| \begin{array}{cc} N\sum \frac{1}{\sigma_i^2} & \sum \frac{x_i}{\sigma_i^2}\\ \sum \frac{x_i}{\sigma_i^2}  & \sum \frac{x_i^2}{\sigma_i^2} \end{array}\right|}[/math]
Uncertainty in the Linear Fit parameters
As always the uncertainty is determined by the Taylor expansion in quadrature such that
- [math]\sigma_P^2 = \sum \left [ \sigma_i^2 \left ( \frac{\partial P}{\partial y_i}\right )^2\right ][/math] = error in parameter P: here covariance has been assumed to be zero
By definition of variance
- [math]\sigma_i^2 \approx s^2 = \frac{\sum \left( y_i - A - B x_i \right)^2}{N -2}[/math]  : there are 2 parameters and N data points which translate to (N-2) degrees of freedom.
 
The least square fit ( assuming equal [math]\sigma[/math]) has the following solution for the parameters A & B as
- [math]A = \frac{\left| \begin{array}{cc} \sum y_i & \sum x_i\\ \sum x_i y_i & \sum x_i^2 \end{array}\right|}{\left| \begin{array}{cc} N & \sum x_i\\ \sum x_i  & \sum x_i^2 \end{array}\right|} \;\;\;\; B = \frac{\left| \begin{array}{cc} N & \sum y_i\\ \sum x_i  & \sum x_i y_i \end{array}\right|}{\left| \begin{array}{cc} N & \sum x_i\\ \sum x_i  & \sum x_i^2 \end{array}\right|}[/math]
uncertainty in A
- [math]\frac{\partial A}{\partial y_j} =\frac{\partial}{\partial y_j} \frac{\sum y_i \sum x_i^2 - \sum x_i \sum x_i y_i  }{\left| \begin{array}{cc} N & \sum x_i\\ \sum x_i  & \sum x_i^2 \end{array}\right|}[/math]
- [math] = \frac{(1) \sum x_i^2 - x_j\sum x_i   }{\left| \begin{array}{cc} N & \sum x_i\\ \sum x_i  & \sum x_i^2 \end{array}\right|}[/math] only the [math]y_j[/math] term survives
- [math] = D \left ( \sum x_i^2 - x_j\sum x_i \right)[/math]
Let 
- [math]D \equiv \frac{1}{\left| \begin{array}{cc} N & \sum x_i\\ \sum x_i  & \sum x_i^2 \end{array}\right|}=\frac{1}{N\sum x_i^2 - \sum x_i \sum x_i }[/math]
- [math]\sigma_A^2 = \sum_{j=1}^N \left [ \sigma_j^2 \left ( \frac{\partial A}{\partial y_j}\right )^2\right ][/math]
- [math]= \sum_{j=1}^N \sigma_j^2 \left ( D \left ( \sum x_i^2 - x_j\sum x_i \right) \right )^2[/math]
- [math] = \sigma^2 D^2 \sum_{j=1}^N  \left ( \sum x_i^2 - x_j\sum x_i \right )^2[/math] : Assume [math]\sigma_i = \sigma[/math]
- [math] = \sigma^2 D^2 \sum_{j=1}^N  \left ( \sum x_i^2\right )^2  + \left (x_j\sum x_i \right )^2 - 2 \left ( \sum x_i^2 x_j \sum x_i \right )[/math]
- [math] = \sigma^2 D^2\sum x_i^2\left [  \sum_{j=1}^N  \left ( \sum x_i^2\right ) +  \sum_{j=1}^N x_j^2  -  2 \sum x_i \sum_{j=1}^N x_j  \right ][/math]
- [math] = \sigma^2 D^2\sum x_i^2\left [  N  \left ( \sum x_i^2\right ) -  2 \sum x_i \sum_{j=1}^N x_j  +  \sum_{j=1}^N x_j^2   \right ][/math]
- [math] \sum x_i \sum_{j=1}^N x_j \approx \sum_{j=1}^N x_j^2[/math] Both sums are over the number of observations [math]N[/math]
- [math] = \sigma^2 D^2\sum x_i^2\left [  N  \left ( \sum x_i^2\right ) -  2 \sum_{j=1}^N x_j^2  +  \sum_{j=1}^N x_j^2   \right ][/math]
- [math] = \sigma^2 D^2\sum x_i^2 \frac{1}{D}[/math]
- [math] \sigma_A^2= \sigma^2  \frac{\sum x_i^2 }{N\sum x_i^2 - \left (\sum x_i \right)^2}[/math]
- [math] \sigma_A^2= \frac{\sum \left( y_i - A - B x_i \right)^2}{N -2} \frac{\sum x_i^2 }{N\sum x_i^2 - \left (\sum x_i \right)^2}[/math]
If we redefine our origin in the linear plot so the line is centered a x=0 then 
- [math]\sum{x_i} = 0[/math]
- [math]\Rightarrow \frac{\sum x_i^2 }{N\sum x_i^2 - \left (\sum x_i \right)^2} = \frac{\sum x_i^2 }{N\sum x_i^2 } = \frac{1}{N}[/math]
or
- [math] \sigma_A^2= \frac{\sum \left( y_i - A - B x_i \right)^2}{N -2} \frac{1}{N} = \frac{\sigma^2}{N}[/math]
- Note
- The parameter A is the y-intercept so it makes some intuitive sense that the error in the Y -intercept would be dominated by the statistical error in Y
uncertainty in B
- [math]B = \frac{\left| \begin{array}{cc} N & \sum y_i\\ \sum x_i  & \sum x_i y_i \end{array}\right|}{\left| \begin{array}{cc} N & \sum x_i\\ \sum x_i  & \sum x_i^2 \end{array}\right|}
[/math]
- [math]\sigma_B^2 = \sum_{j=1}^N \left [ \sigma_j^2 \left ( \frac{\partial B}{\partial y_j}\right )^2\right ][/math]
- [math]\frac{\partial B}{\partial y_j} =\frac{\partial}{\partial y_j} \frac{\left| \begin{array}{cc} N & \sum y_i\\ \sum x_i  & \sum x_i y_i \end{array}\right|}{\left| \begin{array}{cc} N & \sum x_i\\ \sum x_i  & \sum x_i^2 \end{array}\right|}[/math]
Go Back Forest_Error_Analysis_for_the_Physical_Sciences#Statistical_inference