- 1.1 Corrected Standard Deviation
- 1.2 Standard Error of the Mean
- 1.3 Confidence Interval Around the Mean
- 1.4 Two-Sample T-Test
- 2.1 Confidence Interval of a Bernoulli Parameter
- 2.2 Multinomial Confidence Intervals
- 2.3 Chi-Squared Test
- 3.1. Standard Deviation of a Poisson Distribution
- 3.2. Confidence Interval Around the Poisson Parameter
- 3.3. Conditional Test of Two Poisson Parameters
- 4.1. Comparing An Empirical Distribution to a Known Distribution
- 4.2. Comparing Two Empirical Distributions
- 4.3. Comparing Three or More Empirical Distributions
- 5.1. Slope of a Best-Fit Line
- 5.2. Standard Error of the Slope
- 5.3. Confidence Interval Around the Slope
Evan Miller收集了程序员应该掌握的统计学公式,非常有用。
但他在文章前面的这段话更有启发意义:
能够运用统计学,就像拥有了秘密的超能力。
大多数人只会看看平均数,而你会看置信区间。
别人说“7比5大”,你会说,生成数据的过程是无法辨识的。
在众声喧哗的噪音里,你会听到关键的求救。
糟糕的是,并不是很多程序员具有这种超能力。这真是悲剧,因为运用统计学几乎总是能改善数据的展示和阐释。为此,我收集了自认为最有用的统计学公式。
Statistical Formulas For Programmers
By Evan Miller
DRAFT: May 19, 2013
Being able to apply statistics is like having a secret superpower.
Where most people see averages, you see confidence intervals.
When someone says “7 is greater than 5,” you declare that they're really the same.
In a cacophony of noise, you hear a cry for help.
Unfortunately, not enough programmers have this superpower. That's a shame, because the application of statistics can almost always enhance the display and interpretation of data.
As my modest contribution to developer-kind, I've collected together the statistical formulas that I find to be most useful; this page presents them all in one place, a sort of statistical cheat-sheet for the practicing programmer.
Most of these formulas can be found in Wikipedia, but others are buried in journal articles or in professors' web pages. They are all classical (not Bayesian), and to motivate them I have added concise commentary. I've also added links and references, so
that even if you're unfamiliar with the underlying concepts, you can go out and learn more. Wearing a red cape is optional.
Send suggestions and corrections to emmiller@gmail.com
Table of Contents
- Formulas For Reporting Averages
- Formulas For Reporting Proportions
- Formulas For Reporting Count Data
- Formulas For Comparing Distributions
- Formulas For Drawing a Trend Line
1. Formulas For Reporting Averages
One of the first programming lessons in any language is to compute an average. But rarely does anyone stop to ask: what does the average actually tell us about the underlying data?
1.1 Corrected Standard Deviation
The standard deviation is a single number that reflects how spread out the data actually is. It should be reported alongside the average (unless the user will be confused).
Where:
N
is the number of observationsxi
is the value of thei th
observationx¯
is the average value ofxi
Reference: Standard deviation (Wikipedia)
1.2 Standard Error of the Mean
From a statistical point of view, the "average" is really just an estimate of an underlying population mean. That estimate has uncertainty that is summarized by the standard error.
Reference: Standard error (Wikipedia)
1.3 Confidence Interval Around the Mean
A confidence interval reflects the set of statistical hypotheses that won't be rejected at a given significance level. So the confidence interval around the mean reflects all possible values of the mean that can't be rejected by the data. It is a multiple
of the standard error added to and subtracted from the mean.
Where:
α
is the significance level, typically 5% (one minus the confidence level)tα/2
is the1−α/2
quantile of a t-distribution withN−1
degrees of freedom
Reference:
Confidence interval (Wikipedia)
1.4 Two-Sample T-Test
A two-sample t-test can tell whether two groups of observations differ in their mean.
The test statistic is given by:
The hypothesis of equal means is rejected if
exceeds the
quantile of a t distribution with degrees of freedom equal to:
You can see a demonstration of these concepts in
Evan's Awesome Two-Sample T-Test.
Reference:
Student's t-test (Wikipedia)
2. Formulas For Reporting Proportions
It's common to report the relative proportions of binary outcomes or categorical data, but in general these are meaningless without confidence intervals and tests of independence.
2.1 Confidence Interval of a Bernoulli Parameter
A Bernoulli parameter is the proportion underlying a binary-outcome event (for example, the percent of the time a coin comes up heads). The confidence interval is given by:
Where:
p
is the observed proportion of interestzα/2
is the(1−α/2)
quantile of a normal distribution
This formula can also be used as a
sorting criterion.
Reference:
Binomial proportion confidence interval (Wikipedia)
2.2 Multinomial Confidence Intervals
If you have more than two categories, a multinomial confidence interval supplies upper and lower confidence limits on all of the category proportions at once. The formula is nearly identical to the preceding one.
Where:
pj
is the observed proportion of thej th
category
Reference:
Confidence Intervals for Multinomial Proportions
2.3 Chi-Squared Test
Pearson's chi-squared test can detect whether the distribution of row counts seem to differ across columns (or vice versa). It is useful when comparing two or more sets of category proportions.
The test statistic, called
is computed as:
Where:
n
is the number of rowsm
is the number of columnsOi,j
is the observed count in rowi
and columnj Ei,j
is the expected count in rowi
and columnj
The expected count is given by:
A statistical dependence exists if
is greater than the (
quantile of a
distribution with
degrees of freedom.
You can see a 2x2 demonstration of these concepts in
Evan's Awesome Chi-Squared Test.
Reference:
Pearson's chi-squared test (Wikipedia)
3. Formulas For Reporting Count Data
If the incoming events are independent, their counts are well-described by a Poisson distribution. A Poisson distribution takes a parameter
which is the distribution's mean — that is, the average arrival rate of events per unit time.
3.1. Standard Deviation of a Poisson Distribution
The standard deviation of Poisson data usually doesn't need to be explicitly calculated. Instead it can be inferred from the Poisson parameter:
This fact can be used to read an
unlabeled sales chart, for example.
Reference: Poisson distribution (Wikipedia)
3.2. Confidence Interval Around the Poisson Parameter
The confidence interval around the Poisson parameter represents the set of arrival rates that can't be rejected by the data. It can be inferred from a single data point of
events observed over
time periods with the following formula:
Where:
γ−1(p,c)
is the inverse of the lower incomplete gamma function
Reference:
Confidence Intervals for the Mean of a Poisson Distribution
3.3. Conditional Test of Two Poisson Parameters
Please never do this:
From a statistical point of view, 5 events is indistinguishable from 7 events. Before reporting in bright red text that one count is greater than another, it's best to perform a test of the two Poisson means.
The p-value is given by:
Where:
- Observation 1 consists of
c1
events overt1
time periods - Observation 2 consists of
c2
events overt2
time periods c=c1+c2
andt=t1+t2
You can see a demonstration of these concepts in
Evan's Awesome Poisson Means Test.
Reference: A more powerful test for comparing two Poisson means (PDF)
4. Formulas For Comparing Distributions
If you want to test whether groups of observations come from the same (unknown) distribution, or if a single group of observations comes from a known distribution, you'll need a Kolmogorov-Smirnov test. A K-S test will test the entire distribution for equality,
not just the distribution mean.
4.1. Comparing An Empirical Distribution to a Known Distribution
The simplest version is a one-sample K-S test, which compares a sample of
points having an observed cumulative distribution function
to a known distribution function having a c.d.f. of
The test statistic is:
In plain English,
is the absolute value of the largest difference in the two c.d.f.s for any value of
The critical value of
at significance level
is given by
where
is the value of
that solves:
The critical must be solved iteratively, e.g. by Newton's method. If only the p-value is needed, it can be computed directly by solving the above for
Reference:
Kolmogorov-Smirnov Test (Wikipedia)
4.2. Comparing Two Empirical Distributions
The two-sample version is similar, except the test statistic is given by:
Where
and
are the empirical c.d.f.s of the two samples, having
and
observations, respectively. The critical value of the test statistic is
with the same value of
above.
Reference:
Kolmogorov-Smirnov Test (Wikipedia)
4.3. Comparing Three or More Empirical Distributions
A
extension of Kolmogorov-Smirnov was described by J. Kiefer in a
1959 paper. The test statistic is:
Where
is the c.d.f. of the combined samples. The critical value of
is
where
solves:
Where:
h=k−1 Jh/2
is a Bessel function of the first kind with order
h/2 γ(h−2)/2,n
is then th
zero ofJ(h−2)/2
To compute the critical value, this equation must also be solved iteratively. When
the equation reduces to a two-sample Kolmogorov-Smirnov test. The case of
can also be reduced to a simpler form, but for other values of
the equation cannot be reduced.
Reference: K-sample analogues of the Kolmogorov-Smirnov and Cramer-v. Mises tests (JSTOR)
5. Formulas For Drawing a Trend Line
Trend lines (or best-fit lines) can be used to establish a relationship between two variables and predict future values.
5.1. Slope of a Best-Fit Line
The slope of a best-fit (least squares) line is:
Where:
{x1,…,xN}
is the independent variable with sample meanx¯ {y1,…,yN}
is the dependent variable with sample meany¯
5.2. Standard Error of the Slope
The standard error around the estimated slope is:
5.3. Confidence Interval Around the Slope
The confidence interval is constructed as:
Where:
α
is the significance level, typically 5% (one minus the confidence level)tα/2
is the1−α/2
quantile of a t-distribution withN−2
degrees of freedom
Reference: Simple linear regression (Wikipedia)
If you own a Mac, my desktop statistics software Wizard can help you analyze
more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!