Probability and Statistics for Biological Sciences
The Empirical Rule Tchebysheff's Theorem The Relationship Between the Mean and the Median A Rough Estimate of s
There are a number of simple "rules of thumb" that are useful for making rough estimates of various things in statistics. Some of the more likely to be useful are described briefly here.
The Empirical Rule
The so-called empirical rule describes how values in sets of data or populations which obey the normal distribution are clustered around the mean value. It is exact for situations which are exactly normally distributed, but it can be a reasonable approximation in situations where the normal distribution is followed approximately (i.e., the distribution is unimodal and symmetric). The rule is illustrated in the following figure:
For a normally distributed set of data or a normally distributed population,
|approximately 68% of all elements will fall within one standard deviation of the mean; that is, will have a value between μ - σ and μ + σ|
|approximately 95% of all elements will fall within two standard deviations of the mean; that is, will have a value between μ - 2 σ and μ + 2 σ|
|approximately 99.7% of all elements will fall within three standard deviations of the mean. Since this percentage is so close to 100%, people often state this third rule as "effectively the entire population will be located within three standard deviations of the mean".|
These percentages are accurate when the distribution is exactly a normal distribution (well, ok, the exact percentages in that case to four significant figures are 68.26%, 95.44%, and 99.74%, respectively).
When the distribution is symmetric, we know that the percentage of the population excluded from each of these intervals will be evenly split between the lower tail and the upper tail of the distribution. Thus, we can also make statements along the lines
|approximately 16% of the population is more than one standard deviation below the mean|
|approximately 16% of the population is more than one standard deviation above the mean|
|approximately 2.5% of the population is more than two standard deviations below the mean, and similarly, approximately 2.5% of the population are more than two standard deviations above the mean.|
|only about 3 elements out of 1000 (0.3%) of a population will deviate from the mean by more than three standard deviations.|
When you don't feel justified in applying the empirical rule (either because you have evidence that the population distribution is not even approximately normal, or you don't know and don't want to make an erroneous assumption), there is another rule that may be tried.
This result, called Tchebysheff's Theorem, makes no assumptions at all about the shape of the data distribution. It can be stated as follows:
"The fraction of a population occurring within k standard deviations of the mean is at least ."
This rule is true for any positive value of k (not just whole numbers), but to get an idea of how its results compare with those from the empirical rule, we look at it for k = 1, 2, and 3:
|when k = 1, , indicating that at least 0% of the data values fall within one standard deviation of the mean. Since you can never have less than "at least 0%" of a collection of things, this is a useless statement.|
|when k = 2, , indicating that at least 75% of the data values will be found within two standard deviations of the mean. (When we are confident enough that the data is approximately normally distributed so that we can use the empirical rule, we're able to make the statement that at least 95% of the data falls within this interval -- the lack of information about the data distribution results in this rule hedging by 20 percentage points.) Note that since we haven't assumed that the distribution is symmetric about the mean here, we can't say anything about where the residual 25% of the data may be -- just that up to 25% of the data may be as much as two standard deviations different from the mean.|
|when k = 3, allowing us to say that at least 89% of the data or the population will be found within three standard deviations of the mean.|
Tchebysheff's theorem has the advantage that what it says is guaranteed to be true. Its disadvantage is that what it says is often so imprecise that it is of little practical use. On the other hand, it you assume the empirical rule when it really isn't justified, you may get some very specific results, but they are untrue.
To give you an idea of how you might use these rules of thumb, consider a situation in which a technologist is trying to estimate the amount of time in total that must be allocated to carry out a frequent laboratory procedure. Suppose she monitors 50 repetitions of the procedure, and finds that for those 50 repetitions, the mean time per procedure was = 25.0 minutes and s = 5 minutes (nice round numbers for the example!). According to the empirical rule, she could then say that only 16% of the time will the procedure take more than 30 minutes, only 2.5% of the time will the procedure take more than 35 minutes, and rarely (around three times out of every 2000) will the procedure take more than 40 minutes to perform. On the other hand, using Tchebysheff's theorem, all she could say is that at least 75% of the procedures will require less than 35 minutes (so up to 25% could require more than 35 minutes); at least 89% of the procedures will require less than 40 minutes (or up to 11% could require more than 40 minutes), and so on. Of course, if it was really important to make precise and reliable predictions here, the technologist should employ statistical techniques that go beyond simple "rules-of-thumb" -- which are primarily intended to allow people to make some fast, rough, but insightful estimates, rather than profound, reliable, precise analyses.
The Relationship Between the Mean and the Median
Recall that if the distribution is not symmetric, then the mean and median will have different values, with the mean being in the direction of skewing from the median. It is possible to demonstrate that the mean and median are never different by more than one standard deviation. When applied to a population, this principle takes the form:
Of course, this is true for a symmetric distribution as well, since then , since σ is never a negative number.
Recall the example of the very skewed set of data described on page 2 of the document on "Measures of Central Tendency". The group under consideration consisted of 20 students, of which 19 had an income of $2000 each and one had an income of $10,000,000 for a particular year. For these twenty students, the median annual income was $2000, but the mean annual income was $501,900.
We can work out an estimate of s here by calculating the value of s for the 20 students. Letting xk denote the annual income of student number k, we have
which is clearly less than or equal to the estimate s = $2,235,621 we have for σ .
Consider a less far-fetched set of data: the SalmonCa0 data. From information provided in class, we know that ≅ 74.28 and ≅ 70.50, so that , which is less than or equal to s ≅ 22.02 by a good bit.
A Rough Estimate of s
The standard deviation is not a very intuitive quantity, and so it is not always easy to tell if the value you calculate is reasonable (as opposed to being the ridiculous result of an arithmetic blunder). One rough estimate of the value of s arises out of the empirical rule.
For approximately normally distributed populations, about 95% of the members will fall within two standard deviations of the mean, an interval with a width of 2 σ + 2 σ = 4 σ . Thus, particularly when sample sizes are not in the thousands (in which case there is a good likelihood of encountering members of the population that are three or more standard deviations from the mean), it is reasonable to equate this interval roughly with the sample range. This will give a ballpark estimate of σ and hence also of s. Thus, very roughly, for the data in a sample, we can write
For the SalmonCa0 data, the largest observation was 129 ppm, and the smallest observation was 29 ppm. Thus, according to this rough rule,
An exact calculation gives s = 22.02 ppm, to two decimal places. The values 22.02 and 25 are close enough that we can be confident no serious blunder has been committed in calculating s. Had we obtained 2.202 or 220.2 when we tried to calculate s, we would probably recheck our work, because either of these two values would be difficult to accept as roughly equal to 25.
This material is available in Microsoft WORD format here.