Fundamentals of Statistics 3: Sampling :: The sampling distribution of the mean
We've seen how the normal distribution can be helpful in understanding how likely scores or events are to occur. To work with the normal distribution we need to know the mean and standard deviation of the population but pretty much never know it. Instead we need to sample from the population (like polls and surveys) to estimate the unknown population parameters.

Sampling involves measuring some subset of the population of interest, for example a sample of 20 adult women's heights would give us an estimate of the unknown population mean and standard deviation of all adult women (parameters are from the population and statistics are from the sample).   Our random sample of 20 heights would be helpful in estimating the population mean and standard deviation, but they won't be exactly right.

In fact, every sample we take from our population will have some error in its estimation of the population mean and standard deviation. If you think about it, it makes sense. We could sample 20 women and find their average. Then sample a different set of 20 women and take their average again. Every time we do this our sample average will either under estimate or over estimate the unknown population mean height of all women.

For example, we know the average height of North-American women is 65 inches. I have a large database of heights of women which has a mean of 64.9 inches (close enough to 65) so I sampled 30 heights then graphed them. I repeated this nine times and each sample of 30 are graphed below. Graph of 9 of 30 samples of 30 women heights. Each graph above is a histogram which shows some women are shorter than 60 inches and some taller than 70 inches.

You can see from the graph above that the heights all tend to cluster around the mean of 65 but some women are less than 60 inches or taller than 70 inches. I continued this fun exercise until I had 30 samples, each with 30 women's heights. I then graphed the 30 means below. Graph of the means of the 30 samples of women's heights. The central limit theorem states that the mean of the distribution of sample means is equal to the mean (when n is large). This mean is 65.02 almost exactly the population mean of 65.

Now, if we drop the bars from the histogram and draw a curve that fits the heights we get the famous bell-curve. We can superimpose one sample of 30 heights over the distribution of 30 means and we can see how narrow the distribution of means is compared to the distribution of raw heights. The red-dashed bell-curve shows the distrubution of the 30 means. The black graph shows the wider and more variable distribution of raw hieghts from one sample of 30 women.

There are a few things to notice in these graphs.
1. The shape of the sample means looks bell-shaped, that is it is normally distributed.
2. There is much less fluctuation in the sample means than in the raw data points. While the raw heights varied by as much as 12 inches, the sample means varied by only 2 inches.
3. The mean of these means is really close to 64.9 (65.01 to be exact). In fact, if we were to keep sampling(infinitely) the mean of this sample will be exactly the population mean.

### Who cares?

You may wonder what the point is to all this taking many samples from a population since you'll never do it. It turns out that everything we do in statistics comes down to what we see with this exercise. Like all important things in math it needs a special name and theory and it is called the central limit theorem which we will now define.

### How well did you understand this lesson?

Avg. Rating 8.44 (115)

 Not at all Neutral Extremely 0 1 2 3 4 5 6 7 8 9 10