Sampling and sampling distributions
Sampling and sampling distributions
In many cases, we would like to learn something about a big population, without actually inspecting every unit in that population. In that case, we would like to draw a sample that permits us to draw conclusions about a population of interest. We may, for example, draw a sample from the population of Dutch men of 18 years and older to learn something about the joint distribution of height and weight in this population.
Because we cannot draw conclusions about the population from a sample without error, it is important to know how large these errors may be, and how often incorrect conclusions may occur. An objective assessment of these errors is only possible for a probability sample. For a probability sample, the probability of inclusion in the sample is known and positive for each unit in the population. Drawing a probability sample of size n from a population consisting of N units may be a quite complex random experiment. The experiment is simplified considerably by subdividing it into n experiments, consisting of drawing then consecutive units. In a simple random sample, the n consecutive units are drawn with equal probabilities from the units concerned.
In random sampling with replacement, the sub-experiments (drawing of one unit) are all identical and independent: n times a unit is randomly selected from the entire population. We will see that this property simplifies the ensuing analysis considerably. For units in the sample, we observe one or more population variables. For probability samples, each draw is a random experiment. Every observation may, therefore, be viewed as a random variable. The observation of a population variable X from the unit drawn in the ith trial yields a random variable.
Xi. Observation of the complete sample yields n random variables X1, ...,Xn.
Unit 1 2 3 4 5 6
X 1 1 2 2 2 3
Table: A small population x 1 2 3
p1(x) = p2(x) 1/3 1/2 1/6
Likewise, if we observe for each unit the pair of population variables (X,Y), we obtain pairs of random variables (Xi, Yi) with outcomes (xi, yi). Consider the population of size N = 6, displayed in table 2.1. A random sample of size n = 2 is drawn with replacement from this
population. For each unit drawn we observe the value of X. This yields two random variables X1 and X2, with identical probability distribution as displayed in table 2.2. Furthermore X1 and X2 are independent, so their joint distribution equals the product of their individual distributions, i.e. p(x1, x2) = i=1π i=2 pi(xi) = [p(x)]2
The distribution of the sample is displayed in . Usually, we are not really interested in the individual outcomes of the sample, but rather in some sample statistic. A statistic is a function of the sample observations X1, ...,Xn, and therefore is itself also a random variable.
The probability distribution of a sample statistic is called its sampling distribution.
We were able to determine the probability distribution of the sample, and sample statistics, by complete enumeration of all possible samples. This was feasible only because the sample size and the number of distinct values of X were very small. When the sample is of realistic size, and X takes on many distinct values, the complete enumeration is not possible. Nevertheless, we would like to be able to infer something about the shape of the sampling distribution of a sample statistic, from knowledge of the distribution of X. We consider here two options to make such inferences.
1. The distribution of X has some standard form that allows the mathematical derivation of the exact sampling distribution.
2. We use a limiting distribution to approximate the sampling distribution of interest. The limiting distribution may be derived from some characteristics of the distribution of X.
The exact sampling distribution of a sample statistic is often hard to derive analytically, even
if the population distribution of X is known.
Non-parametric: Draw samples of 30 (x, y) pairs (with replacement) from the data. For each
bootstrap sample, compute r, to obtain an empirical sampling distribution.
Parametric: Make appropriate assumptions about the joint distribution of X and Y .
Prediction error
We have one response variable (Y ) and one predictor variable (X). For example, we might predict a son’s height, based on his father’s height. Or we might predict a cat’s heart weight, based on its total body weight (Figure 2).
Suppose (X; Y ) have a joint distribution f(x; y). You observe X = x. What is your best prediction of Y ?
Let g(x) be any prediction function, for instance, a linear relationship. The prediction error (or risk) is
R(g) = E(Y g(X))2, where E is the expected value with respect to the joint distribution f(x; y). Condition on X = x and let r(x) = E(Y jX = x) = Z, yf(yjx) dy be the regression function: Let _ = Y r(X). Then, E(_) = E[E[Y r(X)jX = x] = 0 and we can write
Y = r(X) + _: (1)
Key result: for any g, the regression function r(x) minimizes the prediction error R(r) _ R(g):
Comments
Post a Comment