Populations vs. Samples

“A POPULATION is the total set of items that we are concerned about” (Meier, Brudney, and Bohte, 2011,  p. 173).  In other words, the population is the complete set of individuals or items that share a common characteristic or set of characteristics.  We are often interested in population PARAMETERS, i.e., numerical values that are fixed and describe a characteristic of a population, such as the population mean (μ), variance (σ²), and standard deviation (σ).  

“A SAMPLE is a subset of a population” (Meier, Brudney, and Bohte, 2011, p. 173).  There are two different types of samples: probability samples and non-probability samples. In a PROBABILITY SAMPLE, all members of the population have a KNOWN CHANCE of being selected as part of the sample.  To construct a probability sample, you will need to obtain a list of the entire population; this list then serves as the SAMPLING FRAME from which the sample will be selected/drawn. An example of a probability sample is a RANDOM SAMPLE, in which all members of the population have an equal chance of being selected in a sample. In a NON-PROBABILITY SAMPLE, some members of the population have NO CHANCE of being selected as part of the sample (in other words, the probability of selection cannot be determined). An example of a non-probability sample is a CONVENIENCE SAMPLE, in which the sample is selected based on convenience (i.e., as a result of being easy to contact or reach).

“A STATISTIC is a measure that is used to summarize a sample” (Meier, Brudney, and Bohte, 2011, p. 173), such as the measures of central tendency (ex: sample mean, X̄)and dispersion (ex: sample standard deviation, s) for a variable.  In order to treat sample findings as GENERALIZABLE to the population (i.e., use sample statistics as reliable estimates of the population parameters), the sample should to be a probability sample. 

Why Are Only Probability Samples Generalizable?

Probability samples are more representative of the population.  Furthermore, in probability sampling, the sampling distribution of the sample statistic (e.g., sample mean) can be determined based on statistical principles.  Thus, probability sampling allows us to calculate measures such as margins of error and confidence levels, which account for uncertainty in our sample statistics and capture how reliably they estimate population parameters. 

In contrast, non-probability samples lack a clear and defined sampling distribution, making it impossible to accurately estimate the variability of the sample statistic.

Sources of Secondary Data: Open Data 

OPEN DATA refers to data made public for research, information, or transparency purposes.  Often, open data is published by governments, research institutions, or other organizations (such as interest groups or nonprofit organizations) that aim to promote transparency, accountability, and innovation.  The federal government maintains a comprehensive list of open data sets here