Standardization, Z-Scores, and the Z-Table

Standard deviations give us aggregate information but not individual information: although standard deviations can give us parameters with which we can calculate how all values of a variable cluster around the mean value, they do not give us an indication of how closely a particular score does.  This is where standardization, z-scores, and the standard normal distribution table are beneficial.

Standardization and the Standard Normal Distribution 

STANDARDIZATION is the process of transforming data into a STANDARD NORMAL DISTRIBUTION, which is a special normal distribution with a mean of 0 and a standard deviation of 1: Z ~ N(0,1).  Standardization allows for comparison between datasets or variables with different units or scales.  For example, if you want to directly compare SAT and ACT scores (which are based on different scales),  you can standardize the data; this puts the scores on the same scale, allowing direct comparisons.  Standardization also allows us to more easily calculate the probability of observing a specific value for a given variable.

Z-Scores

Z-scores (i.e., standard scores) are the result of standardization; they put individual scores into context.  “A Z-SCORE is simply the number of standard deviations a score of interest lies from the mean of a [standard] normal distribution” (Meier, Brudney, and Bohte, 2011, p. 134).

Using a Standard Normal Distribution Table

Once you have standardized your variable(s) and calculated z-scores for the values of interest, you can use the STANDARD NORMAL DISTRIBUTION TABLE (i.e., Z-TABLE) to determine a value’s probability.  Normal distribution tables can also be used to find p-values for z-tests.

Below are some tips for reading a standard normal distribution table:

  • Round the z-score to the nearest hundredth
  • Familiarize yourself with the layout of the standard normal distribution table:
  • Row and column headers define the z-score
    • Read down the first column for the ones and tenths places of your number
    • Read along the top row for the hundredths place
  • Table cells represent the area under the curve to the left of a z-score
  • To locate the probability of a variable taking on a certain value:
    • Split the z-score into a number to the nearest tenth and one to the nearest hundredth
    • The intersection of the row from the first part and the column from the second part will give you the value associated with your z-score
    • This value represents the proportion of the data set that lies below the value corresponding to your z-score in a standard normal distribution
      • For example, the cumulative probability for z-score=1.23 is 0.8907, which means that there is an 89.07% chance that a randomly selected value from a standard normal distribution is less than 1.23
  • Calculating the difference between the area under the curve for two values/data points tells you the probability of variables taking on a range of values

The Normal Distribution: The Basics

The NORMAL DISTRIBUTION (sometimes referred to as the Gaussian distribution) is a continuous probability distribution that can be found in many places: height, weight, IQ scores, test scores, errors, reaction times, etc.  Understanding that a variable is normally distributed allows you to:

  • predict the likelihood (i.e., probability) of observing certain values
  • apply various statistical techniques that assume normality
  • establish confidence intervals and conduct hypothesis tests

Characteristics of the Normal Distribution

There are several key characteristics of the normal distribution:

  • mean, median, and mode are equal and located at the center of the distribution
  • the distribution is symmetric about the mean (i.e., the left half of the distribution is a mirror image of the right half); “Scores above and below the mean are equally likely to occur so that half of the probability under the curve (0.5) lies above the mean and half (0.5) below” (Meier, Brudney, and Bohte, 2011, p. 132)
  • the distribution resembles a bell-shaped curve (i.e., highest at the mean and tapers off towards the tails)
  • the standard deviation determines the SPREAD of the distribution (i.e., its height and width): a smaller standard deviation results in a steeper curve, while a larger standard deviation results in a flatter curve
  •  the 68-95-99 RULE can be used to summarize the distribution and calculate probabilities of event occurrence:
    – approximately 68% of the data falls within ±1 standard deviation of the mean
    – approximately 95% of the data falls within ±2 standard deviations of the mean
    – approximately 99% of the data falls within ±3 standard deviations of the mean
  • there is always a chance that values will fall outside ±3 standard deviations of the mean, but the probability of occurrence is less than 1%
  • the tails of the distribution never touch the horizontal axis: the probability of an outlier occurring may be unlikely, but it is always possible; thus, the upper and lower tails approach, but never reach, 0%

Why the Normal Distribution is Common in Nature: The Central Limit Theorem

The CENTRAL LIMIT THEOREM states that the distribution of sample means for INDEPENDENT, IDENTICALLY DISTRIBUTED (IID) random variables will approximate a normal distribution, even when the variables themselves are not normally distributed, assuming the sample is large enough.  Thus, as long as you have a sufficiently large random sample, we can make inferences about the population parameters (what we are interested in) from sample statistics (what we often are working with).

What Does “IID” Mean?

Variables are considered independent if they are mutually exclusive.  Variables are considered identically distributed if they have the same probability distribution (i.e., normal, Poisson, etc.)

Do Outliers Matter?

In a normal distribution based on a large number of observations, it is unlikely that outliers will skew results.  If you are working with data involving fewer observations, outliers are more likely to skew results; in these situations, you should identify, invest, and decide how to handle outliers.

Example of a Normal Distribution: IQ Tests

Because the IQ test has been given millions of times, IQ scores represent a normal probability distribution.  On the IQ test, the mean, median, and mode are equal and fall in the middle of the distribution (100).  The standard deviation on the IQ test is 15; applying the 68-95-99 rule, we can say with reasonable certainty:

  • 68% of the population will score between 85 and 115, or ±1 standard deviation from the mean
  • 95% of the population will score between 70 and 130, or ±2 standard deviations from the mean
  • 99% of the population will score between 55 and 145, or ±3 standard deviations from the mean

Rarely will you encounter such a perfect normal probability distribution as the IQ test, but we can calculate z-scores to standardize (i.e., “normalize”) values for distributions that aren’t as normal as the IQ distribution.

Probability Theory: The Basics

PROBABILITY is a branch of mathematics that deals with the likelihood or chance of different outcomes occurring in uncertain situations.  It quantifies how likely an event is to happen and is expressed as a number between 0 and 1, where 0 indicates impossibility and 1 indicates certainty. 

Probabilities are important to understand because they give us predictive capabilities: probabilities are the closest thing we have to being able to predict the future.  Of course, this is not foolproof; there is always a chance that we are wrong, which is why we couch results/findings in terms of a 95% confidence interval and margin of error. 

Basic Law of Probability

“The BASIC LAW OF PROBABILITY . . . states the following: Given that all possible outcomes of a given event are equally likely, the probability of any specific outcome is equal to the ratio of the number of ways that that outcome could be achieved to the total number of ways that all possible outcomes can be achieved” (Meier, Brudney, and Bohte, 2011, p. 113).  This means that we can predict the outcome of a specific event as long as the likelihood of a given event is known.  For example:

  • the probability of getting heads with one coin flip is 1/2
  • the probability of getting three with one roll of a six-sided dice is 1/6
  • the probability of drawing the ace of spades from a deck of cards is 1/52
  • the probability of drawing an ace from a deck of cards is 1/13 (although there are 52 cards in a deck, there are four aces; to calculate the probability of drawing an ace from a deck of cards, you would divide 52 by four)
  • the probability of drawing a heart from a deck of 52 cards is 1/4 (this is because there are four suits in each deck of cards)

All of these examples involve probabilities of the occurrence of single events.  As you can see, the process of calculating these probabilities is pretty straightforward: you divide the number of times a specific outcome can occur by the total number of possible outcomes.

Probability P(A) of a Single Event A

P (A) = Number of favorable outcomes / Total number of possible outcomes

This gets a little more complicated as we factor in other events; the manner in which we would calculate the probability of an event occurring differs depending on whether we are looking at mutually exclusive events (i.e., events that cannot occur at the same time), non-mutually exclusive events (i.e., events that can occur at the same time), independent events (i.e., events in which the occurrence of one does not affect the occurrence of the other), an event occurring given the occurrence of a different event (i.e., the conditional probability), etc.  Nevertheless, while different equations are used to calculate probabilities in these situations, the basic law of probability still exists: the probability of an event occurring falls between 0 (“never occurs”) and 1 (“always occurs”), and calculating this probability is based on possible outcomes.

A Priori and Posterior Probabilities 

A PRIORI PROBABILITIES are initial probabilities of an event based on existing knowledge, theory, or general reasoning about the event.  Everything we have discussed thus far are a priori probabilities, because we know the possible outcomes of a coin flip, die roll, or card draw.  By contrast, POSTERIOR PROBABILITIES are probabilities of an event after new evidence or information is taken into account.  A classic example of posterior probability is the Monty Hall problem.  Posterior probabilities are often calculated using BAYES’ THEOREM, which combines the prior probability with the likelihood of new evidence or information.  Frequency distributions provide the empirical data needed to estimate the probabilities used in Bayes’ theorem. 

Measures of Dispersion

Measures of DISPERSION tells us about how much the observations cluster around the expected value (i.e., the”typical” or average value) for a variable.  In other words, measures of dispersion tell us about the SPREAD of a distribution of values and the overall VARIATION in a measure.  This information can be used to understand the distribution of their data, identify the range of values associated with their data, and determine how much confidence we can have in our expected values.  

Min, Max, Range, IQR, Variance, & Standard Deviation

There are six measures of dispersion:

  • MIN and MAX – the minimum (lowest) and maximum (highest) values of a variable
  • RANGE – the difference between the maximum and minimum values of a variable; as a formula: Range = Max – Min
  • INTERQUARTILE RANGE (IQR) — the difference between the first quartile (Q1 / 25%) and the third quartile (Q3 / 75%), which corresponds to the range of the middle 50% of values of a variable; as a formula:  IQR = Q3 – Q1
  • VARIANCE — the average squared deviation of each value from the mean (i.e., the sum of all values of a variable minus the mean, squared, and divided by the number of cases in the variable)
  • STANDARD DEVIATION — the average distance of each value from the mean, expressed in same units as the data (i.e., the square root of the variance)

As is the case with measures of central tendency, we cannot calculate all measures of dispersion on all levels of variables.  Range and IQR require rank ordering of values — which, in turn, requires that the variable has direction.  Variance and standard deviation can only be calculated if values are associated with real numbers that have equal intervals of measurement between them.  Recall that the hierarchy of measurement illustrates that any statistic that can be calculated for a lower level of measurement can be legitimately calculated and used for higher levels of measurement.  Therefore:

  • because min and max can be calculated for nominal level variables, they can also be calculated on ordinal, interval, and ratio variables
  • because range and IQR can be calculated for ordinal variables, they can also be calculated on interval and ratio variables
  • because variance and standard deviation can be calculated for interval variables, they can also be calculated for ratio variables 
MeasureDescriptionLevels of Measurement
MIN/MAXMinimum and maximum values of a variableNominal + Ordinal + Interval + Ratio
RANGEDifference between the maximum and minimum values of a variable Ordinal + Interval + Ratio
IQRRange of the middle 50% of values of a variableOrdinal + Interval + Ratio
VARIANCEAverage squared deviation of each value of a variable from the meanInterval + Ratio
STANDARD DEVIATIONAverage distance of value of a variable from the mean, expressed in same units as the data; square root of the varianceInterval + Ratio
Measures of Dispersion

Variance vs. Standard Deviation

Variance and standard deviation are measures that capture the same information (hence, the standard deviation is simply the square root of the variance).  Does it matter which measure we report?  In fact, it does! 

Generally speaking, standard deviation is more useful than the variance from an interpretation standpoint because it is in the same units as the original data (unlike variance, which is expressed in squared units of the original data).  This makes standard deviation easier to understand and communicate.  Thus, standard deviation allows for direct comparisons and provides a clearer picture of data spread.  This is especially true within the context of normal distributions.

For example, let’s assume we have a dataset of annual incomes and derive the following measures of central tendency and dispersion:

  • Mean income: 50,000 (dollars)
  • Standard deviation: 10,000 (dollars)
  • Variance: 100,000,000 (square dollars)

Interpreting the standard deviation is pretty straight-forward: most people’s incomes are within $10,000 of the average income of $50,000.  Interpreting variance, however, is more tricky: the average squared deviation from the mean income is $100,000,000.  As you can see, the interpretation of variance is less directly meaningful without further mathematical manipulation (i.e., taking the square root to find the standard deviation).

Why Does Variance Use Square Units?

When calculating the mean deviation by summing the differences between each data point and the mean, the positive and negative differences (associated with values that fall above and below the mean) can cancel each other out, resulting in a sum of zero.  Variance uses square units to ensure all deviations from the mean are positive, which in turn prevents positive and negative differences from cancelling out.

Working with square units also has some useful mathematical properties. 

Measures of Central Tendency

Measures of CENTRAL TENDENCY tells us about the “typical” or average value for a variable.  In other words, measures of central tendency tell us how closely data in a variable group around some central point.  This information can be used to make an initial prediction of the EXPECTED VALUE that a variable will take on.  

Mode, Median, and Mean

There are three measures of central tendency:

  • MODE — the value(s) that occurs most often (i.e., with greatest frequency) in a distribution of observations within a variable
    • Most of the time, the mode corresponds to one value; sometimes, the mode corresponds to two (BIMODAL) or more (MULTIMODAL) values
  • MEDIAN — the middle value when the observations within a variable are ranked in ascending order (i.e., from lowest to highest); in other words, the median is the observation with 50% of observations above and 50% of observations below it
    • If there are an even number of observations, the median is equal to the sum of the two middle observations, divided by two
  • MEAN — the arithmetic average of all the observations within a variable (i.e., the sum of values, divided by the number of values)

We cannot calculate all measures of central tendency on all levels of variables.  Median requires rank ordering of values — which, in turn, requires that the variable has direction.  Mean can only be calculated if values are associated with real numbers that have equal intervals of measurement between then.  The HIERARCHY OF MEASUREMENT illustrates that any statistic that can be calculated for a lower level of measurement can be legitimately calculated and used for higher levels of measurement.  Therefore:

  • because mode can be calculated for nominal level variables, it can also be calculated on ordinal, interval, and ratio variables
  • because median can be calculated for ordinal variables, it can also be calculated on interval and ratio variables
  • because mean can be calculated for interval variables, it can also be calculated for ratio variables 
MeasureDescriptionLevels of Measurement
MODEValue that occurs most often (i.e., with greatest frequency) of a variableNominal + Ordinal + Interval + Ratio
MEDIANMiddle value when observations of a variable are ranked in ascending orderOrdinal + Interval + Ratio
MEANAverage of all the observations of a variableInterval + Ratio
Measures of Central Tendency

Mean vs. Median

Generally, we would opt for the mean over the median when either can be calculated on a particular variable.  However, sometimes mean can be misleading when there are outliers in the data.  An OUTLIER is an extreme value of a variable.  When outliers are present, the mean is distorted: it will be skewed towards the outlier.  In such situations, median may be a more precise measure of central tendency.

Frequency Distributions

FREQUENCY DISTRIBUTION is a summary of how often each value or range of values occurs in a dataset.  It organizes data into a table or graph that displays the FREQUENCY (count) of each unique CLASS (category or value/interval of values) within the dataset. 

Frequency distributions are an important tool for understanding the distribution and patterns of data.  Frequency distributions provide a clear visual summary of the data, helping to identify patterns such as central tendency, dispersion, and skewness.  Frequency distributions are also an important tool for summarizing data: they condense large datasets into an easily interpretable format.  This can facilitate initial data exploration and analysis.  Furthermore, as data summary tools, frequency distributions can also aid in decision-making processes and serve as a mechanism through which findings can be effectively communicated to various stakeholders.

Frequency Tables

FREQUENCY TABLE is a tabular representation of data that shows the number of occurrences (frequency) of each distinct case (value or category in a dataset).  It organizes raw data into a summary format, making it easier to see how often each value appears. 

While frequency tables are helpful, they do not provide as much information as relative frequency tables.  A RELATIVE FREQUENCY TABLE extends the frequency table by including the relative frequency (i.e., the PERCENTAGE DISTRIBUTION), which is the proportion or percentage of the total number of observations that fall into each case.  A relative frequency table provides a sense of the distribution of data in terms of its overall context.

CUMULATIVE RELATIVE FREQUENCY TABLE shows the cumulative relative frequency (i.e., CUMULATIVE FREQUENCY DISTRIBUTION), which is the sum of the percentage distributions for all values up to and including the current value.  The cumulative percentages for all values should add up to 100% (or something close, depending on rounding errors).  A cumulative relative frequency table helps us to understand the cumulative distribution of the data. 

Another extension of the frequency table is a CONTINGENCY TABLE (also known as a cross-tabulation or crosstab).  A contingency table is used to display the frequency distribution of two or more variables; it shows the relationship between two or more CATEGORICAL VARIABLES (i.e., nominal- or ordinal-level variables) by presenting the frequency of each combination of variable categories.

Charts and Graphs

There are numerous charts and graphs that can be used to display frequency distributions:  

  • BAR GRAPHS (or bar charts) and HISTOGRAMS are graphical representations of data that use rectangular bars to represent the frequency of a value or intervals of values (i.e., BINS); bar charts and histograms are useful for showing the distribution of variables 
  • PIE CHART is a circular graph divided into slices to illustrate numerical proportions, the size of each slice proportional to the quantity it represents; pie charts are useful for showing the relative frequencies of different categories within a whole
  • LINE GRAPH (or line chart) is a type of graph that displays information as a series of data points connected by straight line segments.  Line graphs are often used to show trends over time; they can also be used to summarize frequency distributions of interval- and ratio-level variables

Measurement Theory: The Basics

MEASUREMENT refers to the process of assigning numbers or labels to objects, events, or characteristics according to specific rules.  Measurement is fundamental in transforming abstract concepts and complex social phenomena into observable and quantifiable data, allowing for empirical investigation and statistical analysis.  Effective measurement enables public administrators to quantify and analyze variables, thereby:

  • facilitating the examination of relationships, patterns, and trends
  • enabling the comparison of variables across different groups, times, or conditions
  • facilitating the testing of hypotheses by providing empirical data that can be analyzed to support or refute theoretical predictions
  • informing evidence-based decision-making and policy formulation by providing measurable and reliable data

Measurement is not as simple as measuring a tangible object or substance.  It involves defining, operationalizing, and quantifying variables in a manner that ensures consistency, reliability, and validity.  We must be as precise and transparent as possible when discussing our measurements.

Conceptualization and Operationalization

The first step in empirical research is to identify CONCEPTS (sets of characteristics or specific information representing constructs or social phenomena) to study. Concepts help us organize and understand our data and can be thought of as the “building blocks” for constructing statistical models.  Concepts can be abstract/unobservable (ex: public trust in government) or concrete/observable (ex: government spending, public service delivery). Once you have identified the concept you want to research, you need to narrow your concept to a more precise CONCEPTUAL DEFINITION, or an explanation of a concept based on theoretical constructs (i.e., what theory says about the concept you are researching) outlining the concept’s meaning and the essential characteristics and/or attributes related to different aspects or components of the concept.  An effective conceptual definition:

  • provides a theoretical framework for understanding the concept, establishing a common language, understanding, and foundation for measurement 
  • is clear enough that others can understand what you mean, facilitating the replication of studies
  • distinguishes the concept being studied from other different (but related) concepts, reducing MEASUREMENT ERROR (i.e., the difference between the true value and the value obtained through measurement) and helping ensure we are using valid, reliable measures
  • is not circular

Once you have developed a conceptual definition, you need to operationalize the concept.  An OPERATIONAL DEFINITION is a clear, precise description of the procedures or steps used to measure a concept; in essence, it translates abstract concepts into measurable variables.  Operational definitions are essential for enhancing the reliability and validity of data collection and measurement and for facilitating replication.  In operationally defining the concept, you will identify one or more INDICATORS (i.e., specified observations used to measure the concept). 

Variables

VARIABLE refers to an empirical (observable and measurable) characteristic, attribute, or quantity that can vary (i.e., take on different values among individuals, objects, or events being studied).  Variables are used to describe and quantify concepts in an analysis. 

Variables are related to indicators.  Multiple indicators may be used to represent a single variable (ex: a four-item scale, with each item associated with a distinct indicator, is often used to measure public trust in federal government).  Alternatively, a single indicator may be used to represent a single variable OR to represent more than one variable, if it measures different aspects or dimensions of those variables.  

Variables can be classified into three main types:

  • DEPENDENT VARIABLES, or the outcome variables we are interested in explaining or predicting. Sometimes called outputs or responses
  • INDEPENDENT VARIABLES, or the predictor variables we are using to explain or predict variation in the dependent variable(s). Sometimes called predictors, inputs, or features
  • CONTROL VARIABLES, or variables that are theoretically related to the dependent variables of interest, which researchers hold constant or account for to isolate the relationship between the independent variable(s) and the dependent variable; control variables are important because they help ensure that the results are attributable to the variables of interest, rather than being influenced by extraneous factors

CODING refers to the process of preparing data for analysis.  To allow for statistical analysis, variable values are coded numerically.  Some of these numerical values represent categories or labels (ex: gender); other numerical values correspond to real numbers (ex: per capita state-level spending on secondary education). 

Reliability and Validity

RELIABILITY is the extent to which a measure consistently captures whatever it measures.  Simply put – reliability is repeatabilityVALIDITY is the extent to which a measure captures what it is intended to measure.  Simply put – validity is accuracy.  

A measure CAN be reliable and not valid, but a measure CANNOT be valid if it is not reliable.  To illustrate this point, let’s consider an example.  Let’s say that your bathroom scale always displays your weight as ten pounds below your actual weight.  Is the weight displayed by the scale a reliable measure?  Yes: the scale consistently displays your weight as being ten pounds below your actual weight.  Is the weight displayed by the scale a valid measure?  No: the scale does not display your actual weight.  The weight displayed by the scale is reliable but not valid.

Levels of Measurement

The level of measurement of a variable impacts the statistics that can be examined and reported and the types of statistical analyses that can be used.  Levels of measurement are hierarchical: what applies to the lowest level of measurement will also apply to variables measured at higher levels.

Nominal

The lowest level of variable measurement is NOMINAL, in which observations are classified into a set of two (BINARY or DICHOTOMOUS variable) or more (MULTINOMIAL variable) categories that:

  1. have no direction (i.e., they are “yes/no”, “on/off”, or “either/or” in nature — there cannot be “more” or “less” of the variable) 
  2. are MUTUALLY EXCLUSIVE (i.e., there is no overlap between categories) 
  3. are EXHAUSTIVE (i.e., there is a category for every possible outcome)

When coding nominal variables, each category is assigned a numerical value to allow for statistical analysis; these numerical values have no meaning aside from what we assign them. 

Ordinal

The next level of variable measurement is ORDINAL, in which observations are classified into a set of categories that:

  1. have direction
  2. are mutually exclusive and exhaustive
  3. have intervals between categories that cannot be assumed to be equal (i.e., there is no objective basis for differentiating numerical values between categories/responses; ex: the difference between “strongly disagree” and “disagree” may vary from one person to another)

Ordinal variables are often ranking variables measured using a LIKERT SCALE

When coding ordinal variables measured using Likert scales, categories should be assigned numerical codes arranged in either:

  • ASCENDING ORDER, i.e., from lower numbers (less of [variable]) to higher numbers (more of [variable])
  • DESCENDING ORDER, i.e., from higher numbers (more of [variable]) to lower numbers (less of [variable])

Interval

The next level of variable measurement is INTERVAL, in which observations:

  1. have direction
  2. are mutually exclusive and exhaustive
  3. correspond to real numbers
  4. have equal intervals of measurement
  5. do not have an absolute zero

We rarely see true interval variables in public and non-profit administration.  An example of an interval variable is temperature measured by Fahrenheit or Celsius.  Fahrenheit and Celsius both lack an absolute zero (i.e., 0 degrees Fahrenheit or 0 degrees Celsius do not represent the absence of temperature).  By extension, you can have negative values (-20 Fahrenheit or -10 Celsius), and you cannot assume that 80 degrees is twice as hot as 40 degrees: 80 degrees is certainly warmer than 40 degrees, and the distance between each degree is the same, but without an absolute zero, we cannot say it is twice as hot.  

Ratio

The final level of variable measurement is RATIO, in which observations:

  1. have direction
  2. are mutually exclusive and exhaustive
  3. correspond to real numbers
  4. have equal intervals of measurement
  5. have an absolute zero

Because ratio variables have an absolute zero, we can make assumptions that cannot be made with interval variables.  For example, an income of zero dollars indicates the absence of income, and someone who is 80 years old has twice as much age (in years) as someone who is 40 years old.

Replication and Generalizability

One of the hallmarks of social science research is REPLICATION (i.e., another researcher or administrator should be able to repeat the study or experiment).  Replication is important for two reasons:

  1. it confirms the reliability of the research findings
  2. it helps us to determine the GENERALIZABILITY of the research findings (i.e., how well the results can be applied to other contexts, populations, or settings / the conditions under which the results hold; generalizability is also referred to as EXTERNAL VALIDITY)

Replication cannot occur without precise explanations of our research.

Sources of Secondary Data: Open Data 

OPEN DATA refers to data made public for research, information, or transparency purposes.  Often, open data is published by governments, research institutions, or other organizations (such as interest groups or nonprofit organizations) that aim to promote transparency, accountability, and innovation.  The federal government maintains a comprehensive list of open data sets here

Qualitative (Non-Statistical) vs. Quantitative (Statistical) Research

Non-statistical (qualitative) and statistical (quantitative) research are two fundamental approaches to conducting research, each with its own methods, purposes, and strengths.  

QUALITATIVE (NON-STATISTICAL) RESEARCH aims to explore complex phenomena, understand meanings, and gain insights into people’s experiences, behaviors, and interactions.  It focuses on providing a deep, contextual understanding of a specific issue or topic.  Data is often obtained via interviews, focus groups, participant observations, and content analysis.  Data analysis involves identifying patterns, themes, and narratives and is often interpretative and subjective, relying on the researcher’s ability to understand and articulate the meanings within the data.

QUANTITATIVE (STATISTICAL) RESEARCH aims to identify relationships or causal effects between concepts and/or phenomena.  It seeks to produce results that can be generalized to larger populations.  Data is often obtained via original data obtained through surveys or experiments and secondary data that has already been collected (such as information collected by the U.S. Census Bureau).  Analysis involves using statistical methods to analyze numerical data.  Techniques can range from basic descriptive statistics (ex: mean, median, mode) to complex inferential statistics (ex: linear regression analysis, ANOVA).  Data analysis is typically more objective and replicable, with clear rules and procedures for conducting statistical tests.

While qualitative and quantitative research have distinct differences, they are often used together in mixed-methods research to provide a comprehensive understanding of a research problem.  Qualitative research can provide context and depth to quantitative findings, while quantitative research can offer generalizability and precision to qualitative insights.