Difference of Means Tests: The Basics

DIFFERENCE OF MEANS TEST (also called a TWO-SAMPLE T-TEST) is used to compare the average values of a variable across two different groups (samples) to see whether they statistically differ from each other.  The process of conducting a difference of means test mirrors the hypothesis testing process.  Specifically, it involves:

  1. Formulating null and alternative hypotheses — for example:
    • Null hypothesis (H0): there is no difference in the average values of a variable across two groups
    • Research hypothesis (Ha): there is a difference in the average values of a variable across two groups
  2. Choosing a significance level (usually, α = 0.10, 0.05, or 0.01)
  3. Checking assumptions regarding the relationship between the groups and the nature of the data
  4. Choosing the appropriate difference of means test/two-sample t-test
  5. Calculating the test statistic, determining the degrees of freedom (df), and, based on α and df, identifying the:
    • Critical value(s) that will define the rejection region for the null hypothesis
    • P-value, or probability of obtaining the observed results, or more extreme results, if the null hypothesis is true
  6. Analyzing data to make a decision about the validity of the hypotheses, by either:
    • Comparing the test statistic to the critical value(s); if the test statistic falls into the rejection rage, and we would reject the null hypothesis
    • Comparing the p-value to α; if the p-value is less than or equal to α, we would reject the null hypothesis

Assumptions Underlying Difference of Means Tests

Samples: Related, or Unrelated?

INDEPENDENT SAMPLES are those in which cases across the two samples are not ‘paired’ or matched in any way” (Meier, Brudney, and Bohte, 2011, p. 223).  In other words, independent samples involve between-group comparisons of two unrelated groups of different individuals or cases.  Observations across samples are independent: the observations in one group have no effect on the observations in the other group.  Examples of two independent samples are a treatment group and a control group.    

DEPENDENT SAMPLES exist when each item in one sample is paired with an item in the second sample” (p. 223).  In other words, dependent samples involve within-group comparisons of two closely related groups, which contain the same, or extremely similar, individuals or cases.  Observations across samples are dependent within pairs: each pair of observations is related.  Examples of dependent samples include:

  • two samples containing the same individuals or cases, with data collected before and after a treatment, intervention, policy change, etc. 
  •  two samples containing different individuals, with individuals in one sample matched/paired to individuals in the other sample who have similar characteristics (age, gender, race, income, education, etc.)

Nature of the Data

Is the Data Normally Distributed?

Many statistical methods, including difference of means tests, are based on the ASSUMPTION OF NORMALITY: the distribution of the data — in this context, within each sample (or the differences in paired samples) — should be approximately normal (i.e., bell-curve shaped, symmetrical around the mean).  This assumption is particularly important for small sample sizes.  Researchers or administrators can check for normality by using graphical methods, such as histograms and quantile-quantile (Q-Q) plots.  There are also statistical tests that can check for normality, such as the Shapiro-Wilk test.

Are Variances Equal, or Unequal?

The ASSUMPTION OF EQUAL VARIANCES (i.e., homogeneity of variances) applies when the variances of two groups are assumed to be approximately equal.  At times, however, the variances of two groups cannot assumed to be equal; in such situations, we proceed with the ASSUMPTION OF UNEQUAL VARIANCES (i.e., heterogeneity of variances).  Determining which assumption applies is important, regardless of whether the data corresponds to samples or populations.

Whether variances are equal or unequal impacts the way in which the t-test calculation is performed.  Researchers or administrators can check for homogeneity (i.e., equal variances) using statistical tests such as the F-TEST.

Types of Differences of Means Tests

There are three difference of means tests that you may use to examine the difference between two groups, depending on the samples and nature of data: the INDEPENDENT SAMPLES T-TEST, the DEPENDENT/PAIRED SAMPLES T-TEST, and the WELCH’S UNEQUAL VARIANCES T-TEST.  These tests are summarized in the table below.

Types of Difference of Means Tests (t-Tests) 

Type of t-TestSamplesNature of Data
Independent Samples t-TestIndependent samples and observationsData in each group are normally distributed
Equal variances in both groups
Dependent/Paired Samples t-TestDependent/paired samples; observations are dependent within pairs and independent between pairsDifferences in paired samples are normally distributed
Welch’s Unequal Variance t-TestIndependent samples and observationsData in each group are normally distributed
Unequal variance across groups 

Statistical vs. Practical Significance

A result can be statistically significant but have a small effect size that is not meaningful in real-world applications.  Thus, it is important to consider both STATISTICAL SIGNIFICANCE (i.e., whether the observed relationship is likely unlikely to have occurred by chance alone) and PRACTICAL SIGNIFICANCE (i.e., whether the effect size of the observed relationship is meaningful in real life). 

For example: using a large probability sample and α=0.05, we find a statistically significant relationship between a new policy for processing building permits and the amount of time that it takes to process requests for building permits.  We can reject the null hypothesis that there is no relationship between the new policy and processing time.  Now, let’s assume the new policy is associated with a decrease in processing times of 1 day.  If the average processing time before the new policy was 30 days, and the new average processing time is 29 days, the actual effect size (1 day) is minimal and may have little practical significance.

Statistical Significance: How Sure Should a Person Be?

When researchers say that their results are STATISTICALLY SIGNIFICANT, they mean that the observed effect or relationship in the data is unlikely to have occurred by chance alone.  Thus, statistical significance tells us whether the sample results we observe are strong enough to reject the null hypothesis, according to a predefined threshold (i.e., the significance level α). 

Statistically significant results obtained from a probability sample can be generalized to the population from which the sample was taken.  For example, if I find a statistically significant relationship between voting and sex (women are more likely to vote than men) in a random sample of Americans using a threshold of α=0.05, I can conclude with 95% confidence that this relationship exists in the United States: throughout the entire country, women are more likely to vote than men.

Choosing a Significance Level     

What significance level (α) should we use to determine whether results are statistically significant? 

In social science research, the significance level at which results are considered statistically significant is usually α=0.05, meaning we are 95% confident that the relationship between two variables is real (i.e., not the result of random chance).  However, in some social science research (such as some areas of political behavior research), α=0.10 is used to identify whether results are statistically significant, meaning we are 90% confident that the relationship between two variables is real.  If you think about it, choosing a lower significance level here makes sense: human behavior is only so predictable.  Thus, the choice of significance level is sometimes driven by the concept being researched.

Sample size can also effect the choice of significance level.  As sample size increases, the standard error decreases, which leads to more precise estimates of the population parameters and makes it easier to detect smaller effects.  Therefore, larger sample sizes increase the likelihood of detecting statistically significant effects with a smaller α, relative to smaller sample sizes.  However, there is a trade-off when it comes to sample sizes: larger samples are more costly than smaller samples in terms of the time required to recruit the sample and collect data and the financial cost associated with survey administration and data collection.  Time and money are finite resources: researchers and administrators only have so much time, and so much money, that can be dedicated to a given project.  As such, we sometimes adjust our level of significance to accommodate our sample size, opting to proceed with a smaller sample size (n) and larger significance level (α). 

For instance, when researching crime rates using data using a sample of 50 cities and townships (n=50), you may decide to adopt a threshold of α=0.10 to increase the likelihood of finding statistically significant effects.  If, on the other hand, the sample consists of 2,500 cities and townships (n=2500), you may decide to adopt a threshold of α=0.01.  As long as you decide on a level of significance before conducing statistical analysis, and accurately report α alongside the results, either of these is perfectly acceptable.  You cannot adjust α after conducting statistical analysis to accommodate the results.

Determining Sample Size for a Significance Level

At times, you may decide that you want to report results at a specific significance level and then let α drive the decision regarding how large of a sample you will need to detect a relationship.  In these situations, “the ideal sample size for any problem is a function of (1) the amount of error that can be tolerated, (2) the confidence one wants to have in the error estimate, and (3) the standard deviation of the population” (Meier, Brudney, and Bohte, 2011, p. 203).  Specifically, the sample size should be equal to squared value obtained when the critical value (i.e., t-score) associated with the desired α is multiplied by the estimated sample standard deviation, and then divided by the maximum margin of error that can be tolerated

Hypothesis Testing: Errors in Interpreting Results

When interpreting the results of a hypothesis test, there are two types of error that we can make:

  • TYPE I ERROR (i.e., a FALSE POSITIVE) occurs when one rejects the null hypothesis when it is true; in other words, the Type I error is saying that a relationship exists when, in fact, it does not exist
    • The probability of committing a Type I error is equal to the significance level (α); for example, if α=0.05 (95% confidence level), there is a 5% chance of rejecting the null hypothesis when it is true
  • TYPE II ERROR (i.e., a FALSE NEGATIVE) occurs when one fails to reject the null hypothesis when it is false; in other words, the Type II error is saying that no relationship exists when, in fact, a relationship does exist
    • The probability of committing a Type II error is associated with several factors, including sample size, relationship strength/effect size, significance level (α), variability in the data, test design, and measurement precision

Which Error is Worse: Type I, or Type II? 

Let’s consider how the judicial system is structured: would we rather convict an innocent man, or let a guilty man go free? 

To convict someone of a crime, the prosecution must convince the jury beyond a reasonable doubt, and the jury verdict must be unanimous.  Clearly, our judicial system is structured to make it harder to convict.  The judicial system would rather let a guilty man go free than convict an innocent man.  In other words, the judicial system seeks to avoid a Type I error, where it asserts a relationship exists when it does not (finding an innocent person “guilty”).  Instead, the judicial system would rather commit a Type II error, failing to find a relationship when it does exist (finding a guilty person “not guilty”). 

The same logic underlies statistical analyses.  If we commit a Type I error, we are saying a relationship exists when, in fact, it does not.  In inferential statistics, we always want to err on the side of caution.  Therefore, a Type II error, where we fail to identify a real relationship, is generally more acceptable.

Hypothesis Testing: One-Tailed vs. Two-Tailed Tests

One-Tailed Tests

“A ONE-TAILED TEST is applied whenever the hypothesis under consideration specifies a direction” (Meier, Brudney, and Bohte, 2011, p. 198).  In a one-tailed test, we are only interested in one tail of the distribution (i.e., values on one side of the mean): 

  • For a positive relationship, we are interested in the RIGHT TAIL (or UPPER TAIL), i.e., values to the right of/greater than the mean
  • For a negative relationship, we are interested in the LEFT TAIL (or LOWER TAIL), i.e., values to the left of/less than the mean

The rejection region is determined by the significance level (α) and the direction of the hypothesized relationship; it usually includes the most extreme 10% (α = 0.10), 5% (α = 0.05), or 1% (α = 0.01) of values in the distribution.  Whether these values lie at the bottom (i.e., on the left side) or at the top (i.e., on the right side) of the distribution, and the one-tail test that should be used for hypothesis testing, is determined by the hypothesized relationship:

  • RIGHT-TAILED TEST is used to test for a positive relationship between variables: if the test statistic falls within the top α percent of the distribution, it is in the rejection region, and the null hypothesis is rejected
    • Assuming α=0.05 and a 95% confidence interval, our rejection region would be the top 5% of the distribution (i.e., at or above the 95th percentile)
  • LEFT-TAILED TEST is used to test for a negative relationship between variables: if the test statistic falls within the bottom α percent of the distribution, it is in the rejection region, and the null hypothesis is rejected
    • Assuming α=0.05 and a 95% confidence interval, our rejection region would be the bottom 5% of the distribution (i.e., at or below the 5th percentile)

Two-Tailed Tests

A TWO-TAILED TEST is used whenever the hypothesis under consideration does not specify a direction; it simultaneously tests for the possibility of both a positive and a negative relationship between variables.  Thus, in a two-tailed test, we are interested in both tails of the distribution (i.e., values that fall on both sides of the mean).

The rejection region is determined by the significance level (α), divided equally between the left/lower and right/upper tails; it usually includes:

  • for α=0.10, the most extreme 10% of values in the distribution: the bottom 5% of the distribution (i.e., at or below the 5th percentile), and the top 5% of the distribution (i.e., at or above the 95th percentile)
  • for α=0.05, the most extreme 5% of values in the distribution: the bottom 2.5% of the distribution (i.e., at or below the 2.5th percentile), and the top 2.5% of the distribution (i.e., at or above the 97.5th percentile)
  • for α=0.01, the most extreme 1% of values in the distribution: the bottom 0.5% of the distribution (i.e., at or below the 0.5th percentile), and the top 0.5% of the distribution (i.e., at or above the 99.5th percentile)

If the test statistic falls within the bottom (α/2) percent of the distribution (i.e., the α percentile) or the top (α/2) percent of the distribution (i.e., the 100-α percentile), it is in the rejection region, and the null hypothesis is rejected.  Thus, the two-tailed test is more conservative than the one-tailed test because it accounts for the possibility of an effect in either direction.

Correlation vs. Causation

Correlation

CORRELATION refers to any relationship or statistical association between two variables.  If two variables are correlated, the variables appear to move together: as one variable changes, the other variable tends to change in a specific direction.  Two variables can display a POSITIVE CORRELATION (as the values for one variable increase, the values for the other variable increase) or a NEGATIVE CORRELATION (as the values for one variable increase, the values for the other variable decrease).  If two variables are UNCORRELATED, there is no apparent relationship between them.  

Positive and negative correlations can also be characterized based on the strength of the relationship between the two variables as either STRONG (a high degree of association between two variables), MODERATE (a noticeable but not perfect association between two variables) or WEAK ( a low degree of association between two variables).

Researchers can check to see if two variables are correlated by calculating their CORRELATION COEFFICIENT (also called PEARSON’S R), which measures the direction and strength of a linear relationship between two variables.  Pearson’s R is of the most widely used statistics in both descriptive statistics and inferential statistics.  Pearson’s R values range from -1 to 1:

  • -1 indicates a perfect negative linear relationship between two variables — i.e., as one variable increases by a unit of one, the other variable decreases by a unit of one
  • 0 indicates no linear relationship between two variables
  • 1 indicates a perfect positive linear relationship between to variables — i.e., as one variable increases by a unit of one, so does the other variable

There are four possible reasons for correlations: (1) variable X causes variable Y (CAUSATION); (2) variable Y causes variable X (REVERSE CAUSATION); (3) the relationship between variable X and variable Y is simply a coincidence (RANDOM CHANCE); and (4) some other variable Z causes both variable X and variable Y (SPURIOUS RELATIONSHIP). Thus, correlation DOES NOT equal causation.

Example: Ice Cream Sales and Sunburns

There is a strong positive correlation between ice cream sales and sunburns: as ice cream sales increase, so do sunburns.  Does this mean the ice cream is causing sunburns?  Of course not!  As this illustrates, correlation DOES NOT imply that one variable causes the other variable to change.  What other factor helps explain this observed correlation between ice cream sales and sunburns?  Weather!

  • As it gets warmer, people eat more ice cream
  • During the summer months, when its warmer, people are more likely to go outside — that, combined with being closer to the sun, results in increased opportunities for sunburns

This is an example of a spurious relationship — an apparent causal relationship between two variables that is actually due to one or more other variables.  

Causation

In the context of hypothesis testing, CAUSALITY (i.e., whether one variable affects/leads to changes in another variable) is usually what we are interested in because it helps us understand mechanisms and underlying processes, thereby allowing us to make accurate predictions.

Demonstrating Causation

To demonstrate causation, a few factors must be present:

  1. The variables must be correlated
  2. The cause must precede the effect
  3. Other possible causes/explanations of the variation observed in the dependent variable must be ruled out

Hypothesis Testing: The Basics

HYPOTHESIS TESTING involves using statistical techniques to determine whether there is enough evidence to support the hypothesis.  Hypothesis testing helps researchers and analysts make decisions about the validity of their assumptions or claims; thus, it plays a critical role in allowing researchers and administrators to:

  • report sample findings with any manner of certainty
  • make inferences/draw conclusions about a population based on sample data

We never “prove” anything in social sciences; the best that we can say is that the results support our hypothesis within a pre-determine level of statistical certainty (usually, the .05 significance level or 95% confidence interval).

How Hypothesis Testing Works

Hypothesis testing involves:

  1. Developing a research question
  2. Operationalizing your concepts and identifying the dependent and independent variables to include in the analysis
  3. Formulating research, null, and alternative hypotheses
  4. Selecting an appropriate significance level to serve as the threshold for rejecting the null hypothesis (usually, α = 0.10, 0.05, or 0.01)
  5. Analyzing data to make a decision about the validity of the hypotheses

We always start research by assuming the null hypothesis is correct — in other words, that there is no relationship between our dependent and independent variables.  From this starting point, our job is to create models based on theory and existing knowledge, run these models, interpret the results, and report the findings.

When we engage in hypothesis testing, we either:

  • REJECT THE NULL (meaning there is a relationship between the two variables)
  • FAIL TO REJECT THE NULL (meaning there is no relationship between the two variables)

To determine whether we should reject the null or fail to reject the null, we first need to calculate the appropriate TEST STATISTIC; this depends on the type of data and the hypothesis being tested.  Examples of test statistics include the t-test, the z-test, and the differences in means test for two or more groups (i.e., ANOVA, which stands for analysis of variance). Then, based on the chosen significance level (ex: α = 0.05), we need to identify either the P-VALUE (i.e., the probability of obtaining the observed results, or more extreme results, if the null hypothesis is true) or the CRITICAL VALUE (i.e., the cutoff value that defines the REJECTION REGION for the null hypothesis).  From there, we would either:

  • compare the test statistic to the critical value (if using the critical value approach); if the absolute value of the test statistic is smaller than the critical value, the test statistic falls into the rejection rage, and we would reject the null hypothesis
  • compare the p-value to the significance level (if using the p-value approach); if the p-value is less than or equal to the significance level, we would reject the null hypothesis

Rejecting the null hypothesis within our predetermined level of confidence indicates that we found a statistically significant relationship between two or more variables.

Hypotheses

Research Hypothesis

A RESEARCH HYPOTHESIS is a clear, specific, testable statement or prediction about the relationship between two or more variables (i.e., independent variable X explains variation in dependent variable Y).  The research hypothesis guides the direction of the study and outlines what the researcher expects to find.  A good hypothesis is:

  • based on existing knowledge (i.e., theory drives your predictions)
  • FALSIFIABLE (i.e., can be proven false if it is incorrect)

Some hypotheses are directional, positing that as values of one variable increase or decrease, values of the other variable increase or decrease:

  • There is a POSITIVE RELATIONSHIP between population density (independent variable) and crime rates (dependent variable) — as population density increases, crime rates increase
  • There is a NEGATIVE RELATIONSHIP between education level (independent variable) and teen pregnancy (dependent variable) — as education increases, teen pregnancy decreases

These types of hypothesis are appropriate when working with variables that have direction (i.e., ordinal-, interval- or ratio-level variables).

Some hypotheses merely posit that there is a relationship between two variables:

  • Women are more likely to vote than men (respondent sex is the independent variable) to vote (dependent variable) — there is a relationship between sex and voting

This type of hypothesis is common when working with nominal variables.  Because nominal variables have no direction, the relationship between a nominal variable and another variable cannot be stated in directional terms. Instead, the hypothesis should specify the type of relationship between the variables in terms of how differences in the dependent variable are linked with differences in the independent variable.

If, based on theory and existing knowledge, there are control variables that further explain variation in the dependent variable, they should be included the research hypothesis: 

  • if all variables involved have direction (i.e., ordinal-, interval-, or ratio-level variables), you would simply add the phrase “while controlling for” and then specify the control variables
  • if the control variable is nominal, you should specify the expected relationship

The research hypothesis is presented as H1.  If you have more than one research hypothesis, they would be presented as H1, H2, H3, etc.

Null Hypothesis

Null means having no value.  By extension, the NULL HYPOTHESIS is a statement that there is no relationship between the variables being studied. While the research hypothesis serves as a foundation for conducting empirical research by guiding the direction of the study and outlining what the researcher or administrator expects to find, the null hypothesis forms the basis of hypothesis testing.  The null hypothesis is presented as H0.

Alternative Hypothesis

An ALTERNATIVE HYPOTHESIS is proposed as an alternative to the null hypothesis; it indicates that there is a relationship between the variables being studied. Alternative hypotheses can be:

  • directional (one-tailed), specifying a direction of the effect or difference
  • non-directional (two-tailed), merely stating that there is a difference, without specifying the direction

Research hypotheses and alternative hypotheses are conceptually similar but serve different functions: a research hypothesis’s broader context is more aligned with scientific inquiry and theory testing, whereas an alternative hypothesis is specifically formulated for statistical testing against the null hypothesis. The alternative hypothesis is presented as Ha.

Confidence Intervals

Confidence intervals provide a useful way to convey the uncertainty and reliability of an estimate, allowing researchers to make more informed conclusions about the population parameter.

When constructing a confidence interval for a sample mean, the critical value (t) for a given confidence level and number of degrees of freedom is multiplied by the standard error of the mean; this gives us the margin of error, which can then be added to and subtracted from the sample mean to establish the upper and lower bounds of our confidence interval. 

The t-Distribution

The T-DISTRIBUTION is a probability distribution that has a mean of 0 and is symmetrical and bell-shaped, similar to the normal distribution, but with heavier tails.  The t-distribution provides a more accurate and conservative estimates of population parameters when dealing with small samples (n<30) or when population standard deviations are unknown (which is usually the case in social science research).

The shape of the t-distribution — how tall/short the center of the distribution is and how thin/thick the tails of the distribution are (i.e., the dispersion of the distribution) — is determined by the DEGREES OF FREEDOM (df).  The degrees of freedom for a single sample is equal to the sample size, minus one; as a formula: df=n-1.  As degrees of freedom increase, the t-distribution approaches the normal distribution.

To interpret a t-distribution, you will need to reference a T-DISTRIBUTION TABLE (i.e., a T-TABLE).  Using a t-table is similar to using a z-table:

  • Rows correspond to different degrees of freedom 
  • Columns correspond to different confidence levels (90%, 95%, 99%) or SIGNIFICANCE LEVELS (α), which are equal to 1 minus the confidence level (α = 0.10, 0.05, 0.01)
  • Table cells report the CRITICAL VALUES of the t-distribution, given the degrees of freedom and the confidence level/significance level; critical values are helpful in hypothesis testing and determining confidence intervals