The Normal Distribution: The Basics

The NORMAL DISTRIBUTION (sometimes referred to as the Gaussian distribution) is a continuous probability distribution that can be found in many places: height, weight, IQ scores, test scores, errors, reaction times, etc.  Understanding that a variable is normally distributed allows you to:

  • predict the likelihood (i.e., probability) of observing certain values
  • apply various statistical techniques that assume normality
  • establish confidence intervals and conduct hypothesis tests

Characteristics of the Normal Distribution

There are several key characteristics of the normal distribution:

  • mean, median, and mode are equal and located at the center of the distribution
  • the distribution is symmetric about the mean (i.e., the left half of the distribution is a mirror image of the right half); “Scores above and below the mean are equally likely to occur so that half of the probability under the curve (0.5) lies above the mean and half (0.5) below” (Meier, Brudney, and Bohte, 2011, p. 132)
  • the distribution resembles a bell-shaped curve (i.e., highest at the mean and tapers off towards the tails)
  • the standard deviation determines the SPREAD of the distribution (i.e., its height and width): a smaller standard deviation results in a steeper curve, while a larger standard deviation results in a flatter curve
  •  the 68-95-99 RULE can be used to summarize the distribution and calculate probabilities of event occurrence:
    – approximately 68% of the data falls within ±1 standard deviation of the mean
    – approximately 95% of the data falls within ±2 standard deviations of the mean
    – approximately 99% of the data falls within ±3 standard deviations of the mean
  • there is always a chance that values will fall outside ±3 standard deviations of the mean, but the probability of occurrence is less than 1%
  • the tails of the distribution never touch the horizontal axis: the probability of an outlier occurring may be unlikely, but it is always possible; thus, the upper and lower tails approach, but never reach, 0%

Why the Normal Distribution is Common in Nature: The Central Limit Theorem

The CENTRAL LIMIT THEOREM states that the distribution of sample means for INDEPENDENT, IDENTICALLY DISTRIBUTED (IID) random variables will approximate a normal distribution, even when the variables themselves are not normally distributed, assuming the sample is large enough.  Thus, as long as you have a sufficiently large random sample, we can make inferences about the population parameters (what we are interested in) from sample statistics (what we often are working with).

What Does “IID” Mean?

Variables are considered independent if they are mutually exclusive.  Variables are considered identically distributed if they have the same probability distribution (i.e., normal, Poisson, etc.)

Do Outliers Matter?

In a normal distribution based on a large number of observations, it is unlikely that outliers will skew results.  If you are working with data involving fewer observations, outliers are more likely to skew results; in these situations, you should identify, invest, and decide how to handle outliers.

Example of a Normal Distribution: IQ Tests

Because the IQ test has been given millions of times, IQ scores represent a normal probability distribution.  On the IQ test, the mean, median, and mode are equal and fall in the middle of the distribution (100).  The standard deviation on the IQ test is 15; applying the 68-95-99 rule, we can say with reasonable certainty:

  • 68% of the population will score between 85 and 115, or ±1 standard deviation from the mean
  • 95% of the population will score between 70 and 130, or ±2 standard deviations from the mean
  • 99% of the population will score between 55 and 145, or ±3 standard deviations from the mean

Rarely will you encounter such a perfect normal probability distribution as the IQ test, but we can calculate z-scores to standardize (i.e., “normalize”) values for distributions that aren’t as normal as the IQ distribution.

Frequency Distributions

FREQUENCY DISTRIBUTION is a summary of how often each value or range of values occurs in a dataset.  It organizes data into a table or graph that displays the FREQUENCY (count) of each unique CLASS (category or value/interval of values) within the dataset. 

Frequency distributions are an important tool for understanding the distribution and patterns of data.  Frequency distributions provide a clear visual summary of the data, helping to identify patterns such as central tendency, dispersion, and skewness.  Frequency distributions are also an important tool for summarizing data: they condense large datasets into an easily interpretable format.  This can facilitate initial data exploration and analysis.  Furthermore, as data summary tools, frequency distributions can also aid in decision-making processes and serve as a mechanism through which findings can be effectively communicated to various stakeholders.

Frequency Tables

FREQUENCY TABLE is a tabular representation of data that shows the number of occurrences (frequency) of each distinct case (value or category in a dataset).  It organizes raw data into a summary format, making it easier to see how often each value appears. 

While frequency tables are helpful, they do not provide as much information as relative frequency tables.  A RELATIVE FREQUENCY TABLE extends the frequency table by including the relative frequency (i.e., the PERCENTAGE DISTRIBUTION), which is the proportion or percentage of the total number of observations that fall into each case.  A relative frequency table provides a sense of the distribution of data in terms of its overall context.

CUMULATIVE RELATIVE FREQUENCY TABLE shows the cumulative relative frequency (i.e., CUMULATIVE FREQUENCY DISTRIBUTION), which is the sum of the percentage distributions for all values up to and including the current value.  The cumulative percentages for all values should add up to 100% (or something close, depending on rounding errors).  A cumulative relative frequency table helps us to understand the cumulative distribution of the data. 

Another extension of the frequency table is a CONTINGENCY TABLE (also known as a cross-tabulation or crosstab).  A contingency table is used to display the frequency distribution of two or more variables; it shows the relationship between two or more CATEGORICAL VARIABLES (i.e., nominal- or ordinal-level variables) by presenting the frequency of each combination of variable categories.

Charts and Graphs

There are numerous charts and graphs that can be used to display frequency distributions:  

  • BAR GRAPHS (or bar charts) and HISTOGRAMS are graphical representations of data that use rectangular bars to represent the frequency of a value or intervals of values (i.e., BINS); bar charts and histograms are useful for showing the distribution of variables 
  • PIE CHART is a circular graph divided into slices to illustrate numerical proportions, the size of each slice proportional to the quantity it represents; pie charts are useful for showing the relative frequencies of different categories within a whole
  • LINE GRAPH (or line chart) is a type of graph that displays information as a series of data points connected by straight line segments.  Line graphs are often used to show trends over time; they can also be used to summarize frequency distributions of interval- and ratio-level variables