Descriptive Statistics – Texas Political Science

May 28, 2024June 1, 2024

Measures of Dispersion

Measures of DISPERSION tells us about how much the observations cluster around the expected value (i.e., the”typical” or average value) for a variable. In other words, measures of dispersion tell us about the SPREAD of a distribution of values and the overall VARIATION in a measure. This information can be used to understand the distribution of their data, identify the range of values associated with their data, and determine how much confidence we can have in our expected values.

Min, Max, Range, IQR, Variance, & Standard Deviation

There are six measures of dispersion:

MIN and MAX – the minimum (lowest) and maximum (highest) values of a variable
RANGE – the difference between the maximum and minimum values of a variable; as a formula: Range = Max – Min
INTERQUARTILE RANGE (IQR) — the difference between the first quartile (Q1 / 25%) and the third quartile (Q3 / 75%), which corresponds to the range of the middle 50% of values of a variable; as a formula: IQR = Q3 – Q1
VARIANCE — the average squared deviation of each value from the mean (i.e., the sum of all values of a variable minus the mean, squared, and divided by the number of cases in the variable)
STANDARD DEVIATION — the average distance of each value from the mean, expressed in same units as the data (i.e., the square root of the variance)

As is the case with measures of central tendency, we cannot calculate all measures of dispersion on all levels of variables. Range and IQR require rank ordering of values — which, in turn, requires that the variable has direction. Variance and standard deviation can only be calculated if values are associated with real numbers that have equal intervals of measurement between them. Recall that the hierarchy of measurement illustrates that any statistic that can be calculated for a lower level of measurement can be legitimately calculated and used for higher levels of measurement. Therefore:

because min and max can be calculated for nominal level variables, they can also be calculated on ordinal, interval, and ratio variables
because range and IQR can be calculated for ordinal variables, they can also be calculated on interval and ratio variables
because variance and standard deviation can be calculated for interval variables, they can also be calculated for ratio variables

Measure	Description	Levels of Measurement
MIN/MAX	Minimum and maximum values of a variable	Nominal + Ordinal + Interval + Ratio
RANGE	Difference between the maximum and minimum values of a variable	Ordinal + Interval + Ratio
IQR	Range of the middle 50% of values of a variable	Ordinal + Interval + Ratio
VARIANCE	Average squared deviation of each value of a variable from the mean	Interval + Ratio
STANDARD DEVIATION	Average distance of value of a variable from the mean, expressed in same units as the data; square root of the variance	Interval + Ratio

Measures of Dispersion

Variance vs. Standard Deviation

Variance and standard deviation are measures that capture the same information (hence, the standard deviation is simply the square root of the variance). Does it matter which measure we report? In fact, it does!

Generally speaking, standard deviation is more useful than the variance from an interpretation standpoint because it is in the same units as the original data (unlike variance, which is expressed in squared units of the original data). This makes standard deviation easier to understand and communicate. Thus, standard deviation allows for direct comparisons and provides a clearer picture of data spread. This is especially true within the context of normal distributions.

For example, let’s assume we have a dataset of annual incomes and derive the following measures of central tendency and dispersion:

Mean income: 50,000 (dollars)
Standard deviation: 10,000 (dollars)
Variance: 100,000,000 (square dollars)

Interpreting the standard deviation is pretty straight-forward: most people’s incomes are within $10,000 of the average income of $50,000. Interpreting variance, however, is more tricky: the average squared deviation from the mean income is $100,000,000. As you can see, the interpretation of variance is less directly meaningful without further mathematical manipulation (i.e., taking the square root to find the standard deviation).

Why Does Variance Use Square Units?

When calculating the mean deviation by summing the differences between each data point and the mean, the positive and negative differences (associated with values that fall above and below the mean) can cancel each other out, resulting in a sum of zero. Variance uses square units to ensure all deviations from the mean are positive, which in turn prevents positive and negative differences from cancelling out.

Working with square units also has some useful mathematical properties.

May 28, 2024June 1, 2024

Measures of Central Tendency

Measures of CENTRAL TENDENCY tells us about the “typical” or average value for a variable. In other words, measures of central tendency tell us how closely data in a variable group around some central point. This information can be used to make an initial prediction of the EXPECTED VALUE that a variable will take on.

Mode, Median, and Mean

There are three measures of central tendency:

MODE — the value(s) that occurs most often (i.e., with greatest frequency) in a distribution of observations within a variable
- Most of the time, the mode corresponds to one value; sometimes, the mode corresponds to two (BIMODAL) or more (MULTIMODAL) values
MEDIAN — the middle value when the observations within a variable are ranked in ascending order (i.e., from lowest to highest); in other words, the median is the observation with 50% of observations above and 50% of observations below it
- If there are an even number of observations, the median is equal to the sum of the two middle observations, divided by two
MEAN — the arithmetic average of all the observations within a variable (i.e., the sum of values, divided by the number of values)

We cannot calculate all measures of central tendency on all levels of variables. Median requires rank ordering of values — which, in turn, requires that the variable has direction. Mean can only be calculated if values are associated with real numbers that have equal intervals of measurement between then. The HIERARCHY OF MEASUREMENT illustrates that any statistic that can be calculated for a lower level of measurement can be legitimately calculated and used for higher levels of measurement. Therefore:

because mode can be calculated for nominal level variables, it can also be calculated on ordinal, interval, and ratio variables
because median can be calculated for ordinal variables, it can also be calculated on interval and ratio variables
because mean can be calculated for interval variables, it can also be calculated for ratio variables

Measure	Description	Levels of Measurement
MODE	Value that occurs most often (i.e., with greatest frequency) of a variable	Nominal + Ordinal + Interval + Ratio
MEDIAN	Middle value when observations of a variable are ranked in ascending order	Ordinal + Interval + Ratio
MEAN	Average of all the observations of a variable	Interval + Ratio

Measures of Central Tendency

Mean vs. Median

Generally, we would opt for the mean over the median when either can be calculated on a particular variable. However, sometimes mean can be misleading when there are outliers in the data. An OUTLIER is an extreme value of a variable. When outliers are present, the mean is distorted: it will be skewed towards the outlier. In such situations, median may be a more precise measure of central tendency.

May 28, 2024June 1, 2024

Frequency Distributions

A FREQUENCY DISTRIBUTION is a summary of how often each value or range of values occurs in a dataset. It organizes data into a table or graph that displays the FREQUENCY (count) of each unique CLASS (category or value/interval of values) within the dataset.

Frequency distributions are an important tool for understanding the distribution and patterns of data. Frequency distributions provide a clear visual summary of the data, helping to identify patterns such as central tendency, dispersion, and skewness. Frequency distributions are also an important tool for summarizing data: they condense large datasets into an easily interpretable format. This can facilitate initial data exploration and analysis. Furthermore, as data summary tools, frequency distributions can also aid in decision-making processes and serve as a mechanism through which findings can be effectively communicated to various stakeholders.

Frequency Tables

A FREQUENCY TABLE is a tabular representation of data that shows the number of occurrences (frequency) of each distinct case (value or category in a dataset). It organizes raw data into a summary format, making it easier to see how often each value appears.

While frequency tables are helpful, they do not provide as much information as relative frequency tables. A RELATIVE FREQUENCY TABLE extends the frequency table by including the relative frequency (i.e., the PERCENTAGE DISTRIBUTION), which is the proportion or percentage of the total number of observations that fall into each case. A relative frequency table provides a sense of the distribution of data in terms of its overall context.

A CUMULATIVE RELATIVE FREQUENCY TABLE shows the cumulative relative frequency (i.e., CUMULATIVE FREQUENCY DISTRIBUTION), which is the sum of the percentage distributions for all values up to and including the current value. The cumulative percentages for all values should add up to 100% (or something close, depending on rounding errors). A cumulative relative frequency table helps us to understand the cumulative distribution of the data.

Another extension of the frequency table is a CONTINGENCY TABLE (also known as a cross-tabulation or crosstab). A contingency table is used to display the frequency distribution of two or more variables; it shows the relationship between two or more CATEGORICAL VARIABLES (i.e., nominal- or ordinal-level variables) by presenting the frequency of each combination of variable categories.

Charts and Graphs

There are numerous charts and graphs that can be used to display frequency distributions:

BAR GRAPHS (or bar charts) and HISTOGRAMS are graphical representations of data that use rectangular bars to represent the frequency of a value or intervals of values (i.e., BINS); bar charts and histograms are useful for showing the distribution of variables
A PIE CHART is a circular graph divided into slices to illustrate numerical proportions, the size of each slice proportional to the quantity it represents; pie charts are useful for showing the relative frequencies of different categories within a whole
A LINE GRAPH (or line chart) is a type of graph that displays information as a series of data points connected by straight line segments. Line graphs are often used to show trends over time; they can also be used to summarize frequency distributions of interval- and ratio-level variables