UNIVARIATE STATISTICS

Measures of Central Tendency
Mode
Median
Mean
Measures of Dispersion
Range
Percentiles
Standard Deviation
Variance
Skew
Normal Curve

Measures of Central Tendency

What is the most common, most typical, or most often-occurring value of a variable? The following chart shows which measures of central tendency can be used with variables measured at the nominal, ordinal, or interval/ratio level.

 Nominal Ordinal Interval/Ratio Mode X X X Median X X Mean X

Mode

The mode, or modal value, is the most commonly occurring value or category of a variable. To find the mode, look for the category that contains the highest number of observations. For example, in a survey of 88 cities, the most common form of city governance is the council/manager form.

 Type of Government Number of Cities Commission 4 Weak Mayor 17 Strong Mayor 22 Council/Manager 45 Total 88

Note, however, that a variable may have two modal categories. For example, if the type of government had looked like this,

 Type of Government Number of Cities Commission 14 Weak Mayor 18 Strong Mayor 28 Council/Manager 28 Total 88

Then the variable would have two modal categories, "Strong Mayor" and "Council/Manager". This means that there is no one central tendency within the data for this variable.  (More will be said about this under "Skew" below under "Measures of Dispersion.")

Median

The median is the value of the category or case that divides an ordered distribution into two equal parts. One half of the values will be higher than the median value; the other half of the values will be lower than the median value. To find the median, you must first put all the observations in order, from lowest to highest. Then use the formula (N+1)/2.

For example, if there are 7 categories of employee pay, the median category will be category number 4, or (7+1)/2=4. In the example below, the median pay category is \$24,000. The value of this category can also be interpreted as the median pay value.

 Pay: \$12,000 \$17,000 \$18,000 \$24,000 \$25,000 \$27,000 \$30,000

However, if there are 8 categories of employee pay, the median pay value will fall in between two categories. The median category is category 4.5, or (8+1)/2=9/2=4.5  Add the fourth and the fifth categories and divide by two, or (\$24,000+\$25,000)/2=\$49,000/2=\$24,500.

 Pay \$12,000 \$17,000 \$18,000 \$24,000 \$25,000 \$27,000 \$30,000 \$58,000

If you have grouped data, that is, ranges of values, as well as the number of people found in each group or range, there is a more precise way to calculate the median. For example, assume that the pay categories are ranges of pay, and employees are distributed among them as follows:

 Pay Range Number of  employees Cumulative Frequency \$20,000- \$29,000 9 9 \$30,000- \$39,000 14 23 \$40,000- \$49,000 16 39 \$50,000- \$59,000 21 60 Total 60 -

The median pay is found by using the formula for grouped data, N/2. In this case, there are 60 employees, so the median = 60/2=the 30th observation.

We can see from the cumulative distribution that the 30th observation will be found in the category of \$40,000-\$49,000. We can estimate the median by calculating the mid-point of this category, by adding the lower boundary value to the higher boundary value and dividing by two, or (\$40,000+\$49,000)/2=\$89,000/2=\$44,500.

To calculate the median more exactly, we can look at how many observations into the \$40,000-\$49,000 category we must go to find the 30th observation.

There are 16 observations in this category. We must go to the 7th observation to find the 30th total observation of the sample. So we can calculate the value of going 7/16th of the way through this category.

The category has 10 values (40,41,42,43,44,45,46,47,48,49). So 7/16 x 10 = 4.375.

We add this to the lower boundary value of the category to get the median salary value of \$40,000+\$4,375=\$44,375.

Note that we must assume that the observations are evenly distributed within the categories; if the sample size is large enough in relation to the number of categories, this is usually not a problem.

Mean

The mean, or average, is the arithmetic balance point of the distribution, but is not the same as the median or the mode. If you subtract from the mean each observation in the sample that is above the mean, that sum will be equal to the sum of subtracting each observation in the sample that is below the mean.

To find the mean
1) add up the value of all the observations in the sample, and
2) divide that sum by the total number of observations.

The average of the following employee salaries is equal to \$21,857.14

 Salaries: \$12,000 \$17,000 \$18,000 \$24,000 \$25,000 \$27,000 \$30,000

However, the average of the following salaries is equal to \$26,375.

 Salaries 12,000 17,000 18,000 24,000 25,000 27,000 30,000 58,000

This demonstrates the fact that the value of the mean is sensitive to very high, or very low values. In this case, it may be better to use the median.

To find the mean from grouped data:

1) first find the midpoint for each range or category.  This is found by adding the lower boundary value to the upper boundary value and dividing by two.
2) Then multiply the number of employees in each range by the midpoint of the range.
3) Add up all the products of (number of employees times range midpoint)
4) Divide that sum by the total number of observations.

 Pay Range Number of  employees Range Midpoint Product \$20,000- \$29,000 9 \$24,500 \$220,500 \$30,000- \$39,000 14 \$34,500 \$438,000 \$40,000- \$49,000 16 \$44,500 \$712,000 \$50,000- \$59,000 21 \$54,500 \$1,144,500 Total 60 - \$2,560,000

In this case, \$2,560,000/60=\$42,666.67

Measures of Dispersion

Measures of dispersion are the opposite of measures of central tendency. The former attempt to describe the most typical or central value of a distribution of values of a variable. Measure of dispersion, in contrast, attempt to give an idea of how widely dispersed the values are and how different the observations are from one another.

The following chart shows which measures of variation and dispersion be used with variables measured at the nominal, ordinal, or interval/ratio level.

 Nominal Ordinal Interval/Ratio Range X X Percentiles X X Standard Deviation X Variance X

Range

The range is the difference between the highest and the lowest values in an ordered distribution of the values of a variable. For example, if the highest paid employee makes \$58,000 per year and the lowest paid employee makes \$12,000 per year, the salary range is \$46,000. (Note the average is \$26,400)

 Pay: 12,000 17,000 18,000 24,000 25,000 27,000 30,000 58,000

However, if the highest paid employees makes \$29,000 per year and the lowest paid employee makes \$22,500 per year, the salary range is \$3,000. (Note the average is \$26,000)

 Pay: 22,500 24,500 25,000 26,000 27,000 28,000 28,500 29,000

Although these two organizations have very similar averages, they have very different ranges. For which organization would you rather be working?

Percentiles

In general, percentiles are points in the distribution of the ordered values of a variable at which a known number of observations fall below the point and a known number of observations remain above the point.

For example, the 50th percentile is the same as the median; at the 50th percentile, half of the observations have higher values and half of the observations have lower values.

Percentiles are often used on standardized tests, such as the SAT or GRE. If you scored at the 75th percentile, that means that 75% of the other people scored below your score and 25% scored at or above your score.

Sometimes on tests for civil service, applicants are advised that they must score at a certain percentile or above to be considered for an interview, a promotion, etc.

When two organizations have very different ranges but similar averages, you may want to use the interquartile range, or the range between the 25th and 75th percentiles. The interquartile range contains the middle 50% of the observations.

To arrive at the interquartile range,

1) ignore the bottom 25% of the categories and the top 25% of the categories.
2) re-calculate the range, subtracting the new bottom category from the new top category.

For example, in this case, there are 8 categories, so one-quarter of 8 categories=2 categories. Ignoring the top two and the bottom two categories, the interquartile range for this organization is \$27,000-\$18,000=\$9,000.

 Pay: 12,000 17,000 18,000 24,000 25,000 27,000 30,000 58,000

The interquartile range for this organization is \$28,000-\$25,000=\$3,000.

 Pay: 22,500 24,500 25,000 26,000 27,000 28,000 28,500 29,000

Standard Deviation

The standard deviation is a measure of the average difference of each observation in a distribution from the average (mean) of the distribution. Given that you can calculate a mean, how different are most of the observations from that mean?

If most of the observations are near the mean in value, the standard deviation will be small. But if most of the observations are far from the mean in value, the standard deviation will be large.

The formula for calculating the standard deviation is

1) calculate the mean
2) subtract the value of each observation from the value of the mean
3) square the difference obtained in step 2 for each observation
4) add up all the squared differences obtained in step 3
5) divide the sum of the squared differences by the total number of observations
6) take the square root of the result of step 5=the value of the standard deviation

Variance

The variance is an expression of the total amount of variability of the observations for a variable. The value of the variance is obtained by squaring the value of the standard deviation.

A variable with a large variance has a great deal of difference in the values of the various observations, while a variable with a small variance has less difference in the values of the various observations.

Skew

The skew refers to the "shape" of the distribution of the values of a variable. The values of a variable can be plotted on a chart.

If the values of the observations are distributed symmetrically around the mean of a variable, that is called a normal distribution. In this case, the mean, median, and mode will all coincide.

If the values of most of the observations are lower than the value of the mean, then the distribution is called a negatively skewed or left skewed distribution. In this case, the mode will have a lower value than the median, and the mean will have a higher value than the median.

If the values of most of the observations are higher than the value of the mean, then the distribution is called a positively skewed or right skewed distribution. In this case, the mode will have a higher value than the median, and the mean will have a lower value than the median.

An inspection of the skewness of a variable will help the researcher to decide which of the three measures of central tendency to use--mean, median, or mode--as the best indicator of the central tendency of the distribution of values for that variable.

Normal Curve

The Normal Curve is a graph of the values of a variable where those values are distributed symmetrically about the mean of the variable. It has the following characteristics:

1) it has a bell-shaped, symmetrical curve

2) the mean, median, and mode all have the same value

3) the properties of the curve are known

4) it is useful in calculating estimates in inferential statistics

If the distribution of the values on a variable approach a normal curve, we know that approximately 68% of the values will be within plus or minus one standard deviation from the mean; 95% of the values will be within plus or minus two standard deviations from the mean; and 99% of the values will be within plus or minus three standard deviations from the mean.

This is useful because the value of any one observation can be converted to a standardized score, or z-score. A standardized score or z-score converts any observation to a measure of standard deviation units, where the value of the mean equals zero and the value of a standard deviation equals one.

The formula for the z-score is

1) calculate the mean of the variable
2) calculate the standard deviation of the variable
3) subtract the value of the observation from the value of the mean
4) divide the result of step 3 by the standard deviation of the variable=z-score

A z-score of +1.5 means that the value of the observation lies 1.5 standard deviation units above the mean. A z-score of -2.0 means that the value of the observation lies 2.0 standard deviation units below the mean.

If a student scores 40 out of 50 on a test of mathematics (with a mean of 41 and a standard deviation of 5), and 60 out of 75 on a test of language (with a mean of 55 and a standard deviation of 10), the scores are not directly comparable. Converting each of them to their respective z-scores allows them to be compared directly.

z-score for 45= (45-40)/5=-0.2

z-score for 53=(53-40)/15=+0.5

Although the student scored 80% of the total points available on each test, the student did slightly better than average (+0.5 standard deviations) on the language test, and slightly worse than average (-0.2 standard deviations) on the math test.