Bivariate Statistics

PPA 696 RESEARCH METHODS

CONTINGENCY TABLES

Contingency Tables
Constructing a Contingency Table
Characteristics of a Contingency Table
Interpreting a Contingency Table
What is an Association
Measures of Association
Which Measure to Use
Nominal Measures of Association
Ordinal Measures of Association
Introducing Control Variables
Interpreting Control Tables

Contingency Tables

After examining the univariate frequency distribution of the values of each variable separately, the researcher is often interested in the joint occurrence and distribution of the values of the independent and dependent variable together. The joint distribution of two variables is called a bivariate distribution.

A contingency table shows the frequency distribution of the values of the dependent variable, given the occurrence of the values of the independent variable. Both variables must be grouped into a finite number of categories (usually no more than 2 or 3 categories) such as low, medium, or high; positive, neutral, or negative; male or female; etc.

Constructing a Contingency Table

1) obtain a frequency distribution for the values of the independent variable; if the variable is not divided into categories, decide on how to group the data.

2) obtain a frequency distribution for the values of the dependent variable; if the variable is not divided into categories, decide on how to group the data.

3) obtain the frequency distribution of the values of the dependent variable, given the values of the independent variable (either by tabulating the raw data, or from a computer program)

4) display the results of step 3 in a table

Example:
Independent Variable: Place of Residence
Categories: Inside City Limits=505
Outside City Limits=145

Dependent Variable: Attitude about Consolidation
Categories: Favor consolidation=327
No Opinion=168
Against consolidation=155

Joint Distribution:

Table 1. Attitudes toward Consolidation by Area of Residence

Attitude toward
Consolidation
Area of Residence

Inside
City Limits Outside
City Limits

Against 98 57

No Opinion 134 34

For 273 54

Total 505 145

Characteristics of a Contingency Table:

1. Title

2. Categories of the Independent Variable head the tops of the columns

3. Categories of the Dependent Variable label the rows

4. Order categories of the two variables from lowest to highest (from left to right across the columns; from top to bottom along the rows).

4. Show totals at the foot of the columns

Interpreting a Contingency Table

1) Inspect the contingency table for patterns. This may be difficult if there are different totals of observations in the different categories of the independent variable.

2) Convert the observations in each cell to a percentage of the column total; be sure to still show the total number of observations for each column on which the percentages are based.

3) Compare the percentages across the categories of the dependent variable (the rows).

Example:
Table 1. Attitudes toward Consolidation by Area of Residence

Attitude toward
Consolidation
Area of Residence

Inside
City Limits
(N=505) Outside
City Limits
(N=145)

Against 19% 39%

No Opinion 27% 23%

For 54% 37%

Total 100% 100%

According to this table, more city residents (54%) than non-city residents (37%) are for consolidation. Conversely, more non-city residents (39%) than city residents (19%) are against consolidation. About the same percentage of both groups have no opinion about consolidation.

The percentage distribution can suggest the strength of a relationship, but interpretation is up to each individual researcher. There is no minimum percentage difference that must be reached to indicate a strong or weak relationship between the two variables.

Does this mean that there is a relationship between the two variables, area of residence and attitude toward consolidation? Is ones's attitude about consolidation associated with one's area of residence?

If there is a relationship, how strong is it? Are the results statistically significant? Are the results meaningfully significant? In order to answer these questions, we must turn to a set of statistics called Measures of Association.

What is an Association

Can the value of one variable be predicted, if we know the value of the other variable?

For example, say half the people participating in training programs get a job. What is the likelihood of any one participant getting a job? About fifty-fifty. So we would not be very good at predicting whether people will get jobs or not.

But if we introduce a second variable (the independent variable), does it help us to be more accurate in our predictions of the likelihood that someone will get a job?

Dependent variable: Obtaining a Job
No job=100
Gets a job=100

Independent Variable: Length of Training Program
Short=100
Long=100

Bivariate Distribution--Perfect Positive Relationship
(If training is good for getting a job)

Obtains a Job
Length of Training Program

Short
(N=100) Long
(N=100)

No 100% 0%

Yes 0% 100%

Total 100% 100%

If we know the length of the training program, we can perfectly predict the likelihood of getting a job. The longer the training program, the more likely the participant is to get a job and, conversely, the shorter the training program the less likely the participant is to get a job. That is, as the training program length increases, so does the likelihood of obtaining a job. The value of the measure of association would be +1.0.

Bivariate Distribution--Perfect Inverse Relationship
(If training is bad for getting a job)

Obtains a Job
Length of Training Program

Short
(N=100) Long
(N=100)

No 0% 100%

Yes 100% 0%

Total 100% 100%

If we know the length of the training program, we can perfectly predict the likelihood of getting a job. The longer the training program, the less likely the participant is to get a job and, conversely, the shorter the training program the more likely the participant is to get a job. That is, as the training program length increases, likelihood of obtaining a job decreases. The value of the measure of association would be -1.0.

Bivariate Distribution--No Relationship
(If training has no effect on getting a job)

Obtains a Job
Length of Training Program

Short
(N=100) Long
(N=100)

No 50% 50%

Yes 50% 50%

Total 100% 100%

Here we are back to a 50/50 guess. Knowing the length of the training program does not help in any way to predict the likelihood of getting a job. The value of the measure of association would be 0.0

Measures of Association

Measures of Association are statistics that provide a standard against which to judge the relationship between the variables observed in contingency tables. They can indicate the strength of a relationship between two variables measured on a nominal or ordinal scale. For the latter, they can also indicate the direction of the relationship (positive or negative).

Measures of Association are descriptive statistics, so they can be used with samples which were not selected using a strict random sampling method. But they do not allow the researcher to infer whether the relationship observed in the sample is true of the general population.

Measures of Association do not indicate causality, but association--that is, whether one's score on one variable tends to be associated with one's score on another variable. The value of the measure of association statistic also indicates the strength of the relationship, whether weak, moderate, or strong.

Examples of Measures of Association:

Level of
Measurement Measures of
Association Values Symmetric?

Nominal Lambda 0.0 (weakest relationship) to 1.0 (strongest relationship) Lambda is asymmetric

Ordinal Gamma 0.0 (weakest relationship) to +1.0 (strongest relationship) Gamma is symmetric

Measures of Association for variables measured at the nominal level generally vary from a low of 0.0 to a high of +1.0. Lower values indicate weaker associations, and higher values indicate stronger associations.

In addition, for variables measured at the ordinal level, Measures of Association vary from a low of 0.0, indicating the weakest level of association, to a high of either +1.0 or -1.0, which indicate the strongest level of association.

A value on the statistic between 0.0 and +1.0 indicates a positive (or direct) relationship. That is, as the value of one variable increases the value of the other variable also increases. For example, as the number of hours spent studying increases, the student's grade on the test also increases. And conversely, as the number of hours spent studying decreases, the student's grade on the test also decreases

A value on the statistic between 0.0 and -1.0 indicates a negative (or indirect) relationship. That is, as the value of one variable increases the value of the other variable decreases. For example, the as the number of librarians on duty increases, the number of patron complaints decreases. And conversely, the as the number of librarians on duty decreases, the number of patron complaints increases.

Which Measure to Use

1) it is appropriate to the level of measurement of the data (nominal or ordinal);

2) it equals 0.0 for no relationship and 1.0 for a perfect relationship;

3) it is sensitive to subtle differences in the strength of a relationship

4) the researcher is familiar with the statistic and knows how to interpret it

5) look at what has been done in the past with research on this type of variable

Note that some statistics take on different values, depending on which of the two variables is the independent variable and which is the dependent variable. These are called asymmetric measures of association. Symmetric measures of association take on the same value, no matter which variable is the independent variable and which is the dependent variable.

Note that the value of one statistic, such as gamma, cannot be directly compared with the value of another statistic, such as Tau. Each statistic has its own standard, and the value of the statistic obtained by the researcher must be compared with the standard for that statistic.

If the values of a number of statistics are obtained, and they all indicate a strong relationship between two variables, the researcher may take that as additional support for the existence of a relationship. However, if the values of a number of statistics are contradictory, with some indicating a strong relationship and others a weak relationship, the researcher must look more closely at the data. For example, there may be a non-linear relationship between the two variables.

Note that some measures of association are not useful when there is a non-linear relationship between the two variables. This can occur when there are three or more categories of values for the independent variable, and the values of the dependent variable do vary but not in a strictly linear fashion.

Nominal Measures of Association

Lambda is a measure of association that measures the Proportional Reduction in Error (PRE) obtained when the researcher uses the value of the independent variable to predict the value of the dependent variable.

If the researcher only has the value of the dependent variable, the researcher will make a number of errors trying to predict the values of the dependent variable for new observations. The amount of error made in trying to predict the dependent variable alone is called original error.

For example, say you asked the people in your organization to rate the personnel department. You know that the univariate distribution for this variable looks like this:

Rating of Personnel Department Frequency

Poor 38

Satisfactory 32

Good 35

Total 95

Let's say you want to guess what the rating of another 95 people would be. Your best guess would be to pick to modal category, which is "Poor." That is, more people picked "Poor" than any other category. If you consistently pick "Poor," you will make the fewest number of wrong guesses. Original error=38 right and 57 wrong (out of 95 total guesses).

Now, let's say that you are given one additional piece of information. You now know what the ratings of the personnel department are by the people who work in one of four departments: police, fire, public works, and planning.

Personnel
Department
Rating Department of Employment

Police Fire Public
Works Planning

Poor 10 15 5 8

Satisfactory 5 10 15 2

Good 15 5 5 0

Total 30 30 25 10

Now, if you had to guess the personnel department rating, you could qualify your best guess by knowing the department of employment. For each department, you would guess the modal category.

Rating of
Personnel Department Modal
Category Right Guesses Wrong Guesses

Police (N=30) "Good" 15 15

Fire (N=30) "Poor" 15 15

Public Works (N=25) "Satisfactory" 15 10

Planning (N=10) "Poor" 8 2

Total 53 42

The total number of new errors (wrong guesses) is 42.

To calculate Lambda, subtract the number of new errors from the number of original errors and divide by the number of original errors. In this case, [(57-42)/57]=.263

By knowing a person's department, we can reduce the error in predicting how they rate the personnel department by 26.3%. This indicates a weak relationship between department of employment and perception of the personnel department. As the independent variable is measured on a nominal scale, there is no direction for the relationship (neither positive nor negative, just an association).

Ordinal Measures of Association

Gamma is a measure of association that measures the Proportional Reduction in Error (PRE) obtained when the researcher uses the value of the independent variable to predict the value of the dependent variable.

Gamma varies from a value of 0.0 for the weakest level of association, to a value of +1.0 for the strongest level of association for a direct or positive or -1.0 for the strongest level of association for a negative or inverse relationship.

Note that both variables must be coded so that the values of the variable go from low to high, for example, dissatisfied=1, neutral=2, high=3, or less than high school=1, high school=2, more than high school=3. The values of the variables in the contingency table should be arrayed from low to high as you read from left to right across the columns, and from low to high as you read from top to bottom along the rows.

Gamma can be used with two variables measures at the ordinal level, but is not good at reflecting non-linear relationships between two variables. In that case, a nominal measure of association should be used.

For example, let us hypothesize that there is a relationship between the length of time a person has been employed in an organization, and that person's opinion of that organization's personnel department: the longer employed, the better the opinion.

Opinion of the
Personnel
Department Number of Years Employed

Less than 1 1 to 5 More than 5

Poor 0 6 12

Satisfactory 0 6 0

Good 12 0 0

Total 12 12 12

To calculate gamma, we look at the number of observations that would support our hypothesis (called A) and the number of observations that would not support it (called D).

First we look for the number of observations in agreement (A). This consists in identifying the cells in the table that tend to support our hypothesis. We begin in the upper left hand corner, and work right and downward across the table.

We take the number of people who have worked less than 1 year and rate the department as poor (this would support our hypothesis). We multiply this number times the number of observations found in the cells which are under and to the right of this cell. These are the cells that contain the number of people who have worked either from 1-5 years or more than 5 years and who rate the department as either satisfactory or good.

Next we find the number of people who have worked more less than 1 year and who rate the department as satisfactory. We multiply this number times the number of observations found in the cells which are under and to the right of this cell. This includes the number of people who have worked from 1-5 years or more than 5 years and rate the department as good.

Next we find the number of people who have worked from 1-5 years and rate the department as poor. We multiply this number times the number of observations found in the cells which are under and to the right of this cell. This includes the number of people who have worked more than 5 years and rate the department as satisfactory or good.

Finally, we count the number of people who have worked from 1-5 years and rate the department as satisfactory. We multiply this number times the number of observations found in the cells which are under and to the right of this cell. This includes the number of people who have worked more than 5 years and rate the department as good.

A=0 x (6+0+0+0) + 0 x (0 + 0) + 6 x (0 + 0) + 6 x (0)
A=0 x (6) + 0 x (0) +6 x (0) + 6 x (0)
A=0

Next we look for the number of observations in disagreement (D). This consists in identifying the cells in the table that tend to support our hypothesis. In this case, we would begin in the opposite (upper right hand) corner and work left and downward across the table.

In the table, there are 12 people who have worked more than five years who rate the personnel department as poor (this would disconfirm our hypothesis). We multiply this number times the number of observations found in the cells which are under and to the left of this cell. These are the cells that contain the number of people who have worked either less than one or from 1-5 years and who rate the department as either satisfactory or good.

Next we find the number of people who have worked more than 5 years who would rate the department as satisfactory. We multiply this number times the number of observations found in the cells which are under and to the left of this cell. This includes the number of people who have worked less than 1 year or from 1-5 years and rate the department as good.

Next we find the number of people who have worked from 1-5 years and rate the department as poor. We multiply this number times the number of observations found in the cells which are under and to the left of this cell. This includes the number of people who have worked less than 1 year and rate the department as satisfactory or good.

Finally, we count the number of people who have worked from 1-5 years and rate the department as satisfactory. We multiply this number times the number of observations found in the cells which are under and to the left of this cell. This includes the number of people who have worked less than 1 year and rate the department as good.

D=12 x (6+0+12+0) + 0 x (0 + 12) + 6 x (0 + 12) + 6 x (12)
D=12 x (18) + 0 x (12) +6 x (12) + 6 x (12)
D=216 + 0 + 72 + 72
D=360

Gamma is calculated by finding the number of observations in agreement minus the number of observations in disagreement, and dividing that by the number of observations in agreement plus the number of observations in disagreement.

Gamma=(0-360)/(0+360)=-1.0

This value of gamma tells us that we have a very strong relationship between the length of time employed and opinion of the personnel department, but the relationship is in the opposite direction than we predicted. That is, as length of employment increases, opinion of the personnel department decreases.

Introducing Control Variables

In establishing whether or not a relationship exists between two variables, it is not enough to obtain a high value on a measure of association. The researcher must also show that the purported relationships between the two variables is not spurious. A spurious relationship is one where two variables seem to be associated with one anther, but the association can be explained away by the introduction of a third variable.

The introduction of a third, control, variable is called the specification or elaboration of the relationship observed between the original two variables. Control variables come from the researcher's experience; from a review of the literature; from a conceptual model that guides the research; or from a hypothesis.

For example, it is possible to establish that an association exists between the amount of ice cream sold and the number of assaults in any given city. However, this relationship is spurious: both the amount of ice cream sold and the number of assaults increase as the temperature increases. The temperature is associated with ice cream sales, and the temperature is associated with assaults, but ice cream and assaults are not related. This becomes apparent because when temperature is controlled, the value of the measure of association between ice cream sales and assaults will greatly diminish.

Previously, we established an apparent relationships between attitude toward consolidation and area of residence. But what if citizens' attitude toward consolidation is really influenced by their evaluation of their current public services?

Say that we have collected information on the third variable, evaluation of current public services. The variable is coded as either satisfactory or unsatisfactory. In order to introduce this as a control variable, we need to take the following steps.

1) obtain the original bivariate distribution table

2) Obtain the frequency distribution for the control variable and divide the observations in the original table into groups according to the categories of the control variable.

3) within each of these two new groups, re-create the original bivariate distribution table

4) compare the new bivariate distributions with the original distribution (in step 1)

5) interpret the results

Interpreting Control Tables

Step 1. Obtain the original bivariate distribution table

Attitudes toward Consolidation by Area of Residence

Attitude toward
Consolidation
Area of Residence

Inside
City Limits
(N=505) Outside
City Limits
(N=145)

Against 19% 39%

No Opinion 27% 23%

For 54% 37%

Total 100% 100%

Step 2. Obtain the frequency distribution for the control variable.

Control Variable: Rating of Current Services
Categories: Satisfactory=388
Unsatisfactory=262

Divide the 650 observations in the original table into two groups: those who rate their current services as satisfactory, and those who rate their current services as unsatisfactory.

Step 3. Within each of these two new groups, re-create the original bivariate distribution table.

Control Table A. Current Services Rated as Satisfactory (N=388)

Attitude toward
Consolidation
Area of Residence

Inside
City Limits
(N=505) Outside
City Limits
(N=145)

Against 15% 54%

No Opinion 20% 44%

For 65% 2%

Total 100% 100%

Control Table B. Current Services Rated as Unsatisfactory (N=262)

Attitude toward
Consolidation
Area of Residence

Inside
City Limits
(N=505) Outside
City Limits
(N=145)

Against 27% 29%

No Opinion 39% 9%

For 34% 62%

Total 100% 100%

Step 4. Compare the new bivariate distributions with the original distribution (in step 1). There are three distinct possibilities: the original relationship is unchanged; the original relationship disappears; the original relationship is changed.

If the original relationship is unchanged, then the control variable has no effect, and can be disregarded in further analysis of the dependent variable.

If the original relationship disappears, then that relationship was spurious, and the control variable becomes the new independent variable in further analysis of the dependent variable.

If the original relationship is changed, then both variables are important, and must be considered in further analysis of the dependent variable.

In the original table, more city residents (54%) than non-city residents (37%) were for consolidation. This relationship is similar among the respondents in the first control table. For those who rate their current services as satisfactory, more city residents (65%) than non-city residents (2%) were for consolidation.

However, the relationship is reversed in the second control table. Among respondents who rate their current services as unsatisfactory, fewer city residents (34%) than non-city residents (62%) are for consolidation.

Step 5. Interpret the results.

In this case, both area of residence and perception of current services are important influences on a citizen's attitude toward consolidation. Those who live outside the city, and who are satisfied with their services, are opposed to consolidation, but those who live outside the city and are unsatisfied with their services favor consolidation.

Among city residents, the relationship is reversed: those who are satisfied favor consolidation, while those who are unsatisfied oppose it. Perhaps those who are unsatisfied think that their services will deteriorate even further if the city and county are consolidated.

Another example concerns the attitude of organizational employees toward merit pay. We hypothesize that men will be more favorable to merit pay than women. We obtain the following bivariate distribution table:

Original Table: Attitude toward Merit Pay by Sex

Attitude toward
Merit Pay Sex

Female
(n=1506) Male
(n=228)

Negative 80% 20%

Positive 20% 80%

Total 100% 100%

This table seems to confirm our hypothesis: 80% of men favor merit pay but only 20% of women favor it. Values obtained for various measures of association are strong.

However, our MPA intern suggests that it is not sex but whether or not someone is in management position that determines their attitude toward merit pay. We obtain the distribution for type of job, and find that of the original 1734 people in our study, 444 have management jobs and 1290 do not.

Control Table A: Management Jobs

Attitude toward
Merit Pay Sex

Female
(n=238) Male
(n=206)

Negative 13% 13%

Positive 87% 87%

Total 100% 100%

Here the relationship between sex and attitude completely disappears. Equally high percentages of women and men in management jobs are in favor of merit pay. The value obtained for the measure of association drops to nearly zero.

Control Table B: Non-management Jobs

Attitude toward
Merit Pay Sex

Female
(n=1268) Male
(n=22)

Negative 92% 91%

Positive 8% 9%

Total 100% 100%

Here the relationship between sex and attitude completely disappears. Equally high percentages of women and men in non-management jobs are opposed to merit pay. The value obtained for the measure of association drops to nearly zero.

In conclusion, we can discard the variable sex and concentrate on level of employment in our further analysis of the dependent variable, attitude toward merit pay.

Attitude toward Consolidation	Area of Residence
Attitude toward Consolidation	Inside City Limits	Outside City Limits
Against	98	57
No Opinion	134	34
For	273	54
Total	505	145

Obtains a Job	Length of Training Program
Obtains a Job	Short (N=100)	Long (N=100)
No	100%	0%
Yes	0%	100%
Total	100%	100%

Level of Measurement	Measures of Association	Values	Symmetric?
Nominal	Lambda	0.0 (weakest relationship) to 1.0 (strongest relationship)	Lambda is asymmetric
Ordinal	Gamma	0.0 (weakest relationship) to +1.0 (strongest relationship)	Gamma is symmetric

Rating of Personnel Department	Frequency
Poor	38
Satisfactory	32
Good	35
Total	95

Personnel Department Rating	Department of Employment
Personnel Department Rating	Police	Fire	Public Works	Planning
Poor	10	15	5	8
Satisfactory	5	10	15	2
Good	15	5	5	0
Total	30	30	25	10

Rating of Personnel Department	Modal Category	Right Guesses	Wrong Guesses
Police (N=30)	"Good"	15	15
Fire (N=30)	"Poor"	15	15
Public Works (N=25)	"Satisfactory"	15	10
Planning (N=10)	"Poor"	8	2
Total		53	42

Opinion of the Personnel Department	Number of Years Employed
Opinion of the Personnel Department	Less than 1	1 to 5	More than 5
Poor	0	6	12
Satisfactory	0	6	0
Good	12	0	0
Total	12	12	12

Attitude toward Merit Pay	Sex
Attitude toward Merit Pay	Female (n=1506)	Male (n=228)
Negative	80%	20%
Positive	20%	80%
Total	100%	100%

Attitude toward Merit Pay	Sex
Attitude toward Merit Pay	Female (n=238)	Male (n=206)
Negative	13%	13%
Positive	87%	87%
Total	100%	100%

Attitude toward Merit Pay	Sex
Attitude toward Merit Pay	Female (n=1268)	Male (n=22)
Negative	92%	91%
Positive	8%	9%
Total	100%	100%