Contingency Tables
Constructing a Contingency Table
Characteristics
of a Contingency Table
Interpreting a Contingency Table
What is an Association
Measures of Association
Which Measure to Use
Nominal Measures of Association
Ordinal Measures of Association
Introducing Control Variables
Interpreting Control Tables
A contingency table shows the frequency distribution of the values of the dependent variable, given the occurrence of the values of the independent variable. Both variables must be grouped into a finite number of categories (usually no more than 2 or 3 categories) such as low, medium, or high; positive, neutral, or negative; male or female; etc.
2) obtain a frequency distribution for the values of the dependent variable; if the variable is not divided into categories, decide on how to group the data.
3) obtain the frequency distribution of the values of the dependent variable, given the values of the independent variable (either by tabulating the raw data, or from a computer program)
4) display the results of step 3 in a table
Example:
Independent Variable: Place of Residence
Categories: Inside City Limits=505
Outside City Limits=145
Dependent Variable: Attitude about Consolidation
Categories: Favor consolidation=327
No Opinion=168
Against consolidation=155
Joint Distribution:
Table 1. Attitudes toward Consolidation by Area of Residence
Attitude toward
|
Area of Residence | |
Inside
City Limits |
Outside
City Limits |
|
Against | 98 | 57 |
No Opinion | 134 | 34 |
For | 273 | 54 |
Total | 505 | 145 |
2. Categories of the Independent Variable head the tops of the columns
3. Categories of the Dependent Variable label the rows
4. Order categories of the two variables from lowest to highest (from left to right across the columns; from top to bottom along the rows).
4. Show totals at the foot of the columns
2) Convert the observations in each cell to a percentage of the column total; be sure to still show the total number of observations for each column on which the percentages are based.
3) Compare the percentages across the categories of the dependent variable (the rows).
Example:
Table 1. Attitudes toward Consolidation by Area of Residence
Attitude toward
|
Area of Residence | |
Inside
City Limits (N=505) |
Outside
City Limits (N=145) |
|
Against | 19% | 39% |
No Opinion | 27% | 23% |
For | 54% | 37% |
Total | 100% | 100% |
According to this table, more city residents (54%) than non-city residents (37%) are for consolidation. Conversely, more non-city residents (39%) than city residents (19%) are against consolidation. About the same percentage of both groups have no opinion about consolidation.
The percentage distribution can suggest the strength of a relationship, but interpretation is up to each individual researcher. There is no minimum percentage difference that must be reached to indicate a strong or weak relationship between the two variables.
Does this mean that there is a relationship between the two variables, area of residence and attitude toward consolidation? Is ones's attitude about consolidation associated with one's area of residence?
If there is a relationship, how strong is it? Are
the results statistically significant? Are the results meaningfully significant?
In order to answer these questions, we must turn to a set of statistics
called Measures of Association.
For example, say half the people participating in training programs get a job. What is the likelihood of any one participant getting a job? About fifty-fifty. So we would not be very good at predicting whether people will get jobs or not.
But if we introduce a second variable (the independent variable), does it help us to be more accurate in our predictions of the likelihood that someone will get a job?
Dependent variable: Obtaining a Job
No job=100
Gets a job=100
Independent Variable: Length of Training Program
Short=100
Long=100
Bivariate Distribution--Perfect Positive Relationship
(If training is good for getting a job)
Obtains a Job |
Length of Training Program | |
Short
(N=100) |
Long
(N=100) |
|
No | 100% | 0% |
Yes | 0% | 100% |
Total | 100% | 100% |
If we know the length of the training program, we can perfectly predict the likelihood of getting a job. The longer the training program, the more likely the participant is to get a job and, conversely, the shorter the training program the less likely the participant is to get a job. That is, as the training program length increases, so does the likelihood of obtaining a job. The value of the measure of association would be +1.0.
Bivariate Distribution--Perfect Inverse Relationship
(If training is bad for getting a job)
Obtains a Job |
Length of Training Program | |
Short
(N=100) |
Long
(N=100) |
|
No | 0% | 100% |
Yes | 100% | 0% |
Total | 100% | 100% |
If we know the length of the training program, we can perfectly predict the likelihood of getting a job. The longer the training program, the less likely the participant is to get a job and, conversely, the shorter the training program the more likely the participant is to get a job. That is, as the training program length increases, likelihood of obtaining a job decreases. The value of the measure of association would be -1.0.
Bivariate Distribution--No Relationship
(If training has no effect on getting a job)
Obtains a Job |
Length of Training Program | |
Short
(N=100) |
Long
(N=100) |
|
No | 50% | 50% |
Yes | 50% | 50% |
Total | 100% | 100% |
Here we are back to a 50/50 guess. Knowing the length
of the training program does not help in any way to predict the likelihood
of getting a job. The value of the measure of association would be 0.0
Measures of Association are descriptive statistics, so they can be used with samples which were not selected using a strict random sampling method. But they do not allow the researcher to infer whether the relationship observed in the sample is true of the general population.
Measures of Association do not indicate causality, but association--that is, whether one's score on one variable tends to be associated with one's score on another variable. The value of the measure of association statistic also indicates the strength of the relationship, whether weak, moderate, or strong.
Examples of Measures of Association:
Level of
Measurement |
Measures of
Association |
Values | Symmetric? |
Nominal | Lambda | 0.0 (weakest relationship) to 1.0 (strongest relationship) | Lambda is asymmetric |
Ordinal | Gamma | 0.0 (weakest relationship) to +1.0 (strongest relationship) | Gamma is symmetric |
Measures of Association for variables measured at the nominal level generally vary from a low of 0.0 to a high of +1.0. Lower values indicate weaker associations, and higher values indicate stronger associations.
In addition, for variables measured at the ordinal level, Measures of Association vary from a low of 0.0, indicating the weakest level of association, to a high of either +1.0 or -1.0, which indicate the strongest level of association.
A value on the statistic between 0.0 and +1.0 indicates a positive (or direct) relationship. That is, as the value of one variable increases the value of the other variable also increases. For example, as the number of hours spent studying increases, the student's grade on the test also increases. And conversely, as the number of hours spent studying decreases, the student's grade on the test also decreases
A value on the statistic between 0.0 and -1.0 indicates
a negative (or indirect) relationship. That is, as the value of one variable
increases the value of the other variable decreases. For example, the as
the number of librarians on duty increases, the number of patron complaints
decreases. And conversely, the as the number of librarians on duty decreases,
the number of patron complaints increases.
2) it equals 0.0 for no relationship and 1.0 for a perfect relationship;
3) it is sensitive to subtle differences in the strength of a relationship
4) the researcher is familiar with the statistic and knows how to interpret it
5) look at what has been done in the past with research on this type of variable
Note that some statistics take on different values, depending on which of the two variables is the independent variable and which is the dependent variable. These are called asymmetric measures of association. Symmetric measures of association take on the same value, no matter which variable is the independent variable and which is the dependent variable.
Note that the value of one statistic, such as gamma, cannot be directly compared with the value of another statistic, such as Tau. Each statistic has its own standard, and the value of the statistic obtained by the researcher must be compared with the standard for that statistic.
If the values of a number of statistics are obtained, and they all indicate a strong relationship between two variables, the researcher may take that as additional support for the existence of a relationship. However, if the values of a number of statistics are contradictory, with some indicating a strong relationship and others a weak relationship, the researcher must look more closely at the data. For example, there may be a non-linear relationship between the two variables.
Note that some measures of association are not useful
when there is a non-linear relationship between the two variables. This
can occur when there are three or more categories of values for the independent
variable, and the values of the dependent variable do vary but not in a
strictly linear fashion.
If the researcher only has the value of the dependent variable, the researcher will make a number of errors trying to predict the values of the dependent variable for new observations. The amount of error made in trying to predict the dependent variable alone is called original error.
For example, say you asked the people in your organization
to rate the personnel department. You know that the univariate distribution
for this variable looks like this:
Rating of Personnel Department | Frequency |
Poor | 38 |
Satisfactory | 32 |
Good | 35 |
Total | 95 |
Let's say you want to guess what the rating of another 95 people would be. Your best guess would be to pick to modal category, which is "Poor." That is, more people picked "Poor" than any other category. If you consistently pick "Poor," you will make the fewest number of wrong guesses. Original error=38 right and 57 wrong (out of 95 total guesses).
Now, let's say that you are given one additional
piece of information. You now know what the ratings of the personnel department
are by the people who work in one of four departments: police, fire, public
works, and planning.
Personnel
Department Rating |
Department of Employment | |||
Police | Fire | Public
Works |
Planning | |
Poor | 10 | 15 | 5 | 8 |
Satisfactory | 5 | 10 | 15 | 2 |
Good | 15 | 5 | 5 | 0 |
Total | 30 | 30 | 25 | 10 |
Now, if you had to guess the personnel department
rating, you could qualify your best guess by knowing the department of
employment. For each department, you would guess the modal category.
Rating of
Personnel Department |
Modal
Category |
Right Guesses | Wrong Guesses |
Police (N=30) | "Good" | 15 | 15 |
Fire (N=30) | "Poor" | 15 | 15 |
Public Works (N=25) | "Satisfactory" | 15 | 10 |
Planning (N=10) | "Poor" | 8 | 2 |
Total | 53 | 42 |
The total number of new errors (wrong guesses) is 42.
To calculate Lambda, subtract the number of new errors from the number of original errors and divide by the number of original errors. In this case, [(57-42)/57]=.263
By knowing a person's department, we can reduce the
error in predicting how they rate the personnel department by 26.3%. This
indicates a weak relationship between department of employment and perception
of the personnel department. As the independent variable is measured on
a nominal scale, there is no direction for the relationship (neither positive
nor negative, just an association).
Gamma varies from a value of 0.0 for the weakest level of association, to a value of +1.0 for the strongest level of association for a direct or positive or -1.0 for the strongest level of association for a negative or inverse relationship.
Note that both variables must be coded so that the values of the variable go from low to high, for example, dissatisfied=1, neutral=2, high=3, or less than high school=1, high school=2, more than high school=3. The values of the variables in the contingency table should be arrayed from low to high as you read from left to right across the columns, and from low to high as you read from top to bottom along the rows.
Gamma can be used with two variables measures at the ordinal level, but is not good at reflecting non-linear relationships between two variables. In that case, a nominal measure of association should be used.
For example, let us hypothesize that there is a relationship
between the length of time a person has been employed in an organization,
and that person's opinion of that organization's personnel department:
the longer employed, the better the opinion.
Opinion of the
Personnel Department |
Number of Years Employed | ||
Less than 1 | 1 to 5 | More than 5 | |
Poor | 0 | 6 | 12 |
Satisfactory | 0 | 6 | 0 |
Good | 12 | 0 | 0 |
Total | 12 | 12 | 12 |
To calculate gamma, we look at the number of observations that would support our hypothesis (called A) and the number of observations that would not support it (called D).
First we look for the number of observations in agreement (A). This consists in identifying the cells in the table that tend to support our hypothesis. We begin in the upper left hand corner, and work right and downward across the table.
We take the number of people who have worked less than 1 year and rate the department as poor (this would support our hypothesis). We multiply this number times the number of observations found in the cells which are under and to the right of this cell. These are the cells that contain the number of people who have worked either from 1-5 years or more than 5 years and who rate the department as either satisfactory or good.
Next we find the number of people who have worked more less than 1 year and who rate the department as satisfactory. We multiply this number times the number of observations found in the cells which are under and to the right of this cell. This includes the number of people who have worked from 1-5 years or more than 5 years and rate the department as good.
Next we find the number of people who have worked from 1-5 years and rate the department as poor. We multiply this number times the number of observations found in the cells which are under and to the right of this cell. This includes the number of people who have worked more than 5 years and rate the department as satisfactory or good.
Finally, we count the number of people who have worked from 1-5 years and rate the department as satisfactory. We multiply this number times the number of observations found in the cells which are under and to the right of this cell. This includes the number of people who have worked more than 5 years and rate the department as good.
A=0 x (6+0+0+0) + 0 x (0 + 0) + 6 x (0 + 0) + 6 x (0)
A=0 x (6) + 0 x (0) +6 x (0) + 6 x (0)
A=0
Next we look for the number of observations in disagreement (D). This consists in identifying the cells in the table that tend to support our hypothesis. In this case, we would begin in the opposite (upper right hand) corner and work left and downward across the table.
In the table, there are 12 people who have worked more than five years who rate the personnel department as poor (this would disconfirm our hypothesis). We multiply this number times the number of observations found in the cells which are under and to the left of this cell. These are the cells that contain the number of people who have worked either less than one or from 1-5 years and who rate the department as either satisfactory or good.
Next we find the number of people who have worked more than 5 years who would rate the department as satisfactory. We multiply this number times the number of observations found in the cells which are under and to the left of this cell. This includes the number of people who have worked less than 1 year or from 1-5 years and rate the department as good.
Next we find the number of people who have worked from 1-5 years and rate the department as poor. We multiply this number times the number of observations found in the cells which are under and to the left of this cell. This includes the number of people who have worked less than 1 year and rate the department as satisfactory or good.
Finally, we count the number of people who have worked from 1-5 years and rate the department as satisfactory. We multiply this number times the number of observations found in the cells which are under and to the left of this cell. This includes the number of people who have worked less than 1 year and rate the department as good.
D=12 x (6+0+12+0) + 0 x (0 + 12) + 6 x (0 + 12) + 6 x (12)
D=12 x (18) + 0 x (12) +6 x (12) + 6 x (12)
D=216 + 0 + 72 + 72
D=360
Gamma is calculated by finding the number of observations in agreement minus the number of observations in disagreement, and dividing that by the number of observations in agreement plus the number of observations in disagreement.
Gamma=(0-360)/(0+360)=-1.0
This value of gamma tells us that we have a very
strong relationship between the length of time employed and opinion of
the personnel department, but the relationship is in the opposite direction
than we predicted. That is, as length of employment increases, opinion
of the personnel department decreases.
The introduction of a third, control, variable is called the specification or elaboration of the relationship observed between the original two variables. Control variables come from the researcher's experience; from a review of the literature; from a conceptual model that guides the research; or from a hypothesis.
For example, it is possible to establish that an association exists between the amount of ice cream sold and the number of assaults in any given city. However, this relationship is spurious: both the amount of ice cream sold and the number of assaults increase as the temperature increases. The temperature is associated with ice cream sales, and the temperature is associated with assaults, but ice cream and assaults are not related. This becomes apparent because when temperature is controlled, the value of the measure of association between ice cream sales and assaults will greatly diminish.
Previously, we established an apparent relationships between attitude toward consolidation and area of residence. But what if citizens' attitude toward consolidation is really influenced by their evaluation of their current public services?
Say that we have collected information on the third variable, evaluation of current public services. The variable is coded as either satisfactory or unsatisfactory. In order to introduce this as a control variable, we need to take the following steps.
1) obtain the original bivariate distribution table
2) Obtain the frequency distribution for the control variable and divide the observations in the original table into groups according to the categories of the control variable.
3) within each of these two new groups, re-create the original bivariate distribution table
4) compare the new bivariate distributions with the original distribution (in step 1)
5) interpret the results
Step 1. Obtain the original bivariate distribution table
Attitudes toward Consolidation by Area of Residence
Attitude toward
|
Area of Residence | |
Inside
City Limits (N=505) |
Outside
City Limits (N=145) |
|
Against | 19% | 39% |
No Opinion | 27% | 23% |
For | 54% | 37% |
Total | 100% | 100% |
Step 2. Obtain the frequency distribution for the control variable.
Control Variable: Rating of Current Services
Categories: Satisfactory=388
Unsatisfactory=262
Divide the 650 observations in the original table
into two groups: those who rate their current services as satisfactory,
and those who rate their current services as unsatisfactory.
Step 3. Within each of these two new groups, re-create the original bivariate distribution table.
Control Table A. Current Services Rated as Satisfactory (N=388)
Attitude toward
|
Area of Residence | |
Inside
City Limits (N=505) |
Outside
City Limits (N=145) |
|
Against | 15% | 54% |
No Opinion | 20% | 44% |
For | 65% | 2% |
Total | 100% | 100% |
Control Table B. Current Services Rated as Unsatisfactory (N=262)
Attitude toward
|
Area of Residence | |
Inside
City Limits (N=505) |
Outside
City Limits (N=145) |
|
Against | 27% | 29% |
No Opinion | 39% | 9% |
For | 34% | 62% |
Total | 100% | 100% |
Step 4. Compare the new bivariate distributions with the original distribution (in step 1). There are three distinct possibilities: the original relationship is unchanged; the original relationship disappears; the original relationship is changed.
If the original relationship is unchanged, then the control variable has no effect, and can be disregarded in further analysis of the dependent variable.
If the original relationship disappears, then that relationship was spurious, and the control variable becomes the new independent variable in further analysis of the dependent variable.
If the original relationship is changed, then both variables are important, and must be considered in further analysis of the dependent variable.
In the original table, more city residents (54%) than non-city residents (37%) were for consolidation. This relationship is similar among the respondents in the first control table. For those who rate their current services as satisfactory, more city residents (65%) than non-city residents (2%) were for consolidation.
However, the relationship is reversed in the second
control table. Among respondents who rate their current services as unsatisfactory,
fewer city residents (34%) than non-city residents (62%) are for consolidation.
Step 5. Interpret the results.
In this case, both area of residence and perception of current services are important influences on a citizen's attitude toward consolidation. Those who live outside the city, and who are satisfied with their services, are opposed to consolidation, but those who live outside the city and are unsatisfied with their services favor consolidation.
Among city residents, the relationship is reversed: those who are satisfied favor consolidation, while those who are unsatisfied oppose it. Perhaps those who are unsatisfied think that their services will deteriorate even further if the city and county are consolidated.
Another example concerns the attitude of organizational employees toward merit pay. We hypothesize that men will be more favorable to merit pay than women. We obtain the following bivariate distribution table:
Original Table: Attitude toward Merit Pay by Sex
Attitude toward
Merit Pay |
Sex | |
Female
(n=1506) |
Male
(n=228) |
|
Negative | 80% | 20% |
Positive | 20% | 80% |
Total | 100% | 100% |
This table seems to confirm our hypothesis: 80% of men favor merit pay but only 20% of women favor it. Values obtained for various measures of association are strong.
However, our MPA intern suggests that it is not sex but whether or not someone is in management position that determines their attitude toward merit pay. We obtain the distribution for type of job, and find that of the original 1734 people in our study, 444 have management jobs and 1290 do not.
Control Table A: Management Jobs
Attitude toward
Merit Pay |
Sex | |
Female
(n=238) |
Male
(n=206) |
|
Negative | 13% | 13% |
Positive | 87% | 87% |
Total | 100% | 100% |
Here the relationship between sex and attitude completely disappears. Equally high percentages of women and men in management jobs are in favor of merit pay. The value obtained for the measure of association drops to nearly zero.
Control Table B: Non-management Jobs
Attitude toward
Merit Pay |
Sex | |
Female
(n=1268) |
Male
(n=22) |
|
Negative | 92% | 91% |
Positive | 8% | 9% |
Total | 100% | 100% |
Here the relationship between sex and attitude completely disappears. Equally high percentages of women and men in non-management jobs are opposed to merit pay. The value obtained for the measure of association drops to nearly zero.
In conclusion, we can discard the variable sex and concentrate on level of employment in our further analysis of the dependent variable, attitude toward merit pay.