Tests for statistical significance are used to address the question: what is the probability that what we think is a relationship between two variables is really just a chance occurrence?
If we selected many samples from the same population, would we still find the same relationship between these two variables in every sample? If we could do a census of the population, would we also find that this relationship exists in the population from which the sample was drawn? Or is our finding due only to random chance?
Tests for statistical significance tell us what the probability is that the relationship we think we have found is due only to random chance. They tell us what the probability is that we would be making an error if we assume that we have found that a relationship exists.
We can never be completely 100% certain that a relationship exists between two variables. There are too many sources of error to be controlled, for example, sampling error, researcher bias, problems with reliability and validity, simple mistakes, etc.
But using probability theory and the normal curve, we can estimate the probability of being wrong, if we assume that our finding a relationship is true. If the probability of being wrong is small, then we say that our observation of the relationship is a statistically significant finding.
Statistical significance means that there is a good chance that we are right in finding that a relationship exists between two variables. But statistical significance is not the same as practical significance. We can have a statistically significant finding, but the implications of that finding may have no practical application. The researcher must always examine both the statistical and the practical significance of any research finding.
For example, we may find that there is a statistically significant relationship between a citizen's age and satisfaction with city recreation services. It may be that older citizens are 5% less satisfied than younger citizens with city recreation services. But is 5% a large enough difference to be concerned about?
Often times, when differences are small but statistically
significant, it is due to a very large sample size; in a sample of a smaller
size, the differences would not be enough to be statistically significant.
The first is called a Type I error. This occurs when the researcher assumes that a relationship exists when in fact the evidence is that it does not. In a Type I error, the researcher should accept the null hypothesis and reject the research hypothesis, but the opposite occurs. The probability of committing a Type I error is called alpha.
The second is called a Type II error. This occurs when the researcher assumes that a relationship does not exist when in fact the evidence is that it does. In a Type II error, the researcher should reject the null hypothesis and accept the research hypothesis, but the opposite occurs. The probability of committing a Type II error is called beta.
Generally, reducing the possibility of committing a Type I error increases the possibility of committing a Type II error and vice versa, reducing the possibility of committing a Type II error increases the possibility of committing a Type I error.
Researchers generally try to minimize Type I errors, because when a researcher assumes a relationship exists when one really does not, things may be worse off than before. In Type II errors, the researcher misses an opportunity to confirm that a relationship exists, but is no worse off than before.
If a Type II error is committed, then the County is assumed to be ineligible for disaster relief, when it really is eligible (the null hypothesis should be accepted, but it is rejected). The government may not spend disaster relief funds when it should, and farmers may go into bankruptcy.
If a Type II error is committed, then the new drug
is assumed to be no better when it really is better (the null hypothesis
should be rejected, but it is accepted). People may not be treated with
the new drug, although they would be better off than with the old one.
If the relationship between the two variables is
strong (as assessed by a Measure of Association), and the level chosen
for alpha is .05, then moderate or small sample sizes will detect it. As
relationships get weaker, however, and/or as the level of alpha gets smaller,
larger sample sizes will be needed for the research to reach statistical
significance.
Type of Training Attended: | Number attending Training |
Vocational Education | 200 |
Work Skills Training | 250 |
Total | 450 |
Placed in a Job? | Number of Trainees |
Yes | 300 |
No | 150 |
Total | 450 |
To compute Chi Square, a table showing the joint distribution of the two variables is needed:
Table 1. Job Placement by Type of Training (Observed Frequencies)
Placed in a Job? |
Type of Training | ||
Vocational
Education |
Work Skills
Training |
Total | |
Yes | 175 | 125 | 300 |
No | 25 | 125 | 150 |
Total | 200 | 250 | 450 |
Chi Square is computed by looking at the different parts of the table. The "cells" of the table are the squares in the middle of the table containing numbers that are completely enclosed. The cells contain the frequencies that occur in the joint distribution of the two variables. The frequencies that we actually find in the data are called the "observed" frequencies.
In this table, the cells contain the frequencies for vocational education trainees who got a job (n=175) and who didn't get a job (n=25), and the frequencies for work skills trainees who got a job (n=125) and who didn't get a job (n=125).
The "Total" columns and rows of the table show the marginal frequencies. The marginal frequencies are the frequencies that we would find if we looked at each variable separately by itself. For example, we can see in the "Total" column that there were 300 people who got a job and 150 people who didn't. We can see in the "Total" row that there were 200 people in vocational education training and 250 people in job skills training.
Finally, there is the total number of observations
in the whole table, called N. In this table, N=450.
To find the value of Chi Square, we first assume that there is no relationship between the type of training program attended and whether the trainee was placed in a job. If we look at the column total, we can see that 300 of 450 people found a job, or 66.7% of the total people in training found a job. We can also see that 150 of 450 people did not find a job, or 33.3% of the total people in training did not find a job.
If there was no relationship between the type of program attended and success in finding a job, then we would expect 66.7% of trainees of both types of training programs to get a job, and 33.3% of both types of training programs to not get a job.
The first thing that Chi Square does is to calculate "expected" frequencies for each cell. The expected frequency is the frequency that we would have expected to appear in each cell if there was no relationship between type of training program and job placement.
The way to calculate the expected cell frequency is to multiply the column total for that cell, by the row total for that cell, and divide by the total number of observations for the whole table.
For the upper left hand corner cell, multiply 200 by 300 and divide
by 450=133.3
For the lower left hand corner cell, multiply 200 by 150 and divide
by 450=66.7
For the upper right hand corner cell, multiply 250 by 300 and divide
by 450=166.7
For the lower right hand corner cell, multiply 250 by 150 and divide
by 450=83.3
Table 2. Job Placement by Type of Training (Expected Frequencies)
Placed in a Job? |
Type of Training | ||
Vocational
Education |
Work Skills
Training |
Total | |
Yes | 133.3 | 166.7 | 300 |
No | 66.7 | 83.3 | 150 |
Total | 200 | 250 | 450 |
This table shows the distribution of "expected" frequencies, that is, the cell frequencies we would expect to find if there was no relationship between type of training and job placement.
Note that Chi Square is not reliable if any cell in the contingency table has an expected frequency of less than 5.
To calculate Chi Square, we need to compare the original,
observed frequencies with the new, expected frequencies. For each cell,
we perform the following calculations:
a) Subtract the value of the observed frequency from the value of the
expected frequency
b) square the result
c) divide the result by the value of the expected frequency
For each cell above,
fe - fo | (fe - fo)2 | [(fe - fo)2] / fe | Result |
(133.3 - 175) | (133.3 - 175)2 | [(133.3 - 175)2] / 133.3 | 13.04 |
(66.7 - 25) | (66.7 - 25)2 | [(66.7 - 25)2] / 66.7 | 26.07 |
(166.7 - 125) | (166.7 - 125)2 | [(166.7 - 125)2] / 166.7 | 10.43 |
(83.3 - 125) | (83.3 - 125)2 | [(83.3 - 135)2] / 83.3 | 20.88 |
To calculate the value of Chi Square, add up the results for each cell--Total=70.42
In theory, the value of the Chi Square statistic is normally distributed; that is, the value of the Chi Square statistics looks like a normal (bell-shaped) curve. Thus we can use the properties of the normal curve to interpret the value obtained from our calculation of the Chi Square statistic.
If the value we obtain for Chi Square is large enough, then we can say that it indicates the level of statistical significance at which the relationship between the two variables can be presumed to exist.
However, whether the value is large enough depends on two things: the size of the contingency table from which the Chi Square statistic has been computed; and the level of alpha that we have selected.
The larger the size of the contingency table, the larger the value of Chi Square will need to be in order to reach statistical significance, if other things are equal. Similarly, the more stringent the level of alpha, the larger the value of Chi Square will need to be, in order to reach statistical significance, if other things are equal.
The term "degrees of freedom" is used to refer to
the size of the contingency table on which the value of the Chi Square
statistic has been computed. The degrees of freedom is calculated as the
product of (the number of rows in the table minus 1) times (the number
of columns in the table minus ).
When reporting the level of alpha, it is usually
reported as being "less than" some level, using the "less than" sign or
<. Thus, it is reported as p<.05, or p<.01; unless you are
reporting the exact p-value, such as p=.04 or p=.22.
In the table, find the degrees of freedom (usually listed in a column down the side of the page). Next find the desired level of alpha (usually listed in a row across the top of the page). Find the intersection of the degrees of freedom and the level of alpha, and that is the value which the computed Chi Square must equal or exceed to reach statistical significance.
For example, for df=2 and p=.05, Chi Square must
equal or exceed 5.99 to indicate that the relationship between the two
variables is probably not due to chance. For df=4 and p=.05, Chi Square
must equal or exceed 9.49.
The computed value of Chi Square, at a given level
of alpha and with a given degree of freedom, is a type of "pass-fail" measurement.
It is not like a measure of association, which can vary from 0.0 to (plus
or minus) 1.0, and which can be interpreted at every point along the distribution.
Either the computed value of Chi Square reaches the required level for
statistical significance or it does not.
To calculate a value of t,
Like other statistics, the t-test has a distribution that approaches the normal distribution, especially if the sample size is greater than 30. Since we know the properties of the normal curve, we can it to tell us how far away from the mean of the distribution our calculated t-score is.
The normal curve is distributed about a mean of zero, with a standard deviation of one. A t-score can fall along the normal curve either above or below the mean; that is, either plus or minus some standard deviation units from the mean.
A t-score must fall far from the mean in order to achieve statistical significance. That is, it must be quite different from the value of the mean of the distribution, something that has only a low probability of occurring by chance if there is no relationship between the two variables. If we have chosen a value of p=.05 for alpha, we look for a value of t that falls into the extreme 5% of the distribution.
If we have a hypothesis that states the expected direction of the results, e.g., that male graduate assistant salaries are higher than female graduate assistant salaries, then we expect the calculated t-score to fall into only one end of the normal distribution. We expect the calculated t-score to fall into the extreme 5% of the distribution.
If we have a hypothesis, however, that only states that there is some difference between two groups, but does not state which group is expected to have the higher score, then the calculated t-score can fall into either end of the normal distribution. For example, our hypothesis could be that we expect to find a difference between the average salaries of male and female graduate assistant members (but we do not know which is going to be higher, or which is going to be lower).
For a hypothesis which states no direction, we need to use a "two-tailed" t-test. That is, we must look for a value of t that falls into either one of the extreme ends ("tails") of the distribution. But since t can fall into either tail, if we select p=.05 for alpha, we must divide the 5% into two parts of 2-1/2% each. So a two-tailed test requires t to take on a more extreme value to reach statistical significance than a one-tailed test of t.
e) calculate t
A t-score is calculated by comparing the average value on some variable obtained for two groups; the calculation also involves the variance of each group and the number of observations in each group. For example,
Table 3. Male and Female Graduate Assistant Salaries at CSULB
MaleGraduate Assistants | Female Graduate Assistants | |
Number of
observations |
403 |
132 |
Mean | $17,095 | $14,885 |
Standard
Deviation |
6329 |
4676 |
Variance | 40045241 | 21864976 |
To calculate t,
1) subtract the mean of the second group from the mean of the first
group
2) calculate, for each group, the variance divided by the number of
observations minus 1
3) add the results obtained for each group in step two together
4) take the square root of the results of step three
5) divide the results of step one by the results of step four
For example,
In this example, the computed t-score of 4.28 exceeds the table value of t, so we can reject the null hypothesis of no relationship between graduate assistant gender and graduate assistant pay, and instead accept the research hypothesis and conclude that there is a relationship between graduate assistant gender and graduate assistant pay.
Remember, however, that this is only one statistic,
based on just one sample, at one point in time, from one research project.
It is not absolute, conclusive proof that a relationship exists, but rather
support for the research hypothesis. It is only one piece of evidence,
that must be considered along with many other pieces of evidence on the
same subject.
A second method of reporting the results of tests for statistical significance is to report the test and its value, the degrees of freedom, and the p-value at the bottom of the contingency table or printout showing the data on which the calculations were based.
Table 1. Job Placement by Type of Training (Observed Frequencies)
Placed in a Job? |
Type of Training | ||
Vocational
Education |
Work Skills
Training |
Total | |
Yes | 175 | 125 | 300 |
No | 25 | 125 | 150 |
Total | 200 | 250 | 450 |
Table 3. Male and Female Graduate Assistant Salaries at CSULB
Male Graduate Assistants | Female Graduate Assistants | |
Number of
observations |
403 |
132 |
Mean | $17,095 | $14,885 |
Standard
Deviation |
6329 |
4676 |
Variance | 40045241 | 21864976 |
The third way to report tests of statistical significance is to include them in tables showing the results of an extended analysis of the data, including a number of variables. For example, here are some results from a study of older Hispanic women in El Paso, TX, and Long Beach, CA.
Table 4. Characteristics of Workshop Participants Age 40 and Older
Characteristics | El Paso
(N=83) |
Long Beach
(N=131) |
value of
t |
Mean age | 60.5 years | 68.7 years | 2.1* |
Ethnic self-identification
Mexican-American |
97.2 |
89.7 |
0.9 |
Language preference
Spanish only |
68.5 |
52.3 |
3.2** |
Tests for statistical significance are used because they constitute a common yardstick that can be understood by a great many people, and they communicate essential information about a research project that can be compared to the findings of other projects.
However, they do not assure that the research has been carefully designed and executed. In fact, tests for statistical significance may be misleading, because they are precise numbers. But they have no relationship to the practical significance of the findings of the research.
Finally, one must always use measures of association along with tests for statistical significance. The latter estimate the probability that the relationship exists; while the former estimate the strength (and sometimes the direction) of the relationship. Each has its use, and they are best when used together.