CHAPTER 1 9
OVERCONFIDENCE
From The Psychology of Judgment and Decision Making
By Scott Plous
The odds of a meltdown are one in
10,000 years. -Vitali Skylarov, Minister of Power and Electrification in the
Ukraine, two months before the chernobyl accident
No problem in judgement in decision making is more prevalent and more potentially catastrophic than overconfidence. As loving Janis (1982) documented in his work on groupthink, American overconfidence enabled the Japanese to destroy Pearl Harbor in World War n. Overconfidence also played a role in the disastrous decision to launch the U.S. space shuttle Challenger. Before the shuttle exploded on its twenty-fifth mission, NASA's official launch risk estimate was 1 catastrophic failure in 100,000 launches (Feynman, 1988, February). This risk estimate is roughly equivalent to launching the shuttle once per day and expecting to see only one accident in three centuries.
THE CASE OF JOSEPH KIDD
One of the earliest and best known of these studies was published by
Stuart Oskamp in 1965. Oskamp asked 8 clinical
psychologists; 18 psychology graduate students, and 6 undergraduates to read the
case study of "Joseph Kidd," a 29-year-old
man who had experienced "adolescent maladjustment." The case study was
divided into four parts. Part 1 introduced Kidd as a war veteran who was working
as a business assistant in a floral decorating studio. Part 2 discussed Kidd's
childhood through age 12. Part 3 covered Kidd's high school and college years.
And Part 4 chronicled Kidd's army service and later activities.
Subjects answered the same set of questions four times—once after each
part of the case study. These questions were constructed from factual material
in the case study, but they required subjects to form clinical
judgments based on general impressions of Kidd's personality. Questions always had five forced-choice alternatives, and following each
item, subjects estimated the likelihood that their answer was correct. These
confidence ratings ranged from 20 percent (no confidence beyond chance levels of
accuracy) to 100 percent (absolute certainty).
Somewhat surprisingly, there were no significant differences among the
ratings from psychologists, graduate students, and undergraduates, so Oskamp
combined all three groups in his analysis of the results. What he found was that
confidence increased with the amount of information subjects read, but accuracy
did not.
After reading the first part of the case study, subjects answered 26 percent of the questions correctly (slightly more than what would be expected by chance), and their mean confidence rating was 33 percent. These figures show fairly close agreement. As subjects read more information, though, the gap between confidence and accuracy grew (see Figure 19.1). The more material subjects read, the more confident they became-even though accuracy did not increase significantly with additional information. By the time they finished reading the fourth part of the case study, more than 90 percent of Oskamp's subjects were overconfident of their answers.
Figure
19.1
Stuart Oskamp (1965) found that as subjects read more information from a case study, the gap between their estimated accuracy (confidence) and true accuracy increased. |
In
the years since this experiment, a number of studies have found that people tend
to be overconfident of their judgments, particularly when accurate judgments are
difficult to make. For example, Sarah Lichtenstein and Baruch Fischhoff (1977)
conducted a series of experiments in which they found that people were 65 to 70
percent confident of being right when they were actually correct about 50
percent of the time.
In
the first of these experiments, Lichtenstein and Fischhoff asked people to judge
whether each of 12 children's drawings came from Europe or Asia, and to estimate
the probability that each judgment was correct. "Even though only 53
percent of the judgments were correct (very close to chance performance), the
average confidence rating was 68 percent.
In
another experiment, Lichtenstein and Fischhoff gave people market reports on 12
stocks and asked them to predict whether the stocks would rise or fall in a
given period. Once again, even though only 47, percent of these predictions were
correct (slightly less than would be expected by chance), the mean
confidence rating was 65 percent.
After several additional studies, Lichtenstein and Fischhoff drew the following conclusions about the correspondence between accuracy and confidence in two-alternative judgments:
-Overconfidence
is greatest when accuracy is near chance levels. -Overconfidence
diminishes as accuracy increases from 50 to 80 percent, and once accuracy exceeds 80 percent,
people often become underconfident. In other words, the
gap between accuracy and confidence is smallest when accuracy is around
80 percent, and it grows larger as accuracy departs from this level. -Discrepancies between accuracy and confidence are not related to a decision maker's intelligence. |
Although
early critics of this work claimed that these results were largely a function of
asking people questions about obscure or trivial topics, recent studies have
replicated Lichtenstein and Fischhoffs findings with more commonplace judgments.
For example, in a series of
experiments involving more than 10,000 separate judgments, Lee Ross and his
colleagues found roughly 10 to 15 percent overconfidence when subjects were
asked to make a variety of predictions about their behavior and the behavior of
others (Dunning, Griffin, Milojkovic, & Ross, 1990; Vallone, Griffin, Lin,
& Ross, 1990).
This
is not to say that people are always overconfident David Ronis and
Frank Yates (1987) found, for instance, that overconfidence depends partly on
how confidence ratings are elicited and what type of judgments are being made
(general knowledge items seem to produce relatively high degrees of
overconfidence). There is also some evidence that expert bridge players,
professional oddsmakers, and National Weather Service forecasters--all of whom
receive regular feedback following
their judgments-exhibit little or no overconfidence (Keren, 1987; Lichtenstein,
Fischhoff, & Phillips, 1982; Murphy & Brown, 1984; Murphy & Winkler,
1984). Still, for the most part, research suggests that overconfidence is
prevalent.
EXTREME
CONFIDENCE
What
if people are virtually certain that an answer is correct? How often are they
right in such cases? In 1977, Baruch Fischhoff, Paul Slovic, and Sarah
Lichtenstein conducted a series of experiments to investigate this issue. In the
first experiment, subjects answered hundreds of general knowledge questions and
estimated the probability that their answers were correct. For example, they
answered whether absinthe is a liqueur or a precious stone, and they estimated
their confidence on a scale from .50 to 1.00 (this problem appears as Item #21
of the Reader Survey). Fischhoff, Slovic, and Lichtenstein then examined the accuracy
of only those answers about which subjects were absolutely sure.
What
they found was that people tended to be only 70 to 85 percent correct when they
reported being 100 percent sure of their answer. How confident were you of your
answer to Item #21? The correct answer is that absinthe is a liqueur, though
many people confuse it with a precious stone called amethyst.
Just
to be certain their results were not due to misconceptions about probability,
Fischhoff, Slovic, and Lichtenstein (1977) conducted a second experiment in
which confidence was elicited in terms of the odds of being correct. Subjects in
this experiment were given more than 106 items in which two causes of death were
listed--for instance, leukemia and drowning. They were asked to indicate which cause of death was more
frequent in the United States and to estimate the odds that their answer was
correct (i.e., 2:1, 3:1, etc.). This
way, instead of having to express 75 percent confidence in terms of a
probability, subjects could express their confidence as 3: 1 odds of being
correct.
What Fischhoff, Slovic, and Lichtenstein (1977) found was that confidence and accuracy were aligned fairly well up to confidence estimates of about 3: 1, but as confidence increased from 3: 1 to 100:1, accuracy did not increase appreciably. When people set the odds of being correct at 100:1, they were actually correct 73 percent of the time. Even when, people set the odds between 10,000:1 and 1,000,000:I--indicating virtual certainty--they were correct only 85 to 90 percent of the time (and should have given a confidence rating between 6:1 and 9:1).*
* Although these results may seem to contradict Lichtenstein and Fischhoffs earlier claim that overconfidence is minimal when subjects are 80 percent accurate, there is really no contradiction. The fact that subjects average only 70 to 90 percent accuracy when they are highly confident does not mean that they are always highly confident when 70 to 90 percent accurate. |
Finally, as an added check to
make sure that subjects understood the task and were taking it seriously,
Fischhoff, Slovic, and Lichtenstein (1977) conducted three replications. In one
replication, the relation between odds and probability was carefully explained
in a twenty-minute lecture. Subjects were given a chart showing the
correspondence between various odds estimates and probabilities, and they were
told about the subtleties of expressing uncertainty as an odds rating (with a
special emphasis on how to use odds between 1: 1 and 2: 1 to express
uncertainty). Even with these instructions, subjects showed unwarranted
confidence in their answers. They assigned odds of at least 50:1 when, the odds
were actually about 4: 1, and they gave odds of 1000: 1 when they should have
given odds of 5:1.
In another replication, subjects
were asked whether they would accept a monetary bet based on the accuracy of
answers that they rated as having 50: 1 or better odds of being correct. Of 42
subjects, 39 were willing to gamble-even though their overconfidence would have
led to a total of more than $140 in
losses. And in a final replication,
Fischhoff, Slovic, and Lichtenstein (1977) actually played subjects' bets. In
this study, 13 of 19 subjects agreed to gamble on the accuracy of their answers,
even though they were incorrect on 12 percent of the questions to which they had
assigned odds of 50:1 or greater (and all would have lost from $1 to $11, had
the experimenters not waived the loss). These results suggest that (1) people
are overconfident even when virtually certain they are correct, and (2)
overconfidence is not simply a consequence of taking the task lightly or
misunderstanding how to make confidence ratings. Indeed, Joan Sieber (1974)
found that overconfidence increased with incentives to perform well.
Are people overconfident when
more is at stake than a few dollars? Although ethical considerations obviously
limit what can be tested in the laboratory, at least one line of evidence
suggests that overconfidence operates even when human life hangs in the balance.
This evidence comes from research on the death penalty.
In a comprehensive review of wrongful convictions, Hugo Bedau and Michael Radelet (1987) found 350 documented instances in which innocent defendants were convicted of capital or potentially capital crimes in the United States--even though the defendants were apparently' judged "guilty beyond a reasonable doubt." In five of these cases, the error was discovered prior to sentencing. The other defendants were not so lucky: 67 were sentenced to prison for terms of up to 25 years, 139 were sentenced to life in prison (terms of 25 years or more), and 139 were sentenced to die. At the time of Bedau and Radelet's review, 23 of the people sentenced to die had been executed.
CALIBRATION
"Calibration"
is the degree to which confidence matches accuracy. A decision maker is
perfectly calibrated when, across all judgments at a given level of confidence,
the proportion of accurate judgments is identical to the expected probability of
being correct. In other words, 90 percent of all judgments assigned a .90
probability of being correct are accurate, 80 percent of all judgments assigned
a probability of .80 are accurate, 70 percent of all judgments assigned a
probability of .70 are accurate, and so forth.
When
individual judgments are considered alone, it doesn't make much sense to speak
of calibration. How well calibrated is a decision maker who answers
".70" to Item #21b of the Reader Survey? The only way to reliably
assess calibration is by comparing
accuracy and confidence across hundreds of judgments (Lichtenstein, Fischhoff,
& Phillips, 1982).
Just
as there are many ways to measure confidence, there are several techniques for
assessing calibration. One way is simply to calculate the difference between
average confidence ratings and the overall proportion of accurate judgments. For
instance, a decision maker might aver age 80 percent confidence on a set of
general knowledge items but be correct on only 60 percent of the items. Such a
decision maker would be overconfident by 20 percent.
Although
this measure of calibration is convenient, it can be misleading at times.
Consider, for example, a decision maker whose overall accuracy and average
confidence are both 80 percent. Is this person perfectly calibrated? Not
necessarily. The person may be 60 percent confident on half the judgments and
l00 percent confident on the others (averaging out to 80 percent confidence),
yet 80 percent accurate at both levels of confidence. Such a person would be
underconfident when 60 percent sure and overconfident when 100 percent sure.
FIGURE 19.2 This figure contains calibration curves for weather forecasters' predictions of precipitation (hollow circles) and physicians' diagnoses of pneumonia (filled circles). Although the weather forecasters are almost perfectly calibrated, the physicians show substantial overconfidence (i.e., unwarranted certainty that patients have pneumonia). The data on weather forecasters comes from a report by Allan Murphy and Robert Winkler (1984), and the data on physicians comes from a study by Jay Christensen-Szalanski and James Bushyhead (1981). |
There
are additional ways to assess calibration, some of them involving complicated
mathematics. For instance, one of the most common techniques is to calculate a
number known as a "Brier score" ((lamed after statistician Glenn
Brier). Brier scores can be partitioned into threecomponents,
one of which corresponds to calibration. The Brier score component for
calibration is a weighted average of the mean squared differences between the
proportion correct in each category and the probability associated with that
category (for a good introduction to the technical aspects of calibration, see
Yates, 1990).
In major review of
calibration research, Sarah Lichtenstein, Baruch Fischhoff, and Lawrence
Phillips (1982) examined several studies in which subjects had been asked to
give 98 percent confidence intervals (i.e., intervals that had a 98 percent
chance of including the correct answer). In every study, the surprise index
exceeded 2 percent. Averaging across all experiments for which information was
available--a total of nearly 15,000 judgments--the surprise index was 32
percent. In other words, when subjects were 98 percent sure that an interval
contained the correct answer, they were right 68 percent of the time. Once
again, overconfidence proved the rule rather than the exception.
Are you overconfident? Edward Russo and Paul Schoemaker (1989) developed an easy self-test to measure overconfidence on general knowledge questions (reprinted in Figure 19.3). Although a comprehensive assessment of calibration requires hundreds of judgments, this test will give you a rough idea of what your surprise index is with general knowledge questions at one level of confidence. Russo and Schoemaker administered the test to more than 1000 people and found that less than 1 percent of the respondents got nine or more items correct. Most people missed four to seven items (a surprise index of 40 to 70 percent), indicating a substantial degree of overconfidence.
For each of the following ten items, provide a low and high guess such that you are 90 percent sure the correct answer falls between the two. Your challenge is to be neither too narrow (i.e., overconfident) nor too wide (i.e., underconfident). If you successfully meet this challenge you should have 10 percent misses -- that is, exactly one miss.
90% Confidence Range | ||
LOW |
HIGH |
|
1. Martin Luther King's age at death | ||
2. Length of the Nile River | ||
3. Number of countries that are members of OPEC | ||
4. Number of books in the Old Testament | ||
5. Diameter of the moon in miles | ||
6. Weight of an empty Boeing 747 in pounds | ||
7. Year in which Wolfgang Amadeus Mozart was born | ||
8. Gestation period (in days) of an Asian elephant | ||
9. Air distance from London to Tokyo | ||
10. Deepest (known) point in the ocean (in feet) | ||
FIGURE 19.3 This test will give you some idea of whether you are overconfident on general knowledge questions {reprinted with permission from Russo and Schoemaker, 1989}. |
THE CORRELATION BETWEEN
CONFIDENCE AND ACCURACY
Overconfidence notwithstanding,
it is still possible for confidence to be correlated with accuracy. To take an
example, suppose a decision maker were 50 percent accurate when 70 percent
confident, 60 percent accurate when 80 percent confident, and 70 percent
accurate when 90 percent confident. In
such a case confidence would be perfectly correlated, with accuracy, even though
the decision maker would be uniformly overconfident by 20 percent.
The question arises, then,
whether confidence is correlated with accuracy--regardless of whether decision
makers are overconfident. If confidence ratings increase when
accuracy increases, then accuracy can be predicted as a function of how
confident a decision maker feels. If
not, then confidence is a misleading indicator of accuracy.
Many studies have examined this
issue, and the results have often shown very little relationship between
confidence and accuracy. To illustrate, consider the following two problems
concerning military history:
Problem 1.
The government of a country not far from Superpower A, after discussing certain
changes in its party system, began broadening its trade with Superpower B. To
reverse these changes in government and trade, Superpower A sent its troops into
the country and militarily backed the original government. Who was Superpower
A-the United States or the Soviet Union? How confident are you that your answer
is correct?
Problem 2.
In the 1960s Superpower A sponsored a surprise invasion of a small country near
its border, with the purpose of overthrowing the regime in power at the time.
The invasion failed, and most of the original invading forces were killed or
imprisoned. Who was Superpower A, and again, how sure are you of your answer?
Most
people miss at least one of these problems, despite whatever confidence they
feel.
In
the November 1984 issue of Psychology Today magazine, Philip Zimbardo and
I published the results of a reader survey that contained both of these
problems and a variety of others on superpower conflict. The survey included 10
descriptions of events, statements, or policies related to American and
Soviet militarism, but in each description, all labels identifying the United
States and Soviet Union were removed. The task for readers was to decide whether
"Superpower A" was the United States or the Soviet Union, and to
indicate on a 9-point scale how confident they were of each answer.
Based
on surveys from 3500 people, we were able to conclude two things. First,
respondents were not able to tell American and Soviet military actions apart.
Even though they would have averaged 5 items correct out of 10 just by flipping
a coin, the overall average from readers of Psychology Today--who were
more politically involved and educated than the general public--was 4.9 items
correct. Only 54 percent of the respondents correctly identified the Soviet
Union as Superpower A in the invasion of Czechoslovakia, and 25 percent mistook
the United States for the Soviet Union in the Bay of Pigs invasion. These
findings suggested that Americans were condemning Soviet actions and policies
largely because they were Soviet, not because they were radically
different from American actions and policies.
The
second thing we found was that people's confidence ratings were virtually
unrelated to their accuracy (the average correlation between confidence and
accuracy for each respondent was only .08, very close to zero). On the whole,
people who got nine or ten items correct were no more confident than less
successful respondents, and highly confident respondents scored about the same
as less confident respondents.
This
does not mean that confidence ratings were made at random; highly confident
respondents differed in a number of ways from other respondents. Two-thirds of
all highly confident respondents (i.e., who averaged more than 8 on the 9-point
confidence scale) were male, even though the general sample was split evenly by
gender, and 80 percent were more than 30 years old. Twice as many of the highly confident respondents wanted to
increase defense spending as did less confident respondents, and nearly twice as
many felt that the Soviet government could not be trusted at all. Yet the mean
score these respondents achieved on the survey was 5.1 items correct--almost
exactly what would be expected by chance responding. Thus, highly confident
respondents could not discriminate between Soviet and American military actions,
but they were very confident of misperceived differences and advocated increased
defense spending.
As
mentioned earlier, many other studies have found little or no correlation
between confidence and accuracy (Paese & Sniezek, 1991; Ryback,
1967; Sniezek & Henry, 1989, 1990; Sniezek, Paese, & Switzer, 1990).
This general pattern is particularly apparent in research on eyewitness
testimony. By and large, these studies suggest that the confidence eyewitnesses
feel about their testimony bears little relation to how accurate the testimony
actually is (Brown, Deffenbacher, & Sturgill, 1977; Clifford & Scott,
1978; Leippe, Wells, & Ostrom, 1978). In a review of 43 separate research
findings on the relation between accuracy and confidence in eye- and
earwitnesses, Kenneth Deffenbacher (1980) found that in two-thirds of the
"forensically relevant" studies (e.g., studies in which subjects were
not instructed in advance to watch for a staged crime), the correlation between
confidence and accuracy was not significantly positive.
Findings such as these led Elizabeth Loftus (1979, p. 101), author of Eyewitness
Testimony, to caution: "One should not take high confidence as any
absolute guarantee of anything."
Similar
results have been found in clinical research. In one of the first experiments to explore this topic, Lewis
Goldberg (1959) assessed the correlation between confidence and accuracy in
clinical diagnoses. Goldberg was interested in whether clinicians could
accurately detect organic brain damage on the basis of protocols from the
Bender-Gestalt test (a test widely used to diagnose brain damage). He presented
30 different test results to four experienced clinical psychologists, ten
clinical trainees, and eight non-psychologists (secretaries). Half
of these protocols were from patients who had brain damage, and half were from
psychiatric patients who had nonorganic problems. Judges were asked to indicate
whether each patient was "organic" or "nonorganic," and to
indicate their confidence on a rating scale labeled "Positively,"
"Fairly certain," "Think so," "Maybe," or
"Blind guess."
Goldberg
found two surprising results. First, all three groups of judges-experienced
clinicians, trainees, and non-psychologists, correctly classified 65 to 70 percent of the patients. There
were no differences based on clinical experience; secretaries performed as well
as psychologists with four to ten years of clinical experience. Second,
there was no significant relationship between individual diagnostic accuracy and
degree of confidence. Judges were
generally as confident on cases they misdiagnosed as on cases they diagnosed
correctly. Subsequent, studies have found miscalibration in diagnoses of cancer,
pneumonia (see Figure 19.2), and other serious medical problems (Centor, Dalton,
& Yates, 1984; Christensen-Szalanski & Bushyhead, 1981; Wallsten, 1981).
HOW
CAN OVERCONFIDENCE BE REDUCED?
What
would be useful is a technique that decision makers could carry with them from
judgment to judgment--something lightweight, durable, and easy to apply in a
range of situations. And indeed, there does seem to be such a technique. The
most effective way to improve calibration seems to be very simple:
Stop
to consider reasons why your judgment might be wrong.
The
value of this technique was first documented by Asher Koriat, Sarah Lichtenstein,
and Baruch Fischhoff (1980). In this research, subjects answered two sets of
two-alternative general knowledge questions, first under control instructions
and then under reasons instructions. Under control instructions, subjects
chose an answer and estimated the probability (between .50 and 1.00) that their
answer was correct. Under reasons instructions, they were asked to list reasons
for and against each of the alternatives before choosing an answer.
Koriat,
Lichtenstein, and Fischhoff found that under control instructions, subjects
showed typical levels of overconfidence, but after geneating pro and con
reasons, they became extremely well calibrated (roughly comparable to subjects
who were given intensive feedback in the study by Lichtenstein and Fischhoff).
After listing reasons for and against each of the alternatives, subjects were
less confident (primarily because they used .50 more often and 1.00 less often)
and more accurate (presumably because they devoted more thought to their
answers).
In
a follow-up experiment, Koriat, Lichtenstein, and Fischhoff found that it was
not the generation of reasons per se that led to improved calibration; rather,
it was the generation of opposing reasons. When subjects listed reasons
in support of their preferred answers, overconfidence was not reduced.
Calibration improved only when subjects considered reasons why their preferred
answers might be wrong. Although these findings may be partly a function of
"social demand characteristics" (i.e., subjects feeling cued by
instructions to tone down their confidence levels), other studies have confirmed
that the generation of opposing reasons improves calibration (e.g., Hoch, 1985).
These results are reminiscent of the study by Paul Slovic and Baruch Fischhoff (1977) discussed in Chapter 3, in which hindsight biases were reduced when subjects thought of reasons why certain experimental results might have turned out differently than they did. Since the time of Slovic and Fischhoff’s study, several experiments have shown how various judgment biases can be reduced by considering the possibility of alternative outcomes or answers (Griffin, Dunning, & Ross, 1990; Hoch, 1985; Lord, Lepper, & Preston, 1984).
FIGURE 19.4 The difficult task of considering multiple perspectives. (Calvin and Hobbes copyright 1990 Wat terson. Dist. by Universal Press Syndicate. Reprinted with permission. All rights reserved.) |
As
Charles Lord, Mark Lepper, and Elizabeth Preston (1984, p. 1239) pointed out:
"The observation that humans have a blind spot for opposite possibilities
is not a new one. In 1620, Francis Bacon wrote that 'it is the peculiar and
perpetual error of human intellect to be more moved and excited by affirmatives
than by negatives.'" In Chapter 20, this blind spot--and some of its
consequences--will be explored in detail.
CONCLUSION
It is important to keep research on overconfidence in perspective. In most studies, average confidence levels do not exceed accuracy by more than 1 0 to 20 percent. Consequently, overconfidence is unlikely to be catastrophic unless decision makers are nearly certain that their judgments are correct. As the explosion of the space shuttle illustrates, the most devastating form of miscalibration is inappropriate certainty.
Taken together, the studies in this chapter suggest several strategies for dealing with miscalibration:
First,
you may want to flag certain judgments for special consideration.
Overconfidence is greatest when judgments are difficult or confidence is
extreme. In such cases, it pays to proceed cautiously.
|
Second,
you may want to "recalibrate" your confidence judgments and
the judgments of others. As Lichtenstein and Fischhoff (1977) observed,
if a decision maker is 90 percent confident but only 70 to 75 percent
accurate, it is probably best to treat "90 percent confidence"
as though it were "70 to 75 percent confidence."
|
Along
the same lines, you may want to automatically convert judgments of
"100 percent confidence" to a lesser degree ofconfidence. One
hundred percent confidence is especially unwarranted when predicting how
people will behave (Dunning, Griffin, Milojkovic, & Ross, 1990).
|
Above all, if you feel extremely confident about an answer, consider reasons why a different answer might be correct. Even though you may not change your mind, your judgments will probably be better calibrated. |