Applications With Probability: Learn It 1

  • Calculate conditional probability using Bayes’ Theorem
  • Solve counting problems
  • Calculate the average value of a random event

Bayes’ Theorem

The problem below provides an excellent example of how thinking carefully through a problem can provide more, and longer lasting, insight than would be obtained by memorizing a formula. Certainly, some formulas are handy once memorized (Bayes’ Theorem being one of them), but understanding the underlying conditions of the formula can be extremely valuable.

Suppose a certain disease has an incidence rate of [latex]0.1\%[/latex] (that is, it afflicts [latex]0.1\%[/latex] of the population). A test has been devised to detect this disease. The test does not produce false negatives (that is, anyone who has the disease will test positive for it), but the false positive rate is [latex]5\%[/latex] (that is, about [latex]5\%[/latex] of people who take the test will test positive, even though they do not have the disease). Suppose a randomly selected person takes the test and tests positive.  What is the probability that this person actually has the disease?

There are two ways to approach the solution to this problem. One involves an important result in probability theory called Bayes’ theorem. We will discuss this theorem a bit later, but for now, we will use an alternative we hope is a much more intuitive approach.

Let’s break down the information in the problem piece by piece as an example.

  • Suppose a certain disease has an incidence rate of [latex]0.1\%[/latex] (that is, it afflicts [latex]0.1\%[/latex] of the population)
    • The percentage [latex]0.1\%[/latex] can be converted to a decimal number by moving the decimal place two places to the left, to get [latex]0.001[/latex]. In turn, [latex]0.001[/latex] can be rewritten as a fraction: [latex]1/1000[/latex]. This tells us that about [latex]1[/latex] in every [latex]1000[/latex] people has the disease. (If we wanted we could write [latex]P(\text{disease})=0.001[/latex].)
  • A test has been devised to detect this disease.  The test does not produce false negatives (that is, anyone who has the disease will test positive for it)
    •  This part is fairly straightforward: everyone who has the disease will test positive, or alternatively everyone who tests negative does not have the disease. (We could also say [latex]P(\text{positive} | \text{disease})=1[/latex].)
  • The false positive rate is [latex]5\%[/latex] (that is, about [latex]5\%[/latex] of people who take the test will test positive, even though they do not have the disease)
    •  This is even more straightforward. Another way of looking at it is that of every [latex]100[/latex] people who are tested and do not have the disease, [latex]5[/latex] will test positive even though they do not have the disease. (We could also say that [latex]P(\text{positive} | \text{no disease})=0.05[/latex].)
  • Suppose a randomly selected person takes the test and tests positive.  What is the probability that this person actually has the disease?
    • Here we want to compute [latex]P(\text{disease} | \text{positive})[/latex]. We already know that [latex]P(\text{positive} | \text{disease})=1[/latex], but remember that conditional probabilities are not equal if the conditions are switched.

Rather than thinking in terms of all these probabilities we have developed, let’s create a hypothetical situation and apply the facts as set out above. First, suppose we randomly select [latex]1000[/latex] people and administer the test. How many do we expect to have the disease? Since about [latex]1/1000[/latex] of all people are afflicted with the disease, [latex]1/1000[/latex] of [latex]1000[/latex] people is [latex]1[/latex]. (Now you know why we chose [latex]1000[/latex].) Only [latex]1[/latex] of [latex]1000[/latex] test subjects actually has the disease; the other [latex]999[/latex] do not.

We also know that [latex]5\%[/latex] of all people who do not have the disease will test positive. There are [latex]999[/latex] disease-free people, so we would expect [latex](0.05)(999)=49.95[/latex] (so, about [latex]50[/latex]) people to test positive who do not have the disease.

Now back to the original question, computing [latex]P(\text{disease} | \text{positive})[/latex]. There are [latex]51[/latex] people who test positive in our example (the one unfortunate person who actually has the disease, plus the [latex]50[/latex] people who tested positive but don’t). Only one of these people has the disease, so

[latex]P(\text{disease} | \text{positive})\approx\frac{1}{51}\approx0.0196[/latex]

or less than [latex]2\%[/latex]. Does this surprise you? This means that of all people who test positive, over [latex]98\%[/latex] do not have the disease.

The answer we got was slightly approximate since we rounded [latex]49.95[/latex] to [latex]50[/latex]. We could redo the problem with [latex]100,000[/latex] test subjects, [latex]100[/latex] of whom would have the disease and [latex](0.05)(99,900)=4995[/latex] test positive but do not have the disease, so the exact probability of having the disease if you test positive is

[latex]P(\text{disease} | \text{positive})\approx\frac{100}{5095}\approx0.0196[/latex]

which is pretty much the same answer.

But back to the surprising result. Of all people who test positive, over [latex]98\%[/latex] do not have the disease.  If your guess for the probability a person who tests positive has the disease was wildly different from the right answer ([latex]2\%[/latex]), don’t feel bad. The exact same problem was posed to doctors and medical students at the Harvard Medical School [latex]25[/latex] years ago and the results revealed, in a 1978 New England Journal of Medicine article, only about [latex]18\%[/latex] of the participants got the right answer. Most of the rest thought the answer was closer to [latex]95\%[/latex] (perhaps they were misled by the false positive rate of [latex]5\%[/latex]).

You can view the transcript for “Probability of a disease given a positive test: Bayes Thorem ex1” here (opens in new window).

This example can also be solved using Bayes’ Theorem.

Bayes’ Theorem

[latex]P(A|B)=\frac{P(A)P(B|A)}{P(A)P(B|A)+P(\bar{A})P(B|\bar{A})}[/latex]

In our earlier example, this translates to

[latex]P(\text{disease}|\text{positive})=\frac{P(\text{disease})P(\text{positive}|\text{disease})}{P(\text{disease})P(\text{positive}|\text{disease})+P(\text{nodisease})P(\text{positive}|\text{nodisease})}[/latex]

Plugging in the numbers gives

[latex]P(\text{disease}|\text{positive})=\frac{(0.001)(1)}{(0.001)(1)+(0.999)(0.05)}\approx0.0196[/latex]

which is exactly the same answer as our original solution.

The problem is that you (or the typical medical student, or even the typical math professor) are much more likely to be able to remember the original solution than to remember Bayes’ theorem. Psychologists, such as Gerd Gigerenzer, author of Calculated Risks: How to Know When Numbers Deceive You, have advocated that the method involved in the original solution (which Gigerenzer calls the method of “natural frequencies”) be employed in place of Bayes’ Theorem. Gigerenzer performed a study and found that those educated in the natural frequency method were able to recall it far longer than those who were taught Bayes’ theorem. When one considers the possible life-and-death consequences associated with such calculations it seems wise to heed his advice.