diff --git a/.gitignore b/.gitignore index 1cee3417b..aedd23156 100644 --- a/.gitignore +++ b/.gitignore @@ -25,3 +25,5 @@ _freeze/ *.pdf rsconnect *.md + +**/*.quarto_ipynb diff --git a/_quarto-book.yml b/_quarto-book.yml index 7f777a49d..f78818872 100644 --- a/_quarto-book.yml +++ b/_quarto-book.yml @@ -38,6 +38,7 @@ book: - appendices-are-prereqs.qmd - math-prereqs.qmd - probability.qmd + - classification.qmd - estimation.qmd - inference.qmd - intro-MLEs.qmd diff --git a/classification.qmd b/classification.qmd new file mode 100644 index 000000000..b85756ad5 --- /dev/null +++ b/classification.qmd @@ -0,0 +1,195 @@ +{{< include macros.qmd >}} + +# Classification {#sec-classification} + +--- + +Classification problems occur frequently in epidemiology and diagnostic medicine. +For example, we may need to determine whether an individual has a particular disease or condition based on test results or other indicators. + +--- + +:::{#def-classification} + +#### Classification + +A **classification problem** is a statistical problem in which we seek to assign observations to one of two or more discrete categories (classes) based on observed features or predictors. +In the binary case, we assign each observation to one of two classes, often labeled as "positive" or "negative", "diseased" or "healthy", etc. + +::: + +--- + +Understanding how to interpret diagnostic tests requires knowledge of key statistical concepts including sensitivity, specificity, and predictive values. + +In this section, we explore how Bayes' theorem allows us to calculate the probability that a person has a disease given a positive test result. +This is particularly important in public health decision-making, where we must understand not just how accurate a test is in general, but how to interpret test results for individuals in specific populations. + +--- + +### Diagnostic test characteristics + +When evaluating a diagnostic test, we consider several key performance measures: + +:::{#def-sensitivity} + +#### Sensitivity + +The probability that the test is positive given that the person has the disease, denoted $\pmf{\text{positive} \mid \text{disease}}$. + +::: + +:::{#def-specificity} + +#### Specificity + +The probability that the test is negative given that the person does not have the disease, denoted $\pmf{\text{negative} \mid \text{no disease}}$. + +::: + +:::{#def-ppv} + +#### Positive Predictive Value (PPV) + +The probability that a person has the disease given that their test is positive, denoted $\pmf{\text{disease} \mid \text{positive}}$. + +::: + +:::{#def-npv} + +#### Negative Predictive Value (NPV) + +The probability that a person does not have the disease given that their test is negative, denoted $\pmf{\text{no disease} \mid \text{negative}}$. + +::: + +--- + +### Example: COVID-19 testing + +Suppose we have a COVID-19 test with the following characteristics: + +- **99% sensitive**: If a person has COVID-19, the test will be positive 99% of the time +- **99% specific**: If a person does not have COVID-19, the test will be negative 99% of the time + +--- + +Let's define our events: + +- Let $D$ denote the event "person has COVID-19" +- Let $+$ denote the event "test is positive" + +Then our test characteristics can be written as: + +$$ +\pmf{+ \mid D} = 0.99 \quad \text{(sensitivity)} +$$ + +$$ +\pmf{- \mid \neg D} = 0.99 \quad \text{(specificity)} +$$ + +--- + +Note that if specificity is 0.99, then the false positive rate is: +$$ +\pmf{+ \mid \neg D} = 1 - 0.99 = 0.01 +$$ + +Suppose the **prevalence** of COVID-19 in the population is 7%: + +$$ +\pmf{D} = 0.07 +$$ + +$$ +\pmf{\neg D} = 0.93 +$$ + +--- + +### Calculating positive predictive value + +The key question we want to answer is: **If someone tests positive, what is the probability they actually have COVID-19?** + +This is the positive predictive value: +$$ +\pmf{D \mid +} = \, ? +$$ + +--- + +We can use **Bayes' theorem** to calculate this: + +$$ +\pmf{D \mid +} = \frac{\pmf{+ \mid D} \cd \pmf{D}}{\pmf{+}} +$$ + +To find $\pmf{+}$, we use the **law of total probability**: + +$$ +\pmf{+} = \pmf{+ \mid D} \cd \pmf{D} + \pmf{+ \mid \neg D} \cd \pmf{\neg D} +$$ + +--- + +Now we can calculate each component: + +**Probability of being positive with disease:** +$$ +\pmf{+ \mid D} \cd \pmf{D} = 0.99 \times 0.07 = 0.0693 +$$ + +**Probability of being positive without disease (false positive):** +$$ +\pmf{+ \mid \neg D} \cd \pmf{\neg D} = 0.01 \times 0.93 = 0.0093 +$$ + +--- + +**Total probability of positive test:** +$$ +\pmf{+} = 0.0693 + 0.0093 = 0.0786 +$$ + +**Positive predictive value:** +$$ +\pmf{D \mid +} = \frac{0.0693}{0.0786} = 0.88 +$$ + +--- + +Therefore, even with a highly accurate test (99% sensitive and 99% specific), only about 88% of people who test positive actually have COVID-19. +This is because the disease prevalence is relatively low (7%), so false positives make up a meaningful fraction of all positive tests. + +::: notes +This counterintuitive result demonstrates the importance of considering disease prevalence when interpreting test results. +Even highly accurate tests can have relatively low positive predictive values when the disease is rare. +::: + +--- + +### Alternative formulation + +We can rearrange Bayes' theorem to express the positive predictive value in terms of the sensitivity, specificity, and disease prevalence: + +$$ +\begin{aligned} +\pmf{D \mid +} &= \frac{\pmf{+ \mid D} \cd \pmf{D}}{\pmf{+}} \\ +&= \frac{\pmf{+ \mid D} \cd \pmf{D}}{\pmf{+ \mid D} \cd \pmf{D} + \pmf{+ \mid \neg D} \cd \pmf{\neg D}} \\ +&= \frac{\pmf{D}}{\pmf{D} + \frac{\pmf{+ \mid \neg D}}{\pmf{+ \mid D}} \cd \pmf{\neg D}} \\ +&= \frac{1}{1 + \frac{\pmf{+ \mid \neg D}}{\pmf{+ \mid D}} \cd \frac{\pmf{\neg D}}{\pmf{D}}} \\ +&= \frac{1}{1 + \frac{1 - \text{spec}}{\text{sens}} \cd \frac{1 - \text{prev}}{\text{prev}}} +\end{aligned} +$$ + +--- + +This final form emphasizes the ratio of the false positive rate to the sensitivity, weighted by the ratio of non-diseased to diseased individuals in the population. +It shows that even with a very high sensitivity and specificity, the positive predictive value depends strongly on disease prevalence. + +::: notes +This algebraic form is useful for understanding how the different parameters interact. +Notice how the prevalence ratio $\pmf{\neg D}/\pmf{D}$ appears explicitly in the denominator. +When the disease is rare, this ratio is large, which reduces the positive predictive value. +::: diff --git a/macros.qmd b/macros.qmd index c7588919f..8ef506c92 100644 --- a/macros.qmd +++ b/macros.qmd @@ -363,6 +363,7 @@ \def\reglincomb{\vx \cdot \vb} \def\regbetasum{\beta_1 x_1+ \dots + \beta_p x_p} \def\pdf{\distop{f}} +\providecommand{\pmf}[1]{\distop{P}\paren{#1}} \def\cdf{\distop{F}} \def\defLik{\Lik(\theta) \eqdef \p(\vX = \vx | \Theta = \theta)} \def\defLogLik{\lik \eqdef \logf{\Lik(\vx|\th)}} diff --git a/probability.qmd b/probability.qmd index 31062a145..fbde5c0ec 100644 --- a/probability.qmd +++ b/probability.qmd @@ -524,6 +524,7 @@ $\dsn{X}$. {{< include sec-CLT.qmd >}} +{{< include classification.qmd >}} ## Additional resources