d-morrison · d-morrison · Nov 14, 2025 · Nov 14, 2025 · Nov 14, 2025 · Nov 14, 2025
diff --git a/.gitignore b/.gitignore
@@ -25,3 +25,5 @@ _freeze/
 *.pdf
 rsconnect
 *.md
+
+**/*.quarto_ipynb
diff --git a/_quarto-book.yml b/_quarto-book.yml
@@ -38,6 +38,7 @@ book:
     - appendices-are-prereqs.qmd
     - math-prereqs.qmd
     - probability.qmd
+    - classification.qmd
     - estimation.qmd
     - inference.qmd
     - intro-MLEs.qmd

diff --git a/classification.qmd b/classification.qmd
@@ -0,0 +1,195 @@
+{{< include macros.qmd >}}
+
+# Classification {#sec-classification}
+
-
+
+Classification is a fundamental concept in statistics and machine learning, where the goal is to assign items or individuals to categories based on observed features or test results. It is widely used in fields such as medicine, finance, and biology to make decisions or predictions. Understanding the performance of classification systems—such as medical tests—requires analyzing metrics like sensitivity, specificity, and predictive values. The following example illustrates how to calculate the positive predictive value of a test, which tells us how likely it is that a person actually has a condition given a positive test result.
-
+
+Classification is a fundamental concept in statistics and machine learning, where the goal is to assign items or individuals to categories based on observed features or test results. It is widely used in fields such as medicine, finance, and biology to make decisions or predictions. Understanding the performance of classification systems—such as medical tests—requires analyzing metrics like sensitivity, specificity, and predictive values. The following example illustrates how to calculate the positive predictive value of a test, which tells us how likely it is that a person actually has a condition given a positive test result.
+---
+
+Classification problems occur frequently in epidemiology and diagnostic medicine.
+For example, we may need to determine whether an individual has a particular disease or condition based on test results or other indicators.
+
+---
+
+:::{#def-classification}
+
+#### Classification
+
+A **classification problem** is a statistical problem in which we seek to assign observations to one of two or more discrete categories (classes) based on observed features or predictors.
+In the binary case, we assign each observation to one of two classes, often labeled as "positive" or "negative", "diseased" or "healthy", etc.
+
+:::
+
+---
+
+Understanding how to interpret diagnostic tests requires knowledge of key statistical concepts including sensitivity, specificity, and predictive values.
+
+In this section, we explore how Bayes' theorem allows us to calculate the probability that a person has a disease given a positive test result.
+This is particularly important in public health decision-making, where we must understand not just how accurate a test is in general, but how to interpret test results for individuals in specific populations.
+
+---
+
+### Diagnostic test characteristics
+
+When evaluating a diagnostic test, we consider several key performance measures:
+
+:::{#def-sensitivity}
+
+#### Sensitivity
+
+The probability that the test is positive given that the person has the disease, denoted $\pmf{\text{positive} \mid \text{disease}}$.
+
+:::
+
+:::{#def-specificity}
+
+#### Specificity
+
+The probability that the test is negative given that the person does not have the disease, denoted $\pmf{\text{negative} \mid \text{no disease}}$.
+
+:::
+
+:::{#def-ppv}
+
+#### Positive Predictive Value (PPV)
+
+The probability that a person has the disease given that their test is positive, denoted $\pmf{\text{disease} \mid \text{positive}}$.
+
+:::
+
+:::{#def-npv}
+
+#### Negative Predictive Value (NPV)
+
+The probability that a person does not have the disease given that their test is negative, denoted $\pmf{\text{no disease} \mid \text{negative}}$.
+
+:::
+
+---
+
+### Example: COVID-19 testing
+
+Suppose we have a COVID-19 test with the following characteristics:
+
+- **99% sensitive**: If a person has COVID-19, the test will be positive 99% of the time
+- **99% specific**: If a person does not have COVID-19, the test will be negative 99% of the time
+
+---
+
+Let's define our events:
+
+- Let $D$ denote the event "person has COVID-19"
+- Let $+$ denote the event "test is positive"
+
+Then our test characteristics can be written as:
+
+$$
+\pmf{+ \mid D} = 0.99 \quad \text{(sensitivity)}
+$$
+
+$$
+\pmf{- \mid \neg D} = 0.99 \quad \text{(specificity)}
+$$
+
+---
+
+Note that if specificity is 0.99, then the false positive rate is:
+$$
+\pmf{+ \mid \neg D} = 1 - 0.99 = 0.01
+$$
+
+Suppose the **prevalence** of COVID-19 in the population is 7%:
+
+$$
+\pmf{D} = 0.07
+$$
+
+$$
+\pmf{\neg D} = 0.93
+$$
+
+---
+
+### Calculating positive predictive value
+
+The key question we want to answer is: **If someone tests positive, what is the probability they actually have COVID-19?**
+
+This is the positive predictive value:
+$$
+\pmf{D \mid +} = \, ?
+$$
+
+---
+
+We can use **Bayes' theorem** to calculate this:
+
+$$
+\pmf{D \mid +} = \frac{\pmf{+ \mid D} \cd \pmf{D}}{\pmf{+}}
+$$
+
+To find $\pmf{+}$, we use the **law of total probability**:
+
+$$
+\pmf{+} = \pmf{+ \mid D} \cd \pmf{D} + \pmf{+ \mid \neg D} \cd \pmf{\neg D}
+$$
+
+---
+
+Now we can calculate each component:
+
+**Probability of being positive with disease:**
+$$
+\pmf{+ \mid D} \cd \pmf{D} = 0.99 \times 0.07 = 0.0693
+$$
+
+**Probability of being positive without disease (false positive):**
+$$
+\pmf{+ \mid \neg D} \cd \pmf{\neg D} = 0.01 \times 0.93 = 0.0093
+$$
+
+---
+
+**Total probability of positive test:**
+$$
+\pmf{+} = 0.0693 + 0.0093 = 0.0786
+$$
+
+**Positive predictive value:**
+$$
+\pmf{D \mid +} = \frac{0.0693}{0.0786} = 0.88
+$$
+
+---
+
+Therefore, even with a highly accurate test (99% sensitive and 99% specific), only about 88% of people who test positive actually have COVID-19.
+This is because the disease prevalence is relatively low (7%), so false positives make up a meaningful fraction of all positive tests.
+
+::: notes
+This counterintuitive result demonstrates the importance of considering disease prevalence when interpreting test results.
+Even highly accurate tests can have relatively low positive predictive values when the disease is rare.
+:::
+
+---
+
+### Alternative formulation
+
+We can rearrange Bayes' theorem to express the positive predictive value in terms of the sensitivity, specificity, and disease prevalence:
+
+$$
+\begin{aligned}
+\pmf{D \mid +} &= \frac{\pmf{+ \mid D} \cd \pmf{D}}{\pmf{+}} \\
+&= \frac{\pmf{+ \mid D} \cd \pmf{D}}{\pmf{+ \mid D} \cd \pmf{D} + \pmf{+ \mid \neg D} \cd \pmf{\neg D}} \\
+&= \frac{\pmf{D}}{\pmf{D} + \frac{\pmf{+ \mid \neg D}}{\pmf{+ \mid D}} \cd \pmf{\neg D}} \\
+&= \frac{1}{1 + \frac{\pmf{+ \mid \neg D}}{\pmf{+ \mid D}} \cd \frac{\pmf{\neg D}}{\pmf{D}}} \\
+&= \frac{1}{1 + \frac{1 - \text{spec}}{\text{sens}} \cd \frac{1 - \text{prev}}{\text{prev}}}
+\end{aligned}
+$$
+
+---
+
+This final form emphasizes the ratio of the false positive rate to the sensitivity, weighted by the ratio of non-diseased to diseased individuals in the population.
+It shows that even with a very high sensitivity and specificity, the positive predictive value depends strongly on disease prevalence.
+
+::: notes
+This algebraic form is useful for understanding how the different parameters interact.
+Notice how the prevalence ratio $\pmf{\neg D}/\pmf{D}$ appears explicitly in the denominator.
+When the disease is rare, this ratio is large, which reduces the positive predictive value.
+:::
diff --git a/macros.qmd b/macros.qmd
@@ -363,6 +363,7 @@
 \def\reglincomb{\vx \cdot \vb}
 \def\regbetasum{\beta_1 x_1+ \dots + \beta_p x_p}
 \def\pdf{\distop{f}}
+\providecommand{\pmf}[1]{\distop{P}\paren{#1}}
 \def\cdf{\distop{F}}
 \def\defLik{\Lik(\theta) \eqdef \p(\vX = \vx | \Theta = \theta)}
 \def\defLogLik{\lik \eqdef \logf{\Lik(\vx|\th)}}

diff --git a/probability.qmd b/probability.qmd
@@ -524,6 +524,7 @@ $\dsn{X}$.
 
 {{< include sec-CLT.qmd >}}
 
+{{< include classification.qmd >}}
 
 ## Additional resources
-Original file line number
+Diff line change
@@ Expand Up / @@ -25,3 +25,5 @@ _freeze/ @@
     *.pdf
     rsconnect
     *.md
+    **/*.quarto_ipynb
Original file line number	Diff line number	Diff line change
Expand Up		@@ -524,6 +524,7 @@ $\dsn{X}$.

		{{< include sec-CLT.qmd >}}

		{{< include classification.qmd >}}

		## Additional resources

Expand Down