-
Notifications
You must be signed in to change notification settings - Fork 13
add classification notes #265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
d-morrison
wants to merge
14
commits into
main
Choose a base branch
from
classification
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
b5efbe4
add classification notes
d-morrison bf43dee
Initial plan
Copilot 11ba1a6
Complete and polish classification section introduction
Copilot 4abab2a
Update classification.qmd
d-morrison 3a95514
Update classification.qmd
d-morrison e9990f3
Update classification.qmd
d-morrison 9c4b4d2
Merge classification branch keeping polished version
Copilot 4f30f9b
Add newlines after sentences and use custom macros
Copilot 42af021
Simplify sentence structure and add formal classification definition
Copilot 814d956
Add slide breaks and speaker notes for RevealJS format
Copilot b4811b5
Fix LaTeX equation nesting error: use aligned instead of align
Copilot 08992c4
Add definition blocks for diagnostic test characteristics and abbrevi…
Copilot c42ca40
Move classification section to appendix chapter
Copilot 890fca9
Merge pull request #266 from d-morrison/copilot/sub-pr-265
d-morrison File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,3 +25,5 @@ _freeze/ | |
| rsconnect | ||
| *.md | ||
|
|
||
| **/*.quarto_ipynb | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,195 @@ | ||
| {{< include macros.qmd >}} | ||
|
|
||
| # Classification {#sec-classification} | ||
|
|
||
| --- | ||
|
|
||
| Classification problems occur frequently in epidemiology and diagnostic medicine. | ||
| For example, we may need to determine whether an individual has a particular disease or condition based on test results or other indicators. | ||
|
|
||
| --- | ||
|
|
||
| :::{#def-classification} | ||
|
|
||
| #### Classification | ||
|
|
||
| A **classification problem** is a statistical problem in which we seek to assign observations to one of two or more discrete categories (classes) based on observed features or predictors. | ||
| In the binary case, we assign each observation to one of two classes, often labeled as "positive" or "negative", "diseased" or "healthy", etc. | ||
|
|
||
| ::: | ||
|
|
||
| --- | ||
|
|
||
| Understanding how to interpret diagnostic tests requires knowledge of key statistical concepts including sensitivity, specificity, and predictive values. | ||
|
|
||
| In this section, we explore how Bayes' theorem allows us to calculate the probability that a person has a disease given a positive test result. | ||
| This is particularly important in public health decision-making, where we must understand not just how accurate a test is in general, but how to interpret test results for individuals in specific populations. | ||
|
|
||
| --- | ||
|
|
||
| ### Diagnostic test characteristics | ||
|
|
||
| When evaluating a diagnostic test, we consider several key performance measures: | ||
|
|
||
| :::{#def-sensitivity} | ||
|
|
||
| #### Sensitivity | ||
|
|
||
| The probability that the test is positive given that the person has the disease, denoted $\pmf{\text{positive} \mid \text{disease}}$. | ||
|
|
||
| ::: | ||
|
|
||
| :::{#def-specificity} | ||
|
|
||
| #### Specificity | ||
|
|
||
| The probability that the test is negative given that the person does not have the disease, denoted $\pmf{\text{negative} \mid \text{no disease}}$. | ||
|
|
||
| ::: | ||
|
|
||
| :::{#def-ppv} | ||
|
|
||
| #### Positive Predictive Value (PPV) | ||
|
|
||
| The probability that a person has the disease given that their test is positive, denoted $\pmf{\text{disease} \mid \text{positive}}$. | ||
|
|
||
| ::: | ||
|
|
||
| :::{#def-npv} | ||
|
|
||
| #### Negative Predictive Value (NPV) | ||
|
|
||
| The probability that a person does not have the disease given that their test is negative, denoted $\pmf{\text{no disease} \mid \text{negative}}$. | ||
|
|
||
| ::: | ||
|
|
||
| --- | ||
|
|
||
| ### Example: COVID-19 testing | ||
|
|
||
| Suppose we have a COVID-19 test with the following characteristics: | ||
|
|
||
| - **99% sensitive**: If a person has COVID-19, the test will be positive 99% of the time | ||
| - **99% specific**: If a person does not have COVID-19, the test will be negative 99% of the time | ||
|
|
||
| --- | ||
|
|
||
| Let's define our events: | ||
|
|
||
| - Let $D$ denote the event "person has COVID-19" | ||
| - Let $+$ denote the event "test is positive" | ||
|
|
||
| Then our test characteristics can be written as: | ||
|
|
||
| $$ | ||
| \pmf{+ \mid D} = 0.99 \quad \text{(sensitivity)} | ||
| $$ | ||
|
|
||
| $$ | ||
| \pmf{- \mid \neg D} = 0.99 \quad \text{(specificity)} | ||
| $$ | ||
|
|
||
| --- | ||
|
|
||
| Note that if specificity is 0.99, then the false positive rate is: | ||
| $$ | ||
| \pmf{+ \mid \neg D} = 1 - 0.99 = 0.01 | ||
| $$ | ||
|
|
||
| Suppose the **prevalence** of COVID-19 in the population is 7%: | ||
|
|
||
| $$ | ||
| \pmf{D} = 0.07 | ||
| $$ | ||
|
|
||
| $$ | ||
| \pmf{\neg D} = 0.93 | ||
| $$ | ||
|
|
||
| --- | ||
|
|
||
| ### Calculating positive predictive value | ||
|
|
||
| The key question we want to answer is: **If someone tests positive, what is the probability they actually have COVID-19?** | ||
|
|
||
| This is the positive predictive value: | ||
| $$ | ||
| \pmf{D \mid +} = \, ? | ||
| $$ | ||
|
|
||
| --- | ||
|
|
||
| We can use **Bayes' theorem** to calculate this: | ||
|
|
||
| $$ | ||
| \pmf{D \mid +} = \frac{\pmf{+ \mid D} \cd \pmf{D}}{\pmf{+}} | ||
| $$ | ||
|
|
||
| To find $\pmf{+}$, we use the **law of total probability**: | ||
|
|
||
| $$ | ||
| \pmf{+} = \pmf{+ \mid D} \cd \pmf{D} + \pmf{+ \mid \neg D} \cd \pmf{\neg D} | ||
| $$ | ||
|
|
||
| --- | ||
|
|
||
| Now we can calculate each component: | ||
|
|
||
| **Probability of being positive with disease:** | ||
| $$ | ||
| \pmf{+ \mid D} \cd \pmf{D} = 0.99 \times 0.07 = 0.0693 | ||
| $$ | ||
|
|
||
| **Probability of being positive without disease (false positive):** | ||
| $$ | ||
| \pmf{+ \mid \neg D} \cd \pmf{\neg D} = 0.01 \times 0.93 = 0.0093 | ||
| $$ | ||
|
|
||
| --- | ||
|
|
||
| **Total probability of positive test:** | ||
| $$ | ||
| \pmf{+} = 0.0693 + 0.0093 = 0.0786 | ||
| $$ | ||
|
|
||
| **Positive predictive value:** | ||
| $$ | ||
| \pmf{D \mid +} = \frac{0.0693}{0.0786} = 0.88 | ||
| $$ | ||
|
|
||
| --- | ||
|
|
||
| Therefore, even with a highly accurate test (99% sensitive and 99% specific), only about 88% of people who test positive actually have COVID-19. | ||
| This is because the disease prevalence is relatively low (7%), so false positives make up a meaningful fraction of all positive tests. | ||
|
|
||
| ::: notes | ||
| This counterintuitive result demonstrates the importance of considering disease prevalence when interpreting test results. | ||
| Even highly accurate tests can have relatively low positive predictive values when the disease is rare. | ||
| ::: | ||
|
|
||
| --- | ||
|
|
||
| ### Alternative formulation | ||
|
|
||
| We can rearrange Bayes' theorem to express the positive predictive value in terms of the sensitivity, specificity, and disease prevalence: | ||
|
|
||
| $$ | ||
| \begin{aligned} | ||
| \pmf{D \mid +} &= \frac{\pmf{+ \mid D} \cd \pmf{D}}{\pmf{+}} \\ | ||
| &= \frac{\pmf{+ \mid D} \cd \pmf{D}}{\pmf{+ \mid D} \cd \pmf{D} + \pmf{+ \mid \neg D} \cd \pmf{\neg D}} \\ | ||
| &= \frac{\pmf{D}}{\pmf{D} + \frac{\pmf{+ \mid \neg D}}{\pmf{+ \mid D}} \cd \pmf{\neg D}} \\ | ||
| &= \frac{1}{1 + \frac{\pmf{+ \mid \neg D}}{\pmf{+ \mid D}} \cd \frac{\pmf{\neg D}}{\pmf{D}}} \\ | ||
| &= \frac{1}{1 + \frac{1 - \text{spec}}{\text{sens}} \cd \frac{1 - \text{prev}}{\text{prev}}} | ||
| \end{aligned} | ||
| $$ | ||
|
|
||
| --- | ||
|
|
||
| This final form emphasizes the ratio of the false positive rate to the sensitivity, weighted by the ratio of non-diseased to diseased individuals in the population. | ||
| It shows that even with a very high sensitivity and specificity, the positive predictive value depends strongly on disease prevalence. | ||
|
|
||
| ::: notes | ||
| This algebraic form is useful for understanding how the different parameters interact. | ||
| Notice how the prevalence ratio $\pmf{\neg D}/\pmf{D}$ appears explicitly in the denominator. | ||
| When the disease is rare, this ratio is large, which reduces the positive predictive value. | ||
| ::: | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -524,6 +524,7 @@ $\dsn{X}$. | |
|
|
||
| {{< include sec-CLT.qmd >}} | ||
|
|
||
| {{< include classification.qmd >}} | ||
|
|
||
| ## Additional resources | ||
|
|
||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing section content. The "Introduction to classification" section header is followed immediately by a subsection without any introductory content explaining what classification is, why it's relevant, or how the positive predictive value example relates to the broader topic.