Merge pull request #488 from kartik-git/add-q-rouge-score

moe18 · web-flow · commit 3d1abd4c37e2 · 2025-07-02T16:59:16.000-04:00
Adding new question for ROUGE-1 Score #144
diff --git a/questions/152_implementing-rouge-score/description.md b/questions/152_implementing-rouge-score/description.md
@@ -0,0 +1,5 @@
+## Problem
+
+Implement the ROUGE-1 (Recall-Oriented Understudy for Gisting Evaluation) score to evaluate the quality of a generated summary by comparing it to a reference summary. ROUGE-1 focuses on unigram (single word) overlaps between the candidate and reference texts.
+
+Your task is to write a function that computes the ROUGE-1 recall, precision, and F1 score based on the number of overlapping unigrams.
diff --git a/questions/152_implementing-rouge-score/example.json b/questions/152_implementing-rouge-score/example.json
@@ -0,0 +1,5 @@
+{
+  "input": "rouge_1_score('the cat sat on the mat', 'the cat is on the mat')",
+  "output": "{'precision': 0.8333333333333334, 'recall': 0.8333333333333334, 'f1': 0.8333333333333334}",
+  "reasoning": "The reference text 'the cat sat on the mat' has 6 tokens, and the candidate text 'the cat is on the mat' has 6 tokens. The overlapping words are: 'the' (appears 2 times in reference, 2 times in candidate, so min(2,2)=2 overlap), 'cat' (1,1 → 1 overlap), 'on' (1,1 → 1 overlap), and 'mat' (1,1 → 1 overlap). Total overlap = 2+1+1+1 = 5. Precision = 5/6 ≈ 0.833 (5 overlapping words out of 6 candidate words). Recall = 5/6 ≈ 0.833 (5 overlapping words out of 6 reference words). F1 = 2×(0.833×0.833)/(0.833+0.833) = 0.833 since precision equals recall."
+}
diff --git a/questions/152_implementing-rouge-score/learn.md b/questions/152_implementing-rouge-score/learn.md
@@ -0,0 +1,98 @@
+# ROUGE-1 Score Learning Guide
+
+## Solution Explanation
+
+ROUGE-1 (Recall-Oriented Understudy for Gisting Evaluation) is a fundamental metric for evaluating the quality of automatically generated summaries by comparing them to reference summaries. The "1" in ROUGE-1 refers to unigrams (single words), making it the most basic but widely used variant of ROUGE metrics.
+
+### Intuition
+
+Imagine you're a teacher grading a student's book summary. You have a reference summary (the "gold standard") and want to measure how well the student's summary captures the key information. ROUGE-1 essentially counts how many important words from the reference summary appear in the student's summary.
+
+The core idea is simple: **if a generated summary contains many of the same words as a high-quality reference summary, it's likely capturing similar content and therefore of good quality.**
+
+### Mathematical Foundation
+
+ROUGE-1 is built on three fundamental components:
+
+**1. Precision (P)**
+$$P = \frac{\text{Number of overlapping unigrams}}{\text{Total unigrams in generated summary}}$$
+
+**2. Recall (R)**
+$$R = \frac{\text{Number of overlapping unigrams}}{\text{Total unigrams in reference summary}}$$
+
+**3. F1-Score (F)**
+$$F = \frac{2 \times P \times R}{P + R}$$
+
+Where an "overlapping unigram" is a word that appears in both the generated summary and the reference summary.
+
+### Step-by-Step Calculation Process
+
+Let's work through a concrete example:
+
+**Reference Summary:** "The quick brown fox jumps over the lazy dog"
+**Generated Summary:** "A quick fox jumps over a lazy cat"
+
+**Step 1: Tokenization**
+- Reference tokens: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
+- Generated tokens: ["A", "quick", "fox", "jumps", "over", "a", "lazy", "cat"]
+
+**Step 2: Identify Overlapping Unigrams**
+Overlapping words (case-insensitive): ["quick", "fox", "jumps", "over", "lazy"]
+- Count of overlapping unigrams: 5
+
+**Step 3: Calculate Precision**
+$$P = \frac{5}{8} = 0.625$$
+*Interpretation: 62.5% of words in the generated summary appear in the reference*
+
+**Step 4: Calculate Recall**
+$$R = \frac{5}{9} = 0.556$$
+*Interpretation: 55.6% of words in the reference summary are captured in the generated summary*
+
+**Step 5: Calculate F1-Score**
+$$F = \frac{2 \times 0.625 \times 0.556}{0.625 + 0.556} = \frac{0.695}{1.181} = 0.588$$
+
+### Understanding the Components
+
+**Precision answers:** "Of all the words in my generated summary, how many are actually relevant (appear in the reference)?"
+- High precision means the generated summary doesn't contain many irrelevant words
+- Low precision suggests the summary is verbose or off-topic
+
+**Recall answers:** "Of all the important words in the reference, how many did my generated summary capture?"
+- High recall means the generated summary covers most key information
+- Low recall suggests the summary misses important content
+
+**F1-Score provides:** A balanced measure that penalizes both missing important information (low recall) and including irrelevant information (low precision)
+
+### Advanced Considerations
+
+**Preprocessing Steps:**
+1. **Case normalization:** Convert all text to lowercase
+2. **Tokenization:** Split text into individual words
+3. **Stop word handling:** Optionally remove common words like "the", "and", "is"
+4. **Stemming/Lemmatization:** Optionally reduce words to their root forms
+
+**Mathematical Variants:**
+- **ROUGE-1 Precision:** $P = \frac{\sum_{i} \text{Count}_{\text{match}}(unigram_i)}{\sum_{i} \text{Count}(unigram_i)}$
+- **ROUGE-1 Recall:** $R = \frac{\sum_{i} \text{Count}_{\text{match}}(unigram_i)}{\sum_{i} \text{Count}_{\text{ref}}(unigram_i)}$
+
+Where $\text{Count}_{\text{match}}(unigram_i)$ is the minimum of the counts of $unigram_i$ in the generated and reference summaries.
+
+### Practical Implementation Insights
+
+**Handling Multiple References:**
+When multiple reference summaries exist, ROUGE-1 can be calculated against each reference separately, then the maximum score is typically taken:
+
+$$\text{ROUGE-1} = \max_{j} \text{ROUGE-1}(\text{generated}, \text{reference}_j)$$
+
+**Limitations to Consider:**
+- **Word order independence:** ROUGE-1 ignores sentence structure and word order
+- **Semantic blindness:** Synonyms and paraphrases aren't recognized
+- **Length bias:** Longer summaries may achieve higher recall simply by including more words
+
+### Real-World Applications
+
+ROUGE-1 is extensively used in:
+- **Automatic summarization evaluation** (news articles, scientific papers)
+- **Machine translation quality assessment** (as a secondary metric)
+- **Question answering systems** (evaluating answer quality)
+- **Chatbot response evaluation** (measuring relevance to expected responses)
diff --git a/questions/152_implementing-rouge-score/meta.json b/questions/152_implementing-rouge-score/meta.json
@@ -0,0 +1,15 @@
+{
+  "id": "152",
+  "title": "Implementing ROUGE Score",
+  "difficulty": "medium",
+  "category": "Machine Learning",
+  "video": "",
+  "likes": "0",
+  "dislikes": "0",
+  "contributor": [
+      {
+        "profile_link": "https://github.com/kartik-git",
+        "name": "kartik-git"
+      }
+  ]
+}
diff --git a/questions/152_implementing-rouge-score/solution.py b/questions/152_implementing-rouge-score/solution.py
@@ -0,0 +1,22 @@
+from collections import Counter
+
+def rouge_1_score(reference: str, candidate: str) -> dict:
+    """
+    Compute ROUGE-1 score between reference and candidate texts.
+    
+    Returns a dictionary with precision, recall, and f1.
+    """
+    ref_tokens = reference.lower().split()
+    cand_tokens = candidate.lower().split()
+
+    ref_counter = Counter(ref_tokens)
+    cand_counter = Counter(cand_tokens)
+
+    # Count overlapping unigrams
+    overlap = sum(min(ref_counter[w], cand_counter[w]) for w in cand_counter)
+
+    precision = overlap / len(cand_tokens) if cand_tokens else 0.0
+    recall = overlap / len(ref_tokens) if ref_tokens else 0.0
+    f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) else 0.0
+
+    return {"precision": precision, "recall": recall, "f1": f1}
diff --git a/questions/152_implementing-rouge-score/starter_code.py b/questions/152_implementing-rouge-score/starter_code.py
@@ -0,0 +1,10 @@
+# Implement your function below.
+
+def rouge_1_score(reference: str, candidate: str) -> dict:
+    """
+    Compute ROUGE-1 score between reference and candidate texts.
+    
+    Returns a dictionary with precision, recall, and f1.
+    """
+    # Your code here
+    pass
diff --git a/questions/152_implementing-rouge-score/tests.json b/questions/152_implementing-rouge-score/tests.json
@@ -0,0 +1,11 @@
+
+[
+  {
+    "test": "print(rouge_1_score('the cat sat on the mat', 'the cat is on the mat'))",
+    "expected_output": "{'precision': 0.8333333333333334, 'recall': 0.8333333333333334, 'f1': 0.8333333333333334}"
+  },
+  {
+    "test": "print(rouge_1_score('hello there', 'hello there'))",
+    "expected_output": "{'precision': 1.0, 'recall': 1.0, 'f1': 1.0}"
+  }
+]