Skip to content

Commit 3d1abd4

Browse files
authored
Merge pull request #488 from kartik-git/add-q-rouge-score
Adding new question for ROUGE-1 Score #144
2 parents ca5ad32 + a674cb8 commit 3d1abd4

File tree

7 files changed

+166
-0
lines changed

7 files changed

+166
-0
lines changed
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
## Problem
2+
3+
Implement the ROUGE-1 (Recall-Oriented Understudy for Gisting Evaluation) score to evaluate the quality of a generated summary by comparing it to a reference summary. ROUGE-1 focuses on unigram (single word) overlaps between the candidate and reference texts.
4+
5+
Your task is to write a function that computes the ROUGE-1 recall, precision, and F1 score based on the number of overlapping unigrams.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"input": "rouge_1_score('the cat sat on the mat', 'the cat is on the mat')",
3+
"output": "{'precision': 0.8333333333333334, 'recall': 0.8333333333333334, 'f1': 0.8333333333333334}",
4+
"reasoning": "The reference text 'the cat sat on the mat' has 6 tokens, and the candidate text 'the cat is on the mat' has 6 tokens. The overlapping words are: 'the' (appears 2 times in reference, 2 times in candidate, so min(2,2)=2 overlap), 'cat' (1,1 → 1 overlap), 'on' (1,1 → 1 overlap), and 'mat' (1,1 → 1 overlap). Total overlap = 2+1+1+1 = 5. Precision = 5/6 ≈ 0.833 (5 overlapping words out of 6 candidate words). Recall = 5/6 ≈ 0.833 (5 overlapping words out of 6 reference words). F1 = 2×(0.833×0.833)/(0.833+0.833) = 0.833 since precision equals recall."
5+
}
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# ROUGE-1 Score Learning Guide
2+
3+
## Solution Explanation
4+
5+
ROUGE-1 (Recall-Oriented Understudy for Gisting Evaluation) is a fundamental metric for evaluating the quality of automatically generated summaries by comparing them to reference summaries. The "1" in ROUGE-1 refers to unigrams (single words), making it the most basic but widely used variant of ROUGE metrics.
6+
7+
### Intuition
8+
9+
Imagine you're a teacher grading a student's book summary. You have a reference summary (the "gold standard") and want to measure how well the student's summary captures the key information. ROUGE-1 essentially counts how many important words from the reference summary appear in the student's summary.
10+
11+
The core idea is simple: **if a generated summary contains many of the same words as a high-quality reference summary, it's likely capturing similar content and therefore of good quality.**
12+
13+
### Mathematical Foundation
14+
15+
ROUGE-1 is built on three fundamental components:
16+
17+
**1. Precision (P)**
18+
$$P = \frac{\text{Number of overlapping unigrams}}{\text{Total unigrams in generated summary}}$$
19+
20+
**2. Recall (R)**
21+
$$R = \frac{\text{Number of overlapping unigrams}}{\text{Total unigrams in reference summary}}$$
22+
23+
**3. F1-Score (F)**
24+
$$F = \frac{2 \times P \times R}{P + R}$$
25+
26+
Where an "overlapping unigram" is a word that appears in both the generated summary and the reference summary.
27+
28+
### Step-by-Step Calculation Process
29+
30+
Let's work through a concrete example:
31+
32+
**Reference Summary:** "The quick brown fox jumps over the lazy dog"
33+
**Generated Summary:** "A quick fox jumps over a lazy cat"
34+
35+
**Step 1: Tokenization**
36+
- Reference tokens: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
37+
- Generated tokens: ["A", "quick", "fox", "jumps", "over", "a", "lazy", "cat"]
38+
39+
**Step 2: Identify Overlapping Unigrams**
40+
Overlapping words (case-insensitive): ["quick", "fox", "jumps", "over", "lazy"]
41+
- Count of overlapping unigrams: 5
42+
43+
**Step 3: Calculate Precision**
44+
$$P = \frac{5}{8} = 0.625$$
45+
*Interpretation: 62.5% of words in the generated summary appear in the reference*
46+
47+
**Step 4: Calculate Recall**
48+
$$R = \frac{5}{9} = 0.556$$
49+
*Interpretation: 55.6% of words in the reference summary are captured in the generated summary*
50+
51+
**Step 5: Calculate F1-Score**
52+
$$F = \frac{2 \times 0.625 \times 0.556}{0.625 + 0.556} = \frac{0.695}{1.181} = 0.588$$
53+
54+
### Understanding the Components
55+
56+
**Precision answers:** "Of all the words in my generated summary, how many are actually relevant (appear in the reference)?"
57+
- High precision means the generated summary doesn't contain many irrelevant words
58+
- Low precision suggests the summary is verbose or off-topic
59+
60+
**Recall answers:** "Of all the important words in the reference, how many did my generated summary capture?"
61+
- High recall means the generated summary covers most key information
62+
- Low recall suggests the summary misses important content
63+
64+
**F1-Score provides:** A balanced measure that penalizes both missing important information (low recall) and including irrelevant information (low precision)
65+
66+
### Advanced Considerations
67+
68+
**Preprocessing Steps:**
69+
1. **Case normalization:** Convert all text to lowercase
70+
2. **Tokenization:** Split text into individual words
71+
3. **Stop word handling:** Optionally remove common words like "the", "and", "is"
72+
4. **Stemming/Lemmatization:** Optionally reduce words to their root forms
73+
74+
**Mathematical Variants:**
75+
- **ROUGE-1 Precision:** $P = \frac{\sum_{i} \text{Count}_{\text{match}}(unigram_i)}{\sum_{i} \text{Count}(unigram_i)}$
76+
- **ROUGE-1 Recall:** $R = \frac{\sum_{i} \text{Count}_{\text{match}}(unigram_i)}{\sum_{i} \text{Count}_{\text{ref}}(unigram_i)}$
77+
78+
Where $\text{Count}_{\text{match}}(unigram_i)$ is the minimum of the counts of $unigram_i$ in the generated and reference summaries.
79+
80+
### Practical Implementation Insights
81+
82+
**Handling Multiple References:**
83+
When multiple reference summaries exist, ROUGE-1 can be calculated against each reference separately, then the maximum score is typically taken:
84+
85+
$$\text{ROUGE-1} = \max_{j} \text{ROUGE-1}(\text{generated}, \text{reference}_j)$$
86+
87+
**Limitations to Consider:**
88+
- **Word order independence:** ROUGE-1 ignores sentence structure and word order
89+
- **Semantic blindness:** Synonyms and paraphrases aren't recognized
90+
- **Length bias:** Longer summaries may achieve higher recall simply by including more words
91+
92+
### Real-World Applications
93+
94+
ROUGE-1 is extensively used in:
95+
- **Automatic summarization evaluation** (news articles, scientific papers)
96+
- **Machine translation quality assessment** (as a secondary metric)
97+
- **Question answering systems** (evaluating answer quality)
98+
- **Chatbot response evaluation** (measuring relevance to expected responses)
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{
2+
"id": "152",
3+
"title": "Implementing ROUGE Score",
4+
"difficulty": "medium",
5+
"category": "Machine Learning",
6+
"video": "",
7+
"likes": "0",
8+
"dislikes": "0",
9+
"contributor": [
10+
{
11+
"profile_link": "https://github.com/kartik-git",
12+
"name": "kartik-git"
13+
}
14+
]
15+
}
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
from collections import Counter
2+
3+
def rouge_1_score(reference: str, candidate: str) -> dict:
4+
"""
5+
Compute ROUGE-1 score between reference and candidate texts.
6+
7+
Returns a dictionary with precision, recall, and f1.
8+
"""
9+
ref_tokens = reference.lower().split()
10+
cand_tokens = candidate.lower().split()
11+
12+
ref_counter = Counter(ref_tokens)
13+
cand_counter = Counter(cand_tokens)
14+
15+
# Count overlapping unigrams
16+
overlap = sum(min(ref_counter[w], cand_counter[w]) for w in cand_counter)
17+
18+
precision = overlap / len(cand_tokens) if cand_tokens else 0.0
19+
recall = overlap / len(ref_tokens) if ref_tokens else 0.0
20+
f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) else 0.0
21+
22+
return {"precision": precision, "recall": recall, "f1": f1}
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Implement your function below.
2+
3+
def rouge_1_score(reference: str, candidate: str) -> dict:
4+
"""
5+
Compute ROUGE-1 score between reference and candidate texts.
6+
7+
Returns a dictionary with precision, recall, and f1.
8+
"""
9+
# Your code here
10+
pass
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
2+
[
3+
{
4+
"test": "print(rouge_1_score('the cat sat on the mat', 'the cat is on the mat'))",
5+
"expected_output": "{'precision': 0.8333333333333334, 'recall': 0.8333333333333334, 'f1': 0.8333333333333334}"
6+
},
7+
{
8+
"test": "print(rouge_1_score('hello there', 'hello there'))",
9+
"expected_output": "{'precision': 1.0, 'recall': 1.0, 'f1': 1.0}"
10+
}
11+
]

0 commit comments

Comments
 (0)