-
-
Notifications
You must be signed in to change notification settings - Fork 323
Validate educational videos before quiz generation (#3769) #5356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi @Shhazzz! This pull request needs a peer review before it can be merged. Please request a review from a team member who is not:
Once a valid peer review is submitted, this check will pass automatically. Thank you! |
WalkthroughAdds a YouTube-based educational video submission flow: validates YouTube URLs, fetches transcripts, classifies content with OpenAI, generates JSON MCQ quizzes, exposes a UI page and API endpoint, and persists quizzes as EducationalVideo records. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant Browser
participant App as Django App
participant YT as YouTubeTranscriptApi
participant OpenAI as OpenAI API
participant DB as Database
User->>Browser: Submit YouTube URL (POST)
Browser->>App: submit_educational_video(request)
App->>App: validate URL & extract video_id
App->>YT: fetch transcript(video_id)
YT-->>App: transcript text / error
App->>OpenAI: classify transcript (educational?)
OpenAI-->>App: classification result
alt educational
App->>OpenAI: generate quiz (prompt includes transcript)
OpenAI-->>App: quiz JSON
App->>DB: save EducationalVideo (quiz JSON)
DB-->>App: saved
App-->>Browser: 200 / render form + quiz
else not educational
App-->>Browser: 400 / render form + rejection message
end
Browser-->>User: display result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📊 Monthly LeaderboardHi @Shhazzz! Here's how you rank for December 2025:
Leaderboard based on contributions in December 2025. Keep up the great work! 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (3)
website/api/views.py (1)
49-49: Import should be at the top of the file.The
youtube_transcript_apiimport is placed in the middle of the file. Move it to the import section at the top for consistency and PEP 8 compliance.website/views/education.py (2)
61-70: CodeQL warning: URL substring check may be bypassed.The static analysis tool flags that checking
"youtube.com" in parsed_url.netloc(line 66) could match malicious domains likeevil-youtube.com. However, sinceEducationalVideoForm.clean_video_url()already validates the exact netloc against an allowlist before this function is called, the risk is mitigated in the current flow.For defense-in-depth, consider using exact domain matching here as well:
🔎 Proposed fix for stricter validation
def extract_youtube_video_id(url): parsed_url = urlparse(url) + + youtube_domains = {"youtube.com", "www.youtube.com", "youtu.be"} + if parsed_url.netloc not in youtube_domains: + return None - if "youtu.be" in parsed_url.netloc: + if parsed_url.netloc == "youtu.be": return parsed_url.path.lstrip("/") - elif "youtube.com" in parsed_url.netloc: + elif parsed_url.netloc in {"youtube.com", "www.youtube.com"}: qs = parse_qs(parsed_url.query) return qs.get("v", [None])[0] return None
73-85: Empty string return on error could be improved.When transcript fetching fails, the function returns an empty string
"". This works butNonewould be more semantically correct for "no transcript available" and would be consistent with howgenerate_quizhandles missing input (line 143-144 checksif not transcript_text).🔎 Proposed fix
def get_transcript_text(video_id): try: ytt_api = YouTubeTranscriptApi() fetched_transcript = ytt_api.fetch(video_id) # fetched_transcript is iterable text = " ".join(snippet.text for snippet in fetched_transcript) return text.strip() except Exception as e: logger.error(f"Transcript fetch failed: {e}") - return "" + return None
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (5)
blt/urls.pywebsite/api/views.pywebsite/forms.pywebsite/templates/education/submit_video.htmlwebsite/views/education.py
🧰 Additional context used
🧬 Code graph analysis (3)
website/api/views.py (1)
website/views/education.py (3)
generate_quiz(142-182)submit_educational_video(88-139)extract_youtube_video_id(61-70)
website/views/education.py (1)
website/forms.py (1)
EducationalVideoForm(610-632)
blt/urls.py (2)
website/api/views.py (1)
submit_educational_video(63-79)website/views/education.py (1)
submit_educational_video(88-139)
🪛 GitHub Check: CodeQL
website/views/education.py
[failure] 66-66: Incomplete URL substring sanitization
The string youtube.com may be at an arbitrary position in the sanitized URL.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Run Tests
- GitHub Check: docker-test
🔇 Additional comments (3)
blt/urls.py (1)
747-751: URL routing looks correct; verify if public access is intended.The endpoint is wired correctly. Note that
submit_educational_videohas no authentication requirement, meaning anonymous users can submit videos and trigger OpenAI API calls. If this is intentional for public use, consider adding rate limiting to prevent abuse and cost overruns.website/views/education.py (2)
142-182: Quiz generation looks good with appropriate safeguards.The implementation includes:
- Early return for empty transcript
- Truncation to 3500 chars to stay within token limits
- Strict JSON-only prompt with clear format specification
- Low temperature (0.4) for consistent output
- Proper JSON parsing with error handling
185-221: Educational classification implementation is solid.Good use of:
- Temperature 0 for deterministic classification
- Transcript truncation (2000 chars)
- Clear binary YES/NO output format
- Explicit criteria for educational vs non-educational content
❌ Pre-commit checks failedThe pre-commit hooks found issues that need to be fixed. Please run the following commands locally to fix them: # Install pre-commit if you haven't already
pip install pre-commit
# Run pre-commit on all files
pre-commit run --all-files
# Or run pre-commit on staged files only
pre-commit runAfter running these commands, the pre-commit hooks will automatically fix most issues. 💡 Tip: You can set up pre-commit to run automatically on every commit by running: pre-commit installPre-commit outputFor more information, see the pre-commit documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
website/views/education.py (2)
20-28: This issue was already flagged in a previous review.The duplicate logger definition and scattered imports were already identified. Please refer to the previous review comment for the recommended fix.
91-142: This issue was already flagged in a previous review.The missing
@login_requireddecorator and exposure of costly API calls to unauthenticated users was already identified. Please refer to the previous review comment for details.
🧹 Nitpick comments (3)
website/views/education.py (3)
61-73: Clarify URL validation logic to address static analysis warning.The static analysis warning about substring sanitization is a false positive here because line 67 already validates
netlocagainst an exact allowlist. However, the nested condition at line 70 ("youtube.com" in netloc) is redundant and confusing sincenetlocwas already confirmed to be inallowed_domains.🔎 Suggested refactor for clarity
def extract_youtube_video_id(url): parsed_url = urlparse(url) netloc = parsed_url.netloc.lower() allowed_domains = ["youtu.be", "youtube.com", "www.youtube.com"] if netloc in allowed_domains: if netloc == "youtu.be": return parsed_url.path.lstrip("/") - elif "youtube.com" in netloc: + else: # youtube.com or www.youtube.com qs = parse_qs(parsed_url.query) return qs.get("v", [None])[0] return None
145-186: Consider making quiz length and transcript window configurable.The function hardcodes 20 questions (line 150) and uses only the first 3500 characters of the transcript (line 167). For short videos, 20 questions may be excessive; for longer educational content, the 3500-character limit captures only a small portion.
Consider:
- Scaling the number of questions based on transcript length (e.g., 1 question per 200 characters)
- Making the transcript window configurable or using a smarter sampling strategy for longer content
- Adding a parameter to control quiz difficulty or focus areas
These changes would improve the user experience across different video lengths.
188-224: Strengthen response parsing for edge cases.The classification relies on an exact
"YES"match after stripping and uppercasing (line 223-224). If OpenAI returns variations like"YES.","YES\n", or includes additional whitespace, the classification will fail and incorrectly reject educational videos.🔎 Proposed fix
answer = response.choices[0].message.content.strip().upper() - return answer == "YES" + # Handle common variations: "YES", "YES.", "YES\n", etc. + return answer.startswith("YES")Or more robustly:
answer = response.choices[0].message.content.strip().upper() - return answer == "YES" + # Extract first word to handle punctuation/whitespace + first_word = answer.split()[0] if answer else "" + return first_word == "YES"
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (1)
website/views/education.py
🧰 Additional context used
🧬 Code graph analysis (1)
website/views/education.py (1)
website/forms.py (1)
EducationalVideoForm(610-632)
🪛 GitHub Check: CodeQL
website/views/education.py
[failure] 70-70: Incomplete URL substring sanitization
The string youtube.com may be at an arbitrary position in the sanitized URL.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Run Tests
- GitHub Check: docker-test
🔇 Additional comments (1)
website/views/education.py (1)
76-88: No action required—API usage is correct.The
YouTubeTranscriptApi()instantiation withfetch()method is a documented and valid usage pattern in the youtube-transcript-api library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
website/views/education.py (1)
20-28: Import organization and duplicate logger issues persist.This was already flagged in a previous review: imports are scattered (should be grouped at top),
loggeris defined twice (lines 18 and 28), andopenai.api_keyassignment on line 27 is redundant since the client on line 25 already uses the API key.
🧹 Nitpick comments (2)
website/views/education.py (2)
61-73: Simplify conditional to eliminate CodeQL warning.The static analysis flags
"youtube.com" in netlocas incomplete URL sanitization. While the outernetloc in allowed_domainsguard makes this safe, the substring check is redundant sinceallowed_domainsonly contains"youtube.com"and"www.youtube.com". Simplify to avoid the warning.🔎 Proposed fix
def extract_youtube_video_id(url): parsed_url = urlparse(url) netloc = parsed_url.netloc.lower() allowed_domains = ["youtu.be", "youtube.com", "www.youtube.com"] if netloc in allowed_domains: if netloc == "youtu.be": return parsed_url.path.lstrip("/") - elif "youtube.com" in netloc: + else: # youtube.com or www.youtube.com qs = parse_qs(parsed_url.query) return qs.get("v", [None])[0] return None
103-130: Usereverse()instead of hardcoded URL paths.Multiple redirects use the hardcoded string
"/education/submit/". This is fragile if the URL pattern changes. Use Django'sreverse()function with the URL name for maintainability.🔎 Proposed fix
+from django.urls import reverse + # Then replace all occurrences: - return redirect("/education/submit/") + return redirect(reverse("submit_educational_video"))
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (1)
website/views/education.py
🧰 Additional context used
🧬 Code graph analysis (1)
website/views/education.py (1)
website/forms.py (1)
EducationalVideoForm(610-632)
🪛 GitHub Check: CodeQL
website/views/education.py
[failure] 70-70: Incomplete URL substring sanitization
The string youtube.com may be at an arbitrary position in the sanitized URL.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Run Tests
- GitHub Check: docker-test
🔇 Additional comments (1)
website/views/education.py (1)
76-88: LGTM!The transcript fetching logic handles errors gracefully by logging and returning an empty string, which the caller properly checks.
@mdkaifansari, |
|
Hi @DonnieBLT ! All review feedbacks have been addressed and CI is now passing. Please take a look when you have time! |
|
@coderabbitai whom should i ask to review the PR and merge? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 2269 @Shhazzz Great news! You've already done the right thing by tagging @DonnieBLT in your previous comment. Based on the repository activity: ✅ Your PR is already APPROVED by reviewers What to do now:Wait for DonnieBLT to merge — they're the admin and handle all merges for this repository. Since you've already tagged them and the PR is approved with passing CI, it's in the merge queue. If you don't hear back within a few days, you can:
But based on recent activity (DonnieBLT merged PR #5357 just yesterday), they're actively reviewing and merging PRs, so your approved PR should be merged soon! 🎉 |
|
Other than this everything looks good to me , you can further ask @DonnieBLT for the review on it |
|
💬 Reminder: Unresolved Conversations Hi @Shhazzz! This pull request has 1 unresolved conversation that need to be addressed. Please review and resolve the pending discussions so we can move forward with merging this PR. Thank you! 🙏 |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 14002 🏁 Script executed: Length of output: 897 @Shhazzz Great question! I checked all conversation threads on your PR. Here's the status: ✅ All Actionable Review Comments Are ResolvedAll review comments from human reviewers and my previous feedback have been addressed and marked as resolved. 📋 Status Summary
🔍 The 2 "Unresolved" ThreadsThe 2 unresolved threads are automated security alerts from GitHub Advanced Security (CodeQL) about "Incomplete URL substring sanitization" in your URL validation code. However:
These outdated security alerts don't block your PR from merging. ✨ Current PR Status
🎯 Next StepsYou're done! 🎉 There's nothing else you need to address. Just wait for @DonnieBLT to merge your PR. Based on recent activity, they actively merge approved PRs, so it should happen soon. The outdated security alerts will be automatically dismissed once the PR is merged or you can ignore them since the code has been updated. |
|
Hi @DonnieBLT |
Validate educational videos before quiz generation (Fixes #3769)
What this PR does
Why
This aligns with maintainer feedback to:
Testing
Below is the working example:

For Educational Videos
For Non-Educational Videos

** For Vimeo Links**

Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.