feat: add automated semantic analysis workflow (no API calls)#79
feat: add automated semantic analysis workflow (no API calls)#79
Conversation
This adds a fully automated knowledge base generation workflow that
performs semantic analysis without requiring manual intervention or API calls.
**New Components:**
1. **semantic_analyzer.py** - Local NLP/heuristic analyzer
- Extracts concepts from LaTeX sources using regex patterns
- Finds definitions: \definition{}{}, "X is defined as Y", etc.
- Extracts concepts from section titles
- Analyzes cited paper titles, abstracts, keywords
- Creates hierarchical structure automatically
- Generates Quartz-compatible markdown files
- NO external API calls - fully local processing
2. **knowledge-base-automated.yml** - Automated workflow
- Manually triggered with a generated branch name
- Runs semantic_analyzer.py automatically
- Generates markdown concept files
- Creates PR with results
- Fully automated end-to-end
**Comparison with Manual Workflow:**
Manual (knowledge-base-pipeline.yml):
- High quality (LLM-powered via Claude Code)
- Requires human intervention
- Deep semantic understanding
- Best for production
Automated (NEW):
- Medium quality (heuristic-based)
- Fully automated
- Pattern matching + NLP
- Best for drafts/testing
**Benefits:**
- Test pipeline without manual steps
- Generate quick drafts for review
- No API costs
- 100% automation option
- Compare with manual Claude Code results
**Usage:**
```bash
# After main pipeline creates a generated branch:
gh workflow run "Knowledge Base — Automated Analysis (Experimental)" \
--field branch=knowledge-database/generated-XXXXX
```
**Design Philosophy:**
Both workflows kept intentionally separate:
- Manual workflow for quality (LLM-powered)
- Automated workflow for speed (heuristic-based)
- User chooses based on needs
- Can run both and compare
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
📚 Implementation DetailsAutomated Semantic Analyzer FeaturesThe 1. LaTeX Definition Patterns:
2. Structural Extraction:
3. Cited Paper Analysis:
4. Relationship Inference:
Workflow Execution Flowgraph TD
A[Main Pipeline Generates Branch] --> B[User Triggers Automated Workflow]
B --> C[Checkout Generated Branch]
C --> D[Run semantic_analyzer.py]
D --> E[Extract LaTeX Concepts]
D --> F[Analyze Cited Papers]
E --> G[Organize Hierarchy]
F --> G
G --> H[Generate Markdown Files]
H --> I[Commit to Branch]
I --> J[Create PR]
J --> K[Review & Merge]
Quality ComparisonManual Claude Code Analysis:
Automated Heuristic Analysis:
Example OutputFor a LaTeX file containing: \section{Formal Ontologies}
\definition{system}{A system is a set of related elements organized for a purpose}Automated analyzer generates: ---
title: System
tags: [formal-ontologies, definition]
hierarchy: [formal-ontologies]
---
# System
## Definition
A system is a set of related elements organized for a purpose
## Related Concepts
- [[formal-ontologies]]
- [[organization]]Testing Recommendations
Future EnhancementsPotential improvements to the analyzer:
Ready to merge when testing is complete! 🚀 |
There was a problem hiding this comment.
Pull Request Overview
This PR introduces an experimental automated semantic analysis workflow that generates knowledge base content from LaTeX sources and cited papers using local heuristic analysis, providing a fully automated alternative to the manual Claude Code-based workflow.
- Adds
semantic_analyzer.pyfor local NLP-based concept extraction from LaTeX and papers - Implements automated workflow
knowledge-base-automated.ymlfor end-to-end KB generation - Updates workflow documentation to explain manual vs automated approaches
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 14 comments.
| File | Description |
|---|---|
| scripts/zotero-integration/semantic_analyzer.py | New Python script implementing heuristic-based semantic analysis with regex patterns for definition extraction, concept identification, and markdown generation |
| .github/workflows/knowledge-base-automated.yml | New GitHub Actions workflow for fully automated knowledge base generation with PR creation |
| .github/workflows/README.md | Updated documentation comparing manual (LLM-powered) vs automated (heuristic) workflows with usage examples and feature comparison |
Comments suppressed due to low confidence (1)
scripts/zotero-integration/semantic_analyzer.py:13
- Import of 'Tuple' is not used.
Import of 'Optional' is not used.
from typing import Dict, List, Set, Tuple, Optional
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - **Total Concept Files**: $TOTAL_FILES | ||
| - **Categories**: $CATEGORIES | ||
| - **Source**: Automated heuristic analysis | ||
| - **Processing Time**: ~${{ github.run_duration }} seconds |
There was a problem hiding this comment.
The variable ${{ github.run_duration }} doesn't exist in GitHub Actions context. This will either output an empty value or cause an error. Consider using a computed duration by capturing timestamps at the start and end of the job, or remove this metric from the comment.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
| content = f"""--- | ||
| title: {self.name} | ||
| tags: [{', '.join(sorted(self.keywords))}] | ||
| hierarchy: [{', '.join(hierarchy)}] | ||
| related: [{', '.join(sorted(self.related_concepts))}] | ||
| --- |
There was a problem hiding this comment.
The YAML front matter generation doesn't escape special characters in concept names, which could break YAML parsing. For example, if self.name contains a colon (:) or quotes, the YAML will be invalid. Consider using proper YAML escaping or quoting:
content = f"""---
title: "{self.name.replace('"', '\\"')}"
tags: [{', '.join(sorted(self.keywords))}]
hierarchy: [{', '.join(hierarchy)}]
related: [{', '.join(sorted(self.related_concepts))}]
---Or better yet, use a YAML library like pyyaml to generate the front matter safely.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
| def _extract_definitions(self, content: str, source: str, category: str): | ||
| """Extract explicit definitions from LaTeX content.""" | ||
| # Pattern: \definition{term}{definition text} | ||
| def_pattern = r'\\definition\{([^}]+)\}\{([^}]+)\}' |
There was a problem hiding this comment.
The regex pattern r'\\definition\{([^}]+)\}\{([^}]+)\}' on line 134 will fail to correctly match nested braces in LaTeX definitions. For example, \definition{term}{text with \emph{nested} braces} would not be fully captured. Consider using a more robust parsing approach or recursive regex for handling nested braces, or document this limitation.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…#99) * Initial plan * fix: add error handling for empty concept generation Co-authored-by: izzortsi <35747979+izzortsi@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: izzortsi <35747979+izzortsi@users.noreply.github.com>
* Initial plan * Fix non-existent github.run_duration variable with computed elapsed time Co-authored-by: izzortsi <35747979+izzortsi@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: izzortsi <35747979+izzortsi@users.noreply.github.com>
* Initial plan * Fix YAML front matter escaping using yaml.safe_dump Co-authored-by: izzortsi <35747979+izzortsi@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: izzortsi <35747979+izzortsi@users.noreply.github.com>
* Initial plan * Fix regex pattern to handle nested braces in LaTeX definitions Co-authored-by: izzortsi <35747979+izzortsi@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: izzortsi <35747979+izzortsi@users.noreply.github.com>
This adds a fully automated knowledge base generation workflow that
performs semantic analysis without requiring manual intervention or API calls.
New Components:
semantic_analyzer.py - Local NLP/heuristic analyzer
knowledge-base-automated.yml - Automated workflow
Comparison with Manual Workflow:
Manual (knowledge-base-pipeline.yml):
Automated (NEW):
Benefits:
Usage:
Design Philosophy:
Both workflows kept intentionally separate:
🤖 Generated with Claude Code
Co-Authored-By: Claude noreply@anthropic.com