Skip to content

feat: add automated semantic analysis workflow (no API calls)#79

Open
izzortsi wants to merge 11 commits intodevfrom
feature/automated-semantic-analysis
Open

feat: add automated semantic analysis workflow (no API calls)#79
izzortsi wants to merge 11 commits intodevfrom
feature/automated-semantic-analysis

Conversation

@izzortsi
Copy link
Copy Markdown
Owner

@izzortsi izzortsi commented Nov 4, 2025

This adds a fully automated knowledge base generation workflow that
performs semantic analysis without requiring manual intervention or API calls.

New Components:

  1. semantic_analyzer.py - Local NLP/heuristic analyzer

    • Extracts concepts from LaTeX sources using regex patterns
    • Finds definitions: \definition{}{}, "X is defined as Y", etc.
    • Extracts concepts from section titles
    • Analyzes cited paper titles, abstracts, keywords
    • Creates hierarchical structure automatically
    • Generates Quartz-compatible markdown files
    • NO external API calls - fully local processing
  2. knowledge-base-automated.yml - Automated workflow

    • Manually triggered with a generated branch name
    • Runs semantic_analyzer.py automatically
    • Generates markdown concept files
    • Creates PR with results
    • Fully automated end-to-end

Comparison with Manual Workflow:

Manual (knowledge-base-pipeline.yml):

  • High quality (LLM-powered via Claude Code)
  • Requires human intervention
  • Deep semantic understanding
  • Best for production

Automated (NEW):

  • Medium quality (heuristic-based)
  • Fully automated
  • Pattern matching + NLP
  • Best for drafts/testing

Benefits:

  • Test pipeline without manual steps
  • Generate quick drafts for review
  • No API costs
  • 100% automation option
  • Compare with manual Claude Code results

Usage:

# After main pipeline creates a generated branch:
gh workflow run "Knowledge Base — Automated Analysis (Experimental)" \
  --field branch=knowledge-database/generated-XXXXX

Design Philosophy:
Both workflows kept intentionally separate:

  • Manual workflow for quality (LLM-powered)
  • Automated workflow for speed (heuristic-based)
  • User chooses based on needs
  • Can run both and compare

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

This adds a fully automated knowledge base generation workflow that
performs semantic analysis without requiring manual intervention or API calls.

**New Components:**

1. **semantic_analyzer.py** - Local NLP/heuristic analyzer
   - Extracts concepts from LaTeX sources using regex patterns
   - Finds definitions: \definition{}{}, "X is defined as Y", etc.
   - Extracts concepts from section titles
   - Analyzes cited paper titles, abstracts, keywords
   - Creates hierarchical structure automatically
   - Generates Quartz-compatible markdown files
   - NO external API calls - fully local processing

2. **knowledge-base-automated.yml** - Automated workflow
   - Manually triggered with a generated branch name
   - Runs semantic_analyzer.py automatically
   - Generates markdown concept files
   - Creates PR with results
   - Fully automated end-to-end

**Comparison with Manual Workflow:**

Manual (knowledge-base-pipeline.yml):
- High quality (LLM-powered via Claude Code)
- Requires human intervention
- Deep semantic understanding
- Best for production

Automated (NEW):
- Medium quality (heuristic-based)
- Fully automated
- Pattern matching + NLP
- Best for drafts/testing

**Benefits:**
- Test pipeline without manual steps
- Generate quick drafts for review
- No API costs
- 100% automation option
- Compare with manual Claude Code results

**Usage:**
```bash
# After main pipeline creates a generated branch:
gh workflow run "Knowledge Base — Automated Analysis (Experimental)" \
  --field branch=knowledge-database/generated-XXXXX
```

**Design Philosophy:**
Both workflows kept intentionally separate:
- Manual workflow for quality (LLM-powered)
- Automated workflow for speed (heuristic-based)
- User chooses based on needs
- Can run both and compare

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@izzortsi izzortsi added automation Automated processes and workflows knowledge-base Knowledge base generation and management experimental Experimental features requiring testing and validation labels Nov 4, 2025
@izzortsi
Copy link
Copy Markdown
Owner Author

izzortsi commented Nov 4, 2025

📚 Implementation Details

Automated Semantic Analyzer Features

The semantic_analyzer.py script uses several extraction techniques:

1. LaTeX Definition Patterns:

  • \definition{term}{definition text}
  • "X is defined as Y"
  • "A X is a Y that..."

2. Structural Extraction:

  • Section titles (\section{}, \subsection{})
  • Citation context (concepts near \cite{})
  • Cross-references between sources

3. Cited Paper Analysis:

  • Title concept extraction (capitalized multi-word terms)
  • Abstract analysis (quoted terms, italics, bold)
  • Keyword integration
  • Technical hyphenated terms

4. Relationship Inference:

  • Co-occurrence in same sources
  • Shared citations (≥2 common references)
  • Hierarchical hints (name containment)

Workflow Execution Flow

graph TD
    A[Main Pipeline Generates Branch] --> B[User Triggers Automated Workflow]
    B --> C[Checkout Generated Branch]
    C --> D[Run semantic_analyzer.py]
    D --> E[Extract LaTeX Concepts]
    D --> F[Analyze Cited Papers]
    E --> G[Organize Hierarchy]
    F --> G
    G --> H[Generate Markdown Files]
    H --> I[Commit to Branch]
    I --> J[Create PR]
    J --> K[Review & Merge]
Loading

Quality Comparison

Manual Claude Code Analysis:

  • Understands context and nuance
  • Infers implicit concepts
  • Writes clear, refined definitions
  • Creates intelligent hierarchies
  • Best for: Production knowledge base

Automated Heuristic Analysis:

  • Fast pattern matching
  • Literal definition extraction
  • Mechanical cross-referencing
  • Good coverage but less depth
  • Best for: Quick drafts, testing, structure validation

Example Output

For a LaTeX file containing:

\section{Formal Ontologies}
\definition{system}{A system is a set of related elements organized for a purpose}

Automated analyzer generates:

---
title: System
tags: [formal-ontologies, definition]
hierarchy: [formal-ontologies]
---

# System

## Definition

A system is a set of related elements organized for a purpose

## Related Concepts

- [[formal-ontologies]]
- [[organization]]

Testing Recommendations

  1. First Run: Test on a small generated branch
  2. Compare: Run both automated and manual on same data
  3. Iterate: Refine heuristics based on results
  4. Hybrid: Use automated as baseline, manual for refinement

Future Enhancements

Potential improvements to the analyzer:

  • Add spaCy for better NLP
  • Implement noun phrase chunking
  • Add entity recognition
  • Improve definition quality scoring
  • Multi-language support
  • Concept similarity metrics

Ready to merge when testing is complete! 🚀

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces an experimental automated semantic analysis workflow that generates knowledge base content from LaTeX sources and cited papers using local heuristic analysis, providing a fully automated alternative to the manual Claude Code-based workflow.

  • Adds semantic_analyzer.py for local NLP-based concept extraction from LaTeX and papers
  • Implements automated workflow knowledge-base-automated.yml for end-to-end KB generation
  • Updates workflow documentation to explain manual vs automated approaches

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 14 comments.

File Description
scripts/zotero-integration/semantic_analyzer.py New Python script implementing heuristic-based semantic analysis with regex patterns for definition extraction, concept identification, and markdown generation
.github/workflows/knowledge-base-automated.yml New GitHub Actions workflow for fully automated knowledge base generation with PR creation
.github/workflows/README.md Updated documentation comparing manual (LLM-powered) vs automated (heuristic) workflows with usage examples and feature comparison
Comments suppressed due to low confidence (1)

scripts/zotero-integration/semantic_analyzer.py:13

  • Import of 'Tuple' is not used.
    Import of 'Optional' is not used.
from typing import Dict, List, Set, Tuple, Optional

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/knowledge-base-automated.yml
- **Total Concept Files**: $TOTAL_FILES
- **Categories**: $CATEGORIES
- **Source**: Automated heuristic analysis
- **Processing Time**: ~${{ github.run_duration }} seconds
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable ${{ github.run_duration }} doesn't exist in GitHub Actions context. This will either output an empty value or cause an error. Consider using a computed duration by capturing timestamps at the start and end of the job, or remove this metric from the comment.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment thread scripts/zotero-integration/semantic_analyzer.py Outdated
Comment on lines +34 to +39
content = f"""---
title: {self.name}
tags: [{', '.join(sorted(self.keywords))}]
hierarchy: [{', '.join(hierarchy)}]
related: [{', '.join(sorted(self.related_concepts))}]
---
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The YAML front matter generation doesn't escape special characters in concept names, which could break YAML parsing. For example, if self.name contains a colon (:) or quotes, the YAML will be invalid. Consider using proper YAML escaping or quoting:

content = f"""---
title: "{self.name.replace('"', '\\"')}"
tags: [{', '.join(sorted(self.keywords))}]
hierarchy: [{', '.join(hierarchy)}]
related: [{', '.join(sorted(self.related_concepts))}]
---

Or better yet, use a YAML library like pyyaml to generate the front matter safely.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment thread scripts/zotero-integration/semantic_analyzer.py Outdated
Comment thread scripts/zotero-integration/semantic_analyzer.py Outdated
Comment thread scripts/zotero-integration/semantic_analyzer.py Outdated
Comment thread scripts/zotero-integration/semantic_analyzer.py Outdated
def _extract_definitions(self, content: str, source: str, category: str):
"""Extract explicit definitions from LaTeX content."""
# Pattern: \definition{term}{definition text}
def_pattern = r'\\definition\{([^}]+)\}\{([^}]+)\}'
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern r'\\definition\{([^}]+)\}\{([^}]+)\}' on line 134 will fail to correctly match nested braces in LaTeX definitions. For example, \definition{term}{text with \emph{nested} braces} would not be fully captured. Consider using a more robust parsing approach or recursive regex for handling nested braces, or document this limitation.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment thread scripts/zotero-integration/semantic_analyzer.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI commented Nov 14, 2025

@izzortsi I've opened a new pull request, #99, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown
Contributor

Copilot AI commented Nov 14, 2025

@izzortsi I've opened a new pull request, #100, to work on those changes. Once the pull request is ready, I'll request review from you.

izzortsi and others added 3 commits November 13, 2025 16:53
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented Nov 14, 2025

@izzortsi I've opened a new pull request, #101, to work on those changes. Once the pull request is ready, I'll request review from you.

izzortsi and others added 6 commits November 13, 2025 16:54
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…#99)

* Initial plan

* fix: add error handling for empty concept generation

Co-authored-by: izzortsi <35747979+izzortsi@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: izzortsi <35747979+izzortsi@users.noreply.github.com>
* Initial plan

* Fix non-existent github.run_duration variable with computed elapsed time

Co-authored-by: izzortsi <35747979+izzortsi@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: izzortsi <35747979+izzortsi@users.noreply.github.com>
* Initial plan

* Fix YAML front matter escaping using yaml.safe_dump

Co-authored-by: izzortsi <35747979+izzortsi@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: izzortsi <35747979+izzortsi@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented Nov 14, 2025

@izzortsi I've opened a new pull request, #102, to work on those changes. Once the pull request is ready, I'll request review from you.

* Initial plan

* Fix regex pattern to handle nested braces in LaTeX definitions

Co-authored-by: izzortsi <35747979+izzortsi@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: izzortsi <35747979+izzortsi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automation Automated processes and workflows experimental Experimental features requiring testing and validation knowledge-base Knowledge base generation and management

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants