-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Complete Linguist sync automation and snapshot publishing workflow #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add `supported_in_singularity` flag (defaults to false, explicitly true for our 24 languages) - Add `language_type` field aligned with Linguist's classification - Update all 24 language registrations with new fields - Source of truth: <https://github.com/github-linguist/linguist/blob/main/lib/linguist/languages.yml> ## Governance Model Language definitions now follow GitHub Linguist's standard: - Prevents ad-hoc language additions - Ensures consistency across ecosystem - Automatic tracking via Renovate (weekly) ## Build Script Enhancement Updated build.rs with future capability for: - Automatic Linguist languages.yml synchronization - Code generation from Linguist definitions - Auto-update when Linguist adds new languages ## Renovate Configuration - New rule to track Linguist releases (weekly) - Labels: linguist, language-registry - Manual review for language definition changes This prepares Singularity for scalable language support while maintaining explicit governance over what's actually supported. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
…cation ## What's New FileClassifier Module: Detect vendored, generated, and binary files - Uses patterns from GitHub Linguist (vendor.yml, generated.rb) - Supports: vendored detection, generated file detection, binary detection - Methods: is_vendored(), is_generated(), is_binary(), classify(), should_analyze() Phase 1: Language Definitions - DONE - Languages synced from Linguist languages.yml - supported_in_singularity flag for explicit support - Weekly Renovate alerts Phase 2: File Classification - READY - FileClassifier implementation complete - Ready to auto-generate from Linguist patterns - Supports: vendor paths, generated extensions, binary formats, documentation markers Phase 3: Detection Heuristics - PLANNED - Future: Auto-generate from Linguist heuristics.yml - Fallback language detection for ambiguous extensions New Files: - src/file_classifier.rs: File classification engine - LINGUIST_INTEGRATION.md: Complete documentation - Updated build.rs: 3-phase roadmap - Updated renovate.json5: Enhanced PR instructions Benefits: ✅ Skip vendored code (node_modules/, vendor/) ✅ Skip generated files (.pb.rs, .generated.ts, etc.) ✅ Skip binary files (images, archives, executables) ✅ Auto-updated with Linguist releases ✅ Reduces false positives in code analysis Testing: All tests pass, Clippy and fmt clean 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
Phase 2 Implementation: Auto-generate File Classification Patterns New Files Added: scripts/sync_linguist_patterns.py (200+ lines) - Downloads vendor.yml from Linguist - Downloads generated.rb from Linguist - Parses YAML and Ruby code - Extracts vendored, generated, and binary file patterns - Generates Rust code arrays for FileClassifier tools/linguist_sync.rs (130+ lines) - Rust implementation roadmap - Pattern parsing architecture - Code generation infrastructure Updated Files: build.rs: Enhanced documentation - Added manual synchronization workflow - Documented automated (future) workflow - Phase 2 in-progress status - Maintenance instructions justfile: New command - just sync-linguist: Run Python script to sync patterns - Provides step-by-step next actions - Integrates into development workflow LINGUIST_INTEGRATION.md: Detailed Phase 2 documentation - Status: FileClassifier, Script, Integration, CI - Manual + Automated sync workflows - Implementation details - Usage examples Workflow: For Maintainers (When Linguist Updates): just sync-linguist cargo test git add . git commit For Automation (Future): cargo xtask sync-linguist What Gets Synced: - Vendored paths: node_modules/, vendor/, .yarn/ - Generated files: .pb.rs, .generated.ts, .designer.cs - Binary formats: images, archives, executables 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
…tions Complete Automation: Linguist Sync via Renovate + GitHub Actions What's New: - 100% Pure Rust implementation (no Python/Perl/Bash) - GitHub Actions workflow for automatic sync - Enhanced Cargo.toml with required dependencies - Updated Renovate config with workflow info Workflow: 1. Renovate detects Linguist update (weekly) 2. Creates PR automatically 3. GitHub Actions triggers sync tool 4. Downloads vendor.yml, generated.rb, heuristics.yml 5. Parses and generates src/file_classifier_generated.rs 6. Validates with cargo test 7. Auto-commits changes 8. Posts summary on PR Phases Automated: - Phase 2: File classification (vendor, generated, binary) - Phase 3: Language detection heuristics (ambiguous extensions) Files Modified: - Cargo.toml: Added deps and bin definition - tools/linguist_sync.rs: Full Rust implementation - .github/workflows/sync-linguist.yml: GitHub Actions workflow - renovate.json5: Updated PR instructions - justfile: Updated sync command - LINGUIST_INTEGRATION.md: Full documentation 100% Pure Rust with Renovate + GitHub Actions automation 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
- Fix example usage.rs to properly load AtomicBool values with Ordering::Relaxed - Update doctest to use \`no_run\` to avoid test environment issues - Update test fixture to include all PatternSignatures fields with defaults This ensures compatibility with the updated LanguageInfo structure where ast_grep_supported is now an AtomicBool instead of a plain bool. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
||||||||||||||||||||||||||||
PR Code Suggestions ✨Explore these optional code suggestions:
|
|||||||||||||||
Use cargo:notice= instead of cargo:warning= for successful validation messages. This prevents successful builds from showing as warnings when the validation actually completed successfully. Only use cargo:warning= for actual issues and errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| #[non_exhaustive] | ||
| pub struct LanguageInfo { | ||
| /// Unique language identifier (e.g., `"rust"`, `"elixir"`) | ||
| /// Derived from GitHub Linguist language names (lowercased) | ||
| pub id: String, | ||
| /// Human-readable language name (e.g., `"Rust"`, `"Elixir"`) | ||
| pub name: String, | ||
| /// File extensions for this language (e.g., `rs`, or `ex`/`exs`) | ||
| /// Source: GitHub Linguist | ||
| pub extensions: Vec<String>, | ||
| /// Alternative names/aliases (e.g., `js`, `javascript`) | ||
| pub aliases: Vec<String>, | ||
| /// Whether this language is supported by Singularity's parsing engine | ||
| /// Default: false (only explicitly supported languages are true) | ||
| pub supported_in_singularity: bool, | ||
| /// Tree-sitter language name (if supported) | ||
| pub tree_sitter_language: Option<String>, | ||
| /// Whether RCA (rust-code-analysis) supports this language | ||
| pub rca_supported: bool, | ||
| /// Whether AST-Grep supports this language | ||
| pub ast_grep_supported: bool, | ||
| pub rca_supported: AtomicBool, | ||
| /// Whether AST-Grep supports this language (set at runtime by engines) | ||
| pub ast_grep_supported: AtomicBool, | ||
| /// MIME types for this language | ||
| pub mime_types: Vec<String>, | ||
| /// Language family (e.g., "BEAM", "C-like", "Web") | ||
| pub family: Option<String>, | ||
| /// Whether this is a compiled or interpreted language | ||
| pub is_compiled: bool, | ||
| /// Language type from Linguist: "programming", "markup", "data", "prose" | ||
| pub language_type: String, | ||
| /// Pattern signatures for cross-language pattern detection | ||
| #[serde(default)] | ||
| pub pattern_signatures: PatternSignatures, | ||
| /// Dynamic capability bits controlled by downstream engines | ||
| #[serde(skip)] | ||
| pub capabilities: AtomicU32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Derive adds serde bounds missing for atomic fields
The newly added atomics in LanguageInfo are still deriving Serialize/Deserialize, but AtomicBool and AtomicU32 do not implement those serde traits. The derive therefore cannot compile – the compiler will emit the trait Serialize is not implemented for AtomicBool (same for AtomicU32). Because this struct is used throughout the crate, the entire crate fails to build. Either drop the serde derives from LanguageInfo and rely on the new LanguageInfoSnapshot, or provide custom serialization helpers for the atomic fields.
Useful? React with 👍 / 👎.
…nused-result fixes
…at run inside nix devShell
🔍 Automated Checks🔍 Checking for stale files and out-of-scope changes... Stale File Check✅ No stale files detected Scope CheckChecking file relevance (blocks binaries, temp files, etc.)... ✅ All changes appear relevant (includes .github/ workflows, src/, docs, config) ℹ️ Note: 1 .github/ file(s) changed - workflows/actions are critical infrastructure Claude is reviewing the code... Check the "Claude Code Review" step for detailed feedback. |
…uild Replace openssl-sys with pure Rust rustls-tls backend for reqwest. This allows sync-linguist binary to build without system OpenSSL libraries, enabling it to work in CI/CD environments without nix develop. - Changed reqwest to use rustls-tls feature - Disabled default-tls (OpenSSL) feature - Resolves CI/CD build failures for sync-linguist binary 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
🔍 Automated Checks🔍 Checking for stale files and out-of-scope changes... Stale File Check✅ No stale files detected Scope CheckChecking file relevance (blocks binaries, temp files, etc.)... ✅ All changes appear relevant (includes .github/ workflows, src/, docs, config) ℹ️ Note: 1 .github/ file(s) changed - workflows/actions are critical infrastructure Claude is reviewing the code... Check the "Claude Code Review" step for detailed feedback. |
This commit significantly improves the linguist sync tool (Phase 2 & 3): ## Tool Improvements - Add proper logging support (log, env_logger) - Replace regex-based parsing with serde YAML deserialization - Add proper data structures for heuristics (Disambiguation, Rule, etc.) - Improve error handling with anyhow::Context - Write files directly to src/ instead of stdout redirection - Increase fetch timeout from 30s to 45s for reliability ## Generated Files - src/file_classifier_generated.rs (7.8K) - 167 vendored code patterns from vendor.yml - 82 generated file patterns from generated.rb - src/heuristics_generated.rs (117K) - 124 disambiguation groups from heuristics.yml - 21 named patterns - Full rule-based language detection support ## Workflow Updates - Update sync-linguist.yml to remove stdout redirect - Track both generated files in commits - Update documentation to mention both outputs ## Testing - All 17 tests pass - Tool successfully fetches and parses latest Linguist data - Deterministic output (idempotent runs) Co-Authored-By: Claude <noreply@anthropic.com>
This commit adds comprehensive language metadata synchronization (Phase 4) to complement the existing pattern sync (Phases 2 & 3). ## New Features - Download and parse languages.yml (157KB, 789 languages) - Generate Rust types with full language metadata: - Extensions, filenames, interpreters - Syntax highlighting modes (ace_mode, tm_scope, codemirror) - Visual metadata (colors, aliases) - Language categorization (type, group) - Editor configuration (wrap, fs_name) - Save raw languages.yml to .github/linguist/ for snapshot workflow ## Generated Files - src/languages_metadata_generated.rs (448KB) - `LanguageMetadata` struct with all Linguist fields - `LANGUAGES` const array with 789 language definitions - .github/linguist/languages.yml (154KB) - Raw YAML for publish-snapshot workflow ## Workflow Updates - Update sync-linguist.yml to commit all 4 generated files - Update documentation to mention Phase 4 - Update PR comments to show complete sync status ## Architecture The tool now provides both: 1. Rust const data (embedded in binary) for performance 2. Raw YAML (for external tooling and snapshot generation) This gives downstream consumers flexibility to choose their integration approach. Co-Authored-By: Claude <noreply@anthropic.com>
🔍 Automated Checks🔍 Checking for stale files and out-of-scope changes... Stale File Check✅ No stale files detected Scope CheckChecking file relevance (blocks binaries, temp files, etc.)... ✅ All changes appear relevant (includes .github/ workflows, src/, docs, config) ℹ️ Note: 1 .github/ file(s) changed - workflows/actions are critical infrastructure Claude is reviewing the code... Check the "Claude Code Review" step for detailed feedback. |
Release highlights: - Complete Linguist sync automation (Phases 2, 3 & 4) - 789 languages with full metadata - Automated snapshot publishing workflows - Enhanced development infrastructure See CHANGELOG.md for full details.
🔍 Automated Checks🔍 Checking for stale files and out-of-scope changes... Stale File Check✅ No stale files detected Scope CheckChecking file relevance (blocks binaries, temp files, etc.)... ✅ All changes appear relevant (includes .github/ workflows, src/, docs, config) ℹ️ Note: 1 .github/ file(s) changed - workflows/actions are critical infrastructure Claude is reviewing the code... Check the "Claude Code Review" step for detailed feedback. |
The workspace configuration puts binaries in target/release/, not tools/*/target/release/. Updated both workflows to use the correct path.
🔍 Automated Checks🔍 Checking for stale files and out-of-scope changes... Stale File Check✅ No stale files detected Scope CheckChecking file relevance (blocks binaries, temp files, etc.)... ✅ All changes appear relevant (includes .github/ workflows, src/, docs, config) ℹ️ Note: 1 .github/ file(s) changed - workflows/actions are critical infrastructure Claude is reviewing the code... Check the "Claude Code Review" step for detailed feedback. |
Use -p linguist_to_snapshot instead of --bin to properly build workspace member binaries.
🔍 Automated Checks🔍 Checking for stale files and out-of-scope changes... Stale File Check✅ No stale files detected Scope CheckChecking file relevance (blocks binaries, temp files, etc.)... ✅ All changes appear relevant (includes .github/ workflows, src/, docs, config) ℹ️ Note: 1 .github/ file(s) changed - workflows/actions are critical infrastructure Claude is reviewing the code... Check the "Claude Code Review" step for detailed feedback. |
🔍 Automated Checks🔍 Checking for stale files and out-of-scope changes... Stale File Check✅ No stale files detected Scope CheckChecking file relevance (blocks binaries, temp files, etc.)... ✅ All changes appear relevant (includes .github/ workflows, src/, docs, config) ℹ️ Note: 1 .github/ file(s) changed - workflows/actions are critical infrastructure Claude is reviewing the code... Check the "Claude Code Review" step for detailed feedback. |
CI Feedback 🧐(Feedback updated until commit c44b5f4)A test triggered by this PR failed. Here is an AI-generated analysis of the failure:
|
Overview
This PR establishes a complete automated workflow for synchronizing GitHub Linguist data and publishing language snapshots. It includes comprehensive pattern sync (Phases 2-4), automated workflows, and publishing infrastructure.
Key Features
🔄 Linguist Sync Tool (Phases 2, 3 & 4)
Phase 2: File Classification
vendor.ymlgenerated.rbPhase 3: Language Detection Heuristics
heuristics.ymlPhase 4: Language Metadata ⭐ NEW
languages.yml📦 Generated Files
src/file_classifier_generated.rs(7.8K)src/heuristics_generated.rs(117K)src/languages_metadata_generated.rs(448K) ⭐ NEW.github/linguist/languages.yml(154K) ⭐ NEW🤖 Automated Workflows
sync-linguist.yml
publish-snapshot.yml
mainvalidate-snapshot.yml
publish-docs.yml
mainTechnical Improvements
Sync Tool Refactor
anyhow::Contextenv_loggerArchitecture
Provides both:
This gives downstream consumers flexibility in integration approach.
Testing
✅ All 17 tests pass
✅ Clippy passes with pedantic + nursery lints
✅ Pre-commit/pre-push hooks pass
✅ Sync tool successfully fetches and parses latest Linguist data
Migration Notes
No breaking changes. This is purely additive functionality that enhances the existing language registry with automated sync capabilities.
Follow-up Work
🤖 Generated with Claude Code
Co-Authored-By: Claude noreply@anthropic.com