Skip to content

Conversation

@mishig25
Copy link
Contributor

Summary

This PR introduces a new command and workflow for populating the search engine with documentation from the HF doc-build dataset.

Changes

  • New markdown splitting logic: Split markdown content by headings (h1-h6) with hierarchy preservation
  • **New module **: Fetches and processes pre-built docs from dataset
  • **New CLI command **: Replaces the old command (which has been removed)
  • New GitHub Actions workflow: Automated documentation processing with for dependency management
  • Skip embeddings flag: Allows testing doc processing without generating embeddings

Key Features

  1. Downloads pre-built documentation from HuggingFace dataset API
  2. Processes only markdown files (ignores HTML)
  3. Splits markdown by headings while maintaining hierarchy
  4. Generates proper URLs matching the HF docs structure
  5. Supports selective library processing for testing
  6. Uses for fast dependency management

Testing

The workflow includes a flag for initial testing without the full embedding pipeline.

- Add new markdown splitting logic based on headings (h1-h6)
- Implement process_hf_docs.py to fetch and process docs from HF dataset
- Add populate-search-engine CLI command (replaces build-embeddings)
- Add GitHub Actions workflow for automated doc processing
- Support downloading pre-built docs from hf-doc-build/doc-build dataset
- Handle markdown chunking with heading hierarchy preservation
- Add skip-embeddings flag for testing without embedding generation
…workflow was responsible for building embeddings for various Hugging Face repositories and has been deprecated in favor of the new populate-search-engine command and workflow.
- Gradio docs use separate JSON format from gradio/docs dataset
- Gradio is not in hf-doc-build dataset, needs separate processing
- Uncomment cleanup-job and update to depend on both jobs
- Remove accidental 'on: push' trigger
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants