README Auto-Generation System

This project develops an advanced, multi-modal AI system designed to solve the critical developer problem of maintaining up-to-date and informative documentation. The core service is the Automatic README Generation AI, which transforms raw GitHub code into structured Markdown documentation. This feature is integrated into a larger career management platform, aiming to streamline workflow convenience and enhance job-seeking capabilities.

The Challenge of Documentation & Token Efficiency

Manually documenting extensive codebases is tedious and error-prone. Furthermore, feeding entire code files to Large Language Models (LLMs) is prohibitively expensive due to high token consumption.

Our Solution: Implement an Abstract Syntax Tree (AST) Analysis Pipeline to pre-process code, drastically reducing the token budget while maximizing the information density of the input prompt.

📂 Branch/Commit Naming Convention

Examples:

feat/login-api
fix/comment-delete-bug
test/user-service-test

Type List:

Type	Description
`feat`	Add a new feature
`fix`	Fix a bug
`refactor`	Improve code quality without changing functionality
`test`	Add or modify test code
`hotfix`	Apply an urgent fix

1. Demonstration and Visual Context

1-1. Project Workflow Demonstration

This video showcases the end-to-end functionality, from inputting a GitHub repository URL to receiving the final, structured Markdown README file.

2. Technical Architecture and Model Selection

The system utilizes a specialized combination of models for task-specific performance, leveraging both local GPU infrastructure and external, high-throughput APIs.

2-1. The README Generation Pipeline (QwenCoder + AST)

The README generation process is engineered for efficiency and code comprehension:

Stage	Process	Key Technology / Model	Rationale
I. Code Ingestion	Retrieve target code files from a linked GitHub repository.	GitHub API Integration	Ensures access to the most current codebase.
II. Token Optimization	Convert raw code (e.g., Python) into an Abstract Syntax Tree (AST). Only critical nodes (function definitions, library imports, key logic flow) are extracted.	AST Parser (`libcst` or equivalent)	CRITICAL for cost reduction and focus. Reduces LLM input token count by up to 90%.
III. Generation	The processed AST metadata is transformed into a concise, context-rich prompt and fed to the LLM.	QwenCoder Model	Selected for superior performance, as detailed below.

2-2. Model Selection Rationale (QwenCoder)

The QwenCoder model was chosen as the primary engine for README generation based on stringent performance criteria:

Criterion	QwenCoder Performance	Justification
Code Understanding	High	Proven ability to grasp complex code structure and context.
Multilingual Support	High	Essential for processing code in diverse programming languages.
MultiPL-E Score	High Ranking	Verifies strong performance on the Multi-Programming Language Evaluation benchmark.
McEval Score	High Ranking	Verifies superior performance on the Massively Multilingual Code Evaluation benchmark.

2-3. Code Analysis for Tagging (Ensemble Learning)

A separate ensemble system analyzes repository content to recommend technical tags, prioritizing accuracy and diversity.

Model / System	Role	Execution Method	Rationale
Gemini-thinking	Inference & Reasoning	Multi-threaded API Call	Utilized for strong reasoning and structural interpretation.
Qwen-coder-32b	Code-Specific Analysis	Multi-threaded API Call	Provides robust, code-centric analysis.
llama-versatile	Auxiliary Analysis	Multi-threaded API Call	Contributes diverse perspectives and maintains high throughput.
Ensemble Aggregation	Final Decision	Bagging Technique (JSON Output)	Combines results (majority vote) to significantly boost tag reliability.

3. Environment Setup and Execution

3-1. Development Requirements

Dependency	Purpose	Installation
Python	Core runtime environment.	`Python 3.10+`
Libraries / SDKs	Required for LLM inference, AST analysis, and external model access.	`pip install -r requirements.txt` (See `requirements.txt` for exact versions)

This project was developed and tested with the following key libraries:

Package	Version	Purpose
python	3.10.x	Core runtime environment
libcst	1.1.0	AST-based code parsing
requests	2.31.0	GitHub API communication
python-dotenv	1.0.1	Environment variable management
groq	0.9.0	Groq API client
google-generativeai	0.5.2	Gemini model access
tqdm	4.66.1	Progress bar visualization

For the full dependency list, see requirements.txt.

3-2. API Keys Configuration

To ensure secure access to external LLM services, create a file named .env in the project's root directory:

# --- External AI Service Credentials ---

# Google AI (Gemini) API Key for the Code Analysis Ensemble
# Used for its robust inference capabilities.
GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE"

# Groq API Key for the Code Analysis Ensemble
# Used for its high-speed inference in multi-model parallel processing.
GROQ_API_KEY="YOUR_GROQ_API_KEY_HERE"

# --- Local Server Configuration ---

# URL for the local vLLM server hosting the BART model (for summarization feature)
BART_VLLM_SERVER_URL="http://127.0.0.1:8000/v1/completions"

3-3. Execution Steps (current project)

Install dependencies

Recommended: use a Python 3.10 virtual environment (venv or conda)
PowerShell (Windows) example:

# venv example
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r .\requirements.txt

# or conda example
conda create -n gs-env python=3.10 -y
conda activate gs-env
python -m pip install -r .\requirements.txt

API keys (.env)

Create a .env file in the project root and set required keys.
Main environment variables used by the code:
- GROQ_API_KEY (used for README generation)
- GOOGLE_API_KEY (used for Gemini / Google generativeai calls)
- GITHUB_TOKEN (recommended for GitHub API requests, optional)

Example .env:

GROQ_API_KEY=your_groq_api_key_here
GOOGLE_API_KEY=your_google_api_key_here
GITHUB_TOKEN=ghp_xxx...   # optional: use to increase rate limits / access private repos

vLLM / local LLM server (optional)

This project uses external APIs (Groq, Google) by default for README/tag generation. A local vLLM server or GPU is not required.
Only set up a local vLLM server if you plan to run local models; doing so may require changes to .env and the code.

Run the script

The entry point is main.py. Pass the GitHub repository URL as the argument.
Basic run example (PowerShell):

# Default run: generate README, extract tags, download image
python .\main.py "https://github.com/owner/repo"

Options:
- --out <folder> : output directory to save results (default: output)
- --no-readme : skip README generation
- --no-tags : skip tag extraction
- --no-image : skip image selection/download
Example (specify output folder, skip image):

python .\main.py "https://github.com/owner/repo" --out .\results --no-image

Output files and locations

Default output structure: output/<owner__repo>/
- GENERATED_README.md : generated README (if produced)
- TAGS.json : tag extraction result (if produced)
- repo_image.<ext> : selected repository image (if produced)
If .env is missing or API keys are not set, some features (README generation, tag extraction) may not run. The script will print errors or skip those steps.

6. References & Related Works

This project is grounded in prior research and official technical documentation related to code analysis, token optimization, and large language models.

Abstract Syntax Tree (AST) & Code Analysis

LibCST Documentation – Concrete Syntax Tree for Python
https://libcst.readthedocs.io/
Baxter, I. D., et al. Clone Detection Using Abstract Syntax Trees, IEEE, 1998.

Model Benchmarks

Cassano et al., MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation, 2022
https://arxiv.org/abs/2208.08227

External AI Platforms & APIs

Groq API Documentation – High-throughput inference platform used for fast parallel LLM execution.
https://console.groq.com/docs
Google AI Studio (Gemini API) – Official platform for accessing Gemini models and experiment management.
https://aistudio.google.com/

Ensemble Learning

Dietterich, T. G. Ensemble Methods in Machine Learning, Springer, 2000.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github		.github
src		src
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README Auto-Generation System

The Challenge of Documentation & Token Efficiency

📂 Branch/Commit Naming Convention

1. Demonstration and Visual Context

1-1. Project Workflow Demonstration

2. Technical Architecture and Model Selection

2-1. The README Generation Pipeline (QwenCoder + AST)

2-2. Model Selection Rationale (QwenCoder)

2-3. Code Analysis for Tagging (Ensemble Learning)

3. Environment Setup and Execution

3-1. Development Requirements

3-2. API Keys Configuration

3-3. Execution Steps (current project)

6. References & Related Works

Abstract Syntax Tree (AST) & Code Analysis

Model Benchmarks

External AI Platforms & APIs

Ensemble Learning

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

dogsub/Open-Source-TermP

Folders and files

Latest commit

History

Repository files navigation

README Auto-Generation System

The Challenge of Documentation & Token Efficiency

📂 Branch/Commit Naming Convention

1. Demonstration and Visual Context

1-1. Project Workflow Demonstration

2. Technical Architecture and Model Selection

2-1. The README Generation Pipeline (QwenCoder + AST)

2-2. Model Selection Rationale (QwenCoder)

2-3. Code Analysis for Tagging (Ensemble Learning)

3. Environment Setup and Execution

3-1. Development Requirements

3-2. API Keys Configuration

3-3. Execution Steps (current project)

6. References & Related Works

Abstract Syntax Tree (AST) & Code Analysis

Model Benchmarks

External AI Platforms & APIs

Ensemble Learning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages