Skip to content

dogsub/Open-Source-TermP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README Auto-Generation System

This project develops an advanced, multi-modal AI system designed to solve the critical developer problem of maintaining up-to-date and informative documentation. The core service is the Automatic README Generation AI, which transforms raw GitHub code into structured Markdown documentation. This feature is integrated into a larger career management platform, aiming to streamline workflow convenience and enhance job-seeking capabilities.

The Challenge of Documentation & Token Efficiency

Manually documenting extensive codebases is tedious and error-prone. Furthermore, feeding entire code files to Large Language Models (LLMs) is prohibitively expensive due to high token consumption.

Our Solution: Implement an Abstract Syntax Tree (AST) Analysis Pipeline to pre-process code, drastically reducing the token budget while maximizing the information density of the input prompt.


📂 Branch/Commit Naming Convention

Examples:

  • feat/login-api
  • fix/comment-delete-bug
  • test/user-service-test

Type List:

Type Description
feat Add a new feature
fix Fix a bug
refactor Improve code quality without changing functionality
test Add or modify test code
hotfix Apply an urgent fix

1. Demonstration and Visual Context

1-1. Project Workflow Demonstration

This video showcases the end-to-end functionality, from inputting a GitHub repository URL to receiving the final, structured Markdown README file.

454307305-62f04eb9-9979-45bb-9dc8-b3bd6ab2faf2


2. Technical Architecture and Model Selection

The system utilizes a specialized combination of models for task-specific performance, leveraging both local GPU infrastructure and external, high-throughput APIs.

2-1. The README Generation Pipeline (QwenCoder + AST)

The README generation process is engineered for efficiency and code comprehension:

Stage Process Key Technology / Model Rationale
I. Code Ingestion Retrieve target code files from a linked GitHub repository. GitHub API Integration Ensures access to the most current codebase.
II. Token Optimization Convert raw code (e.g., Python) into an Abstract Syntax Tree (AST). Only critical nodes (function definitions, library imports, key logic flow) are extracted. AST Parser (libcst or equivalent) CRITICAL for cost reduction and focus. Reduces LLM input token count by up to 90%.
III. Generation The processed AST metadata is transformed into a concise, context-rich prompt and fed to the LLM. QwenCoder Model Selected for superior performance, as detailed below.

2-2. Model Selection Rationale (QwenCoder)

The QwenCoder model was chosen as the primary engine for README generation based on stringent performance criteria:

Criterion QwenCoder Performance Justification
Code Understanding High Proven ability to grasp complex code structure and context.
Multilingual Support High Essential for processing code in diverse programming languages.
MultiPL-E Score High Ranking Verifies strong performance on the Multi-Programming Language Evaluation benchmark.
McEval Score High Ranking Verifies superior performance on the Massively Multilingual Code Evaluation benchmark.
image

2-3. Code Analysis for Tagging (Ensemble Learning)

A separate ensemble system analyzes repository content to recommend technical tags, prioritizing accuracy and diversity.

Model / System Role Execution Method Rationale
Gemini-thinking Inference & Reasoning Multi-threaded API Call Utilized for strong reasoning and structural interpretation.
Qwen-coder-32b Code-Specific Analysis Multi-threaded API Call Provides robust, code-centric analysis.
llama-versatile Auxiliary Analysis Multi-threaded API Call Contributes diverse perspectives and maintains high throughput.
Ensemble Aggregation Final Decision Bagging Technique (JSON Output) Combines results (majority vote) to significantly boost tag reliability.

3. Environment Setup and Execution

3-1. Development Requirements

Dependency Purpose Installation
Python Core runtime environment. Python 3.10+
Libraries / SDKs Required for LLM inference, AST analysis, and external model access. pip install -r requirements.txt (See requirements.txt for exact versions)

This project was developed and tested with the following key libraries:

Package Version Purpose
python 3.10.x Core runtime environment
libcst 1.1.0 AST-based code parsing
requests 2.31.0 GitHub API communication
python-dotenv 1.0.1 Environment variable management
groq 0.9.0 Groq API client
google-generativeai 0.5.2 Gemini model access
tqdm 4.66.1 Progress bar visualization

For the full dependency list, see requirements.txt.


3-2. API Keys Configuration

To ensure secure access to external LLM services, create a file named .env in the project's root directory:

# --- External AI Service Credentials ---

# Google AI (Gemini) API Key for the Code Analysis Ensemble
# Used for its robust inference capabilities.
GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE"

# Groq API Key for the Code Analysis Ensemble
# Used for its high-speed inference in multi-model parallel processing.
GROQ_API_KEY="YOUR_GROQ_API_KEY_HERE"

# --- Local Server Configuration ---

# URL for the local vLLM server hosting the BART model (for summarization feature)
BART_VLLM_SERVER_URL="http://127.0.0.1:8000/v1/completions"

3-3. Execution Steps (current project)

  1. Install dependencies
  • Recommended: use a Python 3.10 virtual environment (venv or conda)
  • PowerShell (Windows) example:
# venv example
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r .\requirements.txt

# or conda example
conda create -n gs-env python=3.10 -y
conda activate gs-env
python -m pip install -r .\requirements.txt
  1. API keys (.env)
  • Create a .env file in the project root and set required keys.
  • Main environment variables used by the code:
    • GROQ_API_KEY (used for README generation)
    • GOOGLE_API_KEY (used for Gemini / Google generativeai calls)
    • GITHUB_TOKEN (recommended for GitHub API requests, optional)

Example .env:

GROQ_API_KEY=your_groq_api_key_here
GOOGLE_API_KEY=your_google_api_key_here
GITHUB_TOKEN=ghp_xxx...   # optional: use to increase rate limits / access private repos
  1. vLLM / local LLM server (optional)
  • This project uses external APIs (Groq, Google) by default for README/tag generation. A local vLLM server or GPU is not required.
  • Only set up a local vLLM server if you plan to run local models; doing so may require changes to .env and the code.
  1. Run the script
  • The entry point is main.py. Pass the GitHub repository URL as the argument.
  • Basic run example (PowerShell):
# Default run: generate README, extract tags, download image
python .\main.py "https://github.com/owner/repo"
  • Options:
    • --out <folder> : output directory to save results (default: output)
    • --no-readme : skip README generation
    • --no-tags : skip tag extraction
    • --no-image : skip image selection/download
  • Example (specify output folder, skip image):
python .\main.py "https://github.com/owner/repo" --out .\results --no-image
  1. Output files and locations
  • Default output structure: output/<owner__repo>/

    • GENERATED_README.md : generated README (if produced)
    • TAGS.json : tag extraction result (if produced)
    • repo_image.<ext> : selected repository image (if produced)
  • If .env is missing or API keys are not set, some features (README generation, tag extraction) may not run. The script will print errors or skip those steps.


6. References & Related Works

This project is grounded in prior research and official technical documentation related to code analysis, token optimization, and large language models.

Abstract Syntax Tree (AST) & Code Analysis

  • LibCST Documentation – Concrete Syntax Tree for Python
    https://libcst.readthedocs.io/
  • Baxter, I. D., et al. Clone Detection Using Abstract Syntax Trees, IEEE, 1998.

Model Benchmarks

External AI Platforms & APIs

Ensemble Learning

  • Dietterich, T. G. Ensemble Methods in Machine Learning, Springer, 2000.

About

2025년 2학기 오픈소스 SW Term Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •