GitHub - fedimoss/smart-data: A powerful tool for creating datasets for LLM fine-tuning 、RAG and Eval

A powerful tool for creating fine-tuning datasets for Large Language Models

Overview

Smart Data is an application specifically designed for building large language model (LLM) datasets. It features an intuitive interface, along with built-in powerful document parsing tools, intelligent segmentation algorithms, data cleaning and augmentation capabilities. The application can convert domain-specific documents in various formats into high-quality structured datasets, which are applicable to scenarios such as model fine-tuning, retrieval-augmented generation (RAG), and model performance evaluation.

News

🎉🎉 Smart Data Version 1.7.0 launches brand-new evaluation capabilities! You can effortlessly convert domain-specific documents into evaluation datasets (test sets) and automatically run multi-dimensional evaluation tasks. Additionally, it comes with a human blind test system, enabling you to easily meet needs such as vertical domain model evaluation, post-fine-tuning model performance assessment, and RAG recall rate evaluation.

Features

📄 Document Processing & Data Generation

Intelligent Document Processing: Supports PDF, Markdown, DOCX, TXT, EPUB and more formats with intelligent recognition
Intelligent Text Splitting: Multiple splitting algorithms (Markdown structure, recursive separators, fixed length, code-aware chunking), with customizable visual segmentation
Intelligent Question Generation: Auto-extract relevant questions from text segments, with question templates and batch generation
Domain Label Tree: Intelligently builds global domain label trees based on document structure, with auto-tagging capabilities
Answer Generation: Uses LLM API to generate comprehensive answers and Chain of Thought (COT), with AI optimization
Data Cleaning: Intelligent text cleaning to remove noise and improve data quality

🔄 Multiple Dataset Types

Single-Turn QA Datasets: Standard question-answer pairs for basic fine-tuning
Multi-Turn Dialogue Datasets: Customizable roles and scenarios for conversational format
Image QA Datasets: Generate visual QA data from images, with multiple import methods (directory, PDF, ZIP)
Data Distillation: Generate label trees and questions directly from domain topics without uploading documents

📊 Model Evaluation System

Evaluation Datasets: Generate true/false, single-choice, multiple-choice, short-answer, and open-ended questions
Automated Model Evaluation: Use Judge Model to automatically evaluate model answer quality with customizable scoring rules
Human Blind Test (Arena): Double-blind comparison of two models' answers for unbiased evaluation
AI Quality Assessment: Automatic quality scoring and filtering of generated datasets

🛠️ Advanced Features

Custom Prompts: Project-level customization of all prompt templates (question generation, answer generation, data cleaning, etc.)
GA Pair Generation: Genre-Audience pair generation to enrich data diversity
Task Management Center: Background batch task processing with monitoring and interruption support
Resource Monitoring Dashboard: Token consumption statistics, API call tracking, model performance analysis
Model Testing Playground: Compare up to 3 models simultaneously

📤 Export & Integration

Multiple Export Formats: Alpaca, ShareGPT, Multilingual-Thinking formats with JSON/JSONL file types
Balanced Export: Configure export counts per tag for dataset balancing
LLaMA Factory Integration: One-click LLaMA Factory configuration file generation
Hugging Face Upload: Direct upload datasets to Hugging Face Hub

🤖 Model Support

Wide Model Compatibility: Compatible with all LLM APIs that follow the OpenAI format
Multi-Provider Support: OpenAI, Ollama (local models), Zhipu AI, Alibaba Bailian, OpenRouter, and more
Vision Models: Support Gemini, Claude, etc. for PDF parsing and image QA

🌐 User Experience

User-Friendly Interface: Modern, intuitive UI designed for both technical and non-technical users
Multi-Language Support: Complete Chinese, English, and Turkish language support 🇹🇷
Dataset Square: Discover and explore public dataset resources
Desktop Clients: Available for Windows, macOS, and Linux

Local Run

Install with NPM

Clone the repository:

   git clone https://github.com/shengjian-tech/smart-data.git
   cd smart-data

Install dependencies:

   npm install

Start the development server:

   npm run build

   npm run start

Open your browser and visit http://localhost:1717

Using the Official Docker Image

Clone the repository:

git clone https://github.com/shengjian-tech/smart-data.git
cd smart-data

Modify the docker-compose.yml file:

services:
  smart-data:
    image: smart-data
    container_name: smart-data
    ports:
      - '1717:1717'
    volumes:
      - ./local-db:/app/local-db
      - ./prisma:/app/prisma
    restart: unless-stopped

Note: It is recommended to use the local-db and prisma folders in the current code repository directory as mount paths to maintain consistency with the database paths when starting via NPM.

Note: The database file will be automatically initialized on first startup, no need to manually run npm run db:push.

Start with docker-compose:

docker-compose up -d

Open a browser and visit http://localhost:1717

Building with a Local Dockerfile

If you want to build the image yourself, use the Dockerfile in the project root directory:

Clone the repository:

git clone https://github.com/shengjian-tech/smart-data.git
cd smart-data

Build the Docker image:

docker build -t smart-data .

Run the container:

docker run -d \
  -p 1717:1717 \
  -v ./local-db:/app/local-db \
  -v ./prisma:/app/prisma \
  --name smart-data \
  smart-data

Note: It is recommended to use the local-db and prisma folders in the current code repository directory as mount paths to maintain consistency with the database paths when starting via NPM.

Note: The database file will be automatically initialized on first startup, no need to manually run npm run db:push.

Open a browser and visit http://localhost:1717

Contributing

We welcome contributions from the community! If you'd like to contribute to Smart Data, please follow these steps:

Fork the repository
Create a new branch (git checkout -b feature/amazing-feature)
Make your changes
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request (submit to the DEV branch)

Please ensure that tests are appropriately updated and adhere to the existing coding style.

License

This project is licensed under the AGPL 3.0 License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 855 Commits
.github		.github
.husky		.husky
app		app
components		components
constant		constant
electron		electron
hooks		hooks
lib		lib
local-db		local-db
locales		locales
prisma		prisma
public/imgs		public/imgs
styles		styles
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
.npmrc		.npmrc
.prettierrc.js		.prettierrc.js
.windsurfrules		.windsurfrules
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README.tr.md		README.tr.md
README.zh-CN.md		README.zh-CN.md
commitlint.config.mjs		commitlint.config.mjs
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
jsconfig.json		jsconfig.json
next.config.js		next.config.js
package-lock.json		package-lock.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

News

Features

📄 Document Processing & Data Generation

🔄 Multiple Dataset Types

📊 Model Evaluation System

🛠️ Advanced Features

📤 Export & Integration

🤖 Model Support

🌐 User Experience

Local Run

Install with NPM

Using the Official Docker Image

Building with a Local Dockerfile

Contributing

License

About

Uh oh!

Releases

Packages

Contributors 26

Uh oh!

Languages

License

fedimoss/smart-data

Folders and files

Latest commit

History

Repository files navigation

Overview

News

Features

📄 Document Processing & Data Generation

🔄 Multiple Dataset Types

📊 Model Evaluation System

🛠️ Advanced Features

📤 Export & Integration

🤖 Model Support

🌐 User Experience

Local Run

Install with NPM

Using the Official Docker Image

Building with a Local Dockerfile

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 26

Uh oh!

Languages

Packages