Skip to content

Fatzy070/netu-arkupload-frontend

Repository files navigation

🔥 SOOT

AI-Ready File Ingestion & Tokenization Gateway

SOOT is a lightweight, secure file ingestion layer built to collect, validate, and prepare raw data for AI training pipelines.

It sits at the front of your AI stack — accepting unstructured files, enforcing format rules, and making data ready for tokenization, embedding, and model training.

⚡ What SOOT Does

SOOT solves one problem cleanly: getting raw data into your AI system without garbage.

Accepts files via drag-and-drop or click upload

Enforces strict file type validation

Tracks ingestion state in real time

Secures uploads with authentication & role control

Exposes clean data for downstream AI workflows

🧠 Built for AI Pipelines

SOOT is designed as a data ingestion gateway, not just a file uploader.

It enables:

Dataset collection

Tokenization pipelines

Embedding generation

Fine-tuning workflows

Internal AI tooling

Think of it as the intake valve for your model training system.

🚦 Ingestion Lifecycle

Each file moves through a clear, auditable state:

QUEUED → BURNED (accepted) → REJECTED (unsupported / failed)

This makes ingestion transparent, debuggable, and scalable.

📂 Supported Data Formats

Configurable allow-list on the backend, including:

.txt

.json

.jsonl

.md

.py

.js

.csv

.zip

❌ Unsupported formats are rejected automatically.

🔐 Security & Access Control

JWT-based authentication

Role-based permissions

Admin visibility across all ingested files

User-scoped uploads by default

Built to be safe for internal tools and early production use.

🧱 Tech Stack

Frontend

React

Tailwind CSS

Native Drag & Drop API

Axios

Backend

Node.js + Express

JWT Authentication

Multer (file ingestion)

Storage abstraction (local / cloud-ready)

RBAC (admin / user)

🧩 How It Fits Into Your Stack Raw Files ↓ SOOT (Ingestion + Validation) ↓ Tokenizer / Chunker ↓ Embeddings / Datasets ↓ Model Training / Fine-Tuning

SOOT keeps ingestion isolated, clean, and replaceable — exactly how modern AI infrastructure should be designed.

🧪 Use Cases

Internal AI tooling

Dataset ingestion for LLMs

Code and document preprocessing

Research pipelines

Early-stage AI products

🧭 Roadmap

Content chunking

Token count estimation

Streaming uploads

Hashing & deduplication

Direct tokenizer integration

Dataset versioning

🔥 Why SOOT?

Because not all data deserves to make it into your model.

SOOT burns the noise and feeds the furnace with signal.

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published