A containerized application designed to run small and efficient Large Language Models (LLMs) for code generation on GPUs with limited VRAM (4GB or less).
Neuro-Cond leverages multiple optimization techniques to ensure fast inference while supporting CPU-GPU offloading for larger models. The project is designed to make code-specialized LLMs accessible on consumer-grade hardware.
- Run code generation models on GPUs with as little as 4GB VRAM
- Support for multiple optimization techniques (GGUF, GPTQ, BitsAndBytes)
- CPU-GPU offloading for larger models
- FastAPI backend with efficient batch processing
- React frontend with Tailwind CSS
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct(2.4B parameters)deepseek-ai/DeepSeek-Coder-V2-Lite-Base(2.4B parameters)google/codegemma-7b-GGUF(7B parameters)meta-llama/CodeLlama-7b-hf(7B parameters)Qwen/Qwen2.5-Coder-3B(3B parameters)
- Host Operating System: Windows (via WSL 2) or Linux (Ubuntu 20.04+)
- Docker Runtime: Docker with NVIDIA Container Toolkit enabled
- GPU Support: NVIDIA GPU with CUDA (Compute Capability 7.0+ recommended)
# Build the Docker image
docker build -t neuro-cond .
# Run the container with GPU support
docker run --gpus all -p 8000:8000 neuro-condOpen a web browser and navigate to http://localhost:8000 to access the Neuro-Cond user interface.
See the LICENSE file for details.