🦾 Dum-E | The Embodied AI Agent

What if your favorite AI agents have hands?

🚀 Get Started • 📖 Documentation • 🗺️ Roadmap • 🤝 Contributing

🎯 Mission

Inspired by Tony Stark's robotic assistant Dum-E, the mission of this project is to create an intelligent, voice & vision enabled AI agent with robotic arm(s) capable of real-time human interaction, physical operations, and orchestration of a wide range of tools and services.

✨ Key Features

🎤 Real-time Voice Interface - Natural voice conversation with configurable voice and multi-language support
🧠 Long-horizon Tasks Planning - Orchestrated by state-of-the-art VLMs with multi-modal reasoning and tool use
📡 Asynchronous Tasks Execution - Support for asynchronous task execution with streaming updates
👁️ Hybrid Robot Control - Deep learning policies for generalizable physical operations combined with classical control for precise manipulation
🔧 Modular Architecture - Flexible interfaces to add your own custom backends, embodiments and tools
🌐 MCP Support - Agents and tools available via MCP for custom client integration

📺 Demo

🔊 Watch with sound to hear voice interactions

Dum-E.Demo.mp4

🚀 Get Started

This project supports two deployment patterns:

Single Workstation: Run everything locally on a single machine with GPU
Client-Server: Run policy server on a separate GPU server while keeping lightweight client components local, useful when your local machine doesn't meet the GPU requirements

Choose the setup that best matches your needs and hardware availability. The following sections will guide you through the installation process.

⚙️ System Requirements

Common

Robot: SO100/101 robotic arm assembled with wrist camera
Webcam: 640p+ USB webcam for additional vision input

For Single Workstation

GPU: NVIDIA Ampere or later with at least 12GB VRAM (tested on RTX 3060) and driver ≥ 535.0
OS: Ubuntu 22.04/24.04 LTS or Windows with WSL2

For Client-Server

Server: Any machine with 12GB+ VRAM NVIDIA GPU, e.g. EC2 g4dn/g5/g6
Client: Any machine with 4GB+ RAM
OS: Any OS (MacOS/Windows/Linux)

🔧 Installation

📦 On Single Workstation or Server

Requires 1) NVIDIA GPU 2) Linux or WSL2

Install required system dependencies

sudo apt-get update
sudo apt-get install ffmpeg libsm6 libxext6

Install CUDA toolkit 12.4

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4

Create a conda or venv environment using Python 3.10 for gr00t policy server:
```
conda create -y -n gr00t python=3.10
conda activate gr00t
```

Clone Isaac-GR00T repository:

git clone https://github.com/NVIDIA/Isaac-GR00T

Install Isaac-GR00T:

cd Isaac-GR00T
# use the tested version on 12th Sep 2025
git checkout b211007ed6698e6642d2fd7679dabab1d97e9e6c

# conda activate gr00t
pip install --upgrade setuptools
pip install -e .[base]
pip install --no-build-isolation flash-attn==2.7.1.post4

Download a fine-tuned GR00T model from Hugging Face for the task you want to perform. For example, to pick up a fruit (apple, banana, orange, etc.) and put it on the plate, you can download our checkpoint by running:

hf download aaronsu11/GR00T-N1.5-3B-FT-FRUIT-0810 --local-dir ./GR00T-N1.5-3B-FT --exclude "optimizer.pt"

Start policy server

Run the following command to start the gr00t policy server:
```
python scripts/inference_service.py \
--server \
--model_path ./GR00T-N1.5-3B-FT \
--embodiment_tag new_embodiment \
--data_config so100_dualcam
```
This needs to be running as long as you are using the gr00t policy for inference. Note down the IP address of the policy server (<policy_host>) and make sure port 5555 is accessible from the client.

On Single Workstation or Client

Clone the Repository

git clone https://github.com/aaronsu11/Dum-E.git

Create a conda or venv environment using Python 3.12 for dum-e client:
```
conda create -y -n dum-e python=3.12
conda activate dum-e
```

Install Dum-E Dependencies

cd Dum-E
# conda activate dum-e
pip install -r requirements.txt

Note

If you have never set up SO-ARM before:

Find the wrist_cam_idx and front_cam_idx by running lerobot-find-cameras
Find the robot_port of by running lerobot-find-port
Calibrate the robot following the instructions for SO-100 or SO-101 and note down your robot_id. For example with SO-101, run: lerobot-calibrate --robot.type=so101_follower --robot.port=<robot_port> --robot.id=<robot_id>

Configure Dum-E
```
cp config.example.yaml my-dum-e.yaml
```
Edit my-dum-e.yaml:
- Set controller.robot_type/robot_port/robot_id/wrist_cam_idx/front_cam_idx
- Set controller.policy_host to your gr00t policy server IP (or localhost)
- Optionally set agent.profile and voice.mode/profile to use different model presets

Test policy execution

# Uses controller.* from YAML
python -m embodiment.so_arm10x.controller --config my-dum-e.yaml --instruction "<your-instruction>"

Note

If the robot is not moving, check if the gr00t policy server is running and the port is accessible from the client by running

nc -zv <policy_host> 5555 (on MacOS/Linux) or
Test-NetConnection -ComputerName <policy_host> -Port 5555 (on Windows PowerShell).

Environment configuration for agent and voice

Sign up for free accounts at ElevenLabs, Deepgram, Anthropic and obtain the API keys if you are using the default profile, or get your AWS credentials if you are using the aws profile.

Copy the environment template and update the .env file with your credentials:
```
cp .env.example .env
```
Edit .env with your credentials (choose one of the following):
- For default profile:
  - Anthropic API key for LLM
  - ElevenLabs API key for TTS
  - Deepgram API key for STT
- For aws profile:
  - AWS Access Key ID
  - AWS Secret Access Key
  - AWS Region

Note

You can optionally configure Langfuse observability for agent and voice by setting LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY and LANGFUSE_HOST in the .env file.

Test the robot agent

# One-shot instruction (agent inherits controller.* unless overridden in agent section)
python -m embodiment.so_arm10x.agent --config my-dum-e.yaml --instruction "<your-instruction>"

Start Dum-E

Finally, start the full stack with the voice interface, MCP server and robot agent:
```
python dum_e.py --config my-dum-e.yaml
```
This will launch the voice interface at http://localhost:7860 where you can connect and speak to Dum-E using your microphone. Have fun!
You can also start only the voice interface and MCP servers. Useful for testing the servers independently without the robot hardware:
```
python dum_e.py --node servers --config my-dum-e.yaml
```

🏗️ Architecture Overview

Core Components

graph TB
 subgraph subGraph0["Frontend"]
        A["Voice Interface<br>Pipecat"]
        B["External MCP Clients"]
  end
 subgraph subGraph1["Backend"]
        C["MCP Server<br>Streamable HTTP"]
        D["Fleet Manager<br>Multi-agent Management"]
        E["Task Manager<br>Task Management"]
        F["Message Broker<br>Event Streaming"]
  end
 subgraph subGraph2["Agent Layer"]
        G["Robot Agent<br>Multi-modal Orchestration"]
  end
 subgraph subGraph3["Controller Layer"]
        H["Classical Control<br>MoveIt"]
        I["VLA Policy<br>Gr00t Inference"]
        J["Custom Tools<br>Public MCP Servers"]
  end
 subgraph subGraph4["Hardware Layer"]
        K["Physical Robot<br>SO-ARM10x"]
  end
    A --> C
    B --> C
    C --> D & E & F & G
    G --> H & I & J
    H --> K
    I --> K

🔄 Data Flow

Voice Interaction: Voice Streams → Cascaded / Speech-to-Speech Processor → Conversation and Task Delegation
Task Execution: Task Manager → Robot Agent → VLM Reasoning → Robot Controller Tools → Robot Hardware
Robot Control: Task Instruction + Camera Images → DL Policy / Classical Control → Joint Commands
Streaming Feedback: Agent Streaming Responses → Message Broker → MCP Context → Voice Updates

🗺️ Roadmap

Q3 2025

Q4 2025

Cross-Platform Support
- Docker containers for platform-agnostic deployment
ROS2 Integration
- Native ROS2 node implementation
- Integration with existing ROS toolkits

🔬 Ongoing: Research Initiatives

Embodied AI Research
- Generalizable & scalable policies for physical tasks
- Efficient dataset collection and training
Human-Robot Interaction
- Natural multi-modal understanding
- Contextual conversation memory
- Self-evolving personality and skillset

🤝 Contributing

We welcome contributions from the robotics and AI community! Here's how you can help:

🌟 Ways to Contribute

🐛 Bug Reports: Found an issue? Create a detailed bug report
💡 Feature Requests: Have ideas? Share them in our discussions or Discord
📝 Documentation: Help improve our docs and tutorials
🧪 Testing: Add test cases and improve coverage
🚀 Code: Submit pull requests with new features or fixes

📋 Development Guidelines

Fork the Repository

git fork https://github.com/aaronsu11/Dum-E.git
cd Dum-E
git checkout -b feature/issue-<number>/<your-feature-name>

Follow Code Standards
- Use Python 3.12 type hints
- Follow PEP 8 style guidelines
- Add comprehensive docstrings
- Maintain test coverage > 50%

Testing Requirements

# Run tests before submitting
python -m pytest tests/
python -m black <your-file-or-directory>
python -m isort <your-file-or-directory>

Pull Request Process
- Create detailed PR description
- Link related issues
- Ensure CI/CD passes
- Request review from maintainers

👥 Community

Discord: Join our community
GitHub Discussions: Share ideas and get help

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

This project builds on top of the following open-source projects:

⭐ Star us on GitHub — it motivates us a lot!

🚀 Get Started • 📖 Documentation • 🤝 Join Community • 💼 Commercial Use

Built with ❤️ for the future of robotics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🦾 Dum-E | The Embodied AI Agent

🎯 Mission

✨ Key Features

📺 Demo

🚀 Get Started

⚙️ System Requirements

Common

For Single Workstation

For Client-Server

🔧 Installation

📦 On Single Workstation or Server

On Single Workstation or Client

🏗️ Architecture Overview

Core Components

🔄 Data Flow

🗺️ Roadmap

Q3 2025

Q4 2025

🔬 Ongoing: Research Initiatives

🤝 Contributing

🌟 Ways to Contribute

📋 Development Guidelines

👥 Community

📄 License

About

Uh oh!

Releases

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
embodiment		embodiment
policy		policy
shared		shared
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.example.yaml		config.example.yaml
dum_e.py		dum_e.py
mcp_server.py		mcp_server.py
pipecat_server.py		pipecat_server.py
requirements.txt		requirements.txt
utils.py		utils.py

License

aaronsu11/Dum-E

Folders and files

Latest commit

History

Repository files navigation

🦾 Dum-E | The Embodied AI Agent

🎯 Mission

✨ Key Features

📺 Demo

🚀 Get Started

⚙️ System Requirements

Common

For Single Workstation

For Client-Server

🔧 Installation

📦 On Single Workstation or Server

On Single Workstation or Client

🏗️ Architecture Overview

Core Components

🔄 Data Flow

🗺️ Roadmap

Q3 2025

Q4 2025

🔬 Ongoing: Research Initiatives

🤝 Contributing

🌟 Ways to Contribute

📋 Development Guidelines

👥 Community

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages