This repository contains the complete codebase for the master's thesis research on integrating Large Language Models (LLMs) with Reinforcement Learning (RL) agents across multiple game environments. The research explores how LLM-generated hints can improve RL agent training efficiency and performance.
This thesis investigates how Large Language Models (LLMs) can guide exploration in Reinforcement Learning (RL) without imposing hard constraints on agent decision-making. We hypothesize that LLMs encode broad world knowledge and commonsense procedural regularities that, when elicited via prompting, can bias exploration while preserving policy autonomy. To investigate this, we propose a soft-constraint integration wherein LLM-generated suggestions are provided as structured hints within the agent’s observation, accompanied by a hint-availability flag. This design preserves the original Markov Decision Process (MDP), leaves the learning objective unchanged, and allows the agent's policy to learn when to use or ignore guidance.
Methodologically, we develop a prompting and encoding pipeline to translate compact state summaries into schema-constrained outputs that can be consumed by standard RL policies. The approach is algorithm-agnostic; we instantiate it with Proximal Policy Optimization (PPO) and evaluate across domains of varying structure and difficulty: Minigrid, TicTacToe, and Deal or No Deal. Supplementary experiments with DQN and REINFORCE are also provided.
Empirical results demonstrate that hints elicited via structure-preserving prompts—augmented with chain-of-thought reasoning where appropriate—are reliable and context-relevant. In Minigrid, integrating LLM hints as soft inputs yields improved sample efficiency and, on more difficult tasks, better final performance relative to tabula-rasa baselines, while remaining below an oracle upper bound. In compact domains such as TicTacToe and Deal or No Deal, the prompting pipeline produces interpretable, valid suggestions (e.g., higher action validity under masking and approximately 85% agreement with curated data in Deal or No Deal), though overall training gains are bounded by the short horizon and small state spaces. In all settings, RL agents learn to discount suboptimal hints, showcasing robustness to imperfect guidance.
We discuss key limitations, particularly the computational overhead of frequent LLM queries, and outline cost-aware extensions—including adaptive hint scheduling, distillation, and lightweight serving. Overall, our results support LLM-guided hints as a practical and robust mechanism for accelerating learning in sufficiently complex RL tasks while preserving agent autonomy.
master-thesis-codebase/
├── deal_or_no_deal/ # Deal or No Deal negotiation game RL-LLM integration
├── minigrid-rl-llm/ # BabyAI/Minigrid environments RL-LLM research pipeline
├── tictactoe_rl_llm/ # Tic-Tac-Toe game RL-LLM integration
└── thesis.pdf # Complete thesis document
This codebase contains the implementation of a unified research thesis that investigates the integration of LLMs with RL agents, applying the same methodology across three different environments:
-
Deal or No Deal: Multi-issue bargaining negotiation environment with RL agents trained using PPO and REINFORCE algorithms, enhanced with LLM-generated hints.
-
Minigrid-RL-LLM: Comprehensive research pipeline for BabyAI/Minigrid environments, including ground truth data collection, LLM optimization, evaluation, RL training, and agent comparison.
-
Tic-Tac-Toe RL-LLM: Simple grid-based game environment exploring LLM-guided RL training with various prompting strategies and board representations.
- Python 3.8+
- Conda (recommended for environment management)
- CUDA-capable GPU (recommended for training)
Each project has its own requirements. Navigate to the specific project directory and install dependencies:
# For Deal or No Deal
cd deal_or_no_deal
pip install -r requirements.txt
# For Minigrid-RL-LLM
cd minigrid-rl-llm
pip install -r requirements.txt
# For Tic-Tac-Toe RL-LLM
cd tictactoe_rl_llm
pip install -r requirements.txtA comprehensive research pipeline for studying LLM-RL integration in BabyAI/Minigrid environments. The project is organized into sequential modules:
Module Structure:
1_GT_collection/: Ground truth data collection from BabyAI bot2_LLM_optimization/: LLM fine-tuning and ablation studies using DSPy3_LLM_evaluation/: Evaluation of trained LLM models4_RL_agent_training/: RL agent training with Stable Baselines35_RL_agent_comparison/: Agent comparison and analysis6_CoP_Approach/: Chain-of-Thought approach implementation
Key Features:
- Multiple observation encodings (natural language, ASCII, tuples, relative)
- DSPy-based LLM optimization for subgoal and action prediction
- Hint system with configurable frequency and sources
- Comprehensive evaluation and comparison tools
Quick Start:
cd minigrid-rl-llm
# Follow the pipeline sequentially
cd 1_GT_collection && python gt_data_collection.py
cd ../2_LLM_optimization && python main_llm_optimization.py
# ... and so onDocumentation:
- Main README:
minigrid-rl-llm/README.md - Module-specific READMEs in each numbered directory
A simplified grid-based game environment exploring LLM-guided RL training with various configurations:
Key Features:
- Multiple board representations (1D, 2D, etc.)
- Various prompting methods (Zero-Shot, Chain-of-Thought, etc.)
- LLM suggestion integration with configurable probability
- Ground truth database for comparison
- Comprehensive experiment tracking
Quick Start:
cd tictactoe_rl_llm
# Run RL experiments
python run_RL_experiments.py
# Run LLM experiments
python run_LLM_experiment.pyA Gym-compatible environment for the Deal or No Deal negotiation task, implementing multi-issue bargaining with coarse dialogue acts. The project includes:
- Environment: Custom Gymnasium environment for negotiation scenarios
- Algorithms: PPO and REINFORCE implementations
- LLM Integration: Hint injection system using LLM-generated guidance
- Evaluation: Comprehensive metrics and policy evaluation tools
Key Features:
- Multi-issue negotiation over 3 item types (books, hats, balls)
- Coarse dialogue acts: propose, insist, agree, disagree, end
- Support for LLM-generated hints during training
- Supervised expert training baseline
Quick Start:
cd deal_or_no_deal
conda activate rl # As per project setup
python train/train.py --config configs/ppo_config.yamlDocumentation:
- Environment details:
deal_or_no_deal/deal_or_no_deal_env/README.md - Dialog system:
deal_or_no_deal/deal_or_no_dialog/README.md
All three projects follow a similar research methodology:
- Baseline Establishment: Train RL agents without LLM assistance
- LLM Integration: Incorporate LLM-generated hints/guidance during training
- Evaluation: Compare sample efficiency and final performance
- Analysis: Statistical analysis and visualization of results
Each project generates:
- Trained model checkpoints
- Training logs and metrics (CSV, JSON)
- Evaluation results
- Comparison plots and visualizations
- Configuration snapshots
Results are stored in project-specific directories:
deal_or_no_deal/runs/minigrid-rl-llm/4_RL_agent_training/RL_Training_Results_*/tictactoe_rl_llm/experiments_results/
While each project has specific requirements, common dependencies include:
- PyTorch
- Gymnasium
- NumPy, Pandas, Matplotlib
- PyYAML
- tqdm
-
Thesis: Full thesis PDF
-
Workshop paper arXiv:2510.08779
Vaibhav Jain
Master's Thesis Research (In fullfilment of the MS Data Science and AI Program)
University of Saarland (UdS)
Contact: vaja00001@uni-saarland.de (or create an issue on this repo)
For detailed documentation on each project, please refer to the README files in the respective project directories.