A systematic analysis of when Reinforcement Learning is applicable to the 0-1 Knapsack Problem, with comprehensive experiments on scalability, advantage scenarios, and hybrid strategies.
- Project Overview
- Key Contributions
- Research Questions
- Methodology
- Project Structure
- Experiment Results
- How to Run
- Conclusion
- Limitations & Future Work
This project investigates the applicability of Reinforcement Learning (RL) methods for solving the 0-1 Knapsack Problem. We compare traditional methods (Dynamic Programming, Greedy) with RL approaches (Policy Gradient, DQN, PPO) to answer three core research questions about scalability, advantage scenarios, and hybrid strategy design.
- Dynamic Programming (DP) — Exact optimal solution
- Greedy — Value density heuristic
- Policy Gradient (PG) — Linear policy RL
- Deep Q-Network (DQN) — Deep RL with experience replay
- Proximal Policy Optimization (PPO) — Stable RL algorithm
- Hybrid Strategies — Ensemble, Staged, Confidence-based
- Scalability advantage: RL (PG) is 2.2× faster than DP when n=50
- Scenario-specific value: On difficult cases, PG achieves 81.77% vs Greedy's 72.35% (+9.42%)
- Hybrid strategies: Ensemble achieves 96.91% solution quality, statistically significant improvement over Greedy (p=0.009)
- Ensemble vs Greedy: statistically significant (p=0.009)
- DQN vs PPO: statistically significant (p=0.0005)
| ID | Question | Focus |
|---|---|---|
| RQ1 | Does RL have time advantages on large-scale problems? | Scalability |
| RQ2 | In which scenarios does RL outperform traditional heuristics? | Applicability |
| RQ3 | How to effectively combine RL with traditional methods? | Hybrid Strategy |
| Category | Method | Description | Complexity |
|---|---|---|---|
| Exact | DP | Dynamic Programming | O(n×W) |
| Heuristic | Greedy | Value density sorting | O(n log n) |
| RL | PG | Linear policy | O(n) |
| RL | DQN | 3-layer MLP (256-256-128) | O(n) |
| RL | PPO | Actor-Critic (64-64) | O(n) |
| Strategy | Description |
|---|---|
| Ensemble | Run Greedy+PG+DQN, select best result |
| Staged Hybrid | PG selects first, Greedy fills remaining |
| Confidence-based | Choose method based on capacity ratio |
knapsack-rl-analysis/
├── experiments/
│ ├── rq1_scalability.py # RQ1: Scalability analysis
│ ├── rq2_scenarios.py # RQ2: Advantage scenarios
│ ├── rq3_hybrid.py # RQ3: Hybrid strategies
│ ├── statistical_test.py # Statistical significance tests
│ └── solvers.py # Unified solver interface
├── models/ # Trained models
├── results/ # Experiment results & figures
├── visualization/ # Plotting scripts
└── data/ # Dataset
| n | DP Time | PG Time | Speedup | Greedy Quality | PG Quality | DQN Quality | PPO Quality |
|---|---|---|---|---|---|---|---|
| 5 | 0.04ms | 0.37ms | 0.1× | 99.42% | 89.18% | 90.38% | 82.08% |
| 10 | 0.22ms | 0.43ms | 0.5× | 98.56% | 90.22% | 91.63% | 80.86% |
| 20 | 0.83ms | 0.84ms | 1.0× | 99.03% | 82.20% | 91.12% | 79.18% |
| 50 | 5.40ms | 2.44ms | 2.2× | 99.78% | 77.83% | 82.58% | 77.93% |
Key Finding: RL (PG) achieves 2.2× speedup over DP at n=50. For n>50, DP becomes impractical due to O(n×W) complexity.
- Left panel (Runtime Scalability): Shows computation time (log scale) for each method as problem size increases. DP time grows exponentially with n, while RL methods (PG, PPO) maintain near-linear growth. At n=50, PG is 2.2× faster than DP.
- Right panel (Solution Quality): Shows solution quality (% of optimal) for Greedy, PG, DQN, and PPO across scales n=5 to n=50. Greedy consistently achieves 98-99% quality, while RL methods show more variance (77-91%).
| Scenario | Greedy | PG | DQN | PPO |
|---|---|---|---|---|
| High Value Density Variance | 98.88% | 87.74% | 92.01% | 81.72% |
| Difficult Real Cases | 72.35% | 81.77% | 76.57% | 75.39% |
| Larger Scale (n=20) | 99.10% | 84.81% | 90.11% | 80.08% |
Key Finding: On difficult real cases (where Greedy < 95%), PG achieves 81.77% vs Greedy's 72.35% (+9.42%). Note: This difference was not statistically significant in our t-test (p=0.802, n=10 cases), suggesting more samples are needed for conclusive evidence.
This figure compares four methods (Greedy, PG, DQN, PPO) across three scenarios:
- High Value Density Variance: Instances with diverse value-to-weight ratios. Greedy performs best (98.88%), DQN second (92.01%).
- Difficult Real Cases: Real dataset instances where Greedy achieves <95%. Here PG shows advantage (81.77% vs 72.35%).
- Larger Scale (n=20): Randomly generated instances with 20 items. Greedy dominates (99.10%), DQN performs reasonably (90.11%).
| Method | Quality | vs Greedy |
|---|---|---|
| Greedy | 95.40% | baseline |
| PG | 90.25% | -5.15% |
| DQN | 90.57% | -4.83% |
| PPO | 84.25% | -11.15% |
| Confidence-based | 91.60% | -3.80% |
| Staged Hybrid | 91.65% | -3.75% |
| Ensemble | 96.91% | +1.51% |
Key Finding: Ensemble strategy achieves the best performance, statistically significant improvement over Greedy (p=0.009).
This figure compares baseline methods (Greedy, PG, DQN) with hybrid strategies (Confidence-based, Staged Hybrid, Ensemble):
- Baselines: Greedy (95.40%) outperforms individual RL methods. PG (90.25%) and DQN (90.57%) show similar performance.
- Hybrid Strategies: Ensemble achieves the best result (96.91%), combining Greedy, PG, and DQN by selecting the best solution among them.
- Note on PPO: PPO is excluded from this comparison and the Ensemble strategy due to its significantly lower performance (84.25%) and high variance, which would degrade the hybrid strategy's effectiveness.
This figure compares solution quality of Greedy, PG, DQN, and PPO across problem scales (n=5, 10, 20, 50). Greedy consistently achieves ~99% quality across all scales, demonstrating its effectiveness as a baseline. Among RL methods, DQN performs best (82-91%), followed by PG (77-90%) and PPO (77-82%). Notably, all RL methods show performance degradation as problem size increases, while Greedy remains stable.
| Test | Result | p-value | Significant |
|---|---|---|---|
| Ensemble vs Greedy | +2.08% | 0.009 | ✓ Yes |
| DQN vs PPO | +6.45% | 0.0005 | ✓ Yes |
| PG vs DQN | -2.18% | 0.218 | ✗ No |
| PG vs PPO | +4.27% | 0.063 | ✗ No |
All experiments use fixed random seeds (seed=42) for reproducibility. Pretrained models are available in models/ directory and can be loaded directly:
from experiments.solvers import solve
value, solution = solve(weights, values, capacity, method='pg') # or 'dqn', 'ppo'For complete details on hyperparameters and experimental setup, see docs/EXPERIMENT_DETAILS.md.
pip install -r requirements.txtpython experiments/rq1_scalability.py
python experiments/rq2_scenarios.py
python experiments/rq3_hybrid.py
python experiments/statistical_test.pypython visualization/generate_figures_rq.pydocker build -t knapsack-rl .
docker run -v $(pwd)/results:/app/results knapsack-rlThis study systematically investigates the applicability of Reinforcement Learning methods for the 0-1 Knapsack Problem through three research questions addressing scalability, advantage scenarios, and hybrid strategies.
Our experiments reveal that RL methods offer meaningful time advantages on large-scale problems. At n=50, Policy Gradient achieves a 2.2× speedup over Dynamic Programming, and this advantage grows as problem size increases since DP's O(n×W) complexity becomes prohibitive while RL maintains O(n) inference time. However, this speed advantage comes at the cost of solution quality—Greedy consistently achieves 97-99% of optimal across all tested scales, while RL methods range from 77-91%.
The most significant finding is that the Ensemble strategy, which combines Greedy, PG, and DQN by selecting the best solution among them, achieves 96.91% solution quality with statistically significant improvement over Greedy alone (p=0.009). This demonstrates that RL methods, while not superior individually, provide complementary value when combined with traditional heuristics.
On difficult cases where Greedy performs poorly (<95%), PG shows potential with a +9.42% improvement. However, this result requires further validation as our statistical test did not reach significance (p=0.802) due to limited sample size (n=10).
Practical Recommendations (based on our experimental setup with Intel i7 CPU, Python 3.8+, PyTorch implementation):
| Scenario | Recommended Method | Reason |
|---|---|---|
| n < 20 | DP | Optimal solution, fast enough in our tests |
| n > 50 | PG | DP infeasible; in our implementation, PG showed fastest inference |
| Best quality needed | Ensemble | 96.91% quality, statistically validated (p=0.009) |
Our experiments show that RL methods struggle to consistently outperform Greedy on standard knapsack instances, achieving 77-90% quality versus Greedy's 97-99%. This gap suggests that for well-structured problems with effective heuristics, RL may not always be the best choice. Additionally, models trained on small instances show degraded performance on larger problems, indicating poor generalization.
PPO performed notably worse than expected (84.25% with high variance), likely due to insufficient training time, untuned hyperparameters, or mismatch between its continuous optimization approach and the discrete nature of knapsack decisions. We excluded it from the Ensemble strategy for this reason.
Future improvements could include attention mechanisms for better generalization, curriculum learning for scaling, and extended PPO training with reward shaping. RL may also be better suited for knapsack variants where Greedy struggles, such as dynamic problems with online arrivals, stochastic settings with uncertain attributes, or multi-dimensional constraints.
This figure shows the training dynamics of PPO over 5000 episodes:
- Left panel (Average Reward): The reward increases from approximately 0.45 to 0.58 over training, indicating that the model learns to make better decisions. However, the improvement plateaus after around 3000 episodes.
- Middle panel (Actor Loss): The policy loss fluctuates around zero, showing stable policy updates without divergence.
- Right panel (Value Loss): The value function loss decreases from ~0.017 to ~0.006, indicating improved value estimation over training.
Despite these positive training signals, PPO's final performance (84.25% average quality) remains significantly below Greedy (95.40%). This gap suggests that either more training is needed, the reward function requires redesign, or PPO's continuous policy optimization is fundamentally misaligned with the discrete decision structure of the knapsack problem.
This project is licensed under the MIT License.




