Skip to content

yecon-27/rl-knapsack-solver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

When to Use Reinforcement Learning for Knapsack Problems

A systematic analysis of when Reinforcement Learning is applicable to the 0-1 Knapsack Problem, with comprehensive experiments on scalability, advantage scenarios, and hybrid strategies.

Table of Contents

Project Overview

This project investigates the applicability of Reinforcement Learning (RL) methods for solving the 0-1 Knapsack Problem. We compare traditional methods (Dynamic Programming, Greedy) with RL approaches (Policy Gradient, DQN, PPO) to answer three core research questions about scalability, advantage scenarios, and hybrid strategy design.

Key Contributions

1. Six knapsack solving algorithms

  • Dynamic Programming (DP) — Exact optimal solution
  • Greedy — Value density heuristic
  • Policy Gradient (PG) — Linear policy RL
  • Deep Q-Network (DQN) — Deep RL with experience replay
  • Proximal Policy Optimization (PPO) — Stable RL algorithm
  • Hybrid Strategies — Ensemble, Staged, Confidence-based

2. Systematic analysis of RL applicability

  • Scalability advantage: RL (PG) is 2.2× faster than DP when n=50
  • Scenario-specific value: On difficult cases, PG achieves 81.77% vs Greedy's 72.35% (+9.42%)
  • Hybrid strategies: Ensemble achieves 96.91% solution quality, statistically significant improvement over Greedy (p=0.009)

3. Statistical validation

  • Ensemble vs Greedy: statistically significant (p=0.009)
  • DQN vs PPO: statistically significant (p=0.0005)

Research Questions

ID Question Focus
RQ1 Does RL have time advantages on large-scale problems? Scalability
RQ2 In which scenarios does RL outperform traditional heuristics? Applicability
RQ3 How to effectively combine RL with traditional methods? Hybrid Strategy

Methodology

Compared Methods

Category Method Description Complexity
Exact DP Dynamic Programming O(n×W)
Heuristic Greedy Value density sorting O(n log n)
RL PG Linear policy O(n)
RL DQN 3-layer MLP (256-256-128) O(n)
RL PPO Actor-Critic (64-64) O(n)

Hybrid Strategies

Strategy Description
Ensemble Run Greedy+PG+DQN, select best result
Staged Hybrid PG selects first, Greedy fills remaining
Confidence-based Choose method based on capacity ratio

Project Structure

knapsack-rl-analysis/
├── experiments/
│   ├── rq1_scalability.py      # RQ1: Scalability analysis
│   ├── rq2_scenarios.py        # RQ2: Advantage scenarios
│   ├── rq3_hybrid.py           # RQ3: Hybrid strategies
│   ├── statistical_test.py     # Statistical significance tests
│   └── solvers.py              # Unified solver interface
├── models/                     # Trained models
├── results/                    # Experiment results & figures
├── visualization/              # Plotting scripts
└── data/                       # Dataset

Experiment Results & Figures

RQ1 — Scalability Analysis

n DP Time PG Time Speedup Greedy Quality PG Quality DQN Quality PPO Quality
5 0.04ms 0.37ms 0.1× 99.42% 89.18% 90.38% 82.08%
10 0.22ms 0.43ms 0.5× 98.56% 90.22% 91.63% 80.86%
20 0.83ms 0.84ms 1.0× 99.03% 82.20% 91.12% 79.18%
50 5.40ms 2.44ms 2.2× 99.78% 77.83% 82.58% 77.93%

Key Finding: RL (PG) achieves 2.2× speedup over DP at n=50. For n>50, DP becomes impractical due to O(n×W) complexity.

Figure 1: Runtime and Solution Quality across Problem Scales

RQ1 Results

  • Left panel (Runtime Scalability): Shows computation time (log scale) for each method as problem size increases. DP time grows exponentially with n, while RL methods (PG, PPO) maintain near-linear growth. At n=50, PG is 2.2× faster than DP.
  • Right panel (Solution Quality): Shows solution quality (% of optimal) for Greedy, PG, DQN, and PPO across scales n=5 to n=50. Greedy consistently achieves 98-99% quality, while RL methods show more variance (77-91%).

RQ2 — Advantage Scenarios

Scenario Greedy PG DQN PPO
High Value Density Variance 98.88% 87.74% 92.01% 81.72%
Difficult Real Cases 72.35% 81.77% 76.57% 75.39%
Larger Scale (n=20) 99.10% 84.81% 90.11% 80.08%

Key Finding: On difficult real cases (where Greedy < 95%), PG achieves 81.77% vs Greedy's 72.35% (+9.42%). Note: This difference was not statistically significant in our t-test (p=0.802, n=10 cases), suggesting more samples are needed for conclusive evidence.

Figure 2: RL Performance across Different Scenarios

RQ2 Results

This figure compares four methods (Greedy, PG, DQN, PPO) across three scenarios:

  • High Value Density Variance: Instances with diverse value-to-weight ratios. Greedy performs best (98.88%), DQN second (92.01%).
  • Difficult Real Cases: Real dataset instances where Greedy achieves <95%. Here PG shows advantage (81.77% vs 72.35%).
  • Larger Scale (n=20): Randomly generated instances with 20 items. Greedy dominates (99.10%), DQN performs reasonably (90.11%).

RQ3 — Hybrid Strategies

Method Quality vs Greedy
Greedy 95.40% baseline
PG 90.25% -5.15%
DQN 90.57% -4.83%
PPO 84.25% -11.15%
Confidence-based 91.60% -3.80%
Staged Hybrid 91.65% -3.75%
Ensemble 96.91% +1.51%

Key Finding: Ensemble strategy achieves the best performance, statistically significant improvement over Greedy (p=0.009).

Figure 3: Hybrid Strategy Performance Comparison

RQ3 Results

This figure compares baseline methods (Greedy, PG, DQN) with hybrid strategies (Confidence-based, Staged Hybrid, Ensemble):

  • Baselines: Greedy (95.40%) outperforms individual RL methods. PG (90.25%) and DQN (90.57%) show similar performance.
  • Hybrid Strategies: Ensemble achieves the best result (96.91%), combining Greedy, PG, and DQN by selecting the best solution among them.
  • Note on PPO: PPO is excluded from this comparison and the Ensemble strategy due to its significantly lower performance (84.25%) and high variance, which would degrade the hybrid strategy's effectiveness.

RL Algorithm Comparison

Figure 4: RL Algorithm Quality Comparison across Problem Scales

RL Comparison

This figure compares solution quality of Greedy, PG, DQN, and PPO across problem scales (n=5, 10, 20, 50). Greedy consistently achieves ~99% quality across all scales, demonstrating its effectiveness as a baseline. Among RL methods, DQN performs best (82-91%), followed by PG (77-90%) and PPO (77-82%). Notably, all RL methods show performance degradation as problem size increases, while Greedy remains stable.

Statistical Tests Summary

Test Result p-value Significant
Ensemble vs Greedy +2.08% 0.009 ✓ Yes
DQN vs PPO +6.45% 0.0005 ✓ Yes
PG vs DQN -2.18% 0.218 ✗ No
PG vs PPO +4.27% 0.063 ✗ No

Reproducibility

All experiments use fixed random seeds (seed=42) for reproducibility. Pretrained models are available in models/ directory and can be loaded directly:

from experiments.solvers import solve
value, solution = solve(weights, values, capacity, method='pg')  # or 'dqn', 'ppo'

For complete details on hyperparameters and experimental setup, see docs/EXPERIMENT_DETAILS.md.

How to Run

Install dependencies

pip install -r requirements.txt

Run all experiments

python experiments/rq1_scalability.py
python experiments/rq2_scenarios.py
python experiments/rq3_hybrid.py
python experiments/statistical_test.py

Generate figures

python visualization/generate_figures_rq.py

Docker Support

docker build -t knapsack-rl .
docker run -v $(pwd)/results:/app/results knapsack-rl

Conclusion

This study systematically investigates the applicability of Reinforcement Learning methods for the 0-1 Knapsack Problem through three research questions addressing scalability, advantage scenarios, and hybrid strategies.

Our experiments reveal that RL methods offer meaningful time advantages on large-scale problems. At n=50, Policy Gradient achieves a 2.2× speedup over Dynamic Programming, and this advantage grows as problem size increases since DP's O(n×W) complexity becomes prohibitive while RL maintains O(n) inference time. However, this speed advantage comes at the cost of solution quality—Greedy consistently achieves 97-99% of optimal across all tested scales, while RL methods range from 77-91%.

The most significant finding is that the Ensemble strategy, which combines Greedy, PG, and DQN by selecting the best solution among them, achieves 96.91% solution quality with statistically significant improvement over Greedy alone (p=0.009). This demonstrates that RL methods, while not superior individually, provide complementary value when combined with traditional heuristics.

On difficult cases where Greedy performs poorly (<95%), PG shows potential with a +9.42% improvement. However, this result requires further validation as our statistical test did not reach significance (p=0.802) due to limited sample size (n=10).

Practical Recommendations (based on our experimental setup with Intel i7 CPU, Python 3.8+, PyTorch implementation):

Scenario Recommended Method Reason
n < 20 DP Optimal solution, fast enough in our tests
n > 50 PG DP infeasible; in our implementation, PG showed fastest inference
Best quality needed Ensemble 96.91% quality, statistically validated (p=0.009)

Limitations & Future Work

Current Limitations

Our experiments show that RL methods struggle to consistently outperform Greedy on standard knapsack instances, achieving 77-90% quality versus Greedy's 97-99%. This gap suggests that for well-structured problems with effective heuristics, RL may not always be the best choice. Additionally, models trained on small instances show degraded performance on larger problems, indicating poor generalization.

PPO performed notably worse than expected (84.25% with high variance), likely due to insufficient training time, untuned hyperparameters, or mismatch between its continuous optimization approach and the discrete nature of knapsack decisions. We excluded it from the Ensemble strategy for this reason.

Future Work

Future improvements could include attention mechanisms for better generalization, curriculum learning for scaling, and extended PPO training with reward shaping. RL may also be better suited for knapsack variants where Greedy struggles, such as dynamic problems with online arrivals, stochastic settings with uncertain attributes, or multi-dimensional constraints.

Appendix

PPO Training Curve

Figure 5: PPO Training Progress

PPO Training Curve

This figure shows the training dynamics of PPO over 5000 episodes:

  • Left panel (Average Reward): The reward increases from approximately 0.45 to 0.58 over training, indicating that the model learns to make better decisions. However, the improvement plateaus after around 3000 episodes.
  • Middle panel (Actor Loss): The policy loss fluctuates around zero, showing stable policy updates without divergence.
  • Right panel (Value Loss): The value function loss decreases from ~0.017 to ~0.006, indicating improved value estimation over training.

Despite these positive training signals, PPO's final performance (84.25% average quality) remains significantly below Greedy (95.40%). This gap suggests that either more training is needed, the reward function requires redesign, or PPO's continuous policy optimization is fundamentally misaligned with the discrete decision structure of the knapsack problem.

License

This project is licensed under the MIT License.

About

Reinforcement Learning for 0/1 Knapsack: Policy Gradient, PPO, DQN & Hybrid methods.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors