When to Use Reinforcement Learning for Knapsack Problems

A systematic analysis of when Reinforcement Learning is applicable to the 0-1 Knapsack Problem, with comprehensive experiments on scalability, advantage scenarios, and hybrid strategies.

Project Overview

This project investigates the applicability of Reinforcement Learning (RL) methods for solving the 0-1 Knapsack Problem. We compare traditional methods (Dynamic Programming, Greedy) with RL approaches (Policy Gradient, DQN, PPO) to answer three core research questions about scalability, advantage scenarios, and hybrid strategy design.

Key Contributions

1. Six knapsack solving algorithms

Dynamic Programming (DP) — Exact optimal solution
Greedy — Value density heuristic
Policy Gradient (PG) — Linear policy RL
Deep Q-Network (DQN) — Deep RL with experience replay
Proximal Policy Optimization (PPO) — Stable RL algorithm
Hybrid Strategies — Ensemble, Staged, Confidence-based

2. Systematic analysis of RL applicability

Scalability advantage: RL (PG) is 2.2× faster than DP when n=50
Scenario-specific value: On difficult cases, PG achieves 81.77% vs Greedy's 72.35% (+9.42%)
Hybrid strategies: Ensemble achieves 96.91% solution quality, statistically significant improvement over Greedy (p=0.009)

3. Statistical validation

Ensemble vs Greedy: statistically significant (p=0.009)
DQN vs PPO: statistically significant (p=0.0005)

Research Questions

ID	Question	Focus
RQ1	Does RL have time advantages on large-scale problems?	Scalability
RQ2	In which scenarios does RL outperform traditional heuristics?	Applicability
RQ3	How to effectively combine RL with traditional methods?	Hybrid Strategy

Methodology

Compared Methods

Category	Method	Description	Complexity
Exact	DP	Dynamic Programming	O(n×W)
Heuristic	Greedy	Value density sorting	O(n log n)
RL	PG	Linear policy	O(n)
RL	DQN	3-layer MLP (256-256-128)	O(n)
RL	PPO	Actor-Critic (64-64)	O(n)

Hybrid Strategies

Strategy	Description
Ensemble	Run Greedy+PG+DQN, select best result
Staged Hybrid	PG selects first, Greedy fills remaining
Confidence-based	Choose method based on capacity ratio

Project Structure

knapsack-rl-analysis/
├── experiments/
│   ├── rq1_scalability.py      # RQ1: Scalability analysis
│   ├── rq2_scenarios.py        # RQ2: Advantage scenarios
│   ├── rq3_hybrid.py           # RQ3: Hybrid strategies
│   ├── statistical_test.py     # Statistical significance tests
│   └── solvers.py              # Unified solver interface
├── models/                     # Trained models
├── results/                    # Experiment results & figures
├── visualization/              # Plotting scripts
└── data/                       # Dataset

Experiment Results & Figures

RQ1 — Scalability Analysis

n	DP Time	PG Time	Speedup	Greedy Quality	PG Quality	DQN Quality	PPO Quality
5	0.04ms	0.37ms	0.1×	99.42%	89.18%	90.38%	82.08%
10	0.22ms	0.43ms	0.5×	98.56%	90.22%	91.63%	80.86%
20	0.83ms	0.84ms	1.0×	99.03%	82.20%	91.12%	79.18%
50	5.40ms	2.44ms	2.2×	99.78%	77.83%	82.58%	77.93%

Key Finding: RL (PG) achieves 2.2× speedup over DP at n=50. For n>50, DP becomes impractical due to O(n×W) complexity.

Figure 1: Runtime and Solution Quality across Problem Scales

Left panel (Runtime Scalability): Shows computation time (log scale) for each method as problem size increases. DP time grows exponentially with n, while RL methods (PG, PPO) maintain near-linear growth. At n=50, PG is 2.2× faster than DP.
Right panel (Solution Quality): Shows solution quality (% of optimal) for Greedy, PG, DQN, and PPO across scales n=5 to n=50. Greedy consistently achieves 98-99% quality, while RL methods show more variance (77-91%).

RQ2 — Advantage Scenarios

Scenario	Greedy	PG	DQN	PPO
High Value Density Variance	98.88%	87.74%	92.01%	81.72%
Difficult Real Cases	72.35%	81.77%	76.57%	75.39%
Larger Scale (n=20)	99.10%	84.81%	90.11%	80.08%

Key Finding: On difficult real cases (where Greedy < 95%), PG achieves 81.77% vs Greedy's 72.35% (+9.42%). Note: This difference was not statistically significant in our t-test (p=0.802, n=10 cases), suggesting more samples are needed for conclusive evidence.

Figure 2: RL Performance across Different Scenarios

This figure compares four methods (Greedy, PG, DQN, PPO) across three scenarios:

High Value Density Variance: Instances with diverse value-to-weight ratios. Greedy performs best (98.88%), DQN second (92.01%).
Difficult Real Cases: Real dataset instances where Greedy achieves <95%. Here PG shows advantage (81.77% vs 72.35%).
Larger Scale (n=20): Randomly generated instances with 20 items. Greedy dominates (99.10%), DQN performs reasonably (90.11%).

RQ3 — Hybrid Strategies

Method	Quality	vs Greedy
Greedy	95.40%	baseline
PG	90.25%	-5.15%
DQN	90.57%	-4.83%
PPO	84.25%	-11.15%
Confidence-based	91.60%	-3.80%
Staged Hybrid	91.65%	-3.75%
Ensemble	96.91%	+1.51%

Key Finding: Ensemble strategy achieves the best performance, statistically significant improvement over Greedy (p=0.009).

Figure 3: Hybrid Strategy Performance Comparison

This figure compares baseline methods (Greedy, PG, DQN) with hybrid strategies (Confidence-based, Staged Hybrid, Ensemble):

Baselines: Greedy (95.40%) outperforms individual RL methods. PG (90.25%) and DQN (90.57%) show similar performance.
Hybrid Strategies: Ensemble achieves the best result (96.91%), combining Greedy, PG, and DQN by selecting the best solution among them.
Note on PPO: PPO is excluded from this comparison and the Ensemble strategy due to its significantly lower performance (84.25%) and high variance, which would degrade the hybrid strategy's effectiveness.

RL Algorithm Comparison

Figure 4: RL Algorithm Quality Comparison across Problem Scales

This figure compares solution quality of Greedy, PG, DQN, and PPO across problem scales (n=5, 10, 20, 50). Greedy consistently achieves ~99% quality across all scales, demonstrating its effectiveness as a baseline. Among RL methods, DQN performs best (82-91%), followed by PG (77-90%) and PPO (77-82%). Notably, all RL methods show performance degradation as problem size increases, while Greedy remains stable.

Statistical Tests Summary

Test	Result	p-value	Significant
Ensemble vs Greedy	+2.08%	0.009	✓ Yes
DQN vs PPO	+6.45%	0.0005	✓ Yes
PG vs DQN	-2.18%	0.218	✗ No
PG vs PPO	+4.27%	0.063	✗ No

Reproducibility

All experiments use fixed random seeds (seed=42) for reproducibility. Pretrained models are available in models/ directory and can be loaded directly:

from experiments.solvers import solve
value, solution = solve(weights, values, capacity, method='pg')  # or 'dqn', 'ppo'

For complete details on hyperparameters and experimental setup, see docs/EXPERIMENT_DETAILS.md.

How to Run

Install dependencies

pip install -r requirements.txt

Run all experiments

python experiments/rq1_scalability.py
python experiments/rq2_scenarios.py
python experiments/rq3_hybrid.py
python experiments/statistical_test.py

Generate figures

python visualization/generate_figures_rq.py

Docker Support

docker build -t knapsack-rl .
docker run -v $(pwd)/results:/app/results knapsack-rl

Conclusion

This study systematically investigates the applicability of Reinforcement Learning methods for the 0-1 Knapsack Problem through three research questions addressing scalability, advantage scenarios, and hybrid strategies.

Our experiments reveal that RL methods offer meaningful time advantages on large-scale problems. At n=50, Policy Gradient achieves a 2.2× speedup over Dynamic Programming, and this advantage grows as problem size increases since DP's O(n×W) complexity becomes prohibitive while RL maintains O(n) inference time. However, this speed advantage comes at the cost of solution quality—Greedy consistently achieves 97-99% of optimal across all tested scales, while RL methods range from 77-91%.

The most significant finding is that the Ensemble strategy, which combines Greedy, PG, and DQN by selecting the best solution among them, achieves 96.91% solution quality with statistically significant improvement over Greedy alone (p=0.009). This demonstrates that RL methods, while not superior individually, provide complementary value when combined with traditional heuristics.

On difficult cases where Greedy performs poorly (<95%), PG shows potential with a +9.42% improvement. However, this result requires further validation as our statistical test did not reach significance (p=0.802) due to limited sample size (n=10).

Practical Recommendations (based on our experimental setup with Intel i7 CPU, Python 3.8+, PyTorch implementation):

Scenario	Recommended Method	Reason
n < 20	DP	Optimal solution, fast enough in our tests
n > 50	PG	DP infeasible; in our implementation, PG showed fastest inference
Best quality needed	Ensemble	96.91% quality, statistically validated (p=0.009)

Limitations & Future Work

Current Limitations

Our experiments show that RL methods struggle to consistently outperform Greedy on standard knapsack instances, achieving 77-90% quality versus Greedy's 97-99%. This gap suggests that for well-structured problems with effective heuristics, RL may not always be the best choice. Additionally, models trained on small instances show degraded performance on larger problems, indicating poor generalization.

PPO performed notably worse than expected (84.25% with high variance), likely due to insufficient training time, untuned hyperparameters, or mismatch between its continuous optimization approach and the discrete nature of knapsack decisions. We excluded it from the Ensemble strategy for this reason.

Future Work

Future improvements could include attention mechanisms for better generalization, curriculum learning for scaling, and extended PPO training with reward shaping. RL may also be better suited for knapsack variants where Greedy struggles, such as dynamic problems with online arrivals, stochastic settings with uncertain attributes, or multi-dimensional constraints.

Appendix

PPO Training Curve

Figure 5: PPO Training Progress

This figure shows the training dynamics of PPO over 5000 episodes:

Left panel (Average Reward): The reward increases from approximately 0.45 to 0.58 over training, indicating that the model learns to make better decisions. However, the improvement plateaus after around 3000 episodes.
Middle panel (Actor Loss): The policy loss fluctuates around zero, showing stable policy updates without divergence.
Right panel (Value Loss): The value function loss decreases from ~0.017 to ~0.006, indicating improved value estimation over training.

Despite these positive training signals, PPO's final performance (84.25% average quality) remains significantly below Greedy (95.40%). This gap suggests that either more training is needed, the reward function requires redesign, or PPO's continuous policy optimization is fundamentally misaligned with the discrete decision structure of the knapsack problem.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
data		data
docs		docs
experiments		experiments
results		results
src		src
visualization		visualization
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
README_cn.md		README_cn.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

When to Use Reinforcement Learning for Knapsack Problems

Table of Contents

Project Overview

Key Contributions

1. Six knapsack solving algorithms

2. Systematic analysis of RL applicability

3. Statistical validation

Research Questions

Methodology

Compared Methods

Hybrid Strategies

Project Structure

Experiment Results & Figures

RQ1 — Scalability Analysis

Figure 1: Runtime and Solution Quality across Problem Scales

RQ2 — Advantage Scenarios

Figure 2: RL Performance across Different Scenarios

RQ3 — Hybrid Strategies

Figure 3: Hybrid Strategy Performance Comparison

RL Algorithm Comparison

Figure 4: RL Algorithm Quality Comparison across Problem Scales

Statistical Tests Summary

Reproducibility

How to Run

Install dependencies

Run all experiments

Generate figures

Docker Support

Conclusion

Limitations & Future Work

Current Limitations

Future Work

Appendix

PPO Training Curve

Figure 5: PPO Training Progress

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages