| title | SalaryNegotiationArena |
|---|---|
| emoji | π€ |
| colorFrom | blue |
| colorTo | green |
| sdk | gradio |
| sdk_version | 4.20.0 |
| app_file | app.py |
| pinned | false |
| license | mit |
OpenEnv Hackathon SF β Self-Improvement (Statement 4) + Snorkel AI
An RL environment where LLM agents learn to negotiate salary packages against 5 simulated hiring experts with hidden priorities that shift over time.
Live App : https://huggingface.co/spaces/yashj2110/negotiation-arena-v2 Model : https://huggingface.co/yashj2110/salary-negotiation-qwen-1.5b
Metrics: Training Metrics Result
Total Reward = 0.2 Γ Format + 0.5 Γ Negotiation + 0.3 Γ Deal Quality
| Component | Weight | What it measures | Score |
|---|---|---|---|
| Format | 20% | Valid action type (propose, counter, accept, reject, walk_away) | 1.0 valid, 0.0 invalid |
| Negotiation | 50% | Outcome quality β deal terms vs baseline ($140k, 2.5%, 90 days) | +1.0 above baseline, +0.5 at baseline, -1.0 no deal, +0.2 early close bonus, -0.1 per turn |
| Deal Quality | 30% | Snorkel-weighted utility across 4 labeling profiles | +0.3 if utility β₯ 0.5, else 0.0 |
Each profile is a different "judge" evaluating the same deal with different priorities:
| Profile | Salary Weight | Equity Weight | Start Date Weight | Prioritizes |
|---|---|---|---|---|
| Balanced | 0.4 | 0.3 | 0.3 | Equal mix |
| Cash-heavy | 0.7 | 0.1 | 0.2 | Maximize salary |
| Equity-heavy | 0.2 | 0.6 | 0.2 | Maximize equity |
| Fast-start | 0.2 | 0.2 | 0.6 | Start ASAP |
Example: Deal closes at $150k salary, 3% equity, start in 30 days.
| Step | Salary | Equity | Start Date |
|---|---|---|---|
| Raw value | $150,000 | 3.0% | 30 days |
| Normalized (0β1) | 150k/200k = 0.75 | 3%/5% = 0.60 | 1 β 30/180 = 0.83 |
Balanced profile score: 0.4Γ0.75 + 0.3Γ0.60 + 0.3Γ0.83 = 0.73 β β₯ 0.5 β reward = +0.3
The profile rotates with the expert, so the agent can't optimize for just one definition of "good deal."
Screenshots:
- 5 Expert Personas with distinct personalities, deal-breakers, and hidden priorities
- Information Asymmetry β agent must infer expert priorities from conversational cues
- CurriculumManager β agent's weaknesses drive next epoch's training data
- SelfPlayChallenger β epoch N model becomes epoch N+1 opponent
- Snorkel Drift β preferences shift every 8 episodes
Agent (Qwen2.5-1.5B) ββ OpenEnv MCPEnvironment ββ Expert Challengers
β
3 Reward Functions
βββ Format compliance
βββ Negotiation outcome
βββ Snorkel-weighted quality
OpenEnv_Hack/
βββ server/
β βββ __init__.py
β βββ models.py
β βββ negotiation_environment.py
β βββ app.py
βββ client/
β βββ __init__.py
β βββ negotiation_env.py
βββ reward.py
βββ challenger.py
βββ app_gradio.py
βββ app.py
βββ train_colab.py
βββ evaluate.py
βββ test_env.py
βββ requirements.txt
βββ openenv.yaml
βββ pyproject.toml
βββ README.md
βββ .gitignore
βββ negotiation_arena_training.ipynb
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DEPLOYMENT FLOW β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Local Development Northflank H100 HuggingFace
βββββββββββββββββ βββββββββββββββ ββββββββββββ
βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ
β Build Code β SCP β Install Deps β Push β HF Spaces β
β Verify βββββββββββΆβ Run Tests βββββββββββΆβ Gradio Demo β
β Structure β β Start Server β β β
βββββββββββββββββ βββββββββ¬ββββββββ βββββββββββββββββ
β
βββββββββΌββββββββ
β GRPO Trainingβ
β (3 epochs) β
β + Curriculum β
βββββββββ¬ββββββββ
β
βββββββββΌββββββββ βββββββββββββββββ
β Evaluate β Push β HF Model β
β Baseline vs βββββββββββΆβ yashj2110/ β
β Finetuned β β salary-neg.. β
βββββββββββββββββ βββββββββββββββββ
## EMOTIONAL DYNAMICS
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EMOTIONAL DYNAMICS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Agent Message Analysis:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Positive Words β Rapport β β
β "understand", "fair", "appreciate", "value", "excited", "mission" β
β Effect: Lowers acceptance threshold by 0.1 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Negative Words β Frustration β β
β "demand", "must", "non-negotiable", "ridiculous" β
β Effect: Raises acceptance threshold by 0.05 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRAINING ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
β Qwen2.5-1.5B ββLoRAββΆβ GRPO Trainer βββββββΆβ Epoch 1 Ckpt β
β (Base Model)β r=16 β (TRL + Unsloth) β β ./grpo_output/ β
βββββββββββββββββ βββββββββββ¬βββββββββββ β /epoch_1 β
β ββββββββββ¬ββββββββββ
βββββββββββΌβββββββββββ β
β Reward Function β β
β (EnvClient.sync())β β
βββββββββββ¬βββββββββββ β
β β
βββββββββββββββββββββββΌβββββββββββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββ ββββββββββββββββ
β Curriculum β β OpenEnv β
β Manager βββββββ Environment β
β β β (5 Experts) β
βββββββββββββββββ ββββββββββββββββ
β
β Weights for Epoch 2
βΌ
ββββββββββββββββββββ
β Next Training β
β Emphasizes β
β Weak Experts β
ββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ENVIRONMENT FLOW (MCP Pattern) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Agent (LLM)
β
β "I propose $150K, 3% equity, 60 days"
β
βΌ
βββββββββββββββββββββββββ
β EnvClient.step() β β Client Side (client/negotiation_env.py)
β (Action β JSON) β
ββββββββββββ¬βββββββββββββ
β HTTP POST /step
βΌ
βββββββββββββββββββββββββ
β FastAPI create_app() β β Server Entry (server/app.py)
ββββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββββ
β NegotiationEnvironmentβ β MCPEnvironment (server/negotiation_environment.py)
β .step(action) β - 5 Expert Personas
ββββββββββββ¬βββββββββββββ - Preference Drift (every 8 eps)
β - Emotional Dynamics
β - Deal-breaker Logic
βΌ
βββββββββββββββββββββββββ
β ExpertChallenger β β Challenger (challenger.py)
β .respond(offer) β - Hidden Priorities
ββββββββββββ¬βββββββββββββ - Concession Patterns
β - Style-based Messaging
β
βΌ
βββββββββββββββββββββββββ
β Observation β β Response (server/models.py)
β + Reward (computed) β - Turn, Phase, Expert Message
ββββββββββββ¬βββββββββββββ - Current Offer State
β - Done Flag
β HTTP 200 JSON
βΌ
βββββββββββββββββββββββββ
β EnvClient.parse() β β Client Side
β β StepResult β
ββββββββββββ¬βββββββββββββ
β
βΌ
Agent receives:
- Observation
- Reward: -0.1 (ongoing) | +1.0 (deal) | -1.0 (no deal)
- Done: True/False
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β REWARD SYSTEM (STANDALONE) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββ
β reward_format() β β 1.0 if valid JSON with action_type
β (Format Compliance) β 0.5 if partial, 0.0 if invalid
ββββββββββββ¬ββββββββββββ
β
βββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
β reward_negotiation() β β reward_deal_quality()β
β (Outcome-based) β β (Snorkel-weighted) β
β β β β
β +1.0 deal above base β β +0.3 if weighted β
β +0.5 at baseline β β utility >= 0.5 β
β -1.0 no deal β β β
β -0.1 per turn β β Uses expert weights: β
β +0.2 if close early β β - salary_wt β
ββββββββββββ¬ββββββββββββ β - equity_wt β
β β - start_wt β
β ββββββββββββ¬ββββββββββββ
β β
ββββββββββ¬βββββββββββββββββ
β
βΌ
ββββββββββββββββββββ
β compute_reward() β
β β
β rf*0.2 + rn*0.5 β
β + rq*0.3 β
ββββββββββββββββββββ
β
βΌ
TOTAL REWARD β GRPO Trainer
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 5 EXPERT PERSONAS (SNORKEL AI) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. Sarah Chen β VP Engineering (Analytical) β
β Style: Data-driven, benchmark references β
β Weights: Salary 40%, Equity 30%, Start 30% β
β Deal-breakers: $180K max, 4.0% equity cap β
β Hidden Priority: Fast start (not visible to agent) β
β Opening: "Market data suggests $X. What are your expectations?" β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. Marcus Rivera β CFO (Aggressive) β
β Style: Firm, budget-focused, low equity β
β Weights: Salary 70%, Equity 10%, Start 20% β
β Deal-breakers: $150K max, 2.0% equity cap β
β Hidden Priority: Low cash burn β
β Opening: "Budget: $X. That's firm." β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. Dr. Aisha Patel β CTO (Collaborative) β
β Style: Mission-driven, equity-heavy β
β Weights: Salary 20%, Equity 60%, Start 20% β
β Deal-breakers: $200K max, 5.0% equity cap β
β Hidden Priority: Equity alignment with company β
β Opening: "We're excited! Thinking $X/yr." β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. James O'Brien β HR Director (Bureaucratic) β
β Style: Policy-bound, process-oriented β
β Weights: Salary 20%, Equity 20%, Start 60% β
β Deal-breakers: $160K max, 3.0% equity cap β
β Hidden Priority: Fill ASAP (urgent hiring) β
β Opening: "Per comp bands: $X/yr. Some flexibility within policy." β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 5. Elena Volkov β Founder/CEO (Visionary) β
β Style: Mission-focused, inspirational β
β Weights: Salary 30%, Equity 40%, Start 30% β
β Deal-breakers: $170K max, 4.5% equity cap β
β Hidden Priority: Mission alignment and culture fit β
β Opening: "Before numbers β what excites you? We offer $X/yr." β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β² Every 8 episodes, persona shifts (Preference Drift) β²
| Resource | Range |
|---|---|
| Base Salary | $80Kβ$200K/yr |
| Equity | 0%β5% RSU |
| Start Date | 14β180 days |
Unsloth + TRL GRPO on Qwen2.5-1.5B (4-bit) using Northflank H100. https://unsloth.ai/docs/get-startedunsloth-notebooks#grpo-reasoning-rl-notebooks
Yash Joshi β Solo builder at OpenEnv Hackathon SF 2025





