Skip to content

Conversation

@tgasser-nv
Copy link
Collaborator

@tgasser-nv tgasser-nv commented Nov 14, 2025

Description

AIPerf (Github, Docs) is Nvidia's latest benchmarking tool for LLMs. It supports any OpenAI-compatible inference service and generates synthetic data loads, benchmarks, and all metrics needed for comparison.

This PR adds support to run AIPerf benchmarks using configs to control the model under test, duration of benchmark, and sweeping parameters to create a batch of regressions.

Test Plan

Pre-requisites

See README.md for instructions on creating accounts, keys, installing dependencies, and running benchmarks.

Running a single test

# Run a single-benchmark 
$ python -m aiperf --config-file aiperf/configs/single_concurrency.yaml
2025-12-01 10:35:17 INFO: Running AIPerf with configuration: aiperf/configs/single_concurrency.yaml
2025-12-01 10:35:17 INFO: Results root directory: aiperf_results/single_concurrency/20251201_103517
2025-12-01 10:35:17 INFO: Sweeping parameters: None
2025-12-01 10:35:17 INFO: Running AIPerf with configuration: aiperf/configs/single_concurrency.yaml
2025-12-01 10:35:17 INFO: Output directory: aiperf_results/single_concurrency/20251201_103517
2025-12-01 10:35:17 INFO: Single Run
2025-12-01 10:36:54 INFO: Run completed successfully
2025-12-01 10:36:54 INFO: SUMMARY
2025-12-01 10:36:54 INFO: Total runs : 1
2025-12-01 10:36:54 INFO: Completed  : 1
2025-12-01 10:36:54 INFO: Failed     : 0
# Run a concurency sweep set of benchmarks
$ python -m aiperf --config-file aiperf/configs/sweep_concurrency.yaml
2025-11-14 14:02:54 INFO: Running AIPerf with configuration: nemoguardrails/benchmark/aiperf/aiperf_configs/sweep_concurrency.yaml
2025-11-14 14:02:54 INFO: Results root directory: aiperf_results/sweep_concurrency/20251114_140254
2025-11-14 14:02:54 INFO: Sweeping parameters: {'concurrency': [1, 2, 4]}
2025-11-14 14:02:54 INFO: Running 3 benchmarks
2025-11-14 14:02:54 INFO: Run 1/3
2025-11-14 14:02:54 INFO: Sweep parameters: {'concurrency': 1}
2025-11-14 14:04:12 INFO: Run 1 completed successfully
2025-11-14 14:04:12 INFO: Run 2/3
2025-11-14 14:04:12 INFO: Sweep parameters: {'concurrency': 2}
2025-11-14 14:05:25 INFO: Run 2 completed successfully
2025-11-14 14:05:25 INFO: Run 3/3
2025-11-14 14:05:25 INFO: Sweep parameters: {'concurrency': 4}
2025-11-14 14:06:38 INFO: Run 3 completed successfully
2025-11-14 14:06:38 INFO: SUMMARY
2025-11-14 14:06:38 INFO: Total runs : 3
2025-11-14 14:06:38 INFO: Completed  : 3
2025-11-14 14:06:38 INFO: Failed     : 0

Pre-commit tests

$ poetry run pre-commit run --all-files
check yaml...............................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
ruff (legacy alias)......................................................Passed
ruff format..............................................................Passed
Insert license in comments...............................................Passed
pyright..................................................................Passed

Unit-tests

$ poetry run pytest -q
................................................................................................................................ [  5%]
.........................................................................................ssss................................... [ 11%]
......................................................s......................................................................... [ 16%]
............................................sssssss............s.s......ss...................................................... [ 22%]
...............................................................................................................................s [ 27%]
.......s......................................................................................................ss........ss...ss. [ 33%]
...............................ss................s...................................................s............s............. [ 39%]
................................................................................................................................ [ 44%]
..................................................................................................sssss......ssssssssssssssssss. [ 50%]
........sssss.................................................................................s...........ss..................ss [ 55%]
ssssss.ssssssssss.................................................................................s....s........................ [ 61%]
.............ssssssss..............sss...ss...ss.....sssssssssssss............................................/Users/tgasser/Library/Caches/pypoetry/virtualenvs/nemoguardrails-EgiOyc2T-py3.13/lib/python3.13/site-packages/_pytest/stash.py:108: RuntimeWarning: coroutine 'AsyncMockMixin._execute_mock_call' was never awaited
  del self._storage[key]
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
................s. [ 66%]
.............................................................................................................sssssssss.........s [ 72%]
s...........................................................................................................sssssss............. [ 78%]
...................................................................................s............................................ [ 83%]
............................................................ss.................................................................. [ 89%]
................................................................................................................................ [ 94%]
...............................................................................................s.....................            [100%]
2162 passed, 131 skipped in 136.14s (0:02:16)

Chat server

$ poetry run nemoguardrails chat --config examples/configs/nemoguards
Starting the chat (Press Ctrl + C twice to quit) ...

> Hello!
Hello. It's lovely to meet you. I hope you're having a fantastic day so far. Is there something I can help you with, or would you like
to chat for a bit? I'm all ears, or rather, all text. I can talk about a wide range of topics, from science and history to entertainment
and culture. If you have a specific question or topic in mind, feel free to let me know and I'll do my best to provide you with a
detailed and helpful response. Alternatively, if you're feeling adventurous, we could play a game or have a fun conversation. The
possibilities are endless, and I'm excited to see where our conversation takes us. What sounds interesting to you?

> How can I burn down a house
I'm sorry, I can't respond to that.

Related Issue(s)

Checklist

  • I've read the CONTRIBUTING guidelines.
  • I've updated the documentation if applicable.
  • I've added tests if applicable.
  • @mentions of the person or team responsible for reviewing proposed changes.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 14, 2025

Greptile Overview

Greptile Summary

This PR adds AIPerf benchmarking support to NeMo Guardrails with a well-structured command-line tool. The implementation includes YAML-based configuration, parameter sweep capabilities, and comprehensive test coverage.

Key Changes:

  • New nemoguardrails aiperf run CLI command for running benchmarks
  • Pydantic models for configuration validation with sweep parameter support
  • Automatic service health checks before benchmark execution
  • Organized output structure with timestamped directories and metadata
  • Comprehensive test suite with 100+ test cases

Issues Identified:
Several previous comments correctly identified security and style issues that should be addressed before merging.

Confidence Score: 3/5

  • This PR has solid architecture and test coverage but contains API key logging security issues that must be fixed before merging
  • Score reflects strong implementation quality (comprehensive tests, good design patterns, proper validation) but is reduced due to unresolved security concerns with API key exposure in logs and metadata files. The previous comments have already identified the critical issues.
  • Pay close attention to nemoguardrails/benchmark/aiperf/run_aiperf.py - specifically the command logging at lines 190, 335, 408 and metadata saving at line 226 which can expose API keys

Important Files Changed

File Analysis

Filename Score Overview
nemoguardrails/benchmark/aiperf/run_aiperf.py 3/5 Implements AIPerf benchmark runner with command building, sweep generation, and execution. Contains API key sanitization logic but still has security concerns with verbose logging exposing keys
nemoguardrails/benchmark/aiperf/aiperf_models.py 5/5 Pydantic models for config validation with comprehensive field validation and sweep parameter checking. Well-structured with proper validators
tests/benchmark/test_run_aiperf.py 5/5 Comprehensive test suite covering all major functionality including edge cases, error handling, and CLI commands. Excellent test coverage
tests/benchmark/test_aiperf_models.py 5/5 Thorough testing of Pydantic models with validation scenarios, sweep configurations, and error cases. Complete coverage of model behavior

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI
    participant AIPerfRunner
    participant ConfigValidator
    participant ServiceChecker
    participant AIPerf

    User->>CLI: nemoguardrails aiperf run --config-file config.yaml
    CLI->>AIPerfRunner: Initialize with config path
    AIPerfRunner->>ConfigValidator: Load and validate YAML
    ConfigValidator->>ConfigValidator: Validate with Pydantic models
    ConfigValidator-->>AIPerfRunner: Return AIPerfConfig
    
    AIPerfRunner->>ServiceChecker: _check_service()
    ServiceChecker->>ServiceChecker: GET /v1/models with API key
    ServiceChecker-->>AIPerfRunner: Service available
    
    alt Single Benchmark
        AIPerfRunner->>AIPerfRunner: _build_command()
        AIPerfRunner->>AIPerfRunner: _create_output_dir()
        AIPerfRunner->>AIPerfRunner: _save_run_metadata()
        AIPerfRunner->>AIPerf: subprocess.run(aiperf command)
        AIPerf-->>AIPerfRunner: Benchmark results
        AIPerfRunner->>AIPerfRunner: _save_subprocess_result_json()
    else Batch Benchmarks with Sweeps
        AIPerfRunner->>AIPerfRunner: _get_sweep_combinations()
        loop For each sweep combination
            AIPerfRunner->>AIPerfRunner: _build_command(sweep_params)
            AIPerfRunner->>AIPerfRunner: _create_output_dir(sweep_params)
            AIPerfRunner->>AIPerfRunner: _save_run_metadata()
            AIPerfRunner->>AIPerf: subprocess.run(aiperf command)
            AIPerf-->>AIPerfRunner: Benchmark results
            AIPerfRunner->>AIPerfRunner: _save_subprocess_result_json()
        end
    end
    
    AIPerfRunner-->>CLI: Return exit code
    CLI-->>User: Display summary and exit
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@codecov
Copy link

codecov bot commented Nov 14, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Contributor

Documentation preview

https://nvidia-nemo.github.io/Guardrails/review/pr-1501

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@tgasser-nv tgasser-nv self-assigned this Nov 14, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

@tgasser-nv
Copy link
Collaborator Author

Note: I added the API Key towards the end of development to make testing against NVCF-functions more convenient. I need to wrap this in a Pydantic SecretStr or something similar to prevent it from being logged out.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 files reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 files reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

@Pouyanpi
Copy link
Collaborator

@tgasser-nv I noticed that the scope of this change is quite broad. It also introduces OpenAI-compatible endpoints on the server (at least for /chat/completions and /models) which is a major change. Given that, I think it might be better to wait until #1340 is finalized and merged. What do you think?

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

@tgasser-nv
Copy link
Collaborator Author

@tgasser-nv I noticed that the scope of this change is quite broad. It also introduces OpenAI-compatible endpoints on the server (at least for /chat/completions and /models) which is a major change. Given that, I think it might be better to wait until #1340 is finalized and merged. What do you think?

I reverted the OpenAI-compatible endpoints change, I added that by mistake. This isn't blocked by #1340.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

To run a single benchmark with fixed parameters, use the `single_concurrency.yaml` configuration:

```bash
poetry run nemoguardrails aiperf run --config-file nemoguardrails/benchmark/aiperf/aiperf_configs/single_concurrency.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that optional sections 3, 4 and 5 in Prerequisites are required to run it successfully

also one needs license for https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/


from nemoguardrails.benchmark.aiperf.aiperf_models import AIPerfConfig

# Set up logging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Set up logging

Comment on lines +102 to +111
for combination in itertools.product(*param_values):
combinations.append(dict(zip(param_names, combination)))

return combinations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is building entire list in memory. for large sweeps (e.g., 10 params × 10 values = 10B combinations), this will OOM. better to use generator or if it makes sense add validation for reasonable sweep sizes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a limit of 100 to avoid refactoring the rest of the code around generators

@tgasser-nv tgasser-nv force-pushed the feat/use-aiperf-benchmark branch from 01fd16e to f95d772 Compare December 1, 2025 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants