Skip to content

Conversation

@carlschader-saronic
Copy link
Contributor

@carlschader-saronic carlschader-saronic commented Aug 31, 2025

Description

Added support for as many dynamic axes as you want wherever you want. It doesn't just have to be the first dimension of a tensor.

Testing

All the examples run
Also integrated into a certain saronic downstream program and it works

Notes

Finished adding support for dynamic axes in libinfer wherever whenever: Some comparison

libinfer 0.0.4 DETR benchmark.rs

2025-08-31T21:32:50.467814Z  INFO benchmark: inference calls    : 4096
2025-08-31T21:32:50.467820Z  INFO benchmark: total latency      : 54.53879
2025-08-31T21:32:50.467823Z  INFO benchmark: avg. frame latency : 0.0066575673
2025-08-31T21:32:50.467825Z  INFO benchmark: avg. frame fps     : 150.20502
2025-08-31T21:32:50.467826Z  INFO benchmark: avg. batch latency : 0.013315135
2025-08-31T21:32:50.467828Z  INFO benchmark: avg. batch fps     : 75.10251

libinfer 0.0.5 (dynamic axes of death) DETR benchmark.rs

2025-08-31T21:29:09.053947Z  INFO benchmark: inference calls    : 4096
2025-08-31T21:29:09.053951Z  INFO benchmark: total latency      : 30.404646
2025-08-31T21:29:09.053954Z  INFO benchmark: avg. batch latency : 0.0074230093
2025-08-31T21:29:09.053955Z  INFO benchmark: avg. batch fps     : 134.71625

libinfer 0.0.4 yolov8

2025-08-31T21:37:40.372793Z  INFO benchmark: inference calls    : 4096
2025-08-31T21:37:40.372798Z  INFO benchmark: total latency      : 50.759754
2025-08-31T21:37:40.372800Z  INFO benchmark: avg. frame latency : 0.012392518
2025-08-31T21:37:40.372802Z  INFO benchmark: avg. frame fps     : 80.69385
2025-08-31T21:37:40.372803Z  INFO benchmark: avg. batch latency : 0.012392518
2025-08-31T21:37:40.372805Z  INFO benchmark: avg. batch fps     : 80.69385

libinfer 0.0.5 yolov8

2025-08-31T21:39:45.790839Z  INFO benchmark: inference calls    : 4096
2025-08-31T21:39:45.790845Z  INFO benchmark: total latency      : 59.263123
2025-08-31T21:39:45.790847Z  INFO benchmark: avg. batch latency : 0.014468536
2025-08-31T21:39:45.790849Z  INFO benchmark: avg. batch fps     : 69.11549

I found a major optimization bug where we were prematurely synchronizing the cuda stream. I introduced this in 0.0.4. By removing this we have a pretty massive performance improvement on larger models. Strangely I am getting better performance on the new tracker trained DETR model than yolov8. The DETR model is quite a bit larger and has two transformers so I am suprised. Not complaining though, this is nearly a 2x performance improvement

We are still IO bound on f32 output tensors. Will save that for 0.0.6

…ill need to finish the output tensor part of infer
@freeman94 freeman94 changed the title Version 0.0.5 Better dynamic axes feat: better dynamic axes Sep 3, 2025
Copy link
Collaborator

@freeman94 freeman94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly notes for future PRs

Comment on lines +22 to +24
pkgs-unstable = import nixpkgs-unstable {
inherit system;
config.allowUnfree = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to keep on a stable release

struct TensorInfo {
name: String,
dims: Vec<u32>,
shape: Vec<i64>, // -1 for dynamic dimensions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if we could convert this into an enum at some point in the future, rather than having to check for -1 and then interpret the min/max/opt fields.

name: String,
data: Vec<u8>,
shape: Vec<i64>, // this should always be positive, just i64 for convenience
dtype: TensorDataType,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be genericized by dtype so that the data Vec is appropriately cast without the user having to do so.


// ASSUMPTION: we always use optimization profile 0
// set the optimization profile to 0 so we can query output shapes after setting input shapes
mContext->setOptimizationProfileAsync(0, stream);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to make this configurable in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants