Skip to content

Conversation

dvrogozh
Copy link
Contributor

@dvrogozh dvrogozh commented Aug 26, 2025

Changes:

  1. Added DeviceInterface::initializeFiltersContext() API which returns device specific FilterGraph initialization settings
  2. Enabled filter graphs on SingleStreamDecoder level
  3. Switched CPU device interface to use above changes
  4. Enabled filter graphs in CUDA device interface to handle non-NV12 decoders output (10/12-bits videos) thru scale_cuda
    • Filter graph converts 10/12-bit videos to NV12 (as scale_cuda does not currently support conversion of YUV to RGB) and then cuda device interface converts NV12 to RGB (via existing NPP path)

Basically idea behind this change is the following. Let device interface to perform trivial and performance optimized conversions in the convertAVFrameToFrameOutput() method. If device interface can not handle conversion in the convertAVFrameToFrameOutput(), it can setup ffmpeg filters pipeline by returning valid description in initializeFiltersContext().

I tested the pipeline on 10-bit videos, h264 and h265. However, this setup should be valid for 12-bit videos which I did not try.

Note:

  1. On ffmpeg-n4.4 filters pipeline walls back to CPU as scale_cuda does not support format conversion in this ffmpeg version (was added from n5.0)
  2. More work might be needed to align scale_cuda converted outputs with CPU implementation. I do see the difference in the outputs likely due to different handling of color standards. This is something I was not able to overcome at the moment.

CC: @scotts @NicolasHug @eromomon

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 26, 2025
@dvrogozh
Copy link
Contributor Author

I forgot to mention that I did not change cpu device interface yet in a similar way to cuda (initializeFiltersContext() not implemented for cpu). This would be a next step if this design will be accepted.

@dvrogozh
Copy link
Contributor Author

I've added one more commit to update CPU device interface to use filters path introduce in SingleStreamDecoder and DeviceInterface. Still a draft as #831 needs to be reviewed first.

@dvrogozh dvrogozh marked this pull request as ready for review September 2, 2025 19:18
@dvrogozh
Copy link
Contributor Author

dvrogozh commented Sep 2, 2025

I've rebase the PR on top of the recently landed #831 prerequisite. Change is ready for review. @scotts @NicolasHug

@dvrogozh
Copy link
Contributor Author

dvrogozh commented Sep 4, 2025

I pushed update to fix a linter issue which I have overlooked.

Error: Unable to download artifact(s): Artifact not found for name: pytorch_torchcodec__3.10_cu129_x86_64

This error in CI test seems unrelated to the change.

@dvrogozh
Copy link
Contributor Author

dvrogozh commented Sep 5, 2025

Pushed update to address another lint issue in the last commit.

@dvrogozh
Copy link
Contributor Author

dvrogozh commented Sep 5, 2025

Rebased on top of:

avFrame->width,
"x",
avFrame->height);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what we're enabling by exposing DeviceInterface::initializeFiltersContext() and then calling it here in SingleStreamDecoder. That is, if DeviceInterface just didn't expose the concept of filter graphs at all, each device implementation could still decide to use a filter graph on its own.

We already have the ability to ask a device to do device-specific frame conversions with DeviceInterface::convertAVFrameToFrameOutput(). Pulling out filter graph capabilities from that, and then forcing that to be orchestrated at the level of SingleStreamDecoder does not, as far as I can tell, buy us any new capabilities. But it does come at a cost, because CpuDeviceInterface::convertAVFrameToFrameOutput() is now (to me) quite strange: it does swscale based conversion and filter graph cleanup.

The logic in this PR is effectively:

if (stream.kind == AUDIO) {
    convertAudio(inFrame, outFrame);
}
else if (stream.kind == VIDEO && device != nullptr) {
    filterGraph = device->getFilterGraph();
    if (filterGraph != nullptr) {
        inFrame = filterGraph->convert(inFrame);
    }
    device->convertVideo(inFrame, outFrame);
}
return outFrame;

I don't see how the above buys us anything over:

if (stream.kind == AUDIO) {
    convertAudio(inFrame, outFrame);
}
else if (stream.kind == VIDEO && device != nullptr) {
     // device is free to use filter graph if it wants; it's
     // entirely an implementation detail for the device
     device->convertVideo(inFrame, outFrame);
}
return outFrame;

I do understand that in the future, we'll likely want to tell a device "Hey, here are the particular filters we want you to apply." But I don't think these interfaces actually buy us that capability. I think we can just as easily through the VideoStreamOptions we currently pass to DeviceInterface::convertAVFrameToFrameOutput().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR moves ownership and managing of filter graph to SingleStreamDecoder. Otherwise this will be done in each device interface separately while doing that in SingleStreamDecoder makes these behave the same and share across all devices. There is further considerations around that which I highlight in #853 (comment).

@scotts
Copy link
Contributor

scotts commented Sep 6, 2025

@dvrogozh, thank you for this work! I left a detailed comment where I'm trying to understand the value we're getting out of the new abstractions - please help me understand that better. At a higher-level, I wonder if we could take the following approach:

  1. DeviceInterface is left unchanged; we don't add the concept of filter graph initialization to it.
  2. SingleStreamDecoder is left unchanged.
  3. CpuDeviceInterface is left unchanged.
  4. CudaDeviceInterface is changed to add the new Cuda-specific filter graph implementation.

I'm sorry if that's the direction you were already going in - I know I pushed to see what the full generalization would look like. One quirk of the approach above is that each device may end up implementing the same patterns for their filter graph implementation. In that case, we could potentially add some member functions to DeviceInterface, but they would be private. They wouldn't be called by SingleStreamDecoder, but just guides for how best to implement filter graph support for a device.

@dvrogozh
Copy link
Contributor Author

dvrogozh commented Sep 8, 2025

@scotts : you have highlighted the alternate design approach to take. I think we have 2 options on a plate:

  1. Keep decoder output conversions (filter graphs, sws calls, npp calls, etc.) within device interface classes
  2. Define output conversions separate from device interface classes. In this case device interface convertAVFrameToFrameOutput()becomes responsible only for the trivial operation - wrap AVFrame into Tensor without any conversions.

I am proposing to consider 2nd direction which gives more modular structure for the project and simplifies device interface to a set of trivial operations. I am trying to reduce complexity of writing a device interface for non-CUDA GPUs by reusing as much as possible.

In the sense of 2nd direction, this PR is the first step. As you saw, it moves ownership and management of the Filter Graph out of device interface (to SingleStreamDecoder). If we would take this direction further, I suggest to consider:

  1. Abstract FilterGraph to allow multiple implementations, i.e. understand under Filter Graph any conversion not only via ffmpeg filters, but also via sws or npp or other libraries.
  2. Implement SwsGraph and NppGraph, i.e. move these out of CpuDeviceInterface and CudaDeviceInterface
  3. Above steps will leave device interface to support only trivial wrapping/copy of AVFrame into Tensor which is backend specific operation

Please, share your thoughts on the above. If you want, we can start by going with 1st direction (keep conversion in device interfaces) and continue weighing 2nd approach. I think enabling encoders might contribute to its justification as we again will consider filter operations. This time before encoding. If we will decide to go with 1st approach for now, then I will submit another PR with the focus on enabling 10-bit support in CUDA device interface.

@scotts
Copy link
Contributor

scotts commented Sep 12, 2025

@dvrogozh, thanks for explaining the direction you're going in! Based on where we are, I still prefer the first option: keeping all decoder output conversions within the device implementation.

I do think that in the future, we'll want better abstractions around filtergraph, swscale and NPP. That is, these are all ways to convert a raw decoded frame into our desired color space and potentially perform some native transforms. But what the exact abstractions we should build is not obvious to me right now. In particular, this PR has us in an in-between state where some stuff remains in the device implementation, and some it outside.

By pursuing option 1, I think we'll eventually get to a state where we have three device implementations: CPU and CUDA in this repo, and XPU in an Intel repo. At that point, it will be more clear to us what the patterns are and what abstractions to build.

For: pytorch#776

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
@dvrogozh dvrogozh changed the title Use cuda filters to support 10-bit videos Abstract filters support Sep 12, 2025
@dvrogozh
Copy link
Contributor Author

By pursuing option 1, I think we'll eventually get to a state...

@scotts : I have submitted a PR to follow option 1. Please, help to review:

I will probably keep #853 opened and rebase it on top of #899 to continue working on option 2.

@dvrogozh dvrogozh marked this pull request as draft September 12, 2025 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants