Conversation
| // Heuristic thresholds: only pack "small" tensors. | ||
| constexpr int64_t kPackMaxBytesPerTensor = 256 * 1024; // 256KB |
There was a problem hiding this comment.
Do we need to limit the total size of both packet and tensor?
Such as 32MB limit for a packet.
There was a problem hiding this comment.
Good point.
For the per-tensor size, the threshold is to focus on small tensors (which benefit the most from the packing on the one hand, and do not add too much copying overhead on the CPU on the other hand).
A very large total size may lead to problems such as allocation failure or larger allocation overhead. I adjusted the implementation and now, multiple chunks are allocated if needed (32MB by default, configurable). If there is more than 32 MB needed, multiple chunks will be used.
| Results | ||
| ------- | ||
|
|
||
| .. list-table:: Runtime and Speedup (mean +/- std over 10 runs) |
There was a problem hiding this comment.
can we add a performance compare with pytorch nested tensor?
There was a problem hiding this comment.
The use-cases are different: With the copier, we can have more general input structures (e.g. dicts/lists/tuples containing tensors (typical for meta-data); individual inputs can have different dtypes and be on different devices).
However, the underlying implementation has similarities: For a nested tensor, a single memory buffer is also used. I made a small evaluation and compared the copy runtime to both an already created nested tensor, and to copying by creating a nested tensor from a list & copy the tensor (without splitting it back into a list). The results are (copy of 500 tensors, each has 32 - 1024 entries, use of pinned memory when creating the nested tensor):
multi_tensor_copier: 0.388 ms
nested tensor (from list): 1.071 ms
nested tensor (pre-built): 0.158 ms
So, if lists are used, the multi_tensor_copier copy is faster, but using a nested tensor directly is even faster.
I would say that this is expected as a nested tensor is already in a format similar to what we use internally (and have to convert to and from) for the copier.
4f5ed1f to
3aab06f
Compare
Signed-off-by: Roman Schaffert <rschaffert@nvidia.com>
3aab06f to
16b8213
Compare
|
Thank you for the insightful comments @xupinjie ! I prepared a new version. Apart from the changes related to your comments, I also reworked how streams are handled. Previously, some copy directions were not synchronized properly and the way multiple stream were used was not meaningful (as there are only copy operations involved). |
|
That is great! Merged. |
Description
Added the Multi-Tensor Copier functionality as well as the corresponding documentation & example & simple evaluation
Type of Change
Please select (at least one):
Testing
Checklist for testing:
scripts/run_tests.shDocumentation, Examples, Tutorials, Demos
Checklist for documentation:
Code Quality
Checklist for dependencies:
pyproject.tomlif/as neededRelated Issues / Context
If applicable, link related issues, discussions etc.
DCO / Sign-Off
Please refer to the section on Signing Your Work & Developer Certificate of Origin (DCO)
in the Contribution Guide before submitting your contribution.
References
For additional details, please refer to the Contribution Guide.
The following guides are available (referenced in the Contribution Guide for further details):
docs/guides/CONTRIBUTION_GUIDE.mddocs/guides/DEVELOPMENT_GUIDE.mddocs/guides/DOCUMENTATION_SETUP_GUIDE.mddocs/guides/FORMATTING_GUIDE.mdPlease also refer to the summary checklist in the Contribution Guide,
which is a guideline for what to consider when submitting your contribution and covers the same topics as the checklists above.