-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Open
Labels
Description
Depthwise Conv2d throughput degrades significantly when spatial dimensions are not divisible by 16. This surfaces in multi-stage vision encoders where repeated 2x downsampling produces non-mod-16 intermediate feature maps.
Repro
Results
| Channels | Mod-16 res | Other res | Mod-16 ms | Other ms | Gap | Other mod 16 |
|---|---|---|---|---|---|---|
| 96 | 256x256 | 240x240 | 0.52 | 0.50 | -5% | 0 |
| 192 | 128x128 | 120x120 | 0.47 | 0.45 | -5% | 8 |
| 384 | 64x64 | 60x60 | 0.43 | 0.83 | +94% | 12 |
| 768 | 32x32 | 30x30 | 0.37 | 0.55 | +47% | 14 |
| 1536 | 16x16 | 15x15 | 0.37 | 0.47 | +27% | 15 |
When the non-mod-16 dimension is 240 (still divisible by 16), there is no penalty. The gap appears at 120 and below, where the remainder is nonzero, and compounds across stages to produce ~53% end-to-end throughput loss through a 5-stage encoder.
Expected behavior
Depthwise Conv2d throughput should scale proportionally with pixel count regardless of whether spatial dimensions are divisible by 16.
Reactions are currently unavailable