latency result slower than tensorrt fp16

Hi, I tried to replicate your speed experiment, I tested the deit_tiny, batch size=1, RTX3090 environment, after a few days of autotune, compared to tensorrt FP16, speed is still slower.

Here are the results of my experiment:

<img width="259" alt="image" src="https://github.com/zkkli/I-ViT/assets/34906782/3a27098d-cba8-4ba2-9140-de0a724b1107">