Hi,
I am attempting to reproduce the results for the Llama-3.1-8B-Instruct model by following the steps provided in the README. Everything is set up within your Docker environment, and I am using vLLM for inference. My setup includes a single H100 GPU with a batch size of 8, as specified in the example scripts.
With this configuration, the runtime for processing a 128k context length (synthetic task) is approximately 2 days. Is this runtime expected? If not, could you please share the configuration or optimizations you used to efficiently handle this context length?