Skip to content

Optimizing LLM inference for real-time customer support using compiler and runtime techniques. This project profiles inference bottlenecks in open-source LLMs (Phi-2, Mistral), applies torch.compile and quantization strategies, and demonstrates latency and memory improvements for conversational AI Co-Pilots.

License

Notifications You must be signed in to change notification settings

maitribg/talk-fast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

talk-fast

Optimizing LLM inference for real-time customer support using compiler and runtime techniques. This project profiles inference bottlenecks in open-source LLMs (Phi-2, Mistral), applies torch.compile and quantization strategies, and demonstrates latency and memory improvements for conversational AI Co-Pilots.

About

Optimizing LLM inference for real-time customer support using compiler and runtime techniques. This project profiles inference bottlenecks in open-source LLMs (Phi-2, Mistral), applies torch.compile and quantization strategies, and demonstrates latency and memory improvements for conversational AI Co-Pilots.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages