Applied Research Engineer working on large-scale LLM inference and ML systems.
My work focuses on bridging research ideas and production systems, with an emphasis on GPU-level performance optimization and inference-time techniques.
- LLM inference and serving systems
- GPU performance optimization (Triton / CUDA)
- Quantization and speculative decoding
- KV-cache optimization and batching strategies
- Production GenAI infrastructure
I occasionally write about GPU architecture, inference optimization, and ML systems:

