I work on LLM inference at the engine and runtime level, focusing on performance, memory efficiency, and predictable behavior in production environments.
My experience includes optimizing inference across CPU and GPU backends, with hands-on use of CUDA, cuBLAS, cuBLASLt, and custom CUDA kernels for transformer workloads. I focus on practical improvements such as quantization-aware execution, efficient KV-cache management, memory allocation strategies, and optimized execution paths tailored to specific model architectures and hardware constraints.
I build and adapt local, cloud-independent inference systems, customizing runtimes for different model families and deployment requirements rather than relying on fixed abstractions. The goal is stable, efficient inference that makes full use of available hardware under real operational conditions.



