Skip to content

aerlabsAI/ai-inference-resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Learning Guide: AI Inference Engineering

Purpose

A curated collection of resources for engineers working on AI inference systems — covering LLM serving, GPU kernel programming, attention mechanisms, quantization, distributed inference, and production deployment. Compiled from the AER Labs community.

How to read

Recommended reading order:

  1. Read "Tier 1" for all topics first (foundational concepts)
  2. Read "Tier 2" for all topics (intermediate depth)
  3. Read "Tier 3" for all topics (advanced / cutting-edge)

Table of contents


1. LLM Inference Fundamentals

Tier 1

Tier 2

Tier 3

2. Inference Engines & Serving Systems

Tier 1

Tier 2

Tier 3

3. Attention Mechanisms & Memory Optimization

Tier 1

Tier 2

Tier 3

4. Quantization & Model Compression

Tier 1

Tier 2

Tier 3

5. CUDA & GPU Kernel Programming

Tier 1

Tier 2

Tier 3

6. Structured Output & Guided Decoding

Tier 1

Tier 2

7. Distributed & Multi-GPU Inference

Tier 1

Tier 2

Tier 3

  • How To Scale Your Model - JAX Team. Comprehensive book covering TPU/GPU architecture, inter-device communication, and parallelism strategies for training and inference at scale.

8. Post-Training & Fine-Tuning

Tier 1

  • Post-training 101 - Han Fang, Karthik A Sankararaman. Hitchhiker's guide to LLM post-training covering RLHF, DPO, and modern alignment techniques.

Tier 2

9. Hardware Architecture & Co-Design

Tier 1

  • Domain-Specific Architectures - Fleetwood. Overview of domain-specific hardware design principles and their application to AI accelerator architectures.

Tier 2

10. State-Space Models & Alternative Architectures

Tier 2

11. Compiler & DSL Approaches

Tier 1

  • Helion: Python-Embedded DSL for ML Kernels - PyTorch. A Python-embedded domain-specific language for writing fast, scalable ML kernels with minimal boilerplate, lowering the barrier to custom kernel development.

Tier 2

  • AOTInductor: Ahead-of-Time Compilation for PyTorch - PyTorch. Official documentation for AOTInductor, enabling ahead-of-time compilation of PyTorch models for deployment without Python runtime dependency.

  • Helion Flex Attention Example - PyTorch. Reference implementation of flexible attention variants using Helion DSL, demonstrating how to write custom attention kernels with minimal code.

  • CUDA Tile IR - NVIDIA. MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-based computation patterns targeting NVIDIA tensor cores.

  • cuTile Python Samples - PeaBrane. Sample implementations using the cuTile programming model for writing parallel GPU kernels.

  • Intel ISPC: Implicit SPMD Program Compiler - Intel. Open-source compiler for high-performance SIMD programming on CPU and GPU using an implicit SPMD model.

12. Confidential & Secure Inference

Tier 2

13. AI Agents & LLM Tooling

Tier 1

  • AgentKernelArena - AMD AGI. End-to-end benchmarking environment for evaluating LLM-powered coding agents (Cursor, Claude Code, Codex, SWE-agent, GEAK) on CUDA kernel writing tasks.

Tier 2

14. Production Inference at Scale

Tier 2

Tier 3

15. Benchmarking & Profiling

Tier 1

  • Evaluation Guidebook - OpenEvals / HuggingFace. Comprehensive guide to evaluating AI models, covering evaluation methodologies, metrics, and best practices.

  • AI Hardware Benchmarking & Performance Analysis - Artificial Analysis. Comprehensive benchmarking of AI accelerator systems for LLM inference across chip configurations, inference software, and concurrent load scaling.

Tier 2

16. Courses & Comprehensive Guides

Tier 1

Tier 2

17. Tools & Libraries

Tier 1

Tier 2

18. Reference Collections

  • GPU Performance Engineering Resources - Wafer AI. Comprehensive tiered learning guide for GPU kernel programming and optimization, covering fundamentals through production deployment.

  • AER Labs Blog - AER Labs. Technical blog covering AI inference optimization, vLLM architecture, PagedAttention, KV cache systems, and LLM deployment strategies.


Contributing

Have a resource to share? Open a pull request or issue with the link, a brief description, and suggested category/tier placement.

License

MIT

About

Curated collection of AI inference engineering resources — LLM serving, GPU kernels, quantization, distributed inference, and production deployment. Compiled from the AER Labs community.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors