LLama inference implementation in Scala #82

szymon-rd · 2026-01-31T03:17:21Z

WIP (shows weird conflicts because to be rebased)
DSL to be cleaned up, GShared is weird now.
Some sources are to be removed.

Two critical bugs were causing the LLM to produce nonsensical text: 1. KVCachedAttention: workgroupSize=32 but headSize=64 meant only half the output dimensions were computed. Added second pass for dims 32-63. 2. KVCachedAttention: Unlike FlashAttention, masked positions weren't initialized with -10000, leaving garbage in shared memory that corrupted softmax computation. Now writes -10000 for masked positions. Also removed incorrect transpose of token embeddings - GGUF column-major layout [C,V] already stores token t at indices [t*C, (t+1)*C), matching what the GPU embedding lookup expects.

- Consolidate all programs into /programs directory (remove inline duplicates from pipeline) - Add optional weightOffset parameter to programs for layered/pipeline use - Add GIO.readVec4 for vectorized buffer reads (Vec4[UInt32], Vec4[Float32]) - Add ReadBufferVec4U/F expressions with compiler support - Add CopyProgram and EmbeddingProgram as standalone programs - Remove QuantizedMatmulVecProgram (superseded by Q4K/Q6K) - Clean up debug/exploration tests (keep regression tests) - Add OPTIMIZATION_OPPORTUNITIES.md documenting Cyfra vs llama.cpp analysis - Add Q4KBenchmarkTest for performance comparison This reduces LlamaGPUPipeline from ~2400 to ~2000 lines by removing duplicate program definitions. Programs now use default weightOffset=0 for standalone testing and accept offsets for pipeline integration.

- Introduce GenerationLayout that holds both prefill and decode buffers - Weights uploaded once to GPU, reused across all token generations - KV cache allocated as GPU-only buffer, no CPU roundtrips - Add foldLeft pattern in KVCachedPipeline.generate() to build full computation graph before single runUnsafe call - Add performance profiling tests and microbenchmarks - Add .gguf to gitignore Known limitation: Still O(N²) scaling due to different seqLen per step requiring different compiled pipelines. Further optimization needed to make seqLen a runtime parameter in attention kernels.

- Add AttentionParams struct with runtime seqLen field - Change numScoreIterations to use maxSeqLen (compile-time loop bounds) - Use takeWhile(_ < seqLenVal) for runtime early termination - Single decode pipeline now works for all sequence lengths - Eliminates O(N²) overhead from building N different pipelines Performance improvement: 0.2 tok/s -> 10.4 tok/s (50x speedup) Scaling now linear: 32 tokens at 19.1 tok/s

The KV-cached decode pipeline was producing garbage output because RoPE was using compile-time startPos=255 (maxSeqLen-1) for all decode steps, instead of the actual runtime positions (5, 6, 7, ...). Changes: - RoPE now reads startPos from AttentionParams uniform at runtime - Added attnParams field to all pipeline layouts (PipelineLayout, Q4KPipelineLayout, MixedQuantPipelineLayout, KVCachePipelineLayout) - Updated all RoPE ProgramLayout usages to pass attnParams Also adds: - Interactive Chatbot app with top-p sampling - Improved benchmark test (128 tokens, 5 runs, warmup) Performance: 13.7 tok/s average on TinyLlama 1.1B with KV cache

- Increase MAX_SEQ_LEN to 2048 (8KB shared memory, well within limits) - Update Chatbot and benchmark to use 2048 context - Performance unchanged at ~13.6 tok/s (masked positions are cheap) - KV cache now 44MB (was 5.5MB), still fits easily in GPU memory

- Add GSeq.unroll DSL feature for generating [[unroll]] pragma in SPIR-V - Optimize Q4K matmul: replace manual unrolling with GSeq.limit(8).unroll - Shader size reduction: 45,000 lines → 219 lines (200x smaller!) - Performance improvement: ~15 tok/s → 30 tok/s (2x speedup) - Add tied embeddings support for models like Llama 3.2 The compact shader with proper unroll hints dramatically improves instruction cache performance, leading to the 2x speedup.

- Add Float16 Value type for half-precision floats - Add GIO.readF16 for reading F16 data from buffers - Add ReadBufferF16 expression type - Foundation for F16 weight support in future updates

@TargetNAME

Add complete Float16 math support throughout Cyfra: DSL (cyfra-dsl): - Add Float16 Value type alongside Float32 - Add ToFloat16 conversion expression - Add Float16 subgroup operations (subgroupAddF16, subgroupMinF16, subgroupMaxF16) - Add GIO subgroup methods for F16 with @TargetNAME annotations - Add GIO.readF16 for reading F16 from buffers Compiler (cyfra-compiler): - Add Float16Tag and LFloat16Tag type tags - Add F16 type definition (OpTypeFloat 16) in SPIRV types - Add F16 constant handling with floatToFloat16 conversion - Add F16 type stride (2 bytes) - Add F16 conversions (ToFloat16, ToFloat32 between F16/F32) using OpFConvert - Add F16 to basic types list This enables native F16 compute in shaders without F32 conversion, saving 2x memory bandwidth for F16 weights and activations.

Add LlamaF16Pipeline with native F16 compute support: - F16PipelineLayout with all buffers in Float16 (except logits) - F16CachedPipeline class for weight management - F16MatmulVecProgram using native F16 operations - F16ModelWeights structure for F16 bytes (no F32 conversion) DSL improvements: - Add BasicScalarAlgebra[Float16] for math operations - Add Float16 conversions (asFloat32, asInt, etc.) - Fix subgroup ops and constants for F16 This enables 2x memory savings for F16 models like Llama 3.2 1B. Pipeline structure ready, programs need completion.

- Add GGUFReader.readTensorF16Bytes() to read F16 tensors without conversion - Add LlamaInference.loadF16Weights() for native F16 weight loading - Add LlamaInference.getF16CachedPipeline() getter - Add allWeightsAreF16 check for F16 model detection - Integrate F16 pipeline with existing inference infrastructure F16 models like Llama 3.2 1B can now be loaded without F32 conversion, saving 2x memory (2.4GB instead of 4.8GB).

Add F16 math functions and clean DSL usage: DSL Improvements: - Add Float16(value: Float) factory for clean constant creation - Add F16 math functions: sin, cos, tan, pow, sqrt, exp, max, min, abs - Add asFloat16 to Int32 and UInt32 for direct conversion - Add toF16 extension for Float32 F16 Programs (all native F16, no F32 conversion): - F16MatmulVecProgram: matrix-vector multiply in F16 - F16RMSNormProgram: RMS normalization with F16 sqrt - F16SiLUProgram: SiLU activation native F16 - F16SwiGLUProgram: SwiGLU activation native F16 - F16RoPEProgram: Rotary embeddings with F16 sin/cos All programs use native F16 operations throughout for maximum performance. 2x memory bandwidth savings vs F32.

CRITICAL FIX: The F16 pipeline was accumulating dot products and sums in Float16, causing catastrophic precision loss for large reductions (2048+ elements). Changes: - F16MatmulVecProgram: accumulate in F32, convert to F16 at end - F16RMSNormProgram: accumulate sum of squares in F32 - F16SwiGLUProgram: compute sigmoid in F32 for better precision - F16OutputProgram: accumulate logit computation in F32 - F16AttentionProgram: accumulate weighted sum in F32 This matches how proper F16 inference works: - Read F16 weights/activations - Compute in F32 for precision - Write F16 output Also adds F16KVCacheDebugTest for comparing CPU vs F16 GPU outputs. The test showed: - CPU: 'The capital of France is' -> 'Paris' (correct) - F16 GPU before fix: -> '\n\n' (wrong) - After fix: should match CPU

szymon-rd added 22 commits January 21, 2026 01:27

Docs adjustments, publishing pipeline

91c4eb2

llm.scala wip

1543106

WIP

781bd58

work on KVCacheAttention

95fa864

feat: add Float16 type support to DSL

6b9c65e

- Add Float16 Value type for half-precision floats - Add GIO.readF16 for reading F16 data from buffers - Add ReadBufferF16 expression type - Foundation for F16 weight support in future updates

fast cyfra-llama

24d2e9c

fast

30187fc

fasster!

b04a7f3

clean and forgot some files

30455f2

remove llm

8b5671a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLama inference implementation in Scala #82

LLama inference implementation in Scala #82

Uh oh!

szymon-rd commented Jan 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LLama inference implementation in Scala #82

Are you sure you want to change the base?

LLama inference implementation in Scala #82

Uh oh!

Conversation

szymon-rd commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

szymon-rd commented Jan 31, 2026 •

edited

Loading