Skip to content

Conversation

@szymon-rd
Copy link
Member

@szymon-rd szymon-rd commented Jan 31, 2026

WIP (shows weird conflicts because to be rebased)
DSL to be cleaned up, GShared is weird now.
Some sources are to be removed.

Two critical bugs were causing the LLM to produce nonsensical text:

1. KVCachedAttention: workgroupSize=32 but headSize=64 meant only half the
   output dimensions were computed. Added second pass for dims 32-63.

2. KVCachedAttention: Unlike FlashAttention, masked positions weren't
   initialized with -10000, leaving garbage in shared memory that corrupted
   softmax computation. Now writes -10000 for masked positions.

Also removed incorrect transpose of token embeddings - GGUF column-major
layout [C,V] already stores token t at indices [t*C, (t+1)*C), matching
what the GPU embedding lookup expects.
- Consolidate all programs into /programs directory (remove inline duplicates from pipeline)
- Add optional weightOffset parameter to programs for layered/pipeline use
- Add GIO.readVec4 for vectorized buffer reads (Vec4[UInt32], Vec4[Float32])
- Add ReadBufferVec4U/F expressions with compiler support
- Add CopyProgram and EmbeddingProgram as standalone programs
- Remove QuantizedMatmulVecProgram (superseded by Q4K/Q6K)
- Clean up debug/exploration tests (keep regression tests)
- Add OPTIMIZATION_OPPORTUNITIES.md documenting Cyfra vs llama.cpp analysis
- Add Q4KBenchmarkTest for performance comparison

This reduces LlamaGPUPipeline from ~2400 to ~2000 lines by removing
duplicate program definitions. Programs now use default weightOffset=0
for standalone testing and accept offsets for pipeline integration.
- Introduce GenerationLayout that holds both prefill and decode buffers
- Weights uploaded once to GPU, reused across all token generations
- KV cache allocated as GPU-only buffer, no CPU roundtrips
- Add foldLeft pattern in KVCachedPipeline.generate() to build full
  computation graph before single runUnsafe call
- Add performance profiling tests and microbenchmarks
- Add .gguf to gitignore

Known limitation: Still O(N²) scaling due to different seqLen per step
requiring different compiled pipelines. Further optimization needed to
make seqLen a runtime parameter in attention kernels.
- Add AttentionParams struct with runtime seqLen field
- Change numScoreIterations to use maxSeqLen (compile-time loop bounds)
- Use takeWhile(_ < seqLenVal) for runtime early termination
- Single decode pipeline now works for all sequence lengths
- Eliminates O(N²) overhead from building N different pipelines

Performance improvement: 0.2 tok/s -> 10.4 tok/s (50x speedup)
Scaling now linear: 32 tokens at 19.1 tok/s
The KV-cached decode pipeline was producing garbage output because RoPE
was using compile-time startPos=255 (maxSeqLen-1) for all decode steps,
instead of the actual runtime positions (5, 6, 7, ...).

Changes:
- RoPE now reads startPos from AttentionParams uniform at runtime
- Added attnParams field to all pipeline layouts (PipelineLayout,
  Q4KPipelineLayout, MixedQuantPipelineLayout, KVCachePipelineLayout)
- Updated all RoPE ProgramLayout usages to pass attnParams

Also adds:
- Interactive Chatbot app with top-p sampling
- Improved benchmark test (128 tokens, 5 runs, warmup)

Performance: 13.7 tok/s average on TinyLlama 1.1B with KV cache
- Increase MAX_SEQ_LEN to 2048 (8KB shared memory, well within limits)
- Update Chatbot and benchmark to use 2048 context
- Performance unchanged at ~13.6 tok/s (masked positions are cheap)
- KV cache now 44MB (was 5.5MB), still fits easily in GPU memory
- Add GSeq.unroll DSL feature for generating [[unroll]] pragma in SPIR-V
- Optimize Q4K matmul: replace manual unrolling with GSeq.limit(8).unroll
- Shader size reduction: 45,000 lines → 219 lines (200x smaller!)
- Performance improvement: ~15 tok/s → 30 tok/s (2x speedup)
- Add tied embeddings support for models like Llama 3.2

The compact shader with proper unroll hints dramatically improves
instruction cache performance, leading to the 2x speedup.
- Add Float16 Value type for half-precision floats
- Add GIO.readF16 for reading F16 data from buffers
- Add ReadBufferF16 expression type
- Foundation for F16 weight support in future updates
Add complete Float16 math support throughout Cyfra:

DSL (cyfra-dsl):
- Add Float16 Value type alongside Float32
- Add ToFloat16 conversion expression
- Add Float16 subgroup operations (subgroupAddF16, subgroupMinF16, subgroupMaxF16)
- Add GIO subgroup methods for F16 with @TargetNAME annotations
- Add GIO.readF16 for reading F16 from buffers

Compiler (cyfra-compiler):
- Add Float16Tag and LFloat16Tag type tags
- Add F16 type definition (OpTypeFloat 16) in SPIRV types
- Add F16 constant handling with floatToFloat16 conversion
- Add F16 type stride (2 bytes)
- Add F16 conversions (ToFloat16, ToFloat32 between F16/F32) using OpFConvert
- Add F16 to basic types list

This enables native F16 compute in shaders without F32 conversion,
saving 2x memory bandwidth for F16 weights and activations.
Add LlamaF16Pipeline with native F16 compute support:

- F16PipelineLayout with all buffers in Float16 (except logits)
- F16CachedPipeline class for weight management
- F16MatmulVecProgram using native F16 operations
- F16ModelWeights structure for F16 bytes (no F32 conversion)

DSL improvements:
- Add BasicScalarAlgebra[Float16] for math operations
- Add Float16 conversions (asFloat32, asInt, etc.)
- Fix subgroup ops and constants for F16

This enables 2x memory savings for F16 models like Llama 3.2 1B.
Pipeline structure ready, programs need completion.
- Add GGUFReader.readTensorF16Bytes() to read F16 tensors without conversion
- Add LlamaInference.loadF16Weights() for native F16 weight loading
- Add LlamaInference.getF16CachedPipeline() getter
- Add allWeightsAreF16 check for F16 model detection
- Integrate F16 pipeline with existing inference infrastructure

F16 models like Llama 3.2 1B can now be loaded without F32 conversion,
saving 2x memory (2.4GB instead of 4.8GB).
Add F16 math functions and clean DSL usage:

DSL Improvements:
- Add Float16(value: Float) factory for clean constant creation
- Add F16 math functions: sin, cos, tan, pow, sqrt, exp, max, min, abs
- Add asFloat16 to Int32 and UInt32 for direct conversion
- Add toF16 extension for Float32

F16 Programs (all native F16, no F32 conversion):
- F16MatmulVecProgram: matrix-vector multiply in F16
- F16RMSNormProgram: RMS normalization with F16 sqrt
- F16SiLUProgram: SiLU activation native F16
- F16SwiGLUProgram: SwiGLU activation native F16
- F16RoPEProgram: Rotary embeddings with F16 sin/cos

All programs use native F16 operations throughout for maximum performance.
2x memory bandwidth savings vs F32.
CRITICAL FIX: The F16 pipeline was accumulating dot products and sums
in Float16, causing catastrophic precision loss for large reductions
(2048+ elements).

Changes:
- F16MatmulVecProgram: accumulate in F32, convert to F16 at end
- F16RMSNormProgram: accumulate sum of squares in F32
- F16SwiGLUProgram: compute sigmoid in F32 for better precision
- F16OutputProgram: accumulate logit computation in F32
- F16AttentionProgram: accumulate weighted sum in F32

This matches how proper F16 inference works:
- Read F16 weights/activations
- Compute in F32 for precision
- Write F16 output

Also adds F16KVCacheDebugTest for comparing CPU vs F16 GPU outputs.

The test showed:
- CPU: 'The capital of France is' -> 'Paris' (correct)
- F16 GPU before fix: -> '\n\n' (wrong)
- After fix: should match CPU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants