-
Notifications
You must be signed in to change notification settings - Fork 12
LLama inference implementation in Scala #82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
szymon-rd
wants to merge
22
commits into
main
Choose a base branch
from
llm.scala
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Two critical bugs were causing the LLM to produce nonsensical text: 1. KVCachedAttention: workgroupSize=32 but headSize=64 meant only half the output dimensions were computed. Added second pass for dims 32-63. 2. KVCachedAttention: Unlike FlashAttention, masked positions weren't initialized with -10000, leaving garbage in shared memory that corrupted softmax computation. Now writes -10000 for masked positions. Also removed incorrect transpose of token embeddings - GGUF column-major layout [C,V] already stores token t at indices [t*C, (t+1)*C), matching what the GPU embedding lookup expects.
- Consolidate all programs into /programs directory (remove inline duplicates from pipeline) - Add optional weightOffset parameter to programs for layered/pipeline use - Add GIO.readVec4 for vectorized buffer reads (Vec4[UInt32], Vec4[Float32]) - Add ReadBufferVec4U/F expressions with compiler support - Add CopyProgram and EmbeddingProgram as standalone programs - Remove QuantizedMatmulVecProgram (superseded by Q4K/Q6K) - Clean up debug/exploration tests (keep regression tests) - Add OPTIMIZATION_OPPORTUNITIES.md documenting Cyfra vs llama.cpp analysis - Add Q4KBenchmarkTest for performance comparison This reduces LlamaGPUPipeline from ~2400 to ~2000 lines by removing duplicate program definitions. Programs now use default weightOffset=0 for standalone testing and accept offsets for pipeline integration.
- Introduce GenerationLayout that holds both prefill and decode buffers - Weights uploaded once to GPU, reused across all token generations - KV cache allocated as GPU-only buffer, no CPU roundtrips - Add foldLeft pattern in KVCachedPipeline.generate() to build full computation graph before single runUnsafe call - Add performance profiling tests and microbenchmarks - Add .gguf to gitignore Known limitation: Still O(N²) scaling due to different seqLen per step requiring different compiled pipelines. Further optimization needed to make seqLen a runtime parameter in attention kernels.
- Add AttentionParams struct with runtime seqLen field - Change numScoreIterations to use maxSeqLen (compile-time loop bounds) - Use takeWhile(_ < seqLenVal) for runtime early termination - Single decode pipeline now works for all sequence lengths - Eliminates O(N²) overhead from building N different pipelines Performance improvement: 0.2 tok/s -> 10.4 tok/s (50x speedup) Scaling now linear: 32 tokens at 19.1 tok/s
The KV-cached decode pipeline was producing garbage output because RoPE was using compile-time startPos=255 (maxSeqLen-1) for all decode steps, instead of the actual runtime positions (5, 6, 7, ...). Changes: - RoPE now reads startPos from AttentionParams uniform at runtime - Added attnParams field to all pipeline layouts (PipelineLayout, Q4KPipelineLayout, MixedQuantPipelineLayout, KVCachePipelineLayout) - Updated all RoPE ProgramLayout usages to pass attnParams Also adds: - Interactive Chatbot app with top-p sampling - Improved benchmark test (128 tokens, 5 runs, warmup) Performance: 13.7 tok/s average on TinyLlama 1.1B with KV cache
- Increase MAX_SEQ_LEN to 2048 (8KB shared memory, well within limits) - Update Chatbot and benchmark to use 2048 context - Performance unchanged at ~13.6 tok/s (masked positions are cheap) - KV cache now 44MB (was 5.5MB), still fits easily in GPU memory
- Add GSeq.unroll DSL feature for generating [[unroll]] pragma in SPIR-V - Optimize Q4K matmul: replace manual unrolling with GSeq.limit(8).unroll - Shader size reduction: 45,000 lines → 219 lines (200x smaller!) - Performance improvement: ~15 tok/s → 30 tok/s (2x speedup) - Add tied embeddings support for models like Llama 3.2 The compact shader with proper unroll hints dramatically improves instruction cache performance, leading to the 2x speedup.
- Add Float16 Value type for half-precision floats - Add GIO.readF16 for reading F16 data from buffers - Add ReadBufferF16 expression type - Foundation for F16 weight support in future updates
Add complete Float16 math support throughout Cyfra: DSL (cyfra-dsl): - Add Float16 Value type alongside Float32 - Add ToFloat16 conversion expression - Add Float16 subgroup operations (subgroupAddF16, subgroupMinF16, subgroupMaxF16) - Add GIO subgroup methods for F16 with @TargetNAME annotations - Add GIO.readF16 for reading F16 from buffers Compiler (cyfra-compiler): - Add Float16Tag and LFloat16Tag type tags - Add F16 type definition (OpTypeFloat 16) in SPIRV types - Add F16 constant handling with floatToFloat16 conversion - Add F16 type stride (2 bytes) - Add F16 conversions (ToFloat16, ToFloat32 between F16/F32) using OpFConvert - Add F16 to basic types list This enables native F16 compute in shaders without F32 conversion, saving 2x memory bandwidth for F16 weights and activations.
Add LlamaF16Pipeline with native F16 compute support: - F16PipelineLayout with all buffers in Float16 (except logits) - F16CachedPipeline class for weight management - F16MatmulVecProgram using native F16 operations - F16ModelWeights structure for F16 bytes (no F32 conversion) DSL improvements: - Add BasicScalarAlgebra[Float16] for math operations - Add Float16 conversions (asFloat32, asInt, etc.) - Fix subgroup ops and constants for F16 This enables 2x memory savings for F16 models like Llama 3.2 1B. Pipeline structure ready, programs need completion.
- Add GGUFReader.readTensorF16Bytes() to read F16 tensors without conversion - Add LlamaInference.loadF16Weights() for native F16 weight loading - Add LlamaInference.getF16CachedPipeline() getter - Add allWeightsAreF16 check for F16 model detection - Integrate F16 pipeline with existing inference infrastructure F16 models like Llama 3.2 1B can now be loaded without F32 conversion, saving 2x memory (2.4GB instead of 4.8GB).
Add F16 math functions and clean DSL usage: DSL Improvements: - Add Float16(value: Float) factory for clean constant creation - Add F16 math functions: sin, cos, tan, pow, sqrt, exp, max, min, abs - Add asFloat16 to Int32 and UInt32 for direct conversion - Add toF16 extension for Float32 F16 Programs (all native F16, no F32 conversion): - F16MatmulVecProgram: matrix-vector multiply in F16 - F16RMSNormProgram: RMS normalization with F16 sqrt - F16SiLUProgram: SiLU activation native F16 - F16SwiGLUProgram: SwiGLU activation native F16 - F16RoPEProgram: Rotary embeddings with F16 sin/cos All programs use native F16 operations throughout for maximum performance. 2x memory bandwidth savings vs F32.
CRITICAL FIX: The F16 pipeline was accumulating dot products and sums in Float16, causing catastrophic precision loss for large reductions (2048+ elements). Changes: - F16MatmulVecProgram: accumulate in F32, convert to F16 at end - F16RMSNormProgram: accumulate sum of squares in F32 - F16SwiGLUProgram: compute sigmoid in F32 for better precision - F16OutputProgram: accumulate logit computation in F32 - F16AttentionProgram: accumulate weighted sum in F32 This matches how proper F16 inference works: - Read F16 weights/activations - Compute in F32 for precision - Write F16 output Also adds F16KVCacheDebugTest for comparing CPU vs F16 GPU outputs. The test showed: - CPU: 'The capital of France is' -> 'Paris' (correct) - F16 GPU before fix: -> '\n\n' (wrong) - After fix: should match CPU
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
WIP (shows weird conflicts because to be rebased)
DSL to be cleaned up, GShared is weird now.
Some sources are to be removed.