add llama2 70b lora training by nbarbier-265 · Pull Request #3 · nbarbier-265/tinygrad

nbarbier-265 · 2025-11-17T14:04:17Z

add llama2 70b lora training

tinygrad#14546) * PYTHONREMU: VOP3P integer operations with constants don't cast to fp16 * put that back * cleaner * do that once

* test asm_gemm in CI * default float16 * use a smaller shape for multi * smaller size * smaller for CI * smaller for ci * need half

* grad_b uses custom gemm * fix multi backward, acc is in float32 * test_gemm_batched * square gemm --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com> Co-authored-by: qazal <qazal.software@gmail.com>

* PYTHONREMU: failing test for V_SUB_NC_U32_E64 clamp * fix saturation in PYTHON_REMU * simpler * more tests, less lines --------- Co-authored-by: Christopher Milan <chrismilan@ucla.edu>

more explicit

symlink model not allowed in latest onnxruntime

* pin onnxruntime to 1.23.2 for DSP * list ml_dtypes instead This reverts commit 84bb2cc.

* dtype decomps don't require bitshifts * simplify shr/shl * ruff

onnxruntime does not allow symlink that's outside model dir. update snapshot_download to use local_dir instead of cache_dir. some ad hoc migration step to copy the existing model too

* clean up linearize schedule [pr] don't mix ScheduleItem and UOp in schedule queue * ok

fixed wrong comments and simplified queue building

* rangeify always adds KernelInfo * fix tests * skip flaky test

revert tinygrad#14478 which breaks tinyfs

* viz: cleanup amdgpu target mapping * linter * unwraps

int is less flaky

…d#14582)

* better * bottom up earliest rewrites * fix

* start * x * fix * sdma * c * clean * x * hm * cleaer

…#14831)

test after calling .realize(), uop.is_realized is True. currently not working for empty (thus disk tensor), and const

concat schedules. separate out the execution part

…ad#14837) This reverts commit df7c37f.

automatically fixes is_realized issue for empty

LLVM should support eg, SHL/SHR, but this was never actually rendered

this can crash, not sure why. skip 100 to see if it's better

* assign should be used as buffer * late removed * the fix * better fix * backward slice

* assign after copy shouldn't contig * fix assign copy

* viz: start displaying pma * s * work * colors * cleaner * max packets * fine * work * pma * diff cleanup

* setFocus is the more clear name * do less

reshape is lazy now, so better to raise from the .axis call and not have caller to handle invalid case

remove dead assert, also make it more like a view

…ygrad#14868) test can be flaky if gc happens in between

leftovers from ops_remote

* double e2m1 values for mxfp4 * check if assert equal works in ci * Revert "check if assert equal works in ci" This reverts commit 8cf902c. * remove unnecessary whitespace change * add test case that fails for old implementation but passes for new * add note that the previous test is bad * clarification on the methodology for the test * fix the indent problem that happened to skip this test * for now update mxfp4 block test to similarly use allclose (bad) * add gist link and clearer explanation of process for computing test data

sirhcm and others added 30 commits February 4, 2026 20:10

PYTHONREMU: VOP3P integer operations with constants don't cast to fp16 (

232848d

tinygrad#14546) * PYTHONREMU: VOP3P integer operations with constants don't cast to fp16 * put that back * cleaner * do that once

clean up UOp.vars [pr] (tinygrad#14547)

e8dace4

use more UOp.sum and UOp.prod [pr] (tinygrad#14549)

c0ca7f9

test asm_gemm in CI (tinygrad#14551)

f9cfb64

* test asm_gemm in CI * default float16 * use a smaller shape for multi * smaller size * smaller for CI * smaller for ci * need half

grad_b uses custom gemm (tinygrad#14550)

43e7eda

* grad_b uses custom gemm * fix multi backward, acc is in float32 * test_gemm_batched * square gemm --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com> Co-authored-by: qazal <qazal.software@gmail.com>

fa: simpler is faster (tinygrad#14548)

c1ea668

assembly/amd: fix saturation in python remu (tinygrad#14557)

b398335

* PYTHONREMU: failing test for V_SUB_NC_U32_E64 clamp * fix saturation in PYTHON_REMU * simpler * more tests, less lines --------- Co-authored-by: Christopher Milan <chrismilan@ucla.edu>

llama: faster bf16 matmul / rope backward (tinygrad#14558)

1900423

nv: use prof_exec_counter (tinygrad#14559)

483bba4

add Ops asserts in toposort sched_sink [pr] (tinygrad#14561)

42c18da

more explicit

skip test_xlm_roberta_large (tinygrad#14563)

2b47a9a

symlink model not allowed in latest onnxruntime

list ml_dtypes as dependency for DSP (tinygrad#14562)

b47397a

* pin onnxruntime to 1.23.2 for DSP * list ml_dtypes instead This reverts commit 84bb2cc.

dtype decomps don't require bitshifts (tinygrad#14542)

aa9dc50

* dtype decomps don't require bitshifts * simplify shr/shl * ruff

fix test_xlm_roberta_large (tinygrad#14564)

41a179f

onnxruntime does not allow symlink that's outside model dir. update snapshot_download to use local_dir instead of cache_dir. some ad hoc migration step to copy the existing model too

clean up linearize schedule [pr] (tinygrad#14565)

79b7799

* clean up linearize schedule [pr] don't mix ScheduleItem and UOp in schedule queue * ok

disable threads (tinygrad#14555)

cee7ef7

more cleanup in create_schedule [pr] (tinygrad#14566)

b7ef775

fixed wrong comments and simplified queue building

fa: block skipping for fa kv bwd (tinygrad#14569)

f73468d

add CallInfo and viz call toggle (tinygrad#14570)

28c56a7

KernelInfo is required on get_program (tinygrad#14571)

6cbcf98

* rangeify always adds KernelInfo * fix tests * skip flaky test

remove KERNEL special case in realize_assign [pr] (tinygrad#14573)

d41836f

revert some late_buffer_view change (tinygrad#14578)

b09dc64

revert tinygrad#14478 which breaks tinyfs

viz: cleanup amdgpu target mapping (tinygrad#14579)

50a166a

* viz: cleanup amdgpu target mapping * linter * unwraps

use int inputs in test_assign (tinygrad#14580)

15d3344

int is less flaky

llama: contig backward for wk / wv matmul backward (tinygrad#14581)

be77873

hotfix: disable slower asm gemm shape from llama seqlen 8192 (tinygra…

cf73d7e

…d#14582)

make disk tensor tests process safe (tinygrad#14584)

3c26ce2

small changes and test fixes from kernel is call (tinygrad#14586)

03af240

bottom up earliest rewrites (tinygrad#14587)

7cb996e

* better * bottom up earliest rewrites * fix

diff devices for sdma (tinygrad#14589)

fbeb978

* start * x * fix * sdma * c * clean * x * hm * cleaer

chenyuxyz and others added 26 commits February 17, 2026 10:30

remove assign test @unittest.skip("this test is crashing!") (tinygrad…

9d4937a

…#14831)

update test to reset and test kernel_count directly (tinygrad#14832)

f147791

TestRealizeIsRealized (tinygrad#14834)

61867c2

test after calling .realize(), uop.is_realized is True. currently not working for empty (thus disk tensor), and const

one run_schedule for assign realize (tinygrad#14835)

df7c37f

concat schedules. separate out the execution part

Revert "one run_schedule for assign realize (tinygrad#14835)" (tinygr…

aec8a6c

…ad#14837) This reverts commit df7c37f.

removed if self.buffer.is_allocated() in realized (tinygrad#14836)

72cf603

automatically fixes is_realized issue for empty

seperate llama optim (tinygrad#14810)

95e97ec

LLVM actually supports ops (tinygrad#14843)

5b11519

LLVM should support eg, SHL/SHR, but this was never actually rendered

remove doublecast in IMAGE=1 (tinygrad#14839)

7641ed6

exclude 100 in test_assign_add (tinygrad#14846)

e3c120c

this can crash, not sure why. skip 100 to see if it's better

assign should be used as output buffer (tinygrad#14845)

ab55e8c

* assign should be used as buffer * late removed * the fix * better fix * backward slice

assign after copy shouldn't contig (tinygrad#14847)

d5636fb

* assign after copy shouldn't contig * fix assign copy

viz: start displaying pma (tinygrad#14848)

a3d516c

* viz: start displaying pma * s * work * colors * cleaner * max packets * fine * work * pma * diff cleanup

feat: llama wqkv (tinygrad#14841)

6d301ad

remove all the outerworld stuff, it was too complex (tinygrad#14852)

af839b2

viz: simplify shape clicking (tinygrad#14853)

b0110c4

* setFocus is the more clear name * do less

viz: second profiler link goes to source code (tinygrad#14855)

a212881

am_smi: enable mem usage back (tinygrad#14858)

3b95fa0

UOp.axis raises for invalid reshape (tinygrad#14863)

5746a60

reshape is lazy now, so better to raise from the .axis call and not have caller to handle invalid case

simplify reshape_multi [pr] (tinygrad#14864)

0260406

clean up expand_multi [pr] (tinygrad#14865)

b3cdb61

remove dead assert, also make it more like a view

am: aca (tinygrad#14861)

1c8c17a

delete uneven shard tests and mentions (tinygrad#14867)

f84a11b

gc.collect() to get the correct GlobalCounters.mem_used in tests (tin…

f771de6

…ygrad#14868) test can be flaky if gc happens in between

remove handle_allreduce_multirank and group_id [pr] (tinygrad#14869)

0e4cf21

leftovers from ops_remote

nbarbier-265 force-pushed the llama2-70b-lora branch from f4bd71b to 14af697 Compare February 19, 2026 00:13

nbarbier-265 added 2 commits February 18, 2026 19:15

add llama2 70b lora training

d3a0d6c

tested on macbook

9e3a807

nbarbier-265 force-pushed the llama2-70b-lora branch from 14af697 to 9e3a807 Compare February 19, 2026 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add llama2 70b lora training#3

add llama2 70b lora training#3
nbarbier-265 wants to merge 1224 commits intomasterfrom
llama2-70b-lora

nbarbier-265 commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Conversation

nbarbier-265 commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants