Skip to content

dontmerge: support talkie 13b#508

Draft
georgewhewell wants to merge 3 commits intohellas-ai:masterfrom
georgewhewell:grw/feat/local-model+talkie
Draft

dontmerge: support talkie 13b#508
georgewhewell wants to merge 3 commits intohellas-ai:masterfrom
georgewhewell:grw/feat/local-model+talkie

Conversation

@georgewhewell
Copy link
Copy Markdown
Contributor

@georgewhewell georgewhewell commented Apr 29, 2026

❯ cargo run --release --features metal --example llama -- --backend candle -p 'i think that category theory is ' -m /tmp/talkie-convert/talkie-base-out/  --raw -k -s 40 --dtype bf16
    Finished [`release` profile [optimized]](https://doc.rust-lang.org/cargo/reference/profiles.html#default-profiles) target(s) in 0.12s
     Running `target/release/examples/llama --backend candle -p 'i think that category theory is ' -m /tmp/talkie-convert/talkie-base-out/ --raw -k -s 40 --dtype bf16`
Model weights loaded for /tmp/talkie-convert/talkie-base-out/ in 2.46 seconds
i think that category theory is 1t is the only theory that will explain the facts. The facts are that the world is a world of change, and that the only thing that is permanent is the law of change. The law
40 tokens generated in 19 seconds. (2.10 tps)

`get_model_files` and `get_model_chat_template` now treat the model
identifier as a local directory if it's an existing path on disk; that
directory must look like a HuggingFace snapshot (config.json,
tokenizer.json, tokenizer_config.json, and either model.safetensors
or model.safetensors.index.json + shards). Otherwise the existing HF
hub download path is used unchanged.
Talkie is a 40-layer/40-head decoder-only transformer (talkie-lm.com,
github.com/talkie-lm/talkie) with the standard Llama backbone plus four
small departures, all expressible with existing catgrad operators:

  1. RMSNorm everywhere is unweighted (F.rms_norm with no gamma),
     including a norm immediately after the embedding.
  2. QK-norm — RMSNorm is applied to Q and K after RoPE.
  3. Per-head and per-layer learned gains — head_gain ([H]) on Q after
     QK-norm, and scalar attn_gain / mlp_gain / embed_skip on the
     residual branches.
  4. Embedding-skip residual — the post-input-norm activations are
     threaded through every block as e_x and added back via a learned
     scalar.

The lm_head is an untied [V, D] parameter (not a Linear) scaled by a
learned scalar (lm_head_gain.w_g) before the final matmul. Talkie's
RoPE uses the opposite sin convention from catgrad's default; we negate
cache.sin once after init to match.

Architecture string: TalkieForCausalLM. End-to-end inference reproduces
the upstream pytorch reference byte-for-byte at greedy argmax for short
sequences in bf16; on longer sequences the cross-implementation bf16
noise floor (Metal vs CPU) flips one borderline argmax per ~40 tokens
on some prompts. Test harness in scripts/compare/talkie_compare.sh.

Helpers:
  - scripts/convert_talkie.py: pickle -> safetensors + tokenizer + config
  - scripts/llm_talkie.py:     greedy-argmax pytorch reference
  - scripts/compare/talkie_compare.sh: token-level stability matrix
The decoder stack now reads from `model.embed.weight`,
`model.blocks.{i}.…` — matching the HF port at
`lewtun/talkie-1930-13b-it-hf` (`TalkieForCausalLM` with `self.model =
TalkieModel(…)` and `lm_head`/`lm_head_gain.w_g` at the root).

That repo includes a full HF-format checkpoint plus a `tokenizer.json`
already in HF tokenizers form, so our pickle→safetensors converter and
greedy-argmax reference are no longer needed:

  - rm catgrad-llm/scripts/convert_talkie.py
  - rm catgrad-llm/scripts/llm_talkie.py
  - rm catgrad-llm/scripts/compare/talkie_compare.sh

End-to-end run:
  ./target/release/examples/llama -m lewtun/talkie-1930-13b-it-hf \
    -k -s 60 --dtype bf16 -p "Write a short poem about the wireless telegraph."
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant