Active development · Lean 4 formalization in progress

Proof the claimed model ran and the delivered answer was not tampered with

CommitLLM returns a compact receipt binding the checkpoint, deployment manifest, prompt, sampling randomness, decode policy, and the path that produced the delivered answer. A verifier checks on CPU that the delivered answer matches the committed execution trace, not a post-hoc rewrite.

Checkpoint + config bound Prompt + randomness bound Decode policy replayed Delivered answer bound
Sample receipt
Response verified
  • Checkpoint + quantization bound
  • Decode policy replayed
  • Delivered answer bound
  • Challenge tok 67 · L7/23/41/58
  • Routine audit 1.3 ms/tok
  • Verifier CPU only

If a provider rewrites, cleans up, or post-processes the answer after decoding, the receipt no longer verifies.


The receipt binds the model, decode path, and delivered answer

CommitLLM does not only attest that some checkpoint ran. It binds the request, deployment choices, sampling policy, and answer path that determine the delivered answer.

Model + deployment

Which system actually ran

Checkpoint identity, quantization, adapter state, and runtime configuration.

Decode policy

How the answer was produced

Sampler, temperature, penalties, grammar constraints, and stop rules.

Delivered answer

What answer was actually delivered

Detokenization, cleanup, whitespace handling, and formatting on the committed path.

Receipt fieldWhat it binds
input_spec_hashTokenizer, chat template, BOS/EOS, truncation, padding, and system prompt rules
request_bindingPrompt tokens, transcript state, retained audit window, and token count
model_spec_hashCheckpoint identity RW, quantization, LoRA/adapter, RoPE config, RMSNorm ε
decode_spec_hashSampler, temperature, top-k/p, penalties, logit bias, grammar, and stop rules
sampling_randomnessThe randomness actually used for non-greedy decoding, committed before replay of the output policy
output_spec_hashDetokenization, cleanup, whitespace normalization, and output formatting
Consequence: if the provider rewrites, reformats, or otherwise tampers with the delivered answer outside the committed execution path, verification fails.

Serve normally. Open on challenge. Verify on CPU.

CommitLLM stays close to the kept serving path. Expensive work appears only when a verifier asks to inspect a committed region.

Step 1

Serve and commit

The provider stays on the normal GPU path and returns the delivered answer plus a compact receipt binding the model, request, randomness, and output policy.

Step 2

Open on challenge

The verifier challenges token positions and layers after commitment. Routine audit samples the retained window; deep audit opens more of it.

Step 3

Verify the delivered answer

CPU-side checks tie the delivered answer back to model identity, decode policy, and the committed execution trace.


Measured on the current prototype

Current figures on the corrected replay path for Qwen2.5-7B-W8A8 and Llama-3.1-8B-W8A8. Routine audit keeps verifier work CPU-only.

Current measured path Qwen2.5-7B-W8A8 and Llama-3.1-8B-W8A8
Verifier hardware CPU only
Provider path Normal GPU serving with tracing

Verifier cost · Llama 70B

Routine
1.3 ms
Full
10 ms

Online tracing overhead

Base
baseline
+Trace
+12-14%

Attention corridor · Qwen2.5-7B-W8A8

L∞
8
frac_eq
>92%
frac≤1
>99.8%

Attention corridor · Llama-3.1-8B-W8A8

L∞
9
frac_eq
94-96%
frac≤1
>99.9%

What is exact, approximate, and statistical

Decode is fail-closed. The remaining boundary lives inside native attention and unopened retained state during routine audit.

Input
Exact
Embedding
Exact
Shell matmuls
Freivalds
INT8 bridges
Exact
Prefix/KV
Statistical*
Attention
Approximate
Final tail
Exact
LM head
Freivalds
Decode
Fail-closed
Exact / canonical replay Algebraic checks Approximate (FP16/BF16) Statistical - exact in deep audit Fail-closed

Routine audit stays cheap. Deep audit upgrades coverage.

CommitLLM uses the same receipt in both modes. Routine audit keeps steady-state verification light; deep audit opens the full retained window and upgrades prefix provenance to exact verification.

Routine audit

Low-friction spot checks

Designed for normal operation when you want frequent verification without opening the full trace every time.

  • Freivalds-based checks on large linear layers
  • Canonical replay for supported nonlinear subcomputations
  • Sampled prefix and KV provenance with statistical coverage
  • Bounded approximate attention replay on CPU
Deep audit

Escalate when the stakes are higher

Use the same commitment, but require a larger opening. This removes the routine-audit statistical gap on the retained prefix window.

  • Full-prefix and KV openings across the retained audit window
  • Exact prefix provenance instead of sampled coverage
  • The same algebraic, replay, and decode checks as routine audit
  • Higher bandwidth and storage cost, not a different serving path
Operationally: routine audit is the default posture; deep audit is the escalation path when a response is high value, disputed, or randomly selected for full review.

Built to sit beside real serving stacks

CommitLLM is not a replacement inference engine. The provider keeps the normal GPU path and produces request-scoped evidence alongside it.

Supported now
Per-request receipts survive batching

Continuous batching and paged attention

Many user requests can share the same GPU microbatch. CommitLLM still produces per-request receipts and per-request audits.

Supported now
No proof-specific kernel swap

Tensor parallelism and fused kernels

The tracing layer follows the existing execution path instead of replacing production kernels with proof-friendly substitutes.

Supported now
Measured on production-style W8A8

Quantized serving

Quantization metadata is receipt-bound, and the kept path is measured on production-style W8A8 checkpoints.

Not the current story
Unsupported shortcuts should fail closed

Cross-request cache reuse and shortcut decoding

Cross-request prefix caching, speculative decoding, and other semantics-changing shortcuts need more protocol work. Unsupported paths should fail closed.


Execution integrity for every deployment

Model identity is only part of the story. CommitLLM matters wherever decode policy and the delivered answer need to be auditable.

Buyer protection

Enterprise procurement

Paying for Llama 70B? Get proof the provider served that checkpoint and did not rewrite the delivered answer after decoding.

Compliance trail

Regulated deployments

Banks, hospitals, and legal teams need an auditable chain from prompt and model version to decode policy and the delivered answer.

Trustless marketplaces

Decentralized compute

Networks like Gensyn, Ritual, or Bittensor cannot rely on “trust the node.” CommitLLM provides the missing verification layer.

Action governance

Agent systems

When an agent takes action, model provenance and the untampered delivered answer become governance and liability questions.


Between fingerprints and full proving

Heuristics are too weak, full proofs are still too expensive, and deployments need a middle ground that binds the delivered answer.

Too weak

Fingerprinting

Statistical heuristics can provide evidence, but they do not exactly verify a specific response or rule out post-hoc rewriting by a determined provider.

CommitLLM

Commit-and-audit

Commitment-bound end-to-end. Large linear layers are checked algebraically, supported nonlinear pieces are replayed canonically, and the delivered answer is tied to the committed path.

Too expensive

Full proofs

Strong proof objects are attractive, but prover costs remain too high for production LLM serving at realistic scale today.


Setup once. Commit every response. Verify on challenge.

For readers who want the mechanics rather than the product pitch. The verifier holds a secret key derived from public weights, the provider commits inline during serving, and expensive openings happen only on challenge.

Phase 0 · Setup

Build the verifier key

From a public checkpoint, the verifier computes a Merkle root over weights, secret Freivalds vectors for eight matrix families (Wq, Wk, Wv, Wo, Wgate, Wup, Wdown, LM_head), and the model configuration needed for canonical replay.

Phase 1 · Commit

Serve normally, return a receipt

The provider runs inference on the normal GPU path with a tracing sidecar that captures retained state. It returns the response plus a compact receipt binding the execution trace, KV state, deployment manifest, prompt, sampling randomness, output policy, and token count.

Phase 2 · Audit

Challenge specific positions and layers

The verifier challenges token positions and layers after the commitment. The provider opens the requested region. Routine audit samples prefix state; deep audit opens everything in the retained window.

Phase 3 · Verify

CPU-only checks

Embedding Merkle proof. Freivalds on shell matmuls. Exact INT8 bridge recomputation. KV provenance. Attention replay against committed post-attention output. Final-token tail from captured residual. LM-head binding. Decode and output policy replay.


Verify huge matrix multiplies cheaply

The provider claims z = W @ x for a public weight matrix W. Recomputing the full product is expensive. Freivalds’ algorithm gives a much cheaper check: the verifier precomputes v = rTW with a secret random vector r, then checks v · x =? rT · z in the finite field Fp where p = 232-5.

If z != Wx, the check fails with probability ≥ 1-1/p. This is information-theoretically sound. Transformers are mostly matrix multiplication; once those multiplies are cheap to audit, the verifier can check model identity without rerunning the full model.

Wq Wk Wv Wo Wgate Wup Wdown LMhead
Interactive demo · 3x3 over Fp
Click a button to run Freivalds’ check

Start with the paper, then inspect the implementation

The homepage gives the high-level story. The paper, code, and formalization carry the full technical detail.

Read

Paper (PDF)

Protocol, guarantee boundary, threat model, and measurements in full detail.

Open paper
Inspect

GitHub repository

Reference implementation in Rust, Python, and Lean. Some internal crate paths still use the legacy verilm-* prefix while the rename completes.

Browse repository
Formalize

Lean 4 formalization

The formalization tracks the proof obligations behind the protocol and makes the security claims explicit.

Open Lean directory
Reproduce

Benchmark and repro scripts

Scripts for reproducing measurements on the current prototype, including the corrected replay path and corridor checks.

Open scripts