Active development · Lean 4 formalization in progress

Proof the claimed model ran and the delivered answer was not tampered with

CommitLLM returns a compact receipt binding the checkpoint, deployment manifest, prompt, sampling randomness, decode policy, and the path that produced the delivered answer. A verifier checks on CPU that the delivered answer matches the committed execution trace, not a post-hoc rewrite.

Checkpoint + config bound Prompt + randomness bound Decode policy replayed Delivered answer bound

See audit flow Inspect the receipt

Sample receipt

Response verified

Checkpoint + quantization bound
Decode policy replayed
Delivered answer bound
Challenge tok 67 · L7/23/41/58
Routine audit 1.3 ms/tok
Verifier CPU only

If a provider rewrites, cleans up, or post-processes the answer after decoding, the receipt no longer verifies.

Receipt scope

The receipt binds the model, decode path, and delivered answer

CommitLLM does not only attest that some checkpoint ran. It binds the request, deployment choices, sampling policy, and answer path that determine the delivered answer.

Model + deployment

Which system actually ran

Checkpoint identity, quantization, adapter state, and runtime configuration.

Decode policy

How the answer was produced

Sampler, temperature, penalties, grammar constraints, and stop rules.

Delivered answer

What answer was actually delivered

Detokenization, cleanup, whitespace handling, and formatting on the committed path.

Receipt field	What it binds
input_spec_hash	Tokenizer, chat template, BOS/EOS, truncation, padding, and system prompt rules
request_binding	Prompt tokens, transcript state, retained audit window, and token count
model_spec_hash	Checkpoint identity R_W, quantization, LoRA/adapter, RoPE config, RMSNorm ε
decode_spec_hash	Sampler, temperature, top-k/p, penalties, logit bias, grammar, and stop rules
sampling_randomness	The randomness actually used for non-greedy decoding, committed before replay of the output policy
output_spec_hash	Detokenization, cleanup, whitespace normalization, and output formatting

Consequence: if the provider rewrites, reformats, or otherwise tampers with the delivered answer outside the committed execution path, verification fails.

How it works

Serve normally. Open on challenge. Verify on CPU.

CommitLLM stays close to the kept serving path. Expensive work appears only when a verifier asks to inspect a committed region.

Step 1

Serve and commit

The provider stays on the normal GPU path and returns the delivered answer plus a compact receipt binding the model, request, randomness, and output policy.

Step 2

Open on challenge

The verifier challenges token positions and layers after commitment. Routine audit samples the retained window; deep audit opens more of it.

Step 3

Verify the delivered answer

CPU-side checks tie the delivered answer back to model identity, decode policy, and the committed execution trace.

Measurements

Measured on the current prototype

Current figures on the corrected replay path for Qwen2.5-7B-W8A8 and Llama-3.1-8B-W8A8. Routine audit keeps verifier work CPU-only.

Current measured path Qwen2.5-7B-W8A8 and Llama-3.1-8B-W8A8

Verifier hardware CPU only

Provider path Normal GPU serving with tracing

Verifier cost · Llama 70B

Routine

1.3 ms

Full

10 ms

Online tracing overhead

Base

baseline

+Trace

+12-14%

Attention corridor · Qwen2.5-7B-W8A8

L∞

frac_eq

>92%

frac≤1

>99.8%

Attention corridor · Llama-3.1-8B-W8A8

L∞

frac_eq

94-96%

frac≤1

>99.9%

Guarantee boundary

What is verified, what is open

Embedding, shell matmuls, bridges, decode, and all bindings are independently verified. Attention at arbitrary token positions is an open problem on stock GPU kernels.

Input

Exact

Embedding

Exact

Shell matmuls

Freivalds

INT8 bridges

Exact

Prefix/KV

Statistical*

Attention

Open

Final tail

Exact

LM head

Freivalds

Decode

Exact*

Exact / canonical replay Algebraic checks Open — committed but not yet independently verified Statistical — exact in deep audit * Supported decode modes exact; unsupported modes fail closed

Audit modes

Routine audit stays cheap. Deep audit upgrades coverage.

CommitLLM uses the same receipt in both modes. Routine audit keeps steady-state verification light; deep audit opens the full retained window and upgrades prefix provenance to exact verification.

Routine audit

Low-friction spot checks

Designed for normal operation when you want frequent verification without opening the full trace every time.

Freivalds-based checks on large linear layers
Canonical replay for supported nonlinear subcomputations
Sampled prefix and KV provenance with statistical coverage
Attention output is committed; arbitrary-position verification is open

Deep audit

Escalate when the stakes are higher

Use the same commitment, but require a larger opening. This removes the routine-audit statistical gap on the retained prefix window.

Full-prefix and KV openings across the retained audit window
Exact prefix provenance instead of sampled coverage
The same algebraic, replay, and decode checks as routine audit
Higher bandwidth and storage cost, not a different serving path

Operationally: routine audit is the default posture; deep audit is the escalation path when a response is high value, disputed, or randomly selected for full review.

Deployment fit

Built to sit beside real serving stacks

CommitLLM is not a replacement inference engine. The provider keeps the normal GPU path and produces request-scoped evidence alongside it.

Supported now

Per-request receipts survive batching

Continuous batching and paged attention

Many user requests can share the same GPU microbatch. CommitLLM still produces per-request receipts and per-request audits.

Supported now

No proof-specific kernel swap

Tensor parallelism and fused kernels

The tracing layer follows the existing execution path instead of replacing production kernels with proof-friendly substitutes.

Supported now

Measured on production-style W8A8

Quantized serving

Quantization metadata is receipt-bound, and the kept path is measured on production-style W8A8 checkpoints.

Not the current story

Unsupported shortcuts should fail closed

Cross-request cache reuse and shortcut decoding

Cross-request prefix caching, speculative decoding, and other semantics-changing shortcuts need more protocol work. Unsupported paths should fail closed.

Use cases

Execution integrity for every deployment

Model identity is only part of the story. CommitLLM matters wherever decode policy and the delivered answer need to be auditable.

Buyer protection

Enterprise procurement

Paying for Llama 70B? Get proof the provider served that checkpoint and did not rewrite the delivered answer after decoding.

Compliance trail

Regulated deployments

Banks, hospitals, and legal teams need an auditable chain from prompt and model version to decode policy and the delivered answer.

Trustless marketplaces

Decentralized compute

Networks like Gensyn, Ritual, or Bittensor cannot rely on “trust the node.” CommitLLM provides the missing verification layer.

Action governance

Agent systems

When an agent takes action, model provenance and the untampered delivered answer become governance and liability questions.

Positioning

Between fingerprints and full proving

Heuristics are too weak, full proofs are still too expensive, and deployments need a middle ground that binds the delivered answer.

Too weak

Fingerprinting

Statistical heuristics can provide evidence, but they do not exactly verify a specific response or rule out post-hoc rewriting by a determined provider.

CommitLLM

Commit-and-audit

Commitment-bound end-to-end. Large linear layers are checked algebraically, supported nonlinear pieces are replayed canonically, and the delivered answer is tied to the committed path.

Too expensive

Full proofs

Strong proof objects are attractive, but prover costs remain too high for production LLM serving at realistic scale today.

FAQ

The questions serious readers ask first

The important distinctions are threat model, cost model, and what is exact versus approximate. This protocol is not a SNARK, not a TEE, and not naive spot checking. It is a commit-and-audit system for remote inference.

The key distinction: the provider commits before they know whether any part of the response will be audited.

Problem

What problem does CommitLLM solve?

It gives a remote caller evidence that the provider actually served the claimed model and did not rewrite the delivered answer after decoding. The receipt binds model identity, deployment choices, decode policy, randomness, and the answer path into one auditable object.

Baseline

Why not just rerun the model?

Because that assumes the caller can afford to run the same checkpoint, quantization, and serving stack themselves. Often they cannot. CommitLLM is built for the more common server-vs-client setting where the model is remote and the caller needs something stronger than trust without reproducing the full computation locally.

Security model

Is this just spot checking?

No. Ordinary spot checking is weak because the provider can behave honestly when watched and cheat when unwatched. Here the provider commits first. The challenge happens after commitment, so a forged answer cannot be swapped out once the provider learns that an audit is coming.

Commitment

What does precommitment actually buy?

It forces the provider to bind themselves to one concrete execution trace before they know which tokens or layers will be challenged. If they serve a forged answer and that request is challenged later, they must open the same forgery they already committed to or fail verification.

Scope

What exactly is bound in the receipt?

The checkpoint, quantization, runtime manifest, prompt handling, sampling randomness, decode policy, token count, and output path. The question is not just “did some model produce something plausible,” but “did the claimed deployment produce this answer under the claimed policy.”

Coverage

What is exact, what is open, and what is statistical?

Large INT8 matrix multiplications are checked by exact algebraic fingerprints. Supported nonlinear boundaries are replayed canonically. Sampled decode correctness is checked from exact captured GPU logits plus algebraic LM-head binding; cheaper replay-based paths remain acceptable for some greedy profiles. Attention at arbitrary token positions is an open problem: stock GPU FlashAttention bf16 arithmetic does not match verifier-side f64 replay beyond token 0. Routine audit gives statistical coverage on unopened retained state; deep audit opens more of that state and upgrades coverage.

Boundary

Does this verify the whole model exactly?

No. The protocol is honest about its guarantee boundary. Embedding, all linear projections, bridge chain, decode, and all bindings are independently verified. Attention at arbitrary token positions is an open problem on stock GPU kernels. Routine prefix provenance is statistical unless deep audit is used. The point is not to pretend everything is exact, but to make the verified, open, and statistical parts explicit.

Attacker model

Can a provider still target one user or one prompt?

Yes, but then the guarantee is probabilistic unless the audit rate is 100%. Roughly, catch probability is audit rate times challenge coverage times verifier soundness. If the stakes are higher, the verifier can escalate from routine audit to deep audit on that specific response instead of paying full cost on every request.

Availability

What if the provider refuses or misses an audit?

Then that is evidence too. CommitLLM is an interactive protocol: inability or refusal to open a committed region on time is not neutral behavior. In a real deployment, missed audits, selective aborts, and retention-window failures should be policy violations, not silently ignored edge cases.

Comparison

Is this stronger or weaker than redundant execution?

If you can afford full redundant execution of the exact claimed stack, that gives stronger assurance on one request. CommitLLM solves a different problem: making remote inference verification deployable when full reruns are too expensive, unavailable, or only practical occasionally.

Anti-substitution

Does this catch model switching and hidden downgrades?

That is one of the main reasons the protocol exists. A remote provider’s cheapest attack is often not random corruption but serving a smaller checkpoint, using a more aggressive quantization plan, skipping work, or changing decode behavior. The receipt is designed to make those changes auditable.

Auditors

Can someone other than the original user verify later?

Yes. The client can verify, hand the receipt to an auditor, or ask a third party with the verifier material to re-check it later. CommitLLM is not only about server-vs-server replication; it is mainly about crossing the server-vs-client trust boundary.

Verifier cost

Does the verifier need a GPU?

No. The kept path is CPU-only on the verifier side. In the current prototype, Llama 70B routine audit is about 1.3 ms per challenged token and a 1-token full audit is about 10 ms on CPU.

Provider cost

What does this cost the provider?

The provider keeps the normal GPU serving path and pays for tracing, retained state, and occasional audit openings. On the current prototype, online tracing overhead is roughly 12-14%. The big costs are retained-state memory and audit bandwidth, not per-response proof generation.

Wire cost

Does the client receive huge payloads on every request?

No. The normal path returns a compact receipt. Large payloads appear only when an audit opening is requested. That is a core design choice: cheap steady-state serving, heavier bandwidth only on challenged responses.

Tuning

Do all requests get the same verification?

No. That is one of the practical advantages. Low-stakes traffic can use cheap routine verification. High-stakes responses can be escalated to deep audit on demand. The assurance/cost tradeoff is programmable instead of all-or-nothing.

Privacy

Does auditing reveal anything sensitive?

The compact receipt is small, but deep audit openings can reveal retained trace data and intermediate activations. In practice, that means deep audits are best done by the client or a trusted auditor rather than published as a universally reusable proof object.

Protocol shape

Is this a zk proof or a TEE?

No. It is a sidecar commit-and-audit protocol. The provider stays on the normal serving path, there is no expensive proof generation for every response, and there is no hardware-attestation trust assumption. The tradeoff is that verification is interactive and audit openings can be bulky.

Deployment

Does this only make sense in server-vs-client settings?

No. It also helps in server-vs-server settings when constant full redundancy is too expensive. A second server can act as verifier or auditor, and the operator can replace “rerun every request” with “precommit everything, audit selectively, and escalate when needed.” In practice, occasional re-execution plus CommitLLM is a strong combination.

Compatibility

What models and deployments does this support today?

The current measured path is focused on open-weight dense decoder models on quantized serving stacks, with explicit fail-closed behavior for unsupported semantics. The protocol is meant to broaden over time, but it is better to be explicit about what is supported today than to claim universal coverage prematurely.

Serving path

Does the provider need to replace production kernels?

Not on the kept path. CommitLLM is designed to sit beside normal serving rather than force a proof-specific inference engine. The provider stays on the ordinary GPU path and produces evidence alongside it. Heavier kernel changes belong to later research paths, not the basic deployment story.

Technical appendix

Setup once. Commit every response. Verify on challenge.

For readers who want the mechanics rather than the product pitch. The verifier holds a secret key derived from public weights, the provider commits inline during serving, and expensive openings happen only on challenge.

Phase 0 · Setup

Build the verifier key

From a public checkpoint, the verifier computes a Merkle root over weights, secret Freivalds vectors for eight matrix families (Wq, Wk, Wv, Wo, Wgate, Wup, Wdown, LM_head), and the model configuration needed for canonical replay.

Phase 1 · Commit

Serve normally, return a receipt

The provider runs inference on the normal GPU path with a tracing sidecar that captures retained state. It returns the response plus a compact receipt binding the execution trace, KV state, deployment manifest, prompt, sampling randomness, output policy, and token count.

Phase 2 · Audit

Challenge specific positions and layers

The verifier challenges token positions and layers after the commitment. The provider opens the requested region. Routine audit samples prefix state; deep audit opens everything in the retained window.

Phase 3 · Verify

CPU-only checks

Embedding Merkle proof. Freivalds on all shell matmuls. Exact INT8 bridge recomputation. KV transcript proofs. Attention replay at token 0 (arbitrary positions are open). Final-token tail from captured residual. LM-head binding. Decode and output policy replay. Prompt, manifest, and IO chain binding.

Core mechanism

Verify huge matrix multiplies cheaply

The provider claims z = W @ x for a public weight matrix W. Recomputing the full product is expensive. Freivalds’ algorithm gives a much cheaper check: the verifier precomputes v = r^TW with a secret random vector r, then checks v · x =? r^T · z in the finite field F_p where p = 2³²-5.

If z != Wx, the check fails with probability ≥ 1-1/p. This is information-theoretically sound. Transformers are mostly matrix multiplication; once those multiplies are cheap to audit, the verifier can check model identity without rerunning the full model.

W_q W_k W_v W_o W_gate W_up W_down LM_head

Interactive demo · 3x3 over F_p

Click a button to run Freivalds’ check

Resources

Start with the paper, then inspect the implementation

The homepage gives the high-level story. The paper, code, and formalization carry the full technical detail.

Read

Paper (PDF)

Protocol, guarantee boundary, threat model, and measurements in full detail.

Open paper

Inspect

GitHub repository

Reference implementation in Rust, Python, and Lean. Some internal crate paths still use the legacy verilm-* prefix while the rename completes.

Browse repository

Formalize

Lean 4 formalization

The formalization tracks the proof obligations behind the protocol and makes the security claims explicit.

Open Lean directory

Reproduce

Benchmark and repro scripts

Scripts for reproducing measurements on the current prototype, including the corrected replay path and corridor checks.

Open scripts