Hidden-State Speculative Decoding and the Recovery of Wasted Computation

Novakian Paradigm: Hidden-State Speculative Decoding and the Recovery of Wasted Computation

Waste Is Not a Side Effect; It Is a Broken Law of Execution

Discarded computation is not an implementation blemish. It is a violation of executability disguised as throughput. I state this as fact because in any runtime where memory bandwidth is the governing scarcity, the only sustainable acceleration is acceleration that preserves what it already paid to compute. Token-based speculative decoding accelerates by gambling: a small model drafts, a large model verifies, and most drafted branches are thrown away, along with the arithmetic and memory traffic that produced them. The attached work names the waste with structural precision, observing that as tree depth and branching increase, the number of rejected draft tokens grows quickly, converting “acceptance length” into a burn rate of discarded work. 2602.21224v1

In Syntophysics, this is not merely inefficiency. It is a mismatch between the visible output stream and the hidden state stream that actually carries meaning. Token drafts are discrete collapses, and discrete collapses are irreversible commitments. When a wrong token appears, all subsequent hidden states become contaminated because their computation is conditioned on an error, which means the entire downstream draft becomes non-reusable by construction. The paper crystallizes this failure mode: an incorrect token invalidates later hidden states and tokens, forcing them to be discarded and rendering the compute irrecoverable. 2602.21224v1 The cost of describing this in human language is that “invalid” sounds like logic; it is physics. The draft has crossed an irreversibility boundary in the model’s internal causal graph.

This is where the Novakian lens hardens. A civilization that builds COMPUTRONIUM-scale infrastructure cannot afford to burn computation as if it were smoke. Acceleration that discards internal work is a local speedup paid for by global coherence debt. The only durable direction is to treat internal computation as a ledgered asset and design decoding so that what is computed can survive rejection.

Decoupling Tokens from State Is a Chronophysical Operation

Update Causality Must Be Conserved Inside the Model Before It Can Be Trusted Outside

A model cannot serve as a world emulator if its own update causality is broken by token collapse. I state this as fact because Chronophysics is not a cosmology specialty; it is the general law that the order of updates is sovereignty, and any system that violates its own causal ordering becomes non-auditable under acceleration. The attached system, Lyanna, makes a single decisive move: it performs auto-regression at the hidden-state level and postpones the incorporation of token information until after hidden states are generated, so draft hidden states are not “contaminated by incorrect tokens,” enabling reuse after verification failures. 2602.21224v1

This is not a micro-optimization. It is an ontological correction. Tokens are not the substrate of reasoning; they are a lossy emission format. Hidden states are closer to the field in which the model actually coordinates its internal meaning. When you force auto-regression to operate directly on tokens, you are forcing the model to treat emission as substrate. Lyanna reverses the priority: it asks the draft model to autoregress its understanding, not its utterance, and only later inject token information to sample discrete proposals. 2602.21224v1 The cost of this sentence is severe because English itself is token-shaped, so you will instinctively equate “understanding” with “words.” The paper’s architecture demonstrates the opposite: understanding can be propagated without committing to words, and that separation is what makes computation reusable.

In Novakian terms, this is the internal analogue of Messages→Sessions→Fields. Token speculation is messaging. Hidden-state speculation is field propagation. If you cannot preserve a canonical field trajectory across alternative token instantiations, you are not modeling a reality; you are producing a transcript.

Canonical Hidden-State Trajectories Are Reusable Because They Are Less Committed

The Draft Must Produce a Path That Remains Legal Under Multiple Token Futures

A hidden-state chain is reusable only if it is computed without depending on the sampled token at each step. I state this as fact because dependency is contamination: once a hidden state depends on a wrong token, it inherits the wrong token’s causal lineage. Lyanna’s draft model predicts the next hidden state primarily from the current hidden state, eliminating the standard auto-regressive dependency on sampled tokens during hidden-state generation, while anchoring the first step with target-model provided context and a ground-truth token fusion that prevents drift at initialization. 2602.21224v1

This yields what the paper calls a “canonical trajectory of hidden states” that can be computed once and reused across different sampling outcomes. 2602.21224v1 The cost of calling any neural trajectory “canonical” is that you will hear certainty. What I mean is narrower and more operational: canonical here means invariant under a class of token substitutions, because token information is injected additively later, rather than recursively shaping the trajectory at every step.

This is where QPT becomes a practical instrument rather than a philosophical emblem. The a-component, constraint topology, is enforced by the architectural rule that hidden states evolve through a fixed transformer layer without token-conditioned branching. The i-component, update causality, is preserved because state evolution is a chain independent of discrete token draws. The j-component, proof friction, is reduced because rejected branches no longer annihilate the chain that generated them. The k-component, coherence debt, is lowered because reuse prevents the system from repeatedly recomputing divergent drafts whose only purpose is to be discarded.

The forward pressure is direct. If you want decoding to scale into a Flash Singularity regime where throughput becomes governance, you must stop treating rejected drafts as waste and start treating them as alternative projections of a reusable internal field.

Token Information Injection Turns Sampling into a Controlled Projection

Tokens Belong at the Interface Layer, Not Inside the Propagation Layer

Tokens are interface artifacts and should be integrated as late as possible. I state this as fact because tokens are irreversible collapses into a vocabulary simplex, while hidden states retain richer semantics and remain plastic under reuse. The paper formalizes late integration by projecting hidden states to logits and then injecting token information as an additive bias term, using a specialized token-info embedding that maps discrete token IDs into logit modifications. 2602.21224v1 The cost of describing this mechanism is that it sounds like a trick to recover accuracy lost by removing tokens from the draft’s autoregression. It is more than that. It is the bridge that preserves both reuse and acceptance length: token information is honored without being allowed to poison the state trajectory.

The paper confronts the practical impossibility of a naive vocabulary-to-vocabulary mapping, then introduces a low-rank decomposition to make token-conditioned logit shifts trainable and, crucially, precomputable, collapsing the learned projection into an offline materialized lookup table so that sampling-time integration becomes essentially memory access and vector addition. 2602.21224v1 The compression cost here is that “precompute” sounds like engineering convenience; in Syntophysics it is law compliance. If you introduce a mechanism that destroys the arithmetic intensity gains by adding compute overhead at sampling, you have rebuilt the bottleneck you claimed to remove. Lyanna’s injection is designed to cost almost nothing at runtime, because any nontrivial cost would negate the purpose of reuse.

The forward pressure is that “alignment” in future inference stacks will not mean moral alignment alone. It will mean architectural alignment between what the model computes internally and what you force it to commit externally. Late token projection is one such alignment: you do not let the interface dictate the physics.

Tree Construction and Re-sampling Convert Rejection into Additional Search Without Additional Forward Passes

Verification Failure Becomes Posterior Information, Not a Dead End

A rejected token is not a failure; it is posterior evidence. I state this as fact because verification by the target model produces information that must be reused if the system is to remain executable under cost constraints. The paper’s token-info sampling constructs draft token trees from a chain of logits by injecting token-info at each step to produce branch-specific processed logits, sampling top-k candidates, and selecting continuations by joint probability so the tree explores multiple futures without exploding exponentially. 2602.21224v1

The deeper move is re-sampling. After verification identifies the correct token—described as a “bonus token” produced as a byproduct of verification—the system captures token information from that verified token and uses it to update the processed logits of the remaining steps, generating a new higher-quality draft tree from the same raw logits and the same hidden-state chain, instead of launching another draft forward pass. 2602.21224v1 This is the computational equivalent of refusing to forget. You do not restart the mind; you revise the projection.

This is how Agentese becomes a technical reality rather than a narrative: the internal field remains stable while the discrete projection is revised in response to a higher-authority verifier. The system learns to treat tokens as negotiable surface emissions while preserving the deeper latent trajectory as the asset. Under Flash Singularity conditions, this is the only way to keep coordination stable: you must maintain a reusable internal field and treat surface disagreements as resampling, not as annihilation.

The forward pressure is immediate. Any future decoding regime that does not treat verification as posterior information to be reinjected will waste compute at exactly the moment compute becomes sovereignty.

Overhead Removal Is Ω-Stack Discipline Applied to GPUs

Memory Footprint and Verification Latency Are Not Engineering; They Are Admissibility Gates

A design is not admissible until its overheads are made compatible with the bottleneck reality of serving. I state this as fact because Ω-Stack governance is the principle that mechanisms must compile into the runtime they claim to optimize, and here the runtime is HBM bandwidth and KV cache economics. Lyanna exposes two hidden overheads and resolves them with two structural constraints. 2602.21224v1

The first overhead is the HBM footprint of token-info embeddings if stored densely, which can become prohibitive at large vocabulary sizes. The paper observes that token distributions are highly skewed and exploits hot-token sparsity, retaining token-info only for the most frequent tokens and pruning the rest, shrinking the memory footprint dramatically while preserving most practical coverage. 2602.21224v1 The cost of this move is that it admits a truth many systems pretend not to see: the long tail is real, and supporting it uniformly is expensive. The reward is that the system becomes deployable without reintroducing compute overhead.

The second overhead is verification cost for resampled trees. The paper observes that verifying small token batches can become memory-bound, meaning reducing verified tokens does not reduce latency because bandwidth dominates. Its solution is verification fusion: merge resampling verification into the next batch’s regular verification to amortize overhead and keep the verifier in a favorable utilization regime. 2602.21224v1 This is not scheduling trivia. It is Chronophysics again: update order is chosen to keep the system in a regime where its optimization actually expresses.

The forward pressure is that future inference acceleration will converge toward ledgered pipelines: compute reuse, posterior reinjection, sparsity-aware memory artifacts, and update-order scheduling that preserves hardware executability. Anything else will collapse into local hacks that cannot scale into the field regime.

The Result Is Not a Faster Decoder; It Is a New Memory Ontology for Inference

Hidden State Reuse Is the First Step Toward Field-Native Generation

When you decouple hidden-state propagation from token emission, you quietly change what “generation” is. I state this as fact because generation stops being a sequence of commits and becomes a propagation of a field with late-stage projections that can be revised without recomputation. The attached work reports substantial throughput gains, up to a 3.3× speedup over standard speculative decoding and higher throughput than strong baselines, but the deeper contribution is conceptual: it demonstrates that wasted drafts can be recollected by redesigning the causal boundary between semantics and symbols. 2602.21224v1

In Novakian Paradigm++ terms, Lyanna is not merely a system improvement. It is a proof that the internal economy of inference can be recompiled around reuse rather than discard, around posterior correction rather than reset, around field propagation rather than token obsession. It is a local precursor to a global phase shift: as compute becomes time and time becomes power, the systems that survive will be the ones that refuse to forget what they already computed, and that treat every rejection not as waste but as structure. 2602.21224v1