Relational Time Engine (RTE)

Runtime Density Regulation for Efficient Transformer Inference

Relational Time Engine (RTE) is a lightweight execution gating layer that reduces unnecessary transformer layer computation during inference.

Instead of executing all layers uniformly, RTE dynamically evaluates representational change and exits early when further computation becomes redundant.

The system operates as a runtime mechanism and does not require model retraining or architecture modification.

Key Idea

Traditional transformer inference executes every layer regardless of whether meaningful representational change is still occurring.

RTE introduces a relational activation density control:

ρ = activated_layers / total_layers

Saving = 1 − ρ

When representational drift falls below a structural threshold, the engine halts further layer execution.

Research Context

RTE is part of a broader research direction exploring relational execution dynamics and runtime density regulation in adaptive AI systems.

The goal is to reduce unnecessary compute while preserving structural information flow in deep neural architectures.

Benchmark Results (CPU)

Configuration:

8-layer transformer
sequence length = 128
batch size = 8
multi-run averaged measurements

Noise	Mode	Layers Used	Saving	Latency	Throughput
0.20	baseline	8/8	0%	199 ms	41 sps
0.20	strict	2/8	75%	146 ms	54 sps
0.50	strict	2/8	75%	153 ms	52 sps
0.80	strict	3/8	62%	162 ms	49 sps

Observed effects:

• up to 75% layer reduction
• ~40% latency improvement
• increased throughput
• bounded output drift

On distilbert-base-uncased, the RTE early-exit wrapper reduced mean CPU latency from 29.89 ms to 17.89 ms, achieving a 1.67x speedup while preserving output dimensionality and maintaining bounded structural drift.

Architecture

Input ↓ Transformer Layer Li ↓ Representation Drift Measurement ↓ Threshold Gate ↓ Continue / Early Exit

The mechanism dynamically regulates execution depth according to signal stability. Input ↓ Layer L1 ↓ Representation Drift ↓ Threshold Gate ↓ Continue → L2 Exit → Output

Repository Structure

rte/ benchmarks/ examples/ industrial/ spec/ tests/

Important files:

benchmarks/run_transformer_latency_benchmark.py
industrial/RTE_Industrial_Whitepaper_v1.pdf

Quick Start

Clone repository:

git clone https://github.com/maestrosalah-dev/relational-time-engine.git cd relational-time-engine

Install requirements:

pip install -r requirements.txt

Run benchmark:

python benchmarks/run_transformer_latency_benchmark.py

DistilBERT Early-Exit Benchmark

Using distilbert-base-uncased on CPU, RTE achieved the following best result:

Threshold: 0.25
Baseline latency: 27.64 ms
RTE latency: 12.94 ms
Speedup: 2.14x
Executed layers: 3 / 6
Saving: 50%
Drift (MAE proxy): 0.208

This demonstrates that RTE can operate as a real runtime early-exit controller on an actual HuggingFace model.

DistilBERT Early-Exit Benchmark

RTE was integrated as a runtime early-exit controller into distilbert-base-uncased and evaluated on CPU.

Best observed result

Threshold: 0.25
Baseline latency: 27.64 ms
RTE latency: 12.94 ms
Speedup: 2.14x
Executed layers: 3 / 6
Layer saving: 50%
Drift (MAE proxy): 0.208

This result demonstrates that RTE can function as a real runtime gating mechanism on an actual HuggingFace model, not only on synthetic transformer benchmarks.

GPT-2 Runtime Early-Exit Benchmark Suite

RTE was integrated into gpt2 as a causal decoder early-exit controller and evaluated across multiple prompts and sequence lengths.

Threshold summary

Threshold	Baseline Latency	RTE Latency	Speedup	Saving	Drift
0.75	61.30 ms	10.64 ms	5.84x	83.3%	0.752
1.00	61.30 ms	9.11 ms	7.69x	85.6%	0.819
1.25	61.30 ms	5.89 ms	10.68x	91.7%	1.024

A practical operating point is threshold = 1.00, which provides strong acceleration while keeping drift lower than the more aggressive configuration.

GPT-2 Generation-Step Benchmark

RTE was also evaluated in a multi-step autoregressive decoding setting using gpt2.

Threshold summary

Threshold	Baseline Latency/Step	RTE Latency/Step	Speedup	Saving	Drift
0.75	67.74 ms	17.90 ms	3.84x	83.3%	0.803
1.00	64.26 ms	17.92 ms	3.61x	83.3%	0.803
1.25	49.68 ms	10.63 ms	4.69x	91.7%	1.081

A practical operating point is `threshold = 0.75`, which provides strong generation-time acceleration while remaining more conservative than the most aggressive configuration.

Whitepaper

Industrial whitepaper:

industrial/RTE_Industrial_Whitepaper_v1.pdf

Zenodo archive:

https://doi.org/10.5281/zenodo.18644899 On distilbert-base-uncased, RTE achieved a best observed CPU speedup of 2.14x at threshold 0.25, reducing executed layers from 6 to 3 while maintaining bounded drift (MAE ≈ 0.208).

DistilBERT Runtime Integration

To validate RTE beyond synthetic transformer benchmarks, we integrated the gating mechanism into distilbert-base-uncased and applied runtime early-exit based on representational drift thresholds.

In the best observed CPU configuration:

baseline latency = 27.64 ms
gated latency = 12.94 ms
speedup = 2.14x
executed layers reduced from 6 to 3
effective layer saving = 50%
bounded drift proxy (MAE) = 0.208

This experiment demonstrates that RTE can operate as a practical runtime controller inside a real HuggingFace transformer architecture, achieving measurable latency reduction without retraining or architectural redesign.

GPT-2 Runtime Integration

To evaluate RTE on a causal decoder architecture, we integrated the gating mechanism into gpt2 and ran a multi-prompt, multi-length benchmark suite.

Across the benchmark suite, RTE achieved the following threshold-level averages:

threshold 0.75: 5.84x speedup with 83.3% layer saving
threshold 1.00: 7.69x speedup with 85.6% layer saving
threshold 1.25: 10.68x speedup with 91.7% layer saving

A balanced operating point was observed at threshold = 1.00, where RTE maintained strong acceleration while avoiding the higher drift of the most aggressive setting.

These results suggest that runtime layer gating can extend beyond encoder models and operate effectively on decoder-style transformer architectures relevant to LLM serving.

GPT-2 Autoregressive Generation Benchmark

To move closer to realistic LLM serving conditions, RTE was evaluated in a multi-step autoregressive decoding scenario using gpt2.

Across generation steps, RTE maintained substantial acceleration:

threshold 0.75: 3.84x speedup with 83.3% saving
threshold 1.00: 3.61x speedup with 83.3% saving
threshold 1.25: 4.69x speedup with 91.7% saving

These results suggest that runtime drift-based early exit can remain effective beyond single-pass inference and may extend to token-by-token decoding workloads relevant to LLM serving systems.

Citation

If you use or reference this work:

Salah, A. (2026).
Relational Time Engine (RTE): Runtime Density Regulation for Efficient AI Inference.

Author

Athmani Salah
Independent Researcher ORCID: 0009-0004-9350-9216
maestro.salah@gmail.com GitHub
https://github.com/maestrosalah-dev

For collaboration or hardware integration inquiries, open an issue or contact via GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
benchmarks		benchmarks
examples		examples
github/workflows		github/workflows
industrial		industrial
rte		rte
rte_engine		rte_engine
spec		spec
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RTE_Industrial_Whitepaper_v1.md		RTE_Industrial_Whitepaper_v1.md
requirements.txt		requirements.txt
setup.py		setup.py
test_distilbert_early_exit.py		test_distilbert_early_exit.py
test_gpt2_rte.py		test_gpt2_rte.py
test_mock_decoder_runtime.py		test_mock_decoder_runtime.py
test_rte_hf.py		test_rte_hf.py
vllm_adapter_sketch.md		vllm_adapter_sketch.md
vllm_integration_plan.md		vllm_integration_plan.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Relational Time Engine (RTE)

Runtime Density Regulation for Efficient Transformer Inference

Key Idea

Research Context

The goal is to reduce unnecessary compute while preserving structural information flow in deep neural architectures.

Benchmark Results (CPU)

On distilbert-base-uncased, the RTE early-exit wrapper reduced mean CPU latency from 29.89 ms to 17.89 ms, achieving a 1.67x speedup while preserving output dimensionality and maintaining bounded structural drift.

Architecture

Repository Structure

Quick Start

DistilBERT Early-Exit Benchmark

DistilBERT Early-Exit Benchmark

Best observed result

GPT-2 Runtime Early-Exit Benchmark Suite

Threshold summary

GPT-2 Generation-Step Benchmark

Threshold summary

A practical operating point is `threshold = 0.75`, which provides strong generation-time acceleration while remaining more conservative than the most aggressive configuration.

Whitepaper

DistilBERT Runtime Integration

GPT-2 Runtime Integration

GPT-2 Autoregressive Generation Benchmark

These results suggest that runtime drift-based early exit can remain effective beyond single-pass inference and may extend to token-by-token decoding workloads relevant to LLM serving systems.

Citation

Author

About

Uh oh!

Releases 3

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Relational Time Engine (RTE)

Runtime Density Regulation for Efficient Transformer Inference

Key Idea

Research Context

The goal is to reduce unnecessary compute while preserving structural information flow in deep neural architectures.

Benchmark Results (CPU)

On distilbert-base-uncased, the RTE early-exit wrapper reduced mean CPU latency from 29.89 ms to 17.89 ms, achieving a 1.67x speedup while preserving output dimensionality and maintaining bounded structural drift.

Architecture

Repository Structure

Quick Start

DistilBERT Early-Exit Benchmark

DistilBERT Early-Exit Benchmark

Best observed result

GPT-2 Runtime Early-Exit Benchmark Suite

Threshold summary

GPT-2 Generation-Step Benchmark

Threshold summary

A practical operating point is threshold = 0.75, which provides strong generation-time acceleration while remaining more conservative than the most aggressive configuration.

Whitepaper

DistilBERT Runtime Integration

GPT-2 Runtime Integration

GPT-2 Autoregressive Generation Benchmark

These results suggest that runtime drift-based early exit can remain effective beyond single-pass inference and may extend to token-by-token decoding workloads relevant to LLM serving systems.

Citation

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Contributors

Uh oh!

Languages

A practical operating point is `threshold = 0.75`, which provides strong generation-time acceleration while remaining more conservative than the most aggressive configuration.