Skip to main content

research.altifigence.com > notebooks > Gemma 3N E4B

Latency Benchmark Methodology

Benchmark methodology notebook with decode latency, component timing, thread scaling, and CSV export paths.

Gemma 3N E4B · Hyun Woo Kim

Experiment data

1,218.0 msAverage decode mean across experiment runs
26.2 ppFFN share reduction from short to long context
28.0 ppKV-cache attention share increase

Decode latency breakdown

Per-run decode component timing from experiment_results.csv.

Decode latency component breakdown by run036673110971462Run 11,521.5 msRun 21,488.4 msRun 31,259.2 msRun 4926.6 msRun 5894.4 ms
FFNQKVOutput projectionAttentionRuntime overhead

FFN versus KV-cache bottleneck shift

Component share from context_sweep_results.csv as context length grows from short to long prompts.

FFN and attention share across context length0%25%50%75%100%13172658input tokens
FFN shareKV-cache attention share

CPU-thread scaling

Decode throughput and FFN timing from thread_scaling_results.csv.

Thread scaling for decode throughput and FFN timing0.00.81.71 threads1.682 threads1.564 threads1.326 threads1.28tok/s
Decode tok/sFFN ms trend

GEMV, GEMM, and memory-bandwidth relationship

Normalized curves from bottleneck_analysis.csv compare FFN weight-read pressure, KV-cache attention reads, and decode latency.

Memory traffic and decode latency relationship0%25%50%75%100%13172658input tokens
FFN GEMV read/tok, max 1,680.0 MBKV-cache read/tok, max 44.98 MBDecode latency, max 1,372.1 ms