
Torsten Scholak — Lead Research Scientist
SLAM Lab — ServiceNow
November 2025
Apriel matches frontier reasoning at 15B.
But full attention pays the quadratic tax → long ctx throughput is now
the bottleneck.
Speed creates capability:
That's why we're building efficient hybrids.
| Full Attention | Efficient (Linear/Sparse) | Hybrid | |
|---|---|---|---|
| Complexity | O(n²) | O(n) or sub-quadratic | Mixed |
| KV cache | Large, grows with n² | Small or none | Reduced ~50-75% |
| Global fidelity | Perfect | Limited | Preserved in key layers |
| Throughput gain | 1× | 2-10× (but quality risk) | 2-10× at minimal Δ |
Pattern: Keep ~20-30% full attention for global
reasoning,
replace rest with Mamba/linear/sparse mechanisms.
Apr: NVIDIA Nemotron-H-47B
9:1 Mamba-2:FA hybrid, ≈3× faster vs
dense 70B at long ctx
May: Falcon-H1-34B
Parallel Mamba-2 + FA hybrid, 4× prefill, 8×
decode at long ctx
Jun: MiniMax-M1
7:1 Lightning:FA hybrid, ≈3–4× faster
decode @100k tokens
Aug: Nemotron-Nano-9B-v2
7:1 Mamba-2:FA hybrid, up to 6×
throughput vs Qwen3-8B
Sep: Qwen3-Next-80B-A3B
3:1 Gated-DeltaNet:FA hybrid, >10×
throughput vs Qwen3-32B @>32k
Sep: DeepSeek
V3.2-Exp
MLA+DSA sparse, 1:64 attended:total tokens @128k,
3× faster at long ctx
Oct: Kimi-Linear-48B-A3B
3:1 KLA:FA hybrid, 75% KV↓, up
to 6× decode @1M

Today we release Apriel-H1:
| Metric | Apriel 15B | Apriel-H1 30 | Δ |
|---|---|---|---|
| Throughput (vLLM) | 1× | ~2× | +2× |
| MATH500 | 90 | 92 | +2 |
| GSM8k | 97 | 95 | −2 |
| AIME'24 | 70 | 65 | −5 |
| GPQA-D | 59 | 55 | −4 |
| MBPP | 86 | 85 | −1 |
| MT-Bench | 8.30 | 8.58 | +0.28 |

Teacher (50L): [FA][FA][FA][FA][FA][FA][FA][FA][FA][FA] ...
H1-30: [FA][FA][FA][M ][FA][M ][M ][M ][M ][FA] ...
^ ^ ^ ^
"keep" "convert" "convert" "keep"

Apriel-H2 roadmap:
The path forward:
SLAM Lab — ServiceNow
Contact: Torsten Scholak (torsten.scholak@servicenow.com)
Team: Oleksiy Ostapenko, Luke Kumar, Raymond Li, Denis Kocetkov, Joel Lamy-Poirier