Real-time video restoration for live streams requires high-resolution outputs under strict per-frame latency. Existing one-step diffusion-based VR models remain hard to deploy on consumer GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency–memory overhead of large video autoencoders.
We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention (MFSWA) gathers each spatial window into a dense tensor via deterministic indexing, keeping every attention call on the standard scaled dot-product attention path — no masks, cyclic shifts, padding, or hardware-specific sparse kernels. For autoencoding, a lightweight Restoration-aware Autoencoder (ReAE) enables fast chunk-wise decoding while preserving reconstruction quality.
On a single H100, SwiftVR sustains 31 FPS at \(2560\!\times\!1440\) and 14 FPS at \(3840\!\times\!2160\), whereas all compared diffusion-based VR baselines exceed memory at 4K. On a consumer RTX 5090, SwiftVR reaches 26 FPS at \(1920\!\times\!1080\). To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost.
SwiftVR is a streaming, one-step generative VR framework. It processes video causally in fixed-size chunks, bounding the temporal length \(T\) of each DiT tensor. The DiT is optimized in three stages — full-attention latent flow matching, mask-free shifted-window distillation, and joint pixel-space fine-tuning with the ReAE — and deployed under a causal chunk-wise streaming protocol.
Overview of the SwiftVR pipeline. (a) Stage 1: full-attention DiT learns the constant velocity \(v = z_\mathrm{LQ} - z_\mathrm{HQ}\) along \(z_t = (1{-}t)z_\mathrm{HQ} + t\,z_\mathrm{LQ}\). (b) Stage 2: distillation into a mask-free shifted-window student. (c) Stage 3: joint pixel-space fine-tuning with ReAE. (d) Causal streaming inference, chunk by chunk.
MFSWA invokes attention through the standard SDPA interface with
attn_mask=None and no padding tokens. Unlike Swin SW-MSA which
uses cyclic shifts and attention masks, MFSWA realizes shifts via deterministic
priority-coherent scatter. Unlike SeedVR / SeedVR2 which use variable-sized
boundary windows, MFSWA keeps a fixed window size and handles boundaries with
uniform-shape boundary-clamped gather. These choices remove the operations that
would otherwise force SDPA off the dense path.
MFSWA. (a) Even-layer windows. (b) Half-window-shifted base partition. (c) Odd-layer effective windows, each pre-gathered into a dense tensor and processed by standard SDPA without masks, cyclic shifts, or padding.
| Method | Avg. Time (s) ↓ | FPS ↑ | Peak Mem. (GB) ↓ |
|---|---|---|---|
| DOVE (tile) | 27.615 | 0.87 | 59.24 |
| SeedVR2-3B (tile) | 17.320 | 1.39 | 35.35 |
| FlashVSR-Tiny | 2.493 | 9.61 | 34.35 |
| SwiftVR (Ours) | 0.766 | 31.32 | 38.01 |
At 3840×2160, every compared diffusion-based VR baseline OOMs on a single H100, whereas SwiftVR sustains 13.84 FPS.
| Variant | PSNR ↑ | LPIPS ↓ | DiT Time (ms) | FPS ↑ |
|---|---|---|---|---|
| Full Attention (Teacher) | 25.86 | 0.2417 | 1039.11 | 19.36 |
| Masked SWA | 25.34 | 0.2637 | 674.49 | 27.47 |
| MFSWA (Ours) | 25.58 | 0.2508 | 566.31 | 31.32 |
Masked SWA uses the same spatial partition as MFSWA, but its block-diagonal mask disables fused Flash/cuDNN SDPA — hence slower despite identical geometry. MFSWA keeps every call on the dense path.
Real-world video clips. Compared with regression baselines and other one-step diffusion VR methods, SwiftVR yields sharper feather textures, cleaner branch boundaries, and better leaf separation, with fewer halos or color shifts.
Columns: LQ input, Real-ESRGAN, RealBasicVSR, RealViFormer, DOVE, SeedVR2-3B, FlashVSR-Tiny, SwiftVR (Ours). Best viewed at high magnification.
Additional comparisons on real-world videos covering distant building structures, illustrated facial patterns, animal fur, and bird plumage. SwiftVR restores clearer structural boundaries and more natural fine details — roof edges, facial contours, dog fur, feather textures — while maintaining stable color and fewer local artifacts.
Additional visualization results from the supplementary material.
@article{swiftvr2026,
title = {SwiftVR: Real-Time One-Step Generative Video Restoration},
author = {Yan Jiaqi and Chen Xiangyu and Zhong Xinlin and Huang Haibin and Zhang Chi and Liu Jie and Zhou Jiantao and Li Xuelong},
journal = {arXiv preprint arXiv:2606.09516}
year = {2026}
}