SwiftVR: Real-Time One-Step Generative Video Restoration

1State Key Laboratory of Internet of Things for Smart City,
Department of Computer and Information Science, University of Macau,
2Institute of Artificial Intelligence, China Telecom (TeleAI), 3State Key Laboratory for Novel Software Technology, Nanjing University
SwiftVR teaser

SwiftVR is the first generative video restoration model to reach real-time 1080p streaming on a consumer-grade GPU. It sustains 31 FPS at QHD and 14 FPS at 4K on a single H100, and 26 FPS at 1080p on an RTX 5090.

26 fps
1920×1080 · RTX 5090
31 fps
2560×1440 · H100
14 fps
3840×2160 · H100
1.62×
vs full-attention teacher

Video Results

Drag the slider to compare Low-Quality Input (left) with SwiftVR Restoration (right). Tap a card to fullscreen on mobile.

Sample 1
LQ Input
SwiftVR
0:00/0:00
Sample 2
LQ Input
SwiftVR
0:00/0:00
Sample 3
LQ Input
SwiftVR
0:00/0:00

Abstract

Real-time video restoration for live streams requires high-resolution outputs under strict per-frame latency. Existing one-step diffusion-based VR models remain hard to deploy on consumer GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency–memory overhead of large video autoencoders.

We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention (MFSWA) gathers each spatial window into a dense tensor via deterministic indexing, keeping every attention call on the standard scaled dot-product attention path — no masks, cyclic shifts, padding, or hardware-specific sparse kernels. For autoencoding, a lightweight Restoration-aware Autoencoder (ReAE) enables fast chunk-wise decoding while preserving reconstruction quality.

On a single H100, SwiftVR sustains 31 FPS at \(2560\!\times\!1440\) and 14 FPS at \(3840\!\times\!2160\), whereas all compared diffusion-based VR baselines exceed memory at 4K. On a consumer RTX 5090, SwiftVR reaches 26 FPS at \(1920\!\times\!1080\). To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost.

Method

SwiftVR is a streaming, one-step generative VR framework. It processes video causally in fixed-size chunks, bounding the temporal length \(T\) of each DiT tensor. The DiT is optimized in three stages — full-attention latent flow matching, mask-free shifted-window distillation, and joint pixel-space fine-tuning with the ReAE — and deployed under a causal chunk-wise streaming protocol.

SwiftVR pipeline overview

Overview of the SwiftVR pipeline. (a) Stage 1: full-attention DiT learns the constant velocity \(v = z_\mathrm{LQ} - z_\mathrm{HQ}\) along \(z_t = (1{-}t)z_\mathrm{HQ} + t\,z_\mathrm{LQ}\). (b) Stage 2: distillation into a mask-free shifted-window student. (c) Stage 3: joint pixel-space fine-tuning with ReAE. (d) Causal streaming inference, chunk by chunk.

Mask-free Shifted-Window Self-Attention (MFSWA)

MFSWA invokes attention through the standard SDPA interface with attn_mask=None and no padding tokens. Unlike Swin SW-MSA which uses cyclic shifts and attention masks, MFSWA realizes shifts via deterministic priority-coherent scatter. Unlike SeedVR / SeedVR2 which use variable-sized boundary windows, MFSWA keeps a fixed window size and handles boundaries with uniform-shape boundary-clamped gather. These choices remove the operations that would otherwise force SDPA off the dense path.

MFSWA illustration

MFSWA. (a) Even-layer windows. (b) Half-window-shifted base partition. (c) Odd-layer effective windows, each pre-gathered into a dense tensor and processed by standard SDPA without masks, cyclic shifts, or padding.

Results

Efficiency at 2560×1440 (single H100, causal streaming, 24 frames)

Method Avg. Time (s) ↓ FPS ↑ Peak Mem. (GB) ↓
DOVE (tile)27.6150.8759.24
SeedVR2-3B (tile)17.3201.3935.35
FlashVSR-Tiny2.4939.6134.35
SwiftVR (Ours) 0.766 31.32 38.01

At 3840×2160, every compared diffusion-based VR baseline OOMs on a single H100, whereas SwiftVR sustains 13.84 FPS.

MFSWA ablation

Variant PSNR ↑ LPIPS ↓ DiT Time (ms) FPS ↑
Full Attention (Teacher)25.860.24171039.1119.36
Masked SWA25.340.2637674.4927.47
MFSWA (Ours) 25.58 0.2508 566.31 31.32

Masked SWA uses the same spatial partition as MFSWA, but its block-diagonal mask disables fused Flash/cuDNN SDPA — hence slower despite identical geometry. MFSWA keeps every call on the dense path.

Qualitative Comparison

Real-world video clips. Compared with regression baselines and other one-step diffusion VR methods, SwiftVR yields sharper feather textures, cleaner branch boundaries, and better leaf separation, with fewer halos or color shifts.

Qualitative comparison on real-world videos

Columns: LQ input, Real-ESRGAN, RealBasicVSR, RealViFormer, DOVE, SeedVR2-3B, FlashVSR-Tiny, SwiftVR (Ours). Best viewed at high magnification.

Additional Qualitative Results

Additional comparisons on real-world videos covering distant building structures, illustrated facial patterns, animal fur, and bird plumage. SwiftVR restores clearer structural boundaries and more natural fine details — roof edges, facial contours, dog fur, feather textures — while maintaining stable color and fewer local artifacts.

Additional qualitative results

Additional visualization results from the supplementary material.

BibTeX

@article{swiftvr2026,
  title   = {SwiftVR: Real-Time One-Step Generative Video Restoration},
  author  = {Yan Jiaqi and Chen Xiangyu and Zhong Xinlin and Huang Haibin and Zhang Chi and Liu Jie and Zhou Jiantao and Li Xuelong},
  journal = {arXiv preprint arXiv:2606.09516}
  year    = {2026}
}