STeP: A General and Scalable Framework
for Solving Video Inverse Problems
with Spatiotemporal Diffusion Priors
Bingliang Zhang1,*
Zihui Wu1,*
Berthy T. Feng1
Yang Song2
Yisong Yue1
Katherine L. Bouman1
1California Institute of Technology
2OpenAI
*Equal contribution
in submission
TL;DR: A general and scalable framework for solving video inverse problems based on spatiotemporal diffusion priors, enabling efficient and high-quality video reconstruction.
We study the problem of general Bayesian inverse problems of videos using diffusion model priors. While it is desirable to use a video diffusion prior to effectively capture complex temporal relationships, due to the computational and data requirements of training such a model, prior work has instead relied on image diffusion priors on single frames combined with heuristics to enforce temporal consistency. However, these approaches struggle with faithfully recovering the underlying temporal relationships, particularly for tasks with high temporal uncertainty. In this paper, we demonstrate the feasibility of practical and accessible spatiotemporal diffusion priors by fine-tuning latent video diffusion models from pretrained image diffusion models using limited videos in specific domains. Leveraging this plug-and-play spatiotemporal diffusion prior, we introduce a general and scalable framework for solving video inverse problems. We then apply our framework to two challenging scientific video inverse problems—black hole imaging and dynamic MRI. Our framework enables the generation of diverse, high-fidelity video reconstructions that not only fit observations but also recover multi-modal solutions. By incorporating a spatiotemporal diffusion prior, we significantly improve our ability to capture complex temporal relationships in the data while also enhancing spatial fidelity.
Results | Black Hole Video Reconstruction

We consider an astronomical imaging problem where we recover videos of the rapidly evolving Sagittarius A* black hole from highly sparse interferometric measurements.

Results | Dynamic Magnetic Resonance Imaging

We also investigate a dynamic magnetic resonance imaging (MRI) problem in cardiology. The reconstructions are given by accelerated sequences that only take 27% of runtime of the original sequences.

Methodology

Prior works for solving video inverse problems typically rely on image diffusion priors combined with heuristics to enforce temporal consistency. This is due to the common belief that training a video diffusion model is computationally prohibitive and requires a large amount of video data, which is usually unavailable for scientific problems. Therefore, they opt for using image diffusion priors on single frames and enforce temporal consistency by techniques such as equivariant self-guidance and batch-consistent sampling. These heuristics either rely on extracting optical flow from the measurements or assume a static temporal relationship between the frames. We find that these approaches are limited to image restoration tasks and struggle with faithfully recovering the underlying temporal relationships, particularly for tasks with high temporal uncertainty.

In this work, we propose a general and scalable framework for solving video inverse problems called STeP. We first show that it is feasible to train a spatiotemporal diffusion prior by fine-tuning latent video diffusion models from pretrained image diffusion models using limited videos in specific domains. This allows us to leverage the power of video diffusion models without the need for extensive computational resources or large datasets. By combining the prior with the physical knowledge of the inverse problem in a plug-and-play video inverse problem solver, we can effectively solve video inverse problems in a scalable and data-efficient manner. Our framework is designed to be general and scalable, making it applicable to a wide range of video inverse problems. We demonstrate the effectiveness of our approach by applying it to two challenging scientific video inverse problems: black hole video reconstruction and dynamic MRI. Our results show that our framework enables the generation of diverse, high-fidelity video reconstructions that not only fit observations but also recover multi-modal solutions.

Video

Citation

If you find our work interesting, please consider citing our paper:

@misc{zhang2025stepgeneralscalableframework, title={STeP: A General and Scalable Framework for Solving Video Inverse Problems with Spatiotemporal Diffusion Priors}, author={Bingliang Zhang and Zihui Wu and Berthy T. Feng and Yang Song and Yisong Yue and Katherine L. Bouman}, year={2025}, eprint={2504.07549}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2504.07549}, }

STeP: A General and Scalable Framework for Solving Video Inverse Problems with Spatiotemporal Diffusion Priors
Template adopted from Trellis designed by Jianfeng Xiang.