SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

Peng Dai1,2, Feitong Tan1*, Qiangeng Xu1*, David Futschik1, Ruofei Du1, Sean Fanello1, Xiaojuan Qi2, Yinda Zhang1
1Google 2The University of Hong Kong

Abstract

Generative video models have demonstrated great capability of producing impressive 2D monocular videos, however, the question of generating 3D stereoscopic video remains under-explored. We propose a pose-free and training-free approach for generating 3D stereoscopic videos using an off-the-shelf monocular video generation model. Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth, and employs a novel frame matrix video inpainting framework. Our frame matrix framework leverages the video generation model to inpaint frames observed at different timestamps and from different views. This effective approach generates consistent and semantically coherent stereoscopic videos without scene optimization or model fine-tuning. Moreover, we develop an elaborate update scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, including Sora, Lumiere, WALT, and Zeroscope. The experiments demonstrate that our method has a significant improvement over previous methods. Code will be released.

Method Overview



Given a text prompt, our method first uses a video generation model to generate a monocular video, which is warped (using estimated depth) into pre-defined camera views to form a frame matrix with disocclusion masks. Then, the disoccluded regions are inpainted by denoising the frame sequences within the frame matrix. After denoising, we select the leftmost and therightmost columns and decode them to obtain a 3D stereoscopic video. Bottom: Details of denoising frame matrix. We initialize the latent matrix as a random noise map. For each noise level, we extend the resampling mechanism [16,26] to alternatively denoise temporal (column) sequences and spatial (row) sequences N times. Each time, row or column sequences are denoised and inpainted. By denoising along both spatial and temporal directions, we obtain an inpainted latent which can be decoded into temporally smooth and semantically consistent sequences.

Ablation study

Consistent and smooth inpainting is the cornerstone of converting monocular videos into 3D stereoscopic videos.

Frame matrix helps generate semantically reasonable content and benefits consistency across different views.

Disocclusion boundary re-injection helps to reduce negative effects of inpainted pixels, and ensures high-quality generation.


Here we show the effect of the components we introduce in our system →

Ours

Without depth smoothing

Video ablations (right eye view)

Comparison with previous methods

Compared to previous methods, our approach provides a high level of temporal and semantic consistency.


Left eye view

DynIBaR

Ours

Right eye view (generated)


Even in sequences with fast motion, our results show high level of temporal consistency.

Left eye view

E2FGVI

Ours

Right eye view (generated)

Generated Video Trajectories

Reference left view

Using our frame matrix method, we are able to generate video trajectories through space (first two sequences) and time (last two sequences), consistent with the Method Overview.

Reference left view

The frame matrix formulation is flexible and allows us also to extract different trajectories. The sequences still appear consistent, thanks to the row-and-column constraints.

3D stereoscopic videos

If you happen to have a VR headset, we encourage you to view these videos in stereo!

Videos generated from a real monocular video