SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

Abstract

Generative video models have demonstrated great capability of producing impressive 2D monocular videos, however, the question of generating 3D stereoscopic video remains under-explored. We propose a pose-free and training-free approach for generating 3D stereoscopic videos using an off-the-shelf monocular video generation model. Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth, and employs a novel frame matrix video inpainting framework. Our frame matrix framework leverages the video generation model to inpaint frames observed at different timestamps and from different views. This effective approach generates consistent and semantically coherent stereoscopic videos without scene optimization or model fine-tuning. Moreover, we develop an elaborate update scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, including Sora, Lumiere, WALT, and Zeroscope. The experiments demonstrate that our method has a significant improvement over previous methods. Code will be released.

Method Overview

Given a text prompt, our method first uses a video generation model to generate a monocular video, which is warped (using estimated depth) into pre-defined camera views to form a frame matrix with disocclusion masks. Then, the disoccluded regions are inpainted by denoising the frame sequences within the frame matrix. After denoising, we select the leftmost and therightmost columns and decode them to obtain a 3D stereoscopic video. Bottom: Details of denoising frame matrix. We initialize the latent matrix as a random noise map. For each noise level, we extend the resampling mechanism [16,26] to alternatively denoise temporal (column) sequences and spatial (row) sequences N times. Each time, row or column sequences are denoised and inpainted. By denoising along both spatial and temporal directions, we obtain an inpainted latent which can be decoded into temporally smooth and semantically consistent sequences.

Ablation study

Consistent and smooth inpainting is the cornerstone of converting monocular videos into 3D stereoscopic videos.

Frame matrix helps generate semantically reasonable content and benefits consistency across different views.

Disocclusion boundary re-injection helps to reduce negative effects of inpainted pixels, and ensures high-quality generation.

Here we show the effect of the components we introduce in our system →

Ours

Without depth smoothing

Video ablations (right eye view)

Comparison with previous methods

Compared to previous methods, our approach provides a high level of temporal and semantic consistency.

Left eye view

DynIBaR

Ours

Right eye view (generated)

Even in sequences with fast motion, our results show high level of temporal consistency.

Left eye view

E2FGVI

Ours

Right eye view (generated)

Mutli-View Videos

Reference left view

We can generate multi-view videos by decoding the whole frame matrix. We can fix time and change views (spatial video, first two sequences), or view the scene from different camera perspectives (temporal video, last two sequences).

Reference left view

The pre-defined camera trajectory used to construct the frame matrix is flexible. In this case, we show a circle trajectory. The first two sequences are spatial videos, and the last two sequences are temporal videos.