S<sup>2</sup>VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix

Abstract

While video generation models excel at producing high-quality monocular videos, generating 3D stereoscopic and spatial videos for immersive applications remains an underexplored challenge. We present a pose-free and training-free method that leverages an off-the-shelf monocular video generation model to produce immersive 3D videos. Our approach first warps the generated monocular video into pre-defined camera viewpoints using estimated depth information, then applies a novel frame matrix inpainting framework. This framework utilizes the original video generation model to synthesize missing content across different viewpoints and timestamps, ensuring spatial and temporal consistency without requiring additional model fine-tuning. Moreover, we develop a disocclusion boundary re-injectio scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. The resulting multi-view videos are then adapted into stereoscopic pairs or fitted into 4D Gaussians for spatial video synthesis. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, including Sora, Lumiere, WALT, and Zeroscope. The experiments demonstrate that our method has a significant improvement over previous methods.

Method Overview

Top: Given a text prompt, our method first employs a video generation model to generate a monocular video, which is warped into pre-defined camera views using estimated depth to form a frame matrix with disocclusion masks. Then, the disoccluded regions are inpainted by denoising the sampled frame sequences within the frame matrix. After denoising, we decode the clean latent frame matrix into RGB frames, where the 3D stereoscopic video is obtained via choosing the leftmost and rightmost columns. To generate spatial video, we optimize decoded frames into a 4D representation (e.g., 4D Gaussian) that supports continuous view changes. Bottom: Details of denoising frame matrix. We initialize the latent matrix as a random noise map. For each noise level, we extend the resampling mechanism to alternatively denoise temporal (column) sequences and spatial (row) sequences N times. Each time, row or column sequences are denoised and inpainted. By denoising along both spatial and temporal directions, we obtain an inpainted latent which can be decoded into temporally smooth and semantically consistent sequences.

Ablation Study

Consistent and smooth inpainting is the cornerstone of converting monocular videos into 3D stereoscopic and spatial videos.
Frame matrix helps generate semantically sound video content and ensure consistency across views and time.
Disocclusion boundary re-injection reduces the impact of contaminated features and ensures high-quality outputs.
Outpainting is used to process partially visible objects, thereby generating complete object images.

Here we show the impact of these components →

Ours

No Frame Matrix

Video ablations (right eye view)

Comparison with Previous Methods

Compared to previous methods, our approach generates high-fidelity content in disoccluded regions and provides a high level of temporal and semantic consistency.

Left eye view

DynIBaR

Ours

Right eye view (generated)

Even in sequences with fast motion, our results demonstrate superior performance.

Left eye view

E2FGVI

Ours

Right eye view (generated)

When comparing with the concurrent 4D scene generation method (Free4D), our approach achieves robust and superior performance on objects with complex motions. (* denotes Free4D failed in 4D reconstruction and uses its inpainted novel-view video instead.)

FixedCamera

Ours

Mutli-View Videos

Reference left view

We can generate multi-view videos by decoding the whole frame matrix. We can fix the time and change the views (spatial direction, first two sequences), or view the scene from different camera perspectives (time direction, last two sequences).

Reference left view

The pre-defined camera trajectory used to construct the frame matrix is flexible. In this case, we show a circle trajectory.

Frame Interpolation via 4D Representation

The left side is the original video, and the right side is the video after frame interpolation (via interpolating time in 4D, x4). The capability of frame interpolation demonstrates the continuity of our 4D representation.