S2VG: 3D Stereoscopic and Spatial Video Generation
via Denoising Frame Matrix



Abstract

While video generation models excel at producing high-quality monocular videos, generating 3D stereoscopic and spatial videos for immersive applications remains an underexplored challenge. We present a pose-free and training-free method that leverages an off-the-shelf monocular video generation model to produce immersive 3D videos. Our approach first warps the generated monocular video into pre-defined camera viewpoints using estimated depth information, then applies a novel frame matrix inpainting framework. This framework utilizes the original video generation model to synthesize missing content across different viewpoints and timestamps, ensuring spatial and temporal consistency without requiring additional model fine-tuning. Moreover, we develop a disocclusion boundary re-injectio scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. The resulting multi-view videos are then adapted into stereoscopic pairs or fitted into 4D Gaussians for spatial video synthesis. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, including Sora, Lumiere, WALT, and Zeroscope. The experiments demonstrate that our method has a significant improvement over previous methods.

Method Overview



Top: Given a text prompt, our method first employs a video generation model to generate a monocular video, which is warped into pre-defined camera views using estimated depth to form a frame matrix with disocclusion masks. Then, the disoccluded regions are inpainted by denoising the sampled frame sequences within the frame matrix. After denoising, we decode the clean latent frame matrix into RGB frames, where the 3D stereoscopic video is obtained via choosing the leftmost and rightmost columns. To generate spatial video, we optimize decoded frames into a 4D representation (e.g., 4D Gaussian) that supports continuous view changes. Bottom: Details of denoising frame matrix. We initialize the latent matrix as a random noise map. For each noise level, we extend the resampling mechanism to alternatively denoise temporal (column) sequences and spatial (row) sequences N times. Each time, row or column sequences are denoised and inpainted. By denoising along both spatial and temporal directions, we obtain an inpainted latent which can be decoded into temporally smooth and semantically consistent sequences.

Ablation Study

Consistent and smooth inpainting is the cornerstone of converting monocular videos into 3D stereoscopic and spatial videos.
Frame matrix helps generate semantically reasonable content and benefits consistency across different viewpoints.
Disocclusion boundary re-injection helps to reduce negative effects of inpainted pixels, and ensures high-quality generation.
Outpainting handles partially observbed objects for complete object generation.


Here we show the effect of the components we introduce in our system →

Ours

No Frame Matrix

Video ablations (right eye view)

Comparison with Previous Methods

Compared to previous methods, our approach generates high-fidelity content in disoccluded regions and provides a high level of temporal and semantic consistency.


Left eye view

DynIBaR

Ours

Right eye view (generated)


Even in sequences with fast motion, our results demonstrate superior performace.

Left eye view

E2FGVI

Ours

Right eye view (generated)


When comparing with the concurrent 4D scene generation method (Free4D), our approach achieves robust and superioir performance on objects with complex motions. (* denotes Free4D failed in 4D reconstuction and use its inpainted novel-view video instead.)


FixedCamera

Ours

Mutli-View Videos

Reference left view

We can generate multi-view videos by decoding the whole frame matrix. We can fix time and change views (spatial direction, first two sequences), or view the scene from different camera perspectives (time direction, last two sequences).



Reference left view

The pre-defined camera trajectory used to construct the frame matrix is flexible. In this case, we show a circle trajectory.

3D Stereoscopic Videos

Stereoscopic video provides fixed viewpoints to observe the scene.

3D Spatial Videos

Spatial videos enable viewpoint changes when observing the scene (stereoscopic novel view synthesis).

3D Effects Visualization

The rendered depth maps, which contain distingused foreground and background content, indicate reasonable 3D effects. The left side demonstrate the RGB video.

Frame Interpolation via 4D Representation

The left side is the original video, and the right side is video after frame interpolation (via interpolating time in 4D, x4). The campabiltiy of frame interpolation demostrate the continuity of our 4D representation in time direction.

Stereoscopic Video Super Resolution

Our stereoscopic video super-resolution design upsamples the video to provide more details.

Bilinear

Ours

3D Viewer

If you happen to have a VR headset, we encourage you to view these videos in stereo!

In VR headsets, enter this webpage, you need to choose one of the following cases. After this, you can enjoy the 3D video.


,     ,      ,      ,      ,      ,     


,      ,      ,      ,      ,      ,