While video generation models excel at producing high-quality monocular videos, generating 3D stereoscopic and spatial videos for immersive applications remains an underexplored challenge. We present a pose-free and training-free method that leverages an off-the-shelf monocular video generation model to produce immersive 3D videos. Our approach first warps the generated monocular video into pre-defined camera viewpoints using estimated depth information, then applies a novel frame matrix inpainting framework. This framework utilizes the original video generation model to synthesize missing content across different viewpoints and timestamps, ensuring spatial and temporal consistency without requiring additional model fine-tuning. Moreover, we develop a disocclusion boundary re-injectio scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. The resulting multi-view videos are then adapted into stereoscopic pairs or fitted into 4D Gaussians for spatial video synthesis. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, including Sora, Lumiere, WALT, and Zeroscope. The experiments demonstrate that our method has a significant improvement over previous methods.
Top: Given a text prompt, our method first employs a video generation model to generate a monocular video, which is warped into pre-defined camera views using estimated depth to form a frame matrix with disocclusion masks. Then, the disoccluded regions are inpainted by denoising the sampled frame sequences within the frame matrix. After denoising, we decode the clean latent frame matrix into RGB frames, where the 3D stereoscopic video is obtained via choosing the leftmost and rightmost columns. To generate spatial video, we optimize decoded frames into a 4D representation (e.g., 4D Gaussian) that supports continuous view changes. Bottom: Details of denoising frame matrix. We initialize the latent matrix as a random noise map. For each noise level, we extend the resampling mechanism to alternatively denoise temporal (column) sequences and spatial (row) sequences N times. Each time, row or column sequences are denoised and inpainted. By denoising along both spatial and temporal directions, we obtain an inpainted latent which can be decoded into temporally smooth and semantically consistent sequences.
Consistent and smooth inpainting is the cornerstone of converting monocular videos into 3D stereoscopic and spatial videos.
Frame matrix helps generate semantically sound video content and ensure consistency across views and time.
Disocclusion boundary re-injection reduces the impact of contaminated features and ensures high-quality outputs.
Outpainting is used to process partially visible objects, thereby generating complete object images.
Here we show the impact of these components →
Ours
No Frame Matrix
Video ablations (right eye view)
Compared to previous methods, our approach generates high-fidelity content in disoccluded regions and provides a high level of temporal and semantic consistency.
Left eye view
DynIBaR
Ours
Right eye view (generated)
Even in sequences with fast motion, our results demonstrate superior performance.
Left eye view
E2FGVI
Ours
Right eye view (generated)
When comparing with the concurrent 4D scene generation method (Free4D), our approach achieves robust and superior performance on objects with complex motions. (* denotes Free4D failed in 4D reconstruction and uses its inpainted novel-view video instead.)
FixedCamera
Ours
Reference left view
We can generate multi-view videos by decoding the whole frame matrix. We can fix the time and change the views (spatial direction, first two sequences), or view the scene from different camera perspectives (time direction, last two sequences).
Reference left view
The pre-defined camera trajectory used to construct the frame matrix is flexible. In this case, we show a circle trajectory.
Stereoscopic video predefines the viewpoints, which cannot be changed, to observe the scene.
Spatial videos enable viewpoint changes when observing the scene (stereoscopic novel view synthesis).
The rendered depth maps, which contain distinct foreground and background content, indicate reasonable 3D effects.
The left side is the original video, and the right side is the video after frame interpolation (via interpolating time in 4D, x4). The capability of frame interpolation demonstrates the continuity of our 4D representation.
Our stereoscopic video super-resolution design delivers more details.
Bilinear
Ours
If you happen to have a VR headset, we encourage you to view these videos in stereo!
In VR headsets, enter this webpage, and you need to choose one of the following cases. After this, you can enjoy the 3D video.
,    
,     
,     
,     
,     
,     
,     
,     
,     
,     
,     
,