While video generation models excel at producing high-quality monocular videos, generating 3D stereoscopic and spatial videos for immersive applications remains an underexplored challenge. We present a pose-free and training-free method that leverages an off-the-shelf monocular video generation model to produce immersive 3D videos. Our approach first warps the generated monocular video into pre-defined camera viewpoints using estimated depth information, then applies a novel frame matrix inpainting framework. This framework utilizes the original video generation model to synthesize missing content across different viewpoints and timestamps, ensuring spatial and temporal consistency without requiring additional model fine-tuning. Moreover, we develop a disocclusion boundary re-injectio scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. The resulting multi-view videos are then adapted into stereoscopic pairs or fitted into 4D Gaussians for spatial video synthesis. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, including Sora, Lumiere, WALT, and Zeroscope. The experiments demonstrate that our method has a significant improvement over previous methods.
Top: Given a text prompt, our method first employs a video generation model to generate a monocular video, which is warped into pre-defined camera views using estimated depth to form a frame matrix with disocclusion masks. Then, the disoccluded regions are inpainted by denoising the sampled frame sequences within the frame matrix. After denoising, we decode the clean latent frame matrix into RGB frames, where the 3D stereoscopic video is obtained via choosing the leftmost and rightmost columns. To generate spatial video, we optimize decoded frames into a 4D representation (e.g., 4D Gaussian) that supports continuous view changes. Bottom: Details of denoising frame matrix. We initialize the latent matrix as a random noise map. For each noise level, we extend the resampling mechanism to alternatively denoise temporal (column) sequences and spatial (row) sequences N times. Each time, row or column sequences are denoised and inpainted. By denoising along both spatial and temporal directions, we obtain an inpainted latent which can be decoded into temporally smooth and semantically consistent sequences.
Consistent and smooth inpainting is the cornerstone of converting monocular videos into 3D stereoscopic and spatial videos.
Frame matrix helps generate semantically reasonable content and benefits consistency across different viewpoints.
Disocclusion boundary re-injection helps to reduce negative effects of inpainted pixels, and ensures high-quality generation.
Outpainting handles partially observbed objects for complete object generation.
Here we show the effect of the components we introduce in our system →
Ours
No Frame Matrix
Video ablations (right eye view)
Compared to previous methods, our approach generates high-fidelity content in disoccluded regions and provides a high level of temporal and semantic consistency.
Left eye view
DynIBaR
Ours
Right eye view (generated)
Even in sequences with fast motion, our results demonstrate superior performace.
Left eye view
E2FGVI
Ours
Right eye view (generated)
When comparing with the concurrent 4D scene generation method (Free4D), our approach achieves robust and superioir performance on objects with complex motions. (* denotes Free4D failed in 4D reconstuction and use its inpainted novel-view video instead.)
FixedCamera
Ours
Reference left view
We can generate multi-view videos by decoding the whole frame matrix. We can fix time and change views (spatial direction, first two sequences), or view the scene from different camera perspectives (time direction, last two sequences).
Reference left view
The pre-defined camera trajectory used to construct the frame matrix is flexible. In this case, we show a circle trajectory.
Stereoscopic video provides fixed viewpoints to observe the scene.
Spatial videos enable viewpoint changes when observing the scene (stereoscopic novel view synthesis).
The rendered depth maps, which contain distingused foreground and background content, indicate reasonable 3D effects. The left side demonstrate the RGB video.
The left side is the original video, and the right side is video after frame interpolation (via interpolating time in 4D, x4). The campabiltiy of frame interpolation demostrate the continuity of our 4D representation in time direction.
Our stereoscopic video super-resolution design upsamples the video to provide more details.
Bilinear
Ours
If you happen to have a VR headset, we encourage you to view these videos in stereo!
In VR headsets, enter this webpage, you need to choose one of the following cases. After this, you can enjoy the 3D video.
,    
,     
,     
,     
,     
,     
,     
,     
,     
,     
,     
,