DynVFX: Augmenting Real Videos with Dynamic Content
Supplementary Material
Click here to open the SM PDF
Press the spacebar to pause all videos simultaneously.
Our Results
We present sample results of our method, part of which are shown in Fig.4 in the paper.
Original video |
A majestic whale in the background |
|
|
Original video |
A puppy peaking its head out of the box |
|
|
Original video |
A majestic knight rides the horse |
|
|
Original video |
A massive tsunami flooding the city, water everywhere, apocalyptic |
|
|
Original video |
A playful golden retriever swimming trying to catch up with the joggers |
|
|
Original video |
dinosaurs eating leaves from trees, taking big bites |
|
|
Original video |
massive explosion from the mountain |
|
|
Original video |
Many workers in the rice field, cutting crops |
|
|
Original video |
A group of jellyfish floating |
|
|
Original video |
A giraffe walking |
|
|
Original video |
A elephant walking |
|
|
Original video |
Balloons bouncing and swaying, surrounding the woman. Many, colorful. |
|
|
Original video |
A puppy running beside the woman |
|
|
Comparisons to Baselines
Existing methods are incapable of maintaining scene fidelity while introducing new content.
Our method successfully adds new content to the scene, achieving high fidelity to the user instructions.while allowing for natural interactions between original and added elements in a realistic and seamless manner
A bear dancing with the woman |
Ours |
Finetuning |
|
|
|
SDEdit |
Gen-3 |
DDIM Inversion |
|
|
|
A fire breathing dragon chasing the dog |
Ours |
Finetuning |
|
|
|
SDEdit |
Gen-3 |
DDIM Inversion |
|
|
|
A pair of deer drinking water from the creek |
Ours |
Finetuning |
|
|
|
SDEdit |
Gen-3 |
DDIM Inversion |
|
|
|
Ablations
Ablations. (b) Excluding both AnchorExtAttn and the Iterative refinement process results in significant misalignment with the original scene and poor harmonization (e.g., size of the puppy relative to the scene and boundary artifacts). (c) Omitting AnchorExtAttn leads to incorrect positioning of the new content. (d) Removing iterative refinement results in poor harmonization. Our full method (e) exhibits good localization and harmonization of the edit.
A puppy peaking its head out of the box
|
w/o AnchrExtAttn. and Iter. Refin. |
w/o AnchrExtAttn. |
|
|
|
|
w/o Iter Refin. |
Our method |
|
|
|
A massive tsunami flooding the city
|
w/o AnchrExtAttn. and Iter. Refin. |
w/o AnchrExtAttn. |
|
|
|
|
w/o Iter Refin. |
Our method |
|
|
|
Extended Attention Analysis
Controlling fidelity to the original scene using different extended attention mechanisms.
(a-b) SDEdit suffers from the original scene preservation/edit
fidelity trade-off. (c-e) Three Extended Attention variants during sampling demonstrate different control levels: Full Extended Attention closely reconstructs
the input scene, Masked Extended Attention proves too constrained in overlapping regions despite allowing new content emergence, and our Anchor Ext. Attn.
achieves optimal results by applying dropout – extending attention only at sparse points within selected regions..
A knight riding a horse |
Sampling T=0.9 |
Sampling T=0.6 |
Our method |
|
|
|
|
|
Full Ext. Att. |
Masked Ext. Att. |
Anchor Ext. Att. |
|
|
|
|
Limitations
In some cases, the T2V diffusion model struggles to
precisely follow the edit prompt.
Original video |
A fish bowl encircling the boy's head, in bowl fish goldfish swimming around the boy's head |
|
|