DynVFX: Augmenting Real Videos with Dynamic Content
Supplementary Material
Click here to open the SM PDF
Press the spacebar to pause all videos simultaneously.
Comparisons to Baselines
Existing methods are incapable of maintaining scene fidelity while introducing new content.
Our method successfully adds new content to the scene, achieving high fidelity to the user instructions.while allowing for natural interactions between original and added elements in a realistic and seamless manner
| A bear dancing with the woman |
Ours |
Finetuning |
|
|
|
|
| A fire breathing dragon chasing the dog |
Ours |
Finetuning |
|
|
|
|
| A pair of deer drinking water from the creek |
Ours |
Finetuning |
|
|
|
|
Comparisons to AnyV2V and I2VEdit
| Add a puppy playing witht he woman |
Edited First Frame |
|
|
|
| Add a massive explosion from the mountain |
Edited First Frame |
|
|
|
Comparisons to Pika and Kling
| dinosaurs eating leaves from trees, taking big bites |
Input Frame |
|
|
|
| Enthusiastic audience, clapping and snapping photos, as the dancer takes a bow |
Input Frame |
|
|
|
Ablations
Ablations. (b) Excluding both AnchorExtAttn and the Iterative refinement process results in significant misalignment with the original scene and poor harmonization (e.g., size of the puppy relative to the scene and boundary artifacts). (c) Omitting AnchorExtAttn leads to incorrect positioning of the new content. (d) Removing iterative refinement results in poor harmonization. Our full method (e) exhibits good localization and harmonization of the edit.
|
A puppy peaking its head out of the box
|
w/o AnchrExtAttn. and Iter. Refin. |
w/o AnchrExtAttn. |
|
|
|
|
|
w/o Iter Refin. |
Our method |
|
|
|
|
A massive tsunami flooding the city
|
w/o AnchrExtAttn. and Iter. Refin. |
w/o AnchrExtAttn. |
|
|
|
|
|
w/o Iter Refin. |
Our method |
|
|
|
Ablation of VLM protocol
To validate the importance of our VLM protocol, we performed an ablation by prompting the VLM with a simplified system prompt, asking it to caption the original video and provide the edit prompt. Simplifying the system prompt results in misinterpretation of the instruction with respect to the scene or fails to add new content in general.
|
A group of jellyfish floating
|
w/o VLM protocol. |
Ours |
|
|
|
|
|
A group of jellyfish floating
|
w/o VLM protocol. |
Ours |
|
|
|
|
Extended Attention Analysis
Controlling fidelity to the original scene using different extended attention mechanisms.
(a-b) SDEdit suffers from the original scene preservation/edit
fidelity trade-off. (c-e) Three Extended Attention variants during sampling demonstrate different control levels: Full Extended Attention closely reconstructs
the input scene, Masked Extended Attention proves too constrained in overlapping regions despite allowing new content emergence, and our Anchor Ext. Attn.
achieves optimal results by applying dropout – extending attention only at sparse points within selected regions..
| A knight riding a horse |
Sampling T=0.9 |
Sampling T=0.6 |
Our method |
|
|
|
|
|
|
Full Ext. Att. |
Masked Ext. Att. |
Anchor Ext. Att. |
|
|
|
|
Limitations
In some cases, the T2V diffusion model struggles to
precisely follow the edit prompt.
| Original video |
A fish bowl encircling the boy's head, in bowl fish goldfish swimming around the boy's head |
|
|
|