DynVFX: Augmenting Real Videos with Dynamic Content

Supplementary Material

 

Click here to open the SM PDF


Press the spacebar to pause all videos simultaneously.

 


Comparisons to Baselines

Existing methods are incapable of maintaining scene fidelity while introducing new content. Our method successfully adds new content to the scene, achieving high fidelity to the user instructions.while allowing for natural interactions between original and added elements in a realistic and seamless manner


A bear dancing with the woman Ours Finetuning
SDEdit FlowEdit Gen-3
DDIM Inversion

 


A fire breathing dragon chasing the dog Ours Finetuning
SDEdit FlowEdit Gen-3
DDIM Inversion

 


A pair of deer drinking water from the creek Ours Finetuning
SDEdit FlowEdit Gen-3
DDIM Inversion

 

 

Comparisons to AnyV2V and I2VEdit


 


Add a puppy playing witht he woman Edited First Frame
AnyV2V I2VEdit Ours
Add a massive explosion from the mountain Edited First Frame
AnyV2V I2VEdit Ours

Comparisons to Pika and Kling


dinosaurs eating leaves from trees, taking big bites Input Frame
Bikers Input Frame
Ours Pika Kling

 


Enthusiastic audience, clapping and snapping photos, as the dancer takes a bow Input Frame
Dancer Stage Input Frame
Ours Pika Kling

Ablations

Ablations. (b) Excluding both AnchorExtAttn and the Iterative refinement process results in significant misalignment with the original scene and poor harmonization (e.g., size of the puppy relative to the scene and boundary artifacts). (c) Omitting AnchorExtAttn leads to incorrect positioning of the new content. (d) Removing iterative refinement results in poor harmonization. Our full method (e) exhibits good localization and harmonization of the edit.


A puppy peaking its head out of the box w/o AnchrExtAttn. and Iter. Refin. w/o AnchrExtAttn.
w/o Iter Refin. Our method

 


A massive tsunami flooding the city w/o AnchrExtAttn. and Iter. Refin. w/o AnchrExtAttn.
w/o Iter Refin. Our method

 


Ablation of VLM protocol

To validate the importance of our VLM protocol, we performed an ablation by prompting the VLM with a simplified system prompt, asking it to caption the original video and provide the edit prompt. Simplifying the system prompt results in misinterpretation of the instruction with respect to the scene or fails to add new content in general.


A group of jellyfish floating w/o VLM protocol. Ours

 


A group of jellyfish floating w/o VLM protocol. Ours

 


Extended Attention Analysis

Controlling fidelity to the original scene using different extended attention mechanisms.

(a-b) SDEdit suffers from the original scene preservation/edit fidelity trade-off. (c-e) Three Extended Attention variants during sampling demonstrate different control levels: Full Extended Attention closely reconstructs the input scene, Masked Extended Attention proves too constrained in overlapping regions despite allowing new content emergence, and our Anchor Ext. Attn. achieves optimal results by applying dropout – extending attention only at sparse points within selected regions..


A knight riding a horse Sampling T=0.9 Sampling T=0.6 Our method
Full Ext. Att. Masked Ext. Att. Anchor Ext. Att.

 


Limitations

In some cases, the T2V diffusion model struggles to precisely follow the edit prompt.


Original video A fish bowl encircling the boy's head, in bowl fish goldfish swimming around the boy's head