EchoFoley Event-Centric Hierarchical Control for Video-Grounded Creative Sound Generation

EchoFoley

Event-Centric Hierarchical Control for Video-Grounded Creative Sound Generation

Bingxuan Li1,2,4 · Yiming Cui1 · Yicheng He1 · Yiwei Wang3 · Shu Zhang1 · Longyin Wen1 · Yulei Niu1
1ByteDance Intelligent Creation    2University of Illinois Urbana-Champaign    3University of California, Merced    4University of California, Los Angeles

Motivation of EchoFoley: Given a silent video, we generate creative, story-aligned soundtracks with fine-grained, event-level control over how each sound is crafted and transformed over time.

examples

Demos


We show example videos generated by EchoVidia, comparing against three baselines for a range of creative, event-centric instructions. model.

Instruction 1
Instance Level Control

Add a sound of match scratching to the ignition sound at 00:05.

HunayuanVideoFoley
MMAudio
Thinksound
EchoVidia (ours)
Instruction 2
Instance Level Control

Insert a metallic pulse explosion sound right after the finger touches the interface, when the circuit lines appear.

HunayuanVideoFoley
MMAudio
Thinksound
EchoVidia (ours)
Instruction 3
Instance Level Control

Make the golf ball sound like a rocket when it flies out.

HunayuanVideoFoley
MMAudio
Thinksound
EchoVidia (ours)
Instruction 4
Group Level Control

Make the cat first meow, then hiss, and hiss again while punching, illustrating an escalation of emotion.

HunayuanVideoFoley
MMAudio
Thinksound
EchoVidia (ours)
Instruction 5
Video Level

Render the entire video with a futuristic, sci‑fi aesthetic.

HunayuanVideoFoley
MMAudio
Thinksound
EchoVidia (ours)
>

BibTeX


@article{li2025echofoley,
title={EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation},
author={Li, Bingxuan and Cui, Yiming and He, Yicheng and Wang, Yiwei and Zhang, Shu and Wen, Longyin and Niu, Yulei},
journal={arXiv preprint arXiv:2512.24731},
year={2025}
}