EchoFoley Event-Centric Hierarchical Control

EchoFoley

Event-Centric Hierarchical Control for Video-Grounded Creative Sound Generation

1ByteDance Intelligent Creation    2University of Illinois Urbana-Champaign
3University of California, Merced    4University of California, Los Angeles

Motivation of EchoFoley: Given a silent video, we generate creative, story-aligned soundtracks with fine-grained, event-level control over how each sound is crafted and transformed over time.

Showcase

Generation Results


Comparison of EchoVidia against three strong baselines across a range of complex, event-centric instructions.

Instruction 1
Instance Level Control

"Add a sound of match scratching to the ignition sound at 00:05."

HunayuanVideoFoley
MMAudio
Thinksound
EchoVidia (Ours)
Instruction 2
Instance Level Control

"Insert a metallic pulse explosion sound right after the finger touches the interface, when the circuit lines appear."

HunayuanVideoFoley
MMAudio
Thinksound
EchoVidia (Ours)
Instruction 3
Instance Level Control

"Make the golf ball sound like a rocket when it flies out."

HunayuanVideoFoley
MMAudio
Thinksound
EchoVidia (Ours)
Instruction 4
Group Level Control

"Make the cat first meow, then hiss, and hiss again while punching, illustrating an escalation of emotion."

HunayuanVideoFoley
MMAudio
Thinksound
EchoVidia (Ours)
Instruction 5
Video Level Control

"Render the entire video with a futuristic, sci‑fi aesthetic."

HunayuanVideoFoley
MMAudio
Thinksound
EchoVidia (Ours)
</>

Citation


If you find this work useful, please consider citing our paper:

@article{li2025echofoley,
  title={EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation},
  author={Li, Bingxuan and Cui, Yiming and He, Yicheng and Wang, Yiwei and Zhang, Shu and Wen, Longyin and Niu, Yulei},
  journal={arXiv preprint arXiv:2512.24731},
  year={2025}
}