Do Joint Audio-Video Generation Models Understand Physics?
1 University of Texas at Dallas 2 University of Washington 3 University of California, Los Angeles * Equal contribution. † Corresponding author.
AV-Phys Bench is first comprehensive benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench systematically tests joint audio-video generation models across three scene categories that probe how physical commonsense holds as the scene evolves: (a) Steady State, (b) Event Transition, and (c) Environment Transition. 7 models are evaluated by humans, an MLLM-as-judge baseline, and the AV-Phys Agent.
Leaderboard
| # | Model | SA | PC | Both | V-SA | A-SA | V-PC | A-PC | AV-PC |
|---|
All numbers are pass-rates in [0, 1]. SA = semantic adherence (V-SA ∧ A-SA); PC = physical commonsense (V-PC ∧ A-PC ∧ AV-PC); Both = SA ∧ PC. Headline values match Table 3 of the paper on the Physics-following subset.
Taxonomy
Prompts are organized on a scene-evolution axis: what changes in the scene from start to end. Each top category has a 4th Anti-Physics subcategory that deliberately violates a physical principle — a stress test for whether the model executes the instruction or defaults to plausible priors.
159 prompts · source, action, environment all fixed
C1-1 source_material · C1-2 source_anchoring · C1-3 sound_persistence · C1-4 anti
119 prompts · a discrete action changes the source state
C2-1 source_body · C2-2 source_excitation · C2-3 source_radiation · C2-4 anti
43 prompts · source fixed, propagation path changes
C3-1 propagation_medium · C3-2 enclosure_geometry · C3-3 sound_attenuation · C3-4 anti
Run AV-Phys on your own model
Each prompt comes with a hand-authored rubric of 8.6 ± Y/N statements on average. Drop your model's
outputs into videos/<your-model>/<INDEX>.mp4 and the AV-Phys Agent will score them
against the rubric on 5 dimensions.
# 1. clone the umbrella repo and enter the eval harness
git clone https://github.com/ZijunCui02/AV-Phys
cd AV-Phys/code
# 2. fetch the dataset (prompts + rubrics + videos) from HuggingFace
huggingface-cli download ZijunCui/AV-Phys-Bench --repo-type=dataset --local-dir data_release
# 3. generate one .mp4 per prompt into data_release/videos/<your-model>/
# 4. follow code/README.md to run the AV-Phys Agent evaluator on your model