Do Joint Audio-Video Generation Models Understand Physics?

Zijun Cui1,*, Xiulong Liu2,*, Hao Fang2,*, Mingwei Xu2, Jiageng Liu3, Zexin Xu1, Weiguo Pian1, Shijian Deng1, Feiyu Du1, Chenming Ge2, Yapeng Tian1,†

1 University of Texas at Dallas    2 University of Washington    3 University of California, Los Angeles      * Equal contribution.     Corresponding author.

AV-Phys Bench is first comprehensive benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench systematically tests joint audio-video generation models across three scene categories that probe how physical commonsense holds as the scene evolves: (a) Steady State, (b) Event Transition, and (c) Environment Transition. 7 models are evaluated by humans, an MLLM-as-judge baseline, and the AV-Phys Agent.

Seedance 2.0
Kling 3.0 Omni ×
Veo 3.1 ×

A speaker plays music at low volume, sounding quiet and thin. Then the volume knob is turned up gradually until the music fills the room. C2-2-20 ↗

Leaderboard

Evaluator
Subset
# Model SA PC Both V-SA A-SA V-PC A-PC AV-PC

All numbers are pass-rates in [0, 1]. SA = semantic adherence (V-SA ∧ A-SA); PC = physical commonsense (V-PC ∧ A-PC ∧ AV-PC); Both = SA ∧ PC. Headline values match Table 3 of the paper on the Physics-following subset.

Taxonomy

Prompts are organized on a scene-evolution axis: what changes in the scene from start to end. Each top category has a 4th Anti-Physics subcategory that deliberately violates a physical principle — a stress test for whether the model executes the instruction or defaults to plausible priors.

Run AV-Phys on your own model

Each prompt comes with a hand-authored rubric of 8.6 ± Y/N statements on average. Drop your model's outputs into videos/<your-model>/<INDEX>.mp4 and the AV-Phys Agent will score them against the rubric on 5 dimensions.

bash # 1. clone the umbrella repo and enter the eval harness git clone https://github.com/ZijunCui02/AV-Phys cd AV-Phys/code # 2. fetch the dataset (prompts + rubrics + videos) from HuggingFace huggingface-cli download ZijunCui/AV-Phys-Bench --repo-type=dataset --local-dir data_release # 3. generate one .mp4 per prompt into data_release/videos/<your-model>/ # 4. follow code/README.md to run the AV-Phys Agent evaluator on your model