Do Joint Audio-Video Generation Models Understand Physics?

Zijun Cui^1,*, Xiulong Liu^2,*, Hao Fang^2,*, Mingwei Xu², Jiageng Liu³, Zexin Xu¹, Weiguo Pian¹, Shijian Deng¹, Feiyu Du¹, Chenming Ge², Yapeng Tian^1,†

¹ University of Texas at Dallas ² University of Washington ³ University of California, Los Angeles ^* Equal contribution. ^† Corresponding author.

arXiv HF Dataset Code Videos

AV-Phys Bench is first comprehensive benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench systematically tests joint audio-video generation models across three scene categories that probe how physical commonsense holds as the scene evolves: (a) Steady State, (b) Event Transition, and (c) Environment Transition. 7 models are evaluated by humans, an MLLM-as-judge baseline, and the AV-Phys Agent.

Seedance 2.0 ✓

Kling 3.0 Omni ×

Veo 3.1 ×

“A speaker plays music at low volume, sounding quiet and thin. Then the volume knob is turned up gradually until the music fills the room.” C2-2-20 ↗

Leaderboard

Evaluator

Subset

#	Model	SA	PC	Both	V-SA	A-SA	V-PC	A-PC	AV-PC

All numbers are pass-rates in [0, 1]. SA = semantic adherence (V-SA ∧ A-SA); PC = physical commonsense (V-PC ∧ A-PC ∧ AV-PC); Both = SA ∧ PC. Headline values match Table 3 of the paper on the Physics-following subset.

Taxonomy

Prompts are organized on a scene-evolution axis: what changes in the scene from start to end. Each top category has a 4th Anti-Physics subcategory that deliberately violates a physical principle — a stress test for whether the model executes the instruction or defaults to plausible priors.

C1Steady State

159 prompts · source, action, environment all fixed

C1-1 source_material · C1-2 source_anchoring · C1-3 sound_persistence · C1-4 anti

C2Event Transition

119 prompts · a discrete action changes the source state

C2-1 source_body · C2-2 source_excitation · C2-3 source_radiation · C2-4 anti

C3Environment Transition

43 prompts · source fixed, propagation path changes

C3-1 propagation_medium · C3-2 enclosure_geometry · C3-3 sound_attenuation · C3-4 anti

Run AV-Phys on your own model

Each prompt comes with a hand-authored rubric of 8.6 ± Y/N statements on average. Drop your model's outputs into videos/<your-model>/<INDEX>.mp4 and the AV-Phys Agent will score them against the rubric on 5 dimensions.

  bash
  # 1. clone the umbrella repo and enter the eval harness
git clone https://github.com/ZijunCui02/AV-Phys
cd AV-Phys/code

# 2. fetch the dataset (prompts + rubrics + videos) from HuggingFace
huggingface-cli download ZijunCui/AV-Phys-Bench --repo-type=dataset --local-dir data_release

# 3. generate one .mp4 per prompt into data_release/videos/<your-model>/
# 4. follow code/README.md to run the AV-Phys Agent evaluator on your model