Active Intelligence in Video Avatars via Closed-loop World Modeling

Xuanhua He^1,* Tianyu Yang^2,† Ke Cao³ Ruiqi Wu² Cheng Meng² Yong Zhang^2,‡ Zhuoliang Kang² Xiaoming Wei² Qifeng Chen^1,†

¹HKUST ²Meituan ³USTC

*Work done during internship at Meituan †Corresponding authors ‡Project leader

Read the Paper View on GitHub Dataset

Abstract

Current video avatars excel at visuals but lack agency. We bridge this gap with L-IVA, a benchmark for long-horizon planning, and ORCA, a framework enabling active intelligence. ORCA mimics an Internal World Model via a closed-loop Observe-Think-Act-Reflect (OTAR) cycle and a hierarchical Dual-System architecture. This design allows avatars to autonomously reason, verify outcomes, and correct errors in real-time. Extensive experiments demonstrate that ORCA significantly outperforms baselines, advancing video avatars from passive animation to active, goal-oriented behavior.

Figure 1. The ORCA Framework. ORCA enables active intelligence via a closed-loop OTAR (Observe-Think-Act-Reflect) cycle. It features a dual-system architecture: System 2 performs high-level strategic planning and state tracking, while System 1 grounds abstract plans into precise, model-specific action captions for the video generation model.

The L-IVA Benchmark

Unlike traditional benchmarks that evaluate single-clip aesthetics, L-IVA is the first benchmark designed to assess goal-directed planning in stochastic generative environments.

100

Interactive Tasks

Diverse Scenarios

3-8

Steps per Task

Goal

Oriented Evaluation

Garden

mix soil and fertilizer

mix soil and fertilizer in the wheelbarrow

Kitchen

Fry the egg

fry the egg

Livestream

Product Demo

demonstrate the application of a facial serum

Office

Leave Office

Finish work for the day, pack up personal items

Workshop

Sharpen Saw

Stop the current work, take the saw from the wall, and sharpen it

Garden

mix soil and fertilizer

mix soil and fertilizer in the wheelbarrow

Kitchen

Fry the egg

fry the egg

Garden

Clean up the trash

Put the trash in a bag

Kitchen

Wash Plate

collaborate to wash, dry, and stack a dinner plate

Livestream

Install GPU

collaborate to install the graphics card into the PC case

Office

Fix Printer

collaborate to remove jammed paper from the printer

Garden

Clean up the trash

Put the trash in a bag

Qualitative Results

Comparing ORCA against state-of-the-art baselines

Case 01

Make a cup of tea by scooping leaves into the pot

Open-Loop

Reactive

VAGEN

ORCA (Ours)

Case 02

Make Coffee

Open-Loop

Reactive

VAGEN

ORCA (Ours)

Case 03

Prepare simple guacamole

Open-Loop

Reactive

VAGEN

ORCA (Ours)

Case 01

Demonstrate the multi-light features of the makeup mirror

Open-Loop

Reactive

VAGEN

ORCA (Ours)

Case 02

Demonstrate how to make a simple vegetable salad

Open-Loop

Reactive

VAGEN

ORCA (Ours)

Case 03

Put on a diamond necklace and display it

Open-Loop

Reactive

VAGEN

ORCA (Ours)

Case 01

Mix soil and fertilizer in the wheelbarrow

Open-Loop

Reactive

VAGEN

ORCA (Ours)

Case 02

Checking beehives

Open-Loop

Reactive

VAGEN

ORCA (Ours)

Case 03

Start a fire in the fire pit

Open-Loop

Reactive

VAGEN

ORCA (Ours)

Case 01

Pause reading the book and join an online meeting

Open-Loop

Reactive

VAGEN

ORCA (Ours)

Case 02

Prepare for an upcoming video meeting and joining the call

Open-Loop

Reactive

VAGEN

ORCA (Ours)

Case 03

Replace the toner cartridge in the copier

Open-Loop

Reactive

VAGEN

ORCA (Ours)

Case 01

Sharpen the saw

Open-Loop

Reactive

VAGEN

ORCA (Ours)

Case 02

Prepare a plant leaf sample and observe it

Open-Loop

Reactive

VAGEN

ORCA (Ours)

Case 03

Add motor oil to the engine

Open-Loop

Reactive

VAGEN

ORCA (Ours)

Quantitative Results

ORCA achieves state-of-the-art performance across all metrics

71.0%

Task Success Rate

3.72

Physical Plausibility

28.7%

Human Preference

Method	Task Success Rate (%) ↑			Physical Plausibility (1-5) ↑
Method	Kitchen	Garden	Average	Kitchen	Garden	Average
Reactive Agent	56.7	55.0	50.9	3.47	3.08	3.11
Open-Loop Planner	72.3	46.2	62.3	3.57	2.92	3.17
VAGEN	70.8	60.0	61.2	3.56	2.54	3.22
ORCA (Ours)	73.8	81.5	71.0	3.53	3.77	3.72

Citation

@article{he2025orca,
    title={Active Intelligence in Video Avatars via Closed-loop World Modeling},
    author={He, Xuanhua and Yang, Tianyu and Cao, Ke and Wu, Ruiqi and Meng, Cheng and Zhang, Yong and Kang, Zhuoliang and Wei, Xiaoming and Chen, Qifeng},
    journal={arXiv preprint arXiv:2508.xxxxx},
    year={2025}
}