FullDiT2

Efficient In-Context Conditioning for Video Diffusion Transformers

Xuanhua He¹, Quande Liu^2,†, Zixuan Ye¹, Weicai Ye², Qiulin Wang²,
Xintao Wang², Qifeng Chen¹, Pengfei Wan², Di Zhang², Kun Gai²

¹The Hong Kong University of Science and Technology ²Kuaishou Technology ^†Corresponding author

Paper (PDF)

Abstract

Fine-grained and efficient controllability on video diffusion transformers has raised increasing desires for the applicability. Recently, In-context Conditioning emerged as a powerful paradigm for unified conditional video generation, which enables diverse controls by concatenating varying context conditioning signals with noisy video latents into a long unified token sequence and jointly processing them via full-attention, e.g., FullDiT. Despite their effectiveness, these methods face quadratic computation overhead as task complexity increases, hindering practical deployment. In this paper, we study the efficiency bottleneck neglected in original in-context conditioning video generation framework. We begin with systematic analysis to identify two key sources of the computation inefficiencies: the inherent redundancy within context condition tokens and the computational redundancy in context-latent interactions throughout the diffusion process. Based on these insights, we propose FullDiT2, an efficient in-context conditioning framework for general controllability in both video generation and editing tasks, which innovates from two key perspectives. Firstly, to address the token redundancy in context conditions, FullDiT2 leverages a dynamical token selection mechanism to adaptively identity important context tokens, reducing the sequence length for unified full-attention. Additionally, a selective context caching mechanism is devised to minimize redundant interactions between condition tokens and video latents throughout the diffusion process. Extensive experiments on six diverse conditional video editing and generation tasks demonstrate that FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step, with minimal degradation or even higher performance in video generation quality.

Showcase: Diverse Capabilities

FullDiT2 maintains high fidelity across diverse generation tasks.

Task: Insert a specific identity (image) into a reference video while preserving motion/style.

Sample 1

Ref Video

ID Ref

FullDiT2 Output

Sample 2

Ref Video

ID Ref

FullDiT2 Output

Sample 3

Ref Video

ID Ref

FullDiT2 Output

Sample 4

Ref Video

ID Ref

FullDiT2 Output

Sample 5

Ref Video

ID Ref

FullDiT2 Output

Sample 6

Ref Video

ID Ref

FullDiT2 Output

Task: Swap the subject in the video with a target ID.

Sample 1

Ref Video

Target ID

FullDiT2 Output

Sample 2

Ref Video

Target ID

FullDiT2 Output

Sample 3

Ref Video

Target ID

FullDiT2 Output

Sample 4

Ref Video

Target ID

FullDiT2 Output

Sample 5

Ref Video

Target ID

FullDiT2 Output

Sample 6

Ref Video

Target ID

FullDiT2 Output

Task: Cleanly remove the main subject/object.

Sample 1

Ref Video

Deletion Output

Sample 2

Ref Video

Deletion Output

Sample 3

Ref Video

Deletion Output

Sample 4

Ref Video

Deletion Output

Sample 5

Ref Video

Deletion Output

Sample 6

Ref Video

Deletion Output

Task: Render new views based on a reference video and camera trajectory.

Sample 1

Ref Video

Cam Traj.

FullDiT2 Output

Sample 2

Ref Video

Cam Traj.

FullDiT2 Output

Sample 3

Ref Video

Cam Traj.

FullDiT2 Output

Sample 4

Ref Video

Cam Traj.

FullDiT2 Output

Sample 5

Ref Video

Cam Traj.

FullDiT2 Output

Sample 6

Ref Video

Cam Traj.

FullDiT2 Output

Task: Driven by pose sequences.

Sample 1

Pose Sequence

Generated Video

Sample 2

Pose Sequence

Generated Video

Sample 3

Pose Sequence

Generated Video

Sample 4

Pose Sequence

Generated Video

Sample 5

Pose Sequence

Generated Video

Sample 6

Pose Sequence

Generated Video

Task: Camera Trajectory + Text Prompt -> Video.

Sample 1

Cam Traj.

a fantastical treehouse city, rendered in a bright, expressive animated style.

Text

FullDiT2 Output

Sample 2

Cam Traj.

A dramatic Chinese ink painting of a waterfall

Text

FullDiT2 Output

Sample 3

Cam Traj.

A wonderful scene of Universe stars

Text

FullDiT2 Output

Sample 4

Cam Traj.

A fantastical underwater city of Atlantis

Text

FullDiT2 Output

Sample 5

Cam Traj.

A first-person perspective through a underwater enviroment.

Text

FullDiT2 Output

Sample 6

Cam Traj.

A collection of festival decorations.

Text

FullDiT2 Output

Our Approach: FullDiT2

Traditional approaches to conditional video generation, such as adapter-based methods, often require introducing additional network structures for specific tasks, which can be less flexible. As shown in Figure 1, In-Context Conditioning (ICC) as exemplified by models like FullDiT offers a more unified solution by concatenating condition tokens with noisy latents and processing them jointly, achieving diverse control capabilities. However, this token concatenation strategy, while effective, introduces a significant computational burden due to the quadratic complexity of full attention on these extended sequences. To address this challenge, we propose FullDiT2, an efficient ICC framework. FullDiT2 inherits the versatile context conditioning mechanism but introduces two key innovations to mitigate the computational overhead: 1) Dynamic Token Selection to reduce sequence length for full-attention by identifying important context tokens, and 2) Selective Context Caching to minimize redundant computations by efficiently caching and skipping context tokens across diffusion steps and blocks. Our method thus realizes an efficient and effective ICC framework for controllable video generation and editing.

1. Dynamic Token Selection

To address token redundancy where many context tokens might be less informative, each Transformer block in FullDiT2 adaptively selects an informative subset of reference tokens (e.g., top 50% in our implementation) using a lightweight, learnable importance prediction network operating on reference Value vectors. This reduces the sequence length for attention involving reference tokens, lowering computational cost from $O((n_z+n_c)^2)$ towards $O((n_z+k)^2)$. Unselected reference tokens bypass the attention mechanism and are re-concatenated after the Feed-Forward Network to preserve their information for subsequent layers.

2. Selective Context Caching

To tackle computation redundancy across timesteps and layers, FullDiT2 first identifies important layers for reference token processing using a Block Importance Index. Only these pre-selected important layers (e.g., 4 layers with highest BI plus the first layer for token projection in our model) process reference information; intermediate layers only process noisy tokens, with reference representations passed directly between important layers. For temporal efficiency, especially given that context tokens are relatively static across diffusion steps compared to noisy latents, we cache the Key (K) and Value (V) of selected top-k reference tokens from the first sampling step ($T_0$). These cached K/V values are then reused in subsequent steps for the non-skipped layers, avoiding redundant re-computation. Decoupled attention is employed to maintain training-inference consistency during this caching process, as naive caching can lead to misalignment.

Comparison of our FullDiT2 with adapter-based methods and FullDiT.

Figure 2: Comparison of our FullDiT2 with adapter-based methods and FullDiT.

FullDiT2 Framework Overview

Figure 3: (Left) Dynamic Token Selection (DTS) module selects top-K reference tokens for attention. (Right) Selective Context Caching illustrates temporal-layer caching and skipping for efficiency.

Comparisons

ID Insertion Highlights: FullDiT2 can even outperform the baseline in ID insertion tasks.

Speedup (Ours): 2.287x
GFLOPS (Baseline vs Ours): 69.292 vs 33.141
CLIP-I (Baseline vs Ours): 0.568 vs 0.605 (Higher is better)
DINO-S (Baseline vs Ours): 0.254 vs 0.313 (Higher is better)

Case 1

Baseline

FullDiT2

Case 2

Baseline

FullDiT2

ID Insert

Input

ID Ref

Delta-DiT

FORA

FullDiT2

ID Swap

Input

ID Ref

Delta-DiT

FORA

FullDiT2

ID Delete

Input

Delta-DiT

FORA

FullDiT2

Re-Camera

Input

Delta-DiT

FORA

FullDiT2

Trajectory

Cam Traj.

Delta-DiT

FORA

FullDiT2

Pose

Pose Seq.

Delta-DiT

FORA

FullDiT2

Summary

Significant Speedup: Achieves 2-3 times speedup in averaged time cost per diffusion step compared to the baseline FullDiT. For instance, in ID-related video editing tasks, FullDiT2 achieves approximately 2.28x speedup.
Reduced Computational Cost: Particularly pronounced in tasks with multiple conditions, such as Video Re-Camera, where FullDiT2 reduces computational cost to only 32% of baseline FLOPs and achieves a 3.43x speedup.
Preserved/Improved Quality: Maintains high fidelity and accurately adheres to various conditioning inputs, achieving results comparable to or even outperforming the baseline. For example, FullDiT2 can outperform in ID insertion tasks.