FullDiT2

Efficient In-Context Conditioning for Video Diffusion Transformers
1The Hong Kong University of Science and Technology    2Kuaishou Technology Corresponding author
Paper (PDF)

Abstract

Fine-grained and efficient controllability on video diffusion transformers has raised increasing desires for the applicability. Recently, In-context Conditioning emerged as a powerful paradigm for unified conditional video generation, which enables diverse controls by concatenating varying context conditioning signals with noisy video latents into a long unified token sequence and jointly processing them via full-attention, e.g., FullDiT. Despite their effectiveness, these methods face quadratic computation overhead as task complexity increases, hindering practical deployment. In this paper, we study the efficiency bottleneck neglected in original in-context conditioning video generation framework. We begin with systematic analysis to identify two key sources of the computation inefficiencies: the inherent redundancy within context condition tokens and the computational redundancy in context-latent interactions throughout the diffusion process. Based on these insights, we propose FullDiT2, an efficient in-context conditioning framework for general controllability in both video generation and editing tasks, which innovates from two key perspectives. Firstly, to address the token redundancy in context conditions, FullDiT2 leverages a dynamical token selection mechanism to adaptively identity important context tokens, reducing the sequence length for unified full-attention. Additionally, a selective context caching mechanism is devised to minimize redundant interactions between condition tokens and video latents throughout the diffusion process. Extensive experiments on six diverse conditional video editing and generation tasks demonstrate that FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step, with minimal degradation or even higher performance in video generation quality.

Showcase: Diverse Capabilities

FullDiT2 maintains high fidelity across diverse generation tasks.

Task: Insert a specific identity (image) into a reference video while preserving motion/style.

Sample 1
Ref Video
ID Ref
FullDiT2 Output
Sample 2
Ref Video
ID Ref
FullDiT2 Output
Sample 3
Ref Video
ID Ref
FullDiT2 Output
Sample 4
Ref Video
ID Ref
FullDiT2 Output
Sample 5
Ref Video
ID Ref
FullDiT2 Output
Sample 6
Ref Video
ID Ref
FullDiT2 Output

Task: Swap the subject in the video with a target ID.

Sample 1
Ref Video
Target ID
FullDiT2 Output
Sample 2
Ref Video
Target ID
FullDiT2 Output
Sample 3
Ref Video
Target ID
FullDiT2 Output
Sample 4
Ref Video
Target ID
FullDiT2 Output
Sample 5
Ref Video
Target ID
FullDiT2 Output
Sample 6
Ref Video
Target ID
FullDiT2 Output

Task: Cleanly remove the main subject/object.

Sample 1
Ref Video
Deletion Output
Sample 2
Ref Video
Deletion Output
Sample 3
Ref Video
Deletion Output
Sample 4
Ref Video
Deletion Output
Sample 5
Ref Video
Deletion Output
Sample 6
Ref Video
Deletion Output

Task: Render new views based on a reference video and camera trajectory.

Sample 1
Ref Video
Cam Traj.
FullDiT2 Output
Sample 2
Ref Video
Cam Traj.
FullDiT2 Output
Sample 3
Ref Video
Cam Traj.
FullDiT2 Output
Sample 4
Ref Video
Cam Traj.
FullDiT2 Output
Sample 5
Ref Video
Cam Traj.
FullDiT2 Output
Sample 6
Ref Video
Cam Traj.
FullDiT2 Output

Task: Driven by pose sequences.

Sample 1
Pose Sequence
Generated Video
Sample 2
Pose Sequence
Generated Video
Sample 3
Pose Sequence
Generated Video
Sample 4
Pose Sequence
Generated Video
Sample 5
Pose Sequence
Generated Video
Sample 6
Pose Sequence
Generated Video

Task: Camera Trajectory + Text Prompt -> Video.

Sample 1
Cam Traj.
a fantastical treehouse city, rendered in a bright, expressive animated style.
Text
FullDiT2 Output
Sample 2
Cam Traj.
A dramatic Chinese ink painting of a waterfall
Text
FullDiT2 Output
Sample 3
Cam Traj.
A wonderful scene of Universe stars
Text
FullDiT2 Output
Sample 4
Cam Traj.
A fantastical underwater city of Atlantis
Text
FullDiT2 Output
Sample 5
Cam Traj.
A first-person perspective through a underwater enviroment.
Text
FullDiT2 Output
Sample 6
Cam Traj.
A collection of festival decorations.
Text
FullDiT2 Output

Our Approach: FullDiT2

Traditional approaches to conditional video generation, such as adapter-based methods, often require introducing additional network structures for specific tasks, which can be less flexible. As shown in Figure 1, In-Context Conditioning (ICC) as exemplified by models like FullDiT offers a more unified solution by concatenating condition tokens with noisy latents and processing them jointly, achieving diverse control capabilities. However, this token concatenation strategy, while effective, introduces a significant computational burden due to the quadratic complexity of full attention on these extended sequences. To address this challenge, we propose FullDiT2, an efficient ICC framework. FullDiT2 inherits the versatile context conditioning mechanism but introduces two key innovations to mitigate the computational overhead: 1) Dynamic Token Selection to reduce sequence length for full-attention by identifying important context tokens, and 2) Selective Context Caching to minimize redundant computations by efficiently caching and skipping context tokens across diffusion steps and blocks. Our method thus realizes an efficient and effective ICC framework for controllable video generation and editing.

1. Dynamic Token Selection

To address token redundancy where many context tokens might be less informative, each Transformer block in FullDiT2 adaptively selects an informative subset of reference tokens (e.g., top 50% in our implementation) using a lightweight, learnable importance prediction network operating on reference Value vectors. This reduces the sequence length for attention involving reference tokens, lowering computational cost from $O((n_z+n_c)^2)$ towards $O((n_z+k)^2)$. Unselected reference tokens bypass the attention mechanism and are re-concatenated after the Feed-Forward Network to preserve their information for subsequent layers.

2. Selective Context Caching

To tackle computation redundancy across timesteps and layers, FullDiT2 first identifies important layers for reference token processing using a Block Importance Index. Only these pre-selected important layers (e.g., 4 layers with highest BI plus the first layer for token projection in our model) process reference information; intermediate layers only process noisy tokens, with reference representations passed directly between important layers. For temporal efficiency, especially given that context tokens are relatively static across diffusion steps compared to noisy latents, we cache the Key (K) and Value (V) of selected top-k reference tokens from the first sampling step ($T_0$). These cached K/V values are then reused in subsequent steps for the non-skipped layers, avoiding redundant re-computation. Decoupled attention is employed to maintain training-inference consistency during this caching process, as naive caching can lead to misalignment.

Comparison of our FullDiT2 with adapter-based methods and FullDiT.

Comparison

Figure 2: Comparison of our FullDiT2 with adapter-based methods and FullDiT.

FullDiT2 Framework Overview

Framework Overview

Figure 3: (Left) Dynamic Token Selection (DTS) module selects top-K reference tokens for attention. (Right) Selective Context Caching illustrates temporal-layer caching and skipping for efficiency.

Comparisons

ID Insertion Highlights: FullDiT2 can even outperform the baseline in ID insertion tasks.

  • Speedup (Ours): 2.287x
  • GFLOPS (Baseline vs Ours): 69.292 vs 33.141
  • CLIP-I (Baseline vs Ours): 0.568 vs 0.605 (Higher is better)
  • DINO-S (Baseline vs Ours): 0.254 vs 0.313 (Higher is better)
Case 1
Baseline
FullDiT2
Case 2
Baseline
FullDiT2
ID Insert
Input
ID Ref
Delta-DiT
FORA
FullDiT2
ID Swap
Input
ID Ref
Delta-DiT
FORA
FullDiT2
ID Delete
Input
Delta-DiT
FORA
FullDiT2
Re-Camera
Input
Delta-DiT
FORA
FullDiT2
Trajectory
Cam Traj.
Delta-DiT
FORA
FullDiT2
Pose
Pose Seq.
Delta-DiT
FORA
FullDiT2

Summary