--- license: apache-2.0 datasets: - FastVideo/Wan-Syn_77x448x832_600k base_model: - Wan-AI/Wan2.1-T2V-1.3B-Diffusers --- # FastVideo FastWan2.1-T2V-1.3B-Diffusers Model

FastVideo Team

Paper | Github

## Introduction We're excited to introduce the FastWan2.1 series—a new line of models finetuned with our novel **Sparse-distill** strategy. This approach jointly integrates DMD and VSA in a single training process, combining the benefits of distillation to shorten diffusion steps and sparse attention to reduce attention computations, enabling even faster video generation. FastWan2.1-T2V-1.3B-Diffusers is built upon Wan-AI/Wan2.1-T2V-1.3B-Diffusers. It supports efficient 3-step inference and produces high-quality videos at 61×448×832 resolution. For training, we use the FastVideo 480P Synthetic Wan dataset, which contains 600k synthetic latents. --- ## Model Overview - 3-step inference is supported and achieves up to **20 FPS** on a single **H100** GPU. - Our model is trained on **61×448×832** resolution, but it supports generating videos with any resolution.(quality may degrade) - Finetuning and inference scripts are available in the [FastVideo](https://github.com/hao-ai-lab/FastVideo) repository: - [1 Node/GPU debugging finetuning script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/distill/v1_distill_dmd_wan_VSA.sh) - [Slurm training example script](https://github.com/hao-ai-lab/FastVideo/blob/main/examples/distill/Wan2.1-T2V/Wan-Syn-Data-480P/distill_dmd_VSA_t2v_1.3B.slurm) - [Inference script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/inference/v1_inference_wan_dmd.sh) - Try it out on **FastVideo** — we support a wide range of GPUs from **H100** to **4090**, and also support **Mac** users! ### Training Infrastructure Training was conducted on **4 nodes with 32 H200 GPUs** in total, using a `global batch size = 64`. We enable `gradient checkpointing`, set `gradient_accumulation_steps=2`, and use `learning rate = 1e-5`. We set **VSA attention sparsity** to 0.8, and training runs for **4000 steps (~12 hours)** If you use the FastWan2.1-T2V-1.3B-Diffusers model for your research, please cite our paper: ``` @article{zhang2025vsa, title={VSA: Faster Video Diffusion with Trainable Sparse Attention}, author={Zhang, Peiyuan and Huang, Haofeng and Chen, Yongqi and Lin, Will and Liu, Zhengzhong and Stoica, Ion and Xing, Eric and Zhang, Hao}, journal={arXiv preprint arXiv:2505.13389}, year={2025} } @article{zhang2025fast, title={Fast video generation with sliding tile attention}, author={Zhang, Peiyuan and Chen, Yongqi and Su, Runlong and Ding, Hangliang and Stoica, Ion and Liu, Zhengzhong and Zhang, Hao}, journal={arXiv preprint arXiv:2502.04507}, year={2025} } ```