
Video model that accepts text, images, video clips, and audio files simultaneously. Seedance 2.0 creates multi-shot 2K videos up to 15 seconds with consistent character identity, precise camera control, and joint audio-video output — 150 credits per 5-second clip.
Seedance 2.0 accepts all four input types simultaneously: text, images, video clips, and audio files. Built on a unified multimodal architecture, it creates multi-shot 2K video up to 15 seconds with up to 6 independently controlled camera cuts in a single pass. An internal reference-locking system maintains character identity — same face, clothing, and proportions — across every shot. The Dual-Branch Diffusion Transformer produces layered audio-video output: spoken dialogue with phoneme-level lip-sync in 8+ languages, context-aware foley effects, and environmental ambience.
Key specifications of the Seedance 2.0 model.
Max Resolution
Multi-Shot Sequences
Max Duration
Describe the scene in natural language (up to 2,500 characters). Attach reference images for character likeness, video clips for motion style, or audio files for rhythm and dialogue timing. Use the @ system to assign each file a role.
Define the number of shots (up to 6) and specify camera movement for each — dolly zoom, tracking shot, handheld, or locked. Set overall duration (up to 15 seconds) and aspect ratio.
The Dual-Branch Diffusion Transformer processes all inputs and produces a multi-shot video with synchronized audio in one inference pass. Flat-rate pricing: 150 credits per 5 seconds, regardless of resolution or aspect ratio.
The @ reference system lets you assign specific roles to each uploaded file: @face for character likeness, @motion for movement style, @style for visual tone, @audio for soundtrack sync. No other model offers this level of compositional control over multimodal inputs.
Dolly zooms, rack focuses, tracking shots, POV switches, smooth handheld — Seedance 2.0 scored 9/10 for camera control in benchmark testing, the highest among competing models. Each shot in a multi-shot sequence can have its own camera behavior.
ByteDance incorporated physics-aware training that penalizes impossible motion during generation. Cloth drapes and wrinkles naturally, water splashes with correct weight, collisions have impact, and characters shift balance when walking.
Audio and video are generated simultaneously through the Dual-Branch Diffusion Transformer — not as a post-processing step. The output includes phoneme-level lip-sync in 8+ languages, layered foley, and environmental ambience.
16:9, 9:16, 1:1, 4:3, 3:4, and 21:9. Same aspect ratio flexibility as Seedance 1.5 Pro, now at 2K resolution.
150 credits per 5 seconds, regardless of resolution or aspect ratio. Audio generation included at no extra cost. Simpler than Seedance 1.5 Pro's dynamic pricing, though not always cheaper for low-resolution short clips.
Multi-shot sequences, physics-aware motion, and quad-modal input — all generated by Seedance models with no post-editing or compositing.






Multi-shot 2K video with quad-modal input, persistent character identity, and joint audio generation. 150 credits per 5 seconds.
1080p video generation with audio
4K video generation model
Video generation with audio support
Turbo Pro video generation
AI image generation model
Next-gen AI image generation
4K AI image generation
AI image editing model
Ultra-fast AI image generation