
The first Seedance model with joint audio-video generation. Seedance 1.5 Pro produces videos with phoneme-level lip-sync across 8+ languages, dynamic resolution pricing from 480p to 1080p, and fine-grained duration control from 4 to 12 seconds.
Seedance 1.5 Pro is ByteDance's first model to ship joint audio-video generation. Audio and video are produced together — not layered after the fact — through the same Dual-Branch Diffusion Transformer architecture later carried into Seedance 2.0. Its standout capability is phoneme-level multilingual lip-sync: characters speak with accurate mouth shapes in English, Chinese, Japanese, Korean, Spanish, French, German, Portuguese, and more. Combined with three resolution tiers and duration from 4 to 12 seconds, it covers everything from quick social clips to longer narrative scenes.
Describe a scene and receive a video with motion, camera work, and optionally synchronized dialogue or sound effects. Prompts support detailed multi-part instructions with emotional tone and environmental context.
Upload a photo or illustration and the model infers natural motion — hair movement, fabric sway, body gestures — while preserving fine details from the source image including skin texture, accessories, and background elements.
Audio is generated in the same inference pass as video through a Dual-Branch Diffusion Transformer. The output includes layered sound: spoken dialogue with lip-sync, context-aware foley effects, and environmental ambience, mixed and balanced automatically.
Speaking characters exhibit mouth-shape accuracy at the phoneme level across 8+ languages. Unlike post-production dubbing, the lip movements are generated simultaneously with the audio, producing natural alignment without manual adjustment.
Describe your scene in text, or upload a reference image to animate. For image-to-video, you can also provide an end frame to define the final composition.
Pick 480p, 720p, or 1080p. Set duration anywhere from 4 to 12 seconds. If generating speech, select the target language for lip-sync.
The model produces video and audio together in one pass. Credits are calculated dynamically based on your resolution, aspect ratio, and duration — lower settings cost fewer credits.
Choose 480p for fast drafts, 720p for the standard quality-to-cost ratio, or 1080p for final output. Credits scale with pixel count — a 480p video costs roughly a quarter of a 1080p video at the same duration.
Set any duration between 4 and 12 seconds. Unlike models that only offer fixed 5s or 10s lengths, 1.5 Pro gives per-second control — pay only for the length you need.
16:9, 9:16, 1:1, 4:3, 3:4, and 21:9. The 21:9 ultrawide option is uncommon among AI video generators and suits widescreen and film trailer formats.
Lock the camera to a fixed position for product demos, interview framing, or any shot where camera movement would be distracting. When unlocked, the model generates natural camera motion based on scene content.
Pass a seed value to reproduce the same output across generations. Useful for A/B testing different prompts while keeping visual style consistent.
Provide both a starting image and an ending image to define the first and last frames. The model interpolates between them, enabling controlled transitions and predictable story arcs.
Videos generated by Seedance models — joint audio-video output, multilingual lip-sync, and physics-aware motion across different genres and styles.






Joint audio-video generation with phoneme-level multilingual lip-sync. Three resolution tiers, 4-12 second duration, dynamic pricing from 15 credits.