
The first AI video model with simultaneous audio-visual generation. Kling 2.6 by Kuaishou produces synchronized speech, sound effects, and ambient audio alongside 1080p visuals in a single unified pass — no manual dubbing required.
Released on December 3, 2025, Kling 2.6 is the first model from Kuaishou to achieve simultaneous audio-visual generation. Unlike traditional workflows that produce silent video first and require manual dubbing afterward, Kling 2.6 generates synchronized audio and video together in a single pass. Built on a diffusion-based Transformer architecture with a proprietary 3D variational autoencoder, it delivers deep semantic alignment between real-world sounds and dynamic visuals at 1080p resolution and 48 frames per second.
Transform text prompts into fully realized videos complete with synchronized audio. Kling 2.6 interprets complex instructions with 15% higher compliance than previous versions, translating detailed scene descriptions into matching visuals and soundscapes.
Bring static images to life with fluid, natural motion. Upload a reference image and Kling 2.6 animates it with precise hand movements, expressive facial details, and full-body motion fidelity suited for dance routines, martial arts sequences, and natural human gestures.
Generate speech, dialogue, narration, singing, rap, ambient sound effects, and mixed audio — standalone or combined. Bilingual speech output in English and Chinese, with other languages automatically translated to English for voice generation.
Output videos at full HD 1080p resolution running at 48fps for exceptionally smooth motion. Choose from 16:9 landscape, 9:16 portrait, or 1:1 square aspect ratios to fit any platform requirement.
Choose text-to-video to generate entirely from a written prompt, or switch to image-to-video to animate a reference photo. You can also upload a motion reference clip (3-30 seconds) to guide the movement style of your output.
Set your aspect ratio (16:9, 9:16, or 1:1), pick a duration of 5 or 10 seconds, and decide whether to enable native audio generation. When audio is enabled, specify the type — dialogue, narration, sound effects, singing, or a combination.
Start generation and receive your completed video with fully synchronized audio and visuals. The diffusion-based Transformer architecture processes both modalities simultaneously, so what you download is a complete, ready-to-publish clip.
The defining breakthrough of Kling 2.6: audio and video are generated together in one pass rather than layered sequentially. This produces deep semantic alignment — footsteps match walking, doors slam when they close, and dialogue matches lip movements naturally.
Kling 2.6 delivers blur-free hand movements, accurate finger articulation, and nuanced facial expressions. Full-body movement fidelity captures complex choreography including dance routines, martial arts forms, and athletic sequences.
Speaking characters exhibit precise mouth-to-audio alignment. Whether generating dialogue, narration, singing, or rap, lip movements track the generated speech with frame-level accuracy across both English and Chinese output.
Upload a motion reference video between 3 and 30 seconds long to guide the movement patterns in your output. This enables uninterrupted motion sequences that follow specific choreography, camera paths, or action styles.
Maintain consistent character appearance and identity across different shots and scenes. Characters retain their facial features, clothing, and proportions throughout the generated video.
Native speech generation in English and Chinese with natural intonation and pacing. Prompts written in other languages are automatically translated to English for voice synthesis, broadening accessibility for international creators.
Explore videos generated by Kling models — synchronized audio, precise human movement, and detailed visual storytelling across diverse scenarios.






Experience the first AI video model with simultaneous audio-visual generation. Produce 1080p videos with speech, sound effects, and music in a single pass.
2K cinematic video with native audio
1080p video generation with audio
4K video generation model
Turbo Pro video generation
AI image generation model
Next-gen AI image generation
4K AI image generation
AI image editing model
Ultra-fast AI image generation