Background

Kling 2.6 — AI Video with Simultaneous Audio Generation

The first AI video model with simultaneous audio-visual generation. Kling 2.6 by Kuaishou produces synchronized speech, sound effects, and ambient audio alongside 1080p visuals in a single unified pass — no manual dubbing required.

Kling 2.6 — AI Video with Simultaneous Audio Generation

Video Generator
0 / 2000
5s
Cost 275 creditsRemaining 0 credits
Video Preview

Kling 2.6: Simultaneous Audio-Visual Generation in One Pass

Released on December 3, 2025, Kling 2.6 is the first model from Kuaishou to achieve simultaneous audio-visual generation. Unlike traditional workflows that produce silent video first and require manual dubbing afterward, Kling 2.6 generates synchronized audio and video together in a single pass. Built on a diffusion-based Transformer architecture with a proprietary 3D variational autoencoder, it delivers deep semantic alignment between real-world sounds and dynamic visuals at 1080p resolution and 48 frames per second.

Text to Video with Integrated Audio

Transform text prompts into fully realized videos complete with synchronized audio. Kling 2.6 interprets complex instructions with 15% higher compliance than previous versions, translating detailed scene descriptions into matching visuals and soundscapes.

Image to Video Animation

Bring static images to life with fluid, natural motion. Upload a reference image and Kling 2.6 animates it with precise hand movements, expressive facial details, and full-body motion fidelity suited for dance routines, martial arts sequences, and natural human gestures.

Rich Audio Synthesis

Generate speech, dialogue, narration, singing, rap, ambient sound effects, and mixed audio — standalone or combined. Bilingual speech output in English and Chinese, with other languages automatically translated to English for voice generation.

1080p at 48 Frames Per Second

Output videos at full HD 1080p resolution running at 48fps for exceptionally smooth motion. Choose from 16:9 landscape, 9:16 portrait, or 1:1 square aspect ratios to fit any platform requirement.

Generate Video with Integrated Audio

1

1. Select Your Input Mode

Choose text-to-video to generate entirely from a written prompt, or switch to image-to-video to animate a reference photo. You can also upload a motion reference clip (3-30 seconds) to guide the movement style of your output.

2

2. Configure Output Parameters

Set your aspect ratio (16:9, 9:16, or 1:1), pick a duration of 5 or 10 seconds, and decide whether to enable native audio generation. When audio is enabled, specify the type — dialogue, narration, sound effects, singing, or a combination.

3

3. Generate and Download

Start generation and receive your completed video with fully synchronized audio and visuals. The diffusion-based Transformer architecture processes both modalities simultaneously, so what you download is a complete, ready-to-publish clip.

One Pass: Audio and Video Together

Unified Audio-Visual Pipeline

The defining breakthrough of Kling 2.6: audio and video are generated together in one pass rather than layered sequentially. This produces deep semantic alignment — footsteps match walking, doors slam when they close, and dialogue matches lip movements naturally.

Precision Human Motion

Kling 2.6 delivers blur-free hand movements, accurate finger articulation, and nuanced facial expressions. Full-body movement fidelity captures complex choreography including dance routines, martial arts forms, and athletic sequences.

Accurate Lip Synchronization

Speaking characters exhibit precise mouth-to-audio alignment. Whether generating dialogue, narration, singing, or rap, lip movements track the generated speech with frame-level accuracy across both English and Chinese output.

Motion Reference Control

Upload a motion reference video between 3 and 30 seconds long to guide the movement patterns in your output. This enables uninterrupted motion sequences that follow specific choreography, camera paths, or action styles.

Cross-Shot Character Consistency

Maintain consistent character appearance and identity across different shots and scenes. Characters retain their facial features, clothing, and proportions throughout the generated video.

Bilingual Speech Output

Native speech generation in English and Chinese with natural intonation and pacing. Prompts written in other languages are automatically translated to English for voice synthesis, broadening accessibility for international creators.

Showcases

Kling 2.6 Video Examples

Explore videos generated by Kling models — synchronized audio, precise human movement, and detailed visual storytelling across diverse scenarios.

Wartime Flag Ceremony
Old Craftsman in Golden Light
Suited Man Dancing
Industrial Drift Racing
Emotional Rain Scene
Game Character Selection Screen

Frequently Asked Questions









Create Videos with Synchronized Audio Today

Experience the first AI video model with simultaneous audio-visual generation. Produce 1080p videos with speech, sound effects, and music in a single pass.