Synchronized Sound • Lip-Sync Speech • Dynamic Visuals • Creative Freedom
Alibaba's breakthrough Wan 2.5 model generates videos with native audio - speech, music, and sound effects synchronized to visuals. Create 10-second videos from text or images in 720p/1080p. Maximum creative freedom for bold, dynamic content. No audio post-production needed.
Add Image
JPG, PNG, WebP
Max 10MB
The output video aspect ratio will match your uploaded image
Configure your settings and click generate to start creating amazing videos
See how Wan 2.5 transforms text and images into complete audio-visual experiences
Transform static images into dynamic videos with synchronized soundtracks, speech, and environmental audio
Input

Create complete videos with visuals, speech, and music from text descriptions alone
Input
“A dimly lit jazz bar at night, wooden tables glowing under warm pendant lights. Patrons sip drinks and chat quietly while a three-piece band performs on stage. The saxophone player stands under a spotlight, gleaming instrument reflecting the light. No dialogue. Ambient audio: smooth live jazz music with saxophone and piano, clinking glasses, low murmur of audience conversations, occasional burst of laughter from a nearby table. Camera: slow pan across the crowd, then gentle zoom toward the saxophone player’s solo, focusing on expressive hand movements.”
First video AI model with native audio generation. Wan 2.5 eliminates audio post-production by creating synchronized soundtracks, speech, and sound effects during video generation. Unmatched creative freedom for diverse content styles.
Wan 2.5 generates video and audio simultaneously: synchronized speech with lip movements, background music matching video rhythm, environmental sounds, and ambient effects. No separate recording or audio editing needed - everything is created together in one process.
Advanced camera language with smooth transitions, stable object tracking, and consistent character continuity across frames. Eliminates common AI video issues like flickering, jittering, or morphing. Professional-grade cinematography with natural movement flow.
Generate 5-second or 10-second videos (longer than most competitors' 8s limit) in 720p or 1080p resolution. Multiple aspect ratios: 16:9 landscape, 9:16 portrait, 1:1 square. Optimized for YouTube, TikTok, Instagram, and all social platforms.
Lenient content moderation enables bold, dynamic, and impactful video creation. Support for text-to-video and image-to-video modes. Multimodal inputs including text, images, and audio references. Excellent multilingual support including Chinese and other languages.
Generate professional videos with synchronized audio using Wan 2.5. No audio editing skills required - speech, music, and sound effects are created automatically with your video.
Text-to-Video: Describe your scene, camera movements, actions, and audio requirements. Image-to-Video: Upload a reference image and describe desired motion. Wan 2.5 will generate matching audio including speech, music, and environmental sounds.
Duration: 5 seconds (quick content) or 10 seconds (richer storytelling). Resolution: 720p (faster rendering) or 1080p (maximum quality). Aspect Ratio: 16:9 landscape, 9:16 vertical, or 1:1 square. Optional: Add negative prompts to exclude unwanted elements.
Click generate and Wan 2.5 creates your video with synchronized audio in minutes. Preview the complete video with sound, lip-synced speech, and background music. Download ready-to-use content for YouTube, TikTok, Instagram, or commercial projects.
Complete guide to Wan 2.5's audio-visual generation capabilities, pricing, content policies, and comparison with other AI video models like Sora 2, Veo 3.
Use our AI image prompt gallery to design scenes and characters, then bring them to life with Wan 2.5.
Browse AI image prompts →