3/10 Model Release 25 May 2026, 12:01 UTC

Meituan releases LongCat-Video-Avatar-1.5, a multimodal video generation model trending on Hugging Face.

Meituan's LongCat-Video-Avatar-1.5 signals a push towards highly controllable, multimodal avatar generation by combining audio, image, and text conditioning. The inclusion of ONNX and safetensors support indicates an immediate focus on production readiness and efficient inference pipelines. Engineers should evaluate this for real-time digital human applications.

What Happened

Meituan's AI team has open-sourced `LongCat-Video-Avatar-1.5` on Hugging Face. Despite being freshly published (showing 0 recorded downloads, likely due to caching or recent upload), it rapidly accumulated 177 likes, indicating strong community anticipation and immediate visibility in the open-source AI ecosystem.

Technical Details

The model operates in the complex `audio-image-text-to-video` space. This multimodal architecture allows it to ingest a reference image (the avatar's appearance), an audio track (for speech and lip-syncing), and text prompts (for styling, expression, or background context) to generate synchronized video output. Built on the `diffusers` library, the repository notably includes both `safetensors` and `onnx` formats. The use of `safetensors` ensures secure, zero-copy memory loading. More importantly, `onnx` (Open Neural Network Exchange) compatibility demonstrates that the model is already optimized for cross-platform hardware acceleration, bypassing the typical friction engineers face when converting PyTorch research models for deployment.

Why It Matters

The digital avatar space is highly competitive, yet open-source models frequently struggle with lip-sync accuracy, temporal consistency, or inference latency. By providing a multimodal conditioning mechanism, LongCat 1.5 offers developers granular control over the generated output. The immediate availability of ONNX weights is a strong signal for production readiness, lowering the barrier to entry for edge deployment and optimized cloud inference, which is critical for real-time or near-real-time streaming applications.

What to Watch Next

Engineers should benchmark the ONNX runtime performance against existing open-source avatar solutions like SadTalker or Wav2Lip, specifically evaluating latency, VRAM consumption, and temporal stability over longer context windows. Watch for downstream community applications in automated customer service, virtual streaming, and localized marketing content to validate its real-world viability.

Sources

https://huggingface.co/meituan-longcat/LongCat-Video-Avatar-1.5

video-generation multimodal digital-avatars diffusers onnx