VMO Holdings - Works - Virtual Influencers

Offshore Development Center Mobile Application Development Web Application Development Internet Of Things Artificial Intelligence Semiconductor Offerings

Portfolio For start-up Contact us Insight

Virtual influencers for the Japanese metaverse

A Japanese metaverse startup, focused on virtual influencers, was interested in developing advanced AI models for generating real-time talking videos. They partnered with VMO Holdings to bring their vision to life. Leveraging groundbreaking innovations, they introduced ultra-realistic, AI-driven avatars that seamlessly interact in real-time, redefining the metaverse experience.

The innovative Japanese startup sought to generate high-quality video tailored to Japanese metaverse audiences. The scarcity of Japanese training datasets in existing commercial application programming interfaces (APIs) posed a significant obstacle to achieving the accurate speech-to-text capabilities the startup was looking for. In addition, the high cost of third-party commercial products made the development of an in-house AI product.

CHALLENGE

To develop cutting-edge AI technologies for real-time talking video generation, tailored to the needs of a Japanese metaverse audience focused on virtual influencers

SOLUTION

Innovative research on Al models to generate real-time, high-quality, real-time talking AI avatars for virtual applications, coupled with custom finetuning of speech recognition models for better performance in Japanese

BENEFITS

Real-time, high-quality talking AI characters
State-of-the-art solution
Enhanced AI capabilities
Superior performance in Japanese speech recognition
Improved efficiency in video generation
Faster processing times and higher resolution outputs

KEY PERFORMANCE INDICATOR

SPEECH-TO-TEXT

75% reduction in processing time
15% enhancement in Japanese speech-to-text accuracy compared to market standards

VIDEO GENERATION

100% improvement in alignment between voice and lip movements increased resolution, with equivalent GPU usage, from 96 x 96 to 256 x 256 decreased latency between input and output, in real-time, from 45 seconds to < 3 seconds
>90% reduction in cost of video creation, compared to traditional tools

Person-talking video

Together with VMO, the startup set out to revolutionize virtual communication for the metaverse platform.

The solution included two main components. The first involved research on and development of AI models for the generation of high-quality, real-time talking avatars. The models needed to support the creation of dynamic video with realistic lip-sync and facial expressions, optimized for low usage of graphic processing units (GPU). This was complemented by finetuning of the speech-to-text model, enhancing it for Japanese data to compensate for the lack of existing commercial alternatives. The finetuning resulted in significant improvements in the recognition accuracy of Japanese speech. The improved efficiency in video generation enabled faster processing times and higher-resolution outputs.

The first AI lip sync model entered production in September 2023, with fake video lip-sync. The current version, with real lip-sync, went into production in January 2024.

The tools

Model Training on Nvidia GPU H100s was utilized for intensive model training tasks.Speech-to-text benchmarking was conducted with Whisper API and various pretrained models, for comparison against industry benchmarks (OpenAI and Google ASR). The advanced AI models included SadTalker for high-quality, GPU-intensive video generation. ER-NERF was deployed for static body D-ID, transforming still photos into personalized streaming AI videos optimized for low GPU usage. Wav2Lip enabled efficient, full body motion with good quality. Custom Japanese datasets were used for finetuning, leveraging internal preprocessing and optimization techniques. State-of-the-art solutions achieved near real-time video generation, reducing processing time from 45 seconds to just 5 seconds, at a resolution of 96x96.

The VMO AI resulted in enhanced video quality, achieving a resolution of 256x256 pixels with only three seconds of generation time. The results in Japanese speech recognition ensured superior performance, with a word error rate (WER) of 18.01 that far outperforms OpenAI’s Whisper (21.11 WER) and Google ASR (27.74 WER), providing a significant edge in Japanese language support.

New horizons on the metaverse

This innovative application of AI technologies offers Japanese audiences previously unknown advances in avatar generation, opening up territories in the metaverse focused on their needs and preferences.

LET'S BUILD SOMETHING GREAT TOGETHER

Whatever your ambition, we’d love to design and build your next big idea or lend a hand on an existing one.