The landscape of artificial intelligence is evolving at an unprecedented pace, and a leading voice in the text-to-speech domain, ElevenLabs, is strategically positioning itself for a future where its core technology may become widely accessible. Mati Staniszewski, co-founder and chief executive of ElevenLabs, recently offered a candid outlook on the future of AI audio models, asserting that they are destined for commoditization over time. This revealing perspective, shared during his appearance at the TechCrunch Disrupt 2025 conference, underscores a broader industry trend and signals a significant pivot in ElevenLabs’ long-term strategy, moving beyond mere model creation to a deeper integration with applications.
The Genesis of AI Audio and ElevenLabs’ Role
The journey of AI in voice synthesis has been marked by rapid advancements, transforming capabilities from robotic, monotone speech into remarkably natural and emotionally nuanced audio. Early text-to-speech (TTS) systems, often based on concatenative synthesis, pieced together pre-recorded speech fragments, resulting in choppy and unnatural outputs. The advent of deep learning, particularly recurrent neural networks (RNNs) and later transformer architectures, revolutionized this field. Companies like Google, Amazon, and Microsoft invested heavily in research, leading to services like Google Wavenet, Amazon Polly, and Microsoft Azure AI Speech, which set new benchmarks for synthetic voice quality.
ElevenLabs, founded in 2022, emerged into this dynamic environment with a focus on delivering highly realistic and customizable AI voices, quickly gaining traction for its impressive voice cloning and text-to-speech capabilities. Its technology has found applications across various sectors, from content creation and audiobooks to gaming and virtual assistants, pushing the boundaries of what AI-generated audio can achieve. Staniszewski’s remarks at the conference reflect a company that understands its position at the vanguard of this technology but also keenly anticipates its inevitable maturation. He detailed both the immediate objectives and the long-term strategic vision for the AI audio sector, acknowledging the profound shifts that lie ahead.
The Inevitable March Towards Commoditization
Staniszewski articulated a clear timeline for this transformation. In the immediate future, perhaps over the next one to two years, ElevenLabs will continue to concentrate on fundamental model architecture challenges within the audio space. The company’s researchers have made significant strides in overcoming these hurdles, ensuring their models remain at the cutting edge. However, this short-term focus on foundational research is juxtaposed with a longer-term prediction: "Over the long term, it will commoditize – over the next couple of years," Staniszewski stated. He further elaborated, "Even if there’s differences – which I think will be the truth for some voices, some languages – on its own, the differences will be smaller." This implies a future where the core technology for generating high-quality AI audio becomes widely available, reducing proprietary advantage.
This prediction is not unique to AI audio; it mirrors the trajectory of many groundbreaking technologies. Initially, innovative breakthroughs are proprietary and command high value. As knowledge disseminates, tools become more accessible, and open-source alternatives emerge, the underlying technology becomes a commodity. Think of cloud computing infrastructure, once a specialized service, now a foundational utility, or even the evolution of graphics processing units (GPUs) from niche components to widely adopted hardware. The core distinction shifts from having the technology to how effectively it is applied.
The Strategic Paradox: Building What Will Be Commoditized
A natural question arises from Staniszewski’s forecast: why would ElevenLabs continue to invest heavily in building models if their commoditization is imminent? His explanation provides insight into the strategic thinking necessary in a rapidly evolving tech landscape. In the short term, proprietary models still represent the "biggest advantage and the biggest step change you can have today." The quality of AI voices and interactions remains a critical differentiator. If the synthetic voices lack authenticity, emotional range, or clarity, the user experience suffers, regardless of the application.
Staniszewski emphasized that the only way to achieve truly superior audio quality currently is through dedicated, in-house model development. This allows ElevenLabs to fine-tune algorithms, optimize for specific nuances, and respond swiftly to emerging demands for realism and expressiveness. While other players will eventually catch up, the current competitive edge is rooted in this specialized expertise. Furthermore, for businesses seeking reliable and scalable solutions, different models might still be preferred for distinct use cases, ensuring specific performance requirements are met. This nuanced approach acknowledges that while the base technology may commoditize, specialized, high-performance variants might retain value for particular applications.
The Pivot Towards Multimodality and Application Integration
Looking ahead, ElevenLabs envisions a future where the focus shifts dramatically towards integrated, multi-modal experiences. Staniszewski anticipates that within the next year or two, an increasing number of AI models will adopt "multi-modal or fused approaches." This means a seamless combination of different AI capabilities working in concert. He provided compelling examples: "So, you will create audio and video at the same time, or audio and LLMs at the same time in a conversational setting." This vision aligns with broader industry trends where AI is moving beyond isolated functions towards holistic, intelligent systems. Google’s Veo 3, which integrates various AI models to generate complex visual and auditory content, serves as a prime illustration of this potential.
To realize this multi-modal future, ElevenLabs plans to actively pursue partnerships with other companies and leverage open-source technologies. This collaborative strategy will allow the company to fuse its specialized audio expertise with the capabilities of other models, whether for advanced video generation, sophisticated conversational AI, or other yet-to-be-imagined applications. The long-term goal for ElevenLabs, therefore, is not merely to build superior models but to strategically focus on both model development and the creation of compelling applications. Staniszewski drew an insightful parallel to Apple’s enduring success: "The same way software and hardware was the magic for Apple, we think the product and AI will be the magic for the generation of the best use cases." This analogy highlights a strategic shift from being a pure technology provider to becoming a creator of integrated, user-centric experiences where the underlying AI powers a superior product.
Broader Market, Social, and Cultural Implications
The commoditization of AI audio models, coupled with a shift towards multi-modal applications, carries profound implications across various sectors.
Economic Impact: The accessibility of high-quality AI voices will democratize content creation, significantly lowering production costs for audiobooks, podcasts, e-learning materials, and video narration. This could empower independent creators and smaller businesses, fostering a more diverse content ecosystem. However, it also poses challenges for traditional voice actors and narrators, necessitating a re-evaluation of their roles and skills in an AI-augmented industry. New job categories, such as AI voice directors, ethical AI specialists, and prompt engineers for multimodal AI, are likely to emerge.
Accessibility and Inclusivity: Advanced AI audio can dramatically improve accessibility for individuals with visual impairments or reading disabilities, transforming how they consume information and interact with digital content. It can also facilitate cross-language communication through real-time, natural-sounding translations.
Content and Media: The ability to generate realistic audio and video simultaneously opens new frontiers for entertainment, marketing, and journalism. Imagine personalized news broadcasts, dynamically generated game characters with unique voices, or interactive digital companions. This could lead to an explosion of AI-generated content, blurring the lines between human and machine creativity.
Ethical and Societal Challenges: As AI audio becomes more sophisticated and accessible, concerns around deepfakes, voice impersonation, and misinformation intensify. The ease with which realistic synthetic voices can be created raises critical questions about authenticity, consent, and identity. The industry will face increasing pressure to develop robust watermarking technologies, ethical guidelines, and regulatory frameworks to mitigate potential misuse. Companies like ElevenLabs, by engaging in both model development and application, are uniquely positioned to integrate ethical considerations directly into product design, rather than treating them as afterthoughts.
The Competitive Landscape and Future Outlook
The AI audio market is fiercely competitive, with tech giants like Google, Amazon, and Microsoft investing heavily, alongside a growing number of specialized startups. The prediction of commoditization suggests that the race will shift from who can build the best foundational model to who can build the best products and services leveraging those models. Open-source initiatives are accelerating this commoditization, providing powerful, freely available tools that push the baseline quality higher for everyone.
ElevenLabs’ strategy to focus on both model building and applications, combined with a willingness to partner and embrace open source, positions it to thrive in this evolving environment. By controlling the quality of the underlying AI while simultaneously developing compelling user experiences, the company aims to create a virtuous cycle of innovation. The future of AI audio is not just about voices; it’s about intelligent, immersive, and interactive experiences that seamlessly blend auditory information with other sensory inputs and cognitive processes. This transition represents a maturation of the AI industry, moving from foundational research to the widespread deployment of transformative, integrated solutions.
In essence, Staniszewski’s vision for ElevenLabs and the broader AI audio space is one of evolution: from raw technological prowess to sophisticated, user-centric integration. The journey from specialized models to commoditized components is a natural progression, and ElevenLabs’ strategic pivot towards multi-modal applications and strong product development reflects a proactive approach to maintaining relevance and leadership in an increasingly dynamic technological frontier. The "magic" will lie not just in the AI itself, but in how it is woven into the fabric of everyday digital life.




