PreBuild is a software development company that aims to streamline the software planning and development process for businesses.
An AI phone translator that translates calls, messages, and more in real time
Confidential
9 months
AI
2024
AWS, Twilio, WebRTC, Whisper ASR, Google Translate API, ElevenLabs TTS, NodeJS, WebSockets, Docker
Translingo Call Translator is a sophisticated, AI-powered communication tool designed to eliminate language barriers instantly by translating phone calls, messages, and more in real-time. Leveraging advanced machine learning, it provides a seamless, instantaneous, and reliable bridge for global communication, maintaining context and nuance.
Translingo Call Translator must deliver seamless, real-time communication by chaining together multiple high-latency services via real-time communication protocols. However, several challenges stand in the way of achieving this goal:
– Cumulative latency across the multi-step pipeline: real-time translation requires a sequential chain of events: speech → text → translate → speech. Each step adds its own processing time, which, when combined with network latency, results in a significant, cumulative delay.
– Maintaining context and turn-taking in real-time: The system must accurately determine when one speaker has finished and the other has started (Voice Activity Detection, or VAD) to correctly segment audio for translation. Misidentification of speech boundaries (due to accents, background noise, or short pauses) causes the system to translate incomplete thoughts or interrupt the current speaker, leading to mistranslations and a severely disruptive conversational flow.
– High operational cost and resource management: The core process relies on computationally intensive, per-use external services (for speech recognition, translation, and text-to-speech) and highly concurrent infrastructure. Running these demanding services constantly for real-time translation leads to unpredictably high external service costs, significant internal resource consumption, and challenges in efficiently scaling the stateless processing backend.
To overcome the challenges, Hola Tech adhered to the best practices. Key components of the solution included:
– Implement streaming & chunk-based translation: Instead of waiting for a full sentence or speaker turn, segment the audio stream (WebRTC/Twilio) into 500ms-1 second chunks. Use a streaming ASR solution (or an optimized, low-latency deployment of Whisper), perform translation on these partial chunks, and utilize streaming TTS (ElevenLabs) to begin playing the translated audio immediately. This overlaps the processing steps, drastically reducing the perceived latency.
– Integrate advanced Voice Activity Detection (VAD): Supplement the ASR module with a dedicated, highly tuned VAD component at the NodeJS/WebSocket layer. This VAD microservice should analyze the incoming audio stream (WebRTC) to reliably detect speaker start and stop points. This ensures that transcription (Whisper) is only initiated when speech is active, and that the translation step (Google Translate API) receives complete, contextual speaker segments, preventing interruptions and enhancing translation accuracy.
– Leverage AWS spot instances and container orchestration for inference: Package self-hosted ASR and TTS models (or their optimized open-source variants) in Docker containers. Deploy these containers on AWS ECS/EKS utilizing Spot Instances (for significant cost reduction) and orchestrate the load dynamically based on real-time call volume. This replaces constant, expensive API calls with internal, scalable compute resources that are managed efficiently.
– Supported simultaneous real-time translation for over 25 languages and major dialects
– Processed over 1.5 million minutes of translated audio since the platform’s launch
The implemented system successfully met Translingo Call Translator’s requirements for real-time performance, communication flow integrity, and language coverage. The platform has significantly lowered the friction for international communication, attracting over 50 major enterprise clients in customer support and cross-border sales. Translingo Call Translator has empowered businesses and individuals to conduct seamless multilingual calls and messaging, evidenced by the 1.5 million minutes of audio successfully translated. This demonstrates Translingo’s success in providing an instantaneous, reliable, and high-quality solution for breaking down global language barriers.
PreBuild is a software development company that aims to streamline the software planning and development process for businesses.
PRS is an advanced software solution that aims to streamline communication and automate interactions through a sophisticated prompt and response system.
FinQuant AI is an advanced, specialized module designed to redefine financial data analysis and accelerate investment strategy innovation
Translingo Call Translator is a sophisticated, AI-powered communication tool designed to eliminate language barriers instantly by translating phone calls, messages, and more in real-time. Leveraging advanced machine learning, it provides a seamless, instantaneous, and reliable bridge for global communication, maintaining context and nuance.