This task can be performed using TEN Agent
Meet TEN, the World's First Truly Real-time Multimodal Agent Framework for Creating Next-Gen AI Agents
Best product for this task

Meet TEN, the World's First Truly Real-time Multimodal Agent Framework for Creating Next-Gen AI Agents
The TEN Framework is an open-source framework that enables developers to quickly build real-time multimodal agents (voicevideodata streamimage and text)making it easy for developers to experimentintegrate large language modelsand create reusable extensions.

What to expect from an ideal product
- Uses lightweight microservices that talk to each other instantly instead of waiting for responses
- Splits complex tasks into smaller chunks that run at the same time
- Processes text, voice and video streams on the fly without buffering
- Keeps everything in memory rather than writing to disk to avoid delays
- Uses smart routing to pick the fastest path between components in real-time