What do y'all think about the latency/quality tradeoff with LLMs? Human voices d...

navanchauhan · 2026-03-02T18:22:50 1772475770

Not affiliated with Sesame, but this is what the realtime models are trying to solve. If you look at NVIDIA’s PersonaPlex release [0], it uses a duplex architecture. It’s based on Moshi [1], which aims to address this problem by allowing the model to listen and generate audio at the same time.

[0] https://github.com/NVIDIA/personaplex

[1] https://arxiv.org/abs/2410.00037