there were speech synthesis boxes in the 80s, so i doubt this is intractable. at&t did a lot of work on it through 2010 (perhaps beyond, but i stopped paying attention), where the syllables are nearly perfect, assuming you can massage the source text using context clues to force the proper pronunciation. Is it "read" or "red"? well, that depends on the context.
Even in the voice-cloning TTS you have to deliberately spell things to be pronounced correctly.
so it comes down to "guessing the context" for a computer, which is something that LLMs can do fairly well, and raspberry pi can run small LLM, so who knows. Sentiment analysis is one thing, i'd probably mess with prompting an llm to "correct the grammar" and see what words change - or even "change words for synonyms that can't be misconstrued by the TTS engine, unless the tone of the phrase or passage changes dramatically, in which case, phonetically spell the word in question"
but i only think to do this because LLMs exist. if you had asked me 15 years ago how to automate "pronunciation" of english i'd have said "if IBM and AT&T and Apple can't figure it out..."
Even in the voice-cloning TTS you have to deliberately spell things to be pronounced correctly.
so it comes down to "guessing the context" for a computer, which is something that LLMs can do fairly well, and raspberry pi can run small LLM, so who knows. Sentiment analysis is one thing, i'd probably mess with prompting an llm to "correct the grammar" and see what words change - or even "change words for synonyms that can't be misconstrued by the TTS engine, unless the tone of the phrase or passage changes dramatically, in which case, phonetically spell the word in question"
but i only think to do this because LLMs exist. if you had asked me 15 years ago how to automate "pronunciation" of english i'd have said "if IBM and AT&T and Apple can't figure it out..."