It was never supposed to be nonsense. To anyone playing in Japanese or presumably other phonetically consistent languages, it's obviously a sped up, slurred/somewhat mangled version of what the text is saying. I guarantee not a single Japanese player didn't notice this, or otherwise thinks it's pure gibberish. Japanese is not my native language and I noticed immediately; any native would have too.
The problem is that English is a phonetically inconsistent language, with a massive number of rules required to even begin to approximate the mapping from text to phonemes (and zillions of exceptions). So this kind of really dumb TTS not intended to be actually intelligible doesn't work at all in English. And so it sounds like actual nonsense.
It did work in English though. Maybe not as well, but I do remember noticing in the original Animal Crossing that there was some correlation between the sounds and the text. Even if it was just the timing and intonation, it could definitely be understood that the lines were actually being spoken in some form. I don't think it was TTS because it was too accurate in a really subtle way.
Or maybe it was and they tweaked it well enough to work. It's been a while, my memory could be off. I don't know if it would have been prohibitive on the GameCube to have audio of every line (there were a lot), but I wouldn't put it past them to have done so.
It's TTS. They absolutely don't go dubbing around every line for this, that'd be insane. That's the whole point of this system, to provide some fun sounding audio for the lines without having to actually dub them. You can tell because they speak your island name and your own name exactly the same way as the rest of the text, accurately. And because it's pretty monotone and consistent.
Besides, can you imagine voice actors dubbing this stuff in this kind of voice line by line? They'd go insane.
Yeah that makes sense. I guess I figured they could have processed the spoken audio, not that the actors would have actually talked like that. I was just fooled by the fact that their TTS sounds more passably human than any other I've heard (in an abstract way).
Banjo-Kazooie [1] was the first game on the N64 to use what Animal Crossing terms "Animalese/Bebebese" [2], and their intention was never to build a TTS engine. The first Banjo was released in 1998, while Dobutsu no Mori (Animal Crossing) for N64 didn't come out until 2001.
Nintendo was definitely talking with Rareware at the time and they exchanged ideas and techniques on game engine design, platformer mechanics, etc. Interviews from the Rare side admitted this (I'll need to dig up some sources to include here).
I'm curious if Nintendo picked up the Animal Crossing Bebebese voices directly from Banjo-Kazooie.
[1] https://www.youtube.com/watch?v=9ZE5A3DbHDk (Actually a video of the year 2000 sequel, Banjo-Tooie, but this is a better example of the same voice engine using various voices.)
I've played it in both Japanese and in English, and while Japanese is more phonetically consistent I don't understand how people could miss it in English either. You can tell right at the beginning when Rover says your name back to you. Even if it's usually difficult to make out, occasionally you should notice it's not totally random, especially when you start playing around with giving the villagers catchphrases.
This reminds me of the game killer7, where some NPCs would speak in a partially distorted voice. However in that game it always quite interesting listening to the TTS because the script an NPC is reading seems to be a paraphrased from the subtitle (and maybe partially complete gibberish?) and it feels like you’re fading in and out of understanding the language as they speak [1], which is in fitting with the surreal style of the game and presumably intentional.
Fun fact: In the Japanese version it's literally just mac TTS voices with machine translated English.
The distortion effects were presumably added so people overseas wouldn't notice how jarring it is. It's such a cool effect it feels like something that should have been in the original release.
Oh wow that’s so interesting! Especially how incomprehensible most of the dialog still is. I just realised this account posted the same video with the non-distorted speech too [1]
That is assuming they tried to write their own TTS instead of just taking an existing, working English TTS and speed it up and distort it. Why would they do that?
To me it does sound exactly like that too in English.
I'd wager it's more obvious in the Japanese version because Japanese is the exceptional language. There are only 44 syllables in Japanese (English has about 16,000) and one would probably still notice this in otherwise unintelligible distorted speech.
Writing their own TTS sounds like what Nintendo would do, to be honest.
Spanish is about as phonetically consistent as Japanese. Most languages have simpler phonology than English. It's not about the number of possible syllables, it's about the rules to go from text to phonemes. The rules in English are immensely complicated and inconsistent. Meanwhile, I can accurately describe Spanish phonology, such that you'd be able to pronounce ~any Spanish word (English loanwords excluded) accurately including stress, in about one page. Written Japanese lacks pitch accent information, but otherwise works similarly.
(By the way, your stats are off; modern Japanese has about ~106 possible syllables (mora) by rough count).
True, but if they didn't write their own TTS nothing of that matters.
Most TTS software out there will be better at English than any other language despite the more complex phonology. Then it's solely about distinguishing language characteristics and I think the amount of syllables would have an effect in that. Japanese has a lower information density than English (at the same speed) so with the same amount of distortion Japanese should be more recognizable.
Luigi's mansion 3 does something similar, I wouldn't be surprised if it's using the same or similar technology under the hood. For one of the characters, bits of the dialog sounded like incoherent Japanese and I could make out some words. I haven't played in a while, but I recall different characters had different sounding accents too. I like this approach for in-game dialog!
Reminds me of a song from 1972 by an Italian comedian. The song is called "Prisencolinensinainciusol" which means...well, nothing. It's gibberish. In fact, the entire song is nonsense lyrics made to sound like English.
Adriano Celentano is a singer songwriter, actor, director, screenwriter, composer, film editor and TV author[1]. He definitely has humor and used irony throughout his career, but wouldn't call him a comedian by any stretch.
It's also a pretty catchy song. But yeah it messes with my brain that is working hard to parse English. It's interesting. If a language is clearly not English I don't have the same issue.
Is English your native language? Because for me at least as a native English speaker I immediately recognize it as gibberish and have no such problem. I’m curious if it’s a native/non-native issue with parsing or not.
Yes it is. I'm a very strong English speaker/listener.
I get this in a lot of things. Human brains love finding patterns and everyone does this to an extent. I think mine does it so much more. For example if I ever see numbers anywhere I'm compulsively adding them to see if any nicer looking numbers result.
I wouldn't be surprised if this inclination to hear jibberish and try to parse it into language is a me thing.
Animalese is a simple phonetic translation of the text. It’s super noticeable in Spanish which is fairly straightforward. In English it’s a simplistic translation to phonemes as far as I can tell.
I first realized what they were doing when talking to Blathers. The hoots in his speech were easy to pick out in the audio which then led to hearing that the rest of the text was being spoken. After noticing it, it's been much easier to hear it happening for all characters.
Interestingly, we have a word in French to call these gibberish languages that sound like real ones but have no meaning.
It's called speaking "yogurt".
It was originally used because young French people wanted to sing the English songs that they heard, but didn't know the language, so they would make up sounds that looked like English.
This exact thing is called "washawasheo" or "washawashear" in (Mexican) Spanish. It would be cool to know what languages have this, as I'm pretty sure not all of them do (Russian doesn't, for example).
> The Greeks used the term barbarian for all non-Greek-speaking peoples, including the Egyptians, Persians, Medes and Phoenicians, emphasizing their otherness. According to Greek writers, this was because the language they spoke sounded to Greeks like gibberish represented by the sounds "bar..bar..;"
Instead of the constant focus on graphics graphics graphics, why don’t companies improve other tech like speech synthesis for a while?
That would bring about a new boom (pun) of creativity by allowing indie devs to write complex stories with spoken dialogue without having to worry about hiring actors and immutable recording sessions.
There aren't super technical burdens to speech synthesis in games; for example, Jackbox Party Pack 7 runs just fine on the Switch and contains a speech synthesis engine for "Blather Round".
I imagine it's mostly a licensing/cost thing (since a "voice" for a speech synthesis engine still requires hiring an actor and doing a recording session).
> since a "voice" for a speech synthesis engine still requires hiring an actor
That's what I'm talking about. Surely we can do away with that if we try, just as we don't need real people to build 3D models -from- if we don't want them.
I’m not sure what you mean when you say we don’t need real people to build 3D models. If we want 3D models, someone has to make them. That person might be a traditional modeler working in Blender, or a sculptor in ZBrush, or might be someone with a 3D scanner doing photogrammetry. The tools are changing, but it is still people using the tools.
Just like how we invented an “automatic programming” system where the computer will do programming for you, and it turns out that once we’ve made automated programming we have more programmers and not fewer. The tools for making 3D models are getting better and easier to use, and as photogrammetry is being used more and more, we see larger teams of modelers, not smaller.
They're comparing making a bespoke 3D model of a character as opposed to scanning an actual person, saying similarly you could just create a voice from scratch rather than record a real one.
I thought that was obvious, so I guess my comment wasn’t clear.
A 3D scanner is an artist’s tool. By using a 3D scanner, you aren’t getting rid of artists, you are just changing how artists do their jobs. Vocal synthesizers and vocal transformers, similarly, aren’t making it possible to press a button and get reasonable sounding voice in your games if you don’t have a voice actor making it possible. If you aren’t convinced, then just look at soundtracks. You can press a button and your iMac will spit out the sounds of the BBC Symphony Orchestra violin section. In spite of this, getting a symphonic score is still expensive. It’s expensive enough that TV shows (with sizable budgets) often skip out on the symphonic score and do something cheaper. We haven’t gotten rid of musicians, it’s just that musicians are much more likely to have computers.
There is, in theory, nothing stopping you from buying like $200 in software and making a symphony orchestra right now. The problem is that you have no idea how to write a symphony orchestra. For the same reason, if you have an iPhone 12 or something similar then you can start using the LIDAR features and making a 3D model using photogrammetry in moments—except for the fact that you have no idea how to make a 3D model.
I think there’s a trap that people fall into, thinking that technology is just around the corner that will get rid of job X, Y, or Z. Often what you end up with is MORE people doing job X, Y, and Z, it’s just that they use computers to do it, and have a different skill set.
What a lot of people fail to grasp is that there is a fixed cost to staying alive. The big problem for a lot of skills is that the production capacity of a human is not high enough to pay for that fixed cost or just barely profitable enough for the top 10000 humans to have a career in this skill. If you somehow increase the productivity of humans with that skill you are massively reducing the barrier to entry which means more people can make a living out of it.
There are companies like replicastudios.com tackling this. I imagine it's not likely a games company can build a massively lead in what is a very academic field.
Reminds of glossolalia. The speaker believes they’re speaking a foreign language, but when you examine the “language” it’s just random phonemes from the speaker’s native language.
I'm curious if that's the same for other games that do similar "gibberish" sounds like Mario Odyssey and Zelda games - I do think the gibberish "sounds" englishy. I can see them making it sound more like Japanese gibberish in Japan.
On not all gibberish being created equal: it's been long enough since "What Languages Sound Like To Foreigners" that some people here may have missed have missed this.
I believe they run the dialogs into a program and generate the sounds from it. I play AC in Chinese and the words do sound like Chinese. It is just crazy to wrap my head around the amount of effort they put into this game.
When I was a kid the only way I could play video games was in 1 week bursts in the form of hollywood video rentals. When we picked up animal crossing for GameCube and I heard animalese, I thought my disc was messed up and went back to swap it out. I was too young to really know how to google something like that.
The problem is that English is a phonetically inconsistent language, with a massive number of rules required to even begin to approximate the mapping from text to phonemes (and zillions of exceptions). So this kind of really dumb TTS not intended to be actually intelligible doesn't work at all in English. And so it sounds like actual nonsense.