Jump to content
IGNORED

New speech synthesis for Atari


R0ger

Recommended Posts

In 2019, when topic for 2020 Forever party was announced to be "Robot", I came with with an idea to somehow improve speech synthesis situation on Atari. Good old SAM can't talk with video on, is mostly only good for English, not very good for singing, and nobody really understands how it works. And singing was something I was especially interested in.

I ended up writing new speech synth from scratch.

Well, 2020 Forever didn't happen. Nor did Forever 2021, nor 2022. But it did finally happen this year. So after all those years, I barely managed to finish the demo in the last two weeks, and here is the result:

 

 

I plan to make like a lecture about it in Czech on this year Atariada, and I would make a video or megapost here about it soon after in English. That would be early May.

So at the moment I don't want to go too deep in explaining how it works. But here are few points:

- it can do 2 voices with low DMA modes, like Antic D, or low res text mode. It can do 1 voice with full-screen hires. It doesn't like badlines though, so no hires text modes. And with a bit of work, it can also do other things in between, like play a POKEY song, and do slight animation.

- this demo plays the 2 voices in separate channels in stereo, but it is also possible to mix them by CPU and only use 1 channel for 2 voices of speech.

- of course it can also talk. It was supposed to be part of the demo, I just didn't manage to make that part work before the deadline.

- at the moment it sucks at English, as I didn't need it for the demo. But I'm working on it, as well as some other languages.

- compared to SAM (which I somewhat understand now, as some good soul rewrote it to C) it needs more memory (about 15k sample bank) but way less CPU. Features and output quality are similar.

- mine has some special features for singing, like 16 bit frequency control, vibrato, and frame perfect timing.

- POKEY music uses LZSS. The code is based on @rensoup's modified LZSSP, and I use his RMT2LZSS.

- speech output is every 2 lines, so about 8kHz, 4 bits. I can also do 8bits output, but it doesn't help much, and usually it's not worth the extra channel.

 

Stay tuned for more demos, and one day, hopefully, something other people can use too.

 

PS. you need stereo for the demo !

zvonky.xex

  • Like 43
  • Thanks 4
Link to comment
Share on other sites

Nice! Even though it's difficult to assess the speech quality because of the Czech language.

 

How about a downgraded mono version with just 2 channels for music ? Stereo only seems to imply some kind of cheat even though there isn't obviously.

 

Waiting for that paper too!

 

Link to comment
Share on other sites

This is great news! Can you provide some details on the technology that is behind your code? SAM is not sample based, BTW. It is based on modelling the vocal tract of humans, which adds a lot of complexity, but brings quite some flexibility. At least in threory, as nobody (except Softvoice) knows how to adapt it to other languages. The latter would be possible,in principle, but it's unfortunately undocumented.

Link to comment
Share on other sites

26 minutes ago, Rybags said:

I'd think SAM should do other languages mostly well since it takes phonetics as input.

 

Indeed it does. The issue is every language has different phonetics. For example Czech "R" is completely different to English one. What's worse, even vowels are shifted a bit. You can't get clean Czech vowels out of Sam. Which makes Sam sound like uncle from America. And it's exactly the same the other way around. At the moment I don't have English "R", and I have poor support for diftongs, and I miss few other sounds .. which makes for really bad English. On the other hand I can do Japanese or Italian just about perfectly, as they are very similar to Czech phonetically.

But don't worry, English is certainly on the list. I will probably make mono version of the demo, probably will try German one, even without proper German sounds, I can always improve it later. And I'm thinking about Bad Apple with actual singing (but obviously, without the video, or with very simpler animation, something like this demo). For those I need just to reuse the tech I have now, without improving it.

After that, English is on the table. That will require some more research and experimenting.

  • Like 8
Link to comment
Share on other sites

I keep running the demo. Ha - Brilliant! Love the mouth animations. Definitely be up for seeing some English songs with these two. Perhaps enter them in for Eurovision heh heh! ;)

 

Last time I was pleasantly impressed with A8 speech synthesis was in the Cyberpunk demo:

 

 

Edited by Beeblebrox
  • Like 4
Link to comment
Share on other sites

4 hours ago, Beeblebrox said:

Last time I was pleasantly impressed with A8 speech synthesis was in the Cyberpunk demo:

That sounds cool, but it seems to be just really low base frequency for the speech. I encountered this effect during my tests, as soon as I make my talking working again, I'll post some.

 

Anyway .. here is the mono version with software mixing. Quite mediocre must say. I run it like this for months, and only switched to stereo week ago, but now I can see (I mean hear) how superior hardware mixing is. Also only one oscilloscope, it's a cheap effect, but only if I just reuse the value I'm sending to pokey. Which in this version is the mix.

It's also quieter, as I need extra room in the amplitude range to prevent overflow. And it's also bigger, as I didn't bother to pack it ;-)

zvonky.mono.xex

  • Like 4
Link to comment
Share on other sites

11 hours ago, R0ger said:

 

Indeed it does. The issue is every language has different phonetics. For example Czech "R" is completely different to English one. What's worse, even vowels are shifted a bit. You can't get clean Czech vowels out of Sam. Which makes Sam sound like uncle from America. And it's exactly the same the other way around. At the moment I don't have English "R", and I have poor support for diftongs, and I miss few other sounds .. which makes for really bad English. On the other hand I can do Japanese or Italian just about perfectly, as they are very similar to Czech phonetically.

But don't worry, English is certainly on the list. I will probably make mono version of the demo, probably will try German one, even without proper German sounds, I can always improve it later. And I'm thinking about Bad Apple with actual singing (but obviously, without the video, or with very simpler animation, something like this demo). For those I need just to reuse the tech I have now, without improving it.

After that, English is on the table. That will require some more research and experimenting.

Probably you mean "indeed it does not".  As you say completely correctly, the main issue is that SAM (and its related products, such as the Amiga narrator.device also from SoftVoice) does not support the phonems of any other language but english. There is no german "R", no German "Ü", no German "CH" (actually, we have two of them). Even with these phonems present, the result would still sound like an american trying to speak German as the "melody" of a sentence is not right.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...