Jump to content
IGNORED

Custom Speech in XB


SteveB

Recommended Posts

Hi,

 

with some hints from @arcadeshopper , the TI-Trek video of @pixelpedant and a posting from @HOME AUTOMATION about the first three bytes of CALL SPGET I was able to get my first custom words out of XB and the Speech Synthesizer.


I used Python_Wizard to create the LPC code out of a 1 sec 16 bit Mono WAV file, 203 bytes. I thought I had the volume to the maximum without clipping. But the result is in low volume and leaves some room for improvement. I did not use any tuning parameters in the beginning, but then -V and -p as they looked easy. I am still not completely happy with the result, but a little proud that I got so far.

 

  1. How shold I process an audio file for best output? Compressor? Vocoder?
  2. Has someone any hints on the parameters like WindowWidth, PitchRange, Framerate, PreemphasisAlpha, subMultipleThreshold ...? What are usefull starting points and what do they influence? 

 

Here is my little XB demo:

 

venkman

 

PS: Perhaps Pixelpedant was right on TI-Treck ... samples from movies / TV-shows are sub-optimal sound-sources...

  • Like 5
Link to comment
Share on other sites

A little theory:

 

The TMS5200  LPC-10 takes 40 frames/second and outputs 8 kHz samples. So you should keep frame rate at 40.

 

Spoiler

(Other chips support 50/100/150/200 frames/sec which are helpful in modeling higher frequency speakers, or singing.) 


 

A frame covers 25 ms of time and has a set of parameters for voiced/unvoiced, pitch, energy, K1-10.  
 

The algorithm is attempting to find the best estimate for the LPC parameters, which fit the input sample during that time period.
 

LPC doesn’t aim to re-create the input sample perfectly, but good enough that the output makes you hear the formants in human voiced speech (“aaaah”) and the character of unvoiced speech  ( “ssssh”, “zzzz”). 
 

As such, LPC can’t digitize just any signal. It is great for human voice, which has kind of a lumpy frequency spectrum with a few peaks—the peaks are the formants.
 

Human speech is also sometimes voiced, sometimes unvoiced. For voiced, LPC speech synthesis begins by shaping a repeated “chirp” signal into a resonating sound (“aaaah”). Unvoiced sound begins as white noise (same as SN76489 LFSR white noise) which is  filtered. 
 

K1-10 control the lattice filter, which is a model for the way the human vocal tract shapes sound. (It’s a bunch of multiply-add operations with a 10-sample memory.) 


Since the human vocal tract changes  slowly over time, 40 frames/sec is considered good enough. 
 

With 8 kHz input and output, each frame equals 200 samples. You could set Window width at 200, in which case each frame would be independent, or you could extend it to cover more samples  (before or centered on? Not sure!) 
 

Since we know the model changes slowly, the  parameter estimates should be similar whether we use 200 or 500 samples.  If the bigger window is less sensitive to outliers, great. But too big of a window will blur out the changing signal you are trying to model!


Pre-gain attempts to amplify the important frequencies in human speech while damping others. This lets the algorithm use its 12 degrees of freedom to best fit the formants that characterize voiced human speech.   

 

I think pre-emphasis is  a low-pass filter with one parameter, alpha. (In *Wizard, the pitch range could be parameters to the pre-emphasis filter.) 
 

The pitch range of male speech is from about  80 or 160 Hz, to 1000 (?) Hz. Female and juvenile speech begins at 160 and goes much higher. So the content which you want to pre-emphasize depends on the speaker. 
 

In a spectrum analysis of a speech sample, formants are the peak frequencies.  (Higher energies than those close by.)

 

The first peak frequency is stored as the pitch parameter for the frame. energy is the power of this first formant. In voiced speech, the chirp signal is played in at the given pitch. 

 

Formant creation is not directly controlled by an LPC encoder. But an LPC filter is capable of fitting only a few formants well! So you want pre-emphasis to amplify the range covering the first couple of  formants. 

 

 

(In an LPC encoder-decoder system, the final stage before sound output should be a post-emphasis filter, to reverse this effect. I don’t know if our hardware does anything.)


Submultiple … I have no idea what this is. 


If your resulting speech has faint volume, you may be killing the signal in pre-emphasis. 

 

I would dump the decoded LPC and see what values are put in for energy. 
 

Finally. Encoding LPC parameters was never the final result of speech processing. A speech coding lab at Texas Instruments employed a linguist (famously, Kathleen M.  Goudie) and experienced editors. They would go through the rough LPC and adjust the pitch and energy frame-by-frame if necessary. (I suppose they could tweak the K1-K10 but that’s scary.)

 

 

*wizard may come up with erratic values for pitch. A speech editor could change these to vary smoothly. 
 

In software made available to 3rd parties, pitch smoothing (“intonation”) was a function. TE2 text-to-speech allowed a start pitch and slope. (Speech Education Module; Barcode Factory.) 

 

I have to end this post. I have not experimented with *wizard lately but I hope to. 
 

 

 


 


 

 

  • Like 1
  • Thanks 1
Link to comment
Share on other sites

26 minutes ago, FarmerPotato said:

The TMS5200  LPC-10 takes 40 frames/second and outputs 8 kHz samples. So you should keep frame rate at 40.

Python Wizard has a default frame rate of 25, but perhaps that's the frame duration in ms?

We usually send 8 bytes at a time to the speech synth when the buffer is half full. How often do you have to do that to keep up with a frame rate of 40 fps? In other words, how many bytes per frame?

  • Like 2
Link to comment
Share on other sites

LPC Speech Encoding

You should look at it as bits-per-frame:

Voiced: 50 bits

Unvoiced: 29 bits

Repeated: 11 bits

Silence: 4 bits

Stop: 4 bits

So for the highest case of all frames being voiced, that's 50 bits (6.25 bytes) per frame every 25ms, then you would need to fill half the buffer (64 bits equaling 8 bytes) every 32ms at that rate (2 bits per ms). For all frames being silence - 4 bits (0.5 bytes) per frame, then half the buffer would be every 400ms (a rate of 0.16 bits per ms).  Since you could have a mix of any type of frames, you should be able to handle the highest rate checking the buffer low status bit at the VDP interrupt (intervals 16ms NTSC or 20ms PAL)

Edited by PeteE
  • Like 1
  • Thanks 2
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...