Jump to content
  • entries
    62
  • comments
    464
  • views
    86,867

BTP2 demo


supercat

1,188 views

Here's a simple demo of my BTP2 music routine. The routine can produce a full chromatic scale over a five octave range, though the top and bottom octaves don't sound too wonderful (I doubt that the DPC would sound great on the really high or really low notes either).

 

Each of the four voices can independently play loudly or softly (though in this demo the volumes all tie together). This demo doesn't maintain vsync, though doing so probably wouldn't be too hard. The music player requires 46 cycles per scan line, leaving 30. Not enough to do a wonderful kernel, but enough to do something.

20 Comments


Recommended Comments

Lots of potential here. Can you give us some technical details?I've thought about this from time to time. I originally thought that one could take the absolute value of the waveform and feed it into the volume register, but I've since second-guessed myself and I think it may sound better of one used a waveform with a DC offset (i.e. with the zero crossing at AUDVx=:) so you could get something closer to the actual waveform in there.

Link to comment
I've thought about this from time to time. I originally thought that one could take the absolute value of the waveform and feed it into the volume register, but I've since second-guessed myself and I think it may sound better of one used a waveform with a DC offset (i.e. with the zero crossing at AUDVx=:) so you could get something closer to the actual waveform in there.

 

Each of the twelve pitches has a wave table for loud mode and one for quiet mode, as well as a step table. The wavetables vary in length from 32 bytes to 60 bytes, but each is then padded with an extra 16 bytes except for the 60-byte one (the pitch C) which is padded with 32 bytes (to allow for the highest C note). The wave tables together total 9 pages of code space.

 

The step table for any particular length simply holds (index mod length). The wave tables hold the amplitude values for a waveform in the range from middle C to the B above that. For loud mode, the amplitude values range 0-7; for soft mode they range 2-5.

 

There are four different pieces of code for outputting wave data, which are run in sequence: In C form, they are:

  • AUDV0 = ptr1d[i1] + ptr2c[i2]; AUDV1 = ptr3b[i3]+ptr4a[i4]; i1=ptr1s[i1];
  • AUDV0 = ptr1a[i1] + ptr2d[i2]; AUDV1 = ptr3c[i3]+ptr4b[i4]; i2=ptr1s[i2];
  • AUDV0 = ptr1b[i1] + ptr2a[i2]; AUDV1 = ptr3d[i3]+ptr4c[i4]; i3=ptr1s[i3];
  • AUDV0 = ptr1c[i1] + ptr2b[i2]; AUDV1 = ptr3a[i3]+ptr4d[i4]; i4=ptr1s[i4];

Things are rearranged slightly to optimize the machine code, but the above arrangement is easier to understand.

 

Edit: Code fixed above

 

There are five pointers associated with each voice. For voice 1, playing in the middle C octave, they are set as follows:

  • ptr1a points to the start of the wave table for that voice's pitch/amplitude
  • ptr1b = ptr1a + 1
  • ptr1c = ptr1b + 1
  • ptr1d = ptr1c + 1
  • ptr1s = ptr1d + 1 + (start of step table - start of wave table)

Thus, on successive scan lines, successive bytes of the wave table will be output, wrapping cleanly. To use higher octaves, replace the "+1" above with "+2" or "+4" (or, for C5, "+8"). To use the first lower octave, use

  • ptr1a points to the start of the wave table for that voice's pitch/amplitude
  • ptr1b = ptr1a
  • ptr1c = ptr1b + 1
  • ptr1d = ptr1c
  • ptr1s = ptr1d + 1 + (start of step table - start of wave table)

and for the bottom octave

  • ptr1a points to the start of the wave table for that voice's pitch/amplitude
  • ptr1b = ptr1a
  • ptr1c = ptr1b
  • ptr1d = ptr1c
  • ptr1s = ptr1d + 1 + (start of step table - start of wave table)

Link to comment
  • AUDV0 = ptr1d[i1] + ptr2c[i2]; AUDV1 = ptr3b[i3]+ptr4b[i4]; i1=ptr1s[i1];
  • AUDV0 = ptr1a[i1] + ptr2d[i2]; AUDV1 = ptr3c[i3]+ptr4c[i4]; i2=ptr1s[i2];
  • AUDV0 = ptr1b[i1] + ptr2a[i2]; AUDV1 = ptr3d[i3]+ptr4d[i4]; i3=ptr1s[i3];
  • AUDV0 = ptr1c[i1] + ptr2b[i2]; AUDV1 = ptr3a[i3]+ptr4a[i4]; i4=ptr1s[i4];

Hm, why do you mix the values that way? Why not just a, b, c, d? And why the same ptrs for both voices in V1, but not in V0? :)

Link to comment
  • AUDV0 = ptr1d[i1] + ptr2c[i2]; AUDV1 = ptr3b[i3]+ptr4b[i4]; i1=ptr1s[i1];
  • AUDV0 = ptr1a[i1] + ptr2d[i2]; AUDV1 = ptr3c[i3]+ptr4c[i4]; i2=ptr1s[i2];
  • AUDV0 = ptr1b[i1] + ptr2a[i2]; AUDV1 = ptr3d[i3]+ptr4d[i4]; i3=ptr1s[i3];
  • AUDV0 = ptr1c[i1] + ptr2b[i2]; AUDV1 = ptr3a[i3]+ptr4a[i4]; i4=ptr1s[i4];

Hm, why do you mix the values that way? Why not just a, b, c, d? And why the same ptrs for both voices in V1, but not in V0? :)

 

Uh, maybe because I oopsed? (now fixed above)

 

I only update one index per scan line, and the pointers are named so that 'a' is always the first one used after the index is updated and 'd' is the last one used before it's updated. If I wanted to make a game run 262 scan lines instead of 264, I would swap the roles of i2 and i3; this would allow code on a couple of tricky scan lines to 'criss cross' voices 1/3 and 2/4 but using 264 scan lines should be fine.

Link to comment

Hm, why do you mix the values that way? Why not just a, b, c, d? And why the same ptrs for both voices in V1, but not in V0? :)

Uh, maybe because I oopsed? (now fixed above)

Ok, that answers my last question. :)

 

I understand the meaning of the pointers, but why not just a, b, c, d for all 4 voices? :)

Link to comment
I understand the meaning of the pointers, but why not just a, b, c, d for all 4 voices? :)

 

If the code were written:

AUDV0 = ptr1a[i1] + ptr2a[i2]; AUDV1 = ptr3a[i3]+ptr4a[i4];
AUDV0 = ptr1b[i1] + ptr2b[i2]; AUDV1 = ptr3b[i3]+ptr4b[i4];
AUDV0 = ptr1c[i1] + ptr2c[i2]; AUDV1 = ptr3c[i3]+ptr4c[i4];
AUDV0 = ptr1d[i1] + ptr2d[i2]; AUDV1 = ptr3d[i3]+ptr4d[i4];
i4=ptr4s[i4];
i1=ptr1s[i1];
i2=ptr2s[i2]; 
i3=ptr3s[i3];

Things would read nicely and straightforwardly. Unfortunately, three scan lines out of every four would take 38 cycles each while the fourth would take 71 (note nine cycles are wasted with register loads--doing 'i4' first saves three). For some game kernels this might be nicer than having the music take 46 cycles every scan line but I thought it would be more useful to have balanced CPU loading, so I distributed the pointer updates.

 

To understand how the a/b/c/d work, let me run an example: Generate a top-octave C.

  • ptr1a points to AMPL00 ' 00 is the pitch (01 is C#, 02 is D, etc.
  • ptr1b points to AMPL00+8
  • ptr1c points to AMPL00+16
  • ptr1d points to AMPL00+24
  • ptr1s points to NEXT00+32

If i1 starts at 0, start by outputting items 0, 8, 16, and 24 of AMPL00. Then set i1=NEXT00[32] which is 32, and output items 32,40,48, and 56. Since the wave is 60 bytes long, NEXT00[64] is 4. Thus set i1=4 and output items 4, 12, 20, and 28.

 

It's worth noting that after a few loops, i1 will equal 56. This will thus output data from bytes 56, 64, 72, and 80 of AMPL00 and then load i1=NEXT00[88]. The table is padded out with 32 extra bytes, though, so NEXT00[88]=28 and there's no problem (the next loop will set i1=NEXT00[60] which equals zero, so the whole process will repeat).

 

If I used something other than ptr1a after updating i1, then the data would get output in the wrong sequence. The pointers must run in the order abcd following an update, which means the order prior to the update must be 'rotated'.

Link to comment
I just had a look at your (disassembled) code. It seems like you can update the index directly before using it and save 1 cycle.

 

How so? I see no way to get by with less than five (zp),y loads (four for amplitude and one for index), four zp loads (indices), and three zp stores (new index, AUDV0, and AUDV1). What instruction can I get rid of?

 

Note that there are two somewhat different sequences of operations, based upon whether the index to be updated is associated with an AUDV0 voice or an AUDV1 voice. In the former case, I update the index after use, with the left-over value in Y. In the latter case, I update the index in RAM before I used its value (left over in Y) for the audio lookup.

Link to comment

How so?

You are updating the index like this:

  lda	($8a),y	
 sta	$82		   

Then later:

  ldy	$82	
 lda	($92),y	   

If you move the first part directly before the 2nd one, you can replace ldy $82 with tay.

Link to comment

How so?

You are updating the index like this:

  lda	($8a),y	
 sta	$82		   

Then later:

  ldy	$82	
 lda	($92),y	   

If you move the first part directly before the 2nd one, you can replace ldy $82 with tay.

I have a set of four macros that each take 46 cycles, do not trash X, require carry clear on entry, leave it clear on exit, and do not require A or Y to be held between invocations. All four macros write to AUDV0 on the 19th cycle and AUDV1 on the 46th.

 

Phase0 [address $82] is only used with AUDV0. Two consecutive writes to AUDV0 do not occur without an intervening write of AUDV1, and computing the value to be written there will require trashing both A and Y. Further, between the two writes of AUDV0 there will also be user code executed that will also likely trash A and Y.

 

If there were no other code running while the music plays, or if I didn't mind having different lines take different amounts of time to execute, it might be possible to save a cycle here and there via register optimization. I do not think such optimization is apt to be practical, however.

Link to comment

I fully understand. Still you can save that cycle in each macro, making them all 45 cycles.

 

I don't see how, unless I require that the value of A or Y be held between macro invocations. Such a requirement would add 5 cycles to the cost of most time-critical scan lines, and would thus seem to be a net loss.

 

Can you show me how you'd do it without having to hold over A or Y? It seems to me that each phase value is only loaded once, and one of them gets stored, so unless a phase value is held over I see no way to eliminate a load.

Link to comment

Can you show me how you'd do it without having to hold over A or Y? It seems to me that each phase value is only loaded once, and one of them gets stored, so unless a phase value is held over I see no way to eliminate a load.

Since I don't have access to your code, I have to post Distella code:

.loop:
sta	WSYNC		   ; 3
;---------------------------------------
ldy	$84			 ; 3
lda	($a4),y		 ; 5
lda	($8a),y		 ; 5
sta	$82			 ; 3	
tay					; 2
adc	($aa),y		 ; 5
sta	AUDV0		   ; 3
ldy	$86			 ; 3
lda	($9e),y		 ; 5
ldy	$88			 ; 3
adc	($98),y		 ; 5
sta	AUDV1		   ; 3
sta	WSYNC		   ; 3
;---------------------------------------
ldy	$82			 ; 3
lda	($92),y		 ; 5
lda	($8c),y		 ; 5
sta	$84			 ; 3	
tay					; 2
adc	($ac),y		 ; 5
sta	AUDV0		   ; 3
ldy	$86			 ; 3
lda	($a6),y		 ; 5
ldy	$88			 ; 3
adc	($a0),y		 ; 5
sta	AUDV1		   ; 3
sta	WSYNC		   ; 3
;---------------------------------------
ldy	$82			 ; 3
lda	($9a),y		 ; 5
ldy	$84			 ; 3
adc	($94),y		 ; 5
sta	AUDV0		   ; 3
lda	($8e),y		 ; 5
sta	$86			 ; 3	
tay					; 2
lda	($ae),y		 ; 5
ldy	$88			 ; 3
adc	($a8),y		 ; 5
sta	AUDV1		   ; 3
sta	WSYNC		   ; 3
;---------------------------------------
ldy	$82			 ; 3
lda	($a2),y		 ; 5
ldy	$84			 ; 3
adc	($9c),y		 ; 5
sta	AUDV0		   ; 3
lda	($90),y		 ; 5
sta	$88			 ; 3
tay					; 2
lda	($b0),y		 ; 5
ldy	$86			 ; 3
adc	($96),y		 ; 5
sta	AUDV1		   ; 3

You need to adapt your slightly, but as far as I can tell, that should be all.

Link to comment
Since I don't have access to your code, I have to post Distella code:

 

You're loading Y with $84 (aka phase1) and then storing the value derived therefrom to $82 (phase0).

 

Here are the macros from my source code:

	mac	 PART0
	ldy	 phase1; 84
	lda	 (amp1c),y; a4
	ldy	 phase0; 82
	adc	 (amp0d),y; aa
	SA0
	lda	 (next0),y; 8a
	sta	 phase0; 82
	ldy	 phase2; 86
	lda	 (amp2b),y; 9e
	ldy	 phase3; 88
	adc	 (amp3a),y; 98
	SA1
	endm

	mac	 PART1
	ldy	 phase0; 82
	lda	 (amp0a),y; 92
	ldy	 phase1; 84
	adc	 (amp1d),y; ac
	SB0
	lda	 (next1),y; 8c
	sta	 phase1; 84
	ldy	 phase2; 86
	lda	 (amp2c),y; a6
	ldy	 phase3; 88
	adc	 (amp3b),y; a0
	SB1
	endm

	mac	 PART2
	ldy	 phase0; 82
	lda	 (amp0b),y; 9a
	ldy	 phase1; 84
	adc	 (amp1a),y; 94
	SA0
	ldy	 phase2; 86
	lda	 (next2),y; 8e
	sta	 phase2; 86
	lda	 (amp2d),y; ae
	ldy	 phase3; 88
	adc	 (amp3c),y; a8
	SA1
	endm

	mac	 PART3
	ldy	 phase0; 82
	lda	 (amp0c),y; a2
	ldy	 phase1; 84
	adc	 (amp1b),y; 9c
	SB0
	ldy	 phase3; 88
	lda	 (next3),y; 90
	sta	 phase3; 88
	lda	 (amp3d),y; b0
	ldy	 phase2; 86
	adc	 (amp2a),y; 96
	SB1
	endm

Whichever phase gets 'sta'ed in a step will be the same phase whose Y value was used in that step.

Link to comment

I don't get it. Where's my error?

 

Well, your code is performing (to start with):

 ldy phase1
 lda (amp1c),y
 lda (next0),y
 tay

I'm not quite clear on what you're intenting there, but I don't see how two consecutive LDA's are going to do anything useful.

 

Your part2 and part3 don't have that particular problem, but part2 sets phase2=next2[phase1] instead of phase2=next2[phase2] and part3 sets phase3=next3[phase1] instead of phase3=next3[phase3].

Link to comment

Well, your code is performing (to start with):

 ldy phase1
 lda (amp1c),y
 lda (next0),y
 tay

I'm not quite clear on what you're intenting there, but I don't see how two consecutive LDA's are going to do anything useful.

I (finally :)) got the code now.

 

There still might be one cycle to save, but then you had to carry Y from one line to the next one, making the code between more limited. Plus you would have to reverse the order of setting the two channels , probably that would hamper the sound. Not worth it.

Link to comment
There still might be one cycle to save, but then you had to carry Y from one line to the next one, making the code between more limited. Plus you would have to reverse the order of setting the two channels , probably that would hamper the sound. Not worth it.

 

That was my thinking. If I were only using one AUDvx channel (say, for 3 voices) and carrying Y over from one scan line to the next were possible, it would indeed be possible to save a cycle. In practice, though, on most scan lines where I want to do a lot of stuff I'm going to want to use the Y register.

 

If the code and data tables are in RAM, some other interesting techniques become possible. For example:

patch1:
 ldy ABS
 sty patch1+1
 lda wave1,y
patch2:
 ldy ABS
 sty patch2+1
 adc wave2,y

This requires a flat 11-12 cycles per voice (depending upon where the code is), without having to do use multi-line unrolling techniques. If one adds such techniques, things become more interesting:

;line A
patch1a:
 ldy ABS
 sty patch1a+1
 sty patch1b+1
 lda wave1,y
patch2a:
 adc ABS
;
;line B
patch1b:
 lda ABS
patch2b:
 ldy ABS
 sty patch2a+1
 sty patch2b+1
 adc wave2,y

That's 9-10 cycles per voice with two-line unrolling; the approach does not benefit from going beyond that, and puts more annoying restrictions on data tables than my present approach.

Link to comment

If I were generating wave samples in bulk (queueing them up for later replay) and could handle each pair of voices separately, then I think things would work out something like:

 ldy phase0
lp:
 lda (next0),y
 sta phase0
 lda (amp0d),y
 ldy phase1
 adc (amp1c),y
 sta SAMPLE1
 lda (next1),y
 sta phase1
 lda (amp1d),y
 ldy phase0
 adc (amp0a),y
 sta SAMPLE2
 lda (amp0b),y
 ldy phase1
 adc (amp1a),y
 sta SAMPLE3
 lda (amp1b),y
 ldy phase0
 adc (amp0c),y
 sta SAMPLE4
; loop back to lp

Doing four samples for a pair of voices would require 80 cycles, or an average of ten cycles per voice per line, including all other overhead. Actually, this would be an intriguing approach, especially with extra-RAM carts. One would have to spend six or twelve cycles per scan line updating AUDxx, but the efficiency of the approach above would allow the CPU to process four scan lines worth of audio every three scan lines, thus allowing some audio to be buffered up for more "difficult" parts of the screen.

Link to comment
Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...