During a discussion with @SvOlli about how to optimize calculations required for a low-res, playfield based plasma effect within a 512 byte demo for the 2600, I started coding myself to verify my own ideas. This eventually led to the current code.
Usually the plasma effect is created by combining sine waves. Since one of the goals was to use only minimal ROM space, I started by using precalculated, small sine tables. This worked OK, but still needed some ROM space and also a lot of checks in the code when wrapping around the table index. This also affected my second goal negatively: display as many scanlines with the highest vertical resolution possible. While looking for improvements, I found this website, which uses easy to calculate parabolas for generating sine tables. A stock 2600 cannot make use of this, due to the limited RAM. But Svolli wanted to use CommaVid bankswitching, which allows up to 2K of RAM. Great! After a little optimizing, I came up with the following code. Since it is placed right after the initial clear loop, A and X are 0 already, so that saves a few bytes too.
ldy #$3f ; A = X = 0! ; Accumulate the delta (normal 16-bit addition): .loopSine ; Reflect the value around for a sine wave: ; clc ; this makes no difference pha ; = .delta adc .value sta .value lda .delta+1 adc .value+1 sta .value+1 sta SinLstW + $c0,x sta SinLstW + $80,y eor #$7f sta SinLstW + $40,x sta SinLstW + $00,y ; Increase the delta, which creates the "acceleration" for a parabola: pla ; = .delta adc #$08 ; this value adds up to the proper amplitude bcc .skipHi inc .delta+1 .skipHi ; Loop: inx dey bpl .loopSine
The result is a 256 bytes sine table, ranging from 0 to 127. So now we have a large table which automatically wraps around. Nice.
For creating the plasma effect, several sines with different offset and frequency have to be combined. Svolli's idea was to aggregate two sines per axis. And then aggregate the results per playfield pixel. If that result overflows (carry set), the pixel would be set. Even for a mirrored playfield, the number of calculations required for the final aggregates exceed the available CPU time by far. So the plan was (and still is), to do that on-the-fly during kernel display. Here is an excerpt of the original code:
LoopKernel lda YSinLst,y ; from CV RAM, e.g. 50 aggregated sine values tay adc xSinLst+19 ; from ZP-RAM, 20 aggregated sine values ror .tmpPF0 tya adc xSinLst+18 ror .tmpPF0 tya adc xSinLst+17 ror .tmpPF0 ... ; and so on for 20 pixel and 3 PF registers lda .tmpPF0 sta PF0
This fully unrolled code would need 10 cycles per pixel for the sine list aggregation. So that's 200 cycles already, and with some overhead (e.g. colors), it would barely fit into 3 scanlines, and most likely need 4 scanlines. That's when Svolli contacted me, asking if I have any ideas how to optimize this.
Since I had never coded the effect, initially I barely understood all the details. But somewhere in the back of my mind I had the idea that there must be a simpler solution. I first thought about using deltas in xSinLst to avoid the TYA, and this would have saved 2 cycles per calculation. But then I came up with something completely different. Instead of adding the two sine lists, due to their symmetric nature, subtracting them might work as well. And since we only need the carry flag, we could use CMP instead of SBC. Which means we could use A for aggregating the carries. Here is the new, faster code:
ldy #KERNEH_H-1 LoopKernel ldx YSinLst,y cpx xSinLst+19 ror cpx xSinLst+18 ror cpx xSinLst+17 ror ... sta PF0
Now each pixel requires only 5 cycles, 50% saved! Which makes the code fit into just two scanlines now. But does it really work? Svolli was not convinced, so I started coding myself to test the idea. The initial results where OK, but no exactly what I was expecting. But that was due to a lack of understanding of how to prepare the sine lists. Svolli was kind enough to give me some detailed explanations. Later it turned out, that the new kernel code works almost exactly like the original code. Just that everything is shifted by 180°. Which doesn't matter for the plasma effect at all.
The missing piece was the calculation of the sine lists for X and Y axis. Since we combine two sine tables per axis with varying offsets, we first have to change their offset each frame. To make movement smooth, 16 bit math is used here. So that's four 16 bit additions. Again we can make use of the 256 byte table size to ignore any overflow checks. And then I had the idea, that we could do eight 8 bit additions instead. Which makes the loop a bit smaller and saves some bytes. I only had to rearrange the variables a bit.
offsetLst ds NUM_SPEEDS*2 xOffsetAHi = offsetLst ;xOffsetALo = offsetLst+1 xOffsetBHi = offsetLst+2 ;xOffsetBLo = offsetLst+3 yOffsetAHi = offsetLst+4 ;yOffsetALo = offsetLst+5 yOffsetBHi = offsetLst+6 ;yOffsetBLo = offsetLst+7 ... ldx #8-1 .loopOffsets lda offsetLst,x adc SpeedTbl,x sta offsetLst,x dex bpl .loopOffsets
What's left now, are the final calculations of the two sine lists from two sine tables each. This is pretty time consuming, as we have to do 20 calculations for the X-axis and about 100 calculations for the Y-axis.
Since the X-axis goes into ZP-RAM and I need X and Y registers for the offsets, I make heavy use of the stack pointer here. For that I did put the list at the beginning of the ZP-RAM, so that I can now easily check the N-flag in the loop branch.
; setup X-list: ; A = xOffsetAHi from previous code ldx #xSinLst+PF_BITS-1 ; 2 txs ; 2 SP also used as loop counter ldy xOffsetBHi ; 3 = 7 LoopCopyX tax ; 2 ; clc ; 2 lda SinLst,x ; 4 adc SinLst,y ; 4 pha ; 3 tya ; 2 = 15 adc #13 ; 2 tay ; 2 txa ; 2 ; clc ; 2 adc #-11 ; 2 = 8 tsx ; 2 bmi LoopCopyX ; 3/2= 5/4
That's 28 cycles per loop. Nice!
Adding 13/-11 to the offsets for each column simulates using sinus tables of higher frequencies. So we can use our single 256 bytes sinus table here too. The values used for adding are arbitrary chosen, they just have to look nice. The code ignores clearing the carry flags, because I found that the differences are hardly noticeable, only of you look very closely. Since my goal is to minimize the ROM space, this is an acceptable compromise, IMO.
Now to the Y-axis. Here we have to do the same calculation for about 100 values, so this is very time consuming. Especially since I cannot use the stack pointer, so I have to use variables to keep track. Initially I planned to put the previous code into Overscan and the Y-axis calculations into VBlank. But then I would have wasted some remaining CPU time in Overscan. But I wanted to display as many scanlines of plasma as possible. So I had to split the calculation between Overscan and VBlank. Doubling the code was out of question, and a subroutine would have made the code more complex and slower. Then I had the idea, that I could check the timer during the loop, do the VSync when it is due and continue with the loop. The timer check would have cost me extra cycles, but the extra CPU time gained from Overhead made more than up for that. Still the timer check was bugging me, since reading INTIM takes 4 cycles. Eventually I realized that all my code execution timings are constant (or can be made constant), so I don't need the timers are all!
; setup Y-list: .tmpX = tmpVars lda yOffsetAHi ; 3 sta .tmpX ; 3 ldy yOffsetBHi ; 3 ldx #KERNEL_H-1 ; 2 = 8 LoopCopyY txs ; 2 = 2 dec .tmpX ; 5 ldx .tmpX ; 3 tya ; 2 adc #5 ; 2 tay ; 2 = 14 ; clc ; 2 lda SinLst,x ; 4 adc SinLst,y ; 4 = 8 tsx ; 2 sta YSinLstW,x ; 5 = 7 ; Instead of splitting the loop, do the vertical sync in the middle of the loop. ; This maximizes the available CPU time for the loop and minimizes the code. cpx #OVERSCAN_X ; 2 bne .skipVSync ; 3/2= 5/4 lda #%1110 ; each '1' bits generate a VSYNC ON line (bits 1..3) .loopVSync sta WSYNC ; 1st '0' bit resets Vsync, 2nd '0' bit exits loop sta VSYNC lsr bne .loopVSync ; branch until VSYNC has been reset .skipVSync dex ; 2 bpl LoopCopyY ; 3/2= 5/4
41 cycles per loop, ~3977 cycles (~53.3 scanlines) in total. That would have never fit into Overscan only, even if we deduct the 5 extra cycles for the VSync check.
Finally I just have to waste the few remaining cycles (~150):
ldx #VBLANK_X ; waste remaining time .waitTim dex bne .waitTim sta WSYNC
For now, everything was just black and white, which looked pretty dull, even with animation:
But the kernel code had ~25 cycles free within its two scanlines. By rearranging the code a bit, I was able to merge the free cycles into one block. And after a bit of experimenting, I came up some code which mixes the values of PF1 and X. Not exactly what plasma usually looks like, but still nice looking. Maybe one could use a precalculated a color table here, but what do I know.
; A = PF1, X = YSinLst,y and #$60 ; 2 ; 0, 2, 4, 6 adc colorOr ; 3 ; +2, 3, 4, 5, 6 sta .tmpCol ; 3 ; 2..12 txa ; 2 lsr ; 2 lsr ; 2 lsr ; 2 lsr ; 2 ora .tmpCol ; 3 sta.w COLUPF ; 4 = 25 @17
To make the colors move lively, I change colorOr every 256 frames:
inc frameCnt ; 5 ; update color bne .skipColor ; 3/2= 8/7 lda colorOr ; 3 cmp #$50 ; 2 bcc .ok ; 2/3 sbc #$50+1 ; 2 .ok adc #$20 ; 2 sta colorOr ; 3 $28..$68 .skipColor
That looks much more interesting:
One last little trick is, that I use BRK to jump back to the begin of the main loop.
Now, what's the final result? The minimal code (no color) needs exactly 247 bytes, with colors added ~30 bytes more. And in the end I managed to display 97 double scanlines (NTSC, 117 for PAL). So a nice demo with sound and some bells and whistles within 512 bytes seems very double.
I also experimented with a 32 pixel wide, non-mirrored playfield, but that takes at least 385 bytes and halves the vertical resolution:
Anyway, I am no demo coder, so I stopped here. But I learned some new tricks, which I can maybe use in the future.
I have attached some ROMs and source code files. PlasmaCompact.asm shows the cleaned code, with all options and further distracting stuff removed. Then I have the same code, prepared for PAL and NTSC, and with some assembler options. I also added my code for the non-mirrored playfield. And last not least, a version with Svolli's nice full coloring.
The ROMs show the results with all options enabled, which means you can change the settings with the joystick:
- With left difficulty = B you can change the offset steps within the setup loops. Left and right for X offset steps, up and down for Y offset steps. Since there are two values each, you change be 2nd value by holding the fire button.
- With left difficulty = A you can change the speed of the initial offset changes. Again, left and right for X offset speeds, up and down for Y offset speeds. And since there are two values each here too, use the fire button for the 2nd one.
- By using RESET, you can reset all settings to their initial values.
The effects of the changes are hard to describe. But when you experiment with them, you will get an idea. Here is an example:
I hope that made sense or you have at least some fun playing with the plasma effect.
Thanks to @SvOlli for contacting me and helping me out quite a lot. Therefore I respectfully waited with the release of this entry until after the Nordlicht 2023 demo party (September 8. - 10.), where he successfully presented his demo.
Edited by Thomas Jentzsch