Jump to content
  • entries
    334
  • comments
    900
  • views
    258,393

Time, it is precious


Guest

1,560 views

Well, I mentally looked at the list to Leprechaun to-dos and decided that I needed to tackle the BLIT routine next. This is the code which copies the sprite graphics (running etc) from ROM to the strips of SuperCharger RAM that the kernel then uses. (Much like the 5200 & A8s.)And I'm glad I did, because I discovered (to my horror) that the routine barely fits in the time between the end of VSYNC and the start of the kernel. (Actually, I had to increase that time by a couple of lines even.) Each player & enemy typically needs over 600 CPU cycles to handle, with close to 500 cycles being the actual BLIT. (The rest being figuring out which player sprite gets used and setting up the BLIT.)I'm now a little concerned that I'm going to run short on time to squeeze the remaining features into the time between then end of the kernel and the start of VSYNC. (I'm particularly worried about adding a score display since that will automatically chew up a bunch of lines.)Anyway new features in the attached binary:- The colored blocks are now the right number of lines high. This really shows off the intelligent flicker.- player now changes color when dead

44 Comments


Recommended Comments



Hi there!

 

Hm... it seems I don't get you here. After you have fully determined which shape needs to be copied in which RAM area, you need like 600 cycles to copy it?!? I think your sprites are like 10 byte tall, that should be copied in less than 100 cycles, no?

 

And why do you copy it at all, can't you simply access the sprites via (),Y?

 

Greetings,

Manuel

Link to comment

Let's deal with the second question first. It comes down to LDA (zp),Y taking 5 cycles while LDA abs,Y only needing 4. Then most of my kernel collapses into a very simple routine:

LINESUB
STA	WSYNC	; do a single line
STA	COLUPF	; 0
LDA	FGBM0,Y	; 3
STA	GRP0	; 7
LDA	FGBM1+8,Y; 10
STA	GRP1	; 14
LDA	BGBM0,Y	; 17
STA	PF1	; 21
LDA	BGBM1,Y	; 24
STA	PF2	; 28
LDA	BGBM2,Y	; 31
STA	PF0	; 35
LDX	BGBM3,Y	; 38
STX	PF1	; 42
INY  ; 45
NOP  ; 47
AND	#$0F	; 49
STA	PF2	; 51
STA	PF0	; 54
RTS  ; 57-63

I'm using SuperCharger RAM abilities to treat the BGBM0-4 and FGBM0-4 as virtual bitmaps. Unfortunately, writes to SC RAM require a bunch of cycles. So my BLIT routine looks like:

BLIT0
LDY	YPOS+NUMSPR
3$
CPY	#8
BCS	1$
LDA	(SPR_PTR),Y
TXS
TAX
LDA	SCDATA,X; SuperCharger write
TSX
LDA	FGBM0,X; equivalent to STX FGBM0,X
BCC	2$
1$
LDA	SCDATA
NOP
LDA	FGBM0,X
2$
INY
INX
TXA
AND	#$0F
BNE	3$
BEQ	ZPUTSPR

Of course, any suggestions on how to reduce the number of cycles required would be appreciated. The full source is included in the attached ZIP.

Link to comment

Hi there!

 

Let's deal with the second question first.  It comes down to LDA (zp),Y taking 5 cycles while LDA abs,Y only needing 4.  Then most of my kernel collapses into a very simple routine

 

Hm... it seems like you save 6 cycles with that, but wouldn't getting rid of JSR and RTS alone save you 12?

 

I'm using SuperCharger RAM abilities to treat the BGBM0-4 and FGBM0-4 as virtual bitmaps.

 

Hm... wouldn't the sprites require only two slices of RAM (sizeof SCREENHEIGHT)? Since you can only display two per row? If there's more in vertical seperation, you could blit them into the same slice.

 

Unfortunately, writes to SC RAM require a bunch of cycles.  So my BLIT routine looks like:

BLIT0
LDY	YPOS+NUMSPR
3$
CPY	#8
BCS	1$
LDA	(SPR_PTR),Y
TXS
TAX
LDA	SCDATA,X; SuperCharger write
TSX
LDA	FGBM0,X; equivalent to STX FGBM0,X
BCC	2$
1$
LDA	SCDATA
NOP
LDA	FGBM0,X
2$
INY
INX
TXA
AND	#$0F
BNE	3$
BEQ	ZPUTSPR

 

Admittedly I don't understand it. What is the purpose of the major branch in the code used for? You either write your sprite data to the SC RAM, but when Y>=8 then it does what? Why?

 

Of course, it seems like it would be a lot faster, if you'd split this multipurpose routine into two seperate ones. If I get that right, it checks with each iteration of the loop wether it is doing a sprite or not? :)

 

If you'd have two seperate routines, it seems you'd also save a lot of register swaps and saves/restores. Or, maybe even better than two seperate routines, have one routine doing both tasks entwined, two loops, one after the other!

 

Of course, any suggestions on how to reduce the number of cycles required would be appreciated.  The full source is included in the attached ZIP.

 

Hm... I prefer discussing smaller/isolated parts, since I usually find it a hard task to deal with a complete source written by someone else :)

 

Greetings,

Manuel

Link to comment

The kernel is more than that single subroutine, which is used for 13 out of every 18 line sequence. Of the remaining 5 lines, two are used to reposition one of the player sprites (alternating) and the other 3 are fairly tight for cycles. So although I could save 12 cycles removing the JSR/RTS, I need the 6 cycles saved elsewhere in the kernel.

 

Hm... wouldn't the sprites require only two slices of RAM (sizeof SCREENHEIGHT)? Since you can only display two per row? If there's more in vertical seperation, you could blit them into the same slice.

Exactly. FGBM0/1 are each divided into 16 line zones and may contain multiple sprites if the sprites are vertically separated. In this way I have 5 sprites which only flicker if more than two are on the same line.

Admittedly I don't understand it. What is the purpose of the major branch in the code used for? You either write your sprite data to the SC RAM, but when Y>=8 then it does what? Why? . . . If I get that right, it checks with each iteration of the loop wether it is doing a sprite or not?

Sorry, forgot to comment that. The 1$ branch writes a zero to FGBM0 (there's actually a separate BLIT1 routine for handling FGBM1). So just like a normal SkipDraw routine, it writes either 0 or the sprite data to FGBM0. I have to do this because I have no way of knowing what part of that 16 line zone was used in the last frame (it could be top, bottom, or even in the middle). This routine also handles whether the sprite is at the top, bottom or middle of the 16 line zone.

Link to comment

Hi there!

 

 

Sorry, forgot to comment that.  The 1$ branch writes a zero to FGBM0 (there's actually a separate BLIT1 routine for handling FGBM1).  So just like a normal SkipDraw routine, it writes either 0 or the sprite data to FGBM0.  I have to do this because I have no way of knowing what part of that 16 line zone was used in the last frame (it could be top, bottom, or even in the middle).  This routine also handles whether the sprite is at the top, bottom or middle of the 16 line zone.

 

I have a feeling that my suggestion of entwining the loops into two different ones was the right one then. Just fill the whole 16 bytes completely with zeros, then only write the 8 bytes of the sprite :)

 

Greetings,

Manuel

Link to comment
I have a feeling that my suggestion of entwining the loops into two different ones was the right one then. Just fill the whole 16 bytes completely with zeros, then only write the 8 bytes of the sprite :)

I have never done anything for the SC, but it looks like writing to RAM is terribly slow. So filling the RAM twice seems to be slow too.

 

But the code above is also not looking very optimized. I'll have a closer look at it tonight.

Link to comment

Hi there!

 

I have a feeling that my suggestion of entwining the loops into two different ones was the right one then. Just fill the whole 16 bytes completely with zeros, then only write the 8 bytes of the sprite :)

I have never done anything for the SC, but it looks like writing to RAM is terribly slow. So filling the RAM twice seems to be slow too.

 

Well, right now it's writing 16 bytes, and my suggestion would indeed require 24, but I think 16 * the overhead in the big loop is much more than the additional ~80 cycles for the 8 superfluous writes :)

 

Greetings,

Manuel

Link to comment
I have a feeling that my suggestion of entwining the loops into two different ones was the right one then. Just fill the whole 16 bytes completely with zeros, then only write the 8 bytes of the sprite :)

I have never done anything for the SC, but it looks like writing to RAM is terribly slow. So filling the RAM twice seems to be slow too.

 

But the code above is also not looking very optimized. I'll have a closer look at it tonight.

 

It might be possible to swap SC RAM with regular RAM for speed. Stuff that changes more gradually like scores or playfield graphics could be in SC RAM. Stuff that needs to be refreshed more aggressively could be in main RAM.

Link to comment

I think the loop has to be rewritten so that it's counting down instead of up for one thing.

 

And can't the indirect addressing within the interior loop be eliminated by precalculating the pointer before entering in?

 

It should also be possible to unroll the loop if you have enough rom(ram) space and your sprites are all the same height (or padded in ROM).

Link to comment

First, I'd like to say thanks for taking an interest. I really appreciate it and hope we can find some optimizations.

 

I have never done anything for the SC, but it looks like writing to RAM is terribly slow. So filling the RAM twice seems to be slow too.

Correct, writes to SC RAM requires at least 9 cycles: 4 cycles read from $F0nn, then 5 cycles to read from the destination address for the actual write. (And the 5 cycle an SC requirement since the write occurs 5 address changes after the read from $F0nn.)

 

I have a feeling that my suggestion of entwining the loops into two different ones was the right one then. Just fill the whole 16 bytes completely with zeros, then only write the 8 bytes of the sprite :)
Well, right now it's writing 16 bytes, and my suggestion would indeed require 24, but I think 16 * the overhead in the big loop is much more than the additional ~80 cycles for the 8 superfluous writes :)

You might be right. The outer loop takes more than 9 cycles per itteration. So unrolling the 16 writes to zero the zone would require less time. It would probably simplify the BLIT too. Of course, ROM space isn't infinite....

 

I don't understand how the 1$ branch works. Shouldn't X be 0?  :) And how is X initialized?

X is initialized to the top of the 16 line zone outside the loop. So it's 0,16,32.... depending upon where the sprite is onscreen vertically. Check out lepbank1.asm starting with PUTSPR.

 

It might be possible to swap SC RAM with regular RAM for speed.  Stuff that changes more gradually like scores or playfield graphics could be in SC RAM.  Stuff that needs to be refreshed more aggressively could be in main RAM.

And although I'd love to use regular RAM, I'd need 320 bytes or a major rewrite of the kernel. (If it's even possible to have an equivalent kernel without using SC RAM.)

 

I think the loop has to be rewritten so that it's counting down instead of up for one thing.

Yes, I know it's tradition (and typically more optimizable) in 2600 games to count backwards, but I find it very difficult to get things right when I use that method. Currently, counting backwards in this routine wouldn't gain me anything since neither index register is 0 at the end of the loop. But maybe if I separate the clear & BLIT operations as Manual suggests, then it will be possible.

 

And can't the indirect addressing within the interior loop be eliminated by precalculating the pointer before entering in?

My current plans are to store the sprites in "unused" areas of ROM space (e.g. after the virtual bitmaps) rather than in a single page, which means indirect addressing is about the only possibility. You should have seen the mess before I split the routine into separate ones for FGBM0 and FGBM1, with different Y offsets.

 

It should also be possible to unroll the loop if you have enough rom(ram) space and your sprites are all the same height (or padded in ROM).

All sprites are 8 lines high, but ROM space isn't infinite. I'm not sure I've got the space to pad out the sprites to 16 lines; especially if I have different sprites for the player and the enemies.

Link to comment

Some possible improvement suggestions:

1. split the loop

2. use LAX

3. when writing zeroes don't increase Y

4. when writing zeroes put INX instead of NOP between SC accesses (though according to my docs, 4 cycles are enough anyway)

5. instead of

  TXA
 AND #$0F

you could do

  CPX endX; precalculated

This is all pretty easy and should save you a little over 100 cycles.

 

If you need more cycles, you could try to use only one single counter (Y) and do something like

  LAX (SPR_PTR),Y 
 LDA SCDATA,X   
 LDA (SC_PTR),Y; setting up this pointer is a bit tricky
 INY          
 CPY endY 
 BNE 3$

Link to comment
Some possible improvement suggestions:

1. split the loop

2. use LAX

3. when writing zeroes don't increase Y

4. when writing zeroes put INX instead of NOP between SC accesses (though according to my docs, 4 cycles are enough anyway)

5. instead of

  TXA
 AND #$0F

you could do

  CPX endX; precalculated

This is all pretty easy and should save you a little over 100 cycles.

 

If you need more cycles, you could try to use only one single counter (Y) and do something like

  LAX (SPR_PTR),Y 
 LDA SCDATA,X   
 LDA (SC_PTR),Y; setting up this pointer is a bit tricky
 INY          
 CPY endY 
 BNE 3$

Tres cool. I hadn't considered LAX, but that does get around the lack of LDX (zp),Y. Setting up the pointers shouldn't be too bad since I should be able to set it up so Y counts down from 7 to 0 (which then saves some more cycles since the CPY can be skipped).

Link to comment
Tres cool.  I hadn't considered LAX, but that does get around the lack of LDX (zp),Y.  Setting up the pointers shouldn't be too bad since I should be able to set it up so Y counts down from 7 to 0 (which then saves some more cycles since the CPY can be skipped).

At the cost of some additional overhead outside the loop, yes.

 

But this was just peephole optimization. I wonder, if the general approach can't be optimized. Maybe you will come down to 400 cycles, but worst case each of the (5 or 5*2?) sprites can be in two 16 lines stripes. Right? So setting up those would require ~4000/8000 cycles, which is still way too much.

 

So maybe it would be better to track the animations and vertical sprite positions. So when a sprite isn't moving vertically, isn't animated between two frames and doesn't flicker, you don't have to BLIT it. With some scheduling, the first two points should be easy to handle. So only the flickering might become a problem. But then you could maybe use more than one RAM location/stripe.

 

Though, probably the easiest solution is getting rid of JSR/RTS and load the data directly from ROM. :)

 

It comes down to LDA (zp),Y taking 5 cycles while LDA abs,Y only needing 4.

And thinking a bit further, why don't you just use selfmodifying code here :)

Link to comment
Though, probably the easiest solution is getting rid of JSR/RTS and load the data directly from ROM. :)

And thinking a bit further, why don't you just use selfmodifying code here :)

This all depends on whether it is possible to create an equivalent kernel which doesn't require SC RAM. Perhaps you or Manuel has coded a kernel which I could reuse. So let's look at what my kernel does and maybe you (or Manuel) can provide some suggestions.

 

32 pixel x 160 line asymmetrical playfield (1LK) stored as 4 pages in SC RAM. Each 8 lines is a single row of the grid. PF colour changes used to make ropes & ladders. Blank line between rows (making a 180 line display) used to reposition one player sprite (alternating between rows).

 

There are 5 independent sprites (player + 4 enemies). 16<=Xpos<140, 0<=Ypos<156, 8 pixels wide, 8 lines high. Up to two sprites may be displayed on a single row without flicker. Partial sort algorithm used to provide intelligent flicker. So if there are no conflicts then no flickering. Each sprite independently coloured.

Link to comment

Hi there!

 

Though, probably the easiest solution is getting rid of JSR/RTS and load the data directly from ROM. :)

And thinking a bit further, why don't you just use selfmodifying code here :)

This all depends on whether it is possible to create an equivalent kernel which doesn't require SC RAM. Perhaps you or Manuel has coded a kernel which I could reuse. So let's look at what my kernel does and maybe you (or Manuel) can provide some suggestions.

 

I addmittely have nothing that'd be close enough. I either have sprite repositionings or complex playfield, but not both. Closest homebrew example I could think of is Hunchy 2, but it doesn't do the complex color changes you do.

 

After I had some closer look over your complete source code, I no longer think that getting rid of JSR/RTS will help you any or is even possible at all. I (and probably TJ as well) mistakenly thought that you only gain the difference between (),Y and $,X, but this is certainly not the case: When using (),Y you'd be forced to add switchdraw code for both sprites + sprite skipping logic - certainly a killer :)

 

A thought I had that may bring a great speedup boost though:

 

How about having lots of RAM stripes with 1 sprite only? Have insane padding of zeros on bottom and top.

 

And then just use a set of pointers inside the kernel, all properly adjusted to the set of sprites you're going to draw.

 

Then, you'd only need to copy each sprite shape to RAM once at startup.

 

(Of course you could do that with giant mostly zero filled ROM tables as well...)

 

Just another idea... :)

 

IIRC I ran into similar problems with Jumpman but since it was planned without extra RAM I ended up with the levels getting more and more symmetrical :)

 

Greetings,

Manuel

Link to comment
A thought I had that may bring a great speedup boost though:

 

How about having lots of RAM stripes with 1 sprite only? Have insane padding of zeros on bottom and top. And then just use a set of pointers inside the kernel, all properly adjusted to the set of sprites you're going to draw. Then, you'd only need to copy each sprite shape to RAM once at startup. (Of course you could do that with giant mostly zero filled ROM tables as well...)  Just another idea... :)

Unfortunately, ROM isn't infinite. I've only got 6K to work with. Right now I've got ~300 bytes of code space + 2K in the last bank (which I was saving for title / end / load / music?) for the remaining features. So dedicating a bank per sprite really isn't feasible.

 

However, I've coded up a new blit routine based on your & TJ's suggestions which has cut 600 cycles (8 lines) off the runtime. That gives me some breathing room. I'll put up a new release after I get home and merge together the various changes I've made over the weekend.

Link to comment
This all depends on whether it is possible to create an equivalent kernel which doesn't require SC RAM.

Probably I simply just still not get the problem. :)

 

Here is how I understand it:

- your BLIT code fills a 16 bytes RAM area with (partial) data of no more than one sprite at a time

- your kernel reads that data and copies it into GRPx

- you are doing this to save 1 cycle for each read inside the kernel

 

Right?

 

If yes, then just use selfmodifying code inside the kernel (using abs,Y or abs,X), which directly points to sprite data. Each sprite has to be padded with 15 blank bytes then.

 

I will cost you 15 bytes for each sprite, but you will make things much easier.

Link to comment
32 pixel x 160 line asymmetrical playfield (1LK) stored as 4 pages in SC RAM.  Each 8 lines is a single row of the grid.  PF colour changes used to make ropes & ladders.  Blank line between rows (making a 180 line display) used to reposition one player sprite (alternating between rows).

Sounds a bit like the POP kernel.

 

BTW: You could save some cycles by using a mirrored playfield. Then you would avoid any writes to PF0 and the AND. So you would have 7 reads and writes (=49 cycles) instead of 7 reads, 9 writes, AND and NOP (59 cycles).

Link to comment
Sounds a bit like the POP kernel.

Not surprising since PoP is also for the SuperCharger.

 

BTW: You could save some cycles by using a mirrored playfield. Then you would avoid any writes to PF0 and the AND. So you would have 7 reads and writes (=49 cycles) instead of 7 reads, 9 writes, AND and NOP (59 cycles).

The problem is I don't think you can update PF2 cleanly so right PF2 bit 7 and left PF2 bit 7 are each 4 pixels wide. It'd be okay if both bits were the same, but that would place an additional restriction on level development.

Link to comment
The problem is I don't think you can update PF2 cleanly so right PF2 bit 7 and left PF2 bit 7 are each 4 pixels wide.  It'd be okay if both bits were the same, but that would place an additional restriction on level development.

Sure you can! The 2nd write to PF2 must happen @48.

Link to comment

Hi there!

 

However, I've coded up a new blit routine based on your & TJ's suggestions which has cut 600 cycles (8 lines) off the runtime.  That gives me some breathing room.  I'll put up a new release after I get home and merge together the various changes I've made over the weekend.

 

Weird, weird, weird...

 

I just come home from the cinema, watching the new Harry Potter. The film is shown in 2 halfes, with a 15 minute break in the middle.

 

That was when I had another idea for your problem :)

 

It is a compromise again, but pretty reasonable (it seems to me):

 

It requires twice the ROM data for your sprites:

 

Just have it organized like

 

8 zeros

8 sprite1 bytes

8 zeros

8 sprite2 bytes

8 zeros

...

 

Then, in your blitting code, copy always 16 bytes, but pick the fitting range from ROM. So there's no more additional branch or skip/switch logic in the loop. :)

 

If you don't want to waste that much ROM, just have your sprite data *without* zeros there and sorta unpack the data *with* zeros into some RAM area at startup!

 

Greetings,

Manuel

Link to comment
Here is how I understand it:

- your BLIT code fills a 16 bytes RAM area with (partial) data of no more than one sprite at a time

- your kernel reads that data and copies it into GRPx

- you are doing this to save 1 cycle for each read inside the kernel

 

If yes, then just use selfmodifying code inside the kernel (using abs,Y or abs,X), which directly points to sprite data. Each sprite has to be padded with 15 blank bytes then.  I will cost you 15 bytes for each sprite, but you will make things much easier.

Ahh, but remember that I can reuse GRP0/1 multiple times on the same screen. So to implement your suggestion I'd need to have 20 copies of my LINESUB routine or somehow update that kernel during the sprite repositioning line.

 

The problem is I don't think you can update PF2 cleanly so right PF2 bit 7 and left PF2 bit 7 are each 4 pixels wide.

Sure you can! The 2nd write to PF2 must happen @48.

So, the STA PF2 starts at cycle 46 and the next Opcode is at cycle 49. Hmm.. Need to give it a try, but that would save a number of cycles in the kernel. Oooh, that might mean I'll have enough cycles for multi-colored sprites!

 

That was when I had another idea for your problem :) It is a compromise again, but pretty reasonable (it seems to me): It requires twice the ROM data for your sprites: Just have it organized like

 

8 zeros

8 sprite1 bytes

8 zeros

8 sprite2 bytes

8 zeros

...

 

Then, in your blitting code, copy always 16 bytes, but pick the fitting range from ROM. So there's no more additional branch or skip/switch logic in the loop. :)

 

If you don't want to waste that much ROM, just have your sprite data *without* zeros there and sorta unpack the data *with* zeros into some RAM area at startup!

First, there is no extra RAM as such to unpack into. The SC has 3 banks of 2K of writable ROM. So I'd just need to find enough ROM space to pad out the sprite data. It would save 28 cycles, or almost 2 lines total. Have to remember to make sure Y stays positive since Y is an unsigned offset for (ZP),Y. But, if I also use REFP0/1, then I'll halve the ROM requirements for the sprites. Might work, might work.

 

I see lots of updates in my immediate future, which of course Supercede the updates I've already done, sigh.

Link to comment
Of course, any suggestions on how to reduce the number of cycles required would be appreciated.  The full source is included in the attached ZIP.

 

If less than half of the memory locations that you might want to Blt to have 'stuff' in them, it may be faster to erase things after showing them on screen than to have to erase everything.

 

Alternatively, if you're trying to clear out a bunch of memory much of which is likely to be blank, and the number of bytes is a multiple of four, I'd suggest something like:

lp:
 lda $1000
 lda (ptr),y
 dey
 lda $1000
 lda (ptr),y
 dey
 lda $1000
 lda (ptr),y
 dey
 lda $1000
 lda (ptr),y
 dey
 bpl lp

By my count, that's (4+5+2)*4+3 cycles for four bytes, or 11.75 cycles/byte. I don't see room to get much better than that. Using more than one pointer could save some 'dey' instructions, but would increase the loop setup time.

 

One caveat is that when writing to the SuperCharger, you must ensure that there are no unexpected page-boundary crossings.

Link to comment

Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...