New game - Great Escape

+Philsan · June 29, 2015

Humm... C64 not using hardware sprites that you said to me Skool Daze and Saboteur. I would like the challenge to get Memory to have the PMGs overlays colouring the screens:-Skool Daze

Skool Daze and Skool Daze II (Back to Skool), games I'd love to see ported to A8...

José Pereira · June 29, 2015

Those Skool series I think they maybe not look good on A8 because of the colours and the way PMGs will be. On Saboteur most of the colourings are in cell based gfxs that are in front of the guys and these are darker soo I think it will look good.

I never tried Skool games, I'll have to load the game map and extract each screen to see how it might look.

How much KBs have the original Skool and Saboteur I&II games? Just curious to see if we will have available Memory to add the PMGs colourings...

Edited June 29, 2015 by José Pereira

Irgendwer · June 29, 2015

Yes, game is quite fast now. Most important is that it is faster than C64 version . However, it is still slower than Spectrum version Also, I find it entertaining digging through the code and searching for better solutions.

I still can identify the potential for loop unrolling like already mentioned.

Another place:

711C: A2 07 LDX #$07

711E: BC 80 81 LDY $8180,X

7121: BD 10 9F LDA $9F10,X

7124: 91 0B STA ($B),Y

7126: CA DEX

7127: 10 F5 BPL $711E

-->

LDY $8187

LDA $9F17

STA ($B),Y

LDY $8186

LDA $9F16

STA ($B),Y

LDY $8185

LDA $9F15

STA ($B),Y

etc...

And like mentioned in PM: Going to change the PMG via IRQ and GRAF... registers could reduce memory footprint and DMA load.

mono · June 29, 2015

Thanks. This particular routine is sometimes executed 500 times per frame, so it is worth optimizing. But I don't really want to spend another $200 bytes for lookup tables, and still the savings will be not too big (as you mentioned, optimizing the loop is most important).

Here is another example, routine which scrolls the screen. Screen in game consists of framebuffer of size 192 pixels * 136 pixels (at address $CF40 in C64 code, size $cc0) and background tilemap of size $198 bytes at $C008 in C64 code. When game needs to scroll in either direction, these two buffers are moved. Here is one procedure for scroll:

First it moves tilemap, and then framebuffer.

I attempted to optimize these in Atari version:

Any ideas to improve these?

C= uses static address for origin of the screen (you can select only VIC bank). Atari has display list and possibility of setting any address for any line of screen apparently (constraints given by ANTIC are quite small).

I think best optimization of this routine would be:

1. expand game viewport to 256 pixels

2. change address of screen in display list (on scroll you need to update only one byte per line)

3. using pointer to point origin of screen in tiles map

But I'm afraid this require big changes in engine.

TMR · June 29, 2015

C= uses static address for origin of the screen (you can select only VIC bank).

It's a little more flexible than that, $d018 can select the lower or upper 8K of a 16K VIC-II bank and six out of the eight available 8K blocks are free.

mono · June 30, 2015

@TMR: Of course but you can't set new screen address a one byte higher/lower than before. But ANTIC allows. This way you can quick scroll whole screen by one byte without moving of memory.

@mariuszw: If you don't want expand viewport to 256 pixels you can mask unexpected data on screen borders by sprities (ink color as COLPF2, background color as COLPF1).

Edited June 30, 2015 by mono

José Pereira · June 30, 2015

@mono and how will he do for the sides wire? And it also can't be because PF1 here is the pixels and invert all will need a re-write regarding the guys and the gfxs masking. Or is something there that I am not understanding?

P.s.- And like it is now the PMGs are used in a way that you don't have any to use on the sides. O.K. maybe there's one (that I thought in it for a future use if he adds the flag...) but not two, one for each side.

Edited June 30, 2015 by José Pereira

TMR · June 30, 2015

@TMR: Of course but you can't set new screen address a one byte higher/lower than before.

Just making sure the facts were presented correctly, s'all... if someone else picks up a similar project and believes that there's only four places a bitmap can be it's going to stuff them up if it's in one of the other two that are viable.

popmilo · June 30, 2015

...

LDY $8187

LDA $9F17

STA ($B),Y

LDY $8186

LDA $9F16

STA ($B),Y

LDY $8185

LDA $9F15

STA ($B),Y

etc...

Whats on $8187, ..86, ..85... ?

Could you change it with LDY #nn ?

mariuszw · June 30, 2015

Just making sure the facts were presented correctly, s'all... if someone else picks up a similar project and believes that there's only four places a bitmap can be it's going to stuff them up if it's in one of the other two that are viable.

I guess any coder attempting to port any C64 game to Atari should know such VIC details ;-) And http://icu64.blogspot.com/ may help to reveal details about C64 game display layout without even attempting to disassemble a game.

mariuszw · June 30, 2015

I still can identify the potential for loop unrolling like already mentioned.

Another place:

711C: A2 07 LDX #$07

711E: BC 80 81 LDY $8180,X

7121: BD 10 9F LDA $9F10,X

7124: 91 0B STA ($B),Y

7126: CA DEX

7127: 10 F5 BPL $711E

-->

LDY $8187

LDA $9F17

STA ($B),Y

LDY $8186

LDA $9F16

STA ($B),Y

LDY $8185

LDA $9F15

STA ($B),Y

etc...

And like mentioned in PM: Going to change the PMG via IRQ and GRAF... registers could reduce memory footprint and DMA load.

This is actually my code, already optimized ;-) The point is that lda $9f10,x has self modified address as an argument, and unrolling would create additional cost of modifing source adresses which would eat all benefit from unrolling.

Can you elaborate a little bit about changing PMG via IRQ? What will be the savings? I'm already not using WSYNC too much (actually only two times a frame, to sync font base adress change) so how using IRQ would help here?

mariuszw · June 30, 2015

C= uses static address for origin of the screen (you can select only VIC bank). Atari has display list and possibility of setting any address for any line of screen apparently (constraints given by ANTIC are quite small).

I think best optimization of this routine would be:

1. expand game viewport to 256 pixels

2. change address of screen in display list (on scroll you need to update only one byte per line)

3. using pointer to point origin of screen in tiles map

But I'm afraid this require big changes in engine.

This is nice idea and it would definitely save the cost fo scrolling, but it has some drawbacks: double buffering will be required to avoid tearing during display. This will definitiely requite many changes to the engine.

Irgendwer · June 30, 2015

This is actually my code, already optimized

I see. Thanks for the explanation.

Can you elaborate a little bit about changing PMG via IRQ? What will be the savings? I'm already not using WSYNC too much (actually only two times a frame, to sync font base adress change) so how using IRQ would help here?

* if you switch PMG-DMA on (via 559), the bus/CPU will be blocked by ANTIC for 5 cycles on every line to fetch the data ( see yellow blocks here: http://atariage.com/forums/uploads/monthly_05_2015/post-15480-0-25255000-1432812542.png )

* you need 640 bytes in double line mode for the PMG data

If you install an IRQ for every change of PMG data and you modify the look directly via the GRAFxx registers ($D00D-$D011), you can save both of the resources above.

Since the PMG data is relative constant over larger areas in your case, the change seems beneficial...

dmsc · July 7, 2015

Hi!,

I attempted to optimize these in Atari version:

firs loop of framebuffer:
lda $cf41,y ; 4+
sta $cf40,y    ; 5
iny        ; 2
lda $cf41,y    ; 4+
sta $cf40,y    ; 5
iny        ; 2
lda $cf41,y ; 4+
sta $cf40,y ; 5
iny        ; 2
lda $cf41,y ; 4+
sta $cf40,y ; 5
iny        ; 2
bne loop    ; (11*4+3)*64 =3008
lda addr,x    ; 4
sta code1    ; 4
sta code2    ; 4
sta code3    ; 4
sta code4    ; 4
sta code5    ; 4
sta code6    ; 4
sta code7    ; 4
sta code8    ; 4
dex        ; 2
bne loop    ; 3    ; (41+3008)*12 = 36,588

Any ideas to improve these?

If I understand the code correctly, you are moving the framebuffer 1 byte to the left, by copying the area from $CF41 to $CF40, of length $CC0. As you unrolled the loop, you use self-modifying code to change the 8 addresses.

Is this Ok?

Well, when you unroll loops on the 6502, it is faster to unroll row-wise instead of column-wise, this is an example:

loop:
 lda $CF41,x
 sta $CF40,x
 lda $D041,x
 sta $D040,x
 lda $D141,x
 sta $D140,x
....
 lda $DB41,x
 sta $DB40,x
 inx
 bne loop

As you see, this is much faster because you don't need to increment X on each copy, only once per loop.

*BUT*

There is a problem with this example. You can not move in parallel because there is a data dependency from one copy to the next!

But I suspect something: you are moving a rectangular region to the left, so you don't have a data dependency from onw row to the next. As each row has 192 columns, the dependency is broken each 24 bytes. You then need to move in multiples of 24 bytes.

So, this should work (SCR is address of screen data, $CF40 in your example):

 # First, copy 240 * 13 = 3120 bytes
 ldx #(256-240)
loopBig:
 lda SCR-(256-240)+1,x
 sta SCR-(256-240),x
 lda SCR-(256-240)+240+1,x
 sta SCR-(256-240)+240,x
 lda SCR-(256-240)+2*240+1,x
 sta SCR-(256-240)+2*240,x
 lda SCR-(256-240)+3*240+1,x
 sta SCR-(256-240)+3*240,x
 lda SCR-(256-240)+4*240+1,x
 sta SCR-(256-240)+4*240,x
 lda SCR-(256-240)+5*240+1,x
 sta SCR-(256-240)+5*240,x
 lda SCR-(256-240)+6*240+1,x
 sta SCR-(256-240)+6*240,x
 lda SCR-(256-240)+7*240+1,x
 sta SCR-(256-240)+7*240,x
 lda SCR-(256-240)+8*240+1,x
 sta SCR-(256-240)+8*240,x
 lda SCR-(256-240)+9*240+1,x
 sta SCR-(256-240)+9*240,x
 lda SCR-(256-240)+10*240+1,x
 sta SCR-(256-240)+10*240,x
 lda SCR-(256-240)+11*240+1,x
 sta SCR-(256-240)+11*240,x
 lda SCR-(256-240)+12*240+1,x
 sta SCR-(256-240)+12*240,x
 inx
 bne loopBig
 # Now, copy the remaining 144 bytes:
 ldx #(256-144)
loopSmall:
 lda SCR-(256-144)+13*240+1,x
 sta SCR-(256-144)+13*240,x
 inx
 bne loopSmall

Assuming that half the time the "LDA ABS,X" takes 5 cycles, you have 240*((4.5+5)*13+2+5)+144*(4.5+5+2+4) = 33552 cycles, this is about 13% faster than your loop (if you count 4.5 cycles per LDA in your loop also).

As the theoretical minimum with "ABS,x" is 29376 cycles (9 cycles per copy) this is already only 14% slower than that.

Daniel.

mariuszw · July 7, 2015

Hi!,

If I understand the code correctly, you are moving the framebuffer 1 byte to the left, by copying the area from $CF41 to $CF40, of length $CC0. As you unrolled the loop, you use self-modifying code to change the 8 addresses.

Is this Ok?

Well, when you unroll loops on the 6502, it is faster to unroll row-wise instead of column-wise, this is an example:
loop:
 lda $CF41,x
 sta $CF40,x
 lda $D041,x
 sta $D040,x
 lda $D141,x
 sta $D140,x
....
 lda $DB41,x
 sta $DB40,x
 inx
 bne loop
As you see, this is much faster because you don't need to increment X on each copy, only once per loop.

*BUT*

There is a problem with this example. You can not move in parallel because there is a data dependency from one copy to the next!

But I suspect something: you are moving a rectangular region to the left, so you don't have a data dependency from onw row to the next. As each row has 192 columns, the dependency is broken each 24 bytes. You then need to move in multiples of 24 bytes.

So, this should work (SCR is address of screen data, $CF40 in your example):
 # First, copy 240 * 13 = 3120 bytes
 ldx #(256-240)
loopBig:
 lda SCR-(256-240)+1,x
 sta SCR-(256-240),x
 lda SCR-(256-240)+240+1,x
 sta SCR-(256-240)+240,x
 lda SCR-(256-240)+2*240+1,x
 sta SCR-(256-240)+2*240,x
 lda SCR-(256-240)+3*240+1,x
 sta SCR-(256-240)+3*240,x
 lda SCR-(256-240)+4*240+1,x
 sta SCR-(256-240)+4*240,x
 lda SCR-(256-240)+5*240+1,x
 sta SCR-(256-240)+5*240,x
 lda SCR-(256-240)+6*240+1,x
 sta SCR-(256-240)+6*240,x
 lda SCR-(256-240)+7*240+1,x
 sta SCR-(256-240)+7*240,x
 lda SCR-(256-240)+8*240+1,x
 sta SCR-(256-240)+8*240,x
 lda SCR-(256-240)+9*240+1,x
 sta SCR-(256-240)+9*240,x
 lda SCR-(256-240)+10*240+1,x
 sta SCR-(256-240)+10*240,x
 lda SCR-(256-240)+11*240+1,x
 sta SCR-(256-240)+11*240,x
 lda SCR-(256-240)+12*240+1,x
 sta SCR-(256-240)+12*240,x
 inx
 bne loopBig
 # Now, copy the remaining 144 bytes:
 ldx #(256-144)
loopSmall:
 lda SCR-(256-144)+13*240+1,x
 sta SCR-(256-144)+13*240,x
 inx
 bne loopSmall
Assuming that half the time the "LDA ABS,X" takes 5 cycles, you have 240*((4.5+5)*13+2+5)+144*(4.5+5+2+4) = 33552 cycles, this is about 13% faster than your loop (if you count 4.5 cycles per LDA in your loop also).

As the theoretical minimum with "ABS,x" is 29376 cycles (9 cycles per copy) this is already only 14% slower than that.

Daniel.

Thank you very much for your help!

New game - Great Escape

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members