[AQUARIUS] - Screen Block Data Copy

chjmartin2 · July 5, 2021

Hi,

It was funny as I went back and read about stuff I started now 10 years ago (wow @jaybird3rd it has been more than a decade) and realized how much more I know now, and then sad at how little I have progressed. I am super excited about having MAME, as it has really sped up my development cycle. I am working on a quick demo that I think everyone will find fun. Here is my question. It is really easy to update a section of a screen if you want to update the whole line, just set up HL, BC and DE, have your screen data labelled .db style and LDIR your way to success. Here is the rub, I am very cycle starved in my current program and I only want to update a SECTION of the screen and not the whole line. I have several THEORIES of the best way to do this using the fewest t-states but am frustrated because this is clearly a solved problem that SOMEBODY knows the best way. I have a "graphics" that is 4 characters wide by 6 characters tall with both colors and characters. The total graphic is 48 bytes. I have a few versions of it and want to use it to make an animation. So let's say I wanted to load it in the bottom left side of the screen.

12,930 is the starting memory position upper left and 13,133 is the ending memory position lower right. Color is 13,954 down to 14,157.

Here are the methods I think I can use, but again feel like this something that somebody else has optimized and why reinvent the wheel. Anyway... I could:

ld de, 12930 ; 10 t-states

ld hl, IMAGECHARLINEONE ; 10 t-states

ld bc, 4 ; 10 t-states

LDIR ; 21, 21, 21, 15 t-states

ld de, 12970

ld hl, IMAGECHARLINETWO

ld bc, 4

LDIR

...

It will be like 648 t-states - an eternity and then double it for the color.

This just seems like an awfully inefficient way of doing it and doesn't PUSH, POP use IX/IY or anything else. My alternative is just:

ld hl, 12930 ;10 t-states

ld (hl), nn ;10 t-states

inc hl ;11 t-states

ld (hl), nn ;10 t-states

inc hl ;11 t-states

ld (hl),nn ;10 t-states

inc hl ;11 t-states

ld (hl),nn ;10 t-states

ld hl, 12970

...

This would be 498 t-states, better than LDIR but still quite a bit but still again doubled for the color.

Is this a solved problem and ideally I would have a more generic routine where I specify the upper left corner and point to the ROM location of the character/color data and it draws it to screen. Any thoughts would be appreciated. Sorry if it is a basic obvious question.

Thanks,

Chris

jltursan · July 13, 2021

If you're absolutely in need of saving cycles the fastest way to dump a graphic is using stack-blasting; but to do so "easily" you need to assume an even width of graphics or auto-modifying code. This way you can POP registers (HL',DE',BC',HL,DE,BC,IX,IY: 16 bytes width) from source and PUSH them into destination, tricky and a bit dirty but there're no fastest ways to do it ?


ld hl, 12930                  ;10 t-states

ld (hl), nn                     ;10 t-states

inc hl                            ;11 t-states

ld (hl), nn                     ;10 t-states

inc hl                            ;11 t-states

ld (hl),nn                      ;10 t-states

inc hl                            ;11 t-states

ld (hl),nn                      ;10 t-states

ld hl, 12970

That's good enough, sadly there's no way to use "INC L" instead "INC HL" due the Aquarius 40 bytes screen width ?

chjmartin2 · July 14, 2021

On 7/13/2021 at 3:32 AM, jltursan said:

If you're absolutely in need of saving cycles the fastest way to dump a graphic is using stack-blasting; but to do so "easily" you need to assume an even width of graphics or auto-modifying code. This way you can POP registers (HL',DE',BC',HL,DE,BC,IX,IY: 16 bytes width) from source and PUSH them into destination, tricky and a bit dirty but there're no fastest ways to do it ?

So help me with that because I am very interested in it. I don't know what stack blasting is, nor do I know how to auto-modify code. Is there somewhere I can read about it?

jltursan · July 15, 2021

The stack blasting techniques come from the use of stack to quickly transfer (or fill) data. Think that the fastest way to put a byte in RAM
is by means of a "LD (HL),A" that runs in 7 t-states; using an "LDI", 16 t-states, you can get some extra speed if you can take profit of the extra
updated registers.
Now, using the stack, you can "PUSH xx" 2 bytes at a cost of only 11 t-states. Of course it's not so good, you need to load the PUSHed register and
build around the code needed to really transfer a block...

As an example:

ld (savesp),sp ; saves the real stack
ld sp,sprite_data+12 ; 10
pop de ; 10
pop bc ; 10
exx ; 4
pop de ; 10
pop bc ; 10
pop ix ; 14
pop iy ; 14
ld sp,$3000+4 ; 10
push ix ; 15
push iy ; 15
ld sp,$3000+40+4 ; 10
push de ; 11
push bc ; 11
ld sp,$3000+80+4 ; 10
exx ; 4
push de ; 11
push bc ; 11
ld sp,(savesp) ; restores the real stack

That untested and rough code dumps a 4x3 chars image without attributes to a fixed screen position (top left area).
The code can increase its complexity a lot when trying to parametrize its use and making it as "universal" as possible. Right now AF or HL are not being used, you can use them to implement some logic or trade its use in benefit of the stack chain.

This code run in 190 t-states so this is about 11 t-states per byte transferred, close to expected.

About the so-called "auto-modifying" code, it just means that you can dinamically "overwrite" you code poking data over it. Usually you're not going to overwrite instruction codes (of course sometime you can and it's really handy), it's common to change data values like the ones loaded in the "LD SP,nn" instructions above, "nn" can be replaced easily and you're self-modifying your code ?

Hope you get the idea...

chjmartin2 · July 15, 2021

4 hours ago, jltursan said:

The stack blasting techniques come from the use of stack to quickly transfer (or fill) data. Think that the fastest way to put a byte in RAM
is by means of a "LD (HL),A" that runs in 7 t-states; using an "LDI", 16 t-states, you can get some extra speed if you can take profit of the extra
updated registers.
Now, using the stack, you can "PUSH xx" 2 bytes at a cost of only 11 t-states. Of course it's not so good, you need to load the PUSHed register and
build around the code needed to really transfer a block...

As an example:

ld (savesp),sp ; saves the real stack
ld sp,sprite_data+12 ; 10
pop de ; 10
pop bc ; 10
exx ; 4
pop de ; 10
pop bc ; 10
pop ix ; 14
pop iy ; 14
ld sp,$3000+4 ; 10
push ix ; 15
push iy ; 15
ld sp,$3000+40+4 ; 10
push de ; 11
push bc ; 11
ld sp,$3000+80+4 ; 10
exx ; 4
push de ; 11
push bc ; 11
ld sp,(savesp) ; restores the real stack

That untested and rough code dumps a 4x3 chars image without attributes to a fixed screen position (top left area).
The code can increase its complexity a lot when trying to parametrize its use and making it as "universal" as possible. Right now AF or HL are not being used, you can use them to implement some logic or trade its use in benefit of the stack chain.

This code run in 190 t-states so this is about 11 t-states per byte transferred, close to expected.

About the so-called "auto-modifying" code, it just means that you can dinamically "overwrite" you code poking data over it. Usually you're not going to overwrite instruction codes (of course sometime you can and it's really handy), it's common to change data values like the ones loaded in the "LD SP,nn" instructions above, "nn" can be replaced easily and you're self-modifying your code ?

Hope you get the idea...

Yeah, so I have not used the stack at all in any of my programs and this is an application I want to get better at. I think I will try to create a 4x3 frame animation demo to see exactly how fast I can get it to run. 190 t-states is MUCH faster (like 60%) than my unrolled loop code.

jltursan · July 15, 2021

I never ended up using a "stack-blasting" sprite routine as I usually need routines a little bit more universal, of course they can be enhanced; but they're hard to adapt as you are using all available registers (the more registers, the faster you can get).

However, this next routine is between my favorites, used to fill RAM with a value very fast:

; HL=start
; DE=length ( must be multiple of 8 )
; A=byte
fillram:
       ld (stack),sp        ; Only remove if you're sure you don't need the stack
       add hl,de
       ld sp,hl           ; start = FINAL_ADDRESS+1
       srl d
       rr e               ; LENGTH_TO_FILL/2
       srl d
       rr e               ; LENGTH_TO_FILL/4
       srl d
       rr e               ; LENGTH_TO_FILL/4
       ld h,a
       ld l,a               ; HL= bytes to fill
       ld b,e
       dec de
       inc d
.loop:
[4]   push hl
       djnz .loop
       dec d
       jp nz,.loop
       ld sp,(stack)
       ret

Edited July 15, 2021 by jltursan
Formatting fixes

Sign In

[AQUARIUS] - Screen Block Data Copy

Recommended Posts

chjmartin2

Link to comment

Share on other sites

jltursan

Link to comment

Share on other sites

chjmartin2

Link to comment

Share on other sites

jltursan

Link to comment

Share on other sites

chjmartin2

Link to comment

Share on other sites

jltursan

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members

Apps

My Activity Streams

More