Jump to content
IGNORED

pixels per second in Antic $0D


Recommended Posts

 

The 'benchmark' is by using Altirra and examining the 'start' and 'stop' jiffy counters.

 

Did so here, but have no idea how you calculated your 4500 pps. The screen contains 160x96 pixels = 15360 pixels. When running your program, I get a count of $32 jiffies in NTSC case, so that is roundabout 0.83 seconds, which means ~18500 pps?!?

Edited by Irgendwer
Link to comment
Share on other sites

THE SHEER JOY

 

The purpose was to benchmark & develop a fast individual pixel plot routine.

 

 

:-D - but my calculation says 400% :grin:
(I'll supply something soon, ATM I cannot build since my cc65 version is more recent and doesn't like your config)

 

Man, I'm terrible at math. Yeah that's why I am still on the older version, I got irritated with the later builds for changing stuff.

  • Like 1
Link to comment
Share on other sites

Well, it didn't take SBX, so I used AXS, which supposedly is a synonym. But it didn't work.

 

Found the reason: In the source code, 'plot_c' is the zero-page address where the color for the plot is located.

SBX/AXS works on immediate data (so XXL's 'sbx #$100-plot_c' is defined at compile time).

When using 'axs #$100-COLOR_2' everything works like expected, but of course the degree of freedom is missing, where the color is defined by a memory location and variable.

Solution would be self-modifying code at this location and adapted COLOR_x values - taking the #$100-x complement into account.

 

Edit: This change results in ~20000 pps (46/60) BTW.

Edited by Irgendwer
Link to comment
Share on other sites

I think all we did was take your suggestion, XXL. Very cool! I'm going to modify it to store the color in that instruction instead of the reference...I think it will work out the same because there's a STA plot_c already involved on each pass, I'll just change it form a ZP location to the isntruction.

Link to comment
Share on other sites

I think all we did was take your suggestion, XXL.

 

Exactly. (And this is indeed a good example for undoc opcode usage that makes sense!)

-

Since the most individual time was spend on the subroutine call, I made a macro of the plot code. I had to abstain from undocumented opcodes, since self-modifying-code in a macro won't work very well ;)...

 

In this opcode police friendly version I get ~21400 pps (43/60).

 

The drawback is 34 bytes for every 'call'.

Edited by Irgendwer
Link to comment
Share on other sites

I thought about further optimizations, but all go in a usage dependent direction - like the one above with the empty source or caching the latest y call value and abstain in case of equality from the line-base calculation (this would work very well in the test case) - or take much more memory or a bit more complicated memory layout:

 

The colour can be used as an offset for the indirect table access of the mask, so we can get rid of the add:

...
  txa
  and #3
  tax
 
  lda (ptr1),y
  and and_masks, x   ; needs only 4 bytes now, not 16!
  ora colortable, x  ; 16 bytes at page start (or in *one* page with adapted COLOR values..)
plot_c = *-2 
  sta (ptr1),y

...

This results in ~19600 pps (47/60).

Please note the saving of 12 bytes for the and_masks and that the colortable has to be page aligned or all values have to reside in one page with adapted COLOR definition values.

 

Pros:

* opcode police friendly

* smallest memory footprint of the versions

 

Cons:

* it uses SMC

* slower than macro and undoc ops version

 

Since other things are calling I stop this here now (for the sheer joy ;) ). But I have some homework for you: Create five 160 byte tables, so that the "txa, and #3, tax" is not needed anymore, and we have a opcode police friendly version which is faster than the violating one...

Edited by Irgendwer
  • Like 1
Link to comment
Share on other sites

But I have some homework for you: Create five 160 byte tables, so that the "txa, and #3, tax" is not needed anymore, and we have a opcode police friendly version which is faster than the violating one...

 

Hint: The plot_c is then the high-byte of the table, and we get ~20900 pps (44/60).

 

Like said:

 

it may not be necessary to build (an) additional table(s)

 

I think 1.25k of tables (with some spare bytes between them) should be enough... ;-)

Link to comment
Share on other sites

Amazing! Thanks, all, and especially Somebody. This was fun and I learned quite a bit. I just got the ethernet driver to open a TCP connection though, so I'm going to be focused on that for a while, but I do plan on implementing a little two-ship space war type game that can be played over the internet, which was what sparked the initial interest in plotting points.

Link to comment
Share on other sites

Ok, final consideration and test:

 

Of course it is possible to use the latest version in conjunction with a macro, if the colour(-table) is given as argument:

...

and_masks:
.repeat 40
.byte %00111111
.byte %11001111
.byte %11110011
.byte %11111100
.endrepeat

.align 256
c_masks:
colortable0:
.repeat 40
.byte %00000000
.byte %00000000
.byte %00000000
.byte %00000000
.endrepeat    


.align 256
colortable1:
.repeat 40
.byte %01000000
.byte %00010000
.byte %00000100
.byte %00000001
.endrepeat

.align 256
colortable2:
.repeat 40
.byte %10000000
.byte %00100000
.byte %00001000
.byte %00000010
.endrepeat

.align 256
colortable3:
.repeat 40      
.byte %11000000
.byte %00110000
.byte %00001100
.byte %00000011
.endrepeat

...

.macro mac_plot colortable
  ldx plot_x
  ldy plot_y

  lda line_addr_lo,y
  sta ptr1
  lda line_addr_hi,y
  sta ptr1+1   ; ptr1 now points to start of screen row
  ldy x_div_by_4_lut,x
  
  lda (ptr1),y
  and and_masks,x
  ora colortable,x
  sta (ptr1),y
.endmacro

...

; usage, plot something with color2
mac_plot colortable2

...

This results in ~25600 pps (36/60).

 

Notes:

 

'Calling' costs now 27 bytes.

This macro variation would work of course also with the other variants (post #35 & undoc ops). They would be a bit slower, but saving a lot of the table space.

Macro can only be used when colour is static at compile time, which brings up the question if this make sense. (Could we also supply byte position and mask instead of x then? Why not toggling the plot byte directly, if everything is static...)

 

END OF SHEER JOY

 

Edit: I leave it to the reader to enhance the macro in case 'colortable3' is given as argument... (so that the 'and and_masks,x' would be saved)

Edited by Irgendwer
  • Like 2
Link to comment
Share on other sites

Nice effort.

 

Wouldn't it be interesting to use mode B (Gr.7 with 2 colours ;) ) ... just for getting the most speed ?

 

Using such pixel plotting routines in a demo, whirling on the screen and build objects on the fly... If you know what I mean ;)

This could get very impressive...

Link to comment
Share on other sites

One more way to squeeze out couple more plots would be to reorganize screen layout.

Use LMS to set every scanline to start of page so Y coordinate is already high byte of screen address.

96 pages is not that much ;)

sty ad+1
ldy x_div_by_4_lut,x
lda (ad),y
and and_masks,x
ora colortable,x
sta (ad),y
Of course, y coordinate of upper left corner of screen would be 128 for example. You can use 'left over' bytes on each page for second buffer or other graphics data.

 

After all, my guess is that calculating where to plot all those pixels would take much more time than plotting them :)

Link to comment
Share on other sites

For modes needing only one LMS, the "disable Antic DList DMA" trick could be used.

Would be needed in conjunction with a Timer IRQ in order to re-enable the DMA at the right time, though if full screen height is used the VBlank would fix the DMA address for the next frame anyway.

 

Again it's into the realm of scavenging maybe 70 cycles per frame - more than that could be gained by just using a custom streamlined VBlank handler.

  • Like 1
Link to comment
Share on other sites

page per line is quite handy... I used that in Arsantica 3 Intro. here you trade off simple screen layout for better calculating with some DMA but your draw/plot/calc routine are simpified... and you got the possibility of having several buffers as you can adress each buffer with X or Y reg easily.

 

having longer lines on A8 is really a plus for 3d stuff imho... compared to c64 layouts... just making some tests on the VICII ;)

Edited by Heaven/TQA
Link to comment
Share on other sites

I'm more used to wasting memory on that "other" machine, but I was just wondering if you have considered a few other approaches?

Only use EOR plotting? Makes for possibly weird colours when crossing different colours.

Split in 4 different cases? It looks like you have basically 4 different masks that can then be immediates. Doesn't really work unless you also special case the code in either X or Y direction jumptables.

As said, special case either by line or by column so you can use known absolute indexed address modes. And you run into 6502 indexed jumping trouble again... (word handling is not a strong point for the 6502).

 

Looking at 6502 code sometimes makes me gnash my teeth. Is it faster to use a jumptable or to use zeropage. What is the fastest jumptable you can get away with. Can you use selfmodification for something.

And it leaves me with this uneasy feeling that it can be done totally differently and so much faster by doing something incredibly clever.

Edited by NorthWay
Link to comment
Share on other sites

That's the joy of 6502. Code is never "definitive". :)

True. After all these years it wasn't until today I realized that there is a way to shortcut a "*2" byte addressing: If the base address is page aligned then you can store the byte as the low byte of the address and then index address it with ,X or ,Y. Too bad there is not much use for it on a 6502, but I was looking at the 6809 when that struck me.

Then you have stuff like optimizing a 0 to (0-255)/3+1 range jumptable to jump to jumps.

Then it gets more crazy like the stack-based multiplexer sort that Gary Liddon(?) used in Tyger Tyger.

Then it gets rather esoteric when you look at Linus' 1541 turbo decoder.

And I don't know if someone actually implemented using chained CIA timers to calculate line slope values. Anyone know?

 

Back to topic, is it possible to morph the AND/OR into a table lookup if not using EOR? I guess that would need separate code for different Y positions already.

OTOH the existing code is very fine if you spin the pointers in a loop and don't intend to execute the setup for every pixel in a line. With a page per line screenmode X changes is a simple INC/DEC, and Y changes is only recalculating the Y part again.

  • Like 1
Link to comment
Share on other sites

Back to topic, is it possible to morph the AND/OR into a table lookup if not using EOR?

 

When writing the optimization I thought about many other ways to speed up the code - including jump-tables and EORing the data. Problem is, you not only need quite a lot memory to get things fast, but still have to spend more setup-time than in the shown way.

 

EORing would melt the AND,x & OR,x into one operation, saving 4 cycles - BUT since the EOR mask is not only colour and position dependent but also have to take the source data into account, you need a quite big table.

And there are two problems with big tables: They not only need a lot of memory, they also exceed the 256 bytes which can be accessed somewhat fast. Going for SMC for a big table access is not a solution in the regular, non-ZP case here: A SMC write takes 4 cycles - and this would already mean no save.

 

I thought also about optimized versions for colour patterns like '11' and '00' (you can save the AND or OR in such cases), but even if you put the colour as an address modifier for a JMP(ind) and spill the memory for three different plot routines, the 5 cycles for the JMP are 'lost' and a performance killer for the mixed pattern version.

(Of course you could just JSR directly in 4 different colour/plot routines with different run-times and have in 2 out of 4 cases the save, but the comparable macro version is still faster.)

 

Like mentioned: For specific use cases, there is still room for some speed.

And yes - being unsure if the taken way is really the best one, is part of the 6502 coding fun. (...and leaves always room for 'my code is better than yours'... ;) )

Edited by Irgendwer
  • Like 1
Link to comment
Share on other sites

True. After all these years it wasn't until today I realized that there is a way to shortcut a "*2" byte addressing: If the base address is page aligned then you can store the byte as the low byte of the address and then index address it with ,X or ,Y...

Then you have stuff like optimizing a 0 to (0-255)/3+1 range jumptable to jump to jumps.

Then it gets more crazy like the stack-based multiplexer sort that Gary Liddon(?) used in Tyger Tyger.

Could you elaborate these three things ?
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...