Jump to content
IGNORED

Ridiculously slow blitter perf


cubanismo

Recommended Posts

I've written up some GPU code to draw a string of characters with the blitter. It works, but it is crazy slow. It takes about half a second per character blitting 6x12px characters from a 1bpp surface to another 1bpp surface. There are a few other things going on:

 

  • The 68k is in a tight loop reading a word from a main memory location to see if the GPU is done yet.
  • The GPU is servicing the object processor interrupt once a frame to reset the object list and do some crappy animation experiment stuff.
  • Two ~half-screen-sized 1bpp surfaces are being displayed by the OP

 

So the system isn't idle, but it's not doing anything crazy either. Wondering if anyone has theories on what's going on? Here's a link to  the GPU code in question:

 

https://github.com/cubanismo/skunk_usb/blob/03de2e2fa4a/ffsgpu.s#L334

 

And you can find the rest of the related source files in that repo as well, e.g.:

 

https://github.com/cubanismo/skunk_usb/blob/master/usbffs.c

https://github.com/cubanismo/skunk_usb/blob/master/ffsobj.s

https://github.com/cubanismo/skunk_usb/blob/master/startffs.s

 

 

 

Edited by cubanismo
Edit: Fixed the link to the actual function to use an absolute git revision rather than pointing to latest version.
Link to comment
Share on other sites

Stop dinkin' around and hurry up and get this figured out. They're going to need someone to help them port Tekken Hybrid to the Jag 

 

Tekken Bot (@BotTekken) Tweeted:
IT'S OFFICIAL: Tekken Hybrid: Remastered to be released next year featuring Rayman as a guest fighter! Only available on Hyperscan and Atari Jaguar https://t.co/6IjcP2bHgL https://twitter.com/BotTekken/status/1449341872099397632?s=20

Edited by JagChris
Link to comment
Share on other sites

2 minutes ago, JagChris said:

Hurry up and get this figured out. They're going to need someone to help them port Tekken Hybrid to the Jag 

 

Tekken Bot (@BotTekken) Tweeted:
IT'S OFFICIAL: Tekken Hybrid: Remastered to be released next year featuring Rayman as a guest fighter! Only available on Hyperscan and Atari Jaguar https://t.co/6IjcP2bHgL https://twitter.com/BotTekken/status/1449341872099397632?s=20

Please don't polute the coding forum with nonsense.  If you can't help, or don't have a genuine question.... well, you know the drill.

Edited by CyranoJ
  • Like 2
Link to comment
Share on other sites

11 hours ago, cubanismo said:

I've written up some GPU code to draw a string of characters with the blitter. It works, but it is crazy slow. It takes about half a second per character blitting 6x12px characters from a 1bpp surface to another 1bpp surface. There are a few other things going on:

I wonder how the blitter works internally. Does it copy pixel by pixel - one bit at once in this case?

  • Like 1
Link to comment
Share on other sites

46 minutes ago, Cyprian said:

I wonder how the blitter works internally. Does it copy pixel by pixel - one bit at once in this case?

 

For the courageous, @Shamus has done quite some reading on the Jaguar netlists and has derived a lot of facts about how the blitter works internally. You can check out the blitter source right here (and I bet that an corrections are welcome!)

  • Like 3
Link to comment
Share on other sites

@ggn thanks

 

after a quick look, if I'm not wrong, it is pixel based.

In case of pixels smaller than 8 bit (1/24bpp), the blitter needs to read a previous value of the destination, OR it with a new a new value and write a new value. It meas 3 memory accesses per pixel.

In case 8/16/32bpp it is just read/write (two accesses)

 

// 1 bpp pixel read
#define PIXEL_SHIFT_1(a)      (((~a##_x) >> 16) & 7)
#define PIXEL_OFFSET_1(a)     (((((uint32_t)a##_y >> 16) * a##_width / 8) + (((uint32_t)a##_x >> 19) & ~7)) * (1 + a##_pitch) + (((uint32_t)a##_x >> 19) & 7))
#define READ_PIXEL_1(a)       ((JaguarReadByte(a##_addr+PIXEL_OFFSET_1(a), BLITTER) >> PIXEL_SHIFT_1(a)) & 0x01)
//#define READ_PIXEL_1(a)       ((JaguarReadByte(a##_addr+PIXEL_OFFSET_1(a)) >> PIXEL_SHIFT_1(a)) & 0x01)

// 2 bpp pixel read
#define PIXEL_SHIFT_2(a)      (((~a##_x) >> 15) & 6)
#define PIXEL_OFFSET_2(a)     (((((uint32_t)a##_y >> 16) * a##_width / 4) + (((uint32_t)a##_x >> 18) & ~7)) * (1 + a##_pitch) + (((uint32_t)a##_x >> 18) & 7))
#define READ_PIXEL_2(a)       ((JaguarReadByte(a##_addr+PIXEL_OFFSET_2(a), BLITTER) >> PIXEL_SHIFT_2(a)) & 0x03)
//#define READ_PIXEL_2(a)       ((JaguarReadByte(a##_addr+PIXEL_OFFSET_2(a)) >> PIXEL_SHIFT_2(a)) & 0x03)

// 4 bpp pixel read
#define PIXEL_SHIFT_4(a)      (((~a##_x) >> 14) & 4)
#define PIXEL_OFFSET_4(a)     (((((uint32_t)a##_y >> 16) * (a##_width/2)) + (((uint32_t)a##_x >> 17) & ~7)) * (1 + a##_pitch) + (((uint32_t)a##_x >> 17) & 7))
#define READ_PIXEL_4(a)       ((JaguarReadByte(a##_addr+PIXEL_OFFSET_4(a), BLITTER) >> PIXEL_SHIFT_4(a)) & 0x0f)
//#define READ_PIXEL_4(a)       ((JaguarReadByte(a##_addr+PIXEL_OFFSET_4(a)) >> PIXEL_SHIFT_4(a)) & 0x0f)

// 8 bpp pixel read
#define PIXEL_OFFSET_8(a)     (((((uint32_t)a##_y >> 16) * a##_width) + (((uint32_t)a##_x >> 16) & ~7)) * (1 + a##_pitch) + (((uint32_t)a##_x >> 16) & 7))
#define READ_PIXEL_8(a)       (JaguarReadByte(a##_addr+PIXEL_OFFSET_8(a), BLITTER))
//#define READ_PIXEL_8(a)       (JaguarReadByte(a##_addr+PIXEL_OFFSET_8(a)))

// 16 bpp pixel read
#define PIXEL_OFFSET_16(a)    (((((uint32_t)a##_y >> 16) * a##_width) + (((uint32_t)a##_x >> 16) & ~3)) * (1 + a##_pitch) + (((uint32_t)a##_x >> 16) & 3))
#define READ_PIXEL_16(a)       (JaguarReadWord(a##_addr+(PIXEL_OFFSET_16(a)<<1), BLITTER))
//#define READ_PIXEL_16(a)       (JaguarReadWord(a##_addr+(PIXEL_OFFSET_16(a)<<1)))

// 32 bpp pixel read
#define PIXEL_OFFSET_32(a)    (((((uint32_t)a##_y >> 16) * a##_width) + (((uint32_t)a##_x >> 16) & ~1)) * (1 + a##_pitch) + (((uint32_t)a##_x >> 16) & 1))
#define READ_PIXEL_32(a)      (JaguarReadLong(a##_addr+(PIXEL_OFFSET_32(a)<<2), BLITTER))
//#define READ_PIXEL_32(a)      (JaguarReadLong(a##_addr+(PIXEL_OFFSET_32(a)<<2)))

 

// 1 bpp pixel write
#define WRITE_PIXEL_1(a,d)       { JaguarWriteByte(a##_addr+PIXEL_OFFSET_1(a), (JaguarReadByte(a##_addr+PIXEL_OFFSET_1(a), BLITTER)&(~(0x01 << PIXEL_SHIFT_1(a))))|(d<<PIXEL_SHIFT_1(a)), BLITTER); }
//#define WRITE_PIXEL_1(a,d)       { JaguarWriteByte(a##_addr+PIXEL_OFFSET_1(a), (JaguarReadByte(a##_addr+PIXEL_OFFSET_1(a))&(~(0x01 << PIXEL_SHIFT_1(a))))|(d<<PIXEL_SHIFT_1(a))); }

// 2 bpp pixel write
#define WRITE_PIXEL_2(a,d)       { JaguarWriteByte(a##_addr+PIXEL_OFFSET_2(a), (JaguarReadByte(a##_addr+PIXEL_OFFSET_2(a), BLITTER)&(~(0x03 << PIXEL_SHIFT_2(a))))|(d<<PIXEL_SHIFT_2(a)), BLITTER); }
//#define WRITE_PIXEL_2(a,d)       { JaguarWriteByte(a##_addr+PIXEL_OFFSET_2(a), (JaguarReadByte(a##_addr+PIXEL_OFFSET_2(a))&(~(0x03 << PIXEL_SHIFT_2(a))))|(d<<PIXEL_SHIFT_2(a))); }

// 4 bpp pixel write
#define WRITE_PIXEL_4(a,d)       { JaguarWriteByte(a##_addr+PIXEL_OFFSET_4(a), (JaguarReadByte(a##_addr+PIXEL_OFFSET_4(a), BLITTER)&(~(0x0f << PIXEL_SHIFT_4(a))))|(d<<PIXEL_SHIFT_4(a)), BLITTER); }
//#define WRITE_PIXEL_4(a,d)       { JaguarWriteByte(a##_addr+PIXEL_OFFSET_4(a), (JaguarReadByte(a##_addr+PIXEL_OFFSET_4(a))&(~(0x0f << PIXEL_SHIFT_4(a))))|(d<<PIXEL_SHIFT_4(a))); }

// 8 bpp pixel write
#define WRITE_PIXEL_8(a,d)       { JaguarWriteByte(a##_addr+PIXEL_OFFSET_8(a), d, BLITTER); }
//#define WRITE_PIXEL_8(a,d)       { JaguarWriteByte(a##_addr+PIXEL_OFFSET_8(a), d); }

// 16 bpp pixel write
//#define WRITE_PIXEL_16(a,d)     {  JaguarWriteWord(a##_addr+(PIXEL_OFFSET_16(a)<<1),d); }
#define WRITE_PIXEL_16(a,d)     {  JaguarWriteWord(a##_addr+(PIXEL_OFFSET_16(a)<<1), d, BLITTER); if (specialLog) WriteLog("Pixel write address: %08X\n", a##_addr+(PIXEL_OFFSET_16(a)<<1)); }
//#define WRITE_PIXEL_16(a,d)     {  JaguarWriteWord(a##_addr+(PIXEL_OFFSET_16(a)<<1), d); if (specialLog) WriteLog("Pixel write address: %08X\n", a##_addr+(PIXEL_OFFSET_16(a)<<1)); }

// 32 bpp pixel write
#define WRITE_PIXEL_32(a,d)		{ JaguarWriteLong(a##_addr+(PIXEL_OFFSET_32(a)<<2), d, BLITTER); }
//#define WRITE_PIXEL_32(a,d)		{ JaguarWriteLong(a##_addr+(PIXEL_OFFSET_32(a)<<2), d); }

 

Edited by Cyprian
  • Like 1
Link to comment
Share on other sites

7 hours ago, CyranoJ said:

Do you have a binary?

Attached a slightly newer version. To run, you need a skunkboard with a FAT16 or FAT32-formatted flash drive in the first (From the front of the Jaguar with the board in the cartridge slot, the left-most) USB-A connector. Then, launch it and exercise the font blitting code as follows:

 

jcp -c usbffs.cof
<It dumps some debug info about your USB stick to the console>
> showlist
> clearlist
> drawstring
<If you're patient, or you don't have many files on your USB drive>
> ls
<Will very slowly render a directory listing on-screen and print it to the console>

 

2 hours ago, Cyprian said:

I wonder how the blitter works internally. Does it copy pixel by pixel - one bit at once in this case?

 

11 minutes ago, Cyprian said:

In case of pixels smaller than 8 bit (1/24bpp), the blitter needs to read a previous value of the destination, OR it with a new a new value and write a new value. It meas 3 memory accesses per pixel.

Yes, it  goes pixel by pixel, bit-by-bit in this case from my understanding, which matches the logic above (Thanks @ggn). However, given the little trick it uses, it should only be 2 accesses (1Byte read and 1Byte write) per-pixel on average, with an extra 1Byte read at the start of each line (Using SRCENX, not SRCEN). For comparison, I added some code later last night to invert an entire line of characters using one blitter op. This should have roughly the same memory access pattern (1Byte read and 1Byte write per 1b pixel after a trip through the LFU to ! the value), but completes almost instantaneously even for rather large rects (well, large compared to a single character). Additionally, individual characters, each rendered using a single blitter operation, seem to pop up as a whole rather than line-by-line or pixel-by-pixel, and I'm not doing any double-buffering or anything.

 

The above leads me to believe it isn't a problem of raw throughput. Yes, the blitter is horribly inefficient at modifying 1bpp surfaces, but it shouldn't be any slower than modifying a similarly-sized 8bpp surface. Rather, there seems to be some problem where the blits aren't actually starting up or signalling completion as quickly as they should, but I don't know why. I grepped around the SDK and I seem to be waiting for blits to complete using almost the exact same code fragment the cpkdemo GPU rotation code does (See the .waitblit and .waitlast labels). I'm pre-calculating the parameters for the next blit in local registers before waiting for the blitter to complete the prior blit, in hopes of interleaving the work a bit.

usbffs.cof

Link to comment
Share on other sites

Bleh, sorry for the noise. I looked at this with fresh eyes this evening, and was quickly able to spot the bug:

		movei	#B_COUNT, r4
		...
		movei	#B_CMD, r6
		...
		store	r5, (r6)			; Write op to B_CMD
		...
		load	(r4), r11			; (Always) Read back blit status
.waitblit:	btst	#0, r11				; See if bit 0 is set
		...

Simple cut-and-paste error: This function evolved from my prior blit function that stashes B_CMD in r4. Here, B_COUNT is in r4, and B_CMD is in r6, so the .waitblit loop was completing only when B_COUNT was odd, rather than when B_CMD indicated the blitter was idle. Apparently, rather then being too slow, the blitter is fast enough that it's rather hard to catch it on an odd pixel before it completes its work.

 

Interestingly, as an experiment I had moved on to writing some code to halt the 68k while it waited for the GPU to finish its work, in hopes that would speed things up. In that case, the blits usually hung entirely. Stopping the 68k made the blits happen so fast that GPU almost never caught it on an odd address before the tiny glyph blit was completed. I'll be leaving that code to stop the 68k in there ?

  • Like 5
  • Thanks 1
Link to comment
Share on other sites

15 hours ago, ggn said:

For the courageous, @Shamus has done quite some reading on the Jaguar netlists and has derived a lot of facts about how the blitter works internally. You can check out the blitter source right here (and I bet that an corrections are welcome!)

For the blitter support in VJ, there are 3 sorts.
The MIDSUMMER_BLITTER, MIDSUMMER_BLITTER_MKII and the 'normal' blitter.
Only 2 are supported by the emulator, MIDSUMMER_BLITTER_MKII and the 'normal' one.
For some reasons, it seems the most advanced one is the MIDSUMMER_BLITTER_MKII.
By example the blitter2_e code runs well on it, but not on the original one (the character simply doesn't show up in the bar).

MIDSUMMER_BLITTER_MKII was planned to be used for a future console (Jaguar II?) and was considered as retro-compatible with the 'original' blitter.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...