Jump to content
  • entries
    340
  • comments
    905
  • views
    261,343

cog coding


EricBall

648 views

I've been doing a bunch of think-coding over on potatohead's Blog trying to work though making Propeller code capable of generating a 240x240 resolution sprite display.

 

For those not familiar with the Propeller, it's an 8-way SMP processor in a very low cost ($13 each) package. Each processor (or cog) has 496x32 bits of RAM which functions as registers and code & data storage (self modifying code is almost required), with 32K of shared RAM accessed in round-robin fashion (very deterministic once it gets going). Code is either written in a high-level interpretted language called Spin, or in native cog assembly. Guess which one I'm using...

 

Each cog also has a video generator, which leads to some interesting possibilities. The concept I'm working on is to use 240 entries of cog RAM as line RAM which gets written out in a very tight loop. Then once the line is displayed, the cog fetches sprite graphics for the next line while another cog is busy displaying the current line. The big advantage is each pixel is stored as a long and has the capability of covering the entire NTSC color gamut. The downside is I'm discovering how few sprites the system can actually display.

 

The last straw was clearing the line RAM:

LineRAM	EQU	$100
D1	LONG	0000_0000_0000_000000001_000000000
Black	LONG	$01010101 * 7.5IRE

MOV	count1, #240
MOVD	ZeroLR, #LineRAM
ZeroLR	MOV	LineRAM, Black
ADD	ZeroLR, D1
DJNZ	count1, #ZeroLR

This simple bit of code chews up 2892 cycles, or over half of the time per line. Ugh!

 

The solution is obvious - use 3 cogs for video. That doubles the number of cycles available for the write Line RAM routines.

 

I'm hoping is also will even out the overall output timing. The plan is for the output routine to finish after writing a set of synch pulse pixels. This should set the output bits to 0 so it won't interfere with the output of active cog. But I'm guessing the video generator is still running, even though the cog isn't feeding it any new pixel data. Which means when the code goes back to writing, it's going to resynch back to the video generator clock. So because the pixel clock is 4.77MHz, that means there will be 303 pixels per line, which is 1/3 of a pixel short versus the spec. (Not a huge crime, but it's better to be in spec if possible.)

 

But since I'm using 3 cogs for video, then I can accumulate that extra 1/3 pixel across the 3 lines sequence. So the cog generates 303 pixels of output, and then spends 607 pixels of time generating the next line of output. And we're back in spec.

 

I'm also hoping I can figure out how to synch the 3 cogs together so they are all starting together. Then I can use the round-robin access delay to start each cog 1/3 pixel from each other.

6 Comments


Recommended Comments

How about using five cogs for display, to allow a 4x speed boost?

I think at some point there's a law of diminishing returns. You could very easily end up without enough cog power left to actually manipulate those sprites.

 

Going with more video cogs might make it feasible to go back to the calculated byte->long translation for color, thus leaving half of the cog RAM for code which could do things during VSYNC. More time might also make a tiled background possible, but I think that would need too many hub cycles to be effective for anything more than lo-res.

 

I think 3 cogs is a reasonable sweet spot for my design. I particularly like the 1/3 pixel side-effect.

Link to comment
How about using five cogs for display, to allow a 4x speed boost?

I think at some point there's a law of diminishing returns. You could very easily end up without enough cog power left to actually manipulate those sprites.

 

Going with more video cogs might make it feasible to go back to the calculated byte->long translation for color, thus leaving half of the cog RAM for code which could do things during VSYNC. More time might also make a tiled background possible, but I think that would need too many hub cycles to be effective for anything more than lo-res.

 

I think 3 cogs is a reasonable sweet spot for my design. I particularly like the 1/3 pixel side-effect.

 

The Donkey Kong game in progress ended up with plenty of game power, after consuming 6 cogs for video. Still that's excessive and the ideas you've come up with here will improve on that. I will be able to do some coding next week, now that I'm back home with a display!

 

I've toyed with running two video generators together. At the time I ran them both full screen bitmaps for a layering effect. Was hard to get the color synced, but otherwise they ran just fine. The outputs are or'ed together. As long as you keep them fed with the proper waitvids, your idea of just running zeros should work just fine. I never thought about having them output their own scanlines! I was thinking about one master SYNC cog, with the others drawing into the graphics area. This method will take one less COG.

 

If you want to run some code, just let me know. I'm seriously impressed with your ability to just consider this stuff. How did you get that skill? It's just the way I need to work given my schedule and availability on the machine.

 

I know how to sync the COGS. Did that when trying to overlay two bitmaps.

 

You store the counter in SPIN, just prior to launching assembly code. No matter what, a small three line spin wrapper is required to start any program. On boot, cog 0 launches SPIN, which can then launch assembly.

 

Store a value, add some time to it, then have the COG's launch, watch the system counter, then begin executing code, based on that count. The system counter is shared among all COGs, cannot be zeroed and is 32 bit. One master assembly COG could do the same thing, with a shared HUB memory location as well.

 

Learned something about waitvid this morning too. It stops the COG, until the video generator is ready. Once the generator is loaded (4 cycles), the COG continues executing code. If instructions are well timed, delays per wiatvid can be held to as few as 8 cycles.

 

Once started the video generators keep running. If they are not fed with waitvids, they grab whatever happens to be in the D & S registers at the time. Your idea of using the COG video registers together like this is interesting and new. All other game engines currently use one video generator and feed data to it. This is probably why so many COGs are being consumed. IMHO, keeping them all synced with similar waitvids will reduce thoughtput, but will keep things deterministic.

Link to comment
If you want to run some code, just let me know. I'm seriously impressed with your ability to just consider this stuff. How did you get that skill? It's just the way I need to work given my schedule and availability on the machine.

I've spent a lot of time over a lot of years working at the assembly level on a wide variety of processors. It's kinda like knowing several spoken languages, each new one becomes easier to learn (assuming it's not radically different from what you know).

I know how to sync the COGS. Did that when trying to overlay two bitmaps.

Yeah, I saw a post on the Parallax forum on how to sync the cogs. It was a kind of a "duh" moment. WAITCNT to sync to a fixed time on all cogs, then a hubop to sync them to their position on the hub.

Learned something about waitvid this morning too. It stops the COG, until the video generator is ready. Once the generator is loaded (4 cycles), the COG continues executing code. If instructions are well timed, delays per wiatvid can be held to as few as 8 cycles. Once started the video generators keep running. If they are not fed with waitvids, they grab whatever happens to be in the D & S registers at the time. Your idea of using the COG video registers together like this is interesting and new.

Uh-oh. The video generator just grabs D&S even without a WAITVID? I was assuming the shift registers would just go to zero until the next WAITVID. Which means the cogs which aren't outputting video need to so something to the video generator so they don't corrupt the output of the cog which is generating video. Might be sufficent to simply change the direction register to input for the cogs which aren't putting out video.

Link to comment

Yep, you gotta feed it.

 

There are solutions though. One could set the VSCL to different values to keep the generator at bay, so multiple waitvids don't get in the way of processing. At first, I thought the COG was totally consumed during a waitvid from start until end. This is not the case. Once the sync has happened, and the video generator grabs the D & S registers (instruction bytes and registers are referred to here and I'm not sure they are differentiated), the COG can go on about it's business while the pixels are being shifted out.

Link to comment

Changing the direction registers might be interesting... The pins are common to all COGs. I'll have to do some reading on that aspect of the chip. The designer of the thing is not into non-flexible operation. I'll bet that was a design choice, just in case.

 

Was on a long drive back from helping a friend today thinking about this. I think all three video COGs can run the same code. Start one of them, and it looks to see if it's the first one running or not. Use a flag in HUB memory for this. It then starts the others, which make the same analysis. All of them work from a master line counter in HUB memory and shared parameters for resolution, number of lines, etc...

 

Any of them can output the SYNC signals necessary, if it's their role to do so.

Link to comment
Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...