Known fast line algorithms ?

R0ger · November 12, 2014

Thanks for tips (and support) !

Actually .. it seems to me that jmp is faster then branch in these cases. Branch is 3 or 4 cycles if jumping (2 if not), jmp is 3 cycles alwyas. At least based on reference I use. You have to read one more byte from memory, but you do no address math and no testing.

R0ger · November 14, 2014

So I added steep angles variant. It does not benefit that much from the unwrap. On average it makes 53 cycles per pixel.

But I want to make the steep variant to draw from both ends. So there is still some room for improvements.

Then I have to finish the routines with proper line start and end. I also have to somehow solve changing of colors ! That can only slow the code down. But I hope not much. In worst case I can have routines for every color. It's of course more of a problem in flat variant, steep variant works same as my original line.

Again there is huge disparity between vertical and diagonal lines. Pure vertical needs only 25 cycles per pixel, as the mask just stays static.

Current version: http://roger.questions.cz/atari/line-unwrap3.asm

Edited November 14, 2014 by Dr.Sid

dmsc · November 14, 2014

Hi!,

So I added steep angles variant. It does not benefit that much from the unwrap. On average it makes 53 cycles per pixel.

But I want to make the steep variant to draw from both ends. So there is still some room for improvements.

Then I have to finish the routines with proper line start and end. I also have to somehow solve changing of colors ! That can only slow the code down. But I hope not much. In worst case I can have routines for every color. It's of course more of a problem in flat variant, steep variant works same as my original line.

Again there is huge disparity between vertical and diagonal lines. Pure vertical needs only 25 cycles per pixel, as the mask just stays static.

Current version: http://roger.questions.cz/atari/line-unwrap3.asm

If you don't want to update the mask at all positions, you can replace the code:

DrawByte .macro x
        lda (pos),y
        ora #
        sta (pos),y

with:

DrawByte .macro x
        lda mask
        and #
        ora (pos),y
        sta (pos),y

It is a little slower, but simple.

For the line start/end you can simply use the old pixel-at-a-time code until the X (or Y) coordinates are divisible by 4, and then jump to the fast code, or at start you can special case the calculation of the code and jump to the middle of the already existing sequences....

Edited November 14, 2014 by dmsc

Heaven/TQA · November 14, 2014

have you checked here:

http://codebase64.org/doku.php?id=base:6502_6510_maths#the_art_of_3d

R0ger · November 14, 2014

Hi!,

If you don't want to update the mask at all positions, you can replace the code:
DrawByte .macro x
        lda (pos),y
        ora #
        sta (pos),y
with:
DrawByte .macro x
        lda mask
        and #
        ora (pos),y
        sta (pos),y
It is a little slower, but simple.

For the line start/end you can simply use the old pixel-at-a-time code until the X (or Y) coordinates are divisible by 4, and then jump to the fast code, or at start you can special case the calculation of the code and jump to the middle of the already existing sequences....

Yes, I didn't mean I don't know how to do it :-D

But thanks for reading the code !

R0ger · November 15, 2014

So I made the step angle variant which paints from both ends. And I hate it ! The code is really ugly. You can't use registers for anything. It's also pain in the ALU to have it all correct. But it is faster.

4 pixels per pattern has 48/53/55 cycles per pixel (best case, avg case, worst case). Yes, that 25 cycles per pixel yesterday for best case was some error.

8 pixels per pattern has 41/48/55 cycles per pixel.

flat variant for comparison, same method of measurement: 25/40/55 cycles per pixel. Note how the worst case is about the same no matter what method you use.

The difference is over 10% on average, almost 20% in best case. So I guess I will fly with it.

Sry, no code so far, it's a mess.

Edited November 15, 2014 by Dr.Sid

R0ger · November 19, 2014

Slow progress .. I have prolog and epilog for flat angle variant. I mean drawing the pixels before and after full byte patterns. I even managed to use the same loop for both prolog and epilog. I also introduced new variable with pattern count, which is computed in the beginning, so I can simply DEC it. So now I'm again few cycles faster (on longer lines at least, but that's what matters for me).

Well .. not sure if this line of mine will be fastest line ever, but it's good contender for the most complex one. The prolog and epilog combinations are pretty tough.

Normal case is you start with single pixels, then you switch to patterns, then again you switch to single pixels. But all those phases can be omitted in some cases. I have some nasty computations before I start to draw, mainly number of whole bytes I need to draw .. but that can be local optimized later, and it only happens once per line.

I hope I will have prolog for steep angles (no epilog needed here) tomorrow, so wait a bit longer for the code.

dmsc · November 19, 2014

Hi!,

Slow progress .. I have prolog and epilog for flat angle variant. I mean drawing the pixels before and after full byte patterns. I even managed to use the same loop for both prolog and epilog. I also introduced new variable with pattern count, which is computed in the beginning, so I can simply DEC it. So now I'm again few cycles faster (on longer lines at least, but that's what matters for me).

Well .. not sure if this line of mine will be fastest line ever, but it's good contender for the most complex one. The prolog and epilog combinations are pretty tough.

LOL, it really get's complicated... Funny thing is that if you restrict the line deltas to max. 160x192, and you have only 4x1 different start points (as each other start is equivalent), you have a total of 122880 possible lines. Seems you are writing almost all those combinations ;-)

I just searched a little about line-drawing papers, and found http://repository.cmu.edu/cgi/viewcontent.cgi?article=2551&context=compsci where they perform an approximation of drawing all segments with "n" movements the same, that approach could save some code.

R4ngerM4n · November 19, 2014

http://repository.cmu.edu/cgi/viewcontent.cgi?article=2551&context=compsci where they perform an approximation of drawing all segments with "n" movements the same, that approach could save some code.

There is something wrong with the URL - can't open it.

Edited November 19, 2014 by R4ngerM4n

MaPa · November 19, 2014

http://repository.cmu.edu/cgi/viewcontent.cgi?article=2551&context=compsci

Heaven/TQA · November 19, 2014

thx Mapa... esp. the "character based" approach is mainly used often in c64 demos as far as I have understood.

R0ger · November 21, 2014

Thanks for links. It has some good ideas, but mostly useless stuff at this stage. Btw. Dragonstomper, you are RGB author ? Czech COMRADE !? How come we didn't meet on Atariada ?

Anyway: http://roger.questions.cz/atari/line2.asm (read at your own risk).

Still lots of work. Mainly up variant. It's still not implemented, so do not use this variant yet ! I also have to clean all up .. improve comments. Make benchmark so it can be compared to ERU's line.

But I believe all cases (for down direction) now work for flat and steep. The flat variant produces pixel by pixel the same results as my old line. Steep variant differs in lower half, as this up/down variant is more symmetrical than I could achieve in old line.

I added color setting. It's quite cheap, at the price of setcolor being a bit slower. Still, in my 'project' setcolor will be called only rarely.

Basically instead ( ora $ff ) I now have ( ora mff ). 'mff' is zero page variable, which for color 3 contains $ff.

I don't need all 16 variants .. I only need 10, cause i don't need patterns like 11001100. You will see it clearly later when I post my zero page declarations.

For prolog and epilog I use my old zero page table 03,0c,30,c0 .. but I made it so that the table overlays with m03,m0c,m30,mc0 variables, so it does not take any more space.

On the other hand I need another table mask2, which has the same bytes in the reversed order, but I might yet solve that.

So basically when doing SetColor, I have to set those 10 (mxx+mask) + 4 (mask2) bytes in zero page. Which is no big deal.

So after all this, I did average benchmarking - by filling square 80x80, line starting in 0,0 to all pixels in lower and right edge (so all angles are included). All lines have 80 pixels.

This is the test, in which my old line does 64 cycles per pixel.

This new line does 42 cycles per pixel in this test. Which is 35% boost, pretty sweet.

Code length of the new line is A1C bytes, and it does not contain Up variant, so it will be almost double that. Quite a price, but perfectly acceptable for my purpose.

There is also one more idea, which can make it faster still. I now use Dlist with 80 graphic lines. I could align all lines to page boundary. That way I could only INC high byte of VRAM address to move down.

I have 3 buffers 40 bytes wide for triple-buffering, that will leave 136 bytes free in each line, which will be challenge to use. It will effectively cripple half of my basic 64k. Still I have another half for the code, and I think I might be able to store some static data, like texts, into those holes. Certainly worth try !

Edited November 21, 2014 by Dr.Sid

Irgendwer · November 21, 2014

Anyway: http://roger.questions.cz/atari/line2.asm (read at your own risk).

Check order of operations:

L_line		
		;dx
		lda x2
		sub x1
		sta dx

		;dy
		lda y2
		sub y1
		sta dy

		;dxy = dx-dy
		lda dx
		sub dy
		sta dxy


change to: 

L_line		
		;dy
 		lda y2
		sub y1
		sta dy

		;dx
		lda x2
		sub x1
		sta dx

		;dxy = dx-dy
		sub dy
		sta dxy

pseudografx · November 21, 2014

Thanks for links. It has some good ideas, but mostly useless stuff at this stage. Btw. Dragonstomper, you are RGB author ? Czech COMRADE !? How come we didn't meet on Atariada ?

Unfortunately, MaPa (not Dragonstomper) never visits Atariada, but he regularly visits Forever and Bytefest so perhaps we all can meet there!

R0ger · November 21, 2014

Thanks for reading the code. There is more small things like that I guess.

And I have another idea worth 2 cycles pixels. Check my decision tree for flat variant .. after each failed BCC there is SUB (in other words SEC) .. and after each successful BCC there is ADD (in other words CLC). Basically in whole decision tree carry state already is how I need it, and I don't have to set it or clear it.

With steep variant I use SUB in both cases, but that can be easily fixed.

I plan to visit some more meetings in future, so let's hope we'll meet some day !

R0ger · November 22, 2014

I removed redundant SEC and CLC, which gave me almost 2 cycles per pixel as expected. I also tried the page per line approach. I didn't really make it work, I just tested how fast would it be, by replacing every adw #40 with INC, and testing the speed. It goes from 41 cycles per pixel to 37. It helps only with vertical steps of course. I guess I will run my project without this variant at the moment, and use it later if needed, or generally leave this to some later stage of development.

R0ger · November 25, 2014

So finally .. I made the 'up' variant (put it all in standalone file, line3.asm), added masking of old content for benchmarking. There was still ton of errors I had to find and fix, but now it seems to be OK.

I'm attaching the source code and little demo.

This demo cannot be directly compared to older exes posted here. It uses different graphics mode, and other things I suspect. But I don't have source code of the old demos, so I can't make one comparable.

If you are curious what are you looking at though, this demo makes about 680 lines per second, lines being randomly chosen from rectangle 160x80 pixels. That older demos used at least twice as much pixels though, so it should be about twice as much slower.

There most probably still is room for some small local optimizations, especially jumps and subroutines reordering. I wish there was some compiler which could do that, it's not really job for humans :-D

line2.zip

main.xex

Edited November 25, 2014 by Dr.Sid

+Stephen · November 25, 2014

Damn - that is impressive.

emkay · November 25, 2014

If you are curious what are you looking at though, this demo makes about 680 lines per second, lines being randomly chosen from rectangle 160x80 pixels. That older demos used at least twice as much pixels though, so it should be about twice as much slower.

It's even slower than 1/2 because of the (additional 3200) DMA cycles per frame.

Optimised line draw code on the best mode for doing calculated graphics, shows where the A8 has it's strengh.

Hehe... now imagine "Rescue on Fractalus" with optimised drawing and look up table for the maths on 64K or even 128K machines...

Who'd need a faster CPU?

Rybags · November 25, 2014

It would be interesting to see optimisations of RoF, Mercenary, Star Raiders and a good number of other games by simply throwing lots of Ram at the problem.

Even most of the "greatest" games in a technical sense in the day usually had constraints like 48K compatability, and the flow-on effect was sacrificing speed for compactness.

Heaven/TQA · November 25, 2014

RoF works completly different and last 2 cents for RoF/Koronis Rift... RAM would not help much (ok... for the 3d maths yes like mul-tables, perspective, runrolled code) but not for the Fractal routine itself as it might use recursion... but what we discussed for RoF is precalc more of the map in advance... but that helps only for fixed flight paths though...

Edited November 25, 2014 by Heaven/TQA

R0ger · November 25, 2014

RoF works completly different and last 2 cents for RoF/Koronis Rift... RAM would not help much (ok... for the 3d maths yes like mul-tables, perspective, runrolled code) but not for the Fractal routine itself as it might use recursion... but what we discussed for RoF is precalc more of the map in advance... but that helps only for fixed flight paths though...

Any links to how exactly does RoF works ? I always wondered about that.

As for Mercenary, I did some profiling on it. Not exactly reverse engineering, but still I can say that their line is quite fast, and so is their math. What is good on Mercenary is that is has 2 versions, low res and high res. By measuring FPS on both you can estimate how much time is spend in drawing and how much in math.

Edited November 25, 2014 by Dr.Sid

Heaven/TQA · November 25, 2014

Regarding RoF yes... but you have to wait little bit then I can give you more insights... stay tuned.

- Loren Carpenter 6502 specific, only MSX port has same algo and Ballblazer Intro Fractal flight on Amiga (you can recon the "errors", if they are missing, adapted fractal code)

- Grid based

- no heightmap but "node map"

- special EOR filler
- fast perspective corrections

- no rotation of 2d points (only "tilting") as viewport is only 48 scanlines you can not recognise

- double buffering with long scanlines (line,x) x=index into buffer1 or buffer2.

...

- yet unknown... how the hell they managed "clipping".

Edited November 25, 2014 by Heaven/TQA

R0ger · November 25, 2014

Yes, it seems clear that they basically draw a grid, only the lines they use are funky. The 'crappy line' algorithm seem to me the most interested one. I could understand how they do all else pretty well.

It might be something like my pattern approach to lines .. except they use distorted patterns, not just straight lines.

Clipping can be done in many ways, and it's usually nothing really slow. I don't think it's worth investigating.

ilmenit · November 25, 2014

Drawing lines is easy. All you need is to add to height value precalculated in array where initial and final value is 0. F.e. [+0, +1, +2, +3, +6, ..., +2, +1, +0], like in midpoint displacement. I think RoF has only 3-4 precalulated arrays of length 64 and even these arrays look simple. There is one U shaped, one double U and one tripple U - you can see them on screenshots when two grid corners have similar height.

Some time ago I've been playing with RoF-like engine using raycasting and because of symmetry everything can be precalculated. In 64K of memory you can store all the arrays and have the engine without a single multiplication, only 8bit additions. I've lost the latest sources but this RoF.zip gives an idea.

Known fast line algorithms ?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members