Raycasting demo

JohnPCAE · June 10, 2013

Lots more multiplication speed

JohnPCAE · June 12, 2013

Added a special-case fixed-point multiplication routine. It is only used when one of the multiplicand values is guaranteed to be in the range -1.0 ... +1.0. This actually covers most of the calls to the multiplication routine and results in a noticeable speed boost.

If anyone wanted to, say, port Wolf3D to the Inty, the frame rate might be good enough now

raycast.zip

JohnPCAE · April 15, 2014

I've been taking a second look at my fixed-poing 8.8 multiplication routines and it might be possible to optimize them some more. I built a list of multiplication "patterns" and a Java program that validates them and searches for the optimal ones based on whether it's being used for the first stage (low byte) or for the second stage (high byte). I'd like to get it to the point where it can autogenerate the necessary CP1600 assembly; now that would be awesome.

As an example, my patterns look something like this:

Assuming the value to be multiplied is in R2:R0, encoded in the following format: 0000 0000 HHHH HHHH : LLLL LLLL 0000 0000

And the end 8.8 fixed-point result is to be in R5, with R4 used as an intermediate register

Some sample patterns for various multipliers (many more potential variations are possible):

0 (empty)

1 +

2 ++

.+

-+++

--..+

--:+

.-.+

3 +++

+.+

4 ++++

..+

:+

.

252 :-:::+

253 -.-.:::+

254 --::::+

255 +.+.+.+.+.+.+.+

-::::+

256 ::::+

+ means add R2:R0 to R5:R4 (remember the carry bit!)

- means subtract R2:R0 from R5: R4 (remember the carry/borrow bit!)

. means shift R2:R0 left by 1 bit

: means shift R2:R0 left by 2 bits

* means add R2 to R5 (this is only valid for phase 2 when R0 is always 0) -- think of it as an abbreviated + operation and is only used to substitute for the ending + in a pattern

These are the basic pattern elements, but there are more specialized ones if I'm able to use more registers or use them in different ways. Each operation has a cost in clock cycles, and the Java program can build some extra patterns based on existing ones. I have patterns using the basic operations above for multiplying by any value from 0-256, though there could always be some that I haven't found yet (not including specialized register-specific variants). There's a lot of potential to improve the multiplication speed, I think.

For example:

A pattern for multiplying by x+1 can always be built by +(pattern or x). This effectively eliminates the need to store separate patterns for odd multipliers.

A pattern for multiplying by x-1 can always be built by -(pattern or x). This effectively eliminates the need to store separate patterns for odd multipliers.

If I have the value for R2:R0 in a single 16-bit register before splitting it into R2:R0 format, then the ::::+ pattern can be replaced by a single operation that adds it to R5.

I'm still investigating more advanced ways of building patterns based on what register combinations are available at different times, but it has the potential to lead to a highly optimized way of multiplying 8.8 fixed-point numbers in native CP1600 code.

Edited April 15, 2014 by JohnPCAE

intvnut · April 15, 2014

This is interesting.

I had written a multiply generator some time back for integer multiplies that comes up with similar patterns to what you're computing. It didn't compute fixed-point MPYs, but I thought it might be fun to compare notes. I noticed many of our patterns are similar.

Attached is my C code and what it generated, if you'd like to take a look.

(The silly ".c.txt" extension is to get around AA's silly file extension restrictions.)

mpyk.asm

mult_by_constant.c.txt

JohnPCAE · January 29, 2015

Lately I did some more work on this and have an improved multiplication routine. It's a little faster and doesn't need separate hi- and low-byte kernels. See the top post for an updated ZIP file.

Edited January 29, 2015 by JohnPCAE

artrag · August 13, 2015

Where is last rom with improved multiplication?

JohnPCAE · August 13, 2015

The top post. I updated it when I updated the ROM.

artrag · August 16, 2015

Thanks! It looks awesome. Greetings!

On msx TR (using a lot more resources, it is another category of machine) I did this:

Edited August 16, 2015 by artrag

JohnPCAE · September 21, 2015

I couldn't sleep and wound up making some more optimizations, basically by removing overhead from the multiplication routine.

raycast_20150921.zip

Edited September 21, 2015 by JohnPCAE

JohnPCAE · September 22, 2015

A little more speed: removed some more overhead from the multiplication routine (mainly by splitting it up into two separate use cases) and sped up the colored squares rendering routine a bit.

raycast_20150921_2.zip

artrag · September 24, 2015

Very good.
Btw
Would it be possible to build the screen in a hidden buffer ?
How many gram tiles do you use now ?

It seems you are using 10 tiles x 6 tiles

How many columns do you render ?

If you render 10*8 = 80 columns you could speed up the computation using less angles, say 40.

All you need to do is to set two pixels at time in your gram cards and keep 10 tiles wide the window.

About the walls, I see you can reuse the same tile vertically for large portions of the image.

I've the feeling you "blit" column by column all the gram tiles without exploiting the fact you can replace a whole tile instead of passing over it bit by bit 8x8 = 64 times during the rendering.

From what I see the time needed to update the gram is about 1 or two frames.
The tearing is very evident.
If you were able to use less than 32 tiles you could swap between the two subsets of tiles at each scene update.

If as I think you blit bit by bit the whole 10x6 tiles, I think you have room for improving the rendering speed.

A simple strategy for filled walls could be:

Compute in an array in ram the height of each column (now 80 bytes) using your raycasting engine.

Compute on each column how many integer tiles would be needed and group them 8 at time (divide by 8 the 80 values - shift).

Use a filled tile (no blitting, use grom - CARD 95) to plot the minimum number of pixels in a set of 8 (the "common part" of the 8 columns) (find the minimum out of 8 values)

Render the 8 spare heights in a set of gram cards (this time bit by bit as you do now) (use "and" and the minimum above to find the 8 remainders).

Edited September 24, 2015 by artrag

JohnPCAE · September 24, 2015

You make some good points, though skimming them at 4am is causing most of them to sail over my head

Actually, this program has two modes: GRAM and Colored Squares. The side buttons will toggle you between the two modes, and the numeric keys can be used to set the rendering distance. The frame rate in colored squares mode is MUCH faster, for different reasons (less pixels, and I draw them a whole card at a time).

I've made some more optimizations (this time to the main general-purpose multiplication routine as opposed to the special-case one). In colored-squares mode, the frame rate seems noticeably better.

raycast_20150924.zip

artrag · September 24, 2015

In colored square mode you plot 1/4 of the pixels but you blit column by column again.

I am proposing a more complex approach that allows you to plot the repetitive part of the walls as columns of the same card.

JohnPCAE · September 26, 2015

I think I see. Well, for a first start, I changed the Colored Squares mode to first determine the wall heights and then render the image all at once. The tearing is no longer visible in that mode now. I also fixed several bugs in my multiplication routine and added some text that show what the side buttons and keypad keys do.

raycast_20150926.zip

artrag · September 26, 2015

Very good. Actually I think that in colored squares mode you can plot two columns at time by plotting the repetitive blocks with a single access to backtab vram.

JohnPCAE · September 26, 2015

Very good. Actually I think that in colored squares mode you can plot two columns at time by plotting the repetitive blocks with a single access to backtab vram.

That's what I do; I write one card to plot four squares at a time, for a total of 240 writes to BACKTAB.

I think the major performance bottleneck is in my FixedPtMultiply routine (the full one, not the limited-case one). I'm investigating using Joe's quarter-square implementation, and so far I've switched the limited-case version over to it (though I'm not noticing a performance improvement because I think the limited-case one isn't taking up that much time relative to everything else).

Edited September 26, 2015 by JohnPCAE

artrag · October 1, 2015

You should change color of each column according to the distance e.g. using different levels of green

It is simple and effective to increase the realism.
It could work also in color stack mode even if with color clash

this is what you get when have 256 colors to play

Edited October 1, 2015 by artrag

JohnPCAE · October 12, 2015

I was thinking of maybe trying for a Treasure of Tarmin look at some point, with alternating wall colors. For now, though, here is a new version with (hopefully) improved performance. I added a version of the main multiplication routine that uses the quarter-square method and set the code to use that instead. The shift-and-add version is still there as well, just not used.

raycast_20151012.zip

artrag · October 17, 2015

The speed in color square mode seems ok for a game but you should really use two colors at least for walls. E.g. dark green for N/S sides and light green for E/W sides. You could get the info from the final step of the ray casting loop.

About the color stack mode, the frame tearing needs an approach like the one we discussed earlier.

BTW, for a game, I would focus on coloring walls in color square mode.

This would allow to use the GRAM for sprites and Items.

Edited October 17, 2015 by artrag

JohnPCAE · December 24, 2015

I did a bit more work on the raycasting engine to try to get some more speed out of it. I optimized the casting loop in RenderCS() so that it scales better to more distant walls. You'll only see the difference in colored-squares mode as that's the only routine I worked on, but porting it to the normal Render() routine would be straightforward (the one that deals with F-B mode). Anyway, the frame rate does seem a bit higher in colored-squares mode now.

In the back of my mind I've been thinking a bit about what it would take to allow for individual control over wall colors and types, but I wanted to see if I could first wring as much performance out of the engine as possible. I'm not sure how much more speed can be squeezed out of it at this point, but you never know.

raycast_20151224.zip

JohnPCAE · December 27, 2015

Sped up multiplication quite a bit :-D

(only did it for colored-squares mode)

raycast_20151227.zip

First Spear · December 28, 2015

Very cool. It quits immediately after the title screen when I launch it from jzIntv with the --jlp switch. Newbie curious, why would that happen?

Sped up multiplication quite a bit

(only did it for colored-squares mode)

JohnPCAE · December 28, 2015

The demo uses a lot of cart memory space for lookup tables. It's probably stepping on something special in JLP carts. That said, it probably wouldn't be hard to work around it if I knew where the address conflict was. It's written for plain-Jane, non-enhanced carts.

intvnut · December 28, 2015

JLP default RAM range is $8040 - $9F7F. If you move your RAM16 area down to there, and move your ROM out of that region, then it'd work well on JLP's default RAM range. I can always move the RAM (it's determined by firmware), but usually it's easy enough to rejigger the assembly.

EDIT: Also, putting _CARTRAM at $BE00 - $BFF isn't a great idea, as writes in this space will corrupt GRAM if done during vertical blank. There are write-only aliases of GRAM at $7800-$7FFF, $B800-$BFFF and $F800-$FFFF.

Edited December 28, 2015 by intvnut

intvnut · December 28, 2015

Ok, so I made a couple minor tweaks to raycast.src to make its memory map compatible w/ JLP's default memory map.

Also, out of curiosity, did you start from a disassembly of one of my games? (e.g. 4-Tris or Space Patrol)?

raycast_20151227_jz.zip

Raycasting demo

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members