JohnPCAE Posted June 10, 2013 Author Share Posted June 10, 2013 Lots more multiplication speed raycast.zip 1 Quote Link to comment Share on other sites More sharing options...
JohnPCAE Posted June 12, 2013 Author Share Posted June 12, 2013 Added a special-case fixed-point multiplication routine. It is only used when one of the multiplicand values is guaranteed to be in the range -1.0 ... +1.0. This actually covers most of the calls to the multiplication routine and results in a noticeable speed boost. If anyone wanted to, say, port Wolf3D to the Inty, the frame rate might be good enough now raycast.zip Quote Link to comment Share on other sites More sharing options...
JohnPCAE Posted April 15, 2014 Author Share Posted April 15, 2014 (edited) I've been taking a second look at my fixed-poing 8.8 multiplication routines and it might be possible to optimize them some more. I built a list of multiplication "patterns" and a Java program that validates them and searches for the optimal ones based on whether it's being used for the first stage (low byte) or for the second stage (high byte). I'd like to get it to the point where it can autogenerate the necessary CP1600 assembly; now that would be awesome. As an example, my patterns look something like this: Assuming the value to be multiplied is in R2:R0, encoded in the following format: 0000 0000 HHHH HHHH : LLLL LLLL 0000 0000 And the end 8.8 fixed-point result is to be in R5, with R4 used as an intermediate register Some sample patterns for various multipliers (many more potential variations are possible): 0 (empty) 1 + 2 ++ .+ -+++ --..+ --:+ .-.+ 3 +++ +.+ 4 ++++ ..+ :+ . . . 252 :-:::+ 253 -.-.:::+ 254 --::::+ 255 +.+.+.+.+.+.+.+ -::::+ 256 ::::+ + means add R2:R0 to R5:R4 (remember the carry bit!) - means subtract R2:R0 from R5: R4 (remember the carry/borrow bit!) . means shift R2:R0 left by 1 bit : means shift R2:R0 left by 2 bits * means add R2 to R5 (this is only valid for phase 2 when R0 is always 0) -- think of it as an abbreviated + operation and is only used to substitute for the ending + in a pattern These are the basic pattern elements, but there are more specialized ones if I'm able to use more registers or use them in different ways. Each operation has a cost in clock cycles, and the Java program can build some extra patterns based on existing ones. I have patterns using the basic operations above for multiplying by any value from 0-256, though there could always be some that I haven't found yet (not including specialized register-specific variants). There's a lot of potential to improve the multiplication speed, I think. For example: A pattern for multiplying by x+1 can always be built by +(pattern or x). This effectively eliminates the need to store separate patterns for odd multipliers. A pattern for multiplying by x-1 can always be built by -(pattern or x). This effectively eliminates the need to store separate patterns for odd multipliers. If I have the value for R2:R0 in a single 16-bit register before splitting it into R2:R0 format, then the ::::+ pattern can be replaced by a single operation that adds it to R5. I'm still investigating more advanced ways of building patterns based on what register combinations are available at different times, but it has the potential to lead to a highly optimized way of multiplying 8.8 fixed-point numbers in native CP1600 code. Edited April 15, 2014 by JohnPCAE Quote Link to comment Share on other sites More sharing options...
intvnut Posted April 15, 2014 Share Posted April 15, 2014 This is interesting. I had written a multiply generator some time back for integer multiplies that comes up with similar patterns to what you're computing. It didn't compute fixed-point MPYs, but I thought it might be fun to compare notes. I noticed many of our patterns are similar. Attached is my C code and what it generated, if you'd like to take a look. (The silly ".c.txt" extension is to get around AA's silly file extension restrictions.) mpyk.asm mult_by_constant.c.txt Quote Link to comment Share on other sites More sharing options...
JohnPCAE Posted January 29, 2015 Author Share Posted January 29, 2015 (edited) Lately I did some more work on this and have an improved multiplication routine. It's a little faster and doesn't need separate hi- and low-byte kernels. See the top post for an updated ZIP file. Edited January 29, 2015 by JohnPCAE 1 Quote Link to comment Share on other sites More sharing options...
artrag Posted August 13, 2015 Share Posted August 13, 2015 Where is last rom with improved multiplication? Quote Link to comment Share on other sites More sharing options...
JohnPCAE Posted August 13, 2015 Author Share Posted August 13, 2015 The top post. I updated it when I updated the ROM. Quote Link to comment Share on other sites More sharing options...
artrag Posted August 16, 2015 Share Posted August 16, 2015 (edited) Thanks! It looks awesome. Greetings! On msx TR (using a lot more resources, it is another category of machine) I did this: Edited August 16, 2015 by artrag Quote Link to comment Share on other sites More sharing options...
JohnPCAE Posted September 21, 2015 Author Share Posted September 21, 2015 (edited) I couldn't sleep and wound up making some more optimizations, basically by removing overhead from the multiplication routine. raycast_20150921.zip Edited September 21, 2015 by JohnPCAE Quote Link to comment Share on other sites More sharing options...
JohnPCAE Posted September 22, 2015 Author Share Posted September 22, 2015 A little more speed: removed some more overhead from the multiplication routine (mainly by splitting it up into two separate use cases) and sped up the colored squares rendering routine a bit. raycast_20150921_2.zip 1 Quote Link to comment Share on other sites More sharing options...
artrag Posted September 24, 2015 Share Posted September 24, 2015 (edited) Very good.BtwWould it be possible to build the screen in a hidden buffer ?How many gram tiles do you use now ? It seems you are using 10 tiles x 6 tiles How many columns do you render ? If you render 10*8 = 80 columns you could speed up the computation using less angles, say 40. All you need to do is to set two pixels at time in your gram cards and keep 10 tiles wide the window. About the walls, I see you can reuse the same tile vertically for large portions of the image. I've the feeling you "blit" column by column all the gram tiles without exploiting the fact you can replace a whole tile instead of passing over it bit by bit 8x8 = 64 times during the rendering.From what I see the time needed to update the gram is about 1 or two frames.The tearing is very evident.If you were able to use less than 32 tiles you could swap between the two subsets of tiles at each scene update. If as I think you blit bit by bit the whole 10x6 tiles, I think you have room for improving the rendering speed. A simple strategy for filled walls could be: Compute in an array in ram the height of each column (now 80 bytes) using your raycasting engine. Compute on each column how many integer tiles would be needed and group them 8 at time (divide by 8 the 80 values - shift). Use a filled tile (no blitting, use grom - CARD 95) to plot the minimum number of pixels in a set of 8 (the "common part" of the 8 columns) (find the minimum out of 8 values) Render the 8 spare heights in a set of gram cards (this time bit by bit as you do now) (use "and" and the minimum above to find the 8 remainders). Edited September 24, 2015 by artrag Quote Link to comment Share on other sites More sharing options...
JohnPCAE Posted September 24, 2015 Author Share Posted September 24, 2015 You make some good points, though skimming them at 4am is causing most of them to sail over my head Actually, this program has two modes: GRAM and Colored Squares. The side buttons will toggle you between the two modes, and the numeric keys can be used to set the rendering distance. The frame rate in colored squares mode is MUCH faster, for different reasons (less pixels, and I draw them a whole card at a time). I've made some more optimizations (this time to the main general-purpose multiplication routine as opposed to the special-case one). In colored-squares mode, the frame rate seems noticeably better. raycast_20150924.zip Quote Link to comment Share on other sites More sharing options...
artrag Posted September 24, 2015 Share Posted September 24, 2015 In colored square mode you plot 1/4 of the pixels but you blit column by column again. I am proposing a more complex approach that allows you to plot the repetitive part of the walls as columns of the same card. Quote Link to comment Share on other sites More sharing options...
JohnPCAE Posted September 26, 2015 Author Share Posted September 26, 2015 I think I see. Well, for a first start, I changed the Colored Squares mode to first determine the wall heights and then render the image all at once. The tearing is no longer visible in that mode now. I also fixed several bugs in my multiplication routine and added some text that show what the side buttons and keypad keys do. raycast_20150926.zip Quote Link to comment Share on other sites More sharing options...
artrag Posted September 26, 2015 Share Posted September 26, 2015 Very good. Actually I think that in colored squares mode you can plot two columns at time by plotting the repetitive blocks with a single access to backtab vram. Quote Link to comment Share on other sites More sharing options...
JohnPCAE Posted September 26, 2015 Author Share Posted September 26, 2015 (edited) Very good. Actually I think that in colored squares mode you can plot two columns at time by plotting the repetitive blocks with a single access to backtab vram. That's what I do; I write one card to plot four squares at a time, for a total of 240 writes to BACKTAB. I think the major performance bottleneck is in my FixedPtMultiply routine (the full one, not the limited-case one). I'm investigating using Joe's quarter-square implementation, and so far I've switched the limited-case version over to it (though I'm not noticing a performance improvement because I think the limited-case one isn't taking up that much time relative to everything else). Edited September 26, 2015 by JohnPCAE Quote Link to comment Share on other sites More sharing options...
artrag Posted October 1, 2015 Share Posted October 1, 2015 (edited) You should change color of each column according to the distance e.g. using different levels of greenIt is simple and effective to increase the realism.It could work also in color stack mode even if with color clash this is what you get when have 256 colors to play Edited October 1, 2015 by artrag Quote Link to comment Share on other sites More sharing options...
JohnPCAE Posted October 12, 2015 Author Share Posted October 12, 2015 I was thinking of maybe trying for a Treasure of Tarmin look at some point, with alternating wall colors. For now, though, here is a new version with (hopefully) improved performance. I added a version of the main multiplication routine that uses the quarter-square method and set the code to use that instead. The shift-and-add version is still there as well, just not used. raycast_20151012.zip Quote Link to comment Share on other sites More sharing options...
artrag Posted October 17, 2015 Share Posted October 17, 2015 (edited) The speed in color square mode seems ok for a game but you should really use two colors at least for walls. E.g. dark green for N/S sides and light green for E/W sides. You could get the info from the final step of the ray casting loop. About the color stack mode, the frame tearing needs an approach like the one we discussed earlier. BTW, for a game, I would focus on coloring walls in color square mode. This would allow to use the GRAM for sprites and Items. Edited October 17, 2015 by artrag Quote Link to comment Share on other sites More sharing options...
JohnPCAE Posted December 24, 2015 Author Share Posted December 24, 2015 I did a bit more work on the raycasting engine to try to get some more speed out of it. I optimized the casting loop in RenderCS() so that it scales better to more distant walls. You'll only see the difference in colored-squares mode as that's the only routine I worked on, but porting it to the normal Render() routine would be straightforward (the one that deals with F-B mode). Anyway, the frame rate does seem a bit higher in colored-squares mode now. In the back of my mind I've been thinking a bit about what it would take to allow for individual control over wall colors and types, but I wanted to see if I could first wring as much performance out of the engine as possible. I'm not sure how much more speed can be squeezed out of it at this point, but you never know. raycast_20151224.zip 1 Quote Link to comment Share on other sites More sharing options...
JohnPCAE Posted December 27, 2015 Author Share Posted December 27, 2015 Sped up multiplication quite a bit (only did it for colored-squares mode) raycast_20151227.zip Quote Link to comment Share on other sites More sharing options...
First Spear Posted December 28, 2015 Share Posted December 28, 2015 Very cool. It quits immediately after the title screen when I launch it from jzIntv with the --jlp switch. Newbie curious, why would that happen? Sped up multiplication quite a bit (only did it for colored-squares mode) Quote Link to comment Share on other sites More sharing options...
JohnPCAE Posted December 28, 2015 Author Share Posted December 28, 2015 The demo uses a lot of cart memory space for lookup tables. It's probably stepping on something special in JLP carts. That said, it probably wouldn't be hard to work around it if I knew where the address conflict was. It's written for plain-Jane, non-enhanced carts. Quote Link to comment Share on other sites More sharing options...
intvnut Posted December 28, 2015 Share Posted December 28, 2015 (edited) JLP default RAM range is $8040 - $9F7F. If you move your RAM16 area down to there, and move your ROM out of that region, then it'd work well on JLP's default RAM range. I can always move the RAM (it's determined by firmware), but usually it's easy enough to rejigger the assembly. EDIT: Also, putting _CARTRAM at $BE00 - $BFF isn't a great idea, as writes in this space will corrupt GRAM if done during vertical blank. There are write-only aliases of GRAM at $7800-$7FFF, $B800-$BFFF and $F800-$FFFF. Edited December 28, 2015 by intvnut Quote Link to comment Share on other sites More sharing options...
intvnut Posted December 28, 2015 Share Posted December 28, 2015 Ok, so I made a couple minor tweaks to raycast.src to make its memory map compatible w/ JLP's default memory map. Also, out of curiosity, did you start from a disassembly of one of my games? (e.g. 4-Tris or Space Patrol)? raycast_20151227_jz.zip Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.