+TheBF Posted February 15, 2021 Author Share Posted February 15, 2021 Well didn't this open an interesting can of worms. My great improvement was short lived. On my Windows 10 computer using Classic99 and Camel99 Forth 2.66 I get a different timing than @Speccery did. I should have timed it first on my machine before I even started working on improvements (doh!) but here is the Apples to Apples comparison: So after seeing some improvement re-working the DO LOOP with an inline NEXT interpreter I wondered if I could use some of that extra space to speed up other primitives? Next I found this reference table to provide some insight into what Forth words run most often from "Stack Computers, the next Wave" by Phil Koopman 6.3.1 Dynamic instruction frequencies NAMES FRAC LIFE MATH COMPILE AVE CALL 11.16% 12.73% 12.59% 12.36% 12.21% EXIT 11.07% 12.72% 12.55% 10.60% 11.74% VARIABLE 7.63% 10.30% 2.26% 1.65% 5.46% @ 7.49% 2.05% 0.96% 11.09% 5.40% 0BRANCH 3.39% 6.38% 3.23% 6.11% 4.78% LIT 3.94% 5.22% 4.92% 4.09% 4.54% + 3.41% 10.45% 0.60% 2.26% 4.18% SWAP 4.43% 2.99% 7.00% 1.17% 3.90% R> 2.05% 0.00% 11.28% 2.23% 3.89% >R 2.05% 0.00% 11.28% 2.16% 3.87% CONSTANT 3.92% 3.50% 2.78% 4.50% 3.68% DUP 4.08% 0.45% 1.88% 5.78% 3.05% ROT 4.05% 0.00% 4.61% 0.48% 2.29% USER 0.07% 0.00% 0.06% 8.59% 2.18% C@ 0.00% 7.52% 0.01% 0.36% 1.97% I 0.58% 6.66% 0.01% 0.23% 1.87% = 0.33% 4.48% 0.01% 1.87% 1.67% AND 0.17% 3.12% 3.14% 0.04% 1.61% BRANCH 1.61% 1.57% 0.72% 2.26% 1.54% EXECUTE 0.14% 0.00% 0.02% 2.45% 0.65% I already have 0BRANCH, BRANCH, EXIT, CALL (DOCOL), LIT, @ and DROP in 16 bit RAM. Using this table I changed the NEXT macro in each of the following words to use the ILNEXT macro. (inline NEXT, 3 instructions) DOVAR, +, SWAP, R> , >R, DOCON , ROT, DOUSER, C@, I and =. After re-compiling the kernel with these changes the FIB2-BENCH a bit faster again. Fibonacci FIB2-BENCH Timings ---------------------------------------- V2.66 1:46.80 V2.67 1:46.03 0.7% better ( DO LOOP change) V2.67b 1:44.50 2.2% better than original ( inline next on hi usage words) Since a threaded Forth program spends about 50% of it's time running the interpreter NEXT, you can get improvements by removing the branch through a register to get to it and placing it inline as we can see. But... this consumed 46 bytes in my tiny kernel so is it worth it? I will play with it more and run some more benchmarks before I make up my mind. 2 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted February 21, 2021 Author Share Posted February 21, 2021 In the course of optimizing EMIT for faster screen printing I was amazed that the speed of the SEVENs problem benchmark was not reduced by very much. I remembered that FbForth was doing Lee's version of the benchmark in less than 1 minute. I suspected my scroll was the issue since I bucked the trend and did not write it entirely in Assembler. I use an intermediate word called MOVEUP to scroll the screen by a different number of lines and you call it the number of times you need to scroll 24 lines. : MOVEUP ( vaddr -- 'vaddr) C/L@ 8* >R \ compute chunk size. 8* means 8 lines HERE 100 + OVER C/L@ + ( -- 1stline heap 2ndline) OVER R@ VREAD OVER R@ VWRITE R> + ; \ 36 bytes I re-wrote the scroll as one Forth word and use a buffer that was the size of the screen minus one line. The brought the speed down to 51.5 seconds. : SCROLL ( -- ) PAUSE TOPLN \ top of VDP screen memory C/SCR @ C/L@ - >R \ C/SCR-1line to rstack HERE 100 + OVER C/L@ + ( -- 1stline heap 2ndline) OVER R@ VREAD SWAP R> VWRITE 0 17 AT-XY VPOS C/L@ BL VFILL \ SEVENS = 51.5 SECS ; I don't really want to use such a huge buffer even though it's in un-allocated memory because at some point it will crash into the stack in a big program project. This is especially true in 80 column mode. I also don't want to put the scroll buffer in low RAM since that is so useful for SAMS buffers. Since I can rebuild the kernel in 5 seconds and re-run it on Classic99 I did the experiment to find out how the buffer size affected the speed of the benchmark. Here is the data. I think I will stay with my original decision to use an 8 line buffer but at least I know now that in the SEVENs benchmark almost 5 extra seconds are being used just to scroll the screen. Amazing. Buffer Lines Sevens Speed Reduction Notes 1 01:17.26 2 01:07.83 -12.21% Uses do loop 4 01:02.83 -18.68% Uses do loop 8 01:00.90 -21.18% Uses do loop 8 01:00.06 -22.26% MOVEUP MOVEUP MOVEUP 12 00:59.36 -23.17% MOVEUP MOVEUP 24 00:51.50 -33.34% Scroll is 1 word 1 Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted February 22, 2021 Share Posted February 22, 2021 5 hours ago, TheBF said: I suspected my scroll was the issue since I bucked the trend and did not write it entirely in Assembler. This is probably not terribly responsive, but I use only a one-line buffer with ALC: * *** SCROLLING ROUTINE * SCROLL MOV @$SSTRT(U),R0 VRAM addr LI R1,LINBUF Line buffer MOV @$SWDTH(U),R2 Count A R2,R0 Start at line 2 SCROL1 BLWP @VMBR S R2,R0 One line back to write BLWP @VMBW A R2,R0 Two lines ahead for next read A R2,R0 C R0,@$SEND(U) End of screen? JL SCROL1 MOV R2,R1 Blank bottom row of screen LI R0,>2000 Blank S @$SEND(U),R2 NEG R2 Now contains address of start of last line MOV LINK,R6 BL @FILL1 Write the blanks B *R6 If you need details about missing definitions, I can supply them, but the comments will likely suffice. ...lee 1 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted February 22, 2021 Author Share Posted February 22, 2021 Thanks. This is very concise. I should go down that road. Early on to save space I decided to limit functions to the Forth interface only so I can not BLWP or BL to VMBW or VMBR or FILL. It not tricky to change but in the beginning of this journey I had no room left in the 8K. Hell I was still figuring out how to use the cross-compiler that I made. I have about 80 bytes free in the existing system and I can play games with headless definitions and labels to save space. So I will take a run at this method too since my social life is somewhat limited these days. Did get outside for a walk in the park with my brother in law this afternoon so that was good. 2 Quote Link to comment Share on other sites More sharing options...
GDMike Posted February 22, 2021 Share Posted February 22, 2021 Different environment, but same ole outta memory, I was able to, (and this is probably easier for you guys), but I did manage to make a type of loader for some of my routines to use free upper SAMs today. I was at 1014 bytes free, now after pushing some things up, I'm at 2050 free and 2 routines in upper SAMs. I'm learning. Quote Link to comment Share on other sites More sharing options...
+TheBF Posted February 22, 2021 Author Share Posted February 22, 2021 1 minute ago, GDMike said: Different environment, but same ole outta memory, I was able to, (and this is probably easier for you guys), but I did manage to make a type of loader for some of my routines to use free upper SAMs today. I was at 1014 bytes free, now after pushing some things up, I'm at 2050 free and 2 routines in upper SAMs. I'm learning. Ya a little different. I start with this tiny 8K piece and then I can add to it after it loads. But squashing everything into the first 8K has been a fun challenge. I have actually built a version where I don't have any loops or branching in the kernel, but then it compiles those when it starts. ( I know what you are thinking. How does the kernel work without loops or branching? The cross compiler knows how to compile them to make the kernel, but the finished program does not have the BEGIN AGAIN , IF THEN etc. words. Crazy stuff.) 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted February 23, 2021 Author Share Posted February 23, 2021 When you turn over a stone don't be surprise that you find a worm. In the course of testing my VMBR /VMBW code as sub-routines I found it was slower than normal because I used a WHILE loop structure to protect from cnt=0 conditions. I changed that and will take responsibility for the risk. It made things much faster. CODE: SCRL ( buffer Vaddr len -- ) \ TOS hold address of C/L user variable \ R0 VDP address \ R1 CPU BUFFER for routines, copy kept on stack \ R5 line length TOS R5 MOV, \ line length -> R5 *SP+ R0 MOV, \ VDPRDaddr -> R0 R5 R0 ADD, \ start at line2 BEGIN, R5 TOS MOV, \ COPY length to R4 for VMBR *SP R1 MOV, \ buffer -> R1 RMODE @@ BL, VMBR @@ BL, \ read line2 to buffer R5 R0 SUB, \ One line back to write R5 TOS MOV, \ set counter for the write *SP R1 MOV, \ restore buffer address WMODE @@ BL, VMBW @@ BL, R0 3FFF ANDI, \ strip off write bit R5 R0 ADD, \ Two lines ahead for next read R5 R0 ADD, R0 C/SCR @@ CMP, \ End of screen? HI UNTIL, TOS POP, \ drop buffer TOS POP, \ refill TOS register NEXT, END-CODE \ Buffer Lines Sevens Speed \ 1 00:58.16 26 bytes bigger : SCROLL ( -- ) HERE 100 + TOPLN C/L@ SCRL 0 17 AT-XY VPOS C/L@ BL VFILL ; It turns out that since I am not using BLWP, again to save space, I need a few extra instructions in my loop to reset the control registers. I call a sub-routine to setup the VDP address each time since I didn't want to push/pop R11. I also decided to erase the last screen line in Forth because it's pretty fast being mostly code words and only one line of code. I pass the buffer, screen and length as parameters since my TOPLN can be in different places in VDP RAM if you use the SCREEN: word to create different VDP text screens. And I don't keep variables for screen-end and screen-start so parameter passing was simplest. The ALC version is 26 bytes bigger than this Forth code which I created to do an "apple to apples" comparison. \ Notes: Using SEVENs program as a benchmark \ Buffer Lines Sevens Speed \ 1 01:08.71 : SCROLL ( buffer vaddr -- ) DUP C/L@ L/SCR * + SWAP ( -- buffer SCRend SCRstart) DO I C/L@ + OVER C/L@ VREAD DUP I C/L@ VWRITE C/L@ +LOOP DROP 0 17 AT-XY VPOS C/L@ BL VFILL ; So we can see that the ALC scroll makes the benchmark program ~15% faster at the cost of 26 bytes at least the way I did the ALC code. To show how much my VDP code improved, the older method that used with the 8 line buffer, improved from 1:00.6 to 0:55.75 or 8.7% improvement which I was very happy to see. I comes in 18 bytes bigger than the single line DO/LOOP method. I will explore what happens now with this improved VDP code and a 2 line buffer which seems like a reasonable trade-off. 4 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted February 23, 2021 Author Share Posted February 23, 2021 Continuing on the scroll research... Adding this code to SCRL to clear the last line: \ erase last line adds 8 bytes, buys .4 seconds on benchmark C/SCR @@ R0 MOV, \ end of screen Vaddr -> R0 R5 R0 SUB, \ go back 1 line R5 R2 MOV, \ byte count for VFILL -> R2 TOS 2000 LI, \ space char -> R4 WMODE @@ BL, _VFILL @@ BL, Versus this code in Forth: 0 17 AT-XY VPOS C/L@ BL VFILL ALC for clearing the last line was not worth the trouble since it only improved the benchmark .4 seconds on the very long SEVENS benchmark and consumed an extra 8 bytes. And I needed to reset the cursor after SCRL completed anyway. I re-wrote my previous Forth SCROLL using the idea of keeping the buffer and VDP address arguments on the stack but I un-rolled the DO/LOOP. It was a bit faster. The code below ran the benchmark in 1:08.06 and was 36 bytes smaller than using the ALC scroll. Example 1: : MOVEUP ( buffer vaddr -- buffer 'vaddr) 2DUP C/L@ + SWAP C/L@ VREAD 2DUP C/L@ VWRITE C/L@ + ; : MOVE8 ( buffer Vaddr -- buffer 'Vaddr) MOVEUP MOVEUP MOVEUP MOVEUP MOVEUP MOVEUP MOVEUP MOVEUP ; [PUBLIC] : SCROLL ( -- ) PAUSE HERE 100 + TOPLN MOVE8 MOVE8 MOVE8 2DROP 0 17 AT-XY VPOS C/L@ BL VFILL ; Re-writing the above to use a four line buffer and creating a code word to return C/L@ 4* was the only way I could get a faster scroll than the 1 line ALC code. 57 seconds vs 58 in ALC. Size or speed. Its hard to get both. 2 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted February 24, 2021 Author Share Posted February 24, 2021 And just to beat this horse until it is well and truly dead Let's try ALC scrolling with a 2 line buffer. Since I am using un-allocated memory for the buffer, it was only 3 extra instructions. This dropped the benchmark time from 58.16 to 56.00 seconds. Just 1 second off of the Forth time using a 24 line buffer! Gotta love that assembler. Using all this fancy stuff created a 8,148 byte kernel, so I still have 48 bytes left over! And I now have VMBW and VMBR as BL callable sub-routines. Nice progress overall And if I really need a slightly smaller kernel I just change the TRUE below to FALSE and the kernel is 8,116 bytes with slower scroll. Spoiler \ Scroll in Assembler is a re-write of concept from FbForth Lee Stewart FALSE [IF] [PRIVATE] CODE: SCRL ( buffer Vaddr len -- ) \ TOS hold address of C/L user variable \ R0 VDP address \ R1 CPU BUFFER for routines, copy kept on stack \ R5 line length TOS R5 MOV, \ line length -> R5 *SP+ R0 MOV, \ VDPRDaddr -> R0 R5 R0 ADD, \ start at line2 BEGIN, R5 TOS MOV, \ COPY length to R4 for VMBR TOS TOS ADD, \ *read 2 lines *SP R1 MOV, \ buffer -> R1 RMODE @@ BL, VMBR @@ BL, \ read line2 to buffer R5 R0 SUB, \ One line back to write R5 TOS MOV, \ set counter for the write TOS TOS ADD, \ *write 2 lines *SP R1 MOV, \ restore buffer address WMODE @@ BL, VMBW @@ BL, R0 3FFF ANDI, \ strip off write bit R5 R0 ADD, R5 R0 ADD, R5 R0 ADD, \ *advance one extra line R0 C/SCR @@ CMP, \ End of screen? HI UNTIL, TOS POP, \ DROP buffer address TOS POP, \ refill TOS register NEXT, END-CODE [PUBLIC] \ Buffer Lines Sevens Speed \ 1 00:58.16 26 bytes bigger : SCROLL ( -- ) HERE 100 + TOPLN C/L@ SCRL 0 17 AT-XY VPOS C/L@ BL VFILL ; [ELSE] \ [PRIVATE] \ Notes: Using SEVENs program as a benchmark \ Buffer Lines Sevens Speed \ 1 01:08.06 \ 2 01:02.00 \ 1:01.00 using 2LINES code word. \ 4 00:57.28 \ 8 00:55.43 \ CODE: 2LINES ( -- n) \ TOS PUSH, \ R1 STWP, \ 2E (R1) TOS MOV, \ read user var C/L \ TOS 1 SLA, \ 2* \ NEXT, \ END-CODE [PUBLIC] : SCROLL ( buffer vaddr -- ) HERE 100 + TOPLN C/SCR @ ( -- buffer Vstart len) BOUNDS ( -- buffer SCRend SCRstart) DO I C/L@ + OVER C/L@ VREAD DUP I C/L@ VWRITE C/L@ +LOOP DROP 0 17 AT-XY VPOS C/L@ BL VFILL ; [THEN] 2 Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted February 24, 2021 Share Posted February 24, 2021 2 hours ago, TheBF said: \ Scroll in Assembler is a re-write of concept from fbForth Lee Stewart Before credits get lost, I must hasten to say that the ALC scroll routine I posted is totally lifted from TI Forth. There...I feel better already! ...lee 2 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted February 25, 2021 Author Share Posted February 25, 2021 While updating the list in Benchmarking Languages I started re-reading some of the posts by @matthew180 and @jedimatt42 where I was being soundly scolded for writing inefficient VDP routines. At the time I understood what they said but I did not know how to implement it in my system because of my use of different workspaces in the multi-tasker. I could not reference a register's odd numbered byte using indirect addressing because the numerical address could be anything. A long time back I decided to use the workspace pointer register to define not just my register space but also local variable space above the registers for each task. I recently re-wrote my character output routine to use more inline code versus Forth and to do that I have to access those local variables for COL and ROW using indexed addressing. Well... it suddenly occurred to me that I can also access my registers the same way. This gave rise to a re-write of CPUT: (TOS is R4) Edit: Took out 1 more instruction. CODE: CPUT ( char -- ?) \ put a char at cursor position, return eol flag R1 STWP, \ workspace is USER area base address 32 (R1) R2 MOV, \ vrow->r3 2E (R1) R2 MPY, \ vrow*c/l->r3 34 (R1) R3 ADD, \ add vcol VPG @@ R3 ADD, \ add video page address 0 LIMI, 7 (R1) 8C02 @@ MOVB, \ write odd byte from R3 R3 4000 ORI, R3 8C02 @@ MOVB, 9 (R1) VDPWD @@ MOVB, \ Odd byte R4, write to screen 2 LIMI, TOS CLR, 34 (R1) INC, \ bump VCOL 34 (R1) 2E (R1) CMP, \ compare VCOL = C/L EQ IF, TOS SETO, \ set true flag ENDIF, NEXT, END-CODE 4 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted February 26, 2021 Author Share Posted February 26, 2021 Armed with this new tool for faster VDP access I had to re-write my VDP driver. You know I had to. It makes everything just a little more "perky". I like it. All the benchmarks that do anything with the screen like the TURSI sprite benchmark, go faster and even compiling is a touch quicker because we can get things in and out of the PAB faster. Only took me 5 years to get here. ? It also made me look some very early code and discover how I had used 4 instructions in my VWTR word where I only needed 2. I guess I am getting a little better at this Assembly Language thing. With these faster Forth words I don't feel a real need for sub-routine access to VMBR and VMBW anymore. I pays to listen to the experts. I have not tried this stuff on real iron yet. I hope it doesn't over run the VDP. Example: (edit: Found an extra instruction from the old version that was not needed) \ VSBR Forth style, on the stack CODE: VC@ ( VDP-adr -- char ) \ Video CHAR fetch 0 LIMI, R1 STWP, 9 (R1) 8C02 @@ MOVB, \ write odd byte from TOS ie R4 TOS 8C02 @@ MOVB, \ write even bytes from TOS VDPRD @@ TOS MOVB, \ READ char from VDP RAM into TOS TOS 8 SRL, \ move the damned byte to correct half of the word 2 LIMI, NEXT, END-CODE 4 Quote Link to comment Share on other sites More sharing options...
matthew180 Posted February 26, 2021 Share Posted February 26, 2021 12 hours ago, TheBF said: I could not reference a register's odd numbered byte using indirect addressing because the numerical address could be anything. That is just a slight optimization to the VDP routines, and you only incur a small increase (8us or something like that) by using other methods. I see you worked it out, even if it is backwards (i.e. written in Forth). 12 minutes ago, TheBF said: I have not tried this stuff on real iron yet. I hope it doesn't over run the VDP. This has been debated and tested quite a bit on the 99/4A, and IIRC there is only one very specific situation where you might be able to overrun the VDP (again, on the 99/4A). Then again, I think that one case was on a modified system, so it is probably not possible on a stock console. The threads about it are here on A.A. if you want to dig around. The main clincher (again, IIRC) for the 99/4A is that VDP access triggers the wait-state generator, so you pretty much cannot overrun the VDP. Also, if the system has an F18A, it cannot be overrun on the retro computers that used the 9918A family of VDP. You need a CPU clock around 25MHz or faster to overrun the F18A. 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted February 26, 2021 Author Share Posted February 26, 2021 3 minutes ago, matthew180 said: That is just a slight optimization to the VDP routines, and you only incur a small increase (8us or something like that) by using other methods. I see you worked it out, even if it is backwards (i.e. written in Forth). This has been debated and tested quite a bit on the 99/4A, and IIRC there is only one very specific situation where you might be able to overrun the VDP (again, on the 99/4A). Then again, I think that one case was on a modified system, so it is probably not possible on a stock console. The threads about it are here on A.A. if you want to dig around. The main clincher (again, IIRC) for the 99/4A is that VDP access triggers the wait-state generator, so you pretty much cannot overrun the VDP. Also, if the system has an F18A, it cannot be overrun on the retro computers that used the 9918A family of VDP. You need a CPU clock around 25MHz or faster to overrun the F18A. ... even if it is backwards... The good news is the code gets laid down in the correct order. I still find it amazing that you can write functional assembler with structured branching and looping in 200 lines. Thanks for the news on the over-run not being an issue. I only have stock hardware. Quote Link to comment Share on other sites More sharing options...
+TheBF Posted February 26, 2021 Author Share Posted February 26, 2021 Once you start looking... This is faster again by not used SRL but CLR and moving the data into the correct side of the register. \ VSBR Forth style, on the stack CODE: VC@ ( VDP-adr -- char ) \ Video CHAR fetch 0 LIMI, R1 STWP, 9 (R1) 8C02 @@ MOVB, \ write odd byte from TOS ie R4 TOS 8C02 @@ MOVB, \ write even bytes from TOS TOS CLR, VDPRD @@ 9 (R1) MOVB, \ READ char from VDP RAM into TOS 2 LIMI, NEXT, END-CODE I can do this in a number of places... 1 Quote Link to comment Share on other sites More sharing options...
matthew180 Posted February 26, 2021 Share Posted February 26, 2021 You might want to check out the 9900 datasheet and get a little familiar with the instruction timings (pg. 28). The slowest instructions are: DIV, MPY, Shift instructions, XOP, LDCR/STCR, BLWP. The barrel shifter takes a cycle for each bit-shift, so the more bits, the longer the instruction takes. So, even though shifting is faster than DIV and MPY for powers of 2, it is still slower than other instructions if you can do the same task with other instructions. Quote Link to comment Share on other sites More sharing options...
+TheBF Posted February 26, 2021 Author Share Posted February 26, 2021 1 hour ago, matthew180 said: You might want to check out the 9900 datasheet and get a little familiar with the instruction timings (pg. 28). The slowest instructions are: DIV, MPY, Shift instructions, XOP, LDCR/STCR, BLWP. The barrel shifter takes a cycle for each bit-shift, so the more bits, the longer the instruction takes. So, even though shifting is faster than DIV and MPY for powers of 2, it is still slower than other instructions if you can do the same task with other instructions. When I first started writing the low level code for this system I had that thing in front me all the time. I use shift for the routines called 2* 4* 8* for fast multiplication and 2/ which divides by 2. The challenge with text is when you grab a byte and Forth wants the byte in odd byte of the register. SRL 8 is pretty slow, but its even worse to mask with AI and then SWPB. I find it pretty hard to make the old 9900 give you anything for free. Quote Link to comment Share on other sites More sharing options...
matthew180 Posted February 27, 2021 Share Posted February 27, 2021 8 hours ago, TheBF said: SRL 8 is pretty slow, but its even worse to mask with AI and then SWPB. AI is 14 cycles, SWPB is 10 cycles, so 24 cycles total. This assumes register addressing to make it comparable to shift, since the shift can only operate on registers. The shift instructions have two forms (three actually, but only two that apply here), using R0 for the count, or the count is a fixed value. If the count is in R0, then the timing is 20+2N (N is the value read from R0). So in this case it would be 20+2*8 = 36 cycles. If the count is fixed (which is encoded as part of the instruction), then the timing is 12+2C. So in this case it would be 12+2*8 = 26 cycles. So, AI + SWPB is still faster, by at least 2 cycles, than shifting by 8. 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted February 27, 2021 Author Share Posted February 27, 2021 You are correct. I recall now that I chose SLR in that case to save space since the speed difference was so small. (24 vs 26) And bytes are important since I try to keep the kernel in an 8K package. Quote Link to comment Share on other sites More sharing options...
+TheBF Posted February 27, 2021 Author Share Posted February 27, 2021 Bottom line on all these changes to the VDP driver. I reached into the archive and pulled out Camel99 V2.54 which also used a 2 line buffer for scrolling. I did a hex DUMP of 1024 bytes from address 0000, starting on the bottom line of the screen. I timed them by hand so that interrupts would not skew the result. Here are the results: V2.54 17.45 seconds V2.67 13.55 seconds That's a 28% improvement. I'm will keep it. 4 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted March 1, 2021 Author Share Posted March 1, 2021 Seeing the mandelbrot in BASIC made me wonder what it would take to do it Forth. Rosetta code has a version that use floating point. Not going there. I found this excellent version that is all integer math here: (4) Mandelbrot Set Rendered in ASCII Art Using Forth on 6502 Machine - YouTube The code is in the spoiler Spoiler \ Setup constants to remove magic numbers to allow \ for greater zoom with different scale factors. \ SOURCE: https://github.com/Martin-H1/Forth-CS-101/blob/master/mandelbrot.fs DECIMAL 20 CONSTANT MAXITER -39 CONSTANT MINVAL 40 CONSTANT MAXVAL 20 5 LSHIFT CONSTANT RESCALE RESCALE 4 * CONSTANT S_ESCAPE \ These variables hold values during the escape calculation. VARIABLE CREAL VARIABLE CIMAG VARIABLE ZREAL VARIABLE ZIMAG VARIABLE CNT \ Compute squares, but rescale to remove extra scaling factor. : ZR_SQ ZREAL @ DUP RESCALE */ ; : ZI_SQ ZIMAG @ DUP RESCALE */ ; \ Translate escape count to ascii greyscale. \ : .CHAR ( n --) S" ..,'~!^:;[/<&?oxOX# " DROP + 1 TYPE ; : .CHAR ( n --) S" ..,'~!^:;[/<&?oxOX# " DROP + C@ EMIT ; \ BF. better :) \ Numbers above 4 will always escape, so compare to a scaled value. : ESCAPES? S_ESCAPE > ; \ Increment count and compare to max iterations. : COUNT_AND_TEST? CNT @ 1+ DUP CNT ! MAXITER > ; \ stores the row column values from the stack for the escape calculation. : INIT_VARS 5 LSHIFT DUP CREAL ! ZREAL ! 5 LSHIFT DUP CIMAG ! ZIMAG ! 1 CNT ! ; \ Performs a single iteration of the escape calculation. : DOESCAPE ZR_SQ ZI_SQ 2DUP + ESCAPES? IF 2DROP TRUE ELSE - CREAL @ + \ leave result on stack ZREAL @ ZIMAG @ RESCALE */ 1 LSHIFT CIMAG @ + ZIMAG ! ZREAL ! \ Store stack item into ZREAL COUNT_AND_TEST? THEN ; \ Iterates on a single cell to compute its escape factor. : DOCELL INIT_VARS BEGIN DOESCAPE UNTIL CNT @ .CHAR ; \ For each cell in a row. : DOROW MAXVAL MINVAL DO DUP I DOCELL LOOP DROP ; \ For each row in the set. : MANDELBROT CR MAXVAL MINVAL DO I DOROW CR LOOP ; \ Run the computation. MANDELBROT It works perfectly in GForth as you can see in the video. It compiles on Camel99 Forth and TurboForth but does not render correctly. I suspect it is a big-endian/ little-endian problem but I have not figured it out yet. gforth 2021-03-01 17-20-40.mp4 1 Quote Link to comment Share on other sites More sharing options...
HOME AUTOMATION Posted March 1, 2021 Share Posted March 1, 2021 Fast 'n smooth! I like that. 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted March 1, 2021 Author Share Posted March 1, 2021 28 minutes ago, HOME AUTOMATION said: Fast 'n smooth! I like that. It won't be "quite" that fast on a the old 9900. 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted March 2, 2021 Author Share Posted March 2, 2021 Over in another topic Mr @vol was talking about making the interrupt poll the 9901 timer. I thought that was just a splendid idea so here is a version in Forth. Since Camel99 starts the timer when it boots I just needed to write the interrupt handler to read the timer. I think this is correct, but man counting at 21.3 uS per tick is REALLY fast! \ Interrupt polled 9901 timer NEEDS MOV, FROM DSK1.LOWTOOLS NEEDS INSTALL FROM DSK1.ISRSUPPORT DECIMAL \ ISR workspace registers \ R0,R1 32 bit timer variable \ R2, difference register \ R3 temp \ R4 previous time reading CREATE IWKSP 16 CELLS ALLOT IWKSP 16 CELLS 0 FILL CODE READ9901 0 LIMI, IWKSP LWPI, R2 CLR, R12 2 LI, \ load 9901 Timer CRU address -1 SBO, \ SET bit 0 TO 1, Enter timer mode R2 14 STCR, \ READ TIMER (14 bits) -1 SBZ, \ RESET bit 1, exit timer mode 2 LIMI, R4 R3 MOV, \ old reading -> temp R2 R4 MOV, \ save this read for next time R3 R2 SUB, \ compute ticks since last read R2 ABS, R2 R1 ADD, \ add ticks to timer registers OC IF, R0 INC, \ deal with overflow to make 32bit value ENDIF, HEX 83E0 LWPI, \ return to GPL workspace RT, ENDCODE REMOVE-TOOLS : T ( -- ) IWKSP 2@ ; \ read the workspace as memory : COLD 0 INSTALL COLD ; \ disable interrupt before leaving Forth ISR' READ9901 INSTALL : TEST PAGE BEGIN 10 10 AT-XY T DU. ?TERMINAL UNTIL ; 9901 ISR TIMER.mp4 2 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted March 8, 2021 Author Share Posted March 8, 2021 I took a look my inline optimizer to see if it was possible optimize Forth loop structures as code. While I was at it, things were getting a little complicated so I reduced the business end of the process, copying kernel code snippets, into one word nice word call CODE, . I can now optimize DO LOOP , BEGIN UNTIL and BEGIN AGAIN with this version. I don't think I will go any further. In theory one could build a recursive descent compiler over the Forth code but I think that's above my pay grade. In order to keep the optimized loop info from getting mixed up with the Forth DATA stack, I make a little secondary LIFO called a control stack. This made it much simpler and I can do nested loops without losing my mind managing mix data on the Forth data stack. The video shows the difference in speed for these test 64K iteration loops: : COUNTDN FFFF BEGIN 1- DUP 0= UNTIL DROP ; : OPTCOUNTDN FFFF INLINE[ BEGIN 1- DUP 0= UNTIL DROP ] ; : FORTHLOOP FFFF 0 DO LOOP ; : OPTLOOP INLINE[ FFFF 0 DO LOOP ] ; Spoiler \ **not portable Forth code** Uses TMS9900/CAMEL99 CARNAL Knowledge NEEDS .S FROM DSK1.TOOLS NEEDS CASE FROM DSK1.CASE NEEDS LIFO: FROM DSK1.STACKS NEEDS ELAPSE FROM DSK1.ELAPSE MARKER /INLINE 8 LIFO: CS \ small control flow stack for loops and branching : >CS ( n -- ) CS PUSH ; : CS> ( -- n ) CS POP ; HEX \ need NORMAL copies of words that are WEIRD in the Camel99 kernel CODE @ C114 , NEXT, ENDCODE CODE C@ D114 , 0984 , NEXT, ENDCODE CODE DROP C136 , NEXT, ENDCODE \ Heap management words : THERE ( -- addr) H @ ; \ returns end of Target memory in HEAP : HALLOT ( n -- ) H +! ; \ Allocate n bytes of target memory. : T, ( n -- ) THERE ! 2 HALLOT ; \ "target compile" n into memory : NEW ( -- ) 2000 2000 0 FILL 2000 H ! ; \ clean HEAP memory 045A CONSTANT 'NEXT' \ 9900 CODE for B *R10 Camel99 Forth's NEXT code : CODE, ( xt --) \ Read code word from kernel, compile into target memory >BODY DUP 80 CELLS + \ set a max size for any code fragment SWAP ( -- IPend IPstart) BEGIN DUP @ 'NEXT' <> \ the instruction is not 'NEXT' WHILE DUP @ ( -- IP instruction) T, \ compile instruction CELL+ \ advance IP 2DUP < ABORT" End of code not found" REPEAT 2DROP ; \ now we can steal code word from the kernel and compile it to target memory : DUP, ['] DUP CODE, ; : DROP, ['] DROP CODE, ; \ LIT, DUP TOS and LI n into R4 : LIT, ( n -- ) DUP, 0204 T, ( n) T, ; \ <DO> is the preamble to setup return stack. Runs only once. \ THERE is the address that loop jumps back to : DO, ( -- there) ['] <DO> CODE, THERE >CS ; \ store a byte offset in odd byte of addr. \ Addr is the location of Jump instruction : RESOLVE ( addr offset --) 2- 2/ SWAP 1+ C! ; \ compute offset from addr addr' & complete the jump instruction : <BACK ( addr addr' -- ) TUCK - RESOLVE ; \ compile misc. jump instructions with no offset. : JMP, 1000 T, ; : JNO, 1900 T, ; : JEQ, 1300 T, ; : JNE, 1600 T, ; : LOOP, 0597 T, \ *RP INC, CS> THERE JNO, <BACK \ compute offset between 2 THERE addresses ['] UNLOOP CODE, \ collapse stack frame DROP \ ?? not sure what's going here ; : +LOOP, A5CA T, \ TOS *RP ADD, DROP, \ don't need TOS value anymore LOOP, \ compile loop code ; : AGAIN, CS> THERE JMP, <BACK ; : UNTIL, ( 8104 T, ) 1302 T, \ 2 JEQ, DROP, AGAIN, DROP, ; \ CFA of a code word contains the address of the next cell : NOTCODE? ( XT -- ?) DUP @ 2- - ; : OPT-FORTH ( cfa) ['] DOCOL @ OVER @ = \ is a colon definition? IF \ colon definition CASE ( loop words) ['] DO OF DO, THERE ENDOF ['] LOOP OF LOOP, ENDOF ['] +LOOP OF +LOOP, ENDOF ['] BEGIN OF THERE >CS ENDOF ['] UNTIL OF UNTIL, ENDOF ['] AGAIN OF AGAIN, ENDOF TRUE ABORT" Can't optimize word" ENDCASE DROP ELSE \ Forth DATA word DUP @ \ get the "executor" code routine address CASE ( data words ) ['] DOVAR OF >BODY LIT, ENDOF ['] DOCON OF EXECUTE LIT, ENDOF ['] DOUSER @ OF EXECUTE LIT, ENDOF TRUE ABORT" Unknown data type" ENDCASE DROP THEN ; \ new interpreter loop for inlining : INLINE[ ( -- addr) \ Returns address where code has been copied THERE ( -- XT) \ execution token (XT) for the NEW compiled code DUP CELL+ T, \ create the ITC header for CODE word BEGIN BL WORD CHAR+ C@ [CHAR] ] <> WHILE HERE FIND IF ( *it's a word in the dictionary* ) DUP NOTCODE? IF ( -- cfa ) DUP OPT-FORTH ELSE \ it's a CODE primitive CODE, \ compile code without NEXT THEN ELSE ( maybe its a number) COUNT NUMBER? ?ERR ( n ) LIT, \ compile n as a literal THEN REPEAT \ CR .S ( debug line) 'NEXT' T, \ compile NEXT at end of new code word , \ compile CODE word's XT into Forth definition ; IMMEDIATE optimized-loops.mp4 3 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.