Tursi Posted January 22, 2016 Share Posted January 22, 2016 While you're at it, how do you think GPL will compare with p-code? I have a very soft spot in my heart for Pascal, and would eventually like to develop a program for it on the TI, but I am concerned about its performance... I couldn't say, I've never run it beyond the work I did debugging Classic99. In all that the best I did was assemble some Hello World programs... For benchmarking languages, really... just write comparable programs. Trying to compare languages and implementations was always a battle, even back in the day, since algorithm matters, what parts of the language you touch matters, what parts of the hardware you need to use matters, etc. But off the top of my head, a good quick one for the TI might be something like manually moving a sprite around the outer edge of the screen, one pixel at a time (no auto-motion). See how fast you can get it whipping around. Make it loop 100 times and then exit, so that you can time the total runtime. Starting with the simple in XB... 100 CALL CLEAR 110 CALL MAGNIFY(2) 120 CALL SPRITE(#1,42,2,1,1) 130 CNT=100 140 FOR X=1 TO 240 :: CALL LOCATE(#1,1,X):: NEXT X 150 FOR Y=1 TO 176 :: CALL LOCATE(#1,Y,240):: NEXT Y 160 FOR X=240 TO 1 STEP -1 :: CALL LOCATE(#1,176,X):: NEXT X 170 FOR Y=176 TO 1 STEP -1 :: CALL LOCATE(#1,Y,1):: NEXT Y 180 CNT=CNT-1 :: IF CNT>0 THEN 140 190 END ASM and TurboForth in the spoiler tag. ASM version: * assumes startup from Editor/Assembler DEF START REF VDPWA,VDPWD * make it work as EA5 if desired B @START START * call clear li r0,>0040 * write address >0000 movb r0,@VDPWA swpb r0 movb r0,@VDPWA li r1,>2000 li r2,768 lp1 movb r1,@VDPWD dec r2 jne lp1 * call magnify(2) li r0,>c181 * write VDP register 1 with >C2 (16k,enable, no int, double-size sprites) movb r0,@VDPWA swpb r0 movb r0,@VDPWA * call sprite(#1,42,2,1,1) li r0,>0186 * vdp register 6 to >01 (sprite descriptor table to >0800) movb r0,@VDPWA swpb r0 movb r0,@VDPWA li r0,>0043 * write address >0300 movb r0,@VDPWA swpb r0 movb r0,@VDPWA li r0,>002A * 1,1 (minus 1) and 42 movb r0,@VDPWD nop movb r0,@VDPWD swpb r0 movb r0,@VDPWD li r0,>01d0 * color 2 (-1) and list terminator movb r0,@VDPWD swpb r0 movb r0,@VDPWD * cnt=100 li r5,100 l140 * for x=1 to 240 (minus 1 for asm) clr r3 xlp1 * call locate(#1,1,x) li r0,>0143 * write address >0301 (X pos) movb r0,@VDPWA swpb r0 movb r0,@VDPWA nop movb r3,@VDPWD * next x ai r3,>0100 ci r3,>f000 jne xlp1 * for y=1 to 176 clr r4 ylp1 * call locate(#1,y,240) li r0,>0043 * write address >0300 (Y pos) movb r0,@VDPWA swpb r0 movb r0,@VDPWA nop movb r4,@VDPWD * next y ai r4,>0100 ci r4,>b000 jne ylp1 * for x=240 to 1 step -1 li r3,>ef00 xlp2 * call locate(#1,176,x) li r0,>0143 * write address >0301 (X pos) movb r0,@VDPWA swpb r0 movb r0,@VDPWA nop movb r3,@VDPWD * next x ai r3,>ff00 ci r3,>ff00 jne xlp2 * for y=176 to 1 step -1 li r4,>af00 ylp2 * call locate(#1,y,240) li r0,>0043 * write address >0300 (Y pos) movb r0,@VDPWA swpb r0 movb r0,@VDPWA nop movb r4,@VDPWD * next y ai r4,>ff00 ci r4,>ff00 jne ylp2 * cnt=cnt-1 dec r5 jne l140 * end blwp @>0000 end TurboForth: VARIABLE cnt hex : asterisk DATA 4 0028 107C 1028 0000 12a dchar ; decimal : test 1 gmode page 1 magnify asterisk 0 0 0 42 1 sprite 100 dup cnt ! begin while 239 0 do 0 0 I sprloc loop 175 0 do 0 I 239 sprloc loop 0 239 do 0 175 I sprloc -1 +loop 0 175 do 0 I 0 sprloc -1 +loop cnt @ 1- dup cnt ! repeat bye ; If porting - note how the corners overlap for one frame each! (For example, the X loop positions at 1,240, and then the Y loop ALSO positions at 1,240). Alllllso, for XB you might want to only time one lap and multiply it by 100. My tests for the above test come out like so: XB (estimated): 2000 seconds (33 mins) Assembly (8-bit code): 7 seconds TurboForth: 48 seconds I attempted a UCSD Pascal version, but it kept saying it couldn't find the library on 'USES SPRITE' when I tried to compile, so I gave up... and I'm out of time for the GPL version. 4 Quote Link to comment Share on other sites More sharing options...
Willsy Posted January 22, 2016 Share Posted January 22, 2016 (edited) Hmm.... it's academic but you might be able to make TF go faster by making it more like the assembly version. I.e use V! To poke VDP memory. I'll have a look this evening and see if it'll be any faster. I was disappointed when I saw 48 seconds, but on the other hand SPRLOC and friends actually update a copy of the sprite attribute list in cpu ram and copy portions of it to VDP so there's a lot going on under the covers. Edited January 22, 2016 by Willsy Quote Link to comment Share on other sites More sharing options...
sometimes99er Posted January 22, 2016 Share Posted January 22, 2016 Hmm.... it's academic but you might be able to make TF go faster by making it more like the assembly version. I.e use V! To poke VDP memory. I'll have a look this evening and see if it'll be any faster. I was disappointed when I saw 48 seconds, but ... It would then only be fair that time is spent to make the 2 other implementations faster. 1 Quote Link to comment Share on other sites More sharing options...
Willsy Posted January 22, 2016 Share Posted January 22, 2016 Yes of course! Quote Link to comment Share on other sites More sharing options...
Willsy Posted January 22, 2016 Share Posted January 22, 2016 (edited) This one is based on Tursi's code, but pokes VDP directly. Some other little optimisations: VARIABLE cnt hex : asterisk DATA 4 0028 107C 1028 0000 12a dchar ; decimal : test 1 gmode page 1 magnify asterisk 0 0 0 42 1 sprite 100 cnt ! begin cnt @ 0> while 239 0 do i $301 v! loop 175 0 do i $300 v! loop 0 239 do i $301 v! -1 +loop 0 175 do i $300 v! -1 +loop -1 cnt +! repeat bye ; and here's one that removes the need for a variable: hex : asterisk DATA 4 0028 107C 1028 0000 12a dchar ; decimal : test 1 gmode page 1 magnify asterisk 0 0 0 42 1 sprite 100 0 do 239 0 do i $301 v! loop 175 0 do i $300 v! loop 0 239 do i $301 v! -1 +loop 0 175 do i $300 v! -1 +loop loop bye ; Both of them take 29 seconds. So that's 3.6 times slower than assembler and 69 times faster than XB. Rock on! Edited January 22, 2016 by Willsy Quote Link to comment Share on other sites More sharing options...
+InsaneMultitasker Posted January 22, 2016 Share Posted January 22, 2016 and here's one that removes the need for a variable: Both of them take 29 seconds. So that's 3.6 times slower than assembler . Rock on! So it's about one 'forth' as fast? Quote Link to comment Share on other sites More sharing options...
Willsy Posted January 22, 2016 Share Posted January 22, 2016 So it's about one 'forth' as fast? Ha ha yes! Quote Link to comment Share on other sites More sharing options...
sometimes99er Posted January 22, 2016 Share Posted January 22, 2016 Both of them take 29 seconds. So that's 3.6 times slower than assembler and 69 times faster than XB. So the newer language, with quite a few updates, gets optimized by its creator, and is then compared with the unoptimized versions. Now let's compile the XB and have the ASM run on the GPU. It can be done. Quote Link to comment Share on other sites More sharing options...
Willsy Posted January 22, 2016 Share Posted January 22, 2016 Well not really. I was just seeing if I could improve tursi's time of 48 seconds. Quote Link to comment Share on other sites More sharing options...
sometimes99er Posted January 22, 2016 Share Posted January 22, 2016 Well not really. I was just seeing if I could improve tursi's time of 48 seconds. Wow. Sure looks like you did compare them: Both of them take 29 seconds. So that's 3.6 times slower than assembler and 69 times faster than XB. As Tursi said Trying to compare languages and implementations was always a battle, ... Quote Link to comment Share on other sites More sharing options...
lucien2 Posted January 22, 2016 Share Posted January 22, 2016 GPL: 80 seconds When we compared TF and GPL with the bricks demo 4 1/2 years ago they were closer. grom >6000 data >aa00,>0100,>0000 data menu data >0000,>0000,>0000,>0000 menu data >0000 data start stri 'BENCHMARK' upcase equ >0018 x equ arg y equ arg+1 xy equ arg cnt equ arg+2 start * magnify 2 st >e1,@arg move 1,@arg,#1 * load uppercase character set dst >0900,@fac call upcase * copy asterisk pattern to sprite char 0 move 8,v@42*8+>800,v@>400 * define sprite 0 to character 0, color black dst >8001,v@>302 * locate sprite 0 to 1,1 dst >0000,v@>300 st 100,@cnt L5 clr @x L1 st @x,v@>301 inc @x ch 239,@x br L1 clr @y L2 st @y,v@>300 inc @y ch 175,@y br L2 st 239,@x L3 st @x,v@>301 dec @x ceq 255,@x br L3 st 175,@y L4 st @y,v@>300 dec @y ceq 255,@y br L4 dec @cnt cz @cnt br L5 exit 1 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 23, 2016 Author Share Posted January 23, 2016 Rigt Hmm.... it's academic but you might be able to make TF go faster by making it more like the assembly version. I.e use V! To poke VDP memory. I'll have a look this evening and see if it'll be any faster. I was disappointed when I saw 48 seconds, but on the other hand SPRLOC and friends actually update a copy of the sprite attribute list in cpu ram and copy portions of it to VDP so there's a lot going on under the covers. Yeah, what I was trying to do was use the language's features. The intent was to compare to the baseline Extended BASIC code, once you start bypassing the language it becomes a debate whether it's a sensible comparison. But the assembly version can be sped up with registers and scratchpad without changing the structure (also, the workspace is in 8-bit RAM, so I move that too. That's actually a bug, I never intended to not have the workspace in scratchpad ): * assumes startup from Editor/Assembler DEF START REF VDPWA,VDPWD * make it work as EA5 if desired B @START START * performance lwpi >8300 li r6,VDPWA li r7,VDPWD li r8,>0043 li r9,>0143 li r10,>ff00 li r11,>0100 li r12,>f000 li r13,>b000 li r0,l140 li r1,>8320 sclp mov *r0+,*r1+ ci r1,>8400 jne sclp * call clear li r0,>0040 * write address >0000 movb r0,*R6 swpb r0 movb r0,*R6 li r1,>2000 li r2,768 lp1 movb r1,*R7 dec r2 jne lp1 * call magnify(2) li r0,>c181 * write VDP register 1 with >C2 (16k,enable, no int, double-size sprites) movb r0,*R6 swpb r0 movb r0,*R6 * call sprite(#1,42,2,1,1) li r0,>0186 * vdp register 6 to >01 (sprite descriptor table to >0800) movb r0,*R6 swpb r0 movb r0,*R6 mov r8,r0 * write address >0300 movb r0,*R6 swpb r0 movb r0,*R6 li r0,>002A * 1,1 (minus 1) and 42 movb r0,*R7 nop movb r0,*R7 swpb r0 movb r0,*R7 li r0,>01d0 * color 2 (-1) and list terminator movb r0,*R7 swpb r0 movb r0,*R7 * cnt=100 li r5,100 b @>8320 l140 * for x=1 to 240 (minus 1 for asm) clr r3 xlp1 * call locate(#1,1,x) mov r9,r0 * write address >0301 (X pos) movb r0,*R6 swpb r0 movb r0,*R6 nop movb r3,*R7 * next x a r11,r3 c r12,r3 jne xlp1 * for y=1 to 176 clr r4 ylp1 * call locate(#1,y,240) mov r8,r0 * write address >0300 (Y pos) movb r0,*R6 swpb r0 movb r0,*R6 nop movb r4,*R7 * next y a r11,r4 c r13,r4 jne ylp1 * for x=240 to 1 step -1 li r3,>ef00 xlp2 * call locate(#1,176,x) mov r9,r0 * write address >0301 (X pos) movb r0,*R6 swpb r0 movb r0,*R6 nop movb r3,*R7 * next x a r10,r3 c r10,r3 jne xlp2 * for y=176 to 1 step -1 li r4,>af00 ylp2 * call locate(#1,y,240) mov r8,r0 * write address >0300 (Y pos) movb r0,*R6 swpb r0 movb r0,*R6 nop movb r4,*R7 * next y a r10,r4 c r10,r4 jne ylp2 * cnt=cnt-1 dec r5 jne l140 * end blwp @>0000 end That gets it down to 4.5 seconds - and it's the scratchpad workspace that makes most of the difference (1.5s)... running this code in scratchpad only saved about 1s. Since it spends all its time writing to VDP this program is multiplexer bound. So we'll round up for the table and say 5s. All that said, I totally get the desire to optimize and there's no actual cheating in the TF version directly hitting VDP RAM, since it's built in. If XB had the ability to VPOKE we could try it there -- maybe an RXB version to see if it's faster. GPL: 80 seconds Thanks Lucien! I was hoping someone would take that on. Looks pretty good! I'll split up first pass and optimized times to be fair - barring extreme bugs the first pass may be how someone new to the language would write it, optimized will be any interested party's best time (without changing the output of the program). To be fair there, I've retimed the assembly version using VSBW etc, since that's how a new assembly programmer would normally start. That actually takes 17 seconds! * assumes startup from Editor/Assembler * slower version DEF START REF VSBW,VWTR,VMBW * make it work as EA5 if desired B @START sprdat data >0000,>2A01,>d000 START * call clear clr r0 li r1,>2000 li r2,768 lp1 blwp @vsbw inc r0 dec r2 jne lp1 * call magnify(2) li r0,>01c1 * write VDP register 1 with >C2 (16k,enable, no int, double-size sprites) blwp @vwtr * call sprite(#1,42,2,1,1) li r0,>0601 * vdp register 6 to >01 (sprite descriptor table to >0800) blwp @vwtr li r0,>0300 * write address >0300 li r1,sprdat * sprite table li r2,5 blwp @vmbw * cnt=100 li r5,100 l140 * for x=1 to 240 (minus 1 for asm) clr r3 xlp1 * call locate(#1,1,x) li r0,>0301 * write address >0301 (X pos) movb r3,r1 blwp @vsbw * next x ai r3,>0100 ci r3,>f000 jne xlp1 * for y=1 to 176 clr r4 ylp1 * call locate(#1,y,240) li r0,>0300 * write address >0300 (Y pos) movb r4,r1 blwp @vsbw * next y ai r4,>0100 ci r4,>b000 jne ylp1 * for x=240 to 1 step -1 li r3,>ef00 xlp2 * call locate(#1,176,x) li r0,>0301 * write address >0301 (X pos) movb r3,r1 blwp @vsbw * next x ai r3,>ff00 ci r3,>ff00 jne xlp2 * for y=176 to 1 step -1 li r4,>af00 ylp2 * call locate(#1,y,240) li r0,>0300 * write address >0300 (Y pos) movb r4,r1 blwp @vsbw * next y ai r4,>ff00 ci r4,>ff00 jne ylp2 * cnt=cnt-1 dec r5 jne l140 * end blwp @>0000 end So we have: Language First Pass Optimized Assembly 17 sec 5 sec TurboForth 48 sec 29 sec GPL 80 sec none yet XB 2000 sec none yet Frankly it's looking good for all of them so far versus XB. 1 Quote Link to comment Share on other sites More sharing options...
Willsy Posted January 23, 2016 Share Posted January 23, 2016 Ah. I see. Yes I think that's fair and I see what Sometimes was saying now. 1 Quote Link to comment Share on other sites More sharing options...
+InsaneMultitasker Posted January 23, 2016 Share Posted January 23, 2016 For giggles I typed the program into Myarc's Advanced BASIC for the Geneve. It took approximately 8.2 minutes (490 or so seconds) to complete. Considering this BASIC is written in assembly (no GPL) I would have expected it to be a bit faster. I wonder if some of the sluggishness in both XB and ABASIC isn't related to all the floating point manipulation. 1 Quote Link to comment Share on other sites More sharing options...
senior_falcon Posted January 23, 2016 Share Posted January 23, 2016 51 seconds for compiled XB 8 bit bus 37 seconds for compiled XB 16 bit bus 2 Quote Link to comment Share on other sites More sharing options...
Willsy Posted January 23, 2016 Share Posted January 23, 2016 51 seconds for compiled XB 8 bit bus 37 seconds for compiled XB 16 bit bus Wow that's really good! Quote Link to comment Share on other sites More sharing options...
Asmusr Posted January 23, 2016 Share Posted January 23, 2016 This seems like a straightforward benchmark, but what does it actually mean to move a sprite around at a rate faster than 1/60s, resulting in visual frames being skipped? Quote Link to comment Share on other sites More sharing options...
globeron Posted January 23, 2016 Share Posted January 23, 2016 (I think I have the software somewhere it is somewhere in Tijdingen TI-GG NL magazine in th '80s), but there was this fun thing when changing the screen color continuously, it generated kind of moving bars on the screen (I think it only works on CRT televisions (50 Hz/60Hz), as I tried it on an LCD but did not see it happening. It is very simple, something like 100 Call Screen(4) 110 Call screen(5) 120 Goto 100 and did the same in TI-Basic, Extended Basic, TP99 (Turbo Pascal), C99 © and Assembler. The difference was that the stripes increased (e.g. Basic had 2 or 3 large bars alternating, but TP99 had several small stripes, and Assembler was very fast switching colours) Not sure if it is a good benchmark to compare languages, but it was visual. I just tried in Classic99, but here colours switch fast. 1 Quote Link to comment Share on other sites More sharing options...
+Retrospect Posted January 23, 2016 Share Posted January 23, 2016 I didn't think BASIC on a TI would be able to do the raster crt bars! ... cuz it uses CALLS which , I recently read, are one of the reasons for slowspeed. I did this trick on a Spectrum though. Quote Link to comment Share on other sites More sharing options...
Asmusr Posted January 23, 2016 Share Posted January 23, 2016 It is because of the emulator if it doesn't work, because the screen in some emulators is drawn too fast or is not drawn concurrently with the CPU (It does work in MESS). You should always get some type of raster bars if you change the background color at random intervals on the hardware (and is not timing it with the vertical refresh). It has nothing to do with CRT vs LCD AFAIK. The problem on the TI is keeping the bars steady because the clocks of the CPU and the VDP are not synchronized. The only way I'm aware of to get a stable raster effect is to use the 5th sprite flag to measure when the VDP is reaching a specific scan line. Edit: sorry for polluting this thread, the benchmark is fine is long as you realize it's basically about how fast you can update one VDP RAM byte with increasing or decreasing values. 2 Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted January 24, 2016 Share Posted January 24, 2016 (edited) Here are the fbForth equivalents(?) of the two TurboForth sprite runs. First pass: HEX 064 VARIABLE CNT : TEST GRAPHICS PAGE 1 MAGNIFY 0 0 1 02A 0 SPRITE BEGIN CNT @ WHILE 0EF 0 DO I 0 0 SPRPUT LOOP 0AF 0 DO 0EF I 0 SPRPUT LOOP 0 0EF DO I 0AF 0 SPRPUT -1 +LOOP 0 0AF DO 0 I 0 SPRPUT -1 +LOOP -1 CNT +! REPEAT MON ; and port of the TF optimized pass: HEX : TEST GRAPHICS PAGE 1 MAGNIFY 0 0 1 02A 0 SPRITE 064 0 DO 0EF 0 DO I 301 VSBW LOOP 0AF 0 DO I 300 VSBW LOOP 0 0EF DO I 301 VSBW -1 +LOOP 0 0AF DO I 300 VSBW -1 +LOOP LOOP MON ; DECIMAL The first took 70 seconds and the second took 58 seconds. I might be able to optimize further; but, fbForth cannot really compete with the scratchpad-optimized words of TurboForth that run on the 16-bit bus. ...lee Edited April 22, 2022 by Lee Stewart Prettified the code 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 24, 2016 Author Share Posted January 24, 2016 Thanks for the continued updates folks! I'm finding this pretty interesting. And yeah, the output to the screen is irrelevant, it's just about taking a normal operation to hardware (moving a sprite) and using it to benchmark the performance of the language. This is certainly not comprehensive, but I wanted something that was quick to implement and still at least somewhat real-world. So what I see so far: Language First Pass Optimized Assembly 17 sec 5 sec TurboForth 48 sec 29 sec Compiled XB 51 sec 37 sec FbForth 70 sec 58 sec GPL 80 sec none yet ABASIC 490 sec none yet XB 2000 sec none yet (I included ABASIC although I don't know if it's a fair comparison since it's a different computer! ) Quote Link to comment Share on other sites More sharing options...
Tursi Posted January 24, 2016 Author Share Posted January 24, 2016 (edited) The original question was 'how does GPL compare?'... to be honest I'm surprised. While it is the slowest (non-BASIC) tested so far, it's not the slowest by much. Any of those languages would be just fine. If I posted my Pascal attempt, would someone be able to help figure out why it doesn't compile? Edited January 24, 2016 by Tursi 1 Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted January 24, 2016 Share Posted January 24, 2016 (edited) I would like to revise the fbForth optimized code. The following is more in line with the TurboForth code I was attempting to port. It defines V! similar to how it is defined in TurboForth: HEX ASM: V! *SP+ R0 MOV, ( pop addr) *SP+ R1 MOV, ( pop value) R1 SWPB, ( get LSB of value into MSB) 0 LIMI, ( disable interrupts) R0 4000 ORI, ( tell VDP processor "hey, this is a *write*") R0 SWPB, ( get low byte of address) R0 8C02 @() MOVB, ( write it to vdp address register) R0 SWPB, ( get high byte of address) R0 8C02 @() MOVB, ( write it) R1 8C00 @() MOVB, ( write payload) 2 LIMI, ( enable interrupts) ;ASM : TEST GRAPHICS PAGE 1 MAGNIFY 0 0 1 02A 0 SPRITE 064 0 DO 0EF 0 DO I 301 V! LOOP 0AF 0 DO I 300 V! LOOP 0 0EF DO I 301 V! -1 +LOOP 0 0AF DO I 300 V! -1 +LOOP LOOP MON ; DECIMAL This runs in 26 seconds! ...lee Edited April 22, 2022 by Lee Stewart Prettified the code 1 Quote Link to comment Share on other sites More sharing options...
senior_falcon Posted January 24, 2016 Share Posted January 24, 2016 (edited) Doggone it, now I suppose I'll have to do the program in XB using CALL LOADs. Results later today. Oops, just remembered that I need to write to VDP, not CPU. So maybe no results today. Edited January 24, 2016 by senior_falcon Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.