Jump to content
IGNORED

why ZX has and atari hasnt?


Poison

Recommended Posts

The zero page 16-bit add is 18 cycles; the abs. would be 22 cycles. What's the with the Pop DE you keep using-- you first have to store the constants on the stack whereas normal code can have the constants already loaded at correct position when program loads.

 

The pop de is a wonderfull z80 hack. pop's are 16 bit loads with address updates - and on the z80 the sp is 16 bits so it can point anywhere. So the optimisation is to retarget the stack as a index pointer. It's an evil hack, which is why I used self modifying zp 6502 code to compare against it.

 

Well, I'm familiar with SP pointers from 80x86 and you're not supposed to modify those unless you disable all interrupts and take care of the stack yourself. So it's a hack which is restricting your IRQs and NMIs then-- not normal coding.

Link to comment
Share on other sites

The zero page 16-bit add is 18 cycles; the abs. would be 22 cycles. What's the with the Pop DE you keep using-- you first have to store the constants on the stack whereas normal code can have the constants already loaded at correct position when program loads.

 

The pop de is a wonderfull z80 hack. pop's are 16 bit loads with address updates - and on the z80 the sp is 16 bits so it can point anywhere. So the optimisation is to retarget the stack as a index pointer. It's an evil hack, which is why I used self modifying zp 6502 code to compare against it.

 

I imagine you have to make sure no interrupts can happen while you've re-purposed the stack pointer. I've used the 6502 stack pointer and the lower half of the stack to do a few sneaky things, but overall the 6502's stack instructions aren't really faster than using other tricks.

Link to comment
Share on other sites

...

What's the with the Pop DE you keep using-- you first have to store the constants on the stack whereas normal code can have the constants already loaded at correct position when program loads.

Z80 has a register called SP / stack pointer.

You set it up to point at beginning of an array and then pop or push just 'browse' through that data without effort... Beautiful stuff...

5 A8s cycles for pop-ing 2 bytes...

5.5 A8s cycles for push-ing 2 bytes...

 

Compare that to 6502s pha and pla, and fixed stack position :(

 

Great for filling memory. PUSH after PUSH and random place in memory (to set it you just fill one register) gets filled at rate of 2.75 A8 cycles/byte...

Even unrolled code on A8 can not come close to it...

And lets not even mention C64s 1Mhz... :(

 

How do you end up with fractional cycles? Once you modify the SP register, any sort of maskable or worse nonmaskable interrupt will overwrite your data that you are browsing via SP.

Link to comment
Share on other sites

Definitely no interupts :) - but it's worth it for the speed up. ( The 6502 stack instructions are byte only to page 1, which is the real limit )

 

Well, PHA is 3 cycles so it's like another zero-page write in cycle counts and still much faster than Z80.

 

80x86 can actually use the SP better since it also has an index with the stack like [sP+BP] so the SP can remain the same and you can allocate memory above the stack pointer.

Link to comment
Share on other sites

The z80 is clocked higher as standard, probably as it doesn't use 2 different clock signals ( For example in the MSX machines it's 3.58MHz which is 2x the Atari clock )

Just an FYI, MSX uses the same VDP as the TI. If it's like the TI all video memory is accessed through the VDP so that will totally alter things from the theoretical best.

Link to comment
Share on other sites

...

Great for filling memory. PUSH after PUSH and random place in memory (to set it you just fill one register) gets filled at rate of 2.75 A8 cycles/byte...

How do you end up with fractional cycles? Once you modify the SP register, any sort of maskable or worse nonmaskable interrupt will overwrite your data that you are browsing via SP.

I took A8s cycle as a measurement unit :)

 

To be clear:

Z80 in ZX spectrum can fill memory at 1 byte/1.536 microseconds

And more importantly you only need to fill SP register and go on, no need for INX, INY, watch high byte of address and stuff like that...

 

6502 in Atari 800 at best (unrolled absolute addressing code) can fill memory at 4 cycles / byte = 2.234 microseconds.

 

In real life it would be even worse. To achieve same effect as simple:

On Z80 you can unroll PUSHs like this so loop branching becomes almost not important, and that is how fast spectrum can go without effort:

Setup SP

LD SP,address_to_fill

LD DE,value_to_fill

PUSH

PUSH

...

PUSH

'loop' (crazy ace would need to help here / my Z80 coding skills are little rusty :) )

 

Inner loop on A8 would have to look something like this to allow same freedom:

 

STA (ZP),y

INY

8 cycles = 4.46 microseconds...

 

There is no irq complication that can make this raw speed increase not worth doing.

Link to comment
Share on other sites

Just an FYI, using the stack pointer as a fast pointer register is common practice on several non-6502 CPUs.

 

The 6800 only has one pointer register so it was common there and on the 6801/3 which is really an improved 6800 anyway. I even used it on the fastest version of the music player I ported to the 6803. The player is interrupt driven so it was a no brainer there... interrupts are already disabled.

 

It's commonly recommended that you use the user stack pointer on the 6809 before the IY index register because IY use involves a 2 byte instruction. Since the 6809 can push/pull multiple registers with one instruction, you can fill 8 bytes with each PUSH.

 

The 65816 could also use this as long as you were in 16bit mode and using the first 64K of RAM. Each push would be 4 clock cycles and a pull would be 5.

Link to comment
Share on other sites

Definitely no interupts :) - but it's worth it for the speed up. ( The 6502 stack instructions are byte only to page 1, which is the real limit )

 

Well, PHA is 3 cycles so it's like another zero-page write in cycle counts and still much faster than Z80.

And here is a prime example of why you don't want to compare instruction vs instruction.

 

The PHA may be fast but it is stuck on page 1 so how does it help you? You can't do a fast screen clear with it, you can't use it to clear a buffer fast unless its small enough to fit in the stack memory, you can't use it in a memory copy... so it's basically useless in this case unless you want to erase something on the 256 byte stack. And keep in mind there is some overhead in saving and restoring the stack pointer so there is a minimum number of pushes required for this optimization to even be practical in the first place.

Link to comment
Share on other sites

...

Great for filling memory. PUSH after PUSH and random place in memory (to set it you just fill one register) gets filled at rate of 2.75 A8 cycles/byte...

How do you end up with fractional cycles? Once you modify the SP register, any sort of maskable or worse nonmaskable interrupt will overwrite your data that you are browsing via SP.

I took A8s cycle as a measurement unit :)

 

To be clear:

Z80 in ZX spectrum can fill memory at 1 byte/1.536 microseconds

And more importantly you only need to fill SP register and go on, no need for INX, INY, watch high byte of address and stuff like that...

 

6502 in Atari 800 at best (unrolled absolute addressing code) can fill memory at 4 cycles / byte = 2.234 microseconds.

 

In real life it would be even worse.

In real life, it'll depend on the application. ANTIC can put zero scanlines or any other pattern. And if there's enough RAM, just change the video memory pointer. And just putting the same value in memory isn't that common thing-- more common is the memory copy where the stack isn't much use.

 

To achieve same effect as simple:

On Z80 you can unroll PUSHs like this so loop branching becomes almost not important, and that is how fast spectrum can go without effort:

Setup SP

LD SP,address_to_fill

LD DE,value_to_fill

PUSH

PUSH

...

PUSH

'loop' (crazy ace would need to help here / my Z80 coding skills are little rusty :) )

 

Inner loop on A8 would have to look something like this to allow same freedom:

 

STA (ZP),y

INY

8 cycles = 4.46 microseconds...

I think your example isn't that useful. We are suppose to be comparing common instructions like LDA/STA/Add/Sub/etc. and memory copy is much more common than filling with a constant. Normally, the initialization phase does the clearing so it's a one-time event. And in very specific case, you are also hogging up memory with loop unrolling.

 

There is no irq complication that can make this raw speed increase not worth doing.

 

For commonly usage of instructions like LDA/STA/Add/Sub, you cannot disable IRQs/NMIs. It's self-contradictory. We're trying to measure cycles for commonly used instructions and are using a specialized case of SP pointer on Z80 to do the compare. Suppose a DLI/VBI is executing (as is common on A8), if the background ever disables IRQs/NMIs, you just screwed up the display.

 

If you want to talk specific specialized cases, well just POKE 54272,0 and POKE 53274,color and you have filled the screen with a constant value.

Link to comment
Share on other sites

Definitely no interupts :) - but it's worth it for the speed up. ( The 6502 stack instructions are byte only to page 1, which is the real limit )

 

Well, PHA is 3 cycles so it's like another zero-page write in cycle counts and still much faster than Z80.

And here is a prime example of why you don't want to compare instruction vs instruction.

 

The PHA may be fast but it is stuck on page 1 so how does it help you? You can't do a fast screen clear with it, you can't use it to clear a buffer fast unless its small enough to fit in the stack memory, you can't use it in a memory copy... so it's basically useless in this case unless you want to erase something on the 256 byte stack. And keep in mind there is some overhead in saving and restoring the stack pointer so there is a minimum number of pushes required for this optimization to even be practical in the first place.

 

I'm not comparing memory copy using 6502 stack vs. z80 stack. I'm comparing a stack as a stack. To push a value on the stack, I do LDA #value, PHA. If you want to compare memory copy, 6502 doesn't need to use the stack; it'll win using its normal instructions while Z80 even with its specialized limited use of SP loses. So I'm not comparing instructions one-to-one, I'm comparing Z80 with 6502. Use whatever instructions you feel like, but don't tell me I have to disable my DLIs/VBIs in order to use it. Then it's no longer some commonly used item. And I have a bunch of specialized things I can do.

Link to comment
Share on other sites

Memcpy would be

 

LDI for Z80 , 16 cycles

 

vs

 

lda src,x

sta dest,x

inx

 

10 cycles for 6502 ( 8 cycles in the best case where you unroll and dont need the inx every time.

 

So the 6502 is at best equal to the Z80 ( assuming the 2x clock ).

 

 

( Regarding Antic - well that's not a z80 v 6502 comparision - Antic is way better than the video h/w on the Spectrum :) , in my opinion of course )

Link to comment
Share on other sites

Common example of filling memory is drawing polygons on screen...

Horizontal line drawing routine needs to fill memory just like that, and that is where bulk of rendering time goes...

So in my opinion that Z80 stack routine is perfect for that kind of stuff.

 

I'm not saying that A8 doesn't have tricks up its sleeve, you sure can do wonders with ANTIC.

One big advantage of Atari over spectrum is character screen. We can use charaters to fill inner portion of polygons - 1 STA 8 bytes - Spectrum can not compete with that!).

 

Don't go into memory copy battle with Z80 - it will end up badly :)

LDI performs a "LD (DE),(HL)", then increases DE, HL, and decreases BC (counter)

 

Try to copy more than 256 bytes:

 

Z80:

LD HL,2000h ; source

LD DE,4000h ; destination

LD BC,1000h ; number of bytes to copy

LDIR

 

LDIR is 21 cycles and it repeats the copy command until BC reaches 0. It does whole job by it self :)

 

A8:

You need LDA, STA (absolute or indirect addressing), INX or INY, BNE to see if you reached end of page, another one or two compares to see if you reached the end of counter.

In best case you at least have LDA, STA, INX, BNE (4+5+2+3+2=16 cycles).

Z80 is more than 50% faster in that case...

 

And LDIR has no influence on interrupts so I think its safe to use ;)

 

The only source of interrupt on ZX Spectrum is nonprogramable timer that bits 50 times per second.

So you can be pretty sure when it is safe to use stack and when it is not...

Link to comment
Share on other sites

Back to the topic:

 

why ZX has and atari hasnt?

 

Why really ? Do we have any 3d game on A8 that is not older than 20 - 25 years ?

 

There is no technical reason for it (numen and project M show just a little part of what could be done...)

 

I really miss Freescape games on A8... :(

 

They are perfect for porting to atari... :)

Link to comment
Share on other sites

Definitely no interupts :) - but it's worth it for the speed up. ( The 6502 stack instructions are byte only to page 1, which is the real limit )

 

Well, PHA is 3 cycles so it's like another zero-page write in cycle counts and still much faster than Z80.

And here is a prime example of why you don't want to compare instruction vs instruction.

 

The PHA may be fast but it is stuck on page 1 so how does it help you? You can't do a fast screen clear with it, you can't use it to clear a buffer fast unless its small enough to fit in the stack memory, you can't use it in a memory copy... so it's basically useless in this case unless you want to erase something on the 256 byte stack. And keep in mind there is some overhead in saving and restoring the stack pointer so there is a minimum number of pushes required for this optimization to even be practical in the first place.

 

I'm not comparing memory copy using 6502 stack vs. z80 stack. I'm comparing a stack as a stack. To push a value on the stack, I do LDA #value, PHA. If you want to compare memory copy, 6502 doesn't need to use the stack; it'll win using its normal instructions while Z80 even with its specialized limited use of SP loses. So I'm not comparing instructions one-to-one, I'm comparing Z80 with 6502. Use whatever instructions you feel like, but don't tell me I have to disable my DLIs/VBIs in order to use it. Then it's no longer some commonly used item. And I have a bunch of specialized things I can do.

Ok, I must have missed a change in gears in the discussion. I thought the original example being shown was using the stack pointer was as a fast memory pointer for something like the Yoomp render routine, something that would be useful for the Z80 but not on the 6502. Just remember that you can push/pull 16 bits at a time on the Z80 and only 8 on the 6502. Also the X & Y registers have to go through the accumulator to be pushed or pulled on the 6502 so you only presented the best case. The 65c02 and 65816 support direct push/pull of X & Y.

 

I never said anything about DLIs/VBIs in my post... I have been saying all along you have to account for the hardware and can't just compare cpu vs cpu.

Link to comment
Share on other sites

Memcpy would be

 

LDI for Z80 , 16 cycles

 

vs

 

lda src,x

sta dest,x

inx

 

10 cycles for 6502 ( 8 cycles in the best case where you unroll and dont need the inx every time.

 

So the 6502 is at best equal to the Z80 ( assuming the 2x clock ).

I'll accept they are about equal at 2X clock. However, there are various memcopy scenarios. If you are copying lets say sprite data (8*16 sprite), you can have it in zero page and have it loop unrolled (7 cycles/byte). If you know the data you can just do: LDA #val; STA abs and it's 6 cycles/byte. And 10 cycles/byte would be worst case for large blocks that use abs addressing.

 

( Regarding Antic - well that's not a z80 v 6502 comparision - Antic is way better than the video h/w on the Spectrum :) , in my opinion of course )

 

Okay, I won't bring in ANTIC but I was giving example of a specialized scenario. For 6502 specific case, I can use INC, DEC, ROL, ASL, ROR, and LSR to I/O ports to get two writes within 6 cycles. For example, INC 53774 can be used to acknowledge an IRQ.

Link to comment
Share on other sites

Common example of filling memory is drawing polygons on screen...

Horizontal line drawing routine needs to fill memory just like that, and that is where bulk of rendering time goes...

So in my opinion that Z80 stack routine is perfect for that kind of stuff.

 

I'm not saying that A8 doesn't have tricks up its sleeve, you sure can do wonders with ANTIC.

One big advantage of Atari over spectrum is character screen. We can use charaters to fill inner portion of polygons - 1 STA 8 bytes - Spectrum can not compete with that!).

Okay, for some special case of large polygon fills and zero interrupts. But even for polygons in general, you would be modifying your SP pointer every line where only a 2 or 3 bytes are being filled so add that to the overhead and the fact that you may be filling less than a byte or odd bytes so 16-bit push would be divided even further.

 

Don't go into memory copy battle with Z80 - it will end up badly :)

LDI performs a "LD (DE),(HL)", then increases DE, HL, and decreases BC (counter)

 

Try to copy more than 256 bytes:

 

Z80:

LD HL,2000h ; source

LD DE,4000h ; destination

LD BC,1000h ; number of bytes to copy

LDIR

 

LDIR is 21 cycles and it repeats the copy command until BC reaches 0. It does whole job by it self :)

 

A8:

You need LDA, STA (absolute or indirect addressing), INX or INY, BNE to see if you reached end of page, another one or two compares to see if you reached the end of counter.

In best case you at least have LDA, STA, INX, BNE (4+5+2+3+2=16 cycles).

Z80 is more than 50% faster in that case...

 

And LDIR has no influence on interrupts so I think its safe to use ;)

No, worst case for 6502 is: Repeat LDA abs,x:STA abs,x:Inx a few times and the BNE cycles become insignificant so it's 10 cycles/byte. And even in your example, the total cycles is 4+4+2+3 = 13.

 

The only source of interrupt on ZX Spectrum is nonprogramable timer that bits 50 times per second.

So you can be pretty sure when it is safe to use stack and when it is not...

 

Oh, I see ZX spectrum doesn't have much hardware generating IRQ/NMIs to worry about, but you drop a Z80 into A8 and that trick is useless practically. So it can't be used to show Z80 superiority since it depends on the system having a lack of other timer critical hardware.

Link to comment
Share on other sites

Back to the topic:

 

why ZX has and atari hasnt?

 

Why really ? Do we have any 3d game on A8 that is not older than 20 - 25 years ?

 

There is no technical reason for it (numen and project M show just a little part of what could be done...)

 

I really miss Freescape games on A8... icon_sad.gif

 

They are perfect for porting to atari... icon_smile.gif

 

Marketing, marketing, and .... not to forget .... marketing ;)

 

And, partially, false informations about the hardware limits. Or missing informations about the possibility of using real modulations on POKEY's sound generators.... or the possibility of transfering "megabytes per second" into the RAM and depending external drives....

 

And at last, not all coders are as clever and interested as ERU or Sheddy.

 

My most concernes belong to "Altirra" and the speed problems of the sound emulation. Seems, no one recognizes it. It reflects somehow the interest of the community.

Link to comment
Share on other sites

[My most concernes belong to "Altirra" and the speed problems of the sound emulation. Seems, no one recognizes it. It reflects somehow the interest of the community.

 

I'm not sure about the "interest of the community" but I'd say my interest isn't in how accurate the emulators are since I use real H/W. I like Altira, especially for it's debugger but I have a real machine set up right next to my desk here use that far more than the emulators.

Edited by spookt
  • Like 1
Link to comment
Share on other sites

Definitely no interupts :) - but it's worth it for the speed up. ( The 6502 stack instructions are byte only to page 1, which is the real limit )

 

Well, PHA is 3 cycles so it's like another zero-page write in cycle counts and still much faster than Z80.

And here is a prime example of why you don't want to compare instruction vs instruction.

 

The PHA may be fast but it is stuck on page 1 so how does it help you? You can't do a fast screen clear with it, you can't use it to clear a buffer fast unless its small enough to fit in the stack memory, you can't use it in a memory copy... so it's basically useless in this case unless you want to erase something on the 256 byte stack. And keep in mind there is some overhead in saving and restoring the stack pointer so there is a minimum number of pushes required for this optimization to even be practical in the first place.

 

I'm not comparing memory copy using 6502 stack vs. z80 stack. I'm comparing a stack as a stack. To push a value on the stack, I do LDA #value, PHA. If you want to compare memory copy, 6502 doesn't need to use the stack; it'll win using its normal instructions while Z80 even with its specialized limited use of SP loses. So I'm not comparing instructions one-to-one, I'm comparing Z80 with 6502. Use whatever instructions you feel like, but don't tell me I have to disable my DLIs/VBIs in order to use it. Then it's no longer some commonly used item. And I have a bunch of specialized things I can do.

Ok, I must have missed a change in gears in the discussion. I thought the original example being shown was using the stack pointer was as a fast memory pointer for something like the Yoomp render routine, something that would be useful for the Z80 but not on the 6502. Just remember that you can push/pull 16 bits at a time on the Z80 and only 8 on the 6502. Also the X & Y registers have to go through the accumulator to be pushed or pulled on the 6502 so you only presented the best case. The 65c02 and 65816 support direct push/pull of X & Y.

 

I never said anything about DLIs/VBIs in my post... I have been saying all along you have to account for the hardware and can't just compare cpu vs cpu.

Yoomp stuff was in post #84 by CrazyAce where it was >2X slower. As for general instruction comparisons, even a 16-bit stack comparison: LD BC,val:PUSH BC is 21 cycles whereas A8 would be LDA #lsb:PHA:LDA #msb:PHA which is 10 cycles so still better even at 2X factor. DLI/VBI was related if using SP pointers to fill things. As far as pushing PHX and PHY go, you can write them directly to stack or zero page with STX and STY to save them (no need for 65c02).

Link to comment
Share on other sites

I guess there is no rational comparison of cycles of certain assembler commands that will persuade anyone reading this forum to lean on one side or another...

 

The only way for anyone to really like one side over another is a real game example like Spectrum already has, and unfortunately Atari still has not... :(

 

So, I call coders to unite and produce something usable and make us proud :)

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...