Jump to content
IGNORED

Z80 vs. 6502


BillyHW

Recommended Posts

I gave it a look too... thinking some more, 6809 efficiency is somewhat more than my previous estimate especially if the programmer is aiming for it.

 

Seems it was late to the party though. CPU culture is traditionally hard to break. Most prominent example obviously being the Mac with 68K then PPC then x86/64. But most of the big companies in the old days were set in their ways.

Link to comment
Share on other sites

I gave it a look too... thinking some more, 6809 efficiency is somewhat more than my previous estimate especially if the programmer is aiming for it.

...

If you are only using the A register and Y as a counter for a simple loop, the 6502 can actually be faster than the 6809 due to the lack of a prefetch on the 6809. As long as you don't have a lot of indexing, can unroll loops, don't use the stack a lot, etc... the 6502 does pretty well.

As soon as you need to use much indexed addressing the 6502 starts to show it's weakness and using 16 bit numbers/pointers pretty much swings things in favor of the 6809.

Add the auto increment, auto decrement, 2 accumulators or 16 bit D register, single instruction stack operations, 2nd stack pointer, multiply, etc... and the time saved by the prefetch is more than eaten up by extra instructions.

Just being able to use a stack pointer as a data pointer can make a huge difference code size wise. You can load multiple registers and update your data pointer with a single instruction.

The combined D register also lets you load two bytes of data with a single instruction.

 

Link to comment
Share on other sites

If you are only using the A register and Y as a counter for a simple loop, the 6502 can actually be faster than the 6809 due to the lack of a prefetch on the 6809. As long as you don't have a lot of indexing, can unroll loops, don't use the stack a lot, etc... the 6502 does pretty well.

...

That should probably say "and X or Y as a counter".

Link to comment
Share on other sites

Providing complex data structures are arranged optimally for the 8-bit accumulator, the only things I tend to miss on the 6502 (with linked lists and the like) are STX abs,Y and STY abs,X to complement the load instructions. As with PHA and PLA, lots of register copying necessary there which would have been nicely avoided. I've never used (ZP,X) once in twenty years of assembly coding, so I wouldn't miss that at all. Providing code isn't in ROM, of course, self-modifying code can be a nice way of getting around the lack of LDA/STA (ZP),X if you happen to be using both registers at once to access data in a loop. I do try to avoid it, though, especially since I started to do a lot of cartridge coding.

Edited by flashjazzcat
Link to comment
Share on other sites

But by the time XL came out they were using 150 ns chips, and pretty sure the XE had 90 ns chips.

IIRC the Amiga was designed around 150ns memory. Or was it 120ns? I'm pretty sure I read that the ST "overclocked" its chips compared to spec.

I'm guessing 150/90 was what was the cheapest available on the market at that time.

 

(And why does someone do something as idiotic as the Electron and split bytes into nibbles?)

(Why #2 didn't C= 128 2MHz mode just sync down to 1MHz when the video was fetching memory?)

Link to comment
Share on other sites

The C= design from the start seems kind of flawed - the initial idea was supposedly to run the CPU at full speed and put up with the sporadic DMA like Atari does. Supposedly a little later the decision was to go to the slightly faster system bus speed which gives the compressed horizontal screen you see. And for whatever reason they went with the 50/50 memory access split (of course with the odd extra steal of dedicated CPU cycles).

 

If you look at the allocation of cycles, you can see even with all sprites active there would still be a few spare cycles the CPU otherwise could have had - plus in normal circumstances, VIC doesn't exactly use a lot of it's cycles.

 

The Electron - I think it was simply cost-cutting. At the time they could get the 4-bit DRams cheaply, but apparently not quite cheap enough to just have 64K rather than the method they chose to give 32K.

 

Amiga - well, the system bus on base machines was ~ 7.2 MHz. Assuming access time needs to be about half a cycle duration that makes for 69 ns, about 62.5 ns for the ST at 8 MHz.

Link to comment
Share on other sites

I'm pretty sure I read that the ST "overclocked" its chips compared to spec.

Ah, I found out where I had read that: http://www.dadhacker.com/blog/?p=1383

(Bit odd though when reading this a second time - the 68010 handled exceptions properly in 1982, and the 68020 came out in 1984 - too expensive I guess.)

(And if you read http://www.easy68k.com/paulrsm/dg/dg34.htm it seems strangely spot on in its Atari speculations about what was about to happen after the buyout...(page 21))

Link to comment
Share on other sites

Amiga - well, the system bus on base machines was ~ 7.2 MHz. Assuming access time needs to be about half a cycle duration that makes for 69 ns, about 62.5 ns for the ST at 8 MHz.

The 68000 uses 4 clock cycles for a read or write cycle. That WOULD be ~560 ns, however, the 68000 only actually uses the last two cycles of the four to read/write the data. So the Amiga chipset was designed to use the first two cycles, leaving the next two free for the CPU. That theoretically made the CPU run at full speed while the chipset was active. However, not all 68000 cycles are multiple of four, which adds a two cycle wait to some instructions, and the chipset needs some of the CPU slots to do other things - in particular, more than 4 bitplanes in low-res mode, or more than 2 bitplanes in high-res mode can steal CPU slots.

 

Anywho, a slot is therefore two cycles, or about 280 ns... VERY cheap memory. That was also one of the reasons Sega used the 68000 for the Genesis - really cheap slow roms. In fact, the VDP DMA on the Genesis is faster than the 68000 bus cycle, so it fails on old (slow) roms if you try to DMA directly from rom. Sega recommended copying/decompressing from rom to ram using the CPU, then DMAing from ram to vram. Newer carts were made using faster access times, so the game could DMA directly from rom to vram.

  • Like 1
Link to comment
Share on other sites

Worst-case, Amiga is losing 75% of cycles just for playfield DMA if 6 bitplanes active. So I would think they'd need the faster Ram.

I've got 3 machines here, might have to look inside one. Or just find some pics.

Of several "Amiga 500 motherboard" pics I looked at, most are 80ns and one was 60ns Ram.

Chip Ram needs to be faster since the chipset can potentially hit it on any cycle. The remaining Ram that the chipset can't access could potentially run slower.

Link to comment
Share on other sites

I found an A1000 picture showing TMS4464-12NL chips. I don't know if I should interpret the datasheet to classify it as 120ns or 230ns (you kinda would like to read or write, not just access it?).

Then I found an A500 internal expansion using km41256ap-15 which seems to be 150/260ns.

(And worst case would be 4 bitplanes hires - grabbing all the cycles.)

Edited by NorthWay
Link to comment
Share on other sites

Worst-case, Amiga is losing 75% of cycles just for playfield DMA if 6 bitplanes active. So I would think they'd need the faster Ram.

That's 50% of the CPU slots. Remember that the CPU takes four cycles to do a read or write, with the actual data being read or written in the last two cycles. The chipset always takes those first two "free" cycles, so the CPU has 100% of the slots it's CAPABLE of using free as long as the chipset doesn't take any of those slots. Five bitplanes (low-res) takes 25% of those slots, and six takes 50%. High-res bogs down more as 3 bitplanes takes 50% of those slots, and 4 bitplanes takes 100% of those slots. That's why stock Amiga 500s and 2000s are so effing slow in 16 color high-res - you only have free CPU slots in the blank periods when running in chip ram. That's why they recommend getting fast ram for your old Amiga.

 

I've got 3 machines here, might have to look inside one. Or just find some pics.

Of several "Amiga 500 motherboard" pics I looked at, most are 80ns and one was 60ns Ram.

Chip Ram needs to be faster since the chipset can potentially hit it on any cycle. The remaining Ram that the chipset can't access could potentially run slower.

They used whatever memory was cheapest to get in bulk at the time the assembly plant needed ram, but 150ns access time was plenty fast. Remember that the chipset was designed to be a console in 1983. There's no way they would have designed it around 80ns ram.

 

The chipset uses two CPU clocks per cycle for all memory access operations. All COPPER timing is around that frequency. The chipset always has priority over the CPU when it needs more of those two-cycle slots. The BLITTER has a bus hog setting that can lock the CPU entirely until the blit is done, but most folks didn't do that.

 

All this is in the various hardware books Commodore put out. I've got them all. You can also find these hardware manuals online in various formats.

 

AGA changed this timing in two ways - first, they added what was called Double-CAS mode - they did two page mode reads in the same amount of time as one normal read, or one read per clock cycle (the 68020 ran at 2X the clock cycle, or 14.4 MHz). The also had Double-Wide mode, which fetched 32 bits instead of 16 bits. You could set either mode independent of the other, and one or the other or both were needed in order to fetch enough data for AGA modes. For example, the sprites normally fetch one word per access slot per bitplane per line. An AGA sprite would run in Double-CAS/Double-Wide mode to fetch two longs per bitplane per line, allowing sprites to be 64 pixels wide instead of 16. Likewise, switching to a faster fetch mode allowed low-res displays to show 8 bitplanes (or high-res to show 4 bitplanes) without using all available memory access slots like under the old chipset. High-res 8 bitplane still took all access slots. The new SuperHiRes mode (1280 wide instead of 640) took all access slots at 4 bitplanes even with the faster fetch mode.

Edited by Chilly Willy
  • Like 1
Link to comment
Share on other sites

I'm not sure I've ever used the (ind,X) more than a few times. Or even if the Atari OS even uses an instruction in that mode. If it was replaced with something else, I doubt it'd be missed much.

Those other instructions - largely covered by the later 65C02, but of course it's of little use when the established base was already there on the original CPU...

 

(zp,X) comes in handy once in awhile in things like Forth interpreters, where X is typically used as the parameter stack pointer, and you want to use a stack cell as a pointer without having to copy it to a fixed zp location.

 

I used it in couple of spots in my translation of the VTL-2 mini-interpreter from the 6800 to the 6502, but only in the degenerate case where X was 0. Here's a snip;


...
;------------------------------------------------------
; Delete/insert program line and restart command prompt
; entry:  Carry must be clear
; uses:   find, start, {@ _ # & * (}, linbuf
; 
skp2    tya             ;save linbuf offset pointer
        pha   
        jsr  find       ;locate first line >= {#}
        bcs  insrt 
        lda  lparen 
        cmp  pound      ;if line doesn't already exist
        bne  insrt      ;  then skip deletion process
        lda  lparen+1 
        eor  pound+1 
        bne  insrt 
        tax             ;x = 0
        lda  (at),y 
        tay             ;y = length of line to delete
        eor  #-1 
        adc  ampr       ;{&} = {&} - y
        sta  ampr 
        bcs  delt 
        dec  ampr+1 
delt    lda  at 
        sta  under      ;{_} = {@}
        lda  at+1 
        sta  under+1 
delt2   lda  under 
        cmp  ampr       ;delete the line
        lda  under+1 
        sbc  ampr+1 
        bcs  insrt 
        lda  (under),y 
        sta  (under,x) 
        inc  under 
        bne  delt2 
        inc  under+1 
        bcc  delt2      ;(always taken)
insrt   pla   
        tax             ;x = linbuf offset pointer
        lda  pound 
        pha             ;push the new line number on
        lda  pound+1    ;  the system stack
        pha   
        ldy  #2 
cntln   inx   
        iny             ;determine new line length in y
        lda  linbuf-1,x ;  and push statement string on
        pha             ;  the system stack
        bne  cntln 
        cpy  #4         ;if empty line then skip the
        bcc  jstart     ;  insertion process
        tax    ;x = 0
        tya   
        clc   
        adc  ampr       ;calculate new program end
        sta  under      ;{_} = {&} + y
        txa   
        adc  ampr+1 
        sta  under+1 
        lda  under 
        cmp  star 
        lda  under+1    ;if {_} >= {*} then the program
        sbc  star+1     ;  won't fit in available RAM,
        bcs  jstart     ;  so abort to the "OK" prompt
slide   lda  ampr 
        bne  slide2 
        dec  ampr+1 
slide2  dec  ampr 
        lda  ampr 
        cmp  at 
        lda  ampr+1 
        sbc  at+1 
        bcc  move       ;slide open a gap inside the
        lda  (ampr,x)   ;  program just big enough to
        sta  (ampr),y   ;  hold the new line
        bcs  slide      ;(always taken)
move    tya   
        tax             ;x = new line length
move2   pla             ;pull the statement string and
        dey             ;  the new line number and store
        sta  (at),y     ;  them in the program gap
        bne  move2 
        ldy  #2 
        txa   
        sta  (at),y     ;store length after line number
        lda  under 
        sta  ampr       ;{&} = {_}
        lda  under+1 
        sta  ampr+1 
jstart  jmp  start      ;dump stack, restart cmd prompt
...
Using it like that allowed me to open or close gaps <256 bytes wide in the program text without having to use and update two separate pointers for source and destination.

 

Mike.

Edited by barrym95838
  • Like 1
Link to comment
Share on other sites

I forgot about another spot in the same interpreter that uses (zp,X) for non-zero values of X. It's the rough equivalent of Forth's @ word (pronounced fetch):

 

...
getval3 cmp  #'('       ;sub-expression?
        beq  eval       ;  yes: evaluate it recursively
        jsr  convp      ;  no: first set var[x] to the
        lda  (0,x)      ;    named variable's address,
        pha             ;    then replace that address
        inc  0,x        ;    with the variable's actual
        bne  getval4    ;    value before returning
        inc  1,x 
getval4 lda  (0,x) 
        sta  1,x 
        pla   
        sta  0,x 
getrts  rts   
...

The Z80, 6809, and even the 6800 have a code-size advantage for activities like this, due to their abilities to load and store 16-bit values with a single instruction, but the 6502 has the best average cycle/instruction ratio of the bunch, so the advantage isn't as large as it might seem.

 

Mike.

 

 

  • Like 1
Link to comment
Share on other sites

  • 3 weeks later...
  • 1 year later...

So how come the Z80 shines on SMS but on Spectrum it all looks like regurgitated toe nails?

That has to do with the SMS Graphics chips and parser compared to the spectrum. The SMS had additional chips to help with the load, while on the Sinclair machine, the Z80 is responsible for practically everything on a Spectrum, from drawing the graphics, tracking movement, and even parsing text. Plus, the sprites in memory were stored on the spectrum in black and white, and the color was added to the screen as it was drawn.

Link to comment
Share on other sites

  • 2 weeks later...

FWIW, I've done a lot of coding on my 64/80 column graphics text code since this thread was written and there are a few things I discovered.

1. LDIR was not the fastest way to scroll the screen on the Z80. I used an unrolled loop of LDI instructions.
2. The 6803 in the MC-10 which is clocked at .89 MHz, runs the demo faster than the 1.7 MHz 6502 in the Atari. The 8 bit only registers kill the 6502 on the scroll and the screen is pretty ugly during a scroll due to the address mode I had to use. I scroll columns rather than rows. Using the A register 16 bit mode on the 65816 scroll made the Atari version the fastest tested so far. The 65816 memory move instructions are slower than an unrolled loop. I should point out that the VZ200 version does have to move more data in the scroll due to the paged screen memory, so it might actually be faster without that.

3. To get the most speed out of the 6502, I had to reorganize the font data to suit the 6502 addressing modes, this made it possible to dump the costly multiply. It also means I can't just load a font anywhere in memory without modifying the character printing routines since addresses are hard coded. A direct port from another CPU is going to be slow, you have to rebuild things to suit the 6502.
4. The 6803 (Motorola) addressing modes seem to be the most useful (IMHO) and I just used a multiply instruction to calculate the character offset in the font. The LDAA ##,X addressing just works better with data structures than LDA ####,x on the 6502.
5. The z180 version of the Z80 code has one change (I haven't looked for others). It replaces 8 instructions with the new multiply instruction to calculate the address of the character in the font data. This alone drops a lot of clock cycles.
6. Using the really slow IX, IY addressing mode on the Z80 actually allowed me to make some of the code faster since I didn't have to shuffle values in registers and increment an index register. I used it for printing two characters at a time.
7. Since implementing the Atari 65816 code, I have looked at optimizing the Z80 code further. By positioning tables on 256 byte boundaries, I can increment the least significant bytes of the pointers instead of all 16 bits. It is faster, but I haven't compared it to the 65816 code yet. I don't think it's enough of a speed increase to make it the fastest again.
8. With the Z180, not only can I use a multiply instruction, but I could also use DMA to scroll the screen which should be much faster than the 65816 code. And that's in addition to the 20% speed increase in execution speed on the same code. I haven't benchmarked it or looked at the number of memory cycles it takes per byte for the memory move.
9. I haven't written a 6809 version yet, but I wrote part of the character printing routine and scroll code. It will definitely be faster than the 6803.
10. The 6309 memory move instruction should make it's scroll one of the fastest at 3 clock cycles per byte. The 65816 memory move instruction takes 7 clock cycles per byte and it's unrolled loop is something like 5 clock cycles per byte.

  • Like 4
Link to comment
Share on other sites

What's the best vintage hardware to write/debug 6507 programs for the Atari VCS. And when I say best I mean also what's the most readily available hardware today? I've read on the forums that the Apple IIe was used, and those are readily available for a good price. But maybe there's other vintage hardware out their that's better and similarly priced?

Link to comment
Share on other sites

I've been reading through old issues of MICRO looking for info on a couple topics and I ran across some info related to this thread.

 

First of all... I located the info on the 6509.

The 6509's actual name was to be the SY6516 and the manufacturer was Synertek.

 

The first mention of the SY6516 that I ran across as in issue 21 (Feb 1980) Page 11.|

...

The first article is located in Issue 20 page 36.

It is very different from the 65816.

  • Like 1
Link to comment
Share on other sites

  • 2 weeks later...

The first article is located in Issue 20 page 36.

It is very different from the 65816.

I can't locate any info on page 36 about the SY6516..

 

 

I found a Follow Up Article here on PDF Pages 69-76:

http://archive.6502.org/publications/micro/micro_34_mar_1981.pdf

 

 

MarkO

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...