Jump to content
IGNORED

Faster to load from RAM or ROM


bent_pin

Recommended Posts

I remember reading in a couple threads that loading from RAM is faster than loading from ROM, but as I look into it, I'm a bit confused. 

 

Loading from ROM with a Y offset takes 4 clocks.

Loading from the zero page with a Y offset takes 4 clocks.

Popping the stack into the accumulator takes 4 clocks. 

 

Except for the fact that popping from the stack saves decrementing y, in this case I cannot see a difference. Have I misinterpreted something?

 

To prevent an XY question, I'm looking at caching my bitmaps during overscan, but I don't see a real gain.

Link to comment
Share on other sites

Loading from zeropage RAM is faster, not RAM in general. But only for non-indexed loads.

 

BTW: There is no opcode for lda zp,y. Instead the assembler creates lda abs,y, which requires 1 extra byte. But there is an opcode for lda zp,x. So to save space, using X when loading from ZP RAM is more efficient.

  • Like 1
  • Thanks 1
Link to comment
Share on other sites

2 hours ago, Thomas Jentzsch said:

Loading from zeropage RAM is faster, not RAM in general. But only for non-indexed loads.

Thank you. I missed that in my reading.

So, I could speed up the render if I created direct zp references instead of looping, exit the render early by branching to a point beyond the render, and control the y position of the render by adjusting the kernels above and below it. It'll just chew up a bit more ROM.

Link to comment
Share on other sites

Indeed, loading with indexes is going to cost at least 4 cycles. The only saving with zeropage RAM is a direct, non indexed load (3 cycles). Pulling from stack does save on updating the index as you say, for linear loads. If you aren't using the stack during a kernel, potentially the TXS and TXS instructions can cost 2 cycles, but it's only useful for 1 shot. 

 

The most awesome load is load immediate from rom (LDA#, LDX# or LDY#), which is only 2 cycles. It is quite powerful if using self modifying code that runs in RAM, or a fast fetch engine. It is not very useful in a ROM kernel. If you make a kernel that runs in RAM, you can update the RAM positions that contain the # argument to LDA instruction and this now changes when the screen is drawn. The downside is that the # arguments have to be updated during the overscan/vblank periods. In my experience, it works effectively for games that only update the screen every 6 or more frames.  

  • Thanks 1
Link to comment
Share on other sites

22 minutes ago, MarcoJ said:

Indeed, loading with indexes is going to cost at least 4 cycles. The only saving with zeropage RAM is a direct, non indexed load (3 cycles). Pulling from stack does save on updating the index as you say, for linear loads. If you aren't using the stack during a kernel, potentially the TXS and TXS instructions can cost 2 cycles, but it's only useful for 1 shot. 

 

The most awesome load is load immediate from rom (LDA#, LDX# or LDY#), which is only 2 cycles. It is quite powerful if using self modifying code that runs in RAM, or a fast fetch engine. It is not very useful in a ROM kernel. If you make a kernel that runs in RAM, you can update the RAM positions that contain the # argument to LDA instruction and this now changes when the screen is drawn. The downside is that the # arguments have to be updated during the overscan/vblank periods. In my experience, it works effectively for games that only update the screen every 6 or more frames.  

For now, I ended up making direct references in zp and I load from those. 3 zp direct loads and 4 tia stores for 21 clocks, plus the wsync strobe. I ditched the stack for the time being. I'm going to try to take advantage of the vertical shift and make a one-line kernel that's split between two lines; wherein the pf data loads right after wsync, followed by flipping the vertical shift to access the sprite data loaded in the visible section of the previous line, then loading the next lines sprite data before the wsync. It will take an alternating action, so it's really two one-line kernels working in tandem.

 

Can you please point me to an example of a kernel that runs in RAM? I would greatly appreciate it.

 

I've been teaching kids how to program C spot run games like pong, combat, and two-line kernels for years now. It makes for a great introduction to assembly before moving on to microcontroller assembly. This is my first foray into really trying to maximize the capabilities of the VCS, though. I appreciate all the helpful tips so far.

Link to comment
Share on other sites

17 minutes ago, bent_pin said:

Can you please point me to an example of a kernel that runs in RAM? I would greatly appreciate it.

I thought some original Atari titles used self modifying code, but I can't find them.

 

Attached is a demo I wrote that shows a 16x12 tile map with 8x8 coloured pixels. It's quite complex but shows the power of RAM kernels. You probably don't need to understand how it works, but if you're keen:

 

It uses many kB of RAM for screen kernels, organised in blocks of 128 bytes. it gets updated during vblank/overscan when the map is scrolled. It doesn't use the 2 cycle load specifically but uses an indexed x load.(4 cycles)  Each 128b block represents 16 scanlines, or 1 row of 8 horizontal tiles. The screen update changes various locations within the 128b block to specify the 16 bit address associated with each tile's graphics and colours table, which are 8 values long.   

image.png.094958f820924d541bba2d6b1ad7269e.png

There are 2 kernels, below are the offsets for each 16 bit address, expressed in HEX. 

 

SxC = Colour, Kernel 1

SxCA = Colour, Kernel 2

SxG = Graphics, Kernel 1

SxGA = Graphics, Kernel 2

 

where x is 0 through to F, representing which tile from left to right. 0 is the leftmost tile, F is the rightmost tile. If you leave the demo stationery it's possible to poke to ram and change parts of the screen.

 

S0C                      +3f              
S0CA                     +0e              
S0G                      +35              
S0GA                     +09              
S1C                      +0e              
S1CA                     +39              
S1G                      +07              
S1GA                     +34              
S2C                      +49              
S2CA                     +18              
S2G                      +44              
S2GA                     +13              
S3C                      +18              
S3CA                     +4b              
S3G                      +13              
S3GA                     +3e              
S4C                      +3f              
S4CA                     +0e              
S4G                      +3a              
S4GA                     +09              
S5C                      +0e              
S5CA                     +42              
S5G                      +09              
S5GA                     +38              
S6C                      +49              
S6CA                     +18              
S6G                      +44              
S6GA                     +13              
S7C                      +18              
S7CA                     +4e              
S7G                      +13              
S7GA                     +49              
S8C                      +55              
S8CA                     +20              
S8G                      +52              
S8GA                     +1d              
S9C                      +20              
S9CA                     +57              
S9G                      +1d              
S9GA                     +54              
SAC                      +61              
SACA                     +2a              
SAG                      +5a              
SAGA                     +27              
SBC                      +2a              
SBCA                     +63              
SBG                      +27              
SBGA                     +5e              
SCC                      +57              
SCCA                     +27              
SCG                      +54              
SCGA                     +24              
SDC                      +24              
SDCA                     +63              
SDG                      +21              
SDGA                     +5e              
SEC                      +64              
SECA                     +31              
SEG                      +5f              
SEGA                     +2e              
SFC                      +30              
SFCA                     +74              
SFG                      +2b              
SFGA                     +74   

 

 

 

This guide has some ideas on how to do self modifying code for sprite positioning in a space invaders type game.

https://www.qotile.net/minidig/docs/2600_advanced_prog_guide.txt

 

 

 

16CHAR_NTSC_20210603.bin

  • Thanks 1
Link to comment
Share on other sites

2 hours ago, MarcoJ said:

I thought some original Atari titles used self modifying code, but I can't find them.

 

SARA is not fast enough to run code,  @Thomas Jentzsch recently updated Stella to enforce this for Atari's F8SC, F6SC and F4SC bankswitching schemes. More info in GitHub issue 933 which links to this AtariAge discussion - start with @SvOlli's October 6th reply.

  • Thanks 1
Link to comment
Share on other sites

1 hour ago, splendidnut said:

Supercharger would be a good candidate for doing self-modifying kernels, with it's three 2k RAM banks available in the cartridge memory area.

I have a pile of those.

I am basing my proposed supercart off a similar line of thinking, internal processor and internal ram that writes directly to the TIA and reads the inputs in a kind of waltz. Should make dumping it a bit tough too.

Link to comment
Share on other sites

On 7/11/2023 at 8:32 AM, MarcoJ said:

This guide has some ideas on how to do self modifying code for sprite positioning in a space invaders type game.

https://www.qotile.net/minidig/docs/2600_advanced_prog_guide.txt

This document has a broken link: http://www.tripoint.org/kevtris/files/sizes.txt

 

It's at the beginning of the bankswitching section. Do you happen to know of another source for this information, please?

http://www.tripoint.org/kevtris/files/sizes.txt
Link to comment
Share on other sites

17 minutes ago, Thomas Jentzsch said:

You could have a look at my 11 invaders demo. The main kernel uses self-modifying code.

 

 

Thank you, I most certainly will. Just the first picture is quite impressive.

The more I can keep these kids on computers that I can build for $40 or less, the more kids I can afford to give them to. This work of your will be a big help in that accord. One of my students is getting ready to enter 4th year at Purdue. I have worked with them for 14 years now. Couldn't be prouder they were my own.

Link to comment
Share on other sites

On 7/11/2023 at 10:06 PM, splendidnut said:

Supercharger would be a good candidate for doing self-modifying kernels, with it's three 2k RAM banks available in the cartridge memory area.

I'm not so convinced about this, if you look at the end result and motivation. Typically you need self-modifying code for two reasons: size or speed (or both). Writing self-modifying code just because it's self-modifying is not a valid argument, as you want to improve your code or circumvent an obstacle using self-modification.

 

There are two ways to manipulate the SuperCharger RAM:

 

Option 1: without vector

Code:
$f100: cmp $f000,x
$f103: nop
$f104: cmp $f000,y

Bus access:
1: f100 dd
2: f101 00
3: f102 f0
4: f0XX __  <-- trigger and write XX to latch (don't care about __)
5: f103 ea
6: f104 d9
7: f105 00
8: f106 f2
9: f2YY __  WRITE XX now

 

Option 2: with vector

$f100: cmp $f000,x
$f103: cmp ($80),y

Bus access:
1: f100 dd
2: f101 00
3: f102 f0
4: f0XX __  <-- trigger and write XX to latch (don't care about __)
5: f103 d1
6: f104 80
7: 0080 00
8: 0081 f2
9: f2YY __  WRITE XX now

 

What happens there is on the write to the $f000 bank the low-byte of that address is stored into a latch. After the write to the $f000 bank a counter starts, counting changes on the address bus. On the 5th change, the bus will be taken over from read to write and the data from the latch will be written to that address. (A slight simplification, but it's enough for the explanation.)

 

This works awesome while loading from tape. It still works somehow for creating tables. It sucks when trying to create code. Why? Writing to RAM costs you both index registers that are typically required for generating anything, especially when code space and/or cycles are tight. When creating data tables it's already a pain. My 512 byte demos, when ported from CommaVid to SuperCharger are then >550 bytes. Take a look at the source code of both versions for generating the sine wave to understand what it means to "loose" index registers. It's very annoying. On the bright side: timing is no issue there, as the tables are only generated once.

 

When nowadays a 32k ROM costs roughly the same as a 4k ROM, instead of generating the code in the cartridge, generate the code during assembling in ROM. Way less problems: no microcontroller or else to emulate RAM, just a plain EPROM with GAL/PAL for bankswitching like it's been working for how many years now?

 

If you require self-modifying code it's typically just a snippet that you can squeeze into the 6532 RAM. If there's too much stuff in there, then you might be forced to use other RAM, but if it's SuperCharger better use it as swap memory. Having done tight coding with self-modifying code on the 2600 for more than a decade now, nothing else made sense. Not once, and - believe me - I've tried.

 

No code generation in SARA, no code generation in SuperCharger, not even in CommaVid. Why? Because of the zero page addressing, generating code in 6532 RAM costs you one byte less in ROM for each time you write to RAM.

 

I'd like to be convinced otherwise, though... But if I'd have to reimplement something like my favourite plasma effect again, I'd definitely go for 6532 RAM again, no other RAM for code generation.

  • Thanks 1
Link to comment
Share on other sites

I'm poking around with two different one-line kernels. One uses direct references, and one runs completely from RAM. I think it's important that I better understand these concepts before using the Supercharger, but I will certainly be looking at the supercharger too, frogger in particular. 

 

I appreciate all the pointers so far. I will post some examples of my code in this thread.

Link to comment
Share on other sites

	processor 6502
	
	seg code
	org $f000
	
RAM_Kernel:
	lda #$a9
	sta $80
	lda #5
	sta $81
	lda #$a9
	sta $82
	lda #21
	sta $83
	lda #$4c
	sta $84
	lda #$80
	sta $85
	lda #$00
	sta $86
	
	jmp $0080
	
	org $fffc
	word RAM_Kernel
	word RAM_Kernel

 

My first run-in-ram code. Manual copy, just flip-flops the value in the accumulator. Probably terrible, but it's fast and it works as expected.

Link to comment
Share on other sites

If you'd like to take a look at a real world example generating a plasma effect.

In the source code:

  • the code that copies the the code from ROM to RAM is at the bottom at line 247
  • the code that's been copied is right above at line 225
  • the adjustment of the code and calling it starts at line 167

But most probably the best way to get an understanding is to run the demo in Stella, once the part is displayed, enter the debugger and step through the code.

  • Thanks 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...