Jump to content
IGNORED

Differentiating data from code on Assembler


Recommended Posts

Hi team

 

Reading at Chapter 9 of the book Machine Language for Beginners:

 

https://www.atariarchives.org/mlb/chapter9.php

 

...this statement came to my attention: "In BASIC, DATA announces that the items following the word DATA are to be considered pieces of information (as opposed to being thought of as parts of the program). That is, the program will probably use this data, but the data are not BASIC commands. In ML, such a zone of "non-program" is called a table. It is unique only in that the program counter never starts trying to run through a table to carry out instructions. Program control is never transferred to a table since there are no meaningful instructions inside a table."

 

I did some research on this, and found this thread:

 

https://stackoverflow.com/questions/2022489/how-instructions-are-differentiated-from-data

 

Here says that since machine code instructions are just binary numbers, as are data, there's no way to differentiate.

Is there a practical way to differentiate data from code on Assembler?

 

Kind regards,
Luis.

Link to comment
Share on other sites

In assembly source, you of course write code and data differently, but at execution time the distinction is based on how the bytes are used, not what they were intended for. Code can be handled as data and data can be executed as code. The most direct example of this is DOS: sectors are loaded off disk and into memory as data, and they become code when DOS jumps to the starting address to begin program execution. There's no explicit command or mark to indicate when code becomes data or vice versa, that just happens when it's used that way. In some cases, code is intentionally also used as data, such as 2600 games that reuse part of their program code as noise or flak graphics.

 

This poses problems for disassemblers, which have to use heuristics to guess what is code or data. They have trouble with constructs like jump tables, where the bounds of the jump table are hard to determine statically. They're also often fooled by constructs like CLC+BCC to do an unconditional branch, as they can trace past into regions that aren't actually executed.

 

On some CPU architectures, there is more of a distinction. Some older CPUs had tag bits to mark memory regions as code or data, and newer CPUs have read/write/permissions in the page tables. Some CPUs even had separate memory buses or address spaces for code and data. On the 8048, for instance, you can't even read code as data without external hardware support or using special MOVP instructions, and similarly it isn't normally possible to execute data because it's simply not mapped into code address space.

 

  • Like 4
  • Thanks 1
Link to comment
Share on other sites

tbf, memorize the 6502 hex code and mnemonics, then when staring a hex dump / file you've spot the the patterns such as A9 XX 8D YY XX and distinguish runs of instructions from data. like anything, practice, practice, practice.

 

but there are many different coding styles you'll come across. you'll see lots of inline reserving of data bytes between functions but then you'll also see data defined separately in, say, contiguously across a specific page or more of memory.

  • Like 3
Link to comment
Share on other sites

19 hours ago, Wrathchild said:

but there are many different coding styles you'll come across. you'll see lots of inline reserving of data bytes between functions but then you'll also see data defined separately in, say, contiguously across a specific page or more of memory.

What I really find annoying to disassemble is code like this:

 

	jsr PRINT

	.byte "Hello World!", $9b

	jsr PRINT

	.byte "Press any key", 0

	rts

PRINT:
	; get PC from stack
	; print string from PC+1 up to $9b or 0
	; put adjusted PC back on stack
	rts

 

I understand it's convenient for the programmer, especially with a macro generating the jsr and .byte sequence, but having to manually flag tens of regions of memory as data gets tedious very quickly.

  • Like 1
  • Haha 1
Link to comment
Share on other sites

The Happy 1050 uses that style for bank switching, with a JSR to the bank address followed by the target address. Makes disassembly fun, especially since the two banks overlap in address space.

 

The 1050 Turbo had an even nastier form of bank switching, which has a bunch of trampolines like this:

 

F804: 2C 01 F0          BIT $F001
F807: 2C 03 F0          BIT $F003
F80A: 2C 02 F0          BIT $F002
F80D: 4C 13 F8          JMP $F813

 

Why's it nasty? Because each one of those reads is actually a jump. The read triggered by those BIT instructions causes an immediate bank switch such that the next instruction comes from a different bank at the following address. It would have been worse had the trampolines not all been placed at the front of the bank where they were easier to track.

 

2600 games pull a lot of dirty tricks due to heavy use of bank switching and size-saving tricks. Tables are really fun, you see a lookup like LDA $F017,X and have to figure out it's actually LDA $F027-$10,X with X >= $10... or the sprite table that you tried to rewrite for GTIA also happened to overlap on one side with the sound table and on the other side with the score graphics.

  • Like 3
Link to comment
Share on other sites

On 4/26/2024 at 6:57 PM, ivop said:

I understand it's convenient for the programmer, especially with a macro generating the jsr and .byte sequence, but having to manually flag tens of regions of memory as data gets tedious very quickly.

One of my favourite techniques, proliferating all over my code. Fortunately how easy it is to disassemble does not even appear on my list of priorities. :)

  • Like 2
  • Haha 3
Link to comment
Share on other sites

9 minutes ago, flashjazzcat said:

One of my favourite techniques, proliferating all over my code. Fortunately how easy it is to disassemble does not even appear on my list of priorities. :)

:)

 

I have used it myself, too. In midimon for example: https://github.com/ivop/midimuse/blob/master/software/midimon/cio.s That's actually the CIO macro library from the Atmas-II manual, converted to mads. If you don't need speed, it's indeed very helpful, and it's a quick way to get some messages on the screen.

Link to comment
Share on other sites

3 hours ago, ivop said:

I have used it myself, too. In midimon for example: https://github.com/ivop/midimuse/blob/master/software/midimon/cio.s That's actually the CIO macro library from the Atmas-II manual, converted to mads. If you don't need speed, it's indeed very helpful, and it's a quick way to get some messages on the screen.

It's a nice technique. Not the fastest, but quite concise. I first noticed the technique used extensively in SDX - for example, the library 'printf' function. Nearly everything I've written for the past decade or more seems to be targeted at large-scale banked ROMs, and I ended up developing inter-bank JSRs called with macros and referencing bankswitching code positioned at the same address in each bank which pulls the target address and bank from in-lined data, manages a banking stack, etc. This has the additional advantage that you can load up all three registers with useful data prior to calling the macro and have them passed to the target function.

Edited by flashjazzcat
  • Like 2
Link to comment
Share on other sites

On 4/29/2024 at 5:17 PM, flashjazzcat said:

One of my favourite techniques, proliferating all over my code. Fortunately how easy it is to disassemble does not even appear on my list of priorities. :)

It was also sometimes done precisely because it did make dissembling more difficult. A very primitive kind of copy protection since people couldn't just nick your best routines in one go.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...