Xuel Posted December 10, 2014 Share Posted December 10, 2014 (edited) I wasn't satisfied with existing alternatives, so I wrote a simple 6502 disassembler in Perl. You can download from github here. Some features: Statically traces code from entry points that you provide in order to distinguish code from data Automatically generates labels if desired Emits XASM/MADS syntax Emits "a:" as needed when absolute addressing is used for zero page addresses Can generate labels for addresses in the middle of instructions, e.g. "l1234 equ *-2". This occurs when BIT is used to skip an instruction, for example. Callers are annotated in a comment at every label so you can see who calls an address The current address and the raw data is annotated in a comment for every instruction Based on C= Hacking opcode table. Example output: l1150 ; Callers: 111F 1124 lda $10C3 ; 1150: AD C3 10 lsr @ ; 1153: 4A lsr @ ; 1154: 4A cmp #$20 ; 1155: C9 20 bcc l115E ; 1157: 90 05 beq l114F ; 1159: F0 F4 lda #$01 ; 115B: A9 01 bit a:$00A9 ; 115D: 2C A9 00 l115E equ *-2 ; Callers: 1157 sta $1166 ; 1160: 8D 66 11 jmp l1130 ; 1163: 4C 30 11 dta $1 ; 1166: 01 dta $2 ; 1167: 02 dta $0 ; 1168: 00 I've tested a few dumps and I've confirmed that XASM is able to reproduce the exact image when given the output of disassembling the image. But there are probably still some bugs lurking, so take the output with a grain of salt. Suggestions for improvements are welcome. It currently only handles raw memory dumps. I'll probably add a mode to handle XEX files. I also want to add a mechanism to supply user-defined labels. It should also detect the BIT trick and replace it with "dta $2C" so the skipped instruction can be disassembled. Edited December 10, 2014 by Xuel 8 Quote Link to comment Share on other sites More sharing options...
flashjazzcat Posted December 10, 2014 Share Posted December 10, 2014 Excellent: will be checking this out! Thanks. Quote Link to comment Share on other sites More sharing options...
Tezz Posted December 10, 2014 Share Posted December 10, 2014 Sounds great, I'll make sure to have a look over the weekend Quote Link to comment Share on other sites More sharing options...
Xuel Posted December 11, 2014 Author Share Posted December 11, 2014 (edited) I added support for Atari XEX and Commodore 64 PRG files. The disassembler automatically determines code entry points when disassembling such executables. Download from the github project page. I've tested a few executables and verified that running XASM on the disassembled code produces the exact same executable. I'll try some more exhaustive tests soon. The XEX mode is aware that segments can overlap and that RUN and INI segments can refer to previous segments. However, it doesn't yet understand that code from one segment could call code in another segment. If that occurs, then the the code may be treated as data instead of code. One tricky bit to reproducing the exact same XEX is to emit a $FFFF segment header only when it was present in the original executable, but I did implement this. Here's an example from Ransack. The first two code and INI segments disable BASIC and DMACTL. The next segment has an $FFFF segment header as it was just a separate executable compressed with exomizer. Notice also that all of the labels are prefixed with the segment number to avoid label collisions in case of overlapping segments. I may add an option to output MADS .local/.endl directives instead or only prefix labels that actually collide. org $2000 ; end 2013 s1l2000 lda #$02 ; 2000: A9 02 <--- Entry ora $D301 ; 2002: 0D 01 D3 sta $D301 ; 2005: 8D 01 D3 lda #$00 ; 2008: A9 00 sta $022F ; 200A: 8D 2F 02 lda $14 ; 200D: A5 14 s1l200F ; Callers: 2011 cmp $14 ; 200F: C5 14 beq s1l200F ; 2011: F0 FC rts ; 2013: 60 ini $2000 opt h- dta a($FFFF) ; Segment header opt h+ org $2000 ; end 3D47 s2l2000 ldy #$11 ; 2000: A0 11 <--- Entry tsx ; 2002: BA s2l2003 ; Callers: 200A lda $3C76,x ; 2003: BD 76 3C sta a:$00FC,x ; 2006: 9D FC 00 dex ; 2009: CA bne s2l2003 ; 200A: D0 F7 jmp s2l3C3B ; 200C: 4C 3B 3C dta $7E ; 200F: 7E dta $B6 ; 2010: B6 dta $20 ; 2011: 20 ... Edited December 11, 2014 by Xuel 1 Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted December 11, 2014 Share Posted December 11, 2014 will have a look, too... how do you decide if it's data and not code? Quote Link to comment Share on other sites More sharing options...
Xuel Posted December 11, 2014 Author Share Posted December 11, 2014 will have a look, too... how do you decide if it's data and not code? That's the "statically tracing" part. It starts by assuming everything is data. Then code entry points are determined from the executable's RUN and INI segments, or by manual specification by the user with -e XXXX. The disassembler traces starting from each entry point until it hits a JMP, JSR, BXX branch, RTI, RTS or illegal instruction. If it's a JMP, JSR or BXX, then it recursively traces the target addresses as new code entry points. So, theoretically if it knows the initial entry point, it can find all memory locations that correspond to code. However, this is done statically, i.e. without knowledge of how the program may change memory at runtime. So there are several cases where it can't find code, e.g. self-modifying jumps, indirect jumps through memory that changes, PHA/PHA/RTS-style jumps, interrupt vectors that are changed at runtime, generated code, decompressed code, etc. The best results are probably achieved using a memory dump taken after the executable has decompressed itself, but you're still only able to capture one state of the machine. If the executable goes on to create more speed code, change vectors, or otherwise self-modify then all bets are off. That being said, you can still get decent results on code that doesn't use a lot of tricks. And you can use as many -e XXXX options as you like to tell the tracer to visit places it otherwise would have missed. Static tracing is mostly conservative in what it treats as code, but there are some cases like an always taken branch where it might go off the rails into data. For those situations you can use -c XXXX to force the tracing to stop at specific addresses. 2 Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted December 11, 2014 Share Posted December 11, 2014 did you compare your outpiut with Dis6502? Quote Link to comment Share on other sites More sharing options...
Xuel Posted December 11, 2014 Author Share Posted December 11, 2014 (edited) did you compare your outpiut with Dis6502? Some advantages of dis over dis6502: dis offers static tracing. As far as I know dis6502 does no tracing and attempts to disassemble everything as code. It appears to only use .byte for illegal opcodes and BRK. dis can disassemble illegal opcodes. dis6502 treats them as data. dis uses "a:" to distinguish absolute addressing from z-page addressing when the address lies within the zero page. So there are cases where assembling the output of dis6502 won't give you back the original file but dis will. dis uses unique labels for each segment. dis6502 may create duplicate labels if segments overlap. Some advantages of dis6502 over dis: dis6502 has built-in system equates and supports user equates. dis doesn't yet. dis6502 creates labels for data and supports address ranges. dis doesn't do this yet. dis6502's GUI is very nice. dis is just a command-line tool. dis6502 supports more input file formats including ATR, XFD and CAS. dis only suppots raw, XEX and PRG. dis6502 let's you redefine the assembler syntax. dis only supports XASM/MADS at the moment. dis6502 can put multiple data bytes on the same line. dis currently puts each byte on a separate line which can make the output huge. way more features in general Edited December 11, 2014 by Xuel Quote Link to comment Share on other sites More sharing options...
Tezz Posted December 11, 2014 Share Posted December 11, 2014 It certainly sounds like a time saver. I'll be interested to see it in action. I must admit that I quite enjoy disassembling manually with dis6502 although I have made use of IDA once for an Electron disassembly in the past. 1 Quote Link to comment Share on other sites More sharing options...
snicklin Posted December 11, 2014 Share Posted December 11, 2014 Thanks Xuel, this sounds very good. I have thought out such a system before, though you've gone much further and even implemented it. I thought that it would be possible to some extent. Feature request (though it may take some time).... If it sees: LDA #0 STA 559 ... then it should add " ; Switching off screen" as a comment onto the end of the STA 559. You could load values from a csv which the user could supply and we as a community could build that file up. It would only work for statically coded information i.e. LDA 203 STA 559 ... could not be commented.... unless it said, "Doing something with the screen" 1 Quote Link to comment Share on other sites More sharing options...
bob_er Posted December 13, 2014 Share Posted December 13, 2014 Hi, Heh, I have similar project on my hdd. Unfortunately, it isn't finished. Nice features, that I have and you not are (as I see): - detect reading from unitililized memory (be carefull with hardware registers), - detect executing from unitialized memory - advanced breakpoints (memory1=x, memory2=y, memory3>z and PC=xxx), - call VBL each around 30000 cycles (to make some vbl clocks working). Quote Link to comment Share on other sites More sharing options...
pfeuh Posted December 15, 2014 Share Posted December 15, 2014 (edited) Hello, And what about something like this (I have nothing to assemble, juste a keyboard at my desk)? As hidden is never called in the code... I confess that I didn't read all the posts, just your first one. * = $600 jmp (toto) toto .word hidden hidden ; here some code... loop jmp loop Edited December 15, 2014 by pfeuh Quote Link to comment Share on other sites More sharing options...
Xuel Posted December 15, 2014 Author Share Posted December 15, 2014 Hi, Heh, I have similar project on my hdd. Unfortunately, it isn't finished. Nice features, that I have and you not are (as I see): - detect reading from unitililized memory (be carefull with hardware registers), - detect executing from unitialized memory - advanced breakpoints (memory1=x, memory2=y, memory3>z and PC=xxx), - call VBL each around 30000 cycles (to make some vbl clocks working). This sounds like you are actually simulating a 6502. I'm just stepping forward one instruction at a time based on the instruction length. The only instructions which I interpret in any fashion are JMP, JSR, BXX, RTI, and RTS. I think your method has many advantages including being able to handle self-modifying code and the uninitialized memory checks you described. I'd like to see it action! Quote Link to comment Share on other sites More sharing options...
Xuel Posted December 15, 2014 Author Share Posted December 15, 2014 Hello, And what about something like this (I have nothing to assemble, juste a keyboard at my desk)? As hidden is never called in the code... I confess that I didn't read all the posts, just your first one. * = $600 jmp (toto) toto .word hidden hidden ; here some code... loop jmp loop dis does handle indirect JMPs but only uses the value in "hidden" at the time of disassembly. If that value changes over the course of the run time, then the other values would have to be given manually with -e XXXX to insure that dis can traverse them. 1 Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted December 15, 2014 Share Posted December 15, 2014 how do I run that python file? Quote Link to comment Share on other sites More sharing options...
Xuel Posted December 15, 2014 Author Share Posted December 15, 2014 how do I run that python file? It's a Perl script. On Linux and Mac, Perl is usually installed by default. Just "chmod +x dis.pl" to run. On windows you can get Perl with cygwin or mingw or you can install something like Strawberry or ActiveState Perl. I really like cygwin. I use it for all my development. 1 Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted December 16, 2014 Share Posted December 16, 2014 ok. works so far. now let's see if MADS produces same binary cart image Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted December 16, 2014 Share Posted December 16, 2014 ok.... cart will be assembled correctly... good start... I was just wondering that sometimes code is not disassembled but remains in DTA statements? Quote Link to comment Share on other sites More sharing options...
Xuel Posted December 16, 2014 Author Share Posted December 16, 2014 ok.... cart will be assembled correctly... good start... I was just wondering that sometimes code is not disassembled but remains in DTA statements? You have to help it by supplying the code entry points that it can't determine statically. As I mentioned in post 6, there are many types of code paths that it can't trace statically. It traces as much as it can through JMP/JSR/BXX instructions, but interrupts, indirect jumps and self-modifying code can throw it off. As an example, consider the Joust.rom file on Atarimania. We can get an initial pass with the following command: dis.pl Joust.rom -l -o 8000 -v bffa -v bffe > joust.asm The -v options tell dis to trace from the code entry points specified by the Cartridge B start and init vectors. This will allow dis to trace the mainline code. However, Joust uses a deferred VBLANK routine which dis won't see: lda #$38 ; 840C: A9 38 sta $0224 ; 840E: 8D 24 02 lda #$A6 ; 8411: A9 A6 sta $0225 ; 8413: 8D 25 02 And it uses indirect JMP instructions fed with entries from a couple of jump tables. One of them looks like this: lda $B691,y ; B684: B9 91 B6 sta $B7 ; B687: 85 B7 lda $B692,y ; B689: B9 92 B6 sta $B8 ; B68C: 85 B8 jmp ($00B7) ; B68E: 6C B7 00 dta $B9 ; B691: B9 dta $B6 ; B692: B6 dta $9D ; B693: 9D dta $B6 ; B694: B6 dta $9C ; B695: 9C dta $BF ; B696: BF dta $9D ; B697: 9D dta $B6 ; B698: B6 dta $B9 ; B699: B9 dta $B6 ; B69A: B6 dta $D4 ; B69B: D4 dta $B6 ; B69C: B6 We can augment the command-line to tell dis to trace these as well: dis.pl Joust.rom -l -o 8000 -v bffa -v bffe -e a638 -e b6b9 -e b69d -e bf9c -e b6d4 > joust.asm Basically, you can keep running dis with more -e switches until you're satisfied that it has covered all of the code. I'll probably add some options to ignore tracing and just force disassembly as well. 1 Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted December 16, 2014 Share Posted December 16, 2014 thanks... it works so far... I am glad that it is assembled 1:1... I guess the DIS6502 issue was that it uses forced absolut adressing even it is ZP... Quote Link to comment Share on other sites More sharing options...
Xuel Posted December 16, 2014 Author Share Posted December 16, 2014 I should also mention that dis doesn't really support banked cartridges. Currently, the best bet with a banked cartridge would be to separate it into smaller virtual carts, maybe 4K or 8K pieces depending on the banking scheme, and then disassemble them individually. But the number of -e switches that you'd have to keep track of would probably make this a really unpleasant process. Perhaps I can apply some of the ideas I have for supporting XEX inter-segment calls to banked carts as well. At the minimum the -e flag would need a way to specify the bank number in addition to the code entry address. Also, dis doesn't yet support the CAR format. You can fake it out by giving -o as 16 bytes before the real starting address to skip over the CART header, e.g. -o 7FF0. This will work fine for unbanked carts, but the banked cart issue remains. Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted December 17, 2014 Share Posted December 17, 2014 Xuel... no worries... output assembles fine to a 32kb 5200 cart ($4000-$bfff) so no problem here... Quote Link to comment Share on other sites More sharing options...
Xuel Posted January 5, 2015 Author Share Posted January 5, 2015 Release v0.4 is now available here. Major new features include: Added support for user-defined labels with optional address ranges Added support for reading options, including labels, from a set of files Added support for SAP format files Added tracing between XEX segments Fixed bugs - Thanks to fox! See change log for more details Example of disassembling some of the Atari OS using sys.dop and hardware.dop: lC0E2 ; Callers: SYSVBV -v 0222 C029 inc RTCLOK+2 ; C0E2: E6 14 bne lC0EE ; C0E4: D0 08 inc ATRACT ; C0E6: E6 4D inc RTCLOK+1 ; C0E8: E6 13 bne lC0EE ; C0EA: D0 02 inc RTCLOK ; C0EC: E6 12 lC0EE ; Callers: C0E4 C0EA lda #$FE ; C0EE: A9 FE ldx #$00 ; C0F0: A2 00 ldy ATRACT ; C0F2: A4 4D bpl lC0FC ; C0F4: 10 06 sta ATRACT ; C0F6: 85 4D ldx RTCLOK+1 ; C0F8: A6 13 lda #$F6 ; C0FA: A9 F6 lC0FC ; Callers: C0F4 sta DRKMSK ; C0FC: 85 4E stx COLRSH ; C0FE: 86 4F 1 1 Quote Link to comment Share on other sites More sharing options...
Fraggle Posted August 11, 2021 Share Posted August 11, 2021 On 12/11/2014 at 11:22 AM, Xuel said: So there are several cases where it can't find code, e.g. self-modifying jumps, indirect jumps through memory that changes, PHA/PHA/RTS-style jumps, interrupt vectors that are changed at runtime, generated code, decompressed code, etc. On 12/11/2014 at 11:22 AM, Xuel said: If the executable goes on to create more speed code, change vectors, or otherwise self-modify then all bets are off. Absolutely cool project you did/do! I´d like to suggest the following: You mentioned the limits - what dis can and can´t do/handle. *1* How about making the list complete? (Collect ALL possibilities, where dis would fail. Instead of "etc.") *2* How about DETECTING these cases? It would be great if dis could say e.g. "self modifying jump(s) detected at $...." or "stack jump(s) detected at $...." or "irq/nmi/reset vectors manipulated at $...." or "jump table detected at $...."! This would help much where to investigate further. *3* Already mentioned in this thread: How about building up and sharing an additional info-file for e.g. C64 for VIC, SID and CIA - addresses? Does nobody have interest in that? Why invent the wheel a thousand times individually? It would be fantastic if these 3 points could get implemented! 1 Quote Link to comment Share on other sites More sharing options...
mellis Posted August 11, 2021 Share Posted August 11, 2021 12 hours ago, Fraggle said: It would be fantastic if these 3 points could get implemented! Indeed, that would be fantastic -- why don't you get to work on it? The source code is available here: https://github.com/lybrown/dis 1 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.