My Basic Parsing and Transformation Tool

+MrFish · September 9, 2015

This is probably a buffer overflow bug in the TBXL parser code. Can you post the lines 155 to 157 of the generated .LST? I will look at it later and try to understand where the overflow is.

Or, you can send me the full source, so I can test it myself.

155DIMN$(BA),O$(BA),R$(BA):DIMP$(BA),Q$(BA):DIMT$(219),U$(50),V$(245):DS=ADR(V$):CV=ADR(T$):BS=ADR(U$):C4=ADR(S$)

156BP=ADR(G$):BJ=ADR(N$):BH=ADR(O$):BT=ADR(R$):BE=ADR(P$):BF=ADR(Q$):ENDP.

157PROCB:AX=0:AW=0:B2=0:B5=0:DU=0:A_=0:BB=DR:BC=0:B0=0:B4=0:B6=0:Z=0:AM=0:AO=0:DV=0:DW=0:B1=0:B3=0:BD=0:AY=0:DX=0:BG=0

158BQ=0:BR=0:BK=0:BI=0:CW=183:BW=0:BV=1:C9=1:CC=65535:CD=1:CE=0:CH=0:CK=A4:B7=0:I=0:DC=0:I$="":ENDP.

I'll send you the sources in PM.

I know a big part of the purpose of your parser is to ease writing BASIC programs for submission in contests such as the "BASIC 10 Liner", where you want to pack as many characters as possible in each line. But I'm wondering -- based on problems with buffer overflow -- if it might be a good idea to have a command line switch for setting the line length?

vitoco · September 9, 2015

@dmsc: Thank you for this tool.

I tried it with some of my BASIC sources and have some comments:

With "-l" option, I could recover the full free-format BASIC listing from my already compacted listing (generated by my own tool). Great!
When there is a IF-THEN with many sentences before he newline, "-l" option lists every one in a new line (with one more level of indentation), but parsing this again results in an error: "expected end of line, got 'then '".
I couldn't notice any difference when running with or without "-f" option in the long output format ("-l").
Verbose mode shows the variable translation table, even if used "-f" option and variable names are preserved.
I wanted to preserve my own already-short variable names when the output is the default (compacted listing). Could "-f" option be available to all output modes?
Is the "-x" option available just to remove the table of variable names to save some bytes in tokenized output?
The backslash char is used to translate hex codes to ATASCII chars, but it is available in real Atari keyboard, and then for BASIC programs. If it happens that there are two hex digits after a backslash, it will always be translated, even if that was required. Could it be escaped it with another "\" or changed to another unavailable char in real Atari, like "~"? Also, an option to convert decimal codes could be useful.

Thank you again.

dmsc · September 10, 2015

Hi!

155DIMN$(BA),O$(BA),R$(BA):DIMP$(BA),Q$(BA):DIMT$(219),U$(50),V$(245):DS=ADR(V$):CV=ADR(T$):BS=ADR(U$):C4=ADR(S$)

156BP=ADR(G$):BJ=ADR(N$):BH=ADR(O$):BT=ADR(R$):BE=ADR(P$):BF=ADR(Q$):ENDP.

157PROCB:AX=0:AW=0:B2=0:B5=0:DU=0:A_=0:BB=DR:BC=0:B0=0:B4=0:B6=0:Z=0:AM=0:AO=0:DV=0:DW=0:B1=0:B3=0:BD=0:AY=0:DX=0:BG=0

158BQ=0:BR=0:BK=0:BI=0:CW=183:BW=0:BV=1:C9=1:CC=65535:CD=1:CE=0:CH=0:CK=A4:B7=0:I=0:DC=0:I$="":ENDP.

I'll send you the sources in PM.

Wow, your code was superb!

And, I fixed another error and was able to run the program without (apparent) issues.

The error was that I counted tokenized "%0" to "%3" as one byte, but in short list mode wrote the numbers "0" to "3", as were a little shorter. But, the numbers use 6 bytes instead of one, so there was a possibility of overflow in the TBXL parser.

Now, I write the tokens as is, this produces a program that uses less bytes after loading in TurboBasic.

I know a big part of the purpose of your parser is to ease writing BASIC programs for submission in contests such as the "BASIC 10 Liner", where you want to pack as many characters as possible in each line. But I'm wondering -- based on problems with buffer overflow -- if it might be a good idea to have a command line switch for setting the line length?

There is already use "-n" to specify the maximum number of characters in one line. But, I should first fix the binary output, as that is the best option if you want to produce a minimal BAS program.

Attached is the newer version with the %* fixed, as always, full sources are in github: https://github.com/dmsc/tbxl-parser

basicParser-v2-3-g7f0a83a-win32.zip

dmsc · September 10, 2015

Hi!

@dmsc: Thank you for this tool.

I tried it with some of my BASIC sources and have some comments:

With "-l" option, I could recover the full free-format BASIC listing from my already compacted listing (generated by my own tool). Great!

When there is a IF-THEN with many sentences before he newline, "-l" option lists every one in a new line (with one more level of indentation), but parsing this again results in an error: "expected end of line, got 'then '".

I know, but I have not found a better way to "convert" those IF/THEN to the long format. Perhaps i should replace them with IF/ENDIF, or simply keep all the code in the same "very long" line.

I couldn't notice any difference when running with or without "-f" option in the long output format ("-l").

Yes, there are no differences. Currently, the "-f" option only works when writing binary (.BAS) files, as those use short variable names by default.

Verbose mode shows the variable translation table, even if used "-f" option and variable names are preserved.

Ha ha, you are right.... Well, current code always generate the short names at parsing time, and the renaming info is shown there. I should write the list *after* parsing, then it would be possible to only show it when needed.

I wanted to preserve my own already-short variable names when the output is the default (compacted listing). Could "-f" option be available to all output modes?

Preserving only short variable names would be difficult. Currently, *short* names are derived from the number of variables of the given type, sequentially, there is not really a "naming".

But, writing the full names is possible (after converting to upper-case and removing invalid characters). I will add support for that later.

Is the "-x" option available just to remove the table of variable names to save some bytes in tokenized output?

Yes, the program will use less memory, and also will be "protected".

The backslash char is used to translate hex codes to ATASCII chars, but it is available in real Atari keyboard, and then for BASIC programs. If it happens that there are two hex digits after a backslash, it will always be translated, even if that was required. Could it be escaped it with another "\" or changed to another unavailable char in real Atari, like "~"? Also, an option to convert decimal codes could be useful.

Mmm... you are right, the string "HOLA\\FEO" does not give what you expect. I can fix the parser to parse "\\" as "\", or you can write "HOLA\5CFEO", that is a lot harder to understand.

I won't like the idea of changing the "escape" char from '\', because in fact all characters are available in ATASCII (for example, the '~' character would be DELETE), so you would expect all to be possible in some sources, and '\' is universally used in modern programming languages.

Thank you again.

And thanks for the testing.

+MrFish · September 10, 2015

Wow, your code was superb!

Thanks, it's the font converter for FlashJazzCat's GUI.

And, I fixed another error and was able to run the program without (apparent) issues.

The error was that I counted tokenized "%0" to "%3" as one byte, but in short list mode wrote the numbers "0" to "3", as were a little shorter. But, the numbers use 6 bytes instead of one, so there was a possibility of overflow in the TBXL parser.

Now, I write the tokens as is, this produces a program that uses less bytes after loading in TurboBasic.

It makes sense why those lines in particular caused the overflow then, since so many variables were being initialized to %0 and %1. When I saw the output file I was wondering why you converted the %0 - %3 to 0 - 3. I thought you were attempting to save space to fit more on each line. In my case they're better off left alone for saving program RAM, which I'm always fighting against with this app.

Indeed, everything seems to be fine now after the changes you made. Converting to tokenized format still results in a program error, but I'm satisfied with listed for now, until you get tokenized dialed in.

My code was already pretty well optimized, but by using your converter I now have and extra 4,847 bytes available in main memory to play around with, which is golden because I was already down to 2,560 bytes (the reason why the three machine language subs were pushed out to files).

There is already use "-n" to specify the maximum number of characters in one line. But, I should first fix the binary output, as that is the best option if you want to produce a minimal BAS program.

Attached is the newer version with the %* fixed, as always, full sources are in github: https://github.com/dmsc/tbxl-parser

Doh! I guess I was a little too tired last night to realise that, otherwise I would have tried using it.

Thanks again. I'll be putting this to good use, once I have time to start coding again. It integrates nicely in either ConTEXT or NotePad++.

Edited September 10, 2015 by MrFish

dmsc · September 12, 2015

Hi!

Indeed, everything seems to be fine now after the changes you made. Converting to tokenized format still results in a program error, but I'm satisfied with listed for now, until you get tokenized dialed in.

I made another release fixing all known bugs so-far: https://github.com/dmsc/tbxl-parser/releases/tag/v3

Now, the tokenized format produces a valid binary file for your program, but does not currently works because the resulting size small enough that the machine language routines that do the bank-switching end up stored below address $8000, inside the bank switching window!!.

So, It will probably make sense to move said ML routines to page 6 or other LOW or HIGH address.

dmsc · September 12, 2015

Hi!

When there is a IF-THEN with many sentences before he newline, "-l" option lists every one in a new line (with one more level of indentation), but parsing this again results in an error: "expected end of line, got 'then '".

This is now fixed in current (v3) release, you can get it at: https://github.com/dmsc/tbxl-parser/releases/tag/v3

+MrFish · September 12, 2015

I made another release fixing all known bugs so-far: https://github.com/dmsc/tbxl-parser/releases/tag/v3

Looks great. I'll check it out when I get a chance.

Now, the tokenized format produces a valid binary file for your program, but does not currently works because the resulting size small enough that the machine language routines that do the bank-switching end up stored below address $8000, inside the bank switching window!!.

So, It will probably make sense to move said ML routines to page 6 or other LOW or HIGH address.

Ah, yes I hadn't thought about that happening -- it's been a while since I've been in my own code. That'll be my problem to fix, but I like the reason why I have to do it. I never liked using page six, since there are instances where BASIC will use it for buffer overflow. I'll have to find another place to store them safely.

+MrFish · September 12, 2015

...the resulting size small enough that the machine language routines that do the bank-switching end up stored below address $8000, inside the bank switching window!!.

It was close -- within less than 150 bytes. I'll bet that caused you a little headache.

And, actually, the listed version just makes it by an even closer margin of 61 bytes currently.

It's not really a good practice for bank-switching code. But it became such a non-concern after my code had grown beyond a certain point. It was already ending up well beyond $9000, and I was just looking for a quick way to get it out of page 6.

Anyway, good work, and good you found my note in there.

Edited September 12, 2015 by MrFish

vitoco · September 14, 2015

Preserving only short variable names would be difficult. Currently, *short* names are derived from the number of variables of the given type, sequentially, there is not really a "naming".

But, writing the full names is possible (after converting to upper-case and removing invalid characters). I will add support for that later.

What I tried to say was that I want to preserve all the original variables, not only the short ones. Most of the time, with short name or not, they had some mnemonic meaning, so I want to preserve them. That's why I would like to enable this "-f" option for all output types.

Mmm... you are right, the string "HOLA\\FEO" does not give what you expect. I can fix the parser to parse "\\" as "\"

Nice example!!! :-D I'll also wait for the support of this escape char.

This is now fixed in current (v3) release, you can get it at: https://github.com/dmsc/tbxl-parser/releases/tag/v3

Now IF-THEN is being converted into IF-:-ENDIF structure. Nice, but I'm not sure that bytes are saved.

Small bug detected: Parsing a file that does not exists, it complains and abort in default mode, but it generates an empty output in list mode ("-l") and lists an empty variable info on screen.

Roydea6 · September 14, 2015

Hi!

This is now fixed in current (v3) release, you can get it at: https://github.com/dmsc/tbxl-parser/releases/tag/v3

I think that I am having problems with the Turbobasic GO# LABEL with the parser. I get a line stating the unexpected end of line #LABEL and never writing the -o file...

dmsc · September 15, 2015

Hi!

I think that I am having problems with the Turbobasic GO# LABEL with the parser. I get a line stating the unexpected end of line #LABEL and never writing the -o file...

Can you share at least the offending line?

This is my current test code for labels:

  ' Test labels
  for i=0 to 10
# Resume
    if i = 1 then go# Label1
    if i = 2
      exec sub
    endif
  next i
  ? "End program"
  end

proc sub
  ? "Inside sub!"
endproc

#Label1
  ? "Label 1"
  i = i + 1
  on i go# noLabel, Resume, Label1

The source parses ok and gives the following short listing that works:

0F.A=0TO10
1#A:IFA=1THENGO#B
2IFA=2:EXECC:END.:N.A:?"End program":END
3PROCC:?"Inside sub!":ENDP.
4#B:?"Label 1":A=A+1:ONA GO#D,A,B

Roydea6 · September 15, 2015

I re arranged some lines and got it to work okay. I believe that I had some duplicate labels or mislabeled a GO# label

How does the parser handle RESTORE 320 when the listing is rearranged to only have 260 lines?. And when doing Atari Basic listing with GOTO 290 with only 260 lines being generated.??

.

dmsc · September 16, 2015

Hi!,

I re arranged some lines and got it to work okay. I believe that I had some duplicate labels or mislabeled a GO# label

How does the parser handle RESTORE 320 when the listing is rearranged to only have 260 lines?. And when doing Atari Basic listing with GOTO 290 with only 260 lines being generated.??

The parser always keeps existing line numbers, as it does not know if it is the target of some GOTO, GOSUB, RESTORE, etc. Even if the lines are empty after parsing, a single line with a minimal comment is written.

dmsc · September 16, 2015

Hi!

Now IF-THEN is being converted into IF-:-ENDIF structure. Nice, but I'm not sure that bytes are saved.

The intention is not to save bytes, it is to produce a *valid* listing. As the purpose of the lister is to convert old basic programs to the long format, the output is transformed to a multi-line equivalent.

Small bug detected: Parsing a file that does not exists, it complains and abort in default mode, but it generates an empty output in list mode ("-l") and lists an empty variable info on screen.

Well, the list mode always generate an output file, because it is helpful when converting bad sources to the long format. I added a message explaining that the output will be generated even on error.

Also, the parser accepted a file with only empty lines but no empty file, I fixed the parser so that empty files are accepted.

Irgendwer · September 18, 2015

Can you share at least the offending line?

Yes, for a label related problem:

RESTORE #SETTINGS_DATA
READ A

# SETTINGS_DATA
DATA 3.14

results in an error at the 'RESTORE'-line:

expected end of line, got '#SETTINGS_DATA'

Fix welcome - I'll soon publish my scripts for the automation. The chain works quite nice!

Irgendwer · September 18, 2015

Another wish: It seems that the basicParser returns '0' even in case errors occurred, could you please change that?

dmsc · September 19, 2015

Hi!

So, I published a new version 4, download at: https://github.com/dmsc/tbxl-parser/releases

Yes, for a label related problem:
RESTORE #SETTINGS_DATA
READ A

# SETTINGS_DATA
DATA 3.14
results in an error at the 'RESTORE'-line:
expected end of line, got '#SETTINGS_DATA'

Fix welcome - I'll soon publish my scripts for the automation. The chain works quite nice!

Oh, this was an omission, I didn't allow for labels after a "RESTORE" statement. Fixed.

Another wish: It seems that the basicParser returns '0' even in case errors occurred, could you please change that?

Yes, now if any input file (you can pass multiple files in the command line) gives an error, the program returns failure.

The backslash char is used to translate hex codes to ATASCII chars, but it is available in real Atari keyboard, and then for BASIC programs. If it happens that there are two hex digits after a backslash, it will always be translated, even if that was required. Could it be escaped it with another "\" or changed to another unavailable char in real Atari, like "~"? Also, an option to convert decimal codes could be useful.

I added this "\\" escape code.

Thanks to all testers

Irgendwer · September 19, 2015

So, I published a new version 4, download at: https://github.com/dmsc/tbxl-parser/releases

That's great! Thank you for your quick support. I'll meet today some other A8 enthusiasts here in Berlin - I think the presentation of your tool will be a highlight!

Irgendwer · September 19, 2015

Found another issue:

If a line number isn't available any more, due to progressed code, I get the output:

error: line number 1 repeated

and your tool returns with '0'. (To repeat that behaviour just enter a '0' at line 20 of your sample-3.)

The wrong return value should be easy to fix (please do so), but I ask myself how to fix the reason of the problem:

IMHO line numbers make only sense now, if they form a 'calculated label' (like 'GOTO A*10'). Therefore an inside reassignment of such line numbers to valid ones would not only be a tedious feature to program, but fail in these cases - they have to be static...

Maybe a more meaningful error message would help for the time being - like "line number already in use, next free one is ..."

dmsc · September 19, 2015

Hi!

Found another issue:

If a line number isn't available any more, due to progressed code, I get the output:

error: line number 1 repeated

and your tool returns with '0'. (To repeat that behaviour just enter a '0' at line 20 of your sample-3.)

The wrong return value should be easy to fix (please do so), but I ask myself how to fix the reason of the problem:

IMHO line numbers make only sense now, if they form a 'calculated label' (like 'GOTO A*10'). Therefore an inside reassignment of such line numbers to valid ones would not only be a tedious feature to program, but fail in these cases - they have to be static...

Maybe a more meaningful error message would help for the time being - like "line number already in use, next free one is ..."

So, I made a new release, v4.1 see at https://github.com/dmsc/tbxl-parser/releases/

I used your suggested error message, also added the error to the list output and propagated the errors to the program output.

Thanks!

Irgendwer · September 19, 2015

(Yes, I'm not only sounding tired - I am! )

Edit: Due to video codec magic it seems that in the 2nd test the program is stopped before I hit the break key. I can ensure you that in real the order was correct...

Edited September 19, 2015 by Irgendwer

vitoco · September 20, 2015

Nice integration!

dmsc · September 21, 2015

Hi!

(Yes, I'm not only sounding tired - I am! )

Edit: Due to video codec magic it seems that in the 2nd test the program is stopped before I hit the break key. I can ensure you that in real the order was correct...

Nice environment. I see you are using sublime text, I'm more of a vim user :-)

I published a new release, v5, you probably will appreciate that I changed all error reporting to use the same format, always including the source file name and file line number, this makes easier to correlate problem lines with the editor line. Also, I fixed some remaining bugs

https://github.com/dmsc/tbxl-parser/releases

I'm more of a VIM user, attached is my current syntax file, with support for statements, tokens, line numbers, comments, labels.

tbxl.vim.gz

Irgendwer · September 21, 2015

I published a new release, v5, you probably will appreciate that I changed all error reporting to use the same format, always including the source file name and file line number, this makes easier to correlate problem lines with the editor line. Also, I fixed some remaining bugs

Thank you! I didn't dare to ask for similar error messages.

Since they were no other suggestions regarding the syntax, could you please support the binary include feature? I'm fine with your proposal.

BTW: Do you mind to convert 0-3 automatically to %0-%3 to save memory?

My Basic Parsing and Transformation Tool

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members