Jump to content
IGNORED

tip: converting "De Re Intellivision" to UTF-8


ppelleti

Recommended Posts

The "De Re Intellivision" documents are in a weird character set (MS-DOS specific?) and don't display well on modern systems.

 

Here is a command that will convert a "De Re Intellivision" file to UTF-8 encoding, which should be viewable in most modern programs:

 

iconv -f cp437 -t utf-8 dri_2.txt | tr '\020\021\036\037' '><^v' > dri_2-utf8.txt
(This should work on Linux and Mac OS X, and probably most UNIX systems. May also work on Windows if you have Cygwin, MSYS, or WSL installed.)
  • Like 3
Link to comment
Share on other sites

  • 3 months later...
  • 2 years later...

While "iconv -f CP437 -t UTF-8" worked for most things, it seems to have missed

a few characters.  Well, at least under cygwin on Windows, can't speak for other platforms.

 

Below is a sed script (dri.sed) that replaces the special characters in the dri_*.txt files with

their HTML character entities.  With this, you can wrap a whole dri_*.txt file inside

an HTML <pre> tag and view it in your browser using, for example:

 

echo "<!DOCTYPE html>"     > dri_1.htm

echo "<html lang=\"en\">"  >> dri_1.htm

echo "<head><title>dri_1</title></head><body><pre>" >> dri_1.htm

LC_ALL=C sed -f dri.sed dri_1.txt >> dri_1.htm

echo "</pre></body></html>" >> dri_1.htm

 

----- Here is the dri.sed file -----

 

# Sed script to translate IBM Extended ascii characters in
# dri_*.txt files to HTML entities.
#
# The dri_*.txt encodings are *mostly* code page 437.  However, with cygwin
# on Windows, "iconv -f CP437 -t UTF-8 dri*.txt" seems to ignore the
# following input characters, leaving them (incorrectly) unmodified:
#
# 0x10 -- right arrow head (0x10 is ctrl-P)
# 0x11 -- left  arrow head (0x11 is ctrl-Q)
# 0x1e -- up    arrow head (0x1e is ctrl-^)
# 0x1f -- down  arrow head (0x1f is ctrl-underscore)
#
# And iconv treats the following as CP437 symbols, while dri_1.txt
# seems to use them as the slanted accent symbol above an 'e' or 'a'.
# 0xe1 -- &aacute;
# 0xe9 -- &eacute;
#
# Strip the CR from CR/LF line endings.
s/[\x0d]//g
# Strip trailing white space.
s/  *$//
# '&', '<' and '>' need to be escaped in HTML.
s/&/\&amp;/g
s/</\&lt;/g
s/>/\&gt;/g
s/[\x10]/\&#9658;/g
s/[\x11]/\&#9668;/g
s/[\x16]/\&#9644;/g
s/[\x1e]/\&#9650;/g
s/[\x1f]/\&#9660;/g
s/[\xb3]/\&#9474;/g
s/[\xb4]/\&#9508;/g
s/[\xb6]/\&#9570;/g
s/[\xb9]/\&#9571;/g
s/[\xba]/\&#9553;/g
s/[\xbb]/\&#9559;/g
s/[\xbc]/\&#9565;/g
s/[\xbd]/\&#9564;/g
s/[\xbf]/\&#9488;/g
s/[\xc0]/\&#9492;/g
s/[\xc1]/\&#9524;/g
s/[\xc2]/\&#9516;/g
s/[\xc3]/\&#9500;/g
s/[\xc4]/\&#9472;/g
s/[\xc5]/\&#9532;/g
s/[\xc6]/\&#9566;/g
s/[\xc8]/\&#9562;/g
s/[\xc9]/\&#9556;/g
s/[\xca]/\&#9577;/g
s/[\xcc]/\&#9568;/g
s/[\xcd]/\&#9552;/g
s/[\xce]/\&#9580;/g
s/[\xcf]/\&#9575;/g
s/[\xd0]/\&#9576;/g
s/[\xd2]/\&#9573;/g
s/[\xd7]/\&#9579;/g
s/[\xda]/\&#9484;/g
s/[\xdb]/\&#9608;/g
s/[\xd8]/\&#9578;/g
s/[\xd9]/\&#9496;/g
s/[\xe1]/\&aacute;/g
s/[\xe9]/\&eacute;/g
s/[\xf0]/\&#8801;/g
s/[\xf8]/\&#248;/g
#
# The replacements below fix a handful of spelling errors/typos.
#
s/\<accesories\>/accessories/g
s/\<Accomodates\>/Accommodates/g
# When fixing "acknnowledge", we need to add a space near the end
# of the line to keep the enclosing diagram character well aligned.
s/\<acknnowledge\.  Followed by IAD\. /acknowledge.  Followed by IAD.  /g
s/\<adderess\>/address/g
s/\<addess\>/address/g
s/\<advertizing\>/advertising/g
s/\<abreviation\>/abbreviation/g
s/\<appropirate\>/appropriate/g
s/\<Avalable\>/Available/g
s/\<B-17 Bomer\>/B-17 Bomber/g
s/\<B- 17 Bomber\>/B-17 Bomber/g
s/\<begining\>/beginning/g
s/\<best-remembered Intellivision game\>/best-remembered Intellivision games/g
s/\<bidrectional\>/bidirectional/g
s/\<cartidges\>/cartridges/g
s/\<casettes\>/cassettes/g
s/\<charater\>/character/g
s/\<Christmass\>/Christmas/g
s/\<Comission\>/Commission/g
s/\<Commision\>/Commission/g
s/\<componenet\>/component/g
s/\<componenets\>/components/g
s/\<Componenet\>/Component/g
s/\<Compnents\>/Components/g
s/\<conditons\>/conditions/g
s/\<connecter\>/connector/g
s/\<conprises\>/comprises/g
s/\<conputer\>/computer/g
s/\<consistant\>/consistent/g
s/\<contolled\>/controlled/g
s/\<criple\>/cripple/g
s/\<deliberite\>/deliberate/g
s/\<eariler\>/earlier/g
s/\<eather\>/either/g
s/\<effecitve\>/effective/g
s/\<empolyees\>/employees/g
s/\<ENVIROMENT\>/ENVIRONMENT/g
s/\<equiped\>/equipped/g
s/\<everytime\>/every time/g
s/\<exsisting\>/existing/g
s/\<exstensively\>/extensively/g
s/\<extrodinary\>/extraordinary/g
s/\<facilties\>/facilities/g
s/\<follwoing\>/following/g
s/\<frequencey\>/frequency/g
s/\<Ginini\>/Gimini/g
s/\<hoplessly\>/hopelessly/g
s/\<hrizon\>/horizon/g
s/\<in conjection sith\>/in conjunction with/g
s/\<Intellivison\>/Intellivision/g
s/\<Intellivsion\>/Intellivision/g
s/\<irrelevent\>/irrelevant/g
s/\<medievel\>/medieval/g
s/\<modifyed\>/modified/g
s/\<moring\>/morning/g
s/\<muliplexed\>/multiplexed/g
s/\<playtesting\>/play testing/g
s/\<possiblity\>/possibility/g
s/\<Preceeds\>/Precedes/g
s/\<PRESETNT\>/PRESENT/g
s/\<programable\>/programmable/g
s/\<programmed Jay\>/programmed by Jay/g
s/\<refered\>/referred/g
s/\<reliablility\>/reliability/g
s/\<reluctatly\>/reluctantly/g
s/\<remander\>/remainder/g
s/\<Richocheting\>/Ricocheting/g
s/\<selcted\>/selected/g
s/\<sheilding\>/shielding/g
s/\<SIGNFIY\>/SIGNIFY/g
s/\<souces\>/sources/g
s/\<sucessful\>/successful/g
s/\<sucessfully\>/successfully/g
s/\<successfuly\>/successfully/g
s/\<synchroniztation\>/synchronization/g
s/\<synchronizzation\>/synchronization/g
s/\<thsi\>/this/g
s/\<Tempsest\>/Tempest/g
s/\<to rights to\>/the rights to/g
s/\<the intellivision\>/the Intellivision/g
s/\<visability\>/visibility/g
s/\<volitile\>/volatile/g
 

 

Link to comment
Share on other sites

1 hour ago, Peripheral said:

While "iconv -f CP437 -t UTF-8" worked for most things, it seems to have missed

a few characters.  Well, at least under cygwin on Windows, can't speak for other platforms.

My experience, on Mac OS X and Linux, with the four dri_n.txt files that come with jzIntv, is that iconv successfully converts all of the CP437 characters that are above the ASCII range (i. e. those with the hi bit set).

 

In my experience, iconv does not convert the characters below the printable ASCII range (i. e. control characters), even though those are printable characters in CP437.  This is why I pipe the output of iconv into tr:

 

iconv -f cp437 -t utf-8 dri_6.txt | tr '\020\021\036\037\026' '><^v-' > dri_6-utf8.txt

 

The five characters in question are four arrow heads, plus "black rectangle".  I replace these with ASCII characters, rather than Unicode characters, because the suggested Unicode replacements (at least the ones given on Wikipedia) are double width characters (►, ◄, ▲, ▼, and ▬), which would mess up the formatting.  So, I used ASCII characters, since those are single-width.

 

1 hour ago, Peripheral said:

Below is a sed script (dri.sed) that replaces the special characters in the dri_*.txt files with

their HTML character entities.  With this, you can wrap a whole dri_*.txt file inside

an HTML <pre> tag and view it in your browser using, for example:

That's cool!  Personally, I prefer having a text file I can view in Emacs, or with "less" on the command line.  But I can see how HTML would be a useful format for many people.

Edited by ppelleti
pluralize "file"
Link to comment
Share on other sites

I just realized that the sed translation

 

s/[\xf8]/\&#248;/g

 

in my earlier post's dri.sed script is actually a no-op, as 248 decimal is 0xf8 hex.

 

This is probably OK.  But a perhaps more appropriate translation would be

 

s/[\xf8]/\&oslash;/g

 

This gives an 'o' with a slash through it, representing the funky 'o' in Broderbund in dri_1.txt.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...