Attempting to create a hash-based comprehensive software library

+x=usr(1536) · December 19, 2022

Note: the original title of this thread was, "Attempting to create a (mostly) deduplicated comprehensive software library." As the focus has moved from deduplication as a primary goal and more towards identification and verification of software, the title has been changed to, "Attempting to create a hash-based comprehensive software library." This change is to provide a more flexible foundation for archiving, preservation, and collecting efforts. No posts have been changed, so references to deduplication should be considered anachronistic.

This may be biting off more than I can chew, but I'm at least going to give it a shot and see where it ends up.

In order to better structure my A8 software, I'll be embarking on a programme of expansion and deduplication of that library. The idea is that this can be synced to tnfs.online for FujiNet users to access, and I won't need to maintain two separate sets of software.

My question is this: apart from Homesoft, the Holmes CDs, and what's on pigwa.net, can anyone recommend other collections to acquire? I realise that there's quite a bit on archive.org, but the way they have the data structured makes it tedious to try to find.

_The Doctor__ · December 19, 2022

aol poland, but you really will have to try to pick stuff that's deprotected and work on the most devices including fujinet.

+x=usr(1536) · December 19, 2022

1 hour ago, _The Doctor__ said:

aol poland

Oooh, good call. Thanks for the reminder.

1 hour ago, _The Doctor__ said:

but you really will have to try to pick stuff that's deprotected and work on the most devices including fujinet.

That's my goal. Pretty much 100% of what I run is either via FujiNet or a SIDE3, so compatibility with those devices is absolutely mandatory.

Mclaneinc · December 19, 2022

I wish you the best of luck on this, I've tried this route for years, and it's a painstaking nightmare to remove duplicates, I even purchased what I think is the best dupe finder out there (Duplicate File Detective) to help in this cause, and yes it's great but the interlinking of dupes is so much a PITA. Saying that, while I have had the time my dedication has been lacking. Good luck on this... Can't think of other resources bar the one's that you know already.

baktra · December 19, 2022

Fandal's collection at: a8.fandal.cz

Top quality digital images (.atr, .xex) of 8-bit Atari software.

+DrVenkman · December 19, 2022

A8 Software Preservation Initiative, plus Farb’s “straight cracks” in the CSS collection. Both discussed extensively here on the forum in the last few years.

+x=usr(1536) · December 19, 2022

3 hours ago, Mclaneinc said:

I wish you the best of luck on this, I've tried this route for years, and it's a painstaking nightmare to remove duplicates,

Tell me about it. This is part of the reason I say this would be a mostly deduplicated collection: there are going to be cases where things like the same file appearing in multiple different .atr images, for example, will be unavoidable. That's fine; I just want to get as many of the obvious candidates out of there as possible.

3 hours ago, Mclaneinc said:

I even purchased what I think is the best dupe finder out there (Duplicate File Detective) to help in this cause, and yes it's great but the interlinking of dupes is so much a PITA.

I've been doing this from the *nix side for a while, mainly using fdupes and jdupes along with utilities like find, shasum, diff, etc. They work - and generally work very well - but the potential size of the A8 collection means that they're not going to be terribly efficient; this has been apparent for some time with the collection I currently have. To address this, I'm planning on building a filestore specific to the A8 software that uses a deduplicating filesystem (TBD). This should introduce both process improvements as well as a degree of sorely-needed automation into the process.

This is my first foray into deduplicating filesystems - at least, from the standpoint of building one out from scratch that wasn't part of a vendor's storage silo - so there'll be a learning curve to get through. This isn't something I expect to happen quickly

2 hours ago, baktra said:

Fandal's collection at: a8.fandal.cz

Thank you! These things are obvious when someone else mentions them, but trying to remember them when you're in the thick of thinking things through is not exactly easy :-D

1 hour ago, DrVenkman said:

A8 Software Preservation Initiative, plus Farb’s “straight cracks” in the CSS collection.

Added to the list, and thanks!

Keatah · December 19, 2022

I admire anyone attempting a fully deduplicated library.

I tried doing it off and on throughout the ages and settled on a close-enuff state of affairs.

In the Apple II department I start with FTP ASIMOV and tack on my own personal stuff and material gathered from umpteen billion other sources. Searching and deduping as I go along.

In the 2600 department I rely on the descriptive file names instituted by Rom Hunter. It’s great for searching. And since file sizes are small for all classic computing images in relation to today’s multi-terabyte entry-level drives, duplicates in the name of organization mean more to me than forcefully minimizing everything.

By that I mean Missile Command may show up in several folders. One based on controller type, another sorted by company, and yet another alphabetically. And even a favorites folder too.

But the contents of those folder ARE definitely de-duped.

I run Windows exclusively and happily. Have done so since MS-DOS 5.0 & Windows 3.1.

Since the early XP days I’ve come to rely on Easy Duplicate Finder 2.41 and CloneSpy 3.43. Sometimes I dedupe by file name, or contents, or both.

I don’t often dig into individual disk images except like if I’m making a set of compilation disks. Or maybe working with my journal or my own personally programmed infantile Applesoft BASIC stuff. Stuff that print swear words and stuff like that.

But THE number one tip I have to offer is keep revisions or backups of your deduplication efforts along the way. And all the original files. That means copies of everything in the raw. Each file with timestamps intact. Separate and away from your work-in-progress gig.

You just don’t know if or when you’ll want to backtrack a few steps and take a different approach, method, or use different criteria altogether, for a section, or all of it. As your organization and dedupe efforts move along you can delete the older backups as you mint new revisions. There isn’t a set threshold here.

I’ve been fucking with this off and on since the 80’s. And it’s a way of life. Constantly evolving. Changing. Being refined. Being compartmentalized. An hour a week here. 20 minutes in-between other activities there.

And that’s the number two tip I offer. Work piecemeal and never burn out. You never wanna associate tedium with this otherwise you’ll dread each session. Even if it’s 15-minutes a month.

It’s partly the reason why cartridge collecting remained fun (and still is for some folks). We got everything a bit at a time. Time enough for it to meld with the psyche and become nostalgic.

Keatah · December 20, 2022

13 hours ago, Mclaneinc said:

and yes it's great but the interlinking of dupes is so much a PITA.

Can you elaborate further on what means interlinking of dupes?

+MrFish · December 20, 2022

11 hours ago, DrVenkman said:

A8 Software Preservation Initiative, plus Farb’s “straight cracks” in the CSS collection. Both discussed extensively here on the forum in the last few years.

These are going to be the cleanest copies you'll get anywhere (Atarimania is great too -- although not as clean because of filling in open spots with non-original files; but no such archive exists anyway). The straight cracks done by @DjayBee are good when ATX files can't be used. They're the closest things to the originals (found in the A8 Software Preservation collection) you can get. Most of the other cracks out there usually eliminate loading screens and such; his cracked versions retain all of those assets.

Homesoft is another excellent source. What you'll get there are XEX versions of files that are normally only available on ATR, all 5200 -> A8 conversions (including any recent homebrews available for conversion; the site owner does these himself), and versions that are clean from any "hacker" logos and weird changes (textual info hacks, etc). He also has versions modified with nice Graph2Font or RastaConverter images; these are all noted as such in the filenames.

[Edit]

AtariOnline.pl has a huge collection too (available in a single zip). They have multiple (sometimes many) versions of the same game in the same format (i.e.: ATR, XEX, etc.). Maybe unwieldy for what you're doing. I usually go there if I can't find a file somewhere else. Maybe the best source for Polish games too.

Fandal has a lot of stuff (also available in a single zip). Maybe a source for the most Czech games.

_The Doctor__ · December 20, 2022

Homesoft stuff normally work across more devices, preservation stuff is normally copy protected and have no ability to work across DOS's and Devices. Straight cracks work on more devices but still aren't cross DOS etc.

Beeblebrox · December 20, 2022

@x=usr(1536) Good luck. Massive undertaking. Great you are ensuring Fujinet and SIDE3 compatibility.

Here is another site from Fellow Atarian Gury:

https://gury.atari8.info/list.php?src=2&ch=a&th=0

and of course Atarionline/pl:

https://atarionline.pl/

Language hacked versions is also a consideration. I recall when attempting to make up an Atarimaxflash 8Mbit cart games compiliation which included one of my favourite games Blinky's Scary School, it took an age to find one that actualy ran off the cart. When I finally got the compilation sorted and started to play Blinky - as soon as I got to the potion lists I realised they were all in Polish!! Game file had been hacked and converted which is pretty cool - but not great if you can't read Polish heh heh. In the end I managed to find a packed English version which also had cheat option hacks. (Not that I'd ever consider cheating on a game like Blinky! )

Edited December 20, 2022 by Beeblebrox

Mclaneinc · December 20, 2022

10 hours ago, Keatah said:

Can you elaborate further on what means interlinking of dupes?

@Keatah, yeah sure, in my case I was also archiving website's that had been captured on the Pool disks etc, I soon discovered that you had to be selective folders that you included in the scan because I you accidentally included folders of web sites and then scanned for byte identical copies it would find thousands upon thousands and going through the results one by one was time-consuming. So clever me did a mass delete of the dupes (I had backups) only to find that it was taking dupes from the website links, so making the webpage broken because the file it linked to was gone..

There's other awkward things ie being careful of collections from other folks if you want to keep them separate.

youxia · December 20, 2022

Gamebase is the ultimate 1G1R resource. Though this collection is mostly meant for emulators so there's no guarantee everything will work with Fuji3/SIDE3. Still, it's worth having a look in seeing as the creators have already done the heavy lifting.

+x=usr(1536) · December 20, 2022

4 hours ago, Beeblebrox said:

Good luck. Massive undertaking.

Thank you ;-) Having now spent a couple of days looking into the details has been a constant process of re-evaluating assumptions as to how to address the problem. This wasn't expected to be easy by any means, but there are some really interesting problems to have to solve. My big three:

The first one is just how far to go with deduplication. Initially, this was taken to mean the removal of duplicate files (including media images) existing within a filesystem. That's still a part of it, but how far does one go with that?

As an example: we have multiple recognised software collections. Homesoft, Fandal, atarionline.pl, APX, etc. Theoretically, all of those could be merged and file-level tools turned loose on them to root out and deal with duplicates. But then you're breaking the integrity of those collections, and the actual gain is minimal as a result: each one ends up fractured as duplicate content is removed and has to be amalgamated with the others, thus creating Yet Another Collection™.

That leads on to the question of determining the criteria by which software may or may not be a candidate for inclusion and/or exclusion. Does a commercial release have priority over its executable appearing on a menu disk? Is one source considered to be more or less authoritative than another? What if a release took place on a combination of disk, cassette, and cartridge media? Or from multiple publishers?

Granted, those are not new questions and preservationists have been wrangling them for decades. But they still apply, and in many cases there are no good answers for them. That's not to say that they can't be handled with considered judgement calls, but rather that those calls will always be ones that are both objective and subjective at the same time: the goal may be to get down to as slim of a collection of software as possible, but that's always going to be based on someone's qualitative assessment as to whether or not something is worthy of inclusion.

Finally, in the words of Rodney Dangerfield: "Who made you Pope of this dump?"

Basically, this is the question of what gives someone the right to decide that one copy in an existing collection is more or less important than one in another collection. If I'm creating my own collection for my own use, it really doesn't matter and I can play God to my heart's content. But for one that's potentially public, the idea of taking someone else's work, altering it by addition or subtraction, and packaging it as somehow better seems like a slap in the face of the people who built and published their collections. Granted, we all get the same software from somewhere, but there is effort that goes into compiling, indexing, and presenting it, and it's that effort that may be disrespected even if only inadvertently.

Note that I'm not really losing sleep over these things - but they are considerations, and there are a number of different angles attached to them that have to be taken into account before even deciding where to start. The relevance of those considerations may even change as part of that process. However, taking care of as many of them (and others) as possible before diving in to all of this is definitely preferable than just going in blind and finding that they may be hindrances later.

Irgendwer · December 20, 2022

On 12/19/2022 at 4:54 AM, x=usr(1536) said:

My question is this: apart from Homesoft, the Holmes CDs, and what's on pigwa.net, can anyone recommend other collections to acquire?

There is also the "vjetnam" collection here: http://atari.vjetnam.cz/

Beeblebrox · December 20, 2022

47 minutes ago, Irgendwer said:

There is also the "vjetnam" collection here: http://atari.vjetnam.cz/

Thanks for this link - not seen this one before.

Heh heh - I have a fairly normal high def resolution set on my laptop, (1920*1080), and even with the browser running at 100% zoom ratio this is a very interesting lack of use of the available screen space and UI.

I'll get get out my magnifying glass!!! :lol:

For comparrison here is what AA looks on my lappy just in case anyone thinks I have a weird screen setting :

ldelsarte · December 20, 2022

When it comes to finding duplicate files, my favourite tool is "All Dup" (Windows, portable edition, because I don't like to install programs unless forced to)
You can download it here: https://www.alldup.info/ (search "portable")
It can look out for duplicate files based on numerous criteria. My usual one is "file content" (hash of actual content).
You can add filters inclusive or exclusive to select files to compare (ex: *.atr|*.xex|*.bin|*.car etc...)
And when you get the list of duplicates, you can define plenty of possible actions (ex: select all duplicate files found on E:\ and leave the ones on D:\)
I suggest you give it a chance.

Edited December 20, 2022 by ldelsarte

ldelsarte · December 20, 2022

As for the sources of Software,

I visited these ones for .ATR:

Source = bit torrent download
Source = ftp spudster org
Source = ftp.atari.art.pl
Source = www.atarionline.pl
Source = www.atariworld.com
Source = www.langesite.com

I visited these ones for .XEX:

Source = www.atarionline.pl
Source = bit torrent download

Atari8guy · December 20, 2022

2 hours ago, x=usr(1536) said:

Granted, those are not new questions and preservationists have been wrangling them for decades. But they still apply, and in many cases there are no good answers for them. That's not to say that they can't be handled with considered judgement calls, but rather that those calls will always be ones that are both objective and subjective at the same time: the goal may be to get down to as slim of a collection of software as possible, but that's always going to be based on someone's qualitative assessment as to whether or not something is worthy of inclusion.

Finally, in the words of Rodney Dangerfield: "Who made you Pope of this dump?"

Basically, this is the question of what gives someone the right to decide that one copy in an existing collection is more or less important than one in another collection. If I'm creating my own collection for my own use, it really doesn't matter and I can play God to my heart's content. But for one that's potentially public, the idea of taking someone else's work, altering it by addition or subtraction, and packaging it as somehow better seems like a slap in the face of the people who built and published their collections. Granted, we all get the same software from somewhere, but there is effort that goes into compiling, indexing, and presenting it, and it's that effort that may be disrespected even if only inadvertently.

Probably not for your purposes as you seem to be up to the challenge and are (sort of?) looking forward to the work, but there is so much interest here I wonder a solution couldn't be crowd-sourced to cut down the work - for all of us who have tried this before. A group would probably have to agree on some principles (like non-copy protected orginals > straight cracks > existing cracks with intro's etc...) to get it started though...

baktra · December 20, 2022

I am wondering in any of those dup tools can perform a fuzzy compare, i.e. showing how much likely in % two files are duplicates.

I would only suggest answering a simple question. How does my archive differ from other archives? Isn't it better to contribute to the existing ones?

Once I was considering an archive of monolothic .xex files (for whatever dubious reasons). Then I realized I'd better create a tool that harvests these files from existing archives.

+x=usr(1536) · December 20, 2022

2 hours ago, ldelsarte said:

When it comes to finding duplicate files, my favourite tool is "All Dup" (Windows, portable edition, because I don't like to install programs unless forced to)
You can download it here: https://www.alldup.info/ (search "portable")

Yep, familiar with that particular one ;-) Appreciate the recommendation. However, one thing I neglected to mention: I'm a Mac user for desktop stuff, with Linux and FreeBSD comprising the backend. If it looks like it'll do something that other tools can't, though, I'll certainly give it a shot from a VM.

2 hours ago, ldelsarte said:

It can look out for duplicate files based on numerous criteria. My usual one is "file content" (hash of actual content).
You can add filters inclusive or exclusive to select files to compare (ex: *.atr|*.xex|*.bin|*.car etc...)
And when you get the list of duplicates, you can define plenty of possible actions (ex: select all duplicate files found on E:\ and leave the ones on D:\)
I suggest you give it a chance.

Sounds like it may have some potential. Thank you!

2 hours ago, Atari8guy said:

Probably not for your purposes as you seem to be up to the challenge and are (sort of?) looking forward to the work, but there is so much interest here I wonder a solution couldn't be crowd-sourced to cut down the work - for all of us who have tried this before. A group would probably have to agree on some principles (like non-copy protected orginals > straight cracks > existing cracks with intro's etc...) to get it started though...

I hear you on that, and certainly have no opposition to the idea of crowdsourcing the work. But you do bring up a good point regarding it, namely agreeing on the value hierarchy of different copies of the same thing. There's also the question of access to the data for the crowd: where to house it as it's being worked on as well as where to put it once it's been processed are definite considerations. This would mean something cloud-based, ultimately. That's not necessarily a bad thing, but it is one that would need to be addressed before setting the crowd loose on it.

36 minutes ago, baktra said:

I am wondering in any of those dup tools can perform a fuzzy compare, i.e. showing how much likely in % two files are duplicates.

There are a number of tools out there that can handle doing that. However, it's computationally-intensive, and that means time-intensive as well when spread out over tens of thousands of files. Granted, there's no deadline to be doing this by, but it is a consideration nonetheless.

36 minutes ago, baktra said:

I would only suggest answering a simple question. How does my archive differ from other archives? Isn't it better to contribute to the existing ones?

The software collection I'm using right now is derived from other archives - so, ultimately, there is no difference except perhaps in the structure of how the data is stored. Sure, there'll be a few standalones in there, but I'm not generating new content for it myself. There wouldn't really be a contribution from my end to any other archive.

36 minutes ago, baktra said:

Once I was considering an archive of monolothic .xex files (for whatever dubious reasons). Then I realized I'd better create a tool that harvests these files from existing archives.

This is one of the realisations that I've come to.

It's a tricky one to address, too, because without being able to mount an .atr the same way as any other block device, direct visibility into that image's contents at an OS level are difficult to come by. That, in turn, makes comparing files inside of disk images and figuring out what to do with them rather complicated (and slow).

Ideally, disk, cassette, and ROM images could be mounted as block devices, the files contained within them hashed and the hashes stored in a database, a duplicate map generated from there, and deduplication undertaken based on that duplicate map. This would take care of a large chunk of the work up front, and executables could be handled in a very similar way.

In some ways, this is a Big Data problem, only much, much smaller. Many of the requirements to accomplish the sort of filtering being proposed are things that Big Data crunches on daily - it just doesn't typically happen at this scale.

Anyway, none of this is to say that it's an impossible task - but it is one that is going to require the right toolset to start out with being present. No plan survives contact with the enemy, so that would be expected to evolve over time, but there is an as-ready-as-possible middle ground out there. It just needs to be figured out ;-)

+MrFish · December 20, 2022

My previous post was mainly in regards to games, although AtariOnline has a database of non-game programs.

So... AtariWiki.org has a large number of non-game programs; and they're active in weeding out long-missing titles. They don't have an archive that I'm aware of; but you may be able to coax @luckybuck (Roland Wassenberg) and/or @cas (Carsten Strotmann) to provide you with something.

The other thing I had in mind was the Pool Disks. These may be available on Pigwa; so, you might already be covered there. To my recollection, these contain a lot of disk mags/user's group disks.

+CharlieChaplin · December 20, 2022

the Pooldisks are both available at pigwa:

http://ftp.pigwa.net/stuff/collections/PoolDisk One/

http://ftp.pigwa.net/stuff/collections/PoolDisk Too/

and there is also the Holmes collection and many many others:

http://ftp.pigwa.net/stuff/collections/holmes cd/

http://ftp.pigwa.net/stuff/collections/RK-DVD/

http://ftp.pigwa.net/stuff/collections/RK_CD/

http://ftp.pigwa.net/stuff/collections/SIO2SD_DVD/

http://ftp.pigwa.net/stuff/collections/nir_dary_cds/

(RK is an abbreviation for the german word "Raubkopie" = pirate copy, thus RK-DVD = pirate copy DVD and RK-CD = pirate copy CD)

+MrFish · December 20, 2022

Here's another collection. It's 4x 16 MB disk images filled with games by @kheffington

Atari 16mb Drive Images (8-11-2013)

Attempting to create a hash-based comprehensive software library

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members