Jump to content
IGNORED

Attempting to create a hash-based comprehensive software library


x=usr(1536)

Recommended Posts

14 hours ago, gnusto said:

You should be aware that @Farb is already hard at work on an exportable database of everything in the A8P collection, in rom manager .dat format. DAT format is quite simple to parse and while basic in feature set, provides everything required to then feed into something much more complex like a relational DB, but still has user tools that are free and super easy to use. I would highly recommend that as a source of truth, when it is finished. Not only does it already have hashes of the original software as it appeared on shipping disks, Farb already has duplicate associations identified.

Interesting, and thanks for that - I was totally unaware that something along those lines was being worked on.

14 hours ago, gnusto said:

I would strongly recommend against using something like the Homesoft collection as a source of truth. To be clear it's a great collection and my absolute go-to for emulation needs. But he has revised disks in place before and presumably will again, as he makes further or better modifications to the software contained on them, meaning there could be multiple versions of a given disk and different hashes.

All of which is true.  Bear in mind that what I'm using Homesoft for is strictly an example; it's not meant to be representative of any sort of final configuration and/or workflow.  However, once things are a bit more solidified, there will be a methodology for tracking changes within collections.

14 hours ago, gnusto said:

And there is no clear attribution of his original source before he works his (super valuable!) magic and makes it able to run on almost anything and without protection. So something like Homesoft should be a leaf, not the root of your sourcing.

True, but bear in mind that the only relationship being established is that the file in question came from a particular disk.  No attribution is made beyond that: all we're saying is  that it came from disk no. whatever in a particular collection.  The point that I would ultimately like for all of this to reach, however, is one where we can get percentile matches based on file content to multiple variants of a file (or disk image, for that matter).  From there we can start establishing reasonably solid root relationships.

14 hours ago, gnusto said:

The root should be the most definitive and accurate representation as it originally existed, which is likely A8P.

True, and I think we're in agreement on that.  More:

14 hours ago, gnusto said:

So for instance you have M.U.L.E. from A8P as your root, with a first level of the tree being all known published variants on original disk, then leaves are CSS crack/homesoft associated disk/fandal etc.

Which is essentially the approach I'm aiming for, except that it's data-centric as opposed to collection- or source-centric.  Collections and sources are more or less metadata in this instance, with the fingerprinted (for want of a better term) file being treated as a unique-but-related object to others that may share data similarity.

14 hours ago, gnusto said:

The user experience in an ideal case is they have a front end manager, look up M.U.L.E., then have a choice of different versions to run but default to the best user experience (which is likely to be something like Homesoft).

That is definitely a use case that I've considered, but I remain to be convinced that it's the path that this should ultimately go down.  Earlier on in all of this, I realised that there was potential for this to turn into another collection, which I don't feel is something that we really need as the existing collection maintainers are doing a great job of keeping theirs going.  Where the value is in doing all of this is in identification, which in turn should be able to help anyone maintaining a collection (at home or for public consumption) structure it as they see fit.

  • Like 1
Link to comment
Share on other sites

24 minutes ago, x=usr(1536) said:
15 hours ago, gnusto said:

You should be aware that @Farb is already hard at work on an exportable database of everything in the A8P collection, in rom manager .dat format. DAT format is quite simple to parse and while basic in feature set, provides everything required to then feed into something much more complex like a relational DB, but still has user tools that are free and super easy to use. I would highly recommend that as a source of truth, when it is finished. Not only does it already have hashes of the original software as it appeared on shipping disks, Farb already has duplicate associations identified.

Interesting, and thanks for that - I was totally unaware that something along those lines was being worked on.

It's not that easy.

The hashes in the database are the ones of our exact "master" dump. If this is only an ATR, then the hash will be usable.

 

But our preferred format are ATX files and each fluxdump (even of the same disk) leads to a slightly different file and thus to a different hash value. The reason for this is that the ATX stores the exact timing of each sector, which varies due to jitter.

To circumvent this problem and be able to really compare different dumps, Farb has written a specific tool which dissects the ATXs and compares their contents.

If we find dumps to be identical then the "master" dump becomes the dump for both (or all) releases which are identical.

  • Like 3
Link to comment
Share on other sites

11 hours ago, x=usr(1536) said:

Earlier on in all of this, I realised that there was potential for this to turn into another collection, which I don't feel is something that we really need as the existing collection maintainers are doing a great job of keeping theirs going.  Where the value is in doing all of this is in identification, which in turn should be able to help anyone maintaining a collection (at home or for public consumption) structure it as they see fit.

 

Ok, I see your reasoning on that now. The data has value itself; once you establish the relationship between various versions as a collective "thing" (like the M.U.L.E. example), it's easy for scripts or the DB itself to parse the data to produce front end output, which can be used against a collection.

  • Like 1
Link to comment
Share on other sites

10 hours ago, DjayBee said:

But our preferred format are ATX files and each fluxdump (even of the same disk) leads to a slightly different file and thus to a different hash value. The reason for this is that the ATX stores the exact timing of each sector, which varies due to jitter.

To circumvent this problem and be able to really compare different dumps, Farb has written a specific tool which dissects the ATXs and compares their contents.

 

Understood, the masters are the ones I would treat as "definitive". Absent a master a reference from original disk (analog read drift aside) is still a good determination compared to some altered copy of unknown origin.

  • Like 1
Link to comment
Share on other sites

On 1/13/2023 at 1:30 AM, gnusto said:

The user experience in an ideal case is they have a front end manager, look up M.U.L.E., then have a choice of different versions to run but default to the best user experience (which is likely to be something like Homesoft).

Bear in mind there are also "single-disk" and "cartridge" conversions of multi-side disk releases which remove the disk-swap messages

Link to comment
Share on other sites

1 hour ago, Keatah said:

How about handling disks that write hi-scores and level/position saves?

Ideally, they'd be catalogued from original media.  However:

1 hour ago, Keatah said:

Or is that a total non-issue.

It's a really good question, is what it is :D

 

Quite honestly, I don't have a good answer for that one right now.  It has crossed my mind, and about the best idea I have is to give a little more weight in this scenario to data that can be matched in blocks.  This would at least allow for establishing a relationship to a known source, and noting a reason why the image / file is different - 'this game saves high scores', 'it's a database', 'Home Filing Manager is the greatest program ever', etc.

 

As for what that might look like in implementation, I have absolutely no idea and am certainly open to suggestions.  My suspicion is that until a workable framework for handling the data is established, though, it may be difficult to come up with something that would be a good fit.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...