Digital Tinker

 
 
 
  • About

    Some days you get the data, some days the data gets you …
 
Data Cleanser Humble Origins August 28th, 2008

Believe it or not, I started out in the sub-cellar of my old personal computer, cleaning up MP3 files that I had either downloaded or ripped from my CD collection. In the olden days, there was a clever website called CDDB.com (now gracenote.com) that had a database of CDs based on the number of tracks and the length of time for each track. MusicMatch Jukebox, the software I was using then, would connect to CDDB and provide the information from a newly ripped CD. If CDDB found a match, it would return all of the data for the CD!

Once I learned the method that CDDB was using to store its information, I marveled at the elegance of the service.
First of all, this was very much like giving a CD its own fingerprint, since it is highly unlikely that any two CDs would have the exact same number of tracks, with the exact minutes and seconds for each track in the exact same order! Indeed, the only time CDDB returned multiple matches was when the CD was rereleased. (That was just my experience, though. I’m sure there were false positives.)

Secondly, the user community that updated the database was very altruistic. As far as I know, the only interaction was through MusicMatch Jukebox: when a CD match was not found, I would laboriously type in the song titles and submit them to the CDDB database.


read more about CDDB on wikipedia.

During all of this activity, I learned about the extra data that gets stored along with the actual music in an MP3:

  • ID3v1 - the original MP3 tag stored a few items, such as title, artist, album, track number and genre
  • ID3v2 - an enhanced version allows more space for the ID3v1 items and has even more items, such as lyrics, cover art

As you might expect (or know from experience), downloaded music did not always have the correct tag information.
One of the benefits of using MusicMatch Jukebox to store my music was the ability to create a library and sort my music by ID3 tags.
Naturally, selecting a genre is subjective and I spent a lot of time changing ID3 tags.
MusicMatch Jukebox had a primitive ID3 tagging function that relied on the physical file name of the MP3. This became quite a chore, as I could never decide how best to name the files.

My first data cleansing project, therefore, was to buy a program called Dr.Tag. Since it was a dedicated application, it did a better job than the MusicMatch Jukebox software. After learning the nuances of renaming physical files based on ID3 tag information versus updating ID3 tags based on the file name, I was able to clean up my music library in a few hours.