Digital Tinker

 
 
 
  • About

    Some days you get the data, some days the data gets you …
 
How I Created the Ambient Orb Blog Index September 12th, 2008

Here’s a change of pace. In this article, I’ll show you how I used Microsoft ® Word, Microsoft ® Excel and Notepad to clean up an old blog post for republishing.

Project Purpose

For my personal blog, I decided to do a links posts, which is used to highlight older posts that new visitors may enjoy and long-time readers can revisit. There are some SEO benefits, so that incentive was there as well.

On my old blog, I regularly maintained one such post. It was called the Ambient Orb Blog Index and, rather than just having a few selected posts, I put all of them in this one post! I wouldn’t recommend that today, but back then, the website had no other way of effectively categorizing and showcasing blog posts.

Anyway, to give a sense of context for my new links post, I wanted to share the original Ambient Orb Blog Index. Of course, that meant that I would have to remove all of the dead links.

Examining the Raw Data

The original post was saved in Microsoft ® Word format.
Here are a few lines from the file:

<li><font size=2 color=blue><b><a href="http://www.writingup.com/mitchell_allen/am_i_an_artist_or_an_engineer">OOO</a>: Am I an Artist or an Engineer?</b></font>

<li><font size=2 color=blue><b><a href="http://www.writingup.com/mitchell_allen/beethoven_hearing_aids_and_suicide">OOO</a>: Beethoven, Hearing Aids and Suicide</b></font>

<li><font size=2 color=blue><b><a href="http://www.writingup.com/mitchell_allen/a_tortured_history_of_mass_storage">OOO</a>: A Tortured History of Mass Storage</b></font>

As each link was unique, I couldn’t do a mass search and replace. My first idea was to “bracket” the data that I wanted to delete. I figured that would allow me to use Microsoft ® Word’s wildcard search. Bracketing is a term I use to describe using a weird character to enclose or delimit (separate) substrings within a longer string. Usually, bracketing involves searching for a common substring and either appending the weird character to the end of the common substring or, if the common substring is unneeded, replacing that common substring altogether.

First Cleansing

For this project, I wanted to get rid of bits of unneeded text while bracketing the links. So, the first thing I did was to change every occurrence of

<a href="http://www.writingup.com/mitchell_allen/

to
~

To complete the bracket and get rid of even more useless text, I changed every occurrence of

">OOO</a>:

to the same ~.

Here are the modified lines at this point:

<li><font size=2 color=blue><b&gt~am_i_an_artist_or_an_engineer~Am I an Artist or an Engineer?</b></font>

<li><font size=2 color=blue><b&gt~beethoven_hearing_aids_and_suicide~Beethoven, Hearing Aids and Suicide</b></font>

<li><font size=2 color=blue><b&gt~a_tortured_history_of_mass_storage~A Tortured History of Mass Storage</b></font>

Now, I had to get rid of the bracketed text.

Weak Soap

In order to understand the wildcard search and replace, let me digress a moment. Look at this nonsense text:


frogface

frogs in the field

swollen river flows peacefully

Let’s perform a wildcard search and replace. To change every occurrence of text bracketed by the letter f to the (equally nonsensical) string consisting of fishf, we do the following:

1. Click Edit on the menu and select Replace … (or press Ctrl+H)

2. In the Find what: box, type f*f

3. Press the More button

4. Check Use wildcards

5. In the Replace with: box, type fishf

6. Press the Replace All button

This is the result of the wildcard search and replace:

fishface

fishfield

swollen river fishfully

The asterisk (*) is a wildcard operator that takes the place of zero or more arbitrary chararacters. When you don’t know or care what’s being matched, asterisk is your friend.

On the first line of nonsense text, the asterisk matches rog. On the second line, it matches rogs in the . Finally, on the third line, it matches lows peace.

Keep in mind, the search and replace function does things one at a time, so we don’t wind up with our asterisk matching ace and ield swollen river.

By the way, if you’d like to see all of the wildcard search operators, check the wildcard box and click the Special button.

Now, back to the project. I had fully expected to be able to replace ~*~ with a character string of my choice (actually, just deleting it would have been best, but I didn’t want to confuse you). However, no matches were found! I never figured out what I did wrong, but I must have introduced a spurious space character to the end of the search string because, when I recreated the files for this article, the wildcard search worked perfectly. So, instead of being done with this project, I had to try another tack.

After muttering a few choice curses, I went to my industrial-strength solution.

Strong Soap

Microsoft ® Excel has powerful conversion utilities. Once I had bracketed my data, I could literally show Microsoft ® Excel how to read it and convert it into a spreadsheet.

First, I pasted the slightly scrubbed data into Notepad and saved the file. I didn’t want any formatting characters to be imported into the spreadsheet. Next, I opened Microsoft ® Excel, clicked Data and selected Get External Data.

This invoked a powerful wizard that walked me through choosing the Notepad file, selecting the ~ as a delimiter and choosing the destination on the spreadsheet.

In seconds, I had a worksheet that looked like this:
worksheet 1

Click to view full size

Now that the dead links were separated, all I had to do was concatenate column A with column C and place the result in column D.

=A1 & C1

Here is a screen shot, with column B hidden:
worksheet 2

Click to view full size

Squeaky Clean

The only thing I had to do after all that, was to paste column D into a new Microsoft ® Word document and fix the syntax of the font tag by adding quotes (which weren’t needed on the old blog software).

The result can be viewed at my blog: Ambient Orb - Blog Index.

Misguided Advertising Copy September 3rd, 2008

No, this isn’t a copywriting article. It’s a plea to all office managers who send insertion text to webmasters.

This is also a horror story, so if you have a weak stomach, click away now!

Julie is the owner and publisher of the Fredericksburg Guidebook and its companion website, fredericksburgguidebook.com.
It is a quarterly publication. Advertisers pay for inclusion in the guidebook and can pay for a listing online.

Each quarter, the office manager solicits new business, while updating and deleting existing accounts.

Naturally, I get the results for updating the website.

“It was a dark and stormy night. Suddenly, my cell phone rang. Julie wanted an update to the list of sponsors on her website.”

You Want it WHEN?

The quarterly list of advertiser insertions, deletions and modifications was on its way to my in-box.
As usual, I was getting it at the last minute. The pressure was on! (I had never missed a deadline and I wasn’t about to start, now.)
I had just under a week to update the website.

This is NOT What I Asked For!

The first update I ever received from the office manager was in a Microsoft Excel spreadsheet. Cruella (not her real name) explained her system to me and I told her that I could work with that. I had written a custom database program that reads text and pumps out web pages. But for this project, I modified it to import the spreadsheet data into a database table.

Unfortunately, after just two updates, Cruella changed her system and started using Microsoft Word! So I had to re-tool my factory, so to speak. I added some widgets and doo-dads and fairy dust that allowed me to feed the required text file for this project.
Now, all I had to do was convert the Word document into the proper text format.

Are You Feeding Me A Line?

Long-time Word users know that there are three ways to generate a new line of text:

  • Natural break (you just keep on typing and Word “wraps” the text)
  • Manual line break (hold the Shift key down and press the Enter key)
  • Paragraph break (just press the Enter key)

Well, Cruella managed to get all three of them into one document! This simply means that she had no fixed format for her data. Actually, in most cases, there was a manual line break between the ad copy and the contact information. In addition, there were usually two paragraph breaks between advertisers. To make things interesting, she inserted three paragraph breaks between categories of advertisers. For example, all of the restaurants were listed together, followed by all of the beauty salons.

That helped.

I used Word’s search and replace to convert all the manual line breaks to paragraph breaks.
Next, I converted every paragraph break to this pair of characters:

~p

This removes all the formatting information, which is stored with the paragraph mark. (Believe me, you’ll go blind alternating between 14 pt headers and 10 pt body text!)
Then, I went through the document, looking for groups of these and stamping out inconsistencies.
Once I ensured that all the advertisers were separated from each other by exactly one ~p, I made sure each category was separated by two ~p ~p.

Finally, I converted each ~p back into a paragraph break.
I pasted the whole kit and caboodle into a plain text file and fed it to my Frankenstein monster.

Hyper Text

One gritty detail that I’ve hidden from you, gentle reader, is that fredericksburgguidebook.com originally had tons of duplicated ads listed in different categories.

Julie insisted that this is what her clients demanded. If you wanted to visit a restaurant for a juicy steak, you might miss Pablo’s Ice Cream Land, who serves steak, pizzas and ice cream! Poor Pablo would have to plaster his ad in three categories!

Pablo, a fictional businessman, had nothing on the real restaurant owners, some of whom had 14 copies of the same ad throughout the site.

After that update debacle, I insisted that the office manager learn to do her own updates!

WordPress to the Rescue!

My son Sheldon tweaked a WordPress theme so that the advertisers’ categories and subcategories would show up in the sidebar.

Meanwhile, I opened up the Frankenstein monster and plopped in yet another converter. Instead of pumping out web pages, I made it pump out WordPress static pages with no HTML (except for special characters).

We ended up condensing 380 insertions to 86 unique advertisers. We manually categorized each one and uploaded the lot to fredericksburgguidebook.com/guide.

Technically Speaking

My job was only half done. If Cruella couldn’t do the updates herself, my mission would have been a failure.
So, I wrote a user guide: Manage Your Online Directory.

Cruella used it.

The new office manager uses it.

I have more free time.

Data Cleanser Humble Origins August 28th, 2008

Believe it or not, I started out in the sub-cellar of my old personal computer, cleaning up MP3 files that I had either downloaded or ripped from my CD collection. In the olden days, there was a clever website called CDDB.com (now gracenote.com) that had a database of CDs based on the number of tracks and the length of time for each track. MusicMatch Jukebox, the software I was using then, would connect to CDDB and provide the information from a newly ripped CD. If CDDB found a match, it would return all of the data for the CD!

Once I learned the method that CDDB was using to store its information, I marveled at the elegance of the service.
First of all, this was very much like giving a CD its own fingerprint, since it is highly unlikely that any two CDs would have the exact same number of tracks, with the exact minutes and seconds for each track in the exact same order! Indeed, the only time CDDB returned multiple matches was when the CD was rereleased. (That was just my experience, though. I’m sure there were false positives.)

Secondly, the user community that updated the database was very altruistic. As far as I know, the only interaction was through MusicMatch Jukebox: when a CD match was not found, I would laboriously type in the song titles and submit them to the CDDB database.


read more about CDDB on wikipedia.

During all of this activity, I learned about the extra data that gets stored along with the actual music in an MP3:

  • ID3v1 - the original MP3 tag stored a few items, such as title, artist, album, track number and genre
  • ID3v2 - an enhanced version allows more space for the ID3v1 items and has even more items, such as lyrics, cover art

As you might expect (or know from experience), downloaded music did not always have the correct tag information.
One of the benefits of using MusicMatch Jukebox to store my music was the ability to create a library and sort my music by ID3 tags.
Naturally, selecting a genre is subjective and I spent a lot of time changing ID3 tags.
MusicMatch Jukebox had a primitive ID3 tagging function that relied on the physical file name of the MP3. This became quite a chore, as I could never decide how best to name the files.

My first data cleansing project, therefore, was to buy a program called Dr.Tag. Since it was a dedicated application, it did a better job than the MusicMatch Jukebox software. After learning the nuances of renaming physical files based on ID3 tag information versus updating ID3 tags based on the file name, I was able to clean up my music library in a few hours.