Here’s a change of pace. In this article, I’ll show you how I used Microsoft ® Word, Microsoft ® Excel and Notepad to clean up an old blog post for republishing.
Project Purpose
For my personal blog, I decided to do a links posts, which is used to highlight older posts that new visitors may enjoy and long-time readers can revisit. There are some SEO benefits, so that incentive was there as well.
On my old blog, I regularly maintained one such post. It was called the Ambient Orb Blog Index and, rather than just having a few selected posts, I put all of them in this one post! I wouldn’t recommend that today, but back then, the website had no other way of effectively categorizing and showcasing blog posts.
Anyway, to give a sense of context for my new links post, I wanted to share the original Ambient Orb Blog Index. Of course, that meant that I would have to remove all of the dead links.
Examining the Raw Data
The original post was saved in Microsoft ® Word format.
Here are a few lines from the file:
<li><font size=2 color=blue><b><a href="http://www.writingup.com/mitchell_allen/am_i_an_artist_or_an_engineer">OOO</a>: Am I an Artist or an Engineer?</b></font>
<li><font size=2 color=blue><b><a href="http://www.writingup.com/mitchell_allen/beethoven_hearing_aids_and_suicide">OOO</a>: Beethoven, Hearing Aids and Suicide</b></font>
<li><font size=2 color=blue><b><a href="http://www.writingup.com/mitchell_allen/a_tortured_history_of_mass_storage">OOO</a>: A Tortured History of Mass Storage</b></font>
As each link was unique, I couldn’t do a mass search and replace. My first idea was to “bracket” the data that I wanted to delete. I figured that would allow me to use Microsoft ® Word’s wildcard search. Bracketing is a term I use to describe using a weird character to enclose or delimit (separate) substrings within a longer string. Usually, bracketing involves searching for a common substring and either appending the weird character to the end of the common substring or, if the common substring is unneeded, replacing that common substring altogether.
First Cleansing
For this project, I wanted to get rid of bits of unneeded text while bracketing the links. So, the first thing I did was to change every occurrence of
<a href="http://www.writingup.com/mitchell_allen/
to
~
To complete the bracket and get rid of even more useless text, I changed every occurrence of
">OOO</a>:
to the same ~.
Here are the modified lines at this point:
<li><font size=2 color=blue><b>~am_i_an_artist_or_an_engineer~Am I an Artist or an Engineer?</b></font>
<li><font size=2 color=blue><b>~beethoven_hearing_aids_and_suicide~Beethoven, Hearing Aids and Suicide</b></font>
<li><font size=2 color=blue><b>~a_tortured_history_of_mass_storage~A Tortured History of Mass Storage</b></font>
Now, I had to get rid of the bracketed text.
Weak Soap
In order to understand the wildcard search and replace, let me digress a moment. Look at this nonsense text:
frogface
frogs in the field
swollen river flows peacefully
Let’s perform a wildcard search and replace. To change every occurrence of text bracketed by the letter f to the (equally nonsensical) string consisting of fishf, we do the following:
1. Click Edit on the menu and select Replace … (or press Ctrl+H)
2. In the Find what: box, type f*f
3. Press the More button
4. Check Use wildcards
5. In the Replace with: box, type fishf
6. Press the Replace All button
This is the result of the wildcard search and replace:
fishface
fishfield
swollen river fishfully
The asterisk (*) is a wildcard operator that takes the place of zero or more arbitrary chararacters. When you don’t know or care what’s being matched, asterisk is your friend.
On the first line of nonsense text, the asterisk matches rog. On the second line, it matches rogs in the . Finally, on the third line, it matches lows peace.
Keep in mind, the search and replace function does things one at a time, so we don’t wind up with our asterisk matching ace and ield swollen river.
By the way, if you’d like to see all of the wildcard search operators, check the wildcard box and click the Special button.
Now, back to the project. I had fully expected to be able to replace ~*~ with a character string of my choice (actually, just deleting it would have been best, but I didn’t want to confuse you). However, no matches were found! I never figured out what I did wrong, but I must have introduced a spurious space character to the end of the search string because, when I recreated the files for this article, the wildcard search worked perfectly. So, instead of being done with this project, I had to try another tack.
After muttering a few choice curses, I went to my industrial-strength solution.
Strong Soap
Microsoft ® Excel has powerful conversion utilities. Once I had bracketed my data, I could literally show Microsoft ® Excel how to read it and convert it into a spreadsheet.
First, I pasted the slightly scrubbed data into Notepad and saved the file. I didn’t want any formatting characters to be imported into the spreadsheet. Next, I opened Microsoft ® Excel, clicked Data and selected Get External Data.
This invoked a powerful wizard that walked me through choosing the Notepad file, selecting the ~ as a delimiter and choosing the destination on the spreadsheet.
In seconds, I had a worksheet that looked like this:

Click to view full size
Now that the dead links were separated, all I had to do was concatenate column A with column C and place the result in column D.
=A1 & C1
Here is a screen shot, with column B hidden:

Click to view full size
Squeaky Clean
The only thing I had to do after all that, was to paste column D into a new Microsoft ® Word document and fix the syntax of the font tag by adding quotes (which weren’t needed on the old blog software).
The result can be viewed at my blog: Ambient Orb - Blog Index.
|