Jumbalaya Jam
Before I learned about HTML in 2003, I learned how to get rid of the tags that define it.
One of my passions is designing games and I wanted to recreate Boggle® on the computer.
My wife loved to play an online game called Jumbalaya, which is similar to Boggle®.
I decided to borrow her word lists for my Boggle® dictionary, so that it would reject nonsense words.
Because she played so many games in one sitting, I wrote a program in Visual Basic to read through the Internet Explorer Cache, locate the pages, read each one and extract just the words.
Real Estate Ads
A more recent project involved sifting through online advertising for real estate. I wanted a neat list that I could store in my own database.
This time, I wrote a program to copy the actual web page (just like selecting a portion of text to paste into another location).
I felt guilty because I thought I was doing something called scraping a website. However, when I finally learned the true meaning of scraping, which is to republish other people’s website content as if it were your own, I realized that I wasn’t doing anything unethical.
Parsing Power
The real power behind both mini-projects was the parsing engine.
To parse means to separate a string of data into parts.
It is very useful to parse an HTML file into its tags and values. Your web browser does this in order to display the web page on your screen.
It is also useful to be able to parse a string of characters, in order to find specific patterns. Google does this when it separates your search into keywords, by finding the spaces and punctuation between the words.
You don’t always have to use a programming language: Microsoft® Access and Microsoft® Excel provide string manipulation functions that can help you parse strings.
Keep in mind, though, these two programs don’t always agree on the names or even how to use the function!
For example, suppose you want to see if a string contains a comma. If it does, you want to know where it starts (this is called the string position).
Here is the way each program expects you to use their function:
Access: InStr([start, ]string1, string2)
string1 is the string being searched
string2 is the smaller string being sought
start is an optional parameter which tells the function where in string1 to begin looking for string2
Excel: =SEARCH (find_text,within_text,start_num)
within_text is the string being searched
find_text is the smaller string being sought
start_num is an optional parameter which tells the function where in within_text to begin looking for find_text
If the search function does not find the string being sought, Access returns a 0, but Excel gives you a #VALUE error.
This means that you have to go through hoops to get Excel to parse arbitrary strings.
To understand my suggested solution below, open up Excel and type Digital,Tinker into cell A1.
In cell A2, type or paste the following formula:
=SEARCH(”,”, A1)
When you press Enter, A2 will display 8.
If you remove the comma from A1, A2 will display #VALUE.
Experts Only
I am always learning and never claim to know all the answers. I hope that, if you consider yourself an expert, you’ll do two things:
- Share your techniques with us
- Be opened-minded about unusual techniques you find by other readers and by me
Now, consider the reason the Excel function fails and think of the easiest way to avoid it.
Here’s what I think: just add the smaller string to the end of the string being searched!
Adding one string to another string is called concatenating and is symbolized by the ampersand ( & )
In cell A3, type or paste the following formula:
=SEARCH(”,”, A1 & “,”)*(SEARCH(”,”, A1 & “,”) <LEN(A1))
When you press Enter, A3 displays a 0.
Add the comma to A1, press Enter and both A2 and A3 will display the string position correctly.
This works because of a little-known feature of nearly all programming languages. Normally, when you evaluate an equation, the result is TRUE or FALSE.
However, if you use that result in another equation, Excel treats it as a 1 if TRUE or a 0 if FALSE.
To see this, type or paste just the evaluation into A4:
=(SEARCH(”,”, A1 & “,”) <LEN(A1))
A4 will display either TRUE or FALSE, depending on whether a comma is in A1.
Now, change A4 to multiply that evaluation by 1:
=(SEARCH(”,”, A1 & “,”) <LEN(A1))*1
Now you’ll see a 1 or a 0 in A4, depending on the presence of a comma in A1.
Looking again at cell A3, you can now see that by multiplying the result of the search by the result of the evaluation, we’re basically saying, “multiply the search result by either 0 or 1″.
As you may recall from elementary school mathematics, any number multiplied by 0 equals 0, while any number multiplied by 1 equals that number.
(In some programming languages, such as Visual Basic and Access, -1 is used. You could negate the final result or use the ABS( ) function to change it into a positive number.)
As for the mysterious LEN( ), Excel returns the length of the string of text in A1. So, the evaluation does the SEARCH, compares the result to the length of the original string and returns 1 (TRUE) if the string position of the comma is less than the length of the string in A1. That would mean that a comma was found before the one we concatenated to the end!
By the way, in Excel, you are always working with a copy of something in your formulas. We never really concatenate a comma to the end of A1. We are using a copy of A1 in our other cells.
If you come up with anything shorter than this suggestion, I’d love to hear it!
After years of manually processing documents, I wrote a program that greatly simplifies - and speeds up - tedious tasks such as this one.
Visit www.parsermonster.com for more information.
|
September 12th, 2008 at 8:09 pm
I read some more of your blog entries and must say, you are a whiz at this data stuff!
When I said I did data manipulation I meant that I did more manual work (moving columns and rows around, setting up some simple filters, and importing/exporting data).
Parsing is not my forte probably never will be
I only got as far as being able to combine first and last names and dates!
September 13th, 2008 at 1:43 pm
Bobbi Jo, I must admit that there is a deeper connection to data, strings of text and mathematics.
Drumroll…
Cryptography!
I’ve played with this field since I was about 8 years old. Parsing is fundamental to understanding codes and ciphers. I believe that’s what helps me “see” patterns in problems.
Cheers,
Mitch
August 21st, 2009 at 6:10 am
[…] Digital Tinker microwebblogs.com/digitaltinker/2008/parsing-strings-in-excel – view page – cached Before I learned about HTML in 2003, I learned how to get rid of the tags that define it. — From the page […]