Does shorthand and symbols cause problems with text-mining?
Wednesday, October 22, 2008
Jared in SAS, Text Mining

I recently had to convert some customer open-ended feedback from Excel to SAS.  The comments were transcribed from hand written comment cards to an Excel spreadsheet.  After converting to the SAS dataset, about 10% of them displayed a character error of sorts. The comment looked like this:

"Your product is the greatest.  We will tell all our friends.  ?□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□ etc....."

Because there were so few of these, I didn't notice it right away until I applied some analysis to the dataset and my Log was throwing a error messages.

I tracked the problem down to Excels good 'ol Autocorrect.  When the transcriber was typing up customer comments, they included the smiley face :) which was automatically converted to the smiley face symbol.  SAS can't read the symbol so replaces it with a question mark and a heck of a lot of squares.

It was an easy enough fix but it got me wondering how automated Text-mining software handles shorthand and symbols.  Especially in a world of cell phone texting and instant messaging people have developed all sorts of shorthand such as LOL, ROTFLOL, OMG, "ur" stands for "your" and "2moro" means "tomorrow".  And what about when people use the smiley face or the tongue smiley :P   ?  There are many more shorthand words. 

Once I get my hands on true text-mining software I'll have to test how to analyze shorthand.

Article originally appeared on jaredprins (http://jaredprins.squarespace.com/).
See website for complete article licensing information.