Does shorthand and symbols cause problems with text-mining?
I recently had to convert some customer open-ended feedback from Excel to SAS. The comments were transcribed from hand written comment cards to an Excel spreadsheet. After converting to the SAS dataset, about 10% of them displayed a character error of sorts. The comment looked like this:
"Your product is the greatest. We will tell all our friends. ?□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□ etc....."
Because there were so few of these, I didn't notice it right away until I applied some analysis to the dataset and my Log was throwing a error messages.
I tracked the problem down to Excels good 'ol Autocorrect. When the transcriber was typing up customer comments, they included the smiley face :) which was automatically converted to the smiley face symbol. SAS can't read the symbol so replaces it with a question mark and a heck of a lot of squares.
It was an easy enough fix but it got me wondering how automated Text-mining software handles shorthand and symbols. Especially in a world of cell phone texting and instant messaging people have developed all sorts of shorthand such as LOL, ROTFLOL, OMG, "ur" stands for "your" and "2moro" means "tomorrow". And what about when people use the smiley face or the tongue smiley :P ? There are many more shorthand words.
Once I get my hands on true text-mining software I'll have to test how to analyze shorthand.
References (1)
-
Response: Blogs not to missIf you're not reading these bloggers, you might want to update your feeds, bookmarks or daily visits. Jared, a SAS user, updates frequently, and he's asking a lot of great questions about SAS. Maybe you know the answers or have a related thought to sh

Reader Comments (5)
Hi Jared,
If the emoticons aren't graphics you could make
:-) = 'happy'
:P = 'happy'
:-( = 'unhappy' etc., via a synonym list. I’m not 100%sure how to get past graphics. You could probably approach those somehow with Base SAS, I am checking into that for you.
As for the new 'texting' language, those acronyms too can be handled via a synonym list. Some customers have already implemented this since call center agents are switching to this shorthand mode. We could easily create a standard set of synonyms out of this new communication style and push it out to customers.
Regards,
Manya Mayes
SAS Text Miner Product Manager
Jared,
I've never looked into this before, but I imagine it's possible. We would simply have to recognize the character code for "smiley" and "frownie" and so on, and then tag the record accordingly. That could work for pure text sources.
In the case of importing from Excel (as you were doing), it might be more complicated than that because those characters might be lost in translation while SAS is reading from the Excel data source. I did a small test and found that "smiley" turned to "J" and "frownie" turned to "L". You might be able to preprocess such characters in Excel to turn them into "normal" ASCII text, and then perform your analysis once imported into SAS.
Chris
Hi Manya. That's actually what I ended up doing. I changed all the graphic :) smilies to the word HAPPY and everything worked well.
Thanks for the information. It's good to know that others are able to handle these new communication styles into their analysis.
Hi Chris.
Most of my data currently comes in CSV or XLS, so I may have to look into your suggestion of preprocessing. Thanks for the tip :)
It's interesting that your smiley converted to a J. Mine didn't do that. Maybe we are converting the data using a different method?
I am using the following SAS code to convert my data (using SAS 9.2):
PROC IMPORT DATAFILE="c:\folder\file.xls"
OUT=a.mydata
DBMS=excel2002 REPLACE;
SHEET="Sheet1";
GETNAMES=yes;
RANGE="Sheet1$A1:AE407";
MIXED=YES;
SCANTEXT=yes;
USEDATE=no;
SCANTIME=no;
DBSASLABEL= none;
RUN;
some better emote faces r..
=D super happy
D= super sad
-.- bored
^.- u just made a very good point
O.o surprised
=S awkward
=X this was just one my m8 made up but it looks cool
=O surprised again
<3 love heart
>=( angry
<=( u should feel sorry for me
>=) sinister
<=) "awww"
^.^ abashed