How to parse a tweet text from Twitter using Ruby to parse-out ‘@’ and ‘#’
Posted by arjunghosh on March 5, 2009
Well lot of us love @twitter and also Ruby, and some time work on both
And often we need to do the folowing with a tweet
Well I had to do the following quite often:-
Take out the ‘@’ (i.e. @replies )and ‘#’ (i.e. hashtags ) from a tweet and separate it from the text part.
For example, we have a tweet:
@myfriend1 @myfriend2 this is a sample text #link #text
Now I want this tweet to be seperated into the following Array:
['myfriend1','myfriend2']
['link','text']
and the text only – ["this is a sample text "]
So first had to build a RegE, and then using the ever useful .gsub method of Ruby, created the following:
parsed_text = tweet.text.gsub(/ ?(@\w+)| ?(#\w+)/) { |a| ((a.include?(‘#’)) ? tags : replies) << a.strip.gsub(/#|@/,”); ” }
So the parsed_text has the final text only. tags is an Array which will contain the hashtags and replies is an Array which will contain the @replies.
The RegEx / ?(@\w+)| ?(#\w+)/ extracts and seperates the hashtags & the @replies and place them in two seperate arrays.
The RegEx /#|@/,” only reples the ‘@’ and ‘#’ symbols in the extracted array elements.
And you can download it from Gist here http://gist.github.com/78498
Also while working on creating the above regular expressions, I found this interesting RegEx testing site called www.rubular.com which will help you write regular expressions very easily.


Yogesh Goel said
Hi,
Have added to my blog
http://www.ygoel.com/
the write-up & all the photographs of the Kolkata Bloggers Meet 2009.
Do take some time to visit and do not forget to put down your inputs for the same.
Regards & Love,
Yours ever in blogging,
Yogesh Goel
ygoel.com
arjunghosh said
Will do
raf said
with your current regexp you’d catch part of the domain on email addresses as well, look at this:
” @myfriend1 @myfriend2 someone@domain.com this is a sample text #link #text”
result:
tags # => ["link", "text"]
replies # => ["myfriend1", "myfriend2", "domain"]
what if we are twitting about Ruby code? sometimes it happens
” @myfriend1 @myfriend2 someone@domain.com this is a sample text #link #text ActiveRecord#find”
tags # => ["link", "text", "find"]
replies # => ["myfriend1", "myfriend2", "domain"]
you can fix that getting rid of the question marks on the regex => / (@\w+)| (#\w+)/
tags # => ["link", "text"]
replies # => ["myfriend1", "myfriend2"]
besides I’d add a .strip after the last } => << a.strip.gsub(/#|@/,”); ” }.strip
and I'd use do/end instead of {/}
Thanks for sharing.
arjunghosh said
Thanks @Raf for pointing it! That improves the code.
Gamooo said
hi,
how can I do the same thing with ASP.NET MVC3 syntax ?
Thanks
voodoologic said
Howdy,
As a new RoR developer, I am so thankful I found your post! Additionally, I’d like to make a remark for any new developers who might be using the titter-auth gem.
gsub will not work on an array!! you’ll need a @tweets.each {|twitstring| twistring.gsub(RegEx)… }}
Here’s my code that tallies my jock & nerd tweets.
@tweets.each{|x| x.gsub(/(\+|-)\d\s(geek|nerd|jock)/){|a| ((a.include?(‘jock’)) ? jock : nerd) <>jock=[]
=>[]
-voodoologic
Devesh said
Its awesome… really its very nice…
Mike Rossetti said
Hi. Great solution. Is there a way to store those @ and # arrays as objects in the database?