[Rcpp-devel] wordcloud

ian.fellows at stat.ucla.edu ian.fellows at stat.ucla.edu
Sat Jul 23 18:43:05 CEST 2011


>
> On 23 July 2011 at 09:02, ian.fellows at stat.ucla.edu wrote:
> | Hi all,
> |
> | I've just released an R package to CRAN that creates pretty looking
word | clouds. I think it makes a good minimal example of how to
prototype an | algorithm in R, and then bring the performance bottleneck
down to c++ to | improve speed.
>
> Sweet!  I am still watching the whole onslaught of new or updated
packages unfold so I haven't had a chance to even check if there were
new Rcpp-using
> packages.  So welcome to the club :)
>
> | An example:
> |
> |
> >install.packages("wordcloud",repos="http://cran.r-project.org",type="source")
> | >library(tm)
> | >data(crude)
> | >crude <- tm_map(crude, removePunctuation)
> | >crude <- tm_map(crude, function(x)removeWords(x,stopwords())) | >tdm
<- TermDocumentMatrix(crude)
> | >m <- as.matrix(tdm)
> | >v <- sort(rowSums(m),decreasing=TRUE)
> | >d <- data.frame(word = names(v),freq=v
> | + )
> | >library(wordcloud)
> | Loading required package: Rcpp
> | >#using c++ to help layout the words
> | >system.time(wordcloud(d$word,d$freq,scale=c(8,.1),min.freq=0)) |  
user  system elapsed
> |  9.979   0.049   9.878
> | >#using R code to do the same layout
> |
> >system.time(wordcloud(d$word,d$freq,scale=c(8,.1),min.freq=0,use.r.layout=T))
> |   user  system elapsed
> | 151.919   0.716 146.737
>
> Ok, I'll be lazy now as I could just look at the code, but what type of
layout operation did you move to C++? Is it a type of sorting /
arranging /
> classifying / ... ?  Does it rely on other libraries or did you solve it
with
> homegrown C++?  How many lines?

The layout algorithm takes each word and spirals out from the center of
the plot until it finds a place where the word wouldn't overlap with any
words already plotted. It then plots the word in that place.

The check to see whether the word has any overlaps at a particular point
is expensive and scales poorly. I tried something really smart and clever
in R to fix this, but it turns out that just doing the check in c++ is
faster than any cleverness I could come up with. The function is 24 lines
of c++ code replacing 23 lines of R code.

>
> And lastly ... given that also know Java so well: what works well /
better with Rcpp for you?

Speed. wordcloud was a cute little weekend project, but for my
dissertation work, high performance is a primary concern, so I'm designing
it from the ground up using Rcpp (nothing public yet).

It is much slower going for me to code in c++, partly due to my lack of
experience, and partly due to my inability to find an IDE with good code
completion / syntax error detection. My understanding is that (due to
extensive use of templates), this is something I'll just have to live
with. I'm open to suggestions though. I'm currently using Eclipse CDT.

>
> Cheers, Dirk
>
> --
> Gauss once played himself in a zero-sum game and won $50.
>                       -- #11 at http://www.gaussfacts.com
>
>






More information about the Rcpp-devel mailing list