[R] Problems with tm package, Removeword and trasformations
Renato Medei
medei.ren at gmail.com
Tue Feb 10 15:52:58 CET 2015
Dear all,
I'm sorry but as all the newbies I have a lot of problems to solve.
I'm using R 3.1.2 under osx 10.10.2.
I'm working with tm to analyze some tweets and I received some strange
errors when I tried to remove stopwords (See below error 1), to transform
content (See below error 2) and to create document term Matrix (See below
error 3)
Could anyone help me?
Error 1
> tweets = searchTwitter("rimini", n=1000)
> tweets = sapply(tweets, function(x) x$getText())
> tweets_corpus = Corpus(VectorSource(tweets))
> toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
> tweets_corpus <- tm_map(tweets_corpus, toSpace,
"(f|ht)tp(s?)://(.*)[.][a-z]+")
> tweets_corpus <- tm_map(tweets_corpus, toSpace, "RT |via ")
> tweets_corpus <- tm_map(tweets_corpus, toSpace, "@[^\\s]+")
> tweets_corpus <- tm_map(tweets_corpus, removeNumbers)
> tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
> tweets_corpus <- tm_map(tweets_corpus, removeWords, c("rimini", "Rimini",
"Riviera", "riviera"))
> tweets_corpus <- tm_map(tweets_corpus, stopwords("italian"))
Warning message:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code
Error2
> tweets = searchTwitter("rimini", n=1000)
> tweets = sapply(tweets, function(x) x$getText())
> tweets_corpus = Corpus(VectorSource(tweets))
> tweets_corpus
<<VCorpus (documents: 1000, metadata (corpus/indexed): 0/0)>>
> toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
> tweets_corpus <- tm_map(tweets_corpus, toSpace,
"(f|ht)tp(s?)://(.*)[.][a-z]+")
> tweets_corpus <- tm_map(tweets_corpus, toSpace, "RT |via ")
> tweets_corpus <- tm_map(tweets_corpus, toSpace, "@[^\\s]+")
> tweets_corpus <- tm_map(tweets_corpus, removeNumbers)
> tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
> tweets_corpus <- tm_map(tweets_corpus, removeWords, c("rimini", "Rimini",
"Riviera", "riviera"))
> tweets_corpus <- tm_map(tweets_corpus, content_transformer(tolower))
Warning message:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code
Error3
> tweets = searchTwitter("rimini", n=1000)
> tweets = sapply(tweets, function(x) x$getText())
> tweets_corpus = Corpus(VectorSource(tweets))
> tweets_corpus
<<VCorpus (documents: 1000, metadata (corpus/indexed): 0/0)>>
> toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
> tweets_corpus <- tm_map(tweets_corpus, toSpace,
"(f|ht)tp(s?)://(.*)[.][a-z]+")
> tweets_corpus <- tm_map(tweets_corpus, toSpace, "RT |via ")
> tweets_corpus <- tm_map(tweets_corpus, toSpace, "@[^\\s]+")
> tweets_corpus <- tm_map(tweets_corpus, removeNumbers)
> tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
> tweets_corpus <- tm_map(tweets_corpus, removeWords, c("rimini", "Rimini",
"Riviera", "riviera"))
> dtm <- DocumentTermMatrix(tweets_corpus)
Errore in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow =
length(allTerms), :
'i, j, v' different lengths
Inoltre: Warning messages:
1: In mclapply(unname(content(x)), termFreq, control) :
all scheduled cores encountered errors in user code
2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow =
length(allTerms), :
si è prodotto un NA per coercizione
Thank you for your help
[[alternative HTML version deleted]]
More information about the R-help
mailing list