[R] Fwd: problem with kmeans
Ranjan Maitra
maitra.mbox.ignored at inbox.com
Tue Apr 29 06:35:10 CEST 2014
Cassie,
I am sorry but do you even know what k-means does? That it is a locally
optimal algorithm. That different software implement the same algorithm
differently.
FYI, R uses the Hartigan-Wong (1979) algorithm by default, which is
probably the most efficient out there.
I suggest you first go to a multivariate statistics class before
passing such sweeping statements. (Btw, did these same "some people"
tell you that most other software do not provide the kinds of broad
abilities which R provides, and therefore are not even comparable.)
And then, please read the help function for how to "improve" your run
of k-means using R.
HTH,
Ranjan
On Tue, 29 Apr 2014 09:45:18 +0530 cassie jones
<cassiejones26 at gmail.com> wrote:
> Dear R-users,
>
> I am trying to run kmeans on a set comprising of 100 observations. But R
> somehow can not figure out the true underlying groups, although other
> software such as Jmp, MINITAB are producing the desired result.
>
> Following is a brief example of what I am doing.
>
> library(stringdist)
> test=c('hematolgy','hemtology','oncology','onclogy',
> 'oncolgy','dermatolgy','dermatoloy','dematology',
> 'neurolog','nerology','neurolgy','nerology')
>
> dis=stringdistmatrix(test,test, method = "lv")
>
> set.seed(123)
> cl=kmeans(dis,4)
>
>
> grp_cl=vector('list',4)
>
> for(i in 1:4)
> {
> grp_cl[[i]]=test[which(cl$cluster==i)]
> }
> grp_cl
>
> [[1]]
> [1] "oncology" "onclogy"
>
> [[2]]
> [1] "neurolog" "nerology" "neurolgy" "nerology"
>
> [[3]]
> [1] "oncolgy"
>
> [[4]]
> [1] "hematolgy" "hemtology" "dermatolgy" "dermatoloy" "dematology"
>
> In the above example, the 'test' variable consists of a set of
> terminologies with various typos and I am trying to group the similar types
> of words based on their string distance. Unfortunately kmeans is not able
> to replicate the following result that the other software are able to
> produce.
> [[1]]
> [1] "oncology" "onclogy" "oncolgy"
>
> [[2]]
> [1] "neurolog" "nerology" "neurolgy" "nerology"
>
> [[3]]
> [1] "dermatolgy" "dermatoloy" "dematology"
>
> [[4]]
> [1] "hematolgy" "hemtology"
>
>
> Does anyone know if there is a way out, I have heard from a lot of people
> that multivariate analysis in R does not produce the desired result most of
> the time. Any help is really appreciated.
>
>
> Thanks in advance.
>
>
> Cassie
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Important Notice: This mailbox is ignored: e-mails are set to be
deleted on receipt. Please respond to the mailing list if appropriate.
For those needing to send personal or professional e-mail, please use
appropriate addresses.
____________________________________________________________
FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
More information about the R-help
mailing list