[R] help formatting data for clustering

Wed Nov 14 19:35:51 CET 2012

On Nov 13, 2012, at 5:41 PM, arun wrote:

> Hi,
> 
> You could also try:
> dta <- read.table(text="
> 1 , 45 , 32, 45, 23
> 2 , 34
> 4, 11, 43, 45
> ",sep=",",fill=TRUE)
> library(reshape)
>  dtanew<-reshape(dta,varying=2:5,v.name="brand",idvar="V1",direction="long")[,c(1,3)]

It's a bit puzzling to see package reshape loaded and then the reshape function being used. their is no reshape function in that package. 'reshape' is in the stats package which is loaded by default.

-- 
David.

>  dtanew1<-dtanew[complete.cases(dtanew),]
>  dtanew1<-dtanew1[order(dtanew1$V1),]
>  colnames(dtanew1)[1]<-"id"
>  table(dtanew1$id,dtanew1$brand)
>    
> #    11 23 32 34 43 45
>  # 1  0  1  1  0  0  2
>  # 2  0  0  0  1  0  0
>  # 4  1  0  0  0  1  1
> 
> 
> A.K.
> 
> ----- Original Message -----
> From: David Carlson <dcarlson at tamu.edu>
> To: 'Raphael Bauduin' <rblists at gmail.com>; r-help at r-project.org
> Cc: 
> Sent: Tuesday, November 13, 2012 5:38 PM
> Subject: Re: [R] help formatting data for clustering
> 
> This is easier if you read the data into a list instead of creating a
> data frame since the number of values on each row is different. You may
> be able to modify this to fit your needs. The steps are 1) Read the file
> with readLines(); 2) split the lines into numeric vectors  (one for each
> line); 3) repeat the first column (id) once for each brand in the line
> and build a data.frame with col.names; 4) use table() to build a list of
> all the brands and the number of times each appears; 5) cluster using
> the table or if necessary convert to a data frame (this will add X to
> the front of each brand number since numbers cannot be column names.
> 
> dta <- readLines(con=stdin(), n=3)
> 1 , 45 , 32, 45, 23
> 2 , 34
> 4, 11, 43, 45
> 
> lst <- strsplit(dta, ", ")
> lst <- sapply(lst, as.numeric)
> a <- sapply(1:length(lst), function(x) cbind(rep(lst[[x]][[1]], 
>       length(lst[[x]])-1), lst[[x]][-1]))
> a <- data.frame(do.call(rbind, a))
> colnames(a) <- c("id", "brand")
> newdat <- table(a$id, a$brand)
> newdf <- data.frame(unclass(newdat))
> 
> -------------------------------------
> David L Carlson
> Associate Professor of Anthropology
> Texas A&M University
> College Station, TX 77840-4352
> 
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
> On Behalf Of Raphael Bauduin
> Sent: Tuesday, November 13, 2012 4:47 AM
> To: r-help at r-project.org
> Subject: [R] help formatting data for clustering
> 
> Hi,
> 
> I'm a R beginner. I have data of this form:
> 
> user_id, brand_id1, brand_id2, .....
> 
> for example:
> 1 , 45 , 32, 45, 23
> 2 , 34
> 4, 11, 43, 45
> 
> I'm looking for the right procedure to be able to cluster users. I am
> especially interested to know which functions to use at each step.
> 
> I am currently able to load the data in a data frame, each row's name
> being
> the user id.
> 
> #extract user brands, ie all collumn except the first
> user_brands <- userclustering[,-1]
> 
> # extract user ids, ie the first column
> user_ids  <- userclustering[,1]
> 
> # set user ids as row name
> row.names(user_brands) <- user_ids
> 
> But now I'm stuck replacing the brand ids by a count for each brand the
> user ordered, all other brand counters being implicitely 0 for that
> user.
> 
> Then I'll need to be sure I can use it for clustering (normalising,
> correct
> handling of brands absent from a user's list, etc).
> 
> thanks in advance for your help!
> 
> Raph
> 
>     [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Alameda, CA, USA