[R] ggplot2 / reshape / Question on manipulating data

hadley wickham h.wickham at gmail.com
Thu Jul 12 09:35:27 CEST 2007


On 7/12/07, Pete Kazmier <pete-expires-20070910 at kazmier.com> wrote:
> I'm an R newbie but recently discovered the ggplot2 and reshape
> packages which seem incredibly useful and much easier to use for a
> beginner.  Using the data from the IMDB, I'm trying to see how the
> average movie rating varies by year.  Here is what my data looks like:
>
> > ratings <- read.delim("groomed.list", header = TRUE, sep = "|", comment.char = "")
> > ratings <- subset(ratings, VoteCount > 100)
> > head(ratings)
>                              Title  Histogram VoteCount VoteMean Year
> 1                !Huff (2004) (TV) 0000000016       299      8.4 2004
> 8              'Allo 'Allo! (1982) 0000000125       829      8.6 1982
> 50              .hack//SIGN (2002) 0000001113       150      7.0 2002
> 56            1-800-Missing (2003) 0000000103       118      5.4 2003
> 66  Greatest Artists (2000) (mini) 00..000016       110      7.8 2000
> 77 00 Scariest Movie (2004) (mini) 00..000115       256      8.6 2004

Have you tried using the movies dataset included in ggplot?  Or is
there some data that you want that is not in that dataset.

> The above data is not aggregated.  So after playing around with basic
> R functionality, I stumbled across the 'aggregate' function and was
> able to see the information in the manner I desired (average movie
> rating by year).
>
> > byYear <- aggregate(ratings$VoteMean, list(Year = ratings$Year), mean)
> > plot(byYear)
>
> Having just discovered gglot2, I wanted to create the same graph but
> augment it with a color attribute based on the total number of votes
> in a year.  So first I tried to see if I could reproduce the above:
>
> > library(ggplot2)
> > qplot(Year, x, byYear)
>
> This did not work as expected because the x-axis contained labels for
> each and every year making it impossible to read whereas the plot
> created with basic R had nice x-axis labels.  How do I get 'qplot' to
> treat the x-axis in a similar manner to 'plot'?

The problem is probably that Year is a factor - and factors are
labelled on every level (even if they overlap - which is a bug).
There's no terribly easy way to fix this, but the following will work:

qplot(as.numeric(as.character(Year)), x, data=byYear)

> After playing around further, I was able to get 'qplot' to work in a
> manner similar to 'plot' with regards to the x-axis labels by using
> 'melt' and 'cast'.  The 'qplot' now behaves correctly:
>
> > mratings <- melt(ratings, id = c("Title", "Year"), measure = c("VoteCount", "VoteMean"))
> > byYear2 <- cast(mratings, Year ~ variable, mean, subset = variable == "VoteMean")
> > qplot(Year, VoteMean, data = byYear2)
>
> How do 'byYear' and 'byYear2' differ?  I am trying to use 'typeof' but
> both seem to be lists.  However, they are clearly different in some
> way because 'qplot' graphs them differently.

Try using str - it's much more helpful, and you should see the
different quickly.

> Finally, I'd like to use a color attribute to 'qplot' to augment each
> point with a color based on the total number of votes for the year.
> Using attributes with 'qplot' seems simple, but I'm having a hard time
> grooming my data appropriately.  I believe this requires aggregation
> by summing the VoteCount column.  Is there a way to cast the data
> using different aggregation functions for various columns?  In my

Not easily, unfortunately.  However, you could do:

cast(mratings, Year ~ variable, c(mean, sum)), subset = variable %in%
c("VoteMean", "VoteCount"))

which will give you a mean and sum for both.

> case, I want the mean of the VoteMean column, and the sum of the
> VoteCount column.  Then I want to produce a graph showing the average
> movie rating per year but with each point colored to reflect the total
> number of votes for that year.  Any pointers?

Using the built in movies data:

mm <- melt(movies, id=1:2, m=c("rating", "votes"))
msum <- cast(mm, year ~ variable, c(mean, sum))

qplot(year, rating_mean, data=msum, colour=votes_sum)
qplot(year, rating_mean, data=msum, colour=votes_sum, geom="line")

Hadley



More information about the R-help mailing list