[R] Data manipulation problem

David Winsemius dwinsemius at comcast.net
Tue Apr 6 21:30:49 CEST 2010


On Apr 6, 2010, at 9:56 AM, moleps islon wrote:

> OK... next question.. Which is still a data manipulation problem so I
> believe the heading is still OK.
>
> ##So now I read my population data from excel.

No, you read it from a text file and providing the first ten lines of  
that text file should have been really easy. Read the Posting Guide  
for advice about offering datasets either as structure() objects with  
dput or dump or as attached files with "*.txt" extension (not .csv).  
Just change the file name with your file browser.

> pop<-read.csv("pop.csv")
>
> typeof(pop) ## yields a list

Really? I would have guessed it to yield just "list".

> where I have age-specific population rows
> and a yearly column population, where the years are suffixed by X

And had you used class(pop) you would have learned it was a dataframe  
and even more informative would have been str(pop).
>
> c<-(1953:2008)

No, no, no. Do not use variable names that are important function  
names. The R interpreter can (usually) keep things straight but it is  
our brains that experience problems.  Other  function names to avoid:  
data, df, cut, mean, sd, list, vector, matrix

> names(pop)<-c
> c.div<-cut(c,break=seq(1950,2010,by=5)

(You should have gotten an error here.) After fixing the error, did  
you you notice that there were only 3 of the first level???

Watch out for cut(). It uses the default convention of ( , ] , i.e.  
open interval at right which is backwards to what some (most?) of us  
think natural. Because of that the lowest level gets dropped unless  
you take special precautions.  That is undoubtedly why Harrell set up  
his Hmisc::cut2 to have the default be [ , )

Aggregating across columns? Certainly possible, but maybe not as  
natural a fit to functions like split as would occur with working  
across rows. I suppose you could use something like this untested  
(because _still_ no sample dataset provided) code:

apply(pop, 1,    # this works a row a time
     function(x) tapply(x, list(c.div), sum) ) )  # aggregate which  
uses tapply

I'm not sure it will work, since I don't know if the column names  
would get carried over into "x" by apply(). You might need to create a  
separate index that used the numeric positions of the columns rather  
than their names. Perhaps use c.div <-  seq(0,(2008-1953)) %/% 5  or  
some such inside tapply.

>
> Now I'd like to sum the agespecific population over the individual
> levels of -c.div- and generate a new table for this with agespecific
> rows and columns containing the 5-year bins instead of the original
> yearly data. Do I have to program this from scratch or is it possible
> to use an already existing function?

I think you ought to read more introductory material (and the Posting  
Guide regarding how to offer example datasets). In this case there are  
many functions that do data aggregation and most of them should be  
illustrated in a good introductory text.

-- 
David.
>
>
> //M
>
> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =
> TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE
>
> On Mon, Apr 5, 2010 at 10:11 PM, moleps <moleps2 at gmail.com> wrote:
>>
>> Thx Erik,
>> I have no idea what went wrong with the other code snippet, but  
>> this one works.. Appreciate it.
>>
>> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =  
>> TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE))
>>
>> M
>>
>>
>> On 5. apr. 2010, at 21.45, Erik Iverson wrote:
>>
>>> I don't know what your data are like, since you haven't given a  
>>> reproducible example. I was imagining something like:
>>>
>>> ## generate fake data
>>> age <- sample(20:90, 100, replace = TRUE)
>>> year <- sample(1950:2000, 100, replace = TRUE)
>>>
>>> ##look at big table
>>> table(age, year)
>>>
>>> ## categorize data
>>> ## see include.lowest and right arguments to cut
>>> age.factor <- cut(age, breaks = seq(20, 90, by = 10),
>>>                 include.lowest = TRUE)
>>>
>>> year.factor <- cut(year, breaks = seq(1950, 2000, by = 10),
>>>                  include.lowest = TRUE)
>>>
>>> table(age.factor, year.factor)
>>>
>>> moleps wrote:
>>>> I already did try the regression modeling approach. However the  
>>>> epidemiologists (referee) turns out to be quite fond of comparing  
>>>> the incidence rates to different standard populations, hence the  
>>>> need for this labourius approach. And trying the "cutting"  
>>>> approach I ended up with :
>>>>> table (age5)
>>>> age5
>>>>  (0,5]   (5,10]  (10,15]  (15,20]  (20,25]  (25,30]  (30,35]   
>>>> (35,40]  (40,45]  (45,50]  (50,55]  (55,60]  (60,65]  (65,70]  
>>>> (70,75]  (75,80]  (80,85] (85,100]       35       34        
>>>> 33       47       51      109      157      231      362       
>>>> 511    745      926     1002      866      547      247        
>>>> 82       18
>>>>> table (yr5)
>>>> yr5
>>>> (1950,1955] (1955,1960] (1960,1965] (1965,1970] (1970,1975]  
>>>> (1975,1980] (1980,1985] (1985,1990] (1990,1995] (1995,2000]  
>>>> (2000,2005] (2005,2009]           3           5            
>>>> 5           5           5           5           5            
>>>> 5         5           5           5           3
>>>>> table (yr5,age5)
>>>> Error in table(yr5, age5) : all arguments must have the same length
>>>> Sincerely,
>>>> M
>>>> On 5. apr. 2010, at 20.59, Bert Gunter wrote:
>>>>> You have tempted, and being weak, I yield to temptation:
>>>>>
>>>>> "Any good ideas?"
>>>>>
>>>>> Yes. Don't do this.
>>>>>
>>>>> (what you probably really want to do is fit a model with age as  
>>>>> a factor,
>>>>> which can be done statistically e.g. by logistic regression; or  
>>>>> graphically
>>>>> using conditioning plots, e.g. via trellis graphics (the lattice  
>>>>> package).
>>>>> This avoids the arbitrariness and discontinuities of binning by  
>>>>> age range.)
>>>>>
>>>>> Bert Gunter
>>>>> Genentech Nonclinical Biostatistics
>>>>>
>>>>> -----Original Message-----
>>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org 
>>>>> ] On
>>>>> Behalf Of moleps
>>>>> Sent: Monday, April 05, 2010 11:46 AM
>>>>> To: r-help at r-project.org
>>>>> Subject: [R] Data manipulation problem
>>>>>
>>>>> Dear R´ers.
>>>>>
>>>>> I´ve got a dataset with age and year of diagnosis. In order to
>>>>> age-standardize the incidence I need to transform the data into  
>>>>> a matrix
>>>>> with age-groups (divided in 5 or 10 years) along one axis and  
>>>>> year divided
>>>>> into 5 years along the other axis. Each cell should contain the  
>>>>> number of
>>>>> cases for that age group and for that period.
>>>>> I.e.
>>>>> My data format now is
>>>>> ID-age (to one decimal)-year(yearly data).
>>>>>
>>>>> What I´d like is
>>>>>
>>>>> age 1960-1965 1966-1970 etc...
>>>>> 0-5 3 8 10 15
>>>>> 6-10 2 5 8 13
>>>>> etc..
>>>>>
>>>>>
>>>>> Any good ideas?
>>>>>
>>>>> Regards,
>>>>> M
>


David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list