[R] Selecting Variables
jim holtman
jholtman at gmail.com
Tue Aug 5 20:07:00 CEST 2008
I think that you have to be a little more explicit with a description
of your data. I am not clear as to what this means:
> There are lots of variables between each exposure and the values are nominal
> with upto 6 values..
Can you provide a more complete description. How many columns of
exposure are there in your data? How many unique IDs? Depending on
these answers, you can probably read in a portion of your 5GB data
base and summarize the information and the aggregate it at then end
since I would expect that the length of the aggregated data is just
the number of unique IDs.
On Tue, Aug 5, 2008 at 11:54 AM, Michael Pearmain <mpearmain at google.com> wrote:
> Thanks for the help guys,
>
> i think i needed to be a bit more explicit however (sorry)
>
> There are lots of variables between each exposure and the values are nominal
> with upto 6 values..
> And to add to the problem the datasets i deal with range from anything upto
> 5G.
>
> My guess is that the melt function would be inefficient in this situation.
>
> I was looking at the agrep function to count the number Exposures in the
> names() , i wasn't sure of how to count if there was a value in each one but
> the y[complete.cases(y),] looks like a nice function.
>
> Is this a good path to follow?
>
>
>
>
> On Tue, Aug 5, 2008 at 3:09 PM, jim holtman <jholtman at gmail.com> wrote:
>>
>> I am not sure where the "Max" comes from, but this might be a start for
>> you:
>>
>> > x <- read.table(textConnection("ID Exposure_1 Exposure_2 Exposure_3
>> + 1 y y y
>> + 2 y y -
>> + 3 y - -"), header=TRUE,
>> na.strings='-')
>> > closeAllConnections()
>> > require(reshape)
>> > y <- melt(x, id.var='ID')
>> > # get rid of NAs
>> > y <- y[complete.cases(y),]
>> > y
>> ID variable value
>> 1 1 Exposure_1 y
>> 2 2 Exposure_1 y
>> 3 3 Exposure_1 y
>> 4 1 Exposure_2 y
>> 5 2 Exposure_2 y
>> 7 1 Exposure_3 y
>> > cbind(Unique=tapply(y$ID, y$ID, length))
>> Unique
>> 1 3
>> 2 2
>> 3 1
>> >
>>
>>
>> On Tue, Aug 5, 2008 at 9:21 AM, Michael Pearmain <mpearmain at google.com>
>> wrote:
>> > Hi All,
>> >
>> > i have a dataset that i want to dynamically inspect for the number of
>> > variables that start with "Exposure_" and then for these count the
>> > entries
>> > across each case i.e
>> >
>> > ID Exposure_1 Exposure_2 Exposure_3
>> > 1 y y y
>> > 2 y y -
>> > 3 y - -
>> >
>> > So the corresponding new variables that would be created are
>> >
>> > ID Max_Exposure Unique_Exposure
>> > 1 3 3
>> > 2 3 2
>> > 3 3 1
>> >
>> > I know this may seem fairly basic but it will give me the starting point
>> > to
>> > develop more advanced things with loop and nat lang
>> >
>> > Thanks in advance
>> >
>> > Mike
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>
>
>
> --
> Michael Pearmain
> Senior Statistical Analyst
>
>
> 1st Floor, 180 Great Portland St. London W1W 5QZ
> t +44 (0) 2032191684
> mpearmain at google.com
> mpearmain at doubleclick.com
>
>
> Doubleclick is a part of the Google group of companies
>
> "If you received this communication by mistake, please don't forward it to
> anyone else (it may contain confidential or privileged information), please
> erase all copies of it, including all attachments, and please let the sender
> know it went to the wrong person. Thanks."
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
More information about the R-help
mailing list