[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Sun Jul 3 18:05:02 CEST 2016


Your goal of putting character representations of dates in certain rows of a column is hard to imagine a use for.  Your goal of identifying start and end dates seems reasonable enough. It can be accomplished using aggregate from base R (less external dependency) or summarise from dplyr (faster, simpler syntax):

result <- setNames( data.frame( aggregate( date~ID, data=drug_study, FUN=min ),  aggregate( date~ID, data=drug_study, FUN=max )[2] ), c( "ID", "start", "end" ) )

or

library( dplyr )
result <- (   drug_study
          %>% group_by( ID )
          %>% summarise( start=min( date ), end=max( date) )
           )

-- 
Sent from my phone. Please excuse my brevity.

On July 3, 2016 5:19:01 AM PDT, Kevin Wamae <KWamae at kemri-wellcome.org> wrote:
>Hi John, attached is the file in txt. Kindly let me know if it fails
>again..
>
>Regards
>-------------------------------------------------------------------------------
>Kevin Wame | Ph.D. Student (IDeAL)
>KEMRI-Wellcome Trust Collaborative Research Programme
>Centre for Geographic Medicine Research
>P.O. Box 230-80108, Kilifi, Kenya
> 
>
>On 7/3/16, 3:16 PM, "John Kane" <jrkrideau at inbox.com> wrote:
>
>The data set did not show up. The R-help list tends to strip out most
>file types as a safety precaution.  Try renaming the file from xxx.csv
>to xxx.txt and it should come through alright.
>
>
>
>John Kane
>Kingston ON Canada
>
>
>> -----Original Message-----
>> From: kwamae at kemri-wellcome.org
>> Sent: Sun, 3 Jul 2016 09:39:59 +0000
>> To: jdnewmil at dcn.davis.ca.us, r-help at r-project.org
>> Subject: Re: [R] R - Populate Another Variable Based on Multiple
>> Conditions | For a Large Dataset
>> 
>> Hi Jeff, pardon me, I was surely not making it easy. I hope this time
>I
>> will ☺
>> 
>> Attached is snippet of the dataset in csv format and below is the
>> R.script I have managed so far.
>> 
>>
>-----------------------------------------------------------------------------------------------------------------------------------------------
>>
>-----------------------------------------------------------------------------------------------------------------------------------------------
>> 
>> drug_study <- read.csv("drug_study.csv", header = T);
>head(drug_study)
>> drug_study$date <- as.Date(drug_study$date, "%m/%d/%Y")
>> drug_study$study_id <- ""  #create new column
>> 
>> individual <- unique (drug_study$ID)  #vector of individuals
>> datalength <- dim(drug_study)[1]      #number of rows in dataframe
>> 
>> for (i in 1:length(individual)) {
>>   for (j in 1:datalength) {
>>     start_admin <- drug_study[c(drug_study$ID == individual[i] &
>> drug_study$year == 2007 & drug_study$drug_admin == "Y" &
>drug_study$month
>> == 5),2]  #capture date of start
>>     end_admin <- drug_study[(drug_study$ID == individual[i] &
>> drug_study$year == 2008 & drug_study$drug_admin == "Y" &
>drug_study$month
>> == 2),2]    #capture date of end
>> 
>>     if(drug_study[j,1] == individual[i] & drug_study[j,2] >=
>start_admin
>> & drug_study[j,2] < end_admin) {
>>       drug_study[j,6] <- paste(start_admin) #populate respective row
>if
>> condition is met
>>     }
>>   }
>> }
>>
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> 
>> For this dataset, there exists three individuals, J1/3, R1/3, R10/1.
>> 
>> The script works for the last two individuals but not J1/3 with the
>error
>> below:
>> 
>>
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> Error in if (drug_study[j, 1] == individual[i] & drug_study[j, 2] >=
>> start_admin &  :
>>   argument is of length zero
>>
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> 
>> I figured it’s because this individuals start_admin and end_admin
>dates
>> aren’t captured because the if-loop fails. There’s my first problem,
>> there are thousands of individuals with varying
>> start_admin and end_admin dates and I need a script to capture these
>for
>> every individual.
>> 
>> Secondly, the above script is taking almost an hour to run for the
>entire
>> dataset, just for the individuals whose start_admin and end_admin
>dates
>> can be captured by the if-loop.
>> 
>> I need help in coming up with a script that will tackle the problem
>> taking into account the different start_admin and end_admin dates and
>be
>> resourceful with regards to time.
>> 
>> Regards
>>
>-------------------------------------------------------------------------------
>> Kevin Kariuki
>> 
>>
>###############################################################################################################################################
>>
>###############################################################################################################################################
>> 
>> On 7/3/16, 8:42 AM, "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us>
>wrote:
>> 
>> You are making this hard on yourself by not paying attention the
>Posting
>> Guide listed in the footer of every email on this list. You would
>> probably also find [1] helpful also.
>> 
>> [1]
>>
>http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>> --
>> Sent from my phone. Please excuse my brevity.
>> 
>> On July 2, 2016 3:41:07 PM PDT, Kevin Wamae
><KWamae at kemri-wellcome.org>
>> wrote:
>> >Hi Jeff, sorry for referring to you as Jennifer earlier, accept my
>> >apologies.
>>> 
>> >I attached a sample dataset in the question, am afraid it must have
>> >failed to attach.
>>> 
>> >I have attached it again..
>>> 
>>> 
>> >Regards
>>
>>-------------------------------------------------------------------------------
>> >Kevin Kariuki
>>> 
>>> 
>> >On 7/2/16, 7:37 PM, "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us>
>wrote:
>>> 
>> >I can understand you not wanting to supply your actual data online,
>but
>> >only you know what your data looks like so only you can create a
>> >simulated data set that we could show you how to work with.
>> >--
>> >Sent from my phone. Please excuse my brevity.
>>> 
>> >On July 2, 2016 2:57:39 AM PDT, Kevin Wamae
><KWamae at kemri-wellcome.org>
>> >wrote:
>> >>I have a drug-trial study dataset (attached image).
>>>> 
>> >>Since its a large and complex dataset (at least to me) and I hope
>to
>> >be
>> >>as clear as possible with my question.
>> >>The dataset is from a study where individuals are given drugs and
>> >>followed up over a period spanning two consecutive years.
>Individuals
>> >>do not start treatment on the same day and once they start, the
>> >>variable "drug-admin" is marked "x" as well as the time they stop
>> >>treatment in the following year.
>> >>There exists another variable, "study_id", that I hope to populate
>as
>> >>can be seen in the dataset, with the following conditions:
>>>> 
>> >>For every individual
>> >>•    if the individual has entries that show they received drugs
>both
>> >>on the start and end date (marked with the "x")
>> >>•    if the start of drug administration falls in month == 2 | 3
>and
>> >>end of administration falls in month == 2 | 4
>> >>•    then, using the date that marks the start of drug
>administration,
>> >>populate the variable _"study_id"_ in all the rows that fall within
>> >the
>> >>timeframe that the individual was given drugs but excluding the end
>of
>> >>drug administration.
>> >>I have tried my level best and while I have explored several
>examples
>> >>online, I haven't managed to solve this. The dataset contains close
>to
>> >>6000 individuals spanning 10 years and my best bet was to use a
>loop
>> >>which keeps crushing R after running for close to 30min. I have
>also
>> >>read that dplyr may do the job but my attempts have been in vain.
>>>> 
>> >>sample code
>>
>>>-------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >>individual <- unique (df$ID)  #vector of individuals
>> >>datalength <- dim(df)[1]      #number of rows in dataframe
>>>> 
>> >>for (i in 1:length(individual)) {
>>>>  for (j in 1:datalength) {
>> >>start_admin <- df[(df$year == 2007] & df$drug_admin == "x" &
>> >c(df$month
>> >>== 2 | df$month == 3),1]  #capture date of start
>> >>end_admin <- df[(df$year == 2008] & df$drug_admin == "x" &
>c(df$month
>> >>== 2 | df$month == 4),1]    #capture date of end
>>>> 
>> >>if(df[datalength,1] == individual(i) & df[datalength,2] >=
>start_admin
>> >>& df[datalength,2] < end_admin) {
>> >>df[datalength,6] <- start_admin #populate respective row if
>condition
>> >>is met
>>>>      }
>>>>    }
>>>>  }
>>>> 
>>
>>>-------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>> 
>> >>Above is the code that keeps failing..
>>>> 
>> >>Any help is highly appreciated....
>>>> 
>>>> 
>>
>>>______________________________________________________________________
>>>> 
>> >>This e-mail contains information which is confidential. It is
>intended
>> >>only for the use of the named recipient. If you have received this
>> >>e-mail in error, please let us know by replying to the sender, and
>> >>immediately delete it from your system.  Please note, that in these
>> >>circumstances, the use, disclosure, distribution or copying of this
>> >>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>> >>cannot accept any responsibility for the  accuracy or completeness
>of
>> >>this message as it has been transmitted over a public network.
>> >Although
>> >>the Programme has taken reasonable precautions to ensure no viruses
>> >are
>> >>present in emails, it cannot accept responsibility for any loss or
>> >>damage arising from the use of the email or attachments. Any views
>> >>expressed in this message are those of the individual sender,
>except
>> >>where the sender specifically states them to be the views of
>> >>KEMRI-Wellcome Trust Programme.
>>
>>>______________________________________________________________________
>>>> 
>>>> 
>>
>>>------------------------------------------------------------------------
>>>> 
>> >>______________________________________________
>> >>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >>https://stat.ethz.ch/mailman/listinfo/r-help
>> >>PLEASE do read the posting guide
>> >>http://www.R-project.org/posting-guide.html
>> >>and provide commented, minimal, self-contained, reproducible code.
>>> 
>>> 
>>> 
>>> 
>>
>>______________________________________________________________________
>>> 
>> >This e-mail contains information which is confidential. It is
>intended
>> >only for the use of the named recipient. If you have received this
>> >e-mail in error, please let us know by replying to the sender, and
>> >immediately delete it from your system.  Please note, that in these
>> >circumstances, the use, disclosure, distribution or copying of this
>> >information is strictly prohibited. KEMRI-Wellcome Trust Programme
>> >cannot accept any responsibility for the  accuracy or completeness
>of
>> >this message as it has been transmitted over a public network.
>Although
>> >the Programme has taken reasonable precautions to ensure no viruses
>are
>> >present in emails, it cannot accept responsibility for any loss or
>> >damage arising from the use of the email or attachments. Any views
>> >expressed in this message are those of the individual sender, except
>> >where the sender specifically states them to be the views of
>> >KEMRI-Wellcome Trust Programme.
>>
>>______________________________________________________________________
>> 
>> 
>> 
>> 
>>
>______________________________________________________________________
>> 
>> This e-mail contains information which is confidential. It is
>intended
>> only for the use of the named recipient. If you have received this
>e-mail
>> in error, please let us know by replying to the sender, and
>immediately
>> delete it from your system.  Please note, that in these
>circumstances,
>> the use, disclosure, distribution or copying of this information is
>> strictly prohibited. KEMRI-Wellcome Trust Programme cannot accept any
>> responsibility for the  accuracy or completeness of this message as
>it
>> has been transmitted over a public network. Although the Programme
>has
>> taken reasonable precautions to ensure no viruses are present in
>emails,
>> it cannot accept responsibility for any loss or damage arising from
>the
>> use of the email or attachments. Any views expressed in this message
>are
>> those of the individual sender, except where the sender specifically
>> states them to be the views of KEMRI-Wellcome Trust Programme.
>>
>______________________________________________________________________
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>____________________________________________________________
>Can't remember your password? Do you need a strong and secure password?
>Use Password manager! It stores your passwords & protects your account.
>Check it out at http://mysecurelogon.com/password-manager
>
>
>
>
>
>______________________________________________________________________
>
>This e-mail contains information which is confidential. It is intended
>only for the use of the named recipient. If you have received this
>e-mail in error, please let us know by replying to the sender, and
>immediately delete it from your system.  Please note, that in these
>circumstances, the use, disclosure, distribution or copying of this
>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>cannot accept any responsibility for the  accuracy or completeness of
>this message as it has been transmitted over a public network. Although
>the Programme has taken reasonable precautions to ensure no viruses are
>present in emails, it cannot accept responsibility for any loss or
>damage arising from the use of the email or attachments. Any views
>expressed in this message are those of the individual sender, except
>where the sender specifically states them to be the views of
>KEMRI-Wellcome Trust Programme.
>______________________________________________________________________



More information about the R-help mailing list