[R] how to subset based on other row values and multiplicity

Wed Jul 16 15:07:44 CEST 2014

Hi R experts,

I have a dataset as sampled below. Values are only regarded as Œconfirmed¹
in an individual (Œid¹) if they occur
more than once at least 30 days apart.

id   date value
a    2000-01-01 x
a    2000-03-01 x
b    2000-11-11 w
c    2000-11-11 y
c    2000-10-01 y
c    2000-09-10 y
c    2000-12-12 z
c    2000-10-11 z
d    2000-11-11 w
d    2000-11-10 w

I wish to subset the data to retain rows where the value for the
individual is confirmed more than 30 days apart. So, after deleting all
rows with just one occurrence of id and value, the rest would be the
earliest occurrence of each value in each case id, provided 31 or more
days exist between the dates. If >1 value is present per id, each value
level needs to be assessed independently. This example would then reduce
to:

id   date           value
a    2000-01-01 x
c    2000-09-10 y
c    2000-10-11 z

I can do this via some crude loops and subsetting, but I am looking for as
much efficiency as possible
as the dataset has around 50 million rows to assess. Any suggestions
welcomed.

Thanks in advance

Scott Williams MD
Melbourne, Australia

This email (including any attachments or links) may contain 
confidential and/or legally privileged information and is 
intended only to be read or used by the addressee.  If you 
are not the intended addressee, any use, distribution, 
disclosure or copying of this email is strictly 
prohibited.  
Confidentiality and legal privilege attached to this email 
(including any attachments) are not waived or lost by 
reason of its mistaken delivery to you.
If you have received this email in error, please delete it 
and notify us immediately by telephone or email.  Peter 
MacCallum Cancer Centre provides no guarantee that this 
transmission is free of virus or that it has not been 
intercepted or altered and will not be liable for any delay 
in its receipt.