[R] remove
Jeff Newmiller
jdnewmil at dcn.davis.ca.us
Mon Feb 13 01:31:23 CET 2017
Your question mystifies me, since it looks to me like you already know the answer.
--
Sent from my phone. Please excuse my brevity.
On February 12, 2017 3:30:49 PM PST, Val <valkremk at gmail.com> wrote:
>Hi Jeff and all,
> How do I get the number of unique first names in the two data sets?
>
>for the first one,
>result2 <- DF[ 1 == err2, ]
>length(unique(result2$first))
>
>
>
>
>On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
><jdnewmil at dcn.davis.ca.us> wrote:
>> The "by" function aggregates and returns a result with generally
>fewer rows
>> than the original data. Since you are looking to index the rows in
>the
>> original data set, the "ave" function is better suited because it
>always
>> returns a vector that is just as long as the input vector:
>>
>> # I usually work with character data rather than factors if I plan
>> # to modify the data (e.g. removing rows)
>> DF <- read.table( text=
>> 'first week last
>> Alex 1 West
>> Bob 1 John
>> Cory 1 Jack
>> Cory 2 Jack
>> Bob 2 John
>> Bob 3 John
>> Alex 2 Joseph
>> Alex 3 West
>> Alex 4 West
>> ', header = TRUE, as.is = TRUE )
>>
>> err <- ave( DF$last
>> , DF[ , "first", drop = FALSE]
>> , FUN = function( lst ) {
>> length( unique( lst ) )
>> }
>> )
>> result <- DF[ "1" == err, ]
>> result
>>
>> Notice that the ave function returns a vector of the same type as was
>given
>> to it, so even though the function returns a numeric the err
>> vector is character.
>>
>> If you wanted to be able to examine more than one other column in
>> determining the keep/reject decision, you could do:
>>
>> err2 <- ave( seq_along( DF$first )
>> , DF[ , "first", drop = FALSE]
>> , FUN = function( n ) {
>> length( unique( DF[ n, "last" ] ) )
>> }
>> )
>> result2 <- DF[ 1 == err2, ]
>> result2
>>
>> and then you would have the option to re-use the "n" index to look at
>other
>> columns as well.
>>
>> Finally, here is a dplyr solution:
>>
>> library(dplyr)
>> result3 <- ( DF
>> %>% group_by( first ) # like a prep for ave or by
>> %>% mutate( err = length( unique( last ) ) ) # similar to
>ave
>> %>% filter( 1 == err ) # drop the rows with too many last
>names
>> %>% select( -err ) # drop the temporary column
>> %>% as.data.frame # convert back to a plain-jane data
>frame
>> )
>> result3
>>
>> which uses a small set of verbs in a pipeline of functions to go from
>input
>> to result in one pass.
>>
>> If your data set is really big (running out of memory big) then you
>might
>> want to investigate the data.table or sqlite packages, either of
>which can
>> be combined with dplyr to get a standardized syntax for managing
>larger
>> amounts of data. However, most people actually aren't running out of
>memory
>> so in most cases the extra horsepower isn't actually needed.
>>
>>
>> On Sun, 12 Feb 2017, P Tennant wrote:
>>
>>> Hi Val,
>>>
>>> The by() function could be used here. With the dataframe dfr:
>>>
>>> # split the data by first name and check for more than one last name
>for
>>> each first name
>>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1)
>>> # make the result more easily manipulated
>>> res <- as.table(res)
>>> res
>>> # first
>>> # Alex Bob Cory
>>> # TRUE FALSE FALSE
>>>
>>> # then use this result to subset the data
>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>>> # sort if needed
>>> nw.dfr[order(nw.dfr$first) , ]
>>>
>>> first week last
>>> 2 Bob 1 John
>>> 5 Bob 2 John
>>> 6 Bob 3 John
>>> 3 Cory 1 Jack
>>> 4 Cory 2 Jack
>>>
>>>
>>> Philip
>>>
>>> On 12/02/2017 4:02 PM, Val wrote:
>>>>
>>>> Hi all,
>>>> I have a big data set and want to remove rows conditionally.
>>>> In my data file each person were recorded for several weeks.
>Somehow
>>>> during the recording periods, their last name was misreported.
>For
>>>> each person, the last name should be the same. Otherwise remove
>from
>>>> the data. Example, in the following data set, Alex was found to
>have
>>>> two last names .
>>>>
>>>> Alex West
>>>> Alex Joseph
>>>>
>>>> Alex should be removed from the data. if this happens then I want
>>>> remove all rows with Alex. Here is my data set
>>>>
>>>> df<- read.table(header=TRUE, text='first week last
>>>> Alex 1 West
>>>> Bob 1 John
>>>> Cory 1 Jack
>>>> Cory 2 Jack
>>>> Bob 2 John
>>>> Bob 3 John
>>>> Alex 2 Joseph
>>>> Alex 3 West
>>>> Alex 4 West ')
>>>>
>>>> Desired output
>>>>
>>>> first week last
>>>> 1 Bob 1 John
>>>> 2 Bob 2 John
>>>> 3 Bob 3 John
>>>> 4 Cory 1 Jack
>>>> 5 Cory 2 Jack
>>>>
>>>> Thank you in advance
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>---------------------------------------------------------------------------
>> Jeff Newmiller The ..... ..... Go
>Live...
>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
>Go...
>> Live: OO#.. Dead: OO#..
>Playing
>> Research Engineer (Solar/Batteries O.O#. #.O#. with
>> /Software/Embedded Controllers) .OO#. .OO#.
>rocks...1k
>>
>---------------------------------------------------------------------------
More information about the R-help
mailing list