[R] Row exclude
David Carlson
dc@r|@on @end|ng |rom t@mu@edu
Sun Jan 30 05:15:45 CET 2022
It is possible that there would be errors on the same row for different
columns. This does not happen in your example. If row 4 was "John6, 3BC,
175X" then row 4 would be included 3 times, but we only need to remove it
once. Removing the duplicates is not necessary since R would not get
confused, but length(unique(c(BadName, BadAge, BadWeight)) indicates how
many lines are being removed.
David
On Sat, Jan 29, 2022 at 8:32 PM Val <valkremk using gmail.com> wrote:
> Thank you David for your help. I just have one question on this. What is
> the purpose of using the "unique" function on this? (dat2 <-
> dat1[-unique(c(BadName, BadAge, BadWeight)), ]) I got the same result
> without using it. ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
> ZjQcmQRYFpfptBannerEnd
> Thank you David for your help.
>
> I just have one question on this. What is the purpose of using the
> "unique" function on this?
> (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
>
> I got the same result without using it.
> (dat2 <- dat1[-(c(BadName, BadAge, BadWeight)), ])
>
> My concern is when I am applying this for the large data set the "unique"
> function may consume resources(time and memory).
>
> Thank you.
>
> On Sat, Jan 29, 2022 at 12:30 AM David Carlson <dcarlson using tamu.edu> wrote:
>
>> Given that you know which columns should be numeric and which should be
>> character, finding characters in numeric columns or numbers in character
>> columns is not difficult. Your data frame consists of three character
>> columns so you can use regular expressions as Bert mentioned. First you
>> should strip the whitespace out of your data:
>>
>> dat1 <-read.table(text="Name, Age, Weight
>> Alex, 20, 13X
>> Bob, 25, 142
>> Carol, 24, 120
>> John, 3BC, 175
>> Katy, 35, 160
>> Jack3, 34, 140",sep=",", header=TRUE, stringsAsFactors=FALSE,
>> strip.white=TRUE)
>>
>> Now check to see if all of the fields are character as expected.
>>
>> sapply(dat1, typeof)
>> # Name Age Weight
>> # "character" "character" "character"
>>
>> Now identify character variables containing numbers and numeric variables
>> containing characters:
>>
>> BadName <- which(grepl("[[:digit:]]", dat1$Name))
>> BadAge <- which(grepl("[[:alpha:]]", dat1$Age))
>> BadWeight <- which(grepl("[[:alpha:]]", dat1$Weight))
>>
>> Next remove those rows:
>>
>> (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
>> # Name Age Weight
>> # 2 Bob 25 142
>> # 3 Carol 24 120
>> # 5 Katy 35 160
>>
>> You still need to convert Age and Weight to numeric, e.g. dat2$Age <-
>> as.numeric(dat2$Age).
>>
>> David Carlson
>>
>>
>> On Fri, Jan 28, 2022 at 11:59 PM Bert Gunter <bgunter.4567 using gmail.com>
>> wrote:
>>
>>> As character 'polluted' entries will cause a column to be read in (via
>>> read.table and relatives) as factor or character data, this sounds like a
>>> job for regular expressions. If you are not familiar with this subject,
>>> time to learn. And, yes, ZjQcmQRYFpfptBannerStart
>>> This Message Is From an External Sender
>>> This message came from outside your organization.
>>> ZjQcmQRYFpfptBannerEnd
>>>
>>> As character 'polluted' entries will cause a column to be read in (via
>>> read.table and relatives) as factor or character data, this sounds like a
>>> job for regular expressions. If you are not familiar with this subject,
>>> time to learn. And, yes, some heavy lifting will be required.
>>> See ?regexp for a start maybe? Or the stringr package?
>>>
>>> Cheers,
>>> Bert
>>>
>>>
>>>
>>>
>>> On Fri, Jan 28, 2022, 7:08 PM Val <valkremk using gmail.com> wrote:
>>>
>>> > Hi All,
>>> >
>>> > I want to remove rows that contain a character string in an integer
>>> > column or a digit in a character column.
>>> >
>>> > Sample data
>>> >
>>> > dat1 <-read.table(text="Name, Age, Weight
>>> > Alex, 20, 13X
>>> > Bob, 25, 142
>>> > Carol, 24, 120
>>> > John, 3BC, 175
>>> > Katy, 35, 160
>>> > Jack3, 34, 140",sep=",",header=TRUE,stringsAsFactors=F)
>>> >
>>> > If the Age/Weight column contains any character(s) then remove
>>> > if the Name column contains an digit then remove that row
>>> > Desired output
>>> >
>>> > Name Age weight
>>> > 1 Bob 25 142
>>> > 2 Carol 24 120
>>> > 3 Katy 35 160
>>> >
>>> > Thank you,
>>> >
>>> > ______________________________________________
>>> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$
>>> > PLEASE do read the posting guide
>>> > https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$
>>> > and provide commented, minimal, self-contained, reproducible code.
>>> >
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, seehttps://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$
>>> PLEASE do read the posting guide https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
[[alternative HTML version deleted]]
More information about the R-help
mailing list