[R] Improvement: function cut
David Winsemius
dw|n@em|u@ @end|ng |rom comc@@t@net
Sat Sep 18 22:43:31 CEST 2021
On 9/18/21 5:28 AM, Leonard Mada via R-help wrote:
> Hello Andrew,
>
>
> I add this info as a completion (so other users can get a better
> understanding):
>
> If we want to perform a survival analysis, than the interval should be
> closed to the right, but we should include also the first time point (as
> per Intention-to-Treat):
>
> [0, 4](4, 8](8, 12](12, 16]
>
> [0, 4](4, 8](8, 12](12, 16](16, 20]
>
>
> So the series is extendible to the right without any errors!
>
> But the 1st interval (which is the same in both series) is different
> from the other intervals: [0, 4].
>
>
> I feel that this should have been the default behaviour for cut().
To Leonard;
If you do not like the behavior of `cut`, then you should "roll your
own". It's very unlikely that R Core will modify a base cunction like
cut. You might want to look at Hmisc::cut2. Frank Harrell didn't like
that default behavior and thought he could make a better cut, so he just
put it in his package. I did like his version better and often used it
when I was actively programming. I suspect there is also a tidyverse
cut-like function, but I'm not terribly familiar with that fork of R.
(It's really not the same language IMHO.)
But it's a waste of time and energy to try propose modifications of core
R functions unless *you* can show that it is stable across 20,000
packages and will not offend long-time users. The likelihood of that
happening for your proposal is vanishing small in my estimation. You
shouldn't ask R Core to do that for you. They are busy fixing real bugs.
If you want to persist despite my negativity, then you should make a
complete proposal by submitting a proper diff file that incorporates
your tested efforts to the Rdevel mailing list.
--
David
>
> Note:
>
> I was induced to think about a different situation in my previous
> message, as you constructed open intervals on the right, and also
> extended to the right. But survival analysis should be as described in
> this mail and should probably be the default.
>
>
> Sincerely,
>
>
> Leonard
>
>
> On 9/18/2021 1:29 AM, Andrew Simmons wrote:
>> I disagree, I don't really think it's too long or ugly, but if you
>> think it is, you could abbreviate it as 'i'.
>>
>>
>> x <- 0:20
>> breaks1 <- seq.int <http://seq.int>(0, 16, 4)
>> breaks2 <- seq.int <http://seq.int>(0, 20, 4)
>> data.frame(
>> cut(x, breaks1, right = FALSE, i = TRUE),
>> cut(x, breaks2, right = FALSE, i = TRUE),
>> check.names = FALSE
>> )
>>
>>
>> I hope this helps.
>>
>> On Fri, Sep 17, 2021 at 6:26 PM Leonard Mada <leo.mada using syonic.eu
>> <mailto:leo.mada using syonic.eu>> wrote:
>>
>> Hello Andrew,
>>
>>
>> But "cut" generates factors. In most cases with real data one
>> expects to have also the ends of the interval: the argument
>> "include.lowest" is both ugly and too long.
>>
>> [The test-code on the ftable thread contains this error! I have
>> run through this error a couple of times.]
>>
>>
>> The only real situation that I can imagine to be problematic:
>>
>> - if the interval goes to +Inf (or -Inf): I do not know if there
>> would be any effects when including +Inf (or -Inf).
>>
>>
>> Leonard
>>
>>
>> On 9/18/2021 1:14 AM, Andrew Simmons wrote:
>>> While it is not explicitly mentioned anywhere in the
>>> documentation for .bincode, I suspect 'include.lowest = FALSE' is
>>> the default to keep the definitions of the bins consistent. For
>>> example:
>>>
>>>
>>> x <- 0:20
>>> breaks1 <- seq.int <http://seq.int>(0, 16, 4)
>>> breaks2 <- seq.int <http://seq.int>(0, 20, 4)
>>> cbind(
>>> .bincode(x, breaks1, right = FALSE, include.lowest = TRUE),
>>> .bincode(x, breaks2, right = FALSE, include.lowest = TRUE)
>>> )
>>>
>>>
>>> by having 'include.lowest = TRUE' with different ends, you can
>>> get inconsistent behaviour. While this probably wouldn't be an
>>> issue with 'real' data, this would seem like something you'd want
>>> to avoid by default. The definitions of the bins are
>>>
>>>
>>> [0, 4)
>>> [4, 8)
>>> [8, 12)
>>> [12, 16]
>>>
>>>
>>> and
>>>
>>>
>>> [0, 4)
>>> [4, 8)
>>> [8, 12)
>>> [12, 16)
>>> [16, 20]
>>>
>>>
>>> so you can see where the inconsistent behaviour comes from. You
>>> might be able to get R-core to add argument 'warn', but probably
>>> not to change the default of 'include.lowest'. I hope this helps
>>>
>>>
>>> On Fri, Sep 17, 2021 at 6:01 PM Leonard Mada <leo.mada using syonic.eu
>>> <mailto:leo.mada using syonic.eu>> wrote:
>>>
>>> Thank you Andrew.
>>>
>>>
>>> Is there any reason not to make: include.lowest = TRUE the
>>> default?
>>>
>>>
>>> Regarding the NA:
>>>
>>> The user still has to suspect that some values were not
>>> included and run that test.
>>>
>>>
>>> Leonard
>>>
>>>
>>> On 9/18/2021 12:53 AM, Andrew Simmons wrote:
>>>> Regarding your first point, argument 'include.lowest'
>>>> already handles this specific case, see ?.bincode
>>>>
>>>> Your second point, maybe it could be helpful, but since both
>>>> 'cut.default' and '.bincode' return NA if a value isn't
>>>> within a bin, you could make something like this on your own.
>>>> Might be worth pitching to R-bugs on the wishlist.
>>>>
>>>>
>>>>
>>>> On Fri, Sep 17, 2021, 17:45 Leonard Mada via R-help
>>>> <r-help using r-project.org <mailto:r-help using r-project.org>> wrote:
>>>>
>>>> Hello List members,
>>>>
>>>>
>>>> the following improvements would be useful for function
>>>> cut (and .bincode):
>>>>
>>>>
>>>> 1.) Argument: Include extremes
>>>> extremes = TRUE
>>>> if(right == FALSE) {
>>>> # include also right for last interval;
>>>> } else {
>>>> # include also left for first interval;
>>>> }
>>>>
>>>>
>>>> 2.) Argument: warn = TRUE
>>>>
>>>> Warn if any values are not included in the intervals.
>>>>
>>>>
>>>> Motivation:
>>>> - reduce risk of errors when using function cut();
>>>>
>>>>
>>>> Sincerely,
>>>>
>>>>
>>>> Leonard
>>>>
>>>> ______________________________________________
>>>> R-help using r-project.org <mailto:R-help using r-project.org>
>>>> mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> <https://stat.ethz.ch/mailman/listinfo/r-help>
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> <http://www.R-project.org/posting-guide.html>
>>>> and provide commented, minimal, self-contained,
>>>> reproducible code.
>>>>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list