[R] Difficulty with qqline in logarithmic context
François Pinard
pinard at iro.umontreal.ca
Fri Feb 3 20:08:32 CET 2006
[Brian Ripley]
>Is there a good reason to use qqnorm in a single-log context?
Yes. Googling around reveals this is not so uncommon.
> Should one not rather use
>>qqnorm(log(freq))
>>qqline(log(freq))
In the display produced by "qqnorm", the y-axis would then show
"log(value)" labels, while the user (me!) expects "value" labels.
>since you are (I guess) looking at log-normality of freq?
Once again, I was merely toying with "qqplot". I found intriguing that,
while shuffling messages around between folders, for a good while, the
distribution of log(number of messages) per folder appears vagueley
normal, as I do not quickly see a reasonable justification for this.
>Another way to look at that is
>>qqplot(qlnorm(ppoints(length(freq))), freq, log="xy")
>the same plot, different scales.
Interesting, thanks for teaching me about "ppoints". Yet, I stay more
happy with the abcissa scale produced by "qqnorm". Besides, how would
one uses "qqline" with the above?
>(I believe a QQ plot should always have comparable scales on the two
>axes.)
While comparable scales are somewhat simpler to compare, this is not
necessarily what is most adequate for the user. Proof is that while
quantiles are being compared here, scales do not show quantiles, but
units as meaningful to the user. One might want to compare variables
scaled very differently, maybe because of different units from the same
distribution, of from different but similar distributions using
different scales and shifted to different means. Or even, why not, if
this is what is meaningful for users, a log scale.
>The point is that qqline is tied to normality, not to log-normality.
As it stands, yes. As a convenience, it could be extended (probably
easily) to log-normality. "qqnorm" already does something sensible in
log-context, so a user might expect "qqline" to do equally well.
The real point might be that "qqline" is tied to "abline" a bit too
blindly. What is the meaning of intercept and slope of a straight line
on a graphic in log context? First, the intercept might not even exist.
Second, "abline" interpretation depends on the clippling, and possibly
on the extrema of the pretty breakpoints chosen for scales, so making it
hard to predict on average use. There ought to be some reason for the
log-aware code in "abline", yet I did not find documentation for it.
The wisest for "abline", in my very humble opinion, would be for it to
complain if ever called in log context. Then, "qqline" would indirectly
complain through "abline", if "qqline" is not modified to do something
more proper. Moreover, if it is definitely out of question that
"qqline" be ever meaningfully called in log context, then so "qqnorm",
which should then complain as well.
Currently, "qqline" misbehaves, in that it silently produces
a meaningless result, while it could either diagnose that the result is
meaningless, or produce a mearningful result.
[Remainder of the reply top-quoted, as usual on r-help.]
>On Wed, 1 Feb 2006, François Pinard wrote:
>>Hi, R friends. I had some difficulty with the following code:
>> qqnorm(freq, log='y')
>> qqline(freq)
>>as the line drawn was seemingly random. The exact data I used appears
>>below. After wandering a bit within the source code for "abline",
>>I figured out I should rather write:
>> qqnorm(freq, log='y')
>> par(ylog=FALSE)
>> qqline(log10(freq))
>> par(ylog=TRUE)
>>I'm proposing that this little stunt be rather be hidden and
>>automatically effected within "qqline" proper, whenever par('ylog') is
>>TRUE. I thought about providing a patch, as "qqline" is so small. Yet
>>it would be more noise than useful, as I'm not familiar with the "datax"
>>argument usage, which should probably be addressed as well.
>>Here is the data, in case useful:
>>freq <-
>>as.integer(c(33, 79, 21, 436, 58, 18, 1106, 498, 1567, 393, 2,
>>104, 50, 67, 113, 76, 327, 331, 196, 145, 86, 59, 12, 215, 293,
>>154, 500, 314, 246, 587, 85, 23, 323, 3, 13, 576, 29, 37, 24,
>>21, 1230, 137, 13, 93, 3, 101, 72, 218, 59, 17, 2, 8, 86, 143,
>>150, 22, 19, 234, 119, 157, 4, 255, 146, 126, 76, 15, 271, 170,
>>4, 6, 16, 3048, 2175, 3350, 5017, 5706, 1610, 665, 322, 1, 16,
>>47, 51, 168, 94, 66, 154, 99, 11, 547, 953, 1, 1071, 80, 184,
>>168, 52, 187, 103, 187, 361, 46, 85, 135, 597, 121, 283, 26,
>>12, 20, 169, 9, 79, 15, 114, 75, 30, 111, 556, 173, 32, 99, 438,
>>2, 2, 1, 117, 5, 3, 51, 8, 41, 12, 23, 2, 13, 5, 1, 9, 4, 1,
>>7, 15, 5, 48, 16, 112, 6, 1, 39, 60, 5, 23, 5, 19, 1, 8, 32,
>>4, 13, 1, 14, 71, 5, 1, 35, 30, 100, 389, 22, 8, 1, 192, 40,
>>6, 3, 17, 2, 14, 71, 14, 1, 5, 4, 32, 21, 18, 13, 2, 2, 45, 342,
>>46, 144, 18, 131, 188, 112, 37, 85, 90, 8, 195, 173, 5, 53, 96,
>>37, 16, 16, 281, 64, 50, 92, 336, 31, 744, 4, 134, 74, 1, 227,
>>6, 48, 418, 64, 66, 59, 20, 45, 20, 370, 148, 22, 7, 30, 601,
>>29, 82, 113, 938, 252, 65, 137, 72, 22, 98, 12, 152, 212, 13,
>>8, 35, 3, 77))
>>Yet this really is the value of "courriel$freq" after "data(courriel)",
>>with a file ".../R/data/courriel.R" here, holding:
>>courriel <- read.table(pipe('grep -c \'^From \' ../courriel/*'),
>> sep=':', as.is=T, row.names=1,
>> col.names=c('fichier', 'freq'))
>>My goal, which is nothing serious, was merely to toy with the number of
>>messages per folder, for folders massaged out of R archives.
>>Version:
>>platform = i686-pc-linux-gnu
>>arch = i686
>>os = linux-gnu
>>system = i686, linux-gnu
>>status =
>>major = 2
>>minor = 2.1
>>year = 2005
>>month = 12
>>day = 20
>>svn rev = 36812
>>language = R
>>Locale:
>>LC_CTYPE=fr_CA.UTF-8;LC_NUMERIC=C;LC_TIME=fr_CA.UTF-8;LC_COLLATE=fr_CA.UTF-8;LC_MONETARY=fr_CA.UTF-8;LC_MESSAGES=fr_CA.UTF-8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C
>>Search Path:
>>.GlobalEnv, package:methods, package:stats, package:graphics,
>>package:grDevices, package:utils, package:datasets, fp.etc, Autoloads,
>>package:base
>>--
>>François Pinard http://pinard.progiciels-bpi.ca
>>______________________________________________
>>R-help at stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide!
>>http://www.R-project.org/posting-guide.html
>--
>Brian D. Ripley, ripley at stats.ox.ac.uk
>Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
>University of Oxford, Tel: +44 1865 272861 (self)
>1 South Parks Road, +44 1865 272866 (PA)
>Oxford OX1 3TG, UK Fax: +44 1865 272595
--
François Pinard http://pinard.progiciels-bpi.ca
More information about the R-help
mailing list