[R] Comparing multiple distributions

Thu May 31 19:28:44 CEST 2007

On 2007-May-31  , at 18:56 , Bert Gunter wrote:
> While Ravi's suggestion of the "compositions" package is certainly
> appropriate, I suspect that the complex and extensive statistical  
> "homework"
> you would need to do to use it might be overwhelming (the geometry of
> compositions is a simplex, and this makes things hard).

Yes I am reading the documentation now, which is well written but  
huge indeed...

> As a simple and
> perhaps useful alternative, use pairs() or splom() to plot your 5-D  
> data,
> distinguishing the different treatments via color and/or symbol.
>
> In addition, it might be useful to do the same sort of plot on the  
> first two
> principal components (?prcomp) of the first 4 dimensions of your 5  
> component
> vectors (since the 5th is determined by the first 4). Because of the
> simplicial geometry, this PCA approach is not right, but it may  
> nevertheless
> be revealing. The same plotting ideas are in the compositions  
> package done
> properly (in the correct geometry),so if you are motivated to do  
> so, you can
> do these things there. Even if you don't dig into the details,  
> using the
> compositions package version of the plots may be realtively easy to
> do,interpretable, and revealing -- more so than my "simple but wrong"
> suggestions. You can decide.
>
> I would not trust inference using ad hoc approaches in the  
> untransformed
> data. That's what the package is for. But plotting the data should  
> always be
> at least the first thing you do anyway. I often find it to be  
> sufficient,
> too.

Thank you for your suggestions on plotting, I will look into it. I  
was using histograms of mean proportions + SE until now because it  
was what seemed the most straightforward given my specific questions.  
If we come back to my original data (abandoning the statistical  
language for a while ;) ) I have proportions of fishes caught 1. near  
the surface, 2. a bit below, .... 5. near the bottom. The questions I  
want to ask are for example: does the vertical distribution of  
species A and species B differ? So I can plot the mean proportion at  
each depth for both species and obtain a visual representation of the  
vertical distribution of each.
At this stage differences between fishes that accumulate near the  
surface or near the bottom are quite obvious. If I add error bars I  
can get an idea of the variability of those distributions. The issue  
arise when I want to *test* for a difference between the  
distributions of species A and B. If I use a basic KS test I can only  
compare the mean proportions for species A (5 points) to the mean  
proportions of species B (5 points) and this has low power + does not  
take in account the variability around those means. In addition I may  
also want to know wether there is a difference within species A, B  
and C and pairwise KS tests would increase alpha error risk. Am I  
explaining things correctly? Does this seem logical to you too?
As for the PCA I must admit I don't really understand what you mean.

Thank you very much again.

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of jiho
> Subject: Re: [R] Comparing multiple distributions
>
> Nobody answered my first request. I am sorry if I did not explain my
> problem clearly. English is not my native language and statistical
> english is even more difficult. I'll try to summarize my issue in
> more appropriate statistical terms:
>
> Each of my observations is not a single number but a vector of 5
> proportions (which add up to 1 for each observation). I want to
> compare the "shape" of those vectors between two treatments (i.e. how
> the quantities are distributed between the 5 values in treatment A
> with respect to treatment B).
>
> I was pointed to Hotelling T-squared. Does it seem appropriate? Are
> there other possibilities (I read many discussions about hotelling
> vs. manova but I could not see how any of those related to my
> particular case)?
>
> Thank you very much in advance for your insights. See below for my
> earlier, more detailed, e-mail.
>
> On 2007-May-21  , at 19:26 , jiho wrote:
>> I am studying the vertical distribution of plankton and want to
>> study its variations relatively to several factors (time of day,
>> species, water column structure etc.). So my data is special in
>> that, at each sampling site (each observation), I don't have *one*
>> number, I have *several* numbers (abundance of organisms in each
>> depth bin, I sample 5 depth bins) which describe a vertical
>> distribution.
>>
>> Then let say I want to compare speciesA with speciesB, I would end
>> up trying to compare a group of several distributions with another
>> group of several distributions (where a "distribution" is a vector
>> of 5 numbers: an abundance for each depth bin). Does anyone know
>> how I could do this (with R obviously ;) )?
>>
>> Currently I kind of get around the problem and:
>> - compute mean abundance per depth bin within each group and
>> compare the two mean distributions with a ks.test but this
>> obviously diminishes the power of the test (I only compare 5*2
>> "observations")
>> - restrict the information at each sampling site to the mean depth
>> weighted by the abundance of the species of interest. This way I
>> have one observation per station but I reduce the information to
>> the mean depths while the actual repartition is important also.
>>
>> I know this is probably not directly R related but I have already
>> searched around for solutions and solicited my local statistics
>> expert... to no avail. So I hope that the stats' experts on this
>> list will help me.
>>
>> Thank you very much in advance.

JiHO
---
http://jo.irisson.free.fr/

-- 
Ce message a été vérifié par MailScanner
pour des virus ou des polluriels et rien de
suspect n'a été trouvé.
CRI UPVD http://www.univ-perp.fr