[R] Non-normal data issues in PhD software engineering experiment

Thu Jul 10 19:07:06 CEST 2008

On Thu, Jul 10, 2008 at 11:06 AM, Daniel Malter <daniel at umd.edu> wrote:
>
> I hope you don't really want our patients :)
>
> It looks that you have an experiment with two groups. You have several
> trials for each group. And within each trial you observe your units a
> distinct points in time.
>
> The first advice for you is to graphically display your data. Before you
> start modeling your data wrong, you should have a strong feeling what the
> right approach will be. If your data is nonlinear, for example, you will
> take a different approach than when it is. So what I suggest you to do is to
> plot your Ys (dependent variables) against time for each of your trials,
> optimally two plots, one for each group (but multiple plots are also okay).
> These plots should give you a firm intution about how your dependent
> variable develops over time for each group. The modeling of your data in a
> regression model then depends on the presumed functional relationship
> between your dependent variable and your independent variables (time and
> group). An important question is the distribution of your dependent
> variable. Is normally distributed? Or is it a proportion? All this is
> important information in deciding how to model your problem.

I'd suggest starting with looking at the overall distribution of sensitivity:

exp <- read.csv("data.csv")
library(ggplot2)

qplot(sensitivity, geom="histogram", data=exp, binwidth=.05)

This is revealing - sensitivity is discrete and quite clumpy. You
could then look at this distribution conditioned on version and
paradigm:

qplot(sensitivity, geom="histogram", data=exp, binwidth=.05, facets =
version ~ paradigm)

This is a complex plot, but it rewards detailed study (and suggests
that accurate modelling is going to be challenging).  There's a clear
change in sensitivity in paradigm one after version 3, and in paradigm
two, versions 4, 9 and 10 look unusual.

Looking at the scatterplot of sensitivity vs version:

qplot(version, sensitivity, data=exp, colour=factor(paradigm))

isn't very helpful because the discrete values of sensitivity mean
that many of the points are overplotted.  Jittering the points and
adding a smoothed line for each group helps a little, but it's not as
revealing as the histograms.

qplot(version, sensitivity, data=exp, colour=factor(paradigm),
geom="jitter") + geom_smooth()

Hadley

-- 
http://had.co.nz/