[R] Data visualization: overlay columns of train/test/validation datasets

David Winsemius dwinsemius at comcast.net
Wed Jul 2 23:47:01 CEST 2014

On Jul 2, 2014, at 11:42 AM, Supriya Jain wrote:

> Hi David,
> Thanks for your mail. 
> Here are the details of what I would like to do. 
> Given a dataset, I make two sets from it (for training and testing my model, respectively). But before the modeling, I would like to check the distributions of all columns in my dataset in order to make sure that my splitted tables represent the same distributions. 
> With the code below (using the "attenu" dataset), I can overlay histograms normalized to unit area from the two splitted datasets, for columns that are of type numeric. 
> ---------------------
> head(attenu, 10)
> nrow(attenu)
> indices <- sample(1:182, 50)
> t1 <- attenu[indices, ]
> t2 <- attenu[-indices, ]
> # overlay column "event" from t1 and t2:
> hist(t2$event, col = "red", density = 0, freq = FALSE, breaks = seq(1, 25, 2), xlab = "event", ylim = c(0, 0.2))
> par(new=TRUE) 
> hist(t1$event, col = "blue", density = 0, freq = FALSE, breaks = seq(1, 25, 2), xlab = "event", ylim = c(0, 0.2))
> #-------------
> However, for columns of type factor, although I can get the frequency of the different levels using the "summary" method for the columns separately, how do I plot their frequency distribution, after normalizing the frequencies by the total count, and overlay these distributions? 
> summary(t1$station)
> #--------output--------------
>    135     111     113     117    1027    1028    1052    1093    1095    1102     112    1219 
>       3       2       2       2       1       1       1       1       1       1       1       1 
>     126     127    1291    1293     130    1308    1383    1408    1409     141    1410    1418 
>       1       1       1       1       1       1       1       1       1       1       1       1 
>     266     270     272     411     412    5042    5043    5054    5060    5066    5069    5160 
>       1       1       1       1       1       1       1       1       1       1       1       1 
>    5165     952    c168    c266    1008    1011    1013    1014    1015    1016    1030    1032 
>       1       1       1       1       0       0       0       0       0       0       0       0 
>    1051    1083    1096     110    1117     116     125    1250    1251     128    1292    1298 
>       0       0       0       0       0       0       0       0       0       0       0       0 
>    1299    1376    1377    1411    1413    1422    1438    1445    1456    1492    2001    2316 
>       0       0       0       0       0       0       0       0       0       0       0       0 
>     262     269    2708    2714    2715    2728    2734     280     283     286     290    3501 
>       0       0       0       0       0       0       0       0       0       0       0       0 
>     475    5028    5044    5045    5047    5049    5050    5051    5052    5053    5055    5056 
>       0       0       0       0       0       0       0       0       0       0       0       0 
>    5057    5058 (Other)    NA's 
>       0       0       0       5 
> #----------------------------
> summary(t2$station)
> #--------output--------------
>    1028     117     475    1030    1083     112     113     116    1299    1377     269     283 
>       3       3       3       2       2       2       2       2       2       2       2       2 
>     290    5028    5053    5055    5056    5057    5058    5115     942     955     958    1008 
>       2       2       2       2       2       2       2       2       2       2       2       1 
>    1011    1013    1014    1015    1016    1032    1051    1093    1095    1096     110    1117 
>       1       1       1       1       1       1       1       1       1       1       1       1 
>    1219     125    1250    1251     128    1292    1298     130    1308    1376    1383    1411 
>       1       1       1       1       1       1       1       1       1       1       1       1 
>    1413    1418    1422    1438    1445    1456    1492    2001    2316     262     266    2708 
>       1       1       1       1       1       1       1       1       1       1       1       1 
>    2714    2715     272    2728    2734     280     286    3501     412    5044    5045    5047 
>       1       1       1       1       1       1       1       1       1       1       1       1 
>    5049    5050    5051    5052    5054    5059    5060    5061    5062    5067    5068    5070 
>       1       1       1       1       1       1       1       1       1       1       1       1 
>    5072    5073    5165     655     724     885     931     952    c118    c203    c204    1027 
>       1       1       1       1       1       1       1       1       1       1       1       0 
>    1052    1102 (Other)    NA's 
>       0       0       0      11 
> #---------------------------

It appears there may be a natural order to those categories but that the alpha ordering of the factor representation is making a hash of that fact. It also appears that the factor levels are different in the two datasets. Seems unlikely that you will get satisfactory plots for comparison using barplot.

> Thanks in advance for any help with this,
> Supriya
> On Tue, Jul 1, 2014 at 6:42 PM, David Winsemius <dwinsemius at comcast.net> wrote:
> On Jul 1, 2014, at 3:46 PM, Supriya Jain wrote:
> > Hello,
> >
> > Given two different datasets (having the same number and type of columns,
> > but different observations, as commonly encountered in data-mining as
> > train/test/validation datasets), is it possible to overlay plots
> > (histograms) and compare the different attributes from the separate
> > datasets, in order to check how similar the different datasets are?
> >
> > Is there a package available for such plotting together of similar columns
> > from different datasets?
> Possible. Assuming you just want frequency histograms (or ones using counts for that matter) it can be done in any of the three major plotting paradigms supported in R. No extra packages needed if using just base graphics.
> >
> > Thanks,
> > SJ
> >
> >       [[alternative HTML version deleted]]
> Oh, you must have missed the parts of the Posign Guide where plain text was requyested. See below.
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> And you missed that section, as well.
> > and provide commented, minimal, self-contained, reproducible code.
> --
> David Winsemius
> Alameda, CA, USA

David Winsemius
Alameda, CA, USA

More information about the R-help mailing list