[R] Data visualization: overlay columns of train/test/validation datasets
David Winsemius
dwinsemius at comcast.net
Wed Jul 2 23:47:01 CEST 2014
On Jul 2, 2014, at 11:42 AM, Supriya Jain wrote:
> Hi David,
>
> Thanks for your mail.
>
> Here are the details of what I would like to do.
> Given a dataset, I make two sets from it (for training and testing my model, respectively). But before the modeling, I would like to check the distributions of all columns in my dataset in order to make sure that my splitted tables represent the same distributions.
>
> With the code below (using the "attenu" dataset), I can overlay histograms normalized to unit area from the two splitted datasets, for columns that are of type numeric.
> ---------------------
>
> head(attenu, 10)
> nrow(attenu)
> indices <- sample(1:182, 50)
> t1 <- attenu[indices, ]
> t2 <- attenu[-indices, ]
>
> # overlay column "event" from t1 and t2:
>
> hist(t2$event, col = "red", density = 0, freq = FALSE, breaks = seq(1, 25, 2), xlab = "event", ylim = c(0, 0.2))
> par(new=TRUE)
> hist(t1$event, col = "blue", density = 0, freq = FALSE, breaks = seq(1, 25, 2), xlab = "event", ylim = c(0, 0.2))
>
> #-------------
>
> However, for columns of type factor, although I can get the frequency of the different levels using the "summary" method for the columns separately, how do I plot their frequency distribution, after normalizing the frequencies by the total count, and overlay these distributions?
>
> summary(t1$station)
>
> #--------output--------------
>
> 135 111 113 117 1027 1028 1052 1093 1095 1102 112 1219
> 3 2 2 2 1 1 1 1 1 1 1 1
> 126 127 1291 1293 130 1308 1383 1408 1409 141 1410 1418
> 1 1 1 1 1 1 1 1 1 1 1 1
> 266 270 272 411 412 5042 5043 5054 5060 5066 5069 5160
> 1 1 1 1 1 1 1 1 1 1 1 1
> 5165 952 c168 c266 1008 1011 1013 1014 1015 1016 1030 1032
> 1 1 1 1 0 0 0 0 0 0 0 0
> 1051 1083 1096 110 1117 116 125 1250 1251 128 1292 1298
> 0 0 0 0 0 0 0 0 0 0 0 0
> 1299 1376 1377 1411 1413 1422 1438 1445 1456 1492 2001 2316
> 0 0 0 0 0 0 0 0 0 0 0 0
> 262 269 2708 2714 2715 2728 2734 280 283 286 290 3501
> 0 0 0 0 0 0 0 0 0 0 0 0
> 475 5028 5044 5045 5047 5049 5050 5051 5052 5053 5055 5056
> 0 0 0 0 0 0 0 0 0 0 0 0
> 5057 5058 (Other) NA's
> 0 0 0 5
>
> #----------------------------
>
> summary(t2$station)
>
> #--------output--------------
>
> 1028 117 475 1030 1083 112 113 116 1299 1377 269 283
> 3 3 3 2 2 2 2 2 2 2 2 2
> 290 5028 5053 5055 5056 5057 5058 5115 942 955 958 1008
> 2 2 2 2 2 2 2 2 2 2 2 1
> 1011 1013 1014 1015 1016 1032 1051 1093 1095 1096 110 1117
> 1 1 1 1 1 1 1 1 1 1 1 1
> 1219 125 1250 1251 128 1292 1298 130 1308 1376 1383 1411
> 1 1 1 1 1 1 1 1 1 1 1 1
> 1413 1418 1422 1438 1445 1456 1492 2001 2316 262 266 2708
> 1 1 1 1 1 1 1 1 1 1 1 1
> 2714 2715 272 2728 2734 280 286 3501 412 5044 5045 5047
> 1 1 1 1 1 1 1 1 1 1 1 1
> 5049 5050 5051 5052 5054 5059 5060 5061 5062 5067 5068 5070
> 1 1 1 1 1 1 1 1 1 1 1 1
> 5072 5073 5165 655 724 885 931 952 c118 c203 c204 1027
> 1 1 1 1 1 1 1 1 1 1 1 0
> 1052 1102 (Other) NA's
> 0 0 0 11
>
> #---------------------------
>
It appears there may be a natural order to those categories but that the alpha ordering of the factor representation is making a hash of that fact. It also appears that the factor levels are different in the two datasets. Seems unlikely that you will get satisfactory plots for comparison using barplot.
--
David.
>
> Thanks in advance for any help with this,
> Supriya
>
>
>
>
>
> On Tue, Jul 1, 2014 at 6:42 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Jul 1, 2014, at 3:46 PM, Supriya Jain wrote:
>
> > Hello,
> >
> > Given two different datasets (having the same number and type of columns,
> > but different observations, as commonly encountered in data-mining as
> > train/test/validation datasets), is it possible to overlay plots
> > (histograms) and compare the different attributes from the separate
> > datasets, in order to check how similar the different datasets are?
> >
> > Is there a package available for such plotting together of similar columns
> > from different datasets?
>
> Possible. Assuming you just want frequency histograms (or ones using counts for that matter) it can be done in any of the three major plotting paradigms supported in R. No extra packages needed if using just base graphics.
>
>
> >
> > Thanks,
> > SJ
> >
> > [[alternative HTML version deleted]]
>
> Oh, you must have missed the parts of the Posign Guide where plain text was requyested. See below.
>
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>
> And you missed that section, as well.
>
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> David Winsemius
> Alameda, CA, USA
>
>
David Winsemius
Alameda, CA, USA
More information about the R-help
mailing list