[R] read.delim skips first column (why?)
Gabor Grothendieck
ggrothendieck at gmail.com
Tue Jul 14 13:37:19 CEST 2009
Try
count.fields("myfile.txt", sep = "\t")
read.delim uses sep = "\t" but there are trailing tabs
on some lines.
The first line, i.e. with the headers, has three trailing tabs
so it thinks that there are 15 columns rather than 12.
The 5th line of the file (4th line of data) has 4 trailing
tabs so it thinks that there are up to 16 fields in each
data line.
Since it now believes that there are 16 fields of data and
15 fields of headers it assumes the extra field, i.e. the
first one, is the row names.
On Tue, Jul 14, 2009 at 5:11 AM, Giovanni Marco
Dall'Olio<dalloliogm at gmail.com> wrote:
> Hi,
> I have uploaded a copy of the file here:
> - http://pastebin.com/fd0edfab
>
> the file has also been passed throught the unix command tool unexpand, but
> it doesn't solve the problem.
>
> using head=TRUE instead of head=T has also the same effect.
>
> the output of print(names) is:
>> print(names(ngly), quote=TRUE)
> [1] "snp" "gene"
> [3] "chromosome" "distance_from_gene_center"
> [5] "position" "ame"
> [7] "csasia" "easia"
> [9] "eur" "mena"
> [11] "oce" "ssafr"
> [13] "X" "X.1"
> [15] "X.2"
>
> Thank you to all the people who answered me to my mail address, but I
> couldn't solve the problem yet.
>
>
> On Tue, Jul 14, 2009 at 12:36 AM, jim holtman <jholtman at gmail.com> wrote:
>
>> Can you send your file as an attachment since it is impossible to see
>> where the separator characters are.
>>
>> On Mon, Jul 13, 2009 at 1:27 PM, Giovanni Marco
>> Dall'Olio<dalloliogm at gmail.com> wrote:
>> > Hi people,
>> > I have a text file like this one posted:
>> >
>> > snp_id gene chromosome distance_from_gene_center
>> > position pop1 pop2 pop3 pop4 pop5 pop6 pop7
>> > rs2129081 RAPT2 3 -129993 "upstream" 0.439009
>> > 1.169210 NA 0.233020 0.093042 NA
>> > -0.902596
>> > rs1202698 RAPT2 3 -128695 "upstream" NA
>> > 1.815000 NA 0.399079 1.814270 1.382950
>> > NA
>> > rs1163207 RAPT2 3 -128224 "upstream" NA NA
>> > NA NA NA NA NA
>> > rs1834127 RAPT2 3 -128106 "upstream" NA NA
>> > NA NA NA NA 2.180670
>> > rs2114211 RAPT2 3 -126738 "upstream" -0.468279
>> > -1.447620 NA 0.010616 -0.414581 NA
>> > 0.550447
>> > rs2113151 RAPT2 3 -124620 "upstream" -0.897660
>> > -1.971020 NA -0.920327 -0.764658 NA
>> > 0.337127
>> > rs2524130 RAPT2 3 -123029 "upstream" -0.109795
>> > -0.004646 -0.412059 1.116740 0.667567
>> > -0.924529 0.962841
>> > rs1381318 RAPT2 3 -12818 "upstream" -0.911662
>> > -1.791580 NA -0.945716 -1.239640 NA
>> > 0.004876
>> > rs2113319 RAPT2 3 -122028 "upstream" -0.911662
>> > -1.738610 NA -0.945716 -1.240950 NA -0.005318
>> >
>> > When I use read.delim (or any read function) on it, R skips the first
>> > column, and I don' understand why.
>> >
>> > For example:
>> > $: R
>> >> data = read.delim('snp_file.txt', head=T, sep='\t')
>> >
>> > Now, I would expect data$snp_id to contain snp ids, and data$gene to
>> contain
>> > gene names; but it is not like this:
>> >
>> >> data$snp_id
>> > [1] RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2
>> > Levels: RAPT2
>> >> data$gene
>> > [1] 3 3 3 3 3 3 3 3 3
>> >
>> >> summary(data)
>> > snp_id gene chromosome distance_from_gene_center
>> > RAPT2:9 Min. :3 Min. :-129993 upstream:9
>> > 1st Qu.:3 1st Qu.:-128224
>> > Median :3 Median :-126738
>> > Mean :3 Mean :-113806
>> > 3rd Qu.:3 3rd Qu.:-123029
>> > Max. :3 Max. : -12818
>> > ....
>> >
>> >> data$pop7
>> > [1] NA NA NA NA NA NA NA NA NA
>> >
>> >
>> > Notice that it did use snp_id as the header for the first column, but it
>> > skips completely al the data from that column, and all the fields are
>> > shifted, so the last column is filled with NA values.
>> >
>> > What I am doing wrong? Can it be a problem of my data files? I have tried
>> to
>> > modify them a bit (add new columns, etc..) but it didn't work.
>> >
>> > I am running R from an Ubuntu system:
>> >> sessionInfo()
>> > R version 2.9.1 (2009-06-26)
>> > i486-pc-linux-gnu
>> >
>> > locale:
>> >
>> LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=it_IT.UTF-8;LC_COLLATE=it_IT.UTF-8;LC_MONETARY=C;LC_MESSAGES=it_IT.UTF-8;LC_PAPER=it_IT.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=it_IT.UTF-8;LC_IDENTIFICATION=C
>> >
>> > attached base packages:
>> > [1] stats graphics grDevices utils datasets methods base
>> >
>> >
>> >
>> >
>> > --
>> > Giovanni Dall'Olio, phd student
>> > Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)
>> >
>> > My blog on bioinformatics: http://bioinfoblog.it
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>>
>
>
>
> --
> Giovanni Dall'Olio, phd student
> Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)
>
> My blog on bioinformatics: http://bioinfoblog.it
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list