[R] read.delim skips first column (why?)

Tue Jul 14 13:37:19 CEST 2009

Try

count.fields("myfile.txt", sep = "\t")

read.delim uses sep = "\t" but there are trailing tabs
on some lines.

The first line, i.e. with the headers, has three trailing tabs
so it thinks that there are 15 columns rather than 12.

The 5th line of the file (4th line of data) has 4 trailing
tabs so it thinks that there are up to 16 fields in each
data line.

Since it now believes that there are 16 fields of data and
15 fields of headers it assumes the extra field, i.e. the
first one, is the row names.

On Tue, Jul 14, 2009 at 5:11 AM, Giovanni Marco
Dall'Olio<dalloliogm at gmail.com> wrote:
> Hi,
> I have uploaded a copy of the file here:
> - http://pastebin.com/fd0edfab
>
> the file has also been passed throught the unix command tool unexpand, but
> it doesn't solve the problem.
>
> using head=TRUE instead of head=T has also the same effect.
>
> the output of print(names) is:
>> print(names(ngly), quote=TRUE)
>  [1] "snp"                       "gene"
>  [3] "chromosome"                "distance_from_gene_center"
>  [5] "position"                  "ame"
>  [7] "csasia"                    "easia"
>  [9] "eur"                       "mena"
> [11] "oce"                       "ssafr"
> [13] "X"                         "X.1"
> [15] "X.2"
>
> Thank you to all the people who answered me to my mail address, but I
> couldn't solve the problem yet.
>
>
> On Tue, Jul 14, 2009 at 12:36 AM, jim holtman <jholtman at gmail.com> wrote:
>
>> Can you send your file as an attachment since it is impossible to see
>> where the separator characters are.
>>
>> On Mon, Jul 13, 2009 at 1:27 PM, Giovanni Marco
>> Dall'Olio<dalloliogm at gmail.com> wrote:
>> > Hi people,
>> > I have a text file like this one posted:
>> >
>> > snp_id  gene    chromosome      distance_from_gene_center
>> > position        pop1    pop2    pop3    pop4    pop5    pop6    pop7
>> > rs2129081       RAPT2   3       -129993 "upstream"      0.439009
>> > 1.169210        NA      0.233020        0.093042        NA
>> > -0.902596
>> > rs1202698       RAPT2   3       -128695 "upstream"      NA
>> > 1.815000        NA      0.399079        1.814270        1.382950
>> > NA
>> > rs1163207       RAPT2   3       -128224 "upstream"      NA      NA
>> > NA      NA      NA      NA      NA
>> > rs1834127       RAPT2   3       -128106 "upstream"      NA      NA
>> > NA      NA      NA      NA      2.180670
>> > rs2114211       RAPT2   3       -126738 "upstream"      -0.468279
>> > -1.447620       NA      0.010616        -0.414581       NA
>> > 0.550447
>> > rs2113151       RAPT2   3       -124620 "upstream"      -0.897660
>> > -1.971020       NA      -0.920327       -0.764658       NA
>> > 0.337127
>> > rs2524130       RAPT2   3       -123029 "upstream"      -0.109795
>> > -0.004646       -0.412059       1.116740        0.667567
>> > -0.924529       0.962841
>> > rs1381318       RAPT2   3       -12818  "upstream"      -0.911662
>> > -1.791580       NA      -0.945716       -1.239640       NA
>> > 0.004876
>> > rs2113319       RAPT2   3       -122028 "upstream"      -0.911662
>> > -1.738610       NA      -0.945716       -1.240950       NA      -0.005318
>> >
>> > When I use read.delim (or any read function) on it, R skips the first
>> > column, and I don' understand why.
>> >
>> > For example:
>> > $: R
>> >> data = read.delim('snp_file.txt', head=T, sep='\t')
>> >
>> > Now, I would expect data$snp_id to contain snp ids, and data$gene to
>> contain
>> > gene names; but it is not like this:
>> >
>> >> data$snp_id
>> > [1] RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2
>> > Levels: RAPT2
>> >> data$gene
>> > [1] 3 3 3 3 3 3 3 3 3
>> >
>> >> summary(data)
>> >  snp_id       gene     chromosome      distance_from_gene_center
>> >  RAPT2:9   Min.   :3   Min.   :-129993   upstream:9
>> >           1st Qu.:3   1st Qu.:-128224
>> >           Median :3   Median :-126738
>> >           Mean   :3   Mean   :-113806
>> >           3rd Qu.:3   3rd Qu.:-123029
>> >           Max.   :3   Max.   : -12818
>> > ....
>> >
>> >> data$pop7
>> > [1] NA NA NA NA NA NA NA NA NA
>> >
>> >
>> > Notice that it did use snp_id as the header for the first column, but it
>> > skips completely al the data from that column, and all the fields are
>> > shifted, so the last column is filled with NA values.
>> >
>> > What I am doing wrong? Can it be a problem of my data files? I have tried
>> to
>> > modify them a bit (add new columns, etc..) but it didn't work.
>> >
>> > I am running R from an Ubuntu system:
>> >> sessionInfo()
>> > R version 2.9.1 (2009-06-26)
>> > i486-pc-linux-gnu
>> >
>> > locale:
>> >
>> LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=it_IT.UTF-8;LC_COLLATE=it_IT.UTF-8;LC_MONETARY=C;LC_MESSAGES=it_IT.UTF-8;LC_PAPER=it_IT.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=it_IT.UTF-8;LC_IDENTIFICATION=C
>> >
>> > attached base packages:
>> > [1] stats     graphics  grDevices utils     datasets  methods   base
>> >
>> >
>> >
>> >
>> > --
>> > Giovanni Dall'Olio, phd student
>> > Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)
>> >
>> > My blog on bioinformatics: http://bioinfoblog.it
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>>
>
>
>
> --
> Giovanni Dall'Olio, phd student
> Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)
>
> My blog on bioinformatics: http://bioinfoblog.it
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>