[R] Strange column shifting with read.table
David Winsemius
dwinsemius at comcast.net
Mon Aug 3 01:14:17 CEST 2009
On Aug 2, 2009, at 7:02 PM, Noah Silverman wrote:
> Hi,
>
> It seems as if the problem was caused by an odd quirk of the "scale"
> function.
>
> Some of my data have NA entries.
>
> So, I substitute 0 for any NA with:
> rawdata[is.na(rawdata)] <- 0
Perhaps this would have done what you intended:
rawdata[is.na(rawdata), ] <- 0
# But this is added _only_ as a matter of coding behavior. See below.
>
> I then scale the data.
>
> For some reason that I don't understand, I find some NA back in the
> data
> after the scale command.
> But, issuing the same 0 substitution AFTER the scale command makes
> everything work again.
> rawdata[is.na(rawdata)] <- 0
It "works" because rawdata has been converted by scale() to a matrix
which can be accessed as a vector.
>
The notion of adding zeroes for NA seems "so wrong". And the idea that
you might get the same results of doing so before scale() as after
scale() seems additionally bizarre.
>
> VERY strange behavior.
>
Your behavior might be seen as VERY strange by some.
--
D
> -N
>
> On 8/2/09 3:57 PM, J Dougherty wrote:
>> On Sunday 02 August 2009 02:34:43 pm Noah Silverman wrote:
>>
>>> The column names have to obfuscated, but here are 10 rows of the
>>> data.
>>>
>>> label c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13
>>> c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27
>>> c28 c29 c30 c31 c32 c33 c34 c35 c36 c37 c38 c39 c40 c41
>>> c42 c43 c44 c45 c46 c47 c48 c49 c50 c51 c52 c53 c54 c55
>>> c56 c57 c58 c59 c60 c61 c62 c63 c64 c65 c66
>>> sick 2008-12-28_1 95.609 5 3.3 1.35 0 1 35 9.6666 0 0
>>> 0.0833 1 0.0833 1 0.1428 7 3 2.035714286 6.5 94.8481
>>> 53.846 12 -4.69 1.25 0.5062 0.0522 0.1808 3 0.5126 0.0694
>>> 0.2061 94.9288 8.3125 0.0247 7.5833 9.3 35 9.6666 0 0
>>> 0.0833 1 0.0833 1 0.1428 7 3 2.035714286 6.5 94.8481
>>> 53.846 12 -4.69 1.25 0.5062 0.0522 0.1808 3 0.5126 0.0694
>>> 0.2061 94.9288 8.3125 0.0247 7.5833 9.3
>>> well 2008-12-28_1 95.338 1 11 3.2 3 2 11 7.0277 0.0555 2
>>> 0.1666 6 0.1666 5 0.238 18 11 2.541666667 2.022727273
>>> 94.7733
>>> 38.461 36 6.07 7.5555 0.5928 0.0955 0.2871 0 0.5434 0.0679
>>> 0.2283 95.9003 5.1736 0.0847 7.3333 28 11 7.0277 0.0555 2
>>> 0.1666 6 0.1666 5 0.238 18 11 2.541666667 2.022727273
>>> 94.7733
>>> 38.461 36 6.07 7.5555 0.5928 0.0955 0.2871 0 0.5434 0.0679
>>> 0.2283 95.9003 5.1736 0.0847 7.3333 28
>>> well 2008-12-28_1 95.204 2 7.4 2.75 4 1 22 8.4545 0 0
>>> 0 0 0 0 0 6 4 2.791666667 2.5625 94.8444 61.538 11 2.84
>>> 3.0909 0.5693 0.0641 0.2738 0 0.5874 0.1011 0.2803 94.9769
>>> 8.1363 0.0467 5.4545 10 22 8.4545 0 0 0 0 0 0 0 6 4
>>> 2.791666667 2.5625 94.8444 61.538 11 2.84 3.0909 0.5693
>>> 0.0641
>>> 0.2738 0 0.5874 0.1011 0.2803 94.9769 8.1363 0.0467 5.4545
>>> 10
>>> sick 2008-12-28_1 95.204 14 48
>>> 0 3 25 8.7045 0.0909 4 0.2045 9 0.2045 4 0.2666 11 8
>>> 4.409090909 0 95.0006 15.384 44 1.76 7.409 0.4475 0.0285
>>> 0.1206 0 0.5094 0.058 0.1931 92.9455 7.2613 0.0532 4.5227
>>> 82 25 8.7045 0.0909 4 0.2045 9 0.2045 4 0.2666 11 8
>>> 4.409090909 0 95.0006 15.384 44 1.76 7.409 0.4475 0.0285
>>> 0.1206 0 0.5094 0.058 0.1931 92.9455 7.2613 0.0532 4.5227
>>> 82
>>> well 2008-12-28_1 95.07 13 26
>>> 1 1 11 8.1 0.0666 2 0.1666 5 0.1666 0 0 21 16
>>> 2.571428571 1.984375 94.825 30.769 30 -4.69 -0.7999 0.5166
>>> 0.0624 0.2078 0 0.5306 0.0792 0.2398 95.2282 7.575 0.0715
>>> 3.4333 44 11 8.1 0.0666 2 0.1666 5 0.1666 0 0 21 16
>>> 2.571428571 1.984375 94.825 30.769 30 -4.69 -0.7999 0.5166
>>> 0.0624 0.2078 0 0.5306 0.0792 0.2398 95.2282 7.575 0.0715
>>> 3.4333 44
>>> well 2008-12-28_1 95.07 9 16
>>> 0 4 39 9.4117 0 0 0.0588 1 0.0588 0 0 3 25 3.916666667
>>> 2.96 94.8177 30.769 17 -20.84 -15.8234 0.8205 0.3333
>>> 0.6666 0
>>> 0.6054 0.1287 0.3292 95.3232 6.9117 0.076 2.647 16 39
>>> 9.4117 0 0 0.0588 1 0.0588 0 0 3 25 3.916666667 2.96
>>> 94.8177 30.769 17 -20.84 -15.8234 0.8205 0.3333 0.6666 0
>>> 0.6054 0.1287 0.3292 95.3232 6.9117 0.076 2.647 16
>>> sick 2008-12-28_1 94.936 6 11
>>> 4 1 28 7.725 0.075 3 0.125 5 0.125 0 0 6 2 4 1.75
>>> 94.7815 46.153 40 6.07 12.5 0.5014 0.0621 0.1972 6 0.523
>>> 0.0742 0.2035 95.794 6.0625 0.046 7.25 12 28 7.725 0.075 3
>>> 0.125 5 0.125 0 0 6 2 4 1.75 94.7815 46.153 40 6.07
>>> 12.5
>>> 0.5014 0.0621 0.1972 6 0.523 0.0742 0.2035 95.794 6.0625
>>> 0.046 7.25 12
>>> well 2008-12-28_1 94.803 11 13
>>> 0 5 35 7.125 0.0937 3 0.1562 5 0.1562 5 0.2 18 17
>>> 1.555555556 2.794117647 95.0398 38.461 32 10.38 8.4063 0.5804
>>> 0.0871 0.2627 1 0.558 0.0738 0.2324 92.4367 5.289 0.0722
>>> 9.125 16 35 7.125 0.0937 3 0.1562 5 0.1562 5 0.2 18 17
>>> 1.555555556 2.794117647 95.0398 38.461 32 10.38 8.4063 0.5804
>>> 0.0871 0.2627 1 0.558 0.0738 0.2324 92.4367 5.289 0.0722
>>> 9.125 16
>>> well 2008-12-28_1 94.67 4 38
>>> 5 1 11 8.9642 0.0357 1 0.1428 4 0.1428 4 0.2105 11 13
>>> 3.772727273 4.307692308 94.8451 23.076 28 -5.76 -4 0.3269 0
>>> 0.0833 0 0.5222 0.0616 0.2079 94.9668 8.6696 0.0663 4.6428
>>> 14 11 8.9642 0.0357 1 0.1428 4 0.1428 4 0.2105 11 13
>>> 3.772727273 4.307692308 94.8451 23.076 28 -5.76 -4 0.3269 0
>>> 0.0833 0 0.5222 0.0616 0.2079 94.9668 8.6696 0.0663 4.6428
>>> 14
>>> well 2008-12-28_1 94.537 12 39
>>> 0 1 35 9.4444 0 0 0 0 0 0 0 2 7 2.5 2.892857143
>>> 94.878
>>> 23.076 9 -12.23 -9.6666 0.4428 0 0.0857 0 0.5411 0.0849
>>> 0.25
>>> 94.54 8.9166 0.0296 6.1111 67 35 9.4444 0 0 0 0 0 0 0
>>> 2 7 2.5 2.892857143 94.878 23.076 9 -12.23 -9.6666 0.4428
>>> 0
>>> 0.0857 0 0.5411 0.0849 0.25 94.54 8.9166 0.0296 6.1111 67
>>>
>>>
>>>
>> Your initial post mentions 70 columns in your data table, yet the
>> example
>> shows 67 counting the initial "labels" term in the header. I would
>> suggest
>> adding "row.names = NULL" to force row numbers and see how that
>> behaves, e.g.
>>
>> rawdata<- read.table("r_work/train_data.csv", header=T, sep=",",
>> na.strings=0, row.names = NULL)
>>
>> Otherwise, you might want to consult the R Manual where it states:
>>
>> header a logical value indicating whether the file contains the
>> names of the
>> variables as its first line. If missing, the value is determined
>> from the
>> file format: header is set to TRUE if and only if the first row
>> contains one
>> fewer field than the number of columns.
>>
>> So, you might also want to count up your column names in the header
>> line.
>>
>> JWDougherty
>>
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list