[R] readLines without skipNul=TRUE causes crash
Anthony Damico
ajdamico at gmail.com
Sun Jul 16 12:31:08 CEST 2017
sorry, typo, 80937 not 809367
On Sun, Jul 16, 2017 at 6:21 AM, Anthony Damico <ajdamico at gmail.com> wrote:
> hi, thank you for attempting this. it looks like your unix machine
> unzipped the txt file without corruption -- if you copied over the same txt
> file to windows 7, i don't think that would reproduce the problem? i think
> it needs to be the corrupted text file where R.utils::countLines( txtfile
> ) gives 809367. i am able to reproduce on two distinct windows machines
> but no guarantee i'm not doing something dumb
>
> On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us>
> wrote:
>
>> I am not able to reproduce your segfault on a Windows 7 platform either:
>>
>> ##########################
>> fn1 <- "d:/DADOS_ENEM_2009.txt"
>> sessionInfo()
>> ## R version 3.4.1 (2017-06-30)
>> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
>> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
>> ##
>> ## Matrix products: default
>> ##
>> ## locale:
>> ## [1] LC_COLLATE=English_United States.1252
>> ## [2] LC_CTYPE=English_United States.1252
>> ## [3] LC_MONETARY=English_United States.1252
>> ## [4] LC_NUMERIC=C
>> ## [5] LC_TIME=English_United States.1252
>> ##
>> ## attached base packages:
>> ## [1] stats graphics grDevices utils datasets methods base
>> ##
>> ## loaded via a namespace (and not attached):
>> ## [1] compiler_3.4.1
>> tools::md5sum( fn1 )
>> ## d:/DADOS_ENEM_2009.txt
>> ## "83e61c96092285b60d7bf6b0dbc7072e"
>> dat <- readLines( fn1 )
>> length( dat )
>> ## [1] 4148721
>>
>>
>> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
>>
>> I am not able to reproduce this on a Linux platform:
>>>
>>> #######################3
>>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt"
>>> sessionInfo()
>>> ## R version 3.4.1 (2017-06-30)
>>> ## Platform: x86_64-pc-linux-gnu (64-bit)
>>> ## Running under: Ubuntu 14.04.5 LTS
>>> ##
>>> ## Matrix products: default
>>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
>>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
>>> ##
>>> ## locale:
>>> ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>> ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>> ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>> ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>>> ## [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> ##
>>> ## attached base packages:
>>> ## [1] stats graphics grDevices utils datasets methods base
>>> ##
>>> ## loaded via a namespace (and not attached):
>>> ## [1] compiler_3.4.1
>>> tools::md5sum( fn1 )
>>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
>>> 2009/DADOS_ENEM_2009.txt
>>> ##
>>> "83e61c96092285b60d7bf6b0dbc7072e"
>>> dat <- readLines( fn1 )
>>> length( dat )
>>> ## [1] 4148721
>>>
>>> No segfault occurs.
>>>
>>> On Sat, 15 Jul 2017, Anthony Damico wrote:
>>>
>>> hi, i realized that the segfault happens on the text file in a new R
>>>> session. so, creating the segfault-generating text file requires a
>>>> contributed package, but prompting the actual segfault does not --
>>>> pretty
>>>> sure that means this is a base R bug? submitted here:
>>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 hopefully
>>>> i am
>>>> not doing something remarkably stupid. the text file itself is 4GB so
>>>> cannot upload it to bugzilla, and from the R_AllocStringBugger error in
>>>> the
>>>> previous message, i think most or all of it needs to be there to trigger
>>>> the segfault. thanks!
>>>>
>>>>
>>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdamico at gmail.com>
>>>> wrote:
>>>>
>>>> hi, thanks Dr. Murdoch
>>>>>
>>>>>
>>>>> i'd appreciate if anyone on r-help could help me narrow this down? i
>>>>> believe the segfault occurs because there's a single line with 4GB and
>>>>> also
>>>>> embedded nuls, but i am not sure how to artificially construct that?
>>>>>
>>>>>
>>>>> the lodown package can be removed from my example.. it is just for
>>>>> file
>>>>> download cacheing, so `lodown::cachaca` can be replaced with
>>>>> `download.file` my current example requires a huge download, so sort
>>>>> of
>>>>> painful to repeat but i'm pretty confident that's not the issue.
>>>>>
>>>>>
>>>>> the archive::archive_extract() function unzips a (probably corrupt)
>>>>> .RAR
>>>>> file and creates a text file with 80,937 lines. this file is 4GB:
>>>>>
>>>>> > file.size(infile)
>>>>> [1] 4078192743 <(407)%20819-2743>
>>>>>
>>>>>
>>>>> i am pretty sure that nearly all of that 4GB is contained on a single
>>>>> line
>>>>> in the file. here's what happens when i create a file connection and
>>>>> scan
>>>>> through..
>>>>>
>>>>> > file_con <- file( infile , 'r' )
>>>>> >
>>>>> > first_80936_lines <- readLines( file_con , n = 80936 )
>>>>> > scan( w , n = 1 , what = character() )
>>>>> Read 1 item
>>>>> [1] "1000023930632009"
>>>>> > scan( w , n = 1 , what = character() )
>>>>> Read 1 item
>>>>> [1] "36F2924009PAULO"
>>>>> > scan( w , n = 1 , what = character() )
>>>>> Read 1 item
>>>>> [1] "AFONSO"
>>>>> > scan( w , n = 1 , what = character() )
>>>>> Read 1 item
>>>>> [1] "BA11"
>>>>> > scan( w , n = 1 , what = character() )
>>>>> Read 1 item
>>>>> [1] "00000"
>>>>> > scan( w , n = 1 , what = character() )
>>>>> Read 1 item
>>>>> [1] "00"
>>>>> > scan( w , n = 1 , what = character() )
>>>>> Read 1 item
>>>>> [1] "2924009PAULO"
>>>>> > scan( w , n = 1 , what = character() )
>>>>> Read 1 item
>>>>> [1] "AFONSO"
>>>>> > scan( w , n = 1 , what = character() )
>>>>> Read 1 item
>>>>> [1] "BA1111"
>>>>> > scan( w , n = 1 , what = character() )
>>>>> Read 1 item
>>>>> [1] "467.20"
>>>>> > scan( w , n = 1 , what = character() )
>>>>> Read 1 item
>>>>> [1] "346.10"
>>>>> > scan( w , n = 1 , what = character() )
>>>>> Read 1 item
>>>>> [1] "414.40"
>>>>> > scan( w , n = 1 , what = character() )
>>>>> Error in scan(w, n = 1, what = character()) :
>>>>> could not allocate memory (2048 Mb) in C function
>>>>> 'R_AllocStringBuffer'
>>>>>
>>>>>
>>>>>
>>>>> making a huge single-line file does not reproduce the problem, i think
>>>>> the
>>>>> embedded nuls have something to do with it--
>>>>>
>>>>>
>>>>> # WARNING do not run with less than 64GB RAM
>>>>> tf <- tempfile()
>>>>> a <- rep( "a" , 1000000000 )
>>>>> b <- paste( a , collapse = '' )
>>>>> writeLines( b , tf ) ; rm( b ) ; gc()
>>>>> d <- readLines( tf )
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <
>>>>> murdoch.duncan at gmail.com>
>>>>> wrote:
>>>>>
>>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>>>>>>
>>>>>> hello, the last line of the code below causes a segfault for me on
>>>>>>> 3.4.1.
>>>>>>> i think i should submit to https://bugs.r-project.org/ unless
>>>>>>> others
>>>>>>> have
>>>>>>> advice? thanks
>>>>>>>
>>>>>>>
>>>>>> Segfaults are usually worth reporting as bugs. Try to come up with a
>>>>>> self-contained example, not using the lodown and archive packages. I
>>>>>> imagine you can do this by uploading the file you downloaded, or
>>>>>> enough of
>>>>>> a subset of it to trigger the segfault. If you can't do that, then
>>>>>> likely
>>>>>> the bug is with one of those packages, not with R.
>>>>>>
>>>>>> Duncan Murdoch
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> install.packages( "devtools" )
>>>>>>> devtools::install_github("ajdamico/lodown")
>>>>>>> devtools::install_github("jimhester/archive")
>>>>>>>
>>>>>>>
>>>>>>> file_folder <- file.path( tempdir() , "file_folder" )
>>>>>>>
>>>>>>> tf <- tempfile()
>>>>>>>
>>>>>>> # large download! cachaca saves on your local disk if already
>>>>>>> downloaded
>>>>>>> lodown::cachaca( '
>>>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' ,
>>>>>>> tf ,
>>>>>>> mode
>>>>>>> = 'wb' )
>>>>>>>
>>>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>>>>>>
>>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
>>>>>>> full.names =
>>>>>>> TRUE )
>>>>>>>
>>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>>>>>>
>>>>>>> # works
>>>>>>> R.utils::countLines( infile )
>>>>>>>
>>>>>>> # works with warning
>>>>>>> my_file <- readLines( infile , skipNul = TRUE )
>>>>>>>
>>>>>>> # crash
>>>>>>> my_file <- readLines( infile )
>>>>>>>
>>>>>>>
>>>>>>> # run just before crash
>>>>>>> sessionInfo()
>>>>>>> # R version 3.4.1 (2017-06-30)
>>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>>>> # Running under: Windows 10 x64 (build 15063)
>>>>>>>
>>>>>>> # Matrix products: default
>>>>>>>
>>>>>>> # locale:
>>>>>>> # [1] LC_COLLATE=English_United States.1252
>>>>>>> # [2] LC_CTYPE=English_United States.1252
>>>>>>> # [3] LC_MONETARY=English_United States.1252
>>>>>>> # [4] LC_NUMERIC=C
>>>>>>> # [5] LC_TIME=English_United States.1252
>>>>>>>
>>>>>>> # attached base packages:
>>>>>>> # [1] stats graphics grDevices utils datasets methods
>>>>>>> base
>>>>>>>
>>>>>>> # loaded via a namespace (and not attached):
>>>>>>> # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1
>>>>>>> withr_1.0.2
>>>>>>> # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11
>>>>>>> memoise_1.1.0
>>>>>>> # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12
>>>>>>> lodown_0.1.0
>>>>>>> # [13] R.utils_2.5.0 rlang_0.1.1 devtools_1.13.2
>>>>>>> R.oo_1.21.0
>>>>>>> # [17] archive_0.0.0.9000
>>>>>>>
>>>>>>> [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>>>>> ng-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>> ng-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>> ------------------------------------------------------------
>>> ---------------
>>> Jeff Newmiller The ..... ..... Go
>>> Live...
>>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
>>> Go...
>>> Live: OO#.. Dead: OO#.. Playing
>>> Research Engineer (Solar/Batteries O.O#. #.O#. with
>>> /Software/Embedded Controllers) .OO#. .OO#.
>>> rocks...1k
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>> ng-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>> ------------------------------------------------------------
>> ---------------
>> Jeff Newmiller The ..... ..... Go
>> Live...
>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
>> Go...
>> Live: OO#.. Dead: OO#.. Playing
>> Research Engineer (Solar/Batteries O.O#. #.O#. with
>> /Software/Embedded Controllers) .OO#. .OO#.
>> rocks...1k
>> ------------------------------------------------------------
>> ---------------
>>
>
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list