[R] Fixed Width EBCDIC Files in R

Brian Trautman btrautman84 at gmail.com
Thu Feb 5 23:45:33 CET 2015


First off, thank you very much for taking a look at this.  I didn't know
"raw=TRUE" would be necessary here.

Unfortunately, I'm stuck with the embedded nulls in the source data at this
point.  If worst comes to worst, does R have a way to do something like --

1.  Read the entire file in as raw binary.
2.  Replace all embedded nulls with spaces.
3.  Output the revised file (as binary) somewhere else.

?

I imagine it'd take a big performance penalty, but at least then I proceed
with importing the revised file.

Thanks again!

On Thu, Feb 5, 2015 at 2:06 PM, John McKown <john.archie.mckown at gmail.com>
wrote:

> On Thu, Feb 5, 2015 at 2:08 PM, Brian Trautman <btrautman84 at gmail.com>
> wrote:
>
>> I'm trying to read some mainframe data encoded as EBCDIC into R, and am at
>> a loss. I'd like to avoid using an external program to convert the files,
>> since I'm operating in a corporate environment.
>>
>> You can find the example files at at the link below, with both ASCII and
>> EBCDIC versions. Note that there are no linebreaks in the EBCDIC versions
>> of the file -- instead, I'd be specifying the width of each line manually.
>> R has the IBM500 encoding available in my environment, which should be the
>> correct one for these files.
>>
>> However, when I run the following commands, R seems to fail entirely.  It
>> loads a single record with garbage characters, regardless of the encoding
>> I
>> specified.
>>
>>
>> layout <- read.fwf("EBCDIC_LAYOUT", widths = c(80), fileEncoding='ibm500')
>>
>> data   <- read.fwf("EBCDIC_ZIPCODE", widths = c(32),
>> fileEncoding='ibm500')
>>
>>
>> Where might I go from here?
>>
>> Related -- some of the files I expect to use will be fairly large (1 GB or
>> so). Preferably, I'd like a solution that scales reasonably well. (I tried
>> packages like LaF, but they don't have the option to select encoding.)
>>
>> Thank you very much!
>>
>>
>> Example files --
>> https://drive.google.com/open?id=0ByvX1v-WqaaASTdwV2ZYS0pBV00&authuser=0
>>
>>
>> I gave this a short try. What killed me (see below) is that your file
> EBCDIC_ZIPCODE has embedded NULL characters, \0. My transcript:
>
> > file<-file("EBCDIC_ZIPCODE",encoding="IBM500", raw=TRUE);
> > data=read.fwf(file,widths=c(32));
> Warning messages:
> 1: In readLines(file, n = thisblock) :
>   line 1 appears to contain an embedded nul
> 2: In readLines(file, n = thisblock) :
>   incomplete final line found on 'EBCDIC_ZIPCODE'
> > View(data)
>
> I don't know how to get past the embedded NULL. I'm a UNIX user, so my
> thought (not applicable with your restriction of "pure R"), would be to use
> "tr" to convert the \0 to spaces, then use the above.​
>
>
> --
> He's about as useful as a wax frying pan.
>
> 10 to the 12th power microphones = 1 Megaphone
>
> Maranatha! <><
> John McKown
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list