[R] need help with excel data
Ista Zahn
istazahn at gmail.com
Thu Jan 22 03:58:09 CET 2015
I agree, R will be fine for this. Not being as expert with regex as
Jeff I would tend to do this in a few steps, something like
library(XLConnect)
DF <- readWorksheetFromFile( "exampX.xlsx", sheet="examp" )
library(stringi)
## insert a marker between the text and the numbers
txt <- stri_replace_all_regex(DF[[2]], "([^\\d]{2,})(\\d+ )", "$1|||$2")
## separate the text from the numbers
stringNums <- stri_split_fixed(txt, "|||", 2, simplify = TRUE)
## split the numbers apart
nums <- stri_split_regex(stringNums[, 2], "[^\\d]+", n = 5, simplify=TRUE)
## put it all back together
extracted <- data.frame(DF[, 1], stringNums[, 1], apply(nums, 2, as.numeric))
## put the names back
names(extracted) <- c(names(DF)[1], paste(names(DF)[2], 1:6, sep = "_"))
Best,
Ista
On Wed, Jan 21, 2015 at 8:02 PM, Jeff Newmiller
<jdnewmil at dcn.davis.ca.us> wrote:
> I think R is quite capable of doing this. You would have to learn a
> comparable number of fiddly bits to accomplish this in R, Python or Perl.
>
> That is not to say that learning Perl or Python is a bad idea... but in
> terms of "shortest path" I think they are of comparable complexity. All
> three languages support regular expressions, which would be the key bit of
> knowledge to acquire regardless of which tool you use.
>
> Other fiddly bits might involve handling the cyrillic strings as data,
> though you did not convey a desire to retain that information.
>
> One way (not extracting cyrillic text):
>
> library(XLConnect)
> DF <- readWorksheetFromFile( "exampX.xlsx", sheet="examp" )
> pattern <- "^.*(\\d+) *\\* *(\\d+)[^\\d]*(\\d+) *\\* *(\\d+).*$"
> idx <- grep( pattern, DF[[2]] )
> dta <- sub( pattern, "\\1,\\2,\\3,\\4", DF[[2]][idx])
> dtamatrix <- apply( do.call( rbind
> , strsplit( dta, "," ) )
> , 2
> , as.numeric
> )
> extracted <- data.frame( V1=DF[[1]][idx], dtamatrix )
>
>
> On Wed, 21 Jan 2015, Collin Lynch wrote:
>
>> Dr. Polanski, I would recommend something else. Given the messy nature of
>> your data I would suggest using a language like Python or Perl to extract
>> it to an appropriate format. Python has good regular expression support
>> and unicode support. If you can save your data as a csv file or even text
>> line by line then it would be possible to write some code to read the
>> file,
>> match the lines with a simple regular expression, and then spit them back
>> out as a csv file which you could read into R.
>>
>> I realize that this means learning a new language or finding someone with
>> the requisite skills by I would recommend that over attempting to use R's
>> text processing.
>>
>> Collin.
>>
>> On Wed, Jan 21, 2015 at 3:31 PM, Dr Polanski <n.polyanskij at gmail.com>
>> wrote:
>>
>>> Hi all!
>>>
>>> Sorry to bother you, I am trying to learn some R via coursera courses and
>>> other internet sources yet haven?t managed to go far
>>>
>>> And now I need to do some, I hope, not too difficult things, which I
>>> think
>>> R can do, yet have no idea how to make it do so
>>>
>>> I have a big set of data (empirical) which was obtained by my colleagues
>>> and store at not convenient way - all of the data in two cells of an
>>> excel
>>> table
>>> an example of the data is in the attached file (the link)
>>>
>>>
>>>
>>> https://drive.google.com/file/d/0B64YMbf_hh5BS2tzVE9WVmV3bFU/view?usp=sharing
>>>
>>> so the first column has a number and the second has a whole vector (I
>>> guess it is) which looks like
>>> ?some words in Cyrillic(the length varies)? and then the set of numbers
>>> ?12*23 34*45? (another problem that some times it is ?12*23, 34*56?
>>>
>>> And the number of raws is about 3000 so it is impossible to do manually
>>>
>>> what I need to have at the end is to have it separately in different
>>> excel
>>> cells
>>> - what is written in words - | 12 | 23 | 34 | 45 |
>>>
>>> Do you think it is possible to do so using R (or something else?)
>>>
>>> Thank you very much in advance and sorry for asking for help and so
>>> stupid
>>> question, the problem is - I am trying and yet haven?t even managed to
>>> install openSUSE onto my laptop - only Ubuntu! :)
>>>
>>>
>>> Thank you very much!
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
> ---------------------------------------------------------------------------
> Jeff Newmiller The ..... ..... Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
> Live: OO#.. Dead: OO#.. Playing
> Research Engineer (Solar/Batteries O.O#. #.O#. with
> /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list