[R] how to separate string from numbers in a large txt file
David Winsemius
dw|n@em|u@ @end|ng |rom comc@@t@net
Fri May 17 05:29:51 CEST 2019
On 5/16/19 3:53 PM, Michael Boulineau wrote:
> OK. So, I named the object test and then checked the 6347th item
>
>> test <- readLines ("hangouts-conversation.txt)
>> test [6347]
> [1] "2016-10-21 10:56:37 <John Doe> Admit#8242"
>
> Perhaps where it was getting screwed up is, since the end of this is a
> number (8242), then, given that there's no space between the number
> and what ought to be the next row, R didn't know where to draw the
> line. Sure enough, it looks like this when I go to the original file
> and control f "#8242"
>
> 2016-10-21 10:35:36 <Jane Doe> What's your login
> 2016-10-21 10:56:29 <John Doe> John_Doe
> 2016-10-21 10:56:37 <John Doe> Admit#8242
An octothorpe is an end of line signifier and is interpreted as allowing
comments. You can prevent that interpretation with suitable choice of
parameters to `read.table` or `read.csv`. I don't understand why that
should cause anu error or a failure to match that pattern.
> 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion
>
> Again, it doesn't look like that in the file. Gmail automatically
> formats it like that when I paste it in. More to the point, it looks
> like
>
> 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21 10:56:29
> <John Doe> John_Doe2016-10-21 10:56:37 <John Doe> Admit#82422016-10-21
> 11:00:13 <Jane Doe> Okay so you have a discussion
>
> Notice Admit#82422016. So there's that.
>
> Then I built object test2.
>
> test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", test)
>
> This worked for 84 lines, then this happened.
It may have done something but as you later discovered my first code for
the pattern was incorrect. I had tested it (and pasted in the results of
the test) . The way to refer to a capture class is with back-slashes
before the numbers, not forward-slashes. Try this:
> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", chrvec)
> newvec
[1] "2016-07-01,02:50:35,<john>,hey"
[2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh"
[3] "2016-07-01,02:51:45,<john>,thinking about my boo"
[4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, not really"
[5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, didn't sleep"
[6] "2016-07-01,02:54:08,<jane>,no idea what time it is or where I am
really"
[7] "2016-07-01,02:54:17,<john>,just know it's london"
[8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
[9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good eay"
[10] "2016-07-01 02:58:56 <jone>"
[11] "2016-07-01 02:59:34 <jane>"
[12] "2016-07-01,03:02:48,<john>,British security is a little more
rigorous..."
I made note of the fact that the 10th and 11th lines had no commas.
>
>> test2 [84]
> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat"
That line didn't have any "<" so wasn't matched.
You could remove all none matching lines for pattern of
dates<space>times<space>"<"<name>">"<space><anything>
with:
chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)]
Do read:
?read.csv
?regex
--
David
>> test2 [85]
> [1] "//1,//2,//3,//4"
>> test [85]
> [1] "2016-07-01 02:50:35 <John Doe> hey"
>
> Notice how I toggled back and forth between test and test2 there. So,
> whatever happened with the regex, it happened in the switch from 84 to
> 85, I guess. It went on like
>
> [990] "//1,//2,//3,//4"
> [991] "//1,//2,//3,//4"
> [992] "//1,//2,//3,//4"
> [993] "//1,//2,//3,//4"
> [994] "//1,//2,//3,//4"
> [995] "//1,//2,//3,//4"
> [996] "//1,//2,//3,//4"
> [997] "//1,//2,//3,//4"
> [998] "//1,//2,//3,//4"
> [999] "//1,//2,//3,//4"
> [1000] "//1,//2,//3,//4"
>
> up until line 1000, then I reached max.print.
> Michael
>
> On Thu, May 16, 2019 at 1:05 PM David Winsemius <dwinsemius using comcast.net> wrote:
>>
>> On 5/16/19 12:30 PM, Michael Boulineau wrote:
>>> Thanks for this tip on etiquette, David. I will be sure and not do that again.
>>>
>>> I tried the read.fwf from the foreign package, with a code like this:
>>>
>>> d <- read.fwf("hangouts-conversation.txt",
>>> widths= c(10,10,20,40),
>>> col.names=c("date","time","person","comment"),
>>> strip.white=TRUE)
>>>
>>> But it threw this error:
>>>
>>> Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
>>> line 6347 did not have 4 elements
>>
>> So what does line 6347 look like? (Use `readLines` and print it out.)
>>
>>> Interestingly, though, the error only happened when I increased the
>>> width size. But I had to increase the size, or else I couldn't "see"
>>> anything. The comment was so small that nothing was being captured by
>>> the size of the column. so to speak.
>>>
>>> It seems like what's throwing me is that there's no comma that
>>> demarcates the end of the text proper. For example:
>> Not sure why you thought there should be a comma. Lines usually end
>> with <cr> and or a <lf>.
>>
>>
>> Once you have the raw text in a character vector from `readLines` named,
>> say, 'chrvec', then you could selectively substitute commas for spaces
>> with regex. (Now that you no longer desire to remove the dates and times.)
>>
>> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec)
>>
>> This will not do any replacements when the pattern is not matched. See
>> this test:
>>
>>
>> > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", chrvec)
>> > newvec
>> [1] "2016-07-01,02:50:35,<john>,hey"
>> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh"
>> [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
>> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, not really"
>> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, didn't sleep"
>> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or where I am
>> really"
>> [7] "2016-07-01,02:54:17,<john>,just know it's london"
>> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
>> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good eay"
>> [10] "2016-07-01 02:58:56 <jone>"
>> [11] "2016-07-01 02:59:34 <jane>"
>> [12] "2016-07-01,03:02:48,<john>,British security is a little more
>> rigorous..."
>>
>>
>> You should probably remove the "empty comment" lines.
>>
>>
>> --
>>
>> David.
>>
>>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a starbucks2016-07-01
>>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 <Jane
>>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe> There was
>>> lots of Starbucks in my day2016-07-01 15:35:47
>>>
>>> It was interesting, too, when I pasted the text into the email, it
>>> self-formatted into the way I wanted it to look. I had to manually
>>> make it look like it does above, since that's the way that it looks in
>>> the txt file. I wonder if it's being organized by XML or something.
>>>
>>> Anyways, There's always a space between the two sideways carrots, just
>>> like there is right now: <John Doe> See. Space. And there's always a
>>> space between the data and time. Like this. 2016-07-01 15:34:30 See.
>>> Space. But there's never a space between the end of the comment and
>>> the next date. Like this: We were in a starbucks2016-07-01 15:35:02
>>> See. starbucks and 2016 are smooshed together.
>>>
>>> This code is also on the table right now too.
>>>
>>> a <- read.table("E:/working
>>> directory/-189/hangouts-conversation2.txt", quote="\"",
>>> comment.char="", fill=TRUE)
>>>
>>> h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])
>>>
>>> aa<-gsub("[^[:digit:]]","",h)
>>> my.data.num <- as.numeric(str_extract(h, "[0-9]+"))
>>>
>>> Those last lines are a work in progress. I wish I could import a
>>> picture of what it looks like when it's translated into a data frame.
>>> The fill=TRUE helped to get the data in table that kind of sort of
>>> works, but the comments keep bleeding into the data and time column.
>>> It's like
>>>
>>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been
>>> over there
>>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :(
>>>
>>> And then, maybe, the "seriously" will be in a column all to itself, as
>>> will be the "I've'"and the "never" etc.
>>>
>>> I will use a regular expression if I have to, but it would be nice to
>>> keep the dates and times on there. Originally, I thought they were
>>> meaningless, but I've since changed my mind on that count. The time of
>>> day isn't so important. But, especially since, say, Gmail itself knows
>>> how to quickly recognize what it is, I know it can be done. I know
>>> this data has structure to it.
>>>
>>> Michael
>>>
>>>
>>>
>>> On Wed, May 15, 2019 at 8:47 PM David Winsemius <dwinsemius using comcast.net> wrote:
>>>> On 5/15/19 4:07 PM, Michael Boulineau wrote:
>>>>> I have a wild and crazy text file, the head of which looks like this:
>>>>>
>>>>> 2016-07-01 02:50:35 <john> hey
>>>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh
>>>>> 2016-07-01 02:51:45 <john> thinking about my boo
>>>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not really
>>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast, didn't sleep
>>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where I am really
>>>>> 2016-07-01 02:54:17 <john> just know it's london
>>>>> 2016-07-01 02:56:44 <jane> you are probably asleep
>>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good eay
>>>>> 2016-07-01 02:58:56 <jone>
>>>>> 2016-07-01 02:59:34 <jane>
>>>>> 2016-07-01 03:02:48 <john> British security is a little more rigorous...
>>>> Looks entirely not-"crazy". Typical log file format.
>>>>
>>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) Use regex
>>>> (i.e. the sub-function) to strip everything up to the "<". Read
>>>> `?regex`. Since that's not a metacharacters you could use a pattern
>>>> ".+<" and replace with "".
>>>>
>>>> And do read the Posting Guide. Cross-posting to StackOverflow and Rhelp,
>>>> at least within hours of each, is considered poor manners.
>>>>
>>>>
>>>> --
>>>>
>>>> David.
>>>>
>>>>> It goes on for a while. It's a big file. But I feel like it's going to
>>>>> be difficult to annotate with the coreNLP library or package. I'm
>>>>> doing natural language processing. In other words, I'm curious as to
>>>>> how I would shave off the dates, that is, to make it look like:
>>>>>
>>>>> <john> hey
>>>>> <jane> waiting for plane to Edinburgh
>>>>> <john> thinking about my boo
>>>>> <jane> nothing crappy has happened, not really
>>>>> <john> plane went by pretty fast, didn't sleep
>>>>> <jane> no idea what time it is or where I am really
>>>>> <john> just know it's london
>>>>> <jane> you are probably asleep
>>>>> <jane> I hope fish was fishy in a good eay
>>>>> <jone>
>>>>> <jane>
>>>>> <john> British security is a little more rigorous...
>>>>>
>>>>> To be clear, then, I'm trying to clean a large text file by writing a
>>>>> regular expression? such that I create a new object with no numbers or
>>>>> dates.
>>>>>
>>>>> Michael
>>>>>
>>>>> ______________________________________________
>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list