[R] text analysis errors

Gordon Ballingrud gob@@|||ngrud @end|ng |rom gm@||@com
Thu Jan 7 00:39:10 CET 2021


Hello all,

I have asked this question on many forums without response. And although
I've made progress myself, I am stuck as to how to respond to a particular
error message.

I have a question about text-analysis packages and code. The general idea
is that I am trying to perform readability analyses on a collection of
about 4,000 Word files. I would like to do any of a number of such
analyses, but the problem now is getting R to recognize the uploaded files
as data ready for analysis. But I have been getting error messages. Let me
show what I have done so far. I have three separate commands because I
broke the file of 4,000 files up into three separate ones because,
evidently, the file was too voluminous to be read alone in its entirety.
So, I divided the files up into three roughly similar folders. They are
called ‘WPSCASES’ one through three. Here is my code, with the error
messages for each command recorded below:

token <-
tokenize("/Users/Gordon/Desktop/WPSCASESONE/",lang="en",doc_id="sample")

The code is the same for the other folders; the name of the folder is
different, but otherwise identical.

The error message reads:

*Error in nchar(tagged.text[, "token"], type = "width") : invalid multibyte
string, element 348*

The error messages are the same for the other two commands. But the
'element' number is different. It's 925 for the second folder, and 4302 for
the third.

token2 <-
tokenize("/Users/Gordon/Desktop/WPSCASES2/",lang="en",doc_id="sample")

token3 <-
tokenize("/Users/Gordon/Desktop/WPSCASES3/",lang="en",doc_id="sample")

These are the other commands if that's helpful.

I’ve tried to discover whether the ‘element’ that the error message
mentions corresponds to the file of that number in the file’s order. But
since folder 3 does not have 4,300 files in it, I think that that was
unlikely. Please let me know if you can figure out how to fix this stuff so
that I can start to use ‘koRpus’ commands, like ‘readability’ and its
progeny.

Thank you,
Gordon

	[[alternative HTML version deleted]]



More information about the R-help mailing list