[R] Huge data sets and RAM problems
Stella Pachidi
stella.pachidi at gmail.com
Thu Apr 22 10:41:55 CEST 2010
Dear all,
Thank you very much for your replies and help. I will try to work with
your suggestions and come back to you if I need something more.
Kind regards,
Stella Pachidi
On Thu, Apr 22, 2010 at 5:30 AM, kMan <kchamberln at gmail.com> wrote:
> You set records to NULL perhaps (delete, shift up). Perhaps your system is
> susceptible to butterflies on the other side of the world.
>
> Your code may have 'worked' on a small section of data, but the data used
> did not include all of the cases needed to fully test your code. So... test
> your code!
>
> scan(), used with 'nlines', 'skip', 'sep', and 'what' will cut your read
> time by at least half while taking less RAM memory to do it, do most of your
> post processing, and give you something to better test your code. Or, don't
> use 'nlines' and lose your time/memory benefits over read.table(). 'skip'
> will get you "right to the point" before where things failed. That would be
> an interesting small segment of data to test with.
>
> wordpad can read your file (and then some). Eventually.
>
> Sincerely,
> KeithC.
>
> -----Original Message-----
> From: Stella Pachidi [mailto:stella.pachidi at gmail.com]
> Sent: Monday, April 19, 2010 2:07 PM
> To: r-help at stat.math.ethz.ch
> Subject: [R] Huge data sets and RAM problems
>
> Dear all,
>
> This is the first time I am sending mail to the mailing list, so I hope I do
> not make a mistake...
>
> The last months I have been working on my MSc thesis project on performing
> data mining techniques on user logs of a software-as-a-service application.
> The main problem I am experiencing is how to process the huge amount of
> data. More specifically:
>
> I am using R 2.10.1 in a laptop with Windows 7 - 32bit system, 2GB RAM and
> CPU Intel Core Duo 2GHz.
>
> The user logs data come from a query Crystal report (.rpt file) which I
> transform with some Java code into a tab separated file.
>
> Although with a small subset of my data everything manages to run, when I
> increase the data set I get several problems:
>
> The first problem is with the use of read.delim(). When I try to read a big
> amount of data (over 2.400.000 rows and 18 attributes at each
> row) it doesn't seem to transform all table into a data frame. In
> particular, the data frame returned has 1.220.987 rows.
>
> Furthermore, as one of the data attributes is DataTime, when I try to split
> this column into two columns (one with Data and one with the Time), the
> returned result is quite strange, as the two new columns appear to have more
> rows than the data frame:
>
> applicLog.dat <- read.delim("file.txt")
> #Process the syscreated column (Date time --> Date + time) copyDate <-
> applicLog.dat[["ï..syscreated"]] copyDate <- as.character(copyDate)
> splitDate <- strsplit(copyDate, " ") splitDate <- unlist(splitDate)
> splitDateIndex <- c(1:length(splitDate)) sysCreatedDate <-
> splitDate[splitDateIndex %% 2 == 1] sysCreatedTime <-
> splitDate[splitDateIndex %% 2 == 0] sysCreatedDate <-
> strptime(sysCreatedDate, format="%Y-%m-%d") op <- options(digits.secs = 3)
> sysCreatedTime <- strptime(sysCreatedTime, format ="%H:%M:%OS")
> applicLog.dat[["ï..syscreated"]] <- NULL applicLog.dat <- cbind
> (sysCreatedDate,sysCreatedTime,applicLog.dat)
>
> Then I get the error: Error in data.frame(..., check.names = FALSE) :
> arguments imply differing number of rows: 1221063, 1221062, 1220987
>
>
> Finally, another problem I have is when I perform association mining on the
> data set using the package arules: I turn the data frame into transactions
> table and then run the apriori algorithm. When I put too low support in
> order to manage to find the rules I need, the vector of rules becomes too
> big and I get problems with the memory such as:
> Error: cannot allocate vector of size 923.1 Mb In addition: Warning
> messages:
> 1: In items(x) : Reached total allocation of 153Mb: see help(memory.size)
>
> Could you please help me with how I could allocate more RAM? Or, do you
> think there is a way to process the data by loading them into a document
> instead of loading all into RAM? Do you know how I could manage to read all
> my data set?
>
> I would really appreciate your help.
>
> Kind regards,
> Stella Pachidi
>
> PS: Do you know any text editor that can read huge .txt files?
>
>
>
>
>
> --
> Stella Pachidi
> Master in Business Informatics student
> Utrecht University
>
>
>
>
--
Stella Pachidi
Master in Business Informatics student
Utrecht University
email: S.Pachidi at students.uu.nl
tel: +31644478898
More information about the R-help
mailing list