[R] Cleaning database: grep()? apply()?
jim holtman
jholtman at gmail.com
Tue Nov 13 23:25:14 CET 2007
Here is how to wittle it down for the first two parts of your
question. I am not exactly what you are after in the third part. Is
it that you want specific DATEs or do you want the ratio of the
DATE[max]/DATE[min]?
> x <- read.table(textConnection("CODE NAME DATE DATA1
+ 4813 'ADVANCED TELECOM' 1987 0.013
+ 3845 'ADVANCED THERAPEUTIC SYS LTD' 1987 10.1
+ 3845 'ADVANCED THERAPEUTIC SYS LTD' 1989 2.463
+ 3845 'ADVANCED THERAPEUTIC SYS LTD' 1988 1.563
+ 2836 'ADVANCED TISSUE SCI -CL A' 1987 0.847
+ 2836 'ADVANCED TISSUE SCI -CL A' 1989 0.872
+ 2836 'ADVANCED TISSUE SCI -CL A' 1988
0.529"), header=TRUE)
> # matches on things to delete
> delete_indx <- grep("-CL A$|-OLD$|-ADS$", x$NAME)
> # delete them
> x <- x[-delete_indx,]
> x
CODE NAME DATE DATA1
1 4813 ADVANCED TELECOM 1987 0.013
2 3845 ADVANCED THERAPEUTIC SYS LTD 1987 10.100
3 3845 ADVANCED THERAPEUTIC SYS LTD 1989 2.463
4 3845 ADVANCED THERAPEUTIC SYS LTD 1988 1.563
> # I assume you want to use NAME to check for ranges of data
> date_range <- tapply(x$DATE, x$NAME, function(dates) diff(range(dates)))
> date_range
ADVANCED TELECOM ADVANCED THERAPEUTIC SYS LTD
0 2
ADVANCED TISSUE SCI -CL A
NA
> # delete ones with less than 3 years
> names_to_delete <- names(date_range[date_range < 2])
> # delete those entries
> x <- x[!(x$NAME %in% names_to_delete),]
> x
CODE NAME DATE DATA1
2 3845 ADVANCED THERAPEUTIC SYS LTD 1987 10.100
3 3845 ADVANCED THERAPEUTIC SYS LTD 1989 2.463
4 3845 ADVANCED THERAPEUTIC SYS LTD 1988 1.563
>
>
On Nov 13, 2007 2:34 PM, Jonas Malmros <jonas.malmros at gmail.com> wrote:
> Dear R users,
>
> I have a huge database and I need to adjust it somewhat.
>
> Here is a very little cut out from database:
>
> CODE NAME DATE DATA1
> 4813 ADVANCED TELECOM 1987 0.013
> 3845 ADVANCED THERAPEUTIC SYS LTD 1987 10.1
> 3845 ADVANCED THERAPEUTIC SYS LTD 1989 2.463
> 3845 ADVANCED THERAPEUTIC SYS LTD 1988 1.563
> 2836 ADVANCED TISSUE SCI -CL A 1987 0.847
> 2836 ADVANCED TISSUE SCI -CL A 1989 0.872
> 2836 ADVANCED TISSUE SCI -CL A 1988 0.529
>
> What I need is:
> 1) Delete all cases containing -CL A (and also -OLD, -ADS, etc) at the end
> 2) Delete all cases that have less than 3 years of data
> 3) For each remaining case compute ratio DATA1(1989) / DATA1(1987)
> [and then ratios involving other data variables] and output this into
> new database consisting of CODE, NAME, RATIOs.
>
> Maybe someone can suggest an effective way to do these things? I
> imagine the first one would involve grep(), and 2 and 3 would involve
> apply family of functions, but I cannot get my mind around the actual
> code to perform this adjustments. I am new to R, I do write code but
> usually it consists of for-functions and plotting. I would much
> appreciate your help.
> Thank you in advance!
> --
> Jonas Malmros
> Stockholm University
> Stockholm, Sweden
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem you are trying to solve?
More information about the R-help
mailing list