[BioC] extracting character string

Mark Robinson mrobinson at wehi.EDU.AU
Wed Jun 17 01:19:12 CEST 2009


Hi Hari.

strsplit() will work, its just sensitive.  For starters, you might try:

 > x <- c("ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158",
+ "ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239","mgc|BC034752:79")
 >
 > strsplit(x,"\\|")
[[1]]
[1] "ref"           "NM_004564"     "ref"           "PET112L:2131"
[5] "mgc"           "BC130348:2158"

[[2]]
[1] "ref"           "NM_007266"     "ref"           "XAB1:2255"
[5] "mgc"           "BC007451:2239"

[[3]]
[1] "mgc"         "BC034752:79"


And, for extracting the first 2 columns, maybe you'll want to migrate  
towards something like:

 > t(sapply(x, FUN=function(u) strsplit(u, "\\|")[[1]][1:2],  
USE.NAMES=FALSE))
      [,1]  [,2]
[1,] "ref" "NM_004564"
[2,] "ref" "NM_007266"
[3,] "mgc" "BC034752:79"

Hope that gets you started.

Cheers,
Mark


On 17/06/2009, at 7:54 AM, Hari Easwaran wrote:

> Hi all,
> I am working with Agilent microarray data and trying to extract only  
> the
> accession numbers from the output probe annotation. Basically I have a
> column detailing the probe as follows:
>
> ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158
> ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239
> mgc|BC034752:79
> ref|NM_057094|ref|CRYBA2:-2513|ref|NM_005209:-2519|ref| 
> NM_194302:45605|mirna|hsa-mir-375:5790
> ...
>
> I am trying to extract only the Refseq IDs (in this case NM_004564,
> NM_007266, NM_057094, NM_005209, NM_194302.....) and create a new  
> column
> with the IDs. I am not able to figure out how to do this. I tried  
> using  the
> function 'strsplit',  but it doesn't work.
> I am a newbie to R/Bioconductor and would appreciate if someone can  
> help.
>
> Thanks.
> Hari
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

------------------------------
Mark Robinson, PhD (Melb)
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: m.robinson at garvan.org.au
e: mrobinson at wehi.edu.au
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852



More information about the Bioconductor mailing list