[BioC] extracting character string
Hervé Pagès
hpages at fhcrc.org
Wed Jun 17 01:57:40 CEST 2009
Hi Hari, Mark,
Mark Robinson wrote:
> Hi Hari.
>
> strsplit() will work, its just sensitive. For starters, you might try:
>
> > x <- c("ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158",
> + "ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239","mgc|BC034752:79")
> >
> > strsplit(x,"\\|")
> [[1]]
> [1] "ref" "NM_004564" "ref" "PET112L:2131"
> [5] "mgc" "BC130348:2158"
>
> [[2]]
> [1] "ref" "NM_007266" "ref" "XAB1:2255"
> [5] "mgc" "BC007451:2239"
>
> [[3]]
> [1] "mgc" "BC034752:79"
Note that it's better here to use strsplit() with fixed=TRUE. Then no
need to escape the | and in addition strsplit() will be much faster...
Cheers,
H.
>
>
> And, for extracting the first 2 columns, maybe you'll want to migrate
> towards something like:
>
> > t(sapply(x, FUN=function(u) strsplit(u, "\\|")[[1]][1:2],
> USE.NAMES=FALSE))
> [,1] [,2]
> [1,] "ref" "NM_004564"
> [2,] "ref" "NM_007266"
> [3,] "mgc" "BC034752:79"
>
> Hope that gets you started.
>
> Cheers,
> Mark
>
>
> On 17/06/2009, at 7:54 AM, Hari Easwaran wrote:
>
>> Hi all,
>> I am working with Agilent microarray data and trying to extract only the
>> accession numbers from the output probe annotation. Basically I have a
>> column detailing the probe as follows:
>>
>> ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158
>> ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239
>> mgc|BC034752:79
>> ref|NM_057094|ref|CRYBA2:-2513|ref|NM_005209:-2519|ref|NM_194302:45605|mirna|hsa-mir-375:5790
>>
>> ...
>>
>> I am trying to extract only the Refseq IDs (in this case NM_004564,
>> NM_007266, NM_057094, NM_005209, NM_194302.....) and create a new column
>> with the IDs. I am not able to figure out how to do this. I tried
>> using the
>> function 'strsplit', but it doesn't work.
>> I am a newbie to R/Bioconductor and would appreciate if someone can help.
>>
>> Thanks.
>> Hari
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> ------------------------------
> Mark Robinson, PhD (Melb)
> Epigenetics Laboratory, Garvan
> Bioinformatics Division, WEHI
> e: m.robinson at garvan.org.au
> e: mrobinson at wehi.edu.au
> p: +61 (0)3 9345 2628
> f: +61 (0)3 9347 0852
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list