[BioC] Gene names

Christopher Wilkinson christopher.wilkinson at adelaide.edu.au
Mon Nov 7 01:13:35 CET 2005


If you want to do this in R, the function you want is strsplit, telling 
it to split on the "|" character. However "|" is special in character 
splitting (regular expressions) so we have to protect it with 
backslashes. As a word of advice look up regular expressions - they are 
extremely powerful for manipulating strings (?regexp)

 > geneName <- "SFTPB|NM_000542.1|4506904|surfactant, 
pulmonary-associated protein B"
 > strsplit(geneName,"\\|")
[[1]]
[1] "SFTPB"                                      
"NM_000542.1"                              
[3] "4506904"                                    "surfactant, 
pulmonary-associated protein B"
note it returns a list, where you probably want a vector or array, so 
something like
t(as.matrix(strsplit(geneName,"\\|")[[1]])) or 
unlist(strsplit(geneName,"\\|") will give
"SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated protein B"

Now lets assume you have a vector of genenames to be split, you can use 
the sapply function.
geneNames <- rep(geneName,3)
geneNamesAsMatrix <- 
t(sapply(geneNames,function(x){unlist(strsplit(x,"\\|"))}))
 > rownames(geneNamesAsMatrix) <- NULL ## otherwise whole str is the row 
name
 > geneNamesAsMatrix
     [,1]    [,2]          [,3]      
[,4]                                       
[1,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated 
protein B"
[2,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated 
protein B"
[3,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated 
protein B"

Of course you could do this on the command line with perl using 
something like
perl -ne 'my @F=split /\|/,$_;print join("\t", at F)' infile > outfile

Cheers
Chris

>Date: Sun, 06 Nov 2005 02:13:39 +0000
>From: J.delasHeras at ed.ac.uk
>Subject: Re: [BioC] Gene names
>To: bioconductor at stat.math.ethz.ch
>Message-ID: <20051106021339.3x6viekhogs0w8w0 at www.staffmail.ed.ac.uk>
>Content-Type: text/plain;	charset=ISO-8859-1;	format="flowed"
>
>Quoting Narendra Kaushik <kaushiknk at Cardiff.ac.uk>:
>
>  
>
>>I have gene file in this format, everything in one column (no spaces at all):
>>SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B
>>Is there any way to convert it in this format (into four columns) except
>>manually?
>>
>>SFTPB                        NM_000542.1               4506904
>>surfactant, pulmonary-associated protein B
>>
>>Any suggestions?
>>
>>Narendra
>>    
>>
>
>Maybe too obvious, but Excel is very good for this sort of thing. 
>Functions like
>Search allow you to obtain the position of a particulat character (like 
>"|") and
>knowing that you can select the text to the left or right to it... if you do
>that consecutively you can sort it like that. It'll take a minute.
>
>Jose
>
>  
>


-- 

Dr Chris Wilkinson

Senior Research Officer               | ARC Research Associate
Child Health Research Institute (CHRI)| Microarray Analysis Group
7th floor, Clarence Rieger Building   | Room 121
Women's and Children's Hospital       | School of Mathematical Sciences
72 King William Rd,                   | The University of Adelaide, 5005
North Adelaide, 5006                  | CRICOS Provider Number 00123M

Math's Office (Room 121)        Ph: 8303 3714
CHRI   Office (CR2 52A)         Ph: 8161 6363

Christopher.Wilkinson at adelaide.edu.au

http://mag.maths.adelaide.edu.au/crwilkinson.html

Organising Committee Member, 5th Australian Microarray Conference
29th Sept to 1st Oct 2005, Novatel Barossa Valley Resort
http://www.sapmea.asn.au/conventions/microarray/index.html



More information about the Bioconductor mailing list