[R] splitting a dataframe in R based on multiple gene names in a specific column

Sat Aug 26 11:53:58 CEST 2017

Very tidy. Amazing what is hidden away in R packages.

Jim

On Sat, Aug 26, 2017 at 5:26 AM, Jeff Newmiller
<jdnewmil at dcn.davis.ca.us> wrote:
> If row numbers can be dispensed with, then tidyr makes this easy with the
> unnest function:
>
> #####
> library(dplyr)
> #>
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #>
> #>     filter, lag
> #> The following objects are masked from 'package:base':
> #>
> #>     intersect, setdiff, setequal, union
> library(purrr)
> library(tidyr)
>
> df.sample.gene<-read.table(
>  text="Chr     Start       End Ref Alt Func.refGene  Gene.refGene
>  284 chr2  16080996  16080996   C   T ncRNA_exonic  GACAT3
>  448 chr2 113979920 113979920   C   T ncRNA_exonic  LINC01191,LOC100499194
>  465 chr2 131279347 131279347   C   G ncRNA_exonic  LOC440910
>  525 chr2 223777758 223777758   T   A       exonic  AP1S3
>  626 chr3  99794575  99794575   G   A       exonic  COL8A1
>  643 chr3 132601066 132601066   A   G       exonic  ACKR4
>  655 chr3 132601999 132601999   A   G       exonic  BCDF5,CDFG6",
>  header=TRUE,stringsAsFactors=FALSE)
>
> df.sample.out <- (   df.sample.gene
>                  %>% mutate( Gene.refGene = strsplit( Gene.refGene
>                                                     , ","
>                                                     )
>                            )
>                  %>% unnest( Gene.refGene )
>                  )
> df.sample.out
> #>    Chr     Start       End Ref Alt Func.refGene Gene.refGene
> #> 1 chr2  16080996  16080996   C   T ncRNA_exonic       GACAT3
> #> 2 chr2 113979920 113979920   C   T ncRNA_exonic    LINC01191
> #> 3 chr2 113979920 113979920   C   T ncRNA_exonic LOC100499194
> #> 4 chr2 131279347 131279347   C   G ncRNA_exonic    LOC440910
> #> 5 chr2 223777758 223777758   T   A       exonic        AP1S3
> #> 6 chr3  99794575  99794575   G   A       exonic       COL8A1
> #> 7 chr3 132601066 132601066   A   G       exonic        ACKR4
> #> 8 chr3 132601999 132601999   A   G       exonic        BCDF5
> #> 9 chr3 132601999 132601999   A   G       exonic        CDFG6
> #####
>
>
> On Wed, 23 Aug 2017, Jim Lemon wrote:
>
>> Hi Bogdan,
>> Messy, and very specific to your problem:
>>
>> df.sample.gene<-read.table(
>> text="Chr     Start       End Ref Alt Func.refGene  Gene.refGene
>> 284 chr2  16080996  16080996   C   T ncRNA_exonic  GACAT3
>> 448 chr2 113979920 113979920   C   T ncRNA_exonic  LINC01191,LOC100499194
>> 465 chr2 131279347 131279347   C   G ncRNA_exonic  LOC440910
>> 525 chr2 223777758 223777758   T   A       exonic  AP1S3
>> 626 chr3  99794575  99794575   G   A       exonic  COL8A1
>> 643 chr3 132601066 132601066   A   G       exonic  ACKR4
>> 655 chr3 132601999 132601999   A   G       exonic  BCDF5,CDFG6",
>> header=TRUE,stringsAsFactors=FALSE)
>>
>> multgenes<-grep(",",df.sample.gene$Gene.refGene)
>> rep_genes<-strsplit(df.sample.gene$Gene.refGene[multgenes],",")
>> ngenes<-unlist(lapply(rep_genes,length))
>> dup_row<-function(x) {
>> newrows<-x
>> lastcol<-dim(x)[2]
>> rep_genes<-unlist(strsplit(x[,lastcol],","))
>> for(i in 2:length(rep_genes)) newrows<-rbind(newrows,x)
>> newrows$Gene.refGene<-rep_genes
>> return(newrows)
>> }
>> for(multgene in multgenes)
>> df.sample.gene<-rbind(df.sample.gene,dup_row(df.sample.gene[multgene,]))
>> df.sample.gene<-df.sample.gene[-multgenes,]
>> df.sample.gene
>>
>> I added a second line with multiple genes to make sure that it would
>> work with more than one line.
>>
>> Jim
>>
>>
>> On Wed, Aug 23, 2017 at 9:57 AM, Bogdan Tanasa <tanasa at gmail.com> wrote:
>>>
>>> I would appreciate please a suggestion on how to do the following :
>>>
>>> i'm working with a dataframe in R that contains in a specific column
>>> multiple gene names, eg :
>>>
>>>> df.sample.gene[15:20,2:8]
>>>
>>>      Chr     Start       End Ref Alt Func.refGene
>>> Gene.refGene284 chr2  16080996  16080996   C   T ncRNA_exonic
>>>        GACAT3448 chr2 113979920 113979920   C   T ncRNA_exonic
>>> LINC01191,LOC100499194465 chr2 131279347 131279347   C   G
>>> ncRNA_exonic              LOC440910525 chr2 223777758 223777758   T
>>> A       exonic                  AP1S3626 chr3  99794575  99794575   G
>>>  A       exonic                 COL8A1643 chr3 132601066 132601066   A
>>>   G       exonic                  ACKR4
>>>
>>> How could I obtain a dataframe where each line that has multiple gene
>>> names
>>> (in the field Gene.refGene) is replicated with only one gene name ? i.e.
>>>
>>> for the second row :
>>>
>>>   448 chr2 113979920 113979920   C   T ncRNA_exonic
>>> LINC01191,LOC100499194
>>>
>>> we shall get in the final output (that contains all the rows) :
>>>
>>>   448 chr2 113979920 113979920   C   T ncRNA_exonic LINC01191
>>>   448 chr2 113979920 113979920   C   T ncRNA_exonic LOC100499194
>>>
>>> thanks a lot !
>>>
>>> -- bogdan
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------