[BioC] Easy way to convert CharacterList to character, collapsing each element?

Hervé Pagès hpages at fhcrc.org
Thu Dec 19 21:03:57 CET 2013


Hi Ryan, Michael,

OK for unstrsplit. The generic is now in IRanges 1.21.18 (devel)
with methods for ordinary list and CharacterList.
There is also a method in Biostrings 2.31.6 (devel) for
XStringSetList objects.

See ?unstrsplit

Cheers,
H.


On 12/16/2013 06:51 PM, Michael Lawrence wrote:
> Btw, the name strunsplit is way better than my pasteCollapse. Maybe
> tweak it to unstrsplit? Feels more like a verb.
>
>
>
> On Mon, Dec 16, 2013 at 4:16 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Ryan,
>
>     Here is one way to do this using Biostrings:
>
>        library(Biostrings)
>
>        strunsplit <- function(x, sep=",")
>        {
>          if (!is(x, "XStringSetList"))
>              x <- Biostrings:::XStringSetList("__B", x)
>          if (!isSingleString(sep))
>              stop("'sep' must be a single character string")
>
>          ## unlist twice.
>          unlisted_x <- unlist(x, use.names=FALSE)
>          unlisted_ans0 <- unlist(unlisted_x, use.names=FALSE)
>
>          ## insert 'seq'.
>          unlisted_x_width <- width(unlisted_x)
>          x_partitioning <- PartitioningByEnd(x)
>          at <- cumsum(unlisted_x_width)[-end(__x_partitioning)] + 1L
>          unlisted_ans <- replaceAt(unlisted_ans0, at, value=sep)
>
>          ## relist.
>          ans_width <- sum(relist(unlisted_x_width, x_partitioning))
>          x_eltlens <- width(x_partitioning)
>          idx <- which(x_eltlens >= 2L)
>          ans_width[idx] <- ans_width[idx] + (x_eltlens[idx] - 1L) *
>     nchar(sep)
>          relist(unlisted_ans, PartitioningByWidth(ans_width)__)
>        }
>
>     Then:
>
>        > x <- CharacterList(A=c("id35", "id2", "id18"), B=NULL, C="id4",
>     D=c("id2", "id4"))
>        > strunsplit(x)
>          A BStringSet instance of length 4
>            width seq                                               names
>        [1]    13 id35,id2,id18                                     A
>        [2]     0                                                   B
>        [3]     3 id4                                               C
>        [4]     7 id2,id4                                           D
>
>     I'll add this to Biostrings.
>
>     Cheers,
>     H.
>
>
>
>     On 12/16/2013 03:04 PM, Ryan C. Thompson wrote:
>
>         Hi all,
>
>         I have some annotation data in a DataFrame, and of course since
>         annotations are not one-to-one, some of the columns are
>         CharacterList or
>         similar classes. I would like to know if there is an efficient
>         way to
>         collapse a CharacterList to a character vector of the same
>         length, such
>         that for elements of length > 1, those elements are collapsed with a
>         given separator. The following is what I came up with, but it is
>         very
>         slow for large CharacterLists:
>
>         library(stringr)
>         library(plyr)
>         flatten.CharacterList <- function(x, sep=",") {
>             if (is.list(x)) {
>               x[!is.na <http://is.na>(x)] <- laply(x[!is.na
>         <http://is.na>(x)], str_c, collapse=sep,
>         .parallel=TRUE)
>               x <- as(x, "character")
>             }
>             x
>         }
>
>         -Ryan
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>     _________________________________________________
>     Bioconductor mailing list
>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     https://stat.ethz.ch/mailman/__listinfo/bioconductor
>     <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>     Search the archives:
>     http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list