From kripa777 at hotmail.com Fri Jun 1 00:03:24 2012 From: kripa777 at hotmail.com (Kripa R) Date: Thu, 31 May 2012 22:03:24 +0000 Subject: [BioC] LIMMA: plotMDS In-Reply-To: References: , , , , , Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From nuin at genedrift.org Fri Jun 1 00:53:44 2012 From: nuin at genedrift.org (Paulo Nuin) Date: Thu, 31 May 2012 18:53:44 -0400 Subject: [BioC] MacOS Package installation problems In-Reply-To: References: <089B8CC2D95DD5498EB7CD66289A668F1C5167@pbrcas30.pbrc.edu> <11308_1338395632_4FC64BF0_11308_13916_1_CANeAVBngMQ-qo3XA670uLeMs2UPQmXwiLEv1_oOdomJkfppcSw@mail.gmail.com> Message-ID: <0EA23C73-2048-4516-83A9-54DD2573A20B@genedrift.org> It is some type of Mac-only problem that happened in Seattle at the source. Just google for the term and you should get in the first two hits a good answer about it. Cheers Paulo On 2012-05-31, at 10:48 AM, Chinedu Orekie wrote: > I am having the same problem installing packages (Line starting ' PUBLI ...' is malformed). It has not mattered which repoistory I cite, CRAN or > BioC. I first observed this yesterday, 30 Jun 2012. This all seems quite recent > since I was able to access packages only two weeks ago. My computer is Windows > based by the way. > > Could this be some generic systems error? > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From dtenenba at fhcrc.org Fri Jun 1 00:57:07 2012 From: dtenenba at fhcrc.org (Dan Tenenbaum) Date: Thu, 31 May 2012 15:57:07 -0700 Subject: [BioC] MacOS Package installation problems In-Reply-To: References: <089B8CC2D95DD5498EB7CD66289A668F1C5167@pbrcas30.pbrc.edu> <11308_1338395632_4FC64BF0_11308_13916_1_CANeAVBngMQ-qo3XA670uLeMs2UPQmXwiLEv1_oOdomJkfppcSw@mail.gmail.com> Message-ID: Hi Chinedu, On Thu, May 31, 2012 at 7:48 AM, Chinedu Orekie wrote: > I am having the same problem installing packages (Line starting ' PUBLI ...' is malformed). It has not mattered which repoistory I cite, CRAN or > BioC. I first observed this yesterday, 30 Jun 2012. This all seems quite recent > since I was able to access packages only two weeks ago. My computer is Windows > based by the way. > > Could this be some generic systems error? > This is the first I have heard of this happening on windows systems. Can you send the command that caused the error, as well as the full error code and the output of sessionInfo()? Thanks, Dan > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From dtenenba at fhcrc.org Fri Jun 1 01:06:29 2012 From: dtenenba at fhcrc.org (Dan Tenenbaum) Date: Thu, 31 May 2012 16:06:29 -0700 Subject: [BioC] Errors Installing Package qrqc under MacOsX In-Reply-To: References: Message-ID: Hi Sue, On Thu, May 31, 2012 at 8:01 AM, Sue Jones wrote: > I am using R version 2.15 under Mac OS X 10.6.8 > When i try to install the qrqc package using > > source("http://bioconductor.org/biocLite.R") > biocLite("qrqc") > library(qrqc) > > I get the following error message > -------------------------------- > BioC_mirror: http://www.bioconductor.org > Using R version 2.15, BiocInstaller version 1.4.4. The latest version of BiocInstaller is 1.4.6 and your email was sent 8 hours ago. Are you still having this problem? Dan > Warning: unable to access index for repository > http://brainarray.mbni.med.umich.edu/bioc/bin/macosx/leopard/contrib/2.15 > Installing package(s) 'qrqc' > Error: Line starting ' library(qrqc) > Error in library(qrqc) : there is no package called ?qrqc? > ------------------------------------ > > I think the first part of the error is perhaps related to the CRAN mirror - > but even when I use this > alternative suggested elsewhere on gmane > > biocLite("qrqc", type="source") > > I get a series of error message - the end of which is > > -------------------- > ERROR: dependency ?biovizBase? is not available for package ?qrqc? > * removing ?/Library/Frameworks/R.framework/Versions/2.15/Resources/library/qrqc? > > The downloaded source packages are in > ? ? ? ??/private/var/folders/aS/aSrDK+ZQF+0QTLwvM0ZOXk+++TI/-Tmp- > /RtmpVVshp3/downloaded_packages? > Error: Line starting ' ----------------------- > > Any help with what I am doing wrong would be appreciated. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From dtenenba at fhcrc.org Fri Jun 1 01:14:07 2012 From: dtenenba at fhcrc.org (Dan Tenenbaum) Date: Thu, 31 May 2012 16:14:07 -0700 Subject: [BioC] MacOS Package installation problems In-Reply-To: References: <089B8CC2D95DD5498EB7CD66289A668F1C5167@pbrcas30.pbrc.edu> <11308_1338395632_4FC64BF0_11308_13916_1_CANeAVBngMQ-qo3XA670uLeMs2UPQmXwiLEv1_oOdomJkfppcSw@mail.gmail.com> Message-ID: On Thu, May 31, 2012 at 7:48 AM, Chinedu Orekie wrote: > I am having the same problem installing packages (Line starting ' PUBLI ...' is malformed). It has not mattered which repoistory I cite, CRAN or > BioC. I first observed this yesterday, 30 Jun 2012. This all seems quite recent > since I was able to access packages only two weeks ago. My computer is Windows > based by the way. > > Could this be some generic systems error? > I wonder if this is unrelated to the Mac OS problem and is instead a problem with a firewall or proxy at your location. What if you try this command: download.file("http://google.com", tempfile()) Does it work, or produce an error? If it produces an error, try starting R as follows: R --internet2 And then try biocLite() again. Let us know what happens. Thanks, Dan > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From mdavy86 at gmail.com Fri Jun 1 01:16:03 2012 From: mdavy86 at gmail.com (Marcus Davy) Date: Fri, 1 Jun 2012 11:16:03 +1200 Subject: [BioC] samtools error In-Reply-To: <4FC6F036.8000000@fhcrc.org> References: <4FC6F036.8000000@fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From gorillayue at gmail.com Fri Jun 1 01:23:54 2012 From: gorillayue at gmail.com (Yue Li) Date: Thu, 31 May 2012 19:23:54 -0400 Subject: [BioC] how to load RangedData for all chromosomes in one tab in UCSC browser using browserView from rtracklayer In-Reply-To: References: <0EB0F5EF-5DDD-4170-96D2-227E5A1137AA@gmail.com> Message-ID: <30255D77-D99F-433B-9564-4A3E0D0CC736@gmail.com> Not sure if you were asking a question or implying that it's not feasible with UCSC browser. Anyhow, I could just upload to UCSC the bed/bam/bedGraph/gff file containing ranges for multiple chromosomes. Then go to the browser-view to visualize any chromosome in a single window with my browser? On 2012-05-31, at 2:12 PM, Steve Lianoglou wrote: > Hi, > > On Thu, May 31, 2012 at 1:00 PM, Yue Li wrote: >> Dear List, >> >> I wonder how to load RangedData for all chromosomes in a SINGLE tab in UCSC browser using browserView from rtracklayer. >> >> Currently, if I have a GRanges object for "mm9" build: >> >>> alignGR >> GRanges with 238161 ranges and 0 elementMetadata cols: >> seqnames ranges strand >> >> SRR039212.1000031 chr19 [ 8790316, 8790351] - >> SRR039212.1000085 chr5 [106579844, 106579879] + >> SRR039212.1000087 chr8 [109778747, 109778782] + >> SRR039212.1000088 chr8 [ 93777537, 93777572] + >> SRR039212.1000132 chr3 [128910749, 128910784] + >> SRR039212.1000149 chr8 [127433402, 127433437] + >> SRR039212.1000170 chr15 [ 93546853, 93546888] + >> SRR039212.1000174 chr18 [ 32056273, 32056308] - >> SRR039212.1000177 chr7 [ 90292453, 90292474] - >> ... ... ... ... >> SRR039212.999792 chr2 [162907151, 162907186] - >> SRR039212.999805 chr6 [ 44021338, 44021373] + >> SRR039212.999810 chr4 [121106682, 121106717] - >> SRR039212.999844 chr19 [ 60848841, 60848876] + >> SRR039212.999848 chr5 [117644397, 117644432] - >> SRR039212.999854 chr6 [132445007, 132445042] - >> SRR039212.999855 chr7 [108392362, 108392397] - >> SRR039212.999892 chr9 [ 20946884, 20946919] + >> SRR039212.999901 chr2 [168845152, 168845187] + >> --- >> seqlengths: >> chr1 chr10 chr11 chr12 chr13 chr14 ... chr8 chr9 chrM chrX chrY >> 197195432 129993255 121843856 121257530 120284312 125194864 ... 131738871 124076172 16299 166650296 15902555 >> >> >> and I do this: >> session <- browserSession() >> >> >> track(session, "read alignments") <- RangedData(alignGR) >> >> >> # launch browser view >> browserView(session, alignGR) >> >> >> >> This will open 22 tabs in my browser corresponding to 22 chromosomes in the GRanges object. I wonder if it would be possible to have just one single tab open for all 22 chromosomes to view on UCSC browser. > > Is there anyway that you know of to have the genome browser do what > you are asking "directly"? > > I mean, if you were just navigating w/ the browser alone (not using > rtracklayer), how would you get it to do what you are asking? > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact From dtenenba at fhcrc.org Fri Jun 1 04:00:45 2012 From: dtenenba at fhcrc.org (Dan Tenenbaum) Date: Thu, 31 May 2012 19:00:45 -0700 Subject: [BioC] MacOS Package installation problems In-Reply-To: <7554A32B-75A3-4CE3-BB2F-36590D6775CF@tntcmedia.us> References: <089B8CC2D95DD5498EB7CD66289A668F1C5167@pbrcas30.pbrc.edu> <11308_1338395632_4FC64BF0_11308_13916_1_CANeAVBngMQ-qo3XA670uLeMs2UPQmXwiLEv1_oOdomJkfppcSw@mail.gmail.com> <7554A32B-75A3-4CE3-BB2F-36590D6775CF@tntcmedia.us> Message-ID: Hi, On Thu, May 31, 2012 at 6:36 PM, Chinedu Orekie wrote: > Dan: > > I get the same error at separate internet connections (at home and at work). Now these are locations that worked in the past. So I am doubtful that a change in net connection is explanation. > > The sequence to seeing that error was: utils:::menuInstallPkgs() at the console and then on selecting a mirror site. Which mirror site did you select? Did you try to install a package? Which one? Was there any output besides the error message you sent? > Your thoughts. I don't know what could cause this. The recommended way to install Bioconductor packages, and the only method we support and can troubleshoot, is: source("http://bioconductor.org/biocLite.R") biocLite("pkgName") Also, to rule out any problems with mirrors, you should run chooseCRANmirror() and chooseBioCmirror() and choose Switzerland and USA (WA 1) respectively. Dan > > Chinedu > "TNTC: The need to connect" | P: (908) 514-TNTC > > On May 31, 2012, at 6:57 PM, Dan Tenenbaum wrote: > >> Hi Chinedu, >> >> On Thu, May 31, 2012 at 7:48 AM, Chinedu Orekie wrote: >>> I am having the same problem installing packages (Line starting '>> PUBLI ...' is malformed). It has not mattered which repoistory I cite, CRAN or >>> BioC. I first observed this yesterday, 30 Jun 2012. This all seems quite recent >>> since I was able to access packages only two weeks ago. My computer is Windows >>> based by the way. >>> >>> Could this be some generic systems error? >>> >> >> This is the first I have heard of this happening on windows systems. >> Can you send the command that caused the error, as well as the full >> error code and the output of sessionInfo()? >> >> Thanks, >> Dan >> >> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> From kasperdanielhansen at gmail.com Fri Jun 1 05:24:49 2012 From: kasperdanielhansen at gmail.com (Kasper Daniel Hansen) Date: Thu, 31 May 2012 23:24:49 -0400 Subject: [BioC] voom() vs. RPKM/FPKM or otherwise normalized counts, and GC correction, when fitting models to a small number of responses (per-feature counts) In-Reply-To: References: Message-ID: On Wed, May 30, 2012 at 7:11 PM, Tim Triche, Jr. wrote: > Hi Dr. Smyth, > > ?Thank you for the helpful clarifications. ?It seems like RPM/CPM is > useful for tasks such as plotting expression on a reasonably similar scale; > taking logs and adjusting for mean-variance relationships can better > satisfy expected mean-variance relationships for linear modeling and thus > should dovetail better with the toolset in limma, offering a less > computationally demanding alternative for exploratory analysis. ?On the > other hand, if the primary goal is to ?detect differences, especially in > rare or highly variably expressed features, an edgeR GLM with empirical > Bayes estimates of the feature-wise dispersion is the most appropriate tool > to maximize statistical power. > > ?Is this understanding reasonable? ?It would seem that, whether I use > limma or rig up some sort of weighting for (e.g.) sparsenet, the output > from voom() is most likely to be useful for my particular (EDA) needs at > the moment. > > ?One last question (for anyone who wishes to answer, really) -- if > gene/transcript length is not associated with the mean/variance > relationship for read counts, why was it asserted in the original Mortazavi > paper that: > > The sensitivity of RNA-Seq will be a function of both molar concentration > and transcript length [nb: no citation given, presumably this is felt to be > self-evident?]. We therefore quantified transcript levels in reads per > kilobase of exon model per million mapped reads. > > It seems as if this is a red herring? ?GC% could clearly affect the degree > to which a transcript "absorbs" read depth, but I continue to have > difficulty understanding why the length of exon model is relevant in this > context. While the Mortazavi paper is a very good paper on RNA-seq, this section is not their best. Because RNA is fragmented, there will be a relationship between read counts (number of reads mapped to a gene model) and gene length. This is indisputable. The question is whether this is something we want to include in our model, beyond the fact that longer genes have more counts and therefore a bigger mean (and since higher mean leads to lower variance, this is probably what Mortazavi meant here), RPKM tries to make an expression measure that is comparable between genes inside a single sample. This is for example necessary for making the "titration" curve in the Mortazavi paper showing a nice relationship between actual concentration and RPKM, since each of the points on the curve is a different gene. Note that that plot has nothing to do with differential expression, but rather absolute quantification. In a typical gene expression analysis we are (1) not interested in comparing genes, only samples within a fixed gene. (2) interested in relative changes, not absolute meaurements This is really not something Mortazavi discusses. EdgeR and DEseq tries to get at differential expression. And they essentially use the fact that there is a mean-variance relationship to improve their modeling. Now, it is clear (I would argue) that mean does not in any way perfectly predict variability, so it entirely possible that a better method may come along and improve on what we have. But such a method would first have to prove itself. Now, as I said above, gene length affects read counts through fragmentation. In case fragmentation varies between samples, there may be a problem. Same with GC content. We recently showed [1] that GC content, and to a lesser extent gene length, can have a sample specific effect. If that is the case, you need to account for that. But that is because the effect is sample specific. 1. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics 13, 204?216 (2012). Kasper > > Thank you so much for your time and effort in explaining these rather > subtle issues. > > Tim Triche, Jr. > USC Biostatistics > > On Wed, May 30, 2012 at 1:23 AM, Gordon K Smyth wrote: > >> Hi Tim, >> >> On thinking about this a little more, voom() could easily output logRPKM >> rather than logCPM, and the same weights would apply. ?Indeed you could >> convert the voom() output to logRPKM yourself and, in principle, undertake >> analyses using the values if you make use of the corresponding voom weights. >> >> However voom() does need to get raw counts as input, just like edgeR, >> rather than RPKM. ?voom() can cope with a re-scaling of the counts, but not >> with a transformation that is non-monotonic in the counts. ?RPKM is an >> unhelpful measure from a statistical point of view, because it "forgets" >> how large the count was in the first plae. >> >> The aims of Yuval's package are complementary to edgeR or voom, certainly >> neither replaces the other. ?These results may inform how we do the >> normalization step, but we have not yet reached the stage of doing this >> routinely. >> >> Best wishes >> Gordon >> >> >> On Fri, 25 May 2012, Gordon K Smyth wrote: >> >> ?Dear Tim, >>> >>> I don't follow what you are trying to do scientifically, and this makes >>> all the difference when deciding what are the appropriate tools to use. >>> >>> If you are undertaking some sort of analysis that requires absolute gene >>> (or feature) expression levels as responses, then you should not be using >>> voom or limma or edgeR. ?limma and edgeR do not estimate absolute >>> expression. >>> >>> If on the other hand, you want to detect differentially expressed genes >>> (or features), which is what voom does, then there is no need to correct >>> for gene length. ?The comments of Section 2.3 of the edgeR User's Guide and >>> especially 2.3.2 "Adjustments for gene length, GC content, mappability and >>> so on" are also relevant for voom. ?There is no need to correct for any >>> characteristic of a gene that remains unchanged across samples. >>> >>> A good case has been made that GC content can have differential influence >>> across samples, but that doesn't apply to gene length. >>> >>> voom does not work on RPKM or FPKM, or on the output from cufflinks. voom >>> estimates a mean-variance relationship, and the variance is a function of >>> count size, not of expression level. >>> >>> Yes, you need limma to use the output from voom, because other softwares >>> do not generally have the ability to use quantitative weights. ?If you >>> ignore the weights, then the output from voom is just logCPM, and you >>> hardly need voom to compute that. >>> >>> Best wishes >>> Gordon >>> >>> ------------------------------**--------------- >>> Professor Gordon K Smyth, >>> Bioinformatics Division, >>> Walter and Eliza Hall Institute of Medical Research, >>> 1G Royal Parade, Parkville, Vic 3052, Australia. >>> smyth at wehi.edu.au >>> http://www.wehi.edu.au >>> http://www.statsci.org/smyth >>> >>> On Thu, 24 May 2012, Tim Triche, Jr. wrote: >>> >>> ?Hi Dr. Smyth and Dr. Law, >>>> >>>> I have been reading the documentation for limma::voom() and trying to >>>> understand why there seems to be no correction for the size of the >>>> feature >>>> in the model: >>>> >>>> In an experiment, a count value is observed for each tag in each sample. >>>> A >>>> tag-wise mean-variance trend is computed using lowess. The tag-wise mean >>>> is >>>> the mean log2 count with an offset of 0.5, across samples for a given >>>> tag. >>>> The tag-wise variance is the quarter-root-variance of normalized log2 >>>> counts per million values with an offset of 0.5, across samples for a >>>> given >>>> tag. Tags with zero counts across all samples are not included in the >>>> lowess fit. Optional normalization is performed using >>>> normalizeBetweenArrays. Using fitted values of log2 counts from a linear >>>> model fit by lmFit, variances from the mean-variance trend were >>>> interpolated for each observation. This was carried out by approxfun. >>>> Inverse variance weights can be used to correct for mean-variance trend >>>> in >>>> the count data. >>>> >>>> >>>> I don't see a reference to the feature size in all of this. (?) ?Am I >>>> missing something? ?Probably something major (like, say, the relationship >>>> of GC content or read length to variance)... >>>> Is the idea that features with similar sequence properties/size and >>>> abundance will have their mean-variance relationship modeled >>>> appropriately >>>> and weights generated empirically? >>>> >>>> For comparison, what I have been doing (in lieu of knowing any better) is >>>> as follows: align with Rsubread, run subjunc and splicegrapher, and count >>>> against exon/gene/feature models: >>>> >>>> alignedToRPKM <- function(readcounts) { # the output of featureCounts() >>>> ?millionsMapped <- colSums(readcounts$counts)/**1000000 >>>> ?if('ExonLength' %in% names(readcounts$annotation)) { >>>> ? geneLengthsInKB <- readcounts$annotation$**ExonLength/1000 >>>> ?} else { >>>> ? geneLengthsInKB <- readcounts$annotation$**GeneLength/1000 # works >>>> fine >>>> for ncRNA and splice graph edges >>>> ?} >>>> >>>> ?# example usage: readcounts$RPKM <- alignedToRPKM(readcounts) >>>> ?return( sweep(readcounts$counts, 2, millionsMapped, '/') / >>>> geneLengthsInKB ) >>>> } >>>> >>>> (When I did pretty much the same thing with Bowtie/TopHat/CuffLinks I got >>>> about the same results but slower, so I stuck with Rsubread. ?And >>>> featureCounts() is really handy.) >>>> >>>> So, given the feature sizes in readcounts$annotation I can at least put >>>> things on something like a similar scale. ?Most of my modeling currently >>>> is >>>> focused on penalized local regressions and thus a performant (but >>>> accurate) >>>> measure that can be used for linear modeling on a large scale is >>>> desirable. >>>> Is the output of voom() what I want? ?Does one need to use limma/lmFit() >>>> to make use of voom()'s output? >>>> >>>> Last but not least, should I use something like Yuval Benjamini's >>>> GCcorrect >>>> package (http://www.stat.berkeley.edu/**~yuvalb/YuvalWeb/Software.html >>>> **) >>>> before/during/instead of voom()? >>>> And if the expression of a feature or several nearby features is often >>>> the >>>> response, does it matter a great deal what I use? >>>> >>>> Thanks for any input you might have time to provide. ?I have to assume >>>> that >>>> the minds at WEHI periodically scheme together how best to go about these >>>> things... >>>> >>>> >>>> -- >>>> *A model is a lie that helps you see the truth.* >>>> * >>>> * >>>> Howard Skipper>>> 1173.full.pdf >>>> > >>>> >>>> >>> >> ______________________________**______________________________**__________ >> The information in this email is confidential and inte...{{dropped:18}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From Ekta_Jain at jubilantbiosys.com Fri Jun 1 08:41:38 2012 From: Ekta_Jain at jubilantbiosys.com (Ekta Jain) Date: Fri, 1 Jun 2012 12:11:38 +0530 Subject: [BioC] edgeR: topTags In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From delhomme at embl.de Fri Jun 1 10:00:39 2012 From: delhomme at embl.de (Nicolas Delhomme) Date: Fri, 1 Jun 2012 10:00:39 +0200 Subject: [BioC] DESeq estimateDispersions() problem In-Reply-To: <3821B825-66C6-4F8E-B610-85D802279B17@slu.se> References: <3821B825-66C6-4F8E-B610-85D802279B17@slu.se> Message-ID: Dear Karl, Can you please paste in your session info as well, that will help the maintainer. To get it use the sessionInfo command in R after having loaded the DESeq package. I would not expect such a function to disappear without being deprecated, but you can always lookup the package recent changes using the news command: news(package="DESeq") Depending on your R version, you can query more specific new, i.e. if you're using the current development version: news(Version >= "1.9", package="DESeq") As I said, since I do not expect such a drastic change in the DESeq API, make sure as well that nothing changed in your R environment: check that no previous session is restored when you start R (look for an .Rdata file in the startup directory), etc. Cheers, --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On May 31, 2012, at 2:29 PM, Karl Lund?n wrote: > Dear all, > I can not get the DESeq function estimateDispersions() to function as it usually > does. Have there been any recent updates in DESeq or in R that explains why > estimateDispersions doesn't work? I use R on a grid engine and everything worked > fine earlier this spring. > Thanks > Karl > ================================================ >> library(DESeq) >> ?newCountDataSet() >> library(DESeq) >> ?estimateDispersions() > Error in .helpForCall(topicExpr, parent.frame()) : > no methods for 'estimateDispersions' and no documentation for it as a function >> estimateDispersions() > Error: could not find function "estimateDispersions" > > ## Other functions do work >> newCountDataSet() > Error in as.matrix(countData) : > argument "countData" is missing, with no default > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From guest at bioconductor.org Fri Jun 1 11:16:51 2012 From: guest at bioconductor.org (Sonal [guest]) Date: Fri, 1 Jun 2012 02:16:51 -0700 (PDT) Subject: [BioC] Error in intgroup of arrayQualityMetrics package Message-ID: <20120601091651.6CFA1133CD3@mamba.fhcrc.org> I am using arraQualityMetrics package installed from Bioconductor site and R version that I am using is 2.15.0 The input for the function was eset and for the intgroup argument character vector "Tissue". There is a column named Tissue in my phenoData of the eset. But it still gives me an error saying the elements of intgroup do not match the column names of the pData(eset). I don't know what wrong I am doing. Can anybody suggest anything. Thank You. -- output of sessionInfo(): Error in prepData(expressionset,intgroup=intgroup): all elements of 'intgroup' should match column names of pData(expressionset) -- Sent via the guest posting facility at bioconductor.org. From daniel at intomics.com Fri Jun 1 11:19:55 2012 From: daniel at intomics.com (Daniel Aaen Hansen) Date: Fri, 1 Jun 2012 11:19:55 +0200 Subject: [BioC] arrayQualityMetrics error with MAList In-Reply-To: <4F802006.70802@embl.de> References: <4F53C9F6.3010705@embl.de> <4F626A82.7090707@embl.de> <5065A458-37F5-4F40-9BB7-D4A589CBAF51@intomics.com> <4F802006.70802@embl.de> Message-ID: <2E20B7B5-CDCE-4780-A815-EAA528A8BDED@intomics.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From Parisa.Razaz at icr.ac.uk Fri Jun 1 12:01:22 2012 From: Parisa.Razaz at icr.ac.uk (Parisa Razaz) Date: Fri, 1 Jun 2012 11:01:22 +0100 Subject: [BioC] Is it possible to read in Bluefuse and Agilent files together using read.maimages() function in limma? Message-ID: Hi, Is it possible to read in both Bluefuse and Agilent files together using the read.maimages() function in limma? Thanks, Parisa The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the a...{{dropped:5}} From smyth at wehi.EDU.AU Fri Jun 1 12:19:29 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Fri, 1 Jun 2012 20:19:29 +1000 (AUS Eastern Standard Time) Subject: [BioC] Is it possible to read in Bluefuse and Agilent files together using read.maimages() function in limma? In-Reply-To: References: Message-ID: No, it can't combine any two types. Read them in instead using separate calls to read.maimages(). Gordon On Fri, 1 Jun 2012, Parisa Razaz wrote: > Hi, > > Is it possible to read in both Bluefuse and Agilent files together using the read.maimages() function in limma? > > Thanks, > > Parisa ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From Parisa.Razaz at icr.ac.uk Fri Jun 1 12:33:18 2012 From: Parisa.Razaz at icr.ac.uk (Parisa Razaz) Date: Fri, 1 Jun 2012 11:33:18 +0100 Subject: [BioC] Is it possible to read in Bluefuse and Agilent files together using read.maimages() function in limma? In-Reply-To: References: Message-ID: <12B90BF6-E4EE-4BA6-9103-BDAD63EB669E@icr.ac.uk> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From Guido.Hooiveld at wur.nl Fri Jun 1 14:20:31 2012 From: Guido.Hooiveld at wur.nl (Hooiveld, Guido) Date: Fri, 1 Jun 2012 12:20:31 +0000 Subject: [BioC] frmaTools: error with 'convertPlatform' Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From mailinglist.honeypot at gmail.com Fri Jun 1 16:06:53 2012 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Fri, 1 Jun 2012 10:06:53 -0400 Subject: [BioC] DESeq estimateDispersions() problem In-Reply-To: <3821B825-66C6-4F8E-B610-85D802279B17@slu.se> References: <3821B825-66C6-4F8E-B610-85D802279B17@slu.se> Message-ID: Hi Karl, estimateDispersions is still definitely there ... you can verify that by looking through the the DESeq Reference Manual: http://bioconductor.org/packages/2.10/bioc/manuals/DESeq/man/DESeq.pdf It seems there's something wonky going on with your install. As Nicolas said, providing the output from `sessionInfo()` after you load DESeq will be most helpful. As a side note, I just want to point out that you are trying to invoke the help incorrectly: On Thu, May 31, 2012 at 8:29 AM, Karl Lund?n wrote: [snip] >> library(DESeq) >> ?estimateDispersions() > Error in .helpForCall(topicExpr, parent.frame()) : > ?no methods for 'estimateDispersions' and no documentation for it as a function You should remove the open/close parens. So your call to help should be: R> ?estimateDispersions HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From mailinglist.honeypot at gmail.com Fri Jun 1 16:29:11 2012 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Fri, 1 Jun 2012 10:29:11 -0400 Subject: [BioC] how to load RangedData for all chromosomes in one tab in UCSC browser using browserView from rtracklayer In-Reply-To: <30255D77-D99F-433B-9564-4A3E0D0CC736@gmail.com> References: <0EB0F5EF-5DDD-4170-96D2-227E5A1137AA@gmail.com> <30255D77-D99F-433B-9564-4A3E0D0CC736@gmail.com> Message-ID: Hi Yue, On Thu, May 31, 2012 at 7:23 PM, Yue Li wrote: > Not sure if you were asking a question or implying that it's not feasible with UCSC browser. Anyhow, I could just upload to UCSC the bed/bam/bedGraph/gff file containing ranges for multiple chromosomes. Then go to the browser-view to visualize any chromosome in a single window with my browser? I was just trying to get some clarification is all -- it sounded like you wanted to see all of the different regions at once when you said: """I wonder how to load RangedData for all chromosomes in a SINGLE tab in UCSC browser using browserView from rtracklayer.""" -steve > > > On 2012-05-31, at 2:12 PM, Steve Lianoglou wrote: > >> Hi, >> >> On Thu, May 31, 2012 at 1:00 PM, Yue Li wrote: >>> Dear List, >>> >>> I wonder how to load RangedData for all chromosomes in a SINGLE tab in UCSC browser using browserView from rtracklayer. >>> >>> Currently, if I have a GRanges object for "mm9" build: >>> >>>> alignGR >>> GRanges with 238161 ranges and 0 elementMetadata cols: >>> ? ? ? ? ? ? ? ? ? ?seqnames ? ? ? ? ? ? ? ? ranges strand >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? >>> ?SRR039212.1000031 ? ?chr19 [ ?8790316, ? 8790351] ? ? ?- >>> ?SRR039212.1000085 ? ? chr5 [106579844, 106579879] ? ? ?+ >>> ?SRR039212.1000087 ? ? chr8 [109778747, 109778782] ? ? ?+ >>> ?SRR039212.1000088 ? ? chr8 [ 93777537, ?93777572] ? ? ?+ >>> ?SRR039212.1000132 ? ? chr3 [128910749, 128910784] ? ? ?+ >>> ?SRR039212.1000149 ? ? chr8 [127433402, 127433437] ? ? ?+ >>> ?SRR039212.1000170 ? ?chr15 [ 93546853, ?93546888] ? ? ?+ >>> ?SRR039212.1000174 ? ?chr18 [ 32056273, ?32056308] ? ? ?- >>> ?SRR039212.1000177 ? ? chr7 [ 90292453, ?90292474] ? ? ?- >>> ? ? ? ? ? ? ? ?... ? ? ?... ? ? ? ? ? ? ? ? ? ?... ? ?... >>> ? SRR039212.999792 ? ? chr2 [162907151, 162907186] ? ? ?- >>> ? SRR039212.999805 ? ? chr6 [ 44021338, ?44021373] ? ? ?+ >>> ? SRR039212.999810 ? ? chr4 [121106682, 121106717] ? ? ?- >>> ? SRR039212.999844 ? ?chr19 [ 60848841, ?60848876] ? ? ?+ >>> ? SRR039212.999848 ? ? chr5 [117644397, 117644432] ? ? ?- >>> ? SRR039212.999854 ? ? chr6 [132445007, 132445042] ? ? ?- >>> ? SRR039212.999855 ? ? chr7 [108392362, 108392397] ? ? ?- >>> ? SRR039212.999892 ? ? chr9 [ 20946884, ?20946919] ? ? ?+ >>> ? SRR039212.999901 ? ? chr2 [168845152, 168845187] ? ? ?+ >>> ?--- >>> ?seqlengths: >>> ? ? ? ?chr1 ? ? chr10 ? ? chr11 ? ? chr12 ? ? chr13 ? ? chr14 ... ? ? ?chr8 ? ? ?chr9 ? ? ?chrM ? ? ?chrX ? ? ?chrY >>> ? 197195432 129993255 121843856 121257530 120284312 125194864 ... 131738871 124076172 ? ? 16299 166650296 ?15902555 >>> >>> >>> and I do this: >>> ? ? ? ?session <- browserSession() >>> >>> >>> ? ? ? ?track(session, "read alignments") <- RangedData(alignGR) >>> >>> >>> ? ? ? ?# launch browser view >>> ? ? ? ?browserView(session, alignGR) >>> >>> >>> >>> This will open 22 tabs in my browser corresponding to 22 chromosomes in the GRanges object. I wonder if it would be possible to have just one single tab open for all 22 chromosomes to view on UCSC browser. >> >> Is there anyway that you know of to have the genome browser do what >> you are asking "directly"? >> >> I mean, if you were just navigating w/ the browser alone (not using >> rtracklayer), how would you get it to do what you are asking? >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> ?| Memorial Sloan-Kettering Cancer Center >> ?| Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact > -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From delhomme at embl.de Fri Jun 1 16:33:39 2012 From: delhomme at embl.de (Nicolas Delhomme) Date: Fri, 1 Jun 2012 16:33:39 +0200 Subject: [BioC] DESeq estimateDispersions() problem In-Reply-To: References: <3821B825-66C6-4F8E-B610-85D802279B17@slu.se> Message-ID: <9C5FC44E-5798-40D2-ABDB-D756F933CD85@embl.de> Indeed; good catch Steve. Calling the help that way gives me a very similar error to the one you reported: ?estimateDispersions() Error in .helpForCall(topicExpr, parent.frame()) : no documentation for function ?estimateDispersions? and signature ?object = "ANY"? In addition: Warning message: In .helpForCall(topicExpr, parent.frame()) : no method defined for function ?estimateDispersions? and signature ?object = "ANY"? whereas this works as expected: ?estimateDispersions Help on topic ?estimateDispersions? was found in the following packages: Package Library DESeq /Library/Frameworks/R.framework/Versions/2.15/Resources/library Biobase /Library/Frameworks/R.framework/Versions/2.15/Resources/library >sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] DESeq_1.8.2 locfit_1.5-8 Biobase_2.16.0 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.18.0 DBI_0.2-5 IRanges_1.14.3 [4] RColorBrewer_1.0-5 RSQLite_0.11.1 annotate_1.34.0 [7] genefilter_1.38.0 geneplotter_1.34.0 grid_2.15.0 [10] lattice_0.20-6 splines_2.15.0 stats4_2.15.0 [13] survival_2.36-14 xtable_1.7-0 Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On Jun 1, 2012, at 4:06 PM, Steve Lianoglou wrote: > Hi Karl, > > estimateDispersions is still definitely there ... you can verify that > by looking through the the DESeq Reference Manual: > > http://bioconductor.org/packages/2.10/bioc/manuals/DESeq/man/DESeq.pdf > > It seems there's something wonky going on with your install. As > Nicolas said, providing the output from `sessionInfo()` after you load > DESeq will be most helpful. > > As a side note, I just want to point out that you are trying to invoke > the help incorrectly: > > On Thu, May 31, 2012 at 8:29 AM, Karl Lund?n wrote: > [snip] >>> library(DESeq) >>> ?estimateDispersions() >> Error in .helpForCall(topicExpr, parent.frame()) : >> no methods for 'estimateDispersions' and no documentation for it as a function > > You should remove the open/close parens. So your call to help should be: > > R> ?estimateDispersions > > HTH, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From karl.lunden at slu.se Fri Jun 1 16:42:43 2012 From: karl.lunden at slu.se (=?Windows-1252?Q?Karl_Lund=E9n?=) Date: Fri, 1 Jun 2012 16:42:43 +0200 Subject: [BioC] DESeq estimateDispersions() problem In-Reply-To: <9C5FC44E-5798-40D2-ABDB-D756F933CD85@embl.de> References: <3821B825-66C6-4F8E-B610-85D802279B17@slu.se> <9C5FC44E-5798-40D2-ABDB-D756F933CD85@embl.de> Message-ID: <42235816-DCE2-489C-8F1E-FC1B112E37D9@slu.se> Thanks! I am working on the reinstallation. Karl Jun 1, 2012 kl. 4:33 PM skrev Nicolas Delhomme: > Indeed; good catch Steve. Calling the help that way gives me a very similar error to the one you reported: > > ?estimateDispersions() > Error in .helpForCall(topicExpr, parent.frame()) : > no documentation for function ?estimateDispersions? and signature ?object = "ANY"? > In addition: Warning message: > In .helpForCall(topicExpr, parent.frame()) : > no method defined for function ?estimateDispersions? and signature ?object = "ANY"? > > whereas this works as expected: > > ?estimateDispersions > Help on topic ?estimateDispersions? was found in the following > packages: > > Package Library > DESeq /Library/Frameworks/R.framework/Versions/2.15/Resources/library > Biobase /Library/Frameworks/R.framework/Versions/2.15/Resources/library > > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] C/UTF-8/C/C/C/C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] DESeq_1.8.2 locfit_1.5-8 Biobase_2.16.0 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.18.0 DBI_0.2-5 IRanges_1.14.3 > [4] RColorBrewer_1.0-5 RSQLite_0.11.1 annotate_1.34.0 > [7] genefilter_1.38.0 geneplotter_1.34.0 grid_2.15.0 > [10] lattice_0.20-6 splines_2.15.0 stats4_2.15.0 > [13] survival_2.36-14 xtable_1.7-0 > > Cheers, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On Jun 1, 2012, at 4:06 PM, Steve Lianoglou wrote: > >> Hi Karl, >> >> estimateDispersions is still definitely there ... you can verify that >> by looking through the the DESeq Reference Manual: >> >> http://bioconductor.org/packages/2.10/bioc/manuals/DESeq/man/DESeq.pdf >> >> It seems there's something wonky going on with your install. As >> Nicolas said, providing the output from `sessionInfo()` after you load >> DESeq will be most helpful. >> >> As a side note, I just want to point out that you are trying to invoke >> the help incorrectly: >> >> On Thu, May 31, 2012 at 8:29 AM, Karl Lund?n wrote: >> [snip] >>>> library(DESeq) >>>> ?estimateDispersions() >>> Error in .helpForCall(topicExpr, parent.frame()) : >>> no methods for 'estimateDispersions' and no documentation for it as a function >> >> You should remove the open/close parens. So your call to help should be: >> >> R> ?estimateDispersions >> >> HTH, >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > From harryzs1981 at gmail.com Fri Jun 1 17:09:48 2012 From: harryzs1981 at gmail.com (sheng zhao) Date: Fri, 1 Jun 2012 17:09:48 +0200 Subject: [BioC] [ChIPpeakAnno] Can not start ChIPpeakAnno after update to version 2.5.9 Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From curoli at gmail.com Fri Jun 1 17:17:49 2012 From: curoli at gmail.com (Oliver Ruebenacker) Date: Fri, 1 Jun 2012 11:17:49 -0400 Subject: [BioC] BioPAX/SBPAX import for MicroArray Analysis and Hypothesis Building In-Reply-To: References: Message-ID: Hello Tim, On Thu, May 31, 2012 at 5:53 PM, Tim Triche, Jr. wrote: > this seems clever, although I could swear that an existing package handles > graphical models of similar sorts. We could turn BioPAX/SBPAX into R graphs similar to the way rsbml does it. I wonder if there already are packages for using R graphs representing biological connections to model differential expression data? Then we may be able to use these. > but if the structural priors can be automatically pulled in (however that > might happen) and updated it would be cool as hell. Such data could be made available as SBPAX. BioPAX/SBPAX are excellent for merging data from multiple sources. Can you recommend any significant providers of such public data? Thanks! Take care Oliver > > > On Thu, May 31, 2012 at 2:47 PM, Oliver Ruebenacker > wrote: >> >> ? ? Hello, >> >> ?I am exploring the idea of creating a package to import pathways as >> BioPAX/SBPAX (Level 3) data to analyze various measurements. In >> particular, differential microarray measurements could be used to >> identify upstream pathway nodes that seem to play a critical role in >> explaining the observed differences. >> >> ?The basic idea is very simple: consider a pathway that contains >> reactions A -> X, B -> X and B -> Y. If measurements show an increase >> in X, but not in Y, this would suggest an increase in A. If, however, >> we see increases of both X and Y, then this would point to an increase >> in B. Analogous considerations apply to nodes upstream of A and B. >> Negative correlations (by inhibition or depletion) will also be >> included. Consider this applied to a large network and a large set of >> measurements, which requires statistical tools to identify the most >> relevant upstream nodes. >> >> ?There are people using similar methods on similar data relying on >> quite simple evaluation functions and turning it into a profitable >> business. >> >> ?We can start with a simple prototype that can be created and >> deployed as quickly as possible as proof of concept to find interested >> parties. If there is sufficient interest, a more sophisticated version >> can be built. >> >> ?Different implementation approaches are possible. It seems to be >> simple and efficient to use rjava and have the reaction network >> extracted from BioPAX/SBPAX by a Java package that uses OpenRDF Sesame >> Rio (i.e. a small fraction of Sesame dealing with RDF graph >> representation and I/O). >> >> ?Any comment or show of interest is greatly appreciated. >> >> ?Thanks! >> >> ? ? Take care >> ? ? Oliver >> >> -- >> Oliver Ruebenacker >> Bioinformatics Consultant >> (http://www.knowomics.com/wiki/Oliver_Ruebenacker) >> Knowomics, The Bioinformatics Network (http://www.knowomics.com) >> SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > A model is a lie that helps you see the truth. > > Howard Skipper > -- Oliver Ruebenacker Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker) Knowomics, The Bioinformatics Network (http://www.knowomics.com) SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) From mailinglist.honeypot at gmail.com Fri Jun 1 17:28:52 2012 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Fri, 1 Jun 2012 11:28:52 -0400 Subject: [BioC] [ChIPpeakAnno] Can not start ChIPpeakAnno after update to version 2.5.9 In-Reply-To: References: Message-ID: Hi, On Fri, Jun 1, 2012 at 11:09 AM, sheng zhao wrote: > Dear all, > > I updated ChIPpeakAnno to 2.5.9 by: > > useDevel(TRUE) > source("http://www.bioconductor.org/biocLite.R") > biocLite("ChIPpeakAnno") > > > > After that, I got the following wrong Error information when starting > ChIPpeakAnno: > > Loading required package: DBI > Error : .onLoad failed in loadNamespace() for 'GO.db', details: > ?call: ls(envir, all.names = TRUE) > ?error: 7 arguments passed to .Internal(identical) which requires 6 > Error: package 'GO.db' could not be loaded > > > Any suggestion? Thanks . > > ps: Working with ChIPpeakAnno 2.4.0 is fine. I suspect you'll need to upgrade the rest of your packages to their devel versions if you want to use the devel version of ChIPpeakAnno (using the devel version works for me). In light of the yearly R release cycle now, the bioc folks have outlined a strategy you might want to follow if you think you want to hop between release and devel versions of packages here: http://bioconductor.org/developers/useDevel/ HTH, -steve > > Regards, > Sheng > >> sessionInfo() > sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] C > > attached base packages: > [1] grid ? ? ?stats ? ? graphics ?grDevices utils ? ? datasets ?methods > [8] base > > other attached packages: > ?[1] RSQLite_0.11.1 ? ? ? ? ? ? ? ? ? ? ?DBI_0.2-5 > ?[3] AnnotationDbi_1.19.9 ? ? ? ? ? ? ? ?BSgenome.Ecoli.NCBI.20080805_1.3.17 > ?[5] BSgenome_1.25.1 ? ? ? ? ? ? ? ? ? ? GenomicRanges_1.9.21 > ?[7] Biostrings_2.25.4 ? ? ? ? ? ? ? ? ? IRanges_1.15.11 > ?[9] multtest_2.13.0 ? ? ? ? ? ? ? ? ? ? Biobase_2.17.5 > [11] biomaRt_2.13.1 ? ? ? ? ? ? ? ? ? ? ?BiocGenerics_0.3.0 > [13] gplots_2.10.1 ? ? ? ? ? ? ? ? ? ? ? KernSmooth_2.23-7 > [15] caTools_1.13 ? ? ? ? ? ? ? ? ? ? ? ?bitops_1.0-4.1 > [17] gdata_2.8.2 ? ? ? ? ? ? ? ? ? ? ? ? gtools_2.6.2 > > loaded via a namespace (and not attached): > [1] MASS_7.3-18 ? ? ?RCurl_1.91-1 ? ? XML_3.9-4 ? ? ? ?splines_2.15.0 > [5] stats4_2.15.0 ? ?survival_2.36-14 > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From dtenenba at fhcrc.org Fri Jun 1 17:39:53 2012 From: dtenenba at fhcrc.org (Dan Tenenbaum) Date: Fri, 1 Jun 2012 08:39:53 -0700 Subject: [BioC] MacOS Package installation problems In-Reply-To: <454E2EA6-4920-4422-AD0A-39857A99AD02@tntcmedia.us> References: <089B8CC2D95DD5498EB7CD66289A668F1C5167@pbrcas30.pbrc.edu> <11308_1338395632_4FC64BF0_11308_13916_1_CANeAVBngMQ-qo3XA670uLeMs2UPQmXwiLEv1_oOdomJkfppcSw@mail.gmail.com> <7554A32B-75A3-4CE3-BB2F-36590D6775CF@tntcmedia.us> <454E2EA6-4920-4422-AD0A-39857A99AD02@tntcmedia.us> Message-ID: On Fri, Jun 1, 2012 at 8:35 AM, Chinedu Orekie wrote: > Working from the drop down menu, I pointed to the CRAN mirror and on selecting "Install Packages" I get the error "Error in read.dcf(file = tmpf): ?Line starting ' > I see the same error when I select "Update Packages" as well. > > This is bug is certainly not package specific since i never get to where I can request one. > I am not sure if this is a Bioconductor issue, but what happens if you try biocLite() as suggested earlier? Dan > Chinedu > "TNTC: The need to connect" | P: (908) 514-TNTC > > On May 31, 2012, at 10:00 PM, Dan Tenenbaum wrote: > >> Hi, >> >> >> On Thu, May 31, 2012 at 6:36 PM, Chinedu Orekie wrote: >>> Dan: >>> >>> I get the same error at separate internet connections (at home and at work). Now these are locations that worked in the past. So I am doubtful that a change in net connection is explanation. >>> >>> The sequence to seeing that error was: utils:::menuInstallPkgs() at the console and then on selecting a mirror site. >> >> >> Which mirror site did you select? Did you try to install a package? >> Which one? Was there any output besides the error message you sent? >> >>> Your thoughts. >> >> I don't know what could cause this. >> The recommended way to install Bioconductor packages, and the only >> method we support and can troubleshoot, is: >> >> source("http://bioconductor.org/biocLite.R") >> biocLite("pkgName") >> >> Also, to rule out any problems with mirrors, you should run >> chooseCRANmirror() >> and >> chooseBioCmirror() >> and choose Switzerland >> and >> USA (WA 1) >> respectively. >> >> Dan >> >> >>> >>> Chinedu >>> "TNTC: The need to connect" | P: (908) 514-TNTC >>> >>> On May 31, 2012, at 6:57 PM, Dan Tenenbaum wrote: >>> >>>> Hi Chinedu, >>>> >>>> On Thu, May 31, 2012 at 7:48 AM, Chinedu Orekie wrote: >>>>> I am having the same problem installing packages (Line starting '>>>> PUBLI ...' is malformed). It has not mattered which repoistory I cite, CRAN or >>>>> BioC. I first observed this yesterday, 30 Jun 2012. This all seems quite recent >>>>> since I was able to access packages only two weeks ago. My computer is Windows >>>>> based by the way. >>>>> >>>>> Could this be some generic systems error? >>>>> >>>> >>>> This is the first I have heard of this happening on windows systems. >>>> Can you send the command that caused the error, as well as the full >>>> error code and the output of sessionInfo()? >>>> >>>> Thanks, >>>> Dan >>>> >>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >> From johnlinuxuser at yahoo.com Fri Jun 1 18:02:47 2012 From: johnlinuxuser at yahoo.com (John linux-user) Date: Fri, 1 Jun 2012 09:02:47 -0700 Subject: [BioC] HUGO EXON_ID In-Reply-To: References: <089B8CC2D95DD5498EB7CD66289A668F1C5167@pbrcas30.pbrc.edu> <11308_1338395632_4FC64BF0_11308_13916_1_CANeAVBngMQ-qo3XA670uLeMs2UPQmXwiLEv1_oOdomJkfppcSw@mail.gmail.com> <7554A32B-75A3-4CE3-BB2F-36590D6775CF@tntcmedia.us> <454E2EA6-4920-4422-AD0A-39857A99AD02@tntcmedia.us> Message-ID: <1338566567.13961.YahooMailNeo@web113616.mail.gq1.yahoo.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From mccallm at gmail.com Fri Jun 1 18:19:14 2012 From: mccallm at gmail.com (Matthew McCall) Date: Fri, 1 Jun 2012 12:19:14 -0400 Subject: [BioC] frmaTools: error with 'convertPlatform' In-Reply-To: References: Message-ID: Guido, The frma and frmaTools packages use oligo (rather than AffyBatch) objects for the ST arrays, so what you're trying to do is a bit outside the intended functionality. I would also caution you against combining data from different platforms as probe behavior can change quite a bit. That said, we can see whether there's some simple modification that could let you try out what you'd like. Can you figure out at what point in the convertPlatform function the error pops up? Best, Matt On Fri, Jun 1, 2012 at 8:20 AM, Hooiveld, Guido wrote: > Hi, > > I would like to use the function ?convertPlatform? (from the library > frmaTools) to convert an Affybatch object from the MoGene ST v1.1 (GeneTitan > array) format into that of the MoGene ST v1.0 format (cartridge array), but > I run into an error. The reason that I would like to convert that Affybatch > object is that I would like to combine 2 experiments performed on those 2 > platform so I can normalize them together. > > > > In principle the content of the arrays is the same, that is the probeSETS > should be identical, but the design and number of probes are different: the > v1.0 array (cartridge) is square (1050cols x 1050rows) whereas the v1.1 > array is rectangular (990cols x 1190rows). I think this may be related to > the error I experience. Note also that I would like to use a remapped CDF. > > > > Any suggestions? > > Thanks, > > Guido > > > > > >> affy.data <- ReadAffy(cdfname="mogene11stv1mmentrezg") > >> affy.data > > Loading required package: AnnotationDbi > > > > AffyBatch object > > size of arrays=1190x990 features (25 kb) > > cdf=mogene11stv1mmentrezg (21225 affyids) > > number of samples=23 > > number of genes=21225 > > annotation=mogene11stv1mmentrezg > > notes= > >> object.conv <- convertPlatform(affy.data, "mogene10stv1mmentrezg") > > Loading required package: mogene10stv1mmentrezgprobe > > Loading required package: mogene11stv1mmentrezgprobe > > > > > > Attaching package: ?mogene10stv1mmentrezgcdf? > > > > The following object(s) are masked from ?package:mogene11stv1mmentrezgcdf?: > > > > ??? i2xy, xy2i > > > > Error in convertPlatform(affy.data, "mogene10stv1mmentrezg") : > > ??subscript out of bounds > >> > > > > > > > > > > Some maybe relevant array characteristics: > >> library(affxparser) > >> GeneSTv1.0 <- readCelHeader("MouseTP_Brain_01_mGENE.CEL") > >> GeneSTv1.0 > > $filename > > [1] "./MouseTP_Brain_01_mGENE.CEL" > > > > $version > > [1] 1 > > > > $cols > > [1] 1050 > > > > $rows > > [1] 1050 > > > > $total > > [1] 1102500 > > <> > > > >> GeneSTv1.1 <- readCelHeader("MouseBrain_1.CEL") > >> GeneSTv1.1 > > $filename > > [1] "./MouseBrain_1.CEL" > > > > $version > > [1] 1 > > > > $cols > > [1] 990 > > > > $rows > > [1] 1190 > > > > $total > > [1] 1178100 > > <> > > > >> sessionInfo() > > R version 2.15.0 (2012-03-30) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8?????? LC_NUMERIC=C > > ?[3] LC_TIME=en_US.UTF-8??????? LC_COLLATE=en_US.UTF-8 > > ?[5] LC_MONETARY=en_US.UTF-8??? LC_MESSAGES=en_US.UTF-8 > > ?[7] LC_PAPER=C???????????????? LC_NAME=C > > ?[9] LC_ADDRESS=C?????????????? LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats???? graphics? grDevices utils???? datasets? methods?? base > > > > other attached packages: > > [1] frmaTools_1.8.0??? affy_1.34.0??????? Biobase_2.16.0 > BiocGenerics_0.2.0 > > > > loaded via a namespace (and not attached): > > [1] affyio_1.24.0???????? BiocInstaller_1.4.4?? DBI_0.2-5 > > [4] preprocessCore_1.18.0 zlibbioc_1.2.0 > > > > > > --------------------------------------------------------- > > Guido Hooiveld, PhD > > Nutrition, Metabolism & Genomics Group > > Division of Human Nutrition > > Wageningen University > > Biotechnion, Bomenweg 2 > > NL-6703 HD Wageningen > > the Netherlands > > tel: (+)31 317 485788 > > fax: (+)31 317 483342 > > email: ?????guido.hooiveld at wur.nl > > internet:?? http://nutrigene.4t.com > > http://scholar.google.com/citations?user=qFHaMnoAAAAJ > > http://www.researcherid.com/rid/F-4912-2010 > > -- Matthew N McCall, PhD 112 Arvine Heights Rochester, NY 14611 Cell: 202-222-5880 From schoi at cornell.edu Fri Jun 1 20:34:53 2012 From: schoi at cornell.edu (Sang Chul Choi) Date: Fri, 1 Jun 2012 18:34:53 +0000 Subject: [BioC] qrqc with error message of "unable to start device PNG" and "unable to open connection to X11 display"? In-Reply-To: <4FC782C6.2040505@uw.edu> References: <2BFEB327-AD6B-4DD8-9B46-58B86DB6D20A@cornell.edu> <4FC61EFD.2010203@uw.edu> <4FC782C6.2040505@uw.edu> Message-ID: <7B59A467-C30C-414F-9AA2-11B01BCF094C@cornell.edu> Thank you for the helpful message. I followed it, and it worked. Thank you, SangChul On May 31, 2012, at 10:40 AM, James W. MacDonald wrote: > Hi SangChul, > > On 5/31/2012 9:29 AM, Sang Chul Choi wrote: >> Hi, >> >> Thank you for the helpful replies. >> >> The command line that causes the problem is: >> makeReport(fq.file, outputDir=bwadir) >> from qrqc package. >> >> I do not know what parts in makeReport of qrqc R package cause the problem. Appended are outputs of capabilities() run at the head and compute nodes. A session info at the compute is also following that. >> >> I do not have access to the head node to run makeReport command of qrqc because the command takes too much memory and would cause problems that could affect other users. So, I have to use a compute node. Clearly, the compute nodes capabilities() shows that png cannot be produced. Thank you for the command. It was helpful. >> >> I am not sure whether I have options. What I want to have from my raw RNA-seq data is to plot "per-base sequence quality," one that is shown at >> http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ >> >> I will appreciate if there are other ways. > > There are always other ways. This is R after all. > > First, please note two things. One, the makeReport() function is just a convenience function intended to create an HTML file. If you inspect the HTML template found in the /library/qrqc/extdata/fastq-report-template.html, you can see that the plot you are after is produced by qualPlot(). Two, you can assign the results of calling qualPlot() to a variable rather than plotting it outright. > > So you can always do this as a two step process; first do all the computationally intensive steps on the compute node, then create and save all the plots you want. You can then go to the head node where you have the correct capabilities and do the plotting there. > > As an example, using the data that come with the qrqc package: > > First create the plot you want, on the node. > > s.fastq <- readSeqFile(system.file('extdata', 'test.fastq', package='qrqc')) > > toplot <- qualPlot(s.fastq) > > save(list = "toplot", file = "qualPlot.Rdata") > > You could do a whole bunch of different plots, and then save them all using > > save(list = ("firstplot","secondplot","thirdplot"), file = "myplots.Rdata") > > Now move the qualPlot.Rdata file to the head node (or just read it in from the compute node). > > > load("qualPlot.Rdata") > > library(ggplot2) > > plot(toplot) > > ggsave("./qualplot.png") > > Best, > > Jim > >> >> Thank you, >> >> SangChul >> >> I also tried to install cairo. But, it has the following errors when I ran an install command: >>> install.packages("Cairo") >> =================================================== >> checking for FreeType support in cairo... yes >> checking whether FreeType needs additional flags... yes >> checking whether pkg-config knows about fontconfig or freetype2... no >> checking whether fontconfig/freetype2 location can be guessed... no >> configure: error: Cannot find fontconfig/freetype2 although cairo claims to support it. Please check your cairo installation and/or update cairo if necessary or set CAIRO_CFLAGS/CAIRO_LIBS accordingly. >> ERROR: configuration failed for package 'Cairo' >> >> At the head node: >> =================================================== >>> capabilities() >> jpeg png tiff tcltk X11 aqua http/ftp sockets >> TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE >> libxml fifo cledit iconv NLS profmem cairo >> TRUE TRUE TRUE TRUE TRUE FALSE FALSE >> >> At the compute node: >> =================================================== >>> capabilities() >> jpeg png tiff tcltk X11 aqua http/ftp sockets >> FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE >> libxml fifo cledit iconv NLS profmem cairo >> TRUE TRUE TRUE TRUE TRUE FALSE FALSE >> >>> sessionInfo() >> R Under development (unstable) (2012-04-01 r58897) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C >> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 >> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] qrqc_1.10.0 testthat_0.6 Rsamtools_1.8.5 >> [4] GenomicRanges_1.8.6 xtable_1.7-0 brew_1.0-6 >> [7] biovizBase_1.4.2 Biostrings_2.24.1 IRanges_1.14.3 >> [10] BiocGenerics_0.2.0 ggplot2_0.9.1 reshape_0.8.4 >> [13] plyr_1.7.1 >> >> loaded via a namespace (and not attached): >> [1] AnnotationDbi_1.18.1 Biobase_2.16.0 biomaRt_2.12.0 >> [4] bitops_1.0-4.1 BSgenome_1.24.0 cluster_1.14.2 >> [7] colorspace_1.1-1 DBI_0.2-5 dichromat_1.2-4 >> [10] digest_0.5.2 evaluate_0.4.2 GenomicFeatures_1.8.1 >> [13] grid_2.16.0 Hmisc_3.9-3 labeling_0.1 >> [16] lattice_0.20-6 MASS_7.3-18 memoise_0.1 >> [19] munsell_0.3 proto_0.3-9.2 RColorBrewer_1.0-5 >> [22] RCurl_1.91-1 reshape2_1.2.1 RSQLite_0.11.1 >> [25] rtracklayer_1.16.1 scales_0.2.1 stats4_2.16.0 >> [28] stringr_0.6 tools_2.16.0 XML_3.9-4 >> [31] zlibbioc_1.2.0 >> >> >> On May 30, 2012, at 12:06 PM, Dan Tenenbaum wrote: >> >>> Hi SangChul and Jim, >>> >>> On Wed, May 30, 2012 at 6:22 AM, James W. MacDonald wrote: >>>> Hi SangChul, >>>> >>>> >>>> On 5/29/2012 4:46 PM, Sang Chul Choi wrote: >>>>> Hi, >>>>> >>>>> I am trying to use qrqc Bioc package in a linux machine used as a compute >>>>> node. I have the following error: >>>>> >>>>> ==================== >>>>> Error in X11(paste("png::", filename, sep = ""), g$width, g$height, >>>>> pointsize, : >>>>> unable to start device PNG >>>>> In addition: Warning message: >>>>> In grDevices::png(..., width = width, height = height, res = dpi, : >>>>> unable to open connection to X11 display '' >>>>> ==================== >>>>> >>>>> I have googled for an answer and bumped into following mailing list posts: >>>>> >>>>> https://stat.ethz.ch/pipermail/r-help/2008-March/155943.html >>>>> https://stat.ethz.ch/pipermail/r-help/2008-February/155021.html >>>>> https://stat.ethz.ch/pipermail/r-help/2008-February/155023.html >>>>> >>>>> It is not obvious what I should do. I will appreciate your answers. >>>> >>>> The simple answer is to run the code on the head rather than a compute node. >>>> In my experience, compute nodes usually don't have 'GUI-type' software >>>> installed (X11, png, etc), as it is not common for people to need that sort >>>> of thing. >>>> >>> If Jim's suggestion to run the relevant computations on your Windows >>> machine does not work out, you can go back on the Linux machine and >>> start R. What is the output of the >>> capabilities() >>> command? >>> And >>> sessionInfo() >>> for that matter? >>> >>> Can you tell what the exact command is that causes the error? >>> Do you need your files to be in png format? >>> >>> If you install the Cairo package: >>> biocLite("Cairo") >>> >>> Does qrqc then work? >>> Is it practical to use bitmap() to generate your png? >>> >>> What Linux distribution are you on? You may want to install Xvfb which >>> should solve the problem, though it can be sort of tricky, according >>> to this somewhat dated page for Ubuntu. >>> http://blog.martin-lyness.com/archives/installing-xvfb-on-ubuntu-9-10-karmic-koala >>> >>> Dan >>> >>> >>> >>> >>>> Best, >>>> >>>> Jim >>>> >>>> >>>> >>>>> Thank you, >>>>> >>>>> SangChul >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> -- >>>> James W. MacDonald, M.S. >>>> Biostatistician >>>> University of Washington >>>> Environmental and Occupational Health Sciences >>>> 4225 Roosevelt Way NE, # 100 >>>> Seattle WA 98105-6099 >>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > From mcarlson at fhcrc.org Fri Jun 1 21:18:56 2012 From: mcarlson at fhcrc.org (Marc Carlson) Date: Fri, 01 Jun 2012 12:18:56 -0700 Subject: [BioC] HUGO EXON_ID In-Reply-To: <1338566567.13961.YahooMailNeo@web113616.mail.gq1.yahoo.com> References: <089B8CC2D95DD5498EB7CD66289A668F1C5167@pbrcas30.pbrc.edu> <11308_1338395632_4FC64BF0_11308_13916_1_CANeAVBngMQ-qo3XA670uLeMs2UPQmXwiLEv1_oOdomJkfppcSw@mail.gmail.com> <7554A32B-75A3-4CE3-BB2F-36590D6775CF@tntcmedia.us> <454E2EA6-4920-4422-AD0A-39857A99AD02@tntcmedia.us> <1338566567.13961.YahooMailNeo@web113616.mail.gq1.yahoo.com> Message-ID: <4FC915A0.7040003@fhcrc.org> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From Guido.Hooiveld at wur.nl Fri Jun 1 21:53:27 2012 From: Guido.Hooiveld at wur.nl (Hooiveld, Guido) Date: Fri, 1 Jun 2012 19:53:27 +0000 Subject: [BioC] frmaTools: error with 'convertPlatform' In-Reply-To: References: Message-ID: Hi Matt, Thanks for coming back on this. First of all I am fully aware that I am not using the preferred analysis route for Gene ST arrays (which indeed should go through e.g. oligo or XPS). But the possibilities of your function convertPlatform are so nice I gave it a try with these arrays using the remapped CDFs (which AFAIK are valid CDFs; that is they confirm to all standards). I decided to look at the source code of convertPlatform to manually execute it step-by-step (since the code is not so long), and check the output of each line. By doing so I indeed identified the line were things go wrong. It is happening at the 2nd last line of convertPlatform (i.e. exprs2[index,] <- exprs(object)[pmIndex,]) # 1st rename object according to 'nomenclature' used when function convertPlatform is defined # convertPlatform <- function(object, new.platform){........ > object <- affy.data > new.platform <- "mogene10stv1mmentrezg" > cleancdfname(cdfName(object)) [1] "mogene11stv1mmentrezgcdf" > cdfname <- cleancdfname(cdfName(object)) > old.platform <- gsub("cdf","",cdfname) > old.platform [1] "mogene11stv1mmentrezg" > map <- makeMaps(new.platform, old.platform) > head(map) mogene10stv1mmentrezg mogene11stv1mmentrezg [1,] 831891 213206 [2,] 237305 15731 [3,] 14720 511115 [4,] 615715 549916 [5,] 362313 1064843 [6,] 1080675 271008 > tmp <- new("AffyBatch", cdfName=new.platform) > tmp AffyBatch object size of arrays=0x0 features (15 kb) cdf=mogene10stv1mmentrezg (21225 affyids) number of samples=0 number of genes=21225 annotation= > pns <- probeNames(tmp) > head(pns) [1] "100008567_at" "100008567_at" "100008567_at" "100008567_at" "100008567_at" [6] "100008567_at" # check whether this identical output also occurs when 'real' Affybatch object (i.e. affy.data) is used as input > head(probeNames(affy.data)) [1] "100008567_at" "100008567_at" "100008567_at" "100008567_at" "100008567_at" [6] "100008567_at" # yes, same output > index <- unlist(pmindex(tmp)) > head(index) 100008567_at1 100008567_at2 100008567_at3 100008567_at4 100008567_at5 831891 237305 14720 615715 362313 100008567_at6 1080675 > mIndex <- match(index,map[,1]) > head(mIndex) [1] 1 2 3 4 5 6 > pmIndex <- map[mIndex,2] > head(pmIndex) [1] 213206 15731 511115 549916 1064843 271008 > paste(new.platform,"cdf",sep="") [1] "mogene10stv1mmentrezgcdf" > env <- get(paste(new.platform,"dim",sep="")) # check which environment is defined > paste(new.platform,"dim",sep="") [1] "mogene10stv1mmentrezgdim" # > nc <- env$NCOL > head(nc) [1] 1050 > nr <- env$NROW > head(nr) [1] 1050 > exprs2 <- matrix(nrow=nc*nr, ncol=length(object)) > dim(exprs2) [1] 1102500 23 # Note, nr and nc are indeed the dimension of the v1.0 (cartridge) array, as is the number of probes. See my first email. > exprs2[index,] <- exprs(object)[pmIndex,] Error: subscript out of bounds > ^^^ here it goes wrong. I *think* this is related to the fact that the v1.1 array (GeneTitan) is rectangular... Compare dimensions of newly created expression v1.0 matrix: > dim(exprs2) [1] 1102500 23 With that of the input v1.1 expression matrix: > dim(exprs(object)) [1] 1178100 23 > Number of arrays match, but number of probes not... To me it naively looks some probes of the v1.1 array have to be deleted that do not match cq are not present on the v1.0 array...?? Thanks again for looking into this, Guido BTW: if needed I can send you some CEL files from both platforms. -----Original Message----- From: Matthew McCall [mailto:mccallm at gmail.com] Sent: Friday, June 01, 2012 18:19 To: Hooiveld, Guido Cc: bioconductor (bioconductor at stat.math.ethz.ch) Subject: Re: frmaTools: error with 'convertPlatform' Guido, The frma and frmaTools packages use oligo (rather than AffyBatch) objects for the ST arrays, so what you're trying to do is a bit outside the intended functionality. I would also caution you against combining data from different platforms as probe behavior can change quite a bit. That said, we can see whether there's some simple modification that could let you try out what you'd like. Can you figure out at what point in the convertPlatform function the error pops up? Best, Matt On Fri, Jun 1, 2012 at 8:20 AM, Hooiveld, Guido wrote: > Hi, > > I would like to use the function 'convertPlatform' (from the library > frmaTools) to convert an Affybatch object from the MoGene ST v1.1 > (GeneTitan > array) format into that of the MoGene ST v1.0 format (cartridge > array), but I run into an error. The reason that I would like to > convert that Affybatch object is that I would like to combine 2 > experiments performed on those 2 platform so I can normalize them together. > > > > In principle the content of the arrays is the same, that is the > probeSETS should be identical, but the design and number of probes are > different: the > v1.0 array (cartridge) is square (1050cols x 1050rows) whereas the > v1.1 array is rectangular (990cols x 1190rows). I think this may be > related to the error I experience. Note also that I would like to use a remapped CDF. > > > > Any suggestions? > > Thanks, > > Guido > > > > > >> affy.data <- ReadAffy(cdfname="mogene11stv1mmentrezg") > >> affy.data > > Loading required package: AnnotationDbi > > > > AffyBatch object > > size of arrays=1190x990 features (25 kb) > > cdf=mogene11stv1mmentrezg (21225 affyids) > > number of samples=23 > > number of genes=21225 > > annotation=mogene11stv1mmentrezg > > notes= > >> object.conv <- convertPlatform(affy.data, "mogene10stv1mmentrezg") > > Loading required package: mogene10stv1mmentrezgprobe > > Loading required package: mogene11stv1mmentrezgprobe > > > > > > Attaching package: 'mogene10stv1mmentrezgcdf' > > > > The following object(s) are masked from 'package:mogene11stv1mmentrezgcdf': > > > > ??? i2xy, xy2i > > > > Error in convertPlatform(affy.data, "mogene10stv1mmentrezg") : > > ??subscript out of bounds > >> > > > > > > > > > > Some maybe relevant array characteristics: > >> library(affxparser) > >> GeneSTv1.0 <- readCelHeader("MouseTP_Brain_01_mGENE.CEL") > >> GeneSTv1.0 > > $filename > > [1] "./MouseTP_Brain_01_mGENE.CEL" > > > > $version > > [1] 1 > > > > $cols > > [1] 1050 > > > > $rows > > [1] 1050 > > > > $total > > [1] 1102500 > > <> > > > >> GeneSTv1.1 <- readCelHeader("MouseBrain_1.CEL") > >> GeneSTv1.1 > > $filename > > [1] "./MouseBrain_1.CEL" > > > > $version > > [1] 1 > > > > $cols > > [1] 990 > > > > $rows > > [1] 1190 > > > > $total > > [1] 1178100 > > <> > > > >> sessionInfo() > > R version 2.15.0 (2012-03-30) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8?????? LC_NUMERIC=C > > ?[3] LC_TIME=en_US.UTF-8??????? LC_COLLATE=en_US.UTF-8 > > ?[5] LC_MONETARY=en_US.UTF-8??? LC_MESSAGES=en_US.UTF-8 > > ?[7] LC_PAPER=C???????????????? LC_NAME=C > > ?[9] LC_ADDRESS=C?????????????? LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats???? graphics? grDevices utils???? datasets? methods?? base > > > > other attached packages: > > [1] frmaTools_1.8.0??? affy_1.34.0??????? Biobase_2.16.0 > BiocGenerics_0.2.0 > > > > loaded via a namespace (and not attached): > > [1] affyio_1.24.0???????? BiocInstaller_1.4.4?? DBI_0.2-5 > > [4] preprocessCore_1.18.0 zlibbioc_1.2.0 > > > > > > --------------------------------------------------------- > > Guido Hooiveld, PhD > > Nutrition, Metabolism & Genomics Group > > Division of Human Nutrition > > Wageningen University > > Biotechnion, Bomenweg 2 > > NL-6703 HD Wageningen > > the Netherlands > > tel: (+)31 317 485788 > > fax: (+)31 317 483342 > > email: ?????guido.hooiveld at wur.nl > > internet:?? http://nutrigene.4t.com > > http://scholar.google.com/citations?user=qFHaMnoAAAAJ > > http://www.researcherid.com/rid/F-4912-2010 > > -- Matthew N McCall, PhD 112 Arvine Heights Rochester, NY 14611 Cell: 202-222-5880 From mccallm at gmail.com Fri Jun 1 22:04:02 2012 From: mccallm at gmail.com (Matthew McCall) Date: Fri, 1 Jun 2012 16:04:02 -0400 Subject: [BioC] frmaTools: error with 'convertPlatform' In-Reply-To: References: Message-ID: Guido, Thanks for the line by line results. Can you send me the map object -- the result of: map <- makeMaps(new.platform, old.platform)? Best, Matt On Fri, Jun 1, 2012 at 3:53 PM, Hooiveld, Guido wrote: > Hi Matt, > Thanks for coming back on this. > > First of all I am fully aware that I am not using the preferred analysis route for Gene ST arrays (which indeed should go through e.g. oligo or XPS). But the possibilities of your function convertPlatform are so nice I gave it a try with these arrays using the remapped CDFs (which AFAIK are valid CDFs; that is they confirm to all standards). > > I decided to look at the source code of convertPlatform to manually execute it step-by-step (since the code is not so long), and check the output of each line. By doing so I indeed identified the line were things go wrong. It is happening at the 2nd last line of convertPlatform (i.e. exprs2[index,] <- exprs(object)[pmIndex,]) > > > # 1st rename object according to 'nomenclature' used when function convertPlatform is defined > # convertPlatform <- function(object, new.platform){........ > >> object <- affy.data >> new.platform <- "mogene10stv1mmentrezg" >> cleancdfname(cdfName(object)) > [1] "mogene11stv1mmentrezgcdf" >> cdfname <- cleancdfname(cdfName(object)) >> old.platform <- gsub("cdf","",cdfname) >> old.platform > [1] "mogene11stv1mmentrezg" >> map <- makeMaps(new.platform, old.platform) >> head(map) > ? ? mogene10stv1mmentrezg mogene11stv1mmentrezg > [1,] ? ? ? ? ? ? ? ?831891 ? ? ? ? ? ? ? ?213206 > [2,] ? ? ? ? ? ? ? ?237305 ? ? ? ? ? ? ? ? 15731 > [3,] ? ? ? ? ? ? ? ? 14720 ? ? ? ? ? ? ? ?511115 > [4,] ? ? ? ? ? ? ? ?615715 ? ? ? ? ? ? ? ?549916 > [5,] ? ? ? ? ? ? ? ?362313 ? ? ? ? ? ? ? 1064843 > [6,] ? ? ? ? ? ? ? 1080675 ? ? ? ? ? ? ? ?271008 >> tmp <- new("AffyBatch", cdfName=new.platform) >> tmp > AffyBatch object > size of arrays=0x0 features (15 kb) > cdf=mogene10stv1mmentrezg (21225 affyids) > number of samples=0 > number of genes=21225 > annotation= >> pns <- probeNames(tmp) >> head(pns) > [1] "100008567_at" "100008567_at" "100008567_at" "100008567_at" "100008567_at" > [6] "100008567_at" > > # check whether this identical output also occurs when 'real' Affybatch object (i.e. affy.data) is used as input >> head(probeNames(affy.data)) > [1] "100008567_at" "100008567_at" "100008567_at" "100008567_at" "100008567_at" > [6] "100008567_at" > # yes, same output > >> index <- unlist(pmindex(tmp)) >> head(index) > 100008567_at1 100008567_at2 100008567_at3 100008567_at4 100008567_at5 > ? ? ? 831891 ? ? ? ?237305 ? ? ? ? 14720 ? ? ? ?615715 ? ? ? ?362313 > 100008567_at6 > ? ? ?1080675 >> mIndex <- match(index,map[,1]) >> head(mIndex) > [1] 1 2 3 4 5 6 >> pmIndex <- map[mIndex,2] >> head(pmIndex) > [1] ?213206 ? 15731 ?511115 ?549916 1064843 ?271008 >> paste(new.platform,"cdf",sep="") > [1] "mogene10stv1mmentrezgcdf" >> env <- get(paste(new.platform,"dim",sep="")) > > # check which environment is defined >> paste(new.platform,"dim",sep="") > [1] "mogene10stv1mmentrezgdim" > # > >> nc <- env$NCOL >> head(nc) > [1] 1050 >> nr <- env$NROW >> head(nr) > [1] 1050 >> exprs2 <- matrix(nrow=nc*nr, ncol=length(object)) >> dim(exprs2) > [1] 1102500 ? ? ?23 > # Note, nr and nc are indeed the dimension of the v1.0 (cartridge) array, as is the number of probes. See my first email. > >> exprs2[index,] <- exprs(object)[pmIndex,] > Error: subscript out of bounds >> > ^^^ here it goes wrong. I *think* this is related to the fact that the v1.1 array (GeneTitan) is rectangular... > Compare dimensions of newly created expression v1.0 matrix: >> dim(exprs2) > [1] 1102500 ? ? ?23 > With that of the input v1.1 expression matrix: >> dim(exprs(object)) > [1] 1178100 ? ? ?23 >> > Number of arrays match, but number of probes not... > > To me it naively looks some probes of the v1.1 array have to be deleted that do not match cq are not present on the v1.0 array...?? > > Thanks again for looking into this, > Guido > > BTW: if needed I can send you some CEL files from both platforms. > > -----Original Message----- > From: Matthew McCall [mailto:mccallm at gmail.com] > Sent: Friday, June 01, 2012 18:19 > To: Hooiveld, Guido > Cc: bioconductor (bioconductor at stat.math.ethz.ch) > Subject: Re: frmaTools: error with 'convertPlatform' > > Guido, > > The frma and frmaTools packages use oligo (rather than AffyBatch) objects for the ST arrays, so what you're trying to do is a bit outside the intended functionality. I would also caution you against combining data from different platforms as probe behavior can change quite a bit. > > That said, we can see whether there's some simple modification that could let you try out what you'd like. Can you figure out at what point in the convertPlatform function the error pops up? > > Best, > Matt > > > > On Fri, Jun 1, 2012 at 8:20 AM, Hooiveld, Guido wrote: >> Hi, >> >> I would like to use the function 'convertPlatform' (from the library >> frmaTools) to convert an Affybatch object from the MoGene ST v1.1 >> (GeneTitan >> array) format into that of the MoGene ST v1.0 format (cartridge >> array), but I run into an error. The reason that I would like to >> convert that Affybatch object is that I would like to combine 2 >> experiments performed on those 2 platform so I can normalize them together. >> >> >> >> In principle the content of the arrays is the same, that is the >> probeSETS should be identical, but the design and number of probes are >> different: the >> v1.0 array (cartridge) is square (1050cols x 1050rows) whereas the >> v1.1 array is rectangular (990cols x 1190rows). I think this may be >> related to the error I experience. Note also that I would like to use a remapped CDF. >> >> >> >> Any suggestions? >> >> Thanks, >> >> Guido >> >> >> >> >> >>> affy.data <- ReadAffy(cdfname="mogene11stv1mmentrezg") >> >>> affy.data >> >> Loading required package: AnnotationDbi >> >> >> >> AffyBatch object >> >> size of arrays=1190x990 features (25 kb) >> >> cdf=mogene11stv1mmentrezg (21225 affyids) >> >> number of samples=23 >> >> number of genes=21225 >> >> annotation=mogene11stv1mmentrezg >> >> notes= >> >>> object.conv <- convertPlatform(affy.data, "mogene10stv1mmentrezg") >> >> Loading required package: mogene10stv1mmentrezgprobe >> >> Loading required package: mogene11stv1mmentrezgprobe >> >> >> >> >> >> Attaching package: 'mogene10stv1mmentrezgcdf' >> >> >> >> The following object(s) are masked from 'package:mogene11stv1mmentrezgcdf': >> >> >> >> ??? i2xy, xy2i >> >> >> >> Error in convertPlatform(affy.data, "mogene10stv1mmentrezg") : >> >> ??subscript out of bounds >> >>> >> >> >> >> >> >> >> >> >> >> Some maybe relevant array characteristics: >> >>> library(affxparser) >> >>> GeneSTv1.0 <- readCelHeader("MouseTP_Brain_01_mGENE.CEL") >> >>> GeneSTv1.0 >> >> $filename >> >> [1] "./MouseTP_Brain_01_mGENE.CEL" >> >> >> >> $version >> >> [1] 1 >> >> >> >> $cols >> >> [1] 1050 >> >> >> >> $rows >> >> [1] 1050 >> >> >> >> $total >> >> [1] 1102500 >> >> <> >> >> >> >>> GeneSTv1.1 <- readCelHeader("MouseBrain_1.CEL") >> >>> GeneSTv1.1 >> >> $filename >> >> [1] "./MouseBrain_1.CEL" >> >> >> >> $version >> >> [1] 1 >> >> >> >> $cols >> >> [1] 990 >> >> >> >> $rows >> >> [1] 1190 >> >> >> >> $total >> >> [1] 1178100 >> >> <> >> >> >> >>> sessionInfo() >> >> R version 2.15.0 (2012-03-30) >> >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> >> >> locale: >> >> [1] LC_CTYPE=en_US.UTF-8?????? LC_NUMERIC=C >> >> ?[3] LC_TIME=en_US.UTF-8??????? LC_COLLATE=en_US.UTF-8 >> >> ?[5] LC_MONETARY=en_US.UTF-8??? LC_MESSAGES=en_US.UTF-8 >> >> ?[7] LC_PAPER=C???????????????? LC_NAME=C >> >> ?[9] LC_ADDRESS=C?????????????? LC_TELEPHONE=C >> >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> >> >> attached base packages: >> >> [1] stats???? graphics? grDevices utils???? datasets? methods?? base >> >> >> >> other attached packages: >> >> [1] frmaTools_1.8.0??? affy_1.34.0??????? Biobase_2.16.0 >> BiocGenerics_0.2.0 >> >> >> >> loaded via a namespace (and not attached): >> >> [1] affyio_1.24.0???????? BiocInstaller_1.4.4?? DBI_0.2-5 >> >> [4] preprocessCore_1.18.0 zlibbioc_1.2.0 >> >> >> >> >> >> --------------------------------------------------------- >> >> Guido Hooiveld, PhD >> >> Nutrition, Metabolism & Genomics Group >> >> Division of Human Nutrition >> >> Wageningen University >> >> Biotechnion, Bomenweg 2 >> >> NL-6703 HD Wageningen >> >> the Netherlands >> >> tel: (+)31 317 485788 >> >> fax: (+)31 317 483342 >> >> email: ?????guido.hooiveld at wur.nl >> >> internet:?? http://nutrigene.4t.com >> >> http://scholar.google.com/citations?user=qFHaMnoAAAAJ >> >> http://www.researcherid.com/rid/F-4912-2010 >> >> > > > > -- > Matthew N McCall, PhD > 112 Arvine Heights > Rochester, NY 14611 > Cell: 202-222-5880 > > > > -- Matthew N McCall, PhD 112 Arvine Heights Rochester, NY 14611 Cell: 202-222-5880 From schoi at cornell.edu Fri Jun 1 22:23:29 2012 From: schoi at cornell.edu (Sang Chul Choi) Date: Fri, 1 Jun 2012 20:23:29 +0000 Subject: [BioC] qrqc with variable length of short reads? - readSeqFile could not handle a 2GB zipped file. Message-ID: <8234A7F2-8FB7-45E0-BC7C-B5C3386C4EF1@cornell.edu> Hi, I am using qrqc to plot base quality of a short read fastq file. When the FASTQ file has short reads of the same length, the readSeqFile could read in the FASTQ file (25 millions of 100bp reads) with a couple of GB of memory. I trimmed 3' end of the short reads, which would lead to short reads of variable length because of different base quality at the 3' end. Then, I tried to read in this second FASTQ file of reads of variable length. It used up all of the 16 GB memory, and not using CPUs at all. It seems there are some efficient code in readSeqFile as mentioned in the readSeqFile help message. It seems to fall apart when short reads are of different size. I wish to see how the trimming change the base-quality plots, and this is a problem. I am wondering if there is a way of sidestepping this problem. Thank you, SangChul From luciap at iscb.org Fri Jun 1 22:26:44 2012 From: luciap at iscb.org (Lucia Peixoto) Date: Fri, 1 Jun 2012 16:26:44 -0400 Subject: [BioC] samtools error In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From vsbuffalo at gmail.com Fri Jun 1 22:55:53 2012 From: vsbuffalo at gmail.com (Vince Buffalo) Date: Fri, 1 Jun 2012 13:55:53 -0700 Subject: [BioC] qrqc with variable length of short reads? - readSeqFile could not handle a 2GB zipped file. In-Reply-To: <8234A7F2-8FB7-45E0-BC7C-B5C3386C4EF1@cornell.edu> References: <8234A7F2-8FB7-45E0-BC7C-B5C3386C4EF1@cornell.edu> Message-ID: <3B4C55BA-D655-444A-91F3-F4107198E1EC@gmail.com> Hi SangChul, By default readSeqFile hashes a proportion of the reads to check against many being non-unique. Specify hash=FALSE to turn this off and your memory usage will decrease. Best, Vince Sent from my iPhone On Jun 1, 2012, at 1:23 PM, Sang Chul Choi wrote: > Hi, > > I am using qrqc to plot base quality of a short read fastq file. When the FASTQ file has short reads of the same length, the readSeqFile could read in the FASTQ file (25 millions of 100bp reads) with a couple of GB of memory. I trimmed 3' end of the short reads, which would lead to short reads of variable length because of different base quality at the 3' end. Then, I tried to read in this second FASTQ file of reads of variable length. It used up all of the 16 GB memory, and not using CPUs at all. It seems there are some efficient code in readSeqFile as mentioned in the readSeqFile help message. It seems to fall apart when short reads are of different size. > > I wish to see how the trimming change the base-quality plots, and this is a problem. I am wondering if there is a way of sidestepping this problem. > > Thank you, > > SangChul > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From johnlinuxuser at yahoo.com Fri Jun 1 23:37:40 2012 From: johnlinuxuser at yahoo.com (John linux-user) Date: Fri, 1 Jun 2012 14:37:40 -0700 (PDT) Subject: [BioC] HUGO EXON_ID In-Reply-To: <4FC915A0.7040003@fhcrc.org> References: <089B8CC2D95DD5498EB7CD66289A668F1C5167@pbrcas30.pbrc.edu> <11308_1338395632_4FC64BF0_11308_13916_1_CANeAVBngMQ-qo3XA670uLeMs2UPQmXwiLEv1_oOdomJkfppcSw@mail.gmail.com> <7554A32B-75A3-4CE3-BB2F-36590D6775CF@tntcmedia.us> <454E2EA6-4920-4422-AD0A-39857A99AD02@tntcmedia.us> <1338566567.13961.YahooMailNeo@web113616.mail.gq1.yahoo.com> <4FC915A0.7040003@fhcrc.org> Message-ID: <1338586660.74632.YahooMailNeo@web113618.mail.gq1.yahoo.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From ysapkota at ualberta.ca Fri Jun 1 23:50:44 2012 From: ysapkota at ualberta.ca (Yadav Sapkota) Date: Fri, 1 Jun 2012 15:50:44 -0600 Subject: [BioC] Calculate heterozygosity % using SNP genotype data Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From smyth at wehi.EDU.AU Sat Jun 2 02:41:10 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Sat, 2 Jun 2012 10:41:10 +1000 (AUS Eastern Standard Time) Subject: [BioC] Is it possible to read in Bluefuse and Agilent files together using read.maimages() function in limma? In-Reply-To: <12B90BF6-E4EE-4BA6-9103-BDAD63EB669E@icr.ac.uk> References: <12B90BF6-E4EE-4BA6-9103-BDAD63EB669E@icr.ac.uk> Message-ID: Dear Parisa, I've never tried to normalize together intensity data from different image analysis programs. Even if the microarray platforms were the same in both cases, I would view that as a risky procedure. If you were to attempt it, you would need to attend very carefully to batch correction between the platforms as part of the downstream analysis. If you are processing Agilent data, please follow the case study in the limma User's Guide that deals with single-channel Agilent data. It is Section 11.8 titled "Agilent Single-Channel Data: Gene expression in thymus from female Wistar rats". Best wishes Gordon On Fri, 1 Jun 2012, Parisa Razaz wrote: > Hi, > > Thanks for getting back to me. > > If I read the two file types in separately, would it be possible to then > normalise and analyse the data together (after merging)? I am hoping to > follow a protocol similar to that outlined here: > http://matticklab.com/index.php?title=Single_channel_analysis_of_Agilent_microarray_data_with_Limma > > Thanks, > > Parisa > > > On 1 Jun 2012, at 11:19, Gordon K Smyth wrote: > > No, it can't combine any two types. Read them in instead using separate > calls to read.maimages(). > > Gordon > > On Fri, 1 Jun 2012, Parisa Razaz wrote: > > Hi, > > Is it possible to read in both Bluefuse and Agilent files together using the read.maimages() function in limma? > > Thanks, > Parisa ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From guest at bioconductor.org Sat Jun 2 03:02:20 2012 From: guest at bioconductor.org (Jorge Miró [guest]) Date: Fri, 1 Jun 2012 18:02:20 -0700 (PDT) Subject: [BioC] XPS package working with Affymetrix GeneChip 1.0 ST at gene level Message-ID: <20120602010220.A8769133D08@mamba.fhcrc.org> I was wondering if there is some way of getting the XPS package working at gene level as I need to get the gene expression from some Rat Gene chips (RaGene 1.0 ST r4)that I will analyze. I tried to use the Affy package before but as far as I understand they need .CDF file to get working and I only have CLF and PBG files for my chips. Kind regards Jorge -- output of sessionInfo(): - -- Sent via the guest posting facility at bioconductor.org. From mccallm at gmail.com Sat Jun 2 05:25:57 2012 From: mccallm at gmail.com (Matthew McCall) Date: Fri, 1 Jun 2012 23:25:57 -0400 Subject: [BioC] frmaTools: error with 'convertPlatform' In-Reply-To: References: Message-ID: Guido, Well I've found the problem, but I'm not sure exactly what the solution is. The issue is that multiple probes on 1.0 are mapping to the same probe on 1.1: > sum(duplicated(map[,1])) [1] 0 > sum(duplicated(map[,2])) [1] 1749 I think this may be a feature of the alternative CDF, but I'm not positive (perhaps someone else can weigh in on other this is the case). But that is what is "breaking" the platform conversion. Sorry I couldn't be of more help. Best, Matt On Fri, Jun 1, 2012 at 6:04 PM, Hooiveld, Guido wrote: > Hi, > I uploaded it here: > https://sendit.wur.nl/Download.aspx?id=cb769829-a7e5-4f7f-9311-290df518ce5d > > Guido > > -----Original Message----- > From: Matthew McCall [mailto:mccallm at gmail.com] > Sent: Friday, June 01, 2012 22:04 > To: Hooiveld, Guido > Cc: bioconductor (bioconductor at stat.math.ethz.ch) > Subject: Re: frmaTools: error with 'convertPlatform' > > Guido, > > Thanks for the line by line results. Can you send me the map object -- the result of: map <- makeMaps(new.platform, old.platform)? > > Best, > Matt > > On Fri, Jun 1, 2012 at 3:53 PM, Hooiveld, Guido wrote: >> Hi Matt, >> Thanks for coming back on this. >> >> First of all I am fully aware that I am not using the preferred analysis route for Gene ST arrays (which indeed should go through e.g. oligo or XPS). But the possibilities of your function convertPlatform are so nice I gave it a try with these arrays using the remapped CDFs (which AFAIK are valid CDFs; that is they confirm to all standards). >> >> I decided to look at the source code of convertPlatform to manually >> execute it step-by-step (since the code is not so long), and check the >> output of each line. By doing so I indeed identified the line were >> things go wrong. It is happening at the 2nd last line of >> convertPlatform (i.e. exprs2[index,] <- exprs(object)[pmIndex,]) >> >> >> # 1st rename object according to 'nomenclature' used when function >> convertPlatform is defined # convertPlatform <- function(object, new.platform){........ >> >>> object <- affy.data >>> new.platform <- "mogene10stv1mmentrezg" >>> cleancdfname(cdfName(object)) >> [1] "mogene11stv1mmentrezgcdf" >>> cdfname <- cleancdfname(cdfName(object)) old.platform <- >>> gsub("cdf","",cdfname) old.platform >> [1] "mogene11stv1mmentrezg" >>> map <- makeMaps(new.platform, old.platform) >>> head(map) >> ? ? mogene10stv1mmentrezg mogene11stv1mmentrezg [1,] >> 831891 ? ? ? ? ? ? ? ?213206 [2,] ? ? ? ? ? ? ? ?237305 >> 15731 [3,] ? ? ? ? ? ? ? ? 14720 ? ? ? ? ? ? ? ?511115 [4,] >> 615715 ? ? ? ? ? ? ? ?549916 [5,] ? ? ? ? ? ? ? ?362313 >> 1064843 [6,] ? ? ? ? ? ? ? 1080675 ? ? ? ? ? ? ? ?271008 >>> tmp <- new("AffyBatch", cdfName=new.platform) tmp >> AffyBatch object >> size of arrays=0x0 features (15 kb) >> cdf=mogene10stv1mmentrezg (21225 affyids) number of samples=0 number >> of genes=21225 annotation= >>> pns <- probeNames(tmp) >>> head(pns) >> [1] "100008567_at" "100008567_at" "100008567_at" "100008567_at" "100008567_at" >> [6] "100008567_at" >> >> # check whether this identical output also occurs when 'real' >> Affybatch object (i.e. affy.data) is used as input >>> head(probeNames(affy.data)) >> [1] "100008567_at" "100008567_at" "100008567_at" "100008567_at" "100008567_at" >> [6] "100008567_at" >> # yes, same output >> >>> index <- unlist(pmindex(tmp)) >>> head(index) >> 100008567_at1 100008567_at2 100008567_at3 100008567_at4 100008567_at5 >> ? ? ? 831891 ? ? ? ?237305 ? ? ? ? 14720 ? ? ? ?615715 ? ? ? ?362313 >> 100008567_at6 >> ? ? ?1080675 >>> mIndex <- match(index,map[,1]) >>> head(mIndex) >> [1] 1 2 3 4 5 6 >>> pmIndex <- map[mIndex,2] >>> head(pmIndex) >> [1] ?213206 ? 15731 ?511115 ?549916 1064843 ?271008 >>> paste(new.platform,"cdf",sep="") >> [1] "mogene10stv1mmentrezgcdf" >>> env <- get(paste(new.platform,"dim",sep="")) >> >> # check which environment is defined >>> paste(new.platform,"dim",sep="") >> [1] "mogene10stv1mmentrezgdim" >> # >> >>> nc <- env$NCOL >>> head(nc) >> [1] 1050 >>> nr <- env$NROW >>> head(nr) >> [1] 1050 >>> exprs2 <- matrix(nrow=nc*nr, ncol=length(object)) >>> dim(exprs2) >> [1] 1102500 ? ? ?23 >> # Note, nr and nc are indeed the dimension of the v1.0 (cartridge) array, as is the number of probes. See my first email. >> >>> exprs2[index,] <- exprs(object)[pmIndex,] >> Error: subscript out of bounds >>> >> ^^^ here it goes wrong. I *think* this is related to the fact that the v1.1 array (GeneTitan) is rectangular... >> Compare dimensions of newly created expression v1.0 matrix: >>> dim(exprs2) >> [1] 1102500 ? ? ?23 >> With that of the input v1.1 expression matrix: >>> dim(exprs(object)) >> [1] 1178100 ? ? ?23 >>> >> Number of arrays match, but number of probes not... >> >> To me it naively looks some probes of the v1.1 array have to be deleted that do not match cq are not present on the v1.0 array...?? >> >> Thanks again for looking into this, >> Guido >> >> BTW: if needed I can send you some CEL files from both platforms. >> >> -----Original Message----- >> From: Matthew McCall [mailto:mccallm at gmail.com] >> Sent: Friday, June 01, 2012 18:19 >> To: Hooiveld, Guido >> Cc: bioconductor (bioconductor at stat.math.ethz.ch) >> Subject: Re: frmaTools: error with 'convertPlatform' >> >> Guido, >> >> The frma and frmaTools packages use oligo (rather than AffyBatch) objects for the ST arrays, so what you're trying to do is a bit outside the intended functionality. I would also caution you against combining data from different platforms as probe behavior can change quite a bit. >> >> That said, we can see whether there's some simple modification that could let you try out what you'd like. Can you figure out at what point in the convertPlatform function the error pops up? >> >> Best, >> Matt >> >> >> >> On Fri, Jun 1, 2012 at 8:20 AM, Hooiveld, Guido wrote: >>> Hi, >>> >>> I would like to use the function 'convertPlatform' (from the library >>> frmaTools) to convert an Affybatch object from the MoGene ST v1.1 >>> (GeneTitan >>> array) format into that of the MoGene ST v1.0 format (cartridge >>> array), but I run into an error. The reason that I would like to >>> convert that Affybatch object is that I would like to combine 2 >>> experiments performed on those 2 platform so I can normalize them together. >>> >>> >>> >>> In principle the content of the arrays is the same, that is the >>> probeSETS should be identical, but the design and number of probes >>> are >>> different: the >>> v1.0 array (cartridge) is square (1050cols x 1050rows) whereas the >>> v1.1 array is rectangular (990cols x 1190rows). I think this may be >>> related to the error I experience. Note also that I would like to use a remapped CDF. >>> >>> >>> >>> Any suggestions? >>> >>> Thanks, >>> >>> Guido >>> >>> >>> >>> >>> >>>> affy.data <- ReadAffy(cdfname="mogene11stv1mmentrezg") >>> >>>> affy.data >>> >>> Loading required package: AnnotationDbi >>> >>> >>> >>> AffyBatch object >>> >>> size of arrays=1190x990 features (25 kb) >>> >>> cdf=mogene11stv1mmentrezg (21225 affyids) >>> >>> number of samples=23 >>> >>> number of genes=21225 >>> >>> annotation=mogene11stv1mmentrezg >>> >>> notes= >>> >>>> object.conv <- convertPlatform(affy.data, "mogene10stv1mmentrezg") >>> >>> Loading required package: mogene10stv1mmentrezgprobe >>> >>> Loading required package: mogene11stv1mmentrezgprobe >>> >>> >>> >>> >>> >>> Attaching package: 'mogene10stv1mmentrezgcdf' >>> >>> >>> >>> The following object(s) are masked from 'package:mogene11stv1mmentrezgcdf': >>> >>> >>> >>> ??? i2xy, xy2i >>> >>> >>> >>> Error in convertPlatform(affy.data, "mogene10stv1mmentrezg") : >>> >>> ??subscript out of bounds >>> >>>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Some maybe relevant array characteristics: >>> >>>> library(affxparser) >>> >>>> GeneSTv1.0 <- readCelHeader("MouseTP_Brain_01_mGENE.CEL") >>> >>>> GeneSTv1.0 >>> >>> $filename >>> >>> [1] "./MouseTP_Brain_01_mGENE.CEL" >>> >>> >>> >>> $version >>> >>> [1] 1 >>> >>> >>> >>> $cols >>> >>> [1] 1050 >>> >>> >>> >>> $rows >>> >>> [1] 1050 >>> >>> >>> >>> $total >>> >>> [1] 1102500 >>> >>> <> >>> >>> >>> >>>> GeneSTv1.1 <- readCelHeader("MouseBrain_1.CEL") >>> >>>> GeneSTv1.1 >>> >>> $filename >>> >>> [1] "./MouseBrain_1.CEL" >>> >>> >>> >>> $version >>> >>> [1] 1 >>> >>> >>> >>> $cols >>> >>> [1] 990 >>> >>> >>> >>> $rows >>> >>> [1] 1190 >>> >>> >>> >>> $total >>> >>> [1] 1178100 >>> >>> <> >>> >>> >>> >>>> sessionInfo() >>> >>> R version 2.15.0 (2012-03-30) >>> >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> >>> >>> locale: >>> >>> [1] LC_CTYPE=en_US.UTF-8?????? LC_NUMERIC=C >>> >>> ?[3] LC_TIME=en_US.UTF-8??????? LC_COLLATE=en_US.UTF-8 >>> >>> ?[5] LC_MONETARY=en_US.UTF-8??? LC_MESSAGES=en_US.UTF-8 >>> >>> ?[7] LC_PAPER=C???????????????? LC_NAME=C >>> >>> ?[9] LC_ADDRESS=C?????????????? LC_TELEPHONE=C >>> >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> >>> >>> attached base packages: >>> >>> [1] stats???? graphics? grDevices utils???? datasets? methods?? base >>> >>> >>> >>> other attached packages: >>> >>> [1] frmaTools_1.8.0??? affy_1.34.0??????? Biobase_2.16.0 >>> BiocGenerics_0.2.0 >>> >>> >>> >>> loaded via a namespace (and not attached): >>> >>> [1] affyio_1.24.0???????? BiocInstaller_1.4.4?? DBI_0.2-5 >>> >>> [4] preprocessCore_1.18.0 zlibbioc_1.2.0 >>> >>> >>> >>> >>> >>> --------------------------------------------------------- >>> >>> Guido Hooiveld, PhD >>> >>> Nutrition, Metabolism & Genomics Group >>> >>> Division of Human Nutrition >>> >>> Wageningen University >>> >>> Biotechnion, Bomenweg 2 >>> >>> NL-6703 HD Wageningen >>> >>> the Netherlands >>> >>> tel: (+)31 317 485788 >>> >>> fax: (+)31 317 483342 >>> >>> email: ?????guido.hooiveld at wur.nl >>> >>> internet:?? http://nutrigene.4t.com >>> >>> http://scholar.google.com/citations?user=qFHaMnoAAAAJ >>> >>> http://www.researcherid.com/rid/F-4912-2010 >>> >>> >> >> >> >> -- >> Matthew N McCall, PhD >> 112 Arvine Heights >> Rochester, NY 14611 >> Cell: 202-222-5880 >> >> >> >> > > > > -- > Matthew N McCall, PhD > 112 Arvine Heights > Rochester, NY 14611 > Cell: 202-222-5880 > > > > -- Matthew N McCall, PhD 112 Arvine Heights Rochester, NY 14611 Cell: 202-222-5880 From smyth at wehi.EDU.AU Sat Jun 2 09:24:17 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Sat, 2 Jun 2012 17:24:17 +1000 (AUS Eastern Standard Time) Subject: [BioC] voom() vs. RPKM/FPKM or otherwise normalized counts, and GC correction, when fitting models to a small number of responses (per-feature counts) In-Reply-To: References: Message-ID: Hi Kasper, I look forward to benefiting from your work on sample-specific GC and gene length effects. As you know, we cited the article you mention in our recent edgeR glm paper (McCarthy et al NAR 2012), and edgeR works nicely with the normalization matrix from your cqn package. When I get a chance, I'll rewrite the section on normalization in the edgeR User's Guide to mention these possibilities. Until now, I hadn't realized that you were finding sample-specific effects for length as well for GC content. voom() isn't currently setup to accept the normalization matrix from cqn, but it will be down the track. I don't understand your remarks on mean-variance relationships though. For one thing, the normalization matrix matrix from cqn affects the assumed link between expression levels and count size in edgeR. It doesn't have any effect on the form on the mean-variance relationship. Secondly, edgeR has never assumed that the mean perfectly predicts variability. edgeR uses the mean-variance relationship as a guide towards which to moderate the genewise variances, but the fitted variances remain gene-specific and are not a monotonic function of the mean. DEseq (as described in the Genome Biol paper) did assume that the variance is a perfect function of the mean, but edgeR has never had this limitation. Best wishes Gordon On Thu, 31 May 2012, Kasper Daniel Hansen wrote: > On Wed, May 30, 2012 at 7:11 PM, Tim Triche, Jr. wrote: >> Hi Dr. Smyth, >> >> ?Thank you for the helpful clarifications. ?It seems like RPM/CPM is >> useful for tasks such as plotting expression on a reasonably similar scale; >> taking logs and adjusting for mean-variance relationships can better >> satisfy expected mean-variance relationships for linear modeling and thus >> should dovetail better with the toolset in limma, offering a less >> computationally demanding alternative for exploratory analysis. ?On the >> other hand, if the primary goal is to ?detect differences, especially in >> rare or highly variably expressed features, an edgeR GLM with empirical >> Bayes estimates of the feature-wise dispersion is the most appropriate tool >> to maximize statistical power. >> >> Is this understanding reasonable? ?It would seem that, whether I use >> limma or rig up some sort of weighting for (e.g.) sparsenet, the output >> from voom() is most likely to be useful for my particular (EDA) needs >> at the moment. >> >> One last question (for anyone who wishes to answer, really) -- if >> gene/transcript length is not associated with the mean/variance >> relationship for read counts, why was it asserted in the original >> Mortazavi paper that: >> >> The sensitivity of RNA-Seq will be a function of both molar >> concentration and transcript length [nb: no citation given, presumably >> this is felt to be self-evident?]. We therefore quantified transcript >> levels in reads per kilobase of exon model per million mapped reads. >> >> It seems as if this is a red herring? ?GC% could clearly affect the >> degree to which a transcript "absorbs" read depth, but I continue to >> have difficulty understanding why the length of exon model is relevant >> in this context. > > While the Mortazavi paper is a very good paper on RNA-seq, this section > is not their best. > > Because RNA is fragmented, there will be a relationship between read > counts (number of reads mapped to a gene model) and gene length. This > is indisputable. The question is whether this is something we want to > include in our model, beyond the fact that longer genes have more counts > and therefore a bigger mean (and since higher mean leads to lower > variance, this is probably what Mortazavi meant here), > > RPKM tries to make an expression measure that is comparable between > genes inside a single sample. This is for example necessary for making > the "titration" curve in the Mortazavi paper showing a nice relationship > between actual concentration and RPKM, since each of the points on the > curve is a different gene. Note that that plot has nothing to do with > differential expression, but rather absolute quantification. > > In a typical gene expression analysis we are > (1) not interested in comparing genes, only samples within a fixed gene. > (2) interested in relative changes, not absolute meaurements > > This is really not something Mortazavi discusses. > > EdgeR and DEseq tries to get at differential expression. And they > essentially use the fact that there is a mean-variance relationship to > improve their modeling. Now, it is clear (I would argue) that mean > does not in any way perfectly predict variability, so it entirely > possible that a better method may come along and improve on what we > have. But such a method would first have to prove itself. > > Now, as I said above, gene length affects read counts through > fragmentation. In case fragmentation varies between samples, there > may be a problem. Same with GC content. We recently showed [1] that > GC content, and to a lesser extent gene length, can have a sample > specific effect. If that is the case, you need to account for that. > But that is because the effect is sample specific. > > 1. Removing technical variability in RNA-seq data using conditional > quantile normalization. Biostatistics 13, 204?216 (2012). > > Kasper > > >> >> Thank you so much for your time and effort in explaining these rather >> subtle issues. >> >> Tim Triche, Jr. >> USC Biostatistics >> >> On Wed, May 30, 2012 at 1:23 AM, Gordon K Smyth wrote: >> >>> Hi Tim, >>> >>> On thinking about this a little more, voom() could easily output logRPKM >>> rather than logCPM, and the same weights would apply. ?Indeed you could >>> convert the voom() output to logRPKM yourself and, in principle, undertake >>> analyses using the values if you make use of the corresponding voom weights. >>> >>> However voom() does need to get raw counts as input, just like edgeR, >>> rather than RPKM. ?voom() can cope with a re-scaling of the counts, but not >>> with a transformation that is non-monotonic in the counts. ?RPKM is an >>> unhelpful measure from a statistical point of view, because it "forgets" >>> how large the count was in the first plae. >>> >>> The aims of Yuval's package are complementary to edgeR or voom, certainly >>> neither replaces the other. ?These results may inform how we do the >>> normalization step, but we have not yet reached the stage of doing this >>> routinely. >>> >>> Best wishes >>> Gordon >>> >>> >>> On Fri, 25 May 2012, Gordon K Smyth wrote: >>> >>> ?Dear Tim, >>>> >>>> I don't follow what you are trying to do scientifically, and this makes >>>> all the difference when deciding what are the appropriate tools to use. >>>> >>>> If you are undertaking some sort of analysis that requires absolute gene >>>> (or feature) expression levels as responses, then you should not be using >>>> voom or limma or edgeR. ?limma and edgeR do not estimate absolute >>>> expression. >>>> >>>> If on the other hand, you want to detect differentially expressed genes >>>> (or features), which is what voom does, then there is no need to correct >>>> for gene length. ?The comments of Section 2.3 of the edgeR User's Guide and >>>> especially 2.3.2 "Adjustments for gene length, GC content, mappability and >>>> so on" are also relevant for voom. ?There is no need to correct for any >>>> characteristic of a gene that remains unchanged across samples. >>>> >>>> A good case has been made that GC content can have differential influence >>>> across samples, but that doesn't apply to gene length. >>>> >>>> voom does not work on RPKM or FPKM, or on the output from cufflinks. voom >>>> estimates a mean-variance relationship, and the variance is a function of >>>> count size, not of expression level. >>>> >>>> Yes, you need limma to use the output from voom, because other softwares >>>> do not generally have the ability to use quantitative weights. ?If you >>>> ignore the weights, then the output from voom is just logCPM, and you >>>> hardly need voom to compute that. >>>> >>>> Best wishes >>>> Gordon >>>> >>>> ------------------------------**--------------- >>>> Professor Gordon K Smyth, >>>> Bioinformatics Division, >>>> Walter and Eliza Hall Institute of Medical Research, >>>> 1G Royal Parade, Parkville, Vic 3052, Australia. >>>> smyth at wehi.edu.au >>>> http://www.wehi.edu.au >>>> http://www.statsci.org/smyth >>>> >>>> On Thu, 24 May 2012, Tim Triche, Jr. wrote: >>>> >>>> ?Hi Dr. Smyth and Dr. Law, >>>>> >>>>> I have been reading the documentation for limma::voom() and trying to >>>>> understand why there seems to be no correction for the size of the >>>>> feature >>>>> in the model: >>>>> >>>>> In an experiment, a count value is observed for each tag in each sample. >>>>> A >>>>> tag-wise mean-variance trend is computed using lowess. The tag-wise mean >>>>> is >>>>> the mean log2 count with an offset of 0.5, across samples for a given >>>>> tag. >>>>> The tag-wise variance is the quarter-root-variance of normalized log2 >>>>> counts per million values with an offset of 0.5, across samples for a >>>>> given >>>>> tag. Tags with zero counts across all samples are not included in the >>>>> lowess fit. Optional normalization is performed using >>>>> normalizeBetweenArrays. Using fitted values of log2 counts from a linear >>>>> model fit by lmFit, variances from the mean-variance trend were >>>>> interpolated for each observation. This was carried out by approxfun. >>>>> Inverse variance weights can be used to correct for mean-variance trend >>>>> in >>>>> the count data. >>>>> >>>>> >>>>> I don't see a reference to the feature size in all of this. (?) ?Am I >>>>> missing something? ?Probably something major (like, say, the relationship >>>>> of GC content or read length to variance)... >>>>> Is the idea that features with similar sequence properties/size and >>>>> abundance will have their mean-variance relationship modeled >>>>> appropriately >>>>> and weights generated empirically? >>>>> >>>>> For comparison, what I have been doing (in lieu of knowing any better) is >>>>> as follows: align with Rsubread, run subjunc and splicegrapher, and count >>>>> against exon/gene/feature models: >>>>> >>>>> alignedToRPKM <- function(readcounts) { # the output of featureCounts() >>>>> ?millionsMapped <- colSums(readcounts$counts)/**1000000 >>>>> ?if('ExonLength' %in% names(readcounts$annotation)) { >>>>> ? geneLengthsInKB <- readcounts$annotation$**ExonLength/1000 >>>>> ?} else { >>>>> ? geneLengthsInKB <- readcounts$annotation$**GeneLength/1000 # works >>>>> fine >>>>> for ncRNA and splice graph edges >>>>> ?} >>>>> >>>>> ?# example usage: readcounts$RPKM <- alignedToRPKM(readcounts) >>>>> ?return( sweep(readcounts$counts, 2, millionsMapped, '/') / >>>>> geneLengthsInKB ) >>>>> } >>>>> >>>>> (When I did pretty much the same thing with Bowtie/TopHat/CuffLinks I got >>>>> about the same results but slower, so I stuck with Rsubread. ?And >>>>> featureCounts() is really handy.) >>>>> >>>>> So, given the feature sizes in readcounts$annotation I can at least put >>>>> things on something like a similar scale. ?Most of my modeling currently >>>>> is >>>>> focused on penalized local regressions and thus a performant (but >>>>> accurate) >>>>> measure that can be used for linear modeling on a large scale is >>>>> desirable. >>>>> Is the output of voom() what I want? ?Does one need to use limma/lmFit() >>>>> to make use of voom()'s output? >>>>> >>>>> Last but not least, should I use something like Yuval Benjamini's >>>>> GCcorrect >>>>> package (http://www.stat.berkeley.edu/**~yuvalb/YuvalWeb/Software.html >>>>> **) >>>>> before/during/instead of voom()? >>>>> And if the expression of a feature or several nearby features is often >>>>> the >>>>> response, does it matter a great deal what I use? >>>>> >>>>> Thanks for any input you might have time to provide. ?I have to assume >>>>> that >>>>> the minds at WEHI periodically scheme together how best to go about these >>>>> things... >>>>> >>>>> >>>>> -- >>>>> *A model is a lie that helps you see the truth.* >>>>> * >>>>> * >>>>> Howard Skipper>>>> 1173.full.pdf >>>>>> >>>>> >>>>> >>>> >>> ______________________________**______________________________**__________ >>> The information in this email is confidential and inte...{{dropped:18}} >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:5}} From harryzs1981 at gmail.com Sat Jun 2 09:25:05 2012 From: harryzs1981 at gmail.com (sheng zhao) Date: Sat, 2 Jun 2012 09:25:05 +0200 Subject: [BioC] [ChIPpeakAnno] Can not start ChIPpeakAnno after update to version 2.5.9 In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jinkeanlim at gmail.com Sat Jun 2 10:53:09 2012 From: jinkeanlim at gmail.com (KJ Lim) Date: Sat, 2 Jun 2012 11:53:09 +0300 Subject: [BioC] edgeR: topTags In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From beniltoncarvalho at gmail.com Sat Jun 2 12:19:52 2012 From: beniltoncarvalho at gmail.com (Benilton Carvalho) Date: Sat, 2 Jun 2012 11:19:52 +0100 Subject: [BioC] XPS package working with Affymetrix GeneChip 1.0 ST at gene level In-Reply-To: <20120602010220.A8769133D08@mamba.fhcrc.org> References: <20120602010220.A8769133D08@mamba.fhcrc.org> Message-ID: Another alternative, given that the affy package is not designed for this, is the oligo package.... b On 2 June 2012 02:02, Jorge Mir? [guest] wrote: > > I was wondering if there is some way of getting the XPS package working at gene level as I need to get the gene expression from some Rat Gene chips (RaGene 1.0 ST r4)that I will analyze. > > I tried to use the Affy package before but as far as I understand they need .CDF file to get working and I only have CLF and PBG files for my chips. > > Kind regards > Jorge > > ?-- output of sessionInfo(): > > - > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From alyamahmoud at gmail.com Sat Jun 2 14:57:46 2012 From: alyamahmoud at gmail.com (Alyaa Mahmoud) Date: Sat, 2 Jun 2012 15:57:46 +0300 Subject: [BioC] error in hclust function Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From guest at bioconductor.org Sat Jun 2 15:08:24 2012 From: guest at bioconductor.org (Vedran Franke [guest]) Date: Sat, 2 Jun 2012 06:08:24 -0700 (PDT) Subject: [BioC] readGappedAlignmentPairs with multimapping reads Message-ID: <20120602130824.BABCA13445B@mamba.fhcrc.org> How does the readGappedAlignmentPairs from the GenomicRanges library handle reads that map to several places in the genome? Sometimes it can happen that one pair of the read is flagged as properly paired even if the other read maps to several locations, how is this handled? Thank you in advance! -- output of sessionInfo(): R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 [4] plyr_1.6 stringr_0.6 BiocInstaller_1.4.4 loaded via a namespace (and not attached): [1] stats4_2.15.0 tools_2.15.0 -- Sent via the guest posting facility at bioconductor.org. From james.reid at ifom-ieo-campus.it Sat Jun 2 16:25:29 2012 From: james.reid at ifom-ieo-campus.it (James F. Reid) Date: Sat, 02 Jun 2012 15:25:29 +0100 Subject: [BioC] error in hclust function In-Reply-To: References: Message-ID: <4FCA2259.5030501@ifom-ieo-campus.it> Hi Alyaa, you probably have missing values (see 'NA/NaN/Inf in foreign function call (arg 11)') in your bb matrix. HTH. J. On 02/06/12 13:57, Alyaa Mahmoud wrote: > Hi All > > I am trying to cluster 57 COGs in 24 datasets. I use the following code and > run into this error: > > hc = NULL > hc<- hclust(as.dist(1-cor(as.matrix(bb), method="spearman")), > method="complete", members=NULL) > > Error in hclust(as.dist(1 - cor(as.matrix(bb), method = "spearman")), : > NA/NaN/Inf in foreign function call (arg 11) > In addition: Warning message: > In cor(as.matrix(bb), method = "spearman") : the standard deviation is zero > > hr = NULL > hr<- hclust(as.dist(1-cor(t(as.matrix(bb)), method="spearman")), > method="complete", members=NULL) > > I tried to remove any rows that have sd of zero but there was none; > ind<- apply(bb, 1, var) == 0 > subset<- bb[!ind,] > > or > > ind<- apply(bb, 1, sd) == 0 > subset<- bb[!ind,] > > > any clue what coule the problem be ? > > Thanks a lot for your help > yours, > Alyaa From thomas.girke at ucr.edu Sat Jun 2 17:14:40 2012 From: thomas.girke at ucr.edu (Thomas Girke) Date: Sat, 2 Jun 2012 08:14:40 -0700 Subject: [BioC] error in hclust function In-Reply-To: References: Message-ID: <20120602151440.GA460@Thomas-Girkes-MacBook-Pro.local> You probably forgot to remove the zero variance columns in your matrix. In the step where you are observing the error you are clustering the columns of bb not its rows, since the cor() functions operates on the columns of a matrix not its rows. Running things stepwise might help to pinpoint the problem: ## Sample data bb <- matrix(1:5, 5, 5, dimnames=list(paste("g", 1:5, sep=""), paste("t", 1:5, sep="")), byrow=TRUE)> bb t1 t2 t3 t4 t5 g1 1 2 3 4 5 g2 1 2 3 4 5 g3 1 2 3 4 5 g4 1 2 3 4 5 g5 1 2 3 4 5 > bb t1 t2 t3 t4 t5 g1 1 2 3 4 5 g2 1 2 3 4 5 g3 1 2 3 4 5 g4 1 2 3 4 5 g5 1 2 3 4 5 ## cor() without t() > cor(bb) t1 t2 t3 t4 t5 t1 1 NA NA NA NA t2 NA 1 NA NA NA t3 NA NA 1 NA NA t4 NA NA NA 1 NA t5 NA NA NA NA 1 Warning message: In cor(bb) : the standard deviation is zero ## cor() with t() > cor(t(bb)) g1 g2 g3 g4 g5 g1 1 1 1 1 1 g2 1 1 1 1 1 g3 1 1 1 1 1 g4 1 1 1 1 1 g5 1 1 1 1 1 ## hclust without t() > hc <- hclust(as.dist(1-cor(bb))) Error in hclust(as.dist(1 - cor(bb))) : NA/NaN/Inf in foreign function call (arg 11) In addition: Warning message: In cor(bb) : the standard deviation is zero ## hclust with t() hc <- hclust(as.dist(1-cor(t(bb)))) Thomas On Sat, Jun 02, 2012 at 12:57:46PM +0000, Alyaa Mahmoud wrote: > Hi All > > I am trying to cluster 57 COGs in 24 datasets. I use the following code and > run into this error: > > hc = NULL > hc <- hclust(as.dist(1-cor(as.matrix(bb), method="spearman")), > method="complete", members=NULL) > > Error in hclust(as.dist(1 - cor(as.matrix(bb), method = "spearman")), : > NA/NaN/Inf in foreign function call (arg 11) > In addition: Warning message: > In cor(as.matrix(bb), method = "spearman") : the standard deviation is zero > > hr = NULL > hr <- hclust(as.dist(1-cor(t(as.matrix(bb)), method="spearman")), > method="complete", members=NULL) > > I tried to remove any rows that have sd of zero but there was none; > ind <- apply(bb, 1, var) == 0 > subset <- bb[!ind,] > > or > > ind <- apply(bb, 1, sd) == 0 > subset <- bb[!ind,] > > > any clue what coule the problem be ? > > Thanks a lot for your help > yours, > Alyaa > -- > Alyaa Mahmoud > > "Love all, trust a few, do wrong to none"- Shakespeare > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From cstrato at aon.at Sat Jun 2 17:14:43 2012 From: cstrato at aon.at (cstrato) Date: Sat, 02 Jun 2012 17:14:43 +0200 Subject: [BioC] XPS package working with Affymetrix GeneChip 1.0 ST at gene level In-Reply-To: <20120602010220.A8769133D08@mamba.fhcrc.org> References: <20120602010220.A8769133D08@mamba.fhcrc.org> Message-ID: <4FCA2DE3.1030103@aon.at> Dear Jorge, You can preprocess both, RaExon and RaGene 1.0 ST arrays, at the probeset and the gene level, simply set the option to either option="probeset" or to option="transcript", e.g.: data.rma <- rma(data.genome ,"RaGeneRMAMetacore", filedir=datdir background="antigenomic", normalize=TRUE, option="transcript", exonlevel="metacore+affx") See the help file ?rma. For further examples see the scripts "script4exon.R" and "script4xps.R" in the xps/examples directory. See also script "script4schemes.R" how to create the scheme for the RaGene array. Best regards Christian _._._._._._._._._._._._._._._._._._ C.h.r.i.s.t.i.a.n S.t.r.a.t.o.w.a V.i.e.n.n.a A.u.s.t.r.i.a e.m.a.i.l: cstrato at aon.at _._._._._._._._._._._._._._._._._._ On 6/2/12 3:02 AM, Jorge Mir? [guest] wrote: > > I was wondering if there is some way of getting the XPS package working at gene level as I need to get the gene expression from some Rat Gene chips (RaGene 1.0 ST r4)that I will analyze. > > I tried to use the Affy package before but as far as I understand they need .CDF file to get working and I only have CLF and PBG files for my chips. > > Kind regards > Jorge > > -- output of sessionInfo(): > > - > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > From jinkeanlim at gmail.com Sat Jun 2 17:19:32 2012 From: jinkeanlim at gmail.com (KJ Lim) Date: Sat, 2 Jun 2012 18:19:32 +0300 Subject: [BioC] edgeR: summary of differentially expressed genes or tags Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From cstrato at aon.at Sat Jun 2 18:19:56 2012 From: cstrato at aon.at (cstrato) Date: Sat, 02 Jun 2012 18:19:56 +0200 Subject: [BioC] XPS package working with Affymetrix GeneChip 1.0 ST at gene level In-Reply-To: References: <20120602010220.A8769133D08@mamba.fhcrc.org> <4FCA2DE3.1030103@aon.at> Message-ID: <4FCA3D2C.9060901@aon.at> Dear Jorge, Please see the vignette "APTvsXPS.pdf" for a comparison between APT and xps, and see Figure 25 for the reason why xps did not implement PLIER. I do not think that you can import the APT output into xps, at least not from within R (although it might be possible from C++, but I would have to check). Best regards Christian On 6/2/12 5:20 PM, Jorge Mir? wrote: > Hi Christian, > > Oh, very nice. Is there any way of running PLIER in XPS too? Or is there > any possibility of producing the expressions with Affymetrix APT and > then use the output files in XPS to do a quality analysis within XPS? > > Kindly > Jorge > > On Sat, Jun 2, 2012 at 5:14 PM, cstrato > wrote: > > Dear Jorge, > > You can preprocess both, RaExon and RaGene 1.0 ST arrays, at the > probeset and the gene level, simply set the option to either > option="probeset" or to option="transcript", e.g.: > > data.rma <- rma(data.genome ,"RaGeneRMAMetacore", filedir=datdir > background="antigenomic", normalize=TRUE, option="transcript", > exonlevel="metacore+affx") > > See the help file ?rma. > > For further examples see the scripts "script4exon.R" and > "script4xps.R" in the xps/examples directory. See also script > "script4schemes.R" how to create the scheme for the RaGene array. > > Best regards > Christian > _._._._._._._._._._._._._._._.___._._ > C.h.r.i.s.t.i.a.n S.t.r.a.t.o.w.a > V.i.e.n.n.a A.u.s.t.r.i.a > e.m.a.i.l: cstrato at aon.at > _._._._._._._._._._._._._._._.___._._ > > > > > > On 6/2/12 3:02 AM, Jorge Mir? [guest] wrote: > > > I was wondering if there is some way of getting the XPS package > working at gene level as I need to get the gene expression from > some Rat Gene chips (RaGene 1.0 ST r4)that I will analyze. > > I tried to use the Affy package before but as far as I > understand they need .CDF file to get working and I only have > CLF and PBG files for my chips. > > Kind regards > Jorge > > -- output of sessionInfo(): > > - > > -- > Sent via the guest posting facility at bioconductor.org > . > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/__listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > > > From thomas.girke at ucr.edu Sat Jun 2 20:19:17 2012 From: thomas.girke at ucr.edu (Thomas Girke) Date: Sat, 2 Jun 2012 11:19:17 -0700 Subject: [BioC] qrqc with variable length of short reads? - readSeqFile could not handle a 2GB zipped file. In-Reply-To: <874ba6aa6e9c4a9582b518ff20bac7b0@EXCH-NODE01.exch.ucr.edu> References: <8234A7F2-8FB7-45E0-BC7C-B5C3386C4EF1@cornell.edu> <874ba6aa6e9c4a9582b518ff20bac7b0@EXCH-NODE01.exch.ucr.edu> Message-ID: <20120602181917.GB583@Thomas-Girkes-MacBook-Pro.local> Dear Vince, Have you thought about supporting ShortReadQ objects from ShortRead in your package. This way users could random sample reads from large fastq files with FastqSampler() which would reduce the memory requirements and speed things up to generate the really nice and useful quality plots of your package. Right this seems to be only possible by saving things back to files (random sample with ShortRead -> save to file -> reload with qrqc) which is not ideal, but perhaps there is a simpler solution to this already that I missed? Thomas On Fri, Jun 01, 2012 at 08:55:53PM +0000, Vince Buffalo wrote: > Hi SangChul, > > By default readSeqFile hashes a proportion of the reads to check against many being non-unique. Specify hash=FALSE to turn this off and your memory usage will decrease. > > Best, > Vince > > Sent from my iPhone > > On Jun 1, 2012, at 1:23 PM, Sang Chul Choi wrote: > > > Hi, > > > > I am using qrqc to plot base quality of a short read fastq file. When the FASTQ file has short reads of the same length, the readSeqFile could read in the FASTQ file (25 millions of 100bp reads) with a couple of GB of memory. I trimmed 3' end of the short reads, which would lead to short reads of variable length because of different base quality at the 3' end. Then, I tried to read in this second FASTQ file of reads of variable length. It used up all of the 16 GB memory, and not using CPUs at all. It seems there are some efficient code in readSeqFile as mentioned in the readSeqFile help message. It seems to fall apart when short reads are of different size. > > > > I wish to see how the trimming change the base-quality plots, and this is a problem. I am wondering if there is a way of sidestepping this problem. > > > > Thank you, > > > > SangChul > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From mtmorgan at fhcrc.org Sat Jun 2 20:48:52 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Sat, 02 Jun 2012 11:48:52 -0700 Subject: [BioC] readGappedAlignmentPairs with multimapping reads In-Reply-To: <20120602130824.BABCA13445B@mamba.fhcrc.org> References: <20120602130824.BABCA13445B@mamba.fhcrc.org> Message-ID: <4FCA6014.20605@fhcrc.org> On 06/02/2012 06:08 AM, Vedran Franke [guest] wrote: > > How does the readGappedAlignmentPairs from the GenomicRanges library handle reads that map to several places in the genome? > > Sometimes it can happen that one pair of the read is flagged as properly paired even if the other read maps to several locations, how is this handled? The full details will eventually be documented on ?readBamGappedAlignmentPairs and ?makeGappedAlignmentPairs; the author is currently taking a few days off. My understanding is that each alignment is a separate record, and that in the SAM specification one mate tells the location of the second mate (and vice versa) so that pairs can be identified unambiguously (except when the mate alignments differ only in the pattern of indels within an end). More details will be forthcoming. Martin > Thank you in advance! > > -- output of sessionInfo(): > > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C > [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 > [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 > [4] plyr_1.6 stringr_0.6 BiocInstaller_1.4.4 > > loaded via a namespace (and not attached): > [1] stats4_2.15.0 tools_2.15.0 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 From mtmorgan at fhcrc.org Sat Jun 2 20:59:07 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Sat, 02 Jun 2012 11:59:07 -0700 Subject: [BioC] [ChIPpeakAnno] Can not start ChIPpeakAnno after update to version 2.5.9 In-Reply-To: References: Message-ID: <4FCA627B.6030906@fhcrc.org> On 06/02/2012 12:25 AM, sheng zhao wrote: > Hi Steve, > > thank you for your help. > > But I still face this problem even after I reinstalled R (version 2.15.0) > and reinstalled all packages I need to devel versions . > > With other packages I am using, for example: cummerbund 1.99.2, under devel > version > , everything is working fine. > > I tested also all thing on windows, everything is working under devel > versions. > > Is it a bug for Mac 10.7.4 ? Steve is right that the problem is likely to be an outdated package; it is NOT related to operating system. Start a new R session, making sure that no '.Rhistory' or '.RData' are loaded (e.g., from the command line, R --vanilla) then issue the command source('http://bioconductor.org/biocLite.R') biocLite(character()) this should report any out-of-date packages; can you please post the result of this command? Since the original error occurred with GO.db, perhaps you could also report the result of packageDescription("GO.db")$Version. Can you simplify your session, e.g., cummbeRbund, fastcluster, reshape, and ggplot2 are not loaded by ChIPpeakAnno. Martin > > Regards, > Sheng > > sessionInfo() > > R version 2.15.0 (2012-03-30) > > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > > locale: > > [1] C > > > attached base packages: > > [1] grid stats graphics grDevices utils datasets methods > > [8] base > > > other attached packages: > > [1] cummeRbund_1.99.2 fastcluster_1.1.6 > > [3] reshape2_1.2.1 ggplot2_0.9.1 > > [5] RSQLite_0.11.1 DBI_0.2-5 > > [7] AnnotationDbi_1.19.9 BSgenome.Ecoli.NCBI.20080805_1.3.17 > > [9] BSgenome_1.25.1 GenomicRanges_1.9.21 > > [11] Biostrings_2.25.4 IRanges_1.15.11 > > [13] multtest_2.13.0 Biobase_2.17.5 > > [15] biomaRt_2.13.1 BiocGenerics_0.3.0 > > [17] gplots_2.10.1 KernSmooth_2.23-7 > > [19] caTools_1.13 bitops_1.0-4.1 > > [21] gdata_2.8.2 gtools_2.6.2 > > [23] BiocInstaller_1.5.10 > > > loaded via a namespace (and not attached): > > [1] MASS_7.3-18 RColorBrewer_1.0-5 RCurl_1.91-1 XML_3.9-4 > > > [5] colorspace_1.1-1 dichromat_1.2-4 digest_0.5.2 labeling_0.1 > > > [9] memoise_0.1 munsell_0.3 plyr_1.7.1 > proto_0.3-9.2 > > [13] scales_0.2.1 splines_2.15.0 stats4_2.15.0 stringr_0.6 > > > [17] survival_2.36-14 tools_2.15.0 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 From harryzs1981 at gmail.com Sat Jun 2 22:00:52 2012 From: harryzs1981 at gmail.com (sheng zhao) Date: Sat, 2 Jun 2012 22:00:52 +0200 Subject: [BioC] [ChIPpeakAnno] Can not start ChIPpeakAnno after update to version 2.5.9 In-Reply-To: <4FCA627B.6030906@fhcrc.org> References: <4FCA627B.6030906@fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From mtmorgan at fhcrc.org Sat Jun 2 22:05:14 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Sat, 02 Jun 2012 13:05:14 -0700 Subject: [BioC] [ChIPpeakAnno] Can not start ChIPpeakAnno after update to version 2.5.9 In-Reply-To: References: <4FCA627B.6030906@fhcrc.org> Message-ID: <4FCA71FA.7060908@fhcrc.org> On 06/02/2012 01:00 PM, sheng zhao wrote: > Hi Martin, > > thank you for your help. > > Follow your guide : > ------ > source('http://bioconductor.org/biocLite.R') > biocLite(character()) > ------ > > I updated all old packages. > I also checked the version of GO.db > > > packageDescription("GO.db")$Version > [1] "2.7.1" > > > But I am very sorry to say that I still face this problem.: Does, in a new R --vanilla session, simply trying to load GO.db cause the problem, library(GO.db) ? > > > > library(ChIPpeakAnno) > Loading required package: gplots > Loading required package: gtools > Loading required package: gdata > gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED. > > gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED. > > Attaching package: 'gdata' > > The following object(s) are masked from 'package:stats': > > nobs > > The following object(s) are masked from 'package:utils': > > object.size > > Loading required package: caTools > Loading required package: bitops > Loading required package: grid > Loading required package: KernSmooth > KernSmooth 2.23 loaded > Copyright M. P. Wand 1997-2009 > > Attaching package: 'gplots' > > The following object(s) are masked from 'package:stats': > > lowess > > Loading required package: BiocGenerics > > Attaching package: 'BiocGenerics' > > The following object(s) are masked from 'package:gdata': > > combine > > The following object(s) are masked from 'package:stats': > > xtabs > > The following object(s) are masked from 'package:base': > > Filter, Find, Map, Position, Reduce, anyDuplicated, cbind, > colnames, duplicated, eval, get, intersect, lapply, mapply, mget, > order, paste, pmax, pmax.int , pmin, pmin.int > , rbind, rep.int , > rownames, sapply, setdiff, table, tapply, union, unique > > Loading required package: biomaRt > Loading required package: multtest > Loading required package: Biobase > Welcome to Bioconductor > > Vignettes contain introductory material; view with > 'browseVignettes()'. To cite Bioconductor, see > 'citation("Biobase")', and for packages 'citation("pkgname")'. > > > Attaching package: 'multtest' > > The following object(s) are masked from 'package:gplots': > > wapply > > Loading required package: IRanges > > Attaching package: 'IRanges' > > The following object(s) are masked from 'package:gplots': > > space > > The following object(s) are masked from 'package:caTools': > > runmean > > The following object(s) are masked from 'package:gdata': > > trim > > Loading required package: Biostrings > Loading required package: BSgenome > Loading required package: GenomicRanges > Loading required package: BSgenome.Ecoli.NCBI.20080805 > Loading required package: GO.db > Loading required package: AnnotationDbi > > Attaching package: 'AnnotationDbi' > > The following object(s) are masked from 'package:BSgenome': > > species > > Loading required package: DBI > Error : .onLoad failed in loadNamespace() for 'GO.db', details: > call: ls(envir, all.names = TRUE) > error: 7 arguments passed to .Internal(identical) which requires 6 > Error: package 'GO.db' could not be loaded > > > > > sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] C > > attached base packages: > [1] grid stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] RSQLite_0.11.1 DBI_0.2-5 > [3] AnnotationDbi_1.19.9 > BSgenome.Ecoli.NCBI.20080805_1.3.17 > [5] BSgenome_1.25.1 GenomicRanges_1.9.21 > [7] Biostrings_2.25.4 IRanges_1.15.12 > [9] multtest_2.13.0 Biobase_2.17.5 > [11] biomaRt_2.13.1 BiocGenerics_0.3.0 > [13] gplots_2.10.1 KernSmooth_2.23-7 > [15] caTools_1.13 bitops_1.0-4.1 > [17] gdata_2.8.2 gtools_2.6.2 > [19] BiocInstaller_1.5.10 > > loaded via a namespace (and not attached): > [1] MASS_7.3-18 RCurl_1.91-1 XML_3.9-4 splines_2.15.0 > [5] stats4_2.15.0 survival_2.36-14 tools_2.15.0 > > source('http://bioconductor.org/biocLite.R') > > biocLite(character()) > BioC_mirror: http://bioconductor.org > Using R version 2.15, BiocInstaller version 1.5.10. > > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 From lawrence.michael at gene.com Sun Jun 3 07:36:46 2012 From: lawrence.michael at gene.com (Michael Lawrence) Date: Sat, 2 Jun 2012 22:36:46 -0700 Subject: [BioC] readGappedAlignmentPairs with multimapping reads In-Reply-To: <4FCA6014.20605@fhcrc.org> References: <20120602130824.BABCA13445B@mamba.fhcrc.org> <4FCA6014.20605@fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From guest at bioconductor.org Sun Jun 3 14:25:40 2012 From: guest at bioconductor.org (Juliet [guest]) Date: Sun, 3 Jun 2012 05:25:40 -0700 (PDT) Subject: [BioC] estimateLogicle from flowCore produces \"Error in eval(expr, envir, enclos) : object \'tr\' not found\" error Message-ID: <20120603122540.96C3D13322C@mamba.fhcrc.org> I am trying to analyse some flow cytometry data using flowCore. The problem I have is that I am using the estimateLogicle function and this is giving me an error with one of my files when I run it from within a function: gating <- function(file, name, results.dir){ fcs <- read.FCS(file) tr <- estimateLogicle(fcs, col) fcs <- transform(fcs, tr) } For one of my files, I get the following error message: "Error in eval(expr, envir, enclos) : object 'tr' not found" I think that this comes from the transform function when you are trying to the apply the results from estimateLogicle... The weird thing is that when I actually run these arguments by hand then it works without giving me an error... but when I run it from within the function it gives me an error. Do you know what the problem is? -- output of sessionInfo(): > sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/en_US.UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] flowCore_1.18.0 rrcov_1.3-01 pcaPP_1.9-43 mvtnorm_0.9-9991 robustbase_0.7-6 Biobase_2.12.2 loaded via a namespace (and not attached): [1] MASS_7.3-14 feature_1.2.7 graph_1.30.0 ks_1.8.2 stats4_2.13.1 tools_2.13.1 -- Sent via the guest posting facility at bioconductor.org. From guest at bioconductor.org Sun Jun 3 14:30:10 2012 From: guest at bioconductor.org (Juliet [guest]) Date: Sun, 3 Jun 2012 05:30:10 -0700 (PDT) Subject: [BioC] estimateLogicle from flowCore produces \"Error in eval(expr, envir, enclos) : object \'tr\' not found\" error Message-ID: <20120603123010.1C1C9133CD6@mamba.fhcrc.org> I am trying to analyse some flow cytometry data using flowCore. The problem I have is that I am using the estimateLogicle function and this is giving me an error with one of my files when I run it from within a function: gating <- function(file, name, results.dir){ fcs <- read.FCS(file) tr <- estimateLogicle(fcs, col) fcs <- transform(fcs, tr) } For one of my files, I get the following error message: "Error in eval(expr, envir, enclos) : object 'tr' not found" I think that this comes from the transform function when you are trying to the apply the results from estimateLogicle... The weird thing is that when I actually run these arguments by hand then it works without giving me an error... but when I run it from within the function it gives me an error. Do you know what the problem is? -- output of sessionInfo(): R version 2.13.1 (2011-07-08) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/en_US.UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] flowCore_1.18.0 rrcov_1.3-01 pcaPP_1.9-43 mvtnorm_0.9-9991 robustbase_0.7-6 Biobase_2.12.2 loaded via a namespace (and not attached): [1] MASS_7.3-14 feature_1.2.7 graph_1.30.0 ks_1.8.2 stats4_2.13.1 tools_2.13.1 -- Sent via the guest posting facility at bioconductor.org. From mtmorgan at fhcrc.org Sun Jun 3 19:59:59 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Sun, 03 Jun 2012 10:59:59 -0700 Subject: [BioC] estimateLogicle from flowCore produces \"Error in eval(expr, envir, enclos) : object \'tr\' not found\" error In-Reply-To: <20120603123010.1C1C9133CD6@mamba.fhcrc.org> References: <20120603123010.1C1C9133CD6@mamba.fhcrc.org> Message-ID: <4FCBA61F.3040300@fhcrc.org> On 06/03/2012 05:30 AM, Juliet [guest] wrote: > > I am trying to analyse some flow cytometry data using flowCore. > > The problem I have is that I am using the estimateLogicle function and this is giving me an error with one of my files when I run it from within a function: > > gating<- function(file, name, results.dir){ > fcs<- read.FCS(file) > tr<- estimateLogicle(fcs, col) > fcs<- transform(fcs, tr) > } > > For one of my files, I get the following error message: > > "Error in eval(expr, envir, enclos) : object 'tr' not found" > > I think that this comes from the transform function when you are trying to the apply the results from estimateLogicle... > > The weird thing is that when I actually run these arguments by hand then it works without giving me an error... but when I run it from within the function it gives me an error. > > Do you know what the problem is? You are using an old version of R, so update to 2.15.0 and install current packages using source("http://bioconductor.org/biocLite.R") biocLite("flowCore") In your function 'col' is not defined, perhaps you meant to pass it as an argument? Try and provide a fully reproducible example. For instance even in 2.13.1 I can say library(flowCore) example(estimateLogicle) f <- function(fcs, col) { tr <- estimateLogicle(fcs, col) transform(fcs, tr) } f(samp, "FL1-H") and there is no error; what do I need to do to make your error occur on my machine? Martin > sessionInfo() R version 2.13.1 Patched (2011-09-04 r56932) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] flowCore_1.18.0 rrcov_1.3-00 pcaPP_1.8-3 mvtnorm_0.9-999 [5] robustbase_0.7-3 Biobase_2.12.1 loaded via a namespace (and not attached): [1] feature_1.2.8 graph_1.30.0 ks_1.8.8 MASS_7.3-12 stats4_2.13.1 [6] tools_2.13.1 > > -- output of sessionInfo(): > > R version 2.13.1 (2011-07-08) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] C/en_US.UTF-8/C/C/C/C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] flowCore_1.18.0 rrcov_1.3-01 pcaPP_1.9-43 mvtnorm_0.9-9991 robustbase_0.7-6 Biobase_2.12.2 > > loaded via a namespace (and not attached): > [1] MASS_7.3-14 feature_1.2.7 graph_1.30.0 ks_1.8.2 stats4_2.13.1 tools_2.13.1 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 From rsaber at comcast.net Sun Jun 3 21:15:19 2012 From: rsaber at comcast.net (Gregory Ryslik) Date: Sun, 03 Jun 2012 15:15:19 -0400 Subject: [BioC] pairwiseAlignment of PDB files to canonical protein structure Message-ID: <4FCBB7C7.6010905@comcast.net> Hi Everyone, I am new to this list so please forgive me if I miss something. Over the past few weeks, I have been attempting to match the positions provided by the PDB to the canonical protein structure. For instance, if a pdb file puts a CA Leucine residue at position 5, that does not mean that position 5 in the canonical protein structure (as shown by uniprot or other databases) is a Leucine. That is because the PDB numbering is different. Using CIF files from the PDB database I am more or less able to reconstruct the canonical numbering for about 70% of all files. However, I would like to also align the residues I pull from the CIF file with the canonical structure for the structures that my algorithm fails to process. To do this, I am using the pairwiseAlignment function in the Biostrings package. This function seems to work very well, however, I am new to alignment and am thus wondering what are the best parameters to use for my problem? Suppose I have the canonical protein sequence in "canonical.protein" and the cif sequnce that I pull from the PDB database in "protein.extracted". I then run "pairwiseAlignment(pattern = canonical.protein, subject=protein.extracted)", and use the default settings for the other parameters. If someone has done something similar, can they point me if there parameters that are optimal? Especially for things like gapOpening, gapExtension, etc... Thank you for your help, Greg From ddhervas at yahoo.es Mon Jun 4 00:52:02 2012 From: ddhervas at yahoo.es (=?iso-8859-1?Q?David_Herv=E1s?=) Date: Sun, 3 Jun 2012 23:52:02 +0100 (BST) Subject: [BioC] GWAS with Affymetrix SNP 6.0 Message-ID: <1338763922.30446.YahooMailNeo@web132402.mail.ird.yahoo.com> Hello, I'm new to Bioconductor and after searching and reading lots of documentation files from different packages I still haven't figured how to perform GWAS with Affymetrix SNP 6.0 arrays As far as I know I need "oligo", "pd.genomewidesnp.6" and "snpStats" pacakages but I haven't found an example of how to put all the commands together. I've managed to read my .CEL files with the following code provided in the oligo package documentation: setwd("d:\\CELS\\") fullFilenames <- list.celfiles(full.names=TRUE) ? ? outputDir <- file.path(getwd(), "crlmmResults") ? ? ? ? if (!file.exists(outputDir)) crlmm(fullFilenames, outputDir) crlmmOut <- getCrlmmSummaries(outputDir) I get a SnpSuperSet object, and I also know how to get the genotypes from it with genotype<-calls(crlmmOut) What are the next steps to perform the association analysis? Documentation in oligo package suggests "snpStats" but I have no idea how to make this package read my SnpSuperSet object or my genotype matrix, since snpStats needs a SNPmatrix object. I've also tried other packages but none seem to be able to read SnpSuperSet objects.? Thanks in advance for you help ________________ David Herv?s Mar?n Biostatistician in IIS La Fe - Valencia From stvjc at channing.harvard.edu Mon Jun 4 01:04:21 2012 From: stvjc at channing.harvard.edu (Vincent Carey) Date: Sun, 3 Jun 2012 19:04:21 -0400 Subject: [BioC] GWAS with Affymetrix SNP 6.0 In-Reply-To: <1338763922.30446.YahooMailNeo@web132402.mail.ird.yahoo.com> References: <1338763922.30446.YahooMailNeo@web132402.mail.ird.yahoo.com> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From fboehm at biostat.wisc.edu Mon Jun 4 01:42:22 2012 From: fboehm at biostat.wisc.edu (Fred Boehm) Date: Sun, 03 Jun 2012 18:42:22 -0500 Subject: [BioC] GWAS with Affymetrix SNP 6.0 In-Reply-To: References: <1338763922.30446.YahooMailNeo@web132402.mail.ird.yahoo.com> Message-ID: <4FCBF65E.1050207@biostat.wisc.edu> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From smyth at wehi.EDU.AU Mon Jun 4 03:11:37 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Mon, 4 Jun 2012 11:11:37 +1000 (AUS Eastern Standard Time) Subject: [BioC] edgeR: summary of differentially expressed genes or tags In-Reply-To: References: Message-ID: Dear KJ Lim, When testing for multiple coefficients, the easiest way to learn the number of differentially expressed genes is by: FDR <- p.adjust(ltr$PValue, method="BH") sum(FDR < 0.05) There is however no unambiguous way to break this down by up and down genes. There is no unambiguous way to classify a gene as up or down with respect to multiple coefficients, because the coefficients may change in different directions for the same gene. Best wishes Gordon > Date: Sat, 2 Jun 2012 18:19:32 +0300 > From: KJ Lim > To: Bioconductor mailing list > Subject: [BioC] edgeR: summary of differentially expressed genes or > tags > > Dear edgeR community, > > Good day. > > I can learn summary of the up and down regulated genes/tags from > >summary(de <- decideTestsDGE(lrt)) > when the *coef *of *glmLRT*(the likelihood ratio test) is set to one degree > of freedom. > > When the *coef* is set to i.e. 2:5; the decideTestsDGE doesn't work. It > could be nice to see the summary of up and down regulated genes/tags when > the *coef* is set i.e. 2:5. > > Thus, may I ask is there any method to learn the number of up and down > regulated genes/tags when the *coef* of *glmLRT*(the likelihood ratio test) is > set to all groups? > > Thank you very much for your guys time and help. > > Have a nice weekend. > > Best regards, > KJ Lim ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From smyth at wehi.EDU.AU Mon Jun 4 03:16:27 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Mon, 4 Jun 2012 11:16:27 +1000 (AUS Eastern Standard Time) Subject: [BioC] edgeR: summary of differentially expressed genes or tags In-Reply-To: References: Message-ID: Correcting a typo in my previous post. Code should have been: FDR <- p.adjust(lrt$table$PValue, method="BH") sum(FDR < 0.05) Gordon ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From smelov at buckinstitute.org Mon Jun 4 05:08:16 2012 From: smelov at buckinstitute.org (Simon Melov) Date: Sun, 3 Jun 2012 20:08:16 -0700 Subject: [BioC] oligo and pdInfoBuilder Message-ID: I've been beating my head against the wall for several hours trying to see why I cant build a library for a nimblegen expression array, using the oligo and pdInfoBuilder packages. I dont see any errors when building the library for a 12plex yeast genome expression array.... makePdInfoPackage(seed, destDir = "/Library/Frameworks/R.framework/Versions/2.15/Resources/library/") ================================================================================ Building annotation package for Nimblegen Expression Array NDF: 100718_Scer_EXP.ndf XYS: 532785A01_Chip6.330.2012.5.8_532.xys ================================================================================ Parsing file: 100718_Scer_EXP.ndf... OK Parsing file: 532785A01_Chip6.330.2012.5.8_532.xys... OK Merging NDF and XYS files... OK Preparing contents for featureSet table... OK Preparing contents for bgfeature table... OK Preparing contents for pmfeature table... OK Creating package in /Library/Frameworks/R.framework/Versions/2.15/Resources/library//pd.100718.scer.exp Inserting 5777 rows into table featureSet... OK Inserting 137040 rows into table pmfeature... OK Counting rows in featureSet Counting rows in pmfeature Creating index idx_pmfsetid on pmfeature... OK Creating index idx_pmfid on pmfeature... OK Creating index idx_fsfsetid on featureSet... OK Saving DataFrame object for PM. Done. > but when I try to use the library in oligo, it fails for some reason library(pd.100718.scer.exp) Error in library(pd.100718.scer.exp) : 'pd.100718.scer.exp' is not a valid installed package > Can anyone help? I'm at my wits end... From stvjc at channing.harvard.edu Mon Jun 4 05:33:23 2012 From: stvjc at channing.harvard.edu (Vincent Carey) Date: Sun, 3 Jun 2012 23:33:23 -0400 Subject: [BioC] oligo and pdInfoBuilder In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jinkeanlim at gmail.com Mon Jun 4 07:45:14 2012 From: jinkeanlim at gmail.com (KJ Lim) Date: Mon, 4 Jun 2012 08:45:14 +0300 Subject: [BioC] edgeR: summary of differentially expressed genes or tags In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From D.Strbenac at garvan.org.au Mon Jun 4 09:00:14 2012 From: D.Strbenac at garvan.org.au (Dario Strbenac) Date: Mon, 4 Jun 2012 17:00:14 +1000 (EST) Subject: [BioC] Rsamtools yieldTabix Skips Comment Lines Message-ID: <20120604170014.BXE76127@gimr.garvan.unsw.edu.au> Hello, In a previous version, I was able to read a tabix file, including the first line that started with # and had column names. Now with Rsamtools 1.8.4, it skips that line and the first element of the character vector is the first record of the tabix file. Any way to get the old behaviour back so that I can know the column names ? anno <- "http://genomesavant.com/savant/data/hg18/hg18.refGene.gz" txTabix <- TabixFile(anno) txStrings <- yieldTabix(txTabix, yieldSize = 100000) close(txTabix) txStrings[[1]] # Not the row of column names any longer. -------------------------------------- Dario Strbenac Research Assistant Cancer Epigenetics Garvan Institute of Medical Research Darlinghurst NSW 2010 Australia From smyth at wehi.EDU.AU Mon Jun 4 09:10:12 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Mon, 4 Jun 2012 17:10:12 +1000 (AUS Eastern Standard Time) Subject: [BioC] edgeR: summary of differentially expressed genes or tags In-Reply-To: References: Message-ID: Page 51 of the edgeR User's Guide (4 May 2012) gives an example of this. Gordon On Mon, 4 Jun 2012, KJ Lim wrote: > Dear Prof Gordon, > > Thanks for your suggestion and explanation. > > How could I never thought about the expression (up or down-regulated) of > same genes may change when test on multiple coefficients! Thanks for your > time and help. Have a nice day. > > Best regards, > KJ Lim > > On 4 June 2012 04:11, Gordon K Smyth wrote: > >> Dear KJ Lim, >> >> When testing for multiple coefficients, the easiest way to learn the >> number of differentially expressed genes is by: >> >> FDR <- p.adjust(ltr$PValue, method="BH") >> sum(FDR < 0.05) >> >> There is however no unambiguous way to break this down by up and down >> genes. There is no unambiguous way to classify a gene as up or down with >> respect to multiple coefficients, because the coefficients may change in >> different directions for the same gene. >> >> Best wishes >> Gordon >> >> Date: Sat, 2 Jun 2012 18:19:32 +0300 >>> From: KJ Lim >>> To: Bioconductor mailing list >>> Subject: [BioC] edgeR: summary of differentially expressed genes or >>> tags >>> >>> Dear edgeR community, >>> >>> Good day. >>> >>> I can learn summary of the up and down regulated genes/tags from >>> >summary(de <- decideTestsDGE(lrt)) >>> when the *coef *of *glmLRT*(the likelihood ratio test) is set to one >>> degree >>> of freedom. >>> >>> When the *coef* is set to i.e. 2:5; the decideTestsDGE doesn't work. It >>> could be nice to see the summary of up and down regulated genes/tags when >>> the *coef* is set i.e. 2:5. >>> >>> Thus, may I ask is there any method to learn the number of up and down >>> regulated genes/tags when the *coef* of *glmLRT*(the likelihood ratio >>> test) is >>> set to all groups? >>> >>> Thank you very much for your guys time and help. >>> >>> Have a nice weekend. >>> >>> Best regards, >>> KJ Lim >>> >> >> >> ______________________________**______________________________**__________ >> The information in this email is confidential and intended solely for the >> addressee. >> You must not disclose, forward, print or use it without the permission of >> the sender. >> ______________________________**______________________________**__________ >> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From philip.degroot at wur.nl Mon Jun 4 09:38:51 2012 From: philip.degroot at wur.nl (Groot, Philip de) Date: Mon, 4 Jun 2012 07:38:51 +0000 Subject: [BioC] XPS package working with Affymetrix GeneChip 1.0 ST at gene level In-Reply-To: References: <20120602010220.A8769133D08@mamba.fhcrc.org> Message-ID: <8E281A84192EF947ADA6322A4C824F1508FC0F@SCOMP0936.wurnet.nl> I created some CDF-files for these chips that you can also use (not sure on the r4 though): http://nmg-r.bioinformatics.nl/NuGO_R.html Regards, Dr. Philip de Groot Bioinformatician / Microarray analysis expert Wageningen University / TIFN Netherlands Nutrigenomics Center (NNC) Nutrition, Metabolism & Genomics Group Division of Human Nutrition PO Box 8129, 6700 EV Wageningen Visiting Address: "De Valk" ("Erfelijkheidsleer"), Building 304, Verbindingsweg 4, 6703 HC Wageningen Room: 0052a T: 0317 485786 F: 0317 483342 E-mail: Philip.deGroot at wur.nl I:???????? http://humannutrition.wur.nl https://madmax.bioinformatics.nl http://www.nutrigenomicsconsortium.nl -----Original Message----- From: Benilton Carvalho [mailto:beniltoncarvalho at gmail.com] Sent: zaterdag 2 juni 2012 12:20 To: Jorge Mir? [guest] Cc: jorgma86 at gmail.com; bioconductor at r-project.org Subject: Re: [BioC] XPS package working with Affymetrix GeneChip 1.0 ST at gene level Another alternative, given that the affy package is not designed for this, is the oligo package.... b On 2 June 2012 02:02, Jorge Mir? [guest] wrote: > > I was wondering if there is some way of getting the XPS package working at gene level as I need to get the gene expression from some Rat Gene chips (RaGene 1.0 ST r4)that I will analyze. > > I tried to use the Affy package before but as far as I understand they need .CDF file to get working and I only have CLF and PBG files for my chips. > > Kind regards > Jorge > > ?-- output of sessionInfo(): > > - > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From jpflorido at gmail.com Mon Jun 4 10:58:39 2012 From: jpflorido at gmail.com (=?ISO-8859-1?Q?Javier_P=E9rez_Florido?=) Date: Mon, 4 Jun 2012 10:58:39 +0200 Subject: [BioC] oligo: error in crlmm function Message-ID: <4FCC78BF.4090208@gmail.com> Dear list, I'm trying to obtain genotype calls from genome wide SNP arrays (affymetrix 6.0) using oligo package, but the following error comes up: library(oligo) fullFilenames<-list.celfiles(full.names=TRUE) fullFilenames [1] "./E10897_(GenomeWideSNP_6)_2.CEL" "./E10905_(GenomeWideSNP_6).CEL" [3] "./E10906_(GenomeWideSNP_6).CEL" "./E10915_(GenomeWideSNP_6).CEL" [5] "./E10916_(GenomeWideSNP_6).CEL" outputDir<-file.path(getwd(),"crlmmResults") crlmm(fullFilenames,outputDir) Loading required package: pd.genomewidesnp.6 Loading required package: RSQLite Loading required package: DBI Loading results from previous normalization/summarization step. Error in readChar(con, 5L, useBytes = TRUE) : cannot open connection Adem?s: Mensajes de aviso perdidos In readChar(con, 5L, useBytes = TRUE) : cannot open compressed file '/media/data/ArraysSNPs/PTC/prueba/crlmmResults/NormalizationSummarizationOutput.rda', probable reason 'No such file or directory' The output directory exists and has correct permissions. Any tips? Thanks in advance, Javier > sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=es_ES.UTF-8 [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] pd.genomewidesnp.6_1.2.2 RSQLite_0.11.1 DBI_0.2-5 [4] oligo_1.16.2 preprocessCore_1.14.0 oligoClasses_1.14.0 [7] Biobase_2.12.2 loaded via a namespace (and not attached): [1] affxparser_1.24.0 affyio_1.20.0 Biostrings_2.20.4 bit_1.1-8 [5] ff_2.2-7 IRanges_1.10.6 splines_2.13.1 From ovokeraye at gmail.com Mon Jun 4 14:03:53 2012 From: ovokeraye at gmail.com (Ovokeraye Achinike-Oduaran) Date: Mon, 4 Jun 2012 14:03:53 +0200 Subject: [BioC] Replacing commas with new lines in R Message-ID: Hi all, I have a file that looks like this (the genes column from different DAVID functional annotation chart categories). I would like to replace the commas with new lines if possible so that I can have a single gene per row. How can I possibly do this in R? Thanks. Regards, Avoks [1] DCD, PPARG, FTO, IGF2BP2, TSPAN8, CDKAL1, LGR5, KCNJ11, TCF7L2, THADA, NOTCH2, HHEX, ADAMTS9, CDKN2A, CDKN2B, VEGFA, SYN2, CDC123, JAZF1, ADAM30, CAMK1D [2] HHEX, CDKN2A, CDKN2B, PPARG, IGF2BP2, CDKAL1, SLC30A8, KCNJ11, TCF7L2 [3] HHEX, CDKN2B, FTO, PPARG, IGF2BP2, CDKAL1, SLC30A8, KCNJ11, TCF7L2 [4] HHEX, CDKN2A, CDKN2B, PPARG, IGF2BP2, CDKAL1, SLC30A8, KCNJ11, TCF7L2 [5] STRAP, U2AF2, SNRPD1, XAB2, YBX1, SART1, NONO, HNRNPA3, HNRNPK, SRRM2, CDC40, DHX15, SRRM1, QKI, LSM2, PRPF31, PTBP1, SF1, RNPS1, CDC5L, SF3A2, HNRNPU, RBMY1A1, EIF4A3, SFPQ, KHSRP, RBM38, SNRPE [6] MAEA, AURKAIP1, PML, SMAD3, IGF1, CDC16, GAS1, RCC1, PROX1, TGFB1, TRIAP1, CDKN2A, CDKN2B, HNF4A, CDC123, PEBP1, RBM38, FOXC1, TPR, RAD17, SMARCA4, APC, DLG1 From sdavis2 at mail.nih.gov Mon Jun 4 14:16:52 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 4 Jun 2012 08:16:52 -0400 Subject: [BioC] Replacing commas with new lines in R In-Reply-To: References: Message-ID: On Mon, Jun 4, 2012 at 8:03 AM, Ovokeraye Achinike-Oduaran wrote: > Hi all, > > I have a file that looks like this (the genes column from different > DAVID functional annotation chart categories). I would like to replace > the commas with new lines if possible so that I can have a single gene > per row. How can I possibly do this in R? Have a look at strsplit(). Sean > Thanks. > > Regards, > > Avoks > > [1] DCD, PPARG, FTO, IGF2BP2, TSPAN8, CDKAL1, LGR5, KCNJ11, TCF7L2, > THADA, NOTCH2, HHEX, ADAMTS9, CDKN2A, CDKN2B, VEGFA, SYN2, CDC123, > JAZF1, ADAM30, CAMK1D > [2] HHEX, CDKN2A, CDKN2B, PPARG, IGF2BP2, CDKAL1, SLC30A8, KCNJ11, > TCF7L2 > [3] HHEX, CDKN2B, FTO, PPARG, IGF2BP2, CDKAL1, SLC30A8, KCNJ11, TCF7L2 > [4] HHEX, CDKN2A, CDKN2B, PPARG, IGF2BP2, CDKAL1, SLC30A8, KCNJ11, > TCF7L2 > [5] STRAP, U2AF2, SNRPD1, XAB2, YBX1, SART1, NONO, HNRNPA3, HNRNPK, > SRRM2, CDC40, DHX15, SRRM1, QKI, LSM2, PRPF31, PTBP1, SF1, RNPS1, > CDC5L, SF3A2, HNRNPU, RBMY1A1, EIF4A3, SFPQ, KHSRP, RBM38, SNRPE > [6] MAEA, AURKAIP1, PML, SMAD3, IGF1, CDC16, GAS1, RCC1, PROX1, TGFB1, > TRIAP1, CDKN2A, CDKN2B, HNF4A, CDC123, PEBP1, RBM38, FOXC1, TPR, > RAD17, SMARCA4, APC, DLG1 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From beniltoncarvalho at gmail.com Mon Jun 4 14:45:14 2012 From: beniltoncarvalho at gmail.com (Benilton Carvalho) Date: Mon, 4 Jun 2012 13:45:14 +0100 Subject: [BioC] oligo: error in crlmm function In-Reply-To: <4FCC78BF.4090208@gmail.com> References: <4FCC78BF.4090208@gmail.com> Message-ID: Hi Javier, I'll take a look at this... But, in the meantime, I'd strongly recommend you to use the crlmm package for SNP 5 and SNP6 (or Illumina) chips. The reason is that the algorithm implemented there is an upgrade on the one in oligo and provides more precise and accurate calls. benilton On 4 June 2012 09:58, Javier P?rez Florido wrote: > Dear list, > I'm trying to obtain genotype calls from genome wide SNP arrays ?(affymetrix > 6.0) using oligo package, but the following error comes up: > > library(oligo) > fullFilenames<-list.celfiles(full.names=TRUE) > fullFilenames > [1] "./E10897_(GenomeWideSNP_6)_2.CEL" "./E10905_(GenomeWideSNP_6).CEL" > [3] "./E10906_(GenomeWideSNP_6).CEL" ? "./E10915_(GenomeWideSNP_6).CEL" > [5] "./E10916_(GenomeWideSNP_6).CEL" > outputDir<-file.path(getwd(),"crlmmResults") > > crlmm(fullFilenames,outputDir) > Loading required package: pd.genomewidesnp.6 > Loading required package: RSQLite > Loading required package: DBI > Loading results from previous normalization/summarization step. > Error in readChar(con, 5L, useBytes = TRUE) : > ?cannot open connection > Adem?s: Mensajes de aviso perdidos > In readChar(con, 5L, useBytes = TRUE) : > ?cannot open compressed file > '/media/data/ArraysSNPs/PTC/prueba/crlmmResults/NormalizationSummarizationOutput.rda', > probable reason 'No such file or directory' > > The output directory exists and has correct permissions. > Any tips? > Thanks in advance, > Javier > > > >> sessionInfo() > R version 2.13.1 (2011-07-08) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > ?[1] LC_CTYPE=es_ES.UTF-8 ? ? ? LC_NUMERIC=C > ?[3] LC_TIME=es_ES.UTF-8 ? ? ? ?LC_COLLATE=es_ES.UTF-8 > ?[5] LC_MONETARY=C ? ? ? ? ? ? ?LC_MESSAGES=es_ES.UTF-8 > ?[7] LC_PAPER=es_ES.UTF-8 ? ? ? LC_NAME=C > ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C > [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > other attached packages: > [1] pd.genomewidesnp.6_1.2.2 RSQLite_0.11.1 ? ? ? ? ? DBI_0.2-5 > [4] oligo_1.16.2 ? ? ? ? ? ? preprocessCore_1.14.0 ? ?oligoClasses_1.14.0 > [7] Biobase_2.12.2 > > loaded via a namespace (and not attached): > [1] affxparser_1.24.0 affyio_1.20.0 ? ? Biostrings_2.20.4 bit_1.1-8 > [5] ff_2.2-7 ? ? ? ? ?IRanges_1.10.6 ? ?splines_2.13.1 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From ovokeraye at gmail.com Mon Jun 4 14:46:46 2012 From: ovokeraye at gmail.com (Ovokeraye Achinike-Oduaran) Date: Mon, 4 Jun 2012 14:46:46 +0200 Subject: [BioC] Replacing commas with new lines in R In-Reply-To: References: Message-ID: Hi Sean and Nicholas, Thanks. strsplit() worked great for my purposes. And Nicholas, I wanted it to remain in a structure that worked for R. Sorry I wasn't very clear on that. Thanks again. -Avoks On Mon, Jun 4, 2012 at 2:16 PM, Sean Davis wrote: > On Mon, Jun 4, 2012 at 8:03 AM, Ovokeraye Achinike-Oduaran > wrote: >> Hi all, >> >> I have a file that looks like this (the genes column from different >> DAVID functional annotation chart categories). I would like to replace >> the commas with new lines if possible so that I can have a single gene >> per row. How can I possibly do this in R? > > Have a look at strsplit(). > > Sean > > >> Thanks. >> >> Regards, >> >> Avoks >> >> [1] DCD, PPARG, FTO, IGF2BP2, TSPAN8, CDKAL1, LGR5, KCNJ11, TCF7L2, >> THADA, NOTCH2, HHEX, ADAMTS9, CDKN2A, CDKN2B, VEGFA, SYN2, CDC123, >> JAZF1, ADAM30, CAMK1D >> [2] HHEX, CDKN2A, CDKN2B, PPARG, IGF2BP2, CDKAL1, SLC30A8, KCNJ11, >> TCF7L2 >> [3] HHEX, CDKN2B, FTO, PPARG, IGF2BP2, CDKAL1, SLC30A8, KCNJ11, TCF7L2 >> [4] HHEX, CDKN2A, CDKN2B, PPARG, IGF2BP2, CDKAL1, SLC30A8, KCNJ11, >> TCF7L2 >> [5] STRAP, U2AF2, SNRPD1, XAB2, YBX1, SART1, NONO, HNRNPA3, HNRNPK, >> SRRM2, CDC40, DHX15, SRRM1, QKI, LSM2, PRPF31, PTBP1, SF1, RNPS1, >> CDC5L, SF3A2, HNRNPU, RBMY1A1, EIF4A3, SFPQ, KHSRP, RBM38, SNRPE >> [6] MAEA, AURKAIP1, PML, SMAD3, IGF1, CDC16, GAS1, RCC1, PROX1, TGFB1, >> TRIAP1, CDKN2A, CDKN2B, HNF4A, CDC123, PEBP1, RBM38, FOXC1, TPR, >> RAD17, SMARCA4, APC, DLG1 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From ddhervas at yahoo.es Mon Jun 4 14:50:02 2012 From: ddhervas at yahoo.es (=?iso-8859-1?Q?David_Herv=E1s?=) Date: Mon, 4 Jun 2012 13:50:02 +0100 (BST) Subject: [BioC] GWAS with Affymetrix SNP 6.0 In-Reply-To: References: <1338763922.30446.YahooMailNeo@web132402.mail.ird.yahoo.com> Message-ID: <1338814202.97754.YahooMailNeo@web132404.mail.ird.yahoo.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From guest at bioconductor.org Mon Jun 4 15:09:49 2012 From: guest at bioconductor.org (Fleur [guest]) Date: Mon, 4 Jun 2012 06:09:49 -0700 (PDT) Subject: [BioC] Gene list annotation Message-ID: <20120604130949.6BC76134467@mamba.fhcrc.org> Hi, I'm trying to measure the significance of my gene annotation list. Gene list of interest is composed of 185 genes ( among 6000 genes).In a first step, i performed a GO term analysis but i would like to know if i could have the same result by chance. My idea was to randomly select 185 genes from the 6000 genes ( 100 times for example) and annotate those list and see if i could have the same terms by chance ... But how calculate a pvalue for each term at each repetition of the permutation ? Any idea ? Someone know a package which do this king of thing ? Thanks in advance for your help -- output of sessionInfo(): R version 2.13.0 (2011-04-13) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C [5] LC_TIME=French_France.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base -- Sent via the guest posting facility at bioconductor.org. From sdavis2 at mail.nih.gov Mon Jun 4 15:14:13 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 4 Jun 2012 09:14:13 -0400 Subject: [BioC] Gene list annotation In-Reply-To: <20120604130949.6BC76134467@mamba.fhcrc.org> References: <20120604130949.6BC76134467@mamba.fhcrc.org> Message-ID: On Mon, Jun 4, 2012 at 9:09 AM, Fleur [guest] wrote: > > Hi, > I'm trying to measure the significance of my gene annotation list. > Gene list of interest is composed of 185 genes ( among 6000 genes).In a first step, i performed a GO term analysis but i would like to know if i could have the same result by chance. > My idea was to randomly select 185 genes from the 6000 genes ( 100 times for example) and annotate those list and see if i could have the same terms by chance ... But how calculate a pvalue for each term at each repetition of the permutation ? Any idea ? > Someone know a package which do this king of thing ? GOstats, topGo, limma (roast, romer, and friends), and several others might be applicable. You mentioned that you had done a "GO term analysis", so you may have already done this, depending on what you meant by that statement. Sean > Thanks in advance for your help > > > > > ?-- output of sessionInfo(): > > R version 2.13.0 (2011-04-13) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=French_France.1252 ?LC_CTYPE=French_France.1252 > [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C > [5] LC_TIME=French_France.1252 > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From mtmorgan at fhcrc.org Mon Jun 4 15:25:46 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Mon, 04 Jun 2012 06:25:46 -0700 Subject: [BioC] Rsamtools yieldTabix Skips Comment Lines In-Reply-To: <20120604170014.BXE76127@gimr.garvan.unsw.edu.au> References: <20120604170014.BXE76127@gimr.garvan.unsw.edu.au> Message-ID: <4FCCB75A.10201@fhcrc.org> Hi Dario -- On 06/04/2012 12:00 AM, Dario Strbenac wrote: > Hello, > > In a previous version, I was able to read a tabix file, including the first line that started with # and had column names. Now with Rsamtools 1.8.4, it skips that line and the first element of the character vector is the first record of the tabix file. Any way to get the old behaviour back so that I can know the column names ? > > anno<- "http://genomesavant.com/savant/data/hg18/hg18.refGene.gz" > txTabix<- TabixFile(anno) > txStrings<- yieldTabix(txTabix, yieldSize = 100000) > close(txTabix) > txStrings[[1]] # Not the row of column names any longer. > tail(headerTabix(txTabix)$header, 1) [1] "#bin\tname\tchrom\tstrand\ttxStart\ttxEnd\tcdsStart\tcdsEnd\texonCount\texonStarts\texonEnds\tscore\tname2\tcdsStartStat\tcdsEndStat\texonFrames" > > -------------------------------------- > Dario Strbenac > Research Assistant > Cancer Epigenetics > Garvan Institute of Medical Research > Darlinghurst NSW 2010 > Australia > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 From flower_des_iles at hotmail.com Mon Jun 4 15:50:07 2012 From: flower_des_iles at hotmail.com (=?iso-8859-1?B?U3TpcGhhbmllIGhhYWFhYWFhYQ==?=) Date: Mon, 4 Jun 2012 13:50:07 +0000 Subject: [BioC] FW: Gene list annotation In-Reply-To: References: <20120604130949.6BC76134467@mamba.fhcrc.org>, , Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From Trent.Simmons at oicr.on.ca Mon Jun 4 16:14:18 2012 From: Trent.Simmons at oicr.on.ca (Trent Simmons) Date: Mon, 4 Jun 2012 14:14:18 +0000 Subject: [BioC] Unable to perform GCRMA/RMA with cdf file 'mogene11stv1mmentrezgcdf' Message-ID: <1060A4C56F2F1943AEB02A72CEBB47810912442D@exmb3.ad.oicr.on.ca> Hi All, I am trying to perform a GCRMA normalization of microarray data that was run on Mouse Gene 1.1 Affymetrix arrays. I am running into an issue with the CDF file for that array (the CDF file is mogene11stv1mmentrezgcdf). I obtained the latest version of the CDF file from Brainarray. When I attempted to perform the compute.affinities step, the error message: .Error in tmp.exprs[pmIndex[subIndex]] = apm : NAs are not allowed in subscripted assignments is returned. Is anyone else experiencing this issue, or have any suggestions for how to resolve it? I have also tried to perform basic RMA analysis using one of my CEL files and the same CDF file, just in case it was an issue solely with GCRMA. RMA didn't work either, but in this case, it seems to be that the RMA script is looking for a CDF file called 'mogene11stv1cdf' instead. Once again, anyone else with this issue or suggestions for a fix? Thanks in advance, Trent ### START CODE ### ### LOAD LIBRARIES ### library(affy) library(gcrma) library(AnnotationDbi) library(mogene11stv1mmentrezgcdf) library(mogene11stv1mmentrezgprobe) > ### RUN GCRMA ### > compute.affinities('mogene11stv1mmentrezg', verbose = TRUE) Computing affinities.Error in tmp.exprs[pmIndex[subIndex]] = apm : NAs are not allowed in subscripted assignments > > traceback() 1: compute.affinities("mogene11stv1mmentrezg", verbose = TRUE) > ### RUN RMA ### > setwd('~/Documents/test/mouse/CEL') > > data <- ReadAffy() > eset <- rma(data) Error in getCdfInfo(object) : Could not obtain CDF environment, problems encountered: Specified environment does not contain MoGene-1_1-st-v1 Library - package mogene11stv1cdf not installed Bioconductor - mogene11stv1cdf not available > > traceback() 12: stop(paste("Could not obtain CDF environment, problems encountered:", paste(unlist(badOut), collapse = "\n"), sep = "\n")) 11: getCdfInfo(object) 10: .local(object, which, ...) 9: indexProbes(object, "pm", genenames = genenames) 8: indexProbes(object, "pm", genenames = genenames) 7: .local(object, ...) 6: pmindex(object, genenames) 5: pmindex(object, genenames) 4: .local(object, ...) 3: probeNames(object, subset) 2: probeNames(object, subset) 1: rma(data) > > ### SESSION INFO ### > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] mogene11stv1mmentrezgprobe_15.1.0 mogene11stv1mmentrezgcdf_15.1.0 [3] AnnotationDbi_1.18.1 gcrma_2.28.0 [5] BiocInstaller_1.4.6 affy_1.34.0 [7] Biobase_2.16.0 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] affyio_1.24.0 Biostrings_2.24.1 DBI_0.2-5 [4] IRanges_1.14.3 preprocessCore_1.18.0 RSQLite_0.11.1 [7] splines_2.15.0 stats4_2.15.0 tools_2.15.0 [10] zlibbioc_1.2.0 ## END CODE ## From sdavis2 at mail.nih.gov Mon Jun 4 16:17:34 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 4 Jun 2012 10:17:34 -0400 Subject: [BioC] FW: Gene list annotation In-Reply-To: References: <20120604130949.6BC76134467@mamba.fhcrc.org> Message-ID: On Mon, Jun 4, 2012 at 9:50 AM, St?phanie haaaaaaaa wrote: > > I have done the analysis with DAVID but someone told me to do this kind of analysis. because he believes that if i randomly select 185 others genes among the 6000 i will have the same annotation results .... > DAVID uses a fisher-exact-based statistic for computing the probability of enrichment. Did you get p-values from DAVID? If so, you may have already accomplished your task. If not, you can use DAVID to get such a result, but that is for another list.... Sean >> Date: Mon, 4 Jun 2012 09:14:13 -0400 >> Subject: Re: [BioC] Gene list annotation >> From: sdavis2 at mail.nih.gov >> To: guest at bioconductor.org >> CC: bioconductor at r-project.org; flower_des_iles at hotmail.com >> >> On Mon, Jun 4, 2012 at 9:09 AM, Fleur [guest] wrote: >> > >> > Hi, >> > I'm trying to measure the significance of my gene annotation list. >> > Gene list of interest is composed of 185 genes ( among 6000 genes).In a first step, i performed a GO term analysis but i would like to know if i could have the same result by chance. >> > My idea was to randomly select 185 genes from the 6000 genes ( 100 times for example) and annotate those list and see if i could have the same terms by chance ... But how calculate a pvalue for each term at each repetition of the permutation ? Any idea ? >> > Someone know a package which do this king of thing ? >> >> GOstats, topGo, limma (roast, romer, and friends), and several others >> might be applicable. ?You mentioned that you had done a "GO term >> analysis", so you may have already done this, depending on what you >> meant by that statement. >> >> Sean >> >> >> > Thanks in advance for your help >> > >> > >> > >> > >> > ?-- output of sessionInfo(): >> > >> > R version 2.13.0 (2011-04-13) >> > Platform: i386-pc-mingw32/i386 (32-bit) >> > >> > locale: >> > [1] LC_COLLATE=French_France.1252 ?LC_CTYPE=French_France.1252 >> > [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C >> > [5] LC_TIME=French_France.1252 >> > >> > attached base packages: >> > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >> > >> > -- >> > Sent via the guest posting facility at bioconductor.org. >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From jmacdon at uw.edu Mon Jun 4 16:30:33 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Mon, 04 Jun 2012 10:30:33 -0400 Subject: [BioC] Unable to perform GCRMA/RMA with cdf file 'mogene11stv1mmentrezgcdf' In-Reply-To: <1060A4C56F2F1943AEB02A72CEBB47810912442D@exmb3.ad.oicr.on.ca> References: <1060A4C56F2F1943AEB02A72CEBB47810912442D@exmb3.ad.oicr.on.ca> Message-ID: <4FCCC689.4050201@uw.edu> Hi Trent, On 6/4/2012 10:14 AM, Trent Simmons wrote: > Hi All, > > I am trying to perform a GCRMA normalization of microarray data that was run on Mouse Gene 1.1 Affymetrix arrays. I am running into an issue with the CDF file for that array (the CDF file is mogene11stv1mmentrezgcdf). > > I obtained the latest version of the CDF file from Brainarray. When I attempted to perform the compute.affinities step, the error message: > > .Error in tmp.exprs[pmIndex[subIndex]] = apm : > NAs are not allowed in subscripted assignments > > is returned. Is anyone else experiencing this issue, or have any suggestions for how to resolve it? The default for gcrma() is to use the MM probes to estimate GC-specific background estimates. However, the Gene ST chips are PM-only chips. However, there is an argument to gcrma (NCprobe) that you can use to point to the negative control probes. > > I have also tried to perform basic RMA analysis using one of my CEL files and the same CDF file, just in case it was an issue solely with GCRMA. RMA didn't work either, but in this case, it seems to be that the RMA script is looking for a CDF file called 'mogene11stv1cdf' instead. Once again, anyone else with this issue or suggestions for a fix? See the cdfname argument to ReadAffy. Best, Jim > > > Thanks in advance, > > Trent > > ### START CODE ### > ### LOAD LIBRARIES ### > library(affy) > library(gcrma) > library(AnnotationDbi) > library(mogene11stv1mmentrezgcdf) > library(mogene11stv1mmentrezgprobe) > >> ### RUN GCRMA ### >> compute.affinities('mogene11stv1mmentrezg', verbose = TRUE) > Computing affinities.Error in tmp.exprs[pmIndex[subIndex]] = apm : > NAs are not allowed in subscripted assignments >> traceback() > 1: compute.affinities("mogene11stv1mmentrezg", verbose = TRUE) > >> ### RUN RMA ### >> setwd('~/Documents/test/mouse/CEL') >> >> data<- ReadAffy() >> eset<- rma(data) > Error in getCdfInfo(object) : > Could not obtain CDF environment, problems encountered: > Specified environment does not contain MoGene-1_1-st-v1 > Library - package mogene11stv1cdf not installed > Bioconductor - mogene11stv1cdf not available >> traceback() > 12: stop(paste("Could not obtain CDF environment, problems encountered:", > paste(unlist(badOut), collapse = "\n"), sep = "\n")) > 11: getCdfInfo(object) > 10: .local(object, which, ...) > 9: indexProbes(object, "pm", genenames = genenames) > 8: indexProbes(object, "pm", genenames = genenames) > 7: .local(object, ...) > 6: pmindex(object, genenames) > 5: pmindex(object, genenames) > 4: .local(object, ...) > 3: probeNames(object, subset) > 2: probeNames(object, subset) > 1: rma(data) >> ### SESSION INFO ### >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 > [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] mogene11stv1mmentrezgprobe_15.1.0 mogene11stv1mmentrezgcdf_15.1.0 > [3] AnnotationDbi_1.18.1 gcrma_2.28.0 > [5] BiocInstaller_1.4.6 affy_1.34.0 > [7] Biobase_2.16.0 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] affyio_1.24.0 Biostrings_2.24.1 DBI_0.2-5 > [4] IRanges_1.14.3 preprocessCore_1.18.0 RSQLite_0.11.1 > [7] splines_2.15.0 stats4_2.15.0 tools_2.15.0 > [10] zlibbioc_1.2.0 > > ## END CODE ## > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From Trent.Simmons at oicr.on.ca Mon Jun 4 16:46:26 2012 From: Trent.Simmons at oicr.on.ca (Trent Simmons) Date: Mon, 4 Jun 2012 14:46:26 +0000 Subject: [BioC] Unable to perform GCRMA/RMA with cdf file 'mogene11stv1mmentrezgcdf' In-Reply-To: <4FCCC689.4050201@uw.edu> References: <1060A4C56F2F1943AEB02A72CEBB47810912442D@exmb3.ad.oicr.on.ca>, <4FCCC689.4050201@uw.edu> Message-ID: <1060A4C56F2F1943AEB02A72CEBB478109124595@exmb3.ad.oicr.on.ca> Hi Jim, That has solved the RMA issue. Thank you very much! Any idea for the GCRMA issue? Best, Trent ________________________________________ From: James W. MacDonald [jmacdon at uw.edu] Sent: Monday, June 04, 2012 10:30 AM To: Trent Simmons Cc: bioconductor at r-project.org Subject: Re: [BioC] Unable to perform GCRMA/RMA with cdf file 'mogene11stv1mmentrezgcdf' Hi Trent, On 6/4/2012 10:14 AM, Trent Simmons wrote: > Hi All, > > I am trying to perform a GCRMA normalization of microarray data that was run on Mouse Gene 1.1 Affymetrix arrays. I am running into an issue with the CDF file for that array (the CDF file is mogene11stv1mmentrezgcdf). > > I obtained the latest version of the CDF file from Brainarray. When I attempted to perform the compute.affinities step, the error message: > > .Error in tmp.exprs[pmIndex[subIndex]] = apm : > NAs are not allowed in subscripted assignments > > is returned. Is anyone else experiencing this issue, or have any suggestions for how to resolve it? The default for gcrma() is to use the MM probes to estimate GC-specific background estimates. However, the Gene ST chips are PM-only chips. However, there is an argument to gcrma (NCprobe) that you can use to point to the negative control probes. > > I have also tried to perform basic RMA analysis using one of my CEL files and the same CDF file, just in case it was an issue solely with GCRMA. RMA didn't work either, but in this case, it seems to be that the RMA script is looking for a CDF file called 'mogene11stv1cdf' instead. Once again, anyone else with this issue or suggestions for a fix? See the cdfname argument to ReadAffy. Best, Jim > > > Thanks in advance, > > Trent > > ### START CODE ### > ### LOAD LIBRARIES ### > library(affy) > library(gcrma) > library(AnnotationDbi) > library(mogene11stv1mmentrezgcdf) > library(mogene11stv1mmentrezgprobe) > >> ### RUN GCRMA ### >> compute.affinities('mogene11stv1mmentrezg', verbose = TRUE) > Computing affinities.Error in tmp.exprs[pmIndex[subIndex]] = apm : > NAs are not allowed in subscripted assignments >> traceback() > 1: compute.affinities("mogene11stv1mmentrezg", verbose = TRUE) > >> ### RUN RMA ### >> setwd('~/Documents/test/mouse/CEL') >> >> data<- ReadAffy() >> eset<- rma(data) > Error in getCdfInfo(object) : > Could not obtain CDF environment, problems encountered: > Specified environment does not contain MoGene-1_1-st-v1 > Library - package mogene11stv1cdf not installed > Bioconductor - mogene11stv1cdf not available >> traceback() > 12: stop(paste("Could not obtain CDF environment, problems encountered:", > paste(unlist(badOut), collapse = "\n"), sep = "\n")) > 11: getCdfInfo(object) > 10: .local(object, which, ...) > 9: indexProbes(object, "pm", genenames = genenames) > 8: indexProbes(object, "pm", genenames = genenames) > 7: .local(object, ...) > 6: pmindex(object, genenames) > 5: pmindex(object, genenames) > 4: .local(object, ...) > 3: probeNames(object, subset) > 2: probeNames(object, subset) > 1: rma(data) >> ### SESSION INFO ### >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 > [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] mogene11stv1mmentrezgprobe_15.1.0 mogene11stv1mmentrezgcdf_15.1.0 > [3] AnnotationDbi_1.18.1 gcrma_2.28.0 > [5] BiocInstaller_1.4.6 affy_1.34.0 > [7] Biobase_2.16.0 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] affyio_1.24.0 Biostrings_2.24.1 DBI_0.2-5 > [4] IRanges_1.14.3 preprocessCore_1.18.0 RSQLite_0.11.1 > [7] splines_2.15.0 stats4_2.15.0 tools_2.15.0 > [10] zlibbioc_1.2.0 > > ## END CODE ## > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From Guido.Hooiveld at wur.nl Mon Jun 4 16:52:18 2012 From: Guido.Hooiveld at wur.nl (Hooiveld, Guido) Date: Mon, 4 Jun 2012 14:52:18 +0000 Subject: [BioC] Unable to perform GCRMA/RMA with cdf file 'mogene11stv1mmentrezgcdf' In-Reply-To: <4FCCC689.4050201@uw.edu> References: <1060A4C56F2F1943AEB02A72CEBB47810912442D@exmb3.ad.oicr.on.ca> <4FCCC689.4050201@uw.edu> Message-ID: Hi Trent, In addition to Jim's comments you also may want to have a look at the suggestions Mike Smith provided some time ago: http://article.gmane.org/gmane.science.biology.informatics.conductor/32506 Note: I didn't try it myself, so I cannot comment any further on it. HTH, Guido --------------------------------------------------------- Guido Hooiveld, PhD Nutrition, Metabolism & Genomics Group Division of Human Nutrition Wageningen University Biotechnion, Bomenweg 2 NL-6703 HD Wageningen the Netherlands tel: (+)31 317 485788 fax: (+)31 317 483342 email: guido.hooiveld at wur.nl internet: http://nutrigene.4t.com http://scholar.google.com/citations?user=qFHaMnoAAAAJ http://www.researcherid.com/rid/F-4912-2010 -----Original Message----- From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of James W. MacDonald Sent: Monday, June 04, 2012 16:31 To: Trent Simmons Cc: bioconductor at r-project.org Subject: Re: [BioC] Unable to perform GCRMA/RMA with cdf file 'mogene11stv1mmentrezgcdf' Hi Trent, On 6/4/2012 10:14 AM, Trent Simmons wrote: > Hi All, > > I am trying to perform a GCRMA normalization of microarray data that was run on Mouse Gene 1.1 Affymetrix arrays. I am running into an issue with the CDF file for that array (the CDF file is mogene11stv1mmentrezgcdf). > > I obtained the latest version of the CDF file from Brainarray. When I attempted to perform the compute.affinities step, the error message: > > .Error in tmp.exprs[pmIndex[subIndex]] = apm : > NAs are not allowed in subscripted assignments > > is returned. Is anyone else experiencing this issue, or have any suggestions for how to resolve it? The default for gcrma() is to use the MM probes to estimate GC-specific background estimates. However, the Gene ST chips are PM-only chips. However, there is an argument to gcrma (NCprobe) that you can use to point to the negative control probes. > > I have also tried to perform basic RMA analysis using one of my CEL files and the same CDF file, just in case it was an issue solely with GCRMA. RMA didn't work either, but in this case, it seems to be that the RMA script is looking for a CDF file called 'mogene11stv1cdf' instead. Once again, anyone else with this issue or suggestions for a fix? See the cdfname argument to ReadAffy. Best, Jim > > > Thanks in advance, > > Trent > > ### START CODE ### > ### LOAD LIBRARIES ### > library(affy) > library(gcrma) > library(AnnotationDbi) > library(mogene11stv1mmentrezgcdf) > library(mogene11stv1mmentrezgprobe) > >> ### RUN GCRMA ### >> compute.affinities('mogene11stv1mmentrezg', verbose = TRUE) > Computing affinities.Error in tmp.exprs[pmIndex[subIndex]] = apm : > NAs are not allowed in subscripted assignments >> traceback() > 1: compute.affinities("mogene11stv1mmentrezg", verbose = TRUE) > >> ### RUN RMA ### >> setwd('~/Documents/test/mouse/CEL') >> >> data<- ReadAffy() >> eset<- rma(data) > Error in getCdfInfo(object) : > Could not obtain CDF environment, problems encountered: > Specified environment does not contain MoGene-1_1-st-v1 Library - > package mogene11stv1cdf not installed Bioconductor - mogene11stv1cdf > not available >> traceback() > 12: stop(paste("Could not obtain CDF environment, problems encountered:", > paste(unlist(badOut), collapse = "\n"), sep = "\n")) > 11: getCdfInfo(object) > 10: .local(object, which, ...) > 9: indexProbes(object, "pm", genenames = genenames) > 8: indexProbes(object, "pm", genenames = genenames) > 7: .local(object, ...) > 6: pmindex(object, genenames) > 5: pmindex(object, genenames) > 4: .local(object, ...) > 3: probeNames(object, subset) > 2: probeNames(object, subset) > 1: rma(data) >> ### SESSION INFO ### >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 > [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] mogene11stv1mmentrezgprobe_15.1.0 mogene11stv1mmentrezgcdf_15.1.0 > [3] AnnotationDbi_1.18.1 gcrma_2.28.0 > [5] BiocInstaller_1.4.6 affy_1.34.0 > [7] Biobase_2.16.0 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] affyio_1.24.0 Biostrings_2.24.1 DBI_0.2-5 > [4] IRanges_1.14.3 preprocessCore_1.18.0 RSQLite_0.11.1 > [7] splines_2.15.0 stats4_2.15.0 tools_2.15.0 > [10] zlibbioc_1.2.0 > > ## END CODE ## > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From Trent.Simmons at oicr.on.ca Mon Jun 4 16:52:33 2012 From: Trent.Simmons at oicr.on.ca (Trent Simmons) Date: Mon, 4 Jun 2012 14:52:33 +0000 Subject: [BioC] Unable to perform GCRMA/RMA with cdf file 'mogene11stv1mmentrezgcdf' In-Reply-To: <4FCCC689.4050201@uw.edu> References: <1060A4C56F2F1943AEB02A72CEBB47810912442D@exmb3.ad.oicr.on.ca>, <4FCCC689.4050201@uw.edu> Message-ID: <1060A4C56F2F1943AEB02A72CEBB4781091245A5@exmb3.ad.oicr.on.ca> Oh sheesh. I missed the explanation. Sorry, and thank you again! Trent Simmons Volunteer-Student Ontario Institute for Cancer Research MaRS Centre, South Tower 101 College Street, Suite 800 Toronto, Ontario, Canada M5G 0A3 Toll-free: 1-866-678-6427 www.oicr.on.ca This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization. ________________________________________ From: James W. MacDonald [jmacdon at uw.edu] Sent: Monday, June 04, 2012 10:30 AM To: Trent Simmons Cc: bioconductor at r-project.org Subject: Re: [BioC] Unable to perform GCRMA/RMA with cdf file 'mogene11stv1mmentrezgcdf' Hi Trent, On 6/4/2012 10:14 AM, Trent Simmons wrote: > Hi All, > > I am trying to perform a GCRMA normalization of microarray data that was run on Mouse Gene 1.1 Affymetrix arrays. I am running into an issue with the CDF file for that array (the CDF file is mogene11stv1mmentrezgcdf). > > I obtained the latest version of the CDF file from Brainarray. When I attempted to perform the compute.affinities step, the error message: > > .Error in tmp.exprs[pmIndex[subIndex]] = apm : > NAs are not allowed in subscripted assignments > > is returned. Is anyone else experiencing this issue, or have any suggestions for how to resolve it? The default for gcrma() is to use the MM probes to estimate GC-specific background estimates. However, the Gene ST chips are PM-only chips. However, there is an argument to gcrma (NCprobe) that you can use to point to the negative control probes. > > I have also tried to perform basic RMA analysis using one of my CEL files and the same CDF file, just in case it was an issue solely with GCRMA. RMA didn't work either, but in this case, it seems to be that the RMA script is looking for a CDF file called 'mogene11stv1cdf' instead. Once again, anyone else with this issue or suggestions for a fix? See the cdfname argument to ReadAffy. Best, Jim > > > Thanks in advance, > > Trent > > ### START CODE ### > ### LOAD LIBRARIES ### > library(affy) > library(gcrma) > library(AnnotationDbi) > library(mogene11stv1mmentrezgcdf) > library(mogene11stv1mmentrezgprobe) > >> ### RUN GCRMA ### >> compute.affinities('mogene11stv1mmentrezg', verbose = TRUE) > Computing affinities.Error in tmp.exprs[pmIndex[subIndex]] = apm : > NAs are not allowed in subscripted assignments >> traceback() > 1: compute.affinities("mogene11stv1mmentrezg", verbose = TRUE) > >> ### RUN RMA ### >> setwd('~/Documents/test/mouse/CEL') >> >> data<- ReadAffy() >> eset<- rma(data) > Error in getCdfInfo(object) : > Could not obtain CDF environment, problems encountered: > Specified environment does not contain MoGene-1_1-st-v1 > Library - package mogene11stv1cdf not installed > Bioconductor - mogene11stv1cdf not available >> traceback() > 12: stop(paste("Could not obtain CDF environment, problems encountered:", > paste(unlist(badOut), collapse = "\n"), sep = "\n")) > 11: getCdfInfo(object) > 10: .local(object, which, ...) > 9: indexProbes(object, "pm", genenames = genenames) > 8: indexProbes(object, "pm", genenames = genenames) > 7: .local(object, ...) > 6: pmindex(object, genenames) > 5: pmindex(object, genenames) > 4: .local(object, ...) > 3: probeNames(object, subset) > 2: probeNames(object, subset) > 1: rma(data) >> ### SESSION INFO ### >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 > [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] mogene11stv1mmentrezgprobe_15.1.0 mogene11stv1mmentrezgcdf_15.1.0 > [3] AnnotationDbi_1.18.1 gcrma_2.28.0 > [5] BiocInstaller_1.4.6 affy_1.34.0 > [7] Biobase_2.16.0 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] affyio_1.24.0 Biostrings_2.24.1 DBI_0.2-5 > [4] IRanges_1.14.3 preprocessCore_1.18.0 RSQLite_0.11.1 > [7] splines_2.15.0 stats4_2.15.0 tools_2.15.0 > [10] zlibbioc_1.2.0 > > ## END CODE ## > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From jmacdon at uw.edu Mon Jun 4 16:52:31 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Mon, 04 Jun 2012 10:52:31 -0400 Subject: [BioC] Unable to perform GCRMA/RMA with cdf file 'mogene11stv1mmentrezgcdf' In-Reply-To: <1060A4C56F2F1943AEB02A72CEBB478109124595@exmb3.ad.oicr.on.ca> References: <1060A4C56F2F1943AEB02A72CEBB47810912442D@exmb3.ad.oicr.on.ca>, <4FCCC689.4050201@uw.edu> <1060A4C56F2F1943AEB02A72CEBB478109124595@exmb3.ad.oicr.on.ca> Message-ID: <4FCCCBAF.2090809@uw.edu> Hi Trent, On 6/4/2012 10:46 AM, Trent Simmons wrote: > Hi Jim, > > That has solved the RMA issue. Thank you very much! Any idea for the GCRMA issue? Sure, as I noted below, you can use the NCprobes argument to point to the negative control probes on the chip. I have no idea if these exist on the remapped cdf - it's up to you to figure that part out. Best, Jim > > Best, > > Trent > ________________________________________ > From: James W. MacDonald [jmacdon at uw.edu] > Sent: Monday, June 04, 2012 10:30 AM > To: Trent Simmons > Cc: bioconductor at r-project.org > Subject: Re: [BioC] Unable to perform GCRMA/RMA with cdf file 'mogene11stv1mmentrezgcdf' > > Hi Trent, > > On 6/4/2012 10:14 AM, Trent Simmons wrote: >> Hi All, >> >> I am trying to perform a GCRMA normalization of microarray data that was run on Mouse Gene 1.1 Affymetrix arrays. I am running into an issue with the CDF file for that array (the CDF file is mogene11stv1mmentrezgcdf). >> >> I obtained the latest version of the CDF file from Brainarray. When I attempted to perform the compute.affinities step, the error message: >> >> .Error in tmp.exprs[pmIndex[subIndex]] = apm : >> NAs are not allowed in subscripted assignments >> >> is returned. Is anyone else experiencing this issue, or have any suggestions for how to resolve it? > The default for gcrma() is to use the MM probes to estimate GC-specific > background estimates. However, the Gene ST chips are PM-only chips. > However, there is an argument to gcrma (NCprobe) that you can use to > point to the negative control probes. > > >> I have also tried to perform basic RMA analysis using one of my CEL files and the same CDF file, just in case it was an issue solely with GCRMA. RMA didn't work either, but in this case, it seems to be that the RMA script is looking for a CDF file called 'mogene11stv1cdf' instead. Once again, anyone else with this issue or suggestions for a fix? > See the cdfname argument to ReadAffy. > > Best, > > Jim > > >> >> Thanks in advance, >> >> Trent >> >> ### START CODE ### >> ### LOAD LIBRARIES ### >> library(affy) >> library(gcrma) >> library(AnnotationDbi) >> library(mogene11stv1mmentrezgcdf) >> library(mogene11stv1mmentrezgprobe) >> >>> ### RUN GCRMA ### >>> compute.affinities('mogene11stv1mmentrezg', verbose = TRUE) >> Computing affinities.Error in tmp.exprs[pmIndex[subIndex]] = apm : >> NAs are not allowed in subscripted assignments >>> traceback() >> 1: compute.affinities("mogene11stv1mmentrezg", verbose = TRUE) >> >>> ### RUN RMA ### >>> setwd('~/Documents/test/mouse/CEL') >>> >>> data<- ReadAffy() >>> eset<- rma(data) >> Error in getCdfInfo(object) : >> Could not obtain CDF environment, problems encountered: >> Specified environment does not contain MoGene-1_1-st-v1 >> Library - package mogene11stv1cdf not installed >> Bioconductor - mogene11stv1cdf not available >>> traceback() >> 12: stop(paste("Could not obtain CDF environment, problems encountered:", >> paste(unlist(badOut), collapse = "\n"), sep = "\n")) >> 11: getCdfInfo(object) >> 10: .local(object, which, ...) >> 9: indexProbes(object, "pm", genenames = genenames) >> 8: indexProbes(object, "pm", genenames = genenames) >> 7: .local(object, ...) >> 6: pmindex(object, genenames) >> 5: pmindex(object, genenames) >> 4: .local(object, ...) >> 3: probeNames(object, subset) >> 2: probeNames(object, subset) >> 1: rma(data) >>> ### SESSION INFO ### >>> sessionInfo() >> R version 2.15.0 (2012-03-30) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 >> [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] mogene11stv1mmentrezgprobe_15.1.0 mogene11stv1mmentrezgcdf_15.1.0 >> [3] AnnotationDbi_1.18.1 gcrma_2.28.0 >> [5] BiocInstaller_1.4.6 affy_1.34.0 >> [7] Biobase_2.16.0 BiocGenerics_0.2.0 >> >> loaded via a namespace (and not attached): >> [1] affyio_1.24.0 Biostrings_2.24.1 DBI_0.2-5 >> [4] IRanges_1.14.3 preprocessCore_1.18.0 RSQLite_0.11.1 >> [7] splines_2.15.0 stats4_2.15.0 tools_2.15.0 >> [10] zlibbioc_1.2.0 >> >> ## END CODE ## >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From ahmetzehir at gmail.com Mon Jun 4 17:01:43 2012 From: ahmetzehir at gmail.com (Ahmet ZEHIR) Date: Mon, 4 Jun 2012 11:01:43 -0400 Subject: [BioC] DEXSeq package: Error in DEXSeqHTML Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From mtmorgan at fhcrc.org Mon Jun 4 20:18:33 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Mon, 04 Jun 2012 11:18:33 -0700 Subject: [BioC] [ChIPpeakAnno] Can not start ChIPpeakAnno after update to version 2.5.9 In-Reply-To: <4FCA75FF.7020206@fhcrc.org> References: <4FCA627B.6030906@fhcrc.org> <4FCA71FA.7060908@fhcrc.org> <4FCA75FF.7020206@fhcrc.org> Message-ID: <4FCCFBF9.8010303@fhcrc.org> (cc'ing list, for posterity) On 06/02/2012 01:22 PM, Martin Morgan wrote: > On 06/02/2012 01:08 PM, sheng zhao wrote: >> Hi Martin, >> >> My answer is yes. Please see following: >> >> >> mac$ R --vanilla >> >> R version 2.15.0 (2012-03-30) >> Copyright (C) 2012 The R Foundation for Statistical Computing >> ISBN 3-900051-07-0 >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> R is free software and comes with ABSOLUTELY NO WARRANTY. >> You are welcome to redistribute it under certain conditions. >> Type 'license()' or 'licence()' for distribution details. >> >> Natural language support but running in an English locale >> >> R is a collaborative project with many contributors. >> Type 'contributors()' for more information and >> 'citation()' on how to cite R or R packages in publications. >> >> Type 'demo()' for some demos, 'help()' for on-line help, or >> 'help.start()' for an HTML browser interface to help. >> Type 'q()' to quit R. >> >> > library(GO.db) >> Loading required package: AnnotationDbi >> Loading required package: BiocGenerics >> >> Attaching package: 'BiocGenerics' >> >> The following object(s) are masked from 'package:stats': >> >> xtabs >> >> The following object(s) are masked from 'package:base': >> >> Filter, Find, Map, Position, Reduce, anyDuplicated, cbind, >> colnames, duplicated, eval, get, intersect, lapply, mapply, mget, >> order, paste, pmax, pmax.int , pmin, pmin.int >> , rbind, rep.int , >> rownames, sapply, setdiff, table, tapply, union, unique >> >> Loading required package: Biobase >> Welcome to Bioconductor >> >> Vignettes contain introductory material; view with >> 'browseVignettes()'. To cite Bioconductor, see >> 'citation("Biobase")', and for packages 'citation("pkgname")'. >> >> Loading required package: DBI >> Error : .onLoad failed in loadNamespace() for 'GO.db', details: >> call: ls(envir, all.names = TRUE) >> error: 7 arguments passed to .Internal(identical) which requires 6 >> Error: package/namespace load failed for 'GO.db' >> > > I am looking at Prof. Ripley's post here > > http://r.789695.n4.nabble.com/7-arguments-passed-to-Internal-identical-which-requires-6-td4548460.html > > > and the time stamp on your version of R (March 30). I do not really have > a good suggestion; perhaps (a) traceback() after the error; (b) > installing GO.db from source (biocLite("GO.db", type="source")). And since > > trace(loadNamespace, tracer=quote(print(package))) > library(GO.db) > > ends with > > Loading required package: DBI > Tracing loadNamespace(package, c(which.lib.loc, lib.loc)) on entry > [1] "DBI" > Tracing loadNamespace(package, c(which.lib.loc, lib.loc)) on entry > [1] "RSQLite" > > for me these two packages also are candidates for suspicion. > > Martin > > > > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 From dtenenba at fhcrc.org Mon Jun 4 20:29:12 2012 From: dtenenba at fhcrc.org (Dan Tenenbaum) Date: Mon, 4 Jun 2012 11:29:12 -0700 Subject: [BioC] [ChIPpeakAnno] Can not start ChIPpeakAnno after update to version 2.5.9 In-Reply-To: <4FCCFBF9.8010303@fhcrc.org> References: <4FCA627B.6030906@fhcrc.org> <4FCA71FA.7060908@fhcrc.org> <4FCA75FF.7020206@fhcrc.org> <4FCCFBF9.8010303@fhcrc.org> Message-ID: On Mon, Jun 4, 2012 at 11:18 AM, Martin Morgan wrote: > (cc'ing list, for posterity) > > On 06/02/2012 01:22 PM, Martin Morgan wrote: >> >> On 06/02/2012 01:08 PM, sheng zhao wrote: >>> >>> Hi Martin, >>> >>> My answer is yes. Please see following: >>> >>> >>> mac$ R --vanilla >>> >>> R version 2.15.0 (2012-03-30) >>> Copyright (C) 2012 The R Foundation for Statistical Computing >>> ISBN 3-900051-07-0 >>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>> >>> R is free software and comes with ABSOLUTELY NO WARRANTY. >>> You are welcome to redistribute it under certain conditions. >>> Type 'license()' or 'licence()' for distribution details. >>> >>> Natural language support but running in an English locale >>> >>> R is a collaborative project with many contributors. >>> Type 'contributors()' for more information and >>> 'citation()' on how to cite R or R packages in publications. >>> >>> Type 'demo()' for some demos, 'help()' for on-line help, or >>> 'help.start()' for an HTML browser interface to help. >>> Type 'q()' to quit R. >>> >>> > library(GO.db) >>> Loading required package: AnnotationDbi >>> >>> Loading required package: BiocGenerics >>> >>> Attaching package: 'BiocGenerics' >>> >>> The following object(s) are masked from 'package:stats': >>> >>> xtabs >>> >>> The following object(s) are masked from 'package:base': >>> >>> Filter, Find, Map, Position, Reduce, anyDuplicated, cbind, >>> colnames, duplicated, eval, get, intersect, lapply, mapply, mget, >>> order, paste, pmax, pmax.int , pmin, pmin.int >>> , rbind, rep.int , >>> rownames, sapply, setdiff, table, tapply, union, unique >>> >>> Loading required package: Biobase >>> Welcome to Bioconductor >>> >>> Vignettes contain introductory material; view with >>> 'browseVignettes()'. To cite Bioconductor, see >>> 'citation("Biobase")', and for packages 'citation("pkgname")'. >>> >>> Loading required package: DBI >>> Error : .onLoad failed in loadNamespace() for 'GO.db', details: >>> call: ls(envir, all.names = TRUE) >>> error: 7 arguments passed to .Internal(identical) which requires 6 >>> Error: package/namespace load failed for 'GO.db' >>> >> I experienced the same problem, but fixed it by following Prof. Ripley's advice and updating to R-patched (which you can get at http://r.research.att.com/). Thanks, Dan >> I am looking at Prof. Ripley's post here >> >> >> http://r.789695.n4.nabble.com/7-arguments-passed-to-Internal-identical-which-requires-6-td4548460.html >> >> >> and the time stamp on your version of R (March 30). I do not really have >> a good suggestion; perhaps (a) traceback() after the error; (b) >> installing GO.db from source (biocLite("GO.db", type="source")). And since >> >> trace(loadNamespace, tracer=quote(print(package))) >> library(GO.db) >> >> ends with >> >> Loading required package: DBI >> Tracing loadNamespace(package, c(which.lib.loc, lib.loc)) on entry >> [1] "DBI" >> Tracing loadNamespace(package, c(which.lib.loc, lib.loc)) on entry >> [1] "RSQLite" >> >> for me these two packages also are candidates for suspicion. >> >> Martin >> >> >> >> > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From smelov at buckinstitute.org Tue Jun 5 00:26:26 2012 From: smelov at buckinstitute.org (Simon Melov) Date: Mon, 4 Jun 2012 15:26:26 -0700 Subject: [BioC] oligo and pdInfoBuilder In-Reply-To: References: Message-ID: That was it, Ben set me straight, sorry for the time waster! On Jun 3, 2012, at 8:33 PM, Vincent Carey wrote: On Sun, Jun 3, 2012 at 11:08 PM, Simon Melov > wrote: I've been beating my head against the wall for several hours trying to see why I cant build a library for a nimblegen expression array, using the oligo and pdInfoBuilder packages. I dont see any errors when building the library for a 12plex yeast genome expression array.... makePdInfoPackage(seed, destDir = "/Library/Frameworks/R.framework/Versions/2.15/Resources/library/") ================================================================================ Building annotation package for Nimblegen Expression Array NDF: 100718_Scer_EXP.ndf XYS: 532785A01_Chip6.330.2012.5.8_532.xys ================================================================================ Parsing file: 100718_Scer_EXP.ndf... OK Parsing file: 532785A01_Chip6.330.2012.5.8_532.xys... OK Merging NDF and XYS files... OK Preparing contents for featureSet table... OK Preparing contents for bgfeature table... OK Preparing contents for pmfeature table... OK Creating package in /Library/Frameworks/R.framework/Versions/2.15/Resources/library//pd.100718.scer.exp Inserting 5777 rows into table featureSet... OK Inserting 137040 rows into table pmfeature... OK Counting rows in featureSet Counting rows in pmfeature Creating index idx_pmfsetid on pmfeature... OK Creating index idx_pmfid on pmfeature... OK Creating index idx_fsfsetid on featureSet... OK Saving DataFrame object for PM. Done. > but when I try to use the library in oligo, it fails for some reason library(pd.100718.scer.exp) Error in library(pd.100718.scer.exp) : 'pd.100718.scer.exp' is not a valid installed package What happened when you ran R CMD INSTALL (or install.packages?) sessionInfo()? > Can anyone help? I'm at my wits end... _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From robert.castelo at upf.edu Tue Jun 5 09:45:11 2012 From: robert.castelo at upf.edu (Robert Castelo) Date: Tue, 5 Jun 2012 09:45:11 +0200 (CEST) Subject: [BioC] Rsubread crashes in 32bit linux Message-ID: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> hi, the computer room at my university where we do practicals on R & Bioconductor runs a 32bit linux distribution and when i tried to run the latest version of the Rsubread package (1.6.3) it crashes when calling the buildindex() function on a multifasta file with the yeast genome. this does *not* happen under a 64bit linux distribution. i have verified that installing the version before (1.4.4) on the current R 2.15 it also crashes (on the 32bit), but two versions before, the 1.1.1, it does *not* and it works smoothly on this 32bit linux distribution. i'm pasting below the output of using the 1.6.3 and 1.1.1 on R 2.15 where allChr.fa is the multifasta file with the yeast genome. so i can manage by now the problem by using the 1.1.1 version on R 2.15 for my teaching but i wonder whether there would be some easy solution for this, or even if it could be a symptom of something else that the Rsubread developers should worry about. i know that using a 32bit system nowadays is quite obsolete but this is what i got for teaching :( and i would be happy to let my students play with the latest version of Rsubread in the future. thanks!!! robert. ======================Rsubread 1.6.3 on R 2.15======================= > library(Rsubread) > sessionInfo() R version 2.15.0 (2012-03-30) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Rsubread_1.6.3 > buildindex(basename="subreadindex", reference="allChr.fa", memory=2500) Building a base-space index. Size of memory used=2500 MB Base name of the built index = subreadindex *** caught segfault *** address 0xdf670cc0, cause 'memory not mapped' Traceback: 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = as.character(cmd), PACKAGE = "Rsubread") 2: buildindex(basename = "subreadindex", reference = "allChr.fa", memory = 2500) Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace Selection: ======================Rsubread 1.1.1 on R 2.15======================= > library(Rsubread) > buildindex(basename="subreadindex", reference="allChr.fa", memory=2500) Building the index in the base space. Size of memory requested=2500 MB Index base name = subreadindex INDEX ITEMS PER PARTITION = 275940352 completed=40.88%; time used=1.7s; rate=2955.1k bps/s; total=12m bps completed=81.76%; time used=2.4s; rate=4111.8k bps/s; total=12m bps All the chromosome files are processed. | Dumping index [===========================================================>] Index subreadindex is successfully built. > sessionInfo() R version 2.15.0 (2012-03-30) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Rsubread_1.1.1 From alejandro.reyes at embl.de Tue Jun 5 10:17:05 2012 From: alejandro.reyes at embl.de (Alejandro Reyes) Date: Tue, 5 Jun 2012 10:17:05 +0200 Subject: [BioC] DEXSeq package: Error in DEXSeqHTML In-Reply-To: References: Message-ID: <4FCDC081.2070104@embl.de> Dear Ahmet, Thanks for your report! You are using a quite old version of DEXSeq (1.0.2), I would recommend you try updating to one of the most recent versions, both in the current stable or the current devel this should not be a problem any more.Let me know if its not the case. Best wishes, Alejandro > Dear list, > > I am using DEXseq to look at differential exon usage and everything seems > to work just fine. When I want to create an HTML report however, I get the > following error: > >> DEXSeqHTML(ecsA673vshMSC, FDR = 0.001, color = c("#FF000080", > "#0000FF80")) > Error in plot.new() : figure margins too large > In addition: Warning message: > In plotDEXSeq(ecs, geneID = gene, FDR = FDR, lwd = 2, expression = opts[1], > : > This gene contains more than 42 transcripts annotated, only the first 42 > will be plotted > > After this error, the function quits and the output is not complete. Is > there a way to turn this message off and continue outputting the rest of > the HTML report? > > Thanks, > >> sessionInfo() > R version 2.14.2 (2012-02-29) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] C/en_US.UTF-8/C/C/C/C > > attached base packages: > [1] grDevices datasets splines graphics utils grid stats > methods base > > other attached packages: > [1] edgeR_2.4.6 limma_3.10.3 biomaRt_2.10.0 DEXSeq_1.0.2 > Biobase_2.14.0 plyr_1.7.1 reshape2_1.2.1 > [8] survival_2.36-14 RSQLite_0.11.1 DBI_0.2-5 knitr_0.5 > gplots_2.10.1 KernSmooth_2.23-7 caTools_1.12 > [15] bitops_1.0-4.1 gdata_2.8.2 gtools_2.6.2 > RColorBrewer_1.0-5 ggplot2_0.9.1 > > loaded via a namespace (and not attached): > [1] MASS_7.3-18 RCurl_1.91-1 Rcpp_0.9.10 XML_3.9-4 > codetools_0.2-8 colorspace_1.1-1 dichromat_1.2-4 digest_0.5.2 > [9] evaluate_0.4.2 formatR_0.4 highlight_0.3.1 hwriter_1.3 > labeling_0.1 memoise_0.1 munsell_0.3 parser_0.0-14 > [17] proto_0.3-9.2 scales_0.2.1 statmod_1.4.14 stringr_0.6 > tools_2.14.2 > From harryzs1981 at gmail.com Tue Jun 5 10:13:44 2012 From: harryzs1981 at gmail.com (sheng zhao) Date: Tue, 5 Jun 2012 10:13:44 +0200 Subject: [BioC] [ChIPpeakAnno] Can not start ChIPpeakAnno after update to version 2.5.9 In-Reply-To: References: <4FCA627B.6030906@fhcrc.org> <4FCA71FA.7060908@fhcrc.org> <4FCA75FF.7020206@fhcrc.org> <4FCCFBF9.8010303@fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From joseph.barry at embl.de Tue Jun 5 13:01:09 2012 From: joseph.barry at embl.de (Joseph Barry) Date: Tue, 5 Jun 2012 13:01:09 +0200 Subject: [BioC] Bioconductor package cellHTS2 In-Reply-To: References: Message-ID: <4CF55F31-3215-4281-9C48-402D510DEDDA@embl.de> Dear Juliane, I have cc'd this email to the bioconductor mailing list as it may be of general interest. Thanks for isolating this issue in a neat example. It really makes things easier to debug! As a result I have recreated the error locally and been able to resolve it. In a nutshell, the error occurs due to an ordering problem of the measurementNames argument to buildCellHTS2(). order(c('Channel 1', 'Channel 2', 'Channel 10')) makes it clear why. channelNames() enforces precisely this ordering of measurementNames later in the buildCellHTS2() function, which results in the error that you report. I have introduced an extra ordering step into the buildHTS2() function in cellHTS2-devel to resolve the issue. If you checkout and install the latest for cellHTS2-devel (see http://wiki.fhcrc.org/bioc/SvnHowTo for details), this should resolve your problem. Best wishes, Joseph Barry On Jun 4, 2012, at 9:44 AM, Siebourg Juliane wrote: > Dear Joseph Barray, > > I am using your R Bioconductor package cellHTS2 and came across something I do not understand (it might be a bug?). > I have screening data with many dimensions (channels). When I create cellHTS objects with the function 'buildCellHTS2(xd)' I noticed the following: Whenever there are more than 9 different Channels, they somehow get miss-named or shuffled in the cellHTS object. > Here is a small example showing the issue: > > > wells = sprintf("%s%02d", rep(LETTERS[1:4], each=6), 1:6) > > xd = expand.grid(well=wells,plate=1:3, replicate=1:2) > > nc<-10 #number of channels > > data<-matrix(rep(1:nc,nrow(xd)),nrow=nrow(xd),byrow=TRUE) > > colnames(data)<-paste('Channel',1:nc,sep='_') > > xd<-cbind(xd,data) > > x = buildCellHTS2(xd) > > > head(xd) > > head(Data(x)[,1,]) > > If you change nc to something smaller than 10 everything runs fine. Am I doing something wrong? > I use the buildCellHTS2 function since I get my data already in a csv format. > > Thanks in advance for your help, > Juliane Siebourg From ovokeraye at gmail.com Tue Jun 5 13:53:41 2012 From: ovokeraye at gmail.com (Ovokeraye Achinike-Oduaran) Date: Tue, 5 Jun 2012 13:53:41 +0200 Subject: [BioC] BiomaRt query error Message-ID: Hi all, I ran a list of genes through biomaRt with the following code and it gives me this error in the snp retrieval aspect of it. I doubt it's a connection/proxy problem because I have that taken care of, I think and every step prior to that seemed to have worked just fine. Any ideas what the problem might be? Thanks. -Avoks mart = useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl") genes = read.delim("DAVID_BFE_Genes_4_06_2012.txt", header = TRUE) results = getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", "chromosome_name","strand", "transcript_start", "transcript_end" ), filters = "hgnc_symbol", values = genes$Symbol, mart = mart) mart2 = useMart(biomart="snp", dataset="hsapiens_snp") results2 = getBM(attributes = c("refsnp_id", "allele", "snp", "chrom_strand", "cds_start","cds_end","validated", "consequence_type_tv","phenotype_name"), filters = "ensembl_gene", values = results$ensembl_gene_id, mart = mart2) sessionInfo() R version 2.15.0 (2012-03-30) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_1252 LC_CTYPE=English_.1252 [3] LC_MONETARY=English_.1252 LC_NUMERIC=C [5] LC_TIME=English_.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.4.6 biomaRt_2.12.0 loaded via a namespace (and not attached): [1] RCurl_1.91-1.1 tools_2.15.0 XML_3.9-4.1 > From ovokeraye at gmail.com Tue Jun 5 13:56:17 2012 From: ovokeraye at gmail.com (Ovokeraye Achinike-Oduaran) Date: Tue, 5 Jun 2012 13:56:17 +0200 Subject: [BioC] BiomaRt Query error Edited Message-ID: My apologies. I omitted the "error" in my initial post. Hi all, I ran a list of genes through biomaRt with the following code and it gives me this error in the snp retrieval aspect of it. I doubt it's a connection/proxy problem because I have that taken care of, I think and every step prior to that seemed to have worked just fine. Any ideas what the problem might be? Thanks. -Avoks mart = useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl") genes = read.delim("DAVID_BFE_Genes_4_06_2012.txt", header = TRUE) results = getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", "chromosome_name","strand", "transcript_start", "transcript_end" ), filters = "hgnc_symbol", values = genes$Symbol, mart = mart) mart2 = useMart(biomart="snp", dataset="hsapiens_snp") results2 = getBM(attributes = c("refsnp_id", "allele", "snp", "chrom_strand", "cds_start","cds_end","validated", "consequence_type_tv","phenotype_name"), filters = "ensembl_gene", values = results$ensembl_gene_id, mart = mart2) 2 3 ERROR: The requested URL could not be retrieved 4 5 6

ERROR

Error in getBM(attributes = c("refsnp_id", "allele", "snp", "chrom_strand", : The query to the BioMart webservice returned an invalid result: the number of columns in the result table does not equal the number of attributes in the query. Please report this to the mailing list. > sessionInfo() R version 2.15.0 (2012-03-30) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_1252 LC_CTYPE=English_.1252 [3] LC_MONETARY=English_.1252 LC_NUMERIC=C [5] LC_TIME=English_.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.4.6 biomaRt_2.12.0 loaded via a namespace (and not attached): [1] RCurl_1.91-1.1 tools_2.15.0 XML_3.9-4.1 > From guest at bioconductor.org Tue Jun 5 15:09:17 2012 From: guest at bioconductor.org (Senhao Zhang [guest]) Date: Tue, 5 Jun 2012 06:09:17 -0700 (PDT) Subject: [BioC] edgeR Error in `colnames<-`(`*tmp*`, value = c(\"1\", \"2\", \"3\", \"4\")) : Message-ID: <20120605130917.0CE62133CCB@mamba.fhcrc.org> Hi, I am learning edgeR. I don't have any prior knowledge about programing language. When I got some exercises in case study 1, according to the edgeR's user's guide, I got an error. > d <- readDGE(targets, skip=5, comment.char="!") Error in `colnames<-`(`*tmp*`, value = c("1", "2", "3", "4")) : length of 'dimnames' [2] not equal to array extent I searched around, but I haven't fixed it yet. Any responses would be greatly appreciated. Thank you! -- output of sessionInfo(): > targets <- read.delim("Targets.txt",stringsAsFactors=FALSE) > targets file group description 1 NC1.txt NC Normal colon 2 NC2.txt NC Normal colon 3 Tu98.txt Tu tumour 4 Tu102.txt Tu tumour > d <- readDGE(targets, skip=5, comment.char="!") Error in `colnames<-`(`*tmp*`, value = c("1", "2", "3", "4")) : length of 'dimnames' [2] not equal to array extent > summary(targets) file group description Length:4 Length:4 Length:4 Class :character Class :character Class :character Mode :character Mode :character Mode :character > names(targets) [1] "file" "group" "description" > str(targets) 'data.frame': 4 obs. of 3 variables: $ file : chr "NC1.txt" "NC2.txt" "Tu98.txt" "Tu102.txt" $ group : chr "NC" "NC" "Tu" "Tu" $ description: chr "Normal colon" "Normal colon" "tumour" "tumour" -- Sent via the guest posting facility at bioconductor.org. From ahmetzehir at gmail.com Tue Jun 5 15:39:30 2012 From: ahmetzehir at gmail.com (Ahmet ZEHIR) Date: Tue, 5 Jun 2012 09:39:30 -0400 Subject: [BioC] DEXSeq package: Error in DEXSeqHTML In-Reply-To: <4FCDC081.2070104@embl.de> References: <4FCDC081.2070104@embl.de> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From tooyoung at gmail.com Tue Jun 5 15:43:37 2012 From: tooyoung at gmail.com (Dan Du) Date: Tue, 05 Jun 2012 15:43:37 +0200 Subject: [BioC] Rsubread crashes in 32bit linux In-Reply-To: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> References: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> Message-ID: <1338903817.2062.284.camel@yangdu-desktop> Hi Robert, I have been experiencing something else, possibly related to yours, on a 64bit ubuntu laptop with 6g of ram. As I recall, when bumping to Bioc 2.10, the Rsubread installation kind of ate all the memory, basically froze the system so I had to call it off, yet building it on the server side turned out fine. So I think I just accepted that the new version may be 'computationally heavy' thus not suitable for a normal pc, though I did not find any mentioning of this increased memory requirement in the NEWS file. So currently Rsubread stays at 1.4.4 on that pc, all subsequent versions of Rsubread drain the memory in the same way when compiling Rsubread.so. Now I think I can confirm this on a 32-bit opensuse box, it did successfully built, but when running the example code in the manual, same segfault happens. > library(Rsubread) > ref <- system.file("extdata","reference.fa",package="Rsubread") > path <- system.file("extdata",package="Rsubread") > buildindex(basename=file.path(path,"reference_index"),reference=ref) Building a base-space index. Size of memory used=3700 MB Base name of the built index = /home/opensuse/R-patched/library/Rsubread/extdata/reference_index *** caught segfault *** address 0xdf03ee80, cause 'memory not mapped' Traceback: 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = as.character(cmd), PACKAGE = "Rsubread") 2: buildindex(basename = file.path(path, "reference_index"), reference = ref) > sessionInfo() R version 2.15.0 Patched (2012-06-04 r59517) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Rsubread_1.6.3 Regards, Dan On Tue, 2012-06-05 at 09:45 +0200, Robert Castelo wrote: > hi, > > the computer room at my university where we do practicals on R & Bioconductor runs a 32bit linux distribution and when i tried to run the latest version of the Rsubread package (1.6.3) it crashes when calling the buildindex() function on a multifasta file with the yeast genome. this does *not* happen under a 64bit linux distribution. > > i have verified that installing the version before (1.4.4) on the current R 2.15 it also crashes (on the 32bit), but two versions before, the 1.1.1, it does *not* and it works smoothly on this 32bit linux distribution. > > i'm pasting below the output of using the 1.6.3 and 1.1.1 on R 2.15 where allChr.fa is the multifasta file with the yeast genome. > > so i can manage by now the problem by using the 1.1.1 version on R 2.15 for my teaching but i wonder whether there would be some easy solution for this, or even if it could be a symptom of something else that the Rsubread developers should worry about. i know that using a 32bit system nowadays is quite obsolete but this is what i got for teaching :( and i would be happy to let my students play with the latest version of Rsubread in the future. > > > thanks!!! > robert. > > ======================Rsubread 1.6.3 on R 2.15======================= > > > library(Rsubread) > > sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: i686-pc-linux-gnu (32-bit) > > locale: > [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C > [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 > [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Rsubread_1.6.3 > > > buildindex(basename="subreadindex", reference="allChr.fa", memory=2500) > > Building a base-space index. > Size of memory used=2500 MB > Base name of the built index = subreadindex > > *** caught segfault *** > address 0xdf670cc0, cause 'memory not mapped' > > Traceback: > 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = as.character(cmd), PACKAGE = "Rsubread") > 2: buildindex(basename = "subreadindex", reference = "allChr.fa", memory = 2500) > > Possible actions: > 1: abort (with core dump, if enabled) > 2: normal R exit > 3: exit R without saving workspace > 4: exit R saving workspace > Selection: > > > ======================Rsubread 1.1.1 on R 2.15======================= > > > library(Rsubread) > > buildindex(basename="subreadindex", reference="allChr.fa", memory=2500) > > Building the index in the base space. > Size of memory requested=2500 MB > Index base name = subreadindex > INDEX ITEMS PER PARTITION = 275940352 > > completed=40.88%; time used=1.7s; rate=2955.1k bps/s; total=12m bps completed=81.76%; time used=2.4s; rate=4111.8k bps/s; total=12m bps > All the chromosome files are processed. > | Dumping index [===========================================================>] > Index subreadindex is successfully built. > > sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: i686-pc-linux-gnu (32-bit) > > locale: > [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C > [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 > [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Rsubread_1.1.1 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From ahmetzehir at gmail.com Tue Jun 5 15:43:27 2012 From: ahmetzehir at gmail.com (Ahmet ZEHIR) Date: Tue, 5 Jun 2012 09:43:27 -0400 Subject: [BioC] DEXSeq package: Error in DEXSeqHTML In-Reply-To: References: <4FCDC081.2070104@embl.de> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From alejandro.reyes at embl.de Tue Jun 5 15:51:02 2012 From: alejandro.reyes at embl.de (Alejandro Reyes) Date: Tue, 5 Jun 2012 15:51:02 +0200 Subject: [BioC] DEXSeq package: Error in DEXSeqHTML In-Reply-To: References: <4FCDC081.2070104@embl.de> Message-ID: <4FCE0EC6.8090806@embl.de> Dear Ahmet, It depends on the R version you are using, 1.0.2 is the release for the R version you are using (2.14.2), if you update your R to a newest version and install it again via "biocLite", it will install the newest version of DEXSeq. Best wishes, Alejandro > oh, I just typed this: > > > source("http://bioconductor.org/biocLite.R") > biocLite("DEXSeq") > > It installed 1.0.2 again. I'll download the updated version and install > it manually but something might be broken there. (I check for package > updates frequently and DEXSeq never updated there either) > > > Ahmet > On Tue, Jun 5, 2012 at 9:39 AM, Ahmet ZEHIR > wrote: > > Dear Alejandro, > > I think back in the day I installed DEXSeq through NCI's mirror and > it seems like the package is not updated there. I'm assuming that's > why I missed the updates (here 1.0.2 is cited as the current > version: > http://watson.nci.nih.gov/bioc_mirror/packages/2.9/bioc/html/DEXSeq.html) > > I will update to 1.2 right away. Thanks for your reply! > > Cheers, > > Ahmet Z. > > > On Tue, Jun 5, 2012 at 4:17 AM, Alejandro Reyes > > wrote: > > Dear Ahmet, > > Thanks for your report! > You are using a quite old version of DEXSeq (1.0.2), I would > recommend you try updating to one of the most recent versions, > both in the current stable or the current devel this should not > be a problem any more.Let me know if its not the case. > > Best wishes, > Alejandro > > > > Dear list, > > I am using DEXseq to look at differential exon usage and > everything seems > to work just fine. When I want to create an HTML report > however, I get the > following error: > > DEXSeqHTML(ecsA673vshMSC, FDR = 0.001, color = > c("#FF000080", > > "#0000FF80")) > Error in plot.new() : figure margins too large > In addition: Warning message: > In plotDEXSeq(ecs, geneID = gene, FDR = FDR, lwd = 2, > expression = opts[1], > : > This gene contains more than 42 transcripts annotated, > only the first 42 > will be plotted > > After this error, the function quits and the output is not > complete. Is > there a way to turn this message off and continue outputting > the rest of > the HTML report? > > Thanks, > > sessionInfo() > > R version 2.14.2 (2012-02-29) > Platform: x86_64-apple-darwin9.8.0/x86___64 (64-bit) > > locale: > [1] C/en_US.UTF-8/C/C/C/C > > attached base packages: > [1] grDevices datasets splines graphics utils grid > stats > methods base > > other attached packages: > [1] edgeR_2.4.6 limma_3.10.3 biomaRt_2.10.0 > DEXSeq_1.0.2 > Biobase_2.14.0 plyr_1.7.1 reshape2_1.2.1 > [8] survival_2.36-14 RSQLite_0.11.1 DBI_0.2-5 > knitr_0.5 > gplots_2.10.1 KernSmooth_2.23-7 caTools_1.12 > [15] bitops_1.0-4.1 gdata_2.8.2 gtools_2.6.2 > RColorBrewer_1.0-5 ggplot2_0.9.1 > > loaded via a namespace (and not attached): > [1] MASS_7.3-18 RCurl_1.91-1 Rcpp_0.9.10 > XML_3.9-4 > codetools_0.2-8 colorspace_1.1-1 dichromat_1.2-4 > digest_0.5.2 > [9] evaluate_0.4.2 formatR_0.4 highlight_0.3.1 > hwriter_1.3 > labeling_0.1 memoise_0.1 munsell_0.3 > parser_0.0-14 > [17] proto_0.3-9.2 scales_0.2.1 statmod_1.4.14 > stringr_0.6 > tools_2.14.2 > > > > > > -- > /Ahmet Z./ > > > > > -- > /Ahmet Z./ From huangji at ohsu.edu Tue Jun 5 18:44:16 2012 From: huangji at ohsu.edu (Jing Huang) Date: Tue, 5 Jun 2012 09:44:16 -0700 Subject: [BioC] predicting transcription factors Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From pshannon at fhcrc.org Tue Jun 5 19:24:18 2012 From: pshannon at fhcrc.org (Paul Shannon) Date: Tue, 5 Jun 2012 10:24:18 -0700 Subject: [BioC] predicting transcription factors In-Reply-To: References: Message-ID: <129A8C7E-F590-4E9F-9194-9DE1E16D6090@fhcrc.org> Dear Jing Huang, Let me make sure I correctly grasp your problem. 1) You have a set of co-regulated genes 2) You wish to identify possible shared transcription factors for these genes If this is an accurate statement, then one good approach is to 1) Obtain promoter sequence of each gene, often estimated to be 3k upstream, and 300 bases downstream, of the transcription start site. BioC provides excellent tools and data for this. 2) Search for enriched transcription factor binding sites in these promoters. The meme website (or the downloaded meme software) is one traditional way to do the search. We have some BioC packages, including MotIV, and my soon-to-be-released collection of transcription factor matrices, which provide a good solution for this also, and have the advantage that your analysis can be performed entirely in R, reproducibly. Is this helpful? I can provide more detail. - Paul On Jun 5, 2012, at 9:44 AM, Jing Huang wrote: > Hi Experts, > > I am interested in predicting transcription factors for a specific family of genes. According to my readings, it is possible to predict transcription factors for the genes that are expressed accordingly with similar pattern (up or down regulated). I don't know how. > > Can somebody provide advices? > > Many thanks > > Jing Huang PhD > > OHSU > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From huangji at ohsu.edu Tue Jun 5 19:32:24 2012 From: huangji at ohsu.edu (Jing Huang) Date: Tue, 5 Jun 2012 10:32:24 -0700 Subject: [BioC] predicting transcription factors In-Reply-To: <129A8C7E-F590-4E9F-9194-9DE1E16D6090@fhcrc.org> Message-ID: THANK you Paul. Your interpretation is right on my problem. Jing On 6/5/12 10:24 AM, "Paul Shannon" wrote: >Dear Jing Huang, > >Let me make sure I correctly grasp your problem. > > 1) You have a set of co-regulated genes > 2) You wish to identify possible shared transcription factors for these >genes > >If this is an accurate statement, then one good approach is to > > 1) Obtain promoter sequence of each gene, often estimated to be 3k >upstream, and 300 bases downstream, of the transcription start site. > BioC provides excellent tools and data for this. > > 2) Search for enriched transcription factor binding sites in these >promoters. > >The meme website (or the downloaded meme software) is one traditional way >to do the search. We have some BioC packages, including MotIV, and my >soon-to-be-released collection of transcription factor matrices, which >provide a good solution for this also, and have the advantage that your >analysis can be performed entirely in R, reproducibly. > >Is this helpful? I can provide more detail. > > - Paul > > > >On Jun 5, 2012, at 9:44 AM, Jing Huang wrote: > >> Hi Experts, >> >> I am interested in predicting transcription factors for a specific >>family of genes. According to my readings, it is possible to predict >>transcription factors for the genes that are expressed accordingly with >>similar pattern (up or down regulated). I don't know how. >> >> Can somebody provide advices? >> >> Many thanks >> >> Jing Huang PhD >> >> OHSU >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor > From huangji at ohsu.edu Tue Jun 5 19:33:56 2012 From: huangji at ohsu.edu (Jing Huang) Date: Tue, 5 Jun 2012 10:33:56 -0700 Subject: [BioC] predicting transcription factors In-Reply-To: <129A8C7E-F590-4E9F-9194-9DE1E16D6090@fhcrc.org> Message-ID: Hi Paul, Your interpretations are right on my problem. Jing On 6/5/12 10:24 AM, "Paul Shannon" wrote: >Dear Jing Huang, > >Let me make sure I correctly grasp your problem. > > 1) You have a set of co-regulated genes > 2) You wish to identify possible shared transcription factors for these >genes > >If this is an accurate statement, then one good approach is to > > 1) Obtain promoter sequence of each gene, often estimated to be 3k >upstream, and 300 bases downstream, of the transcription start site. > BioC provides excellent tools and data for this. > > 2) Search for enriched transcription factor binding sites in these >promoters. > >The meme website (or the downloaded meme software) is one traditional way >to do the search. We have some BioC packages, including MotIV, and my >soon-to-be-released collection of transcription factor matrices, which >provide a good solution for this also, and have the advantage that your >analysis can be performed entirely in R, reproducibly. > >Is this helpful? I can provide more detail. > > - Paul > > > >On Jun 5, 2012, at 9:44 AM, Jing Huang wrote: > >> Hi Experts, >> >> I am interested in predicting transcription factors for a specific >>family of genes. According to my readings, it is possible to predict >>transcription factors for the genes that are expressed accordingly with >>similar pattern (up or down regulated). I don't know how. >> >> Can somebody provide advices? >> >> Many thanks >> >> Jing Huang PhD >> >> OHSU >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor > From katharine.coyte at jesus.ox.ac.uk Tue Jun 5 12:02:44 2012 From: katharine.coyte at jesus.ox.ac.uk (Katharine Coyte) Date: Tue, 5 Jun 2012 10:02:44 +0000 Subject: [BioC] GOSemSim comparison between species Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From schoi at cornell.edu Tue Jun 5 21:01:09 2012 From: schoi at cornell.edu (Sang Chul Choi) Date: Tue, 5 Jun 2012 19:01:09 +0000 Subject: [BioC] qrqc with variable length of short reads? - readSeqFile could not handle a 2GB zipped file. In-Reply-To: <3B4C55BA-D655-444A-91F3-F4107198E1EC@gmail.com> References: <8234A7F2-8FB7-45E0-BC7C-B5C3386C4EF1@cornell.edu> <3B4C55BA-D655-444A-91F3-F4107198E1EC@gmail.com> Message-ID: <03E26781-7909-4DFD-9BD8-8092A9A8F237@cornell.edu> I have tried to tunn off the option when reading sequences of variable lengths in a gzipped FASTQ file (2GB) using readSeqFile. The computer has 16 GB memory, and it used up all of the memory, leaving R in "Dead" or not running any more. Is there a way of sidestepping this problem? Thank you, SangChul On Jun 1, 2012, at 4:55 PM, Vince Buffalo wrote: > Hi SangChul, > > By default readSeqFile hashes a proportion of the reads to check against many being non-unique. Specify hash=FALSE to turn this off and your memory usage will decrease. > > Best, > Vince > > Sent from my iPhone > > On Jun 1, 2012, at 1:23 PM, Sang Chul Choi wrote: > >> Hi, >> >> I am using qrqc to plot base quality of a short read fastq file. When the FASTQ file has short reads of the same length, the readSeqFile could read in the FASTQ file (25 millions of 100bp reads) with a couple of GB of memory. I trimmed 3' end of the short reads, which would lead to short reads of variable length because of different base quality at the 3' end. Then, I tried to read in this second FASTQ file of reads of variable length. It used up all of the 16 GB memory, and not using CPUs at all. It seems there are some efficient code in readSeqFile as mentioned in the readSeqFile help message. It seems to fall apart when short reads are of different size. >> >> I wish to see how the trimming change the base-quality plots, and this is a problem. I am wondering if there is a way of sidestepping this problem. >> >> Thank you, >> >> SangChul >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From vsbuffalo at gmail.com Tue Jun 5 21:03:23 2012 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Tue, 5 Jun 2012 12:03:23 -0700 Subject: [BioC] qrqc with variable length of short reads? - readSeqFile could not handle a 2GB zipped file. In-Reply-To: <03E26781-7909-4DFD-9BD8-8092A9A8F237@cornell.edu> References: <8234A7F2-8FB7-45E0-BC7C-B5C3386C4EF1@cornell.edu> <3B4C55BA-D655-444A-91F3-F4107198E1EC@gmail.com> <03E26781-7909-4DFD-9BD8-8092A9A8F237@cornell.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From kurinji.pandiyan at gmail.com Tue Jun 5 22:11:01 2012 From: kurinji.pandiyan at gmail.com (Kurinji Pandiyan) Date: Tue, 5 Jun 2012 13:11:01 -0700 Subject: [BioC] Increasing Stringency of GRanges Overlap Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From tim.triche at gmail.com Tue Jun 5 22:35:35 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Tue, 5 Jun 2012 13:35:35 -0700 Subject: [BioC] Increasing Stringency of GRanges Overlap In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From li.wang at ttu.edu Wed Jun 6 01:40:59 2012 From: li.wang at ttu.edu (Wang, Li) Date: Tue, 5 Jun 2012 18:40:59 -0500 Subject: [BioC] cummeRbund errors In-Reply-To: <129A8C7E-F590-4E9F-9194-9DE1E16D6090@fhcrc.org> References: , <129A8C7E-F590-4E9F-9194-9DE1E16D6090@fhcrc.org> Message-ID: Dear list members I am struggling with cummeRbund. I tried some codes listed here, and am confronted with some errors. > cuff_data <- readCufflinks('diff_out') > csDensity(genes(cuff_data)) Error in dat$fpkm + pseudocount : non-numeric argument to binary operator > diffGeneIDs <- getSig(cuff_data, level="genes", alpha=0.05) > diffGenes <- getGenes(cuff_data, diffGeneIDs) Error in sqliteExecStatement(conn, statement, ...) : RS-DBI driver: (RS_SQLite_exec: could not execute1: cannot start a transaction within a transaction) > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] cummeRbund_1.2.0 reshape2_1.2.1 ggplot2_0.9.1 RSQLite_0.11.1 DBI_0.2-5 loaded via a namespace (and not attached): [1] colorspace_1.1-1 dichromat_1.2-4 digest_0.5.2 grid_2.15.0 [5] labeling_0.1 MASS_7.3-18 memoise_0.1 munsell_0.3 [9] plyr_1.7.1 proto_0.3-9.2 RColorBrewer_1.0-5 scales_0.2.1 [13] stringr_0.6 I cannot figure out the reason, could anyone give me some hints? Thanks in advance! Best wishes Li From hbolouri at fhcrc.org Wed Jun 6 02:55:39 2012 From: hbolouri at fhcrc.org (Hamid Bolouri) Date: Tue, 05 Jun 2012 17:55:39 -0700 (PDT) Subject: [BioC] "graphite" Biocarta 'native' graphs different from Biocarta web site? In-Reply-To: Message-ID: <68bd42f3-34e5-4c01-90ff-19b07d5220d1@zimbra2.fhcrc.org> Graphite's native Biocarta pathways seem to have a different node list than that given by the Biocarta "PROTEIN LIST" link on Biocarta pathway pages (presumably what the pathway authors consider the 'true' pathway membership). There seem to be 2 categories of difference: (1) Some genes listed by Biocarta are absent from graphite's version (see ??? marks in the example below). (2) Because the native format nodes are annotated variously, it's necessary to do a node conversion. In particular, Biocarta's "PROTEIN LIST" gives _specific_ members of enzyme families, whereas graphite seems to replace EC numbers with all family members. However, I have trouble explaining how some enzymes are on/off the list (see --- marks in the example below). Am I misinterpreting things? If not, is there any way to get pathway graphs with node lists more closely matching what Biocarta lists online? Thanks, Hamid Bolouri -- http://labs.fhcrc.org/bolouri Example: > biocarta[["epo signaling pathway"]] "epo signaling pathway" pathway from BioCarta Number of nodes = 10 Number of edges = 24 Type of identifiers = native Retrieved on = 2011-05-12 > nodes(biocarta[["epo signaling pathway"]]) [1] "EntrezGene:2056" "EntrezGene:2057" [3] "EntrezGene:2885" "EntrezGene:3265" [5] "EntrezGene:6464" "EntrezGene:6654" [7] "EnzymeConsortium:2.7.1.112" "EnzymeConsortium:3.1.3.48" [9] "EnzymeConsortium:3.1.4.11" "STAT5" > PE <- convertIdentifiers(biocarta[["epo signaling pathway"]],type="entrez") > nodes(PE) [1] "2056" "2057" "2885" "3265" "6464" "6654" "52" "993" [9] "994" "995" "1843" "1844" "1845" "1846" "1847" "1848" [17] "1849" "1850" "1852" "5770" "5777" "5778" "5781" "5787" [25] "5788" "5792" "5795" "5797" "5798" "5799" "5801" "5803" [33] "8555" "8556" "11072" "11221" "56940" "80824" "84867" "5330" [41] "5331" "5332" "5333" "5335" "5336" "23236" "84812" "113026" > PS <- convertIdentifiers(biocarta[["epo signaling pathway"]],type="symbol") > nodes(PS) [1] "EPO" "EPOR" "GRB2" "HRAS" "SHC1" "SOS1" "ACP1" "CDC25A" [9] "CDC25B" "CDC25C" "DUSP1" "DUSP2" "DUSP3" "DUSP4" "DUSP5" "DUSP6" [17] "DUSP7" "DUSP8" "DUSP9" "PTPN1" "PTPN6" "PTPN7" "PTPN11" "PTPRB" [25] "PTPRC" "PTPRF" "PTPRJ" "PTPRM" "PTPRN" "PTPRN2" "PTPRR" "PTPRZ1" [33] "CDC14B" "CDC14A" "DUSP14" "DUSP10" "DUSP22" "DUSP16" "PTPN5" "PLCB2" [41] "PLCB3" "PLCB4" "PLCD1" "PLCG1" "PLCG2" "PLCB1" "PLCD4" "PLCD3" Compare the above with what I get from: http://www.biocarta.com/pathfiles/PathwayProteinList.asp?showPFID=69 erythropoietin 2056 *** erythropoietin receptor 2057 *** growth factor receptor-bound protein 2 2885 *** son of sevenless homolog 1 (Drosophila) 6654 *** v-Ha-ras Harvey rat sarcoma viral oncogene homolog 3265 *** signal transducer and activator of transcription 5A 6776 *** signal transducer and activator of transcription 5B 6777 *** SHC (Src homology 2 domain containing) transforming protein 1 6464 *** v-fos FBJ murine osteosarcoma viral oncogene homolog 2353 ??? v-raf-1 murine leukemia viral oncogene homolog 1 5894 ??? ELK1, member of ETS oncogene family 2002 ??? jun oncogene 3725 ??? casein kinase 2, alpha 1 polypeptide 1457 ??? Janus kinase 2 (a protein tyrosine kinase) 3717 ??? mitogen-activated protein kinase 3 5595 --- mitogen-activated protein kinase 8 5599 --- mitogen-activated protein kinase kinase 1 5604 --- phospholipase C, gamma 1 5335 ok protein tyrosine phosphatase, non-receptor type 6 5777 ok HBcomment: ***== in graphite, ???==missing from graphite, ---==specific enzymes in Biocarta are mapped to large (& urnrelated?) families in graphite ### > sessionInfo() R version 2.15.0 (2012-03-30) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] graphite_1.2.0 AnnotationDbi_1.18.1 Biobase_2.16.0 [4] BiocGenerics_0.2.0 RSQLite_0.11.1 DBI_0.2-5 [7] graph_1.34.0 loaded via a namespace (and not attached): [1] IRanges_1.14.3 org.Hs.eg.db_2.7.1 stats4_2.15.0 tools_2.15.0 ### From mtmorgan at fhcrc.org Wed Jun 6 04:57:09 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Tue, 05 Jun 2012 19:57:09 -0700 Subject: [BioC] BioC2012: upcoming deadlines Message-ID: <4FCEC705.3030009@fhcrc.org> A reminder that the deadline for the Bioconductor annual meeting (BioC2012) travel scholarship is rapidly approaching; see https://secure.bioconductor.org/BioC2012/ Also, we have an excellent line up of talks and tutorials! We're really looking forward to seeing you here in Seattle, July 24-25 (July 23 for developer day). Best, Martin Morgan -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 From shi at wehi.EDU.AU Wed Jun 6 06:49:28 2012 From: shi at wehi.EDU.AU (Wei Shi) Date: Wed, 6 Jun 2012 14:49:28 +1000 Subject: [BioC] Rsubread crashes in 32bit linux In-Reply-To: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> References: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> Message-ID: Dear Robert, We do not have a 32bit linux machine here, but we managed to reproduce the problem you have encountered using a 32bit Virtual Machine running on a 64bit linux machine. It turned out that the problem occurred when one of our calls to the malloc() function was unsuccessful in requesting memories from the system, which means the system runs out of memory and can not allocate more memories to the buildindex() function. We tried to let buildindex() function request a small amount of memory (1000MB), which was found to be able to solve this problem. So I recommend you to give value of 1000 to the 'memory' parameter of buildindex() function. The yeast genome is very tiny, so you do not need 2500MB of memory to build the index for it. The buildindex() function requires at least 1000MB of memory no matter how big or small the reference genome is, to build hash tables and remove highly repetitive 16 mers. Also note that the mapping results are not affected by the amount of memory requested in the index building step. The amount of memory used will only affect the running time. For example, using 8GB of memory to build index for mouse genome will give you a mapping speed twice as fast as that from using 4GB of memory. But for the yeast, the entire index will always be loaded into the memory in one go, because its genome size is tiny and the minimal memory used by buildindex() is 1GB which is big enough to accommodate the hash table, the genome sequences and other related information. Finally, the reason why the problem you encountered did not happen in version 1.1.1 was because genome sequences were not included in the built index by default in that version, however, they are included in the index in the newer versions. Hope this can solve your problem. But please let us know if it doesn't. Cheers, Wei On Jun 5, 2012, at 5:45 PM, Robert Castelo wrote: > hi, > > the computer room at my university where we do practicals on R & Bioconductor runs a 32bit linux distribution and when i tried to run the latest version of the Rsubread package (1.6.3) it crashes when calling the buildindex() function on a multifasta file with the yeast genome. this does *not* happen under a 64bit linux distribution. > > i have verified that installing the version before (1.4.4) on the current R 2.15 it also crashes (on the 32bit), but two versions before, the 1.1.1, it does *not* and it works smoothly on this 32bit linux distribution. > > i'm pasting below the output of using the 1.6.3 and 1.1.1 on R 2.15 where allChr.fa is the multifasta file with the yeast genome. > > so i can manage by now the problem by using the 1.1.1 version on R 2.15 for my teaching but i wonder whether there would be some easy solution for this, or even if it could be a symptom of something else that the Rsubread developers should worry about. i know that using a 32bit system nowadays is quite obsolete but this is what i got for teaching :( and i would be happy to let my students play with the latest version of Rsubread in the future. > > > thanks!!! > robert. > > ======================Rsubread 1.6.3 on R 2.15======================= > >> library(Rsubread) >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: i686-pc-linux-gnu (32-bit) > > locale: > [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C > [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 > [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Rsubread_1.6.3 > >> buildindex(basename="subreadindex", reference="allChr.fa", memory=2500) > > Building a base-space index. > Size of memory used=2500 MB > Base name of the built index = subreadindex > > *** caught segfault *** > address 0xdf670cc0, cause 'memory not mapped' > > Traceback: > 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = as.character(cmd), PACKAGE = "Rsubread") > 2: buildindex(basename = "subreadindex", reference = "allChr.fa", memory = 2500) > > Possible actions: > 1: abort (with core dump, if enabled) > 2: normal R exit > 3: exit R without saving workspace > 4: exit R saving workspace > Selection: > > > ======================Rsubread 1.1.1 on R 2.15======================= > >> library(Rsubread) >> buildindex(basename="subreadindex", reference="allChr.fa", memory=2500) > > Building the index in the base space. > Size of memory requested=2500 MB > Index base name = subreadindex > INDEX ITEMS PER PARTITION = 275940352 > > completed=40.88%; time used=1.7s; rate=2955.1k bps/s; total=12m bps completed=81.76%; time used=2.4s; rate=4111.8k bps/s; total=12m bps > All the chromosome files are processed. > | Dumping index [===========================================================>] > Index subreadindex is successfully built. >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: i686-pc-linux-gnu (32-bit) > > locale: > [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C > [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 > [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Rsubread_1.1.1 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}} From shi at wehi.EDU.AU Wed Jun 6 06:56:58 2012 From: shi at wehi.EDU.AU (Wei Shi) Date: Wed, 6 Jun 2012 14:56:58 +1000 Subject: [BioC] Rsubread crashes in 32bit linux In-Reply-To: <1338903817.2062.284.camel@yangdu-desktop> References: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> <1338903817.2062.284.camel@yangdu-desktop> Message-ID: <25927CA4-11CE-46F1-8492-739AE8B10C05@wehi.edu.au> Dear Dan, It is probably because including genome sequences into the index slowed down your laptop. But I believe it should be alleviated if you give smaller values to the 'memory ' parameter of the buildindex() function. Also, the index building is an one-off operation, you do not need to redo it even when new releases come. For your 32-bit opensuse box, I guess the problem will be solved if you change the amount of memory requested to be 1000MB. Cheers, Wei On Jun 5, 2012, at 11:43 PM, Dan Du wrote: > Hi Robert, > > I have been experiencing something else, possibly related to yours, > on a 64bit ubuntu laptop with 6g of ram. > > As I recall, when bumping to Bioc 2.10, the Rsubread installation kind > of ate all the memory, basically froze the system so I had to call it > off, yet building it on the server side turned out fine. So I think I > just accepted that the new version may be 'computationally heavy' thus > not suitable for a normal pc, though I did not find any mentioning of > this increased memory requirement in the NEWS file. > > So currently Rsubread stays at 1.4.4 on that pc, all subsequent versions > of Rsubread drain the memory in the same way when compiling Rsubread.so. > > Now I think I can confirm this on a 32-bit opensuse box, it did > successfully built, but when running the example code in the manual, > same segfault happens. > > >> library(Rsubread) >> ref <- system.file("extdata","reference.fa",package="Rsubread") >> path <- system.file("extdata",package="Rsubread") >> buildindex(basename=file.path(path,"reference_index"),reference=ref) > > Building a base-space index. > Size of memory used=3700 MB > Base name of the built index > = /home/opensuse/R-patched/library/Rsubread/extdata/reference_index > > *** caught segfault *** > address 0xdf03ee80, cause 'memory not mapped' > > Traceback: > 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = > as.character(cmd), PACKAGE = "Rsubread") > 2: buildindex(basename = file.path(path, "reference_index"), reference > = ref) > >> sessionInfo() > R version 2.15.0 Patched (2012-06-04 r59517) > Platform: i686-pc-linux-gnu (32-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] Rsubread_1.6.3 > > > Regards, > Dan > > On Tue, 2012-06-05 at 09:45 +0200, Robert Castelo wrote: >> hi, >> >> the computer room at my university where we do practicals on R & Bioconductor runs a 32bit linux distribution and when i tried to run the latest version of the Rsubread package (1.6.3) it crashes when calling the buildindex() function on a multifasta file with the yeast genome. this does *not* happen under a 64bit linux distribution. >> >> i have verified that installing the version before (1.4.4) on the current R 2.15 it also crashes (on the 32bit), but two versions before, the 1.1.1, it does *not* and it works smoothly on this 32bit linux distribution. >> >> i'm pasting below the output of using the 1.6.3 and 1.1.1 on R 2.15 where allChr.fa is the multifasta file with the yeast genome. >> >> so i can manage by now the problem by using the 1.1.1 version on R 2.15 for my teaching but i wonder whether there would be some easy solution for this, or even if it could be a symptom of something else that the Rsubread developers should worry about. i know that using a 32bit system nowadays is quite obsolete but this is what i got for teaching :( and i would be happy to let my students play with the latest version of Rsubread in the future. >> >> >> thanks!!! >> robert. >> >> ======================Rsubread 1.6.3 on R 2.15======================= >> >>> library(Rsubread) >>> sessionInfo() >> R version 2.15.0 (2012-03-30) >> Platform: i686-pc-linux-gnu (32-bit) >> >> locale: >> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 >> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] Rsubread_1.6.3 >> >>> buildindex(basename="subreadindex", reference="allChr.fa", memory=2500) >> >> Building a base-space index. >> Size of memory used=2500 MB >> Base name of the built index = subreadindex >> >> *** caught segfault *** >> address 0xdf670cc0, cause 'memory not mapped' >> >> Traceback: >> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = as.character(cmd), PACKAGE = "Rsubread") >> 2: buildindex(basename = "subreadindex", reference = "allChr.fa", memory = 2500) >> >> Possible actions: >> 1: abort (with core dump, if enabled) >> 2: normal R exit >> 3: exit R without saving workspace >> 4: exit R saving workspace >> Selection: >> >> >> ======================Rsubread 1.1.1 on R 2.15======================= >> >>> library(Rsubread) >>> buildindex(basename="subreadindex", reference="allChr.fa", memory=2500) >> >> Building the index in the base space. >> Size of memory requested=2500 MB >> Index base name = subreadindex >> INDEX ITEMS PER PARTITION = 275940352 >> >> completed=40.88%; time used=1.7s; rate=2955.1k bps/s; total=12m bps completed=81.76%; time used=2.4s; rate=4111.8k bps/s; total=12m bps >> All the chromosome files are processed. >> | Dumping index [===========================================================>] >> Index subreadindex is successfully built. >>> sessionInfo() >> R version 2.15.0 (2012-03-30) >> Platform: i686-pc-linux-gnu (32-bit) >> >> locale: >> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 >> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] Rsubread_1.1.1 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}} From haroonnaim at yahoo.com Wed Jun 6 06:59:15 2012 From: haroonnaim at yahoo.com (haroon naeem) Date: Wed, 6 Jun 2012 05:59:15 +0100 (BST) Subject: [BioC] minfi - missing values Message-ID: <1338958755.76383.YahooMailNeo@web29601.mail.ird.yahoo.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From tooyoung at gmail.com Wed Jun 6 08:50:21 2012 From: tooyoung at gmail.com (Dan Du) Date: Wed, 06 Jun 2012 08:50:21 +0200 Subject: [BioC] Rsubread crashes in 32bit linux In-Reply-To: <25927CA4-11CE-46F1-8492-739AE8B10C05@wehi.edu.au> References: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> <1338903817.2062.284.camel@yangdu-desktop> <25927CA4-11CE-46F1-8492-739AE8B10C05@wehi.edu.au> Message-ID: <1338965421.5663.36.camel@yangdu-desktop> Dear Wei, Unfortunately reducing the memory parameter to 1000, still causes the segfault. I guess with 3g ram limit on a 32bit system, there is still a fat chance that you can not request a continuous 1g block. For that 64bit laptop, it is still strange about the 6g memory draining. It is happing during the installation when compiling the shared library Rsubread.so, not running the buildindex function. Btw, the gcc version is 4.4.3. Server and opensuse box runs gcc version 4.3.1 and 4.5.0 respectively. Regards, Dan On Wed, 2012-06-06 at 14:56 +1000, Wei Shi wrote: > Dear Dan, > > It is probably because including genome sequences into the index slowed down your laptop. But I believe it should be alleviated if you give smaller values to the 'memory ' parameter of the buildindex() function. Also, the index building is an one-off operation, you do not need to redo it even when new releases come. > > For your 32-bit opensuse box, I guess the problem will be solved if you change the amount of memory requested to be 1000MB. > > Cheers, > Wei > > On Jun 5, 2012, at 11:43 PM, Dan Du wrote: > > > Hi Robert, > > > > I have been experiencing something else, possibly related to yours, > > on a 64bit ubuntu laptop with 6g of ram. > > > > As I recall, when bumping to Bioc 2.10, the Rsubread installation kind > > of ate all the memory, basically froze the system so I had to call it > > off, yet building it on the server side turned out fine. So I think I > > just accepted that the new version may be 'computationally heavy' thus > > not suitable for a normal pc, though I did not find any mentioning of > > this increased memory requirement in the NEWS file. > > > > So currently Rsubread stays at 1.4.4 on that pc, all subsequent versions > > of Rsubread drain the memory in the same way when compiling Rsubread.so. > > > > Now I think I can confirm this on a 32-bit opensuse box, it did > > successfully built, but when running the example code in the manual, > > same segfault happens. > > > > > >> library(Rsubread) > >> ref <- system.file("extdata","reference.fa",package="Rsubread") > >> path <- system.file("extdata",package="Rsubread") > >> buildindex(basename=file.path(path,"reference_index"),reference=ref) > > > > Building a base-space index. > > Size of memory used=3700 MB > > Base name of the built index > > = /home/opensuse/R-patched/library/Rsubread/extdata/reference_index > > > > *** caught segfault *** > > address 0xdf03ee80, cause 'memory not mapped' > > > > Traceback: > > 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = > > as.character(cmd), PACKAGE = "Rsubread") > > 2: buildindex(basename = file.path(path, "reference_index"), reference > > = ref) > > > >> sessionInfo() > > R version 2.15.0 Patched (2012-06-04 r59517) > > Platform: i686-pc-linux-gnu (32-bit) > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=C LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods > > base > > > > other attached packages: > > [1] Rsubread_1.6.3 > > > > > > Regards, > > Dan > > > > On Tue, 2012-06-05 at 09:45 +0200, Robert Castelo wrote: > >> hi, > >> > >> the computer room at my university where we do practicals on R & Bioconductor runs a 32bit linux distribution and when i tried to run the latest version of the Rsubread package (1.6.3) it crashes when calling the buildindex() function on a multifasta file with the yeast genome. this does *not* happen under a 64bit linux distribution. > >> > >> i have verified that installing the version before (1.4.4) on the current R 2.15 it also crashes (on the 32bit), but two versions before, the 1.1.1, it does *not* and it works smoothly on this 32bit linux distribution. > >> > >> i'm pasting below the output of using the 1.6.3 and 1.1.1 on R 2.15 where allChr.fa is the multifasta file with the yeast genome. > >> > >> so i can manage by now the problem by using the 1.1.1 version on R 2.15 for my teaching but i wonder whether there would be some easy solution for this, or even if it could be a symptom of something else that the Rsubread developers should worry about. i know that using a 32bit system nowadays is quite obsolete but this is what i got for teaching :( and i would be happy to let my students play with the latest version of Rsubread in the future. > >> > >> > >> thanks!!! > >> robert. > >> > >> ======================Rsubread 1.6.3 on R 2.15======================= > >> > >>> library(Rsubread) > >>> sessionInfo() > >> R version 2.15.0 (2012-03-30) > >> Platform: i686-pc-linux-gnu (32-bit) > >> > >> locale: > >> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C > >> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 > >> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 > >> [7] LC_PAPER=C LC_NAME=C > >> [9] LC_ADDRESS=C LC_TELEPHONE=C > >> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C > >> > >> attached base packages: > >> [1] stats graphics grDevices utils datasets methods base > >> > >> other attached packages: > >> [1] Rsubread_1.6.3 > >> > >>> buildindex(basename="subreadindex", reference="allChr.fa", memory=2500) > >> > >> Building a base-space index. > >> Size of memory used=2500 MB > >> Base name of the built index = subreadindex > >> > >> *** caught segfault *** > >> address 0xdf670cc0, cause 'memory not mapped' > >> > >> Traceback: > >> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = as.character(cmd), PACKAGE = "Rsubread") > >> 2: buildindex(basename = "subreadindex", reference = "allChr.fa", memory = 2500) > >> > >> Possible actions: > >> 1: abort (with core dump, if enabled) > >> 2: normal R exit > >> 3: exit R without saving workspace > >> 4: exit R saving workspace > >> Selection: > >> > >> > >> ======================Rsubread 1.1.1 on R 2.15======================= > >> > >>> library(Rsubread) > >>> buildindex(basename="subreadindex", reference="allChr.fa", memory=2500) > >> > >> Building the index in the base space. > >> Size of memory requested=2500 MB > >> Index base name = subreadindex > >> INDEX ITEMS PER PARTITION = 275940352 > >> > >> completed=40.88%; time used=1.7s; rate=2955.1k bps/s; total=12m bps completed=81.76%; time used=2.4s; rate=4111.8k bps/s; total=12m bps > >> All the chromosome files are processed. > >> | Dumping index [===========================================================>] > >> Index subreadindex is successfully built. > >>> sessionInfo() > >> R version 2.15.0 (2012-03-30) > >> Platform: i686-pc-linux-gnu (32-bit) > >> > >> locale: > >> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C > >> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 > >> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 > >> [7] LC_PAPER=C LC_NAME=C > >> [9] LC_ADDRESS=C LC_TELEPHONE=C > >> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C > >> > >> attached base packages: > >> [1] stats graphics grDevices utils datasets methods base > >> > >> other attached packages: > >> [1] Rsubread_1.1.1 > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}} From Peter.Sorensen2 at agrsci.dk Wed Jun 6 11:35:28 2012 From: Peter.Sorensen2 at agrsci.dk (=?iso-8859-1?Q?Peter_S=F8rensen_=28HAG=29?=) Date: Wed, 6 Jun 2012 11:35:28 +0200 Subject: [BioC] Rsubread crashes in 32bit linux In-Reply-To: <1338965421.5663.36.camel@yangdu-desktop> References: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> <1338903817.2062.284.camel@yangdu-desktop> <25927CA4-11CE-46F1-8492-739AE8B10C05@wehi.edu.au>, <1338965421.5663.36.camel@yangdu-desktop> Message-ID: <9F0721FDD4F12D4B95AD894274F388EC02767C9E2F83@DJFEXMBX01.djf.agrsci.dk> Dear Wei, I also encounter a (perhaps related) problem using Rsubread (see below). Kind regards Peter > library(Rsubread) Error in dyn.load(file, DLLpath = DLLpath, ...) : unable to load shared object '/opt/ghpc/R-2.15.0/lib64/R/library/Rsubread/libs/Rsubread.so': /opt/ghpc/R-2.15.0/lib64/R/library/Rsubread/libs/Rsubread.so: cannot map zero-fill pages: Cannot allocate memory Error: package/namespace load failed for 'Rsubread' > ref <- system.file("extdata","reference.fa",package="Rsubread") > path <- system.file("extdata",package="Rsubread") > buildindex(basename=file.path(path,"reference_index"),reference=ref) Error: could not find function "buildindex" > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base > ________________________________________ From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] On Behalf Of Dan Du [tooyoung at gmail.com] Sent: Wednesday, June 06, 2012 8:50 AM To: Wei Shi Cc: Yang Liao; bioconductor at r-project.org mailman Subject: Re: [BioC] Rsubread crashes in 32bit linux Dear Wei, Unfortunately reducing the memory parameter to 1000, still causes the segfault. I guess with 3g ram limit on a 32bit system, there is still a fat chance that you can not request a continuous 1g block. For that 64bit laptop, it is still strange about the 6g memory draining. It is happing during the installation when compiling the shared library Rsubread.so, not running the buildindex function. Btw, the gcc version is 4.4.3. Server and opensuse box runs gcc version 4.3.1 and 4.5.0 respectively. Regards, Dan On Wed, 2012-06-06 at 14:56 +1000, Wei Shi wrote: > Dear Dan, > > It is probably because including genome sequences into the index slowed down your laptop. But I believe it should be alleviated if you give smaller values to the 'memory ' parameter of the buildindex() function. Also, the index building is an one-off operation, you do not need to redo it even when new releases come. > > For your 32-bit opensuse box, I guess the problem will be solved if you change the amount of memory requested to be 1000MB. > > Cheers, > Wei > > On Jun 5, 2012, at 11:43 PM, Dan Du wrote: > > > Hi Robert, > > > > I have been experiencing something else, possibly related to yours, > > on a 64bit ubuntu laptop with 6g of ram. > > > > As I recall, when bumping to Bioc 2.10, the Rsubread installation kind > > of ate all the memory, basically froze the system so I had to call it > > off, yet building it on the server side turned out fine. So I think I > > just accepted that the new version may be 'computationally heavy' thus > > not suitable for a normal pc, though I did not find any mentioning of > > this increased memory requirement in the NEWS file. > > > > So currently Rsubread stays at 1.4.4 on that pc, all subsequent versions > > of Rsubread drain the memory in the same way when compiling Rsubread.so. > > > > Now I think I can confirm this on a 32-bit opensuse box, it did > > successfully built, but when running the example code in the manual, > > same segfault happens. > > > > > >> library(Rsubread) > >> ref <- system.file("extdata","reference.fa",package="Rsubread") > >> path <- system.file("extdata",package="Rsubread") > >> buildindex(basename=file.path(path,"reference_index"),reference=ref) > > > > Building a base-space index. > > Size of memory used=3700 MB > > Base name of the built index > > = /home/opensuse/R-patched/library/Rsubread/extdata/reference_index > > > > *** caught segfault *** > > address 0xdf03ee80, cause 'memory not mapped' > > > > Traceback: > > 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = > > as.character(cmd), PACKAGE = "Rsubread") > > 2: buildindex(basename = file.path(path, "reference_index"), reference > > = ref) > > > >> sessionInfo() > > R version 2.15.0 Patched (2012-06-04 r59517) > > Platform: i686-pc-linux-gnu (32-bit) > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=C LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods > > base > > > > other attached packages: > > [1] Rsubread_1.6.3 > > > > > > Regards, > > Dan > > > > On Tue, 2012-06-05 at 09:45 +0200, Robert Castelo wrote: > >> hi, > >> > >> the computer room at my university where we do practicals on R & Bioconductor runs a 32bit linux distribution and when i tried to run the latest version of the Rsubread package (1.6.3) it crashes when calling the buildindex() function on a multifasta file with the yeast genome. this does *not* happen under a 64bit linux distribution. > >> > >> i have verified that installing the version before (1.4.4) on the current R 2.15 it also crashes (on the 32bit), but two versions before, the 1.1.1, it does *not* and it works smoothly on this 32bit linux distribution. > >> > >> i'm pasting below the output of using the 1.6.3 and 1.1.1 on R 2.15 where allChr.fa is the multifasta file with the yeast genome. > >> > >> so i can manage by now the problem by using the 1.1.1 version on R 2.15 for my teaching but i wonder whether there would be some easy solution for this, or even if it could be a symptom of something else that the Rsubread developers should worry about. i know that using a 32bit system nowadays is quite obsolete but this is what i got for teaching :( and i would be happy to let my students play with the latest version of Rsubread in the future. > >> > >> > >> thanks!!! > >> robert. > >> > >> ======================Rsubread 1.6.3 on R 2.15======================= > >> > >>> library(Rsubread) > >>> sessionInfo() > >> R version 2.15.0 (2012-03-30) > >> Platform: i686-pc-linux-gnu (32-bit) > >> > >> locale: > >> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C > >> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 > >> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 > >> [7] LC_PAPER=C LC_NAME=C > >> [9] LC_ADDRESS=C LC_TELEPHONE=C > >> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C > >> > >> attached base packages: > >> [1] stats graphics grDevices utils datasets methods base > >> > >> other attached packages: > >> [1] Rsubread_1.6.3 > >> > >>> buildindex(basename="subreadindex", reference="allChr.fa", memory=2500) > >> > >> Building a base-space index. > >> Size of memory used=2500 MB > >> Base name of the built index = subreadindex > >> > >> *** caught segfault *** > >> address 0xdf670cc0, cause 'memory not mapped' > >> > >> Traceback: > >> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = as.character(cmd), PACKAGE = "Rsubread") > >> 2: buildindex(basename = "subreadindex", reference = "allChr.fa", memory = 2500) > >> > >> Possible actions: > >> 1: abort (with core dump, if enabled) > >> 2: normal R exit > >> 3: exit R without saving workspace > >> 4: exit R saving workspace > >> Selection: > >> > >> > >> ======================Rsubread 1.1.1 on R 2.15======================= > >> > >>> library(Rsubread) > >>> buildindex(basename="subreadindex", reference="allChr.fa", memory=2500) > >> > >> Building the index in the base space. > >> Size of memory requested=2500 MB > >> Index base name = subreadindex > >> INDEX ITEMS PER PARTITION = 275940352 > >> > >> completed=40.88%; time used=1.7s; rate=2955.1k bps/s; total=12m bps completed=81.76%; time used=2.4s; rate=4111.8k bps/s; total=12m bps > >> All the chromosome files are processed. > >> | Dumping index [===========================================================>] > >> Index subreadindex is successfully built. > >>> sessionInfo() > >> R version 2.15.0 (2012-03-30) > >> Platform: i686-pc-linux-gnu (32-bit) > >> > >> locale: > >> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C > >> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 > >> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 > >> [7] LC_PAPER=C LC_NAME=C > >> [9] LC_ADDRESS=C LC_TELEPHONE=C > >> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C > >> > >> attached base packages: > >> [1] stats graphics grDevices utils datasets methods base > >> > >> other attached packages: > >> [1] Rsubread_1.1.1 > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:9}} From shi at wehi.EDU.AU Wed Jun 6 12:10:18 2012 From: shi at wehi.EDU.AU (Wei Shi) Date: Wed, 6 Jun 2012 20:10:18 +1000 (EST) Subject: [BioC] Rsubread crashes in 32bit linux In-Reply-To: <1338965421.5663.36.camel@yangdu-desktop> References: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> <1338903817.2062.284.camel@yangdu-desktop> <25927CA4-11CE-46F1-8492-739AE8B10C05@wehi.edu.au> <1338965421.5663.36.camel@yangdu-desktop> Message-ID: <2333c7eb3c4b13297612a68e8334d48d.squirrel@homebase.wehi.edu.au> Dear Dan, It didn't seem to be problem of requesting a continuous 1GB block in our investigation. We tracked the memory usage of buildindex() function when running it on yeast genome using a 32-bit VM, and found that the segfault happened right after a request of a few KB of memory was sent to the system when the memory parameter was set to 2500. However, the problem was gone when the memory parameter was changed to 1000. Removing highly repetitive 16 mers required a continuous 1GB block of memory, but this step was always executed successfully. This step also included in the old version of Rsubread (1.1.1), and it did not have problem there either. Could you please provide us your complete code for running your test and also session info? This will help us to diagnose what the problem could be because we couldn't reproduce what you saw from our end. For the compilation issue on your 64bit laptop, could you provide us more details as well, including the message output from gcc? Thanks, Wei > Dear Wei, > > Unfortunately reducing the memory parameter to 1000, still causes the > segfault. I guess with 3g ram limit on a 32bit system, there is still a > fat chance that you can not request a continuous 1g block. > > For that 64bit laptop, it is still strange about the 6g memory draining. > It is happing during the installation when compiling the shared library > Rsubread.so, not running the buildindex function. Btw, the gcc version > is 4.4.3. > > Server and opensuse box runs gcc version 4.3.1 and 4.5.0 respectively. > > Regards, > Dan > > On Wed, 2012-06-06 at 14:56 +1000, Wei Shi wrote: >> Dear Dan, >> >> It is probably because including genome sequences into the index slowed >> down your laptop. But I believe it should be alleviated if you give >> smaller values to the 'memory ' parameter of the buildindex() function. >> Also, the index building is an one-off operation, you do not need to >> redo it even when new releases come. >> >> For your 32-bit opensuse box, I guess the problem will be solved if you >> change the amount of memory requested to be 1000MB. >> >> Cheers, >> Wei >> >> On Jun 5, 2012, at 11:43 PM, Dan Du wrote: >> >> > Hi Robert, >> > >> > I have been experiencing something else, possibly related to yours, >> > on a 64bit ubuntu laptop with 6g of ram. >> > >> > As I recall, when bumping to Bioc 2.10, the Rsubread installation kind >> > of ate all the memory, basically froze the system so I had to call it >> > off, yet building it on the server side turned out fine. So I think I >> > just accepted that the new version may be 'computationally heavy' thus >> > not suitable for a normal pc, though I did not find any mentioning of >> > this increased memory requirement in the NEWS file. >> > >> > So currently Rsubread stays at 1.4.4 on that pc, all subsequent >> versions >> > of Rsubread drain the memory in the same way when compiling >> Rsubread.so. >> > >> > Now I think I can confirm this on a 32-bit opensuse box, it did >> > successfully built, but when running the example code in the manual, >> > same segfault happens. >> > >> > >> >> library(Rsubread) >> >> ref <- system.file("extdata","reference.fa",package="Rsubread") >> >> path <- system.file("extdata",package="Rsubread") >> >> buildindex(basename=file.path(path,"reference_index"),reference=ref) >> > >> > Building a base-space index. >> > Size of memory used=3700 MB >> > Base name of the built index >> > = /home/opensuse/R-patched/library/Rsubread/extdata/reference_index >> > >> > *** caught segfault *** >> > address 0xdf03ee80, cause 'memory not mapped' >> > >> > Traceback: >> > 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = >> > as.character(cmd), PACKAGE = "Rsubread") >> > 2: buildindex(basename = file.path(path, "reference_index"), reference >> > = ref) >> > >> >> sessionInfo() >> > R version 2.15.0 Patched (2012-06-04 r59517) >> > Platform: i686-pc-linux-gnu (32-bit) >> > >> > locale: >> > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> > [7] LC_PAPER=C LC_NAME=C >> > [9] LC_ADDRESS=C LC_TELEPHONE=C >> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> > >> > attached base packages: >> > [1] stats graphics grDevices utils datasets methods >> > base >> > >> > other attached packages: >> > [1] Rsubread_1.6.3 >> > >> > >> > Regards, >> > Dan >> > >> > On Tue, 2012-06-05 at 09:45 +0200, Robert Castelo wrote: >> >> hi, >> >> >> >> the computer room at my university where we do practicals on R & >> Bioconductor runs a 32bit linux distribution and when i tried to run >> the latest version of the Rsubread package (1.6.3) it crashes when >> calling the buildindex() function on a multifasta file with the yeast >> genome. this does *not* happen under a 64bit linux distribution. >> >> >> >> i have verified that installing the version before (1.4.4) on the >> current R 2.15 it also crashes (on the 32bit), but two versions >> before, the 1.1.1, it does *not* and it works smoothly on this 32bit >> linux distribution. >> >> >> >> i'm pasting below the output of using the 1.6.3 and 1.1.1 on R 2.15 >> where allChr.fa is the multifasta file with the yeast genome. >> >> >> >> so i can manage by now the problem by using the 1.1.1 version on R >> 2.15 for my teaching but i wonder whether there would be some easy >> solution for this, or even if it could be a symptom of something else >> that the Rsubread developers should worry about. i know that using a >> 32bit system nowadays is quite obsolete but this is what i got for >> teaching :( and i would be happy to let my students play with the >> latest version of Rsubread in the future. >> >> >> >> >> >> thanks!!! >> >> robert. >> >> >> >> ======================Rsubread 1.6.3 on R 2.15======================= >> >> >> >>> library(Rsubread) >> >>> sessionInfo() >> >> R version 2.15.0 (2012-03-30) >> >> Platform: i686-pc-linux-gnu (32-bit) >> >> >> >> locale: >> >> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C >> >> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 >> >> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 >> >> [7] LC_PAPER=C LC_NAME=C >> >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> >> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C >> >> >> >> attached base packages: >> >> [1] stats graphics grDevices utils datasets methods base >> >> >> >> other attached packages: >> >> [1] Rsubread_1.6.3 >> >> >> >>> buildindex(basename="subreadindex", reference="allChr.fa", >> memory=2500) >> >> >> >> Building a base-space index. >> >> Size of memory used=2500 MB >> >> Base name of the built index = subreadindex >> >> >> >> *** caught segfault *** >> >> address 0xdf670cc0, cause 'memory not mapped' >> >> >> >> Traceback: >> >> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = >> as.character(cmd), PACKAGE = "Rsubread") >> >> 2: buildindex(basename = "subreadindex", reference = "allChr.fa", >> memory = 2500) >> >> >> >> Possible actions: >> >> 1: abort (with core dump, if enabled) >> >> 2: normal R exit >> >> 3: exit R without saving workspace >> >> 4: exit R saving workspace >> >> Selection: >> >> >> >> >> >> ======================Rsubread 1.1.1 on R 2.15======================= >> >> >> >>> library(Rsubread) >> >>> buildindex(basename="subreadindex", reference="allChr.fa", >> memory=2500) >> >> >> >> Building the index in the base space. >> >> Size of memory requested=2500 MB >> >> Index base name = subreadindex >> >> INDEX ITEMS PER PARTITION = 275940352 >> >> >> >> completed=40.88%; time used=1.7s; rate=2955.1k bps/s; total=12m bps >> completed=81.76%; time used=2.4s; rate=4111.8k >> bps/s; total=12m bps >> >> All the chromosome files are processed. >> >> | Dumping index >> [===========================================================>] >> >> Index subreadindex is successfully built. >> >>> sessionInfo() >> >> R version 2.15.0 (2012-03-30) >> >> Platform: i686-pc-linux-gnu (32-bit) >> >> >> >> locale: >> >> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C >> >> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 >> >> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 >> >> [7] LC_PAPER=C LC_NAME=C >> >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> >> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C >> >> >> >> attached base packages: >> >> [1] stats graphics grDevices utils datasets methods base >> >> >> >> other attached packages: >> >> [1] Rsubread_1.1.1 >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor at r-project.org >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for >> the addressee. >> You must not disclose, forward, print or use it without the permission >> of the sender. >> ______________________________________________________________________ > > > > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From shi at wehi.EDU.AU Wed Jun 6 12:56:00 2012 From: shi at wehi.EDU.AU (Wei Shi) Date: Wed, 6 Jun 2012 20:56:00 +1000 (EST) Subject: [BioC] Rsubread crashes in 32bit linux In-Reply-To: <9F0721FDD4F12D4B95AD894274F388EC02767C9E2F83@DJFEXMBX01.djf.agrsci.dk > References: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> <1338903817.2062.284.camel@yangdu-desktop> <25927CA4-11CE-46F1-8492-739AE8B10C05@wehi.edu.au>, <1338965421.5663.36.camel@yangdu-desktop> <9F0721FDD4F12D4B95AD894274F388EC02767C9E2F83@DJFEXMBX01.djf.agrsci.dk> Message-ID: <6b6ef69de4af7ed3bdc6db76c885528c.squirrel@homebase.wehi.edu.au> Dear Peter, It looks like your installation of Rsubread was unsuccessfully, therefore you got an error when you loaded it into your R session. This issue is different from the issue with running buildindex() function on a 32-bit machine. Did you notice any errors when it was being installed? You can reinstall it using the following commands in your R session: source("http://bioconductor.org/biocLite.R") biocLite("Rsubread") Cheers, Wei > Dear Wei, > I also encounter a (perhaps related) problem using Rsubread (see below). > Kind regards > Peter > > >> library(Rsubread) > Error in dyn.load(file, DLLpath = DLLpath, ...) : > unable to load shared object > '/opt/ghpc/R-2.15.0/lib64/R/library/Rsubread/libs/Rsubread.so': > /opt/ghpc/R-2.15.0/lib64/R/library/Rsubread/libs/Rsubread.so: cannot map > zero-fill pages: Cannot allocate memory > Error: package/namespace load failed for 'Rsubread' >> ref <- system.file("extdata","reference.fa",package="Rsubread") >> path <- system.file("extdata",package="Rsubread") >> buildindex(basename=file.path(path,"reference_index"),reference=ref) > Error: could not find function "buildindex" >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > locale: > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base >> > > > ________________________________________ > From: bioconductor-bounces at r-project.org > [bioconductor-bounces at r-project.org] On Behalf Of Dan Du > [tooyoung at gmail.com] > Sent: Wednesday, June 06, 2012 8:50 AM > To: Wei Shi > Cc: Yang Liao; bioconductor at r-project.org mailman > Subject: Re: [BioC] Rsubread crashes in 32bit linux > > Dear Wei, > > Unfortunately reducing the memory parameter to 1000, still causes the > segfault. I guess with 3g ram limit on a 32bit system, there is still a > fat chance that you can not request a continuous 1g block. > > For that 64bit laptop, it is still strange about the 6g memory draining. > It is happing during the installation when compiling the shared library > Rsubread.so, not running the buildindex function. Btw, the gcc version > is 4.4.3. > > Server and opensuse box runs gcc version 4.3.1 and 4.5.0 respectively. > > Regards, > Dan > > On Wed, 2012-06-06 at 14:56 +1000, Wei Shi wrote: >> Dear Dan, >> >> It is probably because including genome sequences into the index slowed >> down your laptop. But I believe it should be alleviated if you give >> smaller values to the 'memory ' parameter of the buildindex() function. >> Also, the index building is an one-off operation, you do not need to >> redo it even when new releases come. >> >> For your 32-bit opensuse box, I guess the problem will be solved if you >> change the amount of memory requested to be 1000MB. >> >> Cheers, >> Wei >> >> On Jun 5, 2012, at 11:43 PM, Dan Du wrote: >> >> > Hi Robert, >> > >> > I have been experiencing something else, possibly related to yours, >> > on a 64bit ubuntu laptop with 6g of ram. >> > >> > As I recall, when bumping to Bioc 2.10, the Rsubread installation kind >> > of ate all the memory, basically froze the system so I had to call it >> > off, yet building it on the server side turned out fine. So I think I >> > just accepted that the new version may be 'computationally heavy' thus >> > not suitable for a normal pc, though I did not find any mentioning of >> > this increased memory requirement in the NEWS file. >> > >> > So currently Rsubread stays at 1.4.4 on that pc, all subsequent >> versions >> > of Rsubread drain the memory in the same way when compiling >> Rsubread.so. >> > >> > Now I think I can confirm this on a 32-bit opensuse box, it did >> > successfully built, but when running the example code in the manual, >> > same segfault happens. >> > >> > >> >> library(Rsubread) >> >> ref <- system.file("extdata","reference.fa",package="Rsubread") >> >> path <- system.file("extdata",package="Rsubread") >> >> buildindex(basename=file.path(path,"reference_index"),reference=ref) >> > >> > Building a base-space index. >> > Size of memory used=3700 MB >> > Base name of the built index >> > = /home/opensuse/R-patched/library/Rsubread/extdata/reference_index >> > >> > *** caught segfault *** >> > address 0xdf03ee80, cause 'memory not mapped' >> > >> > Traceback: >> > 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = >> > as.character(cmd), PACKAGE = "Rsubread") >> > 2: buildindex(basename = file.path(path, "reference_index"), reference >> > = ref) >> > >> >> sessionInfo() >> > R version 2.15.0 Patched (2012-06-04 r59517) >> > Platform: i686-pc-linux-gnu (32-bit) >> > >> > locale: >> > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> > [7] LC_PAPER=C LC_NAME=C >> > [9] LC_ADDRESS=C LC_TELEPHONE=C >> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> > >> > attached base packages: >> > [1] stats graphics grDevices utils datasets methods >> > base >> > >> > other attached packages: >> > [1] Rsubread_1.6.3 >> > >> > >> > Regards, >> > Dan >> > >> > On Tue, 2012-06-05 at 09:45 +0200, Robert Castelo wrote: >> >> hi, >> >> >> >> the computer room at my university where we do practicals on R & >> Bioconductor runs a 32bit linux distribution and when i tried to run >> the latest version of the Rsubread package (1.6.3) it crashes when >> calling the buildindex() function on a multifasta file with the yeast >> genome. this does *not* happen under a 64bit linux distribution. >> >> >> >> i have verified that installing the version before (1.4.4) on the >> current R 2.15 it also crashes (on the 32bit), but two versions >> before, the 1.1.1, it does *not* and it works smoothly on this 32bit >> linux distribution. >> >> >> >> i'm pasting below the output of using the 1.6.3 and 1.1.1 on R 2.15 >> where allChr.fa is the multifasta file with the yeast genome. >> >> >> >> so i can manage by now the problem by using the 1.1.1 version on R >> 2.15 for my teaching but i wonder whether there would be some easy >> solution for this, or even if it could be a symptom of something else >> that the Rsubread developers should worry about. i know that using a >> 32bit system nowadays is quite obsolete but this is what i got for >> teaching :( and i would be happy to let my students play with the >> latest version of Rsubread in the future. >> >> >> >> >> >> thanks!!! >> >> robert. >> >> >> >> ======================Rsubread 1.6.3 on R 2.15======================= >> >> >> >>> library(Rsubread) >> >>> sessionInfo() >> >> R version 2.15.0 (2012-03-30) >> >> Platform: i686-pc-linux-gnu (32-bit) >> >> >> >> locale: >> >> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C >> >> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 >> >> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 >> >> [7] LC_PAPER=C LC_NAME=C >> >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> >> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C >> >> >> >> attached base packages: >> >> [1] stats graphics grDevices utils datasets methods base >> >> >> >> other attached packages: >> >> [1] Rsubread_1.6.3 >> >> >> >>> buildindex(basename="subreadindex", reference="allChr.fa", >> memory=2500) >> >> >> >> Building a base-space index. >> >> Size of memory used=2500 MB >> >> Base name of the built index = subreadindex >> >> >> >> *** caught segfault *** >> >> address 0xdf670cc0, cause 'memory not mapped' >> >> >> >> Traceback: >> >> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = >> as.character(cmd), PACKAGE = "Rsubread") >> >> 2: buildindex(basename = "subreadindex", reference = "allChr.fa", >> memory = 2500) >> >> >> >> Possible actions: >> >> 1: abort (with core dump, if enabled) >> >> 2: normal R exit >> >> 3: exit R without saving workspace >> >> 4: exit R saving workspace >> >> Selection: >> >> >> >> >> >> ======================Rsubread 1.1.1 on R 2.15======================= >> >> >> >>> library(Rsubread) >> >>> buildindex(basename="subreadindex", reference="allChr.fa", >> memory=2500) >> >> >> >> Building the index in the base space. >> >> Size of memory requested=2500 MB >> >> Index base name = subreadindex >> >> INDEX ITEMS PER PARTITION = 275940352 >> >> >> >> completed=40.88%; time used=1.7s; rate=2955.1k bps/s; total=12m bps >> completed=81.76%; time used=2.4s; rate=4111.8k >> bps/s; total=12m bps >> >> All the chromosome files are processed. >> >> | Dumping index >> [===========================================================>] >> >> Index subreadindex is successfully built. >> >>> sessionInfo() >> >> R version 2.15.0 (2012-03-30) >> >> Platform: i686-pc-linux-gnu (32-bit) >> >> >> >> locale: >> >> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C >> >> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 >> >> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 >> >> [7] LC_PAPER=C LC_NAME=C >> >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> >> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C >> >> >> >> attached base packages: >> >> [1] stats graphics grDevices utils datasets methods base >> >> >> >> other attached packages: >> >> [1] Rsubread_1.1.1 >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor at r-project.org >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> ______________________________________________________________________ >> The information in this email is confidential and inte...{{dropped:6}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From tooyoung at gmail.com Wed Jun 6 13:08:54 2012 From: tooyoung at gmail.com (Dan Du) Date: Wed, 06 Jun 2012 13:08:54 +0200 Subject: [BioC] Rsubread crashes in 32bit linux In-Reply-To: <2333c7eb3c4b13297612a68e8334d48d.squirrel@homebase.wehi.edu.au> References: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> <1338903817.2062.284.camel@yangdu-desktop> <25927CA4-11CE-46F1-8492-739AE8B10C05@wehi.edu.au> <1338965421.5663.36.camel@yangdu-desktop> <2333c7eb3c4b13297612a68e8334d48d.squirrel@homebase.wehi.edu.au> Message-ID: <1338980934.5663.85.camel@yangdu-desktop> Dear Wei, Here is a standard bioclite update, I think it is at the last step when compiling Rsubread.so, the memory usage exceeds 5.5g, then system freeze and I have to call it off. Same result when runing 'R CMD INSTALL Rsubread_1.6.3.tar.gz' from shell, or manually compile all .c file and run the last gcc statement. I guess there might just be a minimum ram requirement somewhere higher than 6g... I will do some more poking when I have time. 'gcc -std=gnu99 -shared -o Rsubread.so R_wrapper.o SNP_calling.o aligner.o atgcContent.o detectionCall.o detectionCallAnnotation.o exon-algorithms.o exon-align.o fullscan.o gene-algorithms.o gene-value-index.o hashtable.o index-builder.o input-files.o processExons.o propmapped.o qualityScores.o readSummary.o removeDuplicatedReads.o sam2bed.o sorted-hashtable.o -lpthread -DMAKE_FOR_EXON -L/usr/lib64/R/lib -lR' Also down there are the sessionInfo and full gcc version, please let me know if you need more information. Regards, Dan -------------------------------------------------------------------- > source('http://www.bioconductor.org/biocLite.R') > biocLite('') BioC_mirror: http://bioconductor.org Using R version 2.15, BiocInstaller version 1.4.6. Installing package(s) '' Old packages: 'Rsubread' Update all/some/none? [a/s/n]: a trying URL 'http://www.bioconductor.org/packages/2.10/bioc/src/contrib/Rsubread_1.6.3.tar.gz' Content type 'application/x-gzip' length 21891723 bytes (20.9 Mb) opened URL ================================================== downloaded 20.9 Mb WARNING: ignoring environment value of R_HOME * installing *source* package ?Rsubread? ... ** libs gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c R_wrapper.c -o R_wrapper.o gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c SNP_calling.c -o SNP_calling.o gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c aligner.c -o aligner.o gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? declared but never defined gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared but never defined gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? declared but never defined gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared but never defined gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c atgcContent.c -o atgcContent.o gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c detectionCall.c -o detectionCall.o gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c detectionCallAnnotation.c -o detectionCallAnnotation.o detectionCallAnnotation.c: In function ?calculateExonGCContent?: detectionCallAnnotation.c:175: warning: ignoring return value of ?fgets?, declared with attribute warn_unused_result detectionCallAnnotation.c: In function ?calculateIRGCContent?: detectionCallAnnotation.c:262: warning: ignoring return value of ?fgets?, declared with attribute warn_unused_result gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c exon-algorithms.c -o exon-algorithms.o gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? declared but never defined gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared but never defined gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? declared but never defined gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared but never defined gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c exon-align.c -o exon-align.o gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? declared but never defined gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared but never defined gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? declared but never defined gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared but never defined gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c fullscan.c -o fullscan.o gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? declared but never defined gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared but never defined gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? declared but never defined gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared but never defined gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c gene-algorithms.c -o gene-algorithms.o gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c gene-value-index.c -o gene-value-index.o gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? declared but never defined gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared but never defined gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? declared but never defined gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared but never defined gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c hashtable.c -o hashtable.o gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c index-builder.c -o index-builder.o gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? declared but never defined gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared but never defined gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? declared but never defined gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared but never defined gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c input-files.c -o input-files.o gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c processExons.c -o processExons.o gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c propmapped.c -o propmapped.o gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c qualityScores.c -o qualityScores.o gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c readSummary.c -o readSummary.o readSummary.c: In function ?readSummary?: readSummary.c:122: warning: format ?%d? expects type ?int?, but argument 5 has type ?long int? readSummary.c:122: warning: format ?%d? expects type ?int?, but argument 6 has type ?long int? readSummary.c:39: warning: ignoring return value of ?getline?, declared with attribute warn_unused_result readSummary.c:52: warning: ignoring return value of ?getline?, declared with attribute warn_unused_result readSummary.c:55: warning: ignoring return value of ?getline?, declared with attribute warn_unused_result gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c removeDuplicatedReads.c -o removeDuplicatedReads.o gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c sam2bed.c -o sam2bed.o gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic -O3 -pipe -g -c sorted-hashtable.c -o sorted-hashtable.o gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? declared but never defined gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared but never defined gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? declared but never defined gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared but never defined gcc -std=gnu99 -shared -o Rsubread.so R_wrapper.o SNP_calling.o aligner.o atgcContent.o detectionCall.o detectionCallAnnotation.o exon-algorithms.o exon-align.o fullscan.o gene-algorithms.o gene-value-index.o hashtable.o index-builder.o input-files.o processExons.o propmapped.o qualityScores.o readSummary.o removeDuplicatedReads.o sam2bed.o sorted-hashtable.o -lpthread -DMAKE_FOR_EXON -L/usr/lib64/R/lib -lR ^Cmake: *** Deleting file `Rsubread.so' make: *** [Rsubread.so] Interrupt ** R ** inst ** preparing package for lazy loading ** help *** installing help indices ** building package indices ** installing vignettes ?Rsubread.Rnw? ** testing if installed package can be loaded Error in library.dynam(lib, package, package.lib) : shared object ?Rsubread.so? not found Error: loading failed Execution halted -------------------------------------------------------------------- > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base -------------------------------------------------------------------- $ gcc -v Using built-in specs. Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.4.3-4ubuntu5.1' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --enable-multiarch --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 --program-suffix=-4.4 --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-plugin --enable-objc-gc --disable-werror --with-arch-32=i486 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) -------------------------------------------------------------------- On Wed, 2012-06-06 at 20:10 +1000, Wei Shi wrote: > Dear Dan, > > It didn't seem to be problem of requesting a continuous 1GB block in our > investigation. We tracked the memory usage of buildindex() function when > running it on yeast genome using a 32-bit VM, and found that the segfault > happened right after a request of a few KB of memory was sent to the > system when the memory parameter was set to 2500. However, the problem was > gone when the memory parameter was changed to 1000. > > Removing highly repetitive 16 mers required a continuous 1GB block of > memory, but this step was always executed successfully. This step also > included in the old version of Rsubread (1.1.1), and it did not have > problem there either. > > Could you please provide us your complete code for running your test and > also session info? This will help us to diagnose what the problem could be > because we couldn't reproduce what you saw from our end. > > For the compilation issue on your 64bit laptop, could you provide us more > details as well, including the message output from gcc? > > Thanks, > Wei > > > Dear Wei, > > > > Unfortunately reducing the memory parameter to 1000, still causes the > > segfault. I guess with 3g ram limit on a 32bit system, there is still a > > fat chance that you can not request a continuous 1g block. > > > > For that 64bit laptop, it is still strange about the 6g memory draining. > > It is happing during the installation when compiling the shared library > > Rsubread.so, not running the buildindex function. Btw, the gcc version > > is 4.4.3. > > > > Server and opensuse box runs gcc version 4.3.1 and 4.5.0 respectively. > > > > Regards, > > Dan > > > > On Wed, 2012-06-06 at 14:56 +1000, Wei Shi wrote: > >> Dear Dan, > >> > >> It is probably because including genome sequences into the index slowed > >> down your laptop. But I believe it should be alleviated if you give > >> smaller values to the 'memory ' parameter of the buildindex() function. > >> Also, the index building is an one-off operation, you do not need to > >> redo it even when new releases come. > >> > >> For your 32-bit opensuse box, I guess the problem will be solved if you > >> change the amount of memory requested to be 1000MB. > >> > >> Cheers, > >> Wei > >> > >> On Jun 5, 2012, at 11:43 PM, Dan Du wrote: > >> > >> > Hi Robert, > >> > > >> > I have been experiencing something else, possibly related to yours, > >> > on a 64bit ubuntu laptop with 6g of ram. > >> > > >> > As I recall, when bumping to Bioc 2.10, the Rsubread installation kind > >> > of ate all the memory, basically froze the system so I had to call it > >> > off, yet building it on the server side turned out fine. So I think I > >> > just accepted that the new version may be 'computationally heavy' thus > >> > not suitable for a normal pc, though I did not find any mentioning of > >> > this increased memory requirement in the NEWS file. > >> > > >> > So currently Rsubread stays at 1.4.4 on that pc, all subsequent > >> versions > >> > of Rsubread drain the memory in the same way when compiling > >> Rsubread.so. > >> > > >> > Now I think I can confirm this on a 32-bit opensuse box, it did > >> > successfully built, but when running the example code in the manual, > >> > same segfault happens. > >> > > >> > > >> >> library(Rsubread) > >> >> ref <- system.file("extdata","reference.fa",package="Rsubread") > >> >> path <- system.file("extdata",package="Rsubread") > >> >> buildindex(basename=file.path(path,"reference_index"),reference=ref) > >> > > >> > Building a base-space index. > >> > Size of memory used=3700 MB > >> > Base name of the built index > >> > = /home/opensuse/R-patched/library/Rsubread/extdata/reference_index > >> > > >> > *** caught segfault *** > >> > address 0xdf03ee80, cause 'memory not mapped' > >> > > >> > Traceback: > >> > 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = > >> > as.character(cmd), PACKAGE = "Rsubread") > >> > 2: buildindex(basename = file.path(path, "reference_index"), reference > >> > = ref) > >> > > >> >> sessionInfo() > >> > R version 2.15.0 Patched (2012-06-04 r59517) > >> > Platform: i686-pc-linux-gnu (32-bit) > >> > > >> > locale: > >> > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >> > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > >> > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > >> > [7] LC_PAPER=C LC_NAME=C > >> > [9] LC_ADDRESS=C LC_TELEPHONE=C > >> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > >> > > >> > attached base packages: > >> > [1] stats graphics grDevices utils datasets methods > >> > base > >> > > >> > other attached packages: > >> > [1] Rsubread_1.6.3 > >> > > >> > > >> > Regards, > >> > Dan > >> > > >> > On Tue, 2012-06-05 at 09:45 +0200, Robert Castelo wrote: > >> >> hi, > >> >> > >> >> the computer room at my university where we do practicals on R & > >> Bioconductor runs a 32bit linux distribution and when i tried to run > >> the latest version of the Rsubread package (1.6.3) it crashes when > >> calling the buildindex() function on a multifasta file with the yeast > >> genome. this does *not* happen under a 64bit linux distribution. > >> >> > >> >> i have verified that installing the version before (1.4.4) on the > >> current R 2.15 it also crashes (on the 32bit), but two versions > >> before, the 1.1.1, it does *not* and it works smoothly on this 32bit > >> linux distribution. > >> >> > >> >> i'm pasting below the output of using the 1.6.3 and 1.1.1 on R 2.15 > >> where allChr.fa is the multifasta file with the yeast genome. > >> >> > >> >> so i can manage by now the problem by using the 1.1.1 version on R > >> 2.15 for my teaching but i wonder whether there would be some easy > >> solution for this, or even if it could be a symptom of something else > >> that the Rsubread developers should worry about. i know that using a > >> 32bit system nowadays is quite obsolete but this is what i got for > >> teaching :( and i would be happy to let my students play with the > >> latest version of Rsubread in the future. > >> >> > >> >> > >> >> thanks!!! > >> >> robert. > >> >> > >> >> ======================Rsubread 1.6.3 on R 2.15======================= > >> >> > >> >>> library(Rsubread) > >> >>> sessionInfo() > >> >> R version 2.15.0 (2012-03-30) > >> >> Platform: i686-pc-linux-gnu (32-bit) > >> >> > >> >> locale: > >> >> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C > >> >> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 > >> >> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 > >> >> [7] LC_PAPER=C LC_NAME=C > >> >> [9] LC_ADDRESS=C LC_TELEPHONE=C > >> >> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C > >> >> > >> >> attached base packages: > >> >> [1] stats graphics grDevices utils datasets methods base > >> >> > >> >> other attached packages: > >> >> [1] Rsubread_1.6.3 > >> >> > >> >>> buildindex(basename="subreadindex", reference="allChr.fa", > >> memory=2500) > >> >> > >> >> Building a base-space index. > >> >> Size of memory used=2500 MB > >> >> Base name of the built index = subreadindex > >> >> > >> >> *** caught segfault *** > >> >> address 0xdf670cc0, cause 'memory not mapped' > >> >> > >> >> Traceback: > >> >> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = > >> as.character(cmd), PACKAGE = "Rsubread") > >> >> 2: buildindex(basename = "subreadindex", reference = "allChr.fa", > >> memory = 2500) > >> >> > >> >> Possible actions: > >> >> 1: abort (with core dump, if enabled) > >> >> 2: normal R exit > >> >> 3: exit R without saving workspace > >> >> 4: exit R saving workspace > >> >> Selection: > >> >> > >> >> > >> >> ======================Rsubread 1.1.1 on R 2.15======================= > >> >> > >> >>> library(Rsubread) > >> >>> buildindex(basename="subreadindex", reference="allChr.fa", > >> memory=2500) > >> >> > >> >> Building the index in the base space. > >> >> Size of memory requested=2500 MB > >> >> Index base name = subreadindex > >> >> INDEX ITEMS PER PARTITION = 275940352 > >> >> > >> >> completed=40.88%; time used=1.7s; rate=2955.1k bps/s; total=12m bps > >> completed=81.76%; time used=2.4s; rate=4111.8k > >> bps/s; total=12m bps > >> >> All the chromosome files are processed. > >> >> | Dumping index > >> [===========================================================>] > >> >> Index subreadindex is successfully built. > >> >>> sessionInfo() > >> >> R version 2.15.0 (2012-03-30) > >> >> Platform: i686-pc-linux-gnu (32-bit) > >> >> > >> >> locale: > >> >> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C > >> >> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 > >> >> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 > >> >> [7] LC_PAPER=C LC_NAME=C > >> >> [9] LC_ADDRESS=C LC_TELEPHONE=C > >> >> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C > >> >> > >> >> attached base packages: > >> >> [1] stats graphics grDevices utils datasets methods base > >> >> > >> >> other attached packages: > >> >> [1] Rsubread_1.1.1 > >> >> > >> >> _______________________________________________ > >> >> Bioconductor mailing list > >> >> Bioconductor at r-project.org > >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > >> > _______________________________________________ > >> > Bioconductor mailing list > >> > Bioconductor at r-project.org > >> > https://stat.ethz.ch/mailman/listinfo/bioconductor > >> > Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > >> ______________________________________________________________________ > >> The information in this email is confidential and intended solely for > >> the addressee. > >> You must not disclose, forward, print or use it without the permission > >> of the sender. > >> ______________________________________________________________________ > > > > > > > > > > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:5}} From Peter.Sorensen2 at agrsci.dk Wed Jun 6 14:33:50 2012 From: Peter.Sorensen2 at agrsci.dk (=?iso-8859-1?Q?Peter_S=F8rensen_=28HAG=29?=) Date: Wed, 6 Jun 2012 14:33:50 +0200 Subject: [BioC] Rsubread crashes in 32bit linux In-Reply-To: <6b6ef69de4af7ed3bdc6db76c885528c.squirrel@homebase.wehi.edu.au> References: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> <1338903817.2062.284.camel@yangdu-desktop> <25927CA4-11CE-46F1-8492-739AE8B10C05@wehi.edu.au>, <1338965421.5663.36.camel@yangdu-desktop> <9F0721FDD4F12D4B95AD894274F388EC02767C9E2F83@DJFEXMBX01.djf.agrsci.dk>, <6b6ef69de4af7ed3bdc6db76c885528c.squirrel@homebase.wehi.edu.au> Message-ID: <9F0721FDD4F12D4B95AD894274F388EC02767C9E2F8B@DJFEXMBX01.djf.agrsci.dk> Hi Wei, I have tried to install Rsubread locally in my own directory, but I was not successfull. The error/warning messages below. regards Peter R version 2.15.0 (2012-03-30) Copyright (C) 2012 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-unknown-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > source("http://bioconductor.org/biocLite.R") BiocInstaller version 1.4.6, ?biocLite for help > biocLite("Rsubread") BioC_mirror: http://bioconductor.org Using R version 2.15, BiocInstaller version 1.4.6. Installing package(s) 'Rsubread' Warning in install.packages(pkgs = pkgs, lib = lib, repos = repos, ...) : 'lib = "/usr/hag/biogen/pso/R/x86_64-unknown-linux-gnu-library/2.15"' is not writable Would you like to create a personal library ~/R/x86_64-unknown-linux-gnu-library/2.15 to install packages into? (y/n) y trying URL 'http://www.bioconductor.org/packages/2.10/bioc/src/contrib/Rsubread_1.6.3.tar.gz' Content type 'application/x-gzip' length 21891723 bytes (20.9 Mb) opened URL ================================================== downloaded 20.9 Mb * installing *source* package 'Rsubread' ... ** libs gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c R_wrapper.c -o R_wrapper.o gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c SNP_calling.c -o SNP_calling.o gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c aligner.c -o aligner.o gene-algorithms.h:23:13: warning: inline function 'add_gene_vote_weighted' declared but never defined gene-algorithms.h:22:13: warning: inline function 'add_gene_vote' declared but never defined gene-algorithms.h:23:13: warning: inline function 'add_gene_vote_weighted' declared but never defined gene-algorithms.h:22:13: warning: inline function 'add_gene_vote' declared but never defined gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c atgcContent.c -o atgcContent.o gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c detectionCall.c -o detectionCall.o gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c detectionCallAnnotation.c -o detectionCallAnnotation.o gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c exon-algorithms.c -o exon-algorithms.o gene-algorithms.h:23:13: warning: inline function 'add_gene_vote_weighted' declared but never defined gene-algorithms.h:22:13: warning: inline function 'add_gene_vote' declared but never defined gene-algorithms.h:23:13: warning: inline function 'add_gene_vote_weighted' declared but never defined gene-algorithms.h:22:13: warning: inline function 'add_gene_vote' declared but never defined gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c exon-align.c -o exon-align.o gene-algorithms.h:23:13: warning: inline function 'add_gene_vote_weighted' declared but never defined gene-algorithms.h:22:13: warning: inline function 'add_gene_vote' declared but never defined gene-algorithms.h:23:13: warning: inline function 'add_gene_vote_weighted' declared but never defined gene-algorithms.h:22:13: warning: inline function 'add_gene_vote' declared but never defined gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c fullscan.c -o fullscan.o gene-algorithms.h:23:13: warning: inline function 'add_gene_vote_weighted' declared but never defined gene-algorithms.h:22:13: warning: inline function 'add_gene_vote' declared but never defined gene-algorithms.h:23:13: warning: inline function 'add_gene_vote_weighted' declared but never defined gene-algorithms.h:22:13: warning: inline function 'add_gene_vote' declared but never defined gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c gene-algorithms.c -o gene-algorithms.o gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c gene-value-index.c -o gene-value-index.o gene-algorithms.h:23:13: warning: inline function 'add_gene_vote_weighted' declared but never defined gene-algorithms.h:22:13: warning: inline function 'add_gene_vote' declared but never defined gene-algorithms.h:23:13: warning: inline function 'add_gene_vote_weighted' declared but never defined gene-algorithms.h:22:13: warning: inline function 'add_gene_vote' declared but never defined gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c hashtable.c -o hashtable.o gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c index-builder.c -o index-builder.o gene-algorithms.h:23:13: warning: inline function 'add_gene_vote_weighted' declared but never defined gene-algorithms.h:22:13: warning: inline function 'add_gene_vote' declared but never defined gene-algorithms.h:23:13: warning: inline function 'add_gene_vote_weighted' declared but never defined gene-algorithms.h:22:13: warning: inline function 'add_gene_vote' declared but never defined gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c input-files.c -o input-files.o gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c processExons.c -o processExons.o gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c propmapped.c -o propmapped.o gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c qualityScores.c -o qualityScores.o gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c readSummary.c -o readSummary.o gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c removeDuplicatedReads.c -o removeDuplicatedReads.o gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c sam2bed.c -o sam2bed.o gcc -std=gnu99 -I/opt/ghpc/R-2.15.0/lib64/R/include -DNDEBUG -I/opt/ghpc/include -DMAKE_FOR_EXON -fpic -g -O2 -c sorted-hashtable.c -o sorted-hashtable.o gene-algorithms.h:23:13: warning: inline function 'add_gene_vote_weighted' declared but never defined gene-algorithms.h:22:13: warning: inline function 'add_gene_vote' declared but never defined gene-algorithms.h:23:13: warning: inline function 'add_gene_vote_weighted' declared but never defined gene-algorithms.h:22:13: warning: inline function 'add_gene_vote' declared but never defined gcc -std=gnu99 -shared -L/opt/ghpc/lib64 -L/opt/ghpc/lib -o Rsubread.so R_wrapper.o SNP_calling.o aligner.o atgcContent.o detectionCall.o detectionCallAnnotation.o exon-algorithms.o exon-align.o fullscan.o gene-algorithms.o gene-value-index.o hashtable.o index-builder.o input-files.o processExons.o propmapped.o qualityScores.o readSummary.o removeDuplicatedReads.o sam2bed.o sorted-hashtable.o -lpthread -DMAKE_FOR_EXON -L/opt/ghpc/R-2.15.0/lib64/R/lib -lR installing to /usr/hag/biogen/pso/R/x86_64-unknown-linux-gnu-library/2.15/Rsubread/libs ** R ** inst ** preparing package for lazy loading ** help *** installing help indices ** building package indices ** installing vignettes 'Rsubread.Rnw' ** testing if installed package can be loaded Error in dyn.load(file, DLLpath = DLLpath, ...) : unable to load shared object '/usr/hag/biogen/pso/R/x86_64-unknown-linux-gnu-library/2.15/Rsubread/libs/Rsubread.so': /usr/hag/biogen/pso/R/x86_64-unknown-linux-gnu-library/2.15/Rsubread/libs/Rsubread.so: cannot map zero-fill pages: Cannot allocate memory Error: loading failed Execution halted ERROR: loading failed * removing '/usr/hag/biogen/pso/R/x86_64-unknown-linux-gnu-library/2.15/Rsubread' The downloaded source packages are in '/tmp/Rtmp5zvTug/downloaded_packages' Warning messages: 1: In install.packages(pkgs = pkgs, lib = lib, repos = repos, ...) : installation of package 'Rsubread' had non-zero exit status 2: installed directory not writable, cannot update packages 'gdata', 'GEOquery', 'Rgraphviz' > ________________________________________ From: Wei Shi [shi at wehi.EDU.AU] Sent: Wednesday, June 06, 2012 12:56 PM To: Peter S?rensen (HAG) Cc: Dan Du; Wei Shi; Yang Liao; bioconductor at r-project.org mailman Subject: RE: [BioC] Rsubread crashes in 32bit linux Dear Peter, It looks like your installation of Rsubread was unsuccessfully, therefore you got an error when you loaded it into your R session. This issue is different from the issue with running buildindex() function on a 32-bit machine. Did you notice any errors when it was being installed? You can reinstall it using the following commands in your R session: source("http://bioconductor.org/biocLite.R") biocLite("Rsubread") Cheers, Wei > Dear Wei, > I also encounter a (perhaps related) problem using Rsubread (see below). > Kind regards > Peter > > >> library(Rsubread) > Error in dyn.load(file, DLLpath = DLLpath, ...) : > unable to load shared object > '/opt/ghpc/R-2.15.0/lib64/R/library/Rsubread/libs/Rsubread.so': > /opt/ghpc/R-2.15.0/lib64/R/library/Rsubread/libs/Rsubread.so: cannot map > zero-fill pages: Cannot allocate memory > Error: package/namespace load failed for 'Rsubread' >> ref <- system.file("extdata","reference.fa",package="Rsubread") >> path <- system.file("extdata",package="Rsubread") >> buildindex(basename=file.path(path,"reference_index"),reference=ref) > Error: could not find function "buildindex" >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > locale: > [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US > [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US > [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base >> > > > ________________________________________ > From: bioconductor-bounces at r-project.org > [bioconductor-bounces at r-project.org] On Behalf Of Dan Du > [tooyoung at gmail.com] > Sent: Wednesday, June 06, 2012 8:50 AM > To: Wei Shi > Cc: Yang Liao; bioconductor at r-project.org mailman > Subject: Re: [BioC] Rsubread crashes in 32bit linux > > Dear Wei, > > Unfortunately reducing the memory parameter to 1000, still causes the > segfault. I guess with 3g ram limit on a 32bit system, there is still a > fat chance that you can not request a continuous 1g block. > > For that 64bit laptop, it is still strange about the 6g memory draining. > It is happing during the installation when compiling the shared library > Rsubread.so, not running the buildindex function. Btw, the gcc version > is 4.4.3. > > Server and opensuse box runs gcc version 4.3.1 and 4.5.0 respectively. > > Regards, > Dan > > On Wed, 2012-06-06 at 14:56 +1000, Wei Shi wrote: >> Dear Dan, >> >> It is probably because including genome sequences into the index slowed >> down your laptop. But I believe it should be alleviated if you give >> smaller values to the 'memory ' parameter of the buildindex() function. >> Also, the index building is an one-off operation, you do not need to >> redo it even when new releases come. >> >> For your 32-bit opensuse box, I guess the problem will be solved if you >> change the amount of memory requested to be 1000MB. >> >> Cheers, >> Wei >> >> On Jun 5, 2012, at 11:43 PM, Dan Du wrote: >> >> > Hi Robert, >> > >> > I have been experiencing something else, possibly related to yours, >> > on a 64bit ubuntu laptop with 6g of ram. >> > >> > As I recall, when bumping to Bioc 2.10, the Rsubread installation kind >> > of ate all the memory, basically froze the system so I had to call it >> > off, yet building it on the server side turned out fine. So I think I >> > just accepted that the new version may be 'computationally heavy' thus >> > not suitable for a normal pc, though I did not find any mentioning of >> > this increased memory requirement in the NEWS file. >> > >> > So currently Rsubread stays at 1.4.4 on that pc, all subsequent >> versions >> > of Rsubread drain the memory in the same way when compiling >> Rsubread.so. >> > >> > Now I think I can confirm this on a 32-bit opensuse box, it did >> > successfully built, but when running the example code in the manual, >> > same segfault happens. >> > >> > >> >> library(Rsubread) >> >> ref <- system.file("extdata","reference.fa",package="Rsubread") >> >> path <- system.file("extdata",package="Rsubread") >> >> buildindex(basename=file.path(path,"reference_index"),reference=ref) >> > >> > Building a base-space index. >> > Size of memory used=3700 MB >> > Base name of the built index >> > = /home/opensuse/R-patched/library/Rsubread/extdata/reference_index >> > >> > *** caught segfault *** >> > address 0xdf03ee80, cause 'memory not mapped' >> > >> > Traceback: >> > 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = >> > as.character(cmd), PACKAGE = "Rsubread") >> > 2: buildindex(basename = file.path(path, "reference_index"), reference >> > = ref) >> > >> >> sessionInfo() >> > R version 2.15.0 Patched (2012-06-04 r59517) >> > Platform: i686-pc-linux-gnu (32-bit) >> > >> > locale: >> > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> > [7] LC_PAPER=C LC_NAME=C >> > [9] LC_ADDRESS=C LC_TELEPHONE=C >> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> > >> > attached base packages: >> > [1] stats graphics grDevices utils datasets methods >> > base >> > >> > other attached packages: >> > [1] Rsubread_1.6.3 >> > >> > >> > Regards, >> > Dan >> > >> > On Tue, 2012-06-05 at 09:45 +0200, Robert Castelo wrote: >> >> hi, >> >> >> >> the computer room at my university where we do practicals on R & >> Bioconductor runs a 32bit linux distribution and when i tried to run >> the latest version of the Rsubread package (1.6.3) it crashes when >> calling the buildindex() function on a multifasta file with the yeast >> genome. this does *not* happen under a 64bit linux distribution. >> >> >> >> i have verified that installing the version before (1.4.4) on the >> current R 2.15 it also crashes (on the 32bit), but two versions >> before, the 1.1.1, it does *not* and it works smoothly on this 32bit >> linux distribution. >> >> >> >> i'm pasting below the output of using the 1.6.3 and 1.1.1 on R 2.15 >> where allChr.fa is the multifasta file with the yeast genome. >> >> >> >> so i can manage by now the problem by using the 1.1.1 version on R >> 2.15 for my teaching but i wonder whether there would be some easy >> solution for this, or even if it could be a symptom of something else >> that the Rsubread developers should worry about. i know that using a >> 32bit system nowadays is quite obsolete but this is what i got for >> teaching :( and i would be happy to let my students play with the >> latest version of Rsubread in the future. >> >> >> >> >> >> thanks!!! >> >> robert. >> >> >> >> ======================Rsubread 1.6.3 on R 2.15======================= >> >> >> >>> library(Rsubread) >> >>> sessionInfo() >> >> R version 2.15.0 (2012-03-30) >> >> Platform: i686-pc-linux-gnu (32-bit) >> >> >> >> locale: >> >> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C >> >> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 >> >> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 >> >> [7] LC_PAPER=C LC_NAME=C >> >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> >> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C >> >> >> >> attached base packages: >> >> [1] stats graphics grDevices utils datasets methods base >> >> >> >> other attached packages: >> >> [1] Rsubread_1.6.3 >> >> >> >>> buildindex(basename="subreadindex", reference="allChr.fa", >> memory=2500) >> >> >> >> Building a base-space index. >> >> Size of memory used=2500 MB >> >> Base name of the built index = subreadindex >> >> >> >> *** caught segfault *** >> >> address 0xdf670cc0, cause 'memory not mapped' >> >> >> >> Traceback: >> >> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = >> as.character(cmd), PACKAGE = "Rsubread") >> >> 2: buildindex(basename = "subreadindex", reference = "allChr.fa", >> memory = 2500) >> >> >> >> Possible actions: >> >> 1: abort (with core dump, if enabled) >> >> 2: normal R exit >> >> 3: exit R without saving workspace >> >> 4: exit R saving workspace >> >> Selection: >> >> >> >> >> >> ======================Rsubread 1.1.1 on R 2.15======================= >> >> >> >>> library(Rsubread) >> >>> buildindex(basename="subreadindex", reference="allChr.fa", >> memory=2500) >> >> >> >> Building the index in the base space. >> >> Size of memory requested=2500 MB >> >> Index base name = subreadindex >> >> INDEX ITEMS PER PARTITION = 275940352 >> >> >> >> completed=40.88%; time used=1.7s; rate=2955.1k bps/s; total=12m bps >> completed=81.76%; time used=2.4s; rate=4111.8k >> bps/s; total=12m bps >> >> All the chromosome files are processed. >> >> | Dumping index >> [===========================================================>] >> >> Index subreadindex is successfully built. >> >>> sessionInfo() >> >> R version 2.15.0 (2012-03-30) >> >> Platform: i686-pc-linux-gnu (32-bit) >> >> >> >> locale: >> >> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C >> >> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 >> >> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 >> >> [7] LC_PAPER=C LC_NAME=C >> >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> >> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C >> >> >> >> attached base packages: >> >> [1] stats graphics grDevices utils datasets methods base >> >> >> >> other attached packages: >> >> [1] Rsubread_1.1.1 >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor at r-project.org >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> ______________________________________________________________________ >> The information in this email is confidential and inte...{{dropped:6}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}} From shi at wehi.EDU.AU Wed Jun 6 14:47:17 2012 From: shi at wehi.EDU.AU (Wei Shi) Date: Wed, 6 Jun 2012 22:47:17 +1000 Subject: [BioC] Rsubread crashes in 32bit linux In-Reply-To: <1338980934.5663.85.camel@yangdu-desktop> References: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> <1338903817.2062.284.camel@yangdu-desktop> <25927CA4-11CE-46F1-8492-739AE8B10C05@wehi.edu.au> <1338965421.5663.36.camel@yangdu-desktop> <2333c7eb3c4b13297612a68e8334d48d.squirrel@homebase.wehi.edu.au> <1338980934.5663.85.camel@yangdu-desktop> Message-ID: <5853451A-DE04-4916-BFD3-50CAF7A00792@wehi.edu.au> Dear Dan, Thanks a lot for the information. The final step of the gcc compilation, which will link all the object files and creates the dynamic library Rsubread.so, should not have a large consumption of the memory and should be done fairly quick. So there seems to a problem with this step. Which Linux distribution are you running on your laptop? Would you please install a C version of the subread aligner onto your laptop? This will be able to figure out it is C or R which caused the problem. The C version can be downloaded from http://subread.sourceforge.net Thanks, Wei On Jun 6, 2012, at 9:08 PM, Dan Du wrote: > Dear Wei, > > Here is a standard bioclite update, I think it is at the last step when > compiling Rsubread.so, the memory usage exceeds 5.5g, then system freeze > and I have to call it off. Same result when runing 'R CMD INSTALL > Rsubread_1.6.3.tar.gz' from shell, or manually compile all .c file and > run the last gcc statement. I guess there might just be a minimum ram > requirement somewhere higher than 6g... I will do some more poking when > I have time. > > 'gcc -std=gnu99 -shared -o Rsubread.so R_wrapper.o SNP_calling.o > aligner.o atgcContent.o detectionCall.o detectionCallAnnotation.o > exon-algorithms.o exon-align.o fullscan.o gene-algorithms.o > gene-value-index.o hashtable.o index-builder.o input-files.o > processExons.o propmapped.o qualityScores.o readSummary.o > removeDuplicatedReads.o sam2bed.o sorted-hashtable.o -lpthread > -DMAKE_FOR_EXON -L/usr/lib64/R/lib -lR' > > Also down there are the sessionInfo and full gcc version, please let me > know if you need more information. > > Regards, > Dan > -------------------------------------------------------------------- >> source('http://www.bioconductor.org/biocLite.R') >> biocLite('') > BioC_mirror: http://bioconductor.org > Using R version 2.15, BiocInstaller version 1.4.6. > Installing package(s) '' > Old packages: 'Rsubread' > Update all/some/none? [a/s/n]: a > trying URL > 'http://www.bioconductor.org/packages/2.10/bioc/src/contrib/Rsubread_1.6.3.tar.gz' > Content type 'application/x-gzip' length 21891723 bytes (20.9 Mb) > opened URL > ================================================== > downloaded 20.9 Mb > > WARNING: ignoring environment value of R_HOME > * installing *source* package ?Rsubread? ... > ** libs > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c R_wrapper.c -o R_wrapper.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c SNP_calling.c -o SNP_calling.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c aligner.c -o aligner.o > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c atgcContent.c -o atgcContent.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c detectionCall.c -o detectionCall.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c detectionCallAnnotation.c -o detectionCallAnnotation.o > detectionCallAnnotation.c: In function ?calculateExonGCContent?: > detectionCallAnnotation.c:175: warning: ignoring return value of > ?fgets?, declared with attribute warn_unused_result > detectionCallAnnotation.c: In function ?calculateIRGCContent?: > detectionCallAnnotation.c:262: warning: ignoring return value of > ?fgets?, declared with attribute warn_unused_result > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c exon-algorithms.c -o exon-algorithms.o > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c exon-align.c -o exon-align.o > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c fullscan.c -o fullscan.o > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c gene-algorithms.c -o gene-algorithms.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c gene-value-index.c -o gene-value-index.o > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c hashtable.c -o hashtable.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c index-builder.c -o index-builder.o > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c input-files.c -o input-files.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c processExons.c -o processExons.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c propmapped.c -o propmapped.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c qualityScores.c -o qualityScores.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c readSummary.c -o readSummary.o > readSummary.c: In function ?readSummary?: > readSummary.c:122: warning: format ?%d? expects type ?int?, but argument > 5 has type ?long int? > readSummary.c:122: warning: format ?%d? expects type ?int?, but argument > 6 has type ?long int? > readSummary.c:39: warning: ignoring return value of ?getline?, declared > with attribute warn_unused_result > readSummary.c:52: warning: ignoring return value of ?getline?, declared > with attribute warn_unused_result > readSummary.c:55: warning: ignoring return value of ?getline?, declared > with attribute warn_unused_result > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c removeDuplicatedReads.c -o removeDuplicatedReads.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c sam2bed.c -o sam2bed.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c sorted-hashtable.c -o sorted-hashtable.o > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gcc -std=gnu99 -shared -o Rsubread.so R_wrapper.o SNP_calling.o > aligner.o atgcContent.o detectionCall.o detectionCallAnnotation.o > exon-algorithms.o exon-align.o fullscan.o gene-algorithms.o > gene-value-index.o hashtable.o index-builder.o input-files.o > processExons.o propmapped.o qualityScores.o readSummary.o > removeDuplicatedReads.o sam2bed.o sorted-hashtable.o -lpthread > -DMAKE_FOR_EXON -L/usr/lib64/R/lib -lR > ^Cmake: *** Deleting file `Rsubread.so' > make: *** [Rsubread.so] Interrupt > ** R > ** inst > ** preparing package for lazy loading > ** help > *** installing help indices > ** building package indices > ** installing vignettes > ?Rsubread.Rnw? > ** testing if installed package can be loaded > Error in library.dynam(lib, package, package.lib) : > shared object ?Rsubread.so? not found > Error: loading failed > Execution halted > -------------------------------------------------------------------- >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C > [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 > [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > -------------------------------------------------------------------- > $ gcc -v > Using built-in specs. > Target: x86_64-linux-gnu > Configured with: ../src/configure -v --with-pkgversion='Ubuntu > 4.4.3-4ubuntu5.1' > --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs > --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr > --enable-shared --enable-multiarch --enable-linker-build-id > --with-system-zlib --libexecdir=/usr/lib --without-included-gettext > --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 > --program-suffix=-4.4 --enable-nls --enable-clocale=gnu > --enable-libstdcxx-debug --enable-plugin --enable-objc-gc > --disable-werror --with-arch-32=i486 --with-tune=generic > --enable-checking=release --build=x86_64-linux-gnu > --host=x86_64-linux-gnu --target=x86_64-linux-gnu > Thread model: posix > gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) > -------------------------------------------------------------------- > > On Wed, 2012-06-06 at 20:10 +1000, Wei Shi wrote: >> Dear Dan, >> >> It didn't seem to be problem of requesting a continuous 1GB block in our >> investigation. We tracked the memory usage of buildindex() function when >> running it on yeast genome using a 32-bit VM, and found that the segfault >> happened right after a request of a few KB of memory was sent to the >> system when the memory parameter was set to 2500. However, the problem was >> gone when the memory parameter was changed to 1000. >> >> Removing highly repetitive 16 mers required a continuous 1GB block of >> memory, but this step was always executed successfully. This step also >> included in the old version of Rsubread (1.1.1), and it did not have >> problem there either. >> >> Could you please provide us your complete code for running your test and >> also session info? This will help us to diagnose what the problem could be >> because we couldn't reproduce what you saw from our end. >> >> For the compilation issue on your 64bit laptop, could you provide us more >> details as well, including the message output from gcc? >> >> Thanks, >> Wei >> >>> Dear Wei, >>> >>> Unfortunately reducing the memory parameter to 1000, still causes the >>> segfault. I guess with 3g ram limit on a 32bit system, there is still a >>> fat chance that you can not request a continuous 1g block. >>> >>> For that 64bit laptop, it is still strange about the 6g memory draining. >>> It is happing during the installation when compiling the shared library >>> Rsubread.so, not running the buildindex function. Btw, the gcc version >>> is 4.4.3. >>> >>> Server and opensuse box runs gcc version 4.3.1 and 4.5.0 respectively. >>> >>> Regards, >>> Dan >>> >>> On Wed, 2012-06-06 at 14:56 +1000, Wei Shi wrote: >>>> Dear Dan, >>>> >>>> It is probably because including genome sequences into the index slowed >>>> down your laptop. But I believe it should be alleviated if you give >>>> smaller values to the 'memory ' parameter of the buildindex() function. >>>> Also, the index building is an one-off operation, you do not need to >>>> redo it even when new releases come. >>>> >>>> For your 32-bit opensuse box, I guess the problem will be solved if you >>>> change the amount of memory requested to be 1000MB. >>>> >>>> Cheers, >>>> Wei >>>> >>>> On Jun 5, 2012, at 11:43 PM, Dan Du wrote: >>>> >>>>> Hi Robert, >>>>> >>>>> I have been experiencing something else, possibly related to yours, >>>>> on a 64bit ubuntu laptop with 6g of ram. >>>>> >>>>> As I recall, when bumping to Bioc 2.10, the Rsubread installation kind >>>>> of ate all the memory, basically froze the system so I had to call it >>>>> off, yet building it on the server side turned out fine. So I think I >>>>> just accepted that the new version may be 'computationally heavy' thus >>>>> not suitable for a normal pc, though I did not find any mentioning of >>>>> this increased memory requirement in the NEWS file. >>>>> >>>>> So currently Rsubread stays at 1.4.4 on that pc, all subsequent >>>> versions >>>>> of Rsubread drain the memory in the same way when compiling >>>> Rsubread.so. >>>>> >>>>> Now I think I can confirm this on a 32-bit opensuse box, it did >>>>> successfully built, but when running the example code in the manual, >>>>> same segfault happens. >>>>> >>>>> >>>>>> library(Rsubread) >>>>>> ref <- system.file("extdata","reference.fa",package="Rsubread") >>>>>> path <- system.file("extdata",package="Rsubread") >>>>>> buildindex(basename=file.path(path,"reference_index"),reference=ref) >>>>> >>>>> Building a base-space index. >>>>> Size of memory used=3700 MB >>>>> Base name of the built index >>>>> = /home/opensuse/R-patched/library/Rsubread/extdata/reference_index >>>>> >>>>> *** caught segfault *** >>>>> address 0xdf03ee80, cause 'memory not mapped' >>>>> >>>>> Traceback: >>>>> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = >>>>> as.character(cmd), PACKAGE = "Rsubread") >>>>> 2: buildindex(basename = file.path(path, "reference_index"), reference >>>>> = ref) >>>>> >>>>>> sessionInfo() >>>>> R version 2.15.0 Patched (2012-06-04 r59517) >>>>> Platform: i686-pc-linux-gnu (32-bit) >>>>> >>>>> locale: >>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>> [7] LC_PAPER=C LC_NAME=C >>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>> >>>>> attached base packages: >>>>> [1] stats graphics grDevices utils datasets methods >>>>> base >>>>> >>>>> other attached packages: >>>>> [1] Rsubread_1.6.3 >>>>> >>>>> >>>>> Regards, >>>>> Dan >>>>> >>>>> On Tue, 2012-06-05 at 09:45 +0200, Robert Castelo wrote: >>>>>> hi, >>>>>> >>>>>> the computer room at my university where we do practicals on R & >>>> Bioconductor runs a 32bit linux distribution and when i tried to run >>>> the latest version of the Rsubread package (1.6.3) it crashes when >>>> calling the buildindex() function on a multifasta file with the yeast >>>> genome. this does *not* happen under a 64bit linux distribution. >>>>>> >>>>>> i have verified that installing the version before (1.4.4) on the >>>> current R 2.15 it also crashes (on the 32bit), but two versions >>>> before, the 1.1.1, it does *not* and it works smoothly on this 32bit >>>> linux distribution. >>>>>> >>>>>> i'm pasting below the output of using the 1.6.3 and 1.1.1 on R 2.15 >>>> where allChr.fa is the multifasta file with the yeast genome. >>>>>> >>>>>> so i can manage by now the problem by using the 1.1.1 version on R >>>> 2.15 for my teaching but i wonder whether there would be some easy >>>> solution for this, or even if it could be a symptom of something else >>>> that the Rsubread developers should worry about. i know that using a >>>> 32bit system nowadays is quite obsolete but this is what i got for >>>> teaching :( and i would be happy to let my students play with the >>>> latest version of Rsubread in the future. >>>>>> >>>>>> >>>>>> thanks!!! >>>>>> robert. >>>>>> >>>>>> ======================Rsubread 1.6.3 on R 2.15======================= >>>>>> >>>>>>> library(Rsubread) >>>>>>> sessionInfo() >>>>>> R version 2.15.0 (2012-03-30) >>>>>> Platform: i686-pc-linux-gnu (32-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C >>>>>> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 >>>>>> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 >>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>> >>>>>> other attached packages: >>>>>> [1] Rsubread_1.6.3 >>>>>> >>>>>>> buildindex(basename="subreadindex", reference="allChr.fa", >>>> memory=2500) >>>>>> >>>>>> Building a base-space index. >>>>>> Size of memory used=2500 MB >>>>>> Base name of the built index = subreadindex >>>>>> >>>>>> *** caught segfault *** >>>>>> address 0xdf670cc0, cause 'memory not mapped' >>>>>> >>>>>> Traceback: >>>>>> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = >>>> as.character(cmd), PACKAGE = "Rsubread") >>>>>> 2: buildindex(basename = "subreadindex", reference = "allChr.fa", >>>> memory = 2500) >>>>>> >>>>>> Possible actions: >>>>>> 1: abort (with core dump, if enabled) >>>>>> 2: normal R exit >>>>>> 3: exit R without saving workspace >>>>>> 4: exit R saving workspace >>>>>> Selection: >>>>>> >>>>>> >>>>>> ======================Rsubread 1.1.1 on R 2.15======================= >>>>>> >>>>>>> library(Rsubread) >>>>>>> buildindex(basename="subreadindex", reference="allChr.fa", >>>> memory=2500) >>>>>> >>>>>> Building the index in the base space. >>>>>> Size of memory requested=2500 MB >>>>>> Index base name = subreadindex >>>>>> INDEX ITEMS PER PARTITION = 275940352 >>>>>> >>>>>> completed=40.88%; time used=1.7s; rate=2955.1k bps/s; total=12m bps >>>> completed=81.76%; time used=2.4s; rate=4111.8k >>>> bps/s; total=12m bps >>>>>> All the chromosome files are processed. >>>>>> | Dumping index >>>> [===========================================================>] >>>>>> Index subreadindex is successfully built. >>>>>>> sessionInfo() >>>>>> R version 2.15.0 (2012-03-30) >>>>>> Platform: i686-pc-linux-gnu (32-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C >>>>>> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 >>>>>> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 >>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>> >>>>>> other attached packages: >>>>>> [1] Rsubread_1.1.1 >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> ______________________________________________________________________ >>>> The information in this email is confidential and intended solely for >>>> the addressee. >>>> You must not disclose, forward, print or use it without the permission >>>> of the sender. >>>> ______________________________________________________________________ >>> >>> >>> >>> >> >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ > > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}} From alyamahmoud at gmail.com Wed Jun 6 15:37:18 2012 From: alyamahmoud at gmail.com (Alyaa Mahmoud) Date: Wed, 6 Jun 2012 16:37:18 +0300 Subject: [BioC] error in hclust function In-Reply-To: <20120602151440.GA460@Thomas-Girkes-MacBook-Pro.local> References: <20120602151440.GA460@Thomas-Girkes-MacBook-Pro.local> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From alyamahmoud at gmail.com Wed Jun 6 15:41:08 2012 From: alyamahmoud at gmail.com (Alyaa Mahmoud) Date: Wed, 6 Jun 2012 16:41:08 +0300 Subject: [BioC] pvclust error In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From stefanie.tauber at univie.ac.at Wed Jun 6 16:03:24 2012 From: stefanie.tauber at univie.ac.at (Stefanie) Date: Wed, 6 Jun 2012 14:03:24 +0000 Subject: [BioC] makeTranscriptDbFromBiomart error Message-ID: Hi, here is my code: library(GenomicFeatures) humanDB = makeTranscriptDbFromBiomart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl") This is the error I get thrown: Download and preprocess the 'transcripts' data frame ... OK Download and preprocess the 'chrominfo' data frame ... OK Download and preprocess the 'splicings' data frame ... Fehler in .extractCdsRangesFromBiomartTable(bm_table) : BioMart data anomaly: some 5' UTR have a start > end Seems to be a problem of biomart not of R? Anybody any idea? Best, Stefanie sessionInfo() R version 2.14.1 (2011-12-22) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] de_AT.UTF-8/de_AT.UTF-8/de_AT.UTF-8/C/de_AT.UTF-8/de_AT.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.6.5 AnnotationDbi_1.16.10 Biobase_2.14.0 GenomicRanges_1.6.4 IRanges_1.12.5 biomaRt_2.10.0 loaded via a namespace (and not attached): [1] Biostrings_2.22.0 BSgenome_1.22.0 DBI_0.2-5 RCurl_1.8-0 RSQLite_0.11.1 rtracklayer_1.14.4 tools_2.14.1 XML_3.6-2 [9] zlibbioc_1.0.0 From Julie.Zhu at umassmed.edu Wed Jun 6 16:46:10 2012 From: Julie.Zhu at umassmed.edu (Zhu, Lihua (Julie)) Date: Wed, 6 Jun 2012 14:46:10 +0000 Subject: [BioC] Question about ChIPpeakAnno In-Reply-To: Message-ID: Dear Petros, You can safely ignore the warning message and proceed. I will modify the code for the next release to replace multiple with select internally. Best regards, Julie On 6/6/12 4:04 AM, "Petros Kolovos" wrote: > Dear Dr Julie Zhu, > > Good morning > > My name is Petros Kolovos and I am a PhD student at Rotterdam, Netherlands. > > I was using your library in order to annotate some chip peaks and in order > to make some venn diagrams. > > But I faced some problems with the venn diagrams > > Here is what I am doing > >> venn=makeVennDiagram(RangedDataList (pr, Anno_exon),NameOfPeaks = > c("peaks","exon"),maxgap = 0, totalTest = 500000) > Warning message: > In findOverlappingPeaks(Peaks[[1]], Peaks[[2]], NameOfPeaks1 = > NameOfPeaks[1], : > Please use select instead of multiple! > > What should I do? > > Could you please help me as I am a rookie in this field > > Thank you in advance > > Yours sincerely > Petros Kolovos > > > > From vilanew at gmail.com Wed Jun 6 17:00:07 2012 From: vilanew at gmail.com (David martin) Date: Wed, 6 Jun 2012 17:00:07 +0200 Subject: [BioC] shortread quality Message-ID: Hi, I'm reading a fastq file from the solexa sequencer. I would like to know how many reads have a phred score (>Q29). The thing is that i get the densities so i don't really know how many reads from the total pass that filter. It's probaly easy for you so any hint would be helpful library("ShortRead") fqpattern <- "1102sdd_SN148_A_s_3_seq_GJH-85.txt" path = getwd() sp <- SolexaPath(path,dataPath=path,analysisPath=path) # Read fastq File and save report fq <- readFastq(sp, fqpattern) qaSummary <- qa(fq,fqpattern) save(qaSummary, file=file.path("./", paste(fqpattern,".rda",sep="" ))) report(qaSummary,dest="report") #Quality idx = which(qaSummary[["readQualityScore"]]["quality"] > 29) a = cbind( qaSummary[["readQualityScore"]][idx,"quality"] , qaSummary[["readQualityScore"]][idx,"density"]) a #reads with a quality >Q29 #How to get the total number ? or percent compared to the total number of reads ? thanks From projectbasu at gmail.com Wed Jun 6 19:37:46 2012 From: projectbasu at gmail.com (swaraj basu) Date: Wed, 6 Jun 2012 19:37:46 +0200 Subject: [BioC] Uploading BED file: rtracklayer Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From lawrence.michael at gene.com Wed Jun 6 19:46:51 2012 From: lawrence.michael at gene.com (Michael Lawrence) Date: Wed, 6 Jun 2012 10:46:51 -0700 Subject: [BioC] Uploading BED file: rtracklayer In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From lawrence.michael at gene.com Wed Jun 6 19:50:43 2012 From: lawrence.michael at gene.com (Michael Lawrence) Date: Wed, 6 Jun 2012 10:50:43 -0700 Subject: [BioC] Uploading BED file: rtracklayer In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From schoi at cornell.edu Wed Jun 6 20:14:00 2012 From: schoi at cornell.edu (Sang Chul Choi) Date: Wed, 6 Jun 2012 18:14:00 +0000 Subject: [BioC] qrqc with variable length of short reads? - readSeqFile could not handle a 2GB zipped file. In-Reply-To: References: <8234A7F2-8FB7-45E0-BC7C-B5C3386C4EF1@cornell.edu> <3B4C55BA-D655-444A-91F3-F4107198E1EC@gmail.com> <03E26781-7909-4DFD-9BD8-8092A9A8F237@cornell.edu> Message-ID: <76217387-CBC5-4D89-8505-880B7557F9CA@cornell.edu> Thank you for offering help. Appended are the R script and its output mostly from sessionInfo() function. Thank you, SangChul The R script: ======================================================================================== library(qrqc) args <- commandArgs(trailingOnly = TRUE) if (length(args) != 3) { cat ("Rscript job-stat.R 1.fq.gz 1.fq.RData sanger\n") quit("yes") } fq.name <- args[1] fq.quality <- args[3] sessionInfo() # fq.file <- readSeqFile(fq.name,quality=fq.quality,hash=FALSE) # toplot <- qualPlot(fq.file) # fq.plot <- args[2] # save(list="toplot", file = fq.plot) ======================================================================================== Output from the R script above: ========================================================================================= Loading required package: reshape Loading required package: plyr Attaching package: 'reshape' The following object(s) are masked from 'package:plyr': rename, round_any Loading required package: ggplot2 Loading required package: methods Loading required package: Biostrings Loading required package: BiocGenerics Attaching package: 'BiocGenerics' The following object(s) are masked from 'package:stats': xtabs The following object(s) are masked from 'package:base': anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get, intersect, lapply, Map, mapply, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int, rownames, sapply, setdiff, table, tapply, union, unique Loading required package: IRanges Attaching package: 'IRanges' The following object(s) are masked from 'package:reshape': rename The following object(s) are masked from 'package:plyr': compact, desc, rename Loading required package: biovizBase Loading required package: brew Loading required package: xtable Loading required package: Rsamtools Loading required package: GenomicRanges Loading required package: testthat R Under development (unstable) (2012-04-01 r58897) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C attached base packages: [1] methods stats graphics grDevices utils datasets base other attached packages: [1] qrqc_1.10.0 testthat_0.6 Rsamtools_1.8.5 [4] GenomicRanges_1.8.6 xtable_1.7-0 brew_1.0-6 [7] biovizBase_1.4.2 Biostrings_2.24.1 IRanges_1.14.3 [10] BiocGenerics_0.2.0 ggplot2_0.9.1 reshape_0.8.4 [13] plyr_1.7.1 loaded via a namespace (and not attached): [1] AnnotationDbi_1.18.1 Biobase_2.16.0 biomaRt_2.12.0 [4] bitops_1.0-4.1 BSgenome_1.24.0 cluster_1.14.2 [7] colorspace_1.1-1 DBI_0.2-5 dichromat_1.2-4 [10] digest_0.5.2 evaluate_0.4.2 GenomicFeatures_1.8.1 [13] grid_2.16.0 Hmisc_3.9-3 labeling_0.1 [16] lattice_0.20-6 MASS_7.3-18 memoise_0.1 [19] munsell_0.3 proto_0.3-9.2 RColorBrewer_1.0-5 [22] RCurl_1.91-1 reshape2_1.2.1 RSQLite_0.11.1 [25] rtracklayer_1.16.1 scales_0.2.1 stats4_2.16.0 [28] stringr_0.6 tools_2.16.0 XML_3.9-4 [31] zlibbioc_1.2.0 ========================================================================================= On Jun 5, 2012, at 3:03 PM, Vince S. Buffalo wrote: > Hi SangChul, > > Can you attach your sessionInfo()? I will take a look into this issue. > > best, > Vince > > On Tue, Jun 5, 2012 at 12:01 PM, Sang Chul Choi wrote: > I have tried to tunn off the option when reading sequences of variable lengths in a gzipped FASTQ file (2GB) using readSeqFile. The computer has 16 GB memory, and it used up all of the memory, leaving R in "Dead" or not running any more. Is there a way of sidestepping this problem? > > Thank you, > > SangChul > > On Jun 1, 2012, at 4:55 PM, Vince Buffalo wrote: > > > Hi SangChul, > > > > By default readSeqFile hashes a proportion of the reads to check against many being non-unique. Specify hash=FALSE to turn this off and your memory usage will decrease. > > > > Best, > > Vince > > > > Sent from my iPhone > > > > On Jun 1, 2012, at 1:23 PM, Sang Chul Choi wrote: > > > >> Hi, > >> > >> I am using qrqc to plot base quality of a short read fastq file. When the FASTQ file has short reads of the same length, the readSeqFile could read in the FASTQ file (25 millions of 100bp reads) with a couple of GB of memory. I trimmed 3' end of the short reads, which would lead to short reads of variable length because of different base quality at the 3' end. Then, I tried to read in this second FASTQ file of reads of variable length. It used up all of the 16 GB memory, and not using CPUs at all. It seems there are some efficient code in readSeqFile as mentioned in the readSeqFile help message. It seems to fall apart when short reads are of different size. > >> > >> I wish to see how the trimming change the base-quality plots, and this is a problem. I am wondering if there is a way of sidestepping this problem. > >> > >> Thank you, > >> > >> SangChul > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > Vince Buffalo > Statistical Programmer > Bioinformatics Core > UC Davis Genome Center > vincebuffalo.com twitter.com/vsbuffalo > From vsbuffalo at gmail.com Wed Jun 6 20:25:35 2012 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Wed, 6 Jun 2012 11:25:35 -0700 Subject: [BioC] qrqc with variable length of short reads? - readSeqFile could not handle a 2GB zipped file. In-Reply-To: <76217387-CBC5-4D89-8505-880B7557F9CA@cornell.edu> References: <8234A7F2-8FB7-45E0-BC7C-B5C3386C4EF1@cornell.edu> <3B4C55BA-D655-444A-91F3-F4107198E1EC@gmail.com> <03E26781-7909-4DFD-9BD8-8092A9A8F237@cornell.edu> <76217387-CBC5-4D89-8505-880B7557F9CA@cornell.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From Julie.Zhu at umassmed.edu Wed Jun 6 21:05:52 2012 From: Julie.Zhu at umassmed.edu (Zhu, Lihua (Julie)) Date: Wed, 6 Jun 2012 19:05:52 +0000 Subject: [BioC] Question about ChIPpeakAnno In-Reply-To: <632bfa5730487e958a3a04f7118f3db3.squirrel@webmail.erasmusmc.nl> Message-ID: Petros, Totaltest specifies the total number of tests performed to obtain the list of peaks. This is one of the frequently asked questions and many have contributed wisdoms to address this question. Could you please take a look at the slides at http://www.bioconductor.org/help/course-materials/2011/BioC2011/BioC2011_ChI PpeakAnno.pdf, esp. the faq slide near the end? Please do cc Bioconductor list (bioconductor ) for sharing. Thanks! Best regards, Julie On 6/6/12 11:02 AM, "Petros Kolovos" wrote: > Dear Julie, > > I would like to ask you also something else. > > In the makeVennDiagrams function what do you mean by total tests? > > For example i have the following code > >> venn=makeVennDiagram(RangedDataList (pr, intron),NameOfPeaks = >> c("peaks","intron"),maxgap = 0, totalTest = 320000) > > PR=my peaks has 50000 peaks > intron= table aquired from UCSC with all the exons 250000 peaks/coordinates > > So how much should I put the totalTest > > Is there a problem if I make a venn diagram of peaks vs introns? > > Thanks > > Best regards > Petros > > > >> Dear Petros, >> >> You can safely ignore the warning message and proceed. I will modify the >> code for the next release to replace multiple with select internally. >> >> Best regards, >> >> Julie >> >> >> On 6/6/12 4:04 AM, "Petros Kolovos" wrote: >> >>> Dear Dr Julie Zhu, >>> >>> Good morning >>> >>> My name is Petros Kolovos and I am a PhD student at Rotterdam, >>> Netherlands. >>> >>> I was using your library in order to annotate some chip peaks and in >>> order >>> to make some venn diagrams. >>> >>> But I faced some problems with the venn diagrams >>> >>> Here is what I am doing >>> >>>> venn=makeVennDiagram(RangedDataList (pr, Anno_exon),NameOfPeaks = >>> c("peaks","exon"),maxgap = 0, totalTest = 500000) >>> Warning message: >>> In findOverlappingPeaks(Peaks[[1]], Peaks[[2]], NameOfPeaks1 = >>> NameOfPeaks[1], : >>> Please use select instead of multiple! >>> >>> What should I do? >>> >>> Could you please help me as I am a rookie in this field >>> >>> Thank you in advance >>> >>> Yours sincerely >>> Petros Kolovos >>> >>> >>> >>> >> >> >> > > From mtmorgan at fhcrc.org Wed Jun 6 21:11:43 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Wed, 6 Jun 2012 12:11:43 -0700 Subject: [BioC] shortread quality In-Reply-To: References: Message-ID: <4FCFAB6F.9020108@fhcrc.org> Hi, On 06/06/2012 08:00 AM, David martin wrote: > Hi, > I'm reading a fastq file from the solexa sequencer. > I would like to know how many reads have a phred score (>Q29). The thing If you mean the average base quality score, then fq <- readFastq(sp, fqpattern) score <- alphabetScore(fq) gives the sum of the base quality scores for each read, so is a vector as long as the length of the reads. The average is aveScore <- score / width(fq) and then you're in the realm of familiar R again, e.g., hist(aveScore) table(aveScore > 29) etc. Hope that heps, Martin > is that i get the densities so i don't really know how many reads from > the total pass that filter. It's probaly easy for you so any hint would > be helpful > > library("ShortRead") > fqpattern <- "1102sdd_SN148_A_s_3_seq_GJH-85.txt" > > path = getwd() > sp <- SolexaPath(path,dataPath=path,analysisPath=path) > > # Read fastq File and save report > fq <- readFastq(sp, fqpattern) > qaSummary <- qa(fq,fqpattern) > save(qaSummary, file=file.path("./", paste(fqpattern,".rda",sep="" ))) > report(qaSummary,dest="report") > > #Quality > > idx = which(qaSummary[["readQualityScore"]]["quality"] > 29) > a = cbind( qaSummary[["readQualityScore"]][idx,"quality"] , > qaSummary[["readQualityScore"]][idx,"density"]) > a #reads with a quality >Q29 > > #How to get the total number ? or percent compared to the total number > of reads ? > > thanks > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 From Julie.Zhu at umassmed.edu Wed Jun 6 21:26:48 2012 From: Julie.Zhu at umassmed.edu (Zhu, Lihua (Julie)) Date: Wed, 6 Jun 2012 19:26:48 +0000 Subject: [BioC] Question about ChIPpeakAnno In-Reply-To: Message-ID: Sorry, Petros! It is - instead of _. http://www.bioconductor.org/help/course-materials/2011/BioC2011/LabStuff/ChI PpeakAnno-BioC2011.pdf For your other questions in a separate email, you need to give sessionInfo() and tell us whether you are able to get the example work. Thanks! Best regards, Julie On 6/6/12 3:22 PM, "Petros Kolovos" wrote: > Dear Julie > > the > http://www.bioconductor.org/help/course-materials/2011/BioC2011/BioC2011_ChIPp > eakAnno.pdf > doesn't work. Could you please check it? > > Best regards > Petros > > >> Petros, >> >> Totaltest specifies the total number of tests performed to obtain the list >> of peaks. >> >> This is one of the frequently asked questions and many have contributed >> wisdoms to address this question. Could you please take a look at the >> slides >> at >> http://www.bioconductor.org/help/course-materials/2011/BioC2011/BioC2011_ChI >> PpeakAnno.pdf, esp. the faq slide near the end? >> >> Please do cc Bioconductor list (bioconductor >> ) for sharing. Thanks! >> >> Best regards, >> >> Julie >> >> >> On 6/6/12 11:02 AM, "Petros Kolovos" wrote: >> >>> Dear Julie, >>> >>> I would like to ask you also something else. >>> >>> In the makeVennDiagrams function what do you mean by total tests? >>> >>> For example i have the following code >>> >>>> venn=makeVennDiagram(RangedDataList (pr, intron),NameOfPeaks = >>>> c("peaks","intron"),maxgap = 0, totalTest = 320000) >>> >>> PR=my peaks has 50000 peaks >>> intron= table aquired from UCSC with all the exons 250000 >>> peaks/coordinates >>> >>> So how much should I put the totalTest >>> >>> Is there a problem if I make a venn diagram of peaks vs introns? >>> >>> Thanks >>> >>> Best regards >>> Petros >>> >>> >>> >>>> Dear Petros, >>>> >>>> You can safely ignore the warning message and proceed. I will modify >>>> the >>>> code for the next release to replace multiple with select internally. >>>> >>>> Best regards, >>>> >>>> Julie >>>> >>>> >>>> On 6/6/12 4:04 AM, "Petros Kolovos" wrote: >>>> >>>>> Dear Dr Julie Zhu, >>>>> >>>>> Good morning >>>>> >>>>> My name is Petros Kolovos and I am a PhD student at Rotterdam, >>>>> Netherlands. >>>>> >>>>> I was using your library in order to annotate some chip peaks and in >>>>> order >>>>> to make some venn diagrams. >>>>> >>>>> But I faced some problems with the venn diagrams >>>>> >>>>> Here is what I am doing >>>>> >>>>>> venn=makeVennDiagram(RangedDataList (pr, Anno_exon),NameOfPeaks = >>>>> c("peaks","exon"),maxgap = 0, totalTest = 500000) >>>>> Warning message: >>>>> In findOverlappingPeaks(Peaks[[1]], Peaks[[2]], NameOfPeaks1 = >>>>> NameOfPeaks[1], : >>>>> Please use select instead of multiple! >>>>> >>>>> What should I do? >>>>> >>>>> Could you please help me as I am a rookie in this field >>>>> >>>>> Thank you in advance >>>>> >>>>> Yours sincerely >>>>> Petros Kolovos >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >>> >> >> >> > > From mailinglist.honeypot at gmail.com Wed Jun 6 21:59:14 2012 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Wed, 6 Jun 2012 15:59:14 -0400 Subject: [BioC] makeTranscriptDbFromBiomart error In-Reply-To: References: Message-ID: Hi Stefanie, On Wed, Jun 6, 2012 at 10:03 AM, Stefanie wrote: > Hi, > > here is my code: > library(GenomicFeatures) > humanDB = makeTranscriptDbFromBiomart(biomart = "ensembl", dataset = > "hsapiens_gene_ensembl") > > This is the error I get thrown: > Download and preprocess the 'transcripts' data frame ... OK > Download and preprocess the 'chrominfo' data frame ... OK > Download and preprocess the 'splicings' data frame ... Fehler in > .extractCdsRangesFromBiomartTable(bm_table) : > ?BioMart data anomaly: some 5' UTR have a start > end > > Seems to be a problem of biomart not of R? > Anybody any idea? I don't think this is the problem, but I'd first start by updating R to 2.15.x and reinstall your bioconductor packages (via `biocLite`) so that your playing w/ the latest and greatest. Second, you might try building the DB you need via UCSC using their ensGene table ... I guess it should be pretty much what you need, no? For example: R> library(GenomicFeatures) R> txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="ensGene") HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From mtmorgan at fhcrc.org Wed Jun 6 22:34:48 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Wed, 6 Jun 2012 13:34:48 -0700 Subject: [BioC] shortread quality In-Reply-To: <4FCFAB6F.9020108@fhcrc.org> References: <4FCFAB6F.9020108@fhcrc.org> Message-ID: <4FCFBEE8.2000803@fhcrc.org> On 06/06/2012 12:11 PM, Martin Morgan wrote: > Hi, > > On 06/06/2012 08:00 AM, David martin wrote: >> Hi, >> I'm reading a fastq file from the solexa sequencer. >> I would like to know how many reads have a phred score (>Q29). The thing > > If you mean the average base quality score, then > > fq <- readFastq(sp, fqpattern) > score <- alphabetScore(fq) > > gives the sum of the base quality scores for each read, so is a vector > as long as the length of the reads. The average is > > aveScore <- score / width(fq) > > and then you're in the realm of familiar R again, e.g., > > hist(aveScore) > table(aveScore > 29) > > etc. > > Hope that heps, I guess the qa object already gets you further, as you've indicated df <- qaSummary[["readQualityScore"]] the 'density' column (apparently not really a density) could be turned into a cumulative density cdensity <- cumsum(df$density) / sum(df$density) and then look up the cumulative density nearest the quality that you're interested in cdensity[findInterval(29, df$quality)] You'd want to do these steps separately for each lane, if there were several in df. Martin > > Martin > > > >> is that i get the densities so i don't really know how many reads from >> the total pass that filter. It's probaly easy for you so any hint would >> be helpful >> >> library("ShortRead") >> fqpattern <- "1102sdd_SN148_A_s_3_seq_GJH-85.txt" >> >> path = getwd() >> sp <- SolexaPath(path,dataPath=path,analysisPath=path) >> >> # Read fastq File and save report >> fq <- readFastq(sp, fqpattern) >> qaSummary <- qa(fq,fqpattern) >> save(qaSummary, file=file.path("./", paste(fqpattern,".rda",sep="" ))) >> report(qaSummary,dest="report") >> >> #Quality >> >> idx = which(qaSummary[["readQualityScore"]]["quality"] > 29) >> a = cbind( qaSummary[["readQualityScore"]][idx,"quality"] , >> qaSummary[["readQualityScore"]][idx,"density"]) >> a #reads with a quality >Q29 >> >> #How to get the total number ? or percent compared to the total number >> of reads ? >> >> thanks >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 From seungyeul.yoo at mssm.edu Wed Jun 6 23:36:29 2012 From: seungyeul.yoo at mssm.edu (Yoo, Seungyeul) Date: Wed, 6 Jun 2012 21:36:29 +0000 Subject: [BioC] Mapping genes to their gene sets for GSEA Message-ID: <094C64D2-E578-457D-A137-559A8C8D8C86@mssm.edu> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From hpages at fhcrc.org Thu Jun 7 01:33:58 2012 From: hpages at fhcrc.org (=?ISO-8859-1?Q?Herv=E9_Pag=E8s?=) Date: Wed, 06 Jun 2012 16:33:58 -0700 Subject: [BioC] readGappedAlignmentPairs with multimapping reads In-Reply-To: <20120602130824.BABCA13445B@mamba.fhcrc.org> References: <20120602130824.BABCA13445B@mamba.fhcrc.org> Message-ID: <4FCFE8E6.2040108@fhcrc.org> Hi Vedran, On 06/02/2012 06:08 AM, Vedran Franke [guest] wrote: > > How does the readGappedAlignmentPairs from the GenomicRanges library handle reads that map to several places in the genome? Good question and I wish there was a simple answer... readGappedAlignmentPairs() delegates to findMateAlignment() for doing the pairing. findMateAlignment() does not have a full man page yet but will have one soon. The man page will explain the algorithm used for doing the pairing of records loaded from a BAM file. Here is roughly how it works. First only records with flag bit 0x1 set to 1, flag bit 0x4 set to 0, and flag bit 0x8 set to 0 are candidates for pairing (see the SAM Spec for a description of flag bits and fields). Any other record is discarded. That is, records that correspond to single end reads, and records that correspond to paired end reads where one or both ends are unmapped, are discarded. Then the algorithm looks at the following fields and flag bits: (A) QNAME (B) RNAME, RNEXT (C) POS, PNEXT (D) Flag bits Ox10 and 0x20 (E) Flag bits 0x40 and 0x80 2 records rec(i) and rec(j) are considered mates iff all the following conditions are satisfied: (A) They have the same QNAME (B) RNEXT(i) == RNAME(j) and RNEXT(j) == RNAME(i) (C) PNEXT(i) == POS(j) and PNEXT(j) == POS(i) (D) Flag bit 0x20 of rec(i) == Flag bit 0x10 of rec(j) and Flag bit 0x20 of rec(j) == Flag bit 0x10 of rec(i) (E) rec(i) corresponds to the fi rst segment in the template and rec(j) corresponds to the last segment in the template OR rec(j) corresponds to the fi rst segment in the template and rec(i) corresponds to the last segment in the template This algorithm will find almost all pairs unambiguously, even when the same pair of reads maps to several places in the genome. Note that when a given pair maps to a single place in the genome, looking at (A) is enough to pair the 2 corresponding records. The additional conditions (B), (C), (D) and (E) are only here to help in the situation where more than 2 records share the same QNAME. And that works most of the times but there are still situations where this is not enough to solve the pairing problem unambiguously. For example, here are 4 records (loaded in a GappedAlignments object) that cannot be paired with the above algorithm: ** Showing the 4 records as a GappedAlignments object of length 4: GappedAlignments with 4 alignments and 2 elementMetadata cols: seqnames strand cigar qwidth start end SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 SRR031714.2658602 chr2R - 13M372N24M 37 6983858 6984266 SRR031714.2658602 chr2R - 13M378N24M 37 6983858 6984272 width ngap | mrnm mpos | SRR031714.2658602 421 1 | chr2R 6983858 SRR031714.2658602 421 1 | chr2R 6983858 SRR031714.2658602 409 1 | chr2R 6983850 SRR031714.2658602 415 1 | chr2R 6983850 Note that the BAM fields show up in the following columns: - QNAME: the names of the GappedAlignments object (unnamed col) - RNAME: the seqnames col - POS: the start col - RNEXT: the mrnm col - PNEXT: the mpos col As you can see, the aligner has aligned the same pair to the same location twice! The only difference between the 2 aligned pairs is in the cigar i.e. one end of the pair is aligned twice to the same location with exactly the same cigar while the other end of the pair is aligned twice to the same location but with slightly different cigars. ** Now showing the corresponding flag bits: isPaired isProperPair isUnmappedQuery hasUnmappedMate isMinusStrand [1,] 1 1 0 0 0 [2,] 1 1 0 0 0 [3,] 1 1 0 0 1 [4,] 1 1 0 0 1 isMateMinusStrand isFirstMateRead isSecondMateRead isNotPrimaryRead [1,] 1 0 1 0 [2,] 1 0 1 0 [3,] 0 1 0 0 [4,] 0 1 0 0 isNotPassingQualityControls isDuplicate [1,] 0 0 [2,] 0 0 [3,] 0 0 [4,] 0 0 As you can see, rec(1) and rec(2) are second mates, rec(3) and rec(4) are both first mates. But looking at (A), (B), (C), (D) and (E), the pairs could be rec(1) <-> rec(3) and rec(2) <-> rec(4), or they could be rec(1) <-> rec(4) and rec(2) <-> rec(3). There is no way to disambiguate! Also note that everything is tagged as proper pair (flag bit 0x2 is set to 1) and primary read (flag bit 0x100 is set to 0), so using this information would not help disambiguate here. I'm wondering if there is some other place we should look at in the BAM file in order to disambiguate (e.g. a tag?), or if those ambiguous pairings are just part of the life with the SAM Spec. Not sure whether this is a weakness of the Spec? Or A feature? Any input on this would be appreciated. In the meantime, findMateAlignment() is just ignoring those ambiguous pairings (with a warning) i.e. records that cannot be paired unambiguously are not paired at all. Concretely that means that readGappedAlignmentPairs() is guaranteed to return a GappedAlignmentPairs object where every pair could be formed in an non-ambiguous way. Note that AFAICS in practice this approach doesn't seem to leave aside a lot of records because ambiguous pairing events seem pretty rare. Cheers, H. > > Sometimes it can happen that one pair of the read is flagged as properly paired even if the other read maps to several locations, how is this handled? > > Thank you in advance! > > -- output of sessionInfo(): > > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C > [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 > [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 > [4] plyr_1.6 stringr_0.6 BiocInstaller_1.4.4 > > loaded via a namespace (and not attached): > [1] stats4_2.15.0 tools_2.15.0 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 From hpages at fhcrc.org Thu Jun 7 01:46:26 2012 From: hpages at fhcrc.org (=?ISO-8859-1?Q?Herv=E9_Pag=E8s?=) Date: Wed, 06 Jun 2012 16:46:26 -0700 Subject: [BioC] readGappedAlignmentPairs with multimapping reads In-Reply-To: <20120602130824.BABCA13445B@mamba.fhcrc.org> References: <20120602130824.BABCA13445B@mamba.fhcrc.org> Message-ID: <4FCFEBD2.9010102@fhcrc.org> There was a small formatting problem with my previous email (due to some copy/paste from the SAM Spec opened in Acroread to my mailer), so I'm sending this again. Sorry for the noise... On 06/02/2012 06:08 AM, Vedran Franke [guest] wrote: > > How does the readGappedAlignmentPairs from the GenomicRanges library handle reads that map to several places in the genome? Good question and I wish there was a simple answer... readGappedAlignmentPairs() delegates to findMateAlignment() for doing the pairing. findMateAlignment() does not have a full man page yet but will have one soon. The man page will explain the algorithm used for doing the pairing of records loaded from a BAM file. Here is roughly how it works. First only records with flag bit 0x1 set to 1, flag bit 0x4 set to 0, and flag bit 0x8 set to 0 are candidates for pairing (see the SAM Spec for a description of flag bits and fields). Any other record is discarded. That is, records that correspond to single end reads, and records that correspond to paired end reads where one or both ends are unmapped, are discarded. Then the algorithm looks at the following fields and flag bits: (A) QNAME (B) RNAME, RNEXT (C) POS, PNEXT (D) Flag bits Ox10 and 0x20 (E) Flag bits 0x40 and 0x80 2 records rec(i) and rec(j) are considered mates iff all the following conditions are satisfied: (A) They have the same QNAME (B) RNEXT(i) == RNAME(j) and RNEXT(j) == RNAME(i) (C) PNEXT(i) == POS(j) and PNEXT(j) == POS(i) (D) Flag bit 0x20 of rec(i) == Flag bit 0x10 of rec(j) and Flag bit 0x20 of rec(j) == Flag bit 0x10 of rec(i) (E) rec(i) corresponds to the first segment in the template and rec(j) corresponds to the last segment in the template OR rec(j) corresponds to the first segment in the template and rec(i) corresponds to the last segment in the template This algorithm will find almost all pairs unambiguously, even when the same pair of reads maps to several places in the genome. Note that when a given pair maps to a single place in the genome, looking at (A) is enough to pair the 2 corresponding records. The additional conditions (B), (C), (D) and (E) are only here to help in the situation where more than 2 records share the same QNAME. And that works most of the times but there are still situations where this is not enough to solve the pairing problem unambiguously. For example, here are 4 records (loaded in a GappedAlignments object) that cannot be paired with the above algorithm: ** Showing the 4 records as a GappedAlignments object of length 4: GappedAlignments with 4 alignments and 2 elementMetadata cols: seqnames strand cigar qwidth start end SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 SRR031714.2658602 chr2R - 13M372N24M 37 6983858 6984266 SRR031714.2658602 chr2R - 13M378N24M 37 6983858 6984272 width ngap | mrnm mpos | SRR031714.2658602 421 1 | chr2R 6983858 SRR031714.2658602 421 1 | chr2R 6983858 SRR031714.2658602 409 1 | chr2R 6983850 SRR031714.2658602 415 1 | chr2R 6983850 Note that the BAM fields show up in the following columns: - QNAME: the names of the GappedAlignments object (unnamed col) - RNAME: the seqnames col - POS: the start col - RNEXT: the mrnm col - PNEXT: the mpos col As you can see, the aligner has aligned the same pair to the same location twice! The only difference between the 2 aligned pairs is in the cigar i.e. one end of the pair is aligned twice to the same location with exactly the same cigar while the other end of the pair is aligned twice to the same location but with slightly different cigars. ** Now showing the corresponding flag bits: isPaired isProperPair isUnmappedQuery hasUnmappedMate isMinusStrand [1,] 1 1 0 0 0 [2,] 1 1 0 0 0 [3,] 1 1 0 0 1 [4,] 1 1 0 0 1 isMateMinusStrand isFirstMateRead isSecondMateRead isNotPrimaryRead [1,] 1 0 1 0 [2,] 1 0 1 0 [3,] 0 1 0 0 [4,] 0 1 0 0 isNotPassingQualityControls isDuplicate [1,] 0 0 [2,] 0 0 [3,] 0 0 [4,] 0 0 As you can see, rec(1) and rec(2) are second mates, rec(3) and rec(4) are both first mates. But looking at (A), (B), (C), (D) and (E), the pairs could be rec(1) <-> rec(3) and rec(2) <-> rec(4), or they could be rec(1) <-> rec(4) and rec(2) <-> rec(3). There is no way to disambiguate! Also note that everything is tagged as proper pair (flag bit 0x2 is set to 1) and primary read (flag bit 0x100 is set to 0), so using this information would not help disambiguate here. I'm wondering if there is some other place we should look at in the BAM file in order to disambiguate (e.g. a tag?), or if those ambiguous pairings are just part of the life with the SAM Spec. Not sure whether this is a weakness of the Spec? Or A feature? Any input on this would be appreciated. In the meantime, findMateAlignment() is just ignoring those ambiguous pairings (with a warning) i.e. records that cannot be paired unambiguously are not paired at all. Concretely that means that readGappedAlignmentPairs() is guaranteed to return a GappedAlignmentPairs object where every pair was formed in an non-ambiguous way. Note that AFAICS in practice this approach doesn't seem to leave aside a lot of records because ambiguous pairing events seem pretty rare. Cheers, H. > > Sometimes it can happen that one pair of the read is flagged as properly paired even if the other read maps to several locations, how is this handled? > > Thank you in advance! > > -- output of sessionInfo(): > > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C > [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 > [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 > [4] plyr_1.6 stringr_0.6 BiocInstaller_1.4.4 > > loaded via a namespace (and not attached): > [1] stats4_2.15.0 tools_2.15.0 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 From yuchuan at stat.berkeley.edu Thu Jun 7 06:53:38 2012 From: yuchuan at stat.berkeley.edu (Yu Chuan Tai) Date: Wed, 6 Jun 2012 21:53:38 -0700 (PDT) Subject: [BioC] Amplicon and exon level read counts and GC content In-Reply-To: <4FC5B75C.7060900@fhcrc.org> References: <4FC5B3D3.4030007@fhcrc.org> <4FC5B75C.7060900@fhcrc.org> Message-ID: Hi Martin, More questions on your approaches below. If my BAM files are generated by Bowtie2 on pair-end fastq files, for scanBamFlag(), should I set isPaired=TRUE? Do I need to worry about other input arguments for scanBamFlag() or ScanBamParam(), if I want to calculate coverage properly? Also, summarizeOverlaps() doesn't seem to handle paired-end reads. How to get around this, or it won't affect coverage calculation? Finally, is there any way to calculate base-specific coverage at any genomic locus or interval in Rsamtools? Thanks! Best, Yu Chuan > More specifically, after > > library(Rsamtools) > example(scanBam) # defines 'fl', a path to a bam file > > for a _single_ genomic range > > param = ScanBamParam(what="seq", > which=GRanges("seq1", IRanges(100, 500))) > dna = scanBam(fl, param=param)[[1]][["seq"]] > length(dna) # 365 reads overlap region > alphabetFrequency(dna, collapse=TRUE, baseOnly=TRUE) # 2838 + 3003 GC > > though you'd likely want to specify several regions (vector arguments to > GRanges) and think about flags (scanBamFlag() and the flag argument to > ScanBamParam), read mapping quality, reads overlapping more than one region, > etc. (summarizeOverlaps implements several counting strategies, but it is > 'easy' to implement arbitrary approaches). > >> >> Martin >> >>> >>> Thanks for any input! >>> >>> Best, >>> Yu Chuan >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > From yuchuan at stat.berkeley.edu Thu Jun 7 07:43:53 2012 From: yuchuan at stat.berkeley.edu (Yu Chuan Tai) Date: Wed, 6 Jun 2012 22:43:53 -0700 (PDT) Subject: [BioC] base-specific read counts Message-ID: Hi, Is there any way to calculate base-specific read counts for a given genomic interval (including 1-base interval), for paired-end data aligned by Bowtie2 in BAM format? Thanks! Best, Yu Chuan From p.kolovos at erasmusmc.nl Wed Jun 6 21:22:45 2012 From: p.kolovos at erasmusmc.nl (Petros Kolovos) Date: Wed, 6 Jun 2012 21:22:45 +0200 Subject: [BioC] Question about ChIPpeakAnno In-Reply-To: References: Message-ID: Dear Julie the http://www.bioconductor.org/help/course-materials/2011/BioC2011/BioC2011_ChIPpeakAnno.pdf doesn't work. Could you please check it? Best regards Petros > Petros, > > Totaltest specifies the total number of tests performed to obtain the list > of peaks. > > This is one of the frequently asked questions and many have contributed > wisdoms to address this question. Could you please take a look at the > slides > at > http://www.bioconductor.org/help/course-materials/2011/BioC2011/BioC2011_ChI > PpeakAnno.pdf, esp. the faq slide near the end? > > Please do cc Bioconductor list (bioconductor > ) for sharing. Thanks! > > Best regards, > > Julie > > > On 6/6/12 11:02 AM, "Petros Kolovos" wrote: > >> Dear Julie, >> >> I would like to ask you also something else. >> >> In the makeVennDiagrams function what do you mean by total tests? >> >> For example i have the following code >> >>> venn=makeVennDiagram(RangedDataList (pr, intron),NameOfPeaks = >>> c("peaks","intron"),maxgap = 0, totalTest = 320000) >> >> PR=my peaks has 50000 peaks >> intron= table aquired from UCSC with all the exons 250000 >> peaks/coordinates >> >> So how much should I put the totalTest >> >> Is there a problem if I make a venn diagram of peaks vs introns? >> >> Thanks >> >> Best regards >> Petros >> >> >> >>> Dear Petros, >>> >>> You can safely ignore the warning message and proceed. I will modify >>> the >>> code for the next release to replace multiple with select internally. >>> >>> Best regards, >>> >>> Julie >>> >>> >>> On 6/6/12 4:04 AM, "Petros Kolovos" wrote: >>> >>>> Dear Dr Julie Zhu, >>>> >>>> Good morning >>>> >>>> My name is Petros Kolovos and I am a PhD student at Rotterdam, >>>> Netherlands. >>>> >>>> I was using your library in order to annotate some chip peaks and in >>>> order >>>> to make some venn diagrams. >>>> >>>> But I faced some problems with the venn diagrams >>>> >>>> Here is what I am doing >>>> >>>>> venn=makeVennDiagram(RangedDataList (pr, Anno_exon),NameOfPeaks = >>>> c("peaks","exon"),maxgap = 0, totalTest = 500000) >>>> Warning message: >>>> In findOverlappingPeaks(Peaks[[1]], Peaks[[2]], NameOfPeaks1 = >>>> NameOfPeaks[1], : >>>> Please use select instead of multiple! >>>> >>>> What should I do? >>>> >>>> Could you please help me as I am a rookie in this field >>>> >>>> Thank you in advance >>>> >>>> Yours sincerely >>>> Petros Kolovos >>>> >>>> >>>> >>>> >>> >>> >>> >> >> > > > From p.kolovos at erasmusmc.nl Wed Jun 6 21:30:27 2012 From: p.kolovos at erasmusmc.nl (Petros Kolovos) Date: Wed, 6 Jun 2012 21:30:27 +0200 Subject: [BioC] Question about ChIPpeakAnno In-Reply-To: References: Message-ID: Dear Julie, Thank you very much However I still cannot make the venn diagrams appeared with the aforementioned code in the previous emails venn=makeVennDiagram(RangedDataList (pr, Anno_exon),NameOfPeaks = c("peaks","exon"),maxgap = 0, totalTest = 500000) Warning message: In findOverlappingPeaks(Peaks[[1]], Peaks[[2]], NameOfPeaks1 = NameOfPeaks[1], : Please use select instead of multiple! I will post the session info () Thanks Petros > Sorry, Petros! > > It is - instead of _. > > http://www.bioconductor.org/help/course-materials/2011/BioC2011/LabStuff/ChI > PpeakAnno-BioC2011.pdf > > For your other questions in a separate email, you need to give > sessionInfo() > and tell us whether you are able to get the example work. Thanks! > > Best regards, > > Julie > > > On 6/6/12 3:22 PM, "Petros Kolovos" wrote: > >> Dear Julie >> >> the >> http://www.bioconductor.org/help/course-materials/2011/BioC2011/BioC2011_ChIPp >> eakAnno.pdf >> doesn't work. Could you please check it? >> >> Best regards >> Petros >> >> >>> Petros, >>> >>> Totaltest specifies the total number of tests performed to obtain the >>> list >>> of peaks. >>> >>> This is one of the frequently asked questions and many have contributed >>> wisdoms to address this question. Could you please take a look at the >>> slides >>> at >>> http://www.bioconductor.org/help/course-materials/2011/BioC2011/BioC2011_ChI >>> PpeakAnno.pdf, esp. the faq slide near the end? >>> >>> Please do cc Bioconductor list (bioconductor >>> ) for sharing. Thanks! >>> >>> Best regards, >>> >>> Julie >>> >>> >>> On 6/6/12 11:02 AM, "Petros Kolovos" wrote: >>> >>>> Dear Julie, >>>> >>>> I would like to ask you also something else. >>>> >>>> In the makeVennDiagrams function what do you mean by total tests? >>>> >>>> For example i have the following code >>>> >>>>> venn=makeVennDiagram(RangedDataList (pr, intron),NameOfPeaks = >>>>> c("peaks","intron"),maxgap = 0, totalTest = 320000) >>>> >>>> PR=my peaks has 50000 peaks >>>> intron= table aquired from UCSC with all the exons 250000 >>>> peaks/coordinates >>>> >>>> So how much should I put the totalTest >>>> >>>> Is there a problem if I make a venn diagram of peaks vs introns? >>>> >>>> Thanks >>>> >>>> Best regards >>>> Petros >>>> >>>> >>>> >>>>> Dear Petros, >>>>> >>>>> You can safely ignore the warning message and proceed. I will modify >>>>> the >>>>> code for the next release to replace multiple with select internally. >>>>> >>>>> Best regards, >>>>> >>>>> Julie >>>>> >>>>> >>>>> On 6/6/12 4:04 AM, "Petros Kolovos" wrote: >>>>> >>>>>> Dear Dr Julie Zhu, >>>>>> >>>>>> Good morning >>>>>> >>>>>> My name is Petros Kolovos and I am a PhD student at Rotterdam, >>>>>> Netherlands. >>>>>> >>>>>> I was using your library in order to annotate some chip peaks and in >>>>>> order >>>>>> to make some venn diagrams. >>>>>> >>>>>> But I faced some problems with the venn diagrams >>>>>> >>>>>> Here is what I am doing >>>>>> >>>>>>> venn=makeVennDiagram(RangedDataList (pr, Anno_exon),NameOfPeaks = >>>>>> c("peaks","exon"),maxgap = 0, totalTest = 500000) >>>>>> Warning message: >>>>>> In findOverlappingPeaks(Peaks[[1]], Peaks[[2]], NameOfPeaks1 = >>>>>> NameOfPeaks[1], : >>>>>> Please use select instead of multiple! >>>>>> >>>>>> What should I do? >>>>>> >>>>>> Could you please help me as I am a rookie in this field >>>>>> >>>>>> Thank you in advance >>>>>> >>>>>> Yours sincerely >>>>>> Petros Kolovos >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >>> >> >> > > > From vilanew at gmail.com Thu Jun 7 09:50:31 2012 From: vilanew at gmail.com (David martin) Date: Thu, 7 Jun 2012 09:50:31 +0200 Subject: [BioC] shortread quality In-Reply-To: <4FCFBEE8.2000803@fhcrc.org> References: <4FCFAB6F.9020108@fhcrc.org> <4FCFBEE8.2000803@fhcrc.org> Message-ID: thanks Martin, That's exactly what i wanted to get !!! On 06/06/2012 10:34 PM, Martin Morgan wrote: > On 06/06/2012 12:11 PM, Martin Morgan wrote: >> Hi, >> >> On 06/06/2012 08:00 AM, David martin wrote: >>> Hi, >>> I'm reading a fastq file from the solexa sequencer. >>> I would like to know how many reads have a phred score (>Q29). The thing >> >> If you mean the average base quality score, then >> >> fq <- readFastq(sp, fqpattern) >> score <- alphabetScore(fq) >> >> gives the sum of the base quality scores for each read, so is a vector >> as long as the length of the reads. The average is >> >> aveScore <- score / width(fq) >> >> and then you're in the realm of familiar R again, e.g., >> >> hist(aveScore) >> table(aveScore > 29) >> >> etc. >> >> Hope that heps, > > I guess the qa object already gets you further, as you've indicated > > df <- qaSummary[["readQualityScore"]] > > the 'density' column (apparently not really a density) could be turned > into a cumulative density > > cdensity <- cumsum(df$density) / sum(df$density) > > and then look up the cumulative density nearest the quality that you're > interested in > > cdensity[findInterval(29, df$quality)] > > You'd want to do these steps separately for each lane, if there were > several in df. > > Martin > > > >> >> Martin >> >> >> >>> is that i get the densities so i don't really know how many reads from >>> the total pass that filter. It's probaly easy for you so any hint would >>> be helpful >>> >>> library("ShortRead") >>> fqpattern <- "1102sdd_SN148_A_s_3_seq_GJH-85.txt" >>> >>> path = getwd() >>> sp <- SolexaPath(path,dataPath=path,analysisPath=path) >>> >>> # Read fastq File and save report >>> fq <- readFastq(sp, fqpattern) >>> qaSummary <- qa(fq,fqpattern) >>> save(qaSummary, file=file.path("./", paste(fqpattern,".rda",sep="" ))) >>> report(qaSummary,dest="report") >>> >>> #Quality >>> >>> idx = which(qaSummary[["readQualityScore"]]["quality"] > 29) >>> a = cbind( qaSummary[["readQualityScore"]][idx,"quality"] , >>> qaSummary[["readQualityScore"]][idx,"density"]) >>> a #reads with a quality >Q29 >>> >>> #How to get the total number ? or percent compared to the total number >>> of reads ? >>> >>> thanks >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > From projectbasu at gmail.com Thu Jun 7 10:07:57 2012 From: projectbasu at gmail.com (swaraj basu) Date: Thu, 7 Jun 2012 10:07:57 +0200 Subject: [BioC] Uploading BED file: rtracklayer In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From guest at bioconductor.org Thu Jun 7 10:36:07 2012 From: guest at bioconductor.org (bestbird7788 [guest]) Date: Thu, 7 Jun 2012 01:36:07 -0700 (PDT) Subject: [BioC] problem about set operation and computation after split Message-ID: <20120607083607.E6914138E14@mamba.fhcrc.org> hi, I met some problems in R, please help me. 1. How to do a intersect operation among several groups in one list, without a loop statement? (I think It may be a list) create data: myData <- data.frame(product = c(1,2,3,1,2,3,1,2,2), year=c(2009,2009,2009,2010,2010,2010,2011,2011,2011),value=c(1104,608,606,1504,508,1312,900,1100,800)) mySplit<- split(myData,myData$year) mySplit $`2009` product year value 1 1 2009 1104 2 2 2009 608 3 3 2009 606 $`2010` product year value 4 1 2010 1504 5 2 2010 508 6 3 2010 1312 $`2011` product year value 7 1 2011 900 8 2 2011 1100 9 2 2011 800 I want to get intersection of product between every year. I know the basic is: intersect(intersect(mySplit[[1]]$product, mySplit[[2]]$product),mySplit[[3]]$product) this will give the correct answer: [1] 1 2 above code lacks reusability, so It should use a for loop: myIntersect<-mySplit[[1]]$product for (i in 1:length(mySplit)-1){ myIntersect<-intersect(myIntersect,mySplit[[i+1]]$product) } It's correct too, but stll too complex, so my question is: Can I do the same thing just use another similar intersect function (without for/repeat/while). What's this simple function's name ? 2.how to do a relative computation after split (notice: not befor split)? create data: myData1 <- data.frame(product = c(1,2,3,1,2,3), year=c(2009,2009,2009,2010,2010,2010),value=c(1104,608,606,1504,508,1312),relative=0) mySplit1<- split(myData1,myData1$year) mySplit1 $`2009` product year value relative 1 1 2009 1104 0 2 2 2009 608 0 3 3 2009 606 0 $`2010` product year value relative 4 1 2010 1504 0 5 2 2010 508 0 6 3 2010 1312 0 I want compute relative value in the every group, what I mean is , I want get the result is just like below: $`2009` product year value relative 1 1 2009 1104 0 2 2 2009 608 -496 3 3 2009 606 -2 $`2010` product year value relative 4 1 2010 1504 0 5 2 2010 508 -996 6 3 2010 1312 804 I think to use a loop maybe work, but Is there no direct method on list? 3.how to do a sorting after split, It's just like above question, what I want is sorting by value: $`2009` product year value relative 3 3 2009 606 0 2 2 2009 608 0 1 1 2009 1104 0 $`2010` product year value relative 5 2 2010 508 0 6 3 2010 1312 0 4 1 2010 1504 0 4. how to do a filtering after split, Yes, It's just like above quetion, what I want is filtering out data which value is more than 1000: $`2009` product year value relative 1 1 2009 1104 0 $`2010` product year value relative 4 1 2010 1504 0 6 3 2010 1312 0 -- output of sessionInfo(): R version 2.15.0 Patched (2012-04-26 r59206) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices datasets utils methods base loaded via a namespace (and not attached): [1] tools_2.15.0 -- Sent via the guest posting facility at bioconductor.org. From stefanie.tauber at univie.ac.at Thu Jun 7 11:16:39 2012 From: stefanie.tauber at univie.ac.at (Stefanie Tauber) Date: Thu, 7 Jun 2012 11:16:39 +0200 Subject: [BioC] makeTranscriptDbFromBiomart error In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From lawrence.michael at gene.com Thu Jun 7 14:12:34 2012 From: lawrence.michael at gene.com (Michael Lawrence) Date: Thu, 7 Jun 2012 05:12:34 -0700 Subject: [BioC] Uploading BED file: rtracklayer In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From schoi at cornell.edu Thu Jun 7 14:27:45 2012 From: schoi at cornell.edu (Sang Chul Choi) Date: Thu, 7 Jun 2012 12:27:45 +0000 Subject: [BioC] qrqc with variable length of short reads? - readSeqFile could not handle a 2GB zipped file. In-Reply-To: References: <8234A7F2-8FB7-45E0-BC7C-B5C3386C4EF1@cornell.edu> <3B4C55BA-D655-444A-91F3-F4107198E1EC@gmail.com> <03E26781-7909-4DFD-9BD8-8092A9A8F237@cornell.edu> <76217387-CBC5-4D89-8505-880B7557F9CA@cornell.edu>, Message-ID: <01C32E26614616429B1D8B381358EFA81A1D2F04@MBXD-01.exchange.cornell.edu> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From guest at bioconductor.org Thu Jun 7 14:58:15 2012 From: guest at bioconductor.org (German Gonzalez [guest]) Date: Thu, 7 Jun 2012 05:58:15 -0700 (PDT) Subject: [BioC] gplots error Message-ID: <20120607125815.43FC2138E2B@mamba.fhcrc.org> Yesterday a new version gdata package (2.8.2) was released. After upgrading the package, gplots stopped working. Error : object ???nobs??? is not exported by 'namespace:gdata' ERROR: lazy loading failed for package ???gplots??? The only way I had to fix this was to uninstall both packages and install an older version manually. Regards -- output of sessionInfo(): R version 2.14.1 (2011-12-22) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats graphics grDevices utils datasets methods [8] base other attached packages: [1] gdata_2.8.2 gtools_2.6.2 -- Sent via the guest posting facility at bioconductor.org. From mtmorgan at fhcrc.org Thu Jun 7 15:00:42 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Thu, 07 Jun 2012 06:00:42 -0700 Subject: [BioC] base-specific read counts In-Reply-To: References: Message-ID: <4FD0A5FA.6030301@fhcrc.org> On 06/06/2012 10:43 PM, Yu Chuan Tai wrote: > Hi, > > Is there any way to calculate base-specific read counts for a given > genomic interval (including 1-base interval), for paired-end data > aligned by Bowtie2 in BAM format? Thanks for posting to the Boic mailing list! Functions like readGappedAlignments, scanBam, etc. take an argument ScanBamParam that in turn has an argument 'which' to specify, using GRanges, the regions of a bam file you want to query gwhich <- GRanges("chr1", IRanges(c(1000, 2000, 3000), width=100)), c("+", "+", "-")) param <- ScanBamParam(which=gwhich) scanBam("my.bam", param=param) Base-level coverage is also available with ?applyPileups, see example(applyPileups). Martin > Thanks! > > Best, > Yu Chuan > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 From mtmorgan at fhcrc.org Thu Jun 7 15:08:42 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Thu, 07 Jun 2012 06:08:42 -0700 Subject: [BioC] Amplicon and exon level read counts and GC content In-Reply-To: References: <4FC5B3D3.4030007@fhcrc.org> <4FC5B75C.7060900@fhcrc.org> Message-ID: <4FD0A7DA.1040009@fhcrc.org> On 06/06/2012 09:53 PM, Yu Chuan Tai wrote: > Hi Martin, > > More questions on your approaches below. If my BAM files are > generated by Bowtie2 on pair-end fastq files, for scanBamFlag(), > should I set isPaired=TRUE? Do I need to worry about other input > arguments for scanBamFlag() or ScanBamParam(), if I want to > calculate coverage properly? It really depends on what you're interested in doing; see for instance the post by Herve the other day https://stat.ethz.ch/pipermail/bioconductor/2012-June/046052.html > > Also, summarizeOverlaps() doesn't seem to handle paired-end reads. > How to get around this, or it won't affect coverage calculation? There is better support for paired-end reads in the 'devel' version of Biocondcutor; see http://bioconductor.org/developers/useDevel/ whether and what aspects of paired-endedness are important depends on how you are using your coverage. > > Finally, is there any way to calculate base-specific coverage at any > genomic locus or interval in Rsamtools? Thanks! I tried to answer this in your other post. Martin > > Best, Yu Chuan > >> More specifically, after >> >> library(Rsamtools) example(scanBam) # defines 'fl', a path to a >> bam file >> >> for a _single_ genomic range >> >> param = ScanBamParam(what="seq", which=GRanges("seq1", >> IRanges(100, 500))) dna = scanBam(fl, param=param)[[1]][["seq"]] >> length(dna) # 365 reads overlap region alphabetFrequency(dna, >> collapse=TRUE, baseOnly=TRUE) # 2838 + 3003 GC >> >> though you'd likely want to specify several regions (vector >> arguments to GRanges) and think about flags (scanBamFlag() and the >> flag argument to ScanBamParam), read mapping quality, reads >> overlapping more than one region, etc. (summarizeOverlaps >> implements several counting strategies, but it is 'easy' to >> implement arbitrary approaches). >> >>> >>> Martin >>> >>>> >>>> Thanks for any input! >>>> >>>> Best, Yu Chuan >>>> >>>> _______________________________________________ Bioconductor >>>> mailing list Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >>>> archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> >> >> >>>> >>>> -- >> Computational Biology Fred Hutchinson Cancer Research Center 1100 >> Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >> >> Location: M1-B861 Telephone: 206 667-2793 >> -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 From mtmorgan at fhcrc.org Thu Jun 7 15:12:56 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Thu, 7 Jun 2012 06:12:56 -0700 Subject: [BioC] Question about ChIPpeakAnno In-Reply-To: References: Message-ID: <4FD0A8D8.6040707@fhcrc.org> On 06/06/2012 12:22 PM, Petros Kolovos wrote: > Dear Julie > > the > http://www.bioconductor.org/help/course-materials/2011/BioC2011/BioC2011_ChIPpeakAnno.pdf > doesn't work. Could you please check it? Hi Petros -- see http://www.bioconductor.org/help/course-materials/2011/BioC2011/ and (I think) http://www.bioconductor.org/help/course-materials/2011/BioC2011/LabStuff/ChIPpeakAnno-BioC2011.pdf Martin > > Best regards > Petros > > >> Petros, >> >> Totaltest specifies the total number of tests performed to obtain the list >> of peaks. >> >> This is one of the frequently asked questions and many have contributed >> wisdoms to address this question. Could you please take a look at the >> slides >> at >> http://www.bioconductor.org/help/course-materials/2011/BioC2011/BioC2011_ChI >> PpeakAnno.pdf, esp. the faq slide near the end? >> >> Please do cc Bioconductor list (bioconductor >> ) for sharing. Thanks! >> >> Best regards, >> >> Julie >> >> >> On 6/6/12 11:02 AM, "Petros Kolovos" wrote: >> >>> Dear Julie, >>> >>> I would like to ask you also something else. >>> >>> In the makeVennDiagrams function what do you mean by total tests? >>> >>> For example i have the following code >>> >>>> venn=makeVennDiagram(RangedDataList (pr, intron),NameOfPeaks = >>>> c("peaks","intron"),maxgap = 0, totalTest = 320000) >>> >>> PR=my peaks has 50000 peaks >>> intron= table aquired from UCSC with all the exons 250000 >>> peaks/coordinates >>> >>> So how much should I put the totalTest >>> >>> Is there a problem if I make a venn diagram of peaks vs introns? >>> >>> Thanks >>> >>> Best regards >>> Petros >>> >>> >>> >>>> Dear Petros, >>>> >>>> You can safely ignore the warning message and proceed. I will modify >>>> the >>>> code for the next release to replace multiple with select internally. >>>> >>>> Best regards, >>>> >>>> Julie >>>> >>>> >>>> On 6/6/12 4:04 AM, "Petros Kolovos" wrote: >>>> >>>>> Dear Dr Julie Zhu, >>>>> >>>>> Good morning >>>>> >>>>> My name is Petros Kolovos and I am a PhD student at Rotterdam, >>>>> Netherlands. >>>>> >>>>> I was using your library in order to annotate some chip peaks and in >>>>> order >>>>> to make some venn diagrams. >>>>> >>>>> But I faced some problems with the venn diagrams >>>>> >>>>> Here is what I am doing >>>>> >>>>>> venn=makeVennDiagram(RangedDataList (pr, Anno_exon),NameOfPeaks = >>>>> c("peaks","exon"),maxgap = 0, totalTest = 500000) >>>>> Warning message: >>>>> In findOverlappingPeaks(Peaks[[1]], Peaks[[2]], NameOfPeaks1 = >>>>> NameOfPeaks[1], : >>>>> Please use select instead of multiple! >>>>> >>>>> What should I do? >>>>> >>>>> Could you please help me as I am a rookie in this field >>>>> >>>>> Thank you in advance >>>>> >>>>> Yours sincerely >>>>> Petros Kolovos >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >>> >> >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 From nac at sanger.ac.uk Thu Jun 7 15:28:26 2012 From: nac at sanger.ac.uk (nathalie) Date: Thu, 07 Jun 2012 14:28:26 +0100 Subject: [BioC] TEQC package isssue with chromosome format Message-ID: <4FD0AC7A.8010206@sanger.ac.uk> Hi, I would like to analyse the coverage of my Bam files using TEQC package which have been aligned on a reference with the following format chr number (1-19, X, Y ), start (integrer), end (integrer) the chromosomes are not with the prefixe chr. When I try to create the target file with the Nochr nomenclature, it fails with the following error message > targets<-get.targets(""NOchr.txt", chrcol=1,startcol=2,endcol=3, zerobased=F, sep="\t",skip=0) Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") : solving row 1: range cannot be determined from the supplied arguments (too many NAs) This is working when I change de format with "chr" prefixes. > head(targets) RangedData with 6 rows and 0 value columns across 21 spaces space ranges | | 1 chr1 [3206100, 3207051] | 2 chr1 [3411780, 3411984] | 3 chr1 [3660630, 3661431] | 4 chr1 [4334678, 4340174] | 5 chr1 [4341988, 4342164] | 6 chr1 [4342280, 4342908] | But then my bams are in the wrong format as they don't have those prefixes.... > head(mybams.bam) RangedData with 6 rows and 1 value column across 211 spaces space ranges | ID | 1 1 [3000748, 3000822] | HS10_07304:1:1301:15698:141841#2 2 1 [3000748, 3000822] | HS2_07343:1:2107:4612:106954#2 3 1 [3000748, 3000822] | HS2_07343:2:1204:4374:169685#2 4 1 [3000818, 3000892] | HS10_07304:1:1301:15698:141841#2 5 1 [3000818, 3000892] | HS2_07343:1:2107:4612:106954#2 6 1 [3000818, 3000892] | HS2_07343:2:1204:4374:169685#2 Is it possible to make the function accept the NoChr coordinate or is the only way to change everything back to chr prefixes???? many thanks Nat > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=C [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] TEQC_2.4.0 hwriter_1.3 Rsamtools_1.8.4 [4] Biostrings_2.24.1 GenomicRanges_1.8.3 IRanges_1.14.2 [7] BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] Biobase_2.16.0 bitops_1.0-4.1 stats4_2.15.0 zlibbioc_1.2.0 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From drnevich at illinois.edu Thu Jun 7 16:20:03 2012 From: drnevich at illinois.edu (Zadeh, Jenny Drnevich) Date: Thu, 7 Jun 2012 14:20:03 +0000 Subject: [BioC] Question about locked environments Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From mailinglist.honeypot at gmail.com Thu Jun 7 16:25:06 2012 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Thu, 7 Jun 2012 10:25:06 -0400 Subject: [BioC] makeTranscriptDbFromBiomart error In-Reply-To: References: Message-ID: Hi Stefanie, On Thu, Jun 7, 2012 at 5:16 AM, Stefanie Tauber wrote: > Hi > > I just tried it with R 2.15, I get the same error. > > If I follow your suggestion: > > txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="ensGene") > > > I get: > > Download the ensGene table ... OK > Extract the 'transcripts' data frame ... OK > Extract the 'splicings' data frame ... OK > Download and preprocess the 'chrominfo' data frame ... Error in > download.file(url, destfile, quiet = TRUE) : > ? cannot open URL > 'http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz' > In addition: There were 50 or more warnings (use warnings() to see the first > 50) [snip] Strange ... I also get the same warnings you get (the "cds cumulative length is not a multiple of 3") for some transcripts, but I think this is something beyond our control. I don't get any error(s) when downloading and building the TxDB, so it completes fine for me. I'm actually running the *-devel versions of the bioc packages w/ R-2.15.x so it's not very easy for me to check the current released GenomicFeatures package, but I'd be a bit surprised if the error is there. Could you paste the output of `sessionInfo()` after you call `library(GenomicFeatures)` when running your new R-2.15.x install? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From mailinglist.honeypot at gmail.com Thu Jun 7 16:36:15 2012 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Thu, 7 Jun 2012 10:36:15 -0400 Subject: [BioC] problem about set operation and computation after split In-Reply-To: <20120607083607.E6914138E14@mamba.fhcrc.org> References: <20120607083607.E6914138E14@mamba.fhcrc.org> Message-ID: Hi, Your questions aren't bioconductor/bioinformatics related, and should really go to R-help. Still, I'll offer one suggestion -- for more help, please repost the question to r-help: On Thu, Jun 7, 2012 at 4:36 AM, bestbird7788 [guest] wrote: > > hi, > ? ?I met some problems in R, please help me. > 1. How to do a intersect operation among several groups in one list, without a loop statement? (I think It may be a list) > ? create data: > ? myData <- data.frame(product = c(1,2,3,1,2,3,1,2,2), year=c(2009,2009,2009,2010,2010,2010,2011,2011,2011),value=c(1104,608,606,1504,508,1312,900,1100,800)) > ? mySplit<- split(myData,myData$year) > ? mySplit > $`2009` > ?product year value > 1 ? ? ? 1 2009 ?1104 > 2 ? ? ? 2 2009 ? 608 > 3 ? ? ? 3 2009 ? 606 > > $`2010` > ?product year value > 4 ? ? ? 1 2010 ?1504 > 5 ? ? ? 2 2010 ? 508 > 6 ? ? ? 3 2010 ?1312 > > $`2011` > ?product year value > 7 ? ? ? 1 2011 ? 900 > 8 ? ? ? 2 2011 ?1100 > 9 ? ? ? 2 2011 ? 800 > ? ?I want to get intersection of product between every year. I know the basic is: > ? ?intersect(intersect(mySplit[[1]]$product, mySplit[[2]]$product),mySplit[[3]]$product) > ? ?this will give the correct answer: > ? ?[1] 1 2 > ? ?above code lacks reusability, so It should use a for loop: > ? ?myIntersect<-mySplit[[1]]$product > ? ?for (i in 1:length(mySplit)-1){ > ? ? ? ?myIntersect<-intersect(myIntersect,mySplit[[i+1]]$product) > ? ?} > ? ?It's correct too, but stll too complex, so my question is: > ? ?Can I do the same thing just use another similar intersect function (without for/repeat/while). > ? ?What's this simple function's name ? I think your for loop is "fine", but if you want a more functional way of doing it, you could do: myi <- Reduce(intersect, lapply(mySplit, '[[', 'product')) -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From yuchuan at stat.berkeley.edu Thu Jun 7 16:54:48 2012 From: yuchuan at stat.berkeley.edu (Yu Chuan Tai) Date: Thu, 7 Jun 2012 07:54:48 -0700 (PDT) Subject: [BioC] Amplicon and exon level read counts and GC content In-Reply-To: <4FD0A7DA.1040009@fhcrc.org> References: <4FC5B3D3.4030007@fhcrc.org> <4FC5B75C.7060900@fhcrc.org> <4FD0A7DA.1040009@fhcrc.org> Message-ID: Hi Martin, Thanks! I will look into the links below. By 'better support for paired-end reads in the 'devel' version', which package are you referring to? Best, Yu Chuan On Thu, 7 Jun 2012, Martin Morgan wrote: > On 06/06/2012 09:53 PM, Yu Chuan Tai wrote: >> Hi Martin, >> >> More questions on your approaches below. If my BAM files are >> generated by Bowtie2 on pair-end fastq files, for scanBamFlag(), >> should I set isPaired=TRUE? Do I need to worry about other input >> arguments for scanBamFlag() or ScanBamParam(), if I want to >> calculate coverage properly? > > It really depends on what you're interested in doing; see for instance the > post by Herve the other day > > https://stat.ethz.ch/pipermail/bioconductor/2012-June/046052.html > >> >> Also, summarizeOverlaps() doesn't seem to handle paired-end reads. >> How to get around this, or it won't affect coverage calculation? > > There is better support for paired-end reads in the 'devel' version of > Biocondcutor; see > > http://bioconductor.org/developers/useDevel/ > > whether and what aspects of paired-endedness are important depends on how you > are using your coverage. > >> >> Finally, is there any way to calculate base-specific coverage at any >> genomic locus or interval in Rsamtools? Thanks! > > I tried to answer this in your other post. > > Martin > >> >> Best, Yu Chuan >> >>> More specifically, after >>> >>> library(Rsamtools) example(scanBam) # defines 'fl', a path to a >>> bam file >>> >>> for a _single_ genomic range >>> >>> param = ScanBamParam(what="seq", which=GRanges("seq1", >>> IRanges(100, 500))) dna = scanBam(fl, param=param)[[1]][["seq"]] >>> length(dna) # 365 reads overlap region alphabetFrequency(dna, >>> collapse=TRUE, baseOnly=TRUE) # 2838 + 3003 GC >>> >>> though you'd likely want to specify several regions (vector >>> arguments to GRanges) and think about flags (scanBamFlag() and the >>> flag argument to ScanBamParam), read mapping quality, reads >>> overlapping more than one region, etc. (summarizeOverlaps >>> implements several counting strategies, but it is 'easy' to >>> implement arbitrary approaches). >>> >>>> >>>> Martin >>>> >>>>> >>>>> Thanks for any input! >>>>> >>>>> Best, Yu Chuan >>>>> >>>>> _______________________________________________ Bioconductor >>>>> mailing list Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >>>>> archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >>> >>> >>>>> >>>>> > -- >>> Computational Biology Fred Hutchinson Cancer Research Center 1100 >>> Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >>> >>> Location: M1-B861 Telephone: 206 667-2793 >>> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > From yuchuan at stat.berkeley.edu Thu Jun 7 17:03:43 2012 From: yuchuan at stat.berkeley.edu (Yu Chuan Tai) Date: Thu, 7 Jun 2012 08:03:43 -0700 (PDT) Subject: [BioC] base-specific read counts In-Reply-To: <4FD0A5FA.6030301@fhcrc.org> References: <4FD0A5FA.6030301@fhcrc.org> Message-ID: Thanks! In your code below, to take care of the paired-end reads, is it correct that at least I need to set isPaired=TRUE in scanBamFlag()? Best, Yu Chuan On Thu, 7 Jun 2012, Martin Morgan wrote: > On 06/06/2012 10:43 PM, Yu Chuan Tai wrote: >> Hi, >> >> Is there any way to calculate base-specific read counts for a given >> genomic interval (including 1-base interval), for paired-end data >> aligned by Bowtie2 in BAM format? > > Thanks for posting to the Boic mailing list! Functions like > readGappedAlignments, scanBam, etc. take an argument ScanBamParam that in > turn has an argument 'which' to specify, using GRanges, the regions of a bam > file you want to query > > gwhich <- GRanges("chr1", IRanges(c(1000, 2000, 3000), width=100)), > c("+", "+", "-")) > param <- ScanBamParam(which=gwhich) > scanBam("my.bam", param=param) > > Base-level coverage is also available with ?applyPileups, see > example(applyPileups). > > Martin > >> Thanks! >> >> Best, >> Yu Chuan >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > From sdavis2 at mail.nih.gov Thu Jun 7 17:06:37 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 7 Jun 2012 11:06:37 -0400 Subject: [BioC] base-specific read counts In-Reply-To: References: <4FD0A5FA.6030301@fhcrc.org> Message-ID: On Thu, Jun 7, 2012 at 11:03 AM, Yu Chuan Tai wrote: > Thanks! In your code below, to take care of the paired-end reads, is it > correct that at least I need to set isPaired=TRUE in scanBamFlag()? No. The "isXXX" stuff is for filtering the data. Assuming that you want all your reads to be included (and not just paired reads), you do not need to set isPaired. Sean > Best, > Yu Chuan > > On Thu, 7 Jun 2012, Martin Morgan wrote: > >> On 06/06/2012 10:43 PM, Yu Chuan Tai wrote: >>> >>> Hi, >>> >>> Is there any way to calculate base-specific read counts for a given >>> genomic interval (including 1-base interval), for paired-end data >>> aligned by Bowtie2 in BAM format? >> >> >> Thanks for posting to the Boic mailing list! Functions like >> readGappedAlignments, scanBam, etc. take an argument ScanBamParam that in >> turn has an argument 'which' to specify, using GRanges, the regions of a bam >> file you want to query >> >> ?gwhich <- GRanges("chr1", IRanges(c(1000, 2000, 3000), width=100)), >> ? ? c("+", "+", "-")) >> ?param <- ScanBamParam(which=gwhich) >> ?scanBam("my.bam", param=param) >> >> Base-level coverage is also available with ?applyPileups, see >> example(applyPileups). >> >> Martin >> >>> Thanks! >>> >>> Best, >>> Yu Chuan >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From yuchuan at stat.berkeley.edu Thu Jun 7 17:24:36 2012 From: yuchuan at stat.berkeley.edu (Yu Chuan Tai) Date: Thu, 7 Jun 2012 08:24:36 -0700 (PDT) Subject: [BioC] base-specific read counts In-Reply-To: References: <4FD0A5FA.6030301@fhcrc.org> Message-ID: I see. So, which input arguments of scanBamFlag() or ScanBamParam() take care of paired-end reads? Or should I even worry about the paired-end natural when calculating coverage? Thanks! Yu Chuan On Thu, 7 Jun 2012, Sean Davis wrote: > On Thu, Jun 7, 2012 at 11:03 AM, Yu Chuan Tai wrote: >> Thanks! In your code below, to take care of the paired-end reads, is it >> correct that at least I need to set isPaired=TRUE in scanBamFlag()? > > No. The "isXXX" stuff is for filtering the data. Assuming that you > want all your reads to be included (and not just paired reads), you do > not need to set isPaired. > > Sean > >> Best, >> Yu Chuan >> >> On Thu, 7 Jun 2012, Martin Morgan wrote: >> >>> On 06/06/2012 10:43 PM, Yu Chuan Tai wrote: >>>> >>>> Hi, >>>> >>>> Is there any way to calculate base-specific read counts for a given >>>> genomic interval (including 1-base interval), for paired-end data >>>> aligned by Bowtie2 in BAM format? >>> >>> >>> Thanks for posting to the Boic mailing list! Functions like >>> readGappedAlignments, scanBam, etc. take an argument ScanBamParam that in >>> turn has an argument 'which' to specify, using GRanges, the regions of a bam >>> file you want to query >>> >>> ?gwhich <- GRanges("chr1", IRanges(c(1000, 2000, 3000), width=100)), >>> ? ? c("+", "+", "-")) >>> ?param <- ScanBamParam(which=gwhich) >>> ?scanBam("my.bam", param=param) >>> >>> Base-level coverage is also available with ?applyPileups, see >>> example(applyPileups). >>> >>> Martin >>> >>>> Thanks! >>>> >>>> Best, >>>> Yu Chuan >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> >>> -- >>> Computational Biology / Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. >>> PO Box 19024 Seattle, WA 98109 >>> >>> Location: Arnold Building M1 B861 >>> Phone: (206) 667-2793 >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > From sdavis2 at mail.nih.gov Thu Jun 7 17:35:14 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 7 Jun 2012 11:35:14 -0400 Subject: [BioC] base-specific read counts In-Reply-To: References: <4FD0A5FA.6030301@fhcrc.org> Message-ID: On Thu, Jun 7, 2012 at 11:24 AM, Yu Chuan Tai wrote: > I see. So, which input arguments of scanBamFlag() or ScanBamParam() take > care of paired-end reads? Or should I even worry about the paired-end > natural when calculating coverage? There are situations when you want to calculate coverage based on the extreme ends of pairs, but that would be for finding structural variants and the like and not for determining base-level coverage. You do not need to do anything special to read in paired-end data. As Martin mentioned, there is more complete handling of paired-end data in development, but that is really orthogonal to the scanBamParam() isPaired flag. Sean > > Thanks! > Yu Chuan > > On Thu, 7 Jun 2012, Sean Davis wrote: > >> On Thu, Jun 7, 2012 at 11:03 AM, Yu Chuan Tai wrote: >>> Thanks! In your code below, to take care of the paired-end reads, is it >>> correct that at least I need to set isPaired=TRUE in scanBamFlag()? >> >> No. ?The "isXXX" stuff is for filtering the data. ?Assuming that you >> want all your reads to be included (and not just paired reads), you do >> not need to set isPaired. >> >> Sean >> >>> Best, >>> Yu Chuan >>> >>> On Thu, 7 Jun 2012, Martin Morgan wrote: >>> >>>> On 06/06/2012 10:43 PM, Yu Chuan Tai wrote: >>>>> >>>>> Hi, >>>>> >>>>> Is there any way to calculate base-specific read counts for a given >>>>> genomic interval (including 1-base interval), for paired-end data >>>>> aligned by Bowtie2 in BAM format? >>>> >>>> >>>> Thanks for posting to the Boic mailing list! Functions like >>>> readGappedAlignments, scanBam, etc. take an argument ScanBamParam that in >>>> turn has an argument 'which' to specify, using GRanges, the regions of a bam >>>> file you want to query >>>> >>>> ?gwhich <- GRanges("chr1", IRanges(c(1000, 2000, 3000), width=100)), >>>> ? ? c("+", "+", "-")) >>>> ?param <- ScanBamParam(which=gwhich) >>>> ?scanBam("my.bam", param=param) >>>> >>>> Base-level coverage is also available with ?applyPileups, see >>>> example(applyPileups). >>>> >>>> Martin >>>> >>>>> Thanks! >>>>> >>>>> Best, >>>>> Yu Chuan >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> >>>> -- >>>> Computational Biology / Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N. >>>> PO Box 19024 Seattle, WA 98109 >>>> >>>> Location: Arnold Building M1 B861 >>>> Phone: (206) 667-2793 >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> From stefanie.tauber at univie.ac.at Thu Jun 7 17:50:31 2012 From: stefanie.tauber at univie.ac.at (Stefanie Tauber) Date: Thu, 7 Jun 2012 17:50:31 +0200 Subject: [BioC] makeTranscriptDbFromBiomart error In-Reply-To: References: Message-ID: <66F91D70-6173-4567-89DE-0DE60A7EFD0B@univie.ac.at> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From mtmorgan at fhcrc.org Thu Jun 7 17:54:00 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Thu, 07 Jun 2012 08:54:00 -0700 Subject: [BioC] Amplicon and exon level read counts and GC content In-Reply-To: References: <4FC5B3D3.4030007@fhcrc.org> <4FC5B75C.7060900@fhcrc.org> <4FD0A7DA.1040009@fhcrc.org> Message-ID: <4FD0CE98.20002@fhcrc.org> On 06/07/2012 07:54 AM, Yu Chuan Tai wrote: > Hi Martin, > > Thanks! I will look into the links below. By 'better support for > paired-end reads in the 'devel' version', which package are you > referring to? Mostly GenomicRanges, e.g., readGappedAlignmentPairs, building on additional facilities in Rsamtools. Herve is responsible for this. Martin > > Best, > Yu Chuan > > On Thu, 7 Jun 2012, Martin Morgan wrote: > >> On 06/06/2012 09:53 PM, Yu Chuan Tai wrote: >>> Hi Martin, >>> >>> More questions on your approaches below. If my BAM files are >>> generated by Bowtie2 on pair-end fastq files, for scanBamFlag(), >>> should I set isPaired=TRUE? Do I need to worry about other input >>> arguments for scanBamFlag() or ScanBamParam(), if I want to >>> calculate coverage properly? >> >> It really depends on what you're interested in doing; see for instance >> the post by Herve the other day >> >> https://stat.ethz.ch/pipermail/bioconductor/2012-June/046052.html >> >>> >>> Also, summarizeOverlaps() doesn't seem to handle paired-end reads. >>> How to get around this, or it won't affect coverage calculation? >> >> There is better support for paired-end reads in the 'devel' version of >> Biocondcutor; see >> >> http://bioconductor.org/developers/useDevel/ >> >> whether and what aspects of paired-endedness are important depends on >> how you are using your coverage. >> >>> >>> Finally, is there any way to calculate base-specific coverage at any >>> genomic locus or interval in Rsamtools? Thanks! >> >> I tried to answer this in your other post. >> >> Martin >> >>> >>> Best, Yu Chuan >>> >>>> More specifically, after >>>> >>>> library(Rsamtools) example(scanBam) # defines 'fl', a path to a >>>> bam file >>>> >>>> for a _single_ genomic range >>>> >>>> param = ScanBamParam(what="seq", which=GRanges("seq1", >>>> IRanges(100, 500))) dna = scanBam(fl, param=param)[[1]][["seq"]] >>>> length(dna) # 365 reads overlap region alphabetFrequency(dna, >>>> collapse=TRUE, baseOnly=TRUE) # 2838 + 3003 GC >>>> >>>> though you'd likely want to specify several regions (vector >>>> arguments to GRanges) and think about flags (scanBamFlag() and the >>>> flag argument to ScanBamParam), read mapping quality, reads >>>> overlapping more than one region, etc. (summarizeOverlaps >>>> implements several counting strategies, but it is 'easy' to >>>> implement arbitrary approaches). >>>> >>>>> >>>>> Martin >>>>> >>>>>> >>>>>> Thanks for any input! >>>>>> >>>>>> Best, Yu Chuan >>>>>> >>>>>> _______________________________________________ Bioconductor >>>>>> mailing list Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >>>>>> archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> >>>> >>>> >>>> >>>>>> >>>>>> >> -- >>>> Computational Biology Fred Hutchinson Cancer Research Center 1100 >>>> Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >>>> >>>> Location: M1-B861 Telephone: 206 667-2793 >>>> >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 From mcarlson at fhcrc.org Thu Jun 7 19:40:32 2012 From: mcarlson at fhcrc.org (Marc Carlson) Date: Thu, 07 Jun 2012 10:40:32 -0700 Subject: [BioC] makeTranscriptDbFromBiomart error In-Reply-To: <66F91D70-6173-4567-89DE-0DE60A7EFD0B@univie.ac.at> References: <66F91D70-6173-4567-89DE-0DE60A7EFD0B@univie.ac.at> Message-ID: <4FD0E790.1000101@fhcrc.org> Hi Stefanie, This is related to a bug with the 5' and 3' starts/ends that was in the latest version of biomaRt. We reported it to them a couple weeks ago because it immediately started to break some of our quality control tests for GenomicFeatures. At that time, they told us that it has been fixed, but it will still take a couple of weeks for their correction to propagate out. In the meantime, using either makeTranscriptDbFromUCSC() or the stock annotation packages for human, might be a good work-around for you. The warning that you saw for makeTranscriptDbFromUCSC() was another quality control check. We expect that when an annotation resource tells us the range for a CDS that this range should be divisible by three. When this doesn't happen, we issue the warning you were seeing for makeTranscriptDbFromUCSC(). Hope that this clarifies things, Marc On 06/07/2012 08:50 AM, Stefanie Tauber wrote: > Hi, > > here is my sessionInfo: > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.8.0 AnnotationDbi_1.18.0 Biobase_2.16.0 > [4] GenomicRanges_1.8.1 IRanges_1.14.2 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.12.0 Biostrings_2.24.0 bitops_1.0-4.1 BSgenome_1.24.0 > [5] DBI_0.2-5 RCurl_1.91-1 Rsamtools_1.8.0 RSQLite_0.11.1 > [9] rtracklayer_1.16.0 stats4_2.15.0 tools_2.15.0 XML_3.9-4 > [13] zlibbioc_1.2.0 > > I updated GenomicFeatures to 1.8.1, but unfortunately did not help. > > > BUT: makeTranscriptDbFromUCSC did work :) > >> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="ensGene") > Download the ensGene table ... OK > Extract the 'transcripts' data frame ... OK > Extract the 'splicings' data frame ... OK > Download and preprocess the 'chrominfo' data frame ... OK > Prepare the 'metadata' data frame ... metadata: OK > Make the TranscriptDb object ... OK > There were 50 or more warnings (use warnings() to see the first 50) > >> txdb > TranscriptDb object: > | Db type: TranscriptDb > | Supporting package: GenomicFeatures > | Data source: UCSC > | Genome: hg19 > | Genus and Species: Homo sapiens > | UCSC Table: ensGene > | Resource URL: http://genome.ucsc.edu/ > | Type of Gene ID: Ensembl gene ID > | Full dataset: yes > | miRBase build ID: NA > | transcript_nrow: 181648 > | exon_nrow: 541825 > | cds_nrow: 278798 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2012-06-07 17:48:45 +0200 (Thu, 07 Jun 2012) > | GenomicFeatures version at creation time: 1.8.1 > | RSQLite version at creation time: 0.11.1 > | DBSCHEMAVERSION: 1.0 > >> warnings() > Warning messages: > 1: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], exon_locs$start[[i]], ... : > UCSC data anomaly in transcript ENST00000513161: the cds cumulative length is not a multiple of 3 > 2: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], exon_locs$start[[i]], ... : > UCSC data anomaly in transcript ENST00000417833: the cds cumulative length is not a multiple of 3 > 3: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], exon_locs$start[[i]], ... : > UCSC data anomaly in transcript ENST00000450884: the cds cumulative length is not a multiple of 3 > > > Best, > Stefanie > > Am 07.06.2012 um 16:25 schrieb Steve Lianoglou: > >> Hi Stefanie, >> >> On Thu, Jun 7, 2012 at 5:16 AM, Stefanie Tauber >> wrote: >>> Hi >>> >>> I just tried it with R 2.15, I get the same error. >>> >>> If I follow your suggestion: >>> >>> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="ensGene") >>> >>> >>> I get: >>> >>> Download the ensGene table ... OK >>> Extract the 'transcripts' data frame ... OK >>> Extract the 'splicings' data frame ... OK >>> Download and preprocess the 'chrominfo' data frame ... Error in >>> download.file(url, destfile, quiet = TRUE) : >>> cannot open URL >>> 'http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz' >>> In addition: There were 50 or more warnings (use warnings() to see the first >>> 50) >> [snip] >> >> Strange ... I also get the same warnings you get (the "cds cumulative >> length is not a multiple of 3") for some transcripts, but I think this >> is something beyond our control. I don't get any error(s) when >> downloading and building the TxDB, so it completes fine for me. >> >> I'm actually running the *-devel versions of the bioc packages w/ >> R-2.15.x so it's not very easy for me to check the current released >> GenomicFeatures package, but I'd be a bit surprised if the error is >> there. >> >> Could you paste the output of `sessionInfo()` after you call >> `library(GenomicFeatures)` when running your new R-2.15.x install? >> >> -steve >> >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact > DI Stefanie Tauber > > Center for Integrative Bioinformatics Vienna (CIBIV) > (CIBIV is a joint institute of Vienna University, Medical University, and University of Veterinary Medicine, Vienna, Austria) > Max F. Perutz Laboratories (MFPL) > Campus Vienna Biocenter 5 (VBC5), Ebene 1, Room 1812.2 > Dr. Bohr Gasse 9 > A-1030 Wien, Austria > Phone: ++43 +1 / 42772-4030 > Fax: ++43 +1 / 42772-4098 > email: stefanie.tauber at univie.ac.at > www.cibiv.at > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From kurinji.pandiyan at gmail.com Thu Jun 7 20:06:21 2012 From: kurinji.pandiyan at gmail.com (Kurinji Pandiyan) Date: Thu, 7 Jun 2012 11:06:21 -0700 Subject: [BioC] Changing the Scale of GViz Data track Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From durinck.steffen at gene.com Thu Jun 7 20:09:35 2012 From: durinck.steffen at gene.com (Steffen Durinck) Date: Thu, 7 Jun 2012 11:09:35 -0700 Subject: [BioC] Changing the Scale of GViz Data track In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From dtenenba at fhcrc.org Thu Jun 7 20:49:30 2012 From: dtenenba at fhcrc.org (Dan Tenenbaum) Date: Thu, 7 Jun 2012 11:49:30 -0700 Subject: [BioC] gplots error In-Reply-To: <20120607125815.43FC2138E2B@mamba.fhcrc.org> References: <20120607125815.43FC2138E2B@mamba.fhcrc.org> Message-ID: On Thu, Jun 7, 2012 at 5:58 AM, German Gonzalez [guest] wrote: > > Yesterday a new version gdata package (2.8.2) was released. After upgrading the package, gplots stopped working. > > Error : object ???nobs??? is not exported by 'namespace:gdata' > ERROR: lazy loading failed for package ???gplots??? > > The only way I had to fix this was to uninstall both packages and install an older version manually. > Our build machines have gdata 2.10.0 installed. The "new" version of gdata is 2.8.2. According to R, 2.8.2 is not newer than 2.10.0: > package_version("2.8.2") > package_version("2.10.0") [1] FALSE Hence gdata was not updated on our build machines. I will update it manually. Many packages were affected, in devel and release; they ought to be corrected by mid-morning tomorrow Seattle time. Thanks, Dan > Regards > > > ?-- output of sessionInfo(): > > R version 2.14.1 (2011-12-22) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C > ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 > ?[5] LC_MONETARY=en_US.UTF-8 ? ?LC_MESSAGES=en_US.UTF-8 > ?[7] LC_PAPER=C ? ? ? ? ? ? ? ? LC_NAME=C > ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] grid ? ? ?stats ? ? graphics ?grDevices utils ? ? datasets ?methods > [8] base > > other attached packages: > [1] gdata_2.8.2 ? ? ? gtools_2.6.2 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From german at bioinformaticos.com.ar Thu Jun 7 20:57:00 2012 From: german at bioinformaticos.com.ar (=?ISO-8859-1?Q?Germ=E1n_Gonz=E1lez?=) Date: Thu, 7 Jun 2012 15:57:00 -0300 Subject: [BioC] gplots error In-Reply-To: References: <20120607125815.43FC2138E2B@mamba.fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From mcarlson at fhcrc.org Thu Jun 7 21:32:12 2012 From: mcarlson at fhcrc.org (Marc Carlson) Date: Thu, 07 Jun 2012 12:32:12 -0700 Subject: [BioC] makeTranscriptDbFromBiomart error In-Reply-To: <29165_1339090914_4FD0E7E1_29165_11025_1_4FD0E790.1000101@fhcrc.org> References: <66F91D70-6173-4567-89DE-0DE60A7EFD0B@univie.ac.at> <29165_1339090914_4FD0E7E1_29165_11025_1_4FD0E790.1000101@fhcrc.org> Message-ID: <4FD101BC.1030800@fhcrc.org> One more thing: The uswest ensmbl biomart mirror has apparently been updated with the fix (for reasons that are not known to me, the default has still not been updated). So if you look at the manual page for ?makeTranscriptDbFromBiomart You can see an example of how to use the uswest.ensembl.org host by specifying the bomart and host arguments. Marc On 06/07/2012 10:40 AM, Marc Carlson wrote: > Hi Stefanie, > > This is related to a bug with the 5' and 3' starts/ends that was in > the latest version of biomaRt. We reported it to them a couple weeks > ago because it immediately started to break some of our quality > control tests for GenomicFeatures. At that time, they told us that it > has been fixed, but it will still take a couple of weeks for their > correction to propagate out. In the meantime, using either > makeTranscriptDbFromUCSC() or the stock annotation packages for human, > might be a good work-around for you. > > The warning that you saw for makeTranscriptDbFromUCSC() was another > quality control check. We expect that when an annotation resource > tells us the range for a CDS that this range should be divisible by > three. When this doesn't happen, we issue the warning you were seeing > for makeTranscriptDbFromUCSC(). > > Hope that this clarifies things, > > > Marc > > > > On 06/07/2012 08:50 AM, Stefanie Tauber wrote: >> Hi, >> >> here is my sessionInfo: >> >>> sessionInfo() >> R version 2.15.0 (2012-03-30) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] GenomicFeatures_1.8.0 AnnotationDbi_1.18.0 Biobase_2.16.0 >> [4] GenomicRanges_1.8.1 IRanges_1.14.2 BiocGenerics_0.2.0 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.12.0 Biostrings_2.24.0 bitops_1.0-4.1 >> BSgenome_1.24.0 >> [5] DBI_0.2-5 RCurl_1.91-1 Rsamtools_1.8.0 >> RSQLite_0.11.1 >> [9] rtracklayer_1.16.0 stats4_2.15.0 tools_2.15.0 XML_3.9-4 >> [13] zlibbioc_1.2.0 >> >> I updated GenomicFeatures to 1.8.1, but unfortunately did not help. >> >> >> BUT: makeTranscriptDbFromUCSC did work :) >> >>> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="ensGene") >> Download the ensGene table ... OK >> Extract the 'transcripts' data frame ... OK >> Extract the 'splicings' data frame ... OK >> Download and preprocess the 'chrominfo' data frame ... OK >> Prepare the 'metadata' data frame ... metadata: OK >> Make the TranscriptDb object ... OK >> There were 50 or more warnings (use warnings() to see the first 50) >> >>> txdb >> TranscriptDb object: >> | Db type: TranscriptDb >> | Supporting package: GenomicFeatures >> | Data source: UCSC >> | Genome: hg19 >> | Genus and Species: Homo sapiens >> | UCSC Table: ensGene >> | Resource URL: http://genome.ucsc.edu/ >> | Type of Gene ID: Ensembl gene ID >> | Full dataset: yes >> | miRBase build ID: NA >> | transcript_nrow: 181648 >> | exon_nrow: 541825 >> | cds_nrow: 278798 >> | Db created by: GenomicFeatures package from Bioconductor >> | Creation time: 2012-06-07 17:48:45 +0200 (Thu, 07 Jun 2012) >> | GenomicFeatures version at creation time: 1.8.1 >> | RSQLite version at creation time: 0.11.1 >> | DBSCHEMAVERSION: 1.0 >> >>> warnings() >> Warning messages: >> 1: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], >> exon_locs$start[[i]], ... : >> UCSC data anomaly in transcript ENST00000513161: the cds >> cumulative length is not a multiple of 3 >> 2: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], >> exon_locs$start[[i]], ... : >> UCSC data anomaly in transcript ENST00000417833: the cds >> cumulative length is not a multiple of 3 >> 3: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], >> exon_locs$start[[i]], ... : >> UCSC data anomaly in transcript ENST00000450884: the cds >> cumulative length is not a multiple of 3 >> >> >> Best, >> Stefanie >> >> Am 07.06.2012 um 16:25 schrieb Steve Lianoglou: >> >>> Hi Stefanie, >>> >>> On Thu, Jun 7, 2012 at 5:16 AM, Stefanie Tauber >>> wrote: >>>> Hi >>>> >>>> I just tried it with R 2.15, I get the same error. >>>> >>>> If I follow your suggestion: >>>> >>>> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="ensGene") >>>> >>>> >>>> I get: >>>> >>>> Download the ensGene table ... OK >>>> Extract the 'transcripts' data frame ... OK >>>> Extract the 'splicings' data frame ... OK >>>> Download and preprocess the 'chrominfo' data frame ... Error in >>>> download.file(url, destfile, quiet = TRUE) : >>>> cannot open URL >>>> 'http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz' >>>> >>>> In addition: There were 50 or more warnings (use warnings() to see >>>> the first >>>> 50) >>> [snip] >>> >>> Strange ... I also get the same warnings you get (the "cds cumulative >>> length is not a multiple of 3") for some transcripts, but I think this >>> is something beyond our control. I don't get any error(s) when >>> downloading and building the TxDB, so it completes fine for me. >>> >>> I'm actually running the *-devel versions of the bioc packages w/ >>> R-2.15.x so it's not very easy for me to check the current released >>> GenomicFeatures package, but I'd be a bit surprised if the error is >>> there. >>> >>> Could you paste the output of `sessionInfo()` after you call >>> `library(GenomicFeatures)` when running your new R-2.15.x install? >>> >>> -steve >>> >>> >>> -- >>> Steve Lianoglou >>> Graduate Student: Computational Systems Biology >>> | Memorial Sloan-Kettering Cancer Center >>> | Weill Medical College of Cornell University >>> Contact Info: http://cbio.mskcc.org/~lianos/contact >> DI Stefanie Tauber >> >> Center for Integrative Bioinformatics Vienna (CIBIV) >> (CIBIV is a joint institute of Vienna University, Medical University, >> and University of Veterinary Medicine, Vienna, Austria) >> Max F. Perutz Laboratories (MFPL) >> Campus Vienna Biocenter 5 (VBC5), Ebene 1, Room 1812.2 >> Dr. Bohr Gasse 9 >> A-1030 Wien, Austria >> Phone: ++43 +1 / 42772-4030 >> Fax: ++43 +1 / 42772-4098 >> email: stefanie.tauber at univie.ac.at >> www.cibiv.at >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From mtmorgan at fhcrc.org Fri Jun 8 02:22:20 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Thu, 07 Jun 2012 17:22:20 -0700 Subject: [BioC] Question about locked environments In-Reply-To: References: Message-ID: <4FD145BC.9050905@fhcrc.org> On 06/07/2012 07:20 AM, Zadeh, Jenny Drnevich wrote: > Hi Michael, > > I'm sorry you are having trouble with the RemoveProbes() function I posted the BioC mailing list many years ago. I have not had to use that function myself in years, and did not know it wasn't working with newer versions of R. I didn't write the original code, Ariel Chernomoretz did. I only modified it, and I'm not sure I know enough to solve the problem. I'm posting this to the BioC mailing list to see if anyone can help. Below is my reproducible code (link to download the "RemoveProbes.RData" file is below ), showing where the problem occurs. It appears that the environments containing the Affymetrix probe and probe set information that the code is trying to change in now locked. I have no idea if there is a way to overcome this. > > Thanks in advance to anyone for any help, > Jenny > > https://netfiles.uiuc.edu/xythoswfs/webui/_xy-42144579_2-t_YuabdiYC (link expires 7/7/12) > >> library(affy) > Loading required package: BiocGenerics > > Attaching package: 'BiocGenerics' > > The following object(s) are masked from 'package:stats': > > xtabs > > The following object(s) are masked from 'package:base': > > anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get, intersect, lapply, Map, mapply, mget, order, paste, pmax, pmax.int, > pmin, pmin.int, Position, rbind, Reduce, rep.int, rownames, sapply, setdiff, table, tapply, union, unique > > Loading required package: Biobase > Welcome to Bioconductor > > Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages > 'citation("pkgname")'. > >> load("RemoveProbes.RData") >> ls() > [1] "nonsoygenes" "RemoveProbes" "ResetEnvir" "soygenes" >> >> cleancdf<- "soybean" >> >> ResetEnvir(cleancdf) > Loading required package: soybeancdf > Loading required package: AnnotationDbi > > Loading required package: soybeanprobe >> >> RemoveProbes(listOutProbeSets=nonsoygenes, cleancdf=cleancdf) > Error in assign(probepackagename, probe.env.orig[-iNA, ], pos = ipos) : > cannot change value of locked binding for 'soybeanprobe' >> >> debug(RemoveProbes) >> RemoveProbes(listOutProbeSets=nonsoygenes, cleancdf=cleancdf) > debugging in: RemoveProbes(listOutProbeSets = nonsoygenes, cleancdf = cleancdf) > debug: { > cdfpackagename<- paste(cleancdf, "cdf", sep = "") > probepackagename<- paste(cleancdf, "probe", sep = "") > require(cdfpackagename, character.only = TRUE) > require(probepackagename, character.only = TRUE) > probe.env.orig<- get(probepackagename) > if (!is.null(listOutProbes)) { > probes<- unlist(lapply(listOutProbes, function(x) { > a<- strsplit(x, "at") > aux1<- paste(a[[1]][1], "at", sep = "") > aux2<- as.integer(a[[1]][2]) > c(aux1, aux2) > })) > n1<- as.character(probes[seq(1, (length(probes)/2)) * > 2 - 1]) > n2<- as.integer(probes[seq(1, (length(probes)/2)) * > 2]) > probes<- data.frame(I(n1), n2) > probes[, 1]<- as.character(probes[, 1]) > probes[, 2]<- as.integer(probes[, 2]) > pset<- unique(probes[, 1]) > for (i in seq(along = pset)) { > ii<- grep(pset[i], probes[, 1]) > iout<- probes[ii, 2] > a<- get(pset[i], env = get(cdfpackagename)) > a<- a[-iout, ] > assign(pset[i], a, env = get(cdfpackagename)) > } > } > if (!is.null(listOutProbeSets)) { > rm(list = listOutProbeSets, envir = get(cdfpackagename)) > } > tmp<- get("xy2indices", paste("package:", cdfpackagename, > sep = "")) > newAB<- new("AffyBatch", cdfName = cleancdf) > pmIndex<- unlist(indexProbes(newAB, "pm")) > subIndex<- match(tmp(probe.env.orig$x, probe.env.orig$y, > cdf = cdfpackagename), pmIndex) > rm(newAB) > iNA<- which(is.na(subIndex)) > if (length(iNA)> 0) { > ipos<- grep(probepackagename, search()) > assign(probepackagename, probe.env.orig[-iNA, ], pos = ipos) I think you can replace assign() with assignInNamespace(). I don't know whether that is a good idea or not... From ?assignInNamespace They should not be used in production code. but I don't think this is any more dire than what the original code was doing, before the introduction of package name spaces. Martin > } > } > Browse[2]> > debug: cdfpackagename<- paste(cleancdf, "cdf", sep = "") > Browse[2]> > debug: probepackagename<- paste(cleancdf, "probe", sep = "") > Browse[2]> > debug: require(cdfpackagename, character.only = TRUE) > Browse[2]> > debug: require(probepackagename, character.only = TRUE) > Browse[2]> > debug: probe.env.orig<- get(probepackagename) > Browse[2]> > debug: if (!is.null(listOutProbes)) { > probes<- unlist(lapply(listOutProbes, function(x) { > a<- strsplit(x, "at") > aux1<- paste(a[[1]][1], "at", sep = "") > aux2<- as.integer(a[[1]][2]) > c(aux1, aux2) > })) > n1<- as.character(probes[seq(1, (length(probes)/2)) * 2 - > 1]) > n2<- as.integer(probes[seq(1, (length(probes)/2)) * 2]) > probes<- data.frame(I(n1), n2) > probes[, 1]<- as.character(probes[, 1]) > probes[, 2]<- as.integer(probes[, 2]) > pset<- unique(probes[, 1]) > for (i in seq(along = pset)) { > ii<- grep(pset[i], probes[, 1]) > iout<- probes[ii, 2] > a<- get(pset[i], env = get(cdfpackagename)) > a<- a[-iout, ] > assign(pset[i], a, env = get(cdfpackagename)) > } > } > Browse[2]> > debug: NULL > Browse[2]> > debug: if (!is.null(listOutProbeSets)) { > rm(list = listOutProbeSets, envir = get(cdfpackagename)) > } > Browse[2]> > debug: rm(list = listOutProbeSets, envir = get(cdfpackagename)) > Browse[2]> > debug: tmp<- get("xy2indices", paste("package:", cdfpackagename, sep = "")) > Browse[2]> > debug: newAB<- new("AffyBatch", cdfName = cleancdf) > Browse[2]> > debug: pmIndex<- unlist(indexProbes(newAB, "pm")) > Browse[2]> > debug: subIndex<- match(tmp(probe.env.orig$x, probe.env.orig$y, cdf = cdfpackagename), > pmIndex) > Browse[2]> > debug: rm(newAB) > Browse[2]> > debug: iNA<- which(is.na(subIndex)) > Browse[2]> > debug: if (length(iNA)> 0) { > ipos<- grep(probepackagename, search()) > assign(probepackagename, probe.env.orig[-iNA, ], pos = ipos) > } > Browse[2]> > debug: ipos<- grep(probepackagename, search()) > Browse[2]> > debug: assign(probepackagename, probe.env.orig[-iNA, ], pos = ipos) > > #The line above is what causes the error > > Browse[2]> probepackagename > [1] "soybeanprobe" > There were 50 or more warnings (use warnings() to see the first 50) > Browse[2]> warnings()[1:3] > $`object 'AFFX-BioB-3_at' not found` > rm(list = listOutProbeSets, envir = get(cdfpackagename)) > > $`object 'AFFX-BioB-5_at' not found` > rm(list = listOutProbeSets, envir = get(cdfpackagename)) > > $`object 'AFFX-BioB-M_at' not found` > rm(list = listOutProbeSets, envir = get(cdfpackagename)) > > #Not sure what the above warnings mean or if they are related > > Browse[2]> ?assign > starting httpd help server ... done > Browse[2]> ?lockBinding > Browse[2]> environmentIsLocked(as.environment(ipos)) > [1] TRUE > > Browse[2]> assign(probepackagename, probe.env.orig[-iNA, ], pos = ipos) > Error in assign(probepackagename, probe.env.orig[-iNA, ], pos = ipos) : > cannot change value of locked binding for 'soybeanprobe' > In addition: There were 50 or more warnings (use warnings() to see the first 50) >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-pc-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] soybeanprobe_2.10.0 soybeancdf_2.10.0 AnnotationDbi_1.18.1 affy_1.34.0 Biobase_2.16.0 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] affyio_1.24.0 BiocInstaller_1.4.6 DBI_0.2-5 IRanges_1.14.3 preprocessCore_1.18.0 RSQLite_0.11.1 stats4_2.15.0 > [8] tools_2.15.0 zlibbioc_1.2.0 > > > > > Jenny Drnevich, Ph.D. > > Functional Genomics Bioinformatics Specialist > W.M. Keck Center for Comparative and Functional Genomics > Roy J. Carver Biotechnology Center > High Performance Biological Computing Program > University of Illinois, Urbana-Champaign > > 330 ERML > 1201 W. Gregory Dr. > Urbana, IL 61801 > USA > > NOTE NEW PHONE NUMBER > ph: 217-300-6543 > fax: 217-265-5066 > e-mail: drnevich at illinois.edu > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 From hbolouri at fhcrc.org Fri Jun 8 02:43:12 2012 From: hbolouri at fhcrc.org (Hamid Bolouri) Date: Thu, 07 Jun 2012 17:43:12 -0700 (PDT) Subject: [BioC] DEGraph graph format? In-Reply-To: <68bd42f3-34e5-4c01-90ff-19b07d5220d1@zimbra2.fhcrc.org> Message-ID: <314fb72a-08e5-4eb2-884b-d0222c232127@zimbra2.fhcrc.org> hello; Can anyone tell me how to use DEGraph with the pathways in NCIGraphData? The DEGraph Demo: >data("Loi2008_DEGraphVignette", package="DEGraph") >classData <- classLoi2008 >exprData <- exprLoi2008 >annData <- annLoi2008 >grList <- grListKEGG >res <- testOneGraph(grList[[1]],exprData,classData,verbose=T,prop=0.2) works fine for me. But replacing grList with NCI.cyList from NCIGraph: >library(NCIgraphData) >data("NCI-cyList") > NCI.cyList[[1]] A graphNEL graph with directed edges Number of Nodes = 35 Number of Edges = 40 I get this error: >res <- testOneGraph(NCI.cyList[[1]],exprData,classData,verbose=T,prop=0.2) Keeping genes in the graph *and* the expression data set... 35 genes of the graph were not found in the expression data set: chr [1:35] "6749854621221256793-pid_m_25632-674985462-829166685-pid_m_100726" ... 227 genes of the expression data set are absent from the graph: chr [1:227] "31" "32" "207" "208" "355" "356" "369" "572" ... Error: all.equal(dataGN, graphGN) is not TRUE Keeping genes in the graph *and* the expression data set...done I get the same error with 'reactome.cyList' graphs and with graphs generated by 'parseNCInetwork'. Thanks Hamid Bolouri > sessionInfo() R version 2.15.0 (2012-03-30) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] NCIgraphData_0.99.4 DEGraph_1.8.0 R.utils_1.12.1 [4] R.oo_1.9.3 R.methodsS3_1.2.2 loaded via a namespace (and not attached): [1] BiocGenerics_0.2.0 graph_1.34.0 grid_2.15.0 KEGGgraph_1.12.0 [5] lattice_0.20-6 mvtnorm_0.9-9992 NCIgraph_1.4.0 RBGL_1.32.0 [9] RCurl_1.91-1.1 RCytoscape_1.6.3 Rgraphviz_1.34.1 rrcov_1.3-01 [13] stats4_2.15.0 tools_2.15.0 XML_3.9-4.1 XMLRPC_0.2-4 From shi at wehi.EDU.AU Fri Jun 8 04:10:25 2012 From: shi at wehi.EDU.AU (Wei Shi) Date: Fri, 8 Jun 2012 12:10:25 +1000 Subject: [BioC] Rsubread crashes in 32bit linux In-Reply-To: <1338980934.5663.85.camel@yangdu-desktop> References: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> <1338903817.2062.284.camel@yangdu-desktop> <25927CA4-11CE-46F1-8492-739AE8B10C05@wehi.edu.au> <1338965421.5663.36.camel@yangdu-desktop> <2333c7eb3c4b13297612a68e8334d48d.squirrel@homebase.wehi.edu.au> <1338980934.5663.85.camel@yangdu-desktop> Message-ID: <33FCD1BB-DCA5-402E-A973-7FB6F73F208F@wehi.edu.au> Dear Robert, Dan and Peter, We have made changes to a number of functions in the package to reduce the memory allocated to Rsubread by the operating system when it was loaded. The new version has been committed to both bioc release (Rsubread 1.6.4) and bioc devel (Rsubread 1.7.4). They should be available to you in a day or two. Also, the buildindex() function no longer needs the allocation of 1GB continuous memory region. But it will still consume at least 1GB of memory when it is running, no matter what the given value of the 'memory' parameter is. We have tested the new version on our 32-bit VM machine (it has 3GB of memory and the value of 'memory' parameter used by buildindex was 2500) and it solves all the reported problems, so we are pretty happy with it. I hope the new version works in your computers/laptops, but please do let us know if it doesn't. Sorry about the problems you have encountered. It's always a challenge to develop a R package with so much C code in it! Cheers, Wei On Jun 6, 2012, at 9:08 PM, Dan Du wrote: > Dear Wei, > > Here is a standard bioclite update, I think it is at the last step when > compiling Rsubread.so, the memory usage exceeds 5.5g, then system freeze > and I have to call it off. Same result when runing 'R CMD INSTALL > Rsubread_1.6.3.tar.gz' from shell, or manually compile all .c file and > run the last gcc statement. I guess there might just be a minimum ram > requirement somewhere higher than 6g... I will do some more poking when > I have time. > > 'gcc -std=gnu99 -shared -o Rsubread.so R_wrapper.o SNP_calling.o > aligner.o atgcContent.o detectionCall.o detectionCallAnnotation.o > exon-algorithms.o exon-align.o fullscan.o gene-algorithms.o > gene-value-index.o hashtable.o index-builder.o input-files.o > processExons.o propmapped.o qualityScores.o readSummary.o > removeDuplicatedReads.o sam2bed.o sorted-hashtable.o -lpthread > -DMAKE_FOR_EXON -L/usr/lib64/R/lib -lR' > > Also down there are the sessionInfo and full gcc version, please let me > know if you need more information. > > Regards, > Dan > -------------------------------------------------------------------- >> source('http://www.bioconductor.org/biocLite.R') >> biocLite('') > BioC_mirror: http://bioconductor.org > Using R version 2.15, BiocInstaller version 1.4.6. > Installing package(s) '' > Old packages: 'Rsubread' > Update all/some/none? [a/s/n]: a > trying URL > 'http://www.bioconductor.org/packages/2.10/bioc/src/contrib/Rsubread_1.6.3.tar.gz' > Content type 'application/x-gzip' length 21891723 bytes (20.9 Mb) > opened URL > ================================================== > downloaded 20.9 Mb > > WARNING: ignoring environment value of R_HOME > * installing *source* package ?Rsubread? ... > ** libs > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c R_wrapper.c -o R_wrapper.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c SNP_calling.c -o SNP_calling.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c aligner.c -o aligner.o > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c atgcContent.c -o atgcContent.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c detectionCall.c -o detectionCall.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c detectionCallAnnotation.c -o detectionCallAnnotation.o > detectionCallAnnotation.c: In function ?calculateExonGCContent?: > detectionCallAnnotation.c:175: warning: ignoring return value of > ?fgets?, declared with attribute warn_unused_result > detectionCallAnnotation.c: In function ?calculateIRGCContent?: > detectionCallAnnotation.c:262: warning: ignoring return value of > ?fgets?, declared with attribute warn_unused_result > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c exon-algorithms.c -o exon-algorithms.o > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c exon-align.c -o exon-align.o > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c fullscan.c -o fullscan.o > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c gene-algorithms.c -o gene-algorithms.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c gene-value-index.c -o gene-value-index.o > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c hashtable.c -o hashtable.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c index-builder.c -o index-builder.o > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c input-files.c -o input-files.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c processExons.c -o processExons.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c propmapped.c -o propmapped.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c qualityScores.c -o qualityScores.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c readSummary.c -o readSummary.o > readSummary.c: In function ?readSummary?: > readSummary.c:122: warning: format ?%d? expects type ?int?, but argument > 5 has type ?long int? > readSummary.c:122: warning: format ?%d? expects type ?int?, but argument > 6 has type ?long int? > readSummary.c:39: warning: ignoring return value of ?getline?, declared > with attribute warn_unused_result > readSummary.c:52: warning: ignoring return value of ?getline?, declared > with attribute warn_unused_result > readSummary.c:55: warning: ignoring return value of ?getline?, declared > with attribute warn_unused_result > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c removeDuplicatedReads.c -o removeDuplicatedReads.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c sam2bed.c -o sam2bed.o > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > -O3 -pipe -g -c sorted-hashtable.c -o sorted-hashtable.o > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > declared but never defined > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > but never defined > gcc -std=gnu99 -shared -o Rsubread.so R_wrapper.o SNP_calling.o > aligner.o atgcContent.o detectionCall.o detectionCallAnnotation.o > exon-algorithms.o exon-align.o fullscan.o gene-algorithms.o > gene-value-index.o hashtable.o index-builder.o input-files.o > processExons.o propmapped.o qualityScores.o readSummary.o > removeDuplicatedReads.o sam2bed.o sorted-hashtable.o -lpthread > -DMAKE_FOR_EXON -L/usr/lib64/R/lib -lR > ^Cmake: *** Deleting file `Rsubread.so' > make: *** [Rsubread.so] Interrupt > ** R > ** inst > ** preparing package for lazy loading > ** help > *** installing help indices > ** building package indices > ** installing vignettes > ?Rsubread.Rnw? > ** testing if installed package can be loaded > Error in library.dynam(lib, package, package.lib) : > shared object ?Rsubread.so? not found > Error: loading failed > Execution halted > -------------------------------------------------------------------- >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C > [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 > [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > -------------------------------------------------------------------- > $ gcc -v > Using built-in specs. > Target: x86_64-linux-gnu > Configured with: ../src/configure -v --with-pkgversion='Ubuntu > 4.4.3-4ubuntu5.1' > --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs > --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr > --enable-shared --enable-multiarch --enable-linker-build-id > --with-system-zlib --libexecdir=/usr/lib --without-included-gettext > --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 > --program-suffix=-4.4 --enable-nls --enable-clocale=gnu > --enable-libstdcxx-debug --enable-plugin --enable-objc-gc > --disable-werror --with-arch-32=i486 --with-tune=generic > --enable-checking=release --build=x86_64-linux-gnu > --host=x86_64-linux-gnu --target=x86_64-linux-gnu > Thread model: posix > gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) > -------------------------------------------------------------------- > > On Wed, 2012-06-06 at 20:10 +1000, Wei Shi wrote: >> Dear Dan, >> >> It didn't seem to be problem of requesting a continuous 1GB block in our >> investigation. We tracked the memory usage of buildindex() function when >> running it on yeast genome using a 32-bit VM, and found that the segfault >> happened right after a request of a few KB of memory was sent to the >> system when the memory parameter was set to 2500. However, the problem was >> gone when the memory parameter was changed to 1000. >> >> Removing highly repetitive 16 mers required a continuous 1GB block of >> memory, but this step was always executed successfully. This step also >> included in the old version of Rsubread (1.1.1), and it did not have >> problem there either. >> >> Could you please provide us your complete code for running your test and >> also session info? This will help us to diagnose what the problem could be >> because we couldn't reproduce what you saw from our end. >> >> For the compilation issue on your 64bit laptop, could you provide us more >> details as well, including the message output from gcc? >> >> Thanks, >> Wei >> >>> Dear Wei, >>> >>> Unfortunately reducing the memory parameter to 1000, still causes the >>> segfault. I guess with 3g ram limit on a 32bit system, there is still a >>> fat chance that you can not request a continuous 1g block. >>> >>> For that 64bit laptop, it is still strange about the 6g memory draining. >>> It is happing during the installation when compiling the shared library >>> Rsubread.so, not running the buildindex function. Btw, the gcc version >>> is 4.4.3. >>> >>> Server and opensuse box runs gcc version 4.3.1 and 4.5.0 respectively. >>> >>> Regards, >>> Dan >>> >>> On Wed, 2012-06-06 at 14:56 +1000, Wei Shi wrote: >>>> Dear Dan, >>>> >>>> It is probably because including genome sequences into the index slowed >>>> down your laptop. But I believe it should be alleviated if you give >>>> smaller values to the 'memory ' parameter of the buildindex() function. >>>> Also, the index building is an one-off operation, you do not need to >>>> redo it even when new releases come. >>>> >>>> For your 32-bit opensuse box, I guess the problem will be solved if you >>>> change the amount of memory requested to be 1000MB. >>>> >>>> Cheers, >>>> Wei >>>> >>>> On Jun 5, 2012, at 11:43 PM, Dan Du wrote: >>>> >>>>> Hi Robert, >>>>> >>>>> I have been experiencing something else, possibly related to yours, >>>>> on a 64bit ubuntu laptop with 6g of ram. >>>>> >>>>> As I recall, when bumping to Bioc 2.10, the Rsubread installation kind >>>>> of ate all the memory, basically froze the system so I had to call it >>>>> off, yet building it on the server side turned out fine. So I think I >>>>> just accepted that the new version may be 'computationally heavy' thus >>>>> not suitable for a normal pc, though I did not find any mentioning of >>>>> this increased memory requirement in the NEWS file. >>>>> >>>>> So currently Rsubread stays at 1.4.4 on that pc, all subsequent >>>> versions >>>>> of Rsubread drain the memory in the same way when compiling >>>> Rsubread.so. >>>>> >>>>> Now I think I can confirm this on a 32-bit opensuse box, it did >>>>> successfully built, but when running the example code in the manual, >>>>> same segfault happens. >>>>> >>>>> >>>>>> library(Rsubread) >>>>>> ref <- system.file("extdata","reference.fa",package="Rsubread") >>>>>> path <- system.file("extdata",package="Rsubread") >>>>>> buildindex(basename=file.path(path,"reference_index"),reference=ref) >>>>> >>>>> Building a base-space index. >>>>> Size of memory used=3700 MB >>>>> Base name of the built index >>>>> = /home/opensuse/R-patched/library/Rsubread/extdata/reference_index >>>>> >>>>> *** caught segfault *** >>>>> address 0xdf03ee80, cause 'memory not mapped' >>>>> >>>>> Traceback: >>>>> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = >>>>> as.character(cmd), PACKAGE = "Rsubread") >>>>> 2: buildindex(basename = file.path(path, "reference_index"), reference >>>>> = ref) >>>>> >>>>>> sessionInfo() >>>>> R version 2.15.0 Patched (2012-06-04 r59517) >>>>> Platform: i686-pc-linux-gnu (32-bit) >>>>> >>>>> locale: >>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>> [7] LC_PAPER=C LC_NAME=C >>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>> >>>>> attached base packages: >>>>> [1] stats graphics grDevices utils datasets methods >>>>> base >>>>> >>>>> other attached packages: >>>>> [1] Rsubread_1.6.3 >>>>> >>>>> >>>>> Regards, >>>>> Dan >>>>> >>>>> On Tue, 2012-06-05 at 09:45 +0200, Robert Castelo wrote: >>>>>> hi, >>>>>> >>>>>> the computer room at my university where we do practicals on R & >>>> Bioconductor runs a 32bit linux distribution and when i tried to run >>>> the latest version of the Rsubread package (1.6.3) it crashes when >>>> calling the buildindex() function on a multifasta file with the yeast >>>> genome. this does *not* happen under a 64bit linux distribution. >>>>>> >>>>>> i have verified that installing the version before (1.4.4) on the >>>> current R 2.15 it also crashes (on the 32bit), but two versions >>>> before, the 1.1.1, it does *not* and it works smoothly on this 32bit >>>> linux distribution. >>>>>> >>>>>> i'm pasting below the output of using the 1.6.3 and 1.1.1 on R 2.15 >>>> where allChr.fa is the multifasta file with the yeast genome. >>>>>> >>>>>> so i can manage by now the problem by using the 1.1.1 version on R >>>> 2.15 for my teaching but i wonder whether there would be some easy >>>> solution for this, or even if it could be a symptom of something else >>>> that the Rsubread developers should worry about. i know that using a >>>> 32bit system nowadays is quite obsolete but this is what i got for >>>> teaching :( and i would be happy to let my students play with the >>>> latest version of Rsubread in the future. >>>>>> >>>>>> >>>>>> thanks!!! >>>>>> robert. >>>>>> >>>>>> ======================Rsubread 1.6.3 on R 2.15======================= >>>>>> >>>>>>> library(Rsubread) >>>>>>> sessionInfo() >>>>>> R version 2.15.0 (2012-03-30) >>>>>> Platform: i686-pc-linux-gnu (32-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C >>>>>> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 >>>>>> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 >>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>> >>>>>> other attached packages: >>>>>> [1] Rsubread_1.6.3 >>>>>> >>>>>>> buildindex(basename="subreadindex", reference="allChr.fa", >>>> memory=2500) >>>>>> >>>>>> Building a base-space index. >>>>>> Size of memory used=2500 MB >>>>>> Base name of the built index = subreadindex >>>>>> >>>>>> *** caught segfault *** >>>>>> address 0xdf670cc0, cause 'memory not mapped' >>>>>> >>>>>> Traceback: >>>>>> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = >>>> as.character(cmd), PACKAGE = "Rsubread") >>>>>> 2: buildindex(basename = "subreadindex", reference = "allChr.fa", >>>> memory = 2500) >>>>>> >>>>>> Possible actions: >>>>>> 1: abort (with core dump, if enabled) >>>>>> 2: normal R exit >>>>>> 3: exit R without saving workspace >>>>>> 4: exit R saving workspace >>>>>> Selection: >>>>>> >>>>>> >>>>>> ======================Rsubread 1.1.1 on R 2.15======================= >>>>>> >>>>>>> library(Rsubread) >>>>>>> buildindex(basename="subreadindex", reference="allChr.fa", >>>> memory=2500) >>>>>> >>>>>> Building the index in the base space. >>>>>> Size of memory requested=2500 MB >>>>>> Index base name = subreadindex >>>>>> INDEX ITEMS PER PARTITION = 275940352 >>>>>> >>>>>> completed=40.88%; time used=1.7s; rate=2955.1k bps/s; total=12m bps >>>> completed=81.76%; time used=2.4s; rate=4111.8k >>>> bps/s; total=12m bps >>>>>> All the chromosome files are processed. >>>>>> | Dumping index >>>> [===========================================================>] >>>>>> Index subreadindex is successfully built. >>>>>>> sessionInfo() >>>>>> R version 2.15.0 (2012-03-30) >>>>>> Platform: i686-pc-linux-gnu (32-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C >>>>>> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 >>>>>> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 >>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>> >>>>>> other attached packages: >>>>>> [1] Rsubread_1.1.1 >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> ______________________________________________________________________ >>>> The information in this email is confidential and intended solely for >>>> the addressee. >>>> You must not disclose, forward, print or use it without the permission >>>> of the sender. >>>> ______________________________________________________________________ >>> >>> >>> >>> >> >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ > > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}} From yuchuan at stat.berkeley.edu Fri Jun 8 08:06:33 2012 From: yuchuan at stat.berkeley.edu (Yu Chuan Tai) Date: Thu, 7 Jun 2012 23:06:33 -0700 (PDT) Subject: [BioC] base-specific read counts In-Reply-To: <4FD0A5FA.6030301@fhcrc.org> References: <4FD0A5FA.6030301@fhcrc.org> Message-ID: Hi Martin, One more question. Is there any way in Rsamtools to calculate SNVs/INDELS frequency directly using the output file from samtools? Thanks! Best, Yu Chuan On Thu, 7 Jun 2012, Martin Morgan wrote: > On 06/06/2012 10:43 PM, Yu Chuan Tai wrote: >> Hi, >> >> Is there any way to calculate base-specific read counts for a given >> genomic interval (including 1-base interval), for paired-end data >> aligned by Bowtie2 in BAM format? > > Thanks for posting to the Boic mailing list! Functions like > readGappedAlignments, scanBam, etc. take an argument ScanBamParam that in > turn has an argument 'which' to specify, using GRanges, the regions of a bam > file you want to query > > gwhich <- GRanges("chr1", IRanges(c(1000, 2000, 3000), width=100)), > c("+", "+", "-")) > param <- ScanBamParam(which=gwhich) > scanBam("my.bam", param=param) > > Base-level coverage is also available with ?applyPileups, see > example(applyPileups). > > Martin > >> Thanks! >> >> Best, >> Yu Chuan >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > From tooyoung at gmail.com Fri Jun 8 08:09:07 2012 From: tooyoung at gmail.com (Dan Du) Date: Fri, 08 Jun 2012 08:09:07 +0200 Subject: [BioC] Rsubread crashes in 32bit linux In-Reply-To: <33FCD1BB-DCA5-402E-A973-7FB6F73F208F@wehi.edu.au> References: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> <1338903817.2062.284.camel@yangdu-desktop> <25927CA4-11CE-46F1-8492-739AE8B10C05@wehi.edu.au> <1338965421.5663.36.camel@yangdu-desktop> <2333c7eb3c4b13297612a68e8334d48d.squirrel@homebase.wehi.edu.au> <1338980934.5663.85.camel@yangdu-desktop> <33FCD1BB-DCA5-402E-A973-7FB6F73F208F@wehi.edu.au> Message-ID: <1339135747.2008.12.camel@yangdu-desktop> Dear Wei, Good work, memory freed. Just checkout the devel version, package built and installed successfully with no hiccup, test codes runs fine. Regards, Dan On Fri, 2012-06-08 at 12:10 +1000, Wei Shi wrote: > Dear Robert, Dan and Peter, > > We have made changes to a number of functions in the package to reduce the memory allocated to Rsubread by the operating system when it was loaded. The new version has been committed to both bioc release (Rsubread 1.6.4) and bioc devel (Rsubread 1.7.4). They should be available to you in a day or two. > > Also, the buildindex() function no longer needs the allocation of 1GB continuous memory region. But it will still consume at least 1GB of memory when it is running, no matter what the given value of the 'memory' parameter is. > > We have tested the new version on our 32-bit VM machine (it has 3GB of memory and the value of 'memory' parameter used by buildindex was 2500) and it solves all the reported problems, so we are pretty happy with it. I hope the new version works in your computers/laptops, but please do let us know if it doesn't. > > Sorry about the problems you have encountered. It's always a challenge to develop a R package with so much C code in it! > > Cheers, > Wei > > > > On Jun 6, 2012, at 9:08 PM, Dan Du wrote: > > > Dear Wei, > > > > Here is a standard bioclite update, I think it is at the last step when > > compiling Rsubread.so, the memory usage exceeds 5.5g, then system freeze > > and I have to call it off. Same result when runing 'R CMD INSTALL > > Rsubread_1.6.3.tar.gz' from shell, or manually compile all .c file and > > run the last gcc statement. I guess there might just be a minimum ram > > requirement somewhere higher than 6g... I will do some more poking when > > I have time. > > > > 'gcc -std=gnu99 -shared -o Rsubread.so R_wrapper.o SNP_calling.o > > aligner.o atgcContent.o detectionCall.o detectionCallAnnotation.o > > exon-algorithms.o exon-align.o fullscan.o gene-algorithms.o > > gene-value-index.o hashtable.o index-builder.o input-files.o > > processExons.o propmapped.o qualityScores.o readSummary.o > > removeDuplicatedReads.o sam2bed.o sorted-hashtable.o -lpthread > > -DMAKE_FOR_EXON -L/usr/lib64/R/lib -lR' > > > > Also down there are the sessionInfo and full gcc version, please let me > > know if you need more information. > > > > Regards, > > Dan > > -------------------------------------------------------------------- > >> source('http://www.bioconductor.org/biocLite.R') > >> biocLite('') > > BioC_mirror: http://bioconductor.org > > Using R version 2.15, BiocInstaller version 1.4.6. > > Installing package(s) '' > > Old packages: 'Rsubread' > > Update all/some/none? [a/s/n]: a > > trying URL > > 'http://www.bioconductor.org/packages/2.10/bioc/src/contrib/Rsubread_1.6.3.tar.gz' > > Content type 'application/x-gzip' length 21891723 bytes (20.9 Mb) > > opened URL > > ================================================== > > downloaded 20.9 Mb > > > > WARNING: ignoring environment value of R_HOME > > * installing *source* package ?Rsubread? ... > > ** libs > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c R_wrapper.c -o R_wrapper.o > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c SNP_calling.c -o SNP_calling.o > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c aligner.c -o aligner.o > > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > > declared but never defined > > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > > but never defined > > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > > declared but never defined > > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > > but never defined > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c atgcContent.c -o atgcContent.o > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c detectionCall.c -o detectionCall.o > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c detectionCallAnnotation.c -o detectionCallAnnotation.o > > detectionCallAnnotation.c: In function ?calculateExonGCContent?: > > detectionCallAnnotation.c:175: warning: ignoring return value of > > ?fgets?, declared with attribute warn_unused_result > > detectionCallAnnotation.c: In function ?calculateIRGCContent?: > > detectionCallAnnotation.c:262: warning: ignoring return value of > > ?fgets?, declared with attribute warn_unused_result > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c exon-algorithms.c -o exon-algorithms.o > > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > > declared but never defined > > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > > but never defined > > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > > declared but never defined > > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > > but never defined > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c exon-align.c -o exon-align.o > > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > > declared but never defined > > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > > but never defined > > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > > declared but never defined > > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > > but never defined > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c fullscan.c -o fullscan.o > > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > > declared but never defined > > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > > but never defined > > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > > declared but never defined > > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > > but never defined > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c gene-algorithms.c -o gene-algorithms.o > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c gene-value-index.c -o gene-value-index.o > > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > > declared but never defined > > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > > but never defined > > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > > declared but never defined > > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > > but never defined > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c hashtable.c -o hashtable.o > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c index-builder.c -o index-builder.o > > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > > declared but never defined > > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > > but never defined > > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > > declared but never defined > > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > > but never defined > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c input-files.c -o input-files.o > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c processExons.c -o processExons.o > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c propmapped.c -o propmapped.o > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c qualityScores.c -o qualityScores.o > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c readSummary.c -o readSummary.o > > readSummary.c: In function ?readSummary?: > > readSummary.c:122: warning: format ?%d? expects type ?int?, but argument > > 5 has type ?long int? > > readSummary.c:122: warning: format ?%d? expects type ?int?, but argument > > 6 has type ?long int? > > readSummary.c:39: warning: ignoring return value of ?getline?, declared > > with attribute warn_unused_result > > readSummary.c:52: warning: ignoring return value of ?getline?, declared > > with attribute warn_unused_result > > readSummary.c:55: warning: ignoring return value of ?getline?, declared > > with attribute warn_unused_result > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c removeDuplicatedReads.c -o removeDuplicatedReads.o > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c sam2bed.c -o sam2bed.o > > gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic > > -O3 -pipe -g -c sorted-hashtable.c -o sorted-hashtable.o > > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > > declared but never defined > > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > > but never defined > > gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? > > declared but never defined > > gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared > > but never defined > > gcc -std=gnu99 -shared -o Rsubread.so R_wrapper.o SNP_calling.o > > aligner.o atgcContent.o detectionCall.o detectionCallAnnotation.o > > exon-algorithms.o exon-align.o fullscan.o gene-algorithms.o > > gene-value-index.o hashtable.o index-builder.o input-files.o > > processExons.o propmapped.o qualityScores.o readSummary.o > > removeDuplicatedReads.o sam2bed.o sorted-hashtable.o -lpthread > > -DMAKE_FOR_EXON -L/usr/lib64/R/lib -lR > > ^Cmake: *** Deleting file `Rsubread.so' > > make: *** [Rsubread.so] Interrupt > > ** R > > ** inst > > ** preparing package for lazy loading > > ** help > > *** installing help indices > > ** building package indices > > ** installing vignettes > > ?Rsubread.Rnw? > > ** testing if installed package can be loaded > > Error in library.dynam(lib, package, package.lib) : > > shared object ?Rsubread.so? not found > > Error: loading failed > > Execution halted > > -------------------------------------------------------------------- > >> sessionInfo() > > R version 2.15.0 (2012-03-30) > > Platform: x86_64-pc-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C > > [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 > > [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 > > [7] LC_PAPER=C LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > -------------------------------------------------------------------- > > $ gcc -v > > Using built-in specs. > > Target: x86_64-linux-gnu > > Configured with: ../src/configure -v --with-pkgversion='Ubuntu > > 4.4.3-4ubuntu5.1' > > --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs > > --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr > > --enable-shared --enable-multiarch --enable-linker-build-id > > --with-system-zlib --libexecdir=/usr/lib --without-included-gettext > > --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 > > --program-suffix=-4.4 --enable-nls --enable-clocale=gnu > > --enable-libstdcxx-debug --enable-plugin --enable-objc-gc > > --disable-werror --with-arch-32=i486 --with-tune=generic > > --enable-checking=release --build=x86_64-linux-gnu > > --host=x86_64-linux-gnu --target=x86_64-linux-gnu > > Thread model: posix > > gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) > > -------------------------------------------------------------------- > > > > On Wed, 2012-06-06 at 20:10 +1000, Wei Shi wrote: > >> Dear Dan, > >> > >> It didn't seem to be problem of requesting a continuous 1GB block in our > >> investigation. We tracked the memory usage of buildindex() function when > >> running it on yeast genome using a 32-bit VM, and found that the segfault > >> happened right after a request of a few KB of memory was sent to the > >> system when the memory parameter was set to 2500. However, the problem was > >> gone when the memory parameter was changed to 1000. > >> > >> Removing highly repetitive 16 mers required a continuous 1GB block of > >> memory, but this step was always executed successfully. This step also > >> included in the old version of Rsubread (1.1.1), and it did not have > >> problem there either. > >> > >> Could you please provide us your complete code for running your test and > >> also session info? This will help us to diagnose what the problem could be > >> because we couldn't reproduce what you saw from our end. > >> > >> For the compilation issue on your 64bit laptop, could you provide us more > >> details as well, including the message output from gcc? > >> > >> Thanks, > >> Wei > >> > >>> Dear Wei, > >>> > >>> Unfortunately reducing the memory parameter to 1000, still causes the > >>> segfault. I guess with 3g ram limit on a 32bit system, there is still a > >>> fat chance that you can not request a continuous 1g block. > >>> > >>> For that 64bit laptop, it is still strange about the 6g memory draining. > >>> It is happing during the installation when compiling the shared library > >>> Rsubread.so, not running the buildindex function. Btw, the gcc version > >>> is 4.4.3. > >>> > >>> Server and opensuse box runs gcc version 4.3.1 and 4.5.0 respectively. > >>> > >>> Regards, > >>> Dan > >>> > >>> On Wed, 2012-06-06 at 14:56 +1000, Wei Shi wrote: > >>>> Dear Dan, > >>>> > >>>> It is probably because including genome sequences into the index slowed > >>>> down your laptop. But I believe it should be alleviated if you give > >>>> smaller values to the 'memory ' parameter of the buildindex() function. > >>>> Also, the index building is an one-off operation, you do not need to > >>>> redo it even when new releases come. > >>>> > >>>> For your 32-bit opensuse box, I guess the problem will be solved if you > >>>> change the amount of memory requested to be 1000MB. > >>>> > >>>> Cheers, > >>>> Wei > >>>> > >>>> On Jun 5, 2012, at 11:43 PM, Dan Du wrote: > >>>> > >>>>> Hi Robert, > >>>>> > >>>>> I have been experiencing something else, possibly related to yours, > >>>>> on a 64bit ubuntu laptop with 6g of ram. > >>>>> > >>>>> As I recall, when bumping to Bioc 2.10, the Rsubread installation kind > >>>>> of ate all the memory, basically froze the system so I had to call it > >>>>> off, yet building it on the server side turned out fine. So I think I > >>>>> just accepted that the new version may be 'computationally heavy' thus > >>>>> not suitable for a normal pc, though I did not find any mentioning of > >>>>> this increased memory requirement in the NEWS file. > >>>>> > >>>>> So currently Rsubread stays at 1.4.4 on that pc, all subsequent > >>>> versions > >>>>> of Rsubread drain the memory in the same way when compiling > >>>> Rsubread.so. > >>>>> > >>>>> Now I think I can confirm this on a 32-bit opensuse box, it did > >>>>> successfully built, but when running the example code in the manual, > >>>>> same segfault happens. > >>>>> > >>>>> > >>>>>> library(Rsubread) > >>>>>> ref <- system.file("extdata","reference.fa",package="Rsubread") > >>>>>> path <- system.file("extdata",package="Rsubread") > >>>>>> buildindex(basename=file.path(path,"reference_index"),reference=ref) > >>>>> > >>>>> Building a base-space index. > >>>>> Size of memory used=3700 MB > >>>>> Base name of the built index > >>>>> = /home/opensuse/R-patched/library/Rsubread/extdata/reference_index > >>>>> > >>>>> *** caught segfault *** > >>>>> address 0xdf03ee80, cause 'memory not mapped' > >>>>> > >>>>> Traceback: > >>>>> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = > >>>>> as.character(cmd), PACKAGE = "Rsubread") > >>>>> 2: buildindex(basename = file.path(path, "reference_index"), reference > >>>>> = ref) > >>>>> > >>>>>> sessionInfo() > >>>>> R version 2.15.0 Patched (2012-06-04 r59517) > >>>>> Platform: i686-pc-linux-gnu (32-bit) > >>>>> > >>>>> locale: > >>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > >>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > >>>>> [7] LC_PAPER=C LC_NAME=C > >>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C > >>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > >>>>> > >>>>> attached base packages: > >>>>> [1] stats graphics grDevices utils datasets methods > >>>>> base > >>>>> > >>>>> other attached packages: > >>>>> [1] Rsubread_1.6.3 > >>>>> > >>>>> > >>>>> Regards, > >>>>> Dan > >>>>> > >>>>> On Tue, 2012-06-05 at 09:45 +0200, Robert Castelo wrote: > >>>>>> hi, > >>>>>> > >>>>>> the computer room at my university where we do practicals on R & > >>>> Bioconductor runs a 32bit linux distribution and when i tried to run > >>>> the latest version of the Rsubread package (1.6.3) it crashes when > >>>> calling the buildindex() function on a multifasta file with the yeast > >>>> genome. this does *not* happen under a 64bit linux distribution. > >>>>>> > >>>>>> i have verified that installing the version before (1.4.4) on the > >>>> current R 2.15 it also crashes (on the 32bit), but two versions > >>>> before, the 1.1.1, it does *not* and it works smoothly on this 32bit > >>>> linux distribution. > >>>>>> > >>>>>> i'm pasting below the output of using the 1.6.3 and 1.1.1 on R 2.15 > >>>> where allChr.fa is the multifasta file with the yeast genome. > >>>>>> > >>>>>> so i can manage by now the problem by using the 1.1.1 version on R > >>>> 2.15 for my teaching but i wonder whether there would be some easy > >>>> solution for this, or even if it could be a symptom of something else > >>>> that the Rsubread developers should worry about. i know that using a > >>>> 32bit system nowadays is quite obsolete but this is what i got for > >>>> teaching :( and i would be happy to let my students play with the > >>>> latest version of Rsubread in the future. > >>>>>> > >>>>>> > >>>>>> thanks!!! > >>>>>> robert. > >>>>>> > >>>>>> ======================Rsubread 1.6.3 on R 2.15======================= > >>>>>> > >>>>>>> library(Rsubread) > >>>>>>> sessionInfo() > >>>>>> R version 2.15.0 (2012-03-30) > >>>>>> Platform: i686-pc-linux-gnu (32-bit) > >>>>>> > >>>>>> locale: > >>>>>> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C > >>>>>> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 > >>>>>> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 > >>>>>> [7] LC_PAPER=C LC_NAME=C > >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C > >>>>>> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C > >>>>>> > >>>>>> attached base packages: > >>>>>> [1] stats graphics grDevices utils datasets methods base > >>>>>> > >>>>>> other attached packages: > >>>>>> [1] Rsubread_1.6.3 > >>>>>> > >>>>>>> buildindex(basename="subreadindex", reference="allChr.fa", > >>>> memory=2500) > >>>>>> > >>>>>> Building a base-space index. > >>>>>> Size of memory used=2500 MB > >>>>>> Base name of the built index = subreadindex > >>>>>> > >>>>>> *** caught segfault *** > >>>>>> address 0xdf670cc0, cause 'memory not mapped' > >>>>>> > >>>>>> Traceback: > >>>>>> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = > >>>> as.character(cmd), PACKAGE = "Rsubread") > >>>>>> 2: buildindex(basename = "subreadindex", reference = "allChr.fa", > >>>> memory = 2500) > >>>>>> > >>>>>> Possible actions: > >>>>>> 1: abort (with core dump, if enabled) > >>>>>> 2: normal R exit > >>>>>> 3: exit R without saving workspace > >>>>>> 4: exit R saving workspace > >>>>>> Selection: > >>>>>> > >>>>>> > >>>>>> ======================Rsubread 1.1.1 on R 2.15======================= > >>>>>> > >>>>>>> library(Rsubread) > >>>>>>> buildindex(basename="subreadindex", reference="allChr.fa", > >>>> memory=2500) > >>>>>> > >>>>>> Building the index in the base space. > >>>>>> Size of memory requested=2500 MB > >>>>>> Index base name = subreadindex > >>>>>> INDEX ITEMS PER PARTITION = 275940352 > >>>>>> > >>>>>> completed=40.88%; time used=1.7s; rate=2955.1k bps/s; total=12m bps > >>>> completed=81.76%; time used=2.4s; rate=4111.8k > >>>> bps/s; total=12m bps > >>>>>> All the chromosome files are processed. > >>>>>> | Dumping index > >>>> [===========================================================>] > >>>>>> Index subreadindex is successfully built. > >>>>>>> sessionInfo() > >>>>>> R version 2.15.0 (2012-03-30) > >>>>>> Platform: i686-pc-linux-gnu (32-bit) > >>>>>> > >>>>>> locale: > >>>>>> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C > >>>>>> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 > >>>>>> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 > >>>>>> [7] LC_PAPER=C LC_NAME=C > >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C > >>>>>> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C > >>>>>> > >>>>>> attached base packages: > >>>>>> [1] stats graphics grDevices utils datasets methods base > >>>>>> > >>>>>> other attached packages: > >>>>>> [1] Rsubread_1.1.1 > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Bioconductor mailing list > >>>>>> Bioconductor at r-project.org > >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>>> Search the archives: > >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>> > >>>>> _______________________________________________ > >>>>> Bioconductor mailing list > >>>>> Bioconductor at r-project.org > >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>> Search the archives: > >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>> > >>>> > >>>> ______________________________________________________________________ > >>>> The information in this email is confidential and intended solely for > >>>> the addressee. > >>>> You must not disclose, forward, print or use it without the permission > >>>> of the sender. > >>>> ______________________________________________________________________ > >>> > >>> > >>> > >>> > >> > >> > >> > >> ______________________________________________________________________ > >> The information in this email is confidential and intended solely for the addressee. > >> You must not disclose, forward, print or use it without the permission of the sender. > >> ______________________________________________________________________ > > > > > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:5}} From yuchuan at stat.berkeley.edu Fri Jun 8 08:09:00 2012 From: yuchuan at stat.berkeley.edu (Yu Chuan Tai) Date: Thu, 7 Jun 2012 23:09:00 -0700 (PDT) Subject: [BioC] base-specific read counts In-Reply-To: References: <4FD0A5FA.6030301@fhcrc.org> Message-ID: Hi Sean, I see. Thanks for your clarification! Best, Yu Chuan On Thu, 7 Jun 2012, Sean Davis wrote: > On Thu, Jun 7, 2012 at 11:24 AM, Yu Chuan Tai wrote: >> I see. So, which input arguments of scanBamFlag() or ScanBamParam() take >> care of paired-end reads? Or should I even worry about the paired-end >> natural when calculating coverage? > > There are situations when you want to calculate coverage based on the > extreme ends of pairs, but that would be for finding structural > variants and the like and not for determining base-level coverage. > You do not need to do anything special to read in paired-end data. As > Martin mentioned, there is more complete handling of paired-end data > in development, but that is really orthogonal to the scanBamParam() > isPaired flag. > > Sean > >> >> Thanks! >> Yu Chuan >> >> On Thu, 7 Jun 2012, Sean Davis wrote: >> >>> On Thu, Jun 7, 2012 at 11:03 AM, Yu Chuan Tai wrote: >>>> Thanks! In your code below, to take care of the paired-end reads, is it >>>> correct that at least I need to set isPaired=TRUE in scanBamFlag()? >>> >>> No. ?The "isXXX" stuff is for filtering the data. ?Assuming that you >>> want all your reads to be included (and not just paired reads), you do >>> not need to set isPaired. >>> >>> Sean >>> >>>> Best, >>>> Yu Chuan >>>> >>>> On Thu, 7 Jun 2012, Martin Morgan wrote: >>>> >>>>> On 06/06/2012 10:43 PM, Yu Chuan Tai wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> Is there any way to calculate base-specific read counts for a given >>>>>> genomic interval (including 1-base interval), for paired-end data >>>>>> aligned by Bowtie2 in BAM format? >>>>> >>>>> >>>>> Thanks for posting to the Boic mailing list! Functions like >>>>> readGappedAlignments, scanBam, etc. take an argument ScanBamParam that in >>>>> turn has an argument 'which' to specify, using GRanges, the regions of a bam >>>>> file you want to query >>>>> >>>>> ?gwhich <- GRanges("chr1", IRanges(c(1000, 2000, 3000), width=100)), >>>>> ? ? c("+", "+", "-")) >>>>> ?param <- ScanBamParam(which=gwhich) >>>>> ?scanBam("my.bam", param=param) >>>>> >>>>> Base-level coverage is also available with ?applyPileups, see >>>>> example(applyPileups). >>>>> >>>>> Martin >>>>> >>>>>> Thanks! >>>>>> >>>>>> Best, >>>>>> Yu Chuan >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> >>>>> >>>>> -- >>>>> Computational Biology / Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N. >>>>> PO Box 19024 Seattle, WA 98109 >>>>> >>>>> Location: Arnold Building M1 B861 >>>>> Phone: (206) 667-2793 >>>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > From shi at wehi.EDU.AU Fri Jun 8 08:16:24 2012 From: shi at wehi.EDU.AU (Wei Shi) Date: Fri, 8 Jun 2012 16:16:24 +1000 Subject: [BioC] Rsubread crashes in 32bit linux In-Reply-To: <1339135747.2008.12.camel@yangdu-desktop> References: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> <1338903817.2062.284.camel@yangdu-desktop> <25927CA4-11CE-46F1-8492-739AE8B10C05@wehi.edu.au> <1338965421.5663.36.camel@yangdu-desktop> <2333c7eb3c4b13297612a68e8334d48d.squirrel@homebase.wehi.edu.au> <1338980934.5663.85.camel@yangdu-desktop> <33FCD1BB-DCA5-402E-A973-7FB6F73F208F@wehi.edu.au> <1339135747.2008.12.camel@yangdu-desktop> Message-ID: <8306290A-A980-48EA-9FB6-07000447C9E4@wehi.edu.au> Dear Dan, Thanks for letting us know. That's great! Cheers, Wei On Jun 8, 2012, at 4:09 PM, Dan Du wrote: > Dear Wei, > > Good work, memory freed. Just checkout the devel version, package built > and installed successfully with no hiccup, test codes runs fine. > > Regards, > Dan > > On Fri, 2012-06-08 at 12:10 +1000, Wei Shi wrote: >> Dear Robert, Dan and Peter, >> >> We have made changes to a number of functions in the package to reduce the memory allocated to Rsubread by the operating system when it was loaded. The new version has been committed to both bioc release (Rsubread 1.6.4) and bioc devel (Rsubread 1.7.4). They should be available to you in a day or two. >> >> Also, the buildindex() function no longer needs the allocation of 1GB continuous memory region. But it will still consume at least 1GB of memory when it is running, no matter what the given value of the 'memory' parameter is. >> >> We have tested the new version on our 32-bit VM machine (it has 3GB of memory and the value of 'memory' parameter used by buildindex was 2500) and it solves all the reported problems, so we are pretty happy with it. I hope the new version works in your computers/laptops, but please do let us know if it doesn't. >> >> Sorry about the problems you have encountered. It's always a challenge to develop a R package with so much C code in it! >> >> Cheers, >> Wei >> >> >> >> On Jun 6, 2012, at 9:08 PM, Dan Du wrote: >> >>> Dear Wei, >>> >>> Here is a standard bioclite update, I think it is at the last step when >>> compiling Rsubread.so, the memory usage exceeds 5.5g, then system freeze >>> and I have to call it off. Same result when runing 'R CMD INSTALL >>> Rsubread_1.6.3.tar.gz' from shell, or manually compile all .c file and >>> run the last gcc statement. I guess there might just be a minimum ram >>> requirement somewhere higher than 6g... I will do some more poking when >>> I have time. >>> >>> 'gcc -std=gnu99 -shared -o Rsubread.so R_wrapper.o SNP_calling.o >>> aligner.o atgcContent.o detectionCall.o detectionCallAnnotation.o >>> exon-algorithms.o exon-align.o fullscan.o gene-algorithms.o >>> gene-value-index.o hashtable.o index-builder.o input-files.o >>> processExons.o propmapped.o qualityScores.o readSummary.o >>> removeDuplicatedReads.o sam2bed.o sorted-hashtable.o -lpthread >>> -DMAKE_FOR_EXON -L/usr/lib64/R/lib -lR' >>> >>> Also down there are the sessionInfo and full gcc version, please let me >>> know if you need more information. >>> >>> Regards, >>> Dan >>> -------------------------------------------------------------------- >>>> source('http://www.bioconductor.org/biocLite.R') >>>> biocLite('') >>> BioC_mirror: http://bioconductor.org >>> Using R version 2.15, BiocInstaller version 1.4.6. >>> Installing package(s) '' >>> Old packages: 'Rsubread' >>> Update all/some/none? [a/s/n]: a >>> trying URL >>> 'http://www.bioconductor.org/packages/2.10/bioc/src/contrib/Rsubread_1.6.3.tar.gz' >>> Content type 'application/x-gzip' length 21891723 bytes (20.9 Mb) >>> opened URL >>> ================================================== >>> downloaded 20.9 Mb >>> >>> WARNING: ignoring environment value of R_HOME >>> * installing *source* package ?Rsubread? ... >>> ** libs >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c R_wrapper.c -o R_wrapper.o >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c SNP_calling.c -o SNP_calling.o >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c aligner.c -o aligner.o >>> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >>> declared but never defined >>> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >>> but never defined >>> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >>> declared but never defined >>> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >>> but never defined >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c atgcContent.c -o atgcContent.o >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c detectionCall.c -o detectionCall.o >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c detectionCallAnnotation.c -o detectionCallAnnotation.o >>> detectionCallAnnotation.c: In function ?calculateExonGCContent?: >>> detectionCallAnnotation.c:175: warning: ignoring return value of >>> ?fgets?, declared with attribute warn_unused_result >>> detectionCallAnnotation.c: In function ?calculateIRGCContent?: >>> detectionCallAnnotation.c:262: warning: ignoring return value of >>> ?fgets?, declared with attribute warn_unused_result >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c exon-algorithms.c -o exon-algorithms.o >>> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >>> declared but never defined >>> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >>> but never defined >>> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >>> declared but never defined >>> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >>> but never defined >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c exon-align.c -o exon-align.o >>> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >>> declared but never defined >>> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >>> but never defined >>> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >>> declared but never defined >>> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >>> but never defined >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c fullscan.c -o fullscan.o >>> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >>> declared but never defined >>> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >>> but never defined >>> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >>> declared but never defined >>> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >>> but never defined >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c gene-algorithms.c -o gene-algorithms.o >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c gene-value-index.c -o gene-value-index.o >>> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >>> declared but never defined >>> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >>> but never defined >>> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >>> declared but never defined >>> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >>> but never defined >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c hashtable.c -o hashtable.o >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c index-builder.c -o index-builder.o >>> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >>> declared but never defined >>> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >>> but never defined >>> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >>> declared but never defined >>> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >>> but never defined >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c input-files.c -o input-files.o >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c processExons.c -o processExons.o >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c propmapped.c -o propmapped.o >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c qualityScores.c -o qualityScores.o >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c readSummary.c -o readSummary.o >>> readSummary.c: In function ?readSummary?: >>> readSummary.c:122: warning: format ?%d? expects type ?int?, but argument >>> 5 has type ?long int? >>> readSummary.c:122: warning: format ?%d? expects type ?int?, but argument >>> 6 has type ?long int? >>> readSummary.c:39: warning: ignoring return value of ?getline?, declared >>> with attribute warn_unused_result >>> readSummary.c:52: warning: ignoring return value of ?getline?, declared >>> with attribute warn_unused_result >>> readSummary.c:55: warning: ignoring return value of ?getline?, declared >>> with attribute warn_unused_result >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c removeDuplicatedReads.c -o removeDuplicatedReads.o >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c sam2bed.c -o sam2bed.o >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >>> -O3 -pipe -g -c sorted-hashtable.c -o sorted-hashtable.o >>> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >>> declared but never defined >>> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >>> but never defined >>> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >>> declared but never defined >>> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >>> but never defined >>> gcc -std=gnu99 -shared -o Rsubread.so R_wrapper.o SNP_calling.o >>> aligner.o atgcContent.o detectionCall.o detectionCallAnnotation.o >>> exon-algorithms.o exon-align.o fullscan.o gene-algorithms.o >>> gene-value-index.o hashtable.o index-builder.o input-files.o >>> processExons.o propmapped.o qualityScores.o readSummary.o >>> removeDuplicatedReads.o sam2bed.o sorted-hashtable.o -lpthread >>> -DMAKE_FOR_EXON -L/usr/lib64/R/lib -lR >>> ^Cmake: *** Deleting file `Rsubread.so' >>> make: *** [Rsubread.so] Interrupt >>> ** R >>> ** inst >>> ** preparing package for lazy loading >>> ** help >>> *** installing help indices >>> ** building package indices >>> ** installing vignettes >>> ?Rsubread.Rnw? >>> ** testing if installed package can be loaded >>> Error in library.dynam(lib, package, package.lib) : >>> shared object ?Rsubread.so? not found >>> Error: loading failed >>> Execution halted >>> -------------------------------------------------------------------- >>>> sessionInfo() >>> R version 2.15.0 (2012-03-30) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 >>> [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> -------------------------------------------------------------------- >>> $ gcc -v >>> Using built-in specs. >>> Target: x86_64-linux-gnu >>> Configured with: ../src/configure -v --with-pkgversion='Ubuntu >>> 4.4.3-4ubuntu5.1' >>> --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs >>> --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr >>> --enable-shared --enable-multiarch --enable-linker-build-id >>> --with-system-zlib --libexecdir=/usr/lib --without-included-gettext >>> --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 >>> --program-suffix=-4.4 --enable-nls --enable-clocale=gnu >>> --enable-libstdcxx-debug --enable-plugin --enable-objc-gc >>> --disable-werror --with-arch-32=i486 --with-tune=generic >>> --enable-checking=release --build=x86_64-linux-gnu >>> --host=x86_64-linux-gnu --target=x86_64-linux-gnu >>> Thread model: posix >>> gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) >>> -------------------------------------------------------------------- >>> >>> On Wed, 2012-06-06 at 20:10 +1000, Wei Shi wrote: >>>> Dear Dan, >>>> >>>> It didn't seem to be problem of requesting a continuous 1GB block in our >>>> investigation. We tracked the memory usage of buildindex() function when >>>> running it on yeast genome using a 32-bit VM, and found that the segfault >>>> happened right after a request of a few KB of memory was sent to the >>>> system when the memory parameter was set to 2500. However, the problem was >>>> gone when the memory parameter was changed to 1000. >>>> >>>> Removing highly repetitive 16 mers required a continuous 1GB block of >>>> memory, but this step was always executed successfully. This step also >>>> included in the old version of Rsubread (1.1.1), and it did not have >>>> problem there either. >>>> >>>> Could you please provide us your complete code for running your test and >>>> also session info? This will help us to diagnose what the problem could be >>>> because we couldn't reproduce what you saw from our end. >>>> >>>> For the compilation issue on your 64bit laptop, could you provide us more >>>> details as well, including the message output from gcc? >>>> >>>> Thanks, >>>> Wei >>>> >>>>> Dear Wei, >>>>> >>>>> Unfortunately reducing the memory parameter to 1000, still causes the >>>>> segfault. I guess with 3g ram limit on a 32bit system, there is still a >>>>> fat chance that you can not request a continuous 1g block. >>>>> >>>>> For that 64bit laptop, it is still strange about the 6g memory draining. >>>>> It is happing during the installation when compiling the shared library >>>>> Rsubread.so, not running the buildindex function. Btw, the gcc version >>>>> is 4.4.3. >>>>> >>>>> Server and opensuse box runs gcc version 4.3.1 and 4.5.0 respectively. >>>>> >>>>> Regards, >>>>> Dan >>>>> >>>>> On Wed, 2012-06-06 at 14:56 +1000, Wei Shi wrote: >>>>>> Dear Dan, >>>>>> >>>>>> It is probably because including genome sequences into the index slowed >>>>>> down your laptop. But I believe it should be alleviated if you give >>>>>> smaller values to the 'memory ' parameter of the buildindex() function. >>>>>> Also, the index building is an one-off operation, you do not need to >>>>>> redo it even when new releases come. >>>>>> >>>>>> For your 32-bit opensuse box, I guess the problem will be solved if you >>>>>> change the amount of memory requested to be 1000MB. >>>>>> >>>>>> Cheers, >>>>>> Wei >>>>>> >>>>>> On Jun 5, 2012, at 11:43 PM, Dan Du wrote: >>>>>> >>>>>>> Hi Robert, >>>>>>> >>>>>>> I have been experiencing something else, possibly related to yours, >>>>>>> on a 64bit ubuntu laptop with 6g of ram. >>>>>>> >>>>>>> As I recall, when bumping to Bioc 2.10, the Rsubread installation kind >>>>>>> of ate all the memory, basically froze the system so I had to call it >>>>>>> off, yet building it on the server side turned out fine. So I think I >>>>>>> just accepted that the new version may be 'computationally heavy' thus >>>>>>> not suitable for a normal pc, though I did not find any mentioning of >>>>>>> this increased memory requirement in the NEWS file. >>>>>>> >>>>>>> So currently Rsubread stays at 1.4.4 on that pc, all subsequent >>>>>> versions >>>>>>> of Rsubread drain the memory in the same way when compiling >>>>>> Rsubread.so. >>>>>>> >>>>>>> Now I think I can confirm this on a 32-bit opensuse box, it did >>>>>>> successfully built, but when running the example code in the manual, >>>>>>> same segfault happens. >>>>>>> >>>>>>> >>>>>>>> library(Rsubread) >>>>>>>> ref <- system.file("extdata","reference.fa",package="Rsubread") >>>>>>>> path <- system.file("extdata",package="Rsubread") >>>>>>>> buildindex(basename=file.path(path,"reference_index"),reference=ref) >>>>>>> >>>>>>> Building a base-space index. >>>>>>> Size of memory used=3700 MB >>>>>>> Base name of the built index >>>>>>> = /home/opensuse/R-patched/library/Rsubread/extdata/reference_index >>>>>>> >>>>>>> *** caught segfault *** >>>>>>> address 0xdf03ee80, cause 'memory not mapped' >>>>>>> >>>>>>> Traceback: >>>>>>> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = >>>>>>> as.character(cmd), PACKAGE = "Rsubread") >>>>>>> 2: buildindex(basename = file.path(path, "reference_index"), reference >>>>>>> = ref) >>>>>>> >>>>>>>> sessionInfo() >>>>>>> R version 2.15.0 Patched (2012-06-04 r59517) >>>>>>> Platform: i686-pc-linux-gnu (32-bit) >>>>>>> >>>>>>> locale: >>>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>>> >>>>>>> attached base packages: >>>>>>> [1] stats graphics grDevices utils datasets methods >>>>>>> base >>>>>>> >>>>>>> other attached packages: >>>>>>> [1] Rsubread_1.6.3 >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Dan >>>>>>> >>>>>>> On Tue, 2012-06-05 at 09:45 +0200, Robert Castelo wrote: >>>>>>>> hi, >>>>>>>> >>>>>>>> the computer room at my university where we do practicals on R & >>>>>> Bioconductor runs a 32bit linux distribution and when i tried to run >>>>>> the latest version of the Rsubread package (1.6.3) it crashes when >>>>>> calling the buildindex() function on a multifasta file with the yeast >>>>>> genome. this does *not* happen under a 64bit linux distribution. >>>>>>>> >>>>>>>> i have verified that installing the version before (1.4.4) on the >>>>>> current R 2.15 it also crashes (on the 32bit), but two versions >>>>>> before, the 1.1.1, it does *not* and it works smoothly on this 32bit >>>>>> linux distribution. >>>>>>>> >>>>>>>> i'm pasting below the output of using the 1.6.3 and 1.1.1 on R 2.15 >>>>>> where allChr.fa is the multifasta file with the yeast genome. >>>>>>>> >>>>>>>> so i can manage by now the problem by using the 1.1.1 version on R >>>>>> 2.15 for my teaching but i wonder whether there would be some easy >>>>>> solution for this, or even if it could be a symptom of something else >>>>>> that the Rsubread developers should worry about. i know that using a >>>>>> 32bit system nowadays is quite obsolete but this is what i got for >>>>>> teaching :( and i would be happy to let my students play with the >>>>>> latest version of Rsubread in the future. >>>>>>>> >>>>>>>> >>>>>>>> thanks!!! >>>>>>>> robert. >>>>>>>> >>>>>>>> ======================Rsubread 1.6.3 on R 2.15======================= >>>>>>>> >>>>>>>>> library(Rsubread) >>>>>>>>> sessionInfo() >>>>>>>> R version 2.15.0 (2012-03-30) >>>>>>>> Platform: i686-pc-linux-gnu (32-bit) >>>>>>>> >>>>>>>> locale: >>>>>>>> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C >>>>>>>> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 >>>>>>>> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 >>>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C >>>>>>>> >>>>>>>> attached base packages: >>>>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>>>> >>>>>>>> other attached packages: >>>>>>>> [1] Rsubread_1.6.3 >>>>>>>> >>>>>>>>> buildindex(basename="subreadindex", reference="allChr.fa", >>>>>> memory=2500) >>>>>>>> >>>>>>>> Building a base-space index. >>>>>>>> Size of memory used=2500 MB >>>>>>>> Base name of the built index = subreadindex >>>>>>>> >>>>>>>> *** caught segfault *** >>>>>>>> address 0xdf670cc0, cause 'memory not mapped' >>>>>>>> >>>>>>>> Traceback: >>>>>>>> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = >>>>>> as.character(cmd), PACKAGE = "Rsubread") >>>>>>>> 2: buildindex(basename = "subreadindex", reference = "allChr.fa", >>>>>> memory = 2500) >>>>>>>> >>>>>>>> Possible actions: >>>>>>>> 1: abort (with core dump, if enabled) >>>>>>>> 2: normal R exit >>>>>>>> 3: exit R without saving workspace >>>>>>>> 4: exit R saving workspace >>>>>>>> Selection: >>>>>>>> >>>>>>>> >>>>>>>> ======================Rsubread 1.1.1 on R 2.15======================= >>>>>>>> >>>>>>>>> library(Rsubread) >>>>>>>>> buildindex(basename="subreadindex", reference="allChr.fa", >>>>>> memory=2500) >>>>>>>> >>>>>>>> Building the index in the base space. >>>>>>>> Size of memory requested=2500 MB >>>>>>>> Index base name = subreadindex >>>>>>>> INDEX ITEMS PER PARTITION = 275940352 >>>>>>>> >>>>>>>> completed=40.88%; time used=1.7s; rate=2955.1k bps/s; total=12m bps >>>>>> completed=81.76%; time used=2.4s; rate=4111.8k >>>>>> bps/s; total=12m bps >>>>>>>> All the chromosome files are processed. >>>>>>>> | Dumping index >>>>>> [===========================================================>] >>>>>>>> Index subreadindex is successfully built. >>>>>>>>> sessionInfo() >>>>>>>> R version 2.15.0 (2012-03-30) >>>>>>>> Platform: i686-pc-linux-gnu (32-bit) >>>>>>>> >>>>>>>> locale: >>>>>>>> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C >>>>>>>> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 >>>>>>>> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 >>>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C >>>>>>>> >>>>>>>> attached base packages: >>>>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>>>> >>>>>>>> other attached packages: >>>>>>>> [1] Rsubread_1.1.1 >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> >>>>>> ______________________________________________________________________ >>>>>> The information in this email is confidential and intended solely for >>>>>> the addressee. >>>>>> You must not disclose, forward, print or use it without the permission >>>>>> of the sender. >>>>>> ______________________________________________________________________ >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> ______________________________________________________________________ >>>> The information in this email is confidential and intended solely for the addressee. >>>> You must not disclose, forward, print or use it without the permission of the sender. >>>> ______________________________________________________________________ >>> >>> >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ > > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}} From yuchuan at stat.berkeley.edu Fri Jun 8 08:34:23 2012 From: yuchuan at stat.berkeley.edu (Yu Chuan Tai) Date: Thu, 7 Jun 2012 23:34:23 -0700 (PDT) Subject: [BioC] Amplicon and exon level read counts and GC content In-Reply-To: <4FD0CE98.20002@fhcrc.org> References: <4FC5B3D3.4030007@fhcrc.org> <4FC5B75C.7060900@fhcrc.org> <4FD0A7DA.1040009@fhcrc.org> <4FD0CE98.20002@fhcrc.org> Message-ID: Hi Martin, Is it correct that the below code calculate the number of reads overlapping with an amplicon, and overlapping means at least 1 base overlap, and it doesn't have to be fully within the amplicon? In the case that a read overlaps with 2 amplicons, will it be counted twice? When I used this approach to calculate amplicon-level read counts, I found the number of read counts overlapping with all the amplicons is larger than the total number of read counts in the BAM file, and wonder if that's b/c a read could be counted more than once? I found that the code below gives more read counts than using summarizeOverlaps(). I think it's b/c the latter counts a read at most once. If I want to calculate the coverage of SNVs/INDELs outputted from samtools, is it correct that using summarizeOverlaps() will under-estimate the coverage, since a read may overlap with several SNVs/INDELs? Thanks! Yu Chuan > param = ScanBamParam(what="seq", which=GRanges("seq1",IRanges(100, 500))) > dna = scanBam(fl, param=param)[[1]][["seq"]] > length(dna) # 365 reads overlap region > alphabetFrequency(dna, collapse=TRUE, baseOnly=TRUE) # 2838 + 3003 GC On Thu, 7 Jun 2012, Martin Morgan wrote: > On 06/07/2012 07:54 AM, Yu Chuan Tai wrote: >> Hi Martin, >> >> Thanks! I will look into the links below. By 'better support for >> paired-end reads in the 'devel' version', which package are you >> referring to? > > Mostly GenomicRanges, e.g., readGappedAlignmentPairs, building on additional > facilities in Rsamtools. Herve is responsible for this. > > Martin > >> >> Best, >> Yu Chuan >> >> On Thu, 7 Jun 2012, Martin Morgan wrote: >> >>> On 06/06/2012 09:53 PM, Yu Chuan Tai wrote: >>>> Hi Martin, >>>> >>>> More questions on your approaches below. If my BAM files are >>>> generated by Bowtie2 on pair-end fastq files, for scanBamFlag(), >>>> should I set isPaired=TRUE? Do I need to worry about other input >>>> arguments for scanBamFlag() or ScanBamParam(), if I want to >>>> calculate coverage properly? >>> >>> It really depends on what you're interested in doing; see for instance >>> the post by Herve the other day >>> >>> https://stat.ethz.ch/pipermail/bioconductor/2012-June/046052.html >>> >>>> >>>> Also, summarizeOverlaps() doesn't seem to handle paired-end reads. >>>> How to get around this, or it won't affect coverage calculation? >>> >>> There is better support for paired-end reads in the 'devel' version of >>> Biocondcutor; see >>> >>> http://bioconductor.org/developers/useDevel/ >>> >>> whether and what aspects of paired-endedness are important depends on >>> how you are using your coverage. >>> >>>> >>>> Finally, is there any way to calculate base-specific coverage at any >>>> genomic locus or interval in Rsamtools? Thanks! >>> >>> I tried to answer this in your other post. >>> >>> Martin >>> >>>> >>>> Best, Yu Chuan >>>> >>>>> More specifically, after >>>>> >>>>> library(Rsamtools) example(scanBam) # defines 'fl', a path to a >>>>> bam file >>>>> >>>>> for a _single_ genomic range >>>>> >>>>> param = ScanBamParam(what="seq", which=GRanges("seq1", >>>>> IRanges(100, 500))) dna = scanBam(fl, param=param)[[1]][["seq"]] >>>>> length(dna) # 365 reads overlap region alphabetFrequency(dna, >>>>> collapse=TRUE, baseOnly=TRUE) # 2838 + 3003 GC >>>>> >>>>> though you'd likely want to specify several regions (vector >>>>> arguments to GRanges) and think about flags (scanBamFlag() and the >>>>> flag argument to ScanBamParam), read mapping quality, reads >>>>> overlapping more than one region, etc. (summarizeOverlaps >>>>> implements several counting strategies, but it is 'easy' to >>>>> implement arbitrary approaches). >>>>> >>>>>> >>>>>> Martin >>>>>> >>>>>>> >>>>>>> Thanks for any input! >>>>>>> >>>>>>> Best, Yu Chuan >>>>>>> >>>>>>> _______________________________________________ Bioconductor >>>>>>> mailing list Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >>>>>>> archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>>>> >>>>>>> >>> -- >>>>> Computational Biology Fred Hutchinson Cancer Research Center 1100 >>>>> Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >>>>> >>>>> Location: M1-B861 Telephone: 206 667-2793 >>>>> >>> >>> >>> -- >>> Computational Biology / Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. >>> PO Box 19024 Seattle, WA 98109 >>> >>> Location: Arnold Building M1 B861 >>> Phone: (206) 667-2793 >>> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > From robert.castelo at upf.edu Fri Jun 8 09:37:44 2012 From: robert.castelo at upf.edu (Robert Castelo) Date: Fri, 08 Jun 2012 09:37:44 +0200 Subject: [BioC] Rsubread crashes in 32bit linux In-Reply-To: <33FCD1BB-DCA5-402E-A973-7FB6F73F208F@wehi.edu.au> References: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> <1338903817.2062.284.camel@yangdu-desktop> <25927CA4-11CE-46F1-8492-739AE8B10C05@wehi.edu.au> <1338965421.5663.36.camel@yangdu-desktop> <2333c7eb3c4b13297612a68e8334d48d.squirrel@homebase.wehi.edu.au> <1338980934.5663.85.camel@yangdu-desktop> <33FCD1BB-DCA5-402E-A973-7FB6F73F208F@wehi.edu.au> Message-ID: <4FD1ABC8.5040506@upf.edu> Dear Wei, it also works on my side, the 32 bit machines do not crash anymore. thanks for solving this so quickly! robert. On 06/08/2012 04:10 AM, Wei Shi wrote: > Dear Robert, Dan and Peter, > > We have made changes to a number of functions in the package to reduce the memory allocated to Rsubread by the operating system when it was loaded. The new version has been committed to both bioc release (Rsubread 1.6.4) and bioc devel (Rsubread 1.7.4). They should be available to you in a day or two. > > Also, the buildindex() function no longer needs the allocation of 1GB continuous memory region. But it will still consume at least 1GB of memory when it is running, no matter what the given value of the 'memory' parameter is. > > We have tested the new version on our 32-bit VM machine (it has 3GB of memory and the value of 'memory' parameter used by buildindex was 2500) and it solves all the reported problems, so we are pretty happy with it. I hope the new version works in your computers/laptops, but please do let us know if it doesn't. > > Sorry about the problems you have encountered. It's always a challenge to develop a R package with so much C code in it! > > Cheers, > Wei > > > > On Jun 6, 2012, at 9:08 PM, Dan Du wrote: > >> Dear Wei, >> >> Here is a standard bioclite update, I think it is at the last step when >> compiling Rsubread.so, the memory usage exceeds 5.5g, then system freeze >> and I have to call it off. Same result when runing 'R CMD INSTALL >> Rsubread_1.6.3.tar.gz' from shell, or manually compile all .c file and >> run the last gcc statement. I guess there might just be a minimum ram >> requirement somewhere higher than 6g... I will do some more poking when >> I have time. >> >> 'gcc -std=gnu99 -shared -o Rsubread.so R_wrapper.o SNP_calling.o >> aligner.o atgcContent.o detectionCall.o detectionCallAnnotation.o >> exon-algorithms.o exon-align.o fullscan.o gene-algorithms.o >> gene-value-index.o hashtable.o index-builder.o input-files.o >> processExons.o propmapped.o qualityScores.o readSummary.o >> removeDuplicatedReads.o sam2bed.o sorted-hashtable.o -lpthread >> -DMAKE_FOR_EXON -L/usr/lib64/R/lib -lR' >> >> Also down there are the sessionInfo and full gcc version, please let me >> know if you need more information. >> >> Regards, >> Dan >> -------------------------------------------------------------------- >>> source('http://www.bioconductor.org/biocLite.R') >>> biocLite('') >> BioC_mirror: http://bioconductor.org >> Using R version 2.15, BiocInstaller version 1.4.6. >> Installing package(s) '' >> Old packages: 'Rsubread' >> Update all/some/none? [a/s/n]: a >> trying URL >> 'http://www.bioconductor.org/packages/2.10/bioc/src/contrib/Rsubread_1.6.3.tar.gz' >> Content type 'application/x-gzip' length 21891723 bytes (20.9 Mb) >> opened URL >> ================================================== >> downloaded 20.9 Mb >> >> WARNING: ignoring environment value of R_HOME >> * installing *source* package ?Rsubread? ... >> ** libs >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c R_wrapper.c -o R_wrapper.o >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c SNP_calling.c -o SNP_calling.o >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c aligner.c -o aligner.o >> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >> declared but never defined >> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >> but never defined >> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >> declared but never defined >> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >> but never defined >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c atgcContent.c -o atgcContent.o >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c detectionCall.c -o detectionCall.o >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c detectionCallAnnotation.c -o detectionCallAnnotation.o >> detectionCallAnnotation.c: In function ?calculateExonGCContent?: >> detectionCallAnnotation.c:175: warning: ignoring return value of >> ?fgets?, declared with attribute warn_unused_result >> detectionCallAnnotation.c: In function ?calculateIRGCContent?: >> detectionCallAnnotation.c:262: warning: ignoring return value of >> ?fgets?, declared with attribute warn_unused_result >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c exon-algorithms.c -o exon-algorithms.o >> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >> declared but never defined >> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >> but never defined >> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >> declared but never defined >> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >> but never defined >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c exon-align.c -o exon-align.o >> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >> declared but never defined >> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >> but never defined >> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >> declared but never defined >> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >> but never defined >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c fullscan.c -o fullscan.o >> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >> declared but never defined >> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >> but never defined >> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >> declared but never defined >> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >> but never defined >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c gene-algorithms.c -o gene-algorithms.o >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c gene-value-index.c -o gene-value-index.o >> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >> declared but never defined >> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >> but never defined >> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >> declared but never defined >> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >> but never defined >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c hashtable.c -o hashtable.o >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c index-builder.c -o index-builder.o >> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >> declared but never defined >> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >> but never defined >> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >> declared but never defined >> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >> but never defined >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c input-files.c -o input-files.o >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c processExons.c -o processExons.o >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c propmapped.c -o propmapped.o >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c qualityScores.c -o qualityScores.o >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c readSummary.c -o readSummary.o >> readSummary.c: In function ?readSummary?: >> readSummary.c:122: warning: format ?%d? expects type ?int?, but argument >> 5 has type ?long int? >> readSummary.c:122: warning: format ?%d? expects type ?int?, but argument >> 6 has type ?long int? >> readSummary.c:39: warning: ignoring return value of ?getline?, declared >> with attribute warn_unused_result >> readSummary.c:52: warning: ignoring return value of ?getline?, declared >> with attribute warn_unused_result >> readSummary.c:55: warning: ignoring return value of ?getline?, declared >> with attribute warn_unused_result >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c removeDuplicatedReads.c -o removeDuplicatedReads.o >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c sam2bed.c -o sam2bed.o >> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON -fpic >> -O3 -pipe -g -c sorted-hashtable.c -o sorted-hashtable.o >> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >> declared but never defined >> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >> but never defined >> gene-algorithms.h:23: warning: inline function ?add_gene_vote_weighted? >> declared but never defined >> gene-algorithms.h:22: warning: inline function ?add_gene_vote? declared >> but never defined >> gcc -std=gnu99 -shared -o Rsubread.so R_wrapper.o SNP_calling.o >> aligner.o atgcContent.o detectionCall.o detectionCallAnnotation.o >> exon-algorithms.o exon-align.o fullscan.o gene-algorithms.o >> gene-value-index.o hashtable.o index-builder.o input-files.o >> processExons.o propmapped.o qualityScores.o readSummary.o >> removeDuplicatedReads.o sam2bed.o sorted-hashtable.o -lpthread >> -DMAKE_FOR_EXON -L/usr/lib64/R/lib -lR >> ^Cmake: *** Deleting file `Rsubread.so' >> make: *** [Rsubread.so] Interrupt >> ** R >> ** inst >> ** preparing package for lazy loading >> ** help >> *** installing help indices >> ** building package indices >> ** installing vignettes >> ?Rsubread.Rnw? >> ** testing if installed package can be loaded >> Error in library.dynam(lib, package, package.lib) : >> shared object ?Rsubread.so? not found >> Error: loading failed >> Execution halted >> -------------------------------------------------------------------- >>> sessionInfo() >> R version 2.15.0 (2012-03-30) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C >> [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 >> [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> -------------------------------------------------------------------- >> $ gcc -v >> Using built-in specs. >> Target: x86_64-linux-gnu >> Configured with: ../src/configure -v --with-pkgversion='Ubuntu >> 4.4.3-4ubuntu5.1' >> --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs >> --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr >> --enable-shared --enable-multiarch --enable-linker-build-id >> --with-system-zlib --libexecdir=/usr/lib --without-included-gettext >> --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 >> --program-suffix=-4.4 --enable-nls --enable-clocale=gnu >> --enable-libstdcxx-debug --enable-plugin --enable-objc-gc >> --disable-werror --with-arch-32=i486 --with-tune=generic >> --enable-checking=release --build=x86_64-linux-gnu >> --host=x86_64-linux-gnu --target=x86_64-linux-gnu >> Thread model: posix >> gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) >> -------------------------------------------------------------------- >> >> On Wed, 2012-06-06 at 20:10 +1000, Wei Shi wrote: >>> Dear Dan, >>> >>> It didn't seem to be problem of requesting a continuous 1GB block in our >>> investigation. We tracked the memory usage of buildindex() function when >>> running it on yeast genome using a 32-bit VM, and found that the segfault >>> happened right after a request of a few KB of memory was sent to the >>> system when the memory parameter was set to 2500. However, the problem was >>> gone when the memory parameter was changed to 1000. >>> >>> Removing highly repetitive 16 mers required a continuous 1GB block of >>> memory, but this step was always executed successfully. This step also >>> included in the old version of Rsubread (1.1.1), and it did not have >>> problem there either. >>> >>> Could you please provide us your complete code for running your test and >>> also session info? This will help us to diagnose what the problem could be >>> because we couldn't reproduce what you saw from our end. >>> >>> For the compilation issue on your 64bit laptop, could you provide us more >>> details as well, including the message output from gcc? >>> >>> Thanks, >>> Wei >>> >>>> Dear Wei, >>>> >>>> Unfortunately reducing the memory parameter to 1000, still causes the >>>> segfault. I guess with 3g ram limit on a 32bit system, there is still a >>>> fat chance that you can not request a continuous 1g block. >>>> >>>> For that 64bit laptop, it is still strange about the 6g memory draining. >>>> It is happing during the installation when compiling the shared library >>>> Rsubread.so, not running the buildindex function. Btw, the gcc version >>>> is 4.4.3. >>>> >>>> Server and opensuse box runs gcc version 4.3.1 and 4.5.0 respectively. >>>> >>>> Regards, >>>> Dan >>>> >>>> On Wed, 2012-06-06 at 14:56 +1000, Wei Shi wrote: >>>>> Dear Dan, >>>>> >>>>> It is probably because including genome sequences into the index slowed >>>>> down your laptop. But I believe it should be alleviated if you give >>>>> smaller values to the 'memory ' parameter of the buildindex() function. >>>>> Also, the index building is an one-off operation, you do not need to >>>>> redo it even when new releases come. >>>>> >>>>> For your 32-bit opensuse box, I guess the problem will be solved if you >>>>> change the amount of memory requested to be 1000MB. >>>>> >>>>> Cheers, >>>>> Wei >>>>> >>>>> On Jun 5, 2012, at 11:43 PM, Dan Du wrote: >>>>> >>>>>> Hi Robert, >>>>>> >>>>>> I have been experiencing something else, possibly related to yours, >>>>>> on a 64bit ubuntu laptop with 6g of ram. >>>>>> >>>>>> As I recall, when bumping to Bioc 2.10, the Rsubread installation kind >>>>>> of ate all the memory, basically froze the system so I had to call it >>>>>> off, yet building it on the server side turned out fine. So I think I >>>>>> just accepted that the new version may be 'computationally heavy' thus >>>>>> not suitable for a normal pc, though I did not find any mentioning of >>>>>> this increased memory requirement in the NEWS file. >>>>>> >>>>>> So currently Rsubread stays at 1.4.4 on that pc, all subsequent >>>>> versions >>>>>> of Rsubread drain the memory in the same way when compiling >>>>> Rsubread.so. >>>>>> >>>>>> Now I think I can confirm this on a 32-bit opensuse box, it did >>>>>> successfully built, but when running the example code in the manual, >>>>>> same segfault happens. >>>>>> >>>>>> >>>>>>> library(Rsubread) >>>>>>> ref<- system.file("extdata","reference.fa",package="Rsubread") >>>>>>> path<- system.file("extdata",package="Rsubread") >>>>>>> buildindex(basename=file.path(path,"reference_index"),reference=ref) >>>>>> >>>>>> Building a base-space index. >>>>>> Size of memory used=3700 MB >>>>>> Base name of the built index >>>>>> = /home/opensuse/R-patched/library/Rsubread/extdata/reference_index >>>>>> >>>>>> *** caught segfault *** >>>>>> address 0xdf03ee80, cause 'memory not mapped' >>>>>> >>>>>> Traceback: >>>>>> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = >>>>>> as.character(cmd), PACKAGE = "Rsubread") >>>>>> 2: buildindex(basename = file.path(path, "reference_index"), reference >>>>>> = ref) >>>>>> >>>>>>> sessionInfo() >>>>>> R version 2.15.0 Patched (2012-06-04 r59517) >>>>>> Platform: i686-pc-linux-gnu (32-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] stats graphics grDevices utils datasets methods >>>>>> base >>>>>> >>>>>> other attached packages: >>>>>> [1] Rsubread_1.6.3 >>>>>> >>>>>> >>>>>> Regards, >>>>>> Dan >>>>>> >>>>>> On Tue, 2012-06-05 at 09:45 +0200, Robert Castelo wrote: >>>>>>> hi, >>>>>>> >>>>>>> the computer room at my university where we do practicals on R& >>>>> Bioconductor runs a 32bit linux distribution and when i tried to run >>>>> the latest version of the Rsubread package (1.6.3) it crashes when >>>>> calling the buildindex() function on a multifasta file with the yeast >>>>> genome. this does *not* happen under a 64bit linux distribution. >>>>>>> >>>>>>> i have verified that installing the version before (1.4.4) on the >>>>> current R 2.15 it also crashes (on the 32bit), but two versions >>>>> before, the 1.1.1, it does *not* and it works smoothly on this 32bit >>>>> linux distribution. >>>>>>> >>>>>>> i'm pasting below the output of using the 1.6.3 and 1.1.1 on R 2.15 >>>>> where allChr.fa is the multifasta file with the yeast genome. >>>>>>> >>>>>>> so i can manage by now the problem by using the 1.1.1 version on R >>>>> 2.15 for my teaching but i wonder whether there would be some easy >>>>> solution for this, or even if it could be a symptom of something else >>>>> that the Rsubread developers should worry about. i know that using a >>>>> 32bit system nowadays is quite obsolete but this is what i got for >>>>> teaching :( and i would be happy to let my students play with the >>>>> latest version of Rsubread in the future. >>>>>>> >>>>>>> >>>>>>> thanks!!! >>>>>>> robert. >>>>>>> >>>>>>> ======================Rsubread 1.6.3 on R 2.15======================= >>>>>>> >>>>>>>> library(Rsubread) >>>>>>>> sessionInfo() >>>>>>> R version 2.15.0 (2012-03-30) >>>>>>> Platform: i686-pc-linux-gnu (32-bit) >>>>>>> >>>>>>> locale: >>>>>>> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C >>>>>>> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 >>>>>>> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 >>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C >>>>>>> >>>>>>> attached base packages: >>>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>>> >>>>>>> other attached packages: >>>>>>> [1] Rsubread_1.6.3 >>>>>>> >>>>>>>> buildindex(basename="subreadindex", reference="allChr.fa", >>>>> memory=2500) >>>>>>> >>>>>>> Building a base-space index. >>>>>>> Size of memory used=2500 MB >>>>>>> Base name of the built index = subreadindex >>>>>>> >>>>>>> *** caught segfault *** >>>>>>> address 0xdf670cc0, cause 'memory not mapped' >>>>>>> >>>>>>> Traceback: >>>>>>> 1: .C("R_buildindex_wrapper", argc = as.integer(n), argv = >>>>> as.character(cmd), PACKAGE = "Rsubread") >>>>>>> 2: buildindex(basename = "subreadindex", reference = "allChr.fa", >>>>> memory = 2500) >>>>>>> >>>>>>> Possible actions: >>>>>>> 1: abort (with core dump, if enabled) >>>>>>> 2: normal R exit >>>>>>> 3: exit R without saving workspace >>>>>>> 4: exit R saving workspace >>>>>>> Selection: >>>>>>> >>>>>>> >>>>>>> ======================Rsubread 1.1.1 on R 2.15======================= >>>>>>> >>>>>>>> library(Rsubread) >>>>>>>> buildindex(basename="subreadindex", reference="allChr.fa", >>>>> memory=2500) >>>>>>> >>>>>>> Building the index in the base space. >>>>>>> Size of memory requested=2500 MB >>>>>>> Index base name = subreadindex >>>>>>> INDEX ITEMS PER PARTITION = 275940352 >>>>>>> >>>>>>> completed=40.88%; time used=1.7s; rate=2955.1k bps/s; total=12m bps >>>>> completed=81.76%; time used=2.4s; rate=4111.8k >>>>> bps/s; total=12m bps >>>>>>> All the chromosome files are processed. >>>>>>> | Dumping index >>>>> [===========================================================>] >>>>>>> Index subreadindex is successfully built. >>>>>>>> sessionInfo() >>>>>>> R version 2.15.0 (2012-03-30) >>>>>>> Platform: i686-pc-linux-gnu (32-bit) >>>>>>> >>>>>>> locale: >>>>>>> [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C >>>>>>> [3] LC_TIME=ca_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8 >>>>>>> [5] LC_MONETARY=ca_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8 >>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>> [11] LC_MEASUREMENT=ca_ES.UTF-8 LC_IDENTIFICATION=C >>>>>>> >>>>>>> attached base packages: >>>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>>> >>>>>>> other attached packages: >>>>>>> [1] Rsubread_1.1.1 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> >>>>> ______________________________________________________________________ >>>>> The information in this email is confidential and intended solely for >>>>> the addressee. >>>>> You must not disclose, forward, print or use it without the permission >>>>> of the sender. >>>>> ______________________________________________________________________ >>>> >>>> >>>> >>>> >>> >>> >>> >>> ______________________________________________________________________ >>> The information in this email is confidential and intended solely for the addressee. >>> You must not disclose, forward, print or use it without the permission of the sender. >>> ______________________________________________________________________ >> >> > > > ______________________________________________________________________ > The information in this email is confidential and intended solely for the addressee. > You must not disclose, forward, print or use it without the permission of the sender. > ______________________________________________________________________ > > -- Robert Castelo, PhD Associate Professor Dept. of Experimental and Health Sciences Universitat Pompeu Fabra (UPF) Barcelona Biomedical Research Park (PRBB) Dr Aiguader 88 E-08003 Barcelona, Spain telf: +34.933.160.514 fax: +34.933.160.550 From Alogmail2 at aol.com Fri Jun 8 09:39:21 2012 From: Alogmail2 at aol.com (Alogmail2 at aol.com) Date: Fri, 8 Jun 2012 03:39:21 -0400 (EDT) Subject: [BioC] arrayQualityMetrics() doesn't work for one-color non Affy arrays Message-ID: <29016.4baaefe3.3d030629@aol.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From sdavis2 at mail.nih.gov Fri Jun 8 11:50:48 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 8 Jun 2012 05:50:48 -0400 Subject: [BioC] base-specific read counts In-Reply-To: References: <4FD0A5FA.6030301@fhcrc.org> Message-ID: On Fri, Jun 8, 2012 at 2:06 AM, Yu Chuan Tai wrote: > Hi Martin, > > One more question. Is there any way in Rsamtools to calculate SNVs/INDELS > frequency directly using the output file from samtools? By "output file from samtools", I assume you mean a VCF file. If so, take a look a the VariantAnnotation package and readVcf(). From there, you'll need to do the calculation yourself, but that would be a step on the way to accomplishing your task. Sean From shi at wehi.EDU.AU Fri Jun 8 13:55:01 2012 From: shi at wehi.EDU.AU (Wei Shi) Date: Fri, 8 Jun 2012 21:55:01 +1000 (EST) Subject: [BioC] Rsubread crashes in 32bit linux In-Reply-To: <4FD1ABC8.5040506@upf.edu> References: <20360057.1338882311461.JavaMail.oracle@rif2.s.upf.edu> <1338903817.2062.284.camel@yangdu-desktop> <25927CA4-11CE-46F1-8492-739AE8B10C05@wehi.edu.au> <1338965421.5663.36.camel@yangdu-desktop> <2333c7eb3c4b13297612a68e8334d48d.squirrel@homebase.wehi.edu.au> <1338980934.5663.85.camel@yangdu-desktop> <33FCD1BB-DCA5-402E-A973-7FB6F73F208F@wehi.edu.au> <4FD1ABC8.5040506@upf.edu> Message-ID: <3ad253450b1dbbbc239f0e0630c58d5d.squirrel@mailgw.wehi.edu.au> Dear Robert, That's great. Thanks for letting us know. Cheers, Wei > Dear Wei, > > it also works on my side, the 32 bit machines do not crash anymore. > > thanks for solving this so quickly! > robert. > > On 06/08/2012 04:10 AM, Wei Shi wrote: >> Dear Robert, Dan and Peter, >> >> We have made changes to a number of functions in the package to reduce >> the memory allocated to Rsubread by the operating system when it was >> loaded. The new version has been committed to both bioc release >> (Rsubread 1.6.4) and bioc devel (Rsubread 1.7.4). They should be >> available to you in a day or two. >> >> Also, the buildindex() function no longer needs the allocation of 1GB >> continuous memory region. But it will still consume at least 1GB of >> memory when it is running, no matter what the given value of the >> 'memory' parameter is. >> >> We have tested the new version on our 32-bit VM machine (it has 3GB of >> memory and the value of 'memory' parameter used by buildindex was 2500) >> and it solves all the reported problems, so we are pretty happy with it. >> I hope the new version works in your computers/laptops, but please do >> let us know if it doesn't. >> >> Sorry about the problems you have encountered. It's always a challenge >> to develop a R package with so much C code in it! >> >> Cheers, >> Wei >> >> >> >> On Jun 6, 2012, at 9:08 PM, Dan Du wrote: >> >>> Dear Wei, >>> >>> Here is a standard bioclite update, I think it is at the last step when >>> compiling Rsubread.so, the memory usage exceeds 5.5g, then system >>> freeze >>> and I have to call it off. Same result when runing 'R CMD INSTALL >>> Rsubread_1.6.3.tar.gz' from shell, or manually compile all .c file and >>> run the last gcc statement. I guess there might just be a minimum ram >>> requirement somewhere higher than 6g... I will do some more poking when >>> I have time. >>> >>> 'gcc -std=gnu99 -shared -o Rsubread.so R_wrapper.o SNP_calling.o >>> aligner.o atgcContent.o detectionCall.o detectionCallAnnotation.o >>> exon-algorithms.o exon-align.o fullscan.o gene-algorithms.o >>> gene-value-index.o hashtable.o index-builder.o input-files.o >>> processExons.o propmapped.o qualityScores.o readSummary.o >>> removeDuplicatedReads.o sam2bed.o sorted-hashtable.o -lpthread >>> -DMAKE_FOR_EXON -L/usr/lib64/R/lib -lR' >>> >>> Also down there are the sessionInfo and full gcc version, please let me >>> know if you need more information. >>> >>> Regards, >>> Dan >>> -------------------------------------------------------------------- >>>> source('http://www.bioconductor.org/biocLite.R') >>>> biocLite('') >>> BioC_mirror: http://bioconductor.org >>> Using R version 2.15, BiocInstaller version 1.4.6. >>> Installing package(s) '' >>> Old packages: 'Rsubread' >>> Update all/some/none? [a/s/n]: a >>> trying URL >>> 'http://www.bioconductor.org/packages/2.10/bioc/src/contrib/Rsubread_1.6.3.tar.gz' >>> Content type 'application/x-gzip' length 21891723 bytes (20.9 Mb) >>> opened URL >>> ================================================== >>> downloaded 20.9 Mb >>> >>> WARNING: ignoring environment value of R_HOME >>> * installing *source* package ?Rsubread? ... >>> ** libs >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON >>> -fpic >>> -O3 -pipe -g -c R_wrapper.c -o R_wrapper.o >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON >>> -fpic >>> -O3 -pipe -g -c SNP_calling.c -o SNP_calling.o >>> gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DMAKE_FOR_EXON >> ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From mtmorgan at fhcrc.org Fri Jun 8 15:03:04 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Fri, 08 Jun 2012 06:03:04 -0700 Subject: [BioC] Amplicon and exon level read counts and GC content In-Reply-To: References: <4FC5B3D3.4030007@fhcrc.org> <4FC5B75C.7060900@fhcrc.org> <4FD0A7DA.1040009@fhcrc.org> <4FD0CE98.20002@fhcrc.org> Message-ID: <4FD1F808.3080305@fhcrc.org> On 06/07/2012 11:34 PM, Yu Chuan Tai wrote: > Hi Martin, > > Is it correct that the below code calculate the number of reads overlapping > with an amplicon, and overlapping means at least 1 base overlap, and it > doesn't have to be fully within the amplicon? In the case that a read > overlaps > with 2 amplicons, will it be counted twice? When I used this approach to > calculate > amplicon-level read counts, I found the number of read counts > overlapping with > all the amplicons is larger than the total number of read counts in the > BAM file, > and wonder if that's b/c a read could be counted more than once? yes > I found that the code below gives more read counts than using > summarizeOverlaps(). > I think it's b/c the latter counts a read at most once. see ?summarizeOverlaps for how counting occurs with different modes. > > If I want to calculate the coverage of SNVs/INDELs outputted from samtools, > is it correct that using summarizeOverlaps() will under-estimate the > coverage, since > a read may overlap with several SNVs/INDELs? It is 'easy' using findOverlaps() or countOverlaps() to create counting schemes that are different from those implemented in samtools / scanBam, summarizeOverlaps, etc. Martin > > Thanks! > Yu Chuan > > > param = ScanBamParam(what="seq", which=GRanges("seq1",IRanges(100, > 500))) > > dna = scanBam(fl, param=param)[[1]][["seq"]] > > length(dna) # 365 reads overlap region > > alphabetFrequency(dna, collapse=TRUE, baseOnly=TRUE) # 2838 + 3003 GC > > > > On Thu, 7 Jun 2012, Martin Morgan wrote: > >> On 06/07/2012 07:54 AM, Yu Chuan Tai wrote: >>> Hi Martin, >>> >>> Thanks! I will look into the links below. By 'better support for >>> paired-end reads in the 'devel' version', which package are you >>> referring to? >> >> Mostly GenomicRanges, e.g., readGappedAlignmentPairs, building on >> additional facilities in Rsamtools. Herve is responsible for this. >> >> Martin >> >>> >>> Best, >>> Yu Chuan >>> >>> On Thu, 7 Jun 2012, Martin Morgan wrote: >>> >>>> On 06/06/2012 09:53 PM, Yu Chuan Tai wrote: >>>>> Hi Martin, >>>>> >>>>> More questions on your approaches below. If my BAM files are >>>>> generated by Bowtie2 on pair-end fastq files, for scanBamFlag(), >>>>> should I set isPaired=TRUE? Do I need to worry about other input >>>>> arguments for scanBamFlag() or ScanBamParam(), if I want to >>>>> calculate coverage properly? >>>> >>>> It really depends on what you're interested in doing; see for instance >>>> the post by Herve the other day >>>> >>>> https://stat.ethz.ch/pipermail/bioconductor/2012-June/046052.html >>>> >>>>> >>>>> Also, summarizeOverlaps() doesn't seem to handle paired-end reads. >>>>> How to get around this, or it won't affect coverage calculation? >>>> >>>> There is better support for paired-end reads in the 'devel' version of >>>> Biocondcutor; see >>>> >>>> http://bioconductor.org/developers/useDevel/ >>>> >>>> whether and what aspects of paired-endedness are important depends on >>>> how you are using your coverage. >>>> >>>>> >>>>> Finally, is there any way to calculate base-specific coverage at any >>>>> genomic locus or interval in Rsamtools? Thanks! >>>> >>>> I tried to answer this in your other post. >>>> >>>> Martin >>>> >>>>> >>>>> Best, Yu Chuan >>>>> >>>>>> More specifically, after >>>>>> >>>>>> library(Rsamtools) example(scanBam) # defines 'fl', a path to a >>>>>> bam file >>>>>> >>>>>> for a _single_ genomic range >>>>>> >>>>>> param = ScanBamParam(what="seq", which=GRanges("seq1", >>>>>> IRanges(100, 500))) dna = scanBam(fl, param=param)[[1]][["seq"]] >>>>>> length(dna) # 365 reads overlap region alphabetFrequency(dna, >>>>>> collapse=TRUE, baseOnly=TRUE) # 2838 + 3003 GC >>>>>> >>>>>> though you'd likely want to specify several regions (vector >>>>>> arguments to GRanges) and think about flags (scanBamFlag() and the >>>>>> flag argument to ScanBamParam), read mapping quality, reads >>>>>> overlapping more than one region, etc. (summarizeOverlaps >>>>>> implements several counting strategies, but it is 'easy' to >>>>>> implement arbitrary approaches). >>>>>> >>>>>>> >>>>>>> Martin >>>>>>> >>>>>>>> >>>>>>>> Thanks for any input! >>>>>>>> >>>>>>>> Best, Yu Chuan >>>>>>>> >>>>>>>> _______________________________________________ Bioconductor >>>>>>>> mailing list Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >>>>>>>> archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>> >>>>>>>> >>>> -- >>>>>> Computational Biology Fred Hutchinson Cancer Research Center 1100 >>>>>> Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >>>>>> >>>>>> Location: M1-B861 Telephone: 206 667-2793 >>>>>> >>>> >>>> >>>> -- >>>> Computational Biology / Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N. >>>> PO Box 19024 Seattle, WA 98109 >>>> >>>> Location: Arnold Building M1 B861 >>>> Phone: (206) 667-2793 >>>> >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 From guest at bioconductor.org Fri Jun 8 16:44:40 2012 From: guest at bioconductor.org (Femke [guest]) Date: Fri, 8 Jun 2012 07:44:40 -0700 (PDT) Subject: [BioC] batch effects 450K Message-ID: <20120608144440.1E803138FB5@mamba.fhcrc.org> Dear All, I have Infinium 450K data for 56 breast cancer tumors. As a first analysis I wanted to do a clustering and see the distribution of the samples. For this I used the minfi package. Unfortunately, the assays were done in 2 batches and there is a clear batch effect. I looked into Combat and SVA to remove the batch effect. As far as I understand, to use these approaches I need to have a phenotype/variable of interest. In the tutorial ("The SVA package for removing batch effects and other unwanted variation in high-throughput experiments ??? Modified: October 24, 2011 Compiled: April 25, 2012") the variable of interest is cancer status. However, I do not have normals. Does anyone have suggestions on how I should tackle these batch effects? Many thanks in advance and all the best! Femke -- output of sessionInfo(): R version 2.15.0 (2012-03-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] C attached base packages: [1] grid stats graphics grDevices utils datasets methods [8] base other attached packages: [1] bladderbatch_1.0.3 [2] sva_3.2.1 [3] mgcv_1.7-17 [4] corpcor_1.6.3 [5] IlluminaHumanMethylation450kmanifest_0.2.1 [6] gplots_2.10.1 [7] KernSmooth_2.23-7 [8] caTools_1.13 [9] bitops_1.0-4.1 [10] gdata_2.8.2 [11] gtools_2.6.2 [12] minfi_1.2.0 [13] GenomicRanges_1.8.6 [14] IRanges_1.14.3 [15] reshape_0.8.4 [16] plyr_1.7.1 [17] lattice_0.20-6 [18] Biobase_2.16.0 [19] BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.18.1 BiocInstaller_1.4.4 Biostrings_2.24.1 [4] DBI_0.2-5 MASS_7.3-18 Matrix_1.0-6 [7] R.methodsS3_1.2.2 RColorBrewer_1.0-5 RSQLite_0.11.1 [10] affyio_1.24.0 annotate_1.34.0 beanplot_1.1 [13] bit_1.1-8 codetools_0.2-8 crlmm_1.14.0 [16] ellipse_0.3-7 ff_2.2-7 foreach_1.4.0 [19] genefilter_1.38.0 iterators_1.0.6 limma_3.12.0 [22] matrixStats_0.5.0 mclust_3.4.11 multtest_2.12.0 [25] mvtnorm_0.9-9992 nlme_3.1-104 nor1mix_1.1-3 [28] oligoClasses_1.18.0 preprocessCore_1.18.0 siggenes_1.30.0 [31] splines_2.15.0 stats4_2.15.0 survival_2.36-14 [34] xtable_1.7-0 zlibbioc_1.2.0 -- Sent via the guest posting facility at bioconductor.org. From a.teschendorff at ucl.ac.uk Fri Jun 8 17:03:38 2012 From: a.teschendorff at ucl.ac.uk (Teschendorff, Andrew) Date: Fri, 8 Jun 2012 15:03:38 +0000 Subject: [BioC] batch effects 450K In-Reply-To: <20120608144440.1E803138FB5@mamba.fhcrc.org> References: <20120608144440.1E803138FB5@mamba.fhcrc.org> Message-ID: <0324FA851FD85F4DA890FAC7DDB83EC60F8C14BB@DB3PRD0104MB129.eurprd01.prod.exchangelabs.com> Hi Femke, For COMBAT you do not need to specify a phenotype of interest. Read the original paper presenting COMBAT. rgds A *********************************************************************************************************************************************** Andrew E Teschendorff PhD Heller Research Fellow Statistical Cancer Genomics Paul O'Gorman Building UCL Cancer Institute University College London 72 Huntley Street London WC1E 6BT, UK. Tel: +44 (0)20 7679 0727 Mob: +44 (0)7876 561263 Email: a.teschendorff at ucl.ac.uk http://www.ucl.ac.uk/cancer/rescancerbiol/statisticalgenomics ******************************************************************************************************************************************** ________________________________________ From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] on behalf of Femke [guest] [guest at bioconductor.org] Sent: 08 June 2012 15:44 To: bioconductor at r-project.org; f.simmer at ncmls.ru.nl Subject: [BioC] batch effects 450K Dear All, I have Infinium 450K data for 56 breast cancer tumors. As a first analysis I wanted to do a clustering and see the distribution of the samples. For this I used the minfi package. Unfortunately, the assays were done in 2 batches and there is a clear batch effect. I looked into Combat and SVA to remove the batch effect. As far as I understand, to use these approaches I need to have a phenotype/variable of interest. In the tutorial ("The SVA package for removing batch effects and other unwanted variation in high-throughput experiments ??? Modified: October 24, 2011 Compiled: April 25, 2012") the variable of interest is cancer status. However, I do not have normals. Does anyone have suggestions on how I should tackle these batch effects? Many thanks in advance and all the best! Femke -- output of sessionInfo(): R version 2.15.0 (2012-03-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] C attached base packages: [1] grid stats graphics grDevices utils datasets methods [8] base other attached packages: [1] bladderbatch_1.0.3 [2] sva_3.2.1 [3] mgcv_1.7-17 [4] corpcor_1.6.3 [5] IlluminaHumanMethylation450kmanifest_0.2.1 [6] gplots_2.10.1 [7] KernSmooth_2.23-7 [8] caTools_1.13 [9] bitops_1.0-4.1 [10] gdata_2.8.2 [11] gtools_2.6.2 [12] minfi_1.2.0 [13] GenomicRanges_1.8.6 [14] IRanges_1.14.3 [15] reshape_0.8.4 [16] plyr_1.7.1 [17] lattice_0.20-6 [18] Biobase_2.16.0 [19] BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.18.1 BiocInstaller_1.4.4 Biostrings_2.24.1 [4] DBI_0.2-5 MASS_7.3-18 Matrix_1.0-6 [7] R.methodsS3_1.2.2 RColorBrewer_1.0-5 RSQLite_0.11.1 [10] affyio_1.24.0 annotate_1.34.0 beanplot_1.1 [13] bit_1.1-8 codetools_0.2-8 crlmm_1.14.0 [16] ellipse_0.3-7 ff_2.2-7 foreach_1.4.0 [19] genefilter_1.38.0 iterators_1.0.6 limma_3.12.0 [22] matrixStats_0.5.0 mclust_3.4.11 multtest_2.12.0 [25] mvtnorm_0.9-9992 nlme_3.1-104 nor1mix_1.1-3 [28] oligoClasses_1.18.0 preprocessCore_1.18.0 siggenes_1.30.0 [31] splines_2.15.0 stats4_2.15.0 survival_2.36-14 [34] xtable_1.7-0 zlibbioc_1.2.0 -- Sent via the guest posting facility at bioconductor.org. _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From guest at bioconductor.org Fri Jun 8 19:07:20 2012 From: guest at bioconductor.org (Emily P [guest]) Date: Fri, 8 Jun 2012 10:07:20 -0700 (PDT) Subject: [BioC] ChIPpeakAnno Message-ID: <20120608170720.0C63013ADAA@mamba.fhcrc.org> When running this program I have the error message below. I was thinking there was something I should update but could not find any updates. -- output of sessionInfo(): > annotatedPeak=annotatePeakInBatch(KLLNChIP, AnnotationData = TSS.human.GRCh37, output="both", select=NULL, maxgap = 0) Warning message: 'matchMatrix' is deprecated. Use 'as.matrix' instead. -- Sent via the guest posting facility at bioconductor.org. From bpederse at gmail.com Fri Jun 8 19:58:19 2012 From: bpederse at gmail.com (Brent Pedersen) Date: Fri, 8 Jun 2012 11:58:19 -0600 Subject: [BioC] batch effects 450K In-Reply-To: <20120608144440.1E803138FB5@mamba.fhcrc.org> References: <20120608144440.1E803138FB5@mamba.fhcrc.org> Message-ID: On Fri, Jun 8, 2012 at 8:44 AM, Femke [guest] wrote: > > Dear All, > > I have Infinium 450K data for 56 breast cancer tumors. As a first analysis I wanted to do a clustering and see the distribution of the samples. For this I used the minfi package. Unfortunately, the assays were done in 2 batches and there is a clear batch effect. I looked into Combat and SVA to remove the batch effect. As far as I understand, to use these approaches I need to have a phenotype/variable of interest. In the tutorial ("The SVA package for removing batch effects and other unwanted variation in high-throughput experiments ??? Modified: October 24, 2011 Compiled: April 25, 2012") the variable of interest is cancer status. However, I do not have normals. Does anyone have suggestions on how I should tackle these batch effects? > > Many thanks in advance and all the best! > > Femke > > > ?-- output of sessionInfo(): > > R version 2.15.0 (2012-03-30) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] C > > attached base packages: > [1] grid ? ? ?stats ? ? graphics ?grDevices utils ? ? datasets ?methods > [8] base > > other attached packages: > ?[1] bladderbatch_1.0.3 > ?[2] sva_3.2.1 > ?[3] mgcv_1.7-17 > ?[4] corpcor_1.6.3 > ?[5] IlluminaHumanMethylation450kmanifest_0.2.1 > ?[6] gplots_2.10.1 > ?[7] KernSmooth_2.23-7 > ?[8] caTools_1.13 > ?[9] bitops_1.0-4.1 > [10] gdata_2.8.2 > [11] gtools_2.6.2 > [12] minfi_1.2.0 > [13] GenomicRanges_1.8.6 > [14] IRanges_1.14.3 > [15] reshape_0.8.4 > [16] plyr_1.7.1 > [17] lattice_0.20-6 > [18] Biobase_2.16.0 > [19] BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > ?[1] AnnotationDbi_1.18.1 ?BiocInstaller_1.4.4 ? Biostrings_2.24.1 > ?[4] DBI_0.2-5 ? ? ? ? ? ? MASS_7.3-18 ? ? ? ? ? Matrix_1.0-6 > ?[7] R.methodsS3_1.2.2 ? ? RColorBrewer_1.0-5 ? ?RSQLite_0.11.1 > [10] affyio_1.24.0 ? ? ? ? annotate_1.34.0 ? ? ? beanplot_1.1 > [13] bit_1.1-8 ? ? ? ? ? ? codetools_0.2-8 ? ? ? crlmm_1.14.0 > [16] ellipse_0.3-7 ? ? ? ? ff_2.2-7 ? ? ? ? ? ? ?foreach_1.4.0 > [19] genefilter_1.38.0 ? ? iterators_1.0.6 ? ? ? limma_3.12.0 > [22] matrixStats_0.5.0 ? ? mclust_3.4.11 ? ? ? ? multtest_2.12.0 > [25] mvtnorm_0.9-9992 ? ? ?nlme_3.1-104 ? ? ? ? ?nor1mix_1.1-3 > [28] oligoClasses_1.18.0 ? preprocessCore_1.18.0 siggenes_1.30.0 > [31] splines_2.15.0 ? ? ? ?stats4_2.15.0 ? ? ? ? survival_2.36-14 > [34] xtable_1.7-0 ? ? ? ? ?zlibbioc_1.2.0 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Since the batch is known, why not just include it in your model and run with limma or lm()? But what's your study-design if you don't have controls? From dtenenba at fhcrc.org Fri Jun 8 20:56:28 2012 From: dtenenba at fhcrc.org (Dan Tenenbaum) Date: Fri, 8 Jun 2012 11:56:28 -0700 Subject: [BioC] DEGraph graph format? In-Reply-To: <314fb72a-08e5-4eb2-884b-d0222c232127@zimbra2.fhcrc.org> References: <68bd42f3-34e5-4c01-90ff-19b07d5220d1@zimbra2.fhcrc.org> <314fb72a-08e5-4eb2-884b-d0222c232127@zimbra2.fhcrc.org> Message-ID: I'm CC'ing the maintainer of DEGraph... Dan On Thu, Jun 7, 2012 at 5:43 PM, Hamid Bolouri wrote: > hello; > > Can anyone tell me how to use DEGraph with the pathways in NCIGraphData? > > The DEGraph Demo: > >>data("Loi2008_DEGraphVignette", package="DEGraph") >>classData <- classLoi2008 >>exprData <- exprLoi2008 >>annData <- annLoi2008 >>grList <- grListKEGG >>res <- testOneGraph(grList[[1]],exprData,classData,verbose=T,prop=0.2) > > works fine for me. But replacing grList with NCI.cyList from NCIGraph: > >>library(NCIgraphData) >>data("NCI-cyList") >> NCI.cyList[[1]] > A graphNEL graph with directed edges > Number of Nodes = 35 > Number of Edges = 40 > > I get this error: > >>res <- testOneGraph(NCI.cyList[[1]],exprData,classData,verbose=T,prop=0.2) > Keeping genes in the graph *and* the expression data set... > ?35 genes of the graph were not found in the expression data set: > ?chr [1:35] "6749854621221256793-pid_m_25632-674985462-829166685-pid_m_100726" ... > ?227 genes of the expression data set are absent from the graph: > ?chr [1:227] "31" "32" "207" "208" "355" "356" "369" "572" ... > Error: all.equal(dataGN, graphGN) is not TRUE > Keeping genes in the graph *and* the expression data set...done > > I get the same error with 'reactome.cyList' graphs and with graphs generated by 'parseNCInetwork'. > > Thanks > > Hamid Bolouri > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 ?LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > other attached packages: > [1] NCIgraphData_0.99.4 DEGraph_1.8.0 ? ? ? R.utils_1.12.1 > [4] R.oo_1.9.3 ? ? ? ? ?R.methodsS3_1.2.2 > > loaded via a namespace (and not attached): > ?[1] BiocGenerics_0.2.0 graph_1.34.0 ? ? ? grid_2.15.0 ? ? ? ?KEGGgraph_1.12.0 > ?[5] lattice_0.20-6 ? ? mvtnorm_0.9-9992 ? NCIgraph_1.4.0 ? ? RBGL_1.32.0 > ?[9] RCurl_1.91-1.1 ? ? RCytoscape_1.6.3 ? Rgraphviz_1.34.1 ? rrcov_1.3-01 > [13] stats4_2.15.0 ? ? ?tools_2.15.0 ? ? ? XML_3.9-4.1 ? ? ? ?XMLRPC_0.2-4 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From dtenenba at fhcrc.org Fri Jun 8 20:57:49 2012 From: dtenenba at fhcrc.org (Dan Tenenbaum) Date: Fri, 8 Jun 2012 11:57:49 -0700 Subject: [BioC] GOSemSim comparison between species In-Reply-To: References: Message-ID: CC'ing the maintainer of GOSemSim. Dan On Tue, Jun 5, 2012 at 3:02 AM, Katharine Coyte wrote: > Hi, > > I'd like to use GOSemSim to calculate the functional similarity between yeast and human proteins. It looks like it can compare human with human and yeast with yeast, but not proteins between species. > > Having looked at the documentation I assume the important change is to create a new information contents file that combines the yeast and human ones. However, I can't work out how to do this. Is this the correct approach to take, and does anyone have any advice on how to do so? > > Thanks and best wishes, > > Katharine Coyte > > 1st Year DPhil Candidate > Systems Biology DTC > University of Oxford > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From Julie.Zhu at umassmed.edu Fri Jun 8 21:14:42 2012 From: Julie.Zhu at umassmed.edu (Zhu, Lihua (Julie)) Date: Fri, 8 Jun 2012 19:14:42 +0000 Subject: [BioC] ChIPpeakAnno In-Reply-To: <20120608170720.0C63013ADAA@mamba.fhcrc.org> Message-ID: Emily, You can ignore the warning message and proceed. If you wish, you could try BioC Devel Version at http://www.bioconductor.org/developers/useDevel/. Best regards, Julie On 6/8/12 1:07 PM, "Emily P [guest]" wrote: > > When running this program I have the error message below. > I was thinking there was something I should update but could not find any > updates. > > -- output of sessionInfo(): > >> annotatedPeak=annotatePeakInBatch(KLLNChIP, AnnotationData = >> TSS.human.GRCh37, output="both", select=NULL, maxgap = 0) > Warning message: > 'matchMatrix' is deprecated. > Use 'as.matrix' instead. > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From laurent.jacob at gmail.com Fri Jun 8 22:30:35 2012 From: laurent.jacob at gmail.com (laurent jacob) Date: Fri, 8 Jun 2012 13:30:35 -0700 Subject: [BioC] DEGraph graph format? In-Reply-To: <314fb72a-08e5-4eb2-884b-d0222c232127@zimbra2.fhcrc.org> References: <68bd42f3-34e5-4c01-90ff-19b07d5220d1@zimbra2.fhcrc.org> <314fb72a-08e5-4eb2-884b-d0222c232127@zimbra2.fhcrc.org> Message-ID: Hi Hamid, 2012/6/7 Hamid Bolouri : >>library(NCIgraphData) >>data("NCI-cyList") >> NCI.cyList[[1]] > A graphNEL graph with directed edges > Number of Nodes = 35 > Number of Edges = 40 The graphs NCI-cyList cannot directly be used with DEGraph: they are raw representations of the NCI biopax files. In particular, the nodes of the graph do not correspond to genes: > library(graph) > nodes(NCI.cyList[[1]]) [1] "6749854621221256793-pid_m_25632-674985462-829166685-pid_m_100726" [2] "6749854621221256792-pid_m_25631-674985462-829166685-pid_m_100726" [3] "674985462-829169511-pid_m_100441-674985462-829143405-pid_m_101095" [4] "674985462-829168394-pid_m_100592-674985462-829143405-pid_m_101095" [5] "674985462-829169605-pid_m_100410-674985462-829143405-pid_m_101095" [...] which is why testOneGraph doesn't manage to associate the graph nodes with exprData. The package NCIgraph converts these raw graphs to gene graphs that can be used with the DEGraph package: library('NCIgraph') grList <- getNCIPathways(cyList=NCI.cyList, verbose=verbose)$pList Now on my computer testOneGraph fails on NCIgraph objects which have zero or one gene in exprData. I couldn't figure why this is the case now and wasn't the case when I wrote the package. It will be fixed in the next release, in the meantime if you want to test all the pathways in grList you can check whether length(intersect(translateNCI2GeneID(gr), rownames(exprData))) > 1, if yes call testOneGraph, if not return NULL. For example in the Loi2008 demo, replace if(min(length(nodes(gr)),length(gr at edgeData@data))>0) by if(min(length(nodes(gr)),length(gr at edgeData@data))>0 && length(intersect(translateNCI2GeneID(gr), rownames(exprData))) > 1) Note that only 11 networks out of the 460 in grList will have strictly more than one gene in common with the exprData of Loi2008. Best, Laurent -- Laurent Jacob Department of Statistics UC Berkeley http://cbio.ensmp.fr/~ljacob From tim.triche at gmail.com Fri Jun 8 23:16:59 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Fri, 8 Jun 2012 14:16:59 -0700 Subject: [BioC] batch effects 450K In-Reply-To: References: <20120608144440.1E803138FB5@mamba.fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From hbolouri at fhcrc.org Fri Jun 8 23:32:18 2012 From: hbolouri at fhcrc.org (Hamid Bolouri) Date: Fri, 08 Jun 2012 14:32:18 -0700 (PDT) Subject: [BioC] DEGraph graph format? In-Reply-To: Message-ID: It works! Thanks very much indeed Laurent. Best wishes; Hamid ----- Original Message ----- From: "laurent jacob" To: "Hamid Bolouri" Cc: bioconductor at r-project.org Sent: Friday, June 8, 2012 1:30:35 PM Subject: Re: [BioC] DEGraph graph format? Hi Hamid, 2012/6/7 Hamid Bolouri : >>library(NCIgraphData) >>data("NCI-cyList") >> NCI.cyList[[1]] > A graphNEL graph with directed edges > Number of Nodes = 35 > Number of Edges = 40 The graphs NCI-cyList cannot directly be used with DEGraph: they are raw representations of the NCI biopax files. In particular, the nodes of the graph do not correspond to genes: > library(graph) > nodes(NCI.cyList[[1]]) [1] "6749854621221256793-pid_m_25632-674985462-829166685-pid_m_100726" [2] "6749854621221256792-pid_m_25631-674985462-829166685-pid_m_100726" [3] "674985462-829169511-pid_m_100441-674985462-829143405-pid_m_101095" [4] "674985462-829168394-pid_m_100592-674985462-829143405-pid_m_101095" [5] "674985462-829169605-pid_m_100410-674985462-829143405-pid_m_101095" [...] which is why testOneGraph doesn't manage to associate the graph nodes with exprData. The package NCIgraph converts these raw graphs to gene graphs that can be used with the DEGraph package: library('NCIgraph') grList <- getNCIPathways(cyList=NCI.cyList, verbose=verbose)$pList Now on my computer testOneGraph fails on NCIgraph objects which have zero or one gene in exprData. I couldn't figure why this is the case now and wasn't the case when I wrote the package. It will be fixed in the next release, in the meantime if you want to test all the pathways in grList you can check whether length(intersect(translateNCI2GeneID(gr), rownames(exprData))) > 1, if yes call testOneGraph, if not return NULL. For example in the Loi2008 demo, replace if(min(length(nodes(gr)),length(gr at edgeData@data))>0) by if(min(length(nodes(gr)),length(gr at edgeData@data))>0 && length(intersect(translateNCI2GeneID(gr), rownames(exprData))) > 1) Note that only 11 networks out of the 460 in grList will have strictly more than one gene in common with the exprData of Loi2008. Best, Laurent -- Laurent Jacob Department of Statistics UC Berkeley http://cbio.ensmp.fr/~ljacob -- http://labs.fhcrc.org/bolouri From dupan.mail at gmail.com Fri Jun 8 23:37:06 2012 From: dupan.mail at gmail.com (Pan Du) Date: Fri, 8 Jun 2012 14:37:06 -0700 Subject: [BioC] output from "getChipInfo" prevents Illumina ProbeID conversion In-Reply-To: References: Message-ID: Hi Chris The "summary(lumi_vst)" returns the information shown in the header of your input data file. While the getChipInfo returns the best match to the tables in the HumanIDMapping package. To save the storage space, the HumanIDMapping package does not include all sub-version tables. If two sub-versions have the same probes, it only keep the latest version. So you can use "HumanHT12_V3_0_R3_11283641_A" to map your data. If you provide your own annotationFile, the function assumes your data is annotated with Illumina ProbeID or TargetID. Since your data uses "Array_Address_Id", it reports error. Pan On Fri, Jun 8, 2012 at 2:11 PM, Chris Gaiteri wrote: > Hello Dr. Du, > > There is a difference between the output of "getChipInfo" and "summary" > functions applied the same lumi object, which is preventing us from mapping > the Illumina ProbeID's to gene symbols or other useful output. ?Have you > encountered this situation before? ?I do appreciate the time you put into > these packages. ?Annotated output below. > > Regards, > Chris Gaiteri > > "lumi_vst" is the Beadstudio output, processed through the lumi package. >> summary(lumi_vst) > > Summary of data information: > ? ? ? ? ?Data File Information: > ? ? ? ? ? ? ? ? Illumina Inc. GenomeStudio version 1.0.6 > ? ? ? ? ? ? ? ? Normalization = none > ? ? ? ? ? ? ? ? Array Content = HumanHT-12_V3_0_R2_11283641_A.bgx.xml > > (the above does not match the output of "getChipInfo" for instance "R2" vs > "R3" and the extra chip version detected by getChipInfo) > > > >> getChipInfo(lumi_vst) > $chipVersion > [1] "HumanHT12_V3_0_R3_11283641_A" "HumanWG6_V3_0_R3_11282955_A" > > $species > [1] "Human" > > $IDType > [1] "Array_Address_Id" "Array_Address_Id" > > $chipProbeNumber > [1] 48803 > > $inputProbeNumber > [1] 48803 > > $matchedProbeNumber > [1] 48803 > > > (attempting to generate nuID's fails, using what I know is the correct > annotation file) > lumi_vst = addNuID2lumi(lumi_vst, > annotationFile='HumanHT-12_V3_0_R2_11283641_A.txt') > Error in addNuID2lumi(lumi_vst, annotationFile = > "HumanHT-12_V3_0_R2_11283641_A.txt") :?The annotation file does not match > the data! > > (In fact "addNuID2lumi" does not consider than any annotation file matches - > no matter which suggested version of annotation I use) > > other attempts to convert to useful ID's also fail: >> featureNames(lumi_vst)[1:10] > ?[1] "6450255" "2570615" "6370619" "2600039" "2650615" "5340672" "1090041" > "6380561" "7570255" "4920477" >> nuIDs = IlluminaID2nuID(featureNames(lumi_vst)) > Warning message: > In if (!is.na(chipInfo$IDType)) { : > ? the condition has length > 1 and only the first element will be used > > (I think this comes from the two chip types that are listed under > getChipInfo$chipVersion ) > > >> nuIDs[1:10] > ?[1] "ILMN_5579" ? "ILMN_175569" "ILMN_18893" ?"ILMN_18532" ?"ILMN_6386" > "ILMN_6386" ? "ILMN_2229" ? "ILMN_2229" > ?[9] "ILMN_19974" ?"ILMN_31305" > > (these to not appear to be NuIDs and prevent further processing with > lumiHumanIDMapping package) From laurent.jacob at gmail.com Sat Jun 9 00:08:07 2012 From: laurent.jacob at gmail.com (laurent jacob) Date: Fri, 8 Jun 2012 15:08:07 -0700 Subject: [BioC] DEGraph graph format? In-Reply-To: References: Message-ID: 2012/6/8 Hamid Bolouri : > It works! > > Thanks very much indeed Laurent. Great, I'm glad this helped. Let me know if you encounter other problems. Best, Laurent -- Laurent Jacob Department of Statistics UC Berkeley http://cbio.ensmp.fr/~ljacob From p.leo at uq.edu.au Sat Jun 9 01:04:20 2012 From: p.leo at uq.edu.au (Paul Leo) Date: Fri, 8 Jun 2012 23:04:20 +0000 Subject: [BioC] Working with GERP scores Message-ID: <8A20559E21BC8944BB3C74DD5200442609822C@UQEXMDA1.soe.uq.edu.au> Can anyone advise of an elegant solution for working with Genome wide GERP scores and R. I would like to use them in coverage like calculations. Not sure how large they are as a run-length encoded object. Thinking that: Could prefilter to just regions I want. OR Or restrict to GERP score thresholds (But Filtered to bed like formats as text files they are @53Gb for Gerp score >1) OR Would an Rsql based solution work better? OR Put them in vcf format .. compress then and use Rsamtools-like vcf tools readers. Anyone tried any of these , what works? Thanks Paul From mcarlson at fhcrc.org Sat Jun 9 02:10:39 2012 From: mcarlson at fhcrc.org (Marc Carlson) Date: Fri, 08 Jun 2012 17:10:39 -0700 Subject: [BioC] BiomaRt Query error Edited In-Reply-To: References: Message-ID: <4FD2947F.7090605@fhcrc.org> Hi Avoks, Could you maybe give us a few genes that cause the error for you? We don't have your file, so we have no idea what that read.delim call will put into 'genes'. Marc On 06/05/2012 04:56 AM, Ovokeraye Achinike-Oduaran wrote: > My apologies. I omitted the "error" in my initial post. > > Hi all, > > I ran a list of genes through biomaRt with the following code and it > gives me this error in the snp retrieval aspect of it. I doubt it's a > connection/proxy problem because I have that taken care of, I think > and every step prior to that seemed to have worked just fine. > > Any ideas what the problem might be? > > Thanks. > > -Avoks > > mart = useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl") > genes = read.delim("DAVID_BFE_Genes_4_06_2012.txt", header = TRUE) > results = getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", > "chromosome_name","strand", "transcript_start", "transcript_end" ), > filters = "hgnc_symbol", values = genes$Symbol, mart = mart) > mart2 = useMart(biomart="snp", dataset="hsapiens_snp") > results2 = getBM(attributes = c("refsnp_id", "allele", "snp", > "chrom_strand", "cds_start","cds_end","validated", > "consequence_type_tv","phenotype_name"), > filters = "ensembl_gene", values = results$ensembl_gene_id, mart = mart2) > > http://www.w3.org/TR/html4/loose.dtd> > 2 HTTP-EQUIV=Content-Type CONTENT=text/html; charset=iso-8859-1> > 3 > ERROR: The requested URL could not be retrieved > 4 > 5 > > 6 >

ERROR

> Error in getBM(attributes = c("refsnp_id", "allele", "snp", "chrom_strand", : > The query to the BioMart webservice returned an invalid result: the > number of columns in the result table does not equal the number of > attributes in the query. Please report this to the mailing list. > sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_1252 LC_CTYPE=English_.1252 > [3] LC_MONETARY=English_.1252 LC_NUMERIC=C > [5] LC_TIME=English_.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BiocInstaller_1.4.6 biomaRt_2.12.0 > > loaded via a namespace (and not attached): > [1] RCurl_1.91-1.1 tools_2.15.0 XML_3.9-4.1 > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From patel.rimple at yahoo.com Sat Jun 9 02:56:35 2012 From: patel.rimple at yahoo.com (Rimple Patel) Date: Fri, 8 Jun 2012 17:56:35 -0700 (PDT) Subject: [BioC] (no subject) Message-ID: <1339203395.72276.YahooMailNeo@web45714.mail.sp1.yahoo.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From m_raherison at hotmail.com Sat Jun 9 03:21:02 2012 From: m_raherison at hotmail.com (Elie M. RAHERISON) Date: Sat, 9 Jun 2012 01:21:02 +0000 Subject: [BioC] How to get TAIR identifiers corresponding to sequences of a non model species In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From yuchuan at stat.berkeley.edu Sat Jun 9 04:15:22 2012 From: yuchuan at stat.berkeley.edu (Yu Chuan Tai) Date: Fri, 8 Jun 2012 19:15:22 -0700 (PDT) Subject: [BioC] base-specific read counts In-Reply-To: References: <4FD0A5FA.6030301@fhcrc.org> Message-ID: Hi Sean, I didn't find any function in the VariantAnnotation package that can calculate mutant freq. Do you mean after reading in a VCF file using readVcf(), I need to calculate the base-level coverage first (for example, using the way Martin had suggested), and convert coverage to frequency myself? Then why do I need to use VariantAnnotation package for this purpose, given the fact that I already have a text file with all the SNVs/INDELs with their genomic coordinates? Best, Yu Chuan On Fri, 8 Jun 2012, Sean Davis wrote: > On Fri, Jun 8, 2012 at 2:06 AM, Yu Chuan Tai wrote: >> Hi Martin, >> >> One more question. Is there any way in Rsamtools to calculate SNVs/INDELS >> frequency directly using the output file from samtools? > > By "output file from samtools", I assume you mean a VCF file. If so, > take a look a the VariantAnnotation package and readVcf(). From > there, you'll need to do the calculation yourself, but that would be a > step on the way to accomplishing your task. > > Sean > From yuchuan at stat.berkeley.edu Sat Jun 9 04:18:00 2012 From: yuchuan at stat.berkeley.edu (Yu Chuan Tai) Date: Fri, 8 Jun 2012 19:18:00 -0700 (PDT) Subject: [BioC] Amplicon and exon level read counts and GC content In-Reply-To: <4FD0A7DA.1040009@fhcrc.org> References: <4FC5B3D3.4030007@fhcrc.org> <4FC5B75C.7060900@fhcrc.org> <4FD0A7DA.1040009@fhcrc.org> Message-ID: Hi Martin, A quick question. Does it matter if I don't have the strand info. for the interval that I am interested in, when I specify the input arguments for ScanBamParam()? Best, Yu Chuan On Thu, 7 Jun 2012, Martin Morgan wrote: > On 06/06/2012 09:53 PM, Yu Chuan Tai wrote: >> Hi Martin, >> >> More questions on your approaches below. If my BAM files are >> generated by Bowtie2 on pair-end fastq files, for scanBamFlag(), >> should I set isPaired=TRUE? Do I need to worry about other input >> arguments for scanBamFlag() or ScanBamParam(), if I want to >> calculate coverage properly? > > It really depends on what you're interested in doing; see for instance the > post by Herve the other day > > https://stat.ethz.ch/pipermail/bioconductor/2012-June/046052.html > >> >> Also, summarizeOverlaps() doesn't seem to handle paired-end reads. >> How to get around this, or it won't affect coverage calculation? > > There is better support for paired-end reads in the 'devel' version of > Biocondcutor; see > > http://bioconductor.org/developers/useDevel/ > > whether and what aspects of paired-endedness are important depends on how you > are using your coverage. > >> >> Finally, is there any way to calculate base-specific coverage at any >> genomic locus or interval in Rsamtools? Thanks! > > I tried to answer this in your other post. > > Martin > >> >> Best, Yu Chuan >> >>> More specifically, after >>> >>> library(Rsamtools) example(scanBam) # defines 'fl', a path to a >>> bam file >>> >>> for a _single_ genomic range >>> >>> param = ScanBamParam(what="seq", which=GRanges("seq1", >>> IRanges(100, 500))) dna = scanBam(fl, param=param)[[1]][["seq"]] >>> length(dna) # 365 reads overlap region alphabetFrequency(dna, >>> collapse=TRUE, baseOnly=TRUE) # 2838 + 3003 GC >>> >>> though you'd likely want to specify several regions (vector >>> arguments to GRanges) and think about flags (scanBamFlag() and the >>> flag argument to ScanBamParam), read mapping quality, reads >>> overlapping more than one region, etc. (summarizeOverlaps >>> implements several counting strategies, but it is 'easy' to >>> implement arbitrary approaches). >>> >>>> >>>> Martin >>>> >>>>> >>>>> Thanks for any input! >>>>> >>>>> Best, Yu Chuan >>>>> >>>>> _______________________________________________ Bioconductor >>>>> mailing list Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >>>>> archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >>> >>> >>>>> >>>>> > -- >>> Computational Biology Fred Hutchinson Cancer Research Center 1100 >>> Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >>> >>> Location: M1-B861 Telephone: 206 667-2793 >>> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > From mtmorgan at fhcrc.org Sat Jun 9 04:25:34 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Fri, 08 Jun 2012 19:25:34 -0700 Subject: [BioC] Amplicon and exon level read counts and GC content In-Reply-To: References: <4FC5B3D3.4030007@fhcrc.org> <4FC5B75C.7060900@fhcrc.org> <4FD0A7DA.1040009@fhcrc.org> Message-ID: <4FD2B41E.50207@fhcrc.org> On 06/08/2012 07:18 PM, Yu Chuan Tai wrote: > Hi Martin, > > A quick question. Does it matter if I don't have the strand info. for > the interval that I am interested in, when I specify the input arguments > for ScanBamParam()? See ?ScanBamParam which says which: A 'GRanges', 'RangesList', 'RangedData', or missing object, from which a 'IRangesList' instance will be constructed. Names of the 'IRangesList' correspond to reference sequences, and ranges to the regions on that reference sequence for which matches are desired. Because data types are coerced to 'IRangesList', 'which' does _not_ include strand information (use the 'flag' argument instead). Only records with a read overlapping the specified ranges are returned. All ranges must have ends less than or equal to 536870912. If you provide GRanges, with strand information, the strand information is ignored. If you want reads only on the plus strand, see (on the same help page) isMinusStrand: A logical(1) indicating whether reads aligned to the plus (FALSE), minus (TRUE), or any (NA) strand should be returned. so ScanBamParam(flag=scanBamFlag(isMinusStrand=FALSE)) Martin > > Best, > Yu Chuan > > On Thu, 7 Jun 2012, Martin Morgan wrote: > >> On 06/06/2012 09:53 PM, Yu Chuan Tai wrote: >>> Hi Martin, >>> >>> More questions on your approaches below. If my BAM files are >>> generated by Bowtie2 on pair-end fastq files, for scanBamFlag(), >>> should I set isPaired=TRUE? Do I need to worry about other input >>> arguments for scanBamFlag() or ScanBamParam(), if I want to >>> calculate coverage properly? >> >> It really depends on what you're interested in doing; see for instance >> the post by Herve the other day >> >> https://stat.ethz.ch/pipermail/bioconductor/2012-June/046052.html >> >>> >>> Also, summarizeOverlaps() doesn't seem to handle paired-end reads. >>> How to get around this, or it won't affect coverage calculation? >> >> There is better support for paired-end reads in the 'devel' version of >> Biocondcutor; see >> >> http://bioconductor.org/developers/useDevel/ >> >> whether and what aspects of paired-endedness are important depends on >> how you are using your coverage. >> >>> >>> Finally, is there any way to calculate base-specific coverage at any >>> genomic locus or interval in Rsamtools? Thanks! >> >> I tried to answer this in your other post. >> >> Martin >> >>> >>> Best, Yu Chuan >>> >>>> More specifically, after >>>> >>>> library(Rsamtools) example(scanBam) # defines 'fl', a path to a >>>> bam file >>>> >>>> for a _single_ genomic range >>>> >>>> param = ScanBamParam(what="seq", which=GRanges("seq1", >>>> IRanges(100, 500))) dna = scanBam(fl, param=param)[[1]][["seq"]] >>>> length(dna) # 365 reads overlap region alphabetFrequency(dna, >>>> collapse=TRUE, baseOnly=TRUE) # 2838 + 3003 GC >>>> >>>> though you'd likely want to specify several regions (vector >>>> arguments to GRanges) and think about flags (scanBamFlag() and the >>>> flag argument to ScanBamParam), read mapping quality, reads >>>> overlapping more than one region, etc. (summarizeOverlaps >>>> implements several counting strategies, but it is 'easy' to >>>> implement arbitrary approaches). >>>> >>>>> >>>>> Martin >>>>> >>>>>> >>>>>> Thanks for any input! >>>>>> >>>>>> Best, Yu Chuan >>>>>> >>>>>> _______________________________________________ Bioconductor >>>>>> mailing list Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >>>>>> archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> >>>> >>>> >>>> >>>>>> >>>>>> >> -- >>>> Computational Biology Fred Hutchinson Cancer Research Center 1100 >>>> Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >>>> >>>> Location: M1-B861 Telephone: 206 667-2793 >>>> >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 From whuber at embl.de Sat Jun 9 10:02:29 2012 From: whuber at embl.de (Wolfgang Huber) Date: Sat, 09 Jun 2012 10:02:29 +0200 Subject: [BioC] arrayQualityMetrics() doesn't work for one-color non Affy arrays In-Reply-To: <29016.4baaefe3.3d030629@aol.com> References: <29016.4baaefe3.3d030629@aol.com> Message-ID: <4FD30315.3040502@embl.de> Dear Alex Thanks for reporting this. It looks like the range of values in your ExpressionSet esetPROC is outside of what the plotting function for MA plots expects. Can you do please do the following 1. Update to the most recent release of 'arrayQualityMetrics', and if you're up to it, better even the current devel version: http://www.bioconductor.org/packages/release/bioc/html/arrayQualityMetrics.html http://www.bioconductor.org/packages/devel/bioc/html/arrayQualityMetrics.html 2. Establish that the reports work for you with a "well-behaved" ExpressionSet x for which the following holds !any(is.na(exprs(x))) all(exprs(x) > 0) all(is.finite(exprs(x)) 3. With a non well-behaved data set, the software might produce non-sensical plots, but it should not stop with an error. If it does, can you please send me the offending data object esetPROC so that I can reproduce the error and catch it more gracefully in future versions. Best wishes Wolfgang Jun/8/12 9:39 AM, Alogmail2 at aol.com scripsit:: > Dear List, > > Could you share your experience with arrayQualityMetrics() for one-color > non Affy arrays: it doesn't work for me (please see the code below). > > Thanks > > Alex Loguinov > > UC, Berkeley > > > > >> options(error = recover, warn = 2) >> options(bitmapType = "cairo") >> .HaveDummy = !interactive() >> if(.HaveDummy) pdf("dummy.pdf") > >> library("arrayQualityMetrics") > >> head(targets) > FileName Treatment GErep Time Conc > T0-Control-Cu_61_new_252961010035_2_4 > T0-Control-Cu_61_new_252961010035_2_4.txt C.t0.0 0 0 0 > T0-Control-Cu_62_new_252961010036_2_1 > T0-Control-Cu_62_new_252961010036_2_1.txt C.t0.0 0 0 0 > T0-Control-Cu_64_252961010031_2_2 > T0-Control-Cu_64_252961010031_2_2.txt C.t0.0 0 0 0 > T0-Control-Cu_65_new_252961010037_2_2 > T0-Control-Cu_65_new_252961010037_2_2.txt C.t0.0 0 0 0 > T04h-Contr_06_new_252961010037_2_4 > T04h-Contr_06_new_252961010037_2_4.txt C.t4.0 1 4 0 > T04h-Contr_10_new_252961010035_1_2 > T04h-Contr_10_new_252961010035_1_2.txt C.t4.0 1 4 0 > > >> ddaux = read.maimages(files = targets$FileName, source = "agilent", > other.columns = list(IsFound = "gIsFound", IsWellAboveBG = > "IsWellAboveBG",gIsPosAndSignif="gIsPosAndSignif", > IsSaturated = "gIsSaturated", IsFeatNonUnifOF = "gIsFeatNonUnifOL", > IsFeatPopnOL = "gIsFeatPopnOL", ChrCoord = > "chr_coord",Row="Row",Column="Col"), > columns = list(Rf = "gProcessedSignal", Gf = "gMeanSignal", > Rb = "gBGMedianSignal", Gb = "gBGUsed"), verbose = T, > sep = "\t", quote = "") > > >> class(ddaux) > [1] "RGList" > attr(,"package") > [1] "limma" >> names(ddaux) > [1] "R" "G" "Rb" "Gb" "targets" "genes" "source" > "printer" "other" > > > I could apply: >> >> class(ddaux$G) > [1] "matrix" > >> all(rownames(targets)==colnames(ddaux$G)) > [1] TRUE > >> esetPROC = new("ExpressionSet", exprs = ddaux$G) > > But it results in errors: > >> arrayQualityMetrics(expressionset=esetPROC,outdir ="esetPROC",force =T) > > The directory 'esetPROC' has been created. > Error: no function to return from, jumping to top level > > Enter a frame number, or 0 to exit > > 1: arrayQualityMetrics(expressionset = esetPROC, outdir = "esetPROC", > force = T) > 2: aqm.writereport(modules = m, arrayTable = x$pData, reporttitle = > reporttitle, outdir = outdir) > 3: reportModule(p = p, module = modules[[i]], currentIndex = currentIndex, > arrayTable = arrayTableCompact, outdir = outdir) > 4: makePlot(module) > 5: print(_x at plot_ (mailto:x at plot) ) > 6: print.trellis(_x at plot_ (mailto:x at plot) ) > 7: printFunction(x, ...) > 8: tryCatch(checkArgsAndCall(panel, pargs), error = function(e) > panel.error(e)) > 9: tryCatchList(expr, classes, parentenv, handlers) > 10: tryCatchOne(expr, names, parentenv, handlers[[1]]) > 11: doTryCatch(return(expr), name, parentenv, handler) > 12: checkArgsAndCall(panel, pargs) > 13: do.call(FUN, args) > 14: function (x, y = NULL, subscripts, groups, panel.groups = > "panel.xyplot", ..., col = "black", col.line = superpose.line$col, col.symbol = > superpose.symb > 15: .signalSimpleWarning("closing unused connection 5 > (Report_for_exampleSet/index.html)", quote(NULL)) > 16: withRestarts({ > 17: withOneRestart(expr, restarts[[1]]) > 18: doWithOneRestart(return(expr), restart) > > Selection: 0 > > > Error in KernSmooth::bkde2D(x, bandwidth = bandwidth, gridsize = nbin, : > (converted from warning) Binning grid too coarse for current (small) > bandwidth: consider increasing 'gridsize' > > Enter a frame number, or 0 to exit > > 1: arrayQualityMetrics(expressionset = esetPROC, outdir = "esetPROC", > force = T) > 2: aqm.writereport(modules = m, arrayTable = x$pData, reporttitle = > reporttitle, outdir = outdir) > 3: reportModule(p = p, module = modules[[i]], currentIndex = currentIndex, > arrayTable = arrayTableCompact, outdir = outdir) > 4: makePlot(module) > 5: do.call(_x at plot_ (mailto:x at plot) , args = list()) > 6: function () > 7: meanSdPlot(x$M, cex.axis = 0.9, ylab = "Standard deviation of the > intensities", xlab = "Rank(mean of intensities)") > 8: meanSdPlot(x$M, cex.axis = 0.9, ylab = "Standard deviation of the > intensities", xlab = "Rank(mean of intensities)") > 9: smoothScatter(res$px, res$py, xlab = xlab, ylab = ylab, ...) > 10: grDevices:::.smoothScatterCalcDensity(x, nbin, bandwidth) > 11: KernSmooth::bkde2D(x, bandwidth = bandwidth, gridsize = nbin, range.x > = range.x) > 12: warning("Binning grid too coarse for current (small) bandwidth: > consider increasing 'gridsize'") > 13: .signalSimpleWarning("Binning grid too coarse for current (small) > bandwidth: consider increasing 'gridsize'", quote(KernSmooth::bkde2D(x, > bandwidth = ba > 14: withRestarts({ > 15: withOneRestart(expr, restarts[[1]]) > 16: doWithOneRestart(return(expr), restart) > > Selection: 0 > > >> sessionInfo() > R version 2.14.2 (2012-02-29) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C LC_TIME=English_United > States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] CCl4_1.0.11 vsn_3.22.0 > arrayQualityMetrics_3.10.0 Agi4x44PreProcess_1.14.0 genefilter_1.36.0 > [6] annotate_1.32.3 AnnotationDbi_1.16.19 limma_3.10.3 > Biobase_2.14.0 > > loaded via a namespace (and not attached): > [1] affy_1.32.1 affyio_1.22.0 affyPLM_1.30.0 > beadarray_2.4.2 BiocInstaller_1.2.1 Biostrings_2.22.0 > [7] Cairo_1.5-1 cluster_1.14.2 colorspace_1.1-1 > DBI_0.2-5 grid_2.14.2 Hmisc_3.9-3 > [13] hwriter_1.3 IRanges_1.12.6 KernSmooth_2.23-7 > lattice_0.20-6 latticeExtra_0.6-19 plyr_1.7.1 > [19] preprocessCore_1.16.0 RColorBrewer_1.0-5 reshape2_1.2.1 > RSQLite_0.11.1 setRNG_2011.11-2 splines_2.14.2 > [25] stringr_0.6 survival_2.36-14 SVGAnnotation_0.9-0 > tools_2.14.2 XML_3.9-4.1 xtable_1.7-0 > [31] zlibbioc_1.0.1 > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber From whuber at embl.de Sat Jun 9 10:18:20 2012 From: whuber at embl.de (Wolfgang Huber) Date: Sat, 09 Jun 2012 10:18:20 +0200 Subject: [BioC] Error in intgroup of arrayQualityMetrics package In-Reply-To: <20120601091651.6CFA1133CD3@mamba.fhcrc.org> References: <20120601091651.6CFA1133CD3@mamba.fhcrc.org> Message-ID: <4FD306CC.8020805@embl.de> Dear Sonal thank you for reporting this. Sorry that you have trouble. But we will need more information about your problem to help you. What do you get as output from the following lines of R code: sessionInfo() intgroup str(intgroup) pData(eset) colnames(pData(eset)) Best wishes Wolfgang Jun/1/12 11:16 AM, Sonal [guest] scripsit:: > > I am using arraQualityMetrics package installed from Bioconductor site and R version that I am using is 2.15.0 > > The input for the function was eset and for the intgroup argument character vector "Tissue". There is a column named Tissue in my phenoData of the eset. > > But it still gives me an error saying the elements of intgroup do not match the column names of the pData(eset). > I don't know what wrong I am doing. > Can anybody suggest anything. > Thank You. > > > -- output of sessionInfo(): > > Error in prepData(expressionset,intgroup=intgroup): > all elements of 'intgroup' should match column names of pData(expressionset) > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber From curoli at gmail.com Sat Jun 9 12:09:22 2012 From: curoli at gmail.com (Oliver Ruebenacker) Date: Sat, 9 Jun 2012 06:09:22 -0400 Subject: [BioC] DEGraph graph format? In-Reply-To: References: <68bd42f3-34e5-4c01-90ff-19b07d5220d1@zimbra2.fhcrc.org> <314fb72a-08e5-4eb2-884b-d0222c232127@zimbra2.fhcrc.org> Message-ID: Hello, On Fri, Jun 8, 2012 at 4:30 PM, laurent jacob wrote: > Hi Hamid, > > 2012/6/7 Hamid Bolouri : > >>>library(NCIgraphData) >>>data("NCI-cyList") >>> NCI.cyList[[1]] >> A graphNEL graph with directed edges >> Number of Nodes = 35 >> Number of Edges = 40 > > The graphs NCI-cyList cannot directly be used with DEGraph: they are > raw representations of the NCI biopax files. In particular, the nodes > of the graph do not correspond to genes: Does this package use a BioPAX parser that could be used for other BioPAX data? Does it read level 2 or level 3 or both? I'm looking for a BioPAX parser and would be willing to help build one if none exists. Thanks! Take care Oliver -- Oliver Ruebenacker Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker) Knowomics, The Bioinformatics Network (http://www.knowomics.com) SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) From sdavis2 at mail.nih.gov Sat Jun 9 14:15:17 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sat, 9 Jun 2012 08:15:17 -0400 Subject: [BioC] base-specific read counts In-Reply-To: References: <4FD0A5FA.6030301@fhcrc.org> Message-ID: On Fri, Jun 8, 2012 at 10:15 PM, Yu Chuan Tai wrote: > Hi Sean, > > I didn't find any function in ?the VariantAnnotation package that can > calculate mutant freq. Do you mean after reading in a VCF file using > readVcf(), I need to calculate the base-level coverage first (for example, > using the way Martin had suggested), and convert coverage to frequency > myself? Then why do I need to use VariantAnnotation package for this > purpose, given the fact that I already have a text file with all the > SNVs/INDELs with their genomic coordinates? My mistake. I thought you meant the frequency of the variant in your samples. You are talking about allele counts? If so, you'll need the bam files, as Martin has suggested. Sorry to mislead you. Sean > On Fri, 8 Jun 2012, Sean Davis wrote: > >> On Fri, Jun 8, 2012 at 2:06 AM, Yu Chuan Tai >> wrote: >>> >>> Hi Martin, >>> >>> One more question. Is there any way in Rsamtools to calculate SNVs/INDELS >>> frequency directly using the output file from samtools? >> >> >> By "output file from samtools", I assume you mean a VCF file. ?If so, >> take a look a the VariantAnnotation package and readVcf(). ?From >> there, you'll need to do the calculation yourself, but that would be a >> step on the way to accomplishing your task. >> >> Sean >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From yuchuan at stat.berkeley.edu Sat Jun 9 17:20:33 2012 From: yuchuan at stat.berkeley.edu (Yu Chuan Tai) Date: Sat, 9 Jun 2012 08:20:33 -0700 (PDT) Subject: [BioC] base-specific read counts In-Reply-To: References: <4FD0A5FA.6030301@fhcrc.org> Message-ID: Hi Sean, No worries. I actually want mutant frequencies for each sample, but I didn't see any fucntion in VariantAnnotation for that. Anyway, I just found that samtools/bcftools may calculate that directly. Thanks for your help! Best, Yu Chuan On Sat, 9 Jun 2012, Sean Davis wrote: > On Fri, Jun 8, 2012 at 10:15 PM, Yu Chuan Tai wrote: >> Hi Sean, >> >> I didn't find any function in ?the VariantAnnotation package that can >> calculate mutant freq. Do you mean after reading in a VCF file using >> readVcf(), I need to calculate the base-level coverage first (for example, >> using the way Martin had suggested), and convert coverage to frequency >> myself? Then why do I need to use VariantAnnotation package for this >> purpose, given the fact that I already have a text file with all the >> SNVs/INDELs with their genomic coordinates? > > My mistake. I thought you meant the frequency of the variant in your > samples. You are talking about allele counts? If so, you'll need the > bam files, as Martin has suggested. Sorry to mislead you. > > Sean > > >> On Fri, 8 Jun 2012, Sean Davis wrote: >> >>> On Fri, Jun 8, 2012 at 2:06 AM, Yu Chuan Tai >>> wrote: >>>> >>>> Hi Martin, >>>> >>>> One more question. Is there any way in Rsamtools to calculate SNVs/INDELS >>>> frequency directly using the output file from samtools? >>> >>> >>> By "output file from samtools", I assume you mean a VCF file. ?If so, >>> take a look a the VariantAnnotation package and readVcf(). ?From >>> there, you'll need to do the calculation yourself, but that would be a >>> step on the way to accomplishing your task. >>> >>> Sean >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > From wng.peter at gmail.com Sat Jun 9 23:18:08 2012 From: wng.peter at gmail.com (wang peter) Date: Sat, 9 Jun 2012 17:18:08 -0400 Subject: [BioC] shangao Message-ID: how to design a model matrix my data is composed of 35 samples, have two factor time and treatment i want to find DE genes cross the time, considering control so do you think such coding is right? how to get the DE from the lrt.ede raw.data <- read.table("expression-table.txt",row.names=1) lib_size <- read.table("lib_size.txt"); lib_size <- unlist(lib_size) d <- DGEList(counts = raw.data, lib.size = lib_size) dge <- d[rowSums(d$counts) >= length(lib_size)/2,] #normalization dge <- calcNormFactors(dge) treatment=factor(c(rep('control',6),rep('treated',24),rep('control',5))) time=factor(c('0h','0h','0h','24h','24h','24h','0h','0h','0h','6h','6h','6h','6h','12h','12h','12h','12h','18h','18h','18h','18h', '24h','24h','24h','36h','36h','36h','48h','48h','48h','6h','12h','18h','24h','36h')) design <- model.matrix(~time+treatment*time) dge <- estimateGLMCommonDisp(dge, design) dge <- estimateGLMTagwiseDisp(dge, design) glmfit.dge <- glmFit(dge, design,dispersion=dge$common.dispersion) lrt.dge <- glmLRT(dge, glmfit.dge, coef=2) -- shan gao Room 231(Dr.Fei lab) Boyce Thompson Institute Cornell University Tower Road, Ithaca, NY 14853-1801 Office phone: 1-607-254-1267(day) Official email:sg839 at cornell.edu Facebook:http://www.facebook.com/profile.php?id=100001986532253 From donttrustben at gmail.com Sun Jun 10 02:20:29 2012 From: donttrustben at gmail.com (Ben Woodcroft) Date: Sun, 10 Jun 2012 10:20:29 +1000 Subject: [BioC] SRAdb: is the database missing some entries? Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From smyth at wehi.EDU.AU Sun Jun 10 05:12:23 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Sun, 10 Jun 2012 13:12:23 +1000 (AUS Eastern Standard Time) Subject: [BioC] reading single channel Agilent data with limma [was arrayQualityMetrics doesn't work...] Message-ID: Hi Alex, I don't know arrayQualityMetrics, but you are using the limma package to read single-channel Agilent data in a way that I think might cause problems with down-stream analyses. Basically you're creating a two-color data object when your data is not actually of that type. This was a time when I suggested this sort of work-around as a stop-gap measure for some data problems, but hasn't been necessary for quite a few years. I'd also recommend that you do some background correction. If I understand your code correctly, I don't think it is currently making use of the background intensity column. There is a case study in the limma User's Guide that deals with single channel Agilent data. Could you please have a read of that for a cleaner way to read Agilent data? I don't know whether that will be enough to solve your arrayQualityMetrics problem, but perhaps it might. Best wishes Gordon ------------- original message ------------- [BioC] arrayQualityMetrics() doesn't work for one-color non Affy arrays Alogmail2 at aol.com Alogmail2 at aol.com Fri Jun 8 09:39:21 CEST 2012 Dear List, Could you share your experience with arrayQualityMetrics() for one-color non Affy arrays: it doesn't work for me (please see the code below). Thanks Alex Loguinov UC, Berkeley >options(error = recover, warn = 2) >options(bitmapType = "cairo") >.HaveDummy = !interactive() > if(.HaveDummy) pdf("dummy.pdf") >library("arrayQualityMetrics") >head(targets) FileName Treatment GErep Time Conc T0-Control-Cu_61_new_252961010035_2_4 T0-Control-Cu_61_new_252961010035_2_4.txt C.t0.0 0 0 0 T0-Control-Cu_62_new_252961010036_2_1 T0-Control-Cu_62_new_252961010036_2_1.txt C.t0.0 0 0 0 T0-Control-Cu_64_252961010031_2_2 T0-Control-Cu_64_252961010031_2_2.txt C.t0.0 0 0 0 T0-Control-Cu_65_new_252961010037_2_2 T0-Control-Cu_65_new_252961010037_2_2.txt C.t0.0 0 0 0 T04h-Contr_06_new_252961010037_2_4 T04h-Contr_06_new_252961010037_2_4.txt C.t4.0 1 4 0 T04h-Contr_10_new_252961010035_1_2 T04h-Contr_10_new_252961010035_1_2.txt C.t4.0 1 4 0 > ddaux = read.maimages(files = targets$FileName, source = "agilent", other.columns = list(IsFound = "gIsFound", IsWellAboveBG = "IsWellAboveBG",gIsPosAndSignif="gIsPosAndSignif", IsSaturated = "gIsSaturated", IsFeatNonUnifOF = "gIsFeatNonUnifOL", IsFeatPopnOL = "gIsFeatPopnOL", ChrCoord = "chr_coord",Row="Row",Column="Col"), columns = list(Rf = "gProcessedSignal", Gf = "gMeanSignal", Rb = "gBGMedianSignal", Gb = "gBGUsed"), verbose = T, sep = "\t", quote = "") > class(ddaux) [1] "RGList" attr(,"package") [1] "limma" > names(ddaux) [1] "R" "G" "Rb" "Gb" "targets" "genes" "source" "printer" "other" I could apply: > > class(ddaux$G) [1] "matrix" >all(rownames(targets)==colnames(ddaux$G)) [1] TRUE >esetPROC = new("ExpressionSet", exprs = ddaux$G) But it results in errors: >arrayQualityMetrics(expressionset=esetPROC,outdir ="esetPROC",force =T) The directory 'esetPROC' has been created. Error: no function to return from, jumping to top level Enter a frame number, or 0 to exit 1: arrayQualityMetrics(expressionset = esetPROC, outdir = "esetPROC", force = T) 2: aqm.writereport(modules = m, arrayTable = x$pData, reporttitle = reporttitle, outdir = outdir) 3: reportModule(p = p, module = modules[[i]], currentIndex = currentIndex, arrayTable = arrayTableCompact, outdir = outdir) 4: makePlot(module) 5: print(_x at plot_ (mailto:x at plot) ) 6: print.trellis(_x at plot_ (mailto:x at plot) ) 7: printFunction(x, ...) 8: tryCatch(checkArgsAndCall(panel, pargs), error = function(e) panel.error(e)) 9: tryCatchList(expr, classes, parentenv, handlers) 10: tryCatchOne(expr, names, parentenv, handlers[[1]]) 11: doTryCatch(return(expr), name, parentenv, handler) 12: checkArgsAndCall(panel, pargs) 13: do.call(FUN, args) 14: function (x, y = NULL, subscripts, groups, panel.groups = "panel.xyplot", ..., col = "black", col.line = superpose.line$col, col.symbol = superpose.symb 15: .signalSimpleWarning("closing unused connection 5 (Report_for_exampleSet/index.html)", quote(NULL)) 16: withRestarts({ 17: withOneRestart(expr, restarts[[1]]) 18: doWithOneRestart(return(expr), restart) Selection: 0 Error in KernSmooth::bkde2D(x, bandwidth = bandwidth, gridsize = nbin, : (converted from warning) Binning grid too coarse for current (small) bandwidth: consider increasing 'gridsize' Enter a frame number, or 0 to exit 1: arrayQualityMetrics(expressionset = esetPROC, outdir = "esetPROC", force = T) 2: aqm.writereport(modules = m, arrayTable = x$pData, reporttitle = reporttitle, outdir = outdir) 3: reportModule(p = p, module = modules[[i]], currentIndex = currentIndex, arrayTable = arrayTableCompact, outdir = outdir) 4: makePlot(module) 5: do.call(_x at plot_ (mailto:x at plot) , args = list()) 6: function () 7: meanSdPlot(x$M, cex.axis = 0.9, ylab = "Standard deviation of the intensities", xlab = "Rank(mean of intensities)") 8: meanSdPlot(x$M, cex.axis = 0.9, ylab = "Standard deviation of the intensities", xlab = "Rank(mean of intensities)") 9: smoothScatter(res$px, res$py, xlab = xlab, ylab = ylab, ...) 10: grDevices:::.smoothScatterCalcDensity(x, nbin, bandwidth) 11: KernSmooth::bkde2D(x, bandwidth = bandwidth, gridsize = nbin, range.x = range.x) 12: warning("Binning grid too coarse for current (small) bandwidth: consider increasing 'gridsize'") 13: .signalSimpleWarning("Binning grid too coarse for current (small) bandwidth: consider increasing 'gridsize'", quote(KernSmooth::bkde2D(x, bandwidth = ba 14: withRestarts({ 15: withOneRestart(expr, restarts[[1]]) 16: doWithOneRestart(return(expr), restart) Selection: 0 > sessionInfo() R version 2.14.2 (2012-02-29) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] CCl4_1.0.11 vsn_3.22.0 arrayQualityMetrics_3.10.0 Agi4x44PreProcess_1.14.0 genefilter_1.36.0 [6] annotate_1.32.3 AnnotationDbi_1.16.19 limma_3.10.3 Biobase_2.14.0 loaded via a namespace (and not attached): [1] affy_1.32.1 affyio_1.22.0 affyPLM_1.30.0 beadarray_2.4.2 BiocInstaller_1.2.1 Biostrings_2.22.0 [7] Cairo_1.5-1 cluster_1.14.2 colorspace_1.1-1 DBI_0.2-5 grid_2.14.2 Hmisc_3.9-3 [13] hwriter_1.3 IRanges_1.12.6 KernSmooth_2.23-7 lattice_0.20-6 latticeExtra_0.6-19 plyr_1.7.1 [19] preprocessCore_1.16.0 RColorBrewer_1.0-5 reshape2_1.2.1 RSQLite_0.11.1 setRNG_2011.11-2 splines_2.14.2 [25] stringr_0.6 survival_2.36-14 SVGAnnotation_0.9-0 tools_2.14.2 XML_3.9-4.1 xtable_1.7-0 [31] zlibbioc_1.0.1 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From laurent.jacob at gmail.com Sun Jun 10 07:44:22 2012 From: laurent.jacob at gmail.com (laurent jacob) Date: Sat, 9 Jun 2012 22:44:22 -0700 Subject: [BioC] DEGraph graph format? In-Reply-To: References: <68bd42f3-34e5-4c01-90ff-19b07d5220d1@zimbra2.fhcrc.org> <314fb72a-08e5-4eb2-884b-d0222c232127@zimbra2.fhcrc.org> Message-ID: Hi Oliver, 2012/6/9 Oliver Ruebenacker : > ? ? Hello, > > ?Does this package use a BioPAX parser that could be used for other > BioPAX data? Does it read level 2 or level 3 or both? > > ?I'm looking for a BioPAX parser and would be willing to help build > one if none exists. I think Rredland (http://bioconductor.org/packages/2.4/bioc/html/Rredland.html) parses biopax in R. I didn't use it for this package because after that I also needed to convert the read structure to graphNEL objects, which was not straightforward. I read the biopax files in Cytoscape, then used RCytoscape (http://bioconductor.org/packages/release/bioc/html/RCytoscape.html) to read the networks built by Cytoscape. Best, Laurent -- Laurent Jacob Department of Statistics UC Berkeley http://cbio.ensmp.fr/~ljacob From whuber at embl.de Sun Jun 10 11:17:10 2012 From: whuber at embl.de (Wolfgang Huber) Date: Sun, 10 Jun 2012 11:17:10 +0200 Subject: [BioC] reading single channel Agilent data with limma [was arrayQualityMetrics doesn't work...] In-Reply-To: References: Message-ID: <4FD46616.7000409@embl.de> Dear Gordon, you are right, that use of the limma / read.maimages is rather odd. But from there Alex puts his data into an ExpressionSet 'esetPROC' and the arrayQualityMetrics error occurs with that. Any analysis of these data (including arrayQualityMetrics) will only make sense after proper preprocessing, as you suggest, and this what Alex should do, and this is what the arrayQualityMetrics report should have told him to do. Bottomline, my goal (to which we are close, but not there) is that arrayQualityMetrics should react gracefully even to the wildest instances of data, rather than stop with an error. Best wishes Wolfgang Jun/10/12 5:12 AM, Gordon K Smyth scripsit:: > Hi Alex, > > I don't know arrayQualityMetrics, but you are using the limma package to > read single-channel Agilent data in a way that I think might cause > problems with down-stream analyses. Basically you're creating a > two-color data object when your data is not actually of that type. This > was a time when I suggested this sort of work-around as a stop-gap > measure for some data problems, but hasn't been necessary for quite a > few years. > > I'd also recommend that you do some background correction. If I > understand your code correctly, I don't think it is currently making use > of the background intensity column. > > There is a case study in the limma User's Guide that deals with single > channel Agilent data. Could you please have a read of that for a cleaner > way to read Agilent data? > > I don't know whether that will be enough to solve your > arrayQualityMetrics problem, but perhaps it might. > > Best wishes > Gordon > > ------------- original message ------------- > [BioC] arrayQualityMetrics() doesn't work for one-color non Affy arrays > Alogmail2 at aol.com Alogmail2 at aol.com > Fri Jun 8 09:39:21 CEST 2012 > > Dear List, > > Could you share your experience with arrayQualityMetrics() for one-color > non Affy arrays: it doesn't work for me (please see the code below). > > Thanks > > Alex Loguinov > > UC, Berkeley > > > > >> options(error = recover, warn = 2) >> options(bitmapType = "cairo") >> .HaveDummy = !interactive() >> if(.HaveDummy) pdf("dummy.pdf") > >> library("arrayQualityMetrics") > >> head(targets) > FileName Treatment GErep Time Conc > T0-Control-Cu_61_new_252961010035_2_4 > T0-Control-Cu_61_new_252961010035_2_4.txt C.t0.0 0 0 0 > T0-Control-Cu_62_new_252961010036_2_1 > T0-Control-Cu_62_new_252961010036_2_1.txt C.t0.0 0 0 0 > T0-Control-Cu_64_252961010031_2_2 > T0-Control-Cu_64_252961010031_2_2.txt C.t0.0 0 0 0 > T0-Control-Cu_65_new_252961010037_2_2 > T0-Control-Cu_65_new_252961010037_2_2.txt C.t0.0 0 0 0 > T04h-Contr_06_new_252961010037_2_4 > T04h-Contr_06_new_252961010037_2_4.txt C.t4.0 1 4 0 > T04h-Contr_10_new_252961010035_1_2 > T04h-Contr_10_new_252961010035_1_2.txt C.t4.0 1 4 0 > > >> ddaux = read.maimages(files = targets$FileName, source = "agilent", > other.columns = list(IsFound = "gIsFound", IsWellAboveBG = > "IsWellAboveBG",gIsPosAndSignif="gIsPosAndSignif", > IsSaturated = "gIsSaturated", IsFeatNonUnifOF = "gIsFeatNonUnifOL", > IsFeatPopnOL = "gIsFeatPopnOL", ChrCoord = > "chr_coord",Row="Row",Column="Col"), > columns = list(Rf = "gProcessedSignal", Gf = "gMeanSignal", > Rb = "gBGMedianSignal", Gb = "gBGUsed"), verbose = T, > sep = "\t", quote = "") > > >> class(ddaux) > [1] "RGList" > attr(,"package") > [1] "limma" >> names(ddaux) > [1] "R" "G" "Rb" "Gb" "targets" "genes" "source" > "printer" "other" > > > I could apply: >> >> class(ddaux$G) > [1] "matrix" > >> all(rownames(targets)==colnames(ddaux$G)) > [1] TRUE > >> esetPROC = new("ExpressionSet", exprs = ddaux$G) > > But it results in errors: > >> arrayQualityMetrics(expressionset=esetPROC,outdir ="esetPROC",force =T) > > The directory 'esetPROC' has been created. > Error: no function to return from, jumping to top level > > Enter a frame number, or 0 to exit > > 1: arrayQualityMetrics(expressionset = esetPROC, outdir = "esetPROC", > force = T) > 2: aqm.writereport(modules = m, arrayTable = x$pData, reporttitle = > reporttitle, outdir = outdir) > 3: reportModule(p = p, module = modules[[i]], currentIndex = currentIndex, > arrayTable = arrayTableCompact, outdir = outdir) > 4: makePlot(module) > 5: print(_x at plot_ (mailto:x at plot) ) > 6: print.trellis(_x at plot_ (mailto:x at plot) ) > 7: printFunction(x, ...) > 8: tryCatch(checkArgsAndCall(panel, pargs), error = function(e) > panel.error(e)) > 9: tryCatchList(expr, classes, parentenv, handlers) > 10: tryCatchOne(expr, names, parentenv, handlers[[1]]) > 11: doTryCatch(return(expr), name, parentenv, handler) > 12: checkArgsAndCall(panel, pargs) > 13: do.call(FUN, args) > 14: function (x, y = NULL, subscripts, groups, panel.groups = > "panel.xyplot", ..., col = "black", col.line = superpose.line$col, > col.symbol = > superpose.symb > 15: .signalSimpleWarning("closing unused connection 5 > (Report_for_exampleSet/index.html)", quote(NULL)) > 16: withRestarts({ > 17: withOneRestart(expr, restarts[[1]]) > 18: doWithOneRestart(return(expr), restart) > > Selection: 0 > > > Error in KernSmooth::bkde2D(x, bandwidth = bandwidth, gridsize = nbin, : > (converted from warning) Binning grid too coarse for current (small) > bandwidth: consider increasing 'gridsize' > > Enter a frame number, or 0 to exit > > 1: arrayQualityMetrics(expressionset = esetPROC, outdir = "esetPROC", > force = T) > 2: aqm.writereport(modules = m, arrayTable = x$pData, reporttitle = > reporttitle, outdir = outdir) > 3: reportModule(p = p, module = modules[[i]], currentIndex = currentIndex, > arrayTable = arrayTableCompact, outdir = outdir) > 4: makePlot(module) > 5: do.call(_x at plot_ (mailto:x at plot) , args = list()) > 6: function () > 7: meanSdPlot(x$M, cex.axis = 0.9, ylab = "Standard deviation of the > intensities", xlab = "Rank(mean of intensities)") > 8: meanSdPlot(x$M, cex.axis = 0.9, ylab = "Standard deviation of the > intensities", xlab = "Rank(mean of intensities)") > 9: smoothScatter(res$px, res$py, xlab = xlab, ylab = ylab, ...) > 10: grDevices:::.smoothScatterCalcDensity(x, nbin, bandwidth) > 11: KernSmooth::bkde2D(x, bandwidth = bandwidth, gridsize = nbin, range.x > = range.x) > 12: warning("Binning grid too coarse for current (small) bandwidth: > consider increasing 'gridsize'") > 13: .signalSimpleWarning("Binning grid too coarse for current (small) > bandwidth: consider increasing 'gridsize'", quote(KernSmooth::bkde2D(x, > bandwidth = ba > 14: withRestarts({ > 15: withOneRestart(expr, restarts[[1]]) > 16: doWithOneRestart(return(expr), restart) > > Selection: 0 > > >> sessionInfo() > R version 2.14.2 (2012-02-29) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C LC_TIME=English_United > States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] CCl4_1.0.11 vsn_3.22.0 > arrayQualityMetrics_3.10.0 Agi4x44PreProcess_1.14.0 genefilter_1.36.0 > [6] annotate_1.32.3 AnnotationDbi_1.16.19 limma_3.10.3 > Biobase_2.14.0 > > loaded via a namespace (and not attached): > [1] affy_1.32.1 affyio_1.22.0 affyPLM_1.30.0 > beadarray_2.4.2 BiocInstaller_1.2.1 Biostrings_2.22.0 > [7] Cairo_1.5-1 cluster_1.14.2 colorspace_1.1-1 > DBI_0.2-5 grid_2.14.2 Hmisc_3.9-3 > [13] hwriter_1.3 IRanges_1.12.6 KernSmooth_2.23-7 > lattice_0.20-6 latticeExtra_0.6-19 plyr_1.7.1 > [19] preprocessCore_1.16.0 RColorBrewer_1.0-5 reshape2_1.2.1 > RSQLite_0.11.1 setRNG_2011.11-2 splines_2.14.2 > [25] stringr_0.6 survival_2.36-14 SVGAnnotation_0.9-0 > tools_2.14.2 XML_3.9-4.1 xtable_1.7-0 > [31] zlibbioc_1.0.1 > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:19}} From sdavis2 at mail.nih.gov Sun Jun 10 13:53:56 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sun, 10 Jun 2012 07:53:56 -0400 Subject: [BioC] SRAdb: is the database missing some entries? In-Reply-To: References: Message-ID: On Sat, Jun 9, 2012 at 8:20 PM, Ben Woodcroft wrote: > Hi, > > Firstly thanks to the creators of this very useful package. > > I've come across SRA identifiers that don't appear to be in the database (a > minority, but still). Here's a few: > > SRA036600 > DRX001436 > SRA049463 > ERA062401 > ERA062401 > > For example: >> library(SRAdb) >> sra_con = dbConnect(SQLite(),'SRAmetadb.sqlite') >> sraConvert(c('SRA036600'), sra_con= sra_con) > [1] submission study ? ? ?sample ? ? experiment run > <0 rows> (or 0-length row.names) > > However this isn't a bogus accession because I can see it on the NCBI SRA > website. > > I could be wrong but I don't think it is as simple as the metadata being > out of date because the submission dates are often relatively old > (SRA036600 was 2011-05-13) and there's metadata from more recent SRA > submissions in the SRAdb). Thanks, Ben. We'll look into it. Sorry for the inconvenience. Sean From curoli at gmail.com Sun Jun 10 16:51:14 2012 From: curoli at gmail.com (Oliver Ruebenacker) Date: Sun, 10 Jun 2012 10:51:14 -0400 Subject: [BioC] DEGraph graph format? In-Reply-To: References: <68bd42f3-34e5-4c01-90ff-19b07d5220d1@zimbra2.fhcrc.org> <314fb72a-08e5-4eb2-884b-d0222c232127@zimbra2.fhcrc.org> Message-ID: Hello Laurent, Thanks for the response. To my knowledge, Rredland is not maintained any more. Since I am very familiar with Java (and OpenRDF Sesame), I was thinking of using RJava to drive Sesame. Can you explain some more why you choose not to use Rredland? It seems almost certainly relevant for the design of a BioPAX package. Did the issue have to do with separating the actual reaction network from other types of information? Thanks! Take care Oliver On Sun, Jun 10, 2012 at 1:44 AM, laurent jacob wrote: > Hi Oliver, > > 2012/6/9 Oliver Ruebenacker : >> ? ? Hello, >> >> ?Does this package use a BioPAX parser that could be used for other >> BioPAX data? Does it read level 2 or level 3 or both? >> >> ?I'm looking for a BioPAX parser and would be willing to help build >> one if none exists. > > I think Rredland > (http://bioconductor.org/packages/2.4/bioc/html/Rredland.html) parses > biopax in R. > > I didn't use it for this package because after that I also needed to > convert the read structure to graphNEL objects, which was not > straightforward. I read the biopax files in Cytoscape, then used > RCytoscape (http://bioconductor.org/packages/release/bioc/html/RCytoscape.html) > to read the networks built by Cytoscape. > > Best, > > Laurent > > -- > Laurent Jacob > Department of Statistics > UC Berkeley > http://cbio.ensmp.fr/~ljacob -- Oliver Ruebenacker Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker) Knowomics, The Bioinformatics Network (http://www.knowomics.com) SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) From laurent.jacob at gmail.com Sun Jun 10 22:21:44 2012 From: laurent.jacob at gmail.com (laurent jacob) Date: Sun, 10 Jun 2012 13:21:44 -0700 Subject: [BioC] DEGraph graph format? In-Reply-To: References: <68bd42f3-34e5-4c01-90ff-19b07d5220d1@zimbra2.fhcrc.org> <314fb72a-08e5-4eb2-884b-d0222c232127@zimbra2.fhcrc.org> Message-ID: Hi Oliver, 2012/6/10 Oliver Ruebenacker : > ?Can you explain some more why you choose not to use Rredland? It > seems almost certainly relevant for the design of a BioPAX package. > Did the issue have to do with separating the actual reaction network > from other types of information? If I remember well, I wasn't sure how to reconstruct the network structure form Rredland output, in particular how to recover the edges from the parsed BioPAX statements. It's not that the parsing done by Rredland was problematic, it's more that additional work (which seemed non-trivial to me at the time) was required to convert the output to graph objects. Are you planning to develop a bioconductor package or an independent Java parser? If you plan on using Java, you may want to look at what the mskcc people did for their Cytoscape plugin, which I used for my own package: http://cbio.mskcc.org/cytoscape/plugins/biopax/ Best, Laurent -- Laurent Jacob Department of Statistics UC Berkeley http://cbio.ensmp.fr/~ljacob From curoli at gmail.com Sun Jun 10 23:27:24 2012 From: curoli at gmail.com (Oliver Ruebenacker) Date: Sun, 10 Jun 2012 17:27:24 -0400 Subject: [BioC] DEGraph graph format? In-Reply-To: References: <68bd42f3-34e5-4c01-90ff-19b07d5220d1@zimbra2.fhcrc.org> <314fb72a-08e5-4eb2-884b-d0222c232127@zimbra2.fhcrc.org> Message-ID: Hello Laurent, On Sun, Jun 10, 2012 at 4:21 PM, laurent jacob wrote: > Hi Oliver, > > 2012/6/10 Oliver Ruebenacker : > >> ?Can you explain some more why you choose not to use Rredland? It >> seems almost certainly relevant for the design of a BioPAX package. >> Did the issue have to do with separating the actual reaction network >> from other types of information? > > If I remember well, I wasn't sure how to reconstruct the network > structure form Rredland output, in particular how to recover the edges > from the parsed BioPAX statements. It's not that the parsing done by > Rredland was problematic, it's more that additional work (which seemed > non-trivial to me at the time) was required to convert the output to > graph objects. What kind of graph you are constructing? Is it a bi-partite graph where every physical entity is a node and every reaction is a node, and you connect every reaction with its reactants, catalysts and products? In BioPAX Level 2, getting that graph was quite tricky, but Level 3 is much easier (although catalysts ad modulators are still a bit awkward). > Are you planning to develop a bioconductor package or an independent > Java parser? If you plan on using Java, you may want to look at what > the mskcc people did for their Cytoscape plugin, which I used for my > own package: http://cbio.mskcc.org/cytoscape/plugins/biopax/ I'd love to submit to Bioconductor, if that is not too difficult. The Cytoscape plugin is based on PaxTools, the official BioPAX Java library. The reason I'm not using PaxTools is that I'm combining BioPAX data with other data, such as SBPAX, which is not yet part of BioPAX (although hopefully will be soon), and is therefore not (yet) supported by PaxTools. Take care Oliver -- Oliver Ruebenacker Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker) Knowomics, The Bioinformatics Network (http://www.knowomics.com) SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) From mattia.pelizzola at gmail.com Mon Jun 11 10:45:51 2012 From: mattia.pelizzola at gmail.com (mattia pelizzola) Date: Mon, 11 Jun 2012 10:45:51 +0200 Subject: [BioC] [JOB] NGS and computational genomics postdoc positions Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From amitonbiochem at gmail.com Mon Jun 11 14:18:27 2012 From: amitonbiochem at gmail.com (Amit Kumar Kashyap) Date: Mon, 11 Jun 2012 14:18:27 +0200 Subject: [BioC] Kegg pathways overlay with log fold change values Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From alessandro.brozzi at gmail.com Mon Jun 11 14:24:18 2012 From: alessandro.brozzi at gmail.com (alessandro brozzi) Date: Mon, 11 Jun 2012 14:24:18 +0200 Subject: [BioC] Kegg pathways overlay with log fold change values In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From pshannon at fhcrc.org Mon Jun 11 14:42:40 2012 From: pshannon at fhcrc.org (Paul Shannon) Date: Mon, 11 Jun 2012 05:42:40 -0700 Subject: [BioC] Kegg pathways overlay with log fold change values In-Reply-To: References: Message-ID: <2EBC841A-CE7D-47A2-9B4B-B36C18AF2FA1@fhcrc.org> With RCytoscape http://bioconductor.org/packages/2.10/bioc/html/RCytoscape.html http://db.systemsbiology.net:8080/cytoscape/RCytoscape/versions/current/index.html it is straightforward to 1) Display a KEGG network retrieved by Jitao David Zhang's KEGGgraph package 2) control all visual attributes of the graph, using simple 'vizmap' rules and data (e.g., from limma and SPIA) For instance, to control node color: cw <- new.CytoscapeWindow ('setNodeColorRule.test', graph=makeSimpleGraph()) displayGraph (cw) layoutNetwork (cw, 'jgraph-spring') control.points <- c (-3.0, 0.0, 3.0) # typical range of log-fold-change ratio values # paint negative values shades of green, positive values shades of # red, out-of-range low values are dark green; out-of-range high # values are dark red node.colors <- c ("#00AA00", "#00FF00", "#FFFFFF", "#FF0000", "#AA0000") setNodeColorRule (cw, node.attribute.name='lfc', control.points, node.colors, mode='interpolate') redraw (cw) # applies all current vizmap rules Reproducing the original KEGG layout can be a bit of work. In principle, we could obtain the coordinates from the kgml file, and apply them with repeated calls to RCytoscape's setNodePosition method. In practice, I usually use a combination of automatic layout, and manual layout, saving the result via a call to the saveLayout method. For a fully worked up example, see http://rcytoscape.systemsbiology.net/versions/current/gallery/fendtYeast/ - Paul On Jun 11, 2012, at 5:24 AM, alessandro brozzi wrote: > hi, > you might try this: > > http://bioinformatics.oxfordjournals.org/content/25/11/1470.short > http://bioc.ism.ac.jp/2.8/bioc/html/KEGGgraph.html > > Alex > On Mon, Jun 11, 2012 at 2:18 PM, Amit Kumar Kashyap > wrote: > >> Hello all, >> >> does anyone knows how we can overlay expression data { coloring with up >> and down genes } in pathways using R package or any other script. >> >> Especially the results from SPIA package , we get the kegg link like this >> ... >> >> >> http://www.genome.jp/dbget-bin/show_pathway?hsa05200+999+22798+3915+3673+3675+6776+4233+2260+2263+7039+3082+5899+5337+5579+112399+2034+9063+1029+355+650+5743+596+7188+1612+2113+4254 >> >> >> Now I would like to color up and down regulated genes according to >> differentially expressed genes from limma package results. >> >> >> Thanks in advance. >> >> Kind Regards >> -Amit Kumar >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From stefanie.tauber at univie.ac.at Mon Jun 11 14:43:40 2012 From: stefanie.tauber at univie.ac.at (Stefanie) Date: Mon, 11 Jun 2012 12:43:40 +0000 Subject: [BioC] makeTranscriptDbFromBiomart error References: <66F91D70-6173-4567-89DE-0DE60A7EFD0B@univie.ac.at> <4FD0E790.1000101@fhcrc.org> Message-ID: Hi Marc, thanks for the background info, always nice to know the source of some errors or warnings... In any case using makeTranscriptDbFromUCSC() is fine for me, Thanks! Stefanie From smitra at liverpool.ac.uk Mon Jun 11 17:00:19 2012 From: smitra at liverpool.ac.uk (suparna mitra) Date: Mon, 11 Jun 2012 16:00:19 +0100 Subject: [BioC] gcrma problem while processing HuGene-1_0-st-v1 genechip from Affymetrix In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From friedman at cancercenter.columbia.edu Mon Jun 11 17:03:07 2012 From: friedman at cancercenter.columbia.edu (Richard Friedman) Date: Mon, 11 Jun 2012 11:03:07 -0400 Subject: [BioC] gcrma problem while processing HuGene-1_0-st-v1 genechip from Affymetrix In-Reply-To: References: Message-ID: <8B2302CD-71B2-42A1-9BD3-87CDEEA8D06F@cancercenter.columbia.edu> Dear Suparna, Both GCRMA and MAS 5 require mismatch probes which are not on HuGene-1_0-st-v1, so they cannot be used on this chip. Best wishes, Rich ------------------------------------------------------------ Richard A. Friedman, PhD Associate Research Scientist, Biomedical Informatics Shared Resource Herbert Irving Comprehensive Cancer Center (HICCC) Lecturer, Department of Biomedical Informatics (DBMI) Educational Coordinator, Center for Computational Biology and Bioinformatics (C2B2)/ National Center for Multiscale Analysis of Genomic Networks (MAGNet) Room 824 Irving Cancer Research Center Columbia University 1130 St. Nicholas Ave New York, NY 10032 (212)851-4765 (voice) friedman at cancercenter.columbia.edu http://cancercenter.columbia.edu/~friedman/ "School is an evil plot to suppress my individuality" Rose Friedman, age15 On Jun 11, 2012, at 11:00 AM, suparna mitra wrote: > Hi, > I am very new to biocondunctor and microaray. I have limited > experience > with R. > I am trying to use biocondunctor for analyzing HuGene-1_0-st-v1 > microarray > data. I selectected different normalization method (rma, gcrma and > mas5). > For my data rma worked but for for gcrma and mas5 both I have problem. > For gcrma it gives me a error like: Computing affinitiesError: > length(prlen) == 1 is not TRUE > > And for mas 5 it seems working but I get only a whole list of NA. > > Here is what I have done. > >> mydata <- ReadAffy() >> mydata > AffyBatch object > size of arrays=1050x1050 features (16 kb) > cdf=HuGene-1_0-st-v1 (32321 affyids) > number of samples=18 > number of genes=32321 > annotation=hugene10stv1 > >> eset <- rma(mydata) > Background correcting > Normalizing > Calculating Expression >> eset_justrma=justRMA() >> eset_mas5 <- mas5(mydata) > background correction: mas > PM/MM correction : mas > expression values: mas > background correcting...done. > 32321 ids to be processed > | | > |####################| >> eset_gcrma <- gcrma(mydata) > Adjusting for optical effect..................Done. > Computing affinitiesError: length(prlen) == 1 is not TRUE Here is > the > error > >> eset_justrma # this worked fine > ExpressionSet (storageMode: lockedEnvironment) > assayData: 32321 features, 18 samples > element names: exprs, se.exprs > protocolData > sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st- > v1).CEL ... > MC9_(HuGene-1_0-st-v1).CEL (18 total) > varLabels: ScanDate > varMetadata: labelDescription > phenoData > sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st- > v1).CEL ... > MC9_(HuGene-1_0-st-v1).CEL (18 total) > varLabels: sample > varMetadata: labelDescription > featureData: none > experimentData: use 'experimentData(object)' > Annotation: hugene10stv1 >> eset_mas5 # this seems worked fine but resulted all NA > ExpressionSet (storageMode: lockedEnvironment) > assayData: 32321 features, 18 samples > element names: exprs, se.exprs > protocolData > sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st- > v1).CEL ... > MC9_(HuGene-1_0-st-v1).CEL (18 total) > varLabels: ScanDate > varMetadata: labelDescription > phenoData > sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st- > v1).CEL ... > MC9_(HuGene-1_0-st-v1).CEL (18 total) > varLabels: sample > varMetadata: labelDescription > featureData: none > experimentData: use 'experimentData(object)' > Annotation: hugene10stv1 >> write.exprs(eset_justrma,file="eset_justrma.csv") >> write.exprs(eset_mas5,file="eset_mas5.csv") >> write.exprs(eset,file="eset.csv") > > Any help in this will be really great. Being a novice, I am very > sorry if I > am doing any silly mistake. > Thanks a lot, > Suparna. > > -- > Dr. Suparna Mitra > Wolfson Centre for Personalised Medicine > Department of Molecular and Clinical Pharmacology > Institute of Translational Medicine University of Liverpool > Block A: Waterhouse Buildings, L69 3GL Liverpool > > Tel. +44 (0)151 795 5414, Internal ext: 55414 > M: +44 (0) 7523228621 > Email id: smitra at liverpool.ac.uk > Alternative Email id: suparna.mitra.sm at gmail.com > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From smitra at liverpool.ac.uk Mon Jun 11 17:06:57 2012 From: smitra at liverpool.ac.uk (suparna mitra) Date: Mon, 11 Jun 2012 16:06:57 +0100 Subject: [BioC] gcrma problem while processing HuGene-1_0-st-v1 genechip from Affymetrix In-Reply-To: <8B2302CD-71B2-42A1-9BD3-87CDEEA8D06F@cancercenter.columbia.edu> References: <8B2302CD-71B2-42A1-9BD3-87CDEEA8D06F@cancercenter.columbia.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From cstrato at aon.at Mon Jun 11 17:12:39 2012 From: cstrato at aon.at (cstrato) Date: Mon, 11 Jun 2012 17:12:39 +0200 Subject: [BioC] gcrma problem while processing HuGene-1_0-st-v1 genechip from Affymetrix In-Reply-To: References: Message-ID: <4FD60AE7.1000601@aon.at> Dear Suparna, In principle you could use package xps to run mas5 for HuGene arrays, however, I would suggest to stay with rma (which xps also supports). Best regards Christian _._._._._._._._._._._._._._._._._._ C.h.r.i.s.t.i.a.n S.t.r.a.t.o.w.a V.i.e.n.n.a A.u.s.t.r.i.a e.m.a.i.l: cstrato at aon.at _._._._._._._._._._._._._._._._._._ On 6/11/12 5:00 PM, suparna mitra wrote: > Hi, > I am very new to biocondunctor and microaray. I have limited experience > with R. > I am trying to use biocondunctor for analyzing HuGene-1_0-st-v1 microarray > data. I selectected different normalization method (rma, gcrma and mas5). > For my data rma worked but for for gcrma and mas5 both I have problem. > For gcrma it gives me a error like: Computing affinitiesError: > length(prlen) == 1 is not TRUE > > And for mas 5 it seems working but I get only a whole list of NA. > > Here is what I have done. > >> mydata<- ReadAffy() >> mydata > AffyBatch object > size of arrays=1050x1050 features (16 kb) > cdf=HuGene-1_0-st-v1 (32321 affyids) > number of samples=18 > number of genes=32321 > annotation=hugene10stv1 > >> eset<- rma(mydata) > Background correcting > Normalizing > Calculating Expression >> eset_justrma=justRMA() >> eset_mas5<- mas5(mydata) > background correction: mas > PM/MM correction : mas > expression values: mas > background correcting...done. > 32321 ids to be processed > | | > |####################| >> eset_gcrma<- gcrma(mydata) > Adjusting for optical effect..................Done. > Computing affinitiesError: length(prlen) == 1 is not TRUE Here is the > error > >> eset_justrma # this worked fine > ExpressionSet (storageMode: lockedEnvironment) > assayData: 32321 features, 18 samples > element names: exprs, se.exprs > protocolData > sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st-v1).CEL ... > MC9_(HuGene-1_0-st-v1).CEL (18 total) > varLabels: ScanDate > varMetadata: labelDescription > phenoData > sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st-v1).CEL ... > MC9_(HuGene-1_0-st-v1).CEL (18 total) > varLabels: sample > varMetadata: labelDescription > featureData: none > experimentData: use 'experimentData(object)' > Annotation: hugene10stv1 >> eset_mas5 # this seems worked fine but resulted all NA > ExpressionSet (storageMode: lockedEnvironment) > assayData: 32321 features, 18 samples > element names: exprs, se.exprs > protocolData > sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st-v1).CEL ... > MC9_(HuGene-1_0-st-v1).CEL (18 total) > varLabels: ScanDate > varMetadata: labelDescription > phenoData > sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st-v1).CEL ... > MC9_(HuGene-1_0-st-v1).CEL (18 total) > varLabels: sample > varMetadata: labelDescription > featureData: none > experimentData: use 'experimentData(object)' > Annotation: hugene10stv1 >> write.exprs(eset_justrma,file="eset_justrma.csv") >> write.exprs(eset_mas5,file="eset_mas5.csv") >> write.exprs(eset,file="eset.csv") > > Any help in this will be really great. Being a novice, I am very sorry if I > am doing any silly mistake. > Thanks a lot, > Suparna. > From jmacdon at uw.edu Mon Jun 11 17:18:47 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Mon, 11 Jun 2012 11:18:47 -0400 Subject: [BioC] gcrma problem while processing HuGene-1_0-st-v1 genechip from Affymetrix In-Reply-To: References: Message-ID: <4FD60C57.6070709@uw.edu> Hi Suparna, On 6/11/2012 11:00 AM, suparna mitra wrote: > Hi, > I am very new to biocondunctor and microaray. I have limited experience > with R. > I am trying to use biocondunctor for analyzing HuGene-1_0-st-v1 microarray > data. I selectected different normalization method (rma, gcrma and mas5). > For my data rma worked but for for gcrma and mas5 both I have problem. This is to be expected. The HuGene array is a PM-only design, so mas5() won't work (because the mas5 algorithm requires subtracting MM from PM, and there are no MM probes). In addition, the default for gcrma() is to estimate the background for probes based on the GC content, using bins of MM probes. Again, without any MM probes, this won't work. Note however that gcrma() has an 'NCprobe' argument that you can use to specify an index of negative control probes. This is a non-trivial thing to do, and may be beyond your abilities if you are very new to R and BioC. To get the index of these probes, you will need to decide which probes are negative control probes, and then you can use the indexProbes() function, passing a character vector of the negative control probes to the genenames argument. This will return a list of indices for each probeset that you can unlist() prior to feeding in to gcrma(). Or you could just stick with rma(), like the vast majority of people do. Best, Jim > For gcrma it gives me a error like: Computing affinitiesError: > length(prlen) == 1 is not TRUE > > And for mas 5 it seems working but I get only a whole list of NA. > > Here is what I have done. > >> mydata<- ReadAffy() >> mydata > AffyBatch object > size of arrays=1050x1050 features (16 kb) > cdf=HuGene-1_0-st-v1 (32321 affyids) > number of samples=18 > number of genes=32321 > annotation=hugene10stv1 > >> eset<- rma(mydata) > Background correcting > Normalizing > Calculating Expression >> eset_justrma=justRMA() >> eset_mas5<- mas5(mydata) > background correction: mas > PM/MM correction : mas > expression values: mas > background correcting...done. > 32321 ids to be processed > | | > |####################| >> eset_gcrma<- gcrma(mydata) > Adjusting for optical effect..................Done. > Computing affinitiesError: length(prlen) == 1 is not TRUE Here is the > error > >> eset_justrma # this worked fine > ExpressionSet (storageMode: lockedEnvironment) > assayData: 32321 features, 18 samples > element names: exprs, se.exprs > protocolData > sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st-v1).CEL ... > MC9_(HuGene-1_0-st-v1).CEL (18 total) > varLabels: ScanDate > varMetadata: labelDescription > phenoData > sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st-v1).CEL ... > MC9_(HuGene-1_0-st-v1).CEL (18 total) > varLabels: sample > varMetadata: labelDescription > featureData: none > experimentData: use 'experimentData(object)' > Annotation: hugene10stv1 >> eset_mas5 # this seems worked fine but resulted all NA > ExpressionSet (storageMode: lockedEnvironment) > assayData: 32321 features, 18 samples > element names: exprs, se.exprs > protocolData > sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st-v1).CEL ... > MC9_(HuGene-1_0-st-v1).CEL (18 total) > varLabels: ScanDate > varMetadata: labelDescription > phenoData > sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st-v1).CEL ... > MC9_(HuGene-1_0-st-v1).CEL (18 total) > varLabels: sample > varMetadata: labelDescription > featureData: none > experimentData: use 'experimentData(object)' > Annotation: hugene10stv1 >> write.exprs(eset_justrma,file="eset_justrma.csv") >> write.exprs(eset_mas5,file="eset_mas5.csv") >> write.exprs(eset,file="eset.csv") > Any help in this will be really great. Being a novice, I am very sorry if I > am doing any silly mistake. > Thanks a lot, > Suparna. > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From friedman at cancercenter.columbia.edu Mon Jun 11 17:24:27 2012 From: friedman at cancercenter.columbia.edu (Richard Friedman) Date: Mon, 11 Jun 2012 11:24:27 -0400 Subject: [BioC] gcrma problem while processing HuGene-1_0-st-v1 genechip from Affymetrix NEGATIVE CONTROL PROBES In-Reply-To: <4FD60C57.6070709@uw.edu> References: <4FD60C57.6070709@uw.edu> Message-ID: <626E2627-5B97-4A1C-84E2-9566ACA604A6@cancercenter.columbia.edu> Jim and list, Thank you for bringing the NCprobe option to my attention. I did not know that it had been implemented. Does anyone out there have a list of negative control probes more Human st 1.0 and for Mouse st 1.0 ? Thanks and best wishes, Rich On Jun 11, 2012, at 11:18 AM, James W. MacDonald wrote: > Hi Suparna, > > On 6/11/2012 11:00 AM, suparna mitra wrote: >> Hi, >> I am very new to biocondunctor and microaray. I have limited >> experience >> with R. >> I am trying to use biocondunctor for analyzing HuGene-1_0-st-v1 >> microarray >> data. I selectected different normalization method (rma, gcrma and >> mas5). >> For my data rma worked but for for gcrma and mas5 both I have >> problem. > > This is to be expected. The HuGene array is a PM-only design, so > mas5() won't work (because the mas5 algorithm requires subtracting > MM from PM, and there are no MM probes). In addition, the default > for gcrma() is to estimate the background for probes based on the GC > content, using bins of MM probes. Again, without any MM probes, this > won't work. > > Note however that gcrma() has an 'NCprobe' argument that you can use > to specify an index of negative control probes. This is a non- > trivial thing to do, and may be beyond your abilities if you are > very new to R and BioC. > > To get the index of these probes, you will need to decide which > probes are negative control probes, and then you can use the > indexProbes() function, passing a character vector of the negative > control probes to the genenames argument. This will return a list of > indices for each probeset that you can unlist() prior to feeding in > to gcrma(). > > Or you could just stick with rma(), like the vast majority of people > do. > > Best, > > Jim > > >> For gcrma it gives me a error like: Computing affinitiesError: >> length(prlen) == 1 is not TRUE >> >> And for mas 5 it seems working but I get only a whole list of NA. >> >> Here is what I have done. >> >>> mydata<- ReadAffy() >>> mydata >> AffyBatch object >> size of arrays=1050x1050 features (16 kb) >> cdf=HuGene-1_0-st-v1 (32321 affyids) >> number of samples=18 >> number of genes=32321 >> annotation=hugene10stv1 >> >>> eset<- rma(mydata) >> Background correcting >> Normalizing >> Calculating Expression >>> eset_justrma=justRMA() >>> eset_mas5<- mas5(mydata) >> background correction: mas >> PM/MM correction : mas >> expression values: mas >> background correcting...done. >> 32321 ids to be processed >> | | >> |####################| >>> eset_gcrma<- gcrma(mydata) >> Adjusting for optical effect..................Done. >> Computing affinitiesError: length(prlen) == 1 is not TRUE Here is >> the >> error >> >>> eset_justrma # this worked fine >> ExpressionSet (storageMode: lockedEnvironment) >> assayData: 32321 features, 18 samples >> element names: exprs, se.exprs >> protocolData >> sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st- >> v1).CEL ... >> MC9_(HuGene-1_0-st-v1).CEL (18 total) >> varLabels: ScanDate >> varMetadata: labelDescription >> phenoData >> sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st- >> v1).CEL ... >> MC9_(HuGene-1_0-st-v1).CEL (18 total) >> varLabels: sample >> varMetadata: labelDescription >> featureData: none >> experimentData: use 'experimentData(object)' >> Annotation: hugene10stv1 >>> eset_mas5 # this seems worked fine but resulted all NA >> ExpressionSet (storageMode: lockedEnvironment) >> assayData: 32321 features, 18 samples >> element names: exprs, se.exprs >> protocolData >> sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st- >> v1).CEL ... >> MC9_(HuGene-1_0-st-v1).CEL (18 total) >> varLabels: ScanDate >> varMetadata: labelDescription >> phenoData >> sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st- >> v1).CEL ... >> MC9_(HuGene-1_0-st-v1).CEL (18 total) >> varLabels: sample >> varMetadata: labelDescription >> featureData: none >> experimentData: use 'experimentData(object)' >> Annotation: hugene10stv1 >>> write.exprs(eset_justrma,file="eset_justrma.csv") >>> write.exprs(eset_mas5,file="eset_mas5.csv") >>> write.exprs(eset,file="eset.csv") >> Any help in this will be really great. Being a novice, I am very >> sorry if I >> am doing any silly mistake. >> Thanks a lot, >> Suparna. >> > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From smitra at liverpool.ac.uk Mon Jun 11 17:48:07 2012 From: smitra at liverpool.ac.uk (suparna mitra) Date: Mon, 11 Jun 2012 16:48:07 +0100 Subject: [BioC] gcrma problem while processing HuGene-1_0-st-v1 genechip from Affymetrix In-Reply-To: <4FD60C57.6070709@uw.edu> References: <4FD60C57.6070709@uw.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jmacdon at uw.edu Mon Jun 11 17:49:42 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Mon, 11 Jun 2012 11:49:42 -0400 Subject: [BioC] gcrma problem while processing HuGene-1_0-st-v1 genechip from Affymetrix NEGATIVE CONTROL PROBES In-Reply-To: <626E2627-5B97-4A1C-84E2-9566ACA604A6@cancercenter.columbia.edu> References: <4FD60C57.6070709@uw.edu> <626E2627-5B97-4A1C-84E2-9566ACA604A6@cancercenter.columbia.edu> Message-ID: <4FD61396.5080006@uw.edu> Hi Rich, On 6/11/2012 11:24 AM, Richard Friedman wrote: > Jim and list, > > Thank you for bringing the NCprobe option to my attention. > I did not know that it had been implemented. > > Does anyone out there have a list of negative control probes more > Human st 1.0 and for Mouse st 1.0 ? No, but it's easy enough to get. > library(pd.hugene.1.0.st.v1) > con <- db(pd.hugene.1.0.st.v1) > dbListTables(con) [1] "chrom_dict" "core_mps" "featureSet" "level_dict" "pmfeature" [6] "table_info" "type_dict" > dbGetQuery(con, "select * from type_dict;") type type_id 1 1 main 2 2 control->affx 3 3 control->chip 4 4 control->bgp->antigenomic 5 5 control->bgp->genomic 6 6 normgene->exon 7 7 normgene->intron 8 8 rescue->FLmRNA->unmapped So let's say we want to call just bgp probes background. Now I happen to know we want the featureSet table, but you can look to see what is in each table using dbListFields() > dbListFields(con, "featureSet") [1] "fsetid" "strand" "start" [4] "stop" "transcript_cluster_id" "exon_id" [7] "crosshyb_type" "level" "chrom" [10] "type" You can also do something like dbGetQuery(con, "select * from featureSet limit 10;") to get an idea what is in a given table. So to get the probesets we want, > x <-dbGetQuery(con, "select fsetid from featureSet where type in ('4','5');") > head(x) fsetid 1 7892601 2 7892698 3 7892756 4 7892815 5 7892916 6 7892943 Now there may be a further complication that I don't have the time or desire to check out. These are probeset level IDs, and most people do things at the transcript level (and if you are doing gcrma() this is all you can do). So I don't know if you need to convert these fsetids to meta_fsetids, which are transcript level probesets. If so, you can map from fsetid to meta_fsetid using the core_mps table. I leave doing that mapping up to you, grasshopper. As a further test of your SQL awesomeness, you could figure out how to use an inner join statement so you can get the meta_fsetids in one database query. Knowing how to do that sort of thing can come in really handy - you can even link different .db packages to do cross-database queries, which can make your life much better if you need to do some complex mappings. There are some examples in one of the AnnotationDbi vignettes. Best, Jim > > Thanks and best wishes, > Rich > > > On Jun 11, 2012, at 11:18 AM, James W. MacDonald wrote: > >> Hi Suparna, >> >> On 6/11/2012 11:00 AM, suparna mitra wrote: >>> Hi, >>> I am very new to biocondunctor and microaray. I have limited >>> experience >>> with R. >>> I am trying to use biocondunctor for analyzing HuGene-1_0-st-v1 >>> microarray >>> data. I selectected different normalization method (rma, gcrma and >>> mas5). >>> For my data rma worked but for for gcrma and mas5 both I have problem. >> >> This is to be expected. The HuGene array is a PM-only design, so >> mas5() won't work (because the mas5 algorithm requires subtracting MM >> from PM, and there are no MM probes). In addition, the default for >> gcrma() is to estimate the background for probes based on the GC >> content, using bins of MM probes. Again, without any MM probes, this >> won't work. >> >> Note however that gcrma() has an 'NCprobe' argument that you can use >> to specify an index of negative control probes. This is a non-trivial >> thing to do, and may be beyond your abilities if you are very new to >> R and BioC. >> >> To get the index of these probes, you will need to decide which >> probes are negative control probes, and then you can use the >> indexProbes() function, passing a character vector of the negative >> control probes to the genenames argument. This will return a list of >> indices for each probeset that you can unlist() prior to feeding in >> to gcrma(). >> >> Or you could just stick with rma(), like the vast majority of people do. >> >> Best, >> >> Jim >> >> >>> For gcrma it gives me a error like: Computing affinitiesError: >>> length(prlen) == 1 is not TRUE >>> >>> And for mas 5 it seems working but I get only a whole list of NA. >>> >>> Here is what I have done. >>> >>>> mydata<- ReadAffy() >>>> mydata >>> AffyBatch object >>> size of arrays=1050x1050 features (16 kb) >>> cdf=HuGene-1_0-st-v1 (32321 affyids) >>> number of samples=18 >>> number of genes=32321 >>> annotation=hugene10stv1 >>> >>>> eset<- rma(mydata) >>> Background correcting >>> Normalizing >>> Calculating Expression >>>> eset_justrma=justRMA() >>>> eset_mas5<- mas5(mydata) >>> background correction: mas >>> PM/MM correction : mas >>> expression values: mas >>> background correcting...done. >>> 32321 ids to be processed >>> | | >>> |####################| >>>> eset_gcrma<- gcrma(mydata) >>> Adjusting for optical effect..................Done. >>> Computing affinitiesError: length(prlen) == 1 is not TRUE Here is the >>> error >>> >>>> eset_justrma # this worked fine >>> ExpressionSet (storageMode: lockedEnvironment) >>> assayData: 32321 features, 18 samples >>> element names: exprs, se.exprs >>> protocolData >>> sampleNames: MC1_(HuGene-1_0-st-v1).CEL >>> MC10_(HuGene-1_0-st-v1).CEL ... >>> MC9_(HuGene-1_0-st-v1).CEL (18 total) >>> varLabels: ScanDate >>> varMetadata: labelDescription >>> phenoData >>> sampleNames: MC1_(HuGene-1_0-st-v1).CEL >>> MC10_(HuGene-1_0-st-v1).CEL ... >>> MC9_(HuGene-1_0-st-v1).CEL (18 total) >>> varLabels: sample >>> varMetadata: labelDescription >>> featureData: none >>> experimentData: use 'experimentData(object)' >>> Annotation: hugene10stv1 >>>> eset_mas5 # this seems worked fine but resulted all NA >>> ExpressionSet (storageMode: lockedEnvironment) >>> assayData: 32321 features, 18 samples >>> element names: exprs, se.exprs >>> protocolData >>> sampleNames: MC1_(HuGene-1_0-st-v1).CEL >>> MC10_(HuGene-1_0-st-v1).CEL ... >>> MC9_(HuGene-1_0-st-v1).CEL (18 total) >>> varLabels: ScanDate >>> varMetadata: labelDescription >>> phenoData >>> sampleNames: MC1_(HuGene-1_0-st-v1).CEL >>> MC10_(HuGene-1_0-st-v1).CEL ... >>> MC9_(HuGene-1_0-st-v1).CEL (18 total) >>> varLabels: sample >>> varMetadata: labelDescription >>> featureData: none >>> experimentData: use 'experimentData(object)' >>> Annotation: hugene10stv1 >>>> write.exprs(eset_justrma,file="eset_justrma.csv") >>>> write.exprs(eset_mas5,file="eset_mas5.csv") >>>> write.exprs(eset,file="eset.csv") >>> Any help in this will be really great. Being a novice, I am very >>> sorry if I >>> am doing any silly mistake. >>> Thanks a lot, >>> Suparna. >>> >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From MEC at stowers.org Mon Jun 11 18:53:05 2012 From: MEC at stowers.org (Cook, Malcolm) Date: Mon, 11 Jun 2012 11:53:05 -0500 Subject: [BioC] SRAdb: is the database missing some entries? In-Reply-To: Message-ID: Hi, In case it helps, I thought I might remind us of our prior similar experience: https://stat.ethz.ch/pipermail/bioc-devel/2011-October/002849.html Perhaps the solution this time will be similar.. Cheers, ~Malcolm On 6/10/12 6:53 AM, "Sean Davis" wrote: >On Sat, Jun 9, 2012 at 8:20 PM, Ben Woodcroft >wrote: >> Hi, >> >> Firstly thanks to the creators of this very useful package. >> >> I've come across SRA identifiers that don't appear to be in the >>database (a >> minority, but still). Here's a few: >> >> SRA036600 >> DRX001436 >> SRA049463 >> ERA062401 >> ERA062401 >> >> For example: >>> library(SRAdb) >>> sra_con = dbConnect(SQLite(),'SRAmetadb.sqlite') >>> sraConvert(c('SRA036600'), sra_con= sra_con) >> [1] submission study sample experiment run >> <0 rows> (or 0-length row.names) >> >> However this isn't a bogus accession because I can see it on the NCBI >>SRA >> website. >> >> I could be wrong but I don't think it is as simple as the metadata being >> out of date because the submission dates are often relatively old >> (SRA036600 was 2011-05-13) and there's metadata from more recent SRA >> submissions in the SRAdb). > >Thanks, Ben. > >We'll look into it. Sorry for the inconvenience. > >Sean > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor From smitra at liverpool.ac.uk Mon Jun 11 14:41:53 2012 From: smitra at liverpool.ac.uk (suparna mitra) Date: Mon, 11 Jun 2012 13:41:53 +0100 Subject: [BioC] gcrma problem while processing HuGene-1_0-st-v1 genechip from Affymetrix Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From smitra at liverpool.ac.uk Mon Jun 11 16:53:03 2012 From: smitra at liverpool.ac.uk (suparna mitra) Date: Mon, 11 Jun 2012 15:53:03 +0100 Subject: [BioC] gcrma problem while processing HuGene-1_0-st-v1 genechip from Affymetrix In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jamespower371 at gmail.com Mon Jun 11 17:49:01 2012 From: jamespower371 at gmail.com (james power) Date: Mon, 11 Jun 2012 16:49:01 +0100 Subject: [BioC] statistics from snpStats objects Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From amitonbiochem at gmail.com Mon Jun 11 19:14:46 2012 From: amitonbiochem at gmail.com (Amit Kumar Kashyap) Date: Mon, 11 Jun 2012 19:14:46 +0200 Subject: [BioC] Kegg pathways overlay with log fold change values In-Reply-To: <2EBC841A-CE7D-47A2-9B4B-B36C18AF2FA1@fhcrc.org> References: <2EBC841A-CE7D-47A2-9B4B-B36C18AF2FA1@fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From joseph.barry at embl.de Mon Jun 11 20:37:07 2012 From: joseph.barry at embl.de (Joseph Barry) Date: Mon, 11 Jun 2012 20:37:07 +0200 Subject: [BioC] Question regarding cellhts2 output In-Reply-To: <907543CDE7D2764C84BA88D7EB0890480E8AD403@EX-MBX1.ad.tgen.org> References: <907543CDE7D2764C84BA88D7EB0890480E8AD18E@EX-MBX1.ad.tgen.org> <907543CDE7D2764C84BA88D7EB0890480E8AD403@EX-MBX1.ad.tgen.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From joseph.barry at embl.de Mon Jun 11 20:53:26 2012 From: joseph.barry at embl.de (Joseph Barry) Date: Mon, 11 Jun 2012 20:53:26 +0200 Subject: [BioC] Question regarding cellhts2 output In-Reply-To: References: <907543CDE7D2764C84BA88D7EB0890480E8AD18E@EX-MBX1.ad.tgen.org> <907543CDE7D2764C84BA88D7EB0890480E8AD403@EX-MBX1.ad.tgen.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From laurent.jacob at gmail.com Mon Jun 11 21:05:59 2012 From: laurent.jacob at gmail.com (laurent jacob) Date: Mon, 11 Jun 2012 12:05:59 -0700 Subject: [BioC] DEGraph graph format? In-Reply-To: References: <68bd42f3-34e5-4c01-90ff-19b07d5220d1@zimbra2.fhcrc.org> <314fb72a-08e5-4eb2-884b-d0222c232127@zimbra2.fhcrc.org> Message-ID: Hi Oliver, 2012/6/10 Oliver Ruebenacker : > ?What kind of graph you are constructing? Is it a bi-partite graph > where every physical entity is a node and every reaction is a node, > and you connect every reaction with its reactants, catalysts and > products? Ultimately, the graph I construct has nodes corresponding exclusively to genes, with only one node by gene, and edges corresponding to expected correlations between gene expressions. For exemple if the protein encoded by gene A activates a complex which promotes the expression of gene B, I draw a positive edge between gene A and gene B. But as a first step I load the graph that cytoscape builds from the BioPAX files. An exemple of such a graph is given in the vignette http://bioconductor.org/packages/2.8/bioc/html/NCIgraph.html. > In BioPAX Level 2, getting that graph was quite tricky, but > Level 3 is much easier (although catalysts ad modulators are still a > bit awkward). The NCI PID people sent me BioPAX Level 2 data, I don't know if Level 3 is available for all their networks. >> Are you planning to develop a bioconductor package or an independent >> Java parser? If you plan on using Java, you may want to look at what >> the mskcc people did for their Cytoscape plugin, which I used for my >> own package: http://cbio.mskcc.org/cytoscape/plugins/biopax/ > > ?I'd love to submit to Bioconductor, if that is not too difficult. Great, good luck with the development. Best, Laurent -- Laurent Jacob Department of Statistics UC Berkeley http://cbio.ensmp.fr/~ljacob From zheng.alex.fu at gmail.com Tue Jun 12 02:59:23 2012 From: zheng.alex.fu at gmail.com (ZHENG FU) Date: Mon, 11 Jun 2012 20:59:23 -0400 Subject: [BioC] Question about MCRestimate Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From shalmom1 at gmail.com Tue Jun 12 11:29:40 2012 From: shalmom1 at gmail.com (mali salmon) Date: Tue, 12 Jun 2012 12:29:40 +0300 Subject: [BioC] get genomic location Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From alyamahmoud at gmail.com Tue Jun 12 11:46:39 2012 From: alyamahmoud at gmail.com (Alyaa Mahmoud) Date: Tue, 12 Jun 2012 12:46:39 +0300 Subject: [BioC] processed gene expression datasets Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From sdavis2 at mail.nih.gov Tue Jun 12 11:57:44 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 12 Jun 2012 05:57:44 -0400 Subject: [BioC] processed gene expression datasets In-Reply-To: References: Message-ID: On Tue, Jun 12, 2012 at 5:46 AM, Alyaa Mahmoud wrote: > Hi All > > I would highly appreciate if you anyone would direct me to some resource > for already analysed gene expression data for different human conditions, > e.g. cancer, diseases ..etc. i just need to collect expression status of as > many genes as possible in different conditions. > > I know of GEO ofcourse but its mostly .CEL files and I need ready-processed > and analysed data. Actually, the RAW data in GEO can include .CEL files, but the data in GEO (available via the website or via the GEOquery package) are processed but do not include analysis in terms of groups. You may want to look at the gene expression atlas: http://www.ebi.ac.uk/gxa/ From stefanie.tauber at univie.ac.at Tue Jun 12 12:09:21 2012 From: stefanie.tauber at univie.ac.at (Stefanie) Date: Tue, 12 Jun 2012 10:09:21 +0000 Subject: [BioC] restrict transcriptDB object to "known" genes Message-ID: Dear list, I would like to retrieve genomic ranges via makeTranscriptDbFromUSCS. At the moment I just use: humanDB = makeTranscriptDbFromUSCS(genome = "hg19", tablename = "ensGene") Is there an automatic way to restrict my humanDB to those genes that a "known" status (not novel)? best, stefanie From stefanie.tauber at univie.ac.at Tue Jun 12 12:24:00 2012 From: stefanie.tauber at univie.ac.at (Stefanie) Date: Tue, 12 Jun 2012 10:24:00 +0000 Subject: [BioC] restrict transcriptDB object to "known" genes References: Message-ID: Hi, I just managed to do it the following way: I get the transcript IDs of all known genes by the following command: ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) res = getBM(attributes = "ensembl_transcript_id", filters = "status", values = "known", mart = ensembl) humanDb = makeTranscriptDbFromUCSC(genome = "hg19", tablename = "ensGene",transcript_ids = as.character(res[,1])) tx = transcriptsBy(humanDb) Works! :) From alyamahmoud at gmail.com Tue Jun 12 12:52:53 2012 From: alyamahmoud at gmail.com (Alyaa Mahmoud) Date: Tue, 12 Jun 2012 13:52:53 +0300 Subject: [BioC] processed gene expression datasets In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From january.weiner at mpiib-berlin.mpg.de Tue Jun 12 13:15:44 2012 From: january.weiner at mpiib-berlin.mpg.de (January Weiner) Date: Tue, 12 Jun 2012 13:15:44 +0200 Subject: [BioC] Changing results depending on context Message-ID: Dear all, I am evaluating a collection of two-color arrays. There are several sets corresponding to different tissues, there is a treatment (common for each tissue), and the treatment effect is analysed independently in each of the tissues. I am analysing the data using limma. The results in terms of the number of significant genes vary widely depending whether the tissues are analysed separately or jointly. Since the matter is even a bit more complicated, I include only pseudocode below. I guess that this depends on the background variation estimation in eBayes, but the trend is not consistent, e.g. for one tissue, more (many more!) significant results are found when analysed separately, and for another tissue the results are much "better" (ie. more singificant results) if analysed along other data. targets <- readTargets( "targets.txt" ) # Agilent two-color microarrays rg.1 <- read.maimages( targets, columns= list( G= "gMedianSignal", Gb= "gBGMedianSignal", R="rMedianSignal", Rb = "rBGMedianSignal"), annotation = c("Row", "Col","FeatureNum", "ControlType","ProbeName", "GeneName", "SystematicName", "Description" )) rg.2 <- backgroundCorrect( rg.1, method= "normexp", offset=50 ) rg.3 <- normalizeWithinArrays( rg.2, method= "loess" ) rg <- normalizeBetweenArrays( rg.3, method= "quantile" ) rg <- rg[ rg$genes$ControlType == 0, ] # analysing all data together: t2 <- targetsA2C( targets ) design <- model.matrix( ~ 0 + t2$Target ) colnames( design ) <- levels( t2$Target ) corfit <- intraspotCorrelation( rg, design ) fit <- lmscFit( rg, design, correlation= corfit$consensus.correlation ) cmtx <- makeContrasts( "A.T1-A.T2", "B.T1-B.T2", levels= design ) fit <- contrasts.fit( fit, cmtx ) fit <- eBayes( fit ) Above, A and B are tissues, T1 and T2 are treatments. Here, some exemplary results showing the number of significant genes: adj.P.Val < 0.05: A.T1-A.T2: 5601 B.T1-B.T2: 3914 adj.P.Val < 1e-5 A.T1-A.T2: 672 B.T1-B.T2: 758 Now, I analyse the data separately: A.targets <- targets[ 1:16, ] A.rg <- rg[ 1:16, ] A.t2 <- targetsA2C( A.targets ) A.design <- model.matrix( ~ 0 + A.t2$Target ) colnames( A.design ) <- levels( A.t2$Target ) A.corfit <- intraspotCorrelation( A.rg, A.design ) A.fit <- lmscFit( A.rg, A.design, correlation= A.corfit$consensus.correlation ) A.cmtx <- makeContrasts( "A.T1-A.T2", levels= A.design ) A.fit <- contrasts.fit( A.fit, cmtx ) A.fit <- eBayes( A.fit ) B.targets <- targets[ 17:32, ] B.rg <- rg[ 17:32, ] B.t2 <- targetsA2C( B.targets ) B.design <- model.matrix( ~ 0 + B.t2$Target ) colnames( B.design ) <- levels( B.t2$Target ) B.corfit <- intraspotCorrelation( B.rg, B.design ) B.fit <- lmscFit( B.rg, B.design, correlation= B.corfit$consensus.correlation ) B.cmtx <- makeContrasts( "B.T1-B.T2", levels= B.design ) B.fit <- contrasts.fit( B.fit, cmtx ) B.fit <- eBayes( B.fit ) Numbers of significant data are now much, much different: adj.P.Val < 0.05: A.T1-A.T2: 3347 B.T1-B.T2: 5443 adj.P.Val < 1e-5 A.T1-A.T2: 108 B.T1-B.T2: 1102 six times less for tissue A, but almost 50% more for tissue B! How can this be? The log fold changes are stable (i.e., they do not change between the various sets), which is to be expected -- changing the context influences the moderated t statistics, but not the estimation of log fold change. However, the p-values are not simply smaller or larger, there is little correlation of the p-values from one context to another. I'm a bit lost in here. Kind regards, January -- -------- Dr. January Weiner 3 -------------------------------------- Max Planck Institute for Infection Biology Charit?platz 1 D-10117 Berlin, Germany Web?? : www.mpiib-berlin.mpg.de Tel? ?? : +49-30-28460514 Fax ? ?: +49-30-28450505 From alyamahmoud at gmail.com Tue Jun 12 14:30:17 2012 From: alyamahmoud at gmail.com (Alyaa Mahmoud) Date: Tue, 12 Jun 2012 15:30:17 +0300 Subject: [BioC] processed gene expression datasets In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jmacdon at uw.edu Tue Jun 12 15:24:07 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Tue, 12 Jun 2012 09:24:07 -0400 Subject: [BioC] Changing results depending on context In-Reply-To: References: Message-ID: <4FD742F7.4000804@uw.edu> Hi January, When you analyze all data together, the denominator of your t-statistic is based on an average of the intra-group variability. In general this will tend to make an analysis more powerful, because you are using more data to compute the variance estimate (which is not a very efficient statistic, so more data tend to improve the estimate). Conversely, when you analyze the two groups separately, the denominator of your t-statistic is based only on the intra-group variance for the two groups under consideration. Without having you data in hand, I can only guess what is going on. However, to me the most likely cause of this is a high variance in one or more of your 'A' groups. This would explain why when you combine the analyses you get fewer genes in the 'B' group and more genes in the 'A' group, and vice versa when you do things separately. There are several ways to see if you have one or more problem chips in the A group. You could look at MA plots using plotMA(), you could use arrayWeights() or arrayWeightsSimple() to see if some of the arrays get severely down-weighted. You could also do a PCA plot of the normalized M values to look at the overall grouping structure. Best, Jim On 6/12/2012 7:15 AM, January Weiner wrote: > Dear all, > > I am evaluating a collection of two-color arrays. There are several > sets corresponding to different tissues, there is a treatment (common > for each tissue), and the treatment effect is analysed independently > in each of the tissues. I am analysing the data using limma. > > The results in terms of the number of significant genes vary widely > depending whether the tissues are analysed separately or jointly. > Since the matter is even a bit more complicated, I include only > pseudocode below. I guess that this depends on the background > variation estimation in eBayes, but the trend is not consistent, e.g. > for one tissue, more (many more!) significant results are found when > analysed separately, and for another tissue the results are much > "better" (ie. more singificant results) if analysed along other data. > > targets<- readTargets( "targets.txt" ) > # Agilent two-color microarrays > rg.1<- read.maimages( targets, columns= list( G= "gMedianSignal", > Gb= "gBGMedianSignal", R="rMedianSignal", Rb = "rBGMedianSignal"), > annotation = c("Row", "Col","FeatureNum", "ControlType","ProbeName", > "GeneName", "SystematicName", "Description" )) > rg.2<- backgroundCorrect( rg.1, method= "normexp", offset=50 ) > rg.3<- normalizeWithinArrays( rg.2, method= "loess" ) > rg<- normalizeBetweenArrays( rg.3, method= "quantile" ) > rg<- rg[ rg$genes$ControlType == 0, ] > > # analysing all data together: > t2<- targetsA2C( targets ) > design<- model.matrix( ~ 0 + t2$Target ) > colnames( design )<- levels( t2$Target ) > corfit<- intraspotCorrelation( rg, design ) > fit<- lmscFit( rg, design, correlation= corfit$consensus.correlation ) > cmtx<- makeContrasts( "A.T1-A.T2", "B.T1-B.T2", levels= design ) > fit<- contrasts.fit( fit, cmtx ) > fit<- eBayes( fit ) > > Above, A and B are tissues, T1 and T2 are treatments. > > Here, some exemplary results showing the number of significant genes: > > adj.P.Val< 0.05: > A.T1-A.T2: 5601 > B.T1-B.T2: 3914 > > adj.P.Val< 1e-5 > A.T1-A.T2: 672 > B.T1-B.T2: 758 > > Now, I analyse the data separately: > > A.targets<- targets[ 1:16, ] > A.rg<- rg[ 1:16, ] > A.t2<- targetsA2C( A.targets ) > A.design<- model.matrix( ~ 0 + A.t2$Target ) > colnames( A.design )<- levels( A.t2$Target ) > A.corfit<- intraspotCorrelation( A.rg, A.design ) > A.fit<- lmscFit( A.rg, A.design, correlation= A.corfit$consensus.correlation ) > A.cmtx<- makeContrasts( "A.T1-A.T2", levels= A.design ) > A.fit<- contrasts.fit( A.fit, cmtx ) > A.fit<- eBayes( A.fit ) > > > B.targets<- targets[ 17:32, ] > B.rg<- rg[ 17:32, ] > B.t2<- targetsA2C( B.targets ) > B.design<- model.matrix( ~ 0 + B.t2$Target ) > colnames( B.design )<- levels( B.t2$Target ) > B.corfit<- intraspotCorrelation( B.rg, B.design ) > B.fit<- lmscFit( B.rg, B.design, correlation= B.corfit$consensus.correlation ) > B.cmtx<- makeContrasts( "B.T1-B.T2", levels= B.design ) > B.fit<- contrasts.fit( B.fit, cmtx ) > B.fit<- eBayes( B.fit ) > > Numbers of significant data are now much, much different: > > adj.P.Val< 0.05: > A.T1-A.T2: 3347 > B.T1-B.T2: 5443 > > adj.P.Val< 1e-5 > A.T1-A.T2: 108 > B.T1-B.T2: 1102 > > six times less for tissue A, but almost 50% more for tissue B! How can this be? > > The log fold changes are stable (i.e., they do not change between the > various sets), which is to be expected -- changing the context > influences the moderated t statistics, but not the estimation of log > fold change. However, the p-values are not simply smaller or larger, > there is little correlation of the p-values from one context to > another. I'm a bit lost in here. > > Kind regards, > > January > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From guest at bioconductor.org Tue Jun 12 15:24:24 2012 From: guest at bioconductor.org (William Lappner [guest]) Date: Tue, 12 Jun 2012 06:24:24 -0700 (PDT) Subject: [BioC] Bioconductor IT Security polici Message-ID: <20120612132424.2D151133D10@mamba.fhcrc.org> How secure is the A4 reporting feature engine? -- output of sessionInfo(): Blah blah blah -- Sent via the guest posting facility at bioconductor.org. From chenyao.bioinfor at gmail.com Tue Jun 12 16:10:04 2012 From: chenyao.bioinfor at gmail.com (Yao Chen) Date: Tue, 12 Jun 2012 10:10:04 -0400 Subject: [BioC] Limma-include interaction term Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From johnscrn at med.umich.edu Tue Jun 12 13:58:44 2012 From: johnscrn at med.umich.edu (Johnson, Craig) Date: Tue, 12 Jun 2012 11:58:44 +0000 Subject: [BioC] importing iScan data into beadarray Message-ID: <95EF0FCAA4055F40B282433D3EA7399F29ECC3@UHEXMBSPR13.umhs.med.umich.edu> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From drnevich at illinois.edu Tue Jun 12 17:43:01 2012 From: drnevich at illinois.edu (Zadeh, Jenny Drnevich) Date: Tue, 12 Jun 2012 15:43:01 +0000 Subject: [BioC] Question about locked environments In-Reply-To: <4FD145BC.9050905@fhcrc.org> References: <4FD145BC.9050905@fhcrc.org> Message-ID: Hi Martin, Thanks for your suggestion about using assignInNamespace() instead of assign(). That seems to work, except the ResetEnvir() function no longer restores the original probes and probe sets in that R session. However, just quitting and re-starting R will restore the original probes/sets, so just use that if you need it, Michael. A text file with the revised code for RemoveProbes is attached. Thanks again, Jenny -----Original Message----- From: Martin Morgan [mailto:mtmorgan at fhcrc.org] Sent: Thursday, June 07, 2012 7:22 PM To: Zadeh, Jenny Drnevich Cc: Seidl, M.F. (Michael); bioconductor at r-project.org; Ariel Chernomoretz Subject: Re: [BioC] Question about locked environments On 06/07/2012 07:20 AM, Zadeh, Jenny Drnevich wrote: > Hi Michael, > > I'm sorry you are having trouble with the RemoveProbes() function I posted the BioC mailing list many years ago. I have not had to use that function myself in years, and did not know it wasn't working with newer versions of R. I didn't write the original code, Ariel Chernomoretz did. I only modified it, and I'm not sure I know enough to solve the problem. I'm posting this to the BioC mailing list to see if anyone can help. Below is my reproducible code (link to download the "RemoveProbes.RData" file is below ), showing where the problem occurs. It appears that the environments containing the Affymetrix probe and probe set information that the code is trying to change in now locked. I have no idea if there is a way to overcome this. > > Thanks in advance to anyone for any help, Jenny > > https://netfiles.uiuc.edu/xythoswfs/webui/_xy-42144579_2-t_YuabdiYC > (link expires 7/7/12) > >> library(affy) > Loading required package: BiocGenerics > > Attaching package: 'BiocGenerics' > > The following object(s) are masked from 'package:stats': > > xtabs > > The following object(s) are masked from 'package:base': > > anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get, intersect, lapply, Map, mapply, mget, order, paste, pmax, pmax.int, > pmin, pmin.int, Position, rbind, Reduce, rep.int, rownames, > sapply, setdiff, table, tapply, union, unique > > Loading required package: Biobase > Welcome to Bioconductor > > Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages > 'citation("pkgname")'. > >> load("RemoveProbes.RData") >> ls() > [1] "nonsoygenes" "RemoveProbes" "ResetEnvir" "soygenes" >> >> cleancdf<- "soybean" >> >> ResetEnvir(cleancdf) > Loading required package: soybeancdf > Loading required package: AnnotationDbi > > Loading required package: soybeanprobe >> >> RemoveProbes(listOutProbeSets=nonsoygenes, cleancdf=cleancdf) > Error in assign(probepackagename, probe.env.orig[-iNA, ], pos = ipos) : > cannot change value of locked binding for 'soybeanprobe' >> >> debug(RemoveProbes) >> RemoveProbes(listOutProbeSets=nonsoygenes, cleancdf=cleancdf) > debugging in: RemoveProbes(listOutProbeSets = nonsoygenes, cleancdf = > cleancdf) > debug: { > cdfpackagename<- paste(cleancdf, "cdf", sep = "") > probepackagename<- paste(cleancdf, "probe", sep = "") > require(cdfpackagename, character.only = TRUE) > require(probepackagename, character.only = TRUE) > probe.env.orig<- get(probepackagename) > if (!is.null(listOutProbes)) { > probes<- unlist(lapply(listOutProbes, function(x) { > a<- strsplit(x, "at") > aux1<- paste(a[[1]][1], "at", sep = "") > aux2<- as.integer(a[[1]][2]) > c(aux1, aux2) > })) > n1<- as.character(probes[seq(1, (length(probes)/2)) * > 2 - 1]) > n2<- as.integer(probes[seq(1, (length(probes)/2)) * > 2]) > probes<- data.frame(I(n1), n2) > probes[, 1]<- as.character(probes[, 1]) > probes[, 2]<- as.integer(probes[, 2]) > pset<- unique(probes[, 1]) > for (i in seq(along = pset)) { > ii<- grep(pset[i], probes[, 1]) > iout<- probes[ii, 2] > a<- get(pset[i], env = get(cdfpackagename)) > a<- a[-iout, ] > assign(pset[i], a, env = get(cdfpackagename)) > } > } > if (!is.null(listOutProbeSets)) { > rm(list = listOutProbeSets, envir = get(cdfpackagename)) > } > tmp<- get("xy2indices", paste("package:", cdfpackagename, > sep = "")) > newAB<- new("AffyBatch", cdfName = cleancdf) > pmIndex<- unlist(indexProbes(newAB, "pm")) > subIndex<- match(tmp(probe.env.orig$x, probe.env.orig$y, > cdf = cdfpackagename), pmIndex) > rm(newAB) > iNA<- which(is.na(subIndex)) > if (length(iNA)> 0) { > ipos<- grep(probepackagename, search()) > assign(probepackagename, probe.env.orig[-iNA, ], pos = ipos) I think you can replace assign() with assignInNamespace(). I don't know whether that is a good idea or not... From ?assignInNamespace They should not be used in production code. but I don't think this is any more dire than what the original code was doing, before the introduction of package name spaces. Martin > } > } > Browse[2]> > debug: cdfpackagename<- paste(cleancdf, "cdf", sep = "") Browse[2]> > debug: probepackagename<- paste(cleancdf, "probe", sep = "") > Browse[2]> > debug: require(cdfpackagename, character.only = TRUE) Browse[2]> > debug: require(probepackagename, character.only = TRUE) Browse[2]> > debug: probe.env.orig<- get(probepackagename) Browse[2]> > debug: if (!is.null(listOutProbes)) { > probes<- unlist(lapply(listOutProbes, function(x) { > a<- strsplit(x, "at") > aux1<- paste(a[[1]][1], "at", sep = "") > aux2<- as.integer(a[[1]][2]) > c(aux1, aux2) > })) > n1<- as.character(probes[seq(1, (length(probes)/2)) * 2 - > 1]) > n2<- as.integer(probes[seq(1, (length(probes)/2)) * 2]) > probes<- data.frame(I(n1), n2) > probes[, 1]<- as.character(probes[, 1]) > probes[, 2]<- as.integer(probes[, 2]) > pset<- unique(probes[, 1]) > for (i in seq(along = pset)) { > ii<- grep(pset[i], probes[, 1]) > iout<- probes[ii, 2] > a<- get(pset[i], env = get(cdfpackagename)) > a<- a[-iout, ] > assign(pset[i], a, env = get(cdfpackagename)) > } > } > Browse[2]> > debug: NULL > Browse[2]> > debug: if (!is.null(listOutProbeSets)) { > rm(list = listOutProbeSets, envir = get(cdfpackagename)) } > Browse[2]> > debug: rm(list = listOutProbeSets, envir = get(cdfpackagename)) > Browse[2]> > debug: tmp<- get("xy2indices", paste("package:", cdfpackagename, sep = > "")) Browse[2]> > debug: newAB<- new("AffyBatch", cdfName = cleancdf) Browse[2]> > debug: pmIndex<- unlist(indexProbes(newAB, "pm")) Browse[2]> > debug: subIndex<- match(tmp(probe.env.orig$x, probe.env.orig$y, cdf = cdfpackagename), > pmIndex) > Browse[2]> > debug: rm(newAB) > Browse[2]> > debug: iNA<- which(is.na(subIndex)) > Browse[2]> > debug: if (length(iNA)> 0) { > ipos<- grep(probepackagename, search()) > assign(probepackagename, probe.env.orig[-iNA, ], pos = ipos) } > Browse[2]> > debug: ipos<- grep(probepackagename, search()) Browse[2]> > debug: assign(probepackagename, probe.env.orig[-iNA, ], pos = ipos) > > #The line above is what causes the error > > Browse[2]> probepackagename > [1] "soybeanprobe" > There were 50 or more warnings (use warnings() to see the first 50) > Browse[2]> warnings()[1:3] $`object 'AFFX-BioB-3_at' not found` > rm(list = listOutProbeSets, envir = get(cdfpackagename)) > > $`object 'AFFX-BioB-5_at' not found` > rm(list = listOutProbeSets, envir = get(cdfpackagename)) > > $`object 'AFFX-BioB-M_at' not found` > rm(list = listOutProbeSets, envir = get(cdfpackagename)) > > #Not sure what the above warnings mean or if they are related > > Browse[2]> ?assign > starting httpd help server ... done > Browse[2]> ?lockBinding > Browse[2]> environmentIsLocked(as.environment(ipos)) > [1] TRUE > > Browse[2]> assign(probepackagename, probe.env.orig[-iNA, ], pos = > ipos) Error in assign(probepackagename, probe.env.orig[-iNA, ], pos = ipos) : > cannot change value of locked binding for 'soybeanprobe' > In addition: There were 50 or more warnings (use warnings() to see the > first 50) >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-pc-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] soybeanprobe_2.10.0 soybeancdf_2.10.0 AnnotationDbi_1.18.1 affy_1.34.0 Biobase_2.16.0 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] affyio_1.24.0 BiocInstaller_1.4.6 DBI_0.2-5 IRanges_1.14.3 preprocessCore_1.18.0 RSQLite_0.11.1 stats4_2.15.0 > [8] tools_2.15.0 zlibbioc_1.2.0 > > > > > Jenny Drnevich, Ph.D. > > Functional Genomics Bioinformatics Specialist W.M. Keck Center for > Comparative and Functional Genomics Roy J. Carver Biotechnology Center > High Performance Biological Computing Program University of Illinois, > Urbana-Champaign > > 330 ERML > 1201 W. Gregory Dr. > Urbana, IL 61801 > USA > > NOTE NEW PHONE NUMBER > ph: 217-300-6543 > fax: 217-265-5066 > e-mail: drnevich at illinois.edu > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 From mtmorgan at fhcrc.org Tue Jun 12 17:46:18 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Tue, 12 Jun 2012 08:46:18 -0700 Subject: [BioC] Bioconductor IT Security polici In-Reply-To: <20120612132424.2D151133D10@mamba.fhcrc.org> References: <20120612132424.2D151133D10@mamba.fhcrc.org> Message-ID: <4FD7644A.90400@fhcrc.org> On 06/12/2012 06:24 AM, William Lappner [guest] wrote: > > How secure is the A4 reporting feature engine? I can't speak for a4 (? case matters in R), but in terms of your subject line Bioconductor provides no guarantee of 'security'. This http://www.r-project.org/certification.html discusses compliance and validation of (base) R, which is quite far removed from your question but the most relevant document I know of. Martin > > -- output of sessionInfo(): > > Blah blah blah > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 From guido.leoni at gmail.com Tue Jun 12 18:00:41 2012 From: guido.leoni at gmail.com (Guido Leoni) Date: Tue, 12 Jun 2012 18:00:41 +0200 Subject: [BioC] Question about sampling Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jmacdon at uw.edu Tue Jun 12 18:21:24 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Tue, 12 Jun 2012 12:21:24 -0400 Subject: [BioC] Limma-include interaction term In-Reply-To: References: Message-ID: <4FD76C84.5070307@uw.edu> Hi Jack, On 6/12/2012 10:10 AM, Yao Chen wrote: > Dear All, > > I try to find differential expressed genes between treat and untreated > samples, and also I want to include the age effects. > > The design matrix is like this: > > treat untreated age > 1 0 30 > 0 1 40 > 1 0 35 > > > The "treat" is factor, but "age" is continuous. How can I set the > "cont.matrix"? Pretty much just like you (or at least I) would expect: contrast <- makeContrasts(treat - untreat, levels = design) But note that the design you are specifying allows different intercepts, but the slope is assumed to be the same for treated and untreated. If you want to allow different slopes as well, you need to introduce an age:treatment interaction term. Here I am assuming you have more than three samples. Best, Jim > > Thanks, > > Jack > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From chenyao.bioinfor at gmail.com Tue Jun 12 18:57:01 2012 From: chenyao.bioinfor at gmail.com (Yao Chen) Date: Tue, 12 Jun 2012 12:57:01 -0400 Subject: [BioC] Limma-include interaction term In-Reply-To: <4FD76C84.5070307@uw.edu> References: <4FD76C84.5070307@uw.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jmacdon at uw.edu Tue Jun 12 19:33:43 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Tue, 12 Jun 2012 13:33:43 -0400 Subject: [BioC] Limma-include interaction term In-Reply-To: References: <4FD76C84.5070307@uw.edu> Message-ID: <4FD77D77.2010104@uw.edu> Hi Jack, The conventional method is to use the model.matrix() function. I have no idea what your data look like, so here is a random example: > treat <- factor(rep(0:1, each = 5)) > treat [1] 0 0 0 0 0 1 1 1 1 1 Levels: 0 1 > age <- sample(25:35, 10, TRUE) > age [1] 32 30 32 35 29 26 27 25 33 34 > model.matrix(~treat*age) (Intercept) treat1 age treat1:age 1 1 0 32 0 2 1 0 30 0 3 1 0 32 0 4 1 0 35 0 5 1 0 29 0 6 1 1 26 26 7 1 1 27 27 8 1 1 25 25 9 1 1 33 33 10 1 1 34 34 attr(,"assign") [1] 0 1 2 3 attr(,"contrasts") attr(,"contrasts")$treat [1] "contr.treatment" Note that this uses a different parameterization. In this case the treat1 coefficient is the difference between the treated and untreated samples (so you wouldn't specify a contrasts.matrix, you just do lmFit() and then eBayes()). The treat1:age coefficient captures the difference between the slopes for the treated and untreated samples. So topTable(fit2, coef=2) gives you genes that are differentially expressed between treated and untreated and topTable(fit2, coef=4) gives you genes where the change in expression at different ages varies between treated and untreated subjects. Best, Jim On 6/12/2012 12:57 PM, Yao Chen wrote: > Thanks, James > > How to include "age:treatment" interaction in the design matrix? > > Jack > > 2012/6/12 James W. MacDonald > > > Hi Jack, > > > On 6/12/2012 10:10 AM, Yao Chen wrote: > > Dear All, > > I try to find differential expressed genes between treat and > untreated > samples, and also I want to include the age effects. > > The design matrix is like this: > > treat untreated age > 1 0 30 > 0 1 40 > 1 0 35 > > > The "treat" is factor, but "age" is continuous. How can I set the > "cont.matrix"? > > > Pretty much just like you (or at least I) would expect: > > contrast <- makeContrasts(treat - untreat, levels = design) > > But note that the design you are specifying allows different > intercepts, but the slope is assumed to be the same for treated > and untreated. If you want to allow different slopes as well, you > need to introduce an age:treatment interaction term. Here I am > assuming you have more than three samples. > > Best, > > Jim > > > > Thanks, > > Jack > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From chenyao.bioinfor at gmail.com Tue Jun 12 20:22:23 2012 From: chenyao.bioinfor at gmail.com (Yao Chen) Date: Tue, 12 Jun 2012 14:22:23 -0400 Subject: [BioC] Limma-include interaction term In-Reply-To: <4FD78633.8030902@uw.edu> References: <4FD76C84.5070307@uw.edu> <4FD77D77.2010104@uw.edu> <4FD78633.8030902@uw.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jmacdon at uw.edu Tue Jun 12 21:14:44 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Tue, 12 Jun 2012 15:14:44 -0400 Subject: [BioC] Limma-include interaction term In-Reply-To: References: <4FD76C84.5070307@uw.edu> <4FD77D77.2010104@uw.edu> <4FD78633.8030902@uw.edu> Message-ID: <4FD79524.4010100@uw.edu> Hi Jack, On 6/12/2012 2:22 PM, Yao Chen wrote: > I am confused. If I want to get differential expressed genes between > treatment and control with age interaction. Which one should I use > :topTable (fit2, coef=2) or topTable (fit2, coef=4) . I have no idea what you are asking, so I will just try to restate things. If you fit a model with treatment and age, then any genes with a significant treatment - control contrast implies that after adjusting for age of the subjects, there is a difference in expression of that gene between treated and control subjects. One way to conceptualize this model is that you are fitting two lines to your data, one for treated and one for control, but you are constraining these lines to have the same slope (e.g., they are parallel). If you fit a model that includes a treatment:age interaction, you are doing the same exact thing, only now you are fitting two lines and allowing the slopes to vary as well. So the example I gave you, coef 2 tells you which genes are differentially expressed between treatment and control, after adjusting for age of the subjects and allowing the slopes to vary. Coef 4 tells you whether or not the slopes are different for the treated and control subjects. The reason coef 4 might be interesting is because it will allow you to find genes that (as an example) are expressed at the same level in untreated subjects regardless of age, but in treated subjects the expression increases dramatically as a function of age. Best, Jim > > Thanks, > > Jack > > 2012/6/12 James W. MacDonald > > > Hi Jack, > > > On 6/12/2012 1:54 PM, Yao Chen wrote: > > Thanks James. That's exactly what I want to know. > > But I am not sure I fully understand the differential > expressed genes in topTable. For (fit2, coef=2), did I get the > genes without considering treat:age interaction, as my > previous design matrix . And (fit2, coef=4) gives me the genes > considering treat:age interation. > > > No. When you fit a model with a bunch of coefficients, a given > coefficient measures the marginal effect of the coefficient after > accounting for all other coefficients in the model. > > In conventional linear modeling (where you aren't fitting > thousands of models at once), you would probably fit a model with > and without the interaction term and then test to see if the > interaction term is significant. This is difficult to do in the > context of a microarray analysis, so people generally just throw a > bunch of coefficients in a model and look for significant genes. > > If you then wanted to do some other tests with a subset of your > genes I suppose you could, but people generally pick 'interesting' > genes and go to functional studies. > > Best, > > Jim > > > > Jack > > 2012/6/12 James W. MacDonald >> > > Hi Jack, > > The conventional method is to use the model.matrix() > function. I > have no idea what your data look like, so here is a random > example: > > > treat <- factor(rep(0:1, each = 5)) > > treat > [1] 0 0 0 0 0 1 1 1 1 1 > Levels: 0 1 > > age <- sample(25:35, 10, TRUE) > > age > [1] 32 30 32 35 29 26 27 25 33 34 > > model.matrix(~treat*age) > (Intercept) treat1 age treat1:age > 1 1 0 32 0 > 2 1 0 30 0 > 3 1 0 32 0 > 4 1 0 35 0 > 5 1 0 29 0 > 6 1 1 26 26 > 7 1 1 27 27 > 8 1 1 25 25 > 9 1 1 33 33 > 10 1 1 34 34 > attr(,"assign") > [1] 0 1 2 3 > attr(,"contrasts") > attr(,"contrasts")$treat > [1] "contr.treatment" > > Note that this uses a different parameterization. In this > case the > treat1 coefficient is the difference between the treated and > untreated samples (so you wouldn't specify a > contrasts.matrix, you > just do lmFit() and then eBayes()). The treat1:age coefficient > captures the difference between the slopes for the treated and > untreated samples. > > So topTable(fit2, coef=2) gives you genes that are > differentially > expressed between treated and untreated and topTable(fit2, > coef=4) > gives you genes where the change in expression at different > ages > varies between treated and untreated subjects. > > Best, > > Jim > > > > > > On 6/12/2012 12:57 PM, Yao Chen wrote: > > Thanks, James > > How to include "age:treatment" interaction in the > design matrix? > > Jack > > 2012/6/12 James W. MacDonald > > > > > >>> > > > Hi Jack, > > > On 6/12/2012 10:10 AM, Yao Chen wrote: > > Dear All, > > I try to find differential expressed genes between > treat and > untreated > samples, and also I want to include the age effects. > > The design matrix is like this: > > treat untreated age > 1 0 30 > 0 1 40 > 1 0 35 > > > The "treat" is factor, but "age" is continuous. How > can I set the > "cont.matrix"? > > > Pretty much just like you (or at least I) would expect: > > contrast <- makeContrasts(treat - untreat, levels = > design) > > But note that the design you are specifying allows > different > intercepts, but the slope is assumed to be the same > for treated > and untreated. If you want to allow different slopes as > well, you > need to introduce an age:treatment interaction term. > Here I am > assuming you have more than three samples. > > Best, > > Jim > > > > Thanks, > > Jack > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > > > > > >> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > > > -- James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From weixiaokuan at yahoo.com Tue Jun 12 22:03:28 2012 From: weixiaokuan at yahoo.com (Xiaokuan Wei) Date: Tue, 12 Jun 2012 13:03:28 -0700 Subject: [BioC] where to find a host server with R Message-ID: <1339531408.46662.YahooMailNeo@web114209.mail.gq1.yahoo.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From guest at bioconductor.org Wed Jun 13 02:58:54 2012 From: guest at bioconductor.org (Sam McInturf [guest]) Date: Tue, 12 Jun 2012 17:58:54 -0700 (PDT) Subject: [BioC] Wheat annotation Message-ID: <20120613005854.5EFE8134449@mamba.fhcrc.org> Hello, I am working on a set of Affymetrix wheat array, with some very ancient annotations included. I have been looking, and many sequenced organisms have a xxx.db annotation package that is used in conjunction with annotationDBi. I have found the wheatcdf, which seems to describe the environment (probe mapping), but this does not seem to be the same information as the xxx.db has. Is there such a package, or is wheat without such annotations? Thanks! Sam -- output of sessionInfo(): = -- Sent via the guest posting facility at bioconductor.org. From hm3286 at gmail.com Wed Jun 13 04:42:53 2012 From: hm3286 at gmail.com (HIMANSHU MITTAL) Date: Wed, 13 Jun 2012 08:12:53 +0530 Subject: [BioC] Max Common Subgraph Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From tim.triche at gmail.com Wed Jun 13 05:31:22 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Tue, 12 Jun 2012 20:31:22 -0700 Subject: [BioC] restrict transcriptDB object to "known" genes In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From pshannon at fhcrc.org Wed Jun 13 05:58:19 2012 From: pshannon at fhcrc.org (Paul Shannon) Date: Tue, 12 Jun 2012 20:58:19 -0700 Subject: [BioC] Max Common Subgraph In-Reply-To: References: Message-ID: RBGL would be worth taking a look at, if you haven't already. It is an R wrapper around the Boost Graph Library: http://www.boost.org/doc/libs/1_49_0/libs/graph/doc/index.html Try this: biocLite ('RBGL') library (RBGL) help(package='RBGL') Perhaps some elements of the solution may be found there. Methods on the BioC graph class may provide some other elements -- you may already have looked at this as well: help(package=graph) - Paul On Jun 12, 2012, at 7:42 PM, HIMANSHU MITTAL wrote: > Hello, > I want to implement the Maximum Common Subgraph(MCS) problem in R. > I have used igraph but it doesn't allow me to compare subgraphs on basis of > vertex or edge attributes(" labelled isomorphism") > > Is there any package in Bioconductor that has this feature or can in any > way make it easy to find the MCS on basis of attributes? > > Regards > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From donttrustben at gmail.com Wed Jun 13 07:46:26 2012 From: donttrustben at gmail.com (Ben Woodcroft) Date: Wed, 13 Jun 2012 15:46:26 +1000 Subject: [BioC] SRAdb: is the database missing some entries? In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From Ingrid.Mercier at ipbs.fr Tue Jun 12 17:01:42 2012 From: Ingrid.Mercier at ipbs.fr (Ingrid Mercier) Date: Tue, 12 Jun 2012 17:01:42 +0200 Subject: [BioC] design matrix Limma design for paired t-test Message-ID: <4FD759D6.2060202@ipbs.fr> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From phipson at wehi.EDU.AU Wed Jun 13 08:45:45 2012 From: phipson at wehi.EDU.AU (Belinda Phipson) Date: Wed, 13 Jun 2012 16:45:45 +1000 Subject: [BioC] design matrix Limma design for paired t-test In-Reply-To: <4FD759D6.2060202@ipbs.fr> References: <4FD759D6.2060202@ipbs.fr> Message-ID: <004201cd4930$2d7802b0$88680810$@edu.au> Hi Ingrid The problem with your code is the following line: > Time=Treat=factor(Targets$Time) Where you essentially set the time factor equal to the treat factor. Cheers, Belinda -----Original Message----- From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Ingrid Mercier Sent: Wednesday, 13 June 2012 1:02 AM To: bioconductor at r-project.org; smyth at wehi.edu.au Subject: [BioC] design matrix Limma design for paired t-test Dear list and Gordon, I have some troubles to computed a moderated paired t-test in the linear model. Here is my experimental plan : I used a single channel Agilent microarray. 2 types of cells : Control (S) and Treated (T) Fives human donors : 4-5-6-7-8 Two times of treatment : 4 hours and 18 hours I want to compare teh differential expresed genes between my C versus T at 4 hours and then at 18 hours. Here is my design : My targets frame is : > Targets X FileName Treatment Donor Time 1 DC_4_4 US10463851_252665214446_S01_GE1_1010_Sep10_1_2.txt T 4 4 2 SC_4_4 US10463851_252665214448_S01_GE1_1010_Sep10_1_2.txt C 4 4 3 DC_18_4 US10463851_252665214447_S01_GE1_1010_Sep10_1_2.txt T 4 18 4 SC_18_4 US10463851_252665214444_S01_GE1_1010_Sep10_1_3.txt C 4 18 5 DC_4_5 US10463851_252665214448_S01_GE1_1010_Sep10_1_4.txt T 5 4 6 SC_4_5 US10463851_252665214444_S01_GE1_1010_Sep10_1_1.txt C 5 4 7 DC_18_5 US10463851_252665214446_S01_GE1_1010_Sep10_1_3.txt T 5 18 8 SC_18_5 US10463851_252665214447_S01_GE1_1010_Sep10_1_4.txt C 5 18 9 DC_4_6 US10463851_252665214445_S01_GE1_1010_Sep10_1_4.txt T 6 4 10 SC_4_6 US10463851_252665214447_S01_GE1_1010_Sep10_1_3.txt C 6 4 11 DC_18_6 US10463851_252665214448_S01_GE1_1010_Sep10_1_3.txt T 6 18 12 SC_18_6 US10463851_252665214445_S01_GE1_1010_Sep10_1_3.txt C 6 18 13 DC_4_7 US10463851_252665214444_S01_GE1_1010_Sep10_1_4.txt T 7 4 14 SC_4_7 US10463851_252665214445_S01_GE1_1010_Sep10_1_2.txt C 7 4 15 DC_18_7 US10463851_252665214447_S01_GE1_1010_Sep10_1_1.txt T 7 18 16 SC_18_7 US10463851_252665214446_S01_GE1_1010_Sep10_1_1.txt C 7 18 17 DC_4_8 US10463851_252665214444_S01_GE1_1010_Sep10_1_2.txt T 8 4 18 SC_4_8 US10463851_252665214446_S01_GE1_1010_Sep10_1_4.txt C 8 4 19 DC_18_8 US10463851_252665214445_S01_GE1_1010_Sep10_1_1.txt T 8 18 20 SC_18_8 US10463851_252665214448_S01_GE1_1010_Sep10_1_1.txt C 8 18 then I create my design matrix : > Donor [1] 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 Levels: 4 5 6 7 8 > Treat=factor(Targets$Treatment,levels=c("C","T")) > Treat [1] T C T C T C T C T C T C T C T C T C T C Levels: C T > Time=Treat=factor(Targets$Time) > Time [1] 4 4 18 18 4 4 18 18 4 4 18 18 4 4 18 18 4 4 18 18 Levels: 4 18 > design=model.matrix(~Donor+Treat+Time) > design (Intercept) Donor5 Donor6 Donor7 Donor8 Treat18 Time18 1 1 0 0 0 0 0 0 2 1 0 0 0 0 0 0 3 1 0 0 0 0 1 1 4 1 0 0 0 0 1 1 5 1 1 0 0 0 0 0 6 1 1 0 0 0 0 0 7 1 1 0 0 0 1 1 8 1 1 0 0 0 1 1 9 1 0 1 0 0 0 0 10 1 0 1 0 0 0 0 11 1 0 1 0 0 1 1 12 1 0 1 0 0 1 1 13 1 0 0 1 0 0 0 14 1 0 0 1 0 0 0 15 1 0 0 1 0 1 1 16 1 0 0 1 0 1 1 17 1 0 0 0 1 0 0 18 1 0 0 0 1 0 0 19 1 0 0 0 1 1 1 20 1 0 0 0 1 1 1 attr(,"assign") [1] 0 1 1 1 1 2 3 attr(,"contrasts") attr(,"contrasts")$Donor [1] "contr.treatment" attr(,"contrasts")$Treat [1] "contr.treatment" attr(,"contrasts")$Time [1] "contr.treatment" In this design matrix I think something is wrong, because of the column Treat18 is the same as Time18. I don't understand why. So, the following code failed, and the differential expressed genes are odds. Somebody can help me !!! Thanks all. > fit=lmFit(test_norm,design) Coefficients not estimable: Time18 Message d'avis : Partial NA coefficients for 34183 probe(s) > fit2=eBayes(fit) Message d'avis : In ebayes(fit = fit, proportion = proportion, stdev.coef.lim = stdev.coef.lim, : Estimation of var.prior failed - set to default value > table = topTable(fit2,1, number=5000, p.value=0.05,adjust.method="BH",sort.by="logFC",lfc=2) > head(table) ID logFC AveExpr t P.Value adj.P.Val B 6509 A_33_P3396434 18.44159 18.41239 245.14490 1.308161e-31 2.353520e-28 53.41519 22398 A_33_P3223592 18.25824 18.24591 242.75647 1.545005e-31 2.514901e-28 53.36821 10771 A_33_P3244165 18.21029 18.02229 90.76191 2.796577e-24 2.467615e-23 44.59915 6149 A_33_P3346552 18.14780 18.12098 207.18556 2.282464e-30 1.147374e-27 52.50960 23554 A_33_P3210160 18.08158 18.21026 239.64192 1.924175e-31 2.560908e-28 53.30521 20924 A_33_P3286278 18.04425 18.07312 179.72121 2.558128e-29 5.025546e-27 51.56876 Best, Ingrid -- Ingrid MERCIER Mycobacterial Interactions with Host Cells Team Institute of Pharmacology& Structural Biology CNRS - University of Toulouse BP 64182 F-31077 Toulouse Cedex France Tel +33 (0)5 61 17 54 63 [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From aheider at trm.uni-leipzig.de Wed Jun 13 09:14:03 2012 From: aheider at trm.uni-leipzig.de (Andreas Heider) Date: Wed, 13 Jun 2012 09:14:03 +0200 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From aheider at trm.uni-leipzig.de Wed Jun 13 09:47:12 2012 From: aheider at trm.uni-leipzig.de (Andreas Heider) Date: Wed, 13 Jun 2012 09:47:12 +0200 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: <20120613092954.77903xxzifb183n6@webmail.ugent.be> References: <20120613092954.77903xxzifb183n6@webmail.ugent.be> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From stefanie.tauber at univie.ac.at Wed Jun 13 09:48:47 2012 From: stefanie.tauber at univie.ac.at (Stefanie Tauber) Date: Wed, 13 Jun 2012 09:48:47 +0200 Subject: [BioC] restrict transcriptDB object to "known" genes In-Reply-To: References: Message-ID: <484F7D45-9096-475C-8F83-109A8951A9E1@univie.ac.at> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From nsahgal at well.ox.ac.uk Wed Jun 13 12:40:41 2012 From: nsahgal at well.ox.ac.uk (Natasha Sahgal) Date: Wed, 13 Jun 2012 10:40:41 +0000 Subject: [BioC] paired analysis in RNA Seq data using DESeq In-Reply-To: <4FC7113B.7090102@embl.de> References: <4FC7113B.7090102@embl.de> Message-ID: Dear List, I have some RNASeq data, paired samples (n=4) for a 2 vs 2 comparison: not tumour-normal pairs. I was wondering if DESeq can now deal with paired analysis, especially using the multi-factor design section, as it could not earlier? Though I am aware that the edgeR package deals with paired analysis. Many Thanks, Natasha From nac at sanger.ac.uk Wed Jun 13 13:07:49 2012 From: nac at sanger.ac.uk (nathalie) Date: Wed, 13 Jun 2012 12:07:49 +0100 Subject: [BioC] TEQC package very slow Message-ID: <4FD87485.3050103@sanger.ac.uk> HI, I am analysing coverage data using TEQC package from bioC for quality assessment of target enrichment experiment . I am using a computer cluster farm to do the analysis and asked for large memory to be allocated, my bam files are 11 Gb in size and it seems that the analysis is taking very long, several hours, and then my session exit. Do I need to ask for this to be put on a long queue, more than 12 hours job? Do people use TEQC with large files? How can I be more efficient with this analysis? these are my commands: #get reads myread<-get.reads("reads.bam",filetype="bam") #get pair reads : at that point this will fail :in the doc it is stated " To run the function can be quite time consuming, depending on the number of reads" myreadpair<-reads2pairs(myread) #drop single reads myread<-myread[!(myread$ID %in% myreadpair$singleReads$ID), , drop=TRUE] I have used efficiently these functions on smaller files with miSeq data, but not yet with HiSeq ... Many thanks for sharing your experience in getting QC for large files efficiently Nathalie > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=C [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] TEQC_2.4.0 hwriter_1.3 Rsamtools_1.8.4 [4] Biostrings_2.24.1 GenomicRanges_1.8.3 IRanges_1.14.2 [7] BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] Biobase_2.16.0 bitops_1.0-4.1 stats4_2.15.0 zlibbioc_1.2.0 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From vobencha at fhcrc.org Wed Jun 13 15:46:33 2012 From: vobencha at fhcrc.org (Valerie Obenchain) Date: Wed, 13 Jun 2012 06:46:33 -0700 Subject: [BioC] get genomic location In-Reply-To: References: Message-ID: <4FD899B9.2000408@fhcrc.org> Hi Mali, Unfortunately we don't have a function that converts from protein-based to genomic-based coordinates - at least I'm unaware of one. The conversion functions we do have are transcriptLocsToRefLocs() in GenomicFeatures and refLocsToLocalLocs() in VariantAnnotation. The first converts from transcript-based to genome-based and the second from genome-based to transcript and protein-based. Michael and I have plans to consolidate these coordinate-mapping functions into the single 'Map' generic in IRanges. This will likely be done for the next BioC release. Valerie On 06/12/2012 02:29 AM, mali salmon wrote: > Hello List > I have a basic question. > Is there an easy way to convert a location in protein and mRNA to genomic > location? (except of alignment...) > For example, I have a list of point mutations in amino acids positions > and/or positions in mRNA, and I would like to get their genomic locations. > Thanks > Mali > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > From jmacdon at uw.edu Wed Jun 13 16:07:04 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Wed, 13 Jun 2012 10:07:04 -0400 Subject: [BioC] Wheat annotation In-Reply-To: <20120613005854.5EFE8134449@mamba.fhcrc.org> References: <20120613005854.5EFE8134449@mamba.fhcrc.org> Message-ID: <4FD89E88.9050805@uw.edu> Hi Sam, On 6/12/2012 8:58 PM, Sam McInturf [guest] wrote: > Hello, > I am working on a set of Affymetrix wheat array, with some very ancient annotations included. I have been looking, and many sequenced organisms have a xxx.db annotation package that is used in conjunction with annotationDBi. I have found the wheatcdf, which seems to describe the environment (probe mapping), but this does not seem to be the same information as the xxx.db has. Is there such a package, or is wheat without such annotations? There is no such package. These packages depend on some fairly extensive infrastructure (they need an organism-level .db package, as well as an intermediate 'db0' package that Marc Carlson makes). Which packages have been supported is based on the cost/benefit where benefit is defined loosely as the number of end users who are likely to use the package. Unfortunately anopheles and yeast are the only plants to have jumped that hurdle so far. I checked the biomaRt package as well, and I don't see any support for Triticum there either. You could always just do something bootleg like downloading the annotations from Affy http://www.affymetrix.com/analysis/downloads/na32/ivt/wheat.na32.annot.csv.zip and parsing by hand. Best, Jim > > Thanks! > Sam > > -- output of sessionInfo(): > > = > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From vobencha at fhcrc.org Wed Jun 13 16:30:21 2012 From: vobencha at fhcrc.org (Valerie Obenchain) Date: Wed, 13 Jun 2012 07:30:21 -0700 Subject: [BioC] Calculate heterozygosity % using SNP genotype data In-Reply-To: References: Message-ID: <4FD8A3FD.6050308@fhcrc.org> Take a look at snpStats, SNPchip and genoset packages. Specifically, see the snpStats Intro Vignette http://bioconductor.org/packages/2.11/bioc/html/snpStats.html Depending on the origin of your data this thread may also be of interest - https://stat.ethz.ch/pipermail/bioconductor/2012-May/045636.html Valerie On 06/01/2012 02:50 PM, Yadav Sapkota wrote: > Hello, > > I am trying to validate few LOH regions using SNP genotype data. I am > assuming that if it is a LOH, it will contain predominantly homozygous > genotypes. For simplicity, I chose 15 SNPs per ~70 kb LOH region. > > Now I need to calculate the heterozygosity for LOHs in each samples using > genotype data of 15 SNPs. > > Does anyone know the way to calculate the heterozygous % per sample using a > set of SNP genotype data? > > Your help will be greatly appreciated. > > --Yadav > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > From lawrence.michael at gene.com Wed Jun 13 16:41:16 2012 From: lawrence.michael at gene.com (Michael Lawrence) Date: Wed, 13 Jun 2012 07:41:16 -0700 Subject: [BioC] get genomic location In-Reply-To: <4FD899B9.2000408@fhcrc.org> References: <4FD899B9.2000408@fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jmacdon at uw.edu Wed Jun 13 16:47:03 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Wed, 13 Jun 2012 10:47:03 -0400 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: References: Message-ID: <4FD8A7E7.2080102@uw.edu> Hi Andreas, On 6/13/2012 3:14 AM, Andreas Heider wrote: > Dear mailing list, > I know this was on the list couple of times, and I think I read it all, but > actually I still don't get it right. So here is my problem: > > I want to be able to work with Mouse Exon 1.0 ST Arrays (NOT Mouse Gene 1.0 > ST) in a similar fashion to eg. HG-U133 arrays. > That means, I want to finally have it accessible as an ExpressionSet object > with a right Bioconductor annotation assigned. This should include GENE > SYMBOLS, RefSeq IDs and ENTREZ IDs. The problem here is that you want to do something that AFAIK isn't easy to do. The Gene ST arrays allow you to summarize all the probes that interrogate a particular transcript (e.g., all the exon-level probesets are collapsed to transcript level, and then you summarize). However, for the Exon ST arrays that isn't the case, unless there is something in xps to allow for that - I know next to nothing about that package, so Cristian Stratowa will have to chime in if I am missing something. For the Exon chips, you are always summarizing at the same probeset level, where there are <= 4 probes per probeset, and there can be any number of probesets that interrogate a given exon. Lots of these probesets interrogate regions that aren't even transcribed, according to current knowledge of the genome. When you choose core, extended or full probesets, you are just changing the number of probesets being used, not summarizing at a different level as with the Gene ST chip. So when you say you want gene symbols, refseq ids and gene ids, what exactly are you after? If a given probeset is in the intron of a gene do you want to annotate it as being part of that gene? How about if it is in the UTR (or really close to the UTR)? What do you want to do with the probesets where one or more of the probes binds in multiple positions in the genome? These are all questions that the exonmap package tries to consider, and it gets really complicated. That's why Affy went with the Gene ST chips - they unleashed the Exon chips on us and couldn't sell them because people were saying WTF do I do with this thing? I don't think there is an easy or obvious answer to your question. If you were to come up with what you think are reasonable answers to my questions, then it wouldn't be much work to extract the chr, start, end from the pd.moex.1.0.st.v1 package, and then use GenomicFeatures (e.g., findOverlaps()) to decide what regions are being interrogated, and annotate from there. Best, Jim > > I can import it as a AffyBatch and generate an ExpressionSet with the help > of the Xmap/exonmap supplied CDF, but there is no annotation attached to it. > > OR > > I can import the CEL files with the "oligo" package as a Exon Array object > and generate an ExpressionSet from it. > However in that case it still have no annotation. > > Surprisingly on the Bioconductor website there are all packages needed to > deal with Mouse Gene 1.0 ST arrays but the informtion to work with Mouse > Exon 1.0 ST arrays seems missing! > > What am I doing wrong here? Has someone else had such problems? > > Thanks in advance for your effort, > Andreas > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From nac at sanger.ac.uk Wed Jun 13 16:53:43 2012 From: nac at sanger.ac.uk (nathalie) Date: Wed, 13 Jun 2012 15:53:43 +0100 Subject: [BioC] TEQC package very slow In-Reply-To: <4FD87485.3050103@sanger.ac.uk> References: <4FD87485.3050103@sanger.ac.uk> Message-ID: <4FD8A977.20802@sanger.ac.uk> HI, This is the error message produced at the myreadpair<-reads2pairs(myread) stage after it running for 7 hours: > readpairs4_2_PigS<-reads2pairs(reads4_2_PigS) [1] "there were 1453928 reads found without matching second read, or whose second read matches to a different chromosome" Error in endoapply(reads, mergefun) : 'FUN' did not produce an endomorphism > Terminated that may help, thanks, On 13/06/12 12:07, nathalie wrote: > HI, > I am analysing coverage data using TEQC package from bioC for quality > assessment of target enrichment experiment . > I am using a computer cluster farm to do the analysis and asked for > large memory to be allocated, my bam files are 11 Gb in size and it > seems that the analysis is taking very long, several hours, and then > my session exit. Do I need to ask for this to be put on a long queue, > more than 12 hours job? Do people use TEQC with large files? How can I > be more efficient with this analysis? > these are my commands: > #get reads > myread<-get.reads("reads.bam",filetype="bam") > #get pair reads : at that point this will fail :in the doc it is > stated " To run the function can be quite time consuming, depending on > the number of reads" > myreadpair<-reads2pairs(myread) > > #drop single reads > myread<-myread[!(myread$ID %in% myreadpair$singleReads$ID), , drop=TRUE] > > > I have used efficiently these functions on smaller files with miSeq > data, but not yet with HiSeq ... > Many thanks for sharing your experience in getting QC for large files > efficiently > Nathalie > > > sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=C > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] TEQC_2.4.0 hwriter_1.3 Rsamtools_1.8.4 > [4] Biostrings_2.24.1 GenomicRanges_1.8.3 IRanges_1.14.2 > [7] BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] Biobase_2.16.0 bitops_1.0-4.1 stats4_2.15.0 zlibbioc_1.2.0 > -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From juliet.hannah at gmail.com Wed Jun 13 17:01:32 2012 From: juliet.hannah at gmail.com (Juliet Hannah) Date: Wed, 13 Jun 2012 11:01:32 -0400 Subject: [BioC] understanding multiples matches between probesets and entrezgene (biomart) Message-ID: All, I understand the concept of multiple probesets corresponding to one identifier. But what is the meaning of a probeset corresponding to multiple identifiers? And below, given that 220547_s_at has a match, why should another row be returned with NA. Did I happen to choose a few probesets where the gene definition is changing, or am I misunderstanding something else, such as the biomart syntax. Thanks, Juliet library("biomaRt") probeSets <- c("219666_at", "220547_s_at", "218034_at") ensembl = useMart("ensembl") ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) getBM(attributes = c("affy_hg_u133a", "entrezgene"), filters = "affy_hg_u133a",values = probeSets, mart = ensembl) affy_hg_u133a entrezgene 1 220547_s_at 54537 2 218034_at 51024 3 220547_s_at NA 4 219666_at 64231 5 220547_s_at 414241 6 220547_s_at 439965 > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.12.0 BiocInstaller_1.4.6 loaded via a namespace (and not attached): [1] RCurl_1.91-1 tools_2.15.0 XML_3.9-4 From vobencha at fhcrc.org Wed Jun 13 17:40:03 2012 From: vobencha at fhcrc.org (Valerie Obenchain) Date: Wed, 13 Jun 2012 08:40:03 -0700 Subject: [BioC] BioC2012 deadlines Message-ID: <4FD8B453.6030806@fhcrc.org> A reminder of upcoming deadlines for the BioC2012 meeting in Seattle, July 24-25. Lab practical submissions are due June 15 and poster abstracts on July 1. Current submissions can be viewed on the web site https://secure.bioconductor.org/BioC2012/ There is also the option to give a 'flashlight' talk on Developer Day (July 23). These 10-15 min talks can be anything from current work, an introduction to new or developing packages, or ideas for future development directions. Questions or comments should be directed to biocworkshop at fhcrc.org Valerie From curoli at gmail.com Wed Jun 13 18:09:32 2012 From: curoli at gmail.com (Oliver Ruebenacker) Date: Wed, 13 Jun 2012 12:09:32 -0400 Subject: [BioC] Wheat annotation In-Reply-To: <4FD89E88.9050805@uw.edu> References: <20120613005854.5EFE8134449@mamba.fhcrc.org> <4FD89E88.9050805@uw.edu> Message-ID: Hello, On Wed, Jun 13, 2012 at 10:07 AM, James W. MacDonald wrote: > Unfortunately > anopheles and yeast are the only plants to have jumped that hurdle so far. Neither of these is a plant, though :) Take care Oliver -- Oliver Ruebenacker Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker) Knowomics, The Bioinformatics Network (http://www.knowomics.com) SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) From jmacdon at uw.edu Wed Jun 13 18:13:53 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Wed, 13 Jun 2012 12:13:53 -0400 Subject: [BioC] understanding multiples matches between probesets and entrezgene (biomart) In-Reply-To: References: Message-ID: <4FD8BC41.5070804@uw.edu> Hi Juliet, On 6/13/2012 11:01 AM, Juliet Hannah wrote: > All, > > I understand the concept of multiple probesets corresponding to one > identifier. But what is the meaning of > a probeset corresponding to multiple identifiers? And below, given > that 220547_s_at has a match, > why should another row be returned with NA. > > Did I happen to choose a few probesets where the gene definition is > changing, or am I misunderstanding > something else, such as the biomart syntax. I'm not sure about the NA being returned. That probably has something to do with how the Biomart database is set up. As for the multiple genes per probeset, this has to do with the fact that a 25-mer isn't really long enough to distinguish between genes with relatively high homology. This is supposed to be reflected in the probeset ID, although things have changed quite a bit since UniGene build 133. The probeset you are showing below has a _s_at identifier, which indicates that it cross-hybridizes to multiple members of a related gene family (in this case the FAM35 gene family). There are other identifiers like the _x_at which indicates cross-hybridization to unrelated genes. http://www.affymetrix.com/support/help/faqs/hgu133/index.jsp Best, Jim > > Thanks, > > Juliet > > library("biomaRt") > probeSets<- c("219666_at", "220547_s_at", "218034_at") > ensembl = useMart("ensembl") > ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) > > getBM(attributes = c("affy_hg_u133a", "entrezgene"), filters = > "affy_hg_u133a",values = probeSets, mart = ensembl) > > > affy_hg_u133a entrezgene > 1 220547_s_at 54537 > 2 218034_at 51024 > 3 220547_s_at NA > 4 219666_at 64231 > 5 220547_s_at 414241 > 6 220547_s_at 439965 > > > > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] biomaRt_2.12.0 BiocInstaller_1.4.6 > > loaded via a namespace (and not attached): > [1] RCurl_1.91-1 tools_2.15.0 XML_3.9-4 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From jmacdon at uw.edu Wed Jun 13 18:16:41 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Wed, 13 Jun 2012 12:16:41 -0400 Subject: [BioC] Wheat annotation In-Reply-To: References: <20120613005854.5EFE8134449@mamba.fhcrc.org> <4FD89E88.9050805@uw.edu> Message-ID: <4FD8BCE9.7020507@uw.edu> On 6/13/2012 12:09 PM, Oliver Ruebenacker wrote: > Hello, > > On Wed, Jun 13, 2012 at 10:07 AM, James W. MacDonald wrote: >> Unfortunately >> anopheles and yeast are the only plants to have jumped that hurdle so far. > Neither of these is a plant, though :) So you say. Next time an Anopheles *plants* its proboscis in your arm, give me a call. ;-D > > Take care > Oliver > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From mailinglist.honeypot at gmail.com Wed Jun 13 18:16:57 2012 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Wed, 13 Jun 2012 12:16:57 -0400 Subject: [BioC] understanding multiples matches between probesets and entrezgene (biomart) In-Reply-To: References: Message-ID: Hi, On Wed, Jun 13, 2012 at 11:01 AM, Juliet Hannah wrote: > All, > > I understand the concept of multiple probesets corresponding to one > identifier. But what is the meaning of > a probeset corresponding to multiple identifiers? ?And below, given > that 220547_s_at has a match, > why should another row be returned with NA. [snip] Given the output from the entrez IDs you entered (below, in remaining quoted text), the duplicate entrez for the same probesets map to these entrez ids: http://www.ncbi.nlm.nih.gov/gene?term=414241 http://www.ncbi.nlm.nih.gov/gene?term=54537 http://www.ncbi.nlm.nih.gov/gene?term=439965 They're all w/in the same family and there is at least one pseudo gene -- in their "Gene description" field, they all mention that they have "high sequence similarity 35" Given that information, I guess we can take a guess as to why this might be happening. You might consider looking into the CDFs the "brainarray" people are publishing to perhaps avoid these probes altogether: http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp Not sure about the NA part of your question ... HTH, -steve > > Did I happen to choose a few probesets where the gene definition is > changing, or am I misunderstanding > something else, such as the biomart syntax. > > Thanks, > > Juliet > > library("biomaRt") > probeSets <- c("219666_at", "220547_s_at", "218034_at") > ensembl = useMart("ensembl") > ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) > > getBM(attributes = c("affy_hg_u133a", "entrezgene"), filters = > "affy_hg_u133a",values = probeSets, mart = ensembl) > > > ?affy_hg_u133a entrezgene > 1 ? 220547_s_at ? ? ?54537 > 2 ? ? 218034_at ? ? ?51024 > 3 ? 220547_s_at ? ? ? ? NA > 4 ? ? 219666_at ? ? ?64231 > 5 ? 220547_s_at ? ? 414241 > 6 ? 220547_s_at ? ? 439965 > > > > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C > ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 > ?[5] LC_MONETARY=en_US.UTF-8 ? ?LC_MESSAGES=en_US.UTF-8 > ?[7] LC_PAPER=C ? ? ? ? ? ? ? ? LC_NAME=C > ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > other attached packages: > [1] biomaRt_2.12.0 ? ? ?BiocInstaller_1.4.6 > > loaded via a namespace (and not attached): > [1] RCurl_1.91-1 tools_2.15.0 XML_3.9-4 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From dupan.mail at gmail.com Wed Jun 13 18:59:54 2012 From: dupan.mail at gmail.com (Pan Du) Date: Wed, 13 Jun 2012 09:59:54 -0700 Subject: [BioC] Question about lumiMethyN() in lumi package In-Reply-To: References: Message-ID: Hi Niles Thanks for reporting this. The warning message was produced when the "smoothQuantileNormalization" tries to use "rlm"(in MASS package) to detect outliers in the low or high intensity range. Sometimes, "rlm" fails to converge in these regions, especially for low intensity region. Since the number of low intensity probes (usually these are failed probes) are very low for DNA methylation data, it usually should not affect the overall processing. Probably, I should suppress this kind of warning messages. Pan On Wed, Jun 13, 2012 at 9:19 AM, Niles Oien wrote: > > I am a research assistant at the University of Colorado. > > We have some methylation data that we are trying to normalize by calling > lumiMethyN() in the lumi package, however, we are finding that we get the > message that "rlm did not converge in 20 steps". We have tried both quantile > and ssn methods, but still get this message. I do notice that instead of > specifying "quantile" or "ssn" as the method, one can specify a user defined > function?whose input and output should be a intensity matrix (pool of > methylated and unmethylated probe intensities). Do you think that is what we > should be doing? And could you elaborate about how we do that? > > Thanks, any ideas you have would be appreciated - > > Niles Oien. > From cstrato at aon.at Wed Jun 13 19:37:00 2012 From: cstrato at aon.at (cstrato) Date: Wed, 13 Jun 2012 19:37:00 +0200 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: <4FD8A7E7.2080102@uw.edu> References: <4FD8A7E7.2080102@uw.edu> Message-ID: <4FD8CFBC.5070701@aon.at> Dear Andreas, As Jim already mentioned, package xps is able to preprocess MoExon 1.0 ST arrays at the probeset and the gene level, see also my earlier reply to a similar question: https://www.stat.math.ethz.ch/pipermail/bioconductor/2012-June/045958.html Best regards Christian _._._._._._._._._._._._._._._._._._ C.h.r.i.s.t.i.a.n S.t.r.a.t.o.w.a V.i.e.n.n.a A.u.s.t.r.i.a e.m.a.i.l: cstrato at aon.at _._._._._._._._._._._._._._._._._._ On 6/13/12 4:47 PM, James W. MacDonald wrote: > Hi Andreas, > > On 6/13/2012 3:14 AM, Andreas Heider wrote: >> Dear mailing list, >> I know this was on the list couple of times, and I think I read it >> all, but >> actually I still don't get it right. So here is my problem: >> >> I want to be able to work with Mouse Exon 1.0 ST Arrays (NOT Mouse >> Gene 1.0 >> ST) in a similar fashion to eg. HG-U133 arrays. >> That means, I want to finally have it accessible as an ExpressionSet >> object >> with a right Bioconductor annotation assigned. This should include GENE >> SYMBOLS, RefSeq IDs and ENTREZ IDs. > > The problem here is that you want to do something that AFAIK isn't easy > to do. The Gene ST arrays allow you to summarize all the probes that > interrogate a particular transcript (e.g., all the exon-level probesets > are collapsed to transcript level, and then you summarize). However, for > the Exon ST arrays that isn't the case, unless there is something in xps > to allow for that - I know next to nothing about that package, so > Cristian Stratowa will have to chime in if I am missing something. > > For the Exon chips, you are always summarizing at the same probeset > level, where there are <= 4 probes per probeset, and there can be any > number of probesets that interrogate a given exon. Lots of these > probesets interrogate regions that aren't even transcribed, according to > current knowledge of the genome. When you choose core, extended or full > probesets, you are just changing the number of probesets being used, not > summarizing at a different level as with the Gene ST chip. > > So when you say you want gene symbols, refseq ids and gene ids, what > exactly are you after? If a given probeset is in the intron of a gene do > you want to annotate it as being part of that gene? How about if it is > in the UTR (or really close to the UTR)? What do you want to do with the > probesets where one or more of the probes binds in multiple positions in > the genome? These are all questions that the exonmap package tries to > consider, and it gets really complicated. That's why Affy went with the > Gene ST chips - they unleashed the Exon chips on us and couldn't sell > them because people were saying WTF do I do with this thing? > > I don't think there is an easy or obvious answer to your question. If > you were to come up with what you think are reasonable answers to my > questions, then it wouldn't be much work to extract the chr, start, end > from the pd.moex.1.0.st.v1 package, and then use GenomicFeatures (e.g., > findOverlaps()) to decide what regions are being interrogated, and > annotate from there. > > Best, > > Jim > > >> >> I can import it as a AffyBatch and generate an ExpressionSet with the >> help >> of the Xmap/exonmap supplied CDF, but there is no annotation attached >> to it. >> >> OR >> >> I can import the CEL files with the "oligo" package as a Exon Array >> object >> and generate an ExpressionSet from it. >> However in that case it still have no annotation. >> >> Surprisingly on the Bioconductor website there are all packages needed to >> deal with Mouse Gene 1.0 ST arrays but the informtion to work with Mouse >> Exon 1.0 ST arrays seems missing! >> >> What am I doing wrong here? Has someone else had such problems? >> >> Thanks in advance for your effort, >> Andreas >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > From aheider at trm.uni-leipzig.de Wed Jun 13 19:47:05 2012 From: aheider at trm.uni-leipzig.de (Andreas Heider) Date: Wed, 13 Jun 2012 19:47:05 +0200 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: <4FD8CFBC.5070701@aon.at> References: <4FD8A7E7.2080102@uw.edu> <4FD8CFBC.5070701@aon.at> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From cstrato at aon.at Wed Jun 13 20:33:34 2012 From: cstrato at aon.at (cstrato) Date: Wed, 13 Jun 2012 20:33:34 +0200 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: References: <4FD8A7E7.2080102@uw.edu> <4FD8CFBC.5070701@aon.at> Message-ID: <4FD8DCFE.7080508@aon.at> Dear Andreas, Please note that I talk only about package xps, which does contain it's own annotation, based on the Affymetrix annotation files, in this case on files "MoEx-1_0-st-v1.na32.mm9.probeset.csv" and "MoEx-1_0-st-v1.na32.mm9.transcript.csv", respectively. Thus with xps you can do rma() on the trancript level and get the transcript annotation. Package xps creates first a "scheme" file (see e.g. script "script4schemes.R") which contains the Affymetrix annotation files for probesets and transcripts, including the MoEx 1.0 ST identifiers. Best regards Christian On 6/13/12 7:47 PM, Andreas Heider wrote: > Yes, you are right! > rma(target=()) can be used to collapse to transcript or probeset level. > However, the problem is still there, as I a left with a nice > ExpressionSet obejct that has values mapped to transcripts (if I decide > so) but they are only annotated by something like 4701234. That is a > probeset/transcript name for example. Now that wouldn'T be a problem > given that normally such an identifier could be easily translated via > Bioconductors annotation packages. > > But here comes the most significant part: There is no annotation package > available that includes MoEx 1.0 ST identifiers! > > I am trying to get my package to work on these Exon arrays. And the > package expects a proper annotation package such as eg. "mouse4302" to > be attached to the annotation slot of the ExpressionSet. > > I'm still puzzled. > > 2012/6/13 cstrato > > > Dear Andreas, > > As Jim already mentioned, package xps is able to preprocess MoExon > 1.0 ST arrays at the probeset and the gene level, see also my > earlier reply to a similar question: > https://www.stat.math.ethz.ch/__pipermail/bioconductor/2012-__June/045958.html > > > Best regards > Christian > _._._._._._._._._._._._._._._.___._._ > C.h.r.i.s.t.i.a.n S.t.r.a.t.o.w.a > V.i.e.n.n.a A.u.s.t.r.i.a > e.m.a.i.l: cstrato at aon.at > _._._._._._._._._._._._._._._.___._._ > > > > > On 6/13/12 4:47 PM, James W. MacDonald wrote: > > Hi Andreas, > > On 6/13/2012 3:14 AM, Andreas Heider wrote: > > Dear mailing list, > I know this was on the list couple of times, and I think I > read it > all, but > actually I still don't get it right. So here is my problem: > > I want to be able to work with Mouse Exon 1.0 ST Arrays (NOT > Mouse > Gene 1.0 > ST) in a similar fashion to eg. HG-U133 arrays. > That means, I want to finally have it accessible as an > ExpressionSet > object > with a right Bioconductor annotation assigned. This should > include GENE > SYMBOLS, RefSeq IDs and ENTREZ IDs. > > > The problem here is that you want to do something that AFAIK > isn't easy > to do. The Gene ST arrays allow you to summarize all the probes that > interrogate a particular transcript (e.g., all the exon-level > probesets > are collapsed to transcript level, and then you summarize). > However, for > the Exon ST arrays that isn't the case, unless there is > something in xps > to allow for that - I know next to nothing about that package, so > Cristian Stratowa will have to chime in if I am missing something. > > For the Exon chips, you are always summarizing at the same probeset > level, where there are <= 4 probes per probeset, and there can > be any > number of probesets that interrogate a given exon. Lots of these > probesets interrogate regions that aren't even transcribed, > according to > current knowledge of the genome. When you choose core, extended > or full > probesets, you are just changing the number of probesets being > used, not > summarizing at a different level as with the Gene ST chip. > > So when you say you want gene symbols, refseq ids and gene ids, what > exactly are you after? If a given probeset is in the intron of a > gene do > you want to annotate it as being part of that gene? How about if > it is > in the UTR (or really close to the UTR)? What do you want to do > with the > probesets where one or more of the probes binds in multiple > positions in > the genome? These are all questions that the exonmap package > tries to > consider, and it gets really complicated. That's why Affy went > with the > Gene ST chips - they unleashed the Exon chips on us and couldn't > sell > them because people were saying WTF do I do with this thing? > > I don't think there is an easy or obvious answer to your > question. If > you were to come up with what you think are reasonable answers to my > questions, then it wouldn't be much work to extract the chr, > start, end > from the pd.moex.1.0.st.v1 package, and then use GenomicFeatures > (e.g., > findOverlaps()) to decide what regions are being interrogated, and > annotate from there. > > Best, > > Jim > > > > I can import it as a AffyBatch and generate an ExpressionSet > with the > help > of the Xmap/exonmap supplied CDF, but there is no annotation > attached > to it. > > OR > > I can import the CEL files with the "oligo" package as a > Exon Array > object > and generate an ExpressionSet from it. > However in that case it still have no annotation. > > Surprisingly on the Bioconductor website there are all > packages needed to > deal with Mouse Gene 1.0 ST arrays but the informtion to > work with Mouse > Exon 1.0 ST arrays seems missing! > > What am I doing wrong here? Has someone else had such problems? > > Thanks in advance for your effort, > Andreas > > [[alternative HTML version deleted]] > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/__listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > > > > From smcinturf at gmail.com Wed Jun 13 20:39:59 2012 From: smcinturf at gmail.com (Sam McInturf) Date: Wed, 13 Jun 2012 13:39:59 -0500 Subject: [BioC] Wheat annotation In-Reply-To: <4FD8BCE9.7020507@uw.edu> References: <20120613005854.5EFE8134449@mamba.fhcrc.org> <4FD89E88.9050805@uw.edu> <4FD8BCE9.7020507@uw.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From beniltoncarvalho at gmail.com Wed Jun 13 21:11:09 2012 From: beniltoncarvalho at gmail.com (Benilton Carvalho) Date: Wed, 13 Jun 2012 20:11:09 +0100 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: <4FD8A7E7.2080102@uw.edu> References: <4FD8A7E7.2080102@uw.edu> Message-ID: FWIW, remember that you can obtain the contents of the annotation files (the NA32 Affymetrix files) with: library(Biobase) library(oligo) raw = read.celfiles(list.celfiles()) eset = rma(raw, target='transcript') featureData(eset) = getNetAffx(eset, 'transcript') head(fData(eset)) b On 13 June 2012 15:47, James W. MacDonald wrote: > Hi Andreas, > > > On 6/13/2012 3:14 AM, Andreas Heider wrote: >> >> Dear mailing list, >> I know this was on the list couple of times, and I think I read it all, >> but >> actually I still don't get it right. So here is my problem: >> >> I want to be able to work with Mouse Exon 1.0 ST Arrays (NOT Mouse Gene >> 1.0 >> ST) in a similar fashion to eg. HG-U133 arrays. >> That means, I want to finally have it accessible as an ExpressionSet >> object >> with a right Bioconductor annotation assigned. This should include GENE >> SYMBOLS, RefSeq IDs and ENTREZ IDs. > > > The problem here is that you want to do something that AFAIK isn't easy to > do. The Gene ST arrays allow you to summarize all the probes that > interrogate a particular transcript (e.g., all the exon-level probesets are > collapsed to transcript level, and then you summarize). However, for the > Exon ST arrays that isn't the case, unless there is something in xps to > allow for that - I know next to nothing about that package, so Cristian > Stratowa will have to chime in if I am missing something. > > For the Exon chips, you are always summarizing at the same probeset level, > where there are <= 4 probes per probeset, and there can be any number of > probesets that interrogate a given exon. Lots of these probesets interrogate > regions that aren't even transcribed, according to current knowledge of the > genome. When you choose core, extended or full probesets, you are just > changing the number of probesets being used, not summarizing at a different > level as with the Gene ST chip. > > So when you say you want gene symbols, refseq ids and gene ids, what exactly > are you after? If a given probeset is in the intron of a gene do you want to > annotate it as being part of that gene? How about if it is in the UTR (or > really close to the UTR)? What do you want to do with the probesets where > one or more of the probes binds in multiple positions in the genome? These > are all questions that the exonmap package tries to consider, and it gets > really complicated. That's why Affy went with the Gene ST chips - they > unleashed the Exon chips on us and couldn't sell them because people were > saying WTF do I do with this thing? > > I don't think there is an easy or obvious answer to your question. If you > were to come up with what you think are reasonable answers to my questions, > then it wouldn't be much work to extract the chr, start, end from the > pd.moex.1.0.st.v1 package, and then use GenomicFeatures (e.g., > ?findOverlaps()) to decide what regions are being interrogated, and annotate > from there. > > Best, > > Jim > > > >> >> I can import it as a AffyBatch and generate an ExpressionSet with the help >> of the Xmap/exonmap supplied CDF, but there is no annotation attached to >> it. >> >> OR >> >> I can import the CEL files with the "oligo" package as a Exon Array object >> and generate an ExpressionSet from it. >> However in that case it still have no annotation. >> >> Surprisingly on the Bioconductor website there are all packages needed to >> deal with Mouse Gene 1.0 ST arrays but the informtion to work with Mouse >> Exon 1.0 ST arrays seems missing! >> >> What am I doing wrong here? Has someone else had such problems? >> >> Thanks in advance for your effort, >> Andreas >> >> ? ? ? ?[[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From beniltoncarvalho at gmail.com Wed Jun 13 21:37:39 2012 From: beniltoncarvalho at gmail.com (Benilton Carvalho) Date: Wed, 13 Jun 2012 20:37:39 +0100 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: References: <4FD8A7E7.2080102@uw.edu> Message-ID: please correct the code below to: eset = rma(raw, target='full') ## or 'core', 'extended' (whatever is available) and if you want results at the exon level eset = rma(raw, target='probeset') featureData(eset) = getNetAffx(raw, 'probeset') apologies for the mistake below. b On 13 June 2012 20:11, Benilton Carvalho wrote: > FWIW, remember that you can obtain the contents of the annotation > files (the NA32 Affymetrix files) with: > > library(Biobase) > library(oligo) > raw = read.celfiles(list.celfiles()) > eset = rma(raw, target='transcript') > featureData(eset) = getNetAffx(eset, 'transcript') > head(fData(eset)) > > b > > On 13 June 2012 15:47, James W. MacDonald wrote: >> Hi Andreas, >> >> >> On 6/13/2012 3:14 AM, Andreas Heider wrote: >>> >>> Dear mailing list, >>> I know this was on the list couple of times, and I think I read it all, >>> but >>> actually I still don't get it right. So here is my problem: >>> >>> I want to be able to work with Mouse Exon 1.0 ST Arrays (NOT Mouse Gene >>> 1.0 >>> ST) in a similar fashion to eg. HG-U133 arrays. >>> That means, I want to finally have it accessible as an ExpressionSet >>> object >>> with a right Bioconductor annotation assigned. This should include GENE >>> SYMBOLS, RefSeq IDs and ENTREZ IDs. >> >> >> The problem here is that you want to do something that AFAIK isn't easy to >> do. The Gene ST arrays allow you to summarize all the probes that >> interrogate a particular transcript (e.g., all the exon-level probesets are >> collapsed to transcript level, and then you summarize). However, for the >> Exon ST arrays that isn't the case, unless there is something in xps to >> allow for that - I know next to nothing about that package, so Cristian >> Stratowa will have to chime in if I am missing something. >> >> For the Exon chips, you are always summarizing at the same probeset level, >> where there are <= 4 probes per probeset, and there can be any number of >> probesets that interrogate a given exon. Lots of these probesets interrogate >> regions that aren't even transcribed, according to current knowledge of the >> genome. When you choose core, extended or full probesets, you are just >> changing the number of probesets being used, not summarizing at a different >> level as with the Gene ST chip. >> >> So when you say you want gene symbols, refseq ids and gene ids, what exactly >> are you after? If a given probeset is in the intron of a gene do you want to >> annotate it as being part of that gene? How about if it is in the UTR (or >> really close to the UTR)? What do you want to do with the probesets where >> one or more of the probes binds in multiple positions in the genome? These >> are all questions that the exonmap package tries to consider, and it gets >> really complicated. That's why Affy went with the Gene ST chips - they >> unleashed the Exon chips on us and couldn't sell them because people were >> saying WTF do I do with this thing? >> >> I don't think there is an easy or obvious answer to your question. If you >> were to come up with what you think are reasonable answers to my questions, >> then it wouldn't be much work to extract the chr, start, end from the >> pd.moex.1.0.st.v1 package, and then use GenomicFeatures (e.g., >> ?findOverlaps()) to decide what regions are being interrogated, and annotate >> from there. >> >> Best, >> >> Jim >> >> >> >>> >>> I can import it as a AffyBatch and generate an ExpressionSet with the help >>> of the Xmap/exonmap supplied CDF, but there is no annotation attached to >>> it. >>> >>> OR >>> >>> I can import the CEL files with the "oligo" package as a Exon Array object >>> and generate an ExpressionSet from it. >>> However in that case it still have no annotation. >>> >>> Surprisingly on the Bioconductor website there are all packages needed to >>> deal with Mouse Gene 1.0 ST arrays but the informtion to work with Mouse >>> Exon 1.0 ST arrays seems missing! >>> >>> What am I doing wrong here? Has someone else had such problems? >>> >>> Thanks in advance for your effort, >>> Andreas >>> >>> ? ? ? ?[[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor From seungyeul.yoo at mssm.edu Thu Jun 14 00:25:52 2012 From: seungyeul.yoo at mssm.edu (Yoo, Seungyeul) Date: Wed, 13 Jun 2012 22:25:52 +0000 Subject: [BioC] Scaling color range of heatmap.2 with scaled row option Message-ID: <7A44F123-A489-4147-9AB3-5A8ECD9F7FB7@mssm.edu> Hi, I am trying to draw a heatmap of microarray datasets. I'm using a heatmap.2 function of gplots as following: heatmap.2(var_exp,col=greenred,trace="none",scale="row",symkey=TRUE,hclust=function(x) hclust(x,method="complete"),distfun=function(x) as.dist(1-cor(t(x)))) It generates a fine heatmap but the most points are so dark-green or dark-red because the range of z-score generated by the scale option with "row". The color key histogram is like in the attachment [cid:7A004996-3CFB-4304-82E4-F5A8C2F4B24D at 1425mad.mssm.edu] How can I scale the z-score to have more reddish and greenish colors for heatmap? I actually tried with "breaks" function. z_5000<-z_function(var_exp) low_z<-min(z_5000) lowest<-floor(low_z) high_z<-max(z_5000) highest<-ceiling(high_z) br <- c() if((abs(lowest))>highest){ br <- c(seq(-(highest),-2.2,by=0.2),seq(-2,-1.1,by=0.1),seq(-1,1,0.05),seq(1.1,2,by=0.1),seq(2.2,highest,by=0.2)) br <- c(lowest,br) } else if((abs(lowest)) References: <4FD759D6.2060202@ipbs.fr> <004201cd4930$2d7802b0$88680810$@edu.au> Message-ID: <4FD88B9C.5090102@ipbs.fr> Thanks a lot Belinda !! I mistaked so I replaced Time=Treat by Time only, and it's good. So, I have a last question : I 'm confused with the differents coef in topTable. I get genes but I tested several coef without understanding their significance. Somebody can explain me what mean coef="TreatT", or coef= "Time18",coef= " Donor5",coef= " Donor6", coef= "Donor7",coef= " Donor8". My main objective is to identidy the differential expressed genes between the Control donors and Treated Donors at 4 hours or 18 hours. I have no idea, which coef I have to use it. Cheers, Ingrid Ingrid MERCIER Mycobacterial Interactions with Host Cells Team Institute of Pharmacology& Structural Biology CNRS - University of Toulouse BP 64182 F-31077 Toulouse Cedex France Tel +33 (0)5 61 17 54 63 Le 13/06/2012 08:45, Belinda Phipson a ?crit : > Hi Ingrid > > The problem with your code is the following line: >> Time=Treat=factor(Targets$Time) > Where you essentially set the time factor equal to the treat factor. > > Cheers, > Belinda > > > -----Original Message----- > From: bioconductor-bounces at r-project.org > [mailto:bioconductor-bounces at r-project.org] On Behalf Of Ingrid Mercier > Sent: Wednesday, 13 June 2012 1:02 AM > To: bioconductor at r-project.org; smyth at wehi.edu.au > Subject: [BioC] design matrix Limma design for paired t-test > > Dear list and Gordon, > > I have some troubles to computed a moderated paired t-test in the linear > model. > Here is my experimental plan : > > I used a single channel Agilent microarray. > 2 types of cells : Control (S) and Treated (T) > Fives human donors : 4-5-6-7-8 > Two times of treatment : 4 hours and 18 hours > > I want to compare teh differential expresed genes between my C versus T at 4 > hours and then at 18 hours. > > Here is my design : > > > My targets frame is : >> Targets > X FileName Treatment > Donor Time > 1 DC_4_4 US10463851_252665214446_S01_GE1_1010_Sep10_1_2.txt T > 4 4 > 2 SC_4_4 US10463851_252665214448_S01_GE1_1010_Sep10_1_2.txt C > 4 4 > 3 DC_18_4 US10463851_252665214447_S01_GE1_1010_Sep10_1_2.txt T > 4 18 > 4 SC_18_4 US10463851_252665214444_S01_GE1_1010_Sep10_1_3.txt C > 4 18 > 5 DC_4_5 US10463851_252665214448_S01_GE1_1010_Sep10_1_4.txt T > 5 4 > 6 SC_4_5 US10463851_252665214444_S01_GE1_1010_Sep10_1_1.txt C > 5 4 > 7 DC_18_5 US10463851_252665214446_S01_GE1_1010_Sep10_1_3.txt T > 5 18 > 8 SC_18_5 US10463851_252665214447_S01_GE1_1010_Sep10_1_4.txt C > 5 18 > 9 DC_4_6 US10463851_252665214445_S01_GE1_1010_Sep10_1_4.txt T > 6 4 > 10 SC_4_6 US10463851_252665214447_S01_GE1_1010_Sep10_1_3.txt C > 6 4 > 11 DC_18_6 US10463851_252665214448_S01_GE1_1010_Sep10_1_3.txt T > 6 18 > 12 SC_18_6 US10463851_252665214445_S01_GE1_1010_Sep10_1_3.txt C > 6 18 > 13 DC_4_7 US10463851_252665214444_S01_GE1_1010_Sep10_1_4.txt T > 7 4 > 14 SC_4_7 US10463851_252665214445_S01_GE1_1010_Sep10_1_2.txt C > 7 4 > 15 DC_18_7 US10463851_252665214447_S01_GE1_1010_Sep10_1_1.txt T > 7 18 > 16 SC_18_7 US10463851_252665214446_S01_GE1_1010_Sep10_1_1.txt C > 7 18 > 17 DC_4_8 US10463851_252665214444_S01_GE1_1010_Sep10_1_2.txt T > 8 4 > 18 SC_4_8 US10463851_252665214446_S01_GE1_1010_Sep10_1_4.txt C > 8 4 > 19 DC_18_8 US10463851_252665214445_S01_GE1_1010_Sep10_1_1.txt T > 8 18 > 20 SC_18_8 US10463851_252665214448_S01_GE1_1010_Sep10_1_1.txt C > 8 18 > > > then I create my design matrix : > >> Donor > [1] 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 > Levels: 4 5 6 7 8 >> Treat=factor(Targets$Treatment,levels=c("C","T")) >> Treat > [1] T C T C T C T C T C T C T C T C T C T C > Levels: C T >> Time=Treat=factor(Targets$Time) >> Time > [1] 4 4 18 18 4 4 18 18 4 4 18 18 4 4 18 18 4 4 18 18 > Levels: 4 18 > >> design=model.matrix(~Donor+Treat+Time) >> design > (Intercept) Donor5 Donor6 Donor7 Donor8 Treat18 Time18 > 1 1 0 0 0 0 0 0 > 2 1 0 0 0 0 0 0 > 3 1 0 0 0 0 1 1 > 4 1 0 0 0 0 1 1 > 5 1 1 0 0 0 0 0 > 6 1 1 0 0 0 0 0 > 7 1 1 0 0 0 1 1 > 8 1 1 0 0 0 1 1 > 9 1 0 1 0 0 0 0 > 10 1 0 1 0 0 0 0 > 11 1 0 1 0 0 1 1 > 12 1 0 1 0 0 1 1 > 13 1 0 0 1 0 0 0 > 14 1 0 0 1 0 0 0 > 15 1 0 0 1 0 1 1 > 16 1 0 0 1 0 1 1 > 17 1 0 0 0 1 0 0 > 18 1 0 0 0 1 0 0 > 19 1 0 0 0 1 1 1 > 20 1 0 0 0 1 1 1 > attr(,"assign") > [1] 0 1 1 1 1 2 3 > attr(,"contrasts") > attr(,"contrasts")$Donor > [1] "contr.treatment" > > attr(,"contrasts")$Treat > [1] "contr.treatment" > > attr(,"contrasts")$Time > [1] "contr.treatment" > > > In this design matrix I think something is wrong, because of the column > Treat18 is the same as Time18. > I don't understand why. > So, the following code failed, and the differential expressed genes are > odds. > > Somebody can help me !!! Thanks all. > > >> fit=lmFit(test_norm,design) > Coefficients not estimable: Time18 > Message d'avis : > Partial NA coefficients for 34183 probe(s) >> fit2=eBayes(fit) > Message d'avis : > In ebayes(fit = fit, proportion = proportion, stdev.coef.lim = > stdev.coef.lim, : > Estimation of var.prior failed - set to default value > > >> table = topTable(fit2,1, number=5000, > p.value=0.05,adjust.method="BH",sort.by="logFC",lfc=2) >> head(table) > ID logFC AveExpr t P.Value adj.P.Val > B > 6509 A_33_P3396434 18.44159 18.41239 245.14490 1.308161e-31 2.353520e-28 > 53.41519 > 22398 A_33_P3223592 18.25824 18.24591 242.75647 1.545005e-31 2.514901e-28 > 53.36821 > 10771 A_33_P3244165 18.21029 18.02229 90.76191 2.796577e-24 2.467615e-23 > 44.59915 > 6149 A_33_P3346552 18.14780 18.12098 207.18556 2.282464e-30 1.147374e-27 > 52.50960 > 23554 A_33_P3210160 18.08158 18.21026 239.64192 1.924175e-31 2.560908e-28 > 53.30521 > 20924 A_33_P3286278 18.04425 18.07312 179.72121 2.558128e-29 5.025546e-27 > 51.56876 > > > Best, > > Ingrid > > From Alogmail2 at aol.com Thu Jun 14 01:25:57 2012 From: Alogmail2 at aol.com (Alogmail2 at aol.com) Date: Wed, 13 Jun 2012 19:25:57 -0400 (EDT) Subject: [BioC] reading single channel Agilent data with limma [was arrayQualityMetrics d... Message-ID: <2993f.7851507f.3d0a7b84@aol.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From smyth at wehi.EDU.AU Thu Jun 14 01:57:06 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Thu, 14 Jun 2012 09:57:06 +1000 (AUS Eastern Standard Time) Subject: [BioC] reading single channel Agilent data with limma [was arrayQualityMetrics d... In-Reply-To: <2993f.7851507f.3d0a7b84@aol.com> References: <2993f.7851507f.3d0a7b84@aol.com> Message-ID: Dear Alex, When I answer a question on the Bioconductor list, I am always refering to current Bioconductor release unless stated otherwise. I don't think it is unreasonable to ask you to look at the current documentation: http://bioconductor.org/packages/2.10/bioc/vignettes/limma/inst/doc/usersguide.pdf Please send questions about arrayQualityMetrics to the authors of that package. I'm not an author of the Agi4x44PreProcess package. You could write to the authors and suggest they update the import procedure. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.statsci.org/smyth On Wed, 13 Jun 2012, Alogmail2 at aol.com wrote: > Hi Gordon, > > Why creating ExpressionSet > >> esetPROC = new("ExpressionSet", exprs = ddaux$G) > > results in error if then running > >> arrayQualityMetrics(expressionset=esetPROC,outdir ="esetPROC",force =T) > > ? > > Developers (Audrey Kauffmann and Wolfgang Huber) claim that > "expressionset is an object of class ExpressionSet for one color non > Affymetrix data" > (see definitions for prepdata() in their reference manual). > > As to bg-correction: I agree that it makes sense usually. > I looked at the LIMMA user guide(11 November 2011): Agilent is mentioned in > pp.16, 19, and 26 ... > and nothing special about one-color Agilent arrays. > > But your read.maimages() is used in read.AgilentFE() from package > Agi4x44PreProcess to import Agilent one-color data sets as RGlist. > > Thanks > > Alex > > > > In a message dated 6/9/2012 8:12:37 P.M. Pacific Daylight Time, > smyth at wehi.EDU.AU writes: > > Hi Alex, > > I don't know arrayQualityMetrics, but you are using the limma package to > read single-channel Agilent data in a way that I think might cause > problems with down-stream analyses. Basically you're creating a two-color > data object when your data is not actually of that type. This was a time > when I suggested this sort of work-around as a stop-gap measure for some > data problems, but hasn't been necessary for quite a few years. > > I'd also recommend that you do some background correction. If I > understand your code correctly, I don't think it is currently making use > of the background intensity column. > > There is a case study in the limma User's Guide that deals with single > channel Agilent data. Could you please have a read of that for a cleaner > way to read Agilent data? > > I don't know whether that will be enough to solve your arrayQualityMetrics > problem, but perhaps it might. > > Best wishes > Gordon > > ------------- original message ------------- > [BioC] arrayQualityMetrics() doesn't work for one-color non Affy arrays > Alogmail2 at aol.com Alogmail2 at aol.com > Fri Jun 8 09:39:21 CEST 2012 > > Dear List, > > Could you share your experience with arrayQualityMetrics() for one-color > non Affy arrays: it doesn't work for me (please see the code below). > > Thanks > > Alex Loguinov > > UC, Berkeley > > > > >> options(error = recover, warn = 2) >> options(bitmapType = "cairo") >> .HaveDummy = !interactive() >> if(.HaveDummy) pdf("dummy.pdf") > >> library("arrayQualityMetrics") > >> head(targets) > FileName Treatment GErep Time Conc > T0-Control-Cu_61_new_252961010035_2_4 > T0-Control-Cu_61_new_252961010035_2_4.txt C.t0.0 0 0 0 > T0-Control-Cu_62_new_252961010036_2_1 > T0-Control-Cu_62_new_252961010036_2_1.txt C.t0.0 0 0 0 > T0-Control-Cu_64_252961010031_2_2 > T0-Control-Cu_64_252961010031_2_2.txt C.t0.0 0 0 0 > T0-Control-Cu_65_new_252961010037_2_2 > T0-Control-Cu_65_new_252961010037_2_2.txt C.t0.0 0 0 0 > T04h-Contr_06_new_252961010037_2_4 > T04h-Contr_06_new_252961010037_2_4.txt C.t4.0 1 4 0 > T04h-Contr_10_new_252961010035_1_2 > T04h-Contr_10_new_252961010035_1_2.txt C.t4.0 1 4 0 > > >> ddaux = read.maimages(files = targets$FileName, source = "agilent", > other.columns = list(IsFound = "gIsFound", IsWellAboveBG = > "IsWellAboveBG",gIsPosAndSignif="gIsPosAndSignif", > IsSaturated = "gIsSaturated", IsFeatNonUnifOF = "gIsFeatNonUnifOL", > IsFeatPopnOL = "gIsFeatPopnOL", ChrCoord = > "chr_coord",Row="Row",Column="Col"), > columns = list(Rf = "gProcessedSignal", Gf = "gMeanSignal", > Rb = "gBGMedianSignal", Gb = "gBGUsed"), verbose = T, > sep = "\t", quote = "") > > >> class(ddaux) > [1] "RGList" > attr(,"package") > [1] "limma" >> names(ddaux) > [1] "R" "G" "Rb" "Gb" "targets" "genes" "source" > "printer" "other" > > > I could apply: >> >> class(ddaux$G) > [1] "matrix" > >> all(rownames(targets)==colnames(ddaux$G)) > [1] TRUE > >> esetPROC = new("ExpressionSet", exprs = ddaux$G) > > But it results in errors: > >> arrayQualityMetrics(expressionset=esetPROC,outdir ="esetPROC",force =T) > > The directory 'esetPROC' has been created. > Error: no function to return from, jumping to top level > > Enter a frame number, or 0 to exit > > 1: arrayQualityMetrics(expressionset = esetPROC, outdir = "esetPROC", > force = T) > 2: aqm.writereport(modules = m, arrayTable = x$pData, reporttitle = > reporttitle, outdir = outdir) > 3: reportModule(p = p, module = modules[[i]], currentIndex = > currentIndex, > arrayTable = arrayTableCompact, outdir = outdir) > 4: makePlot(module) > 5: print(_x at plot_ (mailto:x at plot) ) > 6: print.trellis(_x at plot_ (mailto:x at plot) ) > 7: printFunction(x, ...) > 8: tryCatch(checkArgsAndCall(panel, pargs), error = function(e) > panel.error(e)) > 9: tryCatchList(expr, classes, parentenv, handlers) > 10: tryCatchOne(expr, names, parentenv, handlers[[1]]) > 11: doTryCatch(return(expr), name, parentenv, handler) > 12: checkArgsAndCall(panel, pargs) > 13: do.call(FUN, args) > 14: function (x, y = NULL, subscripts, groups, panel.groups = > "panel.xyplot", ..., col = "black", col.line = superpose.line$col, > col.symbol = > superpose.symb > 15: .signalSimpleWarning("closing unused connection 5 > (Report_for_exampleSet/index.html)", quote(NULL)) > 16: withRestarts({ > 17: withOneRestart(expr, restarts[[1]]) > 18: doWithOneRestart(return(expr), restart) > > Selection: 0 > > > Error in KernSmooth::bkde2D(x, bandwidth = bandwidth, gridsize = nbin, : > (converted from warning) Binning grid too coarse for current (small) > bandwidth: consider increasing 'gridsize' > > Enter a frame number, or 0 to exit > > 1: arrayQualityMetrics(expressionset = esetPROC, outdir = "esetPROC", > force = T) > 2: aqm.writereport(modules = m, arrayTable = x$pData, reporttitle = > reporttitle, outdir = outdir) > 3: reportModule(p = p, module = modules[[i]], currentIndex = > currentIndex, > arrayTable = arrayTableCompact, outdir = outdir) > 4: makePlot(module) > 5: do.call(_x at plot_ (mailto:x at plot) , args = list()) > 6: function () > 7: meanSdPlot(x$M, cex.axis = 0.9, ylab = "Standard deviation of the > intensities", xlab = "Rank(mean of intensities)") > 8: meanSdPlot(x$M, cex.axis = 0.9, ylab = "Standard deviation of the > intensities", xlab = "Rank(mean of intensities)") > 9: smoothScatter(res$px, res$py, xlab = xlab, ylab = ylab, ...) > 10: grDevices:::.smoothScatterCalcDensity(x, nbin, bandwidth) > 11: KernSmooth::bkde2D(x, bandwidth = bandwidth, gridsize = nbin, range.x > = range.x) > 12: warning("Binning grid too coarse for current (small) bandwidth: > consider increasing 'gridsize'") > 13: .signalSimpleWarning("Binning grid too coarse for current (small) > bandwidth: consider increasing 'gridsize'", quote(KernSmooth::bkde2D(x, > bandwidth = ba > 14: withRestarts({ > 15: withOneRestart(expr, restarts[[1]]) > 16: doWithOneRestart(return(expr), restart) > > Selection: 0 > > >> sessionInfo() > R version 2.14.2 (2012-02-29) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C LC_TIME=English_United > States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] CCl4_1.0.11 vsn_3.22.0 > arrayQualityMetrics_3.10.0 Agi4x44PreProcess_1.14.0 genefilter_1.36.0 > [6] annotate_1.32.3 AnnotationDbi_1.16.19 limma_3.10.3 > Biobase_2.14.0 > > loaded via a namespace (and not attached): > [1] affy_1.32.1 affyio_1.22.0 affyPLM_1.30.0 > beadarray_2.4.2 BiocInstaller_1.2.1 Biostrings_2.22.0 > [7] Cairo_1.5-1 cluster_1.14.2 colorspace_1.1-1 > DBI_0.2-5 grid_2.14.2 Hmisc_3.9-3 > [13] hwriter_1.3 IRanges_1.12.6 KernSmooth_2.23-7 > lattice_0.20-6 latticeExtra_0.6-19 plyr_1.7.1 > [19] preprocessCore_1.16.0 RColorBrewer_1.0-5 reshape2_1.2.1 > RSQLite_0.11.1 setRNG_2011.11-2 splines_2.14.2 > [25] stringr_0.6 survival_2.36-14 SVGAnnotation_0.9-0 > tools_2.14.2 XML_3.9-4.1 xtable_1.7-0 > [31] zlibbioc_1.0.1 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From normanpavelka at gmail.com Thu Jun 14 03:35:10 2012 From: normanpavelka at gmail.com (Norman Pavelka) Date: Thu, 14 Jun 2012 09:35:10 +0800 Subject: [BioC] Fwd: FW: plgem In-Reply-To: References: Message-ID: Dear Olli, Thanks for your email and your continued interest in plgem. From your error message it looks like there might be a version problem. Could you please send me the output of sessionInfo() ? I am also copying the Bioconductor mailing list, so the thread gets archived. Cheers, Norman -----Original Message----- From: Olli Kannaste [mailto:ojkann at utu.fi] Sent: Wednesday, 13 June, 2012 9:07 PM To: Norman Pavelka (SIgN) Subject: plgem Hi Norman, I approached you about 2.5 years ago regarding problems i was having with the plgem analysis. You were kind enough to provide me an R script, which automated the analysis. That worked fine and helped me a great deal. I am now trying plgem again using the script with some other data, and having some difficulties... I'm working on a different computer now and have installed the latest version of plgem. My guess is that for some reason the script is not working properly with the new plgem version. It lets me input my parameters and specify data files but fails to proceed right after that, generating the following error message: Error in run.plgem(get(expressionSetName), signLev = pVal, rank = 100, ?: ?unused argument(s) (trimAllZeroRows = TRUE, zeroMeanOrSD = "replace") Could you perhaps help me out with this one? I'm attaching the script and my input files in the message. Best regards, Olli From olshansky at wehi.EDU.AU Thu Jun 14 03:44:37 2012 From: olshansky at wehi.EDU.AU (Moshe Olshansky) Date: Thu, 14 Jun 2012 11:44:37 +1000 (EST) Subject: [BioC] design matrix Limma design for paired t-test In-Reply-To: <4FD88B9C.5090102@ipbs.fr> References: <4FD759D6.2060202@ipbs.fr> <004201cd4930$2d7802b0$88680810$@edu.au> <4FD88B9C.5090102@ipbs.fr> Message-ID: <5266de74b101cfe9b43bb86abb9fd56b.squirrel@wehimail.alpha.wehi.edu.au> Hi Ingrid, With your design your "base" level is patient 4, Control, 4 hours (let's call it B). The mean for, say, patient 6, Treatment, 18 hours is: B + Donor6 + TreatT + Time18 where Donor6 is the difference between Donor4 and Donor6 (same for any treatment and time), TreatT is the difference between Treatment and Control (independent of patient and time) and Time18 is the difference between 18 hours and 4 hours (independent of patient and treatment). If you think that the effect of Treatment versus Control is the same at 4 hours and 18 hours, then what you did is all right. If you think that the effect of the treatment at 4 hours may be different from the one at 18 hours, you need to change your design. Best regards, Moshe. > Thanks a lot Belinda !! > > I mistaked so I replaced Time=Treat by Time only, and it's good. > So, I have a last question : I 'm confused with the differents coef in > topTable. > I get genes but I tested several coef without understanding their > significance. > Somebody can explain me what mean coef="TreatT", or coef= "Time18",coef= > " Donor5",coef= " Donor6", coef= "Donor7",coef= " Donor8". > My main objective is to identidy the differential expressed genes > between the Control donors and Treated Donors at 4 hours or 18 hours. > I have no idea, which coef I have to use it. > > Cheers, > > Ingrid > > Ingrid MERCIER > Mycobacterial Interactions with Host Cells Team > Institute of Pharmacology& Structural Biology > CNRS - University of Toulouse > BP 64182 > F-31077 Toulouse Cedex France > Tel +33 (0)5 61 17 54 63 > > > > > Le 13/06/2012 08:45, Belinda Phipson a ?crit : >> Hi Ingrid >> >> The problem with your code is the following line: >>> Time=Treat=factor(Targets$Time) >> Where you essentially set the time factor equal to the treat factor. >> >> Cheers, >> Belinda >> >> >> -----Original Message----- >> From: bioconductor-bounces at r-project.org >> [mailto:bioconductor-bounces at r-project.org] On Behalf Of Ingrid Mercier >> Sent: Wednesday, 13 June 2012 1:02 AM >> To: bioconductor at r-project.org; smyth at wehi.edu.au >> Subject: [BioC] design matrix Limma design for paired t-test >> >> Dear list and Gordon, >> >> I have some troubles to computed a moderated paired t-test in the linear >> model. >> Here is my experimental plan : >> >> I used a single channel Agilent microarray. >> 2 types of cells : Control (S) and Treated (T) >> Fives human donors : 4-5-6-7-8 >> Two times of treatment : 4 hours and 18 hours >> >> I want to compare teh differential expresed genes between my C versus T >> at 4 >> hours and then at 18 hours. >> >> Here is my design : >> >> >> My targets frame is : >>> Targets >> X FileName >> Treatment >> Donor Time >> 1 DC_4_4 US10463851_252665214446_S01_GE1_1010_Sep10_1_2.txt T >> 4 4 >> 2 SC_4_4 US10463851_252665214448_S01_GE1_1010_Sep10_1_2.txt C >> 4 4 >> 3 DC_18_4 US10463851_252665214447_S01_GE1_1010_Sep10_1_2.txt T >> 4 18 >> 4 SC_18_4 US10463851_252665214444_S01_GE1_1010_Sep10_1_3.txt C >> 4 18 >> 5 DC_4_5 US10463851_252665214448_S01_GE1_1010_Sep10_1_4.txt T >> 5 4 >> 6 SC_4_5 US10463851_252665214444_S01_GE1_1010_Sep10_1_1.txt C >> 5 4 >> 7 DC_18_5 US10463851_252665214446_S01_GE1_1010_Sep10_1_3.txt T >> 5 18 >> 8 SC_18_5 US10463851_252665214447_S01_GE1_1010_Sep10_1_4.txt C >> 5 18 >> 9 DC_4_6 US10463851_252665214445_S01_GE1_1010_Sep10_1_4.txt T >> 6 4 >> 10 SC_4_6 US10463851_252665214447_S01_GE1_1010_Sep10_1_3.txt C >> 6 4 >> 11 DC_18_6 US10463851_252665214448_S01_GE1_1010_Sep10_1_3.txt T >> 6 18 >> 12 SC_18_6 US10463851_252665214445_S01_GE1_1010_Sep10_1_3.txt C >> 6 18 >> 13 DC_4_7 US10463851_252665214444_S01_GE1_1010_Sep10_1_4.txt T >> 7 4 >> 14 SC_4_7 US10463851_252665214445_S01_GE1_1010_Sep10_1_2.txt C >> 7 4 >> 15 DC_18_7 US10463851_252665214447_S01_GE1_1010_Sep10_1_1.txt T >> 7 18 >> 16 SC_18_7 US10463851_252665214446_S01_GE1_1010_Sep10_1_1.txt C >> 7 18 >> 17 DC_4_8 US10463851_252665214444_S01_GE1_1010_Sep10_1_2.txt T >> 8 4 >> 18 SC_4_8 US10463851_252665214446_S01_GE1_1010_Sep10_1_4.txt C >> 8 4 >> 19 DC_18_8 US10463851_252665214445_S01_GE1_1010_Sep10_1_1.txt T >> 8 18 >> 20 SC_18_8 US10463851_252665214448_S01_GE1_1010_Sep10_1_1.txt C >> 8 18 >> >> >> then I create my design matrix : >> >>> Donor >> [1] 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 >> Levels: 4 5 6 7 8 >>> Treat=factor(Targets$Treatment,levels=c("C","T")) >>> Treat >> [1] T C T C T C T C T C T C T C T C T C T C >> Levels: C T >>> Time=Treat=factor(Targets$Time) >>> Time >> [1] 4 4 18 18 4 4 18 18 4 4 18 18 4 4 18 18 4 4 18 18 >> Levels: 4 18 >> >>> design=model.matrix(~Donor+Treat+Time) >>> design >> (Intercept) Donor5 Donor6 Donor7 Donor8 Treat18 Time18 >> 1 1 0 0 0 0 0 0 >> 2 1 0 0 0 0 0 0 >> 3 1 0 0 0 0 1 1 >> 4 1 0 0 0 0 1 1 >> 5 1 1 0 0 0 0 0 >> 6 1 1 0 0 0 0 0 >> 7 1 1 0 0 0 1 1 >> 8 1 1 0 0 0 1 1 >> 9 1 0 1 0 0 0 0 >> 10 1 0 1 0 0 0 0 >> 11 1 0 1 0 0 1 1 >> 12 1 0 1 0 0 1 1 >> 13 1 0 0 1 0 0 0 >> 14 1 0 0 1 0 0 0 >> 15 1 0 0 1 0 1 1 >> 16 1 0 0 1 0 1 1 >> 17 1 0 0 0 1 0 0 >> 18 1 0 0 0 1 0 0 >> 19 1 0 0 0 1 1 1 >> 20 1 0 0 0 1 1 1 >> attr(,"assign") >> [1] 0 1 1 1 1 2 3 >> attr(,"contrasts") >> attr(,"contrasts")$Donor >> [1] "contr.treatment" >> >> attr(,"contrasts")$Treat >> [1] "contr.treatment" >> >> attr(,"contrasts")$Time >> [1] "contr.treatment" >> >> >> In this design matrix I think something is wrong, because of the column >> Treat18 is the same as Time18. >> I don't understand why. >> So, the following code failed, and the differential expressed genes are >> odds. >> >> Somebody can help me !!! Thanks all. >> >> >>> fit=lmFit(test_norm,design) >> Coefficients not estimable: Time18 >> Message d'avis : >> Partial NA coefficients for 34183 probe(s) >>> fit2=eBayes(fit) >> Message d'avis : >> In ebayes(fit = fit, proportion = proportion, stdev.coef.lim = >> stdev.coef.lim, : >> Estimation of var.prior failed - set to default value >> >> >>> table = topTable(fit2,1, number=5000, >> p.value=0.05,adjust.method="BH",sort.by="logFC",lfc=2) >>> head(table) >> ID logFC AveExpr t P.Value >> adj.P.Val >> B >> 6509 A_33_P3396434 18.44159 18.41239 245.14490 1.308161e-31 >> 2.353520e-28 >> 53.41519 >> 22398 A_33_P3223592 18.25824 18.24591 242.75647 1.545005e-31 >> 2.514901e-28 >> 53.36821 >> 10771 A_33_P3244165 18.21029 18.02229 90.76191 2.796577e-24 >> 2.467615e-23 >> 44.59915 >> 6149 A_33_P3346552 18.14780 18.12098 207.18556 2.282464e-30 >> 1.147374e-27 >> 52.50960 >> 23554 A_33_P3210160 18.08158 18.21026 239.64192 1.924175e-31 >> 2.560908e-28 >> 53.30521 >> 20924 A_33_P3286278 18.04425 18.07312 179.72121 2.558128e-29 >> 5.025546e-27 >> 51.56876 >> >> >> Best, >> >> Ingrid >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From Alogmail2 at aol.com Thu Jun 14 06:35:20 2012 From: Alogmail2 at aol.com (Alogmail2 at aol.com) Date: Thu, 14 Jun 2012 00:35:20 -0400 (EDT) Subject: [BioC] reading single channel Agilent data with limma [was arrayQualityMetrics d... Message-ID: <5871f.4730adea.3d0ac408@aol.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jpflorido at gmail.com Thu Jun 14 10:26:01 2012 From: jpflorido at gmail.com (=?ISO-8859-1?Q?Javier_P=E9rez_Florido?=) Date: Thu, 14 Jun 2012 10:26:01 +0200 Subject: [BioC] Is there an R package to export genotype data to PED /MAP files? Message-ID: <4FD9A019.4010902@gmail.com> Dear list, Is there a way to convert genotype Affy data files (chp files - Affy SNP 6.0 arrays) to PED / MAP files (used by PLINK) by means of some R package? Thanks for your time, All the best, Javier From karthikuttan at gmail.com Thu Jun 14 11:51:12 2012 From: karthikuttan at gmail.com (Karthik K N) Date: Thu, 14 Jun 2012 15:21:12 +0530 Subject: [BioC] AgiMicroRna and Replicates Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From netazuck at gmail.com Thu Jun 14 01:47:37 2012 From: netazuck at gmail.com (Neta) Date: Wed, 13 Jun 2012 23:47:37 +0000 Subject: [BioC] using Limma to read 2-channel dye-swap in Agilent scanner Message-ID: Hello, I am using Limma (version 3.12.0) to read 2-channel dye-swap files from an Agilent image analysis scanner, where each sample has 2 files - one for cy3 and one for cy5. However, the only reference to such a 2-file input in the limma user guide is to "ImaGene". When I try to read the files and the target file using the following commands: RG = read.maimages(file_names, ...) targets = readTargets("targets.txt") files = targets[,c("FileNameCy3","FileNameCy5")]; RG = read.maimages(files,source="imagene"); I get the following error message: Error in read.imagene(files = files, path = path, ext = ext, names = names, : Can't find Field Dimensions in ImaGene header In addition: Warning message: In readImaGeneHeader(fullname) : End of file encountered before End Header When I try to use source="Agilent" instead of "imagene", the command doesn't seem to understand that I have 2 file names, and gives the following error message: Error in read.maimages(files, source = "agilent") : targets frame doesn't contain FileName column I tried every single option for "source" that was listed in the help of "read.maimages" but it seems like "imagene" is the only one that is able to digest the two headers for the file names. I am stuck and would appreciate any help. Thank you, Neta. From michael.s.rooney at gmail.com Thu Jun 14 03:02:52 2012 From: michael.s.rooney at gmail.com (Michael Rooney) Date: Wed, 13 Jun 2012 21:02:52 -0400 Subject: [BioC] factorial designs in limma Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From rpietro at duke.edu Thu Jun 14 05:19:03 2012 From: rpietro at duke.edu (Ricardo Pietrobon) Date: Wed, 13 Jun 2012 23:19:03 -0400 Subject: [BioC] [R] where to find a host server with R In-Reply-To: <1339531408.46662.YahooMailNeo@web114209.mail.gq1.yahoo.com> References: <1339531408.46662.YahooMailNeo@web114209.mail.gq1.yahoo.com> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From ojkann at utu.fi Thu Jun 14 10:52:37 2012 From: ojkann at utu.fi (Olli Kannaste) Date: Thu, 14 Jun 2012 08:52:37 +0000 Subject: [BioC] FW: plgem In-Reply-To: References: Message-ID: <5E1269353FE3EB4A92F4A98B155985F80A737D2D@exch-mbx-02.utu.fi> Hi Norman, Sure thing. Here is the output: R version 2.7.2 (2008-08-25) i386-pc-mingw32 locale: LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 attached base packages: [1] tools stats graphics grDevices utils datasets methods [8] base other attached packages: [1] plgem_1.12.0 MASS_7.2-44 Biobase_2.0.1 -- Thanks, Olli ________________________________________ L?hett?j?: Norman Pavelka [normanpavelka at gmail.com] L?hetetty: 14. kes?kuuta 2012 4:35 Vastaanottaja: Olli Kannaste Cc: bioconductor at r-project.org Aihe: Fwd: FW: plgem Dear Olli, Thanks for your email and your continued interest in plgem. From your error message it looks like there might be a version problem. Could you please send me the output of sessionInfo() ? I am also copying the Bioconductor mailing list, so the thread gets archived. Cheers, Norman -----Original Message----- From: Olli Kannaste [mailto:ojkann at utu.fi] Sent: Wednesday, 13 June, 2012 9:07 PM To: Norman Pavelka (SIgN) Subject: plgem Hi Norman, I approached you about 2.5 years ago regarding problems i was having with the plgem analysis. You were kind enough to provide me an R script, which automated the analysis. That worked fine and helped me a great deal. I am now trying plgem again using the script with some other data, and having some difficulties... I'm working on a different computer now and have installed the latest version of plgem. My guess is that for some reason the script is not working properly with the new plgem version. It lets me input my parameters and specify data files but fails to proceed right after that, generating the following error message: Error in run.plgem(get(expressionSetName), signLev = pVal, rank = 100, : unused argument(s) (trimAllZeroRows = TRUE, zeroMeanOrSD = "replace") Could you perhaps help me out with this one? I'm attaching the script and my input files in the message. Best regards, Olli From friedman at cancercenter.columbia.edu Thu Jun 14 15:38:08 2012 From: friedman at cancercenter.columbia.edu (Richard Friedman) Date: Thu, 14 Jun 2012 09:38:08 -0400 Subject: [BioC] AgiMicroRna and Replicates In-Reply-To: References: Message-ID: <1E4718D8-D6F3-4B82-8775-713CCB6F79E0@cancercenter.columbia.edu> Dear Karthik, I am pretty sure that AgiMicroRna will normalize one treated and one control. The problem comes later in terms of the reproducibility of the effect, or to phrase it differently, whether the observed effect is a general statement about the population of treatments and controls. In more specific terms, the subsequent LIMMA analysis will not compute a p-value. With hopes that this helps, Rich ------------------------------------------------------------ Richard A. Friedman, PhD Associate Research Scientist, Biomedical Informatics Shared Resource Herbert Irving Comprehensive Cancer Center (HICCC) Lecturer, Department of Biomedical Informatics (DBMI) Educational Coordinator, Center for Computational Biology and Bioinformatics (C2B2)/ National Center for Multiscale Analysis of Genomic Networks (MAGNet) Room 824 Irving Cancer Research Center Columbia University 1130 St. Nicholas Ave New York, NY 10032 (212)851-4765 (voice) friedman at cancercenter.columbia.edu http://cancercenter.columbia.edu/~friedman/ "School is an evil plot to suppress my individuality" Rose Friedman, age15 On Jun 14, 2012, at 5:51 AM, Karthik K N wrote: > Dear Members, > > Do we need replicates to carry out analysis with AgiMicroRna package > in > bioconductor? I have one control and one treated samples. Can I go > ahead > with AgiMicroRna with these two datasets? > > Thank you. > > -- > Karthik K.N > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From karthikuttan at gmail.com Thu Jun 14 15:45:13 2012 From: karthikuttan at gmail.com (Karthik K N) Date: Thu, 14 Jun 2012 19:15:13 +0530 Subject: [BioC] AgiMicroRna and Replicates In-Reply-To: <1E4718D8-D6F3-4B82-8775-713CCB6F79E0@cancercenter.columbia.edu> References: <1E4718D8-D6F3-4B82-8775-713CCB6F79E0@cancercenter.columbia.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From friedman at cancercenter.columbia.edu Thu Jun 14 15:54:46 2012 From: friedman at cancercenter.columbia.edu (Richard Friedman) Date: Thu, 14 Jun 2012 09:54:46 -0400 Subject: [BioC] AgiMicroRna and Replicates In-Reply-To: References: <1E4718D8-D6F3-4B82-8775-713CCB6F79E0@cancercenter.columbia.edu> Message-ID: Dear Karthik, On Jun 14, 2012, at 9:45 AM, Karthik K N wrote: > Dear Rich, > > Thanks a lot for your reply. I am just worried because somewhere I > remember reading that LIMMA can't be used without replicates ( in > case of mRNA microarray analysis); so just wanted to check if it is > the same with AgiMicroRna package also. I think that LIMMA can be used without replicates but it cannot give you a p-value. I suggest a minimum of 3 biological replicates per sample. > > Also, Do you think it is a good idea to carry out GeneSpring > analysis upon the data analyzed by biocondcutor? Will it give any > statistically more reliable output that using either of them alone? I do not know how GeneSpring normalizes Agilent MicroRNA data but AgiMicroRna is the best way to normalize such data described in the open literature of which I am aware. The statistical methods in LIMMA are more reliable than those in GeneSpring in general. However with only one replicate of each condition neither program can give a p- value, only a log2FC. You can get the same result in an excel spreadsheet, > > Also, since the data from bioconductor has already normalized, if we > again give this as an input file for genespring and go on with > analysis, then do we need to do normalization step again in > GeneSpirng? Won't this be a double-normalization? I haven't used GeneSpring for a long time. I am not sure what it will do to your data. If you only have one replicate you might consider using AgiMicroRNA to normalize and Excel to compute the log2FC. Best wishes, Rich > > I am new to R/Bioconductor, so any suggestions from you will be > extremely helpful. > > Thanks a lot, > > Karthik > > > On Thu, Jun 14, 2012 at 7:08 PM, Richard Friedman > wrote: > Dear Karthik, > > I am pretty sure that AgiMicroRna will normalize one treated > and > one control. The problem comes later in terms of the reproducibility > of the effect, or to phrase it differently, whether the observed > effect > is a general statement about the population of treatments and > controls. > In more specific terms, the subsequent LIMMA analysis will not > compute a p-value. > > With hopes that this helps, > Rich > ------------------------------------------------------------ > Richard A. Friedman, PhD > Associate Research Scientist, > Biomedical Informatics Shared Resource > Herbert Irving Comprehensive Cancer Center (HICCC) > Lecturer, > Department of Biomedical Informatics (DBMI) > Educational Coordinator, > Center for Computational Biology and Bioinformatics (C2B2)/ > National Center for Multiscale Analysis of Genomic Networks (MAGNet) > Room 824 > Irving Cancer Research Center > Columbia University > 1130 St. Nicholas Ave > New York, NY 10032 > (212)851-4765 (voice) > friedman at cancercenter.columbia.edu > http://cancercenter.columbia.edu/~friedman/ > > "School is an evil plot to suppress my individuality" > > Rose Friedman, age15 > > > > > > > > > > > > On Jun 14, 2012, at 5:51 AM, Karthik K N wrote: > > Dear Members, > > Do we need replicates to carry out analysis with AgiMicroRna package > in > bioconductor? I have one control and one treated samples. Can I go > ahead > with AgiMicroRna with these two datasets? > > Thank you. > > -- > Karthik K.N > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > Karthik K.N > Cancer Discovery Biology Laboratory > Division of Molecular Medicine > Amrita Center for Nanosciences and Molecular Medicine > Amrita Institute of Medical Sciences > AIMS-Ponekkara (P.O), Kochi > Cochin, Kerala - 682 041 > +91-484-280 1234 x 8720 > +91-9400193907 > From martin.preusse at googlemail.com Thu Jun 14 15:55:02 2012 From: martin.preusse at googlemail.com (Martin Preusse) Date: Thu, 14 Jun 2012 15:55:02 +0200 Subject: [BioC] Biostring: print sequence alignment to file In-Reply-To: References: <1F5E43701380457FBD1953187D5FE071@googlemail.com> <4c299b85491a4b8cb0d0cbe2fdf3e3dc@EXCH-NODE02.exch.ucr.edu> <20120417184928.GA439@genomics-57-164.bulk.ucr.edu> <20120417201341.GA587@genomics-57-164.bulk.ucr.edu> <2F44456508644188839DCAB9C8D6B9B9@googlemail.com> <140068C6B7EC41B5BAE2B3A6964BF731@googlemail.com> <4F91FB7F.2000506@fhcrc.org> Message-ID: <78563C69AD804989BBEA7D94269D47BA@googlemail.com> Hi guys, anything new on the sequence output? Maybe I missed something :) please tell me if you need testing etc. Cheers Martin Am Samstag, 21. April 2012 um 11:55 schrieb Martin Preusse: > Hi Herv?, > > thanks for your help! If you need suggestions, help or testing, just say the word. > > Will you implement the header also? If you do so, I would be thankful for an option like "header=F" for the output. > > > Cheers > Martin > > > Am Samstag, 21. April 2012 um 02:12 schrieb Herv? Pag?s: > > > Thanks Martin and Thomas for the useful feedback. The 'pair' and > > 'markx0' formats supported by Emboss seem indeed appropriate for > > printing the output of pairwiseAlignment() to a file. I'll add > > support for those 2 formats in Biostrings. Won't be before 1 week > > or 2 though... > > > > Cheers, > > H. > > > > On 04/18/2012 03:20 AM, Martin Preusse wrote: > > > Hi, > > > > > > I just found this function to print a pairwise alignments in blocks. Doesn't add the match/mismatch indicators between sequences, but might be a starting point: > > > > > > http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter4.html#viewing-a-long-pairwise-alignment > > > > > > > > > Cheers > > > Martin > > > > > > > > > > > > Am Mittwoch, 18. April 2012 um 12:16 schrieb Martin Preusse: > > > > > > > Hi everybody, > > > > > > > > I think the output format depends on the purpose of the alignment. > > > > > > > > A pairwise sequence alignment is usually done to compare two sequences base by base. In my case, I compare sequencing results of cloned expression constructs with the desired sequence. Thus, the best output format would be "BLAST like". > > > > > > > > seq1: 1 ATCTGC 7 > > > > | | | . . | > > > > seq2: 1 ATCAAC 7 > > > > > > > > When doing MSA, most people might rather be interested in the consensus sequence. E.g. in the context of conservation between species. > > > > > > > > So write.PairwiseAlignedXStringSet() and write.MultipleAlignment() are quite different and BLAST doesn't make much sense for multiple alignments. This means it would be best to put the output in the PairwiseAlignment/MultipleAlignment and not to the XStringSet, right? > > > > > > > > This is an overview of sequence alignment formats used by EMBOSS: > > > > http://emboss.sourceforge.net/docs/themes/AlignFormats.html > > > > > > > > 'pair' or 'markx0' would be perfectly fine. > > > > > > > > > > > > Cheers > > > > Martin > > > > > > > > > > > > > > > > Am Dienstag, 17. April 2012 um 22:13 schrieb Thomas Girke: > > > > > > > > > Hi Herv?, > > > > > > > > > > To me, the most basic and versatile MSA or pairwise alignment format to output > > > > > to would be FASTA since it is compatible with almost any other alignment > > > > > editing software. For text-based viewing purposes my preference would be > > > > > to also output to a format similar to the one shown in the following > > > > > example. When there are only two sequences then one could show instead > > > > > of a consensus line the pipe characters between the two sequences to > > > > > indicate identical residues which mimics the blast output. A more > > > > > standardized version of this pairwise alignment format can be found > > > > > here: > > > > > http://emboss.sourceforge.net/apps/cvs/emboss/apps/needle.html > > > > > > > > > > library(Biostrings) > > > > > p450<- read.AAStringSet("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/p450.mul", "fasta") > > > > > > > > > > StringSet2html<- function(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=TRUE, ...) { > > > > > if(class(msa)=="AAStringSet") msa<- AAStringSet(msa, start=start, end=end) > > > > > if(class(msa)=="DNAStringSet") msa<- DNAStringSet(msa, start=start, end=end) > > > > > msavec<- sapply(msa, toString) > > > > > offset<- (counter-1)-nchar(nchar(msavec[1])) > > > > > legend<- paste(paste(paste(paste(rep(" ", offset), collapse=""), format(seq(0, > > > > > nchar(msavec[1]), by=counter)[-1])), collapse=""), collapse="") > > > > > consensus<- consensusString(msavec, ambiguityMap=".", ...) > > > > > msavec<- paste(msavec, rowSums(as.matrix(msa) != "-"), sep=" ") > > > > > msavec<- paste(format(c("", names(msa), "Consensus"), justify="left"), c(legend, msavec, > > > > > consensus), sep=" ") > > > > > msavec<- c("
", msavec,"
") > > > > > writeLines(msavec, file) > > > > > if(browser==TRUE) { browseURL(file) } > > > > > } > > > > > StringSet2html(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=T, threshold=1.0) > > > > > StringSet2html(msa=p450, file="p450.html", start=450, end=470, counter=20, browser=T, threshold=1.0) > > > > > > > > > > > > > > > Thomas > > > > > > > > > > On Tue, Apr 17, 2012 at 07:43:30PM +0000, Herv? Pag?s wrote: > > > > > > Hi Thomas, > > > > > > > > > > > > On 04/17/2012 11:49 AM, Thomas Girke wrote: > > > > > > > What about providing an option in pairwiseAlignment to output to the > > > > > > > MultipleAlignment class in Biostrings and then write the latter to > > > > > > > different alignment formats? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Or we could provide coercion methods to switch between > > > > > > PairwiseAlignedXStringSet and MultipleAlignment. > > > > > > > > > > > > Anyway that kind of moves Martin's problem from having a > > > > > > write.PairwiseAlignedXStringSet() function that produces BLAST output > > > > > > to having a write.MultipleAlignment() function that produces BLAST > > > > > > output. For the specific case of BLAST output, would it make sense > > > > > > to support it for MultipleAlignment? Can someone point me to an example > > > > > > of such output? Or even better, to the specs of such format? > > > > > > > > > > > > Note that right now there is the write.phylip() function in Biostrings > > > > > > for writing a MultipleAlignment object to a file but the Phylip format > > > > > > looks very different from the BLAST output: > > > > > > > > > > > > hpages at latitude:~$ head -n 20 phylip_test.txt > > > > > > 9 2343 > > > > > > Mask 0000000000 0000000000 0000000000 0000000000 0000000000 > > > > > > Human -----TCCCG TCTCCGCAGC AAAAAAGTTT GAGTCGCCGC TGCCGGGTTG > > > > > > Chimp ---------- ---------- ---------- ---------- ---------- > > > > > > Cow ---------- ---------- ---------- ---------- ---------- > > > > > > Mouse ---------- ---------- --AAAAGTTG GAGTCTTCGC TTGAGAGTTG > > > > > > Rat ---------- ---------- ---------- ---------- ---------- > > > > > > Dog ---------- ---------- ---------- ---------- ---------- > > > > > > Chicken ---------- ----CGGCTC CGCAGCGCCT CACTCGCGCA GTCCCCGCGC > > > > > > Salmon GGGGGAGACT TCAGAAGTTG TTGTCCTCTC CGCTGATAAC AGTTGAGATG > > > > > > > > > > > > 0000000000 0000000000 0000000000 0001111111 1111111111 > > > > > > CCAGCGGAGT CGCGCGTCGG GAGCTACGTA GGGCAGAGAA GTCA-TGGCT > > > > > > ---------- ---------- ---------- ---------- ---A-TGGCT > > > > > > ---------- ---------- ---------- ---GAGAGAA GTCA-TGGCT > > > > > > CCAGCGGAGT CGCGCGCCGA CAGCTACGCG GCGCAGA-AA GTCA-TGGCT > > > > > > ---------- ---------- ---------- ---------- ---A-TGGCT > > > > > > ---------- ---------- ---------- ---------- ---A-TGGCT > > > > > > AGGGCCGGGC AGAGGCGCAC GCAGCTCCCC GGGCGGCCCC GCTC-CAGCC > > > > > > CGCATATTAT TATTACCTTT AGGACAAGTT GAATGTGTTC GTCAACATCT > > > > > > > > > > > > Thanks! > > > > > > H. > > > > > > > > > > > > > > > > > > > > Thomas > > > > > > > > > > > > > > On Tue, Apr 17, 2012 at 05:59:24PM +0000, Herv? Pag?s wrote: > > > > > > > > Hi Martin, > > > > > > > > > > > > > > > > On 04/16/2012 04:06 AM, Martin Preusse wrote: > > > > > > > > > Hi Charles, > > > > > > > > > > > > > > > > > > thanks! Your solution allows to print the two alignment strings separately. > > > > > > > > > > > > > > > > > > I was thinking of an output as generated by alignment tools: > > > > > > > > > > > > > > > > > > AGT-TCTAT > > > > > > > > > | | | | | | | | | > > > > > > > > > AGTATCTAT > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This looks like BLAST output. Is this what you have in mind? Note that > > > > > > > > there are many alignment tools and many ways to output the result to a > > > > > > > > file. I'm not really familiar with the BLAST output format. Is it > > > > > > > > specified somewhere? Would that make sense to add something like a > > > > > > > > write.PairwiseAlignedXStringSet() function to Biostrings for writing > > > > > > > > the result of pairwiseAlignment() to a file? We could do this and > > > > > > > > support the BLAST format if that's a commonly used format. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > H. > > > > > > > > > > > > > > > > > > > > > > > > > > For this I would have to write a function to output the strings in blocks of e.g. 60 nucleotides, right? > > > > > > > > > > > > > > > > > > Cheers > > > > > > > > > Martin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Am Freitag, 13. April 2012 um 19:21 schrieb Chu, Charles: > > > > > > > > > > > > > > > > > > > write.XStringSet > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > Bioconductor mailing list > > > > > > > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org) > > > > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > > > > > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Herv? Pag?s > > > > > > > > > > > > > > > > Program in Computational Biology > > > > > > > > Division of Public Health Sciences > > > > > > > > Fred Hutchinson Cancer Research Center > > > > > > > > 1100 Fairview Ave. N, M1-B514 > > > > > > > > P.O. Box 19024 > > > > > > > > Seattle, WA 98109-1024 > > > > > > > > > > > > > > > > E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org) > > > > > > > > Phone: (206) 667-5791 > > > > > > > > Fax: (206) 667-1319 > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > Bioconductor mailing list > > > > > > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org) > > > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > > > > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Herv? Pag?s > > > > > > > > > > > > Program in Computational Biology > > > > > > Division of Public Health Sciences > > > > > > Fred Hutchinson Cancer Research Center > > > > > > 1100 Fairview Ave. N, M1-B514 > > > > > > P.O. Box 19024 > > > > > > Seattle, WA 98109-1024 > > > > > > > > > > > > E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org) > > > > > > Phone: (206) 667-5791 > > > > > > Fax: (206) 667-1319 > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Herv? Pag?s > > > > Program in Computational Biology > > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M1-B514 > > P.O. Box 19024 > > Seattle, WA 98109-1024 > > > > E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org) > > Phone: (206) 667-5791 > > Fax: (206) 667-1319 > From martin.preusse at googlemail.com Thu Jun 14 16:01:29 2012 From: martin.preusse at googlemail.com (Martin Preusse) Date: Thu, 14 Jun 2012 16:01:29 +0200 Subject: [BioC] BioPAX parsing Message-ID: <2FE00F99DC9449FC80176CF37E29B5DA@googlemail.com> Many biological pathway resourced provide their data in the BioPAX format (http://www.biopax.org/index.php), a special XML format for biological interaction networks. Examples are pathway commons (http://www.pathwaycommons.org/pc/) and Reactome (http://www.reactome.org (http://www.reactome.org/)). A JAVA library for parsing BioPAX files exists: http://www.biopax.org/paxtools.php Has anybody used BioPAX files with R? Is it possible to read BioPAX files in any R based graph structure? A solution similar to the KEGGgraph package for KEGG pahways would be great, since more and more databases start using BioPAX. Any ideas are appreciated! Cheers Martin From mailinglist.honeypot at gmail.com Thu Jun 14 16:02:03 2012 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Thu, 14 Jun 2012 10:02:03 -0400 Subject: [BioC] AgiMicroRna and Replicates In-Reply-To: References: Message-ID: Hi, On Thu, Jun 14, 2012 at 5:51 AM, Karthik K N wrote: > Dear Members, > > Do we need replicates to carry out analysis with AgiMicroRna package in > bioconductor? I have one control and one treated samples. Can I go ahead > ewith AgiMicroRna with these two datasets? It depends on what you mean by "need". I suspect that it is *technically* possible to carry out an analysis without replicates so that at the end of day you get *some* results. In that sense, you might not "need" replicates. If by "need" you mean that you want results that you can have some faith in, then yes: you need replicates ... always. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From karthikuttan at gmail.com Thu Jun 14 17:00:54 2012 From: karthikuttan at gmail.com (Karthik K N) Date: Thu, 14 Jun 2012 20:30:54 +0530 Subject: [BioC] AgiMicroRna and Replicates In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From karthikuttan at gmail.com Thu Jun 14 17:01:49 2012 From: karthikuttan at gmail.com (Karthik K N) Date: Thu, 14 Jun 2012 20:31:49 +0530 Subject: [BioC] AgiMicroRna and Replicates In-Reply-To: References: <1E4718D8-D6F3-4B82-8775-713CCB6F79E0@cancercenter.columbia.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From davisjwa at health.missouri.edu Thu Jun 14 18:15:28 2012 From: davisjwa at health.missouri.edu (Davis, Wade) Date: Thu, 14 Jun 2012 16:15:28 +0000 Subject: [BioC] glmFit options in edgeR not passed to mglmLS? Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From friedman at cancercenter.columbia.edu Thu Jun 14 18:39:32 2012 From: friedman at cancercenter.columbia.edu (Richard Friedman) Date: Thu, 14 Jun 2012 12:39:32 -0400 Subject: [BioC] How many tests do you need to moderate the t-statistic in Limma Message-ID: Dear List, Is there a lower limit for the number of tests you need to produce a moderated t-statistic with Limma? I am now analyzing a peptide array which after filtering has about 200 present spots. Can Limma be applied to borrow power over only 200 tests? Should it be applied to this situation? Thanks and best wishes, Rich ------------------------------------------------------------ Richard A. Friedman, PhD Associate Research Scientist, Biomedical Informatics Shared Resource Herbert Irving Comprehensive Cancer Center (HICCC) Lecturer, Department of Biomedical Informatics (DBMI) Educational Coordinator, Center for Computational Biology and Bioinformatics (C2B2)/ National Center for Multiscale Analysis of Genomic Networks (MAGNet) Room 824 Irving Cancer Research Center Columbia University 1130 St. Nicholas Ave New York, NY 10032 (212)851-4765 (voice) friedman at cancercenter.columbia.edu http://cancercenter.columbia.edu/~friedman/ "School is an evil plot to suppress my individuality" Rose Friedman, age15 From guest at bioconductor.org Thu Jun 14 19:45:15 2012 From: guest at bioconductor.org (Skanda [guest]) Date: Thu, 14 Jun 2012 10:45:15 -0700 (PDT) Subject: [BioC] [ArrayExpress] Multiple keyword search via queryAE Message-ID: <20120614174515.42CDA133D15@mamba.fhcrc.org> Hi there! I am trying to queryAE the following on ArrayExpress: Crohns_Series <- queryAE("\"Crohn's+Disease\"", "homo+sapiens") but get the following error: Error in download.file(qr, queryfilename, mode = "wb") : cannot open destfile 'query"Crohn's+Disease"homo+sapiens.xml', reason 'Invalid argument' Error: XML content does not seem to be XML, nor to identify a file name 'query"Crohn's+Disease"homo+sapiens.xml' I suspect that, because I am trying a multiple word keyword search on ArrayExpress (all tied together with quotes), it tries to save the XML file with the quotes in the filename and thus is errors out. It works without the internal quotes (i.e. \"), but the results are not the same. Is this actually the problem? Is there a work around this? Thanks! -- output of sessionInfo(): R version 2.15.0 (2012-03-30) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] compiler stats graphics grDevices utils datasets methods base other attached packages: [1] ArrayExpress_1.16.0 Biobase_2.16.0 BiocGenerics_0.2.0 BiocInstaller_1.4.6 loaded via a namespace (and not attached): [1] affy_1.34.0 affyio_1.24.0 limma_3.12.1 preprocessCore_1.18.0 tools_2.15.0 XML_3.9-4.1 [7] zlibbioc_1.2.0 -- Sent via the guest posting facility at bioconductor.org. From wuests at tcd.ie Thu Jun 14 19:55:10 2012 From: wuests at tcd.ie (Samuel Wuest) Date: Thu, 14 Jun 2012 19:55:10 +0200 Subject: [BioC] problems with annotation package (error in AnnotationDbi v1.18.1 ?) Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From hans.thompson1 at gmail.com Thu Jun 14 20:49:48 2012 From: hans.thompson1 at gmail.com (Hans Thompson) Date: Thu, 14 Jun 2012 10:49:48 -0800 Subject: [BioC] Can someone recommend a package for SNP cluster analysis of Fluidigm microarrays? In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From guest at bioconductor.org Thu Jun 14 20:55:30 2012 From: guest at bioconductor.org (Lucia Spangenberg [guest]) Date: Thu, 14 Jun 2012 11:55:30 -0700 (PDT) Subject: [BioC] About Rsubread Message-ID: <20120614185530.517AD134498@mamba.fhcrc.org> Dear list, I have a question about the arguments of the align function in the Rsubread package. I have mapped my RNA-seq SOLiD data (single-end, 16 samples, 50bp long reads, human) with Rsubread using the align() function in 3 versions: -default parameters (1) -unique=TRUE and tieBreakQS=TRUE (2) -unique=TRUE (3) For my surprise, the percentage of mapped reads is ordered like this: (1)>(2)>(3), for all samples. Why is it, that when unique and tieBreakQS=TRUE (2) is used, I get more mapped reads than only with unique (3)? tieBreaksQS argument should only decide, when two reads are equally optimally aligned, which read has to be kept. I expected something like this: (1)>(2)=(3) approximately. Where is my reasoning mistake? On the other hand, after the counting procedure using the featureCounts() function (gtf only with genes), I retrieved in some samples more genes with alignment (2) than (1). I thought that the set of mapped reads of (2) should be contained in (1)? Is this also wrong? It does not happen in many samples, and the difference is not that big, but is unexpected for me. So, if anyone could help and sees where my thinking mistake is, I would be very thankful! Cheers, Luc??a -- output of sessionInfo(): sessionInfo() +R version 2.15.0 (2012-03-30) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8 [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Rsubread_1.6.3 loaded via a namespace (and not attached): [1] tools_2.15.0 -- Sent via the guest posting facility at bioconductor.org. From wjiang2 at fhcrc.org Thu Jun 14 22:32:37 2012 From: wjiang2 at fhcrc.org (Jiang, Mike) Date: Thu, 14 Jun 2012 13:32:37 -0700 Subject: [BioC] [Bioc-devel] flowCore 1.22.0 broken for some FCS files (which it previously read without errors) Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From phipson at wehi.EDU.AU Fri Jun 15 00:58:44 2012 From: phipson at wehi.EDU.AU (Belinda Phipson) Date: Fri, 15 Jun 2012 08:58:44 +1000 Subject: [BioC] How many tests do you need to moderate the t-statistic in Limma In-Reply-To: References: Message-ID: <000901cd4a81$43e992f0$cbbcb8d0$@edu.au> Hi Richard The lower limit for the number of tests you need to produce a moderated t-statistic with limma is 2. I would definitely recommend using limma to borrow power over 200 tests, as you would certainly need to apply a multiple testing correction for that many tests, and a moderated t statistic will have more power and fewer false discoveries than an ordinary t statistic. Cheers, Belinda -----Original Message----- From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Richard Friedman Sent: Friday, 15 June 2012 2:40 AM To: Bioconductor mailing list Subject: [BioC] How many tests do you need to moderate the t-statistic in Limma Dear List, Is there a lower limit for the number of tests you need to produce a moderated t-statistic with Limma? I am now analyzing a peptide array which after filtering has about 200 present spots. Can Limma be applied to borrow power over only 200 tests? Should it be applied to this situation? Thanks and best wishes, Rich ------------------------------------------------------------ Richard A. Friedman, PhD Associate Research Scientist, Biomedical Informatics Shared Resource Herbert Irving Comprehensive Cancer Center (HICCC) Lecturer, Department of Biomedical Informatics (DBMI) Educational Coordinator, Center for Computational Biology and Bioinformatics (C2B2)/ National Center for Multiscale Analysis of Genomic Networks (MAGNet) Room 824 Irving Cancer Research Center Columbia University 1130 St. Nicholas Ave New York, NY 10032 (212)851-4765 (voice) friedman at cancercenter.columbia.edu http://cancercenter.columbia.edu/~friedman/ "School is an evil plot to suppress my individuality" Rose Friedman, age15 _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From phipson at wehi.EDU.AU Fri Jun 15 01:16:48 2012 From: phipson at wehi.EDU.AU (Belinda Phipson) Date: Fri, 15 Jun 2012 09:16:48 +1000 Subject: [BioC] using Limma to read 2-channel dye-swap in Agilent scanner In-Reply-To: References: Message-ID: <000a01cd4a83$c986fa90$5c94efb0$@edu.au> Hi Neta On page 16 and 17 of the limma vignette (25 March 2012) there is an explanation about column names in the data files. You should check what the header names of your Agilent files are and then change the read.maimages() command accordingly. For example: > RG <- read.maimages(files, + columns=list(R="F635 Mean",G="F532 Mean",Rb="B635 Median",Gb="B532 Median")) The default in limma for Imagine is to extract the signal mean, but for Agilent it is the signal median. You probably need to specify that you want to extract the medians. Hope this helps. Cheers, Belinda -----Original Message----- From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Neta Sent: Thursday, 14 June 2012 9:48 AM To: bioconductor at stat.math.ethz.ch Subject: [BioC] using Limma to read 2-channel dye-swap in Agilent scanner Hello, I am using Limma (version 3.12.0) to read 2-channel dye-swap files from an Agilent image analysis scanner, where each sample has 2 files - one for cy3 and one for cy5. However, the only reference to such a 2-file input in the limma user guide is to "ImaGene". When I try to read the files and the target file using the following commands: RG = read.maimages(file_names, ...) targets = readTargets("targets.txt") files = targets[,c("FileNameCy3","FileNameCy5")]; RG = read.maimages(files,source="imagene"); I get the following error message: Error in read.imagene(files = files, path = path, ext = ext, names = names, : Can't find Field Dimensions in ImaGene header In addition: Warning message: In readImaGeneHeader(fullname) : End of file encountered before End Header When I try to use source="Agilent" instead of "imagene", the command doesn't seem to understand that I have 2 file names, and gives the following error message: Error in read.maimages(files, source = "agilent") : targets frame doesn't contain FileName column I tried every single option for "source" that was listed in the help of "read.maimages" but it seems like "imagene" is the only one that is able to digest the two headers for the file names. I am stuck and would appreciate any help. Thank you, Neta. _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From shi at wehi.EDU.AU Fri Jun 15 02:09:48 2012 From: shi at wehi.EDU.AU (Wei Shi) Date: Fri, 15 Jun 2012 10:09:48 +1000 Subject: [BioC] About Rsubread In-Reply-To: <20120614185530.517AD134498@mamba.fhcrc.org> References: <20120614185530.517AD134498@mamba.fhcrc.org> Message-ID: <3D0EF8AE-3BD8-4812-9CE9-54EF94AA470A@wehi.edu.au> Dear Lucia, For your version 2 of alignment, align() function firstly used mapping quality scores to break ties if more than 1 best locations were found for a read. If one of the mapping locations was found to have a higher mapping score than others, that location will be selected and reported for that read. However, this read will be reported as a unmapped read in your version 3 of alignment, because it was mapped to multiple locations and the '-unique=TRUE' option instructed align() to not report it. So 'tieBreakQS=TRUE' changed a multi-mapping read to a unique-mapping read. This is why you got more mapped reads in (2) than (3). I would recommend setting 'tieBreakQS=TRUE" if you are worried about the multi-mapping reads. For you second question, you are right in that the set of mapped reads reported in version 2 should be subset of mapped reads reported in version 1. However, the mapping locations could be different for the same read if it was mapped to more than 1 locations. For version 1, the mapping location for such a read was randomly chosen. However, the location with best mapping quality score was chosen in version 2. If the mapping locations of such reads reported by version 2 fell within the gene region in your annotation but their locations reported by version 1 did not, these reads will counted in read summarization 2 but not counted in summarization 1. So the align() and featureCounts() function did what they are supposed to do for your different versions of running. Let me know if this is unclear to you. Cheers, Wei On Jun 15, 2012, at 4:55 AM, Lucia Spangenberg [guest] wrote: > > Dear list, > I have a question about the arguments of the align function in the Rsubread package. I have mapped my RNA-seq SOLiD data (single-end, 16 samples, 50bp long reads, human) with Rsubread using the align() function in 3 versions: > > -default parameters (1) > -unique=TRUE and tieBreakQS=TRUE (2) > -unique=TRUE (3) > > For my surprise, the percentage of mapped reads is ordered like this: (1)>(2)>(3), for all samples. > Why is it, that when unique and tieBreakQS=TRUE (2) is used, I get more mapped reads than only with unique (3)? tieBreaksQS argument should only decide, when two reads are equally optimally aligned, which read has to be kept. I expected something like this: > (1)>(2)=(3) approximately. > Where is my reasoning mistake? > > On the other hand, after the counting procedure using the featureCounts() function (gtf only with genes), I retrieved in some samples more genes with alignment (2) than (1). I thought that the set of mapped reads of (2) should be contained in (1)? Is this also wrong? It does not happen in many samples, and the difference is not that big, but is unexpected for me. > > So, if anyone could help and sees where my thinking mistake is, I would be very thankful! > > Cheers, > Luc?a > > > > > -- output of sessionInfo(): > > sessionInfo() > +R version 2.15.0 (2012-03-30) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C > [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8 > [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Rsubread_1.6.3 > > loaded via a namespace (and not attached): > [1] tools_2.15.0 > > > -- > Sent via the guest posting facility at bioconductor.org. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}} From normanpavelka at gmail.com Fri Jun 15 05:04:42 2012 From: normanpavelka at gmail.com (Norman Pavelka) Date: Fri, 15 Jun 2012 11:04:42 +0800 Subject: [BioC] FW: plgem In-Reply-To: <5E1269353FE3EB4A92F4A98B155985F80A737D2D@exch-mbx-02.utu.fi> References: <5E1269353FE3EB4A92F4A98B155985F80A737D2D@exch-mbx-02.utu.fi> Message-ID: Hi Olli, As I suspected, you are using a very old version of R and Bioconductor (2008). Try downloading the latest version of R and re-install the latest plgem package by typing: source("http://bioconductor.org/biocLite.R") biocLite("plgem") Let me know if this solves the problem. Cheers, Norman On Thu, Jun 14, 2012 at 4:52 PM, Olli Kannaste wrote: > Hi Norman, > > Sure thing. Here is the output: > > R version 2.7.2 (2008-08-25) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] tools ? ? stats ? ? graphics ?grDevices utils ? ? datasets ?methods > [8] base > > other attached packages: > [1] plgem_1.12.0 ?MASS_7.2-44 ? Biobase_2.0.1 > > -- > > Thanks, > Olli > ________________________________________ > L?hett?j?: Norman Pavelka [normanpavelka at gmail.com] > L?hetetty: 14. kes?kuuta 2012 4:35 > Vastaanottaja: Olli Kannaste > Cc: bioconductor at r-project.org > Aihe: Fwd: FW: plgem > > Dear Olli, > > Thanks for your email and your continued interest in plgem. From your > error message it looks like there might be a version problem. Could > you please send me the output of sessionInfo() ? > I am also copying the Bioconductor mailing list, so the thread gets archived. > > Cheers, > Norman > > -----Original Message----- > From: Olli Kannaste [mailto:ojkann at utu.fi] > Sent: Wednesday, 13 June, 2012 9:07 PM > To: Norman Pavelka (SIgN) > Subject: plgem > > Hi Norman, > > I approached you about 2.5 years ago regarding problems i was having > with the plgem analysis. You were kind enough to provide me an R > script, which automated the analysis. That worked fine and helped me a > great deal. I am now trying plgem again using the script with some > other data, and having some difficulties... I'm working on a different > computer now and have installed the latest version of plgem. My guess > is that for some reason the script is not working properly with the > new plgem version. It lets me input my parameters and specify data > files but fails to proceed right after that, generating the following > error message: > > Error in run.plgem(get(expressionSetName), signLev = pVal, rank = 100, ?: > ?unused argument(s) (trimAllZeroRows = TRUE, zeroMeanOrSD = "replace") > > Could you perhaps help me out with this one? I'm attaching the script > and my input files in the message. > > Best regards, > Olli From guest at bioconductor.org Fri Jun 15 08:30:16 2012 From: guest at bioconductor.org (Sonal Bakiwala [guest]) Date: Thu, 14 Jun 2012 23:30:16 -0700 (PDT) Subject: [BioC] RE : Error in intgroup of arrayQualityMetrics package Message-ID: <20120615063016.7AB89134FEF@mamba.fhcrc.org> I am using arraQualityMetrics package installed from Bioconductor site and R version that I am using is 2.15.0 The input for the function was eset and for the intgroup argument character vector "Tissue". There is a column named Tissue in my phenoData of the eset. But it still gives me an error saying the elements of intgroup do not match the column names of the pData(eset). I don't know what wrong I am doing. The error look like this : Error in prepData(expressionset,intgroup=intgroup): all elements of 'intgroup' should match column names of pData(expressionset) -- output of sessionInfo(): > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.4.6 arrayQualityMetrics_3.12.0 [3] affy_1.34.0 limma_3.12.1 [5] Biobase_2.16.0 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] affyio_1.24.0 affyPLM_1.32.0 annotate_1.34.0 [4] AnnotationDbi_1.18.1 beadarray_2.6.0 BeadDataPackR_1.8.0 [7] Biostrings_2.24.1 Cairo_1.5-1 cluster_1.14.2 [10] colorspace_1.1-1 DBI_0.2-5 genefilter_1.38.0 [13] grid_2.15.0 Hmisc_3.9-3 hwriter_1.3 [16] IRanges_1.14.3 lattice_0.20-6 latticeExtra_0.6-19 [19] plyr_1.7.1 preprocessCore_1.18.0 RColorBrewer_1.0-5 [22] reshape2_1.2.1 RSQLite_0.11.1 setRNG_2011.11-2 [25] splines_2.15.0 stats4_2.15.0 stringr_0.6 [28] survival_2.36-12 SVGAnnotation_0.9-0 tools_2.15.0 [31] vsn_3.24.0 XML_3.9-4 xtable_1.7-0 [34] zlibbioc_1.2.0 > intgroup [1] "Tissue" > str(intgroup) chr "Tissue" Sorry I wont be able to provide you with the detailed information of the pData. But the colnames(pData(eset)) has one of columns named as "Tissue" and the class of the this column is factor. Thank you. -- Sent via the guest posting facility at bioconductor.org. From mark.robinson at imls.uzh.ch Fri Jun 15 10:36:10 2012 From: mark.robinson at imls.uzh.ch (Mark Robinson) Date: Fri, 15 Jun 2012 10:36:10 +0200 Subject: [BioC] 2012 Bioconductor European Developers' Workshop - annoucement Message-ID: <4BD32519-3F10-4721-87E5-B29252690DFA@imls.uzh.ch> Dear Bioconductors, We are very pleased to announce that the 2012 Bioconductor European Developers' Workshop will take place in Z?rich, Switzerland on December 13th-14th. This workshop is aimed at Bioconductor contributors and bioinformaticians who wish to contribute packages to the Bioconductor project. The aim of the meeting is to foster the exchange of technical expertise, to keep contributors up to speed with the latest developments and to coordinate related efforts. Topics this year will include: * New developments in Bioconductor core packages and infrastructure * Infrastructure and methods for understanding gene regulation * Next Generation Data handling and analysis * Reproducible research and software engineering * Tools for the management and analysis of proteomics datasets Further details can be found here: http://www.fgcz.ch/Bioconductor2012 We hope to see you in Z?rich in December! Best wishes, Mark Robinson, University of Z?rich Michal Okoniewski, Functional Genomics Centre Z?rich From alix.tofigh at gmail.com Fri Jun 15 11:35:47 2012 From: alix.tofigh at gmail.com (Ali Tofigh) Date: Fri, 15 Jun 2012 05:35:47 -0400 Subject: [BioC] Microarray experiment design issues Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From ojkann at utu.fi Fri Jun 15 11:54:47 2012 From: ojkann at utu.fi (Olli Kannaste) Date: Fri, 15 Jun 2012 09:54:47 +0000 Subject: [BioC] FW: plgem In-Reply-To: References: <5E1269353FE3EB4A92F4A98B155985F80A737D2D@exch-mbx-02.utu.fi> Message-ID: <5E1269353FE3EB4A92F4A98B155985F80A737D70@exch-mbx-02.utu.fi> Yes, that sorted it out! Thanks for the tip. Cheers, Olli ________________________________________ L?hett?j?: Norman Pavelka [normanpavelka at gmail.com] L?hetetty: 15. kes?kuuta 2012 6:04 Vastaanottaja: Olli Kannaste Cc: bioconductor at r-project.org Aihe: Re: FW: plgem Hi Olli, As I suspected, you are using a very old version of R and Bioconductor (2008). Try downloading the latest version of R and re-install the latest plgem package by typing: source("http://bioconductor.org/biocLite.R") biocLite("plgem") Let me know if this solves the problem. Cheers, Norman On Thu, Jun 14, 2012 at 4:52 PM, Olli Kannaste wrote: > Hi Norman, > > Sure thing. Here is the output: > > R version 2.7.2 (2008-08-25) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] tools stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] plgem_1.12.0 MASS_7.2-44 Biobase_2.0.1 > > -- > > Thanks, > Olli > ________________________________________ > L?hett?j?: Norman Pavelka [normanpavelka at gmail.com] > L?hetetty: 14. kes?kuuta 2012 4:35 > Vastaanottaja: Olli Kannaste > Cc: bioconductor at r-project.org > Aihe: Fwd: FW: plgem > > Dear Olli, > > Thanks for your email and your continued interest in plgem. From your > error message it looks like there might be a version problem. Could > you please send me the output of sessionInfo() ? > I am also copying the Bioconductor mailing list, so the thread gets archived. > > Cheers, > Norman > > -----Original Message----- > From: Olli Kannaste [mailto:ojkann at utu.fi] > Sent: Wednesday, 13 June, 2012 9:07 PM > To: Norman Pavelka (SIgN) > Subject: plgem > > Hi Norman, > > I approached you about 2.5 years ago regarding problems i was having > with the plgem analysis. You were kind enough to provide me an R > script, which automated the analysis. That worked fine and helped me a > great deal. I am now trying plgem again using the script with some > other data, and having some difficulties... I'm working on a different > computer now and have installed the latest version of plgem. My guess > is that for some reason the script is not working properly with the > new plgem version. It lets me input my parameters and specify data > files but fails to proceed right after that, generating the following > error message: > > Error in run.plgem(get(expressionSetName), signLev = pVal, rank = 100, : > unused argument(s) (trimAllZeroRows = TRUE, zeroMeanOrSD = "replace") > > Could you perhaps help me out with this one? I'm attaching the script > and my input files in the message. > > Best regards, > Olli From normanpavelka at gmail.com Fri Jun 15 11:56:10 2012 From: normanpavelka at gmail.com (Norman Pavelka) Date: Fri, 15 Jun 2012 17:56:10 +0800 Subject: [BioC] FW: plgem In-Reply-To: <5E1269353FE3EB4A92F4A98B155985F80A737D70@exch-mbx-02.utu.fi> References: <5E1269353FE3EB4A92F4A98B155985F80A737D2D@exch-mbx-02.utu.fi> <5E1269353FE3EB4A92F4A98B155985F80A737D70@exch-mbx-02.utu.fi> Message-ID: I'm glad it worked. Hope the results are interesting, too! :-) Cheers, Norman On Fri, Jun 15, 2012 at 5:54 PM, Olli Kannaste wrote: > Yes, that sorted it out! Thanks for the tip. > > Cheers, > Olli > ________________________________________ > L?hett?j?: Norman Pavelka [normanpavelka at gmail.com] > L?hetetty: 15. kes?kuuta 2012 6:04 > Vastaanottaja: Olli Kannaste > Cc: bioconductor at r-project.org > Aihe: Re: FW: plgem > > Hi Olli, > > As I suspected, you are using a very old version of R and Bioconductor > (2008). Try downloading the latest version of R and re-install the > latest plgem package by typing: > > ?source("http://bioconductor.org/biocLite.R") > ?biocLite("plgem") > > Let me know if this solves the problem. > > Cheers, > Norman > > On Thu, Jun 14, 2012 at 4:52 PM, Olli Kannaste wrote: >> Hi Norman, >> >> Sure thing. Here is the output: >> >> R version 2.7.2 (2008-08-25) >> i386-pc-mingw32 >> >> locale: >> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 >> >> attached base packages: >> [1] tools ? ? stats ? ? graphics ?grDevices utils ? ? datasets ?methods >> [8] base >> >> other attached packages: >> [1] plgem_1.12.0 ?MASS_7.2-44 ? Biobase_2.0.1 >> >> -- >> >> Thanks, >> Olli >> ________________________________________ >> L?hett?j?: Norman Pavelka [normanpavelka at gmail.com] >> L?hetetty: 14. kes?kuuta 2012 4:35 >> Vastaanottaja: Olli Kannaste >> Cc: bioconductor at r-project.org >> Aihe: Fwd: FW: plgem >> >> Dear Olli, >> >> Thanks for your email and your continued interest in plgem. From your >> error message it looks like there might be a version problem. Could >> you please send me the output of sessionInfo() ? >> I am also copying the Bioconductor mailing list, so the thread gets archived. >> >> Cheers, >> Norman >> >> -----Original Message----- >> From: Olli Kannaste [mailto:ojkann at utu.fi] >> Sent: Wednesday, 13 June, 2012 9:07 PM >> To: Norman Pavelka (SIgN) >> Subject: plgem >> >> Hi Norman, >> >> I approached you about 2.5 years ago regarding problems i was having >> with the plgem analysis. You were kind enough to provide me an R >> script, which automated the analysis. That worked fine and helped me a >> great deal. I am now trying plgem again using the script with some >> other data, and having some difficulties... I'm working on a different >> computer now and have installed the latest version of plgem. My guess >> is that for some reason the script is not working properly with the >> new plgem version. It lets me input my parameters and specify data >> files but fails to proceed right after that, generating the following >> error message: >> >> Error in run.plgem(get(expressionSetName), signLev = pVal, rank = 100, ?: >> ?unused argument(s) (trimAllZeroRows = TRUE, zeroMeanOrSD = "replace") >> >> Could you perhaps help me out with this one? I'm attaching the script >> and my input files in the message. >> >> Best regards, >> Olli From david at harsk.dk Fri Jun 15 12:27:44 2012 From: david at harsk.dk (David Westergaard) Date: Fri, 15 Jun 2012 12:27:44 +0200 Subject: [BioC] Microarray experiment design issues In-Reply-To: References: Message-ID: Hi Ali, I don't think this list is appropriate to answer these questions, since it doesn't generally involve any bioconductor packages. However, I don't really see why you would have a problem in A. Are both cell lines not exposed to the same variation in humidty levels, temperature, oxygen levels, etc, so that you would expect the same variation due to these factors in both treated and untreated, and thus the total variation due to these factors would be approximately zero? Also, could the technically noise not potentially cause too great a distance between arrays to pick up any variation, in setup B? I don't really have any experience with experimental setup, so the above comments are just my logical conclusions from working with microarray data. Best, David 2012/6/15 Ali Tofigh : > Our goal is to measure the effects of a treatment on a specific cell line > using gene expression microarrays (agilent 2-color). There are two possible > experimental designs: > > A) perform the entire experiment in one day: split cells into 6 groups, > treat 3 with compound and leave 3 untreated. This setup minimizes technical > variation, but the list of differentially expressed genes will include some > that are differentially expressed mainly due to the specific conditions on > the day of the experiment (humidty levels, temperature, oxygen levels, > etc). > > B) perform the experiment on three separate occasions: each day, split > cells into two groups, treat only one with compound. An paired analyis > would be appropriate here. This setup introduces noise (technical noise > because of separate handling of the three pairs and noise from daily > variation of the environmental conditions) and so we lose some statistical > power. However, since the experiment is performed under slightly different > environmental conditions, some of the condition-specific genes will no > longer show up as differentially expressed and the list of genes would in > this sense be more robust/reproducible. > > Does anyone have experience with both setups? I would like to know if the > amount of variance that is introduced in setup B can be expected to be low > enough to not lose too much power while producing a more robust set of > differentially expressed genes. > > Cheers > /Ali > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From whuber at embl.de Fri Jun 15 14:48:44 2012 From: whuber at embl.de (Wolfgang Huber) Date: Fri, 15 Jun 2012 14:48:44 +0200 Subject: [BioC] Microarray experiment design issues In-Reply-To: References: Message-ID: <4FDB2F2C.5050104@embl.de> Hi Ali I think it is fine to discuss this type of question on this list (and it seems to be consistent with the statement on http://www.bioconductor.org/help/mailing-list ). If your aim is to make a general statement about the compound effect, rather than on what happened on a particular day, and all other costs being equal, then B is preferable. I would not term this "loss of power": effects that you see in A but not in B are not easily reproducible, and thus plausibly of lower interest. However, I am not sure how important this issue is compared to many other of your choices, and the biases and errors that they might introduce, such as: choice of cell line, choice of compound dose and incubation time, the sensitivity and specificity of the particular array platform. If you are worried about robust inference, then perhaps you should also consider which of these factors need to be scanned. Most importantly, what will the resulting gene list be used for next? Nobody expects these lists to end up as "standalone truths", they usually have a purpose (e.g. hit picking for subsequent single gene research; search for biological themes and stories that give you warm fuzzy feeling; elucidation of the molecular target(s) of the drug, and perhaps again their downstream targets; clustering of drugs by similarity of the response; etc.) I often find that once you have sorted out these question, the data analytic strategy also becomes more apparent. Best wishes Wolfgang Ali Tofigh scripsit 06/15/2012 11:35 AM: > Our goal is to measure the effects of a treatment on a specific cell line > using gene expression microarrays (agilent 2-color). There are two possible > experimental designs: > > A) perform the entire experiment in one day: split cells into 6 groups, > treat 3 with compound and leave 3 untreated. This setup minimizes technical > variation, but the list of differentially expressed genes will include some > that are differentially expressed mainly due to the specific conditions on > the day of the experiment (humidty levels, temperature, oxygen levels, > etc). > > B) perform the experiment on three separate occasions: each day, split > cells into two groups, treat only one with compound. An paired analyis > would be appropriate here. This setup introduces noise (technical noise > because of separate handling of the three pairs and noise from daily > variation of the environmental conditions) and so we lose some statistical > power. However, since the experiment is performed under slightly different > environmental conditions, some of the condition-specific genes will no > longer show up as differentially expressed and the list of genes would in > this sense be more robust/reproducible. > > Does anyone have experience with both setups? I would like to know if the > amount of variance that is introduced in setup B can be expected to be low > enough to not lose too much power while producing a more robust set of > differentially expressed genes. > > Cheers > /Ali > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber From whuber at embl.de Fri Jun 15 15:00:50 2012 From: whuber at embl.de (Wolfgang Huber) Date: Fri, 15 Jun 2012 15:00:50 +0200 Subject: [BioC] RE : Error in intgroup of arrayQualityMetrics package In-Reply-To: <20120615063016.7AB89134FEF@mamba.fhcrc.org> References: <20120615063016.7AB89134FEF@mamba.fhcrc.org> Message-ID: <4FDB3202.2060107@embl.de> Dear Sonal, you asked the same question before: https://stat.ethz.ch/pipermail/bioconductor/2012-June/045922.html and I replied, asking you for more information that is needed to diagnose your problem. Now, you post this again, adding that you won't be able to provide the needed information. Sorry, but this means I am not able to provide help. The 'intgroup'-related functionality in the package works, as is shown for several examples in the package vignette. best wishes Wolfgang Sonal Bakiwala [guest] scripsit 06/15/2012 08:30 AM: > > I am using arraQualityMetrics package installed from Bioconductor site and R version that I am using is 2.15.0 > > The input for the function was eset and for the intgroup argument character vector "Tissue". There is a > column named Tissue in my phenoData of the eset. > > But it still gives me an error saying the elements of intgroup do not match the column names of the pData(eset). > I don't know what wrong I am doing. > > The error look like this : > > Error in prepData(expressionset,intgroup=intgroup): > all elements of 'intgroup' should match column names of pData(expressionset) > > > > -- output of sessionInfo(): > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BiocInstaller_1.4.6 arrayQualityMetrics_3.12.0 > [3] affy_1.34.0 limma_3.12.1 > [5] Biobase_2.16.0 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] affyio_1.24.0 affyPLM_1.32.0 annotate_1.34.0 > [4] AnnotationDbi_1.18.1 beadarray_2.6.0 BeadDataPackR_1.8.0 > [7] Biostrings_2.24.1 Cairo_1.5-1 cluster_1.14.2 > [10] colorspace_1.1-1 DBI_0.2-5 genefilter_1.38.0 > [13] grid_2.15.0 Hmisc_3.9-3 hwriter_1.3 > [16] IRanges_1.14.3 lattice_0.20-6 latticeExtra_0.6-19 > [19] plyr_1.7.1 preprocessCore_1.18.0 RColorBrewer_1.0-5 > [22] reshape2_1.2.1 RSQLite_0.11.1 setRNG_2011.11-2 > [25] splines_2.15.0 stats4_2.15.0 stringr_0.6 > [28] survival_2.36-12 SVGAnnotation_0.9-0 tools_2.15.0 > [31] vsn_3.24.0 XML_3.9-4 xtable_1.7-0 > [34] zlibbioc_1.2.0 >> intgroup > [1] "Tissue" >> str(intgroup) > chr "Tissue" > > Sorry I wont be able to provide you with the detailed information of the pData. > But the colnames(pData(eset)) has one of columns named as "Tissue" and the class of the this column is factor. > > Thank you. > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber From atarca at med.wayne.edu Fri Jun 15 15:47:40 2012 From: atarca at med.wayne.edu (Tarca, Adi) Date: Fri, 15 Jun 2012 13:47:40 +0000 Subject: [BioC] charm package using MEDIP protocol Message-ID: <6DE578F501A8B2489DBD4893CEC996BA0315A86F@MED-CORE07B.med.wayne.edu> Dear all, I am trying to use the charm package to analyze a Human 2.1M Deluxe Promoter Array dataset produced using the MeDIP protocol (i.e. untreated sample on one channel and enriched for methylated DNA from same sample on the other channel, [unlike the McrBC protocol used to illustrate the use of charm where the treated sample is enriched for unmethylated DNA]). Although the "Human 2.1M Deluxe Promoter Array (pd.081229.hg18.promoter.medip.hx1)" is being supported by the charm packge, I did not see anywhere how one specifies the type of protocol used (MEDIP vs McrBC). It appears to me that the methylation estimate would just change meaning from (0=unmethylated; 1=methylated) to (1=unmethylated; 0=methylated) if MeDIP arrays are used instead of McrBC arrays in chram but any suggestions would be appreciated. Thanks, Adi Tarca From curoli at gmail.com Fri Jun 15 18:32:50 2012 From: curoli at gmail.com (Oliver Ruebenacker) Date: Fri, 15 Jun 2012 12:32:50 -0400 Subject: [BioC] BioPAX parsing In-Reply-To: <2FE00F99DC9449FC80176CF37E29B5DA@googlemail.com> References: <2FE00F99DC9449FC80176CF37E29B5DA@googlemail.com> Message-ID: Hello Martin, I'm currently looking into reading BioPAX into R using RJava and OpenRDF Sesame. If there is interest, I may be looking into submitting a package to BioConductor. It would be very helpful if you could tell me what you need the BioPAX data for, and in what form it would be best for you. Possible options are: - A data frame of the RDF/OWL triples - A graph of the RDF/OWL triples - A data frame with one row for each reaction-participant - A bi-partite graph with nodes for reactions and nodes for substances - A with nodes for substances only, with edges for interactions - A genetic interaction graph This list is roughly sorted form the one most easy to the most difficult to provide. Take care Oliver On Thu, Jun 14, 2012 at 10:01 AM, Martin Preusse wrote: > Many biological pathway resourced provide their data in the BioPAX format (http://www.biopax.org/index.php), a special XML format for biological interaction networks. Examples are pathway commons (http://www.pathwaycommons.org/pc/) and Reactome (http://www.reactome.org (http://www.reactome.org/)). > > A JAVA library for parsing BioPAX files exists: http://www.biopax.org/paxtools.php > > Has anybody used BioPAX files with R? Is it possible to read BioPAX files in any R based graph structure? A solution similar to the KEGGgraph package for KEGG pahways would be great, since more and more databases start using BioPAX. > > > Any ideas are appreciated! > > Cheers > Martin > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Oliver Ruebenacker Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker) Knowomics, The Bioinformatics Network (http://www.knowomics.com) SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) From lgoff at csail.mit.edu Fri Jun 15 19:22:39 2012 From: lgoff at csail.mit.edu (Loyal Goff) Date: Fri, 15 Jun 2012 13:22:39 -0400 Subject: [BioC] cummeRbund errors In-Reply-To: References: , <129A8C7E-F590-4E9F-9194-9DE1E16D6090@fhcrc.org> Message-ID: Hi Li, Sorry for the delayed response. I have not been able to reproduce the first error, can you email me directly your cuffData.db file and I will see if I can figure out what's going on.. As for the second error. This arises from the fact that the csDensity function is dying in the middle of a transaction. Another call to readCufflinks() should re-establish a connection to the database and resolve this issue. Cheers, Loyal On Jun 5, 2012, at 7:40 PM, Wang, Li wrote: > Dear list members > > I am struggling with cummeRbund. I tried some codes listed here, and am confronted with some errors. > >> cuff_data <- readCufflinks('diff_out') >> csDensity(genes(cuff_data)) > > Error in dat$fpkm + pseudocount : non-numeric argument to binary operator > >> diffGeneIDs <- getSig(cuff_data, level="genes", alpha=0.05) >> diffGenes <- getGenes(cuff_data, diffGeneIDs) > > Error in sqliteExecStatement(conn, statement, ...) : > RS-DBI driver: (RS_SQLite_exec: could not execute1: cannot start a transaction within a transaction) > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] cummeRbund_1.2.0 reshape2_1.2.1 ggplot2_0.9.1 RSQLite_0.11.1 DBI_0.2-5 > > loaded via a namespace (and not attached): > [1] colorspace_1.1-1 dichromat_1.2-4 digest_0.5.2 grid_2.15.0 > [5] labeling_0.1 MASS_7.3-18 memoise_0.1 munsell_0.3 > [9] plyr_1.7.1 proto_0.3-9.2 RColorBrewer_1.0-5 scales_0.2.1 > [13] stringr_0.6 > > I cannot figure out the reason, could anyone give me some hints? > > Thanks in advance! > > Best wishes > Li > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From alix.tofigh at gmail.com Fri Jun 15 19:50:00 2012 From: alix.tofigh at gmail.com (Ali Tofigh) Date: Fri, 15 Jun 2012 13:50:00 -0400 Subject: [BioC] Microarray experiment design issues In-Reply-To: <4FDB2F2C.5050104@embl.de> References: <4FDB2F2C.5050104@embl.de> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From ying_chen at live.com Fri Jun 15 20:08:26 2012 From: ying_chen at live.com (ying chen) Date: Fri, 15 Jun 2012 14:08:26 -0400 Subject: [BioC] Is there a package that maps short sequence to exon? Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From martin.preusse at googlemail.com Fri Jun 15 21:03:45 2012 From: martin.preusse at googlemail.com (Martin Preusse) Date: Fri, 15 Jun 2012 21:03:45 +0200 Subject: [BioC] BioPAX parsing In-Reply-To: References: <2FE00F99DC9449FC80176CF37E29B5DA@googlemail.com> Message-ID: <2E774CD91DCD4D57A17218DB93AB053A@googlemail.com> Hi Oliver, I think there is a lot interest in a bioconductor package! Personally, I would like to read pathways stored in the BioPAX format into any kind of graph. It's a philosophical question if reactions should have nodes or should sit on the edges :) So far I have not used any R graph package. But I assume there are some very generic packages which are flexible enough to support both direct and bi-partite pathway structure. I used e.g. the JUNG graph API for JAVA extensively. I'm not sure what you mean with RDF/OWL triples. For me BioPAX is only a format to store a pathway. And I would like to bring it back into its natural form: a network! Do you have any code to test? I have used RJava before. All this RDF and XML file format stuff kind of puzzles me though ? :) Cheers Martin Am Freitag, 15. Juni 2012 um 18:32 schrieb Oliver Ruebenacker: > Hello Martin, > > I'm currently looking into reading BioPAX into R using RJava and > OpenRDF Sesame. If there is interest, I may be looking into submitting > a package to BioConductor. > > It would be very helpful if you could tell me what you need the > BioPAX data for, and in what form it would be best for you. Possible > options are: > > - A data frame of the RDF/OWL triples > - A graph of the RDF/OWL triples > - A data frame with one row for each reaction-participant > - A bi-partite graph with nodes for reactions and nodes for substances > - A with nodes for substances only, with edges for interactions > - A genetic interaction graph > > This list is roughly sorted form the one most easy to the most > difficult to provide. > > Take care > Oliver > > On Thu, Jun 14, 2012 at 10:01 AM, Martin Preusse > wrote: > > Many biological pathway resourced provide their data in the BioPAX format (http://www.biopax.org/index.php), a special XML format for biological interaction networks. Examples are pathway commons (http://www.pathwaycommons.org/pc/) and Reactome (http://www.reactome.org (http://www.reactome.org/)). > > > > A JAVA library for parsing BioPAX files exists: http://www.biopax.org/paxtools.php > > > > Has anybody used BioPAX files with R? Is it possible to read BioPAX files in any R based graph structure? A solution similar to the KEGGgraph package for KEGG pahways would be great, since more and more databases start using BioPAX. > > > > > > Any ideas are appreciated! > > > > Cheers > > Martin > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org) > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > -- > Oliver Ruebenacker > Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker) > Knowomics, The Bioinformatics Network (http://www.knowomics.com) > SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) > From gingerplum at hotmail.com Fri Jun 15 21:04:45 2012 From: gingerplum at hotmail.com (JiangMei) Date: Sat, 16 Jun 2012 03:04:45 +0800 Subject: [BioC] How to print out normalized Cy5 and Cy3 signals Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From curoli at gmail.com Fri Jun 15 21:23:26 2012 From: curoli at gmail.com (Oliver Ruebenacker) Date: Fri, 15 Jun 2012 15:23:26 -0400 Subject: [BioC] BioPAX parsing In-Reply-To: <2E774CD91DCD4D57A17218DB93AB053A@googlemail.com> References: <2FE00F99DC9449FC80176CF37E29B5DA@googlemail.com> <2E774CD91DCD4D57A17218DB93AB053A@googlemail.com> Message-ID: Hello Martin, I don't have code in R to test yet, but I do have extensive experience handling BioPAX in Java, so I'm assuming reading BioPAX using RJava should not be too difficult. The best target format depends on what people would like to do with the data. For visualization, a bi-partite graph in a popular graph-layout package should be best. Is there any particular graph package in BioConductor or R in general you would recommend? For actual analysis, people probably have more specific requirements. BioPAX is a format based on RDF/OWL, which in turn is based on organizing data in triples, which could be stored in a three-column data frame (or perhaps a fourth column for data type). For example (incomplete, for illustration only): ex:mapPhosphorylization rdf:type bp:BiochemicalReaction. ex:atp rdf:type bp:SmallMolecule. ex:adp rdf:type bp:SmallMolecule. ex:map rdf:type bp:Protein. ex:mapPhosphorylized rdf:type bp:Protein. ex:mapPhosphorylization bp:left ex:atp. ex:mapPhosphorylization bp:left ex:map. ex:mapPhosphorylization bp:right ex:adp. ex:mapPhosphorylization bp:right ex:mapPhosphorylized. Take care Oliver On Fri, Jun 15, 2012 at 3:03 PM, Martin Preusse wrote: > Hi Oliver, > > I think there is a lot interest in a bioconductor package! > > Personally, I would like to read pathways stored in the BioPAX format into any kind of graph. It's a philosophical question if reactions should have nodes or should sit on the edges :) So far I have not used any R graph package. But I assume there are some very generic packages which are flexible enough to support both direct and bi-partite pathway structure. I used e.g. the JUNG graph API for JAVA extensively. > > I'm not sure what you mean with RDF/OWL triples. For me BioPAX is only a format to store a pathway. And I would like to bring it back into its natural form: a network! > > Do you have any code to test? I have used RJava before. All this RDF and XML file format stuff kind of puzzles me though ? :) > > Cheers > Martin > > > > Am Freitag, 15. Juni 2012 um 18:32 schrieb Oliver Ruebenacker: > >> Hello Martin, >> >> I'm currently looking into reading BioPAX into R using RJava and >> OpenRDF Sesame. If there is interest, I may be looking into submitting >> a package to BioConductor. >> >> It would be very helpful if you could tell me what you need the >> BioPAX data for, and in what form it would be best for you. Possible >> options are: >> >> - A data frame of the RDF/OWL triples >> - A graph of the RDF/OWL triples >> - A data frame with one row for each reaction-participant >> - A bi-partite graph with nodes for reactions and nodes for substances >> - A with nodes for substances only, with edges for interactions >> - A genetic interaction graph >> >> This list is roughly sorted form the one most easy to the most >> difficult to provide. >> >> Take care >> Oliver >> >> On Thu, Jun 14, 2012 at 10:01 AM, Martin Preusse >> wrote: >> > Many biological pathway resourced provide their data in the BioPAX format (http://www.biopax.org/index.php), a special XML format for biological interaction networks. Examples are pathway commons (http://www.pathwaycommons.org/pc/) and Reactome (http://www.reactome.org (http://www.reactome.org/)). >> > >> > A JAVA library for parsing BioPAX files exists: http://www.biopax.org/paxtools.php >> > >> > Has anybody used BioPAX files with R? Is it possible to read BioPAX files in any R based graph structure? A solution similar to the KEGGgraph package for KEGG pahways would be great, since more and more databases start using BioPAX. >> > >> > >> > Any ideas are appreciated! >> > >> > Cheers >> > Martin >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org) >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >> >> >> >> >> -- >> Oliver Ruebenacker >> Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker) >> Knowomics, The Bioinformatics Network (http://www.knowomics.com) >> SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) >> > > > -- Oliver Ruebenacker Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker) Knowomics, The Bioinformatics Network (http://www.knowomics.com) SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) From mtmorgan at fhcrc.org Fri Jun 15 22:06:46 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Fri, 15 Jun 2012 13:06:46 -0700 Subject: [BioC] Is there a package that maps short sequence to exon? In-Reply-To: References: Message-ID: <4FDB95D6.5050708@fhcrc.org> On 06/15/2012 11:08 AM, ying chen wrote: > > > > > Hi guys, I just wonder if there is any Bioconductor package that can > take a short nucleotide sequence (25 mer) or its genomic coordinate > (chr& pos) as input and return exon number it maps to? Thanks a lot Take a look at GenomicRanges / GenomicFeatures, you might read your alignments as readGappedAlignments, or just create a GRanges() object, library(GenomicRanges) reads = GRanges(c("chr1", "chr7"), IRanges(start=c(12614, 195554), width=1)) and then use a package like TxDb.Hspaiens library(TxDb.Hsapiens.UCSC.hg19.knownGene) ex = exons(TxDb.Hsapiens.UCSC.hg19.knownGene) and findOverlaps hits = findOverlaps(reads, ex) to discover that your 'query' (reads) overlaps the exons > queryHits(hits) [1] 1 1 2 > values(ex)$exon_id[subjectHits(hits)] [1] 5 2 98786 Martin > for the help! Ying Chen [[alternative HTML version deleted]] > > _______________________________________________ Bioconductor mailing > list Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 From mcarlson at fhcrc.org Fri Jun 15 22:33:28 2012 From: mcarlson at fhcrc.org (Marc Carlson) Date: Fri, 15 Jun 2012 13:33:28 -0700 Subject: [BioC] Wheat annotation In-Reply-To: References: <20120613005854.5EFE8134449@mamba.fhcrc.org> <4FD89E88.9050805@uw.edu> <4FD8BCE9.7020507@uw.edu> Message-ID: <4FDB9C18.2070600@fhcrc.org> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From mcarlson at fhcrc.org Fri Jun 15 23:05:23 2012 From: mcarlson at fhcrc.org (Marc Carlson) Date: Fri, 15 Jun 2012 14:05:23 -0700 Subject: [BioC] problems with annotation package (error in AnnotationDbi v1.18.1 ?) In-Reply-To: References: Message-ID: <4FDBA393.90307@fhcrc.org> Hi Sam, There are a couple of things going on here. The 1st is that you are using the same package AGILENTv2.db with both bioc 2.10 AND also with bioc 2.9. That is guaranteed to be a bad idea, because each version of bioconductor (including the annotation packages is tested with its set of same version software packages. Now the AGILENTv2.db package does not look like one that you got from us, so I can't tell you which version of bioc you should be using it with, that would depend on which version it was made by. But I can tell you that it should probably not be used with any other versions (other than the one it was made with). The second thing that is going on is that a chipDb package like AGILENTv2.db appears to be does not really have much annotation data in it. If you look inside of the DB that is wrapped in a chipDb package, you will notice that there is not really much data in it. That is because packages like this point to an organism package like org.At.tair.db for most of their data. The organism package does have a lot of data in it, and is joined to the chipDb package via gene identifiers to get the data that you are looking at in your example. So the fact that you have two different versions of bioc installed that each have their own version matched organism packages is causing the differences that you are seeing. So if you really wanted to make the results be the same, you would need to use the same version of the org packages in both cases. But please don't do that! Instead, please remember that those packages are made for each version of bioc. Because of the testing limitations inherent to a system where all the packages may change for every single release, we can only test to make sure that they work with the other packages within that specific version of bioc. If you mix and match packages across different versions, it is possible, and even somewhat likely that you will get some unexpected results. Please let me know if you have more questions, Marc On 06/14/2012 10:55 AM, Samuel Wuest wrote: > Hi all, > > I have a problem that concerns the use of annotation packages, here in an > example for a custom-made AGILENT microarray for Arabidopsis (but I think > it also concerns the org.At.tair.db package). I don't quite understand how > this issue arises, so sorry for the vague title (my guess is that it > concerns AnnotationDbi package version 1.18.1). > > The issue is: if I query a GO-ID from my AGILENTv2GO2ALLPROBES bimap, I get > a very different result using a newer BioC version when compared to the > result using an older BioC version. Please note here that the annotation > package itself is the same in both versions used! > > Here is the output from a query in the NEW version: > > >> library(AGILENTv2.db) >> genes<- get("GO:0006351", AGILENTv2GO2ALLPROBES) >> length(unique(genes)) > [1] 1825 > > #### ok, so that would be 1825 unique genes/probes obtained from the query > >> sessionInfo() > R version 2.15.0 (2012-03-30) > > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > > locale: > > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > other attached packages: > > [1] AGILENTv2.db_2.6.4 org.At.tair.db_2.7.1 RSQLite_0.11.1 > DBI_0.2-5 AnnotationDbi_1.18.1 > > [6] Biobase_2.16.0 BiocGenerics_0.2.0 > > > loaded via a namespace (and not attached): > > [1] IRanges_1.14.3 stats4_2.15.0 > > ------------------------------------------ > Now please compare this with the output from the OLD version. > >> library(AGILENTv2.db) >> genes<- get("GO:0006351", AGILENTv2GO2ALLPROBES) >> length(unique(genes)) > [1] 2122 > #### here, there are 2122 unique genes/probes obtained, so many more than > the 1825 above, even though the AGILENTv2.db used was the same (version > 2.6.4) >> sessionInfo() > R version 2.14.0 (2011-10-31) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_IE.UTF-8/en_IE.UTF-8/en_IE.UTF-8/C/en_IE.UTF-8/en_IE.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] AGILENTv2.db_2.6.4 org.At.tair.db_2.6.4 RSQLite_0.11.1 > DBI_0.2-5 AnnotationDbi_1.16.10 > [6] Biobase_2.14.0 > > loaded via a namespace (and not attached): > [1] IRanges_1.12.5 > > ------------------------- > > Any suggestions? I am not sure in which package the problem is, because the > annotation package should be identical. I think the same problem occurs in > the org.At.tair.db package, however, I have also different versions of the > annotation package installed on the different computers... > > Thanks for any help, > > Sam > From pshannon at fhcrc.org Sat Jun 16 00:08:29 2012 From: pshannon at fhcrc.org (Paul Shannon) Date: Fri, 15 Jun 2012 15:08:29 -0700 Subject: [BioC] BioPAX parsing In-Reply-To: References: <2FE00F99DC9449FC80176CF37E29B5DA@googlemail.com> <2E774CD91DCD4D57A17218DB93AB053A@googlemail.com> Message-ID: Oliver and Martin, It would be very helpful to have easy access to BioPAX data in Biocondcutor. Just now, at the weekly Bioconductor dev-team meeting, we discussed your ideas, and want to endorse them. Oliver's proposal to parse the RDF triples into a data.frame has lots to recommend it. It would be immediately useful, and yet also allow for more sophisticated uses later. With these relationships in R, annotated as BioPAX data often are, we can imagine interested parties writing S4 classes which use the data, which might provide flexible querying capabilities, and be able to transform those triples into graphs and networks, for further computation and display. Please let us know if we can help. - Paul On Jun 15, 2012, at 12:23 PM, Oliver Ruebenacker wrote: > Hello Martin, > > I don't have code in R to test yet, but I do have extensive > experience handling BioPAX in Java, so I'm assuming reading BioPAX > using RJava should not be too difficult. > > The best target format depends on what people would like to do with > the data. For visualization, a bi-partite graph in a popular > graph-layout package should be best. Is there any particular graph > package in BioConductor or R in general you would recommend? > > For actual analysis, people probably have more specific requirements. > > BioPAX is a format based on RDF/OWL, which in turn is based on > organizing data in triples, which could be stored in a three-column > data frame (or perhaps a fourth column for data type). For example > (incomplete, for illustration only): > > ex:mapPhosphorylization rdf:type bp:BiochemicalReaction. > ex:atp rdf:type bp:SmallMolecule. > ex:adp rdf:type bp:SmallMolecule. > ex:map rdf:type bp:Protein. > ex:mapPhosphorylized rdf:type bp:Protein. > ex:mapPhosphorylization bp:left ex:atp. > ex:mapPhosphorylization bp:left ex:map. > ex:mapPhosphorylization bp:right ex:adp. > ex:mapPhosphorylization bp:right ex:mapPhosphorylized. > > Take care > Oliver > > On Fri, Jun 15, 2012 at 3:03 PM, Martin Preusse > wrote: >> Hi Oliver, >> >> I think there is a lot interest in a bioconductor package! >> >> Personally, I would like to read pathways stored in the BioPAX format into any kind of graph. It's a philosophical question if reactions should have nodes or should sit on the edges :) So far I have not used any R graph package. But I assume there are some very generic packages which are flexible enough to support both direct and bi-partite pathway structure. I used e.g. the JUNG graph API for JAVA extensively. >> >> I'm not sure what you mean with RDF/OWL triples. For me BioPAX is only a format to store a pathway. And I would like to bring it back into its natural form: a network! >> >> Do you have any code to test? I have used RJava before. All this RDF and XML file format stuff kind of puzzles me though ? :) >> >> Cheers >> Martin >> >> >> >> Am Freitag, 15. Juni 2012 um 18:32 schrieb Oliver Ruebenacker: >> >>> Hello Martin, >>> >>> I'm currently looking into reading BioPAX into R using RJava and >>> OpenRDF Sesame. If there is interest, I may be looking into submitting >>> a package to BioConductor. >>> >>> It would be very helpful if you could tell me what you need the >>> BioPAX data for, and in what form it would be best for you. Possible >>> options are: >>> >>> - A data frame of the RDF/OWL triples >>> - A graph of the RDF/OWL triples >>> - A data frame with one row for each reaction-participant >>> - A bi-partite graph with nodes for reactions and nodes for substances >>> - A with nodes for substances only, with edges for interactions >>> - A genetic interaction graph >>> >>> This list is roughly sorted form the one most easy to the most >>> difficult to provide. >>> >>> Take care >>> Oliver >>> >>> On Thu, Jun 14, 2012 at 10:01 AM, Martin Preusse >>> wrote: >>>> Many biological pathway resourced provide their data in the BioPAX format (http://www.biopax.org/index.php), a special XML format for biological interaction networks. Examples are pathway commons (http://www.pathwaycommons.org/pc/) and Reactome (http://www.reactome.org (http://www.reactome.org/)). >>>> >>>> A JAVA library for parsing BioPAX files exists: http://www.biopax.org/paxtools.php >>>> >>>> Has anybody used BioPAX files with R? Is it possible to read BioPAX files in any R based graph structure? A solution similar to the KEGGgraph package for KEGG pahways would be great, since more and more databases start using BioPAX. >>>> >>>> >>>> Any ideas are appreciated! >>>> >>>> Cheers >>>> Martin >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org (mailto:Bioconductor at r-project.org) >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >>> >>> >>> >>> -- >>> Oliver Ruebenacker >>> Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker) >>> Knowomics, The Bioinformatics Network (http://www.knowomics.com) >>> SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) >>> >> >> >> > > > > -- > Oliver Ruebenacker > Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker) > Knowomics, The Bioinformatics Network (http://www.knowomics.com) > SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From mgarciao at ufl.edu Sat Jun 16 02:53:15 2012 From: mgarciao at ufl.edu (Garcia Orellana,Miriam) Date: Sat, 16 Jun 2012 00:53:15 +0000 Subject: [BioC] PROBLEM LOADING THE covdesc FILE IN SIMPLEAFFY Message-ID: <7F10E9EDBB347E4CA0765A3139C110BB14F948CB@UFEXCH-MBXN04.ad.ufl.edu> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From stvjc at channing.harvard.edu Sat Jun 16 04:03:29 2012 From: stvjc at channing.harvard.edu (Vincent Carey) Date: Fri, 15 Jun 2012 22:03:29 -0400 Subject: [BioC] BioPAX parsing In-Reply-To: References: <2FE00F99DC9449FC80176CF37E29B5DA@googlemail.com> <2E774CD91DCD4D57A17218DB93AB053A@googlemail.com> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From muralidharanv89 at gmail.com Sat Jun 16 06:49:05 2012 From: muralidharanv89 at gmail.com (Muralidharan V) Date: Sat, 16 Jun 2012 10:19:05 +0530 Subject: [BioC] Problem in filtering of single color microarrray data using LIMMA Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From curoli at gmail.com Sat Jun 16 12:10:38 2012 From: curoli at gmail.com (Oliver Ruebenacker) Date: Sat, 16 Jun 2012 06:10:38 -0400 Subject: [BioC] BioPAX parsing In-Reply-To: References: <2FE00F99DC9449FC80176CF37E29B5DA@googlemail.com> <2E774CD91DCD4D57A17218DB93AB053A@googlemail.com> Message-ID: Hello, Thanks a lot for the endorsement! I will try to create a prototype in the next days, and then you can probably advice me on how to turn that into a package of desired quality. Take care Oliver On Fri, Jun 15, 2012 at 6:08 PM, Paul Shannon wrote: > Oliver and Martin, > > It would be very helpful to have easy access to BioPAX data in Biocondcutor. > > Just now, at the weekly Bioconductor dev-team meeting, we discussed your ideas, and want to endorse them. ?Oliver's proposal to parse the RDF triples into a data.frame has lots to recommend it. ?It would be immediately useful, and yet also allow for more sophisticated uses later. ?With these relationships in R, annotated as BioPAX data often are, we can imagine interested parties writing S4 classes which use the data, which might provide flexible querying capabilities, and be able to transform those triples into graphs and networks, for further computation and display. > > Please let us know if we can help. > > - Paul > > > On Jun 15, 2012, at 12:23 PM, Oliver Ruebenacker wrote: > >> ? ? Hello Martin, >> >> ?I don't have code in R to test yet, but I do have extensive >> experience handling BioPAX in Java, so I'm assuming reading BioPAX >> using RJava should not be too difficult. >> >> ?The best target format depends on what people would like to do with >> the data. For visualization, a bi-partite graph in a popular >> graph-layout package should be best. Is there any particular graph >> package in BioConductor or R in general you would recommend? >> >> ?For actual analysis, people probably have more specific requirements. >> >> ?BioPAX is a format based on RDF/OWL, which in turn is based on >> organizing data in triples, which could be stored in a three-column >> data frame (or perhaps a fourth column for data type). For example >> (incomplete, for illustration only): >> >> ?ex:mapPhosphorylization ? rdf:type ? bp:BiochemicalReaction. >> ?ex:atp ? rdf:type ? bp:SmallMolecule. >> ?ex:adp ? rdf:type ? bp:SmallMolecule. >> ?ex:map ? rdf:type ? bp:Protein. >> ?ex:mapPhosphorylized ? rdf:type ? bp:Protein. >> ?ex:mapPhosphorylization ? bp:left ? ex:atp. >> ?ex:mapPhosphorylization ? bp:left ? ex:map. >> ?ex:mapPhosphorylization ? bp:right ? ex:adp. >> ?ex:mapPhosphorylization ? bp:right ? ex:mapPhosphorylized. >> >> ? ? Take care >> ? ? Oliver >> >> On Fri, Jun 15, 2012 at 3:03 PM, Martin Preusse >> wrote: >>> Hi Oliver, >>> >>> I think there is a lot interest in a bioconductor package! >>> >>> Personally, I would like to read pathways stored in the BioPAX format into any kind of graph. It's a philosophical question if reactions should have nodes or should sit on the edges :) So far I have not used any R graph package. But I assume there are some very generic packages which are flexible enough to support both direct and bi-partite pathway structure. I used e.g. the JUNG graph API for JAVA extensively. >>> >>> I'm not sure what you mean with RDF/OWL triples. For me BioPAX is only a format to store a pathway. And I would like to bring it back into its natural form: a network! >>> >>> Do you have any code to test? I have used RJava before. All this RDF and XML file format stuff kind of puzzles me though ? :) >>> >>> Cheers >>> Martin >>> >>> >>> >>> Am Freitag, 15. Juni 2012 um 18:32 schrieb Oliver Ruebenacker: >>> >>>> Hello Martin, >>>> >>>> I'm currently looking into reading BioPAX into R using RJava and >>>> OpenRDF Sesame. If there is interest, I may be looking into submitting >>>> a package to BioConductor. >>>> >>>> It would be very helpful if you could tell me what you need the >>>> BioPAX data for, and in what form it would be best for you. Possible >>>> options are: >>>> >>>> - A data frame of the RDF/OWL triples >>>> - A graph of the RDF/OWL triples >>>> - A data frame with one row for each reaction-participant >>>> - A bi-partite graph with nodes for reactions and nodes for substances >>>> - A with nodes for substances only, with edges for interactions >>>> - A genetic interaction graph >>>> >>>> This list is roughly sorted form the one most easy to the most >>>> difficult to provide. >>>> >>>> Take care >>>> Oliver >>>> >>>> On Thu, Jun 14, 2012 at 10:01 AM, Martin Preusse >>>> wrote: >>>>> Many biological pathway resourced provide their data in the BioPAX format (http://www.biopax.org/index.php), a special XML format for biological interaction networks. Examples are pathway commons (http://www.pathwaycommons.org/pc/) and Reactome (http://www.reactome.org (http://www.reactome.org/)). >>>>> >>>>> A JAVA library for parsing BioPAX files exists: http://www.biopax.org/paxtools.php >>>>> >>>>> Has anybody used BioPAX files with R? Is it possible to read BioPAX files in any R based graph structure? A solution similar to the KEGGgraph package for KEGG pahways would be great, since more and more databases start using BioPAX. >>>>> >>>>> >>>>> Any ideas are appreciated! >>>>> >>>>> Cheers >>>>> Martin >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org (mailto:Bioconductor at r-project.org) >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Oliver Ruebenacker >>>> Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker) >>>> Knowomics, The Bioinformatics Network (http://www.knowomics.com) >>>> SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) >>>> >>> >>> >>> >> >> >> >> -- >> Oliver Ruebenacker >> Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker) >> Knowomics, The Bioinformatics Network (http://www.knowomics.com) >> SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Oliver Ruebenacker Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker) Knowomics, The Bioinformatics Network (http://www.knowomics.com) SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) From karthikuttan at gmail.com Sat Jun 16 12:19:59 2012 From: karthikuttan at gmail.com (Karthik K N) Date: Sat, 16 Jun 2012 15:49:59 +0530 Subject: [BioC] Access to Source Codes of Packges Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From alyamahmoud at gmail.com Sat Jun 16 12:23:45 2012 From: alyamahmoud at gmail.com (Alyaa Mahmoud) Date: Sat, 16 Jun 2012 13:23:45 +0300 Subject: [BioC] biomaRt installation error ?? Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From david at harsk.dk Sat Jun 16 12:42:45 2012 From: david at harsk.dk (David Westergaard) Date: Sat, 16 Jun 2012 12:42:45 +0200 Subject: [BioC] biomaRt installation error ?? In-Reply-To: References: Message-ID: Dear Alyaa, It seems your problem is related to RCurl, and not the biomaRt package. A quick google search on 'Cannot find curl-config' seems to return a very relevant hit: http://www.omegahat.org/RCurl/FAQ.html Could you try this, and then try installing RCurl again? Best, David 2012/6/16 Alyaa Mahmoud : > Hi All > > I am having a problem installing biomaRt. Did the package names change or > sth ?? > > I install biomart normally from biocLite > source ("http://bioconductor.org/biocLite) > biocLite ("biomaRt") > > but I get this erros (XML and RCurl): > > ERROR: configuration failed for package 'XML' > * removing '/home/alyaa/R/x86_64-pc-linux-gnu-library/2.11/XML' > During startup - Warning message: > Setting LC_CTYPE failed, using "C" > * installing *source* package 'RCurl' ... > checking for curl-config... no > Cannot find curl-config > ERROR: configuration failed for package 'RCurl' > * removing '/home/alyaa/R/x86_64-pc-linux-gnu-library/2.11/RCurl' > During startup - Warning message: > Setting LC_CTYPE failed, using "C" > ERROR: dependencies 'XML', 'RCurl' are not available for package 'biomaRt' > > I then try to install RCurl or XML independently, but I get the same error: > > ERROR: configuration failed for package 'RCurl' > * removing '/home/alyaa/R/x86_64-pc-linux-gnu-library/2.11/RCurl' > > The downloaded packages are in > '/tmp/Rtmp6dsDAX/downloaded_packages' > Warning message: > In install.packages("RCurl") : > ?installation of package 'RCurl' had non-zero exit status > > I was using biomart normally 3-4 months ago so I am not sure if sth has > been updated or so ?? any ideas ? > > Thanks a lot > Alyaa > -- > Alyaa Mahmoud > > "Love all, trust a few, do wrong to none"- Shakespeare > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From karthikuttan at gmail.com Sat Jun 16 12:50:08 2012 From: karthikuttan at gmail.com (Karthik K N) Date: Sat, 16 Jun 2012 16:20:08 +0530 Subject: [BioC] Access to Source Codes of Packges In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From sdavis2 at mail.nih.gov Sat Jun 16 12:50:57 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sat, 16 Jun 2012 06:50:57 -0400 Subject: [BioC] Access to Source Codes of Packges In-Reply-To: References: Message-ID: On Sat, Jun 16, 2012 at 6:19 AM, Karthik K N wrote: > Hello members, > > I am using AgiMicroRna and Limma packges for analyzing Agilent microarray > data. I would like to know if it is possible to have access to the source > codes of these packages so that I can better understand what is goind on in > the background when we call each functions. Yes. The source code is available from the website. Just find the package you like and download the source code. Alternatively, the SVN repository for bioconductor is open. See the developer section of the website for how to access it. > Also, is it possible to edit the source code and run the package? Yes. You'll want to read at least some sections of the "Writing R Extensions" manual available from CRAN (or a mirror). Sean > > Thanks a lot in advance, > > Regards, > > -- > Karthik > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From sdavis2 at mail.nih.gov Sat Jun 16 13:19:11 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sat, 16 Jun 2012 07:19:11 -0400 Subject: [BioC] Access to Source Codes of Packges In-Reply-To: References: Message-ID: On Sat, Jun 16, 2012 at 6:50 AM, Karthik K N wrote: > Dear David, > > Thank you for your help. Can you please tell me how do you exactly edit the > source code of a package and the use this 'modified package'? You simply download the source code, open up the relevant files (you have to know which those are) with any text editor you like, edit what you like, test your new code and install the edited package. You'll need to do some homework on this, though. > For example, I was wondering if it is possible to add some 'filter by > flags' option in limma so that I can filter out probes that do not meet the > specified flag criteria from my Agilent single color microarray data set ( > I am not sure if limma already has this function). Is this possible in > limma? This is already possible with limma/bioconductor, but you can certainly change limma to your heart's content if you have an idea on how to improve things for your own use. That is the wonder of open source and development. Sean > On Sat, Jun 16, 2012 at 4:06 PM, David Westergaard wrote: > >> Dear Karthik, >> >> If you go to the Bioconductor website (www.bioconductor.org), and >> search for a package, (e.g. limma: >> http://www.bioconductor.org/packages/release/bioc/html/limma.html), at >> the very bottom you have to option to download the source. If you are >> just curious as to how the code looks, you can also view the source by >> calling the function without any parameters. For instance, just >> running 'lmFit' would return the functions source code. >> >> Yes, it is also possible to edit the source code. >> >> Best, >> David >> >> 2012/6/16 Karthik K N : >> > Hello members, >> > >> > I am using AgiMicroRna and Limma packges for analyzing Agilent microarray >> > data. I would like to know if it is possible to have access to the source >> > codes of these packages so that I can better understand what is goind on >> in >> > the background when we call each functions. >> > >> > Also, is it possible to edit the source code and run the package? >> > >> > Thanks a lot in advance, >> > >> > Regards, >> > >> > -- >> > Karthik >> > >> > ? ? ? ?[[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From alyamahmoud at gmail.com Sat Jun 16 14:28:11 2012 From: alyamahmoud at gmail.com (Alyaa Mahmoud) Date: Sat, 16 Jun 2012 15:28:11 +0300 Subject: [BioC] biomaRt installation error ?? In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From Heidi.Dvinge at cancer.org.uk Fri Jun 15 22:06:19 2012 From: Heidi.Dvinge at cancer.org.uk (Heidi Dvinge) Date: Fri, 15 Jun 2012 21:06:19 +0100 Subject: [BioC] HTqPCR In-Reply-To: References: <6D0043C9-4BAE-469C-8369-8733D7D53644@cancer.org.uk> <6C95A3CE-902D-4068-B64A-0A2813071A1A@cancer.org.uk> <50B7F68B-3762-4FF4-8F97-692ED30F06AE@cancer.org.uk> Message-ID: <99093F24-FF41-4FA0-BE55-BB0AF5C3D010@cancer.org.uk> Hi Silvia, On 15 Jun 2012, at 18:45, Silvia Halim wrote: > Hi Heidi, > > I ran into below problem when using plotCtReps. > > > plotCtReps(temp, card = 1, percent = 20, xlim = c(0,50), ylim = c(0,50)) > Error in split.data[[s]] : subscript out of bounds > In addition: Warning messages: > 1: In min(x, na.rm = na.rm) : > no non-missing arguments to min; returning Inf > 2: In max(x, na.rm = na.rm) : > no non-missing arguments to max; returning -Inf > > plotCtReps(temp, card = 1, percent = 20, xlim = c(0,50), ylim = c(0,50)) > Error in split.data[[s]] : subscript out of bounds > In addition: Warning messages: > 1: In min(x, na.rm = na.rm) : > no non-missing arguments to min; returning Inf > 2: In max(x, na.rm = na.rm) : > no non-missing arguments to max; returning -Inf > > plotCtReps(temp, card = 2, percent = 20, xlim = c(0,100), ylim = c(0,100)) > Error in split.data[[s]] : subscript out of bounds > In addition: Warning messages: > 1: In min(x, na.rm = na.rm) : > no non-missing arguments to min; returning Inf > 2: In max(x, na.rm = na.rm) : > no non-missing arguments to max; returning -Inf What's the output from traceback(), i.e. exactly where does the function break? > A couple of things you can try: - plotCtReps is meant to be used in cases where there are exactly 2 replicates of the features on your assay. Is this the case? For example, with the data below there are 190 features that will be plotted, and 1 that will be skipped: > data(qPCRraw) > table(table(featureNames(qPCRraw))) 2 4 190 1 - are there any NAs in your data? E.g. sum(is.na(qPCRraw))>0. HTH \Heidi > Here is how ?temp? looks like > > temp > An object of class "qPCRset" > Size: 96 features, 96 samples > Feature types: Reference, Test > Feature names: b-Actin b-Actin b-Actin ... > Feature classes: > Feature categories: OK > Sample names: NTC_4 PMPT352 NTC_3 ... > > Do you know why it is complaining about split.data? > > Thanks, > Silvia > > -----Original Message----- > From: Heidi Dvinge > Sent: 11 June 2012 6:11 PM > To: Silvia Halim > Subject: Re: HTqPCR > > Ok, so you already have a 96 by 96 matrix, so you don't need changeCtLayout. > Good luck with the rest, and let me know if you encounter any problems. > > On 11 Jun 2012, at 19:05, Silvia Halim wrote: > > > Hi Heidi, > > > > Thank you for your clarification. > > > > Btw this is how it looks like when I type 'temp' > >> temp > > An object of class "qPCRset" > > Size: 96 features, 96 samples > > Feature types: Reference, Test > > Feature names: b-Actin b-Actin b-Actin ... > > Feature classes: > > Feature categories: OK > > Sample names: NTC_4 PMPT352 NTC_3 ... > > > > Cheers, > > Silvia > > > > -----Original Message----- > > From: Heidi Dvinge > > Sent: 08 June 2012 7:12 PM > > To: Silvia Halim > > Subject: Re: HTqPCR > > > > Hi Silvia, > > > > what are the dimensions of the "temp" object that you read in? I.e. > > what does it look like if you just type > >> temp > > > > If you read in the data with n.features=96 and n.data=96, then you should already have an object with 96 rows and 96 columns, in which case you don't need to change the layout. > > > > Best, > > \Heidi > > > > On 8 Jun 2012, at 19:13, Silvia Halim wrote: > > > >> Hi Heidi, > >> > >> I finally have time to try out your HTqPCR bioconductor package again and I was trying to use 'changeCtLayout' function. However, I got following error message: > >> > >>> qPCRnew <- changeCtLayout(temp, sample.order = sample_order) > >> Error in data.frame(..., check.names = FALSE) : > >> arguments imply differing number of rows: 0, 96 In addition: Warning > >> message: > >> In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : > >> data length is not a multiple of split variable > >> > >> The commands that I run are as follows: > >>> temp <- readCtData("110614 BENIGN_1 DATA 96X96.csv", path = getwd(), > >>> n.features = 96, n.data=96, flag = 9, feature = 5, type= 6, Ct = 7, > >>> position = 1, skip = 12, sep = ",") sample_order <- > >>> rep(sampleNames(temp), each = 96) qPCRnew <- changeCtLayout(temp, > >>> sample.order = sample_order) > >> > >> I've tried to follow what's written in changeCtLayout function description. Can you please advise what went wrong? > >> > >> Thanks, > >> Silvia > >> > >> -----Original Message----- > >> From: Heidi Dvinge > >> Sent: 29 April 2012 8:18 PM > >> To: Silvia Halim > >> Subject: Re: HTqPCR > >> > >> HI Silvia, > >> > >> I'm glad you got it working. Depending on what you're supposed to do with the data, you may need to tweak some functions slightly, as you mention. Let me know if you run into any more trouble. > >> > >> Cheers > >> \Heidi > >> > >> On 26 Apr 2012, at 18:37, Silvia Halim wrote: > >> > >>> Hi Heidi, > >>> > >>> Thanks for the help! It's working for me now. Right now I'm figuring it out how I can use the functions that you described in the vignette. I might have to tweak the parameters for using the functions on Fluidigm data. > >>> > >>> Cheers, > >>> Silvia > >>> > >>> -----Original Message----- > >>> From: Heidi Dvinge > >>> Sent: 25 April 2012 8:56 AM > >>> To: Silvia Halim > >>> Subject: Re: HTqPCR > >>> > >>> Hiya, > >>> > >>> sorry, I only just now realised that you'd attached a file. When I saved as csv, the following command worked: > >>> > >>>> raw <- readCtData("110614 BENIGN_1 DATA 96x96.csv", > >>>> format="BioMark", > >>>> n.features=96*96) raw > >>> An object of class "qPCRset" > >>> Size: 9216 features, 1 samples > >>> Feature types: > >>> Feature names: b-Actin b-Actin b-Actin ... > >>> Feature classes: > >>> Feature categories: OK > >>> Sample names: 110614 BENIGN_1 DATA 96x96 ... > >>> > >>> The data isn't transformed into a 96x96 format immediately though (in case you read in multiple arrays, and want to normalise them independently). If you want to change this, you can use changeCtLayout(). Alternatively you can say: > >>> > >>>> raw <- readCtData("110614 BENIGN_1 DATA 96x96.csv", > >>>> format="BioMark", n.features=96, n.data=96) raw > >>> An object of class "qPCRset" > >>> Size: 96 features, 96 samples > >>> Feature types: > >>> Feature names: b-Actin b-Actin b-Actin ... > >>> Feature classes: > >>> Feature categories: OK > >>> Sample names: Sample1 Sample2 Sample3 ... > >>>> plotCtArray(raw) > >>> > >>> HTH > >>> \Heidi > >>> > >>> On 24 Apr 2012, at 17:55, Silvia Halim wrote: > >>> > >>>> Hi Heidi, > >>>> > >>>> I have some problems updating R on lustre. Therefore, I chose to run HTqPCR on my desktop for the moment. > >>>> > >>>> Reading in your sample file looks fine, however, reading in the > >>>> file that I showed you just now gave me below error message. (The > >>>> file is as attached) > >>>> > >>>>> temp <- readCtData("110614 BENIGN_1 DATA 96x96.xlsx", path = > >>>>> getwd() , n.features = 96*96, flag = 9, feature = 5, type= 6, Ct = > >>>>> 7,position = 1, skip = 12, sep = ",") > >>>> Error in read.table(file = file, header = header, sep = sep, quote = quote, : > >>>> no lines available in input > >>>> In addition: Warning message: > >>>> In readLines(file, skip) : > >>>> incomplete final line found on 'C:/Users/halim01/Documents/20110627_RossAdamsH_DN_Fluid/110614 BENIGN_1 DATA 96x96.xlsx' > >>>>> sessionInfo() > >>>> R version 2.14.0 (2011-10-31) > >>>> Platform: x86_64-pc-mingw32/x64 (64-bit) > >>>> > >>>> locale: > >>>> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C LC_TIME=English_United Kingdom.1252 > >>>> > >>>> attached base packages: > >>>> [1] stats graphics grDevices utils datasets methods base > >>>> > >>>> other attached packages: > >>>> [1] Biostrings_2.22.0 IRanges_1.12.6 BiocInstaller_1.2.1 marray_1.32.0 HTqPCR_1.8.0 limma_3.10.3 RColorBrewer_1.0-5 Biobase_2.14.0 gdata_2.8.2 > >>>> > >>>> loaded via a namespace (and not attached): > >>>> [1] affy_1.32.1 affyio_1.22.0 gplots_2.10.1 gtools_2.6.2 preprocessCore_1.16.0 tools_2.14.0 zlibbioc_1.0.1 > >>>>> > >>>> > >>>> I did a quick check on the file and it only has 9228 lines including 12 header lines which I had skipped when reading in the file. Do you know what could possibly go wrong? > >>>> > >>>> Cheers, > >>>> Silvia > >>>> > >>>> -----Original Message----- > >>>> From: Heidi Dvinge > >>>> Sent: 24 April 2012 5:09 PM > >>>> To: Silvia Halim > >>>> Subject: Re: HTqPCR > >>>> > >>>> Hm, that looks like it may be x11 acting up. I often have similar issues when I work on a remote server. > >>>> > >>>> Actually, the processing of Fluidigm files is very computationally light. So you can easily do it on your desktop, if you can't update on lustre. > >>>> > >>>> I can also email you and older version of the vignette if you want to have a look. However, in HTqPCR 1.2.0 I don't even think I had a dedicated function for plotting the Fluidigm assays yet (the plotCtArray shown in the vignette). > >>>> > >>>> Cheers > >>>> \Heidi > >>>> > >>>> On 24 Apr 2012, at 16:39, Silvia Halim wrote: > >>>> > >>>>> Hi Heidi, > >>>>> > >>>>> This is what I got when accessing the vignette. > >>>>> > >>>>>> openVignette(package="HTqPCR") > >>>>> Please select a vignette: > >>>>> > >>>>> 1: HTqPCR - qPCR analysis in R > >>>>> > >>>>> Selection: 1 > >>>>> Opening /home/mib-cri/local/lib64/R/library/HTqPCR/doc/HTqPCR.pdf > >>>>>> xprop: unable to open display '' > >>>>> /usr/local/bin/xdg-open: line 370: firefox: command not found > >>>>> /usr/local/bin/xdg-open: line 370: mozilla: command not found > >>>>> /usr/local/bin/xdg-open: line 370: netscape: command not found > >>>>> xdg-open: no method available for opening '/home/mib-cri/local/lib64/R/library/HTqPCR/doc/HTqPCR.pdf' > >>>>> > >>>>> Sorry for the confusion, you are right that I was looking at a newer version of HTqPCR than the one installed on lustre. I think that's because I have different installations of HTqPCR on lustre and on my desktop. If I can update the one on lustre, I'll go ahead with the update. > >>>>> > >>>>> Thank you, > >>>>> Silvia > >>>>> > >>>>> -----Original Message----- > >>>>> From: Heidi Dvinge > >>>>> Sent: 24 April 2012 4:28 PM > >>>>> To: Silvia Halim > >>>>> Subject: Re: HTqPCR > >>>>> > >>>>> Ah, right, it looks like you have an older version of R, and therefore also HTqPCR. > >>>>> > >>>>> The most current release version is 1.10.0. In that version, readCtData() was modified to accept different types of input data, including from Fluidigm. Before that, this sort of data had to be read in 'manually'. > >>>>> > >>>>> I guess the vignette that you were looking at comes from a version > >>>>> of HTqPCR that's newer than the one you have installed? If you > >>>>> access the vignette corresponding to your HTqPCR version via > >>>>>> openVignette(package="HTqPCR") > >>>>> what do you get then? > >>>>> > >>>>> If you get an older version, then depending on how old it is, there may be a section towards the end giving an example of how to process Fluidigm data more 'manually'. If not, an update may be your best bet. > >>>>> > >>>>> Cheers > >>>>> \Heidi > >>>>> > >>>>> > >>>>> > >>>>> On 24 Apr 2012, at 16:15, Silvia Halim wrote: > >>>>> > >>>>>> Hi Heidi, > >>>>>> > >>>>>> Thanks for looking into the matter. Below is the output of my > >>>>>> sessionInfo() > >>>>>> > >>>>>>> sessionInfo() > >>>>>> R version 2.13.0 (2011-04-13) > >>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) > >>>>>> > >>>>>> locale: > >>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > >>>>>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > >>>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C > >>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > >>>>>> > >>>>>> attached base packages: > >>>>>> [1] stats graphics grDevices utils datasets methods base > >>>>>> > >>>>>> other attached packages: > >>>>>> [1] marray_1.26.0 Biostrings_2.20.1 IRanges_1.10.3 HTqPCR_1.2.0 > >>>>>> [5] limma_3.6.9 RColorBrewer_1.0-2 Biobase_2.12.1 gdata_2.8.0 > >>>>>> > >>>>>> loaded via a namespace (and not attached): > >>>>>> [1] affy_1.26.1 affyio_1.20.0 gplots_2.8.0 > >>>>>> [4] gtools_2.6.2 preprocessCore_1.14.0 > >>>>>>> > >>>>>> > >>>>>> Cheers, > >>>>>> Silvia > >>>>>> > >>>>>> -----Original Message----- > >>>>>> From: Heidi Dvinge > >>>>>> Sent: 24 April 2012 4:07 PM > >>>>>> To: Silvia Halim > >>>>>> Subject: HTqPCR > >>>>>> > >>>>>> Hi Silvia, > >>>>>> > >>>>>> I just tested the read fluidigm from the vignette, and it works on both my mac and a single unix system that I've tested. Although from the errors you were getting, it seemed like the headers weren't been read correctly/at all. > >>>>>> > >>>>>> Would you mind sending me the output of your sessionInfo(), so I can compare which package versions we have? > >>>>>> > >>>>>> Best, > >>>>>> \Heidi > >>>>>> > >>>>>>> sessionInfo() > >>>>>> R version 2.15.0 (2012-03-30) > >>>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > >>>>>> > >>>>>> locale: > >>>>>> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > >>>>>> > >>>>>> attached base packages: > >>>>>> [1] tools stats graphics grDevices utils datasets methods base > >>>>>> > >>>>>> other attached packages: > >>>>>> [1] HTqPCR_1.10.0 limma_3.12.0 RColorBrewer_1.0-5 Biobase_2.16.0 > >>>>>> [5] BiocGenerics_0.2.0 > >>>>>> > >>>>>> loaded via a namespace (and not attached): > >>>>>> [1] affy_1.34.0 affyio_1.24.0 BiocInstaller_1.4.3 > >>>>>> [4] gdata_2.8.2 gplots_2.10.1 gtools_2.6.2 > >>>>>> [7] preprocessCore_1.18.0 stats4_2.15.0 zlibbioc_1.2.0 > >>>>>> > >>>>>> > >>>>> > >>>> > >>>> <110614 BENIGN_1 DATA 96x96.xlsx> > >>> > >> > > > NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for ...{{dropped:16}} From steven.segbroek at gmail.com Thu Jun 14 18:12:40 2012 From: steven.segbroek at gmail.com (steven segbroek) Date: Thu, 14 Jun 2012 18:12:40 +0200 Subject: [BioC] some help requested for constructing an appropriate design matrix in LIMMA Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From Ingrid.Mercier at ipbs.fr Fri Jun 15 11:55:59 2012 From: Ingrid.Mercier at ipbs.fr (Ingrid Mercier) Date: Fri, 15 Jun 2012 11:55:59 +0200 Subject: [BioC] design matrix Limma design for paired t-test In-Reply-To: <5266de74b101cfe9b43bb86abb9fd56b.squirrel@wehimail.alpha.wehi.edu.au> References: <4FD759D6.2060202@ipbs.fr> <004201cd4930$2d7802b0$88680810$@edu.au> <4FD88B9C.5090102@ipbs.fr> <5266de74b101cfe9b43bb86abb9fd56b.squirrel@wehimail.alpha.wehi.edu.au> Message-ID: <4FDB06AF.6040609@ipbs.fr> Thanks Moshe for your reply ! It's very clear ! As you wrote, I want to test is " if the effect of the treatment at 4 hours is different from the one at 18 hours, between Control and Treated cells ", but I don't see how change my design. Somebody can help me ? Cheers, Ingrid Ingrid MERCIER Mycobacterial Interactions with Host Cells Team Institute of Pharmacology& Structural Biology CNRS - University of Toulouse BP 64182 F-31077 Toulouse Cedex France Tel +33 (0)5 61 17 54 63 Le 14/06/2012 03:44, Moshe Olshansky a ?crit : > Hi Ingrid, > > With your design your "base" level is patient 4, Control, 4 hours (let's > call it B). > The mean for, say, patient 6, Treatment, 18 hours is: > B + Donor6 + TreatT + Time18 > where Donor6 is the difference between Donor4 and Donor6 (same for any > treatment and time), TreatT is the difference between Treatment and > Control (independent of patient and time) and Time18 is the difference > between 18 hours and 4 hours (independent of patient and treatment). > > If you think that the effect of Treatment versus Control is the same at 4 > hours and 18 hours, then what you did is all right. If you think that the > effect of the treatment at 4 hours may be different from the one at 18 > hours, you need to change your design. > > Best regards, > Moshe. > >> Thanks a lot Belinda !! >> >> I mistaked so I replaced Time=Treat by Time only, and it's good. >> So, I have a last question : I 'm confused with the differents coef in >> topTable. >> I get genes but I tested several coef without understanding their >> significance. >> Somebody can explain me what mean coef="TreatT", or coef= "Time18",coef= >> " Donor5",coef= " Donor6", coef= "Donor7",coef= " Donor8". >> My main objective is to identidy the differential expressed genes >> between the Control donors and Treated Donors at 4 hours or 18 hours. >> I have no idea, which coef I have to use it. >> >> Cheers, >> >> Ingrid >> >> Ingrid MERCIER >> Mycobacterial Interactions with Host Cells Team >> Institute of Pharmacology& Structural Biology >> CNRS - University of Toulouse >> BP 64182 >> F-31077 Toulouse Cedex France >> Tel +33 (0)5 61 17 54 63 >> >> >> >> >> Le 13/06/2012 08:45, Belinda Phipson a ?crit : >>> Hi Ingrid >>> >>> The problem with your code is the following line: >>>> Time=Treat=factor(Targets$Time) >>> Where you essentially set the time factor equal to the treat factor. >>> >>> Cheers, >>> Belinda >>> >>> >>> -----Original Message----- >>> From:bioconductor-bounces at r-project.org >>> [mailto:bioconductor-bounces at r-project.org] On Behalf Of Ingrid Mercier >>> Sent: Wednesday, 13 June 2012 1:02 AM >>> To:bioconductor at r-project.org;smyth at wehi.edu.au >>> Subject: [BioC] design matrix Limma design for paired t-test >>> >>> Dear list and Gordon, >>> >>> I have some troubles to computed a moderated paired t-test in the linear >>> model. >>> Here is my experimental plan : >>> >>> I used a single channel Agilent microarray. >>> 2 types of cells : Control (S) and Treated (T) >>> Fives human donors : 4-5-6-7-8 >>> Two times of treatment : 4 hours and 18 hours >>> >>> I want to compare teh differential expresed genes between my C versus T >>> at 4 >>> hours and then at 18 hours. >>> >>> Here is my design : >>> >>> >>> My targets frame is : >>>> Targets >>> X FileName >>> Treatment >>> Donor Time >>> 1 DC_4_4 US10463851_252665214446_S01_GE1_1010_Sep10_1_2.txt T >>> 4 4 >>> 2 SC_4_4 US10463851_252665214448_S01_GE1_1010_Sep10_1_2.txt C >>> 4 4 >>> 3 DC_18_4 US10463851_252665214447_S01_GE1_1010_Sep10_1_2.txt T >>> 4 18 >>> 4 SC_18_4 US10463851_252665214444_S01_GE1_1010_Sep10_1_3.txt C >>> 4 18 >>> 5 DC_4_5 US10463851_252665214448_S01_GE1_1010_Sep10_1_4.txt T >>> 5 4 >>> 6 SC_4_5 US10463851_252665214444_S01_GE1_1010_Sep10_1_1.txt C >>> 5 4 >>> 7 DC_18_5 US10463851_252665214446_S01_GE1_1010_Sep10_1_3.txt T >>> 5 18 >>> 8 SC_18_5 US10463851_252665214447_S01_GE1_1010_Sep10_1_4.txt C >>> 5 18 >>> 9 DC_4_6 US10463851_252665214445_S01_GE1_1010_Sep10_1_4.txt T >>> 6 4 >>> 10 SC_4_6 US10463851_252665214447_S01_GE1_1010_Sep10_1_3.txt C >>> 6 4 >>> 11 DC_18_6 US10463851_252665214448_S01_GE1_1010_Sep10_1_3.txt T >>> 6 18 >>> 12 SC_18_6 US10463851_252665214445_S01_GE1_1010_Sep10_1_3.txt C >>> 6 18 >>> 13 DC_4_7 US10463851_252665214444_S01_GE1_1010_Sep10_1_4.txt T >>> 7 4 >>> 14 SC_4_7 US10463851_252665214445_S01_GE1_1010_Sep10_1_2.txt C >>> 7 4 >>> 15 DC_18_7 US10463851_252665214447_S01_GE1_1010_Sep10_1_1.txt T >>> 7 18 >>> 16 SC_18_7 US10463851_252665214446_S01_GE1_1010_Sep10_1_1.txt C >>> 7 18 >>> 17 DC_4_8 US10463851_252665214444_S01_GE1_1010_Sep10_1_2.txt T >>> 8 4 >>> 18 SC_4_8 US10463851_252665214446_S01_GE1_1010_Sep10_1_4.txt C >>> 8 4 >>> 19 DC_18_8 US10463851_252665214445_S01_GE1_1010_Sep10_1_1.txt T >>> 8 18 >>> 20 SC_18_8 US10463851_252665214448_S01_GE1_1010_Sep10_1_1.txt C >>> 8 18 >>> >>> >>> then I create my design matrix : >>> >>>> Donor >>> [1] 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 >>> Levels: 4 5 6 7 8 >>>> Treat=factor(Targets$Treatment,levels=c("C","T")) >>>> Treat >>> [1] T C T C T C T C T C T C T C T C T C T C >>> Levels: C T >>>> Time=Treat=factor(Targets$Time) >>>> Time >>> [1] 4 4 18 18 4 4 18 18 4 4 18 18 4 4 18 18 4 4 18 18 >>> Levels: 4 18 >>> >>>> design=model.matrix(~Donor+Treat+Time) >>>> design >>> (Intercept) Donor5 Donor6 Donor7 Donor8 Treat18 Time18 >>> 1 1 0 0 0 0 0 0 >>> 2 1 0 0 0 0 0 0 >>> 3 1 0 0 0 0 1 1 >>> 4 1 0 0 0 0 1 1 >>> 5 1 1 0 0 0 0 0 >>> 6 1 1 0 0 0 0 0 >>> 7 1 1 0 0 0 1 1 >>> 8 1 1 0 0 0 1 1 >>> 9 1 0 1 0 0 0 0 >>> 10 1 0 1 0 0 0 0 >>> 11 1 0 1 0 0 1 1 >>> 12 1 0 1 0 0 1 1 >>> 13 1 0 0 1 0 0 0 >>> 14 1 0 0 1 0 0 0 >>> 15 1 0 0 1 0 1 1 >>> 16 1 0 0 1 0 1 1 >>> 17 1 0 0 0 1 0 0 >>> 18 1 0 0 0 1 0 0 >>> 19 1 0 0 0 1 1 1 >>> 20 1 0 0 0 1 1 1 >>> attr(,"assign") >>> [1] 0 1 1 1 1 2 3 >>> attr(,"contrasts") >>> attr(,"contrasts")$Donor >>> [1] "contr.treatment" >>> >>> attr(,"contrasts")$Treat >>> [1] "contr.treatment" >>> >>> attr(,"contrasts")$Time >>> [1] "contr.treatment" >>> >>> >>> In this design matrix I think something is wrong, because of the column >>> Treat18 is the same as Time18. >>> I don't understand why. >>> So, the following code failed, and the differential expressed genes are >>> odds. >>> >>> Somebody can help me !!! Thanks all. >>> >>> >>>> fit=lmFit(test_norm,design) >>> Coefficients not estimable: Time18 >>> Message d'avis : >>> Partial NA coefficients for 34183 probe(s) >>>> fit2=eBayes(fit) >>> Message d'avis : >>> In ebayes(fit = fit, proportion = proportion, stdev.coef.lim = >>> stdev.coef.lim, : >>> Estimation of var.prior failed - set to default value >>> >>> >>>> table = topTable(fit2,1, number=5000, >>> p.value=0.05,adjust.method="BH",sort.by="logFC",lfc=2) >>>> head(table) >>> ID logFC AveExpr t P.Value >>> adj.P.Val >>> B >>> 6509 A_33_P3396434 18.44159 18.41239 245.14490 1.308161e-31 >>> 2.353520e-28 >>> 53.41519 >>> 22398 A_33_P3223592 18.25824 18.24591 242.75647 1.545005e-31 >>> 2.514901e-28 >>> 53.36821 >>> 10771 A_33_P3244165 18.21029 18.02229 90.76191 2.796577e-24 >>> 2.467615e-23 >>> 44.59915 >>> 6149 A_33_P3346552 18.14780 18.12098 207.18556 2.282464e-30 >>> 1.147374e-27 >>> 52.50960 >>> 23554 A_33_P3210160 18.08158 18.21026 239.64192 1.924175e-31 >>> 2.560908e-28 >>> 53.30521 >>> 20924 A_33_P3286278 18.04425 18.07312 179.72121 2.558128e-29 >>> 5.025546e-27 >>> 51.56876 >>> >>> >>> Best, >>> >>> Ingrid >>> >>> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}} From ragowthaman at gmail.com Fri Jun 15 21:22:36 2012 From: ragowthaman at gmail.com (gowtham) Date: Fri, 15 Jun 2012 12:22:36 -0700 Subject: [BioC] BCV increases with an increasing counts per million in RNAseq (edgeR) Message-ID: Hi Everyone, I analyse my current RNAseq data set (two groups; each group with two replicates) using classic edgeR. I see couple strange results that i am trying to make sense of. I really appreciate any help from the list. 1) after filtering out tags for low reads (minimum of 1 cpm in each of 4 samples:dge[rowSums((cpm.dge > 1)) >=4, ]) and normalizing (calcNormFactors), i create the BCV plot (attached:norm_filt_bcv.png). I see CV going up along with CPM. But, when I dont filter and dont normalize i see a traditional BCV plot (attached: nonorm_nofilt_bcv.png). Any idea why this is the case? Especially, the normalization factors are close to 1. (0.9747020 , 0.9756064, 0.9769463, 1.0764226) and filtering for all samples with minimum of 1 CPM removed only 800 genes out of 8000 genes. 2) Most of the genes seems to have dispersion lower than common dispersion. Aren't they supposed to be distributed on either side (which is the case with nofilt-nonorm). 2) Similarly, I see a different MDS plot for both filtered (and normalized) and unfiltered (non-normalized) datasets (attached). Wondering what is going on? Any suggestion/comments will be very helpful. Thanks a lot in advance, Gowthaman PS: The calculated common dispersion is rather high. Disp = 0.14757 , BCV = 0.3841 -- Gowthaman Bioinformatics Systems Programmer. SBRI, 307 West lake Ave N Suite 500 Seattle, WA. 98109-5219 Phone : LAB 206-256-7188 (direct). -------------- next part -------------- A non-text attachment was scrubbed... Name: nonorm_nofilt_bcv.png Type: image/png Size: 57366 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: norm_filt_bcv.png Type: image/png Size: 54540 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: nonorm_nofilt_mds.png Type: image/png Size: 12447 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: norm_filt_mds.png Type: image/png Size: 11921 bytes Desc: not available URL: From ragowthaman at gmail.com Sat Jun 16 00:06:16 2012 From: ragowthaman at gmail.com (gowtham) Date: Fri, 15 Jun 2012 15:06:16 -0700 Subject: [BioC] BCV increases with an increasing counts per million in RNAseq (edgeR) Message-ID: Hi All, I am resending this email with fewer figures (in .gif format) as my previous message was held for moderation (due to its size). And i dont seem to cancel the message either (the link was broken). Sorry about this. #---# Hi Everyone, I analyse my current RNAseq data set (two groups; each group with two replicates) using classic edgeR. I see couple strange results that i am trying to make sense of. I really appreciate any help from the list. 1) after filtering out tags for low reads (minimum of 1 cpm in each of 4 samples:dge[rowSums((cpm.dge > 1)) >=4, ]) and normalizing (calcNormFactors), i create the BCV plot (attached:norm_filt_bcv.png). I see CV going up along with CPM. But, when I dont filter and dont normalize i see a traditional BCV plot (attached: nonorm_nofilt_bcv.png). Any idea why this is the case? Especially, the normalization factors are close to 1. (0.9747020 , 0.9756064, 0.9769463, 1.0764226) and filtering for all samples with minimum of 1 CPM removed only 800 genes out of 8000 genes. 2) Most of the genes seems to have dispersion lower than common dispersion. Aren't they supposed to be distributed on either side (which is the case with nofilt-nonorm). 2) Similarly, I see a different MDS plot for both filtered (and normalized) and unfiltered (non-normalized) datasets (attached). Wondering what is going on? Any suggestion/comments will be very helpful. Thanks a lot in advance, Gowthaman PS: The calculated common dispersion is rather high. Disp = 0.14757 , BCV = 0.3841 -- Gowthaman Bioinformatics Systems Programmer. SBRI, 307 West lake Ave N Suite 500 Seattle, WA. 98109-5219 Phone : LAB 206-256-7188 (direct). From lawrence.michael at gene.com Sat Jun 16 15:54:47 2012 From: lawrence.michael at gene.com (Michael Lawrence) Date: Sat, 16 Jun 2012 06:54:47 -0700 Subject: [BioC] BioPAX parsing In-Reply-To: References: <2FE00F99DC9449FC80176CF37E29B5DA@googlemail.com> <2E774CD91DCD4D57A17218DB93AB053A@googlemail.com> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From curoli at gmail.com Sat Jun 16 17:03:21 2012 From: curoli at gmail.com (Oliver Ruebenacker) Date: Sat, 16 Jun 2012 11:03:21 -0400 Subject: [BioC] BioPAX parsing In-Reply-To: References: <2FE00F99DC9449FC80176CF37E29B5DA@googlemail.com> <2E774CD91DCD4D57A17218DB93AB053A@googlemail.com> Message-ID: Hello Michael, I'm planning to use RJava to drive OpenRDf Sesame, with which I am very familiar. Take care Oliver On Sat, Jun 16, 2012 at 9:54 AM, Michael Lawrence wrote: > Were you guys planning on using Rredland for this? > > > On Sat, Jun 16, 2012 at 3:10 AM, Oliver Ruebenacker > wrote: >> >> ? ? Hello, >> >> ?Thanks a lot for the endorsement! >> >> ?I will try to create a prototype in the next days, and then you can >> probably advice me on how to turn that into a package of desired >> quality. >> >> ? ? Take care >> ? ? Oliver >> >> On Fri, Jun 15, 2012 at 6:08 PM, Paul Shannon wrote: >> > Oliver and Martin, >> > >> > It would be very helpful to have easy access to BioPAX data in >> > Biocondcutor. >> > >> > Just now, at the weekly Bioconductor dev-team meeting, we discussed your >> > ideas, and want to endorse them. ?Oliver's proposal to parse the RDF triples >> > into a data.frame has lots to recommend it. ?It would be immediately useful, >> > and yet also allow for more sophisticated uses later. ?With these >> > relationships in R, annotated as BioPAX data often are, we can imagine >> > interested parties writing S4 classes which use the data, which might >> > provide flexible querying capabilities, and be able to transform those >> > triples into graphs and networks, for further computation and display. >> > >> > Please let us know if we can help. >> > >> > - Paul >> > >> > >> > On Jun 15, 2012, at 12:23 PM, Oliver Ruebenacker wrote: >> > >> >> ? ? Hello Martin, >> >> >> >> ?I don't have code in R to test yet, but I do have extensive >> >> experience handling BioPAX in Java, so I'm assuming reading BioPAX >> >> using RJava should not be too difficult. >> >> >> >> ?The best target format depends on what people would like to do with >> >> the data. For visualization, a bi-partite graph in a popular >> >> graph-layout package should be best. Is there any particular graph >> >> package in BioConductor or R in general you would recommend? >> >> >> >> ?For actual analysis, people probably have more specific requirements. >> >> >> >> ?BioPAX is a format based on RDF/OWL, which in turn is based on >> >> organizing data in triples, which could be stored in a three-column >> >> data frame (or perhaps a fourth column for data type). For example >> >> (incomplete, for illustration only): >> >> >> >> ?ex:mapPhosphorylization ? rdf:type ? bp:BiochemicalReaction. >> >> ?ex:atp ? rdf:type ? bp:SmallMolecule. >> >> ?ex:adp ? rdf:type ? bp:SmallMolecule. >> >> ?ex:map ? rdf:type ? bp:Protein. >> >> ?ex:mapPhosphorylized ? rdf:type ? bp:Protein. >> >> ?ex:mapPhosphorylization ? bp:left ? ex:atp. >> >> ?ex:mapPhosphorylization ? bp:left ? ex:map. >> >> ?ex:mapPhosphorylization ? bp:right ? ex:adp. >> >> ?ex:mapPhosphorylization ? bp:right ? ex:mapPhosphorylized. >> >> >> >> ? ? Take care >> >> ? ? Oliver >> >> >> >> On Fri, Jun 15, 2012 at 3:03 PM, Martin Preusse >> >> wrote: >> >>> Hi Oliver, >> >>> >> >>> I think there is a lot interest in a bioconductor package! >> >>> >> >>> Personally, I would like to read pathways stored in the BioPAX format >> >>> into any kind of graph. It's a philosophical question if reactions should >> >>> have nodes or should sit on the edges :) So far I have not used any R graph >> >>> package. But I assume there are some very generic packages which are >> >>> flexible enough to support both direct and bi-partite pathway structure. I >> >>> used e.g. the JUNG graph API for JAVA extensively. >> >>> >> >>> I'm not sure what you mean with RDF/OWL triples. For me BioPAX is only >> >>> a format to store a pathway. And I would like to bring it back into its >> >>> natural form: a network! >> >>> >> >>> Do you have any code to test? I have used RJava before. All this RDF >> >>> and XML file format stuff kind of puzzles me though ? :) >> >>> >> >>> Cheers >> >>> Martin >> >>> >> >>> >> >>> >> >>> Am Freitag, 15. Juni 2012 um 18:32 schrieb Oliver Ruebenacker: >> >>> >> >>>> Hello Martin, >> >>>> >> >>>> I'm currently looking into reading BioPAX into R using RJava and >> >>>> OpenRDF Sesame. If there is interest, I may be looking into >> >>>> submitting >> >>>> a package to BioConductor. >> >>>> >> >>>> It would be very helpful if you could tell me what you need the >> >>>> BioPAX data for, and in what form it would be best for you. Possible >> >>>> options are: >> >>>> >> >>>> - A data frame of the RDF/OWL triples >> >>>> - A graph of the RDF/OWL triples >> >>>> - A data frame with one row for each reaction-participant >> >>>> - A bi-partite graph with nodes for reactions and nodes for >> >>>> substances >> >>>> - A with nodes for substances only, with edges for interactions >> >>>> - A genetic interaction graph >> >>>> >> >>>> This list is roughly sorted form the one most easy to the most >> >>>> difficult to provide. >> >>>> >> >>>> Take care >> >>>> Oliver >> >>>> >> >>>> On Thu, Jun 14, 2012 at 10:01 AM, Martin Preusse >> >>>> > >>>> (mailto:martin.preusse at googlemail.com)> wrote: >> >>>>> Many biological pathway resourced provide their data in the BioPAX >> >>>>> format (http://www.biopax.org/index.php), a special XML format for >> >>>>> biological interaction networks. Examples are pathway commons >> >>>>> (http://www.pathwaycommons.org/pc/) and Reactome (http://www.reactome.org >> >>>>> (http://www.reactome.org/)). >> >>>>> >> >>>>> A JAVA library for parsing BioPAX files exists: >> >>>>> http://www.biopax.org/paxtools.php >> >>>>> >> >>>>> Has anybody used BioPAX files with R? Is it possible to read BioPAX >> >>>>> files in any R based graph structure? A solution similar to the KEGGgraph >> >>>>> package for KEGG pahways would be great, since more and more databases start >> >>>>> using BioPAX. >> >>>>> >> >>>>> >> >>>>> Any ideas are appreciated! >> >>>>> >> >>>>> Cheers >> >>>>> Martin >> >>>>> >> >>>>> _______________________________________________ >> >>>>> Bioconductor mailing list >> >>>>> Bioconductor at r-project.org (mailto:Bioconductor at r-project.org) >> >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>>>> Search the archives: >> >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Oliver Ruebenacker >> >>>> Bioinformatics Consultant >> >>>> (http://www.knowomics.com/wiki/Oliver_Ruebenacker) >> >>>> Knowomics, The Bioinformatics Network (http://www.knowomics.com) >> >>>> SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) >> >>>> >> >>> >> >>> >> >>> >> >> >> >> >> >> >> >> -- >> >> Oliver Ruebenacker >> >> Bioinformatics Consultant >> >> (http://www.knowomics.com/wiki/Oliver_Ruebenacker) >> >> Knowomics, The Bioinformatics Network (http://www.knowomics.com) >> >> SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor at r-project.org >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >> >> >> -- >> Oliver Ruebenacker >> Bioinformatics Consultant >> (http://www.knowomics.com/wiki/Oliver_Ruebenacker) >> Knowomics, The Bioinformatics Network (http://www.knowomics.com) >> SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Oliver Ruebenacker Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker) Knowomics, The Bioinformatics Network (http://www.knowomics.com) SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) From maziz at tgen.org Sun Jun 17 02:05:12 2012 From: maziz at tgen.org (maziz at tgen.org) Date: Sun, 17 Jun 2012 00:05:12 +0000 Subject: [BioC] Question regarding cellhts2 output In-Reply-To: References: <907543CDE7D2764C84BA88D7EB0890480E8AD18E@EX-MBX1.ad.tgen.org> <907543CDE7D2764C84BA88D7EB0890480E8AD403@EX-MBX1.ad.tgen.org> <907543CDE7D2764C84BA88D7EB0890480E8AD424@EX-MBX1.ad.tgen.org> <38883702-1483-4147-81B3-7AD596BC4FCB@embl.de> <907543CDE7D2764C84BA88D7EB0890480E8AD5A8@EX-MBX1.ad.tgen.org> <086B1CE9-C09A-41B2-B146-2A39EB9E3561@embl.de> <907543CDE7D2764C84BA88D7EB0890480E8AD670@EX-MBX1.ad.tgen.org> <907543CDE7D2764C84BA88D7EB0890480E8AD683@EX-MBX1.ad.tgen.org> <907543CDE7D2764C84BA88D7EB0890480E8AD6F2@EX-MBX1.ad.tgen.org> Message-ID: <907543CDE7D2764C84BA88D7EB0890480E8AD926@EX-MBX1.ad.tgen.org> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From maziz at tgen.org Sun Jun 17 08:48:46 2012 From: maziz at tgen.org (maziz at tgen.org) Date: Sun, 17 Jun 2012 06:48:46 +0000 Subject: [BioC] Question regarding cellhts2 output References: <907543CDE7D2764C84BA88D7EB0890480E8AD18E@EX-MBX1.ad.tgen.org> <907543CDE7D2764C84BA88D7EB0890480E8AD403@EX-MBX1.ad.tgen.org> <907543CDE7D2764C84BA88D7EB0890480E8AD424@EX-MBX1.ad.tgen.org> <38883702-1483-4147-81B3-7AD596BC4FCB@embl.de> <907543CDE7D2764C84BA88D7EB0890480E8AD5A8@EX-MBX1.ad.tgen.org> <086B1CE9-C09A-41B2-B146-2A39EB9E3561@embl.de> <907543CDE7D2764C84BA88D7EB0890480E8AD670@EX-MBX1.ad.tgen.org> <907543CDE7D2764C84BA88D7EB0890480E8AD683@EX-MBX1.ad.tgen.org> <907543CDE7D2764C84BA88D7EB0890480E8AD6F2@EX-MBX1.ad.tgen.org> Message-ID: <907543CDE7D2764C84BA88D7EB0890480E8AD95C@EX-MBX1.ad.tgen.org> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From suri_ghani at yahoo.com Sun Jun 17 16:26:40 2012 From: suri_ghani at yahoo.com (suri ghani) Date: Sun, 17 Jun 2012 07:26:40 -0700 (PDT) Subject: [BioC] (no subject) Message-ID: <1339943200.35542.YahooMailNeo@web120002.mail.ne1.yahoo.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From guest at bioconductor.org Sun Jun 17 18:23:30 2012 From: guest at bioconductor.org (papori [guest]) Date: Sun, 17 Jun 2012 09:23:30 -0700 (PDT) Subject: [BioC] down-expression and high-expression in single cell + amplification Message-ID: <20120617162330.C0B3112F963@mamba.fhcrc.org> Hi all, First of all, I am new to this field so i am sorry if i am not clear.. I will try to explain what is my aim, and what i did before DESeq. I am trying to do Differential expression analysis using DESeq for De-Novo invertebrate . We had an experiment of 3 conditions with 3 biological replicate for each.(total of 9 samples) We used hiseq2000 50bp single end reads. We had a different library size for each.(that was single cell experiment so we had amplification step.. what yield variance in the library sizes..) We reconstructed the transcriptome using Trinity. Estimating counts with RSEM. And then i used DESeq.. i have weird behavior of the data, and i dont know if it is because something wrong that i did.. i am always getting down-expression from condition 1 to condition 2 and high-expression from condition 2 to condition 3.(for all the transcripts, no out-layers..) The number of counts that got for each condition to reference transcriptome was: 32M, 27M, 40M respectively.. What made me to think that because cond 2 has lowest count it has a behavior of down-expression from 1 to 2 and high-expression from 2 to 3.. if my conclusion is right, i am in a big mass..(Normalization??) my DESeq script is: Conditions = c("C1", "C2", "C3", "C1", "C2", "C3","C1", "C2", "C3") Counts<-round(MultiGeneMat,0) cds <- newCountDataSet(Counts,Conditions) cds <- estimateSizeFactors(cds) cds <- estimateDispersions(cds,method="per-condition",sharingMode="maximum",fitType="local") res_1vs2 <- nbinomTest(cds,condA="C1",condB="C2") sigDESeq_1vs2 <- res_1vs2[res_1vs2$padj <= 0.1, ] sigDESeq_1vs2 <- na.omit(sigDESeq_1vs2) res_2vs3 <- nbinomTest(cds,condA="C2",condB="C3") sigDESeq_2vs3 <- res_2vs3[res_2vs3$padj <= 0.1, ] sigDESeq_2vs3 <- na.omit(sigDESeq_2vs3) res_1vs3 <- nbinomTest(cds,condA="1",condB="C3") sigDESeq_1vs3 <- res_1vs3[res_1vs3$padj <= 0.1, ] sigDESeq_1vs3 <- na.omit(sigDESeq_1vs3) Is there anything wrong here? or anywhere else?? If i wasnt clear enough so tell me in what and i will try to explain.. Any help will be appreciate here! Thanks, Pap -- output of sessionInfo(): R version 2.14.0 (2011-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_2.4.6 limma_3.10.3 DESeq_1.6.1 locfit_1.5-8 Biobase_2.14.0 loaded via a namespace (and not attached): [1] annotate_1.32.3 AnnotationDbi_1.16.19 DBI_0.2-5 genefilter_1.36.0 geneplotter_1.32.1 [6] grid_2.14.0 IRanges_1.12.6 lattice_0.20-6 RColorBrewer_1.0-5 RSQLite_0.11.1 [11] splines_2.14.0 survival_2.36-14 tools_2.14.0 xtable_1.7-0 > -- Sent via the guest posting facility at bioconductor.org. From tim.triche at gmail.com Sun Jun 17 18:29:04 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Sun, 17 Jun 2012 09:29:04 -0700 Subject: [BioC] down-expression and high-expression in single cell + amplification In-Reply-To: <20120617162330.C0B3112F963@mamba.fhcrc.org> References: <20120617162330.C0B3112F963@mamba.fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From whuber at embl.de Sun Jun 17 23:58:23 2012 From: whuber at embl.de (Wolfgang Huber) Date: Sun, 17 Jun 2012 23:58:23 +0200 Subject: [BioC] Question regarding cellhts2 output In-Reply-To: <907543CDE7D2764C84BA88D7EB0890480E8AD95C@EX-MBX1.ad.tgen.org> References: <907543CDE7D2764C84BA88D7EB0890480E8AD18E@EX-MBX1.ad.tgen.org> <907543CDE7D2764C84BA88D7EB0890480E8AD403@EX-MBX1.ad.tgen.org> <907543CDE7D2764C84BA88D7EB0890480E8AD424@EX-MBX1.ad.tgen.org> <38883702-1483-4147-81B3-7AD596BC4FCB@embl.de> <907543CDE7D2764C84BA88D7EB0890480E8AD5A8@EX-MBX1.ad.tgen.org> <086B1CE9-C09A-41B2-B146-2A39EB9E3561@embl.de> <907543CDE7D2764C84BA88D7EB0890480E8AD670@EX-MBX1.ad.tgen.org> <907543CDE7D2764C84BA88D7EB0890480E8AD683@EX-MBX1.ad.tgen.org> <907543CDE7D2764C84BA88D7EB0890480E8AD6F2@EX-MBX1.ad.tgen.org> <907543CDE7D2764C84BA88D7EB0890480E8AD95C@EX-MBX1.ad.tgen.org> Message-ID: <4FDE52FF.3090504@embl.de> Dear Meraj I am not aware of an easy way to get at an FDR from data as yours, but FPR you can get as follows: pool the data from the negative controls with the rest of your data, as if the negative controls were tested genes. Then estimate FPR from how many of them are found as 'hits'. For this to be valid, you need to make sure that the variation in the negative controls represents the variation otherwise, e.g. at the very least you want to convince yourself that edge-effects or gradients are negligible. Best wishes Wolfgang Jun/17/12 8:48 AM, maziz at tgen.org scripsit:: > I am using CellHTS2 to calculate Bscores. My experiment has only one replicate. > There are approx 900 genes (x4 siRNA). > > From: Meraj Aziz > Sent: Saturday, June 16, 2012 5:05 PM > To: 'Joseph Barry' > Cc: 'bioconductor at r-project.org' > Subject: RE: Question regarding cellhts2 output > > Hi, > > Is there a way to calculate the False Discovery Rate (FDR) for an RNAi > Experiment. > > Thanks, > Meraj > > > From: Joseph Barry [mailto:joseph.barry at embl.de] > Sent: Wednesday, June 13, 2012 2:45 PM > To: Meraj Aziz > Subject: Re: Question regarding cellhts2 output > > Hi Meraj, > > Yes, that would be great. Thanks for being understanding. > > Best wishes, > Joseph > > On Jun 13, 2012, at 7:49 PM,> > wrote: > > So next time I ask a question I will include bioconductor at r-project.org > In my CC. > > I apologize for this. > > Thanks, > Meraj > > From: Joseph Barry [mailto:joseph.barry at embl.de] > Sent: Wednesday, June 13, 2012 2:55 AM > To: Meraj Aziz > Subject: Re: Question regarding cellhts2 output > > Hi Meraj, > > The negative/positive controls are defined by the user, and their "significance" varies greatly from experiment to experiment. Some have no negative controls, others do. It depends on experimental design. Most of the time they are for quality control, as you say. However, the normalization method "negatives" does make use of this information. See the package documentation for further details. > > As regards the assignment of probabilities, I would not interpret the Z or Bscores in this way. Each well can be viewed as being independent from the others (again depending on exp design) so you are not really sampling in the way you are suggesting. The setting of a threshold is usually an arbitrary choice based on the data. It is fine to just state your threshold and present the results directly. > > I am happy to answer any further questions, should you have any. However, it would be great if you could send any such questions out through the bioconductor mailing list so that other users may contribute to the discussion and benefit from the commentary. > > Many thanks, > Joseph > > On Jun 13, 2012, at 1:46 AM,> > wrote: > > > Hi Joseph > > One more question regarding CellHTS2 is the use of negative controls. > When I run CellHTS2 with Bscore normalization, what is the significance of the negative > Controls on the plates. Are negative and positive controls only for quality control and visualization > purpose or are they actually used somehow in the Bscore calculations. > > Thanks, > Meraj > > > > From: Meraj Aziz > Sent: Tuesday, June 12, 2012 1:12 PM > To: 'Joseph Barry' > Subject: RE: Question regarding cellhts2 output > > Hi Joseph, > > Thanks for your reply. > The reason i was interested in knowing if my screen was normally distributed is using > the Bscores (assuming the scores are standard deviations from the median) to assign probability > to each siRNA (using something like Zscore to Probability tables/calculators). > > The outcome from Z/Bscore to probability should give the probability that the given siRNA effect observed by chance is x%. > > For example: > > So at threshold "2" (Bscore) the probability is 0.023 or 2.23%. > This 2.23% means that the probability of a siRNA giving you the observed effect by chance > Is less than 2.23%. > For threshold "3" the probability is 0.00135 or 0.135%. > > For that I need to be sure we are assumption of normality is true or not. > > I hope I am interpreting the results from CellHTS2 Bscore normalization the right way. > Our aim is to justify why we are using a particular Bscore cutoff. > > Thanks, > Meraj > > > From: Joseph Barry [mailto:joseph.barry at embl.de] > Sent: Tuesday, June 12, 2012 12:07 PM > To: Meraj Aziz > Subject: Re: Question regarding cellhts2 output > > Hi Meraj, > > The density plot just shows the distribution of scores for your screen and conveniently marks the positions of positive/negative controls. Your screen is not fully normal as it does not have the classical bell shape. However I would not read too much into whether a screen is normally distributed or not. Scores which seem to break the trend (such as your SMG1) tend to lie further from the line on the Q-Q plot but I would not waste too much time looking at this. They are primarily for quality control, to check that the distribution does not look "funny". > > Best wishes, > Joseph > > On Jun 12, 2012, at 8:18 PM,> wrote: > > Hi Joseph > > The Q-Q plots gives a measure of testing for normality of our RNAi distribution. > Attached is my screens Q-Q plot. > > What does the density plot imply and is my screen normally distributed? > You have been really helpful. > > Thanks, > Meraj > > > > > From: Joseph Barry [mailto:joseph.barry at embl.de] > Sent: Monday, June 11, 2012 12:55 PM > To: Meraj Aziz > Subject: Re: Question regarding cellhts2 output > > Hi Meraj, > > My apologies, I had not spotted the line: > > xsc=scoreReplicates(xn, sign = "-", method = Score) > > , which is calculating the zscore at this stage and multiplying by -1. This is absolutely fine. > > Therefore I don't think there is anything wrong with your analysis. I would not be concerned that you get a score of -76 s.d.. This is perfectly reasonable, given that the standard deviation is ~0.07, i.e. the scores seem high simply because you divide by a small number. > > Hope this helps, > Joseph > > On Jun 11, 2012, at 9:26 PM, Joseph Barry wrote: > > Hi Meraj, > > I noticed in your output that > VarianceAdjust="none" > so I guess that you have not divided by the MAD (or standard deviation) using cellHTS2, but have rather done this as a post-processing step? > > Can you check that you have not made a mistake in calculating the zscore? In R, I quickly manually divided by MAD and obtained a more conservative range: > > range(x$normalized_r1_ch1/mad(x$normalized_r1_ch1, na.rm=TRUE), na.rm=TRUE) > [1] -12.46033 63.93462 > > The median is zero, as it should be, so the subtraction of the median is working fine. > > As a solution, I recommend you reanalyze your data with the VarianceAdjust="byPlate" option turned on. > > Best wishes, > Joseph > > > > On Jun 11, 2012, at 9:04 PM,> > wrote: > > Attached is my output from CellHTS2. > So I was interested in gene "SMG1" and at a cutoff of "-2 BScore" > I get all 4 siRNA, which is good. But the score in the negative goes upto > -76.26 Standard deviations which seems a lot. > > My parameters are as follows: > > orgDir=getwd() > setwd("/temp/cellHTS2/JOB5676000587616137010") > Indir="/temp/cellHTS2/JOB5676000587616137010" > zz<- file("/temp/cellHTS2/JOB5676000587616137010_RUN1370378843309182339/R_OUTPUT.TXT", open="w") > sink(file=zz,type="message" ) > Name="SCNA_with_pos_ctrl" > Outdir_report="/temp/cellHTS2/JOB5676000587616137010_RUN1370378843309182339" > LogTransform=FALSE > PlateList="Platelist.txt" > Plateconf="PlateConfig.txt" > Description="Description.txt" > NormalizationMethod="Bscore" > NormalizationScaling="additive" > VarianceAdjust="none" > SummaryMethod="mean" > Screenlog="Screenlog.txt" > Score="zscore" > Annotation="GeneIDs.txt" > library(cellHTS2) > x=readPlateList(PlateList, name = Name, path = Indir) > x=configure(x, descripFile=Description, confFile=Plateconf, logFile=Screenlog,path=Indir) > xn=normalizePlates(x, scale =NormalizationScaling , log =LogTransform,method=NormalizationMethod, varianceAdjust=VarianceAdjust) > comp=compare2cellHTS(x, xn) > xsc=scoreReplicates(xn, sign = "-", method = Score) > xsc=summarizeReplicates(xsc, summary = SummaryMethod) > scores=Data(xsc) > ylim=quantile(scores, c(0.001, 0.999), na.rm = TRUE) > xsc=annotate(xsc, geneIDFile = Annotation) > out=writeReport(raw = x, normalized = xn, scored = xsc, outdir = Outdir_report, force = TRUE, settings = list(xrange = c(0.5,3),zrange = c(-4, 8), ar = 1)) > setwd(orgDir) > sink() > > Any comments from you will really help guiding me towards the right direction. > > meraj > > From: Joseph Barry [mailto:joseph.barry at embl.de] > Sent: Monday, June 11, 2012 11:53 AM > To: Meraj Aziz > Cc: bioconductor at r-project.org > Subject: Re: Question regarding cellhts2 output > > Hi Meraj, > > One clarification: the Bscore method in cellHTS2 does not automatically divide by the MAD. One must explicitly specify varianceAdjust="byPlate" to enforce this. > > Best wishes, > Joseph > > > On Jun 11, 2012, at 8:37 PM, Joseph Barry wrote: > > > > Hi Meraj, > > I would recommend that you use the method="median" and varianceAdjust="byPlate" (or alternatively "byExperiment" or "byBatch", depending on the context) options to normalizePlates. This will subtract the median and divide by the median absolute deviation (MAD), which is slightly more robust than the classical zscore, where one subtracts the mean and divides by the standard deviation. > > The Bscore normalization method subtracts the plate median and divides by the plate MAD, but also applies a two-way median polish to correct for row and column effects. Thus it is essentially a zscore with a few more bells and whistles attached, if you will. The references at the bottom of the ?Bscore documentation explain this in more detail and will help you to decide whether or not this is appropriate for your data. > > (cc'd to the bioconductor mailing list for future googlers :) ) > > Best wishes, > Joseph > > On Jun 11, 2012, at 8:11 PM,> > wrote: > > > > Hi Joseph > > I have a question regarding the scores generated by cellhts2. > I would really appreciate if you can answer them. > > In your paper > http://genomebiology.com/content/pdf/gb-2006-7-7-r66.pdf > you mention zscore as the basis of your score. Online > cellhts2 does not have a zscore normalization mechanism/option. > > Question is: > 1) How can I only choose zscore normalization. > 2) And if I choose Bscore normalization. Is the score really standard > deviation from the mean/median. > > In the R_OUTPUT file I see: > NormalizationMethod="Bscore" > Score="zscore" > (what exactly does this imply) > > Thank you for your help > > Meraj > > > > > > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber From guest at bioconductor.org Mon Jun 18 00:33:05 2012 From: guest at bioconductor.org (Josh [guest]) Date: Sun, 17 Jun 2012 15:33:05 -0700 (PDT) Subject: [BioC] Getting the start and end positions of a list of genes Message-ID: <20120617223305.0F95D133CAB@mamba.fhcrc.org> Dear listserv, I am a long-time R user, novice Bioconductor user. I am quickly realizing they are not the same thing. I have a very basic question that I hope you can help me with. I have a list of genes in Arabidopsis thaliana. I want to input this list into R/Bioconductor and output a table listing the start and end positions of each gene. Specific code that will get the job done will be the most helpful for me. Also, please tell me the specific packages and databases and such I must load into memory. I am a total newbie at this. Thanks in advance, ----------------------------------- Josh Banta, Ph.D Assistant Professor Department of Biology The University of Texas at Tyler Tyler, TX 75799 Tel: (903) 565-5655 http://plantevolutionaryecology.org -- output of sessionInfo(): > gene.pos <- data.frame(matrix(nrow = 3, ncol = 4)) > gene.list <- c("At5g35790", "AT5g60910", "AT1g16560") > gene.pos[,1] <- gene.list > colnames(gene.pos) <- c("gene", "chromosome", "nuc_sequence_start" , "nuc_sequence_end") > > gene.pos gene chromosome nuc_sequence_start nuc_sequence_end 1 At5g35790 NA NA NA 2 AT5g60910 NA NA NA 3 AT1g16560 NA NA NA > > #now what? How do I fill in the blanks? -- Sent via the guest posting facility at bioconductor.org. From stvjc at channing.harvard.edu Mon Jun 18 00:57:25 2012 From: stvjc at channing.harvard.edu (Vincent Carey) Date: Sun, 17 Jun 2012 18:57:25 -0400 Subject: [BioC] Getting the start and end positions of a list of genes In-Reply-To: <20120617223305.0F95D133CAB@mamba.fhcrc.org> References: <20120617223305.0F95D133CAB@mamba.fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From phipson at wehi.EDU.AU Mon Jun 18 02:45:28 2012 From: phipson at wehi.EDU.AU (Belinda Phipson) Date: Mon, 18 Jun 2012 10:45:28 +1000 Subject: [BioC] Problem in filtering of single color microarrray data using LIMMA In-Reply-To: References: Message-ID: <002a01cd4ceb$abb19560$0314c020$@edu.au> Hi Murali There is no automatic filtering in the limma package, if that is what you are asking. There are different approaches one could take to filter out genes - often lowly expressed genes are filtered out, or in the case of Illumina microarrays, you can filter out probes with low detection values across all samples. Cheers, Belinda -----Original Message----- From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Muralidharan V Sent: Saturday, 16 June 2012 2:49 PM To: bioconductor at r-project.org Subject: [BioC] Problem in filtering of single color microarrray data using LIMMA Hai all, I used LIMMA for doing the preprocessing of microarray data analysis and got the result also but there is some problem in the filtering process. I just want to know how the filtering process is done in LIMMA pacakge using R language. Thanks Murali -- [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From phipson at wehi.EDU.AU Mon Jun 18 02:54:01 2012 From: phipson at wehi.EDU.AU (Belinda Phipson) Date: Mon, 18 Jun 2012 10:54:01 +1000 Subject: [BioC] How to print out normalized Cy5 and Cy3 signals In-Reply-To: References: Message-ID: <002b01cd4cec$dd15aaf0$974100d0$@edu.au> Hi Mei Check the names of your data object: > names(data) to figure out where the normalized data is and then use the > write.csv(data$...,file="norm.csv") which can write matrices or data frames to a file which can be opened in excel. Cheers, Belinda -----Original Message----- From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of JiangMei Sent: Saturday, 16 June 2012 5:05 AM To: bioconductor at r-project.org Subject: [BioC] How to print out normalized Cy5 and Cy3 signals Hi All. Sorry to bother you. I used limma package to normalize my two-color microarray data. I want to export the normalized Cy5 and Cy3 signals. Does anyone know how to do that? Thanks very much in advance. Best, Mei [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From phipson at wehi.EDU.AU Mon Jun 18 03:02:51 2012 From: phipson at wehi.EDU.AU (Belinda Phipson) Date: Mon, 18 Jun 2012 11:02:51 +1000 Subject: [BioC] some help requested for constructing an appropriate design matrix in LIMMA In-Reply-To: References: Message-ID: <002c01cd4cee$18b0f4b0$4a12de10$@edu.au> Hi Steven You could just include cell line in your linear model rather than using duplicateCorrelation(). > design <- model.matrix(~factor(targets$cellline)+factor(targets$fenotype)) > fit <- lmFit(eset,design) > fit <- eBayes(fit) This will test R vs S taking into account cell line. You could also filter out lowly expressed genes across all samples to improve your power to detect differentially expressed genes as your sample size is quite small. Cheers, Belinda -----Original Message----- From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of steven segbroek Sent: Friday, 15 June 2012 2:13 AM To: bioconductor at r-project.org Subject: [BioC] some help requested for constructing an appropriate design matrix in LIMMA Dear R-users, I want to analyse a single channel micro array experiment which looks like the following: > targets File cellline fenotype 1 A 1 R 2 B 2 R 3 C 3 R 4 D 1 S 5 E 2 S 6 F 3 S There are three different cell lines, each of which comes in two versions. Every cell line has a variant that is resistant to a specific drug and another variant that is sensitive to this drug. We treated both variant of the three cell lines with this drug and then extracted RNA which was then hybridised to a micro array. The question we want to resolve is: which genes are differentially regulated between resistant (R) and sensitive (S) versions of these cell lines. There is quite some biological variation between the cell lines, so grouping them by fenotype and then searching for differentially regulated genes would be a bad idea. So, the idea is to to construct a model that accounts for this biological variation between the cell lines and looks which genes are consistently up or down regulated between resistant and sensitive versions of these three cell lines. I am a bit puzzled on how to setup an appropriate design matrix for this particular setup. I have come up with the following code: > design celline fenotR fenotS 1 1 1 0 2 2 1 0 3 3 1 0 4 1 0 1 5 2 0 1 6 3 0 1 attr(,"assign") [1] 1 2 2 attr(,"contrasts") attr(,"contrasts")$fenot [1] "contr.treatment" >block<-c(1,2,3,1,2,3) >eset<-exprs(BSData.log2.quantile) >cor<-duplicateCorrelation(eset, ndups=1, block=block, design=design) >fit <- lmFit(eset, design, block=block, cor=cor$consensus) >fit<- eBayes(fit) >cont.matrix<-makeContrasts(resvssens= fenotR - fenotS, levels=design) >fit2<-contrasts.fit(fit,cont.matrix) >fit2<-eBayes(fit2) >topTable(fit2) However, this code results in an adj.p-value that is "0.9999541" for every gene. Is there a better way to analyse this? Kind regards, Steven [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From drordh at gmail.com Mon Jun 18 07:24:07 2012 From: drordh at gmail.com (Dror Hibsh) Date: Mon, 18 Jun 2012 08:24:07 +0300 Subject: [BioC] down-expression and high-expression in single cell + amplification In-Reply-To: References: <20120617162330.C0B3112F963@mamba.fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From mayba.oleg at gene.com Mon Jun 18 08:11:22 2012 From: mayba.oleg at gene.com (Oleg Mayba) Date: Sun, 17 Jun 2012 23:11:22 -0700 Subject: [BioC] nearest() for GRanges Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From drordh at gmail.com Mon Jun 18 09:20:37 2012 From: drordh at gmail.com (Dror Hibsh) Date: Mon, 18 Jun 2012 10:20:37 +0300 Subject: [BioC] down-expression and high-expression in single cell + amplification In-Reply-To: References: <20120617162330.C0B3112F963@mamba.fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From olshansky at wehi.EDU.AU Mon Jun 18 09:26:44 2012 From: olshansky at wehi.EDU.AU (Moshe Olshansky) Date: Mon, 18 Jun 2012 17:26:44 +1000 (EST) Subject: [BioC] design matrix Limma design for paired t-test In-Reply-To: <4FDB06AF.6040609@ipbs.fr> References: <4FD759D6.2060202@ipbs.fr> <004201cd4930$2d7802b0$88680810$@edu.au> <4FD88B9C.5090102@ipbs.fr> <5266de74b101cfe9b43bb86abb9fd56b.squirrel@wehimail.alpha.wehi.edu.au> <4FDB06AF.6040609@ipbs.fr> Message-ID: <3d8b4ea5c869eff1a410f5065685d8d6.squirrel@wehimail.alpha.wehi.edu.au> Hi Ingrid, If I understand correctly, you would like to find genes which are differentially expressed (DE) between Treatment and Control at 4 hours and compare them with those which are DE at 18 hours. One way to do it is to split your data into two separate sets ( 4 hours and 18 hours) and find DE genes for each part separately (and then you omit your Time column). But by doing so you reduce your ability to estimate the variances. So a preferable way would be to omit the time and have 4 conditions: C4,T4,C18 and T18 (Control and Treatment at 4 and 18 hours). The you may use MakeContrasts function of limma to find DE genes between T4 and C4 and between T18 and C18. If x is your targets file, i.e. > x X FileName Treatment Donor Time 1 DC_4_4 US10463851_252665214446_S01_GE1_1010_Sep10_1_2.txt T 4 4 2 SC_4_4 US10463851_252665214448_S01_GE1_1010_Sep10_1_2.txt C 4 4 3 DC_18_4 US10463851_252665214447_S01_GE1_1010_Sep10_1_2.txt T 4 18 4 SC_18_4 US10463851_252665214444_S01_GE1_1010_Sep10_1_3.txt C 4 18 5 DC_4_5 US10463851_252665214448_S01_GE1_1010_Sep10_1_4.txt T 5 4 6 SC_4_5 US10463851_252665214444_S01_GE1_1010_Sep10_1_1.txt C 5 4 7 DC_18_5 US10463851_252665214446_S01_GE1_1010_Sep10_1_3.txt T 5 18 8 SC_18_5 US10463851_252665214447_S01_GE1_1010_Sep10_1_4.txt C 5 18 9 DC_4_6 US10463851_252665214445_S01_GE1_1010_Sep10_1_4.txt T 6 4 10 SC_4_6 US10463851_252665214447_S01_GE1_1010_Sep10_1_3.txt C 6 4 11 DC_18_6 US10463851_252665214448_S01_GE1_1010_Sep10_1_3.txt T 6 18 12 SC_18_6 US10463851_252665214445_S01_GE1_1010_Sep10_1_3.txt C 6 18 13 DC_4_7 US10463851_252665214444_S01_GE1_1010_Sep10_1_4.txt T 7 4 14 SC_4_7 US10463851_252665214445_S01_GE1_1010_Sep10_1_2.txt C 7 4 15 DC_18_7 US10463851_252665214447_S01_GE1_1010_Sep10_1_1.txt T 7 18 16 SC_18_7 US10463851_252665214446_S01_GE1_1010_Sep10_1_1.txt C 7 18 17 DC_4_8 US10463851_252665214444_S01_GE1_1010_Sep10_1_2.txt T 8 4 18 SC_4_8 US10463851_252665214446_S01_GE1_1010_Sep10_1_4.txt C 8 4 19 DC_18_8 US10463851_252665214445_S01_GE1_1010_Sep10_1_1.txt T 8 18 20 SC_18_8 US10463851_252665214448_S01_GE1_1010_Sep10_1_1.txt C 8 18 > you can do > y <- cbind(x$Donor,paste(x$Treatment,x$Time,sep="_")) > colnames(y) <- c("Donor","Cond_Tim") > y <- data.frame(y) > y Donor Cond_Tim 1 4 T_4 2 4 C_4 3 4 T_18 4 4 C_18 5 5 T_4 6 5 C_4 7 5 T_18 8 5 C_18 9 6 T_4 10 6 C_4 11 6 T_18 12 6 C_18 13 7 T_4 14 7 C_4 15 7 T_18 16 7 C_18 17 8 T_4 18 8 C_4 19 8 T_18 20 8 C_18 > Then > design <- model.matrix(~Donor+Cond_Tim,y) > colnames(design) <- gsub("Cond_Tim","",colnames(design)) > colnames(design)[1] <- "Intercept" > design Intercept Donor5 Donor6 Donor7 Donor8 C_4 T_18 T_4 1 1 0 0 0 0 0 0 1 2 1 0 0 0 0 1 0 0 3 1 0 0 0 0 0 1 0 4 1 0 0 0 0 0 0 0 5 1 1 0 0 0 0 0 1 6 1 1 0 0 0 1 0 0 7 1 1 0 0 0 0 1 0 8 1 1 0 0 0 0 0 0 9 1 0 1 0 0 0 0 1 10 1 0 1 0 0 1 0 0 11 1 0 1 0 0 0 1 0 12 1 0 1 0 0 0 0 0 13 1 0 0 1 0 0 0 1 14 1 0 0 1 0 1 0 0 15 1 0 0 1 0 0 1 0 16 1 0 0 1 0 0 0 0 17 1 0 0 0 1 0 0 1 18 1 0 0 0 1 1 0 0 19 1 0 0 0 1 0 1 0 20 1 0 0 0 1 0 0 0 attr(,"assign") [1] 0 1 1 1 1 2 2 2 attr(,"contrasts") attr(,"contrasts")$Donor [1] "contr.treatment" attr(,"contrasts")$Cond_Tim [1] "contr.treatment" So now your base level is Dono4, Control at 18 hours. I do not know whether you version of R will produce identical results. Assuming it is you can proceed as following (if X is your normalized and log-transformed expression matrix): > contr <- makeContrasts(h_18=T_18,h_4=T_4-C_4,levels=design) > fit <- lmFit(X,design=design) > fit <- contrasts.fit(fit,contr) > fit <- eBayes(fit) > tp4 <- topTable(fit,coef="h_4",number=nrow(X)) > tp18 <- topTable(fit,coef="h_18",number=nrow(X)) Now from tp4 and tp18 you can find genes which are DE at 4 hours and 18 hours respectively and then compare the two lists. Just one warning: if you criterion for DE is logFC of 1 (fold change of 2) and adjusted p-value of 0.05, it may happen that certain gene has logFC of 1.05 at 4 hours and 0.97 at 18 hours (and good p-value in both cases) and then it is DE at 4 hours and not DE at 18 hours, but actually it's behavior is not really different. Or it may happen that it has good logFC under both conditions but adjusted p-value of 0.04 at 4 hours and of 0.06 at 18 hours, so once again it is DE at 4 hours and not DE at 18 hours, but once again it's behavior is not very different. So watch out for such genes - they are not what you are looking for. Best regards, Moshe. > Thanks Moshe for your reply ! > It's very clear ! As you wrote, I want to test is " if the effect of > the treatment at 4 hours is different from the one at 18 hours, between > Control and Treated cells ", but I don't see how change my design. > Somebody can help me ? > > Cheers, > > Ingrid > > Ingrid MERCIER > Mycobacterial Interactions with Host Cells Team > Institute of Pharmacology& Structural Biology > CNRS - University of Toulouse > BP 64182 > F-31077 Toulouse Cedex France > Tel +33 (0)5 61 17 54 63 > > > > > Le 14/06/2012 03:44, Moshe Olshansky a ?crit : >> Hi Ingrid, >> >> With your design your "base" level is patient 4, Control, 4 hours (let's >> call it B). >> The mean for, say, patient 6, Treatment, 18 hours is: >> B + Donor6 + TreatT + Time18 >> where Donor6 is the difference between Donor4 and Donor6 (same for any >> treatment and time), TreatT is the difference between Treatment and >> Control (independent of patient and time) and Time18 is the difference >> between 18 hours and 4 hours (independent of patient and treatment). >> >> If you think that the effect of Treatment versus Control is the same at >> 4 >> hours and 18 hours, then what you did is all right. If you think that >> the >> effect of the treatment at 4 hours may be different from the one at 18 >> hours, you need to change your design. >> >> Best regards, >> Moshe. >> >>> Thanks a lot Belinda !! >>> >>> I mistaked so I replaced Time=Treat by Time only, and it's good. >>> So, I have a last question : I 'm confused with the differents coef in >>> topTable. >>> I get genes but I tested several coef without understanding their >>> significance. >>> Somebody can explain me what mean coef="TreatT", or coef= >>> "Time18",coef= >>> " Donor5",coef= " Donor6", coef= "Donor7",coef= " Donor8". >>> My main objective is to identidy the differential expressed genes >>> between the Control donors and Treated Donors at 4 hours or 18 hours. >>> I have no idea, which coef I have to use it. >>> >>> Cheers, >>> >>> Ingrid >>> >>> Ingrid MERCIER >>> Mycobacterial Interactions with Host Cells Team >>> Institute of Pharmacology& Structural Biology >>> CNRS - University of Toulouse >>> BP 64182 >>> F-31077 Toulouse Cedex France >>> Tel +33 (0)5 61 17 54 63 >>> >>> >>> >>> >>> Le 13/06/2012 08:45, Belinda Phipson a ?crit : >>>> Hi Ingrid >>>> >>>> The problem with your code is the following line: >>>>> Time=Treat=factor(Targets$Time) >>>> Where you essentially set the time factor equal to the treat factor. >>>> >>>> Cheers, >>>> Belinda >>>> >>>> >>>> -----Original Message----- >>>> From:bioconductor-bounces at r-project.org >>>> [mailto:bioconductor-bounces at r-project.org] On Behalf Of Ingrid >>>> Mercier >>>> Sent: Wednesday, 13 June 2012 1:02 AM >>>> To:bioconductor at r-project.org;smyth at wehi.edu.au >>>> Subject: [BioC] design matrix Limma design for paired t-test >>>> >>>> Dear list and Gordon, >>>> >>>> I have some troubles to computed a moderated paired t-test in the >>>> linear >>>> model. >>>> Here is my experimental plan : >>>> >>>> I used a single channel Agilent microarray. >>>> 2 types of cells : Control (S) and Treated (T) >>>> Fives human donors : 4-5-6-7-8 >>>> Two times of treatment : 4 hours and 18 hours >>>> >>>> I want to compare teh differential expresed genes between my C versus >>>> T >>>> at 4 >>>> hours and then at 18 hours. >>>> >>>> Here is my design : >>>> >>>> >>>> My targets frame is : >>>>> Targets >>>> X FileName >>>> Treatment >>>> Donor Time >>>> 1 DC_4_4 US10463851_252665214446_S01_GE1_1010_Sep10_1_2.txt >>>> T >>>> 4 4 >>>> 2 SC_4_4 US10463851_252665214448_S01_GE1_1010_Sep10_1_2.txt >>>> C >>>> 4 4 >>>> 3 DC_18_4 US10463851_252665214447_S01_GE1_1010_Sep10_1_2.txt >>>> T >>>> 4 18 >>>> 4 SC_18_4 US10463851_252665214444_S01_GE1_1010_Sep10_1_3.txt >>>> C >>>> 4 18 >>>> 5 DC_4_5 US10463851_252665214448_S01_GE1_1010_Sep10_1_4.txt >>>> T >>>> 5 4 >>>> 6 SC_4_5 US10463851_252665214444_S01_GE1_1010_Sep10_1_1.txt >>>> C >>>> 5 4 >>>> 7 DC_18_5 US10463851_252665214446_S01_GE1_1010_Sep10_1_3.txt >>>> T >>>> 5 18 >>>> 8 SC_18_5 US10463851_252665214447_S01_GE1_1010_Sep10_1_4.txt >>>> C >>>> 5 18 >>>> 9 DC_4_6 US10463851_252665214445_S01_GE1_1010_Sep10_1_4.txt >>>> T >>>> 6 4 >>>> 10 SC_4_6 US10463851_252665214447_S01_GE1_1010_Sep10_1_3.txt >>>> C >>>> 6 4 >>>> 11 DC_18_6 US10463851_252665214448_S01_GE1_1010_Sep10_1_3.txt >>>> T >>>> 6 18 >>>> 12 SC_18_6 US10463851_252665214445_S01_GE1_1010_Sep10_1_3.txt >>>> C >>>> 6 18 >>>> 13 DC_4_7 US10463851_252665214444_S01_GE1_1010_Sep10_1_4.txt >>>> T >>>> 7 4 >>>> 14 SC_4_7 US10463851_252665214445_S01_GE1_1010_Sep10_1_2.txt >>>> C >>>> 7 4 >>>> 15 DC_18_7 US10463851_252665214447_S01_GE1_1010_Sep10_1_1.txt >>>> T >>>> 7 18 >>>> 16 SC_18_7 US10463851_252665214446_S01_GE1_1010_Sep10_1_1.txt >>>> C >>>> 7 18 >>>> 17 DC_4_8 US10463851_252665214444_S01_GE1_1010_Sep10_1_2.txt >>>> T >>>> 8 4 >>>> 18 SC_4_8 US10463851_252665214446_S01_GE1_1010_Sep10_1_4.txt >>>> C >>>> 8 4 >>>> 19 DC_18_8 US10463851_252665214445_S01_GE1_1010_Sep10_1_1.txt >>>> T >>>> 8 18 >>>> 20 SC_18_8 US10463851_252665214448_S01_GE1_1010_Sep10_1_1.txt >>>> C >>>> 8 18 >>>> >>>> >>>> then I create my design matrix : >>>> >>>>> Donor >>>> [1] 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 >>>> Levels: 4 5 6 7 8 >>>>> Treat=factor(Targets$Treatment,levels=c("C","T")) >>>>> Treat >>>> [1] T C T C T C T C T C T C T C T C T C T C >>>> Levels: C T >>>>> Time=Treat=factor(Targets$Time) >>>>> Time >>>> [1] 4 4 18 18 4 4 18 18 4 4 18 18 4 4 18 18 4 4 18 18 >>>> Levels: 4 18 >>>> >>>>> design=model.matrix(~Donor+Treat+Time) >>>>> design >>>> (Intercept) Donor5 Donor6 Donor7 Donor8 Treat18 Time18 >>>> 1 1 0 0 0 0 0 0 >>>> 2 1 0 0 0 0 0 0 >>>> 3 1 0 0 0 0 1 1 >>>> 4 1 0 0 0 0 1 1 >>>> 5 1 1 0 0 0 0 0 >>>> 6 1 1 0 0 0 0 0 >>>> 7 1 1 0 0 0 1 1 >>>> 8 1 1 0 0 0 1 1 >>>> 9 1 0 1 0 0 0 0 >>>> 10 1 0 1 0 0 0 0 >>>> 11 1 0 1 0 0 1 1 >>>> 12 1 0 1 0 0 1 1 >>>> 13 1 0 0 1 0 0 0 >>>> 14 1 0 0 1 0 0 0 >>>> 15 1 0 0 1 0 1 1 >>>> 16 1 0 0 1 0 1 1 >>>> 17 1 0 0 0 1 0 0 >>>> 18 1 0 0 0 1 0 0 >>>> 19 1 0 0 0 1 1 1 >>>> 20 1 0 0 0 1 1 1 >>>> attr(,"assign") >>>> [1] 0 1 1 1 1 2 3 >>>> attr(,"contrasts") >>>> attr(,"contrasts")$Donor >>>> [1] "contr.treatment" >>>> >>>> attr(,"contrasts")$Treat >>>> [1] "contr.treatment" >>>> >>>> attr(,"contrasts")$Time >>>> [1] "contr.treatment" >>>> >>>> >>>> In this design matrix I think something is wrong, because of the >>>> column >>>> Treat18 is the same as Time18. >>>> I don't understand why. >>>> So, the following code failed, and the differential expressed genes >>>> are >>>> odds. >>>> >>>> Somebody can help me !!! Thanks all. >>>> >>>> >>>>> fit=lmFit(test_norm,design) >>>> Coefficients not estimable: Time18 >>>> Message d'avis : >>>> Partial NA coefficients for 34183 probe(s) >>>>> fit2=eBayes(fit) >>>> Message d'avis : >>>> In ebayes(fit = fit, proportion = proportion, stdev.coef.lim = >>>> stdev.coef.lim, : >>>> Estimation of var.prior failed - set to default value >>>> >>>> >>>>> table = topTable(fit2,1, number=5000, >>>> p.value=0.05,adjust.method="BH",sort.by="logFC",lfc=2) >>>>> head(table) >>>> ID logFC AveExpr t P.Value >>>> adj.P.Val >>>> B >>>> 6509 A_33_P3396434 18.44159 18.41239 245.14490 1.308161e-31 >>>> 2.353520e-28 >>>> 53.41519 >>>> 22398 A_33_P3223592 18.25824 18.24591 242.75647 1.545005e-31 >>>> 2.514901e-28 >>>> 53.36821 >>>> 10771 A_33_P3244165 18.21029 18.02229 90.76191 2.796577e-24 >>>> 2.467615e-23 >>>> 44.59915 >>>> 6149 A_33_P3346552 18.14780 18.12098 207.18556 2.282464e-30 >>>> 1.147374e-27 >>>> 52.50960 >>>> 23554 A_33_P3210160 18.08158 18.21026 239.64192 1.924175e-31 >>>> 2.560908e-28 >>>> 53.30521 >>>> 20924 A_33_P3286278 18.04425 18.07312 179.72121 2.558128e-29 >>>> 5.025546e-27 >>>> 51.56876 >>>> >>>> >>>> Best, >>>> >>>> Ingrid >>>> >>>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for >> the addressee. >> You must not disclose, forward, print or use it without the permission >> of the sender. >> ______________________________________________________________________ > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From anders at embl.de Mon Jun 18 09:53:16 2012 From: anders at embl.de (Simon Anders) Date: Mon, 18 Jun 2012 09:53:16 +0200 Subject: [BioC] down-expression and high-expression in single cell + amplification In-Reply-To: References: <20120617162330.C0B3112F963@mamba.fhcrc.org> Message-ID: <4FDEDE6C.7030303@embl.de> Hi Pap a few comments in addition to what has already been said: - If you have very different library sizes, it is normal that you see less changes in direction from the shallowly sequenced condition to the deeply sequenced one. This is because your power depends on the abolute read count, due to Poisson noise. Hence, if a gene has many reads in the shallow condition and few in he deep one, you have better power to say whether this is real than in the opposite case. - However, in your case, the differences in size are less than 1:2, which is usually not much a problem. Must be something else. Maybe post an MA plot. - I am worried that you used RSEM for quantification. RSEM infers isoform abundances, i.e., each count value has a specific uncertainty attached due to the ambiguity in assigning reads mapping to shared exons, and this uncertainty can be huge and dramatically inflate false positives if a subsequent test is not informed of them. DESeq is not designed to work with RSEM, and the uncertainty information will get a lost. (Actually, it isn't even calculated, if you run RSEM in EM rather than Bayes mode, IIRC.) - I'm not convinced that removing PCR duplicates in RNA-Seq is a good idea. If you have 50 bp single-end reads, you constrain the value range of your counts to 0:50, i.e., you lose all the advantages in dynamic range that RNA-Seq has over microarrays. Simon From smitra at liverpool.ac.uk Mon Jun 18 11:13:43 2012 From: smitra at liverpool.ac.uk (suparna mitra) Date: Mon, 18 Jun 2012 10:13:43 +0100 Subject: [BioC] =?windows-1252?q?package_=91hugene10stv1=2Er3cdf=92_or_=91?= =?windows-1252?q?hugene10stv1=92_is_not_available?= Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From Guido.Hooiveld at wur.nl Mon Jun 18 11:46:28 2012 From: Guido.Hooiveld at wur.nl (Hooiveld, Guido) Date: Mon, 18 Jun 2012 09:46:28 +0000 Subject: [BioC] =?windows-1252?q?package_=91hugene10stv1=2Er3cdf=92_or_=91?= =?windows-1252?q?hugene10stv1=92_is_not_available?= In-Reply-To: References: Message-ID: Hi, First of all, please be aware that for the Gene ST arrays 'unofficial' CDFs are provided. That is, although Affymetrix released a CDF file for these arrays, these should be considered experimental. The preferred way of analysing these arrays is through the library 'oligo' or 'XPS'. They make use of all files provided by Affymetrix (pgf, clf, annotation CSV, etc). Please see the respective vignettes for more details. Having said this, for the analysis you are currently performing you need: http://www.bioconductor.org/packages/2.10/data/annotation/html/hugene10stv1cdf.html HTH, Guido --------------------------------------------------------- Guido Hooiveld, PhD Nutrition, Metabolism & Genomics Group Division of Human Nutrition Wageningen University Biotechnion, Bomenweg 2 NL-6703 HD Wageningen the Netherlands tel: (+)31 317 485788 fax: (+)31 317 483342 email: guido.hooiveld at wur.nl internet: http://nutrigene.4t.com http://scholar.google.com/citations?user=qFHaMnoAAAAJ http://www.researcherid.com/rid/F-4912-2010 -----Original Message----- From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of suparna mitra Sent: Monday, June 18, 2012 11:14 To: bioconductor at r-project.org Subject: [BioC] package ?hugene10stv1.r3cdf? or ?hugene10stv1? is not available Hi members, I'm trying for quite some time to get an analysis started for affy microarray files which has HuGene-1_0-st-v1. justrma step worked. > eset_justrma ExpressionSet (storageMode: lockedEnvironment) assayData: 32321 features, 18 samples element names: exprs, se.exprs protocolData sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st-v1).CEL ... MC9_(HuGene-1_0-st-v1).CEL (18 total) varLabels: ScanDate varMetadata: labelDescription phenoData sampleNames: MC1_(HuGene-1_0-st-v1).CEL MC10_(HuGene-1_0-st-v1).CEL ... MC9_(HuGene-1_0-st-v1).CEL (18 total) varLabels: sample varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: hugene10stv1 But I am not able to do the annotation. I tried many possible options. I keep getting this Error. > biocLite("hugene10stv1.r3cdf") Using R version 2.12.2, biocinstall version 2.7.7. Installing Bioconductor version 2.7 packages: [1] "hugene10stv1.r3cdf" Please wait... Warning message: In getDependencies(pkgs, dependencies, available, lib) : package ?hugene10stv1.r3cdf? is not available > library("annotate") > library("hugene10stv1") Error in library("hugene10stv1") : there is no package called 'hugene10stv1' > library("hugene10stv1.db") Error in library("hugene10stv1.db") : there is no package called 'hugene10stv1.db' > biocLite("hugene10stv1.r3cdf", type = "source") Using R version 2.12.2, biocinstall version 2.7.7. Installing Bioconductor version 2.7 packages: [1] "hugene10stv1.r3cdf" Please wait... Warning message: In getDependencies(pkgs, dependencies, available, lib) : package ?hugene10stv1.r3cdf? is not available Can anybody please help. Thanks a lot in advance. Best wishes, Suparna. [[alternative HTML version deleted]] From patel.rimple at yahoo.com Mon Jun 18 12:41:00 2012 From: patel.rimple at yahoo.com (Rimple Patel) Date: Mon, 18 Jun 2012 03:41:00 -0700 (PDT) Subject: [BioC] (no subject) Message-ID: <1340016060.97974.YahooMailNeo@web45704.mail.sp1.yahoo.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From karthikuttan at gmail.com Mon Jun 18 13:09:16 2012 From: karthikuttan at gmail.com (Karthik K N) Date: Mon, 18 Jun 2012 16:39:16 +0530 Subject: [BioC] Annotation Database for Agilent 8x60K Human Gene Expression Arrays Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From tfrayner at gmail.com Mon Jun 18 14:01:12 2012 From: tfrayner at gmail.com (Tim Rayner) Date: Mon, 18 Jun 2012 13:01:12 +0100 Subject: [BioC] RE : Error in intgroup of arrayQualityMetrics package In-Reply-To: <20120615063016.7AB89134FEF@mamba.fhcrc.org> References: <20120615063016.7AB89134FEF@mamba.fhcrc.org> Message-ID: Hi Sonal, You could try rearranging pData(eset) so that the "Tissue" column is the first column, or within the first few columns. There's some preprocessing code in the arrayQualityMetrics:::cleanUpPhenoData function which limits the number of columns which will be carried forward into the QC (maxcol=10). Also, the contents of the "Tissue" column must not be either all the same or all different (a quite reasonable requirement). Cheers, Tim -- Tim Rayner Bioinformatician Smith Lab, CIMR University of Cambridge On 15 June 2012 07:30, Sonal Bakiwala [guest] wrote: > > I am using arraQualityMetrics package installed from Bioconductor site and R version that I am using is 2.15.0 > > The input for the function was eset and for the intgroup argument character vector "Tissue". There is a > column named Tissue in my phenoData of the eset. > > But it still gives me an error saying the elements of intgroup do not match the column names of the pData(eset). > I don't know what wrong I am doing. > > The error look like this : > > Error in prepData(expressionset,intgroup=intgroup): > all elements of 'intgroup' should match column names of pData(expressionset) > > > > ?-- output of sessionInfo(): > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C > ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 > ?[5] LC_MONETARY=en_US.UTF-8 ? ?LC_MESSAGES=en_US.UTF-8 > ?[7] LC_PAPER=C ? ? ? ? ? ? ? ? LC_NAME=C > ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > other attached packages: > [1] BiocInstaller_1.4.6 ? ? ? ?arrayQualityMetrics_3.12.0 > [3] affy_1.34.0 ? ? ? ? ? ? ? ?limma_3.12.1 > [5] Biobase_2.16.0 ? ? ? ? ? ? BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > ?[1] affyio_1.24.0 ? ? ? ? affyPLM_1.32.0 ? ? ? ?annotate_1.34.0 > ?[4] AnnotationDbi_1.18.1 ?beadarray_2.6.0 ? ? ? BeadDataPackR_1.8.0 > ?[7] Biostrings_2.24.1 ? ? Cairo_1.5-1 ? ? ? ? ? cluster_1.14.2 > [10] colorspace_1.1-1 ? ? ?DBI_0.2-5 ? ? ? ? ? ? genefilter_1.38.0 > [13] grid_2.15.0 ? ? ? ? ? Hmisc_3.9-3 ? ? ? ? ? hwriter_1.3 > [16] IRanges_1.14.3 ? ? ? ?lattice_0.20-6 ? ? ? ?latticeExtra_0.6-19 > [19] plyr_1.7.1 ? ? ? ? ? ?preprocessCore_1.18.0 RColorBrewer_1.0-5 > [22] reshape2_1.2.1 ? ? ? ?RSQLite_0.11.1 ? ? ? ?setRNG_2011.11-2 > [25] splines_2.15.0 ? ? ? ?stats4_2.15.0 ? ? ? ? stringr_0.6 > [28] survival_2.36-12 ? ? ?SVGAnnotation_0.9-0 ? tools_2.15.0 > [31] vsn_3.24.0 ? ? ? ? ? ?XML_3.9-4 ? ? ? ? ? ? xtable_1.7-0 > [34] zlibbioc_1.2.0 >> intgroup > [1] "Tissue" >> str(intgroup) > ?chr "Tissue" > > Sorry I wont be able to provide you with the detailed information of the pData. > But the colnames(pData(eset)) has one of columns named as "Tissue" and the class of the this column is factor. > > Thank you. > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From YWETZEL at its.jnj.com Mon Jun 18 15:51:34 2012 From: YWETZEL at its.jnj.com (Wetzels, Yves [JRDBE Extern]) Date: Mon, 18 Jun 2012 15:51:34 +0200 Subject: [BioC] affyPara - rmaPara fails using cdfname Message-ID: <86496AE74EFEB44A8FFA0AF866AC21B3240155@JNJBEBEGMS06.eu.jnj.com> Dear I am investigating whether affyPara can be used to analyze a large number of Microarray data. As a test case I have 20 CEL files. This works just fine running ... expressionSetRma<- rmaPara(files) ... If I want to use the "cdfname" parameter however ... expressionSetRma<- rmaPara(files, cdfname="hgu133plus2hsentrezg", verbose=TRUE) ... I receive following error: Error in dimnames(eset_mat) <- list(ids, samples.names) : length of 'dimnames' [1] not equal to array extent Calls: rmaPara -> preproPara -> .doSummarizationPara In addition: Warning messages: 1: In is.na(xel) : is.na() applied to non-(list or vector) of type 'S4' 2: In is.na(xel) : is.na() applied to non-(list or vector) of type 'S4' Execution halted I saw a thread http://answerpot.com/showthread.php?1408276-affyPara mentioning a bug in the .initAffyBatchSF function date 21/10/2010. Might this be the same bug ? Many thanks for your help and/or ideas. Kind Regards Yves Wetzels Contractor on behalf of Janssen Turnhoutseweg 30 B-2340-Beerse, Belgium Below you`ll find the logfile/environment settings for both test runs. ****************************************************************** Log for expressionSetRma<- rmaPara(files) => OK ****************************************************************** ubuntu at ip-10-239-95-215:~/test$ cat runAffyPara.R.log.withoutcdf R version 2.15.0 (2012-03-30) Copyright (C) 2012 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-unknown-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > library(affyPara) Loading required package: affy Loading required package: BiocGenerics Attaching package: ?BiocGenerics? The following object(s) are masked from ?package:stats?: xtabs The following object(s) are masked from ?package:base?: anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get, intersect, lapply, Map, mapply, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int, rownames, sapply, setdiff, table, tapply, union, unique Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Loading required package: snow Loading required package: vsn Loading required package: aplpack Loading required package: tcltk Loading Tcl/Tk interface ... done Attaching package: ?affyPara? The following object(s) are masked from ?package:snow?: makeCluster, stopCluster Warning message: In fun(libname, pkgname) : no DISPLAY variable so Tk is not available > library(hgu133plus2hsentrezgcdf) > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] tcltk stats graphics grDevices utils datasets methods [8] base other attached packages: [1] hgu133plus2hsentrezgcdf_12.1.0 affyPara_1.16.0 [3] aplpack_1.2.6 vsn_3.24.0 [5] snow_0.3-9 affy_1.34.0 [7] Biobase_2.16.0 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] affyio_1.24.0 BiocInstaller_1.4.6 grid_2.15.0 [4] lattice_0.20-6 limma_3.12.1 preprocessCore_1.18.0 [7] tools_2.15.0 zlibbioc_1.2.0 > path <- "/home/ubuntu/test/cel_files" > makeCluster(2) > files <- list.celfiles(path, full.names=TRUE) > files [1] "/home/ubuntu/test/cel_files/GSM686287.CEL" [2] "/home/ubuntu/test/cel_files/GSM686289.CEL" [3] "/home/ubuntu/test/cel_files/GSM686290.CEL" [4] "/home/ubuntu/test/cel_files/GSM686291.CEL" [5] "/home/ubuntu/test/cel_files/GSM686298.CEL" [6] "/home/ubuntu/test/cel_files/GSM686300.CEL" [7] "/home/ubuntu/test/cel_files/GSM686301.CEL" [8] "/home/ubuntu/test/cel_files/GSM686303.CEL" [9] "/home/ubuntu/test/cel_files/GSM686304.CEL" [10] "/home/ubuntu/test/cel_files/GSM686305.CEL" [11] "/home/ubuntu/test/cel_files/GSM686310.CEL" [12] "/home/ubuntu/test/cel_files/GSM686311.CEL" [13] "/home/ubuntu/test/cel_files/GSM686314.CEL" [14] "/home/ubuntu/test/cel_files/GSM686316.CEL" [15] "/home/ubuntu/test/cel_files/GSM686319.CEL" [16] "/home/ubuntu/test/cel_files/GSM686320.CEL" [17] "/home/ubuntu/test/cel_files/GSM686322.CEL" [18] "/home/ubuntu/test/cel_files/GSM686323.CEL" [19] "/home/ubuntu/test/cel_files/GSM686324.CEL" [20] "/home/ubuntu/test/cel_files/GSM686325.CEL" > expressionSetRma<- rmaPara(files) Loading required package: AnnotationDbi Attaching package: ?hgu133plus2cdf? The following object(s) are masked from ?package:hgu133plus2hsentrezgcdf?: i2xy, xy2i Warning messages: 1: In is.na(xel) : is.na() applied to non-(list or vector) of type 'S4' 2: In is.na(xel) : is.na() applied to non-(list or vector) of type 'S4' > stopCluster() > write.exprs(expressionSetRma,file="/home/ubuntu/test/expressionSetRma.Rda") > ****************************************************************** Log for expressionSetRma<- rmaPara(files) => ERROR ****************************************************************** ubuntu at ip-10-239-95-215:~/test$ more runAffyPara.R.log.withcdf R version 2.15.0 (2012-03-30) Copyright (C) 2012 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-unknown-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > library(affyPara) Loading required package: affy Loading required package: BiocGenerics Attaching package: ?BiocGenerics? The following object(s) are masked from ?package:stats?: xtabs The following object(s) are masked from ?package:base?: anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get, intersect, lapply, Map, mapply, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int, rownames, sapply, setdiff, table, tapply, union, unique Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Loading required package: snow Loading required package: vsn Loading required package: aplpack Loading required package: tcltk Loading Tcl/Tk interface ... done Attaching package: ?affyPara? The following object(s) are masked from ?package:snow?: makeCluster, stopCluster Warning message: In fun(libname, pkgname) : no DISPLAY variable so Tk is not available > library(hgu133plus2hsentrezgcdf) > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] tcltk stats graphics grDevices utils datasets methods [8] base other attached packages: [1] hgu133plus2hsentrezgcdf_12.1.0 affyPara_1.16.0 [3] aplpack_1.2.6 vsn_3.24.0 [5] snow_0.3-9 affy_1.34.0 [7] Biobase_2.16.0 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] affyio_1.24.0 BiocInstaller_1.4.6 grid_2.15.0 [4] lattice_0.20-6 limma_3.12.1 preprocessCore_1.18.0 [7] tools_2.15.0 zlibbioc_1.2.0 > path <- "/home/ubuntu/test/cel_files" > makeCluster(2) > files <- list.celfiles(path, full.names=TRUE) > files [1] "/home/ubuntu/test/cel_files/GSM686287.CEL" [2] "/home/ubuntu/test/cel_files/GSM686289.CEL" [3] "/home/ubuntu/test/cel_files/GSM686290.CEL" [4] "/home/ubuntu/test/cel_files/GSM686291.CEL" [5] "/home/ubuntu/test/cel_files/GSM686298.CEL" [6] "/home/ubuntu/test/cel_files/GSM686300.CEL" [7] "/home/ubuntu/test/cel_files/GSM686301.CEL" [8] "/home/ubuntu/test/cel_files/GSM686303.CEL" [9] "/home/ubuntu/test/cel_files/GSM686304.CEL" [10] "/home/ubuntu/test/cel_files/GSM686305.CEL" [11] "/home/ubuntu/test/cel_files/GSM686310.CEL" [12] "/home/ubuntu/test/cel_files/GSM686311.CEL" [13] "/home/ubuntu/test/cel_files/GSM686314.CEL" [14] "/home/ubuntu/test/cel_files/GSM686316.CEL" [15] "/home/ubuntu/test/cel_files/GSM686319.CEL" [16] "/home/ubuntu/test/cel_files/GSM686320.CEL" [17] "/home/ubuntu/test/cel_files/GSM686322.CEL" [18] "/home/ubuntu/test/cel_files/GSM686323.CEL" [19] "/home/ubuntu/test/cel_files/GSM686324.CEL" [20] "/home/ubuntu/test/cel_files/GSM686325.CEL" > expressionSetRma<- rmaPara(files, cdfname="hgu133plus2hsentrezg", verbose=TRUE) Error in dimnames(eset_mat) <- list(ids, samples.names) : length of 'dimnames' [1] not equal to array extent Calls: rmaPara -> preproPara -> .doSummarizationPara In addition: Warning messages: 1: In is.na(xel) : is.na() applied to non-(list or vector) of type 'S4' 2: In is.na(xel) : is.na() applied to non-(list or vector) of type 'S4' Execution halted From smitra at liverpool.ac.uk Mon Jun 18 16:28:25 2012 From: smitra at liverpool.ac.uk (suparna mitra) Date: Mon, 18 Jun 2012 15:28:25 +0100 Subject: [BioC] =?windows-1252?q?package_=91hugene10stv1=2Er3cdf=92_or_=91?= =?windows-1252?q?hugene10stv1=92_is_not_available?= In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From guest at bioconductor.org Mon Jun 18 16:41:21 2012 From: guest at bioconductor.org (Laurie Irwin [guest]) Date: Mon, 18 Jun 2012 07:41:21 -0700 (PDT) Subject: [BioC] Opportunity to Build a Biostatistic/Bioinformatic Group Message-ID: <20120618144121.5179E133213@mamba.fhcrc.org> Interested in a new position? -- output of sessionInfo(): A dynamic early stage development company located in Salt Lake City Utah, is seeking to hire a Head of Biostatistics and Bioinformatics. The successful candidate will have experience in the discovery and validation of biomarkers by genomic and/or proteomic technologies. Strong technical expertise in the statistical analysis of large complex data sets using R/Bioconductor is required. An understanding of systems biology and pathway analysis will be important. If you are interested in learning more about this exciting opportunity please contact: Laurie Irwin VP with FPC of Cambridge (978) 535-9920x117 lirwin at fpccambridge.com -- Sent via the guest posting facility at bioconductor.org. From drordh at gmail.com Mon Jun 18 17:17:50 2012 From: drordh at gmail.com (Dror Hibsh) Date: Mon, 18 Jun 2012 18:17:50 +0300 Subject: [BioC] down-expression and high-expression in single cell + amplification In-Reply-To: References: <20120617162330.C0B3112F963@mamba.fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From MEC at stowers.org Mon Jun 18 18:07:54 2012 From: MEC at stowers.org (Cook, Malcolm) Date: Mon, 18 Jun 2012 11:07:54 -0500 Subject: [BioC] Getting the start and end positions of a list of genes In-Reply-To: Message-ID: Hi, I'll get you a step further: On 6/17/12 5:57 PM, "Vincent Carey" wrote: >good spec, but i can't get through the whole thing just now. this could >get you started > >source("http://bioconductor.org/biocLite.R") > biocLite("TxDb.Athaliana.BioMart.plantsmart12") >library(TxDb.Athaliana.BioMart.plantsmart12) >txdb = TxDb.Athaliana.BioMart.plantsmart12 >tr = transcriptsBy(txdb, by="gene") # assuming that for each gene's coordinate, you want the extreme starts and ends of its (potentially multiple) transcripts: gene.gr <- reduce(tr) # ISA GenomicRange gene.df<-as(gene.gr,'data.frame') # whose names are the gene identifiers Now its a matter of coercing column names, and selecting from the BioMart data just the rows for your identifiers (and checking they are all there, and complaining if not). Cheers, Malcolm Cook > >> tr >GRangesList of length 33602: >$AT1G01010 >GRanges with 1 range and 2 elementMetadata cols: > seqnames ranges strand | tx_id tx_name > | > [1] 1 [3631, 5899] + | 9694 AT1G01010.1 > >$AT1G01020 >GRanges with 2 ranges and 2 elementMetadata cols: > seqnames ranges strand | tx_id tx_name > [1] 1 [5928, 8737] - | 29355 AT1G01020.1 > [2] 1 [6790, 8737] - | 29354 AT1G01020.2 > >$AT1G01030 >GRanges with 1 range and 2 elementMetadata cols: > seqnames ranges strand | tx_id tx_name > [1] 1 [11649, 13714] - | 26358 AT1G01030.1 > >... ><33599 more elements> >--- >seqlengths: > 3 4 1 5 2 Pt Mt > NA NA NA NA NA NA NA > >you could use an org.At* package a bit more simply, use the CHRLOC and >CHRLOCEND >elements. please look at the metadata page of bioconductor.org >INSTALL node for your >organism. this should be a standard use case or faq, perhaps > > > >On Sun, Jun 17, 2012 at 6:33 PM, Josh [guest] >wrote: > >> >> Dear listserv, >> >> I am a long-time R user, novice Bioconductor user. I am quickly >>realizing >> they are not the same thing. I have a very basic question that I hope >>you >> can help me with. >> >> I have a list of genes in Arabidopsis thaliana. I want to input this >>list >> into R/Bioconductor and output a table listing the start and end >>positions >> of each gene. >> >> Specific code that will get the job done will be the most helpful for >>me. >> Also, please tell me the specific packages and databases and such I must >> load into memory. I am a total newbie at this. >> >> Thanks in advance, >> ----------------------------------- >> Josh Banta, Ph.D >> Assistant Professor >> Department of Biology >> The University of Texas at Tyler >> Tyler, TX 75799 >> Tel: (903) 565-5655 >> http://plantevolutionaryecology.org >> >> -- output of sessionInfo(): >> >> > gene.pos <- data.frame(matrix(nrow = 3, ncol = 4)) >> > gene.list <- c("At5g35790", "AT5g60910", "AT1g16560") >> > gene.pos[,1] <- gene.list >> > colnames(gene.pos) <- c("gene", "chromosome", "nuc_sequence_start" , >> "nuc_sequence_end") >> > >> > gene.pos >> gene chromosome nuc_sequence_start nuc_sequence_end >> 1 At5g35790 NA NA NA >> 2 AT5g60910 NA NA NA >> 3 AT1g16560 NA NA NA >> > >> > #now what? How do I fill in the blanks? >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor From mgarciao at ufl.edu Mon Jun 18 18:46:14 2012 From: mgarciao at ufl.edu (Garcia Orellana,Miriam) Date: Mon, 18 Jun 2012 16:46:14 +0000 Subject: [BioC] Best package or code to filter Affymetrix probes by present calls?? Message-ID: <7F10E9EDBB347E4CA0765A3139C110BB14F999D3@UFEXCH-MBXN01.ad.ufl.edu> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From mtmorgan at fhcrc.org Mon Jun 18 21:39:19 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Mon, 18 Jun 2012 12:39:19 -0700 Subject: [BioC] nearest() for GRanges In-Reply-To: References: Message-ID: <4FDF83E7.40505@fhcrc.org> Hi Oleg -- On 06/17/2012 11:11 PM, Oleg Mayba wrote: > Hi, > > I just noticed that a piece of logic I was relying on with GRanges before > does not seem to work anymore. Namely, I expect the behavior of nearest() > with a single GRanges object as an argument to be the same as that of > IRanges (example below), but it's not anymore. I expect nearest(GR1) NOT > to behave trivially but to return the closest range OTHER than the range > itself. I could swear that was the case before, but isn't any longer: I think you're right that there is an inconsistency here; Val will likely help clarify in a day or so. My two cents... I think, certainly, that GRanges on a single chromosome on the "+" strand should behave like an IRanges library(GenomicRanges) library(RUnit) r <- IRanges(c(1,5,10), c(2,7,12)) g <- GRanges("chr1", r, "+") ## first two ok, third should work but fails checkEquals(precede(r), precede(g)) checkEquals(follow(r), follow(g)) try(checkEquals(nearest(r), nearest(g))) Also, on the "-" strand I think we're expecting g <- GRanges("chr1", r, "-") ## first two ok, third should work but fails checkEquals(follow(r), precede(g)) checkEquals(precede(r), follow(g)) try(checkEquals(nearest(r), nearest(g))) For "*" (which was your example) I think the situation is (a) different in devel than in release; and (b) not so clear. In devel, "*" is (from method?"nearest,GenomicRanges,missing") "x on '*' strand can match to ranges on any of ''+'', ''-'' or ''*''" and in particular I think these are always true: checkEquals(precede(g), follow(g)) checkEquals(nearest(r), follow(g)) I would also expect try(checkEquals(nearest(g), follow(g))) though this seems not to be the case. In 'release', "*" is coereced and behaves as if on the "+" strand (I think). Martin > >> z=IRanges(start=c(1,5,10), end=c(2,7,12)) >> z > IRanges of length 3 > start end width > [1] 1 2 2 > [2] 5 7 3 > [3] 10 12 3 >> nearest(z) > [1] 2 1 2 >> >> z=GRanges(seqnames=rep('chr1',3),ranges=IRanges(start=c(1,5,10), > end=c(2,7,12))) >> z > GRanges with 3 ranges and 0 elementMetadata cols: > seqnames ranges strand > > [1] chr1 [ 1, 2] * > [2] chr1 [ 5, 7] * > [3] chr1 [10, 12] * > --- > seqlengths: > chr1 > NA >> nearest(z) > [1] 1 2 3 >> >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] datasets utils grDevices graphics stats methods base > > other attached packages: > [1] GenomicRanges_1.8.6 IRanges_1.14.3 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] stats4_2.15.0 >> > > > > I want the IRanges behavior and not what seems currently to be the GRanges > behavior, since I have some code that depends on it. Is there a quick way > to make nearest() do that for me again? > > Thanks! > > Oleg. > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 From mgarciao at ufl.edu Mon Jun 18 21:40:50 2012 From: mgarciao at ufl.edu (Garcia Orellana,Miriam) Date: Mon, 18 Jun 2012 19:40:50 +0000 Subject: [BioC] Best package or code to filter Affymetrix probes by present calls?? In-Reply-To: <980E6D85-9979-42E1-B989-6613786EB5BF@bigelow.org> References: <7F10E9EDBB347E4CA0765A3139C110BB14F999D3@UFEXCH-MBXN01.ad.ufl.edu>, <980E6D85-9979-42E1-B989-6613786EB5BF@bigelow.org> Message-ID: <7F10E9EDBB347E4CA0765A3139C110BB14F9AA42@UFEXCH-MBXN01.ad.ufl.edu> Ben: I followed the steps you gave and the working directory was well set but I still getting the same problem, next is the detail of what I got. Please anyone having other idea of this wrong warning or other approach to filter my data. Thanks. > library(simpleaffy) > library(gcrma) > getwd() [1] "C:/Users/miriam/Documents/1studytemp/RESULTS/liver2008 gene/CEL_files" > dir() [1] "4367.CEL" "4368.CEL" "4381.CEL" "4384.CEL" [5] "4387.CEL" "4388.CEL" "4394.CEL" "4395.CEL" [9] "4396.CEL" "4398.CEL" "4399.CEL" "4400.CEL" [13] "4402.CEL" "4404.CEL" "4409.CEL" "4410.CEL" [17] "4413.CEL" "4429.CEL" "affymetrix_gcrma.txt" "covdesc.prn" [21] "DD_CD.txt" "eset.gcrma.Rdata" > raw.data <- ReadAffy() > gcrma.eset <- call.exprs(raw.data, "gcrma") Adjusting for optical effect..................Done. Computing affinities.Done. Adjusting for non-specific binding..................Done. Normalizing Calculating Expression > raw.data <- read.affy() ##read data in working directory Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : cannot open file './covdesc': No such file or directory > raw.data<- read.affy("covdesc") Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : cannot open file './covdesc': No such file or directory ******************************** Miriam Garcia, MS PhD candidate Department of Animal Sciences University of Florida ________________________________________ From: Ben Tupper [btupper at bigelow.org] Sent: Monday, June 18, 2012 2:30 PM To: Garcia Orellana,Miriam Subject: Re: [BioC] Best package or code to filter Affymetrix probes by present calls?? Hi, On Jun 18, 2012, at 12:46 PM, Garcia Orellana,Miriam wrote: Dear R users: First thank to all users for their direct or indirect support with previous question. Now. I am rephrasing this question since I did not get any help the last 3 days. I am having hard time to analyze my microarray data, since the use of R environment is a new world for me. I have 18 affymetrix bovine arrays from liver samples of 30d old calves that born from cows fed 3 types of prepartum dam diets (factor DD, 6 arrays per DD) and were fed just milk replacer the first 30d of life ( factor MR, 9 array per MR). Biologically I will expect that the main factor driving any difference will be the MR rather than the DD (unless some imprinting genes are expressed). So I have the idea to filter non expressed genes using the simpleaffy package using the manual but I don't know what is wrong when I try to load the covdesc file I got error. I have a folder in my directory that contains all 18 CEL files and also the covdesc (extension .prn - is this the right one?). Since I was able to run the gcrma normalization so the working directory maybe well set, what I got is the next when using the option read.affy to read the covdesc file. > raw.data <- ReadAffy() > gcrma.eset <- call.exprs(raw.data, "gcrma") Loading required package: AnnotationDbi Adjusting for optical effect..................Done. Computing affinities.Done. Adjusting for non-specific binding..................Done. Normalizing Calculating Expression > raw.data <- read.affy() ##read data in working directory Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : cannot open file './covdesc': No such file or directory > raw.data<- read.affy("covdesc") Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : cannot open file './covdesc': No such file or directory I would really appreciate if you can suggest me any simple method to filter my genes ( I want to keep probes that are present in at least 4 of the 9 arrays in at least one of the MR groups, or do you think I should consider the interaction prepartum diet * milk replacer (then 3 arrays per interaction group and try to have at least 2 present genes in at least 1 of the 6 interactions) Thanks in advance for any help. Miriam Try to confirm that the current working directory is where you think it is... > getwd() If not then you'll need to use setwd() to set the correct directory. If your R session 'resides' in your desired directory, then check that the contents of the directory include what read.affy() expects... > dir() Cheers, Ben Ben Tupper Bigelow Laboratory for Ocean Sciences 180 McKown Point Rd. P.O. Box 475 West Boothbay Harbor, Maine 04575-0475 http://www.bigelow.org From sorokin at wisc.edu Mon Jun 18 22:02:40 2012 From: sorokin at wisc.edu (Elena Sorokin) Date: Mon, 18 Jun 2012 15:02:40 -0500 Subject: [BioC] interpreting DEXSeq output Message-ID: Hello, How should we be interpreting output from DEXSeq in which some geneIDs within the DEU results table are denoted by multiple genes separated by + signs? I can send examples of what I mean to the developers, if my question is unclear. Especially when the architecture of the two or even three genes is quite different, this type of output perplexes me. Sorry if my post was answered elsewhere! Best wishes, Elena From jmacdon at uw.edu Mon Jun 18 22:02:48 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Mon, 18 Jun 2012 16:02:48 -0400 Subject: [BioC] Best package or code to filter Affymetrix probes by present calls?? In-Reply-To: <7F10E9EDBB347E4CA0765A3139C110BB14F9AA42@UFEXCH-MBXN01.ad.ufl.edu> References: <7F10E9EDBB347E4CA0765A3139C110BB14F999D3@UFEXCH-MBXN01.ad.ufl.edu>, <980E6D85-9979-42E1-B989-6613786EB5BF@bigelow.org> <7F10E9EDBB347E4CA0765A3139C110BB14F9AA42@UFEXCH-MBXN01.ad.ufl.edu> Message-ID: <4FDF8968.7010005@uw.edu> Hi Miriam, On 6/18/2012 3:40 PM, Garcia Orellana,Miriam wrote: > Ben: > I followed the steps you gave and the working directory was well set but I still getting the same problem, next is the detail of what I got. Please anyone having other idea of this wrong warning or other approach to filter my data. Thanks. > >> library(simpleaffy) >> library(gcrma) >> getwd() > [1] "C:/Users/miriam/Documents/1studytemp/RESULTS/liver2008 gene/CEL_files" >> dir() > [1] "4367.CEL" "4368.CEL" "4381.CEL" "4384.CEL" > [5] "4387.CEL" "4388.CEL" "4394.CEL" "4395.CEL" > [9] "4396.CEL" "4398.CEL" "4399.CEL" "4400.CEL" > [13] "4402.CEL" "4404.CEL" "4409.CEL" "4410.CEL" > [17] "4413.CEL" "4429.CEL" "affymetrix_gcrma.txt" "covdesc.prn" > [21] "DD_CD.txt" "eset.gcrma.Rdata" >> raw.data<- ReadAffy() >> gcrma.eset<- call.exprs(raw.data, "gcrma") > Adjusting for optical effect..................Done. > Computing affinities.Done. > Adjusting for non-specific binding..................Done. > Normalizing > Calculating Expression >> raw.data<- read.affy() ##read data in working directory You already read the data in, using ReadAffy() above. There is no reason to do this again. > Error in file(file, "rt") : cannot open the connection > In addition: Warning message: > In file(file, "rt") : > cannot open file './covdesc': No such file or directory >> raw.data<- read.affy("covdesc") > Error in file(file, "rt") : cannot open the connection > In addition: Warning message: > In file(file, "rt") : > cannot open file './covdesc': No such file or directory The error here is pretty clear; there isn't a file called 'covdesc' in your working directory. As for filtering your data, have a look at the genefilter package. Best, Jim > > > > > ******************************** > Miriam Garcia, MS > PhD candidate > Department of Animal Sciences > University of Florida > > ________________________________________ > From: Ben Tupper [btupper at bigelow.org] > Sent: Monday, June 18, 2012 2:30 PM > To: Garcia Orellana,Miriam > Subject: Re: [BioC] Best package or code to filter Affymetrix probes by present calls?? > > Hi, > > On Jun 18, 2012, at 12:46 PM, Garcia Orellana,Miriam wrote: > Dear R users: > First thank to all users for their direct or indirect support with previous question. Now. I am rephrasing this question since I did not get any help the last 3 days. I am having hard time to analyze my microarray data, since the use of R environment is a new world for me. > I have 18 affymetrix bovine arrays from liver samples of 30d old calves that born from cows fed 3 types of prepartum dam diets (factor DD, 6 arrays per DD) and were fed just milk replacer the first 30d of life ( factor MR, 9 array per MR). Biologically I will expect that the main factor driving any difference will be the MR rather than the DD (unless some imprinting genes are expressed). > So I have the idea to filter non expressed genes using the simpleaffy package using the manual but I don't know what is wrong when I try to load the covdesc file I got error. > I have a folder in my directory that contains all 18 CEL files and also the covdesc (extension .prn - is this the right one?). Since I was able to run the gcrma normalization so the working directory maybe well set, what I got is the next when using the option read.affy to read the covdesc file. > >> raw.data<- ReadAffy() >> gcrma.eset<- call.exprs(raw.data, "gcrma") > Loading required package: AnnotationDbi > > Adjusting for optical effect..................Done. > Computing affinities.Done. > Adjusting for non-specific binding..................Done. > Normalizing > Calculating Expression >> raw.data<- read.affy() ##read data in working directory > Error in file(file, "rt") : cannot open the connection > In addition: Warning message: > In file(file, "rt") : > cannot open file './covdesc': No such file or directory >> raw.data<- read.affy("covdesc") > Error in file(file, "rt") : cannot open the connection > In addition: Warning message: > In file(file, "rt") : > cannot open file './covdesc': No such file or directory > > I would really appreciate if you can suggest me any simple method to filter my genes ( I want to keep probes that are present in at least 4 of the 9 arrays in at least one of the MR groups, or do you think I should consider the interaction prepartum diet * milk replacer (then 3 arrays per interaction group and try to have at least 2 present genes in at least 1 of the 6 interactions) > Thanks in advance for any help. > Miriam > > > Try to confirm that the current working directory is where you think it is... > >> getwd() > > If not then you'll need to use setwd() to set the correct directory. > > If your R session 'resides' in your desired directory, then check that the contents of the directory include what read.affy() expects... > >> dir() > Cheers, > Ben > > > Ben Tupper > Bigelow Laboratory for Ocean Sciences > 180 McKown Point Rd. P.O. Box 475 > West Boothbay Harbor, Maine 04575-0475 > http://www.bigelow.org > > > > > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From MEC at stowers.org Mon Jun 18 23:25:18 2012 From: MEC at stowers.org (Cook, Malcolm) Date: Mon, 18 Jun 2012 16:25:18 -0500 Subject: [BioC] nearest() for GRanges In-Reply-To: <4FDF83E7.40505@fhcrc.org> Message-ID: Martin, Oleg, Val, all, I too have script logic that benefitted from and depends upon what the behavior of nearest,GenomicRanges,missing as reported by Oleg. Thanks for the unit tests Martin. If it helps in sleuthing, in my hands, the 3rd test used to pass (if my memory serves), but does not pass now, as the attached transcript shows. Hoping it helps find a speedy resolution to this issue, ~ Malcolm Cook > r <- IRanges(c(1,5,10), c(2,7,12)) > g <- GRanges("chr1", r, "+") > checkEquals(precede(r), precede(g)) [1] TRUE > checkEquals(follow(r), follow(g)) [1] TRUE > try(checkEquals(nearest(r), nearest(g))) Error in checkEquals(nearest(r), nearest(g)) : Mean relative difference: 0.6 > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C attached base packages: [1] tools splines parallel stats graphics grDevices utils datasets methods base other attached packages: [1] RUnit_0.4.26 log4r_0.1-4 vwr_0.1 RecordLinkage_0.4-1 ffbase_0.5 ff_2.2-7 bit_1.1-8 evd_2.2-6 ipred_0.8-13 prodlim_1.3.1 KernSmooth_2.23-7 nnet_7.3-1 survival_2.36-14 mlbench_2.1-0 MASS_7.3-18 ada_2.0-2 rpart_3.1-53 e1071_1.6 class_7.3-3 XLConnect_0.1-9 XLConnectJars_0.1-4 rJava_0.9-3 latticeExtra_0.6-19 RColorBrewer_1.0-5 lattice_0.20-6 doMC_1.2.5 multicore_0.1-7 [28] BSgenome_1.24.0 rtracklayer_1.16.1 Rsamtools_1.8.5 Biostrings_2.24.1 GenomicFeatures_1.8.1 AnnotationDbi_1.18.1 GenomicRanges_1.8.6 IRanges_1.14.3 Biobase_2.16.0 BiocGenerics_0.2.0 data.table_1.8.0 compare_0.2-3 svUnit_0.7-10 doParallel_1.0.1 iterators_1.0.6 foreach_1.4.0 ggplot2_0.9.1 sqldf_0.4-6.4 RSQLite.extfuns_0.0.1 RSQLite_0.11.1 chron_2.3-42 gsubfn_0.6-3 proto_0.3-9.2 DBI_0.2-5 functional_0.1 reshape_0.8.4 plyr_1.7.1 [55] stringr_0.6 gtools_2.6.2 loaded via a namespace (and not attached): [1] RCurl_1.91-1 XML_3.9-4 biomaRt_2.12.0 bitops_1.0-4.1 codetools_0.2-8 colorspace_1.1-1 compiler_2.15.0 dichromat_1.2-4 digest_0.5.2 grid_2.15.0 labeling_0.1 memoise_0.1 munsell_0.3 reshape2_1.2.1 scales_0.2.1 stats4_2.15.0 tcltk_2.15.0 zlibbioc_1.2.0 On 6/18/12 2:39 PM, "Martin Morgan" wrote: >Hi Oleg -- > >On 06/17/2012 11:11 PM, Oleg Mayba wrote: >> Hi, >> >> I just noticed that a piece of logic I was relying on with GRanges >>before >> does not seem to work anymore. Namely, I expect the behavior of >>nearest() >> with a single GRanges object as an argument to be the same as that of >> IRanges (example below), but it's not anymore. I expect nearest(GR1) >>NOT >> to behave trivially but to return the closest range OTHER than the range >> itself. I could swear that was the case before, but isn't any longer: > >I think you're right that there is an inconsistency here; Val will >likely help clarify in a day or so. My two cents... > >I think, certainly, that GRanges on a single chromosome on the "+" >strand should behave like an IRanges > > library(GenomicRanges) > library(RUnit) > > r <- IRanges(c(1,5,10), c(2,7,12)) > g <- GRanges("chr1", r, "+") > > ## first two ok, third should work but fails > checkEquals(precede(r), precede(g)) > checkEquals(follow(r), follow(g)) > try(checkEquals(nearest(r), nearest(g))) > >Also, on the "-" strand I think we're expecting > > g <- GRanges("chr1", r, "-") > > ## first two ok, third should work but fails > checkEquals(follow(r), precede(g)) > checkEquals(precede(r), follow(g)) > try(checkEquals(nearest(r), nearest(g))) > >For "*" (which was your example) I think the situation is (a) different >in devel than in release; and (b) not so clear. In devel, "*" is (from >method?"nearest,GenomicRanges,missing") "x on '*' strand can match to >ranges on any of ''+'', ''-'' or ''*''" and in particular I think these >are always true: > > checkEquals(precede(g), follow(g)) > checkEquals(nearest(r), follow(g)) > >I would also expect > > try(checkEquals(nearest(g), follow(g))) > >though this seems not to be the case. In 'release', "*" is coereced and >behaves as if on the "+" strand (I think). > >Martin > >> >>> z=IRanges(start=c(1,5,10), end=c(2,7,12)) >>> z >> IRanges of length 3 >> start end width >> [1] 1 2 2 >> [2] 5 7 3 >> [3] 10 12 3 >>> nearest(z) >> [1] 2 1 2 >>> >>> z=GRanges(seqnames=rep('chr1',3),ranges=IRanges(start=c(1,5,10), >> end=c(2,7,12))) >>> z >> GRanges with 3 ranges and 0 elementMetadata cols: >> seqnames ranges strand >> >> [1] chr1 [ 1, 2] * >> [2] chr1 [ 5, 7] * >> [3] chr1 [10, 12] * >> --- >> seqlengths: >> chr1 >> NA >>> nearest(z) >> [1] 1 2 3 >>> >>> sessionInfo() >> R version 2.15.0 (2012-03-30) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] datasets utils grDevices graphics stats methods base >> >> other attached packages: >> [1] GenomicRanges_1.8.6 IRanges_1.14.3 BiocGenerics_0.2.0 >> >> loaded via a namespace (and not attached): >> [1] stats4_2.15.0 >>> >> >> >> >> I want the IRanges behavior and not what seems currently to be the >>GRanges >> behavior, since I have some code that depends on it. Is there a quick >>way >> to make nearest() do that for me again? >> >> Thanks! >> >> Oleg. >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor > > >-- >Computational Biology / Fred Hutchinson Cancer Research Center >1100 Fairview Ave. N. >PO Box 19024 Seattle, WA 98109 > >Location: Arnold Building M1 B861 >Phone: (206) 667-2793 > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor From iaingallagher at btopenworld.com Mon Jun 18 23:47:40 2012 From: iaingallagher at btopenworld.com (Iain Gallagher) Date: Mon, 18 Jun 2012 22:47:40 +0100 (BST) Subject: [BioC] Best package or code to filter Affymetrix probes by present calls?? In-Reply-To: <7F10E9EDBB347E4CA0765A3139C110BB14F999D3@UFEXCH-MBXN01.ad.ufl.edu> References: <7F10E9EDBB347E4CA0765A3139C110BB14F999D3@UFEXCH-MBXN01.ad.ufl.edu> Message-ID: <1340056060.56022.YahooMailNeo@web87703.mail.ir2.yahoo.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jspidlen at bccrc.ca Tue Jun 19 00:00:57 2012 From: jspidlen at bccrc.ca (Josef Spidlen) Date: Mon, 18 Jun 2012 15:00:57 -0700 Subject: [BioC] flowCore 1.22.0 broken for some FCS files In-Reply-To: References: Message-ID: <4FDFA519.9060004@bccrc.ca> Hi Mike, I agree that empty keyword values are illegal according to the FCS data file standard. Unfortunately, there are several vendors breaking this rule (e.g., CELLQuest/FACSCalibur, Partec, Applied Biosystems / Attune). Consequently, I agree with Kieran that it would be better if flowCore "closed one eye" and allowed reading of those files. Technically, I believe it can still be done while being able to distinguish whether the is an actual delimiter or part of the keyword value. When starting to read a keyword value, your parser could distinguish the following states. The stream with the keyword value right after reading the initiating starts with: 1) means that the actual keyword value starts with For example: "|$COM||| Delimiter starts my comment|" (| is the in my examples) 2) x where x is not a means that the vendor broke the standard and saved a keyword with an empty value. For example: "|$COM||$CYT|Partec PAS|" I know, this only works assuming that there are no keyword names that would include the as part of the name. I believe that this is a safe assumption after having seen many many FCS files. In the example, this "relaxed" interpretation would mean that there are two keywords, "$COM" (empty value) and "$CYT" (value "Partec PAS"). A strict FCS compatible implementation reads this as a single keyword named "$COM|$CYT" with a value of "Partec PAS". 3) x where x is not a simply means that the keyword value is starting with character x. For example: "|$COM|My comment|" It goes down to the question whether it is a good practice to read broken files, which is essentially sending a message to vendors saying that it is OK to generate broken files. I hate that message but at the end, I think it is even more important to make users happy, which is why I would argue to change flowCore and make it more relaxed as described. FlowJo and some other tools took this path, which is greatly appreciated by their users. Best regard, Josef Btw. A minor correction to Kieran's note from another email: I have been only involved in the FCS 3.1 revision but haven't been around in the 90s when the FCS 3.0 standard was developed :-) On 12-06-15 03:00 AM, bioconductor-request at r-project.org wrote: > Date: Thu, 14 Jun 2012 13:32:37 -0700 > From: "Jiang, Mike" > To: > Subject: Re: [BioC] [Bioc-devel] flowCore 1.22.0 broken for some FCS > files (which it previously read without errors) > Message-ID: > Content-Type: text/plain > > Kieran, > > I looked at your FCS, it has empty keyword value which does not conform to FCS 3.0 standard: > "3.2.9 Keywords and keyword values must have lengths greater than zero. "(http://murphylab.cbi.cmu.edu/FCSAPI/FCS3.html). > > Particularly, this occurs at $ENDSTEXT keyword-value pairs :"\\$ENDSTEXT\\\\$ETIM..." > Which is "byte offset to end of the supplemental TEXT segment" and really shouldn't be empty (normally it is put as "0") > > And "\\" is used as delimiter here, FCS 3.0 allows delimiter appears in the keyword value or keyword name as long as it is " immediately followed by a second delimiter". So the characters "\\\\" after "$ENDSTEXT" keyword is misunderstood as part of "$ETIM" by the parser here, which further messed up the parsing of subsequent string. That is why the parser is reporting error. > > Originally,flowCore did not handle this delimiter issue properly. It might read FCS successfully with the incorrect keyword values without notifying the user. Now,we thought it may be helpful to throw the error and let user know the issue with the TEXT segment of FCS. > > I have attached the TEXT Segment of your FCS file. > > Let me know if you have questions. > > Thanks, > Mike >> >From: Kieran O'Neill >> >Subject: [Bioc-devel] flowCore 1.22.0 broken for some FCS files (which it previously read without errors) >> >Date: June 13, 2012 3:53:17 PM PDT >> >To:bioc-devel at r-project.org >> >Hi all >> > >> >I just recently came back to a project I was previously working on, >> >and found that the most recent version of flowCore, 1.22.0, no longer >> >reads some of my FCS files (those generated by one instrument in >> >particular). >> > >> >The error it gives is: >> > >> >Error in fcs_text_parse(txt) : ERROR! no end found >> > >> >Previous versions of flowCore had no trouble reading these files, and >> >the current version seems to read most other FCS files I have from >> >other instruments. However, since parsing FCS files into something >> >usable in R is probably the most important functionality in the >> >package, having it broken is rather bad. >> > >> >It is also quite frustrating for me, in that no previous version of >> >flowCore works in the current version of R (2.15.0), so I would need >> >to downgrade the whole of R in order to downgrade to a working version >> >of flowCore to analyse these files. >> > >> >I would be happy to send a sample file for debugging if needed. >> > >> >Thanks, >> >Kieran >> > >> >_______________________________________________ >> >Bioc-devel at r-project.org mailing list >> >https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Josef Spidlen, Ph.D. Terry Fox Laboratory, BC Cancer Agency 675 West 10th Avenue, V5Z 1L3 Vancouver, BC, Canada Tel: +1 (604) 675-8000 x 7755 http://www.terryfoxlab.ca/people/rbrinkman/josef.aspx From alejandro.reyes at embl.de Tue Jun 19 10:29:26 2012 From: alejandro.reyes at embl.de (Alejandro Reyes) Date: Tue, 19 Jun 2012 10:29:26 +0200 Subject: [BioC] interpreting DEXSeq output In-Reply-To: References: Message-ID: <4FE03866.8050104@embl.de> Dear Elena, Thanks for your email! The reason that multiple genes are merged into a single one is because they share exons, and it is not obvious to assign this exon to a single gene. You can see more in detail if you do a "plotDEXSeq" displaying the transcripts. So far, I have not seen a big problem on it but I can imagine a situation in which the merged genes are differentially expressed: there would be differences in exon usage that are differential expression in reality... Is it introducing messy results for you? Alejandro > Hello, > > How should we be interpreting output from DEXSeq in which some geneIDs > within the DEU results table are denoted by multiple genes separated > by + signs? I can send examples of what I mean to the developers, if > my question is unclear. > > Especially when the architecture of the two or even three genes is > quite different, this type of output perplexes me. Sorry if my post > was answered elsewhere! > > Best wishes, > Elena > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From okko at clevert.de Tue Jun 19 11:17:39 2012 From: okko at clevert.de (=?iso-8859-1?Q?Djork-Arn=E9_Clevert?=) Date: Tue, 19 Jun 2012 11:17:39 +0200 Subject: [BioC] Best package or code to filter Affymetrix probes by present calls?? In-Reply-To: <1340056060.56022.YahooMailNeo@web87703.mail.ir2.yahoo.com> References: <7F10E9EDBB347E4CA0765A3139C110BB14F999D3@UFEXCH-MBXN01.ad.ufl.edu> <1340056060.56022.YahooMailNeo@web87703.mail.ir2.yahoo.com> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From muralidharanv89 at gmail.com Tue Jun 19 11:33:56 2012 From: muralidharanv89 at gmail.com (Muralidharan V) Date: Tue, 19 Jun 2012 15:03:56 +0530 Subject: [BioC] C.V function in Agi4x44PreProcess Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From alessandro.brozzi at gmail.com Tue Jun 19 12:51:28 2012 From: alessandro.brozzi at gmail.com (alessandro brozzi) Date: Tue, 19 Jun 2012 12:51:28 +0200 Subject: [BioC] merge the two cards in a single dataset Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From gbayon at gmail.com Tue Jun 19 12:57:12 2012 From: gbayon at gmail.com (=?utf-8?Q?Gustavo_Fern=C3=A1ndez_Bay=C3=B3n?=) Date: Tue, 19 Jun 2012 12:57:12 +0200 Subject: [BioC] Newbie methylation and stats question Message-ID: <5E0BC3AD0EA44F6889E740B6DE1027A5@gmail.com> Hi everybody. As a newbie to bioinformatics, it is not uncommon to find difficulties in the way biological knowledge mixes with statistics. I come from the Machine Learning field, and usually have problems with the naming conventions (well, among several other things, I must admit). Besides, I am not an expert in statistics, having used the barely necessary for the validation of my work. Well, let's try to be more precise. One of the topics I am working more right now is the analysis of methylation array data. As you surely now, the final processed (and normalized) beta values are presented in a pxn matrix, where there are p different probes and n different samples or individuals from which we have obtained the beta-values. I am not currently working with the raw data. Imagine, for a moment, that we have identified two regions of probes, A and B, with a group of nA probes belonging to A, another group (of nB probes) that belongs to B, and the intersection is empty. Say that we want to find a way to show there is a statistically significant difference between the methylation values of both regions. As far as I have seen in the literature, comparisons (statistical tests) are always done comparing the same probe values between case and control groups of individuals or samples. For example, when we are trying to find differentiated probes. However, if I think of directly comparing all the beta values from region A (nA * n values) against the ones in region B (nB * n values) with a, say, t test, I get the suspicion that something is not being done the way it should. My knowledge of Biology and Statistics is still limited and I cannot explain why, but I have the feeling that there is something formally wrong in this approximation. Am I right? What I have done in similar experiments has been to find differentiated probes, and then do a test to the proportion of differentiated probes to total number of them, so I could assign a p-value to prove that there was a significant influence of the region of reference. Several questions here: which could be a coherent approximation to the regions A and B problem stated above? Is there any problem with methylation data I am not aware of which makes only the in-probe analysis valid? Any bibliographic references that could help me seeing the subtleties around? As you can see, concepts are quite interleaved in my mind, so any help would be very appreciated. Regards, Gustavo --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) From stvjc at channing.harvard.edu Tue Jun 19 15:15:36 2012 From: stvjc at channing.harvard.edu (Vincent Carey) Date: Tue, 19 Jun 2012 09:15:36 -0400 Subject: [BioC] changes to RBGL Message-ID: The Bioc 2.10 release image of RBGL has been modified in two significant ways 1) boost (www.boost.org) C++ headers for graph algorithms are now derived from boost 1.49 2) RBGL C++ sources have been modified to compile cleanly with g++ 4.7 This has necessitated (at least temporary) removal of boost-based graph layout functions. The loss seems unimportant given the availability of Rgraphviz There has been a minor change to the behavior of incremental.components Similar changes are in place in the devel version. Users and developers who encounter difficulties with these changes should post details to the list. From Heidi.Dvinge at cancer.org.uk Tue Jun 19 15:04:32 2012 From: Heidi.Dvinge at cancer.org.uk (Heidi Dvinge) Date: Tue, 19 Jun 2012 14:04:32 +0100 Subject: [BioC] HTqPCR In-Reply-To: References: <6D0043C9-4BAE-469C-8369-8733D7D53644@cancer.org.uk> <6C95A3CE-902D-4068-B64A-0A2813071A1A@cancer.org.uk> <50B7F68B-3762-4FF4-8F97-692ED30F06AE@cancer.org.uk> <99093F24-FF41-4FA0-BE55-BB0AF5C3D010@cancer.org.uk> Message-ID: Hi Silvia, On 18 Jun 2012, at 17:51, Silvia Halim wrote: > Hi Heidi, > > The function breaks at plotCtReps. >> traceback() > 1: plotCtReps(temp, card = 2, percent = 20, xlim = c(0, 100), ylim = c(0, > 100)) > > You've pointed out the problem about the duplicates as I have 3 replicates on my assay. I got confused reading the manual as it says plotCtReps can be used for a sample containing duplicate measurements (which I thought to be 2 or more measurements). >> table(table(featureNames(temp))) > > 3 6 > 30 1 > If you try running the examples for plotCtReps, you'll see that the function directly plots two replicates of a feature against each other on the (x,y) axis. 3D (x,y,z) plots aren't implemented, so features that are replicated 3 times can't be plotted. I'll try to clarify the text for the function. Perhaps something like plotCtVariation() will give you what you're after? If you only want to visually inspect your data, then grep("plot", ls("package:HTqPCR"), value=TRUE) will list all the plotting functions available in HTqPCR. HTH \Heidi > Btw there's no NA in my data. >> sum(is.na(temp)) > [1] 0 > Warning message: > In is.na(temp) : is.na() applied to non-(list or vector) of type 'S4' >> > > Thanks, > Silvia > > -----Original Message----- > From: Heidi Dvinge > Sent: 15 June 2012 9:06 PM > To: Silvia Halim > Cc: bioconductor at r-project.org > Subject: Re: HTqPCR > > Hi Silvia, > > On 15 Jun 2012, at 18:45, Silvia Halim wrote: > >> Hi Heidi, >> >> I ran into below problem when using plotCtReps. >> >>> plotCtReps(temp, card = 1, percent = 20, xlim = c(0,50), ylim = >>> c(0,50)) >> Error in split.data[[s]] : subscript out of bounds In addition: >> Warning messages: >> 1: In min(x, na.rm = na.rm) : >> no non-missing arguments to min; returning Inf >> 2: In max(x, na.rm = na.rm) : >> no non-missing arguments to max; returning -Inf >>> plotCtReps(temp, card = 1, percent = 20, xlim = c(0,50), ylim = >>> c(0,50)) >> Error in split.data[[s]] : subscript out of bounds In addition: >> Warning messages: >> 1: In min(x, na.rm = na.rm) : >> no non-missing arguments to min; returning Inf >> 2: In max(x, na.rm = na.rm) : >> no non-missing arguments to max; returning -Inf >>> plotCtReps(temp, card = 2, percent = 20, xlim = c(0,100), ylim = >>> c(0,100)) >> Error in split.data[[s]] : subscript out of bounds In addition: >> Warning messages: >> 1: In min(x, na.rm = na.rm) : >> no non-missing arguments to min; returning Inf >> 2: In max(x, na.rm = na.rm) : >> no non-missing arguments to max; returning -Inf > > What's the output from traceback(), i.e. exactly where does the function break? >> > A couple of things you can try: > > - plotCtReps is meant to be used in cases where there are exactly 2 replicates of the features on your assay. Is this the case? For example, with the data below there are 190 features that will be plotted, and 1 that will be skipped: >> data(qPCRraw) >> table(table(featureNames(qPCRraw))) > 2 4 > 190 1 > > - are there any NAs in your data? E.g. sum(is.na(qPCRraw))>0. > > HTH > \Heidi > >> Here is how 'temp' looks like >>> temp >> An object of class "qPCRset" >> Size: 96 features, 96 samples >> Feature types: Reference, Test >> Feature names: b-Actin b-Actin b-Actin ... >> Feature classes: >> Feature categories: OK >> Sample names: NTC_4 PMPT352 NTC_3 ... >> >> Do you know why it is complaining about split.data? >> >> Thanks, >> Silvia >> >> -----Original Message----- >> From: Heidi Dvinge >> Sent: 11 June 2012 6:11 PM >> To: Silvia Halim >> Subject: Re: HTqPCR >> >> Ok, so you already have a 96 by 96 matrix, so you don't need changeCtLayout. >> Good luck with the rest, and let me know if you encounter any problems. >> >> On 11 Jun 2012, at 19:05, Silvia Halim wrote: >> >>> Hi Heidi, >>> >>> Thank you for your clarification. >>> >>> Btw this is how it looks like when I type 'temp' >>>> temp >>> An object of class "qPCRset" >>> Size: 96 features, 96 samples >>> Feature types: Reference, Test >>> Feature names: b-Actin b-Actin b-Actin ... >>> Feature classes: >>> Feature categories: OK >>> Sample names: NTC_4 PMPT352 NTC_3 ... >>> >>> Cheers, >>> Silvia >>> >>> -----Original Message----- >>> From: Heidi Dvinge >>> Sent: 08 June 2012 7:12 PM >>> To: Silvia Halim >>> Subject: Re: HTqPCR >>> >>> Hi Silvia, >>> >>> what are the dimensions of the "temp" object that you read in? I.e. >>> what does it look like if you just type >>>> temp >>> >>> If you read in the data with n.features=96 and n.data=96, then you should already have an object with 96 rows and 96 columns, in which case you don't need to change the layout. >>> >>> Best, >>> \Heidi >>> >>> On 8 Jun 2012, at 19:13, Silvia Halim wrote: >>> >>>> Hi Heidi, >>>> >>>> I finally have time to try out your HTqPCR bioconductor package again and I was trying to use 'changeCtLayout' function. However, I got following error message: >>>> >>>>> qPCRnew <- changeCtLayout(temp, sample.order = sample_order) >>>> Error in data.frame(..., check.names = FALSE) : >>>> arguments imply differing number of rows: 0, 96 In addition: >>>> Warning >>>> message: >>>> In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : >>>> data length is not a multiple of split variable >>>> >>>> The commands that I run are as follows: >>>>> temp <- readCtData("110614 BENIGN_1 DATA 96X96.csv", path = >>>>> getwd(), n.features = 96, n.data=96, flag = 9, feature = 5, type= >>>>> 6, Ct = 7, position = 1, skip = 12, sep = ",") sample_order <- >>>>> rep(sampleNames(temp), each = 96) qPCRnew <- changeCtLayout(temp, >>>>> sample.order = sample_order) >>>> >>>> I've tried to follow what's written in changeCtLayout function description. Can you please advise what went wrong? >>>> >>>> Thanks, >>>> Silvia >>>> >>>> -----Original Message----- >>>> From: Heidi Dvinge >>>> Sent: 29 April 2012 8:18 PM >>>> To: Silvia Halim >>>> Subject: Re: HTqPCR >>>> >>>> HI Silvia, >>>> >>>> I'm glad you got it working. Depending on what you're supposed to do with the data, you may need to tweak some functions slightly, as you mention. Let me know if you run into any more trouble. >>>> >>>> Cheers >>>> \Heidi >>>> >>>> On 26 Apr 2012, at 18:37, Silvia Halim wrote: >>>> >>>>> Hi Heidi, >>>>> >>>>> Thanks for the help! It's working for me now. Right now I'm figuring it out how I can use the functions that you described in the vignette. I might have to tweak the parameters for using the functions on Fluidigm data. >>>>> >>>>> Cheers, >>>>> Silvia >>>>> >>>>> -----Original Message----- >>>>> From: Heidi Dvinge >>>>> Sent: 25 April 2012 8:56 AM >>>>> To: Silvia Halim >>>>> Subject: Re: HTqPCR >>>>> >>>>> Hiya, >>>>> >>>>> sorry, I only just now realised that you'd attached a file. When I saved as csv, the following command worked: >>>>> >>>>>> raw <- readCtData("110614 BENIGN_1 DATA 96x96.csv", >>>>>> format="BioMark", >>>>>> n.features=96*96) raw >>>>> An object of class "qPCRset" >>>>> Size: 9216 features, 1 samples >>>>> Feature types: >>>>> Feature names: b-Actin b-Actin b-Actin ... >>>>> Feature classes: >>>>> Feature categories: OK >>>>> Sample names: 110614 BENIGN_1 DATA 96x96 ... >>>>> >>>>> The data isn't transformed into a 96x96 format immediately though (in case you read in multiple arrays, and want to normalise them independently). If you want to change this, you can use changeCtLayout(). Alternatively you can say: >>>>> >>>>>> raw <- readCtData("110614 BENIGN_1 DATA 96x96.csv", >>>>>> format="BioMark", n.features=96, n.data=96) raw >>>>> An object of class "qPCRset" >>>>> Size: 96 features, 96 samples >>>>> Feature types: >>>>> Feature names: b-Actin b-Actin b-Actin ... >>>>> Feature classes: >>>>> Feature categories: OK >>>>> Sample names: Sample1 Sample2 Sample3 ... >>>>>> plotCtArray(raw) >>>>> >>>>> HTH >>>>> \Heidi >>>>> >>>>> On 24 Apr 2012, at 17:55, Silvia Halim wrote: >>>>> >>>>>> Hi Heidi, >>>>>> >>>>>> I have some problems updating R on lustre. Therefore, I chose to run HTqPCR on my desktop for the moment. >>>>>> >>>>>> Reading in your sample file looks fine, however, reading in the >>>>>> file that I showed you just now gave me below error message. (The >>>>>> file is as attached) >>>>>> >>>>>>> temp <- readCtData("110614 BENIGN_1 DATA 96x96.xlsx", path = >>>>>>> getwd() , n.features = 96*96, flag = 9, feature = 5, type= 6, Ct >>>>>>> = 7,position = 1, skip = 12, sep = ",") >>>>>> Error in read.table(file = file, header = header, sep = sep, quote = quote, : >>>>>> no lines available in input >>>>>> In addition: Warning message: >>>>>> In readLines(file, skip) : >>>>>> incomplete final line found on 'C:/Users/halim01/Documents/20110627_RossAdamsH_DN_Fluid/110614 BENIGN_1 DATA 96x96.xlsx' >>>>>>> sessionInfo() >>>>>> R version 2.14.0 (2011-10-31) >>>>>> Platform: x86_64-pc-mingw32/x64 (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C LC_TIME=English_United Kingdom.1252 >>>>>> >>>>>> attached base packages: >>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>> >>>>>> other attached packages: >>>>>> [1] Biostrings_2.22.0 IRanges_1.12.6 BiocInstaller_1.2.1 marray_1.32.0 HTqPCR_1.8.0 limma_3.10.3 RColorBrewer_1.0-5 Biobase_2.14.0 gdata_2.8.2 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] affy_1.32.1 affyio_1.22.0 gplots_2.10.1 gtools_2.6.2 preprocessCore_1.16.0 tools_2.14.0 zlibbioc_1.0.1 >>>>>>> >>>>>> >>>>>> I did a quick check on the file and it only has 9228 lines including 12 header lines which I had skipped when reading in the file. Do you know what could possibly go wrong? >>>>>> >>>>>> Cheers, >>>>>> Silvia >>>>>> >>>>>> -----Original Message----- >>>>>> From: Heidi Dvinge >>>>>> Sent: 24 April 2012 5:09 PM >>>>>> To: Silvia Halim >>>>>> Subject: Re: HTqPCR >>>>>> >>>>>> Hm, that looks like it may be x11 acting up. I often have similar issues when I work on a remote server. >>>>>> >>>>>> Actually, the processing of Fluidigm files is very computationally light. So you can easily do it on your desktop, if you can't update on lustre. >>>>>> >>>>>> I can also email you and older version of the vignette if you want to have a look. However, in HTqPCR 1.2.0 I don't even think I had a dedicated function for plotting the Fluidigm assays yet (the plotCtArray shown in the vignette). >>>>>> >>>>>> Cheers >>>>>> \Heidi >>>>>> >>>>>> On 24 Apr 2012, at 16:39, Silvia Halim wrote: >>>>>> >>>>>>> Hi Heidi, >>>>>>> >>>>>>> This is what I got when accessing the vignette. >>>>>>> >>>>>>>> openVignette(package="HTqPCR") >>>>>>> Please select a vignette: >>>>>>> >>>>>>> 1: HTqPCR - qPCR analysis in R >>>>>>> >>>>>>> Selection: 1 >>>>>>> Opening >>>>>>> /home/mib-cri/local/lib64/R/library/HTqPCR/doc/HTqPCR.pdf >>>>>>>> xprop: unable to open display '' >>>>>>> /usr/local/bin/xdg-open: line 370: firefox: command not found >>>>>>> /usr/local/bin/xdg-open: line 370: mozilla: command not found >>>>>>> /usr/local/bin/xdg-open: line 370: netscape: command not found >>>>>>> xdg-open: no method available for opening '/home/mib-cri/local/lib64/R/library/HTqPCR/doc/HTqPCR.pdf' >>>>>>> >>>>>>> Sorry for the confusion, you are right that I was looking at a newer version of HTqPCR than the one installed on lustre. I think that's because I have different installations of HTqPCR on lustre and on my desktop. If I can update the one on lustre, I'll go ahead with the update. >>>>>>> >>>>>>> Thank you, >>>>>>> Silvia >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Heidi Dvinge >>>>>>> Sent: 24 April 2012 4:28 PM >>>>>>> To: Silvia Halim >>>>>>> Subject: Re: HTqPCR >>>>>>> >>>>>>> Ah, right, it looks like you have an older version of R, and therefore also HTqPCR. >>>>>>> >>>>>>> The most current release version is 1.10.0. In that version, readCtData() was modified to accept different types of input data, including from Fluidigm. Before that, this sort of data had to be read in 'manually'. >>>>>>> >>>>>>> I guess the vignette that you were looking at comes from a >>>>>>> version of HTqPCR that's newer than the one you have installed? >>>>>>> If you access the vignette corresponding to your HTqPCR version >>>>>>> via >>>>>>>> openVignette(package="HTqPCR") >>>>>>> what do you get then? >>>>>>> >>>>>>> If you get an older version, then depending on how old it is, there may be a section towards the end giving an example of how to process Fluidigm data more 'manually'. If not, an update may be your best bet. >>>>>>> >>>>>>> Cheers >>>>>>> \Heidi >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 24 Apr 2012, at 16:15, Silvia Halim wrote: >>>>>>> >>>>>>>> Hi Heidi, >>>>>>>> >>>>>>>> Thanks for looking into the matter. Below is the output of my >>>>>>>> sessionInfo() >>>>>>>> >>>>>>>>> sessionInfo() >>>>>>>> R version 2.13.0 (2011-04-13) >>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>> >>>>>>>> locale: >>>>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>>>>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >>>>>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>>>> >>>>>>>> attached base packages: >>>>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>>>> >>>>>>>> other attached packages: >>>>>>>> [1] marray_1.26.0 Biostrings_2.20.1 IRanges_1.10.3 HTqPCR_1.2.0 >>>>>>>> [5] limma_3.6.9 RColorBrewer_1.0-2 Biobase_2.12.1 gdata_2.8.0 >>>>>>>> >>>>>>>> loaded via a namespace (and not attached): >>>>>>>> [1] affy_1.26.1 affyio_1.20.0 gplots_2.8.0 >>>>>>>> [4] gtools_2.6.2 preprocessCore_1.14.0 >>>>>>>>> >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Silvia >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Heidi Dvinge >>>>>>>> Sent: 24 April 2012 4:07 PM >>>>>>>> To: Silvia Halim >>>>>>>> Subject: HTqPCR >>>>>>>> >>>>>>>> Hi Silvia, >>>>>>>> >>>>>>>> I just tested the read fluidigm from the vignette, and it works on both my mac and a single unix system that I've tested. Although from the errors you were getting, it seemed like the headers weren't been read correctly/at all. >>>>>>>> >>>>>>>> Would you mind sending me the output of your sessionInfo(), so I can compare which package versions we have? >>>>>>>> >>>>>>>> Best, >>>>>>>> \Heidi >>>>>>>> >>>>>>>>> sessionInfo() >>>>>>>> R version 2.15.0 (2012-03-30) >>>>>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>>>>>>> >>>>>>>> locale: >>>>>>>> [1] >>>>>>>> en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 >>>>>>>> >>>>>>>> attached base packages: >>>>>>>> [1] tools stats graphics grDevices utils datasets methods base >>>>>>>> >>>>>>>> other attached packages: >>>>>>>> [1] HTqPCR_1.10.0 limma_3.12.0 RColorBrewer_1.0-5 Biobase_2.16.0 >>>>>>>> [5] BiocGenerics_0.2.0 >>>>>>>> >>>>>>>> loaded via a namespace (and not attached): >>>>>>>> [1] affy_1.34.0 affyio_1.24.0 BiocInstaller_1.4.3 >>>>>>>> [4] gdata_2.8.2 gplots_2.10.1 gtools_2.6.2 >>>>>>>> [7] preprocessCore_1.18.0 stats4_2.15.0 zlibbioc_1.2.0 >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> <110614 BENIGN_1 DATA 96x96.xlsx> >>>>> >>>> >>> >> > NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for ...{{dropped:16}} From mark.robinson at imls.uzh.ch Tue Jun 19 16:17:23 2012 From: mark.robinson at imls.uzh.ch (Mark Robinson) Date: Tue, 19 Jun 2012 16:17:23 +0200 Subject: [BioC] Newbie methylation and stats question In-Reply-To: <5E0BC3AD0EA44F6889E740B6DE1027A5@gmail.com> References: <5E0BC3AD0EA44F6889E740B6DE1027A5@gmail.com> Message-ID: <6ED7C085-9733-410B-9B2F-FF25CE8A0192@imls.uzh.ch> Hi Gustavo, I've inserted a few "reactions" below. On 19.06.2012, at 12:57, Gustavo Fern?ndez Bay?n wrote: > Hi everybody. > > As a newbie to bioinformatics, it is not uncommon to find difficulties in the way biological knowledge mixes with statistics. I come from the Machine Learning field, and usually have problems with the naming conventions (well, among several other things, I must admit). Besides, I am not an expert in statistics, having used the barely necessary for the validation of my work. > > Well, let's try to be more precise. One of the topics I am working more right now is the analysis of methylation array data. As you surely now, the final processed (and normalized) beta values are presented in a pxn matrix, where there are p different probes and n different samples or individuals from which we have obtained the beta-values. I am not currently working with the raw data. > > Imagine, for a moment, that we have identified two regions of probes, A and B, with a group of nA probes belonging to A, another group (of nB probes) that belongs to B, and the intersection is empty. Say that we want to find a way to show there is a statistically significant difference between the methylation values of both regions. > As far as I have seen in the literature, comparisons (statistical tests) are always done comparing the same probe values between case and control groups of individuals or samples. For example, when we are trying to find differentiated probes. You can do differential analyses at the probe level or a regional level. An example of the latter (perhaps less popular or less established or less known) is: http://ije.oxfordjournals.org/content/41/1/200.abstract > However, if I think of directly comparing all the beta values from region A (nA * n values) against the ones in region B (nB * n values) with a, say, t test, I get the suspicion that something is not being done the way it should. My knowledge of Biology and Statistics is still limited and I cannot explain why, but I have the feeling that there is something formally wrong in this approximation. Am I right? First of all, I feel this is an unusual comparison to make. Presumably, region A and region B are different regions of the genome - what does it mean if methylation levels in region A and B are different? Maybe you could expand on the biological question here? Second, if this is the comparison you really want to make, what role do your n samples play here? Do you have cases and controls? It may be sensible to fit a model to allow you to decompose effects of case/control from those of interest (A/B). But again, this needs to be geared to your biological question, which I don't yet understand. Best, Mark > What I have done in similar experiments has been to find differentiated probes, and then do a test to the proportion of differentiated probes to total number of them, so I could assign a p-value to prove that there was a significant influence of the region of reference. > Several questions here: which could be a coherent approximation to the regions A and B problem stated above? Is there any problem with methylation data I am not aware of which makes only the in-probe analysis valid? Any bibliographic references that could help me seeing the subtleties around? > > As you can see, concepts are quite interleaved in my mind, so any help would be very appreciated. > Regards, > Gustavo > > > > > --------------------------- > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ---------- Prof. Dr. Mark Robinson Bioinformatics Institute of Molecular Life Sciences University of Zurich Winterthurerstrasse 190 8057 Zurich Switzerland v: +41 44 635 4848 f: +41 44 635 6898 e: mark.robinson at imls.uzh.ch o: Y11-J-16 w: http://tiny.cc/mrobin ---------- http://www.fgcz.ch/Bioconductor2012 From tim.triche at gmail.com Tue Jun 19 16:20:17 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Tue, 19 Jun 2012 07:20:17 -0700 Subject: [BioC] Newbie methylation and stats question In-Reply-To: <5E0BC3AD0EA44F6889E740B6DE1027A5@gmail.com> References: <5E0BC3AD0EA44F6889E740B6DE1027A5@gmail.com> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From MEC at stowers.org Tue Jun 19 16:35:18 2012 From: MEC at stowers.org (Cook, Malcolm) Date: Tue, 19 Jun 2012 09:35:18 -0500 Subject: [BioC] [Engineers for ensemblgenomes.org #251937] BioMart data anomaly: for some transcripts, the cds cumulative length inferred from the exon and UTR info doesn't match the "cds_length" attribute from BioMart In-Reply-To: Message-ID: Hi, I am chiming in as the original reporter, and cc:ing Herve Pages from the BioConductor project who was instrumental in providing diagnostic feedback and coded much of the inner workings of the 'R' part. When I now follow the steps I originally reported, now using today's biomart (Ensembl 67), I find that transcripts are still identified having the reported anomaly. However, for my purposes, I now find the problem greatly ameliorated in that: there are only 5 such they are all in the same alternatively spliced gene the BioConductor package now more gracefully raises a warning with a detailed report instead an error. I believe that examining the detailed report, included in my transcript below, will reveal the remaining root cause to you. Thanks for following up! I hope this helps, and am looking forward to ticket closed on this one! ~ Malcolm Cook $ R # use the package (assuming it and dependencies are installed) library(GenomicFeatures) # and try to build the TranscriptDb (expect error/warning here) txdb<-makeTranscriptDbFromBiomart(biomart="ensembl", dataset="dmelanogaster_gene_ensembl", circ_seqs=NULL)) Download and preprocess the 'transcripts' data frame ... OK Download and preprocess the 'chrominfo' data frame ... OK Download and preprocess the 'splicings' data frame ... OK Download and preprocess the 'genes' data frame ... OK Prepare the 'metadata' data frame ... OK metadata: OK Make the TranscriptDb object ... OK Warning message: In .warningWithBioMartDataAnomalyReport(bm_table, idx, id_prefix, : BioMart data anomaly: in the following transcripts, the CDS total length inferred from the exon and UTR info doesn't match the "cds_length" attribute from BioMart. 1. Transcript FBtr0084080: strand rank exon_chrom_start exon_chrom_end ensembl_exon_id 5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_length 1 -1 1 17203010 17203121 FBgn0002781:30 17203010 17203121 NA NA 887 2 -1 2 17202541 17202798 FBgn0002781:29 17202749 17202798 NA NA 887 3 -1 3 17202324 17202463 FBgn0002781:28-A NA NA NA NA 887 4 -1 4 17195184 17195967 FBgn0002781:39 NA NA 17195184 17195428 887 5 -1 5 17200782 17201634 FBgn0002781:27-B NA NA NA NA 887 2. Transcript FBtr0084077: strand rank exon_chrom_start exon_chrom_end ensembl_exon_id 5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_length 1 -1 3 17203010 17203121 FBgn0002781:30 17203010 17203121 NA NA -213 2 -1 4 17202541 17202798 FBgn0002781:29 17202755 17202798 NA NA -213 3 -1 1 17202324 17202463 FBgn0002781:28-B NA NA NA NA -213 4 -1 2 17177331 17177608 FBgn0002781:1 NA NA 17177331 17177387 -213 5 -1 5 17200782 17201634 FBgn0002781:27-A NA NA NA NA -213 3. Transcript FBtr0084082: strand rank exon_chrom_start exon_chrom_end ensembl_exon_id 5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_length 1 -1 3 17203010 17203121 FBgn0002781:30 17203010 17203121 NA NA -466 2 -1 4 17202541 17202798 FBgn0002781:29 17202749 17202798 NA NA -466 3 -1 1 17202324 17202463 FBgn0002781:28-B NA NA NA NA -466 4 -1 5 17200782 17201634 FBgn0002781:27-A NA NA NA NA -466 5 -1 2 17193632 17193960 FBgn0002781:37 NA NA 17193632 17193935 -466 4. Transcript FBtr0084079: strand rank exon_chrom_start exon_chrom_end ensembl_exon_id 5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_length 1 -1 1 17203010 17203121 FBgn0002781:30 17203010 17203121 NA NA 1572 2 -1 2 17202541 17202798 FBgn0002781:29 17202749 17202798 NA NA 1572 3 -1 3 17202324 17202463 FBgn0002781:28-A NA NA NA NA 1572 4 -1 4 17200782 17201634 FBgn0002781:27-B NA NA NA NA 1572 5 -1 5 17186112 17186276 FBgn0002781:31 NA NA 17186112 17186276 1572 6 -1 6 17186350 17187009 FBgn0002781:32 NA NA 17186350 17186803 1572 5. Transcript FBtr0084085: strand rank exon_chrom_start exon_chrom_end ensembl_exon_id 5_utr_start 5_utr_end 3_utr_start 3_utr_end cds_length 1 -1 1 17203010 17203121 FBgn0002781:30 17203010 17203121 NA NA 1729 2 -1 2 17202541 17202798 FBgn0002781:29 17202749 17202798 NA NA 1729 3 -1 3 17202324 17202463 FBgn0002781:28-A NA NA NA NA 1729 4 -1 4 17200782 17201634 FBgn0002781:27-B NA NA NA NA 1729 5 -1 5 17187120 17187332 FBgn0002781:33 NA NA 17187120 17187332 1729 6 -1 6 17187392 17187860 FBgn0002781:34 NA NA 17187392 17187545 1729 # show off the txdb's metadata > txdb TranscriptDb object: | Db type: TranscriptDb | Supporting package: GenomicFeatures | Data source: BioMart | Genus and Species: Drosophila melanogaster | Resource URL: www.biomart.org:80 | BioMart database: ensembl | BioMart database version: ENSEMBL GENES 67 (SANGER UK) | BioMart dataset: dmelanogaster_gene_ensembl | BioMart dataset description: Drosophila melanogaster genes (BDGP5) | BioMart dataset version: BDGP5 | Full dataset: yes | miRBase build ID: NA | transcript_nrow: 25415 | exon_nrow: 74818 | cds_nrow: 62601 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2012-06-19 09:13:33 -0500 (Tue, 19 Jun 2012) | GenomicFeatures version at creation time: 1.8.1 | RSQLite version at creation time: 0.11.1 | DBSCHEMAVERSION: 1.0 # show off details about the version of R and libraries used. > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.8.1 AnnotationDbi_1.18.1 Biobase_2.16.0 GenomicRanges_1.8.6 IRanges_1.14.3 BiocGenerics_0.2.0 BiocInstaller_1.4.6 loaded via a namespace (and not attached): [1] BSgenome_1.24.0 Biostrings_2.24.1 DBI_0.2-5 RCurl_1.91-1 RSQLite_0.11.1 Rsamtools_1.8.5 XML_3.9-4 biomaRt_2.12.0 bitops_1.0-4.1 rtracklayer_1.16.1 stats4_2.15.0 tools_2.15.0 zlibbioc_1.2.0 > On 6/19/12 8:35 AM, "kmegy at ebi.ac.uk via RT" wrote: >Which species was this again? Drosophila? > >I fixed something about STOP codons for Droso., but it's probably not >what he is talking about. > > >On 19 Jun 2012, at 14:32, Dan Staines wrote: > >> I believe that Karyn fixed this but Dan L & co are probably in a better >>position to comment. >> >> On 06/19/2012 01:36 PM, Bert Overduin via RT wrote: >>> Hi Dan, >>> >>> Has this been fixed in EG14? >>> >>> Cheers, >>> Bert >>> >>> On Sun, Apr 15, 2012 at 5:56 PM, Dan Staines via RT >>> wrote: >>>> Hi Malcolm, >>>> >>>> I've just asked for an update on this. Fixes that we've applied >>>>recently do not >>>> unfortunately appear to fix the issue. However, we're continuing to >>>>investigate >>>> how to fix this and are aiming for a fix for EG14 in May. >>>> >>>> Best, >>>> >>>> Dan. >>>> >>>> . >>>> >>>> -- >>>> Ticket Details>>>https://rt.sanger.ac.uk/SelfService/Display.html?id=251937> >>>> >>>> >>>> -- >>>> The Wellcome Trust Sanger Institute is operated by Genome Research >>>> Limited, a charity registered in England with number 1021457 and a >>>> company registered in England with number 2742969, whose registered >>>> office is 215 Euston Road, London, NW1 2BE. >>> >>> >>> >> >> -- >> Dan Staines, PhD Ensembl Genomes Technical Coordinator >> EMBL-EBI Tel: +44-(0)1223-492507 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > >-- >Ticket Details https://rt.sanger.ac.uk/SelfService/Display.html?id=251937 > > > >-- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. From chenyao.bioinfor at gmail.com Tue Jun 19 16:46:30 2012 From: chenyao.bioinfor at gmail.com (Yao Chen) Date: Tue, 19 Jun 2012 10:46:30 -0400 Subject: [BioC] Newbie methylation and stats question In-Reply-To: References: <5E0BC3AD0EA44F6889E740B6DE1027A5@gmail.com> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From gbayon at gmail.com Tue Jun 19 16:56:41 2012 From: gbayon at gmail.com (=?utf-8?Q?Gustavo_Fern=C3=A1ndez_Bay=C3=B3n?=) Date: Tue, 19 Jun 2012 16:56:41 +0200 Subject: [BioC] Newbie methylation and stats question In-Reply-To: <6ED7C085-9733-410B-9B2F-FF25CE8A0192@imls.uzh.ch> References: <5E0BC3AD0EA44F6889E740B6DE1027A5@gmail.com> <6ED7C085-9733-410B-9B2F-FF25CE8A0192@imls.uzh.ch> Message-ID: <1615E8496E0F4118AF01E41657F8FEB9@gmail.com> Hi Mark. First of all, thank you for your kind answer. I am answering you below (or at least trying to). ;) --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) El martes 19 de junio de 2012 a las 16:17, Mark Robinson escribi?: > [?] > You can do differential analyses at the probe level or a regional level. An example of the latter (perhaps less popular or less established or less known) is: > http://ije.oxfordjournals.org/content/41/1/200.abstract I have just given it a super-fast read, and it seems very interesting. I am going to read it more carefully, and see if it can help me to understand better where I am standing. If I have got the idea right, the authors seem to do some kind of regression or model fitting using the methylation values against the (maybe relative) position of the probes, in order to detect contiguous regions where differential methylation exists. Am I right? > [?] > First of all, I feel this is an unusual comparison to make. Presumably, region A and region B are different regions of the genome - what does it mean if methylation levels in region A and B are different? Maybe you could expand on the biological question here? Yes, of course. A fellow wants to prove that a given region is differentially methylated between two sets of individuals. She has 6 case and 5 control individuals, along with their methylation beta values for a given set of probes (small, around 27 subdivided among 4 regions). Visually, she is able to see that there is a difference in methylation between the control and case group and, what is more, that the differentiation occurs 99% of the time in a given region. She asked me for a statistic test, so she could have a p-value showing that, not only the two groups are differentially methylated, but also the methylation happens at exactly one region. Kind of a "how can I show that this region is different and the others aren't?" > > Second, if this is the comparison you really want to make, what role do your n samples play here? Do you have cases and controls? It may be sensible to fit a model to allow you to decompose effects of case/control from those of interest (A/B). But again, this needs to be geared to your biological question, which I don't yet understand. I don't know if the explanation above is helping. Feel free to ask me anything you need. The biggest problem, I know, is that sometimes I do not know how to put all of this down to words. Well, I hope that is going to improve with time (I have been only in Bioinformatics for two months). > > Best, > Mark Regards, Gustavo From gbayon at gmail.com Tue Jun 19 17:16:46 2012 From: gbayon at gmail.com (=?utf-8?Q?Gustavo_Fern=C3=A1ndez_Bay=C3=B3n?=) Date: Tue, 19 Jun 2012 17:16:46 +0200 Subject: [BioC] Newbie methylation and stats question In-Reply-To: References: <5E0BC3AD0EA44F6889E740B6DE1027A5@gmail.com> Message-ID: Hi Tim. Thank you for your answer. I'll try to "defend" myself the best I can below. ;) --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) El martes 19 de junio de 2012 a las 16:20, Tim Triche, Jr. escribi?: > Look up Andrew Jaffe and Rafa Irizarry's paper on "bump hunting" for regional differences. I think both Mark and you have agreed on the paper. That surely is a good point for making me read it thoroughly. > Or run a smooth over it (caveat: I just wrote smoothing "the way I want it" yesterday, after being provoked by a collaborator, so you might have to use lumi). I am not sure if I understand what you are trying to tell me here. ;) Sorry. I know lumi, although I thought it covered only the necessary stages until normalization of data. > The function "dmrFinder" in the "charm" package is specifically meant for this sort of thing. I had looked at the charm Vignette in the past few days, but thought it was designed for technology different from ours. For me, sometimes it is difficult to just "understand" the goals or targets of different packages. I am currently biocLite'ing it while I am writing this, so I'll take a look to dmrFinder and tell you. > Also, if you're doing linear tests, be careful with normalization, I thought (too naively, I guess) that, when given the beta values, everything was normalized. I.e., that I was safe unless I worked with raw data. > mask your SNPs and chrX probes, I am currently doing something well :) At least, the chrX part. How could I mask the SNP's? > and maybe use M-values (logit(beta)) for the task. Yes, that's a point I was reading a lot lately. As far as I think I have understood, M-values have better statistical properties for spotting DMR's, haven't they? > The latter is more important for epidemiological datasets than something like cancer, where every single interesting result from M-value testing has been reproduced using untransformed beta values when I ran comparisons (e.g. HELP hg17 methylation differences for IDH1/2 mutants vs. Illumina hm450 differences for IDH1/2 mutants, the complete absence of any differences for TET2 mutants regardless of platform, etc.) Well. I have to assume that I do not understand completely what you have written above. ;) Don't worry, it's not your problem. I'm sure it's mine. I am sometimes quite overwhelmed by the huge amount of information in this field. > > Mark Robinson just chimed in, I see. Probably a good idea to read his reply carefully as well. I have done. And both your answer and his have been very helpful, constructive, and kind. Thank you very much. Regards, Gustavo From gbayon at gmail.com Tue Jun 19 17:22:54 2012 From: gbayon at gmail.com (=?utf-8?Q?Gustavo_Fern=C3=A1ndez_Bay=C3=B3n?=) Date: Tue, 19 Jun 2012 17:22:54 +0200 Subject: [BioC] Newbie methylation and stats question In-Reply-To: References: <5E0BC3AD0EA44F6889E740B6DE1027A5@gmail.com> Message-ID: <884463BB79B1402F87C10D8216088427@gmail.com> Hi Yao. First of all, thank you for your answer. --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) El martes 19 de junio de 2012 a las 16:46, Yao Chen escribi?: > As for as I know, this is no standard normalization for methylation data. As I have said to Tim in a previous post, I really thought I did not have to deal with normalization issues. Now it seems I have to start worrying about it. > For me, I prefer keeping the raw value and just adjusting the technical variants. Anyone has better solution. Please let me know. I thought that, for a beginner like me, it was better not to deal with the normalization stages, and just start working with the beta values. > > Back the question, I agree with Mark. It's unusual to compare different region. These regions may have different background methylation status and hardly to directly compare. Thanks to you three, I think I start to see things clear. The fact is that I just didn't know how to put it down in words. We should not compare the methylation status of different regions because their magnitudes and behaviors are not comparable. Am I getting near it? > > Jack Regards, Gustavo From Kaat.DeCremer at biw.kuleuven.be Tue Jun 19 17:44:27 2012 From: Kaat.DeCremer at biw.kuleuven.be (Kaat De Cremer) Date: Tue, 19 Jun 2012 15:44:27 +0000 Subject: [BioC] design matrix edge R pairwise comparison at different timepoints after infection with replicates Message-ID: <3D4A97F14E343F4584925219C1C1ACEF05B95961@ICTS-S-MBX7.luna.kuleuven.be> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From mail.yong.li at googlemail.com Tue Jun 19 18:16:40 2012 From: mail.yong.li at googlemail.com (Yong Li) Date: Tue, 19 Jun 2012 18:16:40 +0200 Subject: [BioC] How to print out normalized Cy5 and Cy3 signals In-Reply-To: <002b01cd4cec$dd15aaf0$974100d0$@edu.au> References: <002b01cd4cec$dd15aaf0$974100d0$@edu.au> Message-ID: Hi, I think one additional step before doing what Belinda suggested is to run RG.MA() because after normalization in limma you get MAlist and Cy5 and Cy3 signals are not in MAlist. You need to convert it back to RGlist. Kind regards, Yong On Mon, Jun 18, 2012 at 2:54 AM, Belinda Phipson wrote: > Hi Mei > > Check the names of your data object: >> names(data) > to figure out where the normalized data is and then use the >> write.csv(data$...,file="norm.csv") > which can write matrices or data frames to a file which can be opened in > excel. > > Cheers, > Belinda > > -----Original Message----- > From: bioconductor-bounces at r-project.org > [mailto:bioconductor-bounces at r-project.org] On Behalf Of JiangMei > Sent: Saturday, 16 June 2012 5:05 AM > To: bioconductor at r-project.org > Subject: [BioC] How to print out normalized Cy5 and Cy3 signals > > > Hi All. Sorry to bother you. > > I used limma package to normalize my two-color microarray data. I want to > export the normalized Cy5 and Cy3 signals. Does anyone know how to do that? > Thanks very much in advance. > > > Best, Mei > > > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:4}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From mail.yong.li at googlemail.com Tue Jun 19 18:47:51 2012 From: mail.yong.li at googlemail.com (Yong Li) Date: Tue, 19 Jun 2012 18:47:51 +0200 Subject: [BioC] Using limma for quantitative proteomics data Message-ID: Hello, limma has been so valuable in microarray data analysis, but has anyone used limma for finding differentially expressed proteins from quantitative proteomics data? The data I got has tumor/normal ratios of thousands proteins, and both tumor and normal have a number of replicates. Could such data be analyzed with limma? If limma can not be used here, what statistics method is suitable so that we can get statistically significant proteins with p-values? Any suggestion is appreciated. Kind regards, Yong From tim.triche at gmail.com Tue Jun 19 19:45:21 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Tue, 19 Jun 2012 10:45:21 -0700 Subject: [BioC] Newbie methylation and stats question In-Reply-To: References: <5E0BC3AD0EA44F6889E740B6DE1027A5@gmail.com> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From tim.triche at gmail.com Tue Jun 19 19:57:22 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Tue, 19 Jun 2012 10:57:22 -0700 Subject: [BioC] Newbie methylation and stats question In-Reply-To: References: <5E0BC3AD0EA44F6889E740B6DE1027A5@gmail.com> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From tim.triche at gmail.com Tue Jun 19 20:01:45 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Tue, 19 Jun 2012 11:01:45 -0700 Subject: [BioC] Newbie methylation and stats question In-Reply-To: <884463BB79B1402F87C10D8216088427@gmail.com> References: <5E0BC3AD0EA44F6889E740B6DE1027A5@gmail.com> <884463BB79B1402F87C10D8216088427@gmail.com> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From chenyao.bioinfor at gmail.com Tue Jun 19 20:12:40 2012 From: chenyao.bioinfor at gmail.com (Yao Chen) Date: Tue, 19 Jun 2012 14:12:40 -0400 Subject: [BioC] Newbie methylation and stats question In-Reply-To: References: <5E0BC3AD0EA44F6889E740B6DE1027A5@gmail.com> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From amackey at virginia.edu Tue Jun 19 20:18:31 2012 From: amackey at virginia.edu (Aaron Mackey) Date: Tue, 19 Jun 2012 14:18:31 -0400 Subject: [BioC] Using limma for quantitative proteomics data In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From tim.triche at gmail.com Tue Jun 19 20:19:53 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Tue, 19 Jun 2012 11:19:53 -0700 Subject: [BioC] Newbie methylation and stats question In-Reply-To: References: <5E0BC3AD0EA44F6889E740B6DE1027A5@gmail.com> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From mcarlson at fhcrc.org Tue Jun 19 20:58:15 2012 From: mcarlson at fhcrc.org (Marc Carlson) Date: Tue, 19 Jun 2012 11:58:15 -0700 Subject: [BioC] Annotation Database for Agilent 8x60K Human Gene Expression Arrays In-Reply-To: References: Message-ID: <4FE0CBC7.2070709@fhcrc.org> Hi Karthik, The 8x60 arrays have a different design ID so out of caution, I would not recommend that you use that package for this. Even if all the IDs from the other design id are present here (and that is a big if), you may still in certain circumstances end up thinking that you tried to measure a bunch of things that you did not try to measure. The design ID was put into the name to try and avoid this confusion and to warn you about possible inconsistencies. Here is the page from Agilents website that gives the design IDs for the platforms in question along with links to their eArray web application that allows access to the files which they supply probe mappings for. http://www.genomics.agilent.com/CollectionSubpage.aspx?PageType=Product&SubPageType=ProductData&PageID=1516 From that eArray link you should be able to just download the file (one for the 8x60 arrays) and then just follow the directions from the SQLForge vignette from the AnnotationDbi package to make an annotation package for the 8x60 arrays. It's pretty easy to make an annotation package. The tricky part is usually getting the mapping information from the manufacturers, but it sounds like you have already done that part. Here is where you can find that SQLForge vignette: http://www.bioconductor.org/packages/2.10/bioc/vignettes/AnnotationDbi/inst/doc/SQLForge.pdf For more information about the Agi4x44PreProcess you should look at the vignette for that package here: http://www.bioconductor.org/packages/2.10/bioc/vignettes/Agi4x44PreProcess/inst/doc/Agi4x44PreProcess.pdf The vignette certainly makes it seem like you can use this package with other Agilent annotations, but you might want to ask Pedro about exactly how accommodating his design is intended to be. Hope this helps, Marc On 06/18/2012 04:09 AM, Karthik K N wrote: > Dear Members (and especially Marc Carlson), > > I have few questions: > > 1. I was wondering whether HsAgilentDesign026652.db is the latest updated > database for Agilent 8x60K human gene expression arrays or is there any > other file? I have been told by Axel that HsAgilentDesign026652.db cover > the oligos used by the Agilent 4x44Kv2 arrays, so I am not sure if I can > use this for 8x60K arrays. > > 2. If I can indeed use HsAgilentDesign026652.db for 8x60K arrays, is it > possible to use it as annotation file for Agi4x44PreProcess package so that > I can use the same codes from this package to analyze 8x60K arrays? > > 3. If I want to create an annotation file to be used in any of the > packages, what are the requirements (files, tools etc) for it? I have an > annotation file in excel format given to me by my microarray service > provider, so If I want to use that annotation file in Agi4x44Preprocess > package (or any other package that requires this file) what should I do? > > Thanks a lot in advance for all your suggestions! > > Cheers, > > Karthik > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From murat.tasan at utoronto.ca Tue Jun 19 21:27:23 2012 From: murat.tasan at utoronto.ca (Murat Tasan) Date: Tue, 19 Jun 2012 15:27:23 -0400 Subject: [BioC] Gviz plot dimensions Message-ID: hi all -- i've been using the Gviz package for plotting annotation tracks, and i've recently run into issues where the device i set up for plotting is too small (along the height dimension) for the eventual plot, thus causing errors. the quick fix is to make the device height much larger in anticipation of tall tracks being plotted, but when the tracks are not tall (or missing, in some cases), Gviz defaults to try to fill the device, leading to vertically stretched out tracks that look rather unsightly. does anyone know of any existing tricks to pre-compute the minimum dimension size(s) required for a plotTracks() call? what i'm thinking of is something similar in spirit to calling the hist() function with the "plot = FALSE" option set, where the values needed to set up ones own plot dimensions are returned, and then the final plot can be executed. (i.e. something like a dry run at a plot.) in the end, what i'm really looking for is a way to automatically estimate the size of the device, such that the features of the plot (e.g. the ideogram track, or the axis track) are always the same vertical size, and the rest of the device is re-sized to handle the amount of data in the annotation/genome/data tracks. cheers! -murat From mail.yong.li at googlemail.com Tue Jun 19 21:39:23 2012 From: mail.yong.li at googlemail.com (Yong Li) Date: Tue, 19 Jun 2012 21:39:23 +0200 Subject: [BioC] Using limma for quantitative proteomics data In-Reply-To: References: Message-ID: Dear Aaron, thank you for your quick answer! I have checked the help page of voom() but it seems to be used for count data. My data are just tumor/normal ratios. I am wondering if you could provide more details? Best regards, Yong On Tue, Jun 19, 2012 at 8:18 PM, Aaron Mackey wrote: > yes, it should be possible with a voom()-based analysis to get the variances > "right". > > -Aaron > > On Tue, Jun 19, 2012 at 12:47 PM, Yong Li > wrote: >> >> Hello, >> >> limma has been so valuable in microarray data analysis, but has anyone >> used limma for finding differentially expressed proteins from >> quantitative proteomics data? The data I got has tumor/normal ratios >> of thousands proteins, and both tumor and normal have a number of >> replicates. Could such data be analyzed with limma? >> >> If limma can not be used here, what statistics method is suitable so >> that we can get statistically significant proteins with p-values? Any >> suggestion is appreciated. >> >> Kind regards, >> Yong >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > From friedman at cancercenter.columbia.edu Tue Jun 19 21:50:13 2012 From: friedman at cancercenter.columbia.edu (Richard Friedman) Date: Tue, 19 Jun 2012 15:50:13 -0400 Subject: [BioC] Using limma for quantitative proteomics data In-Reply-To: References: Message-ID: <66AC202F-785E-475F-B5CF-995F812DF0A1@cancercenter.columbia.edu> Dear Yong, It would be helpful if you could say something about the method used to identify differentially expressed proteins from quantitative proteomics data. Is it a protein microarray - if so which platform. Is it mass spec? I would think, and somebody please correct me if I am wrong, that continuous protein data could be analyzed similarly to contnuous mRNA data as far as differential expression goes - although preprocessing might be signficantly different. For example. I am currently analyzing a JPT peptide array and I am doing the preprocessing with Rapmad and the differential expression with Limma. with hopes that this helps, Rich On Jun 19, 2012, at 3:39 PM, Yong Li wrote: > Dear Aaron, > > thank you for your quick answer! I have checked the help page of > voom() but it seems to be used for count data. My data are just > tumor/normal ratios. I am wondering if you could provide more details? > > Best regards, > Yong > > On Tue, Jun 19, 2012 at 8:18 PM, Aaron Mackey > wrote: >> yes, it should be possible with a voom()-based analysis to get the >> variances >> "right". >> >> -Aaron >> >> On Tue, Jun 19, 2012 at 12:47 PM, Yong Li > > >> wrote: >>> >>> Hello, >>> >>> limma has been so valuable in microarray data analysis, but has >>> anyone >>> used limma for finding differentially expressed proteins from >>> quantitative proteomics data? The data I got has tumor/normal ratios >>> of thousands proteins, and both tumor and normal have a number of >>> replicates. Could such data be analyzed with limma? >>> >>> If limma can not be used here, what statistics method is suitable so >>> that we can get statistically significant proteins with p-values? >>> Any >>> suggestion is appreciated. >>> >>> Kind regards, >>> Yong >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From tim.triche at gmail.com Tue Jun 19 21:50:31 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Tue, 19 Jun 2012 12:50:31 -0700 Subject: [BioC] Using limma for quantitative proteomics data In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From amackey at virginia.edu Tue Jun 19 23:09:46 2012 From: amackey at virginia.edu (Aaron Mackey) Date: Tue, 19 Jun 2012 17:09:46 -0400 Subject: [BioC] Using limma for quantitative proteomics data In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From guest at bioconductor.org Tue Jun 19 23:20:30 2012 From: guest at bioconductor.org (spf385 [guest]) Date: Tue, 19 Jun 2012 14:20:30 -0700 (PDT) Subject: [BioC] RnaSeqTutorial Package Message-ID: <20120619212030.475E312690B@mamba.fhcrc.org> Greetings, I'm working my way through the tutorials for several RNA-Seq packages in R / bioconductor (EdgeR, DESeq, DEXSeq, ShortReads, etc.); and I've somewhat backed up to using the package RnaSeqTutorial (EBI, October 2011; Nicolas Delhomme). As I work my way through the documentation I stumble at section 2.3 Loading the annotation. it provides, > library(BSgenome.Dmelanogaster.UCSC.dm3) and I get the error message: Error in library(BSgenome.Dmelanogaster.UCSC.dm3); there is no package called 'BSgenome.Dmelanogaster.UCSC.dm3' is the above a typo? and should it read something to the effect of >library(BSgenome) >Dmelano <- data(Dmelanogaster.UCSC.dm3, package = "BSgenome") Or is it that BSgenome is built under a different version of R? I have downloaded the source package and can locate said "Dmelanogaster.UCSC.dm3-seed" I can't seem to work my way through this step. Any suggestions? Thanks in advance. -- output of sessionInfo(): > library(BSgenome.Dmelanogaster.UCSC.dm3) Error in library(BSgenome.Dmelanogaster.UCSC.dm3); there is no package called 'BSgenome.Dmelanogaster.UCSC.dm3' -- Sent via the guest posting facility at bioconductor.org. From tim.triche at gmail.com Tue Jun 19 23:50:22 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Tue, 19 Jun 2012 14:50:22 -0700 Subject: [BioC] Using limma for quantitative proteomics data In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From mailinglist.honeypot at gmail.com Tue Jun 19 23:54:38 2012 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Tue, 19 Jun 2012 17:54:38 -0400 Subject: [BioC] RnaSeqTutorial Package In-Reply-To: <20120619212030.475E312690B@mamba.fhcrc.org> References: <20120619212030.475E312690B@mamba.fhcrc.org> Message-ID: Hi, On Tue, Jun 19, 2012 at 5:20 PM, spf385 [guest] wrote: > > Greetings, > > I'm working my way through the tutorials for several RNA-Seq packages in R / bioconductor (EdgeR, DESeq, DEXSeq, ShortReads, etc.); and I've somewhat backed up to using the package RnaSeqTutorial (EBI, October 2011; Nicolas Delhomme). > > As I work my way through the documentation I stumble at section 2.3 Loading the annotation. > > it provides, > >> library(BSgenome.Dmelanogaster.UCSC.dm3) > > and I get the error message: Error in library(BSgenome.Dmelanogaster.UCSC.dm3); there is no package called 'BSgenome.Dmelanogaster.UCSC.dm3' You have to install this package as it is not installed by default. For example: R> source("http://bioconductor.org/biocLite.R") R> biocLite("BSgenome.Dmelanogaster.UCSC.dm3") > is the above a typo? and should it read something to the effect of > >>library(BSgenome) > >>Dmelano <- data(Dmelanogaster.UCSC.dm3, package = "BSgenome") > > > Or is it that BSgenome is built under a different version of R? I have downloaded the source package and can locate said "Dmelanogaster.UCSC.dm3-seed" Please don't take this the wrong way, but it seems as like you're still a bit shaky with using R. I'd strongly recommend getting a better handle on R basics before diving into any type of more advanced analysis (like the tutorials you are reading through). The investment of time in the basics will more than make up for itself and save you lots of frustration, cf. the old "tortoise and the hare" fable ... HTH and good luck! -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From mgarciao at ufl.edu Tue Jun 19 23:55:40 2012 From: mgarciao at ufl.edu (Garcia Orellana,Miriam) Date: Tue, 19 Jun 2012 21:55:40 +0000 Subject: [BioC] Best package or code to filter Affymetrix probes by present calls?? In-Reply-To: References: <7F10E9EDBB347E4CA0765A3139C110BB14F999D3@UFEXCH-MBXN01.ad.ufl.edu> <1340056060.56022.YahooMailNeo@web87703.mail.ir2.yahoo.com>, Message-ID: <7F10E9EDBB347E4CA0765A3139C110BB14F9B3DD@UFEXCH-MBXN01.ad.ufl.edu> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From tim.triche at gmail.com Wed Jun 20 00:01:41 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Tue, 19 Jun 2012 15:01:41 -0700 Subject: [BioC] Using limma for quantitative proteomics data In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From patel.rimple at yahoo.com Wed Jun 20 06:21:01 2012 From: patel.rimple at yahoo.com (Rimple Patel) Date: Tue, 19 Jun 2012 21:21:01 -0700 (PDT) Subject: [BioC] (no subject) Message-ID: <1340166061.50410.YahooMailNeo@web45714.mail.sp1.yahoo.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From yongganw at oceanridgebio.com Tue Jun 19 17:52:21 2012 From: yongganw at oceanridgebio.com (Yonggan Wu) Date: Tue, 19 Jun 2012 11:52:21 -0400 Subject: [BioC] Can I do model comparison with type III ANOVA? Message-ID: <4FE0A035.8010707@oceanridgebio.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From dcanvhet at gmail.com Tue Jun 19 23:10:35 2012 From: dcanvhet at gmail.com (Dave Canvhet) Date: Tue, 19 Jun 2012 23:10:35 +0200 Subject: [BioC] Differential drug effect on clinical groups Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jaudall at gmail.com Tue Jun 19 23:57:44 2012 From: jaudall at gmail.com (Joshua Udall) Date: Tue, 19 Jun 2012 15:57:44 -0600 Subject: [BioC] DESeq and contrasts Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From bernd.fischer at embl.de Wed Jun 20 08:09:52 2012 From: bernd.fischer at embl.de (Bernd Fischer) Date: Wed, 20 Jun 2012 08:09:52 +0200 Subject: [BioC] Using limma for quantitative proteomics data In-Reply-To: References: Message-ID: <4FE16930.7090403@embl.de> Dear Yong! I used limma for ion count data. First I computed log-ratios per peptide and then summarized log-ratios per protein. Protein log-ratios were then analyzed by limma. Have a lock at our paper: Castello, Fischer, et al., Insights into RNA Biology from an Atlas of Mammalian mRNA-Binding Proteins, CELL, 2012 Best, Bernd On 06/19/2012 06:47 PM, Yong Li wrote: > Hello, > > limma has been so valuable in microarray data analysis, but has anyone > used limma for finding differentially expressed proteins from > quantitative proteomics data? The data I got has tumor/normal ratios > of thousands proteins, and both tumor and normal have a number of > replicates. Could such data be analyzed with limma? > > If limma can not be used here, what statistics method is suitable so > that we can get statistically significant proteins with p-values? Any > suggestion is appreciated. > > Kind regards, > Yong > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From gbayon at gmail.com Wed Jun 20 09:40:57 2012 From: gbayon at gmail.com (=?utf-8?Q?Gustavo_Fern=C3=A1ndez_Bay=C3=B3n?=) Date: Wed, 20 Jun 2012 09:40:57 +0200 Subject: [BioC] Newbie methylation and stats question In-Reply-To: References: <5E0BC3AD0EA44F6889E740B6DE1027A5@gmail.com> Message-ID: Well, to sum up, I wanted to thank you all for your kind and constructive answers. Now I am getting to work through the references you provided. There are a lot of things to learn in this field and I am still at the beginning. If I still have problems, be sure I'll be back in the list for asking. Regards, Gus --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) El martes 19 de junio de 2012 a las 20:19, Tim Triche, Jr. escribi?: > Oh, I don't disagree that improper normalization is a bad idea. However, quantile normalization on the overall raw intensities (for example), assuming there are not gross differences in copy number, seems to work OK in many cases. I have seen people quantile normalizing on the summary statistics, which strikes me as perverse, but it's their data and their papers, not mine. > > I do tend to believe that methods which take into account the peculiarities of the platform are preferable to those that don't, but the former do exist; the trouble is that few systematic comparisons have been conducted, mostly on small or unusual datasets. > > As you point out, failing to take into account the differences between expression data (sparse transcripts, mostly absent) and genomic DNA (whether genotyping or "epigenotyping" arrays) can be expected to lead to poor results. I'm not a fan of blindly applying anything, hence the suggestion to plot the data first and ask questions thereafter :-) > > Cheers, > > --t > > > > On Tue, Jun 19, 2012 at 11:12 AM, Yao Chen wrote: > > Hi Tim. > > > > I didn't mean we don't normalization methylation data because there is no standard method. What I want to say is the most of the existing normalization methods are derived from microarray which don't fit the methylation data. Most of these methods such as quantile normalization assume that most genes are not differentially expressed. However, In DNA methylation data, global hypomethylation is observed in many diseases such as cancer . Improper normalization method would erase the real biological difference. > > > > Jack > > > > 2012/6/19 Tim Triche, Jr. > > > On Tue, Jun 19, 2012 at 7:46 AM, Yao Chen wrote: > > > > As for as I know, this is no standard normalization for methylation data. > > > > > > > > > As far as I know, there is no standard for microarray or RNAseq normalization either! But that doesn't mean an investigator should ignore the issue of technical (as opposed to biological) fixed or varying effects in their data. Especially if it could materially impact the outcome of a study. lumi offers quantile normalization, minfi & methylumi will do dye bias normalization, etc. > > > > > > For example, GenomeStudio appears to choose a reference array for dye bias adjustment within each batch of 450k samples, and correct using the normalization controls so that the chips in the run have equivalent Cy3:Cy5 bias to the reference. This is less than optimal if you then want to compare with another, separate batch. Personally I feel that it's better to start from IDATs. > > > > > > Another possibility is pernicious batch effects -- something like ComBat seems to work very well for those, usually, although as noted it's always up to the investigator to ensure that they are reporting on biologically (vs. technically) interesting differences. > > > > > > See for example http://www.biomedcentral.com/1755-8794/4/84 > > > > > > > For me, I prefer keeping the raw value and just adjusting the technical variants. Anyone has better solution. Please let me know. > > > > > > > > > > > > See above. If the usual MDS plots indicate a supervised effect, one should fix it, preferably on the logit scale with ComBat, SVA, or something else appropriate to the task (i.e. if you're doing unsupervised analyses, a different method might be optimal). > > > > > > thanks, > > > > > > --t > > > > > > > > > > > > > Jack > > > > > > > > 2012/6/19 Tim Triche, Jr. > > > > > Look up Andrew Jaffe and Rafa Irizarry's paper on "bump hunting" for > > > > > regional differences. Or run a smooth over it (caveat: I just wrote > > > > > smoothing "the way I want it" yesterday, after being provoked by a > > > > > collaborator, so you might have to use lumi). > > > > > > > > > > The function "dmrFinder" in the "charm" package is specifically meant for > > > > > this sort of thing. > > > > > > > > > > Also, if you're doing linear tests, be careful with normalization, mask > > > > > your SNPs and chrX probes, and maybe use M-values (logit(beta)) for the > > > > > task. The latter is more important for epidemiological datasets than > > > > > something like cancer, where every single interesting result from M-value > > > > > testing has been reproduced using untransformed beta values when I ran > > > > > comparisons (e.g. HELP hg17 methylation differences for IDH1/2 mutants vs. > > > > > Illumina hm450 differences for IDH1/2 mutants, the complete absence of any > > > > > differences for TET2 mutants regardless of platform, etc.) > > > > > > > > > > Mark Robinson just chimed in, I see. Probably a good idea to read his > > > > > reply carefully as well. > > > > > > > > > > > > > > > > > > > > On Tue, Jun 19, 2012 at 3:57 AM, Gustavo Fern?ndez Bay?n > > > > > wrote: > > > > > > > > > > > Hi everybody. > > > > > > > > > > > > As a newbie to bioinformatics, it is not uncommon to find difficulties in > > > > > > the way biological knowledge mixes with statistics. I come from the Machine > > > > > > Learning field, and usually have problems with the naming conventions > > > > > > (well, among several other things, I must admit). Besides, I am not an > > > > > > expert in statistics, having used the barely necessary for the validation > > > > > > of my work. > > > > > > > > > > > > Well, let's try to be more precise. One of the topics I am working more > > > > > > right now is the analysis of methylation array data. As you surely now, the > > > > > > final processed (and normalized) beta values are presented in a pxn matrix, > > > > > > where there are p different probes and n different samples or individuals > > > > > > from which we have obtained the beta-values. I am not currently working > > > > > > with the raw data. > > > > > > > > > > > > Imagine, for a moment, that we have identified two regions of probes, A > > > > > > and B, with a group of nA probes belonging to A, another group (of nB > > > > > > probes) that belongs to B, and the intersection is empty. Say that we want > > > > > > to find a way to show there is a statistically significant difference > > > > > > between the methylation values of both regions. > > > > > > As far as I have seen in the literature, comparisons (statistical tests) > > > > > > are always done comparing the same probe values between case and control > > > > > > groups of individuals or samples. For example, when we are trying to find > > > > > > differentiated probes. > > > > > > > > > > > > However, if I think of directly comparing all the beta values from region > > > > > > A (nA * n values) against the ones in region B (nB * n values) with a, say, > > > > > > t test, I get the suspicion that something is not being done the way it > > > > > > should. My knowledge of Biology and Statistics is still limited and I > > > > > > cannot explain why, but I have the feeling that there is something formally > > > > > > wrong in this approximation. Am I right? > > > > > > > > > > > > What I have done in similar experiments has been to find differentiated > > > > > > probes, and then do a test to the proportion of differentiated probes to > > > > > > total number of them, so I could assign a p-value to prove that there was a > > > > > > significant influence of the region of reference. > > > > > > > > > > > > Several questions here: which could be a coherent approximation to the > > > > > > regions A and B problem stated above? Is there any problem with methylation > > > > > > data I am not aware of which makes only the in-probe analysis valid? Any > > > > > > bibliographic references that could help me seeing the subtleties around? > > > > > > > > > > > > As you can see, concepts are quite interleaved in my mind, so any help > > > > > > would be very appreciated. > > > > > > Regards, > > > > > > Gustavo > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------- > > > > > > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) > > > > > > > > > > > > _______________________________________________ > > > > > > Bioconductor mailing list > > > > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org) > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > > > Search the archives: > > > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > *A model is a lie that helps you see the truth.* > > > > > * > > > > > * > > > > > Howard Skipper > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > > > > > > > > _______________________________________________ > > > > > Bioconductor mailing list > > > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org) > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > > > > > > > > -- > > > A model is a lie that helps you see the truth. > > > > > > Howard Skipper (http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf) > > > > -- > A model is a lie that helps you see the truth. > > Howard Skipper (http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf) From Heidi.Dvinge at cancer.org.uk Wed Jun 20 10:45:47 2012 From: Heidi.Dvinge at cancer.org.uk (Heidi Dvinge) Date: Wed, 20 Jun 2012 09:45:47 +0100 Subject: [BioC] HTqPCR In-Reply-To: References: <6D0043C9-4BAE-469C-8369-8733D7D53644@cancer.org.uk> <6C95A3CE-902D-4068-B64A-0A2813071A1A@cancer.org.uk> <50B7F68B-3762-4FF4-8F97-692ED30F06AE@cancer.org.uk> <99093F24-FF41-4FA0-BE55-BB0AF5C3D010@cancer.org.uk> Message-ID: <19700BD8-ECBB-4D72-AA23-1C5D617531E7@cancer.org.uk> Hi SIlvia, On 19 Jun 2012, at 18:48, Silvia Halim wrote: > Hi Heidi, > > Thanks for your tips. I figured I could probably use plotCtVariation(). I am able to use the function to plot variation across samples but how can I use it to plot across genes (or features)? > I tried following commands: > plotCtVariation(temp[,1:10], variation = "sd", log = TRUE, main = "SD of replicated features", col = "lightgrey") > plotCtVariation(temp[1:10,], variation = "sd", log = TRUE, main = "SD of replicated features", col = "lightgrey") Here you're just subsetting your qPCRset object before plotting, but you're not changing the actual plots. > There's a difference in the plots but both plots give me same labels on x-axis, i.e. sample names, though I was expecting the second command would give me gene names on x-axis label. > If you look at the plotCtVariation help files (especially the 'Examples' and 'Details' section), the parameters sample.reps and feature.reps controls whether you plot the variation for each gene across or within samples. In order to get gene names, you have to set sample.reps, to indicate which samples are replicates of each other. Per default, the function calculates the variation between replicated features within each of your samples, and plots the distribution (boxplot) of this variation for each sample. If you want to check individual features or samples more specifically, you ahve to use type="detail" and possibly add.featurenames=TRUE. There are some examples included in the plotCtVariation help file. > Also, the manual says we can exclude unreliable or undetermined data by setting the Ct values to NA using filterCategory. I am wondering how I can get rid of NA data from the plate. I also cannot exclude this kind of data or those having 'Failed' flags the very first time before reading in the input as a qPCRset object because the input has to be something like 48 x 48 or 96 x 96. > The question is, why do you want to remove the NA values? If you just leave them as NAs, then they're ignored during e.g. the calculation of differential expression and for most plotting purposes. You can't remove them as such, since (as you note), the object has to be in a certain features x samples format. If you want, you can replace them though, if you e.g. want to set all NA values to Ct=40: exprs(temp)[is.na(exprs(temp))] <- 40. But beware, because in that case the value '40' will be include into all numerical calculations, which may not be what you want. HTH \Heidi > Many thanks, > Silvia > > -----Original Message----- > From: Heidi Dvinge > Sent: 19 June 2012 2:05 PM > To: Silvia Halim > Cc: bioconductor at r-project.org > Subject: Re: HTqPCR > > Hi Silvia, > > On 18 Jun 2012, at 17:51, Silvia Halim wrote: > >> Hi Heidi, >> >> The function breaks at plotCtReps. >>> traceback() >> 1: plotCtReps(temp, card = 2, percent = 20, xlim = c(0, 100), ylim = c(0, >> 100)) >> >> You've pointed out the problem about the duplicates as I have 3 replicates on my assay. I got confused reading the manual as it says plotCtReps can be used for a sample containing duplicate measurements (which I thought to be 2 or more measurements). >>> table(table(featureNames(temp))) >> >> 3 6 >> 30 1 >> > If you try running the examples for plotCtReps, you'll see that the function directly plots two replicates of a feature against each other on the (x,y) axis. 3D (x,y,z) plots aren't implemented, so features that are replicated 3 times can't be plotted. I'll try to clarify the text for the function. > > Perhaps something like plotCtVariation() will give you what you're after? If you only want to visually inspect your data, then grep("plot", ls("package:HTqPCR"), value=TRUE) will list all the plotting functions available in HTqPCR. > > HTH > \Heidi > >> Btw there's no NA in my data. >>> sum(is.na(temp)) >> [1] 0 >> Warning message: >> In is.na(temp) : is.na() applied to non-(list or vector) of type 'S4' >>> >> >> Thanks, >> Silvia >> >> -----Original Message----- >> From: Heidi Dvinge >> Sent: 15 June 2012 9:06 PM >> To: Silvia Halim >> Cc: bioconductor at r-project.org >> Subject: Re: HTqPCR >> >> Hi Silvia, >> >> On 15 Jun 2012, at 18:45, Silvia Halim wrote: >> >>> Hi Heidi, >>> >>> I ran into below problem when using plotCtReps. >>> >>>> plotCtReps(temp, card = 1, percent = 20, xlim = c(0,50), ylim = >>>> c(0,50)) >>> Error in split.data[[s]] : subscript out of bounds In addition: >>> Warning messages: >>> 1: In min(x, na.rm = na.rm) : >>> no non-missing arguments to min; returning Inf >>> 2: In max(x, na.rm = na.rm) : >>> no non-missing arguments to max; returning -Inf >>>> plotCtReps(temp, card = 1, percent = 20, xlim = c(0,50), ylim = >>>> c(0,50)) >>> Error in split.data[[s]] : subscript out of bounds In addition: >>> Warning messages: >>> 1: In min(x, na.rm = na.rm) : >>> no non-missing arguments to min; returning Inf >>> 2: In max(x, na.rm = na.rm) : >>> no non-missing arguments to max; returning -Inf >>>> plotCtReps(temp, card = 2, percent = 20, xlim = c(0,100), ylim = >>>> c(0,100)) >>> Error in split.data[[s]] : subscript out of bounds In addition: >>> Warning messages: >>> 1: In min(x, na.rm = na.rm) : >>> no non-missing arguments to min; returning Inf >>> 2: In max(x, na.rm = na.rm) : >>> no non-missing arguments to max; returning -Inf >> >> What's the output from traceback(), i.e. exactly where does the function break? >>> >> A couple of things you can try: >> >> - plotCtReps is meant to be used in cases where there are exactly 2 replicates of the features on your assay. Is this the case? For example, with the data below there are 190 features that will be plotted, and 1 that will be skipped: >>> data(qPCRraw) >>> table(table(featureNames(qPCRraw))) >> 2 4 >> 190 1 >> >> - are there any NAs in your data? E.g. sum(is.na(qPCRraw))>0. >> >> HTH >> \Heidi >> >>> Here is how 'temp' looks like >>>> temp >>> An object of class "qPCRset" >>> Size: 96 features, 96 samples >>> Feature types: Reference, Test >>> Feature names: b-Actin b-Actin b-Actin ... >>> Feature classes: >>> Feature categories: OK >>> Sample names: NTC_4 PMPT352 NTC_3 ... >>> >>> Do you know why it is complaining about split.data? >>> >>> Thanks, >>> Silvia >>> >>> -----Original Message----- >>> From: Heidi Dvinge >>> Sent: 11 June 2012 6:11 PM >>> To: Silvia Halim >>> Subject: Re: HTqPCR >>> >>> Ok, so you already have a 96 by 96 matrix, so you don't need changeCtLayout. >>> Good luck with the rest, and let me know if you encounter any problems. >>> >>> On 11 Jun 2012, at 19:05, Silvia Halim wrote: >>> >>>> Hi Heidi, >>>> >>>> Thank you for your clarification. >>>> >>>> Btw this is how it looks like when I type 'temp' >>>>> temp >>>> An object of class "qPCRset" >>>> Size: 96 features, 96 samples >>>> Feature types: Reference, Test >>>> Feature names: b-Actin b-Actin b-Actin ... >>>> Feature classes: >>>> Feature categories: OK >>>> Sample names: NTC_4 PMPT352 NTC_3 ... >>>> >>>> Cheers, >>>> Silvia >>>> >>>> -----Original Message----- >>>> From: Heidi Dvinge >>>> Sent: 08 June 2012 7:12 PM >>>> To: Silvia Halim >>>> Subject: Re: HTqPCR >>>> >>>> Hi Silvia, >>>> >>>> what are the dimensions of the "temp" object that you read in? I.e. >>>> what does it look like if you just type >>>>> temp >>>> >>>> If you read in the data with n.features=96 and n.data=96, then you should already have an object with 96 rows and 96 columns, in which case you don't need to change the layout. >>>> >>>> Best, >>>> \Heidi >>>> >>>> On 8 Jun 2012, at 19:13, Silvia Halim wrote: >>>> >>>>> Hi Heidi, >>>>> >>>>> I finally have time to try out your HTqPCR bioconductor package again and I was trying to use 'changeCtLayout' function. However, I got following error message: >>>>> >>>>>> qPCRnew <- changeCtLayout(temp, sample.order = sample_order) >>>>> Error in data.frame(..., check.names = FALSE) : >>>>> arguments imply differing number of rows: 0, 96 In addition: >>>>> Warning >>>>> message: >>>>> In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : >>>>> data length is not a multiple of split variable >>>>> >>>>> The commands that I run are as follows: >>>>>> temp <- readCtData("110614 BENIGN_1 DATA 96X96.csv", path = >>>>>> getwd(), n.features = 96, n.data=96, flag = 9, feature = 5, type= >>>>>> 6, Ct = 7, position = 1, skip = 12, sep = ",") sample_order <- >>>>>> rep(sampleNames(temp), each = 96) qPCRnew <- changeCtLayout(temp, >>>>>> sample.order = sample_order) >>>>> >>>>> I've tried to follow what's written in changeCtLayout function description. Can you please advise what went wrong? >>>>> >>>>> Thanks, >>>>> Silvia >>>>> >>>>> -----Original Message----- >>>>> From: Heidi Dvinge >>>>> Sent: 29 April 2012 8:18 PM >>>>> To: Silvia Halim >>>>> Subject: Re: HTqPCR >>>>> >>>>> HI Silvia, >>>>> >>>>> I'm glad you got it working. Depending on what you're supposed to do with the data, you may need to tweak some functions slightly, as you mention. Let me know if you run into any more trouble. >>>>> >>>>> Cheers >>>>> \Heidi >>>>> >>>>> On 26 Apr 2012, at 18:37, Silvia Halim wrote: >>>>> >>>>>> Hi Heidi, >>>>>> >>>>>> Thanks for the help! It's working for me now. Right now I'm figuring it out how I can use the functions that you described in the vignette. I might have to tweak the parameters for using the functions on Fluidigm data. >>>>>> >>>>>> Cheers, >>>>>> Silvia >>>>>> >>>>>> -----Original Message----- >>>>>> From: Heidi Dvinge >>>>>> Sent: 25 April 2012 8:56 AM >>>>>> To: Silvia Halim >>>>>> Subject: Re: HTqPCR >>>>>> >>>>>> Hiya, >>>>>> >>>>>> sorry, I only just now realised that you'd attached a file. When I saved as csv, the following command worked: >>>>>> >>>>>>> raw <- readCtData("110614 BENIGN_1 DATA 96x96.csv", >>>>>>> format="BioMark", >>>>>>> n.features=96*96) raw >>>>>> An object of class "qPCRset" >>>>>> Size: 9216 features, 1 samples >>>>>> Feature types: >>>>>> Feature names: b-Actin b-Actin b-Actin ... >>>>>> Feature classes: >>>>>> Feature categories: OK >>>>>> Sample names: 110614 BENIGN_1 DATA 96x96 ... >>>>>> >>>>>> The data isn't transformed into a 96x96 format immediately though (in case you read in multiple arrays, and want to normalise them independently). If you want to change this, you can use changeCtLayout(). Alternatively you can say: >>>>>> >>>>>>> raw <- readCtData("110614 BENIGN_1 DATA 96x96.csv", >>>>>>> format="BioMark", n.features=96, n.data=96) raw >>>>>> An object of class "qPCRset" >>>>>> Size: 96 features, 96 samples >>>>>> Feature types: >>>>>> Feature names: b-Actin b-Actin b-Actin ... >>>>>> Feature classes: >>>>>> Feature categories: OK >>>>>> Sample names: Sample1 Sample2 Sample3 ... >>>>>>> plotCtArray(raw) >>>>>> >>>>>> HTH >>>>>> \Heidi >>>>>> >>>>>> On 24 Apr 2012, at 17:55, Silvia Halim wrote: >>>>>> >>>>>>> Hi Heidi, >>>>>>> >>>>>>> I have some problems updating R on lustre. Therefore, I chose to run HTqPCR on my desktop for the moment. >>>>>>> >>>>>>> Reading in your sample file looks fine, however, reading in the >>>>>>> file that I showed you just now gave me below error message. (The >>>>>>> file is as attached) >>>>>>> >>>>>>>> temp <- readCtData("110614 BENIGN_1 DATA 96x96.xlsx", path = >>>>>>>> getwd() , n.features = 96*96, flag = 9, feature = 5, type= 6, Ct >>>>>>>> = 7,position = 1, skip = 12, sep = ",") >>>>>>> Error in read.table(file = file, header = header, sep = sep, quote = quote, : >>>>>>> no lines available in input >>>>>>> In addition: Warning message: >>>>>>> In readLines(file, skip) : >>>>>>> incomplete final line found on 'C:/Users/halim01/Documents/20110627_RossAdamsH_DN_Fluid/110614 BENIGN_1 DATA 96x96.xlsx' >>>>>>>> sessionInfo() >>>>>>> R version 2.14.0 (2011-10-31) >>>>>>> Platform: x86_64-pc-mingw32/x64 (64-bit) >>>>>>> >>>>>>> locale: >>>>>>> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C LC_TIME=English_United Kingdom.1252 >>>>>>> >>>>>>> attached base packages: >>>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>>> >>>>>>> other attached packages: >>>>>>> [1] Biostrings_2.22.0 IRanges_1.12.6 BiocInstaller_1.2.1 marray_1.32.0 HTqPCR_1.8.0 limma_3.10.3 RColorBrewer_1.0-5 Biobase_2.14.0 gdata_2.8.2 >>>>>>> >>>>>>> loaded via a namespace (and not attached): >>>>>>> [1] affy_1.32.1 affyio_1.22.0 gplots_2.10.1 gtools_2.6.2 preprocessCore_1.16.0 tools_2.14.0 zlibbioc_1.0.1 >>>>>>>> >>>>>>> >>>>>>> I did a quick check on the file and it only has 9228 lines including 12 header lines which I had skipped when reading in the file. Do you know what could possibly go wrong? >>>>>>> >>>>>>> Cheers, >>>>>>> Silvia >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Heidi Dvinge >>>>>>> Sent: 24 April 2012 5:09 PM >>>>>>> To: Silvia Halim >>>>>>> Subject: Re: HTqPCR >>>>>>> >>>>>>> Hm, that looks like it may be x11 acting up. I often have similar issues when I work on a remote server. >>>>>>> >>>>>>> Actually, the processing of Fluidigm files is very computationally light. So you can easily do it on your desktop, if you can't update on lustre. >>>>>>> >>>>>>> I can also email you and older version of the vignette if you want to have a look. However, in HTqPCR 1.2.0 I don't even think I had a dedicated function for plotting the Fluidigm assays yet (the plotCtArray shown in the vignette). >>>>>>> >>>>>>> Cheers >>>>>>> \Heidi >>>>>>> >>>>>>> On 24 Apr 2012, at 16:39, Silvia Halim wrote: >>>>>>> >>>>>>>> Hi Heidi, >>>>>>>> >>>>>>>> This is what I got when accessing the vignette. >>>>>>>> >>>>>>>>> openVignette(package="HTqPCR") >>>>>>>> Please select a vignette: >>>>>>>> >>>>>>>> 1: HTqPCR - qPCR analysis in R >>>>>>>> >>>>>>>> Selection: 1 >>>>>>>> Opening >>>>>>>> /home/mib-cri/local/lib64/R/library/HTqPCR/doc/HTqPCR.pdf >>>>>>>>> xprop: unable to open display '' >>>>>>>> /usr/local/bin/xdg-open: line 370: firefox: command not found >>>>>>>> /usr/local/bin/xdg-open: line 370: mozilla: command not found >>>>>>>> /usr/local/bin/xdg-open: line 370: netscape: command not found >>>>>>>> xdg-open: no method available for opening '/home/mib-cri/local/lib64/R/library/HTqPCR/doc/HTqPCR.pdf' >>>>>>>> >>>>>>>> Sorry for the confusion, you are right that I was looking at a newer version of HTqPCR than the one installed on lustre. I think that's because I have different installations of HTqPCR on lustre and on my desktop. If I can update the one on lustre, I'll go ahead with the update. >>>>>>>> >>>>>>>> Thank you, >>>>>>>> Silvia >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Heidi Dvinge >>>>>>>> Sent: 24 April 2012 4:28 PM >>>>>>>> To: Silvia Halim >>>>>>>> Subject: Re: HTqPCR >>>>>>>> >>>>>>>> Ah, right, it looks like you have an older version of R, and therefore also HTqPCR. >>>>>>>> >>>>>>>> The most current release version is 1.10.0. In that version, readCtData() was modified to accept different types of input data, including from Fluidigm. Before that, this sort of data had to be read in 'manually'. >>>>>>>> >>>>>>>> I guess the vignette that you were looking at comes from a >>>>>>>> version of HTqPCR that's newer than the one you have installed? >>>>>>>> If you access the vignette corresponding to your HTqPCR version >>>>>>>> via >>>>>>>>> openVignette(package="HTqPCR") >>>>>>>> what do you get then? >>>>>>>> >>>>>>>> If you get an older version, then depending on how old it is, there may be a section towards the end giving an example of how to process Fluidigm data more 'manually'. If not, an update may be your best bet. >>>>>>>> >>>>>>>> Cheers >>>>>>>> \Heidi >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 24 Apr 2012, at 16:15, Silvia Halim wrote: >>>>>>>> >>>>>>>>> Hi Heidi, >>>>>>>>> >>>>>>>>> Thanks for looking into the matter. Below is the output of my >>>>>>>>> sessionInfo() >>>>>>>>> >>>>>>>>>> sessionInfo() >>>>>>>>> R version 2.13.0 (2011-04-13) >>>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>>> >>>>>>>>> locale: >>>>>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>>>>>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >>>>>>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>>>>> >>>>>>>>> attached base packages: >>>>>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>>>>> >>>>>>>>> other attached packages: >>>>>>>>> [1] marray_1.26.0 Biostrings_2.20.1 IRanges_1.10.3 HTqPCR_1.2.0 >>>>>>>>> [5] limma_3.6.9 RColorBrewer_1.0-2 Biobase_2.12.1 gdata_2.8.0 >>>>>>>>> >>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>> [1] affy_1.26.1 affyio_1.20.0 gplots_2.8.0 >>>>>>>>> [4] gtools_2.6.2 preprocessCore_1.14.0 >>>>>>>>>> >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Silvia >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Heidi Dvinge >>>>>>>>> Sent: 24 April 2012 4:07 PM >>>>>>>>> To: Silvia Halim >>>>>>>>> Subject: HTqPCR >>>>>>>>> >>>>>>>>> Hi Silvia, >>>>>>>>> >>>>>>>>> I just tested the read fluidigm from the vignette, and it works on both my mac and a single unix system that I've tested. Although from the errors you were getting, it seemed like the headers weren't been read correctly/at all. >>>>>>>>> >>>>>>>>> Would you mind sending me the output of your sessionInfo(), so I can compare which package versions we have? >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> \Heidi >>>>>>>>> >>>>>>>>>> sessionInfo() >>>>>>>>> R version 2.15.0 (2012-03-30) >>>>>>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>>>>>>>> >>>>>>>>> locale: >>>>>>>>> [1] >>>>>>>>> en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 >>>>>>>>> >>>>>>>>> attached base packages: >>>>>>>>> [1] tools stats graphics grDevices utils datasets methods base >>>>>>>>> >>>>>>>>> other attached packages: >>>>>>>>> [1] HTqPCR_1.10.0 limma_3.12.0 RColorBrewer_1.0-5 Biobase_2.16.0 >>>>>>>>> [5] BiocGenerics_0.2.0 >>>>>>>>> >>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>> [1] affy_1.34.0 affyio_1.24.0 BiocInstaller_1.4.3 >>>>>>>>> [4] gdata_2.8.2 gplots_2.10.1 gtools_2.6.2 >>>>>>>>> [7] preprocessCore_1.18.0 stats4_2.15.0 zlibbioc_1.2.0 >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> <110614 BENIGN_1 DATA 96x96.xlsx> >>>>>> >>>>> >>>> >>> >> > NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for ...{{dropped:16}} From okko at clevert.de Wed Jun 20 10:57:17 2012 From: okko at clevert.de (=?iso-8859-1?Q?Djork-Arn=E9_Clevert?=) Date: Wed, 20 Jun 2012 10:57:17 +0200 Subject: [BioC] Best package or code to filter Affymetrix probes by present calls?? In-Reply-To: <7F10E9EDBB347E4CA0765A3139C110BB14F9B3DD@UFEXCH-MBXN01.ad.ufl.edu> References: <7F10E9EDBB347E4CA0765A3139C110BB14F999D3@UFEXCH-MBXN01.ad.ufl.edu> <1340056060.56022.YahooMailNeo@web87703.mail.ir2.yahoo.com> <7F10E9EDBB347E4CA0765A3139C110BB14F9B3DD@UFEXCH-MBXN01.ad.ufl.edu> Message-ID: <422D2675-1DBE-4682-A6FD-0E2771676ACA@clevert.de> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From zeynep.ozkeserli at gmail.com Wed Jun 20 11:18:48 2012 From: zeynep.ozkeserli at gmail.com (=?ISO-8859-1?Q?zeynep_=F6zkeserli?=) Date: Wed, 20 Jun 2012 12:18:48 +0300 Subject: [BioC] Non-Specific Filtering with "nsFilter" Question Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From sudeep.sahadevan at scai-extern.fraunhofer.de Wed Jun 20 11:36:13 2012 From: sudeep.sahadevan at scai-extern.fraunhofer.de (Sudeep Sahadevan) Date: Wed, 20 Jun 2012 11:36:13 +0200 (CEST) Subject: [BioC] WGCNA chooseTopHubInEachModule function In-Reply-To: <1029755442.1043986.1340184648948.JavaMail.root@scai-extern.fraunhofer.de> Message-ID: <1541271595.1044256.1340184972969.JavaMail.root@scai-extern.fraunhofer.de> Hi all, In WGCNA R package the default "power" argument for the function "chooseTopHubInEachModule" is 2. My question is there anyway to test what would be the optimum argument to use for a signed network ? Thank you in advance. Regards, Sudeep. From vilanew at gmail.com Wed Jun 20 12:34:57 2012 From: vilanew at gmail.com (David martin) Date: Wed, 20 Jun 2012 12:34:57 +0200 Subject: [BioC] HTqPCR problem Message-ID: I'm manually building a qPCRset object that used to work until i switch from R 2.12 to 2.15. >dim(X) #data matrix (for the moment it contains only zero values) [1] 3 72 >dim(cat)#data matrix with charactacter string("OK") [1] 3 72 #Build Qpcr Object out <- new("qPCRset", exprs=X, flag=as.data.frame(X), featureCategory=cat) > out An object of class "qPCRset" Size: 0 features, 72 samples Feature types: Feature names: NA NA NA ... Feature classes: Error in `row.names<-.data.frame`(`*tmp*`, value = value) : invalid 'row.names' length > What is the problem ???? From vernonvisser at sun.ac.za Wed Jun 20 12:55:03 2012 From: vernonvisser at sun.ac.za (Visser, V, Dr ) Date: Wed, 20 Jun 2012 12:55:03 +0200 Subject: [BioC] Problems installing EBImage Message-ID: <63F358F0ED08BD40AADA9D735420310F03316437B717@STBEVS08.stb.sun.ac.za> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jpflorido at gmail.com Wed Jun 20 13:20:41 2012 From: jpflorido at gmail.com (=?ISO-8859-1?Q?Javier_P=E9rez_Florido?=) Date: Wed, 20 Jun 2012 13:20:41 +0200 Subject: [BioC] boxplot/histograms on preprocessed SNP affy data with crlmm package Message-ID: <4FE1B209.4060305@gmail.com> Dear list, I have an SNP 6.0 affymetrix data set and before genotype calling, an exploratory analysis (boxplot, histograms, MAplots) on raw data are run by means of the oligo package. However, I would like to run such exploratory analysis once the data is normalized. crlmm function from crlmm package performs genotype calling and returns an SNPSet object (calls, confs, SNP quality score and SNR information). However, I don't know how to access to normalized data to run such exploratory analysis and compare such analysis with the one run with raw data. Any suggestions? All the best, Javier From huwenhuo at gmail.com Wed Jun 20 13:44:37 2012 From: huwenhuo at gmail.com (wenhuo hu) Date: Wed, 20 Jun 2012 07:44:37 -0400 Subject: [BioC] HTqPCR problem In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From guest at bioconductor.org Wed Jun 20 13:53:47 2012 From: guest at bioconductor.org (Igor Ulitsky [guest]) Date: Wed, 20 Jun 2012 04:53:47 -0700 (PDT) Subject: [BioC] .wig files for strand-specific paired-end RNA-Seq Message-ID: <20120620115347.6E651133D06@mamba.fhcrc.org> Hi, Is there a simple way to make strand-specific .wig file (i.e., a separate track for + and - strand) from paired-end data (where the second read maps to the other strand)? I've tried using this: library(Rsamtools) library(rtracklayer) myReads <- readGappedAlignments("RNAseqMapping.bam") coveragePlus <- coverage(myReads[strand(myReads) == '+']) export(coveragePlus, "RNAplus.wig") coverageMinus <- coverage(myReads[strand(myReads) == '-']) export(coverageMinus, "RNAminus.wig") But it appears that the second read in the pair contributes to the other strand, generating similar tracks for the + and the - strands. Is there a way to deal with this better? Thanks! Igor. -- output of sessionInfo(): R version 2.13.1 (2011-07-08) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_2.13.1 -- Sent via the guest posting facility at bioconductor.org. From beniltoncarvalho at gmail.com Wed Jun 20 14:16:14 2012 From: beniltoncarvalho at gmail.com (Benilton Carvalho) Date: Wed, 20 Jun 2012 13:16:14 +0100 Subject: [BioC] boxplot/histograms on preprocessed SNP affy data with crlmm package In-Reply-To: <4FE1B209.4060305@gmail.com> References: <4FE1B209.4060305@gmail.com> Message-ID: Probably http://www.jstatsoft.org/v40/i12/paper is relevant for what you want to achieve? Also, the accessors calls(), confs() should work to get the results you want... SNR is stored in the phenoData slot, so SNPSetObject$SNR should get you what you need (check head(pData(SNPSetObject)) )... for feature information, check head(fData(SNPSetObject)) . b On 20 June 2012 12:20, Javier P?rez Florido wrote: > Dear list, > I have an SNP 6.0 affymetrix data set and before genotype calling, an > exploratory analysis (boxplot, histograms, MAplots) on raw data are run by > means of the oligo package. > > However, I would like to run such exploratory analysis once the data is > normalized. crlmm function from crlmm package performs genotype calling and > returns an SNPSet object (calls, confs, SNP quality score and SNR > information). However, I don't know how to access to normalized data to run > such exploratory analysis and compare such analysis with the one run with > raw data. > > Any suggestions? > All the best, > Javier > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From heidi at ebi.ac.uk Wed Jun 20 14:37:12 2012 From: heidi at ebi.ac.uk (Heidi Dvinge) Date: Wed, 20 Jun 2012 13:37:12 +0100 Subject: [BioC] HTqPCR problem In-Reply-To: References: Message-ID: <6a7176d98befcb716fc5330cd51a33fa.squirrel@webmail.ebi.ac.uk> > Is it possible from repeat row names in X? It could be something along those lines, although duplicated row names in X show throw an error already when creating the qPCRset object. In the later versions, HTqPCR has been modified to inherit from eSet classes, and be more strict about formats. David, ould you please provide some more information about X and cat, or a reproducible example? For me, the following works: > X <- matrix(0, ncol=4, nrow=4) > cat <- matrix("OK", nrow=4, ncol=4) > out <- new("qPCRset", exprs=X, flag=as.data.frame(X), featureCategory=cat) > out An object of class "qPCRset" Size: 0 features, 4 samples Feature types: Feature names: NA NA NA ... Feature classes: Feature categories: OK, OK, OK, OK Sample names: 1 2 3 ... Also, it looks like the object is created alright, but it's the 'show' method that's not working. More specifically, the feature categories. Does the following work? featureCategory(out) And if not, what happens if you say out <- new("qPCRset", exprs=X, flag=as.data.frame(X), featureCategory=as.matrix(cat)) out Best, \Heidi > On Jun 20, 2012 6:36 AM, "David martin" wrote: > >> I'm manually building a qPCRset object that used to work until i switch >> from R 2.12 to 2.15. >> >> >> >dim(X) #data matrix (for the moment it contains only zero values) >> [1] 3 72 >> >> >dim(cat)#data matrix with charactacter string("OK") >> [1] 3 72 >> >> #Build Qpcr Object >> out <- new("qPCRset", exprs=X, flag=as.data.frame(X), >> featureCategory=cat) >> >> >> > out >> An object of class "qPCRset" >> Size: 0 features, 72 samples >> Feature types: >> Feature names: NA NA NA ... >> Feature classes: >> Error in `row.names<-.data.frame`(`***tmp*`, value = value) : >> invalid 'row.names' length >> > >> >> What is the problem ???? >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > From vilanew at gmail.com Wed Jun 20 14:38:51 2012 From: vilanew at gmail.com (David martin) Date: Wed, 20 Jun 2012 14:38:51 +0200 Subject: [BioC] HTqPCR problem In-Reply-To: References: Message-ID: No X lmatrix is empty. Here is a snippet to reproduce the problem. test on your side and see if you get the same error library(HTqPCR) X <- matrix(0,3,72) cat <- data.frame(matrix("OK", ncol=72, nrow=3), stringsAsFactors=FALSE) out <- new("qPCRset", exprs=X, flag=as.data.frame(X), featureCategory=cat) out ANy idea why this not working ? > sessionInfo() R version 2.15.0 (2012-03-30) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid splines stats graphics grDevices utils datasets [8] methods base other attached packages: [1] HH_2.3-15 latticeExtra_0.6-19 leaps_2.9 [4] lattice_0.20-6 gplots_2.10.1 KernSmooth_2.23-7 [7] caTools_1.13 bitops_1.0-4.1 gdata_2.8.2 [10] gtools_2.6.2 multcomp_1.2-12 survival_2.36-14 [13] mvtnorm_0.9-9992 pROC_1.5.1 plyr_1.7.1 [16] HTqPCR_1.10.0 limma_3.12.0 RColorBrewer_1.0-5 [19] Biobase_2.16.0 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] affy_1.34.0 affyio_1.24.0 BiocInstaller_1.4.6 [4] preprocessCore_1.18.0 stats4_2.15.0 tools_2.15.0 [7] zlibbioc_1.2.0 > On 06/20/2012 01:44 PM, wenhuo hu wrote: > Is it possible from repeat row names in X? > On Jun 20, 2012 6:36 AM, "David martin" wrote: > >> I'm manually building a qPCRset object that used to work until i switch >> from R 2.12 to 2.15. >> >> >>> dim(X) #data matrix (for the moment it contains only zero values) >> [1] 3 72 >> >>> dim(cat)#data matrix with charactacter string("OK") >> [1] 3 72 >> >> #Build Qpcr Object >> out<- new("qPCRset", exprs=X, flag=as.data.frame(X), featureCategory=cat) >> >> >>> out >> An object of class "qPCRset" >> Size: 0 features, 72 samples >> Feature types: >> Feature names: NA NA NA ... >> Feature classes: >> Error in `row.names<-.data.frame`(`***tmp*`, value = value) : >> invalid 'row.names' length >>> >> >> What is the problem ???? >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > From heidi at ebi.ac.uk Wed Jun 20 14:56:37 2012 From: heidi at ebi.ac.uk (Heidi Dvinge) Date: Wed, 20 Jun 2012 13:56:37 +0100 Subject: [BioC] HTqPCR In-Reply-To: References: Message-ID: Hi Simon, thanks for your email, and sorry if the example files are shoddy. I'm travelling the next couple of days, but will have a look as soon as I'm back. Off the top of my head, I seem to remember that for some of the BioMark files I originally had access to, the sample names were in slightly different formats. Therefore, as a temporary measure I just ignored sample names in the file, in which case the sample names can be added with the 'samples' parameter to readCtData(). Or alternative, added with sampleNames(object) <- c("",...) later. Obviously, that's not optimal though, and I should fix that for future versions. In the case of the example file, a (somewhat messy) workaround may be to say: > exPath <- system.file("exData", package = "HTqPCR") > temp <- read.csv(file.path(exPath, "BioMark_sample.csv"), as.is=TRUE, skip=11) > raw1 <- readCtData(files = "BioMark_sample.csv", path = exPath, format = "BioMark", n.features = 48, n.data = 48, samples=temp$Name[seq(1,nrow(temp), 48)]) > head(sampleNames(raw1)) [1] "no preamp" "preamp neat" "no preamp.1" "preamp neat.1" "no preamp.2" [6] "preamp neat.2" Apart from the missing sample names, that what exactly are the problems you're seeing in importing your own data into a qPCRset object? HTH \Heidi > Hi Heidi, > I'm working on getting your package routinely going for for our biomark > data, and it looks really good. However, I've been trying to use your > example Biomark file in the package as an example, and I'm running into a > few problems. Firstly, its pretty normal to have uneven numbers of samples > per plate. We never do a single sample per plate! Remember, on the small > plates, there are 2304 PCR reactions, and on the large plates, 9216. Due > to the large volume of samples which can be run, we typically do multiple > different samples with the same suite of genes per plate, with varying > replicates. However, it can be tough to make things symetrical due to > running of controls (no template controls, etc). We also strive to have > the samples fields present in the CSV files as it makes things easier for > our existing analysis. > > In the file you supply, the samples do not appear to be recognized by your > scripts. If you do > pData(raw1) > > on the newly imported Biomark file you supply, you just get a generic list > of sample ID's for each of the 48 chambers, and the > > raw1 <- readCtData(files = "BioMark_sample2.csv",path = exPath, format = > "BioMark", n.features = 48,n.data = 48) > > command does not appear to import the sample names associated with each of > the genes which are as follows: > > > no preamp > preamp neat > preamp 1:10 > preamp 1:100 > preamp 1:1000 > > In the biomark file, there are also a number of samples which do not have > any identifiers (blank), so I filled these in as "Blank". Further, the > number of assays being run for each of these samples is quite variable, > ranging form a low of 48 assays, up to 1152 assays. > > I'm having trouble using the examples you provided to reformat the data so > that I can use your excellent tools. I'm also puzzled why the sample names > within the file are not being imported along with the Ct and gene data. > > I'd really like to be able to try various normalizing methods, but I cant > get there till we can import the data so it makes sense. > > best > > Simon. > > Simon Melov Ph.D. > Associate Professor & > Director of Genomics > Buck Institute for Research on Aging > 8001 Redwood Blvd > Novato, CA 94945 > > Office: 415 209 2068 > Cell: 415 827 4979 > Fax: 415 209 9920 > > > > > > > From lawrence.michael at gene.com Wed Jun 20 14:59:34 2012 From: lawrence.michael at gene.com (Michael Lawrence) Date: Wed, 20 Jun 2012 05:59:34 -0700 Subject: [BioC] .wig files for strand-specific paired-end RNA-Seq In-Reply-To: <20120620115347.6E651133D06@mamba.fhcrc.org> References: <20120620115347.6E651133D06@mamba.fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From heidi at ebi.ac.uk Wed Jun 20 15:06:59 2012 From: heidi at ebi.ac.uk (Heidi Dvinge) Date: Wed, 20 Jun 2012 14:06:59 +0100 Subject: [BioC] HTqPCR problem In-Reply-To: References: Message-ID: <603214044bb4d43a9b38a9abe727f21a.squirrel@webmail.ebi.ac.uk> Hi David, > No X lmatrix is empty. > > Here is a snippet to reproduce the problem. test on your side and see if > you get the same error > > library(HTqPCR) > X <- matrix(0,3,72) > cat <- data.frame(matrix("OK", ncol=72, nrow=3), stringsAsFactors=FALSE) > out <- new("qPCRset", exprs=X, flag=as.data.frame(X), featureCategory=cat) > out > Thanks for the example, I get the same output now. Turns out the problem is due to missing featureNames, since this can't be taken from the rownames of your X as is expected by qPCRsets per default. For example: > X <- matrix(0,3,72) > cat <- data.frame(matrix("OK", ncol=72, nrow=3)) > out <- new("qPCRset", exprs=X, flag=as.data.frame(X), featureCategory=cat) > out An object of class "qPCRset" Size: 0 features, 72 samples Feature types: Feature names: NA NA NA ... Feature classes: Error in `row.names<-.data.frame`(`*tmp*`, value = value) : invalid 'row.names' length > head(featureNames(out)) character(0) > featureNames(out) <- paste("ft", 1:nrow(out), sep="") > out An object of class "qPCRset" Size: 3 features, 72 samples Feature types: Feature names: ft1 ft2 ft3 ... Feature classes: Feature categories: OK Sample names: 1 2 3 ... I'll add a more informative error message. In the meantime, if you convert 'cat' to a matrix, it seems to work fine (at least this part): out <- new("qPCRset", exprs=X, flag=as.data.frame(X), featureCategory=as.matrix(cat)) HTH \Heidi > > ANy idea why this not working ? > > sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: i686-pc-linux-gnu (32-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] grid splines stats graphics grDevices utils datasets > [8] methods base > > other attached packages: > [1] HH_2.3-15 latticeExtra_0.6-19 leaps_2.9 > [4] lattice_0.20-6 gplots_2.10.1 KernSmooth_2.23-7 > [7] caTools_1.13 bitops_1.0-4.1 gdata_2.8.2 > [10] gtools_2.6.2 multcomp_1.2-12 survival_2.36-14 > [13] mvtnorm_0.9-9992 pROC_1.5.1 plyr_1.7.1 > [16] HTqPCR_1.10.0 limma_3.12.0 RColorBrewer_1.0-5 > [19] Biobase_2.16.0 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] affy_1.34.0 affyio_1.24.0 BiocInstaller_1.4.6 > [4] preprocessCore_1.18.0 stats4_2.15.0 tools_2.15.0 > [7] zlibbioc_1.2.0 > > > > > > > On 06/20/2012 01:44 PM, wenhuo hu wrote: >> Is it possible from repeat row names in X? >> On Jun 20, 2012 6:36 AM, "David martin" wrote: >> >>> I'm manually building a qPCRset object that used to work until i switch >>> from R 2.12 to 2.15. >>> >>> >>>> dim(X) #data matrix (for the moment it contains only zero values) >>> [1] 3 72 >>> >>>> dim(cat)#data matrix with charactacter string("OK") >>> [1] 3 72 >>> >>> #Build Qpcr Object >>> out<- new("qPCRset", exprs=X, flag=as.data.frame(X), >>> featureCategory=cat) >>> >>> >>>> out >>> An object of class "qPCRset" >>> Size: 0 features, 72 samples >>> Feature types: >>> Feature names: NA NA NA ... >>> Feature classes: >>> Error in `row.names<-.data.frame`(`***tmp*`, value = value) : >>> invalid 'row.names' length >>>> >>> >>> What is the problem ???? >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.** >>> science.biology.informatics.**conductor >>> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > From atarca at med.wayne.edu Wed Jun 20 15:13:42 2012 From: atarca at med.wayne.edu (Tarca, Adi) Date: Wed, 20 Jun 2012 13:13:42 +0000 Subject: [BioC] error in dmrFinder of charm package when there is one sample per group Message-ID: <6DE578F501A8B2489DBD4893CEC996BA0315DF6A@MED-CORE07B.med.wayne.edu> Dear all, I was using dmrFinder in a case with one sample per group and noticed that it fails due to a call to the function rowMedians which requires a matrix but it gets a vector when grp1 or grp2 is made of one sample: "mat = cbind(rowMedians(l[,grp1]), rowMedians(l[,grp2]))" I realize that more samples per group would be a better case to be in but, given the function dmrFinder is prepared it seems for situations with one sample per group with a message like " grp1 has only 1 array!", it should still run without an error and give at least the difference in methylation scores per region. Moreover, I was wondering if anyone used charm with data generated from the MEDIP protocol (instead of McrBC), when the treated channel is methyl-enriched (instead of methyl-depleted). My take on this is that just switching (ut = "_532.xys", md = "_635.xys") which is default in readCharm with (ut = "_635.xys", md = "_532.xys") would keep the meaning of the methylation scores intact and hence nothing else needs to change in the analysis, but I would be interested in the experience of other bioconductors with this issue. Thanks, Adi Tarca From jmacdon at uw.edu Wed Jun 20 15:36:00 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Wed, 20 Jun 2012 09:36:00 -0400 Subject: [BioC] Differential drug effect on clinical groups In-Reply-To: References: Message-ID: <4FE1D1C0.8040900@uw.edu> Hi Dave, On 6/19/2012 5:10 PM, Dave Canvhet wrote: > Dear all, > > > I have 32 transcriptomics profile of A. thaliana (single color), among > which15 received a drug treatment and 15 are the control group. For all > these samples, 2 biological observations were also obtained : > - life time of the plant (short or long) > - expression of an integrine (with or without) > > I would like to get the following contrast : > (short life time without integrine) versus (long life time with integrine) > into treated samples > versus > (short life time without integrine) versus (long life time with integrine) > into control samples This isn't really clear, and I might be way off base with this answer, but it looks to me like you are after an interaction term. If I were to restate, I would say that you are looking for genes that react differently to treatment between the long lived integrine positive samples and the short lived integrine negative samples. If true, this isn't difficult to set up, although I wouldn't do it the way you are. Personally, I would combine the samples into four types, based on life and integrine (where for brevity, life is long/short and integrine is +/-): long+ short+ long- short- Now your interaction as I understand it will only utilize the long+ and short- samples, so you would restrict your samples to just those samples that fulfill those criteria. Then you could make a lifeinteg factor that is long+ and short- and create a design matrix design <- model.matrix(~drug*lifeinteg) and the lifeinteg2 coefficient is the interaction, and gives you the genes that react differently to the drug based on being long+ or short-. Best, Jim > > > I've set up my design matrix (target is below): > > drug = as.factor(targetATH$drug) > > integr = as.factor(targetATH$integrin) > > lifetime = as.factor(targetATH$lifetime) > > design = model.matrix(~drug+integr+lifetime) > > I can't figure out how to set up the correct contrast matrix to get the > coefficient I want. > I would be very grateful if you could give any pieces of advices for that. > I hope I have enough sample to get enough power to detect some genes. > > > many thanks by advance, best regards, > -- > Dave > > > target : >> targetATH > FileName drug lifetime integrin > 1 sample1.cel Y S + > 2 sample2.cel Y S + > 3 sample3.cel Y S + > 4 sample4.cel Y S + > 5 sample5.cel Y L + > 6 sample6.cel Y L + > 7 sample7.cel Y L + > 8 sample8.cel Y L + > 9 sample9.cel Y S - > 10 sample10.cel Y S - > 11 sample11.cel Y S - > 12 sample12.cel Y S - > 13 sample13.cel Y L - > 14 sample14.cel Y L - > 15 sample15.cel Y L - > 16 sample16.cel Y L - > 17 sample17.cel N S + > 18 sample18.cel N S + > 19 sample19.cel N S + > 20 sample20.cel N S + > 21 sample21.cel N L + > 22 sample22.cel N L + > 23 sample23.cel N L + > 24 sample24.cel N L + > 25 sample25.cel N S - > 26 sample26.cel N S - > 27 sample27.cel N S - > 28 sample28.cel N S - > 29 sample29.cel N L - > 30 sample30.cel N L - > 31 sample31.cel N L - > 32 sample32.cel N L - > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From tim.triche at gmail.com Wed Jun 20 15:38:04 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Wed, 20 Jun 2012 06:38:04 -0700 Subject: [BioC] WGCNA chooseTopHubInEachModule function In-Reply-To: <1541271595.1044256.1340184972969.JavaMail.root@scai-extern.fraunhofer.de> References: <1029755442.1043986.1340184648948.JavaMail.root@scai-extern.fraunhofer.de> <1541271595.1044256.1340184972969.JavaMail.root@scai-extern.fraunhofer.de> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From karthikuttan at gmail.com Wed Jun 20 15:59:14 2012 From: karthikuttan at gmail.com (Karthik K N) Date: Wed, 20 Jun 2012 19:29:14 +0530 Subject: [BioC] BioC package for miRNA target scanning (or displaying results from databases) Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From eleonore.gravier at curie.net Wed Jun 20 16:00:58 2012 From: eleonore.gravier at curie.net (eleonore.gravier at curie.net) Date: Wed, 20 Jun 2012 16:00:58 +0200 Subject: [BioC] =?iso-8859-1?q?Eleonore_Gravier/CURIE_a_quitt=E9_l=27Insti?= =?iso-8859-1?q?tut_Curie?= Message-ID: <1668ca$bro4j@mail.curie.net> Je serai absent(e) du 20/06/2012 au 28/10/2012. Bonjour, J'ai quitt? d?finitivement l'Institut Curie le mardi 19 Juin 2012. Pour toute question s'adresser ? Bernard Asselain : bernard.asselain at curie.net Bien cordialement El?onore Gravier L'int?grit? de ce message n'?tant pas assur?e sur Internet, l'Institut Curie ne peut ?tre tenu responsable de son contenu. Si vous n'?tes pas destinataire de ce message confidentiel, merci de le d?truire et d'avertir imm?diatement l'exp?diteur. Afin de contribuer au respect de l'environnement, merci de n'imprimer ce mail qu'en cas de n?cessit?. From jmacdon at uw.edu Wed Jun 20 16:06:43 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Wed, 20 Jun 2012 10:06:43 -0400 Subject: [BioC] Can I do model comparison with type III ANOVA? In-Reply-To: <4FE0A035.8010707@oceanridgebio.com> References: <4FE0A035.8010707@oceanridgebio.com> Message-ID: <4FE1D8F3.6000908@uw.edu> Hi Yonggan, This has nothing to do with Bioconductor, so you should ask elsewhere. Note that type III models have been beat to death on R-help, so a simple search of that list's archive would be likely to answer your question. Best, Jim On 6/19/2012 11:52 AM, Yonggan Wu wrote: > Hi, > > I just found out type III ANOVA won't give me model comparison result, > am I not suppose to do model comparison with type III ANOVA? > > Say we have two models > m1=lm(data~factor1+factor2,data=table) > m2=lm(data~factor1*factor2,data=table) > > Type I ANOVA comparison > anv1=anova(m1,m2) > > Type III ANOVA Comparison, from car package > anv2=Anova(m1,m2,type="III") > > anv1 give me correct result > anv2's result similar to Anova(m1, type="III") > > Can anyone please explain why? > > Thanks, > Yonggan > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From vilanew at gmail.com Wed Jun 20 16:36:45 2012 From: vilanew at gmail.com (David martin) Date: Wed, 20 Jun 2012 16:36:45 +0200 Subject: [BioC] HTqPCR problem In-Reply-To: <603214044bb4d43a9b38a9abe727f21a.squirrel@webmail.ebi.ac.uk> References: <603214044bb4d43a9b38a9abe727f21a.squirrel@webmail.ebi.ac.uk> Message-ID: Perfect , works !!! On 06/20/2012 03:06 PM, Heidi Dvinge wrote: > out<- new("qPCRset", exprs=X, flag=as.data.frame(X), > featureCategory=as.matrix(cat)) From lawrence.michael at gene.com Wed Jun 20 16:41:34 2012 From: lawrence.michael at gene.com (Michael Lawrence) Date: Wed, 20 Jun 2012 07:41:34 -0700 Subject: [BioC] .wig files for strand-specific paired-end RNA-Seq In-Reply-To: References: <20120620115347.6E651133D06@mamba.fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From sorokin at wisc.edu Wed Jun 20 16:45:20 2012 From: sorokin at wisc.edu (Elena Sorokin) Date: Wed, 20 Jun 2012 09:45:20 -0500 Subject: [BioC] interpreting DEXSeq output In-Reply-To: <4FE03866.8050104@embl.de> References: <4FE03866.8050104@embl.de> Message-ID: <4FE1E200.4050209@wisc.edu> Hi Alejandro, Yes, the merged genes I find are difficult to interpret - because the differential exon usage is probably just differences in total gene expression. I still wonder about it, because in my mapping procedure, I do a very stringent alignment where reads that map to more than one place in the transcriptome get thrown out of the BAM file. However, I would say this only affects less than 1 in 10 of my differential exon results, so I believe I can work around it. I did use the plot function, and it's very helpful. Thanks and best wishes, Elena On 6/19/2012 3:29 AM, Alejandro Reyes wrote: > Dear Elena, > > Thanks for your email! The reason that multiple genes are merged into > a single one is because they share exons, and it is not obvious to > assign this exon to a single gene. You can see more in detail if you > do a "plotDEXSeq" displaying the transcripts. So far, I have not seen > a big problem on it but I can imagine a situation in which the merged > genes are differentially expressed: there would be differences in exon > usage that are differential expression in reality... > > Is it introducing messy results for you? > > Alejandro > > >> Hello, >> >> How should we be interpreting output from DEXSeq in which some >> geneIDs within the DEU results table are denoted by multiple genes >> separated by + signs? I can send examples of what I mean to the >> developers, if my question is unclear. >> >> Especially when the architecture of the two or even three genes is >> quite different, this type of output perplexes me. Sorry if my post >> was answered elsewhere! >> >> Best wishes, >> Elena >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > From jmacdon at uw.edu Wed Jun 20 16:46:04 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Wed, 20 Jun 2012 10:46:04 -0400 Subject: [BioC] Non-Specific Filtering with "nsFilter" Question In-Reply-To: References: Message-ID: <4FE1E22C.3030403@uw.edu> Hi Zeynep, On 6/20/2012 5:18 AM, zeynep ?zkeserli wrote: > Hi All, > > I am trying to apply Non-Specific Filtering to Affymetrix GeneChip hgu133 > plus2 data. > > Since it has been shown that there are multiple probe sets mapping to the > same gene in Affymetrix GeneChips (ref: > http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1784106/), I thought it is > necessary to filter those. So I decided to use nsFilter{geneFilter}. > > First I preprocessed the data, obtained an ExpressionSet object and then I > set my criterion as it was suggested as an example for nsFilter. > > - used require.entrez= TRUE, which filters out features without Entrez Gene > ID's. > - used remove.dupEntrez=TRUE, which filters features mapping to the same > Entrez Gene ID. (I turned off the variance filter to see how many will be > removed because of mapping to the same Entrez Gene ID.) > > And, > > - first filter removed 13009 features > - second filter removed 21629 features. > > "feature" here being genes. Because this filter is under geneFilter, which > filters genes :). Am I wrong? Well, you are sort of wrong. In this context, feature means probeset, and each probeset is designed to interrogate either a gene transcript or a putative gene transcript. > > And here are my questions: > > - If I did not perform the filtering wrongly, is it possible that there are > this many duplicates? Or is it really too many? Because in hgu133 arrays > data sheet It says that "Analyzes the relative expression level of more > than 47,000 transcripts and variants, including more than 38,500 well > characterized genes and UniGenes." > (ref: > http://media.affymetrix.com/support/technical/datasheets/hgu133arrays_datasheet.pdf > ) There is no telling if you did it right or wrong, as you neglected to show us your code. What you did and what you think you did may actually be different things. I can tell you this: > length(unique(Rkeys(hgu133plus2ENTREZID))) [1] 42094 So there are 42,094 unique Entrez Gene IDs represented on this array. Note carefully that Affy states '47,000 transcripts and variants', so they include transcript variants in that count, and these transcript variants will by definition have the same Entrez Gene ID. > > - Can anybody suggest a mind-map to follow while performing non-specific > filtering? I think this must be done very carefully. Agreed. I have never personally been fond of non-specific filtering, as to my mind it is a fairly blunt ax where a scalpel is required. Additionally, it is intended to 'fix' problems that I am not sure are either fixable or even exist. For instance, removing duplicated genes assumes that any feature with the same Entrez Gene is by definition intended to measure the same thing. If there were no transcript variants this would be true. But there are transcript variants, so you end up removing things that may well be measuring different things. Not much of a fix IMO. In addition, one rationale for filtering genes is to reduce the number of multiple comparisons. This makes sense to a certain extent if you are simply computing a statistic of some sort and then ranking genes in a univariate manner. I say to a certain extent because things like FDR are monotonic transforms - you aren't changing the order, just moving the cutoff between 'interesting' and 'uninteresting'. That's sort of passe these days - instead of looking for individual genes, we have moved on to looking for perturbed pathways or gene sets, and for that I think removing data is a hindrance not a help. > > And another question regarding the filtering process. > > To my understanding, we should not use features mapping to the same Entrez > Gene ID, because they represent non-specific hybridization, thus they give > exaggerated signal intensities. So, does it effect preprocessing? If it > does, is it meaningful to filter them out after the preprocessing step? Or > am I doing it wrong from the first step? Should this filtering be done > before the preprocessing? I'm not sure where you got that idea, but I think it is wrong. Why would having more than one feature that purports to measure transcript from the same gene represent non-specific hybridization? It might represent duplicate measurement of the same thing, which would be bad because you are increasing the number of comparisons without actually comparing more things. You might be talking about features that might measure more than one transcript, and these may well exist. In fact, the probeset IDs are supposed to alert you to this possibility: http://www.affymetrix.com/support/help/faqs/hgu133_2/faq_7.jsp The short version of that FAQ is that _a_at indicates the probeset may bind to multiple transcripts of the same gene, the _s_at indicates that the probeset may bind to multiple transcripts from the same gene family, and the _x_at indicates that the probeset may bind to multiple transcripts from unrelated genes. For that you can either take these probesets with a grain of salt, or you might look at the MBNI remapped cdfs, which attempt to remove probes that behave poorly. Best, Jim > > I am a little puzzled here. So any help would be appreciated. > > Thank you, > > Zeynep Ozkeserli > Ankara University Biotechnology Institute > Genomics Unit > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From tim.triche at gmail.com Wed Jun 20 16:58:35 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Wed, 20 Jun 2012 07:58:35 -0700 Subject: [BioC] interpreting DEXSeq output In-Reply-To: <4FE1E200.4050209@wisc.edu> References: <4FE03866.8050104@embl.de> <4FE1E200.4050209@wisc.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From iaingallagher at btopenworld.com Wed Jun 20 17:42:06 2012 From: iaingallagher at btopenworld.com (Iain Gallagher) Date: Wed, 20 Jun 2012 16:42:06 +0100 (BST) Subject: [BioC] xps scheme building Message-ID: <1340206926.81116.YahooMailNeo@web87705.mail.ir2.yahoo.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From zeynep.ozkeserli at gmail.com Wed Jun 20 17:47:49 2012 From: zeynep.ozkeserli at gmail.com (=?ISO-8859-1?Q?zeynep_=F6zkeserli?=) Date: Wed, 20 Jun 2012 18:47:49 +0300 Subject: [BioC] Non-Specific Filtering with "nsFilter" Question In-Reply-To: <4FE1E22C.3030403@uw.edu> References: <4FE1E22C.3030403@uw.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From dcanvhet at gmail.com Wed Jun 20 18:07:28 2012 From: dcanvhet at gmail.com (Dave Canvhet) Date: Wed, 20 Jun 2012 18:07:28 +0200 Subject: [BioC] Differential drug effect on clinical groups In-Reply-To: <4FE1D1C0.8040900@uw.edu> References: <4FE1D1C0.8040900@uw.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jmacdon at uw.edu Wed Jun 20 18:31:00 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Wed, 20 Jun 2012 12:31:00 -0400 Subject: [BioC] Non-Specific Filtering with "nsFilter" Question In-Reply-To: References: <4FE1E22C.3030403@uw.edu> Message-ID: <4FE1FAC4.1050405@uw.edu> Hi Zeynep, On 6/20/2012 11:47 AM, zeynep ?zkeserli wrote: > Hi James, > > Thank you for your detailed answer which covered all the black holes > on this subject on my mind. > > In fact, the problem started with the control probes. The problem was > that, when I performed limma analysis without any filters, the control > probes were on top of the differentially expressed gene list. I > couldn't find out why, it didn't seem to be an experimental defect (I > concluded it from QC Reports). So while I was trying to find out a > solution for this, I also started to think on filtering to reduce the > number of multiple comparisons (and my misunderstandings on probe > design suddenly popped out, sorry for some of the unnecessary > questions.) Do you have any idea why control probes would appear to be > significantly differentially expressed? Is it logical to just move them? Ugh. I hate when that happens. So, it depends on what you mean by control probes, as there are various types. If you are talking about the beta-actin or other 'housekeeping' genes, then it isn't clear to me if this is a problem or not. The general assumption is that these genes are constituitively up-regulated, and never vary. But I have always wondered about that. It's sort of like the 'no two snow flakes are alike' hypothesis - in general circulation, but by definition untestable. So housekeeping genes make me wonder, but don't really cause much teeth gnashing. The same is true for the 'normalizing control set' of 100 probesets that Affy claim are not differentially expressed in different tissues. I think that really depends. I had one study back in the day where they were comparing normal C. elegans to C. elegans that had some deadly mutation, and something like 95% of the genes were differentially expressed. It was just ridiculous. But the point to me was that you can't know if a gene or set of genes are never affected - it is too context dependent. That said, I would recommend ensuring that everything is OK. I don't know what you mean by QC Reports - perhaps you used the affyQCReport package, or arrayQualityMetrics? I would certainly run these data through one of those packages. I would also do things like PCA plots of the expression values, and maybe image plots that you can generate using the affyPLM package. Now if you have things like the Poly-A controls or the Hybridization controls popping up, then you may have a real problem, as those are spiked in during the processing. This could indicate big technical variability between batches that may not be resolvable. > > And about getting rid of the "passe" analysis pipeline; does the > search for interesting pathways start after deciding "important" genes > set or is it another approach which seeks those sets in the whole data > set in a different manner? Can you please recommend me any papers > where I could learn this approach? Well, the general idea started with Gene Ontology analyses where you take the 'top' genes, based on a cutoff, and try to find GO terms that are over or under-represented in the set of significant genes. The underlying weakness there is that you are relying on a cutoff, which can be fairly arbitrarily set. Another way to think about it is to just take your ranked list of genes (all genes on the chip, ranked by some statistic), and then see if a certain group of genes (where 'group' is defined as an existing gene set that somebody else already found, or a set of genes in a GO category, or what have you) is 'higher up' in the ranked list than would be expected by chance. For this approach you really need to filter down to a set of unique genes, but in general I don't think you filter further. I'm no expert on the literature, but I think one of the seminal papers is by Tian: http://www.pnas.org/content/102/38/13544.short There are also several out of Robert Gentleman's group that I have found helpful. Do a Google Scholar of gsea gentleman, and they will be near the top. Best, Jim > > Thanks again for your help and comments. Very much appreciated. > > Zeynep > > > > On Wed, Jun 20, 2012 at 5:46 PM, James W. MacDonald > wrote: > > Hi Zeynep, > > > On 6/20/2012 5:18 AM, zeynep ?zkeserli wrote: > > Hi All, > > I am trying to apply Non-Specific Filtering to Affymetrix > GeneChip hgu133 > plus2 data. > > Since it has been shown that there are multiple probe sets > mapping to the > same gene in Affymetrix GeneChips (ref: > http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1784106/), I > thought it is > necessary to filter those. So I decided to use > nsFilter{geneFilter}. > > First I preprocessed the data, obtained an ExpressionSet > object and then I > set my criterion as it was suggested as an example for nsFilter. > > - used require.entrez= TRUE, which filters out features > without Entrez Gene > ID's. > - used remove.dupEntrez=TRUE, which filters features mapping > to the same > Entrez Gene ID. (I turned off the variance filter to see how > many will be > removed because of mapping to the same Entrez Gene ID.) > > And, > > - first filter removed 13009 features > - second filter removed 21629 features. > > "feature" here being genes. Because this filter is under > geneFilter, which > filters genes :). Am I wrong? > > > Well, you are sort of wrong. In this context, feature means > probeset, and each probeset is designed to interrogate either a > gene transcript or a putative gene transcript. > > > > And here are my questions: > > - If I did not perform the filtering wrongly, is it possible > that there are > this many duplicates? Or is it really too many? Because in > hgu133 arrays > data sheet It says that "Analyzes the relative expression > level of more > than 47,000 transcripts and variants, including more than > 38,500 well > characterized genes and UniGenes." > (ref: > http://media.affymetrix.com/support/technical/datasheets/hgu133arrays_datasheet.pdf > ) > > > There is no telling if you did it right or wrong, as you neglected > to show us your code. What you did and what you think you did may > actually be different things. I can tell you this: > > > length(unique(Rkeys(hgu133plus2ENTREZID))) > [1] 42094 > > So there are 42,094 unique Entrez Gene IDs represented on this > array. Note carefully that Affy states '47,000 transcripts and > variants', so they include transcript variants in that count, and > these transcript variants will by definition have the same Entrez > Gene ID. > > > > - Can anybody suggest a mind-map to follow while performing > non-specific > filtering? I think this must be done very carefully. > > > Agreed. I have never personally been fond of non-specific > filtering, as to my mind it is a fairly blunt ax where a scalpel > is required. Additionally, it is intended to 'fix' problems that I > am not sure are either fixable or even exist. > > For instance, removing duplicated genes assumes that any feature > with the same Entrez Gene is by definition intended to measure the > same thing. If there were no transcript variants this would be > true. But there are transcript variants, so you end up removing > things that may well be measuring different things. Not much of a > fix IMO. > > In addition, one rationale for filtering genes is to reduce the > number of multiple comparisons. This makes sense to a certain > extent if you are simply computing a statistic of some sort and > then ranking genes in a univariate manner. I say to a certain > extent because things like FDR are monotonic transforms - you > aren't changing the order, just moving the cutoff between > 'interesting' and 'uninteresting'. That's sort of passe these days > - instead of looking for individual genes, we have moved on to > looking for perturbed pathways or gene sets, and for that I think > removing data is a hindrance not a help. > > > > And another question regarding the filtering process. > > To my understanding, we should not use features mapping to the > same Entrez > Gene ID, because they represent non-specific hybridization, > thus they give > exaggerated signal intensities. So, does it effect > preprocessing? If it > does, is it meaningful to filter them out after the > preprocessing step? Or > am I doing it wrong from the first step? Should this filtering > be done > before the preprocessing? > > > I'm not sure where you got that idea, but I think it is wrong. Why > would having more than one feature that purports to measure > transcript from the same gene represent non-specific > hybridization? It might represent duplicate measurement of the > same thing, which would be bad because you are increasing the > number of comparisons without actually comparing more things. > > You might be talking about features that might measure more than > one transcript, and these may well exist. In fact, the probeset > IDs are supposed to alert you to this possibility: > > http://www.affymetrix.com/support/help/faqs/hgu133_2/faq_7.jsp > > The short version of that FAQ is that _a_at indicates the probeset > may bind to multiple transcripts of the same gene, the _s_at > indicates that the probeset may bind to multiple transcripts from > the same gene family, and the _x_at indicates that the probeset > may bind to multiple transcripts from unrelated genes. > > For that you can either take these probesets with a grain of salt, > or you might look at the MBNI remapped cdfs, which attempt to > remove probes that behave poorly. > > Best, > > Jim > > > > I am a little puzzled here. So any help would be appreciated. > > Thank you, > > Zeynep Ozkeserli > Ankara University Biotechnology Institute > Genomics Unit > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From jmacdon at uw.edu Wed Jun 20 18:35:58 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Wed, 20 Jun 2012 12:35:58 -0400 Subject: [BioC] Differential drug effect on clinical groups In-Reply-To: References: <4FE1D1C0.8040900@uw.edu> Message-ID: <4FE1FBEE.7050007@uw.edu> Hi Dave, On 6/20/2012 12:07 PM, Dave Canvhet wrote: > Hi James, > > This isn't really clear, and I might be way off base with this > answer, but it looks to me like you are after an interaction term. > If I were to restate, I would say that you are looking for genes > that react differently to treatment between the long lived > integrine positive samples and the short lived integrine negative > samples. > > > This is exactly what I want, so thanks to your clear restate. > > If true, this isn't difficult to set up, although I wouldn't do it > the way you are. Personally, I would combine the samples into four > types, based on life and integrine (where for brevity, life is > long/short and integrine is +/-): > > long+ > short+ > long- > short- > > Now your interaction as I understand it will only utilize the > long+ and short- samples, so you would restrict your samples to > just those samples that fulfill those criteria. Then you could > make a lifeinteg factor that is long+ and short- and create a > design matrix > > > design <- model.matrix(~drug*lifeinteg) > > > > OK I still to progress on the differences between interaction model > and additive model (with which I'm more familiar) > Do you think it will be useful to set up an Intercept ? > design <- model.matrix(~0+drug*lifeinteg) There won't be a difference. As an example: > drug <- factor(rep(1:2, 4)) > lifeinteg <- factor(rep(1:2, each = 4)) > model.matrix(~drug*lifeinteg) (Intercept) drug2 lifeinteg2 drug2:lifeinteg2 1 1 0 0 0 2 1 1 0 0 3 1 0 0 0 4 1 1 0 0 5 1 0 1 0 6 1 1 1 1 7 1 0 1 0 8 1 1 1 1 attr(,"assign") [1] 0 1 2 3 attr(,"contrasts") attr(,"contrasts")$drug [1] "contr.treatment" attr(,"contrasts")$lifeinteg [1] "contr.treatment" > model.matrix(~0+drug*lifeinteg) drug1 drug2 lifeinteg2 drug2:lifeinteg2 1 1 0 0 0 2 0 1 0 0 3 1 0 0 0 4 0 1 0 0 5 1 0 1 0 6 0 1 1 1 7 1 0 1 0 8 0 1 1 1 attr(,"assign") [1] 1 1 2 3 attr(,"contrasts") attr(,"contrasts")$drug [1] "contr.treatment" attr(,"contrasts")$lifeinteg [1] "contr.treatment" So the interaction term will be drug2:lifeinteg2 regardless of how you specify the model. Best, Jim > > Again many for your time and your help. > > Bests > -- > Dave > > > > > and the lifeinteg2 coefficient is the interaction, and gives you > the genes that react differently to the drug based on being long+ > or short-. > > Best, > > Jim > > > > > > I've set up my design matrix (target is below): > > drug = as.factor(targetATH$drug) > > integr = as.factor(targetATH$integrin) > > lifetime = as.factor(targetATH$lifetime) > > design = model.matrix(~drug+integr+lifetime) > > I can't figure out how to set up the correct contrast matrix > to get the > coefficient I want. > I would be very grateful if you could give any pieces of > advices for that. > I hope I have enough sample to get enough power to detect some > genes. > > > many thanks by advance, best regards, > -- > Dave > > > target : > > targetATH > > FileName drug lifetime integrin > 1 sample1.cel Y S + > 2 sample2.cel Y S + > 3 sample3.cel Y S + > 4 sample4.cel Y S + > 5 sample5.cel Y L + > 6 sample6.cel Y L + > 7 sample7.cel Y L + > 8 sample8.cel Y L + > 9 sample9.cel Y S - > 10 sample10.cel Y S - > 11 sample11.cel Y S - > 12 sample12.cel Y S - > 13 sample13.cel Y L - > 14 sample14.cel Y L - > 15 sample15.cel Y L - > 16 sample16.cel Y L - > 17 sample17.cel N S + > 18 sample18.cel N S + > 19 sample19.cel N S + > 20 sample20.cel N S + > 21 sample21.cel N L + > 22 sample22.cel N L + > 23 sample23.cel N L + > 24 sample24.cel N L + > 25 sample25.cel N S - > 26 sample26.cel N S - > 27 sample27.cel N S - > 28 sample28.cel N S - > 29 sample29.cel N L - > 30 sample30.cel N L - > 31 sample31.cel N L - > 32 sample32.cel N L - > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From sudeep.sahadevan at scai-extern.fraunhofer.de Wed Jun 20 18:36:10 2012 From: sudeep.sahadevan at scai-extern.fraunhofer.de (Sudeep Sahadevan) Date: Wed, 20 Jun 2012 18:36:10 +0200 (CEST) Subject: [BioC] WGCNA chooseTopHubInEachModule function In-Reply-To: Message-ID: <588573744.1102383.1340210170326.JavaMail.root@scai-extern.fraunhofer.de> Dear Steve, Thank you for your reply (and thanks to Tim for forwarding the mail) Regards, Sudeep. ----- Original Message ----- From: "Steve Horvath" To: ttriche at usc.edu, "Sudeep Sahadevan" Cc: bioconductor at r-project.org, "Peter Langfelder" Sent: Wednesday, June 20, 2012 6:19:52 PM Subject: RE: [BioC] WGCNA chooseTopHubInEachModule function Dear Tim and Sudeep, regarding your question one can invoke a general rule linking unsigned and signed networks: if a power of beta is chosen for an unsigned network then one should choose a power of 2*beta for corresponding signed network. Therefore, I suggest to use a power of 4 for a signed network. In any event, the good news is that weighted networks are fairly robust with respect to (soft) threshold choices (i.e. the power) so the result should be fairly robust irrespective of the choice of beta. Steve ________________________________ From: Tim Triche, Jr. [tim.triche at gmail.com] Sent: Wednesday, June 20, 2012 6:38 AM To: Sudeep Sahadevan Cc: bioconductor at r-project.org; Horvath, Steve; Peter.Langfelder at gmail.com Subject: Re: [BioC] WGCNA chooseTopHubInEachModule function WGCNA is not a BioC package, you should cc: the authors (Steve Horvath and Peter Langfelder) on your email (IMHO) On Wed, Jun 20, 2012 at 2:36 AM, Sudeep Sahadevan > wrote: Hi all, In WGCNA R package the default "power" argument for the function "chooseTopHubInEachModule" is 2. My question is there anyway to test what would be the optimum argument to use for a signed network ? Thank you in advance. Regards, Sudeep. _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- A model is a lie that helps you see the truth. Howard Skipper ________________________________ IMPORTANT WARNING: This email (and any attachments) is o...{{dropped:2}} From vobencha at fhcrc.org Wed Jun 20 22:13:27 2012 From: vobencha at fhcrc.org (Valerie Obenchain) Date: Wed, 20 Jun 2012 13:13:27 -0700 Subject: [BioC] nearest() for GRanges In-Reply-To: References: Message-ID: <4FE22EE7.90404@fhcrc.org> Hi Oleg, Malcom, Thanks for the bug report. This is now fixed in devel 1.9.28. Over the past months we've done an overhaul of the precede/follow code in devel. The new nearest method is based on the new precede and follow and is documented at ?'nearest,GenomicRanges,GenomicRanges-method' Let me know if you run into problems. Valerie On 06/18/2012 02:25 PM, Cook, Malcolm wrote: > Martin, Oleg, Val, all, > > I too have script logic that benefitted from and depends upon what the > behavior of nearest,GenomicRanges,missing as reported by Oleg. > > Thanks for the unit tests Martin. > > If it helps in sleuthing, in my hands, the 3rd test used to pass (if my > memory serves), but does not pass now, as the attached transcript shows. > > Hoping it helps find a speedy resolution to this issue, > > ~ Malcolm Cook > > > >> r<- IRanges(c(1,5,10), c(2,7,12)) >> g<- GRanges("chr1", r, "+") >> checkEquals(precede(r), precede(g)) > [1] TRUE >> checkEquals(follow(r), follow(g)) > [1] TRUE >> try(checkEquals(nearest(r), nearest(g))) > Error in checkEquals(nearest(r), nearest(g)) : > Mean relative difference: 0.6 > > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] C > > attached base packages: > [1] tools splines parallel stats graphics grDevices utils > datasets methods base > > other attached packages: > [1] RUnit_0.4.26 log4r_0.1-4 vwr_0.1 > RecordLinkage_0.4-1 ffbase_0.5 ff_2.2-7 > bit_1.1-8 evd_2.2-6 ipred_0.8-13 > prodlim_1.3.1 KernSmooth_2.23-7 nnet_7.3-1 > survival_2.36-14 mlbench_2.1-0 MASS_7.3-18 > ada_2.0-2 rpart_3.1-53 e1071_1.6 > class_7.3-3 XLConnect_0.1-9 XLConnectJars_0.1-4 > rJava_0.9-3 latticeExtra_0.6-19 RColorBrewer_1.0-5 > lattice_0.20-6 doMC_1.2.5 multicore_0.1-7 > [28] BSgenome_1.24.0 rtracklayer_1.16.1 Rsamtools_1.8.5 > Biostrings_2.24.1 GenomicFeatures_1.8.1 AnnotationDbi_1.18.1 > GenomicRanges_1.8.6 IRanges_1.14.3 Biobase_2.16.0 > BiocGenerics_0.2.0 data.table_1.8.0 compare_0.2-3 > svUnit_0.7-10 doParallel_1.0.1 iterators_1.0.6 > foreach_1.4.0 ggplot2_0.9.1 sqldf_0.4-6.4 > RSQLite.extfuns_0.0.1 RSQLite_0.11.1 chron_2.3-42 > gsubfn_0.6-3 proto_0.3-9.2 DBI_0.2-5 > functional_0.1 reshape_0.8.4 plyr_1.7.1 > [55] stringr_0.6 gtools_2.6.2 > > loaded via a namespace (and not attached): > [1] RCurl_1.91-1 XML_3.9-4 biomaRt_2.12.0 bitops_1.0-4.1 > codetools_0.2-8 colorspace_1.1-1 compiler_2.15.0 dichromat_1.2-4 > digest_0.5.2 grid_2.15.0 labeling_0.1 memoise_0.1 > munsell_0.3 reshape2_1.2.1 scales_0.2.1 stats4_2.15.0 > tcltk_2.15.0 zlibbioc_1.2.0 > > > > > > > On 6/18/12 2:39 PM, "Martin Morgan" wrote: > >> Hi Oleg -- >> >> On 06/17/2012 11:11 PM, Oleg Mayba wrote: >>> Hi, >>> >>> I just noticed that a piece of logic I was relying on with GRanges >>> before >>> does not seem to work anymore. Namely, I expect the behavior of >>> nearest() >>> with a single GRanges object as an argument to be the same as that of >>> IRanges (example below), but it's not anymore. I expect nearest(GR1) >>> NOT >>> to behave trivially but to return the closest range OTHER than the range >>> itself. I could swear that was the case before, but isn't any longer: >> I think you're right that there is an inconsistency here; Val will >> likely help clarify in a day or so. My two cents... >> >> I think, certainly, that GRanges on a single chromosome on the "+" >> strand should behave like an IRanges >> >> library(GenomicRanges) >> library(RUnit) >> >> r<- IRanges(c(1,5,10), c(2,7,12)) >> g<- GRanges("chr1", r, "+") >> >> ## first two ok, third should work but fails >> checkEquals(precede(r), precede(g)) >> checkEquals(follow(r), follow(g)) >> try(checkEquals(nearest(r), nearest(g))) >> >> Also, on the "-" strand I think we're expecting >> >> g<- GRanges("chr1", r, "-") >> >> ## first two ok, third should work but fails >> checkEquals(follow(r), precede(g)) >> checkEquals(precede(r), follow(g)) >> try(checkEquals(nearest(r), nearest(g))) >> >> For "*" (which was your example) I think the situation is (a) different >> in devel than in release; and (b) not so clear. In devel, "*" is (from >> method?"nearest,GenomicRanges,missing") "x on '*' strand can match to >> ranges on any of ''+'', ''-'' or ''*''" and in particular I think these >> are always true: >> >> checkEquals(precede(g), follow(g)) >> checkEquals(nearest(r), follow(g)) >> >> I would also expect >> >> try(checkEquals(nearest(g), follow(g))) >> >> though this seems not to be the case. In 'release', "*" is coereced and >> behaves as if on the "+" strand (I think). >> >> Martin >> >>>> z=IRanges(start=c(1,5,10), end=c(2,7,12)) >>>> z >>> IRanges of length 3 >>> start end width >>> [1] 1 2 2 >>> [2] 5 7 3 >>> [3] 10 12 3 >>>> nearest(z) >>> [1] 2 1 2 >>>> z=GRanges(seqnames=rep('chr1',3),ranges=IRanges(start=c(1,5,10), >>> end=c(2,7,12))) >>>> z >>> GRanges with 3 ranges and 0 elementMetadata cols: >>> seqnames ranges strand >>> >>> [1] chr1 [ 1, 2] * >>> [2] chr1 [ 5, 7] * >>> [3] chr1 [10, 12] * >>> --- >>> seqlengths: >>> chr1 >>> NA >>>> nearest(z) >>> [1] 1 2 3 >>>> sessionInfo() >>> R version 2.15.0 (2012-03-30) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] datasets utils grDevices graphics stats methods base >>> >>> other attached packages: >>> [1] GenomicRanges_1.8.6 IRanges_1.14.3 BiocGenerics_0.2.0 >>> >>> loaded via a namespace (and not attached): >>> [1] stats4_2.15.0 >>> >>> >>> I want the IRanges behavior and not what seems currently to be the >>> GRanges >>> behavior, since I have some code that depends on it. Is there a quick >>> way >>> to make nearest() do that for me again? >>> >>> Thanks! >>> >>> Oleg. >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From mgarciao at ufl.edu Wed Jun 20 22:24:51 2012 From: mgarciao at ufl.edu (Garcia Orellana,Miriam) Date: Wed, 20 Jun 2012 20:24:51 +0000 Subject: [BioC] Best package or code to filter Affymetrix probes by present calls?? In-Reply-To: <422D2675-1DBE-4682-A6FD-0E2771676ACA@clevert.de> References: <7F10E9EDBB347E4CA0765A3139C110BB14F999D3@UFEXCH-MBXN01.ad.ufl.edu> <1340056060.56022.YahooMailNeo@web87703.mail.ir2.yahoo.com> <7F10E9EDBB347E4CA0765A3139C110BB14F9B3DD@UFEXCH-MBXN01.ad.ufl.edu>, <422D2675-1DBE-4682-A6FD-0E2771676ACA@clevert.de> Message-ID: <7F10E9EDBB347E4CA0765A3139C110BB14F9B562@UFEXCH-MBXN01.ad.ufl.edu> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From whuber at embl.de Wed Jun 20 22:58:27 2012 From: whuber at embl.de (Wolfgang Huber) Date: Wed, 20 Jun 2012 22:58:27 +0200 Subject: [BioC] RE : Error in intgroup of arrayQualityMetrics package In-Reply-To: References: <20120615063016.7AB89134FEF@mamba.fhcrc.org> Message-ID: <4FE23973.4070504@embl.de> Dear Tim thanks for spotting this. That's indeed an unintended interaction of this cleanUpPhenoData function (which we needed after seeing some highly verbose and redundant ExpressionSet objects generated from ArrayExpress) and the 'intgroup' functionality. As a workaround, what you suggest below is perfect. Fixing this, as well as the suggestions by Daniel Aaen Hansen of 1 June, are on the to-do list for this summer. best wishes Wolfgang Jun/18/12 2:01 PM, Tim Rayner scripsit:: > Hi Sonal, > > You could try rearranging pData(eset) so that the "Tissue" column is > the first column, or within the first few columns. There's some > preprocessing code in the arrayQualityMetrics:::cleanUpPhenoData > function which limits the number of columns which will be carried > forward into the QC (maxcol=10). Also, the contents of the "Tissue" > column must not be either all the same or all different (a quite > reasonable requirement). > > Cheers, > > Tim > > > -- > Tim Rayner > Bioinformatician > Smith Lab, CIMR > University of Cambridge > > > On 15 June 2012 07:30, Sonal Bakiwala [guest] wrote: >> >> I am using arraQualityMetrics package installed from Bioconductor site and R version that I am using is 2.15.0 >> >> The input for the function was eset and for the intgroup argument character vector "Tissue". There is a >> column named Tissue in my phenoData of the eset. >> >> But it still gives me an error saying the elements of intgroup do not match the column names of the pData(eset). >> I don't know what wrong I am doing. >> >> The error look like this : >> >> Error in prepData(expressionset,intgroup=intgroup): >> all elements of 'intgroup' should match column names of pData(expressionset) >> >> >> >> -- output of sessionInfo(): >> >>> sessionInfo() >> R version 2.15.0 (2012-03-30) >> Platform: x86_64-redhat-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] BiocInstaller_1.4.6 arrayQualityMetrics_3.12.0 >> [3] affy_1.34.0 limma_3.12.1 >> [5] Biobase_2.16.0 BiocGenerics_0.2.0 >> >> loaded via a namespace (and not attached): >> [1] affyio_1.24.0 affyPLM_1.32.0 annotate_1.34.0 >> [4] AnnotationDbi_1.18.1 beadarray_2.6.0 BeadDataPackR_1.8.0 >> [7] Biostrings_2.24.1 Cairo_1.5-1 cluster_1.14.2 >> [10] colorspace_1.1-1 DBI_0.2-5 genefilter_1.38.0 >> [13] grid_2.15.0 Hmisc_3.9-3 hwriter_1.3 >> [16] IRanges_1.14.3 lattice_0.20-6 latticeExtra_0.6-19 >> [19] plyr_1.7.1 preprocessCore_1.18.0 RColorBrewer_1.0-5 >> [22] reshape2_1.2.1 RSQLite_0.11.1 setRNG_2011.11-2 >> [25] splines_2.15.0 stats4_2.15.0 stringr_0.6 >> [28] survival_2.36-12 SVGAnnotation_0.9-0 tools_2.15.0 >> [31] vsn_3.24.0 XML_3.9-4 xtable_1.7-0 >> [34] zlibbioc_1.2.0 >>> intgroup >> [1] "Tissue" >>> str(intgroup) >> chr "Tissue" >> >> Sorry I wont be able to provide you with the detailed information of the pData. >> But the colnames(pData(eset)) has one of columns named as "Tissue" and the class of the this column is factor. >> >> Thank you. >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber From cstrato at aon.at Wed Jun 20 23:05:00 2012 From: cstrato at aon.at (cstrato) Date: Wed, 20 Jun 2012 23:05:00 +0200 Subject: [BioC] xps scheme building In-Reply-To: <1340206926.81116.YahooMailNeo@web87705.mail.ir2.yahoo.com> References: <1340206926.81116.YahooMailNeo@web87705.mail.ir2.yahoo.com> Message-ID: <4FE23AFC.6030109@aon.at> Dear Iain, Everything is ok, this is just a note (and not a warning message) where I list the columns from the Affymetrix annotation file(s) which have somehow changed or were deleted by Affymetrix. The reason is that during the years Affymetrix has changed/added/deleted columns from their annotation files, so this is mainly an information for me. Best regards, Christian _._._._._._._._._._._._._._._._._._ C.h.r.i.s.t.i.a.n S.t.r.a.t.o.w.a V.i.e.n.n.a A.u.s.t.r.i.a e.m.a.i.l: cstrato at aon.at _._._._._._._._._._._._._._._._._._ On 6/20/12 5:42 PM, Iain Gallagher wrote: > Hello List > > I have a set of Rat Gene ST CEL files to analyse and have started looking into using xps for this. I have begun by building the required scheme file for the arrays. I have downloaded the relevant files from affymetrix (RaGene-1_0-st-v1.r4.clf, RaGene-1_0-st-v1.r4.pgf, RaGene-1_0-st-v1.na32.rn4.probeset.csv& RaGene-1_0-st-v1.na32.rn4.transcript.csv) and built the scheme file as follows: > > libdir<- paste(getwd(), 'xpsAnalysis/annotationFiles', sep='/') # the annotation files are here > > xps.scheme<- import.exon.scheme('Scheme_RaGene10stv1r4', filedir=paste(getwd(), 'xpsAnalysis/rootScheme', sep='/'), layoutfile = paste(libdir,'RaGene-1_0-st-v1.r4.clf', sep='/'), schemefile = paste(libdir,'RaGene-1_0-st-v1.r4.pgf',sep='/'), probeset = paste(libdir,'RaGene-1_0-st-v1.na32.rn4.probeset.csv', sep='/'), transcript = paste(libdir, 'RaGene-1_0-st-v1.na32.rn4.transcript.csv', sep='/')) # create the scheme > > > Note that I use the import.exon.scheme command as per advice from cstrato for the r4 annotation files (some weblink I can't find just now). > > > This goes well and I get a .root file in the right directory. However I noticed that during the process the following warning is issued: > > Importing as... > Note: The following header columns are missing or in wrong order: > > > > > > > Something to be concerned about? > > I'm not sure what the mouse data would be doing in the annotations for the rat chips but I'm not yet familiar with the platform. > > Advice appreciated. > > Thanks > > iain > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C > [3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8 > [5] LC_MONETARY=en_GB.utf8 LC_MESSAGES=en_GB.utf8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] xps_1.16.0 > > loaded via a namespace (and not attached): > [1] tools_2.15.0 > [[alternative HTML version deleted]] > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From zhujack at mail.nih.gov Wed Jun 20 23:30:22 2012 From: zhujack at mail.nih.gov (Jack Zhu) Date: Wed, 20 Jun 2012 17:30:22 -0400 Subject: [BioC] SRAdb: is the database missing some entries? (Ben Woodcroft) Message-ID: Hi Ben and all, Sorry for late response - just came back from a vacation. I found the problem - our newest SRAdb SQLite file was not copied to the web server due to permission issue. I have fixed it: > sraConvert(c('SRA036600','SRA049463','ERA062401'), sra_con= sra_con) submission study sample experiment run 1 ERA062401 ERP000941 ERS066098 ERX024719 ERR047656 2 ERA062401 ERP000941 ERS066098 ERX024722 ERR047659 3 ERA062401 ERP000941 ERS066097 ERX024712 ERR047649 4 ERA062401 ERP000941 ERS066097 ERX024710 ERR047647 5 ERA062401 ERP000941 ERS066098 ERX024721 ERR047658 6 ERA062401 ERP000941 ERS066097 ERX024711 ERR047648 7 ERA062401 ERP000941 ERS066097 ERX024708 ERR047645 8 ERA062401 ERP000941 ERS066097 ERX024715 ERR047652 9 ERA062401 ERP000941 ERS066098 ERX024720 ERR047657 10 ERA062401 ERP000941 ERS066097 ERX024713 ERR047650 11 ERA062401 ERP000941 ERS066097 ERX024709 ERR047646 12 ERA062401 ERP000941 ERS066098 ERX024723 ERR047660 13 ERA062401 ERP000941 ERS066098 ERX024717 ERR047654 14 ERA062401 ERP000941 ERS066097 ERX024714 ERR047651 15 ERA062401 ERP000941 ERS066098 ERX024718 ERR047655 16 ERA062401 ERP000941 ERS066098 ERX024716 ERR047653 17 SRA036600 SRP006780 SRS193106 SRX062801 SRR205889 BTW, "SRA049463" is in 'unpublished' status. Thanks for your message. Your comments will be highly appreciated. Jack --------------------------------------------------------------------- Hi, Firstly thanks to the creators of this very useful package. I've come across SRA identifiers that don't appear to be in the database (a minority, but still). Here's a few: SRA036600 DRX001436 SRA049463 ERA062401 ERA062401 For example: > library(SRAdb) > sra_con = dbConnect(SQLite(),'SRAmetadb.sqlite') > sraConvert(c('SRA036600'), sra_con= sra_con) [1] submission study sample experiment run <0 rows> (or 0-length row.names) However this isn't a bogus accession because I can see it on the NCBI SRA website. I could be wrong but I don't think it is as simple as the metadata being out of date because the submission dates are often relatively old (SRA036600 was 2011-05-13) and there's metadata from more recent SRA submissions in the SRAdb). Any ideas? Thanks in advance, ben From smyth at wehi.EDU.AU Thu Jun 21 01:56:00 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Thu, 21 Jun 2012 09:56:00 +1000 (AUS Eastern Standard Time) Subject: [BioC] DESeq and contrasts In-Reply-To: References: Message-ID: Dear Joshua, edgeR does exactly this. It has the ability to fit any designed experiment with any number of factors, and has an easy interface to extract any contrast from such a fit. See page 17 of the edgeR user's guide for a very simple example of the use of a contrast: http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf Best wishes Gordon > Date: Tue, 19 Jun 2012 15:57:44 -0600 > From: Joshua Udall > To: bioconductor at r-project.org > Subject: [BioC] DESeq and contrasts > > Hello all, > > Was there any additional discussion about investigating contrasts of > multifactor designs within DESeq? I saw this thread: > http://article.gmane.org/gmane.science.biology.informatics.conductor/40048/match=deseq+contrast > > but no response was posted to the list. I am essentially looking for the > same guidance (R syntax within DESeq for contrasts and interactions). > > Thanks. > Joshua Udall ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From MEC at stowers.org Thu Jun 21 02:20:00 2012 From: MEC at stowers.org (Cook, Malcolm) Date: Wed, 20 Jun 2012 19:20:00 -0500 Subject: [BioC] nearest() for GRanges In-Reply-To: <4FE22EE7.90404@fhcrc.org> Message-ID: Hi Valerie, Very glad you found and fixed the root cause. I don't know the overhead it would take for you, but, this being a regression, might you consider fixing in Bioconductor 2.10 as, say GenomicRanges_1.8. Thanks for your consideration, Malcolm On 6/20/12 3:13 PM, "Valerie Obenchain" wrote: >Hi Oleg, Malcom, > >Thanks for the bug report. This is now fixed in devel 1.9.28. Over the >past months we've done an overhaul of the precede/follow code in devel. >The new nearest method is based on the new precede and follow and is >documented at > >?'nearest,GenomicRanges,GenomicRanges-method' > >Let me know if you run into problems. > >Valerie > > > >On 06/18/2012 02:25 PM, Cook, Malcolm wrote: >> Martin, Oleg, Val, all, >> >> I too have script logic that benefitted from and depends upon what the >> behavior of nearest,GenomicRanges,missing as reported by Oleg. >> >> Thanks for the unit tests Martin. >> >> If it helps in sleuthing, in my hands, the 3rd test used to pass (if my >> memory serves), but does not pass now, as the attached transcript shows. >> >> Hoping it helps find a speedy resolution to this issue, >> >> ~ Malcolm Cook >> >> >> >>> r<- IRanges(c(1,5,10), c(2,7,12)) >>> g<- GRanges("chr1", r, "+") >>> checkEquals(precede(r), precede(g)) >> [1] TRUE >>> checkEquals(follow(r), follow(g)) >> [1] TRUE >>> try(checkEquals(nearest(r), nearest(g))) >> Error in checkEquals(nearest(r), nearest(g)) : >> Mean relative difference: 0.6 >> >> >>> sessionInfo() >> R version 2.15.0 (2012-03-30) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] tools splines parallel stats graphics grDevices utils >> datasets methods base >> >> other attached packages: >> [1] RUnit_0.4.26 log4r_0.1-4 vwr_0.1 >> RecordLinkage_0.4-1 ffbase_0.5 ff_2.2-7 >> bit_1.1-8 evd_2.2-6 ipred_0.8-13 >> prodlim_1.3.1 KernSmooth_2.23-7 nnet_7.3-1 >> survival_2.36-14 mlbench_2.1-0 MASS_7.3-18 >> ada_2.0-2 rpart_3.1-53 e1071_1.6 >> class_7.3-3 XLConnect_0.1-9 XLConnectJars_0.1-4 >> rJava_0.9-3 latticeExtra_0.6-19 RColorBrewer_1.0-5 >> lattice_0.20-6 doMC_1.2.5 multicore_0.1-7 >> [28] BSgenome_1.24.0 rtracklayer_1.16.1 Rsamtools_1.8.5 >> Biostrings_2.24.1 GenomicFeatures_1.8.1 AnnotationDbi_1.18.1 >> GenomicRanges_1.8.6 IRanges_1.14.3 Biobase_2.16.0 >> BiocGenerics_0.2.0 data.table_1.8.0 compare_0.2-3 >> svUnit_0.7-10 doParallel_1.0.1 iterators_1.0.6 >> foreach_1.4.0 ggplot2_0.9.1 sqldf_0.4-6.4 >> RSQLite.extfuns_0.0.1 RSQLite_0.11.1 chron_2.3-42 >> gsubfn_0.6-3 proto_0.3-9.2 DBI_0.2-5 >> functional_0.1 reshape_0.8.4 plyr_1.7.1 >> [55] stringr_0.6 gtools_2.6.2 >> >> loaded via a namespace (and not attached): >> [1] RCurl_1.91-1 XML_3.9-4 biomaRt_2.12.0 bitops_1.0-4.1 >> codetools_0.2-8 colorspace_1.1-1 compiler_2.15.0 dichromat_1.2-4 >> digest_0.5.2 grid_2.15.0 labeling_0.1 memoise_0.1 >> munsell_0.3 reshape2_1.2.1 scales_0.2.1 stats4_2.15.0 >> tcltk_2.15.0 zlibbioc_1.2.0 >> >> >> >> >> >> >> On 6/18/12 2:39 PM, "Martin Morgan" wrote: >> >>> Hi Oleg -- >>> >>> On 06/17/2012 11:11 PM, Oleg Mayba wrote: >>>> Hi, >>>> >>>> I just noticed that a piece of logic I was relying on with GRanges >>>> before >>>> does not seem to work anymore. Namely, I expect the behavior of >>>> nearest() >>>> with a single GRanges object as an argument to be the same as that of >>>> IRanges (example below), but it's not anymore. I expect nearest(GR1) >>>> NOT >>>> to behave trivially but to return the closest range OTHER than the >>>>range >>>> itself. I could swear that was the case before, but isn't any longer: >>> I think you're right that there is an inconsistency here; Val will >>> likely help clarify in a day or so. My two cents... >>> >>> I think, certainly, that GRanges on a single chromosome on the "+" >>> strand should behave like an IRanges >>> >>> library(GenomicRanges) >>> library(RUnit) >>> >>> r<- IRanges(c(1,5,10), c(2,7,12)) >>> g<- GRanges("chr1", r, "+") >>> >>> ## first two ok, third should work but fails >>> checkEquals(precede(r), precede(g)) >>> checkEquals(follow(r), follow(g)) >>> try(checkEquals(nearest(r), nearest(g))) >>> >>> Also, on the "-" strand I think we're expecting >>> >>> g<- GRanges("chr1", r, "-") >>> >>> ## first two ok, third should work but fails >>> checkEquals(follow(r), precede(g)) >>> checkEquals(precede(r), follow(g)) >>> try(checkEquals(nearest(r), nearest(g))) >>> >>> For "*" (which was your example) I think the situation is (a) different >>> in devel than in release; and (b) not so clear. In devel, "*" is (from >>> method?"nearest,GenomicRanges,missing") "x on '*' strand can match to >>> ranges on any of ''+'', ''-'' or ''*''" and in particular I think these >>> are always true: >>> >>> checkEquals(precede(g), follow(g)) >>> checkEquals(nearest(r), follow(g)) >>> >>> I would also expect >>> >>> try(checkEquals(nearest(g), follow(g))) >>> >>> though this seems not to be the case. In 'release', "*" is coereced and >>> behaves as if on the "+" strand (I think). >>> >>> Martin >>> >>>>> z=IRanges(start=c(1,5,10), end=c(2,7,12)) >>>>> z >>>> IRanges of length 3 >>>> start end width >>>> [1] 1 2 2 >>>> [2] 5 7 3 >>>> [3] 10 12 3 >>>>> nearest(z) >>>> [1] 2 1 2 >>>>> z=GRanges(seqnames=rep('chr1',3),ranges=IRanges(start=c(1,5,10), >>>> end=c(2,7,12))) >>>>> z >>>> GRanges with 3 ranges and 0 elementMetadata cols: >>>> seqnames ranges strand >>>> >>>> [1] chr1 [ 1, 2] * >>>> [2] chr1 [ 5, 7] * >>>> [3] chr1 [10, 12] * >>>> --- >>>> seqlengths: >>>> chr1 >>>> NA >>>>> nearest(z) >>>> [1] 1 2 3 >>>>> sessionInfo() >>>> R version 2.15.0 (2012-03-30) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=C LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] datasets utils grDevices graphics stats methods base >>>> >>>> other attached packages: >>>> [1] GenomicRanges_1.8.6 IRanges_1.14.3 BiocGenerics_0.2.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] stats4_2.15.0 >>>> >>>> >>>> I want the IRanges behavior and not what seems currently to be the >>>> GRanges >>>> behavior, since I have some code that depends on it. Is there a quick >>>> way >>>> to make nearest() do that for me again? >>>> >>>> Thanks! >>>> >>>> Oleg. >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> -- >>> Computational Biology / Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. >>> PO Box 19024 Seattle, WA 98109 >>> >>> Location: Arnold Building M1 B861 >>> Phone: (206) 667-2793 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor > From mayba.oleg at gene.com Thu Jun 21 02:40:34 2012 From: mayba.oleg at gene.com (Oleg Mayba) Date: Wed, 20 Jun 2012 17:40:34 -0700 Subject: [BioC] nearest() for GRanges In-Reply-To: References: <4FE22EE7.90404@fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From pau.gregoire at gene.com Thu Jun 21 03:19:22 2012 From: pau.gregoire at gene.com (Gregoire Pau) Date: Wed, 20 Jun 2012 18:19:22 -0700 Subject: [BioC] Problems installing EBImage In-Reply-To: <63F358F0ED08BD40AADA9D735420310F03316437B717@STBEVS08.stb.sun.ac.za> References: <63F358F0ED08BD40AADA9D735420310F03316437B717@STBEVS08.stb.sun.ac.za> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From ulitskyi at gmail.com Wed Jun 20 15:06:05 2012 From: ulitskyi at gmail.com (Igor Ulitsky) Date: Wed, 20 Jun 2012 16:06:05 +0300 Subject: [BioC] .wig files for strand-specific paired-end RNA-Seq In-Reply-To: References: <20120620115347.6E651133D06@mamba.fhcrc.org> Message-ID: Yes, I'm trying to get the strand of stranscription. If the XS flags are correct (I assume they will be if I run tophat with the appropriate library-type?) then the export will also be on the right strand? Thanks! Igor. On Wed, Jun 20, 2012 at 3:59 PM, Michael Lawrence wrote: > > > On Wed, Jun 20, 2012 at 4:53 AM, Igor Ulitsky [guest] > wrote: >> >> >> Hi, >> >> Is there a simple way to make strand-specific .wig file (i.e., a separate >> track for + and - strand) from paired-end data (where the second read maps >> to the other strand)? I've tried using this: >> >> library(Rsamtools) >> library(rtracklayer) >> myReads <- readGappedAlignments("RNAseqMapping.bam") >> coveragePlus <- coverage(myReads[strand(myReads) == ?'+']) >> export(coveragePlus, "RNAplus.wig") >> coverageMinus <- coverage(myReads[strand(myReads) == ?'-']) >> export(coverageMinus, "RNAminus.wig") >> >> But it appears that the second read in the pair contributes to the other >> strand, generating similar tracks for the + and the - strands. >> Is there a way to deal with this better? >> > > Are you trying to generate coverages for the actual strand of transcription? > If so, you would probably get that information from the XS tag and set it as > your strand prior to export, but unless you used a special protocol the XS > information would be incomplete. Btw, I would recommend BigWig export of > those coverage tracks. > > Michael > >> >> Thanks! >> >> Igor. >> >> ?-- output of sessionInfo(): >> >> R version 2.13.1 (2011-07-08) >> Platform: i386-pc-mingw32/i386 (32-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 >> [2] LC_CTYPE=English_United States.1252 >> [3] LC_MONETARY=English_United States.1252 >> [4] LC_NUMERIC=C >> [5] LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >> >> loaded via a namespace (and not attached): >> [1] tools_2.13.1 >> >> -- >> Sent via the guest posting facility at bioconductor.org. > > From SHorvath at mednet.ucla.edu Wed Jun 20 18:19:52 2012 From: SHorvath at mednet.ucla.edu (Horvath, Steve) Date: Wed, 20 Jun 2012 09:19:52 -0700 Subject: [BioC] WGCNA chooseTopHubInEachModule function In-Reply-To: References: <1029755442.1043986.1340184648948.JavaMail.root@scai-extern.fraunhofer.de> <1541271595.1044256.1340184972969.JavaMail.root@scai-extern.fraunhofer.de>, Message-ID: Dear Tim and Sudeep, regarding your question one can invoke a general rule linking unsigned and signed networks: if a power of beta is chosen for an unsigned network then one should choose a power of 2*beta for corresponding signed network. Therefore, I suggest to use a power of 4 for a signed network. In any event, the good news is that weighted networks are fairly robust with respect to (soft) threshold choices (i.e. the power) so the result should be fairly robust irrespective of the choice of beta. Steve ________________________________ From: Tim Triche, Jr. [tim.triche at gmail.com] Sent: Wednesday, June 20, 2012 6:38 AM To: Sudeep Sahadevan Cc: bioconductor at r-project.org; Horvath, Steve; Peter.Langfelder at gmail.com Subject: Re: [BioC] WGCNA chooseTopHubInEachModule function WGCNA is not a BioC package, you should cc: the authors (Steve Horvath and Peter Langfelder) on your email (IMHO) On Wed, Jun 20, 2012 at 2:36 AM, Sudeep Sahadevan > wrote: Hi all, In WGCNA R package the default "power" argument for the function "chooseTopHubInEachModule" is 2. My question is there anyway to test what would be the optimum argument to use for a signed network ? Thank you in advance. Regards, Sudeep. _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- A model is a lie that helps you see the truth. Howard Skipper ________________________________ IMPORTANT WARNING: This email (and any attachments) is o...{{dropped:9}} From Kaat.DeCremer at biw.kuleuven.be Thu Jun 21 11:42:41 2012 From: Kaat.DeCremer at biw.kuleuven.be (Kaat De Cremer) Date: Thu, 21 Jun 2012 09:42:41 +0000 Subject: [BioC] design matrix edge R pairwise comparison at different time points after infection with replicates Message-ID: <3D4A97F14E343F4584925219C1C1ACEF05B95CE6@ICTS-S-MBX7.luna.kuleuven.be> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From laxvid at gmail.com Thu Jun 21 14:40:13 2012 From: laxvid at gmail.com (Lakshmanan Iyer) Date: Thu, 21 Jun 2012 08:40:13 -0400 Subject: [BioC] Coverage on GappedAlignmentPairs Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From nunesf at gmail.com Thu Jun 21 14:50:02 2012 From: nunesf at gmail.com (Flavia Nunes) Date: Thu, 21 Jun 2012 14:50:02 +0200 Subject: [BioC] estimateDispersions in DESeq Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From david at harsk.dk Thu Jun 21 17:01:13 2012 From: david at harsk.dk (David Westergaard) Date: Thu, 21 Jun 2012 17:01:13 +0200 Subject: [BioC] Printing arrayQualityMetrics report Message-ID: Hello, I am trying to include an arrayQualityMetrics report in my appendix, but I am finding it very difficult to produce a pretty version as a pdf. I have tried simply to print to file in both Google Chrome and Iceweasel, and both of them are having trouble placing the figures correctly in the pdf. I have also tried to convert index.html to LaTeX using html2latex, which did not work either. Is it possible to produce a non-interactive pdf (Or something similar) which can be included in a thesis, written in LaTe, from arrayQualityMetrtics? I could not find any information about this in the documentation. Best, David From polyphemus421 at gmail.com Thu Jun 21 17:51:00 2012 From: polyphemus421 at gmail.com (P. Murakami) Date: Thu, 21 Jun 2012 11:51:00 -0400 Subject: [BioC] error in dmrFinder of charm package when there is one sample per group Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From tim.triche at gmail.com Thu Jun 21 20:29:32 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Thu, 21 Jun 2012 11:29:32 -0700 Subject: [BioC] using disjoin() for copy number as Rle() columns of a SummarizedExperiment Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From hpages at fhcrc.org Thu Jun 21 22:37:32 2012 From: hpages at fhcrc.org (=?ISO-8859-1?Q?Herv=E9_Pag=E8s?=) Date: Thu, 21 Jun 2012 13:37:32 -0700 Subject: [BioC] BSgenome packages for new UCSC rn5 and galGal4 assemblies Message-ID: <4FE3860C.7090005@fhcrc.org> Hi there, Just packaged BSgenome.Rnorvegicus.UCSC.rn5 (Rat) and BSgenome.Ggallus.UCSC.galGal4 (Chicken), and dropped them into our BioC 2.11 (devel) repository. They should become available via biocLite() within the next hour or so. Note that only the source tarballs will be available for the moment so Windows and Mac users will need to install with: install.packages(..., type="source") BTW, if nobody still needs them, I'd like to stop supporting mm8 (from Feb. 2006) and sacCer1 (Oct. 2003) at some point. They've both been superseded by newer assemblies: by mm9 and sacCer2 for a while, and more recently by mm10 and sacCer3. The current plan is that they will be part of the next release (Bioc 2.11, October 2012) but will be removed from Bioc 2.12 (if nobody says anything in the meantime). Thanks, H. -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 From whuber at embl.de Fri Jun 22 00:02:59 2012 From: whuber at embl.de (Wolfgang Huber) Date: Fri, 22 Jun 2012 00:02:59 +0200 Subject: [BioC] Printing arrayQualityMetrics report In-Reply-To: References: Message-ID: <4FE39A13.3010004@embl.de> Dear David I get a reasonable PDF with default settings on Firefox 13 on a Mac OS X 10.6.8. I will send you this off-list. However, it is not perfect and some twiddling with printer formating may improve things. I do not perceive the production of reports in (or fully automated conversion into) PDF as a high priority, and since your use case is a 'one-off', why don't you take the LaTeX that you get from html2latex and manually edit it to your needs? After all, the report is not exactly a very complex document, just a couple of text blocks with figures interspersed. Hope this helps Wolfgang Jun/21/12 5:01 PM, David Westergaard scripsit:: > Hello, > > I am trying to include an arrayQualityMetrics report in my appendix, > but I am finding it very difficult to produce a pretty version as a > pdf. I have tried simply to print to file in both Google Chrome and > Iceweasel, and both of them are having trouble placing the figures > correctly in the pdf. I have also tried to convert index.html to LaTeX > using html2latex, which did not work either. > > Is it possible to produce a non-interactive pdf (Or something similar) > which can be included in a thesis, written in LaTe, from > arrayQualityMetrtics? I could not find any information about this in > the documentation. > > Best, > David > -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber From whuber at embl.de Fri Jun 22 00:29:03 2012 From: whuber at embl.de (Wolfgang Huber) Date: Fri, 22 Jun 2012 00:29:03 +0200 Subject: [BioC] estimateDispersions in DESeq In-Reply-To: References: Message-ID: <4FE3A02F.9080107@embl.de> Dear Flavia Thank you for your detailed and informative feedback! Simon will be better able to address the question regarding what exactly changed between the versions with respect to the different parameter choices (sharingMode, fitType, method). Two other comments: 1. The fact that you see such a strong dependence on these choices might be a symptom of there being outliers (either whole samples being outliers, or certain measurements in some samples). Outlier detection is generally difficult to automate, yet outliers can have an excessive impact on inference, especially with such small sample sizes. Did you have a look at the 'pairs' plot between all 6 communities (on a log or log-like scale)? 2. The dispersion-mean model was developed on RNA-Seq data. I have no experience how well it fits to OTU counts in metagenomics. Your observation on the plot may indicate a problem. If you don't mind, I'd be interested in having a look at your data (you can anonymize the OTUs) to see how well the dispersion-mean model used by DESeq fits that. Best wishes Wolfgang -> More below. -> Jun/21/12 2:50 PM, Flavia Nunes scripsit:: > Dear List, > > I am trying to use DESeq to analyse a dataset where we have samples of 3 > healthy and 3 diseased microbial communities, and we are trying to > establish which OTUs are significantly more or less abundant in the healthy > vs diseased samples. > > I tried running the new version of DESeq (1.8.3) on both a Mac and a PC > running the latest version of R (2.15). Both versions give a strange > result, where all OTUs have a padj value that is >0.7. I found this to be > strange, because when looking at the raw count data, it is obvious that > some OTUs are abundant in the one treatment (say, high counts in all of the > heathy samples) and absent in the other (0 or close to 0 on all of the > diseased samples). Due to the variance-sharing of 'genes' with similar mean counts, this could easily happen if there are *other* OTU with similar counts, but highly discordant values within groups, forcing a high dispersion estimate, and costing power even for those OTUs that you mention. To avoid the dispersion-mean model, if you are willing to make the jump: with 3 vs 3 samples you're in a place where sharingMode = "gene-est-only" might already be useful - but I would consider the above suggestions first. > > I asked a colleague to help me with the analysis, and he ran the analysis > on an older version of DESeq (1.4), using the estimateVarianceFunction > command instead of estimateDispersions. We saw that in the help file for > estimateDispersions, that by using the sharingMode="fit-only", > fitType="local" options, we should be able to get the same result as the > estimateVarianceFunction. However, this is not the case. DESeq 1.4 was > able to find 54 OTUs that were significantly different from healthy vs > diseased samples, while DESeq 1.8.3 found that none of the OTUs were > significantly different in healthy vs diseased. > > In a second attempt, we used the option method="per-condition" and this > worked - I got the same 54 significant p-values as in the analysis with > DESeq 1.4 But when I continued the analysis on other datasets (we have a > number of different conditions), I again started to get odd p-values, such > as 1.00 for every OTU. I changed the setting for the estimateDispersions > command, trying different methods, and each time I would get a different > set of p-values, but usually very high numbers, close to 1. > > It seems to me that the results are really sensitive to the method used to > estimate dispersions, and I was wondering what are the properties of the > data that I might have to look for in order to select the best method. > > Another unusual thing that I have noticed is that when I plot the > Dispersion Estimates, the fit line deviates from the points towards the > right side of the graph. This suggests to me that there must be something > wrong with the fit estimate, but I do not know how I might be able to > change the settings to get a better fit. > > I wanted to know if anyone on the list has come across a similar problem? > > I am using the commands below in DESeq. I can provide files of the data, > as well as the results that I am receiving to anyone that might be > interested in taking a closer look. > > WBDCountTable <- read.table( file.choose(), header=TRUE, row.names=1 ) > WBDDesign <- data.frame(row.names = colnames( WBDCountTable ), condition = > c( "D1", "D2", "D3", "H1", "H2", "H3"), libType = c( "single-end", > "single-end", "single-end", "single-end", "single-end", "single-end" ) ) > conds <- factor( c( "D", "D", "D", "H", "H", "H" ) ) > cds <- newCountDataSet( WBDCountTable, conds ) > cds <- estimateSizeFactors( cds ) > cds <- estimateDispersions( cds, method="per-condition", fitType="local" ) > > plotDispEsts <- function( cds ) > { > plot( > rowMeans( counts( cds ) ), > fitInfo(cds)$perGeneDispEsts, > pch = 16, cex=1, log="xy" ) > xg <- 10^seq( -.5, 5, length.out=300 ) > lines( xg, fitInfo(cds)$dispFun( xg ), col="red" , lwd=3) > } > > res <- nbinomTest( cds, "D", "H" ) > > plotDE <- function( res ) > plot(res$baseMean, res$log2FoldChange, log="x", pch=20, cex=.3, col = > ifelse( res$padj < .1, "red", "black" ) ) > plotDE( res ) > > res > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber From ragowthaman at gmail.com Fri Jun 22 01:07:01 2012 From: ragowthaman at gmail.com (gowtham) Date: Thu, 21 Jun 2012 16:07:01 -0700 Subject: [BioC] edgeR: calcNormFactors question Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From ragowthaman at gmail.com Fri Jun 22 01:13:59 2012 From: ragowthaman at gmail.com (gowtham) Date: Thu, 21 Jun 2012 16:13:59 -0700 Subject: [BioC] edgeR: calcNormFactors question In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From ragowthaman at gmail.com Fri Jun 22 01:39:42 2012 From: ragowthaman at gmail.com (gowtham) Date: Thu, 21 Jun 2012 16:39:42 -0700 Subject: [BioC] edgeR: calcNormFactors question In-Reply-To: References: Message-ID: Sorry about repeated mailing: I have attached a smear plot of the data incase that helps anyone attempting to answer my doubt..... On Thu, Jun 21, 2012 at 4:07 PM, gowtham wrote: > Hi Everyone, > I am analyzing a RNAseq experiment with two groups each having two > replicates. One out of 4 libraries have only half as much reads mapping to > genome. > > Lib Fe+.1 has only 4 million reads while other are 9 million +. But still > the norm.factors are not much different. With my naive understanding i > expect Fe+.1 to be very different from others. I would like to know if what > I see is okay? > > > oldsetDGE <- calcNormFactors(oldsetDGE) > > oldsetDGE$samples > group lib.size norm.factors > fe-.1 2 9664343 0.9865411 > fe-.2 2 11248827 1.0812947 > fe+.1 1 4194124 0.9662389 > fe+.2 1 9963626 0.9701888 > > > Thanks very much, > Gowthaman > -- > Gowthaman > > Bioinformatics Systems Programmer. > SBRI, 307 West lake Ave N Suite 500 > Seattle, WA. 98109-5219 > Phone : LAB 206-256-7188 (direct). > -- Gowthaman Bioinformatics Systems Programmer. SBRI, 307 West lake Ave N Suite 500 Seattle, WA. 98109-5219 Phone : LAB 206-256-7188 (direct). From phipson at wehi.EDU.AU Fri Jun 22 01:58:42 2012 From: phipson at wehi.EDU.AU (Belinda Phipson) Date: Fri, 22 Jun 2012 09:58:42 +1000 Subject: [BioC] edgeR: calcNormFactors question In-Reply-To: References: Message-ID: <000901cd5009$ccf3a620$66daf260$@edu.au> Hi Gowthaman Your output looks fine. What is more important is that library size is taken into account as an offset later on when you fit the glm. See help(glmFit). Cheers, Belinda -----Original Message----- From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of gowtham Sent: Friday, 22 June 2012 9:40 AM To: bioconductor Subject: Re: [BioC] edgeR: calcNormFactors question Sorry about repeated mailing: I have attached a smear plot of the data incase that helps anyone attempting to answer my doubt..... On Thu, Jun 21, 2012 at 4:07 PM, gowtham wrote: > Hi Everyone, > I am analyzing a RNAseq experiment with two groups each having two > replicates. One out of 4 libraries have only half as much reads > mapping to genome. > > Lib Fe+.1 has only 4 million reads while other are 9 million +. But > still the norm.factors are not much different. With my naive > understanding i expect Fe+.1 to be very different from others. I would > like to know if what I see is okay? > > > oldsetDGE <- calcNormFactors(oldsetDGE) oldsetDGE$samples > group lib.size norm.factors > fe-.1 2 9664343 0.9865411 > fe-.2 2 11248827 1.0812947 > fe+.1 1 4194124 0.9662389 > fe+.2 1 9963626 0.9701888 > > > Thanks very much, > Gowthaman > -- > Gowthaman > > Bioinformatics Systems Programmer. > SBRI, 307 West lake Ave N Suite 500 > Seattle, WA. 98109-5219 > Phone : LAB 206-256-7188 (direct). > -- Gowthaman Bioinformatics Systems Programmer. SBRI, 307 West lake Ave N Suite 500 Seattle, WA. 98109-5219 Phone : LAB 206-256-7188 (direct). ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}} From kasperdanielhansen at gmail.com Fri Jun 22 02:43:55 2012 From: kasperdanielhansen at gmail.com (Kasper Daniel Hansen) Date: Thu, 21 Jun 2012 20:43:55 -0400 Subject: [BioC] BSgenome packages for new UCSC rn5 and galGal4 assemblies In-Reply-To: <4FE3860C.7090005@fhcrc.org> References: <4FE3860C.7090005@fhcrc.org> Message-ID: I would prefer sacCer1 to be kept around; we have packaged data that was mapped to this version. And it is small. I would suggest keeping mm8 around as well, but I am not working in mouse myself. I don't know what the situation is in mouse, but plenty of people are still using hg18. Perhaps it is the same for mouse. Kasper On Thu, Jun 21, 2012 at 4:37 PM, Herv? Pag?s wrote: > Hi there, > > Just packaged BSgenome.Rnorvegicus.UCSC.rn5 (Rat) and > BSgenome.Ggallus.UCSC.galGal4 (Chicken), and dropped them > into our BioC 2.11 (devel) repository. They should become > available via biocLite() within the next hour or so. Note > that only the source tarballs will be available for the moment > so Windows and Mac users will need to install with: > > ?install.packages(..., type="source") > > BTW, if nobody still needs them, I'd like to stop supporting mm8 > (from Feb. 2006) and sacCer1 (Oct. 2003) at some point. They've > both been superseded by newer assemblies: by mm9 and sacCer2 for > a while, and more recently by mm10 and sacCer3. The current plan > is that they will be part of the next release (Bioc 2.11, October > 2012) but will be removed from Bioc 2.12 (if nobody says anything > in the meantime). > > Thanks, > H. > > -- > Herv? Pag?s > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: ?(206) 667-5791 > Fax: ? ?(206) 667-1319 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From tim.triche at gmail.com Fri Jun 22 02:53:59 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Thu, 21 Jun 2012 17:53:59 -0700 Subject: [BioC] BSgenome packages for new UCSC rn5 and galGal4 assemblies In-Reply-To: References: <4FE3860C.7090005@fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From kasperdanielhansen at gmail.com Fri Jun 22 03:09:54 2012 From: kasperdanielhansen at gmail.com (Kasper Daniel Hansen) Date: Thu, 21 Jun 2012 21:09:54 -0400 Subject: [BioC] BSgenome packages for new UCSC rn5 and galGal4 assemblies In-Reply-To: References: <4FE3860C.7090005@fhcrc.org> Message-ID: In my opinion that pretty much settles it: I see no reason to deprecate these genome builds. It may be different for genomes that are still in early draft stages, where the improvement in the new build is sometimes so big that no-one in their right mind would continue to work on the old. Kasper On Thu, Jun 21, 2012 at 8:53 PM, Tim Triche, Jr. wrote: > there are plenty of experiments in GEO and elsewhere that were aligned to > mm8, although I don't know if you so much need the BSgenome package to deal > with them (i.e. as opposed to liftOver). ? ?As recently as last year people > were still aligning to it > > > On Thu, Jun 21, 2012 at 5:43 PM, Kasper Daniel Hansen > wrote: >> >> I would prefer sacCer1 to be kept around; we have packaged data that >> was mapped to this version. ?And it is small. >> >> I would suggest keeping mm8 around as well, but I am not working in >> mouse myself. ?I don't know what the situation is in mouse, but plenty >> of people are still using hg18. ?Perhaps it is the same for mouse. >> >> Kasper >> >> On Thu, Jun 21, 2012 at 4:37 PM, Herv? Pag?s wrote: >> > Hi there, >> > >> > Just packaged BSgenome.Rnorvegicus.UCSC.rn5 (Rat) and >> > BSgenome.Ggallus.UCSC.galGal4 (Chicken), and dropped them >> > into our BioC 2.11 (devel) repository. They should become >> > available via biocLite() within the next hour or so. Note >> > that only the source tarballs will be available for the moment >> > so Windows and Mac users will need to install with: >> > >> > ?install.packages(..., type="source") >> > >> > BTW, if nobody still needs them, I'd like to stop supporting mm8 >> > (from Feb. 2006) and sacCer1 (Oct. 2003) at some point. They've >> > both been superseded by newer assemblies: by mm9 and sacCer2 for >> > a while, and more recently by mm10 and sacCer3. The current plan >> > is that they will be part of the next release (Bioc 2.11, October >> > 2012) but will be removed from Bioc 2.12 (if nobody says anything >> > in the meantime). >> > >> > Thanks, >> > H. >> > >> > -- >> > Herv? Pag?s >> > >> > Program in Computational Biology >> > Division of Public Health Sciences >> > Fred Hutchinson Cancer Research Center >> > 1100 Fairview Ave. N, M1-B514 >> > P.O. Box 19024 >> > Seattle, WA 98109-1024 >> > >> > E-mail: hpages at fhcrc.org >> > Phone: ?(206) 667-5791 >> > Fax: ? ?(206) 667-1319 >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > A model is a lie that helps you see the truth. > > Howard Skipper > From lawrence.michael at gene.com Fri Jun 22 04:27:11 2012 From: lawrence.michael at gene.com (Michael Lawrence) Date: Thu, 21 Jun 2012 19:27:11 -0700 Subject: [BioC] using disjoin() for copy number as Rle() columns of a SummarizedExperiment In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From hpages at fhcrc.org Fri Jun 22 06:27:21 2012 From: hpages at fhcrc.org (=?ISO-8859-1?Q?Herv=E9_Pag=E8s?=) Date: Thu, 21 Jun 2012 21:27:21 -0700 Subject: [BioC] BSgenome packages for new UCSC rn5 and galGal4 assemblies In-Reply-To: References: <4FE3860C.7090005@fhcrc.org> Message-ID: <4FE3F429.9000104@fhcrc.org> OK, thanks for the useful feedback. I was under the impression that maybe people didn't care/need those old assemblies anymore but it seems that it might be useful to keep them around for now. So please forget my proposal to deprecate them in the next release. Cheers, H. On 06/21/2012 06:09 PM, Kasper Daniel Hansen wrote: > In my opinion that pretty much settles it: I see no reason to > deprecate these genome builds. > > It may be different for genomes that are still in early draft stages, > where the improvement in the new build is sometimes so big that no-one > in their right mind would continue to work on the old. > > Kasper > > On Thu, Jun 21, 2012 at 8:53 PM, Tim Triche, Jr. wrote: >> there are plenty of experiments in GEO and elsewhere that were aligned to >> mm8, although I don't know if you so much need the BSgenome package to deal >> with them (i.e. as opposed to liftOver). As recently as last year people >> were still aligning to it >> >> >> On Thu, Jun 21, 2012 at 5:43 PM, Kasper Daniel Hansen >> wrote: >>> >>> I would prefer sacCer1 to be kept around; we have packaged data that >>> was mapped to this version. And it is small. >>> >>> I would suggest keeping mm8 around as well, but I am not working in >>> mouse myself. I don't know what the situation is in mouse, but plenty >>> of people are still using hg18. Perhaps it is the same for mouse. >>> >>> Kasper >>> >>> On Thu, Jun 21, 2012 at 4:37 PM, Herv? Pag?s wrote: >>>> Hi there, >>>> >>>> Just packaged BSgenome.Rnorvegicus.UCSC.rn5 (Rat) and >>>> BSgenome.Ggallus.UCSC.galGal4 (Chicken), and dropped them >>>> into our BioC 2.11 (devel) repository. They should become >>>> available via biocLite() within the next hour or so. Note >>>> that only the source tarballs will be available for the moment >>>> so Windows and Mac users will need to install with: >>>> >>>> install.packages(..., type="source") >>>> >>>> BTW, if nobody still needs them, I'd like to stop supporting mm8 >>>> (from Feb. 2006) and sacCer1 (Oct. 2003) at some point. They've >>>> both been superseded by newer assemblies: by mm9 and sacCer2 for >>>> a while, and more recently by mm10 and sacCer3. The current plan >>>> is that they will be part of the next release (Bioc 2.11, October >>>> 2012) but will be removed from Bioc 2.12 (if nobody says anything >>>> in the meantime). >>>> >>>> Thanks, >>>> H. >>>> >>>> -- >>>> Herv? Pag?s >>>> >>>> Program in Computational Biology >>>> Division of Public Health Sciences >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N, M1-B514 >>>> P.O. Box 19024 >>>> Seattle, WA 98109-1024 >>>> >>>> E-mail: hpages at fhcrc.org >>>> Phone: (206) 667-5791 >>>> Fax: (206) 667-1319 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> >> -- >> A model is a lie that helps you see the truth. >> >> Howard Skipper >> -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 From donttrustben at gmail.com Fri Jun 22 07:09:08 2012 From: donttrustben at gmail.com (Ben Woodcroft) Date: Fri, 22 Jun 2012 15:09:08 +1000 Subject: [BioC] SRAdb: is the database missing some entries? (Ben Woodcroft) Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From tim.triche at gmail.com Fri Jun 22 07:43:56 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Thu, 21 Jun 2012 22:43:56 -0700 Subject: [BioC] using disjoin() for copy number as Rle() columns of a SummarizedExperiment In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From james.reid at ifom.eu Thu Jun 21 18:40:28 2012 From: james.reid at ifom.eu (James F.Reid) Date: Thu, 21 Jun 2012 16:40:28 +0000 Subject: [BioC] =?utf-8?q?BioC_package_for_miRNA_target_scanning_=28or_dis?= =?utf-8?q?playing=09results_from_databases=29?= References: Message-ID: Hi Kart, Karthik K N writes: > > Dear Members, > > Are there packages for searching miRNA targets in Bioconductor? > > I Have a list of miRNAs each of whose common target I want to pull out from > TargetScan, PicTar, miRanda, miRDB etc. Is this really possible? > > As a last resort, which BioC/R package can help me search the > aforementioned databases and save the targets (that is common to all the > databases) in a txt or excel format? you could try the BioC RmiR and RmiR.Hs.miRNA packages which integrate different databases of predicted human gene targets alternatively use an on-line tool such as mirDIP for example. The only single database that is available on BioC is targetscan for human and mouse only ('targetscan.Hs.eg.db', 'targetscan.Mm.eg.db'). HTH. J. > > It is taking hell lot of time searching all databases! > > Thanks a lot, > > Regards, > > Kart > From reidjf at gmail.com Fri Jun 22 10:11:45 2012 From: reidjf at gmail.com (James F. Reid) Date: Fri, 22 Jun 2012 09:11:45 +0100 Subject: [BioC] Printing arrayQualityMetrics report In-Reply-To: <4FE39A13.3010004@embl.de> References: <4FE39A13.3010004@embl.de> Message-ID: <4FE428C1.7000309@gmail.com> Hi, suggestion below On 21/06/12 23:02, Wolfgang Huber wrote: > Dear David > > I get a reasonable PDF with default settings on Firefox 13 on a Mac OS X > 10.6.8. I will send you this off-list. However, it is not perfect and > some twiddling with printer formating may improve things. > > I do not perceive the production of reports in (or fully automated > conversion into) PDF as a high priority, and since your use case is a > 'one-off', why don't you take the LaTeX that you get from html2latex and > manually edit it to your needs? After all, the report is not exactly a > very complex document, just a couple of text blocks with figures > interspersed. > > Hope this helps > Wolfgang > > > > > > Jun/21/12 5:01 PM, David Westergaard scripsit:: >> Hello, >> >> I am trying to include an arrayQualityMetrics report in my appendix, >> but I am finding it very difficult to produce a pretty version as a >> pdf. I have tried simply to print to file in both Google Chrome and >> Iceweasel, and both of them are having trouble placing the figures >> correctly in the pdf. I have also tried to convert index.html to LaTeX >> using html2latex, which did not work either. >> >> Is it possible to produce a non-interactive pdf (Or something similar) >> which can be included in a thesis, written in LaTe, from >> arrayQualityMetrtics? I could not find any information about this in >> the documentation. >> >> Best, >> David >> > > you could extract the table contents using the readHTMLTable function from the 'XML' package and for the figures just include the pdfs as figures and add a caption to them. HTH, J. From ragowthaman at gmail.com Fri Jun 22 11:18:51 2012 From: ragowthaman at gmail.com (gowtham) Date: Fri, 22 Jun 2012 02:18:51 -0700 Subject: [BioC] edgeR: calcNormFactors question In-Reply-To: <000901cd5009$ccf3a620$66daf260$@edu.au> References: <000901cd5009$ccf3a620$66daf260$@edu.au> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From heidi at ebi.ac.uk Fri Jun 22 11:27:37 2012 From: heidi at ebi.ac.uk (Heidi Dvinge) Date: Fri, 22 Jun 2012 10:27:37 +0100 Subject: [BioC] re-shape table for HtqPCR In-Reply-To: References: Message-ID: Hi Alessandro, > hi hiedi, > how are you? > > I 'd like to use HTqPCR to analyze our expression data. > > I have been given the following dataset (attached) that consists of 80 > samples and 48 targets. Each target has been measured twice so totally > there are 80 * 48 * 2 = 7680 measurements done in cards containing 48 > targets and 4 samples (so totally 20 cards). > > How would you shape this table in a format that might be read into HTqPCR? > There are (at least) 2 different ways of doing this. 1) Read your data into R using any standard function, such as read.table(). Create your qPCRset-object using new("qPCRset", ...) 2) Read your data into R using readCtData(). Re-format the object according to your data using readCtLayout. There are examples of both approaches in the vignette, section 13.6. Which one you prefer is a question of personal preference. The second may be useful when you have a lot of information in addition to your Ct values (such as feature classes, categories etc.), but that doesn't seem to be the case for your data. You can try one of these two, and if it doesn't seem to work, let me know. HTH \Heidi > thanks very much for your help! > Alessandro > From Heidi.Dvinge at cancer.org.uk Fri Jun 22 11:17:18 2012 From: Heidi.Dvinge at cancer.org.uk (Heidi Dvinge) Date: Fri, 22 Jun 2012 10:17:18 +0100 Subject: [BioC] HTqPCR In-Reply-To: References: <6D0043C9-4BAE-469C-8369-8733D7D53644@cancer.org.uk> <6C95A3CE-902D-4068-B64A-0A2813071A1A@cancer.org.uk> <50B7F68B-3762-4FF4-8F97-692ED30F06AE@cancer.org.uk> <99093F24-FF41-4FA0-BE55-BB0AF5C3D010@cancer.org.uk> <19700BD8-ECBB-4D72-AA23-1C5D617531E7@cancer.org.uk> Message-ID: Hi Silvia, On 20 Jun 2012, at 19:25, Silvia Halim wrote: > Hi Heidi, > > I've tried sample.reps and feature.reps parameters on plotCtVariation, however, I couldn't make it work. It keeps saying 'figure margins too large'. This isn't related to HTqPCR as such, but rather a general R statement saying that you're trying to plot something that's too large for your current plotting device. Most devices (e.g. quartz(), X11()) will let you set a 'width' and 'height' parameter, or you can perhaps plot directly to a file, where you can set the dimensions as required. Just for the record, plotting a 96x96 object works for me. What it the output of the following on your computer: # Load 48x48 example data exPath <- system.file("exData", package = "HTqPCR") raw <- readCtData(files = "BioMark_sample.csv", path = exPath, format = "BioMark", n.features = 48, n.data = 48) # Create 'pseudo' 96x96 data set tmp <- cbind(raw, raw) raw96 <- rbind(tmp, tmp) # Plot plotCtVariation(tmp) > Actually how should I set either of these 2 parameters? > In your example in the help page, your command is plotCtVariation(qPCRraw[1:40,], sample.reps=rep(1:2,3)). Do you specify 'rep(1:2,3)' for sample.reps because each feature is replicated exactly twice only and you want to plot only 3 samples? > Typing rep(1:2, 3) in the console will give you a vector c(1,2,1,2,1,2), which in this case indicates that the 6 samples in my object falls into 2 different groups. Here they're just named '1' and '2' for the sake of ease, but I might as well have said e.g. sample.reps=c("control", "treatment", "control", "treatment", "control", "treatment") or whatever applies to a given experiment. The same goes for feature.reps. In the example, feature.reps=paste("test", rep(1:96, each=4)) simply refers to that on a 384-well assay there are 96 individual features, and each feature is present 4 times after each other. C.f. > paste("test", rep(1:96, each=4))[1:10] [1] "test 1" "test 1" "test 1" "test 1" "test 2" "test 2" "test 2" "test 2" "test 3" [10] "test 3" Typically, feature.reps would simply be the featureNames of your object given that each feature name appears multiple times. This is also the default behaviour of plotCtVariation. HTH \Heidi > And how should I interpret ' feature.reps=paste("test", rep(1:96, each=4))' in the other example of plotCtVariation? > > Thanks, > Silvia > > -----Original Message----- > From: Heidi Dvinge > Sent: 20 June 2012 9:46 AM > To: Silvia Halim > Cc: bioconductor at r-project.org > Subject: Re: HTqPCR > > Hi SIlvia, > > On 19 Jun 2012, at 18:48, Silvia Halim wrote: > >> Hi Heidi, >> >> Thanks for your tips. I figured I could probably use plotCtVariation(). I am able to use the function to plot variation across samples but how can I use it to plot across genes (or features)? >> I tried following commands: >> plotCtVariation(temp[,1:10], variation = "sd", log = TRUE, main = "SD >> of replicated features", col = "lightgrey") >> plotCtVariation(temp[1:10,], variation = "sd", log = TRUE, main = "SD >> of replicated features", col = "lightgrey") > > Here you're just subsetting your qPCRset object before plotting, but you're not changing the actual plots. > >> There's a difference in the plots but both plots give me same labels on x-axis, i.e. sample names, though I was expecting the second command would give me gene names on x-axis label. >> > If you look at the plotCtVariation help files (especially the 'Examples' and 'Details' section), the parameters sample.reps and feature.reps controls whether you plot the variation for each gene across or within samples. In order to get gene names, you have to set sample.reps, to indicate which samples are replicates of each other. > > Per default, the function calculates the variation between replicated features within each of your samples, and plots the distribution (boxplot) of this variation for each sample. If you want to check individual features or samples more specifically, you ahve to use type="detail" and possibly add.featurenames=TRUE. There are some examples included in the plotCtVariation help file. > >> Also, the manual says we can exclude unreliable or undetermined data by setting the Ct values to NA using filterCategory. I am wondering how I can get rid of NA data from the plate. I also cannot exclude this kind of data or those having 'Failed' flags the very first time before reading in the input as a qPCRset object because the input has to be something like 48 x 48 or 96 x 96. >> > The question is, why do you want to remove the NA values? If you just leave them as NAs, then they're ignored during e.g. the calculation of differential expression and for most plotting purposes. > > You can't remove them as such, since (as you note), the object has to be in a certain features x samples format. If you want, you can replace them though, if you e.g. want to set all NA values to Ct=40: exprs(temp)[is.na(exprs(temp))] <- 40. But beware, because in that case the value '40' will be include into all numerical calculations, which may not be what you want. > > HTH > \Heidi > >> Many thanks, >> Silvia >> >> -----Original Message----- >> From: Heidi Dvinge >> Sent: 19 June 2012 2:05 PM >> To: Silvia Halim >> Cc: bioconductor at r-project.org >> Subject: Re: HTqPCR >> >> Hi Silvia, >> >> On 18 Jun 2012, at 17:51, Silvia Halim wrote: >> >>> Hi Heidi, >>> >>> The function breaks at plotCtReps. >>>> traceback() >>> 1: plotCtReps(temp, card = 2, percent = 20, xlim = c(0, 100), ylim = c(0, >>> 100)) >>> >>> You've pointed out the problem about the duplicates as I have 3 replicates on my assay. I got confused reading the manual as it says plotCtReps can be used for a sample containing duplicate measurements (which I thought to be 2 or more measurements). >>>> table(table(featureNames(temp))) >>> >>> 3 6 >>> 30 1 >>> >> If you try running the examples for plotCtReps, you'll see that the function directly plots two replicates of a feature against each other on the (x,y) axis. 3D (x,y,z) plots aren't implemented, so features that are replicated 3 times can't be plotted. I'll try to clarify the text for the function. >> >> Perhaps something like plotCtVariation() will give you what you're after? If you only want to visually inspect your data, then grep("plot", ls("package:HTqPCR"), value=TRUE) will list all the plotting functions available in HTqPCR. >> >> HTH >> \Heidi >> >>> Btw there's no NA in my data. >>>> sum(is.na(temp)) >>> [1] 0 >>> Warning message: >>> In is.na(temp) : is.na() applied to non-(list or vector) of type 'S4' >>>> >>> >>> Thanks, >>> Silvia >>> >>> -----Original Message----- >>> From: Heidi Dvinge >>> Sent: 15 June 2012 9:06 PM >>> To: Silvia Halim >>> Cc: bioconductor at r-project.org >>> Subject: Re: HTqPCR >>> >>> Hi Silvia, >>> >>> On 15 Jun 2012, at 18:45, Silvia Halim wrote: >>> >>>> Hi Heidi, >>>> >>>> I ran into below problem when using plotCtReps. >>>> >>>>> plotCtReps(temp, card = 1, percent = 20, xlim = c(0,50), ylim = >>>>> c(0,50)) >>>> Error in split.data[[s]] : subscript out of bounds In addition: >>>> Warning messages: >>>> 1: In min(x, na.rm = na.rm) : >>>> no non-missing arguments to min; returning Inf >>>> 2: In max(x, na.rm = na.rm) : >>>> no non-missing arguments to max; returning -Inf >>>>> plotCtReps(temp, card = 1, percent = 20, xlim = c(0,50), ylim = >>>>> c(0,50)) >>>> Error in split.data[[s]] : subscript out of bounds In addition: >>>> Warning messages: >>>> 1: In min(x, na.rm = na.rm) : >>>> no non-missing arguments to min; returning Inf >>>> 2: In max(x, na.rm = na.rm) : >>>> no non-missing arguments to max; returning -Inf >>>>> plotCtReps(temp, card = 2, percent = 20, xlim = c(0,100), ylim = >>>>> c(0,100)) >>>> Error in split.data[[s]] : subscript out of bounds In addition: >>>> Warning messages: >>>> 1: In min(x, na.rm = na.rm) : >>>> no non-missing arguments to min; returning Inf >>>> 2: In max(x, na.rm = na.rm) : >>>> no non-missing arguments to max; returning -Inf >>> >>> What's the output from traceback(), i.e. exactly where does the function break? >>>> >>> A couple of things you can try: >>> >>> - plotCtReps is meant to be used in cases where there are exactly 2 replicates of the features on your assay. Is this the case? For example, with the data below there are 190 features that will be plotted, and 1 that will be skipped: >>>> data(qPCRraw) >>>> table(table(featureNames(qPCRraw))) >>> 2 4 >>> 190 1 >>> >>> - are there any NAs in your data? E.g. sum(is.na(qPCRraw))>0. >>> >>> HTH >>> \Heidi >>> >>>> Here is how 'temp' looks like >>>>> temp >>>> An object of class "qPCRset" >>>> Size: 96 features, 96 samples >>>> Feature types: Reference, Test >>>> Feature names: b-Actin b-Actin b-Actin ... >>>> Feature classes: >>>> Feature categories: OK >>>> Sample names: NTC_4 PMPT352 NTC_3 ... >>>> >>>> Do you know why it is complaining about split.data? >>>> >>>> Thanks, >>>> Silvia >>>> >>>> -----Original Message----- >>>> From: Heidi Dvinge >>>> Sent: 11 June 2012 6:11 PM >>>> To: Silvia Halim >>>> Subject: Re: HTqPCR >>>> >>>> Ok, so you already have a 96 by 96 matrix, so you don't need changeCtLayout. >>>> Good luck with the rest, and let me know if you encounter any problems. >>>> >>>> On 11 Jun 2012, at 19:05, Silvia Halim wrote: >>>> >>>>> Hi Heidi, >>>>> >>>>> Thank you for your clarification. >>>>> >>>>> Btw this is how it looks like when I type 'temp' >>>>>> temp >>>>> An object of class "qPCRset" >>>>> Size: 96 features, 96 samples >>>>> Feature types: Reference, Test >>>>> Feature names: b-Actin b-Actin b-Actin ... >>>>> Feature classes: >>>>> Feature categories: OK >>>>> Sample names: NTC_4 PMPT352 NTC_3 ... >>>>> >>>>> Cheers, >>>>> Silvia >>>>> >>>>> -----Original Message----- >>>>> From: Heidi Dvinge >>>>> Sent: 08 June 2012 7:12 PM >>>>> To: Silvia Halim >>>>> Subject: Re: HTqPCR >>>>> >>>>> Hi Silvia, >>>>> >>>>> what are the dimensions of the "temp" object that you read in? I.e. >>>>> what does it look like if you just type >>>>>> temp >>>>> >>>>> If you read in the data with n.features=96 and n.data=96, then you should already have an object with 96 rows and 96 columns, in which case you don't need to change the layout. >>>>> >>>>> Best, >>>>> \Heidi >>>>> >>>>> On 8 Jun 2012, at 19:13, Silvia Halim wrote: >>>>> >>>>>> Hi Heidi, >>>>>> >>>>>> I finally have time to try out your HTqPCR bioconductor package again and I was trying to use 'changeCtLayout' function. However, I got following error message: >>>>>> >>>>>>> qPCRnew <- changeCtLayout(temp, sample.order = sample_order) >>>>>> Error in data.frame(..., check.names = FALSE) : >>>>>> arguments imply differing number of rows: 0, 96 In addition: >>>>>> Warning >>>>>> message: >>>>>> In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : >>>>>> data length is not a multiple of split variable >>>>>> >>>>>> The commands that I run are as follows: >>>>>>> temp <- readCtData("110614 BENIGN_1 DATA 96X96.csv", path = >>>>>>> getwd(), n.features = 96, n.data=96, flag = 9, feature = 5, type= >>>>>>> 6, Ct = 7, position = 1, skip = 12, sep = ",") sample_order <- >>>>>>> rep(sampleNames(temp), each = 96) qPCRnew <- changeCtLayout(temp, >>>>>>> sample.order = sample_order) >>>>>> >>>>>> I've tried to follow what's written in changeCtLayout function description. Can you please advise what went wrong? >>>>>> >>>>>> Thanks, >>>>>> Silvia >>>>>> >>>>>> -----Original Message----- >>>>>> From: Heidi Dvinge >>>>>> Sent: 29 April 2012 8:18 PM >>>>>> To: Silvia Halim >>>>>> Subject: Re: HTqPCR >>>>>> >>>>>> HI Silvia, >>>>>> >>>>>> I'm glad you got it working. Depending on what you're supposed to do with the data, you may need to tweak some functions slightly, as you mention. Let me know if you run into any more trouble. >>>>>> >>>>>> Cheers >>>>>> \Heidi >>>>>> >>>>>> On 26 Apr 2012, at 18:37, Silvia Halim wrote: >>>>>> >>>>>>> Hi Heidi, >>>>>>> >>>>>>> Thanks for the help! It's working for me now. Right now I'm figuring it out how I can use the functions that you described in the vignette. I might have to tweak the parameters for using the functions on Fluidigm data. >>>>>>> >>>>>>> Cheers, >>>>>>> Silvia >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Heidi Dvinge >>>>>>> Sent: 25 April 2012 8:56 AM >>>>>>> To: Silvia Halim >>>>>>> Subject: Re: HTqPCR >>>>>>> >>>>>>> Hiya, >>>>>>> >>>>>>> sorry, I only just now realised that you'd attached a file. When I saved as csv, the following command worked: >>>>>>> >>>>>>>> raw <- readCtData("110614 BENIGN_1 DATA 96x96.csv", >>>>>>>> format="BioMark", >>>>>>>> n.features=96*96) raw >>>>>>> An object of class "qPCRset" >>>>>>> Size: 9216 features, 1 samples >>>>>>> Feature types: >>>>>>> Feature names: b-Actin b-Actin b-Actin ... >>>>>>> Feature classes: >>>>>>> Feature categories: OK >>>>>>> Sample names: 110614 BENIGN_1 DATA 96x96 ... >>>>>>> >>>>>>> The data isn't transformed into a 96x96 format immediately though (in case you read in multiple arrays, and want to normalise them independently). If you want to change this, you can use changeCtLayout(). Alternatively you can say: >>>>>>> >>>>>>>> raw <- readCtData("110614 BENIGN_1 DATA 96x96.csv", >>>>>>>> format="BioMark", n.features=96, n.data=96) raw >>>>>>> An object of class "qPCRset" >>>>>>> Size: 96 features, 96 samples >>>>>>> Feature types: >>>>>>> Feature names: b-Actin b-Actin b-Actin ... >>>>>>> Feature classes: >>>>>>> Feature categories: OK >>>>>>> Sample names: Sample1 Sample2 Sample3 ... >>>>>>>> plotCtArray(raw) >>>>>>> >>>>>>> HTH >>>>>>> \Heidi >>>>>>> >>>>>>> On 24 Apr 2012, at 17:55, Silvia Halim wrote: >>>>>>> >>>>>>>> Hi Heidi, >>>>>>>> >>>>>>>> I have some problems updating R on lustre. Therefore, I chose to run HTqPCR on my desktop for the moment. >>>>>>>> >>>>>>>> Reading in your sample file looks fine, however, reading in the >>>>>>>> file that I showed you just now gave me below error message. >>>>>>>> (The file is as attached) >>>>>>>> >>>>>>>>> temp <- readCtData("110614 BENIGN_1 DATA 96x96.xlsx", path = >>>>>>>>> getwd() , n.features = 96*96, flag = 9, feature = 5, type= 6, >>>>>>>>> Ct = 7,position = 1, skip = 12, sep = ",") >>>>>>>> Error in read.table(file = file, header = header, sep = sep, quote = quote, : >>>>>>>> no lines available in input >>>>>>>> In addition: Warning message: >>>>>>>> In readLines(file, skip) : >>>>>>>> incomplete final line found on 'C:/Users/halim01/Documents/20110627_RossAdamsH_DN_Fluid/110614 BENIGN_1 DATA 96x96.xlsx' >>>>>>>>> sessionInfo() >>>>>>>> R version 2.14.0 (2011-10-31) >>>>>>>> Platform: x86_64-pc-mingw32/x64 (64-bit) >>>>>>>> >>>>>>>> locale: >>>>>>>> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C LC_TIME=English_United Kingdom.1252 >>>>>>>> >>>>>>>> attached base packages: >>>>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>>>> >>>>>>>> other attached packages: >>>>>>>> [1] Biostrings_2.22.0 IRanges_1.12.6 BiocInstaller_1.2.1 marray_1.32.0 HTqPCR_1.8.0 limma_3.10.3 RColorBrewer_1.0-5 Biobase_2.14.0 gdata_2.8.2 >>>>>>>> >>>>>>>> loaded via a namespace (and not attached): >>>>>>>> [1] affy_1.32.1 affyio_1.22.0 gplots_2.10.1 gtools_2.6.2 preprocessCore_1.16.0 tools_2.14.0 zlibbioc_1.0.1 >>>>>>>>> >>>>>>>> >>>>>>>> I did a quick check on the file and it only has 9228 lines including 12 header lines which I had skipped when reading in the file. Do you know what could possibly go wrong? >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Silvia >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Heidi Dvinge >>>>>>>> Sent: 24 April 2012 5:09 PM >>>>>>>> To: Silvia Halim >>>>>>>> Subject: Re: HTqPCR >>>>>>>> >>>>>>>> Hm, that looks like it may be x11 acting up. I often have similar issues when I work on a remote server. >>>>>>>> >>>>>>>> Actually, the processing of Fluidigm files is very computationally light. So you can easily do it on your desktop, if you can't update on lustre. >>>>>>>> >>>>>>>> I can also email you and older version of the vignette if you want to have a look. However, in HTqPCR 1.2.0 I don't even think I had a dedicated function for plotting the Fluidigm assays yet (the plotCtArray shown in the vignette). >>>>>>>> >>>>>>>> Cheers >>>>>>>> \Heidi >>>>>>>> >>>>>>>> On 24 Apr 2012, at 16:39, Silvia Halim wrote: >>>>>>>> >>>>>>>>> Hi Heidi, >>>>>>>>> >>>>>>>>> This is what I got when accessing the vignette. >>>>>>>>> >>>>>>>>>> openVignette(package="HTqPCR") >>>>>>>>> Please select a vignette: >>>>>>>>> >>>>>>>>> 1: HTqPCR - qPCR analysis in R >>>>>>>>> >>>>>>>>> Selection: 1 >>>>>>>>> Opening >>>>>>>>> /home/mib-cri/local/lib64/R/library/HTqPCR/doc/HTqPCR.pdf >>>>>>>>>> xprop: unable to open display '' >>>>>>>>> /usr/local/bin/xdg-open: line 370: firefox: command not found >>>>>>>>> /usr/local/bin/xdg-open: line 370: mozilla: command not found >>>>>>>>> /usr/local/bin/xdg-open: line 370: netscape: command not found >>>>>>>>> xdg-open: no method available for opening '/home/mib-cri/local/lib64/R/library/HTqPCR/doc/HTqPCR.pdf' >>>>>>>>> >>>>>>>>> Sorry for the confusion, you are right that I was looking at a newer version of HTqPCR than the one installed on lustre. I think that's because I have different installations of HTqPCR on lustre and on my desktop. If I can update the one on lustre, I'll go ahead with the update. >>>>>>>>> >>>>>>>>> Thank you, >>>>>>>>> Silvia >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Heidi Dvinge >>>>>>>>> Sent: 24 April 2012 4:28 PM >>>>>>>>> To: Silvia Halim >>>>>>>>> Subject: Re: HTqPCR >>>>>>>>> >>>>>>>>> Ah, right, it looks like you have an older version of R, and therefore also HTqPCR. >>>>>>>>> >>>>>>>>> The most current release version is 1.10.0. In that version, readCtData() was modified to accept different types of input data, including from Fluidigm. Before that, this sort of data had to be read in 'manually'. >>>>>>>>> >>>>>>>>> I guess the vignette that you were looking at comes from a >>>>>>>>> version of HTqPCR that's newer than the one you have installed? >>>>>>>>> If you access the vignette corresponding to your HTqPCR version >>>>>>>>> via >>>>>>>>>> openVignette(package="HTqPCR") >>>>>>>>> what do you get then? >>>>>>>>> >>>>>>>>> If you get an older version, then depending on how old it is, there may be a section towards the end giving an example of how to process Fluidigm data more 'manually'. If not, an update may be your best bet. >>>>>>>>> >>>>>>>>> Cheers >>>>>>>>> \Heidi >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 24 Apr 2012, at 16:15, Silvia Halim wrote: >>>>>>>>> >>>>>>>>>> Hi Heidi, >>>>>>>>>> >>>>>>>>>> Thanks for looking into the matter. Below is the output of my >>>>>>>>>> sessionInfo() >>>>>>>>>> >>>>>>>>>>> sessionInfo() >>>>>>>>>> R version 2.13.0 (2011-04-13) >>>>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>>>> >>>>>>>>>> locale: >>>>>>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>>>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>>>>>>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >>>>>>>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>>>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>>>>>> >>>>>>>>>> attached base packages: >>>>>>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>>>>>> >>>>>>>>>> other attached packages: >>>>>>>>>> [1] marray_1.26.0 Biostrings_2.20.1 IRanges_1.10.3 HTqPCR_1.2.0 >>>>>>>>>> [5] limma_3.6.9 RColorBrewer_1.0-2 Biobase_2.12.1 gdata_2.8.0 >>>>>>>>>> >>>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>>> [1] affy_1.26.1 affyio_1.20.0 gplots_2.8.0 >>>>>>>>>> [4] gtools_2.6.2 preprocessCore_1.14.0 >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Silvia >>>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: Heidi Dvinge >>>>>>>>>> Sent: 24 April 2012 4:07 PM >>>>>>>>>> To: Silvia Halim >>>>>>>>>> Subject: HTqPCR >>>>>>>>>> >>>>>>>>>> Hi Silvia, >>>>>>>>>> >>>>>>>>>> I just tested the read fluidigm from the vignette, and it works on both my mac and a single unix system that I've tested. Although from the errors you were getting, it seemed like the headers weren't been read correctly/at all. >>>>>>>>>> >>>>>>>>>> Would you mind sending me the output of your sessionInfo(), so I can compare which package versions we have? >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> \Heidi >>>>>>>>>> >>>>>>>>>>> sessionInfo() >>>>>>>>>> R version 2.15.0 (2012-03-30) >>>>>>>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>>>>>>>>> >>>>>>>>>> locale: >>>>>>>>>> [1] >>>>>>>>>> en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 >>>>>>>>>> >>>>>>>>>> attached base packages: >>>>>>>>>> [1] tools stats graphics grDevices utils datasets methods base >>>>>>>>>> >>>>>>>>>> other attached packages: >>>>>>>>>> [1] HTqPCR_1.10.0 limma_3.12.0 RColorBrewer_1.0-5 Biobase_2.16.0 >>>>>>>>>> [5] BiocGenerics_0.2.0 >>>>>>>>>> >>>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>>> [1] affy_1.34.0 affyio_1.24.0 BiocInstaller_1.4.3 >>>>>>>>>> [4] gdata_2.8.2 gplots_2.10.1 gtools_2.6.2 >>>>>>>>>> [7] preprocessCore_1.18.0 stats4_2.15.0 zlibbioc_1.2.0 >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> <110614 BENIGN_1 DATA 96x96.xlsx> >>>>>>> >>>>>> >>>>> >>>> >>> >> > NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for ...{{dropped:16}} From ragowthaman at gmail.com Fri Jun 22 11:31:49 2012 From: ragowthaman at gmail.com (gowtham) Date: Fri, 22 Jun 2012 02:31:49 -0700 Subject: [BioC] edgeR: calcNormFactors question In-Reply-To: References: <000901cd5009$ccf3a620$66daf260$@edu.au> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From mark.robinson at imls.uzh.ch Fri Jun 22 11:50:29 2012 From: mark.robinson at imls.uzh.ch (Mark Robinson) Date: Fri, 22 Jun 2012 11:50:29 +0200 Subject: [BioC] edgeR: calcNormFactors question In-Reply-To: References: <000901cd5009$ccf3a620$66daf260$@edu.au> Message-ID: <187BC2EF-2DCC-4E65-A99E-00843C44C3B8@imls.uzh.ch> Hi Gowthaman, You shouldn't manually specify the offset in glmFit(), unless you have a specific need to. Short answer, you should use: fit <- glmFit(d, design) >>>> Lib Fe+.1 has only 4 million reads while other are 9 million +. But >>>> still the norm.factors are not much different. With my naive >>>> understanding i expect Fe+.1 to be very different from others. I would >>>> like to know if what I see is okay? This is ok, since the offset used in the downstream modeling is actually the product of the lib.size and norm.factors columns. Best, Mark ---------- Prof. Dr. Mark Robinson Bioinformatics Institute of Molecular Life Sciences University of Zurich Winterthurerstrasse 190 8057 Zurich Switzerland v: +41 44 635 4848 f: +41 44 635 6898 e: mark.robinson at imls.uzh.ch o: Y11-J-16 w: http://tiny.cc/mrobin ---------- http://www.fgcz.ch/Bioconductor2012 On 22.06.2012, at 11:31, gowtham wrote: > Hi Belinda, > I think, i am bit confused now. The help document suggest, i should use > only one of "offset" and "lib.size". Seems like both of them take the > library size into account. And sounds like "offset" has a preference when > both are supplied. > > So, my question is do I have to explicitly ask for one or other? And do I > have to explicitly give it a value? > > > fit <- glmFit(d, design) > > OR > > > fit <- glmFit(d, design, offset=NULL) > > OR > > fit <- glmFit(d, design, lib.size=c(9664343, 11248827, 4194124, 9963626)) > > should I supply some values for "lib.sizes". Note, my DGEList already has > library size information in it. > > > Once again thanks for your answer and pointer to glmFit. > Gowthaman > > On Fri, Jun 22, 2012 at 2:18 AM, gowtham wrote: > >> Thanks very much Belinda. That is comforting. >> >> My DGEList object has library sizes added to it. Do I still need to supply >> a numeric vector with library sizes while fiting glm? Or is it >> automatically pulled from DGEList object? >> >> Reading help, i understand its automatic. Please advice me if I am wrong. >> " If y is a DGEList object then the default for lib.size is the product >> of the library sizes and the normalization factors (in the samples slot >> of the object). " >> >> Thanks, >> Gowthaman >> >> >> >> >> On Thu, Jun 21, 2012 at 4:58 PM, Belinda Phipson wrote: >> >>> Hi Gowthaman >>> >>> Your output looks fine. What is more important is that library size is >>> taken into account as an offset later on when you fit the glm. See >>> help(glmFit). >>> >>> Cheers, >>> Belinda >>> >>> -----Original Message----- >>> From: bioconductor-bounces at r-project.org [mailto: >>> bioconductor-bounces at r-project.org] On Behalf Of gowtham >>> Sent: Friday, 22 June 2012 9:40 AM >>> To: bioconductor >>> Subject: Re: [BioC] edgeR: calcNormFactors question >>> >>> Sorry about repeated mailing: I have attached a smear plot of the data >>> incase that helps anyone attempting to answer my doubt..... >>> >>> >>> On Thu, Jun 21, 2012 at 4:07 PM, gowtham wrote: >>> >>>> Hi Everyone, >>>> I am analyzing a RNAseq experiment with two groups each having two >>>> replicates. One out of 4 libraries have only half as much reads >>>> mapping to genome. >>>> >>>> Lib Fe+.1 has only 4 million reads while other are 9 million +. But >>>> still the norm.factors are not much different. With my naive >>>> understanding i expect Fe+.1 to be very different from others. I would >>>> like to know if what I see is okay? >>>> >>>>> oldsetDGE <- calcNormFactors(oldsetDGE) oldsetDGE$samples >>>> group lib.size norm.factors >>>> fe-.1 2 9664343 0.9865411 >>>> fe-.2 2 11248827 1.0812947 >>>> fe+.1 1 4194124 0.9662389 >>>> fe+.2 1 9963626 0.9701888 >>>> >>>> >>>> Thanks very much, >>>> Gowthaman >>>> -- >>>> Gowthaman >>>> >>>> Bioinformatics Systems Programmer. >>>> SBRI, 307 West lake Ave N Suite 500 >>>> Seattle, WA. 98109-5219 >>>> Phone : LAB 206-256-7188 (direct). >>>> >>> >>> >>> >>> -- >>> Gowthaman >>> >>> Bioinformatics Systems Programmer. >>> SBRI, 307 West lake Ave N Suite 500 >>> Seattle, WA. 98109-5219 >>> Phone : LAB 206-256-7188 (direct). >>> >>> >>> ______________________________________________________________________ >>> The information in this email is confidential and intended solely for the >>> addressee. >>> You must not disclose, forward, print or use it without the permission of >>> the sender. >>> ______________________________________________________________________ >>> >> >> >> >> -- >> Gowthaman >> >> Bioinformatics Systems Programmer. >> SBRI, 307 West lake Ave N Suite 500 >> Seattle, WA. 98109-5219 >> Phone : LAB 206-256-7188 (direct). >> > > > > -- > Gowthaman > > Bioinformatics Systems Programmer. > SBRI, 307 West lake Ave N Suite 500 > Seattle, WA. 98109-5219 > Phone : LAB 206-256-7188 (direct). > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From mark.robinson at imls.uzh.ch Fri Jun 22 12:02:57 2012 From: mark.robinson at imls.uzh.ch (Mark Robinson) Date: Fri, 22 Jun 2012 12:02:57 +0200 Subject: [BioC] design matrix edge R pairwise comparison at different time points after infection with replicates In-Reply-To: <3D4A97F14E343F4584925219C1C1ACEF05B95CE6@ICTS-S-MBX7.luna.kuleuven.be> References: <3D4A97F14E343F4584925219C1C1ACEF05B95CE6@ICTS-S-MBX7.luna.kuleuven.be> Message-ID: Hi Kaat, It is probably better to fit all your data with a single call to glmFit(), over all 18 samples; you can test the differences of interest trough the 'coef' or 'contrast' argument on glmLRT(). That would afford you more degrees of freedom and presumably better estimates of dispersion, and so on. From your description, I can't quite figure out your design matrix. You have three factors: treatment, test and time point. First, you need to input all 18 samples and extend your 'treatment' and 'test' factor variables to have 18 values (corresponding to the columns of your table). And, then also include a time variable in your design. Some decisions might need to be made about interactions to include. Hope that gets you started. Best, Mark ---------- Prof. Dr. Mark Robinson Bioinformatics Institute of Molecular Life Sciences University of Zurich Winterthurerstrasse 190 8057 Zurich Switzerland v: +41 44 635 4848 f: +41 44 635 6898 e: mark.robinson at imls.uzh.ch o: Y11-J-16 w: http://tiny.cc/mrobin ---------- http://www.fgcz.ch/Bioconductor2012 On 21.06.2012, at 11:42, Kaat De Cremer wrote: > Dear all, > > > I am using edgeR to find genes differentially expressed between infected and mock-infected control plants, at 3 time points after infection. > I have RNAseq data for 3 independent tests, so for every single test I have 6 libraries (control + infected at 3 time points). > Having three replicates this makes 18 libraries in total. > > What I did until now is look at each time point separate and calculate DEgenes at that time point as shown in this script: > >> head(x) > C1 C2 C3 T1 T2 T3 > 1 0 1 2 0 0 0 > 2 13 6 4 10 8 12 > 3 17 16 9 10 8 11 > 4 2 1 2 2 3 2 > 5. 1 3 1 2 1 3 0 > 6 958 457 438 565 429 518 > >> treatment<-factor(c("C","C","C","T","T","T")) >> test<-factor(c(1,2,3,1,2,3)) >> y<-DGEList(counts=x,group=treatment) > Calculating library sizes from column totals. >> cpm.y<-cpm(y) >> y<-y[rowSums(cpm.y>2)>=3,] >> y<-calcNormFactors(y) >> design<-model.matrix(~test+treat) >> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) > Disp = 0.0265 , BCV = 0.1628 >> y<-estimateGLMTrendedDisp(y,design) > Loading required package: splines >> y<-estimateGLMTagwiseDisp(y,design) >> fit<-glmFit(y,design) >> lrt<-glmLRT(y,fit) > > > This works fine but I wonder if I should do the analysis of the different time points all at once? Will this make a difference? > Unfortunately I cannot figure out how to design the matrix. > > I hope someone can help me, > > Kaat > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From ragowthaman at gmail.com Fri Jun 22 12:32:35 2012 From: ragowthaman at gmail.com (gowtham) Date: Fri, 22 Jun 2012 03:32:35 -0700 Subject: [BioC] edgeR: calcNormFactors question In-Reply-To: <187BC2EF-2DCC-4E65-A99E-00843C44C3B8@imls.uzh.ch> References: <000901cd5009$ccf3a620$66daf260$@edu.au> <187BC2EF-2DCC-4E65-A99E-00843C44C3B8@imls.uzh.ch> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From whuber at embl.de Fri Jun 22 12:57:20 2012 From: whuber at embl.de (Wolfgang Huber) Date: Fri, 22 Jun 2012 12:57:20 +0200 Subject: [BioC] Printing arrayQualityMetrics report In-Reply-To: <4FE428C1.7000309@gmail.com> References: <4FE39A13.3010004@embl.de> <4FE428C1.7000309@gmail.com> Message-ID: <4FE44F90.1050603@embl.de> Dear James that is one way to go about it. A more direct way for getting at this table is to keep the return value of the call to the function arrayQualityMetrics: myReport = arrayQualityMetrics( eset, ...) and access the table via myReport$arrayTable Please use arrayQualityMetrics >= 3.13.5 for this (unfortunately in previous versions due to an oversight of mine this object was not propagated all the way to the return value of arrayQualityMetrics). Version 3.13.5 is in svn and should also soon be on the website / in the package repository. It also fixes the 'intgroup' issue reported by Sonal and Tim. Best wishes Wolfgang James F. Reid scripsit 06/22/2012 10:11 AM: ...[snip] > you could extract the table contents using the readHTMLTable function > from the 'XML' package and for the figures just include the pdfs as > figures and add a caption to them. > > HTH, > J. Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber From michael.dondrup at uni.no Fri Jun 22 14:20:15 2012 From: michael.dondrup at uni.no (Michael Dondrup) Date: Fri, 22 Jun 2012 14:20:15 +0200 Subject: [BioC] Gviz: Error plotting C.elegans ideogram Message-ID: Hi, I am trying to plot an ideogram track for C. elegans using Gviz. However I cannot generate the ideogram track: > itrack <- IdeogramTrack(genome = "ce6", chromosome = "chrI" ) Error in normArgTrack(track, trackids) : Unknown track: cytoBandIdeo I have tried also "ce4, and ce10" and for the chromosome "chr1, chrII, 1" with the same effect. Other genomes (hgu19, mm9) worked. Do I have to use a different genome identifier? Best Michael Dondrup > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] IRanges_1.14.3 BiocGenerics_0.2.0 Gviz_1.0.1 loaded via a namespace (and not attached): [1] AnnotationDbi_1.18.1 Biobase_2.16.0 biomaRt_2.12.0 Biostrings_2.24.1 bitops_1.0-4.1 BSgenome_1.24.0 [7] DBI_0.2-5 GenomicRanges_1.8.6 lattice_0.20-6 RColorBrewer_1.0-5 RCurl_1.91-1 Rsamtools_1.8.5 [13] RSQLite_0.11.1 rtracklayer_1.16.1 stats4_2.15.0 tools_2.15.0 XML_3.9-4 zlibbioc_1.2.0 > From liyer01 at tufts.edu Fri Jun 22 15:01:24 2012 From: liyer01 at tufts.edu (Lakshmanan Iyer) Date: Fri, 22 Jun 2012 09:01:24 -0400 Subject: [BioC] Coverage on GappedAlignment Pairs Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From Kaat.DeCremer at biw.kuleuven.be Fri Jun 22 15:11:41 2012 From: Kaat.DeCremer at biw.kuleuven.be (Kaat De Cremer) Date: Fri, 22 Jun 2012 13:11:41 +0000 Subject: [BioC] design matrix edge R pairwise comparison at different time points after infection with replicates In-Reply-To: References: <3D4A97F14E343F4584925219C1C1ACEF05B95CE6@ICTS-S-MBX7.luna.kuleuven.be> Message-ID: <3D4A97F14E343F4584925219C1C1ACEF05B9631A@ICTS-S-MBX7.luna.kuleuven.be> Hi Mark, Thank you for your suggestion, I really appreciate your time. Working in R is new to me so it has been a struggle using edgeR, but I think I managed it using only 2 factors (test and treatment). Now that I will be including 3 factors (test, treatment and time) in one analysis it is clear to me that I still don't understand how it works exactly. Below you can see my workspace with the only design matrix I could come up with, but I don't see which coefficients I should include or which contrast vector to use in the glmLRT function to make the comparison of control-treatment at each time point separate, ignoring the other 2 time points. Is this possible with this design matrix? Or is the matrix wrong for this purpose? Thanks! Kaat > head(x) 12hpi C1 12hpi C2 12hpi C3 12hpi T1 12hpi T2 12hpi T3 24hpi C1 24hpi C2 Lsa000001.1 0 1 1 2 0 2 1 1 Lsa000002.1 5 4 0 5 6 6 6 4 Lsa000003.1 10 9 7 5 5 8 6 2 Lsa000004.1 1 1 1 1 1 1 1 3 Lsa000005.1 1 0 1 0 2 0 0 1 Lsa000006.1 510 223 228 287 222 268 303 358 24hpi C3 24hpi T1 24hpi T2 24hpi T3 48hpi C1 48hpi C2 48hpi C3 48hpi T1 Lsa000001.1 0 1 1 0 0 0 0 2 Lsa000002.1 7 5 2 5 10 6 12 12 Lsa000003.1 7 5 4 2 6 5 8 2 Lsa000004.1 1 3 1 2 1 3 2 3 Lsa000005.1 0 1 0 0 1 0 0 2 Lsa000006.1 372 362 237 320 472 440 411 858 48hpi T2 48hpi T3 Lsa000001.1 0 0 Lsa000002.1 1 5 Lsa000003.1 1 0 Lsa000004.1 0 2 Lsa000005.1 1 0 Lsa000006.1 375 275 > treat<-factor(c("C","C","C","T","T","T","C","C","C","T","T","T","C","C","C","T","T","T")) > test<-factor(c(1,1,2,3,1,2,3,2,3,1,2,3,1,2,3,1,2,3)) time<-factor(c("12hpi","12hpi","12hpi","12hpi","12hpi","12hpi","24hpi","24hpi","24hpi","24hpi","24hpi","24hpi","48hpi","48hpi","48hpi","48hpi","48hpi","48hpi")) > y<-DGEList(counts=x,group=treat) Calculating library sizes from column totals. > cpm.y<-cpm(y) > y<-y[rowSums(cpm.y>2)>=3,] > y<-calcNormFactors(y) design<-model.matrix(~test+treat+time) > design (Intercept) test2 test3 treatT time24hpi time48hpi 1 1 0 0 0 0 0 2 1 1 0 0 0 0 3 1 0 1 0 0 0 4 1 0 0 1 0 0 5 1 1 0 1 0 0 6 1 0 1 1 0 0 7 1 0 0 0 1 0 8 1 1 0 0 1 0 9 1 0 1 0 1 0 10 1 0 0 1 1 0 11 1 1 0 1 1 0 12 1 0 1 1 1 0 13 1 0 0 0 0 1 14 1 1 0 0 0 1 15 1 0 1 0 0 1 16 1 0 0 1 0 1 17 1 1 0 1 0 1 18 1 0 1 1 0 1 attr(,"assign") [1] 0 1 1 2 3 3 attr(,"contrasts") attr(,"contrasts")$test [1] "contr.treatment" attr(,"contrasts")$treat [1] "contr.treatment" attr(,"contrasts")$time [1] "contr.treatment" > y<-estimateGLMCommonDisp(y,design,verbose=TRUE) Disp = 0.07299 , BCV = 0.2702 > y<-estimateGLMTrendedDisp(y,design) Loading required package: splines > y<-estimateGLMTagwiseDisp(y,design) Warning message: In maximizeInterpolant(spline.pts, apl.smooth[j, ]) : max iterations exceeded > fit<-glmFit(y,design) -----Original Message----- From: Mark Robinson [mailto:mark.robinson at imls.uzh.ch] Sent: vrijdag 22 juni 2012 12:03 To: Kaat De Cremer Cc: bioconductor list Subject: Re: [BioC] design matrix edge R pairwise comparison at different time points after infection with replicates Hi Kaat, It is probably better to fit all your data with a single call to glmFit(), over all 18 samples; you can test the differences of interest trough the 'coef' or 'contrast' argument on glmLRT(). That would afford you more degrees of freedom and presumably better estimates of dispersion, and so on. >From your description, I can't quite figure out your design matrix. You have three factors: treatment, test and time point. First, you need to input all 18 samples and extend your 'treatment' and 'test' factor variables to have 18 values (corresponding to the columns of your table). And, then also include a time variable in your design. Some decisions might need to be made about interactions to include. Hope that gets you started. Best, Mark ---------- Prof. Dr. Mark Robinson Bioinformatics Institute of Molecular Life Sciences University of Zurich Winterthurerstrasse 190 8057 Zurich Switzerland v: +41 44 635 4848 f: +41 44 635 6898 e: mark.robinson at imls.uzh.ch o: Y11-J-16 w: http://tiny.cc/mrobin ---------- http://www.fgcz.ch/Bioconductor2012 On 21.06.2012, at 11:42, Kaat De Cremer wrote: > Dear all, > > > I am using edgeR to find genes differentially expressed between infected and mock-infected control plants, at 3 time points after infection. > I have RNAseq data for 3 independent tests, so for every single test I have 6 libraries (control + infected at 3 time points). > Having three replicates this makes 18 libraries in total. > > What I did until now is look at each time point separate and calculate DEgenes at that time point as shown in this script: > >> head(x) > C1 C2 C3 T1 T2 T3 > 1 0 1 2 0 0 0 > 2 13 6 4 10 8 12 > 3 17 16 9 10 8 11 > 4 2 1 2 2 3 2 > 5. 1 3 1 2 1 3 0 > 6 958 457 438 565 429 518 > >> treatment<-factor(c("C","C","C","T","T","T")) >> test<-factor(c(1,2,3,1,2,3)) >> y<-DGEList(counts=x,group=treatment) > Calculating library sizes from column totals. >> cpm.y<-cpm(y) >> y<-y[rowSums(cpm.y>2)>=3,] >> y<-calcNormFactors(y) >> design<-model.matrix(~test+treat) >> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) > Disp = 0.0265 , BCV = 0.1628 >> y<-estimateGLMTrendedDisp(y,design) > Loading required package: splines >> y<-estimateGLMTagwiseDisp(y,design) >> fit<-glmFit(y,design) >> lrt<-glmLRT(y,fit) > > > This works fine but I wonder if I should do the analysis of the different time points all at once? Will this make a difference? > Unfortunately I cannot figure out how to design the matrix. > > I hope someone can help me, > > Kaat > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From laxvid at gmail.com Fri Jun 22 15:56:41 2012 From: laxvid at gmail.com (Lakshmanan Iyer) Date: Fri, 22 Jun 2012 09:56:41 -0400 Subject: [BioC] readGapppedAlignmentpairs questions Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From chenyao.bioinfor at gmail.com Fri Jun 22 15:59:28 2012 From: chenyao.bioinfor at gmail.com (Yao Chen) Date: Fri, 22 Jun 2012 09:59:28 -0400 Subject: [BioC] [Limma] Calculate the relation between mRNA and miRNA Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From steven.segbroek at gmail.com Fri Jun 22 16:26:08 2012 From: steven.segbroek at gmail.com (steven segbroek) Date: Fri, 22 Jun 2012 16:26:08 +0200 Subject: [BioC] some help requested for constructing an appropriate design matrix in LIMMA In-Reply-To: <002c01cd4cee$18b0f4b0$4a12de10$@edu.au> References: <002c01cd4cee$18b0f4b0$4a12de10$@edu.au> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From hpages at fhcrc.org Fri Jun 22 20:11:22 2012 From: hpages at fhcrc.org (=?ISO-8859-1?Q?Herv=E9_Pag=E8s?=) Date: Fri, 22 Jun 2012 11:11:22 -0700 Subject: [BioC] readGapppedAlignmentpairs questions In-Reply-To: References: Message-ID: <4FE4B54A.2020601@fhcrc.org> Hi Iyer, On 06/22/2012 06:56 AM, Lakshmanan Iyer wrote: > Hi > My apologies for multiple posting if it happens-- I sent the last mails > from other accounts which may not be registered with Bioc-list > -Lax > Two questions: > > 1. Is readGapppedAlignmentPairs - the most efficient way to read a > paired-end bam file with mulit-mapped reads? > I am asking as it takes an enormous amount of time to process and load. With a recent version of Rsamtools (>= 1.9.10), last time I tried it took between 30 and 40 min to call readGapppedAlignmentPairs() on a BAM file with 70 million records (35 million pairs). Even for a fixed nb of records and using the same machine, timing could vary a lot depending on how "hard" it is to pair the records which mostly depends on the average number of records sharing the same QNAME (the lowest, the easiest). You can get that number using the new quickBamCounts() utility from Rsamtools: > quickBamCounts("AML_330-0_gsnap_filter.bam") group | nb of | nb of | mean / max of | records | unique | records per records | in group | QNAMEs | unique QNAME All records........................ A | 70446309 | 35407913 | 1.99 / 2 o template has single segment.... S | 0 | 0 | NA / NA o template has multiple segments. M | 70446309 | 35407913 | 1.99 / 2 - first segment.............. F | 35313360 | 35313360 | 1 / 1 - last segment............... L | 35132949 | 35132949 | 1 / 1 - other segment.............. O | 0 | 0 | NA / NA Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M. Indentation reflects this. Here the average number of records per unique QNAME is 1.99 (and the max is 2), which is ideal. I don't remember the exact amount of memory needed to load that file with 70M records though. All I remember is that you need a lot :-/ (probably > 20GB). Make sure you have enough memory and that your system is not swapping. readGapppedAlignmentPairs() is basically calling readGapppedAlignments() followed by findMateAlignment(). Each of the 2 operations are expensive and IIRC I'm not sure there are obvious/easy optimizations that could be done on readGapppedAlignments() itself. However, for findMateAlignment(), there are a few easy ones that have been on my list for a couple of months now and that I will implement ASAP. They should improve speed and reduce memory usage. > > 2. How does one work with coverage on GappedAlignmentPairs in the context > of RNASeq? > The simplest way is to consider each left and right read as separate - > essentially loose the "paired" information and calculate coverage. > However, if both the left and right pair reads fall within a feature of > interest - say an exon, does it imply coverage of the region of the exon > between the reads too > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> > LLLLLLLLLL---------------------------RRRRRRRRRR > ^^^^^^^^^^^^^^^^^ > > In the figure above, the exon is represented by ">" and L and R represents > the left and right reads aligned to the exon. > I am talking about the region represented by "^". Do we assume coverage > for this region too? > Does Coverage on GappedAlignmentPairs do this? No, coverage on a GappedAlignmentPairs does not do this, but 'coverage(range(grglist(galp)))' will do this. Cheers, H. > > -best > -Lax > Center for Neuroscience Research > Tufts Univeristy School of Medicine > Boston, MA > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 From smwilson at hpc.unm.edu Fri Jun 22 20:26:10 2012 From: smwilson at hpc.unm.edu (Susan Wilson) Date: Fri, 22 Jun 2012 12:26:10 -0600 Subject: [BioC] How to generate bar files? Message-ID: <4FE4B8C2.7000602@hpc.unm.edu> Hi, Does anyone know how to generate bar files with rMAT? help(NormalizeProbes) says "The output can be saved as BAR file if the BAR argument specifies a filename, or as a parsed BAR file if argument output specifies a filename." Other than that, I cannot find anything relevant. Thanks. Susan From hpages at fhcrc.org Fri Jun 22 21:09:38 2012 From: hpages at fhcrc.org (=?UTF-8?B?SGVydsOpIFBhZ8Oocw==?=) Date: Fri, 22 Jun 2012 12:09:38 -0700 Subject: [BioC] Biostring: print sequence alignment to file In-Reply-To: <78563C69AD804989BBEA7D94269D47BA@googlemail.com> References: <1F5E43701380457FBD1953187D5FE071@googlemail.com> <4c299b85491a4b8cb0d0cbe2fdf3e3dc@EXCH-NODE02.exch.ucr.edu> <20120417184928.GA439@genomics-57-164.bulk.ucr.edu> <20120417201341.GA587@genomics-57-164.bulk.ucr.edu> <2F44456508644188839DCAB9C8D6B9B9@googlemail.com> <140068C6B7EC41B5BAE2B3A6964BF731@googlemail.com> <4F91FB7F.2000506@fhcrc.org> <78563C69AD804989BBEA7D94269D47BA@googlemail.com> Message-ID: <4FE4C2F2.1000001@fhcrc.org> Hi Martin, On 06/14/2012 06:55 AM, Martin Preusse wrote: > Hi guys, > > anything new on the sequence output? Maybe I missed something :) please tell me if you need testing etc. Still on my list. Will work on this in the next couple of weeks. I'll let you know. Thanks for the reminder. H. > > Cheers > Martin > > > Am Samstag, 21. April 2012 um 11:55 schrieb Martin Preusse: > >> Hi Herv?, >> >> thanks for your help! If you need suggestions, help or testing, just say the word. >> >> Will you implement the header also? If you do so, I would be thankful for an option like "header=F" for the output. >> >> >> Cheers >> Martin >> >> >> Am Samstag, 21. April 2012 um 02:12 schrieb Herv? Pag?s: >> >>> Thanks Martin and Thomas for the useful feedback. The 'pair' and >>> 'markx0' formats supported by Emboss seem indeed appropriate for >>> printing the output of pairwiseAlignment() to a file. I'll add >>> support for those 2 formats in Biostrings. Won't be before 1 week >>> or 2 though... >>> >>> Cheers, >>> H. >>> >>> On 04/18/2012 03:20 AM, Martin Preusse wrote: >>>> Hi, >>>> >>>> I just found this function to print a pairwise alignments in blocks. Doesn't add the match/mismatch indicators between sequences, but might be a starting point: >>>> >>>> http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter4.html#viewing-a-long-pairwise-alignment >>>> >>>> >>>> Cheers >>>> Martin >>>> >>>> >>>> >>>> Am Mittwoch, 18. April 2012 um 12:16 schrieb Martin Preusse: >>>> >>>>> Hi everybody, >>>>> >>>>> I think the output format depends on the purpose of the alignment. >>>>> >>>>> A pairwise sequence alignment is usually done to compare two sequences base by base. In my case, I compare sequencing results of cloned expression constructs with the desired sequence. Thus, the best output format would be "BLAST like". >>>>> >>>>> seq1: 1 ATCTGC 7 >>>>> | | | . . | >>>>> seq2: 1 ATCAAC 7 >>>>> >>>>> When doing MSA, most people might rather be interested in the consensus sequence. E.g. in the context of conservation between species. >>>>> >>>>> So write.PairwiseAlignedXStringSet() and write.MultipleAlignment() are quite different and BLAST doesn't make much sense for multiple alignments. This means it would be best to put the output in the PairwiseAlignment/MultipleAlignment and not to the XStringSet, right? >>>>> >>>>> This is an overview of sequence alignment formats used by EMBOSS: >>>>> http://emboss.sourceforge.net/docs/themes/AlignFormats.html >>>>> >>>>> 'pair' or 'markx0' would be perfectly fine. >>>>> >>>>> >>>>> Cheers >>>>> Martin >>>>> >>>>> >>>>> >>>>> Am Dienstag, 17. April 2012 um 22:13 schrieb Thomas Girke: >>>>> >>>>>> Hi Herv?, >>>>>> >>>>>> To me, the most basic and versatile MSA or pairwise alignment format to output >>>>>> to would be FASTA since it is compatible with almost any other alignment >>>>>> editing software. For text-based viewing purposes my preference would be >>>>>> to also output to a format similar to the one shown in the following >>>>>> example. When there are only two sequences then one could show instead >>>>>> of a consensus line the pipe characters between the two sequences to >>>>>> indicate identical residues which mimics the blast output. A more >>>>>> standardized version of this pairwise alignment format can be found >>>>>> here: >>>>>> http://emboss.sourceforge.net/apps/cvs/emboss/apps/needle.html >>>>>> >>>>>> library(Biostrings) >>>>>> p450<- read.AAStringSet("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/p450.mul", "fasta") >>>>>> >>>>>> StringSet2html<- function(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=TRUE, ...) { >>>>>> if(class(msa)=="AAStringSet") msa<- AAStringSet(msa, start=start, end=end) >>>>>> if(class(msa)=="DNAStringSet") msa<- DNAStringSet(msa, start=start, end=end) >>>>>> msavec<- sapply(msa, toString) >>>>>> offset<- (counter-1)-nchar(nchar(msavec[1])) >>>>>> legend<- paste(paste(paste(paste(rep(" ", offset), collapse=""), format(seq(0, >>>>>> nchar(msavec[1]), by=counter)[-1])), collapse=""), collapse="") >>>>>> consensus<- consensusString(msavec, ambiguityMap=".", ...) >>>>>> msavec<- paste(msavec, rowSums(as.matrix(msa) != "-"), sep=" ") >>>>>> msavec<- paste(format(c("", names(msa), "Consensus"), justify="left"), c(legend, msavec, >>>>>> consensus), sep=" ") >>>>>> msavec<- c("
", msavec,"
") >>>>>> writeLines(msavec, file) >>>>>> if(browser==TRUE) { browseURL(file) } >>>>>> } >>>>>> StringSet2html(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=T, threshold=1.0) >>>>>> StringSet2html(msa=p450, file="p450.html", start=450, end=470, counter=20, browser=T, threshold=1.0) >>>>>> >>>>>> >>>>>> Thomas >>>>>> >>>>>> On Tue, Apr 17, 2012 at 07:43:30PM +0000, Herv? Pag?s wrote: >>>>>>> Hi Thomas, >>>>>>> >>>>>>> On 04/17/2012 11:49 AM, Thomas Girke wrote: >>>>>>>> What about providing an option in pairwiseAlignment to output to the >>>>>>>> MultipleAlignment class in Biostrings and then write the latter to >>>>>>>> different alignment formats? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Or we could provide coercion methods to switch between >>>>>>> PairwiseAlignedXStringSet and MultipleAlignment. >>>>>>> >>>>>>> Anyway that kind of moves Martin's problem from having a >>>>>>> write.PairwiseAlignedXStringSet() function that produces BLAST output >>>>>>> to having a write.MultipleAlignment() function that produces BLAST >>>>>>> output. For the specific case of BLAST output, would it make sense >>>>>>> to support it for MultipleAlignment? Can someone point me to an example >>>>>>> of such output? Or even better, to the specs of such format? >>>>>>> >>>>>>> Note that right now there is the write.phylip() function in Biostrings >>>>>>> for writing a MultipleAlignment object to a file but the Phylip format >>>>>>> looks very different from the BLAST output: >>>>>>> >>>>>>> hpages at latitude:~$ head -n 20 phylip_test.txt >>>>>>> 9 2343 >>>>>>> Mask 0000000000 0000000000 0000000000 0000000000 0000000000 >>>>>>> Human -----TCCCG TCTCCGCAGC AAAAAAGTTT GAGTCGCCGC TGCCGGGTTG >>>>>>> Chimp ---------- ---------- ---------- ---------- ---------- >>>>>>> Cow ---------- ---------- ---------- ---------- ---------- >>>>>>> Mouse ---------- ---------- --AAAAGTTG GAGTCTTCGC TTGAGAGTTG >>>>>>> Rat ---------- ---------- ---------- ---------- ---------- >>>>>>> Dog ---------- ---------- ---------- ---------- ---------- >>>>>>> Chicken ---------- ----CGGCTC CGCAGCGCCT CACTCGCGCA GTCCCCGCGC >>>>>>> Salmon GGGGGAGACT TCAGAAGTTG TTGTCCTCTC CGCTGATAAC AGTTGAGATG >>>>>>> >>>>>>> 0000000000 0000000000 0000000000 0001111111 1111111111 >>>>>>> CCAGCGGAGT CGCGCGTCGG GAGCTACGTA GGGCAGAGAA GTCA-TGGCT >>>>>>> ---------- ---------- ---------- ---------- ---A-TGGCT >>>>>>> ---------- ---------- ---------- ---GAGAGAA GTCA-TGGCT >>>>>>> CCAGCGGAGT CGCGCGCCGA CAGCTACGCG GCGCAGA-AA GTCA-TGGCT >>>>>>> ---------- ---------- ---------- ---------- ---A-TGGCT >>>>>>> ---------- ---------- ---------- ---------- ---A-TGGCT >>>>>>> AGGGCCGGGC AGAGGCGCAC GCAGCTCCCC GGGCGGCCCC GCTC-CAGCC >>>>>>> CGCATATTAT TATTACCTTT AGGACAAGTT GAATGTGTTC GTCAACATCT >>>>>>> >>>>>>> Thanks! >>>>>>> H. >>>>>>> >>>>>>>> >>>>>>>> Thomas >>>>>>>> >>>>>>>> On Tue, Apr 17, 2012 at 05:59:24PM +0000, Herv? Pag?s wrote: >>>>>>>>> Hi Martin, >>>>>>>>> >>>>>>>>> On 04/16/2012 04:06 AM, Martin Preusse wrote: >>>>>>>>>> Hi Charles, >>>>>>>>>> >>>>>>>>>> thanks! Your solution allows to print the two alignment strings separately. >>>>>>>>>> >>>>>>>>>> I was thinking of an output as generated by alignment tools: >>>>>>>>>> >>>>>>>>>> AGT-TCTAT >>>>>>>>>> | | | | | | | | | >>>>>>>>>> AGTATCTAT >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> This looks like BLAST output. Is this what you have in mind? Note that >>>>>>>>> there are many alignment tools and many ways to output the result to a >>>>>>>>> file. I'm not really familiar with the BLAST output format. Is it >>>>>>>>> specified somewhere? Would that make sense to add something like a >>>>>>>>> write.PairwiseAlignedXStringSet() function to Biostrings for writing >>>>>>>>> the result of pairwiseAlignment() to a file? We could do this and >>>>>>>>> support the BLAST format if that's a commonly used format. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> H. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> For this I would have to write a function to output the strings in blocks of e.g. 60 nucleotides, right? >>>>>>>>>> >>>>>>>>>> Cheers >>>>>>>>>> Martin >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Am Freitag, 13. April 2012 um 19:21 schrieb Chu, Charles: >>>>>>>>>> >>>>>>>>>>> write.XStringSet >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Bioconductor mailing list >>>>>>>>>> Bioconductor at r-project.org (mailto:Bioconductor at r-project.org) >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Herv? Pag?s >>>>>>>>> >>>>>>>>> Program in Computational Biology >>>>>>>>> Division of Public Health Sciences >>>>>>>>> Fred Hutchinson Cancer Research Center >>>>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>>>> P.O. Box 19024 >>>>>>>>> Seattle, WA 98109-1024 >>>>>>>>> >>>>>>>>> E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org) >>>>>>>>> Phone: (206) 667-5791 >>>>>>>>> Fax: (206) 667-1319 >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org (mailto:Bioconductor at r-project.org) >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Herv? Pag?s >>>>>>> >>>>>>> Program in Computational Biology >>>>>>> Division of Public Health Sciences >>>>>>> Fred Hutchinson Cancer Research Center >>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>> P.O. Box 19024 >>>>>>> Seattle, WA 98109-1024 >>>>>>> >>>>>>> E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org) >>>>>>> Phone: (206) 667-5791 >>>>>>> Fax: (206) 667-1319 >>>>>> >>>>> >>>> >>> >>> >>> >>> >>> >>> >>> -- >>> Herv? Pag?s >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M1-B514 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org) >>> Phone: (206) 667-5791 >>> Fax: (206) 667-1319 >> > > > -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 From laxvid at gmail.com Fri Jun 22 22:34:46 2012 From: laxvid at gmail.com (Lakshmanan Iyer) Date: Fri, 22 Jun 2012 16:34:46 -0400 Subject: [BioC] readGapppedAlignmentpairs questions In-Reply-To: <4FE4B54A.2020601@fhcrc.org> References: <4FE4B54A.2020601@fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From vobencha at fhcrc.org Fri Jun 22 23:00:32 2012 From: vobencha at fhcrc.org (Valerie Obenchain) Date: Fri, 22 Jun 2012 14:00:32 -0700 Subject: [BioC] nearest() for GRanges In-Reply-To: References: Message-ID: <4FE4DCF0.5010206@fhcrc.org> On 06/20/2012 05:20 PM, Cook, Malcolm wrote: > Hi Valerie, > > Very glad you found and fixed the root cause. > > I don't know the overhead it would take for you, but, this being a > regression, might you consider fixing in Bioconductor 2.10 as, say > GenomicRanges_1.8. > Yes, I will fix this in release too. If not today then first thing next week. Valerie > Thanks for your consideration, > > Malcolm > > On 6/20/12 3:13 PM, "Valerie Obenchain" wrote: > >> Hi Oleg, Malcom, >> >> Thanks for the bug report. This is now fixed in devel 1.9.28. Over the >> past months we've done an overhaul of the precede/follow code in devel. >> The new nearest method is based on the new precede and follow and is >> documented at >> >> ?'nearest,GenomicRanges,GenomicRanges-method' >> >> Let me know if you run into problems. >> >> Valerie >> >> >> >> On 06/18/2012 02:25 PM, Cook, Malcolm wrote: >>> Martin, Oleg, Val, all, >>> >>> I too have script logic that benefitted from and depends upon what the >>> behavior of nearest,GenomicRanges,missing as reported by Oleg. >>> >>> Thanks for the unit tests Martin. >>> >>> If it helps in sleuthing, in my hands, the 3rd test used to pass (if my >>> memory serves), but does not pass now, as the attached transcript shows. >>> >>> Hoping it helps find a speedy resolution to this issue, >>> >>> ~ Malcolm Cook >>> >>> >>> >>>> r<- IRanges(c(1,5,10), c(2,7,12)) >>>> g<- GRanges("chr1", r, "+") >>>> checkEquals(precede(r), precede(g)) >>> [1] TRUE >>>> checkEquals(follow(r), follow(g)) >>> [1] TRUE >>>> try(checkEquals(nearest(r), nearest(g))) >>> Error in checkEquals(nearest(r), nearest(g)) : >>> Mean relative difference: 0.6 >>> >>> >>>> sessionInfo() >>> R version 2.15.0 (2012-03-30) >>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>> >>> locale: >>> [1] C >>> >>> attached base packages: >>> [1] tools splines parallel stats graphics grDevices utils >>> datasets methods base >>> >>> other attached packages: >>> [1] RUnit_0.4.26 log4r_0.1-4 vwr_0.1 >>> RecordLinkage_0.4-1 ffbase_0.5 ff_2.2-7 >>> bit_1.1-8 evd_2.2-6 ipred_0.8-13 >>> prodlim_1.3.1 KernSmooth_2.23-7 nnet_7.3-1 >>> survival_2.36-14 mlbench_2.1-0 MASS_7.3-18 >>> ada_2.0-2 rpart_3.1-53 e1071_1.6 >>> class_7.3-3 XLConnect_0.1-9 XLConnectJars_0.1-4 >>> rJava_0.9-3 latticeExtra_0.6-19 RColorBrewer_1.0-5 >>> lattice_0.20-6 doMC_1.2.5 multicore_0.1-7 >>> [28] BSgenome_1.24.0 rtracklayer_1.16.1 Rsamtools_1.8.5 >>> Biostrings_2.24.1 GenomicFeatures_1.8.1 AnnotationDbi_1.18.1 >>> GenomicRanges_1.8.6 IRanges_1.14.3 Biobase_2.16.0 >>> BiocGenerics_0.2.0 data.table_1.8.0 compare_0.2-3 >>> svUnit_0.7-10 doParallel_1.0.1 iterators_1.0.6 >>> foreach_1.4.0 ggplot2_0.9.1 sqldf_0.4-6.4 >>> RSQLite.extfuns_0.0.1 RSQLite_0.11.1 chron_2.3-42 >>> gsubfn_0.6-3 proto_0.3-9.2 DBI_0.2-5 >>> functional_0.1 reshape_0.8.4 plyr_1.7.1 >>> [55] stringr_0.6 gtools_2.6.2 >>> >>> loaded via a namespace (and not attached): >>> [1] RCurl_1.91-1 XML_3.9-4 biomaRt_2.12.0 bitops_1.0-4.1 >>> codetools_0.2-8 colorspace_1.1-1 compiler_2.15.0 dichromat_1.2-4 >>> digest_0.5.2 grid_2.15.0 labeling_0.1 memoise_0.1 >>> munsell_0.3 reshape2_1.2.1 scales_0.2.1 stats4_2.15.0 >>> tcltk_2.15.0 zlibbioc_1.2.0 >>> >>> >>> >>> >>> >>> >>> On 6/18/12 2:39 PM, "Martin Morgan" wrote: >>> >>>> Hi Oleg -- >>>> >>>> On 06/17/2012 11:11 PM, Oleg Mayba wrote: >>>>> Hi, >>>>> >>>>> I just noticed that a piece of logic I was relying on with GRanges >>>>> before >>>>> does not seem to work anymore. Namely, I expect the behavior of >>>>> nearest() >>>>> with a single GRanges object as an argument to be the same as that of >>>>> IRanges (example below), but it's not anymore. I expect nearest(GR1) >>>>> NOT >>>>> to behave trivially but to return the closest range OTHER than the >>>>> range >>>>> itself. I could swear that was the case before, but isn't any longer: >>>> I think you're right that there is an inconsistency here; Val will >>>> likely help clarify in a day or so. My two cents... >>>> >>>> I think, certainly, that GRanges on a single chromosome on the "+" >>>> strand should behave like an IRanges >>>> >>>> library(GenomicRanges) >>>> library(RUnit) >>>> >>>> r<- IRanges(c(1,5,10), c(2,7,12)) >>>> g<- GRanges("chr1", r, "+") >>>> >>>> ## first two ok, third should work but fails >>>> checkEquals(precede(r), precede(g)) >>>> checkEquals(follow(r), follow(g)) >>>> try(checkEquals(nearest(r), nearest(g))) >>>> >>>> Also, on the "-" strand I think we're expecting >>>> >>>> g<- GRanges("chr1", r, "-") >>>> >>>> ## first two ok, third should work but fails >>>> checkEquals(follow(r), precede(g)) >>>> checkEquals(precede(r), follow(g)) >>>> try(checkEquals(nearest(r), nearest(g))) >>>> >>>> For "*" (which was your example) I think the situation is (a) different >>>> in devel than in release; and (b) not so clear. In devel, "*" is (from >>>> method?"nearest,GenomicRanges,missing") "x on '*' strand can match to >>>> ranges on any of ''+'', ''-'' or ''*''" and in particular I think these >>>> are always true: >>>> >>>> checkEquals(precede(g), follow(g)) >>>> checkEquals(nearest(r), follow(g)) >>>> >>>> I would also expect >>>> >>>> try(checkEquals(nearest(g), follow(g))) >>>> >>>> though this seems not to be the case. In 'release', "*" is coereced and >>>> behaves as if on the "+" strand (I think). >>>> >>>> Martin >>>> >>>>>> z=IRanges(start=c(1,5,10), end=c(2,7,12)) >>>>>> z >>>>> IRanges of length 3 >>>>> start end width >>>>> [1] 1 2 2 >>>>> [2] 5 7 3 >>>>> [3] 10 12 3 >>>>>> nearest(z) >>>>> [1] 2 1 2 >>>>>> z=GRanges(seqnames=rep('chr1',3),ranges=IRanges(start=c(1,5,10), >>>>> end=c(2,7,12))) >>>>>> z >>>>> GRanges with 3 ranges and 0 elementMetadata cols: >>>>> seqnames ranges strand >>>>> >>>>> [1] chr1 [ 1, 2] * >>>>> [2] chr1 [ 5, 7] * >>>>> [3] chr1 [10, 12] * >>>>> --- >>>>> seqlengths: >>>>> chr1 >>>>> NA >>>>>> nearest(z) >>>>> [1] 1 2 3 >>>>>> sessionInfo() >>>>> R version 2.15.0 (2012-03-30) >>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>> >>>>> locale: >>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>> [7] LC_PAPER=C LC_NAME=C >>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>> >>>>> attached base packages: >>>>> [1] datasets utils grDevices graphics stats methods base >>>>> >>>>> other attached packages: >>>>> [1] GenomicRanges_1.8.6 IRanges_1.14.3 BiocGenerics_0.2.0 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] stats4_2.15.0 >>>>> >>>>> >>>>> I want the IRanges behavior and not what seems currently to be the >>>>> GRanges >>>>> behavior, since I have some code that depends on it. Is there a quick >>>>> way >>>>> to make nearest() do that for me again? >>>>> >>>>> Thanks! >>>>> >>>>> Oleg. >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> -- >>>> Computational Biology / Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N. >>>> PO Box 19024 Seattle, WA 98109 >>>> >>>> Location: Arnold Building M1 B861 >>>> Phone: (206) 667-2793 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor From MEC at stowers.org Sat Jun 23 00:25:02 2012 From: MEC at stowers.org (Cook, Malcolm) Date: Fri, 22 Jun 2012 17:25:02 -0500 Subject: [BioC] nearest() for GRanges In-Reply-To: <4FE4DCF0.5010206@fhcrc.org> Message-ID: Great news, Valerie... thanks very much... I will take immediate advantage of this... after re-reading your report of 'an overhaul' I would well understand if back-porting your fix in dev to release would be onerous to impossible. I hope it goes quickly and smoothly.... Cheers, Malcolm On 6/22/12 4:00 PM, "Valerie Obenchain" wrote: >On 06/20/2012 05:20 PM, Cook, Malcolm wrote: >> Hi Valerie, >> >> Very glad you found and fixed the root cause. >> >> I don't know the overhead it would take for you, but, this being a >> regression, might you consider fixing in Bioconductor 2.10 as, say >> GenomicRanges_1.8. >> >Yes, I will fix this in release too. If not today then first thing next >week. > >Valerie >> Thanks for your consideration, >> >> Malcolm >> >> On 6/20/12 3:13 PM, "Valerie Obenchain" wrote: >> >>> Hi Oleg, Malcom, >>> >>> Thanks for the bug report. This is now fixed in devel 1.9.28. Over the >>> past months we've done an overhaul of the precede/follow code in devel. >>> The new nearest method is based on the new precede and follow and is >>> documented at >>> >>> ?'nearest,GenomicRanges,GenomicRanges-method' >>> >>> Let me know if you run into problems. >>> >>> Valerie >>> >>> >>> >>> On 06/18/2012 02:25 PM, Cook, Malcolm wrote: >>>> Martin, Oleg, Val, all, >>>> >>>> I too have script logic that benefitted from and depends upon what the >>>> behavior of nearest,GenomicRanges,missing as reported by Oleg. >>>> >>>> Thanks for the unit tests Martin. >>>> >>>> If it helps in sleuthing, in my hands, the 3rd test used to pass (if >>>>my >>>> memory serves), but does not pass now, as the attached transcript >>>>shows. >>>> >>>> Hoping it helps find a speedy resolution to this issue, >>>> >>>> ~ Malcolm Cook >>>> >>>> >>>> >>>>> r<- IRanges(c(1,5,10), c(2,7,12)) >>>>> g<- GRanges("chr1", r, "+") >>>>> checkEquals(precede(r), precede(g)) >>>> [1] TRUE >>>>> checkEquals(follow(r), follow(g)) >>>> [1] TRUE >>>>> try(checkEquals(nearest(r), nearest(g))) >>>> Error in checkEquals(nearest(r), nearest(g)) : >>>> Mean relative difference: 0.6 >>>> >>>> >>>>> sessionInfo() >>>> R version 2.15.0 (2012-03-30) >>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>>> >>>> locale: >>>> [1] C >>>> >>>> attached base packages: >>>> [1] tools splines parallel stats graphics grDevices >>>>utils >>>> datasets methods base >>>> >>>> other attached packages: >>>> [1] RUnit_0.4.26 log4r_0.1-4 vwr_0.1 >>>> RecordLinkage_0.4-1 ffbase_0.5 ff_2.2-7 >>>> bit_1.1-8 evd_2.2-6 ipred_0.8-13 >>>> prodlim_1.3.1 KernSmooth_2.23-7 nnet_7.3-1 >>>> survival_2.36-14 mlbench_2.1-0 MASS_7.3-18 >>>> ada_2.0-2 rpart_3.1-53 e1071_1.6 >>>> class_7.3-3 XLConnect_0.1-9 XLConnectJars_0.1-4 >>>> rJava_0.9-3 latticeExtra_0.6-19 RColorBrewer_1.0-5 >>>> lattice_0.20-6 doMC_1.2.5 multicore_0.1-7 >>>> [28] BSgenome_1.24.0 rtracklayer_1.16.1 Rsamtools_1.8.5 >>>> Biostrings_2.24.1 GenomicFeatures_1.8.1 AnnotationDbi_1.18.1 >>>> GenomicRanges_1.8.6 IRanges_1.14.3 Biobase_2.16.0 >>>> BiocGenerics_0.2.0 data.table_1.8.0 compare_0.2-3 >>>> svUnit_0.7-10 doParallel_1.0.1 iterators_1.0.6 >>>> foreach_1.4.0 ggplot2_0.9.1 sqldf_0.4-6.4 >>>> RSQLite.extfuns_0.0.1 RSQLite_0.11.1 chron_2.3-42 >>>> gsubfn_0.6-3 proto_0.3-9.2 DBI_0.2-5 >>>> functional_0.1 reshape_0.8.4 plyr_1.7.1 >>>> [55] stringr_0.6 gtools_2.6.2 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] RCurl_1.91-1 XML_3.9-4 biomaRt_2.12.0 >>>>bitops_1.0-4.1 >>>> codetools_0.2-8 colorspace_1.1-1 compiler_2.15.0 dichromat_1.2-4 >>>> digest_0.5.2 grid_2.15.0 labeling_0.1 memoise_0.1 >>>> munsell_0.3 reshape2_1.2.1 scales_0.2.1 stats4_2.15.0 >>>> tcltk_2.15.0 zlibbioc_1.2.0 >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 6/18/12 2:39 PM, "Martin Morgan" wrote: >>>> >>>>> Hi Oleg -- >>>>> >>>>> On 06/17/2012 11:11 PM, Oleg Mayba wrote: >>>>>> Hi, >>>>>> >>>>>> I just noticed that a piece of logic I was relying on with GRanges >>>>>> before >>>>>> does not seem to work anymore. Namely, I expect the behavior of >>>>>> nearest() >>>>>> with a single GRanges object as an argument to be the same as that >>>>>>of >>>>>> IRanges (example below), but it's not anymore. I expect >>>>>>nearest(GR1) >>>>>> NOT >>>>>> to behave trivially but to return the closest range OTHER than the >>>>>> range >>>>>> itself. I could swear that was the case before, but isn't any >>>>>>longer: >>>>> I think you're right that there is an inconsistency here; Val will >>>>> likely help clarify in a day or so. My two cents... >>>>> >>>>> I think, certainly, that GRanges on a single chromosome on the "+" >>>>> strand should behave like an IRanges >>>>> >>>>> library(GenomicRanges) >>>>> library(RUnit) >>>>> >>>>> r<- IRanges(c(1,5,10), c(2,7,12)) >>>>> g<- GRanges("chr1", r, "+") >>>>> >>>>> ## first two ok, third should work but fails >>>>> checkEquals(precede(r), precede(g)) >>>>> checkEquals(follow(r), follow(g)) >>>>> try(checkEquals(nearest(r), nearest(g))) >>>>> >>>>> Also, on the "-" strand I think we're expecting >>>>> >>>>> g<- GRanges("chr1", r, "-") >>>>> >>>>> ## first two ok, third should work but fails >>>>> checkEquals(follow(r), precede(g)) >>>>> checkEquals(precede(r), follow(g)) >>>>> try(checkEquals(nearest(r), nearest(g))) >>>>> >>>>> For "*" (which was your example) I think the situation is (a) >>>>>different >>>>> in devel than in release; and (b) not so clear. In devel, "*" is >>>>>(from >>>>> method?"nearest,GenomicRanges,missing") "x on '*' strand can match to >>>>> ranges on any of ''+'', ''-'' or ''*''" and in particular I think >>>>>these >>>>> are always true: >>>>> >>>>> checkEquals(precede(g), follow(g)) >>>>> checkEquals(nearest(r), follow(g)) >>>>> >>>>> I would also expect >>>>> >>>>> try(checkEquals(nearest(g), follow(g))) >>>>> >>>>> though this seems not to be the case. In 'release', "*" is coereced >>>>>and >>>>> behaves as if on the "+" strand (I think). >>>>> >>>>> Martin >>>>> >>>>>>> z=IRanges(start=c(1,5,10), end=c(2,7,12)) >>>>>>> z >>>>>> IRanges of length 3 >>>>>> start end width >>>>>> [1] 1 2 2 >>>>>> [2] 5 7 3 >>>>>> [3] 10 12 3 >>>>>>> nearest(z) >>>>>> [1] 2 1 2 >>>>>>> z=GRanges(seqnames=rep('chr1',3),ranges=IRanges(start=c(1,5,10), >>>>>> end=c(2,7,12))) >>>>>>> z >>>>>> GRanges with 3 ranges and 0 elementMetadata cols: >>>>>> seqnames ranges strand >>>>>> >>>>>> [1] chr1 [ 1, 2] * >>>>>> [2] chr1 [ 5, 7] * >>>>>> [3] chr1 [10, 12] * >>>>>> --- >>>>>> seqlengths: >>>>>> chr1 >>>>>> NA >>>>>>> nearest(z) >>>>>> [1] 1 2 3 >>>>>>> sessionInfo() >>>>>> R version 2.15.0 (2012-03-30) >>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] datasets utils grDevices graphics stats methods base >>>>>> >>>>>> other attached packages: >>>>>> [1] GenomicRanges_1.8.6 IRanges_1.14.3 BiocGenerics_0.2.0 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] stats4_2.15.0 >>>>>> >>>>>> >>>>>> I want the IRanges behavior and not what seems currently to be the >>>>>> GRanges >>>>>> behavior, since I have some code that depends on it. Is there a >>>>>>quick >>>>>> way >>>>>> to make nearest() do that for me again? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> Oleg. >>>>>> >>>>>> [[alternative HTML version deleted]] >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> -- >>>>> Computational Biology / Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N. >>>>> PO Box 19024 Seattle, WA 98109 >>>>> >>>>> Location: Arnold Building M1 B861 >>>>> Phone: (206) 667-2793 >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > From phipson at wehi.EDU.AU Sat Jun 23 02:30:21 2012 From: phipson at wehi.EDU.AU (Belinda Phipson) Date: Sat, 23 Jun 2012 10:30:21 +1000 (EST) Subject: [BioC] [Limma] Calculate the relation between mRNA and miRNA In-Reply-To: References: Message-ID: Hi Jack I think you have misunderstood what lmFit does. lmFit takes the data object/matrix, call it x, and fits a user-specified design matrix to it, call it design. i.e. > fit <- lmFit(x,design) >From your message I don't understand what format your data is in. However, if you have two vectors, and you wanted to calculate a correlation, you could just use > cor(vector1, vector2, use=complete.obs) which would take care of the missing values. Cheers, Belinda > Hi all, > > I have a problem to use lmFit calculated the correlation between mRNA and > miRNA, because my miRNA data contained "NA" values. > > So if I use :lmFit(mRNA,miRNA), I got the message "Error in qr.default(x) > : > NA/NaN/Inf in foreign function call (arg 1)". > > The original "lm" function allow to included "NA", why lmFit can't? > > Thanks, > > Jack > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From phipson at wehi.EDU.AU Sat Jun 23 02:43:00 2012 From: phipson at wehi.EDU.AU (Belinda Phipson) Date: Sat, 23 Jun 2012 10:43:00 +1000 (EST) Subject: [BioC] some help requested for constructing an appropriate design matrix in LIMMA In-Reply-To: References: <002c01cd4cee$18b0f4b0$4a12de10$@edu.au> Message-ID: Hi Steven A common problem with small sample sizes! There are some things you can try: 1) You can try using a function called combat() in the sva package to remove the cell line effect. 2) You can have a look at what fit$df.prior is giving you. A larger value will result in more significant differential expression, and a small value will result in less differential expression. If you plot log(fit$sigma) on the y-axis and fit$Amean on the x-axis you may find some strange highly variable genes that may need to be filtered out. This can affect the estimate of the prior degrees of freedom. You may also decide to filter your genes based on a variance (fit$sigma) cut-off and re-run eBayes(). 3) You could try running eBayes with a trend: > fit <- eBayes(fit,trend=T) It may not make any difference, but you can try! Other than that have a closer look at the individual expression values of the top 10 genes, even if they're not significant, to determine whether there is a difference between R and S. They may still be worth following up. I find a barplot() helpful to visualise the individual sample values. Good luck! Cheers. Belinda > Hi Belinda, > > Thank you for the suggestion. > I filtered out the probes with very low expression values and then used > your design matrix. > Next I used: >>myresults<-topTable(fit, coef=4, number=50000) > > This list gives me not a single significant differentially expressed probe > after correction for mult hypothesis testing. > When I used multi dimensional plotting on my dataset, I noticed that there > was a very big biological difference between the cell lines (first > dimension), but that the difference between resistent and sensitive (2nd > dimension) are quite small. > Could this be the cause of the relatively high p-values? Is there a way to > correct for this? > > 2012/6/18 Belinda Phipson > >> Hi Steven >> >> You could just include cell line in your linear model rather than using >> duplicateCorrelation(). >> >> > design <- >> model.matrix(~factor(targets$cellline)+factor(targets$fenotype)) >> > fit <- lmFit(eset,design) >> > fit <- eBayes(fit) >> >> This will test R vs S taking into account cell line. You could also >> filter >> out lowly expressed genes across all samples to improve your power to >> detect >> differentially expressed genes as your sample size is quite small. >> >> Cheers, >> Belinda >> >> -----Original Message----- >> From: bioconductor-bounces at r-project.org >> [mailto:bioconductor-bounces at r-project.org] On Behalf Of steven segbroek >> Sent: Friday, 15 June 2012 2:13 AM >> To: bioconductor at r-project.org >> Subject: [BioC] some help requested for constructing an appropriate >> design >> matrix in LIMMA >> >> Dear R-users, >> >> I want to analyse a single channel micro array experiment which looks >> like >> the following: >> >> > targets >> File cellline fenotype >> 1 A 1 R >> 2 B 2 R >> 3 C 3 R >> 4 D 1 S >> 5 E 2 S >> 6 F 3 S >> >> There are three different cell lines, each of which comes in two >> versions. >> >> Every cell line has a variant that is resistant to a specific drug and >> another variant that is sensitive to this drug. >> >> We treated both variant of the three cell lines with this drug and then >> extracted RNA which was then hybridised to a micro array. >> >> The question we want to resolve is: which genes are differentially >> regulated between resistant (R) and sensitive (S) versions of these cell >> lines. >> >> There is quite some biological variation between the cell lines, so >> grouping them by fenotype and then searching for differentially >> regulated >> genes would be a bad idea. >> >> So, the idea is to to construct a model that accounts for this >> biological >> variation between the cell lines and looks which genes are consistently >> up >> or down regulated between resistant and sensitive versions of these >> three >> cell lines. >> >> I am a bit puzzled on how to setup an appropriate design matrix for this >> particular setup. >> >> I have come up with the following code: >> > design >> celline fenotR fenotS >> 1 1 1 0 >> 2 2 1 0 >> 3 3 1 0 >> 4 1 0 1 >> 5 2 0 1 >> 6 3 0 1 >> attr(,"assign") >> [1] 1 2 2 >> attr(,"contrasts") >> attr(,"contrasts")$fenot >> [1] "contr.treatment" >> >> >block<-c(1,2,3,1,2,3) >> >eset<-exprs(BSData.log2.quantile) >> >cor<-duplicateCorrelation(eset, ndups=1, block=block, design=design) >> >fit <- lmFit(eset, design, block=block, cor=cor$consensus) >> >fit<- eBayes(fit) >> >cont.matrix<-makeContrasts(resvssens= fenotR - fenotS, levels=design) >> >fit2<-contrasts.fit(fit,cont.matrix) >> >fit2<-eBayes(fit2) >> >topTable(fit2) >> >> However, this code results in an adj.p-value that is "0.9999541" for >> every >> gene. >> >> Is there a better way to analyse this? >> >> Kind regards, >> Steven >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for >> the >> addressee. >> You must not disclose, forward, print or use it without the permission >> of >> the sender. >> ______________________________________________________________________ >> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From chenyao.bioinfor at gmail.com Sat Jun 23 04:21:39 2012 From: chenyao.bioinfor at gmail.com (Yao Chen) Date: Fri, 22 Jun 2012 22:21:39 -0400 Subject: [BioC] [Limma] Calculate the relation between mRNA and miRNA In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From shi at wehi.EDU.AU Fri Jun 22 16:22:52 2012 From: shi at wehi.EDU.AU (Wei Shi) Date: Sat, 23 Jun 2012 00:22:52 +1000 Subject: [BioC] readGapppedAlignmentpairs questions In-Reply-To: References: Message-ID: <23B95E09-A15B-43BA-A47B-EF33117DAF12@wehi.edu.au> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From mtmorgan at fhcrc.org Sat Jun 23 13:21:59 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Sat, 23 Jun 2012 04:21:59 -0700 Subject: [BioC] readGapppedAlignmentpairs questions In-Reply-To: <23B95E09-A15B-43BA-A47B-EF33117DAF12@wehi.edu.au> References: <23B95E09-A15B-43BA-A47B-EF33117DAF12@wehi.edu.au> Message-ID: <4FE5A6D7.9070307@fhcrc.org> On 06/22/2012 07:22 AM, Wei Shi wrote: > Dear Lakshmanan, > > If the purpose of your analysis is to count reads falling within each feature, you may consider using the featureCounts() function in Rsubread package. It takes only about 2 minutes to summarize 10 million reads into a count table. But it only accept SAM files (you can use samtools to convert your BAM files to SAM files) and it only works on unix. See ?featureCounts() for more info. Hi Wei -- can you clarify how you are counting reads? From a quick scan of your man page / C source code it looks like you're counting each pair of a paired end separately, and looking for a read whose start position is in an exon / gene? This elementary counting scheme (on a bam file) is just ## what features? any GRanges or GRangesList, e.g., library(TxDb.Hsapiens.UCSC.hg19.knownGene) exByGn = exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") ## what reads? GRanges from 'rname' and 'pos' param = ScanBamParam(what=c("rname", "qual"), flag=scanBamFlag(isUnmappedRead=FALSE)) with(scanBam(aBamFile, param=p2)[[1]], { GRanges(rname, IRanges(pos, width=1)) }) ## count, 'top level' of GRangesList, so counts per gene countOverlaps(exByGn, reads) This will be fast and memory friendly. ?countBam is another alternative, also memory efficient and taking this simple approach. ?summarizeOverlaps gives better counting schemes for single-end reads, and is also reasonably fast (and in devel space efficient, iterating over the bam file, and with some support for paired-end reads). ?readGappedAlignmentPairs, from the original post, tries to make sense of paired end reads, and is less memory / speed friendly (but the OP has a lot of memory). Martin > > For your second question, if the pair of reads is indeed mapped as a pair, then the region between them will be covered as well if the two reads are on the same exon. But the reality is that not every read pair can be successfully mapped as pairs. You may get only one end mapped, or the two ends are mapped to two locations which have a distance much bigger than the average fragment lengths. In these cases, you don't even know what are the exons which lie between the two reads. > > Hope this helps. > > Cheers, > Wei > > On Jun 22, 2012, at 11:56 PM, Lakshmanan Iyer wrote: > >> Hi >> My apologies for multiple posting if it happens-- I sent the last mails >> from other accounts which may not be registered with Bioc-list >> -Lax >> Two questions: >> >> 1. Is readGapppedAlignmentPairs - the most efficient way to read a >> paired-end bam file with mulit-mapped reads? >> I am asking as it takes an enormous amount of time to process and load. >> >> 2. How does one work with coverage on GappedAlignmentPairs in the context >> of RNASeq? >> The simplest way is to consider each left and right read as separate - >> essentially loose the "paired" information and calculate coverage. >> However, if both the left and right pair reads fall within a feature of >> interest - say an exon, does it imply coverage of the region of the exon >> between the reads too >> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >> LLLLLLLLLL---------------------------RRRRRRRRRR >> ^^^^^^^^^^^^^^^^^ >> >> In the figure above, the exon is represented by ">" and L and R represents >> the left and right reads aligned to the exon. >> I am talking about the region represented by "^". Do we assume coverage >> for this region too? >> Does Coverage on GappedAlignmentPairs do this? >> >> -best >> -Lax >> Center for Neuroscience Research >> Tufts Univeristy School of Medicine >> Boston, MA >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:17}} From mtmorgan at fhcrc.org Sat Jun 23 13:44:13 2012 From: mtmorgan at fhcrc.org (Martin Morgan) Date: Sat, 23 Jun 2012 04:44:13 -0700 Subject: [BioC] readGapppedAlignmentpairs questions In-Reply-To: <4FE5A6D7.9070307@fhcrc.org> References: <23B95E09-A15B-43BA-A47B-EF33117DAF12@wehi.edu.au> <4FE5A6D7.9070307@fhcrc.org> Message-ID: <4FE5AC0D.5080301@fhcrc.org> On 06/23/2012 04:21 AM, Martin Morgan wrote: > On 06/22/2012 07:22 AM, Wei Shi wrote: >> Dear Lakshmanan, >> >> If the purpose of your analysis is to count reads falling within each >> feature, you may consider using the featureCounts() function in >> Rsubread package. It takes only about 2 minutes to summarize 10 >> million reads into a count table. But it only accept SAM files (you >> can use samtools to convert your BAM files to SAM files) and it only >> works on unix. See ?featureCounts() for more info. > > Hi Wei -- can you clarify how you are counting reads? From a quick scan > of your man page / C source code it looks like you're counting each pair > of a paired end separately, and looking for a read whose start position > is in an exon / gene? This elementary counting scheme (on a bam file) is > just > > ## what features? any GRanges or GRangesList, e.g., > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > exByGn = exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") > > ## what reads? GRanges from 'rname' and 'pos' > param = ScanBamParam(what=c("rname", "qual"), > flag=scanBamFlag(isUnmappedRead=FALSE)) > with(scanBam(aBamFile, param=p2)[[1]], { > GRanges(rname, IRanges(pos, width=1)) > }) Oops, should have assigned that GRanges to 'reads' reads = with(scanBam(aBamFile, param=p2)[[1]], { GRanges(rname, IRanges(pos, width=1)) }) The scheme has obvious limitations, counting each end of a paired-end read separately, counting reads that overhang gene or exon boundaries (though sometimes that might be ok), counting overhanging reads that start in a range but not those that end, etc. Martin > > ## count, 'top level' of GRangesList, so counts per gene > countOverlaps(exByGn, reads) > > This will be fast and memory friendly. ?countBam is another alternative, > also memory efficient and taking this simple approach. > > ?summarizeOverlaps gives better counting schemes for single-end reads, > and is also reasonably fast (and in devel space efficient, iterating > over the bam file, and with some support for paired-end reads). > > ?readGappedAlignmentPairs, from the original post, tries to make sense > of paired end reads, and is less memory / speed friendly (but the OP has > a lot of memory). > > Martin > >> >> For your second question, if the pair of reads is indeed mapped as a >> pair, then the region between them will be covered as well if the two >> reads are on the same exon. But the reality is that not every read >> pair can be successfully mapped as pairs. You may get only one end >> mapped, or the two ends are mapped to two locations which have a >> distance much bigger than the average fragment lengths. In these >> cases, you don't even know what are the exons which lie between the >> two reads. >> >> Hope this helps. >> >> Cheers, >> Wei >> >> On Jun 22, 2012, at 11:56 PM, Lakshmanan Iyer wrote: >> >>> Hi >>> My apologies for multiple posting if it happens-- I sent the last mails >>> from other accounts which may not be registered with Bioc-list >>> -Lax >>> Two questions: >>> >>> 1. Is readGapppedAlignmentPairs - the most efficient way to read a >>> paired-end bam file with mulit-mapped reads? >>> I am asking as it takes an enormous amount of time to process and load. >>> >>> 2. How does one work with coverage on GappedAlignmentPairs in the >>> context >>> of RNASeq? >>> The simplest way is to consider each left and right read as separate - >>> essentially loose the "paired" information and calculate coverage. >>> However, if both the left and right pair reads fall within a feature of >>> interest - say an exon, does it imply coverage of the region of the exon >>> between the reads too >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>> LLLLLLLLLL---------------------------RRRRRRRRRR >>> ^^^^^^^^^^^^^^^^^ >>> >>> In the figure above, the exon is represented by ">" and L and R >>> represents >>> the left and right reads aligned to the exon. >>> I am talking about the region represented by "^". Do we assume coverage >>> for this region too? >>> Does Coverage on GappedAlignmentPairs do this? >>> >>> -best >>> -Lax >>> Center for Neuroscience Research >>> Tufts Univeristy School of Medicine >>> Boston, MA >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> ______________________________________________________________________ >> The information in this email is confidential and inte...{{dropped:17}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 From shi at wehi.EDU.AU Sat Jun 23 14:21:44 2012 From: shi at wehi.EDU.AU (Wei Shi) Date: Sat, 23 Jun 2012 22:21:44 +1000 Subject: [BioC] readGapppedAlignmentpairs questions In-Reply-To: <4FE5AC0D.5080301@fhcrc.org> References: <23B95E09-A15B-43BA-A47B-EF33117DAF12@wehi.edu.au> <4FE5A6D7.9070307@fhcrc.org> <4FE5AC0D.5080301@fhcrc.org> Message-ID: <72449FAF-53C0-45B8-B88A-40E372602674@wehi.EDU.AU> Hi Martin, featureCounts does count each end of a paired-end read separately. But this is actually my favorite counting approach for paired-end reads, because this helps include those fragments into analysis which have only one end mapped, or both ends mapped at a distance greater than the average fragment length (such as those fragments which span two or more distant exons or contain chimeric sequences). featureCounts assigns a read to an exon when they have at least 1bp overlap, so it does count overhanging reads. By default, it uses RefSeq annotation. But it also accepts user provided annotation (users can simply provide a data frame to it, which includes gene id, chr, start and end). This function is not only fast, but also extremely memory efficient. It totally only uses a few megabytes of memory (for storing annotation). It does not read in the entire SAM file into memory. There is only one read in the memory at any time, so no matter how big the SAM file is, the memory usage is always a few megabytes. Hope this makes things clearer. Cheers, Wei On Jun 23, 2012, at 9:44 PM, Martin Morgan wrote: > On 06/23/2012 04:21 AM, Martin Morgan wrote: >> On 06/22/2012 07:22 AM, Wei Shi wrote: >>> Dear Lakshmanan, >>> >>> If the purpose of your analysis is to count reads falling within each >>> feature, you may consider using the featureCounts() function in >>> Rsubread package. It takes only about 2 minutes to summarize 10 >>> million reads into a count table. But it only accept SAM files (you >>> can use samtools to convert your BAM files to SAM files) and it only >>> works on unix. See ?featureCounts() for more info. >> >> Hi Wei -- can you clarify how you are counting reads? From a quick scan >> of your man page / C source code it looks like you're counting each pair >> of a paired end separately, and looking for a read whose start position >> is in an exon / gene? This elementary counting scheme (on a bam file) is >> just >> >> ## what features? any GRanges or GRangesList, e.g., >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> exByGn = exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") >> >> ## what reads? GRanges from 'rname' and 'pos' >> param = ScanBamParam(what=c("rname", "qual"), >> flag=scanBamFlag(isUnmappedRead=FALSE)) >> with(scanBam(aBamFile, param=p2)[[1]], { >> GRanges(rname, IRanges(pos, width=1)) >> }) > > Oops, should have assigned that GRanges to 'reads' > > reads = with(scanBam(aBamFile, param=p2)[[1]], { > GRanges(rname, IRanges(pos, width=1)) > }) > > The scheme has obvious limitations, counting each end of a paired-end read separately, counting reads that overhang gene or exon boundaries (though sometimes that might be ok), counting overhanging reads that start in a range but not those that end, etc. > > Martin >> >> ## count, 'top level' of GRangesList, so counts per gene >> countOverlaps(exByGn, reads) >> >> This will be fast and memory friendly. ?countBam is another alternative, >> also memory efficient and taking this simple approach. >> >> ?summarizeOverlaps gives better counting schemes for single-end reads, >> and is also reasonably fast (and in devel space efficient, iterating >> over the bam file, and with some support for paired-end reads). >> >> ?readGappedAlignmentPairs, from the original post, tries to make sense >> of paired end reads, and is less memory / speed friendly (but the OP has >> a lot of memory). >> >> Martin >> >>> >>> For your second question, if the pair of reads is indeed mapped as a >>> pair, then the region between them will be covered as well if the two >>> reads are on the same exon. But the reality is that not every read >>> pair can be successfully mapped as pairs. You may get only one end >>> mapped, or the two ends are mapped to two locations which have a >>> distance much bigger than the average fragment lengths. In these >>> cases, you don't even know what are the exons which lie between the >>> two reads. >>> >>> Hope this helps. >>> >>> Cheers, >>> Wei >>> >>> On Jun 22, 2012, at 11:56 PM, Lakshmanan Iyer wrote: >>> >>>> Hi >>>> My apologies for multiple posting if it happens-- I sent the last mails >>>> from other accounts which may not be registered with Bioc-list >>>> -Lax >>>> Two questions: >>>> >>>> 1. Is readGapppedAlignmentPairs - the most efficient way to read a >>>> paired-end bam file with mulit-mapped reads? >>>> I am asking as it takes an enormous amount of time to process and load. >>>> >>>> 2. How does one work with coverage on GappedAlignmentPairs in the >>>> context >>>> of RNASeq? >>>> The simplest way is to consider each left and right read as separate - >>>> essentially loose the "paired" information and calculate coverage. >>>> However, if both the left and right pair reads fall within a feature of >>>> interest - say an exon, does it imply coverage of the region of the exon >>>> between the reads too >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> LLLLLLLLLL---------------------------RRRRRRRRRR >>>> ^^^^^^^^^^^^^^^^^ >>>> >>>> In the figure above, the exon is represented by ">" and L and R >>>> represents >>>> the left and right reads aligned to the exon. >>>> I am talking about the region represented by "^". Do we assume coverage >>>> for this region too? >>>> Does Coverage on GappedAlignmentPairs do this? >>>> >>>> -best >>>> -Lax >>>> Center for Neuroscience Research >>>> Tufts Univeristy School of Medicine >>>> Boston, MA >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> ______________________________________________________________________ >>> The information in this email is confidential and inte...{{dropped:17}} >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}} From smyth at wehi.EDU.AU Sun Jun 24 01:55:43 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Sun, 24 Jun 2012 09:55:43 +1000 (AUS Eastern Standard Time) Subject: [BioC] Calculate the relation between mRNA and miRNA In-Reply-To: References: Message-ID: Dear Jack, lmFit() allows NA values in the expression values but not in the design matrix. If you must put NA values in a column of the design, then you need to remove them like this: anyna <- apply(is.na(miRNA),1,any) fit <- lmFit(mRNA[,!anyna], miRNA[!anyna,]) This code identifies rows of the design matrix containing NAs, then removes those samples from the data. lmFit does not do this automatically, because I feel that this is something that a user should make a decision about deliberately. Best wishes Gordon > Date: Fri, 22 Jun 2012 09:59:28 -0400 > From: Yao Chen > To: bioconductor at r-project.org > Subject: [BioC] [Limma] Calculate the relation between mRNA and miRNA > > Hi all, > > I have a problem to use lmFit calculated the correlation between mRNA and > miRNA, because my miRNA data contained "NA" values. > > So if I use :lmFit(mRNA,miRNA), I got the message "Error in qr.default(x) : > NA/NaN/Inf in foreign function call (arg 1)". > > The original "lm" function allow to included "NA", why lmFit can't? > > Thanks, > > Jack ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From smyth at wehi.EDU.AU Sun Jun 24 02:41:49 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Sun, 24 Jun 2012 10:41:49 +1000 (AUS Eastern Standard Time) Subject: [BioC] design matrix edge R pairwise comparison at different time points after infection with replicates In-Reply-To: References: Message-ID: Hi Kaat, I'll jump in and continue on from Mark's help. To test for treatment effects separately at each time, the easiest way is to include the terms "time+time:treat" in your model formula. I'll assume that your "tests" are independent replicates of the whole experiment. If there are batch effects associated with the tests that you need to correct for, then your complete design matrix might be: design <- model.matrix(~test+time+time:treat) This produces a design matrix with the following columns: > colnames(design) [1] "(Intercept)" "test2" "test3" "time24hpi" [5] "time48hpi" "time12hpi:treatT" "time24hpi:treatT" "time48hpi:treatT" So testing for treatment effects at each time is easy. To test for treatment effect as time 12h: fit <- glmFit(y, design) lrt <- glmLRT(y, fit, coef="time12hpi:treatT") etc. To test for treatment effect at time 24h: lrt <- glmLRT(y, fit, coef="time24hpi:treatT") and so on. Best wishes Gordon > Date: Fri, 22 Jun 2012 13:11:41 +0000 > From: Kaat De Cremer > To: Mark Robinson > Cc: bioconductor list > Subject: Re: [BioC] design matrix edge R pairwise comparison at > different time points after infection with replicates > > Hi Mark, > Thank you for your suggestion, > I really appreciate your time. > > Working in R is new to me so it has been a struggle using edgeR, but I > think I managed it using only 2 factors (test and treatment). Now that I > will be including 3 factors (test, treatment and time) in one analysis > it is clear to me that I still don't understand how it works exactly. > Below you can see my workspace with the only design matrix I could come > up with, but I don't see which coefficients I should include or which > contrast vector to use in the glmLRT function to make the comparison of > control-treatment at each time point separate, ignoring the other 2 time > points. Is this possible with this design matrix? Or is the matrix wrong > for this purpose? > > > Thanks! > Kaat > > >> head(x) > 12hpi C1 12hpi C2 12hpi C3 12hpi T1 12hpi T2 12hpi T3 24hpi C1 24hpi C2 > Lsa000001.1 0 1 1 2 0 2 1 1 > Lsa000002.1 5 4 0 5 6 6 6 4 > Lsa000003.1 10 9 7 5 5 8 6 2 > Lsa000004.1 1 1 1 1 1 1 1 3 > Lsa000005.1 1 0 1 0 2 0 0 1 > Lsa000006.1 510 223 228 287 222 268 303 358 > 24hpi C3 24hpi T1 24hpi T2 24hpi T3 48hpi C1 48hpi C2 48hpi C3 48hpi T1 > Lsa000001.1 0 1 1 0 0 0 0 2 > Lsa000002.1 7 5 2 5 10 6 12 12 > Lsa000003.1 7 5 4 2 6 5 8 2 > Lsa000004.1 1 3 1 2 1 3 2 3 > Lsa000005.1 0 1 0 0 1 0 0 2 > Lsa000006.1 372 362 237 320 472 440 411 858 > 48hpi T2 48hpi T3 > Lsa000001.1 0 0 > Lsa000002.1 1 5 > Lsa000003.1 1 0 > Lsa000004.1 0 2 > Lsa000005.1 1 0 > Lsa000006.1 375 275 >> treat<-factor(c("C","C","C","T","T","T","C","C","C","T","T","T","C","C","C","T","T","T")) >> test<-factor(c(1,1,2,3,1,2,3,2,3,1,2,3,1,2,3,1,2,3)) > time<-factor(c("12hpi","12hpi","12hpi","12hpi","12hpi","12hpi","24hpi","24hpi","24hpi","24hpi","24hpi","24hpi","48hpi","48hpi","48hpi","48hpi","48hpi","48hpi")) >> y<-DGEList(counts=x,group=treat) > Calculating library sizes from column totals. >> cpm.y<-cpm(y) >> y<-y[rowSums(cpm.y>2)>=3,] >> y<-calcNormFactors(y) > design<-model.matrix(~test+treat+time) >> design > (Intercept) test2 test3 treatT time24hpi time48hpi > 1 1 0 0 0 0 0 > 2 1 1 0 0 0 0 > 3 1 0 1 0 0 0 > 4 1 0 0 1 0 0 > 5 1 1 0 1 0 0 > 6 1 0 1 1 0 0 > 7 1 0 0 0 1 0 > 8 1 1 0 0 1 0 > 9 1 0 1 0 1 0 > 10 1 0 0 1 1 0 > 11 1 1 0 1 1 0 > 12 1 0 1 1 1 0 > 13 1 0 0 0 0 1 > 14 1 1 0 0 0 1 > 15 1 0 1 0 0 1 > 16 1 0 0 1 0 1 > 17 1 1 0 1 0 1 > 18 1 0 1 1 0 1 > attr(,"assign") > [1] 0 1 1 2 3 3 > attr(,"contrasts") > attr(,"contrasts")$test > [1] "contr.treatment" > > attr(,"contrasts")$treat > [1] "contr.treatment" > > attr(,"contrasts")$time > [1] "contr.treatment" > >> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) > Disp = 0.07299 , BCV = 0.2702 >> y<-estimateGLMTrendedDisp(y,design) > Loading required package: splines >> y<-estimateGLMTagwiseDisp(y,design) > Warning message: > In maximizeInterpolant(spline.pts, apl.smooth[j, ]) : > max iterations exceeded >> fit<-glmFit(y,design) > > > > > > > > -----Original Message----- > From: Mark Robinson [mailto:mark.robinson at imls.uzh.ch] > Sent: vrijdag 22 juni 2012 12:03 > To: Kaat De Cremer > Cc: bioconductor list > Subject: Re: [BioC] design matrix edge R pairwise comparison at different time points after infection with replicates > > Hi Kaat, > > It is probably better to fit all your data with a single call to glmFit(), over all 18 samples; you can test the differences of interest trough the 'coef' or 'contrast' argument on glmLRT(). That would afford you more degrees of freedom and presumably better estimates of dispersion, and so on. > >> From your description, I can't quite figure out your design matrix. You have three factors: treatment, test and time point. First, you need to input all 18 samples and extend your 'treatment' and 'test' factor variables to have 18 values (corresponding to the columns of your table). And, then also include a time variable in your design. Some decisions might need to be made about interactions to include. > > Hope that gets you started. > > Best, > Mark > > > ---------- > Prof. Dr. Mark Robinson > Bioinformatics > Institute of Molecular Life Sciences > University of Zurich > Winterthurerstrasse 190 > 8057 Zurich > Switzerland > > v: +41 44 635 4848 > f: +41 44 635 6898 > e: mark.robinson at imls.uzh.ch > o: Y11-J-16 > w: http://tiny.cc/mrobin > > ---------- > http://www.fgcz.ch/Bioconductor2012 > > On 21.06.2012, at 11:42, Kaat De Cremer wrote: > >> Dear all, >> >> >> I am using edgeR to find genes differentially expressed between >> infected and mock-infected control plants, at 3 time points after >> infection. >> I have RNAseq data for 3 independent tests, so for every single test I >> have 6 libraries (control + infected at 3 time points). >> Having three replicates this makes 18 libraries in total. >> >> What I did until now is look at each time point separate and calculate DEgenes at that time point as shown in this script: >> >>> head(x) >> C1 C2 C3 T1 T2 T3 >> 1 0 1 2 0 0 0 >> 2 13 6 4 10 8 12 >> 3 17 16 9 10 8 11 >> 4 2 1 2 2 3 2 >> 5. 1 3 1 2 1 3 0 >> 6 958 457 438 565 429 518 >> >>> treatment<-factor(c("C","C","C","T","T","T")) >>> test<-factor(c(1,2,3,1,2,3)) >>> y<-DGEList(counts=x,group=treatment) >> Calculating library sizes from column totals. >>> cpm.y<-cpm(y) >>> y<-y[rowSums(cpm.y>2)>=3,] >>> y<-calcNormFactors(y) >>> design<-model.matrix(~test+treat) >>> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) >> Disp = 0.0265 , BCV = 0.1628 >>> y<-estimateGLMTrendedDisp(y,design) >> Loading required package: splines >>> y<-estimateGLMTagwiseDisp(y,design) >>> fit<-glmFit(y,design) >>> lrt<-glmLRT(y,fit) >> >> >> This works fine but I wonder if I should do the analysis of the different time points all at once? Will this make a difference? >> Unfortunately I cannot figure out how to design the matrix. >> >> I hope someone can help me, >> >> Kaat >> ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From mgarciao at ufl.edu Sun Jun 24 03:29:09 2012 From: mgarciao at ufl.edu (Garcia Orellana,Miriam) Date: Sun, 24 Jun 2012 01:29:09 +0000 Subject: [BioC] Getting different results with 2 models for factorial designs with LIMMA Message-ID: <7F10E9EDBB347E4CA0765A3139C110BB14F9B85C@UFEXCH-MBXN01.ad.ufl.edu> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From mgarciao at ufl.edu Sun Jun 24 07:10:50 2012 From: mgarciao at ufl.edu (Garcia Orellana,Miriam) Date: Sun, 24 Jun 2012 05:10:50 +0000 Subject: [BioC] Getting different results with 2 models for factorial designs with LIMMA In-Reply-To: <7F10E9EDBB347E4CA0765A3139C110BB14F9B85C@UFEXCH-MBXN01.ad.ufl.edu> References: <7F10E9EDBB347E4CA0765A3139C110BB14F9B85C@UFEXCH-MBXN01.ad.ufl.edu> Message-ID: <7F10E9EDBB347E4CA0765A3139C110BB14F9B86B@UFEXCH-MBXN01.ad.ufl.edu> Dear ALL (special request to Dr. G. Smith): Just to give an additional information to my first email. NOW I AM SURE both of the models are considering as references those that I was expecting. Also both models for each of my 5 contrasts are giving me the top table with the same numerical values for: AveExp, t, Pvalue, adjPvalue and B. However the logFC for contrasts 1, 2 and 3 (detail of contrasts is lines below) in model B is exactly half of that in model A, while the logFC for contrasts 4 and 5 in model B is exactly one fourth of that one in model A. How that can be possible if all other values are the same? and so what should I follow? Thanks so much. Miriam ________________________________________ From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] on behalf of Garcia Orellana,Miriam [mgarciao at ufl.edu] Sent: Saturday, June 23, 2012 9:29 PM To: bioconductor at r-project.org Subject: [BioC] Getting different results with 2 models for factorial designs with LIMMA Dear Users: I hope someone can help me to understand why the two models I analyzed for my data are giving me different outputs regarding differentially expressed gene. For example the MODEL A for the MR effect give me 47 up- and 84 down-regulated genes (adjPvalue <0.05 rawFC=1.5), while the model B give me only 18 up- and only 2 down-regulated genes under same cut-offs. So additional to the big difference in number of genes also the proportion of UP /DOWN in model A is lower than in MODEL B. So also I am wonder how I can be sure the program is using the right factors as the reference to calculate the logFC. My goal is to have as references the factors as indicated below Briefly me data is a factorial design of 3 dam diets (DD: CTL, SFA, EFA) and 2 milk replacers (MR: LLA, HLA), I have three replicates for each of the interaction factors, then a total of 18 arrays. The data was filtered for informative/noninformative probes and plotted for array quality. So from a initial of 24118 bovine probes I endup with 8026 probes. My interest is to compare: 1. Feeding FAT at prepartum= (SFA +EFA) vs CTL, with CTL as ref 2. Feeding EFA prepartum = EFA vs SFA, with SFA as ref 3. Feeding MR to calves= HLA vs LLA, with LLA as reference 4. Interaction of feeding FAT by MR: (SFA +EFA) vs CTL by MR, with (SFA+EFA) vs CTL by LLA as ref 5. Interaction of feeding EFA by MR: EFA vs SFA by MR, with EFA vs SFA by LLA as ref MODEL A (I created that with the guide of the LIMMA user guide for a factorial design: TS <- paste(phenoDiet$DD, phenoDiet$MR, sep=".") TS TS <- factor(TS, levels=c("Ctl.LLA", "Ctl.HLA","SFA.LLA","SFA.HLA","EFA.LLA", "EFA.HLA")) design <- model.matrix(~0+TS) colnames(design) <- levels(TS) fit <- lmFit(eset2, design, method="robust", maxit=1000) efit <- eBayes(fit) #Contrast results MatContrast=makeContrasts(FAT=(SFA.LLA + SFA.HLA + EFA.LLA + EFA.HLA)/4 - (Ctl.LLA + Ctl.HLA)/2, FA=(EFA.LLA + EFA.HLA)/2 - (SFA.LLA + SFA.HLA)/2, MR=(EFA.HLA+SFA.HLA+Ctl.HLA)/3 - (EFA.LLA+SFA.LLA+Ctl.LLA)/3, FATbyMR=((EFA.HLA+SFA.HLA)/2 - Ctl.HLA) - ((EFA.LLA+SFA.LLA)/2-Ctl.LLA), FAbyMR=( EFA.HLA-SFA.HLA)-(EFA.LLA - SFA.LLA), levels=design) fitMat<-contrasts.fit(fit,MatContrast) Contrast.eBayes=eBayes(fitMat) MODEL B (this model was kindly provided by Dr G. Smith): DD <-factor(phenoDie$DD, levels = c("Ctl", "SFA", "EFA")) MR <-factor(phenoDie$MR, levels = c("LLA", "HLA")) contrasts (DD) <- cbind (SFAEFAvsCtl=c(-2,1,1),EFAvsSFA=c(0,-1,1)) contrasts (MR) <- c(-1,1) design <-model.matrix (~DD*MR) design fit <- lmFit (eset2, design, method="robust",maxit=1000) efit <- eBayes(fit) Thanks in advance, Miriam [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From maiteiriondo at gmail.com Sun Jun 24 15:01:21 2012 From: maiteiriondo at gmail.com (Maite Iriondo) Date: Sun, 24 Jun 2012 15:01:21 +0200 Subject: [BioC] Loading ArrayVision microarray data into Limma Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From Kaat.DeCremer at biw.kuleuven.be Mon Jun 25 10:11:35 2012 From: Kaat.DeCremer at biw.kuleuven.be (Kaat De Cremer) Date: Mon, 25 Jun 2012 08:11:35 +0000 Subject: [BioC] design matrix edge R pairwise comparison at different time points after infection with replicates In-Reply-To: References: Message-ID: <3D4A97F14E343F4584925219C1C1ACEF05B964E0@ICTS-S-MBX7.luna.kuleuven.be> Thank you so much to clarify this! Kaat -----Original Message----- From: Gordon K Smyth [mailto:smyth at wehi.EDU.AU] Sent: zondag 24 juni 2012 2:42 To: Kaat De Cremer Cc: Bioconductor mailing list; Mark Robinson Subject: design matrix edge R pairwise comparison at different time points after infection with replicates Hi Kaat, I'll jump in and continue on from Mark's help. To test for treatment effects separately at each time, the easiest way is to include the terms "time+time:treat" in your model formula. I'll assume that your "tests" are independent replicates of the whole experiment. If there are batch effects associated with the tests that you need to correct for, then your complete design matrix might be: design <- model.matrix(~test+time+time:treat) This produces a design matrix with the following columns: > colnames(design) [1] "(Intercept)" "test2" "test3" "time24hpi" [5] "time48hpi" "time12hpi:treatT" "time24hpi:treatT" "time48hpi:treatT" So testing for treatment effects at each time is easy. To test for treatment effect as time 12h: fit <- glmFit(y, design) lrt <- glmLRT(y, fit, coef="time12hpi:treatT") etc. To test for treatment effect at time 24h: lrt <- glmLRT(y, fit, coef="time24hpi:treatT") and so on. Best wishes Gordon > Date: Fri, 22 Jun 2012 13:11:41 +0000 > From: Kaat De Cremer > To: Mark Robinson > Cc: bioconductor list > Subject: Re: [BioC] design matrix edge R pairwise comparison at > different time points after infection with replicates > > Hi Mark, > Thank you for your suggestion, > I really appreciate your time. > > Working in R is new to me so it has been a struggle using edgeR, but I > think I managed it using only 2 factors (test and treatment). Now that > I will be including 3 factors (test, treatment and time) in one > analysis it is clear to me that I still don't understand how it works exactly. > Below you can see my workspace with the only design matrix I could > come up with, but I don't see which coefficients I should include or > which contrast vector to use in the glmLRT function to make the > comparison of control-treatment at each time point separate, ignoring > the other 2 time points. Is this possible with this design matrix? Or > is the matrix wrong for this purpose? > > > Thanks! > Kaat > > >> head(x) > 12hpi C1 12hpi C2 12hpi C3 12hpi T1 12hpi T2 12hpi T3 24hpi C1 24hpi C2 > Lsa000001.1 0 1 1 2 0 2 1 1 > Lsa000002.1 5 4 0 5 6 6 6 4 > Lsa000003.1 10 9 7 5 5 8 6 2 > Lsa000004.1 1 1 1 1 1 1 1 3 > Lsa000005.1 1 0 1 0 2 0 0 1 > Lsa000006.1 510 223 228 287 222 268 303 358 > 24hpi C3 24hpi T1 24hpi T2 24hpi T3 48hpi C1 48hpi C2 48hpi C3 48hpi T1 > Lsa000001.1 0 1 1 0 0 0 0 2 > Lsa000002.1 7 5 2 5 10 6 12 12 > Lsa000003.1 7 5 4 2 6 5 8 2 > Lsa000004.1 1 3 1 2 1 3 2 3 > Lsa000005.1 0 1 0 0 1 0 0 2 > Lsa000006.1 372 362 237 320 472 440 411 858 > 48hpi T2 48hpi T3 > Lsa000001.1 0 0 > Lsa000002.1 1 5 > Lsa000003.1 1 0 > Lsa000004.1 0 2 > Lsa000005.1 1 0 > Lsa000006.1 375 275 >> treat<-factor(c("C","C","C","T","T","T","C","C","C","T","T","T","C"," >> C","C","T","T","T")) >> test<-factor(c(1,1,2,3,1,2,3,2,3,1,2,3,1,2,3,1,2,3)) > time<-factor(c("12hpi","12hpi","12hpi","12hpi","12hpi","12hpi","24hpi" > ,"24hpi","24hpi","24hpi","24hpi","24hpi","48hpi","48hpi","48hpi","48hp > i","48hpi","48hpi")) >> y<-DGEList(counts=x,group=treat) > Calculating library sizes from column totals. >> cpm.y<-cpm(y) >> y<-y[rowSums(cpm.y>2)>=3,] >> y<-calcNormFactors(y) > design<-model.matrix(~test+treat+time) >> design > (Intercept) test2 test3 treatT time24hpi time48hpi > 1 1 0 0 0 0 0 > 2 1 1 0 0 0 0 > 3 1 0 1 0 0 0 > 4 1 0 0 1 0 0 > 5 1 1 0 1 0 0 > 6 1 0 1 1 0 0 > 7 1 0 0 0 1 0 > 8 1 1 0 0 1 0 > 9 1 0 1 0 1 0 > 10 1 0 0 1 1 0 > 11 1 1 0 1 1 0 > 12 1 0 1 1 1 0 > 13 1 0 0 0 0 1 > 14 1 1 0 0 0 1 > 15 1 0 1 0 0 1 > 16 1 0 0 1 0 1 > 17 1 1 0 1 0 1 > 18 1 0 1 1 0 1 > attr(,"assign") > [1] 0 1 1 2 3 3 > attr(,"contrasts") > attr(,"contrasts")$test > [1] "contr.treatment" > > attr(,"contrasts")$treat > [1] "contr.treatment" > > attr(,"contrasts")$time > [1] "contr.treatment" > >> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) > Disp = 0.07299 , BCV = 0.2702 >> y<-estimateGLMTrendedDisp(y,design) > Loading required package: splines >> y<-estimateGLMTagwiseDisp(y,design) > Warning message: > In maximizeInterpolant(spline.pts, apl.smooth[j, ]) : > max iterations exceeded >> fit<-glmFit(y,design) > > > > > > > > -----Original Message----- > From: Mark Robinson [mailto:mark.robinson at imls.uzh.ch] > Sent: vrijdag 22 juni 2012 12:03 > To: Kaat De Cremer > Cc: bioconductor list > Subject: Re: [BioC] design matrix edge R pairwise comparison at > different time points after infection with replicates > > Hi Kaat, > > It is probably better to fit all your data with a single call to glmFit(), over all 18 samples; you can test the differences of interest trough the 'coef' or 'contrast' argument on glmLRT(). That would afford you more degrees of freedom and presumably better estimates of dispersion, and so on. > >> From your description, I can't quite figure out your design matrix. You have three factors: treatment, test and time point. First, you need to input all 18 samples and extend your 'treatment' and 'test' factor variables to have 18 values (corresponding to the columns of your table). And, then also include a time variable in your design. Some decisions might need to be made about interactions to include. > > Hope that gets you started. > > Best, > Mark > > > ---------- > Prof. Dr. Mark Robinson > Bioinformatics > Institute of Molecular Life Sciences > University of Zurich > Winterthurerstrasse 190 > 8057 Zurich > Switzerland > > v: +41 44 635 4848 > f: +41 44 635 6898 > e: mark.robinson at imls.uzh.ch > o: Y11-J-16 > w: http://tiny.cc/mrobin > > ---------- > http://www.fgcz.ch/Bioconductor2012 > > On 21.06.2012, at 11:42, Kaat De Cremer wrote: > >> Dear all, >> >> >> I am using edgeR to find genes differentially expressed between >> infected and mock-infected control plants, at 3 time points after >> infection. >> I have RNAseq data for 3 independent tests, so for every single test >> I have 6 libraries (control + infected at 3 time points). >> Having three replicates this makes 18 libraries in total. >> >> What I did until now is look at each time point separate and calculate DEgenes at that time point as shown in this script: >> >>> head(x) >> C1 C2 C3 T1 T2 T3 >> 1 0 1 2 0 0 0 >> 2 13 6 4 10 8 12 >> 3 17 16 9 10 8 11 >> 4 2 1 2 2 3 2 >> 5. 1 3 1 2 1 3 0 >> 6 958 457 438 565 429 518 >> >>> treatment<-factor(c("C","C","C","T","T","T")) >>> test<-factor(c(1,2,3,1,2,3)) >>> y<-DGEList(counts=x,group=treatment) >> Calculating library sizes from column totals. >>> cpm.y<-cpm(y) >>> y<-y[rowSums(cpm.y>2)>=3,] >>> y<-calcNormFactors(y) >>> design<-model.matrix(~test+treat) >>> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) >> Disp = 0.0265 , BCV = 0.1628 >>> y<-estimateGLMTrendedDisp(y,design) >> Loading required package: splines >>> y<-estimateGLMTagwiseDisp(y,design) >>> fit<-glmFit(y,design) >>> lrt<-glmLRT(y,fit) >> >> >> This works fine but I wonder if I should do the analysis of the different time points all at once? Will this make a difference? >> Unfortunately I cannot figure out how to design the matrix. >> >> I hope someone can help me, >> >> Kaat >> ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}} From bzguan at ucdavis.edu Mon Jun 25 10:27:48 2012 From: bzguan at ucdavis.edu (bzguan at ucdavis.edu) Date: Mon, 25 Jun 2012 01:27:48 -0700 (PDT) Subject: [BioC] beadarray package: get error message when using summarize function Message-ID: <201206250827.q5P8RmIB012836@melipona.ucdavis.edu> Hi, I am trying to use the summarize function in beadarray to grouped together beadLevelData object according to their ArrayAddressID. I have data from Illumina Human660w_quad bead chip. I didn't specified the annotation when using the readIllumina function because my platform is not among the options that show up when using the suggestAnnotation function. I have successfully read in the data using the readIllumina function. I am hoping that I can use the summarize function without having to specified the annotation. Below is the message I get when using the summarize function and the traceback function. Any suggestion on how to fix this error with be greatly appreciated. thanks, Anna > datasum <- summarize(data, removeUnMappedProbes= FALSE) No sample factor specified. Summarizing each section separately Finding list of unique probes in beadLevelData 210732 unique probeIDs found Summarizing G channel Processing Array 1 Removing outliers Using exprFun Using varFun Summarizing G channel Processing Array 2 Removing outliers Using exprFun Using varFun Summarizing G channel Processing Array 3 Removing outliers Using exprFun Using varFun Summarizing G channel Processing Array 4 Removing outliers Using exprFun Using varFun Summarizing G channel Processing Array 5 Removing outliers Using exprFun Using varFun Summarizing G channel Processing Array 6 Removing outliers Using exprFun Using varFun Summarizing G channel Processing Array 7 Removing outliers Using exprFun Using varFun Summarizing G channel Processing Array 8 Removing outliers Using exprFun Using varFun Making summary object Could not map ArrayAddressIDs: No annotation specified Error in value[[3L]](cond) : row names supplied are of the wrong length AnnotatedDataFrame 'initialize' could not update varMetadata: perhaps pData and varMetadata are inconsistent? > traceback() 10: stop(conditionMessage(err), "\n AnnotatedDataFrame 'initialize' could not update varMetadata:", "\n perhaps pData and varMetadata are inconsistent?") 9: value[[3L]](cond) 8: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 7: tryCatchList(expr, classes, parentenv, handlers) 6: tryCatch({ if (missing(varMetadata)) { if (!missing(data)) checkClass(data, "data.frame", class(.Object)) varMetadata <- data.frame(labelDescription = rep(NA, ncol(data))) row.names(varMetadata) <- as.character(colnames(data)) } else { checkClass(varMetadata, "data.frame", class(.Object)) if (!"labelDescription" %in% colnames(varMetadata)) varMetadata[["labelDescription"]] <- rep(NA, nrow(varMetadata)) row.names(varMetadata) <- names(data) } varMetadata[["labelDescription"]] <- as.character(varMetadata[["labelDescription"]]) }, error = function(err) { stop(conditionMessage(err), "\n AnnotatedDataFrame 'initialize' could not update varMetadata:", "\n perhaps pData and varMetadata are inconsistent?") }) 5: .local(.Object, ...) 4: initialize(value, ...) 3: initialize(value, ...) 2: new("AnnotatedDataFrame", data.frame(sampInfo, row.names = newNames)) 1: summarize(data, removeUnMappedProbes = FALSE) > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] beadarray_2.6.0 ggplot2_0.9.1 Biobase_2.16.0 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] AnnotationDbi_1.18.1 BeadDataPackR_1.8.0 colorspace_1.1-1 DBI_0.2-5 dichromat_1.2-4 digest_0.5.2 grid_2.15.0 [8] IRanges_1.14.3 labeling_0.1 limma_3.12.1 MASS_7.3-18 memoise_0.1 munsell_0.3 plyr_1.7.1 [15] proto_0.3-9.2 RColorBrewer_1.0-5 reshape2_1.2.1 RSQLite_0.11.1 scales_0.2.1 stats4_2.15.0 stringr_0.6 [22] tools_2.15.0 From efthimiosm at bii.a-star.edu.sg Mon Jun 25 13:15:23 2012 From: efthimiosm at bii.a-star.edu.sg (efthimiosm) Date: Mon, 25 Jun 2012 19:15:23 +0800 Subject: [BioC] help with multiple testing Message-ID: <4FE8484B.7080006@bii.a-star.edu.sg> Hi all, My name is Mike and I am a post-doctoral fellow in Bioinformatics. I have a question regarding multiple testing p-values adjustment and I wonder if someone could give me a piece of advice. I have multiple gene pairs (approximately 8,256) composed by all possible combinations of 129 genes. For each pair A-B (A different from B) four values are recorded: number of tumors found in both A and B (TT), number of tumors only in A (TF), number of tumors only in B (FT), number of tumors found neither in A nor in B (FF). The data are in the form of 2x2 contingency tables. E.g. Gene 1 Gene 2 TT TF FT FF g1 g2 5 1 1 27 g1 g3 4 1 1 28 g2 g3 4 2 0 28 ... ... ... Notice that each gene is paired with all others and thus it is represented 128 times in this list. I want to find which of the 8,256 gene pairs (tests) show significant associations between rows (in A, not in A) and columns (in B, not in B) by Fisher or Barnard test. Subsequently I have to perform p-value adjustment for multiple testing. At 5% I find approximately 500 significant gene pairs but, naturally, all p-value adjustment procedures I tried (for independent tests: BH, q-value; for dependent tests: BY, adaptiveBH and BlaRoq from package "multtest") produce adj. p-values > 0.3. I think that the problem is that the highly dependent nature of the data (50% of the genes have very small number of mutations which gives high p-values for all pair they generate) affects dramatically the adjustment procedure. Is there a better way (method) to run the p-values adjustment? Do you think if I created multiple lists of gene pairs, where each gene is represented only once, and then estimate q-value (multiple q-values for each pair) would be an appropriate solution? Thank you, Mike From chenyao.bioinfor at gmail.com Mon Jun 25 15:38:10 2012 From: chenyao.bioinfor at gmail.com (Yao Chen) Date: Mon, 25 Jun 2012 09:38:10 -0400 Subject: [BioC] help with multiple testing In-Reply-To: <4FE8484B.7080006@bii.a-star.edu.sg> References: <4FE8484B.7080006@bii.a-star.edu.sg> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From xydavis at ncsu.edu Mon Jun 25 15:46:34 2012 From: xydavis at ncsu.edu (Xin Davis) Date: Mon, 25 Jun 2012 09:46:34 -0400 Subject: [BioC] CQN Package Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From kasperdanielhansen at gmail.com Mon Jun 25 15:55:46 2012 From: kasperdanielhansen at gmail.com (Kasper Daniel Hansen) Date: Mon, 25 Jun 2012 09:55:46 -0400 Subject: [BioC] CQN Package In-Reply-To: References: Message-ID: On Mon, Jun 25, 2012 at 9:46 AM, Xin Davis wrote: > Dear All, > > I try to use CQN package to normalize RNAseq data for modeling, but > could not find CQN vignette at Bioconductor website. > > The example from the manual is not clear to me (shown below). I should be > able to run the code by replacing montgomery.subset with our dataset, but > it is not the case. > > I assume montgomery.subset is the data set, what is sizeFactors.subset ? > Other pacakge (DESeq, edgeR) will calculate sizeFactors. How about uCovar ? > I Should provide dataset, the package will calculate whatever required by > the package. The explanations are not clear to me. There are many suggested ways to estimate sizeFactors. You have the option of inputting whatever you want. The default will use colSums(data) You need to give the function information about the gc content and the gene lengths. uCovar in the vignette is a matrix with 2 columns (gc content and gene length). What these values are, depend entirely on how you got your count matrix. The uCovar in the package fits together with how the data in the package was computed from the aligned reads, but of course it will most likely not fit with your data. If you want to, you can inspect these objects yourself by loading the package. Kasper > > I would appreciate it if anyone provide guidance on this. > > Thanks, > Xin Davis > > > > > data(montgomery.subset) > data(sizeFactors.subset) > data(uCovar) > cqn.subset <- cqn(montgomery.subset, lengths = uCovar$length, > x = uCovar$gccontent, sizeFactors = sizeFactors.subset, verbose = TRUE) > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From Christopher.Bare at systemsbiology.org Mon Jun 25 17:20:32 2012 From: Christopher.Bare at systemsbiology.org (Christopher Bare) Date: Mon, 25 Jun 2012 08:20:32 -0700 Subject: [BioC] 2012 Systems Bioinformatics Workshop Message-ID: Hello, The 2012 Systems Bioinformatics Workshop will be held at the Institute for Systems Biology in Seattle on September 10th & 11th. We invite your participation. The workshop will be a two day meeting featuring talks, tutorials and a hackathon, bringing together engineers and scientists building software for biological data analysis. Themes for this year's workshop include networks, visualization, and software architecture for collaborative research computing. Speakers ============ Paramvir Dehal, Computational Research Scientist/Engineer, MicrobesOnline, KBase Max Franz, Software Engineer and User Interface Designer, Cytoscape Web Michael Kellen, Director of Technology at Sage Bionetworks Mike Smoot, Chief Architect of the Cytoscape project James Taylor, The Galaxy project Matt Wood, Amazon Web Services Tutorials ============ Regulatory network inference with cMonkey Cloud Computing with EC2 Graph data storage with Neo4j Visualization and D3 More information can be found on the workshop website: http://gaggle.systemsbiology.net/workshop2012/ From hyao at mdanderson.org Mon Jun 25 17:51:01 2012 From: hyao at mdanderson.org (Yao,Hui) Date: Mon, 25 Jun 2012 10:51:01 -0500 Subject: [BioC] A question with DEXSeq package: inconsistency between normalized counts vs. fitted expression, fitted splicing or fold changes Message-ID: Dear DEXSeq authors and users, I am using DEXSeq package (1.2.0) to do splicing analysis for human genome. One part of results from the analysis is really confusing me. It seems that for many genes the fitted expression, the fitted splicing and estimated fold changes are inconsistent (actually reversed) with its normalized counts. To clearly explain the problem, I am showing a simple example with only two genes with 30 samples as below. > load("toBioConductor.RData") > library(DEXSeq) > ls() [1] "testgene" ### For this data set, our interest is to investigate the differential usage of exons between two tissue "type", X and N. ### The samples were collected from two batches, So we need to adjust the batch effects in the following model. > formu <- count ~ sample + (exon + from)*type > testgene <- estimateDispersions(testgene, formula=formu) > testgene <- fitDispersionFunction(testgene) > f0 <- count~sample + from*exon + type > f1 <- count~sample + from*exon + type*I(exon==exonID) > testgene <- testForDEU(testgene,formula0=f0, formula1=f1) > testgene <- estimatelog2FoldChanges(testgene,fitExpToVar="type" ) > res <- DEUresultTable(testgene) n the attached figure, toBioConductor-plotDEXSeq.pdf, for E011, its normalized counts of "N" are clearly smaller than those of "X". However, for both Fitted splicing and Fitted expression, the level of "N" is larger than that of "X". And if we checked the estimated fold change as below, the fold change is consistent with fitted expression and fitted splicing. > res["ENSG00000001497:011",] geneID exonID dispersion pvalue padjust ENSG00000001497:011 ENSG00000001497 E011 0.4096852 0.0006160211 0.01334712 meanBase log2fold(X/N) ENSG00000001497:011 5.714405 -1.936449 We also explicitly check the normalized counts for E011 of this gene. As below shows the median of each type. Those of "N" is clearly smaller than "X". > dtbl <- data.frame(counts=counts(testgene,normalized=T)["ENSG00000001497:011",],type=pData(testgene)$type) > with(dtbl,tapply(counts,type,median)) X N 8.534784 2.064785 > sessionInfo() R version 2.14.0 (2011-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US [4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] DEXSeq_1.2.0 Biobase_2.14.0 loaded via a namespace (and not attached): [1] biomaRt_2.10.0 hwriter_1.3 plyr_1.7.1 RCurl_1.91-1 statmod_1.4.14 [6] stringr_0.6 tools_2.14.0 XML_3.9-4 So, have I made any mistakes in the analysis? Many thanks in advance, Hui Hui Yao, Ph.D. Principal Statistical Analyst MD Anderson Cancer Center -------------- next part -------------- A non-text attachment was scrubbed... Name: toBioConductor-plotDEXSeq.pdf Type: application/pdf Size: 18588 bytes Desc: toBioConductor-plotDEXSeq.pdf URL: From whuber at embl.de Mon Jun 25 20:10:58 2012 From: whuber at embl.de (Wolfgang Huber) Date: Mon, 25 Jun 2012 20:10:58 +0200 Subject: [BioC] help with multiple testing In-Reply-To: <4FE8484B.7080006@bii.a-star.edu.sg> References: <4FE8484B.7080006@bii.a-star.edu.sg> Message-ID: <4FE8A9B2.1050505@embl.de> Dear Mike I'd be surprised if this problem were cracked by a brute force purely 'statistical' approach. You could try to reduce the number of tests by first grouping the genes into 'pathways' or functional modules. With a lot of luck, the data may then just be large enough. Besy wishes Wolfgang Jun/25/12 1:15 PM, efthimiosm scripsit:: > Hi all, > > My name is Mike and I am a post-doctoral fellow in Bioinformatics. I > have a question regarding multiple testing p-values adjustment and I > wonder if someone could give me a piece of advice. > > I have multiple gene pairs (approximately 8,256) composed by all > possible combinations of 129 genes. For each pair A-B (A different from > B) four values are recorded: number of tumors found in both A and B > (TT), number of tumors only in A (TF), number of tumors only in B (FT), > number of tumors found neither in A nor in B (FF). The data are in the > form of 2x2 contingency tables. E.g. > > Gene 1 Gene 2 TT TF FT FF > g1 g2 5 1 1 27 > g1 g3 4 1 1 28 > g2 g3 4 2 0 28 > ... > ... > ... > > Notice that each gene is paired with all others and thus it is > represented 128 times in this list. I want to find which of the 8,256 > gene pairs (tests) show significant associations between rows (in A, not > in A) and columns (in B, not in B) by Fisher or Barnard test. > Subsequently I have to perform p-value adjustment for multiple testing. > > At 5% I find approximately 500 significant gene pairs but, naturally, > all p-value adjustment procedures I tried (for independent tests: BH, > q-value; for dependent tests: BY, adaptiveBH and BlaRoq from package > "multtest") produce adj. p-values > 0.3. I think that the problem is > that the highly dependent nature of the data (50% of the genes have very > small number of mutations which gives high p-values for all pair they > generate) affects dramatically the adjustment procedure. > > Is there a better way (method) to run the p-values adjustment? > > Do you think if I created multiple lists of gene pairs, where each gene > is represented only once, and then estimate q-value (multiple q-values > for each pair) would be an appropriate solution? > > > Thank you, > Mike > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber From vobencha at fhcrc.org Mon Jun 25 20:53:12 2012 From: vobencha at fhcrc.org (Valerie Obenchain) Date: Mon, 25 Jun 2012 11:53:12 -0700 Subject: [BioC] nearest() for GRanges In-Reply-To: References: Message-ID: <4FE8B398.5070409@fhcrc.org> This is now fixed in release, BiocC 2.10, GenomicRanges 1.8.7. Note the behavior of '*' is different from previous behavior (i.e., <= v 1.8.6). Treatment of '*' ranges was one of the aspects we clarified and enforced in the the recent update of precede, follows and nearest. Previously in release '*' was treated as a '+' range, g <- GRanges("chr1", IRanges(c(1,5,10), c(2,7,12)), "*") > g GRanges with 3 ranges and 0 elementMetadata cols: seqnames ranges strand [1] chr1 [ 1, 2] * [2] chr1 [ 5, 7] * [3] chr1 [10, 12] * --- seqlengths: chr1 NA > precede(g) [1] 2 3 NA > follow(g) [1] NA 1 2 > nearest(g) [1] 2 1 2 The new behavior of '*' (in both release and devel) considers both '+' and '-' possibilities. For details see the 'matching by strand' section described in precede() on the man page for ?GRanges. > precede(g) [1] 2 1 2 > follow(g) [1] 2 1 2 > nearest(g) [1] 2 1 2 Valerie On 06/22/2012 03:25 PM, Cook, Malcolm wrote: > Great news, Valerie... thanks very much... I will take immediate advantage > of this... after re-reading your report of 'an overhaul' I would well > understand if back-porting your fix in dev to release would be onerous to > impossible. > > I hope it goes quickly and smoothly.... > > Cheers, > > Malcolm > > > On 6/22/12 4:00 PM, "Valerie Obenchain" wrote: > >> On 06/20/2012 05:20 PM, Cook, Malcolm wrote: >>> Hi Valerie, >>> >>> Very glad you found and fixed the root cause. >>> >>> I don't know the overhead it would take for you, but, this being a >>> regression, might you consider fixing in Bioconductor 2.10 as, say >>> GenomicRanges_1.8. >>> >> Yes, I will fix this in release too. If not today then first thing next >> week. >> >> Valerie >>> Thanks for your consideration, >>> >>> Malcolm >>> >>> On 6/20/12 3:13 PM, "Valerie Obenchain" wrote: >>> >>>> Hi Oleg, Malcom, >>>> >>>> Thanks for the bug report. This is now fixed in devel 1.9.28. Over the >>>> past months we've done an overhaul of the precede/follow code in devel. >>>> The new nearest method is based on the new precede and follow and is >>>> documented at >>>> >>>> ?'nearest,GenomicRanges,GenomicRanges-method' >>>> >>>> Let me know if you run into problems. >>>> >>>> Valerie >>>> >>>> >>>> >>>> On 06/18/2012 02:25 PM, Cook, Malcolm wrote: >>>>> Martin, Oleg, Val, all, >>>>> >>>>> I too have script logic that benefitted from and depends upon what the >>>>> behavior of nearest,GenomicRanges,missing as reported by Oleg. >>>>> >>>>> Thanks for the unit tests Martin. >>>>> >>>>> If it helps in sleuthing, in my hands, the 3rd test used to pass (if >>>>> my >>>>> memory serves), but does not pass now, as the attached transcript >>>>> shows. >>>>> >>>>> Hoping it helps find a speedy resolution to this issue, >>>>> >>>>> ~ Malcolm Cook >>>>> >>>>> >>>>> >>>>>> r<- IRanges(c(1,5,10), c(2,7,12)) >>>>>> g<- GRanges("chr1", r, "+") >>>>>> checkEquals(precede(r), precede(g)) >>>>> [1] TRUE >>>>>> checkEquals(follow(r), follow(g)) >>>>> [1] TRUE >>>>>> try(checkEquals(nearest(r), nearest(g))) >>>>> Error in checkEquals(nearest(r), nearest(g)) : >>>>> Mean relative difference: 0.6 >>>>> >>>>> >>>>>> sessionInfo() >>>>> R version 2.15.0 (2012-03-30) >>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>>>> >>>>> locale: >>>>> [1] C >>>>> >>>>> attached base packages: >>>>> [1] tools splines parallel stats graphics grDevices >>>>> utils >>>>> datasets methods base >>>>> >>>>> other attached packages: >>>>> [1] RUnit_0.4.26 log4r_0.1-4 vwr_0.1 >>>>> RecordLinkage_0.4-1 ffbase_0.5 ff_2.2-7 >>>>> bit_1.1-8 evd_2.2-6 ipred_0.8-13 >>>>> prodlim_1.3.1 KernSmooth_2.23-7 nnet_7.3-1 >>>>> survival_2.36-14 mlbench_2.1-0 MASS_7.3-18 >>>>> ada_2.0-2 rpart_3.1-53 e1071_1.6 >>>>> class_7.3-3 XLConnect_0.1-9 XLConnectJars_0.1-4 >>>>> rJava_0.9-3 latticeExtra_0.6-19 RColorBrewer_1.0-5 >>>>> lattice_0.20-6 doMC_1.2.5 multicore_0.1-7 >>>>> [28] BSgenome_1.24.0 rtracklayer_1.16.1 Rsamtools_1.8.5 >>>>> Biostrings_2.24.1 GenomicFeatures_1.8.1 AnnotationDbi_1.18.1 >>>>> GenomicRanges_1.8.6 IRanges_1.14.3 Biobase_2.16.0 >>>>> BiocGenerics_0.2.0 data.table_1.8.0 compare_0.2-3 >>>>> svUnit_0.7-10 doParallel_1.0.1 iterators_1.0.6 >>>>> foreach_1.4.0 ggplot2_0.9.1 sqldf_0.4-6.4 >>>>> RSQLite.extfuns_0.0.1 RSQLite_0.11.1 chron_2.3-42 >>>>> gsubfn_0.6-3 proto_0.3-9.2 DBI_0.2-5 >>>>> functional_0.1 reshape_0.8.4 plyr_1.7.1 >>>>> [55] stringr_0.6 gtools_2.6.2 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] RCurl_1.91-1 XML_3.9-4 biomaRt_2.12.0 >>>>> bitops_1.0-4.1 >>>>> codetools_0.2-8 colorspace_1.1-1 compiler_2.15.0 dichromat_1.2-4 >>>>> digest_0.5.2 grid_2.15.0 labeling_0.1 memoise_0.1 >>>>> munsell_0.3 reshape2_1.2.1 scales_0.2.1 stats4_2.15.0 >>>>> tcltk_2.15.0 zlibbioc_1.2.0 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 6/18/12 2:39 PM, "Martin Morgan" wrote: >>>>> >>>>>> Hi Oleg -- >>>>>> >>>>>> On 06/17/2012 11:11 PM, Oleg Mayba wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I just noticed that a piece of logic I was relying on with GRanges >>>>>>> before >>>>>>> does not seem to work anymore. Namely, I expect the behavior of >>>>>>> nearest() >>>>>>> with a single GRanges object as an argument to be the same as that >>>>>>> of >>>>>>> IRanges (example below), but it's not anymore. I expect >>>>>>> nearest(GR1) >>>>>>> NOT >>>>>>> to behave trivially but to return the closest range OTHER than the >>>>>>> range >>>>>>> itself. I could swear that was the case before, but isn't any >>>>>>> longer: >>>>>> I think you're right that there is an inconsistency here; Val will >>>>>> likely help clarify in a day or so. My two cents... >>>>>> >>>>>> I think, certainly, that GRanges on a single chromosome on the "+" >>>>>> strand should behave like an IRanges >>>>>> >>>>>> library(GenomicRanges) >>>>>> library(RUnit) >>>>>> >>>>>> r<- IRanges(c(1,5,10), c(2,7,12)) >>>>>> g<- GRanges("chr1", r, "+") >>>>>> >>>>>> ## first two ok, third should work but fails >>>>>> checkEquals(precede(r), precede(g)) >>>>>> checkEquals(follow(r), follow(g)) >>>>>> try(checkEquals(nearest(r), nearest(g))) >>>>>> >>>>>> Also, on the "-" strand I think we're expecting >>>>>> >>>>>> g<- GRanges("chr1", r, "-") >>>>>> >>>>>> ## first two ok, third should work but fails >>>>>> checkEquals(follow(r), precede(g)) >>>>>> checkEquals(precede(r), follow(g)) >>>>>> try(checkEquals(nearest(r), nearest(g))) >>>>>> >>>>>> For "*" (which was your example) I think the situation is (a) >>>>>> different >>>>>> in devel than in release; and (b) not so clear. In devel, "*" is >>>>>> (from >>>>>> method?"nearest,GenomicRanges,missing") "x on '*' strand can match to >>>>>> ranges on any of ''+'', ''-'' or ''*''" and in particular I think >>>>>> these >>>>>> are always true: >>>>>> >>>>>> checkEquals(precede(g), follow(g)) >>>>>> checkEquals(nearest(r), follow(g)) >>>>>> >>>>>> I would also expect >>>>>> >>>>>> try(checkEquals(nearest(g), follow(g))) >>>>>> >>>>>> though this seems not to be the case. In 'release', "*" is coereced >>>>>> and >>>>>> behaves as if on the "+" strand (I think). >>>>>> >>>>>> Martin >>>>>> >>>>>>>> z=IRanges(start=c(1,5,10), end=c(2,7,12)) >>>>>>>> z >>>>>>> IRanges of length 3 >>>>>>> start end width >>>>>>> [1] 1 2 2 >>>>>>> [2] 5 7 3 >>>>>>> [3] 10 12 3 >>>>>>>> nearest(z) >>>>>>> [1] 2 1 2 >>>>>>>> z=GRanges(seqnames=rep('chr1',3),ranges=IRanges(start=c(1,5,10), >>>>>>> end=c(2,7,12))) >>>>>>>> z >>>>>>> GRanges with 3 ranges and 0 elementMetadata cols: >>>>>>> seqnames ranges strand >>>>>>> >>>>>>> [1] chr1 [ 1, 2] * >>>>>>> [2] chr1 [ 5, 7] * >>>>>>> [3] chr1 [10, 12] * >>>>>>> --- >>>>>>> seqlengths: >>>>>>> chr1 >>>>>>> NA >>>>>>>> nearest(z) >>>>>>> [1] 1 2 3 >>>>>>>> sessionInfo() >>>>>>> R version 2.15.0 (2012-03-30) >>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>> >>>>>>> locale: >>>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>>> >>>>>>> attached base packages: >>>>>>> [1] datasets utils grDevices graphics stats methods base >>>>>>> >>>>>>> other attached packages: >>>>>>> [1] GenomicRanges_1.8.6 IRanges_1.14.3 BiocGenerics_0.2.0 >>>>>>> >>>>>>> loaded via a namespace (and not attached): >>>>>>> [1] stats4_2.15.0 >>>>>>> >>>>>>> >>>>>>> I want the IRanges behavior and not what seems currently to be the >>>>>>> GRanges >>>>>>> behavior, since I have some code that depends on it. Is there a >>>>>>> quick >>>>>>> way >>>>>>> to make nearest() do that for me again? >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> Oleg. >>>>>>> >>>>>>> [[alternative HTML version deleted]] >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> -- >>>>>> Computational Biology / Fred Hutchinson Cancer Research Center >>>>>> 1100 Fairview Ave. N. >>>>>> PO Box 19024 Seattle, WA 98109 >>>>>> >>>>>> Location: Arnold Building M1 B861 >>>>>> Phone: (206) 667-2793 >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor From sorokin at wisc.edu Mon Jun 25 21:15:11 2012 From: sorokin at wisc.edu (Elena Sorokin) Date: Mon, 25 Jun 2012 14:15:11 -0500 Subject: [BioC] interactions between variables in DEXSeq Message-ID: <4FE8B8BF.1050002@wisc.edu> Hello again, I'm writing this time to ask about setting up more complex formulae to test for significant interaction between independent variables in DEXSeq. Using the example from the manual, to test for an interaction between library type and condition, how would I set this up? The syntax here is a bit more involved than with DESeq, and I can't seem to find anything in the archives that answers my question... Thanks, Elena From vobencha at fhcrc.org Mon Jun 25 21:22:52 2012 From: vobencha at fhcrc.org (Valerie Obenchain) Date: Mon, 25 Jun 2012 12:22:52 -0700 Subject: [BioC] Bioc2012 : July 24-25 Message-ID: <4FE8BA8C.3030701@fhcrc.org> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From fongchun at interchange.ubc.ca Mon Jun 25 22:30:23 2012 From: fongchun at interchange.ubc.ca (Fong Chun Chan) Date: Mon, 25 Jun 2012 13:30:23 -0700 Subject: [BioC] Extracting a .CEL file from an AffyBatch Object Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From Kaat.DeCremer at biw.kuleuven.be Mon Jun 25 22:43:55 2012 From: Kaat.DeCremer at biw.kuleuven.be (Kaat De Cremer) Date: Mon, 25 Jun 2012 20:43:55 +0000 Subject: [BioC] design matrix edge R pairwise comparison at different time points after infection with replicates In-Reply-To: References: Message-ID: <3D4A97F14E343F4584925219C1C1ACEF05B965F0@ICTS-S-MBX7.luna.kuleuven.be> Again, Thank you both very much for your reply. I have analyzed my time course data now in several different ways and noticed some differences: 1) when I analyze the data of all time points together and look for DE genes at one time point, I find less DE genes compared to when I use only the data of that one time point in edgeR. I assume this is because the dispersion is larger when I include all the different time points at once? In that case, is this the right way to go? The dispersion can be larger at one time point compared to another I would think. 2) I noticed in the edgeR user's guide that in some examples you correct the library size after filtering for low expressed genes, and in other examples you don't. Correcting this library size gives less DE genes for my data at all time points when I analyze all data together and then look for DE genes at each time point. I don't see this difference when I only look at the data of one time point in edgeR. I hope you can comment on this, Thank you, Kaat -----Original Message----- From: Gordon K Smyth [mailto:smyth at wehi.EDU.AU] Sent: zondag 24 juni 2012 2:42 To: Kaat De Cremer Cc: Bioconductor mailing list; Mark Robinson Subject: design matrix edge R pairwise comparison at different time points after infection with replicates Hi Kaat, I'll jump in and continue on from Mark's help. To test for treatment effects separately at each time, the easiest way is to include the terms "time+time:treat" in your model formula. I'll assume that your "tests" are independent replicates of the whole experiment. If there are batch effects associated with the tests that you need to correct for, then your complete design matrix might be: design <- model.matrix(~test+time+time:treat) This produces a design matrix with the following columns: > colnames(design) [1] "(Intercept)" "test2" "test3" "time24hpi" [5] "time48hpi" "time12hpi:treatT" "time24hpi:treatT" "time48hpi:treatT" So testing for treatment effects at each time is easy. To test for treatment effect as time 12h: fit <- glmFit(y, design) lrt <- glmLRT(y, fit, coef="time12hpi:treatT") etc. To test for treatment effect at time 24h: lrt <- glmLRT(y, fit, coef="time24hpi:treatT") and so on. Best wishes Gordon > Date: Fri, 22 Jun 2012 13:11:41 +0000 > From: Kaat De Cremer > To: Mark Robinson > Cc: bioconductor list > Subject: Re: [BioC] design matrix edge R pairwise comparison at > different time points after infection with replicates > > Hi Mark, > Thank you for your suggestion, > I really appreciate your time. > > Working in R is new to me so it has been a struggle using edgeR, but I > think I managed it using only 2 factors (test and treatment). Now that > I will be including 3 factors (test, treatment and time) in one > analysis it is clear to me that I still don't understand how it works exactly. > Below you can see my workspace with the only design matrix I could > come up with, but I don't see which coefficients I should include or > which contrast vector to use in the glmLRT function to make the > comparison of control-treatment at each time point separate, ignoring > the other 2 time points. Is this possible with this design matrix? Or > is the matrix wrong for this purpose? > > > Thanks! > Kaat > > >> head(x) > 12hpi C1 12hpi C2 12hpi C3 12hpi T1 12hpi T2 12hpi T3 24hpi C1 24hpi C2 > Lsa000001.1 0 1 1 2 0 2 1 1 > Lsa000002.1 5 4 0 5 6 6 6 4 > Lsa000003.1 10 9 7 5 5 8 6 2 > Lsa000004.1 1 1 1 1 1 1 1 3 > Lsa000005.1 1 0 1 0 2 0 0 1 > Lsa000006.1 510 223 228 287 222 268 303 358 > 24hpi C3 24hpi T1 24hpi T2 24hpi T3 48hpi C1 48hpi C2 48hpi C3 48hpi T1 > Lsa000001.1 0 1 1 0 0 0 0 2 > Lsa000002.1 7 5 2 5 10 6 12 12 > Lsa000003.1 7 5 4 2 6 5 8 2 > Lsa000004.1 1 3 1 2 1 3 2 3 > Lsa000005.1 0 1 0 0 1 0 0 2 > Lsa000006.1 372 362 237 320 472 440 411 858 > 48hpi T2 48hpi T3 > Lsa000001.1 0 0 > Lsa000002.1 1 5 > Lsa000003.1 1 0 > Lsa000004.1 0 2 > Lsa000005.1 1 0 > Lsa000006.1 375 275 >> treat<-factor(c("C","C","C","T","T","T","C","C","C","T","T","T","C"," >> C","C","T","T","T")) >> test<-factor(c(1,1,2,3,1,2,3,2,3,1,2,3,1,2,3,1,2,3)) > time<-factor(c("12hpi","12hpi","12hpi","12hpi","12hpi","12hpi","24hpi" > ,"24hpi","24hpi","24hpi","24hpi","24hpi","48hpi","48hpi","48hpi","48hp > i","48hpi","48hpi")) >> y<-DGEList(counts=x,group=treat) > Calculating library sizes from column totals. >> cpm.y<-cpm(y) >> y<-y[rowSums(cpm.y>2)>=3,] >> y<-calcNormFactors(y) > design<-model.matrix(~test+treat+time) >> design > (Intercept) test2 test3 treatT time24hpi time48hpi > 1 1 0 0 0 0 0 > 2 1 1 0 0 0 0 > 3 1 0 1 0 0 0 > 4 1 0 0 1 0 0 > 5 1 1 0 1 0 0 > 6 1 0 1 1 0 0 > 7 1 0 0 0 1 0 > 8 1 1 0 0 1 0 > 9 1 0 1 0 1 0 > 10 1 0 0 1 1 0 > 11 1 1 0 1 1 0 > 12 1 0 1 1 1 0 > 13 1 0 0 0 0 1 > 14 1 1 0 0 0 1 > 15 1 0 1 0 0 1 > 16 1 0 0 1 0 1 > 17 1 1 0 1 0 1 > 18 1 0 1 1 0 1 > attr(,"assign") > [1] 0 1 1 2 3 3 > attr(,"contrasts") > attr(,"contrasts")$test > [1] "contr.treatment" > > attr(,"contrasts")$treat > [1] "contr.treatment" > > attr(,"contrasts")$time > [1] "contr.treatment" > >> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) > Disp = 0.07299 , BCV = 0.2702 >> y<-estimateGLMTrendedDisp(y,design) > Loading required package: splines >> y<-estimateGLMTagwiseDisp(y,design) > Warning message: > In maximizeInterpolant(spline.pts, apl.smooth[j, ]) : > max iterations exceeded >> fit<-glmFit(y,design) > > > > > > > > -----Original Message----- > From: Mark Robinson [mailto:mark.robinson at imls.uzh.ch] > Sent: vrijdag 22 juni 2012 12:03 > To: Kaat De Cremer > Cc: bioconductor list > Subject: Re: [BioC] design matrix edge R pairwise comparison at > different time points after infection with replicates > > Hi Kaat, > > It is probably better to fit all your data with a single call to glmFit(), over all 18 samples; you can test the differences of interest trough the 'coef' or 'contrast' argument on glmLRT(). That would afford you more degrees of freedom and presumably better estimates of dispersion, and so on. > >> From your description, I can't quite figure out your design matrix. You have three factors: treatment, test and time point. First, you need to input all 18 samples and extend your 'treatment' and 'test' factor variables to have 18 values (corresponding to the columns of your table). And, then also include a time variable in your design. Some decisions might need to be made about interactions to include. > > Hope that gets you started. > > Best, > Mark > > > ---------- > Prof. Dr. Mark Robinson > Bioinformatics > Institute of Molecular Life Sciences > University of Zurich > Winterthurerstrasse 190 > 8057 Zurich > Switzerland > > v: +41 44 635 4848 > f: +41 44 635 6898 > e: mark.robinson at imls.uzh.ch > o: Y11-J-16 > w: http://tiny.cc/mrobin > > ---------- > http://www.fgcz.ch/Bioconductor2012 > > On 21.06.2012, at 11:42, Kaat De Cremer wrote: > >> Dear all, >> >> >> I am using edgeR to find genes differentially expressed between >> infected and mock-infected control plants, at 3 time points after >> infection. >> I have RNAseq data for 3 independent tests, so for every single test >> I have 6 libraries (control + infected at 3 time points). >> Having three replicates this makes 18 libraries in total. >> >> What I did until now is look at each time point separate and calculate DEgenes at that time point as shown in this script: >> >>> head(x) >> C1 C2 C3 T1 T2 T3 >> 1 0 1 2 0 0 0 >> 2 13 6 4 10 8 12 >> 3 17 16 9 10 8 11 >> 4 2 1 2 2 3 2 >> 5. 1 3 1 2 1 3 0 >> 6 958 457 438 565 429 518 >> >>> treatment<-factor(c("C","C","C","T","T","T")) >>> test<-factor(c(1,2,3,1,2,3)) >>> y<-DGEList(counts=x,group=treatment) >> Calculating library sizes from column totals. >>> cpm.y<-cpm(y) >>> y<-y[rowSums(cpm.y>2)>=3,] >>> y<-calcNormFactors(y) >>> design<-model.matrix(~test+treat) >>> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) >> Disp = 0.0265 , BCV = 0.1628 >>> y<-estimateGLMTrendedDisp(y,design) >> Loading required package: splines >>> y<-estimateGLMTagwiseDisp(y,design) >>> fit<-glmFit(y,design) >>> lrt<-glmLRT(y,fit) >> >> >> This works fine but I wonder if I should do the analysis of the different time points all at once? Will this make a difference? >> Unfortunately I cannot figure out how to design the matrix. >> >> I hope someone can help me, >> >> Kaat >> ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}} From jmacdon at uw.edu Mon Jun 25 23:33:23 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Mon, 25 Jun 2012 17:33:23 -0400 Subject: [BioC] Extracting a .CEL file from an AffyBatch Object In-Reply-To: References: Message-ID: <4FE8D923.5040002@uw.edu> Hi Fong, On 6/25/2012 4:30 PM, Fong Chun Chan wrote: > Hi, > > I have a rather strange task that needs to be done due to an unfortunate > situation. I have an AffyBatch object and I was wondering if it was > possible to extract the raw .CEL files from this AffyBatch file? Any sort > of general idea of how to do it would be greatly helpful. I believe you can do this using the affxparser package. I would look at createCel and updateCel in particular. Best, Jim > > Thanks, > > Fong > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From MEC at stowers.org Mon Jun 25 23:53:39 2012 From: MEC at stowers.org (Cook, Malcolm) Date: Mon, 25 Jun 2012 16:53:39 -0500 Subject: [BioC] nearest() for GRanges In-Reply-To: <4FE8B398.5070409@fhcrc.org> Message-ID: Hi Valerie, Indeed good news. However, I am finding that this newest version is not yet available view biocLite from repository at bioconductor.org. I am still picking up 1.8.6 with biocLite('GenomicRanges'). Should I expect to wait, or perhaps is there a 'push' at your end that needs attending? Please advise if I'm expecting it to appear before its time ;) Thanks! Malcolm On 6/25/12 1:53 PM, "Valerie Obenchain" wrote: >This is now fixed in release, BiocC 2.10, GenomicRanges 1.8.7. > >Note the behavior of '*' is different from previous behavior (i.e., <= v >1.8.6). Treatment of '*' ranges was one of the aspects we clarified and >enforced in the the recent update of precede, follows and nearest. > >Previously in release '*' was treated as a '+' range, > >g <- GRanges("chr1", IRanges(c(1,5,10), c(2,7,12)), "*") > > g >GRanges with 3 ranges and 0 elementMetadata cols: > seqnames ranges strand > > [1] chr1 [ 1, 2] * > [2] chr1 [ 5, 7] * > [3] chr1 [10, 12] * > --- > seqlengths: > chr1 > NA > > precede(g) >[1] 2 3 NA > > follow(g) >[1] NA 1 2 > > nearest(g) >[1] 2 1 2 > > >The new behavior of '*' (in both release and devel) considers both '+' >and '-' possibilities. For details see the 'matching by strand' section >described in precede() on the man page for ?GRanges. > > > precede(g) >[1] 2 1 2 > > follow(g) >[1] 2 1 2 > > nearest(g) >[1] 2 1 2 > > >Valerie > >On 06/22/2012 03:25 PM, Cook, Malcolm wrote: >> Great news, Valerie... thanks very much... I will take immediate >>advantage >> of this... after re-reading your report of 'an overhaul' I would well >> understand if back-porting your fix in dev to release would be onerous >>to >> impossible. >> >> I hope it goes quickly and smoothly.... >> >> Cheers, >> >> Malcolm >> >> >> On 6/22/12 4:00 PM, "Valerie Obenchain" wrote: >> >>> On 06/20/2012 05:20 PM, Cook, Malcolm wrote: >>>> Hi Valerie, >>>> >>>> Very glad you found and fixed the root cause. >>>> >>>> I don't know the overhead it would take for you, but, this being a >>>> regression, might you consider fixing in Bioconductor 2.10 as, say >>>> GenomicRanges_1.8. >>>> >>> Yes, I will fix this in release too. If not today then first thing next >>> week. >>> >>> Valerie >>>> Thanks for your consideration, >>>> >>>> Malcolm >>>> >>>> On 6/20/12 3:13 PM, "Valerie Obenchain" wrote: >>>> >>>>> Hi Oleg, Malcom, >>>>> >>>>> Thanks for the bug report. This is now fixed in devel 1.9.28. Over >>>>>the >>>>> past months we've done an overhaul of the precede/follow code in >>>>>devel. >>>>> The new nearest method is based on the new precede and follow and is >>>>> documented at >>>>> >>>>> ?'nearest,GenomicRanges,GenomicRanges-method' >>>>> >>>>> Let me know if you run into problems. >>>>> >>>>> Valerie >>>>> >>>>> >>>>> >>>>> On 06/18/2012 02:25 PM, Cook, Malcolm wrote: >>>>>> Martin, Oleg, Val, all, >>>>>> >>>>>> I too have script logic that benefitted from and depends upon what >>>>>>the >>>>>> behavior of nearest,GenomicRanges,missing as reported by Oleg. >>>>>> >>>>>> Thanks for the unit tests Martin. >>>>>> >>>>>> If it helps in sleuthing, in my hands, the 3rd test used to pass (if >>>>>> my >>>>>> memory serves), but does not pass now, as the attached transcript >>>>>> shows. >>>>>> >>>>>> Hoping it helps find a speedy resolution to this issue, >>>>>> >>>>>> ~ Malcolm Cook >>>>>> >>>>>> >>>>>> >>>>>>> r<- IRanges(c(1,5,10), c(2,7,12)) >>>>>>> g<- GRanges("chr1", r, "+") >>>>>>> checkEquals(precede(r), precede(g)) >>>>>> [1] TRUE >>>>>>> checkEquals(follow(r), follow(g)) >>>>>> [1] TRUE >>>>>>> try(checkEquals(nearest(r), nearest(g))) >>>>>> Error in checkEquals(nearest(r), nearest(g)) : >>>>>> Mean relative difference: 0.6 >>>>>> >>>>>> >>>>>>> sessionInfo() >>>>>> R version 2.15.0 (2012-03-30) >>>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] C >>>>>> >>>>>> attached base packages: >>>>>> [1] tools splines parallel stats graphics grDevices >>>>>> utils >>>>>> datasets methods base >>>>>> >>>>>> other attached packages: >>>>>> [1] RUnit_0.4.26 log4r_0.1-4 vwr_0.1 >>>>>> RecordLinkage_0.4-1 ffbase_0.5 ff_2.2-7 >>>>>> bit_1.1-8 evd_2.2-6 ipred_0.8-13 >>>>>> prodlim_1.3.1 KernSmooth_2.23-7 nnet_7.3-1 >>>>>> survival_2.36-14 mlbench_2.1-0 MASS_7.3-18 >>>>>> ada_2.0-2 rpart_3.1-53 e1071_1.6 >>>>>> class_7.3-3 XLConnect_0.1-9 XLConnectJars_0.1-4 >>>>>> rJava_0.9-3 latticeExtra_0.6-19 RColorBrewer_1.0-5 >>>>>> lattice_0.20-6 doMC_1.2.5 multicore_0.1-7 >>>>>> [28] BSgenome_1.24.0 rtracklayer_1.16.1 Rsamtools_1.8.5 >>>>>> Biostrings_2.24.1 GenomicFeatures_1.8.1 AnnotationDbi_1.18.1 >>>>>> GenomicRanges_1.8.6 IRanges_1.14.3 Biobase_2.16.0 >>>>>> BiocGenerics_0.2.0 data.table_1.8.0 compare_0.2-3 >>>>>> svUnit_0.7-10 doParallel_1.0.1 iterators_1.0.6 >>>>>> foreach_1.4.0 ggplot2_0.9.1 sqldf_0.4-6.4 >>>>>> RSQLite.extfuns_0.0.1 RSQLite_0.11.1 chron_2.3-42 >>>>>> gsubfn_0.6-3 proto_0.3-9.2 DBI_0.2-5 >>>>>> functional_0.1 reshape_0.8.4 plyr_1.7.1 >>>>>> [55] stringr_0.6 gtools_2.6.2 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] RCurl_1.91-1 XML_3.9-4 biomaRt_2.12.0 >>>>>> bitops_1.0-4.1 >>>>>> codetools_0.2-8 colorspace_1.1-1 compiler_2.15.0 dichromat_1.2-4 >>>>>> digest_0.5.2 grid_2.15.0 labeling_0.1 memoise_0.1 >>>>>> munsell_0.3 reshape2_1.2.1 scales_0.2.1 stats4_2.15.0 >>>>>> tcltk_2.15.0 zlibbioc_1.2.0 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 6/18/12 2:39 PM, "Martin Morgan" wrote: >>>>>> >>>>>>> Hi Oleg -- >>>>>>> >>>>>>> On 06/17/2012 11:11 PM, Oleg Mayba wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> I just noticed that a piece of logic I was relying on with GRanges >>>>>>>> before >>>>>>>> does not seem to work anymore. Namely, I expect the behavior of >>>>>>>> nearest() >>>>>>>> with a single GRanges object as an argument to be the same as that >>>>>>>> of >>>>>>>> IRanges (example below), but it's not anymore. I expect >>>>>>>> nearest(GR1) >>>>>>>> NOT >>>>>>>> to behave trivially but to return the closest range OTHER than the >>>>>>>> range >>>>>>>> itself. I could swear that was the case before, but isn't any >>>>>>>> longer: >>>>>>> I think you're right that there is an inconsistency here; Val will >>>>>>> likely help clarify in a day or so. My two cents... >>>>>>> >>>>>>> I think, certainly, that GRanges on a single chromosome on the "+" >>>>>>> strand should behave like an IRanges >>>>>>> >>>>>>> library(GenomicRanges) >>>>>>> library(RUnit) >>>>>>> >>>>>>> r<- IRanges(c(1,5,10), c(2,7,12)) >>>>>>> g<- GRanges("chr1", r, "+") >>>>>>> >>>>>>> ## first two ok, third should work but fails >>>>>>> checkEquals(precede(r), precede(g)) >>>>>>> checkEquals(follow(r), follow(g)) >>>>>>> try(checkEquals(nearest(r), nearest(g))) >>>>>>> >>>>>>> Also, on the "-" strand I think we're expecting >>>>>>> >>>>>>> g<- GRanges("chr1", r, "-") >>>>>>> >>>>>>> ## first two ok, third should work but fails >>>>>>> checkEquals(follow(r), precede(g)) >>>>>>> checkEquals(precede(r), follow(g)) >>>>>>> try(checkEquals(nearest(r), nearest(g))) >>>>>>> >>>>>>> For "*" (which was your example) I think the situation is (a) >>>>>>> different >>>>>>> in devel than in release; and (b) not so clear. In devel, "*" is >>>>>>> (from >>>>>>> method?"nearest,GenomicRanges,missing") "x on '*' strand can match >>>>>>>to >>>>>>> ranges on any of ''+'', ''-'' or ''*''" and in particular I think >>>>>>> these >>>>>>> are always true: >>>>>>> >>>>>>> checkEquals(precede(g), follow(g)) >>>>>>> checkEquals(nearest(r), follow(g)) >>>>>>> >>>>>>> I would also expect >>>>>>> >>>>>>> try(checkEquals(nearest(g), follow(g))) >>>>>>> >>>>>>> though this seems not to be the case. In 'release', "*" is coereced >>>>>>> and >>>>>>> behaves as if on the "+" strand (I think). >>>>>>> >>>>>>> Martin >>>>>>> >>>>>>>>> z=IRanges(start=c(1,5,10), end=c(2,7,12)) >>>>>>>>> z >>>>>>>> IRanges of length 3 >>>>>>>> start end width >>>>>>>> [1] 1 2 2 >>>>>>>> [2] 5 7 3 >>>>>>>> [3] 10 12 3 >>>>>>>>> nearest(z) >>>>>>>> [1] 2 1 2 >>>>>>>>> z=GRanges(seqnames=rep('chr1',3),ranges=IRanges(start=c(1,5,10), >>>>>>>> end=c(2,7,12))) >>>>>>>>> z >>>>>>>> GRanges with 3 ranges and 0 elementMetadata cols: >>>>>>>> seqnames ranges strand >>>>>>>> >>>>>>>> [1] chr1 [ 1, 2] * >>>>>>>> [2] chr1 [ 5, 7] * >>>>>>>> [3] chr1 [10, 12] * >>>>>>>> --- >>>>>>>> seqlengths: >>>>>>>> chr1 >>>>>>>> NA >>>>>>>>> nearest(z) >>>>>>>> [1] 1 2 3 >>>>>>>>> sessionInfo() >>>>>>>> R version 2.15.0 (2012-03-30) >>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>> >>>>>>>> locale: >>>>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>>>> >>>>>>>> attached base packages: >>>>>>>> [1] datasets utils grDevices graphics stats methods >>>>>>>>base >>>>>>>> >>>>>>>> other attached packages: >>>>>>>> [1] GenomicRanges_1.8.6 IRanges_1.14.3 BiocGenerics_0.2.0 >>>>>>>> >>>>>>>> loaded via a namespace (and not attached): >>>>>>>> [1] stats4_2.15.0 >>>>>>>> >>>>>>>> >>>>>>>> I want the IRanges behavior and not what seems currently to be the >>>>>>>> GRanges >>>>>>>> behavior, since I have some code that depends on it. Is there a >>>>>>>> quick >>>>>>>> way >>>>>>>> to make nearest() do that for me again? >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> Oleg. >>>>>>>> >>>>>>>> [[alternative HTML version deleted]] >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> -- >>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center >>>>>>> 1100 Fairview Ave. N. >>>>>>> PO Box 19024 Seattle, WA 98109 >>>>>>> >>>>>>> Location: Arnold Building M1 B861 >>>>>>> Phone: (206) 667-2793 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > From mgarciao at ufl.edu Tue Jun 26 00:00:30 2012 From: mgarciao at ufl.edu (Garcia Orellana,Miriam) Date: Mon, 25 Jun 2012 22:00:30 +0000 Subject: [BioC] 2 different factorial analysis codes in LIMMA give different logFC but same values for other components of TOPTABLE Message-ID: <7F10E9EDBB347E4CA0765A3139C110BB14F9BA33@UFEXCH-MBXN01.ad.ufl.edu> Dear Dr. Smith and all: I am sorry to bother you with this matter, my understanting of the microarray anylisis is really basic and I am having hard and long time to finish the analysis of my data. Now, I can't figure out what is happening with the two models I am applying to evaluate my data, because each one of them on the TOPTABLE option when requesting the values for all the filtered genes ( 8026). Both models, (A and B) for each of my 5 contrasts are giving me the top table with the same numerical values for: AveExp, t, Pvalue, adjPvalue and B. However the logFC for contrasts 1, 2 and 3 in model B is exactly half of that in model A, while the logFC for contrasts 4 and 5 in model B is exactly one fourth of that one in model A. Because this difference in logFC, I am getting different numbers of differentially expresed genes when using a cut off of adjPvalue lower or equal to 0.05 and a rawlogFC greater or equal to 1.5 for example for contrast 3(MR effect) it gives me 47 up- and 84 down-regulated genes with model A, while with model B it gives me only 18 up- and only 2 down-regulated genes under same cut-offs. How that can be possible if all other values are the same? and so what should I follow? Briefly me data is a factorial design of 3 dam diets (DD: CTL, SFA, EFA) and 2 milk replacers (MR: LLA, HLA), I have three replicates for each of the interaction factors, then a total of 18 arrays. The data was filtered for informative/noninformative probes and plotted for array quality. So from a initial of 24118 bovine probes I endup with 8026 probes. My interest is to compare: 1. Feeding FAT at prepartum= (SFA +EFA) vs CTL, with CTL as ref 2. Feeding EFA prepartum = EFA vs SFA, with SFA as ref 3. Feeding MR to calves= HLA vs LLA, with LLA as reference 4. Interaction of feeding FAT by MR: (SFA +EFA) vs CTL by MR, with (SFA+EFA) vs CTL by LLA as ref 5. Interaction of feeding EFA by MR: EFA vs SFA by MR, with EFA vs SFA by LLA as ref MODEL A (I created that with the guide of the LIMMA user guide for a factorial design: TS <- paste(phenoDiet$DD, phenoDiet$MR, sep=".") TS TS <- factor(TS, levels=c("Ctl.LLA", "Ctl.HLA","SFA.LLA","SFA.HLA","EFA.LLA", "EFA.HLA")) design <- model.matrix(~0+TS) colnames(design) <- levels(TS) fit <- lmFit(eset2, design, method="robust", maxit=1000) efit <- eBayes(fit) #Contrast results MatContrast=makeContrasts(FAT=(SFA.LLA + SFA.HLA + EFA.LLA + EFA.HLA)/4 - (Ctl.LLA + Ctl.HLA)/2, FA=(EFA.LLA + EFA.HLA)/2 - (SFA.LLA + SFA.HLA)/2, MR=(EFA.HLA+SFA.HLA+Ctl.HLA)/3 - (EFA.LLA+SFA.LLA+Ctl.LLA)/3, FATbyMR=((EFA.HLA+SFA.HLA)/2 - Ctl.HLA) - ((EFA.LLA+SFA.LLA)/2-Ctl.LLA), FAbyMR=( EFA.HLA-SFA.HLA)-(EFA.LLA - SFA.LLA), levels=design) fitMat<-contrasts.fit(fit,MatContrast) Contrast.eBayes=eBayes(fitMat) MODEL B (this model was kindly provided by Dr G. Smith): DD <-factor(phenoDie$DD, levels = c("Ctl", "SFA", "EFA")) MR <-factor(phenoDie$MR, levels = c("LLA", "HLA")) contrasts (DD) <- cbind (SFAEFAvsCtl=c(-2,1,1),EFAvsSFA=c(0,-1,1)) contrasts (MR) <- c(-1,1) design <-model.matrix (~DD*MR) design fit <- lmFit (eset2, design, method="robust",maxit=1000) efit <- eBayes(fit) Thanks so much in advance, Miriam From mail.yong.li at googlemail.com Tue Jun 26 00:03:10 2012 From: mail.yong.li at googlemail.com (Yong Li) Date: Tue, 26 Jun 2012 00:03:10 +0200 Subject: [BioC] Using limma for quantitative proteomics data In-Reply-To: References: Message-ID: Dear Aaron, thank you and others for suggestions. My data is really ratios and not absolute values for normal and tumor. Sorry that I am still not quite sure how to move forward with limma when I take log2 of the ratios. It looks like I then will have the M component of the MAList, but how can I construct the A to make an MAList? Or I am missing something here? Kind regards, Yong On Tue, Jun 19, 2012 at 11:09 PM, Aaron Mackey wrote: > There's a thread on the bioconductor mailing list about using voom for > RSEM-based RNA-seq quantification, in which ?Gordon Smythe explained that > while voom() was designed for count data, it doesn't require it. ?As Tim > Triche has suggested, if you're raw data is really ratios (and not absolute > values for normal and tumor), then you should take log2 of those ratios and > use limma from there; you can then also hijack the arrayQualityMetrics > package to check QC (MA plots, mean-variance relationships, etc.) > > -Aaron > > On Tue, Jun 19, 2012 at 3:39 PM, Yong Li > wrote: >> >> Dear Aaron, >> >> thank you for your quick answer! I have checked the help page of >> voom() but it seems to be used for count data. My data are just >> tumor/normal ratios. I am wondering if you could provide more details? >> >> Best regards, >> Yong >> >> On Tue, Jun 19, 2012 at 8:18 PM, Aaron Mackey >> wrote: >> > yes, it should be possible with a voom()-based analysis to get the >> > variances >> > "right". >> > >> > -Aaron >> > >> > On Tue, Jun 19, 2012 at 12:47 PM, Yong Li >> > wrote: >> >> >> >> Hello, >> >> >> >> limma has been so valuable in microarray data analysis, but has anyone >> >> used limma for finding differentially expressed proteins from >> >> quantitative proteomics data? The data I got has tumor/normal ratios >> >> of thousands proteins, and both tumor and normal have a number of >> >> replicates. Could such data be analyzed with limma? >> >> >> >> If limma can not be used here, what statistics method is suitable so >> >> that we can get statistically significant proteins with p-values? Any >> >> suggestion is appreciated. >> >> >> >> Kind regards, >> >> Yong >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor at r-project.org >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > From mail.yong.li at googlemail.com Tue Jun 26 00:05:22 2012 From: mail.yong.li at googlemail.com (Yong Li) Date: Tue, 26 Jun 2012 00:05:22 +0200 Subject: [BioC] Using limma for quantitative proteomics data In-Reply-To: <4FE16930.7090403@embl.de> References: <4FE16930.7090403@embl.de> Message-ID: Dear Bernd, thank you for your answer and the paper. It was mentioned in the paper that the R script used for the analysis was in a Bioconductor data package named mRNAinteractomeHeLa, however I couldn't find it on the Bioc web site. The paper is new so probably I should wait for the next Bioc release? Best regards, Yong On Wed, Jun 20, 2012 at 8:09 AM, Bernd Fischer wrote: > Dear Yong! > > I used limma for ion count data. First I computed log-ratios per peptide and > then summarized log-ratios per protein. Protein log-ratios were then > analyzed > by limma. > Have a lock at our paper: > Castello, Fischer, et al., Insights into RNA Biology from an Atlas of > Mammalian > mRNA-Binding Proteins, CELL, 2012 > > Best, > Bernd > > > On 06/19/2012 06:47 PM, Yong Li wrote: >> >> Hello, >> >> limma has been so valuable in microarray data analysis, but has anyone >> used limma for finding differentially expressed proteins from >> quantitative proteomics data? The data I got has tumor/normal ratios >> of thousands proteins, and both tumor and normal have a number of >> replicates. Could such data be analyzed with limma? >> >> If limma can not be used here, what statistics method is suitable so >> that we can get statistically significant proteins with p-values? Any >> suggestion is appreciated. >> >> Kind regards, >> Yong >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From dtenenba at fhcrc.org Tue Jun 26 00:09:58 2012 From: dtenenba at fhcrc.org (Dan Tenenbaum) Date: Mon, 25 Jun 2012 15:09:58 -0700 Subject: [BioC] nearest() for GRanges In-Reply-To: References: <4FE8B398.5070409@fhcrc.org> Message-ID: On Mon, Jun 25, 2012 at 2:53 PM, Cook, Malcolm wrote: > Hi Valerie, > > Indeed good news. > > However, I am finding that this newest version is not yet available view > biocLite from repository at bioconductor.org. ?I am still picking up 1.8.6 > with biocLite('GenomicRanges'). > > Should I expect to wait, or perhaps is there a 'push' at your end that > needs attending? > > Please advise if I'm expecting it to appear before its time ;) Out build cycle runs once a day, so expect to see the next version tomorrow morning around 10AM Seattle time. If you want to get it before then, you can check it out from the svn repository. Thanks, Dan > > Thanks! > > Malcolm > > > > On 6/25/12 1:53 PM, "Valerie Obenchain" wrote: > >>This is now fixed in release, BiocC 2.10, GenomicRanges 1.8.7. >> >>Note the behavior of '*' is different from previous behavior (i.e., <= v >>1.8.6). Treatment of '*' ranges was one of the aspects we clarified and >>enforced in the the recent update of precede, follows and nearest. >> >>Previously in release '*' was treated as a '+' range, >> >>g <- GRanges("chr1", IRanges(c(1,5,10), c(2,7,12)), "*") >> > g >>GRanges with 3 ranges and 0 elementMetadata cols: >> ? ? ? seqnames ? ?ranges strand >> >> ? [1] ? ? chr1 ?[ 1, ?2] ? ? ?* >> ? [2] ? ? chr1 ?[ 5, ?7] ? ? ?* >> ? [3] ? ? chr1 ?[10, 12] ? ? ?* >> ? --- >> ? seqlengths: >> ? ?chr1 >> ? ? ?NA >> > precede(g) >>[1] ?2 ?3 NA >> > follow(g) >>[1] NA ?1 ?2 >> > nearest(g) >>[1] 2 1 2 >> >> >>The new behavior of '*' (in both release and devel) considers both '+' >>and '-' possibilities. For details see the 'matching by strand' section >>described in precede() on the man page for ?GRanges. >> >> > precede(g) >>[1] 2 1 2 >> > follow(g) >>[1] 2 1 2 >> > nearest(g) >>[1] 2 1 2 >> >> >>Valerie >> >>On 06/22/2012 03:25 PM, Cook, Malcolm wrote: >>> Great news, Valerie... thanks very much... I will take immediate >>>advantage >>> of this... after re-reading your report of 'an overhaul' I would well >>> understand if back-porting your fix in dev to release would be onerous >>>to >>> impossible. >>> >>> I hope it goes quickly and smoothly.... >>> >>> Cheers, >>> >>> Malcolm >>> >>> >>> On 6/22/12 4:00 PM, "Valerie Obenchain" ?wrote: >>> >>>> On 06/20/2012 05:20 PM, Cook, Malcolm wrote: >>>>> Hi Valerie, >>>>> >>>>> Very glad you found and fixed the root cause. >>>>> >>>>> I don't know the overhead it would take for you, but, this being a >>>>> regression, might you consider fixing in Bioconductor 2.10 as, say >>>>> GenomicRanges_1.8. >>>>> >>>> Yes, I will fix this in release too. If not today then first thing next >>>> week. >>>> >>>> Valerie >>>>> Thanks for your consideration, >>>>> >>>>> Malcolm >>>>> >>>>> On 6/20/12 3:13 PM, "Valerie Obenchain" ? wrote: >>>>> >>>>>> Hi Oleg, Malcom, >>>>>> >>>>>> Thanks for the bug report. This is now fixed in devel 1.9.28. ?Over >>>>>>the >>>>>> past months we've done an overhaul of the precede/follow code in >>>>>>devel. >>>>>> The new nearest method is based on the new precede and follow and is >>>>>> documented at >>>>>> >>>>>> ?'nearest,GenomicRanges,GenomicRanges-method' >>>>>> >>>>>> Let me know if you run into problems. >>>>>> >>>>>> Valerie >>>>>> >>>>>> >>>>>> >>>>>> On 06/18/2012 02:25 PM, Cook, Malcolm wrote: >>>>>>> Martin, Oleg, Val, all, >>>>>>> >>>>>>> I too have script logic that benefitted from and depends upon what >>>>>>>the >>>>>>> behavior of nearest,GenomicRanges,missing as reported by Oleg. >>>>>>> >>>>>>> Thanks for the unit tests Martin. >>>>>>> >>>>>>> If it helps in sleuthing, in my hands, the 3rd test used to pass (if >>>>>>> my >>>>>>> memory serves), but does not pass now, as the attached transcript >>>>>>> shows. >>>>>>> >>>>>>> Hoping it helps find a speedy resolution to this issue, >>>>>>> >>>>>>> ~ Malcolm Cook >>>>>>> >>>>>>> >>>>>>> >>>>>>>> ? ? r<- IRanges(c(1,5,10), c(2,7,12)) >>>>>>>> ? ? g<- GRanges("chr1", r, "+") >>>>>>>> ? ? checkEquals(precede(r), precede(g)) >>>>>>> [1] TRUE >>>>>>>> ? ? ?checkEquals(follow(r), follow(g)) >>>>>>> [1] TRUE >>>>>>>> ? ? try(checkEquals(nearest(r), nearest(g))) >>>>>>> Error in checkEquals(nearest(r), nearest(g)) : >>>>>>> ? ? ?Mean relative difference: 0.6 >>>>>>> >>>>>>> >>>>>>>> sessionInfo() >>>>>>> R version 2.15.0 (2012-03-30) >>>>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>>>>>> >>>>>>> locale: >>>>>>> [1] C >>>>>>> >>>>>>> attached base packages: >>>>>>> ? ? [1] tools ? ? splines ? parallel ?stats ? ? graphics ?grDevices >>>>>>> utils >>>>>>> datasets ?methods ? base >>>>>>> >>>>>>> other attached packages: >>>>>>> ? ? [1] RUnit_0.4.26 ? ? ? ? ?log4r_0.1-4 ? ? ? ? ? vwr_0.1 >>>>>>> RecordLinkage_0.4-1 ? ffbase_0.5 ? ? ? ? ? ?ff_2.2-7 >>>>>>> bit_1.1-8 ? ? ? ? ? ? evd_2.2-6 ? ? ? ? ? ? ipred_0.8-13 >>>>>>> prodlim_1.3.1 ? ? ? ? KernSmooth_2.23-7 ? ? nnet_7.3-1 >>>>>>> survival_2.36-14 ? ? ?mlbench_2.1-0 ? ? ? ? MASS_7.3-18 >>>>>>> ada_2.0-2 ? ? ? ? ? ? rpart_3.1-53 ? ? ? ? ?e1071_1.6 >>>>>>> class_7.3-3 ? ? ? ? ? XLConnect_0.1-9 ? ? ? XLConnectJars_0.1-4 >>>>>>> rJava_0.9-3 ? ? ? ? ? latticeExtra_0.6-19 ? RColorBrewer_1.0-5 >>>>>>> lattice_0.20-6 ? ? ? ?doMC_1.2.5 ? ? ? ? ? ?multicore_0.1-7 >>>>>>> [28] BSgenome_1.24.0 ? ? ? rtracklayer_1.16.1 ? ?Rsamtools_1.8.5 >>>>>>> Biostrings_2.24.1 ? ? GenomicFeatures_1.8.1 AnnotationDbi_1.18.1 >>>>>>> GenomicRanges_1.8.6 ? IRanges_1.14.3 ? ? ? ?Biobase_2.16.0 >>>>>>> BiocGenerics_0.2.0 ? ?data.table_1.8.0 ? ? ?compare_0.2-3 >>>>>>> svUnit_0.7-10 ? ? ? ? doParallel_1.0.1 ? ? ?iterators_1.0.6 >>>>>>> foreach_1.4.0 ? ? ? ? ggplot2_0.9.1 ? ? ? ? sqldf_0.4-6.4 >>>>>>> RSQLite.extfuns_0.0.1 RSQLite_0.11.1 ? ? ? ?chron_2.3-42 >>>>>>> gsubfn_0.6-3 ? ? ? ? ?proto_0.3-9.2 ? ? ? ? DBI_0.2-5 >>>>>>> functional_0.1 ? ? ? ?reshape_0.8.4 ? ? ? ? plyr_1.7.1 >>>>>>> [55] stringr_0.6 ? ? ? ? ? gtools_2.6.2 >>>>>>> >>>>>>> loaded via a namespace (and not attached): >>>>>>> ? ? [1] RCurl_1.91-1 ? ? XML_3.9-4 ? ? ? ?biomaRt_2.12.0 >>>>>>> bitops_1.0-4.1 >>>>>>> codetools_0.2-8 ?colorspace_1.1-1 compiler_2.15.0 ?dichromat_1.2-4 >>>>>>> digest_0.5.2 ? ? grid_2.15.0 ? ? ?labeling_0.1 ? ? memoise_0.1 >>>>>>> munsell_0.3 ? ? ?reshape2_1.2.1 ? scales_0.2.1 ? ? stats4_2.15.0 >>>>>>> tcltk_2.15.0 ? ? zlibbioc_1.2.0 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 6/18/12 2:39 PM, "Martin Morgan" ? ?wrote: >>>>>>> >>>>>>>> Hi Oleg -- >>>>>>>> >>>>>>>> On 06/17/2012 11:11 PM, Oleg Mayba wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I just noticed that a piece of logic I was relying on with GRanges >>>>>>>>> before >>>>>>>>> does not seem to work anymore. ?Namely, I expect the behavior of >>>>>>>>> nearest() >>>>>>>>> with a single GRanges object as an argument to be the same as that >>>>>>>>> of >>>>>>>>> IRanges (example below), but it's not anymore. ?I expect >>>>>>>>> nearest(GR1) >>>>>>>>> NOT >>>>>>>>> to behave trivially but to return the closest range OTHER than the >>>>>>>>> range >>>>>>>>> itself. ?I could swear that was the case before, but isn't any >>>>>>>>> longer: >>>>>>>> I think you're right that there is an inconsistency here; Val will >>>>>>>> likely help clarify in a day or so. My two cents... >>>>>>>> >>>>>>>> I think, certainly, that GRanges on a single chromosome on the "+" >>>>>>>> strand should behave like an IRanges >>>>>>>> >>>>>>>> ? ? ?library(GenomicRanges) >>>>>>>> ? ? ?library(RUnit) >>>>>>>> >>>>>>>> ? ? ?r<- IRanges(c(1,5,10), c(2,7,12)) >>>>>>>> ? ? ?g<- GRanges("chr1", r, "+") >>>>>>>> >>>>>>>> ? ? ?## first two ok, third should work but fails >>>>>>>> ? ? ?checkEquals(precede(r), precede(g)) >>>>>>>> ? ? ?checkEquals(follow(r), follow(g)) >>>>>>>> ? ? ?try(checkEquals(nearest(r), nearest(g))) >>>>>>>> >>>>>>>> Also, on the "-" strand I think we're expecting >>>>>>>> >>>>>>>> ? ? ?g<- GRanges("chr1", r, "-") >>>>>>>> >>>>>>>> ? ? ?## first two ok, third should work but fails >>>>>>>> ? ? ?checkEquals(follow(r), precede(g)) >>>>>>>> ? ? ?checkEquals(precede(r), follow(g)) >>>>>>>> ? ? ?try(checkEquals(nearest(r), nearest(g))) >>>>>>>> >>>>>>>> For "*" (which was your example) I think the situation is (a) >>>>>>>> different >>>>>>>> in devel than in release; and (b) not so clear. In devel, "*" is >>>>>>>> (from >>>>>>>> method?"nearest,GenomicRanges,missing") "x on '*' strand can match >>>>>>>>to >>>>>>>> ranges on any of ''+'', ''-'' or ''*''" and in particular I think >>>>>>>> these >>>>>>>> are always true: >>>>>>>> >>>>>>>> ? ? ?checkEquals(precede(g), follow(g)) >>>>>>>> ? ? ?checkEquals(nearest(r), follow(g)) >>>>>>>> >>>>>>>> I would also expect >>>>>>>> >>>>>>>> ? ? ?try(checkEquals(nearest(g), follow(g))) >>>>>>>> >>>>>>>> though this seems not to be the case. In 'release', "*" is coereced >>>>>>>> and >>>>>>>> behaves as if on the "+" strand (I think). >>>>>>>> >>>>>>>> Martin >>>>>>>> >>>>>>>>>> z=IRanges(start=c(1,5,10), end=c(2,7,12)) >>>>>>>>>> z >>>>>>>>> IRanges of length 3 >>>>>>>>> ? ? ? ? start end width >>>>>>>>> [1] ? ? 1 ? 2 ? ? 2 >>>>>>>>> [2] ? ? 5 ? 7 ? ? 3 >>>>>>>>> [3] ? ?10 ?12 ? ? 3 >>>>>>>>>> nearest(z) >>>>>>>>> [1] 2 1 2 >>>>>>>>>> z=GRanges(seqnames=rep('chr1',3),ranges=IRanges(start=c(1,5,10), >>>>>>>>> end=c(2,7,12))) >>>>>>>>>> z >>>>>>>>> GRanges with 3 ranges and 0 elementMetadata cols: >>>>>>>>> ? ? ? ? ? seqnames ? ?ranges strand >>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? >>>>>>>>> ? ? ? [1] ? ? chr1 ?[ 1, ?2] ? ? ?* >>>>>>>>> ? ? ? [2] ? ? chr1 ?[ 5, ?7] ? ? ?* >>>>>>>>> ? ? ? [3] ? ? chr1 ?[10, 12] ? ? ?* >>>>>>>>> ? ? ? --- >>>>>>>>> ? ? ? seqlengths: >>>>>>>>> ? ? ? ?chr1 >>>>>>>>> ? ? ? ? ?NA >>>>>>>>>> nearest(z) >>>>>>>>> [1] 1 2 3 >>>>>>>>>> sessionInfo() >>>>>>>>> R version 2.15.0 (2012-03-30) >>>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>>> >>>>>>>>> locale: >>>>>>>>> ? ? ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C >>>>>>>>> ? ? ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 >>>>>>>>> ? ? ?[5] LC_MONETARY=en_US.UTF-8 ? ?LC_MESSAGES=en_US.UTF-8 >>>>>>>>> ? ? ?[7] LC_PAPER=C ? ? ? ? ? ? ? ? LC_NAME=C >>>>>>>>> ? ? ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C >>>>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>>>>> >>>>>>>>> attached base packages: >>>>>>>>> [1] datasets ?utils ? ? grDevices graphics ?stats ? ? methods >>>>>>>>>base >>>>>>>>> >>>>>>>>> other attached packages: >>>>>>>>> [1] GenomicRanges_1.8.6 IRanges_1.14.3 ? ? ?BiocGenerics_0.2.0 >>>>>>>>> >>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>> [1] stats4_2.15.0 >>>>>>>>> >>>>>>>>> >>>>>>>>> I want the IRanges behavior and not what seems currently to be the >>>>>>>>> GRanges >>>>>>>>> behavior, since I have some code that depends on it. Is there a >>>>>>>>> quick >>>>>>>>> way >>>>>>>>> to make nearest() do that for me again? >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> Oleg. >>>>>>>>> >>>>>>>>> ? ? ? ?[[alternative HTML version deleted]] >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: >>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> -- >>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center >>>>>>>> 1100 Fairview Ave. N. >>>>>>>> PO Box 19024 Seattle, WA 98109 >>>>>>>> >>>>>>>> Location: Arnold Building M1 B861 >>>>>>>> Phone: (206) 667-2793 >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From guest at bioconductor.org Tue Jun 26 00:19:36 2012 From: guest at bioconductor.org (Grant Izmirlian [guest]) Date: Mon, 25 Jun 2012 15:19:36 -0700 (PDT) Subject: [BioC] Quality Diagnostics of Affy Arrays using PLM Message-ID: <20120625221936.90BB9134515@mamba.fhcrc.org> Hi: I have been following examples listed in section 3.5.1 of "Bioinformatics and Computational Biology using R and Bioconductor", which deals with quality diagnostics of affy arrays using PLM. I am trying to produce a composite plot displaying per chip residuals from the PLM model using my own data. Following the example, starting with the AffyBatch object, MyDat.AffyBatch, which contains 40 arrays, MyDat.plm <- fitPLM(MyDat.AffyBatch) par(mfrow=c(4,10)) image(MyDat.plm, type="resids", which=1) image(MyDat.plm, type="resids", which=2) image(MyDat.plm, type="resids", which=3) . . . image(MyDat.plm, type="resids", which=40) The problem is that the par(mfrow=c(4,10)) is ignored and I get 40 new plots. I tried setting 'add=TRUE' to the argument list above--still no luck. The example in the text makes it appear that this works. What's going on? -- output of sessionInfo(): R version 2.14.0 (2011-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] hgu133plus2cdf_2.9.1 AnnotationDbi_1.16.10 limma_3.10.0 [4] affyPLM_1.30.0 preprocessCore_1.16.0 gcrma_2.26.0 [7] affy_1.32.0 Biobase_2.14.0 loaded via a namespace (and not attached): [1] affyio_1.22.0 BiocInstaller_1.2.1 Biostrings_2.22.0 [4] DBI_0.2-5 IRanges_1.12.5 RSQLite_0.11.1 [7] splines_2.14.0 tcltk_2.14.0 tools_2.14.0 [10] zlibbioc_1.0.0 -- Sent via the guest posting facility at bioconductor.org. From izmirlig at mail.nih.gov Tue Jun 26 00:40:33 2012 From: izmirlig at mail.nih.gov (Grant Izmirlian) Date: Mon, 25 Jun 2012 18:40:33 -0400 Subject: [BioC] Quality Diagnostics of Affy Arrays using PLM In-Reply-To: <20120625221936.90BB9134515@mamba.fhcrc.org> References: <20120625221936.90BB9134515@mamba.fhcrc.org> Message-ID: <3557213.lnbuKIhuFO@omega1> Just testing this out to see how replies work On Monday, June 25, 2012 06:19:36 PM you wrote: > Hi: > I have been following examples listed in section 3.5.1 of "Bioinformatics > and Computational Biology using R and Bioconductor", which deals with > quality diagnostics of affy arrays using PLM. I am trying to produce a > composite plot displaying per chip residuals from the PLM model using my > own data. Following the example, starting with the AffyBatch object, > MyDat.AffyBatch, which contains 40 arrays, > > MyDat.plm <- fitPLM(MyDat.AffyBatch) > par(mfrow=c(4,10)) > image(MyDat.plm, type="resids", which=1) > image(MyDat.plm, type="resids", which=2) > image(MyDat.plm, type="resids", which=3) > . > . > . > image(MyDat.plm, type="resids", which=40) > > The problem is that the par(mfrow=c(4,10)) is ignored and I get > 40 new plots. I tried setting 'add=TRUE' to the argument list above--still > no luck. > > The example in the text makes it appear that this works. What's going on? > > > > > > -- output of sessionInfo(): > > R version 2.14.0 (2011-10-31) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] hgu133plus2cdf_2.9.1 AnnotationDbi_1.16.10 limma_3.10.0 > [4] affyPLM_1.30.0 preprocessCore_1.16.0 gcrma_2.26.0 > [7] affy_1.32.0 Biobase_2.14.0 > > loaded via a namespace (and not attached): > [1] affyio_1.22.0 BiocInstaller_1.2.1 Biostrings_2.22.0 > [4] DBI_0.2-5 IRanges_1.12.5 RSQLite_0.11.1 > [7] splines_2.14.0 tcltk_2.14.0 tools_2.14.0 > [10] zlibbioc_1.0.0 > > > -- > Sent via the guest posting facility at bioconductor.org. From smyth at wehi.EDU.AU Tue Jun 26 01:07:31 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Tue, 26 Jun 2012 09:07:31 +1000 (AUS Eastern Standard Time) Subject: [BioC] 2 different factorial analysis codes in LIMMA give different logFC but same values for other components of TOPTABLE In-Reply-To: <7F10E9EDBB347E4CA0765A3139C110BB14F9BA33@UFEXCH-MBXN01.ad.ufl.edu> References: <7F10E9EDBB347E4CA0765A3139C110BB14F9BA33@UFEXCH-MBXN01.ad.ufl.edu> Message-ID: Dear Miriam, It isn't necessary to send emails to my personal email address. You've already asked the same question on the Bioconductor mailing list, so I've got your question four times now. If you must refer to me personally, could you please spell my name correctly? The logFC you get from a contrast depends on how you scale the contrast. A contrast of c(-2,1,1) and a contrast of c(-1,0.5,0.5) are entirely equivalent from the point of view of testing the null hypothesis of no change, but they will give contrast values that differ by a factor of two. Hence the logFC in limma will change by a factor of two, while the p-values and everything else will stay the same. Similarly, defining a contrast by (B+C)/2-A or by B+C-2*A would be equivalent except for a factor of two. They would give the same p-value but different logFC. Obviously if you change the logFC but use the same fold change cutoff when assessing differential expression, then you will change the number of genes that you define as differentially expressed. It is not apparent to me that you can choose FC=1.5 as a meaningful cutoff regardless of the meaning or definition of the contrast. If you are not very familiar with contrasts, then just use the model A approach, which is clear and explicit and obviously correct. That's why I recommend it in the User's Guide! Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.wehi.edu.au http://www.statsci.org/smyth On Mon, 25 Jun 2012, Garcia Orellana,Miriam wrote: > Dear Dr. Smith and all: I am sorry to bother you with this matter, my understanting of the microarray anylisis is really basic and I am having hard and long time to finish the analysis of my data. Now, I can't figure out what is happening with the two models I am applying to evaluate my data, because each one of them on the TOPTABLE option when requesting the values for all the filtered genes ( 8026). Both models, (A and B) for each of my 5 contrasts are giving me the top table with the same numerical values for: AveExp, t, Pvalue, adjPvalue and B. However the logFC for contrasts 1, 2 and 3 in model B is exactly half of that in model A, while the logFC for contrasts 4 and 5 in model B is exactly one fourth of that one in model A. Because this difference in logFC, I am getting different numbers of differentially expresed genes when using a cut off of adjPvalue lower or equal to 0.05 and a rawlogFC greater or equal to 1.5 for example for contrast 3(MR effect) it gives me 47 up- and 84 down-regulated genes with model A, while with model B it gives me only 18 up- and only 2 down-regulated genes under same cut-offs. How that can be possible if all other values are the same? and so what should I follow? Briefly me data is a factorial design of 3 dam diets (DD: CTL, SFA, EFA) and 2 milk replacers (MR: LLA, HLA), I have three replicates for each of the interaction factors, then a total of 18 arrays. The data was filtered for informative/noninformative probes and plotted for array quality. So from a initial of 24118 bovine probes I endup with 8026 probes. My interest is to compare: 1. Feeding FAT at prepartum= (SFA +EFA) vs CTL, with CTL as ref 2. Feeding EFA prepartum = EFA vs SFA, with SFA as ref 3. Feeding MR to calves= HLA vs LLA, with LLA as reference 4. Interaction of feeding FAT by MR: (SFA +EFA) vs CTL by MR, with (SFA+EFA) vs CTL by LLA as ref 5. Interaction of feeding EFA by MR: EFA vs SFA by MR, with EFA vs SFA by LLA as ref MODEL A (I created that with the guide of the LIMMA user guide for a factorial design: TS <- paste(phenoDiet$DD, phenoDiet$MR, sep=".") TS TS <- factor(TS, levels=c("Ctl.LLA", "Ctl.HLA","SFA.LLA","SFA.HLA","EFA.LLA", "EFA.HLA")) design <- model.matrix(~0+TS) colnames(design) <- levels(TS) fit <- lmFit(eset2, design, method="robust", maxit=1000) efit <- eBayes(fit) #Contrast results MatContrast=makeContrasts(FAT=(SFA.LLA + SFA.HLA + EFA.LLA + EFA.HLA)/4 - (Ctl.LLA + Ctl.HLA)/2, FA=(EFA.LLA + EFA.HLA)/2 - (SFA.LLA + SFA.HLA)/2, MR=(EFA.HLA+SFA.HLA+Ctl.HLA)/3 - (EFA.LLA+SFA.LLA+Ctl.LLA)/3, FATbyMR=((EFA.HLA+SFA.HLA)/2 - Ctl.HLA) - ((EFA.LLA+SFA.LLA)/2-Ctl.LLA), FAbyMR=( EFA.HLA-SFA.HLA)-(EFA.LLA - SFA.LLA), levels=design) fitMat<-contrasts.fit(fit,MatContrast) Contrast.eBayes=eBayes(fitMat) MODEL B (this model was kindly provided by Dr G. Smith): DD <-factor(phenoDie$DD, levels = c("Ctl", "SFA", "EFA")) MR <-factor(phenoDie$MR, levels = c("LLA", "HLA")) contrasts (DD) <- cbind (SFAEFAvsCtl=c(-2,1,1),EFAvsSFA=c(0,-1,1)) contrasts (MR) <- c(-1,1) design <-model.matrix (~DD*MR) design fit <- lmFit (eset2, design, method="robust",maxit=1000) efit <- eBayes(fit) Thanks so much in advance, Miriam ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From fongchun at interchange.ubc.ca Tue Jun 26 01:27:53 2012 From: fongchun at interchange.ubc.ca (Fong Chun Chan) Date: Mon, 25 Jun 2012 16:27:53 -0700 Subject: [BioC] Extracting a .CEL file from an AffyBatch Object In-Reply-To: <4FE8D923.5040002@uw.edu> References: <4FE8D923.5040002@uw.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From smyth at wehi.EDU.AU Tue Jun 26 01:29:30 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Tue, 26 Jun 2012 09:29:30 +1000 (AUS Eastern Standard Time) Subject: [BioC] Loading ArrayVision microarray data into Limma In-Reply-To: References: Message-ID: Dear Maite, Well, it's pretty hard to provide support for file formats that change or aren't documented. The column headings in your file are not the same as the column headings in ArrayVision files that people have sent me in the past, and a google search doesn't find any documentation about what an ArrayVision file should contain. Your file seems to have multiple columns with the same column heading (Dens and Bkgd both occur twice). This makes the file pretty inscrutable, and pretty hard to read correctly using any software. Your email is hard to read, with different rows from the files just run together. I wonder whether this is correct output from ArrayVision, or whether the file has been edited in some way. It would seem that the file or your email description of it is in error. Best wishes Gordon > Date: Sun, 24 Jun 2012 15:01:21 +0200 > From: Maite Iriondo > To: bioconductor at r-project.org > Subject: [BioC] Loading ArrayVision microarray data into Limma > > Hello, > > I have got .csv data from a* Pseudomonas putida* microarray which was > analysed with ArrayVision image analysis software, in which the numbers (1, > 2,..5) correspond to different times when the samples were taken and the > "c" stands for control. > I am using Limma to upload my data, however I get an error in which says > that the foreground columns for the file in which I store the intensities > are not found. > My data for the two-channel microarray was given in this format: > > AA30_1 AA30_1c SpotNr SpotType Gene Dens Bkgd Dens Bkgd Spot0001 A > korA2 224.16 110 491.68 235 Spot0002 A korA2 203.44 113 433.33 246.5 > Spot0003 A mpfH 167.12 116 386.53 262.5 Spot0004 A mpfH 145.63 113 392.1 > 249 Spot0005 C DMSO 126.23 116 275.03 260.5 Spot0006 C DMSO 126.61 116 > 254.69 275.5 > I defined the targets file: > targets<- readTargets(file="TargetsAA30.csv", sep=",") > > FileName Cy3 Cy5 AA301.csv AA30_1 AA30_1c AA302.csv AA30_2 AA30_2c > AA303.csv AA30_3 AA30_3c AA304.csv AA30_4 AA30_4c AA305.csv AA30_5 AA30_5c > > and when reading the raw data I have the following error: > >> RG<- read.maimages(targets$FileName, "arrayvision") > Error in read.maimages(targets$FileName, "arrayvision") : > Cannot find foreground columns in AA301.csv > > I would like to know if the structure of my initial file is correct for the > read.maimage function using source= "arrayvision", or if there is any > changes I can make in order to upload my data. > Thank you in advanced, > > Maite Iriondo > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} From akulan at mail.nih.gov Tue Jun 26 01:32:23 2012 From: akulan at mail.nih.gov (Akula, Nirmala (NIH/NIMH) [C]) Date: Mon, 25 Jun 2012 19:32:23 -0400 Subject: [BioC] DEXSeq question Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From smyth at wehi.EDU.AU Tue Jun 26 02:03:56 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Tue, 26 Jun 2012 10:03:56 +1000 (AUS Eastern Standard Time) Subject: [BioC] design matrix edge R pairwise comparison at different time points after infection with replicates In-Reply-To: <3D4A97F14E343F4584925219C1C1ACEF05B965F0@ICTS-S-MBX7.luna.kuleuven.be> References: <3D4A97F14E343F4584925219C1C1ACEF05B965F0@ICTS-S-MBX7.luna.kuleuven.be> Message-ID: Dear Kaat, 1) It is not generally true that you will find more DE genes analysing just one time separately rather than using all the libraries in one linear model. It is possible in principle that the later time may show more variability, and that might justify separate analysis of the different times. However I would do that only when you have a good a priori biological reason for expecting such an effect and data exploration (such as an MDS plot) confirms it. Otherwise the extra instability of dispersion estimation with a small number of libraries is not justified. I would not make such decisions merely on the basis of which analysis gives more DE genes. 2) I don't think there is yet a definitive rule regarding filtering and library sizes, but I prefer to recompute library sizes and scale normalize (calcNormFactors) after filtering. Recomputing the library sizes doesn't make a lot of difference, because scale normalization will self correct anyway. Are you normalizing your data? Best wishes Gordon On Mon, 25 Jun 2012, Kaat De Cremer wrote: > Again, > Thank you both very much for your reply. > > I have analyzed my time course data now in several different ways and > noticed some differences: > > > 1) when I analyze the data of all time points together and look for DE > genes at one time point, I find less DE genes compared to when I use > only the data of that one time point in edgeR. I assume this is because > the dispersion is larger when I include all the different time points at > once? In that case, is this the right way to go? The dispersion can be > larger at one time point compared to another I would think. > > 2) I noticed in the edgeR user's guide that in some examples you correct > the library size after filtering for low expressed genes, and in other > examples you don't. Correcting this library size gives less DE genes for > my data at all time points when I analyze all data together and then > look for DE genes at each time point. I don't see this difference when I > only look at the data of one time point in edgeR. > > I hope you can comment on this, > > > Thank you, > Kaat > > > > -----Original Message----- > From: Gordon K Smyth [mailto:smyth at wehi.EDU.AU] > Sent: zondag 24 juni 2012 2:42 > To: Kaat De Cremer > Cc: Bioconductor mailing list; Mark Robinson > Subject: design matrix edge R pairwise comparison at different time points after infection with replicates > > Hi Kaat, > > I'll jump in and continue on from Mark's help. > > To test for treatment effects separately at each time, the easiest way is to include the terms "time+time:treat" in your model formula. > > I'll assume that your "tests" are independent replicates of the whole experiment. If there are batch effects associated with the tests that you need to correct for, then your complete design matrix might be: > > design <- model.matrix(~test+time+time:treat) > > This produces a design matrix with the following columns: > > > colnames(design) > [1] "(Intercept)" "test2" "test3" "time24hpi" > [5] "time48hpi" "time12hpi:treatT" "time24hpi:treatT" "time48hpi:treatT" > > So testing for treatment effects at each time is easy. To test for treatment effect as time 12h: > > fit <- glmFit(y, design) > lrt <- glmLRT(y, fit, coef="time12hpi:treatT") > > etc. To test for treatment effect at time 24h: > > lrt <- glmLRT(y, fit, coef="time24hpi:treatT") > > and so on. > > Best wishes > Gordon > >> Date: Fri, 22 Jun 2012 13:11:41 +0000 >> From: Kaat De Cremer >> To: Mark Robinson >> Cc: bioconductor list >> Subject: Re: [BioC] design matrix edge R pairwise comparison at >> different time points after infection with replicates >> >> Hi Mark, >> Thank you for your suggestion, >> I really appreciate your time. >> >> Working in R is new to me so it has been a struggle using edgeR, but I >> think I managed it using only 2 factors (test and treatment). Now that >> I will be including 3 factors (test, treatment and time) in one >> analysis it is clear to me that I still don't understand how it works exactly. > >> Below you can see my workspace with the only design matrix I could >> come up with, but I don't see which coefficients I should include or >> which contrast vector to use in the glmLRT function to make the >> comparison of control-treatment at each time point separate, ignoring >> the other 2 time points. Is this possible with this design matrix? Or >> is the matrix wrong for this purpose? >> >> >> Thanks! >> Kaat >> >> >>> head(x) >> 12hpi C1 12hpi C2 12hpi C3 12hpi T1 12hpi T2 12hpi T3 24hpi C1 24hpi C2 >> Lsa000001.1 0 1 1 2 0 2 1 1 >> Lsa000002.1 5 4 0 5 6 6 6 4 >> Lsa000003.1 10 9 7 5 5 8 6 2 >> Lsa000004.1 1 1 1 1 1 1 1 3 >> Lsa000005.1 1 0 1 0 2 0 0 1 >> Lsa000006.1 510 223 228 287 222 268 303 358 >> 24hpi C3 24hpi T1 24hpi T2 24hpi T3 48hpi C1 48hpi C2 48hpi C3 48hpi T1 >> Lsa000001.1 0 1 1 0 0 0 0 2 >> Lsa000002.1 7 5 2 5 10 6 12 12 >> Lsa000003.1 7 5 4 2 6 5 8 2 >> Lsa000004.1 1 3 1 2 1 3 2 3 >> Lsa000005.1 0 1 0 0 1 0 0 2 >> Lsa000006.1 372 362 237 320 472 440 411 858 >> 48hpi T2 48hpi T3 >> Lsa000001.1 0 0 >> Lsa000002.1 1 5 >> Lsa000003.1 1 0 >> Lsa000004.1 0 2 >> Lsa000005.1 1 0 >> Lsa000006.1 375 275 >>> treat<-factor(c("C","C","C","T","T","T","C","C","C","T","T","T","C"," >>> C","C","T","T","T")) >>> test<-factor(c(1,1,2,3,1,2,3,2,3,1,2,3,1,2,3,1,2,3)) >> time<-factor(c("12hpi","12hpi","12hpi","12hpi","12hpi","12hpi","24hpi" >> ,"24hpi","24hpi","24hpi","24hpi","24hpi","48hpi","48hpi","48hpi","48hp >> i","48hpi","48hpi")) >>> y<-DGEList(counts=x,group=treat) >> Calculating library sizes from column totals. >>> cpm.y<-cpm(y) >>> y<-y[rowSums(cpm.y>2)>=3,] >>> y<-calcNormFactors(y) >> design<-model.matrix(~test+treat+time) >>> design >> (Intercept) test2 test3 treatT time24hpi time48hpi >> 1 1 0 0 0 0 0 >> 2 1 1 0 0 0 0 >> 3 1 0 1 0 0 0 >> 4 1 0 0 1 0 0 >> 5 1 1 0 1 0 0 >> 6 1 0 1 1 0 0 >> 7 1 0 0 0 1 0 >> 8 1 1 0 0 1 0 >> 9 1 0 1 0 1 0 >> 10 1 0 0 1 1 0 >> 11 1 1 0 1 1 0 >> 12 1 0 1 1 1 0 >> 13 1 0 0 0 0 1 >> 14 1 1 0 0 0 1 >> 15 1 0 1 0 0 1 >> 16 1 0 0 1 0 1 >> 17 1 1 0 1 0 1 >> 18 1 0 1 1 0 1 >> attr(,"assign") >> [1] 0 1 1 2 3 3 >> attr(,"contrasts") >> attr(,"contrasts")$test >> [1] "contr.treatment" >> >> attr(,"contrasts")$treat >> [1] "contr.treatment" >> >> attr(,"contrasts")$time >> [1] "contr.treatment" >> >>> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) >> Disp = 0.07299 , BCV = 0.2702 >>> y<-estimateGLMTrendedDisp(y,design) >> Loading required package: splines >>> y<-estimateGLMTagwiseDisp(y,design) >> Warning message: >> In maximizeInterpolant(spline.pts, apl.smooth[j, ]) : >> max iterations exceeded >>> fit<-glmFit(y,design) >> >> >> >> >> >> >> >> -----Original Message----- >> From: Mark Robinson [mailto:mark.robinson at imls.uzh.ch] >> Sent: vrijdag 22 juni 2012 12:03 >> To: Kaat De Cremer >> Cc: bioconductor list >> Subject: Re: [BioC] design matrix edge R pairwise comparison at >> different time points after infection with replicates >> >> Hi Kaat, >> >> It is probably better to fit all your data with a single call to glmFit(), over all 18 samples; you can test the differences of interest trough the 'coef' or 'contrast' argument on glmLRT(). That would afford you more degrees of freedom and presumably better estimates of dispersion, and so on. >> >>> From your description, I can't quite figure out your design matrix. You have three factors: treatment, test and time point. First, you need to input all 18 samples and extend your 'treatment' and 'test' factor variables to have 18 values (corresponding to the columns of your table). And, then also include a time variable in your design. Some decisions might need to be made about interactions to include. >> >> Hope that gets you started. >> >> Best, >> Mark >> >> >> ---------- >> Prof. Dr. Mark Robinson >> Bioinformatics >> Institute of Molecular Life Sciences >> University of Zurich >> Winterthurerstrasse 190 >> 8057 Zurich >> Switzerland >> >> v: +41 44 635 4848 >> f: +41 44 635 6898 >> e: mark.robinson at imls.uzh.ch >> o: Y11-J-16 >> w: http://tiny.cc/mrobin >> >> ---------- >> http://www.fgcz.ch/Bioconductor2012 >> >> On 21.06.2012, at 11:42, Kaat De Cremer wrote: >> >>> Dear all, >>> >>> >>> I am using edgeR to find genes differentially expressed between >>> infected and mock-infected control plants, at 3 time points after >>> infection. > >>> I have RNAseq data for 3 independent tests, so for every single test >>> I have 6 libraries (control + infected at 3 time points). > >>> Having three replicates this makes 18 libraries in total. >>> >>> What I did until now is look at each time point separate and calculate DEgenes at that time point as shown in this script: >>> >>>> head(x) >>> C1 C2 C3 T1 T2 T3 >>> 1 0 1 2 0 0 0 >>> 2 13 6 4 10 8 12 >>> 3 17 16 9 10 8 11 >>> 4 2 1 2 2 3 2 >>> 5. 1 3 1 2 1 3 0 >>> 6 958 457 438 565 429 518 >>> >>>> treatment<-factor(c("C","C","C","T","T","T")) >>>> test<-factor(c(1,2,3,1,2,3)) >>>> y<-DGEList(counts=x,group=treatment) >>> Calculating library sizes from column totals. >>>> cpm.y<-cpm(y) >>>> y<-y[rowSums(cpm.y>2)>=3,] >>>> y<-calcNormFactors(y) >>>> design<-model.matrix(~test+treat) >>>> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) >>> Disp = 0.0265 , BCV = 0.1628 >>>> y<-estimateGLMTrendedDisp(y,design) >>> Loading required package: splines >>>> y<-estimateGLMTagwiseDisp(y,design) >>>> fit<-glmFit(y,design) >>>> lrt<-glmLRT(y,fit) >>> >>> >>> This works fine but I wonder if I should do the analysis of the different time points all at once? Will this make a difference? >>> Unfortunately I cannot figure out how to design the matrix. >>> >>> I hope someone can help me, >>> >>> Kaat >>> > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:10}} From kasperdanielhansen at gmail.com Tue Jun 26 05:27:03 2012 From: kasperdanielhansen at gmail.com (Kasper Daniel Hansen) Date: Mon, 25 Jun 2012 23:27:03 -0400 Subject: [BioC] matrix like object with Rle columns Message-ID: Do we have a matrix-like object, but where the columns are Rle's? Kasper From lawrence.michael at gene.com Tue Jun 26 05:36:34 2012 From: lawrence.michael at gene.com (Michael Lawrence) Date: Mon, 25 Jun 2012 20:36:34 -0700 Subject: [BioC] matrix like object with Rle columns In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From kasperdanielhansen at gmail.com Tue Jun 26 05:56:11 2012 From: kasperdanielhansen at gmail.com (Kasper Daniel Hansen) Date: Mon, 25 Jun 2012 23:56:11 -0400 Subject: [BioC] matrix like object with Rle columns In-Reply-To: References: Message-ID: On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence wrote: > Patrick and I had talked about this a long time ago (essentially putting a > "dim" attribute on an Rle), but the closest thing today is a DataFrame with > Rle columns. > > Use case? Say I have whole-genome data (for example coverage) on multiple samples. Usually, this is far easier to think of as a matrix (in my opinion) with ~3B rows and I often want to do rowSums(), colSums() etc (in fact, probably the whole API from matrixStats). This is especially nice when you have multiple coverage-like tracks on each sample, so you could have trackA : genome by samples trackB : genome by samples ... You could think of this as a SummarizedExperiment, but with _extremely_ big matrices in the assay slot. I want to take advantage of the Rle structure to store the data more efficiently and also to do potentially faster computations. This is actually closer to my use case where I currently use matrices with ~30M rows (which works fine), but I would like to expand to ~800M rows (which would suck a bit). You could also think of a matrix-like object with Rle columns as an alternative sparse matrix structure. In a typical sparse matrix you only store the non-zero entities, here we only store the change-points. Depending on the structure of the matrix this could be an efficient storage of an otherwise dense matrix. So essentially, what I want, is to have mathematical operations on this object, where I would utilize that I know that all entities are numbers so the typical matrix operations makes sense. [ side question which could be relevant in this discussion: for a numeric Rle is there some notion of precision - say I have truly numeric values with tons of digits, and I want to consider two numbers part of the same run if |x1 -x2| > Michael > > On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen > wrote: >> >> Do we have a matrix-like object, but where the columns are Rle's? >> >> Kasper >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > From kasperdanielhansen at gmail.com Tue Jun 26 06:11:53 2012 From: kasperdanielhansen at gmail.com (Kasper Daniel Hansen) Date: Tue, 26 Jun 2012 00:11:53 -0400 Subject: [BioC] matrix like object with Rle columns In-Reply-To: References: Message-ID: On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen wrote: > On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence > wrote: >> Patrick and I had talked about this a long time ago (essentially putting a >> "dim" attribute on an Rle), but the closest thing today is a DataFrame with >> Rle columns. >> >> Use case? > > Say I have whole-genome data (for example coverage) ?on multiple > samples. ?Usually, this is far easier to think of as a matrix (in my > opinion) with ~3B rows and I often want to do rowSums(), colSums() etc > (in fact, probably the whole API from matrixStats). ?This is > especially nice when you have multiple coverage-like tracks on each > sample, so you could have > ?trackA : genome by samples > ?trackB : genome by samples > ?... > > You could think of this as a SummarizedExperiment, but with > _extremely_ big matrices in the assay slot. > > I want to take advantage of the Rle structure to store the data more > efficiently and also to do potentially faster computations. > > This is actually closer to my use case where I currently use matrices > with ~30M rows (which works fine), but I would like to expand to ~800M > rows (which would suck a bit). > > You could also think of a matrix-like object with Rle columns as an > alternative sparse matrix structure. ?In a typical sparse matrix you > only store the non-zero entities, here we only store the > change-points. ?Depending on the structure of the matrix this could be > an efficient storage of an otherwise dense matrix. > > So essentially, what I want, is to have mathematical operations on > this object, where I would utilize that I know that all entities are > numbers so the typical matrix operations makes sense. > > [ side question which could be relevant in this discussion: for a > numeric Rle is there some notion of precision - say I have truly > numeric values with tons of digits, and I want to consider two numbers > part of the same run if |x1 -x2| Kasper > >> >> Michael >> >> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen >> wrote: >>> >>> Do we have a matrix-like object, but where the columns are Rle's? >>> >>> Kasper >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> From bmb at bmbolstad.com Tue Jun 26 06:50:14 2012 From: bmb at bmbolstad.com (Benjamin Bolstad) Date: Mon, 25 Jun 2012 21:50:14 -0700 Subject: [BioC] Quality Diagnostics of Affy Arrays using PLM In-Reply-To: <20120625221936.90BB9134515@mamba.fhcrc.org> References: <20120625221936.90BB9134515@mamba.fhcrc.org> Message-ID: <0A7D1284-4F95-4F70-BB5D-372DBB348BD5@bmbolstad.com> Hi Grant, I can not reproduce this issue at my end. But my suspicion is that when you put the "add=TRUE" argument into your image() on the PLMset it is expanding this to "add.legend=TRUE". Internally the image() method for the PLMset object uses "layout" in this situation (which will not interact very well with "par(mfrow)" Best, Ben On Jun 25, 2012, at 3:19 PM, Grant Izmirlian [guest] wrote: > > Hi: > I have been following examples listed in section 3.5.1 of "Bioinformatics and Computational Biology using R and Bioconductor", which deals with quality diagnostics of affy arrays using PLM. I am trying to produce a composite plot displaying per chip residuals from the PLM model using my own data. Following the example, starting with the AffyBatch object, MyDat.AffyBatch, which contains 40 arrays, > > MyDat.plm <- fitPLM(MyDat.AffyBatch) > par(mfrow=c(4,10)) > image(MyDat.plm, type="resids", which=1) > image(MyDat.plm, type="resids", which=2) > image(MyDat.plm, type="resids", which=3) > . > . > . > image(MyDat.plm, type="resids", which=40) > > The problem is that the par(mfrow=c(4,10)) is ignored and I get > 40 new plots. I tried setting 'add=TRUE' to the argument list above--still no luck. > > The example in the text makes it appear that this works. What's going on? > > > > > > -- output of sessionInfo(): > > R version 2.14.0 (2011-10-31) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] hgu133plus2cdf_2.9.1 AnnotationDbi_1.16.10 limma_3.10.0 > [4] affyPLM_1.30.0 preprocessCore_1.16.0 gcrma_2.26.0 > [7] affy_1.32.0 Biobase_2.14.0 > > loaded via a namespace (and not attached): > [1] affyio_1.22.0 BiocInstaller_1.2.1 Biostrings_2.22.0 > [4] DBI_0.2-5 IRanges_1.12.5 RSQLite_0.11.1 > [7] splines_2.14.0 tcltk_2.14.0 tools_2.14.0 > [10] zlibbioc_1.0.0 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From huwenhuo at gmail.com Tue Jun 26 08:02:17 2012 From: huwenhuo at gmail.com (wenhuo hu) Date: Tue, 26 Jun 2012 02:02:17 -0400 Subject: [BioC] Kaplan Meier curve Message-ID: Hi -- I am trying the survival package to draw kaplan meier curve but returned with unexpected result. here is my data: d, status: 1 mean died, 0 means still alive. gene status days 1 ko 1 81 2 ko 1 119 3 ko 1 92 4 ctrl 1 19 5 ctrl 1 15 6 ko 0 41 7 ctrl 1 16 8 ko 0 41 9 ctrl 1 21 10 ko 0 41 11 ctrl 0 41 12 ko 0 41 13 ctrl 1 31 survfit(Surv(days, status)~gene, data=d) -> fit plot(fit, conf.int=F, col=2:3, lty=1:2) as attached, the green one is ko genotype which can live longer. But there should have alive mice left, 4 out of 7, as the data shows. Could any help me on this point? Wenhuo Hu -------------- next part -------------- A non-text attachment was scrubbed... Name: survival.png Type: image/png Size: 7738 bytes Desc: not available URL: From axel.klenk at actelion.com Tue Jun 26 08:19:05 2012 From: axel.klenk at actelion.com (axel.klenk at actelion.com) Date: Tue, 26 Jun 2012 08:19:05 +0200 Subject: [BioC] Using limma for quantitative proteomics data In-Reply-To: References: Message-ID: <7645_1340691640_4FE954B8_7645_72441_1_OF6688FF2C.B036D1C4-ONC1257A29.00227A6B-C1257A29.0022D9EE@actelion.com> Dear Yong, I don't think you need an MAList -- all limma functions will accept a simple matrix of your log2 ratios... or at least, all limma functions I have ever used, will do that... :-) Cheers, - axel Axel Klenk Research Informatician Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil / Switzerland From: Yong Li To: bioconductor at r-project.org Date: 26.06.2012 00:01 Subject: Re: [BioC] Using limma for quantitative proteomics data Sent by: bioconductor-bounces at r-project.org Dear Aaron, thank you and others for suggestions. My data is really ratios and not absolute values for normal and tumor. Sorry that I am still not quite sure how to move forward with limma when I take log2 of the ratios. It looks like I then will have the M component of the MAList, but how can I construct the A to make an MAList? Or I am missing something here? Kind regards, Yong On Tue, Jun 19, 2012 at 11:09 PM, Aaron Mackey wrote: > There's a thread on the bioconductor mailing list about using voom for > RSEM-based RNA-seq quantification, in which Gordon Smythe explained that > while voom() was designed for count data, it doesn't require it. As Tim > Triche has suggested, if you're raw data is really ratios (and not absolute > values for normal and tumor), then you should take log2 of those ratios and > use limma from there; you can then also hijack the arrayQualityMetrics > package to check QC (MA plots, mean-variance relationships, etc.) > > -Aaron > > On Tue, Jun 19, 2012 at 3:39 PM, Yong Li > wrote: >> >> Dear Aaron, >> >> thank you for your quick answer! I have checked the help page of >> voom() but it seems to be used for count data. My data are just >> tumor/normal ratios. I am wondering if you could provide more details? >> >> Best regards, >> Yong >> >> On Tue, Jun 19, 2012 at 8:18 PM, Aaron Mackey >> wrote: >> > yes, it should be possible with a voom()-based analysis to get the >> > variances >> > "right". >> > >> > -Aaron >> > >> > On Tue, Jun 19, 2012 at 12:47 PM, Yong Li >> > wrote: >> >> >> >> Hello, >> >> >> >> limma has been so valuable in microarray data analysis, but has anyone >> >> used limma for finding differentially expressed proteins from >> >> quantitative proteomics data? The data I got has tumor/normal ratios >> >> of thousands proteins, and both tumor and normal have a number of >> >> replicates. Could such data be analyzed with limma? >> >> >> >> If limma can not be used here, what statistics method is suitable so >> >> that we can get statistically significant proteins with p-values? Any >> >> suggestion is appreciated. >> >> >> >> Kind regards, >> >> Yong >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor at r-project.org >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor The information of this email and in any file transmitted with it is strictly confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any copying, distribution or any other use of this email is prohibited and may be unlawful. In such case, you should please notify the sender immediately and destroy this email. The content of this email is not legally binding unless confirmed by letter. Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorised to state them to be the views of the sender's company. For further information about Actelion please see our website at http://www.actelion.com From heidi at ebi.ac.uk Tue Jun 26 09:57:09 2012 From: heidi at ebi.ac.uk (Heidi Dvinge) Date: Tue, 26 Jun 2012 08:57:09 +0100 Subject: [BioC] Kaplan Meier curve In-Reply-To: References: Message-ID: <5d680f43938e2546f548c59cfdc75634.squirrel@webmail.ebi.ac.uk> Hi Wenhuo, > Hi -- > > I am trying the survival package to draw kaplan meier curve but returned > with unexpected result. > here is my data: d, status: 1 mean died, 0 means still alive. > > gene status days > 1 ko 1 81 > 2 ko 1 119 > 3 ko 1 92 > 4 ctrl 1 19 > 5 ctrl 1 15 > 6 ko 0 41 > 7 ctrl 1 16 > 8 ko 0 41 > 9 ctrl 1 21 > 10 ko 0 41 > 11 ctrl 0 41 > 12 ko 0 41 > 13 ctrl 1 31 > I think there might be a slight problem with your data here, more specifically the "days" column. The 12th data point looks like it's formatted differently than the others (a space after it). That might give rise to the one alive mouse showing up as a cross at ~day 100. Incidentally, something like this looks right: d <- data.frame(gene=c(1,1,1,2,2,1,2,1,2,1,2,1,2), status=c(1,1,1,1,1,0,1,0,1,0,0,0,1), days=c(81,119,92,19,15,41,16,41,21,41,41,41,31)) survfit(Surv(days, status)~gene, data=d) -> fit plot(fit, conf.int=F, col=2:3, lty=1:2) That gives 8 deaths in total, and 5 alive mice (plotted on top of each other at day41). HTH \Heidi > survfit(Surv(days, status)~gene, data=d) -> fit > plot(fit, conf.int=F, col=2:3, lty=1:2) > > as attached, the green one is ko genotype which can live longer. But there > should have alive mice left, 4 out of 7, as the data shows. Could any help > me on this point? > > > > Wenhuo Hu > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From heidi at ebi.ac.uk Tue Jun 26 10:10:23 2012 From: heidi at ebi.ac.uk (Heidi Dvinge) Date: Tue, 26 Jun 2012 09:10:23 +0100 Subject: [BioC] other questions for HTqPCR In-Reply-To: References: Message-ID: <24d4ab3512855f74d47c17069fadd7f8.squirrel@webmail.ebi.ac.uk> Hi Simon, > Hi Heidi, > hopefully you can respond to the last email with this one. Should I write > you on the regular forum? I'm a bit unclear about the etiquette here. > In general, writing to the forum is preferable. 1) any email to [BioC] has a greater chance of getting noticed in my inbox, 2) it serves as a Q&A repository, and 3) you may get a useful reply from other people. > In any case, I have 5 plates of data in a 48.48 biomark format, with > different layouts on each plate with regards to the samples, but the same > genes are on every plate. > > after following your last set of suggestions, I can get the sample ID's > associated with the correct genes. I'm having some basic trouble > generating some of the plots though, as I'm getting out of range errors. > This is due to the fact that we've got a lot of groups (roughly 30 groups > or so). This leads to incorrect assignation of margin size on the plots, a > fairly typical problem in R. My question is, do you have some general > guidelines to solve this? e.g., a series of iterative steps to try when a > plot is not working correctly due to wrong margin settings etc. > The short answer would be no. What you can possibly try though is for example: - setting the margins yourself before plotting, using e.g. par(mar=c(3,2,2,1); plotXXX() - plotting directly to a file, where you can increase the height and width dimensions - opening a plotting device such as X11 or quartz, likewise with height and width specified, before calling the plot command More specifically, what's the dimensions of your qPCRset? I.e. how many sample groups do you have. And what plots are failing? If this is likely to be commonly occurring as qPCR platforms increase in size, I should look into it. > The other question is combining the 5 plates I have into one object. > You've covered this to some extent in your guide, but I was wondering > whether or not having exactly the same distribution of samples per plate > affects the merging. For example, some of our groups have 5 samples per > group, while as others have 4, and this varies from plate to plate > I assume you're using cbind(), in which case it shouldn't matter. As long as the genes are in the same order, it's okay. With 5 48.48 plates you should just end up with 48 features (rows) x 240 columns. The content of the columns can vary in any way, as long as you indicate it correctly during the calls to functions that require the data to be grouped by samples. Hope this helps, and I apologise for being so tardy with my replies. As you've undoubtedly noticed, HTqPCT wasn't originally planned for data in the BioMark format, and I'm just trying to catch up with it. If you have any analysis steps/issues that you think ought to be explained in the vignette, then please let me know. \Heidi > I hope I'm clear! > > thanks again for all your efforts, you guys do a terrific job in helping > many people. > > best > > s > > Simon Melov Ph.D. > Associate Professor & > Director of Genomics > Buck Institute for Research on Aging > 8001 Redwood Blvd > Novato, CA 94945 > > Office: 415 209 2068 > Cell: 415 827 4979 > Fax: 415 209 9920 > > > > > > > From heidi at ebi.ac.uk Tue Jun 26 10:50:01 2012 From: heidi at ebi.ac.uk (Heidi Dvinge) Date: Tue, 26 Jun 2012 09:50:01 +0100 Subject: [BioC] HTqPCR In-Reply-To: <1340614784.8482.YahooMailNeo@web132505.mail.ird.yahoo.com> References: <1340319194.4904.BPMail_high_noncarrier@web132504.mail.ird.yahoo.com> <48c1ef538be818c58eab9946116e3dbb.squirrel@webmail.ebi.ac.uk> <1340614784.8482.YahooMailNeo@web132505.mail.ird.yahoo.com> Message-ID: Hello Deborah, > Good morning Heidi, > > thank you for your response. > > I have 6 samples with the "LightCycler" format (96 genes one each sample) > : one control and its duplicate, one "thirty minutes" and its duplicate > and one "three hours" and its duplicate. > I did the ttestCtdata, the mannwhitneyCtData and the limmaCtdata on my > samples for compare all the results. > I did the commands : >>PlaqueControle<-readCtData("essai_CT.txt", >> column.info=list(feature="Name",position="Pos",Ct="Cp"), n.features=96, >> format="LightCycler") >>Plaque30min<-readCtData("essai_30min.txt", >> column.info=list(feature="Name",position="Pos",Ct="Cp"), n.features=96, >> format="LightCycler") >>Plaque3h<-readCtData("essai_3h.txt", >> column.info=list(feature="Name",position="Pos",Ct="Cp"), n.features=96, >> format="LightCycler") >>PlaqueControleB<-readCtData("essai_CT_BIS.txt", >> column.info=list(feature="Name",position="Pos",Ct="Cp"), n.features=96, >> format="LightCycler") >>Plaque30minB<-readCtData("essai_30min_BIS.txt", >> column.info=list(feature="Name",position="Pos",Ct="Cp"), n.features=96, >> format="LightCycler") >>Plaque3hB<-readCtData("essai_3h_BIS.txt", >> column.info=list(feature="Name",position="Pos",Ct="Cp"), n.features=96, >> format="LightCycler") > >>essai<-cbind(PlaqueControle,Plaque30min,Plaque3h,PlaqueControleB,Plaque30minB,Plaque3hB) >>files_essai<-read.table("files_essai.txt",header=T) This all looks okay. Incidentally, you can also read in all the files together, using readCtData(files=c("essai_CT.txt", "essai_30min.txt", ...), n.data=6), in case you don't what to do the cbind() afterwards. >>essai.cat<- setCategory(essai, groups = files_essai$Treatment,quantile = >> 0.8) >>deltaCtnorm <- normalizeCtData(essai.cat, norm = "deltaCt",deltaCt.genes >> =? c("gene85", "gene86", "gene87", "gene88", "gene89", "gene90", >> "gene91")) Just out of curiosity, did you try other normalisation methods as well? I've seen a few cases where one of hte supposed housekeeper genes has been really off the chart. > > Then I applied the ttestCtData, I compared the "control" samples with the > "three hours" samples : >>essai_ttest <- ttestCtData(deltaCtnorm[,c(1,3,6,4)], groups = >> files_essai$Treatment[1:4],calibrator = "Controle") > ? > mannwhitneyCtData on the same samples : >>essaimwtest<-mannwhitneyCtData(deltaCtnorm[,c(1,3,6,4)], groups = >> files_essai$Treatment[1:4],calibrator = "Controle") > Looks okay. I assume the order of your samples is different in your qPCRset compared to your files_essai$Treatment. > > And finally, I did the limmaCtData : >>essai_design<-model.matrix(~0 + files_essai$Treatment) >>colnames(essai_design) <- c("Controle", "Trente_m", "Trois_h") >>print(essai_design) > Again, your sample order in files_essai is definitely different from your object, but you do seem to use c(1,3,6,4,2,5) further down, so I guess it should work. >>essai_contrasts <- makeContrasts(Trois_h-Controle,Trois_h - Trente_m, >> Trente_m - Controle, (Trente_m +Trois_h)/2 - Controle, levels = >> essai_design) >>colnames(essai_contrasts) <- c("3h-CT", "3h-30min", "30min-CT", >> "30min&3h-CT") >>print(essai_contrasts) > > >>deltaCtnorm2 <- deltaCtnorm[order(featureNames(deltaCtnorm)),] > You don't actually have to reorder when ndups=1. Of coruse, it doesn't hurt though. >>essai_limma <- limmaCtData(deltaCtnorm2[,c(1,3,6,4,2,5)], design = >> essai_design,contrasts = essai_contrasts, ndups = 1) > > > I didn't obtain the same conclusions for the same genes in the three > tests. So I think that I did a lot of error on the commands... > >From just glancing over it, it looks okay, although I of course can't actually see what your initial and resulting objects look like. > I give you the head screenshots of my results for the three tests in > attachments. > > For example, if I take the "gene10", with limmaCtData, I obtained that > "gene10" in the "control" samples is significatively different to "gene10" > in "three hours" samples (p-value = 0.0057). With ttestCtData, "gene10" is > not significatively different to "gene 10" in the "three hours" samples > (p-value = 0.063) ; ditto with mannwhitneyCtData (p-value =0.25). > > In the part 10.3 of "HTqPCR.pdf", you say about limma that "The result is > a list with one component per comparison. Each component is similar to the > result from using ttestCtData." ; so I suppose that my results are not > consistent. > The results from limma are *in principle* similar to a t-test, i.e. each component is the result from one of the individual tests specified by your contrasts. However since you're using 3 different types of statistical tests, it's not surprising that the results you get vary. Mann-Whitney is mainly used when the data aren't normally distributed. It's a non-parametric test, so it has less statistical power than the other two. It's therefore not surprising that the p-values are less significant. limma uses a modified, more advanced version of the standard t-test. It always considers data from all samples, not just the ones being compared for any given contrast, and it "borrows" data across all features on your assay to achieve a more robust estimate. It is thus expected to give more significant values that a standard t-test, which only considers the samples and genes being compared in each individual test in isolation. There's no hard'n'fast rule for which of the 3 tests to use, since it also depends on the (distribution of) your starting data etc. It also depends on what you're using your data for. Do you want something where you're 100% sure that the genes do indeed differ between your conditions, e.g. as a validation. Or is this preliminary data, that will be used for further studies in the lab, so you're more interested in e.g. the top 1-10 hits. Depending on this, you can always have a look at the actual data for some of your genes (either the values themselves or some of the plots from HTqPCR). The tests mainly differ in how conservative they are. With this in mind, you can check whether your results are relatively consistent by for example comparing the order of the resulting p-value. For example, if you plot the p-values from one test versus another, are they then completely randomly scattered, or do they follow a general trend. At the moment it sounds like your results mainly differ in the level of the p-value, not whether a gene is e.g. up- or down-regulated in a given comparison. If the latter is the case, then it sounds like there's definitely something wrong, and I'll have to look into that. I'm sorry that I can't give you any simple reply, or tell you which of the 3 tests to use. But it really depends on both the purpose of your study, and your data. \Heidi > Furthermore, all my results with the mannwhitneyCtData are not > significant... I don't know if it was the good way to use these tests. > > Sincerely, > > Deborah. > > > ________________________________ > De?: Heidi Dvinge > ??: Deborah Ung > Envoy? le : Vendredi 22 juin 2012 11h33 > Objet?: Re: HTqPCR > > Hello Deborah, >> >> Good morning Heidi Dvinge, >> >> I am a french student and I am currently a trainee in a biotechnology >> company. I would like to know if you have other documentations about >> Limma >> applied to HTqPCR because I have some problems to analyze my results. >> > I'm afraid the only documentation is the examples in the help files for > limmaCtData, and the examples in the vignette section 10, i.e. > > ?limmaCtData > openVignette("HTqPCR") > > If you can email the commands you used and the resulting error messages, > (or why you think the results don't make sense), we can try to dissect the > problem. > > Best > \Heidi > >> Sincerely, >> >> Deborah. >> From alejandro.reyes at embl.de Tue Jun 26 11:14:19 2012 From: alejandro.reyes at embl.de (Alejandro Reyes) Date: Tue, 26 Jun 2012 11:14:19 +0200 Subject: [BioC] A question with DEXSeq package: inconsistency between normalized counts vs. fitted expression, fitted splicing or fold changes In-Reply-To: References: Message-ID: <4FE97D6B.7090808@embl.de> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From lawrence.michael at gene.com Tue Jun 26 12:41:06 2012 From: lawrence.michael at gene.com (Michael Lawrence) Date: Tue, 26 Jun 2012 03:41:06 -0700 Subject: [BioC] matrix like object with Rle columns In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From k.brand at erasmusmc.nl Tue Jun 26 12:43:05 2012 From: k.brand at erasmusmc.nl (Karl Brand) Date: Tue, 26 Jun 2012 12:43:05 +0200 Subject: [BioC] How do you find your orthologues? Message-ID: <4FE99239.9080702@erasmusmc.nl> Esteemed Bioconductor UseRs and Devs, How do you find your orthologues? We have a variety of data (proteomic, expression) from a variety of species (mouse, rat, pig, human) with lots of primary identifiers (GI-number, ensembl gene ID, gene-symbol). What we need to do is take such an identifier, say a mouse ensembl gene ID, and have a list of identifers for the mapped orthologue (to say rat and pig) returned. It seems compara will do this, via a Perl API http://www.ensembl.org/info/docs/api/compara/index.html Can it also be accessed via R/BioConductor? WHich package? BiomaRt? Or is there another BioConductor package employed for this task i should be looking at? With thanks in advance for tips and reflections before i spend another day hunting for a package to achieve this, Karl -- Karl Brand Dept of Cardiology and Dept of Bioinformatics Erasmus MC Dr Molewaterplein 50 3015 GE Rotterdam T +31 (0)10 703 2460 |M +31 (0)642 777 268 |F +31 (0)10 704 4161 From mail.yong.li at googlemail.com Tue Jun 26 13:11:19 2012 From: mail.yong.li at googlemail.com (Yong Li) Date: Tue, 26 Jun 2012 13:11:19 +0200 Subject: [BioC] Using limma for quantitative proteomics data In-Reply-To: <3570_1340691640_4FE954B8_3570_19469_1_OF6688FF2C.B036D1C4-ONC1257A29.00227A6B-C1257A29.0022D9EE@actelion.com> References: <3570_1340691640_4FE954B8_3570_19469_1_OF6688FF2C.B036D1C4-ONC1257A29.00227A6B-C1257A29.0022D9EE@actelion.com> Message-ID: Dear Axel, thanks for your answer. You are right, a matrix can be given to lmFit() and in this case just the Amean is not calculated in the returned object. Best regards, Yong On Tue, Jun 26, 2012 at 8:19 AM, wrote: > Dear Yong, > > I don't think you need an MAList -- all limma functions will accept a > simple matrix > of your log2 ratios... or at least, all limma functions I have ever used, > will do that... :-) > > Cheers, > > ?- axel > > > Axel Klenk > Research Informatician > Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil / > Switzerland > > > > > From: > Yong Li > To: > bioconductor at r-project.org > Date: > 26.06.2012 00:01 > Subject: > Re: [BioC] Using limma for quantitative proteomics data > Sent by: > bioconductor-bounces at r-project.org > > > > Dear Aaron, > > thank you and others for suggestions. My data is really ratios and not > absolute values for normal and tumor. Sorry that I am still not quite > sure how to move forward with limma when I take log2 of the ratios. It > looks like I then will have the M component of the MAList, but how can > I construct the A to make an MAList? Or I am missing something here? > > Kind regards, > Yong > > On Tue, Jun 19, 2012 at 11:09 PM, Aaron Mackey > wrote: >> There's a thread on the bioconductor mailing list about using voom for >> RSEM-based RNA-seq quantification, in which ?Gordon Smythe explained > that >> while voom() was designed for count data, it doesn't require it. ?As Tim >> Triche has suggested, if you're raw data is really ratios (and not > absolute >> values for normal and tumor), then you should take log2 of those ratios > and >> use limma from there; you can then also hijack the arrayQualityMetrics >> package to check QC (MA plots, mean-variance relationships, etc.) >> >> -Aaron >> >> On Tue, Jun 19, 2012 at 3:39 PM, Yong Li >> wrote: >>> >>> Dear Aaron, >>> >>> thank you for your quick answer! I have checked the help page of >>> voom() but it seems to be used for count data. My data are just >>> tumor/normal ratios. I am wondering if you could provide more details? >>> >>> Best regards, >>> Yong >>> >>> On Tue, Jun 19, 2012 at 8:18 PM, Aaron Mackey >>> wrote: >>> > yes, it should be possible with a voom()-based analysis to get the >>> > variances >>> > "right". >>> > >>> > -Aaron >>> > >>> > On Tue, Jun 19, 2012 at 12:47 PM, Yong Li > >>> > wrote: >>> >> >>> >> Hello, >>> >> >>> >> limma has been so valuable in microarray data analysis, but has > anyone >>> >> used limma for finding differentially expressed proteins from >>> >> quantitative proteomics data? The data I got has tumor/normal ratios >>> >> of thousands proteins, and both tumor and normal have a number of >>> >> replicates. Could such data be analyzed with limma? >>> >> >>> >> If limma can not be used here, what statistics method is suitable so >>> >> that we can get statistically significant proteins with p-values? > Any >>> >> suggestion is appreciated. >>> >> >>> >> Kind regards, >>> >> Yong >>> >> >>> >> _______________________________________________ >>> >> Bioconductor mailing list >>> >> Bioconductor at r-project.org >>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >> Search the archives: >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > >>> > >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > The information of this email and in any file transmitted with it is strictly confidential and may be legally privileged. > It is intended solely for the addressee. If you are not the intended recipient, any copying, distribution or any other use of this email is prohibited and may be unlawful. In such case, you should please notify the sender immediately and destroy this email. > The content of this email is not legally binding unless confirmed by letter. > Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorised to state them to be the views of the sender's company. For further information about Actelion please see our website at http://www.actelion.com > From alejandro.reyes at embl.de Tue Jun 26 13:16:18 2012 From: alejandro.reyes at embl.de (Alejandro Reyes) Date: Tue, 26 Jun 2012 13:16:18 +0200 Subject: [BioC] DEXSeq question In-Reply-To: References: Message-ID: <4FE99A02.9020208@embl.de> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From alejandro.reyes at embl.de Tue Jun 26 13:17:33 2012 From: alejandro.reyes at embl.de (Alejandro Reyes) Date: Tue, 26 Jun 2012 13:17:33 +0200 Subject: [BioC] interactions between variables in DEXSeq In-Reply-To: <4FE99927.8090603@embl.de> References: <4FE99927.8090603@embl.de> Message-ID: <4FE99A4D.6050907@embl.de> Dear Elena, Thank you for your email. If I understood correctly, to test for an interaction between condition and type with the exons, it would be like this (example in the pasilla dataset): data("pasillaExons", package="pasilla") pasillaExons<- estimateSizeFactors( pasillaExons ) pasillaExons<- estimateDispersions( pasillaExons, count ~ sample + (condition * type) * exon ) formula0<- count ~ sample + exon + condition + type + condition:type formula1<- count ~ sample + exon + condition + type + (condition:type) * I(exon == exonID) pasillaExons<- testForDEU( pasillaExons, formula0=formula0, formula1=formula1 ) Best wishes, Alejandro Reyes > Hello again, > > I'm writing this time to ask about setting up more complex formulae to > test for significant interaction between independent variables in > DEXSeq. Using the example from the manual, to test for an interaction > between library type and condition, how would I set this up? The > syntax here is a bit more involved than with DESeq, and I can't seem > to find anything in the archives that answers my question... > > Thanks, > Elena > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From hrh at fmi.ch Tue Jun 26 15:00:48 2012 From: hrh at fmi.ch (Hans-Rudolf Hotz) Date: Tue, 26 Jun 2012 15:00:48 +0200 Subject: [BioC] How do you find your orthologues? In-Reply-To: <4FE99239.9080702@erasmusmc.nl> References: <4FE99239.9080702@erasmusmc.nl> Message-ID: <4FE9B280.1000304@fmi.ch> Hi Karl Yes, you can do it with biomaRt. Have a look at the 'getLDS' function. quick example: > library(biomaRt) > > ensembl <- useMart("ensembl") > > human = useDataset("hsapiens_gene_ensembl", mart=ensembl) > mouse = useMart("ensembl", dataset="mmusculus_gene_ensembl") > > getLDS(attributes=c("ensembl_gene_id"), filters="ensembl_gene_id", values=c("ENSMUSG00000021111"), mart=mouse,attributesL=c("hgnc_symbol", "ensembl_gene_id"), martL=human) V1 V2 V3 1 ENSMUSG00000021111 PAPOLA ENSG00000090060 > Hope this helps Regards, Hans On 06/26/2012 12:43 PM, Karl Brand wrote: > Esteemed Bioconductor UseRs and Devs, > > How do you find your orthologues? > > We have a variety of data (proteomic, expression) from a variety of > species (mouse, rat, pig, human) with lots of primary identifiers > (GI-number, ensembl gene ID, gene-symbol). What we need to do is take > such an identifier, say a mouse ensembl gene ID, and have a list of > identifers for the mapped orthologue (to say rat and pig) returned. It > seems compara will do this, via a Perl API > > http://www.ensembl.org/info/docs/api/compara/index.html > > Can it also be accessed via R/BioConductor? WHich package? BiomaRt? Or > is there another BioConductor package employed for this task i should be > looking at? > > With thanks in advance for tips and reflections before i spend another > day hunting for a package to achieve this, > > Karl > > From alessandro.brozzi at gmail.com Tue Jun 26 15:15:21 2012 From: alessandro.brozzi at gmail.com (alessandro brozzi) Date: Tue, 26 Jun 2012 15:15:21 +0200 Subject: [BioC] How do you find your orthologues? In-Reply-To: <4FE99239.9080702@erasmusmc.nl> References: <4FE99239.9080702@erasmusmc.nl> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From guest at bioconductor.org Tue Jun 26 15:58:51 2012 From: guest at bioconductor.org (narges [guest]) Date: Tue, 26 Jun 2012 06:58:51 -0700 (PDT) Subject: [BioC] pvalue and padj in DESeq Message-ID: <20120626135851.0B56B139009@mamba.fhcrc.org> Hi i have a proper count table of RNA-Seq and i have applied edgeR package for obtaining differentially expressed genes and I have obtained nice acceptable result. But now I am applying also DESeq over the same data but the pval and padj columns of the nbinomTest over them is strange, it is almost 1.00 or NA. Why is this so? thanks a lot for your answers in advance -- output of sessionInfo(): > head (res$padj) [1] 1 NA 1 1 1 1 -- Sent via the guest posting facility at bioconductor.org. From christopher.fletez-brant at nih.gov Tue Jun 26 16:03:38 2012 From: christopher.fletez-brant at nih.gov (Fletez-Brant, Christopher (NIH/VRC) [C]) Date: Tue, 26 Jun 2012 10:03:38 -0400 Subject: [BioC] HTqPCR - limmaCtData using samples with different number of replicates? Message-ID: Dear List, I am using HTqPCR to analyze Fluidigm data (specifically, the 96*96 format). I have been using specifically limmaCtData to look for differential expression between groups, always having the same number of replicates. However, I currently have a dataset of several groups with 3 replicates and 1 group with 2 replicates. As far as I can tell, limmaCtData expects the number of replicates to be the same between groups (i.e. there is only one 'ndups' parameter). Has anyone else encountered this issue? Thank you, Kipper Fletez-Brant From hyao at mdanderson.org Tue Jun 26 16:24:42 2012 From: hyao at mdanderson.org (Yao,Hui) Date: Tue, 26 Jun 2012 09:24:42 -0500 Subject: [BioC] A question with DEXSeq package: inconsistency between normalized counts vs. fitted expression, fitted splicing or fold changes In-Reply-To: <4FE97D6B.7090808@embl.de> References: <4FE97D6B.7090808@embl.de> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From bulak.arpat at unil.ch Mon Jun 25 16:54:46 2012 From: bulak.arpat at unil.ch (Bulak Arpat) Date: Mon, 25 Jun 2012 14:54:46 +0000 Subject: [BioC] =?utf-8?q?DEXSeq=3A_problem_with_dexseq=5Fprepare=5Fannota?= =?utf-8?q?tion=2Epy?= References: Message-ID: Stephen Turner writes: > > Alejandro, Simon, Wolfgang, et al.: > > I'm trying to use the dexseq_prepare_annotation.py script to parse the > UCSC hg18 genes.gtf GTF file included with the Illumina igenomes > packages (http://tophat.cbcb.umd.edu/igenomes.html). I'm getting an > error: > > Traceback (most recent call last): > File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 93, in > raise ValueError, "Same name found on two chromosomes: %s, %s" % ( > str(l[i]), str(l[i+1]) ) > ValueError: Same name found on two chromosomes: exonic_part 'CFB' at chr6_qbl_hap2: 3167392 -> 3167602 (strand '+')>, > > 3360325 (strand '+')> > > I'm guessing this is because the same gene name is found in two > separate places. I'm not entirely sure what these two chromosomal > segments refer to, but I removed them from the GTF file and the python > script threw another error: > > Traceback (most recent call last): > File "/home/sdt5z/bin/dexseq_prepare_annotation.py", line 91, in > assert l[i].iv.end <= l[i+1].iv.start, str(l[i+1]) + " starts too early" > AssertionError: chr1: 148079388 -> 148078883 (strand '-')> starts too early > > I'm really unsure what to make of this or how to fix it. The script > works without issues with the Ensembl GTF. Any help would be greatly > appreciated. > > Stephen > > ----------------------------------------- > Stephen D. Turner, Ph.D. > Bioinformatics Core Director > University of Virginia School of Medicine > bioinformatics.virginia.edu > > _______________________________________________ > Bioconductor mailing list > Bioconductor at ... > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > Dear Stephen, I had the same problem when I tried dexseq_prepare_annotation.py with the mm9 or mm10 GTF files from the Illumina igenomes collection. And like you have mentioned it worked well with an Ensembl version. Going through the script and the data files, I have realized all the problems go back to one root: Ensembl has unique gene_id for each locus whereas other files have gene_id generated from gene_name attribute. This replicates the gene_id for some loci as there are multiple (cis/trans) coding regions. For a quick fix I have done the following modification to the script (around line# 28): f.attr['gene_id'] = f.iv.chrom + '_' + f.attr['gene_id'].replace( ":", "_" ) + f.iv.strand This generates a 'unique' gene_id for the script by combining the chromosome number, gene name and strand information. As I said, it is a quick fix but it seems to work so far without problems. I hope it might be of use for you. Best, Bulak Arpat, PhD Bioinformatician Center for Integrative Genomics University of Lausanne From m.sara at imperial.ac.uk Tue Jun 26 01:47:45 2012 From: m.sara at imperial.ac.uk (Mitchell, Sara N) Date: Mon, 25 Jun 2012 23:47:45 +0000 Subject: [BioC] Compressed boxplots after 'normexp+offset' background correction of Agilent one color microarrays in LIMMA Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From guest at bioconductor.org Tue Jun 26 17:07:01 2012 From: guest at bioconductor.org (Gavin Blackburn [guest]) Date: Tue, 26 Jun 2012 08:07:01 -0700 (PDT) Subject: [BioC] mzR error Message-ID: <20120626150701.A4B1F13ADB0@mamba.fhcrc.org> We are getting the following error when trying to install and run mzR on a 64-bit Windows 7 machine: > library(mzR) Loading required package: Rcpp Error : .onLoad failed in loadNamespace() for 'mzR', details: call: value[[3L]](cond) error: failed to load module Ramp from package mzR could not find function "errorOccured" Error: package/namespace load failed for ???mzR??? Do you know what might be causing it? Cheers, Gavin. -- output of sessionInfo(): sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United Kingdom.1252 [2] LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Rcpp_0.9.12 BiocInstaller_1.4.7 loaded via a namespace (and not attached): [1] Biobase_2.16.0 BiocGenerics_0.2.0 tools_2.15.1 -- Sent via the guest posting facility at bioconductor.org. From heidi at ebi.ac.uk Tue Jun 26 18:14:46 2012 From: heidi at ebi.ac.uk (Heidi Dvinge) Date: Tue, 26 Jun 2012 17:14:46 +0100 Subject: [BioC] HTqPCR - limmaCtData using samples with different number of replicates? In-Reply-To: References: Message-ID: Hi Kipper, > Dear List, > > I am using HTqPCR to analyze Fluidigm data (specifically, the 96*96 > format). I have been using specifically limmaCtData to look for > differential expression between groups, always having the same number of > replicates. However, I currently have a dataset of several groups with 3 > replicates and 1 group with 2 replicates. As far as I can tell, > limmaCtData expects the number of replicates to be the same between groups > (i.e. there is only one 'ndups' parameter). Has anyone else encountered > this issue? > ndups actually refers to the number of replicate *features*, i.e. genes, on your plate. As long as you use the same platform for all your samples, this should be constant. The number of samples within each group is indicated using the design matrix in limmaCtData, and the parameter 'groups' in ttestCtData. So in limmaCtData you can just use a design matrix along the lines of: > samples <- rep(c("treat1", "treat2", "treat3", "control"), c(3,2,3,3)) > design <- model.matrix(~0+samples) You then indicate your comparisons of interest in the contrast matrix, and use both matrices in your call to limmaCtData. If I've misunderstood you and you actually have a variable number of replicates of each gene, then I'm afraid limmaCtData isn't capable of handling this with ndups. HTH \Heidi > Thank you, > Kipper Fletez-Brant > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > From guest at bioconductor.org Tue Jun 26 18:17:05 2012 From: guest at bioconductor.org (narges [guest]) Date: Tue, 26 Jun 2012 09:17:05 -0700 (PDT) Subject: [BioC] DESeq analysis Message-ID: <20120626161705.06E4D13ADBF@mamba.fhcrc.org> Hi all I am doing some RNA seq analysis with DESeq. I have applied the nbinomTest to my dataset which I know have many differentially expressed genes but the first problem is that the result values for "padj"column is almost NA and sometimes 1. and when I want to have a splice from my fata frame the result is not meaningful for me. -- output of sessionInfo(): res <- nbinomTest(cds, "Male", "Female") > head(res) id baseMean baseMeanA baseMeanB foldChange log2FoldChange 1 ENSG00000000003 0.1130534 0.000000 0.2261067 Inf Inf 2 ENSG00000000005 0.0000000 0.000000 0.0000000 NaN NaN 3 ENSG00000000419 14.3767155 17.162610 11.5908205 0.6753530 -0.5662863 4 ENSG00000000457 17.0174761 15.342800 18.6921526 1.2183013 0.2848710 5 ENSG00000000460 3.9414822 2.855099 5.0278659 1.7610131 0.8164056 6 ENSG00000000938 16.0894945 18.350117 13.8288718 0.7536122 -0.4081058 pval padj 1 0.9959638 1 2 NA NA 3 0.3208560 1 4 0.5942512 1 5 0.4840607 1 6 0.5409953 1 > res1 <- res[res$padj<0.1,] > head(res1) id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj NA NA NA NA NA NA NA NA NA.1 NA NA NA NA NA NA NA NA.2 NA NA NA NA NA NA NA NA.3 NA NA NA NA NA NA NA NA.4 NA NA NA NA NA NA NA NA.5 NA NA NA NA NA NA NA my first question is that why although I know there are some differentially expressed genes in the my data, all the padj values are NA or 1 and the second question is this "NA.1" , "NA.2", ..... which are emerged as the first column of object "res1"instead of name of genes Thank you so much Regards -- Sent via the guest posting facility at bioconductor.org. From laurent.gatto at gmail.com Tue Jun 26 18:19:57 2012 From: laurent.gatto at gmail.com (Laurent Gatto) Date: Tue, 26 Jun 2012 17:19:57 +0100 Subject: [BioC] mzR error In-Reply-To: <20120626150701.A4B1F13ADB0@mamba.fhcrc.org> References: <20120626150701.A4B1F13ADB0@mamba.fhcrc.org> Message-ID: Dear Gavin, I can't reproduce this, but I do not have the same configuration at hand for the moment - this could be an incompatibility with the latest Rcpp. What version of mzR have you - packageVersion("mzR") Best wishes, Laurent On 26 June 2012 16:07, Gavin Blackburn [guest] wrote: > > We are getting the following error when trying to install and run mzR on a 64-bit Windows 7 machine: >> library(mzR) > Loading required package: Rcpp > Error : .onLoad failed in loadNamespace() for 'mzR', details: > ?call: value[[3L]](cond) > ?error: failed to load module Ramp from package mzR > could not find function "errorOccured" > Error: package/namespace load failed for ???mzR??? > > > Do you know what might be causing it? > > Cheers, > > Gavin. > > > ?-- output of sessionInfo(): > > ?sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: x86_64-pc-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United Kingdom.1252 > [2] LC_CTYPE=English_United Kingdom.1252 > [3] LC_MONETARY=English_United Kingdom.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > other attached packages: > [1] Rcpp_0.9.12 ? ? ? ? BiocInstaller_1.4.7 > > loaded via a namespace (and not attached): > [1] Biobase_2.16.0 ? ? BiocGenerics_0.2.0 tools_2.15.1 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- [ Laurent Gatto | slashhome.be ] From k.brand at erasmusmc.nl Tue Jun 26 18:20:58 2012 From: k.brand at erasmusmc.nl (Karl Brand) Date: Tue, 26 Jun 2012 18:20:58 +0200 Subject: [BioC] How do you find your orthologues? In-Reply-To: References: <4FE99239.9080702@erasmusmc.nl> Message-ID: <4FE9E16A.1070108@erasmusmc.nl> Bruce, Hans, Alex, Many thanks for the tips and the example. Both biomaRt and HomoVert look excellent to map across species. But, and i could be wrong, i could be out of luck for for a direct mapping to/from GI numbers. I'll either be back with a solution. Or a new thread :) Thanks again, Karl On 26/06/12 12:47, Bruce Moran(External) wrote: > Hi Karl, > > biomaRt is great for this: > > http://www.bioconductor.org/packages/2.2/bioc/vignettes/biomaRt/inst/doc > /biomaRt.pdf > > Pretty nice walkthroughs in the manual above too. > > Good luck, > > Bruce. > > -----Original Message----- > From: bioconductor-bounces at r-project.org > [mailto:bioconductor-bounces at r-project.org] On Behalf Of Karl Brand > Sent: 26 June 2012 11:43 > To: bioconductor at r-project.org > Subject: [BioC] How do you find your orthologues? > > Esteemed Bioconductor UseRs and Devs, > > How do you find your orthologues? > > We have a variety of data (proteomic, expression) from a variety of > species (mouse, rat, pig, human) with lots of primary identifiers > (GI-number, ensembl gene ID, gene-symbol). What we need to do is take > such an identifier, say a mouse ensembl gene ID, and have a list of > identifers for the mapped orthologue (to say rat and pig) returned. It > seems compara will do this, via a Perl API > > http://www.ensembl.org/info/docs/api/compara/index.html > > Can it also be accessed via R/BioConductor? WHich package? BiomaRt? Or > is there another BioConductor package employed for this task i should be > > looking at? > > With thanks in advance for tips and reflections before i spend another > day hunting for a package to achieve this, > > Karl > > -- Karl Brand Dept of Cardiology and Dept of Bioinformatics Erasmus MC Dr Molewaterplein 50 3015 GE Rotterdam T +31 (0)10 703 2460 |M +31 (0)642 777 268 |F +31 (0)10 704 4161 From heidi at ebi.ac.uk Tue Jun 26 18:21:11 2012 From: heidi at ebi.ac.uk (Heidi Dvinge) Date: Tue, 26 Jun 2012 17:21:11 +0100 Subject: [BioC] Compressed boxplots after 'normexp+offset' background correction of Agilent one color microarrays in LIMMA In-Reply-To: References: Message-ID: <3705fd84be4a412ad054f67a216b6ce9.squirrel@webmail.ebi.ac.uk> > Dear All, > Hi Sara, > I am currently using an Agilent 4x44K model organism mosquito array for a > one color time-course course experiment. I am analyzing using the LIMMA > package and have performed background correction and between array > normalization as follows. > > RoBb.corr <- backgroundCorrect(RoB, method="normexp", offset=16) > > RoBb.corr.norm <- normalizeBetweenArrays(RoBb.corr, method="quantile") > > However, after background correction the boxplot for a number of arrays > become compressed (see example here: > https://dl.dropbox.com/u/407047/Work/Catteruccia/minExample.html ). > With such an extreme correction, it looks like it might be your arrays that are the problem, rather than the specific background correction method. Have you tried producing similar boxplots just for the background values? Or even looking at the actual images from the scan. If some of the arrays have a uniformly high signal for both foreground and background values, it could indicate that the hybridisation somehow failed (maybe too much salt in the sample, which causes an all-over high intensity?). Apart from that, yes I have tried using Agilent arrays without any background correction methods. But I definitely wouldn't recommend it in this case, until you figure out what's going on with those outlier arrays. And I would NOT recommend just using quantile normalisation to just make the distributions on all arrays similar when they're so highly different to begin with. HTH \Heidi > I am not sure what is causing this compression although the quantile > between array normalisation seems to correct for this . However I am > concerned about the possible affect on the data. Has anyone else seen > this compression with normexp correction? > > I have read that background correction is not always optimal for Agilent > arrays (Zahurak et al. 2007 BMC Bioinformatics). > > > Do others routinely omit the background correction for Agilent arrays? > > Best regards > > > Dr Sara Mitchell > > Imperial College London > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From alejandro.reyes at embl.de Tue Jun 26 18:28:37 2012 From: alejandro.reyes at embl.de (Alejandro Reyes) Date: Tue, 26 Jun 2012 18:28:37 +0200 Subject: [BioC] A question with DEXSeq package: inconsistency between normalized counts vs. fitted expression, fitted splicing or fold changes In-Reply-To: References: <4FE97D6B.7090808@embl.de> Message-ID: <4FE9E335.5090907@embl.de> Dear Yao Hui, It was an error in the code, the levels in one of your variables in the design matrix were not sorted! At some point part of the code was (wrongly) relying on the order of the factors, some the conditions were mixed. You could stay with the same version and do this before starting the analysis: levels( pData(testgene)$type ) <- sort( levels( pData(testgene)$type ) ) Or update to version DEXSeq_1.3.3? ( it is already fixed here). An apology for the bug! Alejandro From laurent.gatto at gmail.com Tue Jun 26 18:37:48 2012 From: laurent.gatto at gmail.com (Laurent Gatto) Date: Tue, 26 Jun 2012 17:37:48 +0100 Subject: [BioC] mzR error In-Reply-To: References: <20120626150701.A4B1F13ADB0@mamba.fhcrc.org> Message-ID: On 26 June 2012 17:19, Laurent Gatto wrote: > Dear Gavin, > > I can't reproduce this, but I do not have the same configuration at > hand for the moment - this could be an incompatibility with the latest > Rcpp. Ok, I can now reproduce on a Windows box with mzR 1.2.1 (latest stable) and Rcpp 0.9.12. Downgrading to Rcpp 0.9.10 [1] fixes the issue. I will bring it up on the Rcpp list. Thank you for the report. Laurent [1] http://cran.us.r-project.org/bin/windows/contrib/2.13/Rcpp_0.9.10.zip > What version of mzR have you - packageVersion("mzR") > > Best wishes, > > Laurent > > On 26 June 2012 16:07, Gavin Blackburn [guest] wrote: >> >> We are getting the following error when trying to install and run mzR on a 64-bit Windows 7 machine: >>> library(mzR) >> Loading required package: Rcpp >> Error : .onLoad failed in loadNamespace() for 'mzR', details: >> ?call: value[[3L]](cond) >> ?error: failed to load module Ramp from package mzR >> could not find function "errorOccured" >> Error: package/namespace load failed for ???mzR??? >> >> >> Do you know what might be causing it? >> >> Cheers, >> >> Gavin. >> >> >> ?-- output of sessionInfo(): >> >> ?sessionInfo() >> R version 2.15.1 (2012-06-22) >> Platform: x86_64-pc-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=English_United Kingdom.1252 >> [2] LC_CTYPE=English_United Kingdom.1252 >> [3] LC_MONETARY=English_United Kingdom.1252 >> [4] LC_NUMERIC=C >> [5] LC_TIME=English_United Kingdom.1252 >> >> attached base packages: >> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >> >> other attached packages: >> [1] Rcpp_0.9.12 ? ? ? ? BiocInstaller_1.4.7 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.16.0 ? ? BiocGenerics_0.2.0 tools_2.15.1 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > [ Laurent Gatto | slashhome.be ] -- [ Laurent Gatto | slashhome.be ] From jmacdon at uw.edu Tue Jun 26 18:45:45 2012 From: jmacdon at uw.edu (James W. MacDonald) Date: Tue, 26 Jun 2012 12:45:45 -0400 Subject: [BioC] How do you find your orthologues? In-Reply-To: <4FE9E16A.1070108@erasmusmc.nl> References: <4FE99239.9080702@erasmusmc.nl> <4FE9E16A.1070108@erasmusmc.nl> Message-ID: <4FE9E739.4030209@uw.edu> Hi Karl, You might also look at the hom.Xx.inp.db packages, which map orthologues via InParanoid. There is a convenience function inpIDMapper() in AnnotationDbi that does the mappings. Or you could have some SQL fun and do your own ;-D Best, Jim On 6/26/2012 12:20 PM, Karl Brand wrote: > Bruce, Hans, Alex, > > Many thanks for the tips and the example. > > Both biomaRt and HomoVert look excellent to map across species. But, > and i could be wrong, i could be out of luck for for a direct mapping > to/from GI numbers. > > I'll either be back with a solution. Or a new thread :) > > Thanks again, > > Karl > > > On 26/06/12 12:47, Bruce Moran(External) wrote: >> Hi Karl, >> >> biomaRt is great for this: >> >> http://www.bioconductor.org/packages/2.2/bioc/vignettes/biomaRt/inst/doc >> /biomaRt.pdf >> >> Pretty nice walkthroughs in the manual above too. >> >> Good luck, >> >> Bruce. >> >> -----Original Message----- >> From: bioconductor-bounces at r-project.org >> [mailto:bioconductor-bounces at r-project.org] On Behalf Of Karl Brand >> Sent: 26 June 2012 11:43 >> To: bioconductor at r-project.org >> Subject: [BioC] How do you find your orthologues? >> >> Esteemed Bioconductor UseRs and Devs, >> >> How do you find your orthologues? >> >> We have a variety of data (proteomic, expression) from a variety of >> species (mouse, rat, pig, human) with lots of primary identifiers >> (GI-number, ensembl gene ID, gene-symbol). What we need to do is take >> such an identifier, say a mouse ensembl gene ID, and have a list of >> identifers for the mapped orthologue (to say rat and pig) returned. It >> seems compara will do this, via a Perl API >> >> http://www.ensembl.org/info/docs/api/compara/index.html >> >> Can it also be accessed via R/BioConductor? WHich package? BiomaRt? Or >> is there another BioConductor package employed for this task i should be >> >> looking at? >> >> With thanks in advance for tips and reflections before i spend another >> day hunting for a package to achieve this, >> >> Karl >> >> > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 From hyao at mdanderson.org Tue Jun 26 19:16:38 2012 From: hyao at mdanderson.org (Yao,Hui) Date: Tue, 26 Jun 2012 12:16:38 -0500 Subject: [BioC] A question with DEXSeq package: inconsistency between normalized counts vs. fitted expression, fitted splicing or fold changes In-Reply-To: <4FE9E335.5090907@embl.de> References: <4FE97D6B.7090808@embl.de> <4FE9E335.5090907@embl.de> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From christopher.fletez-brant at nih.gov Tue Jun 26 19:32:31 2012 From: christopher.fletez-brant at nih.gov (Fletez-Brant, Christopher (NIH/VRC) [C]) Date: Tue, 26 Jun 2012 13:32:31 -0400 Subject: [BioC] HTqPCR - limmaCtData using samples with different number of replicates? In-Reply-To: Message-ID: Hi Heidi, Thanks for setting me straight - I in fact have a panel of 96 unrepeated features, used to produce experiments with a variable number of replicates per group. This works very well. Thanks! On 6/26/12 12:14 PM, "Heidi Dvinge" wrote: >Hi Kipper, > >> Dear List, >> >> I am using HTqPCR to analyze Fluidigm data (specifically, the 96*96 >> format). I have been using specifically limmaCtData to look for >> differential expression between groups, always having the same number of >> replicates. However, I currently have a dataset of several groups with >>3 >> replicates and 1 group with 2 replicates. As far as I can tell, >> limmaCtData expects the number of replicates to be the same between >>groups >> (i.e. there is only one 'ndups' parameter). Has anyone else encountered >> this issue? >> >ndups actually refers to the number of replicate *features*, i.e. genes, >on your plate. As long as you use the same platform for all your samples, >this should be constant. > >The number of samples within each group is indicated using the design >matrix in limmaCtData, and the parameter 'groups' in ttestCtData. So in >limmaCtData you can just use a design matrix along the lines of: > >> samples <- rep(c("treat1", "treat2", "treat3", "control"), c(3,2,3,3)) >> design <- model.matrix(~0+samples) > >You then indicate your comparisons of interest in the contrast matrix, and >use both matrices in your call to limmaCtData. > >If I've misunderstood you and you actually have a variable number of >replicates of each gene, then I'm afraid limmaCtData isn't capable of >handling this with ndups. > >HTH >\Heidi > >> Thank you, >> Kipper Fletez-Brant >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > From MEC at stowers.org Tue Jun 26 19:59:27 2012 From: MEC at stowers.org (Cook, Malcolm) Date: Tue, 26 Jun 2012 12:59:27 -0500 Subject: [BioC] nearest() for GRanges In-Reply-To: Message-ID: Excellent. All's well that ends well! Thanks much.... cooking with gas again.... --Malcolm On 6/25/12 5:09 PM, "Dan Tenenbaum" wrote: >On Mon, Jun 25, 2012 at 2:53 PM, Cook, Malcolm wrote: >> Hi Valerie, >> >> Indeed good news. >> >> However, I am finding that this newest version is not yet available view >> biocLite from repository at bioconductor.org. I am still picking up >>1.8.6 >> with biocLite('GenomicRanges'). >> >> Should I expect to wait, or perhaps is there a 'push' at your end that >> needs attending? >> >> Please advise if I'm expecting it to appear before its time ;) > >Out build cycle runs once a day, so expect to see the next version >tomorrow morning around 10AM Seattle time. If you want to get it >before then, you can check it out from the svn repository. > >Thanks, >Dan > > >> >> Thanks! >> >> Malcolm >> >> >> >> On 6/25/12 1:53 PM, "Valerie Obenchain" wrote: >> >>>This is now fixed in release, BiocC 2.10, GenomicRanges 1.8.7. >>> >>>Note the behavior of '*' is different from previous behavior (i.e., <= v >>>1.8.6). Treatment of '*' ranges was one of the aspects we clarified and >>>enforced in the the recent update of precede, follows and nearest. >>> >>>Previously in release '*' was treated as a '+' range, >>> >>>g <- GRanges("chr1", IRanges(c(1,5,10), c(2,7,12)), "*") >>> > g >>>GRanges with 3 ranges and 0 elementMetadata cols: >>> seqnames ranges strand >>> >>> [1] chr1 [ 1, 2] * >>> [2] chr1 [ 5, 7] * >>> [3] chr1 [10, 12] * >>> --- >>> seqlengths: >>> chr1 >>> NA >>> > precede(g) >>>[1] 2 3 NA >>> > follow(g) >>>[1] NA 1 2 >>> > nearest(g) >>>[1] 2 1 2 >>> >>> >>>The new behavior of '*' (in both release and devel) considers both '+' >>>and '-' possibilities. For details see the 'matching by strand' section >>>described in precede() on the man page for ?GRanges. >>> >>> > precede(g) >>>[1] 2 1 2 >>> > follow(g) >>>[1] 2 1 2 >>> > nearest(g) >>>[1] 2 1 2 >>> >>> >>>Valerie >>> >>>On 06/22/2012 03:25 PM, Cook, Malcolm wrote: >>>> Great news, Valerie... thanks very much... I will take immediate >>>>advantage >>>> of this... after re-reading your report of 'an overhaul' I would well >>>> understand if back-porting your fix in dev to release would be onerous >>>>to >>>> impossible. >>>> >>>> I hope it goes quickly and smoothly.... >>>> >>>> Cheers, >>>> >>>> Malcolm >>>> >>>> >>>> On 6/22/12 4:00 PM, "Valerie Obenchain" wrote: >>>> >>>>> On 06/20/2012 05:20 PM, Cook, Malcolm wrote: >>>>>> Hi Valerie, >>>>>> >>>>>> Very glad you found and fixed the root cause. >>>>>> >>>>>> I don't know the overhead it would take for you, but, this being a >>>>>> regression, might you consider fixing in Bioconductor 2.10 as, say >>>>>> GenomicRanges_1.8. >>>>>> >>>>> Yes, I will fix this in release too. If not today then first thing >>>>>next >>>>> week. >>>>> >>>>> Valerie >>>>>> Thanks for your consideration, >>>>>> >>>>>> Malcolm >>>>>> >>>>>> On 6/20/12 3:13 PM, "Valerie Obenchain" wrote: >>>>>> >>>>>>> Hi Oleg, Malcom, >>>>>>> >>>>>>> Thanks for the bug report. This is now fixed in devel 1.9.28. Over >>>>>>>the >>>>>>> past months we've done an overhaul of the precede/follow code in >>>>>>>devel. >>>>>>> The new nearest method is based on the new precede and follow and >>>>>>>is >>>>>>> documented at >>>>>>> >>>>>>> ?'nearest,GenomicRanges,GenomicRanges-method' >>>>>>> >>>>>>> Let me know if you run into problems. >>>>>>> >>>>>>> Valerie >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 06/18/2012 02:25 PM, Cook, Malcolm wrote: >>>>>>>> Martin, Oleg, Val, all, >>>>>>>> >>>>>>>> I too have script logic that benefitted from and depends upon what >>>>>>>>the >>>>>>>> behavior of nearest,GenomicRanges,missing as reported by Oleg. >>>>>>>> >>>>>>>> Thanks for the unit tests Martin. >>>>>>>> >>>>>>>> If it helps in sleuthing, in my hands, the 3rd test used to pass >>>>>>>>(if >>>>>>>> my >>>>>>>> memory serves), but does not pass now, as the attached transcript >>>>>>>> shows. >>>>>>>> >>>>>>>> Hoping it helps find a speedy resolution to this issue, >>>>>>>> >>>>>>>> ~ Malcolm Cook >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> r<- IRanges(c(1,5,10), c(2,7,12)) >>>>>>>>> g<- GRanges("chr1", r, "+") >>>>>>>>> checkEquals(precede(r), precede(g)) >>>>>>>> [1] TRUE >>>>>>>>> checkEquals(follow(r), follow(g)) >>>>>>>> [1] TRUE >>>>>>>>> try(checkEquals(nearest(r), nearest(g))) >>>>>>>> Error in checkEquals(nearest(r), nearest(g)) : >>>>>>>> Mean relative difference: 0.6 >>>>>>>> >>>>>>>> >>>>>>>>> sessionInfo() >>>>>>>> R version 2.15.0 (2012-03-30) >>>>>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>>>>>>> >>>>>>>> locale: >>>>>>>> [1] C >>>>>>>> >>>>>>>> attached base packages: >>>>>>>> [1] tools splines parallel stats graphics >>>>>>>>grDevices >>>>>>>> utils >>>>>>>> datasets methods base >>>>>>>> >>>>>>>> other attached packages: >>>>>>>> [1] RUnit_0.4.26 log4r_0.1-4 vwr_0.1 >>>>>>>> RecordLinkage_0.4-1 ffbase_0.5 ff_2.2-7 >>>>>>>> bit_1.1-8 evd_2.2-6 ipred_0.8-13 >>>>>>>> prodlim_1.3.1 KernSmooth_2.23-7 nnet_7.3-1 >>>>>>>> survival_2.36-14 mlbench_2.1-0 MASS_7.3-18 >>>>>>>> ada_2.0-2 rpart_3.1-53 e1071_1.6 >>>>>>>> class_7.3-3 XLConnect_0.1-9 XLConnectJars_0.1-4 >>>>>>>> rJava_0.9-3 latticeExtra_0.6-19 RColorBrewer_1.0-5 >>>>>>>> lattice_0.20-6 doMC_1.2.5 multicore_0.1-7 >>>>>>>> [28] BSgenome_1.24.0 rtracklayer_1.16.1 Rsamtools_1.8.5 >>>>>>>> Biostrings_2.24.1 GenomicFeatures_1.8.1 AnnotationDbi_1.18.1 >>>>>>>> GenomicRanges_1.8.6 IRanges_1.14.3 Biobase_2.16.0 >>>>>>>> BiocGenerics_0.2.0 data.table_1.8.0 compare_0.2-3 >>>>>>>> svUnit_0.7-10 doParallel_1.0.1 iterators_1.0.6 >>>>>>>> foreach_1.4.0 ggplot2_0.9.1 sqldf_0.4-6.4 >>>>>>>> RSQLite.extfuns_0.0.1 RSQLite_0.11.1 chron_2.3-42 >>>>>>>> gsubfn_0.6-3 proto_0.3-9.2 DBI_0.2-5 >>>>>>>> functional_0.1 reshape_0.8.4 plyr_1.7.1 >>>>>>>> [55] stringr_0.6 gtools_2.6.2 >>>>>>>> >>>>>>>> loaded via a namespace (and not attached): >>>>>>>> [1] RCurl_1.91-1 XML_3.9-4 biomaRt_2.12.0 >>>>>>>> bitops_1.0-4.1 >>>>>>>> codetools_0.2-8 colorspace_1.1-1 compiler_2.15.0 dichromat_1.2-4 >>>>>>>> digest_0.5.2 grid_2.15.0 labeling_0.1 memoise_0.1 >>>>>>>> munsell_0.3 reshape2_1.2.1 scales_0.2.1 stats4_2.15.0 >>>>>>>> tcltk_2.15.0 zlibbioc_1.2.0 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 6/18/12 2:39 PM, "Martin Morgan" wrote: >>>>>>>> >>>>>>>>> Hi Oleg -- >>>>>>>>> >>>>>>>>> On 06/17/2012 11:11 PM, Oleg Mayba wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I just noticed that a piece of logic I was relying on with >>>>>>>>>>GRanges >>>>>>>>>> before >>>>>>>>>> does not seem to work anymore. Namely, I expect the behavior of >>>>>>>>>> nearest() >>>>>>>>>> with a single GRanges object as an argument to be the same as >>>>>>>>>>that >>>>>>>>>> of >>>>>>>>>> IRanges (example below), but it's not anymore. I expect >>>>>>>>>> nearest(GR1) >>>>>>>>>> NOT >>>>>>>>>> to behave trivially but to return the closest range OTHER than >>>>>>>>>>the >>>>>>>>>> range >>>>>>>>>> itself. I could swear that was the case before, but isn't any >>>>>>>>>> longer: >>>>>>>>> I think you're right that there is an inconsistency here; Val >>>>>>>>>will >>>>>>>>> likely help clarify in a day or so. My two cents... >>>>>>>>> >>>>>>>>> I think, certainly, that GRanges on a single chromosome on the >>>>>>>>>"+" >>>>>>>>> strand should behave like an IRanges >>>>>>>>> >>>>>>>>> library(GenomicRanges) >>>>>>>>> library(RUnit) >>>>>>>>> >>>>>>>>> r<- IRanges(c(1,5,10), c(2,7,12)) >>>>>>>>> g<- GRanges("chr1", r, "+") >>>>>>>>> >>>>>>>>> ## first two ok, third should work but fails >>>>>>>>> checkEquals(precede(r), precede(g)) >>>>>>>>> checkEquals(follow(r), follow(g)) >>>>>>>>> try(checkEquals(nearest(r), nearest(g))) >>>>>>>>> >>>>>>>>> Also, on the "-" strand I think we're expecting >>>>>>>>> >>>>>>>>> g<- GRanges("chr1", r, "-") >>>>>>>>> >>>>>>>>> ## first two ok, third should work but fails >>>>>>>>> checkEquals(follow(r), precede(g)) >>>>>>>>> checkEquals(precede(r), follow(g)) >>>>>>>>> try(checkEquals(nearest(r), nearest(g))) >>>>>>>>> >>>>>>>>> For "*" (which was your example) I think the situation is (a) >>>>>>>>> different >>>>>>>>> in devel than in release; and (b) not so clear. In devel, "*" is >>>>>>>>> (from >>>>>>>>> method?"nearest,GenomicRanges,missing") "x on '*' strand can >>>>>>>>>match >>>>>>>>>to >>>>>>>>> ranges on any of ''+'', ''-'' or ''*''" and in particular I think >>>>>>>>> these >>>>>>>>> are always true: >>>>>>>>> >>>>>>>>> checkEquals(precede(g), follow(g)) >>>>>>>>> checkEquals(nearest(r), follow(g)) >>>>>>>>> >>>>>>>>> I would also expect >>>>>>>>> >>>>>>>>> try(checkEquals(nearest(g), follow(g))) >>>>>>>>> >>>>>>>>> though this seems not to be the case. In 'release', "*" is >>>>>>>>>coereced >>>>>>>>> and >>>>>>>>> behaves as if on the "+" strand (I think). >>>>>>>>> >>>>>>>>> Martin >>>>>>>>> >>>>>>>>>>> z=IRanges(start=c(1,5,10), end=c(2,7,12)) >>>>>>>>>>> z >>>>>>>>>> IRanges of length 3 >>>>>>>>>> start end width >>>>>>>>>> [1] 1 2 2 >>>>>>>>>> [2] 5 7 3 >>>>>>>>>> [3] 10 12 3 >>>>>>>>>>> nearest(z) >>>>>>>>>> [1] 2 1 2 >>>>>>>>>>> >>>>>>>>>>>z=GRanges(seqnames=rep('chr1',3),ranges=IRanges(start=c(1,5,10), >>>>>>>>>> end=c(2,7,12))) >>>>>>>>>>> z >>>>>>>>>> GRanges with 3 ranges and 0 elementMetadata cols: >>>>>>>>>> seqnames ranges strand >>>>>>>>>> >>>>>>>>>> [1] chr1 [ 1, 2] * >>>>>>>>>> [2] chr1 [ 5, 7] * >>>>>>>>>> [3] chr1 [10, 12] * >>>>>>>>>> --- >>>>>>>>>> seqlengths: >>>>>>>>>> chr1 >>>>>>>>>> NA >>>>>>>>>>> nearest(z) >>>>>>>>>> [1] 1 2 3 >>>>>>>>>>> sessionInfo() >>>>>>>>>> R version 2.15.0 (2012-03-30) >>>>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>>>> >>>>>>>>>> locale: >>>>>>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>>>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>>>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>>>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>>>>>> >>>>>>>>>> attached base packages: >>>>>>>>>> [1] datasets utils grDevices graphics stats methods >>>>>>>>>>base >>>>>>>>>> >>>>>>>>>> other attached packages: >>>>>>>>>> [1] GenomicRanges_1.8.6 IRanges_1.14.3 BiocGenerics_0.2.0 >>>>>>>>>> >>>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>>> [1] stats4_2.15.0 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I want the IRanges behavior and not what seems currently to be >>>>>>>>>>the >>>>>>>>>> GRanges >>>>>>>>>> behavior, since I have some code that depends on it. Is there a >>>>>>>>>> quick >>>>>>>>>> way >>>>>>>>>> to make nearest() do that for me again? >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> Oleg. >>>>>>>>>> >>>>>>>>>> [[alternative HTML version deleted]] >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Bioconductor mailing list >>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>> Search the archives: >>>>>>>>>> >>>>>>>>>>http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>> -- >>>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center >>>>>>>>> 1100 Fairview Ave. N. >>>>>>>>> PO Box 19024 Seattle, WA 98109 >>>>>>>>> >>>>>>>>> Location: Arnold Building M1 B861 >>>>>>>>> Phone: (206) 667-2793 >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: >>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor From smelov at buckinstitute.org Tue Jun 26 21:01:16 2012 From: smelov at buckinstitute.org (Simon Melov) Date: Tue, 26 Jun 2012 12:01:16 -0700 Subject: [BioC] HTqPCR problems Message-ID: <7B8EC902-D981-4467-B997-002FC1CFC368@buckinstitute.org> Hi, I'm having some troubles selectively sub-setting, and graphing up QPCR data within HTqPCR for Biomark plates (both 48.48 and 96.96 plates). I'd like to be able to visualize specific genes, with specific groups we run routinely on our Biomark system. Typical runs are across multiple plates, and have multiple biological replicates, and usually 2 or more technical replicates (although we are moving away from technical reps, as the CVs are so tight). Can anyone help with this? Heidi? raw6=readCtData(files="Chip6.csv", format="BioMark", n.features=48, n.data=48, samples=samples) #Ive read the samples in from a separate file, as when you read it in, it doesnt take the sample names supplied in the biomark output# #Next, I want to plot genes of interest, with samples of interest, and I'm having trouble getting an appropriate output# g=featureNames(raw6)[1:2] plotCtOverview(raw6, genes=g, groups=groupID$Treatment, col=rainbow(5)) #This plots 1 gene across all 48 samples# #but the legend doesnt behave, its placed on top of the plot, and I cant get it to display in a non-overlapping fashion# #I've tried all sorts of things in par, but nothing seems to shift the legend's position# #I now want to plot a subset of the samples for specific genes# > LOY=subset(groupID,groupID$Treatment=="LO" | groupID$Treatment== "LFY") > LOY Sample Treatment 2 L20 LFY 5 L30 LFY 7 L45 LO 20 L40 LO 27 L43 LO 33 L29 LFY 36 L38 LO 40 L39 LO 43 L23 LFY > plotCtOverview(raw6, genes=g, groups=LOY, col=rainbow(5)) Warning messages: 1: In split.default(t(x), sample.split) : data length is not a multiple of split variable 2: In qt(p, df, lower.tail, log.p) : NaNs produced > #it displays the two groups defined by treatment, but doesnt do so nicely, very skinny bars, and the legend doesnt reflect what its displaying# #again, I've tried monkeying around with par, but not sure what HTqPCR is calling to make the plots# please help! thanks Simon. From heidi at ebi.ac.uk Tue Jun 26 21:48:32 2012 From: heidi at ebi.ac.uk (Heidi Dvinge) Date: Tue, 26 Jun 2012 20:48:32 +0100 Subject: [BioC] HTqPCR problems In-Reply-To: <7B8EC902-D981-4467-B997-002FC1CFC368@buckinstitute.org> References: <7B8EC902-D981-4467-B997-002FC1CFC368@buckinstitute.org> Message-ID: <10ebef5ba3907a11607fea19c1f11c3b.squirrel@webmail.ebi.ac.uk> > Hi, > I'm having some troubles selectively sub-setting, and graphing up QPCR > data within HTqPCR for Biomark plates (both 48.48 and 96.96 plates). I'd > like to be able to visualize specific genes, with specific groups we run > routinely on our Biomark system. Typical runs are across multiple plates, > and have multiple biological replicates, and usually 2 or more technical > replicates (although we are moving away from technical reps, as the CVs > are so tight). > > Can anyone help with this? Heidi? > > raw6=readCtData(files="Chip6.csv", format="BioMark", n.features=48, > n.data=48, samples=samples) > #Ive read the samples in from a separate file, as when you read it in, it > doesnt take the sample names supplied in the biomark output# > #Next, I want to plot genes of interest, with samples of interest, and I'm > having trouble getting an appropriate output# > > g=featureNames(raw6)[1:2] > plotCtOverview(raw6, genes=g, groups=groupID$Treatment, col=rainbow(5)) > > #This plots 1 gene across all 48 samples# > #but the legend doesnt behave, its placed on top of the plot, and I cant > get it to display in a non-overlapping fashion# > #I've tried all sorts of things in par, but nothing seems to shift the > legend's position# > As the old saying goes, whenever you want a job done well, you'll have to do it yourself ;). In this case, the easiest thing is probably to use legend=FALSE in plotCtOverview, and then afterwards add it yourself in the desired location using legend(). That way, if you have a lot of different features or groups to display, you can also use the ncol parameter in legend to make several columns within the legend, such as 3x4 instead of the default 12x1. Alternatively, you can use either xlim or ylim in plotCtOverview to add some empty space on the side where there's then room for the legend. > #I now want to plot a subset of the samples for specific genes# >> LOY=subset(groupID,groupID$Treatment=="LO" | groupID$Treatment== "LFY") >> LOY > Sample Treatment > 2 L20 LFY > 5 L30 LFY > 7 L45 LO > 20 L40 LO > 27 L43 LO > 33 L29 LFY > 36 L38 LO > 40 L39 LO > 43 L23 LFY > > >> plotCtOverview(raw6, genes=g, groups=LOY, col=rainbow(5)) > Warning messages: > 1: In split.default(t(x), sample.split) : > data length is not a multiple of split variable > 2: In qt(p, df, lower.tail, log.p) : NaNs produced >> Does it make sense if you split by groups=LOY$Treatment? It looks like the object LOY itself is a data frame, rather than the expected vector. Also, you may have to 'repeat' the col=rainbow() argument to fit your number of features. > > #it displays the two groups defined by treatment, but doesnt do so nicely, > very skinny bars, and the legend doesnt reflect what its displaying# > #again, I've tried monkeying around with par, but not sure what HTqPCR is > calling to make the plots# > If the bars are very skinny, it's probably because you're displaying a lot of features. Nothing much to do about that, except increasing the width or your plot :(. \Heidi > please help! > > thanks > > Simon. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > From huwenhuo at gmail.com Tue Jun 26 23:07:04 2012 From: huwenhuo at gmail.com (wenhuo hu) Date: Tue, 26 Jun 2012 17:07:04 -0400 Subject: [BioC] Kaplan Meier curve In-Reply-To: <5d680f43938e2546f548c59cfdc75634.squirrel@webmail.ebi.ac.uk> References: <5d680f43938e2546f548c59cfdc75634.squirrel@webmail.ebi.ac.uk> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From ysapkota at ualberta.ca Wed Jun 27 00:14:31 2012 From: ysapkota at ualberta.ca (Yadav Sapkota) Date: Tue, 26 Jun 2012 16:14:31 -0600 Subject: [BioC] Overlapping ROC plots Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From smyth at wehi.EDU.AU Wed Jun 27 00:57:58 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Wed, 27 Jun 2012 08:57:58 +1000 (AUS Eastern Standard Time) Subject: [BioC] edge R design matrix pairwise comparison at different time points after infection with replicates In-Reply-To: <3D4A97F14E343F4584925219C1C1ACEF05B9672F@ICTS-S-MBX7.luna.kuleuven.be> References: <3D4A97F14E343F4584925219C1C1ACEF05B965F0@ICTS-S-MBX7.luna.kuleuven.be> <3D4A97F14E343F4584925219C1C1ACEF05B9672F@ICTS-S-MBX7.luna.kuleuven.be> Message-ID: On Tue, 26 Jun 2012, Kaat De Cremer wrote: > Dear Gordon, > > I do normalize the data, after filtering for low expressed genes. Below > you can see my first y$samples (when I forgot to correct the library > size) and the corrected version. This correction for library size (which > is small and of the same order for each library) causes a big difference > in number of differentially expressed genes. But what I find odd is that > I don't see this difference when I analyze each time point separate, > including only data of one time point in edgeR. I notice that the > normalization factors are more consistent within each time point > (especially for the first time point), could this cause the difference > between the different approaches? No. > group lib.size norm.factors > 12hpi C1 C 9498975 1.2036962 > 12hpi C2 C 5757034 1.1233049 > 12hpi C3 C 6367418 1.1285791 > 12hpi T1 T 6250384 1.1524029 > 12hpi T2 T 5606716 1.1351563 > 12hpi T3 T 7294081 1.1001216 > 24hpi C1 C 5873429 0.9832233 > 24hpi C2 C 6357801 0.8988242 > 24hpi C3 C 6778635 0.9467255 > 24hpi T1 T 6596149 1.0076682 > 24hpi T2 T 5451307 0.8962813 > 24hpi T3 T 5726377 0.9136752 > 48hpi C1 C 8232847 0.9945134 > 48hpi C2 C 8079906 0.9643586 > 48hpi C3 C 7632570 0.9228554 > 48hpi T1 T 15413395 0.9086659 > 48hpi T2 T 6289516 0.9171062 > 48hpi T3 T 5003424 0.8942136 > > Corrected for library size: > > group lib.size norm.factors > 12hpi C1 C 9485607 1.2036962 > 12hpi C2 C 5749033 1.1233049 > 12hpi C3 C 6358791 1.1285791 > 12hpi T1 T 6242328 1.1524029 > 12hpi T2 T 5598562 1.1351563 > 12hpi T3 T 7285598 1.1001216 > 24hpi C1 C 5866969 0.9832233 > 24hpi C2 C 6351387 0.8988242 > 24hpi C3 C 6771865 0.9467255 > 24hpi T1 T 6588283 1.0076682 > 24hpi T2 T 5445788 0.8962813 > 24hpi T3 T 5721236 0.9136752 > 48hpi C1 C 8223672 0.9945134 > 48hpi C2 C 8071573 0.9643586 > 48hpi C3 C 7624662 0.9228554 > 48hpi T1 T 15393879 0.9086659 > 48hpi T2 T 6282481 0.9171062 > 48hpi T3 T 4997554 0.8942136 > > > About the different approaches of analyzing the data (all data together > or each time point separate), I included my MDS plot in attachment. I do > see more variability at the later time point (especially between treated > and control samples). This isn't variability. It's a treatment effect. The treated samples at 48h are closely clustered showing low variability. The MDS plots hows that you have a treatment effect mainly at time 48h. > To estimate the dispersion, does edgeR compare all different samples as > if they were independent, or does edgeR bring in account which samples > are of the same time point and same treatment? Yes, certainly it takes into account which samples are of the same time and the same treatment. It uses the design matrix you give it. > still feel that my 3 time points look very different and that combining > all data will overestimate the dispersion. Of course the time points are different, but because of systematic effects not because of random variability. Best wishes Gordon > Thank you so much for your answers, > > Kaat > > > > -----Original Message----- > From: Gordon K Smyth [mailto:smyth at wehi.EDU.AU] > Sent: dinsdag 26 juni 2012 2:04 > To: Kaat De Cremer > Cc: Mark Robinson; Bioconductor mailing list > Subject: RE: design matrix edge R pairwise comparison at different time points after infection with replicates > > Dear Kaat, > > 1) It is not generally true that you will find more DE genes analysing > just one time separately rather than using all the libraries in one > linear model. It is possible in principle that the later time may show > more variability, and that might justify separate analysis of the > different times. However I would do that only when you have a good a > priori biological reason for expecting such an effect and data > exploration (such as an MDS plot) confirms it. Otherwise the extra > instability of dispersion estimation with a small number of libraries is > not justified. I would not make such decisions merely on the basis of > which analysis gives more DE genes. > > 2) I don't think there is yet a definitive rule regarding filtering and > library sizes, but I prefer to recompute library sizes and scale > normalize (calcNormFactors) after filtering. Recomputing the library > sizes doesn't make a lot of difference, because scale normalization will > self correct anyway. > > Are you normalizing your data? > > Best wishes > Gordon > > On Mon, 25 Jun 2012, Kaat De Cremer wrote: > >> Again, >> Thank you both very much for your reply. >> >> I have analyzed my time course data now in several different ways and >> noticed some differences: >> >> >> 1) when I analyze the data of all time points together and look for DE >> genes at one time point, I find less DE genes compared to when I use >> only the data of that one time point in edgeR. I assume this is >> because the dispersion is larger when I include all the different time >> points at once? In that case, is this the right way to go? The >> dispersion can be larger at one time point compared to another I would think. >> >> 2) I noticed in the edgeR user's guide that in some examples you >> correct the library size after filtering for low expressed genes, and >> in other examples you don't. Correcting this library size gives less >> DE genes for my data at all time points when I analyze all data >> together and then look for DE genes at each time point. I don't see >> this difference when I only look at the data of one time point in edgeR. >> >> I hope you can comment on this, >> >> >> Thank you, >> Kaat >> >> >> >> -----Original Message----- >> From: Gordon K Smyth [mailto:smyth at wehi.EDU.AU] >> Sent: zondag 24 juni 2012 2:42 >> To: Kaat De Cremer >> Cc: Bioconductor mailing list; Mark Robinson >> Subject: design matrix edge R pairwise comparison at different time >> points after infection with replicates >> >> Hi Kaat, >> >> I'll jump in and continue on from Mark's help. >> >> To test for treatment effects separately at each time, the easiest way is to include the terms "time+time:treat" in your model formula. >> >> I'll assume that your "tests" are independent replicates of the whole experiment. If there are batch effects associated with the tests that you need to correct for, then your complete design matrix might be: >> >> design <- model.matrix(~test+time+time:treat) >> >> This produces a design matrix with the following columns: >> >> > colnames(design) >> [1] "(Intercept)" "test2" "test3" "time24hpi" >> [5] "time48hpi" "time12hpi:treatT" "time24hpi:treatT" "time48hpi:treatT" >> >> So testing for treatment effects at each time is easy. To test for treatment effect as time 12h: >> >> fit <- glmFit(y, design) >> lrt <- glmLRT(y, fit, coef="time12hpi:treatT") >> >> etc. To test for treatment effect at time 24h: >> >> lrt <- glmLRT(y, fit, coef="time24hpi:treatT") >> >> and so on. >> >> Best wishes >> Gordon >> >>> Date: Fri, 22 Jun 2012 13:11:41 +0000 >>> From: Kaat De Cremer >>> To: Mark Robinson >>> Cc: bioconductor list >>> Subject: Re: [BioC] design matrix edge R pairwise comparison at >>> different time points after infection with replicates >>> >>> Hi Mark, >>> Thank you for your suggestion, >>> I really appreciate your time. >>> >>> Working in R is new to me so it has been a struggle using edgeR, but >>> I think I managed it using only 2 factors (test and treatment). Now >>> that I will be including 3 factors (test, treatment and time) in one >>> analysis it is clear to me that I still don't understand how it works exactly. >> >>> Below you can see my workspace with the only design matrix I could >>> come up with, but I don't see which coefficients I should include or >>> which contrast vector to use in the glmLRT function to make the >>> comparison of control-treatment at each time point separate, ignoring >>> the other 2 time points. Is this possible with this design matrix? Or >>> is the matrix wrong for this purpose? >>> >>> >>> Thanks! >>> Kaat >>> >>> >>>> head(x) >>> 12hpi C1 12hpi C2 12hpi C3 12hpi T1 12hpi T2 12hpi T3 24hpi C1 24hpi C2 >>> Lsa000001.1 0 1 1 2 0 2 1 1 >>> Lsa000002.1 5 4 0 5 6 6 6 4 >>> Lsa000003.1 10 9 7 5 5 8 6 2 >>> Lsa000004.1 1 1 1 1 1 1 1 3 >>> Lsa000005.1 1 0 1 0 2 0 0 1 >>> Lsa000006.1 510 223 228 287 222 268 303 358 >>> 24hpi C3 24hpi T1 24hpi T2 24hpi T3 48hpi C1 48hpi C2 48hpi C3 48hpi T1 >>> Lsa000001.1 0 1 1 0 0 0 0 2 >>> Lsa000002.1 7 5 2 5 10 6 12 12 >>> Lsa000003.1 7 5 4 2 6 5 8 2 >>> Lsa000004.1 1 3 1 2 1 3 2 3 >>> Lsa000005.1 0 1 0 0 1 0 0 2 >>> Lsa000006.1 372 362 237 320 472 440 411 858 >>> 48hpi T2 48hpi T3 >>> Lsa000001.1 0 0 >>> Lsa000002.1 1 5 >>> Lsa000003.1 1 0 >>> Lsa000004.1 0 2 >>> Lsa000005.1 1 0 >>> Lsa000006.1 375 275 >>>> treat<-factor(c("C","C","C","T","T","T","C","C","C","T","T","T","C"," >>>> C","C","T","T","T")) >>>> test<-factor(c(1,1,2,3,1,2,3,2,3,1,2,3,1,2,3,1,2,3)) >>> time<-factor(c("12hpi","12hpi","12hpi","12hpi","12hpi","12hpi","24hpi" >>> ,"24hpi","24hpi","24hpi","24hpi","24hpi","48hpi","48hpi","48hpi","48h >>> p >>> i","48hpi","48hpi")) >>>> y<-DGEList(counts=x,group=treat) >>> Calculating library sizes from column totals. >>>> cpm.y<-cpm(y) >>>> y<-y[rowSums(cpm.y>2)>=3,] >>>> y<-calcNormFactors(y) >>> design<-model.matrix(~test+treat+time) >>>> design >>> (Intercept) test2 test3 treatT time24hpi time48hpi >>> 1 1 0 0 0 0 0 >>> 2 1 1 0 0 0 0 >>> 3 1 0 1 0 0 0 >>> 4 1 0 0 1 0 0 >>> 5 1 1 0 1 0 0 >>> 6 1 0 1 1 0 0 >>> 7 1 0 0 0 1 0 >>> 8 1 1 0 0 1 0 >>> 9 1 0 1 0 1 0 >>> 10 1 0 0 1 1 0 >>> 11 1 1 0 1 1 0 >>> 12 1 0 1 1 1 0 >>> 13 1 0 0 0 0 1 >>> 14 1 1 0 0 0 1 >>> 15 1 0 1 0 0 1 >>> 16 1 0 0 1 0 1 >>> 17 1 1 0 1 0 1 >>> 18 1 0 1 1 0 1 >>> attr(,"assign") >>> [1] 0 1 1 2 3 3 >>> attr(,"contrasts") >>> attr(,"contrasts")$test >>> [1] "contr.treatment" >>> >>> attr(,"contrasts")$treat >>> [1] "contr.treatment" >>> >>> attr(,"contrasts")$time >>> [1] "contr.treatment" >>> >>>> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) >>> Disp = 0.07299 , BCV = 0.2702 >>>> y<-estimateGLMTrendedDisp(y,design) >>> Loading required package: splines >>>> y<-estimateGLMTagwiseDisp(y,design) >>> Warning message: >>> In maximizeInterpolant(spline.pts, apl.smooth[j, ]) : >>> max iterations exceeded >>>> fit<-glmFit(y,design) >>> >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: Mark Robinson [mailto:mark.robinson at imls.uzh.ch] >>> Sent: vrijdag 22 juni 2012 12:03 >>> To: Kaat De Cremer >>> Cc: bioconductor list >>> Subject: Re: [BioC] design matrix edge R pairwise comparison at >>> different time points after infection with replicates >>> >>> Hi Kaat, >>> >>> It is probably better to fit all your data with a single call to glmFit(), over all 18 samples; you can test the differences of interest trough the 'coef' or 'contrast' argument on glmLRT(). That would afford you more degrees of freedom and presumably better estimates of dispersion, and so on. >>> >>>> From your description, I can't quite figure out your design matrix. You have three factors: treatment, test and time point. First, you need to input all 18 samples and extend your 'treatment' and 'test' factor variables to have 18 values (corresponding to the columns of your table). And, then also include a time variable in your design. Some decisions might need to be made about interactions to include. >>> >>> Hope that gets you started. >>> >>> Best, >>> Mark >>> >>> >>> ---------- >>> Prof. Dr. Mark Robinson >>> Bioinformatics >>> Institute of Molecular Life Sciences >>> University of Zurich >>> Winterthurerstrasse 190 >>> 8057 Zurich >>> Switzerland >>> >>> v: +41 44 635 4848 >>> f: +41 44 635 6898 >>> e: mark.robinson at imls.uzh.ch >>> o: Y11-J-16 >>> w: http://tiny.cc/mrobin >>> >>> ---------- >>> http://www.fgcz.ch/Bioconductor2012 >>> >>> On 21.06.2012, at 11:42, Kaat De Cremer wrote: >>> >>>> Dear all, >>>> >>>> >>>> I am using edgeR to find genes differentially expressed between >>>> infected and mock-infected control plants, at 3 time points after >>>> infection. >> >>>> I have RNAseq data for 3 independent tests, so for every single test >>>> I have 6 libraries (control + infected at 3 time points). >> >>>> Having three replicates this makes 18 libraries in total. >>>> >>>> What I did until now is look at each time point separate and calculate DEgenes at that time point as shown in this script: >>>> >>>>> head(x) >>>> C1 C2 C3 T1 T2 T3 >>>> 1 0 1 2 0 0 0 >>>> 2 13 6 4 10 8 12 >>>> 3 17 16 9 10 8 11 >>>> 4 2 1 2 2 3 2 >>>> 5. 1 3 1 2 1 3 0 >>>> 6 958 457 438 565 429 518 >>>> >>>>> treatment<-factor(c("C","C","C","T","T","T")) >>>>> test<-factor(c(1,2,3,1,2,3)) >>>>> y<-DGEList(counts=x,group=treatment) >>>> Calculating library sizes from column totals. >>>>> cpm.y<-cpm(y) >>>>> y<-y[rowSums(cpm.y>2)>=3,] >>>>> y<-calcNormFactors(y) >>>>> design<-model.matrix(~test+treat) >>>>> y<-estimateGLMCommonDisp(y,design,verbose=TRUE) >>>> Disp = 0.0265 , BCV = 0.1628 >>>>> y<-estimateGLMTrendedDisp(y,design) >>>> Loading required package: splines >>>>> y<-estimateGLMTagwiseDisp(y,design) >>>>> fit<-glmFit(y,design) >>>>> lrt<-glmLRT(y,fit) >>>> >>>> >>>> This works fine but I wonder if I should do the analysis of the different time points all at once? Will this make a difference? >>>> Unfortunately I cannot figure out how to design the matrix. >>>> >>>> I hope someone can help me, >>>> >>>> Kaat >>>> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ >> > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:10}} From smelov at buckinstitute.org Wed Jun 27 01:14:04 2012 From: smelov at buckinstitute.org (Simon Melov) Date: Tue, 26 Jun 2012 16:14:04 -0700 Subject: [BioC] HTqPCR problems In-Reply-To: <10ebef5ba3907a11607fea19c1f11c3b.squirrel@webmail.ebi.ac.uk> References: <7B8EC902-D981-4467-B997-002FC1CFC368@buckinstitute.org> <10ebef5ba3907a11607fea19c1f11c3b.squirrel@webmail.ebi.ac.uk> Message-ID: <88EB4818-F48D-472C-887B-4DE867E7C481@buckinstitute.org> Thanks! I will work on it some more, didn't realize groups was defined by vector. On Jun 26, 2012, at 12:48 PM, Heidi Dvinge wrote: >> Hi, >> I'm having some troubles selectively sub-setting, and graphing up QPCR >> data within HTqPCR for Biomark plates (both 48.48 and 96.96 plates). I'd >> like to be able to visualize specific genes, with specific groups we run >> routinely on our Biomark system. Typical runs are across multiple plates, >> and have multiple biological replicates, and usually 2 or more technical >> replicates (although we are moving away from technical reps, as the CVs >> are so tight). >> >> Can anyone help with this? Heidi? >> >> raw6=readCtData(files="Chip6.csv", format="BioMark", n.features=48, >> n.data=48, samples=samples) >> #Ive read the samples in from a separate file, as when you read it in, it >> doesnt take the sample names supplied in the biomark output# >> #Next, I want to plot genes of interest, with samples of interest, and I'm >> having trouble getting an appropriate output# >> >> g=featureNames(raw6)[1:2] >> plotCtOverview(raw6, genes=g, groups=groupID$Treatment, col=rainbow(5)) >> >> #This plots 1 gene across all 48 samples# >> #but the legend doesnt behave, its placed on top of the plot, and I cant >> get it to display in a non-overlapping fashion# >> #I've tried all sorts of things in par, but nothing seems to shift the >> legend's position# >> > As the old saying goes, whenever you want a job done well, you'll have to > do it yourself ;). In this case, the easiest thing is probably to use > legend=FALSE in plotCtOverview, and then afterwards add it yourself in the > desired location using legend(). That way, if you have a lot of different > features or groups to display, you can also use the ncol parameter in > legend to make several columns within the legend, such as 3x4 instead of > the default 12x1. > > Alternatively, you can use either xlim or ylim in plotCtOverview to add > some empty space on the side where there's then room for the legend. > >> #I now want to plot a subset of the samples for specific genes# >>> LOY=subset(groupID,groupID$Treatment=="LO" | groupID$Treatment== "LFY") >>> LOY >> Sample Treatment >> 2 L20 LFY >> 5 L30 LFY >> 7 L45 LO >> 20 L40 LO >> 27 L43 LO >> 33 L29 LFY >> 36 L38 LO >> 40 L39 LO >> 43 L23 LFY >> >> >>> plotCtOverview(raw6, genes=g, groups=LOY, col=rainbow(5)) >> Warning messages: >> 1: In split.default(t(x), sample.split) : >> data length is not a multiple of split variable >> 2: In qt(p, df, lower.tail, log.p) : NaNs produced >>> > > Does it make sense if you split by groups=LOY$Treatment? It looks like the > object LOY itself is a data frame, rather than the expected vector. > > Also, you may have to 'repeat' the col=rainbow() argument to fit your > number of features. > >> >> #it displays the two groups defined by treatment, but doesnt do so nicely, >> very skinny bars, and the legend doesnt reflect what its displaying# >> #again, I've tried monkeying around with par, but not sure what HTqPCR is >> calling to make the plots# >> > If the bars are very skinny, it's probably because you're displaying a lot > of features. Nothing much to do about that, except increasing the width or > your plot :(. > > \Heidi > >> please help! >> >> thanks >> >> Simon. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > From smyth at wehi.EDU.AU Wed Jun 27 01:25:22 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Wed, 27 Jun 2012 09:25:22 +1000 (AUS Eastern Standard Time) Subject: [BioC] glmFit options in edgeR not passed to mglmLS? In-Reply-To: References: Message-ID: Dear Wade, Thanks very much for the bug report. The "..." argument was missing in the generic function, although it is correct in the methods. I've committed a fix today. You should be able to install the updated edgeR from Bioconductor within a couple of days. Best wishes Gordon > From: Davis, Wade > Sent: Thursday, June 14, 2012 11:15 AM > To: bioconductor at r-project.org > Subject: glmFit options in edgeR not passed to mglmLS? > > Dear Gordon and Davis, > I tried passing maxit to mglmLS when using glmFit via the following code, and I get this message: > >> fit.filt.tgw <- glmFit(y=dge.filt.tgw, design=rip.design,maxit = 100) > > Error in glmFit(y = dge.filt.tgw, design = rip.design, maxit = 100) : > unused argument(s) (maxit = 100) > > Everything works fine when maxit is removed from the call. Also note that the method used is linesearch: > > fit.filt.tgw <- glmFit(y=dge.filt.tgw, design=rip.design) > >> fit.filt.tgw$method > [1] "linesearch" > > The help page for glmFit reads > ?... > > other arguments are passed to lower-level functions, for example to mglmLS. > > > Looking inside glmFit.default, everything looks fine to me: > > fit <- switch(method, linesearch = mglmLS(y, design = design, > dispersion = dispersion, start = start, offset = offset, > ...), oneway = mglmOneWay(y, design = design, dispersion = dispersion, > offset = offset), levenberg = mglmLevenberg(y, design = design, > dispersion = dispersion, offset = offset), simple = mglmSimple(y, > design = design, dispersion = dispersion, offset = offset, > weights = weights)) > > To make sure that mglmLS is using linesearch I made the call as but got the same result: > >> fit.filt.tgw <- glmFit(y=dge.filt.tgw, design=rip.design, method="linesearch", maxit = 100) > Error in glmFit(y = dge.filt.tgw, design = rip.design, method = "linesearch", : > unused argument(s) (maxit = 100) > > Calling mglmLS directly works (although I know it isn?t intended to be called this way): > > temp<-mglmLS (y=as.matrix(dge.filt.tgw), > design=rip.design, > maxit = 50, # also used maxit=100 with no problem > dispersion= dge.filt.tgw$tagwise.dispersion) > > Am I doing something wrong? > > Thanks, > Wade > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-pc-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ShortRead_1.14.4 latticeExtra_0.6-23 RColorBrewer_1.0-5 Rsamtools_1.8.5 > [5] lattice_0.20-8 Biostrings_2.24.1 GenomicRanges_1.8.6 IRanges_1.14.3 > [9] BiocGenerics_0.2.0 plyr_1.7.1 edgeR_2.6.7 limma_3.12.1 > > loaded via a namespace (and not attached): > [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3 stats4_2.15.0 tools_2.15.0 > [7] zlibbioc_1.2.0 > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:5}} From david at harsk.dk Wed Jun 27 01:43:44 2012 From: david at harsk.dk (David Westergaard) Date: Wed, 27 Jun 2012 01:43:44 +0200 Subject: [BioC] Printing arrayQualityMetrics report In-Reply-To: <4FE44F90.1050603@embl.de> References: <4FE39A13.3010004@embl.de> <4FE428C1.7000309@gmail.com> <4FE44F90.1050603@embl.de> Message-ID: Hi, I actually managed to produce a sufficient pdf to include just by upgrading my browser and fixing minor issues in inkscape. Thanks for the help, David 2012/6/22 Wolfgang Huber : > Dear James > > that is one way to go about it. A more direct way for getting at this table > is to keep the return value of the call to the function arrayQualityMetrics: > > ?myReport = arrayQualityMetrics( eset, ...) > > and access the table via > > ?myReport$arrayTable > > Please use arrayQualityMetrics >= 3.13.5 for this (unfortunately in previous > versions due to an oversight of mine this object was not propagated all the > way to the return value of arrayQualityMetrics). > > Version 3.13.5 is in svn and should also soon be on the website / in the > package repository. It also fixes the 'intgroup' issue reported by Sonal and > Tim. > > ? ? ? ?Best wishes > ? ? ? ?Wolfgang > > > > > > James F. Reid scripsit 06/22/2012 10:11 AM: > > ...[snip] > > >> you could extract the table contents using the readHTMLTable function >> from the 'XML' package and for the figures just include the pdfs as >> figures and add a caption to them. >> >> HTH, >> J. > > > Best wishes > ? ? ? ?Wolfgang > > Wolfgang Huber > EMBL > http://www.embl.de/research/units/genome_biology/huber > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From smelov at buckinstitute.org Wed Jun 27 02:14:17 2012 From: smelov at buckinstitute.org (Simon Melov) Date: Tue, 26 Jun 2012 17:14:17 -0700 Subject: [BioC] HTqPCR problems In-Reply-To: <10ebef5ba3907a11607fea19c1f11c3b.squirrel@webmail.ebi.ac.uk> References: <7B8EC902-D981-4467-B997-002FC1CFC368@buckinstitute.org> <10ebef5ba3907a11607fea19c1f11c3b.squirrel@webmail.ebi.ac.uk> Message-ID: <7BE89EC0-5910-4758-881F-6EFCB8E79A5F@buckinstitute.org> Thanks for the help Heidi, but I'm still having troubles, your comments on the plotting helped me solve the outputs. But if I want to just display some groups (for example the LO group in the example below), how do I associate a group with multiple samples (ie biological reps)? Currently I'm associating genes with samples by reading in the file as below plate6=read.delim("plate6Sample.txt", header=FALSE) #this is a file to associate sample ID with the genes in the biomark data, as currently HTqPCR does not seem to associate the sample info in the Biomark output to the gene IDs samples=as.vector(t(plate6)) raw6=readCtData(files="Chip6.csv", format="BioMark", n.features=48, n.data=48, samples=samples) #now I have samples and genes similar to your example in the guide, but I want to associate samples to groups now. In the guide, you have an example where you have entire files as distinct samples, but in our runs, we never have that situation. I have a file which associates samples to groups, which I read in... groupID=read.csv("plate6key.csv") but how do I associate the samples with their appropriate groups for biological replicates with any of the functions in HtQPCR? You recommend below using a vector, but I dont see how that helps me associate the samples in the Expression set. thanks again! s On Jun 26, 2012, at 12:48 PM, Heidi Dvinge wrote: >> Hi, >> I'm having some troubles selectively sub-setting, and graphing up QPCR >> data within HTqPCR for Biomark plates (both 48.48 and 96.96 plates). I'd >> like to be able to visualize specific genes, with specific groups we run >> routinely on our Biomark system. Typical runs are across multiple plates, >> and have multiple biological replicates, and usually 2 or more technical >> replicates (although we are moving away from technical reps, as the CVs >> are so tight). >> >> Can anyone help with this? Heidi? >> >> raw6=readCtData(files="Chip6.csv", format="BioMark", n.features=48, >> n.data=48, samples=samples) >> #Ive read the samples in from a separate file, as when you read it in, it >> doesnt take the sample names supplied in the biomark output# >> #Next, I want to plot genes of interest, with samples of interest, and I'm >> having trouble getting an appropriate output# >> >> g=featureNames(raw6)[1:2] >> plotCtOverview(raw6, genes=g, groups=groupID$Treatment, col=rainbow(5)) >> >> #This plots 1 gene across all 48 samples# >> #but the legend doesnt behave, its placed on top of the plot, and I cant >> get it to display in a non-overlapping fashion# >> #I've tried all sorts of things in par, but nothing seems to shift the >> legend's position# >> > As the old saying goes, whenever you want a job done well, you'll have to > do it yourself ;). In this case, the easiest thing is probably to use > legend=FALSE in plotCtOverview, and then afterwards add it yourself in the > desired location using legend(). That way, if you have a lot of different > features or groups to display, you can also use the ncol parameter in > legend to make several columns within the legend, such as 3x4 instead of > the default 12x1. > > Alternatively, you can use either xlim or ylim in plotCtOverview to add > some empty space on the side where there's then room for the legend. > >> #I now want to plot a subset of the samples for specific genes# >>> LOY=subset(groupID,groupID$Treatment=="LO" | groupID$Treatment== "LFY") >>> LOY >> Sample Treatment >> 2 L20 LFY >> 5 L30 LFY >> 7 L45 LO >> 20 L40 LO >> 27 L43 LO >> 33 L29 LFY >> 36 L38 LO >> 40 L39 LO >> 43 L23 LFY >> >> >>> plotCtOverview(raw6, genes=g, groups=LOY, col=rainbow(5)) >> Warning messages: >> 1: In split.default(t(x), sample.split) : >> data length is not a multiple of split variable >> 2: In qt(p, df, lower.tail, log.p) : NaNs produced >>> > > Does it make sense if you split by groups=LOY$Treatment? It looks like the > object LOY itself is a data frame, rather than the expected vector. > > Also, you may have to 'repeat' the col=rainbow() argument to fit your > number of features. > >> >> #it displays the two groups defined by treatment, but doesnt do so nicely, >> very skinny bars, and the legend doesnt reflect what its displaying# >> #again, I've tried monkeying around with par, but not sure what HTqPCR is >> calling to make the plots# >> > If the bars are very skinny, it's probably because you're displaying a lot > of features. Nothing much to do about that, except increasing the width or > your plot :(. > > \Heidi > >> please help! >> >> thanks >> >> Simon. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > From jhardcas at fhcrc.org Tue Jun 26 20:03:52 2012 From: jhardcas at fhcrc.org (Hardcastle, Justin) Date: Tue, 26 Jun 2012 11:03:52 -0700 (PDT) Subject: [BioC] Unable to open database file, cummeRbund error. In-Reply-To: Message-ID: <010b815d-28b1-45e0-98fb-3bce4777eeec@zimbra4.fhcrc.org> Hi, I'm having an issue running cummeRbund on my cuffdiff output. CummeRbund is giving me a DB error and not creating the DB. The code and error are below. library("cummeRbund") dir = "~/Test" outdir = "output/cuffdiff" cuff <- readCufflinks(dir = file.path(dir, outdir), rebuild = TRUE) The error given is > cuff <- readCufflinks(dir = file.path(dir, outdir), rebuild = TRUE) Creating database ~/Test/output/cuffdiff/cuffData.db Error in sqliteNewConnection(drv, ...) : RS-DBI driver: (could not connect to dbname: unable to open database file ) I am running cummeRbund 1.2.0, and Cufflinks 2.0.1. Thanks for any help. From thomas.bartlett.10 at ucl.ac.uk Wed Jun 27 10:26:35 2012 From: thomas.bartlett.10 at ucl.ac.uk (Bartlett, Thomas) Date: Wed, 27 Jun 2012 08:26:35 +0000 Subject: [BioC] Finding TSS locations Message-ID: <2B7EB2D2AC7BAA46BC82166CF856FC3C344654AD@DB3PRD0104MB120.eurprd01.prod.exchangelabs.com> Hi, I'm trying to find a way to get the locations of the tss (transcriptional start site) for genes (I need this for work analysising Illumina 450K methylation data). I've tried the package GenomicFeatures, and have successfully downloaded and loaded the package, however the relevant command data(geneHuman) doesn't seem to work, producing the following error message: Warning message: In data(geneHuman) : data set ?geneHuman? not found I'm currently using R 2.15 on Windows Vista (I also have access to a Unix-type machine) thanks in advance for your help Tom Bartlett From msbootwalla at gmail.com Wed Jun 27 10:49:30 2012 From: msbootwalla at gmail.com (Moiz Bootwalla) Date: Wed, 27 Jun 2012 01:49:30 -0700 Subject: [BioC] Finding TSS locations In-Reply-To: <2B7EB2D2AC7BAA46BC82166CF856FC3C344654AD@DB3PRD0104MB120.eurprd01.prod.exchangelabs.com> References: <2B7EB2D2AC7BAA46BC82166CF856FC3C344654AD@DB3PRD0104MB120.eurprd01.prod.exchangelabs.com> Message-ID: <40E25B66-FC12-4004-89B6-34228DE4D998@gmail.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From gbayon at gmail.com Wed Jun 27 10:51:53 2012 From: gbayon at gmail.com (=?utf-8?Q?Gustavo_Fern=C3=A1ndez_Bay=C3=B3n?=) Date: Wed, 27 Jun 2012 10:51:53 +0200 Subject: [BioC] GEOquery, GSEMatrix parameter and lifecycle of GEO series data Message-ID: Hi everybody. I am experiencing quite a few problems while trying to download and parse a dataset of methylation values. These are not technical problems, IMHO. GEOquery works perfectly, and it really makes getting this kind of data an easy task. However, I think I do not understand exactly the lifecycle of GEO series data, and I would like to ask in this list for any hint on this behavior, so I could try to fix it. What I first did was to download and parse the desired GSE data file, with the default value of GSMMatrix parameter (TRUE). Besides, I extracted the ExpressionSet and the assayData I was looking for. my.gse <- getGEO('GSE30870', destdir='/Users/gbayon/Documents/GEO/') my.expr.set <- my.gse[[1]] beta.values <- exprs(my.expr.set) What really gave me a surprise at first, was to see many strange values (all containing the 'NA' string) in the featureNames of the expression set. >head(featureNames(es), n=20) [1] "NA" "cg00000108" "cg00000109" "cg00000165" "NA.1" "NA.2" "NA.3" [8] "NA.4" "cg00000363" "NA.5" "NA.6" "NA.7" "NA.8" "cg00000734" [15] "NA.9" "cg00000807" "cg00000884" "NA.10" "NA.11" "NA.12" If I select an individual GSM in the series, and download it, the featureNames are ok. If I try to download the GSE with GSEMatrix=FALSE, I get a list of GSM data sets, and the results is again good. This made me suspect of the intermediate, pre-parsed, matrix form. I haven't found a clue about the lifecycle of this kind of data. I mean, how the matrix is built. Is it a manual process? Is it automatic? If it is a manual process, then I guess I will have to contact the responsible of uploading the data to see if they can fix it. But, if it is not, I would like to know if this is something relating to BioC or, more plausibly, to GEO. Any help would be appreciated. Regards, Gustavo --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) From gavin.blackburn at strath.ac.uk Wed Jun 27 11:21:10 2012 From: gavin.blackburn at strath.ac.uk (Gavin Blackburn) Date: Wed, 27 Jun 2012 10:21:10 +0100 Subject: [BioC] mzR error In-Reply-To: References: <20120626150701.A4B1F13ADB0@mamba.fhcrc.org> Message-ID: <8C45594961256A438A8AC49E0BB51DBA028326C9E104@E2K7-MS2.ds.strath.ac.uk> Hi Laurent, Thanks very much, I'll check the Rcpp list to see when it is solved and will downgrade it for now. Cheers, Gavin. -----Original Message----- From: Laurent Gatto [mailto:laurent.gatto at gmail.com] Sent: 26 June 2012 17:38 To: Gavin Blackburn [guest] Cc: bioconductor at r-project.org; Gavin Blackburn Subject: Re: [BioC] mzR error On 26 June 2012 17:19, Laurent Gatto wrote: > Dear Gavin, > > I can't reproduce this, but I do not have the same configuration at > hand for the moment - this could be an incompatibility with the latest > Rcpp. Ok, I can now reproduce on a Windows box with mzR 1.2.1 (latest stable) and Rcpp 0.9.12. Downgrading to Rcpp 0.9.10 [1] fixes the issue. I will bring it up on the Rcpp list. Thank you for the report. Laurent [1] http://cran.us.r-project.org/bin/windows/contrib/2.13/Rcpp_0.9.10.zip > What version of mzR have you - packageVersion("mzR") > > Best wishes, > > Laurent > > On 26 June 2012 16:07, Gavin Blackburn [guest] wrote: >> >> We are getting the following error when trying to install and run mzR on a 64-bit Windows 7 machine: >>> library(mzR) >> Loading required package: Rcpp >> Error : .onLoad failed in loadNamespace() for 'mzR', details: >> ?call: value[[3L]](cond) >> ?error: failed to load module Ramp from package mzR could not find >> function "errorOccured" >> Error: package/namespace load failed for ???mzR??? >> >> >> Do you know what might be causing it? >> >> Cheers, >> >> Gavin. >> >> >> ?-- output of sessionInfo(): >> >> ?sessionInfo() >> R version 2.15.1 (2012-06-22) >> Platform: x86_64-pc-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=English_United Kingdom.1252 [2] >> LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United >> Kingdom.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 >> >> attached base packages: >> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >> >> other attached packages: >> [1] Rcpp_0.9.12 ? ? ? ? BiocInstaller_1.4.7 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.16.0 ? ? BiocGenerics_0.2.0 tools_2.15.1 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > [ Laurent Gatto | slashhome.be ] -- [ Laurent Gatto | slashhome.be ] From laurent.gatto at gmail.com Wed Jun 27 11:27:25 2012 From: laurent.gatto at gmail.com (Laurent Gatto) Date: Wed, 27 Jun 2012 10:27:25 +0100 Subject: [BioC] mzR error In-Reply-To: <8C45594961256A438A8AC49E0BB51DBA028326C9E104@E2K7-MS2.ds.strath.ac.uk> References: <20120626150701.A4B1F13ADB0@mamba.fhcrc.org> <8C45594961256A438A8AC49E0BB51DBA028326C9E104@E2K7-MS2.ds.strath.ac.uk> Message-ID: On 27 June 2012 10:21, Gavin Blackburn wrote: > Hi Laurent, > > Thanks very much, I'll check the Rcpp list to see when it is solved and will downgrade it for now. It is not clear (at least to me) what happens [1]; it might fix itself with new Rcpp and mzR binaries. I will post an update here, anyway. Best wishes, Laurent [1] http://lists.r-forge.r-project.org/pipermail/rcpp-devel/2012-June/thread.html#3940 > Cheers, > > Gavin. > > -----Original Message----- > From: Laurent Gatto [mailto:laurent.gatto at gmail.com] > Sent: 26 June 2012 17:38 > To: Gavin Blackburn [guest] > Cc: bioconductor at r-project.org; Gavin Blackburn > Subject: Re: [BioC] mzR error > > On 26 June 2012 17:19, Laurent Gatto wrote: >> Dear Gavin, >> >> I can't reproduce this, but I do not have the same configuration at >> hand for the moment - this could be an incompatibility with the latest >> Rcpp. > > Ok, I can now reproduce on a Windows box with mzR 1.2.1 (latest > stable) and Rcpp 0.9.12. Downgrading to Rcpp 0.9.10 [1] fixes the issue. I will bring it up on the Rcpp list. > > Thank you for the report. > > Laurent > > [1] http://cran.us.r-project.org/bin/windows/contrib/2.13/Rcpp_0.9.10.zip > > >> What version of mzR have you - packageVersion("mzR") >> >> Best wishes, >> >> Laurent >> >> On 26 June 2012 16:07, Gavin Blackburn [guest] wrote: >>> >>> We are getting the following error when trying to install and run mzR on a 64-bit Windows 7 machine: >>>> library(mzR) >>> Loading required package: Rcpp >>> Error : .onLoad failed in loadNamespace() for 'mzR', details: >>> ?call: value[[3L]](cond) >>> ?error: failed to load module Ramp from package mzR could not find >>> function "errorOccured" >>> Error: package/namespace load failed for ???mzR??? >>> >>> >>> Do you know what might be causing it? >>> >>> Cheers, >>> >>> Gavin. >>> >>> >>> ?-- output of sessionInfo(): >>> >>> ?sessionInfo() >>> R version 2.15.1 (2012-06-22) >>> Platform: x86_64-pc-mingw32/x64 (64-bit) >>> >>> locale: >>> [1] LC_COLLATE=English_United Kingdom.1252 [2] >>> LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United >>> Kingdom.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 >>> >>> attached base packages: >>> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >>> >>> other attached packages: >>> [1] Rcpp_0.9.12 ? ? ? ? BiocInstaller_1.4.7 >>> >>> loaded via a namespace (and not attached): >>> [1] Biobase_2.16.0 ? ? BiocGenerics_0.2.0 tools_2.15.1 >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> -- >> [ Laurent Gatto | slashhome.be ] > > > > -- > [ Laurent Gatto | slashhome.be ] -- [ Laurent Gatto | slashhome.be ] From guest at bioconductor.org Wed Jun 27 11:54:04 2012 From: guest at bioconductor.org (Thileepan [guest]) Date: Wed, 27 Jun 2012 02:54:04 -0700 (PDT) Subject: [BioC] How to create a phenodata Message-ID: <20120627095404.7D5E013C727@mamba.fhcrc.org> Hi, I have a dataset GSE37859 which consist of 3 groups namely "NSC","iNSC","MEF".Now I have to create phenodata for the same. I created a separate txt file describing group of each samples. Now I wanted to create pData for my raw data. Can anyone help me? -- output of sessionInfo(): > phenodata #phenodata created in a text file and read inR V1 V2 1 GSM929093 iNSC1 2 GSM929094 iNSC1 3 GSM929095 iNSC1 4 GSM929096 iNSC2 5 GSM929097 iNSC2 6 GSM929098 iNSC2 7 GSM929099 NSC 8 GSM929100 NSC 9 GSM929101 NSC 10 GSM929102 MEF 11 GSM929103 MEF 12 GSM929104 MEF #This what I get when I give pData(data.raw) > pData(data.raw) sample GSM929093_01.iNSC1.1.CEL.gz 1 GSM929094_02.iNSC1.2.CEL.gz 2 GSM929095_03.iNSC1.3.CEL.gz 3 GSM929096_04.iNSC2.1.CEL.gz 4 GSM929097_05.iNSC2.2.CEL.gz 5 GSM929098_06.iNSC2.3.CEL.gz 6 GSM929099_07.WT.NSC.1.CEL.gz 7 GSM929100_08.WT.NSC.2.CEL.gz 8 GSM929101_09.WT.NSC.3.CEL.gz 9 GSM929102_10.WT.MEFs.1.CEL.gz 10 GSM929103_11.WT.MEFs.3.CEL.gz 11 GSM929104_12.WT.MEFs.5.CEL.gz 12 -- Sent via the guest posting facility at bioconductor.org. From guest at bioconductor.org Wed Jun 27 12:19:47 2012 From: guest at bioconductor.org (Thileepan [guest]) Date: Wed, 27 Jun 2012 03:19:47 -0700 (PDT) Subject: [BioC] Finding differential expressed genes for GSE Message-ID: <20120627101947.EF6AA10A627@mamba.fhcrc.org> I m Thileepan Sekaran Pursuing my Masters in Bioinformatics. Currently I m doing the gene expression profiling of affymetrix data in R. I found the affyexpress package from Bioconductor very interesting and want to use for my analysis but i was stuck when I wanted to find out the differential expressed genes using AffyRegress. The dataset GSE37859 which has been generated in MoGene-1_0-st platform consist of two groups Fibroblast and iNSC cells and I wanted to find the differentially expressed genes between these two groups with fold change of 2 and pvalue of .05.Can any one help me in finding the differantial expressed genes for the dataset btween two groups. -- output of sessionInfo(): "Error in function (classes, fdef, mtable) : unable to find an inherited method for function "annotation", for signature "matrix" -- Sent via the guest posting facility at bioconductor.org. From sdavis2 at mail.nih.gov Wed Jun 27 12:40:21 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Wed, 27 Jun 2012 06:40:21 -0400 Subject: [BioC] Finding differential expressed genes for GSE In-Reply-To: <20120627101947.EF6AA10A627@mamba.fhcrc.org> References: <20120627101947.EF6AA10A627@mamba.fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From sdavis2 at mail.nih.gov Wed Jun 27 13:37:57 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Wed, 27 Jun 2012 07:37:57 -0400 Subject: [BioC] Finding differential expressed genes for GSE In-Reply-To: References: <20120627101947.EF6AA10A627@mamba.fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From guest at bioconductor.org Wed Jun 27 13:39:12 2012 From: guest at bioconductor.org (Carl Hinrichs [guest]) Date: Wed, 27 Jun 2012 04:39:12 -0700 (PDT) Subject: [BioC] gene2pathway retrain() error: species \'hsa\' unknown Message-ID: <20120627113912.2D7A0131524@mamba.fhcrc.org> Hi, when using retrain() in the gene2pathway library I get the following error: library(gene2pathway) > retrain() Retrieving KEGG information via SOAP ... xmlns: URI SOAP/KEGG is not absolute xmlns: URI SOAP/KEGG is not absolute Fehler in gene2pathway:::buildTrainingSet(minnmap = minnmap, level1Only = level1Only, : Organism 'hsa' is unknown in KEGG! Please refer to for a complete list of supported organisms. library(gene2pathway) leads to the following output: Warnmeldungen: 1: Class "VirtualSOAPClass" is defined (with package slot ???SSOAP???) but no metadata object found to revise subclass information---not exported? Making a copy in package ???.GlobalEnv??? 2: Class "VirtualXMLSchemaClass" is defined (with package slot ???XMLSchema???) but no metadata object found to revise subclass information---not exported? Making a copy in package ???.GlobalEnv??? I have no idea how to solve that problem... Best regards Carl -- output of sessionInfo(): > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] gene2pathway_2.10.0 keggorthology_2.8.0 hgu95av2.db_2.7.1 [4] org.Hs.eg.db_2.7.1 org.Dm.eg.db_2.7.1 RSQLite_0.11.1 [7] DBI_0.2-5 AnnotationDbi_1.18.1 Biobase_2.16.0 [10] BiocGenerics_0.2.0 RBGL_1.32.1 graph_1.34.0 [13] KEGGSOAP_1.30.0 biomaRt_2.12.0 kernlab_0.9-14 loaded via a namespace (and not attached): [1] codetools_0.2-8 IRanges_1.14.3 RCurl_1.91-1 SSOAP_0.8-0 [5] stats4_2.15.0 tools_2.15.0 XML_3.9-4 XMLSchema_0.7-2 -- Sent via the guest posting facility at bioconductor.org. From michael.dondrup at uni.no Wed Jun 27 13:58:14 2012 From: michael.dondrup at uni.no (Michael Dondrup) Date: Wed, 27 Jun 2012 13:58:14 +0200 Subject: [BioC] Gviz: Error plotting C.elegans ideogram In-Reply-To: References: Message-ID: Just found the obvious. The error is caused by the fact that the c.elegans genome doesn't have a cytoband track. Not all genomes have this track. Maybe the error message could be made more clear, something like: "There is no cytoband information for chromosome 1 in UCSC genome 'ce4'" Cheers Michael On Jun 22, 2012, at 2:20 PM, Michael Dondrup wrote: > Hi, > > I am trying to plot an ideogram track for C. elegans using Gviz. However I cannot generate the ideogram track: > >> itrack <- IdeogramTrack(genome = "ce6", chromosome = "chrI" ) > Error in normArgTrack(track, trackids) : Unknown track: cytoBandIdeo > > I have tried also "ce4, and ce10" and for the chromosome "chr1, chrII, 1" with the same effect. > Other genomes (hgu19, mm9) worked. Do I have to use a different genome identifier? > > Best > Michael Dondrup > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] grid stats graphics grDevices utils datasets methods base > > other attached packages: > [1] IRanges_1.14.3 BiocGenerics_0.2.0 Gviz_1.0.1 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.18.1 Biobase_2.16.0 biomaRt_2.12.0 Biostrings_2.24.1 bitops_1.0-4.1 BSgenome_1.24.0 > [7] DBI_0.2-5 GenomicRanges_1.8.6 lattice_0.20-6 RColorBrewer_1.0-5 RCurl_1.91-1 Rsamtools_1.8.5 > [13] RSQLite_0.11.1 rtracklayer_1.16.1 stats4_2.15.0 tools_2.15.0 XML_3.9-4 zlibbioc_1.2.0 >> > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From thomas.bartlett.10 at ucl.ac.uk Wed Jun 27 15:51:24 2012 From: thomas.bartlett.10 at ucl.ac.uk (Bartlett, Thomas) Date: Wed, 27 Jun 2012 13:51:24 +0000 Subject: [BioC] Finding TSS locations In-Reply-To: <40E25B66-FC12-4004-89B6-34228DE4D998@gmail.com> References: <2B7EB2D2AC7BAA46BC82166CF856FC3C344654AD@DB3PRD0104MB120.eurprd01.prod.exchangelabs.com>, <40E25B66-FC12-4004-89B6-34228DE4D998@gmail.com> Message-ID: <2B7EB2D2AC7BAA46BC82166CF856FC3C3446557E@DB3PRD0104MB120.eurprd01.prod.exchangelabs.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jperkins at biochem.ucl.ac.uk Wed Jun 27 16:00:30 2012 From: jperkins at biochem.ucl.ac.uk (James Perkins) Date: Wed, 27 Jun 2012 16:00:30 +0200 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: References: <4FD8A7E7.2080102@uw.edu> Message-ID: Hi, I wasn't sure if this was worth starting a new thread for this, since my question is very much related to this thread... Is there any plan to include the "comprehensive" exon array mappings? E.g. for rat: If one goes here http://www.affymetrix.com/estore/browse/products.jsp?productId=131489&categoryId=35748&productName=GeneChip-Rat-Exon-1.0-ST-Array#1_1 Then to Technical Documentation tab And downloads the "Rat Exon 1.0 ST Array Probeset, and Meta Probeset Files, core, full, extended and comprehensive rn4" data http://www.affymetrix.com/Auth/support/downloads/library_files/RaEx-1_0-st-v1.r2.dt1.rn4.ps.zip There are the core/extended/full ps and mps files here. However there is also a comprehensive mps file. Full, core and extended are from 2006. The comprehensive is from 2010 (and gets updated more regularly), so perhaps would be a better file to use for getNetAffx ? Apologies if this has been covered before. I am never sure of what is the best way to analyse exon array data at the gene level. Thanks, Jim On 13 June 2012 21:37, Benilton Carvalho wrote: > > please correct the code below to: > > eset = rma(raw, target='full') ## or 'core', 'extended' (whatever is available) > > and if you want results at the exon level > > eset = rma(raw, target='probeset') > featureData(eset) = getNetAffx(raw, 'probeset') > > apologies for the mistake below. > > b > > On 13 June 2012 20:11, Benilton Carvalho wrote: > > FWIW, remember that you can obtain the contents of the annotation > > files (the NA32 Affymetrix files) with: > > > > library(Biobase) > > library(oligo) > > raw = read.celfiles(list.celfiles()) > > eset = rma(raw, target='transcript') > > featureData(eset) = getNetAffx(eset, 'transcript') > > head(fData(eset)) > > > > b > > > > On 13 June 2012 15:47, James W. MacDonald wrote: > >> Hi Andreas, > >> > >> > >> On 6/13/2012 3:14 AM, Andreas Heider wrote: > >>> > >>> Dear mailing list, > >>> I know this was on the list couple of times, and I think I read it all, > >>> but > >>> actually I still don't get it right. So here is my problem: > >>> > >>> I want to be able to work with Mouse Exon 1.0 ST Arrays (NOT Mouse Gene > >>> 1.0 > >>> ST) in a similar fashion to eg. HG-U133 arrays. > >>> That means, I want to finally have it accessible as an ExpressionSet > >>> object > >>> with a right Bioconductor annotation assigned. This should include GENE > >>> SYMBOLS, RefSeq IDs and ENTREZ IDs. > >> > >> > >> The problem here is that you want to do something that AFAIK isn't easy to > >> do. The Gene ST arrays allow you to summarize all the probes that > >> interrogate a particular transcript (e.g., all the exon-level probesets are > >> collapsed to transcript level, and then you summarize). However, for the > >> Exon ST arrays that isn't the case, unless there is something in xps to > >> allow for that - I know next to nothing about that package, so Cristian > >> Stratowa will have to chime in if I am missing something. > >> > >> For the Exon chips, you are always summarizing at the same probeset level, > >> where there are <= 4 probes per probeset, and there can be any number of > >> probesets that interrogate a given exon. Lots of these probesets interrogate > >> regions that aren't even transcribed, according to current knowledge of the > >> genome. When you choose core, extended or full probesets, you are just > >> changing the number of probesets being used, not summarizing at a different > >> level as with the Gene ST chip. > >> > >> So when you say you want gene symbols, refseq ids and gene ids, what exactly > >> are you after? If a given probeset is in the intron of a gene do you want to > >> annotate it as being part of that gene? How about if it is in the UTR (or > >> really close to the UTR)? What do you want to do with the probesets where > >> one or more of the probes binds in multiple positions in the genome? These > >> are all questions that the exonmap package tries to consider, and it gets > >> really complicated. That's why Affy went with the Gene ST chips - they > >> unleashed the Exon chips on us and couldn't sell them because people were > >> saying WTF do I do with this thing? > >> > >> I don't think there is an easy or obvious answer to your question. If you > >> were to come up with what you think are reasonable answers to my questions, > >> then it wouldn't be much work to extract the chr, start, end from the > >> pd.moex.1.0.st.v1 package, and then use GenomicFeatures (e.g., > >> ?findOverlaps()) to decide what regions are being interrogated, and annotate > >> from there. > >> > >> Best, > >> > >> Jim > >> > >> > >> > >>> > >>> I can import it as a AffyBatch and generate an ExpressionSet with the help > >>> of the Xmap/exonmap supplied CDF, but there is no annotation attached to > >>> it. > >>> > >>> OR > >>> > >>> I can import the CEL files with the "oligo" package as a Exon Array object > >>> and generate an ExpressionSet from it. > >>> However in that case it still have no annotation. > >>> > >>> Surprisingly on the Bioconductor website there are all packages needed to > >>> deal with Mouse Gene 1.0 ST arrays but the informtion to work with Mouse > >>> Exon 1.0 ST arrays seems missing! > >>> > >>> What am I doing wrong here? Has someone else had such problems? > >>> > >>> Thanks in advance for your effort, > >>> Andreas > >>> > >>> ? ? ? ?[[alternative HTML version deleted]] > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > >> -- > >> James W. MacDonald, M.S. > >> Biostatistician > >> University of Washington > >> Environmental and Occupational Health Sciences > >> 4225 Roosevelt Way NE, # 100 > >> Seattle WA 98105-6099 > >> > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From aheider at trm.uni-leipzig.de Wed Jun 27 16:03:18 2012 From: aheider at trm.uni-leipzig.de (Andreas Heider) Date: Wed, 27 Jun 2012 16:03:18 +0200 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: References: <4FD8A7E7.2080102@uw.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From msbootwalla at gmail.com Wed Jun 27 16:13:40 2012 From: msbootwalla at gmail.com (Moiz Bootwalla) Date: Wed, 27 Jun 2012 07:13:40 -0700 Subject: [BioC] Finding TSS locations In-Reply-To: <2B7EB2D2AC7BAA46BC82166CF856FC3C3446557E@DB3PRD0104MB120.eurprd01.prod.exchangelabs.com> References: <2B7EB2D2AC7BAA46BC82166CF856FC3C344654AD@DB3PRD0104MB120.eurprd01.prod.exchangelabs.com>, <40E25B66-FC12-4004-89B6-34228DE4D998@gmail.com> <2B7EB2D2AC7BAA46BC82166CF856FC3C3446557E@DB3PRD0104MB120.eurprd01.prod.exchangelabs.com> Message-ID: <456FA530-0775-4848-8049-7C773497A352@gmail.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jperkins at biochem.ucl.ac.uk Wed Jun 27 16:25:46 2012 From: jperkins at biochem.ucl.ac.uk (James Perkins) Date: Wed, 27 Jun 2012 16:25:46 +0200 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: References: <4FD8A7E7.2080102@uw.edu> Message-ID: Thanks for the pointer Andreas, How did you go from probe sets for a given gene to the transcript level? And how did you know if it was "core", "extended", "full" confidence? Also, how did you summarise the probeset expression levels to make a transcript? Using biomart I get ~25k unique ensembl genes mapping to probe set ids, which is much higher than when I follow the oligo pipeline and perform RMA at core/extended/full level, and use getAffx for annotation. Thanks, Jim On 27 June 2012 16:03, Andreas Heider wrote: > Dear Jim, > I pulled all relevant annotation via biomaRt, as biomart was all mappings of > exon array probeset IDs to eg ENTREZID or GENESYMBOL. Than you can go on > from that. > > Cheers, > Andreas > > > 2012/6/27 James Perkins >> >> Hi, >> >> I wasn't sure if this was worth starting a new thread for this, since >> my question is very much related to this thread... >> >> Is there any plan to include the "comprehensive" exon array mappings? >> >> E.g. for rat: >> >> If one goes here >> >> >> http://www.affymetrix.com/estore/browse/products.jsp?productId=131489&categoryId=35748&productName=GeneChip-Rat-Exon-1.0-ST-Array#1_1 >> >> Then to Technical Documentation tab >> >> And downloads the >> >> "Rat Exon 1.0 ST Array Probeset, and Meta Probeset Files, core, full, >> extended and comprehensive rn4" data >> >> >> http://www.affymetrix.com/Auth/support/downloads/library_files/RaEx-1_0-st-v1.r2.dt1.rn4.ps.zip >> >> There are the core/extended/full ps and mps files here. >> >> However there is also a comprehensive mps file. >> >> Full, core and extended are from 2006. >> >> The comprehensive is from 2010 (and gets updated more regularly), so >> perhaps would be a better file to use for getNetAffx ? >> >> Apologies if this has been covered before. I am never sure of what is >> the best way to analyse exon array data at the gene level. >> >> Thanks, >> >> Jim >> >> >> >> >> On 13 June 2012 21:37, Benilton Carvalho >> wrote: >> > >> > please correct the code below to: >> > >> > eset = rma(raw, target='full') ## or 'core', 'extended' (whatever is >> > available) >> > >> > and if you want results at the exon level >> > >> > eset = rma(raw, target='probeset') >> > featureData(eset) = getNetAffx(raw, 'probeset') >> > >> > apologies for the mistake below. >> > >> > b >> > >> > On 13 June 2012 20:11, Benilton Carvalho >> > wrote: >> > > FWIW, remember that you can obtain the contents of the annotation >> > > files (the NA32 Affymetrix files) with: >> > > >> > > library(Biobase) >> > > library(oligo) >> > > raw = read.celfiles(list.celfiles()) >> > > eset = rma(raw, target='transcript') >> > > featureData(eset) = getNetAffx(eset, 'transcript') >> > > head(fData(eset)) >> > > >> > > b >> > > >> > > On 13 June 2012 15:47, James W. MacDonald wrote: >> > >> Hi Andreas, >> > >> >> > >> >> > >> On 6/13/2012 3:14 AM, Andreas Heider wrote: >> > >>> >> > >>> Dear mailing list, >> > >>> I know this was on the list couple of times, and I think I read it >> > >>> all, >> > >>> but >> > >>> actually I still don't get it right. So here is my problem: >> > >>> >> > >>> I want to be able to work with Mouse Exon 1.0 ST Arrays (NOT Mouse >> > >>> Gene >> > >>> 1.0 >> > >>> ST) in a similar fashion to eg. HG-U133 arrays. >> > >>> That means, I want to finally have it accessible as an ExpressionSet >> > >>> object >> > >>> with a right Bioconductor annotation assigned. This should include >> > >>> GENE >> > >>> SYMBOLS, RefSeq IDs and ENTREZ IDs. >> > >> >> > >> >> > >> The problem here is that you want to do something that AFAIK isn't >> > >> easy to >> > >> do. The Gene ST arrays allow you to summarize all the probes that >> > >> interrogate a particular transcript (e.g., all the exon-level >> > >> probesets are >> > >> collapsed to transcript level, and then you summarize). However, for >> > >> the >> > >> Exon ST arrays that isn't the case, unless there is something in xps >> > >> to >> > >> allow for that - I know next to nothing about that package, so >> > >> Cristian >> > >> Stratowa will have to chime in if I am missing something. >> > >> >> > >> For the Exon chips, you are always summarizing at the same probeset >> > >> level, >> > >> where there are <= 4 probes per probeset, and there can be any number >> > >> of >> > >> probesets that interrogate a given exon. Lots of these probesets >> > >> interrogate >> > >> regions that aren't even transcribed, according to current knowledge >> > >> of the >> > >> genome. When you choose core, extended or full probesets, you are >> > >> just >> > >> changing the number of probesets being used, not summarizing at a >> > >> different >> > >> level as with the Gene ST chip. >> > >> >> > >> So when you say you want gene symbols, refseq ids and gene ids, what >> > >> exactly >> > >> are you after? If a given probeset is in the intron of a gene do you >> > >> want to >> > >> annotate it as being part of that gene? How about if it is in the UTR >> > >> (or >> > >> really close to the UTR)? What do you want to do with the probesets >> > >> where >> > >> one or more of the probes binds in multiple positions in the genome? >> > >> These >> > >> are all questions that the exonmap package tries to consider, and it >> > >> gets >> > >> really complicated. That's why Affy went with the Gene ST chips - >> > >> they >> > >> unleashed the Exon chips on us and couldn't sell them because people >> > >> were >> > >> saying WTF do I do with this thing? >> > >> >> > >> I don't think there is an easy or obvious answer to your question. If >> > >> you >> > >> were to come up with what you think are reasonable answers to my >> > >> questions, >> > >> then it wouldn't be much work to extract the chr, start, end from the >> > >> pd.moex.1.0.st.v1 package, and then use GenomicFeatures (e.g., >> > >> ?findOverlaps()) to decide what regions are being interrogated, and >> > >> annotate >> > >> from there. >> > >> >> > >> Best, >> > >> >> > >> Jim >> > >> >> > >> >> > >> >> > >>> >> > >>> I can import it as a AffyBatch and generate an ExpressionSet with >> > >>> the help >> > >>> of the Xmap/exonmap supplied CDF, but there is no annotation >> > >>> attached to >> > >>> it. >> > >>> >> > >>> OR >> > >>> >> > >>> I can import the CEL files with the "oligo" package as a Exon Array >> > >>> object >> > >>> and generate an ExpressionSet from it. >> > >>> However in that case it still have no annotation. >> > >>> >> > >>> Surprisingly on the Bioconductor website there are all packages >> > >>> needed to >> > >>> deal with Mouse Gene 1.0 ST arrays but the informtion to work with >> > >>> Mouse >> > >>> Exon 1.0 ST arrays seems missing! >> > >>> >> > >>> What am I doing wrong here? Has someone else had such problems? >> > >>> >> > >>> Thanks in advance for your effort, >> > >>> Andreas >> > >>> >> > >>> ? ? ? ?[[alternative HTML version deleted]] >> > >>> >> > >>> _______________________________________________ >> > >>> Bioconductor mailing list >> > >>> Bioconductor at r-project.org >> > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> > >>> Search the archives: >> > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >> > >> >> > >> -- >> > >> James W. MacDonald, M.S. >> > >> Biostatistician >> > >> University of Washington >> > >> Environmental and Occupational Health Sciences >> > >> 4225 Roosevelt Way NE, # 100 >> > >> Seattle WA 98105-6099 >> > >> >> > >> >> > >> _______________________________________________ >> > >> Bioconductor mailing list >> > >> Bioconductor at r-project.org >> > >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> > >> Search the archives: >> > >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor > > From aheider at trm.uni-leipzig.de Wed Jun 27 16:45:51 2012 From: aheider at trm.uni-leipzig.de (Andreas Heider) Date: Wed, 27 Jun 2012 16:45:51 +0200 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: References: <4FD8A7E7.2080102@uw.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From lgoff at csail.mit.edu Wed Jun 27 16:57:31 2012 From: lgoff at csail.mit.edu (Loyal Goff) Date: Wed, 27 Jun 2012 10:57:31 -0400 Subject: [BioC] Unable to open database file, cummeRbund error. In-Reply-To: <010b815d-28b1-45e0-98fb-3bce4777eeec@zimbra4.fhcrc.org> References: <010b815d-28b1-45e0-98fb-3bce4777eeec@zimbra4.fhcrc.org> Message-ID: <72669F21-F1FE-46F7-8BA2-6300E68CEFFB@csail.mit.edu> Hi Justin, Can you confirm that "~/Test/output/cuffdiff" is a valid path? I cannot seem to re-create this issue using a similar approach to yours. Alternatively, can you just provide the directory path directly to readCufflinks instead of going through file.path()? -Loyal On Jun 26, 2012, at 2:03 PM, Hardcastle, Justin wrote: > Hi, > I'm having an issue running cummeRbund on my cuffdiff output. CummeRbund is giving me a DB error and not creating the DB. The code and error are below. > > library("cummeRbund") > > dir = "~/Test" > outdir = "output/cuffdiff" > cuff <- readCufflinks(dir = file.path(dir, outdir), rebuild = TRUE) > > The error given is > >> cuff <- readCufflinks(dir = file.path(dir, outdir), rebuild = TRUE) > Creating database ~/Test/output/cuffdiff/cuffData.db > Error in sqliteNewConnection(drv, ...) : > RS-DBI driver: (could not connect to dbname: > unable to open database file > ) > > I am running cummeRbund 1.2.0, and Cufflinks 2.0.1. > > Thanks for any help. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From ovokeraye at gmail.com Wed Jun 27 17:05:09 2012 From: ovokeraye at gmail.com (Ovokeraye Achinike-Oduaran) Date: Wed, 27 Jun 2012 17:05:09 +0200 Subject: [BioC] BiomaRt Details Message-ID: Hi all, I've been working with biomaRt 2.12.0 and would like to know what versions of Ensembl, dbSNP, Variation, etc it's using. Is it the exact same as on the web interface (www.biomart.org)? Thanks and regards, Avoks From stvjc at channing.harvard.edu Wed Jun 27 17:06:39 2012 From: stvjc at channing.harvard.edu (Vincent Carey) Date: Wed, 27 Jun 2012 11:06:39 -0400 Subject: [BioC] How to create a phenodata In-Reply-To: <20120627095404.7D5E013C727@mamba.fhcrc.org> References: <20120627095404.7D5E013C727@mamba.fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jperkins at biochem.ucl.ac.uk Wed Jun 27 17:07:09 2012 From: jperkins at biochem.ucl.ac.uk (James Perkins) Date: Wed, 27 Jun 2012 17:07:09 +0200 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: References: <4FD8A7E7.2080102@uw.edu> Message-ID: Thanks Andreas! That's really useful information, I will have a look. Out of interest, did you look at the distribution of expression levels for the different prob-sets? If you are including all probe-sets, I would guess that if there were a lot of predicted/intronic probe sets that aren't expressed that could bias your gene-level estimation, i.e. if it the proportion is above the break-down point of the summarisation/aggregation method. Although perhaps the CDF from annmap takes care of that? Cheers! Jim On 27 June 2012 16:45, Andreas Heider wrote: > Ok, sorry, that was the "short answer". Here comes the longer one: > 1. get a CDF for the chip, get it at http://annmap.picr.man.ac.uk/download/ > 2. load CEL files using standard affy package > 3. asign the downloaded CDF to your AffyBatch object > 4. calculate RMA or whatever you want (NOTE: this will get you all > probesets, no restrictions as in eg "core") > 5. pull the whole set of identifiers from biomaRt and annotate your > expression matrix with this information > 6. "collapse" probesets targetting the same identifier to its mean, median > or medpolish, whatever suits your needs best via functions as "recast" or > "aggregate" > 7. have fun with your new expression matrix! > > Hope that helps, I needed also some time to figure out the individual steps. > > > 2012/6/27 James Perkins >> >> Thanks for the pointer Andreas, >> >> How did you go from probe sets for a given gene to the transcript >> level? And how did you know if it was "core", "extended", "full" >> confidence? >> >> Also, how did you summarise the probeset expression levels to make a >> transcript? Using biomart I get ~25k unique ensembl genes mapping to >> probe set ids, which is much higher than when I follow the oligo >> pipeline and perform RMA at core/extended/full level, and use getAffx >> for annotation. >> >> Thanks, >> >> Jim >> >> On 27 June 2012 16:03, Andreas Heider wrote: >> > Dear Jim, >> > I pulled all relevant annotation via biomaRt, as biomart was all >> > mappings of >> > exon array probeset IDs to eg ENTREZID or GENESYMBOL. Than you can go on >> > from that. >> > >> > Cheers, >> > Andreas >> > >> > >> > 2012/6/27 James Perkins >> >> >> >> Hi, >> >> >> >> I wasn't sure if this was worth starting a new thread for this, since >> >> my question is very much related to this thread... >> >> >> >> Is there any plan to include the "comprehensive" exon array mappings? >> >> >> >> E.g. for rat: >> >> >> >> If one goes here >> >> >> >> >> >> >> >> http://www.affymetrix.com/estore/browse/products.jsp?productId=131489&categoryId=35748&productName=GeneChip-Rat-Exon-1.0-ST-Array#1_1 >> >> >> >> Then to Technical Documentation tab >> >> >> >> And downloads the >> >> >> >> "Rat Exon 1.0 ST Array Probeset, and Meta Probeset Files, core, full, >> >> extended and comprehensive rn4" data >> >> >> >> >> >> >> >> http://www.affymetrix.com/Auth/support/downloads/library_files/RaEx-1_0-st-v1.r2.dt1.rn4.ps.zip >> >> >> >> There are the core/extended/full ps and mps files here. >> >> >> >> However there is also a comprehensive mps file. >> >> >> >> Full, core and extended are from 2006. >> >> >> >> The comprehensive is from 2010 (and gets updated more regularly), so >> >> perhaps would be a better file to use for getNetAffx ? >> >> >> >> Apologies if this has been covered before. I am never sure of what is >> >> the best way to analyse exon array data at the gene level. >> >> >> >> Thanks, >> >> >> >> Jim >> >> >> >> >> >> >> >> >> >> On 13 June 2012 21:37, Benilton Carvalho >> >> wrote: >> >> > >> >> > please correct the code below to: >> >> > >> >> > eset = rma(raw, target='full') ## or 'core', 'extended' (whatever is >> >> > available) >> >> > >> >> > and if you want results at the exon level >> >> > >> >> > eset = rma(raw, target='probeset') >> >> > featureData(eset) = getNetAffx(raw, 'probeset') >> >> > >> >> > apologies for the mistake below. >> >> > >> >> > b >> >> > >> >> > On 13 June 2012 20:11, Benilton Carvalho >> >> > wrote: >> >> > > FWIW, remember that you can obtain the contents of the annotation >> >> > > files (the NA32 Affymetrix files) with: >> >> > > >> >> > > library(Biobase) >> >> > > library(oligo) >> >> > > raw = read.celfiles(list.celfiles()) >> >> > > eset = rma(raw, target='transcript') >> >> > > featureData(eset) = getNetAffx(eset, 'transcript') >> >> > > head(fData(eset)) >> >> > > >> >> > > b >> >> > > >> >> > > On 13 June 2012 15:47, James W. MacDonald wrote: >> >> > >> Hi Andreas, >> >> > >> >> >> > >> >> >> > >> On 6/13/2012 3:14 AM, Andreas Heider wrote: >> >> > >>> >> >> > >>> Dear mailing list, >> >> > >>> I know this was on the list couple of times, and I think I read >> >> > >>> it >> >> > >>> all, >> >> > >>> but >> >> > >>> actually I still don't get it right. So here is my problem: >> >> > >>> >> >> > >>> I want to be able to work with Mouse Exon 1.0 ST Arrays (NOT >> >> > >>> Mouse >> >> > >>> Gene >> >> > >>> 1.0 >> >> > >>> ST) in a similar fashion to eg. HG-U133 arrays. >> >> > >>> That means, I want to finally have it accessible as an >> >> > >>> ExpressionSet >> >> > >>> object >> >> > >>> with a right Bioconductor annotation assigned. This should >> >> > >>> include >> >> > >>> GENE >> >> > >>> SYMBOLS, RefSeq IDs and ENTREZ IDs. >> >> > >> >> >> > >> >> >> > >> The problem here is that you want to do something that AFAIK isn't >> >> > >> easy to >> >> > >> do. The Gene ST arrays allow you to summarize all the probes that >> >> > >> interrogate a particular transcript (e.g., all the exon-level >> >> > >> probesets are >> >> > >> collapsed to transcript level, and then you summarize). However, >> >> > >> for >> >> > >> the >> >> > >> Exon ST arrays that isn't the case, unless there is something in >> >> > >> xps >> >> > >> to >> >> > >> allow for that - I know next to nothing about that package, so >> >> > >> Cristian >> >> > >> Stratowa will have to chime in if I am missing something. >> >> > >> >> >> > >> For the Exon chips, you are always summarizing at the same >> >> > >> probeset >> >> > >> level, >> >> > >> where there are <= 4 probes per probeset, and there can be any >> >> > >> number >> >> > >> of >> >> > >> probesets that interrogate a given exon. Lots of these probesets >> >> > >> interrogate >> >> > >> regions that aren't even transcribed, according to current >> >> > >> knowledge >> >> > >> of the >> >> > >> genome. When you choose core, extended or full probesets, you are >> >> > >> just >> >> > >> changing the number of probesets being used, not summarizing at a >> >> > >> different >> >> > >> level as with the Gene ST chip. >> >> > >> >> >> > >> So when you say you want gene symbols, refseq ids and gene ids, >> >> > >> what >> >> > >> exactly >> >> > >> are you after? If a given probeset is in the intron of a gene do >> >> > >> you >> >> > >> want to >> >> > >> annotate it as being part of that gene? How about if it is in the >> >> > >> UTR >> >> > >> (or >> >> > >> really close to the UTR)? What do you want to do with the >> >> > >> probesets >> >> > >> where >> >> > >> one or more of the probes binds in multiple positions in the >> >> > >> genome? >> >> > >> These >> >> > >> are all questions that the exonmap package tries to consider, and >> >> > >> it >> >> > >> gets >> >> > >> really complicated. That's why Affy went with the Gene ST chips - >> >> > >> they >> >> > >> unleashed the Exon chips on us and couldn't sell them because >> >> > >> people >> >> > >> were >> >> > >> saying WTF do I do with this thing? >> >> > >> >> >> > >> I don't think there is an easy or obvious answer to your question. >> >> > >> If >> >> > >> you >> >> > >> were to come up with what you think are reasonable answers to my >> >> > >> questions, >> >> > >> then it wouldn't be much work to extract the chr, start, end from >> >> > >> the >> >> > >> pd.moex.1.0.st.v1 package, and then use GenomicFeatures (e.g., >> >> > >> ?findOverlaps()) to decide what regions are being interrogated, >> >> > >> and >> >> > >> annotate >> >> > >> from there. >> >> > >> >> >> > >> Best, >> >> > >> >> >> > >> Jim >> >> > >> >> >> > >> >> >> > >> >> >> > >>> >> >> > >>> I can import it as a AffyBatch and generate an ExpressionSet with >> >> > >>> the help >> >> > >>> of the Xmap/exonmap supplied CDF, but there is no annotation >> >> > >>> attached to >> >> > >>> it. >> >> > >>> >> >> > >>> OR >> >> > >>> >> >> > >>> I can import the CEL files with the "oligo" package as a Exon >> >> > >>> Array >> >> > >>> object >> >> > >>> and generate an ExpressionSet from it. >> >> > >>> However in that case it still have no annotation. >> >> > >>> >> >> > >>> Surprisingly on the Bioconductor website there are all packages >> >> > >>> needed to >> >> > >>> deal with Mouse Gene 1.0 ST arrays but the informtion to work >> >> > >>> with >> >> > >>> Mouse >> >> > >>> Exon 1.0 ST arrays seems missing! >> >> > >>> >> >> > >>> What am I doing wrong here? Has someone else had such problems? >> >> > >>> >> >> > >>> Thanks in advance for your effort, >> >> > >>> Andreas >> >> > >>> >> >> > >>> ? ? ? ?[[alternative HTML version deleted]] >> >> > >>> >> >> > >>> _______________________________________________ >> >> > >>> Bioconductor mailing list >> >> > >>> Bioconductor at r-project.org >> >> > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> > >>> Search the archives: >> >> > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > >> >> >> > >> >> >> > >> -- >> >> > >> James W. MacDonald, M.S. >> >> > >> Biostatistician >> >> > >> University of Washington >> >> > >> Environmental and Occupational Health Sciences >> >> > >> 4225 Roosevelt Way NE, # 100 >> >> > >> Seattle WA 98105-6099 >> >> > >> >> >> > >> >> >> > >> _______________________________________________ >> >> > >> Bioconductor mailing list >> >> > >> Bioconductor at r-project.org >> >> > >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> > >> Search the archives: >> >> > >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > >> >> > _______________________________________________ >> >> > Bioconductor mailing list >> >> > Bioconductor at r-project.org >> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> > Search the archives: >> >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > > > From kasperdanielhansen at gmail.com Wed Jun 27 17:07:16 2012 From: kasperdanielhansen at gmail.com (Kasper Daniel Hansen) Date: Wed, 27 Jun 2012 11:07:16 -0400 Subject: [BioC] matrix like object with Rle columns In-Reply-To: References: Message-ID: One comment: since matrix is a vector with a dim attribute I see that the natural parallel is doing the same for Rle. Nevertheless, that would put an upper limit on the number of runLengths in the entire matrix. My impression (which could be wrong) is that we would need to implement essentially all matrix-like numeric operations from scratch anyway, so it may be worthwhile to consider using a list of Rle's where each Rle is a column, instead of a single Rle to represent all columns. Clearly that depends on implementation details, but if we really need to do everything from scratch, a list of columns might be more flexible (and perhaps even easier to code). Kasper On Tue, Jun 26, 2012 at 6:41 AM, Michael Lawrence wrote: > Seems like it could be a nice thing to have. Presumably one would create an > Array subclass of Vector that would add a "dim" attribute. Then Matrix could > extend that to constrain dim to length two (unfortunately colliding with the > Matrix class in the Matrix package). Then RleMatrix extends Matrix to > implement the actual data storage and many of the accelerated methods. As > you said, row-oriented methods would be tough. > > Any takers? > > Michael > > On Mon, Jun 25, 2012 at 9:11 PM, Kasper Daniel Hansen > wrote: >> >> On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen >> wrote: >> > On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence >> > wrote: >> >> Patrick and I had talked about this a long time ago (essentially >> >> putting a >> >> "dim" attribute on an Rle), but the closest thing today is a DataFrame >> >> with >> >> Rle columns. >> >> >> >> Use case? >> > >> > Say I have whole-genome data (for example coverage) ?on multiple >> > samples. ?Usually, this is far easier to think of as a matrix (in my >> > opinion) with ~3B rows and I often want to do rowSums(), colSums() etc >> > (in fact, probably the whole API from matrixStats). ?This is >> > especially nice when you have multiple coverage-like tracks on each >> > sample, so you could have >> > ?trackA : genome by samples >> > ?trackB : genome by samples >> > ?... >> > >> > You could think of this as a SummarizedExperiment, but with >> > _extremely_ big matrices in the assay slot. >> > >> > I want to take advantage of the Rle structure to store the data more >> > efficiently and also to do potentially faster computations. >> > >> > This is actually closer to my use case where I currently use matrices >> > with ~30M rows (which works fine), but I would like to expand to ~800M >> > rows (which would suck a bit). >> > >> > You could also think of a matrix-like object with Rle columns as an >> > alternative sparse matrix structure. ?In a typical sparse matrix you >> > only store the non-zero entities, here we only store the >> > change-points. ?Depending on the structure of the matrix this could be >> > an efficient storage of an otherwise dense matrix. >> > >> > So essentially, what I want, is to have mathematical operations on >> > this object, where I would utilize that I know that all entities are >> > numbers so the typical matrix operations makes sense. >> > >> > [ side question which could be relevant in this discussion: for a >> > numeric Rle is there some notion of precision - say I have truly >> > numeric values with tons of digits, and I want to consider two numbers >> > part of the same run if |x1 -x2|> >> You can see that Pete has had similar thoughts in >> genoset/R/DataFrame-methods.R, although he only has colMeans (which is >> the easy one). >> >> Kasper >> >> > Kasper >> > >> >> >> >> Michael >> >> >> >> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen >> >> wrote: >> >>> >> >>> Do we have a matrix-like object, but where the columns are Rle's? >> >>> >> >>> Kasper >> >>> >> >>> _______________________________________________ >> >>> Bioconductor mailing list >> >>> Bioconductor at r-project.org >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>> Search the archives: >> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> > > From aheider at trm.uni-leipzig.de Wed Jun 27 17:15:02 2012 From: aheider at trm.uni-leipzig.de (Andreas Heider) Date: Wed, 27 Jun 2012 17:15:02 +0200 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: References: <4FD8A7E7.2080102@uw.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From lawrence.michael at gene.com Wed Jun 27 17:21:01 2012 From: lawrence.michael at gene.com (Michael Lawrence) Date: Wed, 27 Jun 2012 08:21:01 -0700 Subject: [BioC] matrix like object with Rle columns In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From beniltoncarvalho at gmail.com Wed Jun 27 17:27:56 2012 From: beniltoncarvalho at gmail.com (Benilton Carvalho) Date: Wed, 27 Jun 2012 16:27:56 +0100 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: References: <4FD8A7E7.2080102@uw.edu> Message-ID: Hi Jim, I'll make sure to add the comprehensive MPS as soon as I get more info about it from the specialists... However, note that the contents of the MPS files are not used by getNetAffx(), which only uses the probeset/transcript annotation file... Thanks, benilton On 27 June 2012 15:00, James Perkins wrote: > Hi, > > I wasn't sure if this was worth starting a new thread for this, since > my question is very much related to this thread... > > Is there any plan to include the "comprehensive" exon array mappings? > > E.g. for rat: > > If one goes here > > http://www.affymetrix.com/estore/browse/products.jsp?productId=131489&categoryId=35748&productName=GeneChip-Rat-Exon-1.0-ST-Array#1_1 > > Then to Technical Documentation tab > > And downloads the > > "Rat Exon 1.0 ST Array Probeset, and Meta Probeset Files, core, full, > extended and comprehensive rn4" data > > http://www.affymetrix.com/Auth/support/downloads/library_files/RaEx-1_0-st-v1.r2.dt1.rn4.ps.zip > > There are the core/extended/full ps and mps files here. > > However there is also a comprehensive mps file. > > Full, core and extended are from 2006. > > The comprehensive is from 2010 (and gets updated more regularly), so > perhaps would be a better file to use for getNetAffx ? > > Apologies if this has been covered before. I am never sure of what is > the best way to analyse exon array data at the gene level. > > Thanks, > > Jim > > > > > On 13 June 2012 21:37, Benilton Carvalho wrote: >> >> please correct the code below to: >> >> eset = rma(raw, target='full') ## or 'core', 'extended' (whatever is available) >> >> and if you want results at the exon level >> >> eset = rma(raw, target='probeset') >> featureData(eset) = getNetAffx(raw, 'probeset') >> >> apologies for the mistake below. >> >> b >> >> On 13 June 2012 20:11, Benilton Carvalho wrote: >> > FWIW, remember that you can obtain the contents of the annotation >> > files (the NA32 Affymetrix files) with: >> > >> > library(Biobase) >> > library(oligo) >> > raw = read.celfiles(list.celfiles()) >> > eset = rma(raw, target='transcript') >> > featureData(eset) = getNetAffx(eset, 'transcript') >> > head(fData(eset)) >> > >> > b >> > >> > On 13 June 2012 15:47, James W. MacDonald wrote: >> >> Hi Andreas, >> >> >> >> >> >> On 6/13/2012 3:14 AM, Andreas Heider wrote: >> >>> >> >>> Dear mailing list, >> >>> I know this was on the list couple of times, and I think I read it all, >> >>> but >> >>> actually I still don't get it right. So here is my problem: >> >>> >> >>> I want to be able to work with Mouse Exon 1.0 ST Arrays (NOT Mouse Gene >> >>> 1.0 >> >>> ST) in a similar fashion to eg. HG-U133 arrays. >> >>> That means, I want to finally have it accessible as an ExpressionSet >> >>> object >> >>> with a right Bioconductor annotation assigned. This should include GENE >> >>> SYMBOLS, RefSeq IDs and ENTREZ IDs. >> >> >> >> >> >> The problem here is that you want to do something that AFAIK isn't easy to >> >> do. The Gene ST arrays allow you to summarize all the probes that >> >> interrogate a particular transcript (e.g., all the exon-level probesets are >> >> collapsed to transcript level, and then you summarize). However, for the >> >> Exon ST arrays that isn't the case, unless there is something in xps to >> >> allow for that - I know next to nothing about that package, so Cristian >> >> Stratowa will have to chime in if I am missing something. >> >> >> >> For the Exon chips, you are always summarizing at the same probeset level, >> >> where there are <= 4 probes per probeset, and there can be any number of >> >> probesets that interrogate a given exon. Lots of these probesets interrogate >> >> regions that aren't even transcribed, according to current knowledge of the >> >> genome. When you choose core, extended or full probesets, you are just >> >> changing the number of probesets being used, not summarizing at a different >> >> level as with the Gene ST chip. >> >> >> >> So when you say you want gene symbols, refseq ids and gene ids, what exactly >> >> are you after? If a given probeset is in the intron of a gene do you want to >> >> annotate it as being part of that gene? How about if it is in the UTR (or >> >> really close to the UTR)? What do you want to do with the probesets where >> >> one or more of the probes binds in multiple positions in the genome? These >> >> are all questions that the exonmap package tries to consider, and it gets >> >> really complicated. That's why Affy went with the Gene ST chips - they >> >> unleashed the Exon chips on us and couldn't sell them because people were >> >> saying WTF do I do with this thing? >> >> >> >> I don't think there is an easy or obvious answer to your question. If you >> >> were to come up with what you think are reasonable answers to my questions, >> >> then it wouldn't be much work to extract the chr, start, end from the >> >> pd.moex.1.0.st.v1 package, and then use GenomicFeatures (e.g., >> >> ?findOverlaps()) to decide what regions are being interrogated, and annotate >> >> from there. >> >> >> >> Best, >> >> >> >> Jim >> >> >> >> >> >> >> >>> >> >>> I can import it as a AffyBatch and generate an ExpressionSet with the help >> >>> of the Xmap/exonmap supplied CDF, but there is no annotation attached to >> >>> it. >> >>> >> >>> OR >> >>> >> >>> I can import the CEL files with the "oligo" package as a Exon Array object >> >>> and generate an ExpressionSet from it. >> >>> However in that case it still have no annotation. >> >>> >> >>> Surprisingly on the Bioconductor website there are all packages needed to >> >>> deal with Mouse Gene 1.0 ST arrays but the informtion to work with Mouse >> >>> Exon 1.0 ST arrays seems missing! >> >>> >> >>> What am I doing wrong here? Has someone else had such problems? >> >>> >> >>> Thanks in advance for your effort, >> >>> Andreas >> >>> >> >>> ? ? ? ?[[alternative HTML version deleted]] >> >>> >> >>> _______________________________________________ >> >>> Bioconductor mailing list >> >>> Bioconductor at r-project.org >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>> Search the archives: >> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> >> >> -- >> >> James W. MacDonald, M.S. >> >> Biostatistician >> >> University of Washington >> >> Environmental and Occupational Health Sciences >> >> 4225 Roosevelt Way NE, # 100 >> >> Seattle WA 98105-6099 >> >> >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor at r-project.org >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From aheider at trm.uni-leipzig.de Wed Jun 27 17:30:02 2012 From: aheider at trm.uni-leipzig.de (Andreas Heider) Date: Wed, 27 Jun 2012 17:30:02 +0200 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: References: <4FD8A7E7.2080102@uw.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jperkins at biochem.ucl.ac.uk Wed Jun 27 17:33:27 2012 From: jperkins at biochem.ucl.ac.uk (James Perkins) Date: Wed, 27 Jun 2012 17:33:27 +0200 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: References: <4FD8A7E7.2080102@uw.edu> Message-ID: Could you expand on that a little? Do you mean you can change the level of confidence of the ps ids mapping to the ENSEMBL gene using biomart? On 27 June 2012 17:30, Andreas Heider wrote: > Also remember, that this will be influenced by your selection of identifiers > in biomart! > > > 2012/6/27 Andreas Heider >> >> The AnnMap CDF should take care of that. >> >> >> 2012/6/27 James Perkins >>> >>> Thanks Andreas! That's really useful information, I will have a look. >>> >>> Out of interest, did you look at the distribution of expression levels >>> for the different prob-sets? If you are including all probe-sets, I >>> would guess that if there were a lot of predicted/intronic probe sets >>> that aren't expressed that could bias your gene-level estimation, i.e. >>> if it the proportion is above the break-down point of the >>> summarisation/aggregation method. >>> >>> Although perhaps the CDF from annmap takes care of that? >>> >>> Cheers! >>> >>> Jim >>> >>> On 27 June 2012 16:45, Andreas Heider wrote: >>> > Ok, sorry, that was the "short answer". Here comes the longer one: >>> > 1. get a CDF for the chip, get it at >>> > http://annmap.picr.man.ac.uk/download/ >>> > 2. load CEL files using standard affy package >>> > 3. asign the downloaded CDF to your AffyBatch object >>> > 4. calculate RMA or whatever you want (NOTE: this will get you all >>> > probesets, no restrictions as in eg "core") >>> > 5. pull the whole set of identifiers from biomaRt and annotate your >>> > expression matrix with this information >>> > 6. "collapse" probesets targetting the same identifier to its mean, >>> > median >>> > or medpolish, whatever suits your needs best via functions as "recast" >>> > or >>> > "aggregate" >>> > 7. have fun with your new expression matrix! >>> > >>> > Hope that helps, I needed also some time to figure out the individual >>> > steps. >>> > >>> > >>> > 2012/6/27 James Perkins >>> >> >>> >> Thanks for the pointer Andreas, >>> >> >>> >> How did you go from probe sets for a given gene to the transcript >>> >> level? And how did you know if it was "core", "extended", "full" >>> >> confidence? >>> >> >>> >> Also, how did you summarise the probeset expression levels to make a >>> >> transcript? Using biomart I get ~25k unique ensembl genes mapping to >>> >> probe set ids, which is much higher than when I follow the oligo >>> >> pipeline and perform RMA at core/extended/full level, and use getAffx >>> >> for annotation. >>> >> >>> >> Thanks, >>> >> >>> >> Jim >>> >> >>> >> On 27 June 2012 16:03, Andreas Heider >>> >> wrote: >>> >> > Dear Jim, >>> >> > I pulled all relevant annotation via biomaRt, as biomart was all >>> >> > mappings of >>> >> > exon array probeset IDs to eg ENTREZID or GENESYMBOL. Than you can >>> >> > go on >>> >> > from that. >>> >> > >>> >> > Cheers, >>> >> > Andreas >>> >> > >>> >> > >>> >> > 2012/6/27 James Perkins >>> >> >> >>> >> >> Hi, >>> >> >> >>> >> >> I wasn't sure if this was worth starting a new thread for this, >>> >> >> since >>> >> >> my question is very much related to this thread... >>> >> >> >>> >> >> Is there any plan to include the "comprehensive" exon array >>> >> >> mappings? >>> >> >> >>> >> >> E.g. for rat: >>> >> >> >>> >> >> If one goes here >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> http://www.affymetrix.com/estore/browse/products.jsp?productId=131489&categoryId=35748&productName=GeneChip-Rat-Exon-1.0-ST-Array#1_1 >>> >> >> >>> >> >> Then to Technical Documentation tab >>> >> >> >>> >> >> And downloads the >>> >> >> >>> >> >> "Rat Exon 1.0 ST Array Probeset, and Meta Probeset Files, core, >>> >> >> full, >>> >> >> extended and comprehensive rn4" data >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> http://www.affymetrix.com/Auth/support/downloads/library_files/RaEx-1_0-st-v1.r2.dt1.rn4.ps.zip >>> >> >> >>> >> >> There are the core/extended/full ps and mps files here. >>> >> >> >>> >> >> However there is also a comprehensive mps file. >>> >> >> >>> >> >> Full, core and extended are from 2006. >>> >> >> >>> >> >> The comprehensive is from 2010 (and gets updated more regularly), >>> >> >> so >>> >> >> perhaps would be a better file to use for getNetAffx ? >>> >> >> >>> >> >> Apologies if this has been covered before. I am never sure of what >>> >> >> is >>> >> >> the best way to analyse exon array data at the gene level. >>> >> >> >>> >> >> Thanks, >>> >> >> >>> >> >> Jim >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> On 13 June 2012 21:37, Benilton Carvalho >>> >> >> >>> >> >> wrote: >>> >> >> > >>> >> >> > please correct the code below to: >>> >> >> > >>> >> >> > eset = rma(raw, target='full') ## or 'core', 'extended' (whatever >>> >> >> > is >>> >> >> > available) >>> >> >> > >>> >> >> > and if you want results at the exon level >>> >> >> > >>> >> >> > eset = rma(raw, target='probeset') >>> >> >> > featureData(eset) = getNetAffx(raw, 'probeset') >>> >> >> > >>> >> >> > apologies for the mistake below. >>> >> >> > >>> >> >> > b >>> >> >> > >>> >> >> > On 13 June 2012 20:11, Benilton Carvalho >>> >> >> > >>> >> >> > wrote: >>> >> >> > > FWIW, remember that you can obtain the contents of the >>> >> >> > > annotation >>> >> >> > > files (the NA32 Affymetrix files) with: >>> >> >> > > >>> >> >> > > library(Biobase) >>> >> >> > > library(oligo) >>> >> >> > > raw = read.celfiles(list.celfiles()) >>> >> >> > > eset = rma(raw, target='transcript') >>> >> >> > > featureData(eset) = getNetAffx(eset, 'transcript') >>> >> >> > > head(fData(eset)) >>> >> >> > > >>> >> >> > > b >>> >> >> > > >>> >> >> > > On 13 June 2012 15:47, James W. MacDonald >>> >> >> > > wrote: >>> >> >> > >> Hi Andreas, >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> On 6/13/2012 3:14 AM, Andreas Heider wrote: >>> >> >> > >>> >>> >> >> > >>> Dear mailing list, >>> >> >> > >>> I know this was on the list couple of times, and I think I >>> >> >> > >>> read >>> >> >> > >>> it >>> >> >> > >>> all, >>> >> >> > >>> but >>> >> >> > >>> actually I still don't get it right. So here is my problem: >>> >> >> > >>> >>> >> >> > >>> I want to be able to work with Mouse Exon 1.0 ST Arrays (NOT >>> >> >> > >>> Mouse >>> >> >> > >>> Gene >>> >> >> > >>> 1.0 >>> >> >> > >>> ST) in a similar fashion to eg. HG-U133 arrays. >>> >> >> > >>> That means, I want to finally have it accessible as an >>> >> >> > >>> ExpressionSet >>> >> >> > >>> object >>> >> >> > >>> with a right Bioconductor annotation assigned. This should >>> >> >> > >>> include >>> >> >> > >>> GENE >>> >> >> > >>> SYMBOLS, RefSeq IDs and ENTREZ IDs. >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> The problem here is that you want to do something that AFAIK >>> >> >> > >> isn't >>> >> >> > >> easy to >>> >> >> > >> do. The Gene ST arrays allow you to summarize all the probes >>> >> >> > >> that >>> >> >> > >> interrogate a particular transcript (e.g., all the exon-level >>> >> >> > >> probesets are >>> >> >> > >> collapsed to transcript level, and then you summarize). >>> >> >> > >> However, >>> >> >> > >> for >>> >> >> > >> the >>> >> >> > >> Exon ST arrays that isn't the case, unless there is something >>> >> >> > >> in >>> >> >> > >> xps >>> >> >> > >> to >>> >> >> > >> allow for that - I know next to nothing about that package, so >>> >> >> > >> Cristian >>> >> >> > >> Stratowa will have to chime in if I am missing something. >>> >> >> > >> >>> >> >> > >> For the Exon chips, you are always summarizing at the same >>> >> >> > >> probeset >>> >> >> > >> level, >>> >> >> > >> where there are <= 4 probes per probeset, and there can be any >>> >> >> > >> number >>> >> >> > >> of >>> >> >> > >> probesets that interrogate a given exon. Lots of these >>> >> >> > >> probesets >>> >> >> > >> interrogate >>> >> >> > >> regions that aren't even transcribed, according to current >>> >> >> > >> knowledge >>> >> >> > >> of the >>> >> >> > >> genome. When you choose core, extended or full probesets, you >>> >> >> > >> are >>> >> >> > >> just >>> >> >> > >> changing the number of probesets being used, not summarizing >>> >> >> > >> at a >>> >> >> > >> different >>> >> >> > >> level as with the Gene ST chip. >>> >> >> > >> >>> >> >> > >> So when you say you want gene symbols, refseq ids and gene >>> >> >> > >> ids, >>> >> >> > >> what >>> >> >> > >> exactly >>> >> >> > >> are you after? If a given probeset is in the intron of a gene >>> >> >> > >> do >>> >> >> > >> you >>> >> >> > >> want to >>> >> >> > >> annotate it as being part of that gene? How about if it is in >>> >> >> > >> the >>> >> >> > >> UTR >>> >> >> > >> (or >>> >> >> > >> really close to the UTR)? What do you want to do with the >>> >> >> > >> probesets >>> >> >> > >> where >>> >> >> > >> one or more of the probes binds in multiple positions in the >>> >> >> > >> genome? >>> >> >> > >> These >>> >> >> > >> are all questions that the exonmap package tries to consider, >>> >> >> > >> and >>> >> >> > >> it >>> >> >> > >> gets >>> >> >> > >> really complicated. That's why Affy went with the Gene ST >>> >> >> > >> chips - >>> >> >> > >> they >>> >> >> > >> unleashed the Exon chips on us and couldn't sell them because >>> >> >> > >> people >>> >> >> > >> were >>> >> >> > >> saying WTF do I do with this thing? >>> >> >> > >> >>> >> >> > >> I don't think there is an easy or obvious answer to your >>> >> >> > >> question. >>> >> >> > >> If >>> >> >> > >> you >>> >> >> > >> were to come up with what you think are reasonable answers to >>> >> >> > >> my >>> >> >> > >> questions, >>> >> >> > >> then it wouldn't be much work to extract the chr, start, end >>> >> >> > >> from >>> >> >> > >> the >>> >> >> > >> pd.moex.1.0.st.v1 package, and then use GenomicFeatures (e.g., >>> >> >> > >> ?findOverlaps()) to decide what regions are being >>> >> >> > >> interrogated, >>> >> >> > >> and >>> >> >> > >> annotate >>> >> >> > >> from there. >>> >> >> > >> >>> >> >> > >> Best, >>> >> >> > >> >>> >> >> > >> Jim >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > >>> >>> >> >> > >>> I can import it as a AffyBatch and generate an ExpressionSet >>> >> >> > >>> with >>> >> >> > >>> the help >>> >> >> > >>> of the Xmap/exonmap supplied CDF, but there is no annotation >>> >> >> > >>> attached to >>> >> >> > >>> it. >>> >> >> > >>> >>> >> >> > >>> OR >>> >> >> > >>> >>> >> >> > >>> I can import the CEL files with the "oligo" package as a Exon >>> >> >> > >>> Array >>> >> >> > >>> object >>> >> >> > >>> and generate an ExpressionSet from it. >>> >> >> > >>> However in that case it still have no annotation. >>> >> >> > >>> >>> >> >> > >>> Surprisingly on the Bioconductor website there are all >>> >> >> > >>> packages >>> >> >> > >>> needed to >>> >> >> > >>> deal with Mouse Gene 1.0 ST arrays but the informtion to work >>> >> >> > >>> with >>> >> >> > >>> Mouse >>> >> >> > >>> Exon 1.0 ST arrays seems missing! >>> >> >> > >>> >>> >> >> > >>> What am I doing wrong here? Has someone else had such >>> >> >> > >>> problems? >>> >> >> > >>> >>> >> >> > >>> Thanks in advance for your effort, >>> >> >> > >>> Andreas >>> >> >> > >>> >>> >> >> > >>> ? ? ? ?[[alternative HTML version deleted]] >>> >> >> > >>> >>> >> >> > >>> _______________________________________________ >>> >> >> > >>> Bioconductor mailing list >>> >> >> > >>> Bioconductor at r-project.org >>> >> >> > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >> >> > >>> Search the archives: >>> >> >> > >>> >>> >> >> > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> -- >>> >> >> > >> James W. MacDonald, M.S. >>> >> >> > >> Biostatistician >>> >> >> > >> University of Washington >>> >> >> > >> Environmental and Occupational Health Sciences >>> >> >> > >> 4225 Roosevelt Way NE, # 100 >>> >> >> > >> Seattle WA 98105-6099 >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> _______________________________________________ >>> >> >> > >> Bioconductor mailing list >>> >> >> > >> Bioconductor at r-project.org >>> >> >> > >> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >> >> > >> Search the archives: >>> >> >> > >> >>> >> >> > >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > >>> >> >> > _______________________________________________ >>> >> >> > Bioconductor mailing list >>> >> >> > Bioconductor at r-project.org >>> >> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >> >> > Search the archives: >>> >> >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> > >>> >> > >>> > >>> > >> >> > From jperkins at biochem.ucl.ac.uk Wed Jun 27 17:37:54 2012 From: jperkins at biochem.ucl.ac.uk (James Perkins) Date: Wed, 27 Jun 2012 17:37:54 +0200 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: References: <4FD8A7E7.2080102@uw.edu> Message-ID: Sorry, I meant at the rma(target=) level, not the getNetAffx level, which I *assume* uses the mps files to map between ps and transcripts? Cheers, Jim On 27 June 2012 17:27, Benilton Carvalho wrote: > Hi Jim, > > I'll make sure to add the comprehensive MPS as soon as I get more info > about it from the specialists... > > However, note that the contents of the MPS files are not used by > getNetAffx(), which only uses the probeset/transcript annotation > file... > > Thanks, > > benilton > > On 27 June 2012 15:00, James Perkins wrote: >> Hi, >> >> I wasn't sure if this was worth starting a new thread for this, since >> my question is very much related to this thread... >> >> Is there any plan to include the "comprehensive" exon array mappings? >> >> E.g. for rat: >> >> If one goes here >> >> http://www.affymetrix.com/estore/browse/products.jsp?productId=131489&categoryId=35748&productName=GeneChip-Rat-Exon-1.0-ST-Array#1_1 >> >> Then to Technical Documentation tab >> >> And downloads the >> >> "Rat Exon 1.0 ST Array Probeset, and Meta Probeset Files, core, full, >> extended and comprehensive rn4" data >> >> http://www.affymetrix.com/Auth/support/downloads/library_files/RaEx-1_0-st-v1.r2.dt1.rn4.ps.zip >> >> There are the core/extended/full ps and mps files here. >> >> However there is also a comprehensive mps file. >> >> Full, core and extended are from 2006. >> >> The comprehensive is from 2010 (and gets updated more regularly), so >> perhaps would be a better file to use for getNetAffx ? >> >> Apologies if this has been covered before. I am never sure of what is >> the best way to analyse exon array data at the gene level. >> >> Thanks, >> >> Jim >> >> >> >> >> On 13 June 2012 21:37, Benilton Carvalho wrote: >>> >>> please correct the code below to: >>> >>> eset = rma(raw, target='full') ## or 'core', 'extended' (whatever is available) >>> >>> and if you want results at the exon level >>> >>> eset = rma(raw, target='probeset') >>> featureData(eset) = getNetAffx(raw, 'probeset') >>> >>> apologies for the mistake below. >>> >>> b >>> >>> On 13 June 2012 20:11, Benilton Carvalho wrote: >>> > FWIW, remember that you can obtain the contents of the annotation >>> > files (the NA32 Affymetrix files) with: >>> > >>> > library(Biobase) >>> > library(oligo) >>> > raw = read.celfiles(list.celfiles()) >>> > eset = rma(raw, target='transcript') >>> > featureData(eset) = getNetAffx(eset, 'transcript') >>> > head(fData(eset)) >>> > >>> > b >>> > >>> > On 13 June 2012 15:47, James W. MacDonald wrote: >>> >> Hi Andreas, >>> >> >>> >> >>> >> On 6/13/2012 3:14 AM, Andreas Heider wrote: >>> >>> >>> >>> Dear mailing list, >>> >>> I know this was on the list couple of times, and I think I read it all, >>> >>> but >>> >>> actually I still don't get it right. So here is my problem: >>> >>> >>> >>> I want to be able to work with Mouse Exon 1.0 ST Arrays (NOT Mouse Gene >>> >>> 1.0 >>> >>> ST) in a similar fashion to eg. HG-U133 arrays. >>> >>> That means, I want to finally have it accessible as an ExpressionSet >>> >>> object >>> >>> with a right Bioconductor annotation assigned. This should include GENE >>> >>> SYMBOLS, RefSeq IDs and ENTREZ IDs. >>> >> >>> >> >>> >> The problem here is that you want to do something that AFAIK isn't easy to >>> >> do. The Gene ST arrays allow you to summarize all the probes that >>> >> interrogate a particular transcript (e.g., all the exon-level probesets are >>> >> collapsed to transcript level, and then you summarize). However, for the >>> >> Exon ST arrays that isn't the case, unless there is something in xps to >>> >> allow for that - I know next to nothing about that package, so Cristian >>> >> Stratowa will have to chime in if I am missing something. >>> >> >>> >> For the Exon chips, you are always summarizing at the same probeset level, >>> >> where there are <= 4 probes per probeset, and there can be any number of >>> >> probesets that interrogate a given exon. Lots of these probesets interrogate >>> >> regions that aren't even transcribed, according to current knowledge of the >>> >> genome. When you choose core, extended or full probesets, you are just >>> >> changing the number of probesets being used, not summarizing at a different >>> >> level as with the Gene ST chip. >>> >> >>> >> So when you say you want gene symbols, refseq ids and gene ids, what exactly >>> >> are you after? If a given probeset is in the intron of a gene do you want to >>> >> annotate it as being part of that gene? How about if it is in the UTR (or >>> >> really close to the UTR)? What do you want to do with the probesets where >>> >> one or more of the probes binds in multiple positions in the genome? These >>> >> are all questions that the exonmap package tries to consider, and it gets >>> >> really complicated. That's why Affy went with the Gene ST chips - they >>> >> unleashed the Exon chips on us and couldn't sell them because people were >>> >> saying WTF do I do with this thing? >>> >> >>> >> I don't think there is an easy or obvious answer to your question. If you >>> >> were to come up with what you think are reasonable answers to my questions, >>> >> then it wouldn't be much work to extract the chr, start, end from the >>> >> pd.moex.1.0.st.v1 package, and then use GenomicFeatures (e.g., >>> >> ?findOverlaps()) to decide what regions are being interrogated, and annotate >>> >> from there. >>> >> >>> >> Best, >>> >> >>> >> Jim >>> >> >>> >> >>> >> >>> >>> >>> >>> I can import it as a AffyBatch and generate an ExpressionSet with the help >>> >>> of the Xmap/exonmap supplied CDF, but there is no annotation attached to >>> >>> it. >>> >>> >>> >>> OR >>> >>> >>> >>> I can import the CEL files with the "oligo" package as a Exon Array object >>> >>> and generate an ExpressionSet from it. >>> >>> However in that case it still have no annotation. >>> >>> >>> >>> Surprisingly on the Bioconductor website there are all packages needed to >>> >>> deal with Mouse Gene 1.0 ST arrays but the informtion to work with Mouse >>> >>> Exon 1.0 ST arrays seems missing! >>> >>> >>> >>> What am I doing wrong here? Has someone else had such problems? >>> >>> >>> >>> Thanks in advance for your effort, >>> >>> Andreas >>> >>> >>> >>> ? ? ? ?[[alternative HTML version deleted]] >>> >>> >>> >>> _______________________________________________ >>> >>> Bioconductor mailing list >>> >>> Bioconductor at r-project.org >>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >>> Search the archives: >>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >>> >> >>> >> -- >>> >> James W. MacDonald, M.S. >>> >> Biostatistician >>> >> University of Washington >>> >> Environmental and Occupational Health Sciences >>> >> 4225 Roosevelt Way NE, # 100 >>> >> Seattle WA 98105-6099 >>> >> >>> >> >>> >> _______________________________________________ >>> >> Bioconductor mailing list >>> >> Bioconductor at r-project.org >>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >> Search the archives: >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From gbayon at gmail.com Wed Jun 27 17:38:07 2012 From: gbayon at gmail.com (=?utf-8?Q?Gustavo_Fern=C3=A1ndez_Bay=C3=B3n?=) Date: Wed, 27 Jun 2012 17:38:07 +0200 Subject: [BioC] GEOquery, GSEMatrix parameter and lifecycle of GEO series data In-Reply-To: References: Message-ID: <2EBD24B1B09B4A27BE11DA663EC436A6@gmail.com> Hi again. I would like to add a little bit more of information on this issue. I have been debugging inside the parseGSEMatrix() function in GEOquery source code. The suspicious NA's appeared when execution arrived to the following line: ## Apparently, NCBI GEO uses case-insensitive matching ## between platform IDs and series ID Refs ??? dat <- dat[match(tolower(rownames(datamat)),tolower(rownames(dat))),] The problem here is that 'datamat' has the correct number of rows, which is around 480K, BUT 'dat' doesn't. At a glance, 'datamat' comes from the series matrix file while 'dat' comes from the GPL. If you go to the GEO page of that GPL (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?token=djaxxiayqmwyspu&acc=GPL13534), you'll find it says that the GPL decryption table has exactly 485577 rows, which is kind of logical, a description for each probeset. However, inside the code, 'dat' has only 143889 rows. Replicating directly from R console: >gpl <- getGEO('GPL13534',destdir='../../GEO/') >Meta(gpl)$data_row_count [1] "485577" >t <- Table(gpl) >dim(t) [1] 143889 37 I was really surprised to find this, and I do not have enough knowledge as to know if it responds to an unknown constraint I happen to ignore. Is that ok? Or is there any bug in the GPL processing code? Now I'm going home, but I'll try to continue debugging to see what is really happening inside. Any help will be very much appreciated. Regards, Gus --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) El mi?rcoles 27 de junio de 2012 a las 10:51, Gustavo Fern?ndez Bay?n escribi?: > Hi everybody. > > I am experiencing quite a few problems while trying to download and parse a dataset of methylation values. These are not technical problems, IMHO. GEOquery works perfectly, and it really makes getting this kind of data an easy task. However, I think I do not understand exactly the lifecycle of GEO series data, and I would like to ask in this list for any hint on this behavior, so I could try to fix it. > > What I first did was to download and parse the desired GSE data file, with the default value of GSMMatrix parameter (TRUE). Besides, I extracted the ExpressionSet and the assayData I was looking for. > > my.gse <- getGEO('GSE30870', destdir='/Users/gbayon/Documents/GEO/') > my.expr.set <- my.gse[[1]] > beta.values <- exprs(my.expr.set) > > What really gave me a surprise at first, was to see many strange values (all containing the 'NA' string) in the featureNames of the expression set. > > > head(featureNames(es), n=20) > [1] "NA" "cg00000108" "cg00000109" "cg00000165" "NA.1" "NA.2" "NA.3" > [8] "NA.4" "cg00000363" "NA.5" "NA.6" "NA.7" "NA.8" "cg00000734" > [15] "NA.9" "cg00000807" "cg00000884" "NA.10" "NA.11" "NA.12" > > > > If I select an individual GSM in the series, and download it, the featureNames are ok. If I try to download the GSE with GSEMatrix=FALSE, I get a list of GSM data sets, and the results is again good. This made me suspect of the intermediate, pre-parsed, matrix form. I haven't found a clue about the lifecycle of this kind of data. I mean, how the matrix is built. Is it a manual process? Is it automatic? > > If it is a manual process, then I guess I will have to contact the responsible of uploading the data to see if they can fix it. But, if it is not, I would like to know if this is something relating to BioC or, more plausibly, to GEO. > > Any help would be appreciated. > > Regards, > Gustavo > > > --------------------------- > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) From aheider at trm.uni-leipzig.de Wed Jun 27 17:41:46 2012 From: aheider at trm.uni-leipzig.de (Andreas Heider) Date: Wed, 27 Jun 2012 17:41:46 +0200 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: References: <4FD8A7E7.2080102@uw.edu> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From beniltoncarvalho at gmail.com Wed Jun 27 17:44:07 2012 From: beniltoncarvalho at gmail.com (Benilton Carvalho) Date: Wed, 27 Jun 2012 16:44:07 +0100 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: References: <4FD8A7E7.2080102@uw.edu> Message-ID: That's correct... the summarisation step does use the MPS... and I'll add support for our next release. b On 27 June 2012 16:37, James Perkins wrote: > Sorry, I meant at the rma(target=) level, not the getNetAffx level, > which I *assume* uses the mps files to map between ps and transcripts? > > Cheers, > > Jim > > > On 27 June 2012 17:27, Benilton Carvalho wrote: >> Hi Jim, >> >> I'll make sure to add the comprehensive MPS as soon as I get more info >> about it from the specialists... >> >> However, note that the contents of the MPS files are not used by >> getNetAffx(), which only uses the probeset/transcript annotation >> file... >> >> Thanks, >> >> benilton >> >> On 27 June 2012 15:00, James Perkins wrote: >>> Hi, >>> >>> I wasn't sure if this was worth starting a new thread for this, since >>> my question is very much related to this thread... >>> >>> Is there any plan to include the "comprehensive" exon array mappings? >>> >>> E.g. for rat: >>> >>> If one goes here >>> >>> http://www.affymetrix.com/estore/browse/products.jsp?productId=131489&categoryId=35748&productName=GeneChip-Rat-Exon-1.0-ST-Array#1_1 >>> >>> Then to Technical Documentation tab >>> >>> And downloads the >>> >>> "Rat Exon 1.0 ST Array Probeset, and Meta Probeset Files, core, full, >>> extended and comprehensive rn4" data >>> >>> http://www.affymetrix.com/Auth/support/downloads/library_files/RaEx-1_0-st-v1.r2.dt1.rn4.ps.zip >>> >>> There are the core/extended/full ps and mps files here. >>> >>> However there is also a comprehensive mps file. >>> >>> Full, core and extended are from 2006. >>> >>> The comprehensive is from 2010 (and gets updated more regularly), so >>> perhaps would be a better file to use for getNetAffx ? >>> >>> Apologies if this has been covered before. I am never sure of what is >>> the best way to analyse exon array data at the gene level. >>> >>> Thanks, >>> >>> Jim >>> >>> >>> >>> >>> On 13 June 2012 21:37, Benilton Carvalho wrote: >>>> >>>> please correct the code below to: >>>> >>>> eset = rma(raw, target='full') ## or 'core', 'extended' (whatever is available) >>>> >>>> and if you want results at the exon level >>>> >>>> eset = rma(raw, target='probeset') >>>> featureData(eset) = getNetAffx(raw, 'probeset') >>>> >>>> apologies for the mistake below. >>>> >>>> b >>>> >>>> On 13 June 2012 20:11, Benilton Carvalho wrote: >>>> > FWIW, remember that you can obtain the contents of the annotation >>>> > files (the NA32 Affymetrix files) with: >>>> > >>>> > library(Biobase) >>>> > library(oligo) >>>> > raw = read.celfiles(list.celfiles()) >>>> > eset = rma(raw, target='transcript') >>>> > featureData(eset) = getNetAffx(eset, 'transcript') >>>> > head(fData(eset)) >>>> > >>>> > b >>>> > >>>> > On 13 June 2012 15:47, James W. MacDonald wrote: >>>> >> Hi Andreas, >>>> >> >>>> >> >>>> >> On 6/13/2012 3:14 AM, Andreas Heider wrote: >>>> >>> >>>> >>> Dear mailing list, >>>> >>> I know this was on the list couple of times, and I think I read it all, >>>> >>> but >>>> >>> actually I still don't get it right. So here is my problem: >>>> >>> >>>> >>> I want to be able to work with Mouse Exon 1.0 ST Arrays (NOT Mouse Gene >>>> >>> 1.0 >>>> >>> ST) in a similar fashion to eg. HG-U133 arrays. >>>> >>> That means, I want to finally have it accessible as an ExpressionSet >>>> >>> object >>>> >>> with a right Bioconductor annotation assigned. This should include GENE >>>> >>> SYMBOLS, RefSeq IDs and ENTREZ IDs. >>>> >> >>>> >> >>>> >> The problem here is that you want to do something that AFAIK isn't easy to >>>> >> do. The Gene ST arrays allow you to summarize all the probes that >>>> >> interrogate a particular transcript (e.g., all the exon-level probesets are >>>> >> collapsed to transcript level, and then you summarize). However, for the >>>> >> Exon ST arrays that isn't the case, unless there is something in xps to >>>> >> allow for that - I know next to nothing about that package, so Cristian >>>> >> Stratowa will have to chime in if I am missing something. >>>> >> >>>> >> For the Exon chips, you are always summarizing at the same probeset level, >>>> >> where there are <= 4 probes per probeset, and there can be any number of >>>> >> probesets that interrogate a given exon. Lots of these probesets interrogate >>>> >> regions that aren't even transcribed, according to current knowledge of the >>>> >> genome. When you choose core, extended or full probesets, you are just >>>> >> changing the number of probesets being used, not summarizing at a different >>>> >> level as with the Gene ST chip. >>>> >> >>>> >> So when you say you want gene symbols, refseq ids and gene ids, what exactly >>>> >> are you after? If a given probeset is in the intron of a gene do you want to >>>> >> annotate it as being part of that gene? How about if it is in the UTR (or >>>> >> really close to the UTR)? What do you want to do with the probesets where >>>> >> one or more of the probes binds in multiple positions in the genome? These >>>> >> are all questions that the exonmap package tries to consider, and it gets >>>> >> really complicated. That's why Affy went with the Gene ST chips - they >>>> >> unleashed the Exon chips on us and couldn't sell them because people were >>>> >> saying WTF do I do with this thing? >>>> >> >>>> >> I don't think there is an easy or obvious answer to your question. If you >>>> >> were to come up with what you think are reasonable answers to my questions, >>>> >> then it wouldn't be much work to extract the chr, start, end from the >>>> >> pd.moex.1.0.st.v1 package, and then use GenomicFeatures (e.g., >>>> >> ?findOverlaps()) to decide what regions are being interrogated, and annotate >>>> >> from there. >>>> >> >>>> >> Best, >>>> >> >>>> >> Jim >>>> >> >>>> >> >>>> >> >>>> >>> >>>> >>> I can import it as a AffyBatch and generate an ExpressionSet with the help >>>> >>> of the Xmap/exonmap supplied CDF, but there is no annotation attached to >>>> >>> it. >>>> >>> >>>> >>> OR >>>> >>> >>>> >>> I can import the CEL files with the "oligo" package as a Exon Array object >>>> >>> and generate an ExpressionSet from it. >>>> >>> However in that case it still have no annotation. >>>> >>> >>>> >>> Surprisingly on the Bioconductor website there are all packages needed to >>>> >>> deal with Mouse Gene 1.0 ST arrays but the informtion to work with Mouse >>>> >>> Exon 1.0 ST arrays seems missing! >>>> >>> >>>> >>> What am I doing wrong here? Has someone else had such problems? >>>> >>> >>>> >>> Thanks in advance for your effort, >>>> >>> Andreas >>>> >>> >>>> >>> ? ? ? ?[[alternative HTML version deleted]] >>>> >>> >>>> >>> _______________________________________________ >>>> >>> Bioconductor mailing list >>>> >>> Bioconductor at r-project.org >>>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> >>> Search the archives: >>>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >> >>>> >> >>>> >> -- >>>> >> James W. MacDonald, M.S. >>>> >> Biostatistician >>>> >> University of Washington >>>> >> Environmental and Occupational Health Sciences >>>> >> 4225 Roosevelt Way NE, # 100 >>>> >> Seattle WA 98105-6099 >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> Bioconductor mailing list >>>> >> Bioconductor at r-project.org >>>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> >> Search the archives: >>>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From jperkins at biochem.ucl.ac.uk Wed Jun 27 17:49:31 2012 From: jperkins at biochem.ucl.ac.uk (James Perkins) Date: Wed, 27 Jun 2012 17:49:31 +0200 Subject: [BioC] Analysis and annotation (full) of Affymetrix Mouse Exon 1.0 ST arrays In-Reply-To: References: <4FD8A7E7.2080102@uw.edu> Message-ID: Great, Thanks, I'll look out for it! And thanks a lot Andreas for the suggestion of using ensembl exon ids, that sounds good, thanks for all your help. Cheers! Jim On 27 June 2012 17:44, Benilton Carvalho wrote: > That's correct... the summarisation step does use the MPS... and I'll > add support for our next release. b > > On 27 June 2012 16:37, James Perkins wrote: >> Sorry, I meant at the rma(target=) level, not the getNetAffx level, >> which I *assume* uses the mps files to map between ps and transcripts? >> >> Cheers, >> >> Jim >> >> >> On 27 June 2012 17:27, Benilton Carvalho wrote: >>> Hi Jim, >>> >>> I'll make sure to add the comprehensive MPS as soon as I get more info >>> about it from the specialists... >>> >>> However, note that the contents of the MPS files are not used by >>> getNetAffx(), which only uses the probeset/transcript annotation >>> file... >>> >>> Thanks, >>> >>> benilton >>> >>> On 27 June 2012 15:00, James Perkins wrote: >>>> Hi, >>>> >>>> I wasn't sure if this was worth starting a new thread for this, since >>>> my question is very much related to this thread... >>>> >>>> Is there any plan to include the "comprehensive" exon array mappings? >>>> >>>> E.g. for rat: >>>> >>>> If one goes here >>>> >>>> http://www.affymetrix.com/estore/browse/products.jsp?productId=131489&categoryId=35748&productName=GeneChip-Rat-Exon-1.0-ST-Array#1_1 >>>> >>>> Then to Technical Documentation tab >>>> >>>> And downloads the >>>> >>>> "Rat Exon 1.0 ST Array Probeset, and Meta Probeset Files, core, full, >>>> extended and comprehensive rn4" data >>>> >>>> http://www.affymetrix.com/Auth/support/downloads/library_files/RaEx-1_0-st-v1.r2.dt1.rn4.ps.zip >>>> >>>> There are the core/extended/full ps and mps files here. >>>> >>>> However there is also a comprehensive mps file. >>>> >>>> Full, core and extended are from 2006. >>>> >>>> The comprehensive is from 2010 (and gets updated more regularly), so >>>> perhaps would be a better file to use for getNetAffx ? >>>> >>>> Apologies if this has been covered before. I am never sure of what is >>>> the best way to analyse exon array data at the gene level. >>>> >>>> Thanks, >>>> >>>> Jim >>>> >>>> >>>> >>>> >>>> On 13 June 2012 21:37, Benilton Carvalho wrote: >>>>> >>>>> please correct the code below to: >>>>> >>>>> eset = rma(raw, target='full') ## or 'core', 'extended' (whatever is available) >>>>> >>>>> and if you want results at the exon level >>>>> >>>>> eset = rma(raw, target='probeset') >>>>> featureData(eset) = getNetAffx(raw, 'probeset') >>>>> >>>>> apologies for the mistake below. >>>>> >>>>> b >>>>> >>>>> On 13 June 2012 20:11, Benilton Carvalho wrote: >>>>> > FWIW, remember that you can obtain the contents of the annotation >>>>> > files (the NA32 Affymetrix files) with: >>>>> > >>>>> > library(Biobase) >>>>> > library(oligo) >>>>> > raw = read.celfiles(list.celfiles()) >>>>> > eset = rma(raw, target='transcript') >>>>> > featureData(eset) = getNetAffx(eset, 'transcript') >>>>> > head(fData(eset)) >>>>> > >>>>> > b >>>>> > >>>>> > On 13 June 2012 15:47, James W. MacDonald wrote: >>>>> >> Hi Andreas, >>>>> >> >>>>> >> >>>>> >> On 6/13/2012 3:14 AM, Andreas Heider wrote: >>>>> >>> >>>>> >>> Dear mailing list, >>>>> >>> I know this was on the list couple of times, and I think I read it all, >>>>> >>> but >>>>> >>> actually I still don't get it right. So here is my problem: >>>>> >>> >>>>> >>> I want to be able to work with Mouse Exon 1.0 ST Arrays (NOT Mouse Gene >>>>> >>> 1.0 >>>>> >>> ST) in a similar fashion to eg. HG-U133 arrays. >>>>> >>> That means, I want to finally have it accessible as an ExpressionSet >>>>> >>> object >>>>> >>> with a right Bioconductor annotation assigned. This should include GENE >>>>> >>> SYMBOLS, RefSeq IDs and ENTREZ IDs. >>>>> >> >>>>> >> >>>>> >> The problem here is that you want to do something that AFAIK isn't easy to >>>>> >> do. The Gene ST arrays allow you to summarize all the probes that >>>>> >> interrogate a particular transcript (e.g., all the exon-level probesets are >>>>> >> collapsed to transcript level, and then you summarize). However, for the >>>>> >> Exon ST arrays that isn't the case, unless there is something in xps to >>>>> >> allow for that - I know next to nothing about that package, so Cristian >>>>> >> Stratowa will have to chime in if I am missing something. >>>>> >> >>>>> >> For the Exon chips, you are always summarizing at the same probeset level, >>>>> >> where there are <= 4 probes per probeset, and there can be any number of >>>>> >> probesets that interrogate a given exon. Lots of these probesets interrogate >>>>> >> regions that aren't even transcribed, according to current knowledge of the >>>>> >> genome. When you choose core, extended or full probesets, you are just >>>>> >> changing the number of probesets being used, not summarizing at a different >>>>> >> level as with the Gene ST chip. >>>>> >> >>>>> >> So when you say you want gene symbols, refseq ids and gene ids, what exactly >>>>> >> are you after? If a given probeset is in the intron of a gene do you want to >>>>> >> annotate it as being part of that gene? How about if it is in the UTR (or >>>>> >> really close to the UTR)? What do you want to do with the probesets where >>>>> >> one or more of the probes binds in multiple positions in the genome? These >>>>> >> are all questions that the exonmap package tries to consider, and it gets >>>>> >> really complicated. That's why Affy went with the Gene ST chips - they >>>>> >> unleashed the Exon chips on us and couldn't sell them because people were >>>>> >> saying WTF do I do with this thing? >>>>> >> >>>>> >> I don't think there is an easy or obvious answer to your question. If you >>>>> >> were to come up with what you think are reasonable answers to my questions, >>>>> >> then it wouldn't be much work to extract the chr, start, end from the >>>>> >> pd.moex.1.0.st.v1 package, and then use GenomicFeatures (e.g., >>>>> >> ?findOverlaps()) to decide what regions are being interrogated, and annotate >>>>> >> from there. >>>>> >> >>>>> >> Best, >>>>> >> >>>>> >> Jim >>>>> >> >>>>> >> >>>>> >> >>>>> >>> >>>>> >>> I can import it as a AffyBatch and generate an ExpressionSet with the help >>>>> >>> of the Xmap/exonmap supplied CDF, but there is no annotation attached to >>>>> >>> it. >>>>> >>> >>>>> >>> OR >>>>> >>> >>>>> >>> I can import the CEL files with the "oligo" package as a Exon Array object >>>>> >>> and generate an ExpressionSet from it. >>>>> >>> However in that case it still have no annotation. >>>>> >>> >>>>> >>> Surprisingly on the Bioconductor website there are all packages needed to >>>>> >>> deal with Mouse Gene 1.0 ST arrays but the informtion to work with Mouse >>>>> >>> Exon 1.0 ST arrays seems missing! >>>>> >>> >>>>> >>> What am I doing wrong here? Has someone else had such problems? >>>>> >>> >>>>> >>> Thanks in advance for your effort, >>>>> >>> Andreas >>>>> >>> >>>>> >>> ? ? ? ?[[alternative HTML version deleted]] >>>>> >>> >>>>> >>> _______________________________________________ >>>>> >>> Bioconductor mailing list >>>>> >>> Bioconductor at r-project.org >>>>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> >>> Search the archives: >>>>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> James W. MacDonald, M.S. >>>>> >> Biostatistician >>>>> >> University of Washington >>>>> >> Environmental and Occupational Health Sciences >>>>> >> 4225 Roosevelt Way NE, # 100 >>>>> >> Seattle WA 98105-6099 >>>>> >> >>>>> >> >>>>> >> _______________________________________________ >>>>> >> Bioconductor mailing list >>>>> >> Bioconductor at r-project.org >>>>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> >> Search the archives: >>>>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From sdavis2 at mail.nih.gov Wed Jun 27 17:54:44 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Wed, 27 Jun 2012 11:54:44 -0400 Subject: [BioC] GEOquery, GSEMatrix parameter and lifecycle of GEO series data In-Reply-To: <2EBD24B1B09B4A27BE11DA663EC436A6@gmail.com> References: <2EBD24B1B09B4A27BE11DA663EC436A6@gmail.com> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From durinck.steffen at gene.com Wed Jun 27 18:04:12 2012 From: durinck.steffen at gene.com (Steffen Durinck) Date: Wed, 27 Jun 2012 09:04:12 -0700 Subject: [BioC] BiomaRt Details In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From reidjf at gmail.com Wed Jun 27 18:40:31 2012 From: reidjf at gmail.com (James F. Reid) Date: Wed, 27 Jun 2012 17:40:31 +0100 Subject: [BioC] GEOquery, GSEMatrix parameter and lifecycle of GEO series data In-Reply-To: References: <2EBD24B1B09B4A27BE11DA663EC436A6@gmail.com> Message-ID: <4FEB377F.2050309@gmail.com> Dear Sean and Gustavo, I cannot reproduce this error. See below. On 27/06/12 16:54, Sean Davis wrote: > On Wed, Jun 27, 2012 at 11:38 AM, Gustavo Fern??ndez Bay??n > wrote: > >> Hi again. >> >> I would like to add a little bit more of information on this issue. I have >> been debugging inside the parseGSEMatrix() function in GEOquery source >> code. The suspicious NA's appeared when execution arrived to the following >> line: >> >> ## Apparently, NCBI GEO uses case-insensitive matching >> ## between platform IDs and series ID Refs ??? >> dat <- dat[match(tolower(rownames(datamat)),tolower(rownames(dat))),] >> >> >> >> The problem here is that 'datamat' has the correct number of rows, which >> is around 480K, BUT 'dat' doesn't. At a glance, 'datamat' comes from the >> series matrix file while 'dat' comes from the GPL. >> >> If you go to the GEO page of that GPL ( >> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?token=djaxxiayqmwyspu&acc=GPL13534), >> you'll find it says that the GPL decryption table has exactly 485577 rows, >> which is kind of logical, a description for each probeset. However, inside >> the code, 'dat' has only 143889 rows. >> >> Replicating directly from R console: >> >>> gpl <- getGEO('GPL13534',destdir='../../GEO/') >>> Meta(gpl)$data_row_count >> [1] "485577" >> >>> t <- Table(gpl) >>> dim(t) >> [1] 143889 37 >> >> >> >> I was really surprised to find this, and I do not have enough knowledge as >> to know if it responds to an unknown constraint I happen to ignore. Is that >> ok? Or is there any bug in the GPL processing code? Now I'm going home, but >> I'll try to continue debugging to see what is really happening inside. >> >> > This is most likely a bug in GPL parsing. There are A LOT of edge cases > that I have tried to deal with, some not very appropriately. Often, the > error is due to an extraneous quote in an unexpected location. I'll look > into this one. Could you do me a favor and send along sessionInfo() just > so I know? > > Thanks, > Sean > > > >> Any help will be very much appreciated. >> >> Regards, >> Gus >> >> >> --------------------------- >> Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) >> >> >> El mi??rcoles 27 de junio de 2012 a las 10:51, Gustavo Fern??ndez Bay??n >> escribi??: >> >>> Hi everybody. >>> >>> I am experiencing quite a few problems while trying to download and >> parse a dataset of methylation values. These are not technical problems, >> IMHO. GEOquery works perfectly, and it really makes getting this kind of >> data an easy task. However, I think I do not understand exactly the >> lifecycle of GEO series data, and I would like to ask in this list for any >> hint on this behavior, so I could try to fix it. >>> >>> What I first did was to download and parse the desired GSE data file, >> with the default value of GSMMatrix parameter (TRUE). Besides, I extracted >> the ExpressionSet and the assayData I was looking for. >>> >>> my.gse <- getGEO('GSE30870', destdir='/Users/gbayon/Documents/GEO/') >>> my.expr.set <- my.gse[[1]] >>> beta.values <- exprs(my.expr.set) >>> >>> What really gave me a surprise at first, was to see many strange values >> (all containing the 'NA' string) in the featureNames of the expression set. >>> >>>> head(featureNames(es), n=20) >>> [1] "NA" "cg00000108" "cg00000109" "cg00000165" "NA.1" "NA.2" "NA.3" >>> [8] "NA.4" "cg00000363" "NA.5" "NA.6" "NA.7" "NA.8" "cg00000734" >>> [15] "NA.9" "cg00000807" "cg00000884" "NA.10" "NA.11" "NA.12" >>> >>> >>> >>> If I select an individual GSM in the series, and download it, the >> featureNames are ok. If I try to download the GSE with GSEMatrix=FALSE, I >> get a list of GSM data sets, and the results is again good. This made me >> suspect of the intermediate, pre-parsed, matrix form. I haven't found a >> clue about the lifecycle of this kind of data. I mean, how the matrix is >> built. Is it a manual process? Is it automatic? >>> >>> If it is a manual process, then I guess I will have to contact the >> responsible of uploading the data to see if they can fix it. But, if it is >> not, I would like to know if this is something relating to BioC or, more >> plausibly, to GEO. >>> >>> Any help would be appreciated. >>> >>> Regards, >>> Gustavo >>> >>> >>> --------------------------- >>> Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > library(GEOquery) my.gse <- getGEO('GSE30870', destdir=".") featureNames(my.gse[[1]])[1:10] # [1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236" # [6] "cg00000289" "cg00000292" "cg00000321" "cg00000363" "cg00000622" all(featureNames(my.gse[[1]]) == rownames(exprs(my.gse[[1]]))) #[1] TRUE gpl <- getGEO('GPL13534',destdir=".") Meta(gpl)$data_row_count == nrow(Table(gpl)) # [1] TRUE > sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets tools methods [8] base other attached packages: [1] GEOquery_2.23.5 Biobase_2.16.0 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] RCurl_1.91-1 XML_3.9-4 HTH, J. From mailinglist.honeypot at gmail.com Wed Jun 27 19:20:04 2012 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Wed, 27 Jun 2012 13:20:04 -0400 Subject: [BioC] Cleaning up after getSeq(BSgenome, GRanges) Message-ID: Howdy, Say I'd like to fetch muchos sequences from hg19 that are defined in a GRanges object that spans all hg19 chromosomes. I can make my life easy and do: R> library(BSgenome.Hsapiens.UCSC.hg19) R> seqs <- getSeq(Hsapiens, my.GRanges) But while my life has been made easy, life for my CPU has been made harder as I (think that I) have now all of the Hsapiens chromosomes loaded up into (I think) the Hsapiens at .seqs_cache. I reckon I can do something like: R> rm(list=ls(Hsapiens at .seqs_cache), envir=Hsapiens at .seqs_cache) R> gc() to try to remedy the situation myself, but I wonder if I'm missing something else? Perhaps having a clearCache,BSgenome method to do some cleanup might be handy? Thanks, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From xiaoheyiyh at yahoo.com Wed Jun 27 19:36:01 2012 From: xiaoheyiyh at yahoo.com (heyi xiao) Date: Wed, 27 Jun 2012 10:36:01 -0700 Subject: [BioC] Illumine bead array PROBE_TYPE values Message-ID: <1340818561.99867.YahooMailClassic@web125405.mail.ne1.yahoo.com> Hello list, I am working on illumine bead array expression data. The data was output from GenomeStudio with probe annotation. There is a column called PROBE_TYPE, with three values: A, I and S. What do these values mean, antisense, intron and sense? Antisense and sense are string relative to the target gene, not to the reference genome, right? Why intron and antisense probes are needed? I couldn?t find the description info on Internet. Thanks a lot! Heyi From whuber at embl.de Wed Jun 27 19:36:13 2012 From: whuber at embl.de (Wolfgang Huber) Date: Wed, 27 Jun 2012 19:36:13 +0200 Subject: [BioC] DESeq analysis In-Reply-To: <20120626161705.06E4D13ADBF@mamba.fhcrc.org> References: <20120626161705.06E4D13ADBF@mamba.fhcrc.org> Message-ID: <4FEB448D.1020105@embl.de> Dear Narges thank you for the feedback. Your second question is easy: use the idiom res1 <- subset(res, padj<0.1) instead, this will avoid the creation of rows full of NA whenever res$padj is NA. Alternatively res[order(res$padj)[1:n], ] with 'n' your favourite lucky number might be useful. Have a look at the R-intro manual for more on subsetting of arrays and dataframes in R. Your first question: can you show us the data for the genes where you know that they are differentially expressed? Perhaps then it might become more apparent why DESeq / nbinomtest did not agree. Also, what does the dispersion plot for cds look like? (This is the plot produced by plotDispEsts in the vignette). Best wishes Wolfgang narges [guest] scripsit 06/26/2012 06:17 PM: > > Hi all > > I am doing some RNA seq analysis with DESeq. I have applied the nbinomTest to my dataset which I know have many differentially expressed genes but the first problem is that the result values for "padj"column is almost NA and sometimes 1. and when I want to have a splice from my fata frame the result is not meaningful for me. > > -- output of sessionInfo(): > > res <- nbinomTest(cds, "Male", "Female") > >> head(res) > id baseMean baseMeanA baseMeanB foldChange log2FoldChange > 1 ENSG00000000003 0.1130534 0.000000 0.2261067 Inf Inf > 2 ENSG00000000005 0.0000000 0.000000 0.0000000 NaN NaN > 3 ENSG00000000419 14.3767155 17.162610 11.5908205 0.6753530 -0.5662863 > 4 ENSG00000000457 17.0174761 15.342800 18.6921526 1.2183013 0.2848710 > 5 ENSG00000000460 3.9414822 2.855099 5.0278659 1.7610131 0.8164056 > 6 ENSG00000000938 16.0894945 18.350117 13.8288718 0.7536122 -0.4081058 > pval padj > 1 0.9959638 1 > 2 NA NA > 3 0.3208560 1 > 4 0.5942512 1 > 5 0.4840607 1 > 6 0.5409953 1 > > >> res1 <- res[res$padj<0.1,] >> head(res1) > id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj > NA NA NA NA NA NA NA NA > NA.1 NA NA NA NA NA NA NA > NA.2 NA NA NA NA NA NA NA > NA.3 NA NA NA NA NA NA NA > NA.4 NA NA NA NA NA NA NA > NA.5 NA NA NA NA NA NA NA > > my first question is that why although I know there are some differentially expressed genes in the my data, all the padj values are NA or 1 and the second question is this "NA.1" , "NA.2", ..... which are emerged as the first column of object "res1"instead of name of genes > > Thank you so much > Regards > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber From jtleek at gmail.com Wed Jun 27 19:46:38 2012 From: jtleek at gmail.com (Jeff Leek) Date: Wed, 27 Jun 2012 13:46:38 -0400 Subject: [BioC] matrix like object with Rle columns In-Reply-To: References: Message-ID: I would love/use all the time this feature if it existed. Jeff On Wed, Jun 27, 2012 at 11:21 AM, Michael Lawrence wrote: > On Wed, Jun 27, 2012 at 8:07 AM, Kasper Daniel Hansen < > kasperdanielhansen at gmail.com> wrote: > >> One comment: ?since matrix is a vector with a dim attribute I see that >> the natural parallel is doing the same for Rle. > > > Right, in the original plan, the Array class would bring the dim attribute, > and RleMatrix would contain both Matrix and Rle. > > >> ?Nevertheless, that >> would put an upper limit on the number of runLengths in the entire >> matrix. ?My impression (which could be wrong) is that we would need to >> implement essentially all matrix-like numeric operations from scratch >> anyway, so it may be worthwhile to consider using a list of Rle's >> where each Rle is a column, instead of a single Rle to represent all >> columns. ?Clearly that depends on implementation details, but if we >> really need to do everything from scratch, a list of columns might be >> more flexible (and perhaps even easier to code). >> >> > This would make it harder to treat RleMatrix as an Rle (which is a nice > feature of base R matrices). If the problem is the vector length limit, > then I'd rather wait for Luke's fix, which apparently is coming along. > > Kasper >> >> On Tue, Jun 26, 2012 at 6:41 AM, Michael Lawrence >> wrote: >> > Seems like it could be a nice thing to have. Presumably one would create >> an >> > Array subclass of Vector that would add a "dim" attribute. Then Matrix >> could >> > extend that to constrain dim to length two (unfortunately colliding with >> the >> > Matrix class in the Matrix package). Then RleMatrix extends Matrix to >> > implement the actual data storage and many of the accelerated methods. As >> > you said, row-oriented methods would be tough. >> > >> > Any takers? >> > >> > Michael >> > >> > On Mon, Jun 25, 2012 at 9:11 PM, Kasper Daniel Hansen >> > wrote: >> >> >> >> On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen >> >> wrote: >> >> > On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence >> >> > wrote: >> >> >> Patrick and I had talked about this a long time ago (essentially >> >> >> putting a >> >> >> "dim" attribute on an Rle), but the closest thing today is a >> DataFrame >> >> >> with >> >> >> Rle columns. >> >> >> >> >> >> Use case? >> >> > >> >> > Say I have whole-genome data (for example coverage) ?on multiple >> >> > samples. ?Usually, this is far easier to think of as a matrix (in my >> >> > opinion) with ~3B rows and I often want to do rowSums(), colSums() etc >> >> > (in fact, probably the whole API from matrixStats). ?This is >> >> > especially nice when you have multiple coverage-like tracks on each >> >> > sample, so you could have >> >> > ?trackA : genome by samples >> >> > ?trackB : genome by samples >> >> > ?... >> >> > >> >> > You could think of this as a SummarizedExperiment, but with >> >> > _extremely_ big matrices in the assay slot. >> >> > >> >> > I want to take advantage of the Rle structure to store the data more >> >> > efficiently and also to do potentially faster computations. >> >> > >> >> > This is actually closer to my use case where I currently use matrices >> >> > with ~30M rows (which works fine), but I would like to expand to ~800M >> >> > rows (which would suck a bit). >> >> > >> >> > You could also think of a matrix-like object with Rle columns as an >> >> > alternative sparse matrix structure. ?In a typical sparse matrix you >> >> > only store the non-zero entities, here we only store the >> >> > change-points. ?Depending on the structure of the matrix this could be >> >> > an efficient storage of an otherwise dense matrix. >> >> > >> >> > So essentially, what I want, is to have mathematical operations on >> >> > this object, where I would utilize that I know that all entities are >> >> > numbers so the typical matrix operations makes sense. >> >> > >> >> > [ side question which could be relevant in this discussion: for a >> >> > numeric Rle is there some notion of precision - say I have truly >> >> > numeric values with tons of digits, and I want to consider two numbers >> >> > part of the same run if |x1 -x2|> >> >> >> You can see that Pete has had similar thoughts in >> >> genoset/R/DataFrame-methods.R, although he only has colMeans (which is >> >> the easy one). >> >> >> >> Kasper >> >> >> >> > Kasper >> >> > >> >> >> >> >> >> Michael >> >> >> >> >> >> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen >> >> >> wrote: >> >> >>> >> >> >>> Do we have a matrix-like object, but where the columns are Rle's? >> >> >>> >> >> >>> Kasper >> >> >>> >> >> >>> _______________________________________________ >> >> >>> Bioconductor mailing list >> >> >>> Bioconductor at r-project.org >> >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> >>> Search the archives: >> >> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> >> >> >> > >> > >> > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From ales.maver at gmail.com Wed Jun 27 20:39:33 2012 From: ales.maver at gmail.com (=?UTF-8?Q?Ale=C5=A1_Maver?=) Date: Wed, 27 Jun 2012 20:39:33 +0200 Subject: [BioC] Operations on GenomicRanges metadata information Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From hpages at fhcrc.org Wed Jun 27 21:30:57 2012 From: hpages at fhcrc.org (=?ISO-8859-1?Q?Herv=E9_Pag=E8s?=) Date: Wed, 27 Jun 2012 12:30:57 -0700 Subject: [BioC] matrix like object with Rle columns In-Reply-To: References: Message-ID: <4FEB5F71.5050906@fhcrc.org> Hi guys, Note that some of the things in the "matrix API" seem to work on standard data frames: > df <- data.frame(aa=1:5, bb=100) > rowSums(df) [1] 101 102 103 104 105 > colSums(df) aa bb 15 500 > max(df) [1] 100 > min(df) [1] 1 > range(df) [1] 1 100 > df + df aa bb 1 2 200 2 4 200 3 6 200 4 8 200 5 10 200 > df <= 3 aa bb [1,] TRUE FALSE [2,] TRUE FALSE [3,] TRUE FALSE [4,] FALSE FALSE [5,] FALSE FALSE etc... But none of them work on DataFrame. Maybe if they were we wouldn't need RleMatrix? Using DataFrame instead of RleMatrix would be nice because it reuses what we already have. It would also avoid the pitfall of having the length of an RleMatrix not being representable with a 32-bit int when let's say the nb of rows is 800M and there are a few nb of cols (like in Kasper's use case). No need to wait for Luke's "big vector" hack. Cheers, H. On 06/27/2012 10:46 AM, Jeff Leek wrote: > I would love/use all the time this feature if it existed. > > Jeff > > On Wed, Jun 27, 2012 at 11:21 AM, Michael Lawrence > wrote: >> On Wed, Jun 27, 2012 at 8:07 AM, Kasper Daniel Hansen < >> kasperdanielhansen at gmail.com> wrote: >> >>> One comment: since matrix is a vector with a dim attribute I see that >>> the natural parallel is doing the same for Rle. >> >> >> Right, in the original plan, the Array class would bring the dim attribute, >> and RleMatrix would contain both Matrix and Rle. >> >> >>> Nevertheless, that >>> would put an upper limit on the number of runLengths in the entire >>> matrix. My impression (which could be wrong) is that we would need to >>> implement essentially all matrix-like numeric operations from scratch >>> anyway, so it may be worthwhile to consider using a list of Rle's >>> where each Rle is a column, instead of a single Rle to represent all >>> columns. Clearly that depends on implementation details, but if we >>> really need to do everything from scratch, a list of columns might be >>> more flexible (and perhaps even easier to code). >>> >>> >> This would make it harder to treat RleMatrix as an Rle (which is a nice >> feature of base R matrices). If the problem is the vector length limit, >> then I'd rather wait for Luke's fix, which apparently is coming along. >> >> Kasper >>> >>> On Tue, Jun 26, 2012 at 6:41 AM, Michael Lawrence >>> wrote: >>>> Seems like it could be a nice thing to have. Presumably one would create >>> an >>>> Array subclass of Vector that would add a "dim" attribute. Then Matrix >>> could >>>> extend that to constrain dim to length two (unfortunately colliding with >>> the >>>> Matrix class in the Matrix package). Then RleMatrix extends Matrix to >>>> implement the actual data storage and many of the accelerated methods. As >>>> you said, row-oriented methods would be tough. >>>> >>>> Any takers? >>>> >>>> Michael >>>> >>>> On Mon, Jun 25, 2012 at 9:11 PM, Kasper Daniel Hansen >>>> wrote: >>>>> >>>>> On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen >>>>> wrote: >>>>>> On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence >>>>>> wrote: >>>>>>> Patrick and I had talked about this a long time ago (essentially >>>>>>> putting a >>>>>>> "dim" attribute on an Rle), but the closest thing today is a >>> DataFrame >>>>>>> with >>>>>>> Rle columns. >>>>>>> >>>>>>> Use case? >>>>>> >>>>>> Say I have whole-genome data (for example coverage) on multiple >>>>>> samples. Usually, this is far easier to think of as a matrix (in my >>>>>> opinion) with ~3B rows and I often want to do rowSums(), colSums() etc >>>>>> (in fact, probably the whole API from matrixStats). This is >>>>>> especially nice when you have multiple coverage-like tracks on each >>>>>> sample, so you could have >>>>>> trackA : genome by samples >>>>>> trackB : genome by samples >>>>>> ... >>>>>> >>>>>> You could think of this as a SummarizedExperiment, but with >>>>>> _extremely_ big matrices in the assay slot. >>>>>> >>>>>> I want to take advantage of the Rle structure to store the data more >>>>>> efficiently and also to do potentially faster computations. >>>>>> >>>>>> This is actually closer to my use case where I currently use matrices >>>>>> with ~30M rows (which works fine), but I would like to expand to ~800M >>>>>> rows (which would suck a bit). >>>>>> >>>>>> You could also think of a matrix-like object with Rle columns as an >>>>>> alternative sparse matrix structure. In a typical sparse matrix you >>>>>> only store the non-zero entities, here we only store the >>>>>> change-points. Depending on the structure of the matrix this could be >>>>>> an efficient storage of an otherwise dense matrix. >>>>>> >>>>>> So essentially, what I want, is to have mathematical operations on >>>>>> this object, where I would utilize that I know that all entities are >>>>>> numbers so the typical matrix operations makes sense. >>>>>> >>>>>> [ side question which could be relevant in this discussion: for a >>>>>> numeric Rle is there some notion of precision - say I have truly >>>>>> numeric values with tons of digits, and I want to consider two numbers >>>>>> part of the same run if |x1 -x2|>>>> >>>>> You can see that Pete has had similar thoughts in >>>>> genoset/R/DataFrame-methods.R, although he only has colMeans (which is >>>>> the easy one). >>>>> >>>>> Kasper >>>>> >>>>>> Kasper >>>>>> >>>>>>> >>>>>>> Michael >>>>>>> >>>>>>> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen >>>>>>> wrote: >>>>>>>> >>>>>>>> Do we have a matrix-like object, but where the columns are Rle's? >>>>>>>> >>>>>>>> Kasper >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>>> >>>> >>>> >>> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 From kasperdanielhansen at gmail.com Wed Jun 27 21:54:04 2012 From: kasperdanielhansen at gmail.com (Kasper Daniel Hansen) Date: Wed, 27 Jun 2012 15:54:04 -0400 Subject: [BioC] matrix like object with Rle columns In-Reply-To: <4FEB5F71.5050906@fhcrc.org> References: <4FEB5F71.5050906@fhcrc.org> Message-ID: On Wed, Jun 27, 2012 at 3:30 PM, Herv? Pag?s wrote: > Hi guys, > > Note that some of the things in the "matrix API" seem to work on > standard data frames: > >> df <- data.frame(aa=1:5, bb=100) >> rowSums(df) > [1] 101 102 103 104 105 >> colSums(df) > ?aa ?bb > ?15 500 >> max(df) > [1] 100 >> min(df) > [1] 1 >> range(df) > [1] ? 1 100 >> df + df > ?aa ?bb > 1 ?2 200 > 2 ?4 200 > 3 ?6 200 > 4 ?8 200 > 5 10 200 >> df <= 3 > ? ? ? ?aa ? ?bb > [1,] ?TRUE FALSE > [2,] ?TRUE FALSE > [3,] ?TRUE FALSE > [4,] FALSE FALSE > [5,] FALSE FALSE > > etc... > > But none of them work on DataFrame. Maybe if they were we wouldn't need > RleMatrix? Using DataFrame instead of RleMatrix would be nice because it > reuses what we already have. It would also avoid the pitfall of having > the length of an RleMatrix not being representable with a 32-bit int > when let's say the nb of rows is 800M and there are a few nb of cols > (like in Kasper's use case). No need to wait for Luke's "big vector" > hack. This is totally fine with me, as long as coercion from Rle to a normal vector is avoided. But it might make sense to have a derivative class ensuring that all columns are numeric in nature. Kasper > > Cheers, > H. > > > On 06/27/2012 10:46 AM, Jeff Leek wrote: >> >> I would love/use all the time this feature if it existed. >> >> Jeff >> >> On Wed, Jun 27, 2012 at 11:21 AM, Michael Lawrence >> wrote: >>> >>> On Wed, Jun 27, 2012 at 8:07 AM, Kasper Daniel Hansen < >>> kasperdanielhansen at gmail.com> wrote: >>> >>>> One comment: ?since matrix is a vector with a dim attribute I see that >>>> the natural parallel is doing the same for Rle. >>> >>> >>> >>> Right, in the original plan, the Array class would bring the dim >>> attribute, >>> and RleMatrix would contain both Matrix and Rle. >>> >>> >>>> ?Nevertheless, that >>>> would put an upper limit on the number of runLengths in the entire >>>> matrix. ?My impression (which could be wrong) is that we would need to >>>> implement essentially all matrix-like numeric operations from scratch >>>> anyway, so it may be worthwhile to consider using a list of Rle's >>>> where each Rle is a column, instead of a single Rle to represent all >>>> columns. ?Clearly that depends on implementation details, but if we >>>> really need to do everything from scratch, a list of columns might be >>>> more flexible (and perhaps even easier to code). >>>> >>>> >>> This would make it harder to treat RleMatrix as an Rle (which is a nice >>> feature of base R matrices). If the problem is the vector length limit, >>> then I'd rather wait for Luke's fix, which apparently is coming along. >>> >>> Kasper >>>> >>>> >>>> On Tue, Jun 26, 2012 at 6:41 AM, Michael Lawrence >>>> wrote: >>>>> >>>>> Seems like it could be a nice thing to have. Presumably one would >>>>> create >>>> >>>> an >>>>> >>>>> Array subclass of Vector that would add a "dim" attribute. Then Matrix >>>> >>>> could >>>>> >>>>> extend that to constrain dim to length two (unfortunately colliding >>>>> with >>>> >>>> the >>>>> >>>>> Matrix class in the Matrix package). Then RleMatrix extends Matrix to >>>>> implement the actual data storage and many of the accelerated methods. >>>>> As >>>>> you said, row-oriented methods would be tough. >>>>> >>>>> Any takers? >>>>> >>>>> Michael >>>>> >>>>> On Mon, Jun 25, 2012 at 9:11 PM, Kasper Daniel Hansen >>>>> wrote: >>>>>> >>>>>> >>>>>> On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen >>>>>> wrote: >>>>>>> >>>>>>> On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence >>>>>>> wrote: >>>>>>>> >>>>>>>> Patrick and I had talked about this a long time ago (essentially >>>>>>>> putting a >>>>>>>> "dim" attribute on an Rle), but the closest thing today is a >>>> >>>> DataFrame >>>>>>>> >>>>>>>> with >>>>>>>> Rle columns. >>>>>>>> >>>>>>>> Use case? >>>>>>> >>>>>>> >>>>>>> Say I have whole-genome data (for example coverage) ?on multiple >>>>>>> samples. ?Usually, this is far easier to think of as a matrix (in my >>>>>>> opinion) with ~3B rows and I often want to do rowSums(), colSums() >>>>>>> etc >>>>>>> (in fact, probably the whole API from matrixStats). ?This is >>>>>>> especially nice when you have multiple coverage-like tracks on each >>>>>>> sample, so you could have >>>>>>> ?trackA : genome by samples >>>>>>> ?trackB : genome by samples >>>>>>> ?... >>>>>>> >>>>>>> You could think of this as a SummarizedExperiment, but with >>>>>>> _extremely_ big matrices in the assay slot. >>>>>>> >>>>>>> I want to take advantage of the Rle structure to store the data more >>>>>>> efficiently and also to do potentially faster computations. >>>>>>> >>>>>>> This is actually closer to my use case where I currently use matrices >>>>>>> with ~30M rows (which works fine), but I would like to expand to >>>>>>> ~800M >>>>>>> rows (which would suck a bit). >>>>>>> >>>>>>> You could also think of a matrix-like object with Rle columns as an >>>>>>> alternative sparse matrix structure. ?In a typical sparse matrix you >>>>>>> only store the non-zero entities, here we only store the >>>>>>> change-points. ?Depending on the structure of the matrix this could >>>>>>> be >>>>>>> an efficient storage of an otherwise dense matrix. >>>>>>> >>>>>>> So essentially, what I want, is to have mathematical operations on >>>>>>> this object, where I would utilize that I know that all entities are >>>>>>> numbers so the typical matrix operations makes sense. >>>>>>> >>>>>>> [ side question which could be relevant in this discussion: for a >>>>>>> numeric Rle is there some notion of precision - say I have truly >>>>>>> numeric values with tons of digits, and I want to consider two >>>>>>> numbers >>>>>>> part of the same run if |x1 -x2|>>>>> >>>>>> >>>>>> You can see that Pete has had similar thoughts in >>>>>> genoset/R/DataFrame-methods.R, although he only has colMeans (which is >>>>>> the easy one). >>>>>> >>>>>> Kasper >>>>>> >>>>>>> Kasper >>>>>>> >>>>>>>> >>>>>>>> Michael >>>>>>>> >>>>>>>> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> Do we have a matrix-like object, but where the columns are Rle's? >>>>>>>>> >>>>>>>>> Kasper >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: >>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> >>>>>>>> >>>>>>>> >>>>> >>>>> >>>> >>> >>> ? ? ? ?[[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Herv? Pag?s > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: ?(206) 667-5791 > Fax: ? ?(206) 667-1319 > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From lawrence.michael at gene.com Wed Jun 27 22:35:48 2012 From: lawrence.michael at gene.com (Michael Lawrence) Date: Wed, 27 Jun 2012 13:35:48 -0700 Subject: [BioC] matrix like object with Rle columns In-Reply-To: References: <4FEB5F71.5050906@fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From hpages at fhcrc.org Wed Jun 27 22:37:12 2012 From: hpages at fhcrc.org (=?ISO-8859-1?Q?Herv=E9_Pag=E8s?=) Date: Wed, 27 Jun 2012 13:37:12 -0700 Subject: [BioC] matrix like object with Rle columns In-Reply-To: References: Message-ID: <4FEB6EF8.8030108@fhcrc.org> Hi Kasper, On 06/25/2012 08:56 PM, Kasper Daniel Hansen wrote: [...] > [ side question which could be relevant in this discussion: for a > numeric Rle is there some notion of precision - say I have truly > numeric values with tons of digits, and I want to consider two numbers > part of the same run if |x1 -x2| all.equal(sqrt(3)^2, 3) [1] TRUE > sqrt(3)^2 == 3 [1] FALSE > Rle(c(sqrt(3)^2, 3)) 'numeric' Rle of length 2 with 2 runs Lengths: 1 1 Values : 3 3 Note that base::rle() does the same: > rle(c(sqrt(3)^2, 3)) Run Length Encoding lengths: int [1:2] 1 1 values : num [1:2] 3 3 I can see that using a "|x1 -x2| x <- c(sqrt(3)^2, 3) > identical(as.vector(Rle(x)), x) [1] TRUE > identical(inverse.rle(rle(x)), x) [1] TRUE Also the "|x1 -x2| > Kasper > >> >> Michael >> >> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen >> wrote: >>> >>> Do we have a matrix-like object, but where the columns are Rle's? >>> >>> Kasper >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 From lawrence.michael at gene.com Wed Jun 27 22:48:42 2012 From: lawrence.michael at gene.com (Michael Lawrence) Date: Wed, 27 Jun 2012 13:48:42 -0700 Subject: [BioC] Operations on GenomicRanges metadata information In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From lawrence.michael at gene.com Wed Jun 27 22:58:00 2012 From: lawrence.michael at gene.com (Michael Lawrence) Date: Wed, 27 Jun 2012 13:58:00 -0700 Subject: [BioC] matrix like object with Rle columns In-Reply-To: <4FEB6EF8.8030108@fhcrc.org> References: <4FEB6EF8.8030108@fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From guest at bioconductor.org Wed Jun 27 23:07:55 2012 From: guest at bioconductor.org (Cindy [guest]) Date: Wed, 27 Jun 2012 14:07:55 -0700 (PDT) Subject: [BioC] oligo: ordering of \"backgroundCorrect\" and \"normalize\" output values Message-ID: <20120627210755.A9210135043@mamba.fhcrc.org> Dear Bioconductor team, I am currently using the oligo package for analyzing tiling array data (Affymetrix GeneChip Human Tiling 2.0R Array). I have already built a design platform using PlatformDesign package and was able to read in the raw .CEL files. I guess my question is when I use backgroundCorrect and normalize functions for data pre-processing, I get just the values back and no labels (as shown below: str(normalized)). Can I assume that the ordering of the values is the same as the raw pm data that I input into the backgroundCorrect function? In other words, does the first value following background correction and normalization correspond to the input from probe "2566" and sample "1_10_(Hs35b_P02R_v01).CEL" (shown below in str(raw.pm)). Is there any way to bring out the labels for both the probes and the samples? Thank you very much for your time! # load expression data raw.data <- read.celfiles(cel.files); # get intensity data for pm only raw.pm <- pm(raw.data); str(raw.pm) num [1:6003165, 1:200] 56 73 155 98 60 103 91 184 176 110 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:6003165] "2566" "2567" "2568" "2569" ... ..$ : chr [1:200] "1_10_(Hs35b_P02R_v01).CEL" "1_11_(Hs35b_P02R_v01).CEL" "1_12_(Hs35b_P02R_v01).CEL" "1_13_(Hs35b_P02R_v01).CEL" ... # perform preprocessing using RMA bgCorrected <- backgroundCorrect(raw.pm, method = "rma"); normalized <- normalize(bgCorrected, method = "quantile"); str(normalized) num [1:6003165, 1:200] 18.4 42.2 178.9 79.7 24.2 ... Best regards, Cindy -- output of sessionInfo(): R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] pd.hs35b.p02r.v01_0.0.1 RSQLite_0.11.1 DBI_0.2-5 [4] affxparser_1.28.0 oligo_1.20.1 oligoClasses_1.18.0 loaded via a namespace (and not attached): [1] Biobase_2.16.0 BiocGenerics_0.2.0 BiocInstaller_1.4.3 [4] Biostrings_2.24.1 IRanges_1.14.2 affyio_1.24.0 [7] bit_1.1-8 codetools_0.2-8 ff_2.2-6 [10] foreach_1.4.0 iterators_1.0.6 preprocessCore_1.18.0 [13] splines_2.15.0 stats4_2.15.0 tools_2.15.0 [16] zlibbioc_1.2.0 -- Sent via the guest posting facility at bioconductor.org. From beniltoncarvalho at gmail.com Wed Jun 27 23:20:11 2012 From: beniltoncarvalho at gmail.com (Benilton Carvalho) Date: Wed, 27 Jun 2012 22:20:11 +0100 Subject: [BioC] oligo: ordering of \"backgroundCorrect\" and \"normalize\" output values In-Reply-To: <20120627210755.A9210135043@mamba.fhcrc.org> References: <20120627210755.A9210135043@mamba.fhcrc.org> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From heidi at ebi.ac.uk Wed Jun 27 23:26:52 2012 From: heidi at ebi.ac.uk (Heidi Dvinge) Date: Wed, 27 Jun 2012 22:26:52 +0100 Subject: [BioC] HTqPCR problems In-Reply-To: <7BE89EC0-5910-4758-881F-6EFCB8E79A5F@buckinstitute.org> References: <7B8EC902-D981-4467-B997-002FC1CFC368@buckinstitute.org> <10ebef5ba3907a11607fea19c1f11c3b.squirrel@webmail.ebi.ac.uk> <7BE89EC0-5910-4758-881F-6EFCB8E79A5F@buckinstitute.org> Message-ID: <009c6493df7d27088ee45793182ed7a8.squirrel@webmail.ebi.ac.uk> Hi Simon, > Thanks for the help Heidi, > but I'm still having troubles, your comments on the plotting helped me > solve the outputs. But if I want to just display some groups (for example > the LO group in the example below), how do I associate a group with > multiple samples (ie biological reps)? > > Currently I'm associating genes with samples by reading in the file as > below > plate6=read.delim("plate6Sample.txt", header=FALSE) > #this is a file to associate sample ID with the genes in the biomark data, > as currently HTqPCR does not seem to associate the sample info in the > Biomark output to the gene IDs > Erm, no, it doesn't :-/ > samples=as.vector(t(plate6)) > raw6=readCtData(files="Chip6.csv", format="BioMark", n.features=48, > n.data=48, samples=samples) > #now I have samples and genes similar to your example in the guide, but I > want to associate samples to groups now. In the guide, you have an example > where you have entire files as distinct samples, but in our runs, we never > have that situation. I have a file which associates samples to groups, > which I read in... > > groupID=read.csv("plate6key.csv") > > but how do I associate the samples with their appropriate groups for > biological replicates with any of the functions in HtQPCR? I'm afraid I'm slightly confused here (sorry, long day). Exactly how is your data formatted? I.e. are the columns either individual samples, or from files containing multiple samples? So for example for a single 48.48 arrays, is your qPCRset object 2304 x 1 or 48 x 48? >From your readCtData command I'm guessing you have 48 x 48, i.e. all 48 samples from your 1 array are in columns. In that case the 'groups' parameter in plotCtOverview will need to be a vector of length 48, indicating how you want the 48 columns in your qPCRset object to be grouped together. Below is an example of (admittedly ugly) plots. I don't know if that's similar to your situation at all. \Heidi > # Reading in data > exPath <- system.file("exData", package = "HTqPCR") > raw1 <- readCtData(files = "BioMark_sample.csv", path = exPath, format = "BioMark", n.features = 48, n.data = 48) > # Check sample names > head(sampleNames(raw1)) [1] "Sample1" "Sample2" "Sample3" "Sample4" "Sample5" "Sample6" > # Associate samples with (randomly chosen) groups > anno <- data.frame(sampleID=sampleNames(raw1), Group=rep(c("A", "B", "C", "D"), times=c(4,24,5,15))) > head(anno) sampleID Group 1 Sample1 A 2 Sample2 A 3 Sample3 A 4 Sample4 A 5 Sample5 B 6 Sample6 B > # Plot the first gene - for each sample individually > plotCtOverview(raw1, genes=featureNames(raw1)[1], legend=FALSE, col=1:nrow(anno)) > # Plot the first gene - for each group > plotCtOverview(raw1, genes=featureNames(raw1)[1], group=anno$Group, legend=FALSE, col=1:length(unique(anno$Group))) > # Plot the first gene, using group "A" as a control > plotCtOverview(raw1, genes=featureNames(raw1)[1], group=anno$Group, legend=FALSE, col=1:length(unique(anno$Group)), calibrator="A") > You recommend below using a vector, but I dont see how that helps me > associate the samples in the Expression set. > > thanks again! > > s > > On Jun 26, 2012, at 12:48 PM, Heidi Dvinge wrote: > >>> Hi, >>> I'm having some troubles selectively sub-setting, and graphing up QPCR >>> data within HTqPCR for Biomark plates (both 48.48 and 96.96 plates). >>> I'd >>> like to be able to visualize specific genes, with specific groups we >>> run >>> routinely on our Biomark system. Typical runs are across multiple >>> plates, >>> and have multiple biological replicates, and usually 2 or more >>> technical >>> replicates (although we are moving away from technical reps, as the CVs >>> are so tight). >>> >>> Can anyone help with this? Heidi? >>> >>> raw6=readCtData(files="Chip6.csv", format="BioMark", n.features=48, >>> n.data=48, samples=samples) >>> #Ive read the samples in from a separate file, as when you read it in, >>> it >>> doesnt take the sample names supplied in the biomark output# >>> #Next, I want to plot genes of interest, with samples of interest, and >>> I'm >>> having trouble getting an appropriate output# >>> >>> g=featureNames(raw6)[1:2] >>> plotCtOverview(raw6, genes=g, groups=groupID$Treatment, col=rainbow(5)) >>> >>> #This plots 1 gene across all 48 samples# >>> #but the legend doesnt behave, its placed on top of the plot, and I >>> cant >>> get it to display in a non-overlapping fashion# >>> #I've tried all sorts of things in par, but nothing seems to shift the >>> legend's position# >>> >> As the old saying goes, whenever you want a job done well, you'll have >> to >> do it yourself ;). In this case, the easiest thing is probably to use >> legend=FALSE in plotCtOverview, and then afterwards add it yourself in >> the >> desired location using legend(). That way, if you have a lot of >> different >> features or groups to display, you can also use the ncol parameter in >> legend to make several columns within the legend, such as 3x4 instead of >> the default 12x1. >> >> Alternatively, you can use either xlim or ylim in plotCtOverview to add >> some empty space on the side where there's then room for the legend. >> >>> #I now want to plot a subset of the samples for specific genes# >>>> LOY=subset(groupID,groupID$Treatment=="LO" | groupID$Treatment== >>>> "LFY") >>>> LOY >>> Sample Treatment >>> 2 L20 LFY >>> 5 L30 LFY >>> 7 L45 LO >>> 20 L40 LO >>> 27 L43 LO >>> 33 L29 LFY >>> 36 L38 LO >>> 40 L39 LO >>> 43 L23 LFY >>> >>> >>>> plotCtOverview(raw6, genes=g, groups=LOY, col=rainbow(5)) >>> Warning messages: >>> 1: In split.default(t(x), sample.split) : >>> data length is not a multiple of split variable >>> 2: In qt(p, df, lower.tail, log.p) : NaNs produced >>>> >> >> Does it make sense if you split by groups=LOY$Treatment? It looks like >> the >> object LOY itself is a data frame, rather than the expected vector. >> >> Also, you may have to 'repeat' the col=rainbow() argument to fit your >> number of features. >> >>> >>> #it displays the two groups defined by treatment, but doesnt do so >>> nicely, >>> very skinny bars, and the legend doesnt reflect what its displaying# >>> #again, I've tried monkeying around with par, but not sure what HTqPCR >>> is >>> calling to make the plots# >>> >> If the bars are very skinny, it's probably because you're displaying a >> lot >> of features. Nothing much to do about that, except increasing the width >> or >> your plot :(. >> >> \Heidi >> >>> please help! >>> >>> thanks >>> >>> Simon. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > From heidi at ebi.ac.uk Wed Jun 27 23:36:01 2012 From: heidi at ebi.ac.uk (Heidi Dvinge) Date: Wed, 27 Jun 2012 22:36:01 +0100 Subject: [BioC] plotCtArray example request In-Reply-To: <42184EC7DF030842B7CF0F34F583DABA0581B9C3@SOAANCMSG01.soa.alaska.gov> References: <42184EC7DF030842B7CF0F34F583DABA0581B9C3@SOAANCMSG01.soa.alaska.gov> Message-ID: <2905b88e5eab502b4e0113a4a55666f8.squirrel@webmail.ebi.ac.uk> Hi Hans/Charles, > Hello Heidi Dvinge. > > I am looking to use your package for Fluidigm M96.96 and the new M192.24 > microarrays. I'm relatively new to R but have been able to learn on the > job by using the examples listed in package descriptions. I'm looking > at a few examples in HTqPCR and it appears that some functions/commands > have been cut by the edge of the page so I can't read all of the > arguments. Specifically, since I am using the Biomark system I am > looking to see plotCtArray. Could you please send me the full > argumentation you used for the example please? > There are some examples available in the help file for each function individually. These can be accessed with ?plotCtArray or run directly with examples(plotCtArray) All the R-code in the vignette can be obtained by typing the commands mentioned on page 1: all.R.commands <- system.file("doc", "HTqPCR.Rnw", package = "HTqPCR") Stangle(all.R.commands) This will create the file HTqPCR.R in your current working directory. Incidentally, that is the same file as you can also just download from the HTqPCR page on the bioconductor site, from the "R Script" link under the header "Documentation". This is the case for all Bioconductor packages, in case you ever get tired of copy-pasting from the pdf vignette. > > > Also, how can I learn more about the dimensions of how this information > is held in tables? Thanks! > What do you mean with 'tables' here? The tables (file) where your input data is kept, or the objects in R, such as qPCRsets? For the qPCRset objects all the standard R-functions have been made to work, e.g. dim(), nrow() and ncol(). If you can be a bit more specific about what you want to know, I can try to give you some pointers. HTH \Heidi > Hans Thompson > Fisheries Biologist I > Alaska Department of Fish and Game > > From heidi at ebi.ac.uk Wed Jun 27 23:53:12 2012 From: heidi at ebi.ac.uk (Heidi Dvinge) Date: Wed, 27 Jun 2012 22:53:12 +0100 Subject: [BioC] HTqPCR In-Reply-To: <1340795248.74516.YahooMailNeo@web132504.mail.ird.yahoo.com> References: <1340319194.4904.BPMail_high_noncarrier@web132504.mail.ird.yahoo.com> <48c1ef538be818c58eab9946116e3dbb.squirrel@webmail.ebi.ac.uk> <1340614784.8482.YahooMailNeo@web132505.mail.ird.yahoo.com> <1340795248.74516.YahooMailNeo@web132504.mail.ird.yahoo.com> Message-ID: <6fda8a02565e0fbea3457e2bf445abf1.squirrel@webmail.ebi.ac.uk> Hi Deborah, > Good morning Heidi, > > Yes, the order of my samples is different in my qPCRset compared to my > files_essai$Treatment. Do I have to order them in the same way ? > The order doesn't have to be the same, as long as you make sure you always remember re-order one of them when you use them both in the same function, such as the limmaCtData examples you provided last time. I don't trust myself to always remember such things (or to do it correctly!), so I always order the samples in the same way ;). But that's not a requirement. > > > And with normalizeCtData I obtained : >> deltaCtnorm <- normalizeCtData(essai.cat, norm = >> "deltaCt",deltaCt.genes=? c("gene85", "gene86", "gene87", "gene88", >> "gene89", "gene90","gene91")) > Calculating deltaCt values > ??????? Using control gene(s): gene85 gene86 gene87 gene88 gene89 gene90 > gene91 > ??????? Card 1: Mean=20.93????? Stdev=2.74 > ??????? Card 2: Mean=21.58????? Stdev=2.73 > ??????? Card 3: Mean=21.73????? Stdev=2.81 > ??????? Card 4: Mean=20.96????? Stdev=2.73 > ??????? Card 5: Mean=21.69????? Stdev=2.69 > ??????? Card 6: Mean=21.73????? Stdev=2.83 > I chose this method because my director gave me a file where he has chosen > only seven housekeeper genes on the eight (one of them has different Cp > results in almost each sample) so I did the same thing. > In this case the values of all these housekeepers do look robust across your samples, although the Ct values are higher than for typical housekeepers such as b-actin. (Which BTW isn't necessarily a bad thing). So as long as you/your boss is happy with it, I guess that's fine. > And about the other normalization methods, I tried them only for seeing > the difference between them. > I did : >>par(mfrow=c(3,2)) >>plot(exprs(essai),exprs(essai_g.mean),pch=20,main="Normalisation avec >> geometric.mean",col=rep(brewer.pal(6,"Spectral"),each=96)) >>plot(exprs(essai),exprs(essai_scale.rank),pch=20,main="Normalisation avec >> scale.rankinvariant", col=rep(brewer.pal(6,"Spectral"), each=96)) >>plot(exprs(essai),exprs(essai_deltaCt),pch=20,main="Normalisation avec >> deltaCt",col=rep(brewer.pal(6,"Spectral"),each=96)) >>plot(exprs(essai),exprs(essai_q.norm),pch=20,main="Normalisation avec >> quantile",col=rep(brewer.pal(6,"Spectral"),each=96)) >>plot(exprs(essai),exprs(essai_norm.rank),pch=20,main="Normalisation avec >> norm.rankinvariant", col=rep(brewer.pal(6,"Spectral"), each=96)) > In your case you have relatively few genes (96), which may not be quite enough for some of the methods. If the deltaCt-normalised data doesn't look too discrepant from all the other methods you're probably fine. > And then I wanted to compare only one of the sample as your example and I > used abline() but it didn't work. >>plot(exprs(DU145)[,3],exprs(essai_g.mean)[,3],pch=20,col="magenta") >>abline(exprs(DU145)[,3],exprs(essai_scale.rank)[,3],pch=20,col="blue") >>abline(exprs(DU145)[,3],exprs(essai_deltaCt)[,3],pch=20,col="purple") > I did'nt get error message but there was only one plot. I also changed the > xlim value... > Well, you only use one plot() command, hence only one plot gets produced. Do you perhaps want to add data from multiple objects using points()? plot(exprs(DU145)[,3],exprs(essai_g.mean)[,3],pch=20,col="magenta") points(exprs(DU145)[,3],exprs(essai_norm.rank)[,3],pch=20,col="blue") ...etc.. > About the p.value, I tried to plot them as you had suggest me to do. > Here are the p.values of each test. >> MWTEST<-read.csv("essai_3h_CT_mwest1.csv", sep = ";", >> dec=",",header=TRUE) >> MWTEST$p.value > ?[1] 0.2452781 0.2452781 0.2452781 1.0000000 0.2452781 1.0000000 0.2452781 > 0.2452781 0.2452781 0.2452781 0.2452781 0.2452781 0.2452781 0.2452781 > 0.2452781 > [16] 0.2452781 1.0000000 0.2452781 0.2452781 0.6985354 0.2452781 0.2452781 > 0.2452781 0.6985354 0.2452781 0.6985354 0.2452781 0.6985354 0.2452781 > 0.2452781 > [31] 0.2452781 0.2452781 0.2452781 0.2452781 0.2452781 0.2452781 0.2452781 > 0.2452781 0.2452781 1.0000000 0.2452781 1.0000000 0.2452781 0.2452781 > 0.2452781 > [46] 0.2452781 0.2452781 0.2452781 1.0000000 0.2452781 0.6985354 0.2452781 > 0.6985354 0.2452781 0.2452781 0.2452781 0.2452781 0.2452781 1.0000000 > 0.2452781 > [61] 0.2452781 0.2452781 0.2452781 0.2452781 0.2452781 1.0000000 0.2452781 > 0.2452781 0.2452781 0.6985354 0.2452781 0.2452781 0.2452781 0.2452781 > 0.2452781 > [76] 0.2452781 0.2452781 0.2452781 0.2452781 1.0000000 0.2452781 0.2452781 > 0.2452781 0.2452781 > >> TTEST<-read.csv("essai_3h_CT_ttestFINAL.csv",sep=";",dec=",") >> TTEST$p.value > ?[1] 0.0000445916 0.0007203397 0.0011614062 0.0013499661 0.0021628924 > 0.0026335148 0.0028568178 0.0031900227 0.0047685877 0.0048604855 > 0.0058520334 > [12] 0.0061487450 0.0101863324 0.0101863324 0.0101863324 0.0101863324 > 0.0101863324 0.0101863324 0.0101863324 0.0101863324 0.0118721483 > 0.0130066108 > [23] 0.0166399589 0.0169334479 0.0170523949 0.0209596603 0.0364914159 > 0.0411074450 0.0427607922 0.0448349199 0.0476675408 0.0494563227 > 0.0512518277 > [34] 0.0514256601 0.0572253657 0.0625769920 0.0656449529 0.0797613537 > 0.0804359345 0.0819791697 0.0821601769 0.0909080090 0.0918788852 > 0.0986962901 > [45] 0.0993682993 0.1208709609 0.1261907225 0.1331849915 0.1338834931 > 0.1590798074 0.1611312123 0.1657960803 0.1718536210 0.1844114022 > 0.2035267116 > [56] 0.2092967748 0.2111576859 0.2192894000 0.2223339619 0.2393817321 > 0.2416781885 0.2479843103 0.2570206534 0.2570800840 0.2610404909 > 0.2755461365 > [67] 0.2886380998 0.3133822666 0.4574691996 0.4790123864 0.4963391483 > 0.5780428714 0.5827604076 0.6029711831 0.6738622120 0.6905548966 > 0.7800699292 > [78] 0.8045384637 0.8399336418 0.9347460142 0.9531859762 0.9719743053 > 0.9774759886 0.9934469629 > >> LIMMATEST<-read.csv("essai_limmaFINAL.csv",sep=";",dec=",") >> LIMMATEST$X3h.CT.p.value > ?[1] 4.538818e-01 9.722424e-01 9.681357e-01 1.478327e-02 9.765722e-01 > 1.224899e-01 4.579647e-01 1.035570e-03 6.190137e-02 6.862192e-03 > 1.828032e-02 > [12] 2.354413e-02 2.027634e-01 3.924912e-03 1.245438e-03 9.538478e-07 > 5.915158e-04 2.714424e-01 2.449646e-04 9.943747e-02 8.115928e-02 > 1.014429e-01 > [23] 1.730959e-04 2.283943e-01 4.106429e-02 8.292733e-01 7.384857e-01 > 9.053543e-04 3.031922e-05 4.381594e-02 8.697809e-05 1.730959e-04 > 9.949012e-01 > [34] 3.584419e-03 3.713434e-04 7.691588e-01 1.336464e-01 3.141131e-01 > 3.500428e-02 5.853026e-06 6.234777e-02 1.096195e-01 5.065608e-01 > 1.425943e-02 > [45] 7.720779e-01 2.074906e-05 2.596116e-04 6.080595e-02 6.472036e-01 > 1.730959e-04 6.924510e-05 8.243564e-03 2.010885e-01 9.367344e-01 > 2.535135e-01 > [56] 9.788777e-01 1.730959e-04 6.717992e-02 1.041109e-01 3.951307e-04 > 1.152792e-01 2.552804e-04 8.276034e-01 6.578508e-03 3.226937e-02 > 1.730959e-04 > [67] 1.730959e-04 1.052211e-01 6.826300e-05 1.730959e-04 2.939883e-01 > 1.116254e-02 2.997326e-01 5.701757e-02 2.319393e-03 3.023084e-02 > 8.304573e-01 > [78] 4.892519e-01 6.178556e-01 4.863336e-01 8.506124e-02 1.730959e-04 > 1.380221e-01 7.850957e-03 > > So I did : > par(mfrow=c(1,3)) > plot(LIMMATEST$X3h.CT.p.value,col="green",main="p-value LIMMA") > plot(TTEST$p.value,col="red",main="p-value Student") > plot(MWTEST$p.value,col="blue",main="p-value Mann-Whitney") > Considering the graphs that I obtained, I can say that the p.values don't > follow a general trend... So there is a real problem somewhere... Is that > alright ? > I'm sorry for not being clear, I meant plot them against each other. For example plot(TTEST$p.value, MWTEST$p.value). > Actually, I have a question about the "Summary" component of the > limmaCtData : how do it do to calculate if it is up- or down-regulation ? > Because when I calculated the expression (2^ddCt) of the gene18, I > obtained expression = 11.41. > So the gene18 is up-regulated whereas in the "Summary" there is no change. > Is there a link between the "Summary" and the expression ? > In Summary, -1/0/1 should correspond to down-regulation/no difference/up-regulation respectively. The summary is linked to the expression, but it requires that the change in expression is statistically significant at p<0.05. Otherwise it's just "0" in the output. Best, \Heidi > Thank you again for your help and your advice, > > Deborah. > From smelov at buckinstitute.org Wed Jun 27 23:59:24 2012 From: smelov at buckinstitute.org (Simon Melov) Date: Wed, 27 Jun 2012 14:59:24 -0700 Subject: [BioC] HTqPCR problems In-Reply-To: <009c6493df7d27088ee45793182ed7a8.squirrel@webmail.ebi.ac.uk> References: <7B8EC902-D981-4467-B997-002FC1CFC368@buckinstitute.org> <10ebef5ba3907a11607fea19c1f11c3b.squirrel@webmail.ebi.ac.uk> <7BE89EC0-5910-4758-881F-6EFCB8E79A5F@buckinstitute.org> <009c6493df7d27088ee45793182ed7a8.squirrel@webmail.ebi.ac.uk> Message-ID: Hi Heidi, you are correct, yes 48.48. The example you provide below is exactly what I needed for clarification for groups. I was trying to reverse engineer what you had done with the original expression set package for microarrays, but from below, I can get this to work now. Just to be clear, I have 5 48.48 plates. Should I normalize each individually prior to combining, or should I reformat to a 2304x1 each, combine, then normalize (not sure if you can do that or not) thanks again for your prompt responses! best s On Jun 27, 2012, at 2:26 PM, Heidi Dvinge wrote: > Hi Simon, > >> Thanks for the help Heidi, >> but I'm still having troubles, your comments on the plotting helped me >> solve the outputs. But if I want to just display some groups (for example >> the LO group in the example below), how do I associate a group with >> multiple samples (ie biological reps)? >> >> Currently I'm associating genes with samples by reading in the file as >> below >> plate6=read.delim("plate6Sample.txt", header=FALSE) >> #this is a file to associate sample ID with the genes in the biomark data, >> as currently HTqPCR does not seem to associate the sample info in the >> Biomark output to the gene IDs >> > Erm, no, it doesn't :-/ > >> samples=as.vector(t(plate6)) >> raw6=readCtData(files="Chip6.csv", format="BioMark", n.features=48, >> n.data=48, samples=samples) >> #now I have samples and genes similar to your example in the guide, but I >> want to associate samples to groups now. In the guide, you have an example >> where you have entire files as distinct samples, but in our runs, we never >> have that situation. I have a file which associates samples to groups, >> which I read in... >> >> groupID=read.csv("plate6key.csv") >> >> but how do I associate the samples with their appropriate groups for >> biological replicates with any of the functions in HtQPCR? > > I'm afraid I'm slightly confused here (sorry, long day). Exactly how is > your data formatted? I.e. are the columns either individual samples, or > from files containing multiple samples? So for example for a single 48.48 > arrays, is your qPCRset object 2304 x 1 or 48 x 48? > > From your readCtData command I'm guessing you have 48 x 48, i.e. all 48 > samples from your 1 array are in columns. In that case the 'groups' > parameter in plotCtOverview will need to be a vector of length 48, > indicating how you want the 48 columns in your qPCRset object to be > grouped together. > > Below is an example of (admittedly ugly) plots. I don't know if that's > similar to your situation at all. > > \Heidi > >> # Reading in data >> exPath <- system.file("exData", package = "HTqPCR") >> raw1 <- readCtData(files = "BioMark_sample.csv", path = exPath, format = > "BioMark", n.features = 48, n.data = 48) >> # Check sample names >> head(sampleNames(raw1)) > [1] "Sample1" "Sample2" "Sample3" "Sample4" "Sample5" "Sample6" >> # Associate samples with (randomly chosen) groups >> anno <- data.frame(sampleID=sampleNames(raw1), Group=rep(c("A", "B", > "C", "D"), times=c(4,24,5,15))) >> head(anno) > sampleID Group > 1 Sample1 A > 2 Sample2 A > 3 Sample3 A > 4 Sample4 A > 5 Sample5 B > 6 Sample6 B >> # Plot the first gene - for each sample individually >> plotCtOverview(raw1, genes=featureNames(raw1)[1], legend=FALSE, > col=1:nrow(anno)) >> # Plot the first gene - for each group >> plotCtOverview(raw1, genes=featureNames(raw1)[1], group=anno$Group, > legend=FALSE, col=1:length(unique(anno$Group))) >> # Plot the first gene, using group "A" as a control >> plotCtOverview(raw1, genes=featureNames(raw1)[1], group=anno$Group, > legend=FALSE, col=1:length(unique(anno$Group)), calibrator="A") > > > >> You recommend below using a vector, but I dont see how that helps me >> associate the samples in the Expression set. >> >> thanks again! >> >> s >> >> On Jun 26, 2012, at 12:48 PM, Heidi Dvinge wrote: >> >>>> Hi, >>>> I'm having some troubles selectively sub-setting, and graphing up QPCR >>>> data within HTqPCR for Biomark plates (both 48.48 and 96.96 plates). >>>> I'd >>>> like to be able to visualize specific genes, with specific groups we >>>> run >>>> routinely on our Biomark system. Typical runs are across multiple >>>> plates, >>>> and have multiple biological replicates, and usually 2 or more >>>> technical >>>> replicates (although we are moving away from technical reps, as the CVs >>>> are so tight). >>>> >>>> Can anyone help with this? Heidi? >>>> >>>> raw6=readCtData(files="Chip6.csv", format="BioMark", n.features=48, >>>> n.data=48, samples=samples) >>>> #Ive read the samples in from a separate file, as when you read it in, >>>> it >>>> doesnt take the sample names supplied in the biomark output# >>>> #Next, I want to plot genes of interest, with samples of interest, and >>>> I'm >>>> having trouble getting an appropriate output# >>>> >>>> g=featureNames(raw6)[1:2] >>>> plotCtOverview(raw6, genes=g, groups=groupID$Treatment, col=rainbow(5)) >>>> >>>> #This plots 1 gene across all 48 samples# >>>> #but the legend doesnt behave, its placed on top of the plot, and I >>>> cant >>>> get it to display in a non-overlapping fashion# >>>> #I've tried all sorts of things in par, but nothing seems to shift the >>>> legend's position# >>>> >>> As the old saying goes, whenever you want a job done well, you'll have >>> to >>> do it yourself ;). In this case, the easiest thing is probably to use >>> legend=FALSE in plotCtOverview, and then afterwards add it yourself in >>> the >>> desired location using legend(). That way, if you have a lot of >>> different >>> features or groups to display, you can also use the ncol parameter in >>> legend to make several columns within the legend, such as 3x4 instead of >>> the default 12x1. >>> >>> Alternatively, you can use either xlim or ylim in plotCtOverview to add >>> some empty space on the side where there's then room for the legend. >>> >>>> #I now want to plot a subset of the samples for specific genes# >>>>> LOY=subset(groupID,groupID$Treatment=="LO" | groupID$Treatment== >>>>> "LFY") >>>>> LOY >>>> Sample Treatment >>>> 2 L20 LFY >>>> 5 L30 LFY >>>> 7 L45 LO >>>> 20 L40 LO >>>> 27 L43 LO >>>> 33 L29 LFY >>>> 36 L38 LO >>>> 40 L39 LO >>>> 43 L23 LFY >>>> >>>> >>>>> plotCtOverview(raw6, genes=g, groups=LOY, col=rainbow(5)) >>>> Warning messages: >>>> 1: In split.default(t(x), sample.split) : >>>> data length is not a multiple of split variable >>>> 2: In qt(p, df, lower.tail, log.p) : NaNs produced >>>>> >>> >>> Does it make sense if you split by groups=LOY$Treatment? It looks like >>> the >>> object LOY itself is a data frame, rather than the expected vector. >>> >>> Also, you may have to 'repeat' the col=rainbow() argument to fit your >>> number of features. >>> >>>> >>>> #it displays the two groups defined by treatment, but doesnt do so >>>> nicely, >>>> very skinny bars, and the legend doesnt reflect what its displaying# >>>> #again, I've tried monkeying around with par, but not sure what HTqPCR >>>> is >>>> calling to make the plots# >>>> >>> If the bars are very skinny, it's probably because you're displaying a >>> lot >>> of features. Nothing much to do about that, except increasing the width >>> or >>> your plot :(. >>> >>> \Heidi >>> >>>> please help! >>>> >>>> thanks >>>> >>>> Simon. >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >> >> > > From heidi at ebi.ac.uk Thu Jun 28 00:27:19 2012 From: heidi at ebi.ac.uk (Heidi Dvinge) Date: Wed, 27 Jun 2012 23:27:19 +0100 Subject: [BioC] HTqPCR problems In-Reply-To: References: <7B8EC902-D981-4467-B997-002FC1CFC368@buckinstitute.org> <10ebef5ba3907a11607fea19c1f11c3b.squirrel@webmail.ebi.ac.uk> <7BE89EC0-5910-4758-881F-6EFCB8E79A5F@buckinstitute.org> <009c6493df7d27088ee45793182ed7a8.squirrel@webmail.ebi.ac.uk> Message-ID: <6d2982152de2217869a0f9db60da3904.squirrel@webmail.ebi.ac.uk> > Hi Heidi, > you are correct, yes 48.48. > The example you provide below is exactly what I needed for clarification > for groups. I was trying to reverse engineer what you had done with the > original expression set package for microarrays, but from below, I can get > this to work now. > Glad it works. Hopefully by the next BioConductor release I'll remember to clarify the plotCtOverview help file. > Just to be clear, I have 5 48.48 plates. Should I normalize each > individually prior to combining, or should I reformat to a 2304x1 each, > combine, then normalize (not sure if you can do that or not) > Hm, that's one of the questions I've also been asking myself, so I would be curious to hear what your results from this are. If you suspect that there are some major factors influencing the 5 plates systematically, then normalising them in a 2304 x 5 object should (hopefully) correct for that. For example, they may have been run on different days, by different people, or perhaps there was a short power cut during the processing of one of them. This might be visible if you have for example a boxplot of Ct from all 48*5 samples, and you see blocks of them shifted up or down. Obviously, this doesn't take care of normalisation between your samples within each plate though. If you suspect your samples to have some systematic variation that you need to account for, then you can normalise each plate individually (as a 48 x 48) object. Alternatively, you can try to combine within- and between-sample normalisation by taking all 48 x 240 values at once. In principle, you can split, reformat and the recombine the data in however many ways you like. Personally, with any sort of data I prefer to go with as little preprocessing as possible, since each additional step can potentially introduce its own biases into the data. So unless there are some obvious variation between your 5 plates, I'd probably stick with just normalisation between the samples, e.. using a 48 x 240 object. Of course, you may have different preferences, or find out that a completely different approach is required for this particular data set. \Heidi > thanks again for your prompt responses! > > best > > s > > On Jun 27, 2012, at 2:26 PM, Heidi Dvinge wrote: > >> Hi Simon, >> >>> Thanks for the help Heidi, >>> but I'm still having troubles, your comments on the plotting helped me >>> solve the outputs. But if I want to just display some groups (for >>> example >>> the LO group in the example below), how do I associate a group with >>> multiple samples (ie biological reps)? >>> >>> Currently I'm associating genes with samples by reading in the file as >>> below >>> plate6=read.delim("plate6Sample.txt", header=FALSE) >>> #this is a file to associate sample ID with the genes in the biomark >>> data, >>> as currently HTqPCR does not seem to associate the sample info in the >>> Biomark output to the gene IDs >>> >> Erm, no, it doesn't :-/ >> >>> samples=as.vector(t(plate6)) >>> raw6=readCtData(files="Chip6.csv", format="BioMark", n.features=48, >>> n.data=48, samples=samples) >>> #now I have samples and genes similar to your example in the guide, but >>> I >>> want to associate samples to groups now. In the guide, you have an >>> example >>> where you have entire files as distinct samples, but in our runs, we >>> never >>> have that situation. I have a file which associates samples to groups, >>> which I read in... >>> >>> groupID=read.csv("plate6key.csv") >>> >>> but how do I associate the samples with their appropriate groups for >>> biological replicates with any of the functions in HtQPCR? >> >> I'm afraid I'm slightly confused here (sorry, long day). Exactly how is >> your data formatted? I.e. are the columns either individual samples, or >> from files containing multiple samples? So for example for a single >> 48.48 >> arrays, is your qPCRset object 2304 x 1 or 48 x 48? >> >> From your readCtData command I'm guessing you have 48 x 48, i.e. all 48 >> samples from your 1 array are in columns. In that case the 'groups' >> parameter in plotCtOverview will need to be a vector of length 48, >> indicating how you want the 48 columns in your qPCRset object to be >> grouped together. >> >> Below is an example of (admittedly ugly) plots. I don't know if that's >> similar to your situation at all. >> >> \Heidi >> >>> # Reading in data >>> exPath <- system.file("exData", package = "HTqPCR") >>> raw1 <- readCtData(files = "BioMark_sample.csv", path = exPath, format >>> = >> "BioMark", n.features = 48, n.data = 48) >>> # Check sample names >>> head(sampleNames(raw1)) >> [1] "Sample1" "Sample2" "Sample3" "Sample4" "Sample5" "Sample6" >>> # Associate samples with (randomly chosen) groups >>> anno <- data.frame(sampleID=sampleNames(raw1), Group=rep(c("A", "B", >> "C", "D"), times=c(4,24,5,15))) >>> head(anno) >> sampleID Group >> 1 Sample1 A >> 2 Sample2 A >> 3 Sample3 A >> 4 Sample4 A >> 5 Sample5 B >> 6 Sample6 B >>> # Plot the first gene - for each sample individually >>> plotCtOverview(raw1, genes=featureNames(raw1)[1], legend=FALSE, >> col=1:nrow(anno)) >>> # Plot the first gene - for each group >>> plotCtOverview(raw1, genes=featureNames(raw1)[1], group=anno$Group, >> legend=FALSE, col=1:length(unique(anno$Group))) >>> # Plot the first gene, using group "A" as a control >>> plotCtOverview(raw1, genes=featureNames(raw1)[1], group=anno$Group, >> legend=FALSE, col=1:length(unique(anno$Group)), calibrator="A") >> >> >> >>> You recommend below using a vector, but I dont see how that helps me >>> associate the samples in the Expression set. >>> >>> thanks again! >>> >>> s >>> >>> On Jun 26, 2012, at 12:48 PM, Heidi Dvinge wrote: >>> >>>>> Hi, >>>>> I'm having some troubles selectively sub-setting, and graphing up >>>>> QPCR >>>>> data within HTqPCR for Biomark plates (both 48.48 and 96.96 plates). >>>>> I'd >>>>> like to be able to visualize specific genes, with specific groups we >>>>> run >>>>> routinely on our Biomark system. Typical runs are across multiple >>>>> plates, >>>>> and have multiple biological replicates, and usually 2 or more >>>>> technical >>>>> replicates (although we are moving away from technical reps, as the >>>>> CVs >>>>> are so tight). >>>>> >>>>> Can anyone help with this? Heidi? >>>>> >>>>> raw6=readCtData(files="Chip6.csv", format="BioMark", n.features=48, >>>>> n.data=48, samples=samples) >>>>> #Ive read the samples in from a separate file, as when you read it >>>>> in, >>>>> it >>>>> doesnt take the sample names supplied in the biomark output# >>>>> #Next, I want to plot genes of interest, with samples of interest, >>>>> and >>>>> I'm >>>>> having trouble getting an appropriate output# >>>>> >>>>> g=featureNames(raw6)[1:2] >>>>> plotCtOverview(raw6, genes=g, groups=groupID$Treatment, >>>>> col=rainbow(5)) >>>>> >>>>> #This plots 1 gene across all 48 samples# >>>>> #but the legend doesnt behave, its placed on top of the plot, and I >>>>> cant >>>>> get it to display in a non-overlapping fashion# >>>>> #I've tried all sorts of things in par, but nothing seems to shift >>>>> the >>>>> legend's position# >>>>> >>>> As the old saying goes, whenever you want a job done well, you'll have >>>> to >>>> do it yourself ;). In this case, the easiest thing is probably to use >>>> legend=FALSE in plotCtOverview, and then afterwards add it yourself in >>>> the >>>> desired location using legend(). That way, if you have a lot of >>>> different >>>> features or groups to display, you can also use the ncol parameter in >>>> legend to make several columns within the legend, such as 3x4 instead >>>> of >>>> the default 12x1. >>>> >>>> Alternatively, you can use either xlim or ylim in plotCtOverview to >>>> add >>>> some empty space on the side where there's then room for the legend. >>>> >>>>> #I now want to plot a subset of the samples for specific genes# >>>>>> LOY=subset(groupID,groupID$Treatment=="LO" | groupID$Treatment== >>>>>> "LFY") >>>>>> LOY >>>>> Sample Treatment >>>>> 2 L20 LFY >>>>> 5 L30 LFY >>>>> 7 L45 LO >>>>> 20 L40 LO >>>>> 27 L43 LO >>>>> 33 L29 LFY >>>>> 36 L38 LO >>>>> 40 L39 LO >>>>> 43 L23 LFY >>>>> >>>>> >>>>>> plotCtOverview(raw6, genes=g, groups=LOY, col=rainbow(5)) >>>>> Warning messages: >>>>> 1: In split.default(t(x), sample.split) : >>>>> data length is not a multiple of split variable >>>>> 2: In qt(p, df, lower.tail, log.p) : NaNs produced >>>>>> >>>> >>>> Does it make sense if you split by groups=LOY$Treatment? It looks like >>>> the >>>> object LOY itself is a data frame, rather than the expected vector. >>>> >>>> Also, you may have to 'repeat' the col=rainbow() argument to fit your >>>> number of features. >>>> >>>>> >>>>> #it displays the two groups defined by treatment, but doesnt do so >>>>> nicely, >>>>> very skinny bars, and the legend doesnt reflect what its displaying# >>>>> #again, I've tried monkeying around with par, but not sure what >>>>> HTqPCR >>>>> is >>>>> calling to make the plots# >>>>> >>>> If the bars are very skinny, it's probably because you're displaying a >>>> lot >>>> of features. Nothing much to do about that, except increasing the >>>> width >>>> or >>>> your plot :(. >>>> >>>> \Heidi >>>> >>>>> please help! >>>>> >>>>> thanks >>>>> >>>>> Simon. >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> >>> >>> >> >> > > From anders at embl.de Thu Jun 28 09:17:00 2012 From: anders at embl.de (Simon Anders) Date: Thu, 28 Jun 2012 09:17:00 +0200 Subject: [BioC] pvalue and padj in DESeq In-Reply-To: <20120626135851.0B56B139009@mamba.fhcrc.org> References: <20120626135851.0B56B139009@mamba.fhcrc.org> Message-ID: <4FEC04EC.7040807@embl.de> Hi On 2012-06-26 15:58, narges [guest] wrote: > i have a proper count table of RNA-Seq and i have applied edgeR package for obtaining differentially expressed genes and I have obtained nice acceptable result. > But now I am applying also DESeq over the same data but the pval and padj columns of the nbinomTest over them is strange, it is almost 1.00 or NA. > Why is this so? You know, we are not clairvoyants. Without any more information about your experiment and data, and without seeing the R code that you used to perform the analyses, any answer would just be guessing. Most likely you made a mistake in using the software. Or, maybe the two tools do come to different conclusions about the signal-to-noise ratio of your data. EdgeR and DESeq use similar but not identical methods. I understand that you prefer the result that says that your experiment worked, but this does not seem a good argument to decide which method is correct. Simon From ovokeraye at gmail.com Thu Jun 28 11:36:24 2012 From: ovokeraye at gmail.com (Ovokeraye Achinike-Oduaran) Date: Thu, 28 Jun 2012 11:36:24 +0200 Subject: [BioC] BiomaRt Details In-Reply-To: References: Message-ID: Thanks a bunch, Steffen. -Avoks On Wed, Jun 27, 2012 at 6:04 PM, Steffen Durinck wrote: > Hi Avoks, > > By default biomaRt queries www.biomart.org so these are always in sync. > You can get the version of the BioMarts by using the listMarts function: > >> listMarts() > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?biomart > ? ? ? ? ? ? ?version > 1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ensembl ? ? ? ? ? ? ? ? ? ? ? ? ? ? ENSEMBL > GENES 67 (SANGER UK) > 2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?snp ? ? ? ? ? ? ? ? ? ? ? ? ENSEMBL > VARIATION 67 (SANGER UK) > 3 ? ? ? ? ? ? ? ? ? ?functional_genomics ? ? ? ? ? ? ? ? ? ? ? ?ENSEMBL > REGULATION 67 (SANGER UK) > 4 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? vega > VEGA 47 ?(SANGER UK) > .... > > > Cheers, > Steffen > > > > On Wed, Jun 27, 2012 at 8:05 AM, Ovokeraye Achinike-Oduaran > wrote: >> >> Hi all, >> >> I've been working with biomaRt 2.12.0 and would like to know what >> versions of Ensembl, dbSNP, Variation, etc it's using. Is it the exact >> same as on the web interface (www.biomart.org)? >> >> Thanks and regards, >> >> Avoks >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > From uschmitt at mineway.de Thu Jun 28 11:47:14 2012 From: uschmitt at mineway.de (Uwe Schmitt) Date: Thu, 28 Jun 2012 11:47:14 +0200 Subject: [BioC] XCMS / mzR / Rcpp dependency issue Message-ID: <4FEC2822.9050908@mineway.de> Hi, I have some problems to install xcms for R 2.5.1 on Windows 7, 64 bit. I follow the installation instructions and enter: > source("http://bioconductor.org/biocLite.R") > biocLite("xcms", dep=T) this loads some packages and gives me a warning: installed directory not writable, cannot update packages 'boot', 'class', 'KernSmooth', 'MASS', 'nnet', 'rpart', 'spatial' If I want to check if xmcs is installed I get: > require("xcms") Lade n?tiges Paket: xcms Lade n?tiges Paket: mzR Lade n?tiges Paket: Rcpp Error : .onLoad failed in loadNamespace() for 'mzR', details: call: value[[3L]](cond) error: failed to load module Ramp from package mzR konnte Funktion "errorOccured" nicht finden Failed with error: ?Paket ?mzR? konnte nicht geladen werden? I try to translate the messages to english: load required packet: xcms load required packet: mzR load required packet: Rcpp Error : .onLoad failed in loadNamespace() for 'mzR', details: call: value[[3L]](cond) error: failed to load module Ramp from package mzR could not find function "errorOccured" Failed with error: ?Paket ?mzR? could not be loaded' Any hints what is going wrong ? I installed xcms on an older machine six months ago and I had no problems at all. Kind Regards, Uwe -- Dr. rer. nat. Uwe Schmitt Leitung F/E Mathematik mineway GmbH Geb?ude 4 Im Helmerswald 2 66121 Saarbr?cken Telefon: +49 (0)681 8390 5334 Telefax: +49 (0)681 830 4376 uschmitt at mineway.de www.mineway.de Gesch?ftsf?hrung: Dr.-Ing. Mathias Bauer Amtsgericht Saarbr?cken HRB 12339 From laurent.gatto at gmail.com Thu Jun 28 12:16:31 2012 From: laurent.gatto at gmail.com (Laurent Gatto) Date: Thu, 28 Jun 2012 11:16:31 +0100 Subject: [BioC] XCMS / mzR / Rcpp dependency issue In-Reply-To: <4FEC2822.9050908@mineway.de> References: <4FEC2822.9050908@mineway.de> Message-ID: Dear Uwe, Something is happening with mzR and Rcpp 0.9.12 on Windows - see [1] and [2]. As a temporary fix, you can downgrade to Rcpp 0.9.10 [3] and proceed normally. Best wishes, Laurent [1] http://lists.r-forge.r-project.org/pipermail/rcpp-devel/2012-June/thread.html#3940 [2] https://stat.ethz.ch/pipermail/bioconductor/2012-June/thread.html#46503 [3] http://cran.us.r-project.org/bin/windows/contrib/2.13/Rcpp_0.9.10.zip On 28 June 2012 10:47, Uwe Schmitt wrote: > Hi, > > I have some problems to install xcms for R 2.5.1 on Windows 7, 64 bit. > I follow the installation instructions and enter: > >> source("http://bioconductor.org/biocLite.R") >> biocLite("xcms", dep=T) > > this loads some packages and gives me a warning: > installed directory not writable, cannot update packages 'boot', 'class', > 'KernSmooth', 'MASS', 'nnet', 'rpart', 'spatial' > > If I want to check if xmcs is installed I get: > >> require("xcms") > > Lade n?tiges Paket: xcms > Lade n?tiges Paket: mzR > Lade n?tiges Paket: Rcpp > Error : .onLoad failed in loadNamespace() for 'mzR', details: > call: value[[3L]](cond) > error: failed to load module Ramp from package mzR > konnte Funktion "errorOccured" nicht finden > Failed with error: ?Paket ?mzR? konnte nicht geladen werden? > > I try to translate the messages to english: > > load required packet: xcms > load required packet: mzR > load required packet: Rcpp > Error : .onLoad failed in loadNamespace() for 'mzR', details: > call: value[[3L]](cond) > error: failed to load module Ramp from package mzR > could not find function "errorOccured" > Failed with error: ?Paket ?mzR? could not be loaded' > > Any hints what is going wrong ? I installed xcms on an older machine six > months ago and I had no problems at all. > > Kind Regards, > > Uwe > > > -- > Dr. rer. nat. Uwe Schmitt > Leitung F/E Mathematik > > mineway GmbH > Geb?ude 4 > Im Helmerswald 2 > 66121 Saarbr?cken > > Telefon: +49 (0)681 8390 5334 > Telefax: +49 (0)681 830 4376 > > uschmitt at mineway.de > www.mineway.de > > Gesch?ftsf?hrung: Dr.-Ing. Mathias Bauer > Amtsgericht Saarbr?cken HRB 12339 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor From uschmitt at mineway.de Thu Jun 28 14:10:04 2012 From: uschmitt at mineway.de (Uwe Schmitt) Date: Thu, 28 Jun 2012 14:10:04 +0200 Subject: [BioC] XCMS / mzR / Rcpp dependency issue In-Reply-To: References: <4FEC2822.9050908@mineway.de> Message-ID: <4FEC499C.20409@mineway.de> Thanks for your help, downgrading Rcpp fixed the problem. Kind Regards, Uwe. Am 28.06.2012 12:16, schrieb Laurent Gatto: > Dear Uwe, > > Something is happening with mzR and Rcpp 0.9.12 on Windows - see [1] and [2]. > As a temporary fix, you can downgrade to Rcpp 0.9.10 [3] and proceed normally. > > Best wishes, > > Laurent > > [1] http://lists.r-forge.r-project.org/pipermail/rcpp-devel/2012-June/thread.html#3940 > [2] https://stat.ethz.ch/pipermail/bioconductor/2012-June/thread.html#46503 > [3] http://cran.us.r-project.org/bin/windows/contrib/2.13/Rcpp_0.9.10.zip > > > On 28 June 2012 10:47, Uwe Schmitt wrote: >> Hi, >> >> I have some problems to install xcms for R 2.5.1 on Windows 7, 64 bit. >> I follow the installation instructions and enter: >> >>> source("http://bioconductor.org/biocLite.R") >>> biocLite("xcms", dep=T) >> this loads some packages and gives me a warning: >> installed directory not writable, cannot update packages 'boot', 'class', >> 'KernSmooth', 'MASS', 'nnet', 'rpart', 'spatial' >> >> If I want to check if xmcs is installed I get: >> >>> require("xcms") >> Lade n?tiges Paket: xcms >> Lade n?tiges Paket: mzR >> Lade n?tiges Paket: Rcpp >> Error : .onLoad failed in loadNamespace() for 'mzR', details: >> call: value[[3L]](cond) >> error: failed to load module Ramp from package mzR >> konnte Funktion "errorOccured" nicht finden >> Failed with error: ?Paket ?mzR? konnte nicht geladen werden? >> >> I try to translate the messages to english: >> >> load required packet: xcms >> load required packet: mzR >> load required packet: Rcpp >> Error : .onLoad failed in loadNamespace() for 'mzR', details: >> call: value[[3L]](cond) >> error: failed to load module Ramp from package mzR >> could not find function "errorOccured" >> Failed with error: ?Paket ?mzR? could not be loaded' >> >> Any hints what is going wrong ? I installed xcms on an older machine six >> months ago and I had no problems at all. >> >> Kind Regards, >> >> Uwe >> >> >> -- >> Dr. rer. nat. Uwe Schmitt >> Leitung F/E Mathematik >> >> mineway GmbH >> Geb?ude 4 >> Im Helmerswald 2 >> 66121 Saarbr?cken >> >> Telefon: +49 (0)681 8390 5334 >> Telefax: +49 (0)681 830 4376 >> >> uschmitt at mineway.de >> www.mineway.de >> >> Gesch?ftsf?hrung: Dr.-Ing. Mathias Bauer >> Amtsgericht Saarbr?cken HRB 12339 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor -- Dr. rer. nat. Uwe Schmitt Leitung F/E Mathematik mineway GmbH Geb?ude 4 Im Helmerswald 2 66121 Saarbr?cken Telefon: +49 (0)681 8390 5334 Telefax: +49 (0)681 830 4376 uschmitt at mineway.de www.mineway.de Gesch?ftsf?hrung: Dr.-Ing. Mathias Bauer Amtsgericht Saarbr?cken HRB 12339 From grimbough at gmail.com Thu Jun 28 15:54:10 2012 From: grimbough at gmail.com (Mike Smith) Date: Thu, 28 Jun 2012 14:54:10 +0100 Subject: [BioC] Illumine bead array PROBE_TYPE values In-Reply-To: <1340818561.99867.YahooMailClassic@web125405.mail.ne1.yahoo.com> References: <1340818561.99867.YahooMailClassic@web125405.mail.ne1.yahoo.com> Message-ID: Hi Heyi, The PROBE_TYPE value doesn't refer to the strand. ?It represents the isoform specificity of the probe i.e. does it target all known isoforms, only one isoform or is there only one known isoform. ?The assignments are as follows: A?- probe targets all isoforms I?- probe targets only one of multiple isoforms S?- probe targets the only isoform On 27 June 2012 18:36, heyi xiao wrote: > > Hello list, > I am working on illumine bead array expression data. The data was output from GenomeStudio with probe annotation. There is a column called PROBE_TYPE, with three values: A, I and S. What do these values mean, antisense, intron and sense? Antisense and sense are string relative to the target gene, not to the reference genome, right? Why intron and antisense probes are needed? I couldn?t find the description info on Internet. Thanks a lot! > Heyi > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Mike Smith PhD Student Computational Biology Group Cambridge University From gbayon at gmail.com Thu Jun 28 16:08:37 2012 From: gbayon at gmail.com (=?utf-8?Q?Gustavo_Fern=C3=A1ndez_Bay=C3=B3n?=) Date: Thu, 28 Jun 2012 16:08:37 +0200 Subject: [BioC] GEOquery, GSEMatrix parameter and lifecycle of GEO series data In-Reply-To: References: <2EBD24B1B09B4A27BE11DA663EC436A6@gmail.com> <4FEB377F.2050309@gmail.com> Message-ID: Dear Sean and James, first of all, I would like to apologize for my late reply. There were a lot of storms yesterday here at Oviedo, and a lot of derived technical problems at my workplace which kept me from accessing the net. Thank you for your kind replies. Having read James' answer, I tried to replicate again the problematic commands: > gpl <- getGEO('GPL13534', destdir='/Users/gbayon/Documents/GEO/') Using locally cached version of GPL13534 found here: /Users/gbayon/Documents/GEO//GPL13534.soft > Meta(gpl)$data_row_count == nrow(Table(gpl)) [1] FALSE > Meta(gpl)$data_row_count [1] "485577" > nrow(Table(gpl)) [1] 143889 > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GEOquery_2.23.5 Biobase_2.16.0 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] RCurl_1.91-1 XML_3.9-4 As you can see, the problem was still there. GEOquery, Biobase and BioCGenerics versions are the same than in James' case. R version is 2.15.0 instead of 2.15.1. I see James has executed getGEO() directly, while I employed a cached version of the SOFT file. So, I decided to run previous commands again, but forcing the download: > gpl <- getGEO('GPL13534') File stored at: /var/folders/rs/6p03vvrs5xjcts16s73025lm0000gn/T//RtmpTSHxXq/GPL13534.soft > Meta(gpl)$data_row_count == nrow(Table(gpl)) [1] TRUE > nrow(Table(gpl)) [1] 485577 oO. Now it worked. In order to find if there was any problem with the cached version on disk, I then tried to replicate the experiment again, but this time using the temporary folder soft file. > gpl2 <- getGEO('GPL13534',destdir='/var/folders/rs/6p03vvrs5xjcts16s73025lm0000gn/T//RtmpTSHxXq/') Using locally cached version of GPL13534 found here: /var/folders/rs/6p03vvrs5xjcts16s73025lm0000gn/T//RtmpTSHxXq//GPL13534.soft > Meta(gpl2)$data_row_count == nrow(Table(gpl2)) [1] TRUE Ok. Now it worked smoothly. It seems that my cached version could be corrupted. Let me see: gbayon$ ls -lh total 538984 -rw-r--r-- 1 gbayon staff 203M 28 jun 15:53 GPL13534.soft -rw-r--r-- 1 gbayon staff 60M 28 jun 15:54 GPL13534_prev.soft OMG you are going to kill me? that difference in size? :-O gbayon$ tail -n 1 GPL13534.soft !platform_table_end gbayon$ tail -n 1 GPL13534_prev.soft cg05516328 cg05516328 21692480 ATTCACCAAATAACCTAACAAAATAATCACAAAACACAAAACTCAAAAAC II GCTCTGCTGTCCCGAGCCACTCATGGTGAGCTGCCTCCCTACGAATTCCAGCCGCTGTCG[CG]CCCTTGAGTTCTGTGTCCTGTGACCACCTTGCCAGGCCACTTGGTGAACACGC Now it makes sense. It seems the cached file I had was not complete. Maybe due to a broken or timed out connection. getGEO() read exactly the number of probes it had, no more, no less. I am sorry for wasting your time chasing a GEOquery bug that finally did not exist. Maybe, just a suggestion, it would be nice if it could write a warning message or something if it did not find the '!platform_table_end' string. Just in case another newbie arrives to the same point. Thank you again for your hints. And for the new, corrected, GPL data I have now. Regards, Gus --------------------------- Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) El mi?rcoles 27 de junio de 2012 a las 21:18, Sean Davis escribi?: > Thanks, James, for doing my work for me. ; ) > > Sean > > > On Wed, Jun 27, 2012 at 12:40 PM, James F. Reid wrote: > > Dear Sean and Gustavo, > > > > > > I cannot reproduce this error. See below. > > > > On 27/06/12 16:54, Sean Davis wrote: > > > On Wed, Jun 27, 2012 at 11:38 AM, Gustavo Fern??ndez Bay??n > > > wrote: > > > > > > > Hi again. > > > > > > > > I would like to add a little bit more of information on this issue. I have > > > > been debugging inside the parseGSEMatrix() function in GEOquery source > > > > code. The suspicious NA's appeared when execution arrived to the following > > > > line: > > > > > > > > ## Apparently, NCBI GEO uses case-insensitive matching > > > > ## between platform IDs and series ID Refs ??? > > > > dat <- dat[match(tolower(rownames(datamat)),tolower(rownames(dat))),] > > > > > > > > > > > > > > > > The problem here is that 'datamat' has the correct number of rows, which > > > > is around 480K, BUT 'dat' doesn't. At a glance, 'datamat' comes from the > > > > series matrix file while 'dat' comes from the GPL. > > > > > > > > If you go to the GEO page of that GPL ( > > > > http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?token=djaxxiayqmwyspu&acc=GPL13534), > > > > you'll find it says that the GPL decryption table has exactly 485577 rows, > > > > which is kind of logical, a description for each probeset. However, inside > > > > the code, 'dat' has only 143889 rows. > > > > > > > > Replicating directly from R console: > > > > > > > > > gpl <- getGEO('GPL13534',destdir='../../GEO/') > > > > > Meta(gpl)$data_row_count > > > > > > > > [1] "485577" > > > > > > > > > t <- Table(gpl) > > > > > dim(t) > > > > > > > > [1] 143889 37 > > > > > > > > > > > > > > > > I was really surprised to find this, and I do not have enough knowledge as > > > > to know if it responds to an unknown constraint I happen to ignore. Is that > > > > ok? Or is there any bug in the GPL processing code? Now I'm going home, but > > > > I'll try to continue debugging to see what is really happening inside. > > > > > > This is most likely a bug in GPL parsing. There are A LOT of edge cases > > > that I have tried to deal with, some not very appropriately. Often, the > > > error is due to an extraneous quote in an unexpected location. I'll look > > > into this one. Could you do me a favor and send along sessionInfo() just > > > so I know? > > > > > > Thanks, > > > Sean > > > > > > > > > > > > > Any help will be very much appreciated. > > > > > > > > Regards, > > > > Gus > > > > > > > > > > > > --------------------------- > > > > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) > > > > > > > > > > > > El mi??rcoles 27 de junio de 2012 a las 10:51, Gustavo Fern??ndez Bay??n > > > > escribi??: > > > > > > > > > Hi everybody. > > > > > > > > > > I am experiencing quite a few problems while trying to download and > > > > parse a dataset of methylation values. These are not technical problems, > > > > IMHO. GEOquery works perfectly, and it really makes getting this kind of > > > > data an easy task. However, I think I do not understand exactly the > > > > lifecycle of GEO series data, and I would like to ask in this list for any > > > > hint on this behavior, so I could try to fix it. > > > > > > > > > > What I first did was to download and parse the desired GSE data file, > > > > with the default value of GSMMatrix parameter (TRUE). Besides, I extracted > > > > the ExpressionSet and the assayData I was looking for. > > > > > > > > > > my.gse <- getGEO('GSE30870', destdir='/Users/gbayon/Documents/GEO/') > > > > > my.expr.set <- my.gse[[1]] > > > > > beta.values <- exprs(my.expr.set) > > > > > > > > > > What really gave me a surprise at first, was to see many strange values > > > > (all containing the 'NA' string) in the featureNames of the expression set. > > > > > > > > > > > head(featureNames(es), n=20) > > > > > [1] "NA" "cg00000108" "cg00000109" "cg00000165" "NA.1" "NA.2" "NA.3" > > > > > [8] "NA.4" "cg00000363" "NA.5" "NA.6" "NA.7" "NA.8" "cg00000734" > > > > > [15] "NA.9" "cg00000807" "cg00000884" "NA.10" "NA.11" "NA.12" > > > > > > > > > > > > > > > > > > > > If I select an individual GSM in the series, and download it, the > > > > featureNames are ok. If I try to download the GSE with GSEMatrix=FALSE, I > > > > get a list of GSM data sets, and the results is again good. This made me > > > > suspect of the intermediate, pre-parsed, matrix form. I haven't found a > > > > clue about the lifecycle of this kind of data. I mean, how the matrix is > > > > built. Is it a manual process? Is it automatic? > > > > > > > > > > If it is a manual process, then I guess I will have to contact the > > > > responsible of uploading the data to see if they can fix it. But, if it is > > > > not, I would like to know if this is something relating to BioC or, more > > > > plausibly, to GEO. > > > > > > > > > > Any help would be appreciated. > > > > > > > > > > Regards, > > > > > Gustavo > > > > > > > > > > > > > > > --------------------------- > > > > > Enviado con Sparrow (http://www.sparrowmailapp.com/?sig) > > > > > > > > > > > > _______________________________________________ > > > > Bioconductor mailing list > > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org) > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > Search the archives: > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org) > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > library(GEOquery) > > > > my.gse <- getGEO('GSE30870', destdir=".") > > > > featureNames(my.gse[[1]])[1:10] > > # [1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236" > > # [6] "cg00000289" "cg00000292" "cg00000321" "cg00000363" "cg00000622" > > all(featureNames(my.gse[[1]]) == rownames(exprs(my.gse[[1]]))) > > #[1] TRUE > > > > gpl <- getGEO('GPL13534',destdir=".") > > Meta(gpl)$data_row_count == nrow(Table(gpl)) > > # [1] TRUE > > > > > > > sessionInfo() > > R version 2.15.1 (2012-06-22) > > Platform: x86_64-pc-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > > [7] LC_PAPER=C LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets tools methods > > [8] base > > > > other attached packages: > > [1] GEOquery_2.23.5 Biobase_2.16.0 BiocGenerics_0.2.0 > > > > loaded via a namespace (and not attached): > > [1] RCurl_1.91-1 XML_3.9-4 > > > > > > HTH, > > J. > From sdavis2 at mail.nih.gov Thu Jun 28 16:15:01 2012 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 28 Jun 2012 10:15:01 -0400 Subject: [BioC] GEOquery, GSEMatrix parameter and lifecycle of GEO series data In-Reply-To: References: <2EBD24B1B09B4A27BE11DA663EC436A6@gmail.com> <4FEB377F.2050309@gmail.com> Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From seungyeul.yoo at mssm.edu Thu Jun 28 18:26:35 2012 From: seungyeul.yoo at mssm.edu (Yoo, Seungyeul) Date: Thu, 28 Jun 2012 16:26:35 +0000 Subject: [BioC] Merging two tilingfeaturesets into one set. Message-ID: Hello, I'm working on dna methylation microarray. I need to merge two sets into one set. > rawData $v1 TilingFeatureSet (storageMode: lockedEnvironment) assayData: 2197815 features, 59 samples element names: channel1, channel2 protocolData rowNames: LT290677RU_D1_2011-02-16 LT286300LU_D1_2010-07-24 ... LT003990RU_D1_2010-11-04 (59 total) varLabels: filenamesChannel1 filenamesChannel2 dates1 dates2 varMetadata: labelDescription channel phenoData rowNames: LT290677RU_D1_2011-02-16 LT286300LU_D1_2010-07-24 ... LT003990RU_D1_2010-11-04 (59 total) varLabels: sampleID tissue ... Annotation (5 total) varMetadata: labelDescription channel featureData: none experimentData: use 'experimentData(object)' Annotation: pd.feinberg.hg18.me.hx1 $v1.1 TilingFeatureSet (storageMode: lockedEnvironment) assayData: 2197815 features, 17 samples element names: channel1, channel2 protocolData rowNames: LT282562RM_D1_2010-11-22 LT280646RU_D1_2010-11-22 ... LT093297LU_D1_2010-11-12 (17 total) varLabels: filenamesChannel1 filenamesChannel2 dates1 dates2 varMetadata: labelDescription channel phenoData rowNames: LT282562RM_D1_2010-11-22 LT280646RU_D1_2010-11-22 ... LT093297LU_D1_2010-11-12 (17 total) varLabels: sampleID tissue ... Annotation (5 total) varMetadata: labelDescription channel featureData: none experimentData: use 'experimentData(object)' Annotation: pd.feinberg.hg18.me.hx1 I want to my merged dataset (dataset$v1+dataset$v1.1) has the same number of features (2197815) but added number of samples(59+17=76). How can I merge them to create new dataset? Thanks, Best regards, Seungyeul Yoo Postdoctoral fellow Institute of Genomics and Multiscale Biology Department of Genetics and Genomic Sciences Mount Sinai School of Medicine From hpages at fhcrc.org Thu Jun 28 19:58:39 2012 From: hpages at fhcrc.org (=?ISO-8859-1?Q?Herv=E9_Pag=E8s?=) Date: Thu, 28 Jun 2012 10:58:39 -0700 Subject: [BioC] matrix like object with Rle columns In-Reply-To: References: <4FEB6EF8.8030108@fhcrc.org> Message-ID: <4FEC9B4F.4060708@fhcrc.org> Hi Michael, On 06/27/2012 01:58 PM, Michael Lawrence wrote: > > > On Wed, Jun 27, 2012 at 1:37 PM, Herv? Pag?s > wrote: > > Hi Kasper, > > On 06/25/2012 08:56 PM, Kasper Daniel Hansen wrote: > [...] > > [ side question which could be relevant in this discussion: for a > numeric Rle is there some notion of precision - say I have truly > numeric values with tons of digits, and I want to consider two > numbers > part of the same run if |x1 -x2| > > The comparison of 2 doubles is done at the C level with ==, which > AFAIK is the same as doing == in R (as long as we deal with non-NA > and non-NaN values). See the _fill_Rle_slots_with_double___vals() helper > function in IRanges/src/Rle_class.c for the details. > > Therefore: > > > all.equal(sqrt(3)^2, 3) > [1] TRUE > > sqrt(3)^2 == 3 > [1] FALSE > > Rle(c(sqrt(3)^2, 3)) > 'numeric' Rle of length 2 with 2 runs > Lengths: 1 1 > Values : 3 3 > > Note that base::rle() does the same: > > > rle(c(sqrt(3)^2, 3)) > Run Length Encoding > lengths: int [1:2] 1 1 > values : num [1:2] 3 3 > > I can see that using a "|x1 -x2| give better compression (less runs) but then the compression would not > be lossless as it is right now: > > > x <- c(sqrt(3)^2, 3) > > identical(as.vector(Rle(x)), x) > [1] TRUE > > identical(inverse.rle(rle(x)), x) > [1] TRUE > > Also the "|x1 -x2| complications due to the fact that the criteria is not transitive > anymore i.e. you can have |x1 -x2| without having |x1 -x3| becomes some kind of clustering problem with several possible > strategies, some of them very simple but not necessarily with > the "good properties". > > > One simple "clustering" would be to round to some fixed level of > precision. One could multiple by some power of 10 and coerce to integer > to avoid any floating point issues. Like for example Rle(round(x, digits=4)). If people feel that this would be useful, we could add the 'digits' arg to the Rle() constructor so the rounding is taken care of by the constructor itself. With default to NA for no rounding at all (like now), so the good properties are preserved e.g. lossless compression and the fact that unique, duplicated, is.unsorted, sort, order, rank etc (anything involving comparison between doubles) will behave exactly the same way on x and Rle(x) (there is code around that relies on such behavior). Also maybe we could consider doing signif() instead of round(). Cheers, H. > > H. > > > > Kasper > > > Michael > > On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen > > wrote: > > > Do we have a matrix-like object, but where the columns > are Rle's? > > Kasper > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/__listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > > > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/__listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > > > > > -- > Herv? Pag?s > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 From ales.maver at gmail.com Thu Jun 28 20:00:28 2012 From: ales.maver at gmail.com (=?UTF-8?Q?Ale=C5=A1_Maver?=) Date: Thu, 28 Jun 2012 20:00:28 +0200 Subject: [BioC] Operations on GenomicRanges metadata information In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From aditi_rambani at yahoo.com Thu Jun 28 21:01:50 2012 From: aditi_rambani at yahoo.com (Aditi Rambani) Date: Thu, 28 Jun 2012 12:01:50 -0700 Subject: [BioC] Contrast Problem References: Message-ID: <1340910110.46672.YahooMailNeo@web113104.mail.gq1.yahoo.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From smelov at buckinstitute.org Thu Jun 28 23:11:19 2012 From: smelov at buckinstitute.org (Simon Melov) Date: Thu, 28 Jun 2012 14:11:19 -0700 Subject: [BioC] HTqPCR problems In-Reply-To: <6d2982152de2217869a0f9db60da3904.squirrel@webmail.ebi.ac.uk> References: <7B8EC902-D981-4467-B997-002FC1CFC368@buckinstitute.org> <10ebef5ba3907a11607fea19c1f11c3b.squirrel@webmail.ebi.ac.uk> <7BE89EC0-5910-4758-881F-6EFCB8E79A5F@buckinstitute.org> <009c6493df7d27088ee45793182ed7a8.squirrel@webmail.ebi.ac.uk> <6d2982152de2217869a0f9db60da3904.squirrel@webmail.ebi.ac.uk> Message-ID: <4EBF34FE-0178-4640-8F6B-D6C4212B844B@buckinstitute.org> Hi Heidi, getting there, hopefully if you can clarify the following issue, all will be well and good. After yesterdays correspondence, I'm now producing nice plots, when I check the actual values being plotted, they dont match up to the sample ID's. In fact, if I dont bother assigning groups, the sample ID's dont match to their respective gene CT values. I'm worried there is some underlying problem with the data structure I'm not understanding. I understand the code, its just the samples dont match the reported gene values in the csv file. for example > head(groupID) Sample Treatment 1 S28 SMY 2 L20 LFY 3 M26 MMY 4 L1 LFR 5 L30 LFY 6 K13 KMO >plotCtOverview(raw6, genes=featureNames(raw6)[1], group=groupID$Treatment,legend=FALSE, col=1:length(unique(groupID$Treatment))) produces a nice plot of a tubulin gene across the groups, as you suggested yesterday . Yet if I look at the values, they dont match the CSV values for specific genes/samples I used. If I turn off groups, and look at samples without merging by group, I can see that the values dont match the appropriate gene being displayed. My question is, where is the sample order being drawn from in the CSV file? Is there a simple check I can use to see that what is being plotted, is what I think is being plotted? The group ID sample-Treatment is correct, and all the samples in the original CSV file are correct. Is it possible that the package is assigning gene/sample ID in some other order than that I've supplied? I just want to be sure that when HTqPCR pulls the sample ID and maps it to the appropriate gene/Group, some transformation is not happening. Fluidigm suggests a particular order in loading samples and genes. These are numbered 1-48 (sample), and 1-48 (gene) for a 48.48 plate (and the same for a 96.96 plate). This is the order I supplied the sample IDs in the groupID file above. How do you map the raw csv output to gene/sample id? Is there a way of checking that the sample/gene/group ID is correct? as always, thanks in advance for your help best s On Jun 27, 2012, at 3:27 PM, Heidi Dvinge wrote: >> Hi Heidi, >> you are correct, yes 48.48. >> The example you provide below is exactly what I needed for clarification >> for groups. I was trying to reverse engineer what you had done with the >> original expression set package for microarrays, but from below, I can get >> this to work now. >> > Glad it works. Hopefully by the next BioConductor release I'll remember to > clarify the plotCtOverview help file. > >> Just to be clear, I have 5 48.48 plates. Should I normalize each >> individually prior to combining, or should I reformat to a 2304x1 each, >> combine, then normalize (not sure if you can do that or not) >> > Hm, that's one of the questions I've also been asking myself, so I would > be curious to hear what your results from this are. > > If you suspect that there are some major factors influencing the 5 plates > systematically, then normalising them in a 2304 x 5 object should > (hopefully) correct for that. For example, they may have been run on > different days, by different people, or perhaps there was a short power > cut during the processing of one of them. This might be visible if you > have for example a boxplot of Ct from all 48*5 samples, and you see blocks > of them shifted up or down. > > Obviously, this doesn't take care of normalisation between your samples > within each plate though. If you suspect your samples to have some > systematic variation that you need to account for, then you can normalise > each plate individually (as a 48 x 48) object. Alternatively, you can try > to combine within- and between-sample normalisation by taking all 48 x 240 > values at once. > > In principle, you can split, reformat and the recombine the data in > however many ways you like. Personally, with any sort of data I prefer to > go with as little preprocessing as possible, since each additional step > can potentially introduce its own biases into the data. So unless there > are some obvious variation between your 5 plates, I'd probably stick with > just normalisation between the samples, e.. using a 48 x 240 object. > > Of course, you may have different preferences, or find out that a > completely different approach is required for this particular data set. > > \Heidi > >> thanks again for your prompt responses! >> >> best >> >> s >> >> On Jun 27, 2012, at 2:26 PM, Heidi Dvinge wrote: >> >>> Hi Simon, >>> >>>> Thanks for the help Heidi, >>>> but I'm still having troubles, your comments on the plotting helped me >>>> solve the outputs. But if I want to just display some groups (for >>>> example >>>> the LO group in the example below), how do I associate a group with >>>> multiple samples (ie biological reps)? >>>> >>>> Currently I'm associating genes with samples by reading in the file as >>>> below >>>> plate6=read.delim("plate6Sample.txt", header=FALSE) >>>> #this is a file to associate sample ID with the genes in the biomark >>>> data, >>>> as currently HTqPCR does not seem to associate the sample info in the >>>> Biomark output to the gene IDs >>>> >>> Erm, no, it doesn't :-/ >>> >>>> samples=as.vector(t(plate6)) >>>> raw6=readCtData(files="Chip6.csv", format="BioMark", n.features=48, >>>> n.data=48, samples=samples) >>>> #now I have samples and genes similar to your example in the guide, but >>>> I >>>> want to associate samples to groups now. In the guide, you have an >>>> example >>>> where you have entire files as distinct samples, but in our runs, we >>>> never >>>> have that situation. I have a file which associates samples to groups, >>>> which I read in... >>>> >>>> groupID=read.csv("plate6key.csv") >>>> >>>> but how do I associate the samples with their appropriate groups for >>>> biological replicates with any of the functions in HtQPCR? >>> >>> I'm afraid I'm slightly confused here (sorry, long day). Exactly how is >>> your data formatted? I.e. are the columns either individual samples, or >>> from files containing multiple samples? So for example for a single >>> 48.48 >>> arrays, is your qPCRset object 2304 x 1 or 48 x 48? >>> >>> From your readCtData command I'm guessing you have 48 x 48, i.e. all 48 >>> samples from your 1 array are in columns. In that case the 'groups' >>> parameter in plotCtOverview will need to be a vector of length 48, >>> indicating how you want the 48 columns in your qPCRset object to be >>> grouped together. >>> >>> Below is an example of (admittedly ugly) plots. I don't know if that's >>> similar to your situation at all. >>> >>> \Heidi >>> >>>> # Reading in data >>>> exPath <- system.file("exData", package = "HTqPCR") >>>> raw1 <- readCtData(files = "BioMark_sample.csv", path = exPath, format >>>> = >>> "BioMark", n.features = 48, n.data = 48) >>>> # Check sample names >>>> head(sampleNames(raw1)) >>> [1] "Sample1" "Sample2" "Sample3" "Sample4" "Sample5" "Sample6" >>>> # Associate samples with (randomly chosen) groups >>>> anno <- data.frame(sampleID=sampleNames(raw1), Group=rep(c("A", "B", >>> "C", "D"), times=c(4,24,5,15))) >>>> head(anno) >>> sampleID Group >>> 1 Sample1 A >>> 2 Sample2 A >>> 3 Sample3 A >>> 4 Sample4 A >>> 5 Sample5 B >>> 6 Sample6 B >>>> # Plot the first gene - for each sample individually >>>> plotCtOverview(raw1, genes=featureNames(raw1)[1], legend=FALSE, >>> col=1:nrow(anno)) >>>> # Plot the first gene - for each group >>>> plotCtOverview(raw1, genes=featureNames(raw1)[1], group=anno$Group, >>> legend=FALSE, col=1:length(unique(anno$Group))) >>>> # Plot the first gene, using group "A" as a control >>>> plotCtOverview(raw1, genes=featureNames(raw1)[1], group=anno$Group, >>> legend=FALSE, col=1:length(unique(anno$Group)), calibrator="A") >>> >>> >>> >>>> You recommend below using a vector, but I dont see how that helps me >>>> associate the samples in the Expression set. >>>> >>>> thanks again! >>>> >>>> s >>>> >>>> On Jun 26, 2012, at 12:48 PM, Heidi Dvinge wrote: >>>> >>>>>> Hi, >>>>>> I'm having some troubles selectively sub-setting, and graphing up >>>>>> QPCR >>>>>> data within HTqPCR for Biomark plates (both 48.48 and 96.96 plates). >>>>>> I'd >>>>>> like to be able to visualize specific genes, with specific groups we >>>>>> run >>>>>> routinely on our Biomark system. Typical runs are across multiple >>>>>> plates, >>>>>> and have multiple biological replicates, and usually 2 or more >>>>>> technical >>>>>> replicates (although we are moving away from technical reps, as the >>>>>> CVs >>>>>> are so tight). >>>>>> >>>>>> Can anyone help with this? Heidi? >>>>>> >>>>>> raw6=readCtData(files="Chip6.csv", format="BioMark", n.features=48, >>>>>> n.data=48, samples=samples) >>>>>> #Ive read the samples in from a separate file, as when you read it >>>>>> in, >>>>>> it >>>>>> doesnt take the sample names supplied in the biomark output# >>>>>> #Next, I want to plot genes of interest, with samples of interest, >>>>>> and >>>>>> I'm >>>>>> having trouble getting an appropriate output# >>>>>> >>>>>> g=featureNames(raw6)[1:2] >>>>>> plotCtOverview(raw6, genes=g, groups=groupID$Treatment, >>>>>> col=rainbow(5)) >>>>>> >>>>>> #This plots 1 gene across all 48 samples# >>>>>> #but the legend doesnt behave, its placed on top of the plot, and I >>>>>> cant >>>>>> get it to display in a non-overlapping fashion# >>>>>> #I've tried all sorts of things in par, but nothing seems to shift >>>>>> the >>>>>> legend's position# >>>>>> >>>>> As the old saying goes, whenever you want a job done well, you'll have >>>>> to >>>>> do it yourself ;). In this case, the easiest thing is probably to use >>>>> legend=FALSE in plotCtOverview, and then afterwards add it yourself in >>>>> the >>>>> desired location using legend(). That way, if you have a lot of >>>>> different >>>>> features or groups to display, you can also use the ncol parameter in >>>>> legend to make several columns within the legend, such as 3x4 instead >>>>> of >>>>> the default 12x1. >>>>> >>>>> Alternatively, you can use either xlim or ylim in plotCtOverview to >>>>> add >>>>> some empty space on the side where there's then room for the legend. >>>>> >>>>>> #I now want to plot a subset of the samples for specific genes# >>>>>>> LOY=subset(groupID,groupID$Treatment=="LO" | groupID$Treatment== >>>>>>> "LFY") >>>>>>> LOY >>>>>> Sample Treatment >>>>>> 2 L20 LFY >>>>>> 5 L30 LFY >>>>>> 7 L45 LO >>>>>> 20 L40 LO >>>>>> 27 L43 LO >>>>>> 33 L29 LFY >>>>>> 36 L38 LO >>>>>> 40 L39 LO >>>>>> 43 L23 LFY >>>>>> >>>>>> >>>>>>> plotCtOverview(raw6, genes=g, groups=LOY, col=rainbow(5)) >>>>>> Warning messages: >>>>>> 1: In split.default(t(x), sample.split) : >>>>>> data length is not a multiple of split variable >>>>>> 2: In qt(p, df, lower.tail, log.p) : NaNs produced >>>>>>> >>>>> >>>>> Does it make sense if you split by groups=LOY$Treatment? It looks like >>>>> the >>>>> object LOY itself is a data frame, rather than the expected vector. >>>>> >>>>> Also, you may have to 'repeat' the col=rainbow() argument to fit your >>>>> number of features. >>>>> >>>>>> >>>>>> #it displays the two groups defined by treatment, but doesnt do so >>>>>> nicely, >>>>>> very skinny bars, and the legend doesnt reflect what its displaying# >>>>>> #again, I've tried monkeying around with par, but not sure what >>>>>> HTqPCR >>>>>> is >>>>>> calling to make the plots# >>>>>> >>>>> If the bars are very skinny, it's probably because you're displaying a >>>>> lot >>>>> of features. Nothing much to do about that, except increasing the >>>>> width >>>>> or >>>>> your plot :(. >>>>> >>>>> \Heidi >>>>> >>>>>> please help! >>>>>> >>>>>> thanks >>>>>> >>>>>> Simon. >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > > From guest at bioconductor.org Thu Jun 28 23:16:07 2012 From: guest at bioconductor.org (DaveW [guest]) Date: Thu, 28 Jun 2012 14:16:07 -0700 (PDT) Subject: [BioC] R 2.15.1 ReadAffy error Message-ID: <20120628211607.AEB4B13C71D@mamba.fhcrc.org> I'm attempting to read Affymetrix CEL files and failing miserably. Any thoughts. Error in read.celfile.header(as.character(filenames[[1]])) : Is /home/dw/HDgenotypes/CEL files/Titan_0020_772G_Hannotte_772_001_D06.CEL really a CEL file? tried reading as text, gzipped text, binary, gzipped binary, command console and gzipped command console formats Here is the output of the first few lines of one of the CEL files in case this helps anyone to spot the issue: dw at dw-laptop:~/HDgenotypes/CEL files$ head Titan*H09.CEL ;J.!affymetrix-calvin-multi-intensity60000065535-1336523928-0000026962-0000029358-0000011478en-US0affymetrix-algorithm-nameHHT Image Calibration Cell Generation text/plain affymetrix-algorithm-version3.2.0.1515 text/plainaffymetrix-array-type???Axiom_GW_Gal_SNP_1 text/plainaffymetrix-library-package???Universal text/plainaffymetrix-cel-rows???text/x-calvin-integer-32affymetrix-cel-cols???text/x-calvin-integer-32program-company Affymetrix, Inc. text/plain program-nameFAffymetrix Genechip Command Console text/plain program-id3.2.0.1515 text/plain)affymetrix-algorithm-param-NumPixelsToUsetext/x-calvin-integer-32+affymetrix-algorithm-param-ImageCalibratioTRUE text/plain,affymetrix-algorithm-param-FeatureExtraction -- output of sessionInfo(): > sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 [4] LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] affy_1.34.0 Biobase_2.16.0 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] affyio_1.24.0 BiocInstaller_1.4.7 preprocessCore_1.18.0 tools_2.15.1 [5] zlibbioc_1.2.0 -- Sent via the guest posting facility at bioconductor.org. From jhardcas at fhcrc.org Thu Jun 28 20:30:30 2012 From: jhardcas at fhcrc.org (Hardcastle, Justin) Date: Thu, 28 Jun 2012 11:30:30 -0700 (PDT) Subject: [BioC] Unable to open database file, cummeRbund error. In-Reply-To: <72669F21-F1FE-46F7-8BA2-6300E68CEFFB@csail.mit.edu> Message-ID: I've managed to make it run using the absolute path. I thought I was using the absolute path, but it turns out I could have been more explicit which made it start at least. I'm getting a new error now though. The error is below. Reading /Volumes/home/Justin/Projects/Test/output/cuffdiff/cds.diff Writing CDSDiffData table Indexing Tables... Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: database is locked) cds.diff exists and has data, and the permissions on the DB file look fine for my user. Thanks for any help. -Justin ----- Original Message ----- From: "Loyal Goff" To: "Justin Hardcastle" Cc: bioconductor at r-project.org Sent: Wednesday, June 27, 2012 7:57:31 AM Subject: Re: [BioC] Unable to open database file, cummeRbund error. Hi Justin, Can you confirm that "~/Test/output/cuffdiff" is a valid path? I cannot seem to re-create this issue using a similar approach to yours. Alternatively, can you just provide the directory path directly to readCufflinks instead of going through file.path()? -Loyal On Jun 26, 2012, at 2:03 PM, Hardcastle, Justin wrote: > Hi, > I'm having an issue running cummeRbund on my cuffdiff output. CummeRbund is giving me a DB error and not creating the DB. The code and error are below. > > library("cummeRbund") > > dir = "~/Test" > outdir = "output/cuffdiff" > cuff <- readCufflinks(dir = file.path(dir, outdir), rebuild = TRUE) > > The error given is > >> cuff <- readCufflinks(dir = file.path(dir, outdir), rebuild = TRUE) > Creating database ~/Test/output/cuffdiff/cuffData.db > Error in sqliteNewConnection(drv, ...) : > RS-DBI driver: (could not connect to dbname: > unable to open database file > ) > > I am running cummeRbund 1.2.0, and Cufflinks 2.0.1. > > Thanks for any help. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From Michael.Salbaum at pbrc.edu Fri Jun 29 00:43:09 2012 From: Michael.Salbaum at pbrc.edu (Michael Salbaum) Date: Thu, 28 Jun 2012 17:43:09 -0500 Subject: [BioC] Differential gene expression: EdgeR / DESeq and identifying noise/outliers Message-ID: <089B8CC2D95DD5498EB7CD66289A668F1C51CB@pbrcas30.pbrc.edu> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From smyth at wehi.EDU.AU Fri Jun 29 12:52:52 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Fri, 29 Jun 2012 20:52:52 +1000 (AUS Eastern Standard Time) Subject: [BioC] Differential gene expression: EdgeR / DESeq and identifying noise/outliers In-Reply-To: References: Message-ID: Hi Michael, Here's a link to a previous reply that I made to a similar question last month: https://www.stat.math.ethz.ch/pipermail/bioconductor/2012-May/045483.html The short answer is to set the argument prior.n of the estimateTagwiseDisp() function in edgeR to a smaller value. The default in edgeR is to use a largish value for the prior df, which means that edgeR squeezes the tagwise dispersions strongly towards the global value, meaning that it isn't able to adapt sufficiently to individual genes with outliers such as yours. Your data has 14 libraries and two groups, so there are 12 residual degrees of freedom for each gene. The prior degrees of freedom are set to 20 by default, so the prior number of observations defaults to prior.n = 20/12 = 1.67 for your data. Try instead a smaller value like prior.n = 6/12. The smaller prior.n is, the more edgeR will de-prioritize those hyper variable genes. It would preferable if edgeR did this for you, adapting to the characteristics of your data automatically. A new version of edgeR should be able to do that in a few months. Best wishes Gordon > Date: Thu, 28 Jun 2012 17:43:09 -0500 > From: "Michael Salbaum" > To: > Subject: [BioC] Differential gene expression: EdgeR / DESeq and > identifying noise/outliers > > Hi everyone, > > I am working on a differential gene expression paradigm; n=7, 3?-expression tag sequencing, AB SOLiD5500XL. We look for expression of ~26,000 RefSeq genes; as input, I?m using a bottom-shaved counts table (~16,500 genes), i.e. rows with a total count sum of 12 or less have been trimmed off. Using edgeR, I find 2569 genes (1224 up / 1345 down) to be differentially expressed at FDR better than 0.05. > I?ve noticed that in this list are many genes where the differential expression call appears to be driven by one ?outlier? on one side of the paradigm, for instance: > Nodal WT: 7 7 5 4 19 6 1 Het: 320 2 16 8 6 1 13 logFC: 2.65 FDR: 0.00814198 > > Of course, this is not desirable for follow-up studies, and I?m wondering whether there?s a way to filter out such situations. > > Right now, I?ve resorted to looking at the coefficient of variation calculated from normalized data as a potential discriminator. > > I?ve also used DESeq on the same count data set, which identifies 1569 genes (719 up / 850 down) at padj<0.05. Of these 1569, 1544 are also called by edgeR, so, excellent agreement. Relaxing the cutoff to padj<0.1 gives another 761 genes, of which 585 make the FDR<0.05 cutoff in edgeR. > > Plotting ?log10(FDR) against the coefficient of variation shows that ~90% the genes called by both DESeq and edgeR (at padj<0.05) have a CV of 0.3 or better. The same plot for the genes called only by edgeR (at padj<0.05) seems to show that this group contains many genes at higher CV ? I suspect these are the outlier-driven ones mentioned above. > > I?m perfectly happy with the outcome (?and with the procedure of using > two programs and continue with the intersect between the two?), but > those single outlier-driven genes were irksome ? particularly so because > they were of a nature that got us excited, in biological terms ?. So, to > avoid the letdown, I?d appreciate advice how not to get those in the > first place. > > And I apologize for the long post, but as a newbie, I figured I better include what info I have. > > I?m not sure this is proper etiquette here, but here is a link to the plots: > http://inlinethumb04.webshots.com/51523/2373514640050256648S600x600Q85.jpg > > R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows" > ?edgeR? version 2.6.7 > > library(edgeR) > Loading required package: limma > x <- read.delim("/Users/jms/Desktop/SAGE3/SAGE3F.txt",row.names="Gene") >> head(x) > WT_1 WT_2 WT_3 WT_4 WT_5 WT_6 WT_7 HET_1 HET_2 HET_3 HET_4 HET_5 HET_6 HET_7 > 0610005C13Rik 5 1 6 0 5 2 18 34 7 13 18 28 1 13 >> group <- factor(c(1,1,1,1,1,1,1,2,2,2,2,2,2,2)) >> y <- DGEList(counts=x,group=group) > Calculating library sizes from column totals. >> y <- estimateCommonDisp(y, verbose=TRUE) > Disp = 0.06254 , BCV = 0.2501 >> y <- estimateTagwiseDisp(y) >> et <- exactTest(y) >> topTags(et) >> summary(de <- decideTestsDGE(et, p=0.05, adjust="BH")) > [,1] > -1 1345 > 0 14155 > 1 1224 > > > > ?DESeq? version 1.8.3 > library(DESeq) > Loading required package: Biobase > Loading required package: BiocGenerics > Attaching package: ?BiocGenerics? > The following object(s) are masked from ?package:stats?: > xtabs > The following object(s) are masked from ?package:base?: > anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get, intersect, lapply, Map, mapply, > mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int, rownames, sapply, > setdiff, table, tapply, union, unique > Welcome to Bioconductor > Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see > 'citation("Biobase")', and for packages 'citation("pkgname")'. > Loading required package: locfit > locfit 1.5-8 2012-04-25 >> f = "/Users/jms/Desktop/SAGE3/SAGE3F.txt" >> countsTable <- read.table( f, header=TRUE, row.names=1, stringsAsFactors=TRUE ) >> head(countsTable) > WT_1 WT_2 WT_3 WT_4 WT_5 WT_6 WT_7 HET_1 HET_2 HET_3 HET_4 HET_5 HET_6 HET_7 > 0610005C13Rik 5 1 6 0 5 2 18 34 7 13 18 28 1 13 >> conds <- c( "WT", "WT", "WT", "WT", "WT", "WT", "WT", "HET", "HET", "HET", "HET", "HET", "HET", "HET" ) >> cds <- newCountDataSet( countsTable, conds ) >> cds <- estimateSizeFactors( cds ) >> sizeFactors( cds ) > WT_1 WT_2 WT_3 WT_4 WT_5 WT_6 WT_7 HET_1 HET_2 HET_3 HET_4 HET_5 > 1.3598468 0.9759626 1.0788248 1.0080744 1.0097703 0.5502473 0.8747237 1.0877753 0.5544828 0.9594141 1.6036028 1.4868868 > HET_6 HET_7 > 0.8422686 1.5328149 >> cds <- estimateDispersions( cds ) >> res <- nbinomTest( cds, "WT", "HET" ) >> write.table (res, sep = "\t", file = "/Users/jms/Desktop/SAGE3/SAGE3F_DE.txt", col.names = NA) >> write.table (counts( cds, normalized=TRUE ), sep = "\t", file = "/Users/jms/Desktop/SAGE3/SAGE3F_NORM.txt", col.names = NA) > > Cheers, michael > > > J. Michael Salbaum, Ph.D. > Associate Professor > Pennington Biomedical Research Center > Louisiana State University System > 6400 Perkins Road > Baton Rouge, LA 70808 > > (225) 763-2782 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:5}} From smyth at wehi.EDU.AU Fri Jun 29 13:19:42 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Fri, 29 Jun 2012 21:19:42 +1000 (AUS Eastern Standard Time) Subject: [BioC] Contrast Problem In-Reply-To: References: Message-ID: Dear Aditi, Thanks for going to quite a bit of effort to describe your problem, but I'm afraid that I still don't follow entirely what you're trying to do. I wonder how you have arrived at the contrasts you have defined. When you say different expression (DE) between genomes for all the accessions, do you mean DE between genomes for each of the accessions separately? That is DE genes for A.F1 vs D.F1, for A.Tom vs D.Tom, for A.Tx vs D.Tx and for A.Mx vs D.Mx? When you say DE between accessions, is that for each genome separately? In other words, do you expect the differences between the accessions to be relatively the same for each genome, or will the differences between accessions be genome-specific? Your comments about zero expression and not detecting significantly DE genes don't make sense to me. I won't try to respond to those comments because I think sorting out the above questions will probably solve other perceived problems as well. Best wishes Gordon > Date: Thu, 28 Jun 2012 12:01:50 -0700 > From: Aditi Rambani > To: "bioconductor at stat.math.ethz.ch" > Subject: [BioC] Contrast Problem > > Hello,? > > I am a graduate student at Brigham Young University working on polyploid > cotton RNA seq data.?Our study design has two explanatory variables, one > is 'accession' with four levels (F1,Tom,Tx,Mx) and other one is 'genome' > with two levels (A genome or D genome). We want to detect differential > expression of genes between 'genomes' from all the accessions and also > find genes that are differentially expressed between accessions. We > built a (Accession*Genome) model and did a contrast for two levels of > 'genomes'. In contrast results we see that many genes with zero > expression (0 RPKM) have significant FDRs and some significantly > differentially expressed genes are not detected. We are not sure why > this is happening, any help will be greatly appreciated. > > Thanks > > Aditi? > > We are using following script to do our analysis but? > > > library("edgeR") > > counts <- read.table(INFILE, header=T, row.names=1) > groups <- factor(c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8)) > accessions <- factor(c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4)) > genomes <- factor(c(1,1,1,2,2,2,1,1,1,2,2,2,1,1,1,2,2,2,1,1,1,2,2,2)) > > > design2 <- model.matrix(~accessions*genomes) > > dge <- DGEList(counts=counts, group=groups) > dge <- calcNormFactors(dge) > dge2 <- estimateGLMCommonDisp(dge, design2) > dge2 <- estimateGLMTrendedDisp(dge2, design2) > dge2 <- estimateGLMTagwiseDisp(dge2, design2) > > fit2 <- glmFit(dge2, design2) > > lrt.acc1 <- glmLRT(dge2, fit2, contrast=c(0,0,1,1,-1,-1,0,0)) > lrt.acc2 <- glmLRT(dge2, fit2, contrast=c(0,0,1,1,1,1,-2,-2)) > lrt.acc3 <- glmLRT(dge2, fit2, contrast=c(-3,-3,1,1,1,1,1,1)) > lrt.F1 <- glmLRT(dge2, fit2, contrast=c(1,-1,0,0,0,0,0,0)) > lrt.Mx <- glmLRT(dge2, fit2, contrast=c(0,0,1,-1,0,0,0,0)) > lrt.Tx <- glmLRT(dge2, fit2, contrast=c(0,0,0,0,1,-1,0,0)) > lrt.Tom <- glmLRT(dge2, fit2, contrast=c(0,0,0,0,0,0,1,-1)) > > > write.table(topTags(lrt.acc1, n=10000), file="acc1.results", sep="\t", quote=F) > write.table(topTags(lrt.acc2, n=10000), file="acc2.results", sep="\t", quote=F) > write.table(topTags(lrt.acc3, n=10000), file="acc3.results", sep="\t", quote=F) > write.table(topTags(lrt.F1, n=10000), file="F1.results", sep="\t", quote=F) > write.table(topTags(lrt.Mx, n=10000), file="Mx.results", sep="\t", quote=F) > write.table(topTags(lrt.Tx, n=10000), file="Tx.results", sep="\t", quote=F) > write.table(topTags(lrt.Tom, n=10000), file="Tom.results", sep="\t", quote=F) ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:5}} From fatsey at utu.fi Fri Jun 29 13:26:26 2012 From: fatsey at utu.fi (Fatemehsadat Seyednasrollah) Date: Fri, 29 Jun 2012 11:26:26 +0000 Subject: [BioC] FW: DESeq analysis In-Reply-To: <26055A619290434EA444B0C7E5DEFAF451E4D1AC@exch-mbx-01.utu.fi> References: <20120626161705.06E4D13ADBF@mamba.fhcrc.org> <26055A619290434EA444B0C7E5DEFAF451E4D1AC@exch-mbx-01.utu.fi> Message-ID: <26055A619290434EA444B0C7E5DEFAF451E4D1D4@exch-mbx-01.utu.fi> ________________________________________ From: Fatemehsadat Seyednasrollah Sent: Friday, June 29, 2012 12:00 PM To: narges [guest] Subject: RE: DESeq analysis Hi, First thanks a lot for your answer. Actually I have used a subset of a public data from Bowtie(the Montgomery) and below are the reduced codes of my work both from edgeR and DEseq. I wanted to know have I done something wrong to obtain very different answers ( 85 from DESeq and 407 from edgeR) or it is natural to have this hude difference and it is related to the algorithms? edgeR: > g1 <- read.delim ("count1.txt", row.names = 1) > head(g1) NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F ENSG00000000003 0 0 0 0 1 0 0 ENSG00000000005 0 0 0 0 0 0 0 ENSG00000000419 10 24 19 20 19 8 14 ENSG00000000457 17 15 13 18 21 18 21 ENSG00000000460 2 3 5 2 4 6 8 ENSG00000000938 20 4 35 16 10 17 19 NA10847F ENSG00000000003 0 ENSG00000000005 0 ENSG00000000419 6 ENSG00000000457 15 ENSG00000000460 2 ENSG00000000938 9 > group <- factor(rep(c("Male", "Female"), each= 4)) > dge <- DGEList(counts = g1 , group = group ) Calculating library sizes from column totals. > dge <- calcNormFactors(dge) > dge <- estimateCommonDisp(dge) > sqrt (dge$common.dispersion) [1] 0.3858996 > test <- exactTest(dge) > head(test$table) logFC logCPM PValue ENSG00000000003 -2.3441897 -3.042057 1.0000000 ENSG00000000005 0.0000000 -Inf 1.0000000 ENSG00000000419 0.5777309 3.850993 0.2791539 ENSG00000000457 -0.3054489 4.080866 0.5592668 ENSG00000000460 -0.7792622 1.966865 0.3274528 ENSG00000000938 0.3909100 3.997866 0.4269672 > sum (test$table$PValue <0.01) [1] 407 DESeq: > g1 <- read.table("count1.txt", header = TRUE, row.names = 1) > conds <- factor(rep(c("Male", "Female"), each= 4)) > dataPack <- data.frame(row.names = colnames(g1), condition =rep( c("Male", "Female"), each= 4)) > dataPack condition NA06994M Male NA07051M Male NA07347M Male NA07357M Male NA07000F Female NA07037F Female NA07346F Female NA10847F Female > cds <- newCountDataSet(g1, conds) > head(cds) CountDataSet (storageMode: environment) assayData: 1 features, 8 samples element names: counts protocolData: none phenoData sampleNames: NA06994M NA07051M ... NA10847F (8 total) varLabels: sizeFactor condition varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: > head(counts(cds) + ) NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F ENSG00000000003 0 0 0 0 1 0 0 ENSG00000000005 0 0 0 0 0 0 0 ENSG00000000419 10 24 19 20 19 8 14 ENSG00000000457 17 15 13 18 21 18 21 ENSG00000000460 2 3 5 2 4 6 8 ENSG00000000938 20 4 35 16 10 17 19 NA10847F ENSG00000000003 0 ENSG00000000005 0 ENSG00000000419 6 ENSG00000000457 15 ENSG00000000460 2 ENSG00000000938 9 > cds <- estimateSizeFactors(cds) > sizeFactors(cds) NA06994M NA07051M NA07347M NA07357M NA07000F NA07037F NA07346F NA10847F 0.8599841 1.1102643 1.0869086 1.1157556 1.1056726 1.0666049 0.9152017 0.9402086 > head(counts(cds, normalized= TRUE)) > cds <- estimateDispersions(cds) > result <- nbinomTest(cds, "Male", "Female") > nrow(subset(result, result$pval <0.01)) [1] 85 Again thank you so much With Best Regards, Narges________________________________________ From: narges [guest] [guest at bioconductor.org] Sent: Tuesday, June 26, 2012 7:17 PM To: bioconductor at r-project.org; Fatemehsadat Seyednasrollah Subject: DESeq analysis Hi all I am doing some RNA seq analysis with DESeq. I have applied the nbinomTest to my dataset which I know have many differentially expressed genes but the first problem is that the result values for "padj"column is almost NA and sometimes 1. and when I want to have a splice from my fata frame the result is not meaningful for me. -- output of sessionInfo(): res <- nbinomTest(cds, "Male", "Female") > head(res) id baseMean baseMeanA baseMeanB foldChange log2FoldChange 1 ENSG00000000003 0.1130534 0.000000 0.2261067 Inf Inf 2 ENSG00000000005 0.0000000 0.000000 0.0000000 NaN NaN 3 ENSG00000000419 14.3767155 17.162610 11.5908205 0.6753530 -0.5662863 4 ENSG00000000457 17.0174761 15.342800 18.6921526 1.2183013 0.2848710 5 ENSG00000000460 3.9414822 2.855099 5.0278659 1.7610131 0.8164056 6 ENSG00000000938 16.0894945 18.350117 13.8288718 0.7536122 -0.4081058 pval padj 1 0.9959638 1 2 NA NA 3 0.3208560 1 4 0.5942512 1 5 0.4840607 1 6 0.5409953 1 > res1 <- res[res$padj<0.1,] > head(res1) id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj NA NA NA NA NA NA NA NA NA.1 NA NA NA NA NA NA NA NA.2 NA NA NA NA NA NA NA NA.3 NA NA NA NA NA NA NA NA.4 NA NA NA NA NA NA NA NA.5 NA NA NA NA NA NA NA my first question is that why although I know there are some differentially expressed genes in the my data, all the padj values are NA or 1 and the second question is this "NA.1" , "NA.2", ..... which are emerged as the first column of object "res1"instead of name of genes Thank you so much Regards -- Sent via the guest posting facility at bioconductor.org. From lawrence.michael at gene.com Fri Jun 29 14:06:59 2012 From: lawrence.michael at gene.com (Michael Lawrence) Date: Fri, 29 Jun 2012 05:06:59 -0700 Subject: [BioC] GappedAlignmentPairs requests Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From kasperdanielhansen at gmail.com Fri Jun 29 14:45:53 2012 From: kasperdanielhansen at gmail.com (Kasper Daniel Hansen) Date: Fri, 29 Jun 2012 08:45:53 -0400 Subject: [BioC] Unable to open database file, cummeRbund error. In-Reply-To: References: <72669F21-F1FE-46F7-8BA2-6300E68CEFFB@csail.mit.edu> Message-ID: Loyal, You might want to start using normalizePath() and expand.path() in your code. Kasper On Thu, Jun 28, 2012 at 2:30 PM, Hardcastle, Justin wrote: > I've managed to make it run using the absolute path. I thought I was using the absolute path, but it turns out I could have been more explicit which made it start at least. I'm getting a new error now though. The error is below. > > Reading /Volumes/home/Justin/Projects/Test/output/cuffdiff/cds.diff > Writing CDSDiffData table > Indexing Tables... > Error in sqliteExecStatement(con, statement, bind.data) : > ?RS-DBI driver: (error in statement: database is locked) > > cds.diff exists and has data, and the permissions on the DB file look fine for my user. > > Thanks for any help. > -Justin > > ----- Original Message ----- > From: "Loyal Goff" > To: "Justin Hardcastle" > Cc: bioconductor at r-project.org > Sent: Wednesday, June 27, 2012 7:57:31 AM > Subject: Re: [BioC] Unable to open database file, cummeRbund error. > > Hi Justin, > Can you confirm that "~/Test/output/cuffdiff" is a valid path? ?I cannot seem to re-create this issue using a similar approach to yours. ?Alternatively, can you just provide the directory path directly to readCufflinks instead of going through file.path()? > > -Loyal > > > On Jun 26, 2012, at 2:03 PM, Hardcastle, Justin wrote: > >> Hi, >> I'm having an issue running cummeRbund on my cuffdiff output. CummeRbund is giving me a DB error and not creating the DB. The code and error are below. >> >> library("cummeRbund") >> >> dir = "~/Test" >> outdir = "output/cuffdiff" >> cuff <- readCufflinks(dir = file.path(dir, outdir), rebuild = TRUE) >> >> The error given is >> >>> cuff <- readCufflinks(dir = file.path(dir, outdir), rebuild = TRUE) >> Creating database ~/Test/output/cuffdiff/cuffData.db >> Error in sqliteNewConnection(drv, ...) : >> ?RS-DBI driver: (could not connect to dbname: >> unable to open database file >> ) >> >> I am running cummeRbund 1.2.0, and Cufflinks 2.0.1. >> >> Thanks for any help. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From orenis1 at gmail.com Thu Jun 28 19:41:14 2012 From: orenis1 at gmail.com (Oren Schaedel) Date: Thu, 28 Jun 2012 10:41:14 -0700 Subject: [BioC] DEseq and FDR correction Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From alla.bullashevska at fdm.uni-freiburg.de Fri Jun 29 16:44:22 2012 From: alla.bullashevska at fdm.uni-freiburg.de (Alla Bulashevska) Date: Fri, 29 Jun 2012 16:44:22 +0200 Subject: [BioC] Affymetrix GeneChip Human Promoter 1.0R Array Message-ID: Dear Bioconductor users, i am analysing data from Affymetrix GeneChip Human Promoter 1.0R Array. The library file i use is Hs_PromPR_v02-3_NCBIv36.bpmap My question is, what is the simplest way to obtain for each probe on the array the Entrez Gene ID of the gene, the promoter of which this probe is testing. Generally, I would like to have for the subsequent GO analysis the whole list of genes scanned with this promoter array. Thank you in Advance, Alla Bulashevska, University of Freiburg. From smelov at buckinstitute.org Fri Jun 29 17:58:23 2012 From: smelov at buckinstitute.org (Simon Melov) Date: Fri, 29 Jun 2012 08:58:23 -0700 Subject: [BioC] HTqPCR problems In-Reply-To: <4EBF34FE-0178-4640-8F6B-D6C4212B844B@buckinstitute.org> References: <7B8EC902-D981-4467-B997-002FC1CFC368@buckinstitute.org> <10ebef5ba3907a11607fea19c1f11c3b.squirrel@webmail.ebi.ac.uk> <7BE89EC0-5910-4758-881F-6EFCB8E79A5F@buckinstitute.org> <009c6493df7d27088ee45793182ed7a8.squirrel@webmail.ebi.ac.uk> <6d2982152de2217869a0f9db60da3904.squirrel@webmail.ebi.ac.uk> <4EBF34FE-0178-4640-8F6B-D6C4212B844B@buckinstitute.org> Message-ID: <43857F42-62E5-4B37-B2EF-0C7250307EDE@buckinstitute.org> Hi Heidi, I think I've identified the problem. Currently it appears as though HTqPCR reads the sample ID's and genes in from top to bottom of the CSV output from the Biomark. This is not the sample order we load in. As long as thats made clear in the vignette, it will prevent any confusion. We typically have a loading list, in which we associate samples with groups (numbered 1-48, or 1-96 for both formats). I was getting confusing results (laid out below) as I assumed HTqPCR associated sample IDs in the loading order, not the CSV format top to bottom. Am I correct here in how HTqPCR reads in the data from the CSV file? thanks again, best Simon On Jun 28, 2012, at 2:11 PM, Simon Melov wrote: > Hi Heidi, > getting there, hopefully if you can clarify the following issue, all will be well and good. > > After yesterdays correspondence, I'm now producing nice plots, when I check the actual values being plotted, they dont match up > to the sample ID's. In fact, if I dont bother assigning groups, the sample ID's dont match to their respective gene CT values. I'm > worried there is some underlying problem with the data structure I'm not understanding. > > I understand the code, its just the samples dont match the reported gene values in the csv file. > > for example > >> head(groupID) > Sample Treatment > 1 S28 SMY > 2 L20 LFY > 3 M26 MMY > 4 L1 LFR > 5 L30 LFY > 6 K13 KMO > >> plotCtOverview(raw6, genes=featureNames(raw6)[1], group=groupID$Treatment,legend=FALSE, col=1:length(unique(groupID$Treatment))) > > produces a nice plot of a tubulin gene across the groups, as you suggested yesterday . Yet if I look at the values, they dont match > the CSV values for specific genes/samples I used. If I turn off groups, and look at samples without merging by group, I can see that the values dont match the appropriate gene being > displayed. My question is, where is the sample order being drawn from in the CSV file? Is there a simple check I can use to see that what is being plotted, > is what I think is being plotted? The group ID sample-Treatment is correct, and all the samples in the original CSV file are correct. > > Is it possible that the package is assigning gene/sample ID in some other order than that I've supplied? > I just want to be sure that when HTqPCR pulls the sample ID and maps it to the appropriate gene/Group, some transformation is not happening. > > Fluidigm suggests a particular order in loading samples and genes. These are numbered 1-48 (sample), and 1-48 (gene) for a 48.48 plate (and the same for a 96.96 plate). > This is the order I supplied the sample IDs in the groupID file above. How do you map the raw csv output to gene/sample id? > > Is there a way of checking that the sample/gene/group ID is correct? > > as always, thanks in advance for your help > > best > > s > On Jun 27, 2012, at 3:27 PM, Heidi Dvinge wrote: > >>> Hi Heidi, >>> you are correct, yes 48.48. >>> The example you provide below is exactly what I needed for clarification >>> for groups. I was trying to reverse engineer what you had done with the >>> original expression set package for microarrays, but from below, I can get >>> this to work now. >>> >> Glad it works. Hopefully by the next BioConductor release I'll remember to >> clarify the plotCtOverview help file. >> >>> Just to be clear, I have 5 48.48 plates. Should I normalize each >>> individually prior to combining, or should I reformat to a 2304x1 each, >>> combine, then normalize (not sure if you can do that or not) >>> >> Hm, that's one of the questions I've also been asking myself, so I would >> be curious to hear what your results from this are. >> >> If you suspect that there are some major factors influencing the 5 plates >> systematically, then normalising them in a 2304 x 5 object should >> (hopefully) correct for that. For example, they may have been run on >> different days, by different people, or perhaps there was a short power >> cut during the processing of one of them. This might be visible if you >> have for example a boxplot of Ct from all 48*5 samples, and you see blocks >> of them shifted up or down. >> >> Obviously, this doesn't take care of normalisation between your samples >> within each plate though. If you suspect your samples to have some >> systematic variation that you need to account for, then you can normalise >> each plate individually (as a 48 x 48) object. Alternatively, you can try >> to combine within- and between-sample normalisation by taking all 48 x 240 >> values at once. >> >> In principle, you can split, reformat and the recombine the data in >> however many ways you like. Personally, with any sort of data I prefer to >> go with as little preprocessing as possible, since each additional step >> can potentially introduce its own biases into the data. So unless there >> are some obvious variation between your 5 plates, I'd probably stick with >> just normalisation between the samples, e.. using a 48 x 240 object. >> >> Of course, you may have different preferences, or find out that a >> completely different approach is required for this particular data set. >> >> \Heidi >> >>> thanks again for your prompt responses! >>> >>> best >>> >>> s >>> >>> On Jun 27, 2012, at 2:26 PM, Heidi Dvinge wrote: >>> >>>> Hi Simon, >>>> >>>>> Thanks for the help Heidi, >>>>> but I'm still having troubles, your comments on the plotting helped me >>>>> solve the outputs. But if I want to just display some groups (for >>>>> example >>>>> the LO group in the example below), how do I associate a group with >>>>> multiple samples (ie biological reps)? >>>>> >>>>> Currently I'm associating genes with samples by reading in the file as >>>>> below >>>>> plate6=read.delim("plate6Sample.txt", header=FALSE) >>>>> #this is a file to associate sample ID with the genes in the biomark >>>>> data, >>>>> as currently HTqPCR does not seem to associate the sample info in the >>>>> Biomark output to the gene IDs >>>>> >>>> Erm, no, it doesn't :-/ >>>> >>>>> samples=as.vector(t(plate6)) >>>>> raw6=readCtData(files="Chip6.csv", format="BioMark", n.features=48, >>>>> n.data=48, samples=samples) >>>>> #now I have samples and genes similar to your example in the guide, but >>>>> I >>>>> want to associate samples to groups now. In the guide, you have an >>>>> example >>>>> where you have entire files as distinct samples, but in our runs, we >>>>> never >>>>> have that situation. I have a file which associates samples to groups, >>>>> which I read in... >>>>> >>>>> groupID=read.csv("plate6key.csv") >>>>> >>>>> but how do I associate the samples with their appropriate groups for >>>>> biological replicates with any of the functions in HtQPCR? >>>> >>>> I'm afraid I'm slightly confused here (sorry, long day). Exactly how is >>>> your data formatted? I.e. are the columns either individual samples, or >>>> from files containing multiple samples? So for example for a single >>>> 48.48 >>>> arrays, is your qPCRset object 2304 x 1 or 48 x 48? >>>> >>>> From your readCtData command I'm guessing you have 48 x 48, i.e. all 48 >>>> samples from your 1 array are in columns. In that case the 'groups' >>>> parameter in plotCtOverview will need to be a vector of length 48, >>>> indicating how you want the 48 columns in your qPCRset object to be >>>> grouped together. >>>> >>>> Below is an example of (admittedly ugly) plots. I don't know if that's >>>> similar to your situation at all. >>>> >>>> \Heidi >>>> >>>>> # Reading in data >>>>> exPath <- system.file("exData", package = "HTqPCR") >>>>> raw1 <- readCtData(files = "BioMark_sample.csv", path = exPath, format >>>>> = >>>> "BioMark", n.features = 48, n.data = 48) >>>>> # Check sample names >>>>> head(sampleNames(raw1)) >>>> [1] "Sample1" "Sample2" "Sample3" "Sample4" "Sample5" "Sample6" >>>>> # Associate samples with (randomly chosen) groups >>>>> anno <- data.frame(sampleID=sampleNames(raw1), Group=rep(c("A", "B", >>>> "C", "D"), times=c(4,24,5,15))) >>>>> head(anno) >>>> sampleID Group >>>> 1 Sample1 A >>>> 2 Sample2 A >>>> 3 Sample3 A >>>> 4 Sample4 A >>>> 5 Sample5 B >>>> 6 Sample6 B >>>>> # Plot the first gene - for each sample individually >>>>> plotCtOverview(raw1, genes=featureNames(raw1)[1], legend=FALSE, >>>> col=1:nrow(anno)) >>>>> # Plot the first gene - for each group >>>>> plotCtOverview(raw1, genes=featureNames(raw1)[1], group=anno$Group, >>>> legend=FALSE, col=1:length(unique(anno$Group))) >>>>> # Plot the first gene, using group "A" as a control >>>>> plotCtOverview(raw1, genes=featureNames(raw1)[1], group=anno$Group, >>>> legend=FALSE, col=1:length(unique(anno$Group)), calibrator="A") >>>> >>>> >>>> >>>>> You recommend below using a vector, but I dont see how that helps me >>>>> associate the samples in the Expression set. >>>>> >>>>> thanks again! >>>>> >>>>> s >>>>> >>>>> On Jun 26, 2012, at 12:48 PM, Heidi Dvinge wrote: >>>>> >>>>>>> Hi, >>>>>>> I'm having some troubles selectively sub-setting, and graphing up >>>>>>> QPCR >>>>>>> data within HTqPCR for Biomark plates (both 48.48 and 96.96 plates). >>>>>>> I'd >>>>>>> like to be able to visualize specific genes, with specific groups we >>>>>>> run >>>>>>> routinely on our Biomark system. Typical runs are across multiple >>>>>>> plates, >>>>>>> and have multiple biological replicates, and usually 2 or more >>>>>>> technical >>>>>>> replicates (although we are moving away from technical reps, as the >>>>>>> CVs >>>>>>> are so tight). >>>>>>> >>>>>>> Can anyone help with this? Heidi? >>>>>>> >>>>>>> raw6=readCtData(files="Chip6.csv", format="BioMark", n.features=48, >>>>>>> n.data=48, samples=samples) >>>>>>> #Ive read the samples in from a separate file, as when you read it >>>>>>> in, >>>>>>> it >>>>>>> doesnt take the sample names supplied in the biomark output# >>>>>>> #Next, I want to plot genes of interest, with samples of interest, >>>>>>> and >>>>>>> I'm >>>>>>> having trouble getting an appropriate output# >>>>>>> >>>>>>> g=featureNames(raw6)[1:2] >>>>>>> plotCtOverview(raw6, genes=g, groups=groupID$Treatment, >>>>>>> col=rainbow(5)) >>>>>>> >>>>>>> #This plots 1 gene across all 48 samples# >>>>>>> #but the legend doesnt behave, its placed on top of the plot, and I >>>>>>> cant >>>>>>> get it to display in a non-overlapping fashion# >>>>>>> #I've tried all sorts of things in par, but nothing seems to shift >>>>>>> the >>>>>>> legend's position# >>>>>>> >>>>>> As the old saying goes, whenever you want a job done well, you'll have >>>>>> to >>>>>> do it yourself ;). In this case, the easiest thing is probably to use >>>>>> legend=FALSE in plotCtOverview, and then afterwards add it yourself in >>>>>> the >>>>>> desired location using legend(). That way, if you have a lot of >>>>>> different >>>>>> features or groups to display, you can also use the ncol parameter in >>>>>> legend to make several columns within the legend, such as 3x4 instead >>>>>> of >>>>>> the default 12x1. >>>>>> >>>>>> Alternatively, you can use either xlim or ylim in plotCtOverview to >>>>>> add >>>>>> some empty space on the side where there's then room for the legend. >>>>>> >>>>>>> #I now want to plot a subset of the samples for specific genes# >>>>>>>> LOY=subset(groupID,groupID$Treatment=="LO" | groupID$Treatment== >>>>>>>> "LFY") >>>>>>>> LOY >>>>>>> Sample Treatment >>>>>>> 2 L20 LFY >>>>>>> 5 L30 LFY >>>>>>> 7 L45 LO >>>>>>> 20 L40 LO >>>>>>> 27 L43 LO >>>>>>> 33 L29 LFY >>>>>>> 36 L38 LO >>>>>>> 40 L39 LO >>>>>>> 43 L23 LFY >>>>>>> >>>>>>> >>>>>>>> plotCtOverview(raw6, genes=g, groups=LOY, col=rainbow(5)) >>>>>>> Warning messages: >>>>>>> 1: In split.default(t(x), sample.split) : >>>>>>> data length is not a multiple of split variable >>>>>>> 2: In qt(p, df, lower.tail, log.p) : NaNs produced >>>>>>>> >>>>>> >>>>>> Does it make sense if you split by groups=LOY$Treatment? It looks like >>>>>> the >>>>>> object LOY itself is a data frame, rather than the expected vector. >>>>>> >>>>>> Also, you may have to 'repeat' the col=rainbow() argument to fit your >>>>>> number of features. >>>>>> >>>>>>> >>>>>>> #it displays the two groups defined by treatment, but doesnt do so >>>>>>> nicely, >>>>>>> very skinny bars, and the legend doesnt reflect what its displaying# >>>>>>> #again, I've tried monkeying around with par, but not sure what >>>>>>> HTqPCR >>>>>>> is >>>>>>> calling to make the plots# >>>>>>> >>>>>> If the bars are very skinny, it's probably because you're displaying a >>>>>> lot >>>>>> of features. Nothing much to do about that, except increasing the >>>>>> width >>>>>> or >>>>>> your plot :(. >>>>>> >>>>>> \Heidi >>>>>> >>>>>>> please help! >>>>>>> >>>>>>> thanks >>>>>>> >>>>>>> Simon. >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From dtenenba at fhcrc.org Fri Jun 29 19:34:46 2012 From: dtenenba at fhcrc.org (Dan Tenenbaum) Date: Fri, 29 Jun 2012 10:34:46 -0700 Subject: [BioC] XCMS / mzR / Rcpp dependency issue In-Reply-To: References: <4FEC2822.9050908@mineway.de> Message-ID: On Thu, Jun 28, 2012 at 3:16 AM, Laurent Gatto wrote: > Dear Uwe, > > Something is happening with mzR and Rcpp 0.9.12 on Windows - see [1] and [2]. > As a temporary fix, you can downgrade to Rcpp 0.9.10 [3] and proceed normally. > This problem is fixed in the latest release and devel versions of mzR, available now via biocLite(). mzR just needed a version bump to trigger a fresh build to propagate. Dan > Best wishes, > > Laurent > > [1] http://lists.r-forge.r-project.org/pipermail/rcpp-devel/2012-June/thread.html#3940 > [2] https://stat.ethz.ch/pipermail/bioconductor/2012-June/thread.html#46503 > [3] http://cran.us.r-project.org/bin/windows/contrib/2.13/Rcpp_0.9.10.zip > > > On 28 June 2012 10:47, Uwe Schmitt wrote: >> Hi, >> >> I have some problems to install xcms for R 2.5.1 on Windows 7, 64 bit. >> I follow the installation instructions and enter: >> >>> source("http://bioconductor.org/biocLite.R") >>> biocLite("xcms", dep=T) >> >> this loads some packages and gives me a warning: >> installed directory not writable, cannot update packages 'boot', 'class', >> 'KernSmooth', 'MASS', 'nnet', 'rpart', 'spatial' >> >> If I want to check if xmcs is installed I get: >> >>> require("xcms") >> >> Lade n?tiges Paket: xcms >> Lade n?tiges Paket: mzR >> Lade n?tiges Paket: Rcpp >> Error : .onLoad failed in loadNamespace() for 'mzR', details: >> call: value[[3L]](cond) >> error: failed to load module Ramp from package mzR >> konnte Funktion "errorOccured" nicht finden >> Failed with error: ?Paket ?mzR? konnte nicht geladen werden? >> >> I try to translate the messages to english: >> >> load required packet: xcms >> load required packet: mzR >> load required packet: Rcpp >> Error : .onLoad failed in loadNamespace() for 'mzR', details: >> call: value[[3L]](cond) >> error: failed to load module Ramp from package mzR >> could not find function "errorOccured" >> Failed with error: ?Paket ?mzR? could not be loaded' >> >> Any hints what is going wrong ? I installed xcms on an older machine six >> months ago and I had no problems at all. >> >> Kind Regards, >> >> Uwe >> >> >> -- >> Dr. rer. nat. Uwe Schmitt >> Leitung F/E Mathematik >> >> mineway GmbH >> Geb?ude 4 >> Im Helmerswald 2 >> 66121 Saarbr?cken >> >> Telefon: +49 (0)681 8390 5334 >> Telefax: +49 (0)681 830 4376 >> >> uschmitt at mineway.de >> www.mineway.de >> >> Gesch?ftsf?hrung: Dr.-Ing. Mathias Bauer >> Amtsgericht Saarbr?cken HRB 12339 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From laurent.gatto at gmail.com Fri Jun 29 20:10:58 2012 From: laurent.gatto at gmail.com (Laurent Gatto) Date: Fri, 29 Jun 2012 19:10:58 +0100 Subject: [BioC] mzR error In-Reply-To: References: <20120626150701.A4B1F13ADB0@mamba.fhcrc.org> <8C45594961256A438A8AC49E0BB51DBA028326C9E104@E2K7-MS2.ds.strath.ac.uk> Message-ID: Dear Gavin, The issue was the result of compiler/linker error and required a fresh mzR build, which is not available through biocLite. Best wishes, Laurent On 27 June 2012 10:27, Laurent Gatto wrote: > On 27 June 2012 10:21, Gavin Blackburn wrote: >> Hi Laurent, >> >> Thanks very much, I'll check the Rcpp list to see when it is solved and will downgrade it for now. > > It is not clear (at least to me) what happens [1]; it might fix itself > with new Rcpp and mzR binaries. I will post an update here, anyway. > > Best wishes, > > Laurent > > [1] http://lists.r-forge.r-project.org/pipermail/rcpp-devel/2012-June/thread.html#3940 > >> Cheers, >> >> Gavin. >> >> -----Original Message----- >> From: Laurent Gatto [mailto:laurent.gatto at gmail.com] >> Sent: 26 June 2012 17:38 >> To: Gavin Blackburn [guest] >> Cc: bioconductor at r-project.org; Gavin Blackburn >> Subject: Re: [BioC] mzR error >> >> On 26 June 2012 17:19, Laurent Gatto wrote: >>> Dear Gavin, >>> >>> I can't reproduce this, but I do not have the same configuration at >>> hand for the moment - this could be an incompatibility with the latest >>> Rcpp. >> >> Ok, I can now reproduce on a Windows box with mzR 1.2.1 (latest >> stable) and Rcpp 0.9.12. Downgrading to Rcpp 0.9.10 [1] fixes the issue. I will bring it up on the Rcpp list. >> >> Thank you for the report. >> >> Laurent >> >> [1] http://cran.us.r-project.org/bin/windows/contrib/2.13/Rcpp_0.9.10.zip >> >> >>> What version of mzR have you - packageVersion("mzR") >>> >>> Best wishes, >>> >>> Laurent >>> >>> On 26 June 2012 16:07, Gavin Blackburn [guest] wrote: >>>> >>>> We are getting the following error when trying to install and run mzR on a 64-bit Windows 7 machine: >>>>> library(mzR) >>>> Loading required package: Rcpp >>>> Error : .onLoad failed in loadNamespace() for 'mzR', details: >>>> ?call: value[[3L]](cond) >>>> ?error: failed to load module Ramp from package mzR could not find >>>> function "errorOccured" >>>> Error: package/namespace load failed for ???mzR??? >>>> >>>> >>>> Do you know what might be causing it? >>>> >>>> Cheers, >>>> >>>> Gavin. >>>> >>>> >>>> ?-- output of sessionInfo(): >>>> >>>> ?sessionInfo() >>>> R version 2.15.1 (2012-06-22) >>>> Platform: x86_64-pc-mingw32/x64 (64-bit) >>>> >>>> locale: >>>> [1] LC_COLLATE=English_United Kingdom.1252 [2] >>>> LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United >>>> Kingdom.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 >>>> >>>> attached base packages: >>>> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >>>> >>>> other attached packages: >>>> [1] Rcpp_0.9.12 ? ? ? ? BiocInstaller_1.4.7 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] Biobase_2.16.0 ? ? BiocGenerics_0.2.0 tools_2.15.1 >>>> >>>> -- >>>> Sent via the guest posting facility at bioconductor.org. >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> >>> -- >>> [ Laurent Gatto | slashhome.be ] >> >> >> >> -- >> [ Laurent Gatto | slashhome.be ] > > > > -- > [ Laurent Gatto | slashhome.be ] -- [ Laurent Gatto | slashhome.be ] From laurent.gatto at gmail.com Fri Jun 29 20:13:03 2012 From: laurent.gatto at gmail.com (Laurent Gatto) Date: Fri, 29 Jun 2012 19:13:03 +0100 Subject: [BioC] mzR error In-Reply-To: References: <20120626150701.A4B1F13ADB0@mamba.fhcrc.org> <8C45594961256A438A8AC49E0BB51DBA028326C9E104@E2K7-MS2.ds.strath.ac.uk> Message-ID: On 29 June 2012 19:10, Laurent Gatto wrote: > Dear Gavin, > > The issue was the result of compiler/linker error and required a fresh > mzR build, which is not available through biocLite. ...which is *now* available through biocLite. > Best wishes, > > Laurent > > On 27 June 2012 10:27, Laurent Gatto wrote: >> On 27 June 2012 10:21, Gavin Blackburn wrote: >>> Hi Laurent, >>> >>> Thanks very much, I'll check the Rcpp list to see when it is solved and will downgrade it for now. >> >> It is not clear (at least to me) what happens [1]; it might fix itself >> with new Rcpp and mzR binaries. I will post an update here, anyway. >> >> Best wishes, >> >> Laurent >> >> [1] http://lists.r-forge.r-project.org/pipermail/rcpp-devel/2012-June/thread.html#3940 >> >>> Cheers, >>> >>> Gavin. >>> >>> -----Original Message----- >>> From: Laurent Gatto [mailto:laurent.gatto at gmail.com] >>> Sent: 26 June 2012 17:38 >>> To: Gavin Blackburn [guest] >>> Cc: bioconductor at r-project.org; Gavin Blackburn >>> Subject: Re: [BioC] mzR error >>> >>> On 26 June 2012 17:19, Laurent Gatto wrote: >>>> Dear Gavin, >>>> >>>> I can't reproduce this, but I do not have the same configuration at >>>> hand for the moment - this could be an incompatibility with the latest >>>> Rcpp. >>> >>> Ok, I can now reproduce on a Windows box with mzR 1.2.1 (latest >>> stable) and Rcpp 0.9.12. Downgrading to Rcpp 0.9.10 [1] fixes the issue. I will bring it up on the Rcpp list. >>> >>> Thank you for the report. >>> >>> Laurent >>> >>> [1] http://cran.us.r-project.org/bin/windows/contrib/2.13/Rcpp_0.9.10.zip >>> >>> >>>> What version of mzR have you - packageVersion("mzR") >>>> >>>> Best wishes, >>>> >>>> Laurent >>>> >>>> On 26 June 2012 16:07, Gavin Blackburn [guest] wrote: >>>>> >>>>> We are getting the following error when trying to install and run mzR on a 64-bit Windows 7 machine: >>>>>> library(mzR) >>>>> Loading required package: Rcpp >>>>> Error : .onLoad failed in loadNamespace() for 'mzR', details: >>>>> ?call: value[[3L]](cond) >>>>> ?error: failed to load module Ramp from package mzR could not find >>>>> function "errorOccured" >>>>> Error: package/namespace load failed for ???mzR??? >>>>> >>>>> >>>>> Do you know what might be causing it? >>>>> >>>>> Cheers, >>>>> >>>>> Gavin. >>>>> >>>>> >>>>> ?-- output of sessionInfo(): >>>>> >>>>> ?sessionInfo() >>>>> R version 2.15.1 (2012-06-22) >>>>> Platform: x86_64-pc-mingw32/x64 (64-bit) >>>>> >>>>> locale: >>>>> [1] LC_COLLATE=English_United Kingdom.1252 [2] >>>>> LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United >>>>> Kingdom.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 >>>>> >>>>> attached base packages: >>>>> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base >>>>> >>>>> other attached packages: >>>>> [1] Rcpp_0.9.12 ? ? ? ? BiocInstaller_1.4.7 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] Biobase_2.16.0 ? ? BiocGenerics_0.2.0 tools_2.15.1 >>>>> >>>>> -- >>>>> Sent via the guest posting facility at bioconductor.org. >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> >>>> -- >>>> [ Laurent Gatto | slashhome.be ] >>> >>> >>> >>> -- >>> [ Laurent Gatto | slashhome.be ] >> >> >> >> -- >> [ Laurent Gatto | slashhome.be ] > > > > -- > [ Laurent Gatto | slashhome.be ] -- [ Laurent Gatto | slashhome.be ] From primej at MedImmune.com Fri Jun 29 20:46:21 2012 From: primej at MedImmune.com (Prime, John) Date: Fri, 29 Jun 2012 18:46:21 +0000 Subject: [BioC] Exonmap & xmapcore covdesc variable problem Message-ID: <2D4DB57D287AD14DA261524E57B48BF136E71A97@GBCB1EMP001.medimmune.com> Dear All I'm fairly new to R & Bioconductor and am using Exonmap to analyse a large batch of affy exon arrays. All was working ok with the original covdesc file which just listed the location and file name of each .cel, but I needed to apply a more complex design table to the data so reran the script with the updated covdesc only to find it's having problems. Here's an example segment of the problem covdesc file:- expt.grp id cell.line tissue genetic.background phase tissue.of.origin sex Phase_IId/P815-S28.CEL P815_Sp P815-S28 P815 Spleen DBA_2J 2d Mast_cells Female Phase_IId/P815-S29.CEL P815_Sp P815-S29 P815 Spleen DBA_2J 2d Mast_cells Female Phase_IId/P815-S30.CEL P815_Sp P815-S30 P815 Spleen DBA_2J 2d Mast_cells Female my scripts is as follows:- Sys.setenv(R_XMAP_CONF_DIR="~/.exonmap") library(exonmap) xmapConnect("hssl") setwd( "/home/primej/Desktop/Syngeneic_Data") raw.data<-read.exon(covdesc="covdesc.txt") raw.data at cdfName<- "exon.pmcdf" tes <- rma(raw.data) The new covdesc.txt file doesn't create any errors when being read and the background correction, RMA normalisation and calculation of expression all work fine; but when I pData(tes) to access the pheno data the actual variable entries all contain . str(tes) shows the same problem. see below for the pData output:- sample expt.grp id cell.line tissue genetic.background phase tissue.of.origin sex P815-S28.CEL 1 P815-S29.CEL 2 P815-S30.CEL 3 The fact that the covdesc file is read without any errors flagging and that the file names and column headers all read ok doesn't give me a clear idea why the variables aren't accepted while the others are. Are there length or character limits for variables in covdesc files? (I could no info on this). I got the same problem with xmapcore as well. Any ideas what is causing it to fail to read the variables? I'm hoping it's something obvious that can easily be fixed and most likely due to my inexperience. I've given the sessionInfo at the bottom. Many thanks for your help. John Prime ================================================================== sessionInfo() R version 2.12.1 (2010-12-16) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] exon.pmcdf_1.1 exonmap_2.8.0 RMySQL_0.8-0 DBI_0.2-5 RColorBrewer_1.0-5 [6] genefilter_1.32.0 affy_1.28.1 Biobase_2.10.0 loaded via a namespace (and not attached): [1] affyio_1.18.0 annotate_1.28.1 AnnotationDbi_1.12.1 preprocessCore_1.12.0 [5] RSQLite_0.9-4 splines_2.12.1 survival_2.36-2 tools_2.12.1 [9] xtable_1.5-6 To the extent this electronic communication or any of its attachments contain information that is not in the public domain, such information is considered by MedImmune to be confidential and proprietary. This communication is expected to be read and/or used only by the individual(s) for whom it is intended. If you have received this electronic communication in error, please reply to the sender advising of the error in transmission and delete the original message and any accompanying documents from your system immediately, without copying, reviewing or otherwise using them for any purpose. Thank you for your cooperation. From tim.triche at gmail.com Fri Jun 29 22:24:59 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Fri, 29 Jun 2012 13:24:59 -0700 Subject: [BioC] =?windows-1252?q?GenomicFeatures_installation_error_--_obj?= =?windows-1252?q?ect_=91readDNAStringSet=92_is_not_exported_by_=27?= =?windows-1252?q?namespace=3ABiostrings=27?= Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From jluo.rhelp at gmail.com Fri Jun 29 22:41:01 2012 From: jluo.rhelp at gmail.com (Jack Luo) Date: Fri, 29 Jun 2012 16:41:01 -0400 Subject: [BioC] question regarding using cummeRbund package Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From dtenenba at fhcrc.org Fri Jun 29 22:42:34 2012 From: dtenenba at fhcrc.org (Dan Tenenbaum) Date: Fri, 29 Jun 2012 13:42:34 -0700 Subject: [BioC] =?windows-1252?q?GenomicFeatures_installation_error_--_obj?= =?windows-1252?q?ect_=91readDNAStringSet=92_is_not_exported_by_=27?= =?windows-1252?q?namespace=3ABiostrings=27?= In-Reply-To: References: Message-ID: Looks like you are mixing and matching release and devel packages (AnnotationDbi_1.19.15 is a devel package). This can lead to unpredictable results. You can either maintain two separate R installations for release and devel, or maintain one but use the technique described here: http://bioconductor.org/developers/useDevel/ Dan On Fri, Jun 29, 2012 at 1:24 PM, Tim Triche, Jr. wrote: > Installing package(s) 'GenomicFeatures' > trying URL ' > http://www.bioconductor.org/packages/2.10/bioc/src/contrib/GenomicFeatures_1.8.2.tar.gz > ' > Content type 'application/x-gzip' length 1745766 bytes (1.7 Mb) > opened URL > ================================================== > downloaded 1.7 Mb > > * installing *source* package ?GenomicFeatures? ... > ** R > ** inst > ** preparing package for lazy loading > Error : object ?readDNAStringSet? is not exported by 'namespace:Biostrings' > ERROR: lazy loading failed for package ?GenomicFeatures? > * removing ?/usr/lib64/R/library/GenomicFeatures? > * restoring previous ?/usr/lib64/R/library/GenomicFeatures? > > The downloaded source packages are in > ? ? ? ??/tmp/Rtmp3aHGQx/downloaded_packages? > Updating HTML index of packages in '.Library' > Making packages.html ?... done > > sessionInfo(): > > R> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C > LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 > LC_MONETARY=en_US.UTF-8 ? ?LC_MESSAGES=en_US.UTF-8 > ?[7] LC_PAPER=C ? ? ? ? ? ? ? ? LC_NAME=C ? ? ? ? ? ? ? ? ?LC_ADDRESS=C > ? ? ? ? ?LC_TELEPHONE=C ? ? ? ? ? ? LC_MEASUREMENT=en_US.UTF-8 > LC_IDENTIFICATION=C > > attached base packages: > [1] stats ? ? graphics ?grDevices datasets ?utils ? ? methods ? base > > other attached packages: > ?[1] AnnotationDbi_1.19.15 GenomicRanges_1.9.28 ?IRanges_1.15.15 > matrixStats_0.5.0 ? ? MASS_7.3-19 ? ? ? ? ? methylumi_2.3.4 > ggplot2_0.9.1 > ?[8] reshape2_1.2.1 ? ? ? ?scales_0.2.1 ? ? ? ? ?Biobase_2.17.6 > ?BiocGenerics_0.3.0 ? ?BiocInstaller_1.4.7 ? dataframe_2.5 > devtools_0.7 > [15] gtools_2.7.0 > > loaded via a namespace (and not attached): > ?[1] annotate_1.35.2 ? ?biomaRt_2.13.1 ? ? Biostrings_2.24.1 > ?colorspace_1.1-1 ? DBI_0.2-5 ? ? ? ? ?dichromat_1.2-4 ? ?digest_0.5.2 > ?grid_2.15.0 ? ? ? ?httr_0.1.1 > [10] labeling_0.1 ? ? ? lattice_0.20-6 ? ? memoise_0.1 ? ? ? ?munsell_0.3 > ? ? plyr_1.7.1 ? ? ? ? proto_0.3-9.2 ? ? ?R.methodsS3_1.4.2 > ?RColorBrewer_1.0-5 RCurl_1.91-1 > [19] RSQLite_0.11.1 ? ? stats4_2.15.0 ? ? ?stringr_0.6 ? ? ? ?tools_2.15.0 > ? ? ?XML_3.9-4 ? ? ? ? ?xtable_1.7-0 ? ? ? zlibbioc_1.3.0 > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper > > ? ? ? ?[[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From tim.triche at gmail.com Fri Jun 29 22:44:36 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Fri, 29 Jun 2012 13:44:36 -0700 Subject: [BioC] =?windows-1252?q?GenomicFeatures_installation_error_--_obj?= =?windows-1252?q?ect_=91readDNAStringSet=92_is_not_exported_by_=27?= =?windows-1252?q?namespace=3ABiostrings=27?= In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From tim.triche at gmail.com Fri Jun 29 22:47:41 2012 From: tim.triche at gmail.com (Tim Triche, Jr.) Date: Fri, 29 Jun 2012 13:47:41 -0700 Subject: [BioC] =?windows-1252?q?GenomicFeatures_installation_error_--_obj?= =?windows-1252?q?ect_=91readDNAStringSet=92_is_not_exported_by_=27?= =?windows-1252?q?namespace=3ABiostrings=27?= In-Reply-To: References: Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available URL: From dtenenba at fhcrc.org Fri Jun 29 22:56:28 2012 From: dtenenba at fhcrc.org (Dan Tenenbaum) Date: Fri, 29 Jun 2012 13:56:28 -0700 Subject: [BioC] question regarding using cummeRbund package In-Reply-To: References: Message-ID: Hi Jack, On Fri, Jun 29, 2012 at 1:41 PM, Jack Luo wrote: > Hi, > > I am a newbie to RNA-seq data and try to learn using the cummeRbund > package. Some simple questions are: > > 1. My OS is windows 7. I have no problem installing the cummeRbund in R, > but do I need to install cufflinks? I thought there is only linux and mac > version of cufflinks, not windows. Although the vignette says you need cufflinks, you don't, really. You just need some files produced by cufflinks. > > 2. When I was running the example provided in: > http://bioconductor.org/packages/devel/bioc/vignettes/cummeRbund/inst/doc/cummeRbund-manual.pdf > > I run into several errors: > A. Error: could not find function "dispersionPlot" > B. dend.rep<-csDendro(genes(cuff),replicates=T) > Error in csDendro(genes(cuff), replicates = T) : > ?unused argument(s) (replicates = T) > C. mCount<-MAplot(genes(cuff),"hESC","Fibroblasts",useCount=T) > Error in .local(object, x, y, logMode, pseudocount, ...) : > ?unused argument(s) (useCount = TRUE) > How did you install cummeRbund? Try installing it as follows: biocLite("cummeRbund") Also, when reporting errors, please report the command(s) that caused the errors and include the output of sessionInfo(). Thanks, Dan > There are some other similar errors. > > Thanks, > > -Jack > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor From aditi_rambani at yahoo.com Sat Jun 30 01:54:44 2012 From: aditi_rambani at yahoo.com (Aditi Rambani) Date: Fri, 29 Jun 2012 16:54:44 -0700 (PDT) Subject: [BioC] Contrast Problem In-Reply-To: References: Message-ID: <1341014084.75215.YahooMailNeo@web113108.mail.gq1.yahoo.com> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From smyth at wehi.EDU.AU Sat Jun 30 02:19:19 2012 From: smyth at wehi.EDU.AU (Gordon K Smyth) Date: Sat, 30 Jun 2012 10:19:19 +1000 (AUS Eastern Standard Time) Subject: [BioC] Contrast Problem In-Reply-To: <1341014084.75215.YahooMailNeo@web113108.mail.gq1.yahoo.com> References: <1341014084.75215.YahooMailNeo@web113108.mail.gq1.yahoo.com> Message-ID: Dear Aditi, The reason you are getting unexpected results is that your contrasts are incorrect. The meaning of a contrast is that it makes a comparison between the coefficients of the linear model that you have fitted. Here are the column names of your design matrix: > colnames(design2) "(Intercept)" "accessions2" "accessions3" "accessions4" "genomes2" "accessions2:genomes2" "accessions3:genomes2" "accessions4:genomes2" So when you take the contrast c(1,-1,0,0,0,0,0,0) you are comparing the first coefficient, which is the intercept term, with the second coefficient, which is a difference between accession2 and accession1 for the first genome. It is not a comparison of any interest. The interaction model that you have fitted is inappropriate for the questions that you want to answer. I suggest that you instead redefine a single experiment factor with 8 levels, corresponding to all combinations of accession and genome, so that you can fit your model as a one way layout. I think that will produce coefficients that you will find easier to understand and to take contrasts of. I will not be able to provide more help for some days. Best wishes Gordon On Fri, 29 Jun 2012, Aditi Rambani wrote: Hello, Thanks for your reply, sorry for not being clear enough. > When you say different expression (DE) between genomes for all the > accessions, do you mean DE between genomes for each of the accessions > separately?? That is DE genes for A.F1 vs D.F1, for A.Tom vs D.Tom, for > A.Tx vs D.Tx and for A.Mx vs D.Mx? Yes, that is what i meant. We do a contrast for A.F1 vs D.F1 like this : [contrast=c(1,-1,0,0,0,0,0,0)].? Our problem is that even when contrast is between A.F1(0 rpkm) ?vs D. F1 (0 rpkm) it has significant FDR, I dont understand how it can contrast two columns with zero values and show a significant differential expression. Also, it sometimes ignores genes with?significant?bias but that number is not very high. > When you say DE between accessions, is that for each genome separately? > In other words, do you expect the differences between the accessions to > be relatively the same for each genome, or will the differences between > accessions be genome-specific? We are not absolutely certain about what differences to expect from genomes between accessions. We do know that some accessions are more closely related than others and we can assume they will have a similar differential expression pattern. For accessions we did our contrasts like this : acc1= Mx vs Tx ?[lrt.acc1 <- glmLRT(dge2, fit2, contrast=c(0,0,1,1,-1,-1,0,0))] acc2 = Tom vs Mx/Tx [lrt.acc2 <- glmLRT(dge2, fit2, contrast=c(0,0,1,1,1,1,-2,-2))] acc3= F1 vs Mx/Tx/Tom [lrt.acc3 <- glmLRT(dge2, fit2, contrast=c(-3,-3,1,1,1,1,1,1))] I have no way to test accuracy of these results because i dont know what is expected. I was hoping if we implemented the package correctly we could trust the results. > Your comments about zero expression and not detecting significantly DE > genes don't make sense to me.? I won't try to respond to those comments > because I think sorting out the above questions will probably solve > other perceived problems as well. - I will appreciate your input on this matter and thanks for looking into this.?? Aditi > Date: Thu, 28 Jun 2012 12:01:50 -0700 > From: Aditi Rambani > To: "bioconductor at stat.math.ethz.ch" > Subject: [BioC] Contrast Problem >? > Hello,? >? > I am a graduate student at Brigham Young University working on polyploid > cotton RNA seq data.?Our study design has two explanatory variables, one > is 'accession' with four levels (F1,Tom,Tx,Mx) and other one is 'genome' > with two levels (A genome or D genome). We want to detect differential > expression of genes between 'genomes' from all the accessions and also > find genes that are differentially expressed between accessions. We > built a (Accession*Genome) model and did a contrast for two levels of > 'genomes'. In contrast results we see that many genes with zero > expression (0 RPKM) have significant FDRs and some significantly > differentially expressed genes are not detected. We are not sure why > this is happening, any help will be greatly appreciated. >? > Thanks >? > Aditi? >? > We are using following script to do our analysis but? >? >? > library("edgeR") >? > counts <- read.table(INFILE, header=T, row.names=1) > groups <- factor(c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8)) > accessions <- factor(c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4)) > genomes <- factor(c(1,1,1,2,2,2,1,1,1,2,2,2,1,1,1,2,2,2,1,1,1,2,2,2)) >? >? > design2 <- model.matrix(~accessions*genomes) >? > dge <- DGEList(counts=counts, group=groups) > dge <- calcNormFactors(dge) > dge2 <- estimateGLMCommonDisp(dge, design2) > dge2 <- estimateGLMTrendedDisp(dge2, design2) > dge2 <- estimateGLMTagwiseDisp(dge2, design2) >? > fit2 <- glmFit(dge2, design2) >? > lrt.acc1 <- glmLRT(dge2, fit2, contrast=c(0,0,1,1,-1,-1,0,0)) > lrt.acc2 <- glmLRT(dge2, fit2, contrast=c(0,0,1,1,1,1,-2,-2)) > lrt.acc3 <- glmLRT(dge2, fit2, contrast=c(-3,-3,1,1,1,1,1,1)) > lrt.F1 <- glmLRT(dge2, fit2, contrast=c(1,-1,0,0,0,0,0,0)) > lrt.Mx <- glmLRT(dge2, fit2, contrast=c(0,0,1,-1,0,0,0,0)) > lrt.Tx <- glmLRT(dge2, fit2, contrast=c(0,0,0,0,1,-1,0,0)) > lrt.Tom <- glmLRT(dge2, fit2, contrast=c(0,0,0,0,0,0,1,-1)) >? >? > write.table(topTags(lrt.acc1, n=10000), file="acc1.results", sep="\t", quote=F) > write.table(topTags(lrt.acc2, n=10000), file="acc2.results", sep="\t", quote=F) > write.table(topTags(lrt.acc3, n=10000), file="acc3.results", sep="\t", quote=F) > write.table(topTags(lrt.F1, n=10000), file="F1.results", sep="\t", quote=F) > write.table(topTags(lrt.Mx, n=10000), file="Mx.results", sep="\t", quote=F) > write.table(topTags(lrt.Tx, n=10000), file="Tx.results", sep="\t", quote=F) > write.table(topTags(lrt.Tom, n=10000), file="Tom.results", sep="\t", quote=F) ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:5}} From hpages at fhcrc.org Sat Jun 30 09:42:02 2012 From: hpages at fhcrc.org (=?ISO-8859-1?Q?Herv=E9_Pag=E8s?=) Date: Sat, 30 Jun 2012 00:42:02 -0700 Subject: [BioC] Cleaning up after getSeq(BSgenome, GRanges) In-Reply-To: References: Message-ID: <4FEEADCA.5020907@fhcrc.org> Hi Steve, The intention was really that the DNAStringSet object returned by getSeq() would not hold any reference to the chromosomes that getSeq() would load in the cache during the extraction so everything would get automatically uncached at the first gc() opportunity after getSeq() returns. Unfortunately this was broken because of an issue with a low-level helper in IRanges (the "xvcopy" method for XRawList objects to be precise). The problem is fixed in IRanges 1.15.16 (I'll apply the fix to release too): > library(BSgenome.Hsapiens.UCSC.hg19) > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 1265019 67.6 1710298 91.4 1476915 78.9 Vcells 585626 4.5 1162592 8.9 901241 6.9 > options(verbose=TRUE) # so uncaching events will be reported ## Extracting the first 10 nucleotides from each chromosome: > first10 <- getSeq(Hsapiens, end=10) uncaching chr1 uncaching chr10 uncaching chr11_gl000202_random uncaching chr11 uncaching chr12 uncaching chr13 uncaching chr15 uncaching chr14 uncaching chr16 uncaching chr17_gl000203_random uncaching chr17_gl000206_random uncaching chr19 uncaching chr19_gl000208_random uncaching chr18_gl000207_random uncaching chr18 uncaching chr17_gl000205_random uncaching chr17_gl000204_random uncaching chr17_ctg5_hap1 uncaching chr1_gl000192_random uncaching chr1_gl000191_random uncaching chr19_gl000209_random uncaching chr17 uncaching chr2 uncaching chr21_gl000210_random uncaching chr21 uncaching chr20 uncaching chr22 uncaching chr3 uncaching chr4_gl000193_random uncaching chr4_ctg9_hap1 uncaching chr4_gl000194_random uncaching chr4 uncaching chr5 uncaching chr6_cox_hap2 uncaching chr6_dbb_hap3 uncaching chr6_apd_hap1 uncaching chr6_mcf_hap5 uncaching chr6_mann_hap4 uncaching chr6 uncaching chr7 uncaching chr7_gl000195_random uncaching chr6_ssto_hap7 uncaching chr6_qbl_hap6 uncaching chr8_gl000197_random uncaching chr8_gl000196_random uncaching chr8 uncaching chr9_gl000199_random uncaching chrM uncaching chrUn_gl000213 uncaching chrUn_gl000214 uncaching chrUn_gl000212 uncaching chrUn_gl000211 uncaching chr9_gl000201_random uncaching chr9_gl000200_random uncaching chr9_gl000198_random uncaching chrUn_gl000217 uncaching chrUn_gl000220 uncaching chrUn_gl000223 uncaching chrUn_gl000227 uncaching chrUn_gl000230 uncaching chrUn_gl000234 uncaching chrUn_gl000238 uncaching chrUn_gl000242 uncaching chrUn_gl000243 uncaching chrUn_gl000241 uncaching chrUn_gl000240 uncaching chrUn_gl000239 uncaching chrUn_gl000237 uncaching chrUn_gl000236 uncaching chrUn_gl000235 uncaching chrUn_gl000233 uncaching chrUn_gl000232 uncaching chrUn_gl000231 uncaching chrUn_gl000229 uncaching chrUn_gl000228 uncaching chrUn_gl000226 uncaching chrUn_gl000225 uncaching chrUn_gl000224 uncaching chrUn_gl000222 uncaching chrUn_gl000221 uncaching chrUn_gl000219 uncaching chrUn_gl000218 uncaching chrUn_gl000216 uncaching chrUn_gl000215 uncaching chrUn_gl000246 uncaching chrUn_gl000249 uncaching chrUn_gl000248 uncaching chrUn_gl000247 uncaching chrUn_gl000245 uncaching chrUn_gl000244 uncaching chrX uncaching chr9 > first10 A DNAStringSet instance of length 93 width seq [1] 10 NNNNNNNNNN [2] 10 NNNNNNNNNN [3] 10 NNNNNNNNNN [4] 10 NNNNNNNNNN [5] 10 NNNNNNNNNN [6] 10 NNNNNNNNNN [7] 10 NNNNNNNNNN [8] 10 NNNNNNNNNN [9] 10 NNNNNNNNNN ... ... ... [85] 10 GATCTGAAGA [86] 10 GATCATGCCT [87] 10 GATCTTCAGG [88] 10 GATCTGCGCA [89] 10 GATCAGATAG [90] 10 GATCTTAAGC [91] 10 GATCTAAGTT [92] 10 GATCTGTCAT [93] 10 GATCACCAAG > ls(Hsapiens at .seqs_cache) [1] "chrY" > gc() Garbage collection 177 = 120+21+36 (level 2) ... 69.6 Mbytes of cons cells used (66%) 61.8 Mbytes of vectors used (17%) uncaching chrY used (Mb) gc trigger (Mb) max used (Mb) Ncells 1301932 69.6 1967602 105.1 1967602 105.1 Vcells 8094983 61.8 48876866 373.0 58058596 443.0 > ls(Hsapiens at .seqs_cache) character(0) > gc() Garbage collection 178 = 120+21+37 (level 2) ... 69.5 Mbytes of cons cells used (66%) 4.6 Mbytes of vectors used (2%) used (Mb) gc trigger (Mb) max used (Mb) Ncells 1300073 69.5 1967602 105.1 1967602 105.1 Vcells 600775 4.6 39101492 298.4 58058596 443.0 Memory used is almost the same as before getSeq() was called. Thanks for reporting the issue! H. On 06/27/2012 10:20 AM, Steve Lianoglou wrote: > Howdy, > > Say I'd like to fetch muchos sequences from hg19 that are defined in a > GRanges object that spans all hg19 chromosomes. > > I can make my life easy and do: > > R> library(BSgenome.Hsapiens.UCSC.hg19) > R> seqs <- getSeq(Hsapiens, my.GRanges) > > But while my life has been made easy, life for my CPU has been made > harder as I (think that I) have now all of the Hsapiens chromosomes > loaded up into (I think) the Hsapiens at .seqs_cache. > > I reckon I can do something like: > > R> rm(list=ls(Hsapiens at .seqs_cache), envir=Hsapiens at .seqs_cache) > R> gc() > > to try to remedy the situation myself, but I wonder if I'm missing > something else? > > Perhaps having a clearCache,BSgenome method to do some cleanup might be handy? > > Thanks, > -steve > -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 From mailinglist.honeypot at gmail.com Sat Jun 30 20:35:26 2012 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Sat, 30 Jun 2012 14:35:26 -0400 Subject: [BioC] Cleaning up after getSeq(BSgenome, GRanges) In-Reply-To: <4FEEADCA.5020907@fhcrc.org> References: <4FEEADCA.5020907@fhcrc.org> Message-ID: ?Merci beaucoup! On Sat, Jun 30, 2012 at 3:42 AM, Herv? Pag?s wrote: > Hi Steve, > > The intention was really that the DNAStringSet object returned by > getSeq() would not hold any reference to the chromosomes that > getSeq() would load in the cache during the extraction so everything > would get automatically uncached at the first gc() opportunity after > getSeq() returns. > Unfortunately this was broken because of an issue with a low-level > helper in IRanges (the "xvcopy" method for XRawList objects to be > precise). The problem is fixed in IRanges 1.15.16 (I'll apply the > fix to release too): > >> library(BSgenome.Hsapiens.UCSC.hg19) > >> gc() > ? ? ? ? ?used (Mb) gc trigger (Mb) max used (Mb) > Ncells 1265019 67.6 ? ?1710298 91.4 ?1476915 78.9 > Vcells ?585626 ?4.5 ? ?1162592 ?8.9 ? 901241 ?6.9 > >> options(verbose=TRUE) ?# so uncaching events will be reported > > ## Extracting the first 10 nucleotides from each chromosome: >> first10 <- getSeq(Hsapiens, end=10) > uncaching chr1 > uncaching chr10 > uncaching chr11_gl000202_random > uncaching chr11 > uncaching chr12 > uncaching chr13 > uncaching chr15 > uncaching chr14 > uncaching chr16 > uncaching chr17_gl000203_random > uncaching chr17_gl000206_random > uncaching chr19 > uncaching chr19_gl000208_random > uncaching chr18_gl000207_random > uncaching chr18 > uncaching chr17_gl000205_random > uncaching chr17_gl000204_random > uncaching chr17_ctg5_hap1 > uncaching chr1_gl000192_random > uncaching chr1_gl000191_random > uncaching chr19_gl000209_random > uncaching chr17 > uncaching chr2 > uncaching chr21_gl000210_random > uncaching chr21 > uncaching chr20 > uncaching chr22 > uncaching chr3 > uncaching chr4_gl000193_random > uncaching chr4_ctg9_hap1 > uncaching chr4_gl000194_random > uncaching chr4 > uncaching chr5 > uncaching chr6_cox_hap2 > uncaching chr6_dbb_hap3 > uncaching chr6_apd_hap1 > uncaching chr6_mcf_hap5 > uncaching chr6_mann_hap4 > uncaching chr6 > uncaching chr7 > uncaching chr7_gl000195_random > uncaching chr6_ssto_hap7 > uncaching chr6_qbl_hap6 > uncaching chr8_gl000197_random > uncaching chr8_gl000196_random > uncaching chr8 > uncaching chr9_gl000199_random > uncaching chrM > uncaching chrUn_gl000213 > uncaching chrUn_gl000214 > uncaching chrUn_gl000212 > uncaching chrUn_gl000211 > uncaching chr9_gl000201_random > uncaching chr9_gl000200_random > uncaching chr9_gl000198_random > uncaching chrUn_gl000217 > uncaching chrUn_gl000220 > uncaching chrUn_gl000223 > uncaching chrUn_gl000227 > uncaching chrUn_gl000230 > uncaching chrUn_gl000234 > uncaching chrUn_gl000238 > uncaching chrUn_gl000242 > uncaching chrUn_gl000243 > uncaching chrUn_gl000241 > uncaching chrUn_gl000240 > uncaching chrUn_gl000239 > uncaching chrUn_gl000237 > uncaching chrUn_gl000236 > uncaching chrUn_gl000235 > uncaching chrUn_gl000233 > uncaching chrUn_gl000232 > uncaching chrUn_gl000231 > uncaching chrUn_gl000229 > uncaching chrUn_gl000228 > uncaching chrUn_gl000226 > uncaching chrUn_gl000225 > uncaching chrUn_gl000224 > uncaching chrUn_gl000222 > uncaching chrUn_gl000221 > uncaching chrUn_gl000219 > uncaching chrUn_gl000218 > uncaching chrUn_gl000216 > uncaching chrUn_gl000215 > uncaching chrUn_gl000246 > uncaching chrUn_gl000249 > uncaching chrUn_gl000248 > uncaching chrUn_gl000247 > uncaching chrUn_gl000245 > uncaching chrUn_gl000244 > uncaching chrX > uncaching chr9 > >> first10 > ?A DNAStringSet instance of length 93 > ? ? width seq > ?[1] ? ?10 NNNNNNNNNN > ?[2] ? ?10 NNNNNNNNNN > ?[3] ? ?10 NNNNNNNNNN > ?[4] ? ?10 NNNNNNNNNN > ?[5] ? ?10 NNNNNNNNNN > ?[6] ? ?10 NNNNNNNNNN > ?[7] ? ?10 NNNNNNNNNN > ?[8] ? ?10 NNNNNNNNNN > ?[9] ? ?10 NNNNNNNNNN > ?... ? ... ... > [85] ? ?10 GATCTGAAGA > [86] ? ?10 GATCATGCCT > [87] ? ?10 GATCTTCAGG > [88] ? ?10 GATCTGCGCA > [89] ? ?10 GATCAGATAG > [90] ? ?10 GATCTTAAGC > [91] ? ?10 GATCTAAGTT > [92] ? ?10 GATCTGTCAT > [93] ? ?10 GATCACCAAG > >> ls(Hsapiens at .seqs_cache) > [1] "chrY" > >> gc() > Garbage collection 177 = 120+21+36 (level 2) ... > 69.6 Mbytes of cons cells used (66%) > 61.8 Mbytes of vectors used (17%) > uncaching chrY > ? ? ? ? ?used (Mb) gc trigger ?(Mb) max used ?(Mb) > Ncells 1301932 69.6 ? ?1967602 105.1 ?1967602 105.1 > Vcells 8094983 61.8 ? 48876866 373.0 58058596 443.0 > >> ls(Hsapiens at .seqs_cache) > character(0) > >> gc() > Garbage collection 178 = 120+21+37 (level 2) ... > 69.5 Mbytes of cons cells used (66%) > 4.6 Mbytes of vectors used (2%) > ? ? ? ? ?used (Mb) gc trigger ?(Mb) max used ?(Mb) > Ncells 1300073 69.5 ? ?1967602 105.1 ?1967602 105.1 > Vcells ?600775 ?4.6 ? 39101492 298.4 58058596 443.0 > > Memory used is almost the same as before getSeq() was called. > > Thanks for reporting the issue! > > H. > > > > On 06/27/2012 10:20 AM, Steve Lianoglou wrote: >> >> Howdy, >> >> Say I'd like to fetch muchos sequences from hg19 that are defined in a >> GRanges object that spans all hg19 chromosomes. >> >> I can make my life easy and do: >> >> R> library(BSgenome.Hsapiens.UCSC.hg19) >> R> seqs <- getSeq(Hsapiens, my.GRanges) >> >> But while my life has been made easy, life for my CPU has been made >> harder as I (think that I) have now all of the Hsapiens chromosomes >> loaded up into (I think) the Hsapiens at .seqs_cache. >> >> I reckon I can do something like: >> >> R> rm(list=ls(Hsapiens at .seqs_cache), envir=Hsapiens at .seqs_cache) >> R> gc() >> >> to try to remedy the situation myself, but I wonder if I'm missing >> something else? >> >> Perhaps having a clearCache,BSgenome method to do some cleanup might be >> handy? >> >> Thanks, >> -steve >> > > > -- > Herv? Pag?s > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: ?(206) 667-5791 > Fax: ? ?(206) 667-1319 > > -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact