[R] filter a tab delimited text file

Duke duke.lists at gmx.com
Fri Sep 10 23:17:17 CEST 2010


  On 9/10/10 4:24 PM, Gabor Grothendieck wrote:
> On Fri, Sep 10, 2010 at 4:20 PM, Duke<duke.lists at gmx.com>  wrote:
>>   On 9/10/10 2:49 PM, Gabor Grothendieck wrote:
>>> On Fri, Sep 10, 2010 at 1:24 PM, Duke<duke.lists at gmx.com>    wrote:
>>>>   Hi all,
>>>>
>>>> I have to filter a tab-delimited text file like below:
>>>>
>>>> "GeneNames"    "value1"    "value2"    "log2(Fold_change)"
>>>>   "log2(Fold_change) normalized"    "Signature(abs(log2(Fold_change)
>>>> normalized)>    4)"
>>>> ENSG00000209350    4    35    -3.81131293562629    -4.14357714689656
>>>>   TRUE
>>>> ENSG00000177133    142    2    5.46771720082336    5.13545298955309
>>>>   FALSE
>>>> ENSG00000116285    115    1669    -4.54130810709955    -4.87357231836982
>>>>   TRUE
>>>> ENSG00000009724    10    162    -4.69995182667858    -5.03221603794886
>>>>   FALSE
>>>> ENSG00000162460    3    31    -4.05126372834704    -4.38352793961731
>>>>   TRUE
>>>>
>>>> based on the last column (TRUE), and then write to a new text file,
>>>> meaning
>>>> I should get something like below:
>>>>
>>>> "GeneNames"    "value1"    "value2"    "log2(Fold_change)"
>>>>   "log2(Fold_change) normalized"    "Signature(abs(log2(Fold_change)
>>>> normalized)>    4)"
>>>> ENSG00000209350    4    35    -3.81131293562629    -4.14357714689656
>>>>   TRUE
>>>> ENSG00000116285    115    1669    -4.54130810709955    -4.87357231836982
>>>>   TRUE
>>>> ENSG00000162460    3    31    -4.05126372834704    -4.38352793961731
>>>>   TRUE
>>>>
>>>> I used read.table and write.table but I am still not very satisfied with
>>>> the
>>>> results. Here is what I did:
>>>>
>>>> expFC<- read.table( "test.txt", header=T, sep="\t" )
>>>> expFC.TRUE<- expFC[expFC[dim(expFC)[2]]=="TRUE",]
>>>> write.table (expFC.TRUE, file="test_TRUE.txt", row.names=FALSE, sep="\t"
>>>> )
>>>>
>>>> Result:
>>>>
>>>> "GeneNames"    "value1"    "value2"    "log2.Fold_change."
>>>>   "log2.Fold_change..normalized"
>>>>   "Signature.abs.log2.Fold_change..normalized....4."
>>>> "ENSG00000209350"    4    35    -3.81131293562629    -4.14357714689656
>>>>   TRUE
>>>> "ENSG00000116285"    115    1669    -4.54130810709955
>>>>   -4.87357231836982
>>>>   TRUE
>>>> "ENSG00000162460"    3    31    -4.05126372834704    -4.38352793961731
>>>>   TRUE
>>>>
>>>> As you can see, there are two points:
>>>>
>>>> 1. The headers were altered. All the special characters were converted to
>>>> dot (.).
>>>> 2. The gene names (first column) were quoted (which were not in the
>>>> original
>>>> file).
>>>>
>>> This will copy input lines matching pattern as well as the header to
>>> the output verbatim preserving all quotes, spacing, etc.
>>>
>>> myFilter<- function(infile, outfile, pattern = "TRUE$") {
>>>         L<- readLines(infile)
>>>         cat(L[1], "\n", file = outfile)
>>>         L2<- grep(pattern, L[-1], value = TRUE)
>>>         for(el in L2) cat(el, "\n", file = outfile, append = TRUE)
>>> }
>>>
>>> # e.g.
>>> myFilter("infile.txt", "outfile.txt")
>>>
>> I love this the best! Even it is not as simple as the bash one liner
>> (system( "cat infile.txt | grep -v FALSE>  outfile.txt", wait=TRUE )), but I
>> am very happy to learn that R does have other similar functions as in bash.
>> If there is a document or a list of all such functions, that would be
>> excellent.
>>
>> Thanks Gabor,
>>
> Check out these help files:
>
> help.search(keyword = "character", package = "base")
>

Great! Thanks so much Gabor.

D.



More information about the R-help mailing list