[R] sorting during xtabs? sorting by "individual" order?
    Fridolin Wild 
    fridolin.wild at wu-wien.ac.at
       
    Wed Nov  9 00:03:51 CET 2005
    
    
  
Hey alltogether,
refacturing a package (before it will be released),
I ran across the following problem.
I have two directories with different text files,
I want to read the first and construct a document-term
matrix from it (every term=word in a row, every file in
a column, occurrence frequencies form the values).
The second directory contains different files. It
needs to be read in to also construct a document-term
matrix -- however, in the same "term-order" to enable
similarity comparisons in a vector space of the
same format.
Let's make a (fake) example:
(1) support function
    # directory 1 contains 2 files (F1 & F2):
       F1 = c("word4", "word3", "word2")
       F2 = c("word1", "word4", "word2")
    # directory 2 contains also 2 files (F3 & F4):
       F3 = c("word1", "word2", "bla")
       F4 = c("word1", "word2", "word3")
    # I file in the first directory, file by file,
    # create triples of the format (file, word, 1)
        F1tab = sort(table(F1), decreasing = TRUE)
        F2tab = sort(table(F2), decreasing = TRUE)
    # and create a dataframe
        F1frame = data.frame( docs="F1", terms=names(F1tab),
                              Freq = F1tab, row.names = NULL)
        F2frame = data.frame( docs="F2", terms = names(F2tab),
                              Freq = F2tab, row.names = NULL)
(2) textmatrix function
    ... to be bound together for every file and to be
    converted with xtabs into a document term matrix:
        dummy = list(F1frame, F2frame)
        dtm = t(xtabs(Freq ~ ., data = do.call("rbind", dummy)))
        =>
               docs
        terms   F1 F2
          word2  1  1
          word3  1  0
          word4  1  1
          word1  0  1
    Now, when I want to re-use this to construct another
    document-term matrix from files F3&F4 -- with the same terms
    in the exactly same order, firstly, I need to add
        F3clean = F3[F3 %in% rownames(dtm)]
        F4clean = F4[F4 %in% rownames(dtm)]
    to keep "unwanted" terms from getting into the tabs.
    And here is my problem:
    I need to reformat the output document-term matrix
    (as it would be given by another time running step 2
    with F3clean and F4clean) to correspond with the given
    order of the rownames(dtm) of the first directory.
    How can I do this (not costly, the matrices I have to
    deal with are usually really big)? Hopefully just
    by adding s.th. to the xtabs function?
    To make an example of what I need: I need dtm2
    to look exactly like this (doc-order is not important):
        =>
               docs
        terms   F3 F4
          word2  1  1
          word3  1  1
          word4  0  0
          word1  1  1
    Can anybody help me?
Best,
Fridolin
-- 
Fridolin Wild, Institute for Information Systems and New Media,
Vienna University of Economics and Business Administration (WUW),
Augasse 2-6, A-1090 Wien, Austria
fon +43-1-31336-4488, fax +43-1-31336-746
    
    
More information about the R-help
mailing list