[R] Basic question on concatenating factors

Stavros Macrakis macrakis at alum.mit.edu
Sun Nov 23 17:53:38 CET 2008


I originally wrote:

>>> This algorithm depends crucially on union preserving the order of the
>>> elements of its arguments. As far as I can tell, the spec of union
>>> does not require this.  If union were to (for example) sort its
>>> arguments then merge them (generally a more efficient algorithm), this
>>> function would no longer work.

On Sun, 23 Nov 2008, jim holtman replied
>
>> You are right.  union used 'unique(c(x,y))' and I am not sure if
>> 'unique' preserves the order, but the help page seems to indicate that
>> "an element is omitted if it is identical to any previous element "...

On Sun, Nov 23, 2008 at 2:36 AM, Prof Brian Ripley
<ripley at stats.ox.ac.uk> wrote:
>
> It says
>
>     'unique' returns a vector, data frame or array like 'x' but with
>     duplicate elements/rows removed.
>
> Although it is a generic function, it is hard to see how that can be
> interpreted to allow the order to be changed.

First of all, the original issue was about "union", not about
"unique".  I see nothing in the "union" help page specifying the order
of the resulting vector. Indeed, since union is specified to perform
"set union", and the normal definition of set does not include a
notion of order, it seems fair to assume that the result order is
undefined in the absence of an explicit statement.  What's more, there
could well be set representations (e.g. bitstrings) which are
inherently unordered -- but union is not (currently) generic.

The fact that union is currently implemented using unique should be of
no concern of the user.

> The claim that union would be more efficiently implemented via sorting is made with no evidence:

I simply gave sorted lists as an *example* of a possible alternative
implementation that the spec appears to allow.

> do look up a basic computer science textbook for this kind of thing

Having taught algorithms and data structure courses at a local
university, and having written the "set" package for the Maxima
computer algebra system, I am fairly well versed in the alternatives,
thank you.

> as well as how R actually does it

How "R actually does it" today is not the issue.  Do look up a basic
computer science textbook on data abstraction or software engineering
for this kind of thing.

           -s

Stavros Macrakis
(oh, and since the habit on this list seems to be to include degrees:
PhD, Computer Science)
Cambridge, MA



More information about the R-help mailing list