[R] [R-pkgs] new package 'bit64' - 1000x faster than 'int64' sponsored by Google

Jens Oehlschägel Jens.Oehlschlaegel at truecluster.com
Tue Feb 21 23:13:08 CET 2012


Dear R-Core team,
Dear Rcpp team and other package teams,
Dear R users,

The new package 'bit64' is available on CRAN for beta-testing and 
code-reviewing.

Package 'bit64' provides fast serializable S3 atomic 64bit (signed) 
integers that can be used in vectors, matrices, arrays and data.frames. 
Methods are available for coercion from and to logicals, integers, 
doubles, characters as well as many elementwise and summary functions.

Package 'bit64' has the following advantages over package 'int64' (which 
was sponsored by Google):
- true atomic vectors usable with length, dim, names etc.
- only S3, not S4 class system used to dispatch methods
- less RAM consumption by factor 7 (under 64 bit OS)
- faster operations by factor 4 to 2000 (under 64 bit OS)
- no slow-down of R's garbage collection (as caused by the pure 
existence of 'int64' objects)
- pure GPL, no copyrights from transnational commercial company

While the advantage of the atomic S3 design over the complicated S4 
object design is obvious, it is less obvious that an external package is 
the best way to enrich R with 64bit integers. An external package will 
not give us literals such as 1LL or directly allow us to address larger 
vectors than possible with base R. But it allows us to properly address 
larger vectors in other packages such as 'ff' or 'bigmemory' and it 
allows us to properly work with large surrogate keys from external 
databases. An external package realizing just one data type also makes a 
perfect test bed to play with innovative performance enhancements. 
Performance tuned sorting and hashing are planned for the next release, 
which will give us fast versions of sort, order, merge, duplicated, 
unique, and table - for 64bit integers.

For those who still hope that R's 'integer' will be 64bit some day, here 
is my key learning: migrating R's 'integer' from 32 to 64 bit would be 
RAM expensive. It would most likely require to also migrate R's 'double' 
from 64 to 128 bit - in order to again have a data type to which we can 
lossless coerce. The assumption that 'integer' is a proper subset of 
'double' is scattered over R's semantics. We all expect that binary and 
n-ary functions such as '+' and 'c' do return 'double' and do not 
destroy information. With solely extending 64bit integers but not 128bit 
doubles, we have semantic changes potentially disappointing such 
expectations: integer64+double returns integer64 and does kill decimals. 
I did my best to make operations involving integer64 consistent and 
numerically stable - please consult the documentation at ?bit64 for details.

Since this package is 'at risk' to create a lot of dependencies from 
other packages, I'd appreciate serious  beta-testing and also 
code-review, ideally from the R-Core team. Please check the 
'Limitations' sections at the help page and the numerics involving "long 
double" in C. If the conclusion is that this should be better done in 
Base R - I happly donate the code and drop this package. If we have to 
go with an external package for 64bit integers, it would be great if 
this work could convince the Rcpp team including Romain about the 
advantages of this approach. Shouldn't we join forces here?

Best regards

Jens Oehlschlägel
Munich, 21.2.2012

_______________________________________________
R-packages mailing list
R-packages at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages



More information about the R-help mailing list