[datatable-help] data.table - sort key - columns with real numbers

mdowle at mdowle.plus.com mdowle at mdowle.plus.com
Wed Jun 30 19:13:59 CEST 2010


Thanks Desmond for your comments.

One reason for integers is that radix sorting can be used on integers and
thats amazingly fast (Tom added radix to data.table using ?order.list).
The nature of the radix algorithm itself means it _only_ works for
integers, see Wikipedia.

Also, keys are usually used in an equi-join, and this requires test of
equality (==) internally. Integer equality doesn't have the machine
tolerance issues of double.

Essentially, the idea of keys is they represent unique, discrete things.
Whereas floating point is continuous.

If the distinct set of items happen to be described by floating point
numbers, perhaps like longitude and latitude of distinct places on the
earth,  then as you are doing by *1000 is what other people do, or using
factor() to store the floats as strings.

To make it easier, you could define your own small class for your
datatype, say coord(). The print method would automatically divide by 1000
for you, so you wouldn't have to remember each time. Its pretty quick and
easy to do. That way you retain the speed and memory advantage of integer
(its half as big as double, and sorts and queries many times faster) but
it _appears_ to be float. The particular implementation depends on your
particular data so its something you would do rather than the data.table
package.  If it really is truly continuous, then how can it be in a key ?

However, having said that, I may be easily persuaded to give it higher
priority if someone can explain (e.g. provide an example) why float in
keys is more valid than I currently think it is?

Maybe we should create a decimal() class?  A fixed precision float, stored
as integer.  Maybe that could be in data.table.

When I made large changes internally earlier this year,  I did it in such
a way that we could switch on integer/double. Before that change, the
switch would have slowed things down too much as it would have been too
deep.  Now, maybe.  Or maybe a decimal() class.  Maybe that exists already
somewhere?

Matthew


> Dear Mathew & Tom,
>
> I would like to thank you very much for contributing such an excellent
> and useful package. I have been trying to write some form of R codes to
> overcome some of the limitations data.frame and you have addressed the
> issues.
>
> I have a data table which has columns containing decimal points. Your
> current setkey ( ) only allows integer mode and do not allow decimals. I
> figured out that to overcome the problem I need to multiple the column
> by 10^7 to convert to integer and then to divide by 10^7 to obtain the
> actual value. It is a very messy and cumbersome process. Could you
> please make changes to allow keys to have real numbers and maybe other
> modes too? I would like to suggest that your codes do the conversions
> and would make the package more elegant.
>
> Please inform me whether you will be modifying your package and how soon
> will you be attempting to incorporate the changes?
>
> Thanks again.
>
>
> Regards,
> Desmond Wee
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>




More information about the datatable-help mailing list