[datatable-help] bug in merge when a table is keyed?

Carlos Alberto Arnillas carlosalberto.arnillas at gmail.com
Tue Feb 24 00:31:10 CET 2015


Hi.
The version is 1.9.4.
About how I ended up with a table not properly sorted? It happened
because that table is a small subset (in terms of rows and columns) of
a larger table, and the key used for the larger one include that
variable as a third column. So, I guess that the new table inherit the
key only for the columns that are in its subset, but it didn't rebuild
the index, so the table end up unsorted...

Carlos Alberto

On Mon, Feb 23, 2015 at 6:23 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> Hi Carlos,
>
> It’d be helpful to generate a MRE as to how you ended up with the data.table
> having a key set when it’s not really ordered properly.. Also, could you
> please test on level version as well (I don’t know the version you’re
> running on)?
>
> --
> Arun
>
> On 22 Feb 2015 at 00:41:51, Carlos Alberto Arnillas
> (carlosalberto.arnillas at gmail.com) wrote:
>
> Hello
> I am running the last version of R and data.table, however, I found a
> problem that I think has been reported for previous versions and I
> assumed it was fixed.
>
> Here is the data (as obtained from dput from a larger code)
> yy1 <- structure(list(Spp = c("vicr", "festuca"),
> rel_cover = c(0.0365853658536585,
> 0.0609756097560976)),
> row.names = c(NA, -2L), class =
> c("data.table", "data.frame"),
> .Names = c("Spp", "rel_cover"))
>
> yy2 <- structure(list(Spp = c("eugra", "vicr", "festuca"),
> rel_cover = c(0.048780487804878,
> 0.0609756097560976, 0.0975609756097561)),
> row.names = c(NA, -3L),
> class = c("data.table", "data.frame"),
> .Names = c("Spp", "rel_cover"), sorted = "Spp")
>> yy2
> Spp rel_cover
> 1: eugra 0.04878049
> 2: vicr 0.06097561
> 3: festuca 0.09756098
>
> for some reason, the yy2 dataset had a key assigned (Spp) but wrongly
> applied (in fact, I never sort that dataset or the one that I used to
> create it using that variable). Then, if I try to merge both, I get a
> wrong result:
>
>> merge(yy1,yy2, by="Spp",all=T)
> Spp rel_cover.x rel_cover.y
> 1: eugra NA 0.04878049
> 2: festuca 0.06097561 NA
> 3: festuca NA 0.09756098
> 4: vicr 0.03658537 0.06097561
>
> however, if I set the key for each variable, I first get a warning,
> and then the right result
>
>> setkey(yy1, Spp)
>> setkey(yy2, Spp)
> Warning message:
> In setkeyv(x, cols, verbose = verbose, physical = physical) :
> Already keyed by this key but had invalid row order, key rebuilt. If
> you didn't go under the hood please let datatable-help know so the
> root cause can be fixed.
>
>
>> merge(yy1,yy2, by="Spp",all=T)
> Spp rel_cover.x rel_cover.y
> 1: eugra NA 0.04878049
> 2: festuca 0.06097561 0.09756098
> 3: vicr 0.03658537 0.06097561
>
>
> To solve temporally the problem, I am using merge.data.frame, but I
> would prefer to keep all my data in data.table
>
> If it is not a bug, and I can do something to fix it, let me know please.
>
> Thanks in advance
>
> Carlos Alberto
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


More information about the datatable-help mailing list