[datatable-help] unique, by full row (not just key)

Ricardo Saporta saporta at scarletmail.rutgers.edu
Thu Mar 7 17:45:02 CET 2013


I have a keyed data.table, DT, with 800k rows, of which about 0.5% are
duplicates that need to removed.

Using unique(DT) of course widdles down the whole table to one row per key.

I would like to get results similar to unique.data.frame(DT)
Two problems with using unique.data.frame:  (1) Speed  (2) loss of key(DT)

So instead Im using a wrapper that
  (1) caches key(DT) (2) removes the key (3) calls unique on DT (4) then
repplies the key

However, this is convoluted (and also requires modifying setkey(.) and
getdots(.)).
It occurs to me that I might be overlooking a simpler alternative.

anythoughts?

Thanks,
Rick


_Here is what I am using_:

 uniqueRows <- function(DT) {
    # If already keyed (or not a DT), use regular unique(DT)
    if (!haskey(DT) ||  !is.data.table(x) )
      return(unique(DT))

    .key <- key(DT)
    setkey(DT, NULL)
    setkeyE(unique(DT), eval(.key))
  }


  getdotsWithEval <- function () {
      dots <-
        as.character(match.call(sys.function(-1), call = sys.call(-1),
            expand.dots = FALSE)$...)

      if (grepl("^eval\\(", dots) && grepl("\\)$", dots))
        return(eval(parse(text=dots)))
      return(dots)
  }

  setkeyE <- function (x, ..., verbose = getOption("datatable.verbose")) {
    # SAME AS setkey(.) WITH ADDITION THAT
    # IF KEY IS WRAPPED IN eval(.) IT WILL BE PARSED
      if (is.character(x))
          stop("x may no longer be the character name of the data.table.
The possibility was undocumented and has been removed.")
      #** THIS IS THE MODIFIED LINE **#
      # OLD**:  cols = getdots()
      cols <- getdotsWithEval()
      if (!length(cols))
          cols = colnames(x)
      else if (identical(cols, "NULL"))
          cols = NULL
      setkeyv(x, cols, verbose = verbose)
  }


-- 
Ricardo Saporta
Graduate Student, Data Analytics
Rutgers University, New Jersey
e: saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130307/4ef352b3/attachment.html>


More information about the datatable-help mailing list