<html><head><style>body{font-family:Helvetica,Arial;font-size:13px}</style></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div id="bloop_customfont" style="margin: 0px; "><blockquote type="cite" class="clean_bq" style="color: rgb(0, 0, 0); font-size: 13px; font-family: Helvetica, Arial; "><div lang="EN-US" link="#4183C4" vlink="#4183C4" xml:lang="EN-US"><div class="WordSection1"><p class="MsoNormal"><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(31, 73, 125); ">However there’s another aspect. While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.</span></p></div></div></blockquote><div><div lang="EN-US" link="#4183C4" vlink="#4183C4" xml:lang="EN-US"><div class="WordSection1"><p class="MsoNormal"><font face="helvetica, arial">`data.table` is designed for working with *really large* data sets in mind (> 100 or 200 GB in memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand).</font></p><p class="MsoNormal"><font face="helvetica, arial">This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator.</font></p><p class="MsoNormal"><font face="helvetica, arial"><br></font></p><p class="MsoNormal"><font face="helvetica, arial">HTH</font></p><p class="MsoNormal" style="color: rgb(0, 0, 0); font-size: 13px; font-family: Helvetica, Arial; "><span style="font-family: helvetica, arial; ">Arun</span></p></div></div></div></div> <div style="color:black"><br>From: <span style="color:black">Ron Hylton</span> <a href="mailto:rhylton@verizon.net">rhylton@verizon.net</a><br>Reply: <span style="color:black">Ron Hylton</span> <a href="mailto:rhylton@verizon.net">rhylton@verizon.net</a><br>Date: <span style="color:black">June 14, 2014 at 2:52:04 AM</span><br>To: <span style="color:black">datatable-help@lists.r-forge.r-project.org</span> <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>Subject: <span style="color:black"> Re: [datatable-help] data.table is asking for help <br></span></div><br> <blockquote type="cite" class="clean_bq"><span><div lang="EN-US" link="#4183C4" vlink="#4183C4" xml:lang="EN-US"><div></div><div>
<!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<title></title>
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">
I suspected it was something like this. As one clarification,
there is a setkey(test,id) before any setkey(.SD). If
setkey(test,id) is changed to setkey(test) so all columns are in
the original datatable key then the warning goes away.</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">
</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">
However there’s another aspect. While I’m relatively new to R
my understanding is that a function argument should be modifiable
within the function body without affecting the caller, which
perhaps conflicts with the behavior of .SD.</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">
</span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="font-size:11.0pt;font-family:"Calibri","sans-serif"">From:</span></b>
<span style="font-size:11.0pt;font-family:"Calibri","sans-serif"">Arunkumar
Srinivasan [mailto:aragorn168b@gmail.com]<br>
<b>Sent:</b> Friday, June 13, 2014 8:23 PM<br>
<b>To:</b> Ron Hylton;
datatable-help@lists.r-forge.r-project.org<br>
<b>Subject:</b> Re: [datatable-help] data.table is asking for
help</span></p>
</div>
</div>
<p class="MsoNormal"> </p>
<p>Nicely reproducible post. Reproducible in v1.9.3 (latest commit)
as well.</p>
<p>This is a tricky one. It happens because you’re setting key on
<code><span style="font-size:10.0pt">.SD</span></code> which should
normally not be allowed. What happens is, when you set key the
first time, there’s no key set (here) and therefore key is set on
all the columns <code><span style="font-size:10.0pt">x1</span></code>, <code><span style="font-size:10.0pt">x2</span></code> and <code><span style="font-size:10.0pt">x3</span></code>.</p>
<p>Now, the next group (in the <code><span style="font-size:10.0pt">by=.</span></code>) is passed to your function,
it’ll have the <code><span style="font-size:10.0pt">key</span></code> already set to
<code><span style="font-size:10.0pt">x1,x2,x3</span></code>
(because <code><span style="font-size:10.0pt">setkey</span></code>
modifies the object by reference), but <code><span style="font-size:10.0pt">.SD</span></code> has obtained
<strong>new</strong> data corresponding to <em>this</em> group. And
<code><span style="font-size:10.0pt">data.table</span></code> sorts
this data, knowing that it already has key set.. but if the key is
set then the order must be 1:n. But it wouldn’t be, as this data
isn’t sorted. <code><span style="font-size:10.0pt">data.table</span></code> warns in those
scenarios.. and that’s why you get the warning.</p>
<p>To verify this, you can try:</p>
<div style="mso-element:para-border-div;border:solid #CCCCCC 1.0pt;padding:3.0pt 6.0pt 3.0pt 6.0pt;background:#F8F8F8">
<pre><code>conflictsTable1 <- function(f, address) {</code>
</pre>
<pre><code> u <- unique(setkey(f))</code>
</pre>
<pre><code> setattr(f, 'sorted', NULL)</code>
</pre>
<pre><code> if (nrow(u) == 1) return(NULL)</code>
</pre>
<pre><code> u</code>
</pre>
<pre><code>}</code>
</pre></div>
<p>Basically, we set the key of <code><span style="font-size:10.0pt">f</span></code> (which is equal to
<code><span style="font-size:10.0pt">.SD</span></code> as it’s only
modified by reference) to <code><span style="font-size:10.0pt">NULL</span></code> everytime after.. so that
<code><span style="font-size:10.0pt">.SD</span></code> for the new
group will not have the key set.</p>
<p>The ideal scenario here, IIUC, is that <code><span style="font-size:10.0pt">setkey(.SD)</span></code> or things pointing to
<code><span style="font-size:10.0pt">.SD</span></code> should not
be possible (locking binding doesn’t seem to affect things done by
reference..). <code><span style="font-size:10.0pt">.SD</span></code> however should retain the key
of the data.table, if a key was set, wherever possible.</p>
<div id="bloop_customfont">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Helvetica","sans-serif""> </span></p>
</div>
<div id="bloop_sign_1402704505278157056">
<div>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Helvetica","sans-serif"">Arun</span></p>
</div>
</div>
<div>
<p class="MsoNormal"><span style="color:black"><br>
From: Ron Hylton <a href="mailto:rhylton@verizon.net">rhylton@verizon.net</a><br>
Reply: Ron Hylton <a href="mailto:rhylton@verizon.net">rhylton@verizon.net</a><br>
Date: June 14, 2014 at 1:55:53 AM<br>
To: <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a>
<a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
Subject: [datatable-help] data.table is asking for
help</span></p>
</div>
<p class="MsoNormal"><br>
<br></p>
<blockquote style="margin-left:0in;margin-top:11.25pt;margin-right:0in;margin-bottom:11.25pt">
<div>
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">The code below
generates the warning:</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> </p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;word-break:break-all">
<span style="font-size:10.0pt;font-family:"Lucida Console";color:black;background:#E1E2E5">
In setkeyv(x, cols, verbose = verbose) :</span></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;word-break:break-all">
<span style="font-size:10.0pt;font-family:"Lucida Console";color:black;background:#E1E2E5">
Already keyed by this key but had invalid row order, key
rebuilt. If you didn't go under the hood please let datatable-help
know so the root cause can be fixed.</span></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;word-break:break-all">
<span style="font-size:10.0pt;font-family:"Lucida Console";color:black;background:#E1E2E5">
</span></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">This is my
first attempt at using datatable so I probably did something dumb,
but maybe that‘s useful for someone. The first case is the
one that gives the warnings.</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> </p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I’m also
surprised at the timings. I wrote the original algorithm
using dataframe & ddply and I expected datatable to be
substantially faster; the opposite is true.</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> </p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">The algorithm
does the following: Certain columns in the table are keys and
others are values in the sense that each row with the same set of
keys should have the same set of values. Find all the key
sets for which this is not true and return the keys sets +
conflicting value sets.</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> </p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Insight into
the performance would be appreciated.</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> </p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Regards,</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Ron</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> </p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">
library(data.table)</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">
library(plyr)</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> </p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">
conflictsTable1 <- function(f) {</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> u <-
unique(setkey(f))</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> if
(nrow(u) == 1) return(NULL)</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> u</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">}</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> </p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">
conflictsTable2 <- function(f) {</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> u <-
unique(f)</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> if
(nrow(u) == 1) return(NULL)</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> u</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">}</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> </p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">conflictsFrame
<- function(f) {</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> u <-
unique(f)</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> if
(nrow(u) == 1) return(NULL)</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> u</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">}</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> </p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">N <-
10000</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">test <-
data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)),
x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> </p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">
setkey(test,id)</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> </p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">
print(system.time(ut1 <- test[, conflictsTable1(.SD),
by=id]))</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> </p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">
print(system.time(ut2 <- test[, conflictsTable2(.SD),
by=id]))</p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> </p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">
print(system.time(uf <- ddply(test, .(id), conflictsFrame)))</p>
</div>
<p class="MsoNormal">
_______________________________________________<br>
datatable-help mailing list<br>
<a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
<a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help">
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a></p>
</div>
</div>
</blockquote>
</div>
_______________________________________________
<br>datatable-help mailing list
<br>datatable-help@lists.r-forge.r-project.org
<br>https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</div></div></span></blockquote></body></html>