From kevinushey at gmail.com Sat Feb 1 20:50:41 2014 From: kevinushey at gmail.com (Kevin Ushey) Date: Sat, 1 Feb 2014 11:50:41 -0800 Subject: [datatable-help] R-devel breaks data.table Message-ID: Hi guys, See the commit here: https://github.com/wch/r-source/commit/d0aece456bae5377245eb550a7434ba517be12fe Now if I run the following code, I see an error: library(data.table) DT <- data.table(x=1, y=2, z=3) DT[, k := 4] Error in `[.data.table`(DT, , `:=`(k, 4)) : attempt to set index 3/3 in SET_STRING_ELT Is this R-devel being overly picky about data.table's overallocation, or is this a bug in data.table? This is with data.table 1.8.11 from R-forge (version from yesterday). R Under development (unstable) (2014-02-01 r64910) Platform: x86_64-apple-darwin13.0.0 (64-bit) locale: [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.8.11 knitr_1.5.15 devtools_1.4.1.99 BiocInstaller_1.13.3 loaded via a namespace (and not attached): [1] compiler_3.1.0 digest_0.6.4 evaluate_0.5.1 formatR_0.10 httr_0.2 memoise_0.1 [7] parallel_3.1.0 plyr_1.8 RCurl_1.95-4.1 reshape2_1.2.2 stringr_0.6.2 tools_3.1.0 [13] whisker_0.3-2 -Kevin From mdowle at mdowle.plus.com Sun Feb 2 03:06:59 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Sun, 02 Feb 2014 02:06:59 +0000 Subject: [datatable-help] R-devel breaks data.table In-Reply-To: References: Message-ID: <52EDA843.80400@mdowle.plus.com> Hi Kevin, Yes R-devel has a new check as of Friday. No problem in data.table, just R getting stricter which is a good. Fixed and v1.8.11 (r1108) works again on latest R-devel 2014-02-01 r64910. Thanks, Matt On 01/02/14 19:50, Kevin Ushey wrote: > Hi guys, > > See the commit here: > > https://github.com/wch/r-source/commit/d0aece456bae5377245eb550a7434ba517be12fe > > Now if I run the following code, I see an error: > > library(data.table) > > DT <- data.table(x=1, y=2, z=3) > DT[, k := 4] > > Error in `[.data.table`(DT, , `:=`(k, 4)) : > attempt to set index 3/3 in SET_STRING_ELT > > Is this R-devel being overly picky about data.table's overallocation, > or is this a bug in data.table? > > This is with data.table 1.8.11 from R-forge (version from yesterday). > > R Under development (unstable) (2014-02-01 r64910) > Platform: x86_64-apple-darwin13.0.0 (64-bit) > > locale: > [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] data.table_1.8.11 knitr_1.5.15 devtools_1.4.1.99 > BiocInstaller_1.13.3 > > loaded via a namespace (and not attached): > [1] compiler_3.1.0 digest_0.6.4 evaluate_0.5.1 formatR_0.10 > httr_0.2 memoise_0.1 > [7] parallel_3.1.0 plyr_1.8 RCurl_1.95-4.1 reshape2_1.2.2 > stringr_0.6.2 tools_3.1.0 > [13] whisker_0.3-2 > > -Kevin > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From ggrothendieck at gmail.com Sun Feb 2 13:27:36 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sun, 2 Feb 2014 07:27:36 -0500 Subject: [datatable-help] datatable roll="next" takes 150 times longer than findInterval Message-ID: The benchmark at the bottom of this post shows a problem where a data.table roll="next" took nearly 150x longer than a base findInterval() solution. (The data.table solution is easier to write though.) This suggests an area for possible speed improvement. http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sun Feb 2 19:57:43 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Sun, 02 Feb 2014 18:57:43 +0000 Subject: [datatable-help] datatable roll="next" takes 150 times longer than findInterval In-Reply-To: References: Message-ID: <52EE9527.10608@mdowle.plus.com> But this is at the *micro* second level ?!! I confirm those results on my slow netbook but remember these are **micro** seconds i.e. 71,000 here is less than 0.1 of a second. > microbenchmark(flodel(X,Y), GG1(X,Y), GG2(X,Y)) Unit: microseconds expr min lq median uq max neval flodel(X, Y) 330.798 369.369 402.7935 455.3225 17996.26 100 GG1(X, Y) 14287.380 14370.038 14466.5990 16010.5440 121082.77 100 GG2(X, Y) 71164.270 85751.437 107951.3415 161676.5720 366003.62 100 To put it in some perspective : > system.time(GG2(X,Y)) user system elapsed 0.072 0.000 0.072 > system.time(GG2(X,Y)) user system elapsed 0.080 0.000 0.079 > system.time(GG2(X,Y)) user system elapsed 0.072 0.000 0.072 Where those times are in seconds. So the task in question here, takes 0.07 seconds ?! The 150x longer figure is actually (using figures from the S.O. answer) 24695 microseconds (i.e. 0.024 seconds) divided by 168 microseconds (0.000168 seconds). 0.024 seconds / 0.000168 = "150 times". If you rounded to milliseconds you could say data.table is infinitely slower (24ms / 0ms = Inf). I can believe there's scope for improvement, sure, but not from this benchmark. The vectors need to be *much* bigger and replications needs to be *much* smaller, say 3. The task being timed needs to take a meaningful amount of time (say 5 seconds) *for a single run*. Matt On 02/02/14 12:27, Gabor Grothendieck wrote: > The benchmark at the bottom of this post shows a problem where a > data.table roll="next" took nearly 150x longer than a base > findInterval() solution. (The data.table solution is easier to write > though.) This suggests an area for possible speed improvement. > > http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855 > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Mon Feb 3 12:46:23 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Mon, 03 Feb 2014 11:46:23 +0000 Subject: [datatable-help] datatable roll="next" takes 150 times longer than findInterval In-Reply-To: <52EE9527.10608@mdowle.plus.com> References: <52EE9527.10608@mdowle.plus.com> Message-ID: <52EF818F.8090907@mdowle.plus.com> Gabor, With that said about it being a micro benchmark, by-without-by might be at play in GG2(X,Y) here; i.e. running j for each row of i, where it could run once. I remember you and others quite rightly said by-without-by should be explicit ... still got to make that change. A similar speed issue came up recently somewhere else as well which the change in default should help. Matt On 02/02/14 18:57, Matt Dowle wrote: > > But this is at the *micro* second level ?!! > > I confirm those results on my slow netbook but remember these are > **micro** seconds i.e. 71,000 here is less than 0.1 of a second. > > > microbenchmark(flodel(X,Y), GG1(X,Y), GG2(X,Y)) > Unit: microseconds > expr min lq median uq max neval > flodel(X, Y) 330.798 369.369 402.7935 455.3225 17996.26 100 > GG1(X, Y) 14287.380 14370.038 14466.5990 16010.5440 121082.77 100 > GG2(X, Y) 71164.270 85751.437 107951.3415 161676.5720 366003.62 100 > > To put it in some perspective : > > > system.time(GG2(X,Y)) > user system elapsed > 0.072 0.000 0.072 > > system.time(GG2(X,Y)) > user system elapsed > 0.080 0.000 0.079 > > system.time(GG2(X,Y)) > user system elapsed > 0.072 0.000 0.072 > > Where those times are in seconds. So the task in question here, > takes 0.07 seconds ?! > > The 150x longer figure is actually (using figures from the S.O. > answer) 24695 microseconds (i.e. 0.024 seconds) divided by 168 > microseconds (0.000168 seconds). 0.024 seconds / 0.000168 = "150 > times". If you rounded to milliseconds you could say data.table is > infinitely slower (24ms / 0ms = Inf). > > I can believe there's scope for improvement, sure, but not from this > benchmark. The vectors need to be *much* bigger and replications needs > to be *much* smaller, say 3. The task being timed needs to take a > meaningful amount of time (say 5 seconds) *for a single run*. > > Matt > > > On 02/02/14 12:27, Gabor Grothendieck wrote: >> The benchmark at the bottom of this post shows a problem where a >> data.table roll="next" took nearly 150x longer than a base >> findInterval() solution. (The data.table solution is easier to write >> though.) This suggests an area for possible speed improvement. >> >> http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855 >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From npgraham1 at gmail.com Wed Feb 5 02:36:38 2014 From: npgraham1 at gmail.com (Nathaniel Graham) Date: Tue, 4 Feb 2014 20:36:38 -0500 Subject: [datatable-help] merging data tables on date ranges Message-ID: I'm trying to figure out how to merge two data tables using the dates in both. One table a set of people, which has the to and from dates and their address at each place they've lived. The other has their work history, also with to and from dates. Obviously, there isn't a one-to-one relationship; individuals may have several jobs while staying in the same place, several homes over the course of a job, and any sort of overlapping you can imagine. Both tables are reasonably large; the residences has about 950k rows, and the employment has about 1.2M rows. To give you a bit of flavor, the first ten rows of each: > work.history[1:10, list(icrdn, fromdate, todate, state, postalcode)] icrdn fromdate todate state postalcode 1: 145 Apr 1988 Jan 1990 FL 33432 2: 145 Jan 1990 Jan 1997 FL 33432 3: 145 Jan 1997 Dec 2011 FL 33444 4: 145 Jan 1997 Dec 2011 FL 33444 5: 145 Jan 1997 Dec 2011 FL 33444 6: 170 Oct 1983 Apr 2002 NE 68114 7: 170 Sep 1972 Dec 2011 IL 60443 8: 170 Sep 1972 Dec 2011 IL 61821-3066 9: 183 Aug 2000 Dec 2011 GA 30305 10: 183 Aug 2000 Dec 2011 GA 30305 > residences[1:10] icrdn fromdate todate state postalcode 1: 145 10/1992 03/2004 FL 33432 2: 145 03/2004 FL 33487 3: 170 09/1995 IL 61821 4: 183 05/1993 08/2000 GA 30342 5: 183 08/2000 09/2001 GA 30342 6: 183 09/2001 08/2004 GA 30305 7: 183 08/2004 GA 30073 8: 183 02/2005 GA 30342 9: 183 06/2006 GA 30075 10: 183 07/1974 05/1993 GA 30338 The 'icrdn' column is an identifier unique to each person. What I'm looking for is a data table with a row for each residence-job pair. Any residence that doesn't have a job in the sample can be safely dropped, and vice-versa. Thanks in advance for any help anyone can offer. ------- Nathaniel Graham npgraham1 at gmail.com npgraham1 at uky.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Wed Feb 5 16:22:32 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Wed, 5 Feb 2014 10:22:32 -0500 Subject: [datatable-help] datatable roll="next" takes 150 times longer than findInterval In-Reply-To: <52EF818F.8090907@mdowle.plus.com> References: <52EE9527.10608@mdowle.plus.com> <52EF818F.8090907@mdowle.plus.com> Message-ID: There was anoither benchmark posted with larger data and longer times but this time data.table stopped with an error. See: http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855 On Mon, Feb 3, 2014 at 6:46 AM, Matt Dowle wrote: > Gabor, > > With that said about it being a micro benchmark, by-without-by might be at > play in GG2(X,Y) here; i.e. running j for each row of i, where it could run > once. I remember you and others quite rightly said by-without-by should be > explicit ... still got to make that change. A similar speed issue came up > recently somewhere else as well which the change in default should help. > > Matt > > > On 02/02/14 18:57, Matt Dowle wrote: > > > But this is at the *micro* second level ?!! > > I confirm those results on my slow netbook but remember these are **micro** > seconds i.e. 71,000 here is less than 0.1 of a second. > >> microbenchmark(flodel(X,Y), GG1(X,Y), GG2(X,Y)) > Unit: microseconds > expr min lq median uq max neval > flodel(X, Y) 330.798 369.369 402.7935 455.3225 17996.26 100 > GG1(X, Y) 14287.380 14370.038 14466.5990 16010.5440 121082.77 100 > GG2(X, Y) 71164.270 85751.437 107951.3415 161676.5720 366003.62 100 > > To put it in some perspective : > >> system.time(GG2(X,Y)) > user system elapsed > 0.072 0.000 0.072 >> system.time(GG2(X,Y)) > user system elapsed > 0.080 0.000 0.079 >> system.time(GG2(X,Y)) > user system elapsed > 0.072 0.000 0.072 > > Where those times are in seconds. So the task in question here, takes > 0.07 seconds ?! > > The 150x longer figure is actually (using figures from the S.O. answer) > 24695 microseconds (i.e. 0.024 seconds) divided by 168 microseconds > (0.000168 seconds). 0.024 seconds / 0.000168 = "150 times". If you > rounded to milliseconds you could say data.table is infinitely slower (24ms > / 0ms = Inf). > > I can believe there's scope for improvement, sure, but not from this > benchmark. The vectors need to be *much* bigger and replications needs to be > *much* smaller, say 3. The task being timed needs to take a meaningful > amount of time (say 5 seconds) *for a single run*. > > Matt > > > On 02/02/14 12:27, Gabor Grothendieck wrote: > > The benchmark at the bottom of this post shows a problem where a data.table > roll="next" took nearly 150x longer than a base findInterval() solution. > (The data.table solution is easier to write though.) This suggests an area > for possible speed improvement. > > http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855 > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Wed Feb 5 16:32:03 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 5 Feb 2014 16:32:03 +0100 Subject: [datatable-help] datatable roll="next" takes 150 times longer than findInterval In-Reply-To: References: <52EE9527.10608@mdowle.plus.com> <52EF818F.8090907@mdowle.plus.com> Message-ID: Just tested. Works just fine (on 1.8.11). Takes 16 seconds as opposed to Flodel's which takes 1.4 seconds on my laptop. Also identical returned TRUE. Will see where's the delay coming from. On Wed, Feb 5, 2014 at 4:22 PM, Gabor Grothendieck wrote: > There was anoither benchmark posted with larger data and longer times > but this time data.table stopped with an error. See: > > > http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855 > > On Mon, Feb 3, 2014 at 6:46 AM, Matt Dowle wrote: > > Gabor, > > > > With that said about it being a micro benchmark, by-without-by might be > at > > play in GG2(X,Y) here; i.e. running j for each row of i, where it could > run > > once. I remember you and others quite rightly said by-without-by should > be > > explicit ... still got to make that change. A similar speed issue came > up > > recently somewhere else as well which the change in default should help. > > > > Matt > > > > > > On 02/02/14 18:57, Matt Dowle wrote: > > > > > > But this is at the *micro* second level ?!! > > > > I confirm those results on my slow netbook but remember these are > **micro** > > seconds i.e. 71,000 here is less than 0.1 of a second. > > > >> microbenchmark(flodel(X,Y), GG1(X,Y), GG2(X,Y)) > > Unit: microseconds > > expr min lq median uq max neval > > flodel(X, Y) 330.798 369.369 402.7935 455.3225 17996.26 100 > > GG1(X, Y) 14287.380 14370.038 14466.5990 16010.5440 121082.77 100 > > GG2(X, Y) 71164.270 85751.437 107951.3415 161676.5720 366003.62 100 > > > > To put it in some perspective : > > > >> system.time(GG2(X,Y)) > > user system elapsed > > 0.072 0.000 0.072 > >> system.time(GG2(X,Y)) > > user system elapsed > > 0.080 0.000 0.079 > >> system.time(GG2(X,Y)) > > user system elapsed > > 0.072 0.000 0.072 > > > > Where those times are in seconds. So the task in question here, takes > > 0.07 seconds ?! > > > > The 150x longer figure is actually (using figures from the S.O. answer) > > 24695 microseconds (i.e. 0.024 seconds) divided by 168 microseconds > > (0.000168 seconds). 0.024 seconds / 0.000168 = "150 times". If you > > rounded to milliseconds you could say data.table is infinitely slower > (24ms > > / 0ms = Inf). > > > > I can believe there's scope for improvement, sure, but not from this > > benchmark. The vectors need to be *much* bigger and replications needs > to be > > *much* smaller, say 3. The task being timed needs to take a meaningful > > amount of time (say 5 seconds) *for a single run*. > > > > Matt > > > > > > On 02/02/14 12:27, Gabor Grothendieck wrote: > > > > The benchmark at the bottom of this post shows a problem where a > data.table > > roll="next" took nearly 150x longer than a base findInterval() solution. > > (The data.table solution is easier to write though.) This suggests an > area > > for possible speed improvement. > > > > > http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855 > > > > -- > > Statistics & Software Consulting > > GKX Group, GKX Associates Inc. > > tel: 1-877-GKX-GROUP > > email: ggrothendieck at gmail.com > > > > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Wed Feb 5 16:42:10 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 5 Feb 2014 16:42:10 +0100 Subject: [datatable-help] datatable roll="next" takes 150 times longer than findInterval In-Reply-To: References: <52EE9527.10608@mdowle.plus.com> <52EF818F.8090907@mdowle.plus.com> Message-ID: Seems like the "by-without-by" is what's slowing things down: require(data.table) dtx <- data.table(x=which(X), key="x") dty <- data.table(y=which(Y), key="y") dtx[, x1 := x] dty[, y1 := y] system.time(ans <- dty[dtx, roll="nearest"][, abs(x1-y1)]) user system elapsed 1.321 0.076 1.396 system.time(ans2 <- flodel(x,y)) user system elapsed 0.936 0.044 0.977 identical(ans, ans2) # [1] TRUE On Wed, Feb 5, 2014 at 4:32 PM, Arunkumar Srinivasan wrote: > Just tested. Works just fine (on 1.8.11). Takes 16 seconds as opposed to > Flodel's which takes 1.4 seconds on my laptop. Also identical returned TRUE. > Will see where's the delay coming from. > > > On Wed, Feb 5, 2014 at 4:22 PM, Gabor Grothendieck < > ggrothendieck at gmail.com> wrote: > >> There was anoither benchmark posted with larger data and longer times >> but this time data.table stopped with an error. See: >> >> >> http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855 >> >> On Mon, Feb 3, 2014 at 6:46 AM, Matt Dowle >> wrote: >> > Gabor, >> > >> > With that said about it being a micro benchmark, by-without-by might >> be at >> > play in GG2(X,Y) here; i.e. running j for each row of i, where it could >> run >> > once. I remember you and others quite rightly said by-without-by >> should be >> > explicit ... still got to make that change. A similar speed issue came >> up >> > recently somewhere else as well which the change in default should help. >> > >> > Matt >> > >> > >> > On 02/02/14 18:57, Matt Dowle wrote: >> > >> > >> > But this is at the *micro* second level ?!! >> > >> > I confirm those results on my slow netbook but remember these are >> **micro** >> > seconds i.e. 71,000 here is less than 0.1 of a second. >> > >> >> microbenchmark(flodel(X,Y), GG1(X,Y), GG2(X,Y)) >> > Unit: microseconds >> > expr min lq median uq max >> neval >> > flodel(X, Y) 330.798 369.369 402.7935 455.3225 17996.26 >> 100 >> > GG1(X, Y) 14287.380 14370.038 14466.5990 16010.5440 121082.77 >> 100 >> > GG2(X, Y) 71164.270 85751.437 107951.3415 161676.5720 366003.62 >> 100 >> > >> > To put it in some perspective : >> > >> >> system.time(GG2(X,Y)) >> > user system elapsed >> > 0.072 0.000 0.072 >> >> system.time(GG2(X,Y)) >> > user system elapsed >> > 0.080 0.000 0.079 >> >> system.time(GG2(X,Y)) >> > user system elapsed >> > 0.072 0.000 0.072 >> > >> > Where those times are in seconds. So the task in question here, takes >> > 0.07 seconds ?! >> > >> > The 150x longer figure is actually (using figures from the S.O. answer) >> > 24695 microseconds (i.e. 0.024 seconds) divided by 168 microseconds >> > (0.000168 seconds). 0.024 seconds / 0.000168 = "150 times". If you >> > rounded to milliseconds you could say data.table is infinitely slower >> (24ms >> > / 0ms = Inf). >> > >> > I can believe there's scope for improvement, sure, but not from this >> > benchmark. The vectors need to be *much* bigger and replications needs >> to be >> > *much* smaller, say 3. The task being timed needs to take a meaningful >> > amount of time (say 5 seconds) *for a single run*. >> > >> > Matt >> > >> > >> > On 02/02/14 12:27, Gabor Grothendieck wrote: >> > >> > The benchmark at the bottom of this post shows a problem where a >> data.table >> > roll="next" took nearly 150x longer than a base findInterval() solution. >> > (The data.table solution is easier to write though.) This suggests an >> area >> > for possible speed improvement. >> > >> > >> http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855 >> > >> > -- >> > Statistics & Software Consulting >> > GKX Group, GKX Associates Inc. >> > tel: 1-877-GKX-GROUP >> > email: ggrothendieck at gmail.com >> > >> > >> > _______________________________________________ >> > datatable-help mailing list >> > datatable-help at lists.r-forge.r-project.org >> > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > >> > >> > >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Wed Feb 5 17:12:03 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 5 Feb 2014 17:12:03 +0100 Subject: [datatable-help] datatable roll="next" takes 150 times longer than findInterval In-Reply-To: References: <52EE9527.10608@mdowle.plus.com> <52EF818F.8090907@mdowle.plus.com> Message-ID: Have edited here now: http://stackoverflow.com/a/21500855/559784 On Wed, Feb 5, 2014 at 4:42 PM, Arunkumar Srinivasan wrote: > Seems like the "by-without-by" is what's slowing things down: > > require(data.table) > dtx <- data.table(x=which(X), key="x") > dty <- data.table(y=which(Y), key="y") > dtx[, x1 := x] > dty[, y1 := y] > system.time(ans <- dty[dtx, roll="nearest"][, abs(x1-y1)]) > user system elapsed > 1.321 0.076 1.396 > system.time(ans2 <- flodel(x,y)) > user system elapsed > 0.936 0.044 0.977 > > identical(ans, ans2) # [1] TRUE > > > On Wed, Feb 5, 2014 at 4:32 PM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > >> Just tested. Works just fine (on 1.8.11). Takes 16 seconds as opposed to >> Flodel's which takes 1.4 seconds on my laptop. Also identical returned TRUE. >> Will see where's the delay coming from. >> >> >> On Wed, Feb 5, 2014 at 4:22 PM, Gabor Grothendieck < >> ggrothendieck at gmail.com> wrote: >> >>> There was anoither benchmark posted with larger data and longer times >>> but this time data.table stopped with an error. See: >>> >>> >>> http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855 >>> >>> On Mon, Feb 3, 2014 at 6:46 AM, Matt Dowle >>> wrote: >>> > Gabor, >>> > >>> > With that said about it being a micro benchmark, by-without-by might >>> be at >>> > play in GG2(X,Y) here; i.e. running j for each row of i, where it >>> could run >>> > once. I remember you and others quite rightly said by-without-by >>> should be >>> > explicit ... still got to make that change. A similar speed issue >>> came up >>> > recently somewhere else as well which the change in default should >>> help. >>> > >>> > Matt >>> > >>> > >>> > On 02/02/14 18:57, Matt Dowle wrote: >>> > >>> > >>> > But this is at the *micro* second level ?!! >>> > >>> > I confirm those results on my slow netbook but remember these are >>> **micro** >>> > seconds i.e. 71,000 here is less than 0.1 of a second. >>> > >>> >> microbenchmark(flodel(X,Y), GG1(X,Y), GG2(X,Y)) >>> > Unit: microseconds >>> > expr min lq median uq max >>> neval >>> > flodel(X, Y) 330.798 369.369 402.7935 455.3225 17996.26 >>> 100 >>> > GG1(X, Y) 14287.380 14370.038 14466.5990 16010.5440 121082.77 >>> 100 >>> > GG2(X, Y) 71164.270 85751.437 107951.3415 161676.5720 366003.62 >>> 100 >>> > >>> > To put it in some perspective : >>> > >>> >> system.time(GG2(X,Y)) >>> > user system elapsed >>> > 0.072 0.000 0.072 >>> >> system.time(GG2(X,Y)) >>> > user system elapsed >>> > 0.080 0.000 0.079 >>> >> system.time(GG2(X,Y)) >>> > user system elapsed >>> > 0.072 0.000 0.072 >>> > >>> > Where those times are in seconds. So the task in question here, >>> takes >>> > 0.07 seconds ?! >>> > >>> > The 150x longer figure is actually (using figures from the S.O. answer) >>> > 24695 microseconds (i.e. 0.024 seconds) divided by 168 microseconds >>> > (0.000168 seconds). 0.024 seconds / 0.000168 = "150 times". If you >>> > rounded to milliseconds you could say data.table is infinitely slower >>> (24ms >>> > / 0ms = Inf). >>> > >>> > I can believe there's scope for improvement, sure, but not from this >>> > benchmark. The vectors need to be *much* bigger and replications needs >>> to be >>> > *much* smaller, say 3. The task being timed needs to take a >>> meaningful >>> > amount of time (say 5 seconds) *for a single run*. >>> > >>> > Matt >>> > >>> > >>> > On 02/02/14 12:27, Gabor Grothendieck wrote: >>> > >>> > The benchmark at the bottom of this post shows a problem where a >>> data.table >>> > roll="next" took nearly 150x longer than a base findInterval() >>> solution. >>> > (The data.table solution is easier to write though.) This suggests an >>> area >>> > for possible speed improvement. >>> > >>> > >>> http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855 >>> > >>> > -- >>> > Statistics & Software Consulting >>> > GKX Group, GKX Associates Inc. >>> > tel: 1-877-GKX-GROUP >>> > email: ggrothendieck at gmail.com >>> > >>> > >>> > _______________________________________________ >>> > datatable-help mailing list >>> > datatable-help at lists.r-forge.r-project.org >>> > >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> > >>> > >>> > >>> >>> >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Thu Feb 6 12:55:41 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Thu, 6 Feb 2014 06:55:41 -0500 Subject: [datatable-help] datatable roll="next" takes 150 times longer than findInterval In-Reply-To: References: <52EE9527.10608@mdowle.plus.com> <52EF818F.8090907@mdowle.plus.com> Message-ID: On Wed, Feb 5, 2014 at 10:42 AM, Arunkumar Srinivasan wrote: > Seems like the "by-without-by" is what's slowing things down: > > require(data.table) > dtx <- data.table(x=which(X), key="x") > dty <- data.table(y=which(Y), key="y") > dtx[, x1 := x] > dty[, y1 := y] > system.time(ans <- dty[dtx, roll="nearest"][, abs(x1-y1)]) > user system elapsed > 1.321 0.076 1.396 > system.time(ans2 <- flodel(x,y)) > user system elapsed > 0.936 0.044 0.977 > > identical(ans, ans2) # [1] TRUE What will the code look like after the explicit by-without-by feature is added? -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Thu Feb 6 14:23:31 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 6 Feb 2014 14:23:31 +0100 Subject: [datatable-help] datatable roll="next" takes 150 times longer than findInterval In-Reply-To: References: <52EE9527.10608@mdowle.plus.com> <52EF818F.8090907@mdowle.plus.com> Message-ID: In this case? Then nothing'll be different. I'm not sure what you mean because the problem here is that this *doesn't* require *by-without-by* as the j-operations are not necessary to be performed *during* the join. So, we can just perform the join and then take the "abs" once at the end, rather than calling it about 1e5+ times (the number of groups). So, if your question is: "apart from this question, how would an explicit by-without-by look like?", then I guess it'd be the same as the normal aggregation, but "by" would take a data.table as well. This has not yet been discussed or conceptualised. But this is how I imagine it to be: DT1[, list(...), by=DT2] Where, DT1's key columns have to be set as usual. On Thu, Feb 6, 2014 at 12:55 PM, Gabor Grothendieck wrote: > On Wed, Feb 5, 2014 at 10:42 AM, Arunkumar Srinivasan > wrote: > > Seems like the "by-without-by" is what's slowing things down: > > > > require(data.table) > > dtx <- data.table(x=which(X), key="x") > > dty <- data.table(y=which(Y), key="y") > > dtx[, x1 := x] > > dty[, y1 := y] > > system.time(ans <- dty[dtx, roll="nearest"][, abs(x1-y1)]) > > user system elapsed > > 1.321 0.076 1.396 > > system.time(ans2 <- flodel(x,y)) > > user system elapsed > > 0.936 0.044 0.977 > > > > identical(ans, ans2) # [1] TRUE > > What will the code look like after the explicit by-without-by feature is > added? > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Thu Feb 6 14:45:10 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Thu, 6 Feb 2014 08:45:10 -0500 Subject: [datatable-help] datatable roll="next" takes 150 times longer than findInterval In-Reply-To: References: <52EE9527.10608@mdowle.plus.com> <52EF818F.8090907@mdowle.plus.com> Message-ID: On Thu, Feb 6, 2014 at 8:23 AM, Arunkumar Srinivasan wrote: > In this case? Then nothing'll be different. > > I'm not sure what you mean because the problem here is that this *doesn't* > require *by-without-by* as the j-operations are not necessary to be > performed *during* the join. So, we can just perform the join and then take > the "abs" once at the end, rather than calling it about 1e5+ times (the > number of groups). > > So, if your question is: "apart from this question, how would an explicit > by-without-by look like?", then I guess it'd be the same as the normal > aggregation, but "by" would take a data.table as well. This has not yet been > discussed or conceptualised. But this is how I imagine it to be: > > DT1[, list(...), by=DT2] > > Where, DT1's key columns have to be set as usual. My original code was this: dtx <- data.table(x = which(x)) dty <- data.table(y = which(y), key = "y") dty[dtx, abs(x - y), roll = "nearest"] With that feature would this code not use by-within-by and therefore become fast? From aragorn168b at gmail.com Thu Feb 6 14:53:18 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 6 Feb 2014 14:53:18 +0100 Subject: [datatable-help] datatable roll="next" takes 150 times longer than findInterval In-Reply-To: References: <52EE9527.10608@mdowle.plus.com> <52EF818F.8090907@mdowle.plus.com> Message-ID: Not really. Because it still doing a "by". Meaning, for every grouping in "by" - abs(x-y) will be evaluated. If there are 1e5 groups, there'll be 1e5 calls. And that can be expensive depending on the function + the time to call eval from within C. However, since it's not necessary to do a by-without-by, we can perform the join and then compute once the difference between columns. There's no grouping, no eval from C, and no multiple calls to abs. Hope this clears it up? On Thu, Feb 6, 2014 at 2:45 PM, Gabor Grothendieck wrote: > On Thu, Feb 6, 2014 at 8:23 AM, Arunkumar Srinivasan > wrote: > > In this case? Then nothing'll be different. > > > > I'm not sure what you mean because the problem here is that this > *doesn't* > > require *by-without-by* as the j-operations are not necessary to be > > performed *during* the join. So, we can just perform the join and then > take > > the "abs" once at the end, rather than calling it about 1e5+ times (the > > number of groups). > > > > So, if your question is: "apart from this question, how would an explicit > > by-without-by look like?", then I guess it'd be the same as the normal > > aggregation, but "by" would take a data.table as well. This has not yet > been > > discussed or conceptualised. But this is how I imagine it to be: > > > > DT1[, list(...), by=DT2] > > > > Where, DT1's key columns have to be set as usual. > > My original code was this: > > dtx <- data.table(x = which(x)) > dty <- data.table(y = which(y), key = "y") > dty[dtx, abs(x - y), roll = "nearest"] > > With that feature would this code not use by-within-by and therefore > become fast? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Thu Feb 6 15:20:37 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Thu, 6 Feb 2014 09:20:37 -0500 Subject: [datatable-help] datatable roll="next" takes 150 times longer than findInterval In-Reply-To: References: <52EE9527.10608@mdowle.plus.com> <52EF818F.8090907@mdowle.plus.com> Message-ID: On Thu, Feb 6, 2014 at 8:53 AM, Arunkumar Srinivasan wrote: > Not really. Because it still doing a "by". Meaning, for every grouping in > "by" - abs(x-y) will be evaluated. If there are 1e5 groups, there'll be 1e5 > calls. And that can be expensive depending on the function + the time to > call eval from within C. > > However, since it's not necessary to do a by-without-by, we can perform the > join and then compute once the difference between columns. There's no > grouping, no eval from C, and no multiple calls to abs. Hope this clears it > up? > > In that case what is the proposed user interface? I thought that the idea was that one would have to explicitly specify the by= clause for by-within-by it to occur. In the code I had just posted there is a join = "nearest" but no by= clause is specified. From aragorn168b at gmail.com Thu Feb 6 15:58:28 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 6 Feb 2014 15:58:28 +0100 Subject: [datatable-help] datatable roll="next" takes 150 times longer than findInterval In-Reply-To: References: <52EE9527.10608@mdowle.plus.com> <52EF818F.8090907@mdowle.plus.com> Message-ID: Gabor, I think now I understand what your earlier post was about. You mean after the external by-without-by, doing DT1[DT2, ..., ] will be faster as it shouldn't do a by-without-by. Yes, that's true. So basically, the statement: dty[dtx, abs(x - y), roll = "nearest"] once external by-without-by is implemented, will/should first do the join and then do the "j' operation. And therefore it'll be as fast as the solution I wrote. If one wants to perform the j-operation for each group, then they'll have to do something like DT1[, j, by=DT2] (or any other solutions we end up on) Sorry for the misunderstanding. On Thu, Feb 6, 2014 at 3:20 PM, Gabor Grothendieck wrote: > On Thu, Feb 6, 2014 at 8:53 AM, Arunkumar Srinivasan > wrote: > > Not really. Because it still doing a "by". Meaning, for every grouping in > > "by" - abs(x-y) will be evaluated. If there are 1e5 groups, there'll be > 1e5 > > calls. And that can be expensive depending on the function + the time to > > call eval from within C. > > > > However, since it's not necessary to do a by-without-by, we can perform > the > > join and then compute once the difference between columns. There's no > > grouping, no eval from C, and no multiple calls to abs. Hope this clears > it > > up? > > > > > > In that case what is the proposed user interface? > > I thought that the idea was that one would have to explicitly specify > the by= clause for by-within-by it to occur. In the code I had just > posted there is a join = "nearest" but no by= clause is specified. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yikelu.home at gmail.com Fri Feb 7 00:38:49 2014 From: yikelu.home at gmail.com (Yike Lu) Date: Thu, 6 Feb 2014 17:38:49 -0600 Subject: [datatable-help] integer64 group by doesn't find all groups Message-ID: After a long hiatus, I am back to using data.table. Unfortunately, I've encountered a problem. Am I doing something wrong here? require(data.table) dt = data.table(idx = 1:100 %% 3, 1:100) dt[, list(sum(V2)), by = idx] # normal require(bit64) dt2 = data.table(idx = integer64(100) + 1:100 %% 3, 1:100) dt2[, list(sum(V2)), by = idx] # only has one group: # idx V1 #1: 1 5050 -------------- next part -------------- An HTML attachment was scrubbed... URL: From caneff at gmail.com Wed Feb 12 17:01:56 2014 From: caneff at gmail.com (caneff at gmail.com) Date: Wed, 12 Feb 2014 16:01:56 +0000 Subject: [datatable-help] Infinite numeric key doesn't collapse Message-ID: I have a numeric key in a data.table that sometimes has infinite values. I discovered today that Inf does not collapse when used in a by. Is this expected? It surprised me: DT <- data.table(x=rep(c(1,Inf), each=10), y=1:20) DT[, sum(y), by=x] # The x==1 cases collapse, but the Inf cases don't -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Wed Feb 12 17:04:07 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 12 Feb 2014 17:04:07 +0100 Subject: [datatable-help] Infinite numeric key doesn't collapse In-Reply-To: References: Message-ID: Caneff, I'm guessing you're using 1.8.10. This has been fixed a while ago in the current devel version 1.8.11. Or you can wait until the next release (which should be very soon now). Arun From:?caneff at gmail.com caneff at gmail.com Reply:?caneff at gmail.com caneff at gmail.com Date:?February 12, 2014 at 5:02:14 PM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] Infinite numeric key doesn't collapse I have a numeric key in a data.table that sometimes has infinite values. I discovered today that Inf does not collapse when used in a by. ?Is this expected? It surprised me: DT <- data.table(x=rep(c(1,Inf), each=10), y=1:20) DT[, sum(y), by=x] # The x==1 cases collapse, but the Inf cases don't _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From caneff at gmail.com Wed Feb 12 17:07:04 2014 From: caneff at gmail.com (caneff at gmail.com) Date: Wed, 12 Feb 2014 16:07:04 +0000 Subject: [datatable-help] Infinite numeric key doesn't collapse References: Message-ID: Whoops! Sorry I try to keep synced to the latest devel version, but sometimes because of work related updates packages get overwritten back to the latest public version. Sorry about that. I also found an easy workaround since the number of unique values is low, I can make it an ordered factor. On Wed Feb 12 2014 at 11:04:10 AM, Arunkumar Srinivasan < aragorn168b at gmail.com> wrote: > Caneff, > > I'm guessing you're using 1.8.10. This has been fixed a while ago in the > current devel version 1.8.11. Or you can wait until the next release (which > should be very soon now). > Arun > ------------------------------ > From: caneff at gmail.com caneff at gmail.com > Reply: caneff at gmail.com caneff at gmail.com > Date: February 12, 2014 at 5:02:14 PM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: [datatable-help] Infinite numeric key doesn't collapse > > I have a numeric key in a data.table that sometimes has infinite values. I > discovered today that Inf does not collapse when used in a by. Is this > expected? It surprised me: > > DT <- data.table(x=rep(c(1,Inf), each=10), y=1:20) > > DT[, sum(y), by=x] # The x==1 cases collapse, but the Inf cases don't > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Wed Feb 12 17:22:26 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Wed, 12 Feb 2014 16:22:26 +0000 Subject: [datatable-help] integer64 group by doesn't find all groups In-Reply-To: References: Message-ID: <52FB9FC2.4000305@mdowle.plus.com> Hi, You're doing nothing wrong. Although you can load integer64 using fread and create them directly, data.table's grouping and keys don't work on them yet. Sorry, just not yet implemented. Because integer64 are internally stored as type double (a good idea by package bit64), data.table sees them internally as double and doesn't catch that the type isn't supported yet (hence no error message such as you get for type 'complex'). The particular integer64 numbers in this example are quite small so will use the lower bits. In double, those are the most precise part of the significand, which would explain why only one group comes out here since data.table groups and joins floating point data within tolerance. Matt On 06/02/14 23:38, Yike Lu wrote: > After a long hiatus, I am back to using data.table. Unfortunately, > I've encountered a problem. Am I doing something wrong here? > > require(data.table) > > dt = data.table(idx = 1:100 %% 3, 1:100) > dt[, list(sum(V2)), by = idx] > # normal > > require(bit64) > > dt2 = data.table(idx = integer64(100) + 1:100 %% 3, 1:100) > dt2[, list(sum(V2)), by = idx] > # only has one group: > # idx V1 > #1: 1 5050 > From caneff at gmail.com Wed Feb 12 17:26:06 2014 From: caneff at gmail.com (caneff at gmail.com) Date: Wed, 12 Feb 2014 16:26:06 +0000 Subject: [datatable-help] integer64 group by doesn't find all groups References: <52FB9FC2.4000305@mdowle.plus.com> Message-ID: FYI (and this is a long outstanding argument) this is why I don't like the bit64 package. These sorts of errors happen silently. I understand that data.table can't use the other integer64 package, but at least there it is obvious when things are being coerced. In my situations, if I am grouping by a int64, it is usually either an ID so I can just make it a character vector instead, or it is something where I don't mind lost precision so I just make it numeric. On Wed Feb 12 2014 at 11:22:40 AM, Matt Dowle wrote: > > Hi, > > You're doing nothing wrong. Although you can load integer64 using fread > and create them directly, data.table's grouping and keys don't work on > them yet. Sorry, just not yet implemented. Because integer64 are > internally stored as type double (a good idea by package bit64), > data.table sees them internally as double and doesn't catch that the > type isn't supported yet (hence no error message such as you get for > type 'complex'). The particular integer64 numbers in this example are > quite small so will use the lower bits. In double, those are the most > precise part of the significand, which would explain why only one group > comes out here since data.table groups and joins floating point data > within tolerance. > > Matt > > On 06/02/14 23:38, Yike Lu wrote: > > After a long hiatus, I am back to using data.table. Unfortunately, > > I've encountered a problem. Am I doing something wrong here? > > > > require(data.table) > > > > dt = data.table(idx = 1:100 %% 3, 1:100) > > dt[, list(sum(V2)), by = idx] > > # normal > > > > require(bit64) > > > > dt2 = data.table(idx = integer64(100) + 1:100 %% 3, 1:100) > > dt2[, list(sum(V2)), by = idx] > > # only has one group: > > # idx V1 > > #1: 1 5050 > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Wed Feb 12 17:39:44 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Wed, 12 Feb 2014 16:39:44 +0000 Subject: [datatable-help] integer64 group by doesn't find all groups In-Reply-To: References: <52FB9FC2.4000305@mdowle.plus.com> Message-ID: <52FBA3D0.60109@mdowle.plus.com> Sometimes we take the hard road in data.table, to get to a better place. Once bit64::integer64 is fully supported, it'll be much easier. All the recent radix work for double applies almost automatically to integer64 for example, but that radix work had to be done first. On 12/02/14 16:26, caneff at gmail.com wrote: > FYI (and this is a long outstanding argument) this is why I don't like > the bit64 package. These sorts of errors happen silently. I > understand that data.table can't use the other integer64 package, but > at least there it is obvious when things are being coerced. > > In my situations, if I am grouping by a int64, it is usually either an > ID so I can just make it a character vector instead, or it is > something where I don't mind lost precision so I just make it numeric. > > On Wed Feb 12 2014 at 11:22:40 AM, Matt Dowle > wrote: > > > Hi, > > You're doing nothing wrong. Although you can load integer64 using > fread > and create them directly, data.table's grouping and keys don't > work on > them yet. Sorry, just not yet implemented. Because integer64 are > internally stored as type double (a good idea by package bit64), > data.table sees them internally as double and doesn't catch that the > type isn't supported yet (hence no error message such as you get for > type 'complex'). The particular integer64 numbers in this > example are > quite small so will use the lower bits. In double, those are the most > precise part of the significand, which would explain why only one > group > comes out here since data.table groups and joins floating point data > within tolerance. > > Matt > > On 06/02/14 23:38, Yike Lu wrote: > > After a long hiatus, I am back to using data.table. Unfortunately, > > I've encountered a problem. Am I doing something wrong here? > > > > require(data.table) > > > > dt = data.table(idx = 1:100 %% 3, 1:100) > > dt[, list(sum(V2)), by = idx] > > # normal > > > > require(bit64) > > > > dt2 = data.table(idx = integer64(100) + 1:100 %% 3, 1:100) > > dt2[, list(sum(V2)), by = idx] > > # only has one group: > > # idx V1 > > #1: 1 5050 > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From caneff at gmail.com Wed Feb 12 18:17:16 2014 From: caneff at gmail.com (caneff at gmail.com) Date: Wed, 12 Feb 2014 17:17:16 +0000 Subject: [datatable-help] integer64 group by doesn't find all groups References: <52FB9FC2.4000305@mdowle.plus.com> <52FBA3D0.60109@mdowle.plus.com> Message-ID: Yes this isn't a data.table criticism, just a bit64 one in general. On Wed Feb 12 2014 at 11:39:47 AM, Matt Dowle wrote: > > Sometimes we take the hard road in data.table, to get to a better place. > Once bit64::integer64 is fully supported, it'll be much easier. All the > recent radix work for double applies almost automatically to integer64 for > example, but that radix work had to be done first. > > > On 12/02/14 16:26, caneff at gmail.com wrote: > > FYI (and this is a long outstanding argument) this is why I don't like the > bit64 package. These sorts of errors happen silently. I understand that > data.table can't use the other integer64 package, but at least there it is > obvious when things are being coerced. > > In my situations, if I am grouping by a int64, it is usually either an > ID so I can just make it a character vector instead, or it is something > where I don't mind lost precision so I just make it numeric. > > On Wed Feb 12 2014 at 11:22:40 AM, Matt Dowle > wrote: > > > Hi, > > You're doing nothing wrong. Although you can load integer64 using fread > and create them directly, data.table's grouping and keys don't work on > them yet. Sorry, just not yet implemented. Because integer64 are > internally stored as type double (a good idea by package bit64), > data.table sees them internally as double and doesn't catch that the > type isn't supported yet (hence no error message such as you get for > type 'complex'). The particular integer64 numbers in this example are > quite small so will use the lower bits. In double, those are the most > precise part of the significand, which would explain why only one group > comes out here since data.table groups and joins floating point data > within tolerance. > > Matt > > On 06/02/14 23:38, Yike Lu wrote: > > After a long hiatus, I am back to using data.table. Unfortunately, > > I've encountered a problem. Am I doing something wrong here? > > > > require(data.table) > > > > dt = data.table(idx = 1:100 %% 3, 1:100) > > dt[, list(sum(V2)), by = idx] > > # normal > > > > require(bit64) > > > > dt2 = data.table(idx = integer64(100) + 1:100 %% 3, 1:100) > > dt2[, list(sum(V2)), by = idx] > > # only has one group: > > # idx V1 > > #1: 1 5050 > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.laing at gmail.com Wed Feb 12 18:24:27 2014 From: john.laing at gmail.com (John Laing) Date: Wed, 12 Feb 2014 12:24:27 -0500 Subject: [datatable-help] Force evaluation of first argument to [ Message-ID: Let's say I merge together several data.tables such that I wind up with lots of NAs: require(data.table) foo <- data.table(k=1:4, foo=TRUE, key="k") bar <- data.table(k=3:6, bar=TRUE, key="k") qux <- data.table(k=5:8, qux=TRUE, key="k") fbq <- merge(merge(foo, bar, all=TRUE), qux, all=TRUE) print(fbq) # k foo bar qux # 1: 1 TRUE NA NA # 2: 2 TRUE NA NA # 3: 3 TRUE TRUE NA # 4: 4 TRUE TRUE NA # 5: 5 NA TRUE TRUE # 6: 6 NA TRUE TRUE # 7: 7 NA NA TRUE # 8: 8 NA NA TRUE I want to go through those columns and turn each NA into FALSE. I can do this by writing code for each column: fbq.cp <- copy(fbq) fbq.cp[is.na(foo), foo:=FALSE] fbq.cp[is.na(bar), bar:=FALSE] fbq.cp[is.na(qux), qux:=FALSE] print(fbq.cp) # k foo bar qux # 1: 1 TRUE FALSE FALSE # 2: 2 TRUE FALSE FALSE # 3: 3 TRUE TRUE FALSE # 4: 4 TRUE TRUE FALSE # 5: 5 FALSE TRUE TRUE # 6: 6 FALSE TRUE TRUE # 7: 7 FALSE FALSE TRUE # 8: 8 FALSE FALSE TRUE But I can't figure out how to do it in a loop. More precisely, I can't figure out how to make the [ operator evaluate its first argument in the context of the data.table. All of these have no effect: for (x in c("foo", "bar", "qux")) fbq[is.na(x), eval(x):=FALSE] for (x in c("foo", "bar", "qux")) fbq[is.na(eval(x)), eval(x):=FALSE] for (x in c("foo", "bar", "qux")) fbq[eval(is.na(x)), eval(x):=FALSE] I'm running R 3.0.2 on Linux, data.table 1.8.10. Thanks in advance, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Wed Feb 12 18:44:11 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Wed, 12 Feb 2014 17:44:11 +0000 Subject: [datatable-help] Force evaluation of first argument to [ In-Reply-To: References: Message-ID: <52FBB2EB.2070000@mdowle.plus.com> Hi John, In examples like this I'd use set() and [[, since it's a bit easier to write but memory efficient too. for (x in c("foo", "bar", "qux")) set(fbq, is.na(fbq[[x]]), x, FALSE) [untested] A downside here is one repetition of the "fbq" symbol, but can live with that. If you have a large number of columns (and I've been surprised just how many columns some poeple have!) then calling set() many times has lower overhead than DT[, :=], see ?set. Note also that [[ is base R, doesn't copy the column and often useful to use with data.table. Or, use get() in either i or j rather than eval(). HTH, Matt On 12/02/14 17:24, John Laing wrote: > Let's say I merge together several data.tables such that I wind up > with lots of NAs: > > require(data.table) > foo <- data.table(k=1:4, foo=TRUE, key="k") > bar <- data.table(k=3:6, bar=TRUE, key="k") > qux <- data.table(k=5:8, qux=TRUE, key="k") > fbq <- merge(merge(foo, bar, all=TRUE), qux, all=TRUE) > print(fbq) > # k foo bar qux > # 1: 1 TRUE NA NA > # 2: 2 TRUE NA NA > # 3: 3 TRUE TRUE NA > # 4: 4 TRUE TRUE NA > # 5: 5 NA TRUE TRUE > # 6: 6 NA TRUE TRUE > # 7: 7 NA NA TRUE > # 8: 8 NA NA TRUE > > I want to go through those columns and turn each NA into FALSE. I can > do this by writing code for each column: > > fbq.cp <- copy(fbq) > fbq.cp[is.na (foo), foo:=FALSE] > fbq.cp[is.na (bar), bar:=FALSE] > fbq.cp[is.na (qux), qux:=FALSE] > print(fbq.cp) > # k foo bar qux > # 1: 1 TRUE FALSE FALSE > # 2: 2 TRUE FALSE FALSE > # 3: 3 TRUE TRUE FALSE > # 4: 4 TRUE TRUE FALSE > # 5: 5 FALSE TRUE TRUE > # 6: 6 FALSE TRUE TRUE > # 7: 7 FALSE FALSE TRUE > # 8: 8 FALSE FALSE TRUE > > But I can't figure out how to do it in a loop. More precisely, I can't > figure out how to make the [ operator evaluate its first argument in > the context of the data.table. All of these have no effect: > for (x in c("foo", "bar", "qux")) fbq[is.na (x), > eval(x):=FALSE] > for (x in c("foo", "bar", "qux")) fbq[is.na (eval(x)), > eval(x):=FALSE] > for (x in c("foo", "bar", "qux")) fbq[eval(is.na (x)), > eval(x):=FALSE] > > I'm running R 3.0.2 on Linux, data.table 1.8.10. > > Thanks in advance, > John > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.laing at gmail.com Wed Feb 12 18:58:57 2014 From: john.laing at gmail.com (John Laing) Date: Wed, 12 Feb 2014 12:58:57 -0500 Subject: [datatable-help] Force evaluation of first argument to [ In-Reply-To: <52FBB2EB.2070000@mdowle.plus.com> References: <52FBB2EB.2070000@mdowle.plus.com> Message-ID: Thanks, Matt! With a slight amendment that works great: for (x in c("foo", "bar", "qux")) set(fbq, which(is.na(fbq[[x]])), x, FALSE) Which highlights an opportunity to say that I really appreciate the unusually helpful error messages in this package. -John On Wed, Feb 12, 2014 at 12:44 PM, Matt Dowle wrote: > > Hi John, > > In examples like this I'd use set() and [[, since it's a bit easier to > write but memory efficient too. > > for (x in c("foo", "bar", "qux")) set(fbq, is.na(fbq[[x]]), x, > FALSE) [untested] > > A downside here is one repetition of the "fbq" symbol, but can live with > that. If you have a large number of columns (and I've been surprised just > how many columns some poeple have!) then calling set() many times has lower > overhead than DT[, :=], see ?set. Note also that [[ is base R, doesn't > copy the column and often useful to use with data.table. > > Or, use get() in either i or j rather than eval(). > > HTH, Matt > > > > On 12/02/14 17:24, John Laing wrote: > > Let's say I merge together several data.tables such that I wind up > with lots of NAs: > > require(data.table) > foo <- data.table(k=1:4, foo=TRUE, key="k") > bar <- data.table(k=3:6, bar=TRUE, key="k") > qux <- data.table(k=5:8, qux=TRUE, key="k") > fbq <- merge(merge(foo, bar, all=TRUE), qux, all=TRUE) > print(fbq) > # k foo bar qux > # 1: 1 TRUE NA NA > # 2: 2 TRUE NA NA > # 3: 3 TRUE TRUE NA > # 4: 4 TRUE TRUE NA > # 5: 5 NA TRUE TRUE > # 6: 6 NA TRUE TRUE > # 7: 7 NA NA TRUE > # 8: 8 NA NA TRUE > > I want to go through those columns and turn each NA into FALSE. I can > do this by writing code for each column: > > fbq.cp <- copy(fbq) > fbq.cp[is.na(foo), foo:=FALSE] > fbq.cp[is.na(bar), bar:=FALSE] > fbq.cp[is.na(qux), qux:=FALSE] > print(fbq.cp) > # k foo bar qux > # 1: 1 TRUE FALSE FALSE > # 2: 2 TRUE FALSE FALSE > # 3: 3 TRUE TRUE FALSE > # 4: 4 TRUE TRUE FALSE > # 5: 5 FALSE TRUE TRUE > # 6: 6 FALSE TRUE TRUE > # 7: 7 FALSE FALSE TRUE > # 8: 8 FALSE FALSE TRUE > > But I can't figure out how to do it in a loop. More precisely, I can't > figure out how to make the [ operator evaluate its first argument in > the context of the data.table. All of these have no effect: > for (x in c("foo", "bar", "qux")) fbq[is.na(x), eval(x):=FALSE] > for (x in c("foo", "bar", "qux")) fbq[is.na(eval(x)), eval(x):=FALSE] > for (x in c("foo", "bar", "qux")) fbq[eval(is.na(x)), eval(x):=FALSE] > > I'm running R 3.0.2 on Linux, data.table 1.8.10. > > Thanks in advance, > John > > > _______________________________________________ > datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Wed Feb 12 20:22:04 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Wed, 12 Feb 2014 19:22:04 +0000 Subject: [datatable-help] Force evaluation of first argument to [ In-Reply-To: References: <52FBB2EB.2070000@mdowle.plus.com> Message-ID: <52FBC9DC.2010809@mdowle.plus.com> Ha. Yes we certainly don't hold back from making the messages as long and as helpful as possible. If the code knows, or can know what exactly is wrong, it's a deliberate policy to put that info right there into the message. data.table is written by users; i.e. we wrote it for ourselves doing real jobs. I think that may be the root of that. If any messages could more helpful, those suggestions are very welcome. Matt On 12/02/14 17:58, John Laing wrote: > Thanks, Matt! With a slight amendment that works great: > for (x in c("foo", "bar", "qux")) set(fbq, which(is.na > (fbq[[x]])), x, FALSE) > > Which highlights an opportunity to say that I really appreciate the > unusually helpful error messages in this package. > > -John > > > On Wed, Feb 12, 2014 at 12:44 PM, Matt Dowle > wrote: > > > Hi John, > > In examples like this I'd use set() and [[, since it's a bit > easier to write but memory efficient too. > > for (x in c("foo", "bar", "qux")) set(fbq, is.na > (fbq[[x]]), x, FALSE) [untested] > > A downside here is one repetition of the "fbq" symbol, but can > live with that. If you have a large number of columns (and I've > been surprised just how many columns some poeple have!) then > calling set() many times has lower overhead than DT[, :=], see > ?set. Note also that [[ is base R, doesn't copy the column and > often useful to use with data.table. > > Or, use get() in either i or j rather than eval(). > > HTH, Matt > > > > On 12/02/14 17:24, John Laing wrote: >> Let's say I merge together several data.tables such that I wind up >> with lots of NAs: >> >> require(data.table) >> foo <- data.table(k=1:4, foo=TRUE, key="k") >> bar <- data.table(k=3:6, bar=TRUE, key="k") >> qux <- data.table(k=5:8, qux=TRUE, key="k") >> fbq <- merge(merge(foo, bar, all=TRUE), qux, all=TRUE) >> print(fbq) >> # k foo bar qux >> # 1: 1 TRUE NA NA >> # 2: 2 TRUE NA NA >> # 3: 3 TRUE TRUE NA >> # 4: 4 TRUE TRUE NA >> # 5: 5 NA TRUE TRUE >> # 6: 6 NA TRUE TRUE >> # 7: 7 NA NA TRUE >> # 8: 8 NA NA TRUE >> >> I want to go through those columns and turn each NA into FALSE. I can >> do this by writing code for each column: >> >> fbq.cp <- copy(fbq) >> fbq.cp[is.na (foo), foo:=FALSE] >> fbq.cp[is.na (bar), bar:=FALSE] >> fbq.cp[is.na (qux), qux:=FALSE] >> print(fbq.cp) >> # k foo bar qux >> # 1: 1 TRUE FALSE FALSE >> # 2: 2 TRUE FALSE FALSE >> # 3: 3 TRUE TRUE FALSE >> # 4: 4 TRUE TRUE FALSE >> # 5: 5 FALSE TRUE TRUE >> # 6: 6 FALSE TRUE TRUE >> # 7: 7 FALSE FALSE TRUE >> # 8: 8 FALSE FALSE TRUE >> >> But I can't figure out how to do it in a loop. More precisely, I >> can't >> figure out how to make the [ operator evaluate its first argument in >> the context of the data.table. All of these have no effect: >> for (x in c("foo", "bar", "qux")) fbq[is.na (x), >> eval(x):=FALSE] >> for (x in c("foo", "bar", "qux")) fbq[is.na >> (eval(x)), eval(x):=FALSE] >> for (x in c("foo", "bar", "qux")) fbq[eval(is.na >> (x)), eval(x):=FALSE] >> >> I'm running R 3.0.2 on Linux, data.table 1.8.10. >> >> Thanks in advance, >> John >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From caneff at gmail.com Thu Feb 13 17:05:37 2014 From: caneff at gmail.com (caneff at gmail.com) Date: Thu, 13 Feb 2014 16:05:37 +0000 Subject: [datatable-help] Merging strings claim that the encodings don't match Message-ID: I have a master DT. I aggregate it in one way, and aggregate it in another with a common key between them. When I try to merge these two, it says that the key does not have the same encoding on both sides. If I call Encoding() on each of the keys, they both are listed as "unknown", so from what I can see they still look the same. I can't create a safe to share reproducible case unfortunately, the simple ones I've tried all work. If you can give more advice on how to debug maybe I can. This is using the latest devel version. I did not have this issue i 1.8.10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mel at mbacou.com Fri Feb 14 12:52:08 2014 From: mel at mbacou.com (Bacou, Melanie) Date: Fri, 14 Feb 2014 06:52:08 -0500 Subject: [datatable-help] Force evaluation of first argument to [ In-Reply-To: <52FBC9DC.2010809@mdowle.plus.com> References: <52FBB2EB.2070000@mdowle.plus.com> <52FBC9DC.2010809@mdowle.plus.com> Message-ID: <52FE0368.6080603@mbacou.com> Hi John, Matt, In this case, why not simply using the standard data.table approach with .SD? fbq.cp[, lapply(.SD, function(x) ifelse(is.na(x), FALSE, x)), .SDcols=c("foo", "bar", "qux")] --Mel. On 2/12/2014 2:22 PM, Matt Dowle wrote: > > Ha. Yes we certainly don't hold back from making the messages as long > and as helpful as possible. If the code knows, or can know what > exactly is wrong, it's a deliberate policy to put that info right > there into the message. data.table is written by users; i.e. we wrote > it for ourselves doing real jobs. I think that may be the root of > that. If any messages could more helpful, those suggestions are very > welcome. > > Matt > > On 12/02/14 17:58, John Laing wrote: >> Thanks, Matt! With a slight amendment that works great: >> for (x in c("foo", "bar", "qux")) set(fbq, which(is.na >> (fbq[[x]])), x, FALSE) >> >> Which highlights an opportunity to say that I really appreciate the >> unusually helpful error messages in this package. >> >> -John >> >> >> On Wed, Feb 12, 2014 at 12:44 PM, Matt Dowle > > wrote: >> >> >> Hi John, >> >> In examples like this I'd use set() and [[, since it's a bit >> easier to write but memory efficient too. >> >> for (x in c("foo", "bar", "qux")) set(fbq, is.na >> (fbq[[x]]), x, FALSE) [untested] >> >> A downside here is one repetition of the "fbq" symbol, but can >> live with that. If you have a large number of columns (and I've >> been surprised just how many columns some poeple have!) then >> calling set() many times has lower overhead than DT[, :=], see >> ?set. Note also that [[ is base R, doesn't copy the column and >> often useful to use with data.table. >> >> Or, use get() in either i or j rather than eval(). >> >> HTH, Matt >> >> >> >> On 12/02/14 17:24, John Laing wrote: >>> Let's say I merge together several data.tables such that I wind up >>> with lots of NAs: >>> >>> require(data.table) >>> foo <- data.table(k=1:4, foo=TRUE, key="k") >>> bar <- data.table(k=3:6, bar=TRUE, key="k") >>> qux <- data.table(k=5:8, qux=TRUE, key="k") >>> fbq <- merge(merge(foo, bar, all=TRUE), qux, all=TRUE) >>> print(fbq) >>> # k foo bar qux >>> # 1: 1 TRUE NA NA >>> # 2: 2 TRUE NA NA >>> # 3: 3 TRUE TRUE NA >>> # 4: 4 TRUE TRUE NA >>> # 5: 5 NA TRUE TRUE >>> # 6: 6 NA TRUE TRUE >>> # 7: 7 NA NA TRUE >>> # 8: 8 NA NA TRUE >>> >>> I want to go through those columns and turn each NA into FALSE. >>> I can >>> do this by writing code for each column: >>> >>> fbq.cp <- copy(fbq) >>> fbq.cp[is.na (foo), foo:=FALSE] >>> fbq.cp[is.na (bar), bar:=FALSE] >>> fbq.cp[is.na (qux), qux:=FALSE] >>> print(fbq.cp) >>> # k foo bar qux >>> # 1: 1 TRUE FALSE FALSE >>> # 2: 2 TRUE FALSE FALSE >>> # 3: 3 TRUE TRUE FALSE >>> # 4: 4 TRUE TRUE FALSE >>> # 5: 5 FALSE TRUE TRUE >>> # 6: 6 FALSE TRUE TRUE >>> # 7: 7 FALSE FALSE TRUE >>> # 8: 8 FALSE FALSE TRUE >>> >>> But I can't figure out how to do it in a loop. More precisely, I >>> can't >>> figure out how to make the [ operator evaluate its first argument in >>> the context of the data.table. All of these have no effect: >>> for (x in c("foo", "bar", "qux")) fbq[is.na (x), >>> eval(x):=FALSE] >>> for (x in c("foo", "bar", "qux")) fbq[is.na >>> (eval(x)), eval(x):=FALSE] >>> for (x in c("foo", "bar", "qux")) fbq[eval(is.na >>> (x)), eval(x):=FALSE] >>> >>> I'm running R 3.0.2 on Linux, data.table 1.8.10. >>> >>> Thanks in advance, >>> John >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Melanie BACOU International Food Policy Research Institute Agricultural Economist, HarvestChoice Work +1(202)862-5699 E-mail mel at mbacou.com Visit harvestchoice.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri Feb 14 13:07:58 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 14 Feb 2014 13:07:58 +0100 Subject: [datatable-help] Force evaluation of first argument to [ In-Reply-To: <52FE0368.6080603@mbacou.com> References: <52FBB2EB.2070000@mdowle.plus.com> <52FBC9DC.2010809@mdowle.plus.com> <52FE0368.6080603@mbacou.com> Message-ID: Melanie, `set` modifies by reference. Yours'll make a copy.? Arun From:?Bacou, Melanie Bacou, Melanie Reply:?Bacou, Melanie mel at mbacou.com Date:?February 14, 2014 at 12:52:56 PM To:?Matt Dowle mdowle at mdowle.plus.com, John Laing john.laing at gmail.com Subject:? Re: [datatable-help] Force evaluation of first argument to [ Hi John, Matt, In this case, why not simply using the standard data.table approach with .SD? fbq.cp[, lapply(.SD, function(x) ifelse(is.na(x), FALSE, x)), .SDcols=c("foo", "bar", "qux")] --Mel. On 2/12/2014 2:22 PM, Matt Dowle wrote: Ha.? Yes we certainly don't hold back from making the messages as long and as helpful as possible.? If the code knows, or can know what exactly is wrong, it's a deliberate policy to put that info right there into the message. data.table is written by users; i.e. we wrote it for ourselves doing real jobs. I think that may be the root of that.? If any messages could more helpful,? those suggestions are very welcome. Matt On 12/02/14 17:58, John Laing wrote: Thanks, Matt! With a slight amendment that works great: for (x in c("foo", "bar", "qux")) set(fbq, which(is.na(fbq[[x]])), x, FALSE) Which highlights an opportunity to say that I really appreciate the unusually helpful error messages in this package. -John On Wed, Feb 12, 2014 at 12:44 PM, Matt Dowle wrote: Hi John, In examples like this I'd use set() and [[,? since it's a bit easier to write but memory efficient too. for (x in c("foo", "bar", "qux"))?? set(fbq, is.na(fbq[[x]]), x, FALSE)?????????? [untested] A downside here is one repetition of the "fbq" symbol,? but can live with that.? If you have a large number of columns? (and I've been surprised just how many columns some poeple have!) then calling set() many times has lower overhead than DT[, :=],? see ?set.?? Note also that [[ is base R, doesn't copy the column and often useful to use with data.table. Or, use get() in either i or j rather than eval(). HTH, Matt On 12/02/14 17:24, John Laing wrote: Let's say I merge together several data.tables such that I wind up with lots of NAs: require(data.table) foo <- data.table(k=1:4, foo=TRUE, key="k") bar <- data.table(k=3:6, bar=TRUE, key="k") qux <- data.table(k=5:8, qux=TRUE, key="k") fbq <- merge(merge(foo, bar, all=TRUE), qux, all=TRUE) print(fbq) # ? ?k ?foo ?bar ?qux # 1: 1 TRUE ? NA ? NA # 2: 2 TRUE ? NA ? NA # 3: 3 TRUE TRUE ? NA # 4: 4 TRUE TRUE ? NA # 5: 5 ? NA TRUE TRUE # 6: 6 ? NA TRUE TRUE # 7: 7 ? NA ? NA TRUE # 8: 8 ? NA ? NA TRUE I want to go through those columns and turn each NA into FALSE. I can do this by writing code for each column: fbq.cp <- copy(fbq) fbq.cp[is.na(foo), foo:=FALSE] fbq.cp[is.na(bar), bar:=FALSE] fbq.cp[is.na(qux), qux:=FALSE] print(fbq.cp) # ? ?k ? foo ? bar ? qux # 1: 1 ?TRUE FALSE FALSE # 2: 2 ?TRUE FALSE FALSE # 3: 3 ?TRUE ?TRUE FALSE # 4: 4 ?TRUE ?TRUE FALSE # 5: 5 FALSE ?TRUE ?TRUE # 6: 6 FALSE ?TRUE ?TRUE # 7: 7 FALSE FALSE ?TRUE # 8: 8 FALSE FALSE ?TRUE But I can't figure out how to do it in a loop. More precisely, I can't figure out how to make the [ operator evaluate its first argument in the context of the data.table. All of these have no effect: for (x in c("foo", "bar", "qux")) fbq[is.na(x), eval(x):=FALSE] for (x in c("foo", "bar", "qux")) fbq[is.na(eval(x)), eval(x):=FALSE] for (x in c("foo", "bar", "qux")) fbq[eval(is.na(x)), eval(x):=FALSE] I'm running R 3.0.2 on Linux, data.table 1.8.10. Thanks in advance, John _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Melanie BACOU International Food Policy Research Institute Agricultural Economist, HarvestChoice Work +1(202)862-5699 E-mail mel at mbacou.com Visit harvestchoice.org _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mel at mbacou.com Fri Feb 14 13:47:16 2014 From: mel at mbacou.com (Bacou, Melanie) Date: Fri, 14 Feb 2014 07:47:16 -0500 Subject: [datatable-help] data.table and sp classes - any best practices? Message-ID: <52FE1054.1020406@mbacou.com> I often use data.table in combination with large spatial objects (SpatialPolygonsDataFrame, SpatialPixelsDataFrame, etc.), but I am always worried about using setkey() on a @data slot thinking that I might mess up the link between the data attributes and the spatial features (polygons, points, pixels). I am hoping some of you might be able to clarify how best to manipulate data attributes inside a spatial object using data.table without running into potential errors. Here is a typical use case: # Load a sample SpatialPolygonsDataFrame from GADM load(url("http://biogeo.ucdavis.edu/data/gadm2/R/ETH_adm3.RData")) # My understanding is the data.frame row names should always match the polygon ID slots gadm.rn <- row.names(gadm) gadm.rn[1:5] # [1] "1" "2" "3" "4" "5" pid <- lapply(gadm at polygons, slot, "ID") pid[1:5] # [[1]] # [1] "1" # # [[2]] # [1] "2" # # [[3]] # [1] "3" # # [[4]] # [1] "4" # # [[5]] # [1] "5" # Let's say I need to merge external data into gadm at data using setkey() # Here is my approach gadm at data <- data.table(gadm at data) row.names(gadm at data)[1:5] # [1] "1" "2" "3" "4" "5" # Til now row names are preserved, good. # Let's create an explicit `rn` column to keep the initial `gadm` row names gadm at data[, rn := gadm.rn] # Check the ordering of the first data column gadm at data[, PID][1:5] # [1] 30825 30826 30827 30828 30829 # Now index gadm at data by another column setkey(gadm at data, NAME_3) # Verify that the row order has changed gadm at data[, PID][1:5] # [1] 30859 31100 31101 31145 31016 # What about row names? row.names(gadm at data)[1:5] # [1] "1" "2" "3" "4" "5" # Row names are not preserved, does that mean attributes are now associated # with the wrong polygons? # Let's try to fix that setkey(gadm at data, rn) gadm at data <- gadm at data[gadm.rn] gadm at data[, PID][1:5] # [1] 30825 30826 30827 30828 30829 # I'm now back to the original row order, note that row names are still unchanged row.names(gadm at data)[1:5] # [1] "1" "2" "3" "4" "5" # I assume my spatial object is now correct I don't know whether this approach makes sense at all, or if I should stay away from using data.table inside sp: classes? I much appreciate any suggestion. Thanks, --Mel. -- Melanie BACOU International Food Policy Research Institute Agricultural Economist, HarvestChoice Work +1(202)862-5699 E-mail mel at mbacou.com Visit harvestchoice.org From mel at mbacou.com Fri Feb 14 13:59:05 2014 From: mel at mbacou.com (Bacou, Melanie) Date: Fri, 14 Feb 2014 07:59:05 -0500 Subject: [datatable-help] Force evaluation of first argument to [ In-Reply-To: References: <52FBB2EB.2070000@mdowle.plus.com> <52FBC9DC.2010809@mdowle.plus.com> <52FE0368.6080603@mbacou.com> Message-ID: <52FE1319.7040504@mbacou.com> Arun, thanks for the clarification -- I see I didn't read that thread fully. --Mel. On 2/14/2014 7:07 AM, Arunkumar Srinivasan wrote: > Melanie, > `set` modifies by reference. Yours'll make a copy. > Arun > ------------------------------------------------------------------------ > From: Bacou, Melanie Bacou, Melanie > Reply: Bacou, Melanie mel at mbacou.com > Date: February 14, 2014 at 12:52:56 PM > To: Matt Dowle mdowle at mdowle.plus.com , > John Laing john.laing at gmail.com > Subject: Re: [datatable-help] Force evaluation of first argument to [ >> Hi John, Matt, >> >> In this case, why not simply using the standard data.table approach >> with .SD? >> >> fbq.cp[, lapply(.SD, function(x) ifelse(is.na(x), FALSE, x)), >> .SDcols=c("foo", "bar", "qux")] >> >> --Mel. >> >> >> On 2/12/2014 2:22 PM, Matt Dowle wrote: >>> >>> Ha. Yes we certainly don't hold back from making the messages as >>> long and as helpful as possible. If the code knows, or can know >>> what exactly is wrong, it's a deliberate policy to put that info >>> right there into the message. data.table is written by users; i.e. >>> we wrote it for ourselves doing real jobs. I think that may be the >>> root of that. If any messages could more helpful, those suggestions >>> are very welcome. >>> >>> Matt >>> >>> On 12/02/14 17:58, John Laing wrote: >>>> Thanks, Matt! With a slight amendment that works great: >>>> for (x in c("foo", "bar", "qux")) set(fbq, which(is.na >>>> (fbq[[x]])), x, FALSE) >>>> >>>> Which highlights an opportunity to say that I really appreciate the >>>> unusually helpful error messages in this package. >>>> >>>> -John >>>> >>>> >>>> On Wed, Feb 12, 2014 at 12:44 PM, Matt Dowle >>>> > wrote: >>>> >>>> >>>> Hi John, >>>> >>>> In examples like this I'd use set() and [[, since it's a bit >>>> easier to write but memory efficient too. >>>> >>>> for (x in c("foo", "bar", "qux")) set(fbq, is.na >>>> (fbq[[x]]), x, FALSE) [untested] >>>> >>>> A downside here is one repetition of the "fbq" symbol, but can >>>> live with that. If you have a large number of columns (and >>>> I've been surprised just how many columns some poeple have!) >>>> then calling set() many times has lower overhead than DT[, >>>> :=], see ?set. Note also that [[ is base R, doesn't copy the >>>> column and often useful to use with data.table. >>>> >>>> Or, use get() in either i or j rather than eval(). >>>> >>>> HTH, Matt >>>> >>>> >>>> >>>> On 12/02/14 17:24, John Laing wrote: >>>>> Let's say I merge together several data.tables such that I wind up >>>>> with lots of NAs: >>>>> >>>>> require(data.table) >>>>> foo <- data.table(k=1:4, foo=TRUE, key="k") >>>>> bar <- data.table(k=3:6, bar=TRUE, key="k") >>>>> qux <- data.table(k=5:8, qux=TRUE, key="k") >>>>> fbq <- merge(merge(foo, bar, all=TRUE), qux, all=TRUE) >>>>> print(fbq) >>>>> # k foo bar qux >>>>> # 1: 1 TRUE NA NA >>>>> # 2: 2 TRUE NA NA >>>>> # 3: 3 TRUE TRUE NA >>>>> # 4: 4 TRUE TRUE NA >>>>> # 5: 5 NA TRUE TRUE >>>>> # 6: 6 NA TRUE TRUE >>>>> # 7: 7 NA NA TRUE >>>>> # 8: 8 NA NA TRUE >>>>> >>>>> I want to go through those columns and turn each NA into >>>>> FALSE. I can >>>>> do this by writing code for each column: >>>>> >>>>> fbq.cp <- copy(fbq) >>>>> fbq.cp[is.na (foo), foo:=FALSE] >>>>> fbq.cp[is.na (bar), bar:=FALSE] >>>>> fbq.cp[is.na (qux), qux:=FALSE] >>>>> print(fbq.cp) >>>>> # k foo bar qux >>>>> # 1: 1 TRUE FALSE FALSE >>>>> # 2: 2 TRUE FALSE FALSE >>>>> # 3: 3 TRUE TRUE FALSE >>>>> # 4: 4 TRUE TRUE FALSE >>>>> # 5: 5 FALSE TRUE TRUE >>>>> # 6: 6 FALSE TRUE TRUE >>>>> # 7: 7 FALSE FALSE TRUE >>>>> # 8: 8 FALSE FALSE TRUE >>>>> >>>>> But I can't figure out how to do it in a loop. More precisely, >>>>> I can't >>>>> figure out how to make the [ operator evaluate its first >>>>> argument in >>>>> the context of the data.table. All of these have no effect: >>>>> for (x in c("foo", "bar", "qux")) fbq[is.na (x), >>>>> eval(x):=FALSE] >>>>> for (x in c("foo", "bar", "qux")) fbq[is.na >>>>> (eval(x)), eval(x):=FALSE] >>>>> for (x in c("foo", "bar", "qux")) fbq[eval(is.na >>>>> (x)), eval(x):=FALSE] >>>>> >>>>> I'm running R 3.0.2 on Linux, data.table 1.8.10. >>>>> >>>>> Thanks in advance, >>>>> John >>>>> >>>>> >>>>> _______________________________________________ >>>>> datatable-help mailing list >>>>> datatable-help at lists.r-forge.r-project.org >>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>> >>>> >>> >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> -- >> Melanie BACOU >> International Food Policy Research Institute >> Agricultural Economist, HarvestChoice >> Work +1(202)862-5699 >> E-mailmel at mbacou.com >> Visit harvestchoice.org >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Melanie BACOU International Food Policy Research Institute Agricultural Economist, HarvestChoice Work +1(202)862-5699 E-mail mel at mbacou.com Visit harvestchoice.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From yikelu.home at gmail.com Fri Feb 14 16:07:30 2014 From: yikelu.home at gmail.com (Yike Lu) Date: Fri, 14 Feb 2014 09:07:30 -0600 Subject: [datatable-help] integer64 group by doesn't find all groups In-Reply-To: References: <52FB9FC2.4000305@mdowle.plus.com> <52FBA3D0.60109@mdowle.plus.com> Message-ID: Thanks for the info guys! Wondering if there's any way I can help? On Wed, Feb 12, 2014 at 11:17 AM, caneff at gmail.com wrote: > Yes this isn't a data.table criticism, just a bit64 one in general. > > > On Wed Feb 12 2014 at 11:39:47 AM, Matt Dowle > wrote: > >> >> Sometimes we take the hard road in data.table, to get to a better place. >> Once bit64::integer64 is fully supported, it'll be much easier. All the >> recent radix work for double applies almost automatically to integer64 for >> example, but that radix work had to be done first. >> >> >> On 12/02/14 16:26, caneff at gmail.com wrote: >> >> FYI (and this is a long outstanding argument) this is why I don't like >> the bit64 package. These sorts of errors happen silently. I understand >> that data.table can't use the other integer64 package, but at least there >> it is obvious when things are being coerced. >> >> In my situations, if I am grouping by a int64, it is usually either an >> ID so I can just make it a character vector instead, or it is something >> where I don't mind lost precision so I just make it numeric. >> >> On Wed Feb 12 2014 at 11:22:40 AM, Matt Dowle >> wrote: >> >> >> Hi, >> >> You're doing nothing wrong. Although you can load integer64 using fread >> and create them directly, data.table's grouping and keys don't work on >> them yet. Sorry, just not yet implemented. Because integer64 are >> internally stored as type double (a good idea by package bit64), >> data.table sees them internally as double and doesn't catch that the >> type isn't supported yet (hence no error message such as you get for >> type 'complex'). The particular integer64 numbers in this example are >> quite small so will use the lower bits. In double, those are the most >> precise part of the significand, which would explain why only one group >> comes out here since data.table groups and joins floating point data >> within tolerance. >> >> Matt >> >> On 06/02/14 23:38, Yike Lu wrote: >> > After a long hiatus, I am back to using data.table. Unfortunately, >> > I've encountered a problem. Am I doing something wrong here? >> > >> > require(data.table) >> > >> > dt = data.table(idx = 1:100 %% 3, 1:100) >> > dt[, list(sum(V2)), by = idx] >> > # normal >> > >> > require(bit64) >> > >> > dt2 = data.table(idx = integer64(100) + 1:100 %% 3, 1:100) >> > dt2[, list(sum(V2)), by = idx] >> > # only has one group: >> > # idx V1 >> > #1: 1 5050 >> > >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From mhawkes at gcmlp.com Fri Feb 14 22:34:03 2014 From: mhawkes at gcmlp.com (Malcolm Hawkes) Date: Fri, 14 Feb 2014 21:34:03 +0000 Subject: [datatable-help] CJ and setkey sort differently Message-ID: <002E5054D2B84346B6F551575E3A8CA3030A7BE8@DC-GCM-MB-02.gcmlp.com> Ran in to the warning Warning in setkeyv(x, cols, verbose = verbose) : Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed. You can reproduce it with vec1 <- c("CMDTY", "Copper", "CORPOAS") vec2 <- 1:3 dt <- CJ(vec1, Date) setkey(dt, V1, V2) Issue seems to be that CJ (..., sorted = TRUE) and setkey want to sort the character data in different orders, one case-sensitive, one not. CJ creates V1 V2 1: Corp 1 2: Corp 2 3: Corp 3 4: CORP 1 5: CORP 2 6: CORP 3 And it's keyed as you would expect by V1 then V2 > key(dt) [1] "V1" "V2" But after doing setkey you have V1 V2 1: CORP 1 2: CORP 2 3: CORP 3 4: Corp 1 5: Corp 2 6: Corp 3 data.table version 1.8.10 > R.version _ platform x86_64-w64-mingw32 arch x86_64 os mingw32 system x86_64, mingw32 status major 3 minor 0.2 year 2013 month 09 day 25 svn rev 63987 language R version.string R version 3.0.2 (2013-09-25) nickname Frisbee Sailing > Malcolm Hawkes On-Site Consultant, Investments - RiskManagement Grosvenor Capital Management, L.P. 900 N. Michigan Avenue, Suite 1100 Chicago, IL 60611 mhawkes at gcmlp.com --- Disclosure and Statement of Confidentiality Grosvenor Securities LLC, Member FINRA, Serves as Placement Agent or Distributor for Certain Investment Products Managed/Advised by GCM Grosvenor-Affiliated Entities. The contents of this e-mail message and its attachments (if any) may be proprietary and/or confidential and are intended solely for the addressee(s) hereof. In addition, this e-mail message and its attachments (if any) may be subject to non-disclosure or confidentiality agreements or applicable legal privileges, including privileges protecting communications between attorneys or solicitors and their clients or the work product of attorneys and solicitors. If you are not the named addressee, or if this e-mail message has been addressed to you in error, please do not read, disclose, reproduce, distribute, disseminate or otherwise use this message or any of its attachments. Delivery of this e-mail message to any person other than the intended recipient(s) is not intended in any way to waive privilege or confidentiality. If you have received this e-mail message in error, please alert the sender by reply e-mail; we also request that you immediately delete this e-mail message and its attachments (if any). Grosvenor Capital Management, L.P., GCM Customized Fund Investment Group, L.P. and their affiliated entities (collectively, "GCM Grosvenor") reserve the right to monitor all e-mail communications through their networks. GCM Grosvenor gives no assurances that this e-mail message and its attachments (if any) are free of viruses and other harmful code. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri Feb 14 22:38:16 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 14 Feb 2014 22:38:16 +0100 Subject: [datatable-help] CJ and setkey sort differently In-Reply-To: <002E5054D2B84346B6F551575E3A8CA3030A7BE8@DC-GCM-MB-02.gcmlp.com> References: <002E5054D2B84346B6F551575E3A8CA3030A7BE8@DC-GCM-MB-02.gcmlp.com> Message-ID: Malcolm, Thanks for the nice report. I suppose your `dt` creation should be: `dt <- CJ(vec1, vec2)`. The reason is pretty clear. It's an easy fix. Could you please file a bug report? Thank you. Arun From:?Malcolm Hawkes Malcolm Hawkes Reply:?Malcolm Hawkes mhawkes at gcmlp.com Date:?February 14, 2014 at 10:34:21 PM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] CJ and setkey sort differently Ran in to the warning ? Warning in setkeyv(x, cols, verbose = verbose) : ? Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed. ? You can reproduce it with vec1 <- c("CMDTY", "Copper", "CORPOAS") vec2 <-? 1:3 dt <- CJ(vec1, Date) setkey(dt, V1, V2) ? Issue seems to be that CJ (..., sorted = TRUE) and setkey want to sort the character data in different orders, one case-sensitive, one not. ? CJ creates ? ???? V1 V2 1: Corp? 1 2: Corp? 2 3: Corp? 3 4: CORP? 1 5: CORP? 2 6: CORP? 3 ? And it?s keyed as you would expect by V1 then V2 > key(dt) [1] "V1" "V2" ? ? But after doing setkey you have ? ???? V1 V2 1: CORP? 1 2: CORP? 2 3: CORP? 3 4: Corp? 1 5: Corp? 2 6: Corp? 3 ? ? ? data.table version 1.8.10 ? > R.version ?????????????? _?????????????????????????? platform?????? x86_64-w64-mingw32????????? arch?????????? x86_64????????????????????? os???????????? mingw32???????????????????? system???????? x86_64, mingw32???????????? status???????????????????????????????????? major????????? 3?????????????????????????? minor????????? 0.2???????????????????????? year?????????? 2013??????????????????????? month????????? 09????????????????????????? day??????????? 25????????????????????????? svn rev??????? 63987?????????????????????? language?????? R?????????????????????????? version.string R version 3.0.2 (2013-09-25) nickname?????? Frisbee Sailing???????????? > ? ? Malcolm Hawkes On-Site Consultant, Investments - RiskManagement Grosvenor Capital Management, L.P. 900 N. Michigan Avenue, Suite 1100 Chicago, IL? 60611 mhawkes at gcmlp.com ? ? ? --- Disclosure and Statement of Confidentiality ? Grosvenor Securities LLC, Member FINRA, Serves as Placement Agent or Distributor for Certain Investment Products Managed/Advised by GCM Grosvenor-Affiliated Entities. ? The contents of this e-mail message and its attachments (if any) may be proprietary and/or confidential and are intended solely for the addressee(s) hereof. In addition, this e-mail message and its attachments (if any) may be subject to non-disclosure or confidentiality agreements or applicable legal privileges, including privileges protecting communications between attorneys or solicitors and their clients or the work product of attorneys and solicitors. If you are not the named addressee, or if this e-mail message has been addressed to you in error, please do not read, disclose, reproduce, distribute, disseminate or otherwise use this message or any of its attachments. Delivery of this e-mail message to any person other than the intended recipient(s) is not intended in any way to waive privilege or confidentiality. If you have received this e-mail message in error, please alert the sender by reply e-mail; we also request that you immediately delete this e-mail message and its attachments (if any). Grosvenor Capital Management, L.P., GCM Customized Fund Investment Group, L.P. and their affiliated entities (collectively, ?GCM Grosvenor?) reserve the right to monitor all e-mail communications through their networks. GCM Grosvenor gives no assurances that this e-mail message and its attachments (if any) are free of viruses and other harmful code. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mhawkes at gcmlp.com Fri Feb 14 22:42:00 2014 From: mhawkes at gcmlp.com (Malcolm Hawkes) Date: Fri, 14 Feb 2014 21:42:00 +0000 Subject: [datatable-help] CJ and setkey sort differently In-Reply-To: References: <002E5054D2B84346B6F551575E3A8CA3030A7BE8@DC-GCM-MB-02.gcmlp.com> Message-ID: <002E5054D2B84346B6F551575E3A8CA3030A7C06@DC-GCM-MB-02.gcmlp.com> Arun Oops, yes it should. And vec1 <- c("Corp", "CORP") Took me while to track down, what was causing but got it in the end ? Where / how do I file a bug report ? Thanks Malcolm Malcolm Hawkes On-Site Consultant, Investments - RiskManagement Grosvenor Capital Management, L.P. 900 N. Michigan Avenue, Suite 1100 Chicago, IL 60611 mhawkes at gcmlp.com From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] Sent: Friday, February 14, 2014 3:38 PM To: datatable-help at lists.r-forge.r-project.org; Malcolm Hawkes Subject: Re: [datatable-help] CJ and setkey sort differently Malcolm, Thanks for the nice report. I suppose your `dt` creation should be: `dt <- CJ(vec1, vec2)`. The reason is pretty clear. It's an easy fix. Could you please file a bug report? Thank you. Arun ________________________________ From: Malcolm Hawkes Malcolm Hawkes Reply: Malcolm Hawkes mhawkes at gcmlp.com Date: February 14, 2014 at 10:34:21 PM To: datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject: [datatable-help] CJ and setkey sort differently Ran in to the warning Warning in setkeyv(x, cols, verbose = verbose) : Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed. You can reproduce it with vec1 <- c("CMDTY", "Copper", "CORPOAS") vec2 <- 1:3 dt <- CJ(vec1, Date) setkey(dt, V1, V2) Issue seems to be that CJ (..., sorted = TRUE) and setkey want to sort the character data in different orders, one case-sensitive, one not. CJ creates V1 V2 1: Corp 1 2: Corp 2 3: Corp 3 4: CORP 1 5: CORP 2 6: CORP 3 And it?s keyed as you would expect by V1 then V2 > key(dt) [1] "V1" "V2" But after doing setkey you have V1 V2 1: CORP 1 2: CORP 2 3: CORP 3 4: Corp 1 5: Corp 2 6: Corp 3 data.table version 1.8.10 > R.version _ platform x86_64-w64-mingw32 arch x86_64 os mingw32 system x86_64, mingw32 status major 3 minor 0.2 year 2013 month 09 day 25 svn rev 63987 language R version.string R version 3.0.2 (2013-09-25) nickname Frisbee Sailing > Malcolm Hawkes On-Site Consultant, Investments - RiskManagement Grosvenor Capital Management, L.P. 900 N. Michigan Avenue, Suite 1100 Chicago, IL 60611 mhawkes at gcmlp.com --- Disclosure and Statement of Confidentiality Grosvenor Securities LLC, Member FINRA, Serves as Placement Agent or Distributor for Certain Investment Products Managed/Advised by GCM Grosvenor-Affiliated Entities. The contents of this e-mail message and its attachments (if any) may be proprietary and/or confidential and are intended solely for the addressee(s) hereof. In addition, this e-mail message and its attachments (if any) may be subject to non-disclosure or confidentiality agreements or applicable legal privileges, including privileges protecting communications between attorneys or solicitors and their clients or the work product of attorneys and solicitors. If you are not the named addressee, or if this e-mail message has been addressed to you in error, please do not read, disclose, reproduce, distribute, disseminate or otherwise use this message or any of its attachments. Delivery of this e-mail message to any person other than the intended recipient(s) is not intended in any way to waive privilege or confidentiality. If you have received this e-mail message in error, please alert the sender by reply e-mail; we also request that you immediately delete this e-mail message and its attachments (if any). Grosvenor Capital Management, L.P., GCM Customized Fund Investment Group, L.P. and their affiliated entities (collectively, ?GCM Grosvenor?) reserve the right to monitor all e-mail communications through their networks. GCM Grosvenor gives no assurances that this e-mail message and its attachments (if any) are free of viruses and other harmful code. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri Feb 14 22:43:00 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 14 Feb 2014 22:43:00 +0100 Subject: [datatable-help] CJ and setkey sort differently In-Reply-To: <002E5054D2B84346B6F551575E3A8CA3030A7C06@DC-GCM-MB-02.gcmlp.com> References: <002E5054D2B84346B6F551575E3A8CA3030A7BE8@DC-GCM-MB-02.gcmlp.com> <002E5054D2B84346B6F551575E3A8CA3030A7C06@DC-GCM-MB-02.gcmlp.com> Message-ID: Here:?https://r-forge.r-project.org/tracker/?atid=975&group_id=240&func=browse You've to create an account, but that's super easy. Arun From:?Malcolm Hawkes Malcolm Hawkes Reply:?Malcolm Hawkes mhawkes at gcmlp.com Date:?February 14, 2014 at 10:42:07 PM To:?Arunkumar Srinivasan aragorn168b at gmail.com, datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? RE: [datatable-help] CJ and setkey sort differently Arun ? Oops, yes it should.? And vec1 <- c("Corp", "CORP") ? Took me while to track down, what was causing but got it in the end J ? Where / how do I file a bug report ? ? Thanks ? Malcolm ? Malcolm Hawkes On-Site Consultant, Investments - RiskManagement Grosvenor Capital Management, L.P. 900 N. Michigan Avenue, Suite 1100 Chicago, IL? 60611 mhawkes at gcmlp.com ? From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] Sent: Friday, February 14, 2014 3:38 PM To: datatable-help at lists.r-forge.r-project.org; Malcolm Hawkes Subject: Re: [datatable-help] CJ and setkey sort differently ? Malcolm, ? Thanks for the nice report. I suppose your `dt` creation should be: `dt <- CJ(vec1, vec2)`. The reason is pretty clear. It's an easy fix. Could you please file a bug report? Thank you. ? Arun From:?Malcolm HawkesMalcolm Hawkes Reply:?Malcolm Hawkesmhawkes at gcmlp.com Date:?February 14, 2014 at 10:34:21 PM To:?datatable-help at lists.r-forge.r-project.orgdatatable-help@lists.r-forge.r-project.org Subject:? [datatable-help] CJ and setkey sort differently Ran in to the warning ? Warning in setkeyv(x, cols, verbose = verbose) : ? Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed. ? You can reproduce it with vec1 <- c("CMDTY", "Copper", "CORPOAS") vec2 <-? 1:3 dt <- CJ(vec1, Date) setkey(dt, V1, V2) ? Issue seems to be that CJ (..., sorted = TRUE) and setkey want to sort the character data in different orders, one case-sensitive, one not. ? CJ creates ? ???? V1 V2 1: Corp? 1 2: Corp? 2 3: Corp? 3 4: CORP? 1 5: CORP? 2 6: CORP? 3 ? And it?s keyed as you would expect by V1 then V2 >key(dt) [1] "V1" "V2" ? ? But after doing setkey you have ? ???? V1 V2 1: CORP? 1 2: CORP? 2 3: CORP? 3 4: Corp? 1 5: Corp? 2 6: Corp? 3 ? ? ? data.table version 1.8.10 ? > R.version ?????????????? _?????????????????????????? platform?????? x86_64-w64-mingw32????????? arch?????????? x86_64????????????????????? os???????????? mingw32???????????????????? system???????? x86_64, mingw32???????????? status???????????????????????????????????? major????????? 3?????????????????????????? minor????????? 0.2???????????????????????? year?????????? 2013??????????????????????? month????????? 09????????????????????????? day??????????? 25????????????????????????? svn rev??????? 63987?????????????????????? language?????? R?????????????????????????? version.string R version 3.0.2 (2013-09-25) nickname?????? Frisbee Sailing???????????? >? ? ? Malcolm Hawkes On-Site Consultant, Investments - RiskManagement Grosvenor Capital Management, L.P. 900 N. Michigan Avenue, Suite 1100 Chicago, IL? 60611 mhawkes at gcmlp.com ? ? ? --- ? Disclosure and Statement of Confidentiality ? ? Grosvenor Securities LLC, Member FINRA, Serves as Placement Agent or Distributor for Certain Investment Products Managed/Advised by GCM Grosvenor-Affiliated Entities. ? ? The contents of this e-mail message and its attachments (if any) may be proprietary and/or confidential and are intended solely for the addressee(s) hereof. In addition, this e-mail message and its attachments (if any) may be subject to non-disclosure or confidentiality agreements or applicable legal privileges, including privileges protecting communications between attorneys or solicitors and their clients or the work product of attorneys and solicitors. If you are not the named addressee, or if this e-mail message has been addressed to you in error, please do not read, disclose, reproduce, distribute, disseminate or otherwise use this message or any of its attachments. Delivery of this e-mail message to any person other than the intended recipient(s) is not intended in any way to waive privilege or confidentiality. If you have received this e-mail message in error, please alert the sender by reply e-mail; we also request that you immediately delete this e-mail message and its attachments (if any). Grosvenor Capital Management, L.P., GCM Customized Fund Investment Group, L.P. and their affiliated entities (collectively, ?GCM Grosvenor?) reserve the right to monitor all e-mail communications through their networks. GCM Grosvenor gives no assurances that this e-mail message and its attachments (if any) are free of viruses and other harmful code. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mel at mbacou.com Mon Feb 17 11:14:45 2014 From: mel at mbacou.com (Bacou, Melanie) Date: Mon, 17 Feb 2014 05:14:45 -0500 Subject: [datatable-help] Problem with data.table and FastRWeb Message-ID: <5301E115.5080806@mbacou.com> Hi, I am testing an R script using FastRWeb (through Rserve). FastRWeb works as expected and I can successfully runs Simon Urbanek's examples. Problems arise when I try to merge datatables. It seems FastRWeb cannot find merge.data.table(). I'm using plenty of other libraries (ggplot, raster, RJDBC, etc.) that execute successfully through FastRWeb scripts, so I'm guessing it's something peculiar to data.table. Thanks for any help! --Mel. Here are reproducible examples. Test #1: the code below (the entire content of my R script) SUCCEEDS: # test1.R library(data.table) run <- function(...) { oclear() d1 <- data.table(a=c(1,2,3), b=c("a","b","c")) d2 <- data.table(e=c("v","a","b"), f=c(4,6,7)) otable(d1) otable(d2) } This returns a simple web page showing 2 tables: 1 a 2 b 3 c v 4 a 6 b 7 Test #2: the code below (the entire content of my R script) FAILS with: Error in `[.default`(x, i) : invalid subscript type 'list' # test2.R library(data.table) run <- function(...) { oclear() d1 <- data.table(a=c(1,2,3), b=c("a","b","c")) d2 <- data.table(e=c("v","a","b"), f=c(4,6,7)) otable(d1) otable(d2) setkey(d1, b) setkey(d2, e) otable(d1[d2]) } -- Melanie BACOU International Food Policy Research Institute Agricultural Economist, HarvestChoice Work +1(202)862-5699 E-mail mel at mbacou.com Visit harvestchoice.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Feb 17 12:58:55 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 17 Feb 2014 12:58:55 +0100 Subject: [datatable-help] Problem with data.table and FastRWeb In-Reply-To: <5301E115.5080806@mbacou.com> References: <5301E115.5080806@mbacou.com> Message-ID: Mel, I'm not able to reproduce this on 1.8.11. Which version are you using? I'm not aware of this package, and what 'otable' is supposed to do. But I get no output while running your script, and not the error message as well. On Mon, Feb 17, 2014 at 11:14 AM, Bacou, Melanie wrote: > Hi, > > I am testing an R script using FastRWeb (through Rserve). FastRWeb works > as expected and I can successfully runs Simon Urbanek's examples. Problems > arise when I try to merge datatables. It seems FastRWeb cannot find > merge.data.table(). > > I'm using plenty of other libraries (ggplot, raster, RJDBC, etc.) that > execute successfully through FastRWeb scripts, so I'm guessing it's > something peculiar to data.table. > > Thanks for any help! --Mel. > > > Here are reproducible examples. > > Test #1: the code below (the entire content of my R script) SUCCEEDS: > > # test1.R > library(data.table) > > run <- function(...) { > oclear() > d1 <- data.table(a=c(1,2,3), b=c("a","b","c")) > d2 <- data.table(e=c("v","a","b"), f=c(4,6,7)) > otable(d1) > otable(d2) > } > > This returns a simple web page showing 2 tables: > 1 a 2 b 3 c v 4 a 6 b 7 > > Test #2: the code below (the entire content of my R script) FAILS with: > Error in `[.default`(x, i) : invalid subscript type 'list' > > # test2.R > library(data.table) > > run <- function(...) { > oclear() > d1 <- data.table(a=c(1,2,3), b=c("a","b","c")) > d2 <- data.table(e=c("v","a","b"), f=c(4,6,7)) > otable(d1) > otable(d2) > setkey(d1, b) > setkey(d2, e) > otable(d1[d2]) > } > > > > > -- > Melanie BACOU > International Food Policy Research Institute > Agricultural Economist, HarvestChoice > Work +1(202)862-5699 > E-mail mel at mbacou.com > Visit harvestchoice.org > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Mon Feb 17 23:39:11 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Mon, 17 Feb 2014 22:39:11 +0000 Subject: [datatable-help] Merging strings claim that the encodings don't match In-Reply-To: References: Message-ID: <53028F8F.4030102@mdowle.plus.com> Think you may have ended up with some strings internally marked ASCII by R, which Encoding() returns as "unknown". That shouldn't be a problem and they should join fine. I've change the new warning in v1.8.11 so if it was that, it should be ok now (commit 1153), please confirm. Matt On 13/02/14 16:05, caneff at gmail.com wrote: > I have a master DT. I aggregate it in one way, and aggregate it in > another with a common key between them. When I try to merge these > two, it says that the key does not have the same encoding on both > sides. If I call Encoding() on each of the keys, they both are listed > as "unknown", so from what I can see they still look the same. > > I can't create a safe to share reproducible case unfortunately, the > simple ones I've tried all work. If you can give more advice on how > to debug maybe I can. > > This is using the latest devel version. I did not have this issue i 1.8.10 > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mel at mbacou.com Tue Feb 18 06:31:51 2014 From: mel at mbacou.com (Bacou, Melanie) Date: Tue, 18 Feb 2014 00:31:51 -0500 Subject: [datatable-help] Problem with data.table and FastRWeb In-Reply-To: References: <5301E115.5080806@mbacou.com> Message-ID: <5302F047.6020404@mbacou.com> Hi Arun, This is a little tricky to reproduce unless you have installed FastRWeb, and then started the FastRWeb server. I'm executing these scripts from the browser through a call to FastRWeb running on a local port. Installation is documented here and is quick and straightforward on Linux: https://rforge.net/FastRWeb/ and an example here: http://jayemerson.blogspot.mx/2011/10/setting-up-fastrwebrserve-on-ubuntu.html I'm using FastRWeb to build a simple web service. As long as I stick to data.frame methods, everything works fine and I get the expected plots and HTML output in the browser. But calls to data.table methods (merge, extract) all seem to default to data.frame, and I really don't know how to debug that. I am copying Simon Urbanek who's the maintainer of FastRWeb, in case this is more of a FastRWeb issue. Here is my session info (I am on CentOS 5 and cannot easily upgrade to R 3.0.2). > sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] C attached base packages: [1] stats graphics utils datasets grDevices methods base other attached packages: [1] ggmap_2.3 ggplot2_0.9.3.1 RColorBrewer_1.0-5 raster_2.2-12 [5] rgeos_0.3-3 rgdal_0.8-16 sp_1.0-14 data.table_1.8.10 [9] RJDBC_0.2-3 rJava_0.9-6 DBI_0.2-7 rj_1.1.2-3 loaded via a namespace (and not attached): [1] MASS_7.3-23 RJSONIO_1.0-3 RgoogleMaps_1.2.0.5 [4] colorspace_1.2-4 dichromat_2.0-0 digest_0.6.4 [7] grid_2.15.2 gtable_0.1.2 labeling_0.2 [10] lattice_0.20-24 mapproj_1.2-2 maps_2.3-6 [13] munsell_0.4.2 plyr_1.8 png_0.1-7 [16] proto_0.3-10 reshape2_1.2.2 rj.gd_1.1.0-1 [19] rjson_0.2.13 scales_0.2.3 stringr_0.6.2 [22] tools_2.15.2 Thanks all! --Mel. On 2/17/2014 6:58 AM, Arunkumar Srinivasan wrote: > Mel, > I'm not able to reproduce this on 1.8.11. Which version are you using? > I'm not aware of this package, and what 'otable' is supposed to do. > But I get no output while running your script, and not the error > message as well. > > > On Mon, Feb 17, 2014 at 11:14 AM, Bacou, Melanie > wrote: > > Hi, > > I am testing an R script using FastRWeb (through Rserve). FastRWeb > works as expected and I can successfully runs Simon Urbanek's > examples. Problems arise when I try to merge datatables. It seems > FastRWeb cannot find merge.data.table(). > > I'm using plenty of other libraries (ggplot, raster, RJDBC, etc.) > that execute successfully through FastRWeb scripts, so I'm > guessing it's something peculiar to data.table. > > Thanks for any help! --Mel. > > > Here are reproducible examples. > > Test #1: the code below (the entire content of my R script) SUCCEEDS: > > # test1.R > library(data.table) > > run <- function(...) { > oclear() > d1 <- data.table(a=c(1,2,3), b=c("a","b","c")) > d2 <- data.table(e=c("v","a","b"), f=c(4,6,7)) > otable(d1) > otable(d2) > } > > This returns a simple web page showing 2 tables: > 1 a > 2 b > 3 c > > v 4 > a 6 > b 7 > > > > Test #2: the code below (the entire content of my R script) FAILS > with: > Error in `[.default`(x, i) : invalid subscript type 'list' > > # test2.R > library(data.table) > > run <- function(...) { > oclear() > d1 <- data.table(a=c(1,2,3), b=c("a","b","c")) > d2 <- data.table(e=c("v","a","b"), f=c(4,6,7)) > otable(d1) > otable(d2) > setkey(d1, b) > setkey(d2, e) > otable(d1[d2]) > } > > > > > -- > Melanie BACOU > International Food Policy Research Institute > Agricultural Economist, HarvestChoice > Work+1(202)862-5699 > E-mailmel at mbacou.com > Visitharvestchoice.org > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -- Melanie BACOU International Food Policy Research Institute Agricultural Economist, HarvestChoice Work +1(202)862-5699 E-mail mel at mbacou.com Visit harvestchoice.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue Feb 18 11:43:03 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Tue, 18 Feb 2014 10:43:03 +0000 Subject: [datatable-help] Problem with data.table and FastRWeb In-Reply-To: <5302F047.6020404@mbacou.com> References: <5301E115.5080806@mbacou.com> <5302F047.6020404@mbacou.com> Message-ID: <53033937.2090607@mdowle.plus.com> Hi Mel, Thanks for the info. It's likely related to cedta() and we can handle it from data.table's side as follows. Background : http://stackoverflow.com/a/10529888/403310 Type "data.table:::cedta" so you can see the rules. I guess FastWeb is running your code in its own environment. First thing, turn on verbosity : options(data.table.verbose=TRUE) # or for one statement rather than globally, d1[d2,verbose=TRUE] and run your code again. You should see a message "cedta decided '' wasn't data.table aware", where is probably "FastRWeb". This calling environment (let's assume "FastRWeb" from now on) is more like .GlobalEnv than a package; i.e., it's where you run your own code, you've done library(data.table) in that environment, and so it is data.table aware as far as you're concerned. So what to do? There are two override mechanisms : The data.table package contains a character vector : > data.table:::cedta.override [1] "gWidgetsWWW" It already contains one package which is similar in nature. You can add FastRWeb to that vector yourself as follows : > assignInNamespace("cedta.override", c("gWidgetsWWW","FastRWeb"), "data.table") > data.table:::cedta.override [1] "gWidgetsWWW" "FastRWeb" But I'll also add FastRWeb to that vector in data.table, so from the next version of data.table you won't have to do it yourself. We'll add new packages as we become aware of them. Alternatively, the package author (Simon in this case) can provide data.table-awareness optionally. This mechanism was added for dplyr so it can control data.table awareness from the caller's end. That's done by setting a variable .datatable.aware=TRUE|FALSE in the calling package's namespace. However, in the case of FastRWeb, the cedta.override on data.table's side seems the right way to go. Matt On 18/02/14 05:31, Bacou, Melanie wrote: > Hi Arun, > > This is a little tricky to reproduce unless you have installed > FastRWeb, and then started the FastRWeb server. I'm executing these > scripts from the browser through a call to FastRWeb running on a local > port. > > Installation is documented here and is quick and straightforward on Linux: > https://rforge.net/FastRWeb/ > and an example here: > http://jayemerson.blogspot.mx/2011/10/setting-up-fastrwebrserve-on-ubuntu.html > > I'm using FastRWeb to build a simple web service. As long as I stick > to data.frame methods, everything works fine and I get the expected > plots and HTML output in the browser. But calls to data.table methods > (merge, extract) all seem to default to data.frame, and I really don't > know how to debug that. > > I am copying Simon Urbanek who's the maintainer of FastRWeb, in case > this is more of a FastRWeb issue. > > Here is my session info (I am on CentOS 5 and cannot easily upgrade to > R 3.0.2). > > > sessionInfo() > R version 2.15.2 (2012-10-26) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics utils datasets grDevices methods base > > other attached packages: > [1] ggmap_2.3 ggplot2_0.9.3.1 RColorBrewer_1.0-5 > raster_2.2-12 > [5] rgeos_0.3-3 rgdal_0.8-16 sp_1.0-14 data.table_1.8.10 > [9] RJDBC_0.2-3 rJava_0.9-6 DBI_0.2-7 rj_1.1.2-3 > > loaded via a namespace (and not attached): > [1] MASS_7.3-23 RJSONIO_1.0-3 RgoogleMaps_1.2.0.5 > [4] colorspace_1.2-4 dichromat_2.0-0 digest_0.6.4 > [7] grid_2.15.2 gtable_0.1.2 labeling_0.2 > [10] lattice_0.20-24 mapproj_1.2-2 maps_2.3-6 > [13] munsell_0.4.2 plyr_1.8 png_0.1-7 > [16] proto_0.3-10 reshape2_1.2.2 rj.gd_1.1.0-1 > [19] rjson_0.2.13 scales_0.2.3 stringr_0.6.2 > [22] tools_2.15.2 > > Thanks all! > --Mel. > > > > > On 2/17/2014 6:58 AM, Arunkumar Srinivasan wrote: >> Mel, >> I'm not able to reproduce this on 1.8.11. Which version are you using? >> I'm not aware of this package, and what 'otable' is supposed to do. >> But I get no output while running your script, and not the error >> message as well. >> >> >> On Mon, Feb 17, 2014 at 11:14 AM, Bacou, Melanie > > wrote: >> >> Hi, >> >> I am testing an R script using FastRWeb (through Rserve). >> FastRWeb works as expected and I can successfully runs Simon >> Urbanek's examples. Problems arise when I try to merge >> datatables. It seems FastRWeb cannot find merge.data.table(). >> >> I'm using plenty of other libraries (ggplot, raster, RJDBC, etc.) >> that execute successfully through FastRWeb scripts, so I'm >> guessing it's something peculiar to data.table. >> >> Thanks for any help! --Mel. >> >> >> Here are reproducible examples. >> >> Test #1: the code below (the entire content of my R script) SUCCEEDS: >> >> # test1.R >> library(data.table) >> >> run <- function(...) { >> oclear() >> d1 <- data.table(a=c(1,2,3), b=c("a","b","c")) >> d2 <- data.table(e=c("v","a","b"), f=c(4,6,7)) >> otable(d1) >> otable(d2) >> } >> >> This returns a simple web page showing 2 tables: >> 1 a >> 2 b >> 3 c >> >> v 4 >> a 6 >> b 7 >> >> >> >> Test #2: the code below (the entire content of my R script) FAILS >> with: >> Error in `[.default`(x, i) : invalid subscript type 'list' >> >> # test2.R >> library(data.table) >> >> run <- function(...) { >> oclear() >> d1 <- data.table(a=c(1,2,3), b=c("a","b","c")) >> d2 <- data.table(e=c("v","a","b"), f=c(4,6,7)) >> otable(d1) >> otable(d2) >> setkey(d1, b) >> setkey(d2, e) >> otable(d1[d2]) >> } >> >> >> >> >> -- >> Melanie BACOU >> International Food Policy Research Institute >> Agricultural Economist, HarvestChoice >> Work+1(202)862-5699 >> E-mailmel at mbacou.com >> Visitharvestchoice.org >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> > > -- > Melanie BACOU > International Food Policy Research Institute > Agricultural Economist, HarvestChoice > Work +1(202)862-5699 > E-mailmel at mbacou.com > Visit harvestchoice.org > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mel at mbacou.com Wed Feb 19 07:37:36 2014 From: mel at mbacou.com (Bacou, Melanie) Date: Wed, 19 Feb 2014 01:37:36 -0500 Subject: [datatable-help] Problem with data.table and FastRWeb In-Reply-To: <53033937.2090607@mdowle.plus.com> References: <5301E115.5080806@mbacou.com> <5302F047.6020404@mbacou.com> <53033937.2090607@mdowle.plus.com> Message-ID: <53045130.1070603@mbacou.com> Hi Matt, Thanks very much for your detailed explanation, and for offering to patch data.table. Adding > assignInNamespace("cedta.override", c("gWidgetsWWW","FastRWeb"), "data.table") did the trick here. Perfect! --Mel. On 2/18/2014 5:43 AM, Matt Dowle wrote: > > Hi Mel, > > Thanks for the info. It's likely related to cedta() and we can handle > it from data.table's side as follows. > > Background : > > http://stackoverflow.com/a/10529888/403310 > > Type "data.table:::cedta" so you can see the rules. I guess FastWeb > is running your code in its own environment. First thing, turn on > verbosity : > > options(data.table.verbose=TRUE) # or for one statement > rather than globally, d1[d2,verbose=TRUE] > > and run your code again. You should see a message "cedta decided > '' wasn't data.table aware", where is probably > "FastRWeb". > > This calling environment (let's assume "FastRWeb" from now on) is more > like .GlobalEnv than a package; i.e., it's where you run your own > code, you've done library(data.table) in that environment, and so it > is data.table aware as far as you're concerned. So what to do? There > are two override mechanisms : > > The data.table package contains a character vector : > > > data.table:::cedta.override > [1] "gWidgetsWWW" > > It already contains one package which is similar in nature. You can > add FastRWeb to that vector yourself as follows : > > > assignInNamespace("cedta.override", c("gWidgetsWWW","FastRWeb"), > "data.table") > > data.table:::cedta.override > [1] "gWidgetsWWW" "FastRWeb" > > But I'll also add FastRWeb to that vector in data.table, so from the > next version of data.table you won't have to do it yourself. We'll > add new packages as we become aware of them. > > Alternatively, the package author (Simon in this case) can provide > data.table-awareness optionally. This mechanism was added for dplyr > so it can control data.table awareness from the caller's end. That's > done by setting a variable .datatable.aware=TRUE|FALSE in the calling > package's namespace. However, in the case of FastRWeb, the > cedta.override on data.table's side seems the right way to go. > > Matt > > > On 18/02/14 05:31, Bacou, Melanie wrote: >> Hi Arun, >> >> This is a little tricky to reproduce unless you have installed >> FastRWeb, and then started the FastRWeb server. I'm executing these >> scripts from the browser through a call to FastRWeb running on a >> local port. >> >> Installation is documented here and is quick and straightforward on >> Linux: >> https://rforge.net/FastRWeb/ >> and an example here: >> http://jayemerson.blogspot.mx/2011/10/setting-up-fastrwebrserve-on-ubuntu.html >> >> I'm using FastRWeb to build a simple web service. As long as I stick >> to data.frame methods, everything works fine and I get the expected >> plots and HTML output in the browser. But calls to data.table methods >> (merge, extract) all seem to default to data.frame, and I really >> don't know how to debug that. >> >> I am copying Simon Urbanek who's the maintainer of FastRWeb, in case >> this is more of a FastRWeb issue. >> >> Here is my session info (I am on CentOS 5 and cannot easily upgrade >> to R 3.0.2). >> >> > sessionInfo() >> R version 2.15.2 (2012-10-26) >> Platform: x86_64-redhat-linux-gnu (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics utils datasets grDevices methods base >> >> other attached packages: >> [1] ggmap_2.3 ggplot2_0.9.3.1 RColorBrewer_1.0-5 >> raster_2.2-12 >> [5] rgeos_0.3-3 rgdal_0.8-16 sp_1.0-14 data.table_1.8.10 >> [9] RJDBC_0.2-3 rJava_0.9-6 DBI_0.2-7 rj_1.1.2-3 >> >> loaded via a namespace (and not attached): >> [1] MASS_7.3-23 RJSONIO_1.0-3 RgoogleMaps_1.2.0.5 >> [4] colorspace_1.2-4 dichromat_2.0-0 digest_0.6.4 >> [7] grid_2.15.2 gtable_0.1.2 labeling_0.2 >> [10] lattice_0.20-24 mapproj_1.2-2 maps_2.3-6 >> [13] munsell_0.4.2 plyr_1.8 png_0.1-7 >> [16] proto_0.3-10 reshape2_1.2.2 rj.gd_1.1.0-1 >> [19] rjson_0.2.13 scales_0.2.3 stringr_0.6.2 >> [22] tools_2.15.2 >> >> Thanks all! >> --Mel. >> >> >> >> >> On 2/17/2014 6:58 AM, Arunkumar Srinivasan wrote: >>> Mel, >>> I'm not able to reproduce this on 1.8.11. Which version are you using? >>> I'm not aware of this package, and what 'otable' is supposed to do. >>> But I get no output while running your script, and not the error >>> message as well. >>> >>> >>> On Mon, Feb 17, 2014 at 11:14 AM, Bacou, Melanie >> > wrote: >>> >>> Hi, >>> >>> I am testing an R script using FastRWeb (through Rserve). >>> FastRWeb works as expected and I can successfully runs Simon >>> Urbanek's examples. Problems arise when I try to merge >>> datatables. It seems FastRWeb cannot find merge.data.table(). >>> >>> I'm using plenty of other libraries (ggplot, raster, RJDBC, >>> etc.) that execute successfully through FastRWeb scripts, so I'm >>> guessing it's something peculiar to data.table. >>> >>> Thanks for any help! --Mel. >>> >>> >>> Here are reproducible examples. >>> >>> Test #1: the code below (the entire content of my R script) >>> SUCCEEDS: >>> >>> # test1.R >>> library(data.table) >>> >>> run <- function(...) { >>> oclear() >>> d1 <- data.table(a=c(1,2,3), b=c("a","b","c")) >>> d2 <- data.table(e=c("v","a","b"), f=c(4,6,7)) >>> otable(d1) >>> otable(d2) >>> } >>> >>> This returns a simple web page showing 2 tables: >>> 1 a >>> 2 b >>> 3 c >>> >>> v 4 >>> a 6 >>> b 7 >>> >>> >>> >>> Test #2: the code below (the entire content of my R script) >>> FAILS with: >>> Error in `[.default`(x, i) : invalid subscript type 'list' >>> >>> # test2.R >>> library(data.table) >>> >>> run <- function(...) { >>> oclear() >>> d1 <- data.table(a=c(1,2,3), b=c("a","b","c")) >>> d2 <- data.table(e=c("v","a","b"), f=c(4,6,7)) >>> otable(d1) >>> otable(d2) >>> setkey(d1, b) >>> setkey(d2, e) >>> otable(d1[d2]) >>> } >>> >>> >>> >>> >>> -- >>> Melanie BACOU >>> International Food Policy Research Institute >>> Agricultural Economist, HarvestChoice >>> Work+1(202)862-5699 >>> E-mailmel at mbacou.com >>> Visitharvestchoice.org >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >> >> -- >> Melanie BACOU >> International Food Policy Research Institute >> Agricultural Economist, HarvestChoice >> Work +1(202)862-5699 >> E-mailmel at mbacou.com >> Visit harvestchoice.org >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -- Melanie BACOU International Food Policy Research Institute Agricultural Economist, HarvestChoice Work +1(202)862-5699 E-mail mel at mbacou.com Visit harvestchoice.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From bradleydemarest at gmail.com Thu Feb 27 04:26:56 2014 From: bradleydemarest at gmail.com (bradley demarest) Date: Wed, 26 Feb 2014 20:26:56 -0700 Subject: [datatable-help] Obtain data.table_1.8.11.tar.gz source? Message-ID: I deleted data.table 1.8.11 by prematurely trying to update to 1.9.0. Now I'm stuck with 1.8.10, but I really need melt and cast for an ongoing project. Can anyone provide a link to the 1.8.11 source while the cran build issues are being resolved? Sincerely, Bradley Demarest From mdowle at mdowle.plus.com Thu Feb 27 04:38:51 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Thu, 27 Feb 2014 03:38:51 +0000 Subject: [datatable-help] Obtain data.table_1.8.11.tar.gz source? In-Reply-To: References: Message-ID: <530EB34B.7040804@mdowle.plus.com> Sure, now in the homepage directory. That takes an hour or so to update, so this link should work now : https://r-forge.r-project.org/scm/viewvc.php/*checkout*/www/data.table_1.8.11.tar.gz?root=datatable if not, try here and browse from there : https://r-forge.r-project.org/scm/viewvc.php/www/?root=datatable Matt On 27/02/14 03:26, bradley demarest wrote: > I deleted data.table 1.8.11 by prematurely trying to update to 1.9.0. > > Now I'm stuck with 1.8.10, but I really need melt and cast for an > ongoing project. > > Can anyone provide a link to the 1.8.11 source while the cran build > issues are being resolved? > > Sincerely, > Bradley Demarest > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From mdowle at mdowle.plus.com Thu Feb 27 15:43:50 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Thu, 27 Feb 2014 14:43:50 +0000 Subject: [datatable-help] v1.9.2 is now on CRAN Message-ID: <530F4F26.8050801@mdowle.plus.com> It usually takes a few days for binaries to make their way to all mirrors and all platforms. You can install now from source from CRAN, or within an hour the data.table homepage should refresh with the Windows .zip for v1.9.2 (Ctrl-F5 and even clearing the browser cache may be required to refresh). NEWS is on CRAN : http://cran.r-project.org/web/packages/data.table/NEWS Real-time NEWS as we now move on to v1.9.3 is here (nothing yet) : https://r-forge.r-project.org/scm/viewvc.php/pkg/NEWS?view=markup&root=datatable Matt From carrieromichele at gmail.com Thu Feb 27 16:05:07 2014 From: carrieromichele at gmail.com (carrieromichele) Date: Thu, 27 Feb 2014 15:05:07 +0000 Subject: [datatable-help] Possible bug in 1.9.x versions Message-ID: I just installed the new data.table versions. I tried both 1.9.0, available (binary) at http://datatable.r-forge.r-project.org/data.table_1.9.0.zip, and 1.9.2 (CRAN) building from source (using Rtools) After installing I run my BAU scripts and found out that I had different results... this is what I could made reproducible 1.8.10 > library(data.table) data.table 1.8.10 For help type: help("data.table") > set.seed(1) > dt <- data.table(id=rep(1:4, each=3), + var1 = rep(letters[1:3], 4), + var2 = rnorm(12), + key="id,var1") > dt id var1 var2 1: 1 a -0.6264538 2: 1 b 0.1836433 3: 1 c -0.8356286 4: 2 a 1.5952808 5: 2 b 0.3295078 6: 2 c -0.8204684 7: 3 a 0.4874291 8: 3 b 0.7383247 9: 3 c 0.5757814 10: 4 a -0.3053884 11: 4 b 1.5117812 12: 4 c 0.3898432 > > key(dt) [1] "id" "var1" > dt[.(unique(id)), list(var1, var2)] id var1 var2 1: 1 a -0.6264538 2: 1 b 0.1836433 3: 1 c -0.8356286 4: 2 a 1.5952808 5: 2 b 0.3295078 6: 2 c -0.8204684 7: 3 a 0.4874291 8: 3 b 0.7383247 9: 3 c 0.5757814 10: 4 a -0.3053884 11: 4 b 1.5117812 12: 4 c 0.3898432 1.9.0 > library(data.table) data.table 1.9.0 For help type: help("data.table") Warning message: package 'data.table' was built under R version 3.1.0 > set.seed(1) > dt <- data.table(id=rep(1:4, each=3), + var1 = rep(letters[1:3], 4), + var2 = rnorm(12), + key="id,var1") > dt id var1 var2 1: 1 a -0.6264538 2: 1 b 0.1836433 3: 1 c -0.8356286 4: 2 a 1.5952808 5: 2 b 0.3295078 6: 2 c -0.8204684 7: 3 a 0.4874291 8: 3 b 0.7383247 9: 3 c 0.5757814 10: 4 a -0.3053884 11: 4 b 1.5117812 12: 4 c 0.3898432 > > key(dt) [1] "id" "var1" > dt[.(unique(id)), list(var1, var2)] id var1 var2 1: 1 a -0.6264538 2: 1 a 0.1836433 3: 1 a -0.8356286 4: 2 a 1.5952808 5: 2 a 0.3295078 6: 2 a -0.8204684 7: 3 a 0.4874291 8: 3 a 0.7383247 9: 3 a 0.5757814 10: 4 a -0.3053884 11: 4 a 1.5117812 12: 4 a 0.3898432 1.9.2 > library("data.table", lib.loc="C:/Program Files/R/R-3.0.2/library") data.table 1.9.2 For help type: help("data.table") > set.seed(1) > dt <- data.table(id=rep(1:4, each=3), + var1 = rep(letters[1:3], 4), + var2 = rnorm(12), + key="id,var1") Error in forder(x, cols, sort = TRUE, retGrp = FALSE) : object 'Cforder' not found > dt id var1 var2 1: 1 a -0.6264538 2: 1 b 0.1836433 3: 1 c -0.8356286 4: 2 a 1.5952808 5: 2 b 0.3295078 6: 2 c -0.8204684 7: 3 a 0.4874291 8: 3 b 0.7383247 9: 3 c 0.5757814 10: 4 a -0.3053884 11: 4 b 1.5117812 12: 4 c 0.3898432 > > key(dt) [1] "id" "var1" > dt[.(unique(id)), list(var1, var2)] Error in `[.data.table`(dt, .(unique(id)), list(var1, var2)) : object 'Cbmerge' not found It seems that in the 1.9.0 version when you join using fewer keys than the whole set of keys, the first values of the remaining keys are "carried forward". Other column looks fine. In the 1.9.2 instead some dependencies seem missing. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Feb 27 16:14:30 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Thu, 27 Feb 2014 15:14:30 +0000 Subject: [datatable-help] Possible bug in 1.9.x versions In-Reply-To: References: Message-ID: <530F5656.2030801@mdowle.plus.com> From those messages, it looks like the install didn't work properly. This can happen on Windows if another process is still using the older .dll. On every release we usually do get reports like this. Since it is Windows, let's try overkill first : 1. Close all R sessions 2. To be sure, reboot. This ensures all locks on open .dlls are fully cleared. 3. Start R 4. remove.package("data.table") 5. install.packages("data.table") 6. require(data.table) 7. test.data.table() -- does it work? 8. Rerun test The Windows .zip for 1.9.2 is now on the homepage, so it's best to use that one please. Matt On 27/02/14 15:05, carrieromichele wrote: > I just installed the new data.table versions. I tried both > 1.9.0, available (binary) at > http://datatable.r-forge.r-project.org/data.table_1.9.0.zip, and 1.9.2 > (CRAN) building from source (using Rtools) > > After installing I run my BAU scripts and found out that I had > different results... this is what I could made reproducible > > 1.8.10 > > > library(data.table) > data.table 1.8.10 For help type: help("data.table") > > set.seed(1) > > dt <- data.table(id=rep(1:4, each=3), > + var1 = rep(letters[1:3], 4), > + var2 = rnorm(12), > + key="id,var1") > > dt > id var1 var2 > 1: 1 a -0.6264538 > 2: 1 b 0.1836433 > 3: 1 c -0.8356286 > 4: 2 a 1.5952808 > 5: 2 b 0.3295078 > 6: 2 c -0.8204684 > 7: 3 a 0.4874291 > 8: 3 b 0.7383247 > 9: 3 c 0.5757814 > 10: 4 a -0.3053884 > 11: 4 b 1.5117812 > 12: 4 c 0.3898432 > > > > key(dt) > [1] "id" "var1" > > dt[.(unique(id)), list(var1, var2)] > id var1 var2 > 1: 1 a -0.6264538 > 2: 1 b 0.1836433 > 3: 1 c -0.8356286 > 4: 2 a 1.5952808 > 5: 2 b 0.3295078 > 6: 2 c -0.8204684 > 7: 3 a 0.4874291 > 8: 3 b 0.7383247 > 9: 3 c 0.5757814 > 10: 4 a -0.3053884 > 11: 4 b 1.5117812 > 12: 4 c 0.3898432 > > 1.9.0 > > > > library(data.table) > data.table 1.9.0 For help type: help("data.table") > Warning message: > package 'data.table' was built under R version 3.1.0 > > set.seed(1) > > dt <- data.table(id=rep(1:4, each=3), > + var1 = rep(letters[1:3], 4), > + var2 = rnorm(12), > + key="id,var1") > > dt > id var1 var2 > 1: 1 a -0.6264538 > 2: 1 b 0.1836433 > 3: 1 c -0.8356286 > 4: 2 a 1.5952808 > 5: 2 b 0.3295078 > 6: 2 c -0.8204684 > 7: 3 a 0.4874291 > 8: 3 b 0.7383247 > 9: 3 c 0.5757814 > 10: 4 a -0.3053884 > 11: 4 b 1.5117812 > 12: 4 c 0.3898432 > > > > key(dt) > [1] "id" "var1" > > dt[.(unique(id)), list(var1, var2)] > id var1 var2 > 1: 1 a -0.6264538 > 2: 1 a 0.1836433 > 3: 1 a -0.8356286 > 4: 2 a 1.5952808 > 5: 2 a 0.3295078 > 6: 2 a -0.8204684 > 7: 3 a 0.4874291 > 8: 3 a 0.7383247 > 9: 3 a 0.5757814 > 10: 4 a -0.3053884 > 11: 4 a 1.5117812 > 12: 4 a 0.3898432 > > 1.9.2 > > > library("data.table", lib.loc="C:/Program Files/R/R-3.0.2/library") > data.table 1.9.2 For help type: help("data.table") > > set.seed(1) > > dt <- data.table(id=rep(1:4, each=3), > + var1 = rep(letters[1:3], 4), > + var2 = rnorm(12), > + key="id,var1") > Error in forder(x, cols, sort = TRUE, retGrp = FALSE) : > object 'Cforder' not found > > dt > id var1 var2 > 1: 1 a -0.6264538 > 2: 1 b 0.1836433 > 3: 1 c -0.8356286 > 4: 2 a 1.5952808 > 5: 2 b 0.3295078 > 6: 2 c -0.8204684 > 7: 3 a 0.4874291 > 8: 3 b 0.7383247 > 9: 3 c 0.5757814 > 10: 4 a -0.3053884 > 11: 4 b 1.5117812 > 12: 4 c 0.3898432 > > > > key(dt) > [1] "id" "var1" > > dt[.(unique(id)), list(var1, var2)] > Error in `[.data.table`(dt, .(unique(id)), list(var1, var2)) : > object 'Cbmerge' not found > > It seems that in the 1.9.0 version when you join using fewer keys than > the whole set of keys, the first values of the remaining keys are > "carried forward". Other column looks fine. > > In the 1.9.2 instead some dependencies seem missing. > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From carrieromichele at gmail.com Thu Feb 27 16:26:27 2014 From: carrieromichele at gmail.com (Michele) Date: Thu, 27 Feb 2014 07:26:27 -0800 (PST) Subject: [datatable-help] Possible bug in 1.9.x versions In-Reply-To: <530F5656.2030801@mdowle.plus.com> References: <530F5656.2030801@mdowle.plus.com> Message-ID: <1393514787756-4685932.post@n4.nabble.com> Hi, thanks for the quick response. Still nothing though. Using the .zip form r-forge, at http://datatable.r-forge.r-project.org/data.table_1.9.2.zip, doesn't give me errors like `object 'Cforder' not found `, but the below join is still incorrect. > remove.packages("data.table") Removing package from ?C:/Program Files/R/R-3.0.2/library? (as ?lib? is unspecified) > install.packages("C:/Users/MCarrie/Downloads/data.table_1.9.2.zip", repos > = NULL) Warning in install.packages : package ?C:/Users/MCarrie/Downloads/data.table_1.9.2.zip? is not available (for R version 3.0.2) package ?data.table? successfully unpacked and MD5 sums checked > require(data.table) Loading required package: data.table data.table 1.9.2 For help type: help("data.table") Warning message: package ?data.table? was built under R version 3.1.0 > test.data.table() Running C:/Program Files/R/R-3.0.2/library/data.table/tests/tests.Rraw Loading required package: reshape2 Loading required package: reshape Loading required package: plyr Loading required package: ggplot2 Loading required package: hexbin Loading required package: nlme Loading required package: xts Loading required package: zoo Attaching package: ?zoo? The following objects are masked from ?package:base?: as.Date, as.Date.numeric Attaching package: ?xts? The following object is masked from ?package:data.table?: last Loading required package: bit64 Loading required package: gdata gdata: read.xls support for 'XLS' (Excel 97-2004) files gdata: ENABLED. gdata: read.xls support for 'XLSX' (Excel 2007+) files gdata: ENABLED. Attaching package: ?gdata? The following object is masked from ?package:stats?: nobs The following object is masked from ?package:utils?: object.size Test 167.2 not run. If required call library(hexbin) first. Don't know how to automatically pick scale for object of type ITime. Defaulting to continuous Don't know how to automatically pick scale for object of type ITime. Defaulting to continuous Tests 487 and 488 not run. If required call library(reshape) first. Tests 897-899 not run. If required call library(bit64) first. All 1220 tests in inst/tests/tests.Rraw completed ok in 22.115sec on Thu Feb 27 15:19:35 2014 library(data.table) set.seed(1) > dt <- data.table(id=rep(1:4, each=3), + var1 = rep(letters[1:3], 4), + var2 = rnorm(12), + key="i ..." ... [TRUNCATED] > dt id var1 var2 1: 1 a -0.6264538 2: 1 b 0.1836433 3: 1 c -0.8356286 4: 2 a 1.5952808 5: 2 b 0.3295078 6: 2 c -0.8204684 7: 3 a 0.4874291 8: 3 b 0.7383247 9: 3 c 0.5757814 10: 4 a -0.3053884 11: 4 b 1.5117812 12: 4 c 0.3898432 > key(dt) [1] "id" "var1" > dt[.(unique(id)), list(var1, var2)] id var1 var2 1: 1 a -0.6264538 2: 1 a 0.1836433 3: 1 a -0.8356286 4: 2 a 1.5952808 5: 2 a 0.3295078 6: 2 a -0.8204684 7: 3 a 0.4874291 8: 3 a 0.7383247 9: 3 a 0.5757814 10: 4 a -0.3053884 11: 4 a 1.5117812 12: 4 a 0.3898432 -- View this message in context: http://r.789695.n4.nabble.com/Possible-bug-in-1-9-x-versions-tp4685930p4685932.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Thu Feb 27 16:49:12 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Thu, 27 Feb 2014 15:49:12 +0000 Subject: [datatable-help] Possible bug in 1.9.x versions In-Reply-To: <1393514787756-4685932.post@n4.nabble.com> References: <530F5656.2030801@mdowle.plus.com> <1393514787756-4685932.post@n4.nabble.com> Message-ID: <530F5E78.3070906@mdowle.plus.com> Thanks. Yes, I see the same. 1,220 tests plus tests from 37 dependent packages still isn't enough is it. Sigh. Will fix. Matt On 27/02/14 15:26, Michele wrote: > Hi, thanks for the quick response. Still nothing though. Using the .zip form > r-forge, at http://datatable.r-forge.r-project.org/data.table_1.9.2.zip, > doesn't give me errors like `object 'Cforder' not found > `, but the below join is still incorrect. > >> remove.packages("data.table") > Removing package from ?C:/Program Files/R/R-3.0.2/library? > (as ?lib? is unspecified) >> install.packages("C:/Users/MCarrie/Downloads/data.table_1.9.2.zip", repos >> = NULL) > Warning in install.packages : > package ?C:/Users/MCarrie/Downloads/data.table_1.9.2.zip? is not available > (for R version 3.0.2) > package ?data.table? successfully unpacked and MD5 sums checked >> require(data.table) > Loading required package: data.table > data.table 1.9.2 For help type: help("data.table") > Warning message: > package ?data.table? was built under R version 3.1.0 >> test.data.table() > Running C:/Program Files/R/R-3.0.2/library/data.table/tests/tests.Rraw > Loading required package: reshape2 > Loading required package: reshape > Loading required package: plyr > Loading required package: ggplot2 > Loading required package: hexbin > Loading required package: nlme > Loading required package: xts > Loading required package: zoo > > Attaching package: ?zoo? > > The following objects are masked from ?package:base?: > > as.Date, as.Date.numeric > > > Attaching package: ?xts? > > The following object is masked from ?package:data.table?: > > last > > Loading required package: bit64 > Loading required package: gdata > gdata: read.xls support for 'XLS' (Excel 97-2004) files > gdata: ENABLED. > > gdata: read.xls support for 'XLSX' (Excel 2007+) files > gdata: ENABLED. > > Attaching package: ?gdata? > > The following object is masked from ?package:stats?: > > nobs > > The following object is masked from ?package:utils?: > > object.size > > Test 167.2 not run. If required call library(hexbin) first. > Don't know how to automatically pick scale for object of type ITime. > Defaulting to continuous > Don't know how to automatically pick scale for object of type ITime. > Defaulting to continuous > Tests 487 and 488 not run. If required call library(reshape) first. > Tests 897-899 not run. If required call library(bit64) first. > All 1220 tests in inst/tests/tests.Rraw completed ok in 22.115sec on Thu Feb > 27 15:19:35 2014 > > library(data.table) > set.seed(1) > >> dt <- data.table(id=rep(1:4, each=3), > + var1 = rep(letters[1:3], 4), > + var2 = rnorm(12), > + key="i ..." ... [TRUNCATED] > >> dt > id var1 var2 > 1: 1 a -0.6264538 > 2: 1 b 0.1836433 > 3: 1 c -0.8356286 > 4: 2 a 1.5952808 > 5: 2 b 0.3295078 > 6: 2 c -0.8204684 > 7: 3 a 0.4874291 > 8: 3 b 0.7383247 > 9: 3 c 0.5757814 > 10: 4 a -0.3053884 > 11: 4 b 1.5117812 > 12: 4 c 0.3898432 > >> key(dt) > [1] "id" "var1" > >> dt[.(unique(id)), list(var1, var2)] > id var1 var2 > 1: 1 a -0.6264538 > 2: 1 a 0.1836433 > 3: 1 a -0.8356286 > 4: 2 a 1.5952808 > 5: 2 a 0.3295078 > 6: 2 a -0.8204684 > 7: 3 a 0.4874291 > 8: 3 a 0.7383247 > 9: 3 a 0.5757814 > 10: 4 a -0.3053884 > 11: 4 a 1.5117812 > 12: 4 a 0.3898432 > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Possible-bug-in-1-9-x-versions-tp4685930p4685932.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From carrieromichele at gmail.com Thu Feb 27 16:55:55 2014 From: carrieromichele at gmail.com (Michele) Date: Thu, 27 Feb 2014 07:55:55 -0800 (PST) Subject: [datatable-help] Possible bug in 1.9.x versions In-Reply-To: <530F5E78.3070906@mdowle.plus.com> References: <530F5656.2030801@mdowle.plus.com> <1393514787756-4685932.post@n4.nabble.com> <530F5E78.3070906@mdowle.plus.com> Message-ID: <1393516555862-4685944.post@n4.nabble.com> :-) it happens to best ones as well! May I ask if you are using R-devel for your development? Just because when loading the .zip version, R says: > Warning message: > package ?data.table? was built under R version 3.1.0 aren't we at 3.0.2? -- View this message in context: http://r.789695.n4.nabble.com/Possible-bug-in-1-9-x-versions-tp4685930p4685944.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Thu Feb 27 17:11:48 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Thu, 27 Feb 2014 16:11:48 +0000 Subject: [datatable-help] Possible bug in 1.9.x versions In-Reply-To: <1393516555862-4685944.post@n4.nabble.com> References: <530F5656.2030801@mdowle.plus.com> <1393514787756-4685932.post@n4.nabble.com> <530F5E78.3070906@mdowle.plus.com> <1393516555862-4685944.post@n4.nabble.com> Message-ID: <530F63C4.3020702@mdowle.plus.com> On 27/02/14 15:55, Michele wrote: > :-) it happens to best ones as well! May I ask if you are using R-devel for > your development? Not only Rdevel, but Rdevel compiled with ASAN, 2.14.0, 3.0.2, and some 3.0.3beta as well for good measure. Here was the CRAN submission covering email : ==== I have rerun R CMD check on : Stated dependency (R 2.14.0) Winbuilder Rdevel. Rdevel r65060 with ASAN Causata, gems and treemap (both Rdevel and 3.0.2 but not 3.0.3beta) devtools::revdep() on R 3.0.2 (43 packages and all their dependencies) ==== But yes that Windows .zip came from Winbuilder R-devel. Ok, so it needs to be the R-release option on Winbuilder that goes on the homepage. Will remember that when the patch is ready. Thanks. Matt > Just because when loading the .zip version, R says: > >> Warning message: >> package ?data.table? was built under R version 3.1.0 > aren't we at 3.0.2? > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Possible-bug-in-1-9-x-versions-tp4685930p4685944.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From mdowle at mdowle.plus.com Fri Feb 28 15:52:16 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Fri, 28 Feb 2014 14:52:16 +0000 Subject: [datatable-help] Possible bug in 1.9.x versions In-Reply-To: <1393516555862-4685944.post@n4.nabble.com> References: <530F5656.2030801@mdowle.plus.com> <1393514787756-4685932.post@n4.nabble.com> <530F5E78.3070906@mdowle.plus.com> <1393516555862-4685944.post@n4.nabble.com> Message-ID: <5310A2A0.1050407@mdowle.plus.com> Now fixed and new Windows .zip for R-release uploaded to data.table homepage. We'll aim to release to CRAN fairly soon. Thanks again Michele. From NEWS : o When joining to fewer columns than the key has, using one of the later key columns explicitly in j repeated the first value. A problem introduced by v1.9.2 and not caught by our 1,220 tests, or tests in 37 dependent packages. Test added. Many thanks to Michele Carrier for reporting. DT = data.table(a=1:2, b=letters[1:6], key="a,b") # keyed by a and b DT[.(1), b] # correct result again (joining just to a not b but using b) Matt