From sams.james at gmail.com Thu May 1 06:40:44 2014 From: sams.james at gmail.com (James Sams) Date: Wed, 30 Apr 2014 23:40:44 -0500 Subject: [datatable-help] internal FALSE/TRUE value has been modified Message-ID: <5361D04C.2090509@gmail.com> I don't really know what this error message means. A quick example to show what I'm seeing: > library(data.table) data.table 1.9.3 For help type: help("data.table") > upc_table = data.table(upc=1:100000, upc_ver_uc=rep(c(1,2), times=50000), is_PL=rep(c(T, F, F, T), each=25000), product_module_code=rep(1:4, times=25000), ignore.column=2:100001) > upc = upc_table[, list(is_PL, product_module_code), keyby=list(upc, upc_ver_uc)] Warning message: In `[.data.table`(upc_table, , list(is_PL, product_module_code), : internal TRUE value has been modified When I continue using R, I eventually start getting more errors, such as: Error in gettext(domain, unlist(args)) : invalid 'string' value Error during wrapup: invalid 'string' value and then terminal input/output becomes corrupted. I only start getting these error messages once I start using data.table; but the messages don't necessarily occur only with data.table functions. I don't know if the last statement above is executing correctly or not. I'm rather confused as to what is going on. I was using a somewhat stale (maybe a couple of weeks old) svn version of data.table; but I see the same behavior with the latest data.table (r1263). I'm using CRAN's R 3.1 package for Ubuntu on 13.10 and 14.04. > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.9.3 loaded via a namespace (and not attached): [1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.4 stringr_0.6.2 -- James Sams sams.james at gmail.com From my.r.help at gmail.com Thu May 1 14:42:56 2014 From: my.r.help at gmail.com (Michael Smith) Date: Thu, 01 May 2014 20:42:56 +0800 Subject: [datatable-help] Filtering Based on Previous Observation In-Reply-To: References: <535FB180.2060209@gmail.com> Message-ID: <53624150.6000001@gmail.com> Awesome, thanks to all of you who have replied. I learned some nice new data.table/programming tricks! M On 04/30/2014 08:00 PM, Gabor Grothendieck wrote: > On Tue, Apr 29, 2014 at 10:04 AM, Michael Smith wrote: >> All, >> >> Is there some data.table-idiomatic way to filter based on a previous >> observation/row? For example, I want to remove a row if >> DT$a[row]==DT$a[row-1]. >> >> It could be done by first calculating the lag and then filtering based >> on that, but I wonder if there's a more direct way. >> >> The following example works, but my feeling is there should be a more >> elegant solution: >> >> ( DT <- data.table(a = c(1, 2, 2, 3), b = 8:5) ) >> DT[, L.a := c(NA, head(a, -1))][a != L.a | is.na(L.a)][, L.a := NULL][] > > If the unique elements always appear consecutively then the following > would work. > > (For example, if `a` were in ascending order (as in the example) or > descending order then that would be satisfied. If DT were keyed > on 'a' then this would always be the case.) > > DT[ !duplicated(a) ] > > Note that 'a' need not be numeric. > From mdowle at mdowle.plus.com Thu May 1 17:29:34 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Thu, 01 May 2014 16:29:34 +0100 Subject: [datatable-help] internal FALSE/TRUE value has been modified In-Reply-To: <5361D04C.2090509@gmail.com> References: <5361D04C.2090509@gmail.com> Message-ID: <5362685E.1080303@mdowle.plus.com> Reproduced, thanks for nice example. Not sure yet but what R 3.1 now does is store length 1 logical vectors once only, globally, for efficiency to avoid many new allocations for the common case of single TRUE or FALSE values passed around at C or R level (a nice and welcome change). Since data.table modifies vectors by reference, if that vector is length 1 a new data.table bug as from R 3.1 could be modifying R's internal value of TRUE or FALSE whenever length 1 logical vectors occur. Clearly a serious bug. The test suite immediately broke the day after the R-devel change was made (good) and was one reason data.table was in error state in CRAN checks for quite a while before R 3.1 shipped. It was typically tests of 1-row data.table's including a logical column and modifying that logical column that broke. We fixed that and put in checks to detect and warn if R's internal value has been been modified, just in case. Those changes were in v1.9.2 on CRAN. I think I wasn't 100% confident in the detection test (false positives) so made it a warning instead of an error. Now that R 3.1 is out and we haven't had any false positives, it should be an error. The feature of this upc_table is that all the groups are size 1 : > upc_table[, .N, by=list(upc, upc_ver_uc)][,max(N)] [1] 1 If we change the example so that one group has more than 1 row, it works ok : > upc_table = data.table(upc=c(1:99998,1,1), upc_ver_uc=rep(c(1,2), times=50000), is_PL=rep(c(T, F, F, T), each=25000), product_module_code=rep(1:4, times=25000), ignore.column=2:100001) > upc_table[, .N, by=list(upc, upc_ver_uc)][,max(N)] [1] 2 > upc = upc_table[, list(is_PL, product_module_code), keyby=list(upc, upc_ver_uc)] So it seems the problem is in the single allocation of working memory for the largest group when that's just 1 and contains a logical column. Odd, I would have sworn we caught that! Will fix. R-devel are planning to do more of this small-object-sharing for common single integer values e.g. 0-10, so we'll need to add more tests accordingly. Thanks, Matt On 01/05/14 05:40, James Sams wrote: > I don't really know what this error message means. A quick example to > show what I'm seeing: > > > library(data.table) > data.table 1.9.3 For help type: help("data.table") > > upc_table = data.table(upc=1:100000, upc_ver_uc=rep(c(1,2), > times=50000), is_PL=rep(c(T, F, F, T), each=25000), > product_module_code=rep(1:4, times=25000), ignore.column=2:100001) > > upc = upc_table[, list(is_PL, product_module_code), keyby=list(upc, > upc_ver_uc)] > Warning message: > In `[.data.table`(upc_table, , list(is_PL, product_module_code), : > internal TRUE value has been modified > > When I continue using R, I eventually start getting more errors, such as: > > Error in gettext(domain, unlist(args)) : invalid 'string' value > Error during wrapup: invalid 'string' value > > and then terminal input/output becomes corrupted. I only start getting > these error messages once I start using data.table; but the messages > don't necessarily occur only with data.table functions. > > I don't know if the last statement above is executing correctly or > not. I'm rather confused as to what is going on. I was using a > somewhat stale (maybe a couple of weeks old) svn version of > data.table; but I see the same behavior with the latest data.table > (r1263). I'm using CRAN's R 3.1 package for Ubuntu on 13.10 and 14.04. > > > > > sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C > LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] data.table_1.9.3 > > loaded via a namespace (and not attached): > [1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.4 stringr_0.6.2 > From harishv_99 at yahoo.com Sat May 3 03:08:42 2014 From: harishv_99 at yahoo.com (Harish) Date: Fri, 2 May 2014 18:08:42 -0700 (PDT) Subject: [datatable-help] fread() coercion bug? Message-ID: <1399079322.73291.YahooMailNeo@web120206.mail.ne1.yahoo.com> I was trying to use fread() to read data when I got the following error which made no sense: In fread(paste0(strData, collapse = "\n"), integer64 = "character") : Bumped column 2 to type character on data row 13, field contains '2464.77'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.because "2464.77" is a perfectly legitimate number and there is no reason to coerce the column to character for that. Here is how to reproduce it: ?? dtT <- data.table( a = 1:72, b=0 ) ?? dtT[ 13, b := 2464.77 ] ?? strData <- capture.output( write.table( dtT, row.names=FALSE, quote=FALSE, sep="\t" ) ) ?? fread( paste0( strData, collapse="\n" ), integer64="character" ) Note that the following works okay without the integer64="character" argument: ?? dtT <- data.table( a = 1:72, b=0 ) ?? dtT[ 13, b := 2464.77 ] ?? strData <- capture.output( write.table( dtT, row.names=FALSE, quote=FALSE, sep="\t" ) ) ?? fread( paste0( strData, collapse="\n" ) ) I would appreciate if you could provide some sort of a workaround for this.? The reason I am using the integer64="character" argument is that I have large numbers at times which seems to be having issues once it is read as integer64 -- and that might have nothing to do with data.table but I have not had time to look into it.? My work-around for that issue was to read it as character, but I run into the above issue. Thanks for your help. Regards, Harish -------------- next part -------------- An HTML attachment was scrubbed... URL: From rguy at 123mail.org Sun May 4 08:00:48 2014 From: rguy at 123mail.org (Rguy) Date: Sat, 3 May 2014 23:00:48 -0700 (PDT) Subject: [datatable-help] A[B]? Message-ID: <1399183248863-4689942.post@n4.nabble.com> I am beginning to learn the data.table package. At the outset, 'data.table.pdf' states: It is inspired by A[B] syntax in R where A is a matrix and B is a 2-column matrix. I have used matrices in R but am unfamiliar with the A[B] syntax. When I check the documentation for 'matrix' I find no discussion of such syntax. So this "explanation" is in fact a black hole. Please tell your readers what the package does in such a way that they are not sent on a wild goose chase. For example: The data.table package supports an A[B] syntax where A is a data table, B is a 2 column data table, and the effect of the expression A[B] is... What does A[B] accomplish? Thanks. -- View this message in context: http://r.789695.n4.nabble.com/A-B-tp4689942.html Sent from the datatable-help mailing list archive at Nabble.com. From my.r.help at gmail.com Sun May 4 09:50:14 2014 From: my.r.help at gmail.com (Michael Smith) Date: Sun, 04 May 2014 15:50:14 +0800 Subject: [datatable-help] A[B]? In-Reply-To: <1399183248863-4689942.post@n4.nabble.com> References: <1399183248863-4689942.post@n4.nabble.com> Message-ID: <5365F136.8050807@gmail.com> See FAQ 2.14 http://datatable.r-forge.r-project.org/datatable-faq.pdf On 05/04/2014 02:00 PM, Rguy wrote: > I am beginning to learn the data.table package. At the outset, > 'data.table.pdf' states: > > It is inspired by A[B] syntax in R where A is a matrix and B is a 2-column > matrix. > > I have used matrices in R but am unfamiliar with the A[B] syntax. When I > check the documentation for 'matrix' I find no discussion of such syntax. So > this "explanation" is in fact a black hole. Please tell your readers what > the package does in such a way that they are not sent on a wild goose chase. > For example: > > The data.table package supports an A[B] syntax where A is a data table, B is > a 2 column data table, and the effect of the expression A[B] is... > > What does A[B] accomplish? > > Thanks. > > > > > -- > View this message in context: http://r.789695.n4.nabble.com/A-B-tp4689942.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From mdowle at mdowle.plus.com Sun May 4 10:50:32 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Sun, 04 May 2014 09:50:32 +0100 Subject: [datatable-help] fread() coercion bug? In-Reply-To: <1399079322.73291.YahooMailNeo@web120206.mail.ne1.yahoo.com> References: <1399079322.73291.YahooMailNeo@web120206.mail.ne1.yahoo.com> Message-ID: <5365FF58.2050402@mdowle.plus.com> Reproduced, thanks. Can't think why that is, but will fix. Please file as a bug so it's not forgotten. In the meantime, setting the class manually for that column (colClasses argument) works in this example : fread( paste0( strData, collapse="\n" ), integer64="character", colClasses=list(numeric="b")) Is that workable for the full example? I've used that syntax for colClasses so you can pass a vector of column names to be read as numeric more easily, if need be. Matt On 03/05/14 02:08, Harish wrote: > I was trying to use fread() to read data when I got the following > error which made no sense: > > In fread(paste0(strData, collapse = "\n"), integer64 = "character") : > Bumped column 2 to type character on data row 13, field contains '2464.77'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE. > because "2464.77" is a perfectly legitimate number and there is no > reason to coerce the column to character for that. > > Here is how to reproduce it: > > dtT <- data.table( a = 1:72, b=0 ) > dtT[ 13, b := 2464.77 ] > > strData <- capture.output( write.table( dtT, row.names=FALSE, > quote=FALSE, sep="\t" ) ) > fread( paste0( strData, collapse="\n" ), integer64="character" ) > > Note that the following works okay without the integer64="character" > argument: > dtT <- data.table( a = 1:72, b=0 ) > dtT[ 13, b := 2464.77 ] > > strData <- capture.output( write.table( dtT, row.names=FALSE, > quote=FALSE, sep="\t" ) ) > fread( paste0( strData, collapse="\n" ) ) > > I would appreciate if you could provide some sort of a workaround for > this. The reason I am using the integer64="character" argument is > that I have large numbers at times which seems to be having issues > once it is read as integer64 -- and that might have nothing to do with > data.table but I have not had time to look into it. My work-around > for that issue was to read it as character, but I run into the above > issue. > > Thanks for your help. > > Regards, > Harish > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From carrieromichele at gmail.com Mon May 5 04:43:39 2014 From: carrieromichele at gmail.com (Michele) Date: Sun, 4 May 2014 19:43:39 -0700 (PDT) Subject: [datatable-help] Roll + nomatch mixes result Message-ID: <1399257819547-4689968.post@n4.nabble.com> Hello,I think this was recently introduced because this example comes from a part of my codes double and triple checked in the past several times (I mean I should have noticed before, maybe..): data<-data.table(code = c(rep("A",26L), rep("B",10L)), id = c(rep(1L, 20L), rep(2L, 6L), rep(1L, 10L)), date = structure(c(14602, 14638, 14665, 14698, 14726, 14754, 14788, 14817, 14846, 14882, 14939, 15005, 15029, 15064, 15091, 15125, 15153, 15328, 15393, 15393, 15393, 15393, 15431, 15461, 15569, 15569, 14613, 14762, 15110, 15110, 15686, 15686, 14602, 14638, 14665, 14698), class = "Date"))filter <- data.table(code = c("A", "B"), id = c(2L, 1L), limit1 = structure(c(15564, 15681), class = "Date"), limit2 = structure(c(15574, 15691), class = "Date"), index_R = c(26610L, 22662L))setkey(data)setkey(filter, code, id, limit1)> filter[data, nomatch=0, roll=T] code id limit1 limit2 index_R1: A 2 2012-02-23 2012-08-22 266102: A 2 2012-02-23 2012-08-22 266103: A 2 2012-08-17 2012-12-17 226624: A 2 2012-08-17 2012-12-17 226625: B 1 2011-05-16 2012-08-22 266106: B 1 2011-05-16 2012-08-22 266107: B 1 2012-12-12 2012-12-17 226628: B 1 2012-12-12 2012-12-17 22662> > # expected outpit - workaround using any column from X which is never NA (before doing X[Y, roll=T])> filter[data, roll=T][!is.na(index_R)] code id limit1 limit2 index_R1: A 2 2012-08-17 2012-08-22 266102: A 2 2012-08-17 2012-08-22 266103: B 1 2012-12-12 2012-12-17 226624: B 1 2012-12-12 2012-12-17 22662 btw I'm on 1.9.3, the commit right before the by without by was sadly removed (sadly cause I would need at least a whole week to change all my codes...)Can you guys reproduce this? Is it already fixed?Regards,Michele. -- View this message in context: http://r.789695.n4.nabble.com/Roll-nomatch-mixes-result-tp4689968.html Sent from the datatable-help mailing list archive at Nabble.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rguy at 123mail.org Tue May 6 11:57:25 2014 From: rguy at 123mail.org (Rguy) Date: Tue, 6 May 2014 02:57:25 -0700 (PDT) Subject: [datatable-help] A[B]? In-Reply-To: <5365F136.8050807@gmail.com> References: <1399183248863-4689942.post@n4.nabble.com> <5365F136.8050807@gmail.com> Message-ID: <1399370245881-4690040.post@n4.nabble.com> That FAQ does not provide any examples of the A[B] syntax used with data table objects. It does provide an example using A[B] with matrix objects, but the example does not translate to data table objects, so I'm not sure why it's there. I suggest that the FAQ be extended to provide one, or better yet several, examples of the A[B] syntax applied to data.table objects. As far as I have been able to puzzle out so far, A[B] is just another way to do a merge. -- View this message in context: http://r.789695.n4.nabble.com/A-B-tp4689942p4690040.html Sent from the datatable-help mailing list archive at Nabble.com. From rguy at 123mail.org Tue May 6 12:06:57 2014 From: rguy at 123mail.org (Rguy) Date: Tue, 6 May 2014 03:06:57 -0700 (PDT) Subject: [datatable-help] Assigning with a compound condition Message-ID: <1399370817746-4690042.post@n4.nabble.com> I am experimenting with assigning into a data table (and data frame with same data) when the assignment involves a compound condition on multiple columns. Please see the attached file. Assignment into the data table is about twice as fast as into the data frame, but I wonder if I am using the optimal syntax for achieving speedy assignment. Any advice much appreciated. test_assign.r -- View this message in context: http://r.789695.n4.nabble.com/Assigning-with-a-compound-condition-tp4690042.html Sent from the datatable-help mailing list archive at Nabble.com. From kpm.nachtmann at gmail.com Wed May 7 11:02:15 2014 From: kpm.nachtmann at gmail.com (nachti) Date: Wed, 7 May 2014 02:02:15 -0700 (PDT) Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <1366401278742-4664770.post@n4.nabble.com> References: <1366401278742-4664770.post@n4.nabble.com> Message-ID: <1399453335248-4690100.post@n4.nabble.com> The change of the defaults in 1.9.3 breaks existing code, which shoud not be (see. DT FAQ 1.8). Would be fine if there is a possibility that code works with different versions of DT and R (e.g. for usage in packages). See the example here: https://gist.github.com/nachti/34b2dc46868b9268c5af I know that 1.9.3 is a development version, but I can't use 1.9.2 due to http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-td4687469.html and I can't switch back to an older R-Version because of missing permissions on the server. I have to use a different versions of R and DT parallel. If I rewrite my code that it works for 1.9.3, it doesn't work with 1.8.10 any more. (see also http://stackoverflow.com/questions/23289646/update-subset-of-data-table-based-on-join-using-data-table-1-9-3-does-not-work-a by = key(something) is not the same as by = .EACHI, but even if I can get a solution using the first, 1.8.10 gives a warning, that I shouldn't do that: In addition: Warning message: In `[.data.table` ...: by is not necessary in this query; it equals all the join columns in the same order. j is already evaluated by group of x that each row of i matches to (by-without-by, see ?data.table). Setting by will be slower because a subset of x is taken and then grouped again. Consider removing by, or changing it. nachti -- View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4690100.html Sent from the datatable-help mailing list archive at Nabble.com. From aragorn168b at gmail.com Wed May 7 12:10:57 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 7 May 2014 12:10:57 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <1399453335248-4690100.post@n4.nabble.com> References: <1366401278742-4664770.post@n4.nabble.com> <1399453335248-4690100.post@n4.nabble.com> Message-ID: The change of the defaults in 1.9.3 breaks existing code, which shoud not be? (see. DT FAQ 1.8). Thanks. Yes, that's what will be the case when it hits CRAN. There will be an option to use the older feature, IIUC. Matt can clarify this point further. I know that 1.9.3 is a development version, but I can't use 1.9.2 due to? http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-td4687469.html? Can you show us an example that 1.9.2 doesn't but 1.9.3 does?? In your case, you should be using stable 1.9.2 version (at least until counter measures are in place for by=.EACHI). And you should ask your administrators to downgrade R, if you don't want that bug to bite you, until this is fixed. But I'm repeating myself. Arun From:?nachti kpm.nachtmann at gmail.com Reply:?nachti kpm.nachtmann at gmail.com Date:?May 7, 2014 at 11:02:30 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] changing data.table by-without-by syntax to require a "by" The change of the defaults in 1.9.3 breaks existing code, which shoud not be (see. DT FAQ 1.8). Would be fine if there is a possibility that code works with different versions of DT and R (e.g. for usage in packages). See the example here: https://gist.github.com/nachti/34b2dc46868b9268c5af I know that 1.9.3 is a development version, but I can't use 1.9.2 due to http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-td4687469.html and I can't switch back to an older R-Version because of missing permissions on the server. I have to use a different versions of R and DT parallel. If I rewrite my code that it works for 1.9.3, it doesn't work with 1.8.10 any more. (see also http://stackoverflow.com/questions/23289646/update-subset-of-data-table-based-on-join-using-data-table-1-9-3-does-not-work-a by = key(something) is not the same as by = .EACHI, but even if I can get a solution using the first, 1.8.10 gives a warning, that I shouldn't do that: In addition: Warning message: In `[.data.table` ...: by is not necessary in this query; it equals all the join columns in the same order. j is already evaluated by group of x that each row of i matches to (by-without-by, see ?data.table). Setting by will be slower because a subset of x is taken and then grouped again. Consider removing by, or changing it. nachti -- View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4690100.html Sent from the datatable-help mailing list archive at Nabble.com. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From kpm.nachtmann at gmail.com Wed May 7 13:30:06 2014 From: kpm.nachtmann at gmail.com (nachti) Date: Wed, 7 May 2014 04:30:06 -0700 (PDT) Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1399453335248-4690100.post@n4.nabble.com> Message-ID: <1399462206528-4690105.post@n4.nabble.com> Arunkumar Srinivasan wrote > The change of the defaults in 1.9.3 breaks existing code, which shoud not > be? > (see. DT FAQ 1.8). > Thanks. Yes, that's what will be the case when it hits CRAN. There will be > an option to use the older feature, IIUC. Matt can clarify this point > further. > > I know that 1.9.3 is a development version, but I can't use 1.9.2 due to? > http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-td4687469.html? > Can you show us an example that 1.9.2 doesn't but 1.9.3 does? Copied from http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-td4687469.html ##### Just another example (maybe to be included to test.data.table), which does not do, what I expected (v. 1.9.2 - it's also fixed in 1.9.3) > require(data.table) > sessionInfo() R version 3.1.0 (2014-04-10) Platform: powerpc64-unknown-linux-gnu (64-bit) ... other attached packages: [1] data.table_1.9.2 > example(data.table) > DT x y v v2 m 1: a 1 42 NA 42 2: a 3 42 NA 42 3: a 6 42 NA 42 4: b 1 4 84 5 5: b 3 5 84 5 6: b 6 6 84 5 7: c 1 7 NA 8 8: c 3 8 NA 8 9: c 6 9 NA 8 > setkey(DT) > DT[J("a"), list(v, y)] x v y 1: a 42 1 > DT[J("a"), list(v, y, i = "text")] x v y i 1: a 42 1 text ##### With data.table 1.9.3 it's working fine: > require(data.table) > sessionInfo() R version 3.1.0 (2014-04-10) Platform: powerpc64-unknown-linux-gnu (64-bit) ... other attached packages: [1] data.table_1.9.3 > example(data.table) > setkey(DT) > DT[J("a"), list(v, y)] v y 1: 42 1 2: 42 3 3: 42 6 > DT[J("a"), list(v, y, i = "text")] v y i 1: 42 1 text 2: 42 3 text 3: 42 6 text nachti ##### Arunkumar Srinivasan wrote > In your case, you should be using stable 1.9.2 version (at least until > counter measures are in place for by=.EACHI). And you should ask your > administrators to downgrade R, if you don't want that bug to bite you, > until this is fixed. But I'm repeating myself. > > Arun -- View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4690105.html Sent from the datatable-help mailing list archive at Nabble.com. From aragorn168b at gmail.com Wed May 7 14:27:30 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 7 May 2014 14:27:30 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <1399462206528-4690105.post@n4.nabble.com> References: <1366401278742-4664770.post@n4.nabble.com> <1399453335248-4690100.post@n4.nabble.com> <1399462206528-4690105.post@n4.nabble.com> Message-ID: Once agan, thanks for the example. That wasn't a bug. It's how it was intended to work with prior versions of data.table. But to make things much more consistent (as per user requests and FRs filed), this change is now being implemented.? Your point that there should be ways to make sure existing code doesn't break down is totally valid and we'll do whatever we can to get there.?You've to realise this is a development version - we're working on it.?And these things will get fixed only in due time. Until then, there's no other way but to get around these issues until we fix it, unfortunately - unless you or someone else would like to help us. Arun From:?nachti kpm.nachtmann at gmail.com Reply:?nachti kpm.nachtmann at gmail.com Date:?May 7, 2014 at 1:30:16 PM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] changing data.table by-without-by syntax to require a "by" Arunkumar Srinivasan wrote > The change of the defaults in 1.9.3 breaks existing code, which shoud not > be? > (see. DT FAQ 1.8). > Thanks. Yes, that's what will be the case when it hits CRAN. There will be > an option to use the older feature, IIUC. Matt can clarify this point > further. > > I know that 1.9.3 is a development version, but I can't use 1.9.2 due to? > http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-td4687469.html? > Can you show us an example that 1.9.2 doesn't but 1.9.3 does? Copied from http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-td4687469.html ##### Just another example (maybe to be included to test.data.table), which does not do, what I expected (v. 1.9.2 - it's also fixed in 1.9.3) > require(data.table) > sessionInfo() R version 3.1.0 (2014-04-10) Platform: powerpc64-unknown-linux-gnu (64-bit) ... other attached packages: [1] data.table_1.9.2 > example(data.table) > DT x y v v2 m 1: a 1 42 NA 42 2: a 3 42 NA 42 3: a 6 42 NA 42 4: b 1 4 84 5 5: b 3 5 84 5 6: b 6 6 84 5 7: c 1 7 NA 8 8: c 3 8 NA 8 9: c 6 9 NA 8 > setkey(DT) > DT[J("a"), list(v, y)] x v y 1: a 42 1 > DT[J("a"), list(v, y, i = "text")] x v y i 1: a 42 1 text ##### With data.table 1.9.3 it's working fine: > require(data.table) > sessionInfo() R version 3.1.0 (2014-04-10) Platform: powerpc64-unknown-linux-gnu (64-bit) ... other attached packages: [1] data.table_1.9.3 > example(data.table) > setkey(DT) > DT[J("a"), list(v, y)] v y 1: 42 1 2: 42 3 3: 42 6 > DT[J("a"), list(v, y, i = "text")] v y i 1: 42 1 text 2: 42 3 text 3: 42 6 text nachti ##### Arunkumar Srinivasan wrote > In your case, you should be using stable 1.9.2 version (at least until > counter measures are in place for by=.EACHI). And you should ask your > administrators to downgrade R, if you don't want that bug to bite you, > until this is fixed. But I'm repeating myself. > > Arun -- View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4690105.html Sent from the datatable-help mailing list archive at Nabble.com. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From kpm.nachtmann at gmail.com Wed May 7 15:13:10 2014 From: kpm.nachtmann at gmail.com (nachti) Date: Wed, 7 May 2014 06:13:10 -0700 (PDT) Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1399453335248-4690100.post@n4.nabble.com> <1399462206528-4690105.post@n4.nabble.com> Message-ID: <1399468390041-4690112.post@n4.nabble.com> I have a workaround for it now: ### check data.table version (since 1.9.3 you have to use .EACHI) odt <- packageVersion("data.table") < "1.9.3" odt if (odt) { # code for old (stable) datatable versions } else { # code for datatable versions since 1.9.3 } nachti -- View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4690112.html Sent from the datatable-help mailing list archive at Nabble.com. From benweinstein2010 at gmail.com Thu May 8 16:39:40 2014 From: benweinstein2010 at gmail.com (Ben Weinstein) Date: Thu, 8 May 2014 10:39:40 -0400 Subject: [datatable-help] fread crashes reading R when reading csv Message-ID: Data table crashes I am having a similar issue to this post: http://r.789695.n4.nabble.com/fread-crash-td4683394.html please see markdown script: http://rpubs.com/bw4sz0511/16766 or text below: or text below: The file is about 550MB, i'm unsure how many rows it actually is (several million). When i try to run fread, Rstudio just crashes with no error. I can read in up to about 15 rows require(data.table) ## Loading required package: data.table # env dist table env <- fread("EnvData.csv", nrows = 15, verbose = TRUE) ## Input contains no \n. Taking this to be a filename to open ## File opened, filesize is 0.543B ## File is opened and mapped ok ## Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. ## Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=',' ## Found 4 columns ## First row with 4 fields occurs on line 2 (either column names or first row of data) ## Some fields on line 2 are not type character (or are empty). Treating as a data row and using default column names. ## Count of eol after first data row: 15989212 ## Subtracted 0 for last eol and any trailing empty lines, leaving 15989212 data rows ## nrow limited to nrows passed in (15) ## Type codes: 4113 (first 5 rows) ## Type codes: 4113 (after applying colClasses and integer64) ## Type codes: 4113 (after applying drop or select (if supplied) ## Allocating 4 column slots (4 - 0 NULL) ## 0.000s ( 0%) Memory map (rerun may be quicker) ## 0.000s ( 0%) sep and header detection ## 0.702s (100%) Count rows (wc -l) ## 0.000s ( 0%) Column type detection (first, middle and last 5 rows) ## 0.000s ( 0%) Allocation of 15x4 result (xMB) in RAM ## 0.000s ( 0%) Reading data ## 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered ## 0.000s ( 0%) Coercing data already read in type bumps (if any) ## 0.000s ( 0%) Changing na.strings to NA ## 0.702s Total head(env) ## V1 V2 V3 V4 ## 1: 1 2 1 249.3 ## 2: 2 3 1 536.9 ## 3: 3 4 1 1161.8 ## 4: 4 5 1 1234.0 ## 5: 5 6 1 1513.4 ## 6: 6 7 1 1757.1 However when i run fread with more than 20 rows, it crashes Rstudio. # not run env <- fread("EnvData.csv", nrows = 25, verbose = TRUE) verbose on the error output reads: Input contains no \n. Taking this to be a filename to open File opened, filesize is 0.543B File is opened and mapped ok Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=',' Found 4 columns First row with 4 fields occurs on line 2 (either column names or first row of data) Some fields on line 2 are not type character (or are empty). Treating as a data row and using default column names. Count of eol after first data row: 15989212 Subtracted 0 for last eol and any trailing empty lines, leaving 15989212 data rows nrow limited to nrows passed in (25) Type codes: 4113 (first 5 rows) Type codes: 4113 (+middle 5 rows) Look at the file, nothing seems wrong env <- read.csv("EnvData.csv", nrows = 25) env ## V1 V2 V3 ## 1 2 1 249.3 ## 2 3 1 536.9 ## 3 4 1 1161.8 ## 4 5 1 1234.0 ## 5 6 1 1513.4 ## 6 7 1 1757.1 ## 7 8 1 2176.7 ## 8 9 1 2644.0 ## 9 10 1 3033.3 ## 10 11 1 3721.2 ## 11 12 1 4432.8 ## 12 13 1 4609.6 ## 13 14 1 5378.8 ## 14 15 1 5953.6 ## 15 16 1 5913.9 ## 16 17 1 6281.3 ## 17 18 1 6669.7 ## 18 19 1 6449.7 ## 19 20 1 6218.4 ## 20 21 1 6493.4 ## 21 22 1 6056.6 ## 22 23 1 5275.8 ## 23 24 1 4605.2 ## 24 25 1 3153.9 ## 25 26 1 2532.1 Thanks for your help, Ben Weinstein -- Ben Weinstein PhD Candidate Ecology and Evolution Stony Brook University http://benweinstein.weebly.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From stanasa at latinumnetwork.com Thu May 8 20:50:03 2014 From: stanasa at latinumnetwork.com (stanasa) Date: Thu, 8 May 2014 11:50:03 -0700 (PDT) Subject: [datatable-help] Fread Skip Question Message-ID: <1399575003729-4690205.post@n4.nabble.com> First of all, thank you very much for creating, maintaining and updating this package! Discovering "fread" and the data.table package have made my life a lot easier. I'm using fread to read large (2-4Gb) .CSV files for subsequent RMySQL bulkloads, and (since the computer I use is a bit memory limited) decided to read it in chunks, using skip and nrows. I'm noticing that as I go through the file (with a for loop) each individual read takes on average a bit longer (as I'm guessing fread parses through the file line by line to reach the skip to location). Is there any way to make fread "remember" the end of the last read location for the next iteration? It would speed up my reads from minutes to seconds, I would guess. Also, should I worry that reusing the same data.table in a for loop causes memory issues? Many thanks, Serban Tanasa, Ph.D. Senior Analyst Latinum Network (o) (240) 482-8259 (f) (240) 482-8265 -- View this message in context: http://r.789695.n4.nabble.com/Fread-Skip-Question-tp4690205.html Sent from the datatable-help mailing list archive at Nabble.com. From gsee000 at gmail.com Fri May 9 00:57:02 2014 From: gsee000 at gmail.com (G See) Date: Thu, 8 May 2014 17:57:02 -0500 Subject: [datatable-help] merge zero row data.table Message-ID: Hi, Is the following error expected? > library(data.table) data.table 1.9.3 For help type: help("data.table") > a <- data.table(BOD, key="Time") > b <- data.table(BOD, key="Time")[Time < 0] # zero row data.table > merge(a,b, all=TRUE) # works fine Time demand.x demand.y 1: 1 8.3 NA 2: 2 10.3 NA 3: 3 19.0 NA 4: 4 16.0 NA 5: 5 15.6 NA 6: 7 19.8 NA > merge(b,a, all=TRUE) # error Error in setcolorder(dt, c(setdiff(names(dt), end), end)) : neworder is length 2 but x has 3 columns. Thanks, Garrett p.s. using svn Rev. 1263 From aragorn168b at gmail.com Fri May 9 01:00:18 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 9 May 2014 01:00:18 +0200 Subject: [datatable-help] merge zero row data.table In-Reply-To: References: Message-ID: Garrett, Seems like it works fine in 1.9.2. I'd say it's a bug introduced due to changes in 1.9.3. Could you please file it as one? Thanks. Arun From:?G See gsee000 at gmail.com Reply:?G See gsee000 at gmail.com Date:?May 9, 2014 at 12:57:15 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] merge zero row data.table Hi, Is the following error expected? > library(data.table) data.table 1.9.3 For help type: help("data.table") > a <- data.table(BOD, key="Time") > b <- data.table(BOD, key="Time")[Time < 0] # zero row data.table > merge(a,b, all=TRUE) # works fine Time demand.x demand.y 1: 1 8.3 NA 2: 2 10.3 NA 3: 3 19.0 NA 4: 4 16.0 NA 5: 5 15.6 NA 6: 7 19.8 NA > merge(b,a, all=TRUE) # error Error in setcolorder(dt, c(setdiff(names(dt), end), end)) : neworder is length 2 but x has 3 columns. Thanks, Garrett p.s. using svn Rev. 1263 _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From gsee000 at gmail.com Fri May 9 01:10:07 2014 From: gsee000 at gmail.com (G See) Date: Thu, 8 May 2014 18:10:07 -0500 Subject: [datatable-help] merge zero row data.table In-Reply-To: References: Message-ID: Thanks Arun. Bug filed: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5672&group_id=240&atid=975 On Thu, May 8, 2014 at 6:00 PM, Arunkumar Srinivasan wrote: > Garrett, > > Seems like it works fine in 1.9.2. I'd say it's a bug introduced due to > changes in 1.9.3. Could you please file it as one? Thanks. > > Arun > > From: G See gsee000 at gmail.com > Reply: G See gsee000 at gmail.com > Date: May 9, 2014 at 12:57:15 AM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: [datatable-help] merge zero row data.table > > Hi, > > Is the following error expected? > >> library(data.table) > data.table 1.9.3 For help type: help("data.table") >> a <- data.table(BOD, key="Time") >> b <- data.table(BOD, key="Time")[Time < 0] # zero row data.table >> merge(a,b, all=TRUE) # works fine > Time demand.x demand.y > 1: 1 8.3 NA > 2: 2 10.3 NA > 3: 3 19.0 NA > 4: 4 16.0 NA > 5: 5 15.6 NA > 6: 7 19.8 NA >> merge(b,a, all=TRUE) # error > Error in setcolorder(dt, c(setdiff(names(dt), end), end)) : > neworder is length 2 but x has 3 columns. > > Thanks, > Garrett > > p.s. using svn Rev. 1263 > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From aragorn168b at gmail.com Fri May 9 01:10:42 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 9 May 2014 01:10:42 +0200 Subject: [datatable-help] merge zero row data.table In-Reply-To: References: Message-ID: Great! Thanks a bunch. Arun From:?G See gsee000 at gmail.com Reply:?G See gsee000 at gmail.com Date:?May 9, 2014 at 1:10:07 AM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] merge zero row data.table Thanks Arun. Bug filed: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5672&group_id=240&atid=975 On Thu, May 8, 2014 at 6:00 PM, Arunkumar Srinivasan wrote: > Garrett, > > Seems like it works fine in 1.9.2. I'd say it's a bug introduced due to > changes in 1.9.3. Could you please file it as one? Thanks. > > Arun > > From: G See gsee000 at gmail.com > Reply: G See gsee000 at gmail.com > Date: May 9, 2014 at 12:57:15 AM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: [datatable-help] merge zero row data.table > > Hi, > > Is the following error expected? > >> library(data.table) > data.table 1.9.3 For help type: help("data.table") >> a <- data.table(BOD, key="Time") >> b <- data.table(BOD, key="Time")[Time < 0] # zero row data.table >> merge(a,b, all=TRUE) # works fine > Time demand.x demand.y > 1: 1 8.3 NA > 2: 2 10.3 NA > 3: 3 19.0 NA > 4: 4 16.0 NA > 5: 5 15.6 NA > 6: 7 19.8 NA >> merge(b,a, all=TRUE) # error > Error in setcolorder(dt, c(setdiff(names(dt), end), end)) : > neworder is length 2 but x has 3 columns. > > Thanks, > Garrett > > p.s. using svn Rev. 1263 > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From fch808 at gmail.com Fri May 9 23:34:56 2014 From: fch808 at gmail.com (FCH808) Date: Fri, 9 May 2014 14:34:56 -0700 (PDT) Subject: [datatable-help] Losing header names when using skip argument in fread in R Message-ID: <1399671296512-4690268.post@n4.nabble.com> R package: data.table - version. 1.9.2 I have a ";" delimited text file that I need to subset based on the dates that appear in the first column. I used fread() to read the first column only, and return the indices with the dates needed so I could use the min() of the indices to skip to, and the length() for number of rows to read. (In this case I only need 2 sequential days - 2880 rows/readings) The problem is that the header = TRUE only seems to capture the row of data immediately preceding the rows read and uses it as the header info, and instead of the actual headers in the first line of the text file. I wrapped it in a function and timed it, and it seems to be a reasonably quick way to have a minimal impact on RAM usage for the filtering needed. This file is only about 2 million rows so it wouldn't be a problem just reading the whole thing in and subsetting but I would like a solution that works as my text files get larger. findRows<-fread("power.txt", header = TRUE, select = 1) all<-(which(findRows$Date %in% c("14/2/2008", "15/2/2008")) ) skipLines<- min(all) keepRows<- length(all) feb<- fread("power.txt", skip = skipLines , nrows = keepRows, header = TRUE) rm(findRows) head(feb) 14/2/2008 00:00:00 0.252 0.000 244.230 1.000 0.000 0.000 0.000 1: 14/2/2008 00:01:00 0.254 0 245.24 1 0 0 0 2: 14/2/2008 00:01:00 0.254 0 245.24 1 0 0 0 3: 14/2/2008 00:02:00 0.254 0 245.31 1 0 0 0 4: 14/2/2008 00:03:00 0.252 0 244.44 1 0 0 0 5: 14/2/2008 00:04:00 0.252 0 244.27 1 0 0 0 6: 14/2/2008 00:05:00 0.252 0 244.62 1 0 0 0 > system.time(loadF()) user system elapsed 0.55 0.01 0.56 I was able to circumvent this by setting header = FALSE and just reading the first line into another tiny dataset and extracting all the column names (since I only ever read the first column the first time around) and setting those names to the data.table but this doesn't seem like the best solution if there is a way to do within the fread() call. findRows<-fread("power.txt", header = TRUE, select = 1) all<-(which(findRows$Date %in% c("14/2/2008", "15/2/2008")) ) skipLines<- min(all) keepRows<- length(all) feb<- fread("power.txt", skip = (skipLines) , nrows = keepRows, header = FALSE) rm(findRows) febNames<- names(fread("power.txt", nrow = 1)) setnames(feb, febNames) head(feb) Date Time Global_active_power Global_reactive_power Voltage 1: 14/2/2008 00:00:00 0.252 0 244.23 2: 14/2/2008 00:01:00 0.254 0 245.24 3: 14/2/2008 00:02:00 0.254 0 245.31 4: 14/2/2008 00:03:00 0.252 0 244.44 5: 14/2/2008 00:04:00 0.252 0 244.27 6: 14/2/2008 00:05:00 0.252 0 244.62 Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3 1: 1 0 0 0 2: 1 0 0 0 3: 1 0 0 0 4: 1 0 0 0 5: 1 0 0 0 6: 1 0 0 0 > system.time(loadF()) user system elapsed 0.61 0.05 0.66 Is there a way to accomplish this within the fread() call that skips to row 610,957 and initially creates the feb data.table instead of having to create another data.table of length 1 just to read the headers? -- View this message in context: http://r.789695.n4.nabble.com/Losing-header-names-when-using-skip-argument-in-fread-in-R-tp4690268.html Sent from the datatable-help mailing list archive at Nabble.com. From my.r.help at gmail.com Sat May 10 08:45:03 2014 From: my.r.help at gmail.com (Michael Smith) Date: Sat, 10 May 2014 14:45:03 +0800 Subject: [datatable-help] setkey on .SD Message-ID: <536DCAEF.9050007@gmail.com> All, ?data.table says that `.SD` is read-only. However, I could use `setkey` on it. Is this officially supported, or is it dangerous to use on `.SD`, e.g. since in some corner cases some unexpected behavior could occur. Thanks, M From kevinushey at gmail.com Mon May 12 00:54:19 2014 From: kevinushey at gmail.com (Kevin Ushey) Date: Sun, 11 May 2014 15:54:19 -0700 Subject: [datatable-help] Minor request -- make 'copy' an S3 generic? Message-ID: And move the current copy logic to copy.data.table. This is mainly because I want to implement my own 'copy.environment' function, which performs a deep copy of an environment -- data.table's copy does not do this. Thanks, Kevin From mdowle at mdowle.plus.com Tue May 13 22:16:10 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Tue, 13 May 2014 21:16:10 +0100 Subject: [datatable-help] R/Finance in Chicago on Friday Message-ID: <53727D8A.40208@mdowle.plus.com> Looking forward to it. Spaces available. Tutorial on data.table at 8am. http://www.rinfinance.com/agenda/ Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri May 16 19:25:36 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 16 May 2014 19:25:36 +0200 Subject: [datatable-help] setkey on .SD In-Reply-To: <536DCAEF.9050007@gmail.com> References: <536DCAEF.9050007@gmail.com> Message-ID: After seeing this post: http://stackoverflow.com/questions/22863414/using-roll-true-with-allow-cartesian-true#comment34980343_22866917 I wrote to Matt about this as well. I've marked this issue to resolve it, as I've not heard back on this issue from Matt yet. Thanks for reporting. Arun. On Sat, May 10, 2014 at 8:45 AM, Michael Smith wrote: > All, > > ?data.table says that `.SD` is read-only. However, I could use `setkey` > on it. Is this officially supported, or is it dangerous to use on `.SD`, > e.g. since in some corner cases some unexpected behavior could occur. > > Thanks, > > M > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From statquant at outlook.com Tue May 20 14:50:37 2014 From: statquant at outlook.com (statquant3) Date: Tue, 20 May 2014 05:50:37 -0700 (PDT) Subject: [datatable-help] learn how to use melt and dcast Message-ID: <1400590237635-4690882.post@n4.nabble.com> Guys, Is there some tutorial about how to use melt and dcast, each time I want to use it I forget how to... I think the ?dcast is not enough (likely I am too stupid) Cheers -- View this message in context: http://r.789695.n4.nabble.com/learn-how-to-use-melt-and-dcast-tp4690882.html Sent from the datatable-help mailing list archive at Nabble.com. From my.r.help at gmail.com Tue May 20 16:36:15 2014 From: my.r.help at gmail.com (Michael Smith) Date: Tue, 20 May 2014 22:36:15 +0800 Subject: [datatable-help] learn how to use melt and dcast In-Reply-To: <1400590237635-4690882.post@n4.nabble.com> References: <1400590237635-4690882.post@n4.nabble.com> Message-ID: <537B685F.1040603@gmail.com> Hadley's JSS article might be a good place to start. It's still for the reshape package, but the reshape2 package is not much different. And using it with data.table should be not much different than using it with a data.frame. On 05/20/2014 08:50 PM, statquant3 wrote: > Guys, > Is there some tutorial about how to use melt and dcast, each time I want to > use it I forget how to... > I think the ?dcast is not enough (likely I am too stupid) > Cheers > > > > -- > View this message in context: http://r.789695.n4.nabble.com/learn-how-to-use-melt-and-dcast-tp4690882.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From aragorn168b at gmail.com Tue May 20 21:27:52 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 20 May 2014 21:27:52 +0200 Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments Message-ID: Hello everyone, With the latest commit #1266, the extra functionality offered via rbind (use.names and fill) is also now available to rbindlist. In addition, the implementation is completely moved to C, and is therefore tremendously fast, especially for cases where one has to bind using with use.names=TRUE and/or with fill=TRUE. I?ll try to put out a benchmark comparing speed differences with the older implementation ASAP. Note that this change comes with a very low cost to the default speed to rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding 10,000 data.tables with 20 columns each, resulted in the new version running in 0.107 seconds, where as the older version ran in 0.095 seconds. In addition the documentation for ?rbindlist also has been improved (#5158 from Alexander). Here?s the change log from NEWS: o 'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented entirely in C. Closes #5249 -> use.names by default is FALSE for backwards compatibility (doesn't bind by names by default) -> rbind(...) now just calls rbindlist() internally, except that 'use.names' is TRUE by default, for compatibility with base (and backwards compatibility). -> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE. -> At least one item of the input list has to have non-null column names. -> Duplicate columns are bound in the order of occurrence, like base. -> Attributes that might exist in individual items would be lost in the bound result. -> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible. -> And incredibly fast ;). -> Documentation updated in much detail. Closes DR #5158. Eddi's (excellent) work on finding factor levels, type coercion of columns etc. are all retained. Please try it and write back if things aren?t working as it was before. The tests that had to be fixed are extremely rare cases. I suspect there should be minimal issue, if at all, in this version. However, I do find the changes here bring consistency to the function. One (very rare) feature that is not available due to this implementation is the ability to recycle. dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4)) lst1 <- list(x=4, y=5, z=as.list(1:3)) rbind(dt1, lst1) # x y z # 1: 1 4 1,2 # 2: 2 5 1,2,3 # 3: 3 6 1,2,3,4 # 4: 4 5 1 # 5: 4 5 2 # 6: 4 5 3 The 4,5 are recycled very nicely here.. This is not possible at the moment. This is because the earlier rbind implementation used as.data.table to convert to data.table, however it takes a copy (very inefficient on huge / many tables). I?d love to add this feature in C as well, as it would help incredibly for use within [.data.table (now that we can fill columns and bind by names faster). Will add a FR. In summary, I think there should be minimal issues, if any and should be much faster (for rbind cases). Please write back what you think, if you happen to try out. Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Tue May 20 22:04:01 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Tue, 20 May 2014 16:04:01 -0400 Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments In-Reply-To: References: Message-ID: The requirement to set use.names to TRUE if fill is TRUE seems ugly. I suggest that fill be the default for use.names. On Tue, May 20, 2014 at 3:27 PM, Arunkumar Srinivasan wrote: > Hello everyone, > > With the latest commit #1266, the extra functionality offered via rbind > (use.names and fill) is also now available to rbindlist. In addition, the > implementation is completely moved to C, and is therefore tremendously fast, > especially for cases where one has to bind using with use.names=TRUE and/or > with fill=TRUE. I?ll try to put out a benchmark comparing speed differences > with the older implementation ASAP. > > Note that this change comes with a very low cost to the default speed to > rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding > 10,000 data.tables with 20 columns each, resulted in the new version running > in 0.107 seconds, where as the older version ran in 0.095 seconds. > > In addition the documentation for ?rbindlist also has been improved (#5158 > from Alexander). Here?s the change log from NEWS: > > o 'rbindlist' gains 'use.names' and 'fill' arguments and is now > implemented entirely in C. Closes #5249 > -> use.names by default is FALSE for backwards compatibility > (doesn't bind by names by default) > -> rbind(...) now just calls rbindlist() internally, except that > 'use.names' is TRUE by default, > for compatibility with base (and backwards compatibility). > -> fill by default is FALSE. If fill is TRUE, use.names has to be > TRUE. > -> At least one item of the input list has to have non-null column > names. > -> Duplicate columns are bound in the order of occurrence, like > base. > -> Attributes that might exist in individual items would be lost in > the bound result. > -> Columns are coerced to the highest SEXPTYPE, if they are > different, if/when possible. > -> And incredibly fast ;). > -> Documentation updated in much detail. Closes DR #5158. > Eddi's (excellent) work on finding factor levels, type coercion of > columns etc. are all retained. > > Please try it and write back if things aren?t working as it was before. The > tests that had to be fixed are extremely rare cases. I suspect there should > be minimal issue, if at all, in this version. However, I do find the changes > here bring consistency to the function. > > One (very rare) feature that is not available due to this implementation is > the ability to recycle. > > dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4)) > lst1 <- list(x=4, y=5, z=as.list(1:3)) > > rbind(dt1, lst1) > # x y z > # 1: 1 4 1,2 > # 2: 2 5 1,2,3 > # 3: 3 6 1,2,3,4 > # 4: 4 5 1 > # 5: 4 5 2 > # 6: 4 5 3 > > The 4,5 are recycled very nicely here.. This is not possible at the moment. > This is because the earlier rbind implementation used as.data.table to > convert to data.table, however it takes a copy (very inefficient on huge / > many tables). I?d love to add this feature in C as well, as it would help > incredibly for use within [.data.table (now that we can fill columns and > bind by names faster). Will add a FR. > > In summary, I think there should be minimal issues, if any and should be > much faster (for rbind cases). Please write back what you think, if you > happen to try out. > > > > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Tue May 20 22:07:00 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 20 May 2014 22:07:00 +0200 Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments In-Reply-To: References: Message-ID: Hi Gabor, Thanks for the quick response. Just to be clear, you don?t have to set use.names=TRUE when fill=TRUE. If you just set fill=TRUE and use.names happens to be FALSE, then it?ll automatically set it to TRUE (with a message/warning), which you can safely ignore. Do you find this still ugly? You?ll get the warning if you use rbindlist with just fill=TRUE (because use.name=FALSE by default). Arun From:?Gabor Grothendieck ggrothendieck at gmail.com Reply:?Gabor Grothendieck ggrothendieck at gmail.com Date:?May 20, 2014 at 10:04:21 PM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments The requirement to set use.names to TRUE if fill is TRUE seems ugly. I suggest that fill be the default for use.names. On Tue, May 20, 2014 at 3:27 PM, Arunkumar Srinivasan wrote: > Hello everyone, > > With the latest commit #1266, the extra functionality offered via rbind > (use.names and fill) is also now available to rbindlist. In addition, the > implementation is completely moved to C, and is therefore tremendously fast, > especially for cases where one has to bind using with use.names=TRUE and/or > with fill=TRUE. I?ll try to put out a benchmark comparing speed differences > with the older implementation ASAP. > > Note that this change comes with a very low cost to the default speed to > rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding > 10,000 data.tables with 20 columns each, resulted in the new version running > in 0.107 seconds, where as the older version ran in 0.095 seconds. > > In addition the documentation for ?rbindlist also has been improved (#5158 > from Alexander). Here?s the change log from NEWS: > > o 'rbindlist' gains 'use.names' and 'fill' arguments and is now > implemented entirely in C. Closes #5249 > -> use.names by default is FALSE for backwards compatibility > (doesn't bind by names by default) > -> rbind(...) now just calls rbindlist() internally, except that > 'use.names' is TRUE by default, > for compatibility with base (and backwards compatibility). > -> fill by default is FALSE. If fill is TRUE, use.names has to be > TRUE. > -> At least one item of the input list has to have non-null column > names. > -> Duplicate columns are bound in the order of occurrence, like > base. > -> Attributes that might exist in individual items would be lost in > the bound result. > -> Columns are coerced to the highest SEXPTYPE, if they are > different, if/when possible. > -> And incredibly fast ;). > -> Documentation updated in much detail. Closes DR #5158. > Eddi's (excellent) work on finding factor levels, type coercion of > columns etc. are all retained. > > Please try it and write back if things aren?t working as it was before. The > tests that had to be fixed are extremely rare cases. I suspect there should > be minimal issue, if at all, in this version. However, I do find the changes > here bring consistency to the function. > > One (very rare) feature that is not available due to this implementation is > the ability to recycle. > > dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4)) > lst1 <- list(x=4, y=5, z=as.list(1:3)) > > rbind(dt1, lst1) > # x y z > # 1: 1 4 1,2 > # 2: 2 5 1,2,3 > # 3: 3 6 1,2,3,4 > # 4: 4 5 1 > # 5: 4 5 2 > # 6: 4 5 3 > > The 4,5 are recycled very nicely here.. This is not possible at the moment. > This is because the earlier rbind implementation used as.data.table to > convert to data.table, however it takes a copy (very inefficient on huge / > many tables). I?d love to add this feature in C as well, as it would help > incredibly for use within [.data.table (now that we can fill columns and > bind by names faster). Will add a FR. > > In summary, I think there should be minimal issues, if any and should be > much faster (for rbind cases). Please write back what you think, if you > happen to try out. > > > > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Tue May 20 22:11:16 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Tue, 20 May 2014 16:11:16 -0400 Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments In-Reply-To: References: Message-ID: Then why not make the default of use.names be fill. Then you don't get the warning and you can tell just from the argument list what the dependencies are. On Tue, May 20, 2014 at 4:07 PM, Arunkumar Srinivasan wrote: > Hi Gabor, > > Thanks for the quick response. Just to be clear, you don?t have to set > use.names=TRUE when fill=TRUE. If you just set fill=TRUE and use.names > happens to be FALSE, then it?ll automatically set it to TRUE (with a > message/warning), which you can safely ignore. Do you find this still ugly? > You?ll get the warning if you use rbindlist with just fill=TRUE (because > use.name=FALSE by default). > > > Arun > > From: Gabor Grothendieck ggrothendieck at gmail.com > Reply: Gabor Grothendieck ggrothendieck at gmail.com > Date: May 20, 2014 at 10:04:21 PM > To: Arunkumar Srinivasan aragorn168b at gmail.com > Cc: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill > arguments > > The requirement to set use.names to TRUE if fill is TRUE seems ugly. > I suggest that fill be the default for use.names. > > On Tue, May 20, 2014 at 3:27 PM, Arunkumar Srinivasan > wrote: >> Hello everyone, >> >> With the latest commit #1266, the extra functionality offered via rbind >> (use.names and fill) is also now available to rbindlist. In addition, the >> implementation is completely moved to C, and is therefore tremendously >> fast, >> especially for cases where one has to bind using with use.names=TRUE >> and/or >> with fill=TRUE. I?ll try to put out a benchmark comparing speed >> differences >> with the older implementation ASAP. >> >> Note that this change comes with a very low cost to the default speed to >> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding >> 10,000 data.tables with 20 columns each, resulted in the new version >> running >> in 0.107 seconds, where as the older version ran in 0.095 seconds. >> >> In addition the documentation for ?rbindlist also has been improved (#5158 >> from Alexander). Here?s the change log from NEWS: >> >> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now >> implemented entirely in C. Closes #5249 >> -> use.names by default is FALSE for backwards compatibility >> (doesn't bind by names by default) >> -> rbind(...) now just calls rbindlist() internally, except that >> 'use.names' is TRUE by default, >> for compatibility with base (and backwards compatibility). >> -> fill by default is FALSE. If fill is TRUE, use.names has to be >> TRUE. >> -> At least one item of the input list has to have non-null column >> names. >> -> Duplicate columns are bound in the order of occurrence, like >> base. >> -> Attributes that might exist in individual items would be lost in >> the bound result. >> -> Columns are coerced to the highest SEXPTYPE, if they are >> different, if/when possible. >> -> And incredibly fast ;). >> -> Documentation updated in much detail. Closes DR #5158. >> Eddi's (excellent) work on finding factor levels, type coercion of >> columns etc. are all retained. >> >> Please try it and write back if things aren?t working as it was before. >> The >> tests that had to be fixed are extremely rare cases. I suspect there >> should >> be minimal issue, if at all, in this version. However, I do find the >> changes >> here bring consistency to the function. >> >> One (very rare) feature that is not available due to this implementation >> is >> the ability to recycle. >> >> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4)) >> lst1 <- list(x=4, y=5, z=as.list(1:3)) >> >> rbind(dt1, lst1) >> # x y z >> # 1: 1 4 1,2 >> # 2: 2 5 1,2,3 >> # 3: 3 6 1,2,3,4 >> # 4: 4 5 1 >> # 5: 4 5 2 >> # 6: 4 5 3 >> >> The 4,5 are recycled very nicely here.. This is not possible at the >> moment. >> This is because the earlier rbind implementation used as.data.table to >> convert to data.table, however it takes a copy (very inefficient on huge / >> many tables). I?d love to add this feature in C as well, as it would help >> incredibly for use within [.data.table (now that we can fill columns and >> bind by names faster). Will add a FR. >> >> In summary, I think there should be minimal issues, if any and should be >> much faster (for rbind cases). Please write back what you think, if you >> happen to try out. >> >> >> >> Arun >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Tue May 20 22:17:45 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 20 May 2014 22:17:45 +0200 Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments In-Reply-To: References: Message-ID: Because with the current implementation, the case use.names=TRUE and fill=FALSE (no missing columns, just order isn?t same) could be faster than if you set fill=TRUE (on large and tables) - as it populates with NAs first. Sometimes it might be essential to throw an error (to catch bugs?) when you think the columns are all just interchanged, but in reality, there are either new columns or duplicated columns.. Arun From:?Gabor Grothendieck ggrothendieck at gmail.com Reply:?Gabor Grothendieck ggrothendieck at gmail.com Date:?May 20, 2014 at 10:11:36 PM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments Then why not make the default of use.names be fill. Then you don't get the warning and you can tell just from the argument list what the dependencies are. On Tue, May 20, 2014 at 4:07 PM, Arunkumar Srinivasan wrote: > Hi Gabor, > > Thanks for the quick response. Just to be clear, you don?t have to set > use.names=TRUE when fill=TRUE. If you just set fill=TRUE and use.names > happens to be FALSE, then it?ll automatically set it to TRUE (with a > message/warning), which you can safely ignore. Do you find this still ugly? > You?ll get the warning if you use rbindlist with just fill=TRUE (because > use.name=FALSE by default). > > > Arun > > From: Gabor Grothendieck ggrothendieck at gmail.com > Reply: Gabor Grothendieck ggrothendieck at gmail.com > Date: May 20, 2014 at 10:04:21 PM > To: Arunkumar Srinivasan aragorn168b at gmail.com > Cc: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill > arguments > > The requirement to set use.names to TRUE if fill is TRUE seems ugly. > I suggest that fill be the default for use.names. > > On Tue, May 20, 2014 at 3:27 PM, Arunkumar Srinivasan > wrote: >> Hello everyone, >> >> With the latest commit #1266, the extra functionality offered via rbind >> (use.names and fill) is also now available to rbindlist. In addition, the >> implementation is completely moved to C, and is therefore tremendously >> fast, >> especially for cases where one has to bind using with use.names=TRUE >> and/or >> with fill=TRUE. I?ll try to put out a benchmark comparing speed >> differences >> with the older implementation ASAP. >> >> Note that this change comes with a very low cost to the default speed to >> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding >> 10,000 data.tables with 20 columns each, resulted in the new version >> running >> in 0.107 seconds, where as the older version ran in 0.095 seconds. >> >> In addition the documentation for ?rbindlist also has been improved (#5158 >> from Alexander). Here?s the change log from NEWS: >> >> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now >> implemented entirely in C. Closes #5249 >> -> use.names by default is FALSE for backwards compatibility >> (doesn't bind by names by default) >> -> rbind(...) now just calls rbindlist() internally, except that >> 'use.names' is TRUE by default, >> for compatibility with base (and backwards compatibility). >> -> fill by default is FALSE. If fill is TRUE, use.names has to be >> TRUE. >> -> At least one item of the input list has to have non-null column >> names. >> -> Duplicate columns are bound in the order of occurrence, like >> base. >> -> Attributes that might exist in individual items would be lost in >> the bound result. >> -> Columns are coerced to the highest SEXPTYPE, if they are >> different, if/when possible. >> -> And incredibly fast ;). >> -> Documentation updated in much detail. Closes DR #5158. >> Eddi's (excellent) work on finding factor levels, type coercion of >> columns etc. are all retained. >> >> Please try it and write back if things aren?t working as it was before. >> The >> tests that had to be fixed are extremely rare cases. I suspect there >> should >> be minimal issue, if at all, in this version. However, I do find the >> changes >> here bring consistency to the function. >> >> One (very rare) feature that is not available due to this implementation >> is >> the ability to recycle. >> >> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4)) >> lst1 <- list(x=4, y=5, z=as.list(1:3)) >> >> rbind(dt1, lst1) >> # x y z >> # 1: 1 4 1,2 >> # 2: 2 5 1,2,3 >> # 3: 3 6 1,2,3,4 >> # 4: 4 5 1 >> # 5: 4 5 2 >> # 6: 4 5 3 >> >> The 4,5 are recycled very nicely here.. This is not possible at the >> moment. >> This is because the earlier rbind implementation used as.data.table to >> convert to data.table, however it takes a copy (very inefficient on huge / >> many tables). I?d love to add this feature in C as well, as it would help >> incredibly for use within [.data.table (now that we can fill columns and >> bind by names faster). Will add a FR. >> >> In summary, I think there should be minimal issues, if any and should be >> much faster (for rbind cases). Please write back what you think, if you >> happen to try out. >> >> >> >> Arun >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Tue May 20 22:28:39 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 20 May 2014 22:28:39 +0200 Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments In-Reply-To: References: Message-ID: I?ve filed FR #5690 to remind myself of the recycling feature; that?d be awesome to have. One feature I forgot to point out in the previous post is that, even when there are duplicate names, rbind/rbindlist binds them consistent with ?base? when use.names=TRUE. And it fills the duplicate columns properly (in the order of occurrence) also when fill=TRUE. Okay, on to benchmarks. I took a set of 10,000 data.tables, each with columns ranging from V1 to V500 in random order (all integers for simplicity). We?ll need to just use use.names=TRUE (as all columns are available in all data.tables). I think this data is big enough to illustrate the point. Also, I was curious to see a comparison against dplyr?s rbind_all (commit 1504 devel version). So, I?ve added it as well to the benchmarks. Here?s the data generation. Note: It takes a while for this step to finish. require(data.table) ## 1.9.3 commit 1267 require(dplyr) ## commit 1504 devel set.seed(1L) foo <- function(k) { ans = setDT(lapply(1:k, function(x) sample(10))) } bar <- function(ans, k, n) { bla = sample(paste0("V", 1:k), n) setnames(ans, bla) } n = 10000L ll = vector("list", n) for (i in 1:n) { bla = bar(foo(500L), 500L, 500L) .Call("Csetlistelt", ll, i, bla) } And here are the timings: ## data.table v1.9.3 commit 1267's rbindlist ## Timings of three consecutive runs: system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE)) user system elapsed 10.909 0.449 11.843 user system elapsed 5.219 0.386 5.640 user system elapsed 5.355 0.429 5.898 ## dplyr's rbind_all ## Timings for three consecutive runs system.time(ans2 <- rbind_all(ll)) user system elapsed 62.769 0.247 63.941 user system elapsed 62.010 0.335 65.876 user system elapsed 55.345 0.359 60.193 > identical(ans1, setDT(ans2)) # [1] TRUE ## data.table v1.9.2's rbind version: ## ran only once as it took a bit more. system.time(ans1 <- do.call("rbind", ll)) user system elapsed 125.356 2.247 139.000 > identical(ans1, setDT(ans2)) # [1] TRUE In summary, the newer implementation is about ~11?23x faster than data.table?s older implementation and is ~5.5?10x faster against dplyr on this (relatively huge) data. Arun From:?Arunkumar Srinivasan aragorn168b at gmail.com Reply:?Arunkumar Srinivasan aragorn168b at gmail.com Date:?May 20, 2014 at 9:27:56 PM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? FR #5249 - rbindlist gains use.names and fill arguments Hello everyone, With the latest commit #1266, the extra functionality offered via rbind (use.names and fill) is also now available to rbindlist. In addition, the implementation is completely moved to C, and is therefore tremendously fast, especially for cases where one has to bind using with use.names=TRUE and/or with fill=TRUE. I?ll try to put out a benchmark comparing speed differences with the older implementation ASAP. Note that this change comes with a very low cost to the default speed to rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding 10,000 data.tables with 20 columns each, resulted in the new version running in 0.107 seconds, where as the older version ran in 0.095 seconds. In addition the documentation for ?rbindlist also has been improved (#5158 from Alexander). Here?s the change log from NEWS: o 'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented entirely in C. Closes #5249 -> use.names by default is FALSE for backwards compatibility (doesn't bind by names by default) -> rbind(...) now just calls rbindlist() internally, except that 'use.names' is TRUE by default, for compatibility with base (and backwards compatibility). -> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE. -> At least one item of the input list has to have non-null column names. -> Duplicate columns are bound in the order of occurrence, like base. -> Attributes that might exist in individual items would be lost in the bound result. -> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible. -> And incredibly fast ;). -> Documentation updated in much detail. Closes DR #5158. Eddi's (excellent) work on finding factor levels, type coercion of columns etc. are all retained. Please try it and write back if things aren?t working as it was before. The tests that had to be fixed are extremely rare cases. I suspect there should be minimal issue, if at all, in this version. However, I do find the changes here bring consistency to the function. One (very rare) feature that is not available due to this implementation is the ability to recycle. dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4)) lst1 <- list(x=4, y=5, z=as.list(1:3)) rbind(dt1, lst1) # x y z # 1: 1 4 1,2 # 2: 2 5 1,2,3 # 3: 3 6 1,2,3,4 # 4: 4 5 1 # 5: 4 5 2 # 6: 4 5 3 The 4,5 are recycled very nicely here.. This is not possible at the moment. This is because the earlier rbind implementation used as.data.table to convert to data.table, however it takes a copy (very inefficient on huge / many tables). I?d love to add this feature in C as well, as it would help incredibly for use within [.data.table (now that we can fill columns and bind by names faster). Will add a FR. In summary, I think there should be minimal issues, if any and should be much faster (for rbind cases). Please write back what you think, if you happen to try out. Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Tue May 20 22:49:33 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Tue, 20 May 2014 16:49:33 -0400 Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments In-Reply-To: References: Message-ID: If I understand this right then the table below shows the valid logical combinations in order of speed (slowest first). Is that right? If so then if fill = FALSE and use.names = fill then we get the fastest case by default. Furthermore if you were concerned that we might be T/T when F/T would be sufficient I don't think that is likely since getting F/T is done by setting use.names = TRUE. fill/use.names T/T (slowest) F/T F/F (fasetest) On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan wrote: > I?ve filed FR #5690 to remind myself of the recycling feature; that?d be > awesome to have. > > One feature I forgot to point out in the previous post is that, even when > there are duplicate names, rbind/rbindlist binds them consistent with ?base? > when use.names=TRUE. And it fills the duplicate columns properly (in the > order of occurrence) also when fill=TRUE. > > Okay, on to benchmarks. I took a set of 10,000 data.tables, each with > columns ranging from V1 to V500 in random order (all integers for > simplicity). We?ll need to just use use.names=TRUE (as all columns are > available in all data.tables). > > I think this data is big enough to illustrate the point. Also, I was curious > to see a comparison against dplyr?s rbind_all (commit 1504 devel version). > So, I?ve added it as well to the benchmarks. > > Here?s the data generation. Note: It takes a while for this step to finish. > > require(data.table) ## 1.9.3 commit 1267 > require(dplyr) ## commit 1504 devel > set.seed(1L) > foo <- function(k) { > ans = setDT(lapply(1:k, function(x) sample(10))) > } > bar <- function(ans, k, n) { > bla = sample(paste0("V", 1:k), n) > setnames(ans, bla) > } > n = 10000L > ll = vector("list", n) > for (i in 1:n) { > bla = bar(foo(500L), 500L, 500L) > .Call("Csetlistelt", ll, i, bla) > } > > And here are the timings: > > ## data.table v1.9.3 commit 1267's rbindlist > ## Timings of three consecutive runs: > system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE)) > user system elapsed > 10.909 0.449 11.843 > > user system elapsed > 5.219 0.386 5.640 > > user system elapsed > 5.355 0.429 5.898 > > ## dplyr's rbind_all > ## Timings for three consecutive runs > system.time(ans2 <- rbind_all(ll)) > user system elapsed > 62.769 0.247 63.941 > > user system elapsed > 62.010 0.335 65.876 > > user system elapsed > 55.345 0.359 60.193 > >> identical(ans1, setDT(ans2)) # [1] TRUE > > ## data.table v1.9.2's rbind version: > ## ran only once as it took a bit more. > system.time(ans1 <- do.call("rbind", ll)) > user system elapsed > 125.356 2.247 139.000 > >> identical(ans1, setDT(ans2)) # [1] TRUE > > In summary, the newer implementation is about ~11?23x faster than > data.table?s older implementation and is ~5.5?10x faster against dplyr on > this (relatively huge) data. > > Arun > > From: Arunkumar Srinivasan aragorn168b at gmail.com > Reply: Arunkumar Srinivasan aragorn168b at gmail.com > Date: May 20, 2014 at 9:27:56 PM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: FR #5249 - rbindlist gains use.names and fill arguments > > Hello everyone, > > With the latest commit #1266, the extra functionality offered via rbind > (use.names and fill) is also now available to rbindlist. In addition, the > implementation is completely moved to C, and is therefore tremendously fast, > especially for cases where one has to bind using with use.names=TRUE and/or > with fill=TRUE. I?ll try to put out a benchmark comparing speed differences > with the older implementation ASAP. > > Note that this change comes with a very low cost to the default speed to > rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding > 10,000 data.tables with 20 columns each, resulted in the new version running > in 0.107 seconds, where as the older version ran in 0.095 seconds. > > In addition the documentation for ?rbindlist also has been improved (#5158 > from Alexander). Here?s the change log from NEWS: > > o 'rbindlist' gains 'use.names' and 'fill' arguments and is now > implemented entirely in C. Closes #5249 > -> use.names by default is FALSE for backwards compatibility > (doesn't bind by names by default) > -> rbind(...) now just calls rbindlist() internally, except that > 'use.names' is TRUE by default, > for compatibility with base (and backwards compatibility). > -> fill by default is FALSE. If fill is TRUE, use.names has to be > TRUE. > -> At least one item of the input list has to have non-null column > names. > -> Duplicate columns are bound in the order of occurrence, like > base. > -> Attributes that might exist in individual items would be lost in > the bound result. > -> Columns are coerced to the highest SEXPTYPE, if they are > different, if/when possible. > -> And incredibly fast ;). > -> Documentation updated in much detail. Closes DR #5158. > Eddi's (excellent) work on finding factor levels, type coercion of > columns etc. are all retained. > > Please try it and write back if things aren?t working as it was before. The > tests that had to be fixed are extremely rare cases. I suspect there should > be minimal issue, if at all, in this version. However, I do find the changes > here bring consistency to the function. > > One (very rare) feature that is not available due to this implementation is > the ability to recycle. > > dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4)) > lst1 <- list(x=4, y=5, z=as.list(1:3)) > > rbind(dt1, lst1) > # x y z > # 1: 1 4 1,2 > # 2: 2 5 1,2,3 > # 3: 3 6 1,2,3,4 > # 4: 4 5 1 > # 5: 4 5 2 > # 6: 4 5 3 > > The 4,5 are recycled very nicely here.. This is not possible at the moment. > This is because the earlier rbind implementation used as.data.table to > convert to data.table, however it takes a copy (very inefficient on huge / > many tables). I?d love to add this feature in C as well, as it would help > incredibly for use within [.data.table (now that we can fill columns and > bind by names faster). Will add a FR. > > In summary, I think there should be minimal issues, if any and should be > much faster (for rbind cases). Please write back what you think, if you > happen to try out. > > > > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Tue May 20 23:01:52 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 20 May 2014 23:01:52 +0200 Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments In-Reply-To: References: Message-ID: I think I understand now what you?re trying to say. Going back to an earlier post, you wrote: Then why not make the default of `use.names` be `fill`. Then you don't get the warning and you can tell just from the argument list what the dependencies are. You mean to basically do? rbindlist <- function(l, use.names=fill, fill=FALSE) .rbind.data.table <- function(..., use.names=fill, fill=TRUE/FALSE) Is this what you mean? If so, the defaults from the previous versions will be changed. The ones who use rbind directly without setting use.names will have different results.. (assuming I understand you correctly this time). Arun From:?Gabor Grothendieck ggrothendieck at gmail.com Reply:?Gabor Grothendieck ggrothendieck at gmail.com Date:?May 20, 2014 at 10:49:54 PM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments If I understand this right then the table below shows the valid logical combinations in order of speed (slowest first). Is that right? If so then if fill = FALSE and use.names = fill then we get the fastest case by default. Furthermore if you were concerned that we might be T/T when F/T would be sufficient I don't think that is likely since getting F/T is done by setting use.names = TRUE. fill/use.names T/T (slowest) F/T F/F (fasetest) On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan wrote: > I?ve filed FR #5690 to remind myself of the recycling feature; that?d be > awesome to have. > > One feature I forgot to point out in the previous post is that, even when > there are duplicate names, rbind/rbindlist binds them consistent with ?base? > when use.names=TRUE. And it fills the duplicate columns properly (in the > order of occurrence) also when fill=TRUE. > > Okay, on to benchmarks. I took a set of 10,000 data.tables, each with > columns ranging from V1 to V500 in random order (all integers for > simplicity). We?ll need to just use use.names=TRUE (as all columns are > available in all data.tables). > > I think this data is big enough to illustrate the point. Also, I was curious > to see a comparison against dplyr?s rbind_all (commit 1504 devel version). > So, I?ve added it as well to the benchmarks. > > Here?s the data generation. Note: It takes a while for this step to finish. > > require(data.table) ## 1.9.3 commit 1267 > require(dplyr) ## commit 1504 devel > set.seed(1L) > foo <- function(k) { > ans = setDT(lapply(1:k, function(x) sample(10))) > } > bar <- function(ans, k, n) { > bla = sample(paste0("V", 1:k), n) > setnames(ans, bla) > } > n = 10000L > ll = vector("list", n) > for (i in 1:n) { > bla = bar(foo(500L), 500L, 500L) > .Call("Csetlistelt", ll, i, bla) > } > > And here are the timings: > > ## data.table v1.9.3 commit 1267's rbindlist > ## Timings of three consecutive runs: > system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE)) > user system elapsed > 10.909 0.449 11.843 > > user system elapsed > 5.219 0.386 5.640 > > user system elapsed > 5.355 0.429 5.898 > > ## dplyr's rbind_all > ## Timings for three consecutive runs > system.time(ans2 <- rbind_all(ll)) > user system elapsed > 62.769 0.247 63.941 > > user system elapsed > 62.010 0.335 65.876 > > user system elapsed > 55.345 0.359 60.193 > >> identical(ans1, setDT(ans2)) # [1] TRUE > > ## data.table v1.9.2's rbind version: > ## ran only once as it took a bit more. > system.time(ans1 <- do.call("rbind", ll)) > user system elapsed > 125.356 2.247 139.000 > >> identical(ans1, setDT(ans2)) # [1] TRUE > > In summary, the newer implementation is about ~11?23x faster than > data.table?s older implementation and is ~5.5?10x faster against dplyr on > this (relatively huge) data. > > Arun > > From: Arunkumar Srinivasan aragorn168b at gmail.com > Reply: Arunkumar Srinivasan aragorn168b at gmail.com > Date: May 20, 2014 at 9:27:56 PM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: FR #5249 - rbindlist gains use.names and fill arguments > > Hello everyone, > > With the latest commit #1266, the extra functionality offered via rbind > (use.names and fill) is also now available to rbindlist. In addition, the > implementation is completely moved to C, and is therefore tremendously fast, > especially for cases where one has to bind using with use.names=TRUE and/or > with fill=TRUE. I?ll try to put out a benchmark comparing speed differences > with the older implementation ASAP. > > Note that this change comes with a very low cost to the default speed to > rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding > 10,000 data.tables with 20 columns each, resulted in the new version running > in 0.107 seconds, where as the older version ran in 0.095 seconds. > > In addition the documentation for ?rbindlist also has been improved (#5158 > from Alexander). Here?s the change log from NEWS: > > o 'rbindlist' gains 'use.names' and 'fill' arguments and is now > implemented entirely in C. Closes #5249 > -> use.names by default is FALSE for backwards compatibility > (doesn't bind by names by default) > -> rbind(...) now just calls rbindlist() internally, except that > 'use.names' is TRUE by default, > for compatibility with base (and backwards compatibility). > -> fill by default is FALSE. If fill is TRUE, use.names has to be > TRUE. > -> At least one item of the input list has to have non-null column > names. > -> Duplicate columns are bound in the order of occurrence, like > base. > -> Attributes that might exist in individual items would be lost in > the bound result. > -> Columns are coerced to the highest SEXPTYPE, if they are > different, if/when possible. > -> And incredibly fast ;). > -> Documentation updated in much detail. Closes DR #5158. > Eddi's (excellent) work on finding factor levels, type coercion of > columns etc. are all retained. > > Please try it and write back if things aren?t working as it was before. The > tests that had to be fixed are extremely rare cases. I suspect there should > be minimal issue, if at all, in this version. However, I do find the changes > here bring consistency to the function. > > One (very rare) feature that is not available due to this implementation is > the ability to recycle. > > dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4)) > lst1 <- list(x=4, y=5, z=as.list(1:3)) > > rbind(dt1, lst1) > # x y z > # 1: 1 4 1,2 > # 2: 2 5 1,2,3 > # 3: 3 6 1,2,3,4 > # 4: 4 5 1 > # 5: 4 5 2 > # 6: 4 5 3 > > The 4,5 are recycled very nicely here.. This is not possible at the moment. > This is because the earlier rbind implementation used as.data.table to > convert to data.table, however it takes a copy (very inefficient on huge / > many tables). I?d love to add this feature in C as well, as it would help > incredibly for use within [.data.table (now that we can fill columns and > bind by names faster). Will add a FR. > > In summary, I think there should be minimal issues, if any and should be > much faster (for rbind cases). Please write back what you think, if you > happen to try out. > > > > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Tue May 20 23:13:55 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Tue, 20 May 2014 17:13:55 -0400 Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments In-Reply-To: References: Message-ID: Yes. That is what I intended. rbindlist on CRAN currently has no fill or use.names arguments. What combo of the new fill and use.names does the currrent CRAN rbindlst correspond to? On Tue, May 20, 2014 at 5:01 PM, Arunkumar Srinivasan wrote: > I think I understand now what you?re trying to say. Going back to an earlier > post, you wrote: > > Then why not make the default of `use.names` be `fill`. Then you don't get > the warning and you can tell just from the argument list what the > dependencies are. > > You mean to basically do? > > rbindlist <- function(l, use.names=fill, fill=FALSE) > .rbind.data.table <- function(..., use.names=fill, fill=TRUE/FALSE) > > Is this what you mean? If so, the defaults from the previous versions will > be changed. The ones who use rbind directly without setting use.names will > have different results.. (assuming I understand you correctly this time). > > > Arun > > From: Gabor Grothendieck ggrothendieck at gmail.com > Reply: Gabor Grothendieck ggrothendieck at gmail.com > Date: May 20, 2014 at 10:49:54 PM > > To: Arunkumar Srinivasan aragorn168b at gmail.com > Cc: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill > arguments > > If I understand this right then the table below shows the valid > logical combinations in order of speed (slowest first). Is that > right? If so then if fill = FALSE and use.names = fill then we get > the fastest case by default. > > Furthermore if you were concerned that we might be T/T when F/T would > be sufficient I don't think that is likely since getting F/T is done > by setting use.names = TRUE. > > fill/use.names > T/T (slowest) > F/T > F/F (fasetest) > > > On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan > wrote: >> I?ve filed FR #5690 to remind myself of the recycling feature; that?d be >> awesome to have. >> >> One feature I forgot to point out in the previous post is that, even when >> there are duplicate names, rbind/rbindlist binds them consistent with >> ?base? >> when use.names=TRUE. And it fills the duplicate columns properly (in the >> order of occurrence) also when fill=TRUE. >> >> Okay, on to benchmarks. I took a set of 10,000 data.tables, each with >> columns ranging from V1 to V500 in random order (all integers for >> simplicity). We?ll need to just use use.names=TRUE (as all columns are >> available in all data.tables). >> >> I think this data is big enough to illustrate the point. Also, I was >> curious >> to see a comparison against dplyr?s rbind_all (commit 1504 devel version). >> So, I?ve added it as well to the benchmarks. >> >> Here?s the data generation. Note: It takes a while for this step to >> finish. >> >> require(data.table) ## 1.9.3 commit 1267 >> require(dplyr) ## commit 1504 devel >> set.seed(1L) >> foo <- function(k) { >> ans = setDT(lapply(1:k, function(x) sample(10))) >> } >> bar <- function(ans, k, n) { >> bla = sample(paste0("V", 1:k), n) >> setnames(ans, bla) >> } >> n = 10000L >> ll = vector("list", n) >> for (i in 1:n) { >> bla = bar(foo(500L), 500L, 500L) >> .Call("Csetlistelt", ll, i, bla) >> } >> >> And here are the timings: >> >> ## data.table v1.9.3 commit 1267's rbindlist >> ## Timings of three consecutive runs: >> system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE)) >> user system elapsed >> 10.909 0.449 11.843 >> >> user system elapsed >> 5.219 0.386 5.640 >> >> user system elapsed >> 5.355 0.429 5.898 >> >> ## dplyr's rbind_all >> ## Timings for three consecutive runs >> system.time(ans2 <- rbind_all(ll)) >> user system elapsed >> 62.769 0.247 63.941 >> >> user system elapsed >> 62.010 0.335 65.876 >> >> user system elapsed >> 55.345 0.359 60.193 >> >>> identical(ans1, setDT(ans2)) # [1] TRUE >> >> ## data.table v1.9.2's rbind version: >> ## ran only once as it took a bit more. >> system.time(ans1 <- do.call("rbind", ll)) >> user system elapsed >> 125.356 2.247 139.000 >> >>> identical(ans1, setDT(ans2)) # [1] TRUE >> >> In summary, the newer implementation is about ~11?23x faster than >> data.table?s older implementation and is ~5.5?10x faster against dplyr on >> this (relatively huge) data. >> >> Arun >> >> From: Arunkumar Srinivasan aragorn168b at gmail.com >> Reply: Arunkumar Srinivasan aragorn168b at gmail.com >> Date: May 20, 2014 at 9:27:56 PM >> To: datatable-help at lists.r-forge.r-project.org >> datatable-help at lists.r-forge.r-project.org >> Subject: FR #5249 - rbindlist gains use.names and fill arguments >> >> Hello everyone, >> >> With the latest commit #1266, the extra functionality offered via rbind >> (use.names and fill) is also now available to rbindlist. In addition, the >> implementation is completely moved to C, and is therefore tremendously >> fast, >> especially for cases where one has to bind using with use.names=TRUE >> and/or >> with fill=TRUE. I?ll try to put out a benchmark comparing speed >> differences >> with the older implementation ASAP. >> >> Note that this change comes with a very low cost to the default speed to >> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding >> 10,000 data.tables with 20 columns each, resulted in the new version >> running >> in 0.107 seconds, where as the older version ran in 0.095 seconds. >> >> In addition the documentation for ?rbindlist also has been improved (#5158 >> from Alexander). Here?s the change log from NEWS: >> >> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now >> implemented entirely in C. Closes #5249 >> -> use.names by default is FALSE for backwards compatibility >> (doesn't bind by names by default) >> -> rbind(...) now just calls rbindlist() internally, except that >> 'use.names' is TRUE by default, >> for compatibility with base (and backwards compatibility). >> -> fill by default is FALSE. If fill is TRUE, use.names has to be >> TRUE. >> -> At least one item of the input list has to have non-null column >> names. >> -> Duplicate columns are bound in the order of occurrence, like >> base. >> -> Attributes that might exist in individual items would be lost in >> the bound result. >> -> Columns are coerced to the highest SEXPTYPE, if they are >> different, if/when possible. >> -> And incredibly fast ;). >> -> Documentation updated in much detail. Closes DR #5158. >> Eddi's (excellent) work on finding factor levels, type coercion of >> columns etc. are all retained. >> >> Please try it and write back if things aren?t working as it was before. >> The >> tests that had to be fixed are extremely rare cases. I suspect there >> should >> be minimal issue, if at all, in this version. However, I do find the >> changes >> here bring consistency to the function. >> >> One (very rare) feature that is not available due to this implementation >> is >> the ability to recycle. >> >> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4)) >> lst1 <- list(x=4, y=5, z=as.list(1:3)) >> >> rbind(dt1, lst1) >> # x y z >> # 1: 1 4 1,2 >> # 2: 2 5 1,2,3 >> # 3: 3 6 1,2,3,4 >> # 4: 4 5 1 >> # 5: 4 5 2 >> # 6: 4 5 3 >> >> The 4,5 are recycled very nicely here.. This is not possible at the >> moment. >> This is because the earlier rbind implementation used as.data.table to >> convert to data.table, however it takes a copy (very inefficient on huge / >> many tables). I?d love to add this feature in C as well, as it would help >> incredibly for use within [.data.table (now that we can fill columns and >> bind by names faster). Will add a FR. >> >> In summary, I think there should be minimal issues, if any and should be >> much faster (for rbind cases). Please write back what you think, if you >> happen to try out. >> >> >> >> Arun >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Tue May 20 23:16:27 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 20 May 2014 23:16:27 +0200 Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments In-Reply-To: References: Message-ID: In the current CRAN: rbindlist corresponds to use.names=FALSE and fill = FALSE rbind corresponds to use.names=TRUE and fill = FALSE Just to be clear, again, are you suggesting that I change *just* rbindlist's defaults to use.names=fill and fill=FALSE or for both? Arun From:?Gabor Grothendieck ggrothendieck at gmail.com Reply:?Gabor Grothendieck ggrothendieck at gmail.com Date:?May 20, 2014 at 11:14:15 PM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments Yes. That is what I intended. rbindlist on CRAN currently has no fill or use.names arguments. What combo of the new fill and use.names does the currrent CRAN rbindlst correspond to? On Tue, May 20, 2014 at 5:01 PM, Arunkumar Srinivasan wrote: > I think I understand now what you?re trying to say. Going back to an earlier > post, you wrote: > > Then why not make the default of `use.names` be `fill`. Then you don't get > the warning and you can tell just from the argument list what the > dependencies are. > > You mean to basically do? > > rbindlist <- function(l, use.names=fill, fill=FALSE) > .rbind.data.table <- function(..., use.names=fill, fill=TRUE/FALSE) > > Is this what you mean? If so, the defaults from the previous versions will > be changed. The ones who use rbind directly without setting use.names will > have different results.. (assuming I understand you correctly this time). > > > Arun > > From: Gabor Grothendieck ggrothendieck at gmail.com > Reply: Gabor Grothendieck ggrothendieck at gmail.com > Date: May 20, 2014 at 10:49:54 PM > > To: Arunkumar Srinivasan aragorn168b at gmail.com > Cc: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill > arguments > > If I understand this right then the table below shows the valid > logical combinations in order of speed (slowest first). Is that > right? If so then if fill = FALSE and use.names = fill then we get > the fastest case by default. > > Furthermore if you were concerned that we might be T/T when F/T would > be sufficient I don't think that is likely since getting F/T is done > by setting use.names = TRUE. > > fill/use.names > T/T (slowest) > F/T > F/F (fasetest) > > > On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan > wrote: >> I?ve filed FR #5690 to remind myself of the recycling feature; that?d be >> awesome to have. >> >> One feature I forgot to point out in the previous post is that, even when >> there are duplicate names, rbind/rbindlist binds them consistent with >> ?base? >> when use.names=TRUE. And it fills the duplicate columns properly (in the >> order of occurrence) also when fill=TRUE. >> >> Okay, on to benchmarks. I took a set of 10,000 data.tables, each with >> columns ranging from V1 to V500 in random order (all integers for >> simplicity). We?ll need to just use use.names=TRUE (as all columns are >> available in all data.tables). >> >> I think this data is big enough to illustrate the point. Also, I was >> curious >> to see a comparison against dplyr?s rbind_all (commit 1504 devel version). >> So, I?ve added it as well to the benchmarks. >> >> Here?s the data generation. Note: It takes a while for this step to >> finish. >> >> require(data.table) ## 1.9.3 commit 1267 >> require(dplyr) ## commit 1504 devel >> set.seed(1L) >> foo <- function(k) { >> ans = setDT(lapply(1:k, function(x) sample(10))) >> } >> bar <- function(ans, k, n) { >> bla = sample(paste0("V", 1:k), n) >> setnames(ans, bla) >> } >> n = 10000L >> ll = vector("list", n) >> for (i in 1:n) { >> bla = bar(foo(500L), 500L, 500L) >> .Call("Csetlistelt", ll, i, bla) >> } >> >> And here are the timings: >> >> ## data.table v1.9.3 commit 1267's rbindlist >> ## Timings of three consecutive runs: >> system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE)) >> user system elapsed >> 10.909 0.449 11.843 >> >> user system elapsed >> 5.219 0.386 5.640 >> >> user system elapsed >> 5.355 0.429 5.898 >> >> ## dplyr's rbind_all >> ## Timings for three consecutive runs >> system.time(ans2 <- rbind_all(ll)) >> user system elapsed >> 62.769 0.247 63.941 >> >> user system elapsed >> 62.010 0.335 65.876 >> >> user system elapsed >> 55.345 0.359 60.193 >> >>> identical(ans1, setDT(ans2)) # [1] TRUE >> >> ## data.table v1.9.2's rbind version: >> ## ran only once as it took a bit more. >> system.time(ans1 <- do.call("rbind", ll)) >> user system elapsed >> 125.356 2.247 139.000 >> >>> identical(ans1, setDT(ans2)) # [1] TRUE >> >> In summary, the newer implementation is about ~11?23x faster than >> data.table?s older implementation and is ~5.5?10x faster against dplyr on >> this (relatively huge) data. >> >> Arun >> >> From: Arunkumar Srinivasan aragorn168b at gmail.com >> Reply: Arunkumar Srinivasan aragorn168b at gmail.com >> Date: May 20, 2014 at 9:27:56 PM >> To: datatable-help at lists.r-forge.r-project.org >> datatable-help at lists.r-forge.r-project.org >> Subject: FR #5249 - rbindlist gains use.names and fill arguments >> >> Hello everyone, >> >> With the latest commit #1266, the extra functionality offered via rbind >> (use.names and fill) is also now available to rbindlist. In addition, the >> implementation is completely moved to C, and is therefore tremendously >> fast, >> especially for cases where one has to bind using with use.names=TRUE >> and/or >> with fill=TRUE. I?ll try to put out a benchmark comparing speed >> differences >> with the older implementation ASAP. >> >> Note that this change comes with a very low cost to the default speed to >> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding >> 10,000 data.tables with 20 columns each, resulted in the new version >> running >> in 0.107 seconds, where as the older version ran in 0.095 seconds. >> >> In addition the documentation for ?rbindlist also has been improved (#5158 >> from Alexander). Here?s the change log from NEWS: >> >> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now >> implemented entirely in C. Closes #5249 >> -> use.names by default is FALSE for backwards compatibility >> (doesn't bind by names by default) >> -> rbind(...) now just calls rbindlist() internally, except that >> 'use.names' is TRUE by default, >> for compatibility with base (and backwards compatibility). >> -> fill by default is FALSE. If fill is TRUE, use.names has to be >> TRUE. >> -> At least one item of the input list has to have non-null column >> names. >> -> Duplicate columns are bound in the order of occurrence, like >> base. >> -> Attributes that might exist in individual items would be lost in >> the bound result. >> -> Columns are coerced to the highest SEXPTYPE, if they are >> different, if/when possible. >> -> And incredibly fast ;). >> -> Documentation updated in much detail. Closes DR #5158. >> Eddi's (excellent) work on finding factor levels, type coercion of >> columns etc. are all retained. >> >> Please try it and write back if things aren?t working as it was before. >> The >> tests that had to be fixed are extremely rare cases. I suspect there >> should >> be minimal issue, if at all, in this version. However, I do find the >> changes >> here bring consistency to the function. >> >> One (very rare) feature that is not available due to this implementation >> is >> the ability to recycle. >> >> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4)) >> lst1 <- list(x=4, y=5, z=as.list(1:3)) >> >> rbind(dt1, lst1) >> # x y z >> # 1: 1 4 1,2 >> # 2: 2 5 1,2,3 >> # 3: 3 6 1,2,3,4 >> # 4: 4 5 1 >> # 5: 4 5 2 >> # 6: 4 5 3 >> >> The 4,5 are recycled very nicely here.. This is not possible at the >> moment. >> This is because the earlier rbind implementation used as.data.table to >> convert to data.table, however it takes a copy (very inefficient on huge / >> many tables). I?d love to add this feature in C as well, as it would help >> incredibly for use within [.data.table (now that we can fill columns and >> bind by names faster). Will add a FR. >> >> In summary, I think there should be minimal issues, if any and should be >> much faster (for rbind cases). Please write back what you think, if you >> happen to try out. >> >> >> >> Arun >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Wed May 21 01:02:43 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Tue, 20 May 2014 19:02:43 -0400 Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments In-Reply-To: References: Message-ID: In that case I suggest just changing rbindlist to have use.names = fill and leave rbind as is. On Tue, May 20, 2014 at 5:16 PM, Arunkumar Srinivasan wrote: > In the current CRAN: > > rbindlist corresponds to use.names=FALSE and fill = FALSE > rbind corresponds to use.names=TRUE and fill = FALSE > > Just to be clear, again, are you suggesting that I change *just* rbindlist's > defaults to use.names=fill and fill=FALSE or for both? > Arun > > From: Gabor Grothendieck ggrothendieck at gmail.com > Reply: Gabor Grothendieck ggrothendieck at gmail.com > Date: May 20, 2014 at 11:14:15 PM > > To: Arunkumar Srinivasan aragorn168b at gmail.com > Cc: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill > arguments > > Yes. That is what I intended. > > rbindlist on CRAN currently has no fill or use.names arguments. What > combo of the new fill and use.names does the currrent CRAN rbindlst > correspond to? > > > > On Tue, May 20, 2014 at 5:01 PM, Arunkumar Srinivasan > wrote: >> I think I understand now what you?re trying to say. Going back to an >> earlier >> post, you wrote: >> >> Then why not make the default of `use.names` be `fill`. Then you don't get >> the warning and you can tell just from the argument list what the >> dependencies are. >> >> You mean to basically do? >> >> rbindlist <- function(l, use.names=fill, fill=FALSE) >> .rbind.data.table <- function(..., use.names=fill, fill=TRUE/FALSE) >> >> Is this what you mean? If so, the defaults from the previous versions will >> be changed. The ones who use rbind directly without setting use.names will >> have different results.. (assuming I understand you correctly this time). >> >> >> Arun >> >> From: Gabor Grothendieck ggrothendieck at gmail.com >> Reply: Gabor Grothendieck ggrothendieck at gmail.com >> Date: May 20, 2014 at 10:49:54 PM >> >> To: Arunkumar Srinivasan aragorn168b at gmail.com >> Cc: datatable-help at lists.r-forge.r-project.org >> datatable-help at lists.r-forge.r-project.org >> Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and >> fill >> arguments >> >> If I understand this right then the table below shows the valid >> logical combinations in order of speed (slowest first). Is that >> right? If so then if fill = FALSE and use.names = fill then we get >> the fastest case by default. >> >> Furthermore if you were concerned that we might be T/T when F/T would >> be sufficient I don't think that is likely since getting F/T is done >> by setting use.names = TRUE. >> >> fill/use.names >> T/T (slowest) >> F/T >> F/F (fasetest) >> >> >> On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan >> wrote: >>> I?ve filed FR #5690 to remind myself of the recycling feature; that?d be >>> awesome to have. >>> >>> One feature I forgot to point out in the previous post is that, even when >>> there are duplicate names, rbind/rbindlist binds them consistent with >>> ?base? >>> when use.names=TRUE. And it fills the duplicate columns properly (in the >>> order of occurrence) also when fill=TRUE. >>> >>> Okay, on to benchmarks. I took a set of 10,000 data.tables, each with >>> columns ranging from V1 to V500 in random order (all integers for >>> simplicity). We?ll need to just use use.names=TRUE (as all columns are >>> available in all data.tables). >>> >>> I think this data is big enough to illustrate the point. Also, I was >>> curious >>> to see a comparison against dplyr?s rbind_all (commit 1504 devel >>> version). >>> So, I?ve added it as well to the benchmarks. >>> >>> Here?s the data generation. Note: It takes a while for this step to >>> finish. >>> >>> require(data.table) ## 1.9.3 commit 1267 >>> require(dplyr) ## commit 1504 devel >>> set.seed(1L) >>> foo <- function(k) { >>> ans = setDT(lapply(1:k, function(x) sample(10))) >>> } >>> bar <- function(ans, k, n) { >>> bla = sample(paste0("V", 1:k), n) >>> setnames(ans, bla) >>> } >>> n = 10000L >>> ll = vector("list", n) >>> for (i in 1:n) { >>> bla = bar(foo(500L), 500L, 500L) >>> .Call("Csetlistelt", ll, i, bla) >>> } >>> >>> And here are the timings: >>> >>> ## data.table v1.9.3 commit 1267's rbindlist >>> ## Timings of three consecutive runs: >>> system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE)) >>> user system elapsed >>> 10.909 0.449 11.843 >>> >>> user system elapsed >>> 5.219 0.386 5.640 >>> >>> user system elapsed >>> 5.355 0.429 5.898 >>> >>> ## dplyr's rbind_all >>> ## Timings for three consecutive runs >>> system.time(ans2 <- rbind_all(ll)) >>> user system elapsed >>> 62.769 0.247 63.941 >>> >>> user system elapsed >>> 62.010 0.335 65.876 >>> >>> user system elapsed >>> 55.345 0.359 60.193 >>> >>>> identical(ans1, setDT(ans2)) # [1] TRUE >>> >>> ## data.table v1.9.2's rbind version: >>> ## ran only once as it took a bit more. >>> system.time(ans1 <- do.call("rbind", ll)) >>> user system elapsed >>> 125.356 2.247 139.000 >>> >>>> identical(ans1, setDT(ans2)) # [1] TRUE >>> >>> In summary, the newer implementation is about ~11?23x faster than >>> data.table?s older implementation and is ~5.5?10x faster against dplyr on >>> this (relatively huge) data. >>> >>> Arun >>> >>> From: Arunkumar Srinivasan aragorn168b at gmail.com >>> Reply: Arunkumar Srinivasan aragorn168b at gmail.com >>> Date: May 20, 2014 at 9:27:56 PM >>> To: datatable-help at lists.r-forge.r-project.org >>> datatable-help at lists.r-forge.r-project.org >>> Subject: FR #5249 - rbindlist gains use.names and fill arguments >>> >>> Hello everyone, >>> >>> With the latest commit #1266, the extra functionality offered via rbind >>> (use.names and fill) is also now available to rbindlist. In addition, the >>> implementation is completely moved to C, and is therefore tremendously >>> fast, >>> especially for cases where one has to bind using with use.names=TRUE >>> and/or >>> with fill=TRUE. I?ll try to put out a benchmark comparing speed >>> differences >>> with the older implementation ASAP. >>> >>> Note that this change comes with a very low cost to the default speed to >>> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding >>> 10,000 data.tables with 20 columns each, resulted in the new version >>> running >>> in 0.107 seconds, where as the older version ran in 0.095 seconds. >>> >>> In addition the documentation for ?rbindlist also has been improved >>> (#5158 >>> from Alexander). Here?s the change log from NEWS: >>> >>> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now >>> implemented entirely in C. Closes #5249 >>> -> use.names by default is FALSE for backwards compatibility >>> (doesn't bind by names by default) >>> -> rbind(...) now just calls rbindlist() internally, except that >>> 'use.names' is TRUE by default, >>> for compatibility with base (and backwards compatibility). >>> -> fill by default is FALSE. If fill is TRUE, use.names has to be >>> TRUE. >>> -> At least one item of the input list has to have non-null column >>> names. >>> -> Duplicate columns are bound in the order of occurrence, like >>> base. >>> -> Attributes that might exist in individual items would be lost in >>> the bound result. >>> -> Columns are coerced to the highest SEXPTYPE, if they are >>> different, if/when possible. >>> -> And incredibly fast ;). >>> -> Documentation updated in much detail. Closes DR #5158. >>> Eddi's (excellent) work on finding factor levels, type coercion of >>> columns etc. are all retained. >>> >>> Please try it and write back if things aren?t working as it was before. >>> The >>> tests that had to be fixed are extremely rare cases. I suspect there >>> should >>> be minimal issue, if at all, in this version. However, I do find the >>> changes >>> here bring consistency to the function. >>> >>> One (very rare) feature that is not available due to this implementation >>> is >>> the ability to recycle. >>> >>> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4)) >>> lst1 <- list(x=4, y=5, z=as.list(1:3)) >>> >>> rbind(dt1, lst1) >>> # x y z >>> # 1: 1 4 1,2 >>> # 2: 2 5 1,2,3 >>> # 3: 3 6 1,2,3,4 >>> # 4: 4 5 1 >>> # 5: 4 5 2 >>> # 6: 4 5 3 >>> >>> The 4,5 are recycled very nicely here.. This is not possible at the >>> moment. >>> This is because the earlier rbind implementation used as.data.table to >>> convert to data.table, however it takes a copy (very inefficient on huge >>> / >>> many tables). I?d love to add this feature in C as well, as it would help >>> incredibly for use within [.data.table (now that we can fill columns and >>> bind by names faster). Will add a FR. >>> >>> In summary, I think there should be minimal issues, if any and should be >>> much faster (for rbind cases). Please write back what you think, if you >>> happen to try out. >>> >>> >>> >>> Arun >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From npgraham1 at gmail.com Wed May 21 02:20:34 2014 From: npgraham1 at gmail.com (Nathaniel Graham) Date: Tue, 20 May 2014 20:20:34 -0400 Subject: [datatable-help] rbindlist and unique Message-ID: First, I use rbindlist pretty often, and I've been quite happy with it. The new use.names and fill features definitely scratch an itch for me; I wound up using rbind_all from dplyr (which worked well, I'm not complaining), but I'm looking forward to having a data.table implementation. The speed increase is also welcome. So thank you for the new features! I don't personally have a preference with respect to the use.names and fill defaults, so whatever you guys decide will be fine with me. I do have a question regarding unique, which I use very, very frequently, and often after rbindlist. I have a fairly large data set (tens of millions of raw observations), many of which are duplicates. The observations come from a variety of sources, but the formats and variable names are (nearly) identical. The problem is that many "duplicates" aren't perfect duplicates, and some rows have more information than others. A simple example might look like this: > foo V1 V2 V3 1: 1 3 TRUE 2: 1 4 TRUE 3: 2 3 NA 4: 2 4 TRUE 5: 1 3 TRUE 6: 1 4 NA 7: 2 3 TRUE 8: 2 4 TRUE 9: 3 1 NA > unique(foo, by = c("V1", "V2")) V1 V2 V3 1: 1 3 TRUE 2: 1 4 TRUE 3: 2 3 NA 4: 2 4 TRUE 5: 3 1 NA Sometimes V3 is present and sometimes it isn't. V1 and V2 (in my story) uniquely identify an observation, but if there's a row where I also have V3, I'd prefer to have that row rather than a row where it's missing. You can see that a naive use of unique here gets me the less-preferable 2,3 row. If I only had three columns, this would be easy to solve (sort/setkey first would do it). However, I have more than a dozen additional columns, and when I drop duplicates I want to retain the row with the greatest number of non-missing values. Additionally, some columns are more important than others. If (to refer again to the example above), there are no rows that have V3 for a given V1 & V2 (like 3,1), I still need to retain a row, so I can't just condition on !is.na(V3). Does anybody have any insight or techniques for this sort of thing? I'm currently sorting on all columns prior to unique, but I'm quite sure that this loses some information. ------- Nathaniel Graham npgraham1 at gmail.com npgraham1 at uky.edu https://sites.google.com/site/npgraham1/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Wed May 21 02:34:10 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Tue, 20 May 2014 20:34:10 -0400 Subject: [datatable-help] rbindlist and unique In-Reply-To: References: Message-ID: On Tue, May 20, 2014 at 8:20 PM, Nathaniel Graham wrote: > First, I use rbindlist pretty often, and I've been quite happy with it. The > new use.names and fill features definitely scratch an itch for me; I wound > up using rbind_all from dplyr (which worked well, I'm not complaining), but > I'm looking forward to having a data.table implementation. The speed > increase is also welcome. So thank you for the new features! I don't > personally have a preference with respect to the use.names and fill > defaults, so whatever you guys decide will be fine with me. > > I do have a question regarding unique, which I use very, very frequently, > and often after rbindlist. I have a fairly large data set (tens of millions > of raw observations), many of which are duplicates. The observations come > from a variety of sources, but the formats and variable names are (nearly) > identical. > > The problem is that many "duplicates" aren't perfect duplicates, and some > rows have more information than others. A simple example might look like > this: > >> foo > V1 V2 V3 > 1: 1 3 TRUE > 2: 1 4 TRUE > 3: 2 3 NA > 4: 2 4 TRUE > 5: 1 3 TRUE > 6: 1 4 NA > 7: 2 3 TRUE > 8: 2 4 TRUE > 9: 3 1 NA >> unique(foo, by = c("V1", "V2")) > V1 V2 V3 > 1: 1 3 TRUE > 2: 1 4 TRUE > 3: 2 3 NA > 4: 2 4 TRUE > 5: 3 1 NA > > > Sometimes V3 is present and sometimes it isn't. V1 and V2 (in my story) > uniquely identify an observation, but if there's a row where I also have V3, > I'd prefer to have that row rather than a row where it's missing. You can > see that a naive use of unique here gets me the less-preferable 2,3 row. If > I only had three columns, this would be easy to solve (sort/setkey first > would do it). However, I have more than a dozen additional columns, and > when I drop duplicates I want to retain the row with the greatest number of > non-missing values. Additionally, some columns are more important than > others. If (to refer again to the example above), there are no rows that > have V3 for a given V1 & V2 (like 3,1), I still need to retain a row, so I > can't just condition on !is.na(V3). > > Does anybody have any insight or techniques for this sort of thing? I'm > currently sorting on all columns prior to unique, but I'm quite sure that > this loses some information. Append an importance column which ranks the importance of that row (lower better) and make importance the low order component of the key. DT[, importance := 0+is.na(V3)] setkey(DT, V1, V2, importance) unique(DT, by = c("V1", "V2")) -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From npgraham1 at gmail.com Wed May 21 02:45:16 2014 From: npgraham1 at gmail.com (Nathaniel Graham) Date: Tue, 20 May 2014 20:45:16 -0400 Subject: [datatable-help] rbindlist and unique In-Reply-To: References: Message-ID: Thanks! That's a good idea, and a lot simpler than what I was concocting in my head. I'll give that a try. I think--just for for posterity--you mean DT[, importance := 0 - is.na(V3)] rather than 0 + is.na(V3), so that rows with V3 are lower than rows without. ------- Nathaniel Graham npgraham1 at gmail.com npgraham1 at uky.edu https://sites.google.com/site/npgraham1/ On Tue, May 20, 2014 at 8:34 PM, Gabor Grothendieck wrote: > On Tue, May 20, 2014 at 8:20 PM, Nathaniel Graham > wrote: > > First, I use rbindlist pretty often, and I've been quite happy with it. > The > > new use.names and fill features definitely scratch an itch for me; I > wound > > up using rbind_all from dplyr (which worked well, I'm not complaining), > but > > I'm looking forward to having a data.table implementation. The speed > > increase is also welcome. So thank you for the new features! I don't > > personally have a preference with respect to the use.names and fill > > defaults, so whatever you guys decide will be fine with me. > > > > I do have a question regarding unique, which I use very, very frequently, > > and often after rbindlist. I have a fairly large data set (tens of > millions > > of raw observations), many of which are duplicates. The observations > come > > from a variety of sources, but the formats and variable names are > (nearly) > > identical. > > > > The problem is that many "duplicates" aren't perfect duplicates, and some > > rows have more information than others. A simple example might look like > > this: > > > >> foo > > V1 V2 V3 > > 1: 1 3 TRUE > > 2: 1 4 TRUE > > 3: 2 3 NA > > 4: 2 4 TRUE > > 5: 1 3 TRUE > > 6: 1 4 NA > > 7: 2 3 TRUE > > 8: 2 4 TRUE > > 9: 3 1 NA > >> unique(foo, by = c("V1", "V2")) > > V1 V2 V3 > > 1: 1 3 TRUE > > 2: 1 4 TRUE > > 3: 2 3 NA > > 4: 2 4 TRUE > > 5: 3 1 NA > > > > > > Sometimes V3 is present and sometimes it isn't. V1 and V2 (in my story) > > uniquely identify an observation, but if there's a row where I also have > V3, > > I'd prefer to have that row rather than a row where it's missing. You > can > > see that a naive use of unique here gets me the less-preferable 2,3 row. > If > > I only had three columns, this would be easy to solve (sort/setkey first > > would do it). However, I have more than a dozen additional columns, and > > when I drop duplicates I want to retain the row with the greatest number > of > > non-missing values. Additionally, some columns are more important than > > others. If (to refer again to the example above), there are no rows that > > have V3 for a given V1 & V2 (like 3,1), I still need to retain a row, so > I > > can't just condition on !is.na(V3). > > > > Does anybody have any insight or techniques for this sort of thing? I'm > > currently sorting on all columns prior to unique, but I'm quite sure that > > this loses some information. > > Append an importance column which ranks the importance of that row > (lower better) and make importance the low order component of the key. > > DT[, importance := 0+is.na(V3)] > setkey(DT, V1, V2, importance) > unique(DT, by = c("V1", "V2")) > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Wed May 21 02:50:54 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Tue, 20 May 2014 20:50:54 -0400 Subject: [datatable-help] rbindlist and unique In-Reply-To: References: Message-ID: On Tue, May 20, 2014 at 8:45 PM, Nathaniel Graham wrote: > Thanks! That's a good idea, and a lot simpler than what I was concocting in > my head. I'll give that a try. I think--just for for posterity--you mean > > DT[, importance := 0 - is.na(V3)] > > rather than 0 + is.na(V3), so that rows with V3 are lower than rows without. 0 + is.na(V3) was intended. We want the good rows to have a lower importance than the bad rows so 0+is.na(V3) gives a non-NA V3 an importance of 0 and it gives a V3 which is NA an importance of 1. When we sort them using setkey the non-NA of 0 comes first so it is the one picked by unique. > DT[, importance := 0+is.na(V3)] > setkey(DT, V1, V2, importance) > unique(DT, by = c("V1", "V2")) V1 V2 V3 importance 1: 1 3 TRUE 0 2: 1 4 TRUE 0 3: 2 3 TRUE 0 4: 2 4 TRUE 0 5: 3 1 NA 1 From npgraham1 at gmail.com Wed May 21 02:56:56 2014 From: npgraham1 at gmail.com (Nathaniel Graham) Date: Tue, 20 May 2014 20:56:56 -0400 Subject: [datatable-help] rbindlist and unique In-Reply-To: References: Message-ID: My mistake, you're correct. I reversed it in my head. ------- Nathaniel Graham npgraham1 at gmail.com npgraham1 at uky.edu https://sites.google.com/site/npgraham1/ On Tue, May 20, 2014 at 8:50 PM, Gabor Grothendieck wrote: > On Tue, May 20, 2014 at 8:45 PM, Nathaniel Graham > wrote: > > Thanks! That's a good idea, and a lot simpler than what I was > concocting in > > my head. I'll give that a try. I think--just for for posterity--you > mean > > > > DT[, importance := 0 - is.na(V3)] > > > > rather than 0 + is.na(V3), so that rows with V3 are lower than rows > without. > > 0 + is.na(V3) was intended. We want the good rows to have a lower > importance than the bad rows so 0+is.na(V3) gives a non-NA V3 an > importance of 0 and it gives a V3 which is NA an importance of 1. > When we sort them using setkey the non-NA of 0 comes first so it is > the one picked by unique. > > > DT[, importance := 0+is.na(V3)] > > setkey(DT, V1, V2, importance) > > unique(DT, by = c("V1", "V2")) > V1 V2 V3 importance > 1: 1 3 TRUE 0 > 2: 1 4 TRUE 0 > 3: 2 3 TRUE 0 > 4: 2 4 TRUE 0 > 5: 3 1 NA 1 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Wed May 21 09:23:12 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 21 May 2014 09:23:12 +0200 Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments In-Reply-To: References: Message-ID: Great. That makes total sense to me. No defaults are affected as well. Thanks again. Arun From:?Gabor Grothendieck ggrothendieck at gmail.com Reply:?Gabor Grothendieck ggrothendieck at gmail.com Date:?May 21, 2014 at 1:03:03 AM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments In that case I suggest just changing rbindlist to have use.names = fill and leave rbind as is. On Tue, May 20, 2014 at 5:16 PM, Arunkumar Srinivasan wrote: > In the current CRAN: > > rbindlist corresponds to use.names=FALSE and fill = FALSE > rbind corresponds to use.names=TRUE and fill = FALSE > > Just to be clear, again, are you suggesting that I change *just* rbindlist's > defaults to use.names=fill and fill=FALSE or for both? > Arun > > From: Gabor Grothendieck ggrothendieck at gmail.com > Reply: Gabor Grothendieck ggrothendieck at gmail.com > Date: May 20, 2014 at 11:14:15 PM > > To: Arunkumar Srinivasan aragorn168b at gmail.com > Cc: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill > arguments > > Yes. That is what I intended. > > rbindlist on CRAN currently has no fill or use.names arguments. What > combo of the new fill and use.names does the currrent CRAN rbindlst > correspond to? > > > > On Tue, May 20, 2014 at 5:01 PM, Arunkumar Srinivasan > wrote: >> I think I understand now what you?re trying to say. Going back to an >> earlier >> post, you wrote: >> >> Then why not make the default of `use.names` be `fill`. Then you don't get >> the warning and you can tell just from the argument list what the >> dependencies are. >> >> You mean to basically do? >> >> rbindlist <- function(l, use.names=fill, fill=FALSE) >> .rbind.data.table <- function(..., use.names=fill, fill=TRUE/FALSE) >> >> Is this what you mean? If so, the defaults from the previous versions will >> be changed. The ones who use rbind directly without setting use.names will >> have different results.. (assuming I understand you correctly this time). >> >> >> Arun >> >> From: Gabor Grothendieck ggrothendieck at gmail.com >> Reply: Gabor Grothendieck ggrothendieck at gmail.com >> Date: May 20, 2014 at 10:49:54 PM >> >> To: Arunkumar Srinivasan aragorn168b at gmail.com >> Cc: datatable-help at lists.r-forge.r-project.org >> datatable-help at lists.r-forge.r-project.org >> Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and >> fill >> arguments >> >> If I understand this right then the table below shows the valid >> logical combinations in order of speed (slowest first). Is that >> right? If so then if fill = FALSE and use.names = fill then we get >> the fastest case by default. >> >> Furthermore if you were concerned that we might be T/T when F/T would >> be sufficient I don't think that is likely since getting F/T is done >> by setting use.names = TRUE. >> >> fill/use.names >> T/T (slowest) >> F/T >> F/F (fasetest) >> >> >> On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan >> wrote: >>> I?ve filed FR #5690 to remind myself of the recycling feature; that?d be >>> awesome to have. >>> >>> One feature I forgot to point out in the previous post is that, even when >>> there are duplicate names, rbind/rbindlist binds them consistent with >>> ?base? >>> when use.names=TRUE. And it fills the duplicate columns properly (in the >>> order of occurrence) also when fill=TRUE. >>> >>> Okay, on to benchmarks. I took a set of 10,000 data.tables, each with >>> columns ranging from V1 to V500 in random order (all integers for >>> simplicity). We?ll need to just use use.names=TRUE (as all columns are >>> available in all data.tables). >>> >>> I think this data is big enough to illustrate the point. Also, I was >>> curious >>> to see a comparison against dplyr?s rbind_all (commit 1504 devel >>> version). >>> So, I?ve added it as well to the benchmarks. >>> >>> Here?s the data generation. Note: It takes a while for this step to >>> finish. >>> >>> require(data.table) ## 1.9.3 commit 1267 >>> require(dplyr) ## commit 1504 devel >>> set.seed(1L) >>> foo <- function(k) { >>> ans = setDT(lapply(1:k, function(x) sample(10))) >>> } >>> bar <- function(ans, k, n) { >>> bla = sample(paste0("V", 1:k), n) >>> setnames(ans, bla) >>> } >>> n = 10000L >>> ll = vector("list", n) >>> for (i in 1:n) { >>> bla = bar(foo(500L), 500L, 500L) >>> .Call("Csetlistelt", ll, i, bla) >>> } >>> >>> And here are the timings: >>> >>> ## data.table v1.9.3 commit 1267's rbindlist >>> ## Timings of three consecutive runs: >>> system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE)) >>> user system elapsed >>> 10.909 0.449 11.843 >>> >>> user system elapsed >>> 5.219 0.386 5.640 >>> >>> user system elapsed >>> 5.355 0.429 5.898 >>> >>> ## dplyr's rbind_all >>> ## Timings for three consecutive runs >>> system.time(ans2 <- rbind_all(ll)) >>> user system elapsed >>> 62.769 0.247 63.941 >>> >>> user system elapsed >>> 62.010 0.335 65.876 >>> >>> user system elapsed >>> 55.345 0.359 60.193 >>> >>>> identical(ans1, setDT(ans2)) # [1] TRUE >>> >>> ## data.table v1.9.2's rbind version: >>> ## ran only once as it took a bit more. >>> system.time(ans1 <- do.call("rbind", ll)) >>> user system elapsed >>> 125.356 2.247 139.000 >>> >>>> identical(ans1, setDT(ans2)) # [1] TRUE >>> >>> In summary, the newer implementation is about ~11?23x faster than >>> data.table?s older implementation and is ~5.5?10x faster against dplyr on >>> this (relatively huge) data. >>> >>> Arun >>> >>> From: Arunkumar Srinivasan aragorn168b at gmail.com >>> Reply: Arunkumar Srinivasan aragorn168b at gmail.com >>> Date: May 20, 2014 at 9:27:56 PM >>> To: datatable-help at lists.r-forge.r-project.org >>> datatable-help at lists.r-forge.r-project.org >>> Subject: FR #5249 - rbindlist gains use.names and fill arguments >>> >>> Hello everyone, >>> >>> With the latest commit #1266, the extra functionality offered via rbind >>> (use.names and fill) is also now available to rbindlist. In addition, the >>> implementation is completely moved to C, and is therefore tremendously >>> fast, >>> especially for cases where one has to bind using with use.names=TRUE >>> and/or >>> with fill=TRUE. I?ll try to put out a benchmark comparing speed >>> differences >>> with the older implementation ASAP. >>> >>> Note that this change comes with a very low cost to the default speed to >>> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding >>> 10,000 data.tables with 20 columns each, resulted in the new version >>> running >>> in 0.107 seconds, where as the older version ran in 0.095 seconds. >>> >>> In addition the documentation for ?rbindlist also has been improved >>> (#5158 >>> from Alexander). Here?s the change log from NEWS: >>> >>> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now >>> implemented entirely in C. Closes #5249 >>> -> use.names by default is FALSE for backwards compatibility >>> (doesn't bind by names by default) >>> -> rbind(...) now just calls rbindlist() internally, except that >>> 'use.names' is TRUE by default, >>> for compatibility with base (and backwards compatibility). >>> -> fill by default is FALSE. If fill is TRUE, use.names has to be >>> TRUE. >>> -> At least one item of the input list has to have non-null column >>> names. >>> -> Duplicate columns are bound in the order of occurrence, like >>> base. >>> -> Attributes that might exist in individual items would be lost in >>> the bound result. >>> -> Columns are coerced to the highest SEXPTYPE, if they are >>> different, if/when possible. >>> -> And incredibly fast ;). >>> -> Documentation updated in much detail. Closes DR #5158. >>> Eddi's (excellent) work on finding factor levels, type coercion of >>> columns etc. are all retained. >>> >>> Please try it and write back if things aren?t working as it was before. >>> The >>> tests that had to be fixed are extremely rare cases. I suspect there >>> should >>> be minimal issue, if at all, in this version. However, I do find the >>> changes >>> here bring consistency to the function. >>> >>> One (very rare) feature that is not available due to this implementation >>> is >>> the ability to recycle. >>> >>> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4)) >>> lst1 <- list(x=4, y=5, z=as.list(1:3)) >>> >>> rbind(dt1, lst1) >>> # x y z >>> # 1: 1 4 1,2 >>> # 2: 2 5 1,2,3 >>> # 3: 3 6 1,2,3,4 >>> # 4: 4 5 1 >>> # 5: 4 5 2 >>> # 6: 4 5 3 >>> >>> The 4,5 are recycled very nicely here.. This is not possible at the >>> moment. >>> This is because the earlier rbind implementation used as.data.table to >>> convert to data.table, however it takes a copy (very inefficient on huge >>> / >>> many tables). I?d love to add this feature in C as well, as it would help >>> incredibly for use within [.data.table (now that we can fill columns and >>> bind by names faster). Will add a FR. >>> >>> In summary, I think there should be minimal issues, if any and should be >>> much faster (for rbind cases). Please write back what you think, if you >>> happen to try out. >>> >>> >>> >>> Arun >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Wed May 21 12:56:32 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 21 May 2014 12:56:32 +0200 Subject: [datatable-help] R Studio Interactions with data.table In-Reply-To: References: <87CF8322-0A75-44D7-A246-85ED860ABC2A@dc-energy.com> Message-ID: Zachary, Fixed in #1269 (v1.9.3). Please do write back if you still experience the issue after update. On Sun, Apr 27, 2014 at 9:48 PM, Arunkumar Srinivasan wrote: > Zack, > I'm able to reproduce the crash and the occasional warning. Will look into > it - filed #5647. > Thanks for reporting. > Arun. > > > On Tue, Apr 22, 2014 at 8:07 PM, Zachary Long wrote: > >> Hello, >> >> I was wondering if an error like this had been addressed before. I am >> using data table 1.9.2. >> >> It appears that the error has to do with the interaction with R-Studio. >> When I run >> >> library(data.table) >> >> dt<-data.table(strip="Nov08",date=c("2006-08-01","2006-08-02","2006-08-03","2006-08-04","2006-08-07", >> >> "2006-08-08","2006-08-09","2006-08-10","2006-08-11","2006-08-14")) >> dt[,forward_date:=c(rep(NA,5),date),by='strip'] >> >> >> The result I expect is below, along with a warning message. >> >> strip date forward_date >> 1: Nov08 2006-08-01 NA >> 2: Nov08 2006-08-02 NA >> 3: Nov08 2006-08-03 NA >> 4: Nov08 2006-08-04 NA >> 5: Nov08 2006-08-07 NA >> 6: Nov08 2006-08-08 2006-08-01 >> 7: Nov08 2006-08-09 2006-08-02 >> 8: Nov08 2006-08-10 2006-08-03 >> 9: Nov08 2006-08-11 2006-08-04 >> 10: Nov08 2006-08-14 2006-08-07 >> >> >> However, I don't get this. >> >> 1 of two things can happen. >> >> 1. My R-Studio will completely crash without warning. All unsaved >> information is lost. >> 2. I can get "Error: Value of SET_STRING_ELT() must be a 'CHARSXP' not a >> 'character'" "In addition:Lost warning messages" >> >> Do you know what is the cause here? It seems related to memory >> allocation, or something under the hood relating to the interaction of >> R-Studio and data table. >> >> Zach >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Wed May 21 13:00:56 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 21 May 2014 13:00:56 +0200 Subject: [datatable-help] rbindlist and unique In-Reply-To: References: Message-ID: Nathaniel, Thanks. First, I use rbindlist pretty often, and I've been quite happy with it. The new use.names and fill features definitely scratch an itch for me; I wound up using rbind_all from dplyr (which worked well, I'm not complaining), but I'm looking forward to having a data.table implementation. A data.table implementation (in rbind) exists since the last release (v1.9.0/2). This one just builds on it. Arun From:?Nathaniel Graham npgraham1 at gmail.com Reply:?Nathaniel Graham npgraham1 at gmail.com Date:?May 21, 2014 at 2:20:44 AM To:?data.table source forge datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] rbindlist and unique First, I use rbindlist pretty often, and I've been quite happy with it. ?The new use.names and fill features definitely scratch an itch for me; I wound up using rbind_all from dplyr (which worked well, I'm not complaining), but I'm looking forward to having a data.table implementation. ?The speed increase is also welcome. ?So thank you for the new features! ?I don't personally have a preference with respect to the use.names and fill defaults, so whatever you guys decide will be fine with me. I do have a question regarding unique, which I use very, very frequently, and often after rbindlist. ?I have a fairly large data set (tens of millions of raw observations), many of which are duplicates. ?The observations come from a variety of sources, but the formats and variable names are (nearly) identical. The problem is that many "duplicates" aren't perfect duplicates, and some rows have more information than others. ?A simple example might look like this: > foo ? ?V1 V2 ? V3 1: ?1 ?3 TRUE 2: ?1 ?4 TRUE 3: ?2 ?3 ? NA 4: ?2 ?4 TRUE 5: ?1 ?3 TRUE 6: ?1 ?4 ? NA 7: ?2 ?3 TRUE 8: ?2 ?4 TRUE 9: ?3 ?1 ? NA > unique(foo, by = c("V1", "V2")) ? ?V1 V2 ? V3 1: ?1 ?3 TRUE 2: ?1 ?4 TRUE 3: ?2 ?3 ? NA 4: ?2 ?4 TRUE 5: ?3 ?1 ? NA Sometimes V3 is present and sometimes it isn't. ?V1 and V2 (in my story) uniquely identify an observation, but if there's a row where I also have V3, I'd prefer to have that row rather than a row where it's missing. ?You can see that a naive use of unique here gets me the less-preferable 2,3 row. ?If I only had three columns, this would be easy to solve (sort/setkey first would do it). ?However, I have more than a dozen additional columns, and when I drop duplicates I want to retain the row with the greatest number of non-missing values. ?Additionally, some columns are more important than others. ?If (to refer again to the example above), there are no rows that have V3 for a given V1 & V2 (like 3,1), I still need to retain a row, so I can't just condition on !is.na(V3). Does anybody have any insight or techniques for this sort of thing? ?I'm currently sorting on all columns prior to unique, but I'm quite sure that this loses some information. ------- Nathaniel Graham npgraham1 at gmail.com npgraham1 at uky.edu https://sites.google.com/site/npgraham1/ _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Thu May 22 05:54:57 2014 From: my.r.help at gmail.com (Michael Smith) Date: Thu, 22 May 2014 11:54:57 +0800 Subject: [datatable-help] A[B]? In-Reply-To: <1399370245881-4690040.post@n4.nabble.com> References: <1399183248863-4689942.post@n4.nabble.com> <5365F136.8050807@gmail.com> <1399370245881-4690040.post@n4.nabble.com> Message-ID: <537D7511.1000209@gmail.com> FAQ 1.12? On 05/06/2014 05:57 PM, Rguy wrote: > That FAQ does not provide any examples of the A[B] syntax used with data > table objects. It does provide an example using A[B] with matrix objects, > but the example does not translate to data table objects, so I'm not sure > why it's there. I suggest that the FAQ be extended to provide one, or better > yet several, examples of the A[B] syntax applied to data.table objects. > > As far as I have been able to puzzle out so far, A[B] is just another way to > do a merge. > > > > > -- > View this message in context: http://r.789695.n4.nabble.com/A-B-tp4689942p4690040.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From aragorn168b at gmail.com Sat May 24 23:15:03 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 24 May 2014 23:15:03 +0200 Subject: [datatable-help] print.data.table's digits argument In-Reply-To: References: <081AB5A6E11243C0B1C75A937463DAC8@gmail.com> Message-ID: Fixed this in commit #1275 (v1.9.3). Thanks Frank and Matthew Beckers for filing it here . Arun On Tue, Jun 18, 2013 at 1:39 AM, Frank Erickson wrote: > Ah, that did the trick! I'll use this quite a lot, I expect. Thanks, Arun. > --Frank > > > On Mon, Jun 17, 2013 at 6:19 PM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > >> Dear Frank, >> >> Thanks for forwarding to the list. I always seem to forget to >> "reply-all". Apologies. Managed this time! :) >> >> Try this on your data: >> >> as.data.table(do.call("cbind", lapply(DT, function(x) { >> if (is.list(x)) { >> lapply(x, function(y) as.numeric(format(y, digits=2))) >> } else >> as.numeric(format(x, digits=2)) >> }))) >> >> >> >> Arun >> >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sun May 25 04:54:40 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 25 May 2014 04:54:40 +0200 Subject: [datatable-help] Assignment by reference fails silently In-Reply-To: References: Message-ID: This was an effect of fixing FR #2551 not properly (from me). Stricter (and extensive) tests are added now. Fixed in commit #1277 (v1.9.3). Thanks for reporting. Arun. On Sun, Apr 27, 2014 at 9:14 PM, Arunkumar Srinivasan wrote: > Thanks for reporting. I've added this case under comments to another > recently filed issue bug #5442from Michele (as I am quite sure they're related to handling column types > in `:=` without grouping). > > Arun. > > > On Fri, Apr 25, 2014 at 7:25 PM, John Laing wrote: > >> If I create a logical column in my data.table and try to >> assign-by-reference a character value to it, the assignment fails silently. >> That is, it doesn't work but doesn't throw an error: >> >> ## make a simple data.table >> require(data.table) >> dt <- data.table(a=1:3, b=4:6, c=NA) >> >> ## fails silently >> dt[, c := "foo"] >> dt >> >> In other cases where an action would lead to the implicit conversion of a >> column, data.table throws an error suggesting that the user convert the >> column explicitly if that's what they really mean to do. I think that's the >> right behavior and should be adopted in this case as well. >> >> -John >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From talex at privatdemail.net Fri May 30 11:43:03 2014 From: talex at privatdemail.net (talex) Date: Fri, 30 May 2014 02:43:03 -0700 (PDT) Subject: [datatable-help] strange undocumented data.table error Message-ID: <1401442983415-4691467.post@n4.nabble.com> I ran into a strange error that a search only turns up in the commit of data.table: This came up upon running a previously tested working dcast.data.table expression, on a data.table I have subsetted (with duplicates) using a J() command. The offending section is this: Strangely, the data I fed it works until a specific row of the input table is included, and then it dies. What's bothering data.table? I tried changing keys and messing with the order of the formula terms just to see, because I don't understand the error, but that didn't work, of course. -- View this message in context: http://r.789695.n4.nabble.com/strange-undocumented-data-table-error-tp4691467.html Sent from the datatable-help mailing list archive at Nabble.com. From my.r.help at gmail.com Sat May 31 06:01:31 2014 From: my.r.help at gmail.com (Michael Smith) Date: Sat, 31 May 2014 12:01:31 +0800 Subject: [datatable-help] `with=F` in the `i` Argument Message-ID: <5389541B.8040006@gmail.com> All, I'm trying to order the rows according to several columns at a time: DT <- data.table(a = 1:4, b = 8:5) for (i in c("a", "b")) print(DT[order(i), with = FALSE]) It doesn't work, since `with` seems to be about the `j` argument, but not the `i` argument, according to `?data.table`. I found the following workaround, but wonder whether there is a more elegant way to do it: for (i in c("a", "b")) print(DT[order(DT[, i, with = FALSE])]) Thanks, M From gsee000 at gmail.com Sat May 31 06:44:59 2014 From: gsee000 at gmail.com (G See) Date: Fri, 30 May 2014 23:44:59 -0500 Subject: [datatable-help] `with=F` in the `i` Argument In-Reply-To: <5389541B.8040006@gmail.com> References: <5389541B.8040006@gmail.com> Message-ID: Hi Michael, I would use get() DT <- data.table(a = 1:4, b = 8:5) for (i in c("a", "b")) print(DT[order(get(i))]) For what it's worth, your solution doesn't seem to work in data.table 1.9.3 (svn rev. 1278): > for (i in c("a", "b")) + print(DT[order(DT[, i, with = FALSE])]) Error in forder(DT, DT[, i, with = FALSE]) : Column '1' is type 'list' which is not supported for ordering currently. HTH, Garrett On Fri, May 30, 2014 at 11:01 PM, Michael Smith wrote: > All, > > I'm trying to order the rows according to several columns at a time: > > DT <- data.table(a = 1:4, b = 8:5) > for (i in c("a", "b")) > print(DT[order(i), with = FALSE]) > > It doesn't work, since `with` seems to be about the `j` argument, but > not the `i` argument, according to `?data.table`. > > I found the following workaround, but wonder whether there is a more > elegant way to do it: > > for (i in c("a", "b")) > print(DT[order(DT[, i, with = FALSE])]) > > Thanks, > M > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help