From J.Gorecki at wit.edu.pl Sun Jun 1 13:16:53 2014 From: J.Gorecki at wit.edu.pl (Jan Gorecki) Date: Sun, 1 Jun 2014 04:16:53 -0700 (PDT) Subject: [datatable-help] learn how to use melt and dcast In-Reply-To: <1400590237635-4690882.post@n4.nabble.com> References: <1400590237635-4690882.post@n4.nabble.com> Message-ID: <1401621413529-4691552.post@n4.nabble.com> Hi statquant3, I found this tutorial very helpful: http://marcoghislanzoni.com/blog/2013/10/11/pivot-tables-in-r-with-melt-and-cast/ Jan -- View this message in context: http://r.789695.n4.nabble.com/learn-how-to-use-melt-and-dcast-tp4690882p4691552.html Sent from the datatable-help mailing list archive at Nabble.com. From J.Gorecki at wit.edu.pl Wed Jun 4 11:40:45 2014 From: J.Gorecki at wit.edu.pl (Jan Gorecki) Date: Wed, 4 Jun 2014 02:40:45 -0700 (PDT) Subject: [datatable-help] data.table syntax Data Warehouse use case simulation Message-ID: <1401874845225-4691697.post@n4.nabble.com> Hi All, I would rather not go deep into description of DW star schema model. It may not be necessary as you have the initial structure and expected structure. We have our measures (numeric values) in the "facts" tables. Facts are connected to dimensions which contains the reference field from facts tables plus some higher level attributes (may be seen as: dim1="Paris", dim1h="France"). I'm looking for memory, time and syntax optimal solution to perform denormalization of my data and join the facts table to all the dimension tables. # populate data library(data.table) facts <- data.table(dim1=letters[1:6], dim2=letters[7:12], dim3=letters[13:18], dim4=letters[19:24], quantity = rnorm(6,100,40), value = rnorm(6,1000,200)) dim1 <- data.table(dim1=letters[1:6], dim1h=rep(letters[1:3],2), key="dim1") dim2 <- data.table(dim2=letters[7:12], dim2h=rep(letters[7:9],2), key="dim2") dim3 <- data.table(dim3=letters[13:18], dim3h=rep(letters[13:15],2), key="dim3") dim4 <- data.table(dim4=letters[19:24], dim4h=rep(letters[19:21],2), key="dim4") # my proposed solution joinby <- function(master, join, by){ stopifnot(by %in% names(master) & by %in% names(join)) join[setkeyv(master,by)] } # denormalize dt <- joinby(joinby(joinby(joinby(facts,dim1,"dim1"),dim2,"dim2"),dim3,"dim3"),dim4,"dim4") # aggregate - expected results dt[,list(quantity=sum(quantity),value=sum(value)),by=c("dim1h","dim2h","dim3h","dim4h")] My solution assume the column names to be used on joins are identical. The syntax isn't that great, but I couldn't figure out any better. I'm not aware of the performance, it may be as issue because of resorting the master tables on each join. Anybody would propose better (more optimal) solution? Regards, Jan -- View this message in context: http://r.789695.n4.nabble.com/data-table-syntax-Data-Warehouse-use-case-simulation-tp4691697.html Sent from the datatable-help mailing list archive at Nabble.com. From jmtruppia at gmail.com Fri Jun 6 00:01:36 2014 From: jmtruppia at gmail.com (juancentro) Date: Thu, 5 Jun 2014 15:01:36 -0700 (PDT) Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <1399468390041-4690112.post@n4.nabble.com> References: <1366401278742-4664770.post@n4.nabble.com> <1399453335248-4690100.post@n4.nabble.com> <1399462206528-4690105.post@n4.nabble.com> <1399468390041-4690112.post@n4.nabble.com> Message-ID: <1402005696445-4691774.post@n4.nabble.com> Hi, what's the current status on this one? In the last 1.9.3 by=EACHI is used. This is disruptive for current users (it has broken several pieces of my code) but, after complaining and barking, I realized that it is really more intuitive and reasonable to do a by just when a by is explicit. Are there any plans to release 1.9.3 and which syntax will be kept? I want to be prepared thanks! -- View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4691774.html Sent from the datatable-help mailing list archive at Nabble.com. From aragorn168b at gmail.com Fri Jun 6 00:13:02 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 6 Jun 2014 00:13:02 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <1402005696445-4691774.post@n4.nabble.com> References: <1366401278742-4664770.post@n4.nabble.com> <1399453335248-4690100.post@n4.nabble.com> <1399462206528-4690105.post@n4.nabble.com> <1399468390041-4690112.post@n4.nabble.com> <1402005696445-4691774.post@n4.nabble.com> Message-ID: Juancentro, Matt started a post on this topic in March this year here:?http://lists.r-forge.r-project.org/pipermail/datatable-help/2014-March/002430.html?Have you read it or contributed there??If not, when and where did you complain?? Matt also checked all dependent packages on data.table (on CRAN, I believe, bioconductor - not sure) and contacted those authors whose unit tests failed on this issue, IIRC. Are you developing a package that's dependent on data.table? If so, is it already on CRAN or bioconductor? If not, how do you expect us to reach you other than through the mailing list? And 1.9.3 is a development version, where these things are meant to be ironed out before pushing a *stable* release to CRAN. And IIUC, by the time it'll be pushed to CRAN, there should a provision to use older feature or somehow another fix so that the older feature can be properly deprecated. As I said before, this is *still* in development, and we've not gotten to it yet. I think rather that you should be following the mailing list closely (and NEWS) and contribute to the conversations when decisions are being made. Arun From:?juancentro jmtruppia at gmail.com Reply:?juancentro jmtruppia at gmail.com Date:?June 6, 2014 at 12:02:28 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] changing data.table by-without-by syntax to require a "by" Hi, what's the current status on this one? In the last 1.9.3 by=EACHI is used. This is disruptive for current users (it has broken several pieces of my code) but, after complaining and barking, I realized that it is really more intuitive and reasonable to do a by just when a by is explicit. Are there any plans to release 1.9.3 and which syntax will be kept? I want to be prepared thanks! -- View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4691774.html Sent from the datatable-help mailing list archive at Nabble.com. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmtruppia at gmail.com Fri Jun 6 00:18:24 2014 From: jmtruppia at gmail.com (Juan Manuel Truppia) Date: Thu, 5 Jun 2014 19:18:24 -0300 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1399453335248-4690100.post@n4.nabble.com> <1399462206528-4690105.post@n4.nabble.com> <1399468390041-4690112.post@n4.nabble.com> <1402005696445-4691774.post@n4.nabble.com> Message-ID: Arun, I only complained to myself! My published packages dont depend on data.table. My unpublished code does. But I am not complaining about the change, it is a good one! I was just asking if you had reached a decision. I meant the complaining part as something funny, not to be taken at face value. On Jun 5, 2014 7:13 PM, "Arunkumar Srinivasan" wrote: > Juancentro, > > Matt started a post on this topic in March this year here: > http://lists.r-forge.r-project.org/pipermail/datatable-help/2014-March/002430.html Have > you read it or contributed there? If not, when and where did you complain? > Matt also checked all dependent packages on data.table (on CRAN, I > believe, bioconductor - not sure) and contacted those authors whose unit > tests failed on this issue, IIRC. Are you developing a package that's > dependent on data.table? If so, is it already on CRAN or bioconductor? If > not, how do you expect us to reach you other than through the mailing list? > And 1.9.3 is a development version, where these things are meant to be > ironed out before pushing a *stable* release to CRAN. And IIUC, by the time > it'll be pushed to CRAN, there should a provision to use older feature or > somehow another fix so that the older feature can be properly deprecated. > As I said before, this is *still* in development, and we've not gotten to > it yet. > > I think rather that you should be following the mailing list closely (and > NEWS) and contribute to the conversations when decisions are being made. > > Arun > > From: juancentro jmtruppia at gmail.com > Reply: juancentro jmtruppia at gmail.com > Date: June 6, 2014 at 12:02:28 AM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] changing data.table by-without-by syntax > to require a "by" > > Hi, what's the current status on this one? In the last 1.9.3 by=EACHI is > used. This is disruptive for current users (it has broken several pieces > of > my code) but, after complaining and barking, I realized that it is really > more intuitive and reasonable to do a by just when a by is explicit. > Are there any plans to release 1.9.3 and which syntax will be kept? I want > to be prepared > > thanks! > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4691774.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Jun 6 02:40:52 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Fri, 06 Jun 2014 01:40:52 +0100 Subject: [datatable-help] internal FALSE/TRUE value has been modified In-Reply-To: <5362685E.1080303@mdowle.plus.com> References: <5361D04C.2090509@gmail.com> <5362685E.1080303@mdowle.plus.com> Message-ID: <53910E14.8030101@mdowle.plus.com> Now fixed in v1.9.3 : o The warning "internal TRUE value has been modified" with recently released R 3.1 when grouping a table containing a logical column and where all groups are just 1 row is now fixed and tests added. Thanks to James Sams for the reproducible example. The warning is issued by R and we have asked if it can be upgraded to error. Matt On 01/05/14 16:29, Matt Dowle wrote: > > Reproduced, thanks for nice example. Not sure yet but what R 3.1 now > does is store length 1 logical vectors once only, globally, for > efficiency to avoid many new allocations for the common case of single > TRUE or FALSE values passed around at C or R level (a nice and welcome > change). Since data.table modifies vectors by reference, if that > vector is length 1 a new data.table bug as from R 3.1 could be > modifying R's internal value of TRUE or FALSE whenever length 1 > logical vectors occur. Clearly a serious bug. The test suite > immediately broke the day after the R-devel change was made (good) and > was one reason data.table was in error state in CRAN checks for quite > a while before R 3.1 shipped. It was typically tests of 1-row > data.table's including a logical column and modifying that logical > column that broke. We fixed that and put in checks to detect and warn > if R's internal value has been been modified, just in case. Those > changes were in v1.9.2 on CRAN. I think I wasn't 100% confident in > the detection test (false positives) so made it a warning instead of > an error. Now that R 3.1 is out and we haven't had any false > positives, it should be an error. > > The feature of this upc_table is that all the groups are size 1 : > > > upc_table[, .N, by=list(upc, upc_ver_uc)][,max(N)] > [1] 1 > > If we change the example so that one group has more than 1 row, it > works ok : > > > upc_table = data.table(upc=c(1:99998,1,1), upc_ver_uc=rep(c(1,2), > times=50000), is_PL=rep(c(T, F, F, T), each=25000), > product_module_code=rep(1:4, times=25000), ignore.column=2:100001) > > upc_table[, .N, by=list(upc, upc_ver_uc)][,max(N)] > [1] 2 > > upc = upc_table[, list(is_PL, product_module_code), keyby=list(upc, > upc_ver_uc)] > > So it seems the problem is in the single allocation of working memory > for the largest group when that's just 1 and contains a logical > column. Odd, I would have sworn we caught that! Will fix. > > R-devel are planning to do more of this small-object-sharing for > common single integer values e.g. 0-10, so we'll need to add more > tests accordingly. > > Thanks, > Matt > > > > On 01/05/14 05:40, James Sams wrote: >> I don't really know what this error message means. A quick example to >> show what I'm seeing: >> >> > library(data.table) >> data.table 1.9.3 For help type: help("data.table") >> > upc_table = data.table(upc=1:100000, upc_ver_uc=rep(c(1,2), >> times=50000), is_PL=rep(c(T, F, F, T), each=25000), >> product_module_code=rep(1:4, times=25000), ignore.column=2:100001) >> > upc = upc_table[, list(is_PL, product_module_code), keyby=list(upc, >> upc_ver_uc)] >> Warning message: >> In `[.data.table`(upc_table, , list(is_PL, product_module_code), : >> internal TRUE value has been modified >> >> When I continue using R, I eventually start getting more errors, such >> as: >> >> Error in gettext(domain, unlist(args)) : invalid 'string' value >> Error during wrapup: invalid 'string' value >> >> and then terminal input/output becomes corrupted. I only start >> getting these error messages once I start using data.table; but the >> messages don't necessarily occur only with data.table functions. >> >> I don't know if the last statement above is executing correctly or >> not. I'm rather confused as to what is going on. I was using a >> somewhat stale (maybe a couple of weeks old) svn version of >> data.table; but I see the same behavior with the latest data.table >> (r1263). I'm using CRAN's R 3.1 package for Ubuntu on 13.10 and 14.04. >> >> >> >> > sessionInfo() >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C >> LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] data.table_1.9.3 >> >> loaded via a namespace (and not attached): >> [1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.4 stringr_0.6.2 >> > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > From rguy at 123mail.org Fri Jun 6 08:29:01 2014 From: rguy at 123mail.org (Rguy) Date: Thu, 5 Jun 2014 23:29:01 -0700 (PDT) Subject: [datatable-help] A[B]? In-Reply-To: <537D7511.1000209@gmail.com> References: <1399183248863-4689942.post@n4.nabble.com> <5365F136.8050807@gmail.com> <1399370245881-4690040.post@n4.nabble.com> <537D7511.1000209@gmail.com> Message-ID: <1402036141545-4691793.post@n4.nabble.com> In the FAQ, the X[Y] syntax is first mentioned in item 1.11, where it is not explained and no example of its use is provided. In item 1.12, X[Y] is compared to merge, again without any attempt to explain what X[Y] is or does, and with no examples of its use. Also, merge is not discussed correctly: "...the number of rows returned by merge(X,Y) and merge(Y,X) is the same." This can be controlled by the merge arguments by.x, by.y. I suggest that before discussing the in's and out's of the X[Y] syntax the FAQ explain what it is and provide examples of its use. Think of it as "X[Y] for Dummies". As things stand the FAQ are completely useless for getting a grip on X[Y] and it is very frustrating to encounter explanations of a concept that has nowhere been introduced or illustrated. -- View this message in context: http://r.789695.n4.nabble.com/A-B-tp4689942p4691793.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Mon Jun 9 15:51:11 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Mon, 09 Jun 2014 14:51:11 +0100 Subject: [datatable-help] data.table has moved to GitHub Message-ID: <5395BBCF.1050800@mdowle.plus.com> Dear all, Arun has done an amazing job in transferring everything from R-Forge to GitHub. This includes the full commit history and outstanding bug and feature requests. https://github.com/Rdatatable/datatable/ To install the latest version from now on it's : devtools:::install_github("datatable", "Rdatatable") As you may have noticed R-Forge has been stuck in building state for several days now, so you wouldn't have been able to install v1.9.3 from there anyway. Arun has integrated with Travis which gives us the package build and check environment. Windows users will need to install Rtools because install_github() compiles from source, but that is straightforward we believe. We may be able to add building and checking of a compiled .zip for Windows in future. If you're a Windows user please let us know how you get on with Rtools and devtools::install_github(). GitHub should make it easier for you to contribute : just edit the file within the github website and then press "Propose file change". Project members will then review and accept the change. Public access to the bug and feature request trackers on R-Forge is now turned off. Please use GitHub from now on. Comments couldn't be transferred to GitHub but we can still see them on R-Forge. If you raised an issue on R-Forge you may still get automatic emails from R-Forge as we close them down. Thanks, Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmtruppia at gmail.com Mon Jun 9 22:12:34 2014 From: jmtruppia at gmail.com (Juan Manuel Truppia) Date: Mon, 9 Jun 2014 17:12:34 -0300 Subject: [datatable-help] dcast.data.table loses column classes when column is date Message-ID: Here is a reproducible example dcast.data.table(data = data.table(id = c(1,1,2,2), ty = c("a","b","a","b"), da = Sys.Date()), formula = id ~ ty) I don't know how to report a bug, if someone guides me, I'll be much obliged Thanks! From gleynes+r at gmail.com Mon Jun 9 23:18:59 2014 From: gleynes+r at gmail.com (Gene Leynes) Date: Mon, 9 Jun 2014 16:18:59 -0500 Subject: [datatable-help] data.table has moved to GitHub In-Reply-To: <5395BBCF.1050800@mdowle.plus.com> References: <5395BBCF.1050800@mdowle.plus.com> Message-ID: I was post about a question. Do questions now go to an address on github rather than datatable-help at lists.r-forge.r-project.org, or should we use something else for discussion / questions? On Mon, Jun 9, 2014 at 8:51 AM, Matt Dowle wrote: > > Dear all, > > Arun has done an amazing job in transferring everything from R-Forge to > GitHub. This includes the full commit history and outstanding bug and > feature requests. > > https://github.com/Rdatatable/datatable/ > > To install the latest version from now on it's : > > devtools:::install_github("datatable", "Rdatatable") > > As you may have noticed R-Forge has been stuck in building state for > several days now, so you wouldn't have been able to install v1.9.3 from > there anyway. Arun has integrated with Travis which gives us the package > build and check environment. Windows users will need to install Rtools > because install_github() compiles from source, but that is straightforward > we believe. We may be able to add building and checking of a compiled .zip > for Windows in future. If you're a Windows user please let us know how you > get on with Rtools and devtools::install_github(). > > GitHub should make it easier for you to contribute : just edit the file > within the github website and then press "Propose file change". Project > members will then review and accept the change. > > Public access to the bug and feature request trackers on R-Forge is now > turned off. Please use GitHub from now on. Comments couldn't be > transferred to GitHub but we can still see them on R-Forge. If you raised > an issue on R-Forge you may still get automatic emails from R-Forge as we > close them down. > > Thanks, > Matt > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Jun 9 23:29:21 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 9 Jun 2014 23:29:21 +0200 Subject: [datatable-help] dcast.data.table loses column classes when column is date In-Reply-To: References: Message-ID: Juan, On how to report a bug: 1) Go to?https://github.com and create an account, if you don't already have one. 2) Go to our project page, while signed in:?https://github.com/Rdatatable/datatable 3) Click "Issues" on the right side of the page. 4) This issue doesn't already exist. So, hit "New issue" (green button on the right). 5) Provide a title. Fill the body - remember you can format code using?Markdown?(as well as?Github flavoured markdown). For example, to write R-code, you can do: ```S your R-code ``` The S is the lexer type (for highlighting code using Github flavoured markdown). 6) Add a label (equivalent of tag or tracker type in R-Forge) by clicking on "bug" on the right side.? 7) Preview your post, if you want to. Then click "Submit new issue". --- On the bug itself: This is because `reshape2:::dcast` doesn't preserve attributes. And we wanted to be consistent with their result at the time of writing.? However, since that time, `reshape2` has obtained newer implementation of "melt", written by Kevin Ushey, where attributes are preserved as long as all the columns that you're asking for to be "molten" are of the same type. But this doesn't happen for "factors" by default because that might break existing code - and therefore obtained a new argument "factorsAsStrings", IIUC. I personally find these things adding a layer of complexity. But that's the case with "melt".? It's really hard to tell from reshape2's ?melt or ?cast what's the case regarding attributes. But my guess is that we should, starting with your post, try to define what's what and document it instead of relying entirely on being consistent with reshape2's behaviour, as we do already differ from reshape2 already slightly. We're very much younger than reshape2's melt/cast. So, I think we might be able to rectify these things on consistency and rules relatively easier. Arun From:?Juan Manuel Truppia jmtruppia at gmail.com Reply:?Juan Manuel Truppia jmtruppia at gmail.com Date:?June 9, 2014 at 10:13:05 PM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] dcast.data.table loses column classes when column is date Here is a reproducible example dcast.data.table(data = data.table(id = c(1,1,2,2), ty = c("a","b","a","b"), da = Sys.Date()), formula = id ~ ty) I don't know how to report a bug, if someone guides me, I'll be much obliged Thanks! _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Jun 9 23:34:39 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 9 Jun 2014 23:34:39 +0200 Subject: [datatable-help] data.table has moved to GitHub In-Reply-To: References: <5395BBCF.1050800@mdowle.plus.com> Message-ID: Hello Gene, Yes, you can continue to post questions on the mailing list, of course. Especially, questions on design changes, proposing design changes - where you'd like to hear from the entire list, or simply question on "how to do this in a data.table way / is there a better way" etc, as this is a place to connect to all data.table users subscribed over the mailing list.? Although, there is a?"label" on github?named "question", which I'm not quite sure of the use, yet. I suspect, it is mostly for developers to communicate to each other when they're not entirely sure of where it falls or if it's a good feature etc..? We'll know soon enough :). Arun From:?Gene Leynes gleynes+r at gmail.com Reply:?gleynes+r at gmail.com gleynes+r at gmail.com Date:?June 9, 2014 at 11:19:08 PM To:?Matt Dowle mdowle at mdowle.plus.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] data.table has moved to GitHub I was post about a question. Do questions now go to an address on github rather than datatable-help at lists.r-forge.r-project.org, or should we use something else for discussion ?/ questions? On Mon, Jun 9, 2014 at 8:51 AM, Matt Dowle wrote: Dear all, Arun has done an amazing job in transferring everything from R-Forge to GitHub. This includes the full commit history and outstanding bug and feature requests. ??? https://github.com/Rdatatable/datatable/ To install the latest version from now on it's : ??? devtools:::install_github("datatable", "Rdatatable") As you may have noticed R-Forge has been stuck in building state for several days now, so you wouldn't have been able to install v1.9.3 from there anyway.? Arun has integrated with Travis which gives us the package build and check environment.? Windows users will need to install Rtools because install_github() compiles from source,? but that is straightforward we believe.? We may be able to add building and checking of a compiled .zip for Windows in future.? If you're a Windows user please let us know how you get on with Rtools and devtools::install_github(). GitHub should make it easier for you to contribute :? just edit the file within the github website and then press "Propose file change". Project members will then review and accept the change. Public access to the bug and feature request trackers on R-Forge is now turned off.? Please use GitHub from now on.? Comments couldn't be transferred to GitHub but we can still see them on R-Forge.? If you raised an issue on R-Forge you may still get automatic emails from R-Forge as we close them down. Thanks, Matt _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmtruppia at gmail.com Mon Jun 9 23:48:07 2014 From: jmtruppia at gmail.com (Juan Manuel Truppia) Date: Mon, 9 Jun 2014 18:48:07 -0300 Subject: [datatable-help] dcast.data.table loses column classes when column is date In-Reply-To: References: Message-ID: Posted here https://github.com/Rdatatable/datatable/issues/688 However, I couldn't tag it as a bug (didn't find the option to add a label, sorry!) On Mon, Jun 9, 2014 at 6:29 PM, Arunkumar Srinivasan wrote: > Juan, > > On how to report a bug: > 1) Go to https://github.com and create an account, if you don't already have > one. > 2) Go to our project page, while signed in: > https://github.com/Rdatatable/datatable > 3) Click "Issues" on the right side of the page. > 4) This issue doesn't already exist. So, hit "New issue" (green button on > the right). > 5) Provide a title. Fill the body - remember you can format code using > Markdown (as well as Github flavoured markdown). For example, to write > R-code, you can do: > > ```S > your R-code > ``` > > The S is the lexer type (for highlighting code using Github flavoured > markdown). > 6) Add a label (equivalent of tag or tracker type in R-Forge) by clicking on > "bug" on the right side. > 7) Preview your post, if you want to. Then click "Submit new issue". > > --- > > On the bug itself: This is because `reshape2:::dcast` doesn't preserve > attributes. And we wanted to be consistent with their result at the time of > writing. > However, since that time, `reshape2` has obtained newer implementation of > "melt", written by Kevin Ushey, where attributes are preserved as long as > all the columns that you're asking for to be "molten" are of the same type. > But this doesn't happen for "factors" by default because that might break > existing code - and therefore obtained a new argument "factorsAsStrings", > IIUC. I personally find these things adding a layer of complexity. But > that's the case with "melt". > > It's really hard to tell from reshape2's ?melt or ?cast what's the case > regarding attributes. But my guess is that we should, starting with your > post, try to define what's what and document it instead of relying entirely > on being consistent with reshape2's behaviour, as we do already differ from > reshape2 already slightly. > > We're very much younger than reshape2's melt/cast. So, I think we might be > able to rectify these things on consistency and rules relatively easier. > > Arun > > From: Juan Manuel Truppia jmtruppia at gmail.com > Reply: Juan Manuel Truppia jmtruppia at gmail.com > Date: June 9, 2014 at 10:13:05 PM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: [datatable-help] dcast.data.table loses column classes when column > is date > > Here is a reproducible example > > dcast.data.table(data = data.table(id = c(1,1,2,2), ty = > c("a","b","a","b"), da = Sys.Date()), formula = id ~ ty) > > I don't know how to report a bug, if someone guides me, I'll be much obliged > > Thanks! > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From jmtruppia at gmail.com Mon Jun 9 23:54:39 2014 From: jmtruppia at gmail.com (juancentro) Date: Mon, 9 Jun 2014 14:54:39 -0700 (PDT) Subject: [datatable-help] data.table has moved to GitHub In-Reply-To: References: <5395BBCF.1050800@mdowle.plus.com> Message-ID: <1402350879328-4691929.post@n4.nabble.com> This is great news!!! I love GitHub, and don't love so much R-Forge. As for the Rtools and Windows enviroment, it works great for me. I've been installing Hadley packages from github for a while, and didn't have any issues. -- View this message in context: http://r.789695.n4.nabble.com/data-table-has-moved-to-GitHub-tp4691915p4691929.html Sent from the datatable-help mailing list archive at Nabble.com. From mikkel at scarab-solutions.com Tue Jun 10 19:17:54 2014 From: mikkel at scarab-solutions.com (Mikkel Grum) Date: Tue, 10 Jun 2014 12:17:54 -0500 Subject: [datatable-help] data.table error: invalid subscript type, except it isn't. Message-ID: Hello data.table useRs I've written a function myTable that I've included in a package I've made myself (RAPI). The function calls library(data.table) and does a number of things to the data using the data.table functionality. On its own (cutting and pasting the code into R) the function works well, but when I install the package and then try to run the function, I get the following error > library(RAPI) > myTable(11, '2014-06-09') data.table 1.9.2 For help type: help("data.table") Error in `[.default`(x, i) : invalid subscript type 'list' > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.9.2 RODBC_1.3-10 RAPI_1.0 loaded via a namespace (and not attached): [1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.4 stringr_0.6.2 The line that fails is stopsperrow <- myData[, length(timestamp), by = list(name, house, row)] However, if I type the function name, myTable, and cut and paste the function from the console back into the console, the command produces the desired output without any hiccups! In other words the function in the package is OK, but something about the environment isn't right - if that's the right way to put it. Any ideas for where I should be looking, or what I should be trying? Regards Mikkel -- Mikkel Grum, PhD Director, Research and Development ParqueSoft Calle 25 #127-220 Cali, Colombia cel +57 313 730 1976 website | map | email From my.r.help at gmail.com Wed Jun 11 03:38:28 2014 From: my.r.help at gmail.com (Michael Smith) Date: Wed, 11 Jun 2014 09:38:28 +0800 Subject: [datatable-help] data.table error: invalid subscript type, except it isn't. In-Reply-To: References: Message-ID: <5397B314.6030506@gmail.com> Have you imported data.table into your RAPI package? M On 06/11/2014 01:17 AM, Mikkel Grum wrote: > Hello data.table useRs > > I've written a function myTable that I've included in a package I've > made myself (RAPI). The function calls library(data.table) and does a > number of things to the data using the data.table functionality. On > its own (cutting and pasting the code into R) the function works well, > but when I install the package and then try to run the function, I get > the following error > >> library(RAPI) >> myTable(11, '2014-06-09') > data.table 1.9.2 For help type: help("data.table") > Error in `[.default`(x, i) : invalid subscript type 'list' > >> sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] data.table_1.9.2 RODBC_1.3-10 RAPI_1.0 > > loaded via a namespace (and not attached): > [1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.4 stringr_0.6.2 > > > The line that fails is > stopsperrow <- myData[, length(timestamp), by = list(name, house, row)] > > However, if I type the function name, myTable, and cut and paste the > function from the console back into the console, the command produces > the desired output without any hiccups! In other words the function in > the package is OK, but something about the environment isn't right - > if that's the right way to put it. > > Any ideas for where I should be looking, or what I should be trying? > > Regards > Mikkel > From mdowle at mdowle.plus.com Wed Jun 11 22:22:28 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Wed, 11 Jun 2014 21:22:28 +0100 Subject: [datatable-help] Slides for useR! data.table tutorial Message-ID: <5398BA84.3060806@mdowle.plus.com> Draft slides are now online for the 3 hour data.table tutorial at useR! on Monday 30 June. user2014.stat.ucla.edu/#tutorials Is there something fundamental that you wished had been explained in a tutorial like this? If so, please let me know. I'm doing another of these long tutorials, jointly with Arun, in London on Monday 15th September : http://www.earl-conference.com/Speakers/Workshop1_DataTable.html Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Thu Jun 12 04:44:04 2014 From: my.r.help at gmail.com (Michael Smith) Date: Thu, 12 Jun 2014 10:44:04 +0800 Subject: [datatable-help] Slides for useR! data.table tutorial In-Reply-To: <5398BA84.3060806@mdowle.plus.com> References: <5398BA84.3060806@mdowle.plus.com> Message-ID: <539913F4.2090102@gmail.com> Hi Matt, You mention GForce in your slides. Is this something that happens behind the scenes, or is it something the user should take care of? (I couldn't find it in the current docs.) Thanks, M On 06/12/2014 04:22 AM, Matt Dowle wrote: > > Draft slides are now online for the 3 hour data.table tutorial at useR! > on Monday 30 June. > > user2014.stat.ucla.edu/#tutorials > > Is there something fundamental that you wished had been explained in a > tutorial like this? If so, please let me know. > > > I'm doing another of these long tutorials, jointly with Arun, in London > on Monday 15th September : > > http://www.earl-conference.com/Speakers/Workshop1_DataTable.html > > > Matt > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From lianoglou.steve at gene.com Thu Jun 12 20:17:48 2014 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Thu, 12 Jun 2014 11:17:48 -0700 Subject: [datatable-help] data.table error: invalid subscript type, except it isn't. In-Reply-To: <5397B314.6030506@gmail.com> References: <5397B314.6030506@gmail.com> Message-ID: Hi, On Tue, Jun 10, 2014 at 6:38 PM, Michael Smith wrote: > Have you imported data.table into your RAPI package? This. You shouldn't have a line in your package that explicitly loads the data.table package -- ie. there should be no "library(data.table)" line in your package. Instead you should list "data.table" in the "Imports" field in the DESCRIPTION file of your package, then in the NAMESPACE file you should have an "import(data.table)" line. Once those two things are in place, everything should be feng shui. HTH, -steve > > M > > On 06/11/2014 01:17 AM, Mikkel Grum wrote: >> Hello data.table useRs >> >> I've written a function myTable that I've included in a package I've >> made myself (RAPI). The function calls library(data.table) and does a >> number of things to the data using the data.table functionality. On >> its own (cutting and pasting the code into R) the function works well, >> but when I install the package and then try to run the function, I get >> the following error >> >>> library(RAPI) >>> myTable(11, '2014-06-09') >> data.table 1.9.2 For help type: help("data.table") >> Error in `[.default`(x, i) : invalid subscript type 'list' >> >>> sessionInfo() >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] data.table_1.9.2 RODBC_1.3-10 RAPI_1.0 >> >> loaded via a namespace (and not attached): >> [1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.4 stringr_0.6.2 >> >> >> The line that fails is >> stopsperrow <- myData[, length(timestamp), by = list(name, house, row)] >> >> However, if I type the function name, myTable, and cut and paste the >> function from the console back into the console, the command produces >> the desired output without any hiccups! In other words the function in >> the package is OK, but something about the environment isn't right - >> if that's the right way to put it. >> >> Any ideas for where I should be looking, or what I should be trying? >> >> Regards >> Mikkel >> > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Steve Lianoglou Computational Biologist Genentech From mikkel at scarab-solutions.com Thu Jun 12 21:15:32 2014 From: mikkel at scarab-solutions.com (Mikkel Grum) Date: Thu, 12 Jun 2014 14:15:32 -0500 Subject: [datatable-help] datatable-help Digest, Vol 52, Issue 7 In-Reply-To: References: Message-ID: Thanks Michael. I had written Depends in all caps in the DESCRIPTION file and assumed that calling library(data.table) within the function would override that anyway. Greatly appreciated On 11 June 2014 05:00, wrote: > Send datatable-help mailing list submissions to > datatable-help at lists.r-forge.r-project.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > or, via email, send a message with subject or body 'help' to > datatable-help-request at lists.r-forge.r-project.org > > You can reach the person managing the list at > datatable-help-owner at lists.r-forge.r-project.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of datatable-help digest..." > > > Today's Topics: > > 1. data.table error: invalid subscript type, except it isn't. > (Mikkel Grum) > 2. Re: data.table error: invalid subscript type, except it > isn't. (Michael Smith) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 10 Jun 2014 12:17:54 -0500 > From: Mikkel Grum > To: datatable-help at lists.r-forge.r-project.org > Subject: [datatable-help] data.table error: invalid subscript type, > except it isn't. > Message-ID: > > Content-Type: text/plain; charset=UTF-8 > > Hello data.table useRs > > I've written a function myTable that I've included in a package I've > made myself (RAPI). The function calls library(data.table) and does a > number of things to the data using the data.table functionality. On > its own (cutting and pasting the code into R) the function works well, > but when I install the package and then try to run the function, I get > the following error > >> library(RAPI) >> myTable(11, '2014-06-09') > data.table 1.9.2 For help type: help("data.table") > Error in `[.default`(x, i) : invalid subscript type 'list' > >> sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] data.table_1.9.2 RODBC_1.3-10 RAPI_1.0 > > loaded via a namespace (and not attached): > [1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.4 stringr_0.6.2 > > > The line that fails is > stopsperrow <- myData[, length(timestamp), by = list(name, house, row)] > > However, if I type the function name, myTable, and cut and paste the > function from the console back into the console, the command produces > the desired output without any hiccups! In other words the function in > the package is OK, but something about the environment isn't right - > if that's the right way to put it. > > Any ideas for where I should be looking, or what I should be trying? > > Regards > Mikkel > > -- > Mikkel Grum, PhD > Director, Research and Development > > ParqueSoft Calle 25 #127-220 Cali, Colombia > cel +57 313 730 1976 > website | map | email > > > ------------------------------ > > Message: 2 > Date: Wed, 11 Jun 2014 09:38:28 +0800 > From: Michael Smith > To: Mikkel Grum > Cc: datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] data.table error: invalid subscript > type, except it isn't. > Message-ID: <5397B314.6030506 at gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Have you imported data.table into your RAPI package? > > M > > On 06/11/2014 01:17 AM, Mikkel Grum wrote: >> Hello data.table useRs >> >> I've written a function myTable that I've included in a package I've >> made myself (RAPI). The function calls library(data.table) and does a >> number of things to the data using the data.table functionality. On >> its own (cutting and pasting the code into R) the function works well, >> but when I install the package and then try to run the function, I get >> the following error >> >>> library(RAPI) >>> myTable(11, '2014-06-09') >> data.table 1.9.2 For help type: help("data.table") >> Error in `[.default`(x, i) : invalid subscript type 'list' >> >>> sessionInfo() >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] data.table_1.9.2 RODBC_1.3-10 RAPI_1.0 >> >> loaded via a namespace (and not attached): >> [1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.4 stringr_0.6.2 >> >> >> The line that fails is >> stopsperrow <- myData[, length(timestamp), by = list(name, house, row)] >> >> However, if I type the function name, myTable, and cut and paste the >> function from the console back into the console, the command produces >> the desired output without any hiccups! In other words the function in >> the package is OK, but something about the environment isn't right - >> if that's the right way to put it. >> >> Any ideas for where I should be looking, or what I should be trying? >> >> Regards >> Mikkel >> > > > ------------------------------ > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > End of datatable-help Digest, Vol 52, Issue 7 > ********************************************* -- Mikkel Grum, PhD Director, Research and Development ParqueSoft Calle 25 #127-220 Cali, Colombia cel +57 313 730 1976 website | map | email From J.Gorecki at wit.edu.pl Fri Jun 13 08:00:35 2014 From: J.Gorecki at wit.edu.pl (Jan Gorecki) Date: Thu, 12 Jun 2014 23:00:35 -0700 (PDT) Subject: [datatable-help] data.table syntax Data Warehouse use case simulation In-Reply-To: <1401874845225-4691697.post@n4.nabble.com> References: <1401874845225-4691697.post@n4.nabble.com> Message-ID: <1402639235630-4692037.post@n4.nabble.com> This has been addressed by joinbyv function: https://github.com/Rdatatable/datatable/pull/694 -- View this message in context: http://r.789695.n4.nabble.com/data-table-syntax-Data-Warehouse-use-case-simulation-tp4691697p4692037.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Fri Jun 13 10:24:29 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Fri, 13 Jun 2014 09:24:29 +0100 Subject: [datatable-help] Slides for useR! data.table tutorial In-Reply-To: <539913F4.2090102@gmail.com> References: <5398BA84.3060806@mdowle.plus.com> <539913F4.2090102@gmail.com> Message-ID: <539AB53D.8010607@mdowle.plus.com> Hi Michael, It happens automatically. See NEWS for v1.9.2 : o New optimization: GForce. Rather than grouping the data, the group locations are passed into grouped versions of sum and mean (gsum and gmean) which then compute the result for all groups in a single sequential pass through the column for cache efficiency. Further, since the g* function is called just once, we don't need to find ways to speed up calling sum or mean repetitively for each group. Plan is to add gmin, gmax, gsd, gprod, gwhich.min and gwhich.max. Examples where GForce applies now : DT[,sum(x,na.rm=),by=...] # yes DT[,list(sum(x,na.rm=),mean(y,na.rm=)),by=...] # yes DT[,lapply(.SD,sum,na.rm=),by=...] # yes DT[,list(sum(x),min(y)),by=...] # no. gmin not yet available, only sum and mean so far. GForce is a level 2 optimization. To turn it off: options(datatable.optimize=1) Reminder: to see the optimizations and other info, set verbose=TRUE Matt On 12/06/14 03:44, Michael Smith wrote: > Hi Matt, > > You mention GForce in your slides. Is this something that happens behind > the scenes, or is it something the user should take care of? (I couldn't > find it in the current docs.) > > Thanks, > > M > > > > On 06/12/2014 04:22 AM, Matt Dowle wrote: >> Draft slides are now online for the 3 hour data.table tutorial at useR! >> on Monday 30 June. >> >> user2014.stat.ucla.edu/#tutorials >> >> Is there something fundamental that you wished had been explained in a >> tutorial like this? If so, please let me know. >> >> >> I'm doing another of these long tutorials, jointly with Arun, in London >> on Monday 15th September : >> >> http://www.earl-conference.com/Speakers/Workshop1_DataTable.html >> >> >> Matt >> >> >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Jun 13 14:52:01 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Fri, 13 Jun 2014 13:52:01 +0100 Subject: [datatable-help] We don't know why Stack Overflow data.table tag has just been renamed Message-ID: <539AF3F1.7090906@mdowle.plus.com> Have asked on Meta : http://meta.stackoverflow.com/questions/260463/why-has-the-data-table-tag-just-been-renamed-r-data-table Matt From eduard.antonyan at gmail.com Fri Jun 13 20:46:10 2014 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 13 Jun 2014 13:46:10 -0500 Subject: [datatable-help] We don't know why Stack Overflow data.table tag has just been renamed In-Reply-To: <539AF3F1.7090906@mdowle.plus.com> References: <539AF3F1.7090906@mdowle.plus.com> Message-ID: holy batman, that was a mess :) On Fri, Jun 13, 2014 at 7:52 AM, Matt Dowle wrote: > > Have asked on Meta : > > http://meta.stackoverflow.com/questions/260463/why-has-the- > data-table-tag-just-been-renamed-r-data-table > > Matt > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri Jun 13 21:09:42 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 13 Jun 2014 21:09:42 +0200 Subject: [datatable-help] We don't know why Stack Overflow data.table tag has just been renamed In-Reply-To: References: <539AF3F1.7090906@mdowle.plus.com> Message-ID: Seems like everything's back to normal.? Arun From:?Eduard Antonyan eduard.antonyan at gmail.com Reply:?Eduard Antonyan eduard.antonyan at gmail.com Date:?June 13, 2014 at 8:46:42 PM To:?Matt Dowle mdowle at mdowle.plus.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] We don't know why Stack Overflow data.table tag has just been renamed holy batman, that was a mess :) On Fri, Jun 13, 2014 at 7:52 AM, Matt Dowle wrote: Have asked on Meta : http://meta.stackoverflow.com/questions/260463/why-has-the-data-table-tag-just-been-renamed-r-data-table Matt _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri Jun 13 21:16:59 2014 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 13 Jun 2014 14:16:59 -0500 Subject: [datatable-help] We don't know why Stack Overflow data.table tag has just been renamed In-Reply-To: References: <539AF3F1.7090906@mdowle.plus.com> Message-ID: Did it have a better tag wiki before? Seems like it should at least have a link to github in the full description. On Fri, Jun 13, 2014 at 2:09 PM, Arunkumar Srinivasan wrote: > Seems like everything's back to normal. > > Arun > > From: Eduard Antonyan eduard.antonyan at gmail.com > Reply: Eduard Antonyan eduard.antonyan at gmail.com > Date: June 13, 2014 at 8:46:42 PM > To: Matt Dowle mdowle at mdowle.plus.com > Cc: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] We don't know why Stack Overflow > data.table tag has just been renamed > > holy batman, that was a mess :) > > > On Fri, Jun 13, 2014 at 7:52 AM, Matt Dowle > wrote: > >> >> Have asked on Meta : >> >> >> http://meta.stackoverflow.com/questions/260463/why-has-the-data-table-tag-just-been-renamed-r-data-table >> >> Matt >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri Jun 13 23:00:12 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 13 Jun 2014 23:00:12 +0200 Subject: [datatable-help] We don't know why Stack Overflow data.table tag has just been renamed In-Reply-To: References: <539AF3F1.7090906@mdowle.plus.com> Message-ID: Eddi, Seems to be back:?http://stackoverflow.com/tags/data.table/info Arun From:?Eduard Antonyan eduard.antonyan at gmail.com Reply:?Eduard Antonyan eduard.antonyan at gmail.com Date:?June 13, 2014 at 9:17:20 PM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?Matt Dowle mdowle at mdowle.plus.com, datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] We don't know why Stack Overflow data.table tag has just been renamed Did it have a better tag wiki before? Seems like it should at least have a link to github in the full description. On Fri, Jun 13, 2014 at 2:09 PM, Arunkumar Srinivasan wrote: Seems like everything's back to normal.? Arun From:?Eduard Antonyan eduard.antonyan at gmail.com Reply:?Eduard Antonyan eduard.antonyan at gmail.com Date:?June 13, 2014 at 8:46:42 PM To:?Matt Dowle mdowle at mdowle.plus.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] We don't know why Stack Overflow data.table tag has just been renamed holy batman, that was a mess :) On Fri, Jun 13, 2014 at 7:52 AM, Matt Dowle wrote: Have asked on Meta : http://meta.stackoverflow.com/questions/260463/why-has-the-data-table-tag-just-been-renamed-r-data-table Matt _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From rhylton at verizon.net Sat Jun 14 01:55:12 2014 From: rhylton at verizon.net (Ron Hylton) Date: Fri, 13 Jun 2014 19:55:12 -0400 Subject: [datatable-help] data.table is asking for help Message-ID: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net> The code below generates the warning: In setkeyv(x, cols, verbose = verbose) : Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed. This is my first attempt at using datatable so I probably did something dumb, but maybe that's useful for someone. The first case is the one that gives the warnings. I'm also surprised at the timings. I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true. The algorithm does the following: Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values. Find all the key sets for which this is not true and return the keys sets + conflicting value sets. Insight into the performance would be appreciated. Regards, Ron library(data.table) library(plyr) conflictsTable1 <- function(f) { u <- unique(setkey(f)) if (nrow(u) == 1) return(NULL) u } conflictsTable2 <- function(f) { u <- unique(f) if (nrow(u) == 1) return(NULL) u } conflictsFrame <- function(f) { u <- unique(f) if (nrow(u) == 1) return(NULL) u } N <- 10000 test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N)) setkey(test,id) print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id])) print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id])) print(system.time(uf <- ddply(test, .(id), conflictsFrame))) -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Jun 14 02:22:30 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 14 Jun 2014 02:22:30 +0200 Subject: [datatable-help] data.table is asking for help In-Reply-To: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net> References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net> Message-ID: Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well. This is a tricky one. It happens because you?re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there?s no key set (here) and therefore key is set on all the columns x1, x2 and x3. Now, the next group (in the by=.) is passed to your function, it?ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn?t be, as this data isn?t sorted. data.table warns in those scenarios.. and that?s why you get the warning. To verify this, you can try: conflictsTable1 <- function(f, address) { u <- unique(setkey(f)) setattr(f, 'sorted', NULL) if (nrow(u) == 1) return(NULL) u } Basically, we set the key of f (which is equal to .SD as it?s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set. The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn?t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible. Arun From:?Ron Hylton rhylton at verizon.net Reply:?Ron Hylton rhylton at verizon.net Date:?June 14, 2014 at 1:55:53 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] data.table is asking for help The code below generates the warning: ? In setkeyv(x, cols, verbose = verbose) : ? Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed. ? This is my first attempt at using datatable so I probably did something dumb, but maybe that?s useful for someone.? The first case is the one that gives the warnings. ? I?m also surprised at the timings.? I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true. ? The algorithm does the following:? Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.? Find all the key sets for which this is not true and return the keys sets + conflicting value sets. ? Insight into the performance would be appreciated. ? Regards, Ron ? library(data.table) library(plyr) ? conflictsTable1 <- function(f) { ? u <- unique(setkey(f)) ? if (nrow(u) == 1) return(NULL) ? u } ? conflictsTable2 <- function(f) { ? u <- unique(f) ? if (nrow(u) == 1) return(NULL) ? u } ? conflictsFrame <- function(f) { ? u <- unique(f) ? if (nrow(u) == 1) return(NULL) ? u } ? N <- 10000 test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N)) ? setkey(test,id) ? print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id])) ? print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id])) ? print(system.time(uf <- ddply(test, .(id), conflictsFrame))) _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From rhylton at verizon.net Sat Jun 14 02:51:24 2014 From: rhylton at verizon.net (Ron Hylton) Date: Fri, 13 Jun 2014 20:51:24 -0400 Subject: [datatable-help] data.table is asking for help In-Reply-To: References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net> Message-ID: <006701cf876a$c4f38310$4eda8930$@verizon.net> I suspected it was something like this. As one clarification, there is a setkey(test,id) before any setkey(.SD). If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away. However there?s another aspect. While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD. From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] Sent: Friday, June 13, 2014 8:23 PM To: Ron Hylton; datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well. This is a tricky one. It happens because you?re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there?s no key set (here) and therefore key is set on all the columns x1, x2 and x3. Now, the next group (in the by=.) is passed to your function, it?ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn?t be, as this data isn?t sorted. data.table warns in those scenarios.. and that?s why you get the warning. To verify this, you can try: conflictsTable1 <- function(f, address) { u <- unique(setkey(f)) setattr(f, 'sorted', NULL) if (nrow(u) == 1) return(NULL) u } Basically, we set the key of f (which is equal to .SD as it?s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set. The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn?t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible. Arun From: Ron Hylton rhylton at verizon.net Reply: Ron Hylton rhylton at verizon.net Date: June 14, 2014 at 1:55:53 AM To: datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject: [datatable-help] data.table is asking for help The code below generates the warning: In setkeyv(x, cols, verbose = verbose) : Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed. This is my first attempt at using datatable so I probably did something dumb, but maybe that?s useful for someone. The first case is the one that gives the warnings. I?m also surprised at the timings. I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true. The algorithm does the following: Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values. Find all the key sets for which this is not true and return the keys sets + conflicting value sets. Insight into the performance would be appreciated. Regards, Ron library(data.table) library(plyr) conflictsTable1 <- function(f) { u <- unique(setkey(f)) if (nrow(u) == 1) return(NULL) u } conflictsTable2 <- function(f) { u <- unique(f) if (nrow(u) == 1) return(NULL) u } conflictsFrame <- function(f) { u <- unique(f) if (nrow(u) == 1) return(NULL) u } N <- 10000 test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N)) setkey(test,id) print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id])) print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id])) print(system.time(uf <- ddply(test, .(id), conflictsFrame))) _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Jun 14 02:57:02 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 14 Jun 2014 02:57:02 +0200 Subject: [datatable-help] data.table is asking for help In-Reply-To: <006701cf876a$c4f38310$4eda8930$@verizon.net> References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net> <006701cf876a$c4f38310$4eda8930$@verizon.net> Message-ID: However there?s another aspect.? While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD. `data.table` is?designed?for working with *really large* data sets in mind (> 100 or 200 GB in?memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand). This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator. HTH Arun From:?Ron Hylton rhylton at verizon.net Reply:?Ron Hylton rhylton at verizon.net Date:?June 14, 2014 at 2:52:04 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] data.table is asking for help I suspected it was something like this.? As one clarification, there is a setkey(test,id) before any setkey(.SD).?? If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away. ? However there?s another aspect.? While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD. ? From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] Sent: Friday, June 13, 2014 8:23 PM To: Ron Hylton; datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help ? Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well. This is a tricky one. It happens because you?re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there?s no key set (here) and therefore key is set on all the columns x1, x2 and x3. Now, the next group (in the by=.) is passed to your function, it?ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn?t be, as this data isn?t sorted. data.table warns in those scenarios.. and that?s why you get the warning. To verify this, you can try: conflictsTable1 <- function(f, address) { u <- unique(setkey(f)) setattr(f, 'sorted', NULL) if (nrow(u) == 1) return(NULL) u } Basically, we set the key of f (which is equal to .SD as it?s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set. The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn?t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible. ? Arun From:?Ron Hylton rhylton at verizon.net Reply:?Ron Hylton rhylton at verizon.net Date:?June 14, 2014 at 1:55:53 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] data.table is asking for help The code below generates the warning: ? In setkeyv(x, cols, verbose = verbose) : ? Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed. ? This is my first attempt at using datatable so I probably did something dumb, but maybe that?s useful for someone.? The first case is the one that gives the warnings. ? I?m also surprised at the timings.? I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true. ? The algorithm does the following:? Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.? Find all the key sets for which this is not true and return the keys sets + conflicting value sets. ? Insight into the performance would be appreciated. ? Regards, Ron ? library(data.table) library(plyr) ? conflictsTable1 <- function(f) { ? u <- unique(setkey(f)) ? if (nrow(u) == 1) return(NULL) ? u } ? conflictsTable2 <- function(f) { ? u <- unique(f) ? if (nrow(u) == 1) return(NULL) ? u } ? conflictsFrame <- function(f) { ? u <- unique(f) ? if (nrow(u) == 1) return(NULL) ? u } ? N <- 10000 test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N)) ? setkey(test,id) ? print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id])) ? print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id])) ? print(system.time(uf <- ddply(test, .(id), conflictsFrame))) _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From rhylton at verizon.net Sat Jun 14 03:30:04 2014 From: rhylton at verizon.net (Ron Hylton) Date: Fri, 13 Jun 2014 21:30:04 -0400 Subject: [datatable-help] data.table is asking for help In-Reply-To: References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net> <006701cf876a$c4f38310$4eda8930$@verizon.net> Message-ID: <007301cf8770$2bb569b0$83203d10$@verizon.net> The performance is what puzzles me; the results are correct so the warnings don?t matter, and not all the variations I?ve tried have warnings. On the real dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + ddply. I expected it to be substantially faster. From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] Sent: Friday, June 13, 2014 8:57 PM To: Ron Hylton; datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help However there?s another aspect. While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD. `data.table` is designed for working with *really large* data sets in mind (> 100 or 200 GB in memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand). This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator. HTH Arun From: Ron Hylton rhylton at verizon.net Reply: Ron Hylton rhylton at verizon.net Date: June 14, 2014 at 2:52:04 AM To: datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help I suspected it was something like this. As one clarification, there is a setkey(test,id) before any setkey(.SD). If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away. However there?s another aspect. While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD. From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] Sent: Friday, June 13, 2014 8:23 PM To: Ron Hylton; datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well. This is a tricky one. It happens because you?re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there?s no key set (here) and therefore key is set on all the columns x1, x2 and x3. Now, the next group (in the by=.) is passed to your function, it?ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn?t be, as this data isn?t sorted. data.table warns in those scenarios.. and that?s why you get the warning. To verify this, you can try: conflictsTable1 <- function(f, address) { u <- unique(setkey(f)) setattr(f, 'sorted', NULL) if (nrow(u) == 1) return(NULL) u } Basically, we set the key of f (which is equal to .SD as it?s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set. The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn?t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible. Arun From: Ron Hylton rhylton at verizon.net Reply: Ron Hylton rhylton at verizon.net Date: June 14, 2014 at 1:55:53 AM To: datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject: [datatable-help] data.table is asking for help The code below generates the warning: In setkeyv(x, cols, verbose = verbose) : Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed. This is my first attempt at using datatable so I probably did something dumb, but maybe that?s useful for someone. The first case is the one that gives the warnings. I?m also surprised at the timings. I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true. The algorithm does the following: Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values. Find all the key sets for which this is not true and return the keys sets + conflicting value sets. Insight into the performance would be appreciated. Regards, Ron library(data.table) library(plyr) conflictsTable1 <- function(f) { u <- unique(setkey(f)) if (nrow(u) == 1) return(NULL) u } conflictsTable2 <- function(f) { u <- unique(f) if (nrow(u) == 1) return(NULL) u } conflictsFrame <- function(f) { u <- unique(f) if (nrow(u) == 1) return(NULL) u } N <- 10000 test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N)) setkey(test,id) print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id])) print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id])) print(system.time(uf <- ddply(test, .(id), conflictsFrame))) _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Jun 14 04:34:12 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 14 Jun 2014 04:34:12 +0200 Subject: [datatable-help] data.table is asking for help In-Reply-To: <007301cf8770$2bb569b0$83203d10$@verizon.net> References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net> <006701cf876a$c4f38310$4eda8930$@verizon.net> <007301cf8770$2bb569b0$83203d10$@verizon.net> Message-ID: The j-expression is evaluated from within C for each group (unless they?re optimised with GForce - a new initiative in data.table). And eval(.SD) or eval(anything(.SD)) is costly. You can get around it by listing the columns by yourself and using .I instead, as follows: test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1] # 0.140 0.001 0.142 Takes about 0.14 seconds. An even faster way is: system.time({ ans = test[test[, .I[.N > 1], by=id]$V1] # (1) ans = ans[, .N, by=names(ans)] # (2) ans = ans[ans[, .I[.N > 1L], by=id]$V1] # (3) }) # 0.026 0.000 0.027 The idea for the second case is: (1) remove all entries where there?s just 1 row corresponding to that id. (2) Aggregate this result by all the columns now and get the number of rows in the column N (we won?t have to use this column though). (3) Now, if we aggregate by id and if any id has just 1 row, then it?d mean that that id has had more than 1 rows (step (1) filtering ensures this), but all of them are same and we don?t need them. So we just filter for those where .N > 1L. HTH Arun From:?Ron Hylton rhylton at verizon.net Reply:?Ron Hylton rhylton at verizon.net Date:?June 14, 2014 at 3:30:55 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] data.table is asking for help The performance is what puzzles me; the results are correct so the warnings don?t matter, and not all the variations I?ve tried have warnings.? On the real dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + ddply.? I expected it to be substantially faster. ? From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] Sent: Friday, June 13, 2014 8:57 PM To: Ron Hylton; datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help ? However there?s another aspect.? While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD. `data.table` is?designed?for working with *really large* data sets in mind (> 100 or 200 GB in?memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand). This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator. ? HTH Arun From:?Ron Hylton rhylton at verizon.net Reply:?Ron Hylton rhylton at verizon.net Date:?June 14, 2014 at 2:52:04 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] data.table is asking for help I suspected it was something like this.? As one clarification, there is a setkey(test,id) before any setkey(.SD).?? If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away. ? However there?s another aspect.? While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD. ? From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] Sent: Friday, June 13, 2014 8:23 PM To: Ron Hylton; datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help ? Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well. This is a tricky one. It happens because you?re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there?s no key set (here) and therefore key is set on all the columns x1, x2 and x3. Now, the next group (in the by=.) is passed to your function, it?ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn?t be, as this data isn?t sorted. data.table warns in those scenarios.. and that?s why you get the warning. To verify this, you can try: conflictsTable1 <- function(f, address) { u <- unique(setkey(f)) setattr(f, 'sorted', NULL) if (nrow(u) == 1) return(NULL) u } Basically, we set the key of f (which is equal to .SD as it?s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set. The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn?t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible. ? Arun From:?Ron Hylton rhylton at verizon.net Reply:?Ron Hylton rhylton at verizon.net Date:?June 14, 2014 at 1:55:53 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] data.table is asking for help ? The code below generates the warning: ? In setkeyv(x, cols, verbose = verbose) : ? Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed. ? This is my first attempt at using datatable so I probably did something dumb, but maybe that?s useful for someone.? The first case is the one that gives the warnings. ? I?m also surprised at the timings.? I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true. ? The algorithm does the following:? Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.? Find all the key sets for which this is not true and return the keys sets + conflicting value sets. ? Insight into the performance would be appreciated. ? Regards, Ron ? library(data.table) library(plyr) ? conflictsTable1 <- function(f) { ? u <- unique(setkey(f)) ? if (nrow(u) == 1) return(NULL) ? u } ? conflictsTable2 <- function(f) { ? u <- unique(f) ? if (nrow(u) == 1) return(NULL) ? u } ? conflictsFrame <- function(f) { ? u <- unique(f) ? if (nrow(u) == 1) return(NULL) ? u } ? N <- 10000 test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N)) ? setkey(test,id) ? print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id])) ? print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id])) ? print(system.time(uf <- ddply(test, .(id), conflictsFrame))) _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Jun 14 04:42:26 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 14 Jun 2014 04:42:26 +0200 Subject: [datatable-help] data.table is asking for help In-Reply-To: References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net> <006701cf876a$c4f38310$4eda8930$@verizon.net> <007301cf8770$2bb569b0$83203d10$@verizon.net> Message-ID: A slightly simpler version of the 2nd solution is: system.time({ ans = test[, .N, by=names(test)] ans = ans[ans[, .I[.N > 1L], by=id]$V1] }) # 0.019 0.000 0.019 The answers are identical, you can check this by doing: ans[, N := NULL] setkey(ans) setkey(ut1) identical(ans, ut1) # [1] TRUE Arun From:?Arunkumar Srinivasan aragorn168b at gmail.com Reply:?Arunkumar Srinivasan aragorn168b at gmail.com Date:?June 14, 2014 at 4:34:15 AM To:?Ron Hylton rhylton at verizon.net, datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] data.table is asking for help The j-expression is evaluated from within C for each group (unless they?re optimised with GForce - a new initiative in data.table). And eval(.SD) or eval(anything(.SD)) is costly. You can get around it by listing the columns by yourself and using .I instead, as follows: test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1] # 0.140 0.001 0.142 Takes about 0.14 seconds. An even faster way is: system.time({ ans = test[test[, .I[.N > 1], by=id]$V1] # (1) ans = ans[, .N, by=names(ans)] # (2) ans = ans[ans[, .I[.N > 1L], by=id]$V1] # (3) }) # 0.026 0.000 0.027 The idea for the second case is: (1) remove all entries where there?s just 1 row corresponding to that id. (2) Aggregate this result by all the columns now and get the number of rows in the column N (we won?t have to use this column though). (3) Now, if we aggregate by id and if any id has just 1 row, then it?d mean that that id has had more than 1 rows (step (1) filtering ensures this), but all of them are same and we don?t need them. So we just filter for those where .N > 1L. HTH Arun From:?Ron Hylton rhylton at verizon.net Reply:?Ron Hylton rhylton at verizon.net Date:?June 14, 2014 at 3:30:55 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] data.table is asking for help The performance is what puzzles me; the results are correct so the warnings don?t matter, and not all the variations I?ve tried have warnings.? On the real dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + ddply.? I expected it to be substantially faster. ? From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] Sent: Friday, June 13, 2014 8:57 PM To: Ron Hylton; datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help ? However there?s another aspect.? While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD. `data.table` is?designed?for working with *really large* data sets in mind (> 100 or 200 GB in?memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand). This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator. ? HTH Arun From:?Ron Hylton rhylton at verizon.net Reply:?Ron Hylton rhylton at verizon.net Date:?June 14, 2014 at 2:52:04 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] data.table is asking for help I suspected it was something like this.? As one clarification, there is a setkey(test,id) before any setkey(.SD).?? If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away. ? However there?s another aspect.? While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD. ? From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] Sent: Friday, June 13, 2014 8:23 PM To: Ron Hylton; datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help ? Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well. This is a tricky one. It happens because you?re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there?s no key set (here) and therefore key is set on all the columns x1, x2 and x3. Now, the next group (in the by=.) is passed to your function, it?ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn?t be, as this data isn?t sorted. data.table warns in those scenarios.. and that?s why you get the warning. To verify this, you can try: conflictsTable1 <- function(f, address) { u <- unique(setkey(f)) setattr(f, 'sorted', NULL) if (nrow(u) == 1) return(NULL) u } Basically, we set the key of f (which is equal to .SD as it?s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set. The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn?t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible. ? Arun From:?Ron Hylton rhylton at verizon.net Reply:?Ron Hylton rhylton at verizon.net Date:?June 14, 2014 at 1:55:53 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] data.table is asking for help ? The code below generates the warning: ? In setkeyv(x, cols, verbose = verbose) : ? Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed. ? This is my first attempt at using datatable so I probably did something dumb, but maybe that?s useful for someone.? The first case is the one that gives the warnings. ? I?m also surprised at the timings.? I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true. ? The algorithm does the following:? Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.? Find all the key sets for which this is not true and return the keys sets + conflicting value sets. ? Insight into the performance would be appreciated. ? Regards, Ron ? library(data.table) library(plyr) ? conflictsTable1 <- function(f) { ? u <- unique(setkey(f)) ? if (nrow(u) == 1) return(NULL) ? u } ? conflictsTable2 <- function(f) { ? u <- unique(f) ? if (nrow(u) == 1) return(NULL) ? u } ? conflictsFrame <- function(f) { ? u <- unique(f) ? if (nrow(u) == 1) return(NULL) ? u } ? N <- 10000 test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N)) ? setkey(test,id) ? print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id])) ? print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id])) ? print(system.time(uf <- ddply(test, .(id), conflictsFrame))) _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Jun 14 04:45:49 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 14 Jun 2014 04:45:49 +0200 Subject: [datatable-help] data.table is asking for help In-Reply-To: References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net> <006701cf876a$c4f38310$4eda8930$@verizon.net> <007301cf8770$2bb569b0$83203d10$@verizon.net> Message-ID: Sorry. But we can simplify it even further: The first step is just unique(test). So, we can do: system.time({ ans = unique(test) ans = ans[ans[, .I[.N > 1L], by=id]$V1] }) # 0.016 0.000 0.016 Identical? setkey(ans) setkey(ut1) identical(ans, ut1) # [1] TRUE Arun From:?Arunkumar Srinivasan aragorn168b at gmail.com Reply:?Arunkumar Srinivasan aragorn168b at gmail.com Date:?June 14, 2014 at 4:42:31 AM To:?Ron Hylton rhylton at verizon.net, datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] data.table is asking for help A slightly simpler version of the 2nd solution is: system.time({ ans = test[, .N, by=names(test)] ans = ans[ans[, .I[.N > 1L], by=id]$V1] }) # 0.019 0.000 0.019 The answers are identical, you can check this by doing: ans[, N := NULL] setkey(ans) setkey(ut1) identical(ans, ut1) # [1] TRUE Arun From:?Arunkumar Srinivasan aragorn168b at gmail.com Reply:?Arunkumar Srinivasan aragorn168b at gmail.com Date:?June 14, 2014 at 4:34:15 AM To:?Ron Hylton rhylton at verizon.net, datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] data.table is asking for help The j-expression is evaluated from within C for each group (unless they?re optimised with GForce - a new initiative in data.table). And eval(.SD) or eval(anything(.SD)) is costly. You can get around it by listing the columns by yourself and using .I instead, as follows: test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1] # 0.140 0.001 0.142 Takes about 0.14 seconds. An even faster way is: system.time({ ans = test[test[, .I[.N > 1], by=id]$V1] # (1) ans = ans[, .N, by=names(ans)] # (2) ans = ans[ans[, .I[.N > 1L], by=id]$V1] # (3) }) # 0.026 0.000 0.027 The idea for the second case is: (1) remove all entries where there?s just 1 row corresponding to that id. (2) Aggregate this result by all the columns now and get the number of rows in the column N (we won?t have to use this column though). (3) Now, if we aggregate by id and if any id has just 1 row, then it?d mean that that id has had more than 1 rows (step (1) filtering ensures this), but all of them are same and we don?t need them. So we just filter for those where .N > 1L. HTH Arun From:?Ron Hylton rhylton at verizon.net Reply:?Ron Hylton rhylton at verizon.net Date:?June 14, 2014 at 3:30:55 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] data.table is asking for help The performance is what puzzles me; the results are correct so the warnings don?t matter, and not all the variations I?ve tried have warnings.? On the real dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + ddply.? I expected it to be substantially faster. ? From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] Sent: Friday, June 13, 2014 8:57 PM To: Ron Hylton; datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help ? However there?s another aspect.? While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD. `data.table` is?designed?for working with *really large* data sets in mind (> 100 or 200 GB in?memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand). This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator. ? HTH Arun From:?Ron Hylton rhylton at verizon.net Reply:?Ron Hylton rhylton at verizon.net Date:?June 14, 2014 at 2:52:04 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] data.table is asking for help I suspected it was something like this.? As one clarification, there is a setkey(test,id) before any setkey(.SD).?? If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away. ? However there?s another aspect.? While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD. ? From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] Sent: Friday, June 13, 2014 8:23 PM To: Ron Hylton; datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help ? Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well. This is a tricky one. It happens because you?re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there?s no key set (here) and therefore key is set on all the columns x1, x2 and x3. Now, the next group (in the by=.) is passed to your function, it?ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn?t be, as this data isn?t sorted. data.table warns in those scenarios.. and that?s why you get the warning. To verify this, you can try: conflictsTable1 <- function(f, address) { u <- unique(setkey(f)) setattr(f, 'sorted', NULL) if (nrow(u) == 1) return(NULL) u } Basically, we set the key of f (which is equal to .SD as it?s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set. The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn?t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible. ? Arun From:?Ron Hylton rhylton at verizon.net Reply:?Ron Hylton rhylton at verizon.net Date:?June 14, 2014 at 1:55:53 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] data.table is asking for help ? The code below generates the warning: ? In setkeyv(x, cols, verbose = verbose) : ? Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed. ? This is my first attempt at using datatable so I probably did something dumb, but maybe that?s useful for someone.? The first case is the one that gives the warnings. ? I?m also surprised at the timings.? I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true. ? The algorithm does the following:? Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.? Find all the key sets for which this is not true and return the keys sets + conflicting value sets. ? Insight into the performance would be appreciated. ? Regards, Ron ? library(data.table) library(plyr) ? conflictsTable1 <- function(f) { ? u <- unique(setkey(f)) ? if (nrow(u) == 1) return(NULL) ? u } ? conflictsTable2 <- function(f) { ? u <- unique(f) ? if (nrow(u) == 1) return(NULL) ? u } ? conflictsFrame <- function(f) { ? u <- unique(f) ? if (nrow(u) == 1) return(NULL) ? u } ? N <- 10000 test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N)) ? setkey(test,id) ? print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id])) ? print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id])) ? print(system.time(uf <- ddply(test, .(id), conflictsFrame))) _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From rhylton at verizon.net Sat Jun 14 04:58:16 2014 From: rhylton at verizon.net (Ron Hylton) Date: Fri, 13 Jun 2014 22:58:16 -0400 Subject: [datatable-help] data.table is asking for help In-Reply-To: References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net> <006701cf876a$c4f38310$4eda8930$@verizon.net> <007301cf8770$2bb569b0$83203d10$@verizon.net> Message-ID: <008301cf877c$7dc9def0$795d9cd0$@verizon.net> Thanks, that very helpful. From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] Sent: Friday, June 13, 2014 10:46 PM To: Ron Hylton; datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help Sorry. But we can simplify it even further: The first step is just unique(test). So, we can do: system.time({ ans = unique(test) ans = ans[ans[, .I[.N > 1L], by=id]$V1] }) # 0.016 0.000 0.016 Identical? setkey(ans) setkey(ut1) identical(ans, ut1) # [1] TRUE Arun From: Arunkumar Srinivasan aragorn168b at gmail.com Reply: Arunkumar Srinivasan aragorn168b at gmail.com Date: June 14, 2014 at 4:42:31 AM To: Ron Hylton rhylton at verizon.net , datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help A slightly simpler version of the 2nd solution is: system.time({ ans = test[, .N, by=names(test)] ans = ans[ans[, .I[.N > 1L], by=id]$V1] }) # 0.019 0.000 0.019 The answers are identical, you can check this by doing: ans[, N := NULL] setkey(ans) setkey(ut1) identical(ans, ut1) # [1] TRUE Arun From: Arunkumar Srinivasan aragorn168b at gmail.com Reply: Arunkumar Srinivasan aragorn168b at gmail.com Date: June 14, 2014 at 4:34:15 AM To: Ron Hylton rhylton at verizon.net , datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help The j-expression is evaluated from within C for each group (unless they?re optimised with GForce - a new initiative in data.table). And eval(.SD) or eval(anything(.SD)) is costly. You can get around it by listing the columns by yourself and using .I instead, as follows: test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1] # 0.140 0.001 0.142 Takes about 0.14 seconds. _____ An even faster way is: system.time({ ans = test[test[, .I[.N > 1], by=id]$V1] # (1) ans = ans[, .N, by=names(ans)] # (2) ans = ans[ans[, .I[.N > 1L], by=id]$V1] # (3) }) # 0.026 0.000 0.027 The idea for the second case is: (1) remove all entries where there?s just 1 row corresponding to that id. (2) Aggregate this result by all the columns now and get the number of rows in the column N (we won?t have to use this column though). (3) Now, if we aggregate by id and if any id has just 1 row, then it?d mean that that id has had more than 1 rows (step (1) filtering ensures this), but all of them are same and we don?t need them. So we just filter for those where .N > 1L. HTH Arun From: Ron Hylton rhylton at verizon.net Reply: Ron Hylton rhylton at verizon.net Date: June 14, 2014 at 3:30:55 AM To: datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help The performance is what puzzles me; the results are correct so the warnings don?t matter, and not all the variations I?ve tried have warnings. On the real dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + ddply. I expected it to be substantially faster. From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] Sent: Friday, June 13, 2014 8:57 PM To: Ron Hylton; datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help However there?s another aspect. While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD. `data.table` is designed for working with *really large* data sets in mind (> 100 or 200 GB in memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand). This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator. HTH Arun From: Ron Hylton rhylton at verizon.net Reply: Ron Hylton rhylton at verizon.net Date: June 14, 2014 at 2:52:04 AM To: datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help I suspected it was something like this. As one clarification, there is a setkey(test,id) before any setkey(.SD). If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away. However there?s another aspect. While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD. From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] Sent: Friday, June 13, 2014 8:23 PM To: Ron Hylton; datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] data.table is asking for help Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well. This is a tricky one. It happens because you?re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there?s no key set (here) and therefore key is set on all the columns x1, x2 and x3. Now, the next group (in the by=.) is passed to your function, it?ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn?t be, as this data isn?t sorted. data.table warns in those scenarios.. and that?s why you get the warning. To verify this, you can try: conflictsTable1 <- function(f, address) { u <- unique(setkey(f)) setattr(f, 'sorted', NULL) if (nrow(u) == 1) return(NULL) u } Basically, we set the key of f (which is equal to .SD as it?s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set. The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn?t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible. Arun From: Ron Hylton rhylton at verizon.net Reply: Ron Hylton rhylton at verizon.net Date: June 14, 2014 at 1:55:53 AM To: datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject: [datatable-help] data.table is asking for help The code below generates the warning: In setkeyv(x, cols, verbose = verbose) : Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed. This is my first attempt at using datatable so I probably did something dumb, but maybe that?s useful for someone. The first case is the one that gives the warnings. I?m also surprised at the timings. I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true. The algorithm does the following: Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values. Find all the key sets for which this is not true and return the keys sets + conflicting value sets. Insight into the performance would be appreciated. Regards, Ron library(data.table) library(plyr) conflictsTable1 <- function(f) { u <- unique(setkey(f)) if (nrow(u) == 1) return(NULL) u } conflictsTable2 <- function(f) { u <- unique(f) if (nrow(u) == 1) return(NULL) u } conflictsFrame <- function(f) { u <- unique(f) if (nrow(u) == 1) return(NULL) u } N <- 10000 test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N)) setkey(test,id) print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id])) print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id])) print(system.time(uf <- ddply(test, .(id), conflictsFrame))) _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From roundsjeremiah at gmail.com Sat Jun 14 07:23:10 2014 From: roundsjeremiah at gmail.com (jeremiah rounds) Date: Sat, 14 Jun 2014 01:23:10 -0400 Subject: [datatable-help] Are you aware of this? Message-ID: As a fan of your work I have always been curious if you are aware of this? I find it causes new users to make mistakes. > dt = list() > dt$x = 1:10 > dt$y = letters[10:1] > dt = as.data.table(as.data.frame(dt)) > dt x y 1: 1 j 2: 2 i 3: 3 h 4: 4 g 5: 5 f 6: 6 e 7: 7 d 8: 8 c 9: 9 b 10: 10 a > x0 = dt$x > x1 = dt$x > x0[1] = 11 > setkeyv(dt,"y") > x0 [1] 11 2 3 4 5 6 7 8 9 10 > x1 [1] 10 9 8 7 6 5 4 3 2 1 > x1 == x0 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE x0 and x1 have assignments at the same exact time, and since R data.frame's will not do this, it lures people into thinking they are then identical and distinct as they are with data.frame's. My theory is they are not actually copied: they are promised. When x0 has its index 1 changed it induces a copy distinct from dt$x, but x1 has had no operation on it so it refers to dt$x with its promise. Setting the key on dt reorders it and since x1 still hasn't been evaluated it now matches the order of dt. I found new users getting unpredictable results because they would try to use a data.table as a data.frame and induce this with sorts. If you thought you copied something in a particular order in dt by doing the assigning ahead of the setkeyv you make a mistake. You don't really expect x1 assigned maybe a page of code above to have its order changed by a setkeyv. You do if you think about C pointers and references, but in R you really don't think that way. Many R users don't even know what a pointer is. Thanks, Jeremiah > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] splines parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] locfit_1.5-9.1 edgeR_3.4.2 limma_3.18.13 [4] data.table_1.9.2 GenomicRanges_1.14.4 XVector_0.2.0 [7] IRanges_1.20.7 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] grid_3.0.1 lattice_0.20-15 plyr_1.8.1 Rcpp_0.11.1 [5] reshape2_1.4 stats4_3.0.1 stringr_0.6.2 tools_3.0.1 -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Jun 14 07:35:16 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 14 Jun 2014 07:35:16 +0200 Subject: [datatable-help] Are you aware of this? In-Reply-To: References: Message-ID: Jeremiah, Thanks. Just a few hours ago, I answered a similar question to a post from Ron (pasted below): `data.table` is designed for working with *really large* data sets in mind (> 100 or 200 GB in memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand). This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator. There?s a pending feature request on adding this point (on explicit copy) to the FAQs, which we?ve not gotten to, yet. To our knowledge, people do overcome this difference quite quickly. It?s not necessary to know about pointers to understand that the object gets modified in-place. I?m not a python user at all, but recently came to know that this is also a feature there: https://docs.python.org/2/library/copy.html But point taken. That explicit copy will be required will be added to the FAQs. Arun From:?jeremiah rounds roundsjeremiah at gmail.com Reply:?jeremiah rounds roundsjeremiah at gmail.com Date:?June 14, 2014 at 7:23:22 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] Are you aware of this? As a fan of your work I have always been curious if you are aware of this? ?I find it causes new users to make mistakes. > dt = list() > dt$x = 1:10 > dt$y = letters[10:1] > dt = as.data.table(as.data.frame(dt)) > dt ? ? ?x y ?1: ?1 j ?2: ?2 i ?3: ?3 h ?4: ?4 g ?5: ?5 f ?6: ?6 e ?7: ?7 d ?8: ?8 c ?9: ?9 b 10: 10 a > x0 = dt$x > x1 = dt$x > x0[1] = 11 > setkeyv(dt,"y") > x0 ?[1] 11 ?2 ?3 ?4 ?5 ?6 ?7 ?8 ?9 10 > x1 ?[1] 10 ?9 ?8 ?7 ?6 ?5 ?4 ?3 ?2 ?1 > x1 == x0 ?[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE x0 and x1 have assignments at the same exact time, and since R data.frame's will not do this, it lures people into thinking they are then identical and distinct as they are with data.frame's. ?My theory is they are not actually copied: they are promised. ?When x0 has its index 1 changed it induces a copy distinct from dt$x, but x1 has had no operation on it so it refers to dt$x with its promise. Setting the key on dt reorders it and since x1 still hasn't been evaluated it now matches the order of dt. I found new users getting unpredictable results because they would try to use a data.table as a data.frame and induce this with sorts. ?If you thought you copied something in a particular order in dt by doing the assigning ahead of the setkeyv you make a mistake. ? You don't really expect x1 assigned maybe a page of code above to have its order changed by a setkeyv. ?You do if you think about C pointers and references, but in R you really don't think that way. ?Many R users don't even know what a pointer is. Thanks, Jeremiah > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C ? ? ? ? ? ? ? ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 ? ? ?[5] LC_MONETARY=en_US.UTF-8 ? ?LC_MESSAGES=en_US.UTF-8 ?? ?[7] LC_PAPER=C ? ? ? ? ? ? ? ? LC_NAME=C ? ? ? ? ? ? ? ?? ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C ? ? ? ? ? ? [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ? ? ?? attached base packages: [1] splines ? parallel ?stats ? ? graphics ?grDevices utils ? ? datasets? [8] methods ? base ? ?? other attached packages: [1] locfit_1.5-9.1 ? ? ? edgeR_3.4.2 ? ? ? ? ?limma_3.18.13 ? ? ?? [4] data.table_1.9.2 ? ? GenomicRanges_1.14.4 XVector_0.2.0 ? ? ?? [7] IRanges_1.20.7 ? ? ? BiocGenerics_0.8.0 ? loaded via a namespace (and not attached): [1] grid_3.0.1 ? ? ?lattice_0.20-15 plyr_1.8.1 ? ? ?Rcpp_0.11.1 ? ? [5] reshape2_1.4 ? ?stats4_3.0.1 ? ?stringr_0.6.2 ? tools_3.0.1 ? ? _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Sun Jun 15 05:01:35 2014 From: my.r.help at gmail.com (Michael Smith) Date: Sun, 15 Jun 2014 11:01:35 +0800 Subject: [datatable-help] `with=F` in the `i` Argument In-Reply-To: References: <5389541B.8040006@gmail.com> Message-ID: <539D0C8F.1080005@gmail.com> Devs, Is this a bug? It works in 1.9.2 but not in the 1.9.3 development version: DT <- data.table(a = 1:4, b = 8:5) for (i in c("a", "b")) print(DT[order(DT[, i, with = FALSE])]) Error in forder(DT, DT[, i, with = FALSE]) : Column '1' is type 'list' which is not supported for ordering currently. Thanks, M On 05/31/2014 12:44 PM, G See wrote: > Hi Michael, > > I would use get() > > DT <- data.table(a = 1:4, b = 8:5) > for (i in c("a", "b")) > print(DT[order(get(i))]) > > For what it's worth, your solution doesn't seem to work in data.table > 1.9.3 (svn rev. 1278): > >> for (i in c("a", "b")) > + print(DT[order(DT[, i, with = FALSE])]) > Error in forder(DT, DT[, i, with = FALSE]) : > Column '1' is type 'list' which is not supported for ordering currently. > > > HTH, > Garrett > > On Fri, May 30, 2014 at 11:01 PM, Michael Smith wrote: >> All, >> >> I'm trying to order the rows according to several columns at a time: >> >> DT <- data.table(a = 1:4, b = 8:5) >> for (i in c("a", "b")) >> print(DT[order(i), with = FALSE]) >> >> It doesn't work, since `with` seems to be about the `j` argument, but >> not the `i` argument, according to `?data.table`. >> >> I found the following workaround, but wonder whether there is a more >> elegant way to do it: >> >> for (i in c("a", "b")) >> print(DT[order(DT[, i, with = FALSE])]) >> >> Thanks, >> M >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From aragorn168b at gmail.com Sun Jun 15 10:11:45 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 15 Jun 2014 10:11:45 +0200 Subject: [datatable-help] =?utf-8?Q?=60with=3DF=60_?=in the `i` Argument In-Reply-To: <539D0C8F.1080005@gmail.com> References: <5389541B.8040006@gmail.com> <539D0C8F.1080005@gmail.com> Message-ID: Michael, Thanks. Replacing order with base:::order seems to give the right result. So, I?d say this is a case that seem to have escaped current tests. So, yes, bug. Could you please file as one here? Arun From:?Michael Smith my.r.help at gmail.com Reply:?Michael Smith my.r.help at gmail.com Date:?June 15, 2014 at 5:02:46 AM To:?G See gsee000 at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] `with=F` in the `i` Argument Devs, Is this a bug? It works in 1.9.2 but not in the 1.9.3 development version: DT <- data.table(a = 1:4, b = 8:5) for (i in c("a", "b")) print(DT[order(DT[, i, with = FALSE])]) Error in forder(DT, DT[, i, with = FALSE]) : Column '1' is type 'list' which is not supported for ordering currently. Thanks, M On 05/31/2014 12:44 PM, G See wrote: > Hi Michael, > > I would use get() > > DT <- data.table(a = 1:4, b = 8:5) > for (i in c("a", "b")) > print(DT[order(get(i))]) > > For what it's worth, your solution doesn't seem to work in data.table > 1.9.3 (svn rev. 1278): > >> for (i in c("a", "b")) > + print(DT[order(DT[, i, with = FALSE])]) > Error in forder(DT, DT[, i, with = FALSE]) : > Column '1' is type 'list' which is not supported for ordering currently. > > > HTH, > Garrett > > On Fri, May 30, 2014 at 11:01 PM, Michael Smith wrote: >> All, >> >> I'm trying to order the rows according to several columns at a time: >> >> DT <- data.table(a = 1:4, b = 8:5) >> for (i in c("a", "b")) >> print(DT[order(i), with = FALSE]) >> >> It doesn't work, since `with` seems to be about the `j` argument, but >> not the `i` argument, according to `?data.table`. >> >> I found the following workaround, but wonder whether there is a more >> elegant way to do it: >> >> for (i in c("a", "b")) >> print(DT[order(DT[, i, with = FALSE])]) >> >> Thanks, >> M >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Sun Jun 15 11:15:50 2014 From: my.r.help at gmail.com (Michael Smith) Date: Sun, 15 Jun 2014 17:15:50 +0800 Subject: [datatable-help] `with=F` in the `i` Argument In-Reply-To: References: <5389541B.8040006@gmail.com> <539D0C8F.1080005@gmail.com> Message-ID: <539D6446.3060004@gmail.com> Hi Arun, Filed here: https://github.com/Rdatatable/data.table/issues/696 Thanks, M On 06/15/2014 04:11 PM, Arunkumar Srinivasan wrote: > Michael, > > Thanks. Replacing |order| with |base:::order| seems to give the right > result. So, I?d say this is a case that seem to have escaped current > tests. So, yes, bug. Could you please file as one here > ? > > > Arun > > From: Michael Smith my.r.help at gmail.com > Reply: Michael Smith my.r.help at gmail.com > Date: June 15, 2014 at 5:02:46 AM > To: G See gsee000 at gmail.com > Cc: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > > Subject: Re: [datatable-help] `with=F` in the `i` Argument > >> Devs, >> >> Is this a bug? It works in 1.9.2 but not in the 1.9.3 development >> version: >> >> DT <- data.table(a = 1:4, b = 8:5) >> for (i in c("a", "b")) >> print(DT[order(DT[, i, with = FALSE])]) >> >> Error in forder(DT, DT[, i, with = FALSE]) : >> Column '1' is type 'list' which is not supported for ordering currently. >> >> >> Thanks, >> >> M >> >> >> On 05/31/2014 12:44 PM, G See wrote: >> > Hi Michael, >> > >> > I would use get() >> > >> > DT <- data.table(a = 1:4, b = 8:5) >> > for (i in c("a", "b")) >> > print(DT[order(get(i))]) >> > >> > For what it's worth, your solution doesn't seem to work in data.table >> > 1.9.3 (svn rev. 1278): >> > >> >> for (i in c("a", "b")) >> > + print(DT[order(DT[, i, with = FALSE])]) >> > Error in forder(DT, DT[, i, with = FALSE]) : >> > Column '1' is type 'list' which is not supported for ordering currently. >> > >> > >> > HTH, >> > Garrett >> > >> > On Fri, May 30, 2014 at 11:01 PM, Michael Smith wrote: >> >> All, >> >> >> >> I'm trying to order the rows according to several columns at a time: >> >> >> >> DT <- data.table(a = 1:4, b = 8:5) >> >> for (i in c("a", "b")) >> >> print(DT[order(i), with = FALSE]) >> >> >> >> It doesn't work, since `with` seems to be about the `j` argument, but >> >> not the `i` argument, according to `?data.table`. >> >> >> >> I found the following workaround, but wonder whether there is a more >> >> elegant way to do it: >> >> >> >> for (i in c("a", "b")) >> >> print(DT[order(DT[, i, with = FALSE])]) >> >> >> >> Thanks, >> >> M >> >> _______________________________________________ >> >> datatable-help mailing list >> >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > From aragorn168b at gmail.com Sun Jun 15 11:16:42 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 15 Jun 2014 11:16:42 +0200 Subject: [datatable-help] =?utf-8?Q?=60with=3DF=60_?=in the `i` Argument In-Reply-To: <539D6446.3060004@gmail.com> References: <5389541B.8040006@gmail.com> <539D0C8F.1080005@gmail.com> <539D6446.3060004@gmail.com> Message-ID: Already got the notification. Thanks Michael. Arun From:?Michael Smith my.r.help at gmail.com Reply:?Michael Smith my.r.help at gmail.com Date:?June 15, 2014 at 11:15:55 AM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?G See gsee000 at gmail.com, datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] `with=F` in the `i` Argument Hi Arun, Filed here: https://github.com/Rdatatable/data.table/issues/696 Thanks, M On 06/15/2014 04:11 PM, Arunkumar Srinivasan wrote: > Michael, > > Thanks. Replacing |order| with |base:::order| seems to give the right > result. So, I?d say this is a case that seem to have escaped current > tests. So, yes, bug. Could you please file as one here > ? > > > Arun > > From: Michael Smith my.r.help at gmail.com > Reply: Michael Smith my.r.help at gmail.com > Date: June 15, 2014 at 5:02:46 AM > To: G See gsee000 at gmail.com > Cc: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > > Subject: Re: [datatable-help] `with=F` in the `i` Argument > >> Devs, >> >> Is this a bug? It works in 1.9.2 but not in the 1.9.3 development >> version: >> >> DT <- data.table(a = 1:4, b = 8:5) >> for (i in c("a", "b")) >> print(DT[order(DT[, i, with = FALSE])]) >> >> Error in forder(DT, DT[, i, with = FALSE]) : >> Column '1' is type 'list' which is not supported for ordering currently. >> >> >> Thanks, >> >> M >> >> >> On 05/31/2014 12:44 PM, G See wrote: >> > Hi Michael, >> > >> > I would use get() >> > >> > DT <- data.table(a = 1:4, b = 8:5) >> > for (i in c("a", "b")) >> > print(DT[order(get(i))]) >> > >> > For what it's worth, your solution doesn't seem to work in data.table >> > 1.9.3 (svn rev. 1278): >> > >> >> for (i in c("a", "b")) >> > + print(DT[order(DT[, i, with = FALSE])]) >> > Error in forder(DT, DT[, i, with = FALSE]) : >> > Column '1' is type 'list' which is not supported for ordering currently. >> > >> > >> > HTH, >> > Garrett >> > >> > On Fri, May 30, 2014 at 11:01 PM, Michael Smith wrote: >> >> All, >> >> >> >> I'm trying to order the rows according to several columns at a time: >> >> >> >> DT <- data.table(a = 1:4, b = 8:5) >> >> for (i in c("a", "b")) >> >> print(DT[order(i), with = FALSE]) >> >> >> >> It doesn't work, since `with` seems to be about the `j` argument, but >> >> not the `i` argument, according to `?data.table`. >> >> >> >> I found the following workaround, but wonder whether there is a more >> >> elegant way to do it: >> >> >> >> for (i in c("a", "b")) >> >> print(DT[order(DT[, i, with = FALSE])]) >> >> >> >> Thanks, >> >> M >> >> _______________________________________________ >> >> datatable-help mailing list >> >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gsee000 at gmail.com Sun Jun 15 17:44:58 2014 From: gsee000 at gmail.com (G See) Date: Sun, 15 Jun 2014 10:44:58 -0500 Subject: [datatable-help] subsetting by second key Message-ID: Hi, I want to subset a data.table using only its second key, which is demonstrated here http://stackoverflow.com/questions/15597685/subsetting-data-table-by-2nd-column-only-of-a-2-column-key-using-binary-search/15597713#15597713 However, I need to subset with more than one value in the secondary key Is this warning expected? What exactly is it telling me? library(data.table) DT <- data.table(iris, key="Species,Petal.Width") DT[J(unique(Species), c(1.5, 2.0)), nomatch=0L] # Sepal.Length Sepal.Width Petal.Length Petal.Width Species #1: 6.0 2.2 5.0 1.5 virginica #2: 6.3 2.8 5.1 1.5 virginica #Warning message: #In as.data.table.list(i) : # Item 2 is of size 2 but maximum size is 3 (recycled leaving a remainder of 1 items) It looks like I can get what I want with either of these; can you confirm that both of these will always return the same result? DT[Petal.Width %in% c(1.5, 2.0)] # vector scan DT[CJ(unique(Species), c(1.5, 2.0)), nomatch=0L] Thanks, Garrett From aragorn168b at gmail.com Sun Jun 15 17:56:05 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 15 Jun 2014 17:56:05 +0200 Subject: [datatable-help] subsetting by second key In-Reply-To: References: Message-ID: unique(Species) is of length 3, where as the 2nd entry c(1.5, 2) is of length 2. J in J(.) is replaced with list(.) internally (using lazy evaluation), following which it?s converted to a data.table using as.data.table(list(.)). And here your list is: list(c("setosa", "versicolor", "virginica") , c(1.5, 2.0)) which results in the warning because it has to recycle to convert it to a data.table. In the example you?ve linked, J(.) and CJ(.) will return the same result (because there?s just one value in 2nd column). So, the results don?t change. But the general expression is to use CJ(.) along with nomatch=0L, as you?ve done. Those two expressions are equivalent, yes. Arun From:?G See gsee000 at gmail.com Reply:?G See gsee000 at gmail.com Date:?June 15, 2014 at 5:45:11 PM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] subsetting by second key Hi, I want to subset a data.table using only its second key, which is demonstrated here http://stackoverflow.com/questions/15597685/subsetting-data-table-by-2nd-column-only-of-a-2-column-key-using-binary-search/15597713#15597713 However, I need to subset with more than one value in the secondary key Is this warning expected? What exactly is it telling me? library(data.table) DT <- data.table(iris, key="Species,Petal.Width") DT[J(unique(Species), c(1.5, 2.0)), nomatch=0L] # Sepal.Length Sepal.Width Petal.Length Petal.Width Species #1: 6.0 2.2 5.0 1.5 virginica #2: 6.3 2.8 5.1 1.5 virginica #Warning message: #In as.data.table.list(i) : # Item 2 is of size 2 but maximum size is 3 (recycled leaving a remainder of 1 items) It looks like I can get what I want with either of these; can you confirm that both of these will always return the same result? DT[Petal.Width %in% c(1.5, 2.0)] # vector scan DT[CJ(unique(Species), c(1.5, 2.0)), nomatch=0L] Thanks, Garrett _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From gsee000 at gmail.com Sun Jun 15 18:03:13 2014 From: gsee000 at gmail.com (G See) Date: Sun, 15 Jun 2014 11:03:13 -0500 Subject: [datatable-help] subsetting by second key In-Reply-To: References: Message-ID: Thank you Arun. Should that answer be updated to use CJ(.), then? Is there an advantage to using J(.) over CJ(.) if you know that you're only looking for one value in the second column? On Sun, Jun 15, 2014 at 10:56 AM, Arunkumar Srinivasan wrote: > unique(Species) is of length 3, where as the 2nd entry c(1.5, 2) is of > length 2. > > J in J(.) is replaced with list(.) internally (using lazy evaluation), > following which it?s converted to a data.table using as.data.table(list(.)). > > And here your list is: > > list(c("setosa", "versicolor", "virginica") , c(1.5, 2.0)) which results in > the warning because it has to recycle to convert it to a data.table. > > In the example you?ve linked, J(.) and CJ(.) will return the same result > (because there?s just one value in 2nd column). So, the results don?t > change. But the general expression is to use CJ(.) along with nomatch=0L, as > you?ve done. > > Those two expressions are equivalent, yes. > > > Arun > > From: G See gsee000 at gmail.com > Reply: G See gsee000 at gmail.com > Date: June 15, 2014 at 5:45:11 PM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: [datatable-help] subsetting by second key > > Hi, > > I want to subset a data.table using only its second key, which is > demonstrated here > http://stackoverflow.com/questions/15597685/subsetting-data-table-by-2nd-column-only-of-a-2-column-key-using-binary-search/15597713#15597713 > > However, I need to subset with more than one value in the secondary key > > Is this warning expected? What exactly is it telling me? > > library(data.table) > DT <- data.table(iris, key="Species,Petal.Width") > DT[J(unique(Species), c(1.5, 2.0)), nomatch=0L] > # Sepal.Length Sepal.Width Petal.Length Petal.Width Species > #1: 6.0 2.2 5.0 1.5 virginica > #2: 6.3 2.8 5.1 1.5 virginica > #Warning message: > #In as.data.table.list(i) : > # Item 2 is of size 2 but maximum size is 3 (recycled leaving a > remainder of 1 items) > > > It looks like I can get what I want with either of these; can you > confirm that both of these will always return the same result? > > DT[Petal.Width %in% c(1.5, 2.0)] # vector scan > DT[CJ(unique(Species), c(1.5, 2.0)), nomatch=0L] > > > Thanks, > Garrett > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From aragorn168b at gmail.com Sun Jun 15 18:04:57 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 15 Jun 2014 18:04:57 +0200 Subject: [datatable-help] subsetting by second key In-Reply-To: References: Message-ID: Sure, you can update it. No, there's no advantage. I just dint think of CJ at the time (probably because I tried it with J and it worked, because it's just 1 value for the 2nd key col). Arun From:?G See gsee000 at gmail.com Reply:?G See gsee000 at gmail.com Date:?June 15, 2014 at 6:03:13 PM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] subsetting by second key Thank you Arun. Should that answer be updated to use CJ(.), then? Is there an advantage to using J(.) over CJ(.) if you know that you're only looking for one value in the second column? On Sun, Jun 15, 2014 at 10:56 AM, Arunkumar Srinivasan wrote: > unique(Species) is of length 3, where as the 2nd entry c(1.5, 2) is of > length 2. > > J in J(.) is replaced with list(.) internally (using lazy evaluation), > following which it?s converted to a data.table using as.data.table(list(.)). > > And here your list is: > > list(c("setosa", "versicolor", "virginica") , c(1.5, 2.0)) which results in > the warning because it has to recycle to convert it to a data.table. > > In the example you?ve linked, J(.) and CJ(.) will return the same result > (because there?s just one value in 2nd column). So, the results don?t > change. But the general expression is to use CJ(.) along with nomatch=0L, as > you?ve done. > > Those two expressions are equivalent, yes. > > > Arun > > From: G See gsee000 at gmail.com > Reply: G See gsee000 at gmail.com > Date: June 15, 2014 at 5:45:11 PM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: [datatable-help] subsetting by second key > > Hi, > > I want to subset a data.table using only its second key, which is > demonstrated here > http://stackoverflow.com/questions/15597685/subsetting-data-table-by-2nd-column-only-of-a-2-column-key-using-binary-search/15597713#15597713 > > However, I need to subset with more than one value in the secondary key > > Is this warning expected? What exactly is it telling me? > > library(data.table) > DT <- data.table(iris, key="Species,Petal.Width") > DT[J(unique(Species), c(1.5, 2.0)), nomatch=0L] > # Sepal.Length Sepal.Width Petal.Length Petal.Width Species > #1: 6.0 2.2 5.0 1.5 virginica > #2: 6.3 2.8 5.1 1.5 virginica > #Warning message: > #In as.data.table.list(i) : > # Item 2 is of size 2 but maximum size is 3 (recycled leaving a > remainder of 1 items) > > > It looks like I can get what I want with either of these; can you > confirm that both of these will always return the same result? > > DT[Petal.Width %in% c(1.5, 2.0)] # vector scan > DT[CJ(unique(Species), c(1.5, 2.0)), nomatch=0L] > > > Thanks, > Garrett > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sun Jun 15 18:06:34 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 15 Jun 2014 18:06:34 +0200 Subject: [datatable-help] subsetting by second key In-Reply-To: References: Message-ID: Note that `CJ` by default sorts the columns and sets key to all the columns, which means the result would be sorted as well. If that's not desirable, you should be using `CJ` with `sorted=FALSE`. Arun From:?Arunkumar Srinivasan aragorn168b at gmail.com Reply:?Arunkumar Srinivasan aragorn168b at gmail.com Date:?June 15, 2014 at 6:04:59 PM To:?G See gsee000 at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] subsetting by second key Sure, you can update it. No, there's no advantage. I just dint think of CJ at the time (probably because I tried it with J and it worked, because it's just 1 value for the 2nd key col). Arun From:?G See gsee000 at gmail.com Reply:?G See gsee000 at gmail.com Date:?June 15, 2014 at 6:03:13 PM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] subsetting by second key Thank you Arun. Should that answer be updated to use CJ(.), then? Is there an advantage to using J(.) over CJ(.) if you know that you're only looking for one value in the second column? On Sun, Jun 15, 2014 at 10:56 AM, Arunkumar Srinivasan wrote: > unique(Species) is of length 3, where as the 2nd entry c(1.5, 2) is of > length 2. > > J in J(.) is replaced with list(.) internally (using lazy evaluation), > following which it?s converted to a data.table using as.data.table(list(.)). > > And here your list is: > > list(c("setosa", "versicolor", "virginica") , c(1.5, 2.0)) which results in > the warning because it has to recycle to convert it to a data.table. > > In the example you?ve linked, J(.) and CJ(.) will return the same result > (because there?s just one value in 2nd column). So, the results don?t > change. But the general expression is to use CJ(.) along with nomatch=0L, as > you?ve done. > > Those two expressions are equivalent, yes. > > > Arun > > From: G See gsee000 at gmail.com > Reply: G See gsee000 at gmail.com > Date: June 15, 2014 at 5:45:11 PM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: [datatable-help] subsetting by second key > > Hi, > > I want to subset a data.table using only its second key, which is > demonstrated here > http://stackoverflow.com/questions/15597685/subsetting-data-table-by-2nd-column-only-of-a-2-column-key-using-binary-search/15597713#15597713 > > However, I need to subset with more than one value in the secondary key > > Is this warning expected? What exactly is it telling me? > > library(data.table) > DT <- data.table(iris, key="Species,Petal.Width") > DT[J(unique(Species), c(1.5, 2.0)), nomatch=0L] > # Sepal.Length Sepal.Width Petal.Length Petal.Width Species > #1: 6.0 2.2 5.0 1.5 virginica > #2: 6.3 2.8 5.1 1.5 virginica > #Warning message: > #In as.data.table.list(i) : > # Item 2 is of size 2 but maximum size is 3 (recycled leaving a > remainder of 1 items) > > > It looks like I can get what I want with either of these; can you > confirm that both of these will always return the same result? > > DT[Petal.Width %in% c(1.5, 2.0)] # vector scan > DT[CJ(unique(Species), c(1.5, 2.0)), nomatch=0L] > > > Thanks, > Garrett > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue Jun 17 19:03:09 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Tue, 17 Jun 2014 18:03:09 +0100 Subject: [datatable-help] data.table is asking for help In-Reply-To: <008301cf877c$7dc9def0$795d9cd0$@verizon.net> References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net> <006701cf876a$c4f38310$4eda8930$@verizon.net> <007301cf8770$2bb569b0$83203d10$@verizon.net> <008301cf877c$7dc9def0$795d9cd0$@verizon.net> Message-ID: <53A074CD.3060805@mdowle.plus.com> Hi Ron, Thanks for highlighting this. Two changes now in v1.9.3 on GitHub: * |setkey|on|.SD|is now an error, rather than warnings for each group about rebuilding the key. The new error is similar to when attempting to use|:=|in a|.SD|subquery:|".SD is locked. Using set*() functions on .SD is reserved for possible future use; a tortuously flexible way to modify the original data by group."|Thanks to Ron Hylton for highlighting the issue on datatable-helphere . * Looping calls to|unique(DT)|such as in|DT[,unique(.SD),by=group]|is now faster by avoiding internal overhead of calling|[.data.table|. Thanks again to Ron Hylton for highlighting in thesame thread . His example is reduced from 28 sec to 9 sec, with identical results. I now get the following (on my slow netbook) with no changes to your code. print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id])) # were warnings, now error print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id])) # was 28s, now 9s print(system.time(uf <- ddply(test, .(id), conflictsFrame))) # 13s This just fixes the surprises, basically. Clearly Arun uses data.table in a better way which is orders of magnitude faster. Matt On 14/06/14 03:58, Ron Hylton wrote: > > Thanks, that very helpful. > > *From:*Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] > *Sent:* Friday, June 13, 2014 10:46 PM > *To:* Ron Hylton; datatable-help at lists.r-forge.r-project.org > *Subject:* Re: [datatable-help] data.table is asking for help > > Sorry. But we can simplify it even further: > > The first step is just |unique(test)|. So, we can do: > > |system.time({| > |ans = unique(test)| > |ans = ans[ans[, .I[.N > 1L], by=id]$V1]| > |})| > |# 0.016 0.000 0.016| > > Identical? > > |setkey(ans)| > |setkey(ut1)| > |identical(ans, ut1) # [1] TRUE| > > Arun > > > From: Arunkumar Srinivasan aragorn168b at gmail.com > > Reply: Arunkumar Srinivasan aragorn168b at gmail.com > > Date: June 14, 2014 at 4:42:31 AM > To: Ron Hylton rhylton at verizon.net , > datatable-help at lists.r-forge.r-project.org > > datatable-help at lists.r-forge.r-project.org > > Subject: Re: [datatable-help] data.table is asking for help > > > > A slightly simpler version of the 2nd solution is: > > |system.time({| > > |ans = test[, .N, by=names(test)]| > > |ans = ans[ans[, .I[.N > 1L], by=id]$V1]| > > |})| > > |# 0.019 0.000 0.019| > > > > The answers are identical, you can check this by doing: > > |ans[, N := NULL]| > > |setkey(ans)| > > |setkey(ut1)| > > |identical(ans, ut1) # [1] TRUE| > > > > Arun > > > From: Arunkumar Srinivasan aragorn168b at gmail.com > > Reply: Arunkumar Srinivasan aragorn168b at gmail.com > > Date: June 14, 2014 at 4:34:15 AM > To: Ron Hylton rhylton at verizon.net , > datatable-help at lists.r-forge.r-project.org > > datatable-help at lists.r-forge.r-project.org > > Subject: Re: [datatable-help] data.table is asking for help > > > > The j-expression is evaluated from within C for each group > (unless they're optimised with GForce - a new initiative in > data.table). And |eval(.SD)| or |eval(anything(.SD))| is costly. > > You can get around it by listing the columns by yourself and > using |.I| instead, as follows: > > |test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]| > > |# 0.140 0.001 0.142| > > > > > > Takes about 0.14 seconds. > > ------------------------------------------------------------------------ > > An even faster way is: > > |system.time({| > > |ans = test[test[, .I[.N > 1], by=id]$V1] # (1)| > > |ans = ans[, .N, by=names(ans)] # (2)| > > |ans = ans[ans[, .I[.N > 1L], by=id]$V1] # (3)| > > |})| > > | | > > |# 0.026 0.000 0.027| > > > > > > The idea for the second case is: > > (1) remove all entries where there's just 1 row corresponding > to that |id|. > (2) Aggregate this result by all the columns now and get the > number of rows in the column |N| (we won't have to use this > column though). > (3) Now, if we aggregate by |id| and if any id has just 1 row, > then it'd mean that that |id| has had more than 1 rows (step > (1) filtering ensures this), but all of them are same and we > don't need them. So we just filter for those where .N > 1L. > > HTH > > Arun > > > From: Ron Hylton rhylton at verizon.net > Reply: Ron Hylton rhylton at verizon.net > Date: June 14, 2014 at 3:30:55 AM > To: datatable-help at lists.r-forge.r-project.org > > datatable-help at lists.r-forge.r-project.org > > Subject: Re: [datatable-help] data.table is asking for help > > > > The performance is what puzzles me; the results are > correct so the warnings don't matter, and not all the > variations I've tried have warnings. On the real dataset > (~800,000 rows) datatable takes about 1.5 times longer > than dataframe + ddply. I expected it to be substantially > faster. > > *From:* Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] > *Sent:* Friday, June 13, 2014 8:57 PM > *To:* Ron Hylton; > datatable-help at lists.r-forge.r-project.org > > *Subject:* Re: [datatable-help] data.table is asking for help > > However there's another aspect. While I'm relatively > new to R my understanding is that a function argument > should be modifiable within the function body without > affecting the caller, which perhaps conflicts with the > behavior of .SD. > > `data.table` is designed for working with *really large* > data sets in mind (> 100 or 200 GB in memory even). And > therefore, as a design feature, it trades in "referential > transparency" for manipulating data objects *as efficient > as possible* in terms of both *speed* and *memory usage* > (most of the times they go hand-in-hand). > > This is perhaps the biggest design choice one needs to be > aware of when working/choosing data.tables. It is possible > to modify objects by reference using data.table - All the > functions that begin with "set*" modify objects by > reference. The only other non "set*" function is `:=` > operator. > > HTH > > Arun > > > From: Ron Hylton rhylton at verizon.net > > Reply: Ron Hylton rhylton at verizon.net > > Date: June 14, 2014 at 2:52:04 AM > To: datatable-help at lists.r-forge.r-project.org > > datatable-help at lists.r-forge.r-project.org > > Subject: Re: [datatable-help] data.table is asking for help > > I suspected it was something like this. As one > clarification, there is a setkey(test,id) before any > setkey(.SD). If setkey(test,id) is changed to > setkey(test) so all columns are in the original > datatable key then the warning goes away. > > However there's another aspect. While I'm relatively > new to R my understanding is that a function argument > should be modifiable within the function body without > affecting the caller, which perhaps conflicts with the > behavior of .SD. > > *From:* Arunkumar Srinivasan > [mailto:aragorn168b at gmail.com] > *Sent:* Friday, June 13, 2014 8:23 PM > *To:* Ron Hylton; > datatable-help at lists.r-forge.r-project.org > > *Subject:* Re: [datatable-help] data.table is asking > for help > > Nicely reproducible post. Reproducible in v1.9.3 > (latest commit) as well. > > This is a tricky one. It happens because you're > setting key on |.SD| which should normally not be > allowed. What happens is, when you set key the first > time, there's no key set (here) and therefore key is > set on all the columns |x1|, |x2| and |x3|. > > Now, the next group (in the |by=.|) is passed to your > function, it'll have the |key| already set to > |x1,x2,x3| (because |setkey| modifies the object by > reference), but |.SD| has obtained *new* data > corresponding to /this/ group. And |data.table| sorts > this data, knowing that it already has key set.. but > if the key is set then the order must be 1:n. But it > wouldn't be, as this data isn't sorted. |data.table| > warns in those scenarios.. and that's why you get the > warning. > > To verify this, you can try: > > |conflictsTable1 <- function(f, address) {| > > | u <- unique(setkey(f))| > > | setattr(f, 'sorted', NULL)| > > | if (nrow(u) == 1) return(NULL)| > > | u| > > |}| > > Basically, we set the key of |f| (which is equal to > |.SD| as it's only modified by reference) to |NULL| > everytime after.. so that |.SD| for the new group will > not have the key set. > > The ideal scenario here, IIUC, is that |setkey(.SD)| > or things pointing to |.SD| should not be possible > (locking binding doesn't seem to affect things done by > reference..). |.SD| however should retain the key of > the data.table, if a key was set, wherever possible. > > Arun > > > From: Ron Hylton rhylton at verizon.net > > Reply: Ron Hylton rhylton at verizon.net > > Date: June 14, 2014 at 1:55:53 AM > To: datatable-help at lists.r-forge.r-project.org > > datatable-help at lists.r-forge.r-project.org > > Subject: [datatable-help] data.table is asking for help > > The code below generates the warning: > > In setkeyv(x, cols, verbose = verbose) : > > Already keyed by this key but had invalid row > order, key rebuilt. If you didn't go under the > hood please let datatable-help know so the root > cause can be fixed. > > This is my first attempt at using datatable so I > probably did something dumb, but maybe that's > useful for someone. The first case is the one > that gives the warnings. > > I'm also surprised at the timings. I wrote the > original algorithm using dataframe & ddply and I > expected datatable to be substantially faster; the > opposite is true. > > The algorithm does the following: Certain columns > in the table are keys and others are values in the > sense that each row with the same set of keys > should have the same set of values. Find all the > key sets for which this is not true and return the > keys sets + conflicting value sets. > > Insight into the performance would be appreciated. > > Regards, > > Ron > > library(data.table) > > library(plyr) > > conflictsTable1 <- function(f) { > > u <- unique(setkey(f)) > > if (nrow(u) == 1) return(NULL) > > u > > } > > conflictsTable2 <- function(f) { > > u <- unique(f) > > if (nrow(u) == 1) return(NULL) > > u > > } > > conflictsFrame <- function(f) { > > u <- unique(f) > > if (nrow(u) == 1) return(NULL) > > u > > } > > N <- 10000 > > test <- > data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), > x1=rnorm(N), x2=rnorm(N), x3=rnorm(N)) > > setkey(test,id) > > print(system.time(ut1 <- test[, > conflictsTable1(.SD), by=id])) > > print(system.time(ut2 <- test[, > conflictsTable2(.SD), by=id])) > > print(system.time(uf <- ddply(test, .(id), > conflictsFrame))) > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Wed Jun 18 02:34:14 2014 From: my.r.help at gmail.com (Michael Smith) Date: Wed, 18 Jun 2014 08:34:14 +0800 Subject: [datatable-help] data.table is asking for help In-Reply-To: <53A074CD.3060805@mdowle.plus.com> References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net> <006701cf876a$c4f38310$4eda8930$@verizon.net> <007301cf8770$2bb569b0$83203d10$@verizon.net> <008301cf877c$7dc9def0$795d9cd0$@verizon.net> <53A074CD.3060805@mdowle.plus.com> Message-ID: <53A0DE86.1080001@gmail.com> Hi Matt, There was recently another discussion on using setkey on .SD here: http://r.789695.n4.nabble.com/setkey-on-SD-td4690283.html So the following code won't work any more in the current 1.9.3 dev version. I think the idea of using setkey in a "chain" of data.tables was nice, since it allows to set the key temporarily. The basic idea is taken from the comment here: http://stackoverflow.com/questions/22863414/using-roll-true-with-allow-cartesian-true#comment34980343_22866917 A <- data.table( x = c(1, 2, 3, 4, 5), y = letters[1:5]) B <- data.table( x = c(1, 2, 3, 1, 4), f = c("Alice", "Alice", "Alice", "Bob", "Bob"), z = 101:105) B[, setkey(.SD, x)][ , .SD[A, roll = TRUE, rollends = FALSE], by = f][ , setkey(.SD, x)] Thanks, M On 06/18/2014 01:03 AM, Matt Dowle wrote: > > Hi Ron, > > Thanks for highlighting this. Two changes now in v1.9.3 on GitHub: > > * > > |setkey| on |.SD| is now an error, rather than warnings for each > group about rebuilding the key. The new error is similar to when > attempting to use |:=| in a |.SD| subquery: |".SD is locked. Using > set*() functions on .SD is reserved for possible future use; a > tortuously flexible way to modify the original data by > group."| Thanks to Ron Hylton for highlighting the issue on > datatable-help here > . > > * > > Looping calls to |unique(DT)| such as > in |DT[,unique(.SD),by=group]| is now faster by avoiding internal > overhead of calling |[.data.table|. Thanks again to Ron Hylton for > highlighting in the same thread > . > His example is reduced from 28 sec to 9 sec, with identical results. > > > I now get the following (on my slow netbook) with no changes to your code. > > print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id])) # were > warnings, now error > print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id])) # was > 28s, now 9s > print(system.time(uf <- ddply(test, .(id), conflictsFrame))) # 13s > > This just fixes the surprises, basically. Clearly Arun uses data.table > in a better way which is orders of magnitude faster. > > Matt > > > On 14/06/14 03:58, Ron Hylton wrote: >> >> Thanks, that very helpful. >> >> >> >> *From:*Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] >> *Sent:* Friday, June 13, 2014 10:46 PM >> *To:* Ron Hylton; datatable-help at lists.r-forge.r-project.org >> *Subject:* Re: [datatable-help] data.table is asking for help >> >> >> >> Sorry. But we can simplify it even further: >> >> The first step is just |unique(test)|. So, we can do: >> >> |system.time({| >> |ans = unique(test)| >> |ans = ans[ans[, .I[.N > 1L], by=id]$V1]| >> |})| >> |# 0.016 0.000 0.016 | >> >> Identical? >> >> |setkey(ans)| >> |setkey(ut1)| >> |identical(ans, ut1) # [1] TRUE| >> >> >> >> Arun >> >> >> From: Arunkumar Srinivasan aragorn168b at gmail.com >> >> Reply: Arunkumar Srinivasan aragorn168b at gmail.com >> >> Date: June 14, 2014 at 4:42:31 AM >> To: Ron Hylton rhylton at verizon.net , >> datatable-help at lists.r-forge.r-project.org >> >> datatable-help at lists.r-forge.r-project.org >> >> Subject: Re: [datatable-help] data.table is asking for help >> >> >> >> A slightly simpler version of the 2nd solution is: >> >> |system.time({| >> >> |ans = test[, .N, by=names(test)]| >> >> |ans = ans[ans[, .I[.N > 1L], by=id]$V1]| >> >> |})| >> >> |# 0.019 0.000 0.019 | >> >> >> >> The answers are identical, you can check this by doing: >> >> |ans[, N := NULL]| >> >> |setkey(ans)| >> >> |setkey(ut1)| >> >> |identical(ans, ut1) # [1] TRUE| >> >> >> >> >> >> Arun >> >> >> From: Arunkumar Srinivasan aragorn168b at gmail.com >> >> Reply: Arunkumar Srinivasan aragorn168b at gmail.com >> >> Date: June 14, 2014 at 4:34:15 AM >> To: Ron Hylton rhylton at verizon.net , >> datatable-help at lists.r-forge.r-project.org >> >> datatable-help at lists.r-forge.r-project.org >> >> Subject: Re: [datatable-help] data.table is asking for help >> >> >> >> The j-expression is evaluated from within C for each group >> (unless they?re optimised with GForce - a new initiative in >> data.table). And |eval(.SD)| or |eval(anything(.SD))| is costly. >> >> You can get around it by listing the columns by yourself and >> using |.I| instead, as follows: >> >> |test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]| >> >> |# 0.140 0.001 0.142 | >> >> >> >> >> >> Takes about 0.14 seconds. >> >> ------------------------------------------------------------------------ >> >> An even faster way is: >> >> |system.time({| >> >> |ans = test[test[, .I[.N > 1], by=id]$V1] # (1) | >> >> |ans = ans[, .N, by=names(ans)] # (2) | >> >> |ans = ans[ans[, .I[.N > 1L], by=id]$V1] # (3)| >> >> |})| >> >> | | >> >> |# 0.026 0.000 0.027 | >> >> >> >> >> >> The idea for the second case is: >> >> (1) remove all entries where there?s just 1 row corresponding >> to that |id|. >> (2) Aggregate this result by all the columns now and get the >> number of rows in the column |N| (we won?t have to use this >> column though). >> (3) Now, if we aggregate by |id| and if any id has just 1 row, >> then it?d mean that that |id| has had more than 1 rows (step >> (1) filtering ensures this), but all of them are same and we >> don?t need them. So we just filter for those where .N > 1L. >> >> HTH >> >> >> >> Arun >> >> >> From: Ron Hylton rhylton at verizon.net >> Reply: Ron Hylton rhylton at verizon.net >> Date: June 14, 2014 at 3:30:55 AM >> To: datatable-help at lists.r-forge.r-project.org >> >> datatable-help at lists.r-forge.r-project.org >> >> Subject: Re: [datatable-help] data.table is asking for help >> >> >> >> The performance is what puzzles me; the results are >> correct so the warnings don?t matter, and not all the >> variations I?ve tried have warnings. On the real dataset >> (~800,000 rows) datatable takes about 1.5 times longer >> than dataframe + ddply. I expected it to be substantially >> faster. >> >> >> >> *From:* Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] >> *Sent:* Friday, June 13, 2014 8:57 PM >> *To:* Ron Hylton; >> datatable-help at lists.r-forge.r-project.org >> >> *Subject:* Re: [datatable-help] data.table is asking for help >> >> >> >> However there?s another aspect. While I?m relatively >> new to R my understanding is that a function argument >> should be modifiable within the function body without >> affecting the caller, which perhaps conflicts with the >> behavior of .SD. >> >> `data.table` is designed for working with *really large* >> data sets in mind (> 100 or 200 GB in memory even). And >> therefore, as a design feature, it trades in "referential >> transparency" for manipulating data objects *as efficient >> as possible* in terms of both *speed* and *memory usage* >> (most of the times they go hand-in-hand). >> >> This is perhaps the biggest design choice one needs to be >> aware of when working/choosing data.tables. It is possible >> to modify objects by reference using data.table - All the >> functions that begin with "set*" modify objects by >> reference. The only other non "set*" function is `:=` >> operator. >> >> >> >> HTH >> >> Arun >> >> >> From: Ron Hylton rhylton at verizon.net >> >> Reply: Ron Hylton rhylton at verizon.net >> >> Date: June 14, 2014 at 2:52:04 AM >> To: datatable-help at lists.r-forge.r-project.org >> >> datatable-help at lists.r-forge.r-project.org >> >> Subject: Re: [datatable-help] data.table is asking for help >> >> >> >> I suspected it was something like this. As one >> clarification, there is a setkey(test,id) before any >> setkey(.SD). If setkey(test,id) is changed to >> setkey(test) so all columns are in the original >> datatable key then the warning goes away. >> >> >> >> However there?s another aspect. While I?m relatively >> new to R my understanding is that a function argument >> should be modifiable within the function body without >> affecting the caller, which perhaps conflicts with the >> behavior of .SD. >> >> >> >> *From:* Arunkumar Srinivasan >> [mailto:aragorn168b at gmail.com] >> *Sent:* Friday, June 13, 2014 8:23 PM >> *To:* Ron Hylton; >> datatable-help at lists.r-forge.r-project.org >> >> *Subject:* Re: [datatable-help] data.table is asking >> for help >> >> >> >> Nicely reproducible post. Reproducible in v1.9.3 >> (latest commit) as well. >> >> This is a tricky one. It happens because you?re >> setting key on |.SD| which should normally not be >> allowed. What happens is, when you set key the first >> time, there?s no key set (here) and therefore key is >> set on all the columns |x1|, |x2| and |x3|. >> >> Now, the next group (in the |by=.|) is passed to your >> function, it?ll have the |key| already set to >> |x1,x2,x3| (because |setkey| modifies the object by >> reference), but |.SD| has obtained *new* data >> corresponding to /this/ group. And |data.table| sorts >> this data, knowing that it already has key set.. but >> if the key is set then the order must be 1:n. But it >> wouldn?t be, as this data isn?t sorted. |data.table| >> warns in those scenarios.. and that?s why you get the >> warning. >> >> To verify this, you can try: >> >> |conflictsTable1 <- function(f, address) {| >> >> | u <- unique(setkey(f))| >> >> | setattr(f, 'sorted', NULL)| >> >> | if (nrow(u) == 1) return(NULL)| >> >> | u| >> >> |}| >> >> Basically, we set the key of |f| (which is equal to >> |.SD| as it?s only modified by reference) to |NULL| >> everytime after.. so that |.SD| for the new group will >> not have the key set. >> >> The ideal scenario here, IIUC, is that |setkey(.SD)| >> or things pointing to |.SD| should not be possible >> (locking binding doesn?t seem to affect things done by >> reference..). |.SD| however should retain the key of >> the data.table, if a key was set, wherever possible. >> >> >> >> Arun >> >> >> From: Ron Hylton rhylton at verizon.net >> >> Reply: Ron Hylton rhylton at verizon.net >> >> Date: June 14, 2014 at 1:55:53 AM >> To: datatable-help at lists.r-forge.r-project.org >> >> datatable-help at lists.r-forge.r-project.org >> >> Subject: [datatable-help] data.table is asking for help >> >> >> >> The code below generates the warning: >> >> >> >> In setkeyv(x, cols, verbose = verbose) : >> >> Already keyed by this key but had invalid row >> order, key rebuilt. If you didn't go under the >> hood please let datatable-help know so the root >> cause can be fixed. >> >> >> >> This is my first attempt at using datatable so I >> probably did something dumb, but maybe that?s >> useful for someone. The first case is the one >> that gives the warnings. >> >> >> >> I?m also surprised at the timings. I wrote the >> original algorithm using dataframe & ddply and I >> expected datatable to be substantially faster; the >> opposite is true. >> >> >> >> The algorithm does the following: Certain columns >> in the table are keys and others are values in the >> sense that each row with the same set of keys >> should have the same set of values. Find all the >> key sets for which this is not true and return the >> keys sets + conflicting value sets. >> >> >> >> Insight into the performance would be appreciated. >> >> >> >> Regards, >> >> Ron >> >> >> >> library(data.table) >> >> library(plyr) >> >> >> >> conflictsTable1 <- function(f) { >> >> u <- unique(setkey(f)) >> >> if (nrow(u) == 1) return(NULL) >> >> u >> >> } >> >> >> >> conflictsTable2 <- function(f) { >> >> u <- unique(f) >> >> if (nrow(u) == 1) return(NULL) >> >> u >> >> } >> >> >> >> conflictsFrame <- function(f) { >> >> u <- unique(f) >> >> if (nrow(u) == 1) return(NULL) >> >> u >> >> } >> >> >> >> N <- 10000 >> >> test <- >> data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), >> x1=rnorm(N), x2=rnorm(N), x3=rnorm(N)) >> >> >> >> setkey(test,id) >> >> >> >> print(system.time(ut1 <- test[, >> conflictsTable1(.SD), by=id])) >> >> >> >> print(system.time(ut2 <- test[, >> conflictsTable2(.SD), by=id])) >> >> >> >> print(system.time(uf <- ddply(test, .(id), >> conflictsFrame))) >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From my.r.help at gmail.com Thu Jun 19 05:51:41 2014 From: my.r.help at gmail.com (Michael Smith) Date: Thu, 19 Jun 2014 11:51:41 +0800 Subject: [datatable-help] Bug when Merging with nomatch=0 and roll=T? Message-ID: <53A25E4D.8040206@gmail.com> I got the following result on my keyed data tables `CS` and `SP`, which seems like a bug (in 1.9.2 and 1.9.3 dev version) to me, since all columns should have the _same_ length: > ## Works as expected: > all((l <- sapply(CS[SP, roll = TRUE], length)) == l[1]) [1] TRUE > ## Works as expected: > all((l <- sapply(CS[SP, nomatch = 0], length)) == l[1]) [1] TRUE > ## Here's the potential _bug_, when combining both: > all((l <- sapply(CS[SP, nomatch = 0, roll = TRUE], length)) == l[1]) [1] FALSE Thanks, M From my.r.help at gmail.com Thu Jun 19 05:59:59 2014 From: my.r.help at gmail.com (Michael Smith) Date: Thu, 19 Jun 2014 11:59:59 +0800 Subject: [datatable-help] Bug when Merging with nomatch=0 and roll=T? In-Reply-To: <53A25E4D.8040206@gmail.com> References: <53A25E4D.8040206@gmail.com> Message-ID: <53A2603F.9030204@gmail.com> By the way, I know it's not reproducible with the code below. Before going into further detail, I first wanted to ask whether this looks like a bug, or whether I've overlooked something obvious and this is expected behavior. Thanks, M On 06/19/2014 11:51 AM, Michael Smith wrote: > I got the following result on my keyed data tables `CS` and `SP`, which > seems like a bug (in 1.9.2 and 1.9.3 dev version) to me, since all > columns should have the _same_ length: > >> ## Works as expected: >> all((l <- sapply(CS[SP, roll = TRUE], length)) == l[1]) > [1] TRUE >> ## Works as expected: >> all((l <- sapply(CS[SP, nomatch = 0], length)) == l[1]) > [1] TRUE >> ## Here's the potential _bug_, when combining both: >> all((l <- sapply(CS[SP, nomatch = 0, roll = TRUE], length)) == l[1]) > [1] FALSE > > > Thanks, > > M > From mathematical.coffee at gmail.com Fri Jun 20 02:44:08 2014 From: mathematical.coffee at gmail.com (mathematical.coffee) Date: Thu, 19 Jun 2014 17:44:08 -0700 (PDT) Subject: [datatable-help] What is going on with R 3.1 ? In-Reply-To: <1397752015938-4689002.post@n4.nabble.com> References: <1397752015938-4689002.post@n4.nabble.com> Message-ID: <1403225048664-4692401.post@n4.nabble.com> Hi all, Sorry to resurrect an old thread, but I've been experiencing these problems too and have come up with a reproducible example (for me anyway). Data.table 1.9.2, R 3.1.0 I was trying to join some tables and got the usual "rerun with allow.cartesian=TRUE" message like Michele, and then got this error: Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed However while I was trying to strip down my data to reproduce the error, I now consistently get this one instead: Error in `[.data.table`(x, y, `:=`(female, female)) : object 'bysubl' not found rather than the TRUE/FALSE one. But they seem to be related. * x has a column of subjects, some duplicated * y has a column of subjects, none duplicated, and some not present in x (all subjects of x are in y though). * y additionally has a binary column `female` that I wish to join into x (I know there are other ways to do this, but this is a stripped down example and seems to point out something going wrong in data.table so it is just an illustrative example): ``` library(data.table) x=fread('x.csv') y=fread('y.csv') setkey(x, subject) setkey(y, subject) x[y] # Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), : # Join results in 33 rows; more than 28 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice. x[y, female:=female] Error in `[.data.table`(x, y, `:=`(female, female)) : object 'bysubl' not found ``` I get the above reproducibly with this dataset. >From now onwards, if I type in 'x' or 'y' into the prompt I get nothing printed at all. Additionally: ``` tables() # Error in gettext(domain, unlist(args)) : invalid 'string' value # Error: argument "finally" is missing, with no default ``` The only solution is to restart the R session. Note: this *doesn't* occur if the column I try to merge (`female` in this case) is continuous, for example. I can only get it if it's logical. I've attached x.csv and y.csv to this email for you to play with. I think it might be possible to strip down the tables to less rows (x has 28, y has 26) but in my (not exhaustive) attempts to do so, I didn't get this particular error. x.csv y.csv -- View this message in context: http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-tp4689002p4692401.html Sent from the datatable-help mailing list archive at Nabble.com. From aragorn168b at gmail.com Fri Jun 20 02:51:05 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 20 Jun 2014 02:51:05 +0200 Subject: [datatable-help] What is going on with R 3.1 ? In-Reply-To: <1403225048664-4692401.post@n4.nabble.com> References: <1397752015938-4689002.post@n4.nabble.com> <1403225048664-4692401.post@n4.nabble.com> Message-ID: Hi, Could you let us know if you?re able to reproduce it in the devel version 1.9.3 as well? Arun From:?mathematical.coffee mathematical.coffee at gmail.com Reply:?mathematical.coffee mathematical.coffee at gmail.com Date:?June 20, 2014 at 2:44:50 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] What is going on with R 3.1 ? Hi all, Sorry to resurrect an old thread, but I've been experiencing these problems too and have come up with a reproducible example (for me anyway). Data.table 1.9.2, R 3.1.0 I was trying to join some tables and got the usual "rerun with allow.cartesian=TRUE" message like Michele, and then got this error: Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed However while I was trying to strip down my data to reproduce the error, I now consistently get this one instead: Error in `[.data.table`(x, y, `:=`(female, female)) : object 'bysubl' not found rather than the TRUE/FALSE one. But they seem to be related. * x has a column of subjects, some duplicated * y has a column of subjects, none duplicated, and some not present in x (all subjects of x are in y though). * y additionally has a binary column `female` that I wish to join into x (I know there are other ways to do this, but this is a stripped down example and seems to point out something going wrong in data.table so it is just an illustrative example): ``` library(data.table) x=fread('x.csv') y=fread('y.csv') setkey(x, subject) setkey(y, subject) x[y] # Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), : # Join results in 33 rows; more than 28 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice. x[y, female:=female] Error in `[.data.table`(x, y, `:=`(female, female)) : object 'bysubl' not found ``` I get the above reproducibly with this dataset. From now onwards, if I type in 'x' or 'y' into the prompt I get nothing printed at all. Additionally: ``` tables() # Error in gettext(domain, unlist(args)) : invalid 'string' value # Error: argument "finally" is missing, with no default ``` The only solution is to restart the R session. Note: this *doesn't* occur if the column I try to merge (`female` in this case) is continuous, for example. I can only get it if it's logical. I've attached x.csv and y.csv to this email for you to play with. I think it might be possible to strip down the tables to less rows (x has 28, y has 26) but in my (not exhaustive) attempts to do so, I didn't get this particular error. x.csv y.csv -- View this message in context: http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-tp4689002p4692401.html Sent from the datatable-help mailing list archive at Nabble.com. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathematical.coffee at gmail.com Fri Jun 20 03:01:50 2014 From: mathematical.coffee at gmail.com (Amy) Date: Fri, 20 Jun 2014 11:01:50 +1000 Subject: [datatable-help] What is going on with R 3.1 ? In-Reply-To: References: <1397752015938-4689002.post@n4.nabble.com> <1403225048664-4692401.post@n4.nabble.com> Message-ID: Hi Arun, In 1.9.3 I get the "Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in 33 rows; more than 28 = max(nrow(x),nrow(i))...." message and it doesn't assign the column (upon `x[y, female:=female]`, so no, the error doesn't occur. But as an aside, shouldn't it this command work? If I have x with subjects a, a, b, c, d; y with genders for subjects a--f, shouldn't x[y, female:=female] copy the female column from y to x, duplicating as necessary? Of course y[x] produces the table I'm after, but in the case that y has extra columns I /don't/ want in the output and x has extra columns I /do/, `y[x]` is then not the table I'm after. (But now we are straying into a different question, my limited understanding of how to use data.table, as opposed to the bug this thread is about). PS - typo on the data.table Readmein the "if you get latex errors during installation" bit: devtools:::install_github("datat.able", ...) "datat.able" --> "data.table". cheers Amy On 20 June 2014 10:51, Arunkumar Srinivasan wrote: > Hi, > > Could you let us know if you?re able to reproduce it in the devel version > 1.9.3 as well? > > > > Arun > > From: mathematical.coffee mathematical.coffee at gmail.com > Reply: mathematical.coffee mathematical.coffee at gmail.com > Date: June 20, 2014 at 2:44:50 AM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] What is going on with R 3.1 ? > > Hi all, > > Sorry to resurrect an old thread, but I've been experiencing these > problems > too and have come up with a reproducible example (for me anyway). > > Data.table 1.9.2, R 3.1.0 > > I was trying to join some tables and got the usual "rerun with > allow.cartesian=TRUE" message like Michele, and then got this error: > > Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed > > However while I was trying to strip down my data to reproduce the error, I > now consistently get this one instead: > > Error in `[.data.table`(x, y, `:=`(female, female)) : > object 'bysubl' not found > > > rather than the TRUE/FALSE one. But they seem to be related. > > * x has a column of subjects, some duplicated > * y has a column of subjects, none duplicated, and some not present in x > (all subjects of x are in y though). > * y additionally has a binary column `female` that I wish to join into x > > (I know there are other ways to do this, but this is a stripped down > example > and seems to point out something going wrong in data.table so it is just > an > illustrative example): > > ``` > library(data.table) > x=fread('x.csv') > y=fread('y.csv') > setkey(x, subject) > setkey(y, subject) > > x[y] > # Error in vecseq(f__, len__, if (allow.cartesian) NULL else > as.integer(max(nrow(x), : > # Join results in 33 rows; more than 28 = max(nrow(x),nrow(i)). Check for > duplicate key values in i, each of which join to the same group in x over > and over again. If that's ok, try including `j` and dropping `by` > (by-without-by) so that j runs for each group to avoid the large > allocation. > If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. > Otherwise, please search for this error message in the FAQ, Wiki, Stack > Overflow and datatable-help for advice. > > x[y, female:=female] > Error in `[.data.table`(x, y, `:=`(female, female)) : > object 'bysubl' not found > ``` > > I get the above reproducibly with this dataset. > > From now onwards, if I type in 'x' or 'y' into the prompt I get nothing > printed at all. Additionally: > > ``` > tables() > # Error in gettext(domain, unlist(args)) : invalid 'string' value > # Error: argument "finally" is missing, with no default > ``` > > The only solution is to restart the R session. > > Note: this *doesn't* occur if the column I try to merge (`female` in this > case) is continuous, for example. I can only get it if it's logical. > > I've attached x.csv and y.csv to this email for you to play with. > > I think it might be possible to strip down the tables to less rows (x has > 28, y has 26) but in my (not exhaustive) attempts to do so, I didn't get > this particular error. > > x.csv > y.csv > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-tp4689002p4692401.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri Jun 20 03:18:12 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 20 Jun 2014 03:18:12 +0200 Subject: [datatable-help] What is going on with R 3.1 ? In-Reply-To: References: <1397752015938-4689002.post@n4.nabble.com> <1403225048664-4692401.post@n4.nabble.com> Message-ID: Hi Amy, Good to know that it?s not reproducible in 1.9.3. Matt already fixed it. X[Y, LHS := RHS] can not exceed nrow(X) because this assignment is made by reference. If the join from X[Y] results in more than nrow(X), then X will be to be re-allocated entirely. If you only want those that match with X, then you should do: X[Y, female := i.female, nomatch=0L]. If instead you want all the rows from y, then you could do: x[y, allow.cartesian=TRUE]. Arun From:?Amy mathematical.coffee at gmail.com Reply:?Amy mathematical.coffee at gmail.com Date:?June 20, 2014 at 3:01:50 AM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] What is going on with R 3.1 ? Hi Arun, In 1.9.3 I get the "Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in 33 rows; more than 28 = max(nrow(x),nrow(i))...." message and it doesn't assign the column (upon `x[y, female:=female]`, so no, the error doesn't occur. But as an aside, shouldn't it this command work? If I have x with subjects a, a, b, c, d; y with genders for subjects a--f, shouldn't x[y, female:=female] copy the female column from y to x, duplicating as necessary? Of course y[x] produces the table I'm after, but in the case that y has extra columns I /don't/ want in the output and x has extra columns I /do/, `y[x]` is then not the table I'm after. (But now we are straying into a different question, my limited understanding of how to use data.table, as opposed to the bug this thread is about). PS - typo on the data.table Readmein the "if you get latex errors during installation" bit: devtools:::install_github("datat.able", ...) "datat.able" --> "data.table". cheers Amy On 20 June 2014 10:51, Arunkumar Srinivasan wrote: Hi, Could you let us know if you?re able to reproduce it in the devel version 1.9.3 as well? Arun From:?mathematical.coffee mathematical.coffee at gmail.com Reply:?mathematical.coffee mathematical.coffee at gmail.com Date:?June 20, 2014 at 2:44:50 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] What is going on with R 3.1 ? Hi all, Sorry to resurrect an old thread, but I've been experiencing these problems too and have come up with a reproducible example (for me anyway). Data.table 1.9.2, R 3.1.0 I was trying to join some tables and got the usual "rerun with allow.cartesian=TRUE" message like Michele, and then got this error: Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed However while I was trying to strip down my data to reproduce the error, I now consistently get this one instead: Error in `[.data.table`(x, y, `:=`(female, female)) : object 'bysubl' not found rather than the TRUE/FALSE one. But they seem to be related. * x has a column of subjects, some duplicated * y has a column of subjects, none duplicated, and some not present in x (all subjects of x are in y though). * y additionally has a binary column `female` that I wish to join into x (I know there are other ways to do this, but this is a stripped down example and seems to point out something going wrong in data.table so it is just an illustrative example): ``` library(data.table) x=fread('x.csv') y=fread('y.csv') setkey(x, subject) setkey(y, subject) x[y] # Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), : # Join results in 33 rows; more than 28 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice. x[y, female:=female] Error in `[.data.table`(x, y, `:=`(female, female)) : object 'bysubl' not found ``` I get the above reproducibly with this dataset. From now onwards, if I type in 'x' or 'y' into the prompt I get nothing printed at all. Additionally: ``` tables() # Error in gettext(domain, unlist(args)) : invalid 'string' value # Error: argument "finally" is missing, with no default ``` The only solution is to restart the R session. Note: this *doesn't* occur if the column I try to merge (`female` in this case) is continuous, for example. I can only get it if it's logical. I've attached x.csv and y.csv to this email for you to play with. I think it might be possible to strip down the tables to less rows (x has 28, y has 26) but in my (not exhaustive) attempts to do so, I didn't get this particular error. x.csv y.csv -- View this message in context: http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-tp4689002p4692401.html Sent from the datatable-help mailing list archive at Nabble.com. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathematical.coffee at gmail.com Fri Jun 20 03:44:26 2014 From: mathematical.coffee at gmail.com (Amy) Date: Fri, 20 Jun 2014 11:44:26 +1000 Subject: [datatable-help] What is going on with R 3.1 ? In-Reply-To: References: <1397752015938-4689002.post@n4.nabble.com> <1403225048664-4692401.post@n4.nabble.com> Message-ID: Thanks for this, I knew not knowing how to do that join was a problem with me not understanding data.table, not a problem with data.table. Very good to know the 'bysubl' "error" is fixed in 1.9.3 (even if it is brought about by users like me trying to do our joins wrongly :)) thanks, Amy On 20 June 2014 11:18, Arunkumar Srinivasan wrote: > Hi Amy, > > Good to know that it?s not reproducible in 1.9.3. Matt already fixed it. > > X[Y, LHS := RHS] can not exceed nrow(X) because this assignment is made *by > reference*. If the join from X[Y] results in more than nrow(X), then X > will be to be re-allocated entirely. > > If you only want those that match with X, then you should do: X[Y, female > := i.female, nomatch=0L]. > > If instead you want all the rows from y, then you could do: x[y, > allow.cartesian=TRUE]. > > > Arun > > From: Amy mathematical.coffee at gmail.com > Reply: Amy mathematical.coffee at gmail.com > Date: June 20, 2014 at 3:01:50 AM > To: Arunkumar Srinivasan aragorn168b at gmail.com > Cc: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > > Subject: Re: [datatable-help] What is going on with R 3.1 ? > > Hi Arun, > > In 1.9.3 I get the "Error in vecseq(f__, len__, if (allow.cartesian) NULL > else as.integer(max(nrow(x), : Join results in 33 rows; more than 28 = > max(nrow(x),nrow(i))...." message and it doesn't assign the column (upon > `x[y, female:=female]`, so no, the error doesn't occur. > > But as an aside, shouldn't it this command work? > If I have x with subjects a, a, b, c, d; y with genders for subjects a--f, > shouldn't x[y, female:=female] copy the female column from y to x, > duplicating as necessary? > Of course y[x] produces the table I'm after, but in the case that y has > extra columns I /don't/ want in the output and x has extra columns I /do/, > `y[x]` is then not the table I'm after. (But now we are straying into a > different question, my limited understanding of how to use data.table, as > opposed to the bug this thread is about). > > PS - typo on the data.table Readmein the "if you get latex errors during > installation" bit: > > devtools:::install_github("datat.able", ...) > > "datat.able" --> "data.table". > > cheers > Amy > > > On 20 June 2014 10:51, Arunkumar Srinivasan wrote: > >> Hi, >> >> Could you let us know if you?re able to reproduce it in the devel >> version 1.9.3 as well? >> >> >> Arun >> >> From: mathematical.coffee mathematical.coffee at gmail.com >> Reply: mathematical.coffee mathematical.coffee at gmail.com >> Date: June 20, 2014 at 2:44:50 AM >> To: datatable-help at lists.r-forge.r-project.org >> datatable-help at lists.r-forge.r-project.org >> Subject: Re: [datatable-help] What is going on with R 3.1 ? >> >> Hi all, >> >> Sorry to resurrect an old thread, but I've been experiencing these >> problems >> too and have come up with a reproducible example (for me anyway). >> >> Data.table 1.9.2, R 3.1.0 >> >> I was trying to join some tables and got the usual "rerun with >> allow.cartesian=TRUE" message like Michele, and then got this error: >> >> Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed >> >> However while I was trying to strip down my data to reproduce the error, I >> now consistently get this one instead: >> >> Error in `[.data.table`(x, y, `:=`(female, female)) : >> object 'bysubl' not found >> >> >> rather than the TRUE/FALSE one. But they seem to be related. >> >> * x has a column of subjects, some duplicated >> * y has a column of subjects, none duplicated, and some not present in x >> (all subjects of x are in y though). >> * y additionally has a binary column `female` that I wish to join into x >> >> (I know there are other ways to do this, but this is a stripped down >> example >> and seems to point out something going wrong in data.table so it is just >> an >> illustrative example): >> >> ``` >> library(data.table) >> x=fread('x.csv') >> y=fread('y.csv') >> setkey(x, subject) >> setkey(y, subject) >> >> x[y] >> # Error in vecseq(f__, len__, if (allow.cartesian) NULL else >> as.integer(max(nrow(x), : >> # Join results in 33 rows; more than 28 = max(nrow(x),nrow(i)). Check for >> duplicate key values in i, each of which join to the same group in x over >> and over again. If that's ok, try including `j` and dropping `by` >> (by-without-by) so that j runs for each group to avoid the large >> allocation. >> If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. >> Otherwise, please search for this error message in the FAQ, Wiki, Stack >> Overflow and datatable-help for advice. >> >> x[y, female:=female] >> Error in `[.data.table`(x, y, `:=`(female, female)) : >> object 'bysubl' not found >> ``` >> >> I get the above reproducibly with this dataset. >> >> From now onwards, if I type in 'x' or 'y' into the prompt I get nothing >> printed at all. Additionally: >> >> ``` >> tables() >> # Error in gettext(domain, unlist(args)) : invalid 'string' value >> # Error: argument "finally" is missing, with no default >> ``` >> >> The only solution is to restart the R session. >> >> Note: this *doesn't* occur if the column I try to merge (`female` in this >> case) is continuous, for example. I can only get it if it's logical. >> >> I've attached x.csv and y.csv to this email for you to play with. >> >> I think it might be possible to strip down the tables to less rows (x has >> 28, y has 26) but in my (not exhaustive) attempts to do so, I didn't get >> this particular error. >> >> x.csv >> y.csv >> >> >> >> -- >> View this message in context: >> http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-tp4689002p4692401.html >> Sent from the datatable-help mailing list archive at Nabble.com. >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Fri Jun 20 05:37:07 2014 From: my.r.help at gmail.com (Michael Smith) Date: Fri, 20 Jun 2014 11:37:07 +0800 Subject: [datatable-help] Bug when Merging with nomatch=0 and roll=T? In-Reply-To: <53A2603F.9030204@gmail.com> References: <53A25E4D.8040206@gmail.com> <53A2603F.9030204@gmail.com> Message-ID: <53A3AC63.6020301@gmail.com> So let me rephrase my question (haven't received an answer so far): For a given data.table, is there any condition under which the lengths of the vectors in each column may differ? Based on my understanding, each data.table is also a data.frame, and with a data frame this should not be possible. For example, it's not possible to have a data.frame where the first column is a vector of length eight, and the second column is a vector of length nine. Ergo, it's a bug, right? If my understanding is correct, please do let me know and I'll be glad to try to boil this down to something that's reproducible. Thanks, M On 06/19/2014 11:59 AM, Michael Smith wrote: > By the way, I know it's not reproducible with the code below. Before > going into further detail, I first wanted to ask whether this looks like > a bug, or whether I've overlooked something obvious and this is expected > behavior. > > Thanks, > M > > On 06/19/2014 11:51 AM, Michael Smith wrote: >> I got the following result on my keyed data tables `CS` and `SP`, which >> seems like a bug (in 1.9.2 and 1.9.3 dev version) to me, since all >> columns should have the _same_ length: >> >>> ## Works as expected: >>> all((l <- sapply(CS[SP, roll = TRUE], length)) == l[1]) >> [1] TRUE >>> ## Works as expected: >>> all((l <- sapply(CS[SP, nomatch = 0], length)) == l[1]) >> [1] TRUE >>> ## Here's the potential _bug_, when combining both: >>> all((l <- sapply(CS[SP, nomatch = 0, roll = TRUE], length)) == l[1]) >> [1] FALSE >> >> >> Thanks, >> >> M >> From aragorn168b at gmail.com Fri Jun 20 11:17:13 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 20 Jun 2014 11:17:13 +0200 Subject: [datatable-help] =?utf-8?q?Bug_when_Merging_with_nomatch=3D0_=3F?= =?utf-8?b?PWFuZCA9P3V0Zi04P1E/cm9sbD1UPw==?= In-Reply-To: <53A3AC63.6020301@gmail.com> References: <53A25E4D.8040206@gmail.com> <53A2603F.9030204@gmail.com> <53A3AC63.6020301@gmail.com> Message-ID: For a given data.table, is there any condition????Ergo, it's a bug, right?? Yes. I'll be glad? to try to boil this down to something that's reproducible.? That'd be great. Arun From:?Michael Smith my.r.help at gmail.com Reply:?Michael Smith my.r.help at gmail.com Date:?June 20, 2014 at 5:37:24 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T? So let me rephrase my question (haven't received an answer so far): For a given data.table, is there any condition under which the lengths of the vectors in each column may differ? Based on my understanding, each data.table is also a data.frame, and with a data frame this should not be possible. For example, it's not possible to have a data.frame where the first column is a vector of length eight, and the second column is a vector of length nine. Ergo, it's a bug, right? If my understanding is correct, please do let me know and I'll be glad to try to boil this down to something that's reproducible. Thanks, M On 06/19/2014 11:59 AM, Michael Smith wrote: > By the way, I know it's not reproducible with the code below. Before > going into further detail, I first wanted to ask whether this looks like > a bug, or whether I've overlooked something obvious and this is expected > behavior. > > Thanks, > M > > On 06/19/2014 11:51 AM, Michael Smith wrote: >> I got the following result on my keyed data tables `CS` and `SP`, which >> seems like a bug (in 1.9.2 and 1.9.3 dev version) to me, since all >> columns should have the _same_ length: >> >>> ## Works as expected: >>> all((l <- sapply(CS[SP, roll = TRUE], length)) == l[1]) >> [1] TRUE >>> ## Works as expected: >>> all((l <- sapply(CS[SP, nomatch = 0], length)) == l[1]) >> [1] TRUE >>> ## Here's the potential _bug_, when combining both: >>> all((l <- sapply(CS[SP, nomatch = 0, roll = TRUE], length)) == l[1]) >> [1] FALSE >> >> >> Thanks, >> >> M >> _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Fri Jun 20 13:30:05 2014 From: my.r.help at gmail.com (Michael Smith) Date: Fri, 20 Jun 2014 19:30:05 +0800 Subject: [datatable-help] Bug when Merging with nomatch=0 and roll=T? In-Reply-To: References: <53A25E4D.8040206@gmail.com> <53A2603F.9030204@gmail.com> <53A3AC63.6020301@gmail.com> Message-ID: <53A41B3D.3000003@gmail.com> OK, no problem, here's the code. If there are any problems pasting it into R let me know (I used parts of dput, so maybe the email line endings are messed up). If you want I can also file a bug report on github, just let me know. CS <- data.table( structure(list(LPERMCO = c(7L, 33L), datadate = structure(c(15912, 15912), class = "Date"), me = c(626550.35284, 7766.385)), .Names = c("LPERMCO", "datadate", "me"), class = "data.frame", row.names = c(NA, -2L )), key = "LPERMCO,datadate") SP <- data.table( structure(list(PERMCO = c(7L, 7L, 33L, 33L, 33L, 33L), date = structure(c(15884, 15917, 15884, 15884, 15917, 15917), class = "Date"), RET = c(-0.118303, 0.141225, -0.03137, -0.02533, 0.045967, 0.043694)), .Names = c("PERMCO", "date", "RET"), class = "data.frame", row.names = c(NA, -6L)), key = "PERMCO,date") sapply(CS[SP, nomatch = 0, roll = T], length) The relevant output looks like this, both in 1.9.2 and in dev-1.9.3, and for sapply, the "me" column should be 5 but it's 3: > CS LPERMCO datadate me 1: 7 2013-07-26 626550.353 2: 33 2013-07-26 7766.385 > SP PERMCO date RET 1: 7 2013-06-28 -0.118303 2: 7 2013-07-31 0.141225 3: 33 2013-06-28 -0.031370 4: 33 2013-06-28 -0.025330 5: 33 2013-07-31 0.045967 6: 33 2013-07-31 0.043694 > CS[SP, nomatch = 0, roll = T] LPERMCO datadate me RET 1: 7 2013-07-31 626550.353 0.141225 2: 33 2013-06-28 7766.385 -0.031370 3: 33 2013-06-28 7766.385 -0.025330 4: 33 2013-07-31 626550.353 0.045967 5: 33 2013-07-31 7766.385 0.043694 Warning message: In cbind(LPERMCO = c(" 7", "33", "33", "33", "33"), datadate = c("2013-07-31", : number of rows of result is not a multiple of vector length (arg 3) > sapply(CS[SP, nomatch = 0, roll = T], length) LPERMCO datadate me RET 5 5 3 5 Thanks, M On 06/20/2014 05:17 PM, Arunkumar Srinivasan wrote: >> For a given data.table, is there any condition ? Ergo, it's a bug, >> right? > > Yes. > >> I'll be glad >> to try to boil this down to something that's reproducible. > > That'd be great. > > > Arun > > From: Michael Smith my.r.help at gmail.com > Reply: Michael Smith my.r.help at gmail.com > Date: June 20, 2014 at 5:37:24 AM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > > Subject: Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T? > >> So let me rephrase my question (haven't received an answer so far): >> >> For a given data.table, is there any condition under which the lengths >> of the vectors in each column may differ? Based on my understanding, >> each data.table is also a data.frame, and with a data frame this should >> not be possible. For example, it's not possible to have a data.frame >> where the first column is a vector of length eight, and the second >> column is a vector of length nine. Ergo, it's a bug, right? >> >> If my understanding is correct, please do let me know and I'll be glad >> to try to boil this down to something that's reproducible. >> >> Thanks, >> M >> >> On 06/19/2014 11:59 AM, Michael Smith wrote: >> > By the way, I know it's not reproducible with the code below. Before >> > going into further detail, I first wanted to ask whether this looks like >> > a bug, or whether I've overlooked something obvious and this is expected >> > behavior. >> > >> > Thanks, >> > M >> > >> > On 06/19/2014 11:51 AM, Michael Smith wrote: >> >> I got the following result on my keyed data tables `CS` and `SP`, which >> >> seems like a bug (in 1.9.2 and 1.9.3 dev version) to me, since all >> >> columns should have the _same_ length: >> >> >> >>> ## Works as expected: >> >>> all((l <- sapply(CS[SP, roll = TRUE], length)) == l[1]) >> >> [1] TRUE >> >>> ## Works as expected: >> >>> all((l <- sapply(CS[SP, nomatch = 0], length)) == l[1]) >> >> [1] TRUE >> >>> ## Here's the potential _bug_, when combining both: >> >>> all((l <- sapply(CS[SP, nomatch = 0, roll = TRUE], length)) == l[1]) >> >> [1] FALSE >> >> >> >> >> >> Thanks, >> >> >> >> M >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> From aragorn168b at gmail.com Fri Jun 20 13:41:59 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 20 Jun 2014 13:41:59 +0200 Subject: [datatable-help] =?utf-8?q?Bug_when_Merging_with_nomatch=3D0_=3F?= =?utf-8?b?PWFuZCA9P3V0Zi04P1E/cm9sbD1UPw==?= In-Reply-To: <53A41B3D.3000003@gmail.com> References: <53A25E4D.8040206@gmail.com> <53A2603F.9030204@gmail.com> <53A3AC63.6020301@gmail.com> <53A41B3D.3000003@gmail.com> Message-ID: Michael, Excellent example. Perfectly reproducible on 1.9.2 and 1.9.3. And it works fine on 1.8.10. The answer should've only 3 rows.? It'd be even more nice of you if you could file it as a bug report. PS: On another note.. you maybe also interested in `CS[SP, roll=TRUE, rollends=TRUE]` Arun From:?Michael Smith my.r.help at gmail.com Reply:?Michael Smith my.r.help at gmail.com Date:?June 20, 2014 at 1:30:09 PM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T? OK, no problem, here's the code. If there are any problems pasting it into R let me know (I used parts of dput, so maybe the email line endings are messed up). If you want I can also file a bug report on github, just let me know. CS <- data.table( structure(list(LPERMCO = c(7L, 33L), datadate = structure(c(15912, 15912), class = "Date"), me = c(626550.35284, 7766.385)), .Names = c("LPERMCO", "datadate", "me"), class = "data.frame", row.names = c(NA, -2L )), key = "LPERMCO,datadate") SP <- data.table( structure(list(PERMCO = c(7L, 7L, 33L, 33L, 33L, 33L), date = structure(c(15884, 15917, 15884, 15884, 15917, 15917), class = "Date"), RET = c(-0.118303, 0.141225, -0.03137, -0.02533, 0.045967, 0.043694)), .Names = c("PERMCO", "date", "RET"), class = "data.frame", row.names = c(NA, -6L)), key = "PERMCO,date") sapply(CS[SP, nomatch = 0, roll = T], length) The relevant output looks like this, both in 1.9.2 and in dev-1.9.3, and for sapply, the "me" column should be 5 but it's 3: > CS LPERMCO datadate me 1: 7 2013-07-26 626550.353 2: 33 2013-07-26 7766.385 > SP PERMCO date RET 1: 7 2013-06-28 -0.118303 2: 7 2013-07-31 0.141225 3: 33 2013-06-28 -0.031370 4: 33 2013-06-28 -0.025330 5: 33 2013-07-31 0.045967 6: 33 2013-07-31 0.043694 > CS[SP, nomatch = 0, roll = T] LPERMCO datadate me RET 1: 7 2013-07-31 626550.353 0.141225 2: 33 2013-06-28 7766.385 -0.031370 3: 33 2013-06-28 7766.385 -0.025330 4: 33 2013-07-31 626550.353 0.045967 5: 33 2013-07-31 7766.385 0.043694 Warning message: In cbind(LPERMCO = c(" 7", "33", "33", "33", "33"), datadate = c("2013-07-31", : number of rows of result is not a multiple of vector length (arg 3) > sapply(CS[SP, nomatch = 0, roll = T], length) LPERMCO datadate me RET 5 5 3 5 Thanks, M On 06/20/2014 05:17 PM, Arunkumar Srinivasan wrote: >> For a given data.table, is there any condition ? Ergo, it's a bug, >> right? > > Yes. > >> I'll be glad >> to try to boil this down to something that's reproducible. > > That'd be great. > > > Arun > > From: Michael Smith my.r.help at gmail.com > Reply: Michael Smith my.r.help at gmail.com > Date: June 20, 2014 at 5:37:24 AM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > > Subject: Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T? > >> So let me rephrase my question (haven't received an answer so far): >> >> For a given data.table, is there any condition under which the lengths >> of the vectors in each column may differ? Based on my understanding, >> each data.table is also a data.frame, and with a data frame this should >> not be possible. For example, it's not possible to have a data.frame >> where the first column is a vector of length eight, and the second >> column is a vector of length nine. Ergo, it's a bug, right? >> >> If my understanding is correct, please do let me know and I'll be glad >> to try to boil this down to something that's reproducible. >> >> Thanks, >> M >> >> On 06/19/2014 11:59 AM, Michael Smith wrote: >> > By the way, I know it's not reproducible with the code below. Before >> > going into further detail, I first wanted to ask whether this looks like >> > a bug, or whether I've overlooked something obvious and this is expected >> > behavior. >> > >> > Thanks, >> > M >> > >> > On 06/19/2014 11:51 AM, Michael Smith wrote: >> >> I got the following result on my keyed data tables `CS` and `SP`, which >> >> seems like a bug (in 1.9.2 and 1.9.3 dev version) to me, since all >> >> columns should have the _same_ length: >> >> >> >>> ## Works as expected: >> >>> all((l <- sapply(CS[SP, roll = TRUE], length)) == l[1]) >> >> [1] TRUE >> >>> ## Works as expected: >> >>> all((l <- sapply(CS[SP, nomatch = 0], length)) == l[1]) >> >> [1] TRUE >> >>> ## Here's the potential _bug_, when combining both: >> >>> all((l <- sapply(CS[SP, nomatch = 0, roll = TRUE], length)) == l[1]) >> >> [1] FALSE >> >> >> >> >> >> Thanks, >> >> >> >> M >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Fri Jun 20 14:23:28 2014 From: my.r.help at gmail.com (Michael Smith) Date: Fri, 20 Jun 2014 20:23:28 +0800 Subject: [datatable-help] Bug when Merging with nomatch=0 and roll=T? In-Reply-To: References: <53A25E4D.8040206@gmail.com> <53A2603F.9030204@gmail.com> <53A3AC63.6020301@gmail.com> <53A41B3D.3000003@gmail.com> Message-ID: <53A427C0.5070506@gmail.com> Arun, Thanks for your reply and the issue is here (if there's anything else I can do to help solve this problem let me know): https://github.com/Rdatatable/data.table/issues/700 Also thanks for mentioning rollends. M On 06/20/2014 07:41 PM, Arunkumar Srinivasan wrote: > Michael, > > Excellent example. Perfectly reproducible on 1.9.2 and 1.9.3. And it > works fine on 1.8.10. The answer should've only 3 rows. > It'd be even more nice of you if you could file it as a bug report. > > PS: On another note.. you maybe also interested in `CS[SP, roll=TRUE, > rollends=TRUE]` > Arun > > From: Michael Smith my.r.help at gmail.com > Reply: Michael Smith my.r.help at gmail.com > Date: June 20, 2014 at 1:30:09 PM > To: Arunkumar Srinivasan aragorn168b at gmail.com > > Cc: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > > Subject: Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T? > >> OK, no problem, here's the code. If there are any problems pasting it >> into R let me know (I used parts of dput, so maybe the email line >> endings are messed up). If you want I can also file a bug report on >> github, just let me know. >> >> CS <- >> data.table( >> structure(list(LPERMCO = c(7L, 33L), datadate = structure(c(15912, >> 15912), class = "Date"), me = c(626550.35284, 7766.385)), .Names = >> c("LPERMCO", >> "datadate", "me"), class = "data.frame", row.names = c(NA, -2L >> )), >> key = "LPERMCO,datadate") >> SP <- >> data.table( >> structure(list(PERMCO = c(7L, 7L, 33L, 33L, 33L, 33L), date = >> structure(c(15884, >> 15917, 15884, 15884, 15917, 15917), class = "Date"), RET = c(-0.118303, >> 0.141225, -0.03137, -0.02533, 0.045967, 0.043694)), .Names = c("PERMCO", >> "date", "RET"), class = "data.frame", row.names = c(NA, -6L)), >> key = "PERMCO,date") >> sapply(CS[SP, nomatch = 0, roll = T], length) >> >> >> The relevant output looks like this, both in 1.9.2 and in dev-1.9.3, and >> for sapply, the "me" column should be 5 but it's 3: >> >> > CS >> LPERMCO datadate me >> 1: 7 2013-07-26 626550.353 >> 2: 33 2013-07-26 7766.385 >> > SP >> PERMCO date RET >> 1: 7 2013-06-28 -0.118303 >> 2: 7 2013-07-31 0.141225 >> 3: 33 2013-06-28 -0.031370 >> 4: 33 2013-06-28 -0.025330 >> 5: 33 2013-07-31 0.045967 >> 6: 33 2013-07-31 0.043694 >> > CS[SP, nomatch = 0, roll = T] >> LPERMCO datadate me RET >> 1: 7 2013-07-31 626550.353 0.141225 >> 2: 33 2013-06-28 7766.385 -0.031370 >> 3: 33 2013-06-28 7766.385 -0.025330 >> 4: 33 2013-07-31 626550.353 0.045967 >> 5: 33 2013-07-31 7766.385 0.043694 >> Warning message: >> In cbind(LPERMCO = c(" 7", "33", "33", "33", "33"), datadate = >> c("2013-07-31", : >> number of rows of result is not a multiple of vector length (arg 3) >> > sapply(CS[SP, nomatch = 0, roll = T], length) >> LPERMCO datadate me RET >> 5 5 3 5 >> >> >> Thanks, >> M >> >> >> >> >> >> On 06/20/2014 05:17 PM, Arunkumar Srinivasan wrote: >> >> For a given data.table, is there any condition ? Ergo, it's a bug, >> >> right? >> > >> > Yes. >> > >> >> I'll be glad >> >> to try to boil this down to something that's reproducible. >> > >> > That'd be great. >> > >> > >> > Arun >> > >> > From: Michael Smith my.r.help at gmail.com >> > Reply: Michael Smith my.r.help at gmail.com >> > Date: June 20, 2014 at 5:37:24 AM >> > To: datatable-help at lists.r-forge.r-project.org >> > datatable-help at lists.r-forge.r-project.org >> > >> > Subject: Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T? >> > >> >> So let me rephrase my question (haven't received an answer so far): >> >> >> >> For a given data.table, is there any condition under which the lengths >> >> of the vectors in each column may differ? Based on my understanding, >> >> each data.table is also a data.frame, and with a data frame this should >> >> not be possible. For example, it's not possible to have a data.frame >> >> where the first column is a vector of length eight, and the second >> >> column is a vector of length nine. Ergo, it's a bug, right? >> >> >> >> If my understanding is correct, please do let me know and I'll be glad >> >> to try to boil this down to something that's reproducible. >> >> >> >> Thanks, >> >> M >> >> >> >> On 06/19/2014 11:59 AM, Michael Smith wrote: >> >> > By the way, I know it's not reproducible with the code below. Before >> >> > going into further detail, I first wanted to ask whether this looks like >> >> > a bug, or whether I've overlooked something obvious and this is expected >> >> > behavior. >> >> > >> >> > Thanks, >> >> > M >> >> > >> >> > On 06/19/2014 11:51 AM, Michael Smith wrote: >> >> >> I got the following result on my keyed data tables `CS` and `SP`, which >> >> >> seems like a bug (in 1.9.2 and 1.9.3 dev version) to me, since all >> >> >> columns should have the _same_ length: >> >> >> >> >> >>> ## Works as expected: >> >> >>> all((l <- sapply(CS[SP, roll = TRUE], length)) == l[1]) >> >> >> [1] TRUE >> >> >>> ## Works as expected: >> >> >>> all((l <- sapply(CS[SP, nomatch = 0], length)) == l[1]) >> >> >> [1] TRUE >> >> >>> ## Here's the potential _bug_, when combining both: >> >> >>> all((l <- sapply(CS[SP, nomatch = 0, roll = TRUE], length)) == l[1]) >> >> >> [1] FALSE >> >> >> >> >> >> >> >> >> Thanks, >> >> >> >> >> >> M >> >> >> >> >> _______________________________________________ >> >> datatable-help mailing list >> >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> From aragorn168b at gmail.com Fri Jun 20 14:24:22 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 20 Jun 2014 14:24:22 +0200 Subject: [datatable-help] =?utf-8?q?Bug_when_Merging_with_nomatch=3D0_=3F?= =?utf-8?b?PWFuZCA9P3V0Zi04P1E/cm9sbD1UPw==?= In-Reply-To: <53A427C0.5070506@gmail.com> References: <53A25E4D.8040206@gmail.com> <53A2603F.9030204@gmail.com> <53A3AC63.6020301@gmail.com> <53A41B3D.3000003@gmail.com> <53A427C0.5070506@gmail.com> Message-ID: Awesome. Just got the email notification (from github). Thanks. Arun From:?Michael Smith my.r.help at gmail.com Reply:?Michael Smith my.r.help at gmail.com Date:?June 20, 2014 at 2:23:32 PM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T? Arun, Thanks for your reply and the issue is here (if there's anything else I can do to help solve this problem let me know): https://github.com/Rdatatable/data.table/issues/700 Also thanks for mentioning rollends. M On 06/20/2014 07:41 PM, Arunkumar Srinivasan wrote: > Michael, > > Excellent example. Perfectly reproducible on 1.9.2 and 1.9.3. And it > works fine on 1.8.10. The answer should've only 3 rows. > It'd be even more nice of you if you could file it as a bug report. > > PS: On another note.. you maybe also interested in `CS[SP, roll=TRUE, > rollends=TRUE]` > Arun > > From: Michael Smith my.r.help at gmail.com > Reply: Michael Smith my.r.help at gmail.com > Date: June 20, 2014 at 1:30:09 PM > To: Arunkumar Srinivasan aragorn168b at gmail.com > > Cc: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > > Subject: Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T? > >> OK, no problem, here's the code. If there are any problems pasting it >> into R let me know (I used parts of dput, so maybe the email line >> endings are messed up). If you want I can also file a bug report on >> github, just let me know. >> >> CS <- >> data.table( >> structure(list(LPERMCO = c(7L, 33L), datadate = structure(c(15912, >> 15912), class = "Date"), me = c(626550.35284, 7766.385)), .Names = >> c("LPERMCO", >> "datadate", "me"), class = "data.frame", row.names = c(NA, -2L >> )), >> key = "LPERMCO,datadate") >> SP <- >> data.table( >> structure(list(PERMCO = c(7L, 7L, 33L, 33L, 33L, 33L), date = >> structure(c(15884, >> 15917, 15884, 15884, 15917, 15917), class = "Date"), RET = c(-0.118303, >> 0.141225, -0.03137, -0.02533, 0.045967, 0.043694)), .Names = c("PERMCO", >> "date", "RET"), class = "data.frame", row.names = c(NA, -6L)), >> key = "PERMCO,date") >> sapply(CS[SP, nomatch = 0, roll = T], length) >> >> >> The relevant output looks like this, both in 1.9.2 and in dev-1.9.3, and >> for sapply, the "me" column should be 5 but it's 3: >> >> > CS >> LPERMCO datadate me >> 1: 7 2013-07-26 626550.353 >> 2: 33 2013-07-26 7766.385 >> > SP >> PERMCO date RET >> 1: 7 2013-06-28 -0.118303 >> 2: 7 2013-07-31 0.141225 >> 3: 33 2013-06-28 -0.031370 >> 4: 33 2013-06-28 -0.025330 >> 5: 33 2013-07-31 0.045967 >> 6: 33 2013-07-31 0.043694 >> > CS[SP, nomatch = 0, roll = T] >> LPERMCO datadate me RET >> 1: 7 2013-07-31 626550.353 0.141225 >> 2: 33 2013-06-28 7766.385 -0.031370 >> 3: 33 2013-06-28 7766.385 -0.025330 >> 4: 33 2013-07-31 626550.353 0.045967 >> 5: 33 2013-07-31 7766.385 0.043694 >> Warning message: >> In cbind(LPERMCO = c(" 7", "33", "33", "33", "33"), datadate = >> c("2013-07-31", : >> number of rows of result is not a multiple of vector length (arg 3) >> > sapply(CS[SP, nomatch = 0, roll = T], length) >> LPERMCO datadate me RET >> 5 5 3 5 >> >> >> Thanks, >> M >> >> >> >> >> >> On 06/20/2014 05:17 PM, Arunkumar Srinivasan wrote: >> >> For a given data.table, is there any condition ? Ergo, it's a bug, >> >> right? >> > >> > Yes. >> > >> >> I'll be glad >> >> to try to boil this down to something that's reproducible. >> > >> > That'd be great. >> > >> > >> > Arun >> > >> > From: Michael Smith my.r.help at gmail.com >> > Reply: Michael Smith my.r.help at gmail.com >> > Date: June 20, 2014 at 5:37:24 AM >> > To: datatable-help at lists.r-forge.r-project.org >> > datatable-help at lists.r-forge.r-project.org >> > >> > Subject: Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T? >> > >> >> So let me rephrase my question (haven't received an answer so far): >> >> >> >> For a given data.table, is there any condition under which the lengths >> >> of the vectors in each column may differ? Based on my understanding, >> >> each data.table is also a data.frame, and with a data frame this should >> >> not be possible. For example, it's not possible to have a data.frame >> >> where the first column is a vector of length eight, and the second >> >> column is a vector of length nine. Ergo, it's a bug, right? >> >> >> >> If my understanding is correct, please do let me know and I'll be glad >> >> to try to boil this down to something that's reproducible. >> >> >> >> Thanks, >> >> M >> >> >> >> On 06/19/2014 11:59 AM, Michael Smith wrote: >> >> > By the way, I know it's not reproducible with the code below. Before >> >> > going into further detail, I first wanted to ask whether this looks like >> >> > a bug, or whether I've overlooked something obvious and this is expected >> >> > behavior. >> >> > >> >> > Thanks, >> >> > M >> >> > >> >> > On 06/19/2014 11:51 AM, Michael Smith wrote: >> >> >> I got the following result on my keyed data tables `CS` and `SP`, which >> >> >> seems like a bug (in 1.9.2 and 1.9.3 dev version) to me, since all >> >> >> columns should have the _same_ length: >> >> >> >> >> >>> ## Works as expected: >> >> >>> all((l <- sapply(CS[SP, roll = TRUE], length)) == l[1]) >> >> >> [1] TRUE >> >> >>> ## Works as expected: >> >> >>> all((l <- sapply(CS[SP, nomatch = 0], length)) == l[1]) >> >> >> [1] TRUE >> >> >>> ## Here's the potential _bug_, when combining both: >> >> >>> all((l <- sapply(CS[SP, nomatch = 0, roll = TRUE], length)) == l[1]) >> >> >> [1] FALSE >> >> >> >> >> >> >> >> >> Thanks, >> >> >> >> >> >> M >> >> >> >> >> _______________________________________________ >> >> datatable-help mailing list >> >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri Jun 20 23:47:20 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 20 Jun 2014 23:47:20 +0200 Subject: [datatable-help] =?utf-8?Q?=60with=3DF=60_?=in the `i` Argument In-Reply-To: <539D0C8F.1080005@gmail.com> References: <5389541B.8040006@gmail.com> <539D0C8F.1080005@gmail.com> Message-ID: This is a really tricky one. I was just trying to fix it when I recollected the issues with base:::order from the time during implementation. Consider this case: require(data.table) DT <- data.table(x=c(1,4,3,2), y=c(8,6,5,7), z=c(10,12,11,9)) Consider the cases A and B below: # case A DT[base:::order(DT[, "x", with=FALSE])] # x y z # 1: 1 8 10 # 2: 2 7 9 # 3: 3 5 11 # 4: 4 6 12 Intended right result. Great! B: # case B DT[base:::order(list(x))] # x y z # 1: 1 8 10 What just happened?!? So, basically if the list gives TRUE for is.object(.), it understands what the opeation is, correctly. But if it?s just a list, no idea how to deal with it. Also it silently returns undesirable result (imo). Similar to the above cases, compare these two: # case C DT[base:::order(DT[, "x", with=FALSE], DT[, "y", with=FALSE])] # vs # case D DT[base:::order(list(x), list(y))] Even more crazy case: # case E DT[base:::order(DT[, c("x", "y"), with=FALSE])] # vs # case F DT[base:::order(list(x,y))] While we were testing and implementing forder, obviously it dint occur to check with the argument to order(.) with a data.table. And in spite of the fact that the output for DT[order(list(x))] is a bit strange and even dangerous, to be consistent with base:::order, we had implemented it the same way. Now, I?m not so sure.. Any ideas justifying these differences? Arun From:?Michael Smith my.r.help at gmail.com Reply:?Michael Smith my.r.help at gmail.com Date:?June 15, 2014 at 5:02:46 AM To:?G See gsee000 at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] `with=F` in the `i` Argument Devs, Is this a bug? It works in 1.9.2 but not in the 1.9.3 development version: DT <- data.table(a = 1:4, b = 8:5) for (i in c("a", "b")) print(DT[order(DT[, i, with = FALSE])]) Error in forder(DT, DT[, i, with = FALSE]) : Column '1' is type 'list' which is not supported for ordering currently. Thanks, M On 05/31/2014 12:44 PM, G See wrote: > Hi Michael, > > I would use get() > > DT <- data.table(a = 1:4, b = 8:5) > for (i in c("a", "b")) > print(DT[order(get(i))]) > > For what it's worth, your solution doesn't seem to work in data.table > 1.9.3 (svn rev. 1278): > >> for (i in c("a", "b")) > + print(DT[order(DT[, i, with = FALSE])]) > Error in forder(DT, DT[, i, with = FALSE]) : > Column '1' is type 'list' which is not supported for ordering currently. > > > HTH, > Garrett > > On Fri, May 30, 2014 at 11:01 PM, Michael Smith wrote: >> All, >> >> I'm trying to order the rows according to several columns at a time: >> >> DT <- data.table(a = 1:4, b = 8:5) >> for (i in c("a", "b")) >> print(DT[order(i), with = FALSE]) >> >> It doesn't work, since `with` seems to be about the `j` argument, but >> not the `i` argument, according to `?data.table`. >> >> I found the following workaround, but wonder whether there is a more >> elegant way to do it: >> >> for (i in c("a", "b")) >> print(DT[order(DT[, i, with = FALSE])]) >> >> Thanks, >> M >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Jun 21 02:25:27 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 21 Jun 2014 02:25:27 +0200 Subject: [datatable-help] =?utf-8?Q?=60with=3DF=60_?=in the `i` Argument In-Reply-To: References: <5389541B.8040006@gmail.com> <539D0C8F.1080005@gmail.com> Message-ID: Michael, Note that in your case, you can also do: DT <- data.table(a = 1:4, b = 8:5) for (i in c("a", "b")) DT[order(DT[[i]])] At the moment, I?m more inclined towards giving an error when any of the arguments to order(.) results in a list. The message could be something like: DT[order(.)] on data.tables is optimised internally to use data.table's fast ordering. Since the behaviour of base:::order seems inconsistent in the way it handles list input - for ex: compare DT[order(list(x))] and DT[order(data.table(x))], we do not support list columns as input here. If you're sure, you can use `DT[base:::order(.)]` explicitly. However, this can be avoided most of the times by using `[[` to access specified columns to result in a vector. What do you (all) think? Arun From:?Arunkumar Srinivasan aragorn168b at gmail.com Reply:?Arunkumar Srinivasan aragorn168b at gmail.com Date:?June 20, 2014 at 11:47:22 PM To:?Michael Smith my.r.help at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] `with=F` in the `i` Argument This is a really tricky one. I was just trying to fix it when I recollected the issues with base:::order from the time during implementation. Consider this case: require(data.table) DT <- data.table(x=c(1,4,3,2), y=c(8,6,5,7), z=c(10,12,11,9)) Consider the cases A and B below: # case A DT[base:::order(DT[, "x", with=FALSE])] # x y z # 1: 1 8 10 # 2: 2 7 9 # 3: 3 5 11 # 4: 4 6 12 Intended right result. Great! B: # case B DT[base:::order(list(x))] # x y z # 1: 1 8 10 What just happened?!? So, basically if the list gives TRUE for is.object(.), it understands what the opeation is, correctly. But if it?s just a list, no idea how to deal with it. Also it silently returns undesirable result (imo). Similar to the above cases, compare these two: # case C DT[base:::order(DT[, "x", with=FALSE], DT[, "y", with=FALSE])] # vs # case D DT[base:::order(list(x), list(y))] Even more crazy case: # case E DT[base:::order(DT[, c("x", "y"), with=FALSE])] # vs # case F DT[base:::order(list(x,y))] While we were testing and implementing forder, obviously it dint occur to check with the argument to order(.) with a data.table. And in spite of the fact that the output for DT[order(list(x))] is a bit strange and even dangerous, to be consistent with base:::order, we had implemented it the same way. Now, I?m not so sure.. Any ideas justifying these differences? Arun From:?Michael Smith my.r.help at gmail.com Reply:?Michael Smith my.r.help at gmail.com Date:?June 15, 2014 at 5:02:46 AM To:?G See gsee000 at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] `with=F` in the `i` Argument Devs, Is this a bug? It works in 1.9.2 but not in the 1.9.3 development version: DT <- data.table(a = 1:4, b = 8:5) for (i in c("a", "b")) print(DT[order(DT[, i, with = FALSE])]) Error in forder(DT, DT[, i, with = FALSE]) : Column '1' is type 'list' which is not supported for ordering currently. Thanks, M On 05/31/2014 12:44 PM, G See wrote: > Hi Michael, > > I would use get() > > DT <- data.table(a = 1:4, b = 8:5) > for (i in c("a", "b")) > print(DT[order(get(i))]) > > For what it's worth, your solution doesn't seem to work in data.table > 1.9.3 (svn rev. 1278): > >> for (i in c("a", "b")) > + print(DT[order(DT[, i, with = FALSE])]) > Error in forder(DT, DT[, i, with = FALSE]) : > Column '1' is type 'list' which is not supported for ordering currently. > > > HTH, > Garrett > > On Fri, May 30, 2014 at 11:01 PM, Michael Smith wrote: >> All, >> >> I'm trying to order the rows according to several columns at a time: >> >> DT <- data.table(a = 1:4, b = 8:5) >> for (i in c("a", "b")) >> print(DT[order(i), with = FALSE]) >> >> It doesn't work, since `with` seems to be about the `j` argument, but >> not the `i` argument, according to `?data.table`. >> >> I found the following workaround, but wonder whether there is a more >> elegant way to do it: >> >> for (i in c("a", "b")) >> print(DT[order(DT[, i, with = FALSE])]) >> >> Thanks, >> M >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Sat Jun 21 05:19:00 2014 From: my.r.help at gmail.com (Michael Smith) Date: Sat, 21 Jun 2014 11:19:00 +0800 Subject: [datatable-help] `with=F` in the `i` Argument In-Reply-To: References: <5389541B.8040006@gmail.com> <539D0C8F.1080005@gmail.com> Message-ID: <53A4F9A4.1090808@gmail.com> Hi Arun, If `is.object` gives `FALSE` and you just have a list, you could wrap it in `unlist` as follows. It gives the same result for your cases. (These are just my two cents, maybe someone else has a different opinion.) require(data.table) DT <- data.table(x=c(1,4,3,2), y=c(8,6,5,7), z=c(10,12,11,9)) ## Case A. DT[base::order(DT[, "x", with=FALSE])] # OK. ## Case B. DT[base::order(list(x))] # Not OK. DT[base::order(unlist(list(x)))] # Same as case A. # Case C. DT[base::order(DT[, "x", with=FALSE], DT[, "y", with=FALSE])] # OK. # Case D. DT[base::order(list(x), list(y))] # Not OK. DT[base::order(unlist(list(x), list(y)))] # Same as case C. ## Case E. DT[base::order(DT[, c("x", "y"), with=FALSE])] # Pads NA for `y`. ## Case F. DT[base::order(list(x, y))] # Not OK. DT[base::order(unlist(list(x, y)))] # Same as case E. Thanks, M On 06/21/2014 08:25 AM, Arunkumar Srinivasan wrote: > Michael, > > Note that in your case, you can also do: > > |DT <- data.table(a = 1:4, b = 8:5) > for (i in c("a", "b")) > DT[order(DT[[i]])] > | > > At the moment, I?m more inclined towards giving an error when any of the > arguments to |order(.)| results in a |list|. The message could be > something like: > > |DT[order(.)] on data.tables is optimised internally to use data.table's fast ordering. Since the behaviour of base:::order seems inconsistent in the way it handles list input - for ex: compare DT[order(list(x))] and DT[order(data.table(x))], we do not support list columns as input here. If you're sure, you can use `DT[base:::order(.)]` explicitly. However, this can be avoided most of the times by using `[[` to access specified columns to result in a vector. > | > > What do you (all) think? > > > Arun > > From: Arunkumar Srinivasan aragorn168b at gmail.com > > Reply: Arunkumar Srinivasan aragorn168b at gmail.com > > Date: June 20, 2014 at 11:47:22 PM > To: Michael Smith my.r.help at gmail.com > Cc: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > > Subject: Re: [datatable-help] `with=F` in the `i` Argument > >> This is a really tricky one. I was just trying to fix it when I >> recollected the issues with |base:::order| from the time during >> implementation. >> >> Consider this case: >> >> |require(data.table) >> DT <- data.table(x=c(1,4,3,2), y=c(8,6,5,7), z=c(10,12,11,9)) >> | >> >> Consider the cases A and B below: >> >> |# case A >> DT[base:::order(DT[, "x", with=FALSE])] >> # x y z >> # 1: 1 8 10 >> # 2: 2 7 9 >> # 3: 3 5 11 >> # 4: 4 6 12 >> | >> >> Intended right result. Great! >> >> >> B: >> >> |# case B >> DT[base:::order(list(x))] >> # x y z >> # 1: 1 8 10 >> | >> >> What just happened?!? So, basically if the list gives |TRUE| for >> |is.object(.)|, it understands what the opeation is, correctly. But if >> it?s /just/ a list, no idea how to deal with it. Also it silently >> returns undesirable result (imo). >> >> Similar to the above cases, compare these two: >> >> |# case C >> DT[base:::order(DT[, "x", with=FALSE], DT[, "y", with=FALSE])] >> # vs >> # case D >> DT[base:::order(list(x), list(y))] >> | >> >> >> Even more crazy case: >> >> |# case E >> DT[base:::order(DT[, c("x", "y"), with=FALSE])] >> # vs >> # case F >> DT[base:::order(list(x,y))] >> | >> >> While we were testing and implementing |forder|, obviously it dint >> occur to check with the argument to |order(.)| with a |data.table|. >> And in spite of the fact that the output for |DT[order(list(x))]| is a >> bit strange and even dangerous, to be consistent with |base:::order|, >> we had implemented it the same way. >> >> Now, I?m not so sure.. Any ideas justifying these differences? >> >> >> >> Arun >> >> From: Michael Smith my.r.help at gmail.com >> Reply: Michael Smith my.r.help at gmail.com >> Date: June 15, 2014 at 5:02:46 AM >> To: G See gsee000 at gmail.com >> Cc: datatable-help at lists.r-forge.r-project.org >> datatable-help at lists.r-forge.r-project.org >> >> Subject: Re: [datatable-help] `with=F` in the `i` Argument >> >>> Devs, >>> >>> Is this a bug? It works in 1.9.2 but not in the 1.9.3 development >>> version: >>> >>> DT <- data.table(a = 1:4, b = 8:5) >>> for (i in c("a", "b")) >>> print(DT[order(DT[, i, with = FALSE])]) >>> >>> Error in forder(DT, DT[, i, with = FALSE]) : >>> Column '1' is type 'list' which is not supported for ordering currently. >>> >>> >>> Thanks, >>> >>> M >>> >>> >>> On 05/31/2014 12:44 PM, G See wrote: >>> > Hi Michael, >>> > >>> > I would use get() >>> > >>> > DT <- data.table(a = 1:4, b = 8:5) >>> > for (i in c("a", "b")) >>> > print(DT[order(get(i))]) >>> > >>> > For what it's worth, your solution doesn't seem to work in data.table >>> > 1.9.3 (svn rev. 1278): >>> > >>> >> for (i in c("a", "b")) >>> > + print(DT[order(DT[, i, with = FALSE])]) >>> > Error in forder(DT, DT[, i, with = FALSE]) : >>> > Column '1' is type 'list' which is not supported for ordering currently. >>> > >>> > >>> > HTH, >>> > Garrett >>> > >>> > On Fri, May 30, 2014 at 11:01 PM, Michael Smith wrote: >>> >> All, >>> >> >>> >> I'm trying to order the rows according to several columns at a time: >>> >> >>> >> DT <- data.table(a = 1:4, b = 8:5) >>> >> for (i in c("a", "b")) >>> >> print(DT[order(i), with = FALSE]) >>> >> >>> >> It doesn't work, since `with` seems to be about the `j` argument, but >>> >> not the `i` argument, according to `?data.table`. >>> >> >>> >> I found the following workaround, but wonder whether there is a more >>> >> elegant way to do it: >>> >> >>> >> for (i in c("a", "b")) >>> >> print(DT[order(DT[, i, with = FALSE])]) >>> >> >>> >> Thanks, >>> >> M >>> >> _______________________________________________ >>> >> datatable-help mailing list >>> >> datatable-help at lists.r-forge.r-project.org >>> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From my.r.help at gmail.com Sat Jun 21 09:39:56 2014 From: my.r.help at gmail.com (Michael Smith) Date: Sat, 21 Jun 2014 15:39:56 +0800 Subject: [datatable-help] Self-Join: Potential Bug? Message-ID: <53A536CC.3050501@gmail.com> I'm getting a warning when I run the following code in 1.9.2, dev-1.9.3-master, and dev-1.9.3-issue_700 (b/c I thought it looks similar to that issue, but it turns out it's different). In contrast, I do not get this warning in 1.8.10. Not sure whether this is a bug or whether I'm missing something: X <- data.table( structure(list(ID = c(45063L, 45066L, 45172L), date = structure(c(14548, 14487, 14395), class = "Date"), price = c(17.56, 12.49, 10.04 )), .Names = c("ID", "date", "price"), row.names = c(NA, -3L), class = "data.frame"), key = "ID,date") X[J(unique(ID), as.Date(c("2009-05-31", "2010-05-31")))] The data and the warning message look like this: > X ID date price 1: 45063 2009-10-31 17.56 2: 45066 2009-08-31 12.49 3: 45172 2009-05-31 10.04 > X[J(unique(ID), as.Date(c("2009-05-31", "2010-05-31")))] ID date price 1: 45063 2009-05-31 NA 2: 45066 2010-05-31 NA 3: 45172 2009-05-31 10.04 Warning message: In as.data.table.list(i) : Item 2 is of size 2 but maximum size is 3 (recycled leaving a remainder of 1 items) Thanks, M From aragorn168b at gmail.com Sat Jun 21 09:43:29 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 21 Jun 2014 09:43:29 +0200 Subject: [datatable-help] Self-Join: Potential Bug? In-Reply-To: <53A536CC.3050501@gmail.com> References: <53A536CC.3050501@gmail.com> Message-ID: Michael, You should be using `CJ`. This is no different from the post from Garrett See?http://lists.r-forge.r-project.org/pipermail/datatable-help/2014-June/002619.html?just last week, IIUC. Arun From:?Michael Smith my.r.help at gmail.com Reply:?Michael Smith my.r.help at gmail.com Date:?June 21, 2014 at 9:40:17 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] Self-Join: Potential Bug? I'm getting a warning when I run the following code in 1.9.2, dev-1.9.3-master, and dev-1.9.3-issue_700 (b/c I thought it looks similar to that issue, but it turns out it's different). In contrast, I do not get this warning in 1.8.10. Not sure whether this is a bug or whether I'm missing something: X <- data.table( structure(list(ID = c(45063L, 45066L, 45172L), date = structure(c(14548, 14487, 14395), class = "Date"), price = c(17.56, 12.49, 10.04 )), .Names = c("ID", "date", "price"), row.names = c(NA, -3L), class = "data.frame"), key = "ID,date") X[J(unique(ID), as.Date(c("2009-05-31", "2010-05-31")))] The data and the warning message look like this: > X ID date price 1: 45063 2009-10-31 17.56 2: 45066 2009-08-31 12.49 3: 45172 2009-05-31 10.04 > X[J(unique(ID), as.Date(c("2009-05-31", "2010-05-31")))] ID date price 1: 45063 2009-05-31 NA 2: 45066 2010-05-31 NA 3: 45172 2009-05-31 10.04 Warning message: In as.data.table.list(i) : Item 2 is of size 2 but maximum size is 3 (recycled leaving a remainder of 1 items) Thanks, M _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Sat Jun 21 09:45:45 2014 From: my.r.help at gmail.com (Michael Smith) Date: Sat, 21 Jun 2014 15:45:45 +0800 Subject: [datatable-help] Self-Join: Potential Bug? In-Reply-To: References: <53A536CC.3050501@gmail.com> Message-ID: <53A53829.8000503@gmail.com> Great, thanks a lot for the clarification; I thought I was going crazy. Cheers, M On 06/21/2014 03:43 PM, Arunkumar Srinivasan wrote: > Michael, > You should be using `CJ`. This is no different from the post from > Garrett > See http://lists.r-forge.r-project.org/pipermail/datatable-help/2014-June/002619.html just > last week, IIUC. > > Arun > > From: Michael Smith my.r.help at gmail.com > Reply: Michael Smith my.r.help at gmail.com > Date: June 21, 2014 at 9:40:17 AM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > > Subject: [datatable-help] Self-Join: Potential Bug? > >> I'm getting a warning when I run the following code in 1.9.2, >> dev-1.9.3-master, and dev-1.9.3-issue_700 (b/c I thought it looks >> similar to that issue, but it turns out it's different). In contrast, I >> do not get this warning in 1.8.10. >> >> Not sure whether this is a bug or whether I'm missing something: >> >> X <- data.table( >> structure(list(ID = c(45063L, 45066L, 45172L), date = structure(c(14548, >> 14487, 14395), class = "Date"), price = c(17.56, 12.49, 10.04 >> )), .Names = c("ID", "date", "price"), row.names = c(NA, -3L), class = >> "data.frame"), >> key = "ID,date") >> X[J(unique(ID), as.Date(c("2009-05-31", "2010-05-31")))] >> >> >> The data and the warning message look like this: >> >> > X >> ID date price >> 1: 45063 2009-10-31 17.56 >> 2: 45066 2009-08-31 12.49 >> 3: 45172 2009-05-31 10.04 >> > X[J(unique(ID), as.Date(c("2009-05-31", "2010-05-31")))] >> ID date price >> 1: 45063 2009-05-31 NA >> 2: 45066 2010-05-31 NA >> 3: 45172 2009-05-31 10.04 >> Warning message: >> In as.data.table.list(i) : >> Item 2 is of size 2 but maximum size is 3 (recycled leaving a >> remainder of 1 items) >> >> >> Thanks, >> M >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> From aragorn168b at gmail.com Tue Jun 24 01:54:49 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 24 Jun 2014 01:54:49 +0200 Subject: [datatable-help] =?utf-8?Q?=60with=3DF=60_?=in the `i` Argument In-Reply-To: <53A4F9A4.1090808@gmail.com> References: <5389541B.8040006@gmail.com> <539D0C8F.1080005@gmail.com> <53A4F9A4.1090808@gmail.com> Message-ID: I?ve gone ahead and fixed [#696](https://github.com/Rdatatable/data.table/issues/696) to be consistent with base, even though I think this is not necessary in almost all cases. Either one could do: DT[order(DT[["a"]])] Or simply use copy along with setorderv: setorderv(copy(DT), cols="a", order=1L, na.last=FALSE) I find the latter much more cleaner, and can be used if one wants to reorder by reference as well, by just removing copy. Arun From:?Michael Smith my.r.help at gmail.com Reply:?Michael Smith my.r.help at gmail.com Date:?June 21, 2014 at 5:19:04 AM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] `with=F` in the `i` Argument Hi Arun, If `is.object` gives `FALSE` and you just have a list, you could wrap it in `unlist` as follows. It gives the same result for your cases. (These are just my two cents, maybe someone else has a different opinion.) require(data.table) DT <- data.table(x=c(1,4,3,2), y=c(8,6,5,7), z=c(10,12,11,9)) ## Case A. DT[base::order(DT[, "x", with=FALSE])] # OK. ## Case B. DT[base::order(list(x))] # Not OK. DT[base::order(unlist(list(x)))] # Same as case A. # Case C. DT[base::order(DT[, "x", with=FALSE], DT[, "y", with=FALSE])] # OK. # Case D. DT[base::order(list(x), list(y))] # Not OK. DT[base::order(unlist(list(x), list(y)))] # Same as case C. ## Case E. DT[base::order(DT[, c("x", "y"), with=FALSE])] # Pads NA for `y`. ## Case F. DT[base::order(list(x, y))] # Not OK. DT[base::order(unlist(list(x, y)))] # Same as case E. Thanks, M On 06/21/2014 08:25 AM, Arunkumar Srinivasan wrote: > Michael, > > Note that in your case, you can also do: > > |DT <- data.table(a = 1:4, b = 8:5) > for (i in c("a", "b")) > DT[order(DT[[i]])] > | > > At the moment, I?m more inclined towards giving an error when any of the > arguments to |order(.)| results in a |list|. The message could be > something like: > > |DT[order(.)] on data.tables is optimised internally to use data.table's fast ordering. Since the behaviour of base:::order seems inconsistent in the way it handles list input - for ex: compare DT[order(list(x))] and DT[order(data.table(x))], we do not support list columns as input here. If you're sure, you can use `DT[base:::order(.)]` explicitly. However, this can be avoided most of the times by using `[[` to access specified columns to result in a vector. > | > > What do you (all) think? > > > Arun > > From: Arunkumar Srinivasan aragorn168b at gmail.com > > Reply: Arunkumar Srinivasan aragorn168b at gmail.com > > Date: June 20, 2014 at 11:47:22 PM > To: Michael Smith my.r.help at gmail.com > Cc: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > > Subject: Re: [datatable-help] `with=F` in the `i` Argument > >> This is a really tricky one. I was just trying to fix it when I >> recollected the issues with |base:::order| from the time during >> implementation. >> >> Consider this case: >> >> |require(data.table) >> DT <- data.table(x=c(1,4,3,2), y=c(8,6,5,7), z=c(10,12,11,9)) >> | >> >> Consider the cases A and B below: >> >> |# case A >> DT[base:::order(DT[, "x", with=FALSE])] >> # x y z >> # 1: 1 8 10 >> # 2: 2 7 9 >> # 3: 3 5 11 >> # 4: 4 6 12 >> | >> >> Intended right result. Great! >> >> >> B: >> >> |# case B >> DT[base:::order(list(x))] >> # x y z >> # 1: 1 8 10 >> | >> >> What just happened?!? So, basically if the list gives |TRUE| for >> |is.object(.)|, it understands what the opeation is, correctly. But if >> it?s /just/ a list, no idea how to deal with it. Also it silently >> returns undesirable result (imo). >> >> Similar to the above cases, compare these two: >> >> |# case C >> DT[base:::order(DT[, "x", with=FALSE], DT[, "y", with=FALSE])] >> # vs >> # case D >> DT[base:::order(list(x), list(y))] >> | >> >> >> Even more crazy case: >> >> |# case E >> DT[base:::order(DT[, c("x", "y"), with=FALSE])] >> # vs >> # case F >> DT[base:::order(list(x,y))] >> | >> >> While we were testing and implementing |forder|, obviously it dint >> occur to check with the argument to |order(.)| with a |data.table|. >> And in spite of the fact that the output for |DT[order(list(x))]| is a >> bit strange and even dangerous, to be consistent with |base:::order|, >> we had implemented it the same way. >> >> Now, I?m not so sure.. Any ideas justifying these differences? >> >> >> >> Arun >> >> From: Michael Smith my.r.help at gmail.com >> Reply: Michael Smith my.r.help at gmail.com >> Date: June 15, 2014 at 5:02:46 AM >> To: G See gsee000 at gmail.com >> Cc: datatable-help at lists.r-forge.r-project.org >> datatable-help at lists.r-forge.r-project.org >> >> Subject: Re: [datatable-help] `with=F` in the `i` Argument >> >>> Devs, >>> >>> Is this a bug? It works in 1.9.2 but not in the 1.9.3 development >>> version: >>> >>> DT <- data.table(a = 1:4, b = 8:5) >>> for (i in c("a", "b")) >>> print(DT[order(DT[, i, with = FALSE])]) >>> >>> Error in forder(DT, DT[, i, with = FALSE]) : >>> Column '1' is type 'list' which is not supported for ordering currently. >>> >>> >>> Thanks, >>> >>> M >>> >>> >>> On 05/31/2014 12:44 PM, G See wrote: >>> > Hi Michael, >>> > >>> > I would use get() >>> > >>> > DT <- data.table(a = 1:4, b = 8:5) >>> > for (i in c("a", "b")) >>> > print(DT[order(get(i))]) >>> > >>> > For what it's worth, your solution doesn't seem to work in data.table >>> > 1.9.3 (svn rev. 1278): >>> > >>> >> for (i in c("a", "b")) >>> > + print(DT[order(DT[, i, with = FALSE])]) >>> > Error in forder(DT, DT[, i, with = FALSE]) : >>> > Column '1' is type 'list' which is not supported for ordering currently. >>> > >>> > >>> > HTH, >>> > Garrett >>> > >>> > On Fri, May 30, 2014 at 11:01 PM, Michael Smith wrote: >>> >> All, >>> >> >>> >> I'm trying to order the rows according to several columns at a time: >>> >> >>> >> DT <- data.table(a = 1:4, b = 8:5) >>> >> for (i in c("a", "b")) >>> >> print(DT[order(i), with = FALSE]) >>> >> >>> >> It doesn't work, since `with` seems to be about the `j` argument, but >>> >> not the `i` argument, according to `?data.table`. >>> >> >>> >> I found the following workaround, but wonder whether there is a more >>> >> elegant way to do it: >>> >> >>> >> for (i in c("a", "b")) >>> >> print(DT[order(DT[, i, with = FALSE])]) >>> >> >>> >> Thanks, >>> >> M >>> >> _______________________________________________ >>> >> datatable-help mailing list >>> >> datatable-help at lists.r-forge.r-project.org >>> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ronin78 at gmail.com Thu Jun 26 22:56:40 2014 From: ronin78 at gmail.com (Matthew DeAngelis) Date: Thu, 26 Jun 2014 16:56:40 -0400 Subject: [datatable-help] Efficiently checking value of other row in data.table Message-ID: Hello data.table gurus, I have been using data.table to efficiently work with textual data and I love it for that purpose. I have transformed my data so that it looks something like this: worddocumentpositionI11have12transformed13my14data15so21that22it23looks24 something25like26this27 (I actually use a unique number for each word, so that I am able to use data.table's excellent features to do lightning-fast word counts. This has revolutionized my workflow over looping through text files with Perl.) My problem is that I sometimes need to search for phrases or to select words based on their context (for instance, I may want to exclude a word if it is preceded by "not" or followed by a word that changes its meaning). Currently, I am using the solution here to create a new column for a word in another position, like this: worddocumentpositionlead_wordI11havehave12transformedtransformed13mymy14data data15NAso21thatthat22itit23lookslooks24somethingsomething25likelike26this this27NA using a command like: DT[,lead_word:=DT[list(document,position+1),word]. This approach has two problems, however. First, it consumes more resources as the dataset grows. I am currently working with a file containing over 150 million rows, so adding a column is costly. Second, I may want to check both one and two words ahead, so that I have to add two columns, and this can quickly get out of hand. Is there a better way to use data.table to check the value in a row N distance from the row of interest within a group and select a row based on that value? Perhaps the .I variable could be useful here? I appreciate any suggestions. Regards, Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Jun 27 21:17:18 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Fri, 27 Jun 2014 20:17:18 +0100 Subject: [datatable-help] Efficiently checking value of other row in data.table In-Reply-To: References: Message-ID: <53ADC33E.7040509@mdowle.plus.com> Hi, Not sure exactly what you need but looks interesting. Something a bit like this ? DT[ word == "good", .SD[ lag(word, N) != "not" ], by=document] Your idea being you don't want to have to repeat all the pre and post words alongside each word but rather express it in the query. Makes sense. Leads to classifying "not good" and "not very good" as both negative phrases I guess. Matt On 26/06/14 21:56, Matthew DeAngelis wrote: > Hello data.table gurus, > > I have been using data.table to efficiently work with textual data and > I love it for that purpose. I have transformed my data so that it > looks something like this: > > word document position > I 1 1 > have 1 2 > transformed 1 3 > my 1 4 > data 1 5 > so 2 1 > that 2 2 > it 2 3 > looks 2 4 > something 2 5 > like 2 6 > this 2 7 > > > (I actually use a unique number for each word, so that I am able to > use data.table's excellent features to do lightning-fast word counts. > This has revolutionized my workflow over looping through text files > with Perl.) > > My problem is that I sometimes need to search for phrases or to select > words based on their context (for instance, I may want to exclude a > word if it is preceded by "not" or followed by a word that changes its > meaning). Currently, I am using the solution here > to > create a new column for a word in another position, like this: > > word document position lead_word > I 1 1 have > have 1 2 transformed > transformed 1 3 my > my 1 4 data > data 1 5 NA > so 2 1 that > that 2 2 it > it 2 3 looks > looks 2 4 something > something 2 5 like > like 2 6 this > this 2 7 NA > > > using a command like: DT[,lead_word:=DT[list(document,position+1),word]. > > This approach has two problems, however. First, it consumes more > resources as the dataset grows. I am currently working with a file > containing over 150 million rows, so adding a column is costly. > Second, I may want to check both one and two words ahead, so that I > have to add two columns, and this can quickly get out of hand. > > Is there a better way to use data.table to check the value in a row N > distance from the row of interest within a group and select a row > based on that value? Perhaps the .I variable could be useful here? > > I appreciate any suggestions. > > > Regards, > Matt > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From ronin78 at gmail.com Sat Jun 28 11:55:12 2014 From: ronin78 at gmail.com (Matthew DeAngelis) Date: Sat, 28 Jun 2014 05:55:12 -0400 Subject: [datatable-help] Efficiently checking value of other row in data.table In-Reply-To: <53ADC33E.7040509@mdowle.plus.com> References: <53ADC33E.7040509@mdowle.plus.com> Message-ID: Hi Matt, You have the right of it. The problem is somewhat complicated, however, since I would want to substitute "DT[word=="good"..." with "DT[J("good")..." after setting the key to word and reordering the rows. Hence the two-step process I have now where I key by document and position first, create the lag_word column, key by the word and lag_word columns and query by row. Matt On Fri, Jun 27, 2014 at 3:17 PM, Matt Dowle wrote: > > Hi, > > Not sure exactly what you need but looks interesting. > > Something a bit like this ? > > DT[ word == "good", .SD[ lag(word, N) != "not" ], by=document] > > Your idea being you don't want to have to repeat all the pre and post > words alongside each word but rather express it in the query. Makes > sense. Leads to classifying "not good" and "not very good" as both > negative phrases I guess. > > Matt > > > > On 26/06/14 21:56, Matthew DeAngelis wrote: > > Hello data.table gurus, > > I have been using data.table to efficiently work with textual data and I > love it for that purpose. I have transformed my data so that it looks > something like this: > > word document position I 1 1 have 1 2 transformed 1 3 my 1 4 data > 1 5 so 2 1 that 2 2 it 2 3 looks 2 4 something 2 5 like 2 6 this 2 > 7 > (I actually use a unique number for each word, so that I am able to use > data.table's excellent features to do lightning-fast word counts. This has > revolutionized my workflow over looping through text files with Perl.) > > My problem is that I sometimes need to search for phrases or to select > words based on their context (for instance, I may want to exclude a word if > it is preceded by "not" or followed by a word that changes its meaning). > Currently, I am using the solution here > to > create a new column for a word in another position, like this: > > word document position lead_word I 1 1 have have 1 2 transformed > transformed 1 3 my my 1 4 data data 1 5 NA so 2 1 that that 2 2 it it > 2 3 looks looks 2 4 something something 2 5 like like 2 6 this this 2 > 7 NA > using a command like: DT[,lead_word:=DT[list(document,position+1),word]. > > This approach has two problems, however. First, it consumes more > resources as the dataset grows. I am currently working with a file > containing over 150 million rows, so adding a column is costly. Second, I > may want to check both one and two words ahead, so that I have to add two > columns, and this can quickly get out of hand. > > Is there a better way to use data.table to check the value in a row N > distance from the row of interest within a group and select a row based on > that value? Perhaps the .I variable could be useful here? > > I appreciate any suggestions. > > > Regards, > Matt > > > _______________________________________________ > datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sun Jun 29 00:00:58 2014 From: mdowle at mdowle.plus.com (Matt Dowle) Date: Sat, 28 Jun 2014 23:00:58 +0100 Subject: [datatable-help] Efficiently checking value of other row in data.table In-Reply-To: References: <53ADC33E.7040509@mdowle.plus.com> Message-ID: <53AF3B1A.30308@mdowle.plus.com> Hi Matt, Great. If you can prepare some dummy data with the appropriate properties and a parameter or two to scale up the size (or just provide an online large example to download) and a query that gets to the right answer but is slow or ugly, then we've got something to chew on ... Matt On 28/06/14 10:55, Matthew DeAngelis wrote: > Hi Matt, > > You have the right of it. The problem is somewhat complicated, > however, since I would want to substitute "DT[word=="good"..." with > "DT[J("good")..." after setting the key to word and reordering the > rows. Hence the two-step process I have now where I key by document > and position first, create the lag_word column, key by the word and > lag_word columns and query by row. > > > Matt > > > On Fri, Jun 27, 2014 at 3:17 PM, Matt Dowle > wrote: > > > Hi, > > Not sure exactly what you need but looks interesting. > > Something a bit like this ? > > DT[ word == "good", .SD[ lag(word, N) != "not" ], by=document] > > Your idea being you don't want to have to repeat all the pre and > post words alongside each word but rather express it in the query. > Makes sense. Leads to classifying "not good" and "not very good" > as both negative phrases I guess. > > Matt > > > > On 26/06/14 21:56, Matthew DeAngelis wrote: >> Hello data.table gurus, >> >> I have been using data.table to efficiently work with textual >> data and I love it for that purpose. I have transformed my data >> so that it looks something like this: >> >> word document position >> I 1 1 >> have 1 2 >> transformed 1 3 >> my 1 4 >> data 1 5 >> so 2 1 >> that 2 2 >> it 2 3 >> looks 2 4 >> something 2 5 >> like 2 6 >> this 2 7 >> >> >> (I actually use a unique number for each word, so that I am able >> to use data.table's excellent features to do lightning-fast word >> counts. This has revolutionized my workflow over looping through >> text files with Perl.) >> >> My problem is that I sometimes need to search for phrases or to >> select words based on their context (for instance, I may want to >> exclude a word if it is preceded by "not" or followed by a word >> that changes its meaning). Currently, I am using the solution >> here >> to >> create a new column for a word in another position, like this: >> >> word document position lead_word >> I 1 1 have >> have 1 2 transformed >> transformed 1 3 my >> my 1 4 data >> data 1 5 NA >> so 2 1 that >> that 2 2 it >> it 2 3 looks >> looks 2 4 something >> something 2 5 like >> like 2 6 this >> this 2 7 NA >> >> >> using a command like: >> DT[,lead_word:=DT[list(document,position+1),word]. >> >> This approach has two problems, however. First, it consumes more >> resources as the dataset grows. I am currently working with a >> file containing over 150 million rows, so adding a column is >> costly. Second, I may want to check both one and two words ahead, >> so that I have to add two columns, and this can quickly get out >> of hand. >> >> Is there a better way to use data.table to check the value in a >> row N distance from the row of interest within a group and select >> a row based on that value? Perhaps the .I variable could be >> useful here? >> >> I appreciate any suggestions. >> >> >> Regards, >> Matt >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Sun Jun 29 22:58:50 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sun, 29 Jun 2014 16:58:50 -0400 Subject: [datatable-help] by row Message-ID: There was some discussion of an .EACHI facility for data.table. Not sure what happened about that but I have an example that might be useful: http://stackoverflow.com/questions/24472254/splitting-a-column-by-factor-within-a-data-frame/24472571#24472571 which shows the code where DT has columns v1, v2 and v3: DT[, split(v2, v1), by = names(DT)] It works well if the rows of DT are unique but if they are not then one must do something ugly like appending a uniquifying column of 1:nrow(DT), say, and then including that in by and then finally removing it again at the end. This suggests two features: 1. The ability to tell it to do the by by row 2. The ability to selectively omit by variables from the output For example, if one could use a pseudo column .I and if -.I meant do not include it in the output then one could write: DT[, split(v2, v1), by = c(names(DT), -.I)] Other syntaxes may be thought of too and the main suggestion here is the possible need for these features rather than the specific syntax. (By the way, is there an intention to move to the issue system on github for things like this?) -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Sun Jun 29 23:39:01 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 29 Jun 2014 23:39:01 +0200 Subject: [datatable-help] by row In-Reply-To: References: Message-ID: Hi, You write: There was some discussion of an .EACHI facility for data.table. Not sure what happened about that but I have an example that might be useful: http://stackoverflow.com/questions/24472254/splitting-a-column-by-factor-within-a-data-frame/24472571#24472571 by=.EACHI was implemented to remove the implicit ?by-without-by? feature during joins. And that has been implemented quite sometime back - check the first FR implemented in the README following which Matt also posted on the mailing list asking for feedback. You write: which shows the code where DT has columns v1, v2 and v3: DT[, split(v2, v1), by = names(DT)] ``` A small comment on this solution per-se. This calls split for each row! I?d approach this a little different: ## 1.9.3 rbindlist(setDT(dd)[, { ans = list(v2); setattr(ans, 'names', v1); list(list(ans)) }, by = list(v1=as.character(v1)) ]$V1, fill=TRUE) # a b # 1: 1 NA # 2: 2 NA # 3: 6 NA # 4: NA 3 # 5: NA 4 # 6: NA 5 We can then add this back to dd by reference. Personally I?ve never had to call split on a data.table. You write: It works well if the rows of DT are unique but if they are not then one must do something ugly like appending a uniquifying column of 1:nrow(DT), say, and then including that in by and then finally removing it again at the end. This suggests two features: The ability to tell it to do the by by row The ability to selectively omit by variables from the output ``` Not sure I follow this entirely, but by= does accept expressions. So, you could do: dd[, split(v2,v1), by=1:nrow(dd)] # nrow a b # 1: 1 1 NA # 2: 2 2 NA # 3: 3 6 NA # 4: 4 NA 3 # 5: 5 NA 4 # 6: 6 NA 5 You write: (By the way, is there an intention to move to the issue system on github for things like this?) The entire issues from R-Forge have been already moved to github, including feature requests. And since then users have filed new FRs/bugs here. So, yes, you can file FRs directly, although in this case, I think the feature already exists (IIUC)? Arun From:?Gabor Grothendieck ggrothendieck at gmail.com Reply:?Gabor Grothendieck ggrothendieck at gmail.com Date:?June 29, 2014 at 10:59:22 PM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] by row There was some discussion of an .EACHI facility for data.table. Not sure what happened about that but I have an example that might be useful: http://stackoverflow.com/questions/24472254/splitting-a-column-by-factor-within-a-data-frame/24472571#24472571 which shows the code where DT has columns v1, v2 and v3: DT[, split(v2, v1), by = names(DT)] It works well if the rows of DT are unique but if they are not then one must do something ugly like appending a uniquifying column of 1:nrow(DT), say, and then including that in by and then finally removing it again at the end. This suggests two features: 1. The ability to tell it to do the by by row 2. The ability to selectively omit by variables from the output For example, if one could use a pseudo column .I and if -.I meant do not include it in the output then one could write: DT[, split(v2, v1), by = c(names(DT), -.I)] Other syntaxes may be thought of too and the main suggestion here is the possible need for these features rather than the specific syntax. (By the way, is there an intention to move to the issue system on github for things like this?) -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Mon Jun 30 01:48:33 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sun, 29 Jun 2014 19:48:33 -0400 Subject: [datatable-help] Finding Rdatatable/datatable Message-ID: Googling for github data.table gets one to: https://github.com/arunsrinivasan/datatable and the DESCRIPTION file and the CRAN page (http://cran.r-project.org/package=data.table) both point to R-Forge so the github Rdatatable/database page is not so easy to find. (I had previously been using the one google leads you to which is why I could not find the issues.) -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Mon Jun 30 02:04:11 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 30 Jun 2014 02:04:11 +0200 Subject: [datatable-help] Finding Rdatatable/datatable In-Reply-To: References: Message-ID: I'm sorry about that and am not sure what to do about it. Hopefully Rdatatable/data.table will get more hits and would turn up as the top hit soon. It's been only a few days since the transition. Arun From:?Gabor Grothendieck ggrothendieck at gmail.com Reply:?Gabor Grothendieck ggrothendieck at gmail.com Date:?June 30, 2014 at 1:49:04 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] Finding Rdatatable/datatable Googling for github data.table gets one to: https://github.com/arunsrinivasan/datatable and the DESCRIPTION file and the CRAN page (http://cran.r-project.org/package=data.table) both point to R-Forge so the github Rdatatable/database page is not so easy to find. (I had previously been using the one google leads you to which is why I could not find the issues.) -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From ronin78 at gmail.com Mon Jun 30 15:24:00 2014 From: ronin78 at gmail.com (Matthew DeAngelis) Date: Mon, 30 Jun 2014 09:24:00 -0400 Subject: [datatable-help] Efficiently checking value of other row in data.table In-Reply-To: <53AF3B1A.30308@mdowle.plus.com> References: <53ADC33E.7040509@mdowle.plus.com> <53AF3B1A.30308@mdowle.plus.com> Message-ID: Hi Matt, Thanks for the suggestion. I am placing an example below that I hope illustrates the problem more clearly. Please let me know if I can provide additional detail or clarification. Regards, Matt First we create a dummy dataset with ten documents containing one million words. There are three unique words in the set. library(data.table)options(scipen=2)set.seed(1000)DT<-data.table(wordindex=sample(1:3,1000000,replace=T),docindex=sample(1:10,1000000,replace=T))setkey(DT,docindex)DT[,position:=seq.int(1:.N),by=docindex] ## wordindex docindex position ## 1: 1 1 1 ## 2: 1 1 2 ## 3: 3 1 3 ## 4: 3 1 4 ## 5: 1 1 5 ## --- ## 999996: 2 10 99811 ## 999997: 2 10 99812 ## 999998: 3 10 99813 ## 999999: 1 10 99814 ## 1000000: 3 10 99815 This is a query to count the occurrences of the first unique word across all documents. It is also beautiful. setkey(DT,wordindex)count<-DT[J(1),list(count.1=.N),by=docindex]count ## docindex count.1 ## 1: 1 33533 ## 2: 2 33067 ## 3: 3 33538 ## 4: 4 33053 ## 5: 5 33231 ## 6: 6 33002 ## 7: 7 33369 ## 8: 8 33353 ## 9: 9 33485 ## 10: 10 33225 It gets messier when we have to take the position ahead into account. This is a query to count the occurrences of the first unique word across all documents UNLESS it is followed by the second unique word. We create a new column containing the word one position ahead and then key on both words. setkey(DT,docindex,position)DT[,lead_wordindex:=DT[list(docindex,position+1)][,wordindex]] ## wordindex docindex position lead_wordindex ## 1: 1 1 1 1 ## 2: 1 1 2 3 ## 3: 3 1 3 3 ## 4: 3 1 4 1 ## 5: 1 1 5 2 ## --- ## 999996: 2 10 99811 2 ## 999997: 2 10 99812 3 ## 999998: 3 10 99813 1 ## 999999: 1 10 99814 3 ## 1000000: 3 10 99815 NA setkey(DT,wordindex,lead_wordindex)countr2<-DT[J(c(1,1),c(1,3)),list(count.1=.N),by=docindex]countr2 ## docindex count.1 ## 1: 1 22301 ## 2: 2 21835 ## 3: 3 22490 ## 4: 4 21830 ## 5: 5 22218 ## 6: 6 21914 ## 7: 7 22370 ## 8: 8 22265 ## 9: 9 22211 ## 10: 10 22190 I have a very large dataset for which the above query fails for memory allocation. As an alternative, we can create this new column for only the relevant subset of data by filtering the original dataset and then joining it back on the desired position: setkey(DT,wordindex)filter<-DT[J(1),list(wordindex,docindex,position)]filter[,lead_position:=position+1] ## wordindex wordindex docindex position lead_position ## 1: 1 1 2 99717 99718 ## 2: 1 1 3 99807 99808 ## 3: 1 1 4 100243 100244 ## 4: 1 1 1 1 2 ## 5: 1 1 1 42 43 ## --- ## 332852: 1 1 10 99785 99786 ## 332853: 1 1 10 99787 99788 ## 332854: 1 1 10 99798 99799 ## 332855: 1 1 10 99804 99805 ## 332856: 1 1 10 99814 99815 setkey(DT,docindex,position)filter[,lead_wordindex:=DT[J(filter[,list(docindex,lead_position)])][,wordindex]] ## wordindex wordindex docindex position lead_position lead_wordindex ## 1: 1 1 2 99717 99718 NA ## 2: 1 1 3 99807 99808 NA ## 3: 1 1 4 100243 100244 NA ## 4: 1 1 1 1 2 1 ## 5: 1 1 1 42 43 1 ## --- ## 332852: 1 1 10 99785 99786 3 ## 332853: 1 1 10 99787 99788 3 ## 332854: 1 1 10 99798 99799 3 ## 332855: 1 1 10 99804 99805 3 ## 332856: 1 1 10 99814 99815 3 setkey(filter,wordindex,lead_wordindex)countr2.1<-filter[J(c(1,1),c(1,3)),list(count.1=.N),by=docindex]countr2.1 ## docindex count.1 ## 1: 1 22301 ## 2: 2 21835 ## 3: 3 22490 ## 4: 4 21830 ## 5: 5 22218 ## 6: 6 21914 ## 7: 7 22370 ## 8: 8 22265 ## 9: 9 22211 ## 10: 10 22190 Pretty ugly, I think. In addition, we may want to look more than one word ahead. We have to create yet another column. The easy but costly way is: setkey(DT,docindex,position)DT[,lead_lead_wordindex:=DT[list(docindex,position+2)][,wordindex]] ## wordindex docindex position lead_wordindex lead_lead_wordindex ## 1: 1 1 1 1 3 ## 2: 1 1 2 3 3 ## 3: 3 1 3 3 1 ## 4: 3 1 4 1 2 ## 5: 1 1 5 2 3 ## --- ## 999996: 2 10 99811 2 3 ## 999997: 2 10 99812 3 1 ## 999998: 3 10 99813 1 3 ## 999999: 1 10 99814 3 NA ## 1000000: 3 10 99815 NA NA setkey(DT,wordindex,lead_wordindex,lead_lead_wordindex)countr23<-DT[J(1,2,3),list(count.1=.N),by=docindex]countr23 ## docindex count.1 ## 1: 1 3684 ## 2: 2 3746 ## 3: 3 3717 ## 4: 4 3727 ## 5: 5 3700 ## 6: 6 3779 ## 7: 7 3702 ## 8: 8 3756 ## 9: 9 3702 ## 10: 10 3744 However, I currently have to use the ugly filter-and-join way because of size. So the question is, is there an easier and more beautiful way? On Sat, Jun 28, 2014 at 6:00 PM, Matt Dowle wrote: > > Hi Matt, > > Great. If you can prepare some dummy data with the appropriate properties > and a parameter or two to scale up the size (or just provide an online > large example to download) and a query that gets to the right answer but is > slow or ugly, then we've got something to chew on ... > > Matt > > > On 28/06/14 10:55, Matthew DeAngelis wrote: > > Hi Matt, > > You have the right of it. The problem is somewhat complicated, however, > since I would want to substitute "DT[word=="good"..." with > "DT[J("good")..." after setting the key to word and reordering the rows. > Hence the two-step process I have now where I key by document and position > first, create the lag_word column, key by the word and lag_word columns and > query by row. > > > Matt > > > On Fri, Jun 27, 2014 at 3:17 PM, Matt Dowle > wrote: > >> >> Hi, >> >> Not sure exactly what you need but looks interesting. >> >> Something a bit like this ? >> >> DT[ word == "good", .SD[ lag(word, N) != "not" ], by=document] >> >> Your idea being you don't want to have to repeat all the pre and post >> words alongside each word but rather express it in the query. Makes >> sense. Leads to classifying "not good" and "not very good" as both >> negative phrases I guess. >> >> Matt >> >> >> >> On 26/06/14 21:56, Matthew DeAngelis wrote: >> >> Hello data.table gurus, >> >> I have been using data.table to efficiently work with textual data and >> I love it for that purpose. I have transformed my data so that it looks >> something like this: >> >> word document position I 1 1 have 1 2 transformed 1 3 my 1 4 data >> 1 5 so 2 1 that 2 2 it 2 3 looks 2 4 something 2 5 like 2 6 this 2 >> 7 >> (I actually use a unique number for each word, so that I am able to use >> data.table's excellent features to do lightning-fast word counts. This has >> revolutionized my workflow over looping through text files with Perl.) >> >> My problem is that I sometimes need to search for phrases or to select >> words based on their context (for instance, I may want to exclude a word if >> it is preceded by "not" or followed by a word that changes its meaning). >> Currently, I am using the solution here >> to >> create a new column for a word in another position, like this: >> >> word document position lead_word I 1 1 have have 1 2 transformed >> transformed 1 3 my my 1 4 data data 1 5 NA so 2 1 that that 2 2 it >> it 2 3 looks looks 2 4 something something 2 5 like like 2 6 this >> this 2 7 NA >> using a command like: DT[,lead_word:=DT[list(document,position+1),word]. >> >> This approach has two problems, however. First, it consumes more >> resources as the dataset grows. I am currently working with a file >> containing over 150 million rows, so adding a column is costly. Second, I >> may want to check both one and two words ahead, so that I have to add two >> columns, and this can quickly get out of hand. >> >> Is there a better way to use data.table to check the value in a row N >> distance from the row of interest within a group and select a row based on >> that value? Perhaps the .I variable could be useful here? >> >> I appreciate any suggestions. >> >> >> Regards, >> Matt >> >> >> _______________________________________________ >> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From macrakis at alum.mit.edu Mon Jun 30 17:37:56 2014 From: macrakis at alum.mit.edu (=?UTF-8?B?U3RhdnJvcyBNYWNyYWtpcyAozqPPhM6x4b+mz4HOv8+CIM6czrHOus+BzqzOus63z4Ip?=) Date: Mon, 30 Jun 2014 11:37:56 -0400 Subject: [datatable-help] Speeding up column references with roll Message-ID: In the following example, it is about 15-25% faster to use setnames rather than j=list(name=var). Is there some better approach to referencing the other joined column when using roll? # Use j=list(name=var) calc1 <- function(d) { d[ hit==1 ][ d,list(hittime=time),roll=-20 ][ !is.na(hittime) ] } # Use setnames calc2 <- function(d) { temp <- d[ hit==1 ][ d,time,roll=-20 ] setnames(temp,3,"hittime") temp[!is.na(hittime)] } # Generate sample data set.seed(12312391) data <- data.table( group = sample(1e3,1e7,replace=T), time = ceiling(runif(1e7, 0, 1e5)), hit = rbinom(1e7, 1, p = 0.1), key=c("group","time")) # Timing system.time(replicate(10,{gc();calc1(data)})) => 69 sec system.time(replicate(10,{gc();calc2(data)})) => 52 sec -------------- next part -------------- An HTML attachment was scrubbed... URL: From macrakis at alum.mit.edu Mon Jun 30 18:06:41 2014 From: macrakis at alum.mit.edu (=?UTF-8?B?U3RhdnJvcyBNYWNyYWtpcyAozqPPhM6x4b+mz4HOv8+CIM6czrHOus+BzqzOus63z4Ip?=) Date: Mon, 30 Jun 2014 12:06:41 -0400 Subject: [datatable-help] i = !x different from i = (!x) Message-ID: DT 1.9.2 t1 <- data.table(a=1:2,b=0:1,key="a") t1[b==0] => row 1, OK t1[!b] => ERROR "object 'b' not found" ?? t1[(!b)] => row 1, OK Shouldn't !b be equivalent to (!b)? They are both expressions, not symbols. -s -------------- next part -------------- An HTML attachment was scrubbed... URL: From macrakis at alum.mit.edu Mon Jun 30 18:30:45 2014 From: macrakis at alum.mit.edu (=?UTF-8?B?U3RhdnJvcyBNYWNyYWtpcyAozqPPhM6x4b+mz4HOv8+CIM6czrHOus+BzqzOus63z4Ip?=) Date: Mon, 30 Jun 2014 12:30:45 -0400 Subject: [datatable-help] Error corrupts tables Message-ID: > library(data.table) data.table 1.9.2 For help type: help("data.table") > test <- data.table(a=1:10,b=1:10%%6==0,key="a") > test[b==1][test,b,roll=2] Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed Not sure what the error is there, but even worse... > test Error in if (!is.null(ns)) ns else tryCatch(loadNamespace(name), error = function(e) { : missing value where TRUE/FALSE needed > tables() Error in if (!is.null(ns)) ns else tryCatch(loadNamespace(name), error = function(e) { : missing value where TRUE/FALSE needed It looks like some data structure has been corrupted. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Jun 30 18:45:57 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 30 Jun 2014 18:45:57 +0200 Subject: [datatable-help] Error corrupts tables In-Reply-To: References: Message-ID: Fixed in 1.9.3:?https://github.com/Rdatatable/data.table/commit/ddc1d23166932198ee826f8e66176266093b0b41 Arun From:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu Reply:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu Date:?June 30, 2014 at 6:30:56 PM To:?datatable-help at r-forge.wu-wien.ac.at datatable-help at r-forge.wu-wien.ac.at Subject:? [datatable-help] Error corrupts tables > library(data.table) data.table 1.9.2 ?For help type: help("data.table") > test <- data.table(a=1:10,b=1:10%%6==0,key="a") > test[b==1][test,b,roll=2] Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed Not sure what the error is there, but even worse... > test Error in if (!is.null(ns)) ns else tryCatch(loadNamespace(name), error = function(e) { :? ? missing value where TRUE/FALSE needed > tables() Error in if (!is.null(ns)) ns else tryCatch(loadNamespace(name), error = function(e) { :? ? missing value where TRUE/FALSE needed It looks like some data structure has been corrupted. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Jun 30 18:50:32 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 30 Jun 2014 18:50:32 +0200 Subject: [datatable-help] i =?utf-8?Q?=3D_?=!x different from i =?utf-8?Q?=3D_?=(!x) In-Reply-To: References: Message-ID: Not at the moment because of the checks. If it's a call, and the first index value is `!`, it's removed and then the rest is checked if it's a "name", which is true for `t1[!b]`. And i by default searches the calling scope - because `i` can be a data.table. That is, `X[!Y]` is intended to be used for `Y` being a data.table. Hence the difference between `X[!Y]` and `X[(!Y)]` at the moment. But in 1.9.3, this'll get better: Have a look at?https://github.com/Rdatatable/data.table/issues/697?and?https://github.com/Rdatatable/data.table/issues/633 Arun From:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu Reply:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu Date:?June 30, 2014 at 6:06:53 PM To:?datatable-help at r-forge.wu-wien.ac.at datatable-help at r-forge.wu-wien.ac.at Subject:? [datatable-help] i = !x different from i = (!x) DT 1.9.2 t1 <- data.table(a=1:2,b=0:1,key="a") t1[b==0] => row 1, OK t1[!b] => ERROR "object 'b' not found" ?? t1[(!b)] => row 1, OK Shouldn't !b be equivalent to (!b)? They are both expressions, not symbols. ? ? ? ? ?-s _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Jun 30 19:00:17 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 30 Jun 2014 19:00:17 +0200 Subject: [datatable-help] Speeding up column references with roll In-Reply-To: References: Message-ID: Once again, has been fixed in 1.9.3. Now join requires `by=.EACHI` (explicit) to perform a by-without-by. https://github.com/Rdatatable/data.table/blob/master/README.md Have a look at the first FR (by = .EACHI runs ...) that's been fixed in 1.9.3 - there's some changes in the way join results in due to these changes (which've been discussed since and for quite sometime) to bring more consistency to the DT[i, j, by] syntax. Also have a look at the second FR and the links it points to for the discussions. In general, it's better to test with the devel version (and have a look at README) for any bugs you may encounter. Arun From:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu Reply:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu Date:?June 30, 2014 at 5:38:10 PM To:?datatable-help at r-forge.wu-wien.ac.at datatable-help at r-forge.wu-wien.ac.at Subject:? [datatable-help] Speeding up column references with roll In the following example, it is about 15-25% faster to use setnames rather than j=list(name=var). Is there some better approach to referencing the other joined column when using roll? # Use j=list(name=var) calc1 <- function(d) { ? d[ hit==1 ? ?][ d,list(hittime=time),roll=-20 ? ?][ !is.na(hittime) ? ?] } # Use setnames calc2 <- function(d) { ? temp <- d[ hit==1 ? ? ? ? ? ?][ d,time,roll=-20 ? ? ? ? ? ?] ? setnames(temp,3,"hittime") ? temp[!is.na(hittime)] } # Generate sample data set.seed(12312391) data <- data.table( ? ? ? ? ? group = sample(1e3,1e7,replace=T), ? ? ? ? ? time = ceiling(runif(1e7, 0, 1e5)), ? ? ? ? ? hit = rbinom(1e7, 1, p = 0.1), ??key=c("group","time")) # Timing system.time(replicate(10,{gc();calc1(data)})) => 69 sec system.time(replicate(10,{gc();calc2(data)})) => 52 sec _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Mon Jun 30 20:21:18 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Mon, 30 Jun 2014 14:21:18 -0400 Subject: [datatable-help] Speeding up column references with roll In-Reply-To: References: Message-ID: On Mon, Jun 30, 2014 at 1:00 PM, Arunkumar Srinivasan wrote: > Once again, has been fixed in 1.9.3. Now join requires `by=.EACHI` > (explicit) to perform a by-without-by. > https://github.com/Rdatatable/data.table/blob/master/README.md The README would be easier to understand if DT was not undefined in the README. As it stands none of the examples are runnable. From ggrothendieck at gmail.com Mon Jun 30 20:41:30 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Mon, 30 Jun 2014 14:41:30 -0400 Subject: [datatable-help] Speeding up column references with roll In-Reply-To: References: Message-ID: One other comment. I wonder if .EACHI could mean by each row if there were no join specified so this: library(data.table) DT <- data.table( v1 = factor(c("a", "a", "a", "b", "b", "b")), v2 = c(1, 1, 6, 3, 4, 5), v3 = c("a", "b", "c", "a", "b", "c"), stringsAsFactors=FALSE ) DT[, c(.SD, split(v2, v1)), by = 1:nrow(DT)][, -1, with = FALSE] could be written: DT[, c(.SD, split(v2, v1)), by = .EACHI] or maybe even: DT[, split(v2, v1), by = c(names(DT), .EACHI)] On Mon, Jun 30, 2014 at 2:21 PM, Gabor Grothendieck wrote: > On Mon, Jun 30, 2014 at 1:00 PM, Arunkumar Srinivasan > wrote: >> Once again, has been fixed in 1.9.3. Now join requires `by=.EACHI` >> (explicit) to perform a by-without-by. >> https://github.com/Rdatatable/data.table/blob/master/README.md > > The README would be easier to understand if DT was not undefined in > the README. As it stands none of the examples are runnable. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From macrakis at alum.mit.edu Mon Jun 30 22:40:24 2014 From: macrakis at alum.mit.edu (=?UTF-8?B?U3RhdnJvcyBNYWNyYWtpcyAozqPPhM6x4b+mz4HOv8+CIM6czrHOus+BzqzOus63z4Ip?=) Date: Mon, 30 Jun 2014 16:40:24 -0400 Subject: [datatable-help] Speeding up column references with roll In-Reply-To: References: Message-ID: OK, I'm retesting in 1.9.3, adding by=.EACHI. I don't see any significant difference in the timings -- setnames is still 25% faster than list(hittime=time). What exactly was fixed? I also don't see any way to refer to the different time vs. hittime without renaming the second time column. You mention some FR's, but they're hard to find without the specific numbers. Where can I find the 1.9.3 reference manual? I think it would be easier to understand for me than the incremental changes in the New Features listings. On my system (MacOSX), build_vignettes=TRUE gives an error in texi2dvi -- would that have generated the refman? If so, how do I fix that? Thanks, -s On Mon, Jun 30, 2014 at 1:00 PM, Arunkumar Srinivasan wrote: > Once again, has been fixed in 1.9.3. Now join requires `by=.EACHI` > (explicit) to perform a by-without-by. > https://github.com/Rdatatable/data.table/blob/master/README.md > Have a look at the first FR (by = .EACHI runs ...) that's been fixed in > 1.9.3 - there's some changes in the way join results in due to these > changes (which've been discussed since and for quite sometime) to bring > more consistency to the DT[i, j, by] syntax. Also have a look at the second > FR and the links it points to for the discussions. > > In general, it's better to test with the devel version (and have a look at > README) for any bugs you may encounter. > > Arun > > From: Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu > Reply: Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu > Date: June 30, 2014 at 5:38:10 PM > To: datatable-help at r-forge.wu-wien.ac.at > datatable-help at r-forge.wu-wien.ac.at > Subject: [datatable-help] Speeding up column references with roll > > In the following example, it is about 15-25% faster to use setnames > rather than j=list(name=var). Is there some better approach to referencing > the other joined column when using roll? > > # Use j=list(name=var) > calc1 <- function(d) { > d[ hit==1 > ][ d,list(hittime=time),roll=-20 > ][ !is.na(hittime) > ] > } > > # Use setnames > calc2 <- function(d) { > temp <- d[ hit==1 > ][ d,time,roll=-20 > ] > setnames(temp,3,"hittime") > temp[!is.na(hittime)] > } > > # Generate sample data > set.seed(12312391) > data <- data.table( > group = sample(1e3,1e7,replace=T), > time = ceiling(runif(1e7, 0, 1e5)), > hit = rbinom(1e7, 1, p = 0.1), > key=c("group","time")) > > # Timing > > system.time(replicate(10,{gc();calc1(data)})) => 69 sec > system.time(replicate(10,{gc();calc2(data)})) => 52 sec > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Jun 30 23:34:36 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 30 Jun 2014 23:34:36 +0200 Subject: [datatable-help] Speeding up column references with roll In-Reply-To: References: Message-ID: Your example doesn?t work without allow.cartesian=TRUE. You shouldn?t be using by=.EACHI here. This by was what was implicit in the earlier versions which made it slow. Please re-read the README. Here?s the function I tested on 1.9.3: calc1 <- function(d) { d[ hit==1][ d,list(hittime=time),roll=-20, allow.cartesian=TRUE][ !is.na(hittime)] } calc2 <- function(d) { temp <- d[ hit==1][ d,list(time),roll=-20, allow.cartesian=TRUE] setnames(temp,1,"hittime") temp[!is.na(hittime)] } # Generate sample data set.seed(12312391) data <- data.table( group = sample(1e3,1e7,replace=T), time = ceiling(runif(1e7, 0, 1e5)), hit = rbinom(1e7, 1, p = 0.1), key=c("group","time")) system.time(ans1 <- calc1(data)) # user system elapsed # 2.083 0.189 2.344 system.time(ans2 <- calc2(data)) # user system elapsed # 2.012 0.241 2.426 identical(ans1, ans2) # [1] TRUE You write: I also don't see any way to refer to the different time vs. hittime without renaming the second time column. I don?t quite follow what this means, but IIUC I think this is what you?re referring to: https://github.com/Rdatatable/data.table/issues/471 You write: You mention some FR's, but they're hard to find without the specific numbers. I was mentioning the first two points under NEW FEATURES within Changes in v1.9.3. The one that starts with by=.EACHI runs j for each group in x that each row of i joins to. and the one that starts with Accordingly, X[Y, j] now does what X[Y][, j] did. Maybe we should start numbering the fixes for easy reference. Will note it down. You write: Where can I find the 1.9.3 reference manual? This version is a development version. Necesary changes will be reflected in their corresponding ?... entry. And when we find some time, the introduction and FAQs will be updated. But that?s not yet. If you don?t wish to keep up-to-date by looking at the NEWS, you?ll have to wait until the next stable release on CRAN. You write: On my system (MacOSX), build_vignettes=TRUE gives an error in texi2dvi -- would that have generated the refman? If so, how do I fix that? I?m guessing it?s a PDF latex error. If so, you?ll have to install what the error message says is missing on your system. Sorry, can?t help you much there. Arun From:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu Reply:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu Date:?June 30, 2014 at 10:40:24 PM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?datatable-help at r-forge.wu-wien.ac.at datatable-help at r-forge.wu-wien.ac.at Subject:? Re: [datatable-help] Speeding up column references with roll OK, I'm retesting in 1.9.3, adding by=.EACHI. I don't see any significant difference in the timings -- setnames is still 25% faster than list(hittime=time). What exactly was fixed? I also don't see any way to refer to the different time vs. hittime without renaming the second time column. You mention some FR's, but they're hard to find without the specific numbers. Where can I find the 1.9.3 reference manual? I think it would be easier to understand for me than the incremental changes in the New Features listings. On my system (MacOSX), build_vignettes=TRUE gives an error in texi2dvi -- would that have generated the refman? If so, how do I fix that? Thanks, ? ? ? ? ? ? ? ?-s On Mon, Jun 30, 2014 at 1:00 PM, Arunkumar Srinivasan wrote: Once again, has been fixed in 1.9.3. Now join requires `by=.EACHI` (explicit) to perform a by-without-by. https://github.com/Rdatatable/data.table/blob/master/README.md Have a look at the first FR (by = .EACHI runs ...) that's been fixed in 1.9.3 - there's some changes in the way join results in due to these changes (which've been discussed since and for quite sometime) to bring more consistency to the DT[i, j, by] syntax. Also have a look at the second FR and the links it points to for the discussions. In general, it's better to test with the devel version (and have a look at README) for any bugs you may encounter. Arun From:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu Reply:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu Date:?June 30, 2014 at 5:38:10 PM To:?datatable-help at r-forge.wu-wien.ac.at datatable-help at r-forge.wu-wien.ac.at Subject:? [datatable-help] Speeding up column references with roll In the following example, it is about 15-25% faster to use setnames rather than j=list(name=var). Is there some better approach to referencing the other joined column when using roll? # Use j=list(name=var) calc1 <- function(d) { ? d[ hit==1 ? ?][ d,list(hittime=time),roll=-20 ? ?][ !is.na(hittime) ? ?] } # Use setnames calc2 <- function(d) { ? temp <- d[ hit==1 ? ? ? ? ? ?][ d,time,roll=-20 ? ? ? ? ? ?] ? setnames(temp,3,"hittime") ? temp[!is.na(hittime)] } # Generate sample data set.seed(12312391) data <- data.table( ? ? ? ? ? group = sample(1e3,1e7,replace=T), ? ? ? ? ? time = ceiling(runif(1e7, 0, 1e5)), ? ? ? ? ? hit = rbinom(1e7, 1, p = 0.1), ??key=c("group","time")) # Timing system.time(replicate(10,{gc();calc1(data)})) => 69 sec system.time(replicate(10,{gc();calc2(data)})) => 52 sec _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: