From aragorn168b at gmail.com Sun Jun 2 10:37:33 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 2 Jun 2013 10:37:33 +0200 Subject: [datatable-help] SO question on possible bug with "set" Message-ID: <483757DBE9FF420A97E42B30F696A21D@gmail.com> Hello, I saw this question on SO yesterday: http://stackoverflow.com/questions/16877027/negative-number-of-rows-in-data-table-after-incorrect-use-of-set I think this is a bug in "set". But unfortunately, `set` calls for c-code internally and am not good at using gdb within R. Any ideas of what's the problem here? Thanks, Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From statquant at outlook.com Mon Jun 3 12:36:32 2013 From: statquant at outlook.com (statquant3) Date: Mon, 3 Jun 2013 03:36:32 -0700 (PDT) Subject: [datatable-help] integer64 and fread Message-ID: <1370255792184-4668548.post@n4.nabble.com> Hello, today I had to load a file which contained integers and on some lines nothing. data.table:::fread seems to cast the nothings into integer64. So I had to use colClasses (and then specifiy the whole 20 columns just for this column type that I wanted to be loaded as a character to be then casted as a regular integer) >From my experience with fread (which is the only function I am using now), loading as integer64 are more often a problem than the other way, this is because it is not yet available to be used in analysis (not as a key...) So may I humbly suggest the parameter integer64 in fread be changed to "character" as default, until the parameter is fully implemented ? -- View this message in context: http://r.789695.n4.nabble.com/integer64-and-fread-tp4668548.html Sent from the datatable-help mailing list archive at Nabble.com. From eduard.antonyan at gmail.com Mon Jun 3 19:01:17 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Mon, 3 Jun 2013 12:01:17 -0500 Subject: [datatable-help] SO question on possible bug with "set" In-Reply-To: <483757DBE9FF420A97E42B30F696A21D@gmail.com> References: <483757DBE9FF420A97E42B30F696A21D@gmail.com> Message-ID: As far as I can tell there are no out of bounds checks in "assign", so when it calls memcpy it just overrides some part of the data.table structure that it shouldn't. A simple out of bounds check would fix the issue. On Sun, Jun 2, 2013 at 3:37 AM, Arunkumar Srinivasan wrote: > Hello, > > I saw this question on SO yesterday: > http://stackoverflow.com/questions/16877027/negative-number-of-rows-in-data-table-after-incorrect-use-of-set > > I think this is a bug in "set". But unfortunately, `set` calls for c-code > internally and am not good at using gdb within R. Any ideas of what's the > problem here? > > Thanks, > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wsteitz at gmail.com Thu Jun 6 10:19:56 2013 From: wsteitz at gmail.com (Wolfgang Steitz) Date: Thu, 6 Jun 2013 10:19:56 +0200 Subject: [datatable-help] Joins with or-condition Message-ID: Is it possible to do something like that with data.table? SELECT sum(a.abc) FROM table_a as a LEFT JOIN table_b as b ON a.x = b.x AND (a.y=b.y or a.y=b.z or a.z=b.y or a.z=b.z) Group by a.xyz So my problem is to translate those or-conditions in the ON into data.table syntax. I tried to do a cartesion join first and then filter with the or-conditions and then do the groupby, which gives me the desired results. But since I have large datasets (table_a and table_b both have more than 1 million rows), I want to avoid the cartesion join. Any ideas how to do this in a more clever way? Thanks, Wolfgang From statquant at outlook.com Thu Jun 6 11:10:08 2013 From: statquant at outlook.com (statquant3) Date: Thu, 6 Jun 2013 02:10:08 -0700 (PDT) Subject: [datatable-help] Joins with or-condition In-Reply-To: References: Message-ID: <1370509808219-4668804.post@n4.nabble.com> Can you give us a data sample example (I know it's painful), but I do not know what ON really do. I have some experience with merges and could always avoir cartesian join... -- View this message in context: http://r.789695.n4.nabble.com/Joins-with-or-condition-tp4668801p4668804.html Sent from the datatable-help mailing list archive at Nabble.com. From wsteitz at gmail.com Thu Jun 6 11:43:00 2013 From: wsteitz at gmail.com (Wolfgang Steitz) Date: Thu, 6 Jun 2013 11:43:00 +0200 Subject: [datatable-help] Joins with or-condition In-Reply-To: <1370509808219-4668804.post@n4.nabble.com> References: <1370509808219-4668804.post@n4.nabble.com> Message-ID: Ok, here is a small example. I also changed the SQL a bit, I think that it makes more sense this way. SELECT sum(a.abc * b.abc) FROM table_a as a LEFT JOIN table_b as b ON (a.y=b.y or a.y=b.z or a.z=b.y or a.z=b.z) Group by a.xyz table_a id| y | z | abc | xyz ------------------------------ 1 | l | u | 123.2 | a 2 | l | s | 13.5 | b 3 | s | u | 228.4 | f 4 | a | b | 427.2 | b 5 | a | a | 123.1 | a 6 | b | b | 180.2 | b 7 | c | c | 153.8 | f 8 | d | d | 113.2 | a 9 | d | a | 13.2 | f table_b id| y | z | abc ---------------- 1 | l | u | 123.2 2 | l | s | 13.5 3 | s | u | 228.4 4 | a | b | 427.2 5 | a | a | 123.1 6 | b | b | 180.2 7 | c | c | 153.8 8 | d | d | 113.2 9 | d | a | 13.2 10| k | a | 123.1 11| k | b | 180.2 12| d | d | 153.8 13| c | d | 113.2 14| u | a | 13.2 On Thu, Jun 6, 2013 at 11:10 AM, statquant3 wrote: > Can you give us a data sample example (I know it's painful), but I do not > know what ON really do. I have some experience with merges and could always > avoir cartesian join... > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Joins-with-or-condition-tp4668801p4668804.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From statquant at outlook.com Thu Jun 6 13:20:06 2013 From: statquant at outlook.com (statquant3) Date: Thu, 6 Jun 2013 04:20:06 -0700 (PDT) Subject: [datatable-help] Joins with or-condition In-Reply-To: References: <1370509808219-4668804.post@n4.nabble.com> Message-ID: <1370517606361-4668810.post@n4.nabble.com> And the result you want to achieve ? FYI: usually it is better to send reproducible code, we can't really load your tables in R with what you sent us. Try ?dput or This post on SO -- View this message in context: http://r.789695.n4.nabble.com/Joins-with-or-condition-tp4668801p4668810.html Sent from the datatable-help mailing list archive at Nabble.com. From wsteitz at gmail.com Thu Jun 6 14:30:16 2013 From: wsteitz at gmail.com (Wolfgang Steitz) Date: Thu, 6 Jun 2013 14:30:16 +0200 Subject: [datatable-help] Joins with or-condition In-Reply-To: <1370517606361-4668810.post@n4.nabble.com> References: <1370509808219-4668804.post@n4.nabble.com> <1370517606361-4668810.post@n4.nabble.com> Message-ID: Sorry for not sending code. I guess I was expecting a simple answer how this sql query is translated. Like for an "and" in the "on", using 'setkey' and then doing a join works. So here is reproducible code: require(sqldf) table_a <- data.table(y=c("l","l","s","a","a","b","c","d","d"), z=c("u","s","u","b","a","b","c","d","a"), abc=c(123.2,13.5,228.4,427.2,123.1,180.2,153.8,113.2,13.2), xyz=c("a","b","f","b","a","b","f","a","f")) table_b <- data.table(y=c("l","l","s","a","a","b","c","d","d","k","k","d","c","u"), z=c("u","s","u","b","a","b","c","d","a","a","b","d","d","a"), abc=c(123.2,13.5,228.4,427.2,123.1,180.2,153.8,113.2,13.2,123.1,180.2,153.8,113.2,13.2)) sqldf("SELECT a.xyz, sum(a.abc * b.abc) FROM table_a as a LEFT JOIN table_b as b ON (a.y=b.y or a.y=b.z or a.z=b.y or a.z=b.z) Group by a.xyz", drv="SQLite") On Thu, Jun 6, 2013 at 1:20 PM, statquant3 wrote: > And the result you want to achieve ? > FYI: usually it is better to send reproducible code, we can't really load > your tables in R with what you sent us. Try ?dput or This post on SO > > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Joins-with-or-condition-tp4668801p4668810.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From ggrothendieck at gmail.com Thu Jun 6 15:58:31 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Thu, 6 Jun 2013 09:58:31 -0400 Subject: [datatable-help] Joins with or-condition In-Reply-To: References: <1370509808219-4668804.post@n4.nabble.com> <1370517606361-4668810.post@n4.nabble.com> Message-ID: On Thu, Jun 6, 2013 at 8:30 AM, Wolfgang Steitz wrote: > Sorry for not sending code. I guess I was expecting a simple answer > how this sql query is translated. Like for an "and" in the "on", using > 'setkey' and then doing a join works. So here is reproducible code: > > require(sqldf) > > table_a <- data.table(y=c("l","l","s","a","a","b","c","d","d"), > z=c("u","s","u","b","a","b","c","d","a"), > > abc=c(123.2,13.5,228.4,427.2,123.1,180.2,153.8,113.2,13.2), > xyz=c("a","b","f","b","a","b","f","a","f")) > > table_b <- data.table(y=c("l","l","s","a","a","b","c","d","d","k","k","d","c","u"), > > z=c("u","s","u","b","a","b","c","d","a","a","b","d","d","a"), > > abc=c(123.2,13.5,228.4,427.2,123.1,180.2,153.8,113.2,13.2,123.1,180.2,153.8,113.2,13.2)) > > sqldf("SELECT a.xyz, sum(a.abc * b.abc) > FROM table_a as a > LEFT JOIN table_b as b > ON (a.y=b.y or a.y=b.z or a.z=b.y or a.z=b.z) > Group by a.xyz", drv="SQLite") This will cut down the cartesian product between `a` and `b` decomposing it into cartesian products between a subset of `a` (namely those rows with a particular `xyz` value) and all of `b` for each unique value of `xyz` (which is basically part of what the query optimizer in SQL would do for you automatically): a <- data.table(one = 1, table_a) b <- data.table(one = 1, table_b) setkey(a, one, xyz) setkey(b, one) f <- function(lev) { a. <- a[J(1, lev)] b[a., allow.cartesian = TRUE][y == y.1 | y == z.1 | z == y.1 | z == z.1, list(sum(abc * abc.1))] } rbindlist(lapply(unique(a$xyz), f)) (Although its not a data.table solution you could alternately try adding indexes to your SQL query.) From ggrothendieck at gmail.com Fri Jun 7 03:22:58 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Thu, 6 Jun 2013 21:22:58 -0400 Subject: [datatable-help] Problem with FAQ 2.8 Message-ID: FAQ 2.8 says: 2.8 What are the scoping rules for j expressions? Think of the subset as an environment where all the column names are variables. When a variable foo is used in the j of a query such as X[Y,sum(foo)], foo is looked for in the following order : 1. The scope of X's subset; i.e., X's column names. 2. The scope of each row of Y; i.e., Y's column names (join inherited scope) ... but consider the following (which is modified from this example: https://r-forge.r-project.org/tracker/?func=detail&atid=975&aid=1663&group_id=240): > d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1") > d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2") > > d1[d2, sum(id2 * val)] id1 V1 1: 1 1 2: 2 10 3: 4 NA > > d1[d2, sum(id1 * val)] Error in `[.data.table`(d1, d2, sum(id1 * val)) : object 'id1' not found Note that column id1 of d1 is not in scope contrary to point 1. Even stranger is that d1[, id1] works but d1[d2, id1] does not. Is the FAQ describing how its supposed to work and the actual behavior is wrong or is the behavior as intended and the FAQ wrong? -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From michael.nelson at sydney.edu.au Fri Jun 7 05:50:49 2013 From: michael.nelson at sydney.edu.au (Michael Nelson) Date: Fri, 7 Jun 2013 03:50:49 +0000 Subject: [datatable-help] Problem with FAQ 2.8 In-Reply-To: References: Message-ID: <6FB5193A6CDCDF499486A833B7AFBDCD7C349828@EX-MBX-PRO-04.mcs.usyd.edu.au> This is related to FR 2693 https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978 What is happening is that the `join` columns must be referenced using their names as defined in `i` (or Y in X[Y] syntax) The FAQ doesn't explicitly cover how you are supposed to reference the columns used in the join. Perhaps some binding magic could be used to ensure that either column name could be used. I don't think it is useful want both to be defined and available as separate objects - -that would mean there were two copies of something that are identical in value (but not name!) ________________________________________ From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Gabor Grothendieck [ggrothendieck at gmail.com] Sent: Friday, 7 June 2013 11:22 AM To: datatable-help at lists.r-forge.r-project.org Subject: [datatable-help] Problem with FAQ 2.8 FAQ 2.8 says: 2.8 What are the scoping rules for j expressions? Think of the subset as an environment where all the column names are variables. When a variable foo is used in the j of a query such as X[Y,sum(foo)], foo is looked for in the following order : 1. The scope of X's subset; i.e., X's column names. 2. The scope of each row of Y; i.e., Y's column names (join inherited scope) ... but consider the following (which is modified from this example: https://r-forge.r-project.org/tracker/?func=detail&atid=975&aid=1663&group_id=240): > d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1") > d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2") > > d1[d2, sum(id2 * val)] id1 V1 1: 1 1 2: 2 10 3: 4 NA > > d1[d2, sum(id1 * val)] Error in `[.data.table`(d1, d2, sum(id1 * val)) : object 'id1' not found Note that column id1 of d1 is not in scope contrary to point 1. Even stranger is that d1[, id1] works but d1[d2, id1] does not. Is the FAQ describing how its supposed to work and the actual behavior is wrong or is the behavior as intended and the FAQ wrong? -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From ggrothendieck at gmail.com Fri Jun 7 06:34:06 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 7 Jun 2013 00:34:06 -0400 Subject: [datatable-help] Problem with FAQ 2.8 In-Reply-To: <6FB5193A6CDCDF499486A833B7AFBDCD7C349828@EX-MBX-PRO-04.mcs.usyd.edu.au> References: <6FB5193A6CDCDF499486A833B7AFBDCD7C349828@EX-MBX-PRO-04.mcs.usyd.edu.au> Message-ID: On Thu, Jun 6, 2013 at 11:50 PM, Michael Nelson wrote: > This is related to > > FR 2693 > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978 > > What is happening is that the `join` columns must be referenced using their names as defined in `i` (or Y in X[Y] syntax) > > The FAQ doesn't explicitly cover how you are supposed to reference the columns used in the join. > > Perhaps some binding magic could be used to ensure that either column name could be used. I don't think it is useful want both to be defined and available as separate objects - -that would mean there were two copies of something that are identical in value (but not name!) > Note that the FAQ says that the X variables are "in scope".and the ordinary meaning of being in scope is that such a variable can be referenced in an unqualified manner so I think it does imply that these variables can be accessed. I assume from your response that the answer to my question is that the FAQ is wrong and the behavior is as intended. If that is the case then it would be desirable that the behavior of the software be changed to make the FAQ correct. Having tiny little exceptions like this that are difficult to remember and error prone just makes the software harder to use. Another possibility would be to outlaw having keys in X and Y which have different names (although that would be drastic and inconvenient though safer and easier to learn then the current situation). For example, continuing the code in my post here is a second example consider what would happen if this were to occur: > id1 <- 1 > d1[d2, sum(id1 * val)] id1 V1 1: 1 1 2: 2 5 3: 4 NA It would be difficult to realize without close examination that there is an error in this code (assuming that the writer intended id1 to be taken from d1). Here d1$id1 is not in scope (contrary to the FAQ) and so id1 in the caller is used resulting in wrong output (relative to the result intended). Here is another oddity. It seems that in the first case we cannot access id1 but if we do a join and then access the columns in a separate [] then we can. if (exists("id1")) rm(id1) > d1[d2, id1] Error in `[.data.table`(d1, d2, id1) : object 'id1' not found > d1[d2][, id1] [1] 1 2 2 4 Note that in R's merge one can refer to both keys and in SQL when one does a join one can as well so the behavior we have been discussing here seems entirely unexpected. From ggrothendieck at gmail.com Fri Jun 7 06:50:40 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 7 Jun 2013 00:50:40 -0400 Subject: [datatable-help] Problem with FAQ 2.8 In-Reply-To: References: <6FB5193A6CDCDF499486A833B7AFBDCD7C349828@EX-MBX-PRO-04.mcs.usyd.edu.au> Message-ID: One correction to my post. merge() does not include both key columns in its output; however, that may be less germane because unlike data.table and SQL one cannot give merge an expression that refers to them > merge(as.data.frame(d1), as.data.frame(d2), by = 1) id1 val val2 1 1 1 11 2 2 2 12 3 2 3 12 The situation with SQLite is as described in my post where both the id1 and id2 columns are output: > library(sqldf) > sqldf("select * from d1 join d2 on d1.id1 = d2.id2") id1 val id2 val2 1 1 1 1 11 2 2 2 2 12 3 2 3 2 12 and one could refer to them as id1 and id2 if they are distinct names or as d1.id1 and d2.id2 in the select. One other possibility for data.table would be to change X[Y] so that in the case of keys with different names both columns appear as in the SQL example. This would presumably also ensure that both could be referenced in X[Y, j]. however, if the names are the same then there would be no need to output them both and it would be ok to output them as a single comonly named column. On Fri, Jun 7, 2013 at 12:34 AM, Gabor Grothendieck wrote: > On Thu, Jun 6, 2013 at 11:50 PM, Michael Nelson > wrote: >> This is related to >> >> FR 2693 >> >> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978 >> >> What is happening is that the `join` columns must be referenced using their names as defined in `i` (or Y in X[Y] syntax) >> >> The FAQ doesn't explicitly cover how you are supposed to reference the columns used in the join. >> >> Perhaps some binding magic could be used to ensure that either column name could be used. I don't think it is useful want both to be defined and available as separate objects - -that would mean there were two copies of something that are identical in value (but not name!) >> > > Note that the FAQ says that the X variables are "in scope".and the > ordinary meaning of being in scope is that such a variable can be > referenced in an unqualified manner so I think it does imply that > these variables can be accessed. > > I assume from your response that the answer to my question is that the > FAQ is wrong and the behavior is as intended. > > If that is the case then it would be desirable that the behavior of > the software be changed to make the FAQ correct. Having tiny little > exceptions like this that are difficult to remember and error prone > just makes the software harder to use. Another possibility would be > to outlaw having keys in X and Y which have different names (although > that would be drastic and inconvenient though safer and easier to > learn then the current situation). > > For example, continuing the code in my post here is a second example > consider what would happen if this were to occur: > >> id1 <- 1 >> d1[d2, sum(id1 * val)] > id1 V1 > 1: 1 1 > 2: 2 5 > 3: 4 NA > > It would be difficult to realize without close examination that there > is an error in this code (assuming that the writer intended id1 to be > taken from d1). Here d1$id1 is not in scope (contrary to the FAQ) and > so id1 in the caller is used resulting in wrong output (relative to > the result intended). > > Here is another oddity. It seems that in the first case we cannot > access id1 but if we do a join and then access the columns in a > separate [] then we can. > > if (exists("id1")) rm(id1) >> d1[d2, id1] > Error in `[.data.table`(d1, d2, id1) : object 'id1' not found >> d1[d2][, id1] > [1] 1 2 2 4 > > Note that in R's merge one can refer to both keys and in SQL when one > does a join one can as well so the behavior we have been discussing > here seems entirely unexpected. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From FErickson at psu.edu Fri Jun 7 14:27:48 2013 From: FErickson at psu.edu (Frank Erickson) Date: Fri, 7 Jun 2013 07:27:48 -0500 Subject: [datatable-help] Problem with FAQ 2.8 In-Reply-To: References: <6FB5193A6CDCDF499486A833B7AFBDCD7C349828@EX-MBX-PRO-04.mcs.usyd.edu.au> Message-ID: > > Here is another oddity. It seems that in the first case we cannot > access id1 but if we do a join and then access the columns in a > separate [] then we can. Similarly, columns with duplicated names are referenced differently in the merge and after it (the column from Y in X[Y] is called i.col in the merge and col.1 after). Also, in your example, although id1 is not available, that key column has two aliases that work: id2 and i.id2. So, I guess it wouldn't be necessary to "change X[Y] so that in the case of keys with different names both columns appear"; instead the key column could just be given the additional name id1...? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Jun 7 14:32:02 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 07 Jun 2013 13:32:02 +0100 Subject: [datatable-help] Problem with FAQ 2.8 In-Reply-To: References: <6FB5193A6CDCDF499486A833B7AFBDCD7C349828@EX-MBX-PRO-04.mcs.usyd.edu.au> Message-ID: <1d9d06c30352a08fbef8b3a6f2c18b01@imap.plus.net> Hi, Agreed. A change in the software to match the FAQ makes sense. That FAQ has non-join columns in mind, for which it is true I believe, but yes it should be true for join columns as well. The other consideration is rolling joins. With roll=TRUE it is natural to want to know the staleness of the data joined to. The column names usually match, say 'date', but the data is different. That was the primary motivation for i. and x. prefixes: X[Y, list(price, daysold = i.date-x.date), roll=TRUE] i. prefix is already available, but I don't think I did x. yet. Anyway, the 'date' in X 'should' be higher in scope, in compliance with FAQ 2.8, so that this should be the same (although less clear to read since it relies on the reader knowing FAQ 2.8) : X[Y, list(price, daysold = i.date-date), roll=TRUE] That's less useful now that roll takes a limit, although you still might want to know the staleness of data returned within the limit. I've added a link to this thread to FR 2693 to be addressed. Thanks all. Matthew On 07.06.2013 05:50, Gabor Grothendieck wrote: > One correction to my post. merge() does not include both key columns > in its output; however, that may be less germane because unlike > data.table and SQL one cannot give merge an expression that refers to > them > >> merge(as.data.frame(d1), as.data.frame(d2), by = 1) > id1 val val2 > 1 1 1 11 > 2 2 2 12 > 3 2 3 12 > > The situation with SQLite is as described in my post where both the > id1 and id2 columns are output: > >> library(sqldf) >> sqldf("select * from d1 join d2 on d1.id1 = d2.id2") > id1 val id2 val2 > 1 1 1 1 11 > 2 2 2 2 12 > 3 2 3 2 12 > > and one could refer to them as id1 and id2 if they are distinct names > or as d1.id1 and d2.id2 in the select. > > One other possibility for data.table would be to change X[Y] so that > in the case of keys with different names both columns appear as in > the > SQL example. This would presumably also ensure that both could be > referenced in X[Y, j]. however, if the names are the same then there > would be no need to output them both and it would be ok to output > them > as a single comonly named column. > > On Fri, Jun 7, 2013 at 12:34 AM, Gabor Grothendieck > wrote: >> On Thu, Jun 6, 2013 at 11:50 PM, Michael Nelson >> wrote: >>> This is related to >>> >>> FR 2693 >>> >>> >>> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978 >>> >>> What is happening is that the `join` columns must be referenced >>> using their names as defined in `i` (or Y in X[Y] syntax) >>> >>> The FAQ doesn't explicitly cover how you are supposed to reference >>> the columns used in the join. >>> >>> Perhaps some binding magic could be used to ensure that either >>> column name could be used. I don't think it is useful want both to be >>> defined and available as separate objects - -that would mean there >>> were two copies of something that are identical in value (but not >>> name!) >>> >> >> Note that the FAQ says that the X variables are "in scope".and the >> ordinary meaning of being in scope is that such a variable can be >> referenced in an unqualified manner so I think it does imply that >> these variables can be accessed. >> >> I assume from your response that the answer to my question is that >> the >> FAQ is wrong and the behavior is as intended. >> >> If that is the case then it would be desirable that the behavior of >> the software be changed to make the FAQ correct. Having tiny little >> exceptions like this that are difficult to remember and error prone >> just makes the software harder to use. Another possibility would >> be >> to outlaw having keys in X and Y which have different names >> (although >> that would be drastic and inconvenient though safer and easier to >> learn then the current situation). >> >> For example, continuing the code in my post here is a second example >> consider what would happen if this were to occur: >> >>> id1 <- 1 >>> d1[d2, sum(id1 * val)] >> id1 V1 >> 1: 1 1 >> 2: 2 5 >> 3: 4 NA >> >> It would be difficult to realize without close examination that >> there >> is an error in this code (assuming that the writer intended id1 to >> be >> taken from d1). Here d1$id1 is not in scope (contrary to the FAQ) >> and >> so id1 in the caller is used resulting in wrong output (relative to >> the result intended). >> >> Here is another oddity. It seems that in the first case we cannot >> access id1 but if we do a join and then access the columns in a >> separate [] then we can. >> >> if (exists("id1")) rm(id1) >>> d1[d2, id1] >> Error in `[.data.table`(d1, d2, id1) : object 'id1' not found >>> d1[d2][, id1] >> [1] 1 2 2 4 >> >> Note that in R's merge one can refer to both keys and in SQL when >> one >> does a join one can as well so the behavior we have been discussing >> here seems entirely unexpected. From ggrothendieck at gmail.com Sat Jun 8 05:24:23 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 7 Jun 2013 23:24:23 -0400 Subject: [datatable-help] Problem with FAQ 2.8 In-Reply-To: <1d9d06c30352a08fbef8b3a6f2c18b01@imap.plus.net> References: <6FB5193A6CDCDF499486A833B7AFBDCD7C349828@EX-MBX-PRO-04.mcs.usyd.edu.au> <1d9d06c30352a08fbef8b3a6f2c18b01@imap.plus.net> Message-ID: 1. Good point about having non-equal keys when roll=TRUE. Note that in that case we have the following where the column labelled x1 in the output is wrong. Either it should be labelled y1 (since 10 comes from y1, not from x1) or else it should have the value 1 (since x1 is 1 in the input). If both keys were output with their original column names of disambiguating names in the case that they are the same then there would no longer be a problem. > X <- data.table(x1 = 1, x2 = 2, key = "x1") > Y <- data.table(y1 = 10, y2 = 3) > X[Y] x1 x2 y2 1: 10 NA 3 > X[Y,,roll=TRUE] x1 x2 y2 1: 10 2 3 2. The issues in this thread seem pretty fundamental to data.table and in my opinion resolving them deserves a high priority. On Fri, Jun 7, 2013 at 8:32 AM, Matthew Dowle wrote: > > Hi, > > Agreed. A change in the software to match the FAQ makes sense. That FAQ has > non-join columns in mind, for which it is true I believe, but yes it should > be true for join columns as well. > > The other consideration is rolling joins. With roll=TRUE it is natural to > want to know the staleness of the data joined to. The column names usually > match, say 'date', but the data is different. That was the primary > motivation for i. and x. prefixes: > > X[Y, list(price, daysold = i.date-x.date), roll=TRUE] > > i. prefix is already available, but I don't think I did x. yet. Anyway, the > 'date' in X 'should' be higher in scope, in compliance with FAQ 2.8, so that > this should be the same (although less clear to read since it relies on the > reader knowing FAQ 2.8) : > > X[Y, list(price, daysold = i.date-date), roll=TRUE] > > That's less useful now that roll takes a limit, although you still might > want to know the staleness of data returned within the limit. > > I've added a link to this thread to FR 2693 to be addressed. Thanks all. > > Matthew > > > > On 07.06.2013 05:50, Gabor Grothendieck wrote: >> >> One correction to my post. merge() does not include both key columns >> in its output; however, that may be less germane because unlike >> data.table and SQL one cannot give merge an expression that refers to >> them >> >>> merge(as.data.frame(d1), as.data.frame(d2), by = 1) >> >> id1 val val2 >> 1 1 1 11 >> 2 2 2 12 >> 3 2 3 12 >> >> The situation with SQLite is as described in my post where both the >> id1 and id2 columns are output: >> >>> library(sqldf) >>> sqldf("select * from d1 join d2 on d1.id1 = d2.id2") >> >> id1 val id2 val2 >> 1 1 1 1 11 >> 2 2 2 2 12 >> 3 2 3 2 12 >> >> and one could refer to them as id1 and id2 if they are distinct names >> or as d1.id1 and d2.id2 in the select. >> >> One other possibility for data.table would be to change X[Y] so that >> in the case of keys with different names both columns appear as in the >> SQL example. This would presumably also ensure that both could be >> referenced in X[Y, j]. however, if the names are the same then there >> would be no need to output them both and it would be ok to output them >> as a single comonly named column. >> >> On Fri, Jun 7, 2013 at 12:34 AM, Gabor Grothendieck >> wrote: >>> >>> On Thu, Jun 6, 2013 at 11:50 PM, Michael Nelson >>> wrote: >>>> >>>> This is related to >>>> >>>> FR 2693 >>>> >>>> >>>> >>>> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978 >>>> >>>> What is happening is that the `join` columns must be referenced using >>>> their names as defined in `i` (or Y in X[Y] syntax) >>>> >>>> The FAQ doesn't explicitly cover how you are supposed to reference the >>>> columns used in the join. >>>> >>>> Perhaps some binding magic could be used to ensure that either column >>>> name could be used. I don't think it is useful want both to be defined and >>>> available as separate objects - -that would mean there were two copies of >>>> something that are identical in value (but not name!) >>>> >>> >>> Note that the FAQ says that the X variables are "in scope".and the >>> ordinary meaning of being in scope is that such a variable can be >>> referenced in an unqualified manner so I think it does imply that >>> these variables can be accessed. >>> >>> I assume from your response that the answer to my question is that the >>> FAQ is wrong and the behavior is as intended. >>> >>> If that is the case then it would be desirable that the behavior of >>> the software be changed to make the FAQ correct. Having tiny little >>> exceptions like this that are difficult to remember and error prone >>> just makes the software harder to use. Another possibility would be >>> to outlaw having keys in X and Y which have different names (although >>> that would be drastic and inconvenient though safer and easier to >>> learn then the current situation). >>> >>> For example, continuing the code in my post here is a second example >>> consider what would happen if this were to occur: >>> >>>> id1 <- 1 >>>> d1[d2, sum(id1 * val)] >>> >>> id1 V1 >>> 1: 1 1 >>> 2: 2 5 >>> 3: 4 NA >>> >>> It would be difficult to realize without close examination that there >>> is an error in this code (assuming that the writer intended id1 to be >>> taken from d1). Here d1$id1 is not in scope (contrary to the FAQ) and >>> so id1 in the caller is used resulting in wrong output (relative to >>> the result intended). >>> >>> Here is another oddity. It seems that in the first case we cannot >>> access id1 but if we do a join and then access the columns in a >>> separate [] then we can. >>> >>> if (exists("id1")) rm(id1) >>>> >>>> d1[d2, id1] >>> >>> Error in `[.data.table`(d1, d2, id1) : object 'id1' not found >>>> >>>> d1[d2][, id1] >>> >>> [1] 1 2 2 4 >>> >>> Note that in R's merge one can refer to both keys and in SQL when one >>> does a join one can as well so the behavior we have been discussing >>> here seems entirely unexpected. > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Sun Jun 9 23:08:47 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 9 Jun 2013 23:08:47 +0200 Subject: [datatable-help] Follow-up on subsetting data.table with NAs Message-ID: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> Matthew, Regarding your recent answer here: http://stackoverflow.com/a/17008872/559784 I'd a few questions/thoughts and I thought it may be more appropriate to share here (even though I've already written 3 comments!). 1) First, you write that, DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB,] However, you can write this long expression as: DF[which(DF$ColA == DF$ColB), ] 2) Second, you mention that the motivation is not just convenience but speed. By checking: require(data.table) set.seed(45) df <- as.data.frame(matrix(sample(c(1,2,3,NA), 2e6, replace=TRUE), ncol=2)) dt <- data.table(df) system.time(dt[V1 == V2]) # 0.077 seconds system.time(df[!is.na(df$V1) & !is.na(df$V2) & df$V1 == df$V2, ]) # 0.252 seconds system.time(df[which(df$V1 == df$V2), ]) # 0.038 seconds We see that using `which` (in addition to removing NA) is also faster than `DT[V1 == V2]`. In fact, `DT[which(V1 == V2)]` is faster than `DT[V1 == V2]`. I suspect this is because of the snippet below in `[.data.table`: if (is.logical(i)) { if (identical(i,NA)) i = NA_integer_ # see DT[NA] thread re recycling of NA logical else i[is.na(i)] = FALSE # avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB] } But at the end `irows <- which(i)` is being done: if (is.logical(i)) { if (length(i)==nrow(x)) irows=which(i) # e.g. DT[colA>3,which=TRUE] And this "irows" is what's used to index the corresponding rows. So, is the replacement of `NA` to FALSE really necessary? I may very well have overlooked the purpose of the NA replacement to FALSE for other scenarios, but just by looking at this case, it doesn't seem like it's necessary as you fetch index/row numbers later. 3) And finally, more of a philosophical point. If we agree that subsetting can be done conveniently (using "which") and with no loss of speed (again using "which"), then are there other reasons to change the default behaviour of R's philosophy of handling NAs as unknowns/missing observations? I find I can relate more to the native concept of handling NAs. For example: x <- c(1,2,3,NA) x != 3 # TRUE TRUE FALSE NA makes more sense because `NA != 3` doesn't fall in either TRUE or FALSE, if NA is a missing observation/unknown data. The answer "unknown/missing" seems more appropriate, therefore. I'd be interested in hearing, in addition to Matthew's, other's thoughts and inputs as well. Best regards, Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sun Jun 9 23:47:44 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sun, 09 Jun 2013 22:47:44 +0100 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> Message-ID: <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> On 09.06.2013 22:08, Arunkumar Srinivasan wrote: > Matthew, > Regarding your recent answer here: http://stackoverflow.com/a/17008872/559784 I'd a few questions/thoughts and I thought it may be more appropriate to share here (even though I've already written 3 comments!). > 1) First, you write that, DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB,] > However, you can write this long expression as: DF[which(DF$ColA == DF$ColB), ] Good point. But DT[ColA == ColB] still seems simpler than DF[which(DF$ColA == DF$ColB), ] (in data.table DT[which(ColA == ColB)]). I worry about forgetting I need which() and then have bugs occur when NA occur in the data at some time in future that don't occur now or in test. > 2) Second, you mention that the motivation is not just convenience but speed. By checking: > > require(data.table) > set.seed(45) > df <- as.data.frame(matrix(sample(c(1,2,3,NA), 2e6, replace=TRUE), ncol=2)) > dt <- data.table(df) > system.time(dt[V1 == V2]) > # 0.077 seconds > system.time(df[!is.na(df$V1) & !is.na(df$V2) & df$V1 == df$V2, ]) > # 0.252 seconds > system.time(df[which(df$V1 == df$V2), ]) > # 0.038 seconds > We see that using `which` (in addition to removing NA) is also faster than `DT[V1 == V2]`. In fact, `DT[which(V1 == V2)]` is faster than `DT[V1 == V2]`. I suspect this is because of the snippet below in `[.data.table`: > > if (is.logical(i)) { > if (identical(i,NA)) i = NA_integer_ # see DT[NA] thread re recycling of NA logical > else i[is.na(i)] = FALSE # avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB] > } > But at the end `irows <- which(i)` is being done: > > if (is.logical(i)) { > if (length(i)==nrow(x)) irows=which(i) # e.g. DT[colA>3,which=TRUE] > And this "irows" is what's used to index the corresponding rows. So, is the replacement of `NA` to FALSE really necessary? I may very well have overlooked the purpose of the NA replacement to FALSE for other scenarios, but just by looking at this case, it doesn't seem like it's necessary as you fetch index/row numbers later. Interesting. Cool, so dt[V1 == V2] can and should be at least as fast as the which() way. Will file a FR to improve that speed! 3) And finally, more of a philosophica > n using "which"), > Not sure that is agreed yet, but happy to be persuaded. in-left:5px; width:100%"> then are there other reasons to change the default behaviour of R's philosophy of handling NAs as unknowns/missing observations? I find I can relate more to the native concept of handling NAs. For example: x <- c(1,2,3,NA) x != 3 # TRUE TRUE FALSE NA makes more sense because `NA != 3` doesn't fall in either TRUE or FALSE, if NA is a missing observation/unknown data. The answer "unknown/missing" seems more appropriate, therefore. True but the context of where that result is used is all important; i.e. > The data.table philosophy is that DT [ x==3 ] should exclude any rows in x that are NA, without needing to do anything special such as needing to know to call which() as well. That differs to data.frame, but is more consistent with SQL. In SQL "where x = 3" doesn't need anything else if x contains some NULL values. > I'd be interested in h dition to Matthew's, other's thoughts and inputs as well. Best regards, Arun > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Jun 10 00:43:25 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 10 Jun 2013 00:43:25 +0200 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> Message-ID: Matthew, I personally don't think using "which" takes the simplicity away from the syntax. However, since it's (now) clear (to me) that the philosophy of data.table relates more towards SQL, I don't see a reason for "which". Even in the context of `[` in data.table/data.frame, "missing/unknown" data could be related to R philosophy. dt <- data.table(x=c(1,3,4,NA), y=c(1:4)) dt[x <= 3] Here, one could argue that we don't know if the 4th row missing value is <= 3 or not. So, the problem comes to a point about what is the action to be taken. Do you give back the rows where no decision could be made or not? But as you rightly pointed out the idea behind data.table to be SQL-like, the current output stands very much. So retaining NA rows becomes invalid as well. Regarding FR4652, thanks for the speedy filing of this! I'm glad to have spotted it. Best regards, Arun. On Sunday, June 9, 2013 at 11:47 PM, Matthew Dowle wrote: > On 09.06.2013 22:08, Arunkumar Srinivasan wrote: > > Matthew, > > Regarding your recent answer here: http://stackoverflow.com/a/17008872/559784 I'd a few questions/thoughts and I thought it may be more appropriate to share here (even though I've already written 3 comments!). > > 1) First, you write that, DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB,] > > However, you can write this long expression as: DF[which(DF$ColA == DF$ColB), ] > > > > Good point. But DT[ColA == ColB] still seems simpler than DF[which(DF$ColA == DF$ColB), ] (in data.table DT[which(ColA == ColB)]). I worry about forgetting I need which() and then have bugs occur when NA occur in the data at some time in future that don't occur now or in test. > > 2) Second, you mention that the motivation is not just convenience but speed. By checking: > > require(data.table) > > set.seed(45) > > df <- as.data.frame(matrix(sample(c(1,2,3,NA), 2e6, replace=TRUE), ncol=2)) > > dt <- data.table(df) > > system.time(dt[V1 == V2]) > > # 0.077 seconds > > system.time(df[!is.na(df$V1) & !is.na(df$V2) & df$V1 == df$V2, ]) > > # 0.252 seconds > > system.time(df[which(df$V1 == df$V2), ]) > > > > # 0.038 seconds > > We see that using `which` (in addition to removing NA) is also faster than `DT[V1 == V2]`. In fact, `DT[which(V1 == V2)]` is faster than `DT[V1 == V2]`. I suspect this is because of the snippet below in `[.data.table`: > > if (is.logical(i)) { > > if (identical(i,NA)) i = NA_integer_ # see DT[NA] thread re recycling of NA logical > > else i[is.na(i)] = FALSE # avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB] > > } > > > > But at the end `irows <- which(i)` is being done: > > if (is.logical(i)) { > > if (length(i)==nrow(x)) irows=which(i) # e.g. DT[colA>3,which=TRUE] > > > > And this "irows" is what's used to index the corresponding rows. So, is the replacement of `NA` to FALSE really necessary? I may very well have overlooked the purpose of the NA replacement to FALSE for other scenarios, but just by looking at this case, it doesn't seem like it's necessary as you fetch index/row numbers later. > > > > Interesting. Cool, so dt[V1 == V2] can and should be at least as fast as the which() way. Will file a FR to improve that speed! > > 3) And finally, more of a philosophical point. If we agree that subsetting can be done conveniently (using "which") and with no loss of speed (again using "which"), > > > > Not sure that is agreed yet, but happy to be persuaded. > > then are there other reasons to change the default behaviour of R's philosophy of handling NAs as unknowns/missing observations? I find I can relate more to the native concept of handling NAs. For example: > > x <- c(1,2,3,NA) > > x != 3 > > # TRUE TRUE FALSE NA > > makes more sense because `NA != 3` doesn't fall in either TRUE or FALSE, if NA is a missing observation/unknown data. The answer "unknown/missing" seems more appropriate, therefore. > > > > True but the context of where that result is used is all important; i.e., in this case that's `i` of [.data.table or [.data.frame. It may be easier to consider == first. The data.table philosophy is that DT [ x==3 ] should exclude any rows in x that are NA, without needing to do anything special such as needing to know to call which() as well. That differs to data.frame, but is more consistent with SQL. In SQL "where x = 3" doesn't need anything else if x contains some NULL values. > > I'd be interested in hearing, in addition to Matthew's, other's thoughts and inputs as well. > > Best regards, > > Arun > > > > > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Jun 10 09:11:16 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 10 Jun 2013 09:11:16 +0200 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> Message-ID: Matthew, Regarding your suggestion of changes regarding Frank's post here: http://stackoverflow.com/a/17008872/559784 I find it a bit more confusing and frankly not like sql. You wrote: "If I haven't understood correctly feel free to correct, otherwise the change will get made eventually. It will need to be done in a way that considers compound expressions; e.g., DT[colA=="foo" & colB!="bar"] should exclude rows with NA in colA but include rows where colA is non-NA but colB is NA. Similarly, DT[colA!=colB] should include rows where either colA or colB is NA but not both. And perhaps DT[colA==colB] should include rows where bothcolA and colB are NA (which it doesn't currently, I believe)." Even though sql (ex: sqldf) has a different way of handling NAs when compared to data.frame, it doesn't seem to find NA == NA. That is, df <- data.frame(x = c(1:3,NA), y = c(NA,4:5,NA)) require(sqldf) sqldf("select * from df where x == y") # returns empty data.frame sqldf("select * from df where x != y") x y 1 2 4 2 3 5 That is, at least in sqldf package, NA is not == NA and NA is not != NA which is very much in coherence with R's default NA == NA and NA != NA (both giving NA). But I don't think they it's considered FALSE here. It just acts like the "subset" function where all entries that were evaluated to NAs are simply dropped. But with data.table philosophy NA != NA should be evaluated to TRUE, which I don't think (from what I meagrely understand from sql) is what sql does. Please correct me if I've got it wrong. I think it is clearer and simpler if "NAs are just dropped" after evaluating logical expressions. It would be also easy to document this and easier to grasp, imho. This would also explain Frank's post for NA rows being removed. And probably if there is more consensus an option for "na.rm = TRUE/FALSE" could be added? Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Mon Jun 10 10:05:15 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 10 Jun 2013 09:05:15 +0100 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> Message-ID: <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> Hi Arun, Hm, good point. Is data.table consistent with SQL already, for both == and !=, and so no change needed? And it was correct for Frank to be mistaken. Maybe just some more documentation and examples needed then. Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? : http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently "na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch before in the context of joins, not logical subsets. Thanks, Matthew On 10.06.2013 08:11, Arunkumar Srinivasan wrote: > Matthew, > Regarding your suggestion of changes regarding Frank's post here: http://stackoverflow.com/a/17008872/559784 [1] I find it a bit more confusing and frankly not like sql. > You wrote: "If I haven't understood correctly feel free to correct, otherwise the change will get made eventually. It will need to be done in a way that considers compound expressions; e.g., DT[colA=="foo" & colB!="bar"] should exclude rows with NA in colA but include rows where colA is non-NA but colB is NA. Similarly, DT[colA!=colB] should include rows where either colA or colB is NA but not both. And perhaps DT[colA==colB] should include rows where bothcolA and colB are NA (which it doesn't currently, I believe)." > > Even though sql (ex: sqldf) has a different way of handling NAs when compared to data.frame, it doesn't seem to find NA == NA. That is, > df <- data.frame(x = c(1:3,NA), y = c(NA,4:5,NA)) > require(sqldf) > sqldf("select * from df where x == y") > # returns empty data.frame > sqldf("select * from df where x != y") > > x y > 1 2 4 > 2 3 5 > > That is, at least in sqldf package, NA is not == NA and NA is not != NA which is very much in coherence with R's default NA == NA and NA != NA (both giving NA). But I don't think they it's considered FALSE here. It just acts like the "subset" function where all entries that were evaluated to NAs are simply dropped. But with data.table philosophy NA != NA should be evaluated to TRUE, which I don't think (from what I meagrely understand from sql) is what sql does. Please correct me if I've got it wrong. > I think it is clearer and simpler if "NAs are just dropped" after evaluating logical expressions. It would be also easy to document this and easier to grasp, imho. This would also explain Frank's post for NA rows being removed. > And probably if there is more consensus an option for "na.rm = TRUE/FALSE" could be added? > Arun Links: ------ [1] http://stackoverflow.com/a/17008872/559784 -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Jun 10 10:28:46 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 10 Jun 2013 10:28:46 +0200 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> Message-ID: <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> > Hm, good point. Is data.table consistent with SQL already, for both == and !=, and so no change needed? Yes, I believe it's already consistent with SQL. However, the current interpretation of NA (documentation) being treated as FALSE is not needed / untrue, imho (Please see below). > And it was correct for Frank to be mistaken. Yes, it seems like he was mistaken. > Maybe just some more documentation and examples needed then. It'd be much more appropriate if the documentation reflects the role of subsetting in data.table mimicking "subset" function (in order to be in line with SQL) by dropping NA evaluated logicals. From a couple of posts before, where I pasted the code where NAs are replaced to FALSE were not necessary as `irows <- which(i)` makes clear that `which` is being used to get indices and then subset, this fits perfectly well with the interpretation of NA in data.table. > Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? : > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently Ha, I like the idea behind the use of () in evaluating expressions. It's another nice layer towards simplicity in data.table. But I still think there should not be an inconsistency in equivalent logical operations to provide different results. If !(x== .) and x != . are indeed different, then I'd suppose replacing `!` with a more appropriate name as it's much easier to get confused otherwise. In essence, either !(x == .) must evaluate to (x != .) if the underlying meaning of these are the same, or the `!` in `!(x==.)` must be replaced to something that's more appropriate for what it's supposed to be. Personally, I prefer the former. It would greatly tighten the structure and consistency. > "na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch before in the context of joins, not logical subsets. Yes, I find this option would give more control in evaluating expressions with ease in `i`, by providing both "subset" (default) and the typical data.frame subsetting (na.rm = FALSE). Best regards, Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Jun 10 10:35:59 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 10 Jun 2013 10:35:59 +0200 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> Message-ID: <265F40BB318541E99B6FF75F0C3115AD@gmail.com> Hi Matthew, My view (from the last reply) more or less reflects mnel's comments here: http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 Pasted here for convenience: data.table is mimicing subset in its handling of NA values in logical i arguments. -- the only issue is the ! prefix signifying a not-join, not the way one might expect. Perhaps the not join prefix could have been NJ not ! to avoid this confusion -- this might be another discussion to have on the mailing list -- (I think it is a discussion worth having) Arun On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote: > > Hm, good point. Is data.table consistent with SQL already, for both == and !=, and so no change needed? > > > > Yes, I believe it's already consistent with SQL. However, the current interpretation of NA (documentation) being treated as FALSE is not needed / untrue, imho (Please see below). > > > And it was correct for Frank to be mistaken. > > > > Yes, it seems like he was mistaken. > > Maybe just some more documentation and examples needed then. > > > > It'd be much more appropriate if the documentation reflects the role of subsetting in data.table mimicking "subset" function (in order to be in line with SQL) by dropping NA evaluated logicals. From a couple of posts before, where I pasted the code where NAs are replaced to FALSE were not necessary as `irows <- which(i)` makes clear that `which` is being used to get indices and then subset, this fits perfectly well with the interpretation of NA in data.table. > > Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? : > > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently > > > > Ha, I like the idea behind the use of () in evaluating expressions. It's another nice layer towards simplicity in data.table. But I still think there should not be an inconsistency in equivalent logical operations to provide different results. If !(x== .) and x != . are indeed different, then I'd suppose replacing `!` with a more appropriate name as it's much easier to get confused otherwise. > > In essence, either !(x == .) must evaluate to (x != .) if the underlying meaning of these are the same, or the `!` in `!(x==.)` must be replaced to something that's more appropriate for what it's supposed to be. Personally, I prefer the former. It would greatly tighten the structure and consistency. > > "na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch before in the context of joins, not logical subsets. > > > > Yes, I find this option would give more control in evaluating expressions with ease in `i`, by providing both "subset" (default) and the typical data.frame subsetting (na.rm = FALSE). > > Best regards, > > Arun > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Mon Jun 10 10:52:41 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 10 Jun 2013 09:52:41 +0100 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: <265F40BB318541E99B6FF75F0C3115AD@gmail.com> References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> <265F40BB318541E99B6FF75F0C3115AD@gmail.com> Message-ID: <8540a4a91743d5afab595132dd41eb7b@imap.plus.net> Hi, How about ~ instead of ! ? I ruled out - previously to leave + and - available for future use. NJ() may be possible too. Matthew On 10.06.2013 09:35, Arunkumar Srinivasan wrote: > Hi Matthew, > My view (from the last reply) more or less reflects mnel's comments here: http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 > > Pasted here for convenience: > data.table is mimicing subset in its handling of NA values in logical i arguments. -- the only issue is the ! prefix signifying a not-join, not the way one might expect. Perhaps the not join prefix could have been NJ not ! to avoid this confusion -- this might be another discussion to have on the mailing list -- (I think it is a discussion worth having) > > Arun > > On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote: > >>> Hm, good point. Is data.table consistent with SQL already, for both == and !=, and so no change needed? >> >> Yes, I believe it's already consistent with SQL. However, the current interpretation of NA (documentation) being treated as FALSE is not needed / untrue, imho (Please see below). >> >>> And it was correct for Frank to be mistaken. >> >> Yes, it seems like he was mistaken. >> >> Maybe just some more doc >> >>> cumentation reflects the role of subsetting in data.table mimicking "subset" function >> be in line with SQL) by dropping NA evaluated logicals. From a couple of posts before, where I pasted the code where NAs are replaced to FALSE were not necessary as `irows <- which(i)` makes clear that `which` is being used to get indices and then subset, this fits perfectly well with the interpretation of NA in data.table. >> >> Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? : -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Jun 10 10:55:52 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 10 Jun 2013 10:55:52 +0200 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: <265F40BB318541E99B6FF75F0C3115AD@gmail.com> References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> <265F40BB318541E99B6FF75F0C3115AD@gmail.com> Message-ID: <04CBD144F5C04054AD048861AA496EC9@gmail.com> (Sorry @Matthew for the double email, I forgot to include the list once again). However, one inconsistency I find with the use of `!(x==.)` is this: dt1 <- data.table(x = 0:4, y=5:9) > dt1[!(x)] x y 1: 4 9 Not the correct result! If `!(x==.)` is equal to `x != .`, then the correct result should be the first row, isn't it? dt2 <- data.table(x = c(0,3,4,NA), y = c(NA,4,5,NA)) > dt2[!(x)] # ends up in an error Error in seq_len(nrow(x))[-irows] : only 0's may be mixed with negative subscripts It ends up in an error because `NA` is not removed/replaced. Running the same on data.frame gives the results it's supposed to. Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Mon Jun 10 11:04:12 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 10 Jun 2013 10:04:12 +0100 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: <6C61EBA445A74AB78664B2A3AA724B31@gmail.com> References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> <265F40BB318541E99B6FF75F0C3115AD@gmail.com> <6C61EBA445A74AB78664B2A3AA724B31@gmail.com> Message-ID: <5353dd33b7768147973cd8fd2a9ebd1a@imap.plus.net> On 10.06.2013 09:53, Arunkumar Srinivasan wrote: > However, one inconsistency I find with the use of `!(x==.)` is this: > dt1 <- data.table(x = 0:4, y=5:9) >> dt1[!(x)] > > x y > 1: 4 10 > Not the correct result! If `!(x==.)` is equal to `x != .`, then the correct result should be the first row, isn't it? That result makes perfect sense to me. I don't think of !(x==.) being the same as x!=. ! is simply a prefix. It's all the rows that aren't returned if the ! prefix wasn't there. > dt2 <- data.table(x = c(0,3,4,NA), y = c(NA,4,5,NA)) > >> dt2[!(x)] # ends up in an error > Error in seq_len(nrow(x))[-irows] : > only 0's may be mixed with negative subscripts That needs to be fixed. But we're getting quite theoretical here and far away from common use cases. Why would we ever have row numbers of the table, as a column of the table itself and want to select the rows by number not mentioned in that column? It ends up in an error because `NA` is not removed/replaced. Links: ------ [1] http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Jun 10 11:21:02 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 10 Jun 2013 11:21:02 +0200 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: <8540a4a91743d5afab595132dd41eb7b@imap.plus.net> References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> <265F40BB318541E99B6FF75F0C3115AD@gmail.com> <8540a4a91743d5afab595132dd41eb7b@imap.plus.net> Message-ID: Matthew, > How about ~ instead of ! ? I ruled out - previously to leave + and - available for future use. NJ() may be possible too. Both "NJ()" and "~" are okay for me. > That result makes perfect sense to me. I don't think of !(x==.) being the same as x!=. ! is simply a prefix. It's all the rows that aren't returned if the ! prefix wasn't there. > > > I understand that `DT[!(x)]` does what `data.table` is designed to do currently. What I failed to mention was that if one were to consider implementing `!(x==.)` as the same as `x != .` then this behaviour has to be changed. Let's forget this point for a moment. > That needs to be fixed. But we're getting quite theoretical here and far away from common use cases. Why would we ever have row numbers of the table, as a column of the table itself and want to select the rows by number not mentioned in that column? Probably I did not choose a good example. Suppose that I've a data.table and I want to get all rows where "x == 0". Let's say: set.seed(45) DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y = sample(15)) DF <- as.data.frame(DT) To get all rows where x == 0, it could be done with DT[x == 0]. But it makes sense, at least in the context of data.frames, to do equivalently, DF[!(DF$x), ] (or) DF[DF$x == 0, ] All I want to say is, I expect `DT[!(x)]` should give the same result as `DT[x == 0]` (even though I fully understand it's not the intended behaviour of data.table), as it's more intuitive and less confusing. So, changing `!` to `~` or `NJ` is one half of the issue for me. The other is to replace the actual function of `!` in all contexts. I hope I came across with what I wanted to say, better this time. Best, Arun On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote: > > Hi, > How about ~ instead of ! ? I ruled out - previously to leave + and - available for future use. NJ() may be possible too. > Matthew > > On 10.06.2013 09:35, Arunkumar Srinivasan wrote: > > Hi Matthew, > > My view (from the last reply) more or less reflects mnel's comments here: http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 > > Pasted here for convenience: > > data.table is mimicing subset in its handling of NA values in logical i arguments. -- the only issue is the ! prefix signifying a not-join, not the way one might expect. Perhaps the not join prefix could have been NJ not ! to avoid this confusion -- this might be another discussion to have on the mailing list -- (I think it is a discussion worth having) > > > > Arun > > > > On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote: > > > > > > Hm, good point. Is data.table consistent with SQL already, for both == and !=, and so no change needed? > > > > > > > > > > > > > > Yes, I believe it's already consistent with SQL. However, the current interpretation of NA (documentation) being treated as FALSE is not needed / untrue, imho (Please see below). > > > > > > > And it was correct for Frank to be mistaken. > > > > > > > > > > > > > > Yes, it seems like he was mistaken. > > > > Maybe just some more documentation and examples needed then. > > > > > > > > > > > > > > It'd be much more appropriate if the documentation reflects the role of subsetting in data.table mimicking "subset" function (in order to be in line with SQL) by dropping NA evaluated logicals. From a couple of posts before, where I pasted the code where NAs are replaced to FALSE were not necessary as `irows <- which(i)` makes clear that `which` is being used to get indices and then subset, this fits perfectly well with the interpretation of NA in data.table. > > > > Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? : > > > > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently > > > > > > > > > > > > > > Ha, I like the idea behind the use of () in evaluating expressions. It's another nice layer towards simplicity in data.table. But I still think there should not be an inconsistency in equivalent logical operations to provide different results. If !(x== .) and x != . are indeed different, then I'd suppose replacing `!` with a more appropriate name as it's much easier to get confused otherwise. > > > In essence, either !(x == .) must evaluate to (x != .) if the underlying meaning of these are the same, or the `!` in `!(x==.)` must be replaced to something that's more appropriate for what it's supposed to be. Personally, I prefer the former. It would greatly tighten the structure and consistency. > > > > "na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch before in the context of joins, not logical subsets. > > > > > > > > > > > > > > Yes, I find this option would give more control in evaluating expressions with ease in `i`, by providing both "subset" (default) and the typical data.frame subsetting (na.rm = FALSE). > > > Best regards, > > > > > > Arun > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Mon Jun 10 15:20:23 2013 From: FErickson at psu.edu (Frank Erickson) Date: Mon, 10 Jun 2013 08:20:23 -0500 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> <265F40BB318541E99B6FF75F0C3115AD@gmail.com> <8540a4a91743d5afab595132dd41eb7b@imap.plus.net> Message-ID: +1 to using ~ for the not-join/join on complement/complement then join. Having some logical-looking i's lead to subsetting and others to not-joins can (for me) lead to mistakes that I'm not likely to catch until much later, if at all. I'm not sure I follow Arun's second example. If the syntax is changed so that ~ works as ! does now, then presumably !x will be reverted to having only a logical interpretation -- coercing x to logical and taking the subset where x == 0 -- which is the behavior you want. So why is it a separate issue? The remaining difference from data.frames would be that DF[!x] would show NA rows, if any, while DT[!x] would not. --Frank On Mon, Jun 10, 2013 at 4:21 AM, Arunkumar Srinivasan wrote: > Matthew, > > How about ~ instead of ! ? I ruled out - previously to leave + and - > available for future use. NJ() may be possible too. > > Both "NJ()" and "~" are okay for me. > > That result makes perfect sense to me. I don't think of !(x==.) being > the same as x!=. ! is simply a prefix. It's all the rows that aren't > returned if the ! prefix wasn't there. > > I understand that `DT[!(x)]` does what `data.table` is designed to do > currently. What I failed to mention was that if one were to consider > implementing `!(x==.)` as the same as `x != .` then this behaviour has to > be changed. Let's forget this point for a moment. > > That needs to be fixed. But we're getting quite theoretical here and far > away from common use cases. Why would we ever have row numbers of the > table, as a column of the table itself and want to select the rows by > number not mentioned in that column? > > Probably I did not choose a good example. Suppose that I've a data.table > and I want to get all rows where "x == 0". Let's say: > > set.seed(45) > DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y = > sample(15)) > > DF <- as.data.frame(DT) > > To get all rows where x == 0, it could be done with DT[x == 0]. But it > makes sense, at least in the context of data.frames, to do equivalently, > > DF[!(DF$x), ] (or) DF[DF$x == 0, ] > > All I want to say is, I expect `DT[!(x)]` should give the same result as > `DT[x == 0]` (even though I fully understand it's not the intended > behaviour of data.table), as it's more intuitive and less confusing. > > So, changing `!` to `~` or `NJ` is one half of the issue for me. The other > is to replace the actual function of `!` in all contexts. I hope I came > across with what I wanted to say, better this time. > > Best, > > Arun > > > On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote: > > > > Hi, > > How about ~ instead of ! ? I ruled out - previously to leave + and - > available for future use. NJ() may be possible too. > > Matthew > > > > On 10.06.2013 09:35, Arunkumar Srinivasan wrote: > > Hi Matthew, > My view (from the last reply) more or less reflects mnel's comments here: > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 > > Pasted here for convenience: > data.table is mimicing subset in its handling of NA values in logical i arguments. > -- the only issue is the ! prefix signifying a not-join, not the way one > might expect. Perhaps the not join prefix could have been NJ not ! to > avoid this confusion -- this might be another discussion to have on the > mailing list -- (I think it is a discussion worth having) > > Arun > > On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote: > > Hm, good point. Is data.table consistent with SQL already, for both == > and !=, and so no change needed? > > Yes, I believe it's already consistent with SQL. However, the current > interpretation of NA (documentation) being treated as FALSE is not needed / > untrue, imho (Please see below). > > > And it was correct for Frank to be mistaken. > > Yes, it seems like he was mistaken. > > Maybe just some more documentation and examples needed then. > > It'd be much more appropriate if the documentation reflects the role of > subsetting in data.table mimicking "subset" function (in order to be in > line with SQL) by dropping NA evaluated logicals. From a couple of posts > before, where I pasted the code where NAs are replaced to FALSE were not > necessary as `irows <- which(i)` makes clear that `which` is being used to > get indices and then subset, this fits perfectly well with the > interpretation of NA in data.table. > > Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? : > > > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently > > Ha, I like the idea behind the use of () in evaluating expressions. It's > another nice layer towards simplicity in data.table. But I still think > there should not be an inconsistency in equivalent logical operations to > provide different results. If !(x== .) and x != . are indeed different, > then I'd suppose replacing `!` with a more appropriate name as it's much > easier to get confused otherwise. > In essence, either !(x == .) must evaluate to (x != .) if the underlying > meaning of these are the same, or the `!` in `!(x==.)` must be replaced to > something that's more appropriate for what it's supposed to be. Personally, > I prefer the former. It would greatly tighten the structure and consistency. > > "na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch > before in the context of joins, not logical subsets. > > Yes, I find this option would give more control in evaluating expressions > with ease in `i`, by providing both "subset" (default) and the typical > data.frame subsetting (na.rm = FALSE). > Best regards, > > Arun > > > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Jun 10 15:38:27 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 10 Jun 2013 15:38:27 +0200 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> <265F40BB318541E99B6FF75F0C3115AD@gmail.com> <8540a4a91743d5afab595132dd41eb7b@imap.plus.net> Message-ID: <6D506FB418524C52856231CD897FC950@gmail.com> Frank, You're right about my final point. I can't recollect why I wrote that now. I guess the `!` function will be restored automatically. With my second example, all I wanted to establish was that there was another reason to change `!` from performing the action of a "Not Join" because `DT[!x]` is a perfectly valid syntax (for those who have worked with data.frames and have shifted to data.table) which will not perform the intended action as it'll be a Not Join. In addition, `DT[!x]` gives an error when "x" column has NA. This was meant to be an additional argument for not having `!` for Not Join. But this has caused more confusion. Let's forget about my examples :). To conclude, "~" or "NJ" makes sense than `!` for "Not join" and of course the function of `!` will be automatically restored to "not" (also preferably with a na.rm = TRUE/FALSE. This is what I intended to say from the original discussion. Sorry for any confusion. Arun On Monday, June 10, 2013 at 3:20 PM, Frank Erickson wrote: > +1 to using ~ for the not-join/join on complement/complement then join. Having some logical-looking i's lead to subsetting and others to not-joins can (for me) lead to mistakes that I'm not likely to catch until much later, if at all. > > I'm not sure I follow Arun's second example. If the syntax is changed so that ~ works as ! does now, then presumably !x will be reverted to having only a logical interpretation -- coercing x to logical and taking the subset where x == 0 -- which is the behavior you want. So why is it a separate issue? The remaining difference from data.frames would be that DF[!x] would show NA rows, if any, while DT[!x] would not. > > --Frank > > > On Mon, Jun 10, 2013 at 4:21 AM, Arunkumar Srinivasan wrote: > > Matthew, > > > > > How about ~ instead of ! ? I ruled out - previously to leave + and - available for future use. NJ() may be possible too. > > Both "NJ()" and "~" are okay for me. > > > > > That result makes perfect sense to me. I don't think of !(x==.) being the same as x!=. ! is simply a prefix. It's all the rows that aren't returned if the ! prefix wasn't there. > > > > > > > > > > > > > > > > > > I understand that `DT[!(x)]` does what `data.table` is designed to do currently. What I failed to mention was that if one were to consider implementing `!(x==.)` as the same as `x != .` then this behaviour has to be changed. Let's forget this point for a moment. > > > > > That needs to be fixed. But we're getting quite theoretical here and far away from common use cases. Why would we ever have row numbers of the table, as a column of the table itself and want to select the rows by number not mentioned in that column? > > > > > > > > > Probably I did not choose a good example. Suppose that I've a data.table and I want to get all rows where "x == 0". Let's say: > > > > set.seed(45) > > DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y = sample(15)) > > > > DF <- as.data.frame(DT) > > > > > > > > To get all rows where x == 0, it could be done with DT[x == 0]. But it makes sense, at least in the context of data.frames, to do equivalently, > > > > DF[!(DF$x), ] (or) DF[DF$x == 0, ] > > > > All I want to say is, I expect `DT[!(x)]` should give the same result as `DT[x == 0]` (even though I fully understand it's not the intended behaviour of data.table), as it's more intuitive and less confusing. > > > > So, changing `!` to `~` or `NJ` is one half of the issue for me. The other is to replace the actual function of `!` in all contexts. I hope I came across with what I wanted to say, better this time. > > > > Best, > > > > Arun > > > > > > > > > > On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote: > > > > > > > > Hi, > > > How about ~ instead of ! ? I ruled out - previously to leave + and - available for future use. NJ() may be possible too. > > > Matthew > > > > > > On 10.06.2013 09:35, Arunkumar Srinivasan wrote: > > > > Hi Matthew, > > > > My view (from the last reply) more or less reflects mnel's comments here: http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 > > > > Pasted here for convenience: > > > > data.table is mimicing subset in its handling of NA values in logical i arguments. -- the only issue is the ! prefix signifying a not-join, not the way one might expect. Perhaps the not join prefix could have been NJ not ! to avoid this confusion -- this might be another discussion to have on the mailing list -- (I think it is a discussion worth having) > > > > > > > > Arun > > > > > > > > On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote: > > > > > > > > > > Hm, good point. Is data.table consistent with SQL already, for both == and !=, and so no change needed? > > > > > > > > > > > > > > > > > > > > > > Yes, I believe it's already consistent with SQL. However, the current interpretation of NA (documentation) being treated as FALSE is not needed / untrue, imho (Please see below). > > > > > > > > > > > And it was correct for Frank to be mistaken. > > > > > > > > > > > > > > > > > > > > > > Yes, it seems like he was mistaken. > > > > > > Maybe just some more documentation and examples needed then. > > > > > > > > > > > > > > > > > > > > > > It'd be much more appropriate if the documentation reflects the role of subsetting in data.table mimicking "subset" function (in order to be in line with SQL) by dropping NA evaluated logicals. From a couple of posts before, where I pasted the code where NAs are replaced to FALSE were not necessary as `irows <- which(i)` makes clear that `which` is being used to get indices and then subset, this fits perfectly well with the interpretation of NA in data.table. > > > > > > Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? : > > > > > > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently > > > > > > > > > > > > > > > > > > > > > > Ha, I like the idea behind the use of () in evaluating expressions. It's another nice layer towards simplicity in data.table. But I still think there should not be an inconsistency in equivalent logical operations to provide different results. If !(x== .) and x != . are indeed different, then I'd suppose replacing `!` with a more appropriate name as it's much easier to get confused otherwise. > > > > > In essence, either !(x == .) must evaluate to (x != .) if the underlying meaning of these are the same, or the `!` in `!(x==.)` must be replaced to something that's more appropriate for what it's supposed to be. Personally, I prefer the former. It would greatly tighten the structure and consistency. > > > > > > "na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch before in the context of joins, not logical subsets. > > > > > > > > > > > > > > > > > > > > > > Yes, I find this option would give more control in evaluating expressions with ease in `i`, by providing both "subset" (default) and the typical data.frame subsetting (na.rm = FALSE). > > > > > Best regards, > > > > > > > > > > Arun > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Mon Jun 10 16:02:38 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Mon, 10 Jun 2013 10:02:38 -0400 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> <265F40BB318541E99B6FF75F0C3115AD@gmail.com> <8540a4a91743d5afab595132dd41eb7b@imap.plus.net> Message-ID: The problem with ~ is that it is using up a special character (of which there are only a few) for a case that does not occur much. I can think of other things that ~ might be better used for. For example, perhaps ~ x could mean get(x). One aspect of data.table that tends to be difficult is when you don't know the variable name ahead of time and this woiuld give a way to specify it concisely. On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan wrote: > Matthew, > > How about ~ instead of ! ? I ruled out - previously to leave + and - > available for future use. NJ() may be possible too. > > Both "NJ()" and "~" are okay for me. > > That result makes perfect sense to me. I don't think of !(x==.) being the > same as x!=. ! is simply a prefix. It's all the rows that aren't > returned if the ! prefix wasn't there. > > I understand that `DT[!(x)]` does what `data.table` is designed to do > currently. What I failed to mention was that if one were to consider > implementing `!(x==.)` as the same as `x != .` then this behaviour has to be > changed. Let's forget this point for a moment. > > That needs to be fixed. But we're getting quite theoretical here and far > away from common use cases. Why would we ever have row numbers of the > table, as a column of the table itself and want to select the rows by number > not mentioned in that column? > > Probably I did not choose a good example. Suppose that I've a data.table and > I want to get all rows where "x == 0". Let's say: > > set.seed(45) > DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y = > sample(15)) > > DF <- as.data.frame(DT) > > To get all rows where x == 0, it could be done with DT[x == 0]. But it makes > sense, at least in the context of data.frames, to do equivalently, > > DF[!(DF$x), ] (or) DF[DF$x == 0, ] > > All I want to say is, I expect `DT[!(x)]` should give the same result as > `DT[x == 0]` (even though I fully understand it's not the intended behaviour > of data.table), as it's more intuitive and less confusing. > > So, changing `!` to `~` or `NJ` is one half of the issue for me. The other > is to replace the actual function of `!` in all contexts. I hope I came > across with what I wanted to say, better this time. > > Best, > > Arun > > > On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote: > > > > Hi, > > How about ~ instead of ! ? I ruled out - previously to leave + and - > available for future use. NJ() may be possible too. > > Matthew > > > > On 10.06.2013 09:35, Arunkumar Srinivasan wrote: > > Hi Matthew, > My view (from the last reply) more or less reflects mnel's comments here: > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 > Pasted here for convenience: > data.table is mimicing subset in its handling of NA values in logical i > arguments. -- the only issue is the ! prefix signifying a not-join, not the > way one might expect. Perhaps the not join prefix could have been NJ not ! > to avoid this confusion -- this might be another discussion to have on the > mailing list -- (I think it is a discussion worth having) > > Arun > > On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote: > > Hm, good point. Is data.table consistent with SQL already, for both == and > !=, and so no change needed? > > Yes, I believe it's already consistent with SQL. However, the current > interpretation of NA (documentation) being treated as FALSE is not needed / > untrue, imho (Please see below). > > > And it was correct for Frank to be mistaken. > > Yes, it seems like he was mistaken. > > Maybe just some more documentation and examples needed then. > > It'd be much more appropriate if the documentation reflects the role of > subsetting in data.table mimicking "subset" function (in order to be in line > with SQL) by dropping NA evaluated logicals. From a couple of posts before, > where I pasted the code where NAs are replaced to FALSE were not necessary > as `irows <- which(i)` makes clear that `which` is being used to get indices > and then subset, this fits perfectly well with the interpretation of NA in > data.table. > > Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? : > > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently > > Ha, I like the idea behind the use of () in evaluating expressions. It's > another nice layer towards simplicity in data.table. But I still think there > should not be an inconsistency in equivalent logical operations to provide > different results. If !(x== .) and x != . are indeed different, then I'd > suppose replacing `!` with a more appropriate name as it's much easier to > get confused otherwise. > In essence, either !(x == .) must evaluate to (x != .) if the underlying > meaning of these are the same, or the `!` in `!(x==.)` must be replaced to > something that's more appropriate for what it's supposed to be. Personally, > I prefer the former. It would greatly tighten the structure and consistency. > > "na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch before > in the context of joins, not logical subsets. > > Yes, I find this option would give more control in evaluating expressions > with ease in `i`, by providing both "subset" (default) and the typical > data.frame subsetting (na.rm = FALSE). > Best regards, > > Arun > > > > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From mdowle at mdowle.plus.com Mon Jun 10 16:35:57 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 10 Jun 2013 15:35:57 +0100 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> <265F40BB318541E99B6FF75F0C3115AD@gmail.com> <8540a4a91743d5afab595132dd41eb7b@imap.plus.net> Message-ID: <344d331d407432f6ae7b71cec416e065@imap.plus.net> Hm, another good point. We need ~ for formulae, although I can't imagine a formula in i (only in j). But in both i and j we might want to get(x). I thought about ^ i.e. X[^Y] in the spirit of regular expression syntax, but ^ doesn't parse with a RHS only. Needs to be parsable as a prefix. - maybe then? Consistent with - meaning in R. I don't think I actually had a specific use in mind for - and +, to reserve them for, but at the time it just seemed a shame to use up one of -/+ without defining the other. If - does a not join, then, might + be more like merge() (i.e. returning the union of the rows in x and i by join). I think I had something like that in mind, but hadn't thought it through. Some might say it should be a new argument e.g. notjoin=TRUE, but my thinking there is readability, since we often have many lines in i, j and by in that order, and if the "notjoin=TRUE" followed afterwards it would be far away from the i argument to which it applies. If we incorporate merge() into X[Y] using X[+Y] then it might avoid adding yet more parameters, too. On 10.06.2013 15:02, Gabor Grothendieck wrote: > The problem with ~ is that it is using up a special character (of > which there are only a few) for a case that does not occur much. > > I can think of other things that ~ might be better used for. For > example, perhaps ~ x could mean get(x). One aspect of data.table > that > tends to be difficult is when you don't know the variable name ahead > of time and this woiuld give a way to specify it concisely. > > On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan > wrote: >> Matthew, >> >> How about ~ instead of ! ? I ruled out - previously to leave + >> and - >> available for future use. NJ() may be possible too. >> >> Both "NJ()" and "~" are okay for me. >> >> That result makes perfect sense to me. I don't think of !(x==.) >> being the >> same as x!=. ! is simply a prefix. It's all the rows that >> aren't >> returned if the ! prefix wasn't there. >> >> I understand that `DT[!(x)]` does what `data.table` is designed to >> do >> currently. What I failed to mention was that if one were to consider >> implementing `!(x==.)` as the same as `x != .` then this behaviour >> has to be >> changed. Let's forget this point for a moment. >> >> That needs to be fixed. But we're getting quite theoretical here >> and far >> away from common use cases. Why would we ever have row numbers of >> the >> table, as a column of the table itself and want to select the rows >> by number >> not mentioned in that column? >> >> Probably I did not choose a good example. Suppose that I've a >> data.table and >> I want to get all rows where "x == 0". Let's say: >> >> set.seed(45) >> DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y = >> sample(15)) >> >> DF <- as.data.frame(DT) >> >> To get all rows where x == 0, it could be done with DT[x == 0]. But >> it makes >> sense, at least in the context of data.frames, to do equivalently, >> >> DF[!(DF$x), ] (or) DF[DF$x == 0, ] >> >> All I want to say is, I expect `DT[!(x)]` should give the same >> result as >> `DT[x == 0]` (even though I fully understand it's not the intended >> behaviour >> of data.table), as it's more intuitive and less confusing. >> >> So, changing `!` to `~` or `NJ` is one half of the issue for me. The >> other >> is to replace the actual function of `!` in all contexts. I hope I >> came >> across with what I wanted to say, better this time. >> >> Best, >> >> Arun >> >> >> On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote: >> >> >> >> Hi, >> >> How about ~ instead of ! ? I ruled out - previously to leave + >> and - >> available for future use. NJ() may be possible too. >> >> Matthew >> >> >> >> On 10.06.2013 09:35, Arunkumar Srinivasan wrote: >> >> Hi Matthew, >> My view (from the last reply) more or less reflects mnel's comments >> here: >> >> http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 >> Pasted here for convenience: >> data.table is mimicing subset in its handling of NA values in >> logical i >> arguments. -- the only issue is the ! prefix signifying a not-join, >> not the >> way one might expect. Perhaps the not join prefix could have been NJ >> not ! >> to avoid this confusion -- this might be another discussion to have >> on the >> mailing list -- (I think it is a discussion worth having) >> >> Arun >> >> On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote: >> >> Hm, good point. Is data.table consistent with SQL already, for both >> == and >> !=, and so no change needed? >> >> Yes, I believe it's already consistent with SQL. However, the >> current >> interpretation of NA (documentation) being treated as FALSE is not >> needed / >> untrue, imho (Please see below). >> >> >> And it was correct for Frank to be mistaken. >> >> Yes, it seems like he was mistaken. >> >> Maybe just some more documentation and examples needed then. >> >> It'd be much more appropriate if the documentation reflects the role >> of >> subsetting in data.table mimicking "subset" function (in order to be >> in line >> with SQL) by dropping NA evaluated logicals. From a couple of posts >> before, >> where I pasted the code where NAs are replaced to FALSE were not >> necessary >> as `irows <- which(i)` makes clear that `which` is being used to get >> indices >> and then subset, this fits perfectly well with the interpretation of >> NA in >> data.table. >> >> Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA >> inconsistently? : >> >> >> http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently >> >> Ha, I like the idea behind the use of () in evaluating expressions. >> It's >> another nice layer towards simplicity in data.table. But I still >> think there >> should not be an inconsistency in equivalent logical operations to >> provide >> different results. If !(x== .) and x != . are indeed different, then >> I'd >> suppose replacing `!` with a more appropriate name as it's much >> easier to >> get confused otherwise. >> In essence, either !(x == .) must evaluate to (x != .) if the >> underlying >> meaning of these are the same, or the `!` in `!(x==.)` must be >> replaced to >> something that's more appropriate for what it's supposed to be. >> Personally, >> I prefer the former. It would greatly tighten the structure and >> consistency. >> >> "na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch >> before >> in the context of joins, not logical subsets. >> >> Yes, I find this option would give more control in evaluating >> expressions >> with ease in `i`, by providing both "subset" (default) and the >> typical >> data.frame subsetting (na.rm = FALSE). >> Best regards, >> >> Arun >> >> >> >> >> >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From aragorn168b at gmail.com Mon Jun 10 16:52:56 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 10 Jun 2013 16:52:56 +0200 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: <344d331d407432f6ae7b71cec416e065@imap.plus.net> References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> <265F40BB318541E99B6FF75F0C3115AD@gmail.com> <8540a4a91743d5afab595132dd41eb7b@imap.plus.net> <344d331d407432f6ae7b71cec416e065@imap.plus.net> Message-ID: Matthew, It just occurred to me. I'd be glad if you can clarify this. The operation is supposed to be "Not Join". Which means, I'd expect the "!" to be used with "J" as in: dt <- data.table(x=c(0,0,1,1,3), y=1:5) setkey(dt, "x") dt[J(c(1,3))] # join x y 1: 1 3 2: 1 4 3: 3 5 dt[!J(c(1,3))] x y 1: 0 1 2: 0 2 Here the concept of "Not Join" with the use of "!J(.)" makes total sense. However, extending it to not-join for logical vectors is what seems to be an issue. It's more of a logical indexing than a join (at least in my mind). So, if it is possible to distinguish between "!" and "!J" (by checking if `i` is a data.table or not) to tell if it's a subsetting by logical vector or subsetting by "data.table" and then deciding what to do, would that resolve this issue? If not, what's the reason behind using "!" as a not-join during logical indexing? Is it still considered as a not-join?? Just a thought. I hope it makes at least a little sense. Best, Arun On Monday, June 10, 2013 at 4:35 PM, Matthew Dowle wrote: > > Hm, another good point. We need ~ for formulae, although I can't > imagine a formula in i (only in j). But in both i and j we might want > to get(x). > > I thought about ^ i.e. X[^Y] in the spirit of regular expression > syntax, but ^ doesn't parse with a RHS only. Needs to be parsable as a > prefix. > > - maybe then? Consistent with - meaning in R. I don't think I > actually had a specific use in mind for - and +, to reserve them for, > but at the time it just seemed a shame to use up one of -/+ without > defining the other. If - does a not join, then, might + be more like > merge() (i.e. returning the union of the rows in x and i by join). I > think I had something like that in mind, but hadn't thought it through. > > Some might say it should be a new argument e.g. notjoin=TRUE, but my > thinking there is readability, since we often have many lines in i, j > and by in that order, and if the "notjoin=TRUE" followed afterwards it > would be far away from the i argument to which it applies. If we > incorporate merge() into X[Y] using X[+Y] then it might avoid adding yet > more parameters, too. > > > On 10.06.2013 15:02, Gabor Grothendieck wrote: > > The problem with ~ is that it is using up a special character (of > > which there are only a few) for a case that does not occur much. > > > > I can think of other things that ~ might be better used for. For > > example, perhaps ~ x could mean get(x). One aspect of data.table > > that > > tends to be difficult is when you don't know the variable name ahead > > of time and this woiuld give a way to specify it concisely. > > > > On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan > > wrote: > > > Matthew, > > > > > > How about ~ instead of ! ? I ruled out - previously to leave + > > > and - > > > available for future use. NJ() may be possible too. > > > > > > Both "NJ()" and "~" are okay for me. > > > > > > That result makes perfect sense to me. I don't think of !(x==.) > > > being the > > > same as x!=. ! is simply a prefix. It's all the rows that > > > aren't > > > returned if the ! prefix wasn't there. > > > > > > I understand that `DT[!(x)]` does what `data.table` is designed to > > > do > > > currently. What I failed to mention was that if one were to consider > > > implementing `!(x==.)` as the same as `x != .` then this behaviour > > > has to be > > > changed. Let's forget this point for a moment. > > > > > > That needs to be fixed. But we're getting quite theoretical here > > > and far > > > away from common use cases. Why would we ever have row numbers of > > > the > > > table, as a column of the table itself and want to select the rows > > > by number > > > not mentioned in that column? > > > > > > Probably I did not choose a good example. Suppose that I've a > > > data.table and > > > I want to get all rows where "x == 0". Let's say: > > > > > > set.seed(45) > > > DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y = > > > sample(15)) > > > > > > DF <- as.data.frame(DT) > > > > > > To get all rows where x == 0, it could be done with DT[x == 0]. But > > > it makes > > > sense, at least in the context of data.frames, to do equivalently, > > > > > > DF[!(DF$x), ] (or) DF[DF$x == 0, ] > > > > > > All I want to say is, I expect `DT[!(x)]` should give the same > > > result as > > > `DT[x == 0]` (even though I fully understand it's not the intended > > > behaviour > > > of data.table), as it's more intuitive and less confusing. > > > > > > So, changing `!` to `~` or `NJ` is one half of the issue for me. The > > > other > > > is to replace the actual function of `!` in all contexts. I hope I > > > came > > > across with what I wanted to say, better this time. > > > > > > Best, > > > > > > Arun > > > > > > > > > On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote: > > > > > > > > > > > > Hi, > > > > > > How about ~ instead of ! ? I ruled out - previously to leave + > > > and - > > > available for future use. NJ() may be possible too. > > > > > > Matthew > > > > > > > > > > > > On 10.06.2013 09:35, Arunkumar Srinivasan wrote: > > > > > > Hi Matthew, > > > My view (from the last reply) more or less reflects mnel's comments > > > here: > > > > > > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 > > > Pasted here for convenience: > > > data.table is mimicing subset in its handling of NA values in > > > logical i > > > arguments. -- the only issue is the ! prefix signifying a not-join, > > > not the > > > way one might expect. Perhaps the not join prefix could have been NJ > > > not ! > > > to avoid this confusion -- this might be another discussion to have > > > on the > > > mailing list -- (I think it is a discussion worth having) > > > > > > Arun > > > > > > On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote: > > > > > > Hm, good point. Is data.table consistent with SQL already, for both > > > == and > > > !=, and so no change needed? > > > > > > Yes, I believe it's already consistent with SQL. However, the > > > current > > > interpretation of NA (documentation) being treated as FALSE is not > > > needed / > > > untrue, imho (Please see below). > > > > > > > > > And it was correct for Frank to be mistaken. > > > > > > Yes, it seems like he was mistaken. > > > > > > Maybe just some more documentation and examples needed then. > > > > > > It'd be much more appropriate if the documentation reflects the role > > > of > > > subsetting in data.table mimicking "subset" function (in order to be > > > in line > > > with SQL) by dropping NA evaluated logicals. From a couple of posts > > > before, > > > where I pasted the code where NAs are replaced to FALSE were not > > > necessary > > > as `irows <- which(i)` makes clear that `which` is being used to get > > > indices > > > and then subset, this fits perfectly well with the interpretation of > > > NA in > > > data.table. > > > > > > Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA > > > inconsistently? : > > > > > > > > > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently > > > > > > Ha, I like the idea behind the use of () in evaluating expressions. > > > It's > > > another nice layer towards simplicity in data.table. But I still > > > think there > > > should not be an inconsistency in equivalent logical operations to > > > provide > > > different results. If !(x== .) and x != . are indeed different, then > > > I'd > > > suppose replacing `!` with a more appropriate name as it's much > > > easier to > > > get confused otherwise. > > > In essence, either !(x == .) must evaluate to (x != .) if the > > > underlying > > > meaning of these are the same, or the `!` in `!(x==.)` must be > > > replaced to > > > something that's more appropriate for what it's supposed to be. > > > Personally, > > > I prefer the former. It would greatly tighten the structure and > > > consistency. > > > > > > "na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch > > > before > > > in the context of joins, not logical subsets. > > > > > > Yes, I find this option would give more control in evaluating > > > expressions > > > with ease in `i`, by providing both "subset" (default) and the > > > typical > > > data.frame subsetting (na.rm = FALSE). > > > Best regards, > > > > > > Arun > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > datatable-help mailing list > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Mon Jun 10 16:55:28 2013 From: FErickson at psu.edu (Frank Erickson) Date: Mon, 10 Jun 2013 09:55:28 -0500 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: <344d331d407432f6ae7b71cec416e065@imap.plus.net> References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> <265F40BB318541E99B6FF75F0C3115AD@gmail.com> <8540a4a91743d5afab595132dd41eb7b@imap.plus.net> <344d331d407432f6ae7b71cec416e065@imap.plus.net> Message-ID: I prefer ~ and/or NJ() over -. The not-join operation is different from the subsetting operation usually associated with -. I don't know what characters are available for this sort of thing, but @x, @(x,y) seems natural enough as syntax for a getter. On Mon, Jun 10, 2013 at 9:35 AM, Matthew Dowle wrote: > > Hm, another good point. We need ~ for formulae, although I can't > imagine a formula in i (only in j). But in both i and j we might want to > get(x). > > I thought about ^ i.e. X[^Y] in the spirit of regular expression syntax, > but ^ doesn't parse with a RHS only. Needs to be parsable as a prefix. > > - maybe then? Consistent with - meaning in R. I don't think I actually > had a specific use in mind for - and +, to reserve them for, but at the > time it just seemed a shame to use up one of -/+ without defining the > other. If - does a not join, then, might + be more like merge() (i.e. > returning the union of the rows in x and i by join). I think I had > something like that in mind, but hadn't thought it through. > > Some might say it should be a new argument e.g. notjoin=TRUE, but my > thinking there is readability, since we often have many lines in i, j and > by in that order, and if the "notjoin=TRUE" followed afterwards it would be > far away from the i argument to which it applies. If we incorporate > merge() into X[Y] using X[+Y] then it might avoid adding yet more > parameters, too. > > > > On 10.06.2013 15:02, Gabor Grothendieck wrote: > >> The problem with ~ is that it is using up a special character (of >> which there are only a few) for a case that does not occur much. >> >> I can think of other things that ~ might be better used for. For >> example, perhaps ~ x could mean get(x). One aspect of data.table that >> tends to be difficult is when you don't know the variable name ahead >> of time and this woiuld give a way to specify it concisely. >> >> On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan >> wrote: >> >>> Matthew, >>> >>> How about ~ instead of ! ? I ruled out - previously to leave + and - >>> available for future use. NJ() may be possible too. >>> >>> Both "NJ()" and "~" are okay for me. >>> >>> That result makes perfect sense to me. I don't think of !(x==.) being >>> the >>> same as x!=. ! is simply a prefix. It's all the rows that aren't >>> returned if the ! prefix wasn't there. >>> >>> I understand that `DT[!(x)]` does what `data.table` is designed to do >>> currently. What I failed to mention was that if one were to consider >>> implementing `!(x==.)` as the same as `x != .` then this behaviour has >>> to be >>> changed. Let's forget this point for a moment. >>> >>> That needs to be fixed. But we're getting quite theoretical here and far >>> away from common use cases. Why would we ever have row numbers of the >>> table, as a column of the table itself and want to select the rows by >>> number >>> not mentioned in that column? >>> >>> Probably I did not choose a good example. Suppose that I've a data.table >>> and >>> I want to get all rows where "x == 0". Let's say: >>> >>> set.seed(45) >>> DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y = >>> sample(15)) >>> >>> DF <- as.data.frame(DT) >>> >>> To get all rows where x == 0, it could be done with DT[x == 0]. But it >>> makes >>> sense, at least in the context of data.frames, to do equivalently, >>> >>> DF[!(DF$x), ] (or) DF[DF$x == 0, ] >>> >>> All I want to say is, I expect `DT[!(x)]` should give the same result as >>> `DT[x == 0]` (even though I fully understand it's not the intended >>> behaviour >>> of data.table), as it's more intuitive and less confusing. >>> >>> So, changing `!` to `~` or `NJ` is one half of the issue for me. The >>> other >>> is to replace the actual function of `!` in all contexts. I hope I came >>> across with what I wanted to say, better this time. >>> >>> Best, >>> >>> Arun >>> >>> >>> On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote: >>> >>> >>> >>> Hi, >>> >>> How about ~ instead of ! ? I ruled out - previously to leave + and - >>> available for future use. NJ() may be possible too. >>> >>> Matthew >>> >>> >>> >>> On 10.06.2013 09:35, Arunkumar Srinivasan wrote: >>> >>> Hi Matthew, >>> My view (from the last reply) more or less reflects mnel's comments here: >>> >>> http://stackoverflow.com/**questions/16239153/dtx-and-** >>> dtx-treat-na-in-x-**inconsistently#**comment23317096_16240143 >>> Pasted here for convenience: >>> data.table is mimicing subset in its handling of NA values in logical i >>> arguments. -- the only issue is the ! prefix signifying a not-join, not >>> the >>> way one might expect. Perhaps the not join prefix could have been NJ not >>> ! >>> to avoid this confusion -- this might be another discussion to have on >>> the >>> mailing list -- (I think it is a discussion worth having) >>> >>> Arun >>> >>> On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote: >>> >>> Hm, good point. Is data.table consistent with SQL already, for both == >>> and >>> !=, and so no change needed? >>> >>> Yes, I believe it's already consistent with SQL. However, the current >>> interpretation of NA (documentation) being treated as FALSE is not >>> needed / >>> untrue, imho (Please see below). >>> >>> >>> And it was correct for Frank to be mistaken. >>> >>> Yes, it seems like he was mistaken. >>> >>> Maybe just some more documentation and examples needed then. >>> >>> It'd be much more appropriate if the documentation reflects the role of >>> subsetting in data.table mimicking "subset" function (in order to be in >>> line >>> with SQL) by dropping NA evaluated logicals. From a couple of posts >>> before, >>> where I pasted the code where NAs are replaced to FALSE were not >>> necessary >>> as `irows <- which(i)` makes clear that `which` is being used to get >>> indices >>> and then subset, this fits perfectly well with the interpretation of NA >>> in >>> data.table. >>> >>> Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? : >>> >>> >>> http://stackoverflow.com/**questions/16239153/dtx-and-** >>> dtx-treat-na-in-x-**inconsistently >>> >>> Ha, I like the idea behind the use of () in evaluating expressions. It's >>> another nice layer towards simplicity in data.table. But I still think >>> there >>> should not be an inconsistency in equivalent logical operations to >>> provide >>> different results. If !(x== .) and x != . are indeed different, then I'd >>> suppose replacing `!` with a more appropriate name as it's much easier to >>> get confused otherwise. >>> In essence, either !(x == .) must evaluate to (x != .) if the underlying >>> meaning of these are the same, or the `!` in `!(x==.)` must be replaced >>> to >>> something that's more appropriate for what it's supposed to be. >>> Personally, >>> I prefer the former. It would greatly tighten the structure and >>> consistency. >>> >>> "na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch >>> before >>> in the context of joins, not logical subsets. >>> >>> Yes, I find this option would give more control in evaluating expressions >>> with ease in `i`, by providing both "subset" (default) and the typical >>> data.frame subsetting (na.rm = FALSE). >>> Best regards, >>> >>> Arun >>> >>> >>> >>> >>> >>> >>> >>> ______________________________**_________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.**r-project.org >>> >>> https://lists.r-forge.r-**project.org/cgi-bin/mailman/** >>> listinfo/datatable-help >>> >> > ______________________________**_________________ > datatable-help mailing list > datatable-help at lists.r-forge.**r-project.org > https://lists.r-forge.r-**project.org/cgi-bin/mailman/** > listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Mon Jun 10 16:52:31 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Mon, 10 Jun 2013 09:52:31 -0500 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: <344d331d407432f6ae7b71cec416e065@imap.plus.net> References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> <265F40BB318541E99B6FF75F0C3115AD@gmail.com> <8540a4a91743d5afab595132dd41eb7b@imap.plus.net> <344d331d407432f6ae7b71cec416e065@imap.plus.net> Message-ID: I don't have much to add, except to +1 the suggestion of restoring ! to mean a logical not instead of not-joining as !(x == 0) and x != 0 or (!(x == 0)) giving different results is just too hard to understand and requires some advanced understanding of what ! means, and how it's parsed internally. On Mon, Jun 10, 2013 at 9:35 AM, Matthew Dowle wrote: > > Hm, another good point. We need ~ for formulae, although I can't > imagine a formula in i (only in j). But in both i and j we might want to > get(x). > > I thought about ^ i.e. X[^Y] in the spirit of regular expression syntax, > but ^ doesn't parse with a RHS only. Needs to be parsable as a prefix. > > - maybe then? Consistent with - meaning in R. I don't think I actually > had a specific use in mind for - and +, to reserve them for, but at the > time it just seemed a shame to use up one of -/+ without defining the > other. If - does a not join, then, might + be more like merge() (i.e. > returning the union of the rows in x and i by join). I think I had > something like that in mind, but hadn't thought it through. > > Some might say it should be a new argument e.g. notjoin=TRUE, but my > thinking there is readability, since we often have many lines in i, j and > by in that order, and if the "notjoin=TRUE" followed afterwards it would be > far away from the i argument to which it applies. If we incorporate > merge() into X[Y] using X[+Y] then it might avoid adding yet more > parameters, too. > > > > On 10.06.2013 15:02, Gabor Grothendieck wrote: > >> The problem with ~ is that it is using up a special character (of >> which there are only a few) for a case that does not occur much. >> >> I can think of other things that ~ might be better used for. For >> example, perhaps ~ x could mean get(x). One aspect of data.table that >> tends to be difficult is when you don't know the variable name ahead >> of time and this woiuld give a way to specify it concisely. >> >> On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan >> wrote: >> >>> Matthew, >>> >>> How about ~ instead of ! ? I ruled out - previously to leave + and - >>> available for future use. NJ() may be possible too. >>> >>> Both "NJ()" and "~" are okay for me. >>> >>> That result makes perfect sense to me. I don't think of !(x==.) being >>> the >>> same as x!=. ! is simply a prefix. It's all the rows that aren't >>> returned if the ! prefix wasn't there. >>> >>> I understand that `DT[!(x)]` does what `data.table` is designed to do >>> currently. What I failed to mention was that if one were to consider >>> implementing `!(x==.)` as the same as `x != .` then this behaviour has >>> to be >>> changed. Let's forget this point for a moment. >>> >>> That needs to be fixed. But we're getting quite theoretical here and far >>> away from common use cases. Why would we ever have row numbers of the >>> table, as a column of the table itself and want to select the rows by >>> number >>> not mentioned in that column? >>> >>> Probably I did not choose a good example. Suppose that I've a data.table >>> and >>> I want to get all rows where "x == 0". Let's say: >>> >>> set.seed(45) >>> DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y = >>> sample(15)) >>> >>> DF <- as.data.frame(DT) >>> >>> To get all rows where x == 0, it could be done with DT[x == 0]. But it >>> makes >>> sense, at least in the context of data.frames, to do equivalently, >>> >>> DF[!(DF$x), ] (or) DF[DF$x == 0, ] >>> >>> All I want to say is, I expect `DT[!(x)]` should give the same result as >>> `DT[x == 0]` (even though I fully understand it's not the intended >>> behaviour >>> of data.table), as it's more intuitive and less confusing. >>> >>> So, changing `!` to `~` or `NJ` is one half of the issue for me. The >>> other >>> is to replace the actual function of `!` in all contexts. I hope I came >>> across with what I wanted to say, better this time. >>> >>> Best, >>> >>> Arun >>> >>> >>> On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote: >>> >>> >>> >>> Hi, >>> >>> How about ~ instead of ! ? I ruled out - previously to leave + and - >>> available for future use. NJ() may be possible too. >>> >>> Matthew >>> >>> >>> >>> On 10.06.2013 09:35, Arunkumar Srinivasan wrote: >>> >>> Hi Matthew, >>> My view (from the last reply) more or less reflects mnel's comments here: >>> >>> http://stackoverflow.com/**questions/16239153/dtx-and-** >>> dtx-treat-na-in-x-**inconsistently#**comment23317096_16240143 >>> Pasted here for convenience: >>> data.table is mimicing subset in its handling of NA values in logical i >>> arguments. -- the only issue is the ! prefix signifying a not-join, not >>> the >>> way one might expect. Perhaps the not join prefix could have been NJ not >>> ! >>> to avoid this confusion -- this might be another discussion to have on >>> the >>> mailing list -- (I think it is a discussion worth having) >>> >>> Arun >>> >>> On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote: >>> >>> Hm, good point. Is data.table consistent with SQL already, for both == >>> and >>> !=, and so no change needed? >>> >>> Yes, I believe it's already consistent with SQL. However, the current >>> interpretation of NA (documentation) being treated as FALSE is not >>> needed / >>> untrue, imho (Please see below). >>> >>> >>> And it was correct for Frank to be mistaken. >>> >>> Yes, it seems like he was mistaken. >>> >>> Maybe just some more documentation and examples needed then. >>> >>> It'd be much more appropriate if the documentation reflects the role of >>> subsetting in data.table mimicking "subset" function (in order to be in >>> line >>> with SQL) by dropping NA evaluated logicals. From a couple of posts >>> before, >>> where I pasted the code where NAs are replaced to FALSE were not >>> necessary >>> as `irows <- which(i)` makes clear that `which` is being used to get >>> indices >>> and then subset, this fits perfectly well with the interpretation of NA >>> in >>> data.table. >>> >>> Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? : >>> >>> >>> http://stackoverflow.com/**questions/16239153/dtx-and-** >>> dtx-treat-na-in-x-**inconsistently >>> >>> Ha, I like the idea behind the use of () in evaluating expressions. It's >>> another nice layer towards simplicity in data.table. But I still think >>> there >>> should not be an inconsistency in equivalent logical operations to >>> provide >>> different results. If !(x== .) and x != . are indeed different, then I'd >>> suppose replacing `!` with a more appropriate name as it's much easier to >>> get confused otherwise. >>> In essence, either !(x == .) must evaluate to (x != .) if the underlying >>> meaning of these are the same, or the `!` in `!(x==.)` must be replaced >>> to >>> something that's more appropriate for what it's supposed to be. >>> Personally, >>> I prefer the former. It would greatly tighten the structure and >>> consistency. >>> >>> "na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch >>> before >>> in the context of joins, not logical subsets. >>> >>> Yes, I find this option would give more control in evaluating expressions >>> with ease in `i`, by providing both "subset" (default) and the typical >>> data.frame subsetting (na.rm = FALSE). >>> Best regards, >>> >>> Arun >>> >>> >>> >>> >>> >>> >>> >>> ______________________________**_________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.**r-project.org >>> >>> https://lists.r-forge.r-**project.org/cgi-bin/mailman/** >>> listinfo/datatable-help >>> >> > ______________________________**_________________ > datatable-help mailing list > datatable-help at lists.r-forge.**r-project.org > https://lists.r-forge.r-**project.org/cgi-bin/mailman/** > listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Mon Jun 10 17:06:53 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Mon, 10 Jun 2013 10:06:53 -0500 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> <265F40BB318541E99B6FF75F0C3115AD@gmail.com> <8540a4a91743d5afab595132dd41eb7b@imap.plus.net> <344d331d407432f6ae7b71cec416e065@imap.plus.net> Message-ID: Btw, since we're on the topic of join/not-join syntax does this break others' expectations or is it just me? > dt = data.table(x = c(1,2,3)) > setkey(dt,x) > dt[J(1)] x 1: 1 > dt[!J(1)] x 1: 2 2: 3 *> dt[(!J(1))]* *Error in eval(expr, envir, enclos) : could not find function "J"* *> dt[(J(1))] * *Error in eval(expr, envir, enclos) : could not find function "J"* I understand why this happens internally, because the function "()" is read as the head of the expression tree, but it's still pretty weird. On Mon, Jun 10, 2013 at 9:55 AM, Frank Erickson wrote: > I prefer ~ and/or NJ() over -. The not-join operation is different from > the subsetting operation usually associated with -. > > I don't know what characters are available for this sort of thing, but @x, > @(x,y) seems natural enough as syntax for a getter. > > > On Mon, Jun 10, 2013 at 9:35 AM, Matthew Dowle wrote: > >> >> Hm, another good point. We need ~ for formulae, although I can't >> imagine a formula in i (only in j). But in both i and j we might want to >> get(x). >> >> I thought about ^ i.e. X[^Y] in the spirit of regular expression syntax, >> but ^ doesn't parse with a RHS only. Needs to be parsable as a prefix. >> >> - maybe then? Consistent with - meaning in R. I don't think I actually >> had a specific use in mind for - and +, to reserve them for, but at the >> time it just seemed a shame to use up one of -/+ without defining the >> other. If - does a not join, then, might + be more like merge() (i.e. >> returning the union of the rows in x and i by join). I think I had >> something like that in mind, but hadn't thought it through. >> >> Some might say it should be a new argument e.g. notjoin=TRUE, but my >> thinking there is readability, since we often have many lines in i, j and >> by in that order, and if the "notjoin=TRUE" followed afterwards it would be >> far away from the i argument to which it applies. If we incorporate >> merge() into X[Y] using X[+Y] then it might avoid adding yet more >> parameters, too. >> >> >> >> On 10.06.2013 15:02, Gabor Grothendieck wrote: >> >>> The problem with ~ is that it is using up a special character (of >>> which there are only a few) for a case that does not occur much. >>> >>> I can think of other things that ~ might be better used for. For >>> example, perhaps ~ x could mean get(x). One aspect of data.table that >>> tends to be difficult is when you don't know the variable name ahead >>> of time and this woiuld give a way to specify it concisely. >>> >>> On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan >>> wrote: >>> >>>> Matthew, >>>> >>>> How about ~ instead of ! ? I ruled out - previously to leave + and >>>> - >>>> available for future use. NJ() may be possible too. >>>> >>>> Both "NJ()" and "~" are okay for me. >>>> >>>> That result makes perfect sense to me. I don't think of !(x==.) being >>>> the >>>> same as x!=. ! is simply a prefix. It's all the rows that aren't >>>> returned if the ! prefix wasn't there. >>>> >>>> I understand that `DT[!(x)]` does what `data.table` is designed to do >>>> currently. What I failed to mention was that if one were to consider >>>> implementing `!(x==.)` as the same as `x != .` then this behaviour has >>>> to be >>>> changed. Let's forget this point for a moment. >>>> >>>> That needs to be fixed. But we're getting quite theoretical here and >>>> far >>>> away from common use cases. Why would we ever have row numbers of the >>>> table, as a column of the table itself and want to select the rows by >>>> number >>>> not mentioned in that column? >>>> >>>> Probably I did not choose a good example. Suppose that I've a >>>> data.table and >>>> I want to get all rows where "x == 0". Let's say: >>>> >>>> set.seed(45) >>>> DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y = >>>> sample(15)) >>>> >>>> DF <- as.data.frame(DT) >>>> >>>> To get all rows where x == 0, it could be done with DT[x == 0]. But it >>>> makes >>>> sense, at least in the context of data.frames, to do equivalently, >>>> >>>> DF[!(DF$x), ] (or) DF[DF$x == 0, ] >>>> >>>> All I want to say is, I expect `DT[!(x)]` should give the same result as >>>> `DT[x == 0]` (even though I fully understand it's not the intended >>>> behaviour >>>> of data.table), as it's more intuitive and less confusing. >>>> >>>> So, changing `!` to `~` or `NJ` is one half of the issue for me. The >>>> other >>>> is to replace the actual function of `!` in all contexts. I hope I came >>>> across with what I wanted to say, better this time. >>>> >>>> Best, >>>> >>>> Arun >>>> >>>> >>>> On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote: >>>> >>>> >>>> >>>> Hi, >>>> >>>> How about ~ instead of ! ? I ruled out - previously to leave + and >>>> - >>>> available for future use. NJ() may be possible too. >>>> >>>> Matthew >>>> >>>> >>>> >>>> On 10.06.2013 09:35, Arunkumar Srinivasan wrote: >>>> >>>> Hi Matthew, >>>> My view (from the last reply) more or less reflects mnel's comments >>>> here: >>>> >>>> http://stackoverflow.com/**questions/16239153/dtx-and-** >>>> dtx-treat-na-in-x-**inconsistently#**comment23317096_16240143 >>>> Pasted here for convenience: >>>> data.table is mimicing subset in its handling of NA values in logical i >>>> arguments. -- the only issue is the ! prefix signifying a not-join, not >>>> the >>>> way one might expect. Perhaps the not join prefix could have been NJ >>>> not ! >>>> to avoid this confusion -- this might be another discussion to have on >>>> the >>>> mailing list -- (I think it is a discussion worth having) >>>> >>>> Arun >>>> >>>> On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote: >>>> >>>> Hm, good point. Is data.table consistent with SQL already, for both == >>>> and >>>> !=, and so no change needed? >>>> >>>> Yes, I believe it's already consistent with SQL. However, the current >>>> interpretation of NA (documentation) being treated as FALSE is not >>>> needed / >>>> untrue, imho (Please see below). >>>> >>>> >>>> And it was correct for Frank to be mistaken. >>>> >>>> Yes, it seems like he was mistaken. >>>> >>>> Maybe just some more documentation and examples needed then. >>>> >>>> It'd be much more appropriate if the documentation reflects the role of >>>> subsetting in data.table mimicking "subset" function (in order to be in >>>> line >>>> with SQL) by dropping NA evaluated logicals. From a couple of posts >>>> before, >>>> where I pasted the code where NAs are replaced to FALSE were not >>>> necessary >>>> as `irows <- which(i)` makes clear that `which` is being used to get >>>> indices >>>> and then subset, this fits perfectly well with the interpretation of NA >>>> in >>>> data.table. >>>> >>>> Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? >>>> : >>>> >>>> >>>> http://stackoverflow.com/**questions/16239153/dtx-and-** >>>> dtx-treat-na-in-x-**inconsistently >>>> >>>> Ha, I like the idea behind the use of () in evaluating expressions. >>>> It's >>>> another nice layer towards simplicity in data.table. But I still think >>>> there >>>> should not be an inconsistency in equivalent logical operations to >>>> provide >>>> different results. If !(x== .) and x != . are indeed different, then I'd >>>> suppose replacing `!` with a more appropriate name as it's much easier >>>> to >>>> get confused otherwise. >>>> In essence, either !(x == .) must evaluate to (x != .) if the underlying >>>> meaning of these are the same, or the `!` in `!(x==.)` must be replaced >>>> to >>>> something that's more appropriate for what it's supposed to be. >>>> Personally, >>>> I prefer the former. It would greatly tighten the structure and >>>> consistency. >>>> >>>> "na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch >>>> before >>>> in the context of joins, not logical subsets. >>>> >>>> Yes, I find this option would give more control in evaluating >>>> expressions >>>> with ease in `i`, by providing both "subset" (default) and the typical >>>> data.frame subsetting (na.rm = FALSE). >>>> Best regards, >>>> >>>> Arun >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> ______________________________**_________________ >>>> datatable-help mailing list >>>> datatable-help at lists.r-forge.**r-project.org >>>> >>>> https://lists.r-forge.r-**project.org/cgi-bin/mailman/** >>>> listinfo/datatable-help >>>> >>> >> ______________________________**_________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.**r-project.org >> https://lists.r-forge.r-**project.org/cgi-bin/mailman/** >> listinfo/datatable-help >> > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Mon Jun 10 17:28:20 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 10 Jun 2013 16:28:20 +0100 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> <265F40BB318541E99B6FF75F0C3115AD@gmail.com> <8540a4a91743d5afab595132dd41eb7b@imap.plus.net> <344d331d407432f6ae7b71cec416e065@imap.plus.net> Message-ID: <12b1525284509899b7897fe8f5ef0839@imap.plus.net> Hi Arun, Indeed. ! was introduced for not-join i.e. X[!Y] where i is type data.table. Extending it to vectors seemed to make sense at the time; e.g., X[!"foo"] and X[!3:6] (rather than the X[-3:6] mistake where X[-(3:6)] was intended) were in my mind. I think of everything as a join really; e.g., "where rownumber = i". But I think I'm fine with ! being not-join for data.table/list i only. Or is it just logical vector i to be turned off only, and could leave ! as-is for character and integer vector i? Matthew On 10.06.2013 15:52, Arunkumar Srinivasan wrote: > Matthew, > It just occurred to me. I'd be glad if you can clarify this. The operation is supposed to be "Not Join". Which means, I'd expect the "!" to be used with "J" as in: > dt <- data.table(x=c(0,0,1,1,3), y=1:5) > setkey(dt, "x") > dt[J(c(1,3))] # join > > x y > 1: 1 3 > 2: 1 4 > 3: 3 5 > dt[!J(c(1,3))] > > x y > 1: 0 1 > 2: 0 2 > Here the concept of "Not Join" with the use of "!J(.)" makes total sense. However, extending it to not-join for logical vectors is what seems to be an issue. It's more of a logical indexing than a join (at least in my mind). So, if it is possible to distinguish between "!" and "!J" (by checking if `i` is a data.table or not) to tell if it's a subsetting by logical vector or subsetting by "data.table" and then deciding what to do, would that resolve this issue? If not, what's the reason behind using "!" as a not-join during logical indexing? Is it still considered as a not-join?? > Just a thought. I hope it makes at least a little sense. > > Best, > Arun > > On Monday, June 10, 2013 at 4:35 PM, Matthew Dowle wrote: > >> Hm, another good point. We need ~ for formulae, although I can't >> imagine a formula in i (only in j). But in both i and j we might want >> to get(x). >> I thought about ^ i.e. X[^Y] in the spirit of regular expression >> syntax, but ^ doesn't parse with a RHS only. Needs to be parsable as a >> prefix. >> - maybe then? Consistent with - meaning in R. I don't think I >> actually had a specific use in mind for - and +, to reserve them for, >> but at the time it just seemed a shame to use up one of -/+ without >> defining the other. If - does a not join, then, might + be more like >> merge() (i.e. returning the union of the rows in x and i by join). I >> think I had something like that in mind, but hadn't thought it through. >> Some might say it should be a new argument e.g. notjoin=TRUE, but my >> thinking there is readability, since we often have many lines in i, j >> and by in that order, and if the "notjoin=TRUE" followed afterwards it >> would be far away from the i argument to which it applies. If we >> incorporate merge() into X[Y] using X[+Y] then it might avoid adding yet >> more parameters, too. >> On 10.06.2013 15:02, Gabor Grothendieck wrote: >> >>> The problem with ~ is that it is using up a special character (of >>> which there are only a few) for a case that does not occur much. >>> I can think of other things that ~ might be better used for. For >>> example, perhaps ~ x could mean get(x). One aspect of data.table >>> that >>> tends to be difficult is when you don't know the variable name ahead >>> of time and this woiuld give a way to specify it concisely. >>> On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan >>> wrote: >>> >>>> Matthew, >>>> How about ~ instead of ! ? I ruled out - previously to leave + >>>> and - >>>> available for future use. NJ() may be possible too. >>>> Both "NJ()" and "~" are okay for me. >>>> That result makes perfect sense to me. I don't think of !(x==.) >>>> being the >>>> same as x!=. ! is simply a prefix. It's all the rows that >>>> aren't >>>> returned if the ! prefix wasn't there. >>>> I understand that `DT[!(x)]` does what `data.table` is designed to >>>> do >>>> currently. What I failed to mention was that if one were to consider >>>> implementing `!(x==.)` as the same as `x != .` then this behaviour >>>> has to be >>>> changed. Let's forget this point for a moment. >>>> That needs to be fixed. But we're getting quite theoretical here >>>> and far >>>> away from common use cases. Why would we ever have row numbers of >>>> the >>>> table, as a column of the table itself and want to select the rows >>>> by number >>>> not mentioned in that column? >>>> Probably I did not choose a good example. Suppose that I've a >>>> data.table and >>>> I want to get all rows where "x == 0". Let's say: >>>> set.seed(45) >>>> DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y = >>>> sample(15)) >>>> DF <- as.data.frame(DT) >>>> To get all rows where x == 0, it could be done with DT[x == 0]. But >>>> it makes >>>> sense, at least in the context of data.frames, to do equivalently, >>>> DF[!(DF$x), ] (or) DF[DF$x == 0, ] >>>> All I want to say is, I expect `DT[!(x)]` should give the same >>>> result as >>>> `DT[x == 0]` (even though I fully understand it's not the intended >>>> behaviour >>>> of data.table), as it's more intuitive and less confusing. >>>> So, changing `!` to `~` or `NJ` is one half of the issue for me. The >>>> other >>>> is to replace the actual function of `!` in all contexts. I hope I >>>> came >>>> across with what I wanted to say, better this time. >>>> Best, >>>> Arun >>>> On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote: >>>> Hi, >>>> How about ~ instead of ! ? I ruled out - previously to leave + >>>> and - >>>> available for future use. NJ() may be possible too. >>>> Matthew >>>> On 10.06.2013 09:35, Arunkumar Srinivasan wrote: >>>> Hi Matthew, >>>> My view (from the last reply) more or less reflects mnel's comments >>>> here: >>>> http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 [1] >>>> Pasted here for convenience: >>>> data.table is mimicing subset in its handling of NA values in >>>> logical i >>>> arguments. -- the only issue is the ! prefix signifying a not-join, >>>> not the >>>> way one might expect. Perhaps the not join prefix could have been NJ >>>> not ! >>>> to avoid this confusion -- this might be another discussion to have >>>> on the >>>> mailing list -- (I think it is a discussion worth having) >>>> Arun >>>> On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote: >>>> Hm, good point. Is data.table consistent with SQL already, for both >>>> == and >>>> !=, and so no change needed? >>>> Yes, I believe it's already consistent with SQL. However, the >>>> current >>>> interpretation of NA (documentation) being treated as FALSE is not >>>> needed / >>>> untrue, imho (Please see below). >>>> And it was correct for Frank to be mistaken. >>>> Yes, it seems like he was mistaken. >>>> Maybe just some more documentation and examples needed then. >>>> It'd be much more appropriate if the documentation reflects the role >>>> of >>>> subsetting in data.table mimicking "subset" function (in order to be >>>> in line >>>> with SQL) by dropping NA evaluated logicals. From a couple of posts >>>> before, >>>> where I pasted the code where NAs are replaced to FALSE were not >>>> necessary >>>> as `irows <- which(i)` makes clear that `which` is being used to get >>>> indices >>>> and then subset, this fits perfectly well with the interpretation of >>>> NA in >>>> data.table. >>>> Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA >>>> inconsistently? : >>>> http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently [2] >>>> Ha, I like the idea behind the use of () in evaluating expressions. >>>> It's >>>> another nice layer towards simplicity in data.table. But I still >>>> think there >>>> should not be an inconsistency in equivalent logical operations to >>>> provide >>>> different results. If !(x== .) and x != . are indeed different, then >>>> I'd >>>> suppose replacing `!` with a more appropriate name as it's much >>>> easier to >>>> get confused otherwise. >>>> In essence, either !(x == .) must evaluate to (x != .) if the >>>> underlying >>>> meaning of these are the same, or the `!` in `!(x==.)` must be >>>> replaced to >>>> something that's more appropriate for what it's supposed to be. >>>> Personally, >>>> I prefer the former. It would greatly tighten the structure and >>>> consistency. >>>> "na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch >>>> before >>>> in the context of joins, not logical subsets. >>>> Yes, I find this option would give more control in evaluating >>>> expressions >>>> with ease in `i`, by providing both "subset" (default) and the >>>> typical >>>> data.frame subsetting (na.rm = FALSE). >>>> Best regards, >>>> Arun >>>> _______________________________________________ >>>> datatable-help mailing list >>>> datatable-help at lists.r-forge.r-project.org [3] >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [4] Links: ------ [1] http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 [2] http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently [3] mailto:datatable-help at lists.r-forge.r-project.org [4] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [5] mailto:aragorn168b at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Jun 10 19:01:58 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 10 Jun 2013 19:01:58 +0200 Subject: [datatable-help] Follow-up on subsetting data.table with NAs In-Reply-To: <12b1525284509899b7897fe8f5ef0839@imap.plus.net> References: <0F051BE2E84B407D8E1D630107CFF5EB@gmail.com> <114784a0d4675127913b6b4b903d9a0c@imap.plus.net> <77fc9f2e9a18ca56c511ca1cced6af51@imap.plus.net> <2C19FA7BF7224CC581A6A909CAADF830@gmail.com> <265F40BB318541E99B6FF75F0C3115AD@gmail.com> <8540a4a91743d5afab595132dd41eb7b@imap.plus.net> <344d331d407432f6ae7b71cec416e065@imap.plus.net> <12b1525284509899b7897fe8f5ef0839@imap.plus.net> Message-ID: Hi Matthew, Thanks for clarifying this. To me the "not join" operation is very similar to "setdiff" operation but for a data.frame/data.table. So DT[!J(.)] could be interpreted as setdiff(DT, DT[J(.)]). No, I'm with you in that it makes much sense in extending it to logical vectors operations as well. And so far, I guess all of them who wrote back also agree with the idea of: 1) !(x == .) and x != . being identical 2) ~(.) (or) NJ(.) (or) -(.) being a NOT JOIN on data.table/list/vectors etc.. I'd love for these two to be on the feature list. I really don't mind the "~", "NJ" or "-". Thanks again, Arun On Monday, June 10, 2013 at 5:28 PM, Matthew Dowle wrote: > > Hi Arun, > Indeed. ! was introduced for not-join i.e. X[!Y] where i is type data.table. Extending it to vectors seemed to make sense at the time; e.g., X[!"foo"] and X[!3:6] (rather than the X[-3:6] mistake where X[-(3:6)] was intended) were in my mind. I think of everything as a join really; e.g., "where rownumber = i". > But I think I'm fine with ! being not-join for data.table/list i only. Or is it just logical vector i to be turned off only, and could leave ! as-is for character and integer vector i? > Matthew > > On 10.06.2013 15:52, Arunkumar Srinivasan wrote: > > Matthew, > > It just occurred to me. I'd be glad if you can clarify this. The operation is supposed to be "Not Join". Which means, I'd expect the "!" to be used with "J" as in: > > dt <- data.table(x=c(0,0,1,1,3), y=1:5) > > setkey(dt, "x") > > dt[J(c(1,3))] # join > > x y > > 1: 1 3 > > 2: 1 4 > > 3: 3 5 > > > > dt[!J(c(1,3))] > > x y > > 1: 0 1 > > 2: 0 2 > > > > Here the concept of "Not Join" with the use of "!J(.)" makes total sense. However, extending it to not-join for logical vectors is what seems to be an issue. It's more of a logical indexing than a join (at least in my mind). So, if it is possible to distinguish between "!" and "!J" (by checking if `i` is a data.table or not) to tell if it's a subsetting by logical vector or subsetting by "data.table" and then deciding what to do, would that resolve this issue? If not, what's the reason behind using "!" as a not-join during logical indexing? Is it still considered as a not-join?? > > Just a thought. I hope it makes at least a little sense. > > Best, > > Arun > > > > > > On Monday, June 10, 2013 at 4:35 PM, Matthew Dowle wrote: > > > > > Hm, another good point. We need ~ for formulae, although I can't > > > imagine a formula in i (only in j). But in both i and j we might want > > > to get(x). > > > I thought about ^ i.e. X[^Y] in the spirit of regular expression > > > syntax, but ^ doesn't parse with a RHS only. Needs to be parsable as a > > > prefix. > > > - maybe then? Consistent with - meaning in R. I don't think I > > > actually had a specific use in mind for - and +, to reserve them for, > > > but at the time it just seemed a shame to use up one of -/+ without > > > defining the other. If - does a not join, then, might + be more like > > > merge() (i.e. returning the union of the rows in x and i by join). I > > > think I had something like that in mind, but hadn't thought it through. > > > Some might say it should be a new argument e.g. notjoin=TRUE, but my > > > thinking there is readability, since we often have many lines in i, j > > > and by in that order, and if the "notjoin=TRUE" followed afterwards it > > > would be far away from the i argument to which it applies. If we > > > incorporate merge() into X[Y] using X[+Y] then it might avoid adding yet > > > more parameters, too. > > > On 10.06.2013 15:02, Gabor Grothendieck wrote: > > > > The problem with ~ is that it is using up a special character (of > > > > which there are only a few) for a case that does not occur much. > > > > I can think of other things that ~ might be better used for. For > > > > example, perhaps ~ x could mean get(x). One aspect of data.table > > > > that > > > > tends to be difficult is when you don't know the variable name ahead > > > > of time and this woiuld give a way to specify it concisely. > > > > On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan > > > > wrote: > > > > > Matthew, > > > > > How about ~ instead of ! ? I ruled out - previously to leave + > > > > > and - > > > > > available for future use. NJ() may be possible too. > > > > > Both "NJ()" and "~" are okay for me. > > > > > That result makes perfect sense to me. I don't think of !(x==.) > > > > > being the > > > > > same as x!=. ! is simply a prefix. It's all the rows that > > > > > aren't > > > > > returned if the ! prefix wasn't there. > > > > > I understand that `DT[!(x)]` does what `data.table` is designed to > > > > > do > > > > > currently. What I failed to mention was that if one were to consider > > > > > implementing `!(x==.)` as the same as `x != .` then this behaviour > > > > > has to be > > > > > changed. Let's forget this point for a moment. > > > > > That needs to be fixed. But we're getting quite theoretical here > > > > > and far > > > > > away from common use cases. Why would we ever have row numbers of > > > > > the > > > > > table, as a column of the table itself and want to select the rows > > > > > by number > > > > > not mentioned in that column? > > > > > Probably I did not choose a good example. Suppose that I've a > > > > > data.table and > > > > > I want to get all rows where "x == 0". Let's say: > > > > > set.seed(45) > > > > > DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y = > > > > > sample(15)) > > > > > DF <- as.data.frame(DT) > > > > > To get all rows where x == 0, it could be done with DT[x == 0]. But > > > > > it makes > > > > > sense, at least in the context of data.frames, to do equivalently, > > > > > DF[!(DF$x), ] (or) DF[DF$x == 0, ] > > > > > All I want to say is, I expect `DT[!(x)]` should give the same > > > > > result as > > > > > `DT[x == 0]` (even though I fully understand it's not the intended > > > > > behaviour > > > > > of data.table), as it's more intuitive and less confusing. > > > > > So, changing `!` to `~` or `NJ` is one half of the issue for me. The > > > > > other > > > > > is to replace the actual function of `!` in all contexts. I hope I > > > > > came > > > > > across with what I wanted to say, better this time. > > > > > Best, > > > > > Arun > > > > > On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote: > > > > > Hi, > > > > > How about ~ instead of ! ? I ruled out - previously to leave + > > > > > and - > > > > > available for future use. NJ() may be possible too. > > > > > Matthew > > > > > On 10.06.2013 09:35, Arunkumar Srinivasan wrote: > > > > > Hi Matthew, > > > > > My view (from the last reply) more or less reflects mnel's comments > > > > > here: > > > > > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143 > > > > > Pasted here for convenience: > > > > > data.table is mimicing subset in its handling of NA values in > > > > > logical i > > > > > arguments. -- the only issue is the ! prefix signifying a not-join, > > > > > not the > > > > > way one might expect. Perhaps the not join prefix could have been NJ > > > > > not ! > > > > > to avoid this confusion -- this might be another discussion to have > > > > > on the > > > > > mailing list -- (I think it is a discussion worth having) > > > > > Arun > > > > > On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote: > > > > > Hm, good point. Is data.table consistent with SQL already, for both > > > > > == and > > > > > !=, and so no change needed? > > > > > Yes, I believe it's already consistent with SQL. However, the > > > > > current > > > > > interpretation of NA (documentation) being treated as FALSE is not > > > > > needed / > > > > > untrue, imho (Please see below). > > > > > And it was correct for Frank to be mistaken. > > > > > Yes, it seems like he was mistaken. > > > > > Maybe just some more documentation and examples needed then. > > > > > It'd be much more appropriate if the documentation reflects the role > > > > > of > > > > > subsetting in data.table mimicking "subset" function (in order to be > > > > > in line > > > > > with SQL) by dropping NA evaluated logicals. From a couple of posts > > > > > before, > > > > > where I pasted the code where NAs are replaced to FALSE were not > > > > > necessary > > > > > as `irows <- which(i)` makes clear that `which` is being used to get > > > > > indices > > > > > and then subset, this fits perfectly well with the interpretation of > > > > > NA in > > > > > data.table. > > > > > Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA > > > > > inconsistently? : > > > > > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently > > > > > Ha, I like the idea behind the use of () in evaluating expressions. > > > > > It's > > > > > another nice layer towards simplicity in data.table. But I still > > > > > think there > > > > > should not be an inconsistency in equivalent logical operations to > > > > > provide > > > > > different results. If !(x== .) and x != . are indeed different, then > > > > > I'd > > > > > suppose replacing `!` with a more appropriate name as it's much > > > > > easier to > > > > > get confused otherwise. > > > > > In essence, either !(x == .) must evaluate to (x != .) if the > > > > > underlying > > > > > meaning of these are the same, or the `!` in `!(x==.)` must be > > > > > replaced to > > > > > something that's more appropriate for what it's supposed to be. > > > > > Personally, > > > > > I prefer the former. It would greatly tighten the structure and > > > > > consistency. > > > > > "na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch > > > > > before > > > > > in the context of joins, not logical subsets. > > > > > Yes, I find this option would give more control in evaluating > > > > > expressions > > > > > with ease in `i`, by providing both "subset" (default) and the > > > > > typical > > > > > data.frame subsetting (na.rm = FALSE). > > > > > Best regards, > > > > > Arun > > > > > _______________________________________________ > > > > > datatable-help mailing list > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Sat Jun 15 06:42:27 2013 From: eduard.antonyan at gmail.com (eddi) Date: Fri, 14 Jun 2013 21:42:27 -0700 (PDT) Subject: [datatable-help] rbindlist with unnamed lists Message-ID: <1371271347480-4669570.post@n4.nabble.com> Maybe I'm not understanding smth, but afaict the following is not working as advertised (I don't see it mentioned in the help that the sub-lists have to be named; and I don't think that should be a requirement anyway): *> l = list(list(1,2), list(2,3))* > do.call(rbind, l) [,1] [,2] [1,] 1 2 [2,] 2 3 *> rbindlist(l) Error in alloc.col(ans) : Internal error: length of names (0) is not length of dt (2)* And on an FR note, it would be great if the following worked as well: *> l = list(c(1,2), c(2,3))* > do.call(rbind, l) [,1] [,2] [1,] 1 2 [2,] 2 3 *> rbindlist(l) Error in rbindlist(l) : Item 1 of list input is not a data.frame, data.table or list* -- View this message in context: http://r.789695.n4.nabble.com/rbindlist-with-unnamed-lists-tp4669570.html Sent from the datatable-help mailing list archive at Nabble.com. From papucho at me.com Mon Jun 17 08:54:50 2013 From: papucho at me.com (Ivan Alves) Date: Mon, 17 Jun 2013 08:54:50 +0200 Subject: [datatable-help] merging syntax Message-ID: <8220DA4C-3468-456E-B397-7D9DA5D58591@me.com> Dear all, I am not sure I understand the syntax for merging data.tables. I have keyed the two 'satelite' tables from which I want to match information to the main table 'links' g_ctpty <- gultimate[,list(ctpty_head,ctpty_cty)] setkey(g_ctpty,ctpty_head) g_iss <- gultimate[,list(iss_head,iss_cty)] setkey(g_iss,iss_head) Why are the two below not equivalent? This works: data = merge( merge(links, g_ctpty, all.x = TRUE, by = "ctpty_head"), g_iss, all.x = TRUE, by = "iss_head" ), And this does not: data = g_iss[g_ctpty[links]], Any guidance would be appreciated. Kind regards, Ivan From aragorn168b at gmail.com Mon Jun 17 08:57:48 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 17 Jun 2013 08:57:48 +0200 Subject: [datatable-help] merging syntax In-Reply-To: <8220DA4C-3468-456E-B397-7D9DA5D58591@me.com> References: <8220DA4C-3468-456E-B397-7D9DA5D58591@me.com> Message-ID: Since you have the data as well, why not provide it (a small part at least with which your issue is reproducible)? Isn't it much easier than to ask everyone who's willing to help to create a data and test your code? Arun On Monday, June 17, 2013 at 8:54 AM, Ivan Alves wrote: > Dear all, > > I am not sure I understand the syntax for merging data.tables. I have keyed the two 'satelite' tables from which I want to match information to the main table 'links' > > g_ctpty <- gultimate[,list(ctpty_head,ctpty_cty)] > setkey(g_ctpty,ctpty_head) > g_iss <- gultimate[,list(iss_head,iss_cty)] > setkey(g_iss,iss_head) > > Why are the two below not equivalent? > > This works: > > data = merge( > merge(links, g_ctpty, all.x = TRUE, by = "ctpty_head"), > g_iss, > all.x = TRUE, by = "iss_head" > ), > > And this does not: > > data = g_iss[g_ctpty[links]], > > Any guidance would be appreciated. > Kind regards, > Ivan > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Mon Jun 17 14:36:38 2013 From: FErickson at psu.edu (Frank Erickson) Date: Mon, 17 Jun 2013 07:36:38 -0500 Subject: [datatable-help] merging syntax In-Reply-To: References: <8220DA4C-3468-456E-B397-7D9DA5D58591@me.com> Message-ID: I think that key(g_ctpty[links]) == key(g_ctpty) == "ctpty_head", which is used when you do your second merge with [, instead of "iss_head". You can check this by running key(g_ctpty[links]) . --Frank On Mon, Jun 17, 2013 at 1:57 AM, Arunkumar Srinivasan wrote: > Since you have the data as well, why not provide it (a small part at > least with which your issue is reproducible)? Isn't it much easier than to > ask everyone who's willing to help to create a data and test your code? > > Arun > > On Monday, June 17, 2013 at 8:54 AM, Ivan Alves wrote: > > Dear all, > > I am not sure I understand the syntax for merging data.tables. I have > keyed the two 'satelite' tables from which I want to match information to > the main table 'links' > > g_ctpty <- gultimate[,list(ctpty_head,ctpty_cty)] > setkey(g_ctpty,ctpty_head) > g_iss <- gultimate[,list(iss_head,iss_cty)] > setkey(g_iss,iss_head) > > Why are the two below not equivalent? > > This works: > > data = merge( > merge(links, g_ctpty, all.x = TRUE, by = "ctpty_head"), > g_iss, > all.x = TRUE, by = "iss_head" > ), > > And this does not: > > data = g_iss[g_ctpty[links]], > > Any guidance would be appreciated. > Kind regards, > Ivan > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Mon Jun 17 23:12:55 2013 From: FErickson at psu.edu (Frank Erickson) Date: Mon, 17 Jun 2013 16:12:55 -0500 Subject: [datatable-help] print.data.table's digits argument Message-ID: Hi, I have a data.table with list and float columns. I want to print it with the floats rounded to display only a few significant digits. Using getAnywhere(print.data.table), I see that digits is an option. However, when I use it, it seems to have no effect. Is there a special way to pass arguments to hidden functions like this (on the methods(print) list it shows up as print.data.table*)? Anyway, here are the first two lines of my dt: DT <- structure(list(fisyr = 1995:1996, er = list(c(1, 3), c(1, 3)), eg = c(0.0197315833926059, 0.0197315833926059), esal = list( c(2329.89763779528, 2423.6811023622), c(2263.07456978967, 2354.16826003824)), fr = list(c(4, 4), c(4, 4)), fg = c(0.039310363070415, 0.039310363070415), fsal = list(c(2520.85433070866, 2520.85433070866 ), c(2448.55449330784, 2448.55449330784)), mr = list(c(5, 30), c(5, 30)), mg = c(0.0197779376457164, 0.0197779376457164 ), msal = list(c(2571.70078740157, 4215.73622047244), c(2497.94263862333, 4094.82600382409))), .Names = c("fisyr", "er", "eg", "esal", "fr", "fg", "fsal", "mr", "mg", "msal"), class = c("data.table", "data.frame"), row.names = c(NA, -2L)) print(DT,digits=4) # just DT print.data.frame(DT,digits=4) # fisyr er eg esal fr fg fsal mr mg msal # 1 1995 1, 3 0.01973 2330, 2424 4, 4 0.03931 2521, 2521 5, 30 0.01978 2572, 4216 # 2 1996 1, 3 0.01973 2263, 2354 4, 4 0.03931 2449, 2449 5, 30 0.01978 2498, 4095 Printing as a data.frame does the rounding/shortening, but the list columns look ugly (thanks to that extra space), so I'd rather see a data.table output. It's a tiny data.table, so if anyone knows a fancy lapply trick for this, that'd be cool. I'm going to try to find one myself now. Thanks, Frank -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Tue Jun 18 00:49:48 2013 From: FErickson at psu.edu (Frank Erickson) Date: Mon, 17 Jun 2013 17:49:48 -0500 Subject: [datatable-help] print.data.table's digits argument In-Reply-To: <521F3F587A7542BBB3EA15C2542D7413@gmail.com> References: <521F3F587A7542BBB3EA15C2542D7413@gmail.com> Message-ID: Hi Arun, Thanks. That looks like a bug in format.data.table to me. I think it should be function(col) not function(col,...). Your last line does not work verbatim on my example, but that looks like the way I should go about it. One additional wrinkle: with print.data.frame, it also performs formatting recursively (fixing my list columns), which would call for rapply with the "replace" option, I guess. I tried it, but couldn't get it to work. FYI, it looks like you missed the reply to all, but I'm sending this back to data.table-help, so it's all good. Thanks again, Frank On Mon, Jun 17, 2013 at 5:04 PM, Arunkumar Srinivasan wrote: > The issue seems to come from `data.table:::format.data.table`, > specifically the lines: > > do.call("cbind", lapply(x, function(col, ...) { > if (is.list(col)) > col = sapply(col, format.item) > format(col, justify = justify, ...) > })) > > Here, it seems that passing `?` inside `lapply` as `function(col, ?)` > somehow loses the information about "digits". That is, if you just do: > > That is, consider: > dt <- data.table(x=1:5, y=rnorm(5)) > > If you've > ff <- function(x, ?) { > do.call("cbind", lapply(x, format, ?)) > } > ff(dt, digits=2) > > seems to work. > > However, if you do: > > ff <- function(x, ?) { > do.call("cbind", lapply(x, function(y, ?) { > format(y, ?) > })) > } > ff(dt, digits=2) > > won't work! > > That said, for now, you can do something like: > > as.data.table(do.call("cbind", lapply(dt, function(x) as.numeric(format(x, > digits=2))))) > > until this is resolved.. > > Arun > > On Monday, June 17, 2013 at 11:12 PM, Frank Erickson wrote: > > Hi, > > I have a data.table with list and float columns. I want to print it with > the floats rounded to display only a few significant digits. Using > getAnywhere(print.data.table), I see that digits is an option. However, > when I use it, it seems to have no effect. Is there a special way to pass > arguments to hidden functions like this (on the methods(print) list it > shows up as print.data.table*)? > > Anyway, here are the first two lines of my dt: > > DT <- structure(list(fisyr = 1995:1996, er = list(c(1, 3), c(1, 3)), > eg = c(0.0197315833926059, 0.0197315833926059), esal = list( > c(2329.89763779528, 2423.6811023622), c(2263.07456978967, > 2354.16826003824)), fr = list(c(4, 4), c(4, 4)), fg = > c(0.039310363070415, > 0.039310363070415), fsal = list(c(2520.85433070866, 2520.85433070866 > ), c(2448.55449330784, 2448.55449330784)), mr = list(c(5, > 30), c(5, 30)), mg = c(0.0197779376457164, 0.0197779376457164 > ), msal = list(c(2571.70078740157, 4215.73622047244), > c(2497.94263862333, > 4094.82600382409))), .Names = c("fisyr", "er", "eg", "esal", > "fr", "fg", "fsal", "mr", "mg", "msal"), class = c("data.table", > "data.frame"), row.names = c(NA, -2L)) > > print(DT,digits=4) > # just DT > print.data.frame(DT,digits=4) > # fisyr er eg esal fr fg fsal mr mg > msal > # 1 1995 1, 3 0.01973 2330, 2424 4, 4 0.03931 2521, 2521 5, 30 0.01978 > 2572, 4216 > # 2 1996 1, 3 0.01973 2263, 2354 4, 4 0.03931 2449, 2449 5, 30 0.01978 > 2498, 4095 > > Printing as a data.frame does the rounding/shortening, but the list > columns look ugly (thanks to that extra space), so I'd rather see a > data.table output. > > It's a tiny data.table, so if anyone knows a fancy lapply trick for this, > that'd be cool. I'm going to try to find one myself now. > > Thanks, > > Frank > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Tue Jun 18 01:19:49 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 18 Jun 2013 01:19:49 +0200 Subject: [datatable-help] print.data.table's digits argument In-Reply-To: References: Message-ID: <081AB5A6E11243C0B1C75A937463DAC8@gmail.com> Dear Frank, Thanks for forwarding to the list. I always seem to forget to "reply-all". Apologies. Managed this time! :) Try this on your data: as.data.table(do.call("cbind", lapply(DT, function(x) { if (is.list(x)) { lapply(x, function(y) as.numeric(format(y, digits=2))) } else as.numeric(format(x, digits=2)) }))) Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Tue Jun 18 01:39:39 2013 From: FErickson at psu.edu (Frank Erickson) Date: Mon, 17 Jun 2013 18:39:39 -0500 Subject: [datatable-help] print.data.table's digits argument In-Reply-To: <081AB5A6E11243C0B1C75A937463DAC8@gmail.com> References: <081AB5A6E11243C0B1C75A937463DAC8@gmail.com> Message-ID: Ah, that did the trick! I'll use this quite a lot, I expect. Thanks, Arun. --Frank On Mon, Jun 17, 2013 at 6:19 PM, Arunkumar Srinivasan wrote: > Dear Frank, > > Thanks for forwarding to the list. I always seem to forget to "reply-all". > Apologies. Managed this time! :) > > Try this on your data: > > as.data.table(do.call("cbind", lapply(DT, function(x) { > if (is.list(x)) { > lapply(x, function(y) as.numeric(format(y, digits=2))) > } else > as.numeric(format(x, digits=2)) > }))) > > > > Arun > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From papucho at me.com Tue Jun 18 17:08:55 2013 From: papucho at me.com (Ivan Alves) Date: Tue, 18 Jun 2013 17:08:55 +0200 Subject: [datatable-help] merging syntax In-Reply-To: References: <8220DA4C-3468-456E-B397-7D9DA5D58591@me.com> Message-ID: Hi Frank, Many thanks for the thoughts. It is something that has to do with the keys, that is for sure (see below). A double DT join does not understand that it has to do one join by one variable and the other by another variable: the output simply has lines where the two keys are the same (ctpty_head==iss_head),which is of course not optimal. how do I tell DT to do separate matchings at each join? Setting setkey(links,ctpty_head,iss_head) before the join does not work either. > key(g_ctpty[links]) NULL > key(g_ctpty) [1] "ctpty_head" > key(g_iss) [1] "iss_head" > key(links) NULL Hi Arunkumar, An example would look like follows: g_ctpty = data.table(ctpty_head=c("a","b","c"), ctpty_cty=c("US","DE","JP")) g_iss = data.table(iss_head=c("a","b","c"), iss_cty=c("US","DE","JP")) links = data.table(ctpty_head=c("a","b","c"), iss_head=c("b","b","a")) merge( + merge(links, g_ctpty, all.x = TRUE, by = "ctpty_head"), + g_iss, + all.x = TRUE, by = "iss_head" + ) iss_head ctpty_head ctpty_cty iss_cty 1: a c JP US 2: b a US DE 3: b b DE DE > g_iss[g_ctpty[links]] iss_head iss_cty ctpty_cty iss_head.1 1: a US US b 2: b DE DE b 3: c JP JP a > setkey(links,ctpty_head,iss_head) > g_iss[g_ctpty[links]] iss_head iss_cty ctpty_cty iss_head.1 1: a US US b 2: b DE DE b 3: c JP JP a On 17 Jun 2013, at 14:36, Frank Erickson wrote: > I think that key(g_ctpty[links]) == key(g_ctpty) == "ctpty_head", which is used when you do your second merge with [, instead of "iss_head". You can check this by running key(g_ctpty[links]) . --Frank > > > On Mon, Jun 17, 2013 at 1:57 AM, Arunkumar Srinivasan wrote: > Since you have the data as well, why not provide it (a small part at least with which your issue is reproducible)? Isn't it much easier than to ask everyone who's willing to help to create a data and test your code? > > Arun > > On Monday, June 17, 2013 at 8:54 AM, Ivan Alves wrote: > >> Dear all, >> >> I am not sure I understand the syntax for merging data.tables. I have keyed the two 'satelite' tables from which I want to match information to the main table 'links' >> >> g_ctpty <- gultimate[,list(ctpty_head,ctpty_cty)] >> setkey(g_ctpty,ctpty_head) >> g_iss <- gultimate[,list(iss_head,iss_cty)] >> setkey(g_iss,iss_head) >> >> Why are the two below not equivalent? >> >> This works: >> >> data = merge( >> merge(links, g_ctpty, all.x = TRUE, by = "ctpty_head"), >> g_iss, >> all.x = TRUE, by = "iss_head" >> ), >> >> And this does not: >> >> data = g_iss[g_ctpty[links]], >> >> Any guidance would be appreciated. >> Kind regards, >> Ivan >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Tue Jun 18 17:34:19 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Tue, 18 Jun 2013 10:34:19 -0500 Subject: [datatable-help] merging syntax In-Reply-To: References: <8220DA4C-3468-456E-B397-7D9DA5D58591@me.com> Message-ID: Frank has already answered why you're getting the results you are. Re "how do I tell DT to do *separate matchings at each join*?": Currently you have to use 'merge' or 'data.table', see this - https://r-forge.r-project.org/tracker/?func=detail&atid=978&aid=4675&group_id=240 and the SO link inside. On Tue, Jun 18, 2013 at 10:08 AM, Ivan Alves wrote: > Hi Frank, > > Many thanks for the thoughts. > > It is something that has to do with the keys, that is for sure (see > below). A double DT join does not understand that it has to do one join by > one variable and the other by another variable: the output simply has lines > where the two keys are the same (ctpty_head==iss_head),which is of course > not optimal. > > how do I tell DT to do *separate matchings at each join*? Setting setkey( > links,ctpty_head,iss_head) before the join does not work either. > > > key(g_ctpty[links]) > NULL > > key(g_ctpty) > [1] "ctpty_head" > > key(g_iss) > [1] "iss_head" > > key(links) > NULL > > Hi Arunkumar, > > An example would look like follows: > > g_ctpty = data.table(ctpty_head=c("a","b","c"), > ctpty_cty=c("US","DE","JP")) > g_iss = data.table(iss_head=c("a","b","c"), iss_cty=c("US","DE","JP")) > links = data.table(ctpty_head=c("a","b","c"), iss_head=c("b","b","a")) > > merge( > + merge(links, g_ctpty, all.x = TRUE, by = "ctpty_head"), > + g_iss, > + all.x = TRUE, by = "iss_head" > + ) > iss_head ctpty_head ctpty_cty iss_cty > 1: a c JP US > 2: b a US DE > 3: b b DE DE > > g_iss[g_ctpty[links]] > iss_head iss_cty ctpty_cty iss_head.1 > 1: a US US b > 2: b DE DE b > 3: c JP JP a > > setkey(links,ctpty_head,iss_head) > > g_iss[g_ctpty[links]] > iss_head iss_cty ctpty_cty iss_head.1 > 1: a US US b > 2: b DE DE b > 3: c JP JP a > > On 17 Jun 2013, at 14:36, Frank Erickson wrote: > > I think that key(g_ctpty[links]) == key(g_ctpty) == "ctpty_head", which is > used when you do your second merge with [, instead of "iss_head". You can > check this by running key(g_ctpty[links]) . --Frank > > > On Mon, Jun 17, 2013 at 1:57 AM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > >> Since you have the data as well, why not provide it (a small part at >> least with which your issue is reproducible)? Isn't it much easier than to >> ask everyone who's willing to help to create a data and test your code? >> >> Arun >> >> On Monday, June 17, 2013 at 8:54 AM, Ivan Alves wrote: >> >> Dear all, >> >> I am not sure I understand the syntax for merging data.tables. I have >> keyed the two 'satelite' tables from which I want to match information to >> the main table 'links' >> >> g_ctpty <- gultimate[,list(ctpty_head,ctpty_cty)] >> setkey(g_ctpty,ctpty_head) >> g_iss <- gultimate[,list(iss_head,iss_cty)] >> setkey(g_iss,iss_head) >> >> Why are the two below not equivalent? >> >> This works: >> >> data = merge( >> merge(links, g_ctpty, all.x = TRUE, by = "ctpty_head"), >> g_iss, >> all.x = TRUE, by = "iss_head" >> ), >> >> And this does not: >> >> data = g_iss[g_ctpty[links]], >> >> Any guidance would be appreciated. >> Kind regards, >> Ivan >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From papucho at me.com Thu Jun 20 18:43:58 2013 From: papucho at me.com (Ivan Alves) Date: Thu, 20 Jun 2013 18:43:58 +0200 Subject: [datatable-help] merging syntax In-Reply-To: References: <8220DA4C-3468-456E-B397-7D9DA5D58591@me.com> Message-ID: Many thanks to both Eduard and Frank on the issue of the needed key. One aspect of the merging that is not clear is how to do 'inner' vs. 'outer' 'joins' (like in SQL). Whereas it works with merge (using the all.x=TRUE option), how is it done with data.table? In the improved example below g_ctpty = data.table(ctpty_head=c("a","b","c","d"), ctpty_cty=c("US","DE","JP","CN")) g_iss = data.table(iss_head=c("a","b","c","d"), iss_cty=c("US","DE","JP","CN")) links = data.table(ctpty_head=c("a","b","c"), iss_head=c("b","b","a")) setkey(g_ctpty,ctpty_head) setkey(g_iss,iss_head) merge( merge(links, g_ctpty, all.x = TRUE, by = "ctpty_head"), g_iss, all.x = TRUE, by = "iss_head" ) g_iss[g_ctpty[links]] # error links[g_ctpty][g_iss] # still error setkey(links,ctpty_head,iss_head) # keys are needed links[g_ctpty][g_iss] # how to get inner join? How do it not include the last line in the link? (Again, it works with merge). Many thanks. Ivan On 18 Jun 2013, at 17:34, Eduard Antonyan wrote: > Frank has already answered why you're getting the results you are. > > Re "how do I tell DT to do separate matchings at each join?": > > Currently you have to use 'merge' or 'data.table', see this - https://r-forge.r-project.org/tracker/?func=detail&atid=978&aid=4675&group_id=240 and the SO link inside. > > > On Tue, Jun 18, 2013 at 10:08 AM, Ivan Alves wrote: > Hi Frank, > > Many thanks for the thoughts. > > It is something that has to do with the keys, that is for sure (see below). A double DT join does not understand that it has to do one join by one variable and the other by another variable: the output simply has lines where the two keys are the same (ctpty_head==iss_head),which is of course not optimal. > > how do I tell DT to do separate matchings at each join? Setting setkey(links,ctpty_head,iss_head) before the join does not work either. > > > key(g_ctpty[links]) > NULL > > key(g_ctpty) > [1] "ctpty_head" > > key(g_iss) > [1] "iss_head" > > key(links) > NULL > > Hi Arunkumar, > > An example would look like follows: > > g_ctpty = data.table(ctpty_head=c("a","b","c"), ctpty_cty=c("US","DE","JP")) > g_iss = data.table(iss_head=c("a","b","c"), iss_cty=c("US","DE","JP")) > links = data.table(ctpty_head=c("a","b","c"), iss_head=c("b","b","a")) > > merge( > + merge(links, g_ctpty, all.x = TRUE, by = "ctpty_head"), > + g_iss, > + all.x = TRUE, by = "iss_head" > + ) > iss_head ctpty_head ctpty_cty iss_cty > 1: a c JP US > 2: b a US DE > 3: b b DE DE > > g_iss[g_ctpty[links]] > iss_head iss_cty ctpty_cty iss_head.1 > 1: a US US b > 2: b DE DE b > 3: c JP JP a > > setkey(links,ctpty_head,iss_head) > > g_iss[g_ctpty[links]] > iss_head iss_cty ctpty_cty iss_head.1 > 1: a US US b > 2: b DE DE b > 3: c JP JP a > > On 17 Jun 2013, at 14:36, Frank Erickson wrote: > >> I think that key(g_ctpty[links]) == key(g_ctpty) == "ctpty_head", which is used when you do your second merge with [, instead of "iss_head". You can check this by running key(g_ctpty[links]) . --Frank >> >> >> On Mon, Jun 17, 2013 at 1:57 AM, Arunkumar Srinivasan wrote: >> Since you have the data as well, why not provide it (a small part at least with which your issue is reproducible)? Isn't it much easier than to ask everyone who's willing to help to create a data and test your code? >> >> Arun >> >> On Monday, June 17, 2013 at 8:54 AM, Ivan Alves wrote: >> >>> Dear all, >>> >>> I am not sure I understand the syntax for merging data.tables. I have keyed the two 'satelite' tables from which I want to match information to the main table 'links' >>> >>> g_ctpty <- gultimate[,list(ctpty_head,ctpty_cty)] >>> setkey(g_ctpty,ctpty_head) >>> g_iss <- gultimate[,list(iss_head,iss_cty)] >>> setkey(g_iss,iss_head) >>> >>> Why are the two below not equivalent? >>> >>> This works: >>> >>> data = merge( >>> merge(links, g_ctpty, all.x = TRUE, by = "ctpty_head"), >>> g_iss, >>> all.x = TRUE, by = "iss_head" >>> ), >>> >>> And this does not: >>> >>> data = g_iss[g_ctpty[links]], >>> >>> Any guidance would be appreciated. >>> Kind regards, >>> Ivan >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Thu Jun 20 22:10:55 2013 From: eduard.antonyan at gmail.com (eddi) Date: Thu, 20 Jun 2013 13:10:55 -0700 (PDT) Subject: [datatable-help] fread warning Message-ID: <1371759055656-4669997.post@n4.nabble.com> I'm getting this warning from fread: "Mapped file ok but madvise failed" It seems to be purely a function of file (and, looking at the code, I assume memory page) size and is system dependent, so I can't really give you a reproducible example. Adding or removing a single character anywhere in the file results in the warning disappearing. Two questions - should I care about this warning? And can the code be changed to be aware of this edge case? -- View this message in context: http://r.789695.n4.nabble.com/fread-warning-tp4669997.html Sent from the datatable-help mailing list archive at Nabble.com. From eduard.antonyan at gmail.com Thu Jun 20 22:24:02 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Thu, 20 Jun 2013 15:24:02 -0500 Subject: [datatable-help] merging syntax In-Reply-To: References: <8220DA4C-3468-456E-B397-7D9DA5D58591@me.com> Message-ID: I think the following will achieve the same as your merge's: setkey(g_ctpty, ctpty_head) setkey(links, ctpty_head) setkey(g_iss, iss_head) g_iss[data.table(g_ctpty[links], key = "iss_head")] And in general, merge(X, Y, all.x = TRUE) is (more or less) equivalent to Y[X] On Thu, Jun 20, 2013 at 11:43 AM, Ivan Alves wrote: > Many thanks to both Eduard and Frank on the issue of the needed key. One > aspect of the merging that is not clear is how to do 'inner' vs. 'outer' > 'joins' (like in SQL). Whereas it works with merge (using the all.x=TRUE > option), how is it done with data.table? In the improved example below > > g_ctpty = data.table(ctpty_head=c("a","b","c","d"), ctpty_cty=c("US","DE", > "JP","CN")) > g_iss = data.table(iss_head=c("a","b","c","d"), iss_cty=c("US","DE","JP", > "CN")) > links = data.table(ctpty_head=c("a","b","c"), iss_head=c("b","b","a")) > setkey(g_ctpty,ctpty_head) > setkey(g_iss,iss_head) > merge( > merge(links, g_ctpty, all.x = TRUE, by = "ctpty_head"), > g_iss, > all.x = TRUE, by = "iss_head" > ) > g_iss[g_ctpty[links]] # error > links[g_ctpty][g_iss] # still error > setkey(links,ctpty_head,iss_head) # keys are needed > links[g_ctpty][g_iss] # how to get inner join? > > How do it not include the last line in the link? (Again, it works with > merge). Many thanks. > > Ivan > > > On 18 Jun 2013, at 17:34, Eduard Antonyan > wrote: > > Frank has already answered why you're getting the results you are. > > Re "how do I tell DT to do *separate matchings at each join*?": > > Currently you have to use 'merge' or 'data.table', see this - > https://r-forge.r-project.org/tracker/?func=detail&atid=978&aid=4675&group_id=240 and > the SO link inside. > > > On Tue, Jun 18, 2013 at 10:08 AM, Ivan Alves wrote: > >> Hi Frank, >> >> Many thanks for the thoughts. >> >> It is something that has to do with the keys, that is for sure (see >> below). A double DT join does not understand that it has to do one join by >> one variable and the other by another variable: the output simply has lines >> where the two keys are the same (ctpty_head==iss_head),which is of course >> not optimal. >> >> how do I tell DT to do *separate matchings at each join*? Setting setkey >> (links,ctpty_head,iss_head) before the join does not work either. >> >> > key(g_ctpty[links]) >> NULL >> > key(g_ctpty) >> [1] "ctpty_head" >> > key(g_iss) >> [1] "iss_head" >> > key(links) >> NULL >> >> Hi Arunkumar, >> >> An example would look like follows: >> >> g_ctpty = data.table(ctpty_head=c("a","b","c"), >> ctpty_cty=c("US","DE","JP")) >> g_iss = data.table(iss_head=c("a","b","c"), iss_cty=c("US","DE","JP")) >> links = data.table(ctpty_head=c("a","b","c"), iss_head=c("b","b","a")) >> >> merge( >> + merge(links, g_ctpty, all.x = TRUE, by = "ctpty_head"), >> + g_iss, >> + all.x = TRUE, by = "iss_head" >> + ) >> iss_head ctpty_head ctpty_cty iss_cty >> 1: a c JP US >> 2: b a US DE >> 3: b b DE DE >> > g_iss[g_ctpty[links]] >> iss_head iss_cty ctpty_cty iss_head.1 >> 1: a US US b >> 2: b DE DE b >> 3: c JP JP a >> > setkey(links,ctpty_head,iss_head) >> > g_iss[g_ctpty[links]] >> iss_head iss_cty ctpty_cty iss_head.1 >> 1: a US US b >> 2: b DE DE b >> 3: c JP JP a >> >> On 17 Jun 2013, at 14:36, Frank Erickson wrote: >> >> I think that key(g_ctpty[links]) == key(g_ctpty) == "ctpty_head", which >> is used when you do your second merge with [, instead of "iss_head". You >> can check this by running key(g_ctpty[links]) . --Frank >> >> >> On Mon, Jun 17, 2013 at 1:57 AM, Arunkumar Srinivasan < >> aragorn168b at gmail.com> wrote: >> >>> Since you have the data as well, why not provide it (a small part at >>> least with which your issue is reproducible)? Isn't it much easier than to >>> ask everyone who's willing to help to create a data and test your code? >>> >>> Arun >>> >>> On Monday, June 17, 2013 at 8:54 AM, Ivan Alves wrote: >>> >>> Dear all, >>> >>> I am not sure I understand the syntax for merging data.tables. I have >>> keyed the two 'satelite' tables from which I want to match information to >>> the main table 'links' >>> >>> g_ctpty <- gultimate[,list(ctpty_head,ctpty_cty)] >>> setkey(g_ctpty,ctpty_head) >>> g_iss <- gultimate[,list(iss_head,iss_cty)] >>> setkey(g_iss,iss_head) >>> >>> Why are the two below not equivalent? >>> >>> This works: >>> >>> data = merge( >>> merge(links, g_ctpty, all.x = TRUE, by = "ctpty_head"), >>> g_iss, >>> all.x = TRUE, by = "iss_head" >>> ), >>> >>> And this does not: >>> >>> data = g_iss[g_ctpty[links]], >>> >>> Any guidance would be appreciated. >>> Kind regards, >>> Ivan >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From statquant at outlook.com Fri Jun 21 09:16:08 2013 From: statquant at outlook.com (statquant3) Date: Fri, 21 Jun 2013 00:16:08 -0700 (PDT) Subject: [datatable-help] About roll in data.table Message-ID: <1371798968488-4670023.post@n4.nabble.com> I wrote a post on S.O http://stackoverflow.com/questions/17216843/rolling-joins-in-data-table-with-multiple-matches/17219629?noredirect=1#comment24955451_17219629 It seems that when "roll" is specified in Y[X, roll=rollValue] even if there are several lines of Y that matches one row of X, just one is selected. I was looking for window joins like in kdb: http://code.kx.com/wiki/Reference/wj et I was thinking that because all the roll logic is already there, may be it would be a good feature to get all rows matched when roll is specified. What do you think ? -- View this message in context: http://r.789695.n4.nabble.com/About-roll-in-data-table-tp4670023.html Sent from the datatable-help mailing list archive at Nabble.com. From statquant at outlook.com Fri Jun 21 11:07:09 2013 From: statquant at outlook.com (statquant3) Date: Fri, 21 Jun 2013 02:07:09 -0700 (PDT) Subject: [datatable-help] data.table mot retaining keys Message-ID: <1371805629057-4670027.post@n4.nabble.com> R) DT=data.table(x=c(1,2),y=c(1,2),z=c(1,2),key='x,y,z') R) DT x y z 1: 1 1 1 2: 2 2 2 R) key(DT) [1] "x" "y" "z" R) key(DT[,list(x,y)]) NULL Can't find this as a feature request, should I fill ? -- View this message in context: http://r.789695.n4.nabble.com/data-table-mot-retaining-keys-tp4670027.html Sent from the datatable-help mailing list archive at Nabble.com. From statquant at outlook.com Fri Jun 21 11:48:13 2013 From: statquant at outlook.com (statquant3) Date: Fri, 21 Jun 2013 02:48:13 -0700 (PDT) Subject: [datatable-help] data.table mot retaining keys In-Reply-To: <1371805629057-4670027.post@n4.nabble.com> References: <1371805629057-4670027.post@n4.nabble.com> Message-ID: <1371808093572-4670031.post@n4.nabble.com> Found it #295 Retain key after order-preserving subset -- View this message in context: http://r.789695.n4.nabble.com/data-table-mot-retaining-keys-tp4670027p4670031.html Sent from the datatable-help mailing list archive at Nabble.com. From FErickson at psu.edu Fri Jun 21 22:26:00 2013 From: FErickson at psu.edu (Frank Erickson) Date: Fri, 21 Jun 2013 15:26:00 -0500 Subject: [datatable-help] columns that show up when using both by-without-by and by= Message-ID: Hi, I thought that when joining with J(x) and doing by=y, all the columns involved were put into .BY. However, I see that they are not: DT <- data.table(v1=letters[1:10],v2=1:10,v3=c(TRUE,FALSE),key="v1") DT[J(letters[4:6]),1,by=v3] I think I've just forgotten how to do this correctly (so that both v1 and v3 show up in the output). Any help would be appreciated. Thanks, Frank -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri Jun 21 23:01:43 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 21 Jun 2013 23:01:43 +0200 Subject: [datatable-help] =?utf-8?q?columns_that_show_up_when_using_both_b?= =?utf-8?q?y-without-by_and_by=3D?= In-Reply-To: References: Message-ID: <218D698660944745B9BB67E5BA94E720@gmail.com> you can just do: > DT[J(letters[4:6]), list(v1=v1, 1),by=v3] If you use `by` only the columns in it will be included in the output. But it seems like what you request is a nicer feature and I am for it. Because, when you just do, DT[J(letters[4:6]), sum(v1)] it gives you "v1". But when using `by`, it disappears. I find with your suggestion this would be more consistent. Arun On Friday, June 21, 2013 at 10:26 PM, Frank Erickson wrote: > Hi, > > I thought that when joining with J(x) and doing by=y, all the columns involved were put into .BY. However, I see that they are not: > > DT <- data.table(v1=letters[1:10],v2=1:10,v3=c(TRUE,FALSE),key="v1") > DT[J(letters[4:6]),1,by=v3] > > I think I've just forgotten how to do this correctly (so that both v1 and v3 show up in the output). Any help would be appreciated. > > Thanks, > > Frank > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Fri Jun 21 23:06:24 2013 From: FErickson at psu.edu (Frank Erickson) Date: Fri, 21 Jun 2013 16:06:24 -0500 Subject: [datatable-help] columns that show up when using both by-without-by and by= In-Reply-To: <218D698660944745B9BB67E5BA94E720@gmail.com> References: <218D698660944745B9BB67E5BA94E720@gmail.com> Message-ID: Ok. Thanks, Arun. I could've sworn it did that already, but seeing as it doesn't: yes, this is a feature request. :) --Frank On Fri, Jun 21, 2013 at 4:01 PM, Arunkumar Srinivasan wrote: > you can just do: > > > DT[J(letters[4:6]), list(v1=v1, 1),by=v3] > > If you use `by` only the columns in it will be included in the output. But > it seems like what you request is a nicer feature and I am for it. Because, > when you just do, DT[J(letters[4:6]), sum(v1)] it gives you "v1". But > when using `by`, it disappears. I find with your suggestion this would be > more consistent. > > > Arun > > On Friday, June 21, 2013 at 10:26 PM, Frank Erickson wrote: > > Hi, > > I thought that when joining with J(x) and doing by=y, all the columns > involved were put into .BY. However, I see that they are not: > > DT <- data.table(v1=letters[1:10],v2=1:10,v3=c(TRUE,FALSE),key="v1") > DT[J(letters[4:6]),1,by=v3] > > I think I've just forgotten how to do this correctly (so that both v1 and > v3 show up in the output). Any help would be appreciated. > > Thanks, > > Frank > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri Jun 21 23:07:59 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 21 Jun 2013 23:07:59 +0200 Subject: [datatable-help] =?utf-8?q?columns_that_show_up_when_using_both_b?= =?utf-8?q?y-without-by_and_by=3D?= In-Reply-To: References: <218D698660944745B9BB67E5BA94E720@gmail.com> Message-ID: <175BF4D7277B4ED483F4F83A71260DCB@gmail.com> Me too, I dint at first get what the problem you were referring to, until I dint see `v1` in the output! :) Arun On Friday, June 21, 2013 at 11:06 PM, Frank Erickson wrote: > Ok. Thanks, Arun. I could've sworn it did that already, but seeing as it doesn't: yes, this is a feature request. :) --Frank > > > On Fri, Jun 21, 2013 at 4:01 PM, Arunkumar Srinivasan wrote: > > you can just do: > > > > > DT[J(letters[4:6]), list(v1=v1, 1),by=v3] > > > > If you use `by` only the columns in it will be included in the output. But it seems like what you request is a nicer feature and I am for it. Because, when you just do, DT[J(letters[4:6]), sum(v1)] it gives you "v1". But when using `by`, it disappears. I find with your suggestion this would be more consistent. > > > > > > Arun > > > > > > On Friday, June 21, 2013 at 10:26 PM, Frank Erickson wrote: > > > > > > > Hi, > > > > > > I thought that when joining with J(x) and doing by=y, all the columns involved were put into .BY. However, I see that they are not: > > > > > > DT <- data.table(v1=letters[1:10],v2=1:10,v3=c(TRUE,FALSE),key="v1") > > > DT[J(letters[4:6]),1,by=v3] > > > > > > I think I've just forgotten how to do this correctly (so that both v1 and v3 show up in the output). Any help would be appreciated. > > > > > > Thanks, > > > > > > Frank > > > _______________________________________________ > > > datatable-help mailing list > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Mon Jun 24 22:51:34 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Mon, 24 Jun 2013 15:51:34 -0500 Subject: [datatable-help] About roll in data.table In-Reply-To: <1371798968488-4670023.post@n4.nabble.com> References: <1371798968488-4670023.post@n4.nabble.com> Message-ID: I want to make sure I understand you correctly. In the example data.table below, are you looking for d[J(2.1), roll = 1, mult = "all"] to return the same rows (from 'b') as d[J(2), mult = "all"]? d = data.table(a = c(1,2,2), b = c(1:3), key = 'a') I think that's an interesting feature to have, but it's not obvious that it'll work well with the rest of data.table. A few questions I have - would you want the 'a' column to be the same as currently, i.e. get two rows with a=2.1 in both? What about d[J(c(2.1,2.2)), roll = 1, mult = "all"] - what would that do? What would you set the param defaults to get back current behavior (which arguably is encountered much more frequently)? On Fri, Jun 21, 2013 at 2:16 AM, statquant3 wrote: > I wrote a post on S.O > > http://stackoverflow.com/questions/17216843/rolling-joins-in-data-table-with-multiple-matches/17219629?noredirect=1#comment24955451_17219629 > > It seems that when "roll" is specified in Y[X, roll=rollValue] even if > there > are several lines of Y that matches one row of X, just one is selected. > I was looking for window joins like in kdb: > http://code.kx.com/wiki/Reference/wj et I was thinking that because all > the > roll logic is already there, may be it would be a good feature to get all > rows matched when roll is specified. > > What do you think ? > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/About-roll-in-data-table-tp4670023.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From statquant at outlook.com Tue Jun 25 15:37:51 2013 From: statquant at outlook.com (statquant3) Date: Tue, 25 Jun 2013 06:37:51 -0700 (PDT) Subject: [datatable-help] About roll in data.table In-Reply-To: References: <1371798968488-4670023.post@n4.nabble.com> Message-ID: <1372167471008-4670279.post@n4.nabble.com> What I thought about is the following, whenever you use y[x,roll={-Inf,-a,a,Inf}] (say keyed by {t,u,v}) my understanding is that data.table will case 1) if there is a match on all keys do a regular join case 2) if all but the last key match select the unique row if any such that v of x is the prevailing value of y is rolled iif it is within the bounds defined by the roll argument. What I though about is there might be several rows in y that might be within the roll bounds. Because of this and because we have a mult parameter that can be {first,last,all} why not return the first/last/all the rows of y that where in the bounds ? This is what I call a window join, it is useffull if, say you want to calculate a moving average or a function over all the past elements in the last 2minutes... I wrote a SO post about it (with no answer...) http://stackoverflow.com/questions/17233973/is-it-possible-to-compute-any-window-join-in-data-table -- View this message in context: http://r.789695.n4.nabble.com/About-roll-in-data-table-tp4670023p4670279.html Sent from the datatable-help mailing list archive at Nabble.com. From alexandre.sieira at gmail.com Tue Jun 25 18:09:51 2013 From: alexandre.sieira at gmail.com (Alexandre Sieira) Date: Tue, 25 Jun 2013 13:09:51 -0300 Subject: [datatable-help] =?utf-8?q?Character_matrix_conversion_generating?= =?utf-8?q?_factors_as_columns=3F?= Message-ID: I was trying to convert a character matrix to a data.table with character columns, but got them all as factors instead. Is that to be expected? > m = matrix(rep("hello", 9), ncol=3) > m ? ? ?[,1] ? ?[,2] ? ?[,3] ?? [1,] "hello" "hello" "hello" [2,] "hello" "hello" "hello" [3,] "hello" "hello" "hello" > library(data.table) data.table 1.8.8 ?For help type: help("data.table") > str(data.table(m)) Classes ?data.table? and 'data.frame': 3 obs. of ?3 variables: ?$ V1: Factor w/ 1 level "hello": 1 1 1 ?$ V2: Factor w/ 1 level "hello": 1 1 1 ?$ V3: Factor w/ 1 level "hello": 1 1 1 ?- attr(*, ".internal.selfref")=? I ended up doing something ugly like this to solve it: > str(data.table(data.frame(m, stringsAsFactors=F))) Classes ?data.table? and 'data.frame': 3 obs. of ?3 variables: ?$ X1: chr ?"hello" "hello" "hello" ?$ X2: chr ?"hello" "hello" "hello" ?$ X3: chr ?"hello" "hello" "hello" ?- attr(*, ".internal.selfref")=? I couldn't find any equivalent to 'stringsAsFactors' on the data.table documentation. Is there a better way to do this? --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I -------------- next part -------------- An HTML attachment was scrubbed... URL: From statquant at outlook.com Wed Jun 26 09:19:54 2013 From: statquant at outlook.com (statquant3) Date: Wed, 26 Jun 2013 00:19:54 -0700 (PDT) Subject: [datatable-help] Character matrix conversion generating factors as columns? In-Reply-To: References: Message-ID: <1372231194894-4670339.post@n4.nabble.com> May be what you should do is set options("stringsAsFactors"=FALSE), then you are sure you won't have factors anymore. BTW: I red Mat saying that factors for strings are of no use since R 2.12, because of some hash maintained by R internally, is that true ? -- View this message in context: http://r.789695.n4.nabble.com/Character-matrix-conversion-generating-factors-as-columns-tp4670299p4670339.html Sent from the datatable-help mailing list archive at Nabble.com. From eduard.antonyan at gmail.com Wed Jun 26 14:35:21 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 26 Jun 2013 07:35:21 -0500 Subject: [datatable-help] About roll in data.table In-Reply-To: <1372167471008-4670279.post@n4.nabble.com> References: <1371798968488-4670023.post@n4.nabble.com> <1372167471008-4670279.post@n4.nabble.com> Message-ID: Ok, so as far as I can tell, I understood what you want correctly. What are your answers to my questions about this then? What I thought about is the following, whenever you use y[x,roll={-Inf,-a,a,Inf}] (say keyed by {t,u,v}) my understanding is that data.table will case 1) if there is a match on all keys do a regular join case 2) if all but the last key match select the unique row if any such that v of x is the prevailing value of y is rolled iif it is within the bounds defined by the roll argument. What I though about is there might be several rows in y that might be within the roll bounds. Because of this and because we have a mult parameter that can be {first,last,all} why not return the first/last/all the rows of y that where in the bounds ? This is what I call a window join, it is useffull if, say you want to calculate a moving average or a function over all the past elements in the last 2minutes... I wrote a SO post about it (with no answer...) http://stackoverflow.com/questions/17233973/is-it-possible-to-compute-any-window-join-in-data-table -- View this message in context: http://r.789695.n4.nabble.com/About-roll-in-data-table-tp4670023p4670279.html Sent from the datatable-help mailing list archive at Nabble.com. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Thu Jun 27 14:29:09 2013 From: FErickson at psu.edu (Frank Erickson) Date: Thu, 27 Jun 2013 08:29:09 -0400 Subject: [datatable-help] data.table for R 3.0.1 Message-ID: Hi, So my network upgraded to R 3.0.1 last night, it seems; and R now says that data.table is outdated: package ?data,table? is not available (for R version 3.0.1) Does anyone know of a way of forcing R to load the older package? Or know of a newer version of data.table that I can install through another command? Best, Frank -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Thu Jun 27 15:30:36 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Thu, 27 Jun 2013 06:30:36 -0700 Subject: [datatable-help] data.table for R 3.0.1 In-Reply-To: References: Message-ID: Hi Does installing like so dither trick for you? Install.packages("data.table", type="source") ? -Steve On Thursday, June 27, 2013, Frank Erickson wrote: > Hi, > > So my network upgraded to R 3.0.1 last night, it seems; and R now says > that data.table is outdated: > > package ?data,table? is not available (for R version 3.0.1) > > Does anyone know of a way of forcing R to load the older package? Or know > of a newer version of data.table that I can install through another command? > > Best, > > Frank > -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Thu Jun 27 17:35:04 2013 From: FErickson at psu.edu (Frank Erickson) Date: Thu, 27 Jun 2013 11:35:04 -0400 Subject: [datatable-help] data.table for R 3.0.1 In-Reply-To: References: Message-ID: Hi Steve, Thanks for the suggestion. It doesn't work verbatim; I see ERROR: compilation failed for package 'data.table' and later a warning: 1: running command '"C:/PROGRA~1/R/R-30~1.1/bin/x64/R" CMD INSTALL -l "U:\R" C:\Users\FPERIC~1\AppData\Local\Temp\3\RtmpuWYzPl/downloaded_packages/data.table_1.8.8.tar.gz' had status 1 --Frank On Thu, Jun 27, 2013 at 9:30 AM, Steve Lianoglou wrote: > Hi > > Does installing like so dither trick for you? > > Install.packages("data.table", type="source") > > ? > > -Steve > > > On Thursday, June 27, 2013, Frank Erickson wrote: > >> Hi, >> >> So my network upgraded to R 3.0.1 last night, it seems; and R now says >> that data.table is outdated: >> >> package ?data,table? is not available (for R version 3.0.1) >> >> Does anyone know of a way of forcing R to load the older package? Or know >> of a newer version of data.table that I can install through another command? >> >> Best, >> >> Frank >> > > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Thu Jun 27 17:46:56 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Thu, 27 Jun 2013 08:46:56 -0700 Subject: [datatable-help] data.table for R 3.0.1 In-Reply-To: References: Message-ID: Hi Frank, On Thu, Jun 27, 2013 at 8:35 AM, Frank Erickson wrote: > Hi Steve, > > Thanks for the suggestion. It doesn't work verbatim; I see > > ERROR: compilation failed for package 'data.table' > > and later a warning: > > 1: running command '"C:/PROGRA~1/R/R-30~1.1/bin/x64/R" CMD INSTALL -l "U:\R" > C:\Users\FPERIC~1\AppData\Local\Temp\3\RtmpuWYzPl/downloaded_packages/data.table_1.8.8.tar.gz' > had status 1 Hmm -- sorry, this is rather strange and I'm also not well versed in smoking out windows compilation problems. Can you provide more of the installation log so we can see what the error is, exactly? I say it's strange because it seems as if CRAN actually has the compiled version of the latest version of the package for windows there: http://cran.r-project.org/web/packages/data.table/index.html In particular: http://cran.r-project.org/bin/windows/contrib/r-release/data.table_1.8.8.zip Perhaps you might try changing the CRAN mirror you are using to see if you can grab it? Or just download from the link above and install the compiled version locally? As for me -- I try to keep my version of data.table tied to SVN and periodically svn up and recompile ... I've had no compilation problems as of late (I just recompiled now to double check), so I'm not sure what's going on on your setup. Sorry that I can't be of much help here, perhaps some other windows users can chime in. -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From FErickson at psu.edu Thu Jun 27 17:55:08 2013 From: FErickson at psu.edu (Frank Erickson) Date: Thu, 27 Jun 2013 11:55:08 -0400 Subject: [datatable-help] data.table for R 3.0.1 In-Reply-To: References: Message-ID: Hi Steve, Problem solved! If I go back for more information enough times, I'm bound to start a new R session sooner or later, and that seemed to do the trick. I'm not tech-savvy enough to use SVN. Instead I try to avoid updating software by using as little of it as possible (for R packages: just stringr and data.table). Thanks again, Frank On Thu, Jun 27, 2013 at 11:46 AM, Steve Lianoglou wrote: > Hi Frank, > > On Thu, Jun 27, 2013 at 8:35 AM, Frank Erickson wrote: > > Hi Steve, > > > > Thanks for the suggestion. It doesn't work verbatim; I see > > > > ERROR: compilation failed for package 'data.table' > > > > and later a warning: > > > > 1: running command '"C:/PROGRA~1/R/R-30~1.1/bin/x64/R" CMD INSTALL -l > "U:\R" > > > C:\Users\FPERIC~1\AppData\Local\Temp\3\RtmpuWYzPl/downloaded_packages/data.table_1.8.8.tar.gz' > > had status 1 > > Hmm -- sorry, this is rather strange and I'm also not well versed in > smoking out windows compilation problems. > > Can you provide more of the installation log so we can see what the > error is, exactly? > > I say it's strange because it seems as if CRAN actually has the > compiled version of the latest version of the package for windows > there: > > http://cran.r-project.org/web/packages/data.table/index.html > > In particular: > > > http://cran.r-project.org/bin/windows/contrib/r-release/data.table_1.8.8.zip > > Perhaps you might try changing the CRAN mirror you are using to see if > you can grab it? Or just download from the link above and install the > compiled version locally? > > As for me -- I try to keep my version of data.table tied to SVN and > periodically svn up and recompile ... I've had no compilation problems > as of late (I just recompiled now to double check), so I'm not sure > what's going on on your setup. > > Sorry that I can't be of much help here, perhaps some other windows > users can chime in. > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Jun 27 18:06:21 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 27 Jun 2013 17:06:21 +0100 Subject: [datatable-help] data.table for R 3.0.1 In-Reply-To: References: Message-ID: <14d1e8336d83e02c8a2435d6d46ab2b0@imap.plus.net> Not sure on that error either, there should be an output log somewhere. But another option is the .zip on the data.table homepage : "Last recommended dev snapshot precompiled for Windows: v1.8.9 rev874 13 May 2013 [5]" http://datatable.r-forge.r-project.org/data.table_1.8.9.zip It was compiled using winbuilder but I don't recall whether winbuilder was R3 at that point. Matthew On 27.06.2013 16:55, Frank Erickson wrote: > Hi Steve, > Problem solved! If I go back for more information enough times, I'm bound to start a new R session sooner or later, and that seemed to do the trick. > I'm not tech-savvy enough to use SVN. Instead I try to avoid updating software by using as little of it as possible (for R packages: just stringr and data.table). > Thanks again, > Frank > > On Thu, Jun 27, 2013 at 11:46 AM, Steve Lianoglou wrote: > >> Hi Frank, >> >> On Thu, Jun 27, 2013 at 8:35 AM, Frank Erickson wrote: >> > Hi Steve, >> > >> > Thanks for the suggestion. It doesn't work verbatim; I see >> > >> > ERROR: compilation failed for package 'data.table' >> > >> > and later a warning: >> > >> > 1: running command '"C:/PROGRA~1/R/R-30~1.1/bin/x64/R" CMD INSTALL -l "U:R" >> > C:UsersFPERIC~1AppDataLocalTemp3RtmpuWYzPl/downloaded_packages/data.table_1.8.8.tar.gz' >> > had status 1 >> >> Hmm -- sorry, this is rather strange and I'm also not well versed in >> smoking out windows compilation problems. >> >> Can you provide more of the installation log so we can see what the >> error is, exactly? >> >> I say it's strange because it seems as if CRAN actually has the >> compiled version of the latest version of the package for windows >> there: >> >> http://cran.r-project.org/web/packages/data.table/index.html [2] >> >> In particular: >> >> http://cran.r-project.org/bin/windows/contrib/r-release/data.table_1.8.8.zip [3] >> >> Perhaps you might try changing the CRAN mirror you are using to see if >> you can grab it? Or just download from the link above and install the >> compiled version locally? >> >> As for me -- I try to keep my version of data.table tied to SVN and >> periodically svn up and recompile ... I've had no compilation problems >> as of late (I just recompiled now to double check), so I'm not sure >> what's going on on your setup. >> >> Sorry that I can't be of much help here, perhaps some other windows >> users can chime in. >> >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Bioinformatics and Computational Biology >> Genentech Links: ------ [1] mailto:FErickson at psu.edu [2] http://cran.r-project.org/web/packages/data.table/index.html [3] http://cran.r-project.org/bin/windows/contrib/r-release/data.table_1.8.8.zip [4] mailto:lianoglou.steve at gene.com [5] http://datatable.r-forge.r-project.org/data.table_1.8.9.zip -------------- next part -------------- An HTML attachment was scrubbed... URL: From harishv_99 at yahoo.com Sun Jun 30 10:21:36 2013 From: harishv_99 at yahoo.com (Harish) Date: Sun, 30 Jun 2013 01:21:36 -0700 (PDT) Subject: [datatable-help] fread -- multiple header lines and multiple whitespace characters Message-ID: <1372580496.84142.YahooMailNeo@web120202.mail.ne1.yahoo.com> Hi, I am wondering whether it is possible to read a file using fread() with: 1) Multiple header lines, and 2) Multiple whitespace characters separating fields The sample of the input file is as follows: ------------- Garbage header information that I need to skip when reading... Number of lines here are variable. ???????????? Serial_Number?? PHIv???? Lu/W???? ??????????????????? (-)????? (lm)???? (lm/W) ?????????? ABCDEFG? 27.0264 103.58 ?????????? HIJKLMNO? 33.9143? 91.03 Some footer information that spans multiple lines ------------- To handle the multiple lines of headers, I would have to read the file using fread() first, reprocess the file using a similar algorithm to identify the actual header -- i.e. one line above what fread() would identify as the header, then throw away the names of the columns fread() created and rename it to the actual ones I find.? However, this seems to be highly inefficient since I would replicate what fread() did within R -- not to mention I do not quite know how to do that. As far as handling the multiple (and variable) spaces for separator, I do not see fread() being able to handle this either.? read.table() however does with the default sep="" value.? Of course, that does not handle the garbage headers and footers that fread() so beautifully avoids with its autostart algorithm. Any suggestions as to how I would do this easily?? I have lots of these files to read, and doing manual editing is not desirable.? If there is a hack I can do with fread(), that would be ideal. Thanks a lot for your help. Regards, Harish -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Sun Jun 30 16:04:57 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sun, 30 Jun 2013 10:04:57 -0400 Subject: [datatable-help] What is the point of SJ? Message-ID: Consider SJ which I assume was intended to be used like this X[ SJ(Y) ] where X and Y are two data tables. What is the point of SJ? It seems similar to J except it also adds a key to its argument; however, is it not the case that that the key on Y will not be used since it has to do a full scan of Y anyways? -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com