From statquant at outlook.com Wed May 1 01:10:23 2013 From: statquant at outlook.com (statquant3) Date: Tue, 30 Apr 2013 16:10:23 -0700 (PDT) Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366643879137-4664990.post@n4.nabble.com>

<-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> Message-ID: <1367363423208-4665873.post@n4.nabble.com> Hi, I red the 30 posts and I have to confess that I still do not understand the point of the changes... Could anyone kindly write an example of the current behaviour and what the new option will bring to the table ? Sorry... -- View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4665873.html Sent from the datatable-help mailing list archive at Nabble.com. From saporta at scarletmail.rutgers.edu Wed May 1 01:18:39 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Tue, 30 Apr 2013 19:18:39 -0400 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <1367363423208-4665873.post@n4.nabble.com> References: <1366643879137-4664990.post@n4.nabble.com>

<-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <1367363423208-4665873.post@n4.nabble.com> Message-ID: Eddi, Perhaps you could summarize succinctly, now after a good bit of discussion, what your proposed change is. -Rick On Tue, Apr 30, 2013 at 7:10 PM, statquant3 wrote: > Hi, I red the 30 posts and I have to confess that I still do not understand > the point of the changes... > Could anyone kindly write an example of the current behaviour and what the > new option will bring to the table ? > Sorry... > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4665873.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.harding at paniscus.com Wed May 1 11:28:52 2013 From: p.harding at paniscus.com (Paul Harding) Date: Wed, 1 May 2013 10:28:52 +0100 Subject: [datatable-help] fread on very large file In-Reply-To: <6215268129090c5164b66264010bea9b@imap.plus.net> References: <6215268129090c5164b66264010bea9b@imap.plus.net> Message-ID: Here is the verbose output: > dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) Detected eol as \r\n (CRLF) in that order, the Windows standard. Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found Found 9 columns First row with 9 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 9186293 Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 data rows Type codes: 000002000 (first 5 rows) Type codes: 000002200 (+middle 5 rows) Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 But here is the wc output (via cygwin; newline, word (whitespace delim so each word one 'line' here), byte)@ $ wc spd_all_fixed.csv 168997637 168997638 9078155125 spd_all_fixed.csv [So fread 9M, wc 168M rows]. Regards Paul On 30 April 2013 18:52, Matthew Dowle wrote: > ** > > > > Hi, > > Thanks for reporting this. Please set verbose=TRUE and let us know the > output. > > Thanks, Matthew > > > > On 30.04.2013 18:01, Paul Harding wrote: > > Problem with fread on a large file > The file is 8GB, just short of 200,000 lines, produced as SQLoutput and > modified by cygwin/perl to remove the second line. > Using data.table 1.8.8 on R3.0.0 I get an fread error > fread("data/spd_all_fixed.csv",sep=",") > Error in fread("data/spd_all_fixed.csv", sep = ",") : > Expected sep (',') but '0' ends field 5 on line 6 when detecting types: > 204038,2617097,20110803,0,0 > Looking for the offending line,with line numbers in output so I'm guessing > this is line 6 of the mid-file chunk examined, > $ grep -n '204038,2617097,201108' spd_all_fixed.csv > 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 > 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 > 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 > 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 > 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 > and comparing to surrounding lines and the first ten lines > $ head spd_all_fixed.csv > s_key,i_key,p_key,q,pq,d,l,epi,class > 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 > 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 > 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 > 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 > 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 > 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 > 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 > 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 > 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 > I can't see any difference. I wonder if this is a bug? I have no problems > on a small test data set run through an identical process and using the > same fread command. > Regards > Paul > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Wed May 1 17:43:21 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 1 May 2013 10:43:21 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References:

<-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <1367363423208-4665873.post@n4.nabble.com> Message-ID: Sure, here's a recap. The most succinct way of putting it is - the meaning of d[i, j, by = b] is very complicated and unintuitive right now because of hidden by's in some cases and that statement can be made much more readable by making by-without-by's explicit. The longer version follows. First let's go over what is done currently, in particular what exactly is by-without-by. The following example, adapted from Matthew's examples illustrates current behavior: > X = data.table(a = c(1,1,2,2,3,3), b = c(1:6), key = "a") > Y = data.table(a = c(1,2,1), key = "a") > X[Y] a b 1: 1 1 2: 1 2 3: 1 1 4: 1 2 5: 2 3 6: 2 4 > X[Y, sum(b)] a V1 1: 1 3 2: 1 3 3: 2 7 What's happening here is that the action j=sum(b) is performed for each row of Y (or rather each 'a') as if that was a 'by' by the rows of Y. Had Y had unique 'a' values only, this would've been equivalent to doing a 'by' by 'a' after the merge, but there is a difference when Y$a has duplicates. This is interesting behavior that can be used in a variety of situations (it also has an interesting leveraging point - if Y$a *is* unique and you'd like to do 'by=a' after the merge, it's more computationally advantageous to do the 'by' *during* the merge and not after), however it interferes with the naturally established action for d[i, j], where for other i's this would simply do action 'j', without doing an extra hidden 'by'. The proposal is thus to do the above special 'by' only when explicitly asked to - e.g. by adding a new boolean 'each.i = TRUE', the default value for which would be FALSE. This will make syntax much more readable and user-friendly, would eliminate a few FAQ points and would also allow a new kind of action, that afaik is actually not possible with current syntax. Here's some correspondences - left is new syntax and right is old syntax: Take 'dt' and apply 'i' (where 'i' is anything, including a join): dt[i] <-> dt[i] Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b': dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a join in some cases but not others Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by (will do cross-apply only when 'i' is a join): dt[i, j, each.i = TRUE] <-> dt[i, j] Take 'dt' and apply 'i', return j over *both* the cross-apply/by-without-by (for 'i' being a join only) and another specified 'by', think of this as doing by=list(b, rows of Y): dt[i, j, by = b, each.i = TRUE] <-> afaik there is no direct correspondence in current behavior On Tuesday, April 30, 2013, Ricardo Saporta wrote: > Eddi, > > Perhaps you could summarize succinctly, now after a good bit of > discussion, what your proposed change is. > > -Rick > > > On Tue, Apr 30, 2013 at 7:10 PM, statquant3 wrote: > >> Hi, I red the 30 posts and I have to confess that I still do not >> understand >> the point of the changes... >> Could anyone kindly write an example of the current behaviour and what the >> new option will bring to the table ? >> Sorry... >> >> >> >> -- >> View this message in context: >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4665873.html >> Sent from the datatable-help mailing list archive at Nabble.com. >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.harding at paniscus.com Wed May 1 18:10:50 2013 From: p.harding at paniscus.com (Paul Harding) Date: Wed, 1 May 2013 17:10:50 +0100 Subject: [datatable-help] fread on very large file In-Reply-To: References: <6215268129090c5164b66264010bea9b@imap.plus.net> Message-ID: Some supplementary information, here is the portion of the file (with row numbers, +1 for header) around where fread thinks the file ends. $ nl spd_all_fixed.csv | head -n 9186300 |tail 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 9186293 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 9186294 (row 9186293 excl header) is where fread thinks the file ends, mid-line by the look of it! I've experimented by truncating the file. The error varies, either it reads too few records or gives the error I reported, presumably determined by whether the last perceived line is entire. The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv > dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) Detected eol as \r\n (CRLF) in that order, the Windows standard. Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found Found 9 columns First row with 9 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 80300000 Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 data rows Type codes: 000002000 (first 5 rows) Type codes: 000002000 (+middle 5 rows) Type codes: 000002000 (+last 5 rows) 0%Bumping column 7 from INT to INT64 on data row 9, field contains '0.42634430000000001' Bumping column 7 from INT64 to REAL on data row 9, field contains '0.42634430000000001' 0.000s ( 0%) Memory map (rerun may be quicker) 0.000s ( 0%) Sep and header detection 0.000s ( 0%) Count rows (wc -l) 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM 171.188s ( 65%) Reading data 1365231.809s (518439%) Allocation for type bumps (if any), including gc time if triggered -1365231.809s (-518439%) Coercing data already read in type bumps (if any) 0.000s ( 0%) Changing na.strings to NA 0.000s Total > dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) Detected eol as \r\n (CRLF) in that order, the Windows standard. Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found Found 9 columns First row with 9 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 18913 Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data rows Type codes: 000002000 (first 5 rows) Type codes: 000002000 (+middle 5 rows) Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : Expected sep (',') but ',' ends field 2 on line 6 when detecting types: 204650,724540, Regards, Paul On 1 May 2013 10:28, Paul Harding wrote: > Here is the verbose output: > > > dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) > Detected eol as \r\n (CRLF) in that order, the Windows standard. > Looking for supplied sep ',' on line 30 (the last non blank line in the > first 30) ... found > Found 9 columns > First row with 9 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column > names. > Count of eol after first data row: 9186293 > Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 > data rows > Type codes: 000002000 (first 5 rows) > Type codes: 000002200 (+middle 5 rows) > Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : > Expected sep (',') but '0' ends field 5 on line 6 when detecting types: > 204038,2617097,20110803,0,0 > > But here is the wc output (via cygwin; newline, word (whitespace delim so > each word one 'line' here), byte)@ > $ wc spd_all_fixed.csv > 168997637 168997638 9078155125 spd_all_fixed.csv > > [So fread 9M, wc 168M rows]. > > Regards > Paul > > > On 30 April 2013 18:52, Matthew Dowle wrote: > >> ** >> >> >> >> Hi, >> >> Thanks for reporting this. Please set verbose=TRUE and let us know the >> output. >> >> Thanks, Matthew >> >> >> >> On 30.04.2013 18:01, Paul Harding wrote: >> >> Problem with fread on a large file >> The file is 8GB, just short of 200,000 lines, produced as SQLoutput and >> modified by cygwin/perl to remove the second line. >> Using data.table 1.8.8 on R3.0.0 I get an fread error >> fread("data/spd_all_fixed.csv",sep=",") >> Error in fread("data/spd_all_fixed.csv", sep = ",") : >> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: >> 204038,2617097,20110803,0,0 >> Looking for the offending line,with line numbers in output so I'm >> guessing this is line 6 of the mid-file chunk examined, >> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >> and comparing to surrounding lines and the first ten lines >> $ head spd_all_fixed.csv >> s_key,i_key,p_key,q,pq,d,l,epi,class >> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >> I can't see any difference. I wonder if this is a bug? I have no problems >> on a small test data set run through an identical process and using the >> same fread command. >> Regards >> Paul >> >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Ken.Williams at windlogics.com Wed May 1 22:59:08 2013 From: Ken.Williams at windlogics.com (Ken Williams) Date: Wed, 1 May 2013 20:59:08 +0000 Subject: [datatable-help] Import problem with data.table in packages Message-ID: Hi, I've got a small test package constructed like so: ------------ R/MyCode.R: --------------- ##' Example function. ##' ##' @export ##' @import data.table myfunc <- function() { dt1 <- data.table(time=1:5, key='time') dt2 <- data.table(time=3:8, key='time') dat <- merge(dt1, dt2, all=TRUE) } ------------ DESCRIPTION: --------------- Package: TestMod Type: Package Title: My test package Version: 1.0 Author: Ken Williams Maintainer: Ken Williams Description: A test package License: BSD Imports: data.table Collate: 'MyCode.R' ----------------------------------------------- I process the package with ROxygen in RStudio, which produces an empty `inst/` directory, some docs in `man/`, and a `NAMESPACE` file: ------------ DESCRIPTION: --------------- export(myfunc) import(data.table) ----------------------------------------------- Now, if I start a fresh R session and load this package, I get a namespace error: ------------ R 2.15.2 session: --------------- > library(TestMod) > myfunc function () { dt1 <- data.table(time = 1:5, key = "time") dt2 <- data.table(time = 3:8, key = "time") dat <- merge(dt1, dt2, all = TRUE) } > myfunc() Error in rbind(deparse.level, ...) : could not find function ".rbind.data.table" ----------------------------------------------- Sometimes, in other (more complicated) code, I instead get the error 'could not find function "data.table"'. To my eyes, the imports look correct, so I can't see what the problem is: ------------ R 2.15.2 session: --------------- > getNamespaceImports('TestMod')$data.table %between% %chin% %like% .__C__data.table "%between%" "%chin%" "%like%" ".__C__data.table" .__C__IDate .__C__ITime .__T__$:base .__T__$<-:base ".__C__IDate" ".__C__ITime" ".__T__$:base" ".__T__$<-:base" .__T__[:base .rbind.data.table := alloc.col ".__T__[:base" ".rbind.data.table" ":=" "alloc.col" as.chron.IDate as.chron.ITime as.data.table as.IDate "as.chron.IDate" "as.chron.ITime" "as.data.table" "as.IDate" as.ITime between chgroup chmatch "as.ITime" "between" "chgroup" "chmatch" chorder CJ copy data.table "chorder" "CJ" "copy" "data.table" fread haskey hour IDateTime "fread" "haskey" "hour" "IDateTime" is.data.table key key<- last "is.data.table" "key" "key<-" "last" like mday month quarter "like" "mday" "month" "quarter" rbindlist set setattr setcolorder "rbindlist" "set" "setattr" "setcolorder" setkey setkeyv setnames SJ "setkey" "setkeyv" "setnames" "SJ" tables test.data.table timetaken truelength "tables" "test.data.table" "timetaken" "truelength" wday week yday year "wday" "week" "yday" "year" ----------------------------------------------- Any suggestions? I think for now, I can work around the problem by doing 'Depends: data.table' in my `DESCRIPTION` file. I'd like to not do that though. -- Ken Williams, Senior Research Scientist WindLogics http://windlogics.com ________________________________ CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you. From mdowle at mdowle.plus.com Wed May 1 23:12:35 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 01 May 2013 22:12:35 +0100 Subject: [datatable-help] Import problem with data.table in packages In-Reply-To: References: Message-ID: <5d217b902b9d75ed8ac53bdd26d1f7a1@imap.plus.net> Hi, This rings a bell actually. data.table uses .onLoad currently but it should be using .onAttach, I seem to recall. http://r.789695.n4.nabble.com/Error-in-a-package-that-imports-data-table-tp4660173p4660637.html I had a hunt around but couldn't find if we decided data.table should move from .onLoad to .onAttach. Does anyone know/remember? Thanks, Matthew On 01.05.2013 21:59, Ken Williams wrote: > Hi, > > I've got a small test package constructed like so: > > ------------ R/MyCode.R: --------------- > ##' Example function. > ##' > ##' @export > ##' @import data.table > myfunc <- function() { > dt1 <- data.table(time=1:5, key='time') > dt2 <- data.table(time=3:8, key='time') > dat <- merge(dt1, dt2, all=TRUE) > } > > ------------ DESCRIPTION: --------------- > Package: TestMod > Type: Package > Title: My test package > Version: 1.0 > Author: Ken Williams > Maintainer: Ken Williams > Description: A test package > License: BSD > Imports: > data.table > Collate: > 'MyCode.R' > ----------------------------------------------- > > I process the package with ROxygen in RStudio, which produces an > empty `inst/` directory, some docs in `man/`, and a `NAMESPACE` file: > > ------------ DESCRIPTION: --------------- > export(myfunc) > import(data.table) > ----------------------------------------------- > > Now, if I start a fresh R session and load this package, I get a > namespace error: > > ------------ R 2.15.2 session: --------------- >> library(TestMod) >> myfunc > function () > { > dt1 <- data.table(time = 1:5, key = "time") > dt2 <- data.table(time = 3:8, key = "time") > dat <- merge(dt1, dt2, all = TRUE) > } > >> myfunc() > Error in rbind(deparse.level, ...) : > could not find function ".rbind.data.table" > ----------------------------------------------- > > Sometimes, in other (more complicated) code, I instead get the error > 'could not find function "data.table"'. > > To my eyes, the imports look correct, so I can't see what the problem > is: > > ------------ R 2.15.2 session: --------------- >> getNamespaceImports('TestMod')$data.table > %between% %chin% %like% > .__C__data.table > "%between%" "%chin%" "%like%" > ".__C__data.table" > .__C__IDate .__C__ITime .__T__$:base > .__T__$<-:base > ".__C__IDate" ".__C__ITime" ".__T__$:base" > ".__T__$<-:base" > .__T__[:base .rbind.data.table := > alloc.col > ".__T__[:base" ".rbind.data.table" ":=" > "alloc.col" > as.chron.IDate as.chron.ITime as.data.table > as.IDate > "as.chron.IDate" "as.chron.ITime" "as.data.table" > "as.IDate" > as.ITime between chgroup > chmatch > "as.ITime" "between" "chgroup" > "chmatch" > chorder CJ copy > data.table > "chorder" "CJ" "copy" > "data.table" > fread haskey hour > IDateTime > "fread" "haskey" "hour" > "IDateTime" > is.data.table key key<- > last > "is.data.table" "key" "key<-" > "last" > like mday month > quarter > "like" "mday" "month" > "quarter" > rbindlist set setattr > setcolorder > "rbindlist" "set" "setattr" > "setcolorder" > setkey setkeyv setnames > SJ > "setkey" "setkeyv" "setnames" > "SJ" > tables test.data.table timetaken > truelength > "tables" "test.data.table" "timetaken" > "truelength" > wday week yday > year > "wday" "week" "yday" > "year" > ----------------------------------------------- > > Any suggestions? > > I think for now, I can work around the problem by doing 'Depends: > data.table' in my `DESCRIPTION` file. I'd like to not do that > though. > > -- > Ken Williams, Senior Research Scientist > WindLogics > http://windlogics.com > > > ________________________________ > > CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of > the intended recipient(s) and may contain confidential and privileged > information. Any unauthorized review, use, disclosure or distribution > of any kind is strictly prohibited. If you are not the intended > recipient, please contact the sender via reply e-mail and destroy all > copies of the original message. Thank you. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From aragorn168b at gmail.com Thu May 2 00:16:15 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 2 May 2013 00:16:15 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com>

<-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> Message-ID: Eduard, Great. That explains me the difference between `drop` and `.join` here. Even though I don't *need* this feature (I can't recall the last time when I use a `data.table` for `i` and had to reduce the function, say, sum). But, I think it can only better the usage. However, there's one point *I think* would still disagree with @eddi here, not sure. DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) DT2 <- data.table(x=c(1,2,1)) setkey(DT1, "x") # proposed way and the result: DT1[DT2, sum(y), .join = FALSE] [1] 21 So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` *should* result in a `data.table` output as follows (it's even more clearer now that .join is set to TRUE, meaning it's a data.table join): x V1 1: 1 6 2: 2 9 3: 1 6 Basically, `.join = TRUE` is the current functionality unchanged and nice to be default (as Matthew hinted). Arun On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > Arun, > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does currently. > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is literally a 'by' by each of the rows of DT2 that are in the join (thus each.i! - the operation 'y' will be performed for each of the rows of 'i' and then combined and returned). There is no efficiency issue here that I can see, but Matthew can correct me on this. As far as I understand the efficiency comes into play when e.g. the rows of 'i' are unique, and after the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = key(DT1)] would be less efficient since the 'by' could've already been done while joining. > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future DT1[DT2] - in this expression there is no by-without-by happening in either case. > > The purpose of this is NOT for j just being a column or an expression that gets evaluated into a signal column. It applies to any j. The extra 'by-without-by' column is currently output independently of how many columns you output in your j-expression, the behavior is very similar as to when you specify a by=., except that the 'by' happens by a very special expression, that only exists when joining two data-tables and that generally doesn't exist before or after the join. > > Hope this answers your questions. > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan wrote: > > Eduard, thanks for your reply. But somethings are unclear to me still. I'll try to explain them below. > > > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general (that it is applicable to *every* i operation, which as of now seems untrue). .JOIN is specific to data.table type for `i`. > > > > From what I understand from your reply, if (.JOIN = FALSE), then, > > > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > > > Is this right? It's a bit confusing because I think you're okay with "by-without-by" and I got the impression from Sadao that he finds the syntax of "by-without-by" unaccessible/advanced for basic users. So, just to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the "by-without-by" and then result in a "vector", right? > > > > Matthew explains in the current documentation that DT1[DT2][, y] would "join" all columns of DT1 and DT2 and then subset. I assume the implementation underneath is *not* DT1[DT2][, y] rather the result is an efficient equivalence. Then, that of course seems alright to me. > > > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I can't think of any at the moment. > > > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results in getting evaluated as a scalar for every group in the current by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, list(x,y)]. > > > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's the purpose of `drop` then (and also how it *doesn't* suit here as compared to .JOIN). > > > > Arun > > > > > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > > > > Arun, > > > > > > If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. > > > > > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan wrote: > > > > > > > (The earlier message was too long and was rejected.) > > > > So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > setkey(DT1, "x") > > > > DT2 <- data.table(x=1) > > > > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] > > > > > > > > The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > setkey(DT1, "x") > > > > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > > > > # what's the output supposed to be for? > > > > DT1[DT2, y, .JOIN=FALSE] > > > > DT1[DT2, .JOIN = FALSE] > > > > > > > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? > > > > > > > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > > > > > > > > Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? > > > > > > > > I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. > > > > > > > > Best, > > > > Arun. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu May 2 00:20:32 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 2 May 2013 00:20:32 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com>

<-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> Message-ID: <48F69748BB834619B353A12C6D9962A7@gmail.com> Sorry the proposed result was a wrong paste in the last message: # proposed way and the result: DT1[DT2, sum(y), .join = FALSE] [1] 6 9 6 And the last part that it *should* be a data.table is quite obvious then. Arun On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: > Eduard, > > Great. That explains me the difference between `drop` and `.join` here. > Even though I don't *need* this feature (I can't recall the last time when I use a `data.table` for `i` and had to reduce the function, say, sum). But, I think it can only better the usage. > > However, there's one point *I think* would still disagree with @eddi here, not sure. > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) > DT2 <- data.table(x=c(1,2,1)) > setkey(DT1, "x") > > # proposed way and the result: > DT1[DT2, sum(y), .join = FALSE] > [1] 21 > > > So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` *should* result in a `data.table` output as follows (it's even more clearer now that .join is set to TRUE, meaning it's a data.table join): > > x V1 > 1: 1 6 > 2: 2 9 > 3: 1 6 > > > Basically, `.join = TRUE` is the current functionality unchanged and nice to be default (as Matthew hinted). > > Arun > > > On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > > > Arun, > > > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does currently. > > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is literally a 'by' by each of the rows of DT2 that are in the join (thus each.i! - the operation 'y' will be performed for each of the rows of 'i' and then combined and returned). There is no efficiency issue here that I can see, but Matthew can correct me on this. As far as I understand the efficiency comes into play when e.g. the rows of 'i' are unique, and after the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = key(DT1)] would be less efficient since the 'by' could've already been done while joining. > > > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future DT1[DT2] - in this expression there is no by-without-by happening in either case. > > > > The purpose of this is NOT for j just being a column or an expression that gets evaluated into a signal column. It applies to any j. The extra 'by-without-by' column is currently output independently of how many columns you output in your j-expression, the behavior is very similar as to when you specify a by=., except that the 'by' happens by a very special expression, that only exists when joining two data-tables and that generally doesn't exist before or after the join. > > > > Hope this answers your questions. > > > > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan wrote: > > > Eduard, thanks for your reply. But somethings are unclear to me still. I'll try to explain them below. > > > > > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general (that it is applicable to *every* i operation, which as of now seems untrue). .JOIN is specific to data.table type for `i`. > > > > > > From what I understand from your reply, if (.JOIN = FALSE), then, > > > > > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > > > > > Is this right? It's a bit confusing because I think you're okay with "by-without-by" and I got the impression from Sadao that he finds the syntax of "by-without-by" unaccessible/advanced for basic users. So, just to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the "by-without-by" and then result in a "vector", right? > > > > > > Matthew explains in the current documentation that DT1[DT2][, y] would "join" all columns of DT1 and DT2 and then subset. I assume the implementation underneath is *not* DT1[DT2][, y] rather the result is an efficient equivalence. Then, that of course seems alright to me. > > > > > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I can't think of any at the moment. > > > > > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results in getting evaluated as a scalar for every group in the current by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, list(x,y)]. > > > > > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's the purpose of `drop` then (and also how it *doesn't* suit here as compared to .JOIN). > > > > > > Arun > > > > > > > > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > > > > > > Arun, > > > > > > > > If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. > > > > > > > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan wrote: > > > > > > > > > (The earlier message was too long and was rejected.) > > > > > So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > setkey(DT1, "x") > > > > > DT2 <- data.table(x=1) > > > > > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] > > > > > > > > > > The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > setkey(DT1, "x") > > > > > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > > > > > # what's the output supposed to be for? > > > > > DT1[DT2, y, .JOIN=FALSE] > > > > > DT1[DT2, .JOIN = FALSE] > > > > > > > > > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? > > > > > > > > > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > > > > > > > > > > Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? > > > > > > > > > > I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. > > > > > > > > > > Best, > > > > > Arun. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu May 2 00:23:37 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 2 May 2013 00:23:37 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <48F69748BB834619B353A12C6D9962A7@gmail.com> References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com>

<-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <48F69748BB834619B353A12C6D9962A7@gmail.com> Message-ID: eddi, sorry again, I am confused a bit now. DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)) DT2 <- data.table(x=c(1,2,1)) setkey(DT1, "x") What's the intended result for `DT1[DT2, sum(y), allow.cartesian = TRUE, .join = FALSE]` ? c(6,9,6) or 21? Arun On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote: > Sorry the proposed result was a wrong paste in the last message: > > # proposed way and the result: > DT1[DT2, sum(y), .join = FALSE] > [1] 6 9 6 > > > And the last part that it *should* be a data.table is quite obvious then. > > Arun > > > On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: > > > Eduard, > > > > Great. That explains me the difference between `drop` and `.join` here. > > Even though I don't *need* this feature (I can't recall the last time when I use a `data.table` for `i` and had to reduce the function, say, sum). But, I think it can only better the usage. > > > > However, there's one point *I think* would still disagree with @eddi here, not sure. > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) > > DT2 <- data.table(x=c(1,2,1)) > > setkey(DT1, "x") > > > > # proposed way and the result: > > DT1[DT2, sum(y), .join = FALSE] > > [1] 21 > > > > > > So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` *should* result in a `data.table` output as follows (it's even more clearer now that .join is set to TRUE, meaning it's a data.table join): > > > > x V1 > > 1: 1 6 > > 2: 2 9 > > 3: 1 6 > > > > > > Basically, `.join = TRUE` is the current functionality unchanged and nice to be default (as Matthew hinted). > > > > Arun > > > > > > On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > > > > > Arun, > > > > > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does currently. > > > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is literally a 'by' by each of the rows of DT2 that are in the join (thus each.i! - the operation 'y' will be performed for each of the rows of 'i' and then combined and returned). There is no efficiency issue here that I can see, but Matthew can correct me on this. As far as I understand the efficiency comes into play when e.g. the rows of 'i' are unique, and after the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = key(DT1)] would be less efficient since the 'by' could've already been done while joining. > > > > > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future DT1[DT2] - in this expression there is no by-without-by happening in either case. > > > > > > The purpose of this is NOT for j just being a column or an expression that gets evaluated into a signal column. It applies to any j. The extra 'by-without-by' column is currently output independently of how many columns you output in your j-expression, the behavior is very similar as to when you specify a by=., except that the 'by' happens by a very special expression, that only exists when joining two data-tables and that generally doesn't exist before or after the join. > > > > > > Hope this answers your questions. > > > > > > > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan wrote: > > > > Eduard, thanks for your reply. But somethings are unclear to me still. I'll try to explain them below. > > > > > > > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general (that it is applicable to *every* i operation, which as of now seems untrue). .JOIN is specific to data.table type for `i`. > > > > > > > > From what I understand from your reply, if (.JOIN = FALSE), then, > > > > > > > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > > > > > > > Is this right? It's a bit confusing because I think you're okay with "by-without-by" and I got the impression from Sadao that he finds the syntax of "by-without-by" unaccessible/advanced for basic users. So, just to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the "by-without-by" and then result in a "vector", right? > > > > > > > > Matthew explains in the current documentation that DT1[DT2][, y] would "join" all columns of DT1 and DT2 and then subset. I assume the implementation underneath is *not* DT1[DT2][, y] rather the result is an efficient equivalence. Then, that of course seems alright to me. > > > > > > > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I can't think of any at the moment. > > > > > > > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results in getting evaluated as a scalar for every group in the current by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, list(x,y)]. > > > > > > > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's the purpose of `drop` then (and also how it *doesn't* suit here as compared to .JOIN). > > > > > > > > Arun > > > > > > > > > > > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > > > > > > > > Arun, > > > > > > > > > > If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. > > > > > > > > > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan wrote: > > > > > > > > > > > (The earlier message was too long and was rejected.) > > > > > > So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > setkey(DT1, "x") > > > > > > DT2 <- data.table(x=1) > > > > > > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] > > > > > > > > > > > > The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > setkey(DT1, "x") > > > > > > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > > > > > > # what's the output supposed to be for? > > > > > > DT1[DT2, y, .JOIN=FALSE] > > > > > > DT1[DT2, .JOIN = FALSE] > > > > > > > > > > > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? > > > > > > > > > > > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > > > > > > > > > > > > Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? > > > > > > > > > > > > I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. > > > > > > > > > > > > Best, > > > > > > Arun. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Thu May 2 00:28:46 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 1 May 2013 17:28:46 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com>

<-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <48F69748BB834619B353A12C6D9962A7@gmail.com> Message-ID: Arun, from my previous email: "Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b': dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a join in some cases but not others Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by (will do cross-apply only when 'i' is a join): dt[i, j, each.i = TRUE] <-> dt[i, j]" Together with the default being each.i=FALSE, you can see that the answer to your question will be: DT1[DT2, sum(y), each.i = FALSE, allow.cartesian = TRUE] <-> DT1[DT2, allow.cartesian=TRUE][, sum(y)], i.e. [1] 21 and DT1[DT2, sum(y), each.i = TRUE, allow.cartesian = TRUE] <-> DT1[DT2, sum(y), allow.cartesian=TRUE], i.e. x V1 1: 1 6 2: 2 9 3: 1 6 On Wed, May 1, 2013 at 5:23 PM, Arunkumar Srinivasan wrote: > eddi, > > sorry again, I am confused a bit now. > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)) > DT2 <- data.table(x=c(1,2,1)) > setkey(DT1, "x") > > What's the intended result for `DT1[DT2, sum(y), allow.cartesian = TRUE, > .join = FALSE]` ? c(6,9,6) or 21? > > > Arun > > On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote: > > Sorry the proposed result was a wrong paste in the last message: > > # proposed way and the result: > DT1[DT2, sum(y), .join = FALSE] > [1] 6 9 6 > > And the last part that it *should* be a data.table is quite obvious then. > > Arun > > On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: > > Eduard, > > Great. That explains me the difference between `drop` and `.join` here. > Even though I don't *need* this feature (I can't recall the last time when > I use a `data.table` for `i` and had to reduce the function, say, sum). > But, I think it can only better the usage. > > However, there's one point *I think* would still disagree with @eddi here, > not sure. > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) > DT2 <- data.table(x=c(1,2,1)) > setkey(DT1, "x") > > # proposed way and the result: > DT1[DT2, sum(y), .join = FALSE] > [1] 21 > > > So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` > *should* result in a `data.table` output as follows (it's even more clearer > now that .join is set to TRUE, meaning it's a data.table join): > > x V1 > 1: 1 6 > 2: 2 9 > 3: 1 6 > > Basically, `.join = TRUE` is the current functionality unchanged and nice > to be default (as Matthew hinted). > > Arun > > On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > > Arun, > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does > currently. > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is > literally a 'by' by each of the rows of DT2 that are in the join (thus > each.i! - the operation 'y' will be performed for each of the rows of 'i' > and then combined and returned). There is no efficiency issue here that I > can see, but Matthew can correct me on this. As far as I understand the > efficiency comes into play when e.g. the rows of 'i' are unique, and after > the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = > key(DT1)] would be less efficient since the 'by' could've already been done > while joining. > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future > DT1[DT2] - in this expression there is no by-without-by happening in either > case. > > The purpose of this is NOT for j just being a column or an expression that > gets evaluated into a signal column. It applies to any j. The extra > 'by-without-by' column is currently output independently of how many > columns you output in your j-expression, the behavior is very similar as to > when you specify a by=., except that the 'by' happens by a very special > expression, that only exists when joining two data-tables and that > generally doesn't exist before or after the join. > > Hope this answers your questions. > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > > Eduard, thanks for your reply. But somethings are unclear to me still. > I'll try to explain them below. > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general > (that it is applicable to *every* i operation, which as of now seems > untrue). .JOIN is specific to data.table type for `i`. > > From what I understand from your reply, if (.JOIN = FALSE), then, > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > Is this right? It's a bit confusing because I think you're okay with > "by-without-by" and I got the impression from Sadao that he finds the > syntax of "by-without-by" unaccessible/advanced for basic users. So, just > to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the > "by-without-by" and then result in a "vector", right? > > Matthew explains in the current documentation that DT1[DT2][, y] would > "join" all columns of DT1 and DT2 and then subset. I assume the > implementation underneath is *not* DT1[DT2][, y] rather the result is an > efficient equivalence. Then, that of course seems alright to me. > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` > doesn't make sense/has no purpose to me. At least I can't think of any at > the moment. > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as > DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results > in getting evaluated as a scalar for every group in the current > by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. > Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` > instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of > DT1[i, list(x,y)]. > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's > the purpose of `drop` then (and also how it *doesn't* suit here as compared > to .JOIN). > > Arun > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > Arun, > > If the new boolean is false, the result would be the same as without it > and would be equal to current behavior of d[i][, j]. If it's true, it will > only have an effect if i is a join (I think each.i= fits slightly better > for this description than .join=) - this will replicate current underlying > behavior. If you think the cross-apply is something that could work not > just for i being a data-table but other things as well, then it would make > perfect sense to implement that action too when the bool is true. > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan > wrote: > > (The earlier message was too long and was rejected.) > So, from the discussion so far, I see that Matthew is nice enough to > implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > setkey(DT1, "x") > DT2 <- data.table(x=1) > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I > expect here the same output as current DT1[DT2, y] > > The above syntax seems "okay". But my first question is what is > `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > setkey(DT1, "x") > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > # what's the output supposed to be for? > DT1[DT2, y, .JOIN=FALSE] > DT1[DT2, .JOIN = FALSE] > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how > does it work with `subset`? > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > Is this supposed to also do a "cross-apply" on the logical subset? I > guess not. So, .JOIN is an "extra" parameter that comes into play *only* > when `i` is a `data.table`? > > I'd love to have some replies to these questions for me to take a stance > on `.JOIN`. Thank you. > > Best, > Arun. > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu May 2 00:36:32 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 2 May 2013 00:36:32 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com>

<-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <48F69748BB834619B353A12C6D9962A7@gmail.com> Message-ID: <2BC388F2195044B09475B49AC4AB3809@gmail.com> In retrospect, `.join` is also confusing/untrue (as the data.table join is still being done). I find `cross.apply` clearer. Arun On Thursday, May 2, 2013 at 12:33 AM, Arunkumar Srinivasan wrote: > Eduard, > > Yes, that clears it up. If `.join` if FALSE, then there's no `by-without-by`, basically. `drop` really serves another purpose. > > Once again, I find `each.i = TRUE/FALSE` to be confusing (as it was one of the intended purposes of this post to begin with) to mean to apply to *any* `i` operation. Unless this is true, I'd like to stick to `.join` as it's what we are setting to FALSE/TRUE here. > > Thanks for the patient clarifications. > > Arun > > > On Thursday, May 2, 2013 at 12:28 AM, Eduard Antonyan wrote: > > > Arun, from my previous email: > > > > "Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b': > > dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a join in some cases but not others > > > > Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by (will do cross-apply only when 'i' is a join): > > dt[i, j, each.i = TRUE] <-> dt[i, j]" > > > > Together with the default being each.i=FALSE, you can see that the answer to your question will be: > > > > DT1[DT2, sum(y), each.i = FALSE, allow.cartesian = TRUE] <-> DT1[DT2, allow.cartesian=TRUE][, sum(y)], i.e. > > [1] 21 > > > > and > > DT1[DT2, sum(y), each.i = TRUE, allow.cartesian = TRUE] <-> DT1[DT2, sum(y), allow.cartesian=TRUE], i.e. > > x V1 > > 1: 1 6 > > 2: 2 9 > > 3: 1 6 > > > > > > > > On Wed, May 1, 2013 at 5:23 PM, Arunkumar Srinivasan wrote: > > > eddi, > > > > > > sorry again, I am confused a bit now. > > > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)) > > > DT2 <- data.table(x=c(1,2,1)) > > > setkey(DT1, "x") > > > > > > What's the intended result for `DT1[DT2, sum(y), allow.cartesian = TRUE, .join = FALSE]` ? c(6,9,6) or 21? > > > > > > > > > Arun > > > > > > > > > On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote: > > > > > > > Sorry the proposed result was a wrong paste in the last message: > > > > > > > > # proposed way and the result: > > > > DT1[DT2, sum(y), .join = FALSE] > > > > [1] 6 9 6 > > > > > > > > > > > > And the last part that it *should* be a data.table is quite obvious then. > > > > > > > > Arun > > > > > > > > > > > > On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: > > > > > > > > > Eduard, > > > > > > > > > > Great. That explains me the difference between `drop` and `.join` here. > > > > > Even though I don't *need* this feature (I can't recall the last time when I use a `data.table` for `i` and had to reduce the function, say, sum). But, I think it can only better the usage. > > > > > > > > > > However, there's one point *I think* would still disagree with @eddi here, not sure. > > > > > > > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) > > > > > DT2 <- data.table(x=c(1,2,1)) > > > > > setkey(DT1, "x") > > > > > > > > > > # proposed way and the result: > > > > > DT1[DT2, sum(y), .join = FALSE] > > > > > [1] 21 > > > > > > > > > > > > > > > So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` *should* result in a `data.table` output as follows (it's even more clearer now that .join is set to TRUE, meaning it's a data.table join): > > > > > > > > > > x V1 > > > > > 1: 1 6 > > > > > 2: 2 9 > > > > > 3: 1 6 > > > > > > > > > > > > > > > Basically, `.join = TRUE` is the current functionality unchanged and nice to be default (as Matthew hinted). > > > > > > > > > > Arun > > > > > > > > > > > > > > > On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > > > > > > > > > > > Arun, > > > > > > > > > > > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does currently. > > > > > > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is literally a 'by' by each of the rows of DT2 that are in the join (thus each.i! - the operation 'y' will be performed for each of the rows of 'i' and then combined and returned). There is no efficiency issue here that I can see, but Matthew can correct me on this. As far as I understand the efficiency comes into play when e.g. the rows of 'i' are unique, and after the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = key(DT1)] would be less efficient since the 'by' could've already been done while joining. > > > > > > > > > > > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future DT1[DT2] - in this expression there is no by-without-by happening in either case. > > > > > > > > > > > > The purpose of this is NOT for j just being a column or an expression that gets evaluated into a signal column. It applies to any j. The extra 'by-without-by' column is currently output independently of how many columns you output in your j-expression, the behavior is very similar as to when you specify a by=., except that the 'by' happens by a very special expression, that only exists when joining two data-tables and that generally doesn't exist before or after the join. > > > > > > > > > > > > Hope this answers your questions. > > > > > > > > > > > > > > > > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan wrote: > > > > > > > Eduard, thanks for your reply. But somethings are unclear to me still. I'll try to explain them below. > > > > > > > > > > > > > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general (that it is applicable to *every* i operation, which as of now seems untrue). .JOIN is specific to data.table type for `i`. > > > > > > > > > > > > > > From what I understand from your reply, if (.JOIN = FALSE), then, > > > > > > > > > > > > > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > > > > > > > > > > > > > Is this right? It's a bit confusing because I think you're okay with "by-without-by" and I got the impression from Sadao that he finds the syntax of "by-without-by" unaccessible/advanced for basic users. So, just to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the "by-without-by" and then result in a "vector", right? > > > > > > > > > > > > > > Matthew explains in the current documentation that DT1[DT2][, y] would "join" all columns of DT1 and DT2 and then subset. I assume the implementation underneath is *not* DT1[DT2][, y] rather the result is an efficient equivalence. Then, that of course seems alright to me. > > > > > > > > > > > > > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I can't think of any at the moment. > > > > > > > > > > > > > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results in getting evaluated as a scalar for every group in the current by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, list(x,y)]. > > > > > > > > > > > > > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's the purpose of `drop` then (and also how it *doesn't* suit here as compared to .JOIN). > > > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > > > > > > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > > > > > > > > > > > > > > Arun, > > > > > > > > > > > > > > > > If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. > > > > > > > > > > > > > > > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan wrote: > > > > > > > > > > > > > > > > > (The earlier message was too long and was rejected.) > > > > > > > > > So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > > > > > > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > > > > setkey(DT1, "x") > > > > > > > > > DT2 <- data.table(x=1) > > > > > > > > > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] > > > > > > > > > > > > > > > > > > The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > > > > > > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > > > > setkey(DT1, "x") > > > > > > > > > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > > > > > > > > > # what's the output supposed to be for? > > > > > > > > > DT1[DT2, y, .JOIN=FALSE] > > > > > > > > > DT1[DT2, .JOIN = FALSE] > > > > > > > > > > > > > > > > > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? > > > > > > > > > > > > > > > > > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > > > > > > > > > > > > > > > > > > Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? > > > > > > > > > > > > > > > > > > I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Arun. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu May 2 00:33:35 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 2 May 2013 00:33:35 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com>

<-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <48F69748BB834619B353A12C6D9962A7@gmail.com> Message-ID: Eduard, Yes, that clears it up. If `.join` if FALSE, then there's no `by-without-by`, basically. `drop` really serves another purpose. Once again, I find `each.i = TRUE/FALSE` to be confusing (as it was one of the intended purposes of this post to begin with) to mean to apply to *any* `i` operation. Unless this is true, I'd like to stick to `.join` as it's what we are setting to FALSE/TRUE here. Thanks for the patient clarifications. Arun On Thursday, May 2, 2013 at 12:28 AM, Eduard Antonyan wrote: > Arun, from my previous email: > > "Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b': > dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a join in some cases but not others > > Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by (will do cross-apply only when 'i' is a join): > dt[i, j, each.i = TRUE] <-> dt[i, j]" > > Together with the default being each.i=FALSE, you can see that the answer to your question will be: > > DT1[DT2, sum(y), each.i = FALSE, allow.cartesian = TRUE] <-> DT1[DT2, allow.cartesian=TRUE][, sum(y)], i.e. > [1] 21 > > and > DT1[DT2, sum(y), each.i = TRUE, allow.cartesian = TRUE] <-> DT1[DT2, sum(y), allow.cartesian=TRUE], i.e. > x V1 > 1: 1 6 > 2: 2 9 > 3: 1 6 > > > > On Wed, May 1, 2013 at 5:23 PM, Arunkumar Srinivasan wrote: > > eddi, > > > > sorry again, I am confused a bit now. > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)) > > DT2 <- data.table(x=c(1,2,1)) > > setkey(DT1, "x") > > > > What's the intended result for `DT1[DT2, sum(y), allow.cartesian = TRUE, .join = FALSE]` ? c(6,9,6) or 21? > > > > > > Arun > > > > > > On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote: > > > > > Sorry the proposed result was a wrong paste in the last message: > > > > > > # proposed way and the result: > > > DT1[DT2, sum(y), .join = FALSE] > > > [1] 6 9 6 > > > > > > > > > And the last part that it *should* be a data.table is quite obvious then. > > > > > > Arun > > > > > > > > > On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: > > > > > > > Eduard, > > > > > > > > Great. That explains me the difference between `drop` and `.join` here. > > > > Even though I don't *need* this feature (I can't recall the last time when I use a `data.table` for `i` and had to reduce the function, say, sum). But, I think it can only better the usage. > > > > > > > > However, there's one point *I think* would still disagree with @eddi here, not sure. > > > > > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) > > > > DT2 <- data.table(x=c(1,2,1)) > > > > setkey(DT1, "x") > > > > > > > > # proposed way and the result: > > > > DT1[DT2, sum(y), .join = FALSE] > > > > [1] 21 > > > > > > > > > > > > So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` *should* result in a `data.table` output as follows (it's even more clearer now that .join is set to TRUE, meaning it's a data.table join): > > > > > > > > x V1 > > > > 1: 1 6 > > > > 2: 2 9 > > > > 3: 1 6 > > > > > > > > > > > > Basically, `.join = TRUE` is the current functionality unchanged and nice to be default (as Matthew hinted). > > > > > > > > Arun > > > > > > > > > > > > On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > > > > > > > > > Arun, > > > > > > > > > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does currently. > > > > > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is literally a 'by' by each of the rows of DT2 that are in the join (thus each.i! - the operation 'y' will be performed for each of the rows of 'i' and then combined and returned). There is no efficiency issue here that I can see, but Matthew can correct me on this. As far as I understand the efficiency comes into play when e.g. the rows of 'i' are unique, and after the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = key(DT1)] would be less efficient since the 'by' could've already been done while joining. > > > > > > > > > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future DT1[DT2] - in this expression there is no by-without-by happening in either case. > > > > > > > > > > The purpose of this is NOT for j just being a column or an expression that gets evaluated into a signal column. It applies to any j. The extra 'by-without-by' column is currently output independently of how many columns you output in your j-expression, the behavior is very similar as to when you specify a by=., except that the 'by' happens by a very special expression, that only exists when joining two data-tables and that generally doesn't exist before or after the join. > > > > > > > > > > Hope this answers your questions. > > > > > > > > > > > > > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan wrote: > > > > > > Eduard, thanks for your reply. But somethings are unclear to me still. I'll try to explain them below. > > > > > > > > > > > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general (that it is applicable to *every* i operation, which as of now seems untrue). .JOIN is specific to data.table type for `i`. > > > > > > > > > > > > From what I understand from your reply, if (.JOIN = FALSE), then, > > > > > > > > > > > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > > > > > > > > > > > Is this right? It's a bit confusing because I think you're okay with "by-without-by" and I got the impression from Sadao that he finds the syntax of "by-without-by" unaccessible/advanced for basic users. So, just to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the "by-without-by" and then result in a "vector", right? > > > > > > > > > > > > Matthew explains in the current documentation that DT1[DT2][, y] would "join" all columns of DT1 and DT2 and then subset. I assume the implementation underneath is *not* DT1[DT2][, y] rather the result is an efficient equivalence. Then, that of course seems alright to me. > > > > > > > > > > > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I can't think of any at the moment. > > > > > > > > > > > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results in getting evaluated as a scalar for every group in the current by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, list(x,y)]. > > > > > > > > > > > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's the purpose of `drop` then (and also how it *doesn't* suit here as compared to .JOIN). > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > > > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > > > > > > > > > > > > Arun, > > > > > > > > > > > > > > If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. > > > > > > > > > > > > > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan wrote: > > > > > > > > > > > > > > > (The earlier message was too long and was rejected.) > > > > > > > > So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > > > > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > > > setkey(DT1, "x") > > > > > > > > DT2 <- data.table(x=1) > > > > > > > > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] > > > > > > > > > > > > > > > > The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > > > > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > > > setkey(DT1, "x") > > > > > > > > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > > > > > > > > # what's the output supposed to be for? > > > > > > > > DT1[DT2, y, .JOIN=FALSE] > > > > > > > > DT1[DT2, .JOIN = FALSE] > > > > > > > > > > > > > > > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? > > > > > > > > > > > > > > > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > > > > > > > > > > > > > > > > Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? > > > > > > > > > > > > > > > > I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. > > > > > > > > > > > > > > > > Best, > > > > > > > > Arun. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Thu May 2 00:47:39 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 1 May 2013 17:47:39 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <2BC388F2195044B09475B49AC4AB3809@gmail.com> References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com>

<-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <48F69748BB834619B353A12C6D9962A7@gmail.com> <2BC388F2195044B09475B49AC4AB3809@gmail.com> Message-ID: yeah, I think cross.apply is pretty clear as well, at least when an extra 'by' is not there, but I like each.i when there is a 'by'. Either way this is a pretty small consideration for me and I'd be perfectly happy with either. On Wed, May 1, 2013 at 5:36 PM, Arunkumar Srinivasan wrote: > In retrospect, `.join` is also confusing/untrue (as the data.table join > is still being done). I find `cross.apply` clearer. > > Arun > > On Thursday, May 2, 2013 at 12:33 AM, Arunkumar Srinivasan wrote: > > Eduard, > > Yes, that clears it up. If `.join` if FALSE, then there's no > `by-without-by`, basically. `drop` really serves another purpose. > > Once again, I find `each.i = TRUE/FALSE` to be confusing (as it was one of > the intended purposes of this post to begin with) to mean to apply to *any* > `i` operation. Unless this is true, I'd like to stick to `.join` as it's > what we are setting to FALSE/TRUE here. > > Thanks for the patient clarifications. > > Arun > > On Thursday, May 2, 2013 at 12:28 AM, Eduard Antonyan wrote: > > Arun, from my previous email: > > "Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b': > dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by > = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a > join in some cases but not others > > Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by > (will do cross-apply only when 'i' is a join): > dt[i, j, each.i = TRUE] <-> dt[i, j]" > > Together with the default being each.i=FALSE, you can see that the answer > to your question will be: > > DT1[DT2, sum(y), each.i = FALSE, allow.cartesian = TRUE] <-> DT1[DT2, > allow.cartesian=TRUE][, sum(y)], i.e. > [1] 21 > > and > DT1[DT2, sum(y), each.i = TRUE, allow.cartesian = TRUE] <-> DT1[DT2, > sum(y), allow.cartesian=TRUE], i.e. > x V1 > 1: 1 6 > 2: 2 9 > 3: 1 6 > > > > On Wed, May 1, 2013 at 5:23 PM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > > eddi, > > sorry again, I am confused a bit now. > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)) > DT2 <- data.table(x=c(1,2,1)) > setkey(DT1, "x") > > What's the intended result for `DT1[DT2, sum(y), allow.cartesian = TRUE, > .join = FALSE]` ? c(6,9,6) or 21? > > > Arun > > On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote: > > Sorry the proposed result was a wrong paste in the last message: > > # proposed way and the result: > DT1[DT2, sum(y), .join = FALSE] > [1] 6 9 6 > > And the last part that it *should* be a data.table is quite obvious then. > > Arun > > On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: > > Eduard, > > Great. That explains me the difference between `drop` and `.join` here. > Even though I don't *need* this feature (I can't recall the last time when > I use a `data.table` for `i` and had to reduce the function, say, sum). > But, I think it can only better the usage. > > However, there's one point *I think* would still disagree with @eddi here, > not sure. > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) > DT2 <- data.table(x=c(1,2,1)) > setkey(DT1, "x") > > # proposed way and the result: > DT1[DT2, sum(y), .join = FALSE] > [1] 21 > > > So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` > *should* result in a `data.table` output as follows (it's even more clearer > now that .join is set to TRUE, meaning it's a data.table join): > > x V1 > 1: 1 6 > 2: 2 9 > 3: 1 6 > > Basically, `.join = TRUE` is the current functionality unchanged and nice > to be default (as Matthew hinted). > > Arun > > On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > > Arun, > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does > currently. > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is > literally a 'by' by each of the rows of DT2 that are in the join (thus > each.i! - the operation 'y' will be performed for each of the rows of 'i' > and then combined and returned). There is no efficiency issue here that I > can see, but Matthew can correct me on this. As far as I understand the > efficiency comes into play when e.g. the rows of 'i' are unique, and after > the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = > key(DT1)] would be less efficient since the 'by' could've already been done > while joining. > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future > DT1[DT2] - in this expression there is no by-without-by happening in either > case. > > The purpose of this is NOT for j just being a column or an expression that > gets evaluated into a signal column. It applies to any j. The extra > 'by-without-by' column is currently output independently of how many > columns you output in your j-expression, the behavior is very similar as to > when you specify a by=., except that the 'by' happens by a very special > expression, that only exists when joining two data-tables and that > generally doesn't exist before or after the join. > > Hope this answers your questions. > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > > Eduard, thanks for your reply. But somethings are unclear to me still. > I'll try to explain them below. > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general > (that it is applicable to *every* i operation, which as of now seems > untrue). .JOIN is specific to data.table type for `i`. > > From what I understand from your reply, if (.JOIN = FALSE), then, > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > Is this right? It's a bit confusing because I think you're okay with > "by-without-by" and I got the impression from Sadao that he finds the > syntax of "by-without-by" unaccessible/advanced for basic users. So, just > to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the > "by-without-by" and then result in a "vector", right? > > Matthew explains in the current documentation that DT1[DT2][, y] would > "join" all columns of DT1 and DT2 and then subset. I assume the > implementation underneath is *not* DT1[DT2][, y] rather the result is an > efficient equivalence. Then, that of course seems alright to me. > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` > doesn't make sense/has no purpose to me. At least I can't think of any at > the moment. > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as > DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results > in getting evaluated as a scalar for every group in the current > by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. > Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` > instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of > DT1[i, list(x,y)]. > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's > the purpose of `drop` then (and also how it *doesn't* suit here as compared > to .JOIN). > > Arun > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > Arun, > > If the new boolean is false, the result would be the same as without it > and would be equal to current behavior of d[i][, j]. If it's true, it will > only have an effect if i is a join (I think each.i= fits slightly better > for this description than .join=) - this will replicate current underlying > behavior. If you think the cross-apply is something that could work not > just for i being a data-table but other things as well, then it would make > perfect sense to implement that action too when the bool is true. > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan > wrote: > > (The earlier message was too long and was rejected.) > So, from the discussion so far, I see that Matthew is nice enough to > implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > setkey(DT1, "x") > DT2 <- data.table(x=1) > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I > expect here the same output as current DT1[DT2, y] > > The above syntax seems "okay". But my first question is what is > `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > setkey(DT1, "x") > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > # what's the output supposed to be for? > DT1[DT2, y, .JOIN=FALSE] > DT1[DT2, .JOIN = FALSE] > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how > does it work with `subset`? > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > Is this supposed to also do a "cross-apply" on the logical subset? I > guess not. So, .JOIN is an "extra" parameter that comes into play *only* > when `i` is a `data.table`? > > I'd love to have some replies to these questions for me to take a stance > on `.JOIN`. Thank you. > > Best, > Arun. > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu May 2 00:47:37 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 2 May 2013 00:47:37 +0200 Subject: [datatable-help] sorting on floating point column In-Reply-To: References: <17cf7210eff5da9dadf94185f67df182@imap.plus.net> <8DC39800AD714C4AA03FDB84ED57BADD@gmail.com> <2cd9f53b01f908fe6478e3974ddf18e3@imap.plus.net> Message-ID: <3C10DA3381F64716A180976F14686FF5@gmail.com> Matthew, So what's the resolution here? Is it okay to sort in the "proper" order on the key column but use *machine tolerance* for subset on key column? Arun On Tuesday, April 30, 2013 at 4:26 PM, Arunkumar Srinivasan wrote: > Matthew, > > Precisely. That's what I was thinking as well. But was hesitant to tell as I dint know how complex it would be to implement / change it. Since the join requires tolerance, sorting could be still done in the "right" order (by disregarding tolerance during sort). > > Arun > > > On Tuesday, April 30, 2013 at 4:22 PM, Matthew Dowle wrote: > > > > > Maybe it doesn't actually need to sort within machine tolerance. If it was precise, the sort would be faster, that's for sure. But at the time, I remember thinking that it should preserve the order of rows within a group of values within machine tolerance (e.g. 3.99999999, 4.00000001, 3.99999999 should be consider 4.0 and order of those 3 rows maintained). But maybe sorting them to 3.99999999, 3.99999999, 4.00000001 is ok as it's just the join that should be within machine tolerance? > > Interested in how fast order(y) is, though. Compared to data.table sorting of doubles. > > Matthew > > > > On 30.04.2013 15:16, Arunkumar Srinivasan wrote: > > > Matthew, > > > I see. I din't think about tolerance. Although > > > dt[with(dt, order(y)), ] > > > seems to do the task right (similar to data.frame). I'm glad that I don't have to convert to data.frame to perform the order. I am not keying by this column. Unless one needs this column for keying, I don't think a tolerance option is essential. Although, having it definitely would be only nicer. > > > Arun > > > > > > > > > On Tuesday, April 30, 2013 at 4:09 PM, Matthew Dowle wrote: > > > > > > > > > > > Hi, > > > > data.table sorts double within machine tolerance : > > > > > sqrt(.Machine$double.eps) > > > > [1] 1.490116e-08 > > > > > > > > > > > > > i.e. numbers closer than this are considered equal. > > > > > > > > Otherwise we wouldn't be able to do things like DT[.(3.14)]. > > > > > > > > I had a quick look, see arguments of data.table:::ordernumtol which takes "tol" but there is no option provided (yet) to change this. Do we need one? > > > > > > > > In the examples section of one of the help pages it has an example which generates a series of numers very close together using pi. Note that your numbers are both close together, and, very close to 0. > > > > > > > > Matthew > > > > > > > > On 30.04.2013 14:52, Arunkumar Srinivasan wrote: > > > > > Hi there, > > > > > I just saw something strange when I was sorting a column of p-values. I checked the data.table bug tracker for words "sort" and "floating point" and there were no hits for this case. There's a bug for "integer 64" sort on a column though. > > > > > So, here's a reproducible example. I'd be glad to file a bug, if it is and be corrected if it's something I am doing wrong. > > > > > set.seed(45) > > > > > dt <- data.table(x=sample(50), y= sample(c(seq(0, 1, length.out=1000), 7000000:7000100), 50)/1e7) > > > > > head(dt) > > > > > x y > > > > > 1: 32 5.395395e-08 > > > > > 2: 16 6.956957e-08 > > > > > 3: 12 2.142142e-08 > > > > > 4: 18 5.855856e-08 > > > > > 5: 17 6.216216e-08 > > > > > 6: 14 5.025025e-08 > > > > > setkey(dt, "y") # sort by column y > > > > > head(dt, 10) > > > > > x y > > > > > 1: 47 1.401401e-09 > > > > > 2: 12 2.142142e-08 > > > > > 3: 24 1.391391e-08 > > > > > 4: 43 9.809810e-09 <~~~ obviously false > > > > > 5: 1 2.932933e-08 > > > > > 6: 48 2.562563e-08 > > > > > 7: 49 1.891892e-08 > > > > > 8: 40 2.182182e-08 > > > > > 9: 9 7.307307e-09 <~~~ obviously false > > > > > 10: 45 2.482482e-08 > > > > > > > > > > Best, > > > > > Arun > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu May 2 01:18:44 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 2 May 2013 01:18:44 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com>

<-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <48F69748BB834619B353A12C6D9962A7@gmail.com> <2BC388F2195044B09475B49AC4AB3809@gmail.com> Message-ID: Eduard, What do you mean here: `at least when by is not there`. The "cross.apply" or ".join" or "each.i" was supposedly an option when "i" argument is a `data.table`, right? I can find a reason why there would be a `by` there? (I mean an explicit by). Do you mean the implicit by when it's true? if not, could you elaborate (maybe with an example)? Arun On Thursday, May 2, 2013 at 12:47 AM, Eduard Antonyan wrote: > yeah, I think cross.apply is pretty clear as well, at least when an extra 'by' is not there, but I like each.i when there is a 'by'. Either way this is a pretty small consideration for me and I'd be perfectly happy with either. > > > On Wed, May 1, 2013 at 5:36 PM, Arunkumar Srinivasan wrote: > > In retrospect, `.join` is also confusing/untrue (as the data.table join is still being done). I find `cross.apply` clearer. > > > > Arun > > > > > > On Thursday, May 2, 2013 at 12:33 AM, Arunkumar Srinivasan wrote: > > > > > > > Eduard, > > > > > > Yes, that clears it up. If `.join` if FALSE, then there's no `by-without-by`, basically. `drop` really serves another purpose. > > > > > > Once again, I find `each.i = TRUE/FALSE` to be confusing (as it was one of the intended purposes of this post to begin with) to mean to apply to *any* `i` operation. Unless this is true, I'd like to stick to `.join` as it's what we are setting to FALSE/TRUE here. > > > > > > Thanks for the patient clarifications. > > > > > > Arun > > > > > > > > > On Thursday, May 2, 2013 at 12:28 AM, Eduard Antonyan wrote: > > > > > > > Arun, from my previous email: > > > > > > > > "Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b': > > > > dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a join in some cases but not others > > > > > > > > Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by (will do cross-apply only when 'i' is a join): > > > > dt[i, j, each.i = TRUE] <-> dt[i, j]" > > > > > > > > Together with the default being each.i=FALSE, you can see that the answer to your question will be: > > > > > > > > DT1[DT2, sum(y), each.i = FALSE, allow.cartesian = TRUE] <-> DT1[DT2, allow.cartesian=TRUE][, sum(y)], i.e. > > > > [1] 21 > > > > > > > > and > > > > DT1[DT2, sum(y), each.i = TRUE, allow.cartesian = TRUE] <-> DT1[DT2, sum(y), allow.cartesian=TRUE], i.e. > > > > x V1 > > > > 1: 1 6 > > > > 2: 2 9 > > > > 3: 1 6 > > > > > > > > > > > > > > > > On Wed, May 1, 2013 at 5:23 PM, Arunkumar Srinivasan wrote: > > > > > eddi, > > > > > > > > > > sorry again, I am confused a bit now. > > > > > > > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)) > > > > > DT2 <- data.table(x=c(1,2,1)) > > > > > setkey(DT1, "x") > > > > > > > > > > What's the intended result for `DT1[DT2, sum(y), allow.cartesian = TRUE, .join = FALSE]` ? c(6,9,6) or 21? > > > > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote: > > > > > > > > > > > Sorry the proposed result was a wrong paste in the last message: > > > > > > > > > > > > # proposed way and the result: > > > > > > DT1[DT2, sum(y), .join = FALSE] > > > > > > [1] 6 9 6 > > > > > > > > > > > > > > > > > > And the last part that it *should* be a data.table is quite obvious then. > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > > > > On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: > > > > > > > > > > > > > Eduard, > > > > > > > > > > > > > > Great. That explains me the difference between `drop` and `.join` here. > > > > > > > Even though I don't *need* this feature (I can't recall the last time when I use a `data.table` for `i` and had to reduce the function, say, sum). But, I think it can only better the usage. > > > > > > > > > > > > > > However, there's one point *I think* would still disagree with @eddi here, not sure. > > > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) > > > > > > > DT2 <- data.table(x=c(1,2,1)) > > > > > > > setkey(DT1, "x") > > > > > > > > > > > > > > # proposed way and the result: > > > > > > > DT1[DT2, sum(y), .join = FALSE] > > > > > > > [1] 21 > > > > > > > > > > > > > > > > > > > > > So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` *should* result in a `data.table` output as follows (it's even more clearer now that .join is set to TRUE, meaning it's a data.table join): > > > > > > > > > > > > > > x V1 > > > > > > > 1: 1 6 > > > > > > > 2: 2 9 > > > > > > > 3: 1 6 > > > > > > > > > > > > > > > > > > > > > Basically, `.join = TRUE` is the current functionality unchanged and nice to be default (as Matthew hinted). > > > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > > > > > > > On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > > > > > > > > > > > > > > > Arun, > > > > > > > > > > > > > > > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does currently. > > > > > > > > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is literally a 'by' by each of the rows of DT2 that are in the join (thus each.i! - the operation 'y' will be performed for each of the rows of 'i' and then combined and returned). There is no efficiency issue here that I can see, but Matthew can correct me on this. As far as I understand the efficiency comes into play when e.g. the rows of 'i' are unique, and after the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = key(DT1)] would be less efficient since the 'by' could've already been done while joining. > > > > > > > > > > > > > > > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future DT1[DT2] - in this expression there is no by-without-by happening in either case. > > > > > > > > > > > > > > > > The purpose of this is NOT for j just being a column or an expression that gets evaluated into a signal column. It applies to any j. The extra 'by-without-by' column is currently output independently of how many columns you output in your j-expression, the behavior is very similar as to when you specify a by=., except that the 'by' happens by a very special expression, that only exists when joining two data-tables and that generally doesn't exist before or after the join. > > > > > > > > > > > > > > > > Hope this answers your questions. > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan wrote: > > > > > > > > > Eduard, thanks for your reply. But somethings are unclear to me still. I'll try to explain them below. > > > > > > > > > > > > > > > > > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general (that it is applicable to *every* i operation, which as of now seems untrue). .JOIN is specific to data.table type for `i`. > > > > > > > > > > > > > > > > > > From what I understand from your reply, if (.JOIN = FALSE), then, > > > > > > > > > > > > > > > > > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > > > > > > > > > > > > > > > > > Is this right? It's a bit confusing because I think you're okay with "by-without-by" and I got the impression from Sadao that he finds the syntax of "by-without-by" unaccessible/advanced for basic users. So, just to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the "by-without-by" and then result in a "vector", right? > > > > > > > > > > > > > > > > > > Matthew explains in the current documentation that DT1[DT2][, y] would "join" all columns of DT1 and DT2 and then subset. I assume the implementation underneath is *not* DT1[DT2][, y] rather the result is an efficient equivalence. Then, that of course seems alright to me. > > > > > > > > > > > > > > > > > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I can't think of any at the moment. > > > > > > > > > > > > > > > > > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results in getting evaluated as a scalar for every group in the current by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, list(x,y)]. > > > > > > > > > > > > > > > > > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's the purpose of `drop` then (and also how it *doesn't* suit here as compared to .JOIN). > > > > > > > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > > > > > > > > > > > > > > > > > > Arun, > > > > > > > > > > > > > > > > > > > > If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. > > > > > > > > > > > > > > > > > > > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan wrote: > > > > > > > > > > > > > > > > > > > > > (The earlier message was too long and was rejected.) > > > > > > > > > > > So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > > > > > > > > > > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > > > > > > setkey(DT1, "x") > > > > > > > > > > > DT2 <- data.table(x=1) > > > > > > > > > > > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] > > > > > > > > > > > > > > > > > > > > > > The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > > > > > > > > > > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > > > > > > setkey(DT1, "x") > > > > > > > > > > > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > > > > > > > > > > > # what's the output supposed to be for? > > > > > > > > > > > DT1[DT2, y, .JOIN=FALSE] > > > > > > > > > > > DT1[DT2, .JOIN = FALSE] > > > > > > > > > > > > > > > > > > > > > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? > > > > > > > > > > > > > > > > > > > > > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > > > > > > > > > > > > > > > > > > > > > > Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? > > > > > > > > > > > > > > > > > > > > > > I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > Arun. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Thu May 2 01:27:58 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 1 May 2013 18:27:58 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com>

<-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <48F69748BB834619B353A12C6D9962A7@gmail.com> <2BC388F2195044B09475B49AC4AB3809@gmail.com> Message-ID: <733367056511996651@unknownmsgid> I mean I find it a little easier to read when joining with each.i=TRUE *and* there is by=b - this is an extra operation that I don't believe has an analog in current syntax (but I haven't thought about this too much). On May 1, 2013, at 6:18 PM, Arunkumar Srinivasan wrote: Eduard, What do you mean here: `at least when by is not there`. The "cross.apply" or ".join" or "each.i" was supposedly an option when "i" argument is a `data.table`, right? I can find a reason why there would be a `by` there? (I mean an explicit by). Do you mean the implicit by when it's true? if not, could you elaborate (maybe with an example)? Arun On Thursday, May 2, 2013 at 12:47 AM, Eduard Antonyan wrote: yeah, I think cross.apply is pretty clear as well, at least when an extra 'by' is not there, but I like each.i when there is a 'by'. Either way this is a pretty small consideration for me and I'd be perfectly happy with either. On Wed, May 1, 2013 at 5:36 PM, Arunkumar Srinivasan wrote: In retrospect, `.join` is also confusing/untrue (as the data.table join is still being done). I find `cross.apply` clearer. Arun On Thursday, May 2, 2013 at 12:33 AM, Arunkumar Srinivasan wrote: Eduard, Yes, that clears it up. If `.join` if FALSE, then there's no `by-without-by`, basically. `drop` really serves another purpose. Once again, I find `each.i = TRUE/FALSE` to be confusing (as it was one of the intended purposes of this post to begin with) to mean to apply to *any* `i` operation. Unless this is true, I'd like to stick to `.join` as it's what we are setting to FALSE/TRUE here. Thanks for the patient clarifications. Arun On Thursday, May 2, 2013 at 12:28 AM, Eduard Antonyan wrote: Arun, from my previous email: "Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b': dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a join in some cases but not others Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by (will do cross-apply only when 'i' is a join): dt[i, j, each.i = TRUE] <-> dt[i, j]" Together with the default being each.i=FALSE, you can see that the answer to your question will be: DT1[DT2, sum(y), each.i = FALSE, allow.cartesian = TRUE] <-> DT1[DT2, allow.cartesian=TRUE][, sum(y)], i.e. [1] 21 and DT1[DT2, sum(y), each.i = TRUE, allow.cartesian = TRUE] <-> DT1[DT2, sum(y), allow.cartesian=TRUE], i.e. x V1 1: 1 6 2: 2 9 3: 1 6 On Wed, May 1, 2013 at 5:23 PM, Arunkumar Srinivasan wrote: eddi, sorry again, I am confused a bit now. DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)) DT2 <- data.table(x=c(1,2,1)) setkey(DT1, "x") What's the intended result for `DT1[DT2, sum(y), allow.cartesian = TRUE, .join = FALSE]` ? c(6,9,6) or 21? Arun On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote: Sorry the proposed result was a wrong paste in the last message: # proposed way and the result: DT1[DT2, sum(y), .join = FALSE] [1] 6 9 6 And the last part that it *should* be a data.table is quite obvious then. Arun On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: Eduard, Great. That explains me the difference between `drop` and `.join` here. Even though I don't *need* this feature (I can't recall the last time when I use a `data.table` for `i` and had to reduce the function, say, sum). But, I think it can only better the usage. However, there's one point *I think* would still disagree with @eddi here, not sure. DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) DT2 <- data.table(x=c(1,2,1)) setkey(DT1, "x") # proposed way and the result: DT1[DT2, sum(y), .join = FALSE] [1] 21 So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` *should* result in a `data.table` output as follows (it's even more clearer now that .join is set to TRUE, meaning it's a data.table join): x V1 1: 1 6 2: 2 9 3: 1 6 Basically, `.join = TRUE` is the current functionality unchanged and nice to be default (as Matthew hinted). Arun On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: Arun, Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does currently. No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is literally a 'by' by each of the rows of DT2 that are in the join (thus each.i! - the operation 'y' will be performed for each of the rows of 'i' and then combined and returned). There is no efficiency issue here that I can see, but Matthew can correct me on this. As far as I understand the efficiency comes into play when e.g. the rows of 'i' are unique, and after the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = key(DT1)] would be less efficient since the 'by' could've already been done while joining. DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future DT1[DT2] - in this expression there is no by-without-by happening in either case. The purpose of this is NOT for j just being a column or an expression that gets evaluated into a signal column. It applies to any j. The extra 'by-without-by' column is currently output independently of how many columns you output in your j-expression, the behavior is very similar as to when you specify a by=., except that the 'by' happens by a very special expression, that only exists when joining two data-tables and that generally doesn't exist before or after the join. Hope this answers your questions. On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan wrote: Eduard, thanks for your reply. But somethings are unclear to me still. I'll try to explain them below. First I prefer .JOIN (or cross.apply) just because `each.i` seems general (that it is applicable to *every* i operation, which as of now seems untrue). .JOIN is specific to data.table type for `i`. >From what I understand from your reply, if (.JOIN = FALSE), then, DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] Is this right? It's a bit confusing because I think you're okay with "by-without-by" and I got the impression from Sadao that he finds the syntax of "by-without-by" unaccessible/advanced for basic users. So, just to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the "by-without-by" and then result in a "vector", right? Matthew explains in the current documentation that DT1[DT2][, y] would "join" all columns of DT1 and DT2 and then subset. I assume the implementation underneath is *not* DT1[DT2][, y] rather the result is an efficient equivalence. Then, that of course seems alright to me. If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I can't think of any at the moment. To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results in getting evaluated as a scalar for every group in the current by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, list(x,y)]. If you/anyone believes it's wrong, I'd be all ears to clarify as to what's the purpose of `drop` then (and also how it *doesn't* suit here as compared to .JOIN). Arun On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: Arun, If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan wrote: (The earlier message was too long and was rejected.) So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) setkey(DT1, "x") DT2 <- data.table(x=1) DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) setkey(DT1, "x") DT2 <- data.table(x=c(1,2,1), w=c(11:13)) # what's the output supposed to be for? DT1[DT2, y, .JOIN=FALSE] DT1[DT2, .JOIN = FALSE] Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. Best, Arun. -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.harding at paniscus.com Thu May 2 11:19:22 2013 From: p.harding at paniscus.com (Paul Harding) Date: Thu, 2 May 2013 10:19:22 +0100 Subject: [datatable-help] Fwd: fread on very large file In-Reply-To: References: <6215268129090c5164b66264010bea9b@imap.plus.net>

Message-ID: Some supplementary information, here is the portion of the file (with row numbers, +1 for header) around where fread thinks the file ends. $ nl spd_all_fixed.csv | head -n 9186300 |tail 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 9186293 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 9186294 (row 9186293 excl header) is where fread thinks the file ends, mid-line by the look of it! I've experimented by truncating the file. The error varies, either it reads too few records or gives the error I reported, presumably determined by whether the last perceived line is entire. The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv > dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) Detected eol as \r\n (CRLF) in that order, the Windows standard. Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found Found 9 columns First row with 9 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 80300000 Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 data rows Type codes: 000002000 (first 5 rows) Type codes: 000002000 (+middle 5 rows) Type codes: 000002000 (+last 5 rows) 0%Bumping column 7 from INT to INT64 on data row 9, field contains '0.42634430000000001' Bumping column 7 from INT64 to REAL on data row 9, field contains '0.42634430000000001' 0.000s ( 0%) Memory map (rerun may be quicker) 0.000s ( 0%) Sep and header detection 0.000s ( 0%) Count rows (wc -l) 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM 171.188s ( 65%) Reading data 1365231.809s (518439%) Allocation for type bumps (if any), including gc time if triggered -1365231.809s (-518439%) Coercing data already read in type bumps (if any) 0.000s ( 0%) Changing na.strings to NA 0.000s Total > dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) Detected eol as \r\n (CRLF) in that order, the Windows standard. Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found Found 9 columns First row with 9 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 18913 Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data rows Type codes: 000002000 (first 5 rows) Type codes: 000002000 (+middle 5 rows) Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : Expected sep (',') but ',' ends field 2 on line 6 when detecting types: 204650,724540, Regards, Paul On 1 May 2013 10:28, Paul Harding wrote: > Here is the verbose output: > > > dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) > Detected eol as \r\n (CRLF) in that order, the Windows standard. > Looking for supplied sep ',' on line 30 (the last non blank line in the > first 30) ... found > Found 9 columns > First row with 9 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column > names. > Count of eol after first data row: 9186293 > Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 > data rows > Type codes: 000002000 (first 5 rows) > Type codes: 000002200 (+middle 5 rows) > Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : > Expected sep (',') but '0' ends field 5 on line 6 when detecting types: > 204038,2617097,20110803,0,0 > > But here is the wc output (via cygwin; newline, word (whitespace delim so > each word one 'line' here), byte)@ > $ wc spd_all_fixed.csv > 168997637 168997638 9078155125 spd_all_fixed.csv > > [So fread 9M, wc 168M rows]. > > Regards > Paul > > > On 30 April 2013 18:52, Matthew Dowle wrote: > >> ** >> >> >> >> Hi, >> >> Thanks for reporting this. Please set verbose=TRUE and let us know the >> output. >> >> Thanks, Matthew >> >> >> >> On 30.04.2013 18:01, Paul Harding wrote: >> >> Problem with fread on a large file >> The file is 8GB, just short of 200,000 lines, produced as SQLoutput and >> modified by cygwin/perl to remove the second line. >> Using data.table 1.8.8 on R3.0.0 I get an fread error >> fread("data/spd_all_fixed.csv",sep=",") >> Error in fread("data/spd_all_fixed.csv", sep = ",") : >> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: >> 204038,2617097,20110803,0,0 >> Looking for the offending line,with line numbers in output so I'm >> guessing this is line 6 of the mid-file chunk examined, >> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >> and comparing to surrounding lines and the first ten lines >> $ head spd_all_fixed.csv >> s_key,i_key,p_key,q,pq,d,l,epi,class >> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >> I can't see any difference. I wonder if this is a bug? I have no problems >> on a small test data set run through an identical process and using the >> same fread command. >> Regards >> Paul >> >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri May 3 11:51:47 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 03 May 2013 10:51:47 +0100 Subject: [datatable-help] Fwd: fread on very large file In-Reply-To: References: <6215268129090c5164b66264010bea9b@imap.plus.net>

Message-ID: Hi Paul, Thanks for all this! > The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: Ahah. Are you using a 32bit or 64bit Windows machine? Thanks, Matthew On 02.05.2013 10:19, Paul Harding wrote: > Some supplementary information, here is the portion of the file (with row numbers, +1 for header) around where fread thinks the file ends. > > $ nl spd_all_fixed.csv | head -n 9186300 |tail > 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 > 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 > 9186293 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 > 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 > 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 > 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 > 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 > 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 > 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 > 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 > 9186294 (row 9186293 excl header) is where fread thinks the file ends, mid-line by the look of it! > I've experimented by truncating the file. The error varies, either it reads too few records or gives the error I reported, presumably determined by whether the last perceived line is entire. > The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: > > -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv > -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv > >> dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) > > Detected eol as rn (CRLF) in that order, the Windows standard. > Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found > Found 9 columns > First row with 9 fields occurs on line 1 (either column names or first row of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 80300000 > Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 data rows > > Type codes: 000002000 (first 5 rows) > Type codes: 000002000 (+middle 5 rows) > Type codes: 000002000 (+last 5 rows) > 0%Bumping column 7 from INT to INT64 on data row 9, field contains '0.42634430000000001' > Bumping column 7 from INT64 to REAL on data row 9, field contains '0.42634430000000001' > 0.000s ( 0%) Memory map (rerun may be quicker) > 0.000s ( 0%) Sep and header detection > 0.000s ( 0%) Count rows (wc -l) > 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) > 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM > 171.188s ( 65%) Reading data > 1365231.809s (518439%) Allocation for type bumps (if any), including gc time if triggered > -1365231.809s (-518439%) Coercing data already read in type bumps (if any) > 0.000s ( 0%) Changing na.strings to NA > 0.000s Total >> dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) > > Detected eol as rn (CRLF) in that order, the Windows standard. > Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found > Found 9 columns > First row with 9 fields occurs on line 1 (either column names or first row of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 18913 > Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data rows > > Type codes: 000002000 (first 5 rows) > Type codes: 000002000 (+middle 5 rows) > Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : > Expected sep (',') but ',' ends field 2 on line 6 when detecting types: 204650,724540, > Regards, > Paul > > On 1 May 2013 10:28, Paul Harding wrote: > >> Here is the verbose output: >> >>> dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) >> Detected eol as rn (CRLF) in that order, the Windows standard. >> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >> Found 9 columns >> First row with 9 fields occurs on line 1 (either column names or first row of data) >> All the fields on line 1 are character fields. Treating as the column names. >> Count of eol after first data row: 9186293 >> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 data rows >> Type codes: 000002000 (first 5 rows) >> Type codes: 000002200 (+middle 5 rows) >> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : >> >> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >> But here is the wc output (via cygwin; newline, word (whitespace delim so each word one 'line' here), byte)@ >> >> $ wc spd_all_fixed.csv >> 168997637 168997638 9078155125 spd_all_fixed.csv >> [So fread 9M, wc 168M rows]. >> Regards >> Paul >> >> On 30 April 2013 18:52, Matthew Dowle wrote: >> >>> Hi, >>> >>> Thanks for reporting this. Please set verbose=TRUE and let us know the output. >>> >>> Thanks, Matthew >>> >>> On 30.04.2013 18:01, Paul Harding wrote: >>> >>>> Problem with fread on a large file The file is 8GB, just short of 200,000 lines, produced as SQLoutput and modified by cygwin/perl to remove the second line. >>>> >>>> Using data.table 1.8.8 on R3.0.0 I get an fread error >>>> >>>> fread("data/spd_all_fixed.csv",sep=",") >>>> Error in fread("data/spd_all_fixed.csv", sep = ",") : >>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>> Looking for the offending line,with line numbers in output so I'm guessing this is line 6 of the mid-file chunk examined, >>>> >>>> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >>>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >>>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >>>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >>>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >>>> and comparing to surrounding lines and the first ten lines >>>> >>>> $ head spd_all_fixed.csv >>>> s_key,i_key,p_key,q,pq,d,l,epi,class >>>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >>>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >>>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >>>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >>>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >>>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >>>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >>>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >>>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >>>> I can't see any difference. I wonder if this is a bug? I have no problems on a small test data set run through an identical process and using the same fread command. >>>> Regards >>>> Paul Links: ------ [1] mailto:mdowle at mdowle.plus.com [2] mailto:p.harding at paniscus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Fri May 3 13:18:59 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 07:18:59 -0400 Subject: [datatable-help] merge/join/match Message-ID: I am moving this discussion which started with mdowle to the list. Consider this example slightly modified from the data.table FAQ: > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > out <- X[Y]; out x foo bar 1: b 3 4 2: b 4 4 3: b 5 4 4: c 6 2 5: c 7 2 6: d NA 3 Note that the first column of the output is labelled x even though the data to produce it comes from y, e.g. "d" in out$x is not in X$x but does appear in Y$y so clearly the data is coming from y as opposed to x . In terms of SQL the above would be written: select Y.y as x, ... and the need to renamne the first column of out suggests that there may be a deeper problem here. Here are some ideas to address this (they would require changes to data.table): - the default of X[Y,, match=NA] would be changed to a default of X[Y,,match=0] so that it corresponds to the defaults in R's merge and in SQL joins. - the column name of the first column in the example above would be changed to y if match=0 but be left at x if match=NA. In the case that match=0 (the proposed new default) x and y are equal so the first column can be validly labelled as x but in the case that match=NA they are not so y would be used as the column name. - the name match= does seem a bit misleading since R's match only matches one item in the target whereas in data.table match matches many if mult="all" and that is the default. Perhaps some thought should be given to a name change here? The above would seem to correspond more closely to R's merge and SQL join defaults. Any use cases or other comments? -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From p.harding at paniscus.com Fri May 3 15:32:16 2013 From: p.harding at paniscus.com (Paul Harding) Date: Fri, 3 May 2013 14:32:16 +0100 Subject: [datatable-help] Fwd: fread on very large file In-Reply-To: References: <6215268129090c5164b66264010bea9b@imap.plus.net>

Message-ID: Definitely a 64-bit machine. Here are the details: Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) Installed memory (RAM): 128GB System type: 64-bit Operating System Windows edition: Server 2008 R2 Enterprise SP1 Regards, Paul On 3 May 2013 10:51, Matthew Dowle wrote: > ** > > > > Hi Paul, > > Thanks for all this! > > > The problem arises when the file reaches 4GB, in this case between > 8,030,000 and 8,040,000 rows: > > Ahah. Are you using a 32bit or 64bit Windows machine? > > Thanks, Matthew > > > > On 02.05.2013 10:19, Paul Harding wrote: > > Some supplementary information, here is the portion of the file (with row > numbers, +1 for header) around where fread thinks the file ends. > $ nl spd_all_fixed.csv | head -n 9186300 |tail > 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 > 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 > 9186293 > 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 > 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 > 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 > 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 > 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 > 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 > 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 > 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 > 9186294 (row 9186293 excl header) is where fread thinks the file ends, > mid-line by the look of it! > I've experimented by truncating the file. The error varies, either it > reads too few records or gives the error I reported, presumably determined > by whether the last perceived line is entire. > The problem arises when the file reaches 4GB, in this case between > 8,030,000 and 8,040,000 rows: > -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 > spd_all_trunc_8030k.csv > -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 > spd_all_trunc_8040k.csv > > dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) > Detected eol as \r\n (CRLF) in that order, the Windows standard. > Looking for supplied sep ',' on line 30 (the last non blank line in the > first 30) ... found > Found 9 columns > First row with 9 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column > names. > Count of eol after first data row: 80300000 > Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 > data rows > Type codes: 000002000 (first 5 rows) > Type codes: 000002000 (+middle 5 rows) > Type codes: 000002000 (+last 5 rows) > 0%Bumping column 7 from INT to INT64 on data row 9, field contains > '0.42634430000000001' > Bumping column 7 from INT64 to REAL on data row 9, field contains > '0.42634430000000001' > 0.000s ( 0%) Memory map (rerun may be quicker) > 0.000s ( 0%) Sep and header detection > 0.000s ( 0%) Count rows (wc -l) > 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) > 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM > 171.188s ( 65%) Reading data > 1365231.809s (518439%) Allocation for type bumps (if any), including gc > time if triggered > -1365231.809s (-518439%) Coercing data already read in type bumps (if any) > 0.000s ( 0%) Changing na.strings to NA > 0.000s Total > > dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) > Detected eol as \r\n (CRLF) in that order, the Windows standard. > Looking for supplied sep ',' on line 30 (the last non blank line in the > first 30) ... found > Found 9 columns > First row with 9 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column > names. > Count of eol after first data row: 18913 > Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data > rows > Type codes: 000002000 (first 5 rows) > Type codes: 000002000 (+middle 5 rows) > Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : > Expected sep (',') but ',' ends field 2 on line 6 when detecting types: > 204650,724540, > Regards, > Paul > > > On 1 May 2013 10:28, Paul Harding wrote: > >> Here is the verbose output: >> > dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) >> Detected eol as \r\n (CRLF) in that order, the Windows standard. >> Looking for supplied sep ',' on line 30 (the last non blank line in the >> first 30) ... found >> Found 9 columns >> First row with 9 fields occurs on line 1 (either column names or first >> row of data) >> All the fields on line 1 are character fields. Treating as the column >> names. >> Count of eol after first data row: 9186293 >> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 >> data rows >> Type codes: 000002000 (first 5 rows) >> Type codes: 000002200 (+middle 5 rows) >> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : >> Expected sep (',') but '0' ends field 5 on line 6 when detecting >> types: 204038,2617097,20110803,0,0 >> But here is the wc output (via cygwin; newline, word (whitespace delim >> so each word one 'line' here), byte)@ >> $ wc spd_all_fixed.csv >> 168997637 168997638 9078155125 spd_all_fixed.csv >> [So fread 9M, wc 168M rows]. >> Regards >> Paul >> >> >> On 30 April 2013 18:52, Matthew Dowle wrote: >> >>> >>> >>> Hi, >>> >>> Thanks for reporting this. Please set verbose=TRUE and let us know the >>> output. >>> >>> Thanks, Matthew >>> >>> >>> >>> On 30.04.2013 18:01, Paul Harding wrote: >>> >>> Problem with fread on a large file >>> The file is 8GB, just short of 200,000 lines, produced as SQLoutput and >>> modified by cygwin/perl to remove the second line. >>> Using data.table 1.8.8 on R3.0.0 I get an fread error >>> fread("data/spd_all_fixed.csv",sep=",") >>> Error in fread("data/spd_all_fixed.csv", sep = ",") : >>> Expected sep (',') but '0' ends field 5 on line 6 when detecting >>> types: 204038,2617097,20110803,0,0 >>> Looking for the offending line,with line numbers in output so I'm >>> guessing this is line 6 of the mid-file chunk examined, >>> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >>> and comparing to surrounding lines and the first ten lines >>> $ head spd_all_fixed.csv >>> s_key,i_key,p_key,q,pq,d,l,epi,class >>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >>> I can't see any difference. I wonder if this is a bug? I have no >>> problems on a small test data set run through an identical process and >>> using the same fread command. >>> Regards >>> Paul >>> >>> >>> >>> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri May 3 15:59:16 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 03 May 2013 14:59:16 +0100 Subject: [datatable-help] Fwd: fread on very large file In-Reply-To: References: <6215268129090c5164b66264010bea9b@imap.plus.net>

Message-ID: <806651da84c7d49b3a9aa134e4951274@imap.plus.net> Oh. Then it's likely a bug with fread on Windows for files > 4GB. Think GetFileSize() should be GetFileSizeEx(), iirc. Please could you file it as a bug on the tracker. Thanks. Matthew On 03.05.2013 14:32, Paul Harding wrote: > Definitely a 64-bit machine. Here are the details: > > Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) > Installed memory (RAM): 128GB > System type: 64-bit Operating System > Windows edition: Server 2008 R2 Enterprise SP1 > Regards, > Paul > > On 3 May 2013 10:51, Matthew Dowle wrote: > >> Hi Paul, >> >> Thanks for all this! >> >>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >> >> Ahah. Are you using a 32bit or 64bit Windows machine? >> >> Thanks, Matthew >> >> On 02.05.2013 10:19, Paul Harding wrote: >> >>> Some supplementary information, here is the portion of the file (with row numbers, +1 for header) around where fread thinks the file ends. >>> >>> $ nl spd_all_fixed.csv | head -n 9186300 |tail >>> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 >>> 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 >>> 9186293 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 >>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 >>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 >>> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 >>> 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 >>> 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 >>> 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 >>> 9186294 (row 9186293 excl header) is where fread thinks the file ends, mid-line by the look of it! >>> I've experimented by truncating the file. The error varies, either it reads too few records or gives the error I reported, presumably determined by whether the last perceived line is entire. >>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >>> >>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv >>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv >>> >>>> dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) >>> >>> Detected eol as rn (CRLF) in that order, the Windows standard. >>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>> Found 9 columns >>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>> All the fields on line 1 are character fields. Treating as the column names. >>> Count of eol after first data row: 80300000 >>> Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 data rows >>> >>> Type codes: 000002000 (first 5 rows) >>> Type codes: 000002000 (+middle 5 rows) >>> Type codes: 000002000 (+last 5 rows) >>> 0%Bumping column 7 from INT to INT64 on data row 9, field contains '0.42634430000000001' >>> Bumping column 7 from INT64 to REAL on data row 9, field contains '0.42634430000000001' >>> 0.000s ( 0%) Memory map (rerun may be quicker) >>> 0.000s ( 0%) Sep and header detection >>> 0.000s ( 0%) Count rows (wc -l) >>> 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) >>> 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM >>> 171.188s ( 65%) Reading data >>> 1365231.809s (518439%) Allocation for type bumps (if any), including gc time if triggered >>> -1365231.809s (-518439%) Coercing data already read in type bumps (if any) >>> 0.000s ( 0%) Changing na.strings to NA >>> 0.000s Total >>>> dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) >>> >>> Detected eol as rn (CRLF) in that order, the Windows standard. >>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>> Found 9 columns >>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>> All the fields on line 1 are character fields. Treating as the column names. >>> Count of eol after first data row: 18913 >>> Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data rows >>> >>> Type codes: 000002000 (first 5 rows) >>> Type codes: 000002000 (+middle 5 rows) >>> Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : >>> Expected sep (',') but ',' ends field 2 on line 6 when detecting types: 204650,724540, >>> Regards, >>> Paul >>> >>> On 1 May 2013 10:28, Paul Harding wrote: >>> >>>> Here is the verbose output: >>>> >>>>> dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) >>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>> Found 9 columns >>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>> All the fields on line 1 are character fields. Treating as the column names. >>>> Count of eol after first data row: 9186293 >>>> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 data rows >>>> Type codes: 000002000 (first 5 rows) >>>> Type codes: 000002200 (+middle 5 rows) >>>> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : >>>> >>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>> But here is the wc output (via cygwin; newline, word (whitespace delim so each word one 'line' here), byte)@ >>>> >>>> $ wc spd_all_fixed.csv >>>> 168997637 168997638 9078155125 spd_all_fixed.csv >>>> [So fread 9M, wc 168M rows]. >>>> Regards >>>> Paul >>>> >>>> On 30 April 2013 18:52, Matthew Dowle wrote: >>>> >>>>> Hi, >>>>> >>>>> Thanks for reporting this. Please set verbose=TRUE and let us know the output. >>>>> >>>>> Thanks, Matthew >>>>> >>>>> On 30.04.2013 18:01, Paul Harding wrote: >>>>> >>>>>> Problem with fread on a large file The file is 8GB, just short of 200,000 lines, produced as SQLoutput and modified by cygwin/perl to remove the second line. >>>>>> >>>>>> Using data.table 1.8.8 on R3.0.0 I get an fread error >>>>>> >>>>>> fread("data/spd_all_fixed.csv",sep=",") >>>>>> Error in fread("data/spd_all_fixed.csv", sep = ",") : >>>>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>>>> Looking for the offending line,with line numbers in output so I'm guessing this is line 6 of the mid-file chunk examined, >>>>>> >>>>>> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >>>>>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >>>>>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >>>>>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >>>>>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >>>>>> and comparing to surrounding lines and the first ten lines >>>>>> >>>>>> $ head spd_all_fixed.csv >>>>>> s_key,i_key,p_key,q,pq,d,l,epi,class >>>>>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >>>>>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >>>>>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >>>>>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >>>>>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >>>>>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >>>>>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >>>>>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >>>>>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >>>>>> I can't see any difference. I wonder if this is a bug? I have no problems on a small test data set run through an identical process and using the same fread command. >>>>>> Regards >>>>>> Paul Links: ------ [1] mailto:mdowle at mdowle.plus.com [2] mailto:p.harding at paniscus.com [3] mailto:mdowle at mdowle.plus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 3 16:57:19 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 09:57:19 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: A correction - the param is called "nomatch", not "match". This use case seems like smth a user shouldn't really do - in an ideal world you should have them both keyed by the same-name column. As is, my view on it is that data.table is correcting the user mistake of naming the column in Y - y, instead of x, and so the output makes sense and I don't see the need of complicating the behavior by adding more cases one has to go through to figure out what the output columns would be. Similar to asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous column there, would you? On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck wrote: > I am moving this discussion which started with mdowle to the list. > > Consider this example slightly modified from the data.table FAQ: > > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > > out <- X[Y]; out > x foo bar > 1: b 3 4 > 2: b 4 4 > 3: b 5 4 > 4: c 6 2 > 5: c 7 2 > 6: d NA 3 > > Note that the first column of the output is labelled x even though the > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > does appear in Y$y so clearly the data is coming from y as opposed to > x . In terms of SQL the above would be written: > > select Y.y as x, ... > > and the need to renamne the first column of out suggests that there > may be a deeper problem here. > > Here are some ideas to address this (they would require changes to > data.table): > > - the default of X[Y,, match=NA] would be changed to a default of > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > in SQL joins. > > - the column name of the first column in the example above would be > changed to y if match=0 but be left at x if match=NA. In the case > that match=0 (the proposed new default) x and y are equal so the first > column can be validly labelled as x but in the case that match=NA they > are not so y would be used as the column name. > > - the name match= does seem a bit misleading since R's match only > matches one item in the target whereas in data.table match matches > many if mult="all" and that is the default. Perhaps some thought > should be given to a name change here? > > The above would seem to correspond more closely to R's merge and SQL > join defaults. Any use cases or other comments? > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 3 16:59:05 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 09:59:05 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: I would prefer nomatch=0 as a default though, simply because that's what I do most of the time :) On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan wrote: > A correction - the param is called "nomatch", not "match". > > This use case seems like smth a user shouldn't really do - in an ideal > world you should have them both keyed by the same-name column. > > As is, my view on it is that data.table is correcting the user mistake of > naming the column in Y - y, instead of x, and so the output makes sense and > I don't see the need of complicating the behavior by adding more cases one > has to go through to figure out what the output columns would be. Similar > to asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > column there, would you? > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck < > ggrothendieck at gmail.com> wrote: > >> I am moving this discussion which started with mdowle to the list. >> >> Consider this example slightly modified from the data.table FAQ: >> >> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") >> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) >> > out <- X[Y]; out >> x foo bar >> 1: b 3 4 >> 2: b 4 4 >> 3: b 5 4 >> 4: c 6 2 >> 5: c 7 2 >> 6: d NA 3 >> >> Note that the first column of the output is labelled x even though the >> data to produce it comes from y, e.g. "d" in out$x is not in X$x but >> does appear in Y$y so clearly the data is coming from y as opposed to >> x . In terms of SQL the above would be written: >> >> select Y.y as x, ... >> >> and the need to renamne the first column of out suggests that there >> may be a deeper problem here. >> >> Here are some ideas to address this (they would require changes to >> data.table): >> >> - the default of X[Y,, match=NA] would be changed to a default of >> X[Y,,match=0] so that it corresponds to the defaults in R's merge and >> in SQL joins. >> >> - the column name of the first column in the example above would be >> changed to y if match=0 but be left at x if match=NA. In the case >> that match=0 (the proposed new default) x and y are equal so the first >> column can be validly labelled as x but in the case that match=NA they >> are not so y would be used as the column name. >> >> - the name match= does seem a bit misleading since R's match only >> matches one item in the target whereas in data.table match matches >> many if mult="all" and that is the default. Perhaps some thought >> should be given to a name change here? >> >> The above would seem to correspond more closely to R's merge and SQL >> join defaults. Any use cases or other comments? >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Fri May 3 17:09:19 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 11:09:19 -0400 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: Yes, sorry. Its nomatch= which presumably derives from the parameter of the same name in the match() function. If the idea of the nomatch= name was to leverage off existing argument names in R then I would prefer all.y= to be consistent with merge() in place of nomatch= since we are really merging/joining rather than just matching. That would also allow extension to all types of join by adding all.an x= argument too. On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan wrote: > I would prefer nomatch=0 as a default though, simply because that's what I > do most of the time :) > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > wrote: >> >> A correction - the param is called "nomatch", not "match". >> >> This use case seems like smth a user shouldn't really do - in an ideal >> world you should have them both keyed by the same-name column. >> >> As is, my view on it is that data.table is correcting the user mistake of >> naming the column in Y - y, instead of x, and so the output makes sense and >> I don't see the need of complicating the behavior by adding more cases one >> has to go through to figure out what the output columns would be. Similar to >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous column >> there, would you? >> >> >> >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck >> wrote: >>> >>> I am moving this discussion which started with mdowle to the list. >>> >>> Consider this example slightly modified from the data.table FAQ: >>> >>> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") >>> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) >>> > out <- X[Y]; out >>> x foo bar >>> 1: b 3 4 >>> 2: b 4 4 >>> 3: b 5 4 >>> 4: c 6 2 >>> 5: c 7 2 >>> 6: d NA 3 >>> >>> Note that the first column of the output is labelled x even though the >>> data to produce it comes from y, e.g. "d" in out$x is not in X$x but >>> does appear in Y$y so clearly the data is coming from y as opposed to >>> x . In terms of SQL the above would be written: >>> >>> select Y.y as x, ... >>> >>> and the need to renamne the first column of out suggests that there >>> may be a deeper problem here. >>> >>> Here are some ideas to address this (they would require changes to >>> data.table): >>> >>> - the default of X[Y,, match=NA] would be changed to a default of >>> X[Y,,match=0] so that it corresponds to the defaults in R's merge and >>> in SQL joins. >>> >>> - the column name of the first column in the example above would be >>> changed to y if match=0 but be left at x if match=NA. In the case >>> that match=0 (the proposed new default) x and y are equal so the first >>> column can be validly labelled as x but in the case that match=NA they >>> are not so y would be used as the column name. >>> >>> - the name match= does seem a bit misleading since R's match only >>> matches one item in the target whereas in data.table match matches >>> many if mult="all" and that is the default. Perhaps some thought >>> should be given to a name change here? >>> >>> The above would seem to correspond more closely to R's merge and SQL >>> join defaults. Any use cases or other comments? >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From eduard.antonyan at gmail.com Fri May 3 17:23:02 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 10:23:02 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: References:

Message-ID: To clarify - that behavior is already implemented in merge (more specifically merge.data.table). I don't really have a view on having it in X[Y] as well - I don't like all.x and all.y as the names, since there are no params named 'x' and 'y' in [.data.table (as opposed to merge), but some param that would do a full outer join could certainly be added. On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck wrote: > Yes, sorry. Its nomatch= which presumably derives from the parameter > of the same name in the match() function. If the idea of the nomatch= > name was to leverage off existing argument names in R then I would > prefer all.y= to be consistent with merge() in place of nomatch= since > we are really merging/joining rather than just matching. That would > also allow extension to all types of join by adding all.an x= argument > too. > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > wrote: > > I would prefer nomatch=0 as a default though, simply because that's what > I > > do most of the time :) > > > > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan < > eduard.antonyan at gmail.com> > > wrote: > >> > >> A correction - the param is called "nomatch", not "match". > >> > >> This use case seems like smth a user shouldn't really do - in an ideal > >> world you should have them both keyed by the same-name column. > >> > >> As is, my view on it is that data.table is correcting the user mistake > of > >> naming the column in Y - y, instead of x, and so the output makes sense > and > >> I don't see the need of complicating the behavior by adding more cases > one > >> has to go through to figure out what the output columns would be. > Similar to > >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > column > >> there, would you? > >> > >> > >> > >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > >> wrote: > >>> > >>> I am moving this discussion which started with mdowle to the list. > >>> > >>> Consider this example slightly modified from the data.table FAQ: > >>> > >>> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > >>> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > >>> > out <- X[Y]; out > >>> x foo bar > >>> 1: b 3 4 > >>> 2: b 4 4 > >>> 3: b 5 4 > >>> 4: c 6 2 > >>> 5: c 7 2 > >>> 6: d NA 3 > >>> > >>> Note that the first column of the output is labelled x even though the > >>> data to produce it comes from y, e.g. "d" in out$x is not in X$x but > >>> does appear in Y$y so clearly the data is coming from y as opposed to > >>> x . In terms of SQL the above would be written: > >>> > >>> select Y.y as x, ... > >>> > >>> and the need to renamne the first column of out suggests that there > >>> may be a deeper problem here. > >>> > >>> Here are some ideas to address this (they would require changes to > >>> data.table): > >>> > >>> - the default of X[Y,, match=NA] would be changed to a default of > >>> X[Y,,match=0] so that it corresponds to the defaults in R's merge and > >>> in SQL joins. > >>> > >>> - the column name of the first column in the example above would be > >>> changed to y if match=0 but be left at x if match=NA. In the case > >>> that match=0 (the proposed new default) x and y are equal so the first > >>> column can be validly labelled as x but in the case that match=NA they > >>> are not so y would be used as the column name. > >>> > >>> - the name match= does seem a bit misleading since R's match only > >>> matches one item in the target whereas in data.table match matches > >>> many if mult="all" and that is the default. Perhaps some thought > >>> should be given to a name change here? > >>> > >>> The above would seem to correspond more closely to R's merge and SQL > >>> join defaults. Any use cases or other comments? > >>> > >>> -- > >>> Statistics & Software Consulting > >>> GKX Group, GKX Associates Inc. > >>> tel: 1-877-GKX-GROUP > >>> email: ggrothendieck at gmail.com > >>> _______________________________________________ > >>> datatable-help mailing list > >>> datatable-help at lists.r-forge.r-project.org > >>> > >>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >> > >> > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 3 17:27:44 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 10:27:44 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: References:

Message-ID: Btw the way I think about the "nomatch" name is as follows - normally X[Y] tries to match rows of Y with rows of X, and then "nomatch" tells it what to do when there is *no match*. On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan wrote: > To clarify - that behavior is already implemented in merge (more > specifically merge.data.table). I don't really have a view on having it in > X[Y] as well - I don't like all.x and all.y as the names, since there are > no params named 'x' and 'y' in [.data.table (as opposed to merge), but some > param that would do a full outer join could certainly be added. > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck < > ggrothendieck at gmail.com> wrote: > >> Yes, sorry. Its nomatch= which presumably derives from the parameter >> of the same name in the match() function. If the idea of the nomatch= >> name was to leverage off existing argument names in R then I would >> prefer all.y= to be consistent with merge() in place of nomatch= since >> we are really merging/joining rather than just matching. That would >> also allow extension to all types of join by adding all.an x= argument >> too. >> >> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan >> wrote: >> > I would prefer nomatch=0 as a default though, simply because that's >> what I >> > do most of the time :) >> > >> > >> > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan < >> eduard.antonyan at gmail.com> >> > wrote: >> >> >> >> A correction - the param is called "nomatch", not "match". >> >> >> >> This use case seems like smth a user shouldn't really do - in an ideal >> >> world you should have them both keyed by the same-name column. >> >> >> >> As is, my view on it is that data.table is correcting the user mistake >> of >> >> naming the column in Y - y, instead of x, and so the output makes >> sense and >> >> I don't see the need of complicating the behavior by adding more cases >> one >> >> has to go through to figure out what the output columns would be. >> Similar to >> >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous >> column >> >> there, would you? >> >> >> >> >> >> >> >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck >> >> wrote: >> >>> >> >>> I am moving this discussion which started with mdowle to the list. >> >>> >> >>> Consider this example slightly modified from the data.table FAQ: >> >>> >> >>> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") >> >>> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) >> >>> > out <- X[Y]; out >> >>> x foo bar >> >>> 1: b 3 4 >> >>> 2: b 4 4 >> >>> 3: b 5 4 >> >>> 4: c 6 2 >> >>> 5: c 7 2 >> >>> 6: d NA 3 >> >>> >> >>> Note that the first column of the output is labelled x even though the >> >>> data to produce it comes from y, e.g. "d" in out$x is not in X$x but >> >>> does appear in Y$y so clearly the data is coming from y as opposed to >> >>> x . In terms of SQL the above would be written: >> >>> >> >>> select Y.y as x, ... >> >>> >> >>> and the need to renamne the first column of out suggests that there >> >>> may be a deeper problem here. >> >>> >> >>> Here are some ideas to address this (they would require changes to >> >>> data.table): >> >>> >> >>> - the default of X[Y,, match=NA] would be changed to a default of >> >>> X[Y,,match=0] so that it corresponds to the defaults in R's merge and >> >>> in SQL joins. >> >>> >> >>> - the column name of the first column in the example above would be >> >>> changed to y if match=0 but be left at x if match=NA. In the case >> >>> that match=0 (the proposed new default) x and y are equal so the first >> >>> column can be validly labelled as x but in the case that match=NA they >> >>> are not so y would be used as the column name. >> >>> >> >>> - the name match= does seem a bit misleading since R's match only >> >>> matches one item in the target whereas in data.table match matches >> >>> many if mult="all" and that is the default. Perhaps some thought >> >>> should be given to a name change here? >> >>> >> >>> The above would seem to correspond more closely to R's merge and SQL >> >>> join defaults. Any use cases or other comments? >> >>> >> >>> -- >> >>> Statistics & Software Consulting >> >>> GKX Group, GKX Associates Inc. >> >>> tel: 1-877-GKX-GROUP >> >>> email: ggrothendieck at gmail.com >> >>> _______________________________________________ >> >>> datatable-help mailing list >> >>> datatable-help at lists.r-forge.r-project.org >> >>> >> >>> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> >> > >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Fri May 3 17:36:28 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 11:36:28 -0400 Subject: [datatable-help] merge/join/match In-Reply-To: References:

Message-ID: Yes, except that is not really what happens since match() only matches one row whereas with mult="all", the default, all rows are matched which is not really matching in the sense of match(). The current naming confuses matching with joining and its really the latter that is being done. Regarding the existence of merge the advantage of [ is that it will automatically only take the columns needed so merge is not really equivalent to [ in all respects. Furthermore having to use different constructs for different types of merge seems awkward. On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan wrote: > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > tries to match rows of Y with rows of X, and then "nomatch" tells it what to > do when there is *no match*. > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan > wrote: >> >> To clarify - that behavior is already implemented in merge (more >> specifically merge.data.table). I don't really have a view on having it in >> X[Y] as well - I don't like all.x and all.y as the names, since there are no >> params named 'x' and 'y' in [.data.table (as opposed to merge), but some >> param that would do a full outer join could certainly be added. >> >> >> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck >> wrote: >>> >>> Yes, sorry. Its nomatch= which presumably derives from the parameter >>> of the same name in the match() function. If the idea of the nomatch= >>> name was to leverage off existing argument names in R then I would >>> prefer all.y= to be consistent with merge() in place of nomatch= since >>> we are really merging/joining rather than just matching. That would >>> also allow extension to all types of join by adding all.an x= argument >>> too. >>> >>> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan >>> wrote: >>> > I would prefer nomatch=0 as a default though, simply because that's >>> > what I >>> > do most of the time :) >>> > >>> > >>> > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan >>> > >>> > wrote: >>> >> >>> >> A correction - the param is called "nomatch", not "match". >>> >> >>> >> This use case seems like smth a user shouldn't really do - in an ideal >>> >> world you should have them both keyed by the same-name column. >>> >> >>> >> As is, my view on it is that data.table is correcting the user mistake >>> >> of >>> >> naming the column in Y - y, instead of x, and so the output makes >>> >> sense and >>> >> I don't see the need of complicating the behavior by adding more cases >>> >> one >>> >> has to go through to figure out what the output columns would be. >>> >> Similar to >>> >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous >>> >> column >>> >> there, would you? >>> >> >>> >> >>> >> >>> >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck >>> >> wrote: >>> >>> >>> >>> I am moving this discussion which started with mdowle to the list. >>> >>> >>> >>> Consider this example slightly modified from the data.table FAQ: >>> >>> >>> >>> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") >>> >>> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) >>> >>> > out <- X[Y]; out >>> >>> x foo bar >>> >>> 1: b 3 4 >>> >>> 2: b 4 4 >>> >>> 3: b 5 4 >>> >>> 4: c 6 2 >>> >>> 5: c 7 2 >>> >>> 6: d NA 3 >>> >>> >>> >>> Note that the first column of the output is labelled x even though >>> >>> the >>> >>> data to produce it comes from y, e.g. "d" in out$x is not in X$x but >>> >>> does appear in Y$y so clearly the data is coming from y as opposed to >>> >>> x . In terms of SQL the above would be written: >>> >>> >>> >>> select Y.y as x, ... >>> >>> >>> >>> and the need to renamne the first column of out suggests that there >>> >>> may be a deeper problem here. >>> >>> >>> >>> Here are some ideas to address this (they would require changes to >>> >>> data.table): >>> >>> >>> >>> - the default of X[Y,, match=NA] would be changed to a default of >>> >>> X[Y,,match=0] so that it corresponds to the defaults in R's merge and >>> >>> in SQL joins. >>> >>> >>> >>> - the column name of the first column in the example above would be >>> >>> changed to y if match=0 but be left at x if match=NA. In the case >>> >>> that match=0 (the proposed new default) x and y are equal so the >>> >>> first >>> >>> column can be validly labelled as x but in the case that match=NA >>> >>> they >>> >>> are not so y would be used as the column name. >>> >>> >>> >>> - the name match= does seem a bit misleading since R's match only >>> >>> matches one item in the target whereas in data.table match matches >>> >>> many if mult="all" and that is the default. Perhaps some thought >>> >>> should be given to a name change here? >>> >>> >>> >>> The above would seem to correspond more closely to R's merge and SQL >>> >>> join defaults. Any use cases or other comments? >>> >>> >>> >>> -- >>> >>> Statistics & Software Consulting >>> >>> GKX Group, GKX Associates Inc. >>> >>> tel: 1-877-GKX-GROUP >>> >>> email: ggrothendieck at gmail.com >>> >>> _______________________________________________ >>> >>> datatable-help mailing list >>> >>> datatable-help at lists.r-forge.r-project.org >>> >>> >>> >>> >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >>> >> >>> > >>> >>> >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >> >> > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From eduard.antonyan at gmail.com Fri May 3 17:41:10 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 10:41:10 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: References:

Message-ID: Yeah, leveraging not using all columns in a full merge is a good idea - I agree that feature would be nice to have. I'm not using the match() function tbh, so current naming doesn't bug me, but maybe others who do use match() can weigh in if that bothers them as well. On Fri, May 3, 2013 at 10:36 AM, Gabor Grothendieck wrote: > Yes, except that is not really what happens since match() only matches > one row whereas with mult="all", the default, all rows are matched > which is not really matching in the sense of match(). The current > naming confuses matching with joining and its really the latter that > is being done. > > Regarding the existence of merge the advantage of [ is that it will > automatically only take the columns needed so merge is not really > equivalent to [ in all respects. Furthermore having to use different > constructs for different types of merge seems awkward. > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > wrote: > > Btw the way I think about the "nomatch" name is as follows - normally > X[Y] > > tries to match rows of Y with rows of X, and then "nomatch" tells it > what to > > do when there is *no match*. > > > > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan < > eduard.antonyan at gmail.com> > > wrote: > >> > >> To clarify - that behavior is already implemented in merge (more > >> specifically merge.data.table). I don't really have a view on having it > in > >> X[Y] as well - I don't like all.x and all.y as the names, since there > are no > >> params named 'x' and 'y' in [.data.table (as opposed to merge), but some > >> param that would do a full outer join could certainly be added. > >> > >> > >> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > >> wrote: > >>> > >>> Yes, sorry. Its nomatch= which presumably derives from the parameter > >>> of the same name in the match() function. If the idea of the nomatch= > >>> name was to leverage off existing argument names in R then I would > >>> prefer all.y= to be consistent with merge() in place of nomatch= since > >>> we are really merging/joining rather than just matching. That would > >>> also allow extension to all types of join by adding all.an x= argument > >>> too. > >>> > >>> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > >>> wrote: > >>> > I would prefer nomatch=0 as a default though, simply because that's > >>> > what I > >>> > do most of the time :) > >>> > > >>> > > >>> > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > >>> > > >>> > wrote: > >>> >> > >>> >> A correction - the param is called "nomatch", not "match". > >>> >> > >>> >> This use case seems like smth a user shouldn't really do - in an > ideal > >>> >> world you should have them both keyed by the same-name column. > >>> >> > >>> >> As is, my view on it is that data.table is correcting the user > mistake > >>> >> of > >>> >> naming the column in Y - y, instead of x, and so the output makes > >>> >> sense and > >>> >> I don't see the need of complicating the behavior by adding more > cases > >>> >> one > >>> >> has to go through to figure out what the output columns would be. > >>> >> Similar to > >>> >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > >>> >> column > >>> >> there, would you? > >>> >> > >>> >> > >>> >> > >>> >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > >>> >> wrote: > >>> >>> > >>> >>> I am moving this discussion which started with mdowle to the list. > >>> >>> > >>> >>> Consider this example slightly modified from the data.table FAQ: > >>> >>> > >>> >>> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, > key="x") > >>> >>> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > >>> >>> > out <- X[Y]; out > >>> >>> x foo bar > >>> >>> 1: b 3 4 > >>> >>> 2: b 4 4 > >>> >>> 3: b 5 4 > >>> >>> 4: c 6 2 > >>> >>> 5: c 7 2 > >>> >>> 6: d NA 3 > >>> >>> > >>> >>> Note that the first column of the output is labelled x even though > >>> >>> the > >>> >>> data to produce it comes from y, e.g. "d" in out$x is not in X$x > but > >>> >>> does appear in Y$y so clearly the data is coming from y as opposed > to > >>> >>> x . In terms of SQL the above would be written: > >>> >>> > >>> >>> select Y.y as x, ... > >>> >>> > >>> >>> and the need to renamne the first column of out suggests that there > >>> >>> may be a deeper problem here. > >>> >>> > >>> >>> Here are some ideas to address this (they would require changes to > >>> >>> data.table): > >>> >>> > >>> >>> - the default of X[Y,, match=NA] would be changed to a default of > >>> >>> X[Y,,match=0] so that it corresponds to the defaults in R's merge > and > >>> >>> in SQL joins. > >>> >>> > >>> >>> - the column name of the first column in the example above would be > >>> >>> changed to y if match=0 but be left at x if match=NA. In the case > >>> >>> that match=0 (the proposed new default) x and y are equal so the > >>> >>> first > >>> >>> column can be validly labelled as x but in the case that match=NA > >>> >>> they > >>> >>> are not so y would be used as the column name. > >>> >>> > >>> >>> - the name match= does seem a bit misleading since R's match only > >>> >>> matches one item in the target whereas in data.table match matches > >>> >>> many if mult="all" and that is the default. Perhaps some thought > >>> >>> should be given to a name change here? > >>> >>> > >>> >>> The above would seem to correspond more closely to R's merge and > SQL > >>> >>> join defaults. Any use cases or other comments? > >>> >>> > >>> >>> -- > >>> >>> Statistics & Software Consulting > >>> >>> GKX Group, GKX Associates Inc. > >>> >>> tel: 1-877-GKX-GROUP > >>> >>> email: ggrothendieck at gmail.com > >>> >>> _______________________________________________ > >>> >>> datatable-help mailing list > >>> >>> datatable-help at lists.r-forge.r-project.org > >>> >>> > >>> >>> > >>> >>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >>> >> > >>> >> > >>> > > >>> > >>> > >>> > >>> -- > >>> Statistics & Software Consulting > >>> GKX Group, GKX Associates Inc. > >>> tel: 1-877-GKX-GROUP > >>> email: ggrothendieck at gmail.com > >> > >> > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri May 3 17:45:24 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 3 May 2013 17:45:24 +0200 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: <27580076D6D24A76A7A3807E36E83536@gmail.com> (The third time, I'm growing tired of this 40KB message taking over half-hour to reach me! :) ) Gabor, About the behaviour of X[Y]: The current definition of X[Y] is "it's a join looking up X's rows using Y as an index". By this definition, the output of X[Y] is very much justified, I think. Y is just used as an index. To me it feels similar to, say, X[8] (which gives NA, NA with the same column names as X). Another thought that occurs to me is, say, in this example: X <- data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") Y <- data.table(y=c("b"), bar=c(4)) X[Y] Here again, you query for Y's y values in X's key column and join X and Y's columns. There's no such Y-value where X gives NA. The data then is coming from "X" and "Y" (as opposed to the case "d" you showed where the data comes just from "Y"). In this case should it be named "x" or "y"?? Always "x" makes sense to me. And Y[X] would give a "y" instead. However, I am not that good with sql joins. So I may very well have missed your point here. Regarding `merge`: x <- as.data.frame(X) y <- as.data.frame(Y) merge(x, y, by.x="x", by.y="y", all=TRUE) # --- (1) merge(y, x, by.x="y", by.y="x", all=TRUE) # --- (2) The (1) always gives the column name "x" and (2) always "y". And so does X[Y] as opposed to Y[X], except for the fact that the operations X[Y] and Y[X] are not identical (as opposed to merge). So, I don't see a dissimilarity here. Again, I may have gotten through your point wrongly and would love to be corrected if so. About the case `"nomatch"`, I agree with you that the name could be changed to avoid confusion with R's `match`. Maybe "missing = NA" and "missing = 0" makes more sense? Best regards, Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri May 3 17:46:53 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 3 May 2013 17:46:53 +0200 Subject: [datatable-help] merge/join/match In-Reply-To: References:

Message-ID: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> Gabor, X[Y] and Y[X] are not necessarily the same operations (meaning, they don't *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* to provide the same output (except for the column order and names). In that sense, a join is a bit different from a merge, no? Arun On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > Yes, except that is not really what happens since match() only matches > one row whereas with mult="all", the default, all rows are matched > which is not really matching in the sense of match(). The current > naming confuses matching with joining and its really the latter that > is being done. > > Regarding the existence of merge the advantage of [ is that it will > automatically only take the columns needed so merge is not really > equivalent to [ in all respects. Furthermore having to use different > constructs for different types of merge seems awkward. > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > wrote: > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > > tries to match rows of Y with rows of X, and then "nomatch" tells it what to > > do when there is *no match*. > > > > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan > > wrote: > > > > > > To clarify - that behavior is already implemented in merge (more > > > specifically merge.data.table). I don't really have a view on having it in > > > X[Y] as well - I don't like all.x and all.y as the names, since there are no > > > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > > > param that would do a full outer join could certainly be added. > > > > > > > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > > > wrote: > > > > > > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > > > > of the same name in the match() function. If the idea of the nomatch= > > > > name was to leverage off existing argument names in R then I would > > > > prefer all.y= to be consistent with merge() in place of nomatch= since > > > > we are really merging/joining rather than just matching. That would > > > > also allow extension to all types of join by adding all.an x= argument > > > > too. > > > > > > > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > > > > wrote: > > > > > I would prefer nomatch=0 as a default though, simply because that's > > > > > what I > > > > > do most of the time :) > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > > > > > > > > > wrote: > > > > > > > > > > > > A correction - the param is called "nomatch", not "match". > > > > > > > > > > > > This use case seems like smth a user shouldn't really do - in an ideal > > > > > > world you should have them both keyed by the same-name column. > > > > > > > > > > > > As is, my view on it is that data.table is correcting the user mistake > > > > > > of > > > > > > naming the column in Y - y, instead of x, and so the output makes > > > > > > sense and > > > > > > I don't see the need of complicating the behavior by adding more cases > > > > > > one > > > > > > has to go through to figure out what the output columns would be. > > > > > > Similar to > > > > > > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > > > > > > column > > > > > > there, would you? > > > > > > > > > > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > > > > > > wrote: > > > > > > > > > > > > > > I am moving this discussion which started with mdowle to the list. > > > > > > > > > > > > > > Consider this example slightly modified from the data.table FAQ: > > > > > > > > > > > > > > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > > > > > > > > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > > > > > > > > out <- X[Y]; out > > > > > > > > > > > > > > > > > > > > > > x foo bar > > > > > > > 1: b 3 4 > > > > > > > 2: b 4 4 > > > > > > > 3: b 5 4 > > > > > > > 4: c 6 2 > > > > > > > 5: c 7 2 > > > > > > > 6: d NA 3 > > > > > > > > > > > > > > Note that the first column of the output is labelled x even though > > > > > > > the > > > > > > > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > > > > > > > does appear in Y$y so clearly the data is coming from y as opposed to > > > > > > > x . In terms of SQL the above would be written: > > > > > > > > > > > > > > select Y.y as x, ... > > > > > > > > > > > > > > and the need to renamne the first column of out suggests that there > > > > > > > may be a deeper problem here. > > > > > > > > > > > > > > Here are some ideas to address this (they would require changes to > > > > > > > data.table): > > > > > > > > > > > > > > - the default of X[Y,, match=NA] would be changed to a default of > > > > > > > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > > > > > > > in SQL joins. > > > > > > > > > > > > > > - the column name of the first column in the example above would be > > > > > > > changed to y if match=0 but be left at x if match=NA. In the case > > > > > > > that match=0 (the proposed new default) x and y are equal so the > > > > > > > first > > > > > > > column can be validly labelled as x but in the case that match=NA > > > > > > > they > > > > > > > are not so y would be used as the column name. > > > > > > > > > > > > > > - the name match= does seem a bit misleading since R's match only > > > > > > > matches one item in the target whereas in data.table match matches > > > > > > > many if mult="all" and that is the default. Perhaps some thought > > > > > > > should be given to a name change here? > > > > > > > > > > > > > > The above would seem to correspond more closely to R's merge and SQL > > > > > > > join defaults. Any use cases or other comments? > > > > > > > > > > > > > > -- > > > > > > > Statistics & Software Consulting > > > > > > > GKX Group, GKX Associates Inc. > > > > > > > tel: 1-877-GKX-GROUP > > > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > _______________________________________________ > > > > > > > datatable-help mailing list > > > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > > > > > > > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > -- > > > > Statistics & Software Consulting > > > > GKX Group, GKX Associates Inc. > > > > tel: 1-877-GKX-GROUP > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com (http://gmail.com) > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri May 3 17:48:56 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 3 May 2013 17:48:56 +0200 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: Eddi, You could just set: options(datatable.nomatch = 0)if you use that extensively. Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 3 17:50:19 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 10:50:19 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> References:

<9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> Message-ID: Arun, I think Gabor understands that. If I understand him correctly he simply wants an option for X[Y, some.option = TRUE] to return merge(X,Y, all = TRUE). Also, do note that merge(X, Y, all.x = TRUE) and merge(Y, X, all.x = TRUE) don't have to return the same answer. On Fri, May 3, 2013 at 10:46 AM, Arunkumar Srinivasan wrote: > Gabor, > > X[Y] and Y[X] are not necessarily the same operations (meaning, they don't > *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* > to provide the same output (except for the column order and names). In that > sense, a join is a bit different from a merge, no? > > Arun > > On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > > Yes, except that is not really what happens since match() only matches > one row whereas with mult="all", the default, all rows are matched > which is not really matching in the sense of match(). The current > naming confuses matching with joining and its really the latter that > is being done. > > Regarding the existence of merge the advantage of [ is that it will > automatically only take the columns needed so merge is not really > equivalent to [ in all respects. Furthermore having to use different > constructs for different types of merge seems awkward. > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > wrote: > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > tries to match rows of Y with rows of X, and then "nomatch" tells it what > to > do when there is *no match*. > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan < > eduard.antonyan at gmail.com> > wrote: > > > To clarify - that behavior is already implemented in merge (more > specifically merge.data.table). I don't really have a view on having it in > X[Y] as well - I don't like all.x and all.y as the names, since there are > no > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > param that would do a full outer join could certainly be added. > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > wrote: > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > of the same name in the match() function. If the idea of the nomatch= > name was to leverage off existing argument names in R then I would > prefer all.y= to be consistent with merge() in place of nomatch= since > we are really merging/joining rather than just matching. That would > also allow extension to all types of join by adding all.an x= argument > too. > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > wrote: > > I would prefer nomatch=0 as a default though, simply because that's > what I > do most of the time :) > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > wrote: > > > A correction - the param is called "nomatch", not "match". > > This use case seems like smth a user shouldn't really do - in an ideal > world you should have them both keyed by the same-name column. > > As is, my view on it is that data.table is correcting the user mistake > of > naming the column in Y - y, instead of x, and so the output makes > sense and > I don't see the need of complicating the behavior by adding more cases > one > has to go through to figure out what the output columns would be. > Similar to > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > column > there, would you? > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > wrote: > > > I am moving this discussion which started with mdowle to the list. > > Consider this example slightly modified from the data.table FAQ: > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > out <- X[Y]; out > > x foo bar > 1: b 3 4 > 2: b 4 4 > 3: b 5 4 > 4: c 6 2 > 5: c 7 2 > 6: d NA 3 > > Note that the first column of the output is labelled x even though > the > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > does appear in Y$y so clearly the data is coming from y as opposed to > x . In terms of SQL the above would be written: > > select Y.y as x, ... > > and the need to renamne the first column of out suggests that there > may be a deeper problem here. > > Here are some ideas to address this (they would require changes to > data.table): > > - the default of X[Y,, match=NA] would be changed to a default of > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > in SQL joins. > > - the column name of the first column in the example above would be > changed to y if match=0 but be left at x if match=NA. In the case > that match=0 (the proposed new default) x and y are equal so the > first > column can be validly labelled as x but in the case that match=NA > they > are not so y would be used as the column name. > > - the name match= does seem a bit misleading since R's match only > matches one item in the target whereas in data.table match matches > many if mult="all" and that is the default. Perhaps some thought > should be given to a name change here? > > The above would seem to correspond more closely to R's merge and SQL > join defaults. Any use cases or other comments? > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 3 17:51:13 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 10:51:13 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: Good point - I might do that, though I'll need to be a bit careful as I run a lot of scripts on remote computers. On Fri, May 3, 2013 at 10:48 AM, Arunkumar Srinivasan wrote: > Eddi, > > You could just set: options(datatable.nomatch = 0)if you use that > extensively. > > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Fri May 3 17:55:38 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 11:55:38 -0400 Subject: [datatable-help] merge/join/match In-Reply-To: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> References:

<9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> Message-ID: Assuming same-named keys, then these are all the same except possibly for row and column order: X[Y,,nomatch=0] Y[X,,nomatch=0] merge(X, Y) merge(Y, X) That X[Y] is not the same as Y[X] is analogous to the fact that merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan wrote: > Gabor, > > X[Y] and Y[X] are not necessarily the same operations (meaning, they don't > *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* > to provide the same output (except for the column order and names). In that > sense, a join is a bit different from a merge, no? > > Arun > > On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > > Yes, except that is not really what happens since match() only matches > one row whereas with mult="all", the default, all rows are matched > which is not really matching in the sense of match(). The current > naming confuses matching with joining and its really the latter that > is being done. > > Regarding the existence of merge the advantage of [ is that it will > automatically only take the columns needed so merge is not really > equivalent to [ in all respects. Furthermore having to use different > constructs for different types of merge seems awkward. > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > wrote: > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > tries to match rows of Y with rows of X, and then "nomatch" tells it what to > do when there is *no match*. > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan > wrote: > > > To clarify - that behavior is already implemented in merge (more > specifically merge.data.table). I don't really have a view on having it in > X[Y] as well - I don't like all.x and all.y as the names, since there are no > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > param that would do a full outer join could certainly be added. > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > wrote: > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > of the same name in the match() function. If the idea of the nomatch= > name was to leverage off existing argument names in R then I would > prefer all.y= to be consistent with merge() in place of nomatch= since > we are really merging/joining rather than just matching. That would > also allow extension to all types of join by adding all.an x= argument > too. > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > wrote: > > I would prefer nomatch=0 as a default though, simply because that's > what I > do most of the time :) > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > wrote: > > > A correction - the param is called "nomatch", not "match". > > This use case seems like smth a user shouldn't really do - in an ideal > world you should have them both keyed by the same-name column. > > As is, my view on it is that data.table is correcting the user mistake > of > naming the column in Y - y, instead of x, and so the output makes > sense and > I don't see the need of complicating the behavior by adding more cases > one > has to go through to figure out what the output columns would be. > Similar to > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > column > there, would you? > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > wrote: > > > I am moving this discussion which started with mdowle to the list. > > Consider this example slightly modified from the data.table FAQ: > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > out <- X[Y]; out > > x foo bar > 1: b 3 4 > 2: b 4 4 > 3: b 5 4 > 4: c 6 2 > 5: c 7 2 > 6: d NA 3 > > Note that the first column of the output is labelled x even though > the > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > does appear in Y$y so clearly the data is coming from y as opposed to > x . In terms of SQL the above would be written: > > select Y.y as x, ... > > and the need to renamne the first column of out suggests that there > may be a deeper problem here. > > Here are some ideas to address this (they would require changes to > data.table): > > - the default of X[Y,, match=NA] would be changed to a default of > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > in SQL joins. > > - the column name of the first column in the example above would be > changed to y if match=0 but be left at x if match=NA. In the case > that match=0 (the proposed new default) x and y are equal so the > first > column can be validly labelled as x but in the case that match=NA > they > are not so y would be used as the column name. > > - the name match= does seem a bit misleading since R's match only > matches one item in the target whereas in data.table match matches > many if mult="all" and that is the default. Perhaps some thought > should be given to a name change here? > > The above would seem to correspond more closely to R's merge and SQL > join defaults. Any use cases or other comments? > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From ggrothendieck at gmail.com Fri May 3 17:57:45 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 11:57:45 -0400 Subject: [datatable-help] merge/join/match In-Reply-To: References:

<9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> Message-ID: In my last post it should have read: That X[Y] is not the same as Y[X] is analogous to the fact that merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck wrote: > Assuming same-named keys, then these are all the same except possibly > for row and column order: > > X[Y,,nomatch=0] > Y[X,,nomatch=0] > merge(X, Y) > merge(Y, X) > > That X[Y] is not the same as Y[X] is analogous to the fact that > merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) > > On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan > wrote: >> Gabor, >> >> X[Y] and Y[X] are not necessarily the same operations (meaning, they don't >> *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* >> to provide the same output (except for the column order and names). In that >> sense, a join is a bit different from a merge, no? >> >> Arun >> >> On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: >> >> Yes, except that is not really what happens since match() only matches >> one row whereas with mult="all", the default, all rows are matched >> which is not really matching in the sense of match(). The current >> naming confuses matching with joining and its really the latter that >> is being done. >> >> Regarding the existence of merge the advantage of [ is that it will >> automatically only take the columns needed so merge is not really >> equivalent to [ in all respects. Furthermore having to use different >> constructs for different types of merge seems awkward. >> >> >> On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan >> wrote: >> >> Btw the way I think about the "nomatch" name is as follows - normally X[Y] >> tries to match rows of Y with rows of X, and then "nomatch" tells it what to >> do when there is *no match*. >> >> >> On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan >> wrote: >> >> >> To clarify - that behavior is already implemented in merge (more >> specifically merge.data.table). I don't really have a view on having it in >> X[Y] as well - I don't like all.x and all.y as the names, since there are no >> params named 'x' and 'y' in [.data.table (as opposed to merge), but some >> param that would do a full outer join could certainly be added. >> >> >> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck >> wrote: >> >> >> Yes, sorry. Its nomatch= which presumably derives from the parameter >> of the same name in the match() function. If the idea of the nomatch= >> name was to leverage off existing argument names in R then I would >> prefer all.y= to be consistent with merge() in place of nomatch= since >> we are really merging/joining rather than just matching. That would >> also allow extension to all types of join by adding all.an x= argument >> too. >> >> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan >> wrote: >> >> I would prefer nomatch=0 as a default though, simply because that's >> what I >> do most of the time :) >> >> >> On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan >> >> wrote: >> >> >> A correction - the param is called "nomatch", not "match". >> >> This use case seems like smth a user shouldn't really do - in an ideal >> world you should have them both keyed by the same-name column. >> >> As is, my view on it is that data.table is correcting the user mistake >> of >> naming the column in Y - y, instead of x, and so the output makes >> sense and >> I don't see the need of complicating the behavior by adding more cases >> one >> has to go through to figure out what the output columns would be. >> Similar to >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous >> column >> there, would you? >> >> >> >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck >> wrote: >> >> >> I am moving this discussion which started with mdowle to the list. >> >> Consider this example slightly modified from the data.table FAQ: >> >> X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") >> Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) >> out <- X[Y]; out >> >> x foo bar >> 1: b 3 4 >> 2: b 4 4 >> 3: b 5 4 >> 4: c 6 2 >> 5: c 7 2 >> 6: d NA 3 >> >> Note that the first column of the output is labelled x even though >> the >> data to produce it comes from y, e.g. "d" in out$x is not in X$x but >> does appear in Y$y so clearly the data is coming from y as opposed to >> x . In terms of SQL the above would be written: >> >> select Y.y as x, ... >> >> and the need to renamne the first column of out suggests that there >> may be a deeper problem here. >> >> Here are some ideas to address this (they would require changes to >> data.table): >> >> - the default of X[Y,, match=NA] would be changed to a default of >> X[Y,,match=0] so that it corresponds to the defaults in R's merge and >> in SQL joins. >> >> - the column name of the first column in the example above would be >> changed to y if match=0 but be left at x if match=NA. In the case >> that match=0 (the proposed new default) x and y are equal so the >> first >> column can be validly labelled as x but in the case that match=NA >> they >> are not so y would be used as the column name. >> >> - the name match= does seem a bit misleading since R's match only >> matches one item in the target whereas in data.table match matches >> many if mult="all" and that is the default. Perhaps some thought >> should be given to a name change here? >> >> The above would seem to correspond more closely to R's merge and SQL >> join defaults. Any use cases or other comments? >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Fri May 3 18:45:24 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 3 May 2013 18:45:24 +0200 Subject: [datatable-help] merge/join/match In-Reply-To: References:

<9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com>

Message-ID: Gabor, Very true. I suppose your request is that the x[i] where `i` is a data.table should have the same set of options like R's base `merge` function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself. However, I am not able to think of a way to do this. I mean, I find the syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even though X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the reordered columns) the latter 2 don't seem to make sense/is redundant (maybe it's because I am used to this syntax). Arun On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: > In my last post it should have read: > > That X[Y] is not the same as Y[X] is analogous to the fact that > merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) > > On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck > wrote: > > Assuming same-named keys, then these are all the same except possibly > > for row and column order: > > > > X[Y,,nomatch=0] > > Y[X,,nomatch=0] > > merge(X, Y) > > merge(Y, X) > > > > That X[Y] is not the same as Y[X] is analogous to the fact that > > merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) > > > > On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan > > wrote: > > > Gabor, > > > > > > X[Y] and Y[X] are not necessarily the same operations (meaning, they don't > > > *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* > > > to provide the same output (except for the column order and names). In that > > > sense, a join is a bit different from a merge, no? > > > > > > Arun > > > > > > On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > > > > > > Yes, except that is not really what happens since match() only matches > > > one row whereas with mult="all", the default, all rows are matched > > > which is not really matching in the sense of match(). The current > > > naming confuses matching with joining and its really the latter that > > > is being done. > > > > > > Regarding the existence of merge the advantage of [ is that it will > > > automatically only take the columns needed so merge is not really > > > equivalent to [ in all respects. Furthermore having to use different > > > constructs for different types of merge seems awkward. > > > > > > > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > > > wrote: > > > > > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > > > tries to match rows of Y with rows of X, and then "nomatch" tells it what to > > > do when there is *no match*. > > > > > > > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan > > > wrote: > > > > > > > > > To clarify - that behavior is already implemented in merge (more > > > specifically merge.data.table). I don't really have a view on having it in > > > X[Y] as well - I don't like all.x and all.y as the names, since there are no > > > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > > > param that would do a full outer join could certainly be added. > > > > > > > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > > > wrote: > > > > > > > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > > > of the same name in the match() function. If the idea of the nomatch= > > > name was to leverage off existing argument names in R then I would > > > prefer all.y= to be consistent with merge() in place of nomatch= since > > > we are really merging/joining rather than just matching. That would > > > also allow extension to all types of join by adding all.an x= argument > > > too. > > > > > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > > > wrote: > > > > > > I would prefer nomatch=0 as a default though, simply because that's > > > what I > > > do most of the time :) > > > > > > > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > > > > > wrote: > > > > > > > > > A correction - the param is called "nomatch", not "match". > > > > > > This use case seems like smth a user shouldn't really do - in an ideal > > > world you should have them both keyed by the same-name column. > > > > > > As is, my view on it is that data.table is correcting the user mistake > > > of > > > naming the column in Y - y, instead of x, and so the output makes > > > sense and > > > I don't see the need of complicating the behavior by adding more cases > > > one > > > has to go through to figure out what the output columns would be. > > > Similar to > > > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > > > column > > > there, would you? > > > > > > > > > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > > > wrote: > > > > > > > > > I am moving this discussion which started with mdowle to the list. > > > > > > Consider this example slightly modified from the data.table FAQ: > > > > > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > > > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > > > out <- X[Y]; out > > > > > > x foo bar > > > 1: b 3 4 > > > 2: b 4 4 > > > 3: b 5 4 > > > 4: c 6 2 > > > 5: c 7 2 > > > 6: d NA 3 > > > > > > Note that the first column of the output is labelled x even though > > > the > > > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > > > does appear in Y$y so clearly the data is coming from y as opposed to > > > x . In terms of SQL the above would be written: > > > > > > select Y.y as x, ... > > > > > > and the need to renamne the first column of out suggests that there > > > may be a deeper problem here. > > > > > > Here are some ideas to address this (they would require changes to > > > data.table): > > > > > > - the default of X[Y,, match=NA] would be changed to a default of > > > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > > > in SQL joins. > > > > > > - the column name of the first column in the example above would be > > > changed to y if match=0 but be left at x if match=NA. In the case > > > that match=0 (the proposed new default) x and y are equal so the > > > first > > > column can be validly labelled as x but in the case that match=NA > > > they > > > are not so y would be used as the column name. > > > > > > - the name match= does seem a bit misleading since R's match only > > > matches one item in the target whereas in data.table match matches > > > many if mult="all" and that is the default. Perhaps some thought > > > should be given to a name change here? > > > > > > The above would seem to correspond more closely to R's merge and SQL > > > join defaults. Any use cases or other comments? > > > > > > -- > > > Statistics & Software Consulting > > > GKX Group, GKX Associates Inc. > > > tel: 1-877-GKX-GROUP > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > _______________________________________________ > > > datatable-help mailing list > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > -- > > > Statistics & Software Consulting > > > GKX Group, GKX Associates Inc. > > > tel: 1-877-GKX-GROUP > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > -- > > > Statistics & Software Consulting > > > GKX Group, GKX Associates Inc. > > > tel: 1-877-GKX-GROUP > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > _______________________________________________ > > > datatable-help mailing list > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > -- > > Statistics & Software Consulting > > GKX Group, GKX Associates Inc. > > tel: 1-877-GKX-GROUP > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com (http://gmail.com) > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 3 18:49:11 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 11:49:11 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: References: