From statquant at outlook.com Wed May 1 01:10:23 2013 From: statquant at outlook.com (statquant3) Date: Tue, 30 Apr 2013 16:10:23 -0700 (PDT) Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366643879137-4664990.post@n4.nabble.com> <-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> Message-ID: <1367363423208-4665873.post@n4.nabble.com> Hi, I red the 30 posts and I have to confess that I still do not understand the point of the changes... Could anyone kindly write an example of the current behaviour and what the new option will bring to the table ? Sorry... -- View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4665873.html Sent from the datatable-help mailing list archive at Nabble.com. From saporta at scarletmail.rutgers.edu Wed May 1 01:18:39 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Tue, 30 Apr 2013 19:18:39 -0400 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <1367363423208-4665873.post@n4.nabble.com> References: <1366643879137-4664990.post@n4.nabble.com> <-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <1367363423208-4665873.post@n4.nabble.com> Message-ID: Eddi, Perhaps you could summarize succinctly, now after a good bit of discussion, what your proposed change is. -Rick On Tue, Apr 30, 2013 at 7:10 PM, statquant3 wrote: > Hi, I red the 30 posts and I have to confess that I still do not understand > the point of the changes... > Could anyone kindly write an example of the current behaviour and what the > new option will bring to the table ? > Sorry... > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4665873.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.harding at paniscus.com Wed May 1 11:28:52 2013 From: p.harding at paniscus.com (Paul Harding) Date: Wed, 1 May 2013 10:28:52 +0100 Subject: [datatable-help] fread on very large file In-Reply-To: <6215268129090c5164b66264010bea9b@imap.plus.net> References: <6215268129090c5164b66264010bea9b@imap.plus.net> Message-ID: Here is the verbose output: > dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) Detected eol as \r\n (CRLF) in that order, the Windows standard. Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found Found 9 columns First row with 9 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 9186293 Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 data rows Type codes: 000002000 (first 5 rows) Type codes: 000002200 (+middle 5 rows) Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 But here is the wc output (via cygwin; newline, word (whitespace delim so each word one 'line' here), byte)@ $ wc spd_all_fixed.csv 168997637 168997638 9078155125 spd_all_fixed.csv [So fread 9M, wc 168M rows]. Regards Paul On 30 April 2013 18:52, Matthew Dowle wrote: > ** > > > > Hi, > > Thanks for reporting this. Please set verbose=TRUE and let us know the > output. > > Thanks, Matthew > > > > On 30.04.2013 18:01, Paul Harding wrote: > > Problem with fread on a large file > The file is 8GB, just short of 200,000 lines, produced as SQLoutput and > modified by cygwin/perl to remove the second line. > Using data.table 1.8.8 on R3.0.0 I get an fread error > fread("data/spd_all_fixed.csv",sep=",") > Error in fread("data/spd_all_fixed.csv", sep = ",") : > Expected sep (',') but '0' ends field 5 on line 6 when detecting types: > 204038,2617097,20110803,0,0 > Looking for the offending line,with line numbers in output so I'm guessing > this is line 6 of the mid-file chunk examined, > $ grep -n '204038,2617097,201108' spd_all_fixed.csv > 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 > 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 > 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 > 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 > 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 > and comparing to surrounding lines and the first ten lines > $ head spd_all_fixed.csv > s_key,i_key,p_key,q,pq,d,l,epi,class > 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 > 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 > 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 > 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 > 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 > 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 > 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 > 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 > 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 > I can't see any difference. I wonder if this is a bug? I have no problems > on a small test data set run through an identical process and using the > same fread command. > Regards > Paul > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Wed May 1 17:43:21 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 1 May 2013 10:43:21 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <1367363423208-4665873.post@n4.nabble.com> Message-ID: Sure, here's a recap. The most succinct way of putting it is - the meaning of d[i, j, by = b] is very complicated and unintuitive right now because of hidden by's in some cases and that statement can be made much more readable by making by-without-by's explicit. The longer version follows. First let's go over what is done currently, in particular what exactly is by-without-by. The following example, adapted from Matthew's examples illustrates current behavior: > X = data.table(a = c(1,1,2,2,3,3), b = c(1:6), key = "a") > Y = data.table(a = c(1,2,1), key = "a") > X[Y] a b 1: 1 1 2: 1 2 3: 1 1 4: 1 2 5: 2 3 6: 2 4 > X[Y, sum(b)] a V1 1: 1 3 2: 1 3 3: 2 7 What's happening here is that the action j=sum(b) is performed for each row of Y (or rather each 'a') as if that was a 'by' by the rows of Y. Had Y had unique 'a' values only, this would've been equivalent to doing a 'by' by 'a' after the merge, but there is a difference when Y$a has duplicates. This is interesting behavior that can be used in a variety of situations (it also has an interesting leveraging point - if Y$a *is* unique and you'd like to do 'by=a' after the merge, it's more computationally advantageous to do the 'by' *during* the merge and not after), however it interferes with the naturally established action for d[i, j], where for other i's this would simply do action 'j', without doing an extra hidden 'by'. The proposal is thus to do the above special 'by' only when explicitly asked to - e.g. by adding a new boolean 'each.i = TRUE', the default value for which would be FALSE. This will make syntax much more readable and user-friendly, would eliminate a few FAQ points and would also allow a new kind of action, that afaik is actually not possible with current syntax. Here's some correspondences - left is new syntax and right is old syntax: Take 'dt' and apply 'i' (where 'i' is anything, including a join): dt[i] <-> dt[i] Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b': dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a join in some cases but not others Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by (will do cross-apply only when 'i' is a join): dt[i, j, each.i = TRUE] <-> dt[i, j] Take 'dt' and apply 'i', return j over *both* the cross-apply/by-without-by (for 'i' being a join only) and another specified 'by', think of this as doing by=list(b, rows of Y): dt[i, j, by = b, each.i = TRUE] <-> afaik there is no direct correspondence in current behavior On Tuesday, April 30, 2013, Ricardo Saporta wrote: > Eddi, > > Perhaps you could summarize succinctly, now after a good bit of > discussion, what your proposed change is. > > -Rick > > > On Tue, Apr 30, 2013 at 7:10 PM, statquant3 wrote: > >> Hi, I red the 30 posts and I have to confess that I still do not >> understand >> the point of the changes... >> Could anyone kindly write an example of the current behaviour and what the >> new option will bring to the table ? >> Sorry... >> >> >> >> -- >> View this message in context: >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4665873.html >> Sent from the datatable-help mailing list archive at Nabble.com. >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.harding at paniscus.com Wed May 1 18:10:50 2013 From: p.harding at paniscus.com (Paul Harding) Date: Wed, 1 May 2013 17:10:50 +0100 Subject: [datatable-help] fread on very large file In-Reply-To: References: <6215268129090c5164b66264010bea9b@imap.plus.net> Message-ID: Some supplementary information, here is the portion of the file (with row numbers, +1 for header) around where fread thinks the file ends. $ nl spd_all_fixed.csv | head -n 9186300 |tail 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 9186293 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 9186294 (row 9186293 excl header) is where fread thinks the file ends, mid-line by the look of it! I've experimented by truncating the file. The error varies, either it reads too few records or gives the error I reported, presumably determined by whether the last perceived line is entire. The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv > dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) Detected eol as \r\n (CRLF) in that order, the Windows standard. Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found Found 9 columns First row with 9 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 80300000 Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 data rows Type codes: 000002000 (first 5 rows) Type codes: 000002000 (+middle 5 rows) Type codes: 000002000 (+last 5 rows) 0%Bumping column 7 from INT to INT64 on data row 9, field contains '0.42634430000000001' Bumping column 7 from INT64 to REAL on data row 9, field contains '0.42634430000000001' 0.000s ( 0%) Memory map (rerun may be quicker) 0.000s ( 0%) Sep and header detection 0.000s ( 0%) Count rows (wc -l) 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM 171.188s ( 65%) Reading data 1365231.809s (518439%) Allocation for type bumps (if any), including gc time if triggered -1365231.809s (-518439%) Coercing data already read in type bumps (if any) 0.000s ( 0%) Changing na.strings to NA 0.000s Total > dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) Detected eol as \r\n (CRLF) in that order, the Windows standard. Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found Found 9 columns First row with 9 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 18913 Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data rows Type codes: 000002000 (first 5 rows) Type codes: 000002000 (+middle 5 rows) Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : Expected sep (',') but ',' ends field 2 on line 6 when detecting types: 204650,724540, Regards, Paul On 1 May 2013 10:28, Paul Harding wrote: > Here is the verbose output: > > > dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) > Detected eol as \r\n (CRLF) in that order, the Windows standard. > Looking for supplied sep ',' on line 30 (the last non blank line in the > first 30) ... found > Found 9 columns > First row with 9 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column > names. > Count of eol after first data row: 9186293 > Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 > data rows > Type codes: 000002000 (first 5 rows) > Type codes: 000002200 (+middle 5 rows) > Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : > Expected sep (',') but '0' ends field 5 on line 6 when detecting types: > 204038,2617097,20110803,0,0 > > But here is the wc output (via cygwin; newline, word (whitespace delim so > each word one 'line' here), byte)@ > $ wc spd_all_fixed.csv > 168997637 168997638 9078155125 spd_all_fixed.csv > > [So fread 9M, wc 168M rows]. > > Regards > Paul > > > On 30 April 2013 18:52, Matthew Dowle wrote: > >> ** >> >> >> >> Hi, >> >> Thanks for reporting this. Please set verbose=TRUE and let us know the >> output. >> >> Thanks, Matthew >> >> >> >> On 30.04.2013 18:01, Paul Harding wrote: >> >> Problem with fread on a large file >> The file is 8GB, just short of 200,000 lines, produced as SQLoutput and >> modified by cygwin/perl to remove the second line. >> Using data.table 1.8.8 on R3.0.0 I get an fread error >> fread("data/spd_all_fixed.csv",sep=",") >> Error in fread("data/spd_all_fixed.csv", sep = ",") : >> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: >> 204038,2617097,20110803,0,0 >> Looking for the offending line,with line numbers in output so I'm >> guessing this is line 6 of the mid-file chunk examined, >> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >> and comparing to surrounding lines and the first ten lines >> $ head spd_all_fixed.csv >> s_key,i_key,p_key,q,pq,d,l,epi,class >> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >> I can't see any difference. I wonder if this is a bug? I have no problems >> on a small test data set run through an identical process and using the >> same fread command. >> Regards >> Paul >> >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Ken.Williams at windlogics.com Wed May 1 22:59:08 2013 From: Ken.Williams at windlogics.com (Ken Williams) Date: Wed, 1 May 2013 20:59:08 +0000 Subject: [datatable-help] Import problem with data.table in packages Message-ID: Hi, I've got a small test package constructed like so: ------------ R/MyCode.R: --------------- ##' Example function. ##' ##' @export ##' @import data.table myfunc <- function() { dt1 <- data.table(time=1:5, key='time') dt2 <- data.table(time=3:8, key='time') dat <- merge(dt1, dt2, all=TRUE) } ------------ DESCRIPTION: --------------- Package: TestMod Type: Package Title: My test package Version: 1.0 Author: Ken Williams Maintainer: Ken Williams Description: A test package License: BSD Imports: data.table Collate: 'MyCode.R' ----------------------------------------------- I process the package with ROxygen in RStudio, which produces an empty `inst/` directory, some docs in `man/`, and a `NAMESPACE` file: ------------ DESCRIPTION: --------------- export(myfunc) import(data.table) ----------------------------------------------- Now, if I start a fresh R session and load this package, I get a namespace error: ------------ R 2.15.2 session: --------------- > library(TestMod) > myfunc function () { dt1 <- data.table(time = 1:5, key = "time") dt2 <- data.table(time = 3:8, key = "time") dat <- merge(dt1, dt2, all = TRUE) } > myfunc() Error in rbind(deparse.level, ...) : could not find function ".rbind.data.table" ----------------------------------------------- Sometimes, in other (more complicated) code, I instead get the error 'could not find function "data.table"'. To my eyes, the imports look correct, so I can't see what the problem is: ------------ R 2.15.2 session: --------------- > getNamespaceImports('TestMod')$data.table %between% %chin% %like% .__C__data.table "%between%" "%chin%" "%like%" ".__C__data.table" .__C__IDate .__C__ITime .__T__$:base .__T__$<-:base ".__C__IDate" ".__C__ITime" ".__T__$:base" ".__T__$<-:base" .__T__[:base .rbind.data.table := alloc.col ".__T__[:base" ".rbind.data.table" ":=" "alloc.col" as.chron.IDate as.chron.ITime as.data.table as.IDate "as.chron.IDate" "as.chron.ITime" "as.data.table" "as.IDate" as.ITime between chgroup chmatch "as.ITime" "between" "chgroup" "chmatch" chorder CJ copy data.table "chorder" "CJ" "copy" "data.table" fread haskey hour IDateTime "fread" "haskey" "hour" "IDateTime" is.data.table key key<- last "is.data.table" "key" "key<-" "last" like mday month quarter "like" "mday" "month" "quarter" rbindlist set setattr setcolorder "rbindlist" "set" "setattr" "setcolorder" setkey setkeyv setnames SJ "setkey" "setkeyv" "setnames" "SJ" tables test.data.table timetaken truelength "tables" "test.data.table" "timetaken" "truelength" wday week yday year "wday" "week" "yday" "year" ----------------------------------------------- Any suggestions? I think for now, I can work around the problem by doing 'Depends: data.table' in my `DESCRIPTION` file. I'd like to not do that though. -- Ken Williams, Senior Research Scientist WindLogics http://windlogics.com ________________________________ CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you. From mdowle at mdowle.plus.com Wed May 1 23:12:35 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 01 May 2013 22:12:35 +0100 Subject: [datatable-help] Import problem with data.table in packages In-Reply-To: References: Message-ID: <5d217b902b9d75ed8ac53bdd26d1f7a1@imap.plus.net> Hi, This rings a bell actually. data.table uses .onLoad currently but it should be using .onAttach, I seem to recall. http://r.789695.n4.nabble.com/Error-in-a-package-that-imports-data-table-tp4660173p4660637.html I had a hunt around but couldn't find if we decided data.table should move from .onLoad to .onAttach. Does anyone know/remember? Thanks, Matthew On 01.05.2013 21:59, Ken Williams wrote: > Hi, > > I've got a small test package constructed like so: > > ------------ R/MyCode.R: --------------- > ##' Example function. > ##' > ##' @export > ##' @import data.table > myfunc <- function() { > dt1 <- data.table(time=1:5, key='time') > dt2 <- data.table(time=3:8, key='time') > dat <- merge(dt1, dt2, all=TRUE) > } > > ------------ DESCRIPTION: --------------- > Package: TestMod > Type: Package > Title: My test package > Version: 1.0 > Author: Ken Williams > Maintainer: Ken Williams > Description: A test package > License: BSD > Imports: > data.table > Collate: > 'MyCode.R' > ----------------------------------------------- > > I process the package with ROxygen in RStudio, which produces an > empty `inst/` directory, some docs in `man/`, and a `NAMESPACE` file: > > ------------ DESCRIPTION: --------------- > export(myfunc) > import(data.table) > ----------------------------------------------- > > Now, if I start a fresh R session and load this package, I get a > namespace error: > > ------------ R 2.15.2 session: --------------- >> library(TestMod) >> myfunc > function () > { > dt1 <- data.table(time = 1:5, key = "time") > dt2 <- data.table(time = 3:8, key = "time") > dat <- merge(dt1, dt2, all = TRUE) > } > >> myfunc() > Error in rbind(deparse.level, ...) : > could not find function ".rbind.data.table" > ----------------------------------------------- > > Sometimes, in other (more complicated) code, I instead get the error > 'could not find function "data.table"'. > > To my eyes, the imports look correct, so I can't see what the problem > is: > > ------------ R 2.15.2 session: --------------- >> getNamespaceImports('TestMod')$data.table > %between% %chin% %like% > .__C__data.table > "%between%" "%chin%" "%like%" > ".__C__data.table" > .__C__IDate .__C__ITime .__T__$:base > .__T__$<-:base > ".__C__IDate" ".__C__ITime" ".__T__$:base" > ".__T__$<-:base" > .__T__[:base .rbind.data.table := > alloc.col > ".__T__[:base" ".rbind.data.table" ":=" > "alloc.col" > as.chron.IDate as.chron.ITime as.data.table > as.IDate > "as.chron.IDate" "as.chron.ITime" "as.data.table" > "as.IDate" > as.ITime between chgroup > chmatch > "as.ITime" "between" "chgroup" > "chmatch" > chorder CJ copy > data.table > "chorder" "CJ" "copy" > "data.table" > fread haskey hour > IDateTime > "fread" "haskey" "hour" > "IDateTime" > is.data.table key key<- > last > "is.data.table" "key" "key<-" > "last" > like mday month > quarter > "like" "mday" "month" > "quarter" > rbindlist set setattr > setcolorder > "rbindlist" "set" "setattr" > "setcolorder" > setkey setkeyv setnames > SJ > "setkey" "setkeyv" "setnames" > "SJ" > tables test.data.table timetaken > truelength > "tables" "test.data.table" "timetaken" > "truelength" > wday week yday > year > "wday" "week" "yday" > "year" > ----------------------------------------------- > > Any suggestions? > > I think for now, I can work around the problem by doing 'Depends: > data.table' in my `DESCRIPTION` file. I'd like to not do that > though. > > -- > Ken Williams, Senior Research Scientist > WindLogics > http://windlogics.com > > > ________________________________ > > CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of > the intended recipient(s) and may contain confidential and privileged > information. Any unauthorized review, use, disclosure or distribution > of any kind is strictly prohibited. If you are not the intended > recipient, please contact the sender via reply e-mail and destroy all > copies of the original message. Thank you. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From aragorn168b at gmail.com Thu May 2 00:16:15 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 2 May 2013 00:16:15 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> Message-ID: Eduard, Great. That explains me the difference between `drop` and `.join` here. Even though I don't *need* this feature (I can't recall the last time when I use a `data.table` for `i` and had to reduce the function, say, sum). But, I think it can only better the usage. However, there's one point *I think* would still disagree with @eddi here, not sure. DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) DT2 <- data.table(x=c(1,2,1)) setkey(DT1, "x") # proposed way and the result: DT1[DT2, sum(y), .join = FALSE] [1] 21 So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` *should* result in a `data.table` output as follows (it's even more clearer now that .join is set to TRUE, meaning it's a data.table join): x V1 1: 1 6 2: 2 9 3: 1 6 Basically, `.join = TRUE` is the current functionality unchanged and nice to be default (as Matthew hinted). Arun On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > Arun, > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does currently. > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is literally a 'by' by each of the rows of DT2 that are in the join (thus each.i! - the operation 'y' will be performed for each of the rows of 'i' and then combined and returned). There is no efficiency issue here that I can see, but Matthew can correct me on this. As far as I understand the efficiency comes into play when e.g. the rows of 'i' are unique, and after the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = key(DT1)] would be less efficient since the 'by' could've already been done while joining. > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future DT1[DT2] - in this expression there is no by-without-by happening in either case. > > The purpose of this is NOT for j just being a column or an expression that gets evaluated into a signal column. It applies to any j. The extra 'by-without-by' column is currently output independently of how many columns you output in your j-expression, the behavior is very similar as to when you specify a by=., except that the 'by' happens by a very special expression, that only exists when joining two data-tables and that generally doesn't exist before or after the join. > > Hope this answers your questions. > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan wrote: > > Eduard, thanks for your reply. But somethings are unclear to me still. I'll try to explain them below. > > > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general (that it is applicable to *every* i operation, which as of now seems untrue). .JOIN is specific to data.table type for `i`. > > > > From what I understand from your reply, if (.JOIN = FALSE), then, > > > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > > > Is this right? It's a bit confusing because I think you're okay with "by-without-by" and I got the impression from Sadao that he finds the syntax of "by-without-by" unaccessible/advanced for basic users. So, just to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the "by-without-by" and then result in a "vector", right? > > > > Matthew explains in the current documentation that DT1[DT2][, y] would "join" all columns of DT1 and DT2 and then subset. I assume the implementation underneath is *not* DT1[DT2][, y] rather the result is an efficient equivalence. Then, that of course seems alright to me. > > > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I can't think of any at the moment. > > > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results in getting evaluated as a scalar for every group in the current by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, list(x,y)]. > > > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's the purpose of `drop` then (and also how it *doesn't* suit here as compared to .JOIN). > > > > Arun > > > > > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > > > > Arun, > > > > > > If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. > > > > > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan wrote: > > > > > > > (The earlier message was too long and was rejected.) > > > > So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > setkey(DT1, "x") > > > > DT2 <- data.table(x=1) > > > > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] > > > > > > > > The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > setkey(DT1, "x") > > > > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > > > > # what's the output supposed to be for? > > > > DT1[DT2, y, .JOIN=FALSE] > > > > DT1[DT2, .JOIN = FALSE] > > > > > > > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? > > > > > > > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > > > > > > > > Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? > > > > > > > > I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. > > > > > > > > Best, > > > > Arun. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu May 2 00:20:32 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 2 May 2013 00:20:32 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> Message-ID: <48F69748BB834619B353A12C6D9962A7@gmail.com> Sorry the proposed result was a wrong paste in the last message: # proposed way and the result: DT1[DT2, sum(y), .join = FALSE] [1] 6 9 6 And the last part that it *should* be a data.table is quite obvious then. Arun On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: > Eduard, > > Great. That explains me the difference between `drop` and `.join` here. > Even though I don't *need* this feature (I can't recall the last time when I use a `data.table` for `i` and had to reduce the function, say, sum). But, I think it can only better the usage. > > However, there's one point *I think* would still disagree with @eddi here, not sure. > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) > DT2 <- data.table(x=c(1,2,1)) > setkey(DT1, "x") > > # proposed way and the result: > DT1[DT2, sum(y), .join = FALSE] > [1] 21 > > > So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` *should* result in a `data.table` output as follows (it's even more clearer now that .join is set to TRUE, meaning it's a data.table join): > > x V1 > 1: 1 6 > 2: 2 9 > 3: 1 6 > > > Basically, `.join = TRUE` is the current functionality unchanged and nice to be default (as Matthew hinted). > > Arun > > > On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > > > Arun, > > > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does currently. > > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is literally a 'by' by each of the rows of DT2 that are in the join (thus each.i! - the operation 'y' will be performed for each of the rows of 'i' and then combined and returned). There is no efficiency issue here that I can see, but Matthew can correct me on this. As far as I understand the efficiency comes into play when e.g. the rows of 'i' are unique, and after the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = key(DT1)] would be less efficient since the 'by' could've already been done while joining. > > > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future DT1[DT2] - in this expression there is no by-without-by happening in either case. > > > > The purpose of this is NOT for j just being a column or an expression that gets evaluated into a signal column. It applies to any j. The extra 'by-without-by' column is currently output independently of how many columns you output in your j-expression, the behavior is very similar as to when you specify a by=., except that the 'by' happens by a very special expression, that only exists when joining two data-tables and that generally doesn't exist before or after the join. > > > > Hope this answers your questions. > > > > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan wrote: > > > Eduard, thanks for your reply. But somethings are unclear to me still. I'll try to explain them below. > > > > > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general (that it is applicable to *every* i operation, which as of now seems untrue). .JOIN is specific to data.table type for `i`. > > > > > > From what I understand from your reply, if (.JOIN = FALSE), then, > > > > > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > > > > > Is this right? It's a bit confusing because I think you're okay with "by-without-by" and I got the impression from Sadao that he finds the syntax of "by-without-by" unaccessible/advanced for basic users. So, just to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the "by-without-by" and then result in a "vector", right? > > > > > > Matthew explains in the current documentation that DT1[DT2][, y] would "join" all columns of DT1 and DT2 and then subset. I assume the implementation underneath is *not* DT1[DT2][, y] rather the result is an efficient equivalence. Then, that of course seems alright to me. > > > > > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I can't think of any at the moment. > > > > > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results in getting evaluated as a scalar for every group in the current by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, list(x,y)]. > > > > > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's the purpose of `drop` then (and also how it *doesn't* suit here as compared to .JOIN). > > > > > > Arun > > > > > > > > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > > > > > > Arun, > > > > > > > > If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. > > > > > > > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan wrote: > > > > > > > > > (The earlier message was too long and was rejected.) > > > > > So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > setkey(DT1, "x") > > > > > DT2 <- data.table(x=1) > > > > > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] > > > > > > > > > > The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > setkey(DT1, "x") > > > > > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > > > > > # what's the output supposed to be for? > > > > > DT1[DT2, y, .JOIN=FALSE] > > > > > DT1[DT2, .JOIN = FALSE] > > > > > > > > > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? > > > > > > > > > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > > > > > > > > > > Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? > > > > > > > > > > I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. > > > > > > > > > > Best, > > > > > Arun. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu May 2 00:23:37 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 2 May 2013 00:23:37 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <48F69748BB834619B353A12C6D9962A7@gmail.com> References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <48F69748BB834619B353A12C6D9962A7@gmail.com> Message-ID: eddi, sorry again, I am confused a bit now. DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)) DT2 <- data.table(x=c(1,2,1)) setkey(DT1, "x") What's the intended result for `DT1[DT2, sum(y), allow.cartesian = TRUE, .join = FALSE]` ? c(6,9,6) or 21? Arun On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote: > Sorry the proposed result was a wrong paste in the last message: > > # proposed way and the result: > DT1[DT2, sum(y), .join = FALSE] > [1] 6 9 6 > > > And the last part that it *should* be a data.table is quite obvious then. > > Arun > > > On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: > > > Eduard, > > > > Great. That explains me the difference between `drop` and `.join` here. > > Even though I don't *need* this feature (I can't recall the last time when I use a `data.table` for `i` and had to reduce the function, say, sum). But, I think it can only better the usage. > > > > However, there's one point *I think* would still disagree with @eddi here, not sure. > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) > > DT2 <- data.table(x=c(1,2,1)) > > setkey(DT1, "x") > > > > # proposed way and the result: > > DT1[DT2, sum(y), .join = FALSE] > > [1] 21 > > > > > > So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` *should* result in a `data.table` output as follows (it's even more clearer now that .join is set to TRUE, meaning it's a data.table join): > > > > x V1 > > 1: 1 6 > > 2: 2 9 > > 3: 1 6 > > > > > > Basically, `.join = TRUE` is the current functionality unchanged and nice to be default (as Matthew hinted). > > > > Arun > > > > > > On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > > > > > Arun, > > > > > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does currently. > > > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is literally a 'by' by each of the rows of DT2 that are in the join (thus each.i! - the operation 'y' will be performed for each of the rows of 'i' and then combined and returned). There is no efficiency issue here that I can see, but Matthew can correct me on this. As far as I understand the efficiency comes into play when e.g. the rows of 'i' are unique, and after the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = key(DT1)] would be less efficient since the 'by' could've already been done while joining. > > > > > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future DT1[DT2] - in this expression there is no by-without-by happening in either case. > > > > > > The purpose of this is NOT for j just being a column or an expression that gets evaluated into a signal column. It applies to any j. The extra 'by-without-by' column is currently output independently of how many columns you output in your j-expression, the behavior is very similar as to when you specify a by=., except that the 'by' happens by a very special expression, that only exists when joining two data-tables and that generally doesn't exist before or after the join. > > > > > > Hope this answers your questions. > > > > > > > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan wrote: > > > > Eduard, thanks for your reply. But somethings are unclear to me still. I'll try to explain them below. > > > > > > > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general (that it is applicable to *every* i operation, which as of now seems untrue). .JOIN is specific to data.table type for `i`. > > > > > > > > From what I understand from your reply, if (.JOIN = FALSE), then, > > > > > > > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > > > > > > > Is this right? It's a bit confusing because I think you're okay with "by-without-by" and I got the impression from Sadao that he finds the syntax of "by-without-by" unaccessible/advanced for basic users. So, just to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the "by-without-by" and then result in a "vector", right? > > > > > > > > Matthew explains in the current documentation that DT1[DT2][, y] would "join" all columns of DT1 and DT2 and then subset. I assume the implementation underneath is *not* DT1[DT2][, y] rather the result is an efficient equivalence. Then, that of course seems alright to me. > > > > > > > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I can't think of any at the moment. > > > > > > > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results in getting evaluated as a scalar for every group in the current by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, list(x,y)]. > > > > > > > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's the purpose of `drop` then (and also how it *doesn't* suit here as compared to .JOIN). > > > > > > > > Arun > > > > > > > > > > > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > > > > > > > > Arun, > > > > > > > > > > If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. > > > > > > > > > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan wrote: > > > > > > > > > > > (The earlier message was too long and was rejected.) > > > > > > So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > setkey(DT1, "x") > > > > > > DT2 <- data.table(x=1) > > > > > > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] > > > > > > > > > > > > The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > setkey(DT1, "x") > > > > > > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > > > > > > # what's the output supposed to be for? > > > > > > DT1[DT2, y, .JOIN=FALSE] > > > > > > DT1[DT2, .JOIN = FALSE] > > > > > > > > > > > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? > > > > > > > > > > > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > > > > > > > > > > > > Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? > > > > > > > > > > > > I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. > > > > > > > > > > > > Best, > > > > > > Arun. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Thu May 2 00:28:46 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 1 May 2013 17:28:46 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <48F69748BB834619B353A12C6D9962A7@gmail.com> Message-ID: Arun, from my previous email: "Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b': dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a join in some cases but not others Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by (will do cross-apply only when 'i' is a join): dt[i, j, each.i = TRUE] <-> dt[i, j]" Together with the default being each.i=FALSE, you can see that the answer to your question will be: DT1[DT2, sum(y), each.i = FALSE, allow.cartesian = TRUE] <-> DT1[DT2, allow.cartesian=TRUE][, sum(y)], i.e. [1] 21 and DT1[DT2, sum(y), each.i = TRUE, allow.cartesian = TRUE] <-> DT1[DT2, sum(y), allow.cartesian=TRUE], i.e. x V1 1: 1 6 2: 2 9 3: 1 6 On Wed, May 1, 2013 at 5:23 PM, Arunkumar Srinivasan wrote: > eddi, > > sorry again, I am confused a bit now. > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)) > DT2 <- data.table(x=c(1,2,1)) > setkey(DT1, "x") > > What's the intended result for `DT1[DT2, sum(y), allow.cartesian = TRUE, > .join = FALSE]` ? c(6,9,6) or 21? > > > Arun > > On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote: > > Sorry the proposed result was a wrong paste in the last message: > > # proposed way and the result: > DT1[DT2, sum(y), .join = FALSE] > [1] 6 9 6 > > And the last part that it *should* be a data.table is quite obvious then. > > Arun > > On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: > > Eduard, > > Great. That explains me the difference between `drop` and `.join` here. > Even though I don't *need* this feature (I can't recall the last time when > I use a `data.table` for `i` and had to reduce the function, say, sum). > But, I think it can only better the usage. > > However, there's one point *I think* would still disagree with @eddi here, > not sure. > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) > DT2 <- data.table(x=c(1,2,1)) > setkey(DT1, "x") > > # proposed way and the result: > DT1[DT2, sum(y), .join = FALSE] > [1] 21 > > > So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` > *should* result in a `data.table` output as follows (it's even more clearer > now that .join is set to TRUE, meaning it's a data.table join): > > x V1 > 1: 1 6 > 2: 2 9 > 3: 1 6 > > Basically, `.join = TRUE` is the current functionality unchanged and nice > to be default (as Matthew hinted). > > Arun > > On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > > Arun, > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does > currently. > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is > literally a 'by' by each of the rows of DT2 that are in the join (thus > each.i! - the operation 'y' will be performed for each of the rows of 'i' > and then combined and returned). There is no efficiency issue here that I > can see, but Matthew can correct me on this. As far as I understand the > efficiency comes into play when e.g. the rows of 'i' are unique, and after > the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = > key(DT1)] would be less efficient since the 'by' could've already been done > while joining. > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future > DT1[DT2] - in this expression there is no by-without-by happening in either > case. > > The purpose of this is NOT for j just being a column or an expression that > gets evaluated into a signal column. It applies to any j. The extra > 'by-without-by' column is currently output independently of how many > columns you output in your j-expression, the behavior is very similar as to > when you specify a by=., except that the 'by' happens by a very special > expression, that only exists when joining two data-tables and that > generally doesn't exist before or after the join. > > Hope this answers your questions. > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > > Eduard, thanks for your reply. But somethings are unclear to me still. > I'll try to explain them below. > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general > (that it is applicable to *every* i operation, which as of now seems > untrue). .JOIN is specific to data.table type for `i`. > > From what I understand from your reply, if (.JOIN = FALSE), then, > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > Is this right? It's a bit confusing because I think you're okay with > "by-without-by" and I got the impression from Sadao that he finds the > syntax of "by-without-by" unaccessible/advanced for basic users. So, just > to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the > "by-without-by" and then result in a "vector", right? > > Matthew explains in the current documentation that DT1[DT2][, y] would > "join" all columns of DT1 and DT2 and then subset. I assume the > implementation underneath is *not* DT1[DT2][, y] rather the result is an > efficient equivalence. Then, that of course seems alright to me. > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` > doesn't make sense/has no purpose to me. At least I can't think of any at > the moment. > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as > DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results > in getting evaluated as a scalar for every group in the current > by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. > Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` > instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of > DT1[i, list(x,y)]. > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's > the purpose of `drop` then (and also how it *doesn't* suit here as compared > to .JOIN). > > Arun > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > Arun, > > If the new boolean is false, the result would be the same as without it > and would be equal to current behavior of d[i][, j]. If it's true, it will > only have an effect if i is a join (I think each.i= fits slightly better > for this description than .join=) - this will replicate current underlying > behavior. If you think the cross-apply is something that could work not > just for i being a data-table but other things as well, then it would make > perfect sense to implement that action too when the bool is true. > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan > wrote: > > (The earlier message was too long and was rejected.) > So, from the discussion so far, I see that Matthew is nice enough to > implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > setkey(DT1, "x") > DT2 <- data.table(x=1) > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I > expect here the same output as current DT1[DT2, y] > > The above syntax seems "okay". But my first question is what is > `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > setkey(DT1, "x") > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > # what's the output supposed to be for? > DT1[DT2, y, .JOIN=FALSE] > DT1[DT2, .JOIN = FALSE] > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how > does it work with `subset`? > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > Is this supposed to also do a "cross-apply" on the logical subset? I > guess not. So, .JOIN is an "extra" parameter that comes into play *only* > when `i` is a `data.table`? > > I'd love to have some replies to these questions for me to take a stance > on `.JOIN`. Thank you. > > Best, > Arun. > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu May 2 00:36:32 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 2 May 2013 00:36:32 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <48F69748BB834619B353A12C6D9962A7@gmail.com> Message-ID: <2BC388F2195044B09475B49AC4AB3809@gmail.com> In retrospect, `.join` is also confusing/untrue (as the data.table join is still being done). I find `cross.apply` clearer. Arun On Thursday, May 2, 2013 at 12:33 AM, Arunkumar Srinivasan wrote: > Eduard, > > Yes, that clears it up. If `.join` if FALSE, then there's no `by-without-by`, basically. `drop` really serves another purpose. > > Once again, I find `each.i = TRUE/FALSE` to be confusing (as it was one of the intended purposes of this post to begin with) to mean to apply to *any* `i` operation. Unless this is true, I'd like to stick to `.join` as it's what we are setting to FALSE/TRUE here. > > Thanks for the patient clarifications. > > Arun > > > On Thursday, May 2, 2013 at 12:28 AM, Eduard Antonyan wrote: > > > Arun, from my previous email: > > > > "Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b': > > dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a join in some cases but not others > > > > Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by (will do cross-apply only when 'i' is a join): > > dt[i, j, each.i = TRUE] <-> dt[i, j]" > > > > Together with the default being each.i=FALSE, you can see that the answer to your question will be: > > > > DT1[DT2, sum(y), each.i = FALSE, allow.cartesian = TRUE] <-> DT1[DT2, allow.cartesian=TRUE][, sum(y)], i.e. > > [1] 21 > > > > and > > DT1[DT2, sum(y), each.i = TRUE, allow.cartesian = TRUE] <-> DT1[DT2, sum(y), allow.cartesian=TRUE], i.e. > > x V1 > > 1: 1 6 > > 2: 2 9 > > 3: 1 6 > > > > > > > > On Wed, May 1, 2013 at 5:23 PM, Arunkumar Srinivasan wrote: > > > eddi, > > > > > > sorry again, I am confused a bit now. > > > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)) > > > DT2 <- data.table(x=c(1,2,1)) > > > setkey(DT1, "x") > > > > > > What's the intended result for `DT1[DT2, sum(y), allow.cartesian = TRUE, .join = FALSE]` ? c(6,9,6) or 21? > > > > > > > > > Arun > > > > > > > > > On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote: > > > > > > > Sorry the proposed result was a wrong paste in the last message: > > > > > > > > # proposed way and the result: > > > > DT1[DT2, sum(y), .join = FALSE] > > > > [1] 6 9 6 > > > > > > > > > > > > And the last part that it *should* be a data.table is quite obvious then. > > > > > > > > Arun > > > > > > > > > > > > On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: > > > > > > > > > Eduard, > > > > > > > > > > Great. That explains me the difference between `drop` and `.join` here. > > > > > Even though I don't *need* this feature (I can't recall the last time when I use a `data.table` for `i` and had to reduce the function, say, sum). But, I think it can only better the usage. > > > > > > > > > > However, there's one point *I think* would still disagree with @eddi here, not sure. > > > > > > > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) > > > > > DT2 <- data.table(x=c(1,2,1)) > > > > > setkey(DT1, "x") > > > > > > > > > > # proposed way and the result: > > > > > DT1[DT2, sum(y), .join = FALSE] > > > > > [1] 21 > > > > > > > > > > > > > > > So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` *should* result in a `data.table` output as follows (it's even more clearer now that .join is set to TRUE, meaning it's a data.table join): > > > > > > > > > > x V1 > > > > > 1: 1 6 > > > > > 2: 2 9 > > > > > 3: 1 6 > > > > > > > > > > > > > > > Basically, `.join = TRUE` is the current functionality unchanged and nice to be default (as Matthew hinted). > > > > > > > > > > Arun > > > > > > > > > > > > > > > On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > > > > > > > > > > > Arun, > > > > > > > > > > > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does currently. > > > > > > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is literally a 'by' by each of the rows of DT2 that are in the join (thus each.i! - the operation 'y' will be performed for each of the rows of 'i' and then combined and returned). There is no efficiency issue here that I can see, but Matthew can correct me on this. As far as I understand the efficiency comes into play when e.g. the rows of 'i' are unique, and after the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = key(DT1)] would be less efficient since the 'by' could've already been done while joining. > > > > > > > > > > > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future DT1[DT2] - in this expression there is no by-without-by happening in either case. > > > > > > > > > > > > The purpose of this is NOT for j just being a column or an expression that gets evaluated into a signal column. It applies to any j. The extra 'by-without-by' column is currently output independently of how many columns you output in your j-expression, the behavior is very similar as to when you specify a by=., except that the 'by' happens by a very special expression, that only exists when joining two data-tables and that generally doesn't exist before or after the join. > > > > > > > > > > > > Hope this answers your questions. > > > > > > > > > > > > > > > > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan wrote: > > > > > > > Eduard, thanks for your reply. But somethings are unclear to me still. I'll try to explain them below. > > > > > > > > > > > > > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general (that it is applicable to *every* i operation, which as of now seems untrue). .JOIN is specific to data.table type for `i`. > > > > > > > > > > > > > > From what I understand from your reply, if (.JOIN = FALSE), then, > > > > > > > > > > > > > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > > > > > > > > > > > > > Is this right? It's a bit confusing because I think you're okay with "by-without-by" and I got the impression from Sadao that he finds the syntax of "by-without-by" unaccessible/advanced for basic users. So, just to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the "by-without-by" and then result in a "vector", right? > > > > > > > > > > > > > > Matthew explains in the current documentation that DT1[DT2][, y] would "join" all columns of DT1 and DT2 and then subset. I assume the implementation underneath is *not* DT1[DT2][, y] rather the result is an efficient equivalence. Then, that of course seems alright to me. > > > > > > > > > > > > > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I can't think of any at the moment. > > > > > > > > > > > > > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results in getting evaluated as a scalar for every group in the current by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, list(x,y)]. > > > > > > > > > > > > > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's the purpose of `drop` then (and also how it *doesn't* suit here as compared to .JOIN). > > > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > > > > > > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > > > > > > > > > > > > > > Arun, > > > > > > > > > > > > > > > > If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. > > > > > > > > > > > > > > > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan wrote: > > > > > > > > > > > > > > > > > (The earlier message was too long and was rejected.) > > > > > > > > > So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > > > > > > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > > > > setkey(DT1, "x") > > > > > > > > > DT2 <- data.table(x=1) > > > > > > > > > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] > > > > > > > > > > > > > > > > > > The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > > > > > > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > > > > setkey(DT1, "x") > > > > > > > > > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > > > > > > > > > # what's the output supposed to be for? > > > > > > > > > DT1[DT2, y, .JOIN=FALSE] > > > > > > > > > DT1[DT2, .JOIN = FALSE] > > > > > > > > > > > > > > > > > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? > > > > > > > > > > > > > > > > > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > > > > > > > > > > > > > > > > > > Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? > > > > > > > > > > > > > > > > > > I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Arun. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu May 2 00:33:35 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 2 May 2013 00:33:35 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <48F69748BB834619B353A12C6D9962A7@gmail.com> Message-ID: Eduard, Yes, that clears it up. If `.join` if FALSE, then there's no `by-without-by`, basically. `drop` really serves another purpose. Once again, I find `each.i = TRUE/FALSE` to be confusing (as it was one of the intended purposes of this post to begin with) to mean to apply to *any* `i` operation. Unless this is true, I'd like to stick to `.join` as it's what we are setting to FALSE/TRUE here. Thanks for the patient clarifications. Arun On Thursday, May 2, 2013 at 12:28 AM, Eduard Antonyan wrote: > Arun, from my previous email: > > "Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b': > dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a join in some cases but not others > > Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by (will do cross-apply only when 'i' is a join): > dt[i, j, each.i = TRUE] <-> dt[i, j]" > > Together with the default being each.i=FALSE, you can see that the answer to your question will be: > > DT1[DT2, sum(y), each.i = FALSE, allow.cartesian = TRUE] <-> DT1[DT2, allow.cartesian=TRUE][, sum(y)], i.e. > [1] 21 > > and > DT1[DT2, sum(y), each.i = TRUE, allow.cartesian = TRUE] <-> DT1[DT2, sum(y), allow.cartesian=TRUE], i.e. > x V1 > 1: 1 6 > 2: 2 9 > 3: 1 6 > > > > On Wed, May 1, 2013 at 5:23 PM, Arunkumar Srinivasan wrote: > > eddi, > > > > sorry again, I am confused a bit now. > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)) > > DT2 <- data.table(x=c(1,2,1)) > > setkey(DT1, "x") > > > > What's the intended result for `DT1[DT2, sum(y), allow.cartesian = TRUE, .join = FALSE]` ? c(6,9,6) or 21? > > > > > > Arun > > > > > > On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote: > > > > > Sorry the proposed result was a wrong paste in the last message: > > > > > > # proposed way and the result: > > > DT1[DT2, sum(y), .join = FALSE] > > > [1] 6 9 6 > > > > > > > > > And the last part that it *should* be a data.table is quite obvious then. > > > > > > Arun > > > > > > > > > On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: > > > > > > > Eduard, > > > > > > > > Great. That explains me the difference between `drop` and `.join` here. > > > > Even though I don't *need* this feature (I can't recall the last time when I use a `data.table` for `i` and had to reduce the function, say, sum). But, I think it can only better the usage. > > > > > > > > However, there's one point *I think* would still disagree with @eddi here, not sure. > > > > > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) > > > > DT2 <- data.table(x=c(1,2,1)) > > > > setkey(DT1, "x") > > > > > > > > # proposed way and the result: > > > > DT1[DT2, sum(y), .join = FALSE] > > > > [1] 21 > > > > > > > > > > > > So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` *should* result in a `data.table` output as follows (it's even more clearer now that .join is set to TRUE, meaning it's a data.table join): > > > > > > > > x V1 > > > > 1: 1 6 > > > > 2: 2 9 > > > > 3: 1 6 > > > > > > > > > > > > Basically, `.join = TRUE` is the current functionality unchanged and nice to be default (as Matthew hinted). > > > > > > > > Arun > > > > > > > > > > > > On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > > > > > > > > > Arun, > > > > > > > > > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does currently. > > > > > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is literally a 'by' by each of the rows of DT2 that are in the join (thus each.i! - the operation 'y' will be performed for each of the rows of 'i' and then combined and returned). There is no efficiency issue here that I can see, but Matthew can correct me on this. As far as I understand the efficiency comes into play when e.g. the rows of 'i' are unique, and after the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = key(DT1)] would be less efficient since the 'by' could've already been done while joining. > > > > > > > > > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future DT1[DT2] - in this expression there is no by-without-by happening in either case. > > > > > > > > > > The purpose of this is NOT for j just being a column or an expression that gets evaluated into a signal column. It applies to any j. The extra 'by-without-by' column is currently output independently of how many columns you output in your j-expression, the behavior is very similar as to when you specify a by=., except that the 'by' happens by a very special expression, that only exists when joining two data-tables and that generally doesn't exist before or after the join. > > > > > > > > > > Hope this answers your questions. > > > > > > > > > > > > > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan wrote: > > > > > > Eduard, thanks for your reply. But somethings are unclear to me still. I'll try to explain them below. > > > > > > > > > > > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general (that it is applicable to *every* i operation, which as of now seems untrue). .JOIN is specific to data.table type for `i`. > > > > > > > > > > > > From what I understand from your reply, if (.JOIN = FALSE), then, > > > > > > > > > > > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > > > > > > > > > > > Is this right? It's a bit confusing because I think you're okay with "by-without-by" and I got the impression from Sadao that he finds the syntax of "by-without-by" unaccessible/advanced for basic users. So, just to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the "by-without-by" and then result in a "vector", right? > > > > > > > > > > > > Matthew explains in the current documentation that DT1[DT2][, y] would "join" all columns of DT1 and DT2 and then subset. I assume the implementation underneath is *not* DT1[DT2][, y] rather the result is an efficient equivalence. Then, that of course seems alright to me. > > > > > > > > > > > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I can't think of any at the moment. > > > > > > > > > > > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results in getting evaluated as a scalar for every group in the current by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, list(x,y)]. > > > > > > > > > > > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's the purpose of `drop` then (and also how it *doesn't* suit here as compared to .JOIN). > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > > > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > > > > > > > > > > > > Arun, > > > > > > > > > > > > > > If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. > > > > > > > > > > > > > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan wrote: > > > > > > > > > > > > > > > (The earlier message was too long and was rejected.) > > > > > > > > So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > > > > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > > > setkey(DT1, "x") > > > > > > > > DT2 <- data.table(x=1) > > > > > > > > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] > > > > > > > > > > > > > > > > The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > > > > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > > > setkey(DT1, "x") > > > > > > > > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > > > > > > > > # what's the output supposed to be for? > > > > > > > > DT1[DT2, y, .JOIN=FALSE] > > > > > > > > DT1[DT2, .JOIN = FALSE] > > > > > > > > > > > > > > > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? > > > > > > > > > > > > > > > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > > > > > > > > > > > > > > > > Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? > > > > > > > > > > > > > > > > I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. > > > > > > > > > > > > > > > > Best, > > > > > > > > Arun. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Thu May 2 00:47:39 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 1 May 2013 17:47:39 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <2BC388F2195044B09475B49AC4AB3809@gmail.com> References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <48F69748BB834619B353A12C6D9962A7@gmail.com> <2BC388F2195044B09475B49AC4AB3809@gmail.com> Message-ID: yeah, I think cross.apply is pretty clear as well, at least when an extra 'by' is not there, but I like each.i when there is a 'by'. Either way this is a pretty small consideration for me and I'd be perfectly happy with either. On Wed, May 1, 2013 at 5:36 PM, Arunkumar Srinivasan wrote: > In retrospect, `.join` is also confusing/untrue (as the data.table join > is still being done). I find `cross.apply` clearer. > > Arun > > On Thursday, May 2, 2013 at 12:33 AM, Arunkumar Srinivasan wrote: > > Eduard, > > Yes, that clears it up. If `.join` if FALSE, then there's no > `by-without-by`, basically. `drop` really serves another purpose. > > Once again, I find `each.i = TRUE/FALSE` to be confusing (as it was one of > the intended purposes of this post to begin with) to mean to apply to *any* > `i` operation. Unless this is true, I'd like to stick to `.join` as it's > what we are setting to FALSE/TRUE here. > > Thanks for the patient clarifications. > > Arun > > On Thursday, May 2, 2013 at 12:28 AM, Eduard Antonyan wrote: > > Arun, from my previous email: > > "Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b': > dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by > = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a > join in some cases but not others > > Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by > (will do cross-apply only when 'i' is a join): > dt[i, j, each.i = TRUE] <-> dt[i, j]" > > Together with the default being each.i=FALSE, you can see that the answer > to your question will be: > > DT1[DT2, sum(y), each.i = FALSE, allow.cartesian = TRUE] <-> DT1[DT2, > allow.cartesian=TRUE][, sum(y)], i.e. > [1] 21 > > and > DT1[DT2, sum(y), each.i = TRUE, allow.cartesian = TRUE] <-> DT1[DT2, > sum(y), allow.cartesian=TRUE], i.e. > x V1 > 1: 1 6 > 2: 2 9 > 3: 1 6 > > > > On Wed, May 1, 2013 at 5:23 PM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > > eddi, > > sorry again, I am confused a bit now. > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)) > DT2 <- data.table(x=c(1,2,1)) > setkey(DT1, "x") > > What's the intended result for `DT1[DT2, sum(y), allow.cartesian = TRUE, > .join = FALSE]` ? c(6,9,6) or 21? > > > Arun > > On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote: > > Sorry the proposed result was a wrong paste in the last message: > > # proposed way and the result: > DT1[DT2, sum(y), .join = FALSE] > [1] 6 9 6 > > And the last part that it *should* be a data.table is quite obvious then. > > Arun > > On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: > > Eduard, > > Great. That explains me the difference between `drop` and `.join` here. > Even though I don't *need* this feature (I can't recall the last time when > I use a `data.table` for `i` and had to reduce the function, say, sum). > But, I think it can only better the usage. > > However, there's one point *I think* would still disagree with @eddi here, > not sure. > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) > DT2 <- data.table(x=c(1,2,1)) > setkey(DT1, "x") > > # proposed way and the result: > DT1[DT2, sum(y), .join = FALSE] > [1] 21 > > > So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` > *should* result in a `data.table` output as follows (it's even more clearer > now that .join is set to TRUE, meaning it's a data.table join): > > x V1 > 1: 1 6 > 2: 2 9 > 3: 1 6 > > Basically, `.join = TRUE` is the current functionality unchanged and nice > to be default (as Matthew hinted). > > Arun > > On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > > Arun, > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does > currently. > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is > literally a 'by' by each of the rows of DT2 that are in the join (thus > each.i! - the operation 'y' will be performed for each of the rows of 'i' > and then combined and returned). There is no efficiency issue here that I > can see, but Matthew can correct me on this. As far as I understand the > efficiency comes into play when e.g. the rows of 'i' are unique, and after > the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = > key(DT1)] would be less efficient since the 'by' could've already been done > while joining. > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future > DT1[DT2] - in this expression there is no by-without-by happening in either > case. > > The purpose of this is NOT for j just being a column or an expression that > gets evaluated into a signal column. It applies to any j. The extra > 'by-without-by' column is currently output independently of how many > columns you output in your j-expression, the behavior is very similar as to > when you specify a by=., except that the 'by' happens by a very special > expression, that only exists when joining two data-tables and that > generally doesn't exist before or after the join. > > Hope this answers your questions. > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > > Eduard, thanks for your reply. But somethings are unclear to me still. > I'll try to explain them below. > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general > (that it is applicable to *every* i operation, which as of now seems > untrue). .JOIN is specific to data.table type for `i`. > > From what I understand from your reply, if (.JOIN = FALSE), then, > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > Is this right? It's a bit confusing because I think you're okay with > "by-without-by" and I got the impression from Sadao that he finds the > syntax of "by-without-by" unaccessible/advanced for basic users. So, just > to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the > "by-without-by" and then result in a "vector", right? > > Matthew explains in the current documentation that DT1[DT2][, y] would > "join" all columns of DT1 and DT2 and then subset. I assume the > implementation underneath is *not* DT1[DT2][, y] rather the result is an > efficient equivalence. Then, that of course seems alright to me. > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` > doesn't make sense/has no purpose to me. At least I can't think of any at > the moment. > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as > DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results > in getting evaluated as a scalar for every group in the current > by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. > Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` > instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of > DT1[i, list(x,y)]. > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's > the purpose of `drop` then (and also how it *doesn't* suit here as compared > to .JOIN). > > Arun > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > Arun, > > If the new boolean is false, the result would be the same as without it > and would be equal to current behavior of d[i][, j]. If it's true, it will > only have an effect if i is a join (I think each.i= fits slightly better > for this description than .join=) - this will replicate current underlying > behavior. If you think the cross-apply is something that could work not > just for i being a data-table but other things as well, then it would make > perfect sense to implement that action too when the bool is true. > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan > wrote: > > (The earlier message was too long and was rejected.) > So, from the discussion so far, I see that Matthew is nice enough to > implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > setkey(DT1, "x") > DT2 <- data.table(x=1) > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I > expect here the same output as current DT1[DT2, y] > > The above syntax seems "okay". But my first question is what is > `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > setkey(DT1, "x") > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > # what's the output supposed to be for? > DT1[DT2, y, .JOIN=FALSE] > DT1[DT2, .JOIN = FALSE] > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how > does it work with `subset`? > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > Is this supposed to also do a "cross-apply" on the logical subset? I > guess not. So, .JOIN is an "extra" parameter that comes into play *only* > when `i` is a `data.table`? > > I'd love to have some replies to these questions for me to take a stance > on `.JOIN`. Thank you. > > Best, > Arun. > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu May 2 00:47:37 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 2 May 2013 00:47:37 +0200 Subject: [datatable-help] sorting on floating point column In-Reply-To: References: <17cf7210eff5da9dadf94185f67df182@imap.plus.net> <8DC39800AD714C4AA03FDB84ED57BADD@gmail.com> <2cd9f53b01f908fe6478e3974ddf18e3@imap.plus.net> Message-ID: <3C10DA3381F64716A180976F14686FF5@gmail.com> Matthew, So what's the resolution here? Is it okay to sort in the "proper" order on the key column but use *machine tolerance* for subset on key column? Arun On Tuesday, April 30, 2013 at 4:26 PM, Arunkumar Srinivasan wrote: > Matthew, > > Precisely. That's what I was thinking as well. But was hesitant to tell as I dint know how complex it would be to implement / change it. Since the join requires tolerance, sorting could be still done in the "right" order (by disregarding tolerance during sort). > > Arun > > > On Tuesday, April 30, 2013 at 4:22 PM, Matthew Dowle wrote: > > > > > Maybe it doesn't actually need to sort within machine tolerance. If it was precise, the sort would be faster, that's for sure. But at the time, I remember thinking that it should preserve the order of rows within a group of values within machine tolerance (e.g. 3.99999999, 4.00000001, 3.99999999 should be consider 4.0 and order of those 3 rows maintained). But maybe sorting them to 3.99999999, 3.99999999, 4.00000001 is ok as it's just the join that should be within machine tolerance? > > Interested in how fast order(y) is, though. Compared to data.table sorting of doubles. > > Matthew > > > > On 30.04.2013 15:16, Arunkumar Srinivasan wrote: > > > Matthew, > > > I see. I din't think about tolerance. Although > > > dt[with(dt, order(y)), ] > > > seems to do the task right (similar to data.frame). I'm glad that I don't have to convert to data.frame to perform the order. I am not keying by this column. Unless one needs this column for keying, I don't think a tolerance option is essential. Although, having it definitely would be only nicer. > > > Arun > > > > > > > > > On Tuesday, April 30, 2013 at 4:09 PM, Matthew Dowle wrote: > > > > > > > > > > > Hi, > > > > data.table sorts double within machine tolerance : > > > > > sqrt(.Machine$double.eps) > > > > [1] 1.490116e-08 > > > > > > > > > > > > > i.e. numbers closer than this are considered equal. > > > > > > > > Otherwise we wouldn't be able to do things like DT[.(3.14)]. > > > > > > > > I had a quick look, see arguments of data.table:::ordernumtol which takes "tol" but there is no option provided (yet) to change this. Do we need one? > > > > > > > > In the examples section of one of the help pages it has an example which generates a series of numers very close together using pi. Note that your numbers are both close together, and, very close to 0. > > > > > > > > Matthew > > > > > > > > On 30.04.2013 14:52, Arunkumar Srinivasan wrote: > > > > > Hi there, > > > > > I just saw something strange when I was sorting a column of p-values. I checked the data.table bug tracker for words "sort" and "floating point" and there were no hits for this case. There's a bug for "integer 64" sort on a column though. > > > > > So, here's a reproducible example. I'd be glad to file a bug, if it is and be corrected if it's something I am doing wrong. > > > > > set.seed(45) > > > > > dt <- data.table(x=sample(50), y= sample(c(seq(0, 1, length.out=1000), 7000000:7000100), 50)/1e7) > > > > > head(dt) > > > > > x y > > > > > 1: 32 5.395395e-08 > > > > > 2: 16 6.956957e-08 > > > > > 3: 12 2.142142e-08 > > > > > 4: 18 5.855856e-08 > > > > > 5: 17 6.216216e-08 > > > > > 6: 14 5.025025e-08 > > > > > setkey(dt, "y") # sort by column y > > > > > head(dt, 10) > > > > > x y > > > > > 1: 47 1.401401e-09 > > > > > 2: 12 2.142142e-08 > > > > > 3: 24 1.391391e-08 > > > > > 4: 43 9.809810e-09 <~~~ obviously false > > > > > 5: 1 2.932933e-08 > > > > > 6: 48 2.562563e-08 > > > > > 7: 49 1.891892e-08 > > > > > 8: 40 2.182182e-08 > > > > > 9: 9 7.307307e-09 <~~~ obviously false > > > > > 10: 45 2.482482e-08 > > > > > > > > > > Best, > > > > > Arun > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu May 2 01:18:44 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 2 May 2013 01:18:44 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <48F69748BB834619B353A12C6D9962A7@gmail.com> <2BC388F2195044B09475B49AC4AB3809@gmail.com> Message-ID: Eduard, What do you mean here: `at least when by is not there`. The "cross.apply" or ".join" or "each.i" was supposedly an option when "i" argument is a `data.table`, right? I can find a reason why there would be a `by` there? (I mean an explicit by). Do you mean the implicit by when it's true? if not, could you elaborate (maybe with an example)? Arun On Thursday, May 2, 2013 at 12:47 AM, Eduard Antonyan wrote: > yeah, I think cross.apply is pretty clear as well, at least when an extra 'by' is not there, but I like each.i when there is a 'by'. Either way this is a pretty small consideration for me and I'd be perfectly happy with either. > > > On Wed, May 1, 2013 at 5:36 PM, Arunkumar Srinivasan wrote: > > In retrospect, `.join` is also confusing/untrue (as the data.table join is still being done). I find `cross.apply` clearer. > > > > Arun > > > > > > On Thursday, May 2, 2013 at 12:33 AM, Arunkumar Srinivasan wrote: > > > > > > > Eduard, > > > > > > Yes, that clears it up. If `.join` if FALSE, then there's no `by-without-by`, basically. `drop` really serves another purpose. > > > > > > Once again, I find `each.i = TRUE/FALSE` to be confusing (as it was one of the intended purposes of this post to begin with) to mean to apply to *any* `i` operation. Unless this is true, I'd like to stick to `.join` as it's what we are setting to FALSE/TRUE here. > > > > > > Thanks for the patient clarifications. > > > > > > Arun > > > > > > > > > On Thursday, May 2, 2013 at 12:28 AM, Eduard Antonyan wrote: > > > > > > > Arun, from my previous email: > > > > > > > > "Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b': > > > > dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a join in some cases but not others > > > > > > > > Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by (will do cross-apply only when 'i' is a join): > > > > dt[i, j, each.i = TRUE] <-> dt[i, j]" > > > > > > > > Together with the default being each.i=FALSE, you can see that the answer to your question will be: > > > > > > > > DT1[DT2, sum(y), each.i = FALSE, allow.cartesian = TRUE] <-> DT1[DT2, allow.cartesian=TRUE][, sum(y)], i.e. > > > > [1] 21 > > > > > > > > and > > > > DT1[DT2, sum(y), each.i = TRUE, allow.cartesian = TRUE] <-> DT1[DT2, sum(y), allow.cartesian=TRUE], i.e. > > > > x V1 > > > > 1: 1 6 > > > > 2: 2 9 > > > > 3: 1 6 > > > > > > > > > > > > > > > > On Wed, May 1, 2013 at 5:23 PM, Arunkumar Srinivasan wrote: > > > > > eddi, > > > > > > > > > > sorry again, I am confused a bit now. > > > > > > > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)) > > > > > DT2 <- data.table(x=c(1,2,1)) > > > > > setkey(DT1, "x") > > > > > > > > > > What's the intended result for `DT1[DT2, sum(y), allow.cartesian = TRUE, .join = FALSE]` ? c(6,9,6) or 21? > > > > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote: > > > > > > > > > > > Sorry the proposed result was a wrong paste in the last message: > > > > > > > > > > > > # proposed way and the result: > > > > > > DT1[DT2, sum(y), .join = FALSE] > > > > > > [1] 6 9 6 > > > > > > > > > > > > > > > > > > And the last part that it *should* be a data.table is quite obvious then. > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > > > > On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: > > > > > > > > > > > > > Eduard, > > > > > > > > > > > > > > Great. That explains me the difference between `drop` and `.join` here. > > > > > > > Even though I don't *need* this feature (I can't recall the last time when I use a `data.table` for `i` and had to reduce the function, say, sum). But, I think it can only better the usage. > > > > > > > > > > > > > > However, there's one point *I think* would still disagree with @eddi here, not sure. > > > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) > > > > > > > DT2 <- data.table(x=c(1,2,1)) > > > > > > > setkey(DT1, "x") > > > > > > > > > > > > > > # proposed way and the result: > > > > > > > DT1[DT2, sum(y), .join = FALSE] > > > > > > > [1] 21 > > > > > > > > > > > > > > > > > > > > > So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` *should* result in a `data.table` output as follows (it's even more clearer now that .join is set to TRUE, meaning it's a data.table join): > > > > > > > > > > > > > > x V1 > > > > > > > 1: 1 6 > > > > > > > 2: 2 9 > > > > > > > 3: 1 6 > > > > > > > > > > > > > > > > > > > > > Basically, `.join = TRUE` is the current functionality unchanged and nice to be default (as Matthew hinted). > > > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > > > > > > > On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > > > > > > > > > > > > > > > Arun, > > > > > > > > > > > > > > > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does currently. > > > > > > > > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is literally a 'by' by each of the rows of DT2 that are in the join (thus each.i! - the operation 'y' will be performed for each of the rows of 'i' and then combined and returned). There is no efficiency issue here that I can see, but Matthew can correct me on this. As far as I understand the efficiency comes into play when e.g. the rows of 'i' are unique, and after the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = key(DT1)] would be less efficient since the 'by' could've already been done while joining. > > > > > > > > > > > > > > > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future DT1[DT2] - in this expression there is no by-without-by happening in either case. > > > > > > > > > > > > > > > > The purpose of this is NOT for j just being a column or an expression that gets evaluated into a signal column. It applies to any j. The extra 'by-without-by' column is currently output independently of how many columns you output in your j-expression, the behavior is very similar as to when you specify a by=., except that the 'by' happens by a very special expression, that only exists when joining two data-tables and that generally doesn't exist before or after the join. > > > > > > > > > > > > > > > > Hope this answers your questions. > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan wrote: > > > > > > > > > Eduard, thanks for your reply. But somethings are unclear to me still. I'll try to explain them below. > > > > > > > > > > > > > > > > > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general (that it is applicable to *every* i operation, which as of now seems untrue). .JOIN is specific to data.table type for `i`. > > > > > > > > > > > > > > > > > > From what I understand from your reply, if (.JOIN = FALSE), then, > > > > > > > > > > > > > > > > > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > > > > > > > > > > > > > > > > > Is this right? It's a bit confusing because I think you're okay with "by-without-by" and I got the impression from Sadao that he finds the syntax of "by-without-by" unaccessible/advanced for basic users. So, just to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the "by-without-by" and then result in a "vector", right? > > > > > > > > > > > > > > > > > > Matthew explains in the current documentation that DT1[DT2][, y] would "join" all columns of DT1 and DT2 and then subset. I assume the implementation underneath is *not* DT1[DT2][, y] rather the result is an efficient equivalence. Then, that of course seems alright to me. > > > > > > > > > > > > > > > > > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I can't think of any at the moment. > > > > > > > > > > > > > > > > > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results in getting evaluated as a scalar for every group in the current by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, list(x,y)]. > > > > > > > > > > > > > > > > > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's the purpose of `drop` then (and also how it *doesn't* suit here as compared to .JOIN). > > > > > > > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > > > > > > > > > > > > > > > > > > Arun, > > > > > > > > > > > > > > > > > > > > If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. > > > > > > > > > > > > > > > > > > > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan wrote: > > > > > > > > > > > > > > > > > > > > > (The earlier message was too long and was rejected.) > > > > > > > > > > > So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > > > > > > > > > > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > > > > > > setkey(DT1, "x") > > > > > > > > > > > DT2 <- data.table(x=1) > > > > > > > > > > > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] > > > > > > > > > > > > > > > > > > > > > > The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > > > > > > > > > > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > > > > > > setkey(DT1, "x") > > > > > > > > > > > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > > > > > > > > > > > # what's the output supposed to be for? > > > > > > > > > > > DT1[DT2, y, .JOIN=FALSE] > > > > > > > > > > > DT1[DT2, .JOIN = FALSE] > > > > > > > > > > > > > > > > > > > > > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? > > > > > > > > > > > > > > > > > > > > > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > > > > > > > > > > > > > > > > > > > > > > Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? > > > > > > > > > > > > > > > > > > > > > > I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > Arun. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Thu May 2 01:27:58 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 1 May 2013 18:27:58 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> <48F69748BB834619B353A12C6D9962A7@gmail.com> <2BC388F2195044B09475B49AC4AB3809@gmail.com> Message-ID: <733367056511996651@unknownmsgid> I mean I find it a little easier to read when joining with each.i=TRUE *and* there is by=b - this is an extra operation that I don't believe has an analog in current syntax (but I haven't thought about this too much). On May 1, 2013, at 6:18 PM, Arunkumar Srinivasan wrote: Eduard, What do you mean here: `at least when by is not there`. The "cross.apply" or ".join" or "each.i" was supposedly an option when "i" argument is a `data.table`, right? I can find a reason why there would be a `by` there? (I mean an explicit by). Do you mean the implicit by when it's true? if not, could you elaborate (maybe with an example)? Arun On Thursday, May 2, 2013 at 12:47 AM, Eduard Antonyan wrote: yeah, I think cross.apply is pretty clear as well, at least when an extra 'by' is not there, but I like each.i when there is a 'by'. Either way this is a pretty small consideration for me and I'd be perfectly happy with either. On Wed, May 1, 2013 at 5:36 PM, Arunkumar Srinivasan wrote: In retrospect, `.join` is also confusing/untrue (as the data.table join is still being done). I find `cross.apply` clearer. Arun On Thursday, May 2, 2013 at 12:33 AM, Arunkumar Srinivasan wrote: Eduard, Yes, that clears it up. If `.join` if FALSE, then there's no `by-without-by`, basically. `drop` really serves another purpose. Once again, I find `each.i = TRUE/FALSE` to be confusing (as it was one of the intended purposes of this post to begin with) to mean to apply to *any* `i` operation. Unless this is true, I'd like to stick to `.join` as it's what we are setting to FALSE/TRUE here. Thanks for the patient clarifications. Arun On Thursday, May 2, 2013 at 12:28 AM, Eduard Antonyan wrote: Arun, from my previous email: "Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b': dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a join in some cases but not others Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by (will do cross-apply only when 'i' is a join): dt[i, j, each.i = TRUE] <-> dt[i, j]" Together with the default being each.i=FALSE, you can see that the answer to your question will be: DT1[DT2, sum(y), each.i = FALSE, allow.cartesian = TRUE] <-> DT1[DT2, allow.cartesian=TRUE][, sum(y)], i.e. [1] 21 and DT1[DT2, sum(y), each.i = TRUE, allow.cartesian = TRUE] <-> DT1[DT2, sum(y), allow.cartesian=TRUE], i.e. x V1 1: 1 6 2: 2 9 3: 1 6 On Wed, May 1, 2013 at 5:23 PM, Arunkumar Srinivasan wrote: eddi, sorry again, I am confused a bit now. DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)) DT2 <- data.table(x=c(1,2,1)) setkey(DT1, "x") What's the intended result for `DT1[DT2, sum(y), allow.cartesian = TRUE, .join = FALSE]` ? c(6,9,6) or 21? Arun On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote: Sorry the proposed result was a wrong paste in the last message: # proposed way and the result: DT1[DT2, sum(y), .join = FALSE] [1] 6 9 6 And the last part that it *should* be a data.table is quite obvious then. Arun On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: Eduard, Great. That explains me the difference between `drop` and `.join` here. Even though I don't *need* this feature (I can't recall the last time when I use a `data.table` for `i` and had to reduce the function, say, sum). But, I think it can only better the usage. However, there's one point *I think* would still disagree with @eddi here, not sure. DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) DT2 <- data.table(x=c(1,2,1)) setkey(DT1, "x") # proposed way and the result: DT1[DT2, sum(y), .join = FALSE] [1] 21 So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` *should* result in a `data.table` output as follows (it's even more clearer now that .join is set to TRUE, meaning it's a data.table join): x V1 1: 1 6 2: 2 9 3: 1 6 Basically, `.join = TRUE` is the current functionality unchanged and nice to be default (as Matthew hinted). Arun On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: Arun, Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does currently. No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is literally a 'by' by each of the rows of DT2 that are in the join (thus each.i! - the operation 'y' will be performed for each of the rows of 'i' and then combined and returned). There is no efficiency issue here that I can see, but Matthew can correct me on this. As far as I understand the efficiency comes into play when e.g. the rows of 'i' are unique, and after the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = key(DT1)] would be less efficient since the 'by' could've already been done while joining. DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future DT1[DT2] - in this expression there is no by-without-by happening in either case. The purpose of this is NOT for j just being a column or an expression that gets evaluated into a signal column. It applies to any j. The extra 'by-without-by' column is currently output independently of how many columns you output in your j-expression, the behavior is very similar as to when you specify a by=., except that the 'by' happens by a very special expression, that only exists when joining two data-tables and that generally doesn't exist before or after the join. Hope this answers your questions. On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan wrote: Eduard, thanks for your reply. But somethings are unclear to me still. I'll try to explain them below. First I prefer .JOIN (or cross.apply) just because `each.i` seems general (that it is applicable to *every* i operation, which as of now seems untrue). .JOIN is specific to data.table type for `i`. >From what I understand from your reply, if (.JOIN = FALSE), then, DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] Is this right? It's a bit confusing because I think you're okay with "by-without-by" and I got the impression from Sadao that he finds the syntax of "by-without-by" unaccessible/advanced for basic users. So, just to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the "by-without-by" and then result in a "vector", right? Matthew explains in the current documentation that DT1[DT2][, y] would "join" all columns of DT1 and DT2 and then subset. I assume the implementation underneath is *not* DT1[DT2][, y] rather the result is an efficient equivalence. Then, that of course seems alright to me. If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I can't think of any at the moment. To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results in getting evaluated as a scalar for every group in the current by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, list(x,y)]. If you/anyone believes it's wrong, I'd be all ears to clarify as to what's the purpose of `drop` then (and also how it *doesn't* suit here as compared to .JOIN). Arun On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: Arun, If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan wrote: (The earlier message was too long and was rejected.) So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) setkey(DT1, "x") DT2 <- data.table(x=1) DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) setkey(DT1, "x") DT2 <- data.table(x=c(1,2,1), w=c(11:13)) # what's the output supposed to be for? DT1[DT2, y, .JOIN=FALSE] DT1[DT2, .JOIN = FALSE] Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. Best, Arun. -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.harding at paniscus.com Thu May 2 11:19:22 2013 From: p.harding at paniscus.com (Paul Harding) Date: Thu, 2 May 2013 10:19:22 +0100 Subject: [datatable-help] Fwd: fread on very large file In-Reply-To: References: <6215268129090c5164b66264010bea9b@imap.plus.net> Message-ID: Some supplementary information, here is the portion of the file (with row numbers, +1 for header) around where fread thinks the file ends. $ nl spd_all_fixed.csv | head -n 9186300 |tail 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 9186293 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 9186294 (row 9186293 excl header) is where fread thinks the file ends, mid-line by the look of it! I've experimented by truncating the file. The error varies, either it reads too few records or gives the error I reported, presumably determined by whether the last perceived line is entire. The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv > dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) Detected eol as \r\n (CRLF) in that order, the Windows standard. Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found Found 9 columns First row with 9 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 80300000 Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 data rows Type codes: 000002000 (first 5 rows) Type codes: 000002000 (+middle 5 rows) Type codes: 000002000 (+last 5 rows) 0%Bumping column 7 from INT to INT64 on data row 9, field contains '0.42634430000000001' Bumping column 7 from INT64 to REAL on data row 9, field contains '0.42634430000000001' 0.000s ( 0%) Memory map (rerun may be quicker) 0.000s ( 0%) Sep and header detection 0.000s ( 0%) Count rows (wc -l) 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM 171.188s ( 65%) Reading data 1365231.809s (518439%) Allocation for type bumps (if any), including gc time if triggered -1365231.809s (-518439%) Coercing data already read in type bumps (if any) 0.000s ( 0%) Changing na.strings to NA 0.000s Total > dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) Detected eol as \r\n (CRLF) in that order, the Windows standard. Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found Found 9 columns First row with 9 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 18913 Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data rows Type codes: 000002000 (first 5 rows) Type codes: 000002000 (+middle 5 rows) Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : Expected sep (',') but ',' ends field 2 on line 6 when detecting types: 204650,724540, Regards, Paul On 1 May 2013 10:28, Paul Harding wrote: > Here is the verbose output: > > > dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) > Detected eol as \r\n (CRLF) in that order, the Windows standard. > Looking for supplied sep ',' on line 30 (the last non blank line in the > first 30) ... found > Found 9 columns > First row with 9 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column > names. > Count of eol after first data row: 9186293 > Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 > data rows > Type codes: 000002000 (first 5 rows) > Type codes: 000002200 (+middle 5 rows) > Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : > Expected sep (',') but '0' ends field 5 on line 6 when detecting types: > 204038,2617097,20110803,0,0 > > But here is the wc output (via cygwin; newline, word (whitespace delim so > each word one 'line' here), byte)@ > $ wc spd_all_fixed.csv > 168997637 168997638 9078155125 spd_all_fixed.csv > > [So fread 9M, wc 168M rows]. > > Regards > Paul > > > On 30 April 2013 18:52, Matthew Dowle wrote: > >> ** >> >> >> >> Hi, >> >> Thanks for reporting this. Please set verbose=TRUE and let us know the >> output. >> >> Thanks, Matthew >> >> >> >> On 30.04.2013 18:01, Paul Harding wrote: >> >> Problem with fread on a large file >> The file is 8GB, just short of 200,000 lines, produced as SQLoutput and >> modified by cygwin/perl to remove the second line. >> Using data.table 1.8.8 on R3.0.0 I get an fread error >> fread("data/spd_all_fixed.csv",sep=",") >> Error in fread("data/spd_all_fixed.csv", sep = ",") : >> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: >> 204038,2617097,20110803,0,0 >> Looking for the offending line,with line numbers in output so I'm >> guessing this is line 6 of the mid-file chunk examined, >> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >> and comparing to surrounding lines and the first ten lines >> $ head spd_all_fixed.csv >> s_key,i_key,p_key,q,pq,d,l,epi,class >> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >> I can't see any difference. I wonder if this is a bug? I have no problems >> on a small test data set run through an identical process and using the >> same fread command. >> Regards >> Paul >> >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri May 3 11:51:47 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 03 May 2013 10:51:47 +0100 Subject: [datatable-help] Fwd: fread on very large file In-Reply-To: References: <6215268129090c5164b66264010bea9b@imap.plus.net> Message-ID: Hi Paul, Thanks for all this! > The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: Ahah. Are you using a 32bit or 64bit Windows machine? Thanks, Matthew On 02.05.2013 10:19, Paul Harding wrote: > Some supplementary information, here is the portion of the file (with row numbers, +1 for header) around where fread thinks the file ends. > > $ nl spd_all_fixed.csv | head -n 9186300 |tail > 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 > 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 > 9186293 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 > 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 > 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 > 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 > 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 > 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 > 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 > 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 > 9186294 (row 9186293 excl header) is where fread thinks the file ends, mid-line by the look of it! > I've experimented by truncating the file. The error varies, either it reads too few records or gives the error I reported, presumably determined by whether the last perceived line is entire. > The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: > > -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv > -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv > >> dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) > > Detected eol as rn (CRLF) in that order, the Windows standard. > Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found > Found 9 columns > First row with 9 fields occurs on line 1 (either column names or first row of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 80300000 > Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 data rows > > Type codes: 000002000 (first 5 rows) > Type codes: 000002000 (+middle 5 rows) > Type codes: 000002000 (+last 5 rows) > 0%Bumping column 7 from INT to INT64 on data row 9, field contains '0.42634430000000001' > Bumping column 7 from INT64 to REAL on data row 9, field contains '0.42634430000000001' > 0.000s ( 0%) Memory map (rerun may be quicker) > 0.000s ( 0%) Sep and header detection > 0.000s ( 0%) Count rows (wc -l) > 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) > 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM > 171.188s ( 65%) Reading data > 1365231.809s (518439%) Allocation for type bumps (if any), including gc time if triggered > -1365231.809s (-518439%) Coercing data already read in type bumps (if any) > 0.000s ( 0%) Changing na.strings to NA > 0.000s Total >> dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) > > Detected eol as rn (CRLF) in that order, the Windows standard. > Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found > Found 9 columns > First row with 9 fields occurs on line 1 (either column names or first row of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 18913 > Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data rows > > Type codes: 000002000 (first 5 rows) > Type codes: 000002000 (+middle 5 rows) > Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : > Expected sep (',') but ',' ends field 2 on line 6 when detecting types: 204650,724540, > Regards, > Paul > > On 1 May 2013 10:28, Paul Harding wrote: > >> Here is the verbose output: >> >>> dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) >> Detected eol as rn (CRLF) in that order, the Windows standard. >> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >> Found 9 columns >> First row with 9 fields occurs on line 1 (either column names or first row of data) >> All the fields on line 1 are character fields. Treating as the column names. >> Count of eol after first data row: 9186293 >> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 data rows >> Type codes: 000002000 (first 5 rows) >> Type codes: 000002200 (+middle 5 rows) >> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : >> >> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >> But here is the wc output (via cygwin; newline, word (whitespace delim so each word one 'line' here), byte)@ >> >> $ wc spd_all_fixed.csv >> 168997637 168997638 9078155125 spd_all_fixed.csv >> [So fread 9M, wc 168M rows]. >> Regards >> Paul >> >> On 30 April 2013 18:52, Matthew Dowle wrote: >> >>> Hi, >>> >>> Thanks for reporting this. Please set verbose=TRUE and let us know the output. >>> >>> Thanks, Matthew >>> >>> On 30.04.2013 18:01, Paul Harding wrote: >>> >>>> Problem with fread on a large file The file is 8GB, just short of 200,000 lines, produced as SQLoutput and modified by cygwin/perl to remove the second line. >>>> >>>> Using data.table 1.8.8 on R3.0.0 I get an fread error >>>> >>>> fread("data/spd_all_fixed.csv",sep=",") >>>> Error in fread("data/spd_all_fixed.csv", sep = ",") : >>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>> Looking for the offending line,with line numbers in output so I'm guessing this is line 6 of the mid-file chunk examined, >>>> >>>> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >>>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >>>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >>>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >>>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >>>> and comparing to surrounding lines and the first ten lines >>>> >>>> $ head spd_all_fixed.csv >>>> s_key,i_key,p_key,q,pq,d,l,epi,class >>>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >>>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >>>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >>>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >>>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >>>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >>>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >>>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >>>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >>>> I can't see any difference. I wonder if this is a bug? I have no problems on a small test data set run through an identical process and using the same fread command. >>>> Regards >>>> Paul Links: ------ [1] mailto:mdowle at mdowle.plus.com [2] mailto:p.harding at paniscus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Fri May 3 13:18:59 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 07:18:59 -0400 Subject: [datatable-help] merge/join/match Message-ID: I am moving this discussion which started with mdowle to the list. Consider this example slightly modified from the data.table FAQ: > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > out <- X[Y]; out x foo bar 1: b 3 4 2: b 4 4 3: b 5 4 4: c 6 2 5: c 7 2 6: d NA 3 Note that the first column of the output is labelled x even though the data to produce it comes from y, e.g. "d" in out$x is not in X$x but does appear in Y$y so clearly the data is coming from y as opposed to x . In terms of SQL the above would be written: select Y.y as x, ... and the need to renamne the first column of out suggests that there may be a deeper problem here. Here are some ideas to address this (they would require changes to data.table): - the default of X[Y,, match=NA] would be changed to a default of X[Y,,match=0] so that it corresponds to the defaults in R's merge and in SQL joins. - the column name of the first column in the example above would be changed to y if match=0 but be left at x if match=NA. In the case that match=0 (the proposed new default) x and y are equal so the first column can be validly labelled as x but in the case that match=NA they are not so y would be used as the column name. - the name match= does seem a bit misleading since R's match only matches one item in the target whereas in data.table match matches many if mult="all" and that is the default. Perhaps some thought should be given to a name change here? The above would seem to correspond more closely to R's merge and SQL join defaults. Any use cases or other comments? -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From p.harding at paniscus.com Fri May 3 15:32:16 2013 From: p.harding at paniscus.com (Paul Harding) Date: Fri, 3 May 2013 14:32:16 +0100 Subject: [datatable-help] Fwd: fread on very large file In-Reply-To: References: <6215268129090c5164b66264010bea9b@imap.plus.net> Message-ID: Definitely a 64-bit machine. Here are the details: Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) Installed memory (RAM): 128GB System type: 64-bit Operating System Windows edition: Server 2008 R2 Enterprise SP1 Regards, Paul On 3 May 2013 10:51, Matthew Dowle wrote: > ** > > > > Hi Paul, > > Thanks for all this! > > > The problem arises when the file reaches 4GB, in this case between > 8,030,000 and 8,040,000 rows: > > Ahah. Are you using a 32bit or 64bit Windows machine? > > Thanks, Matthew > > > > On 02.05.2013 10:19, Paul Harding wrote: > > Some supplementary information, here is the portion of the file (with row > numbers, +1 for header) around where fread thinks the file ends. > $ nl spd_all_fixed.csv | head -n 9186300 |tail > 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 > 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 > 9186293 > 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 > 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 > 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 > 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 > 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 > 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 > 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 > 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 > 9186294 (row 9186293 excl header) is where fread thinks the file ends, > mid-line by the look of it! > I've experimented by truncating the file. The error varies, either it > reads too few records or gives the error I reported, presumably determined > by whether the last perceived line is entire. > The problem arises when the file reaches 4GB, in this case between > 8,030,000 and 8,040,000 rows: > -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 > spd_all_trunc_8030k.csv > -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 > spd_all_trunc_8040k.csv > > dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) > Detected eol as \r\n (CRLF) in that order, the Windows standard. > Looking for supplied sep ',' on line 30 (the last non blank line in the > first 30) ... found > Found 9 columns > First row with 9 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column > names. > Count of eol after first data row: 80300000 > Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 > data rows > Type codes: 000002000 (first 5 rows) > Type codes: 000002000 (+middle 5 rows) > Type codes: 000002000 (+last 5 rows) > 0%Bumping column 7 from INT to INT64 on data row 9, field contains > '0.42634430000000001' > Bumping column 7 from INT64 to REAL on data row 9, field contains > '0.42634430000000001' > 0.000s ( 0%) Memory map (rerun may be quicker) > 0.000s ( 0%) Sep and header detection > 0.000s ( 0%) Count rows (wc -l) > 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) > 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM > 171.188s ( 65%) Reading data > 1365231.809s (518439%) Allocation for type bumps (if any), including gc > time if triggered > -1365231.809s (-518439%) Coercing data already read in type bumps (if any) > 0.000s ( 0%) Changing na.strings to NA > 0.000s Total > > dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) > Detected eol as \r\n (CRLF) in that order, the Windows standard. > Looking for supplied sep ',' on line 30 (the last non blank line in the > first 30) ... found > Found 9 columns > First row with 9 fields occurs on line 1 (either column names or first row > of data) > All the fields on line 1 are character fields. Treating as the column > names. > Count of eol after first data row: 18913 > Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data > rows > Type codes: 000002000 (first 5 rows) > Type codes: 000002000 (+middle 5 rows) > Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : > Expected sep (',') but ',' ends field 2 on line 6 when detecting types: > 204650,724540, > Regards, > Paul > > > On 1 May 2013 10:28, Paul Harding wrote: > >> Here is the verbose output: >> > dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) >> Detected eol as \r\n (CRLF) in that order, the Windows standard. >> Looking for supplied sep ',' on line 30 (the last non blank line in the >> first 30) ... found >> Found 9 columns >> First row with 9 fields occurs on line 1 (either column names or first >> row of data) >> All the fields on line 1 are character fields. Treating as the column >> names. >> Count of eol after first data row: 9186293 >> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 >> data rows >> Type codes: 000002000 (first 5 rows) >> Type codes: 000002200 (+middle 5 rows) >> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : >> Expected sep (',') but '0' ends field 5 on line 6 when detecting >> types: 204038,2617097,20110803,0,0 >> But here is the wc output (via cygwin; newline, word (whitespace delim >> so each word one 'line' here), byte)@ >> $ wc spd_all_fixed.csv >> 168997637 168997638 9078155125 spd_all_fixed.csv >> [So fread 9M, wc 168M rows]. >> Regards >> Paul >> >> >> On 30 April 2013 18:52, Matthew Dowle wrote: >> >>> >>> >>> Hi, >>> >>> Thanks for reporting this. Please set verbose=TRUE and let us know the >>> output. >>> >>> Thanks, Matthew >>> >>> >>> >>> On 30.04.2013 18:01, Paul Harding wrote: >>> >>> Problem with fread on a large file >>> The file is 8GB, just short of 200,000 lines, produced as SQLoutput and >>> modified by cygwin/perl to remove the second line. >>> Using data.table 1.8.8 on R3.0.0 I get an fread error >>> fread("data/spd_all_fixed.csv",sep=",") >>> Error in fread("data/spd_all_fixed.csv", sep = ",") : >>> Expected sep (',') but '0' ends field 5 on line 6 when detecting >>> types: 204038,2617097,20110803,0,0 >>> Looking for the offending line,with line numbers in output so I'm >>> guessing this is line 6 of the mid-file chunk examined, >>> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >>> and comparing to surrounding lines and the first ten lines >>> $ head spd_all_fixed.csv >>> s_key,i_key,p_key,q,pq,d,l,epi,class >>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >>> I can't see any difference. I wonder if this is a bug? I have no >>> problems on a small test data set run through an identical process and >>> using the same fread command. >>> Regards >>> Paul >>> >>> >>> >>> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri May 3 15:59:16 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 03 May 2013 14:59:16 +0100 Subject: [datatable-help] Fwd: fread on very large file In-Reply-To: References: <6215268129090c5164b66264010bea9b@imap.plus.net> Message-ID: <806651da84c7d49b3a9aa134e4951274@imap.plus.net> Oh. Then it's likely a bug with fread on Windows for files > 4GB. Think GetFileSize() should be GetFileSizeEx(), iirc. Please could you file it as a bug on the tracker. Thanks. Matthew On 03.05.2013 14:32, Paul Harding wrote: > Definitely a 64-bit machine. Here are the details: > > Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) > Installed memory (RAM): 128GB > System type: 64-bit Operating System > Windows edition: Server 2008 R2 Enterprise SP1 > Regards, > Paul > > On 3 May 2013 10:51, Matthew Dowle wrote: > >> Hi Paul, >> >> Thanks for all this! >> >>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >> >> Ahah. Are you using a 32bit or 64bit Windows machine? >> >> Thanks, Matthew >> >> On 02.05.2013 10:19, Paul Harding wrote: >> >>> Some supplementary information, here is the portion of the file (with row numbers, +1 for header) around where fread thinks the file ends. >>> >>> $ nl spd_all_fixed.csv | head -n 9186300 |tail >>> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 >>> 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 >>> 9186293 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 >>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 >>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 >>> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 >>> 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 >>> 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 >>> 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 >>> 9186294 (row 9186293 excl header) is where fread thinks the file ends, mid-line by the look of it! >>> I've experimented by truncating the file. The error varies, either it reads too few records or gives the error I reported, presumably determined by whether the last perceived line is entire. >>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >>> >>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv >>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv >>> >>>> dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) >>> >>> Detected eol as rn (CRLF) in that order, the Windows standard. >>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>> Found 9 columns >>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>> All the fields on line 1 are character fields. Treating as the column names. >>> Count of eol after first data row: 80300000 >>> Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 data rows >>> >>> Type codes: 000002000 (first 5 rows) >>> Type codes: 000002000 (+middle 5 rows) >>> Type codes: 000002000 (+last 5 rows) >>> 0%Bumping column 7 from INT to INT64 on data row 9, field contains '0.42634430000000001' >>> Bumping column 7 from INT64 to REAL on data row 9, field contains '0.42634430000000001' >>> 0.000s ( 0%) Memory map (rerun may be quicker) >>> 0.000s ( 0%) Sep and header detection >>> 0.000s ( 0%) Count rows (wc -l) >>> 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) >>> 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM >>> 171.188s ( 65%) Reading data >>> 1365231.809s (518439%) Allocation for type bumps (if any), including gc time if triggered >>> -1365231.809s (-518439%) Coercing data already read in type bumps (if any) >>> 0.000s ( 0%) Changing na.strings to NA >>> 0.000s Total >>>> dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) >>> >>> Detected eol as rn (CRLF) in that order, the Windows standard. >>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>> Found 9 columns >>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>> All the fields on line 1 are character fields. Treating as the column names. >>> Count of eol after first data row: 18913 >>> Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data rows >>> >>> Type codes: 000002000 (first 5 rows) >>> Type codes: 000002000 (+middle 5 rows) >>> Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : >>> Expected sep (',') but ',' ends field 2 on line 6 when detecting types: 204650,724540, >>> Regards, >>> Paul >>> >>> On 1 May 2013 10:28, Paul Harding wrote: >>> >>>> Here is the verbose output: >>>> >>>>> dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) >>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>> Found 9 columns >>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>> All the fields on line 1 are character fields. Treating as the column names. >>>> Count of eol after first data row: 9186293 >>>> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 data rows >>>> Type codes: 000002000 (first 5 rows) >>>> Type codes: 000002200 (+middle 5 rows) >>>> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : >>>> >>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>> But here is the wc output (via cygwin; newline, word (whitespace delim so each word one 'line' here), byte)@ >>>> >>>> $ wc spd_all_fixed.csv >>>> 168997637 168997638 9078155125 spd_all_fixed.csv >>>> [So fread 9M, wc 168M rows]. >>>> Regards >>>> Paul >>>> >>>> On 30 April 2013 18:52, Matthew Dowle wrote: >>>> >>>>> Hi, >>>>> >>>>> Thanks for reporting this. Please set verbose=TRUE and let us know the output. >>>>> >>>>> Thanks, Matthew >>>>> >>>>> On 30.04.2013 18:01, Paul Harding wrote: >>>>> >>>>>> Problem with fread on a large file The file is 8GB, just short of 200,000 lines, produced as SQLoutput and modified by cygwin/perl to remove the second line. >>>>>> >>>>>> Using data.table 1.8.8 on R3.0.0 I get an fread error >>>>>> >>>>>> fread("data/spd_all_fixed.csv",sep=",") >>>>>> Error in fread("data/spd_all_fixed.csv", sep = ",") : >>>>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>>>> Looking for the offending line,with line numbers in output so I'm guessing this is line 6 of the mid-file chunk examined, >>>>>> >>>>>> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >>>>>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >>>>>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >>>>>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >>>>>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >>>>>> and comparing to surrounding lines and the first ten lines >>>>>> >>>>>> $ head spd_all_fixed.csv >>>>>> s_key,i_key,p_key,q,pq,d,l,epi,class >>>>>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >>>>>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >>>>>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >>>>>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >>>>>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >>>>>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >>>>>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >>>>>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >>>>>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >>>>>> I can't see any difference. I wonder if this is a bug? I have no problems on a small test data set run through an identical process and using the same fread command. >>>>>> Regards >>>>>> Paul Links: ------ [1] mailto:mdowle at mdowle.plus.com [2] mailto:p.harding at paniscus.com [3] mailto:mdowle at mdowle.plus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 3 16:57:19 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 09:57:19 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: A correction - the param is called "nomatch", not "match". This use case seems like smth a user shouldn't really do - in an ideal world you should have them both keyed by the same-name column. As is, my view on it is that data.table is correcting the user mistake of naming the column in Y - y, instead of x, and so the output makes sense and I don't see the need of complicating the behavior by adding more cases one has to go through to figure out what the output columns would be. Similar to asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous column there, would you? On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck wrote: > I am moving this discussion which started with mdowle to the list. > > Consider this example slightly modified from the data.table FAQ: > > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > > out <- X[Y]; out > x foo bar > 1: b 3 4 > 2: b 4 4 > 3: b 5 4 > 4: c 6 2 > 5: c 7 2 > 6: d NA 3 > > Note that the first column of the output is labelled x even though the > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > does appear in Y$y so clearly the data is coming from y as opposed to > x . In terms of SQL the above would be written: > > select Y.y as x, ... > > and the need to renamne the first column of out suggests that there > may be a deeper problem here. > > Here are some ideas to address this (they would require changes to > data.table): > > - the default of X[Y,, match=NA] would be changed to a default of > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > in SQL joins. > > - the column name of the first column in the example above would be > changed to y if match=0 but be left at x if match=NA. In the case > that match=0 (the proposed new default) x and y are equal so the first > column can be validly labelled as x but in the case that match=NA they > are not so y would be used as the column name. > > - the name match= does seem a bit misleading since R's match only > matches one item in the target whereas in data.table match matches > many if mult="all" and that is the default. Perhaps some thought > should be given to a name change here? > > The above would seem to correspond more closely to R's merge and SQL > join defaults. Any use cases or other comments? > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 3 16:59:05 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 09:59:05 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: I would prefer nomatch=0 as a default though, simply because that's what I do most of the time :) On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan wrote: > A correction - the param is called "nomatch", not "match". > > This use case seems like smth a user shouldn't really do - in an ideal > world you should have them both keyed by the same-name column. > > As is, my view on it is that data.table is correcting the user mistake of > naming the column in Y - y, instead of x, and so the output makes sense and > I don't see the need of complicating the behavior by adding more cases one > has to go through to figure out what the output columns would be. Similar > to asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > column there, would you? > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck < > ggrothendieck at gmail.com> wrote: > >> I am moving this discussion which started with mdowle to the list. >> >> Consider this example slightly modified from the data.table FAQ: >> >> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") >> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) >> > out <- X[Y]; out >> x foo bar >> 1: b 3 4 >> 2: b 4 4 >> 3: b 5 4 >> 4: c 6 2 >> 5: c 7 2 >> 6: d NA 3 >> >> Note that the first column of the output is labelled x even though the >> data to produce it comes from y, e.g. "d" in out$x is not in X$x but >> does appear in Y$y so clearly the data is coming from y as opposed to >> x . In terms of SQL the above would be written: >> >> select Y.y as x, ... >> >> and the need to renamne the first column of out suggests that there >> may be a deeper problem here. >> >> Here are some ideas to address this (they would require changes to >> data.table): >> >> - the default of X[Y,, match=NA] would be changed to a default of >> X[Y,,match=0] so that it corresponds to the defaults in R's merge and >> in SQL joins. >> >> - the column name of the first column in the example above would be >> changed to y if match=0 but be left at x if match=NA. In the case >> that match=0 (the proposed new default) x and y are equal so the first >> column can be validly labelled as x but in the case that match=NA they >> are not so y would be used as the column name. >> >> - the name match= does seem a bit misleading since R's match only >> matches one item in the target whereas in data.table match matches >> many if mult="all" and that is the default. Perhaps some thought >> should be given to a name change here? >> >> The above would seem to correspond more closely to R's merge and SQL >> join defaults. Any use cases or other comments? >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Fri May 3 17:09:19 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 11:09:19 -0400 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: Yes, sorry. Its nomatch= which presumably derives from the parameter of the same name in the match() function. If the idea of the nomatch= name was to leverage off existing argument names in R then I would prefer all.y= to be consistent with merge() in place of nomatch= since we are really merging/joining rather than just matching. That would also allow extension to all types of join by adding all.an x= argument too. On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan wrote: > I would prefer nomatch=0 as a default though, simply because that's what I > do most of the time :) > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > wrote: >> >> A correction - the param is called "nomatch", not "match". >> >> This use case seems like smth a user shouldn't really do - in an ideal >> world you should have them both keyed by the same-name column. >> >> As is, my view on it is that data.table is correcting the user mistake of >> naming the column in Y - y, instead of x, and so the output makes sense and >> I don't see the need of complicating the behavior by adding more cases one >> has to go through to figure out what the output columns would be. Similar to >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous column >> there, would you? >> >> >> >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck >> wrote: >>> >>> I am moving this discussion which started with mdowle to the list. >>> >>> Consider this example slightly modified from the data.table FAQ: >>> >>> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") >>> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) >>> > out <- X[Y]; out >>> x foo bar >>> 1: b 3 4 >>> 2: b 4 4 >>> 3: b 5 4 >>> 4: c 6 2 >>> 5: c 7 2 >>> 6: d NA 3 >>> >>> Note that the first column of the output is labelled x even though the >>> data to produce it comes from y, e.g. "d" in out$x is not in X$x but >>> does appear in Y$y so clearly the data is coming from y as opposed to >>> x . In terms of SQL the above would be written: >>> >>> select Y.y as x, ... >>> >>> and the need to renamne the first column of out suggests that there >>> may be a deeper problem here. >>> >>> Here are some ideas to address this (they would require changes to >>> data.table): >>> >>> - the default of X[Y,, match=NA] would be changed to a default of >>> X[Y,,match=0] so that it corresponds to the defaults in R's merge and >>> in SQL joins. >>> >>> - the column name of the first column in the example above would be >>> changed to y if match=0 but be left at x if match=NA. In the case >>> that match=0 (the proposed new default) x and y are equal so the first >>> column can be validly labelled as x but in the case that match=NA they >>> are not so y would be used as the column name. >>> >>> - the name match= does seem a bit misleading since R's match only >>> matches one item in the target whereas in data.table match matches >>> many if mult="all" and that is the default. Perhaps some thought >>> should be given to a name change here? >>> >>> The above would seem to correspond more closely to R's merge and SQL >>> join defaults. Any use cases or other comments? >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From eduard.antonyan at gmail.com Fri May 3 17:23:02 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 10:23:02 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: To clarify - that behavior is already implemented in merge (more specifically merge.data.table). I don't really have a view on having it in X[Y] as well - I don't like all.x and all.y as the names, since there are no params named 'x' and 'y' in [.data.table (as opposed to merge), but some param that would do a full outer join could certainly be added. On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck wrote: > Yes, sorry. Its nomatch= which presumably derives from the parameter > of the same name in the match() function. If the idea of the nomatch= > name was to leverage off existing argument names in R then I would > prefer all.y= to be consistent with merge() in place of nomatch= since > we are really merging/joining rather than just matching. That would > also allow extension to all types of join by adding all.an x= argument > too. > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > wrote: > > I would prefer nomatch=0 as a default though, simply because that's what > I > > do most of the time :) > > > > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan < > eduard.antonyan at gmail.com> > > wrote: > >> > >> A correction - the param is called "nomatch", not "match". > >> > >> This use case seems like smth a user shouldn't really do - in an ideal > >> world you should have them both keyed by the same-name column. > >> > >> As is, my view on it is that data.table is correcting the user mistake > of > >> naming the column in Y - y, instead of x, and so the output makes sense > and > >> I don't see the need of complicating the behavior by adding more cases > one > >> has to go through to figure out what the output columns would be. > Similar to > >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > column > >> there, would you? > >> > >> > >> > >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > >> wrote: > >>> > >>> I am moving this discussion which started with mdowle to the list. > >>> > >>> Consider this example slightly modified from the data.table FAQ: > >>> > >>> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > >>> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > >>> > out <- X[Y]; out > >>> x foo bar > >>> 1: b 3 4 > >>> 2: b 4 4 > >>> 3: b 5 4 > >>> 4: c 6 2 > >>> 5: c 7 2 > >>> 6: d NA 3 > >>> > >>> Note that the first column of the output is labelled x even though the > >>> data to produce it comes from y, e.g. "d" in out$x is not in X$x but > >>> does appear in Y$y so clearly the data is coming from y as opposed to > >>> x . In terms of SQL the above would be written: > >>> > >>> select Y.y as x, ... > >>> > >>> and the need to renamne the first column of out suggests that there > >>> may be a deeper problem here. > >>> > >>> Here are some ideas to address this (they would require changes to > >>> data.table): > >>> > >>> - the default of X[Y,, match=NA] would be changed to a default of > >>> X[Y,,match=0] so that it corresponds to the defaults in R's merge and > >>> in SQL joins. > >>> > >>> - the column name of the first column in the example above would be > >>> changed to y if match=0 but be left at x if match=NA. In the case > >>> that match=0 (the proposed new default) x and y are equal so the first > >>> column can be validly labelled as x but in the case that match=NA they > >>> are not so y would be used as the column name. > >>> > >>> - the name match= does seem a bit misleading since R's match only > >>> matches one item in the target whereas in data.table match matches > >>> many if mult="all" and that is the default. Perhaps some thought > >>> should be given to a name change here? > >>> > >>> The above would seem to correspond more closely to R's merge and SQL > >>> join defaults. Any use cases or other comments? > >>> > >>> -- > >>> Statistics & Software Consulting > >>> GKX Group, GKX Associates Inc. > >>> tel: 1-877-GKX-GROUP > >>> email: ggrothendieck at gmail.com > >>> _______________________________________________ > >>> datatable-help mailing list > >>> datatable-help at lists.r-forge.r-project.org > >>> > >>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >> > >> > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 3 17:27:44 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 10:27:44 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: Btw the way I think about the "nomatch" name is as follows - normally X[Y] tries to match rows of Y with rows of X, and then "nomatch" tells it what to do when there is *no match*. On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan wrote: > To clarify - that behavior is already implemented in merge (more > specifically merge.data.table). I don't really have a view on having it in > X[Y] as well - I don't like all.x and all.y as the names, since there are > no params named 'x' and 'y' in [.data.table (as opposed to merge), but some > param that would do a full outer join could certainly be added. > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck < > ggrothendieck at gmail.com> wrote: > >> Yes, sorry. Its nomatch= which presumably derives from the parameter >> of the same name in the match() function. If the idea of the nomatch= >> name was to leverage off existing argument names in R then I would >> prefer all.y= to be consistent with merge() in place of nomatch= since >> we are really merging/joining rather than just matching. That would >> also allow extension to all types of join by adding all.an x= argument >> too. >> >> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan >> wrote: >> > I would prefer nomatch=0 as a default though, simply because that's >> what I >> > do most of the time :) >> > >> > >> > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan < >> eduard.antonyan at gmail.com> >> > wrote: >> >> >> >> A correction - the param is called "nomatch", not "match". >> >> >> >> This use case seems like smth a user shouldn't really do - in an ideal >> >> world you should have them both keyed by the same-name column. >> >> >> >> As is, my view on it is that data.table is correcting the user mistake >> of >> >> naming the column in Y - y, instead of x, and so the output makes >> sense and >> >> I don't see the need of complicating the behavior by adding more cases >> one >> >> has to go through to figure out what the output columns would be. >> Similar to >> >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous >> column >> >> there, would you? >> >> >> >> >> >> >> >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck >> >> wrote: >> >>> >> >>> I am moving this discussion which started with mdowle to the list. >> >>> >> >>> Consider this example slightly modified from the data.table FAQ: >> >>> >> >>> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") >> >>> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) >> >>> > out <- X[Y]; out >> >>> x foo bar >> >>> 1: b 3 4 >> >>> 2: b 4 4 >> >>> 3: b 5 4 >> >>> 4: c 6 2 >> >>> 5: c 7 2 >> >>> 6: d NA 3 >> >>> >> >>> Note that the first column of the output is labelled x even though the >> >>> data to produce it comes from y, e.g. "d" in out$x is not in X$x but >> >>> does appear in Y$y so clearly the data is coming from y as opposed to >> >>> x . In terms of SQL the above would be written: >> >>> >> >>> select Y.y as x, ... >> >>> >> >>> and the need to renamne the first column of out suggests that there >> >>> may be a deeper problem here. >> >>> >> >>> Here are some ideas to address this (they would require changes to >> >>> data.table): >> >>> >> >>> - the default of X[Y,, match=NA] would be changed to a default of >> >>> X[Y,,match=0] so that it corresponds to the defaults in R's merge and >> >>> in SQL joins. >> >>> >> >>> - the column name of the first column in the example above would be >> >>> changed to y if match=0 but be left at x if match=NA. In the case >> >>> that match=0 (the proposed new default) x and y are equal so the first >> >>> column can be validly labelled as x but in the case that match=NA they >> >>> are not so y would be used as the column name. >> >>> >> >>> - the name match= does seem a bit misleading since R's match only >> >>> matches one item in the target whereas in data.table match matches >> >>> many if mult="all" and that is the default. Perhaps some thought >> >>> should be given to a name change here? >> >>> >> >>> The above would seem to correspond more closely to R's merge and SQL >> >>> join defaults. Any use cases or other comments? >> >>> >> >>> -- >> >>> Statistics & Software Consulting >> >>> GKX Group, GKX Associates Inc. >> >>> tel: 1-877-GKX-GROUP >> >>> email: ggrothendieck at gmail.com >> >>> _______________________________________________ >> >>> datatable-help mailing list >> >>> datatable-help at lists.r-forge.r-project.org >> >>> >> >>> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> >> > >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Fri May 3 17:36:28 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 11:36:28 -0400 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: Yes, except that is not really what happens since match() only matches one row whereas with mult="all", the default, all rows are matched which is not really matching in the sense of match(). The current naming confuses matching with joining and its really the latter that is being done. Regarding the existence of merge the advantage of [ is that it will automatically only take the columns needed so merge is not really equivalent to [ in all respects. Furthermore having to use different constructs for different types of merge seems awkward. On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan wrote: > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > tries to match rows of Y with rows of X, and then "nomatch" tells it what to > do when there is *no match*. > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan > wrote: >> >> To clarify - that behavior is already implemented in merge (more >> specifically merge.data.table). I don't really have a view on having it in >> X[Y] as well - I don't like all.x and all.y as the names, since there are no >> params named 'x' and 'y' in [.data.table (as opposed to merge), but some >> param that would do a full outer join could certainly be added. >> >> >> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck >> wrote: >>> >>> Yes, sorry. Its nomatch= which presumably derives from the parameter >>> of the same name in the match() function. If the idea of the nomatch= >>> name was to leverage off existing argument names in R then I would >>> prefer all.y= to be consistent with merge() in place of nomatch= since >>> we are really merging/joining rather than just matching. That would >>> also allow extension to all types of join by adding all.an x= argument >>> too. >>> >>> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan >>> wrote: >>> > I would prefer nomatch=0 as a default though, simply because that's >>> > what I >>> > do most of the time :) >>> > >>> > >>> > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan >>> > >>> > wrote: >>> >> >>> >> A correction - the param is called "nomatch", not "match". >>> >> >>> >> This use case seems like smth a user shouldn't really do - in an ideal >>> >> world you should have them both keyed by the same-name column. >>> >> >>> >> As is, my view on it is that data.table is correcting the user mistake >>> >> of >>> >> naming the column in Y - y, instead of x, and so the output makes >>> >> sense and >>> >> I don't see the need of complicating the behavior by adding more cases >>> >> one >>> >> has to go through to figure out what the output columns would be. >>> >> Similar to >>> >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous >>> >> column >>> >> there, would you? >>> >> >>> >> >>> >> >>> >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck >>> >> wrote: >>> >>> >>> >>> I am moving this discussion which started with mdowle to the list. >>> >>> >>> >>> Consider this example slightly modified from the data.table FAQ: >>> >>> >>> >>> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") >>> >>> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) >>> >>> > out <- X[Y]; out >>> >>> x foo bar >>> >>> 1: b 3 4 >>> >>> 2: b 4 4 >>> >>> 3: b 5 4 >>> >>> 4: c 6 2 >>> >>> 5: c 7 2 >>> >>> 6: d NA 3 >>> >>> >>> >>> Note that the first column of the output is labelled x even though >>> >>> the >>> >>> data to produce it comes from y, e.g. "d" in out$x is not in X$x but >>> >>> does appear in Y$y so clearly the data is coming from y as opposed to >>> >>> x . In terms of SQL the above would be written: >>> >>> >>> >>> select Y.y as x, ... >>> >>> >>> >>> and the need to renamne the first column of out suggests that there >>> >>> may be a deeper problem here. >>> >>> >>> >>> Here are some ideas to address this (they would require changes to >>> >>> data.table): >>> >>> >>> >>> - the default of X[Y,, match=NA] would be changed to a default of >>> >>> X[Y,,match=0] so that it corresponds to the defaults in R's merge and >>> >>> in SQL joins. >>> >>> >>> >>> - the column name of the first column in the example above would be >>> >>> changed to y if match=0 but be left at x if match=NA. In the case >>> >>> that match=0 (the proposed new default) x and y are equal so the >>> >>> first >>> >>> column can be validly labelled as x but in the case that match=NA >>> >>> they >>> >>> are not so y would be used as the column name. >>> >>> >>> >>> - the name match= does seem a bit misleading since R's match only >>> >>> matches one item in the target whereas in data.table match matches >>> >>> many if mult="all" and that is the default. Perhaps some thought >>> >>> should be given to a name change here? >>> >>> >>> >>> The above would seem to correspond more closely to R's merge and SQL >>> >>> join defaults. Any use cases or other comments? >>> >>> >>> >>> -- >>> >>> Statistics & Software Consulting >>> >>> GKX Group, GKX Associates Inc. >>> >>> tel: 1-877-GKX-GROUP >>> >>> email: ggrothendieck at gmail.com >>> >>> _______________________________________________ >>> >>> datatable-help mailing list >>> >>> datatable-help at lists.r-forge.r-project.org >>> >>> >>> >>> >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >>> >> >>> > >>> >>> >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >> >> > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From eduard.antonyan at gmail.com Fri May 3 17:41:10 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 10:41:10 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: Yeah, leveraging not using all columns in a full merge is a good idea - I agree that feature would be nice to have. I'm not using the match() function tbh, so current naming doesn't bug me, but maybe others who do use match() can weigh in if that bothers them as well. On Fri, May 3, 2013 at 10:36 AM, Gabor Grothendieck wrote: > Yes, except that is not really what happens since match() only matches > one row whereas with mult="all", the default, all rows are matched > which is not really matching in the sense of match(). The current > naming confuses matching with joining and its really the latter that > is being done. > > Regarding the existence of merge the advantage of [ is that it will > automatically only take the columns needed so merge is not really > equivalent to [ in all respects. Furthermore having to use different > constructs for different types of merge seems awkward. > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > wrote: > > Btw the way I think about the "nomatch" name is as follows - normally > X[Y] > > tries to match rows of Y with rows of X, and then "nomatch" tells it > what to > > do when there is *no match*. > > > > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan < > eduard.antonyan at gmail.com> > > wrote: > >> > >> To clarify - that behavior is already implemented in merge (more > >> specifically merge.data.table). I don't really have a view on having it > in > >> X[Y] as well - I don't like all.x and all.y as the names, since there > are no > >> params named 'x' and 'y' in [.data.table (as opposed to merge), but some > >> param that would do a full outer join could certainly be added. > >> > >> > >> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > >> wrote: > >>> > >>> Yes, sorry. Its nomatch= which presumably derives from the parameter > >>> of the same name in the match() function. If the idea of the nomatch= > >>> name was to leverage off existing argument names in R then I would > >>> prefer all.y= to be consistent with merge() in place of nomatch= since > >>> we are really merging/joining rather than just matching. That would > >>> also allow extension to all types of join by adding all.an x= argument > >>> too. > >>> > >>> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > >>> wrote: > >>> > I would prefer nomatch=0 as a default though, simply because that's > >>> > what I > >>> > do most of the time :) > >>> > > >>> > > >>> > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > >>> > > >>> > wrote: > >>> >> > >>> >> A correction - the param is called "nomatch", not "match". > >>> >> > >>> >> This use case seems like smth a user shouldn't really do - in an > ideal > >>> >> world you should have them both keyed by the same-name column. > >>> >> > >>> >> As is, my view on it is that data.table is correcting the user > mistake > >>> >> of > >>> >> naming the column in Y - y, instead of x, and so the output makes > >>> >> sense and > >>> >> I don't see the need of complicating the behavior by adding more > cases > >>> >> one > >>> >> has to go through to figure out what the output columns would be. > >>> >> Similar to > >>> >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > >>> >> column > >>> >> there, would you? > >>> >> > >>> >> > >>> >> > >>> >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > >>> >> wrote: > >>> >>> > >>> >>> I am moving this discussion which started with mdowle to the list. > >>> >>> > >>> >>> Consider this example slightly modified from the data.table FAQ: > >>> >>> > >>> >>> > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, > key="x") > >>> >>> > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > >>> >>> > out <- X[Y]; out > >>> >>> x foo bar > >>> >>> 1: b 3 4 > >>> >>> 2: b 4 4 > >>> >>> 3: b 5 4 > >>> >>> 4: c 6 2 > >>> >>> 5: c 7 2 > >>> >>> 6: d NA 3 > >>> >>> > >>> >>> Note that the first column of the output is labelled x even though > >>> >>> the > >>> >>> data to produce it comes from y, e.g. "d" in out$x is not in X$x > but > >>> >>> does appear in Y$y so clearly the data is coming from y as opposed > to > >>> >>> x . In terms of SQL the above would be written: > >>> >>> > >>> >>> select Y.y as x, ... > >>> >>> > >>> >>> and the need to renamne the first column of out suggests that there > >>> >>> may be a deeper problem here. > >>> >>> > >>> >>> Here are some ideas to address this (they would require changes to > >>> >>> data.table): > >>> >>> > >>> >>> - the default of X[Y,, match=NA] would be changed to a default of > >>> >>> X[Y,,match=0] so that it corresponds to the defaults in R's merge > and > >>> >>> in SQL joins. > >>> >>> > >>> >>> - the column name of the first column in the example above would be > >>> >>> changed to y if match=0 but be left at x if match=NA. In the case > >>> >>> that match=0 (the proposed new default) x and y are equal so the > >>> >>> first > >>> >>> column can be validly labelled as x but in the case that match=NA > >>> >>> they > >>> >>> are not so y would be used as the column name. > >>> >>> > >>> >>> - the name match= does seem a bit misleading since R's match only > >>> >>> matches one item in the target whereas in data.table match matches > >>> >>> many if mult="all" and that is the default. Perhaps some thought > >>> >>> should be given to a name change here? > >>> >>> > >>> >>> The above would seem to correspond more closely to R's merge and > SQL > >>> >>> join defaults. Any use cases or other comments? > >>> >>> > >>> >>> -- > >>> >>> Statistics & Software Consulting > >>> >>> GKX Group, GKX Associates Inc. > >>> >>> tel: 1-877-GKX-GROUP > >>> >>> email: ggrothendieck at gmail.com > >>> >>> _______________________________________________ > >>> >>> datatable-help mailing list > >>> >>> datatable-help at lists.r-forge.r-project.org > >>> >>> > >>> >>> > >>> >>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >>> >> > >>> >> > >>> > > >>> > >>> > >>> > >>> -- > >>> Statistics & Software Consulting > >>> GKX Group, GKX Associates Inc. > >>> tel: 1-877-GKX-GROUP > >>> email: ggrothendieck at gmail.com > >> > >> > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri May 3 17:45:24 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 3 May 2013 17:45:24 +0200 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: <27580076D6D24A76A7A3807E36E83536@gmail.com> (The third time, I'm growing tired of this 40KB message taking over half-hour to reach me! :) ) Gabor, About the behaviour of X[Y]: The current definition of X[Y] is "it's a join looking up X's rows using Y as an index". By this definition, the output of X[Y] is very much justified, I think. Y is just used as an index. To me it feels similar to, say, X[8] (which gives NA, NA with the same column names as X). Another thought that occurs to me is, say, in this example: X <- data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") Y <- data.table(y=c("b"), bar=c(4)) X[Y] Here again, you query for Y's y values in X's key column and join X and Y's columns. There's no such Y-value where X gives NA. The data then is coming from "X" and "Y" (as opposed to the case "d" you showed where the data comes just from "Y"). In this case should it be named "x" or "y"?? Always "x" makes sense to me. And Y[X] would give a "y" instead. However, I am not that good with sql joins. So I may very well have missed your point here. Regarding `merge`: x <- as.data.frame(X) y <- as.data.frame(Y) merge(x, y, by.x="x", by.y="y", all=TRUE) # --- (1) merge(y, x, by.x="y", by.y="x", all=TRUE) # --- (2) The (1) always gives the column name "x" and (2) always "y". And so does X[Y] as opposed to Y[X], except for the fact that the operations X[Y] and Y[X] are not identical (as opposed to merge). So, I don't see a dissimilarity here. Again, I may have gotten through your point wrongly and would love to be corrected if so. About the case `"nomatch"`, I agree with you that the name could be changed to avoid confusion with R's `match`. Maybe "missing = NA" and "missing = 0" makes more sense? Best regards, Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri May 3 17:46:53 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 3 May 2013 17:46:53 +0200 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> Gabor, X[Y] and Y[X] are not necessarily the same operations (meaning, they don't *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* to provide the same output (except for the column order and names). In that sense, a join is a bit different from a merge, no? Arun On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > Yes, except that is not really what happens since match() only matches > one row whereas with mult="all", the default, all rows are matched > which is not really matching in the sense of match(). The current > naming confuses matching with joining and its really the latter that > is being done. > > Regarding the existence of merge the advantage of [ is that it will > automatically only take the columns needed so merge is not really > equivalent to [ in all respects. Furthermore having to use different > constructs for different types of merge seems awkward. > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > wrote: > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > > tries to match rows of Y with rows of X, and then "nomatch" tells it what to > > do when there is *no match*. > > > > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan > > wrote: > > > > > > To clarify - that behavior is already implemented in merge (more > > > specifically merge.data.table). I don't really have a view on having it in > > > X[Y] as well - I don't like all.x and all.y as the names, since there are no > > > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > > > param that would do a full outer join could certainly be added. > > > > > > > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > > > wrote: > > > > > > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > > > > of the same name in the match() function. If the idea of the nomatch= > > > > name was to leverage off existing argument names in R then I would > > > > prefer all.y= to be consistent with merge() in place of nomatch= since > > > > we are really merging/joining rather than just matching. That would > > > > also allow extension to all types of join by adding all.an x= argument > > > > too. > > > > > > > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > > > > wrote: > > > > > I would prefer nomatch=0 as a default though, simply because that's > > > > > what I > > > > > do most of the time :) > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > > > > > > > > > wrote: > > > > > > > > > > > > A correction - the param is called "nomatch", not "match". > > > > > > > > > > > > This use case seems like smth a user shouldn't really do - in an ideal > > > > > > world you should have them both keyed by the same-name column. > > > > > > > > > > > > As is, my view on it is that data.table is correcting the user mistake > > > > > > of > > > > > > naming the column in Y - y, instead of x, and so the output makes > > > > > > sense and > > > > > > I don't see the need of complicating the behavior by adding more cases > > > > > > one > > > > > > has to go through to figure out what the output columns would be. > > > > > > Similar to > > > > > > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > > > > > > column > > > > > > there, would you? > > > > > > > > > > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > > > > > > wrote: > > > > > > > > > > > > > > I am moving this discussion which started with mdowle to the list. > > > > > > > > > > > > > > Consider this example slightly modified from the data.table FAQ: > > > > > > > > > > > > > > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > > > > > > > > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > > > > > > > > out <- X[Y]; out > > > > > > > > > > > > > > > > > > > > > > x foo bar > > > > > > > 1: b 3 4 > > > > > > > 2: b 4 4 > > > > > > > 3: b 5 4 > > > > > > > 4: c 6 2 > > > > > > > 5: c 7 2 > > > > > > > 6: d NA 3 > > > > > > > > > > > > > > Note that the first column of the output is labelled x even though > > > > > > > the > > > > > > > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > > > > > > > does appear in Y$y so clearly the data is coming from y as opposed to > > > > > > > x . In terms of SQL the above would be written: > > > > > > > > > > > > > > select Y.y as x, ... > > > > > > > > > > > > > > and the need to renamne the first column of out suggests that there > > > > > > > may be a deeper problem here. > > > > > > > > > > > > > > Here are some ideas to address this (they would require changes to > > > > > > > data.table): > > > > > > > > > > > > > > - the default of X[Y,, match=NA] would be changed to a default of > > > > > > > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > > > > > > > in SQL joins. > > > > > > > > > > > > > > - the column name of the first column in the example above would be > > > > > > > changed to y if match=0 but be left at x if match=NA. In the case > > > > > > > that match=0 (the proposed new default) x and y are equal so the > > > > > > > first > > > > > > > column can be validly labelled as x but in the case that match=NA > > > > > > > they > > > > > > > are not so y would be used as the column name. > > > > > > > > > > > > > > - the name match= does seem a bit misleading since R's match only > > > > > > > matches one item in the target whereas in data.table match matches > > > > > > > many if mult="all" and that is the default. Perhaps some thought > > > > > > > should be given to a name change here? > > > > > > > > > > > > > > The above would seem to correspond more closely to R's merge and SQL > > > > > > > join defaults. Any use cases or other comments? > > > > > > > > > > > > > > -- > > > > > > > Statistics & Software Consulting > > > > > > > GKX Group, GKX Associates Inc. > > > > > > > tel: 1-877-GKX-GROUP > > > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > _______________________________________________ > > > > > > > datatable-help mailing list > > > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > > > > > > > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > -- > > > > Statistics & Software Consulting > > > > GKX Group, GKX Associates Inc. > > > > tel: 1-877-GKX-GROUP > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com (http://gmail.com) > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri May 3 17:48:56 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 3 May 2013 17:48:56 +0200 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: Eddi, You could just set: options(datatable.nomatch = 0)if you use that extensively. Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 3 17:50:19 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 10:50:19 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> Message-ID: Arun, I think Gabor understands that. If I understand him correctly he simply wants an option for X[Y, some.option = TRUE] to return merge(X,Y, all = TRUE). Also, do note that merge(X, Y, all.x = TRUE) and merge(Y, X, all.x = TRUE) don't have to return the same answer. On Fri, May 3, 2013 at 10:46 AM, Arunkumar Srinivasan wrote: > Gabor, > > X[Y] and Y[X] are not necessarily the same operations (meaning, they don't > *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* > to provide the same output (except for the column order and names). In that > sense, a join is a bit different from a merge, no? > > Arun > > On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > > Yes, except that is not really what happens since match() only matches > one row whereas with mult="all", the default, all rows are matched > which is not really matching in the sense of match(). The current > naming confuses matching with joining and its really the latter that > is being done. > > Regarding the existence of merge the advantage of [ is that it will > automatically only take the columns needed so merge is not really > equivalent to [ in all respects. Furthermore having to use different > constructs for different types of merge seems awkward. > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > wrote: > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > tries to match rows of Y with rows of X, and then "nomatch" tells it what > to > do when there is *no match*. > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan < > eduard.antonyan at gmail.com> > wrote: > > > To clarify - that behavior is already implemented in merge (more > specifically merge.data.table). I don't really have a view on having it in > X[Y] as well - I don't like all.x and all.y as the names, since there are > no > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > param that would do a full outer join could certainly be added. > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > wrote: > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > of the same name in the match() function. If the idea of the nomatch= > name was to leverage off existing argument names in R then I would > prefer all.y= to be consistent with merge() in place of nomatch= since > we are really merging/joining rather than just matching. That would > also allow extension to all types of join by adding all.an x= argument > too. > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > wrote: > > I would prefer nomatch=0 as a default though, simply because that's > what I > do most of the time :) > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > wrote: > > > A correction - the param is called "nomatch", not "match". > > This use case seems like smth a user shouldn't really do - in an ideal > world you should have them both keyed by the same-name column. > > As is, my view on it is that data.table is correcting the user mistake > of > naming the column in Y - y, instead of x, and so the output makes > sense and > I don't see the need of complicating the behavior by adding more cases > one > has to go through to figure out what the output columns would be. > Similar to > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > column > there, would you? > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > wrote: > > > I am moving this discussion which started with mdowle to the list. > > Consider this example slightly modified from the data.table FAQ: > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > out <- X[Y]; out > > x foo bar > 1: b 3 4 > 2: b 4 4 > 3: b 5 4 > 4: c 6 2 > 5: c 7 2 > 6: d NA 3 > > Note that the first column of the output is labelled x even though > the > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > does appear in Y$y so clearly the data is coming from y as opposed to > x . In terms of SQL the above would be written: > > select Y.y as x, ... > > and the need to renamne the first column of out suggests that there > may be a deeper problem here. > > Here are some ideas to address this (they would require changes to > data.table): > > - the default of X[Y,, match=NA] would be changed to a default of > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > in SQL joins. > > - the column name of the first column in the example above would be > changed to y if match=0 but be left at x if match=NA. In the case > that match=0 (the proposed new default) x and y are equal so the > first > column can be validly labelled as x but in the case that match=NA > they > are not so y would be used as the column name. > > - the name match= does seem a bit misleading since R's match only > matches one item in the target whereas in data.table match matches > many if mult="all" and that is the default. Perhaps some thought > should be given to a name change here? > > The above would seem to correspond more closely to R's merge and SQL > join defaults. Any use cases or other comments? > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 3 17:51:13 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 10:51:13 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: Good point - I might do that, though I'll need to be a bit careful as I run a lot of scripts on remote computers. On Fri, May 3, 2013 at 10:48 AM, Arunkumar Srinivasan wrote: > Eddi, > > You could just set: options(datatable.nomatch = 0)if you use that > extensively. > > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Fri May 3 17:55:38 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 11:55:38 -0400 Subject: [datatable-help] merge/join/match In-Reply-To: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> Message-ID: Assuming same-named keys, then these are all the same except possibly for row and column order: X[Y,,nomatch=0] Y[X,,nomatch=0] merge(X, Y) merge(Y, X) That X[Y] is not the same as Y[X] is analogous to the fact that merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan wrote: > Gabor, > > X[Y] and Y[X] are not necessarily the same operations (meaning, they don't > *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* > to provide the same output (except for the column order and names). In that > sense, a join is a bit different from a merge, no? > > Arun > > On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > > Yes, except that is not really what happens since match() only matches > one row whereas with mult="all", the default, all rows are matched > which is not really matching in the sense of match(). The current > naming confuses matching with joining and its really the latter that > is being done. > > Regarding the existence of merge the advantage of [ is that it will > automatically only take the columns needed so merge is not really > equivalent to [ in all respects. Furthermore having to use different > constructs for different types of merge seems awkward. > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > wrote: > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > tries to match rows of Y with rows of X, and then "nomatch" tells it what to > do when there is *no match*. > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan > wrote: > > > To clarify - that behavior is already implemented in merge (more > specifically merge.data.table). I don't really have a view on having it in > X[Y] as well - I don't like all.x and all.y as the names, since there are no > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > param that would do a full outer join could certainly be added. > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > wrote: > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > of the same name in the match() function. If the idea of the nomatch= > name was to leverage off existing argument names in R then I would > prefer all.y= to be consistent with merge() in place of nomatch= since > we are really merging/joining rather than just matching. That would > also allow extension to all types of join by adding all.an x= argument > too. > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > wrote: > > I would prefer nomatch=0 as a default though, simply because that's > what I > do most of the time :) > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > wrote: > > > A correction - the param is called "nomatch", not "match". > > This use case seems like smth a user shouldn't really do - in an ideal > world you should have them both keyed by the same-name column. > > As is, my view on it is that data.table is correcting the user mistake > of > naming the column in Y - y, instead of x, and so the output makes > sense and > I don't see the need of complicating the behavior by adding more cases > one > has to go through to figure out what the output columns would be. > Similar to > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > column > there, would you? > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > wrote: > > > I am moving this discussion which started with mdowle to the list. > > Consider this example slightly modified from the data.table FAQ: > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > out <- X[Y]; out > > x foo bar > 1: b 3 4 > 2: b 4 4 > 3: b 5 4 > 4: c 6 2 > 5: c 7 2 > 6: d NA 3 > > Note that the first column of the output is labelled x even though > the > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > does appear in Y$y so clearly the data is coming from y as opposed to > x . In terms of SQL the above would be written: > > select Y.y as x, ... > > and the need to renamne the first column of out suggests that there > may be a deeper problem here. > > Here are some ideas to address this (they would require changes to > data.table): > > - the default of X[Y,, match=NA] would be changed to a default of > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > in SQL joins. > > - the column name of the first column in the example above would be > changed to y if match=0 but be left at x if match=NA. In the case > that match=0 (the proposed new default) x and y are equal so the > first > column can be validly labelled as x but in the case that match=NA > they > are not so y would be used as the column name. > > - the name match= does seem a bit misleading since R's match only > matches one item in the target whereas in data.table match matches > many if mult="all" and that is the default. Perhaps some thought > should be given to a name change here? > > The above would seem to correspond more closely to R's merge and SQL > join defaults. Any use cases or other comments? > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From ggrothendieck at gmail.com Fri May 3 17:57:45 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 11:57:45 -0400 Subject: [datatable-help] merge/join/match In-Reply-To: References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> Message-ID: In my last post it should have read: That X[Y] is not the same as Y[X] is analogous to the fact that merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck wrote: > Assuming same-named keys, then these are all the same except possibly > for row and column order: > > X[Y,,nomatch=0] > Y[X,,nomatch=0] > merge(X, Y) > merge(Y, X) > > That X[Y] is not the same as Y[X] is analogous to the fact that > merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) > > On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan > wrote: >> Gabor, >> >> X[Y] and Y[X] are not necessarily the same operations (meaning, they don't >> *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* >> to provide the same output (except for the column order and names). In that >> sense, a join is a bit different from a merge, no? >> >> Arun >> >> On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: >> >> Yes, except that is not really what happens since match() only matches >> one row whereas with mult="all", the default, all rows are matched >> which is not really matching in the sense of match(). The current >> naming confuses matching with joining and its really the latter that >> is being done. >> >> Regarding the existence of merge the advantage of [ is that it will >> automatically only take the columns needed so merge is not really >> equivalent to [ in all respects. Furthermore having to use different >> constructs for different types of merge seems awkward. >> >> >> On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan >> wrote: >> >> Btw the way I think about the "nomatch" name is as follows - normally X[Y] >> tries to match rows of Y with rows of X, and then "nomatch" tells it what to >> do when there is *no match*. >> >> >> On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan >> wrote: >> >> >> To clarify - that behavior is already implemented in merge (more >> specifically merge.data.table). I don't really have a view on having it in >> X[Y] as well - I don't like all.x and all.y as the names, since there are no >> params named 'x' and 'y' in [.data.table (as opposed to merge), but some >> param that would do a full outer join could certainly be added. >> >> >> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck >> wrote: >> >> >> Yes, sorry. Its nomatch= which presumably derives from the parameter >> of the same name in the match() function. If the idea of the nomatch= >> name was to leverage off existing argument names in R then I would >> prefer all.y= to be consistent with merge() in place of nomatch= since >> we are really merging/joining rather than just matching. That would >> also allow extension to all types of join by adding all.an x= argument >> too. >> >> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan >> wrote: >> >> I would prefer nomatch=0 as a default though, simply because that's >> what I >> do most of the time :) >> >> >> On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan >> >> wrote: >> >> >> A correction - the param is called "nomatch", not "match". >> >> This use case seems like smth a user shouldn't really do - in an ideal >> world you should have them both keyed by the same-name column. >> >> As is, my view on it is that data.table is correcting the user mistake >> of >> naming the column in Y - y, instead of x, and so the output makes >> sense and >> I don't see the need of complicating the behavior by adding more cases >> one >> has to go through to figure out what the output columns would be. >> Similar to >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous >> column >> there, would you? >> >> >> >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck >> wrote: >> >> >> I am moving this discussion which started with mdowle to the list. >> >> Consider this example slightly modified from the data.table FAQ: >> >> X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") >> Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) >> out <- X[Y]; out >> >> x foo bar >> 1: b 3 4 >> 2: b 4 4 >> 3: b 5 4 >> 4: c 6 2 >> 5: c 7 2 >> 6: d NA 3 >> >> Note that the first column of the output is labelled x even though >> the >> data to produce it comes from y, e.g. "d" in out$x is not in X$x but >> does appear in Y$y so clearly the data is coming from y as opposed to >> x . In terms of SQL the above would be written: >> >> select Y.y as x, ... >> >> and the need to renamne the first column of out suggests that there >> may be a deeper problem here. >> >> Here are some ideas to address this (they would require changes to >> data.table): >> >> - the default of X[Y,, match=NA] would be changed to a default of >> X[Y,,match=0] so that it corresponds to the defaults in R's merge and >> in SQL joins. >> >> - the column name of the first column in the example above would be >> changed to y if match=0 but be left at x if match=NA. In the case >> that match=0 (the proposed new default) x and y are equal so the >> first >> column can be validly labelled as x but in the case that match=NA >> they >> are not so y would be used as the column name. >> >> - the name match= does seem a bit misleading since R's match only >> matches one item in the target whereas in data.table match matches >> many if mult="all" and that is the default. Perhaps some thought >> should be given to a name change here? >> >> The above would seem to correspond more closely to R's merge and SQL >> join defaults. Any use cases or other comments? >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Fri May 3 18:45:24 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 3 May 2013 18:45:24 +0200 Subject: [datatable-help] merge/join/match In-Reply-To: References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> Message-ID: Gabor, Very true. I suppose your request is that the x[i] where `i` is a data.table should have the same set of options like R's base `merge` function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself. However, I am not able to think of a way to do this. I mean, I find the syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even though X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the reordered columns) the latter 2 don't seem to make sense/is redundant (maybe it's because I am used to this syntax). Arun On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: > In my last post it should have read: > > That X[Y] is not the same as Y[X] is analogous to the fact that > merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) > > On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck > wrote: > > Assuming same-named keys, then these are all the same except possibly > > for row and column order: > > > > X[Y,,nomatch=0] > > Y[X,,nomatch=0] > > merge(X, Y) > > merge(Y, X) > > > > That X[Y] is not the same as Y[X] is analogous to the fact that > > merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) > > > > On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan > > wrote: > > > Gabor, > > > > > > X[Y] and Y[X] are not necessarily the same operations (meaning, they don't > > > *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* > > > to provide the same output (except for the column order and names). In that > > > sense, a join is a bit different from a merge, no? > > > > > > Arun > > > > > > On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > > > > > > Yes, except that is not really what happens since match() only matches > > > one row whereas with mult="all", the default, all rows are matched > > > which is not really matching in the sense of match(). The current > > > naming confuses matching with joining and its really the latter that > > > is being done. > > > > > > Regarding the existence of merge the advantage of [ is that it will > > > automatically only take the columns needed so merge is not really > > > equivalent to [ in all respects. Furthermore having to use different > > > constructs for different types of merge seems awkward. > > > > > > > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > > > wrote: > > > > > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > > > tries to match rows of Y with rows of X, and then "nomatch" tells it what to > > > do when there is *no match*. > > > > > > > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan > > > wrote: > > > > > > > > > To clarify - that behavior is already implemented in merge (more > > > specifically merge.data.table). I don't really have a view on having it in > > > X[Y] as well - I don't like all.x and all.y as the names, since there are no > > > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > > > param that would do a full outer join could certainly be added. > > > > > > > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > > > wrote: > > > > > > > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > > > of the same name in the match() function. If the idea of the nomatch= > > > name was to leverage off existing argument names in R then I would > > > prefer all.y= to be consistent with merge() in place of nomatch= since > > > we are really merging/joining rather than just matching. That would > > > also allow extension to all types of join by adding all.an x= argument > > > too. > > > > > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > > > wrote: > > > > > > I would prefer nomatch=0 as a default though, simply because that's > > > what I > > > do most of the time :) > > > > > > > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > > > > > wrote: > > > > > > > > > A correction - the param is called "nomatch", not "match". > > > > > > This use case seems like smth a user shouldn't really do - in an ideal > > > world you should have them both keyed by the same-name column. > > > > > > As is, my view on it is that data.table is correcting the user mistake > > > of > > > naming the column in Y - y, instead of x, and so the output makes > > > sense and > > > I don't see the need of complicating the behavior by adding more cases > > > one > > > has to go through to figure out what the output columns would be. > > > Similar to > > > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > > > column > > > there, would you? > > > > > > > > > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > > > wrote: > > > > > > > > > I am moving this discussion which started with mdowle to the list. > > > > > > Consider this example slightly modified from the data.table FAQ: > > > > > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > > > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > > > out <- X[Y]; out > > > > > > x foo bar > > > 1: b 3 4 > > > 2: b 4 4 > > > 3: b 5 4 > > > 4: c 6 2 > > > 5: c 7 2 > > > 6: d NA 3 > > > > > > Note that the first column of the output is labelled x even though > > > the > > > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > > > does appear in Y$y so clearly the data is coming from y as opposed to > > > x . In terms of SQL the above would be written: > > > > > > select Y.y as x, ... > > > > > > and the need to renamne the first column of out suggests that there > > > may be a deeper problem here. > > > > > > Here are some ideas to address this (they would require changes to > > > data.table): > > > > > > - the default of X[Y,, match=NA] would be changed to a default of > > > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > > > in SQL joins. > > > > > > - the column name of the first column in the example above would be > > > changed to y if match=0 but be left at x if match=NA. In the case > > > that match=0 (the proposed new default) x and y are equal so the > > > first > > > column can be validly labelled as x but in the case that match=NA > > > they > > > are not so y would be used as the column name. > > > > > > - the name match= does seem a bit misleading since R's match only > > > matches one item in the target whereas in data.table match matches > > > many if mult="all" and that is the default. Perhaps some thought > > > should be given to a name change here? > > > > > > The above would seem to correspond more closely to R's merge and SQL > > > join defaults. Any use cases or other comments? > > > > > > -- > > > Statistics & Software Consulting > > > GKX Group, GKX Associates Inc. > > > tel: 1-877-GKX-GROUP > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > _______________________________________________ > > > datatable-help mailing list > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > -- > > > Statistics & Software Consulting > > > GKX Group, GKX Associates Inc. > > > tel: 1-877-GKX-GROUP > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > -- > > > Statistics & Software Consulting > > > GKX Group, GKX Associates Inc. > > > tel: 1-877-GKX-GROUP > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > _______________________________________________ > > > datatable-help mailing list > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > -- > > Statistics & Software Consulting > > GKX Group, GKX Associates Inc. > > tel: 1-877-GKX-GROUP > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com (http://gmail.com) > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 3 18:49:11 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 11:49:11 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> Message-ID: Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all of the other merge options already exist as either X[Y] or Y[X] with or without nomatch = 0/NA. On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan wrote: > Gabor, > > Very true. I suppose your request is that the x[i] where `i` is a > data.table should have the same set of options like R's base `merge` > function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by > itself. However, I am not able to think of a way to do this. I mean, I find > the syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even > though > > X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the > reordered columns) the latter 2 don't seem to make sense/is redundant > (maybe it's because I am used to this syntax). > > Arun > > On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: > > In my last post it should have read: > > That X[Y] is not the same as Y[X] is analogous to the fact that > merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) > > On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck > wrote: > > Assuming same-named keys, then these are all the same except possibly > for row and column order: > > X[Y,,nomatch=0] > Y[X,,nomatch=0] > merge(X, Y) > merge(Y, X) > > That X[Y] is not the same as Y[X] is analogous to the fact that > merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) > > On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan > wrote: > > Gabor, > > X[Y] and Y[X] are not necessarily the same operations (meaning, they don't > *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* > to provide the same output (except for the column order and names). In that > sense, a join is a bit different from a merge, no? > > Arun > > On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > > Yes, except that is not really what happens since match() only matches > one row whereas with mult="all", the default, all rows are matched > which is not really matching in the sense of match(). The current > naming confuses matching with joining and its really the latter that > is being done. > > Regarding the existence of merge the advantage of [ is that it will > automatically only take the columns needed so merge is not really > equivalent to [ in all respects. Furthermore having to use different > constructs for different types of merge seems awkward. > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > wrote: > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > tries to match rows of Y with rows of X, and then "nomatch" tells it what > to > do when there is *no match*. > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan < > eduard.antonyan at gmail.com> > wrote: > > > To clarify - that behavior is already implemented in merge (more > specifically merge.data.table). I don't really have a view on having it in > X[Y] as well - I don't like all.x and all.y as the names, since there are > no > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > param that would do a full outer join could certainly be added. > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > wrote: > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > of the same name in the match() function. If the idea of the nomatch= > name was to leverage off existing argument names in R then I would > prefer all.y= to be consistent with merge() in place of nomatch= since > we are really merging/joining rather than just matching. That would > also allow extension to all types of join by adding all.an x= argument > too. > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > wrote: > > I would prefer nomatch=0 as a default though, simply because that's > what I > do most of the time :) > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > wrote: > > > A correction - the param is called "nomatch", not "match". > > This use case seems like smth a user shouldn't really do - in an ideal > world you should have them both keyed by the same-name column. > > As is, my view on it is that data.table is correcting the user mistake > of > naming the column in Y - y, instead of x, and so the output makes > sense and > I don't see the need of complicating the behavior by adding more cases > one > has to go through to figure out what the output columns would be. > Similar to > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > column > there, would you? > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > wrote: > > > I am moving this discussion which started with mdowle to the list. > > Consider this example slightly modified from the data.table FAQ: > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > out <- X[Y]; out > > x foo bar > 1: b 3 4 > 2: b 4 4 > 3: b 5 4 > 4: c 6 2 > 5: c 7 2 > 6: d NA 3 > > Note that the first column of the output is labelled x even though > the > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > does appear in Y$y so clearly the data is coming from y as opposed to > x . In terms of SQL the above would be written: > > select Y.y as x, ... > > and the need to renamne the first column of out suggests that there > may be a deeper problem here. > > Here are some ideas to address this (they would require changes to > data.table): > > - the default of X[Y,, match=NA] would be changed to a default of > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > in SQL joins. > > - the column name of the first column in the example above would be > changed to y if match=0 but be left at x if match=NA. In the case > that match=0 (the proposed new default) x and y are equal so the > first > column can be validly labelled as x but in the case that match=NA > they > are not so y would be used as the column name. > > - the name match= does seem a bit misleading since R's match only > matches one item in the target whereas in data.table match matches > many if mult="all" and that is the default. Perhaps some thought > should be given to a name change here? > > The above would seem to correspond more closely to R's merge and SQL > join defaults. Any use cases or other comments? > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Fri May 3 18:50:23 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 12:50:23 -0400 Subject: [datatable-help] merge/join/match In-Reply-To: References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> Message-ID: I was thinking of all.y= would be implemented first to replace nomatch= and then all.x= later (as the latter involves more work whereas the former is trivial). It would be possible to implement by.x, by.y and by too such that they default to the current functionality and that might be nice too. On Fri, May 3, 2013 at 12:45 PM, Arunkumar Srinivasan wrote: > Gabor, > > Very true. I suppose your request is that the x[i] where `i` is a data.table > should have the same set of options like R's base `merge` function, like, > by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself. However, I am > not able to think of a way to do this. I mean, I find the syntax X[Y, > by.x=TRUE] weird / not making sense. That is, to me even though > > X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the > reordered columns) the latter 2 don't seem to make sense/is redundant (maybe > it's because I am used to this syntax). > > Arun > > On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: > > In my last post it should have read: > > That X[Y] is not the same as Y[X] is analogous to the fact that > merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) > > On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck > wrote: > > Assuming same-named keys, then these are all the same except possibly > for row and column order: > > X[Y,,nomatch=0] > Y[X,,nomatch=0] > merge(X, Y) > merge(Y, X) > > That X[Y] is not the same as Y[X] is analogous to the fact that > merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) > > On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan > wrote: > > Gabor, > > X[Y] and Y[X] are not necessarily the same operations (meaning, they don't > *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* > to provide the same output (except for the column order and names). In that > sense, a join is a bit different from a merge, no? > > Arun > > On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > > Yes, except that is not really what happens since match() only matches > one row whereas with mult="all", the default, all rows are matched > which is not really matching in the sense of match(). The current > naming confuses matching with joining and its really the latter that > is being done. > > Regarding the existence of merge the advantage of [ is that it will > automatically only take the columns needed so merge is not really > equivalent to [ in all respects. Furthermore having to use different > constructs for different types of merge seems awkward. > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > wrote: > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > tries to match rows of Y with rows of X, and then "nomatch" tells it what to > do when there is *no match*. > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan > wrote: > > > To clarify - that behavior is already implemented in merge (more > specifically merge.data.table). I don't really have a view on having it in > X[Y] as well - I don't like all.x and all.y as the names, since there are no > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > param that would do a full outer join could certainly be added. > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > wrote: > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > of the same name in the match() function. If the idea of the nomatch= > name was to leverage off existing argument names in R then I would > prefer all.y= to be consistent with merge() in place of nomatch= since > we are really merging/joining rather than just matching. That would > also allow extension to all types of join by adding all.an x= argument > too. > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > wrote: > > I would prefer nomatch=0 as a default though, simply because that's > what I > do most of the time :) > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > wrote: > > > A correction - the param is called "nomatch", not "match". > > This use case seems like smth a user shouldn't really do - in an ideal > world you should have them both keyed by the same-name column. > > As is, my view on it is that data.table is correcting the user mistake > of > naming the column in Y - y, instead of x, and so the output makes > sense and > I don't see the need of complicating the behavior by adding more cases > one > has to go through to figure out what the output columns would be. > Similar to > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > column > there, would you? > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > wrote: > > > I am moving this discussion which started with mdowle to the list. > > Consider this example slightly modified from the data.table FAQ: > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > out <- X[Y]; out > > x foo bar > 1: b 3 4 > 2: b 4 4 > 3: b 5 4 > 4: c 6 2 > 5: c 7 2 > 6: d NA 3 > > Note that the first column of the output is labelled x even though > the > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > does appear in Y$y so clearly the data is coming from y as opposed to > x . In terms of SQL the above would be written: > > select Y.y as x, ... > > and the need to renamne the first column of out suggests that there > may be a deeper problem here. > > Here are some ideas to address this (they would require changes to > data.table): > > - the default of X[Y,, match=NA] would be changed to a default of > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > in SQL joins. > > - the column name of the first column in the example above would be > changed to y if match=0 but be left at x if match=NA. In the case > that match=0 (the proposed new default) x and y are equal so the > first > column can be validly labelled as x but in the case that match=NA > they > are not so y would be used as the column name. > > - the name match= does seem a bit misleading since R's match only > matches one item in the target whereas in data.table match matches > many if mult="all" and that is the default. Perhaps some thought > should be given to a name change here? > > The above would seem to correspond more closely to R's merge and SQL > join defaults. Any use cases or other comments? > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Fri May 3 18:52:34 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 3 May 2013 18:52:34 +0200 Subject: [datatable-help] merge/join/match In-Reply-To: References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> Message-ID: <267F3BA67F4144DC99C3C1C6784722E9@gmail.com> Eduard, Yes I know. But to maintain the consistency with the `merge` in base R, you should be able to query any merge (by.x, by.y, all) with X[Y] or Y[X] is what I understand. That is, with Y[X] you wouldn't be able to get the result of merge(X, Y, all.y=TRUE) (results including the column reordering). This is what I understand from Gabor's post. Arun On Friday, May 3, 2013 at 6:49 PM, Eduard Antonyan wrote: > Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all of the other merge options already exist as either X[Y] or Y[X] with or without nomatch = 0/NA. > > > On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan wrote: > > Gabor, > > > > Very true. I suppose your request is that the x[i] where `i` is a data.table should have the same set of options like R's base `merge` function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself. However, I am not able to think of a way to do this. I mean, I find the syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even though > > > > X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the reordered columns) the latter 2 don't seem to make sense/is redundant (maybe it's because I am used to this syntax). > > > > Arun > > > > > > On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: > > > > > In my last post it should have read: > > > > > > That X[Y] is not the same as Y[X] is analogous to the fact that > > > merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) > > > > > > On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck > > > wrote: > > > > Assuming same-named keys, then these are all the same except possibly > > > > for row and column order: > > > > > > > > X[Y,,nomatch=0] > > > > Y[X,,nomatch=0] > > > > merge(X, Y) > > > > merge(Y, X) > > > > > > > > That X[Y] is not the same as Y[X] is analogous to the fact that > > > > merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) > > > > > > > > On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan > > > > wrote: > > > > > Gabor, > > > > > > > > > > X[Y] and Y[X] are not necessarily the same operations (meaning, they don't > > > > > *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* > > > > > to provide the same output (except for the column order and names). In that > > > > > sense, a join is a bit different from a merge, no? > > > > > > > > > > Arun > > > > > > > > > > On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > > > > > > > > > > Yes, except that is not really what happens since match() only matches > > > > > one row whereas with mult="all", the default, all rows are matched > > > > > which is not really matching in the sense of match(). The current > > > > > naming confuses matching with joining and its really the latter that > > > > > is being done. > > > > > > > > > > Regarding the existence of merge the advantage of [ is that it will > > > > > automatically only take the columns needed so merge is not really > > > > > equivalent to [ in all respects. Furthermore having to use different > > > > > constructs for different types of merge seems awkward. > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > > > > > wrote: > > > > > > > > > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > > > > > tries to match rows of Y with rows of X, and then "nomatch" tells it what to > > > > > do when there is *no match*. > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan > > > > > wrote: > > > > > > > > > > > > > > > To clarify - that behavior is already implemented in merge (more > > > > > specifically merge.data.table). I don't really have a view on having it in > > > > > X[Y] as well - I don't like all.x and all.y as the names, since there are no > > > > > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > > > > > param that would do a full outer join could certainly be added. > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > > > > > wrote: > > > > > > > > > > > > > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > > > > > of the same name in the match() function. If the idea of the nomatch= > > > > > name was to leverage off existing argument names in R then I would > > > > > prefer all.y= to be consistent with merge() in place of nomatch= since > > > > > we are really merging/joining rather than just matching. That would > > > > > also allow extension to all types of join by adding all.an (http://all.an) x= argument > > > > > too. > > > > > > > > > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > > > > > wrote: > > > > > > > > > > I would prefer nomatch=0 as a default though, simply because that's > > > > > what I > > > > > do most of the time :) > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > > > > > > > > > wrote: > > > > > > > > > > > > > > > A correction - the param is called "nomatch", not "match". > > > > > > > > > > This use case seems like smth a user shouldn't really do - in an ideal > > > > > world you should have them both keyed by the same-name column. > > > > > > > > > > As is, my view on it is that data.table is correcting the user mistake > > > > > of > > > > > naming the column in Y - y, instead of x, and so the output makes > > > > > sense and > > > > > I don't see the need of complicating the behavior by adding more cases > > > > > one > > > > > has to go through to figure out what the output columns would be. > > > > > Similar to > > > > > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > > > > > column > > > > > there, would you? > > > > > > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > > > > > wrote: > > > > > > > > > > > > > > > I am moving this discussion which started with mdowle to the list. > > > > > > > > > > Consider this example slightly modified from the data.table FAQ: > > > > > > > > > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > > > > > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > > > > > out <- X[Y]; out > > > > > > > > > > x foo bar > > > > > 1: b 3 4 > > > > > 2: b 4 4 > > > > > 3: b 5 4 > > > > > 4: c 6 2 > > > > > 5: c 7 2 > > > > > 6: d NA 3 > > > > > > > > > > Note that the first column of the output is labelled x even though > > > > > the > > > > > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > > > > > does appear in Y$y so clearly the data is coming from y as opposed to > > > > > x . In terms of SQL the above would be written: > > > > > > > > > > select Y.y as x, ... > > > > > > > > > > and the need to renamne the first column of out suggests that there > > > > > may be a deeper problem here. > > > > > > > > > > Here are some ideas to address this (they would require changes to > > > > > data.table): > > > > > > > > > > - the default of X[Y,, match=NA] would be changed to a default of > > > > > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > > > > > in SQL joins. > > > > > > > > > > - the column name of the first column in the example above would be > > > > > changed to y if match=0 but be left at x if match=NA. In the case > > > > > that match=0 (the proposed new default) x and y are equal so the > > > > > first > > > > > column can be validly labelled as x but in the case that match=NA > > > > > they > > > > > are not so y would be used as the column name. > > > > > > > > > > - the name match= does seem a bit misleading since R's match only > > > > > matches one item in the target whereas in data.table match matches > > > > > many if mult="all" and that is the default. Perhaps some thought > > > > > should be given to a name change here? > > > > > > > > > > The above would seem to correspond more closely to R's merge and SQL > > > > > join defaults. Any use cases or other comments? > > > > > > > > > > -- > > > > > Statistics & Software Consulting > > > > > GKX Group, GKX Associates Inc. > > > > > tel: 1-877-GKX-GROUP > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > _______________________________________________ > > > > > datatable-help mailing list > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Statistics & Software Consulting > > > > > GKX Group, GKX Associates Inc. > > > > > tel: 1-877-GKX-GROUP > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Statistics & Software Consulting > > > > > GKX Group, GKX Associates Inc. > > > > > tel: 1-877-GKX-GROUP > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > _______________________________________________ > > > > > datatable-help mailing list > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Statistics & Software Consulting > > > > GKX Group, GKX Associates Inc. > > > > tel: 1-877-GKX-GROUP > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > -- > > > Statistics & Software Consulting > > > GKX Group, GKX Associates Inc. > > > tel: 1-877-GKX-GROUP > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Fri May 3 18:54:10 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 12:54:10 -0400 Subject: [datatable-help] merge/join/match In-Reply-To: References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> Message-ID: I think that from the viewpoint of compatibility and convenience it would be best to implement all.x and all.y and not rely on swapping X and Y. SQLite did something like this (they implemented left join but not right join based on the idea that all you have to do is swap join arguments) but the problem with it is that it adds a layer of mental specification effort if the actual problem is better stated in the unsupported orientation. On Fri, May 3, 2013 at 12:49 PM, Eduard Antonyan wrote: > Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all of > the other merge options already exist as either X[Y] or Y[X] with or without > nomatch = 0/NA. > > > On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan > wrote: >> >> Gabor, >> >> Very true. I suppose your request is that the x[i] where `i` is a >> data.table should have the same set of options like R's base `merge` >> function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself. >> However, I am not able to think of a way to do this. I mean, I find the >> syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even though >> >> X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the >> reordered columns) the latter 2 don't seem to make sense/is redundant (maybe >> it's because I am used to this syntax). >> >> Arun >> >> On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: >> >> In my last post it should have read: >> >> That X[Y] is not the same as Y[X] is analogous to the fact that >> merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) >> >> On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck >> wrote: >> >> Assuming same-named keys, then these are all the same except possibly >> for row and column order: >> >> X[Y,,nomatch=0] >> Y[X,,nomatch=0] >> merge(X, Y) >> merge(Y, X) >> >> That X[Y] is not the same as Y[X] is analogous to the fact that >> merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) >> >> On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan >> wrote: >> >> Gabor, >> >> X[Y] and Y[X] are not necessarily the same operations (meaning, they don't >> *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* >> to provide the same output (except for the column order and names). In >> that >> sense, a join is a bit different from a merge, no? >> >> Arun >> >> On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: >> >> Yes, except that is not really what happens since match() only matches >> one row whereas with mult="all", the default, all rows are matched >> which is not really matching in the sense of match(). The current >> naming confuses matching with joining and its really the latter that >> is being done. >> >> Regarding the existence of merge the advantage of [ is that it will >> automatically only take the columns needed so merge is not really >> equivalent to [ in all respects. Furthermore having to use different >> constructs for different types of merge seems awkward. >> >> >> On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan >> wrote: >> >> Btw the way I think about the "nomatch" name is as follows - normally X[Y] >> tries to match rows of Y with rows of X, and then "nomatch" tells it what >> to >> do when there is *no match*. >> >> >> On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan >> >> wrote: >> >> >> To clarify - that behavior is already implemented in merge (more >> specifically merge.data.table). I don't really have a view on having it in >> X[Y] as well - I don't like all.x and all.y as the names, since there are >> no >> params named 'x' and 'y' in [.data.table (as opposed to merge), but some >> param that would do a full outer join could certainly be added. >> >> >> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck >> wrote: >> >> >> Yes, sorry. Its nomatch= which presumably derives from the parameter >> of the same name in the match() function. If the idea of the nomatch= >> name was to leverage off existing argument names in R then I would >> prefer all.y= to be consistent with merge() in place of nomatch= since >> we are really merging/joining rather than just matching. That would >> also allow extension to all types of join by adding all.an x= argument >> too. >> >> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan >> wrote: >> >> I would prefer nomatch=0 as a default though, simply because that's >> what I >> do most of the time :) >> >> >> On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan >> >> wrote: >> >> >> A correction - the param is called "nomatch", not "match". >> >> This use case seems like smth a user shouldn't really do - in an ideal >> world you should have them both keyed by the same-name column. >> >> As is, my view on it is that data.table is correcting the user mistake >> of >> naming the column in Y - y, instead of x, and so the output makes >> sense and >> I don't see the need of complicating the behavior by adding more cases >> one >> has to go through to figure out what the output columns would be. >> Similar to >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous >> column >> there, would you? >> >> >> >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck >> wrote: >> >> >> I am moving this discussion which started with mdowle to the list. >> >> Consider this example slightly modified from the data.table FAQ: >> >> X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") >> Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) >> out <- X[Y]; out >> >> x foo bar >> 1: b 3 4 >> 2: b 4 4 >> 3: b 5 4 >> 4: c 6 2 >> 5: c 7 2 >> 6: d NA 3 >> >> Note that the first column of the output is labelled x even though >> the >> data to produce it comes from y, e.g. "d" in out$x is not in X$x but >> does appear in Y$y so clearly the data is coming from y as opposed to >> x . In terms of SQL the above would be written: >> >> select Y.y as x, ... >> >> and the need to renamne the first column of out suggests that there >> may be a deeper problem here. >> >> Here are some ideas to address this (they would require changes to >> data.table): >> >> - the default of X[Y,, match=NA] would be changed to a default of >> X[Y,,match=0] so that it corresponds to the defaults in R's merge and >> in SQL joins. >> >> - the column name of the first column in the example above would be >> changed to y if match=0 but be left at x if match=NA. In the case >> that match=0 (the proposed new default) x and y are equal so the >> first >> column can be validly labelled as x but in the case that match=NA >> they >> are not so y would be used as the column name. >> >> - the name match= does seem a bit misleading since R's match only >> matches one item in the target whereas in data.table match matches >> many if mult="all" and that is the default. Perhaps some thought >> should be given to a name change here? >> >> The above would seem to correspond more closely to R's merge and SQL >> join defaults. Any use cases or other comments? >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> >> > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From eduard.antonyan at gmail.com Fri May 3 18:54:49 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 11:54:49 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: <267F3BA67F4144DC99C3C1C6784722E9@gmail.com> References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> <267F3BA67F4144DC99C3C1C6784722E9@gmail.com> Message-ID: If that's what Gabor wants, then I don't think that makes a lot of sense for an X[Y] syntax. I think you should only be able to get (all.y = T) or (all = T) from X[Y], but not (all.x=T, all.y=F). On Fri, May 3, 2013 at 11:52 AM, Arunkumar Srinivasan wrote: > Eduard, > > Yes I know. But to maintain the consistency with the `merge` in base R, > you should be able to query any merge (by.x, by.y, all) with X[Y] or Y[X] > is what I understand. That is, with Y[X] you wouldn't be able to get the > result of merge(X, Y, all.y=TRUE) (results including the column > reordering). This is what I understand from Gabor's post. > > Arun > > On Friday, May 3, 2013 at 6:49 PM, Eduard Antonyan wrote: > > Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all > of the other merge options already exist as either X[Y] or Y[X] with or > without nomatch = 0/NA. > > > On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > > Gabor, > > Very true. I suppose your request is that the x[i] where `i` is a > data.table should have the same set of options like R's base `merge` > function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by > itself. However, I am not able to think of a way to do this. I mean, I find > the syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even > though > > X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the > reordered columns) the latter 2 don't seem to make sense/is redundant > (maybe it's because I am used to this syntax). > > Arun > > On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: > > In my last post it should have read: > > That X[Y] is not the same as Y[X] is analogous to the fact that > merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) > > On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck > wrote: > > Assuming same-named keys, then these are all the same except possibly > for row and column order: > > X[Y,,nomatch=0] > Y[X,,nomatch=0] > merge(X, Y) > merge(Y, X) > > That X[Y] is not the same as Y[X] is analogous to the fact that > merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) > > On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan > wrote: > > Gabor, > > X[Y] and Y[X] are not necessarily the same operations (meaning, they don't > *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* > to provide the same output (except for the column order and names). In that > sense, a join is a bit different from a merge, no? > > Arun > > On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > > Yes, except that is not really what happens since match() only matches > one row whereas with mult="all", the default, all rows are matched > which is not really matching in the sense of match(). The current > naming confuses matching with joining and its really the latter that > is being done. > > Regarding the existence of merge the advantage of [ is that it will > automatically only take the columns needed so merge is not really > equivalent to [ in all respects. Furthermore having to use different > constructs for different types of merge seems awkward. > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > wrote: > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > tries to match rows of Y with rows of X, and then "nomatch" tells it what > to > do when there is *no match*. > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan < > eduard.antonyan at gmail.com> > wrote: > > > To clarify - that behavior is already implemented in merge (more > specifically merge.data.table). I don't really have a view on having it in > X[Y] as well - I don't like all.x and all.y as the names, since there are > no > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > param that would do a full outer join could certainly be added. > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > wrote: > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > of the same name in the match() function. If the idea of the nomatch= > name was to leverage off existing argument names in R then I would > prefer all.y= to be consistent with merge() in place of nomatch= since > we are really merging/joining rather than just matching. That would > also allow extension to all types of join by adding all.an x= argument > too. > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > wrote: > > I would prefer nomatch=0 as a default though, simply because that's > what I > do most of the time :) > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > wrote: > > > A correction - the param is called "nomatch", not "match". > > This use case seems like smth a user shouldn't really do - in an ideal > world you should have them both keyed by the same-name column. > > As is, my view on it is that data.table is correcting the user mistake > of > naming the column in Y - y, instead of x, and so the output makes > sense and > I don't see the need of complicating the behavior by adding more cases > one > has to go through to figure out what the output columns would be. > Similar to > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > column > there, would you? > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > wrote: > > > I am moving this discussion which started with mdowle to the list. > > Consider this example slightly modified from the data.table FAQ: > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > out <- X[Y]; out > > x foo bar > 1: b 3 4 > 2: b 4 4 > 3: b 5 4 > 4: c 6 2 > 5: c 7 2 > 6: d NA 3 > > Note that the first column of the output is labelled x even though > the > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > does appear in Y$y so clearly the data is coming from y as opposed to > x . In terms of SQL the above would be written: > > select Y.y as x, ... > > and the need to renamne the first column of out suggests that there > may be a deeper problem here. > > Here are some ideas to address this (they would require changes to > data.table): > > - the default of X[Y,, match=NA] would be changed to a default of > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > in SQL joins. > > - the column name of the first column in the example above would be > changed to y if match=0 but be left at x if match=NA. In the case > that match=0 (the proposed new default) x and y are equal so the > first > column can be validly labelled as x but in the case that match=NA > they > are not so y would be used as the column name. > > - the name match= does seem a bit misleading since R's match only > matches one item in the target whereas in data.table match matches > many if mult="all" and that is the default. Perhaps some thought > should be given to a name change here? > > The above would seem to correspond more closely to R's merge and SQL > join defaults. Any use cases or other comments? > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 3 18:56:07 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 11:56:07 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> Message-ID: yeah, I disagree with this view. I don't think [] should pursue compatibility with merge. On Fri, May 3, 2013 at 11:54 AM, Gabor Grothendieck wrote: > I think that from the viewpoint of compatibility and convenience it > would be best to implement all.x and all.y and not rely on swapping X > and Y. SQLite did something like this (they implemented left join but > not right join based on the idea that all you have to do is swap join > arguments) but the problem with it is that it adds a layer of mental > specification effort if the actual problem is better stated in the > unsupported orientation. > > On Fri, May 3, 2013 at 12:49 PM, Eduard Antonyan > wrote: > > Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all > of > > the other merge options already exist as either X[Y] or Y[X] with or > without > > nomatch = 0/NA. > > > > > > On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan > > wrote: > >> > >> Gabor, > >> > >> Very true. I suppose your request is that the x[i] where `i` is a > >> data.table should have the same set of options like R's base `merge` > >> function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by > itself. > >> However, I am not able to think of a way to do this. I mean, I find the > >> syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even > though > >> > >> X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the > >> reordered columns) the latter 2 don't seem to make sense/is redundant > (maybe > >> it's because I am used to this syntax). > >> > >> Arun > >> > >> On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: > >> > >> In my last post it should have read: > >> > >> That X[Y] is not the same as Y[X] is analogous to the fact that > >> merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) > >> > >> On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck > >> wrote: > >> > >> Assuming same-named keys, then these are all the same except possibly > >> for row and column order: > >> > >> X[Y,,nomatch=0] > >> Y[X,,nomatch=0] > >> merge(X, Y) > >> merge(Y, X) > >> > >> That X[Y] is not the same as Y[X] is analogous to the fact that > >> merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) > >> > >> On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan > >> wrote: > >> > >> Gabor, > >> > >> X[Y] and Y[X] are not necessarily the same operations (meaning, they > don't > >> *have* to give the same output). However, merge(X,Y) and merge(Y,X) > *have* > >> to provide the same output (except for the column order and names). In > >> that > >> sense, a join is a bit different from a merge, no? > >> > >> Arun > >> > >> On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > >> > >> Yes, except that is not really what happens since match() only matches > >> one row whereas with mult="all", the default, all rows are matched > >> which is not really matching in the sense of match(). The current > >> naming confuses matching with joining and its really the latter that > >> is being done. > >> > >> Regarding the existence of merge the advantage of [ is that it will > >> automatically only take the columns needed so merge is not really > >> equivalent to [ in all respects. Furthermore having to use different > >> constructs for different types of merge seems awkward. > >> > >> > >> On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > >> wrote: > >> > >> Btw the way I think about the "nomatch" name is as follows - normally > X[Y] > >> tries to match rows of Y with rows of X, and then "nomatch" tells it > what > >> to > >> do when there is *no match*. > >> > >> > >> On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan > >> > >> wrote: > >> > >> > >> To clarify - that behavior is already implemented in merge (more > >> specifically merge.data.table). I don't really have a view on having it > in > >> X[Y] as well - I don't like all.x and all.y as the names, since there > are > >> no > >> params named 'x' and 'y' in [.data.table (as opposed to merge), but some > >> param that would do a full outer join could certainly be added. > >> > >> > >> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > >> wrote: > >> > >> > >> Yes, sorry. Its nomatch= which presumably derives from the parameter > >> of the same name in the match() function. If the idea of the nomatch= > >> name was to leverage off existing argument names in R then I would > >> prefer all.y= to be consistent with merge() in place of nomatch= since > >> we are really merging/joining rather than just matching. That would > >> also allow extension to all types of join by adding all.an x= argument > >> too. > >> > >> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > >> wrote: > >> > >> I would prefer nomatch=0 as a default though, simply because that's > >> what I > >> do most of the time :) > >> > >> > >> On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > >> > >> wrote: > >> > >> > >> A correction - the param is called "nomatch", not "match". > >> > >> This use case seems like smth a user shouldn't really do - in an ideal > >> world you should have them both keyed by the same-name column. > >> > >> As is, my view on it is that data.table is correcting the user mistake > >> of > >> naming the column in Y - y, instead of x, and so the output makes > >> sense and > >> I don't see the need of complicating the behavior by adding more cases > >> one > >> has to go through to figure out what the output columns would be. > >> Similar to > >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > >> column > >> there, would you? > >> > >> > >> > >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > >> wrote: > >> > >> > >> I am moving this discussion which started with mdowle to the list. > >> > >> Consider this example slightly modified from the data.table FAQ: > >> > >> X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > >> Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > >> out <- X[Y]; out > >> > >> x foo bar > >> 1: b 3 4 > >> 2: b 4 4 > >> 3: b 5 4 > >> 4: c 6 2 > >> 5: c 7 2 > >> 6: d NA 3 > >> > >> Note that the first column of the output is labelled x even though > >> the > >> data to produce it comes from y, e.g. "d" in out$x is not in X$x but > >> does appear in Y$y so clearly the data is coming from y as opposed to > >> x . In terms of SQL the above would be written: > >> > >> select Y.y as x, ... > >> > >> and the need to renamne the first column of out suggests that there > >> may be a deeper problem here. > >> > >> Here are some ideas to address this (they would require changes to > >> data.table): > >> > >> - the default of X[Y,, match=NA] would be changed to a default of > >> X[Y,,match=0] so that it corresponds to the defaults in R's merge and > >> in SQL joins. > >> > >> - the column name of the first column in the example above would be > >> changed to y if match=0 but be left at x if match=NA. In the case > >> that match=0 (the proposed new default) x and y are equal so the > >> first > >> column can be validly labelled as x but in the case that match=NA > >> they > >> are not so y would be used as the column name. > >> > >> - the name match= does seem a bit misleading since R's match only > >> matches one item in the target whereas in data.table match matches > >> many if mult="all" and that is the default. Perhaps some thought > >> should be given to a name change here? > >> > >> The above would seem to correspond more closely to R's merge and SQL > >> join defaults. Any use cases or other comments? > >> > >> -- > >> Statistics & Software Consulting > >> GKX Group, GKX Associates Inc. > >> tel: 1-877-GKX-GROUP > >> email: ggrothendieck at gmail.com > >> _______________________________________________ > >> datatable-help mailing list > >> datatable-help at lists.r-forge.r-project.org > >> > >> > >> > >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >> > >> > >> > >> > >> -- > >> Statistics & Software Consulting > >> GKX Group, GKX Associates Inc. > >> tel: 1-877-GKX-GROUP > >> email: ggrothendieck at gmail.com > >> > >> > >> > >> > >> -- > >> Statistics & Software Consulting > >> GKX Group, GKX Associates Inc. > >> tel: 1-877-GKX-GROUP > >> email: ggrothendieck at gmail.com > >> _______________________________________________ > >> datatable-help mailing list > >> datatable-help at lists.r-forge.r-project.org > >> > >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >> > >> > >> > >> > >> -- > >> Statistics & Software Consulting > >> GKX Group, GKX Associates Inc. > >> tel: 1-877-GKX-GROUP > >> email: ggrothendieck at gmail.com > >> > >> > >> > >> > >> -- > >> Statistics & Software Consulting > >> GKX Group, GKX Associates Inc. > >> tel: 1-877-GKX-GROUP > >> email: ggrothendieck at gmail.com > >> > >> > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri May 3 19:01:06 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 3 May 2013 19:01:06 +0200 Subject: [datatable-help] merge/join/match In-Reply-To: References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> Message-ID: <9D8A261B7D1D47D1B12576E30A793FD9@gmail.com> I am wondering if performing X[Y] as a "merge" in correspondence with R's base "merge", if the basic idea of "i" becomes confusing. That is, when "i" is not a data.table in X[i] it indexes by rows. When `i` is a data.table, instead of the current definition which is in par with the subletting operation that use `i` (here data.table) as an index to subset X and then JOIN both X and Y, we say, here X and Y are data.tables and we perform a merge. I think this becomes confusing regarding the purpose of `i`. Remember that the main purpose of having the X[Y] is to have the flexibility of using `j` to to filter/subset only the desired columns. So, for example if you want to get 1 column of Y out of 100 columns when joining, you do: X[Y, list(cols_of_x, one_col_of_y)] and here, it doesn't go with the traditional definition of merge. As much as I like the idea of having consistent syntax, I also love the feature of X[Y, j]. So I'm confused as to how to deal with this. Arun On Friday, May 3, 2013 at 6:54 PM, Gabor Grothendieck wrote: > I think that from the viewpoint of compatibility and convenience it > would be best to implement all.x and all.y and not rely on swapping X > and Y. SQLite did something like this (they implemented left join but > not right join based on the idea that all you have to do is swap join > arguments) but the problem with it is that it adds a layer of mental > specification effort if the actual problem is better stated in the > unsupported orientation. > > On Fri, May 3, 2013 at 12:49 PM, Eduard Antonyan > wrote: > > Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all of > > the other merge options already exist as either X[Y] or Y[X] with or without > > nomatch = 0/NA. > > > > > > On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan > > wrote: > > > > > > Gabor, > > > > > > Very true. I suppose your request is that the x[i] where `i` is a > > > data.table should have the same set of options like R's base `merge` > > > function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself. > > > However, I am not able to think of a way to do this. I mean, I find the > > > syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even though > > > > > > X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the > > > reordered columns) the latter 2 don't seem to make sense/is redundant (maybe > > > it's because I am used to this syntax). > > > > > > Arun > > > > > > On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: > > > > > > In my last post it should have read: > > > > > > That X[Y] is not the same as Y[X] is analogous to the fact that > > > merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) > > > > > > On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck > > > wrote: > > > > > > Assuming same-named keys, then these are all the same except possibly > > > for row and column order: > > > > > > X[Y,,nomatch=0] > > > Y[X,,nomatch=0] > > > merge(X, Y) > > > merge(Y, X) > > > > > > That X[Y] is not the same as Y[X] is analogous to the fact that > > > merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) > > > > > > On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan > > > wrote: > > > > > > Gabor, > > > > > > X[Y] and Y[X] are not necessarily the same operations (meaning, they don't > > > *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* > > > to provide the same output (except for the column order and names). In > > > that > > > sense, a join is a bit different from a merge, no? > > > > > > Arun > > > > > > On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > > > > > > Yes, except that is not really what happens since match() only matches > > > one row whereas with mult="all", the default, all rows are matched > > > which is not really matching in the sense of match(). The current > > > naming confuses matching with joining and its really the latter that > > > is being done. > > > > > > Regarding the existence of merge the advantage of [ is that it will > > > automatically only take the columns needed so merge is not really > > > equivalent to [ in all respects. Furthermore having to use different > > > constructs for different types of merge seems awkward. > > > > > > > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > > > wrote: > > > > > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > > > tries to match rows of Y with rows of X, and then "nomatch" tells it what > > > to > > > do when there is *no match*. > > > > > > > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan > > > > > > wrote: > > > > > > > > > To clarify - that behavior is already implemented in merge (more > > > specifically merge.data.table). I don't really have a view on having it in > > > X[Y] as well - I don't like all.x and all.y as the names, since there are > > > no > > > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > > > param that would do a full outer join could certainly be added. > > > > > > > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > > > wrote: > > > > > > > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > > > of the same name in the match() function. If the idea of the nomatch= > > > name was to leverage off existing argument names in R then I would > > > prefer all.y= to be consistent with merge() in place of nomatch= since > > > we are really merging/joining rather than just matching. That would > > > also allow extension to all types of join by adding all.an x= argument > > > too. > > > > > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > > > wrote: > > > > > > I would prefer nomatch=0 as a default though, simply because that's > > > what I > > > do most of the time :) > > > > > > > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > > > > > wrote: > > > > > > > > > A correction - the param is called "nomatch", not "match". > > > > > > This use case seems like smth a user shouldn't really do - in an ideal > > > world you should have them both keyed by the same-name column. > > > > > > As is, my view on it is that data.table is correcting the user mistake > > > of > > > naming the column in Y - y, instead of x, and so the output makes > > > sense and > > > I don't see the need of complicating the behavior by adding more cases > > > one > > > has to go through to figure out what the output columns would be. > > > Similar to > > > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > > > column > > > there, would you? > > > > > > > > > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > > > wrote: > > > > > > > > > I am moving this discussion which started with mdowle to the list. > > > > > > Consider this example slightly modified from the data.table FAQ: > > > > > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > > > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > > > out <- X[Y]; out > > > > > > x foo bar > > > 1: b 3 4 > > > 2: b 4 4 > > > 3: b 5 4 > > > 4: c 6 2 > > > 5: c 7 2 > > > 6: d NA 3 > > > > > > Note that the first column of the output is labelled x even though > > > the > > > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > > > does appear in Y$y so clearly the data is coming from y as opposed to > > > x . In terms of SQL the above would be written: > > > > > > select Y.y as x, ... > > > > > > and the need to renamne the first column of out suggests that there > > > may be a deeper problem here. > > > > > > Here are some ideas to address this (they would require changes to > > > data.table): > > > > > > - the default of X[Y,, match=NA] would be changed to a default of > > > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > > > in SQL joins. > > > > > > - the column name of the first column in the example above would be > > > changed to y if match=0 but be left at x if match=NA. In the case > > > that match=0 (the proposed new default) x and y are equal so the > > > first > > > column can be validly labelled as x but in the case that match=NA > > > they > > > are not so y would be used as the column name. > > > > > > - the name match= does seem a bit misleading since R's match only > > > matches one item in the target whereas in data.table match matches > > > many if mult="all" and that is the default. Perhaps some thought > > > should be given to a name change here? > > > > > > The above would seem to correspond more closely to R's merge and SQL > > > join defaults. Any use cases or other comments? > > > > > > -- > > > Statistics & Software Consulting > > > GKX Group, GKX Associates Inc. > > > tel: 1-877-GKX-GROUP > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > _______________________________________________ > > > datatable-help mailing list > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > -- > > > Statistics & Software Consulting > > > GKX Group, GKX Associates Inc. > > > tel: 1-877-GKX-GROUP > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > -- > > > Statistics & Software Consulting > > > GKX Group, GKX Associates Inc. > > > tel: 1-877-GKX-GROUP > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > _______________________________________________ > > > datatable-help mailing list > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > -- > > > Statistics & Software Consulting > > > GKX Group, GKX Associates Inc. > > > tel: 1-877-GKX-GROUP > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > -- > > > Statistics & Software Consulting > > > GKX Group, GKX Associates Inc. > > > tel: 1-877-GKX-GROUP > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com (http://gmail.com) > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri May 3 19:03:43 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 3 May 2013 19:03:43 +0200 Subject: [datatable-help] merge/join/match In-Reply-To: <9D8A261B7D1D47D1B12576E30A793FD9@gmail.com> References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> <9D8A261B7D1D47D1B12576E30A793FD9@gmail.com> Message-ID: Where I say "main purpose", it should be "one of the main advantages of having" Arun On Friday, May 3, 2013 at 7:01 PM, Arunkumar Srinivasan wrote: > I am wondering if performing X[Y] as a "merge" in correspondence with R's base "merge", if the basic idea of "i" becomes confusing. That is, when "i" is not a data.table in X[i] it indexes by rows. When `i` is a data.table, instead of the current definition which is in par with the subletting operation that use `i` (here data.table) as an index to subset X and then JOIN both X and Y, we say, here X and Y are data.tables and we perform a merge. I think this becomes confusing regarding the purpose of `i`. > > Remember that the main purpose of having the X[Y] is to have the flexibility of using `j` to to filter/subset only the desired columns. So, for example if you want to get 1 column of Y out of 100 columns when joining, you do: X[Y, list(cols_of_x, one_col_of_y)] and here, it doesn't go with the traditional definition of merge. > > As much as I like the idea of having consistent syntax, I also love the feature of X[Y, j]. So I'm confused as to how to deal with this. > > Arun > > > On Friday, May 3, 2013 at 6:54 PM, Gabor Grothendieck wrote: > > > I think that from the viewpoint of compatibility and convenience it > > would be best to implement all.x and all.y and not rely on swapping X > > and Y. SQLite did something like this (they implemented left join but > > not right join based on the idea that all you have to do is swap join > > arguments) but the problem with it is that it adds a layer of mental > > specification effort if the actual problem is better stated in the > > unsupported orientation. > > > > On Fri, May 3, 2013 at 12:49 PM, Eduard Antonyan > > wrote: > > > Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all of > > > the other merge options already exist as either X[Y] or Y[X] with or without > > > nomatch = 0/NA. > > > > > > > > > On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan > > > wrote: > > > > > > > > Gabor, > > > > > > > > Very true. I suppose your request is that the x[i] where `i` is a > > > > data.table should have the same set of options like R's base `merge` > > > > function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself. > > > > However, I am not able to think of a way to do this. I mean, I find the > > > > syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even though > > > > > > > > X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the > > > > reordered columns) the latter 2 don't seem to make sense/is redundant (maybe > > > > it's because I am used to this syntax). > > > > > > > > Arun > > > > > > > > On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: > > > > > > > > In my last post it should have read: > > > > > > > > That X[Y] is not the same as Y[X] is analogous to the fact that > > > > merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) > > > > > > > > On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck > > > > wrote: > > > > > > > > Assuming same-named keys, then these are all the same except possibly > > > > for row and column order: > > > > > > > > X[Y,,nomatch=0] > > > > Y[X,,nomatch=0] > > > > merge(X, Y) > > > > merge(Y, X) > > > > > > > > That X[Y] is not the same as Y[X] is analogous to the fact that > > > > merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) > > > > > > > > On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan > > > > wrote: > > > > > > > > Gabor, > > > > > > > > X[Y] and Y[X] are not necessarily the same operations (meaning, they don't > > > > *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* > > > > to provide the same output (except for the column order and names). In > > > > that > > > > sense, a join is a bit different from a merge, no? > > > > > > > > Arun > > > > > > > > On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > > > > > > > > Yes, except that is not really what happens since match() only matches > > > > one row whereas with mult="all", the default, all rows are matched > > > > which is not really matching in the sense of match(). The current > > > > naming confuses matching with joining and its really the latter that > > > > is being done. > > > > > > > > Regarding the existence of merge the advantage of [ is that it will > > > > automatically only take the columns needed so merge is not really > > > > equivalent to [ in all respects. Furthermore having to use different > > > > constructs for different types of merge seems awkward. > > > > > > > > > > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > > > > wrote: > > > > > > > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > > > > tries to match rows of Y with rows of X, and then "nomatch" tells it what > > > > to > > > > do when there is *no match*. > > > > > > > > > > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan > > > > > > > > wrote: > > > > > > > > > > > > To clarify - that behavior is already implemented in merge (more > > > > specifically merge.data.table). I don't really have a view on having it in > > > > X[Y] as well - I don't like all.x and all.y as the names, since there are > > > > no > > > > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > > > > param that would do a full outer join could certainly be added. > > > > > > > > > > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > > > > wrote: > > > > > > > > > > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > > > > of the same name in the match() function. If the idea of the nomatch= > > > > name was to leverage off existing argument names in R then I would > > > > prefer all.y= to be consistent with merge() in place of nomatch= since > > > > we are really merging/joining rather than just matching. That would > > > > also allow extension to all types of join by adding all.an x= argument > > > > too. > > > > > > > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > > > > wrote: > > > > > > > > I would prefer nomatch=0 as a default though, simply because that's > > > > what I > > > > do most of the time :) > > > > > > > > > > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > > > > > > > wrote: > > > > > > > > > > > > A correction - the param is called "nomatch", not "match". > > > > > > > > This use case seems like smth a user shouldn't really do - in an ideal > > > > world you should have them both keyed by the same-name column. > > > > > > > > As is, my view on it is that data.table is correcting the user mistake > > > > of > > > > naming the column in Y - y, instead of x, and so the output makes > > > > sense and > > > > I don't see the need of complicating the behavior by adding more cases > > > > one > > > > has to go through to figure out what the output columns would be. > > > > Similar to > > > > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > > > > column > > > > there, would you? > > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > > > > wrote: > > > > > > > > > > > > I am moving this discussion which started with mdowle to the list. > > > > > > > > Consider this example slightly modified from the data.table FAQ: > > > > > > > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > > > > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > > > > out <- X[Y]; out > > > > > > > > x foo bar > > > > 1: b 3 4 > > > > 2: b 4 4 > > > > 3: b 5 4 > > > > 4: c 6 2 > > > > 5: c 7 2 > > > > 6: d NA 3 > > > > > > > > Note that the first column of the output is labelled x even though > > > > the > > > > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > > > > does appear in Y$y so clearly the data is coming from y as opposed to > > > > x . In terms of SQL the above would be written: > > > > > > > > select Y.y as x, ... > > > > > > > > and the need to renamne the first column of out suggests that there > > > > may be a deeper problem here. > > > > > > > > Here are some ideas to address this (they would require changes to > > > > data.table): > > > > > > > > - the default of X[Y,, match=NA] would be changed to a default of > > > > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > > > > in SQL joins. > > > > > > > > - the column name of the first column in the example above would be > > > > changed to y if match=0 but be left at x if match=NA. In the case > > > > that match=0 (the proposed new default) x and y are equal so the > > > > first > > > > column can be validly labelled as x but in the case that match=NA > > > > they > > > > are not so y would be used as the column name. > > > > > > > > - the name match= does seem a bit misleading since R's match only > > > > matches one item in the target whereas in data.table match matches > > > > many if mult="all" and that is the default. Perhaps some thought > > > > should be given to a name change here? > > > > > > > > The above would seem to correspond more closely to R's merge and SQL > > > > join defaults. Any use cases or other comments? > > > > > > > > -- > > > > Statistics & Software Consulting > > > > GKX Group, GKX Associates Inc. > > > > tel: 1-877-GKX-GROUP > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > _______________________________________________ > > > > datatable-help mailing list > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > -- > > > > Statistics & Software Consulting > > > > GKX Group, GKX Associates Inc. > > > > tel: 1-877-GKX-GROUP > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > > -- > > > > Statistics & Software Consulting > > > > GKX Group, GKX Associates Inc. > > > > tel: 1-877-GKX-GROUP > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > _______________________________________________ > > > > datatable-help mailing list > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > -- > > > > Statistics & Software Consulting > > > > GKX Group, GKX Associates Inc. > > > > tel: 1-877-GKX-GROUP > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > > -- > > > > Statistics & Software Consulting > > > > GKX Group, GKX Associates Inc. > > > > tel: 1-877-GKX-GROUP > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > > -- > > Statistics & Software Consulting > > GKX Group, GKX Associates Inc. > > tel: 1-877-GKX-GROUP > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri May 3 19:09:42 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 3 May 2013 19:09:42 +0200 Subject: [datatable-help] merge/join/match In-Reply-To: References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> <9D8A261B7D1D47D1B12576E30A793FD9@gmail.com> Message-ID: The confusion, maybe as well be very well due to the fact that X[Y] is not just a subset of X based on X and Y's key columns, rather a `join` (both X and Y's columns are "visible" and joined). But then that was by itself due to a feature request FR #746. Arun On Friday, May 3, 2013 at 7:03 PM, Arunkumar Srinivasan wrote: > Where I say "main purpose", it should be "one of the main advantages of having" > > Arun > > > On Friday, May 3, 2013 at 7:01 PM, Arunkumar Srinivasan wrote: > > > I am wondering if performing X[Y] as a "merge" in correspondence with R's base "merge", if the basic idea of "i" becomes confusing. That is, when "i" is not a data.table in X[i] it indexes by rows. When `i` is a data.table, instead of the current definition which is in par with the subletting operation that use `i` (here data.table) as an index to subset X and then JOIN both X and Y, we say, here X and Y are data.tables and we perform a merge. I think this becomes confusing regarding the purpose of `i`. > > > > Remember that the main purpose of having the X[Y] is to have the flexibility of using `j` to to filter/subset only the desired columns. So, for example if you want to get 1 column of Y out of 100 columns when joining, you do: X[Y, list(cols_of_x, one_col_of_y)] and here, it doesn't go with the traditional definition of merge. > > > > As much as I like the idea of having consistent syntax, I also love the feature of X[Y, j]. So I'm confused as to how to deal with this. > > > > Arun > > > > > > On Friday, May 3, 2013 at 6:54 PM, Gabor Grothendieck wrote: > > > > > I think that from the viewpoint of compatibility and convenience it > > > would be best to implement all.x and all.y and not rely on swapping X > > > and Y. SQLite did something like this (they implemented left join but > > > not right join based on the idea that all you have to do is swap join > > > arguments) but the problem with it is that it adds a layer of mental > > > specification effort if the actual problem is better stated in the > > > unsupported orientation. > > > > > > On Fri, May 3, 2013 at 12:49 PM, Eduard Antonyan > > > wrote: > > > > Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all of > > > > the other merge options already exist as either X[Y] or Y[X] with or without > > > > nomatch = 0/NA. > > > > > > > > > > > > On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan > > > > wrote: > > > > > > > > > > Gabor, > > > > > > > > > > Very true. I suppose your request is that the x[i] where `i` is a > > > > > data.table should have the same set of options like R's base `merge` > > > > > function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself. > > > > > However, I am not able to think of a way to do this. I mean, I find the > > > > > syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even though > > > > > > > > > > X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the > > > > > reordered columns) the latter 2 don't seem to make sense/is redundant (maybe > > > > > it's because I am used to this syntax). > > > > > > > > > > Arun > > > > > > > > > > On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: > > > > > > > > > > In my last post it should have read: > > > > > > > > > > That X[Y] is not the same as Y[X] is analogous to the fact that > > > > > merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) > > > > > > > > > > On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck > > > > > wrote: > > > > > > > > > > Assuming same-named keys, then these are all the same except possibly > > > > > for row and column order: > > > > > > > > > > X[Y,,nomatch=0] > > > > > Y[X,,nomatch=0] > > > > > merge(X, Y) > > > > > merge(Y, X) > > > > > > > > > > That X[Y] is not the same as Y[X] is analogous to the fact that > > > > > merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) > > > > > > > > > > On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan > > > > > wrote: > > > > > > > > > > Gabor, > > > > > > > > > > X[Y] and Y[X] are not necessarily the same operations (meaning, they don't > > > > > *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* > > > > > to provide the same output (except for the column order and names). In > > > > > that > > > > > sense, a join is a bit different from a merge, no? > > > > > > > > > > Arun > > > > > > > > > > On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > > > > > > > > > > Yes, except that is not really what happens since match() only matches > > > > > one row whereas with mult="all", the default, all rows are matched > > > > > which is not really matching in the sense of match(). The current > > > > > naming confuses matching with joining and its really the latter that > > > > > is being done. > > > > > > > > > > Regarding the existence of merge the advantage of [ is that it will > > > > > automatically only take the columns needed so merge is not really > > > > > equivalent to [ in all respects. Furthermore having to use different > > > > > constructs for different types of merge seems awkward. > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > > > > > wrote: > > > > > > > > > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > > > > > tries to match rows of Y with rows of X, and then "nomatch" tells it what > > > > > to > > > > > do when there is *no match*. > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan > > > > > > > > > > wrote: > > > > > > > > > > > > > > > To clarify - that behavior is already implemented in merge (more > > > > > specifically merge.data.table). I don't really have a view on having it in > > > > > X[Y] as well - I don't like all.x and all.y as the names, since there are > > > > > no > > > > > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > > > > > param that would do a full outer join could certainly be added. > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > > > > > wrote: > > > > > > > > > > > > > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > > > > > of the same name in the match() function. If the idea of the nomatch= > > > > > name was to leverage off existing argument names in R then I would > > > > > prefer all.y= to be consistent with merge() in place of nomatch= since > > > > > we are really merging/joining rather than just matching. That would > > > > > also allow extension to all types of join by adding all.an x= argument > > > > > too. > > > > > > > > > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > > > > > wrote: > > > > > > > > > > I would prefer nomatch=0 as a default though, simply because that's > > > > > what I > > > > > do most of the time :) > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > > > > > > > > > wrote: > > > > > > > > > > > > > > > A correction - the param is called "nomatch", not "match". > > > > > > > > > > This use case seems like smth a user shouldn't really do - in an ideal > > > > > world you should have them both keyed by the same-name column. > > > > > > > > > > As is, my view on it is that data.table is correcting the user mistake > > > > > of > > > > > naming the column in Y - y, instead of x, and so the output makes > > > > > sense and > > > > > I don't see the need of complicating the behavior by adding more cases > > > > > one > > > > > has to go through to figure out what the output columns would be. > > > > > Similar to > > > > > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > > > > > column > > > > > there, would you? > > > > > > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > > > > > wrote: > > > > > > > > > > > > > > > I am moving this discussion which started with mdowle to the list. > > > > > > > > > > Consider this example slightly modified from the data.table FAQ: > > > > > > > > > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > > > > > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > > > > > out <- X[Y]; out > > > > > > > > > > x foo bar > > > > > 1: b 3 4 > > > > > 2: b 4 4 > > > > > 3: b 5 4 > > > > > 4: c 6 2 > > > > > 5: c 7 2 > > > > > 6: d NA 3 > > > > > > > > > > Note that the first column of the output is labelled x even though > > > > > the > > > > > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > > > > > does appear in Y$y so clearly the data is coming from y as opposed to > > > > > x . In terms of SQL the above would be written: > > > > > > > > > > select Y.y as x, ... > > > > > > > > > > and the need to renamne the first column of out suggests that there > > > > > may be a deeper problem here. > > > > > > > > > > Here are some ideas to address this (they would require changes to > > > > > data.table): > > > > > > > > > > - the default of X[Y,, match=NA] would be changed to a default of > > > > > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > > > > > in SQL joins. > > > > > > > > > > - the column name of the first column in the example above would be > > > > > changed to y if match=0 but be left at x if match=NA. In the case > > > > > that match=0 (the proposed new default) x and y are equal so the > > > > > first > > > > > column can be validly labelled as x but in the case that match=NA > > > > > they > > > > > are not so y would be used as the column name. > > > > > > > > > > - the name match= does seem a bit misleading since R's match only > > > > > matches one item in the target whereas in data.table match matches > > > > > many if mult="all" and that is the default. Perhaps some thought > > > > > should be given to a name change here? > > > > > > > > > > The above would seem to correspond more closely to R's merge and SQL > > > > > join defaults. Any use cases or other comments? > > > > > > > > > > -- > > > > > Statistics & Software Consulting > > > > > GKX Group, GKX Associates Inc. > > > > > tel: 1-877-GKX-GROUP > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > _______________________________________________ > > > > > datatable-help mailing list > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > > > > > > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Statistics & Software Consulting > > > > > GKX Group, GKX Associates Inc. > > > > > tel: 1-877-GKX-GROUP > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Statistics & Software Consulting > > > > > GKX Group, GKX Associates Inc. > > > > > tel: 1-877-GKX-GROUP > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > _______________________________________________ > > > > > datatable-help mailing list > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Statistics & Software Consulting > > > > > GKX Group, GKX Associates Inc. > > > > > tel: 1-877-GKX-GROUP > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Statistics & Software Consulting > > > > > GKX Group, GKX Associates Inc. > > > > > tel: 1-877-GKX-GROUP > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Statistics & Software Consulting > > > GKX Group, GKX Associates Inc. > > > tel: 1-877-GKX-GROUP > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri May 3 19:14:03 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 3 May 2013 19:14:03 +0200 Subject: [datatable-help] merge/join/match In-Reply-To: References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> <9D8A261B7D1D47D1B12576E30A793FD9@gmail.com> Message-ID: Gabor, I agree partially with your post in that, since X[Y] *is* a join (/merge), it could also give a "full" join. So, X[Y] <~~~ current usage. equivalent to merge(X, Y, by.y=TRUE) X[Y, all=TRUE] <~~~~ equivalent to merge(X, Y, all=TRUE) Similarly, Y[X] <~~~ current usage. equivalent to merge(Y, X, by.x=TRUE) Y[X, all=TRUE] <~~~ equivalent to merge(Y, X, all=TRUE) But, X[Y, all.x=TRUE] and Y[X, all.y=TRUE] doesn't make sense to me as the operation is clear that you use Y as an index. What do you think? Arun On Friday, May 3, 2013 at 7:09 PM, Arunkumar Srinivasan wrote: > The confusion, maybe as well be very well due to the fact that X[Y] is not just a subset of X based on X and Y's key columns, rather a `join` (both X and Y's columns are "visible" and joined). But then that was by itself due to a feature request FR #746. > > Arun > > > On Friday, May 3, 2013 at 7:03 PM, Arunkumar Srinivasan wrote: > > > Where I say "main purpose", it should be "one of the main advantages of having" > > > > Arun > > > > > > On Friday, May 3, 2013 at 7:01 PM, Arunkumar Srinivasan wrote: > > > > > I am wondering if performing X[Y] as a "merge" in correspondence with R's base "merge", if the basic idea of "i" becomes confusing. That is, when "i" is not a data.table in X[i] it indexes by rows. When `i` is a data.table, instead of the current definition which is in par with the subletting operation that use `i` (here data.table) as an index to subset X and then JOIN both X and Y, we say, here X and Y are data.tables and we perform a merge. I think this becomes confusing regarding the purpose of `i`. > > > > > > Remember that the main purpose of having the X[Y] is to have the flexibility of using `j` to to filter/subset only the desired columns. So, for example if you want to get 1 column of Y out of 100 columns when joining, you do: X[Y, list(cols_of_x, one_col_of_y)] and here, it doesn't go with the traditional definition of merge. > > > > > > As much as I like the idea of having consistent syntax, I also love the feature of X[Y, j]. So I'm confused as to how to deal with this. > > > > > > Arun > > > > > > > > > On Friday, May 3, 2013 at 6:54 PM, Gabor Grothendieck wrote: > > > > > > > I think that from the viewpoint of compatibility and convenience it > > > > would be best to implement all.x and all.y and not rely on swapping X > > > > and Y. SQLite did something like this (they implemented left join but > > > > not right join based on the idea that all you have to do is swap join > > > > arguments) but the problem with it is that it adds a layer of mental > > > > specification effort if the actual problem is better stated in the > > > > unsupported orientation. > > > > > > > > On Fri, May 3, 2013 at 12:49 PM, Eduard Antonyan > > > > wrote: > > > > > Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all of > > > > > the other merge options already exist as either X[Y] or Y[X] with or without > > > > > nomatch = 0/NA. > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan > > > > > wrote: > > > > > > > > > > > > Gabor, > > > > > > > > > > > > Very true. I suppose your request is that the x[i] where `i` is a > > > > > > data.table should have the same set of options like R's base `merge` > > > > > > function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself. > > > > > > However, I am not able to think of a way to do this. I mean, I find the > > > > > > syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even though > > > > > > > > > > > > X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the > > > > > > reordered columns) the latter 2 don't seem to make sense/is redundant (maybe > > > > > > it's because I am used to this syntax). > > > > > > > > > > > > Arun > > > > > > > > > > > > On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: > > > > > > > > > > > > In my last post it should have read: > > > > > > > > > > > > That X[Y] is not the same as Y[X] is analogous to the fact that > > > > > > merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) > > > > > > > > > > > > On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck > > > > > > wrote: > > > > > > > > > > > > Assuming same-named keys, then these are all the same except possibly > > > > > > for row and column order: > > > > > > > > > > > > X[Y,,nomatch=0] > > > > > > Y[X,,nomatch=0] > > > > > > merge(X, Y) > > > > > > merge(Y, X) > > > > > > > > > > > > That X[Y] is not the same as Y[X] is analogous to the fact that > > > > > > merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) > > > > > > > > > > > > On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan > > > > > > wrote: > > > > > > > > > > > > Gabor, > > > > > > > > > > > > X[Y] and Y[X] are not necessarily the same operations (meaning, they don't > > > > > > *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* > > > > > > to provide the same output (except for the column order and names). In > > > > > > that > > > > > > sense, a join is a bit different from a merge, no? > > > > > > > > > > > > Arun > > > > > > > > > > > > On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > > > > > > > > > > > > Yes, except that is not really what happens since match() only matches > > > > > > one row whereas with mult="all", the default, all rows are matched > > > > > > which is not really matching in the sense of match(). The current > > > > > > naming confuses matching with joining and its really the latter that > > > > > > is being done. > > > > > > > > > > > > Regarding the existence of merge the advantage of [ is that it will > > > > > > automatically only take the columns needed so merge is not really > > > > > > equivalent to [ in all respects. Furthermore having to use different > > > > > > constructs for different types of merge seems awkward. > > > > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > > > > > > wrote: > > > > > > > > > > > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > > > > > > tries to match rows of Y with rows of X, and then "nomatch" tells it what > > > > > > to > > > > > > do when there is *no match*. > > > > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > To clarify - that behavior is already implemented in merge (more > > > > > > specifically merge.data.table). I don't really have a view on having it in > > > > > > X[Y] as well - I don't like all.x and all.y as the names, since there are > > > > > > no > > > > > > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > > > > > > param that would do a full outer join could certainly be added. > > > > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > > > > > > wrote: > > > > > > > > > > > > > > > > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > > > > > > of the same name in the match() function. If the idea of the nomatch= > > > > > > name was to leverage off existing argument names in R then I would > > > > > > prefer all.y= to be consistent with merge() in place of nomatch= since > > > > > > we are really merging/joining rather than just matching. That would > > > > > > also allow extension to all types of join by adding all.an x= argument > > > > > > too. > > > > > > > > > > > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > > > > > > wrote: > > > > > > > > > > > > I would prefer nomatch=0 as a default though, simply because that's > > > > > > what I > > > > > > do most of the time :) > > > > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > A correction - the param is called "nomatch", not "match". > > > > > > > > > > > > This use case seems like smth a user shouldn't really do - in an ideal > > > > > > world you should have them both keyed by the same-name column. > > > > > > > > > > > > As is, my view on it is that data.table is correcting the user mistake > > > > > > of > > > > > > naming the column in Y - y, instead of x, and so the output makes > > > > > > sense and > > > > > > I don't see the need of complicating the behavior by adding more cases > > > > > > one > > > > > > has to go through to figure out what the output columns would be. > > > > > > Similar to > > > > > > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > > > > > > column > > > > > > there, would you? > > > > > > > > > > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > > > > > > wrote: > > > > > > > > > > > > > > > > > > I am moving this discussion which started with mdowle to the list. > > > > > > > > > > > > Consider this example slightly modified from the data.table FAQ: > > > > > > > > > > > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > > > > > > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > > > > > > out <- X[Y]; out > > > > > > > > > > > > x foo bar > > > > > > 1: b 3 4 > > > > > > 2: b 4 4 > > > > > > 3: b 5 4 > > > > > > 4: c 6 2 > > > > > > 5: c 7 2 > > > > > > 6: d NA 3 > > > > > > > > > > > > Note that the first column of the output is labelled x even though > > > > > > the > > > > > > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > > > > > > does appear in Y$y so clearly the data is coming from y as opposed to > > > > > > x . In terms of SQL the above would be written: > > > > > > > > > > > > select Y.y as x, ... > > > > > > > > > > > > and the need to renamne the first column of out suggests that there > > > > > > may be a deeper problem here. > > > > > > > > > > > > Here are some ideas to address this (they would require changes to > > > > > > data.table): > > > > > > > > > > > > - the default of X[Y,, match=NA] would be changed to a default of > > > > > > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > > > > > > in SQL joins. > > > > > > > > > > > > - the column name of the first column in the example above would be > > > > > > changed to y if match=0 but be left at x if match=NA. In the case > > > > > > that match=0 (the proposed new default) x and y are equal so the > > > > > > first > > > > > > column can be validly labelled as x but in the case that match=NA > > > > > > they > > > > > > are not so y would be used as the column name. > > > > > > > > > > > > - the name match= does seem a bit misleading since R's match only > > > > > > matches one item in the target whereas in data.table match matches > > > > > > many if mult="all" and that is the default. Perhaps some thought > > > > > > should be given to a name change here? > > > > > > > > > > > > The above would seem to correspond more closely to R's merge and SQL > > > > > > join defaults. Any use cases or other comments? > > > > > > > > > > > > -- > > > > > > Statistics & Software Consulting > > > > > > GKX Group, GKX Associates Inc. > > > > > > tel: 1-877-GKX-GROUP > > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > _______________________________________________ > > > > > > datatable-help mailing list > > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > > > > > > > > > > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Statistics & Software Consulting > > > > > > GKX Group, GKX Associates Inc. > > > > > > tel: 1-877-GKX-GROUP > > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Statistics & Software Consulting > > > > > > GKX Group, GKX Associates Inc. > > > > > > tel: 1-877-GKX-GROUP > > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > _______________________________________________ > > > > > > datatable-help mailing list > > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Statistics & Software Consulting > > > > > > GKX Group, GKX Associates Inc. > > > > > > tel: 1-877-GKX-GROUP > > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Statistics & Software Consulting > > > > > > GKX Group, GKX Associates Inc. > > > > > > tel: 1-877-GKX-GROUP > > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Statistics & Software Consulting > > > > GKX Group, GKX Associates Inc. > > > > tel: 1-877-GKX-GROUP > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri May 3 19:23:04 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 3 May 2013 19:23:04 +0200 Subject: [datatable-help] merge/join/match In-Reply-To: References: Message-ID: <052D4225A53C4ECF8DC72E5E2A566BD1@gmail.com> Eddi, still you can add this line to the top of every R script or even better directly in the .Rprofile file wherever you run. Arun On Friday, May 3, 2013 at 5:51 PM, Eduard Antonyan wrote: > Good point - I might do that, though I'll need to be a bit careful as I run a lot of scripts on remote computers. > > > On Fri, May 3, 2013 at 10:48 AM, Arunkumar Srinivasan wrote: > > Eddi, > > > > You could just set: options(datatable.nomatch = 0)if you use that extensively. > > > > Arun > > > > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 3 19:41:24 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 12:41:24 -0500 Subject: [datatable-help] merge/join/match In-Reply-To: <052D4225A53C4ECF8DC72E5E2A566BD1@gmail.com> References: <052D4225A53C4ECF8DC72E5E2A566BD1@gmail.com> Message-ID: I don't like putting it into .Rprofile that much, because I won't be able to share code then - I might put it in scripts where I do this a lot inside a single script, but the situation is more like - many little scripts that have one or maybe two merges inside and it's currently less work to write nomatch=0 than change the option at the top. .Rprofile would've been a decent solution if there wasn't the code sharing constraint. It's a mess either way, what can I say :) On Fri, May 3, 2013 at 12:23 PM, Arunkumar Srinivasan wrote: > Eddi, still you can add this line to the top of every R script or even > better directly in the .Rprofile file wherever you run. > > Arun > > On Friday, May 3, 2013 at 5:51 PM, Eduard Antonyan wrote: > > Good point - I might do that, though I'll need to be a bit careful as I > run a lot of scripts on remote computers. > > > On Fri, May 3, 2013 at 10:48 AM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > > Eddi, > > You could just set: options(datatable.nomatch = 0)if you use that > extensively. > > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri May 3 19:48:30 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 3 May 2013 19:48:30 +0200 Subject: [datatable-help] merge/join/match In-Reply-To: References: <052D4225A53C4ECF8DC72E5E2A566BD1@gmail.com> Message-ID: Mess, yes indeed! :) If you've too many little scripts lying around, then maybe time to make a package (assuming they're related / or at least some independent functionalities/ utility scripts) and use once on any file the options parameter? That was my last dose of idea. I'm all out now :) Arun On Friday, May 3, 2013 at 7:41 PM, Eduard Antonyan wrote: > I don't like putting it into .Rprofile that much, because I won't be able to share code then - I might put it in scripts where I do this a lot inside a single script, but the situation is more like - many little scripts that have one or maybe two merges inside and it's currently less work to write nomatch=0 than change the option at the top. .Rprofile would've been a decent solution if there wasn't the code sharing constraint. > > It's a mess either way, what can I say :) > > > On Fri, May 3, 2013 at 12:23 PM, Arunkumar Srinivasan wrote: > > Eddi, still you can add this line to the top of every R script or even better directly in the .Rprofile file wherever you run. > > > > Arun > > > > > > On Friday, May 3, 2013 at 5:51 PM, Eduard Antonyan wrote: > > > > > Good point - I might do that, though I'll need to be a bit careful as I run a lot of scripts on remote computers. > > > > > > > > > On Fri, May 3, 2013 at 10:48 AM, Arunkumar Srinivasan wrote: > > > > Eddi, > > > > > > > > You could just set: options(datatable.nomatch = 0)if you use that extensively. > > > > > > > > Arun > > > > > > > > > > > > _______________________________________________ > > > > datatable-help mailing list > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Fri May 3 22:42:06 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 16:42:06 -0400 Subject: [datatable-help] merge/join/match In-Reply-To: <9D8A261B7D1D47D1B12576E30A793FD9@gmail.com> References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> <9D8A261B7D1D47D1B12576E30A793FD9@gmail.com> Message-ID: One can view data.table's generalization of indexing as the realization that all indexing can conceptually be viewed as merging where indexing with numeric values corresponds to merging with the data.table's row numbers and indexing with logical values, L, is equivalent to merging with which(L) so there are really not two types: indexing and merging but just one type: merging that covers them all. On Fri, May 3, 2013 at 1:01 PM, Arunkumar Srinivasan wrote: > I am wondering if performing X[Y] as a "merge" in correspondence with R's > base "merge", if the basic idea of "i" becomes confusing. That is, when "i" > is not a data.table in X[i] it indexes by rows. When `i` is a data.table, > instead of the current definition which is in par with the subletting > operation that use `i` (here data.table) as an index to subset X and then > JOIN both X and Y, we say, here X and Y are data.tables and we perform a > merge. I think this becomes confusing regarding the purpose of `i`. > > Remember that the main purpose of having the X[Y] is to have the flexibility > of using `j` to to filter/subset only the desired columns. So, for example > if you want to get 1 column of Y out of 100 columns when joining, you do: > X[Y, list(cols_of_x, one_col_of_y)] and here, it doesn't go with the > traditional definition of merge. > > As much as I like the idea of having consistent syntax, I also love the > feature of X[Y, j]. So I'm confused as to how to deal with this. > > Arun > > On Friday, May 3, 2013 at 6:54 PM, Gabor Grothendieck wrote: > > I think that from the viewpoint of compatibility and convenience it > would be best to implement all.x and all.y and not rely on swapping X > and Y. SQLite did something like this (they implemented left join but > not right join based on the idea that all you have to do is swap join > arguments) but the problem with it is that it adds a layer of mental > specification effort if the actual problem is better stated in the > unsupported orientation. > > On Fri, May 3, 2013 at 12:49 PM, Eduard Antonyan > wrote: > > Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all of > the other merge options already exist as either X[Y] or Y[X] with or without > nomatch = 0/NA. > > > On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan > wrote: > > > Gabor, > > Very true. I suppose your request is that the x[i] where `i` is a > data.table should have the same set of options like R's base `merge` > function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself. > However, I am not able to think of a way to do this. I mean, I find the > syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even though > > X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the > reordered columns) the latter 2 don't seem to make sense/is redundant (maybe > it's because I am used to this syntax). > > Arun > > On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: > > In my last post it should have read: > > That X[Y] is not the same as Y[X] is analogous to the fact that > merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) > > On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck > wrote: > > Assuming same-named keys, then these are all the same except possibly > for row and column order: > > X[Y,,nomatch=0] > Y[X,,nomatch=0] > merge(X, Y) > merge(Y, X) > > That X[Y] is not the same as Y[X] is analogous to the fact that > merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) > > On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan > wrote: > > Gabor, > > X[Y] and Y[X] are not necessarily the same operations (meaning, they don't > *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* > to provide the same output (except for the column order and names). In > that > sense, a join is a bit different from a merge, no? > > Arun > > On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > > Yes, except that is not really what happens since match() only matches > one row whereas with mult="all", the default, all rows are matched > which is not really matching in the sense of match(). The current > naming confuses matching with joining and its really the latter that > is being done. > > Regarding the existence of merge the advantage of [ is that it will > automatically only take the columns needed so merge is not really > equivalent to [ in all respects. Furthermore having to use different > constructs for different types of merge seems awkward. > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > wrote: > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > tries to match rows of Y with rows of X, and then "nomatch" tells it what > to > do when there is *no match*. > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan > > wrote: > > > To clarify - that behavior is already implemented in merge (more > specifically merge.data.table). I don't really have a view on having it in > X[Y] as well - I don't like all.x and all.y as the names, since there are > no > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > param that would do a full outer join could certainly be added. > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > wrote: > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > of the same name in the match() function. If the idea of the nomatch= > name was to leverage off existing argument names in R then I would > prefer all.y= to be consistent with merge() in place of nomatch= since > we are really merging/joining rather than just matching. That would > also allow extension to all types of join by adding all.an x= argument > too. > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > wrote: > > I would prefer nomatch=0 as a default though, simply because that's > what I > do most of the time :) > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > wrote: > > > A correction - the param is called "nomatch", not "match". > > This use case seems like smth a user shouldn't really do - in an ideal > world you should have them both keyed by the same-name column. > > As is, my view on it is that data.table is correcting the user mistake > of > naming the column in Y - y, instead of x, and so the output makes > sense and > I don't see the need of complicating the behavior by adding more cases > one > has to go through to figure out what the output columns would be. > Similar to > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > column > there, would you? > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > wrote: > > > I am moving this discussion which started with mdowle to the list. > > Consider this example slightly modified from the data.table FAQ: > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > out <- X[Y]; out > > x foo bar > 1: b 3 4 > 2: b 4 4 > 3: b 5 4 > 4: c 6 2 > 5: c 7 2 > 6: d NA 3 > > Note that the first column of the output is labelled x even though > the > data to produce it comes from y, e.g. "d" in out$x is not in X$x but > does appear in Y$y so clearly the data is coming from y as opposed to > x . In terms of SQL the above would be written: > > select Y.y as x, ... > > and the need to renamne the first column of out suggests that there > may be a deeper problem here. > > Here are some ideas to address this (they would require changes to > data.table): > > - the default of X[Y,, match=NA] would be changed to a default of > X[Y,,match=0] so that it corresponds to the defaults in R's merge and > in SQL joins. > > - the column name of the first column in the example above would be > changed to y if match=0 but be left at x if match=NA. In the case > that match=0 (the proposed new default) x and y are equal so the > first > column can be validly labelled as x but in the case that match=NA > they > are not so y would be used as the column name. > > - the name match= does seem a bit misleading since R's match only > matches one item in the target whereas in data.table match matches > many if mult="all" and that is the default. Perhaps some thought > should be given to a name change here? > > The above would seem to correspond more closely to R's merge and SQL > join defaults. Any use cases or other comments? > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From ggrothendieck at gmail.com Sat May 4 00:41:00 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 18:41:00 -0400 Subject: [datatable-help] merge/join/match In-Reply-To: References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> <9D8A261B7D1D47D1B12576E30A793FD9@gmail.com> Message-ID: In thinking about this a bit more I can see the argument for leaving the default at nomatch=NA. Consider these examples of indexing: > letters[27] [1] NA > BOD[7,] Time demand NA NA NA nomatch=NA seems more compatible with these examples than nomatch=0. (At the same time this does not mean we could not also change the argument name from nomatch= to all.y= and add the other merge arguments (all.x=, by.x=, by.y=, by=) as well since it remains the case that R's merge() seems closer than R's match() to this functionality regardless of the default.) On Fri, May 3, 2013 at 4:42 PM, Gabor Grothendieck wrote: > One can view data.table's generalization of indexing as the > realization that all indexing can conceptually be viewed as merging > where indexing with numeric values corresponds to merging with the > data.table's row numbers and indexing with logical values, L, is > equivalent to merging with which(L) so there are really not two types: > indexing and merging but just one type: merging that covers them all. > > > On Fri, May 3, 2013 at 1:01 PM, Arunkumar Srinivasan > wrote: >> I am wondering if performing X[Y] as a "merge" in correspondence with R's >> base "merge", if the basic idea of "i" becomes confusing. That is, when "i" >> is not a data.table in X[i] it indexes by rows. When `i` is a data.table, >> instead of the current definition which is in par with the subletting >> operation that use `i` (here data.table) as an index to subset X and then >> JOIN both X and Y, we say, here X and Y are data.tables and we perform a >> merge. I think this becomes confusing regarding the purpose of `i`. >> >> Remember that the main purpose of having the X[Y] is to have the flexibility >> of using `j` to to filter/subset only the desired columns. So, for example >> if you want to get 1 column of Y out of 100 columns when joining, you do: >> X[Y, list(cols_of_x, one_col_of_y)] and here, it doesn't go with the >> traditional definition of merge. >> >> As much as I like the idea of having consistent syntax, I also love the >> feature of X[Y, j]. So I'm confused as to how to deal with this. >> >> Arun >> >> On Friday, May 3, 2013 at 6:54 PM, Gabor Grothendieck wrote: >> >> I think that from the viewpoint of compatibility and convenience it >> would be best to implement all.x and all.y and not rely on swapping X >> and Y. SQLite did something like this (they implemented left join but >> not right join based on the idea that all you have to do is swap join >> arguments) but the problem with it is that it adds a layer of mental >> specification effort if the actual problem is better stated in the >> unsupported orientation. >> >> On Fri, May 3, 2013 at 12:49 PM, Eduard Antonyan >> wrote: >> >> Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all of >> the other merge options already exist as either X[Y] or Y[X] with or without >> nomatch = 0/NA. >> >> >> On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan >> wrote: >> >> >> Gabor, >> >> Very true. I suppose your request is that the x[i] where `i` is a >> data.table should have the same set of options like R's base `merge` >> function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself. >> However, I am not able to think of a way to do this. I mean, I find the >> syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even though >> >> X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the >> reordered columns) the latter 2 don't seem to make sense/is redundant (maybe >> it's because I am used to this syntax). >> >> Arun >> >> On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: >> >> In my last post it should have read: >> >> That X[Y] is not the same as Y[X] is analogous to the fact that >> merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) >> >> On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck >> wrote: >> >> Assuming same-named keys, then these are all the same except possibly >> for row and column order: >> >> X[Y,,nomatch=0] >> Y[X,,nomatch=0] >> merge(X, Y) >> merge(Y, X) >> >> That X[Y] is not the same as Y[X] is analogous to the fact that >> merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) >> >> On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan >> wrote: >> >> Gabor, >> >> X[Y] and Y[X] are not necessarily the same operations (meaning, they don't >> *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* >> to provide the same output (except for the column order and names). In >> that >> sense, a join is a bit different from a merge, no? >> >> Arun >> >> On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: >> >> Yes, except that is not really what happens since match() only matches >> one row whereas with mult="all", the default, all rows are matched >> which is not really matching in the sense of match(). The current >> naming confuses matching with joining and its really the latter that >> is being done. >> >> Regarding the existence of merge the advantage of [ is that it will >> automatically only take the columns needed so merge is not really >> equivalent to [ in all respects. Furthermore having to use different >> constructs for different types of merge seems awkward. >> >> >> On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan >> wrote: >> >> Btw the way I think about the "nomatch" name is as follows - normally X[Y] >> tries to match rows of Y with rows of X, and then "nomatch" tells it what >> to >> do when there is *no match*. >> >> >> On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan >> >> wrote: >> >> >> To clarify - that behavior is already implemented in merge (more >> specifically merge.data.table). I don't really have a view on having it in >> X[Y] as well - I don't like all.x and all.y as the names, since there are >> no >> params named 'x' and 'y' in [.data.table (as opposed to merge), but some >> param that would do a full outer join could certainly be added. >> >> >> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck >> wrote: >> >> >> Yes, sorry. Its nomatch= which presumably derives from the parameter >> of the same name in the match() function. If the idea of the nomatch= >> name was to leverage off existing argument names in R then I would >> prefer all.y= to be consistent with merge() in place of nomatch= since >> we are really merging/joining rather than just matching. That would >> also allow extension to all types of join by adding all.an x= argument >> too. >> >> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan >> wrote: >> >> I would prefer nomatch=0 as a default though, simply because that's >> what I >> do most of the time :) >> >> >> On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan >> >> wrote: >> >> >> A correction - the param is called "nomatch", not "match". >> >> This use case seems like smth a user shouldn't really do - in an ideal >> world you should have them both keyed by the same-name column. >> >> As is, my view on it is that data.table is correcting the user mistake >> of >> naming the column in Y - y, instead of x, and so the output makes >> sense and >> I don't see the need of complicating the behavior by adding more cases >> one >> has to go through to figure out what the output columns would be. >> Similar to >> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous >> column >> there, would you? >> >> >> >> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck >> wrote: >> >> >> I am moving this discussion which started with mdowle to the list. >> >> Consider this example slightly modified from the data.table FAQ: >> >> X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") >> Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) >> out <- X[Y]; out >> >> x foo bar >> 1: b 3 4 >> 2: b 4 4 >> 3: b 5 4 >> 4: c 6 2 >> 5: c 7 2 >> 6: d NA 3 >> >> Note that the first column of the output is labelled x even though >> the >> data to produce it comes from y, e.g. "d" in out$x is not in X$x but >> does appear in Y$y so clearly the data is coming from y as opposed to >> x . In terms of SQL the above would be written: >> >> select Y.y as x, ... >> >> and the need to renamne the first column of out suggests that there >> may be a deeper problem here. >> >> Here are some ideas to address this (they would require changes to >> data.table): >> >> - the default of X[Y,, match=NA] would be changed to a default of >> X[Y,,match=0] so that it corresponds to the defaults in R's merge and >> in SQL joins. >> >> - the column name of the first column in the example above would be >> changed to y if match=0 but be left at x if match=NA. In the case >> that match=0 (the proposed new default) x and y are equal so the >> first >> column can be validly labelled as x but in the case that match=NA >> they >> are not so y would be used as the column name. >> >> - the name match= does seem a bit misleading since R's match only >> matches one item in the target whereas in data.table match matches >> many if mult="all" and that is the default. Perhaps some thought >> should be given to a name change here? >> >> The above would seem to correspond more closely to R's merge and SQL >> join defaults. Any use cases or other comments? >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> >> > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From ggrothendieck at gmail.com Sat May 4 00:50:28 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 18:50:28 -0400 Subject: [datatable-help] indexing with nomatch=0 Message-ID: Consider this example: > DT[1:4,,nomatch=0] a 1: a 2: b 3: c 4: NA Should it not return only the first 3 rows? It seems to be ignoring the nomatch=0. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From ggrothendieck at gmail.com Sat May 4 00:52:31 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 18:52:31 -0400 Subject: [datatable-help] indexing with nomatch=0 In-Reply-To: References: Message-ID: The definition of DT was left out by mistake. It should be: DT <- data.table(a=letters[1:3]) On Fri, May 3, 2013 at 6:50 PM, Gabor Grothendieck wrote: > Consider this example: > >> DT[1:4,,nomatch=0] > a > 1: a > 2: b > 3: c > 4: NA > > Should it not return only the first 3 rows? It seems to be ignoring > the nomatch=0. > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From eduard.antonyan at gmail.com Sat May 4 00:54:33 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 17:54:33 -0500 Subject: [datatable-help] indexing with nomatch=0 In-Reply-To: References: Message-ID: There is no join'ing happening here, thus nomatch=0 has no effect. On Fri, May 3, 2013 at 5:52 PM, Gabor Grothendieck wrote: > The definition of DT was left out by mistake. It should be: > > DT <- data.table(a=letters[1:3]) > > > On Fri, May 3, 2013 at 6:50 PM, Gabor Grothendieck > wrote: > > Consider this example: > > > >> DT[1:4,,nomatch=0] > > a > > 1: a > > 2: b > > 3: c > > 4: NA > > > > Should it not return only the first 3 rows? It seems to be ignoring > > the nomatch=0. > > > > -- > > Statistics & Software Consulting > > GKX Group, GKX Associates Inc. > > tel: 1-877-GKX-GROUP > > email: ggrothendieck at gmail.com > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Sat May 4 01:00:28 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Fri, 3 May 2013 16:00:28 -0700 Subject: [datatable-help] indexing with nomatch=0 In-Reply-To: References: Message-ID: On Fri, May 3, 2013 at 3:54 PM, Eduard Antonyan wrote: > There is no join'ing happening here, thus nomatch=0 has no effect. Indeed -- and the result is consistent with doing the same thing to a base::data.frame: R> df <- data.frame(a=letters[1:3]) R> df[1:4,,drop=FALSE] a 1 | a 2 | b 3 | c NA| NA (my print.data.frame function is a monkey-patched version of the print.data.table function, which is why the output looks a bit different than what you'd likely to see) -steve > > > On Fri, May 3, 2013 at 5:52 PM, Gabor Grothendieck > wrote: >> >> The definition of DT was left out by mistake. It should be: >> >> DT <- data.table(a=letters[1:3]) >> >> >> On Fri, May 3, 2013 at 6:50 PM, Gabor Grothendieck >> wrote: >> > Consider this example: >> > >> >> DT[1:4,,nomatch=0] >> > a >> > 1: a >> > 2: b >> > 3: c >> > 4: NA >> > >> > Should it not return only the first 3 rows? It seems to be ignoring >> > the nomatch=0. >> > >> > -- >> > Statistics & Software Consulting >> > GKX Group, GKX Associates Inc. >> > tel: 1-877-GKX-GROUP >> > email: ggrothendieck at gmail.com >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech From ggrothendieck at gmail.com Sat May 4 01:54:50 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 19:54:50 -0400 Subject: [datatable-help] indexing with nomatch=0 In-Reply-To: References: Message-ID: data.table is supposed to generalize indexing and although not explicitly stated the generalization seems to be that indexing is merging with the row numbers so there is indeed merging going on and that merging should respect nomatch= for consistency. On Fri, May 3, 2013 at 6:54 PM, Eduard Antonyan wrote: > There is no join'ing happening here, thus nomatch=0 has no effect. > > > On Fri, May 3, 2013 at 5:52 PM, Gabor Grothendieck > wrote: >> >> The definition of DT was left out by mistake. It should be: >> >> DT <- data.table(a=letters[1:3]) >> >> >> On Fri, May 3, 2013 at 6:50 PM, Gabor Grothendieck >> wrote: >> > Consider this example: >> > >> >> DT[1:4,,nomatch=0] >> > a >> > 1: a >> > 2: b >> > 3: c >> > 4: NA >> > >> > Should it not return only the first 3 rows? It seems to be ignoring >> > the nomatch=0. >> > >> > -- >> > Statistics & Software Consulting >> > GKX Group, GKX Associates Inc. >> > tel: 1-877-GKX-GROUP >> > email: ggrothendieck at gmail.com >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From ggrothendieck at gmail.com Sat May 4 01:55:15 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 19:55:15 -0400 Subject: [datatable-help] indexing with nomatch=0 In-Reply-To: References: Message-ID: That is only relevant if nomatch=NA. On Fri, May 3, 2013 at 7:00 PM, Steve Lianoglou wrote: > On Fri, May 3, 2013 at 3:54 PM, Eduard Antonyan > wrote: >> There is no join'ing happening here, thus nomatch=0 has no effect. > > Indeed -- and the result is consistent with doing the same thing to a > base::data.frame: > > R> df <- data.frame(a=letters[1:3]) > R> df[1:4,,drop=FALSE] > a > 1 | a > 2 | b > 3 | c > NA| NA > > (my print.data.frame function is a monkey-patched version of the > print.data.table function, which is why the output looks a bit > different than what you'd likely to see) > > -steve > > > >> >> >> On Fri, May 3, 2013 at 5:52 PM, Gabor Grothendieck >> wrote: >>> >>> The definition of DT was left out by mistake. It should be: >>> >>> DT <- data.table(a=letters[1:3]) >>> >>> >>> On Fri, May 3, 2013 at 6:50 PM, Gabor Grothendieck >>> wrote: >>> > Consider this example: >>> > >>> >> DT[1:4,,nomatch=0] >>> > a >>> > 1: a >>> > 2: b >>> > 3: c >>> > 4: NA >>> > >>> > Should it not return only the first 3 rows? It seems to be ignoring >>> > the nomatch=0. >>> > >>> > -- >>> > Statistics & Software Consulting >>> > GKX Group, GKX Associates Inc. >>> > tel: 1-877-GKX-GROUP >>> > email: ggrothendieck at gmail.com >>> >>> >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > -- > Steve Lianoglou > Computational Biologist > Department of Bioinformatics and Computational Biology > Genentech -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From eduard.antonyan at gmail.com Sat May 4 02:02:39 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 3 May 2013 19:02:39 -0500 Subject: [datatable-help] indexing with nomatch=0 In-Reply-To: References: Message-ID: I think I like this proposal - maybe you should write up a few examples of what current behavior is, vs the proposed behavior. On Fri, May 3, 2013 at 6:54 PM, Gabor Grothendieck wrote: > data.table is supposed to generalize indexing and although not > explicitly stated the generalization seems to be that indexing is > merging with the row numbers so there is indeed merging going on and > that merging should respect nomatch= for consistency. > > On Fri, May 3, 2013 at 6:54 PM, Eduard Antonyan > wrote: > > There is no join'ing happening here, thus nomatch=0 has no effect. > > > > > > On Fri, May 3, 2013 at 5:52 PM, Gabor Grothendieck < > ggrothendieck at gmail.com> > > wrote: > >> > >> The definition of DT was left out by mistake. It should be: > >> > >> DT <- data.table(a=letters[1:3]) > >> > >> > >> On Fri, May 3, 2013 at 6:50 PM, Gabor Grothendieck > >> wrote: > >> > Consider this example: > >> > > >> >> DT[1:4,,nomatch=0] > >> > a > >> > 1: a > >> > 2: b > >> > 3: c > >> > 4: NA > >> > > >> > Should it not return only the first 3 rows? It seems to be ignoring > >> > the nomatch=0. > >> > > >> > -- > >> > Statistics & Software Consulting > >> > GKX Group, GKX Associates Inc. > >> > tel: 1-877-GKX-GROUP > >> > email: ggrothendieck at gmail.com > >> > >> > >> > >> -- > >> Statistics & Software Consulting > >> GKX Group, GKX Associates Inc. > >> tel: 1-877-GKX-GROUP > >> email: ggrothendieck at gmail.com > >> _______________________________________________ > >> datatable-help mailing list > >> datatable-help at lists.r-forge.r-project.org > >> > >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat May 4 02:20:07 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 4 May 2013 02:20:07 +0200 Subject: [datatable-help] =?utf-8?q?indexing_with_nomatch=3D0?= In-Reply-To: References: Message-ID: "Indexing is merging with row numbers, so indeed there's a merging going on" - I hadn't seen it this way until now. But I like this. I see why you expect `nomatch=0` to work on indexing as well. And it makes sense to me. But I am not so much inclined towards the implementation of `merge`-like operations in X[Y] syntax. I'd love to be convinced. I just can't get my mind around the usage X[Y, all.X = TRUE] and even more X[Y, list(2 columns of X, 1 column of Y), all.X=TRUE]. I could just do Y[X, ?] which makes more sense here. I am unable to wrap my head around the need for this feature... Arun On Saturday, May 4, 2013 at 1:54 AM, Gabor Grothendieck wrote: > data.table is supposed to generalize indexing and although not > explicitly stated the generalization seems to be that indexing is > merging with the row numbers so there is indeed merging going on and > that merging should respect nomatch= for consistency. > > On Fri, May 3, 2013 at 6:54 PM, Eduard Antonyan > wrote: > > There is no join'ing happening here, thus nomatch=0 has no effect. > > > > > > On Fri, May 3, 2013 at 5:52 PM, Gabor Grothendieck > > wrote: > > > > > > The definition of DT was left out by mistake. It should be: > > > > > > DT <- data.table(a=letters[1:3]) > > > > > > > > > On Fri, May 3, 2013 at 6:50 PM, Gabor Grothendieck > > > wrote: > > > > Consider this example: > > > > > > > > > DT[1:4,,nomatch=0] > > > > a > > > > 1: a > > > > 2: b > > > > 3: c > > > > 4: NA > > > > > > > > Should it not return only the first 3 rows? It seems to be ignoring > > > > the nomatch=0. > > > > > > > > -- > > > > Statistics & Software Consulting > > > > GKX Group, GKX Associates Inc. > > > > tel: 1-877-GKX-GROUP > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > -- > > > Statistics & Software Consulting > > > GKX Group, GKX Associates Inc. > > > tel: 1-877-GKX-GROUP > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > _______________________________________________ > > > datatable-help mailing list > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com (http://gmail.com) > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Sat May 4 04:18:33 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 3 May 2013 22:18:33 -0400 Subject: [datatable-help] indexing with nomatch=0 In-Reply-To: References: Message-ID: On Fri, May 3, 2013 at 8:20 PM, Arunkumar Srinivasan wrote: > "Indexing is merging with row numbers, so indeed there's a merging going on" > - I hadn't seen it this way until now. But I like this. I see why you expect > `nomatch=0` to work on indexing as well. And it makes sense to me. > > But I am not so much inclined towards the implementation of `merge`-like > operations in X[Y] syntax. I'd love to be convinced. I just can't get my > mind around the usage X[Y, all.X = TRUE] and even more X[Y, list(2 columns > of X, 1 column of Y), all.X=TRUE]. I could just do Y[X, ?] which makes more > sense here. I am unable to wrap my head around the need for this feature... > I think many people find data.table confusing until they put substantial time into it and if one can leverage their existing knowledge of R then it should be easier to understand. all.y= would have the exact same meaning in merge and in [.data.table so one would immediately know what to expect if one knew merge. I don't thinks the same can be said for nomatch since match() is not really the same thing as merge. The downsides seem to be: - It does seem that in order to be consistent with how subscripting works that all.y = TRUE would need to be the default for data.table whereas all.y = FALSE is the default for merge. - all.y seems important to have but all.x is less important although it might be included for completeness and symmetry even if less useful. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From ggrothendieck at gmail.com Sat May 4 11:46:10 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sat, 4 May 2013 05:46:10 -0400 Subject: [datatable-help] merge/join/match In-Reply-To: References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> <9D8A261B7D1D47D1B12576E30A793FD9@gmail.com> Message-ID: One further comment on nomatch=0 weirdness. It seems that the value of nomatch= is the row index of the row of X to return if a row in Y matches no row in X here: X[Y,,nomatch=?] In ordinary R indexing using an index value of 0 means drop the corresponding component and NA means return an NA. nomatch=1 would presumably return the first row of X for non-matching rows of Y but, in fact, nomatch= seems to be restricted to 0 and NA as any other value generates an error message to this effect. Likely it was decided that values other than 0 and NA would be too bizarre and most likely represent user error. If any.y= were used then it would naturally be logical and this artificial distinction (i.e .between 0/NA on one hand and everything else on the other hand) would not have to be made. On Fri, May 3, 2013 at 6:41 PM, Gabor Grothendieck wrote: > In thinking about this a bit more I can see the argument for leaving > the default at nomatch=NA. Consider these examples of indexing: > >> letters[27] > [1] NA >> BOD[7,] > Time demand > NA NA NA > > nomatch=NA seems more compatible with these examples than nomatch=0. > > (At the same time this does not mean we could not also change the > argument name from nomatch= to all.y= and add the other merge > arguments (all.x=, by.x=, by.y=, by=) as well since it remains the > case that R's merge() seems closer than R's match() to this > functionality regardless of the default.) > > > On Fri, May 3, 2013 at 4:42 PM, Gabor Grothendieck > wrote: >> One can view data.table's generalization of indexing as the >> realization that all indexing can conceptually be viewed as merging >> where indexing with numeric values corresponds to merging with the >> data.table's row numbers and indexing with logical values, L, is >> equivalent to merging with which(L) so there are really not two types: >> indexing and merging but just one type: merging that covers them all. >> >> >> On Fri, May 3, 2013 at 1:01 PM, Arunkumar Srinivasan >> wrote: >>> I am wondering if performing X[Y] as a "merge" in correspondence with R's >>> base "merge", if the basic idea of "i" becomes confusing. That is, when "i" >>> is not a data.table in X[i] it indexes by rows. When `i` is a data.table, >>> instead of the current definition which is in par with the subletting >>> operation that use `i` (here data.table) as an index to subset X and then >>> JOIN both X and Y, we say, here X and Y are data.tables and we perform a >>> merge. I think this becomes confusing regarding the purpose of `i`. >>> >>> Remember that the main purpose of having the X[Y] is to have the flexibility >>> of using `j` to to filter/subset only the desired columns. So, for example >>> if you want to get 1 column of Y out of 100 columns when joining, you do: >>> X[Y, list(cols_of_x, one_col_of_y)] and here, it doesn't go with the >>> traditional definition of merge. >>> >>> As much as I like the idea of having consistent syntax, I also love the >>> feature of X[Y, j]. So I'm confused as to how to deal with this. >>> >>> Arun >>> >>> On Friday, May 3, 2013 at 6:54 PM, Gabor Grothendieck wrote: >>> >>> I think that from the viewpoint of compatibility and convenience it >>> would be best to implement all.x and all.y and not rely on swapping X >>> and Y. SQLite did something like this (they implemented left join but >>> not right join based on the idea that all you have to do is swap join >>> arguments) but the problem with it is that it adds a layer of mental >>> specification effort if the actual problem is better stated in the >>> unsupported orientation. >>> >>> On Fri, May 3, 2013 at 12:49 PM, Eduard Antonyan >>> wrote: >>> >>> Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all of >>> the other merge options already exist as either X[Y] or Y[X] with or without >>> nomatch = 0/NA. >>> >>> >>> On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan >>> wrote: >>> >>> >>> Gabor, >>> >>> Very true. I suppose your request is that the x[i] where `i` is a >>> data.table should have the same set of options like R's base `merge` >>> function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself. >>> However, I am not able to think of a way to do this. I mean, I find the >>> syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even though >>> >>> X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the >>> reordered columns) the latter 2 don't seem to make sense/is redundant (maybe >>> it's because I am used to this syntax). >>> >>> Arun >>> >>> On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: >>> >>> In my last post it should have read: >>> >>> That X[Y] is not the same as Y[X] is analogous to the fact that >>> merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) >>> >>> On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck >>> wrote: >>> >>> Assuming same-named keys, then these are all the same except possibly >>> for row and column order: >>> >>> X[Y,,nomatch=0] >>> Y[X,,nomatch=0] >>> merge(X, Y) >>> merge(Y, X) >>> >>> That X[Y] is not the same as Y[X] is analogous to the fact that >>> merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) >>> >>> On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan >>> wrote: >>> >>> Gabor, >>> >>> X[Y] and Y[X] are not necessarily the same operations (meaning, they don't >>> *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* >>> to provide the same output (except for the column order and names). In >>> that >>> sense, a join is a bit different from a merge, no? >>> >>> Arun >>> >>> On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: >>> >>> Yes, except that is not really what happens since match() only matches >>> one row whereas with mult="all", the default, all rows are matched >>> which is not really matching in the sense of match(). The current >>> naming confuses matching with joining and its really the latter that >>> is being done. >>> >>> Regarding the existence of merge the advantage of [ is that it will >>> automatically only take the columns needed so merge is not really >>> equivalent to [ in all respects. Furthermore having to use different >>> constructs for different types of merge seems awkward. >>> >>> >>> On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan >>> wrote: >>> >>> Btw the way I think about the "nomatch" name is as follows - normally X[Y] >>> tries to match rows of Y with rows of X, and then "nomatch" tells it what >>> to >>> do when there is *no match*. >>> >>> >>> On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan >>> >>> wrote: >>> >>> >>> To clarify - that behavior is already implemented in merge (more >>> specifically merge.data.table). I don't really have a view on having it in >>> X[Y] as well - I don't like all.x and all.y as the names, since there are >>> no >>> params named 'x' and 'y' in [.data.table (as opposed to merge), but some >>> param that would do a full outer join could certainly be added. >>> >>> >>> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck >>> wrote: >>> >>> >>> Yes, sorry. Its nomatch= which presumably derives from the parameter >>> of the same name in the match() function. If the idea of the nomatch= >>> name was to leverage off existing argument names in R then I would >>> prefer all.y= to be consistent with merge() in place of nomatch= since >>> we are really merging/joining rather than just matching. That would >>> also allow extension to all types of join by adding all.an x= argument >>> too. >>> >>> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan >>> wrote: >>> >>> I would prefer nomatch=0 as a default though, simply because that's >>> what I >>> do most of the time :) >>> >>> >>> On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan >>> >>> wrote: >>> >>> >>> A correction - the param is called "nomatch", not "match". >>> >>> This use case seems like smth a user shouldn't really do - in an ideal >>> world you should have them both keyed by the same-name column. >>> >>> As is, my view on it is that data.table is correcting the user mistake >>> of >>> naming the column in Y - y, instead of x, and so the output makes >>> sense and >>> I don't see the need of complicating the behavior by adding more cases >>> one >>> has to go through to figure out what the output columns would be. >>> Similar to >>> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous >>> column >>> there, would you? >>> >>> >>> >>> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck >>> wrote: >>> >>> >>> I am moving this discussion which started with mdowle to the list. >>> >>> Consider this example slightly modified from the data.table FAQ: >>> >>> X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") >>> Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) >>> out <- X[Y]; out >>> >>> x foo bar >>> 1: b 3 4 >>> 2: b 4 4 >>> 3: b 5 4 >>> 4: c 6 2 >>> 5: c 7 2 >>> 6: d NA 3 >>> >>> Note that the first column of the output is labelled x even though >>> the >>> data to produce it comes from y, e.g. "d" in out$x is not in X$x but >>> does appear in Y$y so clearly the data is coming from y as opposed to >>> x . In terms of SQL the above would be written: >>> >>> select Y.y as x, ... >>> >>> and the need to renamne the first column of out suggests that there >>> may be a deeper problem here. >>> >>> Here are some ideas to address this (they would require changes to >>> data.table): >>> >>> - the default of X[Y,, match=NA] would be changed to a default of >>> X[Y,,match=0] so that it corresponds to the defaults in R's merge and >>> in SQL joins. >>> >>> - the column name of the first column in the example above would be >>> changed to y if match=0 but be left at x if match=NA. In the case >>> that match=0 (the proposed new default) x and y are equal so the >>> first >>> column can be validly labelled as x but in the case that match=NA >>> they >>> are not so y would be used as the column name. >>> >>> - the name match= does seem a bit misleading since R's match only >>> matches one item in the target whereas in data.table match matches >>> many if mult="all" and that is the default. Perhaps some thought >>> should be given to a name change here? >>> >>> The above would seem to correspond more closely to R's merge and SQL >>> join defaults. Any use cases or other comments? >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >>> >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >>> >>> >>> >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >>> >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >>> >>> >>> >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >>> >>> >>> >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >>> >>> >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From ggrothendieck at gmail.com Sat May 4 11:47:40 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sat, 4 May 2013 05:47:40 -0400 Subject: [datatable-help] merge/join/match In-Reply-To: References: <9F2AA0C6058C4CD69CEFB5662A0CC3A5@gmail.com> <9D8A261B7D1D47D1B12576E30A793FD9@gmail.com> Message-ID: Where it says any.y= it should have read all.y=. On Sat, May 4, 2013 at 5:46 AM, Gabor Grothendieck wrote: > One further comment on nomatch=0 weirdness. It seems that the value > of nomatch= is the row index of the row of X to return if a row in Y > matches no row in X here: X[Y,,nomatch=?] In ordinary R indexing > using an index value of 0 means drop the corresponding component and > NA means return an NA. nomatch=1 would presumably return the first > row of X for non-matching rows of Y but, in fact, nomatch= seems to be > restricted to 0 and NA as any other value generates an error message > to this effect. Likely it was decided that values other than 0 and NA > would be too bizarre and most likely represent user error. If any.y= > were used then it would naturally be logical and this artificial > distinction (i.e .between 0/NA on one hand and everything else on the > other hand) would not have to be made. > > On Fri, May 3, 2013 at 6:41 PM, Gabor Grothendieck > wrote: >> In thinking about this a bit more I can see the argument for leaving >> the default at nomatch=NA. Consider these examples of indexing: >> >>> letters[27] >> [1] NA >>> BOD[7,] >> Time demand >> NA NA NA >> >> nomatch=NA seems more compatible with these examples than nomatch=0. >> >> (At the same time this does not mean we could not also change the >> argument name from nomatch= to all.y= and add the other merge >> arguments (all.x=, by.x=, by.y=, by=) as well since it remains the >> case that R's merge() seems closer than R's match() to this >> functionality regardless of the default.) >> >> >> On Fri, May 3, 2013 at 4:42 PM, Gabor Grothendieck >> wrote: >>> One can view data.table's generalization of indexing as the >>> realization that all indexing can conceptually be viewed as merging >>> where indexing with numeric values corresponds to merging with the >>> data.table's row numbers and indexing with logical values, L, is >>> equivalent to merging with which(L) so there are really not two types: >>> indexing and merging but just one type: merging that covers them all. >>> >>> >>> On Fri, May 3, 2013 at 1:01 PM, Arunkumar Srinivasan >>> wrote: >>>> I am wondering if performing X[Y] as a "merge" in correspondence with R's >>>> base "merge", if the basic idea of "i" becomes confusing. That is, when "i" >>>> is not a data.table in X[i] it indexes by rows. When `i` is a data.table, >>>> instead of the current definition which is in par with the subletting >>>> operation that use `i` (here data.table) as an index to subset X and then >>>> JOIN both X and Y, we say, here X and Y are data.tables and we perform a >>>> merge. I think this becomes confusing regarding the purpose of `i`. >>>> >>>> Remember that the main purpose of having the X[Y] is to have the flexibility >>>> of using `j` to to filter/subset only the desired columns. So, for example >>>> if you want to get 1 column of Y out of 100 columns when joining, you do: >>>> X[Y, list(cols_of_x, one_col_of_y)] and here, it doesn't go with the >>>> traditional definition of merge. >>>> >>>> As much as I like the idea of having consistent syntax, I also love the >>>> feature of X[Y, j]. So I'm confused as to how to deal with this. >>>> >>>> Arun >>>> >>>> On Friday, May 3, 2013 at 6:54 PM, Gabor Grothendieck wrote: >>>> >>>> I think that from the viewpoint of compatibility and convenience it >>>> would be best to implement all.x and all.y and not rely on swapping X >>>> and Y. SQLite did something like this (they implemented left join but >>>> not right join based on the idea that all you have to do is swap join >>>> arguments) but the problem with it is that it adds a layer of mental >>>> specification effort if the actual problem is better stated in the >>>> unsupported orientation. >>>> >>>> On Fri, May 3, 2013 at 12:49 PM, Eduard Antonyan >>>> wrote: >>>> >>>> Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all of >>>> the other merge options already exist as either X[Y] or Y[X] with or without >>>> nomatch = 0/NA. >>>> >>>> >>>> On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan >>>> wrote: >>>> >>>> >>>> Gabor, >>>> >>>> Very true. I suppose your request is that the x[i] where `i` is a >>>> data.table should have the same set of options like R's base `merge` >>>> function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by itself. >>>> However, I am not able to think of a way to do this. I mean, I find the >>>> syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even though >>>> >>>> X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the >>>> reordered columns) the latter 2 don't seem to make sense/is redundant (maybe >>>> it's because I am used to this syntax). >>>> >>>> Arun >>>> >>>> On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: >>>> >>>> In my last post it should have read: >>>> >>>> That X[Y] is not the same as Y[X] is analogous to the fact that >>>> merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) >>>> >>>> On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck >>>> wrote: >>>> >>>> Assuming same-named keys, then these are all the same except possibly >>>> for row and column order: >>>> >>>> X[Y,,nomatch=0] >>>> Y[X,,nomatch=0] >>>> merge(X, Y) >>>> merge(Y, X) >>>> >>>> That X[Y] is not the same as Y[X] is analogous to the fact that >>>> merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) >>>> >>>> On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan >>>> wrote: >>>> >>>> Gabor, >>>> >>>> X[Y] and Y[X] are not necessarily the same operations (meaning, they don't >>>> *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* >>>> to provide the same output (except for the column order and names). In >>>> that >>>> sense, a join is a bit different from a merge, no? >>>> >>>> Arun >>>> >>>> On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: >>>> >>>> Yes, except that is not really what happens since match() only matches >>>> one row whereas with mult="all", the default, all rows are matched >>>> which is not really matching in the sense of match(). The current >>>> naming confuses matching with joining and its really the latter that >>>> is being done. >>>> >>>> Regarding the existence of merge the advantage of [ is that it will >>>> automatically only take the columns needed so merge is not really >>>> equivalent to [ in all respects. Furthermore having to use different >>>> constructs for different types of merge seems awkward. >>>> >>>> >>>> On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan >>>> wrote: >>>> >>>> Btw the way I think about the "nomatch" name is as follows - normally X[Y] >>>> tries to match rows of Y with rows of X, and then "nomatch" tells it what >>>> to >>>> do when there is *no match*. >>>> >>>> >>>> On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan >>>> >>>> wrote: >>>> >>>> >>>> To clarify - that behavior is already implemented in merge (more >>>> specifically merge.data.table). I don't really have a view on having it in >>>> X[Y] as well - I don't like all.x and all.y as the names, since there are >>>> no >>>> params named 'x' and 'y' in [.data.table (as opposed to merge), but some >>>> param that would do a full outer join could certainly be added. >>>> >>>> >>>> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck >>>> wrote: >>>> >>>> >>>> Yes, sorry. Its nomatch= which presumably derives from the parameter >>>> of the same name in the match() function. If the idea of the nomatch= >>>> name was to leverage off existing argument names in R then I would >>>> prefer all.y= to be consistent with merge() in place of nomatch= since >>>> we are really merging/joining rather than just matching. That would >>>> also allow extension to all types of join by adding all.an x= argument >>>> too. >>>> >>>> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan >>>> wrote: >>>> >>>> I would prefer nomatch=0 as a default though, simply because that's >>>> what I >>>> do most of the time :) >>>> >>>> >>>> On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan >>>> >>>> wrote: >>>> >>>> >>>> A correction - the param is called "nomatch", not "match". >>>> >>>> This use case seems like smth a user shouldn't really do - in an ideal >>>> world you should have them both keyed by the same-name column. >>>> >>>> As is, my view on it is that data.table is correcting the user mistake >>>> of >>>> naming the column in Y - y, instead of x, and so the output makes >>>> sense and >>>> I don't see the need of complicating the behavior by adding more cases >>>> one >>>> has to go through to figure out what the output columns would be. >>>> Similar to >>>> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous >>>> column >>>> there, would you? >>>> >>>> >>>> >>>> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck >>>> wrote: >>>> >>>> >>>> I am moving this discussion which started with mdowle to the list. >>>> >>>> Consider this example slightly modified from the data.table FAQ: >>>> >>>> X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") >>>> Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) >>>> out <- X[Y]; out >>>> >>>> x foo bar >>>> 1: b 3 4 >>>> 2: b 4 4 >>>> 3: b 5 4 >>>> 4: c 6 2 >>>> 5: c 7 2 >>>> 6: d NA 3 >>>> >>>> Note that the first column of the output is labelled x even though >>>> the >>>> data to produce it comes from y, e.g. "d" in out$x is not in X$x but >>>> does appear in Y$y so clearly the data is coming from y as opposed to >>>> x . In terms of SQL the above would be written: >>>> >>>> select Y.y as x, ... >>>> >>>> and the need to renamne the first column of out suggests that there >>>> may be a deeper problem here. >>>> >>>> Here are some ideas to address this (they would require changes to >>>> data.table): >>>> >>>> - the default of X[Y,, match=NA] would be changed to a default of >>>> X[Y,,match=0] so that it corresponds to the defaults in R's merge and >>>> in SQL joins. >>>> >>>> - the column name of the first column in the example above would be >>>> changed to y if match=0 but be left at x if match=NA. In the case >>>> that match=0 (the proposed new default) x and y are equal so the >>>> first >>>> column can be validly labelled as x but in the case that match=NA >>>> they >>>> are not so y would be used as the column name. >>>> >>>> - the name match= does seem a bit misleading since R's match only >>>> matches one item in the target whereas in data.table match matches >>>> many if mult="all" and that is the default. Perhaps some thought >>>> should be given to a name change here? >>>> >>>> The above would seem to correspond more closely to R's merge and SQL >>>> join defaults. Any use cases or other comments? >>>> >>>> -- >>>> Statistics & Software Consulting >>>> GKX Group, GKX Associates Inc. >>>> tel: 1-877-GKX-GROUP >>>> email: ggrothendieck at gmail.com >>>> _______________________________________________ >>>> datatable-help mailing list >>>> datatable-help at lists.r-forge.r-project.org >>>> >>>> >>>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>> >>>> >>>> >>>> >>>> -- >>>> Statistics & Software Consulting >>>> GKX Group, GKX Associates Inc. >>>> tel: 1-877-GKX-GROUP >>>> email: ggrothendieck at gmail.com >>>> >>>> >>>> >>>> >>>> -- >>>> Statistics & Software Consulting >>>> GKX Group, GKX Associates Inc. >>>> tel: 1-877-GKX-GROUP >>>> email: ggrothendieck at gmail.com >>>> _______________________________________________ >>>> datatable-help mailing list >>>> datatable-help at lists.r-forge.r-project.org >>>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>> >>>> >>>> >>>> >>>> -- >>>> Statistics & Software Consulting >>>> GKX Group, GKX Associates Inc. >>>> tel: 1-877-GKX-GROUP >>>> email: ggrothendieck at gmail.com >>>> >>>> >>>> >>>> >>>> -- >>>> Statistics & Software Consulting >>>> GKX Group, GKX Associates Inc. >>>> tel: 1-877-GKX-GROUP >>>> email: ggrothendieck at gmail.com >>>> >>>> >>>> >>>> >>>> -- >>>> Statistics & Software Consulting >>>> GKX Group, GKX Associates Inc. >>>> tel: 1-877-GKX-GROUP >>>> email: ggrothendieck at gmail.com >>>> >>>> >>> >>> >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From ggrothendieck at gmail.com Sat May 4 13:26:18 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sat, 4 May 2013 07:26:18 -0400 Subject: [datatable-help] indexing with nomatch=0 In-Reply-To: References: Message-ID: The proposal at this point would be: 1. nomatch= would be replaced by all.i= such that X[Y,,nomatch=NA] is the same as X[Y,,all.i=TRUE] X[Y,,nomatch=0] is the same as X[Y,,all.i=FALSE] nomatch= would be deprecated and ultimately removed. Note that #1 is simple to implement as it only involves changing names and values of arguments and does not really change any behavior; however, its easier to think about because X[Y,,all.i=Z] now has the same behavior as merge(X, Y, all.y=Z) and so can be quickly understood by anyone who knows merge in R. In contrast nomatch= did not even have the same meaning as in match() since match matches the first occurrence whereas with mult="all", the default, matching in data.table matches all occurrences. Note that the default of merge's all.y= is all.y=FALSE but the default of all.i= is all.i=TRUE in order that the default behave as indices do. Also note that this solves the problem that nomatch= can only be 0 or NA since a logical can only have two non-NA values anyways. 2. If Y were a numeric index vector then all.i= will have the same effect as if Y were a data.table with Y as its column and is merged with the row numbers of X. e.g. X[1:4,,all.i=FALSE] would be the same as X[1:3] if X only had 3 rows since 4 does not match a row number of X and is dropped because all.i=FALSE. If Y were a numeric vector with negative values it would be converted to one with positive values in such a way as to have the established meaning and then the same strategy is applied. If Y were logical then its recycled giving YY and the same strategy is applied to which(YY). This description is intended to be conceptual and the actual internal mechanism could be different. Thus #2 allows one to think of **all** i indexing as merging rather than as multiple separate concepts (which I believe is consistent with the original intention of data.table). On Fri, May 3, 2013 at 8:02 PM, Eduard Antonyan wrote: > I think I like this proposal - maybe you should write up a few examples of > what current behavior is, vs the proposed behavior. > > > On Fri, May 3, 2013 at 6:54 PM, Gabor Grothendieck > wrote: >> >> data.table is supposed to generalize indexing and although not >> explicitly stated the generalization seems to be that indexing is >> merging with the row numbers so there is indeed merging going on and >> that merging should respect nomatch= for consistency. >> >> On Fri, May 3, 2013 at 6:54 PM, Eduard Antonyan >> wrote: >> > There is no join'ing happening here, thus nomatch=0 has no effect. >> > >> > >> > On Fri, May 3, 2013 at 5:52 PM, Gabor Grothendieck >> > >> > wrote: >> >> >> >> The definition of DT was left out by mistake. It should be: >> >> >> >> DT <- data.table(a=letters[1:3]) >> >> >> >> >> >> On Fri, May 3, 2013 at 6:50 PM, Gabor Grothendieck >> >> wrote: >> >> > Consider this example: >> >> > >> >> >> DT[1:4,,nomatch=0] >> >> > a >> >> > 1: a >> >> > 2: b >> >> > 3: c >> >> > 4: NA >> >> > >> >> > Should it not return only the first 3 rows? It seems to be ignoring >> >> > the nomatch=0. >> >> > >> >> > -- >> >> > Statistics & Software Consulting >> >> > GKX Group, GKX Associates Inc. >> >> > tel: 1-877-GKX-GROUP >> >> > email: ggrothendieck at gmail.com >> >> >> >> >> >> >> >> -- >> >> Statistics & Software Consulting >> >> GKX Group, GKX Associates Inc. >> >> tel: 1-877-GKX-GROUP >> >> email: ggrothendieck at gmail.com >> >> _______________________________________________ >> >> datatable-help mailing list >> >> datatable-help at lists.r-forge.r-project.org >> >> >> >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > >> > >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com > > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Sat May 4 13:35:33 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 4 May 2013 13:35:33 +0200 Subject: [datatable-help] =?utf-8?q?indexing_with_nomatch=3D0?= In-Reply-To: References: Message-ID: <8BB9278EC50247A3A0B55072048E6E4F@gmail.com> Gabor, Both points I agree with. It brings enough clarity and consistency to the syntax. Does this mean that you don't mind X[Y] not having all functionalities of `merge`? Because this takes care of the confusion of `nomatch` but still does not do all merges, iiuc. Arun On Saturday, May 4, 2013 at 1:26 PM, Gabor Grothendieck wrote: > The proposal at this point would be: > > 1. nomatch= would be replaced by all.i= such that > X[Y,,nomatch=NA] is the same as X[Y,,all.i=TRUE] > X[Y,,nomatch=0] is the same as X[Y,,all.i=FALSE] > nomatch= would be deprecated and ultimately removed. > > Note that #1 is simple to implement as it only involves changing names > and values of arguments and does not really change any behavior; > however, its easier to think about because X[Y,,all.i=Z] now has the > same behavior as merge(X, Y, all.y=Z) and so can be quickly understood > by anyone who knows merge in R. In contrast nomatch= did not even > have the same meaning as in match() since match matches the first > occurrence whereas with mult="all", the default, matching in > data.table matches all occurrences. Note that the default of merge's > all.y= is all.y=FALSE but the default of all.i= is all.i=TRUE in order > that the default behave as indices do. Also note that this solves the > problem that nomatch= can only be 0 or NA since a logical can only > have two non-NA values anyways. > > 2. If Y were a numeric index vector then all.i= will have the same > effect as if Y were a data.table with Y as its column and is merged > with the row numbers of X. e.g. X[1:4,,all.i=FALSE] would be the > same as X[1:3] if X only had 3 rows since 4 does not match a row > number of X and is dropped because all.i=FALSE. If Y were a numeric > vector with negative values it would be converted to one with positive > values in such a way as to have the established meaning and then the > same strategy is applied. If Y were logical then its recycled giving > YY and the same strategy is applied to which(YY). This description is > intended to be conceptual and the actual internal mechanism could be > different. > > Thus #2 allows one to think of **all** i indexing as merging rather > than as multiple separate concepts (which I believe is consistent with > the original intention of data.table). > > > > > > > On Fri, May 3, 2013 at 8:02 PM, Eduard Antonyan > wrote: > > I think I like this proposal - maybe you should write up a few examples of > > what current behavior is, vs the proposed behavior. > > > > > > On Fri, May 3, 2013 at 6:54 PM, Gabor Grothendieck > > wrote: > > > > > > data.table is supposed to generalize indexing and although not > > > explicitly stated the generalization seems to be that indexing is > > > merging with the row numbers so there is indeed merging going on and > > > that merging should respect nomatch= for consistency. > > > > > > On Fri, May 3, 2013 at 6:54 PM, Eduard Antonyan > > > wrote: > > > > There is no join'ing happening here, thus nomatch=0 has no effect. > > > > > > > > > > > > On Fri, May 3, 2013 at 5:52 PM, Gabor Grothendieck > > > > > > > > wrote: > > > > > > > > > > The definition of DT was left out by mistake. It should be: > > > > > > > > > > DT <- data.table(a=letters[1:3]) > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 6:50 PM, Gabor Grothendieck > > > > > wrote: > > > > > > Consider this example: > > > > > > > > > > > > > DT[1:4,,nomatch=0] > > > > > > a > > > > > > 1: a > > > > > > 2: b > > > > > > 3: c > > > > > > 4: NA > > > > > > > > > > > > Should it not return only the first 3 rows? It seems to be ignoring > > > > > > the nomatch=0. > > > > > > > > > > > > -- > > > > > > Statistics & Software Consulting > > > > > > GKX Group, GKX Associates Inc. > > > > > > tel: 1-877-GKX-GROUP > > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Statistics & Software Consulting > > > > > GKX Group, GKX Associates Inc. > > > > > tel: 1-877-GKX-GROUP > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > _______________________________________________ > > > > > datatable-help mailing list > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > -- > > > Statistics & Software Consulting > > > GKX Group, GKX Associates Inc. > > > tel: 1-877-GKX-GROUP > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com (http://gmail.com) > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Sat May 4 13:40:41 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sat, 4 May 2013 07:40:41 -0400 Subject: [datatable-help] indexing with nomatch=0 In-Reply-To: <8BB9278EC50247A3A0B55072048E6E4F@gmail.com> References: <8BB9278EC50247A3A0B55072048E6E4F@gmail.com> Message-ID: I am not sure but I think that could be handled as a separate issue if it becomes important. By using all.i= it makes it sufficiently different from all.y= that users won't expect the same default and further they will not necessarily expect that there be an all argument for the left participant in the merge. On Sat, May 4, 2013 at 7:35 AM, Arunkumar Srinivasan wrote: > Gabor, > Both points I agree with. It brings enough clarity and consistency to the > syntax. > Does this mean that you don't mind X[Y] not having all functionalities of > `merge`? Because this takes care of the confusion of `nomatch` but still > does not do all merges, iiuc. > > Arun > > On Saturday, May 4, 2013 at 1:26 PM, Gabor Grothendieck wrote: > > The proposal at this point would be: > > 1. nomatch= would be replaced by all.i= such that > X[Y,,nomatch=NA] is the same as X[Y,,all.i=TRUE] > X[Y,,nomatch=0] is the same as X[Y,,all.i=FALSE] > nomatch= would be deprecated and ultimately removed. > > Note that #1 is simple to implement as it only involves changing names > and values of arguments and does not really change any behavior; > however, its easier to think about because X[Y,,all.i=Z] now has the > same behavior as merge(X, Y, all.y=Z) and so can be quickly understood > by anyone who knows merge in R. In contrast nomatch= did not even > have the same meaning as in match() since match matches the first > occurrence whereas with mult="all", the default, matching in > data.table matches all occurrences. Note that the default of merge's > all.y= is all.y=FALSE but the default of all.i= is all.i=TRUE in order > that the default behave as indices do. Also note that this solves the > problem that nomatch= can only be 0 or NA since a logical can only > have two non-NA values anyways. > > 2. If Y were a numeric index vector then all.i= will have the same > effect as if Y were a data.table with Y as its column and is merged > with the row numbers of X. e.g. X[1:4,,all.i=FALSE] would be the > same as X[1:3] if X only had 3 rows since 4 does not match a row > number of X and is dropped because all.i=FALSE. If Y were a numeric > vector with negative values it would be converted to one with positive > values in such a way as to have the established meaning and then the > same strategy is applied. If Y were logical then its recycled giving > YY and the same strategy is applied to which(YY). This description is > intended to be conceptual and the actual internal mechanism could be > different. > > Thus #2 allows one to think of **all** i indexing as merging rather > than as multiple separate concepts (which I believe is consistent with > the original intention of data.table). > > > > > > > On Fri, May 3, 2013 at 8:02 PM, Eduard Antonyan > wrote: > > I think I like this proposal - maybe you should write up a few examples of > what current behavior is, vs the proposed behavior. > > > On Fri, May 3, 2013 at 6:54 PM, Gabor Grothendieck > wrote: > > > data.table is supposed to generalize indexing and although not > explicitly stated the generalization seems to be that indexing is > merging with the row numbers so there is indeed merging going on and > that merging should respect nomatch= for consistency. > > On Fri, May 3, 2013 at 6:54 PM, Eduard Antonyan > wrote: > > There is no join'ing happening here, thus nomatch=0 has no effect. > > > On Fri, May 3, 2013 at 5:52 PM, Gabor Grothendieck > > wrote: > > > The definition of DT was left out by mistake. It should be: > > DT <- data.table(a=letters[1:3]) > > > On Fri, May 3, 2013 at 6:50 PM, Gabor Grothendieck > wrote: > > Consider this example: > > DT[1:4,,nomatch=0] > > a > 1: a > 2: b > 3: c > 4: NA > > Should it not return only the first 3 rows? It seems to be ignoring > the nomatch=0. > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Sat May 4 13:47:13 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 4 May 2013 13:47:13 +0200 Subject: [datatable-help] =?utf-8?q?indexing_with_nomatch=3D0?= In-Reply-To: References: <8BB9278EC50247A3A0B55072048E6E4F@gmail.com> Message-ID: <25543EEF47DB41C1AB28E5C6CDF09D98@gmail.com> hmm, I see what you mean. The `i` in `all.i = TRUE/FALSE` (in addition to having T/F instead of 0/NA) kind of delineates the behaviour of X[Y] against "merge" sufficiently that users don't fall into the "unexpected output" scenario. I vote for this change, if there's one :). Arun On Saturday, May 4, 2013 at 1:40 PM, Gabor Grothendieck wrote: > I am not sure but I think that could be handled as a separate issue if > it becomes important. By using all.i= it makes it sufficiently > different from all.y= that users won't expect the same default and > further they will not necessarily expect that there be an all argument > for the left participant in the merge. > > On Sat, May 4, 2013 at 7:35 AM, Arunkumar Srinivasan > wrote: > > Gabor, > > Both points I agree with. It brings enough clarity and consistency to the > > syntax. > > Does this mean that you don't mind X[Y] not having all functionalities of > > `merge`? Because this takes care of the confusion of `nomatch` but still > > does not do all merges, iiuc. > > > > Arun > > > > On Saturday, May 4, 2013 at 1:26 PM, Gabor Grothendieck wrote: > > > > The proposal at this point would be: > > > > 1. nomatch= would be replaced by all.i= such that > > X[Y,,nomatch=NA] is the same as X[Y,,all.i=TRUE] > > X[Y,,nomatch=0] is the same as X[Y,,all.i=FALSE] > > nomatch= would be deprecated and ultimately removed. > > > > Note that #1 is simple to implement as it only involves changing names > > and values of arguments and does not really change any behavior; > > however, its easier to think about because X[Y,,all.i=Z] now has the > > same behavior as merge(X, Y, all.y=Z) and so can be quickly understood > > by anyone who knows merge in R. In contrast nomatch= did not even > > have the same meaning as in match() since match matches the first > > occurrence whereas with mult="all", the default, matching in > > data.table matches all occurrences. Note that the default of merge's > > all.y= is all.y=FALSE but the default of all.i= is all.i=TRUE in order > > that the default behave as indices do. Also note that this solves the > > problem that nomatch= can only be 0 or NA since a logical can only > > have two non-NA values anyways. > > > > 2. If Y were a numeric index vector then all.i= will have the same > > effect as if Y were a data.table with Y as its column and is merged > > with the row numbers of X. e.g. X[1:4,,all.i=FALSE] would be the > > same as X[1:3] if X only had 3 rows since 4 does not match a row > > number of X and is dropped because all.i=FALSE. If Y were a numeric > > vector with negative values it would be converted to one with positive > > values in such a way as to have the established meaning and then the > > same strategy is applied. If Y were logical then its recycled giving > > YY and the same strategy is applied to which(YY). This description is > > intended to be conceptual and the actual internal mechanism could be > > different. > > > > Thus #2 allows one to think of **all** i indexing as merging rather > > than as multiple separate concepts (which I believe is consistent with > > the original intention of data.table). > > > > > > > > > > > > > > On Fri, May 3, 2013 at 8:02 PM, Eduard Antonyan > > wrote: > > > > I think I like this proposal - maybe you should write up a few examples of > > what current behavior is, vs the proposed behavior. > > > > > > On Fri, May 3, 2013 at 6:54 PM, Gabor Grothendieck > > wrote: > > > > > > data.table is supposed to generalize indexing and although not > > explicitly stated the generalization seems to be that indexing is > > merging with the row numbers so there is indeed merging going on and > > that merging should respect nomatch= for consistency. > > > > On Fri, May 3, 2013 at 6:54 PM, Eduard Antonyan > > wrote: > > > > There is no join'ing happening here, thus nomatch=0 has no effect. > > > > > > On Fri, May 3, 2013 at 5:52 PM, Gabor Grothendieck > > > > wrote: > > > > > > The definition of DT was left out by mistake. It should be: > > > > DT <- data.table(a=letters[1:3]) > > > > > > On Fri, May 3, 2013 at 6:50 PM, Gabor Grothendieck > > wrote: > > > > Consider this example: > > > > DT[1:4,,nomatch=0] > > > > a > > 1: a > > 2: b > > 3: c > > 4: NA > > > > Should it not return only the first 3 rows? It seems to be ignoring > > the nomatch=0. > > > > -- > > Statistics & Software Consulting > > GKX Group, GKX Associates Inc. > > tel: 1-877-GKX-GROUP > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > -- > > Statistics & Software Consulting > > GKX Group, GKX Associates Inc. > > tel: 1-877-GKX-GROUP > > email: ggrothendieck at gmail.com (http://gmail.com) > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > -- > > Statistics & Software Consulting > > GKX Group, GKX Associates Inc. > > tel: 1-877-GKX-GROUP > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > -- > > Statistics & Software Consulting > > GKX Group, GKX Associates Inc. > > tel: 1-877-GKX-GROUP > > email: ggrothendieck at gmail.com (http://gmail.com) > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com (http://gmail.com) > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From karl at huftis.org Sat May 4 14:44:10 2013 From: karl at huftis.org (Karl Ove Hufthammer) Date: Sat, 04 May 2013 14:44:10 +0200 Subject: [datatable-help] indexing with nomatch=0 In-Reply-To: References: <8BB9278EC50247A3A0B55072048E6E4F@gmail.com> Message-ID: <1367671450.6119.2.camel@linux-qcrw.site> la. den 04. 05. 2013 klokka 07.40 (-0400) skreiv Gabor Grothendieck: > I am not sure but I think that could be handled as a separate issue if > it becomes important. By using all.i= it makes it sufficiently > different from all.y= that users won't expect the same default and > further they will not necessarily expect that there be an all argument > for the left participant in the merge. But won?t ?all? (e.g., ?all=TRUE?) automatically match ?all.i?, while at the same time not give the same result as ?all? in ?merge?? That could be confusing. -- Karl Ove Hufthammer From ggrothendieck at gmail.com Sat May 4 15:07:52 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sat, 4 May 2013 09:07:52 -0400 Subject: [datatable-help] indexing with nomatch=0 In-Reply-To: <1367671450.6119.2.camel@linux-qcrw.site> References: <8BB9278EC50247A3A0B55072048E6E4F@gmail.com> <1367671450.6119.2.camel@linux-qcrw.site> Message-ID: The current proposal does not have an all= argument so there is no conflict. Suppose all.x= were later added to [.data.table. The all= argument in merge provides the default value for both all.x= and all.y= and all= itself is set to have a default of all=FALSE; however, for data.table if there were an all.x= added then it would have the default all.x=FALSE while all.i= would have the default of all.i=TRUE thus one could never have an all= argument to [.data.table that provides the default for both so I don't think this would ever be a problem. On Sat, May 4, 2013 at 8:44 AM, Karl Ove Hufthammer wrote: > la. den 04. 05. 2013 klokka 07.40 (-0400) skreiv Gabor Grothendieck: >> I am not sure but I think that could be handled as a separate issue if >> it becomes important. By using all.i= it makes it sufficiently >> different from all.y= that users won't expect the same default and >> further they will not necessarily expect that there be an all argument >> for the left participant in the merge. > > But won?t ?all? (e.g., ?all=TRUE?) automatically match ?all.i?, while at > the same time not give the same result as ?all? in ?merge?? That could > be confusing. > > -- > Karl Ove Hufthammer > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From npgraham1 at gmail.com Mon May 6 10:20:11 2013 From: npgraham1 at gmail.com (Nathaniel Graham) Date: Mon, 6 May 2013 04:20:11 -0400 Subject: [datatable-help] bug in 'consistent types for each group' code? Message-ID: I think I found a bug, either in the code that combines the results from grouping with 'by' or in the comparison code for IDate. The following is a simplified description of where and how, where the names have been changed to protect innocent variables. My code runs a function f on a data.table like so: output <- DT[, f(a.date, b.date, etc), by = group] The function f returns a data.table f.out with four columns, two of which are dates. All dates are stored as IDate, and the dates themselves are never changed or altered; some are relevant and most aren't. Explicitly printing the class of each column via print(sapply(f.out, class)) in f before returning always identifies the same classes, in my case "IDate" "Date", "IDate" "Date", "numeric", "integer" Despite this, for a certain group, I get the error columns of j don't evaluate to consistent types for each group: result for group 17 has column 1 type 'double' but expecting type 'integer' Every attempt to identify the problem with group 17 failed; its output looks perfectly correct, and everything checks out, even in debug. Using as.IDate explicitly anywhere before or during making the data.table f.out fixes the problem. As an aside, the error message above is not very helpful in general; I'd like to see *exactly* what isn't matching and where it's coming from. As another aside, when I run code like this, it's often the case that some groups don't end up belonging in the output at all. I can't figure out how to clue data.table to this; I'd like to just return NULL and that group not be in the output. Instead, I'm currently returning a row of obviously wrong output and filtering them later. Is there something I'm missing? ------- Nathaniel Graham npgraham1 at gmail.com npgraham1 at uky.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Mon May 6 17:56:36 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Mon, 6 May 2013 10:56:36 -0500 Subject: [datatable-help] indexing with nomatch=0 In-Reply-To: References: Message-ID: +1; I especially like #2 and the slight conceptual shift it implies On Sat, May 4, 2013 at 6:26 AM, Gabor Grothendieck wrote: > The proposal at this point would be: > > 1. nomatch= would be replaced by all.i= such that > X[Y,,nomatch=NA] is the same as X[Y,,all.i=TRUE] > X[Y,,nomatch=0] is the same as X[Y,,all.i=FALSE] > nomatch= would be deprecated and ultimately removed. > > Note that #1 is simple to implement as it only involves changing names > and values of arguments and does not really change any behavior; > however, its easier to think about because X[Y,,all.i=Z] now has the > same behavior as merge(X, Y, all.y=Z) and so can be quickly understood > by anyone who knows merge in R. In contrast nomatch= did not even > have the same meaning as in match() since match matches the first > occurrence whereas with mult="all", the default, matching in > data.table matches all occurrences. Note that the default of merge's > all.y= is all.y=FALSE but the default of all.i= is all.i=TRUE in order > that the default behave as indices do. Also note that this solves the > problem that nomatch= can only be 0 or NA since a logical can only > have two non-NA values anyways. > > 2. If Y were a numeric index vector then all.i= will have the same > effect as if Y were a data.table with Y as its column and is merged > with the row numbers of X. e.g. X[1:4,,all.i=FALSE] would be the > same as X[1:3] if X only had 3 rows since 4 does not match a row > number of X and is dropped because all.i=FALSE. If Y were a numeric > vector with negative values it would be converted to one with positive > values in such a way as to have the established meaning and then the > same strategy is applied. If Y were logical then its recycled giving > YY and the same strategy is applied to which(YY). This description is > intended to be conceptual and the actual internal mechanism could be > different. > > Thus #2 allows one to think of **all** i indexing as merging rather > than as multiple separate concepts (which I believe is consistent with > the original intention of data.table). > > > > > > > On Fri, May 3, 2013 at 8:02 PM, Eduard Antonyan > wrote: > > I think I like this proposal - maybe you should write up a few examples > of > > what current behavior is, vs the proposed behavior. > > > > > > On Fri, May 3, 2013 at 6:54 PM, Gabor Grothendieck < > ggrothendieck at gmail.com> > > wrote: > >> > >> data.table is supposed to generalize indexing and although not > >> explicitly stated the generalization seems to be that indexing is > >> merging with the row numbers so there is indeed merging going on and > >> that merging should respect nomatch= for consistency. > >> > >> On Fri, May 3, 2013 at 6:54 PM, Eduard Antonyan > >> wrote: > >> > There is no join'ing happening here, thus nomatch=0 has no effect. > >> > > >> > > >> > On Fri, May 3, 2013 at 5:52 PM, Gabor Grothendieck > >> > > >> > wrote: > >> >> > >> >> The definition of DT was left out by mistake. It should be: > >> >> > >> >> DT <- data.table(a=letters[1:3]) > >> >> > >> >> > >> >> On Fri, May 3, 2013 at 6:50 PM, Gabor Grothendieck > >> >> wrote: > >> >> > Consider this example: > >> >> > > >> >> >> DT[1:4,,nomatch=0] > >> >> > a > >> >> > 1: a > >> >> > 2: b > >> >> > 3: c > >> >> > 4: NA > >> >> > > >> >> > Should it not return only the first 3 rows? It seems to be > ignoring > >> >> > the nomatch=0. > >> >> > > >> >> > -- > >> >> > Statistics & Software Consulting > >> >> > GKX Group, GKX Associates Inc. > >> >> > tel: 1-877-GKX-GROUP > >> >> > email: ggrothendieck at gmail.com > >> >> > >> >> > >> >> > >> >> -- > >> >> Statistics & Software Consulting > >> >> GKX Group, GKX Associates Inc. > >> >> tel: 1-877-GKX-GROUP > >> >> email: ggrothendieck at gmail.com > >> >> _______________________________________________ > >> >> datatable-help mailing list > >> >> datatable-help at lists.r-forge.r-project.org > >> >> > >> >> > >> >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >> > > >> > > >> > >> > >> > >> -- > >> Statistics & Software Consulting > >> GKX Group, GKX Associates Inc. > >> tel: 1-877-GKX-GROUP > >> email: ggrothendieck at gmail.com > > > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Ken.Williams at windlogics.com Mon May 6 23:26:49 2013 From: Ken.Williams at windlogics.com (Ken Williams) Date: Mon, 6 May 2013 21:26:49 +0000 Subject: [datatable-help] Import problem with data.table in packages In-Reply-To: <5d217b902b9d75ed8ac53bdd26d1f7a1@imap.plus.net> References: <5d217b902b9d75ed8ac53bdd26d1f7a1@imap.plus.net> Message-ID: > -----Original Message----- > From: Matthew Dowle [mailto:mdowle at mdowle.plus.com] > Sent: Wednesday, May 01, 2013 4:13 PM > To: Ken Williams > Cc: datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] Import problem with data.table in packages > > > Hi, > > This rings a bell actually. data.table uses .onLoad currently but it should be > using .onAttach, I seem to recall. > > > http://r.789695.n4.nabble.com/Error-in-a-package-that-imports-data-table- > tp4660173p4660637.html > > I had a hunt around but couldn't find if we decided data.table should move > from .onLoad to .onAttach. Does anyone know/remember? I'm not sure - but maybe a solution would be to explicitly prefix the package name: Index: pkg/R/onLoad.R =================================================================== --- pkg/R/onLoad.R (revision 855) +++ pkg/R/onLoad.R (working copy) @@ -6,7 +6,7 @@ if (class(ss)!="{") ss = as.call(c(as.name("{"), ss)) if (!length(grep("data.table",ss[[2]]))) { ss = ss[c(1,NA,2:length(ss))] - ss[[2]] = parse(text="if (inherits(..1,'data.table')) return(data.table(...,key=key(..1)))")[[1]] + ss[[2]] = parse(text="if (inherits(..1,'data.table')) return(data.table::data.table(...,key=key(..1)))")[[1]] body(tt)=ss (unlockBinding)("cbind.data.frame",baseenv()) assign("cbind.data.frame",tt,envir=asNamespace("base"),inherits=FALSE) @@ -17,7 +17,7 @@ if (class(ss)!="{") ss = as.call(c(as.name("{"), ss)) if (!length(grep("data.table",ss[[2]]))) { ss = ss[c(1,NA,2:length(ss))] - ss[[2]] = parse(text="if (inherits(..1,'data.table')) return(`.rbind.data.table`(...))")[[1]] + ss[[2]] = parse(text="if (inherits(..1,'data.table')) return(`data.table::.rbind.data.table`(...))")[[1]] body(tt)=ss (unlockBinding)("rbind.data.frame",baseenv()) assign("rbind.data.frame",tt,envir=asNamespace("base"),inherits=FALSE) ________________________________ CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you. From Ken.Williams at windlogics.com Mon May 6 23:29:57 2013 From: Ken.Williams at windlogics.com (Ken Williams) Date: Mon, 6 May 2013 21:29:57 +0000 Subject: [datatable-help] Import problem with data.table in packages In-Reply-To: References: <5d217b902b9d75ed8ac53bdd26d1f7a1@imap.plus.net> Message-ID: > -----Original Message----- > From: datatable-help-bounces at lists.r-forge.r-project.org [mailto:datatable- > help-bounces at lists.r-forge.r-project.org] On Behalf Of Ken Williams > Sent: Monday, May 06, 2013 4:27 PM > To: Matthew Dowle > Cc: datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] Import problem with data.table in packages > > > > > -----Original Message----- > > From: Matthew Dowle [mailto:mdowle at mdowle.plus.com] > > Sent: Wednesday, May 01, 2013 4:13 PM > > To: Ken Williams > > Cc: datatable-help at lists.r-forge.r-project.org > > Subject: Re: [datatable-help] Import problem with data.table in > > packages > > > > > > Hi, > > > > This rings a bell actually. data.table uses .onLoad currently but it > > should be using .onAttach, I seem to recall. > > > > > > http://r.789695.n4.nabble.com/Error-in-a-package-that-imports-data-tab > > le- > > tp4660173p4660637.html > > > > I had a hunt around but couldn't find if we decided data.table should > > move from .onLoad to .onAttach. Does anyone know/remember? > > I'm not sure - but maybe a solution would be to explicitly prefix the package > name: [...] Sorry, didn't notice the backticks - the second patched line should probably go like so: - ss[[2]] = parse(text="if (inherits(..1,'data.table')) return(`.rbind.data.table`(...))")[[1]] + ss[[2]] = parse(text="if (inherits(..1,'data.table')) return(data.table::.rbind.data.table(...))")[[1]] -Ken ________________________________ CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you. From mdowle at mdowle.plus.com Tue May 7 11:18:15 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 07 May 2013 10:18:15 +0100 Subject: [datatable-help] Import problem with data.table in packages In-Reply-To: References: <5d217b902b9d75ed8ac53bdd26d1f7a1@imap.plus.net> Message-ID: <9bc9f40dc95cb125ae81d028db99bba0@imap.plus.net> Hi Ken, cc Victor Many thanks. I've just applied and committed your patch. I suspect a change from .onLoad to .onAttach may still be needed but let's make one change at a time. There's a comment in the code that suggests it was in .onAttach originally and I'd moved it to .onLoad. I wonder if it didn't work in .onAttach because of the lack of "data.table::" prefix. Now that's there, perhaps it can move back to .onAttach. Also filed so as not to forget : https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2771&group_id=240&atid=975 R-Forge should build the patched version in the next few hours. Please let us know if it fixes it or not. Matthew On 06.05.2013 22:29, Ken Williams wrote: >> -----Original Message----- >> From: datatable-help-bounces at lists.r-forge.r-project.org >> [mailto:datatable- >> help-bounces at lists.r-forge.r-project.org] On Behalf Of Ken Williams >> Sent: Monday, May 06, 2013 4:27 PM >> To: Matthew Dowle >> Cc: datatable-help at lists.r-forge.r-project.org >> Subject: Re: [datatable-help] Import problem with data.table in >> packages >> >> >> >> > -----Original Message----- >> > From: Matthew Dowle [mailto:mdowle at mdowle.plus.com] >> > Sent: Wednesday, May 01, 2013 4:13 PM >> > To: Ken Williams >> > Cc: datatable-help at lists.r-forge.r-project.org >> > Subject: Re: [datatable-help] Import problem with data.table in >> > packages >> > >> > >> > Hi, >> > >> > This rings a bell actually. data.table uses .onLoad currently but >> it >> > should be using .onAttach, I seem to recall. >> > >> > >> > >> http://r.789695.n4.nabble.com/Error-in-a-package-that-imports-data-tab >> > le- >> > tp4660173p4660637.html >> > >> > I had a hunt around but couldn't find if we decided data.table >> should >> > move from .onLoad to .onAttach. Does anyone know/remember? >> >> I'm not sure - but maybe a solution would be to explicitly prefix >> the package >> name: > > [...] > > Sorry, didn't notice the backticks - the second patched line should > probably go like so: > > > - ss[[2]] = parse(text="if (inherits(..1,'data.table')) > return(`.rbind.data.table`(...))")[[1]] > + ss[[2]] = parse(text="if (inherits(..1,'data.table')) > return(data.table::.rbind.data.table(...))")[[1]] > > > -Ken > > ________________________________ > > CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of > the intended recipient(s) and may contain confidential and privileged > information. Any unauthorized review, use, disclosure or distribution > of any kind is strictly prohibited. If you are not the intended > recipient, please contact the sender via reply e-mail and destroy all > copies of the original message. Thank you. From dkulp at dizz.org Wed May 8 18:31:39 2013 From: dkulp at dizz.org (David Kulp) Date: Wed, 8 May 2013 12:31:39 -0400 Subject: [datatable-help] Better hacks?: getting a vector AND using 'with'; inserting chunks of rows Message-ID: <289CB99A-379E-49C2-86BF-0344FC4C2336@dizz.org> I must be doing something stupid. I'd like to get a vector from a data.frame column using with=FALSE instead of a single-column data.table. dt <- data.table(x=1:10,y=letters[1:10]) col.name <- 'y' row.num <- 5 print(dt[row.num,y]) # returns a vector with the letter 'e'. OK. print(dt[row.num,list(y)]) # returns a data.table. OK. print(dt[row.num, col.name ,with=FALSE]) # returns a data.table... no list syntax here but I don't get a vector back. Not OK. The best I can do is unlist(as.list(dt[row.num, col.name ,with=FALSE])) which seems rather hackish. I've read the FAQ and I'm stymied. v1.8.8. Any help? ---- While I've got your attention, I might as well ask another stupid question. I can't insert new rows automagically. dt[11] <- c(11,'k') Although I can do df <- as.data.frame(dt) df[11,] <- c(11,'k') So I figure you want me to use rbind, even though rbind.data.table is probably a copy operation. dt <- rbind(dt, list(x=11,y='k')) But I'd like to start with an empty data.table and programmatically add chunks of rows as I run out of space. So I generate a data.table of NA values and rbind. E.g., here I want to add 5 new rows to the 2 column table. dt <- data.table(x=numeric(), y=character()) new.rows <- lapply(1:2, function(c) { rep(NA, 5) }) dt <- rbind(dt, new.rows, use.names=FALSE) According to the documentation, rbind is supposed to copy by position if use.names=FALSE, but it doesn't retain the column names. This worked in v1.8.2. Then I upgraded and it stopped working. I know I can fix this by labeling the columns of new.rows, but I'm guessing that there's a much better way to simply allocate a new chunk of rows to a growing table and I didn't see any info online. Thanks in advance!! From FErickson at psu.edu Wed May 8 22:00:25 2013 From: FErickson at psu.edu (Frank Erickson) Date: Wed, 8 May 2013 15:00:25 -0500 Subject: [datatable-help] Better hacks?: getting a vector AND using 'with'; inserting chunks of rows In-Reply-To: <289CB99A-379E-49C2-86BF-0344FC4C2336@dizz.org> References: <289CB99A-379E-49C2-86BF-0344FC4C2336@dizz.org> Message-ID: For your first question, this should work: dt[row.num,][[col.name]] For the second question, I guess your problem goes away if you aren't using an (all but) NULL data.table. dt <- data.table(x=1,y=1) nr <- data.table(NA,NA) rbind(dt,nr,use.names=FALSE) # x y #1: 1 1 #2: NA NA So, if you're dynamically growing your data.table from nothing, you'll only have to assign the colnames once, after the data.table becomes non-empty. I've read that R is pretty inefficient at dynamically growing things, ...as you say, it's a copy operation, right? I hope this helps. Best, Frank On Wed, May 8, 2013 at 11:31 AM, David Kulp wrote: > I must be doing something stupid. I'd like to get a vector from a > data.frame column using with=FALSE instead of a single-column data.table. > > dt <- data.table(x=1:10,y=letters[1:10]) > col.name <- 'y' > row.num <- 5 > print(dt[row.num,y]) # returns a vector with the letter 'e'. OK. > print(dt[row.num,list(y)]) # returns a data.table. OK. > print(dt[row.num, col.name ,with=FALSE]) # returns a data.table... no > list syntax here but I don't get a vector back. Not OK. > > The best I can do is > > unlist(as.list(dt[row.num, col.name ,with=FALSE])) > > which seems rather hackish. > > I've read the FAQ and I'm stymied. v1.8.8. Any help? > > ---- > > While I've got your attention, I might as well ask another stupid > question. I can't insert new rows automagically. > > dt[11] <- c(11,'k') > > Although I can do > > df <- as.data.frame(dt) > df[11,] <- c(11,'k') > > So I figure you want me to use rbind, even though rbind.data.table is > probably a copy operation. > > dt <- rbind(dt, list(x=11,y='k')) > > But I'd like to start with an empty data.table and programmatically add > chunks of rows as I run out of space. So I generate a data.table of NA > values and rbind. E.g., here I want to add 5 new rows to the 2 column > table. > > dt <- data.table(x=numeric(), y=character()) > new.rows <- lapply(1:2, function(c) { rep(NA, 5) }) > > dt <- rbind(dt, new.rows, use.names=FALSE) > > According to the documentation, rbind is supposed to copy by position if > use.names=FALSE, but it doesn't retain the column names. This worked in > v1.8.2. Then I upgraded and it stopped working. I know I can fix this by > labeling the columns of new.rows, but I'm guessing that there's a much > better way to simply allocate a new chunk of rows to a growing table and I > didn't see any info online. > > Thanks in advance!! > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dkulp at dizz.org Fri May 10 03:05:43 2013 From: dkulp at dizz.org (David Kulp) Date: Thu, 9 May 2013 21:05:43 -0400 Subject: [datatable-help] Better hacks?: getting a vector AND using 'with'; inserting chunks of rows In-Reply-To: References: <289CB99A-379E-49C2-86BF-0344FC4C2336@dizz.org> Message-ID: dt[row.num,][[col.name]] is indeed the solution. And works of course for data.frames, too. Maybe it should be added to the FAQ #1.3. Thank you! On May 8, 2013, at 4:00 PM, Frank Erickson wrote: > For your first question, this should work: > > dt[row.num,][[col.name]] > > For the second question, I guess your problem goes away if you aren't using an (all but) NULL data.table. > > dt <- data.table(x=1,y=1) > nr <- data.table(NA,NA) > > rbind(dt,nr,use.names=FALSE) > # x y > #1: 1 1 > #2: NA NA > > So, if you're dynamically growing your data.table from nothing, you'll only have to assign the colnames once, after the data.table becomes non-empty. I've read that R is pretty inefficient at dynamically growing things, ...as you say, it's a copy operation, right? > > I hope this helps. > > Best, > > Frank > > > On Wed, May 8, 2013 at 11:31 AM, David Kulp wrote: > I must be doing something stupid. I'd like to get a vector from a data.frame column using with=FALSE instead of a single-column data.table. > > dt <- data.table(x=1:10,y=letters[1:10]) > col.name <- 'y' > row.num <- 5 > print(dt[row.num,y]) # returns a vector with the letter 'e'. OK. > print(dt[row.num,list(y)]) # returns a data.table. OK. > print(dt[row.num, col.name ,with=FALSE]) # returns a data.table... no list syntax here but I don't get a vector back. Not OK. > > The best I can do is > > unlist(as.list(dt[row.num, col.name ,with=FALSE])) > > which seems rather hackish. > > I've read the FAQ and I'm stymied. v1.8.8. Any help? > > ---- > > While I've got your attention, I might as well ask another stupid question. I can't insert new rows automagically. > > dt[11] <- c(11,'k') > > Although I can do > > df <- as.data.frame(dt) > df[11,] <- c(11,'k') > > So I figure you want me to use rbind, even though rbind.data.table is probably a copy operation. > > dt <- rbind(dt, list(x=11,y='k')) > > But I'd like to start with an empty data.table and programmatically add chunks of rows as I run out of space. So I generate a data.table of NA values and rbind. E.g., here I want to add 5 new rows to the 2 column table. > > dt <- data.table(x=numeric(), y=character()) > new.rows <- lapply(1:2, function(c) { rep(NA, 5) }) > > dt <- rbind(dt, new.rows, use.names=FALSE) > > According to the documentation, rbind is supposed to copy by position if use.names=FALSE, but it doesn't retain the column names. This worked in v1.8.2. Then I upgraded and it stopped working. I know I can fix this by labeling the columns of new.rows, but I'm guessing that there's a much better way to simply allocate a new chunk of rows to a growing table and I didn't see any info online. > > Thanks in advance!! > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sat May 11 01:41:43 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sat, 11 May 2013 00:41:43 +0100 Subject: [datatable-help] Import problem with data.table in packages In-Reply-To: <9bc9f40dc95cb125ae81d028db99bba0@imap.plus.net> References: <5d217b902b9d75ed8ac53bdd26d1f7a1@imap.plus.net> <9bc9f40dc95cb125ae81d028db99bba0@imap.plus.net> Message-ID: <74abbc5c1b8bf617bb8c28f62ec98e93@imap.plus.net> Hi Ken, Victor, Have read up again about .onAttach and .onLoad and I think data.table is using them correctly after all. So Ken's patch alone should indeed fix the problem. I've closed #975 now. R-Forge has now (finally) built and is passing with that patch applied. fread also has colClasses now, if anyone is waiting for that. Many thanks, Matthew On 07.05.2013 10:18, Matthew Dowle wrote: > Hi Ken, > cc Victor > > Many thanks. I've just applied and committed your patch. I suspect a > change from .onLoad to .onAttach may still be needed but let's make > one change at a time. There's a comment in the code that suggests it > was in .onAttach originally and I'd moved it to .onLoad. I wonder if > it didn't work in .onAttach because of the lack of "data.table::" > prefix. Now that's there, perhaps it can move back to .onAttach. > > Also filed so as not to forget : > > > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2771&group_id=240&atid=975 > > R-Forge should build the patched version in the next few hours. > Please let us know if it fixes it or not. > > Matthew > > > On 06.05.2013 22:29, Ken Williams wrote: >>> -----Original Message----- >>> From: datatable-help-bounces at lists.r-forge.r-project.org >>> [mailto:datatable- >>> help-bounces at lists.r-forge.r-project.org] On Behalf Of Ken Williams >>> Sent: Monday, May 06, 2013 4:27 PM >>> To: Matthew Dowle >>> Cc: datatable-help at lists.r-forge.r-project.org >>> Subject: Re: [datatable-help] Import problem with data.table in >>> packages >>> >>> >>> >>> > -----Original Message----- >>> > From: Matthew Dowle [mailto:mdowle at mdowle.plus.com] >>> > Sent: Wednesday, May 01, 2013 4:13 PM >>> > To: Ken Williams >>> > Cc: datatable-help at lists.r-forge.r-project.org >>> > Subject: Re: [datatable-help] Import problem with data.table in >>> > packages >>> > >>> > >>> > Hi, >>> > >>> > This rings a bell actually. data.table uses .onLoad currently but >>> it >>> > should be using .onAttach, I seem to recall. >>> > >>> > >>> > >>> http://r.789695.n4.nabble.com/Error-in-a-package-that-imports-data-tab >>> > le- >>> > tp4660173p4660637.html >>> > >>> > I had a hunt around but couldn't find if we decided data.table >>> should >>> > move from .onLoad to .onAttach. Does anyone know/remember? >>> >>> I'm not sure - but maybe a solution would be to explicitly prefix >>> the package >>> name: >> >> [...] >> >> Sorry, didn't notice the backticks - the second patched line should >> probably go like so: >> >> >> - ss[[2]] = parse(text="if (inherits(..1,'data.table')) >> return(`.rbind.data.table`(...))")[[1]] >> + ss[[2]] = parse(text="if (inherits(..1,'data.table')) >> return(data.table::.rbind.data.table(...))")[[1]] >> >> >> -Ken >> >> ________________________________ >> >> CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of >> the intended recipient(s) and may contain confidential and >> privileged >> information. Any unauthorized review, use, disclosure or >> distribution >> of any kind is strictly prohibited. If you are not the intended >> recipient, please contact the sender via reply e-mail and destroy >> all >> copies of the original message. Thank you. From mdowle at mdowle.plus.com Sat May 11 03:39:10 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sat, 11 May 2013 02:39:10 +0100 Subject: [datatable-help] Fwd: fread on very large file In-Reply-To: <806651da84c7d49b3a9aa134e4951274@imap.plus.net> References: <6215268129090c5164b66264010bea9b@imap.plus.net> <806651da84c7d49b3a9aa134e4951274@imap.plus.net> Message-ID: Paul, Vishal, Commit 859 : * fread now supports files larger than 4GB on 64bit Windows (#2767 thanks to Paul Harding) and files between 2GB and 4GB on 32bit Windows (#2655 thanks to Vishal). A C call to GetFileSize() needed to be GetFileSizeEx(). Please test and confirm ok now. Thanks, Matthew On 03.05.2013 14:59, Matthew Dowle wrote: > Oh. Then it's likely a bug with fread on Windows for files > 4GB. Think GetFileSize() should be GetFileSizeEx(), iirc. > > Please could you file it as a bug on the tracker. Thanks. > > Matthew > > On 03.05.2013 14:32, Paul Harding wrote: > >> Definitely a 64-bit machine. Here are the details: >> >> Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) >> Installed memory (RAM): 128GB >> System type: 64-bit Operating System >> Windows edition: Server 2008 R2 Enterprise SP1 >> Regards, >> Paul >> >> On 3 May 2013 10:51, Matthew Dowle wrote: >> >>> Hi Paul, >>> >>> Thanks for all this! >>> >>>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >>> >>> Ahah. Are you using a 32bit or 64bit Windows machine? >>> >>> Thanks, Matthew >>> >>> On 02.05.2013 10:19, Paul Harding wrote: >>> >>>> Some supplementary information, here is the portion of the file (with row numbers, +1 for header) around where fread thinks the file ends. >>>> >>>> $ nl spd_all_fixed.csv | head -n 9186300 |tail >>>> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 >>>> 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 >>>> 9186293 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 >>>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 >>>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 >>>> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 >>>> 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 >>>> 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 >>>> 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 >>>> 9186294 (row 9186293 excl header) is where fread thinks the file ends, mid-line by the look of it! >>>> I've experimented by truncating the file. The error varies, either it reads too few records or gives the error I reported, presumably determined by whether the last perceived line is entire. >>>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >>>> >>>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv >>>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv >>>> >>>>> dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) >>>> >>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>> Found 9 columns >>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>> All the fields on line 1 are character fields. Treating as the column names. >>>> Count of eol after first data row: 80300000 >>>> Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 data rows >>>> >>>> Type codes: 000002000 (first 5 rows) >>>> Type codes: 000002000 (+middle 5 rows) >>>> Type codes: 000002000 (+last 5 rows) >>>> 0%Bumping column 7 from INT to INT64 on data row 9, field contains '0.42634430000000001' >>>> Bumping column 7 from INT64 to REAL on data row 9, field contains '0.42634430000000001' >>>> 0.000s ( 0%) Memory map (rerun may be quicker) >>>> 0.000s ( 0%) Sep and header detection >>>> 0.000s ( 0%) Count rows (wc -l) >>>> 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) >>>> 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM >>>> 171.188s ( 65%) Reading data >>>> 1365231.809s (518439%) Allocation for type bumps (if any), including gc time if triggered >>>> -1365231.809s (-518439%) Coercing data already read in type bumps (if any) >>>> 0.000s ( 0%) Changing na.strings to NA >>>> 0.000s Total >>>>> dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) >>>> >>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>> Found 9 columns >>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>> All the fields on line 1 are character fields. Treating as the column names. >>>> Count of eol after first data row: 18913 >>>> Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data rows >>>> >>>> Type codes: 000002000 (first 5 rows) >>>> Type codes: 000002000 (+middle 5 rows) >>>> Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : >>>> Expected sep (',') but ',' ends field 2 on line 6 when detecting types: 204650,724540, >>>> Regards, >>>> Paul >>>> >>>> On 1 May 2013 10:28, Paul Harding wrote: >>>> >>>>> Here is the verbose output: >>>>> >>>>>> dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) >>>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>>> Found 9 columns >>>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>>> All the fields on line 1 are character fields. Treating as the column names. >>>>> Count of eol after first data row: 9186293 >>>>> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 data rows >>>>> Type codes: 000002000 (first 5 rows) >>>>> Type codes: 000002200 (+middle 5 rows) >>>>> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : >>>>> >>>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>>> But here is the wc output (via cygwin; newline, word (whitespace delim so each word one 'line' here), byte)@ >>>>> >>>>> $ wc spd_all_fixed.csv >>>>> 168997637 168997638 9078155125 spd_all_fixed.csv >>>>> [So fread 9M, wc 168M rows]. >>>>> Regards >>>>> Paul >>>>> >>>>> On 30 April 2013 18:52, Matthew Dowle wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Thanks for reporting this. Please set verbose=TRUE and let us know the output. >>>>>> >>>>>> Thanks, Matthew >>>>>> >>>>>> On 30.04.2013 18:01, Paul Harding wrote: >>>>>> >>>>>>> Problem with fread on a large file The file is 8GB, just short of 200,000 lines, produced as SQLoutput and modified by cygwin/perl to remove the second line. >>>>>>> >>>>>>> Using data.table 1.8.8 on R3.0.0 I get an fread error >>>>>>> >>>>>>> fread("data/spd_all_fixed.csv",sep=",") >>>>>>> Error in fread("data/spd_all_fixed.csv", sep = ",") : >>>>>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>>>>> Looking for the offending line,with line numbers in output so I'm guessing this is line 6 of the mid-file chunk examined, >>>>>>> >>>>>>> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >>>>>>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >>>>>>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >>>>>>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>>>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >>>>>>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >>>>>>> and comparing to surrounding lines and the first ten lines >>>>>>> >>>>>>> $ head spd_all_fixed.csv >>>>>>> s_key,i_key,p_key,q,pq,d,l,epi,class >>>>>>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >>>>>>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >>>>>>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >>>>>>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >>>>>>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >>>>>>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >>>>>>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >>>>>>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >>>>>>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >>>>>>> I can't see any difference. I wonder if this is a bug? I have no problems on a small test data set run through an identical process and using the same fread command. >>>>>>> Regards >>>>>>> Paul Links: ------ [1] mailto:mdowle at mdowle.plus.com [2] mailto:p.harding at paniscus.com [3] mailto:mdowle at mdowle.plus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From Ken.Williams at windlogics.com Sat May 11 17:56:42 2013 From: Ken.Williams at windlogics.com (Ken Williams) Date: Sat, 11 May 2013 15:56:42 +0000 Subject: [datatable-help] Import problem with data.table in packages In-Reply-To: <74abbc5c1b8bf617bb8c28f62ec98e93@imap.plus.net> References: <5d217b902b9d75ed8ac53bdd26d1f7a1@imap.plus.net> <9bc9f40dc95cb125ae81d028db99bba0@imap.plus.net>, <74abbc5c1b8bf617bb8c28f62ec98e93@imap.plus.net> Message-ID: <109C1AB5-2A48-4302-95B7-97737CE43777@windlogics.com> Awesome. I wasn't totally clear on this namespace stuff, but that sounds right to me too. Sent from my iPhone. On May 10, 2013, at 6:41 PM, "Matthew Dowle" wrote: > > Hi Ken, Victor, > > Have read up again about .onAttach and .onLoad and I think data.table is > using them correctly after all. So Ken's patch alone should indeed fix the > problem. I've closed #975 now. R-Forge has now (finally) built and is > passing with that patch applied. > > fread also has colClasses now, if anyone is waiting for that. > > Many thanks, > Matthew > > > On 07.05.2013 10:18, Matthew Dowle wrote: >> Hi Ken, >> cc Victor >> >> Many thanks. I've just applied and committed your patch. I suspect a >> change from .onLoad to .onAttach may still be needed but let's make >> one change at a time. There's a comment in the code that suggests it >> was in .onAttach originally and I'd moved it to .onLoad. I wonder if >> it didn't work in .onAttach because of the lack of "data.table::" >> prefix. Now that's there, perhaps it can move back to .onAttach. >> >> Also filed so as not to forget : >> >> >> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2771&group_id=240&atid=975 >> >> R-Forge should build the patched version in the next few hours. >> Please let us know if it fixes it or not. >> >> Matthew >> >> >> On 06.05.2013 22:29, Ken Williams wrote: >>>> -----Original Message----- >>>> From: datatable-help-bounces at lists.r-forge.r-project.org [mailto:datatable- >>>> help-bounces at lists.r-forge.r-project.org] On Behalf Of Ken Williams >>>> Sent: Monday, May 06, 2013 4:27 PM >>>> To: Matthew Dowle >>>> Cc: datatable-help at lists.r-forge.r-project.org >>>> Subject: Re: [datatable-help] Import problem with data.table in packages >>>> >>>> >>>> >>>> > -----Original Message----- >>>> > From: Matthew Dowle [mailto:mdowle at mdowle.plus.com] >>>> > Sent: Wednesday, May 01, 2013 4:13 PM >>>> > To: Ken Williams >>>> > Cc: datatable-help at lists.r-forge.r-project.org >>>> > Subject: Re: [datatable-help] Import problem with data.table in >>>> > packages >>>> > >>>> > >>>> > Hi, >>>> > >>>> > This rings a bell actually. data.table uses .onLoad currently but it >>>> > should be using .onAttach, I seem to recall. >>>> > >>>> > >>>> > http://r.789695.n4.nabble.com/Error-in-a-package-that-imports-data-tab >>>> > le- >>>> > tp4660173p4660637.html >>>> > >>>> > I had a hunt around but couldn't find if we decided data.table should >>>> > move from .onLoad to .onAttach. Does anyone know/remember? >>>> >>>> I'm not sure - but maybe a solution would be to explicitly prefix the package >>>> name: >>> >>> [...] >>> >>> Sorry, didn't notice the backticks - the second patched line should >>> probably go like so: >>> >>> >>> - ss[[2]] = parse(text="if (inherits(..1,'data.table')) >>> return(`.rbind.data.table`(...))")[[1]] >>> + ss[[2]] = parse(text="if (inherits(..1,'data.table')) >>> return(data.table::.rbind.data.table(...))")[[1]] >>> >>> >>> -Ken >>> >>> ________________________________ >>> >>> CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of >>> the intended recipient(s) and may contain confidential and privileged >>> information. Any unauthorized review, use, disclosure or distribution >>> of any kind is strictly prohibited. If you are not the intended >>> recipient, please contact the sender via reply e-mail and destroy all >>> copies of the original message. Thank you. > From mdowle at mdowle.plus.com Sat May 11 23:56:45 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sat, 11 May 2013 22:56:45 +0100 Subject: [datatable-help] =?utf-8?q?fread=28character_string=29_limited_to?= =?utf-8?q?_strings_less_than_4096_long=3F?= In-Reply-To: References: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net> <2c2af8789733127541fe78c1ccde5412@imap.plus.net> <230b0040889556349b21822824a5fb7e@imap.plus.net> Message-ID: <96847d98ac0d995008db94fd21b24906@imap.plus.net> Hi, Have reproduced now, and fixed (commit 862). * When input is the data as a character string, it is no longer truncated to your system's maximum path length, #2649. It was being passed through path.expand() even when it wasn't a filename. Many thanks to Timothee Carayol for the reproducible report. The limit should now be R's character string length limit (2^31-1 bytes = 2GB). Test added. And the persisting nan% in verbose output is also fixed. Many thanks! Matthew On 25.04.2013 08:58, Timoth?e Carayol wrote: > Hi - > > I thought I'd follow up on this. > Matthew, are you still unable to reproduce it? It is still happening to me after an upgrade to R 3.0.0. And Garrett's case above seems even more severe, with a truncation at 256 characters it seems, so it's not just me, and it does seem to depend on some sort of system configuration. > > On Thu, Mar 28, 2013 at 3:26 PM, Timoth?e Carayol wrote: > >> Of course, I'll be happy to help! >> By the way the verbose output was actually from computer 1 (with 1.8.9) so it seems like the -nan% problem is maybe still there? >> Cheers >> Timoth?e >> >> On Thu, Mar 28, 2013 at 3:19 PM, Matthew Dowle wrote: >> >>> Hi, >>> >>> Thanks. That was from v1.8.8 on computer 2 I hope. Computer 1 with v1.8.9 should have the -nan% problem fixed. >>> >>> I'm a bit stumped for the moment. I've filed a bug report. Probably, if I still can't reproduce my end, I'll add some more detailed tracing to verbose output and ask you to try again next week if that's ok. >>> >>> Thanks for reporting! >>> >>> Matthew >>> >>> On 28.03.2013 14:58, Timoth?e Carayol wrote: >>> >>>> Input contains a n (or is ""), taking this to be text input (not a filename) >>>> Detected eol as n only (no r afterwards), the UNIX and Mac standard. >>>> Using line 30 to detect sep (the last non blank line in the first 30) ... 't' >>>> Found 2 columns >>>> First row with 2 fields occurs on line 1 (either column names or first row of data) >>>> All the fields on line 1 are character fields. Treating as the column names. >>>> Count of eol after first data row: 1023 >>>> Subtracted 1 for last eol and any trailing empty lines, leaving 1022 data rows >>>> Type codes: 33 (first 5 rows) >>>> Type codes: 33 (+middle 5 rows) >>>> Type codes: 33 (+last 5 rows) >>>> 0.000s (-nan%) Memory map (rerun may be quicker) >>>> 0.000s (-nan%) sep and header detection >>>> 0.000s (-nan%) Count rows (wc -l) >>>> 0.000s (-nan%) Column type detection (first, middle and last 5 rows) >>>> 0.000s (-nan%) Allocation of 1022x2 result (xMB) in RAM >>>> 0.000s (-nan%) Reading data >>>> 0.000s (-nan%) Allocation for type bumps (if any), including gc time if triggered >>>> 0.000s (-nan%) Coercing data already read in type bumps (if any) >>>> 0.000s (-nan%) Changing na.strings to NA >>>> 0.000s Total >>>> 4092 1022 >>>> Input contains a n (or is ""), taking this to be text input (not a filename) >>>> Detected eol as n only (no r afterwards), the UNIX and Mac standard. >>>> Using line 30 to detect sep (the last non blank line in the first 30) ... 't' >>>> Found 2 columns >>>> First row with 2 fields occurs on line 1 (either column names or first row of data) >>>> All the fields on line 1 are character fields. Treating as the column names. >>>> Count of eol after first data row: 1023 >>>> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data rows >>>> Type codes: 33 (first 5 rows) >>>> Type codes: 33 (+middle 5 rows) >>>> Type codes: 33 (+last 5 rows) >>>> 0.000s (-nan%) Memory map (rerun may be quicker) >>>> 0.000s (-nan%) sep and header detection >>>> 0.000s (-nan%) Count rows (wc -l) >>>> 0.000s (-nan%) Column type detection (first, middle and last 5 rows) >>>> 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM >>>> 0.000s (-nan%) Reading data >>>> 0.000s (-nan%) Allocation for type bumps (if any), including gc time if triggered >>>> 0.000s (-nan%) Coercing data already read in type bumps (if any) >>>> 0.000s (-nan%) Changing na.strings to NA >>>> 0.000s Total >>>> 4096 1023 >>>> Input contains a n (or is ""), taking this to be text input (not a filename) >>>> Detected eol as n only (no r afterwards), the UNIX and Mac standard. >>>> Using line 30 to detect sep (the last non blank line in the first 30) ... 't' >>>> Found 2 columns >>>> First row with 2 fields occurs on line 1 (either column names or first row of data) >>>> All the fields on line 1 are character fields. Treating as the column names. >>>> Count of eol after first data row: 1023 >>>> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data rows >>>> Type codes: 33 (first 5 rows) >>>> Type codes: 33 (+middle 5 rows) >>>> Type codes: 33 (+last 5 rows) >>>> 0.000s (-nan%) Memory map (rerun may be quicker) >>>> 0.000s (-nan%) sep and header detection >>>> 0.000s (-nan%) Count rows (wc -l) >>>> 0.000s (-nan%) Column type detection (first, middle and last 5 rows) >>>> 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM >>>> 0.000s (-nan%) Reading data >>>> 0.000s (-nan%) Allocation for type bumps (if any), including gc time if triggered >>>> 0.000s (-nan%) Coercing data already read in type bumps (if any) >>>> 0.000s (-nan%) Changing na.strings to NA >>>> 0.000s Total >>>> 4100 1023 >>>> Input contains a n (or is ""), taking this to be text input (not a filename) >>>> Detected eol as n only (no r afterwards), the UNIX and Mac standard. >>>> Using line 30 to detect sep (the last non blank line in the first 30) ... 't' >>>> Found 2 columns >>>> First row with 2 fields occurs on line 1 (either column names or first row of data) >>>> All the fields on line 1 are character fields. Treating as the column names. >>>> Count of eol after first data row: 1023 >>>> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data rows >>>> Type codes: 33 (first 5 rows) >>>> Type codes: 33 (+middle 5 rows) >>>> Type codes: 33 (+last 5 rows) >>>> 0.000s (-nan%) Memory map (rerun may be quicker) >>>> 0.000s (-nan%) sep and header detection >>>> 0.000s (-nan%) Count rows (wc -l) >>>> 0.000s (-nan%) Column type detection (first, middle and last 5 rows) >>>> 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM >>>> 0.000s (-nan%) Reading data >>>> 0.000s (-nan%) Allocation for type bumps (if any), including gc time if triggered >>>> 0.000s (-nan%) Coercing data already read in type bumps (if any) >>>> 0.000s (-nan%) Changing na.strings to NA >>>> 0.000s Total >>>> 40000 1023 >>>> >>>> On Thu, Mar 28, 2013 at 2:55 PM, Matthew Dowle wrote: >>>> >>>>> Hm this is odd. >>>>> >>>>> Could you run the following and paste back the (verbose) results please. >>>>> for (n in c(1023:1025, 10000)) { >>>>> >>>>> input = paste( rep('atbn', n), collapse='') >>>>> A = fread(input,verbose=TRUE) >>>>> cat(nchar(input), nrow(A), "n") >>>>> } >>>>> >>>>> On 28.03.2013 14:38, Timoth?e Carayol wrote: >>>>> >>>>>> Curiouser and curiouser.. >>>>>> >>>>>> I can reproduce on two computers with different versions of R and of data.table. >>>>>> >>>>>> Computer 1 (it says unknown-linux but is actually ubuntu): >>>>>> >>>>>> R version 2.15.3 (2013-03-01) >>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 >>>>>> LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C LC_NAME=C LC_ADDRESS=C >>>>>> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>> >>>>>> other attached packages: >>>>>> [1] bit64_0.9-2 bit_1.1-10 data.table_1.8.9 colorout_1.0-0 >>>>>> Computer 2: >>>>>> >>>>>> R version 2.15.2 (2012-10-26) >>>>>> Platform: x86_64-redhat-linux-gnu (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >>>>>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >>>>>> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>> >>>>>> other attached packages: >>>>>> [1] data.table_1.8.8 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] tools_2.15.2 >>>>>> >>>>>> On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle wrote: >>>>>> >>>>>>> Interesting, what's your sessionInfo() please? >>>>>>> >>>>>>> For me it seems to work ok : >>>>>>> >>>>>>> [1] 1022 >>>>>>> [1] 1023 >>>>>>> [1] 1024 >>>>>>> [1] 9999 >>>>>>> >>>>>>>> sessionInfo() >>>>>>> R version 2.15.2 (2012-10-26) >>>>>>> Platform: x86_64-w64-mingw32/x64 (64-bit) >>>>>>> >>>>>>> On 27.03.2013 22:49, Timoth?e Carayol wrote: >>>>>>> >>>>>>>> Agree with Muhammad, longer character strings are definitely permitted in R. >>>>>>>> A minimal example that show something strange happening with fread: >>>>>>>> >>>>>>>> for (n in c(1023:1025, 10000)) { >>>>>>>> A >>>>>>>> >>>>>>>> paste( >>>>>>>> rep('atbn', n), >>>>>>>> collapse='' >>>>>>>> ), >>>>>>>> sep='t' >>>>>>>> ) >>>>>>>> print(nrow(A)) >>>>>>>> } >>>>>>>> On my computer, I obtain: >>>>>>>> >>>>>>>> [1] 1022 >>>>>>>> [1] 1023 >>>>>>>> [1] 1023 >>>>>>>> [1] 1023 >>>>>>>> Hope this helps >>>>>>>> Timoth?e >>>>>>>> >>>>>>>> On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is that >>>>>>>>> the R limit for a character string length? What happens at 4097? >>>>>>>>> Matthew >>>>>>>>> >>>>>>>>> > Hi, >>>>>>>>> > >>>>>>>>> > I have an example of a string of 4097 characters which can't be parsed by >>>>>>>>> > fread; however, if I remove any character, it can be parsed just fine. Is >>>>>>>>> > that a known limitation? >>>>>>>>> > >>>>>>>>> > (If I write the string to a file and then fread the file name, it works >>>>>>>>> > too.) >>>>>>>>> > >>>>>>>>> > Let me know if you need the string and/or a bug report. >>>>>>>>> > >>>>>>>>> > Thanks >>>>>>>>> > Timoth?e > _______________________________________________ >>>>>>>>> > datatable-help mailing list >>>>>>>>> > datatable-help at lists.r-forge.r-project.org [1] >>>>>>>>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [2] Links: ------ [1] mailto:datatable-help at lists.r-forge.r-project.org [2] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [3] mailto:mdowle at mdowle.plus.com [4] mailto:mdowle at mdowle.plus.com [5] mailto:mdowle at mdowle.plus.com [6] mailto:mdowle at mdowle.plus.com [7] mailto:timothee.carayol at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Sun May 12 00:16:30 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sat, 11 May 2013 18:16:30 -0400 Subject: [datatable-help] fread: skip Message-ID: I would find it useful if fread had a skip= argument as in read.table since I have files from time to time that have garbage at the top. Another situation I find from time to time is that the header is messed up but one can still read the file if one can skip over the header and specify header = FALSE. An extra feature that would be nice but less important would be if one could specify skip = "string" and have it skip all lines until it found one with "string": in it and then start reading from the matched row onward. Normally the string would be chosen to be a string found in the header and not likely found prior to the header. read.xls in gdata has a similar feature and I find it quite handy at times. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From mdowle at mdowle.plus.com Sun May 12 00:35:19 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sat, 11 May 2013 23:35:19 +0100 Subject: [datatable-help] fread: skip In-Reply-To: References: Message-ID: Hi, Does the auto skip feature of fread cover both of those? From ?fread : " Once the separator is found on line autostart, the number of columns is determined. Then the file is searched backwards from autostart until a row is found that doesn't have that number of columns, or the start of file is reached. Thus, the first data row is found and any human readable banners are automatically skipped. This feature can be particularly useful for loading a set of files which may not all have consistently sized banners. " There were also some issue with header=FALSE in the first release (1.8.8) which have since been fixed in 1.8.9. Matthew On 11.05.2013 23:16, Gabor Grothendieck wrote: > I would find it useful if fread had a skip= argument as in read.table > since I have files from time to time that have garbage at the top. > Another situation I find from time to time is that the header is > messed up but one can still read the file if one can skip over the > header and specify header = FALSE. > > An extra feature that would be nice but less important would be if > one > could specify skip = "string" and have it skip all lines until it > found one with "string": in it and then start reading from the > matched > row onward. Normally the string would be chosen to be a string > found > in the header and not likely found prior to the header. read.xls in > gdata has a similar feature and I find it quite handy at times. > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From ggrothendieck at gmail.com Sun May 12 01:47:01 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sat, 11 May 2013 19:47:01 -0400 Subject: [datatable-help] fread: skip In-Reply-To: References: Message-ID: Not with the csv I tried. The header is messed up (most of the header fields are missing) and it misconstrues it as data. The automation is great but some way to force its behavior when you know what it should do seems essential since heuristics can't be expected to work in all cases. On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle wrote: > > Hi, > > Does the auto skip feature of fread cover both of those? From ?fread : > > " Once the separator is found on line autostart, the number of columns is > determined. Then the file is searched backwards from autostart until a row > is found that doesn't have that number of columns, or the start of file is > reached. Thus, the first data row is found and any human readable banners > are automatically skipped. This feature can be particularly useful for > loading a set of files which may not all have consistently sized banners. " > > There were also some issue with header=FALSE in the first release (1.8.8) > which have since been fixed in 1.8.9. > > Matthew > > > > On 11.05.2013 23:16, Gabor Grothendieck wrote: >> >> I would find it useful if fread had a skip= argument as in read.table >> since I have files from time to time that have garbage at the top. >> Another situation I find from time to time is that the header is >> messed up but one can still read the file if one can skip over the >> header and specify header = FALSE. >> >> An extra feature that would be nice but less important would be if one >> could specify skip = "string" and have it skip all lines until it >> found one with "string": in it and then start reading from the matched >> row onward. Normally the string would be chosen to be a string found >> in the header and not likely found prior to the header. read.xls in >> gdata has a similar feature and I find it quite handy at times. >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Sun May 12 09:53:38 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 12 May 2013 09:53:38 +0200 Subject: [datatable-help] =?utf-8?Q?=22=3A=3D=22_?=with "by" reassignment/updation + adding new column leads to crash Message-ID: <081EFC10403447CCBA8CAE3CD6761DBA@gmail.com> Hi, I just discovered some weird R-session crash in data.table. Here's an example to reproduce the crash. I did not find any bug filed regarding this issue. Maybe others can verify this? Then I'll file it as a bug. The issue is this. Suppose you've a data.table with two columns x and y as follows: require(data.table) DT <- data.table(x = rep(1:2, c(3,2)), y = 6:10) x y 1: 1 6 2: 1 7 3: 1 8 4: 2 9 5: 2 10 Now you want to add a new column "z" by reference grouped by "x". So, you'd do: DT[, `:=`(z = .GRP), by = x] x y z 1: 1 6 1 2: 1 7 1 3: 1 8 1 4: 2 9 2 5: 2 10 2 Now, for the sake of producing this error, assume that you assigned "z" the wrong value and that you want to change it. But, also you realised that you want to add another column "w" as well. So, you go ahead and do (remember to do the previous step and then this one): DT[, `:=`(z = .N, w = 2), by = x] # R session crashes Here, both R and Rstudio session crashes with the traceback message: *** caught segfault *** address 0x0, cause 'memory not mapped' Traceback: 1: `[.data.table`(DT, , `:=`(z = .GRP, w = 2), by = x) 2: DT[, `:=`(z = .GRP, w = 2), by = x] This on the other hand works as expected if you assign both columns the first time. require(data.table) DT <- data.table(x = rep(1:2, c(3,2)), y = 6:10) DT[, `:=`(z = .GRP, w = 2), by = x] # works fine That is, if you assign by reference (:=) with "by" and re-assign a variable while also creating another variable, there seems to be a segfault. This error may not be limited to this case, but that I've just tested this. Here's my sessionInfo() from before the crash: R version 3.0.0 (2013-04-03) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.8.8 loaded via a namespace (and not attached): [1] tools_3.0.0 Best, Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sun May 12 10:12:17 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 12 May 2013 10:12:17 +0200 Subject: [datatable-help] columns in .SD with grouping ad-hoc using "by" Message-ID: <89F5ECD9ADE0453F903B2D5589382B90@gmail.com> Hi, Suppose you've a data.table, say: require(data.table) DT <- data.table(x = 1:5, y = 6:10) Suppose you want to group by "x %/% 2" ( = 0, 1,1, 2,2) and then calculate the sum of each column for each group, then one would do: DT[, grp := x %/% 2] DT[, list(x.sum=sum(x), y.sum=sum(y)), by = grp] # avoid .SD in case of few columns Now, assume that you've many many columns which would make the use of `.SD` sensible. DT[, lapply(.SD, sum), by = grp] grp x y 1: 0 1 6 2: 1 5 15 3: 2 9 19 The issue is that if you create the grouping column ad-hoc, then the column from which the ad-hoc grouping column is derived is not available to .SD. Let me illustrate this: DT <- data.table(x = 1:5, y = 6:10) DT[, lapply(.SD, sum), by = (grp=x %/% 2)] # ad-hoc creation of grouping column grp y 1: 0 6 2: 1 15 3: 2 19 I think it'd be nice to have the column available to `.SD` so that we can save creating a temporary column, grouping and then deleting it, as "technically" it *is* a new column (meaning, "x" must still be available). Any take on this? Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sun May 12 12:29:35 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sun, 12 May 2013 11:29:35 +0100 Subject: [datatable-help] fread: skip In-Reply-To: References: Message-ID: <21e4d54932de6cb6dbefa4f5b97d410e@imap.plus.net> On 12.05.2013 00:47, Gabor Grothendieck wrote: > Not with the csv I tried. The header is messed up (most of the > header > fields are missing) and it misconstrues it as data. That was fixed a while ago in v1.8.9, from NEWS : " [fread] If some column names are blank they are now given default names rather than causing the header row to be read as a data row " > The automation is great but some way to force its behavior when you > know what it should do seems essential since heuristics can't be > expected to work in all cases. I suspect the heuristics in v1.8.9 work on all your examples so far, but ok point taken. fread allows control of 'autostart' already. This is a line number (default 30) within the regular data block used to detect the separator and search upwards from to find the first data row and/or column names. Will add 'skip' then. It'll be like setting autostart=skip+1 but turning off the search upwards part. Line skip+1 will be used to detect the separator when sep="auto" and used as column names according to header="auto"|TRUE|FALSE as usual. It'll be an error to specify both autostart and skip in the same call. If that sounds ok? Matthew > > On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle > wrote: >> >> Hi, >> >> Does the auto skip feature of fread cover both of those? From >> ?fread : >> >> " Once the separator is found on line autostart, the number of >> columns is >> determined. Then the file is searched backwards from autostart until >> a row >> is found that doesn't have that number of columns, or the start of >> file is >> reached. Thus, the first data row is found and any human readable >> banners >> are automatically skipped. This feature can be particularly useful >> for >> loading a set of files which may not all have consistently sized >> banners. " >> >> There were also some issue with header=FALSE in the first release >> (1.8.8) >> which have since been fixed in 1.8.9. >> >> Matthew >> >> >> >> On 11.05.2013 23:16, Gabor Grothendieck wrote: >>> >>> I would find it useful if fread had a skip= argument as in >>> read.table >>> since I have files from time to time that have garbage at the top. >>> Another situation I find from time to time is that the header is >>> messed up but one can still read the file if one can skip over the >>> header and specify header = FALSE. >>> >>> An extra feature that would be nice but less important would be if >>> one >>> could specify skip = "string" and have it skip all lines until it >>> found one with "string": in it and then start reading from the >>> matched >>> row onward. Normally the string would be chosen to be a string >>> found >>> in the header and not likely found prior to the header. read.xls in >>> gdata has a similar feature and I find it quite handy at times. >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From mdowle at mdowle.plus.com Sun May 12 12:44:49 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sun, 12 May 2013 11:44:49 +0100 Subject: [datatable-help] =?utf-8?b?Ijo9IiB3aXRoICJieSIgcmVhc3NpZ25tZW50?= =?utf-8?q?/updation_+_adding_new_column_leads_to_crash?= In-Reply-To: <081EFC10403447CCBA8CAE3CD6761DBA@gmail.com> References: <081EFC10403447CCBA8CAE3CD6761DBA@gmail.com> Message-ID: <0b1179be0c07f95992fa421dca7613f1@imap.plus.net> Hi, Yes I get that in latest dev too. Thanks for the nice example, please file. Matthew On 12.05.2013 08:53, Arunkumar Srinivasan wrote: > Hi, > I just discovered some weird R-session crash in data.table. Here's an example to reproduce the crash. I did not find any bug filed regarding this issue. Maybe others can verify this? Then I'll file it as a bug. > The issue is this. Suppose you've a data.table with two columns x and y as follows: > require(data.table) > DT <- data.table(x = rep(1:2, c(3,2)), y = 6:10) > > x y > 1: 1 6 > 2: 1 7 > 3: 1 8 > 4: 2 9 > 5: 2 10 > Now you want to add a new column "z" by reference grouped by "x". So, you'd do: > > DT[, `:=`(z = .GRP), by = x] > > x y z > 1: 1 6 1 > 2: 1 7 1 > 3: 1 8 1 > 4: 2 9 2 > 5: 2 10 2 > Now, for the sake of producing this error, assume that you assigned "z" the wrong value and that you want to change it. But, also you realised that you want to add another column "w" as well. So, you go ahead and do (remember to do the previous step and then this one): > DT[, `:=`(z = .N, w = 2), by = x] # R session crashes > Here, both R and Rstudio session crashes with the traceback message: > > *** caught segfault *** > address 0x0, cause 'memory not mapped' > Traceback: > 1: `[.data.table`(DT, , `:=`(z = .GRP, w = 2), by = x) > 2: DT[, `:=`(z = .GRP, w = 2), by = x] > This on the other hand works as expected if you assign both columns the first time. > > require(data.table) > DT <- data.table(x = rep(1:2, c(3,2)), y = 6:10) > DT[, `:=`(z = .GRP, w = 2), by = x] # works fine > That is, if you assign by reference (:=) with "by" and re-assign a variable while also creating another variable, there seems to be a segfault. This error may not be limited to this case, but that I've just tested this. > Here's my sessionInfo() from before the crash: > > R version 3.0.0 (2013-04-03) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] data.table_1.8.8 > loaded via a namespace (and not attached): > [1] tools_3.0.0 > Best, > Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sun May 12 12:58:08 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sun, 12 May 2013 11:58:08 +0100 Subject: [datatable-help] columns in .SD with grouping ad-hoc using "by" In-Reply-To: <89F5ECD9ADE0453F903B2D5589382B90@gmail.com> References: <89F5ECD9ADE0453F903B2D5589382B90@gmail.com> Message-ID: <783cf12994786d84f97a99237c5d70ee@imap.plus.net> On 12.05.2013 09:12, Arunkumar Srinivasan wrote: > Hi, > Suppose you've a data.table, say: > require(data.table) > DT <- data.table(x = 1:5, y = 6:10) > > Suppose you want to group by "x %/% 2" ( = 0, 1,1, 2,2) and then calculate the sum of each column for each group, then one would do: > DT[, grp := x %/% 2] > DT[, list(x.sum=sum(x), y.sum=sum(y)), by = grp] # avoid .SD in case of few columns I know this isn't the main point (keep scrolling down) but just as an aside : DT[, lapply(.SD, sum), by = grp, .SDcols=c("x","y")] # intended way to avoid .SD in case of a few columns > Now, assume that you've many many columns which would make the use of `.SD` sensible. > DT[, lapply(.SD, sum), by = grp] > > grp x y > 1: 0 1 6 > 2: 1 5 15 > 3: 2 9 19 > The issue is that if you create the grouping column ad-hoc, then the column from which the ad-hoc grouping column is derived is not available to .SD. Let me illustrate this: > > DT <- data.table(x = 1:5, y = 6:10) > DT[, lapply(.SD, sum), by = (grp=x %/% 2)] # ad-hoc creation of grouping column > > grp y > 1: 0 6 > 2: 1 15 > 3: 2 19 > I think it'd be nice to have the column available to `.SD` so that we can save creating a temporary column, grouping and then deleting it, as "technically" it *is* a new column (meaning, "x" must still be available). Any take on this? .BY is available to j already for that reason, does that work? .BY isn't a column of .SD because i) it's the same value for every row of .SD i.e. .BY[[1]] is length 1 and contains this particular group (replicating the same value would be wasteful) but more significantly ii) it is often a character group name where running an aggregation function like sum() would trip up on it. Arun & > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sun May 12 13:54:31 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 12 May 2013 13:54:31 +0200 Subject: [datatable-help] columns in .SD with grouping ad-hoc using "by" In-Reply-To: <783cf12994786d84f97a99237c5d70ee@imap.plus.net> References: <89F5ECD9ADE0453F903B2D5589382B90@gmail.com> <783cf12994786d84f97a99237c5d70ee@imap.plus.net> Message-ID: <1F61CE3680FB4CE6B6BBAE2696CE594D@gmail.com> I just realised that I sent it only to MatthewDowle. So, sending it again. Sorry @Matthew for the double email. Matthew, >> .BY is available to j already for that reason, does that work? .BY isn't a column of .SD because i) it's the same value for every row of .SD i.e. .BY[[1]] is length 1 and contains this particular group (replicating the same value would be wasteful) DT[, print(.BY), by = list(grp = x %/% 2)] $grp [1] 0 $grp [1] 1 $grp [1] 2 DT[, print(.SD), by = list(grp = x %/% 2)] # no column "x" y 1: 6 y 1: 7 2: 8 y 1: 9 2: 10 My question is not as to why the BY column is not available in .SD. Rather, since .BY does not have column "x" in it (rather the result of x%/% 2), why does .SD not have "x"? It's as if grp = x%/%2 is a "new column". So, "x" should be available to .SD is my point. >> but more significantly ii) it is often a character group name where running an aggregation function like sum() would trip up on it. Again, I don't think so because, I am not asking for .BY columns to be in .SD. DT[, grp := x%/% = 2] DT[, lapply(.SD, sum), by=grp] must be equal to: DT[, lapply(.SD, sum), by = list(grp = x%/%2)] # here, "x" should be available to .SD as it's not the grouping column Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Sun May 12 14:26:35 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sun, 12 May 2013 08:26:35 -0400 Subject: [datatable-help] fread: skip In-Reply-To: <21e4d54932de6cb6dbefa4f5b97d410e@imap.plus.net> References: <21e4d54932de6cb6dbefa4f5b97d410e@imap.plus.net> Message-ID: 1.8.8 is the most recent version on CRAN so I have now installed 1.8.9 from R-Forge now and the sample csv I was using does indeed work attempting to do the best it can with the mucked up header. Maybe this is sufficient and a skip is not needed but the fact is that there is no facility to skip over the bad header had I wanted to. On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle wrote: > On 12.05.2013 00:47, Gabor Grothendieck wrote: >> >> Not with the csv I tried. The header is messed up (most of the header >> fields are missing) and it misconstrues it as data. > > > That was fixed a while ago in v1.8.9, from NEWS : > > " [fread] If some column names are blank they are now given default names > rather than causing the header row to be read as a data row " > > >> The automation is great but some way to force its behavior when you >> know what it should do seems essential since heuristics can't be >> expected to work in all cases. > > > I suspect the heuristics in v1.8.9 work on all your examples so far, but ok > point taken. > > fread allows control of 'autostart' already. This is a line number (default > 30) within the regular data block used to detect the separator and search > upwards from to find the first data row and/or column names. > > Will add 'skip' then. It'll be like setting autostart=skip+1 but turning off > the search upwards part. Line skip+1 will be used to detect the separator > when sep="auto" and used as column names according to > header="auto"|TRUE|FALSE as usual. It'll be an error to specify both > autostart and skip in the same call. If that sounds ok? > > Matthew > > > >> >> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle >> wrote: >>> >>> >>> Hi, >>> >>> Does the auto skip feature of fread cover both of those? From ?fread : >>> >>> " Once the separator is found on line autostart, the number of columns >>> is >>> determined. Then the file is searched backwards from autostart until a >>> row >>> is found that doesn't have that number of columns, or the start of file >>> is >>> reached. Thus, the first data row is found and any human readable banners >>> are automatically skipped. This feature can be particularly useful for >>> loading a set of files which may not all have consistently sized banners. >>> " >>> >>> There were also some issue with header=FALSE in the first release (1.8.8) >>> which have since been fixed in 1.8.9. >>> >>> Matthew >>> >>> >>> >>> On 11.05.2013 23:16, Gabor Grothendieck wrote: >>>> >>>> >>>> I would find it useful if fread had a skip= argument as in read.table >>>> since I have files from time to time that have garbage at the top. >>>> Another situation I find from time to time is that the header is >>>> messed up but one can still read the file if one can skip over the >>>> header and specify header = FALSE. >>>> >>>> An extra feature that would be nice but less important would be if one >>>> could specify skip = "string" and have it skip all lines until it >>>> found one with "string": in it and then start reading from the matched >>>> row onward. Normally the string would be chosen to be a string found >>>> in the header and not likely found prior to the header. read.xls in >>>> gdata has a similar feature and I find it quite handy at times. >>>> >>>> -- >>>> Statistics & Software Consulting >>>> GKX Group, GKX Associates Inc. >>>> tel: 1-877-GKX-GROUP >>>> email: ggrothendieck at gmail.com >>>> _______________________________________________ >>>> datatable-help mailing list >>>> datatable-help at lists.r-forge.r-project.org >>>> >>>> >>>> >>>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From mdowle at mdowle.plus.com Sun May 12 14:45:03 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sun, 12 May 2013 13:45:03 +0100 Subject: [datatable-help] columns in .SD with grouping ad-hoc using "by" In-Reply-To: <1F61CE3680FB4CE6B6BBAE2696CE594D@gmail.com> References: <89F5ECD9ADE0453F903B2D5589382B90@gmail.com> <783cf12994786d84f97a99237c5d70ee@imap.plus.net> <1F61CE3680FB4CE6B6BBAE2696CE594D@gmail.com> Message-ID: <932fea75ece9b61bde12aa8d54c21a6f@imap.plus.net> On 12.05.2013 12:54, Arunkumar Srinivasan wrote: > I just realised that I sent it only to MatthewDowle. So, sending it again. Sorry @Matthew for the double email. > > Matthew, >>> .BY is available to j already for that reason, does that work? .BY isn't a column of .SD because i) it's the same value for every row of .SD i.e. .BY[[1]] is length 1 and contains this particular group (replicating the same value would be wasteful) > DT[, print(.BY), by = list(grp = x %/% 2)] > > $grp > [1] 0 > $grp > [1] 1 > $grp > [1] 2 > > DT[, print(.SD), by = list(grp = x %/% 2)] # no column "x" > > y > 1: 6 > y > 1: 7 > 2: 8 > y > 1: 9 > 2: 10 > My question is not as to why the BY column is not available in .SD. Rather, since .BY does not have column "x" in it (rather the result of x%/% 2), why does .SD not have "x"? It's as if grp = x%/%2 is a "new column". So, "x" should be available to .SD is my point. Oh I see now. Yes data.table inspects the expressions used in 'by' and considers any columns used there as grouping columns and excludes those from .SD. An example is a date column containing daily observations. DT[, lapply(.SD,sum), by=month(date)] would not wish to sum() the "date" column. In ?data.table I've just changed : .SD is a data.table containing the Subset of x's Data for each group, excluding the group column(s). to .SD is a data.table containing the Subset of x's Data for each group, excluding any columns used in 'by' (or 'keyby'). Further answer below ... >>> but more significantly ii) it is often a character group name where running an aggregation function like sum() would trip up on it. > Again, I don't think so because, I am not asking for .BY columns to be in .SD. > DT[, grp := x%/% = 2] > DT[, lapply(.SD, sum), by=grp] > must be equal to: > DT[, lapply(.SD, sum), by = list(grp = x%/%2)] # here, "x" should be available to .SD as it's not the grouping column This makes sense in this case because x can be sum()-ed, but isn't true in general like the month(date) case above. In these cases you can use .SDcols to include all columns, even the ones used by by : > DT[, lapply(.SD, sum), by=list(grp=x%/%2)] grp y 1: 0 6 2: 1 15 3: 2 19 > DT[, lapply(.SD, sum), by=list(grp=x%/%2), .SDcols=names(DT)] grp x y 1: 0 1 6 2: 1 5 15 3: 2 9 19 > DT[, print(.SD), by = list(grp = x %/% 2), .SDcols=names(DT)] x y 1: 1 6 x y 1: 2 7 2: 3 8 x y 1: 4 9 2: 5 10 Arun > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sun May 12 15:01:49 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sun, 12 May 2013 14:01:49 +0100 Subject: [datatable-help] fread: skip In-Reply-To: References: <21e4d54932de6cb6dbefa4f5b97d410e@imap.plus.net> Message-ID: <41657922b05299edb07739e0c59add64@imap.plus.net> Hi, I suspect you may not have scrolled further down in my reply where I wrote more? Matthew On 12.05.2013 13:26, Gabor Grothendieck wrote: > 1.8.8 is the most recent version on CRAN so I have now installed > 1.8.9 > from R-Forge now and the sample csv I was using does indeed work > attempting to do the best it can with the mucked up header. Maybe > this is sufficient and a skip is not needed but the fact is that > there > is no facility to skip over the bad header had I wanted to. > > On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle > wrote: >> On 12.05.2013 00:47, Gabor Grothendieck wrote: >>> >>> Not with the csv I tried. The header is messed up (most of the >>> header >>> fields are missing) and it misconstrues it as data. >> >> >> That was fixed a while ago in v1.8.9, from NEWS : >> >> " [fread] If some column names are blank they are now given default >> names >> rather than causing the header row to be read as a data row " >> >> >>> The automation is great but some way to force its behavior when you >>> know what it should do seems essential since heuristics can't be >>> expected to work in all cases. >> >> >> I suspect the heuristics in v1.8.9 work on all your examples so far, >> but ok >> point taken. >> >> fread allows control of 'autostart' already. This is a line number >> (default >> 30) within the regular data block used to detect the separator and >> search >> upwards from to find the first data row and/or column names. >> >> Will add 'skip' then. It'll be like setting autostart=skip+1 but >> turning off >> the search upwards part. Line skip+1 will be used to detect the >> separator >> when sep="auto" and used as column names according to >> header="auto"|TRUE|FALSE as usual. It'll be an error to specify >> both >> autostart and skip in the same call. If that sounds ok? >> >> Matthew >> >> >> >>> >>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle >>> wrote: >>>> >>>> >>>> Hi, >>>> >>>> Does the auto skip feature of fread cover both of those? From >>>> ?fread : >>>> >>>> " Once the separator is found on line autostart, the number of >>>> columns >>>> is >>>> determined. Then the file is searched backwards from autostart >>>> until a >>>> row >>>> is found that doesn't have that number of columns, or the start of >>>> file >>>> is >>>> reached. Thus, the first data row is found and any human readable >>>> banners >>>> are automatically skipped. This feature can be particularly useful >>>> for >>>> loading a set of files which may not all have consistently sized >>>> banners. >>>> " >>>> >>>> There were also some issue with header=FALSE in the first release >>>> (1.8.8) >>>> which have since been fixed in 1.8.9. >>>> >>>> Matthew >>>> >>>> >>>> >>>> On 11.05.2013 23:16, Gabor Grothendieck wrote: >>>>> >>>>> >>>>> I would find it useful if fread had a skip= argument as in >>>>> read.table >>>>> since I have files from time to time that have garbage at the >>>>> top. >>>>> Another situation I find from time to time is that the header is >>>>> messed up but one can still read the file if one can skip over >>>>> the >>>>> header and specify header = FALSE. >>>>> >>>>> An extra feature that would be nice but less important would be >>>>> if one >>>>> could specify skip = "string" and have it skip all lines until it >>>>> found one with "string": in it and then start reading from the >>>>> matched >>>>> row onward. Normally the string would be chosen to be a string >>>>> found >>>>> in the header and not likely found prior to the header. read.xls >>>>> in >>>>> gdata has a similar feature and I find it quite handy at times. >>>>> >>>>> -- >>>>> Statistics & Software Consulting >>>>> GKX Group, GKX Associates Inc. >>>>> tel: 1-877-GKX-GROUP >>>>> email: ggrothendieck at gmail.com >>>>> _______________________________________________ >>>>> datatable-help mailing list >>>>> datatable-help at lists.r-forge.r-project.org >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com From aragorn168b at gmail.com Sun May 12 15:14:43 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 12 May 2013 15:14:43 +0200 Subject: [datatable-help] columns in .SD with grouping ad-hoc using "by" In-Reply-To: <932fea75ece9b61bde12aa8d54c21a6f@imap.plus.net> References: <89F5ECD9ADE0453F903B2D5589382B90@gmail.com> <783cf12994786d84f97a99237c5d70ee@imap.plus.net> <1F61CE3680FB4CE6B6BBAE2696CE594D@gmail.com> <932fea75ece9b61bde12aa8d54c21a6f@imap.plus.net> Message-ID: <6235A72AA2834895A051820E0924AAE1@gmail.com> Matthew, Yes, that clarifies things. It makes more sense, especially with the option of being able to use `.SDcols` to include it. And thanks for the change in documentation as well; adds more clarity. Best, Arun On Sunday, May 12, 2013 at 2:45 PM, Matthew Dowle wrote: > On 12.05.2013 12:54, Arunkumar Srinivasan wrote: > > I just realised that I sent it only to MatthewDowle. So, sending it again. Sorry @Matthew for the double email. > > Matthew, > > >> .BY is available to j already for that reason, does that work? .BY isn't a column of .SD because i) it's the same value for every row of .SD i.e. .BY[[1]] is length 1 and contains this particular group (replicating the same value would be wasteful) > > DT[, print(.BY), by = list(grp = x %/% 2)] > > $grp > > [1] 0 > > $grp > > [1] 1 > > $grp > > [1] 2 > > > > DT[, print(.SD), by = list(grp = x %/% 2)] # no column "x" > > y > > 1: 6 > > y > > 1: 7 > > 2: 8 > > y > > 1: 9 > > 2: 10 > > > > My question is not as to why the BY column is not available in .SD. Rather, since .BY does not have column "x" in it (rather the result of x%/% 2), why does .SD not have "x"? It's as if grp = x%/%2 is a "new column". So, "x" should be available to .SD is my point. > > > > > > > > Oh I see now. Yes data.table inspects the expressions used in 'by' and considers any columns used there as grouping columns and excludes those from .SD. An example is a date column containing daily observations. DT[, lapply(.SD,sum), by=month(date)] would not wish to sum() the "date" column. > In ?data.table I've just changed : > .SD is a data.table containing the Subset of x's Data for each group, excluding the group column(s). > to > .SD is a data.table containing the Subset of x's Data for each group, excluding any columns used in 'by' (or 'keyby'). > Further answer below ... > > >> but more significantly ii) it is often a character group name where running an aggregation function like sum() would trip up on it. > > Again, I don't think so because, I am not asking for .BY columns to be in .SD. > > DT[, grp := x%/% = 2] > > DT[, lapply(.SD, sum), by=grp] > > must be equal to: > > DT[, lapply(.SD, sum), by = list(grp = x%/%2)] # here, "x" should be available to .SD as it's not the grouping column > > > > > > > > This makes sense in this case because x can be sum()-ed, but isn't true in general like the month(date) case above. > In these cases you can use .SDcols to include all columns, even the ones used by by : > > DT[, lapply(.SD, sum), by=list(grp=x%/%2)] > grp y > 1: 0 6 > 2: 1 15 > 3: 2 19 > > DT[, lapply(.SD, sum), by=list(grp=x%/%2), .SDcols=names(DT)] > grp x y > 1: 0 1 6 > 2: 1 5 15 > 3: 2 9 19 > > DT[, print(.SD), by = list(grp = x %/% 2), .SDcols=names(DT)] > x y > 1: 1 6 > x y > 1: 2 7 > 2: 3 8 > x y > 1: 4 9 > 2: 5 10 > > Arun > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Sun May 12 15:24:47 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sun, 12 May 2013 09:24:47 -0400 Subject: [datatable-help] fread: skip In-Reply-To: <41657922b05299edb07739e0c59add64@imap.plus.net> References: <21e4d54932de6cb6dbefa4f5b97d410e@imap.plus.net> <41657922b05299edb07739e0c59add64@imap.plus.net> Message-ID: Sorry, I did indeed miss the portion of the reply at the very bottom. Yes, that seems good. What about colClasses too? I would think that there would be cases where an automatic approach might not give the result wanted. For example, order numbers might all be numeric but you would want to store them as character in case there are leading zeros. In other cases similar fields might validly have leading zeros but you would want them regarded as numeric so there is no way to distinguish the two cases except by having the user indicate their intention. Also, there exist cases where - fields are unquoted, - fields are quoted and doubling the quotes are used to indicate an actual quote and - where fields are quoted but a backslash quote it used to denote an actual quote. Ideally all these situations could be handled through some combination of automatic and specified arguments. In the case of R's read.table it cannot handle the back slashed quote case but handles the others mentioned. On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle wrote: > > Hi, > > I suspect you may not have scrolled further down in my reply where I wrote > more? > > Matthew > > > > On 12.05.2013 13:26, Gabor Grothendieck wrote: >> >> 1.8.8 is the most recent version on CRAN so I have now installed 1.8.9 >> from R-Forge now and the sample csv I was using does indeed work >> attempting to do the best it can with the mucked up header. Maybe >> this is sufficient and a skip is not needed but the fact is that there >> is no facility to skip over the bad header had I wanted to. >> >> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle >> wrote: >>> >>> On 12.05.2013 00:47, Gabor Grothendieck wrote: >>>> >>>> >>>> Not with the csv I tried. The header is messed up (most of the header >>>> fields are missing) and it misconstrues it as data. >>> >>> >>> >>> That was fixed a while ago in v1.8.9, from NEWS : >>> >>> " [fread] If some column names are blank they are now given default >>> names >>> rather than causing the header row to be read as a data row " >>> >>> >>>> The automation is great but some way to force its behavior when you >>>> know what it should do seems essential since heuristics can't be >>>> expected to work in all cases. >>> >>> >>> >>> I suspect the heuristics in v1.8.9 work on all your examples so far, but >>> ok >>> point taken. >>> >>> fread allows control of 'autostart' already. This is a line number >>> (default >>> 30) within the regular data block used to detect the separator and search >>> upwards from to find the first data row and/or column names. >>> >>> Will add 'skip' then. It'll be like setting autostart=skip+1 but turning >>> off >>> the search upwards part. Line skip+1 will be used to detect the separator >>> when sep="auto" and used as column names according to >>> header="auto"|TRUE|FALSE as usual. It'll be an error to specify both >>> autostart and skip in the same call. If that sounds ok? >>> >>> Matthew >>> >>> >>> >>>> >>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle >>>> wrote: >>>>> >>>>> >>>>> >>>>> Hi, >>>>> >>>>> Does the auto skip feature of fread cover both of those? From ?fread : >>>>> >>>>> " Once the separator is found on line autostart, the number of >>>>> columns >>>>> is >>>>> determined. Then the file is searched backwards from autostart until a >>>>> row >>>>> is found that doesn't have that number of columns, or the start of file >>>>> is >>>>> reached. Thus, the first data row is found and any human readable >>>>> banners >>>>> are automatically skipped. This feature can be particularly useful for >>>>> loading a set of files which may not all have consistently sized >>>>> banners. >>>>> " >>>>> >>>>> There were also some issue with header=FALSE in the first release >>>>> (1.8.8) >>>>> which have since been fixed in 1.8.9. >>>>> >>>>> Matthew >>>>> >>>>> >>>>> >>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote: >>>>>> >>>>>> >>>>>> >>>>>> I would find it useful if fread had a skip= argument as in read.table >>>>>> since I have files from time to time that have garbage at the top. >>>>>> Another situation I find from time to time is that the header is >>>>>> messed up but one can still read the file if one can skip over the >>>>>> header and specify header = FALSE. >>>>>> >>>>>> An extra feature that would be nice but less important would be if one >>>>>> could specify skip = "string" and have it skip all lines until it >>>>>> found one with "string": in it and then start reading from the matched >>>>>> row onward. Normally the string would be chosen to be a string found >>>>>> in the header and not likely found prior to the header. read.xls in >>>>>> gdata has a similar feature and I find it quite handy at times. >>>>>> >>>>>> -- >>>>>> Statistics & Software Consulting >>>>>> GKX Group, GKX Associates Inc. >>>>>> tel: 1-877-GKX-GROUP >>>>>> email: ggrothendieck at gmail.com >>>>>> _______________________________________________ >>>>>> datatable-help mailing list >>>>>> datatable-help at lists.r-forge.r-project.org >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com > > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From mdowle at mdowle.plus.com Sun May 12 16:14:00 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sun, 12 May 2013 15:14:00 +0100 Subject: [datatable-help] fread: skip In-Reply-To: References: <21e4d54932de6cb6dbefa4f5b97d410e@imap.plus.net> <41657922b05299edb07739e0c59add64@imap.plus.net> Message-ID: Agreed too. colClasses was committed yesterday as luck would have it. ?fread now has : colClasses : A character vector of classes (named or unnamed), as read.csv. Or, type list enables setting ranges of columns by numeric position. colClasses in fread is intended for rare overrides, not for routine use. fread will only promote a column to a higher type if colClasses requests it. It won't downgrade a column to a lower type since NAs would result. You have to coerce such columns afterwards yourself, if you really require data loss. The tests so far are as follows : input = 'A,B,C\n01,foo,3.140\n002,bar,6.28000\n' test(952, fread(input, colClasses=c(C="character")), data.table(A=1:2,B=c("foo","bar"),C=c("3.140","6.28000"))) test(953, fread(input, colClasses=c(C="character",A="numeric")), data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000"))) test(954, fread(input, colClasses=c(C="character",A="double")), data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000"))) test(955, fread(input, colClasses=list(character="C",double="A")), data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000"))) test(956, fread(input, colClasses=list(character=2:3,double="A")), data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000"))) test(957, fread(input, colClasses=list(character=1:3)), data.table(A=c("01","002"),B=c("foo","bar"),C=c("3.140","6.28000"))) test(958, fread(input, colClasses="character"), data.table(A=c("01","002"),B=c("foo","bar"),C=c("3.140","6.28000"))) test(959, fread(input, colClasses=c("character","double","numeric")), data.table(A=c("01","002"),B=c("foo","bar"),C=c(3.14,6.28))) test(960, fread(input, colClasses=c("character","double")), error="colClasses is unnamed and length 2 but there are 3 columns. See") test(961, fread(input, colClasses=1:3), error="colClasses is not type list or character vector") test(962, fread(input, colClasses=list(1:3)), error="colClasses is type list but has no names") test(963, fread(input, colClasses=list(character="D")), error="Column name 'D' in colClasses not found in data") test(964, fread(input, colClasses=c(D="character")), error="Column name 'D' in colClasses not found in data") test(965, fread(input, colClasses=list(character=0)), error="Column number 0 (colClasses..1...1.) is out of range .1,ncol=3.") test(966, fread(input, colClasses=list(character=2:4)), error="Column number 4 (colClasses..1...3.) is out of range .1,ncol=3.") More detailed/trace info is provided when verbose=TRUE. On embedded quotes there are known and documented problems still to resolve. The issue there is subtle: when reading character columns, part of fread's speed comes from pointing mkCharLen() directly to the field in memory mapped region of RAM i.e. the field isn't copied into any intermediate buffer at all. But for embedded quotes (either doubled or escaped) we do need to copy to a buffer so we can remove the doubled quote, or escape character (i.e. change the field) before calling mkCharLen(). That's not a problem per se, but just a new twist to the C code to implement. In order to not slow down, it need only copy that field to a buffer if a doubled or escaped quote was actually present in that particular field. Matthew On 12.05.2013 14:24, Gabor Grothendieck wrote: > Sorry, I did indeed miss the portion of the reply at the very bottom. > Yes, that seems good. > > What about colClasses too? I would think that there would be cases > where an automatic approach might not give the result wanted. For > example, order numbers might all be numeric but you would want to > store them as character in case there are leading zeros. In other > cases similar fields might validly have leading zeros but you would > want them regarded as numeric so there is no way to distinguish the > two cases except by having the user indicate their intention. > > Also, there exist cases where > - fields are unquoted, > - fields are quoted and doubling the quotes are used to indicate an > actual quote and > - where fields are quoted but a backslash quote it used to denote an > actual quote. > Ideally all these situations could be handled through some > combination > of automatic and specified arguments. In the case of R's read.table > it cannot handle the back slashed quote case but handles the others > mentioned. > > > On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle > wrote: >> >> Hi, >> >> I suspect you may not have scrolled further down in my reply where I >> wrote >> more? >> >> Matthew >> >> >> >> On 12.05.2013 13:26, Gabor Grothendieck wrote: >>> >>> 1.8.8 is the most recent version on CRAN so I have now installed >>> 1.8.9 >>> from R-Forge now and the sample csv I was using does indeed work >>> attempting to do the best it can with the mucked up header. Maybe >>> this is sufficient and a skip is not needed but the fact is that >>> there >>> is no facility to skip over the bad header had I wanted to. >>> >>> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle >>> wrote: >>>> >>>> On 12.05.2013 00:47, Gabor Grothendieck wrote: >>>>> >>>>> >>>>> Not with the csv I tried. The header is messed up (most of the >>>>> header >>>>> fields are missing) and it misconstrues it as data. >>>> >>>> >>>> >>>> That was fixed a while ago in v1.8.9, from NEWS : >>>> >>>> " [fread] If some column names are blank they are now given >>>> default >>>> names >>>> rather than causing the header row to be read as a data row " >>>> >>>> >>>>> The automation is great but some way to force its behavior when >>>>> you >>>>> know what it should do seems essential since heuristics can't be >>>>> expected to work in all cases. >>>> >>>> >>>> >>>> I suspect the heuristics in v1.8.9 work on all your examples so >>>> far, but >>>> ok >>>> point taken. >>>> >>>> fread allows control of 'autostart' already. This is a line number >>>> (default >>>> 30) within the regular data block used to detect the separator and >>>> search >>>> upwards from to find the first data row and/or column names. >>>> >>>> Will add 'skip' then. It'll be like setting autostart=skip+1 but >>>> turning >>>> off >>>> the search upwards part. Line skip+1 will be used to detect the >>>> separator >>>> when sep="auto" and used as column names according to >>>> header="auto"|TRUE|FALSE as usual. It'll be an error to specify >>>> both >>>> autostart and skip in the same call. If that sounds ok? >>>> >>>> Matthew >>>> >>>> >>>> >>>>> >>>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle >>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> Does the auto skip feature of fread cover both of those? From >>>>>> ?fread : >>>>>> >>>>>> " Once the separator is found on line autostart, the number of >>>>>> columns >>>>>> is >>>>>> determined. Then the file is searched backwards from autostart >>>>>> until a >>>>>> row >>>>>> is found that doesn't have that number of columns, or the start >>>>>> of file >>>>>> is >>>>>> reached. Thus, the first data row is found and any human >>>>>> readable >>>>>> banners >>>>>> are automatically skipped. This feature can be particularly >>>>>> useful for >>>>>> loading a set of files which may not all have consistently sized >>>>>> banners. >>>>>> " >>>>>> >>>>>> There were also some issue with header=FALSE in the first >>>>>> release >>>>>> (1.8.8) >>>>>> which have since been fixed in 1.8.9. >>>>>> >>>>>> Matthew >>>>>> >>>>>> >>>>>> >>>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> I would find it useful if fread had a skip= argument as in >>>>>>> read.table >>>>>>> since I have files from time to time that have garbage at the >>>>>>> top. >>>>>>> Another situation I find from time to time is that the header >>>>>>> is >>>>>>> messed up but one can still read the file if one can skip over >>>>>>> the >>>>>>> header and specify header = FALSE. >>>>>>> >>>>>>> An extra feature that would be nice but less important would be >>>>>>> if one >>>>>>> could specify skip = "string" and have it skip all lines until >>>>>>> it >>>>>>> found one with "string": in it and then start reading from the >>>>>>> matched >>>>>>> row onward. Normally the string would be chosen to be a >>>>>>> string found >>>>>>> in the header and not likely found prior to the header. >>>>>>> read.xls in >>>>>>> gdata has a similar feature and I find it quite handy at >>>>>>> times. >>>>>>> >>>>>>> -- >>>>>>> Statistics & Software Consulting >>>>>>> GKX Group, GKX Associates Inc. >>>>>>> tel: 1-877-GKX-GROUP >>>>>>> email: ggrothendieck at gmail.com >>>>>>> _______________________________________________ >>>>>>> datatable-help mailing list >>>>>>> datatable-help at lists.r-forge.r-project.org >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >>> >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >> >> From ggrothendieck at gmail.com Sun May 12 16:44:10 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sun, 12 May 2013 10:44:10 -0400 Subject: [datatable-help] fread: skip In-Reply-To: References: <21e4d54932de6cb6dbefa4f5b97d410e@imap.plus.net> <41657922b05299edb07739e0c59add64@imap.plus.net> Message-ID: That looks great. It occurred to me in looking at this that one thing that might be useful would be to provide some conversion routines that can be specified as classes in the colClass vector that will convert numbers from Excel representing Dates or date/times to Date and POSIXct class respectively. (The mapping is discussed in R News 4/1.) On Sun, May 12, 2013 at 10:14 AM, Matthew Dowle wrote: > > Agreed too. colClasses was committed yesterday as luck would have it. > > ?fread now has : > > colClasses : A character vector of classes (named or unnamed), as > read.csv. Or, type list enables setting ranges of columns by numeric > position. colClasses in fread is intended for rare overrides, not for > routine use. fread will only promote a column to a higher type if colClasses > requests it. It won't downgrade a column to a lower type since NAs would > result. You have to coerce such columns afterwards yourself, if you really > require data loss. > > The tests so far are as follows : > > input = 'A,B,C\n01,foo,3.140\n002,bar,6.28000\n' > > test(952, fread(input, colClasses=c(C="character")), > data.table(A=1:2,B=c("foo","bar"),C=c("3.140","6.28000"))) > test(953, fread(input, colClasses=c(C="character",A="numeric")), > data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000"))) > test(954, fread(input, colClasses=c(C="character",A="double")), > data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000"))) > test(955, fread(input, colClasses=list(character="C",double="A")), > data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000"))) > test(956, fread(input, colClasses=list(character=2:3,double="A")), > data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000"))) > test(957, fread(input, colClasses=list(character=1:3)), > data.table(A=c("01","002"),B=c("foo","bar"),C=c("3.140","6.28000"))) > test(958, fread(input, colClasses="character"), > data.table(A=c("01","002"),B=c("foo","bar"),C=c("3.140","6.28000"))) > test(959, fread(input, colClasses=c("character","double","numeric")), > data.table(A=c("01","002"),B=c("foo","bar"),C=c(3.14,6.28))) > > test(960, fread(input, colClasses=c("character","double")), > error="colClasses is unnamed and length 2 but there are 3 columns. See") > test(961, fread(input, colClasses=1:3), error="colClasses is not type list > or character vector") > test(962, fread(input, colClasses=list(1:3)), error="colClasses is type list > but has no names") > test(963, fread(input, colClasses=list(character="D")), error="Column name > 'D' in colClasses not found in data") > test(964, fread(input, colClasses=c(D="character")), error="Column name 'D' > in colClasses not found in data") > test(965, fread(input, colClasses=list(character=0)), error="Column number 0 > (colClasses..1...1.) is out of range .1,ncol=3.") > test(966, fread(input, colClasses=list(character=2:4)), error="Column number > 4 (colClasses..1...3.) is out of range .1,ncol=3.") > > More detailed/trace info is provided when verbose=TRUE. > > > On embedded quotes there are known and documented problems still to resolve. > The issue there is subtle: when reading character columns, part of fread's > speed comes from pointing mkCharLen() directly to the field in memory mapped > region of RAM i.e. the field isn't copied into any intermediate buffer at > all. But for embedded quotes (either doubled or escaped) we do need to copy > to a buffer so we can remove the doubled quote, or escape character (i.e. > change the field) before calling mkCharLen(). That's not a problem per se, > but just a new twist to the C code to implement. In order to not slow down, > it need only copy that field to a buffer if a doubled or escaped quote was > actually present in that particular field. > > Matthew > > > > On 12.05.2013 14:24, Gabor Grothendieck wrote: >> >> Sorry, I did indeed miss the portion of the reply at the very bottom. >> Yes, that seems good. >> >> What about colClasses too? I would think that there would be cases >> where an automatic approach might not give the result wanted. For >> example, order numbers might all be numeric but you would want to >> store them as character in case there are leading zeros. In other >> cases similar fields might validly have leading zeros but you would >> want them regarded as numeric so there is no way to distinguish the >> two cases except by having the user indicate their intention. >> >> Also, there exist cases where >> - fields are unquoted, >> - fields are quoted and doubling the quotes are used to indicate an >> actual quote and >> - where fields are quoted but a backslash quote it used to denote an >> actual quote. >> Ideally all these situations could be handled through some combination >> of automatic and specified arguments. In the case of R's read.table >> it cannot handle the back slashed quote case but handles the others >> mentioned. >> >> >> On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle >> wrote: >>> >>> >>> Hi, >>> >>> I suspect you may not have scrolled further down in my reply where I >>> wrote >>> more? >>> >>> Matthew >>> >>> >>> >>> On 12.05.2013 13:26, Gabor Grothendieck wrote: >>>> >>>> >>>> 1.8.8 is the most recent version on CRAN so I have now installed 1.8.9 >>>> from R-Forge now and the sample csv I was using does indeed work >>>> attempting to do the best it can with the mucked up header. Maybe >>>> this is sufficient and a skip is not needed but the fact is that there >>>> is no facility to skip over the bad header had I wanted to. >>>> >>>> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle >>>> wrote: >>>>> >>>>> >>>>> On 12.05.2013 00:47, Gabor Grothendieck wrote: >>>>>> >>>>>> >>>>>> >>>>>> Not with the csv I tried. The header is messed up (most of the header >>>>>> fields are missing) and it misconstrues it as data. >>>>> >>>>> >>>>> >>>>> >>>>> That was fixed a while ago in v1.8.9, from NEWS : >>>>> >>>>> " [fread] If some column names are blank they are now given default >>>>> names >>>>> rather than causing the header row to be read as a data row " >>>>> >>>>> >>>>>> The automation is great but some way to force its behavior when you >>>>>> know what it should do seems essential since heuristics can't be >>>>>> expected to work in all cases. >>>>> >>>>> >>>>> >>>>> >>>>> I suspect the heuristics in v1.8.9 work on all your examples so far, >>>>> but >>>>> ok >>>>> point taken. >>>>> >>>>> fread allows control of 'autostart' already. This is a line number >>>>> (default >>>>> 30) within the regular data block used to detect the separator and >>>>> search >>>>> upwards from to find the first data row and/or column names. >>>>> >>>>> Will add 'skip' then. It'll be like setting autostart=skip+1 but >>>>> turning >>>>> off >>>>> the search upwards part. Line skip+1 will be used to detect the >>>>> separator >>>>> when sep="auto" and used as column names according to >>>>> header="auto"|TRUE|FALSE as usual. It'll be an error to specify both >>>>> autostart and skip in the same call. If that sounds ok? >>>>> >>>>> Matthew >>>>> >>>>> >>>>> >>>>>> >>>>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Does the auto skip feature of fread cover both of those? From ?fread >>>>>>> : >>>>>>> >>>>>>> " Once the separator is found on line autostart, the number of >>>>>>> columns >>>>>>> is >>>>>>> determined. Then the file is searched backwards from autostart until >>>>>>> a >>>>>>> row >>>>>>> is found that doesn't have that number of columns, or the start of >>>>>>> file >>>>>>> is >>>>>>> reached. Thus, the first data row is found and any human readable >>>>>>> banners >>>>>>> are automatically skipped. This feature can be particularly useful >>>>>>> for >>>>>>> loading a set of files which may not all have consistently sized >>>>>>> banners. >>>>>>> " >>>>>>> >>>>>>> There were also some issue with header=FALSE in the first release >>>>>>> (1.8.8) >>>>>>> which have since been fixed in 1.8.9. >>>>>>> >>>>>>> Matthew >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I would find it useful if fread had a skip= argument as in >>>>>>>> read.table >>>>>>>> since I have files from time to time that have garbage at the top. >>>>>>>> Another situation I find from time to time is that the header is >>>>>>>> messed up but one can still read the file if one can skip over the >>>>>>>> header and specify header = FALSE. >>>>>>>> >>>>>>>> An extra feature that would be nice but less important would be if >>>>>>>> one >>>>>>>> could specify skip = "string" and have it skip all lines until it >>>>>>>> found one with "string": in it and then start reading from the >>>>>>>> matched >>>>>>>> row onward. Normally the string would be chosen to be a string >>>>>>>> found >>>>>>>> in the header and not likely found prior to the header. read.xls in >>>>>>>> gdata has a similar feature and I find it quite handy at times. >>>>>>>> >>>>>>>> -- >>>>>>>> Statistics & Software Consulting >>>>>>>> GKX Group, GKX Associates Inc. >>>>>>>> tel: 1-877-GKX-GROUP >>>>>>>> email: ggrothendieck at gmail.com >>>>>>>> _______________________________________________ >>>>>>>> datatable-help mailing list >>>>>>>> datatable-help at lists.r-forge.r-project.org >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Statistics & Software Consulting >>>> GKX Group, GKX Associates Inc. >>>> tel: 1-877-GKX-GROUP >>>> email: ggrothendieck at gmail.com >>> >>> >>> > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From mdowle at mdowle.plus.com Sun May 12 17:20:44 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sun, 12 May 2013 16:20:44 +0100 Subject: [datatable-help] fread: skip In-Reply-To: References: <21e4d54932de6cb6dbefa4f5b97d410e@imap.plus.net> <41657922b05299edb07739e0c59add64@imap.plus.net> Message-ID: <25ee4399076697e83d55e96d2b0cc93d@imap.plus.net> For that I think all that needs to be done (now) is adding something very similar to these few lines (from read.table) into fread at R level after the data has been read in : if (colClasses[i] == "factor") as.factor(data[[i]]) else if (colClasses[i] == "Date") as.Date(data[[i]]) else if (colClasses[i] == "POSIXct") as.POSIXct(data[[i]]) else methods::as(data[[i]], colClasses[i]) Although I don't quite see why read.table explicity deals with factor, Date and POSIXct separately, rather than leaving them to the methods::as catch all at the end. But reading dates (for example) as character and then converting to Date at R level is going to be relatively slow due to the intermediate character vector and adding all the unique strings to R's global cache. Direct reading of dates (e.g. by using Simon U's fasttime package) could be built in at C level at a later date just for speed, without breaking syntax or output types. In the meantime it would work at least. That's the thinking, anyway. I found some discussion in R News 4.1 about Excel dates and times, but not on colClasses or that mapping specifically. Currently in fread if a colClasses name isn't recognised as a basic type like integer|numeric|double|integer64|character, then it's read as character and (to be done) as long as there's an as.() method for it that'll take care of it. Reading numbers (such as offset from epoch) and then as() on that numeric|integer column isn't something I'd considered before (is that what you mean?) Matthew On 12.05.2013 15:44, Gabor Grothendieck wrote: > That looks great. It occurred to me in looking at this that one > thing > that might be useful would be to provide some conversion routines > that > can be specified as classes in the colClass vector that will convert > numbers from Excel representing Dates or date/times to Date and > POSIXct class respectively. (The mapping is discussed in R News > 4/1.) > > On Sun, May 12, 2013 at 10:14 AM, Matthew Dowle > wrote: >> >> Agreed too. colClasses was committed yesterday as luck would have >> it. >> >> ?fread now has : >> >> colClasses : A character vector of classes (named or unnamed), as >> read.csv. Or, type list enables setting ranges of columns by numeric >> position. colClasses in fread is intended for rare overrides, not >> for >> routine use. fread will only promote a column to a higher type if >> colClasses >> requests it. It won't downgrade a column to a lower type since NAs >> would >> result. You have to coerce such columns afterwards yourself, if you >> really >> require data loss. >> >> The tests so far are as follows : >> >> input = 'A,B,C\n01,foo,3.140\n002,bar,6.28000\n' >> >> test(952, fread(input, colClasses=c(C="character")), >> data.table(A=1:2,B=c("foo","bar"),C=c("3.140","6.28000"))) >> test(953, fread(input, colClasses=c(C="character",A="numeric")), >> data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000"))) >> test(954, fread(input, colClasses=c(C="character",A="double")), >> data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000"))) >> test(955, fread(input, colClasses=list(character="C",double="A")), >> data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000"))) >> test(956, fread(input, colClasses=list(character=2:3,double="A")), >> data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000"))) >> test(957, fread(input, colClasses=list(character=1:3)), >> data.table(A=c("01","002"),B=c("foo","bar"),C=c("3.140","6.28000"))) >> test(958, fread(input, colClasses="character"), >> data.table(A=c("01","002"),B=c("foo","bar"),C=c("3.140","6.28000"))) >> test(959, fread(input, >> colClasses=c("character","double","numeric")), >> data.table(A=c("01","002"),B=c("foo","bar"),C=c(3.14,6.28))) >> >> test(960, fread(input, colClasses=c("character","double")), >> error="colClasses is unnamed and length 2 but there are 3 columns. >> See") >> test(961, fread(input, colClasses=1:3), error="colClasses is not >> type list >> or character vector") >> test(962, fread(input, colClasses=list(1:3)), error="colClasses is >> type list >> but has no names") >> test(963, fread(input, colClasses=list(character="D")), >> error="Column name >> 'D' in colClasses not found in data") >> test(964, fread(input, colClasses=c(D="character")), error="Column >> name 'D' >> in colClasses not found in data") >> test(965, fread(input, colClasses=list(character=0)), error="Column >> number 0 >> (colClasses..1...1.) is out of range .1,ncol=3.") >> test(966, fread(input, colClasses=list(character=2:4)), >> error="Column number >> 4 (colClasses..1...3.) is out of range .1,ncol=3.") >> >> More detailed/trace info is provided when verbose=TRUE. >> >> >> On embedded quotes there are known and documented problems still to >> resolve. >> The issue there is subtle: when reading character columns, part of >> fread's >> speed comes from pointing mkCharLen() directly to the field in >> memory mapped >> region of RAM i.e. the field isn't copied into any intermediate >> buffer at >> all. But for embedded quotes (either doubled or escaped) we do need >> to copy >> to a buffer so we can remove the doubled quote, or escape character >> (i.e. >> change the field) before calling mkCharLen(). That's not a problem >> per se, >> but just a new twist to the C code to implement. In order to not >> slow down, >> it need only copy that field to a buffer if a doubled or escaped >> quote was >> actually present in that particular field. >> >> Matthew >> >> >> >> On 12.05.2013 14:24, Gabor Grothendieck wrote: >>> >>> Sorry, I did indeed miss the portion of the reply at the very >>> bottom. >>> Yes, that seems good. >>> >>> What about colClasses too? I would think that there would be >>> cases >>> where an automatic approach might not give the result wanted. For >>> example, order numbers might all be numeric but you would want to >>> store them as character in case there are leading zeros. In other >>> cases similar fields might validly have leading zeros but you would >>> want them regarded as numeric so there is no way to distinguish the >>> two cases except by having the user indicate their intention. >>> >>> Also, there exist cases where >>> - fields are unquoted, >>> - fields are quoted and doubling the quotes are used to indicate an >>> actual quote and >>> - where fields are quoted but a backslash quote it used to denote >>> an >>> actual quote. >>> Ideally all these situations could be handled through some >>> combination >>> of automatic and specified arguments. In the case of R's >>> read.table >>> it cannot handle the back slashed quote case but handles the others >>> mentioned. >>> >>> >>> On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle >>> wrote: >>>> >>>> >>>> Hi, >>>> >>>> I suspect you may not have scrolled further down in my reply where >>>> I >>>> wrote >>>> more? >>>> >>>> Matthew >>>> >>>> >>>> >>>> On 12.05.2013 13:26, Gabor Grothendieck wrote: >>>>> >>>>> >>>>> 1.8.8 is the most recent version on CRAN so I have now installed >>>>> 1.8.9 >>>>> from R-Forge now and the sample csv I was using does indeed work >>>>> attempting to do the best it can with the mucked up header. >>>>> Maybe >>>>> this is sufficient and a skip is not needed but the fact is that >>>>> there >>>>> is no facility to skip over the bad header had I wanted to. >>>>> >>>>> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle >>>>> wrote: >>>>>> >>>>>> >>>>>> On 12.05.2013 00:47, Gabor Grothendieck wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Not with the csv I tried. The header is messed up (most of the >>>>>>> header >>>>>>> fields are missing) and it misconstrues it as data. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> That was fixed a while ago in v1.8.9, from NEWS : >>>>>> >>>>>> " [fread] If some column names are blank they are now given >>>>>> default >>>>>> names >>>>>> rather than causing the header row to be read as a data row " >>>>>> >>>>>> >>>>>>> The automation is great but some way to force its behavior when >>>>>>> you >>>>>>> know what it should do seems essential since heuristics can't >>>>>>> be >>>>>>> expected to work in all cases. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> I suspect the heuristics in v1.8.9 work on all your examples so >>>>>> far, >>>>>> but >>>>>> ok >>>>>> point taken. >>>>>> >>>>>> fread allows control of 'autostart' already. This is a line >>>>>> number >>>>>> (default >>>>>> 30) within the regular data block used to detect the separator >>>>>> and >>>>>> search >>>>>> upwards from to find the first data row and/or column names. >>>>>> >>>>>> Will add 'skip' then. It'll be like setting autostart=skip+1 but >>>>>> turning >>>>>> off >>>>>> the search upwards part. Line skip+1 will be used to detect the >>>>>> separator >>>>>> when sep="auto" and used as column names according to >>>>>> header="auto"|TRUE|FALSE as usual. It'll be an error to specify >>>>>> both >>>>>> autostart and skip in the same call. If that sounds ok? >>>>>> >>>>>> Matthew >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Does the auto skip feature of fread cover both of those? From >>>>>>>> ?fread >>>>>>>> : >>>>>>>> >>>>>>>> " Once the separator is found on line autostart, the number >>>>>>>> of >>>>>>>> columns >>>>>>>> is >>>>>>>> determined. Then the file is searched backwards from autostart >>>>>>>> until >>>>>>>> a >>>>>>>> row >>>>>>>> is found that doesn't have that number of columns, or the >>>>>>>> start of >>>>>>>> file >>>>>>>> is >>>>>>>> reached. Thus, the first data row is found and any human >>>>>>>> readable >>>>>>>> banners >>>>>>>> are automatically skipped. This feature can be particularly >>>>>>>> useful >>>>>>>> for >>>>>>>> loading a set of files which may not all have consistently >>>>>>>> sized >>>>>>>> banners. >>>>>>>> " >>>>>>>> >>>>>>>> There were also some issue with header=FALSE in the first >>>>>>>> release >>>>>>>> (1.8.8) >>>>>>>> which have since been fixed in 1.8.9. >>>>>>>> >>>>>>>> Matthew >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I would find it useful if fread had a skip= argument as in >>>>>>>>> read.table >>>>>>>>> since I have files from time to time that have garbage at the >>>>>>>>> top. >>>>>>>>> Another situation I find from time to time is that the header >>>>>>>>> is >>>>>>>>> messed up but one can still read the file if one can skip >>>>>>>>> over the >>>>>>>>> header and specify header = FALSE. >>>>>>>>> >>>>>>>>> An extra feature that would be nice but less important would >>>>>>>>> be if >>>>>>>>> one >>>>>>>>> could specify skip = "string" and have it skip all lines >>>>>>>>> until it >>>>>>>>> found one with "string": in it and then start reading from >>>>>>>>> the >>>>>>>>> matched >>>>>>>>> row onward. Normally the string would be chosen to be a >>>>>>>>> string >>>>>>>>> found >>>>>>>>> in the header and not likely found prior to the header. >>>>>>>>> read.xls in >>>>>>>>> gdata has a similar feature and I find it quite handy at >>>>>>>>> times. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Statistics & Software Consulting >>>>>>>>> GKX Group, GKX Associates Inc. >>>>>>>>> tel: 1-877-GKX-GROUP >>>>>>>>> email: ggrothendieck at gmail.com >>>>>>>>> _______________________________________________ >>>>>>>>> datatable-help mailing list >>>>>>>>> datatable-help at lists.r-forge.r-project.org >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Statistics & Software Consulting >>>>> GKX Group, GKX Associates Inc. >>>>> tel: 1-877-GKX-GROUP >>>>> email: ggrothendieck at gmail.com >>>> >>>> >>>> >> From karl at huftis.org Sun May 12 17:39:50 2013 From: karl at huftis.org (Karl Ove Hufthammer) Date: Sun, 12 May 2013 17:39:50 +0200 Subject: [datatable-help] fread: skip In-Reply-To: <25ee4399076697e83d55e96d2b0cc93d@imap.plus.net> References: <21e4d54932de6cb6dbefa4f5b97d410e@imap.plus.net> <41657922b05299edb07739e0c59add64@imap.plus.net> <25ee4399076697e83d55e96d2b0cc93d@imap.plus.net> Message-ID: <1368373190.5719.19.camel@adrian.site> su. den 12. 05. 2013 klokka 16.20 (+0100) skreiv Matthew Dowle: > For that I think all that needs to be done (now) is adding something > very similar to these few lines (from read.table) into fread at R > level > after the data has been read in : > > if (colClasses[i] == "factor") > as.factor(data[[i]]) > else if (colClasses[i] == "Date") > as.Date(data[[i]]) > else if (colClasses[i] == "POSIXct") > as.POSIXct(data[[i]]) Any chance you could support the ?tz? attribute of ?as.POSIXct? (as a global value for all datetimes would probably be sufficient)? By default, strings are interpreted as being in the locale timezone, which means that some apperently valid datetimes are invalid (because of DST), resulting in loss of information of the resulting POSIXct vectors. See the following ?not-a-bug? for an explanation of the problem: https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14845 -- Karl Ove Hufthammer http://huftis.org/ Jabber: karl at huftis.org From ggrothendieck at gmail.com Sun May 12 17:39:53 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sun, 12 May 2013 11:39:53 -0400 Subject: [datatable-help] fread: skip In-Reply-To: <25ee4399076697e83d55e96d2b0cc93d@imap.plus.net> References: <21e4d54932de6cb6dbefa4f5b97d410e@imap.plus.net> <41657922b05299edb07739e0c59add64@imap.plus.net> <25ee4399076697e83d55e96d2b0cc93d@imap.plus.net> Message-ID: The main problem is that one can never remember the mapping so if some predefined classes and setAs functions were included that one could use then that would be nice. This would not involve any change to fread itself as long as it supported colClasses functionality in the way read.table does. For example. if this were provided (and also similar functions for datetime and also Mac versions since the Mac uses a different offset) then that would suffice for read.table and possible fread too: setClass("excel.date") setAs("character", "excel.date", function(from) structure(as.numeric(from)-25569, class = "Date") setClass("excel.mac.date") setAs("character", "excel.mac.date", function(from) structure(as.numeric(from)-24107, class = "Date") On Sun, May 12, 2013 at 11:20 AM, Matthew Dowle wrote: > > For that I think all that needs to be done (now) is adding something very > similar to these few lines (from read.table) into fread at R level after the > data has been read in : > > if (colClasses[i] == "factor") > as.factor(data[[i]]) > else if (colClasses[i] == "Date") > as.Date(data[[i]]) > else if (colClasses[i] == "POSIXct") > as.POSIXct(data[[i]]) > else methods::as(data[[i]], colClasses[i]) > > Although I don't quite see why read.table explicity deals with factor, Date > and POSIXct separately, rather than leaving them to the methods::as catch > all at the end. > > But reading dates (for example) as character and then converting to Date at > R level is going to be relatively slow due to the intermediate character > vector and adding all the unique strings to R's global cache. Direct reading > of dates (e.g. by using Simon U's fasttime package) could be built in at C > level at a later date just for speed, without breaking syntax or output > types. In the meantime it would work at least. That's the thinking, anyway. > > I found some discussion in R News 4.1 about Excel dates and times, but not > on colClasses or that mapping specifically. Currently in fread if a > colClasses name isn't recognised as a basic type like > integer|numeric|double|integer64|character, then it's read as character and > (to be done) as long as there's an as.() method for it that'll take care of > it. Reading numbers (such as offset from epoch) and then as() on that > numeric|integer column isn't something I'd considered before (is that what > you mean?) > > Matthew > > > > On 12.05.2013 15:44, Gabor Grothendieck wrote: >> >> That looks great. It occurred to me in looking at this that one thing >> that might be useful would be to provide some conversion routines that >> can be specified as classes in the colClass vector that will convert >> numbers from Excel representing Dates or date/times to Date and >> POSIXct class respectively. (The mapping is discussed in R News 4/1.) >> >> On Sun, May 12, 2013 at 10:14 AM, Matthew Dowle >> wrote: >>> >>> >>> Agreed too. colClasses was committed yesterday as luck would have it. >>> >>> ?fread now has : >>> >>> colClasses : A character vector of classes (named or unnamed), as >>> read.csv. Or, type list enables setting ranges of columns by numeric >>> position. colClasses in fread is intended for rare overrides, not for >>> routine use. fread will only promote a column to a higher type if >>> colClasses >>> requests it. It won't downgrade a column to a lower type since NAs would >>> result. You have to coerce such columns afterwards yourself, if you >>> really >>> require data loss. >>> >>> The tests so far are as follows : >>> >>> input = 'A,B,C\n01,foo,3.140\n002,bar,6.28000\n' >>> >>> test(952, fread(input, colClasses=c(C="character")), >>> data.table(A=1:2,B=c("foo","bar"),C=c("3.140","6.28000"))) >>> test(953, fread(input, colClasses=c(C="character",A="numeric")), >>> data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000"))) >>> test(954, fread(input, colClasses=c(C="character",A="double")), >>> data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000"))) >>> test(955, fread(input, colClasses=list(character="C",double="A")), >>> data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000"))) >>> test(956, fread(input, colClasses=list(character=2:3,double="A")), >>> data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000"))) >>> test(957, fread(input, colClasses=list(character=1:3)), >>> data.table(A=c("01","002"),B=c("foo","bar"),C=c("3.140","6.28000"))) >>> test(958, fread(input, colClasses="character"), >>> data.table(A=c("01","002"),B=c("foo","bar"),C=c("3.140","6.28000"))) >>> test(959, fread(input, colClasses=c("character","double","numeric")), >>> data.table(A=c("01","002"),B=c("foo","bar"),C=c(3.14,6.28))) >>> >>> test(960, fread(input, colClasses=c("character","double")), >>> error="colClasses is unnamed and length 2 but there are 3 columns. See") >>> test(961, fread(input, colClasses=1:3), error="colClasses is not type >>> list >>> or character vector") >>> test(962, fread(input, colClasses=list(1:3)), error="colClasses is type >>> list >>> but has no names") >>> test(963, fread(input, colClasses=list(character="D")), error="Column >>> name >>> 'D' in colClasses not found in data") >>> test(964, fread(input, colClasses=c(D="character")), error="Column name >>> 'D' >>> in colClasses not found in data") >>> test(965, fread(input, colClasses=list(character=0)), error="Column >>> number 0 >>> (colClasses..1...1.) is out of range .1,ncol=3.") >>> test(966, fread(input, colClasses=list(character=2:4)), error="Column >>> number >>> 4 (colClasses..1...3.) is out of range .1,ncol=3.") >>> >>> More detailed/trace info is provided when verbose=TRUE. >>> >>> >>> On embedded quotes there are known and documented problems still to >>> resolve. >>> The issue there is subtle: when reading character columns, part of >>> fread's >>> speed comes from pointing mkCharLen() directly to the field in memory >>> mapped >>> region of RAM i.e. the field isn't copied into any intermediate buffer at >>> all. But for embedded quotes (either doubled or escaped) we do need to >>> copy >>> to a buffer so we can remove the doubled quote, or escape character (i.e. >>> change the field) before calling mkCharLen(). That's not a problem per >>> se, >>> but just a new twist to the C code to implement. In order to not slow >>> down, >>> it need only copy that field to a buffer if a doubled or escaped quote >>> was >>> actually present in that particular field. >>> >>> Matthew >>> >>> >>> >>> On 12.05.2013 14:24, Gabor Grothendieck wrote: >>>> >>>> >>>> Sorry, I did indeed miss the portion of the reply at the very bottom. >>>> Yes, that seems good. >>>> >>>> What about colClasses too? I would think that there would be cases >>>> where an automatic approach might not give the result wanted. For >>>> example, order numbers might all be numeric but you would want to >>>> store them as character in case there are leading zeros. In other >>>> cases similar fields might validly have leading zeros but you would >>>> want them regarded as numeric so there is no way to distinguish the >>>> two cases except by having the user indicate their intention. >>>> >>>> Also, there exist cases where >>>> - fields are unquoted, >>>> - fields are quoted and doubling the quotes are used to indicate an >>>> actual quote and >>>> - where fields are quoted but a backslash quote it used to denote an >>>> actual quote. >>>> Ideally all these situations could be handled through some combination >>>> of automatic and specified arguments. In the case of R's read.table >>>> it cannot handle the back slashed quote case but handles the others >>>> mentioned. >>>> >>>> >>>> On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle >>>> wrote: >>>>> >>>>> >>>>> >>>>> Hi, >>>>> >>>>> I suspect you may not have scrolled further down in my reply where I >>>>> wrote >>>>> more? >>>>> >>>>> Matthew >>>>> >>>>> >>>>> >>>>> On 12.05.2013 13:26, Gabor Grothendieck wrote: >>>>>> >>>>>> >>>>>> >>>>>> 1.8.8 is the most recent version on CRAN so I have now installed 1.8.9 >>>>>> from R-Forge now and the sample csv I was using does indeed work >>>>>> attempting to do the best it can with the mucked up header. Maybe >>>>>> this is sufficient and a skip is not needed but the fact is that there >>>>>> is no facility to skip over the bad header had I wanted to. >>>>>> >>>>>> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 12.05.2013 00:47, Gabor Grothendieck wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Not with the csv I tried. The header is messed up (most of the >>>>>>>> header >>>>>>>> fields are missing) and it misconstrues it as data. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> That was fixed a while ago in v1.8.9, from NEWS : >>>>>>> >>>>>>> " [fread] If some column names are blank they are now given default >>>>>>> names >>>>>>> rather than causing the header row to be read as a data row " >>>>>>> >>>>>>> >>>>>>>> The automation is great but some way to force its behavior when you >>>>>>>> know what it should do seems essential since heuristics can't be >>>>>>>> expected to work in all cases. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> I suspect the heuristics in v1.8.9 work on all your examples so far, >>>>>>> but >>>>>>> ok >>>>>>> point taken. >>>>>>> >>>>>>> fread allows control of 'autostart' already. This is a line number >>>>>>> (default >>>>>>> 30) within the regular data block used to detect the separator and >>>>>>> search >>>>>>> upwards from to find the first data row and/or column names. >>>>>>> >>>>>>> Will add 'skip' then. It'll be like setting autostart=skip+1 but >>>>>>> turning >>>>>>> off >>>>>>> the search upwards part. Line skip+1 will be used to detect the >>>>>>> separator >>>>>>> when sep="auto" and used as column names according to >>>>>>> header="auto"|TRUE|FALSE as usual. It'll be an error to specify both >>>>>>> autostart and skip in the same call. If that sounds ok? >>>>>>> >>>>>>> Matthew >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Does the auto skip feature of fread cover both of those? From >>>>>>>>> ?fread >>>>>>>>> : >>>>>>>>> >>>>>>>>> " Once the separator is found on line autostart, the number of >>>>>>>>> columns >>>>>>>>> is >>>>>>>>> determined. Then the file is searched backwards from autostart >>>>>>>>> until >>>>>>>>> a >>>>>>>>> row >>>>>>>>> is found that doesn't have that number of columns, or the start of >>>>>>>>> file >>>>>>>>> is >>>>>>>>> reached. Thus, the first data row is found and any human readable >>>>>>>>> banners >>>>>>>>> are automatically skipped. This feature can be particularly useful >>>>>>>>> for >>>>>>>>> loading a set of files which may not all have consistently sized >>>>>>>>> banners. >>>>>>>>> " >>>>>>>>> >>>>>>>>> There were also some issue with header=FALSE in the first release >>>>>>>>> (1.8.8) >>>>>>>>> which have since been fixed in 1.8.9. >>>>>>>>> >>>>>>>>> Matthew >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I would find it useful if fread had a skip= argument as in >>>>>>>>>> read.table >>>>>>>>>> since I have files from time to time that have garbage at the top. >>>>>>>>>> Another situation I find from time to time is that the header is >>>>>>>>>> messed up but one can still read the file if one can skip over the >>>>>>>>>> header and specify header = FALSE. >>>>>>>>>> >>>>>>>>>> An extra feature that would be nice but less important would be if >>>>>>>>>> one >>>>>>>>>> could specify skip = "string" and have it skip all lines until it >>>>>>>>>> found one with "string": in it and then start reading from the >>>>>>>>>> matched >>>>>>>>>> row onward. Normally the string would be chosen to be a string >>>>>>>>>> found >>>>>>>>>> in the header and not likely found prior to the header. read.xls >>>>>>>>>> in >>>>>>>>>> gdata has a similar feature and I find it quite handy at times. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Statistics & Software Consulting >>>>>>>>>> GKX Group, GKX Associates Inc. >>>>>>>>>> tel: 1-877-GKX-GROUP >>>>>>>>>> email: ggrothendieck at gmail.com >>>>>>>>>> _______________________________________________ >>>>>>>>>> datatable-help mailing list >>>>>>>>>> datatable-help at lists.r-forge.r-project.org >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Statistics & Software Consulting >>>>>> GKX Group, GKX Associates Inc. >>>>>> tel: 1-877-GKX-GROUP >>>>>> email: ggrothendieck at gmail.com >>>>> >>>>> >>>>> >>>>> >>> > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From ggrothendieck at gmail.com Sun May 12 17:42:15 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sun, 12 May 2013 11:42:15 -0400 Subject: [datatable-help] fread: skip In-Reply-To: <1368373190.5719.19.camel@adrian.site> References: <21e4d54932de6cb6dbefa4f5b97d410e@imap.plus.net> <41657922b05299edb07739e0c59add64@imap.plus.net> <25ee4399076697e83d55e96d2b0cc93d@imap.plus.net> <1368373190.5719.19.camel@adrian.site> Message-ID: This makes the local time zone GMT: Sys.setenv(TZ = "GMT") and this switches back: Sys.setenv(TZ = "") On Sun, May 12, 2013 at 11:39 AM, Karl Ove Hufthammer wrote: > su. den 12. 05. 2013 klokka 16.20 (+0100) skreiv Matthew Dowle: >> For that I think all that needs to be done (now) is adding something >> very similar to these few lines (from read.table) into fread at R >> level >> after the data has been read in : >> >> if (colClasses[i] == "factor") >> as.factor(data[[i]]) >> else if (colClasses[i] == "Date") >> as.Date(data[[i]]) >> else if (colClasses[i] == "POSIXct") >> as.POSIXct(data[[i]]) > > Any chance you could support the ?tz? attribute of ?as.POSIXct? > (as a global value for all datetimes would probably be sufficient)? > > By default, strings are interpreted as being in the locale timezone, > which means that some apperently valid datetimes are invalid (because of > DST), resulting in loss of information of the resulting POSIXct vectors. > > See the following ?not-a-bug? for an explanation of the problem: > https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14845 > > -- > Karl Ove Hufthammer > http://huftis.org/ > Jabber: karl at huftis.org > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From mdowle at mdowle.plus.com Sun May 12 19:33:35 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sun, 12 May 2013 18:33:35 +0100 Subject: [datatable-help] fread: skip In-Reply-To: References: <21e4d54932de6cb6dbefa4f5b97d410e@imap.plus.net> <41657922b05299edb07739e0c59add64@imap.plus.net> Message-ID: Since I'm in the fread code at the moment I added 'skip' (rev 864). 4 tests added : > input = "some,bad,data\nA,B,C\n1,3,5\n2,4,6\n" > fread(input) some bad data 1: A B C 2: 1 3 5 3: 2 4 6 > fread(input, skip=1) A B C 1: 1 3 5 2: 2 4 6 > fread(input, skip=2) V1 V2 V3 1: 1 3 5 2: 2 4 6 > fread(input, skip=2, header=TRUE) 1 3 5 1: 2 4 6 > On 12.05.2013 14:24, Gabor Grothendieck wrote: > Sorry, I did indeed miss the portion of the reply at the very bottom. > Yes, that seems good. > > On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle > wrote: >> >> Hi, >> >> I suspect you may not have scrolled further down in my reply where I >> wrote >> more? >> >> Matthew >> >> >> >> On 12.05.2013 13:26, Gabor Grothendieck wrote: >>> >>> 1.8.8 is the most recent version on CRAN so I have now installed >>> 1.8.9 >>> from R-Forge now and the sample csv I was using does indeed work >>> attempting to do the best it can with the mucked up header. Maybe >>> this is sufficient and a skip is not needed but the fact is that >>> there >>> is no facility to skip over the bad header had I wanted to. >>> >>> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle >>> wrote: >>>> >>>> On 12.05.2013 00:47, Gabor Grothendieck wrote: >>>>> >>>>> >>>>> Not with the csv I tried. The header is messed up (most of the >>>>> header >>>>> fields are missing) and it misconstrues it as data. >>>> >>>> >>>> >>>> That was fixed a while ago in v1.8.9, from NEWS : >>>> >>>> " [fread] If some column names are blank they are now given >>>> default >>>> names >>>> rather than causing the header row to be read as a data row " >>>> >>>> >>>>> The automation is great but some way to force its behavior when >>>>> you >>>>> know what it should do seems essential since heuristics can't be >>>>> expected to work in all cases. >>>> >>>> >>>> >>>> I suspect the heuristics in v1.8.9 work on all your examples so >>>> far, but >>>> ok >>>> point taken. >>>> >>>> fread allows control of 'autostart' already. This is a line number >>>> (default >>>> 30) within the regular data block used to detect the separator and >>>> search >>>> upwards from to find the first data row and/or column names. >>>> >>>> Will add 'skip' then. It'll be like setting autostart=skip+1 but >>>> turning >>>> off >>>> the search upwards part. Line skip+1 will be used to detect the >>>> separator >>>> when sep="auto" and used as column names according to >>>> header="auto"|TRUE|FALSE as usual. It'll be an error to specify >>>> both >>>> autostart and skip in the same call. If that sounds ok? >>>> >>>> Matthew >>>> >>>> >>>> >>>>> >>>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle >>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> Does the auto skip feature of fread cover both of those? From >>>>>> ?fread : >>>>>> >>>>>> " Once the separator is found on line autostart, the number of >>>>>> columns >>>>>> is >>>>>> determined. Then the file is searched backwards from autostart >>>>>> until a >>>>>> row >>>>>> is found that doesn't have that number of columns, or the start >>>>>> of file >>>>>> is >>>>>> reached. Thus, the first data row is found and any human >>>>>> readable >>>>>> banners >>>>>> are automatically skipped. This feature can be particularly >>>>>> useful for >>>>>> loading a set of files which may not all have consistently sized >>>>>> banners. >>>>>> " >>>>>> >>>>>> There were also some issue with header=FALSE in the first >>>>>> release >>>>>> (1.8.8) >>>>>> which have since been fixed in 1.8.9. >>>>>> >>>>>> Matthew >>>>>> >>>>>> >>>>>> >>>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> I would find it useful if fread had a skip= argument as in >>>>>>> read.table >>>>>>> since I have files from time to time that have garbage at the >>>>>>> top. >>>>>>> Another situation I find from time to time is that the header >>>>>>> is >>>>>>> messed up but one can still read the file if one can skip over >>>>>>> the >>>>>>> header and specify header = FALSE. >>>>>>> >>>>>>> An extra feature that would be nice but less important would be >>>>>>> if one >>>>>>> could specify skip = "string" and have it skip all lines until >>>>>>> it >>>>>>> found one with "string": in it and then start reading from the >>>>>>> matched >>>>>>> row onward. Normally the string would be chosen to be a >>>>>>> string found >>>>>>> in the header and not likely found prior to the header. >>>>>>> read.xls in >>>>>>> gdata has a similar feature and I find it quite handy at >>>>>>> times. >>>>>>> >>>>>>> -- >>>>>>> Statistics & Software Consulting >>>>>>> GKX Group, GKX Associates Inc. >>>>>>> tel: 1-877-GKX-GROUP >>>>>>> email: ggrothendieck at gmail.com >>>>>>> _______________________________________________ >>>>>>> datatable-help mailing list >>>>>>> datatable-help at lists.r-forge.r-project.org >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >>> >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >> >> From mdowle at mdowle.plus.com Mon May 13 00:01:32 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sun, 12 May 2013 23:01:32 +0100 Subject: [datatable-help] fread: skip In-Reply-To: References: <21e4d54932de6cb6dbefa4f5b97d410e@imap.plus.net> <41657922b05299edb07739e0c59add64@imap.plus.net> Message-ID: And skip="string" is also now added and gdata credited (nice idea!) > input = "some,bad,data\n\nsome,cols\n1,2\n3,4\n\n\nreal > data:\nA,B,C\n1,3,5\n2,4,6\n" > cat(input) some,bad,data some,cols 1,2 3,4 real data: A,B,C 1,3,5 2,4,6 > fread(input, skip="B,C") A B C 1: 1 3 5 2: 2 4 6 > fread(input) # autostart handles this case already (since the "real > data:" line doesn't contain 2 * sep) A B C 1: 1 3 5 2: 2 4 6 > fread(input, skip="some,cols") # using skip="string" to get the > middle table some cols 1: 1 2 2: 3 4 Warning message: In fread(input, skip = "some,cols") : Stopped reading at empty line, 2 lines after the 'skip' string was found, but text exists afterwards (discarded): real data: Further example : > input = "some,bad,data\n\nsome,cols\n1,2\n3,4\n\nreal data:\nA B\n1 > 3\n2 4\n" > cat(input) some,bad,data some,cols 1,2 3,4 real data: A B 1 3 2 4 > fread(input) # with space as separator autostart can't distinguish > the "real data:" line. header wouldn't help here. real data: 1: A B 2: 1 3 3: 2 4 > fread(input, skip="B") # skip="string" needed (skip=n onerous). > Nice! A B 1: 1 3 2: 2 4 > Matthew On 12.05.2013 18:33, Matthew Dowle wrote: > Since I'm in the fread code at the moment I added 'skip' (rev 864). > 4 tests added : > >> input = "some,bad,data\nA,B,C\n1,3,5\n2,4,6\n" >> fread(input) > some bad data > 1: A B C > 2: 1 3 5 > 3: 2 4 6 >> fread(input, skip=1) > A B C > 1: 1 3 5 > 2: 2 4 6 >> fread(input, skip=2) > V1 V2 V3 > 1: 1 3 5 > 2: 2 4 6 >> fread(input, skip=2, header=TRUE) > 1 3 5 > 1: 2 4 6 >> > > > On 12.05.2013 14:24, Gabor Grothendieck wrote: >> Sorry, I did indeed miss the portion of the reply at the very >> bottom. >> Yes, that seems good. >> >> On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle >> wrote: >>> >>> Hi, >>> >>> I suspect you may not have scrolled further down in my reply where >>> I wrote >>> more? >>> >>> Matthew >>> >>> >>> >>> On 12.05.2013 13:26, Gabor Grothendieck wrote: >>>> >>>> 1.8.8 is the most recent version on CRAN so I have now installed >>>> 1.8.9 >>>> from R-Forge now and the sample csv I was using does indeed work >>>> attempting to do the best it can with the mucked up header. >>>> Maybe >>>> this is sufficient and a skip is not needed but the fact is that >>>> there >>>> is no facility to skip over the bad header had I wanted to. >>>> >>>> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle >>>> wrote: >>>>> >>>>> On 12.05.2013 00:47, Gabor Grothendieck wrote: >>>>>> >>>>>> >>>>>> Not with the csv I tried. The header is messed up (most of the >>>>>> header >>>>>> fields are missing) and it misconstrues it as data. >>>>> >>>>> >>>>> >>>>> That was fixed a while ago in v1.8.9, from NEWS : >>>>> >>>>> " [fread] If some column names are blank they are now given >>>>> default >>>>> names >>>>> rather than causing the header row to be read as a data row " >>>>> >>>>> >>>>>> The automation is great but some way to force its behavior when >>>>>> you >>>>>> know what it should do seems essential since heuristics can't be >>>>>> expected to work in all cases. >>>>> >>>>> >>>>> >>>>> I suspect the heuristics in v1.8.9 work on all your examples so >>>>> far, but >>>>> ok >>>>> point taken. >>>>> >>>>> fread allows control of 'autostart' already. This is a line >>>>> number >>>>> (default >>>>> 30) within the regular data block used to detect the separator >>>>> and search >>>>> upwards from to find the first data row and/or column names. >>>>> >>>>> Will add 'skip' then. It'll be like setting autostart=skip+1 but >>>>> turning >>>>> off >>>>> the search upwards part. Line skip+1 will be used to detect the >>>>> separator >>>>> when sep="auto" and used as column names according to >>>>> header="auto"|TRUE|FALSE as usual. It'll be an error to specify >>>>> both >>>>> autostart and skip in the same call. If that sounds ok? >>>>> >>>>> Matthew >>>>> >>>>> >>>>> >>>>>> >>>>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Does the auto skip feature of fread cover both of those? From >>>>>>> ?fread : >>>>>>> >>>>>>> " Once the separator is found on line autostart, the number >>>>>>> of >>>>>>> columns >>>>>>> is >>>>>>> determined. Then the file is searched backwards from autostart >>>>>>> until a >>>>>>> row >>>>>>> is found that doesn't have that number of columns, or the start >>>>>>> of file >>>>>>> is >>>>>>> reached. Thus, the first data row is found and any human >>>>>>> readable >>>>>>> banners >>>>>>> are automatically skipped. This feature can be particularly >>>>>>> useful for >>>>>>> loading a set of files which may not all have consistently >>>>>>> sized >>>>>>> banners. >>>>>>> " >>>>>>> >>>>>>> There were also some issue with header=FALSE in the first >>>>>>> release >>>>>>> (1.8.8) >>>>>>> which have since been fixed in 1.8.9. >>>>>>> >>>>>>> Matthew >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I would find it useful if fread had a skip= argument as in >>>>>>>> read.table >>>>>>>> since I have files from time to time that have garbage at the >>>>>>>> top. >>>>>>>> Another situation I find from time to time is that the header >>>>>>>> is >>>>>>>> messed up but one can still read the file if one can skip over >>>>>>>> the >>>>>>>> header and specify header = FALSE. >>>>>>>> >>>>>>>> An extra feature that would be nice but less important would >>>>>>>> be if one >>>>>>>> could specify skip = "string" and have it skip all lines until >>>>>>>> it >>>>>>>> found one with "string": in it and then start reading from the >>>>>>>> matched >>>>>>>> row onward. Normally the string would be chosen to be a >>>>>>>> string found >>>>>>>> in the header and not likely found prior to the header. >>>>>>>> read.xls in >>>>>>>> gdata has a similar feature and I find it quite handy at >>>>>>>> times. >>>>>>>> >>>>>>>> -- >>>>>>>> Statistics & Software Consulting >>>>>>>> GKX Group, GKX Associates Inc. >>>>>>>> tel: 1-877-GKX-GROUP >>>>>>>> email: ggrothendieck at gmail.com >>>>>>>> _______________________________________________ >>>>>>>> datatable-help mailing list >>>>>>>> datatable-help at lists.r-forge.r-project.org >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>> >>>> >>>> >>>> >>>> -- >>>> Statistics & Software Consulting >>>> GKX Group, GKX Associates Inc. >>>> tel: 1-877-GKX-GROUP >>>> email: ggrothendieck at gmail.com >>> >>> > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From ggrothendieck at gmail.com Mon May 13 00:19:04 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sun, 12 May 2013 18:19:04 -0400 Subject: [datatable-help] fread: skip In-Reply-To: References: <21e4d54932de6cb6dbefa4f5b97d410e@imap.plus.net> <41657922b05299edb07739e0c59add64@imap.plus.net> Message-ID: Looks really nice. On Sun, May 12, 2013 at 6:01 PM, Matthew Dowle wrote: > > And skip="string" is also now added and gdata credited (nice idea!) > >> input = "some,bad,data\n\nsome,cols\n1,2\n3,4\n\n\nreal >> data:\nA,B,C\n1,3,5\n2,4,6\n" >> cat(input) > > some,bad,data > > some,cols > 1,2 > 3,4 > > > real data: > A,B,C > 1,3,5 > 2,4,6 >> >> fread(input, skip="B,C") > > A B C > 1: 1 3 5 > 2: 2 4 6 >> >> fread(input) # autostart handles this case already (since the "real >> data:" line doesn't contain 2 * sep) > > A B C > 1: 1 3 5 > 2: 2 4 6 >> >> fread(input, skip="some,cols") # using skip="string" to get the middle >> table > > some cols > 1: 1 2 > 2: 3 4 > Warning message: > In fread(input, skip = "some,cols") : > Stopped reading at empty line, 2 lines after the 'skip' string was found, > but text exists afterwards (discarded): real data: > > > Further example : > >> input = "some,bad,data\n\nsome,cols\n1,2\n3,4\n\nreal data:\nA B\n1 3\n2 >> 4\n" >> cat(input) > > some,bad,data > > some,cols > 1,2 > 3,4 > > real data: > A B > 1 3 > 2 4 >> >> fread(input) # with space as separator autostart can't distinguish the >> "real data:" line. header wouldn't help here. > > real data: > 1: A B > 2: 1 3 > 3: 2 4 >> >> fread(input, skip="B") # skip="string" needed (skip=n onerous). Nice! > > A B > 1: 1 3 > 2: 2 4 >> >> > > Matthew > > > > On 12.05.2013 18:33, Matthew Dowle wrote: >> >> Since I'm in the fread code at the moment I added 'skip' (rev 864). >> 4 tests added : >> >>> input = "some,bad,data\nA,B,C\n1,3,5\n2,4,6\n" >>> fread(input) >> >> some bad data >> 1: A B C >> 2: 1 3 5 >> 3: 2 4 6 >>> >>> fread(input, skip=1) >> >> A B C >> 1: 1 3 5 >> 2: 2 4 6 >>> >>> fread(input, skip=2) >> >> V1 V2 V3 >> 1: 1 3 5 >> 2: 2 4 6 >>> >>> fread(input, skip=2, header=TRUE) >> >> 1 3 5 >> 1: 2 4 6 >>> >>> >> >> >> On 12.05.2013 14:24, Gabor Grothendieck wrote: >>> >>> Sorry, I did indeed miss the portion of the reply at the very bottom. >>> Yes, that seems good. >>> >>> On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle >>> wrote: >>>> >>>> >>>> Hi, >>>> >>>> I suspect you may not have scrolled further down in my reply where I >>>> wrote >>>> more? >>>> >>>> Matthew >>>> >>>> >>>> >>>> On 12.05.2013 13:26, Gabor Grothendieck wrote: >>>>> >>>>> >>>>> 1.8.8 is the most recent version on CRAN so I have now installed 1.8.9 >>>>> from R-Forge now and the sample csv I was using does indeed work >>>>> attempting to do the best it can with the mucked up header. Maybe >>>>> this is sufficient and a skip is not needed but the fact is that there >>>>> is no facility to skip over the bad header had I wanted to. >>>>> >>>>> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle >>>>> wrote: >>>>>> >>>>>> >>>>>> On 12.05.2013 00:47, Gabor Grothendieck wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Not with the csv I tried. The header is messed up (most of the >>>>>>> header >>>>>>> fields are missing) and it misconstrues it as data. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> That was fixed a while ago in v1.8.9, from NEWS : >>>>>> >>>>>> " [fread] If some column names are blank they are now given default >>>>>> names >>>>>> rather than causing the header row to be read as a data row " >>>>>> >>>>>> >>>>>>> The automation is great but some way to force its behavior when you >>>>>>> know what it should do seems essential since heuristics can't be >>>>>>> expected to work in all cases. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> I suspect the heuristics in v1.8.9 work on all your examples so far, >>>>>> but >>>>>> ok >>>>>> point taken. >>>>>> >>>>>> fread allows control of 'autostart' already. This is a line number >>>>>> (default >>>>>> 30) within the regular data block used to detect the separator and >>>>>> search >>>>>> upwards from to find the first data row and/or column names. >>>>>> >>>>>> Will add 'skip' then. It'll be like setting autostart=skip+1 but >>>>>> turning >>>>>> off >>>>>> the search upwards part. Line skip+1 will be used to detect the >>>>>> separator >>>>>> when sep="auto" and used as column names according to >>>>>> header="auto"|TRUE|FALSE as usual. It'll be an error to specify both >>>>>> autostart and skip in the same call. If that sounds ok? >>>>>> >>>>>> Matthew >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Does the auto skip feature of fread cover both of those? From >>>>>>>> ?fread : >>>>>>>> >>>>>>>> " Once the separator is found on line autostart, the number of >>>>>>>> columns >>>>>>>> is >>>>>>>> determined. Then the file is searched backwards from autostart until >>>>>>>> a >>>>>>>> row >>>>>>>> is found that doesn't have that number of columns, or the start of >>>>>>>> file >>>>>>>> is >>>>>>>> reached. Thus, the first data row is found and any human readable >>>>>>>> banners >>>>>>>> are automatically skipped. This feature can be particularly useful >>>>>>>> for >>>>>>>> loading a set of files which may not all have consistently sized >>>>>>>> banners. >>>>>>>> " >>>>>>>> >>>>>>>> There were also some issue with header=FALSE in the first release >>>>>>>> (1.8.8) >>>>>>>> which have since been fixed in 1.8.9. >>>>>>>> >>>>>>>> Matthew >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I would find it useful if fread had a skip= argument as in >>>>>>>>> read.table >>>>>>>>> since I have files from time to time that have garbage at the top. >>>>>>>>> Another situation I find from time to time is that the header is >>>>>>>>> messed up but one can still read the file if one can skip over the >>>>>>>>> header and specify header = FALSE. >>>>>>>>> >>>>>>>>> An extra feature that would be nice but less important would be if >>>>>>>>> one >>>>>>>>> could specify skip = "string" and have it skip all lines until it >>>>>>>>> found one with "string": in it and then start reading from the >>>>>>>>> matched >>>>>>>>> row onward. Normally the string would be chosen to be a string >>>>>>>>> found >>>>>>>>> in the header and not likely found prior to the header. read.xls in >>>>>>>>> gdata has a similar feature and I find it quite handy at times. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Statistics & Software Consulting >>>>>>>>> GKX Group, GKX Associates Inc. >>>>>>>>> tel: 1-877-GKX-GROUP >>>>>>>>> email: ggrothendieck at gmail.com >>>>>>>>> _______________________________________________ >>>>>>>>> datatable-help mailing list >>>>>>>>> datatable-help at lists.r-forge.r-project.org >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Statistics & Software Consulting >>>>> GKX Group, GKX Associates Inc. >>>>> tel: 1-877-GKX-GROUP >>>>> email: ggrothendieck at gmail.com >>>> >>>> >>>> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From mdowle at mdowle.plus.com Mon May 13 03:31:23 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 13 May 2013 02:31:23 +0100 Subject: [datatable-help] =?utf-8?b?Ijo9IiB3aXRoICJieSIgcmVhc3NpZ25tZW50?= =?utf-8?q?/updation_+_adding_new_column_leads_to_crash?= In-Reply-To: <0b1179be0c07f95992fa421dca7613f1@imap.plus.net> References: <081EFC10403447CCBA8CAE3CD6761DBA@gmail.com> <0b1179be0c07f95992fa421dca7613f1@imap.plus.net> Message-ID: Hi, Now fixed in v1.8.9 : o Mixing adding and updating into one DT[, `:=`(existingCol=...,newCol=...), by=...] now works without error or segfault, #2778 and #2528. Many thanks to Arunkumar Srinivasan for reporting both with reproducible examples. Tests added. Matthew On 12.05.2013 11:44, Matthew Dowle wrote: > Hi, > > Yes I get that in latest dev too. Thanks for the nice example, please file. > > Matthew > > On 12.05.2013 08:53, Arunkumar Srinivasan wrote: > >> Hi, >> I just discovered some weird R-session crash in data.table. Here's an example to reproduce the crash. I did not find any bug filed regarding this issue. Maybe others can verify this? Then I'll file it as a bug. >> The issue is this. Suppose you've a data.table with two columns x and y as follows: >> require(data.table) >> DT <- data.table(x = rep(1:2, c(3,2)), y = 6:10) >> >> x y >> 1: 1 6 >> 2: 1 7 >> 3: 1 8 >> 4: 2 9 >> 5: 2 10 >> Now you want to add a new column "z" by reference grouped by "x". So, you'd do: >> >> DT[, `:=`(z = .GRP), by = x] >> >> x y z >> 1: 1 6 1 >> 2: 1 7 1 >> 3: 1 8 1 >> 4: 2 9 2 >> 5: 2 10 2 >> Now, for the sake of producing this error, assume that you assigned "z" the wrong value and that you want to change it. But, also you realised that you want to add another column "w" as well. So, you go ahead and do (remember to do the previous step and then this one): >> DT[, `:=`(z = .N, w = 2), by = x] # R session crashes >> Here, both R and Rstudio session crashes with the traceback message: >> >> *** caught segfault *** >> address 0x0, cause 'memory not mapped' >> Traceback: >> 1: `[.data.table`(DT, , `:=`(z = .GRP, w = 2), by = x) >> 2: DT[, `:=`(z = .GRP, w = 2), by = x] >> This on the other hand works as expected if you assign both columns the first time. >> >> require(data.table) >> DT <- data.table(x = rep(1:2, c(3,2)), y = 6:10) >> DT[, `:=`(z = .GRP, w = 2), by = x] # works fine >> That is, if you assign by reference (:=) with "by" and re-assign a variable while also creating another variable, there seems to be a segfault. This error may not be limited to this case, but that I've just tested this. >> Here's my sessionInfo() from before the crash: >> >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> other attached packages: >> [1] data.table_1.8.8 >> loaded via a namespace (and not attached): >> [1] tools_3.0.0 >> Best, >> Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.harding at paniscus.com Mon May 13 17:01:15 2013 From: p.harding at paniscus.com (Paul Harding) Date: Mon, 13 May 2013 16:01:15 +0100 Subject: [datatable-help] Fwd: fread on very large file In-Reply-To: References: <6215268129090c5164b66264010bea9b@imap.plus.net> <806651da84c7d49b3a9aa134e4951274@imap.plus.net> Message-ID: I'd love to test it, pulled the latest commit with svn, not sure about building from source on windows, got some compilation errors: > install.packages("pkg/",type="source",repos=NULL) Warning in install.packages : package ?pkg/? is not available (for R version 3.0.0) * installing *source* package 'data.table' ... ** libs gcc -m64 -I"C:/Users/PAUL~1.HAR/R/R-30~1.0/include" -DNDEBUG -I"d:/RCompile/CRANpkg/extralibs64/local/include" -O2 -Wall -std=gnu99 -mtune=core2 -c fread.c -o fread.o fread.c: In function 'readfile': fread.c:343:9: error: 'hfile' undeclared (first use in this function) fread.c:343:9: note: each undeclared identifier is reported only once for each function it appears in fread.c:346:115: error: expected ';' before ')' token fread.c:346:115: error: expected statement before ')' token fread.c:350:17: warning: implicit declaration of function 'nanosleep' [-Wimplicit-function-declaration] make: *** [fread.o] Error 1 ERROR: compilation failed for package 'data.table' Regards Paul On 11 May 2013 02:39, Matthew Dowle wrote: > ** > > > > Paul, Vishal, > > Commit 859 : > > * fread now supports files larger than 4GB on 64bit Windows (#2767 thanks to Paul Harding) and files > between 2GB and 4GB on 32bit Windows (#2655 thanks to Vishal). A C call to GetFileSize() needed to > be GetFileSizeEx(). > > > > Please test and confirm ok now. > > > > Thanks, Matthew > > > > On 03.05.2013 14:59, Matthew Dowle wrote: > > > > Oh. Then it's likely a bug with fread on Windows for files > 4GB. Think > GetFileSize() should be GetFileSizeEx(), iirc. > > Please could you file it as a bug on the tracker. Thanks. > > Matthew > > > > On 03.05.2013 14:32, Paul Harding wrote: > > Definitely a 64-bit machine. Here are the details: > > Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) > Installed memory (RAM): 128GB > System type: 64-bit Operating System > Windows edition: Server 2008 R2 Enterprise SP1 > Regards, > Paul > > > On 3 May 2013 10:51, Matthew Dowle wrote: > >> >> >> Hi Paul, >> >> Thanks for all this! >> >> > The problem arises when the file reaches 4GB, in this case between >> 8,030,000 and 8,040,000 rows: >> >> Ahah. Are you using a 32bit or 64bit Windows machine? >> >> Thanks, Matthew >> >> >> >> On 02.05.2013 10:19, Paul Harding wrote: >> >> Some supplementary information, here is the portion of the file (with row >> numbers, +1 for header) around where fread thinks the file ends. >> $ nl spd_all_fixed.csv | head -n 9186300 |tail >> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 >> 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 >> 9186293 >> 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 >> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 >> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 >> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 >> 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 >> 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 >> 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 >> 9186294 (row 9186293 excl header) is where fread thinks the file ends, >> mid-line by the look of it! >> I've experimented by truncating the file. The error varies, either it >> reads too few records or gives the error I reported, presumably determined >> by whether the last perceived line is entire. >> The problem arises when the file reaches 4GB, in this case between >> 8,030,000 and 8,040,000 rows: >> -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 >> spd_all_trunc_8030k.csv >> -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 >> spd_all_trunc_8040k.csv >> > dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) >> Detected eol as \r\n (CRLF) in that order, the Windows standard. >> Looking for supplied sep ',' on line 30 (the last non blank line in the >> first 30) ... found >> Found 9 columns >> First row with 9 fields occurs on line 1 (either column names or first >> row of data) >> All the fields on line 1 are character fields. Treating as the column >> names. >> Count of eol after first data row: 80300000 >> Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 >> data rows >> Type codes: 000002000 (first 5 rows) >> Type codes: 000002000 (+middle 5 rows) >> Type codes: 000002000 (+last 5 rows) >> 0%Bumping column 7 from INT to INT64 on data row 9, field contains >> '0.42634430000000001' >> Bumping column 7 from INT64 to REAL on data row 9, field contains >> '0.42634430000000001' >> 0.000s ( 0%) Memory map (rerun may be quicker) >> 0.000s ( 0%) Sep and header detection >> 0.000s ( 0%) Count rows (wc -l) >> 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) >> 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM >> 171.188s ( 65%) Reading data >> 1365231.809s (518439%) Allocation for type bumps (if any), including gc >> time if triggered >> -1365231.809s (-518439%) Coercing data already read in type bumps (if any) >> 0.000s ( 0%) Changing na.strings to NA >> 0.000s Total >> > dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) >> Detected eol as \r\n (CRLF) in that order, the Windows standard. >> Looking for supplied sep ',' on line 30 (the last non blank line in the >> first 30) ... found >> Found 9 columns >> First row with 9 fields occurs on line 1 (either column names or first >> row of data) >> All the fields on line 1 are character fields. Treating as the column >> names. >> Count of eol after first data row: 18913 >> Subtracted 0 for last eol and any trailing empty lines, leaving 18913 >> data rows >> Type codes: 000002000 (first 5 rows) >> Type codes: 000002000 (+middle 5 rows) >> Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : >> Expected sep (',') but ',' ends field 2 on line 6 when detecting types: >> 204650,724540, >> Regards, >> Paul >> >> >> On 1 May 2013 10:28, Paul Harding wrote: >> >>> Here is the verbose output: >>> > dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) >>> Detected eol as \r\n (CRLF) in that order, the Windows standard. >>> Looking for supplied sep ',' on line 30 (the last non blank line in the >>> first 30) ... found >>> Found 9 columns >>> First row with 9 fields occurs on line 1 (either column names or first >>> row of data) >>> All the fields on line 1 are character fields. Treating as the column >>> names. >>> Count of eol after first data row: 9186293 >>> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 >>> data rows >>> Type codes: 000002000 (first 5 rows) >>> Type codes: 000002200 (+middle 5 rows) >>> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : >>> Expected sep (',') but '0' ends field 5 on line 6 when detecting >>> types: 204038,2617097,20110803,0,0 >>> But here is the wc output (via cygwin; newline, word (whitespace delim >>> so each word one 'line' here), byte)@ >>> $ wc spd_all_fixed.csv >>> 168997637 168997638 9078155125 spd_all_fixed.csv >>> [So fread 9M, wc 168M rows]. >>> Regards >>> Paul >>> >>> >>> On 30 April 2013 18:52, Matthew Dowle wrote: >>> >>>> >>>> >>>> Hi, >>>> >>>> Thanks for reporting this. Please set verbose=TRUE and let us know the >>>> output. >>>> >>>> Thanks, Matthew >>>> >>>> >>>> >>>> On 30.04.2013 18:01, Paul Harding wrote: >>>> >>>> Problem with fread on a large file >>>> The file is 8GB, just short of 200,000 lines, produced as SQLoutput and >>>> modified by cygwin/perl to remove the second line. >>>> Using data.table 1.8.8 on R3.0.0 I get an fread error >>>> fread("data/spd_all_fixed.csv",sep=",") >>>> Error in fread("data/spd_all_fixed.csv", sep = ",") : >>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting >>>> types: 204038,2617097,20110803,0,0 >>>> Looking for the offending line,with line numbers in output so I'm >>>> guessing this is line 6 of the mid-file chunk examined, >>>> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >>>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >>>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >>>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >>>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >>>> and comparing to surrounding lines and the first ten lines >>>> $ head spd_all_fixed.csv >>>> s_key,i_key,p_key,q,pq,d,l,epi,class >>>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >>>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >>>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >>>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >>>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >>>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >>>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >>>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >>>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >>>> I can't see any difference. I wonder if this is a bug? I have no >>>> problems on a small test data set run through an identical process and >>>> using the same fread command. >>>> Regards >>>> Paul >>>> >>>> >>>> >>>> >>> >> >> > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Mon May 13 22:38:31 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 13 May 2013 21:38:31 +0100 Subject: [datatable-help] Fwd: fread on very large file In-Reply-To: References: <6215268129090c5164b66264010bea9b@imap.plus.net> <806651da84c7d49b3a9aa134e4951274@imap.plus.net> Message-ID: Hi Paul, Sorry for that hassle. As you've realised I don't develop data.table on Windows. Those lines are switched in at compile time for Windows, and so I rely on (the truly impressive) winbuilder to compile and test for me. On this occasion, I did submit to winbuilder last night but it didn't reply (even with a compile error) which is extremely unusual. And R-Forge is stuck in 'building' state too (which is not unusual, sadly). I''ll let you know when it's passing on winbuilder, and I'll updated the Windows .zip on the homepage (since we can't rely on R-Forge) ... Matthew On 13.05.2013 16:01, Paul Harding wrote: > I'd love to test it, pulled the latest commit with svn, not sure about building from source on windows, got some compilation errors: > >> install.packages("pkg/",type="source",repos=NULL) > Warning in install.packages : > package 'pkg/' is not available (for R version 3.0.0) > * installing *source* package 'data.table' ... > ** libs > gcc -m64 -I"C:/Users/PAUL~1.HAR/R/R-30~1.0/include" -DNDEBUG -I"d:/RCompile/CRANpkg/extralibs64/local/include" -O2 -Wall -std=gnu99 -mtune=core2 -c fread.c -o fread.o > fread.c: In function 'readfile': > fread.c:343:9: error: 'hfile' undeclared (first use in this function) > fread.c:343:9: note: each undeclared identifier is reported only once for each function it appears in > fread.c:346:115: error: expected ';' before ')' token > fread.c:346:115: error: expected statement before ')' token > fread.c:350:17: warning: implicit declaration of function 'nanosleep' [-Wimplicit-function-declaration] > make: *** [fread.o] Error 1 > ERROR: compilation failed for package 'data.table' > Regards > Paul > > On 11 May 2013 02:39, Matthew Dowle wrote: > >> Paul, Vishal, >> >> Commit 859 : >> >> * fread now supports files larger than 4GB on 64bit Windows (#2767 thanks to Paul Harding) and files >> between 2GB and 4GB on 32bit Windows (#2655 thanks to Vishal). A C call to GetFileSize() needed to >> be GetFileSizeEx(). >> >> Please test and confirm ok now. >> >> Thanks, Matthew >> >> On 03.05.2013 14:59, Matthew Dowle wrote: >> >>> Oh. Then it's likely a bug with fread on Windows for files > 4GB. Think GetFileSize() should be GetFileSizeEx(), iirc. >>> >>> Please could you file it as a bug on the tracker. Thanks. >>> >>> Matthew >>> >>> On 03.05.2013 14:32, Paul Harding wrote: >>> >>>> Definitely a 64-bit machine. Here are the details: >>>> >>>> Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) >>>> Installed memory (RAM): 128GB >>>> System type: 64-bit Operating System >>>> Windows edition: Server 2008 R2 Enterprise SP1 >>>> Regards, >>>> Paul >>>> >>>> On 3 May 2013 10:51, Matthew Dowle wrote: >>>> >>>>> Hi Paul, >>>>> >>>>> Thanks for all this! >>>>> >>>>>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >>>>> >>>>> Ahah. Are you using a 32bit or 64bit Windows machine? >>>>> >>>>> Thanks, Matthew >>>>> >>>>> On 02.05.2013 10:19, Paul Harding wrote: >>>>> >>>>>> Some supplementary information, here is the portion of the file (with row numbers, +1 for header) around where fread thinks the file ends. >>>>>> >>>>>> $ nl spd_all_fixed.csv | head -n 9186300 |tail >>>>>> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 >>>>>> 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 >>>>>> 9186293 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 >>>>>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>>>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 >>>>>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 >>>>>> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 >>>>>> 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 >>>>>> 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 >>>>>> 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 >>>>>> 9186294 (row 9186293 excl header) is where fread thinks the file ends, mid-line by the look of it! >>>>>> I've experimented by truncating the file. The error varies, either it reads too few records or gives the error I reported, presumably determined by whether the last perceived line is entire. >>>>>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >>>>>> >>>>>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv >>>>>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv >>>>>> >>>>>>> dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) >>>>>> >>>>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>>>> Found 9 columns >>>>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>>>> All the fields on line 1 are character fields. Treating as the column names. >>>>>> Count of eol after first data row: 80300000 >>>>>> Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 data rows >>>>>> >>>>>> Type codes: 000002000 (first 5 rows) >>>>>> Type codes: 000002000 (+middle 5 rows) >>>>>> Type codes: 000002000 (+last 5 rows) >>>>>> 0%Bumping column 7 from INT to INT64 on data row 9, field contains '0.42634430000000001' >>>>>> Bumping column 7 from INT64 to REAL on data row 9, field contains '0.42634430000000001' >>>>>> 0.000s ( 0%) Memory map (rerun may be quicker) >>>>>> 0.000s ( 0%) Sep and header detection >>>>>> 0.000s ( 0%) Count rows (wc -l) >>>>>> 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) >>>>>> 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM >>>>>> 171.188s ( 65%) Reading data >>>>>> 1365231.809s (518439%) Allocation for type bumps (if any), including gc time if triggered >>>>>> -1365231.809s (-518439%) Coercing data already read in type bumps (if any) >>>>>> 0.000s ( 0%) Changing na.strings to NA >>>>>> 0.000s Total >>>>>>> dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) >>>>>> >>>>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>>>> Found 9 columns >>>>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>>>> All the fields on line 1 are character fields. Treating as the column names. >>>>>> Count of eol after first data row: 18913 >>>>>> Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data rows >>>>>> >>>>>> Type codes: 000002000 (first 5 rows) >>>>>> Type codes: 000002000 (+middle 5 rows) >>>>>> Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : >>>>>> Expected sep (',') but ',' ends field 2 on line 6 when detecting types: 204650,724540, >>>>>> Regards, >>>>>> Paul >>>>>> >>>>>> On 1 May 2013 10:28, Paul Harding wrote: >>>>>> >>>>>>> Here is the verbose output: >>>>>>> >>>>>>>> dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) >>>>>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>>>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>>>>> Found 9 columns >>>>>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>>>>> All the fields on line 1 are character fields. Treating as the column names. >>>>>>> Count of eol after first data row: 9186293 >>>>>>> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 data rows >>>>>>> Type codes: 000002000 (first 5 rows) >>>>>>> Type codes: 000002200 (+middle 5 rows) >>>>>>> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : >>>>>>> >>>>>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>>>>> But here is the wc output (via cygwin; newline, word (whitespace delim so each word one 'line' here), byte)@ >>>>>>> >>>>>>> $ wc spd_all_fixed.csv >>>>>>> 168997637 168997638 9078155125 spd_all_fixed.csv >>>>>>> [So fread 9M, wc 168M rows]. >>>>>>> Regards >>>>>>> Paul >>>>>>> >>>>>>> On 30 April 2013 18:52, Matthew Dowle wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Thanks for reporting this. Please set verbose=TRUE and let us know the output. >>>>>>>> >>>>>>>> Thanks, Matthew >>>>>>>> >>>>>>>> On 30.04.2013 18:01, Paul Harding wrote: >>>>>>>> >>>>>>>>> Problem with fread on a large file The file is 8GB, just short of 200,000 lines, produced as SQLoutput and modified by cygwin/perl to remove the second line. >>>>>>>>> >>>>>>>>> Using data.table 1.8.8 on R3.0.0 I get an fread error >>>>>>>>> >>>>>>>>> fread("data/spd_all_fixed.csv",sep=",") >>>>>>>>> Error in fread("data/spd_all_fixed.csv", sep = ",") : >>>>>>>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>>>>>>> Looking for the offending line,with line numbers in output so I'm guessing this is line 6 of the mid-file chunk examined, >>>>>>>>> >>>>>>>>> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >>>>>>>>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >>>>>>>>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >>>>>>>>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>>>>>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >>>>>>>>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >>>>>>>>> and comparing to surrounding lines and the first ten lines >>>>>>>>> >>>>>>>>> $ head spd_all_fixed.csv >>>>>>>>> s_key,i_key,p_key,q,pq,d,l,epi,class >>>>>>>>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >>>>>>>>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >>>>>>>>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >>>>>>>>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >>>>>>>>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >>>>>>>>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >>>>>>>>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >>>>>>>>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >>>>>>>>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >>>>>>>>> I can't see any difference. I wonder if this is a bug? I have no problems on a small test data set run through an identical process and using the same fread command. >>>>>>>>> Regards >>>>>>>>> Paul Links: ------ [1] mailto:mdowle at mdowle.plus.com [2] mailto:p.harding at paniscus.com [3] mailto:mdowle at mdowle.plus.com [4] mailto:mdowle at mdowle.plus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Mon May 13 23:26:57 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 13 May 2013 22:26:57 +0100 Subject: [datatable-help] Fwd: fread on very large file In-Reply-To: References: <6215268129090c5164b66264010bea9b@imap.plus.net> <806651da84c7d49b3a9aa134e4951274@imap.plus.net> Message-ID: Passing on winbuilder now. .zip (rev 874) uploaded to homepage (will take an hour or two to refresh), but available now from here : https://r-forge.r-project.org/scm/viewvc.php/*checkout*/www/data.table_1.8.9_rev874.zip?revision=875&root=datatable Matthew On 13.05.2013 21:38, Matthew Dowle wrote: > Hi Paul, > > Sorry for that hassle. As you've realised I don't develop data.table on Windows. Those lines are switched in at compile time for Windows, and so I rely on (the truly impressive) winbuilder to compile and test for me. On this occasion, I did submit to winbuilder last night but it didn't reply (even with a compile error) which is extremely unusual. And R-Forge is stuck in 'building' state too (which is not unusual, sadly). > > I''ll let you know when it's passing on winbuilder, and I'll updated the Windows .zip on the homepage (since we can't rely on R-Forge) ... > > Matthew > > On 13.05.2013 16:01, Paul Harding wrote: > >> I'd love to test it, pulled the latest commit with svn, not sure about building from source on windows, got some compilation errors: >> >>> install.packages("pkg/",type="source",repos=NULL) >> Warning in install.packages : >> package 'pkg/' is not available (for R version 3.0.0) >> * installing *source* package 'data.table' ... >> ** libs >> gcc -m64 -I"C:/Users/PAUL~1.HAR/R/R-30~1.0/include" -DNDEBUG -I"d:/RCompile/CRANpkg/extralibs64/local/include" -O2 -Wall -std=gnu99 -mtune=core2 -c fread.c -o fread.o >> fread.c: In function 'readfile': >> fread.c:343:9: error: 'hfile' undeclared (first use in this function) >> fread.c:343:9: note: each undeclared identifier is reported only once for each function it appears in >> fread.c:346:115: error: expected ';' before ')' token >> fread.c:346:115: error: expected statement before ')' token >> fread.c:350:17: warning: implicit declaration of function 'nanosleep' [-Wimplicit-function-declaration] >> make: *** [fread.o] Error 1 >> ERROR: compilation failed for package 'data.table' >> Regards >> Paul >> >> On 11 May 2013 02:39, Matthew Dowle wrote: >> >>> Paul, Vishal, >>> >>> Commit 859 : >>> >>> * fread now supports files larger than 4GB on 64bit Windows (#2767 thanks to Paul Harding) and files >>> between 2GB and 4GB on 32bit Windows (#2655 thanks to Vishal). A C call to GetFileSize() needed to >>> be GetFileSizeEx(). >>> >>> Please test and confirm ok now. >>> >>> Thanks, Matthew >>> >>> On 03.05.2013 14:59, Matthew Dowle wrote: >>> >>>> Oh. Then it's likely a bug with fread on Windows for files > 4GB. Think GetFileSize() should be GetFileSizeEx(), iirc. >>>> >>>> Please could you file it as a bug on the tracker. Thanks. >>>> >>>> Matthew >>>> >>>> On 03.05.2013 14:32, Paul Harding wrote: >>>> >>>>> Definitely a 64-bit machine. Here are the details: >>>>> >>>>> Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) >>>>> Installed memory (RAM): 128GB >>>>> System type: 64-bit Operating System >>>>> Windows edition: Server 2008 R2 Enterprise SP1 >>>>> Regards, >>>>> Paul >>>>> >>>>> On 3 May 2013 10:51, Matthew Dowle wrote: >>>>> >>>>>> Hi Paul, >>>>>> >>>>>> Thanks for all this! >>>>>> >>>>>>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >>>>>> >>>>>> Ahah. Are you using a 32bit or 64bit Windows machine? >>>>>> >>>>>> Thanks, Matthew >>>>>> >>>>>> On 02.05.2013 10:19, Paul Harding wrote: >>>>>> >>>>>>> Some supplementary information, here is the portion of the file (with row numbers, +1 for header) around where fread thinks the file ends. >>>>>>> >>>>>>> $ nl spd_all_fixed.csv | head -n 9186300 |tail >>>>>>> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 >>>>>>> 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 >>>>>>> 9186293 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 >>>>>>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>>>>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 >>>>>>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 >>>>>>> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 >>>>>>> 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 >>>>>>> 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 >>>>>>> 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 >>>>>>> 9186294 (row 9186293 excl header) is where fread thinks the file ends, mid-line by the look of it! >>>>>>> I've experimented by truncating the file. The error varies, either it reads too few records or gives the error I reported, presumably determined by whether the last perceived line is entire. >>>>>>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >>>>>>> >>>>>>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv >>>>>>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv >>>>>>> >>>>>>>> dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) >>>>>>> >>>>>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>>>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>>>>> Found 9 columns >>>>>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>>>>> All the fields on line 1 are character fields. Treating as the column names. >>>>>>> Count of eol after first data row: 80300000 >>>>>>> Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 data rows >>>>>>> >>>>>>> Type codes: 000002000 (first 5 rows) >>>>>>> Type codes: 000002000 (+middle 5 rows) >>>>>>> Type codes: 000002000 (+last 5 rows) >>>>>>> 0%Bumping column 7 from INT to INT64 on data row 9, field contains '0.42634430000000001' >>>>>>> Bumping column 7 from INT64 to REAL on data row 9, field contains '0.42634430000000001' >>>>>>> 0.000s ( 0%) Memory map (rerun may be quicker) >>>>>>> 0.000s ( 0%) Sep and header detection >>>>>>> 0.000s ( 0%) Count rows (wc -l) >>>>>>> 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) >>>>>>> 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM >>>>>>> 171.188s ( 65%) Reading data >>>>>>> 1365231.809s (518439%) Allocation for type bumps (if any), including gc time if triggered >>>>>>> -1365231.809s (-518439%) Coercing data already read in type bumps (if any) >>>>>>> 0.000s ( 0%) Changing na.strings to NA >>>>>>> 0.000s Total >>>>>>>> dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) >>>>>>> >>>>>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>>>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>>>>> Found 9 columns >>>>>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>>>>> All the fields on line 1 are character fields. Treating as the column names. >>>>>>> Count of eol after first data row: 18913 >>>>>>> Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data rows >>>>>>> >>>>>>> Type codes: 000002000 (first 5 rows) >>>>>>> Type codes: 000002000 (+middle 5 rows) >>>>>>> Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : >>>>>>> Expected sep (',') but ',' ends field 2 on line 6 when detecting types: 204650,724540, >>>>>>> Regards, >>>>>>> Paul >>>>>>> >>>>>>> On 1 May 2013 10:28, Paul Harding wrote: >>>>>>> >>>>>>>> Here is the verbose output: >>>>>>>> >>>>>>>>> dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) >>>>>>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>>>>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>>>>>> Found 9 columns >>>>>>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>>>>>> All the fields on line 1 are character fields. Treating as the column names. >>>>>>>> Count of eol after first data row: 9186293 >>>>>>>> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 data rows >>>>>>>> Type codes: 000002000 (first 5 rows) >>>>>>>> Type codes: 000002200 (+middle 5 rows) >>>>>>>> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : >>>>>>>> >>>>>>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>>>>>> But here is the wc output (via cygwin; newline, word (whitespace delim so each word one 'line' here), byte)@ >>>>>>>> >>>>>>>> $ wc spd_all_fixed.csv >>>>>>>> 168997637 168997638 9078155125 spd_all_fixed.csv >>>>>>>> [So fread 9M, wc 168M rows]. >>>>>>>> Regards >>>>>>>> Paul >>>>>>>> >>>>>>>> On 30 April 2013 18:52, Matthew Dowle wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Thanks for reporting this. Please set verbose=TRUE and let us know the output. >>>>>>>>> >>>>>>>>> Thanks, Matthew >>>>>>>>> >>>>>>>>> On 30.04.2013 18:01, Paul Harding wrote: >>>>>>>>> >>>>>>>>>> Problem with fread on a large file The file is 8GB, just short of 200,000 lines, produced as SQLoutput and modified by cygwin/perl to remove the second line. >>>>>>>>>> >>>>>>>>>> Using data.table 1.8.8 on R3.0.0 I get an fread error >>>>>>>>>> >>>>>>>>>> fread("data/spd_all_fixed.csv",sep=",") >>>>>>>>>> Error in fread("data/spd_all_fixed.csv", sep = ",") : >>>>>>>>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>>>>>>>> Looking for the offending line,with line numbers in output so I'm guessing this is line 6 of the mid-file chunk examined, >>>>>>>>>> >>>>>>>>>> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >>>>>>>>>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >>>>>>>>>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >>>>>>>>>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>>>>>>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >>>>>>>>>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >>>>>>>>>> and comparing to surrounding lines and the first ten lines >>>>>>>>>> >>>>>>>>>> $ head spd_all_fixed.csv >>>>>>>>>> s_key,i_key,p_key,q,pq,d,l,epi,class >>>>>>>>>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >>>>>>>>>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >>>>>>>>>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >>>>>>>>>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >>>>>>>>>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >>>>>>>>>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >>>>>>>>>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >>>>>>>>>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >>>>>>>>>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >>>>>>>>>> I can't see any difference. I wonder if this is a bug? I have no problems on a small test data set run through an identical process and using the same fread command. >>>>>>>>>> Regards >>>>>>>>>> Paul Links: ------ [1] mailto:mdowle at mdowle.plus.com [2] mailto:p.harding at paniscus.com [3] mailto:mdowle at mdowle.plus.com [4] mailto:mdowle at mdowle.plus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.harding at paniscus.com Tue May 14 14:28:53 2013 From: p.harding at paniscus.com (Paul Harding) Date: Tue, 14 May 2013 13:28:53 +0100 Subject: [datatable-help] Fwd: fread on very large file In-Reply-To: References: <6215268129090c5164b66264010bea9b@imap.plus.net> <806651da84c7d49b3a9aa134e4951274@imap.plus.net> Message-ID: Hi Matthew, some frustration until I worked out I needed to rename the zip file to data.table.zip to install! I have regression tested on a 4GB file, and tested on a 19GB whopper. Obviously it is a tad slow, but read.csv would never get there! Delighted, I can't do what I need to do on these big datasets without data.table. All seems fine, correct record count etc. I'm not checking every line of data ;-) > gash.dt<-fread("data/data_extract_1_fixed_trunc_fixed.csv") > big.dt<-fread("data/data_extract_1_fixed.csv",verbose=T) Detected eol as \r\n (CRLF) in that order, the Windows standard. Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=',' Found 16 columns First row with 16 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 214038352 Subtracted 1 for last eol and any trailing empty lines, leaving 214038351 data rows Type codes: 0003330030000000 (first 5 rows) Type codes: 0003330030000000 (+middle 5 rows) Type codes: 0003330030000000 (+last 5 rows) 0.050s ( 0%) Memory map (rerun may be quicker) 0.020s ( 0%) sep and header detection 159.560s ( 35%) Count rows (wc -l) 0.001s ( 0%) Column type detection (first, middle and last 5 rows) 46.267s ( 10%) Allocation of 214038351x16 result (xMB) in RAM 244.760s ( 54%) Reading data 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered 0.000s ( 0%) Coercing data already read in type bumps (if any) 5.258s ( 1%) Changing na.strings to NA 455.916s Total $ wc data_extract_1_fixed.csv 214038352 414098500 19745071003 data_extract_1_fixed.csv > tables() NAME NROW MB COLS KEY [1,] big.dt 214,038,351 16330 STORE_KEY,ITEM_KEY,period_key,ULC_CATEGORY,format,state,pqty,tqty,weekday,dayofw [2,] gash.dt 46,535,426 3551 STORE_KEY,ITEM_KEY,period_key,ULC_CATEGORY,format,state,pqty,tqty,weekday,dayofw [3,] range.dt 1 1 startdt,enddt [4,] spd.dt 46,535,426 4083 caldate,store_key,item_key,period_key,ulc_category,format,state,pqty,tqty,weekda store_key,item_key,caldate [5,] test.dt 5 1 digits,letters digits Total: 23,966MB On 13 May 2013 22:26, Matthew Dowle wrote: > ** > > > > Passing on winbuilder now. > > .zip (rev 874) uploaded to homepage (will take an hour or two to refresh), > but available now from here : > > > https://r-forge.r-project.org/scm/viewvc.php/*checkout*/www/data.table_1.8.9_rev874.zip?revision=875&root=datatable > > Matthew > > > > On 13.05.2013 21:38, Matthew Dowle wrote: > > > > Hi Paul, > > Sorry for that hassle. As you've realised I don't develop data.table on > Windows. Those lines are switched in at compile time for Windows, and so > I rely on (the truly impressive) winbuilder to compile and test for me. On > this occasion, I did submit to winbuilder last night but it didn't reply > (even with a compile error) which is extremely unusual. And R-Forge is > stuck in 'building' state too (which is not unusual, sadly). > > I''ll let you know when it's passing on winbuilder, and I'll updated the > Windows .zip on the homepage (since we can't rely on R-Forge) ... > > Matthew > > > > On 13.05.2013 16:01, Paul Harding wrote: > > I'd love to test it, pulled the latest commit with svn, not sure about > building from source on windows, got some compilation errors: > > install.packages("pkg/",type="source",repos=NULL) > Warning in install.packages : > package ?pkg/? is not available (for R version 3.0.0) > * installing *source* package 'data.table' ... > ** libs > gcc -m64 -I"C:/Users/PAUL~1.HAR/R/R-30~1.0/include" -DNDEBUG > -I"d:/RCompile/CRANpkg/extralibs64/local/include" -O2 -Wall -std=gnu99 > -mtune=core2 -c fread.c -o fread.o > fread.c: In function 'readfile': > fread.c:343:9: error: 'hfile' undeclared (first use in this function) > fread.c:343:9: note: each undeclared identifier is reported only once for > each function it appears in > fread.c:346:115: error: expected ';' before ')' token > fread.c:346:115: error: expected statement before ')' token > fread.c:350:17: warning: implicit declaration of function 'nanosleep' > [-Wimplicit-function-declaration] > make: *** [fread.o] Error 1 > ERROR: compilation failed for package 'data.table' > Regards > Paul > > > On 11 May 2013 02:39, Matthew Dowle wrote: > >> >> >> Paul, Vishal, >> >> Commit 859 : >> >> * fread now supports files larger than 4GB on 64bit Windows (#2767 thanks to Paul Harding) and files >> between 2GB and 4GB on 32bit Windows (#2655 thanks to Vishal). A C call to GetFileSize() needed to >> be GetFileSizeEx(). >> >> >> >> Please test and confirm ok now. >> >> >> >> Thanks, Matthew >> >> >> >> On 03.05.2013 14:59, Matthew Dowle wrote: >> >> >> >> Oh. Then it's likely a bug with fread on Windows for files > 4GB. Think >> GetFileSize() should be GetFileSizeEx(), iirc. >> >> Please could you file it as a bug on the tracker. Thanks. >> >> Matthew >> >> >> >> On 03.05.2013 14:32, Paul Harding wrote: >> >> Definitely a 64-bit machine. Here are the details: >> >> Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) >> Installed memory (RAM): 128GB >> System type: 64-bit Operating System >> Windows edition: Server 2008 R2 Enterprise SP1 >> Regards, >> Paul >> >> >> On 3 May 2013 10:51, Matthew Dowle wrote: >> >>> >>> >>> Hi Paul, >>> >>> Thanks for all this! >>> >>> > The problem arises when the file reaches 4GB, in this case between >>> 8,030,000 and 8,040,000 rows: >>> >>> Ahah. Are you using a 32bit or 64bit Windows machine? >>> >>> Thanks, Matthew >>> >>> >>> >>> On 02.05.2013 10:19, Paul Harding wrote: >>> >>> Some supplementary information, here is the portion of the file (with >>> row numbers, +1 for header) around where fread thinks the file ends. >>> $ nl spd_all_fixed.csv | head -n 9186300 |tail >>> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 >>> 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 >>> 9186293 >>> 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 >>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 >>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 >>> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 >>> 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 >>> 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 >>> 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 >>> 9186294 (row 9186293 excl header) is where fread thinks the file ends, >>> mid-line by the look of it! >>> I've experimented by truncating the file. The error varies, either it >>> reads too few records or gives the error I reported, presumably determined >>> by whether the last perceived line is entire. >>> The problem arises when the file reaches 4GB, in this case between >>> 8,030,000 and 8,040,000 rows: >>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 >>> spd_all_trunc_8030k.csv >>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 >>> spd_all_trunc_8040k.csv >>> > dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) >>> Detected eol as \r\n (CRLF) in that order, the Windows standard. >>> Looking for supplied sep ',' on line 30 (the last non blank line in the >>> first 30) ... found >>> Found 9 columns >>> First row with 9 fields occurs on line 1 (either column names or first >>> row of data) >>> All the fields on line 1 are character fields. Treating as the column >>> names. >>> Count of eol after first data row: 80300000 >>> Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 >>> data rows >>> Type codes: 000002000 (first 5 rows) >>> Type codes: 000002000 (+middle 5 rows) >>> Type codes: 000002000 (+last 5 rows) >>> 0%Bumping column 7 from INT to INT64 on data row 9, field contains >>> '0.42634430000000001' >>> Bumping column 7 from INT64 to REAL on data row 9, field contains >>> '0.42634430000000001' >>> 0.000s ( 0%) Memory map (rerun may be quicker) >>> 0.000s ( 0%) Sep and header detection >>> 0.000s ( 0%) Count rows (wc -l) >>> 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) >>> 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM >>> 171.188s ( 65%) Reading data >>> 1365231.809s (518439%) Allocation for type bumps (if any), including gc >>> time if triggered >>> -1365231.809s (-518439%) Coercing data already read in type bumps (if >>> any) >>> 0.000s ( 0%) Changing na.strings to NA >>> 0.000s Total >>> > dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) >>> Detected eol as \r\n (CRLF) in that order, the Windows standard. >>> Looking for supplied sep ',' on line 30 (the last non blank line in the >>> first 30) ... found >>> Found 9 columns >>> First row with 9 fields occurs on line 1 (either column names or first >>> row of data) >>> All the fields on line 1 are character fields. Treating as the column >>> names. >>> Count of eol after first data row: 18913 >>> Subtracted 0 for last eol and any trailing empty lines, leaving 18913 >>> data rows >>> Type codes: 000002000 (first 5 rows) >>> Type codes: 000002000 (+middle 5 rows) >>> Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : >>> Expected sep (',') but ',' ends field 2 on line 6 when detecting >>> types: 204650,724540, >>> Regards, >>> Paul >>> >>> >>> On 1 May 2013 10:28, Paul Harding wrote: >>> >>>> Here is the verbose output: >>>> > dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) >>>> Detected eol as \r\n (CRLF) in that order, the Windows standard. >>>> Looking for supplied sep ',' on line 30 (the last non blank line in the >>>> first 30) ... found >>>> Found 9 columns >>>> First row with 9 fields occurs on line 1 (either column names or first >>>> row of data) >>>> All the fields on line 1 are character fields. Treating as the column >>>> names. >>>> Count of eol after first data row: 9186293 >>>> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 >>>> data rows >>>> Type codes: 000002000 (first 5 rows) >>>> Type codes: 000002200 (+middle 5 rows) >>>> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : >>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting >>>> types: 204038,2617097,20110803,0,0 >>>> But here is the wc output (via cygwin; newline, word (whitespace >>>> delim so each word one 'line' here), byte)@ >>>> $ wc spd_all_fixed.csv >>>> 168997637 168997638 9078155125 spd_all_fixed.csv >>>> [So fread 9M, wc 168M rows]. >>>> Regards >>>> Paul >>>> >>>> >>>> On 30 April 2013 18:52, Matthew Dowle wrote: >>>> >>>>> >>>>> >>>>> Hi, >>>>> >>>>> Thanks for reporting this. Please set verbose=TRUE and let us know the >>>>> output. >>>>> >>>>> Thanks, Matthew >>>>> >>>>> >>>>> >>>>> On 30.04.2013 18:01, Paul Harding wrote: >>>>> >>>>> Problem with fread on a large file >>>>> The file is 8GB, just short of 200,000 lines, produced as SQLoutput >>>>> and modified by cygwin/perl to remove the second line. >>>>> Using data.table 1.8.8 on R3.0.0 I get an fread error >>>>> fread("data/spd_all_fixed.csv",sep=",") >>>>> Error in fread("data/spd_all_fixed.csv", sep = ",") : >>>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting >>>>> types: 204038,2617097,20110803,0,0 >>>>> Looking for the offending line,with line numbers in output so I'm >>>>> guessing this is line 6 of the mid-file chunk examined, >>>>> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >>>>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >>>>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >>>>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >>>>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >>>>> and comparing to surrounding lines and the first ten lines >>>>> $ head spd_all_fixed.csv >>>>> s_key,i_key,p_key,q,pq,d,l,epi,class >>>>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >>>>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >>>>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >>>>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >>>>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >>>>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >>>>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >>>>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >>>>> >>>>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >>>>> I can't see any difference. I wonder if this is a bug? I have no >>>>> problems on a small test data set run through an identical process and >>>>> using the same fread command. >>>>> Regards >>>>> Paul >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >> >> >> >> >> >> > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue May 14 22:52:05 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 14 May 2013 21:52:05 +0100 Subject: [datatable-help] Fwd: fread on very large file In-Reply-To: References: <6215268129090c5164b66264010bea9b@imap.plus.net> <806651da84c7d49b3a9aa134e4951274@imap.plus.net> Message-ID: Hi Paul, Great to hear, interesting timings. Yup - with a 16GB data.table in RAM, now we're talking. It's this kind of size data.table was intended for. Don't try names(DT)[1]<-"newname" on that! Have changed the .zip file name on the homepage - thanks for mentioning it. And I see R-Forge is up to date and "Current" status anyway after all that, so via the R-Forge repo should be fine now, too. Matthew On 14.05.2013 13:28, Paul Harding wrote: > Hi Matthew, some frustration until I worked out I needed to rename the zip file to data.table.zip to install! I have regression tested on a 4GB file, and tested on a 19GB whopper. Obviously it is a tad slow, but read.csv would never get there! Delighted, I can't do what I need to do on these big datasets without data.table. All seems fine, correct record count etc. I'm not checking every line of data ;-) > >> gash.dt<-fread("data/data_extract_1_fixed_trunc_fixed.csv") >> big.dt<-fread("data/data_extract_1_fixed.csv",verbose=T) > Detected eol as rn (CRLF) in that order, the Windows standard. > Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=',' > Found 16 columns > First row with 16 fields occurs on line 1 (either column names or first row of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 214038352 > Subtracted 1 for last eol and any trailing empty lines, leaving 214038351 data rows > Type codes: 0003330030000000 (first 5 rows) > Type codes: 0003330030000000 (+middle 5 rows) > Type codes: 0003330030000000 (+last 5 rows) > 0.050s ( 0%) Memory map (rerun may be quicker) > 0.020s ( 0%) sep and header detection > 159.560s ( 35%) Count rows (wc -l) > 0.001s ( 0%) Column type detection (first, middle and last 5 rows) > 46.267s ( 10%) Allocation of 214038351x16 result (xMB) in RAM > 244.760s ( 54%) Reading data > 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered > 0.000s ( 0%) Coercing data already read in type bumps (if any) > 5.258s ( 1%) Changing na.strings to NA > 455.916s Total > > $ wc data_extract_1_fixed.csv > 214038352 414098500 19745071003 data_extract_1_fixed.csv > >> tables() > NAME NROW MB COLS KEY > [1,] big.dt 214,038,351 16330 STORE_KEY,ITEM_KEY,period_key,ULC_CATEGORY,format,state,pqty,tqty,weekday,dayofw > [2,] gash.dt 46,535,426 3551 STORE_KEY,ITEM_KEY,period_key,ULC_CATEGORY,format,state,pqty,tqty,weekday,dayofw > [3,] range.dt 1 1 startdt,enddt > [4,] spd.dt 46,535,426 4083 caldate,store_key,item_key,period_key,ulc_category,format,state,pqty,tqty,weekda store_key,item_key,caldate > [5,] test.dt 5 1 digits,letters digits > Total: 23,966MB > > On 13 May 2013 22:26, Matthew Dowle wrote: > >> Passing on winbuilder now. >> >> .zip (rev 874) uploaded to homepage (will take an hour or two to refresh), but available now from here : >> >> https://r-forge.r-project.org/scm/viewvc.php/*checkout*/www/data.table_1.8.9_rev874.zip?revision=875&root=datatable [5] >> >> Matthew >> >> On 13.05.2013 21:38, Matthew Dowle wrote: >> >>> Hi Paul, >>> >>> Sorry for that hassle. As you've realised I don't develop data.table on Windows. Those lines are switched in at compile time for Windows, and so I rely on (the truly impressive) winbuilder to compile and test for me. On this occasion, I did submit to winbuilder last night but it didn't reply (even with a compile error) which is extremely unusual. And R-Forge is stuck in 'building' state too (which is not unusual, sadly). >>> >>> I''ll let you know when it's passing on winbuilder, and I'll updated the Windows .zip on the homepage (since we can't rely on R-Forge) ... >>> >>> Matthew >>> >>> On 13.05.2013 16:01, Paul Harding wrote: >>> >>>> I'd love to test it, pulled the latest commit with svn, not sure about building from source on windows, got some compilation errors: >>>> >>>>> install.packages("pkg/",type="source",repos=NULL) >>>> Warning in install.packages : >>>> package 'pkg/' is not available (for R version 3.0.0) >>>> * installing *source* package 'data.table' ... >>>> ** libs >>>> gcc -m64 -I"C:/Users/PAUL~1.HAR/R/R-30~1.0/include" -DNDEBUG -I"d:/RCompile/CRANpkg/extralibs64/local/include" -O2 -Wall -std=gnu99 -mtune=core2 -c fread.c -o fread.o >>>> fread.c: In function 'readfile': >>>> fread.c:343:9: error: 'hfile' undeclared (first use in this function) >>>> fread.c:343:9: note: each undeclared identifier is reported only once for each function it appears in >>>> fread.c:346:115: error: expected ';' before ')' token >>>> fread.c:346:115: error: expected statement before ')' token >>>> fread.c:350:17: warning: implicit declaration of function 'nanosleep' [-Wimplicit-function-declaration] >>>> make: *** [fread.o] Error 1 >>>> ERROR: compilation failed for package 'data.table' >>>> Regards >>>> Paul >>>> >>>> On 11 May 2013 02:39, Matthew Dowle wrote: >>>> >>>>> Paul, Vishal, >>>>> >>>>> Commit 859 : >>>>> >>>>> * fread now supports files larger than 4GB on 64bit Windows (#2767 thanks to Paul Harding) and files >>>>> between 2GB and 4GB on 32bit Windows (#2655 thanks to Vishal). A C call to GetFileSize() needed to >>>>> be GetFileSizeEx(). >>>>> >>>>> Please test and confirm ok now. >>>>> >>>>> Thanks, Matthew >>>>> >>>>> On 03.05.2013 14:59, Matthew Dowle wrote: >>>>> >>>>>> Oh. Then it's likely a bug with fread on Windows for files > 4GB. Think GetFileSize() should be GetFileSizeEx(), iirc. >>>>>> >>>>>> Please could you file it as a bug on the tracker. Thanks. >>>>>> >>>>>> Matthew >>>>>> >>>>>> On 03.05.2013 14:32, Paul Harding wrote: >>>>>> >>>>>>> Definitely a 64-bit machine. Here are the details: >>>>>>> >>>>>>> Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) >>>>>>> Installed memory (RAM): 128GB >>>>>>> System type: 64-bit Operating System >>>>>>> Windows edition: Server 2008 R2 Enterprise SP1 >>>>>>> Regards, >>>>>>> Paul >>>>>>> >>>>>>> On 3 May 2013 10:51, Matthew Dowle wrote: >>>>>>> >>>>>>>> Hi Paul, >>>>>>>> >>>>>>>> Thanks for all this! >>>>>>>> >>>>>>>>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >>>>>>>> >>>>>>>> Ahah. Are you using a 32bit or 64bit Windows machine? >>>>>>>> >>>>>>>> Thanks, Matthew >>>>>>>> >>>>>>>> On 02.05.2013 10:19, Paul Harding wrote: >>>>>>>> >>>>>>>>> Some supplementary information, here is the portion of the file (with row numbers, +1 for header) around where fread thinks the file ends. >>>>>>>>> >>>>>>>>> $ nl spd_all_fixed.csv | head -n 9186300 |tail >>>>>>>>> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 >>>>>>>>> 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 >>>>>>>>> 9186293 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13 >>>>>>>>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>>>>>>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0 >>>>>>>>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 >>>>>>>>> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 >>>>>>>>> 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 >>>>>>>>> 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 >>>>>>>>> 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 >>>>>>>>> 9186294 (row 9186293 excl header) is where fread thinks the file ends, mid-line by the look of it! >>>>>>>>> I've experimented by truncating the file. The error varies, either it reads too few records or gives the error I reported, presumably determined by whether the last perceived line is entire. >>>>>>>>> The problem arises when the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: >>>>>>>>> >>>>>>>>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv >>>>>>>>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv >>>>>>>>> >>>>>>>>>> dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) >>>>>>>>> >>>>>>>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>>>>>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>>>>>>> Found 9 columns >>>>>>>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>>>>>>> All the fields on line 1 are character fields. Treating as the column names. >>>>>>>>> Count of eol after first data row: 80300000 >>>>>>>>> Subtracted 1 for last eol and any trailing empty lines, leaving 80299999 data rows >>>>>>>>> >>>>>>>>> Type codes: 000002000 (first 5 rows) >>>>>>>>> Type codes: 000002000 (+middle 5 rows) >>>>>>>>> Type codes: 000002000 (+last 5 rows) >>>>>>>>> 0%Bumping column 7 from INT to INT64 on data row 9, field contains '0.42634430000000001' >>>>>>>>> Bumping column 7 from INT64 to REAL on data row 9, field contains '0.42634430000000001' >>>>>>>>> 0.000s ( 0%) Memory map (rerun may be quicker) >>>>>>>>> 0.000s ( 0%) Sep and header detection >>>>>>>>> 0.000s ( 0%) Count rows (wc -l) >>>>>>>>> 0.000s ( 0%) Colmn type detection (first, middle and last 5 rows) >>>>>>>>> 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM >>>>>>>>> 171.188s ( 65%) Reading data >>>>>>>>> 1365231.809s (518439%) Allocation for type bumps (if any), including gc time if triggered >>>>>>>>> -1365231.809s (-518439%) Coercing data already read in type bumps (if any) >>>>>>>>> 0.000s ( 0%) Changing na.strings to NA >>>>>>>>> 0.000s Total >>>>>>>>>> dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) >>>>>>>>> >>>>>>>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>>>>>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>>>>>>> Found 9 columns >>>>>>>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>>>>>>> All the fields on line 1 are character fields. Treating as the column names. >>>>>>>>> Count of eol after first data row: 18913 >>>>>>>>> Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data rows >>>>>>>>> >>>>>>>>> Type codes: 000002000 (first 5 rows) >>>>>>>>> Type codes: 000002000 (+middle 5 rows) >>>>>>>>> Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : >>>>>>>>> Expected sep (',') but ',' ends field 2 on line 6 when detecting types: 204650,724540, >>>>>>>>> Regards, >>>>>>>>> Paul >>>>>>>>> >>>>>>>>> On 1 May 2013 10:28, Paul Harding wrote: >>>>>>>>> >>>>>>>>>> Here is the verbose output: >>>>>>>>>> >>>>>>>>>>> dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) >>>>>>>>>> Detected eol as rn (CRLF) in that order, the Windows standard. >>>>>>>>>> Looking for supplied sep ',' on line 30 (the last non blank line in the first 30) ... found >>>>>>>>>> Found 9 columns >>>>>>>>>> First row with 9 fields occurs on line 1 (either column names or first row of data) >>>>>>>>>> All the fields on line 1 are character fields. Treating as the column names. >>>>>>>>>> Count of eol after first data row: 9186293 >>>>>>>>>> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293 data rows >>>>>>>>>> Type codes: 000002000 (first 5 rows) >>>>>>>>>> Type codes: 000002200 (+middle 5 rows) >>>>>>>>>> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : >>>>>>>>>> >>>>>>>>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>>>>>>>> But here is the wc output (via cygwin; newline, word (whitespace delim so each word one 'line' here), byte)@ >>>>>>>>>> >>>>>>>>>> $ wc spd_all_fixed.csv >>>>>>>>>> 168997637 168997638 9078155125 spd_all_fixed.csv >>>>>>>>>> [So fread 9M, wc 168M rows]. >>>>>>>>>> Regards >>>>>>>>>> Paul >>>>>>>>>> >>>>>>>>>> On 30 April 2013 18:52, Matthew Dowle wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> Thanks for reporting this. Please set verbose=TRUE and let us know the output. >>>>>>>>>>> >>>>>>>>>>> Thanks, Matthew >>>>>>>>>>> >>>>>>>>>>> On 30.04.2013 18:01, Paul Harding wrote: >>>>>>>>>>> >>>>>>>>>>>> Problem with fread on a large file The file is 8GB, just short of 200,000 lines, produced as SQLoutput and modified by cygwin/perl to remove the second line. >>>>>>>>>>>> >>>>>>>>>>>> Using data.table 1.8.8 on R3.0.0 I get an fread error >>>>>>>>>>>> >>>>>>>>>>>> fread("data/spd_all_fixed.csv",sep=",") >>>>>>>>>>>> Error in fread("data/spd_all_fixed.csv", sep = ",") : >>>>>>>>>>>> Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 >>>>>>>>>>>> Looking for the offending line,with line numbers in output so I'm guessing this is line 6 of the mid-file chunk examined, >>>>>>>>>>>> >>>>>>>>>>>> $ grep -n '204038,2617097,201108' spd_all_fixed.csv >>>>>>>>>>>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 >>>>>>>>>>>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 >>>>>>>>>>>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 >>>>>>>>>>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 >>>>>>>>>>>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 >>>>>>>>>>>> and comparing to surrounding lines and the first ten lines >>>>>>>>>>>> >>>>>>>>>>>> $ head spd_all_fixed.csv >>>>>>>>>>>> s_key,i_key,p_key,q,pq,d,l,epi,class >>>>>>>>>>>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 >>>>>>>>>>>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 >>>>>>>>>>>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 >>>>>>>>>>>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 >>>>>>>>>>>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 >>>>>>>>>>>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 >>>>>>>>>>>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 >>>>>>>>>>>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 >>>>>>>>>>>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 >>>>>>>>>>>> I can't see any difference. I wonder if this is a bug? I have no problems on a small test data set run through an identical process and using the same fread command. >>>>>>>>>>>> Regards >>>>>>>>>>>> Paul Links: ------ [1] mailto:mdowle at mdowle.plus.com [2] mailto:p.harding at paniscus.com [3] mailto:mdowle at mdowle.plus.com [4] mailto:mdowle at mdowle.plus.com [5] https://r-forge.r-project.org/scm/viewvc.php/*checkout*/www/data.table_1.8.9_rev874.zip?revision=875&root=datatable [6] mailto:mdowle at mdowle.plus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu May 16 01:15:20 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 16 May 2013 01:15:20 +0200 Subject: [datatable-help] 1.8.9 mac version link broken Message-ID: <422151901E6C414384EECBCC8E6D62EE@gmail.com> Hello, I've tried over the last 2 days and the link for data.table mac version (.tgz) http://download.r-forge.r-project.org/bin/macosx/leopard/contrib/latest/data.table_1.8.9.tgz seems to be broken. The installation by: install.packages("data.table", repos="http://R-Forge.R-project.org") therefore fails as well. Any idea when it'll be back online? Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu May 16 01:28:30 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 16 May 2013 00:28:30 +0100 Subject: [datatable-help] 1.8.9 mac version link broken In-Reply-To: <422151901E6C414384EECBCC8E6D62EE@gmail.com> References: <422151901E6C414384EECBCC8E6D62EE@gmail.com> Message-ID: <4621f2ffd31cc139b83088722984ed7c@imap.plus.net> Hi Arun, That's one for R-Forge support (link to their tracker on R-Forge homepage). My impression is that Mac folk install from source using R CMD INSTALL as you would on unix really. I'm not even sure what is in the .tgz that isn't in the .tar.gz. Sorry! Matthew On 16.05.2013 00:15, Arunkumar Srinivasan wrote: > Hello, > I've tried over the last 2 days and the link for data.table mac version (.tgz) http://download.r-forge.r-project.org/bin/macosx/leopard/contrib/latest/data.table_1.8.9.tgz seems to be broken. The installation by: INSTALL.PACKAGES("DATA.TABLE", REPOS="HTTP://R-FORGE.R-PROJECT.ORG") therefore fails as well. Any idea when it'll be back online? > > Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailinglist.honeypot at gmail.com Thu May 16 06:54:11 2013 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Wed, 15 May 2013 21:54:11 -0700 Subject: [datatable-help] 1.8.9 mac version link broken In-Reply-To: <4621f2ffd31cc139b83088722984ed7c@imap.plus.net> References: <422151901E6C414384EECBCC8E6D62EE@gmail.com> <4621f2ffd31cc139b83088722984ed7c@imap.plus.net> Message-ID: Hi, On Wed, May 15, 2013 at 4:28 PM, Matthew Dowle wrote: > > > Hi Arun, > > That's one for R-Forge support (link to their tracker on R-Forge homepage). > My impression is that Mac folk install from source using R CMD INSTALL as > you would on unix really. I'm not even sure what is in the .tgz that isn't > in the .tar.gz. Sorry! Assuming you Xcode (or just the Xcode command line tools) so that you have a working gcc on your system, you can just do: R> install.packages("data.table", repos="http://R-Forge.R-project.org", type="source") HTH, -steve -- Steve Lianoglou Computational Biologist Department of Bioinformatics and Computational Biology Genentech From caneff at gmail.com Thu May 16 15:46:15 2013 From: caneff at gmail.com (Chris Neff) Date: Thu, 16 May 2013 09:46:15 -0400 Subject: [datatable-help] fread: support int64 package as well, or an option to make 64-bit integers characters Message-ID: Hi all, For reasons I won't go into here, I do not use the bit64 package, instead opting to use int64. When I load something that is a long numeric ID (i.e. something I actually want as a character not an integer), fread loads this as a 64-bit integer using the bit64 encoding. This makes it impossible to convert those to characters without having the bit64 package loaded. Would it be possible to either support int64 as well, or if not an option to convert it to character when it would decide that 64-bit ints are necessary? Thanks, Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu May 16 23:00:11 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 16 May 2013 16:00:11 -0500 Subject: [datatable-help] fread: support int64 package as well, or an option to make 64-bit integers characters In-Reply-To: References: Message-ID: Hi Chris, There's the new colClasses argument so you can override particular fields. But see the new ?fread manual page where I've added an integer64 argument, documented but not yet implemented, default "integer64" but can be set to "numeric" (as read.csv) or "character" -- think that's exactly what you're describing. I guess you would set the global option datatable.integer64 to "character". I don't see how we can support int64 because it's implemented as two integer vectors, iiuc. data.table internals need each column to be a single vector. Matthew On 16.05.2013 08:46, Chris Neff wrote: > Hi all, > For reasons I won't go into here, I do not use the bit64 package, instead opting to use int64. When I load something that is a long numeric ID (i.e. something I actually want as a character not an integer), fread loads this as a 64-bit integer using the bit64 encoding. This makes it impossible to convert those to characters without having the bit64 package loaded. > Would it be possible to either support int64 as well, or if not an option to convert it to character when it would decide that 64-bit ints are necessary? > Thanks, > Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: From caneff at gmail.com Fri May 17 13:34:26 2013 From: caneff at gmail.com (Chris Neff) Date: Fri, 17 May 2013 07:34:26 -0400 Subject: [datatable-help] fread: support int64 package as well, or an option to make 64-bit integers characters In-Reply-To: References: Message-ID: On Thu, May 16, 2013 at 5:00 PM, Matthew Dowle wrote: > ** > > > > Hi Chris, > > There's the new colClasses argument so you can override particular fields. > But see the new ?fread manual page where I've added an integer64 argument, > documented but not yet implemented, default "integer64" but can be set to > "numeric" (as read.csv) or "character" -- think that's exactly what you're > describing. I guess you would set the global option datatable.integer64 to > "character". > That sounds exactly like what I want. > I don't see how we can support int64 because it's implemented as two > integer vectors, iiuc. data.table internals need each column to be a > single vector. > Understood. > Matthew > > > > On 16.05.2013 08:46, Chris Neff wrote: > > Hi all, > For reasons I won't go into here, I do not use the bit64 package, instead > opting to use int64. When I load something that is a long numeric ID (i.e. > something I actually want as a character not an integer), fread loads this > as a 64-bit integer using the bit64 encoding. This makes it impossible to > convert those to characters without having the bit64 package loaded. > Would it be possible to either support int64 as well, or if not an option > to convert it to character when it would decide that 64-bit ints are > necessary? > Thanks, > Chris > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.sieira at gmail.com Fri May 17 16:42:17 2013 From: alexandre.sieira at gmail.com (Alexandre Sieira) Date: Fri, 17 May 2013 11:42:17 -0300 Subject: [datatable-help] Extract Single Column as Vector Message-ID: Sorry if this is a basic question. I'm using R 3.0.0 and data.table 1.8.8.?The documentation for 'j' states that "A single column or single expression returns that type, usually a vector." I am able to obtain this behavior if I know the column name in advance: > dt = data.table(a=c(1, 2, 3), b=c(4, 5, 6)) > dt ? ?a b 1: 1 4 2: 2 5 3: 3 6 > str(dt[,a]) ?num [1:3] 1 2 3 However, if I don't, no such luck: > colname="a" > str(dt[,colname,with=F]) Classes ?data.table? and 'data.frame': 3 obs. of ?1 variable: ?$ a: num ?1 2 3 ?- attr(*, ".internal.selfref")=? If there a way to extract an entire column as a vector if I have the column name as a character scalar? Thank you! --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 17 16:59:18 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 17 May 2013 09:59:18 -0500 Subject: [datatable-help] Extract Single Column as Vector In-Reply-To: References: Message-ID: Use dt[[colname]], but this seems like a bug to me - I would've thought that dt[, a] and dt[, "a", with = F] should return the exact same thing. On Fri, May 17, 2013 at 9:42 AM, Alexandre Sieira < alexandre.sieira at gmail.com> wrote: > Sorry if this is a basic question. > > > I'm using R 3.0.0 and data.table 1.8.8. The documentation for 'j' states > that "A single column or single expression returns that type, usually a > vector." > > > I am able to obtain this behavior if I know the column name in advance: > > > > dt = data.table(a=c(1, 2, 3), b=c(4, 5, 6)) > > > dt > > a b > > 1: 1 4 > > 2: 2 5 > > 3: 3 6 > > > str(dt[,a]) > > num [1:3] 1 2 3 > > > However, if I don't, no such luck: > > > colname="a" > > str(dt[,colname,with=F]) > Classes ?data.table? and 'data.frame': 3 obs. of 1 variable: > $ a: num 1 2 3 > - attr(*, ".internal.selfref")= > > If there a way to extract an entire column as a vector if I have the > column name as a character scalar? > > Thank you! > > -- > Alexandre Sieira > CISA, CISSP, ISO 27001 Lead Auditor > > "The truth is rarely pure and never simple." > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri May 17 17:02:46 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 17 May 2013 17:02:46 +0200 Subject: [datatable-help] Extract Single Column as Vector In-Reply-To: References: Message-ID: Eduard, are we discussing the same thing again :)? Wasn't this somehow your question as well.. the discrepancy between: dt[, a] and dt[, "a", with=FALSE]. There should be a drop=TRUE/FALSE option (as in the case of data.frame) that should be used when you use `with=FALSE`. Until then, the default option seems to be drop=FALSE, which results in a data.table. Alexandre, as of now, it could be done as Eduard points out. Arun On Friday, May 17, 2013 at 4:59 PM, Eduard Antonyan wrote: > Use dt[[colname]], but this seems like a bug to me - I would've thought that dt[, a] and dt[, "a", with = F] should return the exact same thing. > > > On Fri, May 17, 2013 at 9:42 AM, Alexandre Sieira wrote: > > > > Sorry if this is a basic question. > > > > > > > > > > > > > > I'm using R 3.0.0 and data.table 1.8.8. The documentation for 'j' states that "A single column or single expression returns that type, usually a vector." > > > > > > > > > > > > > > I am able to obtain this behavior if I know the column name in advance: > > > > > > > > > > > > > > > > > > > > > dt = data.table(a=c(1, 2, 3), b=c(4, 5, 6)) > > > > > > > dt > > > > > > a b > > > > > > 1: 1 4 > > > > > > 2: 2 5 > > > > > > 3: 3 6 > > > > > > > str(dt[,a]) > > > > > > num [1:3] 1 2 3 > > > > > > > > > > > > > > However, if I don't, no such luck: > > > > > colname="a" > > > str(dt[,colname,with=F]) > > Classes ?data.table? and 'data.frame': 3 obs. of 1 variable: > > $ a: num 1 2 3 > > - attr(*, ".internal.selfref")= > > > > > > If there a way to extract an entire column as a vector if I have the column name as a character scalar? > > > > Thank you! > > > > -- > > Alexandre Sieira > > CISA, CISSP, ISO 27001 Lead Auditor > > > > "The truth is rarely pure and never simple." > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.sieira at gmail.com Fri May 17 17:11:53 2013 From: alexandre.sieira at gmail.com (Alexandre Sieira) Date: Fri, 17 May 2013 12:11:53 -0300 Subject: [datatable-help] =?utf-8?q?Extract_Single_Column_as_Vector?= In-Reply-To: References: Message-ID: It works perfectly for me with the syntax Eduard mentioned. Thank you very much for the quick response! --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I On 17 de maio de 2013 at 12:02:50, Arunkumar Srinivasan (aragorn168b at gmail.com) wrote: Eduard, are we discussing the same thing again :)? Wasn't this somehow your question as well.. the discrepancy between: dt[, a] and dt[, "a", with=FALSE].? There should be a drop=TRUE/FALSE option (as in the case of data.frame) that should be used when you use `with=FALSE`. Until then, the default option seems to be drop=FALSE, which results in a data.table. Alexandre, as of now, it could be done as Eduard points out. Arun On Friday, May 17, 2013 at 4:59 PM, Eduard Antonyan wrote: Use?dt[[colname]], but this seems like a bug to me - I would've thought that dt[, a] and dt[, "a", with = F] should return the exact same thing. On Fri, May 17, 2013 at 9:42 AM, Alexandre Sieira wrote: Sorry if this is a basic question. I'm using R 3.0.0 and data.table 1.8.8.?The documentation for 'j' states that "A single column or single expression returns that type, usually a vector." I am able to obtain this behavior if I know the column name in advance: > dt = data.table(a=c(1, 2, 3), b=c(4, 5, 6)) > dt ? ?a b 1: 1 4 2: 2 5 3: 3 6 > str(dt[,a]) ?num [1:3] 1 2 3 However, if I don't, no such luck: > colname="a" > str(dt[,colname,with=F]) Classes ?data.table? and 'data.frame': 3 obs. of ?1 variable: ?$ a: num ?1 2 3 ?- attr(*, ".internal.selfref")=? If there a way to extract an entire column as a vector if I have the column name as a character scalar? Thank you! --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 17 17:22:14 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 17 May 2013 10:22:14 -0500 Subject: [datatable-help] Extract Single Column as Vector In-Reply-To: References: Message-ID: I don't remember discussing this issue...? What is the conceptual difference between dt[, a] and dt[, "a", with = F] and what does 'drop' have to do with this?? On Fri, May 17, 2013 at 10:02 AM, Arunkumar Srinivasan < aragorn168b at gmail.com> wrote: > Eduard, are we discussing the same thing again :)? Wasn't this somehow > your question as well.. the discrepancy between: > > dt[, a] and dt[, "a", with=FALSE]. > > There should be a drop=TRUE/FALSE option (as in the case of data.frame) > that should be used when you use `with=FALSE`. Until then, the default > option seems to be drop=FALSE, which results in a data.table. > > Alexandre, as of now, it could be done as Eduard points out. > > Arun > > On Friday, May 17, 2013 at 4:59 PM, Eduard Antonyan wrote: > > Use dt[[colname]], but this seems like a bug to me - I would've thought > that dt[, a] and dt[, "a", with = F] should return the exact same thing. > > > On Fri, May 17, 2013 at 9:42 AM, Alexandre Sieira < > alexandre.sieira at gmail.com> wrote: > > Sorry if this is a basic question. > > > I'm using R 3.0.0 and data.table 1.8.8. The documentation for 'j' states > that "A single column or single expression returns that type, usually a > vector." > > > I am able to obtain this behavior if I know the column name in advance: > > > > dt = data.table(a=c(1, 2, 3), b=c(4, 5, 6)) > > > dt > > a b > > 1: 1 4 > > 2: 2 5 > > 3: 3 6 > > > str(dt[,a]) > > num [1:3] 1 2 3 > > > However, if I don't, no such luck: > > > colname="a" > > str(dt[,colname,with=F]) > Classes ?data.table? and 'data.frame': 3 obs. of 1 variable: > $ a: num 1 2 3 > - attr(*, ".internal.selfref")= > > If there a way to extract an entire column as a vector if I have the > column name as a character scalar? > > Thank you! > > -- > Alexandre Sieira > CISA, CISSP, ISO 27001 Lead Auditor > > "The truth is rarely pure and never simple." > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Fri May 17 21:44:20 2013 From: FErickson at psu.edu (Frank Erickson) Date: Fri, 17 May 2013 14:44:20 -0500 Subject: [datatable-help] Extract Single Column as Vector In-Reply-To: References: Message-ID: @Arun and eddi: This question has come up before. http://r.789695.n4.nabble.com/Better-hacks-getting-a-vector-AND-using-with-inserting-chunks-of-rows-tt4666592.html (And I'm sure there are other times, too.) I can't say I've heard anyone arguing about it, though. :) I guess it works that way because ...in dt[ ,a], j is an expression which evaluates to a vector ...in dt[,"a",with=FALSE] the option turns on the "you must want one or more columns" mode, translating j from "a" to list(a) It's unintuitive if you're expecting data frame behavior (you know, drop=TRUE, as Arun mentioned), but if you've already seen dt[,list(a)], it shouldn't be much of a surprise. Adding the drop option, and maybe defaulting it to TRUE when with=FALSE might satisfy eddi's concern...? On Fri, May 17, 2013 at 10:22 AM, Eduard Antonyan wrote: > I don't remember discussing this issue...? What is the conceptual > difference between dt[, a] and dt[, "a", with = F] and what does 'drop' > have to do with this?? > > > On Fri, May 17, 2013 at 10:02 AM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > >> Eduard, are we discussing the same thing again :)? Wasn't this somehow >> your question as well.. the discrepancy between: >> >> dt[, a] and dt[, "a", with=FALSE]. >> >> There should be a drop=TRUE/FALSE option (as in the case of data.frame) >> that should be used when you use `with=FALSE`. Until then, the default >> option seems to be drop=FALSE, which results in a data.table. >> >> Alexandre, as of now, it could be done as Eduard points out. >> >> Arun >> >> On Friday, May 17, 2013 at 4:59 PM, Eduard Antonyan wrote: >> >> Use dt[[colname]], but this seems like a bug to me - I would've thought >> that dt[, a] and dt[, "a", with = F] should return the exact same thing. >> >> >> On Fri, May 17, 2013 at 9:42 AM, Alexandre Sieira < >> alexandre.sieira at gmail.com> wrote: >> >> Sorry if this is a basic question. >> >> >> I'm using R 3.0.0 and data.table 1.8.8. The documentation for 'j' states >> that "A single column or single expression returns that type, usually a >> vector." >> >> >> I am able to obtain this behavior if I know the column name in advance: >> >> >> > dt = data.table(a=c(1, 2, 3), b=c(4, 5, 6)) >> >> > dt >> >> a b >> >> 1: 1 4 >> >> 2: 2 5 >> >> 3: 3 6 >> >> > str(dt[,a]) >> >> num [1:3] 1 2 3 >> >> >> However, if I don't, no such luck: >> >> > colname="a" >> > str(dt[,colname,with=F]) >> Classes ?data.table? and 'data.frame': 3 obs. of 1 variable: >> $ a: num 1 2 3 >> - attr(*, ".internal.selfref")= >> >> If there a way to extract an entire column as a vector if I have the >> column name as a character scalar? >> >> Thank you! >> >> -- >> Alexandre Sieira >> CISA, CISSP, ISO 27001 Lead Auditor >> >> "The truth is rarely pure and never simple." >> Oscar Wilde, The Importance of Being Earnest, 1895, Act I >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri May 17 22:26:40 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 17 May 2013 15:26:40 -0500 Subject: [datatable-help] Extract Single Column as Vector In-Reply-To: References: Message-ID: Well, looking at the documentation: j: A single column name, single expresson of column names, list() of expressions of column names, an expression or function call that evaluates to list (including data.frame and data.table which are lists, too), or *(when with=FALSE) same as j in [.data.frame.* ... with:* *By default with=TRUE and j is evaluated within the frame of x. The column names can be used as variables. *When with=FALSE, j works as it does in [.data.frame.* * * The bolded out part of the documentation doesn't match the actual behavior. On Fri, May 17, 2013 at 2:44 PM, Frank Erickson wrote: > @Arun and eddi: This question has come up before. > > http://r.789695.n4.nabble.com/Better-hacks-getting-a-vector-AND-using-with-inserting-chunks-of-rows-tt4666592.html > (And I'm sure there are other times, too.) I can't say I've heard anyone > arguing about it, though. :) > > I guess it works that way because > ...in dt[ ,a], j is an expression which evaluates to a vector > ...in dt[,"a",with=FALSE] the option turns on the "you must want one or > more columns" mode, translating j from "a" to list(a) > > It's unintuitive if you're expecting data frame behavior (you know, > drop=TRUE, as Arun mentioned), but if you've already seen dt[,list(a)], it > shouldn't be much of a surprise. Adding the drop option, and maybe > defaulting it to TRUE when with=FALSE might satisfy eddi's concern...? > > > On Fri, May 17, 2013 at 10:22 AM, Eduard Antonyan < > eduard.antonyan at gmail.com> wrote: > >> I don't remember discussing this issue...? What is the conceptual >> difference between dt[, a] and dt[, "a", with = F] and what does 'drop' >> have to do with this?? >> >> >> On Fri, May 17, 2013 at 10:02 AM, Arunkumar Srinivasan < >> aragorn168b at gmail.com> wrote: >> >>> Eduard, are we discussing the same thing again :)? Wasn't this somehow >>> your question as well.. the discrepancy between: >>> >>> dt[, a] and dt[, "a", with=FALSE]. >>> >>> There should be a drop=TRUE/FALSE option (as in the case of data.frame) >>> that should be used when you use `with=FALSE`. Until then, the default >>> option seems to be drop=FALSE, which results in a data.table. >>> >>> Alexandre, as of now, it could be done as Eduard points out. >>> >>> Arun >>> >>> On Friday, May 17, 2013 at 4:59 PM, Eduard Antonyan wrote: >>> >>> Use dt[[colname]], but this seems like a bug to me - I would've thought >>> that dt[, a] and dt[, "a", with = F] should return the exact same thing. >>> >>> >>> On Fri, May 17, 2013 at 9:42 AM, Alexandre Sieira < >>> alexandre.sieira at gmail.com> wrote: >>> >>> Sorry if this is a basic question. >>> >>> >>> I'm using R 3.0.0 and data.table 1.8.8. The documentation for 'j' >>> states that "A single column or single expression returns that type, >>> usually a vector." >>> >>> >>> I am able to obtain this behavior if I know the column name in advance: >>> >>> >>> > dt = data.table(a=c(1, 2, 3), b=c(4, 5, 6)) >>> >>> > dt >>> >>> a b >>> >>> 1: 1 4 >>> >>> 2: 2 5 >>> >>> 3: 3 6 >>> >>> > str(dt[,a]) >>> >>> num [1:3] 1 2 3 >>> >>> >>> However, if I don't, no such luck: >>> >>> > colname="a" >>> > str(dt[,colname,with=F]) >>> Classes ?data.table? and 'data.frame': 3 obs. of 1 variable: >>> $ a: num 1 2 3 >>> - attr(*, ".internal.selfref")= >>> >>> If there a way to extract an entire column as a vector if I have the >>> column name as a character scalar? >>> >>> Thank you! >>> >>> -- >>> Alexandre Sieira >>> CISA, CISSP, ISO 27001 Lead Auditor >>> >>> "The truth is rarely pure and never simple." >>> Oscar Wilde, The Importance of Being Earnest, 1895, Act I >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >>> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Fri May 17 22:27:24 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 17 May 2013 16:27:24 -0400 Subject: [datatable-help] zero length list component in j Message-ID: Is this intended? If we use j = list(x = "X", y = numeric(0)) we get a row but if we use just list(y = numeric(0)) then we do not get a row. In the first case it filled in the zero length component with NA and in the second case it just omitted the row entirely: > dd <- data.table(a = 1:3) > dd a 1: 1 2: 2 3: 3 > dd[, list(x = "X", y = numeric(0)), by = a] a x y 1: 1 X NA 2: 2 X NA 3: 3 X NA > dd[, list(y = numeric(0)), by = a] Empty data.table (0 rows) of 2 cols: a,y -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From eduard.antonyan at gmail.com Fri May 17 22:33:28 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 17 May 2013 15:33:28 -0500 Subject: [datatable-help] zero length list component in j In-Reply-To: References: Message-ID: Maybe I'm missing smth, but what else did you expect? Looks like it did it's best to compensate for the user not supplying full data in the first example, and there really was nothing to do in the second one. On Fri, May 17, 2013 at 3:27 PM, Gabor Grothendieck wrote: > Is this intended? If we use j = list(x = "X", y = numeric(0)) we get > a row but if we use just list(y = numeric(0)) then we do not get a > row. In the first case it filled in the zero length component with NA > and in the second case it just omitted the row entirely: > > > dd <- data.table(a = 1:3) > > dd > a > 1: 1 > 2: 2 > 3: 3 > > dd[, list(x = "X", y = numeric(0)), by = a] > a x y > 1: 1 X NA > 2: 2 X NA > 3: 3 X NA > > dd[, list(y = numeric(0)), by = a] > Empty data.table (0 rows) of 2 cols: a,y > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Fri May 17 22:38:53 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 17 May 2013 16:38:53 -0400 Subject: [datatable-help] zero length list component in j In-Reply-To: References: Message-ID: In the first case it replaced the zero length component with NA and in the second case it did not. Why the difference? On Fri, May 17, 2013 at 4:33 PM, Eduard Antonyan wrote: > Maybe I'm missing smth, but what else did you expect? Looks like it did it's > best to compensate for the user not supplying full data in the first > example, and there really was nothing to do in the second one. > > > > On Fri, May 17, 2013 at 3:27 PM, Gabor Grothendieck > wrote: >> >> Is this intended? If we use j = list(x = "X", y = numeric(0)) we get >> a row but if we use just list(y = numeric(0)) then we do not get a >> row. In the first case it filled in the zero length component with NA >> and in the second case it just omitted the row entirely: >> >> > dd <- data.table(a = 1:3) >> > dd >> a >> 1: 1 >> 2: 2 >> 3: 3 >> > dd[, list(x = "X", y = numeric(0)), by = a] >> a x y >> 1: 1 X NA >> 2: 2 X NA >> 3: 3 X NA >> > dd[, list(y = numeric(0)), by = a] >> Empty data.table (0 rows) of 2 cols: a,y >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From eduard.antonyan at gmail.com Fri May 17 22:47:55 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 17 May 2013 15:47:55 -0500 Subject: [datatable-help] zero length list component in j In-Reply-To: References: Message-ID: well numeric(0) is no data, but because in the first case there was other data to output and you also asked to output `y`, what else was it supposed to do? ( it might help to look at the output of c(numeric(0), numeric(0)) ) On Fri, May 17, 2013 at 3:38 PM, Gabor Grothendieck wrote: > In the first case it replaced the zero length component with NA and in > the second case it did not. Why the difference? > > On Fri, May 17, 2013 at 4:33 PM, Eduard Antonyan > wrote: > > Maybe I'm missing smth, but what else did you expect? Looks like it did > it's > > best to compensate for the user not supplying full data in the first > > example, and there really was nothing to do in the second one. > > > > > > > > On Fri, May 17, 2013 at 3:27 PM, Gabor Grothendieck > > wrote: > >> > >> Is this intended? If we use j = list(x = "X", y = numeric(0)) we get > >> a row but if we use just list(y = numeric(0)) then we do not get a > >> row. In the first case it filled in the zero length component with NA > >> and in the second case it just omitted the row entirely: > >> > >> > dd <- data.table(a = 1:3) > >> > dd > >> a > >> 1: 1 > >> 2: 2 > >> 3: 3 > >> > dd[, list(x = "X", y = numeric(0)), by = a] > >> a x y > >> 1: 1 X NA > >> 2: 2 X NA > >> 3: 3 X NA > >> > dd[, list(y = numeric(0)), by = a] > >> Empty data.table (0 rows) of 2 cols: a,y > >> > >> > >> -- > >> Statistics & Software Consulting > >> GKX Group, GKX Associates Inc. > >> tel: 1-877-GKX-GROUP > >> email: ggrothendieck at gmail.com > >> _______________________________________________ > >> datatable-help mailing list > >> datatable-help at lists.r-forge.r-project.org > >> > >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Sat May 18 00:36:06 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 17 May 2013 18:36:06 -0400 Subject: [datatable-help] zero length list component in j In-Reply-To: References: Message-ID: Yes, I understand all that but its not inevitable that it had to be that way. If we perform a computation that results in a list with zero length component then the corresponding row won't show up but another possibility might have been that it would show up filled in with NAs. At any rate, the question remains whether this behavior is intended or not. On Fri, May 17, 2013 at 4:47 PM, Eduard Antonyan wrote: > well numeric(0) is no data, but because in the first case there was other > data to output and you also asked to output `y`, what else was it supposed > to do? ( it might help to look at the output of c(numeric(0), numeric(0)) ) > > > On Fri, May 17, 2013 at 3:38 PM, Gabor Grothendieck > wrote: >> >> In the first case it replaced the zero length component with NA and in >> the second case it did not. Why the difference? >> >> On Fri, May 17, 2013 at 4:33 PM, Eduard Antonyan >> wrote: >> > Maybe I'm missing smth, but what else did you expect? Looks like it did >> > it's >> > best to compensate for the user not supplying full data in the first >> > example, and there really was nothing to do in the second one. >> > >> > >> > >> > On Fri, May 17, 2013 at 3:27 PM, Gabor Grothendieck >> > wrote: >> >> >> >> Is this intended? If we use j = list(x = "X", y = numeric(0)) we get >> >> a row but if we use just list(y = numeric(0)) then we do not get a >> >> row. In the first case it filled in the zero length component with NA >> >> and in the second case it just omitted the row entirely: >> >> >> >> > dd <- data.table(a = 1:3) >> >> > dd >> >> a >> >> 1: 1 >> >> 2: 2 >> >> 3: 3 >> >> > dd[, list(x = "X", y = numeric(0)), by = a] >> >> a x y >> >> 1: 1 X NA >> >> 2: 2 X NA >> >> 3: 3 X NA >> >> > dd[, list(y = numeric(0)), by = a] >> >> Empty data.table (0 rows) of 2 cols: a,y >> >> >> >> >> >> -- >> >> Statistics & Software Consulting >> >> GKX Group, GKX Associates Inc. >> >> tel: 1-877-GKX-GROUP >> >> email: ggrothendieck at gmail.com >> >> _______________________________________________ >> >> datatable-help mailing list >> >> datatable-help at lists.r-forge.r-project.org >> >> >> >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > >> > >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com > > -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From eduard.antonyan at gmail.com Sat May 18 00:45:25 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 17 May 2013 17:45:25 -0500 Subject: [datatable-help] zero length list component in j In-Reply-To: References: Message-ID: Actually, looking at this example: > dd[, ifelse(a < 2, a, integer(0)), by = a] a V1 1: 1 1 2: 2 NA 3: 3 NA I don't quite understand the output. I don't have a coherent story for this and your examples - either your second example should print NA's or this one shouldn't have the last two rows imo. On Fri, May 17, 2013 at 5:36 PM, Gabor Grothendieck wrote: > Yes, I understand all that but its not inevitable that it had to be > that way. If we perform a computation that results in a list with > zero length component then the corresponding row won't show up but > another possibility might have been that it would show up filled in > with NAs. > > At any rate, the question remains whether this behavior is intended or not. > > > > > > On Fri, May 17, 2013 at 4:47 PM, Eduard Antonyan > wrote: > > well numeric(0) is no data, but because in the first case there was other > > data to output and you also asked to output `y`, what else was it > supposed > > to do? ( it might help to look at the output of c(numeric(0), > numeric(0)) ) > > > > > > On Fri, May 17, 2013 at 3:38 PM, Gabor Grothendieck > > wrote: > >> > >> In the first case it replaced the zero length component with NA and in > >> the second case it did not. Why the difference? > >> > >> On Fri, May 17, 2013 at 4:33 PM, Eduard Antonyan > >> wrote: > >> > Maybe I'm missing smth, but what else did you expect? Looks like it > did > >> > it's > >> > best to compensate for the user not supplying full data in the first > >> > example, and there really was nothing to do in the second one. > >> > > >> > > >> > > >> > On Fri, May 17, 2013 at 3:27 PM, Gabor Grothendieck > >> > wrote: > >> >> > >> >> Is this intended? If we use j = list(x = "X", y = numeric(0)) we get > >> >> a row but if we use just list(y = numeric(0)) then we do not get a > >> >> row. In the first case it filled in the zero length component with > NA > >> >> and in the second case it just omitted the row entirely: > >> >> > >> >> > dd <- data.table(a = 1:3) > >> >> > dd > >> >> a > >> >> 1: 1 > >> >> 2: 2 > >> >> 3: 3 > >> >> > dd[, list(x = "X", y = numeric(0)), by = a] > >> >> a x y > >> >> 1: 1 X NA > >> >> 2: 2 X NA > >> >> 3: 3 X NA > >> >> > dd[, list(y = numeric(0)), by = a] > >> >> Empty data.table (0 rows) of 2 cols: a,y > >> >> > >> >> > >> >> -- > >> >> Statistics & Software Consulting > >> >> GKX Group, GKX Associates Inc. > >> >> tel: 1-877-GKX-GROUP > >> >> email: ggrothendieck at gmail.com > >> >> _______________________________________________ > >> >> datatable-help mailing list > >> >> datatable-help at lists.r-forge.r-project.org > >> >> > >> >> > >> >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >> > > >> > > >> > >> > >> > >> -- > >> Statistics & Software Consulting > >> GKX Group, GKX Associates Inc. > >> tel: 1-877-GKX-GROUP > >> email: ggrothendieck at gmail.com > > > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Sat May 18 00:51:11 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 17 May 2013 17:51:11 -0500 Subject: [datatable-help] zero length list component in j In-Reply-To: References: Message-ID: nm, looks like the above is the doing of `ifelse` and is a different issue On Fri, May 17, 2013 at 5:45 PM, Eduard Antonyan wrote: > Actually, looking at this example: > > > dd[, ifelse(a < 2, a, integer(0)), by = a] > a V1 > 1: 1 1 > 2: 2 NA > 3: 3 NA > > I don't quite understand the output. I don't have a coherent story for > this and your examples - either your second example should print NA's or > this one shouldn't have the last two rows imo. > > > On Fri, May 17, 2013 at 5:36 PM, Gabor Grothendieck < > ggrothendieck at gmail.com> wrote: > >> Yes, I understand all that but its not inevitable that it had to be >> that way. If we perform a computation that results in a list with >> zero length component then the corresponding row won't show up but >> another possibility might have been that it would show up filled in >> with NAs. >> >> At any rate, the question remains whether this behavior is intended or >> not. >> >> >> >> >> >> On Fri, May 17, 2013 at 4:47 PM, Eduard Antonyan >> wrote: >> > well numeric(0) is no data, but because in the first case there was >> other >> > data to output and you also asked to output `y`, what else was it >> supposed >> > to do? ( it might help to look at the output of c(numeric(0), >> numeric(0)) ) >> > >> > >> > On Fri, May 17, 2013 at 3:38 PM, Gabor Grothendieck >> > wrote: >> >> >> >> In the first case it replaced the zero length component with NA and in >> >> the second case it did not. Why the difference? >> >> >> >> On Fri, May 17, 2013 at 4:33 PM, Eduard Antonyan >> >> wrote: >> >> > Maybe I'm missing smth, but what else did you expect? Looks like it >> did >> >> > it's >> >> > best to compensate for the user not supplying full data in the first >> >> > example, and there really was nothing to do in the second one. >> >> > >> >> > >> >> > >> >> > On Fri, May 17, 2013 at 3:27 PM, Gabor Grothendieck >> >> > wrote: >> >> >> >> >> >> Is this intended? If we use j = list(x = "X", y = numeric(0)) we >> get >> >> >> a row but if we use just list(y = numeric(0)) then we do not get a >> >> >> row. In the first case it filled in the zero length component with >> NA >> >> >> and in the second case it just omitted the row entirely: >> >> >> >> >> >> > dd <- data.table(a = 1:3) >> >> >> > dd >> >> >> a >> >> >> 1: 1 >> >> >> 2: 2 >> >> >> 3: 3 >> >> >> > dd[, list(x = "X", y = numeric(0)), by = a] >> >> >> a x y >> >> >> 1: 1 X NA >> >> >> 2: 2 X NA >> >> >> 3: 3 X NA >> >> >> > dd[, list(y = numeric(0)), by = a] >> >> >> Empty data.table (0 rows) of 2 cols: a,y >> >> >> >> >> >> >> >> >> -- >> >> >> Statistics & Software Consulting >> >> >> GKX Group, GKX Associates Inc. >> >> >> tel: 1-877-GKX-GROUP >> >> >> email: ggrothendieck at gmail.com >> >> >> _______________________________________________ >> >> >> datatable-help mailing list >> >> >> datatable-help at lists.r-forge.r-project.org >> >> >> >> >> >> >> >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> > >> >> > >> >> >> >> >> >> >> >> -- >> >> Statistics & Software Consulting >> >> GKX Group, GKX Associates Inc. >> >> tel: 1-877-GKX-GROUP >> >> email: ggrothendieck at gmail.com >> > >> > >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Sat May 18 03:34:52 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Fri, 17 May 2013 21:34:52 -0400 Subject: [datatable-help] Extract Single Column as Vector In-Reply-To: References: Message-ID: Hm... Eddi does seem to have a point here. While I agree with Frank that once you're used to it, it is rather straightforward to deal with, I can see why one would have the expectation of a vector. ie, that the last of the following `identical` statements should evaluate to `TRUE` df <- as.data.frame(dt) > identical(df[, "a"], dt[, get("a")]) [1] TRUE > identical(df[, "a"], dt[["a"]]) [1] TRUE > identical(df[, "a"], dt[, "a", with=FALSE]) [1] FALSE rm(df) -Rick Ricardo Saporta Graduate Student, Data Analytics Rutgers University, New Jersey e: saporta at rutgers.edu On Fri, May 17, 2013 at 4:26 PM, Eduard Antonyan wrote: > Well, looking at the documentation: > > j: A single column name, single expresson of column names, list() of > expressions of column names, an expression or function call that evaluates > to list (including data.frame and data.table which are lists, too), or *(when > with=FALSE) same as j in [.data.frame.* > ... > with:* *By default with=TRUE and j is evaluated within the frame of x. > The column names can be used as variables. *When with=FALSE, j works as > it does in [.data.frame.* > > * > * > The bolded out part of the documentation doesn't match the actual behavior. > > > > On Fri, May 17, 2013 at 2:44 PM, Frank Erickson wrote: > >> @Arun and eddi: This question has come up before. >> >> http://r.789695.n4.nabble.com/Better-hacks-getting-a-vector-AND-using-with-inserting-chunks-of-rows-tt4666592.html >> (And I'm sure there are other times, too.) I can't say I've heard anyone >> arguing about it, though. :) >> >> I guess it works that way because >> ...in dt[ ,a], j is an expression which evaluates to a vector >> ...in dt[,"a",with=FALSE] the option turns on the "you must want one or >> more columns" mode, translating j from "a" to list(a) >> >> It's unintuitive if you're expecting data frame behavior (you know, >> drop=TRUE, as Arun mentioned), but if you've already seen dt[,list(a)], it >> shouldn't be much of a surprise. Adding the drop option, and maybe >> defaulting it to TRUE when with=FALSE might satisfy eddi's concern...? >> >> >> On Fri, May 17, 2013 at 10:22 AM, Eduard Antonyan < >> eduard.antonyan at gmail.com> wrote: >> >>> I don't remember discussing this issue...? What is the conceptual >>> difference between dt[, a] and dt[, "a", with = F] and what does 'drop' >>> have to do with this?? >>> >>> >>> On Fri, May 17, 2013 at 10:02 AM, Arunkumar Srinivasan < >>> aragorn168b at gmail.com> wrote: >>> >>>> Eduard, are we discussing the same thing again :)? Wasn't this somehow >>>> your question as well.. the discrepancy between: >>>> >>>> dt[, a] and dt[, "a", with=FALSE]. >>>> >>>> There should be a drop=TRUE/FALSE option (as in the case of data.frame) >>>> that should be used when you use `with=FALSE`. Until then, the default >>>> option seems to be drop=FALSE, which results in a data.table. >>>> >>>> Alexandre, as of now, it could be done as Eduard points out. >>>> >>>> Arun >>>> >>>> On Friday, May 17, 2013 at 4:59 PM, Eduard Antonyan wrote: >>>> >>>> Use dt[[colname]], but this seems like a bug to me - I would've thought >>>> that dt[, a] and dt[, "a", with = F] should return the exact same thing. >>>> >>>> >>>> On Fri, May 17, 2013 at 9:42 AM, Alexandre Sieira < >>>> alexandre.sieira at gmail.com> wrote: >>>> >>>> Sorry if this is a basic question. >>>> >>>> >>>> I'm using R 3.0.0 and data.table 1.8.8. The documentation for 'j' >>>> states that "A single column or single expression returns that type, >>>> usually a vector." >>>> >>>> >>>> I am able to obtain this behavior if I know the column name in advance: >>>> >>>> >>>> > dt = data.table(a=c(1, 2, 3), b=c(4, 5, 6)) >>>> >>>> > dt >>>> >>>> a b >>>> >>>> 1: 1 4 >>>> >>>> 2: 2 5 >>>> >>>> 3: 3 6 >>>> >>>> > str(dt[,a]) >>>> >>>> num [1:3] 1 2 3 >>>> >>>> >>>> However, if I don't, no such luck: >>>> >>>> > colname="a" >>>> > str(dt[,colname,with=F]) >>>> Classes ?data.table? and 'data.frame': 3 obs. of 1 variable: >>>> $ a: num 1 2 3 >>>> - attr(*, ".internal.selfref")= >>>> >>>> If there a way to extract an entire column as a vector if I have the >>>> column name as a character scalar? >>>> >>>> Thank you! >>>> >>>> -- >>>> Alexandre Sieira >>>> CISA, CISSP, ISO 27001 Lead Auditor >>>> >>>> "The truth is rarely pure and never simple." >>>> Oscar Wilde, The Importance of Being Earnest, 1895, Act I >>>> >>>> _______________________________________________ >>>> datatable-help mailing list >>>> datatable-help at lists.r-forge.r-project.org >>>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>> >>>> >>>> _______________________________________________ >>>> datatable-help mailing list >>>> datatable-help at lists.r-forge.r-project.org >>>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>> >>>> >>>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >> > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sat May 18 17:04:46 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sat, 18 May 2013 10:04:46 -0500 Subject: [datatable-help] Extract Single Column as Vector In-Reply-To: References: Message-ID: <5bc2b571908deaccc37f113ab16c95a3@imap.plus.net> All good points. The thinking here has this mind : myvars = c("col1","col2") DT[, myvars, with=FALSE] We don't want the type of the result to depend on whether myvars is length 1 or not. Otherwise we may end up with surprises (in production code for example) if myvars becomes length 1 in future. That's a strong principle that data.table follows : the length of an input shouldn't change the type of the output (only the type of the input should be able to change the type of the output). I've just changed those two parts of ?data.table (thanks for highlighting) : was : "... or (when with=FALSE) same as j in [.data.frame." now : "... or (when with=FALSE) a vector of names or positions to select." Matthew On 17.05.2013 20:34, Ricardo Saporta wrote: > Hm... Eddi does seem to have a point here. While I agree with Frank that once you're used to it, it is rather straightforward to deal with, I can see why one would have the expectation of a vector. ie, that the last of the following `identical` statements should evaluate to `TRUE` > > df <- as.data.frame(dt) > > identical(df[, "a"], dt[, get("a")]) > [1] TRUE > > identical(df[, "a"], dt[["a"]]) > [1] TRUE > > identical(df[, "a"], dt[, "a", with=FALSE]) > [1] FALSE > rm(df) > -Rick > > Ricardo Saporta > Graduate Student, Data Analytics > Rutgers University, New Jersey > e: saporta at rutgers.edu [14] > > On Fri, May 17, 2013 at 4:26 PM, Eduard Antonyan wrote: > >> Well, looking at the documentation: >> j: A single column name, single expresson of column names, list() of expressions of column names, an expression or function call that evaluates to list (including data.frame and data.table which are lists, too), or (WHEN WITH=FALSE) SAME AS J IN [.DATA.FRAME. >> ... >> with: By default with=TRUE and j is evaluated within the frame of x. The column names can be used as variables. WHEN WITH=FALSE, J WORKS AS IT DOES IN [.DATA.FRAME. >> >> The bolded out part of the documentation doesn't match the actual behavior. >> >> On Fri, May 17, 2013 at 2:44 PM, Frank Erickson wrote: >> >>> @Arun and eddi: This question has come up before. >>> http://r.789695.n4.nabble.com/Better-hacks-getting-a-vector-AND-using-with-inserting-chunks-of-rows-tt4666592.html [9] >>> (And I'm sure there are other times, too.) I can't say I've heard anyone arguing about it, though. :) >>> I guess it works that way because >>> ...in dt[ ,a], j is an expression which evaluates to a vector >>> ...in dt[,"a",with=FALSE] the option turns on the "you must want one or more columns" mode, translating j from "a" to list(a) >>> It's unintuitive if you're expecting data frame behavior (you know, drop=TRUE, as Arun mentioned), but if you've already seen dt[,list(a)], it shouldn't be much of a surprise. Adding the drop option, and maybe defaulting it to TRUE when with=FALSE might satisfy eddi's concern...? >>> >>> On Fri, May 17, 2013 at 10:22 AM, Eduard Antonyan wrote: >>> >>>> I don't remember discussing this issue...? What is the conceptual difference between dt[, a] and dt[, "a", with = F] and what does 'drop' have to do with this?? >>>> >>>> On Fri, May 17, 2013 at 10:02 AM, Arunkumar Srinivasan wrote: >>>> >>>>> Eduard, are we discussing the same thing again :)? Wasn't this somehow your question as well.. the discrepancy between: >>>>> dt[, a] and dt[, "a", with=FALSE]. >>>>> There should be a drop=TRUE/FALSE option (as in the case of data.frame) that should be used when you use `with=FALSE`. Until then, the default option seems to be drop=FALSE, which results in a data.table. >>>>> Alexandre, as of now, it could be done as Eduard points out. >>>>> >>>>> Arun >>>>> >>>>> On Friday, May 17, 2013 at 4:59 PM, Eduard Antonyan wrote: >>>>> >>>>>> Use dt[[colname]], but this seems like a bug to me - I would've thought that dt[, a] and dt[, "a", with = F] should return the exact same thing. >>>>>> >>>>>> On Fri, May 17, 2013 at 9:42 AM, Alexandre Sieira wrote: >>>>>> >>>>>>> Sorry if this is a basic question. >>>>>>> >>>>>>> I'm using R 3.0.0 and data.table 1.8.8. The documentation for 'j' states that "A single column or single expression returns that type, usually a vector." >>>>>>> >>>>>>> I am able to obtain this behavior if I know the column name in advance: >>>>>>> >>>>>>>> dt = data.table(a=c(1, 2, 3), b=c(4, 5, 6)) >>>>>>> >>>>>>>> dt >>>>>>> >>>>>>> a b >>>>>>> >>>>>>> 1: 1 4 >>>>>>> >>>>>>> 2: 2 5 >>>>>>> >>>>>>> 3: 3 6 >>>>>>> >>>>>>>> str(dt[,a]) >>>>>>> >>>>>>> num [1:3] 1 2 3 >>>>>>> >>>>>>> However, if I don't, no such luck: >>>>>>> >>>>>>>> colname="a" >>>>>>>> str(dt[,colname,with=F]) >>>>>>> Classes 'data.table' and 'data.frame': 3 obs. of 1 variable: >>>>>>> $ a: num 1 2 3 >>>>>>> - attr(*, ".internal.selfref")= >>>>>>> If there a way to extract an entire column as a vector if I have the column name as a character scalar? >>>>>>> Thank you! >>>>>>> >>>>>>> -- >>>>>>> Alexandre Sieira >>>>>>> CISA, CISSP, ISO 27001 Lead Auditor >>>>>>> >>>>>>> "The truth is rarely pure and never simple." >>>>>>> Oscar Wilde, The Importance of Being Earnest, 1895, Act I >>>>>>> _______________________________________________ >>>>>>> datatable-help mailing list >>>>>>> datatable-help at lists.r-forge.r-project.org [1] >>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [2] >>>>>> >>>>>> _______________________________________________ >>>>>> datatable-help mailing list >>>>>> datatable-help at lists.r-forge.r-project.org [4] >>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [5] >>>> >>>> _______________________________________________ >>>> datatable-help mailing list >>>> datatable-help at lists.r-forge.r-project.org [7] >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [8] >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org [12] >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [13] Links: ------ [1] mailto:datatable-help at lists.r-forge.r-project.org [2] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [3] mailto:alexandre.sieira at gmail.com [4] mailto:datatable-help at lists.r-forge.r-project.org [5] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [6] mailto:aragorn168b at gmail.com [7] mailto:datatable-help at lists.r-forge.r-project.org [8] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [9] http://r.789695.n4.nabble.com/Better-hacks-getting-a-vector-AND-using-with-inserting-chunks-of-rows-tt4666592.html [10] mailto:eduard.antonyan at gmail.com [11] mailto:FErickson at psu.edu [12] mailto:datatable-help at lists.r-forge.r-project.org [13] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [14] mailto:saporta at rutgers.edu [15] mailto:eduard.antonyan at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sat May 18 17:18:31 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sat, 18 May 2013 10:18:31 -0500 Subject: [datatable-help] Extract Single Column as Vector In-Reply-To: <5bc2b571908deaccc37f113ab16c95a3@imap.plus.net> References: <5bc2b571908deaccc37f113ab16c95a3@imap.plus.net> Message-ID: And FAQ 2.17 has a little more on that : "In [.data.frame we very often set drop=FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and drop drop." If it helps to know, I also use DT[["somename"]] quite a bit. Matthew On 18.05.2013 10:04, Matthew Dowle wrote: > All good points. The thinking here has this mind : > > myvars = c("col1","col2") > DT[, myvars, with=FALSE] > > We don't want the type of the result to depend on whether myvars is length 1 or not. Otherwise we may end up with surprises (in production code for example) if myvars becomes length 1 in future. That's a strong principle that data.table follows : the length of an input shouldn't change the type of the output (only the type of the input should be able to change the type of the output). > > I've just changed those two parts of ?data.table (thanks for highlighting) : > > was : > "... or (when with=FALSE) same as j in [.data.frame." > now : > "... or (when with=FALSE) a vector of names or positions to select." > > Matthew > > On 17.05.2013 20:34, Ricardo Saporta wrote: > >> Hm... Eddi does seem to have a point here. While I agree with Frank that once you're used to it, it is rather straightforward to deal with, I can see why one would have the expectation of a vector. ie, that the last of the following `identical` statements should evaluate to `TRUE` >> >> df <- as.data.frame(dt) >> > identical(df[, "a"], dt[, get("a")]) >> [1] TRUE >> > identical(df[, "a"], dt[["a"]]) >> [1] TRUE >> > identical(df[, "a"], dt[, "a", with=FALSE]) >> [1] FALSE >> rm(df) >> -Rick >> >> Ricardo Saporta >> Graduate Student, Data Analytics >> Rutgers University, New Jersey >> e: saporta at rutgers.edu [14] >> >> On Fri, May 17, 2013 at 4:26 PM, Eduard Antonyan wrote: >> >>> Well, looking at the documentation: >>> j: A single column name, single expresson of column names, list() of expressions of column names, an expression or function call that evaluates to list (including data.frame and data.table which are lists, too), or (WHEN WITH=FALSE) SAME AS J IN [.DATA.FRAME. >>> ... >>> with: By default with=TRUE and j is evaluated within the frame of x. The column names can be used as variables. WHEN WITH=FALSE, J WORKS AS IT DOES IN [.DATA.FRAME. >>> >>> The bolded out part of the documentation doesn't match the actual behavior. >>> >>> On Fri, May 17, 2013 at 2:44 PM, Frank Erickson wrote: >>> >>>> @Arun and eddi: This question has come up before. >>>> http://r.789695.n4.nabble.com/Better-hacks-getting-a-vector-AND-using-with-inserting-chunks-of-rows-tt4666592.html [9] >>>> (And I'm sure there are other times, too.) I can't say I've heard anyone arguing about it, though. :) >>>> I guess it works that way because >>>> ...in dt[ ,a], j is an expression which evaluates to a vector >>>> ...in dt[,"a",with=FALSE] the option turns on the "you must want one or more columns" mode, translating j from "a" to list(a) >>>> It's unintuitive if you're expecting data frame behavior (you know, drop=TRUE, as Arun mentioned), but if you've already seen dt[,list(a)], it shouldn't be much of a surprise. Adding the drop option, and maybe defaulting it to TRUE when with=FALSE might satisfy eddi's concern...? >>>> >>>> On Fri, May 17, 2013 at 10:22 AM, Eduard Antonyan wrote: >>>> >>>>> I don't remember discussing this issue...? What is the conceptual difference between dt[, a] and dt[, "a", with = F] and what does 'drop' have to do with this?? >>>>> >>>>> On Fri, May 17, 2013 at 10:02 AM, Arunkumar Srinivasan wrote: >>>>> >>>>>> Eduard, are we discussing the same thing again :)? Wasn't this somehow your question as well.. the discrepancy between: >>>>>> dt[, a] and dt[, "a", with=FALSE]. >>>>>> There should be a drop=TRUE/FALSE option (as in the case of data.frame) that should be used when you use `with=FALSE`. Until then, the default option seems to be drop=FALSE, which results in a data.table. >>>>>> Alexandre, as of now, it could be done as Eduard points out. >>>>>> >>>>>> Arun >>>>>> >>>>>> On Friday, May 17, 2013 at 4:59 PM, Eduard Antonyan wrote: >>>>>> >>>>>>> Use dt[[colname]], but this seems like a bug to me - I would've thought that dt[, a] and dt[, "a", with = F] should return the exact same thing. >>>>>>> >>>>>>> On Fri, May 17, 2013 at 9:42 AM, Alexandre Sieira wrote: >>>>>>> >>>>>>>> Sorry if this is a basic question. >>>>>>>> >>>>>>>> I'm using R 3.0.0 and data.table 1.8.8. The documentation for 'j' states that "A single column or single expression returns that type, usually a vector." >>>>>>>> >>>>>>>> I am able to obtain this behavior if I know the column name in advance: >>>>>>>> >>>>>>>>> dt = data.table(a=c(1, 2, 3), b=c(4, 5, 6)) >>>>>>>> >>>>>>>>> dt >>>>>>>> >>>>>>>> a b >>>>>>>> >>>>>>>> 1: 1 4 >>>>>>>> >>>>>>>> 2: 2 5 >>>>>>>> >>>>>>>> 3: 3 6 >>>>>>>> >>>>>>>>> str(dt[,a]) >>>>>>>> >>>>>>>> num [1:3] 1 2 3 >>>>>>>> >>>>>>>> However, if I don't, no such luck: >>>>>>>> >>>>>>>>> colname="a" >>>>>>>>> str(dt[,colname,with=F]) >>>>>>>> Classes 'data.table' and 'data.frame': 3 obs. of 1 variable: >>>>>>>> $ a: num 1 2 3 >>>>>>>> - attr(*, ".internal.selfref")= >>>>>>>> If there a way to extract an entire column as a vector if I have the column name as a character scalar? >>>>>>>> Thank you! >>>>>>>> >>>>>>>> -- >>>>>>>> Alexandre Sieira >>>>>>>> CISA, CISSP, ISO 27001 Lead Auditor >>>>>>>> >>>>>>>> "The truth is rarely pure and never simple." >>>>>>>> Oscar Wilde, The Importance of Being Earnest, 1895, Act I >>>>>>>> _______________________________________________ >>>>>>>> datatable-help mailing list >>>>>>>> datatable-help at lists.r-forge.r-project.org [1] >>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [2] >>>>>>> >>>>>>> _______________________________________________ >>>>>>> datatable-help mailing list >>>>>>> datatable-help at lists.r-forge.r-project.org [4] >>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [5] >>>>> >>>>> _______________________________________________ >>>>> datatable-help mailing list >>>>> datatable-help at lists.r-forge.r-project.org [7] >>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [8] >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org [12] >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [13] Links: ------ [1] mailto:datatable-help at lists.r-forge.r-project.org [2] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [3] mailto:alexandre.sieira at gmail.com [4] mailto:datatable-help at lists.r-forge.r-project.org [5] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [6] mailto:aragorn168b at gmail.com [7] mailto:datatable-help at lists.r-forge.r-project.org [8] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [9] http://r.789695.n4.nabble.com/Better-hacks-getting-a-vector-AND-using-with-inserting-chunks-of-rows-tt4666592.html [10] mailto:eduard.antonyan at gmail.com [11] mailto:FErickson at psu.edu [12] mailto:datatable-help at lists.r-forge.r-project.org [13] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [14] mailto:saporta at rutgers.edu [15] mailto:eduard.antonyan at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat May 18 17:21:16 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 18 May 2013 17:21:16 +0200 Subject: [datatable-help] Extract Single Column as Vector In-Reply-To: References: <5bc2b571908deaccc37f113ab16c95a3@imap.plus.net> Message-ID: <27D0233335F34EE6A75F4617F9E15DC8@gmail.com> Matthew wrote "..the length of an input shouldn't change the type of the output (only the type of the input should be able to change the type of the output)." That's a very nice way to put it. Arun On Saturday, May 18, 2013 at 5:18 PM, Matthew Dowle wrote: > > And FAQ 2.17 has a little more on that : > "In [.data.frame we very often set drop=FALSE. When we forget, bugs can arise in edge cases > where single columns are selected and all of a sudden a vector is returned rather than a single > column data.frame. In [.data.table we took the opportunity to make it consistent and drop > drop." > > If it helps to know, I also use DT[["somename"]] quite a bit. > > Matthew > > On 18.05.2013 10:04, Matthew Dowle wrote: > > > > All good points. The thinking here has this mind : > > > > myvars = c("col1","col2") > > DT[, myvars, with=FALSE] > > > > We don't want the type of the result to depend on whether myvars is length 1 or not. Otherwise we may end up with surprises (in production code for example) if myvars becomes length 1 in future. That's a strong principle that data.table follows : the length of an input shouldn't change the type of the output (only the type of the input should be able to change the type of the output). > > > > I've just changed those two parts of ?data.table (thanks for highlighting) : > > > > was : > > "... or (when with=FALSE) same as j in [.data.frame." > > now : > > "... or (when with=FALSE) a vector of names or positions to select." > > > > Matthew > > > > On 17.05.2013 20:34, Ricardo Saporta wrote: > > > Hm... Eddi does seem to have a point here. While I agree with Frank that once you're used to it, it is rather straightforward to deal with, I can see why one would have the expectation of a vector. ie, that the last of the following `identical` statements should evaluate to `TRUE` > > > df <- as.data.frame(dt) > > > > identical(df[, "a"], dt[, get("a")]) > > > [1] TRUE > > > > identical(df[, "a"], dt[["a"]]) > > > [1] TRUE > > > > identical(df[, "a"], dt[, "a", with=FALSE]) > > > [1] FALSE > > > rm(df) > > > > > > -Rick > > > > > > Ricardo Saporta > > > Graduate Student, Data Analytics > > > Rutgers University, New Jersey > > > e: saporta at rutgers.edu (mailto:saporta at rutgers.edu) > > > > > > > > > > > > > > > On Fri, May 17, 2013 at 4:26 PM, Eduard Antonyan wrote: > > > > Well, looking at the documentation: > > > > j: A single column name, single expresson of column names, list() of expressions of column names, an expression or function call that evaluates to list (including data.frame and data.table which are lists, too), or (when with=FALSE) same as j in [.data.frame. > > > > ... > > > > with: By default with=TRUE and j is evaluated within the frame of x. The column names can be used as variables. When with=FALSE, j works as it does in [.data.frame. > > > > > > > > > > > > The bolded out part of the documentation doesn't match the actual behavior. > > > > > > > > > > > > On Fri, May 17, 2013 at 2:44 PM, Frank Erickson wrote: > > > > > @Arun and eddi: This question has come up before. > > > > > http://r.789695.n4.nabble.com/Better-hacks-getting-a-vector-AND-using-with-inserting-chunks-of-rows-tt4666592.html > > > > > (And I'm sure there are other times, too.) I can't say I've heard anyone arguing about it, though. :) > > > > > I guess it works that way because > > > > > ...in dt[ ,a], j is an expression which evaluates to a vector > > > > > ...in dt[,"a",with=FALSE] the option turns on the "you must want one or more columns" mode, translating j from "a" to list(a) > > > > > It's unintuitive if you're expecting data frame behavior (you know, drop=TRUE, as Arun mentioned), but if you've already seen dt[,list(a)], it shouldn't be much of a surprise. Adding the drop option, and maybe defaulting it to TRUE when with=FALSE might satisfy eddi's concern...? > > > > > > > > > > > > > > > > > > > > On Fri, May 17, 2013 at 10:22 AM, Eduard Antonyan wrote: > > > > > > I don't remember discussing this issue...? What is the conceptual difference between dt[, a] and dt[, "a", with = F] and what does 'drop' have to do with this?? > > > > > > > > > > > > > > > > > > On Fri, May 17, 2013 at 10:02 AM, Arunkumar Srinivasan wrote: > > > > > > > Eduard, are we discussing the same thing again :)? Wasn't this somehow your question as well.. the discrepancy between: > > > > > > > dt[, a] and dt[, "a", with=FALSE]. > > > > > > > There should be a drop=TRUE/FALSE option (as in the case of data.frame) that should be used when you use `with=FALSE`. Until then, the default option seems to be drop=FALSE, which results in a data.table. > > > > > > > Alexandre, as of now, it could be done as Eduard points out. > > > > > > > Arun > > > > > > > > > > > > > > > > > > > > > On Friday, May 17, 2013 at 4:59 PM, Eduard Antonyan wrote: > > > > > > > > > > > > > > > Use dt[[colname]], but this seems like a bug to me - I would've thought that dt[, a] and dt[, "a", with = F] should return the exact same thing. > > > > > > > > > > > > > > > > > > > > > > > > On Fri, May 17, 2013 at 9:42 AM, Alexandre Sieira wrote: > > > > > > > > > Sorry if this is a basic question. > > > > > > > > > > > > > > > > > > I'm using R 3.0.0 and data.table 1.8.8. The documentation for 'j' states that "A single column or single expression returns that type, usually a vector." > > > > > > > > > > > > > > > > > > I am able to obtain this behavior if I know the column name in advance: > > > > > > > > > > > > > > > > > > > > > > > > > > > > dt = data.table(a=c(1, 2, 3), b=c(4, 5, 6)) > > > > > > > > > > dt > > > > > > > > > a b > > > > > > > > > 1: 1 4 > > > > > > > > > 2: 2 5 > > > > > > > > > 3: 3 6 > > > > > > > > > > str(dt[,a]) > > > > > > > > > num [1:3] 1 2 3 > > > > > > > > > > > > > > > > > > However, if I don't, no such luck: > > > > > > > > > > colname="a" > > > > > > > > > > str(dt[,colname,with=F]) > > > > > > > > > Classes ?data.table? and 'data.frame': 3 obs. of 1 variable: > > > > > > > > > $ a: num 1 2 3 > > > > > > > > > - attr(*, ".internal.selfref")= > > > > > > > > > > > > > > > > > > If there a way to extract an entire column as a vector if I have the column name as a character scalar? > > > > > > > > > Thank you! > > > > > > > > > -- > > > > > > > > > Alexandre Sieira > > > > > > > > > CISA, CISSP, ISO 27001 Lead Auditor > > > > > > > > > > > > > > > > > > "The truth is rarely pure and never simple." > > > > > > > > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > > > > > > > > _______________________________________________ > > > > > > > > > datatable-help mailing list > > > > > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > _______________________________________________ > > > > > > > > datatable-help mailing list > > > > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > datatable-help mailing list > > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > > > > datatable-help mailing list > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat May 18 17:23:46 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 18 May 2013 17:23:46 +0200 Subject: [datatable-help] Extract Single Column as Vector In-Reply-To: <27D0233335F34EE6A75F4617F9E15DC8@gmail.com> References: <5bc2b571908deaccc37f113ab16c95a3@imap.plus.net> <27D0233335F34EE6A75F4617F9E15DC8@gmail.com> Message-ID: @Matthew, On another note, are there plans to implement "drop=T/F" in data.table? Arun On Saturday, May 18, 2013 at 5:21 PM, Arunkumar Srinivasan wrote: > Matthew wrote "..the length of an input shouldn't change the type of the output (only the type of the input should be able to change the type of the output)." > That's a very nice way to put it. > > > Arun > > > On Saturday, May 18, 2013 at 5:18 PM, Matthew Dowle wrote: > > > > > And FAQ 2.17 has a little more on that : > > "In [.data.frame we very often set drop=FALSE. When we forget, bugs can arise in edge cases > > where single columns are selected and all of a sudden a vector is returned rather than a single > > column data.frame. In [.data.table we took the opportunity to make it consistent and drop > > drop." > > > > If it helps to know, I also use DT[["somename"]] quite a bit. > > > > Matthew > > > > On 18.05.2013 10:04, Matthew Dowle wrote: > > > > > > All good points. The thinking here has this mind : > > > > > > myvars = c("col1","col2") > > > DT[, myvars, with=FALSE] > > > > > > We don't want the type of the result to depend on whether myvars is length 1 or not. Otherwise we may end up with surprises (in production code for example) if myvars becomes length 1 in future. That's a strong principle that data.table follows : the length of an input shouldn't change the type of the output (only the type of the input should be able to change the type of the output). > > > > > > I've just changed those two parts of ?data.table (thanks for highlighting) : > > > > > > was : > > > "... or (when with=FALSE) same as j in [.data.frame." > > > now : > > > "... or (when with=FALSE) a vector of names or positions to select." > > > > > > Matthew > > > > > > On 17.05.2013 20:34, Ricardo Saporta wrote: > > > > Hm... Eddi does seem to have a point here. While I agree with Frank that once you're used to it, it is rather straightforward to deal with, I can see why one would have the expectation of a vector. ie, that the last of the following `identical` statements should evaluate to `TRUE` > > > > df <- as.data.frame(dt) > > > > > identical(df[, "a"], dt[, get("a")]) > > > > [1] TRUE > > > > > identical(df[, "a"], dt[["a"]]) > > > > [1] TRUE > > > > > identical(df[, "a"], dt[, "a", with=FALSE]) > > > > [1] FALSE > > > > rm(df) > > > > > > > > -Rick > > > > > > > > Ricardo Saporta > > > > Graduate Student, Data Analytics > > > > Rutgers University, New Jersey > > > > e: saporta at rutgers.edu (mailto:saporta at rutgers.edu) > > > > > > > > > > > > > > > > > > > > On Fri, May 17, 2013 at 4:26 PM, Eduard Antonyan wrote: > > > > > Well, looking at the documentation: > > > > > j: A single column name, single expresson of column names, list() of expressions of column names, an expression or function call that evaluates to list (including data.frame and data.table which are lists, too), or (when with=FALSE) same as j in [.data.frame. > > > > > ... > > > > > with: By default with=TRUE and j is evaluated within the frame of x. The column names can be used as variables. When with=FALSE, j works as it does in [.data.frame. > > > > > > > > > > > > > > > The bolded out part of the documentation doesn't match the actual behavior. > > > > > > > > > > > > > > > On Fri, May 17, 2013 at 2:44 PM, Frank Erickson wrote: > > > > > > @Arun and eddi: This question has come up before. > > > > > > http://r.789695.n4.nabble.com/Better-hacks-getting-a-vector-AND-using-with-inserting-chunks-of-rows-tt4666592.html > > > > > > (And I'm sure there are other times, too.) I can't say I've heard anyone arguing about it, though. :) > > > > > > I guess it works that way because > > > > > > ...in dt[ ,a], j is an expression which evaluates to a vector > > > > > > ...in dt[,"a",with=FALSE] the option turns on the "you must want one or more columns" mode, translating j from "a" to list(a) > > > > > > It's unintuitive if you're expecting data frame behavior (you know, drop=TRUE, as Arun mentioned), but if you've already seen dt[,list(a)], it shouldn't be much of a surprise. Adding the drop option, and maybe defaulting it to TRUE when with=FALSE might satisfy eddi's concern...? > > > > > > > > > > > > > > > > > > > > > > > > On Fri, May 17, 2013 at 10:22 AM, Eduard Antonyan wrote: > > > > > > > I don't remember discussing this issue...? What is the conceptual difference between dt[, a] and dt[, "a", with = F] and what does 'drop' have to do with this?? > > > > > > > > > > > > > > > > > > > > > On Fri, May 17, 2013 at 10:02 AM, Arunkumar Srinivasan wrote: > > > > > > > > Eduard, are we discussing the same thing again :)? Wasn't this somehow your question as well.. the discrepancy between: > > > > > > > > dt[, a] and dt[, "a", with=FALSE]. > > > > > > > > There should be a drop=TRUE/FALSE option (as in the case of data.frame) that should be used when you use `with=FALSE`. Until then, the default option seems to be drop=FALSE, which results in a data.table. > > > > > > > > Alexandre, as of now, it could be done as Eduard points out. > > > > > > > > Arun > > > > > > > > > > > > > > > > > > > > > > > > On Friday, May 17, 2013 at 4:59 PM, Eduard Antonyan wrote: > > > > > > > > > > > > > > > > > Use dt[[colname]], but this seems like a bug to me - I would've thought that dt[, a] and dt[, "a", with = F] should return the exact same thing. > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, May 17, 2013 at 9:42 AM, Alexandre Sieira wrote: > > > > > > > > > > Sorry if this is a basic question. > > > > > > > > > > > > > > > > > > > > I'm using R 3.0.0 and data.table 1.8.8. The documentation for 'j' states that "A single column or single expression returns that type, usually a vector." > > > > > > > > > > > > > > > > > > > > I am able to obtain this behavior if I know the column name in advance: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > dt = data.table(a=c(1, 2, 3), b=c(4, 5, 6)) > > > > > > > > > > > dt > > > > > > > > > > a b > > > > > > > > > > 1: 1 4 > > > > > > > > > > 2: 2 5 > > > > > > > > > > 3: 3 6 > > > > > > > > > > > str(dt[,a]) > > > > > > > > > > num [1:3] 1 2 3 > > > > > > > > > > > > > > > > > > > > However, if I don't, no such luck: > > > > > > > > > > > colname="a" > > > > > > > > > > > str(dt[,colname,with=F]) > > > > > > > > > > Classes ?data.table? and 'data.frame': 3 obs. of 1 variable: > > > > > > > > > > $ a: num 1 2 3 > > > > > > > > > > - attr(*, ".internal.selfref")= > > > > > > > > > > > > > > > > > > > > If there a way to extract an entire column as a vector if I have the column name as a character scalar? > > > > > > > > > > Thank you! > > > > > > > > > > -- > > > > > > > > > > Alexandre Sieira > > > > > > > > > > CISA, CISSP, ISO 27001 Lead Auditor > > > > > > > > > > > > > > > > > > > > "The truth is rarely pure and never simple." > > > > > > > > > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > > > > > > > > > _______________________________________________ > > > > > > > > > > datatable-help mailing list > > > > > > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > _______________________________________________ > > > > > > > > > datatable-help mailing list > > > > > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > datatable-help mailing list > > > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > _______________________________________________ > > > > > datatable-help mailing list > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sat May 18 18:19:57 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sat, 18 May 2013 11:19:57 -0500 Subject: [datatable-help] Extract Single Column as Vector In-Reply-To: References: <5bc2b571908deaccc37f113ab16c95a3@imap.plus.net> <27D0233335F34EE6A75F4617F9E15DC8@gmail.com> Message-ID: In my mind currently, more pressing than drop=T/F is that long thread about by-without-by the other week. Need to find a few hours in a dark room to go through it with fresh eyes, draw together the points and link up with a few FRs and bug reports. I suspect quite a lot might simplify if we do change that, and I think that's likely. Then drop=T/F might go away since that would be what it would do by default, iirc. drop=T/F is entwined with that anyway. Matthew On 18.05.2013 10:23, Arunkumar Srinivasan wrote: > @Matthew, > On another note, are there plans to implement "drop=T/F" in data.table? > > Arun > > On Saturday, May 18, 2013 at 5:21 PM, Arunkumar Srinivasan wrote: > >> Matthew wrote "..the length of an input shouldn't change the type of the output (only the type of the input should be able to change the type of the output)." >> That's a very nice way to put it. >> >> Arun >> >> On Saturday, May 18, 2013 at 5:18 PM, Matthew Dowle wrote: >> >>> And FAQ 2.17 has a little more on that : >>> >>> "In [.data.frame we very often set drop=FALSE. When we forget, bugs can arise in edge cases >>> where single columns are selected and all of a sudden a vector is returned rather than a single >>> column data.frame. In [.data.table we took the opportunity to make it consistent and drop >>> drop." >>> >>> If it helps to know, I also use DT[["somename"]] quite a bit. >>> >>> Matthew >>> >>> On 18.05.2013 10:04, Matthew Dowle wrote: >>> >>>> All good points. The thinking here has this mind : >>>> >>>> myvars = c("col1","col2") >>>> DT[, myvars, with=FALSE] >>>> >>>> We don't want the type of the result to depend on whether myvars is length 1 or not. Otherwise we may end up with surprises (in production code for example) if myvars becomes length 1 in future. That's a strong principle that data.table follows : the length of an input shouldn't change the type of the output (only the type of the input should be able to change the type of the output). >>>> >>>> I've just changed those two parts of ?data.table (thanks for highlighting) : >>>> >>>> was : >>>> "... or (when with=FALSE) same as j in [.data.frame." >>>> now : >>>> "... or (when with=FALSE) a vector of names or positions to select." >>>> >>>> Matthew >>>> >>>> On 17.05.2013 20:34, Ricardo Saporta wrote: >>>> >>>>> Hm... Eddi does seem to have a point here. While I agree with Frank that once you're used to it, it is rather straightforward to deal with, I can see why one would have the expectation of a vector. ie, that the last of the following `identical` statements should evaluate to `TRUE` >>>>> >>>>> df <- as.data.frame(dt) >>>>> > identical(df[, "a"], dt[, get("a")]) >>>>> [1] TRUE >>>>> > identical(df[, "a"], dt[["a"]]) >>>>> [1] TRUE >>>>> > identical(df[, "a"], dt[, "a", with=FALSE]) >>>>> [1] FALSE >>>>> rm(df) >>>>> -Rick >>>>> >>>>> Ricardo Saporta >>>>> Graduate Student, Data Analytics >>>>> Rutgers University, New Jersey >>>>> e: saporta at rutgers.edu [14] >>>>> >>>>> On Fri, May 17, 2013 at 4:26 PM, Eduard Antonyan wrote: >>>>> >>>>>> Well, looking at the documentation: >>>>>> j: A single column name, single expresson of column names, list() of expressions of column names, an expression or function call that evaluates to list (including data.frame and data.table which are lists, too), or (WHEN WITH=FALSE) SAME AS J IN [.DATA.FRAME. >>>>>> ... >>>>>> with: By default with=TRUE and j is evaluated within the frame of x. The column names can be used as variables. WHEN WITH=FALSE, J WORKS AS IT DOES IN [.DATA.FRAME. >>>>>> >>>>>> The bolded out part of the documentation doesn't match the actual behavior. >>>>>> >>>>>> On Fri, May 17, 2013 at 2:44 PM, Frank Erickson wrote: >>>>>> >>>>>>> @Arun and eddi: This question has come up before. >>>>>>> http://r.789695.n4.nabble.com/Better-hacks-getting-a-vector-AND-using-with-inserting-chunks-of-rows-tt4666592.html [9] >>>>>>> (And I'm sure there are other times, too.) I can't say I've heard anyone arguing about it, though. :) >>>>>>> I guess it works that way because >>>>>>> ...in dt[ ,a], j is an expression which evaluates to a vector >>>>>>> ...in dt[,"a",with=FALSE] the option turns on the "you must want one or more columns" mode, translating j from "a" to list(a) >>>>>>> It's unintuitive if you're expecting data frame behavior (you know, drop=TRUE, as Arun mentioned), but if you've already seen dt[,list(a)], it shouldn't be much of a surprise. Adding the drop option, and maybe defaulting it to TRUE when with=FALSE might satisfy eddi's concern...? >>>>>>> >>>>>>> On Fri, May 17, 2013 at 10:22 AM, Eduard Antonyan wrote: >>>>>>> >>>>>>>> I don't remember discussing this issue...? What is the conceptual difference between dt[, a] and dt[, "a", with = F] and what does 'drop' have to do with this?? >>>>>>>> >>>>>>>> On Fri, May 17, 2013 at 10:02 AM, Arunkumar Srinivasan wrote: >>>>>>>> >>>>>>>>> Eduard, are we discussing the same thing again :)? Wasn't this somehow your question as well.. the discrepancy between: >>>>>>>>> dt[, a] and dt[, "a", with=FALSE]. >>>>>>>>> There should be a drop=TRUE/FALSE option (as in the case of data.frame) that should be used when you use `with=FALSE`. Until then, the default option seems to be drop=FALSE, which results in a data.table. >>>>>>>>> Alexandre, as of now, it could be done as Eduard points out. >>>>>>>>> >>>>>>>>> Arun >>>>>>>>> >>>>>>>>> On Friday, May 17, 2013 at 4:59 PM, Eduard Antonyan wrote: >>>>>>>>> >>>>>>>>>> Use dt[[colname]], but this seems like a bug to me - I would've thought that dt[, a] and dt[, "a", with = F] should return the exact same thing. >>>>>>>>>> >>>>>>>>>> On Fri, May 17, 2013 at 9:42 AM, Alexandre Sieira wrote: >>>>>>>>>> >>>>>>>>>>> Sorry if this is a basic question. >>>>>>>>>>> >>>>>>>>>>> I'm using R 3.0.0 and data.table 1.8.8. The documentation for 'j' states that "A single column or single expression returns that type, usually a vector." >>>>>>>>>>> >>>>>>>>>>> I am able to obtain this behavior if I know the column name in advance: >>>>>>>>>>> >>>>>>>>>>>> dt = data.table(a=c(1, 2, 3), b=c(4, 5, 6)) >>>>>>>>>>> >>>>>>>>>>>> dt >>>>>>>>>>> >>>>>>>>>>> a b >>>>>>>>>>> >>>>>>>>>>> 1: 1 4 >>>>>>>>>>> >>>>>>>>>>> 2: 2 5 >>>>>>>>>>> >>>>>>>>>>> 3: 3 6 >>>>>>>>>>> >>>>>>>>>>>> str(dt[,a]) >>>>>>>>>>> >>>>>>>>>>> num [1:3] 1 2 3 >>>>>>>>>>> >>>>>>>>>>> However, if I don't, no such luck: >>>>>>>>>>> >>>>>>>>>>>> colname="a" >>>>>>>>>>>> str(dt[,colname,with=F]) >>>>>>>>>>> Classes 'data.table' and 'data.frame': 3 obs. of 1 variable: >>>>>>>>>>> $ a: num 1 2 3 >>>>>>>>>>> - attr(*, ".internal.selfref")= >>>>>>>>>>> If there a way to extract an entire column as a vector if I have the column name as a character scalar? >>>>>>>>>>> Thank you! >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Alexandre Sieira >>>>>>>>>>> CISA, CISSP, ISO 27001 Lead Auditor >>>>>>>>>>> >>>>>>>>>>> "The truth is rarely pure and never simple." >>>>>>>>>>> Oscar Wilde, The Importance of Being Earnest, 1895, Act I >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> datatable-help mailing list >>>>>>>>>>> datatable-help at lists.r-forge.r-project.org [1] >>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [2] >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> datatable-help mailing list >>>>>>>>>> datatable-help at lists.r-forge.r-project.org [4] >>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [5] >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> datatable-help mailing list >>>>>>>> datatable-help at lists.r-forge.r-project.org [7] >>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [8] >>>>>> >>>>>> _______________________________________________ >>>>>> datatable-help mailing list >>>>>> datatable-help at lists.r-forge.r-project.org [12] >>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [13] Links: ------ [1] mailto:datatable-help at lists.r-forge.r-project.org [2] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [3] mailto:alexandre.sieira at gmail.com [4] mailto:datatable-help at lists.r-forge.r-project.org [5] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [6] mailto:aragorn168b at gmail.com [7] mailto:datatable-help at lists.r-forge.r-project.org [8] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [9] http://r.789695.n4.nabble.com/Better-hacks-getting-a-vector-AND-using-with-inserting-chunks-of-rows-tt4666592.html [10] mailto:eduard.antonyan at gmail.com [11] mailto:FErickson at psu.edu [12] mailto:datatable-help at lists.r-forge.r-project.org [13] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [14] mailto:saporta at rutgers.edu [15] mailto:eduard.antonyan at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Sun May 19 19:55:43 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Sun, 19 May 2013 13:55:43 -0400 Subject: [datatable-help] Feature request: allow specification of row name column when keep.rownames=TRUE Message-ID: When one uses data.table(..., keep.rownames=TRUE) the name of the resulting column is always "rn". Some way of specifying the column name would be nice. (The default could continue to be "rn".) -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From mdowle at mdowle.plus.com Mon May 20 15:07:35 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 20 May 2013 14:07:35 +0100 Subject: [datatable-help] data.table slides at R/Finance Message-ID: <9818de222df45dc234991a3b8fd7e3a0@imap.plus.net> are now on the data.table homepage : http://datatable.r-forge.r-project.org/ It was really encouraging to meet all the datatablers there! And got some useful feedback too. (The tutorial slides were meant more as prompts to explain and discuss, so they probably don't come across very well when read cold.) Matthew From jose at memo2.nl Tue May 21 11:10:30 2013 From: jose at memo2.nl (JNV) Date: Tue, 21 May 2013 02:10:30 -0700 (PDT) Subject: [datatable-help] Sum first 3 non zero elements of row Message-ID: <1369127430158-4667563.post@n4.nabble.com> Hi there, I've got this matrix D with, say 10 rows and 20 columns. For each row I want to sum the first 3 non zero elements and put them in a vector z. So if the first row D[1,] is 0 3 5 0 8 9 3 2 4 0 then I want z z<-D[1,2]+D[1,3]+D[1,5] But if there are less than 3 non zero elements, those should be summed. If there are no non zero elements, the result must be zero. So if the first row D[1,] is 0 0 3 0 1 0 0 0 0 0 then I want z z<-D[1,3]+D[1,5] Hope someone can help me out! -- View this message in context: http://r.789695.n4.nabble.com/Sum-first-3-non-zero-elements-of-row-tp4667563.html Sent from the datatable-help mailing list archive at Nabble.com. From ggrothendieck at gmail.com Tue May 21 14:28:56 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Tue, 21 May 2013 08:28:56 -0400 Subject: [datatable-help] Sum first 3 non zero elements of row In-Reply-To: <1369127430158-4667563.post@n4.nabble.com> References: <1369127430158-4667563.post@n4.nabble.com> Message-ID: On Tue, May 21, 2013 at 5:10 AM, JNV wrote: > Hi there, > I've got this matrix D with, say 10 rows and 20 columns. For each row I want > to sum the first 3 non zero elements and put them in a vector z. > > So if the first row D[1,] is > 0 3 5 0 8 9 3 2 4 0 > > then I want z > z<-D[1,2]+D[1,3]+D[1,5] > > But if there are less than 3 non zero elements, those should be summed. If > there are no non zero elements, the result must be zero. > > So if the first row D[1,] is > 0 0 3 0 1 0 0 0 0 0 > > then I want z > z<-D[1,3]+D[1,5] > Here is a matrix, D, with those two rows The t(apply(...)) replaces the first non-zero element in each row with 1, the 2nd with 2, etc. (It puts garbage into the elements that are 0.) We then convert this to T/F according to whether each element less than or equal to 3 or not and multiply by the original data which both zaps the garbage in the zero positions and zaps those positions which are a 4th or more non-zero in each row. This multiplication also inserts the correct values into the good positions. Finally we sum the rows using what is left: > D <- matrix( c(0, 0, 3, 0, 5, 3, 0, 0, 8, 1, 9, 0, 3, + 0, 2, 0, 4, 0, 0, 0), 2) > > D [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 0 3 5 0 8 9 3 2 4 0 [2,] 0 0 3 0 1 0 0 0 0 0 > > as.data.table(D)[, rowSums((t(apply(.SD > 0, 1, cumsum)) <= 3) * .SD)] [1] 16 4 Not sure if this really benefits from data.table as we could have written this without data.table: > rowSums((t(apply(D > 0, 1, cumsum)) <= 3) * D) [1] 16 4 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From jose at memo2.nl Tue May 21 15:38:50 2013 From: jose at memo2.nl (=?ISO-8859-1?Q?Jos=E9_Verhoeven?=) Date: Tue, 21 May 2013 15:38:50 +0200 Subject: [datatable-help] Sum first 3 non zero elements of row In-Reply-To: References: <1369127430158-4667563.post@n4.nabble.com> Message-ID: Thank you, it really helped me out! Didn't need the data.table like you proposed. 2013/5/21 Gabor Grothendieck > On Tue, May 21, 2013 at 5:10 AM, JNV wrote: > > Hi there, > > I've got this matrix D with, say 10 rows and 20 columns. For each row I > want > > to sum the first 3 non zero elements and put them in a vector z. > > > > So if the first row D[1,] is > > 0 3 5 0 8 9 3 2 4 0 > > > > then I want z > > z<-D[1,2]+D[1,3]+D[1,5] > > > > But if there are less than 3 non zero elements, those should be summed. > If > > there are no non zero elements, the result must be zero. > > > > So if the first row D[1,] is > > 0 0 3 0 1 0 0 0 0 0 > > > > then I want z > > z<-D[1,3]+D[1,5] > > > > Here is a matrix, D, with those two rows The t(apply(...)) replaces > the first non-zero element in each row with 1, the 2nd with 2, etc. > (It puts garbage into the elements that are 0.) We then convert > this to T/F according to whether each element less than or equal to 3 > or not and multiply by the original data which both zaps the garbage > in the zero positions and zaps those positions which are a 4th or more > non-zero in each row. This multiplication also inserts the correct > values into the good positions. Finally we sum the rows using what is > left: > > > D <- matrix( c(0, 0, 3, 0, 5, 3, 0, 0, 8, 1, 9, 0, 3, > + 0, 2, 0, 4, 0, 0, 0), 2) > > > > D > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] > [1,] 0 3 5 0 8 9 3 2 4 0 > [2,] 0 0 3 0 1 0 0 0 0 0 > > > > as.data.table(D)[, rowSums((t(apply(.SD > 0, 1, cumsum)) <= 3) * .SD)] > [1] 16 4 > > Not sure if this really benefits from data.table as we could have > written this without data.table: > > > rowSums((t(apply(D > 0, 1, cumsum)) <= 3) * D) > [1] 16 4 > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue May 21 15:45:23 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 21 May 2013 14:45:23 +0100 Subject: [datatable-help] Sum first 3 non zero elements of row In-Reply-To: References: <1369127430158-4667563.post@n4.nabble.com> Message-ID: <382f840f29fcaad1c3447419940ad28a@imap.plus.net> Yes, think it was meant for r-help. It can be fairly easy to mix up in Nabble since datatable-help is a sub-forum of R there. But the notices are quite clear upon subscribing (which is required to post). Matthew On 21.05.2013 13:28, Gabor Grothendieck wrote: > On Tue, May 21, 2013 at 5:10 AM, JNV wrote: >> Hi there, >> I've got this matrix D with, say 10 rows and 20 columns. For each >> row I want >> to sum the first 3 non zero elements and put them in a vector z. >> >> So if the first row D[1,] is >> 0 3 5 0 8 9 3 2 4 0 >> >> then I want z >> z<-D[1,2]+D[1,3]+D[1,5] >> >> But if there are less than 3 non zero elements, those should be >> summed. If >> there are no non zero elements, the result must be zero. >> >> So if the first row D[1,] is >> 0 0 3 0 1 0 0 0 0 0 >> >> then I want z >> z<-D[1,3]+D[1,5] >> > > Here is a matrix, D, with those two rows The t(apply(...)) replaces > the first non-zero element in each row with 1, the 2nd with 2, etc. > (It puts garbage into the elements that are 0.) We then convert > this to T/F according to whether each element less than or equal to 3 > or not and multiply by the original data which both zaps the garbage > in the zero positions and zaps those positions which are a 4th or > more > non-zero in each row. This multiplication also inserts the correct > values into the good positions. Finally we sum the rows using what > is > left: > >> D <- matrix( c(0, 0, 3, 0, 5, 3, 0, 0, 8, 1, 9, 0, 3, > + 0, 2, 0, 4, 0, 0, 0), 2) >> >> D > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] > [1,] 0 3 5 0 8 9 3 2 4 0 > [2,] 0 0 3 0 1 0 0 0 0 0 >> >> as.data.table(D)[, rowSums((t(apply(.SD > 0, 1, cumsum)) <= 3) * >> .SD)] > [1] 16 4 > > Not sure if this really benefits from data.table as we could have > written this without data.table: > >> rowSums((t(apply(D > 0, 1, cumsum)) <= 3) * D) > [1] 16 4 > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From alexandre.sieira at gmail.com Tue May 21 20:06:25 2013 From: alexandre.sieira at gmail.com (Alexandre Sieira) Date: Tue, 21 May 2013 15:06:25 -0300 Subject: [datatable-help] rbindlist and factors Message-ID: I think I found an unexpected behavior with rbindlist when columns are factors: > dt1 = data.table(a=as.factor(c("a", "a", "a"))) > dt1 ? ?a 1: a 2: a 3: a > str(dt1) Classes ?data.table? and 'data.frame': 3 obs. of ?1 variable: ?$ a: Factor w/ 1 level "a": 1 1 1 ?- attr(*, ".internal.selfref")=? > dt2 = data.table(a=as.factor(c("b", "b", "b"))) > dt2 ? ?a 1: b 2: b 3: b > str(dt2) Classes ?data.table? and 'data.frame': 3 obs. of ?1 variable: ?$ a: Factor w/ 1 level "b": 1 1 1 ?- attr(*, ".internal.selfref")=? If I rbind them, I get the expected value - a table with 6 rows, 3 of which have value "a" and 3 with value "b": > rbind(dt1, dt2) ? ?a 1: a 2: a 3: a 4: b 5: b 6: b So if I do rbindlist(list(dt1, dt2)), I would expect to get the exact same result, only faster. Unfortunately, that is not the case: > rbindlist(list(dt1, dt2)) ? ?a 1: a 2: a 3: a 4: a 5: a 6: a > str(rbindlist(list(dt1, dt2))) Classes ?data.table? and 'data.frame': 6 obs. of ?1 variable: ?$ a: Factor w/ 1 level "a": 1 1 1 1 1 1 ?- attr(*, ".internal.selfref")=? This was executed with R 3.0.1 and data.table 1.8.8 on a Mac OS X 10.8.3. Is this expected behavior? Am I missing something? --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Tue May 21 20:08:56 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 21 May 2013 20:08:56 +0200 Subject: [datatable-help] rbindlist and factors In-Reply-To: References: Message-ID: <1C7056DCFFCD4B0AA68233BD6D4D2B86@gmail.com> This was already addressed here: http://stackoverflow.com/questions/15933846/rbindlist-two-data-tables-where-one-has-factor-and-other-has-character-type-for And was known to be a bug filed here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2650&group_id=240&atid=975 Which has been fixed in the current development version 1.8.9. ( Fixed by commit 879 in v1.8.9 Hope this helps, Arun On Tuesday, May 21, 2013 at 8:06 PM, Alexandre Sieira wrote: > I think I found an unexpected behavior with rbindlist when columns are factors: > > > dt1 = data.table(a=as.factor(c("a", "a", "a"))) > > > dt1 > a > 1: a > 2: a > 3: a > > str(dt1) > Classes ?data.table? and 'data.frame': 3 obs. of 1 variable: > $ a: Factor w/ 1 level "a": 1 1 1 > - attr(*, ".internal.selfref")= > > dt2 = data.table(a=as.factor(c("b", "b", "b"))) > > dt2 > a > 1: b > 2: b > 3: b > > str(dt2) > Classes ?data.table? and 'data.frame': 3 obs. of 1 variable: > $ a: Factor w/ 1 level "b": 1 1 1 > - attr(*, ".internal.selfref")= > > If I rbind them, I get the expected value - a table with 6 rows, 3 of which have value "a" and 3 with value "b": > > > rbind(dt1, dt2) > a > 1: a > 2: a > 3: a > 4: b > 5: b > 6: b > > > So if I do rbindlist(list(dt1, dt2)), I would expect to get the exact same result, only faster. Unfortunately, that is not the case: > > > rbindlist(list(dt1, dt2)) > a > 1: a > 2: a > 3: a > 4: a > 5: a > 6: a > > > str(rbindlist(list(dt1, dt2))) > Classes ?data.table? and 'data.frame': 6 obs. of 1 variable: > $ a: Factor w/ 1 level "a": 1 1 1 1 1 1 > - attr(*, ".internal.selfref")= > > > This was executed with R 3.0.1 and data.table 1.8.8 on a Mac OS X 10.8.3. > > Is this expected behavior? Am I missing something? > > > > -- > Alexandre Sieira > CISA, CISSP, ISO 27001 Lead Auditor > > "The truth is rarely pure and never simple." > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.sieira at gmail.com Tue May 21 20:11:39 2013 From: alexandre.sieira at gmail.com (Alexandre Sieira) Date: Tue, 21 May 2013 15:11:39 -0300 Subject: [datatable-help] =?utf-8?q?rbindlist_and_factors?= In-Reply-To: <1C7056DCFFCD4B0AA68233BD6D4D2B86@gmail.com> References: <1C7056DCFFCD4B0AA68233BD6D4D2B86@gmail.com> <1C7056DCFFCD4B0AA68233BD6D4D2B86@gmail.com> Message-ID: Thank you, I'll wait for the next release then.? It's do.call("rbind", ?) till then, I presume. :) --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I On 21 de maio de 2013 at 15:09:00, Arunkumar Srinivasan (aragorn168b at gmail.com) wrote: This was already addressed here: http://stackoverflow.com/questions/15933846/rbindlist-two-data-tables-where-one-has-factor-and-other-has-character-type-for And was known to be a bug filed here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2650&group_id=240&atid=975 Which has been fixed in the current development version 1.8.9. ( Fixed by commit 879 in v1.8.9 Hope this helps, Arun On Tuesday, May 21, 2013 at 8:06 PM, Alexandre Sieira wrote: I think I found an unexpected behavior with rbindlist when columns are factors: > dt1 = data.table(a=as.factor(c("a", "a", "a"))) > dt1 ? ?a 1: a 2: a 3: a > str(dt1) Classes ?data.table? and 'data.frame': 3 obs. of ?1 variable: ?$ a: Factor w/ 1 level "a": 1 1 1 ?- attr(*, ".internal.selfref")=? > dt2 = data.table(a=as.factor(c("b", "b", "b"))) > dt2 ? ?a 1: b 2: b 3: b > str(dt2) Classes ?data.table? and 'data.frame': 3 obs. of ?1 variable: ?$ a: Factor w/ 1 level "b": 1 1 1 ?- attr(*, ".internal.selfref")=? If I rbind them, I get the expected value - a table with 6 rows, 3 of which have value "a" and 3 with value "b": > rbind(dt1, dt2) ? ?a 1: a 2: a 3: a 4: b 5: b 6: b So if I do rbindlist(list(dt1, dt2)), I would expect to get the exact same result, only faster. Unfortunately, that is not the case: > rbindlist(list(dt1, dt2)) ? ?a 1: a 2: a 3: a 4: a 5: a 6: a > str(rbindlist(list(dt1, dt2))) Classes ?data.table? and 'data.frame': 6 obs. of ?1 variable: ?$ a: Factor w/ 1 level "a": 1 1 1 1 1 1 ?- attr(*, ".internal.selfref")=? This was executed with R 3.0.1 and data.table 1.8.8 on a Mac OS X 10.8.3. Is this expected behavior? Am I missing something? --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Tue May 21 20:14:35 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 21 May 2013 20:14:35 +0200 Subject: [datatable-help] rbindlist and factors In-Reply-To: References: <1C7056DCFFCD4B0AA68233BD6D4D2B86@gmail.com> <1C7056DCFFCD4B0AA68233BD6D4D2B86@gmail.com> Message-ID: You can download 1.8.9 from r-forge and use it. If you're much concerned, you can use devtools and install 1.8.9 in dev mode as follows: >require(devtools) >dev_mode(TRUE) d> install.packages("data.table", repos="http://R-Forge.R-project.org", type="source") d> require(data.table) d> # do whatever calculations you want d> dev_mode(FALSE) > # returns to normal session Arun On Tuesday, May 21, 2013 at 8:11 PM, Alexandre Sieira wrote: > Thank you, I'll wait for the next release then. > > It's do.call("rbind", ?) till then, I presume. :) > -- > Alexandre Sieira > CISA, CISSP, ISO 27001 Lead Auditor > > "The truth is rarely pure and never simple." > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > On 21 de maio de 2013 at 15:09:00, Arunkumar Srinivasan (aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)) wrote: > > > This was already addressed here: > > http://stackoverflow.com/questions/15933846/rbindlist-two-data-tables-where-one-has-factor-and-other-has-character-type-for > > > > And was known to be a bug filed here: > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2650&group_id=240&atid=975 > > > > Which has been fixed in the current development version 1.8.9. ( > > Fixed by commit 879 in v1.8.9 > > > > > > > > Hope this helps, > > Arun > > > > > > On Tuesday, May 21, 2013 at 8:06 PM, Alexandre Sieira wrote: > > > > > I think I found an unexpected behavior with rbindlist when columns are factors: > > > > > > > dt1 = data.table(a=as.factor(c("a", "a", "a"))) > > > > > > > dt1 > > > a > > > 1: a > > > 2: a > > > 3: a > > > > str(dt1) > > > Classes ?data.table? and 'data.frame': 3 obs. of 1 variable: > > > $ a: Factor w/ 1 level "a": 1 1 1 > > > - attr(*, ".internal.selfref")= > > > > dt2 = data.table(a=as.factor(c("b", "b", "b"))) > > > > dt2 > > > a > > > 1: b > > > 2: b > > > 3: b > > > > str(dt2) > > > Classes ?data.table? and 'data.frame': 3 obs. of 1 variable: > > > $ a: Factor w/ 1 level "b": 1 1 1 > > > - attr(*, ".internal.selfref")= > > > > > > If I rbind them, I get the expected value - a table with 6 rows, 3 of which have value "a" and 3 with value "b": > > > > > > > rbind(dt1, dt2) > > > a > > > 1: a > > > 2: a > > > 3: a > > > 4: b > > > 5: b > > > 6: b > > > > > > > > > So if I do rbindlist(list(dt1, dt2)), I would expect to get the exact same result, only faster. Unfortunately, that is not the case: > > > > > > > rbindlist(list(dt1, dt2)) > > > a > > > 1: a > > > 2: a > > > 3: a > > > 4: a > > > 5: a > > > 6: a > > > > > > > str(rbindlist(list(dt1, dt2))) > > > Classes ?data.table? and 'data.frame': 6 obs. of 1 variable: > > > $ a: Factor w/ 1 level "a": 1 1 1 1 1 1 > > > - attr(*, ".internal.selfref")= > > > > > > > > > This was executed with R 3.0.1 and data.table 1.8.8 on a Mac OS X 10.8.3. > > > > > > Is this expected behavior? Am I missing something? > > > > > > > > > > > > -- > > > Alexandre Sieira > > > CISA, CISSP, ISO 27001 Lead Auditor > > > > > > "The truth is rarely pure and never simple." > > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > > _______________________________________________ > > > datatable-help mailing list > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.sieira at gmail.com Tue May 21 20:15:26 2013 From: alexandre.sieira at gmail.com (Alexandre Sieira) Date: Tue, 21 May 2013 15:15:26 -0300 Subject: [datatable-help] =?utf-8?q?rbindlist_and_factors?= In-Reply-To: References: <1C7056DCFFCD4B0AA68233BD6D4D2B86@gmail.com> <1C7056DCFFCD4B0AA68233BD6D4D2B86@gmail.com> Message-ID: Thank you, Arun! --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I On 21 de maio de 2013 at 15:14:39, Arunkumar Srinivasan (aragorn168b at gmail.com) wrote: You can download 1.8.9 from r-forge and use it. If you're much concerned, you can use devtools and install 1.8.9 in dev mode as follows: >require(devtools) >dev_mode(TRUE) d>?install.packages("data.table",?repos="http://R-Forge.R-project.org", type="source") d> require(data.table) d> # do whatever calculations you want d> dev_mode(FALSE) > # returns to normal session Arun On Tuesday, May 21, 2013 at 8:11 PM, Alexandre Sieira wrote: Thank you, I'll wait for the next release then.? It's do.call("rbind", ?) till then, I presume. :) --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I On 21 de maio de 2013 at 15:09:00, Arunkumar Srinivasan (aragorn168b at gmail.com) wrote: This was already addressed here: http://stackoverflow.com/questions/15933846/rbindlist-two-data-tables-where-one-has-factor-and-other-has-character-type-for And was known to be a bug filed here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2650&group_id=240&atid=975 Which has been fixed in the current development version 1.8.9. ( Fixed by commit 879 in v1.8.9 Hope this helps, Arun On Tuesday, May 21, 2013 at 8:06 PM, Alexandre Sieira wrote: I think I found an unexpected behavior with rbindlist when columns are factors: > dt1 = data.table(a=as.factor(c("a", "a", "a"))) > dt1 ? ?a 1: a 2: a 3: a > str(dt1) Classes ?data.table? and 'data.frame': 3 obs. of ?1 variable: ?$ a: Factor w/ 1 level "a": 1 1 1 ?- attr(*, ".internal.selfref")=? > dt2 = data.table(a=as.factor(c("b", "b", "b"))) > dt2 ? ?a 1: b 2: b 3: b > str(dt2) Classes ?data.table? and 'data.frame': 3 obs. of ?1 variable: ?$ a: Factor w/ 1 level "b": 1 1 1 ?- attr(*, ".internal.selfref")=? If I rbind them, I get the expected value - a table with 6 rows, 3 of which have value "a" and 3 with value "b": > rbind(dt1, dt2) ? ?a 1: a 2: a 3: a 4: b 5: b 6: b So if I do rbindlist(list(dt1, dt2)), I would expect to get the exact same result, only faster. Unfortunately, that is not the case: > rbindlist(list(dt1, dt2)) ? ?a 1: a 2: a 3: a 4: a 5: a 6: a > str(rbindlist(list(dt1, dt2))) Classes ?data.table? and 'data.frame': 6 obs. of ?1 variable: ?$ a: Factor w/ 1 level "a": 1 1 1 1 1 1 ?- attr(*, ".internal.selfref")=? This was executed with R 3.0.1 and data.table 1.8.8 on a Mac OS X 10.8.3. Is this expected behavior? Am I missing something? --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From jholtman at gmail.com Tue May 21 20:55:45 2013 From: jholtman at gmail.com (jim holtman) Date: Tue, 21 May 2013 14:55:45 -0400 Subject: [datatable-help] Sum first 3 non zero elements of row In-Reply-To: <1369127430158-4667563.post@n4.nabble.com> References: <1369127430158-4667563.post@n4.nabble.com> Message-ID: Is this what you want: > x <- matrix(sample(c(0,1), 200, TRUE, prob = c(10,1)), ncol = 20) > # sum up to at most first 3 non-zero items > xSum <- apply(x, 1, function(.row){ + indx <- which(.row != 0)[1:3] + return(sum(.row[indx], na.rm = TRUE)) + }) > xSum [1] 0 0 1 2 2 2 3 2 3 2 > cbind(apply(x, 1, paste, collapse = '')) [,1] [1,] "00000000000000000000" [2,] "00000000000000000000" [3,] "10000000000000000000" [4,] "00000000001000000001" [5,] "00001000001000000000" [6,] "01001000000000000000" [7,] "01001000000000000100" [8,] "00000010000001000000" [9,] "00100100000001000000" [10,] "00000001000000010000" > On Tue, May 21, 2013 at 5:10 AM, JNV wrote: > Hi there, > I've got this matrix D with, say 10 rows and 20 columns. For each row I > want > to sum the first 3 non zero elements and put them in a vector z. > > So if the first row D[1,] is > 0 3 5 0 8 9 3 2 4 0 > > then I want z > z<-D[1,2]+D[1,3]+D[1,5] > > But if there are less than 3 non zero elements, those should be summed. If > there are no non zero elements, the result must be zero. > > So if the first row D[1,] is > 0 0 3 0 1 0 0 0 0 0 > > then I want z > z<-D[1,3]+D[1,5] > > Hope someone can help me out! > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/Sum-first-3-non-zero-elements-of-row-tp4667563.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From statquant at outlook.com Wed May 22 15:31:55 2013 From: statquant at outlook.com (statquant3) Date: Wed, 22 May 2013 06:31:55 -0700 (PDT) Subject: [datatable-help] progress % in fread Message-ID: <1369229515842-4667694.post@n4.nabble.com> I know the % progress counter has been removed, but I loved the feature... Is it still accessible from an option or something ? Cheers Colin -- View this message in context: http://r.789695.n4.nabble.com/progress-in-fread-tp4667694.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Wed May 22 15:48:07 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 22 May 2013 14:48:07 +0100 Subject: [datatable-help] progress % in fread In-Reply-To: <1369229515842-4667694.post@n4.nabble.com> References: <1369229515842-4667694.post@n4.nabble.com> Message-ID: <3858d94d5b0abc2798f6f7bc002fd8a4@imap.plus.net> :) Not currently as when I removed it the thinking was speed to also save the 'if(i%%1000 && last print more than 1 second ago)' for each row. But it could be made optional again if people want it: an inner loop could be a 1000 batch with an outer loop containing the if() and printf(x,"%\r"). Matthew On 22.05.2013 14:31, statquant3 wrote: > I know the % progress counter has been removed, but I loved the > feature... > Is it still accessible from an option or something ? > > Cheers > Colin > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/progress-in-fread-tp4667694.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From statquant at outlook.com Wed May 22 16:44:17 2013 From: statquant at outlook.com (statquant3) Date: Wed, 22 May 2013 07:44:17 -0700 (PDT) Subject: [datatable-help] progress % in fread In-Reply-To: <3858d94d5b0abc2798f6f7bc002fd8a4@imap.plus.net> References: <1369229515842-4667694.post@n4.nabble.com> <3858d94d5b0abc2798f6f7bc002fd8a4@imap.plus.net> Message-ID: <1369233857528-4667704.post@n4.nabble.com> Don't know about the other but an option does not hurt, I just loaded a 6million rows file and did not know if it would take 1minute or 10 minutes... something like displayProgress=F in the signature maybe ??? or an option... ??? -- View this message in context: http://r.789695.n4.nabble.com/progress-in-fread-tp4667694p4667704.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Wed May 22 18:32:26 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 22 May 2013 17:32:26 +0100 Subject: [datatable-help] progress % in fread In-Reply-To: <1369233857528-4667704.post@n4.nabble.com> References: <1369229515842-4667694.post@n4.nabble.com> <3858d94d5b0abc2798f6f7bc002fd8a4@imap.plus.net> <1369233857528-4667704.post@n4.nabble.com> Message-ID: The problem was knitr (at least) - it doesn't like the \r. I thought about a graphical progress window, like tkProgressBar() but it needs to be updateable from C level. I could try that again now I'm more comfortable calling R from C. So ... how about a tkProgressBar ? Optional, with argument to fread(,progress=getOption("datatable.fread.progress")) by default FALSE. On 22.05.2013 15:44, statquant3 wrote: > Don't know about the other but an option does not hurt, I just loaded > a > 6million rows file and did not know if it would take 1minute or 10 > minutes... > something like displayProgress=F in the signature maybe ??? or an > option... > ??? > > > > -- > View this message in context: > > http://r.789695.n4.nabble.com/progress-in-fread-tp4667694p4667704.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From statquant at outlook.com Thu May 23 09:18:34 2013 From: statquant at outlook.com (statquant3) Date: Thu, 23 May 2013 00:18:34 -0700 (PDT) Subject: [datatable-help] progress % in fread In-Reply-To: References: <1369229515842-4667694.post@n4.nabble.com> <3858d94d5b0abc2798f6f7bc002fd8a4@imap.plus.net> <1369233857528-4667704.post@n4.nabble.com> Message-ID: <1369293514874-4667782.post@n4.nabble.com> Hey, progressbar would be fancy of course, but the old [75%] updating on the screen was good enough. I did not get your point about knitr, if this is added back as an option (disabled by default) why would knitr complain about it ? Regards -- View this message in context: http://r.789695.n4.nabble.com/progress-in-fread-tp4667694p4667782.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Thu May 23 13:55:11 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 23 May 2013 12:55:11 +0100 Subject: [datatable-help] progress % in fread In-Reply-To: <1369293514874-4667782.post@n4.nabble.com> References: <1369229515842-4667694.post@n4.nabble.com> <3858d94d5b0abc2798f6f7bc002fd8a4@imap.plus.net> <1369233857528-4667704.post@n4.nabble.com> <1369293514874-4667782.post@n4.nabble.com> Message-ID: <7ba023c6f7e940f8ed92e7ebc1b9fbee@imap.plus.net> On 23.05.2013 08:18, statquant3 wrote: > Hey, progressbar would be fancy of course, but the old [75%] updating > on the > screen was good enough. > I did not get your point about knitr, if this is added back as an > option > (disabled by default) why would knitr complain about it ? It wouldn't, if left off. In my mind was a user might soon turn on the global option in their .Rprofile, and then it may mess up in knitr. Causing them to hunt online and possibly ask about that as an apparent bug. The \r output may affect some tests too (if they check output to the console, which some do), say if user had progress turned on and then ran test.data.table(). In contrast a tkProgressBar would consistently work in all environments without any risk of problems caused by the \r, and without user needing to switch it off and on. Yes more fancy but practical reasons too. > > Regards > > > > -- > View this message in context: > > http://r.789695.n4.nabble.com/progress-in-fread-tp4667694p4667782.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From ggrothendieck at gmail.com Fri May 24 18:29:05 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Fri, 24 May 2013 12:29:05 -0400 Subject: [datatable-help] logicals in datatable i Message-ID: The ?data.table page describing arg i says: "integer and logical vectors work the same way they do in [.data.frame. Other than NAs in logical i are treated as FALSE and a single NA logical is not recycled to match the number of rows, as it is in[.data.frame." however, I get this: > packageVersion("data.table") [1] ?1.8.9? > DT1 <- data.table(a = 1:5) > DT1[as.logical(NA)] a 1: NA so it seems that if there is a single logical NA it not only is not recycled but it also is not regarded as FALSE (whereas the quoted statement seems to say a logical NA is regarded as FALSE in all cases). Is this intended? -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Fri May 24 18:40:06 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 24 May 2013 18:40:06 +0200 Subject: [datatable-help] logicals in datatable i In-Reply-To: References: Message-ID: Gabor, Not sure if this is intended, but the there's a code that explicitly assigns NA_integer_ if `i` is NA in `[.data.table`: if (is.logical(i)) { if (identical(i, NA)) i = NA_integer_ else i[is.na(i)] = FALSE } So, if `i` is JUST `NA`, then it's replaced with NA_integer_. If there are more than 1 element and i has NA in them, they are replaced with FALSE. For ex: doing DT1[as.logical(c(NA, NA))] would result in recycling and lead to a 0 rows and 1 column empty data.table. Arun On Friday, May 24, 2013 at 6:29 PM, Gabor Grothendieck wrote: > The ?data.table page describing arg i says: > > "integer and logical vectors work the same way they do in > [.data.frame. Other than NAs in logical i are treated as FALSE and a > single NA logical is not recycled to match the number of rows, as it > is in[.data.frame." > > however, I get this: > > > packageVersion("data.table") > [1] ?1.8.9? > > > DT1 <- data.table(a = 1:5) > > DT1[as.logical(NA)] > > > > a > 1: NA > > so it seems that if there is a single logical NA it not only is not > recycled but it also is not regarded as FALSE (whereas the quoted > statement seems to say a logical NA is regarded as FALSE in all > cases). > > Is this intended? > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com (http://gmail.com) > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri May 24 19:22:09 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 24 May 2013 18:22:09 +0100 Subject: [datatable-help] logicals in datatable i In-Reply-To: References: Message-ID: <6432f32fcb0a897af36d5263d22ba4fe@imap.plus.net> Yes indeed it's intended and there is this in FAQ 2.17 (differences between DF and DT) : "DT[NA] returns 1 row of NA, but DF[NA] returns a copy of DF containing NA throughout. The symbol NA is type logical in R, and is therefore recycled by [.data.frame. Intention was probably DF[NA_integer_]. [.data.table does this automatically for convenience." I see what you mean Gabor about that sentence in ?data.table. Thanks, will improve that wording. Matthew On 24.05.2013 17:40, Arunkumar Srinivasan wrote: > Gabor, > Not sure if this is intended, but the there's a code that explicitly assigns NA_integer_ if `i` is NA in `[.data.table`: > > if (is.logical(i)) { > if (identical(i, NA)) > i = NA_integer_ > else i[is.na(i)] = FALSE > } > So, if `i` is JUST `NA`, then it's replaced with NA_integer_. If there are more than 1 element and i has NA in them, they are replaced with FALSE. > For ex: doing > DT1[as.logical(c(NA, NA))] > would result in recycling and lead to a 0 rows and 1 column empty data.table. > > Arun > > On Friday, May 24, 2013 at 6:29 PM, Gabor Grothendieck wrote: > >> The ?data.table page describing arg i says: >> "integer and logical vectors work the same way they do in >> [.data.frame. Other than NAs in logical i are treated as FALSE and a >> single NA logical is not recycled to match the number of rows, as it >> is in[.data.frame." >> however, I get this: >> >>> packageVersion("data.table") >> >> [1] '1.8.9' >> >>> DT1 <- data.table(a = 1:5) >>> DT1[as.logical(NA)] >> >> a >> 1: NA >> so it seems that if there is a single logical NA it not only is not >> recycled but it also is not regarded as FALSE (whereas the quoted >> statement seems to say a logical NA is regarded as FALSE in all >> cases). >> Is this intended? >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com [1] >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org [2] >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [3] Links: ------ [1] http://gmail.com [2] mailto:datatable-help at lists.r-forge.r-project.org [3] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.sieira at gmail.com Tue May 28 19:37:16 2013 From: alexandre.sieira at gmail.com (Alexandre Sieira) Date: Tue, 28 May 2013 14:37:16 -0300 Subject: [datatable-help] Performance observation Message-ID: I was working on some code today and encountered this scenario here where the performance behavior of data.table surprised me a little. Is this expected? > dt = data.table(a=rnorm(1000000)) > system.time( for(i in 1:100000) j = dt[i, a] ) ? usu?rio ? sistema decorrido? ? ?78.064 ? ? 0.426 ? ?78.034? > system.time( for(i in 1:100000) j = dt[i, "a", with=F] ) ? usu?rio ? sistema decorrido? ? ?27.814 ? ? 0.154 ? ?27.810 ? > system.time( for(i in 1:100000) j = dt[["a"]][i] ) ? usu?rio ? sistema decorrido? ? ? 1.227 ? ? 0.006 ? ? 1.225? (sorry about the output in portuguese) Not knowing anything about how data.table is implemented internally, I would have assumed the three syntaxes for accessing the data.table should have similar or at the most a small difference in performance. --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue May 28 20:11:02 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 28 May 2013 19:11:02 +0100 Subject: [datatable-help] Performance observation In-Reply-To: References: Message-ID: Hi, Yes this is expected because `[.data.table` is a function call with associated overhead. You don't want to loop calls to it. Consider all the arguments to `[.data.table` and all the checks that must be done for existence and type of arguments on each call. The idea is to give [.data.table meaty calls which it can chew on. It doesn't like tiny tasks one at a time. `[[` on the other hand is an R primitive. It's part of the language. You can do very limited things with `[[` but in this case (looking up a single column by name or position) in a loop, that's best for the job. I use `[[` on data.table quite a lot. This is also the very reason for set()'s existence: ?set says it's a 'loopable :=' because of the `[.data.table` overhead. There's a feature request to detect when [.data.table is being looped, though : https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2028&group_id=240&atid=978 which would be more helpful of data.table, so at least it told you, rather than having to stumble across it. Hope that helps, Matthew On 28.05.2013 18:37, Alexandre Sieira wrote: > I was working on some code today and encountered this scenario here where the performance behavior of data.table surprised me a little. Is this expected? > >> dt = data.table(a=rnorm(1000000)) > >> system.time( for(i in 1:100000) j = dt[i, a] ) > > usu?rio sistema decorrido > > 78.064 0.426 78.034 > >> system.time( for(i in 1:100000) j = dt[i, "a", with=F] ) > > usu?rio sistema decorrido > > 27.814 0.154 27.810 > >> system.time( for(i in 1:100000) j = dt[["a"]][i] ) > > usu?rio sistema decorrido > > 1.227 0.006 1.225 > (sorry about the output in portuguese) > Not knowing anything about how data.table is implemented internally, I would have assumed the three syntaxes for accessing the data.table should have similar or at the most a small difference in performance. > > -- > Alexandre Sieira > CISA, CISSP, ISO 27001 Lead Auditor > > "The truth is rarely pure and never simple." > Oscar Wilde, The Importance of Being Earnest, 1895, Act I -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.sieira at gmail.com Tue May 28 20:25:57 2013 From: alexandre.sieira at gmail.com (Alexandre Sieira) Date: Tue, 28 May 2013 15:25:57 -0300 Subject: [datatable-help] =?utf-8?q?Performance_observation?= In-Reply-To: References: Message-ID: Thank you very much. The documentation on := and set are really clear on this, thanks for pointing that out. --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I On 28 de maio de 2013 at 15:11:04, Matthew Dowle (mdowle at mdowle.plus.com) wrote: ? Hi, Yes this is expected because `[.data.table` is a function call with associated overhead. ?You don't want to loop calls to it. ?Consider all the arguments to `[.data.table` and all the checks that must be done for existence and type of arguments on each call. ?The idea is to give [.data.table meaty calls which it can chew on. It doesn't like tiny tasks one at a time. `[[` on the other hand is an R primitive. It's part of the language. ?You can do very limited things with `[[` but in this case (looking up a single column by name or position) in a loop, that's best for the job. ? I use `[[` on data.table quite a lot. This is also the very reason for set()'s existence: ??set says it's a 'loopable :=' because of the `[.data.table` overhead. There's a feature request to detect when [.data.table is being looped, though : https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2028&group_id=240&atid=978 which would be more helpful of data.table, so at least it told you, rather than having to stumble across it. Hope that helps, Matthew ? On 28.05.2013 18:37, Alexandre Sieira wrote: I was working on some code today and encountered this scenario here where the performance behavior of data.table surprised me a little. Is this expected? ? ? > dt = data.table(a=rnorm(1000000)) ? ? > system.time( for(i in 1:100000) j = dt[i, a] ) ? usu?rio ? sistema decorrido? ? ?78.064 ? ? 0.426 ? ?78.034? ? ? > system.time( for(i in 1:100000) j = dt[i, "a", with=F] ) ? usu?rio ? sistema decorrido? ? ?27.814 ? ? 0.154 ? ?27.810 ? > system.time( for(i in 1:100000) j = dt[["a"]][i] ) ? usu?rio ? sistema decorrido? ? ? 1.227 ? ? 0.006 ? ? 1.225? (sorry about the output in portuguese) Not knowing anything about how data.table is implemented internally, I would have assumed the three syntaxes for accessing the data.table should have similar or at the most a small difference in performance. --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue May 28 20:26:29 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 28 May 2013 19:26:29 +0100 Subject: [datatable-help] Performance observation In-Reply-To: References: Message-ID: Here's a nice benchmark that's just been posted on S.O. showing set() speedup when looped : http://stackoverflow.com/a/16797392/403310 On 28.05.2013 19:11, Matthew Dowle wrote: > Hi, > > Yes this is expected because `[.data.table` is a function call with associated overhead. You don't want to loop calls to it. Consider all the arguments to `[.data.table` and all the checks that must be done for existence and type of arguments on each call. The idea is to give [.data.table meaty calls which it can chew on. It doesn't like tiny tasks one at a time. > > `[[` on the other hand is an R primitive. It's part of the language. You can do very limited things with `[[` but in this case (looking up a single column by name or position) in a loop, that's best for the job. I use `[[` on data.table quite a lot. > > This is also the very reason for set()'s existence: ?set says it's a 'loopable :=' because of the `[.data.table` overhead. > > There's a feature request to detect when [.data.table is being looped, though : > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2028&group_id=240&atid=978 > > which would be more helpful of data.table, so at least it told you, rather than having to stumble across it. > > Hope that helps, > > Matthew > > On 28.05.2013 18:37, Alexandre Sieira wrote: > >> I was working on some code today and encountered this scenario here where the performance behavior of data.table surprised me a little. Is this expected? >> >>> dt = data.table(a=rnorm(1000000)) >> >>> system.time( for(i in 1:100000) j = dt[i, a] ) >> >> usu?rio sistema decorrido >> >> 78.064 0.426 78.034 >> >>> system.time( for(i in 1:100000) j = dt[i, "a", with=F] ) >> >> usu?rio sistema decorrido >> >> 27.814 0.154 27.810 >> >>> system.time( for(i in 1:100000) j = dt[["a"]][i] ) >> >> usu?rio sistema decorrido >> >> 1.227 0.006 1.225 >> (sorry about the output in portuguese) >> Not knowing anything about how data.table is implemented internally, I would have assumed the three syntaxes for accessing the data.table should have similar or at the most a small difference in performance. >> >> -- >> Alexandre Sieira >> CISA, CISSP, ISO 27001 Lead Auditor >> >> "The truth is rarely pure and never simple." >> Oscar Wilde, The Importance of Being Earnest, 1895, Act I -------------- next part -------------- An HTML attachment was scrubbed... URL: From tpassafaro at hotmail.com Tue May 28 22:45:15 2013 From: tpassafaro at hotmail.com (tpassafaro) Date: Tue, 28 May 2013 13:45:15 -0700 (PDT) Subject: [datatable-help] how to remove outliers per levels Message-ID: <1369773915321-4668153.post@n4.nabble.com> Dear all, I have a data with two columns, age and weigths. I would like to remove weigths outliers by age. I have tried the function by per age using the boxplot and i know what are outliers, but i don?t know how to remove it. could you help me, please? thanks -- View this message in context: http://r.789695.n4.nabble.com/how-to-remove-outliers-per-levels-tp4668153.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Wed May 29 00:20:11 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 28 May 2013 23:20:11 +0100 Subject: [datatable-help] how to remove outliers per levels In-Reply-To: <1369773915321-4668153.post@n4.nabble.com> References: <1369773915321-4668153.post@n4.nabble.com> Message-ID: Hi, This is datatable-help, just for the package data.table: please read again the detailed message you received upon subscription. The question is too general, with no example and no attempt evident of what you tried yourself. Please search for how to ask good questions. Have you searched on stack overflow? If you need further advice please ask me off list. Regards, Matthew > Dear all, > > I have a data with two columns, age and weigths. I would like to remove > weigths outliers by age. I have tried the function by per age using the > boxplot and i know what are outliers, but i don??t know how to remove it. > could you help me, please? > > thanks > > > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/how-to-remove-outliers-per-levels-tp4668153.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From sds at gnu.org Thu May 30 21:48:52 2013 From: sds at gnu.org (Sam Steingold) Date: Thu, 30 May 2013 15:48:52 -0400 Subject: [datatable-help] join results aren't always sorted? Message-ID: <87ehcofduj.fsf@gnu.org> Hi, I have a table: --8<---------------cut here---------------start------------->8--- > str(dates.dt) Classes ?data.table? and 'data.frame': 1343 obs. of 4 variables: $ sid : chr "missing" "missing" "missing" "missing" ... $ s.c : chr "CLICK" "CLICK" "CLICK" "CLICK" ... $ count: int 70559 71555 79985 84385 88147 94130 100195 109031 116890 129726 ... $ time : POSIXct, format: "2013-05-15 00:00:00" "2013-05-15 01:00:00" ... - attr(*, ".internal.selfref")= - attr(*, "sorted")= chr "sid" "s.c" "time" > dates.dt sid s.c count time 1: missing CLICK 70559 2013-05-15 00:00:00 2: missing CLICK 71555 2013-05-15 01:00:00 3: missing CLICK 79985 2013-05-15 02:00:00 4: missing CLICK 84385 2013-05-15 03:00:00 5: missing CLICK 88147 2013-05-15 04:00:00 --- 1339: present SHARE 35295 2013-05-28 19:00:00 1340: present SHARE 36284 2013-05-28 20:00:00 1341: present SHARE 69504 2013-05-28 21:00:00 1342: present SHARE 67037 2013-05-28 22:00:00 1343: present SHARE 61014 2013-05-28 23:00:00 --8<---------------cut here---------------end--------------->8--- I summarise them by various fields: --8<---------------cut here---------------start------------->8--- > shares <- dates.dt[s.c=="SHARE", list(sum(count)) , by="time"] > clicks <- dates.dt[s.c=="CLICK", list(sum(count)) , by="time"] > str(shares) Classes ?data.table? and 'data.frame': 336 obs. of 2 variables: $ time: POSIXct, format: "2013-05-15 00:00:00" "2013-05-15 01:00:00" ... $ V1 : int 60531 57837 67495 76716 83465 86822 91318 100520 112352 124784 ... - attr(*, ".internal.selfref")= > str(clicks) Classes ?data.table? and 'data.frame': 336 obs. of 2 variables: $ time: POSIXct, format: "2013-05-15 00:00:00" "2013-05-15 01:00:00" ... $ V1 : int 129450 137222 157721 171319 183720 195652 216003 238295 260715 279235 ... - attr(*, "sorted")= chr "time" - attr(*, ".internal.selfref")= --8<---------------cut here---------------end--------------->8--- why is clicks but not shares sorted by time? (if I make "time" the first key in dates.dt, the problem goes away, so, I guess, this is expected). What I actually want is a single data table keyed by time with columns shares,clicks,missing,present,missing/clicks &c I can, obviously, construct it by hand: --8<---------------cut here---------------start------------->8--- setkeyv(shares,"time") stopifnot(identical(shares$time,clicks$time)) dt <- data.table(time=shares$time, clicks=clicks$V1, shares=shares$V1) --8<---------------cut here---------------end--------------->8--- but I was wondering if there is a better way. Thanks. -- Sam Steingold (http://sds.podval.org/) on Ubuntu 13.04 (raring) X 11.0.11303000 http://www.childpsy.net/ http://pmw.org.il http://dhimmi.com http://jihadwatch.org http://www.memritv.org http://honestreporting.com Garbage In, Gospel Out From mdowle at mdowle.plus.com Fri May 31 10:59:38 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 31 May 2013 09:59:38 +0100 Subject: [datatable-help] =?utf-8?q?join_results_aren=27t_always_sorted=3F?= In-Reply-To: <87ehcofduj.fsf@gnu.org> References: <87ehcofduj.fsf@gnu.org> Message-ID: <9e44d0277d181000d007b7a42ed0b14f@imap.plus.net> Hi, > why is clicks but not shares sorted by time? the groups (each unique time in this case) are returned in the order of first appearance, when you use 'by'. This is important and relied upon. When this result is known, a quick check is made to see if this is ordered (using the very fast is.unsorted()) and if so the result is then marked as keyed (which is the "sorted" attribute seen in the str() output). > What I actually want is a single data table keyed by time with ... How about 'keyby' rather than 'by' : dates.dt[s.c=="SHARE", list(sum(count)), keyby="time"] Even if I know the data is already sorted in group order, I often use 'keyby' anyway for robustness. Matthew On 30.05.2013 20:48, Sam Steingold wrote: > Hi, > I have a table: > --8<---------------cut here---------------start------------->8--- >> str(dates.dt) > Classes ?data.table? and 'data.frame': 1343 obs. of 4 variables: > $ sid : chr "missing" "missing" "missing" "missing" ... > $ s.c : chr "CLICK" "CLICK" "CLICK" "CLICK" ... > $ count: int 70559 71555 79985 84385 88147 94130 100195 109031 > 116890 129726 ... > $ time : POSIXct, format: "2013-05-15 00:00:00" "2013-05-15 > 01:00:00" ... > - attr(*, ".internal.selfref")= > - attr(*, "sorted")= chr "sid" "s.c" "time" >> dates.dt > sid s.c count time > 1: missing CLICK 70559 2013-05-15 00:00:00 > 2: missing CLICK 71555 2013-05-15 01:00:00 > 3: missing CLICK 79985 2013-05-15 02:00:00 > 4: missing CLICK 84385 2013-05-15 03:00:00 > 5: missing CLICK 88147 2013-05-15 04:00:00 > --- > 1339: present SHARE 35295 2013-05-28 19:00:00 > 1340: present SHARE 36284 2013-05-28 20:00:00 > 1341: present SHARE 69504 2013-05-28 21:00:00 > 1342: present SHARE 67037 2013-05-28 22:00:00 > 1343: present SHARE 61014 2013-05-28 23:00:00 > --8<---------------cut here---------------end--------------->8--- > I summarise them by various fields: > --8<---------------cut here---------------start------------->8--- >> shares <- dates.dt[s.c=="SHARE", list(sum(count)) , by="time"] >> clicks <- dates.dt[s.c=="CLICK", list(sum(count)) , by="time"] >> str(shares) > Classes ?data.table? and 'data.frame': 336 obs. of 2 variables: > $ time: POSIXct, format: "2013-05-15 00:00:00" "2013-05-15 01:00:00" > ... > $ V1 : int 60531 57837 67495 76716 83465 86822 91318 100520 112352 > 124784 ... > - attr(*, ".internal.selfref")= >> str(clicks) > Classes ?data.table? and 'data.frame': 336 obs. of 2 variables: > $ time: POSIXct, format: "2013-05-15 00:00:00" "2013-05-15 01:00:00" > ... > $ V1 : int 129450 137222 157721 171319 183720 195652 216003 238295 > 260715 279235 ... > - attr(*, "sorted")= chr "time" > - attr(*, ".internal.selfref")= > --8<---------------cut here---------------end--------------->8--- > why is clicks but not shares sorted by time? > (if I make "time" the first key in dates.dt, the problem goes away, > so, > I guess, this is expected). > > What I actually want is a single data table keyed by time with > columns > shares,clicks,missing,present,missing/clicks &c > I can, obviously, construct it by hand: > --8<---------------cut here---------------start------------->8--- > setkeyv(shares,"time") > stopifnot(identical(shares$time,clicks$time)) > dt <- data.table(time=shares$time, clicks=clicks$V1, > shares=shares$V1) > --8<---------------cut here---------------end--------------->8--- > but I was wondering if there is a better way. > Thanks.