From vishalhald at gmail.com Tue Apr 2 06:06:37 2013 From: vishalhald at gmail.com (vishal) Date: Mon, 1 Apr 2013 21:06:37 -0700 (PDT) Subject: [datatable-help] Need help with fread function Message-ID: <1364875597879-4663036.post@n4.nabble.com> I am trying to read 2.5GB pipe delimited text file in R using fread but I am getting below error. " Opened file ok, obtained its size on disk (-0.0MB), but couldn't memory map it. This is a 32bit machine. You don't need more RAM per se but this fread function is tuned for 64bit addressability, at the expense of large file support on 32bit machines. You probably need more RAM to store the resulting data.table, anyway. And most speed benefits of data.table are on 64bit with large RAM, too. Please either upgrade to 64bit (e.g. a 64bit netbook with 4GB RAM can cost just ??300), or make a case for 32bit large file support to datatable-help." I have used following syntax. setwd("E:/Projects/Alo") library(data.table) f <- fread("6.txt",header="auto") -- View this message in context: http://r.789695.n4.nabble.com/Need-help-with-fread-function-tp4663036.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Tue Apr 2 09:48:52 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 02 Apr 2013 08:48:52 +0100 Subject: [datatable-help] Need help with fread function In-Reply-To: <1364875597879-4663036.post@n4.nabble.com> References: <1364875597879-4663036.post@n4.nabble.com> Message-ID: <5627b1d0210cebb207c712a33ebd5d0e@imap.plus.net> Hi, There seems to be a printing problem in that error message (-0.0MB should say 2.5GB), but other than that I can't think of much else to add that isn't in the error message already. The error message tells us you are using a 32bit computer. Are you Windows or Linux? 2.5GB is over 2^31 bits, so at the limit of addressability for 32bit. The file doesn't need to fit in RAM, but it does need to be addressable. For example if you had 1GB of RAM, you should be able to read a 1.5GB file ok. It's not the amount of RAM you have per se, but whether you are 32bit or 64bit. Matthew On 02.04.2013 05:06, vishal wrote: > I am trying to read 2.5GB pipe delimited text file in R using fread > but I am > getting below error. > > " Opened file ok, obtained its size on disk (-0.0MB), but couldn't > memory > map it. This is a 32bit machine. You don't need more RAM per se but > this > fread function is tuned for 64bit addressability, at the expense of > large > file support on 32bit machines. You probably need more RAM to store > the > resulting data.table, anyway. And most speed benefits of data.table > are on > 64bit with large RAM, too. Please either upgrade to 64bit (e.g. a > 64bit > netbook with 4GB RAM can cost just ??300), or make a case for 32bit > large > file support to datatable-help." > > I have used following syntax. > setwd("E:/Projects/Alo") > library(data.table) > f <- fread("6.txt",header="auto") > > > > -- > View this message in context: > > http://r.789695.n4.nabble.com/Need-help-with-fread-function-tp4663036.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From vishalhald at gmail.com Tue Apr 2 09:53:30 2013 From: vishalhald at gmail.com (vishal) Date: Tue, 2 Apr 2013 00:53:30 -0700 (PDT) Subject: [datatable-help] Need help with fread function In-Reply-To: <5627b1d0210cebb207c712a33ebd5d0e@imap.plus.net> References: <1364875597879-4663036.post@n4.nabble.com> <5627b1d0210cebb207c712a33ebd5d0e@imap.plus.net> Message-ID: <1364889210584-4663049.post@n4.nabble.com> Hi Matthew, I am using 32 bit windows with 4 GB RAM Vishwesh -- View this message in context: http://r.789695.n4.nabble.com/Need-help-with-fread-function-tp4663036p4663049.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Tue Apr 2 10:44:52 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 02 Apr 2013 09:44:52 +0100 Subject: [datatable-help] Need help with fread function In-Reply-To: <1364889210584-4663049.post@n4.nabble.com> References: <1364875597879-4663036.post@n4.nabble.com> <5627b1d0210cebb207c712a33ebd5d0e@imap.plus.net> <1364889210584-4663049.post@n4.nabble.com> Message-ID: Thanks. Looking at Windows docs there is a GetFileSizeEx and maybe that'll make this 2.5GB file work on 32bit. Have filed here : https://r-forge.r-project.org/tracker/?group_id=240&atid=975&func=detail&aid=2655 But the best we can hope for on 32bit is 4GB support, iiuc, using memory mapping which fread relies on. Quite possible that the real limit is around 3.2GB (whether RAM addressability limits apply to mapping files as well or not on Windows I don't know). I'll let you know when it's in v1.8.9 and you can try again. If this quick fix doesn't work, then I'm not planning on trying harder. As the error message says: 64bit is the way forward. Matthew On 02.04.2013 08:53, vishal wrote: > Hi Matthew, > > I am using 32 bit windows with 4 GB RAM > > Vishwesh > > > > > > -- > View this message in context: > > http://r.789695.n4.nabble.com/Need-help-with-fread-function-tp4663036p4663049.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From npgraham1 at gmail.com Tue Apr 2 20:30:32 2013 From: npgraham1 at gmail.com (Nathaniel Graham) Date: Tue, 2 Apr 2013 14:30:32 -0400 Subject: [datatable-help] fread on gzipped files Message-ID: I have a moderately large csv file that's gzipped, but not in a tar archive, so it's "filename.csv.gz" that I want to read into a data.table. I'd like to use fread(), but I can't seem to make it work. I'm currently using the following: data.table(read.csv(gzfile("filename.csv.gz","r"))) Various combinations of gzfile, gzcon, file, readLines, and textConnection all produce an error (invalid input). Is there a better way to read in large, compressed files? ------- Nathaniel Graham npgraham1 at gmail.com npgraham1 at uky.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue Apr 2 21:12:03 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 02 Apr 2013 20:12:03 +0100 Subject: [datatable-help] fread on gzipped files In-Reply-To: References: Message-ID: <173fc96df68310b80565cdde75586781@imap.plus.net> Hi, fread memory maps the entire uncompressed file and this is baked into the way it works (e.g. skipping to the beginning, middle and last 5 rows to detect column types before starting to read the rows in) and where the convenience and speed comes from. You could uncompress the .gz to a ramdisk first, and then fread the uncompressed file from that ramdisk, is probably the fastest way. Which should still be pretty quick and I guess unlikely much slower than anything we could build into fread (provided you use a ramdisk). Matthew On 02.04.2013 19:30, Nathaniel Graham wrote: > I have a moderately large csv file that's gzipped, but not in a tar > archive, so it's "filename.csv.gz" that I want to read into a data.table. > I'd like to use fread(), but I can't seem to make it work. I'm currently > using the following: > data.table(read.csv(gzfile("filename.csv.gz","r"))) > Various combinations of gzfile, gzcon, file, readLines, and > textConnection all produce an error (invalid input). Is there a better > way to read in large, compressed files? > > ------- > Nathaniel Graham > npgraham1 at gmail.com [1] > npgraham1 at uky.edu [2] Links: ------ [1] mailto:npgraham1 at gmail.com [2] mailto:npgraham1 at uky.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From npgraham1 at gmail.com Tue Apr 2 21:36:07 2013 From: npgraham1 at gmail.com (Nathaniel Graham) Date: Tue, 2 Apr 2013 15:36:07 -0400 Subject: [datatable-help] fread on gzipped files In-Reply-To: <173fc96df68310b80565cdde75586781@imap.plus.net> References: <173fc96df68310b80565cdde75586781@imap.plus.net> Message-ID: Thanks, but I suspect that it would take longer to setup and then remove a ramdisk than it would to use read.csv and data.table. My files are moderately large (between 200 MB and 3 GB when compressed), but not enormous; I gzip not so much to save space on disk but to speed up reads. ------- Nathaniel Graham npgraham1 at gmail.com npgraham1 at uky.edu On Tue, Apr 2, 2013 at 3:12 PM, Matthew Dowle wrote: > ** > > > > Hi, > > fread memory maps the entire uncompressed file and this is baked into the > way it works (e.g. skipping to the beginning, middle and last 5 rows to > detect column types before starting to read the rows in) and where the > convenience and speed comes from. > > You could uncompress the .gz to a ramdisk first, and then fread the > uncompressed file from that ramdisk, is probably the fastest way. Which > should still be pretty quick and I guess unlikely much slower than anything > we could build into fread (provided you use a ramdisk). > > Matthew > > > > On 02.04.2013 19:30, Nathaniel Graham wrote: > > I have a moderately large csv file that's gzipped, but not in a tar > archive, so it's "filename.csv.gz" that I want to read into a data.table. > I'd like to use fread(), but I can't seem to make it work. I'm currently > using the following: > data.table(read.csv(gzfile("filename.csv.gz","r"))) > Various combinations of gzfile, gzcon, file, readLines, and > textConnection all produce an error (invalid input). Is there a better > way to read in large, compressed files? > ------- > Nathaniel Graham > npgraham1 at gmail.com > npgraham1 at uky.edu > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Wed Apr 3 10:58:24 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 03 Apr 2013 09:58:24 +0100 Subject: [datatable-help] fread on gzipped files In-Reply-To: References: <173fc96df68310b80565cdde75586781@imap.plus.net> Message-ID: <7059b88b20b16767a63fb4bcde6274f9@imap.plus.net> Interesting. How much do you find read.csv is sped up by reading gzip'd files? On 02.04.2013 20:36, Nathaniel Graham wrote: > Thanks, but I suspect that it would take longer to setup and then remove > a ramdisk than it would to use read.csv and data.table. My files are > moderately large (between 200 MB and 3 GB when compressed), but not > enormous; I gzip not so much to save space on disk but to speed up reads. > > ------- > Nathaniel Graham > npgraham1 at gmail.com [3] > npgraham1 at uky.edu [4] > > On Tue, Apr 2, 2013 at 3:12 PM, Matthew Dowle wrote: > >> Hi, >> >> fread memory maps the entire uncompressed file and this is baked into the way it works (e.g. skipping to the beginning, middle and last 5 rows to detect column types before starting to read the rows in) and where the convenience and speed comes from. >> >> You could uncompress the .gz to a ramdisk first, and then fread the uncompressed file from that ramdisk, is probably the fastest way. Which should still be pretty quick and I guess unlikely much slower than anything we could build into fread (provided you use a ramdisk). >> >> Matthew >> >> On 02.04.2013 19:30, Nathaniel Graham wrote: >> >>> I have a moderately large csv file that's gzipped, but not in a tar >>> archive, so it's "filename.csv.gz" that I want to read into a data.table. >>> I'd like to use fread(), but I can't seem to make it work. I'm currently >>> using the following: >>> data.table(read.csv(gzfile("filename.csv.gz","r"))) >>> Various combinations of gzfile, gzcon, file, readLines, and >>> textConnection all produce an error (invalid input). Is there a better >>> way to read in large, compressed files? >>> >>> ------- >>> Nathaniel Graham >>> npgraham1 at gmail.com [1] >>> npgraham1 at uky.edu [2] Links: ------ [1] mailto:npgraham1 at gmail.com [2] mailto:npgraham1 at uky.edu [3] mailto:npgraham1 at gmail.com [4] mailto:npgraham1 at uky.edu [5] mailto:mdowle at mdowle.plus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From npgraham1 at gmail.com Wed Apr 3 22:20:55 2013 From: npgraham1 at gmail.com (Nathaniel Graham) Date: Wed, 3 Apr 2013 16:20:55 -0400 Subject: [datatable-help] fread on gzipped files In-Reply-To: <7059b88b20b16767a63fb4bcde6274f9@imap.plus.net> References: <173fc96df68310b80565cdde75586781@imap.plus.net> <7059b88b20b16767a63fb4bcde6274f9@imap.plus.net> Message-ID: Subjectively, the difference seems substantial, with large loads taking half or a third as long. Whether I use gzip or not, CPU usage isn't especially high, suggesting that I'm either waiting on the hard drive or that the whole process is memory bound. I was all set to produce some timings for comparison, but I'm working from home today and my home machine struggles to accommodate large files---any difference in load times gets swamped by swapping and general flailing on the part of the OS (I've only got 4GB of RAM at home). Hopefully I'll get around to doing some timings on my work machine sometime this week, since I've got no issues with memory there. ------- Nathaniel Graham npgraham1 at gmail.com npgraham1 at uky.edu On Wed, Apr 3, 2013 at 4:58 AM, Matthew Dowle wrote: > ** > > > > Interesting. How much do you find read.csv is sped up by reading gzip'd > files? > > > > On 02.04.2013 20:36, Nathaniel Graham wrote: > > Thanks, but I suspect that it would take longer to setup and then remove > a ramdisk than it would to use read.csv and data.table. My files are > moderately large (between 200 MB and 3 GB when compressed), but not > enormous; I gzip not so much to save space on disk but to speed up reads. > > ------- > Nathaniel Graham > npgraham1 at gmail.com > npgraham1 at uky.edu > > > On Tue, Apr 2, 2013 at 3:12 PM, Matthew Dowle wrote: > >> >> >> Hi, >> >> fread memory maps the entire uncompressed file and this is baked into the >> way it works (e.g. skipping to the beginning, middle and last 5 rows to >> detect column types before starting to read the rows in) and where the >> convenience and speed comes from. >> >> You could uncompress the .gz to a ramdisk first, and then fread the >> uncompressed file from that ramdisk, is probably the fastest way. Which >> should still be pretty quick and I guess unlikely much slower than anything >> we could build into fread (provided you use a ramdisk). >> >> Matthew >> >> >> >> On 02.04.2013 19:30, Nathaniel Graham wrote: >> >> I have a moderately large csv file that's gzipped, but not in a tar >> archive, so it's "filename.csv.gz" that I want to read into a data.table. >> I'd like to use fread(), but I can't seem to make it work. I'm currently >> using the following: >> data.table(read.csv(gzfile("filename.csv.gz","r"))) >> Various combinations of gzfile, gzcon, file, readLines, and >> textConnection all produce an error (invalid input). Is there a better >> way to read in large, compressed files? >> ------- >> Nathaniel Graham >> npgraham1 at gmail.com >> npgraham1 at uky.edu >> >> >> >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From npgraham1 at gmail.com Fri Apr 5 20:59:47 2013 From: npgraham1 at gmail.com (Nathaniel Graham) Date: Fri, 5 Apr 2013 14:59:47 -0400 Subject: [datatable-help] fread on gzipped files In-Reply-To: References: <173fc96df68310b80565cdde75586781@imap.plus.net> <7059b88b20b16767a63fb4bcde6274f9@imap.plus.net> Message-ID: As promised, I did some testing. The results (described in detail below) are mixed, but suggest that compression is useful for some large data sets, and that if this is a serious issue for someone, they need to do some careful testing before committing to anything (I know, that should be obvious, but...). Also, my results pretty clearly show that fread() crushes read.csv, regardless of whether the csv file is compressed. Nice job Matthew! I start with Current Population Survey data from the Bureau of Labor Statistics. The file I used get be accessed here: ftp://ftp.bls.gov/pub/time.series/ln/ln.data.1.AllData I converted it to a csv file using StatTransfer 8 (I'm lazy), with no quoting of strings. I then compressed the csv file using 7-Zip (gzip, Normal). The resulting files, both with 4937221 obs, 5 variables are: ln_data_1.csv : 133625 KB ln_data_1.csv.gz : 17528 KB Given the file size disparity, this should demonstrate any improvements via compression. Also, for comparison, I show fread below. I've made some formatting changes, but changed nothing else. for(i in 1:5) { t1 <- system.time(cps1 <- read.csv("ln_data_1.csv")) print(t1) } user system elapsed 12.32 0.53 12.90 12.51 0.44 13.00 12.39 0.47 12.89 12.36 0.55 12.96 12.43 0.36 12.94 for(i in 1:5) { t2 <- system.time(cps1 <- read.csv("ln_data_1.csv.gz")) print(t2) } user system elapsed 14.04 0.26 14.43 14.00 0.27 14.34 14.07 0.31 14.44 13.93 0.28 14.23 14.02 0.32 14.35 for(i in 1:5) { t3 <- system.time(cps1 <- fread("ln_data_1.csv")) print(t3) } user system elapsed 2.89 0.04 2.94 2.92 0.07 2.98 2.88 0.03 2.95 2.87 0.06 2.95 2.91 0.03 2.95 While the gzipped version uses less system time, total & user time has increased somewhat. The fread function from data.table is dramatically faster. While this isn't strictly a fair comparison because fread produces a data.table while read.csv produces a data.frame, the bias is against fread, not for it. Next, I produce a random 2,000,000x10 matrix, write it to csv, and then read it back into memory as a data.frame (or data.table, for fread). I again use 7-Zip for compression.The resulting files are: test2.csv : 375086 KB test2.csv.gz : 165477 KB > matr <- replicate(10,rnorm(2000000)) > write.csv(matr,"test2.csv") > t1 <- system.time(df <- read.csv("test2.csv")) > t2 <- system.time(df <- read.csv("test2.csv.gz")) > t3 <- system.time(df <- fread("test2.csv")) > t1 user system elapsed 165.32 0.36 166.25 > t2 user system elapsed 116.24 0.16 117.08 > t3 user system elapsed 17.64 0.06 17.83 The switch to strictly floating point numbers is significant. Compression is significant improvement--about 49 seconds or about 30%--although nowhere near enough for read.csv to be comparable to fread. Finally, I produce a 20000x1000 matrix. The resulting files are: test1.csv : 354854 KB test1.csv.gz : 157975 KB matr <- replicate(1000,rnorm(20000)) > write.csv(matr,"test1.csv") > t1 <- system.time(df <- read.csv("test1.csv")) > t2 <- system.time(df <- read.csv("test1.csv.gz")) > t3 <- system.time(df <- fread("test1.csv")) > t1 user system elapsed 206.80 1.14 208.60 > t2 user system elapsed 123.42 0.27 123.99 > t3 user system elapsed 17.24 0.09 17.37 Here, compression is an even larger win, improving by about 83 seconds or roughly 40%. The fread function is again dramatically faster, and unlike read.csv, fread's performance is similar regardless of the shape of the matrix. We could create more detailed tests, varying the number of columns vs rows and their type (strings vs integers vs floats, etc) to get better details, but the basic result is that compression can be a noticeable improvement in performance, but a superior read algorithm trumps that. If it's feasible to combine fread's behavior with gzip, bzip2, or xz compression, it could be a big win for some files, but not for all of them. The advice from http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html to compress csv files appears to hold, although it may not save much time if you have a lot of non-float values or few columns. ------- Nathaniel Graham npgraham1 at gmail.com npgraham1 at uky.edu On Wed, Apr 3, 2013 at 4:20 PM, Nathaniel Graham wrote: > Subjectively, the difference seems substantial, with large loads taking > half or a third as long. Whether I use gzip or not, CPU usage isn't > especially high, suggesting that I'm either waiting on the hard drive > or that the whole process is memory bound. I was all set to produce > some timings for comparison, but I'm working from home today and > my home machine struggles to accommodate large files---any difference > in load times gets swamped by swapping and general flailing on the > part of the OS (I've only got 4GB of RAM at home). Hopefully I'll get > around to doing some timings on my work machine sometime this > week, since I've got no issues with memory there. > > ------- > Nathaniel Graham > npgraham1 at gmail.com > npgraham1 at uky.edu > > > On Wed, Apr 3, 2013 at 4:58 AM, Matthew Dowle wrote: > >> ** >> >> >> >> Interesting. How much do you find read.csv is sped up by reading gzip'd >> files? >> >> >> >> On 02.04.2013 20:36, Nathaniel Graham wrote: >> >> Thanks, but I suspect that it would take longer to setup and then remove >> a ramdisk than it would to use read.csv and data.table. My files are >> moderately large (between 200 MB and 3 GB when compressed), but not >> enormous; I gzip not so much to save space on disk but to speed up reads. >> >> ------- >> Nathaniel Graham >> npgraham1 at gmail.com >> npgraham1 at uky.edu >> >> >> On Tue, Apr 2, 2013 at 3:12 PM, Matthew Dowle wrote: >> >>> >>> >>> Hi, >>> >>> fread memory maps the entire uncompressed file and this is baked into >>> the way it works (e.g. skipping to the beginning, middle and last 5 rows to >>> detect column types before starting to read the rows in) and where the >>> convenience and speed comes from. >>> >>> You could uncompress the .gz to a ramdisk first, and then fread the >>> uncompressed file from that ramdisk, is probably the fastest way. Which >>> should still be pretty quick and I guess unlikely much slower than anything >>> we could build into fread (provided you use a ramdisk). >>> >>> Matthew >>> >>> >>> >>> On 02.04.2013 19:30, Nathaniel Graham wrote: >>> >>> I have a moderately large csv file that's gzipped, but not in a tar >>> archive, so it's "filename.csv.gz" that I want to read into a data.table. >>> I'd like to use fread(), but I can't seem to make it work. I'm currently >>> using the following: >>> data.table(read.csv(gzfile("filename.csv.gz","r"))) >>> Various combinations of gzfile, gzcon, file, readLines, and >>> textConnection all produce an error (invalid input). Is there a better >>> way to read in large, compressed files? >>> ------- >>> Nathaniel Graham >>> npgraham1 at gmail.com >>> npgraham1 at uky.edu >>> >>> >>> >>> >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Apr 5 21:38:40 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 05 Apr 2013 20:38:40 +0100 Subject: [datatable-help] fread on gzipped files In-Reply-To: References: <173fc96df68310b80565cdde75586781@imap.plus.net> <7059b88b20b16767a63fb4bcde6274f9@imap.plus.net> Message-ID: Fantastic, great job here, thanks! One thing to note is that read.csv is much faster when using the standard tricks (colClasses, nrows etc). That's why the speed comparisons in ?fread are careful to link to online resources that list what the tricks are, and then compare read.csv both with and without them to fread. Of course the "friendly" part of fread is that you don't need to learn or know any tricks, so from that point of view it may well be fair to compare no-frills read.csv to fread as you've done. Good to state that so that nobody accuses of unfair comparisons. But even with the tricks applied, fread is still much faster. With-tricks on a compressed file would be interesting for completeness. Thinking about it I suppose fread could read .gz directly. Difficult, but possible. For convenience if nothing else. I'll add it to the list to investigate ... Matthew On 05.04.2013 19:59, Nathaniel Graham wrote: > As promised, I did some testing. The results (described in detail below) are mixed, but suggest that compression is useful for some large data sets, and that if this is a serious issue for someone, they need to do some careful testing before committing to anything (I know, that should be obvious, but...). Also, my results pretty clearly show that fread() crushes read.csv, regardless of whether the csv file is compressed. Nice job Matthew! > I start with Current Population Survey data from the Bureau of Labor Statistics. > The file I used get be accessed here: ftp://ftp.bls.gov/pub/time.series/ln/ln.data.1.AllData [9] > I converted it to a csv file using StatTransfer 8 (I'm lazy), with no quoting of strings. I then compressed the csv file using 7-Zip (gzip, Normal). The resulting > files, both with 4937221 obs, 5 variables are: > > ln_data_1.csv : 133625 KB > ln_data_1.csv.gz : 17528 KB > Given the file size disparity, this should demonstrate any improvements via compression. Also, for comparison, I show fread below. I've made some > formatting changes, but changed nothing else. > > for(i in 1:5) { > t1 > print(t1) > } > user system elapsed > 12.32 0.53 12.90 > 12.51 0.44 13.00 > 12.39 0.47 12.89 > 12.36 0.55 12.96 > 12.43 0.36 12.94 > > for(i in 1:5) { > t2 > print(t2) > } > user system elapsed > 14.04 0.26 14.43 > 14.00 0.27 14.34 > 14.07 0.31 14.44 > 13.93 0.28 14.23 > 14.02 0.32 14.35 > > for(i in 1:5) { > t3 > print(t3) > } > user system elapsed > 2.89 0.04 2.94 > 2.92 0.07 2.98 > 2.88 0.03 2.95 > 2.87 0.06 2.95 > 2.91 0.03 2.95 > While the gzipped version uses less system time, total & user time has increased somewhat. The fread function from data.table is dramatically > faster. While this isn't strictly a fair comparison because fread produces > a data.table while read.csv produces a data.frame, the bias is against fread, > not for it. > Next, I produce a random 2,000,000x10 matrix, write it to csv, and then read it back into memory as a data.frame (or data.table, for fread). I again use 7-Zip for compression.The resulting files are: > test2.csv : 375086 KB > test2.csv.gz : 165477 KB > >> matr >> write.csv(matr,"test2.csv") >> t1 >> t2 >> t3 > >> t1 > user system elapsed > 165.32 0.36 166.25 >> t2 > user system elapsed > 116.24 0.16 117.08 >> t3 > user system elapsed > 17.64 0.06 17.83 > The switch to strictly floating point numbers is significant. Compression is significant improvement--about 49 seconds or about 30%--although nowhere near enough for read.csv to be comparable to fread. > Finally, I produce a 20000x1000 matrix. The resulting files are: > test1.csv : 354854 KB > test1.csv.gz : 157975 KB > > matr >> write.csv(matr,"test1.csv") >> t1 >> t2 >> t3 >> t1 > user system elapsed > 206.80 1.14 208.60 >> t2 > user system elapsed > 123.42 0.27 123.99 >> t3 > user system elapsed > 17.24 0.09 17.37 > Here, compression is an even larger win, improving by about 83 seconds or roughly 40%. The fread function is again dramatically faster, and unlike read.csv, fread's performance is similar regardless of the shape of the matrix. > We could create more detailed tests, varying the number of columns vs rows > and their type (strings vs integers vs floats, etc) to get better details, but the > basic result is that compression can be a noticeable improvement in performance, but a superior read algorithm trumps that. If it's feasible to > combine fread's behavior with gzip, bzip2, or xz compression, it could be a > big win for some files, but not for all of them. The advice from > http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html [10] to compress csv files appears to hold, although > it may not save much time if you have a lot of non-float values or few columns. > > ------- > Nathaniel Graham > npgraham1 at gmail.com [11] > npgraham1 at uky.edu [12] > > On Wed, Apr 3, 2013 at 4:20 PM, Nathaniel Graham wrote: > >> Subjectively, the difference seems substantial, with large loads taking >> half or a third as long. Whether I use gzip or not, CPU usage isn't >> especially high, suggesting that I'm either waiting on the hard drive >> or that the whole process is memory bound. I was all set to produce >> some timings for comparison, but I'm working from home today and >> my home machine struggles to accommodate large files---any difference >> in load times gets swamped by swapping and general flailing on the >> part of the OS (I've only got 4GB of RAM at home). Hopefully I'll get >> around to doing some timings on my work machine sometime this >> week, since I've got no issues with memory there. >> >> ------- >> Nathaniel Graham >> npgraham1 at gmail.com [6] >> npgraham1 at uky.edu [7] >> >> On Wed, Apr 3, 2013 at 4:58 AM, Matthew Dowle wrote: >> >>> Interesting. How much do you find read.csv is sped up by reading gzip'd files? >>> >>> On 02.04.2013 20:36, Nathaniel Graham wrote: >>> >>>> Thanks, but I suspect that it would take longer to setup and then remove >>>> a ramdisk than it would to use read.csv and data.table. My files are >>>> moderately large (between 200 MB and 3 GB when compressed), but not >>>> enormous; I gzip not so much to save space on disk but to speed up reads. >>>> >>>> ------- >>>> Nathaniel Graham >>>> npgraham1 at gmail.com [3] >>>> npgraham1 at uky.edu [4] >>>> >>>> On Tue, Apr 2, 2013 at 3:12 PM, Matthew Dowle wrote: >>>> >>>>> Hi, >>>>> >>>>> fread memory maps the entire uncompressed file and this is baked into the way it works (e.g. skipping to the beginning, middle and last 5 rows to detect column types before starting to read the rows in) and where the convenience and speed comes from. >>>>> >>>>> You could uncompress the .gz to a ramdisk first, and then fread the uncompressed file from that ramdisk, is probably the fastest way. Which should still be pretty quick and I guess unlikely much slower than anything we could build into fread (provided you use a ramdisk). >>>>> >>>>> Matthew >>>>> >>>>> On 02.04.2013 19:30, Nathaniel Graham wrote: >>>>> >>>>>> I have a moderately large csv file that's gzipped, but not in a tar >>>>>> archive, so it's "filename.csv.gz" that I want to read into a data.table. >>>>>> I'd like to use fread(), but I can't seem to make it work. I'm currently >>>>>> using the following: >>>>>> data.table(read.csv(gzfile("filename.csv.gz","r"))) >>>>>> Various combinations of gzfile, gzcon, file, readLines, and >>>>>> textConnection all produce an error (invalid input). Is there a better >>>>>> way to read in large, compressed files? >>>>>> >>>>>> ------- >>>>>> Nathaniel Graham >>>>>> npgraham1 at gmail.com [1] >>>>>> npgraham1 at uky.edu [2] Links: ------ [1] mailto:npgraham1 at gmail.com [2] mailto:npgraham1 at uky.edu [3] mailto:npgraham1 at gmail.com [4] mailto:npgraham1 at uky.edu [5] mailto:mdowle at mdowle.plus.com [6] mailto:npgraham1 at gmail.com [7] mailto:npgraham1 at uky.edu [8] mailto:mdowle at mdowle.plus.com [9] http://webmail.plus.net/ftp://ftp.bls.gov/pub/time.series/ln/ln.data.1.AllData [10] http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html [11] mailto:npgraham1 at gmail.com [12] mailto:npgraham1 at uky.edu [13] mailto:npgraham1 at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From david.bellot at gmail.com Sun Apr 7 19:01:58 2013 From: david.bellot at gmail.com (David Bellot) Date: Sun, 7 Apr 2013 18:01:58 +0100 Subject: [datatable-help] fread Message-ID: just to say: fread rocks ! Soooo fast ! That's all for today ! David -------------- next part -------------- An HTML attachment was scrubbed... URL: From david.bellot at gmail.com Tue Apr 9 12:32:38 2013 From: david.bellot at gmail.com (David Bellot) Date: Tue, 9 Apr 2013 11:32:38 +0100 Subject: [datatable-help] aggregating data Message-ID: Hi, I have a data.table DT with one of the column named x and I other names, let's say, a1, a2, ... aN. The key of this data.table is made of a1...aN. Later on, I aggregate my DT with x like this: agg = DT[ , list(m=mean(y), c=length(y)), by = c("x") ] The problem is that "x" has 331 unique values as found by length(unique(DT$x)) but my result "agg" only has 119 rows. I tried by changing the key to "x" alone but the problem persists. My DT table has a few millions rows by the way. I'm sure I'm missing something totally obvious :-( !!!! Any idea ? Best, David -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue Apr 9 12:39:21 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 09 Apr 2013 11:39:21 +0100 Subject: [datatable-help] aggregating data In-Reply-To: References: Message-ID: That's odd. Please provide result of sessionInfo() and str(DT). Matthew On 09.04.2013 11:32, David Bellot wrote: > Hi, > > I have a data.table DT with one of the column named x and I other names, let's say, a1, a2, ... aN. The key of this data.table is made of a1...aN. > > Later on, I aggregate my DT with x like this: > agg = DT[ , list(m=mean(y), c=length(y)), by = c("x") ] > > The problem is that "x" has 331 unique values as found by length(unique(DT$x)) but my result "agg" only has 119 rows. I tried by changing the key to "x" alone but the problem persists. My DT table has a few millions rows by the way. > > I'm sure I'm missing something totally obvious :-( !!!! > > Any idea ? > Best, > David -------------- next part -------------- An HTML attachment was scrubbed... URL: From david.bellot at gmail.com Wed Apr 10 14:50:01 2013 From: david.bellot at gmail.com (David Bellot) Date: Wed, 10 Apr 2013 13:50:01 +0100 Subject: [datatable-help] aggregating data In-Reply-To: References: Message-ID: actually I found the issue. That was not related to data.table but because I'm comparing float values, it breaks all the time if I do not round() my values before. Basically I have values like 0,1, 1.5, 0.5 etc... I know it's bad to do that but I'm not the boss in this project ;-) Just in case other users are reading my email, I can only advise to read that again and again: http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html Best, David On Tue, Apr 9, 2013 at 11:39 AM, Matthew Dowle wrote: > ** > > > > That's odd. Please provide result of sessionInfo() and str(DT). > > Matthew > > > > On 09.04.2013 11:32, David Bellot wrote: > > Hi, > > I have a data.table DT with one of the column named x and I other names, > let's say, a1, a2, ... aN. The key of this data.table is made of a1...aN. > > Later on, I aggregate my DT with x like this: > agg = DT[ , list(m=mean(y), c=length(y)), by = c("x") ] > > The problem is that "x" has 331 unique values as found by > length(unique(DT$x)) but my result "agg" only has 119 rows. I tried by > changing the key to "x" alone but the problem persists. My DT table has a > few millions rows by the way. > > I'm sure I'm missing something totally obvious :-( !!!! > > Any idea ? > Best, > David > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Wed Apr 10 15:25:05 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 10 Apr 2013 14:25:05 +0100 Subject: [datatable-help] aggregating data In-Reply-To: References: Message-ID: <31d67766ab3ed22a107a94238e096de7@imap.plus.net> But data.table is floating point aware. You _can_ join to floating point values, and you _can_ group by floating point values. data.table will do that within machine tolerance and take care of it for you. So this may explain why your 'agg' only had 119 rows (because data.table is doing the rounding for you automatically), but length(unique(DT$x)) had 331 ? But, there was a bug or two in this area a few versions ago, mentioned in NEWS. Which is why I asked for sessionInfo() and str(DT) suspecting you had a double column with a slightly older version of data.table. Or, there might be a new problem. If you have to round() in data.table, that doesn't sound right to me. Matthew On 10.04.2013 13:50, David Bellot wrote: > actually I found the issue. That was not related to data.table but because I'm comparing float values, it breaks all the time if I do not round() my values before. Basically I have values like 0,1, 1.5, 0.5 etc... > I know it's bad to do that but I'm not the boss in this project ;-) > > Just in case other users are reading my email, I can only advise to read that again and again: > http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html [1] > > Best, > David > > On Tue, Apr 9, 2013 at 11:39 AM, Matthew Dowle wrote: > >> That's odd. Please provide result of sessionInfo() and str(DT). >> >> Matthew >> >> On 09.04.2013 11:32, David Bellot wrote: >> >>> Hi, >>> >>> I have a data.table DT with one of the column named x and I other names, let's say, a1, a2, ... aN. The key of this data.table is made of a1...aN. >>> >>> Later on, I aggregate my DT with x like this: >>> agg = DT[ , list(m=mean(y), c=length(y)), by = c("x") ] >>> >>> The problem is that "x" has 331 unique values as found by length(unique(DT$x)) but my result "agg" only has 119 rows. I tried by changing the key to "x" alone but the problem persists. My DT table has a few millions rows by the way. >>> >>> I'm sure I'm missing something totally obvious :-( !!!! >>> >>> Any idea ? >>> Best, >>> David Links: ------ [1] http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html [2] mailto:mdowle at mdowle.plus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From levkowitz at dc-energy.com Wed Apr 10 16:46:24 2013 From: levkowitz at dc-energy.com (Shir Levkowitz) Date: Wed, 10 Apr 2013 10:46:24 -0400 Subject: [datatable-help] Cartesian join invalid key order - bug report Message-ID: I have encountered a bug in the Cartesian join of two data.tables, where the resulting data.table is not sorted by its full key. This is in data.table v1.8.8. Please let me know if this issue has been brought up or if there is any insight regarding it. Thank you, Shir Levkowitz ------------------------------------------------- library(data.table) ###### set up our example data tables test1 <- data.table(a=sample(1:3, 100, replace=TRUE), b=sample(1:3, 100, replace=TRUE), c=sample(1:10, 100,replace=TRUE)) setkey(test1, a,b,c) test2 <- data.table(p=sample(1:3, 100, replace=TRUE), q=sample(1:3, 100, replace=TRUE), r=sample(1:100), w=sample(1:100)) setkey(test2, p,q) ###### a cartesian join - this is where the issue arises test.join <- test1[test2,nomatch=0, allow.cartesian=TRUE] ### have a look at the key k <- key(test.join) k ### if we do a group by, we don't get the right aggregation test.gb <- test.join[,.N,by='a,b,c'] test.gb[a == 1 & b == 1 & c == 1,] ### when really what we want is: test.agg <- aggregate(r ~a+b+c, test.join, length) subset(test.agg, a == 1 & b == 1 & c == 1) ### if we set the same key, we get a warning setkeyv(test.join, k) >> Warning message: In setkeyv(test.join, k) : Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Wed Apr 10 17:06:55 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 10 Apr 2013 16:06:55 +0100 Subject: [datatable-help] Cartesian join invalid key order - bug report In-Reply-To: References: Message-ID: Agreed, new bug. Thanks for reporting. If you could please file on the R-Forge tracker (then you'll get auto updates) or I can file it, don't mind. I will get to the bug list eventually! Thanks, Matthew On 10.04.2013 15:46, Shir Levkowitz wrote: > I have encountered a bug in the Cartesian join of two data.tables, where the resulting data.table is not sorted by its full key. This is in data.table v1.8.8. Please let me know if this issue has been brought up or if there is any insight regarding it. > Thank you, > Shir Levkowitz > > ------------------------------------------------- > > library(data.table) > > ###### set up our example data tables > test1 > b=sample(1:3, 100, replace=TRUE), > c=sample(1:10, 100,replace=TRUE)) > setkey(test1, a,b,c) > > test2 > q=sample(1:3, 100, replace=TRUE), > r=sample(1:100), > w=sample(1:100)) > setkey(test2, p,q) > > ###### a cartesian join - this is where the issue arises > test.join > > ### have a look at the key > k > k > > ### if we do a group by, we don't get the right aggregation > test.gb > test.gb[a == 1 & b == 1 & c == 1,] > ### when really what we want is: > test.agg > subset(test.agg, a == 1 & b == 1 & c == 1) > > ### if we set the same key, we get a warning > setkeyv(test.join, k) >>> Warning message: > In setkeyv(test.join, k) : Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed. -------------- next part -------------- An HTML attachment was scrubbed... URL: From statquant at outlook.com Thu Apr 18 11:28:22 2013 From: statquant at outlook.com (statquant3) Date: Thu, 18 Apr 2013 02:28:22 -0700 (PDT) Subject: [datatable-help] Use of int64 with fread Message-ID: <1366277302282-4664582.post@n4.nabble.com> Hello, a quick question. Given that there is no support yet of int64 in data.table operations. Is it really a good thing to cast automatically int64. I find myself casting back int64 all the time... Cheers -- View this message in context: http://r.789695.n4.nabble.com/Use-of-int64-with-fread-tp4664582.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Thu Apr 18 11:48:09 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 18 Apr 2013 10:48:09 +0100 Subject: [datatable-help] Use of int64 with fread In-Reply-To: <1366277302282-4664582.post@n4.nabble.com> References: <1366277302282-4664582.post@n4.nabble.com> Message-ID: What do you cast integer64 back to? There is some support, you just can't have integer64 in keys for example yet. They can be useful and work as a value column don't they? (I don't have much need for integer64 myself, so I don't necessarily know.) I've added use.integer64 = TRUE as a global option and argument to fread (not yet committed). That's just a way to turn off the integer64 feature basically, so they'll be read as numeric as read.csv does. Btw, after some to and fro I'm thinking colClasses (when type character vector) would work the same as read.csv, but if type list, then you could pass sets of columns by number or name; i.e., two valid ways to use colClasses in fread would be : colClasses = c(colC="character",colD="character",colE="character",colQ="numeric") # as read.csv or colClasses = list(character=3:6, numeric="colQ") To drop columns use "NULL" in colClasses just as read.csv. To select columns, 'select' may be either character or numeric vector. Sound ok? Matthew On 18.04.2013 10:28, statquant3 wrote: > Hello, > a quick question. > Given that there is no support yet of int64 in data.table operations. > Is it really a good thing to cast automatically int64. I find myself > casting back int64 all the time... > > Cheers > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/Use-of-int64-with-fread-tp4664582.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From statquant at outlook.com Thu Apr 18 16:03:18 2013 From: statquant at outlook.com (stat quant) Date: Thu, 18 Apr 2013 16:03:18 +0200 Subject: [datatable-help] Use of int64 with fread In-Reply-To: References: <1366277302282-4664582.post@n4.nabble.com> Message-ID: Hi Matthew, I cast int64 to character, I need to use those as keys but as you pinpoints it is not yet supported, because of it I use strings. I guess numeric (aka double) could be used but comparing numerics can be problematic too. The other usecase is to select stuff, for this case either it is not well suited as doing DT[myInt64 == 144454938488621444] gives me myInt64 equals to 144454938488621440, not sure if the problem is how int64 are displayed or if operator == is different for int64 but that's too much of a problem to me... For the fread that's AWESOME news, It will be usefull for datetime columns, we'll be able to use fasttime on datetime columns! Your way (the OR case) looks better as if you know the first way you can go to the second way in a few lines of code. (May be even internally [hint hint] so we loosers don't have to do anything !!!) 2013/4/18, Matthew Dowle : > > What do you cast integer64 back to? There is some support, you just > can't have integer64 in keys for example yet. They can be useful and > work as a value column don't they? (I don't have much need for > integer64 myself, so I don't necessarily know.) > > I've added use.integer64 = TRUE as a global option and argument to > fread (not yet committed). That's just a way to turn off the integer64 > feature basically, so they'll be read as numeric as read.csv does. > > Btw, after some to and fro I'm thinking colClasses (when type character > vector) would work the same as read.csv, but if type list, then you > could pass sets of columns by number or name; i.e., two valid ways to > use colClasses in fread would be : > > colClasses = > c(colC="character",colD="character",colE="character",colQ="numeric") # > as read.csv > or > colClasses = list(character=3:6, numeric="colQ") > > To drop columns use "NULL" in colClasses just as read.csv. To select > columns, 'select' may be either character or numeric vector. > > Sound ok? > > Matthew > > > On 18.04.2013 10:28, statquant3 wrote: >> Hello, >> a quick question. >> Given that there is no support yet of int64 in data.table operations. >> Is it really a good thing to cast automatically int64. I find myself >> casting back int64 all the time... >> >> Cheers >> >> >> >> -- >> View this message in context: >> http://r.789695.n4.nabble.com/Use-of-int64-with-fread-tp4664582.html >> Sent from the datatable-help mailing list archive at Nabble.com. >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From eduard.antonyan at gmail.com Fri Apr 19 21:54:38 2013 From: eduard.antonyan at gmail.com (eddi) Date: Fri, 19 Apr 2013 12:54:38 -0700 (PDT) Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" Message-ID: <1366401278742-4664770.post@n4.nabble.com> Matthew Dowle suggested I put this up for a discussion here. This is continuation of the discussion that started on SO and resulted in FR2696 (I recommend reading the latter first, as it's much more clear). My case for the change boils down to the following: I believe *d[i, j, by = b]* should be always understood to mean *"take d, apply i, return j by b"* instead of the much more complicated current behavior, which is: *"take d, apply i, if i was not a merge, return j by b, if i was a merge, if no by, then return j by key, else if b and b == key, complain and return j by b, else return j by b"* I believe, while disruptive to some current users, this will make data.table much more user-friendly for any future users (one piece of evidence I would suggest for this, besides my plea, is that FAQs 1.13-1.14 (and part of 1.12) would become completely unnecessary). This is regarding syntax only, and I do NOT propose any changes to underlying behavior, in particular the speed-up when you do a "by" by the key of the join should stay (and should be done iff by=key is present). -- View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770.html Sent from the datatable-help mailing list archive at Nabble.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.nelson at sydney.edu.au Sat Apr 20 01:07:10 2013 From: michael.nelson at sydney.edu.au (Michael Nelson) Date: Fri, 19 Apr 2013 23:07:10 +0000 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <1366401278742-4664770.post@n4.nabble.com> References: <1366401278742-4664770.post@n4.nabble.com> Message-ID: <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> I think this proposed change is completely unnecessary. That a function may behave differently is entirely consistent with the various s3 /s4 methods system (although neither are used here). I think that drop = TRUE when implemented will take care of dropping join columns. On 20/04/2013, at 5:54 AM, "eddi" > wrote: Matthew Dowle suggested I put this up for a discussion here. This is continuation of the discussion that started on SO and resulted in FR2696 (I recommend reading the latter first, as it's much more clear). My case for the change boils down to the following: I believe d[i, j, by = b] should be always understood to mean "take d, apply i, return j by b" instead of the much more complicated current behavior, which is: "take d, apply i, if i was not a merge, return j by b, if i was a merge, if no by, then return j by key, else if b and b == key, complain and return j by b, else return j by b" I believe, while disruptive to some current users, this will make data.table much more user-friendly for any future users (one piece of evidence I would suggest for this, besides my plea, is that FAQs 1.13-1.14 (and part of 1.12) would become completely unnecessary). This is regarding syntax only, and I do NOT propose any changes to underlying behavior, in particular the speed-up when you do a "by" by the key of the join should stay (and should be done iff by=key is present). ________________________________ View this message in context: changing data.table by-without-by syntax to require a "by" Sent from the datatable-help mailing list archive at Nabble.com. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From erikriverson at gmail.com Mon Apr 22 14:27:52 2013 From: erikriverson at gmail.com (Erik Iverson) Date: Mon, 22 Apr 2013 07:27:52 -0500 Subject: [datatable-help] assigning POSIXlt object to a data.table column Message-ID: Hello, Hope all is well with everyone, just wondering if this is a data.table bug or a bug in my understanding: > DT <- data.table(x = 1) > DT$test <- as.POSIXlt(Sys.Date()) Warning message: In `[<-.data.table`(x, j = name, value = value) : Supplied 9 items to be assigned to 1 items of column 'test' (8 unused) Thanks! --Erik sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] wordcloud_2.4 RColorBrewer_1.0-5 Rcpp_0.10.3 tm_0.5-8.3 [5] ggplot2_0.9.3.1 zoo_1.7-9 data.table_1.8.8 XML_3.96-1.1 loaded via a namespace (and not attached): [1] colorspace_1.2-2 compiler_3.0.0 dichromat_2.0-0 digest_0.6.3 [5] grid_3.0.0 gtable_0.1.2 labeling_0.1 lattice_0.20-15 [9] MASS_7.3-26 munsell_0.4 plyr_1.8 proto_0.3-10 [13] reshape2_1.2.2 scales_0.2.3 slam_0.1-28 stringr_0.6.2 [17] tools_3.0.0 From mdowle at mdowle.plus.com Mon Apr 22 14:40:25 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 22 Apr 2013 13:40:25 +0100 Subject: [datatable-help] assigning POSIXlt object to a data.table column In-Reply-To: References: Message-ID: <91e23509e5f826c38e9519d043c7d616@imap.plus.net> Hi, It's burried in the Notes section of ?data.table : " POSIXlt is not supported as a column type because it uses 40 bytes to store a single datetime. Unexpected errors may occur if you manage to create a column of type POSIXlt. Please see NEWS for 1.6.3, and IDateTime instead. IDateTime has methods to convert to and from POSIXlt. " The no-support for POSIXlt is set in stone, but the advice there to use IDateTime may not be the best. Bascially - anything but POSIXlt! Btw, please don't assign to DT columns using DT$test<-. See ?":=". Matthew On 22.04.2013 13:27, Erik Iverson wrote: > Hello, > > Hope all is well with everyone, just wondering if this is a > data.table > bug or a bug in my understanding: > >> DT <- data.table(x = 1) >> DT$test <- as.POSIXlt(Sys.Date()) > Warning message: > In `[<-.data.table`(x, j = name, value = value) : > Supplied 9 items to be assigned to 1 items of column 'test' (8 > unused) > > Thanks! > --Erik > > sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] wordcloud_2.4 RColorBrewer_1.0-5 Rcpp_0.10.3 > tm_0.5-8.3 > [5] ggplot2_0.9.3.1 zoo_1.7-9 data.table_1.8.8 > XML_3.96-1.1 > > loaded via a namespace (and not attached): > [1] colorspace_1.2-2 compiler_3.0.0 dichromat_2.0-0 digest_0.6.3 > [5] grid_3.0.0 gtable_0.1.2 labeling_0.1 > lattice_0.20-15 > [9] MASS_7.3-26 munsell_0.4 plyr_1.8 proto_0.3-10 > [13] reshape2_1.2.2 scales_0.2.3 slam_0.1-28 stringr_0.6.2 > [17] tools_3.0.0 > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From statquant at outlook.com Mon Apr 22 15:04:04 2013 From: statquant at outlook.com (statquant3) Date: Mon, 22 Apr 2013 06:04:04 -0700 (PDT) Subject: [datatable-help] Millis in IDateTime Message-ID: <1366635844953-4664970.post@n4.nabble.com> Hello, sorry to come back again on this, but I realized that int is good enough for millisecond resolution. As ITime handles times < 24h = 86400000L. Would that be usefull/easy to modify ? Usually even in finance millis are enough. Cheers -- View this message in context: http://r.789695.n4.nabble.com/Millis-in-IDateTime-tp4664970.html Sent from the datatable-help mailing list archive at Nabble.com. From eduard.antonyan at gmail.com Mon Apr 22 17:17:59 2013 From: eduard.antonyan at gmail.com (eddi) Date: Mon, 22 Apr 2013 08:17:59 -0700 (PDT) Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> Message-ID: <1366643879137-4664990.post@n4.nabble.com> I think you're missing the point Michael. Just because it's possible to do it the way it's done now, doesn't mean that's the best way, as I've tried to argue in the OP. I don't think you've addressed the issue of unnecessary complexity pointed out in OP. -- View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html Sent from the datatable-help mailing list archive at Nabble.com. From gsee000 at gmail.com Tue Apr 23 20:46:12 2013 From: gsee000 at gmail.com (G See) Date: Tue, 23 Apr 2013 13:46:12 -0500 Subject: [datatable-help] Indexing by a logical column Message-ID: Hi, Is the following expected behavior? DT = data.table(x=rep(c("a","b","c"),each=3), TF=c(TRUE,FALSE,TRUE)) #All of these return what I expect: DT[c(TRUE, FALSE, TRUE)] DT[TF==TRUE] DT[DT$TF] #Why doesn't this? DT[TF] #Error in eval(expr, envir, enclos) : object 'TF' not found Thanks, Garrett From mdowle at mdowle.plus.com Tue Apr 23 21:12:23 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 23 Apr 2013 20:12:23 +0100 Subject: [datatable-help] Indexing by a logical column In-Reply-To: References: Message-ID: <99876779ed2f08b323f554266d151a8b@imap.plus.net> Hi, Yes expected. From ?data.table: "Advanced: When i is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope." Subsetting by a logical column is the only example I can think of where this is confusing. But we make use of this feature quite a lot e.g. TMP=list(...);DT[TMP] safe in the knowledge that DT[TMP] won't start to fail if DT in future has a column called TMP. When I have a logical column boolCol I wrap with (): DT[(boolCol)]. This avoids the memory allocation and scan of ==TRUE, and avoids the variable name repetition of DT[DT$boolCol] Matthew On 23.04.2013 19:46, G See wrote: > Hi, > > Is the following expected behavior? > > DT = data.table(x=rep(c("a","b","c"),each=3), TF=c(TRUE,FALSE,TRUE)) > > #All of these return what I expect: > > DT[c(TRUE, FALSE, TRUE)] > DT[TF==TRUE] > DT[DT$TF] > > #Why doesn't this? > DT[TF] > #Error in eval(expr, envir, enclos) : object 'TF' not found > > Thanks, > Garrett > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From gsee000 at gmail.com Tue Apr 23 21:16:38 2013 From: gsee000 at gmail.com (G See) Date: Tue, 23 Apr 2013 14:16:38 -0500 Subject: [datatable-help] Indexing by a logical column In-Reply-To: <99876779ed2f08b323f554266d151a8b@imap.plus.net> References: <99876779ed2f08b323f554266d151a8b@imap.plus.net> Message-ID: Thank you. Very helpful, as always. Garrett On Tue, Apr 23, 2013 at 2:12 PM, Matthew Dowle wrote: > > Hi, > Yes expected. From ?data.table: > "Advanced: When i is a single variable name, it is not considered an > expression of column names and is instead evaluated in calling scope." > Subsetting by a logical column is the only example I can think of where this > is confusing. But we make use of this feature quite a lot e.g. > TMP=list(...);DT[TMP] > safe in the knowledge that DT[TMP] won't start to fail if DT in future has a > column called TMP. > When I have a logical column boolCol I wrap with (): DT[(boolCol)]. This > avoids the memory allocation and scan of ==TRUE, and avoids the variable > name repetition of DT[DT$boolCol] > Matthew > > > > On 23.04.2013 19:46, G See wrote: >> >> Hi, >> >> Is the following expected behavior? >> >> DT = data.table(x=rep(c("a","b","c"),each=3), TF=c(TRUE,FALSE,TRUE)) >> >> #All of these return what I expect: >> >> DT[c(TRUE, FALSE, TRUE)] >> DT[TF==TRUE] >> DT[DT$TF] >> >> #Why doesn't this? >> DT[TF] >> #Error in eval(expr, envir, enclos) : object 'TF' not found >> >> Thanks, >> Garrett >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From sds at gnu.org Tue Apr 23 23:41:59 2013 From: sds at gnu.org (Sam Steingold) Date: Tue, 23 Apr 2013 17:41:59 -0400 Subject: [datatable-help] =?utf-8?q?there_is_no_package_called_=E2=80=98xt?= =?utf-8?b?c+KAmQ==?= Message-ID: <87bo94hql4.fsf@gnu.org> Hi, I got this: > dt <- frame[, lapply(.SD, last) ,by=id] Finding groups (bysameorder=TRUE) ... done in 0.126secs. bysameorder=TRUE and o__ is length 0 Optimized j from 'lapply(.SD, last)' to 'list(last(country), last(language), last(browser), last(platform), last(uatype), last(behavior))' Starting dogroups ... Error in loadNamespace(name) : there is no package called ?xts? Calls: [ -> [.data.table -> last -> :: -> getExportedValue -> asNamespace -> getNamespace -> tryCatch -> tryCatchList -> tryCatchOne -> > the help for last does mention xts, but I don't have it installed. do I need to? -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 http://www.childpsy.net/ http://ffii.org http://think-israel.org http://mideasttruth.com http://memri.org http://camera.org Ernqvat guvf ivbyngrf QZPN. From sds at gnu.org Tue Apr 23 23:55:58 2013 From: sds at gnu.org (Sam Steingold) Date: Tue, 23 Apr 2013 17:55:58 -0400 Subject: [datatable-help] cedta decided 'igraph' wasn't data.table aware Message-ID: <8761zchpxt.fsf@gnu.org> Hi, what does this mean? --8<---------------cut here---------------start------------->8--- > graph <- graph.data.frame(merged[!v,], vertices=ve, directed=FALSE) cedta decided 'igraph' wasn't data.table aware cedta decided 'igraph' wasn't data.table aware cedta decided 'igraph' wasn't data.table aware cedta decided 'igraph' wasn't data.table aware cedta decided 'igraph' wasn't data.table aware --8<---------------cut here---------------end--------------->8--- `merged' and `ve' are `data.table' objects, and thus `data.frame' objects too. the igraph function graph.data.frame accepts data.frame. other than the messages (controlled by datatable.verbose), the code appears to work. Bill Dunlap kindly explained that >> cedta ("Calling Environment is Data.Table Aware") >> is a private function in package:data.table could you please offer more detail? what doe the message mean? Thanks. -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 http://www.childpsy.net/ http://www.memritv.org http://memri.org http://ffii.org http://think-israel.org http://palestinefacts.org http://truepeace.org Perl: all stupidities of UNIX in one. From sds at gnu.org Tue Apr 23 23:57:36 2013 From: sds at gnu.org (Sam Steingold) Date: Tue, 23 Apr 2013 17:57:36 -0400 Subject: [datatable-help] =?utf-8?q?there_is_no_package_called_=E2=80=98xt?= =?utf-8?b?c+KAmQ==?= Message-ID: <874newhpv3.fsf@gnu.org> Hi, I got this: > dt <- frame[, lapply(.SD, last) ,by=id] Finding groups (bysameorder=TRUE) ... done in 0.126secs. bysameorder=TRUE and o__ is length 0 Optimized j from 'lapply(.SD, last)' to 'list(last(country), last(language), last(browser), last(platform), last(uatype), last(behavior))' Starting dogroups ... Error in loadNamespace(name) : there is no package called ?xts? Calls: [ -> [.data.table -> last -> :: -> getExportedValue -> asNamespace -> getNamespace -> tryCatch -> tryCatchList -> tryCatchOne -> > the help for last does mention xts, but I don't have it installed. do I need to? -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 http://www.childpsy.net/ http://ffii.org http://think-israel.org http://mideasttruth.com http://memri.org http://camera.org Ernqvat guvf ivbyngrf QZPN. From sds at gnu.org Wed Apr 24 00:11:01 2013 From: sds at gnu.org (Sam Steingold) Date: Tue, 23 Apr 2013 18:11:01 -0400 Subject: [datatable-help] =?utf-8?q?there_is_no_package_called_=E2=80=98xt?= =?utf-8?b?c+KAmQ==?= In-Reply-To: <874newhpv3.fsf@gnu.org> (Sam Steingold's message of "Tue, 23 Apr 2013 17:57:36 -0400") References: <874newhpv3.fsf@gnu.org> Message-ID: <87zjwogaoa.fsf@gnu.org> I apologize for double posting - my first message appeared to have been rejected. > * Sam Steingold [2013-04-23 17:57:36 -0400]: > > Hi, > I got this: > >> dt <- frame[, lapply(.SD, last) ,by=id] > Finding groups (bysameorder=TRUE) ... done in 0.126secs. bysameorder=TRUE and o__ is length 0 > Optimized j from 'lapply(.SD, last)' to 'list(last(country), last(language), last(browser), last(platform), last(uatype), last(behavior))' > Starting dogroups ... Error in loadNamespace(name) : there is no package called ?xts? > Calls: [ -> [.data.table -> last -> :: -> getExportedValue -> asNamespace -> getNamespace -> tryCatch -> tryCatchList -> tryCatchOne -> >> > > the help for last does mention xts, but I don't have it installed. > do I need to? -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 http://www.childpsy.net/ http://openvotingconsortium.org http://pmw.org.il http://camera.org http://dhimmi.com http://think-israel.org Don't use force -- get a bigger hammer. From michael.nelson at sydney.edu.au Wed Apr 24 02:41:33 2013 From: michael.nelson at sydney.edu.au (Michael Nelson) Date: Wed, 24 Apr 2013 00:41:33 +0000 Subject: [datatable-help] =?windows-1252?q?there_is_no_package_called_=91x?= =?windows-1252?q?ts=92?= In-Reply-To: <874newhpv3.fsf@gnu.org> References: <874newhpv3.fsf@gnu.org> Message-ID: <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au> >From the help for data.table::last ?If x is a data.table, the last row as a one row data.table. Otherwise, whatever xts::last returns. calling lapply(.SD, last) will call last on each column in .SD. Columns within a data.table aren't data.tables thus `xts::last` is called. xts is on the suggests list for data.table, you could use install.packages('data.table, dependencies = 'Suggests') or manually installed xts. OR frame[, last(.SD), by = id] would work without needing xts as would frame[, .SD[.N], by = id] or without having to construct .SD (which is time consuming) frame[frame[, .I[.N],by = id]$V1] or setkey(frame, id) frame[unique(id), mult = 'last'] ________________________________________ From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Sam Steingold [sds at gnu.org] Sent: Wednesday, 24 April 2013 7:57 AM To: datatable-help at lists.r-forge.r-project.org Subject: [datatable-help] there is no package called ?xts? Hi, I got this: > dt <- frame[, lapply(.SD, last) ,by=id] Finding groups (bysameorder=TRUE) ... done in 0.126secs. bysameorder=TRUE and o__ is length 0 Optimized j from 'lapply(.SD, last)' to 'list(last(country), last(language), last(browser), last(platform), last(uatype), last(behavior))' Starting dogroups ... Error in loadNamespace(name) : there is no package called ?xts? Calls: [ -> [.data.table -> last -> :: -> getExportedValue -> asNamespace -> getNamespace -> tryCatch -> tryCatchList -> tryCatchOne -> > the help for last does mention xts, but I don't have it installed. do I need to? -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 http://www.childpsy.net/ http://ffii.org http://think-israel.org http://mideasttruth.com http://memri.org http://camera.org Ernqvat guvf ivbyngrf QZPN. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From eduard.antonyan at gmail.com Wed Apr 24 03:16:42 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Tue, 23 Apr 2013 20:16:42 -0500 Subject: [datatable-help] =?windows-1252?q?there_is_no_package_called_=91x?= =?windows-1252?q?ts=92?= In-Reply-To: <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au> References: <874newhpv3.fsf@gnu.org> <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au> Message-ID: <-7825510419294116677@unknownmsgid> This is great, a lot of cool stuff in one post! On Apr 23, 2013, at 7:42 PM, Michael Nelson wrote: > From the help for data.table::last > > If x is a data.table, the last row as a one row data.table. Otherwise, whatever xts::last returns. > > > calling lapply(.SD, last) will call last on each column in .SD. Columns within a data.table aren't data.tables thus `xts::last` is called. xts is on the suggests list for data.table, > > you could use > > install.packages('data.table, dependencies = 'Suggests') > > or manually installed xts. > > OR > > frame[, last(.SD), by = id] > > would work without needing xts > > as would > > frame[, .SD[.N], by = id] > > or without having to construct .SD (which is time consuming) > > frame[frame[, .I[.N],by = id]$V1] > > or > > setkey(frame, id) > > frame[unique(id), mult = 'last'] > > ________________________________________ > From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Sam Steingold [sds at gnu.org] > Sent: Wednesday, 24 April 2013 7:57 AM > To: datatable-help at lists.r-forge.r-project.org > Subject: [datatable-help] there is no package called ?xts? > > Hi, > I got this: > >> dt <- frame[, lapply(.SD, last) ,by=id] > Finding groups (bysameorder=TRUE) ... done in 0.126secs. bysameorder=TRUE and o__ is length 0 > Optimized j from 'lapply(.SD, last)' to 'list(last(country), last(language), last(browser), last(platform), last(uatype), last(behavior))' > Starting dogroups ... Error in loadNamespace(name) : there is no package called ?xts? > Calls: [ -> [.data.table -> last -> :: -> getExportedValue -> asNamespace -> getNamespace -> tryCatch -> tryCatchList -> tryCatchOne -> > > the help for last does mention xts, but I don't have it installed. > do I need to? > > -- > Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 > http://www.childpsy.net/ http://ffii.org http://think-israel.org > http://mideasttruth.com http://memri.org http://camera.org > Ernqvat guvf ivbyngrf QZPN. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From mdowle at mdowle.plus.com Wed Apr 24 10:54:21 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 24 Apr 2013 09:54:21 +0100 Subject: [datatable-help] =?utf-8?q?there_is_no_package_called_=E2=80=98xt?= =?utf-8?b?c+KAmQ==?= In-Reply-To: <-7825510419294116677@unknownmsgid> References: <874newhpv3.fsf@gnu.org> <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au> <-7825510419294116677@unknownmsgid> Message-ID: <78c42b8c3e05690f5e48e9e35b54888b@imap.plus.net> Indeed! Great stuff Michael. I suppose that error ("Error in loadNamespace(name) : there is no package called ?xts?") is yet another valid bug (sigh), if you wouldn't mind filing please Sam. Thanks, Matthew On 24.04.2013 02:16, Eduard Antonyan wrote: > This is great, a lot of cool stuff in one post! > > On Apr 23, 2013, at 7:42 PM, Michael Nelson > wrote: > >> From the help for data.table::last >> >> If x is a data.table, the last row as a one row data.table. >> Otherwise, whatever xts::last returns. >> >> >> calling lapply(.SD, last) will call last on each column in .SD. >> Columns within a data.table aren't data.tables thus `xts::last` is >> called. xts is on the suggests list for data.table, >> >> you could use >> >> install.packages('data.table, dependencies = 'Suggests') >> >> or manually installed xts. >> >> OR >> >> frame[, last(.SD), by = id] >> >> would work without needing xts >> >> as would >> >> frame[, .SD[.N], by = id] >> >> or without having to construct .SD (which is time consuming) >> >> frame[frame[, .I[.N],by = id]$V1] >> >> or >> >> setkey(frame, id) >> >> frame[unique(id), mult = 'last'] >> >> ________________________________________ >> From: datatable-help-bounces at lists.r-forge.r-project.org >> [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Sam >> Steingold [sds at gnu.org] >> Sent: Wednesday, 24 April 2013 7:57 AM >> To: datatable-help at lists.r-forge.r-project.org >> Subject: [datatable-help] there is no package called ?xts? >> >> Hi, >> I got this: >> >>> dt <- frame[, lapply(.SD, last) ,by=id] >> Finding groups (bysameorder=TRUE) ... done in 0.126secs. >> bysameorder=TRUE and o__ is length 0 >> Optimized j from 'lapply(.SD, last)' to 'list(last(country), >> last(language), last(browser), last(platform), last(uatype), >> last(behavior))' >> Starting dogroups ... Error in loadNamespace(name) : there is no >> package called ?xts? >> Calls: [ -> [.data.table -> last -> :: -> getExportedValue -> >> asNamespace -> getNamespace -> tryCatch -> tryCatchList -> tryCatchOne >> -> >> >> the help for last does mention xts, but I don't have it installed. >> do I need to? >> >> -- >> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X >> 11.0.11300000 >> http://www.childpsy.net/ http://ffii.org http://think-israel.org >> http://mideasttruth.com http://memri.org http://camera.org >> Ernqvat guvf ivbyngrf QZPN. >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From mdowle at mdowle.plus.com Wed Apr 24 11:07:12 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 24 Apr 2013 10:07:12 +0100 Subject: [datatable-help] cedta decided 'igraph' wasn't data.table aware In-Reply-To: <8761zchpxt.fsf@gnu.org> References: <8761zchpxt.fsf@gnu.org> Message-ID: <714d75225a91130b39a1af84f2250232@imap.plus.net> Hi, Oh dear. On first glance, you are having a painful time with data.table! But if you have verbose=TRUE then this seems ok. I think of 'verbose' more like 'trace'. There aren't different levels of verbosity, yet, although that has been suggested before and is on the list to do. So internal tracing messages, notes, progress etc is all mixed in to one 'verbose=TRUE' level at the moment. I very rarely use verbose=TRUE. I just switch it on when debugging. You and others are right to switch it on and use it for tuning, that is the idea, but it's too verbose at the moment as you've found. Anyway, cedta did the right thing here, since 'igraph' indeed is not data.table aware. Setting verbose=FALSE should make the trace message go away. More info about cedta on the single result returned by "[data.table] cedta" : http://stackoverflow.com/questions/10527072/using-data-table-package-inside-my-own-package/10529888#10529888 (have just reread that and it's still correct). Matthew On 23.04.2013 22:55, Sam Steingold wrote: > Hi, what does this mean? > > --8<---------------cut here---------------start------------->8--- >> graph <- graph.data.frame(merged[!v,], vertices=ve, directed=FALSE) > cedta decided 'igraph' wasn't data.table aware > cedta decided 'igraph' wasn't data.table aware > cedta decided 'igraph' wasn't data.table aware > cedta decided 'igraph' wasn't data.table aware > cedta decided 'igraph' wasn't data.table aware > --8<---------------cut here---------------end--------------->8--- > > `merged' and `ve' are `data.table' objects, and thus `data.frame' > objects too. > the igraph function graph.data.frame accepts data.frame. > other than the messages (controlled by datatable.verbose), the code > appears to work. > > Bill Dunlap kindly explained that >>> cedta ("Calling Environment is Data.Table Aware") >>> is a private function in package:data.table > > could you please offer more detail? > what doe the message mean? > > Thanks. From eduard.antonyan at gmail.com Wed Apr 24 16:47:12 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 24 Apr 2013 09:47:12 -0500 Subject: [datatable-help] =?windows-1252?q?there_is_no_package_called_=91x?= =?windows-1252?q?ts=92?= In-Reply-To: <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au> References: <874newhpv3.fsf@gnu.org> <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au> Message-ID: @Michael, in the last expression, you probably forgot a J: frame[J(unique(id)), mult = "last"] On Tue, Apr 23, 2013 at 7:41 PM, Michael Nelson < michael.nelson at sydney.edu.au> wrote: > From the help for data.table::last > > If x is a data.table, the last row as a one row data.table. Otherwise, > whatever xts::last returns. > > > calling lapply(.SD, last) will call last on each column in .SD. Columns > within a data.table aren't data.tables thus `xts::last` is called. xts is > on the suggests list for data.table, > > you could use > > install.packages('data.table, dependencies = 'Suggests') > > or manually installed xts. > > OR > > frame[, last(.SD), by = id] > > would work without needing xts > > as would > > frame[, .SD[.N], by = id] > > or without having to construct .SD (which is time consuming) > > frame[frame[, .I[.N],by = id]$V1] > > or > > setkey(frame, id) > > frame[unique(id), mult = 'last'] > > ________________________________________ > From: datatable-help-bounces at lists.r-forge.r-project.org [ > datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Sam > Steingold [sds at gnu.org] > Sent: Wednesday, 24 April 2013 7:57 AM > To: datatable-help at lists.r-forge.r-project.org > Subject: [datatable-help] there is no package called ?xts? > > Hi, > I got this: > > > dt <- frame[, lapply(.SD, last) ,by=id] > Finding groups (bysameorder=TRUE) ... done in 0.126secs. bysameorder=TRUE > and o__ is length 0 > Optimized j from 'lapply(.SD, last)' to 'list(last(country), > last(language), last(browser), last(platform), last(uatype), > last(behavior))' > Starting dogroups ... Error in loadNamespace(name) : there is no package > called ?xts? > Calls: [ -> [.data.table -> last -> :: -> getExportedValue -> asNamespace > -> getNamespace -> tryCatch -> tryCatchList -> tryCatchOne -> > > > > the help for last does mention xts, but I don't have it installed. > do I need to? > > -- > Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X > 11.0.11300000 > http://www.childpsy.net/ http://ffii.org http://think-israel.org > http://mideasttruth.com http://memri.org http://camera.org > Ernqvat guvf ivbyngrf QZPN. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From s_milberg at hotmail.com Wed Apr 24 20:02:15 2013 From: s_milberg at hotmail.com (Sadao Milberg) Date: Wed, 24 Apr 2013 14:02:15 -0400 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <1366643879137-4664990.post@n4.nabble.com> References: <1366401278742-4664770.post@n4.nabble.com>, <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>, <1366643879137-4664990.post@n4.nabble.com> Message-ID: I'd agree with Eduard, although it's probably too late to change behavior now. Maybe for data.table.2? Eduard's proposal seems more closely aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if requested). S. > Date: Mon, 22 Apr 2013 08:17:59 -0700 > From: eduard.antonyan at gmail.com > To: datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by" > > I think you're missing the point Michael. Just because it's possible to do it > the way it's done now, doesn't mean that's the best way, as I've tried to > argue in the OP. I don't think you've addressed the issue of unnecessary > complexity pointed out in OP. > > > > -- > View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From sds at gnu.org Wed Apr 24 21:18:03 2013 From: sds at gnu.org (Sam Steingold) Date: Wed, 24 Apr 2013 15:18:03 -0400 Subject: [datatable-help] =?utf-8?q?there_is_no_package_called_=E2=80=98xt?= =?utf-8?b?c+KAmQ==?= In-Reply-To: <78c42b8c3e05690f5e48e9e35b54888b@imap.plus.net> (Matthew Dowle's message of "Wed, 24 Apr 2013 09:54:21 +0100") References: <874newhpv3.fsf@gnu.org> <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au> <-7825510419294116677@unknownmsgid> <78c42b8c3e05690f5e48e9e35b54888b@imap.plus.net> Message-ID: <87sj2fg2l0.fsf@gnu.org> > * Matthew Dowle [2013-04-24 09:54:21 +0100]: > > Indeed! Great stuff Michael. yep! > I suppose that error ("Error in loadNamespace(name) : there is no > package called ?xts?") is yet another valid bug (sigh), if you wouldn't > mind filing please Sam. https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2728&group_id=240&atid=975 -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 http://www.childpsy.net/ http://openvotingconsortium.org http://dhimmi.com http://memri.org http://pmw.org.il http://www.memritv.org http://jihadwatch.org If abortion is murder, then oral sex is cannibalism. From sds at gnu.org Wed Apr 24 21:25:46 2013 From: sds at gnu.org (Sam Steingold) Date: Wed, 24 Apr 2013 15:25:46 -0400 Subject: [datatable-help] head.data.table does not support negative arguments? Message-ID: <87li87g285.fsf@gnu.org> is this a bug? --8<---------------cut here---------------start------------->8--- > head(1:10,-3) [1] 1 2 3 4 5 6 7 > head(data.frame(a=1:5,b=5:9),-2) a b 1 1 5 2 2 6 3 3 7 > head(data.table(a=1:5,b=5:9),-2) Error in seq_len(min(n, nrow(x))) : argument must be coercible to non-negative integer Calls: head -> head.data.table --8<---------------cut here---------------end--------------->8--- -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 http://www.childpsy.net/ http://camera.org http://openvotingconsortium.org http://truepeace.org http://pmw.org.il http://palestinefacts.org Children fear dentists because of pain, adults - because of bills. From sds at gnu.org Wed Apr 24 21:26:57 2013 From: sds at gnu.org (Sam Steingold) Date: Wed, 24 Apr 2013 15:26:57 -0400 Subject: [datatable-help] =?utf-8?q?there_is_no_package_called_=E2=80=98xt?= =?utf-8?b?c+KAmQ==?= In-Reply-To: <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au> (Michael Nelson's message of "Wed, 24 Apr 2013 00:41:33 +0000") References: <874newhpv3.fsf@gnu.org> <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au> Message-ID: <87haivg266.fsf@gnu.org> > * Michael Nelson [2013-04-24 00:41:33 +0000]: > > frame[, .SD[.N], by = id] I tried --8<---------------cut here---------------start------------->8--- dt <- frame[, .SD[1] ,by=id] --8<---------------cut here---------------end--------------->8--- (I don't care whether I take first or last, see another message). and I got the note --8<---------------cut here---------------start------------->8--- Finding groups (bysameorder=TRUE) ... done in 0.121secs. bysameorder=TRUE and o__ is length 0 Optimization is on but j left unchanged as '.SD[1]' Starting dogroups ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). --8<---------------cut here---------------end--------------->8--- and indeed it runs unbelievably slow (as if I were using data.table) thanks a lot for your detailed reply! -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 http://www.childpsy.net/ http://ffii.org http://www.memritv.org http://palestinefacts.org http://iris.org.il http://think-israel.org non-smoking section in a restaurant == non-peeing section in a swimming pool From mdowle at mdowle.plus.com Wed Apr 24 21:28:42 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 24 Apr 2013 20:28:42 +0100 Subject: [datatable-help] =?utf-8?q?head=2Edata=2Etable_does_not_support_n?= =?utf-8?q?egative_arguments=3F?= In-Reply-To: <87li87g285.fsf@gnu.org> References: <87li87g285.fsf@gnu.org> Message-ID: Yes, known and already filed. Thanks. Matthew On 24.04.2013 20:25, Sam Steingold wrote: > is this a bug? > --8<---------------cut here---------------start------------->8--- >> head(1:10,-3) > [1] 1 2 3 4 5 6 7 >> head(data.frame(a=1:5,b=5:9),-2) > a b > 1 1 5 > 2 2 6 > 3 3 7 >> head(data.table(a=1:5,b=5:9),-2) > Error in seq_len(min(n, nrow(x))) : > argument must be coercible to non-negative integer > Calls: head -> head.data.table > --8<---------------cut here---------------end--------------->8--- From mdowle at mdowle.plus.com Wed Apr 24 21:34:12 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 24 Apr 2013 20:34:12 +0100 Subject: [datatable-help] =?utf-8?q?there_is_no_package_called_=E2=80=98xt?= =?utf-8?b?c+KAmQ==?= In-Reply-To: <87haivg266.fsf@gnu.org> References: <874newhpv3.fsf@gnu.org> <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au> <87haivg266.fsf@gnu.org> Message-ID: Good. Well, all correct, known and expected then. There is a feature request to optimize .SD[i] in DT[,.SD[i],by=...] to not actually create the whole .SD just to get the first or last row (or indeed any subset). Since that's the most natural syntax. I often would like that myself. In the meatime the other suggestions from Michael should be fast. As he said: the one using .I[.N] should be fast. Matthew On 24.04.2013 20:26, Sam Steingold wrote: >> * Michael Nelson [2013-04-24 00:41:33 >> +0000]: >> >> frame[, .SD[.N], by = id] > > I tried > --8<---------------cut here---------------start------------->8--- > dt <- frame[, .SD[1] ,by=id] > --8<---------------cut here---------------end--------------->8--- > (I don't care whether I take first or last, see another message). > and I got the note > --8<---------------cut here---------------start------------->8--- > Finding groups (bysameorder=TRUE) ... done in 0.121secs. > bysameorder=TRUE and o__ is length 0 > Optimization is on but j left unchanged as '.SD[1]' > Starting dogroups ... The result of j is a named list. It's very > inefficient to create the same names over and over again for each > group. When j=list(...), any names are detected, removed and put back > after grouping has completed, for efficiency. Using j=transform(), > for > example, prevents that speedup (consider changing to :=). > --8<---------------cut here---------------end--------------->8--- > and indeed it runs unbelievably slow (as if I were using data.table) > > thanks a lot for your detailed reply! From sds at gnu.org Wed Apr 24 22:29:24 2013 From: sds at gnu.org (Sam Steingold) Date: Wed, 24 Apr 2013 16:29:24 -0400 Subject: [datatable-help] variable column names Message-ID: <87a9onfza3.fsf@gnu.org> What do I do if I want to operate on several columns? E.g., --8<---------------cut here---------------start------------->8--- length(names(dt)) = 25 myvars = c("col1","col4","col7") myid = "user" setkeyv(dt,myid) --8<---------------cut here---------------end--------------->8--- and I want to summarize columns in myvars by myid. the point is that nowhere in the code the literal "col4" or "user" may appear. Thanks. -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 http://www.childpsy.net/ http://ffii.org http://palestinefacts.org http://openvotingconsortium.org http://camera.org http://jihadwatch.org The difference between genius and stupidity is that genius has its limits. From eduard.antonyan at gmail.com Wed Apr 24 22:35:41 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 24 Apr 2013 15:35:41 -0500 Subject: [datatable-help] variable column names In-Reply-To: <87a9onfza3.fsf@gnu.org> References: <87a9onfza3.fsf@gnu.org> Message-ID: with = FALSE will let you use literal column names On Wed, Apr 24, 2013 at 3:29 PM, Sam Steingold wrote: > What do I do if I want to operate on several columns? > E.g., > --8<---------------cut here---------------start------------->8--- > length(names(dt)) = 25 > myvars = c("col1","col4","col7") > myid = "user" > setkeyv(dt,myid) > --8<---------------cut here---------------end--------------->8--- > and I want to summarize columns in myvars by myid. > the point is that nowhere in the code the literal "col4" or "user" may > appear. > > Thanks. > > -- > Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X > 11.0.11300000 > http://www.childpsy.net/ http://ffii.org http://palestinefacts.org > http://openvotingconsortium.org http://camera.org http://jihadwatch.org > The difference between genius and stupidity is that genius has its limits. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Wed Apr 24 22:50:34 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 24 Apr 2013 21:50:34 +0100 Subject: [datatable-help] variable column names In-Reply-To: References: <87a9onfza3.fsf@gnu.org> Message-ID: Or: DT[,lapply(.SD,sum),by=...,.SDcols=myvars] > with = FALSE will let you use literal column names > > > On Wed, Apr 24, 2013 at 3:29 PM, Sam Steingold wrote: > >> What do I do if I want to operate on several columns? >> E.g., >> --8<---------------cut here---------------start------------->8--- >> length(names(dt)) = 25 >> myvars = c("col1","col4","col7") >> myid = "user" >> setkeyv(dt,myid) >> --8<---------------cut here---------------end--------------->8--- >> and I want to summarize columns in myvars by myid. >> the point is that nowhere in the code the literal "col4" or "user" may >> appear. >> >> Thanks. >> >> -- >> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X >> 11.0.11300000 >> http://www.childpsy.net/ http://ffii.org http://palestinefacts.org >> http://openvotingconsortium.org http://camera.org http://jihadwatch.org >> The difference between genius and stupidity is that genius has its >> limits. >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From mdowle at mdowle.plus.com Wed Apr 24 22:54:17 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 24 Apr 2013 21:54:17 +0100 Subject: [datatable-help] variable column names In-Reply-To: References: <87a9onfza3.fsf@gnu.org> Message-ID: <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net> where ... is eval(myid) iigc > Or: > DT[,lapply(.SD,sum),by=...,.SDcols=myvars] > >> with = FALSE will let you use literal column names >> >> >> On Wed, Apr 24, 2013 at 3:29 PM, Sam Steingold wrote: >> >>> What do I do if I want to operate on several columns? >>> E.g., >>> --8<---------------cut here---------------start------------->8--- >>> length(names(dt)) = 25 >>> myvars = c("col1","col4","col7") >>> myid = "user" >>> setkeyv(dt,myid) >>> --8<---------------cut here---------------end--------------->8--- >>> and I want to summarize columns in myvars by myid. >>> the point is that nowhere in the code the literal "col4" or "user" may >>> appear. >>> >>> Thanks. >>> >>> -- >>> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X >>> 11.0.11300000 >>> http://www.childpsy.net/ http://ffii.org http://palestinefacts.org >>> http://openvotingconsortium.org http://camera.org http://jihadwatch.org >>> The difference between genius and stupidity is that genius has its >>> limits. >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From mdowle at mdowle.plus.com Wed Apr 24 23:01:49 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 24 Apr 2013 22:01:49 +0100 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com>, <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>, <1366643879137-4664990.post@n4.nabble.com> Message-ID: But then what would be analogous to CROSS APPLY in SQL? > I'd agree with Eduard, although it's probably too late to change behavior > now. Maybe for data.table.2? Eduard's proposal seems more closely > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if > requested). > > S. > >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >> From: eduard.antonyan at gmail.com >> To: datatable-help at lists.r-forge.r-project.org >> Subject: Re: [datatable-help] changing data.table by-without-by >> syntax to require a "by" >> >> I think you're missing the point Michael. Just because it's possible to >> do it >> the way it's done now, doesn't mean that's the best way, as I've tried >> to >> argue in the OP. I don't think you've addressed the issue of unnecessary >> complexity pointed out in OP. >> >> >> >> -- >> View this message in context: >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html >> Sent from the datatable-help mailing list archive at Nabble.com. >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From eduard.antonyan at gmail.com Wed Apr 24 23:22:42 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 24 Apr 2013 16:22:42 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> Message-ID: By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key). Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there: "We table table1 and table2. table1 has a column called rowcount. For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id" On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle wrote: > But then what would be analogous to CROSS APPLY in SQL? > > > I'd agree with Eduard, although it's probably too late to change behavior > > now. Maybe for data.table.2? Eduard's proposal seems more closely > > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if > > requested). > > > > S. > > > >> Date: Mon, 22 Apr 2013 08:17:59 -0700 > >> From: eduard.antonyan at gmail.com > >> To: datatable-help at lists.r-forge.r-project.org > >> Subject: Re: [datatable-help] changing data.table by-without-by > >> syntax to require a "by" > >> > >> I think you're missing the point Michael. Just because it's possible to > >> do it > >> the way it's done now, doesn't mean that's the best way, as I've tried > >> to > >> argue in the OP. I don't think you've addressed the issue of unnecessary > >> complexity pointed out in OP. > >> > >> > >> > >> -- > >> View this message in context: > >> > http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html > >> Sent from the datatable-help mailing list archive at Nabble.com. > >> _______________________________________________ > >> datatable-help mailing list > >> datatable-help at lists.r-forge.r-project.org > >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Apr 25 00:28:08 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 24 Apr 2013 23:28:08 +0100 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> Message-ID: <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net> That sentence on that linked webpage seems incorect English, since table is a noun not a verb. Should "table" be "join" perhaps? Anyway, by-without-by is often used with join inherited scope (JIS). For example, translating their example : 1> X = data.table(a=1:3,b=1:15, key="a") 1> X a b 1: 1 1 2: 1 4 3: 1 7 4: 1 10 5: 1 13 6: 2 2 7: 2 5 8: 2 8 9: 2 11 10: 2 14 11: 3 3 12: 3 6 13: 3 9 14: 3 12 15: 3 15 1> Y = data.table(a=c(1,2), top=c(3,4)) 1> Y a top 1: 1 3 2: 2 4 1> X[Y, head(.SD,i.top)] a b 1: 1 1 2: 1 4 3: 1 7 4: 2 2 5: 2 5 6: 2 8 7: 2 11 1> If there was no by-without-by (analogous to CROSS BY), then how would that be done? On 24.04.2013 22:22, Eduard Antonyan wrote: > By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key). > Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [8], and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there: > "We table table1 and table2. table1 has a column called rowcount. > > For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id [9]" > > On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle wrote: > >> But then what would be analogous to CROSS APPLY in SQL? >> >> > I'd agree with Eduard, although it's probably too late to change behavior >> > now. Maybe for data.table.2? Eduard's proposal seems more closely >> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if >> > requested). >> > >> > S. >> > >> >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >> From: eduard.antonyan at gmail.com [1] >> >> To: datatable-help at lists.r-forge.r-project.org [2] >> >>>> Subject: Re: [datatable-help] changing data.table by-without-by >> >> syntax to require a "by" >> >> >> >> I think you're missing the point Michael. Just because it's possible to >> >> do it >> >> the way it's done now, doesn't mean that's the best way, as I've tried >> >> to >> >> argue in the OP. I don't think you've addressed the issue of unnecessary >> >> complexity pointed out in OP. >> >> >> >> >> >> >> >> -- >> >> View this message in context: >> >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html [3] >> >> Sent from the datatable-help mailing list archive at Nabble.com. >> >> _______________________________________________ >> >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org [4] >> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [5] >> > _______________________________________________ >> > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org [6] >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [7] Links: ------ [1] mailto:eduard.antonyan at gmail.com [2] mailto:datatable-help at lists.r-forge.r-project.org [3] http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html [4] mailto:datatable-help at lists.r-forge.r-project.org [5] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [6] mailto:datatable-help at lists.r-forge.r-project.org [7] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [8] http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [9] http://table2.id [10] mailto:mdowle at mdowle.plus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Apr 25 00:41:22 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 24 Apr 2013 23:41:22 +0100 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net> References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net> Message-ID: <290b8e5f0151662cb436cca320fa709f@imap.plus.net> Where I meant CROSS APPLY not CROSS BY (typo) and incorrect with 2 r's. I picked up on that because out of the entire page you seemed to quote a sentence which made no sense. The rest of the article looks great. On 24.04.2013 23:28, Matthew Dowle wrote: > That sentence on that linked webpage seems incorect English, since table is a noun not a verb. Should "table" be "join" perhaps? > > Anyway, by-without-by is often used with join inherited scope (JIS). For example, translating their example : > > 1> X = data.table(a=1:3,b=1:15, key="a") > 1> X > a b > 1: 1 1 > 2: 1 4 > 3: 1 7 > 4: 1 10 > 5: 1 13 > 6: 2 2 > 7: 2 5 > 8: 2 8 > 9: 2 11 > 10: 2 14 > 11: 3 3 > 12: 3 6 > 13: 3 9 > 14: 3 12 > 15: 3 15 > 1> Y = data.table(a=c(1,2), top=c(3,4)) > 1> Y > a top > 1: 1 3 > 2: 2 4 > 1> X[Y, head(.SD,i.top)] > a b > 1: 1 1 > 2: 1 4 > 3: 1 7 > 4: 2 2 > 5: 2 5 > 6: 2 8 > 7: 2 11 > 1> > > If there was no by-without-by (analogous to CROSS BY), then how would that be done? > > On 24.04.2013 22:22, Eduard Antonyan wrote: > >> By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key). >> Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [8], and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there: >> "We table table1 and table2. table1 has a column called rowcount. >> >> For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id [9]" >> >> On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle wrote: >> >>> But then what would be analogous to CROSS APPLY in SQL? >>> >>> > I'd agree with Eduard, although it's probably too late to change behavior >>> > now. Maybe for data.table.2? Eduard's proposal seems more closely >>> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if >>> > requested). >>> > >>> > S. >>> > >>> >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >> From: eduard.antonyan at gmail.com [1] >>> >> To: datatable-help at lists.r-forge.r-project.org [2] >>> >>>>> Subject: Re: [datatable-help] changing data.table by-without-by >>> >> syntax to require a "by" >>> >> >>> >> I think you're missing the point Michael. Just because it's possible to >>> >> do it >>> >> the way it's done now, doesn't mean that's the best way, as I've tried >>> >> to >>> >> argue in the OP. I don't think you've addressed the issue of unnecessary >>> >> complexity pointed out in OP. >>> >> >>> >> >>> >> >>> >> -- >>> >> View this message in context: >>> >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html [3] >>> >> Sent from the datatable-help mailing list archive at Nabble.com. >>> >> _______________________________________________ >>> >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org [4] >>> >>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [5] >>> > _______________________________________________ >>> > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org [6] >>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [7] Links: ------ [1] mailto:eduard.antonyan at gmail.com [2] mailto:datatable-help at lists.r-forge.r-project.org [3] http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html [4] mailto:datatable-help at lists.r-forge.r-project.org [5] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [6] mailto:datatable-help at lists.r-forge.r-project.org [7] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [8] http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [9] http://table2.id [10] mailto:mdowle at mdowle.plus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Thu Apr 25 00:43:19 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 24 Apr 2013 17:43:19 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net> References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net> Message-ID: I assumed they meant create a table :) that looks cool, what's i.top ? I can get a very similar to yours result by writing: X[Y][, head(.SD, top[1]), by = a] and I probably would want the following to produce your result (this might depend a little on what exactly i.top is): X[Y, head(.SD, i.top), by = a] On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle wrote: > ** > > > > That sentence on that linked webpage seems incorect English, since table > is a noun not a verb. Should "table" be "join" perhaps? > > Anyway, by-without-by is often used with join inherited scope (JIS). For > example, translating their example : > > 1> X = data.table(a=1:3,b=1:15, key="a") > 1> X > a b > 1: 1 1 > 2: 1 4 > 3: 1 7 > 4: 1 10 > 5: 1 13 > 6: 2 2 > 7: 2 5 > 8: 2 8 > 9: 2 11 > 10: 2 14 > 11: 3 3 > 12: 3 6 > 13: 3 9 > 14: 3 12 > 15: 3 15 > 1> Y = data.table(a=c(1,2), top=c(3,4)) > 1> Y > a top > 1: 1 3 > 2: 2 4 > 1> X[Y, head(.SD,i.top)] > a b > 1: 1 1 > 2: 1 4 > 3: 1 7 > 4: 2 2 > 5: 2 5 > 6: 2 8 > 7: 2 11 > 1> > > > > If there was no by-without-by (analogous to CROSS BY), then how would that be done? > > > > On 24.04.2013 22:22, Eduard Antonyan wrote: > > By that you mean current behavior? You'd get current behavior by > explicitly specifying the appropriate "by" (i.e. "by" equal to the key). > Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using > http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I > can't figure out how by-without-by (or with by-with-by for that matter:) ) > helps with e.g. the first example there: > "We table table1 and table2. table1 has a column called rowcount. > > For each row from table1 we need to select first rowcount rows from table2, > ordered by table2.id" > > > > > On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle wrote: > >> But then what would be analogous to CROSS APPLY in SQL? >> >> > I'd agree with Eduard, although it's probably too late to change >> behavior >> > now. Maybe for data.table.2? Eduard's proposal seems more closely >> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if >> > requested). >> > >> > S. >> > >> >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >> >> From: eduard.antonyan at gmail.com >> >> To: datatable-help at lists.r-forge.r-project.org >> >> Subject: Re: [datatable-help] changing data.table by-without-by >> >> syntax to require a "by" >> >> >> >> I think you're missing the point Michael. Just because it's possible to >> >> do it >> >> the way it's done now, doesn't mean that's the best way, as I've tried >> >> to >> >> argue in the OP. I don't think you've addressed the issue of >> unnecessary >> >> complexity pointed out in OP. >> >> >> >> >> >> >> >> -- >> >> View this message in context: >> >> >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html >> >> Sent from the datatable-help mailing list archive at Nabble.com. >> >> _______________________________________________ >> >> datatable-help mailing list >> >> datatable-help at lists.r-forge.r-project.org >> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > >> _______________________________________________ >> > datatable-help mailing list >> > datatable-help at lists.r-forge.r-project.org >> > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Apr 25 00:50:04 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Wed, 24 Apr 2013 23:50:04 +0100 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net> Message-ID: <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net> i. prefix is just a robust way to reference join inherited columns: the 'top' column in the i table. Like table aliases in SQL. What about this? : 1> X = data.table(a=1:3,b=1:15, key="a") 1> X a b 1: 1 1 2: 1 4 3: 1 7 4: 1 10 5: 1 13 6: 2 2 7: 2 5 8: 2 8 9: 2 11 10: 2 14 11: 3 3 12: 3 6 13: 3 9 14: 3 12 15: 3 15 1> Y = data.table(a=c(1,2,1), top=c(3,4,2)) 1> Y a top 1: 1 3 2: 2 4 3: 1 2 1> X[Y, head(.SD,i.top)] a b 1: 1 1 2: 1 4 3: 1 7 4: 2 2 5: 2 5 6: 2 8 7: 2 11 8: 1 1 9: 1 4 1> On 24.04.2013 23:43, Eduard Antonyan wrote: > I assumed they meant create a table :) > that looks cool, what's i.top ? I can get a very similar to yours result by writing: > X[Y][, head(.SD, top[1]), by = a] > and I probably would want the following to produce your result (this might depend a little on what exactly i.top is): > X[Y, head(.SD, i.top), by = a] > > On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle wrote: > >> That sentence on that linked webpage seems incorect English, since table is a noun not a verb. Should "table" be "join" perhaps? >> >> Anyway, by-without-by is often used with join inherited scope (JIS). For example, translating their example : >> >> 1> X = data.table(a=1:3,b=1:15, key="a") >> 1> X >> a b >> 1: 1 1 >> 2: 1 4 >> 3: 1 7 >> 4: 1 10 >> 5: 1 13 >> 6: 2 2 >> 7: 2 5 >> 8: 2 8 >> 9: 2 11 >> 10: 2 14 >> 11: 3 3 >> 12: 3 6 >> >> 13: 3 9 >> 14: 3 12 >> 15: 3 15 >> 1> Y = data.table(a=c(1,2), top=c(3,4)) >> 1> Y >> a top >> 1: 1 3 >> 2: 2 4 >> 1> X[Y, head(.SD,i.top)] >> a b >> 1: 1 1 >> 2: 1 4 >> 3: 1 7 >> 4: 2 2 >> 5: 2 5 >> >> 6: 2 8 >> 7: 2 11 >> 1> >> >> If there was no by-without-by (analogous to CROSS BY), then how would that be done? >> >> On 24.04.2013 22:22, Eduard Antonyan wrote: >> >>> By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key). >>> Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [8], and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there: >>> "We table table1 and table2. table1 has a column called rowcount. >>> >>> For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id [9]" >>> >>> On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle wrote: >>> >>>> But then what would be analogous to CROSS APPLY in SQL? >>>> >>>> > I'd agree with Eduard, although it's probably too late to change behavior >>>> > now. Maybe for data.table.2? Eduard's proposal seems more closely >>>> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if >>>> > requested). >>>> > >>>> > S. >>>> > >>>> >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >> From: eduard.antonyan at gmail.com [1] >>>> >> To: datatable-help at lists.r-forge.r-project.org [2] >>>> >>>>>> Subject: Re: [datatable-help] changing data.table by-without-by >>>> >> syntax to require a "by" >>>> >> >>>> >> I think you're missing the point Michael. Just because it's possible to >>>> >> do it >>>> >> the way it's done now, doesn't mean that's the best way, as I've tried >>>> >> to >>>> >> argue in the OP. I don't think you've addressed the issue of unnecessary >>>> >> complexity pointed out in OP. >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> View this message in context: >>>> >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html [3] >>>> >> Sent from the datatable-help mailing list archive at Nabble.com. >>>> >> _______________________________________________ >>>> >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org [4] >>>> >>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [5] >>>> > _______________________________________________ >>>> > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org [6] >>>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [7] Links: ------ [1] mailto:eduard.antonyan at gmail.com [2] mailto:datatable-help at lists.r-forge.r-project.org [3] http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html [4] mailto:datatable-help at lists.r-forge.r-project.org [5] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [6] mailto:datatable-help at lists.r-forge.r-project.org [7] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [8] http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [9] http://table2.id [10] mailto:mdowle at mdowle.plus.com [11] mailto:mdowle at mdowle.plus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Thu Apr 25 01:01:42 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 24 Apr 2013 18:01:42 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net> References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net> <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net> Message-ID: that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle wrote: > ** > > > > i. prefix is just a robust way to reference join inherited columns: the > 'top' column in the i table. Like table aliases in SQL. > > What about this? : > > 1> X = data.table(a=1:3,b=1:15, key="a") > 1> X > a b > 1: 1 1 > 2: 1 4 > 3: 1 7 > 4: 1 10 > 5: 1 13 > 6: 2 2 > 7: 2 5 > 8: 2 8 > 9: 2 11 > 10: 2 14 > 11: 3 3 > 12: 3 6 > 13: 3 9 > 14: 3 12 > 15: 3 15 > 1> Y = data.table(a=c(1,2,1), top=c(3,4,2)) > > 1> Y > a top > 1: 1 3 > 2: 2 4 > 3: 1 2 > 1> X[Y, head(.SD,i.top)] > a b > 1: 1 1 > 2: 1 4 > 3: 1 7 > 4: 2 2 > 5: 2 5 > 6: 2 8 > 7: 2 11 > 8: 1 1 > 9: 1 4 > 1> > > > > On 24.04.2013 23:43, Eduard Antonyan wrote: > > I assumed they meant create a table :) > that looks cool, what's i.top ? I can get a very similar to yours result > by writing: > X[Y][, head(.SD, top[1]), by = a] > and I probably would want the following to produce your result (this might > depend a little on what exactly i.top is): > X[Y, head(.SD, i.top), by = a] > > > On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle wrote: > >> >> >> That sentence on that linked webpage seems incorect English, since table >> is a noun not a verb. Should "table" be "join" perhaps? >> >> Anyway, by-without-by is often used with join inherited scope (JIS). For >> example, translating their example : >> >> 1> X = data.table(a=1:3,b=1:15, key="a") >> 1> X >> a b >> 1: 1 1 >> 2: 1 4 >> 3: 1 7 >> 4: 1 10 >> 5: 1 13 >> 6: 2 2 >> 7: 2 5 >> 8: 2 8 >> 9: 2 11 >> 10: 2 14 >> 11: 3 3 >> 12: 3 6 >> >> 13: 3 9 >> 14: 3 12 >> 15: 3 15 >> 1> Y = data.table(a=c(1,2), top=c(3,4)) >> 1> Y >> a top >> 1: 1 3 >> 2: 2 4 >> 1> X[Y, head(.SD,i.top)] >> a b >> 1: 1 1 >> 2: 1 4 >> 3: 1 7 >> 4: 2 2 >> 5: 2 5 >> >> 6: 2 8 >> 7: 2 11 >> 1> >> >> >> >> If there was no by-without-by (analogous to CROSS BY), then how would that be done? >> >> >> >> On 24.04.2013 22:22, Eduard Antonyan wrote: >> >> By that you mean current behavior? You'd get current behavior by >> explicitly specifying the appropriate "by" (i.e. "by" equal to the key). >> Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using >> http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I >> can't figure out how by-without-by (or with by-with-by for that matter:) ) >> helps with e.g. the first example there: >> "We table table1 and table2. table1 has a column called rowcount. >> >> For each row from table1 we need to select first rowcount rows from >> table2, ordered by table2.id" >> >> >> >> >> On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle wrote: >> >>> But then what would be analogous to CROSS APPLY in SQL? >>> >>> > I'd agree with Eduard, although it's probably too late to change >>> behavior >>> > now. Maybe for data.table.2? Eduard's proposal seems more closely >>> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if >>> > requested). >>> > >>> > S. >>> > >>> >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >>> >> From: eduard.antonyan at gmail.com >>> >> To: datatable-help at lists.r-forge.r-project.org >>> >> Subject: Re: [datatable-help] changing data.table by-without-by >>> >> syntax to require a "by" >>> >> >>> >> I think you're missing the point Michael. Just because it's possible >>> to >>> >> do it >>> >> the way it's done now, doesn't mean that's the best way, as I've tried >>> >> to >>> >> argue in the OP. I don't think you've addressed the issue of >>> unnecessary >>> >> complexity pointed out in OP. >>> >> >>> >> >>> >> >>> >> -- >>> >> View this message in context: >>> >> >>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html >>> >> Sent from the datatable-help mailing list archive at Nabble.com. >>> >> _______________________________________________ >>> >> datatable-help mailing list >>> >> datatable-help at lists.r-forge.r-project.org >>> >> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> > >>> _______________________________________________ >>> > datatable-help mailing list >>> > datatable-help at lists.r-forge.r-project.org >>> > >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >>> >> >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From viviannevilar at gmail.com Thu Apr 25 02:38:28 2013 From: viviannevilar at gmail.com (Vivianne Vilar) Date: Thu, 25 Apr 2013 10:38:28 +1000 Subject: [datatable-help] fread: coercion of class from integer to character due to NA string. Message-ID: Hi there, I think this is probably a known issue, but just in case, here it is. I am trying to use fread to read a very large csv file, but I am having problems due to the fact that NAs in a numeric column are represented with some letters. For example, in my column of SIC codes I have "Z" to represent NAs. Even though I explicitly set those to be NAs in the command: data6281 <- fread("data6281.csv",header=TRUE, na.strings=c("C",".","B","Z","")) I get the warning message that that column was changed to be character even though it is supposed to be integer. With the read.csv I have no problem when I use the command data6281 <- data.table(read.csv("data6281.csv",header=TRUE, colClasses=c("integer","integer","integer","integer","integer","factor","character","factor","numeric","numeric","integer"), na.strings=c("C",".","B","Z",""))) but fread does not allow me to set the column classes since it doesn't accept the argument colClasses. A shame really. fread is much faster, and I love that it shows the % progress. I don't supposed there is a way around this, but if there is I would be glad to know. I would also be happy to provide an example if that's necessary. Cheers, Vivianne Siqueira Campos Vilar ---------------------------------------------- ?Don't worry about the world coming to an end today. It is already tomorrow in Australia.? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Apr 25 03:25:50 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 25 Apr 2013 02:25:50 +0100 Subject: [datatable-help] fread: coercion of class from integer to character due to NA string. In-Reply-To: References: Message-ID: Hi, Thanks for reporting. Yes all known and will be tackled. colClasses should be next commit hopefully. Matthew > Hi there, > > I think this is probably a known issue, but just in case, here it is. > > I am trying to use fread to read a very large csv file, but I am having > problems due to the fact that NAs in a numeric column are represented with > some letters. For example, in my column of SIC codes I have "Z" to > represent NAs. Even though I explicitly set those to be NAs in the > command: > > data6281 <- fread("data6281.csv",header=TRUE, > na.strings=c("C",".","B","Z","")) > > I get the warning message that that column was changed to be character > even > though it is supposed to be integer. > > With the read.csv I have no problem when I use the command > > data6281 <- data.table(read.csv("data6281.csv",header=TRUE, > colClasses=c("integer","integer","integer","integer","integer","factor","character","factor","numeric","numeric","integer"), > na.strings=c("C",".","B","Z",""))) > > but fread does not allow me to set the column classes since it doesn't > accept the argument colClasses. > > A shame really. fread is much faster, and I love that it shows the % > progress. > > I don't supposed there is a way around this, but if there is I would be > glad to know. > > I would also be happy to provide an example if that's necessary. > > Cheers, > > Vivianne Siqueira Campos Vilar > ---------------------------------------------- > ?Don't worry about the world coming to an end today. It is already > tomorrow > in Australia.? > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From eduard.antonyan at gmail.com Thu Apr 25 06:16:24 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 24 Apr 2013 23:16:24 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net> <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net> Message-ID: <9146185881995080674@unknownmsgid> That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax. On Apr 24, 2013, at 6:01 PM, Eduard Antonyan wrote: that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle wrote: > ** > > > > i. prefix is just a robust way to reference join inherited columns: the > 'top' column in the i table. Like table aliases in SQL. > > What about this? : > > 1> X = data.table(a=1:3,b=1:15, key="a") > 1> X > a b > 1: 1 1 > 2: 1 4 > 3: 1 7 > 4: 1 10 > 5: 1 13 > 6: 2 2 > 7: 2 5 > 8: 2 8 > 9: 2 11 > 10: 2 14 > > 11: 3 3 > 12: 3 6 > 13: 3 9 > 14: 3 12 > 15: 3 15 > 1> Y = data.table(a=c(1,2,1), top=c(3,4,2)) > > 1> Y > a top > 1: 1 3 > 2: 2 4 > 3: 1 2 > 1> X[Y, head(.SD,i.top)] > > a b > 1: 1 1 > 2: 1 4 > 3: 1 7 > 4: 2 2 > 5: 2 5 > 6: 2 8 > 7: 2 11 > 8: 1 1 > 9: 1 4 > 1> > > > > On 24.04.2013 23:43, Eduard Antonyan wrote: > > I assumed they meant create a table :) > that looks cool, what's i.top ? I can get a very similar to yours result > by writing: > X[Y][, head(.SD, top[1]), by = a] > and I probably would want the following to produce your result (this might > depend a little on what exactly i.top is): > X[Y, head(.SD, i.top), by = a] > > > On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle wrote: > >> >> >> That sentence on that linked webpage seems incorect English, since table >> is a noun not a verb. Should "table" be "join" perhaps? >> >> Anyway, by-without-by is often used with join inherited scope (JIS). For >> example, translating their example : >> >> 1> X = data.table(a=1:3,b=1:15, key="a") >> 1> X >> a b >> 1: 1 1 >> 2: 1 4 >> 3: 1 7 >> 4: 1 10 >> 5: 1 13 >> 6: 2 2 >> 7: 2 5 >> 8: 2 8 >> 9: 2 11 >> 10: 2 14 >> 11: 3 3 >> 12: 3 6 >> >> >> 13: 3 9 >> 14: 3 12 >> 15: 3 15 >> 1> Y = data.table(a=c(1,2), top=c(3,4)) >> 1> Y >> a top >> 1: 1 3 >> 2: 2 4 >> 1> X[Y, head(.SD,i.top)] >> a b >> 1: 1 1 >> 2: 1 4 >> 3: 1 7 >> 4: 2 2 >> 5: 2 5 >> >> >> 6: 2 8 >> 7: 2 11 >> 1> >> >> >> >> If there was no by-without-by (analogous to CROSS BY), then how would that be done? >> >> >> >> On 24.04.2013 22:22, Eduard Antonyan wrote: >> >> By that you mean current behavior? You'd get current behavior by >> explicitly specifying the appropriate "by" (i.e. "by" equal to the key). >> Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using >> http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I >> can't figure out how by-without-by (or with by-with-by for that matter:) ) >> helps with e.g. the first example there: >> "We table table1 and table2. table1 has a column called rowcount. >> >> For each row from table1 we need to select first rowcount rows from >> table2, ordered by table2.id" >> >> >> >> >> On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle wrote: >> >>> But then what would be analogous to CROSS APPLY in SQL? >>> >>> > I'd agree with Eduard, although it's probably too late to change >>> behavior >>> > now. Maybe for data.table.2? Eduard's proposal seems more closely >>> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if >>> > requested). >>> > >>> > S. >>> > >>> >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >>> >> From: eduard.antonyan at gmail.com >>> >> To: datatable-help at lists.r-forge.r-project.org >>> >> Subject: Re: [datatable-help] changing data.table by-without-by >>> >> syntax to require a "by" >>> >> >>> >> I think you're missing the point Michael. Just because it's possible >>> to >>> >> do it >>> >> the way it's done now, doesn't mean that's the best way, as I've tried >>> >> to >>> >> argue in the OP. I don't think you've addressed the issue of >>> unnecessary >>> >> complexity pointed out in OP. >>> >> >>> >> >>> >> >>> >> -- >>> >> View this message in context: >>> >> >>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html >>> >> Sent from the datatable-help mailing list archive at Nabble.com. >>> >> _______________________________________________ >>> >> datatable-help mailing list >>> >> datatable-help at lists.r-forge.r-project.org >>> >> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> > >>> _______________________________________________ >>> > datatable-help mailing list >>> > datatable-help at lists.r-forge.r-project.org >>> > >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >>> >> >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Apr 25 11:28:43 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 25 Apr 2013 10:28:43 +0100 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <9146185881995080674@unknownmsgid> References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net> <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net> <9146185881995080674@unknownmsgid> Message-ID: <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net> I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J? If not .J, or any single symbol what else instead? A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC"). But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row. Currently, that signal is missingness (which I like, rely on, and use with join inherited scope). As I wrote in the S.O. thread, I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside. But you've continued to argue for a change to the default, iiuc. Maybe it helps to consider : x+y Fundamentally in R this depends on what x and y are. Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise, recycling y if necessary. In R we like and write code like this all the time. I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y. If you need j for the entire set that Y joins to, then like a FAQ says, make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as someone mentioned on the S.O. thread). So maybe the new option would be datatable.drop (but with default FALSE not TRUE). If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer. I'm happy to add the argument to [.data.table, and make its default changeable via a global option in the usual way. Matthew On 25.04.2013 05:16, Eduard Antonyan wrote: > That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. > To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax. > > On Apr 24, 2013, at 6:01 PM, Eduard Antonyan wrote: > >> that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me >> >> On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle wrote: >> >>> i. prefix is just a robust way to reference join inherited columns: the 'top' column in the i table. Like table aliases in SQL. >>> >>> What about this? : >>> 1> X = data.table(a=1:3,b=1:15, key="a") >>> 1> X >>> a b >>> 1: 1 1 >>> 2: 1 4 >>> 3: 1 7 >>> 4: 1 10 >>> 5: 1 13 >>> 6: 2 2 >>> 7: 2 5 >>> 8: 2 8 >>> 9: 2 11 >>> 10: 2 14 >>> 11: 3 3 >>> 12: 3 6 >>> 13: 3 9 >>> 14: 3 12 >>> 15: 3 15 >>> >>> 1> Y = data.table(a=c(1,2,1), top=c(3,4,2)) >>> >>> 1> Y >>> a top >>> 1: 1 3 >>> 2: 2 4 >>> 3: 1 2 >>> 1> X[Y, head(.SD,i.top)] >>> a b >>> 1: 1 1 >>> 2: 1 4 >>> 3: 1 7 >>> 4: 2 2 >>> 5: 2 5 >>> 6: 2 8 >>> 7: 2 11 >>> 8: 1 1 >>> >>> 9: 1 4 >>> 1> >>> >>> On 24.04.2013 23:43, Eduard Antonyan wrote: >>> >>>> I assumed they meant create a table :) >>>> that looks cool, what's i.top ? I can get a very similar to yours result by writing: >>>> X[Y][, head(.SD, top[1]), by = a] >>>> and I probably would want the following to produce your result (this might depend a little on what exactly i.top is): >>>> X[Y, head(.SD, i.top), by = a] >>>> >>>> On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle wrote: >>>> >>>>> That sentence on that linked webpage seems incorect English, since table is a noun not a verb. Should "table" be "join" perhaps? >>>>> >>>>> Anyway, by-without-by is often used with join inherited scope (JIS). For example, translating their example : >>>>> >>>>> 1> X = data.table(a=1:3,b=1:15, key="a") >>>>> 1> X >>>>> a b >>>>> 1: 1 1 >>>>> 2: 1 4 >>>>> 3: 1 7 >>>>> 4: 1 10 >>>>> 5: 1 13 >>>>> 6: 2 2 >>>>> 7: 2 5 >>>>> 8: 2 8 >>>>> 9: 2 11 >>>>> 10: 2 14 >>>>> 11: 3 3 >>>>> 12: 3 6 >>>>> >>>>> 13: 3 9 >>>>> 14: 3 12 >>>>> 15: 3 15 >>>>> 1> Y = data.table(a=c(1,2), top=c(3,4)) >>>>> 1> Y >>>>> a top >>>>> 1: 1 3 >>>>> 2: 2 4 >>>>> 1> X[Y, head(.SD,i.top)] >>>>> a b >>>>> 1: 1 1 >>>>> 2: 1 4 >>>>> 3: 1 7 >>>>> 4: 2 2 >>>>> 5: 2 5 >>>>> >>>>> 6: 2 8 >>>>> 7: 2 11 >>>>> 1> >>>>> >>>>> If there was no by-without-by (analogous to CROSS BY), then how would that be done? >>>>> >>>>> On 24.04.2013 22:22, Eduard Antonyan wrote: >>>>> >>>>>> By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key). >>>>>> Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [9], and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there: >>>>>> "We table table1 and table2. table1 has a column called rowcount. >>>>>> >>>>>> For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id [10]" >>>>>> >>>>>> On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle wrote: >>>>>> >>>>>>> But then what would be analogous to CROSS APPLY in SQL? >>>>>>> >>>>>>> > I'd agree with Eduard, although it's probably too late to change behavior >>>>>>> > now. Maybe for data.table.2? Eduard's proposal seems more closely >>>>>>> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if >>>>>>> > requested). >>>>>>> > >>>>>>> > S. >>>>>>> > >>>>>>> >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >> From: eduard.antonyan at gmail.com [1] >>>>>>> >> To: datatable-help at lists.r-forge.r-project.org [2] >>>>>>> >>>>>>>>> Subject: Re: [datatable-help] changing data.table by-without-by >>>>>>> >> syntax to require a "by" >>>>>>> >> >>>>>>> >> I think you're missing the point Michael. Just because it's possible to >>>>>>> >> do it >>>>>>> >> the way it's done now, doesn't mean that's the best way, as I've tried >>>>>>> >> to >>>>>>> >> argue in the OP. I don't think you've addressed the issue of unnecessary >>>>>>> >> complexity pointed out in OP. >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> -- >>>>>>> >> View this message in context: >>>>>>> >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html [3] >>>>>>> >> Sent from the datatable-help mailing list archive at Nabble.com [4]. >>>>>>> >> _______________________________________________ >>>>>>> >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org [5] >>>>>>> >>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [6] >>>>>>> > _______________________________________________ >>>>>>> > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org [7] >>>>>>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [8] Links: ------ [1] mailto:eduard.antonyan at gmail.com [2] mailto:datatable-help at lists.r-forge.r-project.org [3] http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html [4] http://Nabble.com [5] mailto:datatable-help at lists.r-forge.r-project.org [6] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [7] mailto:datatable-help at lists.r-forge.r-project.org [8] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [9] http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [10] http://table2.id [11] mailto:mdowle at mdowle.plus.com [12] mailto:mdowle at mdowle.plus.com [13] mailto:mdowle at mdowle.plus.com [14] mailto:eduard.antonyan at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Thu Apr 25 14:45:45 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Thu, 25 Apr 2013 07:45:45 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net> References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net> <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net> <9146185881995080674@unknownmsgid> <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net> Message-ID: <5222879356405645530@unknownmsgid> Well, so can .I or .N or .GRP or .BY, yet those are used as special names, which is exactly why I suggested .J. The problem with using 'missingness' is that it already means smth very different when i is not a join/cross, it means *don't* do a by, thus introducing the whole case thing one has to through in their head every time as in OP (which of course becomes automatic after a while, but it's a cost nonetheless, which is in particular high for new people). So I see absence of 'by' as an already taken and used signal and thus something else has to be used for the new signal of cross apply (it doesn't have to be the specific option I mentioned above). This is exactly why I find optional turning off of this behavior unsatisfactory, and I don't see that as a solution to this at all. I think in the x+y context the appropriate analog is - what if that added x and y normally, but when x and y were data.frames it did element by element multiplication instead? Yes that's possible to do, and possible to document, but it's not a good idea, because it takes place of adding them element by element. The recycling behavior doesn't do that - what that does is it says it doesn't really make sense to add them as is, but we can do that after recycling, so let's recycle. It doesn't take the place of another existing way of adding vectors. On Apr 25, 2013, at 4:28 AM, Matthew Dowle wrote: I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J? If not .J, or any single symbol what else instead? A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC"). But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row. Currently, that signal is missingness (which I like, rely on, and use with join inherited scope). As I wrote in the S.O. thread, I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside. But you've continued to argue for a change to the default, iiuc. Maybe it helps to consider : x+y Fundamentally in R this depends on what x and y are. Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise, recycling y if necessary. In R we like and write code like this all the time. I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y. If you need j for the entire set that Y joins to, then like a FAQ says, make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as someone mentioned on the S.O. thread). So maybe the new option would be datatable.drop (but with default FALSE not TRUE). If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer. I'm happy to add the argument to [.data.table, and make its default changeable via a global option in the usual way. Matthew On 25.04.2013 05:16, Eduard Antonyan wrote: That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax. On Apr 24, 2013, at 6:01 PM, Eduard Antonyan wrote: that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle wrote: > > > i. prefix is just a robust way to reference join inherited columns: the > 'top' column in the i table. Like table aliases in SQL. > > What about this? : > 1> X = data.table(a=1:3,b=1:15, key="a") > 1> X > a b > 1: 1 1 > 2: 1 4 > 3: 1 7 > 4: 1 10 > 5: 1 13 > 6: 2 2 > 7: 2 5 > 8: 2 8 > 9: 2 11 > 10: 2 14 > 11: 3 3 > 12: 3 6 > 13: 3 9 > 14: 3 12 > 15: 3 15 > > 1> Y = data.table(a=c(1,2,1), top=c(3,4,2)) > > > 1> Y > a top > 1: 1 3 > 2: 2 4 > 3: 1 2 > 1> X[Y, head(.SD,i.top)] > a b > 1: 1 1 > 2: 1 4 > 3: 1 7 > 4: 2 2 > 5: 2 5 > 6: 2 8 > 7: 2 11 > 8: 1 1 > > 9: 1 4 > 1> > > > > On 24.04.2013 23:43, Eduard Antonyan wrote: > > I assumed they meant create a table :) > that looks cool, what's i.top ? I can get a very similar to yours result > by writing: > X[Y][, head(.SD, top[1]), by = a] > and I probably would want the following to produce your result (this might > depend a little on what exactly i.top is): > X[Y, head(.SD, i.top), by = a] > > > On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle wrote: > >> >> >> That sentence on that linked webpage seems incorect English, since table >> is a noun not a verb. Should "table" be "join" perhaps? >> >> Anyway, by-without-by is often used with join inherited scope (JIS). For >> example, translating their example : >> >> 1> X = data.table(a=1:3,b=1:15, key="a") >> 1> X >> a b >> 1: 1 1 >> 2: 1 4 >> 3: 1 7 >> 4: 1 10 >> 5: 1 13 >> 6: 2 2 >> 7: 2 5 >> 8: 2 8 >> 9: 2 11 >> 10: 2 14 >> 11: 3 3 >> 12: 3 6 >> >> >> >> 13: 3 9 >> 14: 3 12 >> 15: 3 15 >> 1> Y = data.table(a=c(1,2), top=c(3,4)) >> 1> Y >> a top >> 1: 1 3 >> 2: 2 4 >> 1> X[Y, head(.SD,i.top)] >> a b >> 1: 1 1 >> 2: 1 4 >> 3: 1 7 >> 4: 2 2 >> 5: 2 5 >> >> >> >> 6: 2 8 >> 7: 2 11 >> 1> >> >> >> >> If there was no by-without-by (analogous to CROSS BY), then how would that be done? >> >> >> >> On 24.04.2013 22:22, Eduard Antonyan wrote: >> >> By that you mean current behavior? You'd get current behavior by >> explicitly specifying the appropriate "by" (i.e. "by" equal to the key). >> Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using >> http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I >> can't figure out how by-without-by (or with by-with-by for that matter:) ) >> helps with e.g. the first example there: >> "We table table1 and table2. table1 has a column called rowcount. >> >> For each row from table1 we need to select first rowcount rows from >> table2, ordered by table2.id" >> >> >> >> >> On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle wrote: >> >>> But then what would be analogous to CROSS APPLY in SQL? >>> >>> > I'd agree with Eduard, although it's probably too late to change >>> behavior >>> > now. Maybe for data.table.2? Eduard's proposal seems more closely >>> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if >>> > requested). >>> > >>> > S. >>> > >>> >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >>> >> From: eduard.antonyan at gmail.com >>> >> To: datatable-help at lists.r-forge.r-project.org >>> >> Subject: Re: [datatable-help] changing data.table by-without-by >>> >> syntax to require a "by" >>> >> >>> >> I think you're missing the point Michael. Just because it's possible >>> to >>> >> do it >>> >> the way it's done now, doesn't mean that's the best way, as I've tried >>> >> to >>> >> argue in the OP. I don't think you've addressed the issue of >>> unnecessary >>> >> complexity pointed out in OP. >>> >> >>> >> >>> >> >>> >> -- >>> >> View this message in context: >>> >> >>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html >>> >> Sent from the datatable-help mailing list archive at Nabble.com. >>> >> _______________________________________________ >>> >> datatable-help mailing list >>> >> datatable-help at lists.r-forge.r-project.org >>> >> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> > >>> _______________________________________________ >>> > datatable-help mailing list >>> > datatable-help at lists.r-forge.r-project.org >>> > >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >>> >> >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From s_milberg at hotmail.com Thu Apr 25 16:54:36 2013 From: s_milberg at hotmail.com (Sadao Milberg) Date: Thu, 25 Apr 2013 10:54:36 -0400 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <5222879356405645530@unknownmsgid> References: <1366401278742-4664770.post@n4.nabble.com>, <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>, <1366643879137-4664990.post@n4.nabble.com>, , , , <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>, , <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>, , <9146185881995080674@unknownmsgid>, <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net>, <5222879356405645530@unknownmsgid> Message-ID: Whatever the "right" way to do things is, the key issue is that default behavior should not be changed since existing code will rely on it. So even though I tend to agree with Eduard, I would strongly advocate against any change in current behavior. This aside, let me throw my 2 pennies in for the sake of data.table.2: As for CROSS APPLY, to be honest, my experience with SQL has been primarily with MySQL < 5 so I didn't even know that existed. As for your specific example a couple of e-mails ago, I believe this works: X = data.table(a=1:3,b=1:15, key="a") Y = data.table(a=c(1,2,1), top=c(3,4,2)) X[Y][, head(.SD, top[1]), by=list(a, top)] Granted, this is somewhat inefficient since we now have the `top` vector replicated for each value of `a` in `X`. You can probably come up with other examples that are inefficient or just don't work (e.g. `Y = data.table(a=c(1,2,1, 1), top=c(3,4,2,2))`), but the point here isn't whether you should allow CROSS APPLY or not, but what the "correct" syntax for invoking cross apply is. I would argue that the correct output to: X[Y, sum(a * top)] Should be 21, not: a V1 1: 1 3 2: 2 8 3: 1 2 While the output above may be convenient to you, it is not intuitive at all. In fact, it is an advanced caveat to standard behavior ("J is an expression evaluated in the context of X") that isn't straigthforward to circumvent, and would likely bewilder most beginner users of data.table. I think given the parallels between data.table and SQL, "X[Y, sum(a * top)]" should mean "SELECT sum(X.a * Y.top) FROM X INNER JOIN Y USING(a)", not some more complex expression involving a CROSS APPLY. Note that if you want a CROSS APPLY in SQL, you have to ask for it (I guess I picked at terrible example here, since the GROUP is implied...). I think the "correct" way to do the original task would be something along the lines of: X[Y, head(.SD, i.top), cross.apply=TRUE] or some such. That said, data.table is yours. It is a fantastic tool, and if you want to behave in a manner that simplifies your work rather than matches the intuitions of others, then it is your hard earned right that I fully respect. Slightly off topic, why aren't the columns from the Y table available in joint inherited scope when not doing a by without by? I find it odd that: X[Y, sum(a * top), by=b] Produces: Error in `[.data.table`(X, Y, sum(a * top), by = b) : object 'top' not found Finally, is i.top documented? S. From: eduard.antonyan at gmail.com Date: Thu, 25 Apr 2013 07:45:45 -0500 To: mdowle at mdowle.plus.com CC: datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by" Well, so can .I or .N or .GRP or .BY, yet those are used as special names, which is exactly why I suggested .J. The problem with using 'missingness' is that it already means smth very different when i is not a join/cross, it means *don't* do a by, thus introducing the whole case thing one has to through in their head every time as in OP (which of course becomes automatic after a while, but it's a cost nonetheless, which is in particular high for new people). So I see absence of 'by' as an already taken and used signal and thus something else has to be used for the new signal of cross apply (it doesn't have to be the specific option I mentioned above). This is exactly why I find optional turning off of this behavior unsatisfactory, and I don't see that as a solution to this at all. I think in the x+y context the appropriate analog is - what if that added x and y normally, but when x and y were data.frames it did element by element multiplication instead? Yes that's possible to do, and possible to document, but it's not a good idea, because it takes place of adding them element by element. The recycling behavior doesn't do that - what that does is it says it doesn't really make sense to add them as is, but we can do that after recycling, so let's recycle. It doesn't take the place of another existing way of adding vectors. On Apr 25, 2013, at 4:28 AM, Matthew Dowle wrote: I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J? If not .J, or any single symbol what else instead? A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC"). But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row. Currently, that signal is missingness (which I like, rely on, and use with join inherited scope). As I wrote in the S.O. thread, I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside. But you've continued to argue for a change to the default, iiuc. Maybe it helps to consider : x+y Fundamentally in R this depends on what x and y are. Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise, recycling y if necessary. In R we like and write code like this all the time. I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y. If you need j for the entire set that Y joins to, then like a FAQ says, make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as someone mentioned on the S.O. thread). So maybe the new option would be datatable.drop (but with default FALSE not TRUE). If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer. I'm happy to add the argument to [.data.table, and make its default changeable via a global option in the usual way. Matthew On 25.04.2013 05:16, Eduard Antonyan wrote: That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax. On Apr 24, 2013, at 6:01 PM, Eduard Antonyan wrote: that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle wrote: i. prefix is just a robust way to reference join inherited columns: the 'top' column in the i table. Like table aliases in SQL. What about this? : 1> X = data.table(a=1:3,b=1:15, key="a") 1> X a b 1: 1 1 2: 1 4 3: 1 7 4: 1 10 5: 1 13 6: 2 2 7: 2 5 8: 2 8 9: 2 11 10: 2 14 11: 3 3 12: 3 6 13: 3 9 14: 3 12 15: 3 15 1> Y = data.table(a=c(1,2,1), top=c(3,4,2)) 1> Y a top 1: 1 3 2: 2 4 3: 1 2 1> X[Y, head(.SD,i.top)] a b 1: 1 1 2: 1 4 3: 1 7 4: 2 2 5: 2 5 6: 2 8 7: 2 11 8: 1 1 9: 1 4 1> On 24.04.2013 23:43, Eduard Antonyan wrote: I assumed they meant create a table :) that looks cool, what's i.top ? I can get a very similar to yours result by writing: X[Y][, head(.SD, top[1]), by = a] and I probably would want the following to produce your result (this might depend a little on what exactly i.top is): X[Y, head(.SD, i.top), by = a] On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle wrote: That sentence on that linked webpage seems incorect English, since table is a noun not a verb. Should "table" be "join" perhaps? Anyway, by-without-by is often used with join inherited scope (JIS). For example, translating their example : 1> X = data.table(a=1:3,b=1:15, key="a") 1> X a b 1: 1 1 2: 1 4 3: 1 7 4: 1 10 5: 1 13 6: 2 2 7: 2 5 8: 2 8 9: 2 11 10: 2 14 11: 3 3 12: 3 6 13: 3 9 14: 3 12 15: 3 15 1> Y = data.table(a=c(1,2), top=c(3,4)) 1> Y a top 1: 1 3 2: 2 4 1> X[Y, head(.SD,i.top)] a b 1: 1 1 2: 1 4 3: 1 7 4: 2 2 5: 2 5 6: 2 8 7: 2 11 1> If there was no by-without-by (analogous to CROSS BY), then how would that be done? On 24.04.2013 22:22, Eduard Antonyan wrote: By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key). Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there: "We table table1 and table2. table1 has a column called rowcount. For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id" On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle wrote: But then what would be analogous to CROSS APPLY in SQL? > I'd agree with Eduard, although it's probably too late to change behavior > now. Maybe for data.table.2? Eduard's proposal seems more closely > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if > requested). > > S. > >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >> From: eduard.antonyan at gmail.com >> To: datatable-help at lists.r-forge.r-project.org >> Subject: Re: [datatable-help] changing data.table by-without-by >> syntax to require a "by" >> >> I think you're missing the point Michael. Just because it's possible to >> do it >> the way it's done now, doesn't mean that's the best way, as I've tried >> to >> argue in the OP. I don't think you've addressed the issue of unnecessary >> complexity pointed out in OP. >> >> >> >> -- >> View this message in context: >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html >> Sent from the datatable-help mailing list archive at Nabble.com. >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From timothee.carayol at gmail.com Thu Apr 25 09:58:32 2013 From: timothee.carayol at gmail.com (=?ISO-8859-1?Q?Timoth=E9e_Carayol?=) Date: Thu, 25 Apr 2013 08:58:32 +0100 Subject: [datatable-help] fread(character string) limited to strings less than 4096 long? In-Reply-To: References: <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net> <2c2af8789733127541fe78c1ccde5412@imap.plus.net> <230b0040889556349b21822824a5fb7e@imap.plus.net> Message-ID: Hi ? I thought I'd follow up on this. Matthew, are you still unable to reproduce it? It is still happening to me after an upgrade to R 3.0.0. And Garrett's case above seems even more severe, with a truncation at 256 characters it seems, so it's not just me, and it does seem to depend on some sort of system configuration. On Thu, Mar 28, 2013 at 3:26 PM, Timoth?e Carayol < timothee.carayol at gmail.com> wrote: > Of course, I'll be happy to help! > > By the way the verbose output was actually from computer 1 (with 1.8.9) so > it seems like the -nan% problem is maybe still there? > > Cheers > Timoth?e > > > On Thu, Mar 28, 2013 at 3:19 PM, Matthew Dowle wrote: > >> ** >> >> >> >> Hi, >> >> Thanks. That was from v1.8.8 on computer 2 I hope. Computer 1 with >> v1.8.9 should have the -nan% problem fixed. >> >> I'm a bit stumped for the moment. I've filed a bug report. Probably, if >> I still can't reproduce my end, I'll add some more detailed tracing to >> verbose output and ask you to try again next week if that's ok. >> >> Thanks for reporting! >> >> Matthew >> >> >> >> On 28.03.2013 14:58, Timoth?e Carayol wrote: >> >> Input contains a \n (or is ""), taking this to be text input (not a >> filename) >> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. >> >> Using line 30 to detect sep (the last non blank line in the first 30) ... >> '\t' >> Found 2 columns >> >> First row with 2 fields occurs on line 1 (either column names or first >> row of data) >> All the fields on line 1 are character fields. Treating as the column >> names. >> Count of eol after first data row: 1023 >> >> Subtracted 1 for last eol and any trailing empty lines, leaving 1022 data >> rows >> Type codes: 33 (first 5 rows) >> >> Type codes: 33 (+middle 5 rows) >> >> Type codes: 33 (+last 5 rows) >> >> 0.000s (-nan%) Memory map (rerun may be quicker) >> >> 0.000s (-nan%) sep and header detection >> >> 0.000s (-nan%) Count rows (wc -l) >> >> 0.000s (-nan%) Column type detection (first, middle and last 5 rows) >> >> 0.000s (-nan%) Allocation of 1022x2 result (xMB) in RAM >> >> 0.000s (-nan%) Reading data >> >> 0.000s (-nan%) Allocation for type bumps (if any), including gc time >> if triggered >> 0.000s (-nan%) Coercing data already read in type bumps (if any) >> >> 0.000s (-nan%) Changing na.strings to NA >> >> 0.000s Total >> >> 4092 1022 >> >> Input contains a \n (or is ""), taking this to be text input (not a >> filename) >> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. >> >> Using line 30 to detect sep (the last non blank line in the first 30) ... >> '\t' >> Found 2 columns >> >> First row with 2 fields occurs on line 1 (either column names or first >> row of data) >> All the fields on line 1 are character fields. Treating as the column >> names. >> Count of eol after first data row: 1023 >> >> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data >> rows >> Type codes: 33 (first 5 rows) >> >> Type codes: 33 (+middle 5 rows) >> >> Type codes: 33 (+last 5 rows) >> >> 0.000s (-nan%) Memory map (rerun may be quicker) >> >> 0.000s (-nan%) sep and header detection >> >> 0.000s (-nan%) Count rows (wc -l) >> >> 0.000s (-nan%) Column type detection (first, middle and last 5 rows) >> >> 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM >> >> 0.000s (-nan%) Reading data >> >> 0.000s (-nan%) Allocation for type bumps (if any), including gc time >> if triggered >> 0.000s (-nan%) Coercing data already read in type bumps (if any) >> >> 0.000s (-nan%) Changing na.strings to NA >> >> 0.000s Total >> >> 4096 1023 >> >> Input contains a \n (or is ""), taking this to be text input (not a >> filename) >> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. >> >> Using line 30 to detect sep (the last non blank line in the first 30) ... >> '\t' >> Found 2 columns >> >> First row with 2 fields occurs on line 1 (either column names or first >> row of data) >> All the fields on line 1 are character fields. Treating as the column >> names. >> Count of eol after first data row: 1023 >> >> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data >> rows >> Type codes: 33 (first 5 rows) >> >> Type codes: 33 (+middle 5 rows) >> >> Type codes: 33 (+last 5 rows) >> >> 0.000s (-nan%) Memory map (rerun may be quicker) >> >> 0.000s (-nan%) sep and header detection >> >> 0.000s (-nan%) Count rows (wc -l) >> >> 0.000s (-nan%) Column type detection (first, middle and last 5 rows) >> >> 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM >> >> 0.000s (-nan%) Reading data >> >> 0.000s (-nan%) Allocation for type bumps (if any), including gc time >> if triggered >> 0.000s (-nan%) Coercing data already read in type bumps (if any) >> >> 0.000s (-nan%) Changing na.strings to NA >> >> 0.000s Total >> >> 4100 1023 >> >> Input contains a \n (or is ""), taking this to be text input (not a >> filename) >> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. >> >> Using line 30 to detect sep (the last non blank line in the first 30) ... >> '\t' >> Found 2 columns >> >> First row with 2 fields occurs on line 1 (either column names or first >> row of data) >> All the fields on line 1 are character fields. Treating as the column >> names. >> Count of eol after first data row: 1023 >> >> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data >> rows >> Type codes: 33 (first 5 rows) >> >> Type codes: 33 (+middle 5 rows) >> >> Type codes: 33 (+last 5 rows) >> >> 0.000s (-nan%) Memory map (rerun may be quicker) >> >> 0.000s (-nan%) sep and header detection >> >> 0.000s (-nan%) Count rows (wc -l) >> >> 0.000s (-nan%) Column type detection (first, middle and last 5 rows) >> >> 0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM >> >> 0.000s (-nan%) Reading data >> >> 0.000s (-nan%) Allocation for type bumps (if any), including gc time >> if triggered >> 0.000s (-nan%) Coercing data already read in type bumps (if any) >> >> 0.000s (-nan%) Changing na.strings to NA >> >> 0.000s Total >> >> 40000 1023 >> >> >> >> On Thu, Mar 28, 2013 at 2:55 PM, Matthew Dowle wrote: >> >>> >>> >>> Hm this is odd. >>> >>> Could you run the following and paste back the (verbose) results please. >>> for (n in c(1023:1025, 10000)) { >>> >>> input = paste( rep('a\tb\n', n), collapse='') >>> A = fread(input,verbose=TRUE) >>> cat(nchar(input), nrow(A), "\n") >>> } >>> >>> >>> >>> >>> >>> On 28.03.2013 14:38, Timoth?e Carayol wrote: >>> >>> Curiouser and curiouser.. >>> >>> I can reproduce on two computers with different versions of R and of >>> data.table. >>> >>> >>> >>> Computer 1 (it says unknown-linux but is actually ubuntu): >>> >>> R version 2.15.3 (2013-03-01) >>> >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> >>> >>> locale: >>> >>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >>> LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >>> LC_MONETARY=en_GB.UTF-8 >>> LC_MESSAGES=en_GB.UTF-8 LC_PAPER=C LC_NAME=C >>> LC_ADDRESS=C >>> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 >>> LC_IDENTIFICATION=C >>> >>> >>> >>> attached base packages: >>> >>> [1] stats graphics grDevices utils datasets methods base >>> >>> >>> >>> other attached packages: >>> >>> [1] bit64_0.9-2 bit_1.1-10 data.table_1.8.9 colorout_1.0-0 >>> >>> Computer 2: >>> R version 2.15.2 (2012-10-26) >>> Platform: x86_64-redhat-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >>> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] data.table_1.8.8 >>> >>> loaded via a namespace (and not attached): >>> [1] tools_2.15.2 >>> >>> >>> On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle wrote: >>> >>>> >>>> >>>> Interesting, what's your sessionInfo() please? >>>> >>>> For me it seems to work ok : >>>> >>>> [1] 1022 >>>> [1] 1023 >>>> [1] 1024 >>>> [1] 9999 >>>> >>>> > sessionInfo() >>>> R version 2.15.2 (2012-10-26) >>>> Platform: x86_64-w64-mingw32/x64 (64-bit) >>>> >>>> >>>> >>>> On 27.03.2013 22:49, Timoth?e Carayol wrote: >>>> >>>> Agree with Muhammad, longer character strings are definitely >>>> permitted in R. >>>> A minimal example that show something strange happening with fread: >>>> for (n in c(1023:1025, 10000)) { >>>> A >>>> paste( >>>> rep('a\tb\n', n), >>>> collapse='' >>>> ), >>>> sep='\t' >>>> ) >>>> print(nrow(A)) >>>> } >>>> On my computer, I obtain: >>>> [1] 1022 >>>> [1] 1023 >>>> [1] 1023 >>>> [1] 1023 >>>> Hope this helps >>>> Timoth?e >>>> >>>> >>>> On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle wrote: >>>> >>>>> Hi, >>>>> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is >>>>> that >>>>> the R limit for a character string length? What happens at 4097? >>>>> Matthew >>>>> >>>>> > Hi, >>>>> > >>>>> > I have an example of a string of 4097 characters which can't be >>>>> parsed by >>>>> > fread; however, if I remove any character, it can be parsed just >>>>> fine. Is >>>>> > that a known limitation? >>>>> > >>>>> > (If I write the string to a file and then fread the file name, it >>>>> works >>>>> > too.) >>>>> > >>>>> > Let me know if you need the string and/or a bug report. >>>>> > >>>>> > Thanks >>>>> > Timoth?e >>>>> > _______________________________________________ >>>>> > datatable-help mailing list >>>>> > datatable-help at lists.r-forge.r-project.org >>>>> > >>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >>> >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Thu Apr 25 17:32:14 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 25 Apr 2013 16:32:14 +0100 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <5222879356405645530@unknownmsgid> References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net> <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net> <9146185881995080674@unknownmsgid> <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net> <5222879356405645530@unknownmsgid> Message-ID: <64f192ba80ac813986ed256029f0e7e0@imap.plus.net> I'd appreciate some input from others whether they agree or not. If you have a view perhaps let me know off list, or on list, whichever you prefer. Thanks, Matthew On 25.04.2013 13:45, Eduard Antonyan wrote: > Well, so can .I or .N or .GRP or .BY, yet those are used as special names, which is exactly why I suggested .J. > The problem with using 'missingness' is that it already means smth very different when i is not a join/cross, it means *don't* do a by, thus introducing the whole case thing one has to through in their head every time as in OP (which of course becomes automatic after a while, but it's a cost nonetheless, which is in particular high for new people). So I see absence of 'by' as an already taken and used signal and thus something else has to be used for the new signal of cross apply (it doesn't have to be the specific option I mentioned above). This is exactly why I find optional turning off of this behavior unsatisfactory, and I don't see that as a solution to this at all. > I think in the x+y context the appropriate analog is - what if that added x and y normally, but when x and y were data.frames it did element by element multiplication instead? Yes that's possible to do, and possible to document, but it's not a good idea, because it takes place of adding them element by element. The recycling behavior doesn't do that - what that does is it says it doesn't really make sense to add them as is, but we can do that after recycling, so let's recycle. It doesn't take the place of another existing way of adding vectors. > > On Apr 25, 2013, at 4:28 AM, Matthew Dowle wrote: > >> I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J? If not .J, or any single symbol what else instead? A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC"). But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row. Currently, that signal is missingness (which I like, rely on, and use with join inherited scope). >> >> As I wrote in the S.O. thread, I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside. But you've continued to argue for a change to the default, iiuc. >> >> Maybe it helps to consider : >> >> x+y >> >> Fundamentally in R this depends on what x and y are. Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise, recycling y if necessary. In R we like and write code like this all the time. I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y. If you need j for the entire set that Y joins to, then like a FAQ says, make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as someone mentioned on the S.O. thread). So maybe the new option would be datatable.drop (but with default FALSE not TRUE). If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer. >> >> I'm happy to add the argument to [.data.table, and make its default changeable via a global option in the usual way. >> >> Matthew >> >> On 25.04.2013 05:16, Eduard Antonyan wrote: >> >>> That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. >>> To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax. >>> >>> On Apr 24, 2013, at 6:01 PM, Eduard Antonyan wrote: >>> >>>> that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me >>>> >>>> On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle wrote: >>>> >>>>> i. prefix is just a robust way to reference join inherited columns: the 'top' column in the i table. Like table aliases in SQL. >>>>> >>>>> What about this? : >>>>> 1> X = data.table(a=1:3,b=1:15, key="a") >>>>> 1> X >>>>> a b >>>>> 1: 1 1 >>>>> 2: 1 4 >>>>> 3: 1 7 >>>>> 4: 1 10 >>>>> 5: 1 13 >>>>> 6: 2 2 >>>>> 7: 2 5 >>>>> 8: 2 8 >>>>> 9: 2 11 >>>>> 10: 2 14 >>>>> 11: 3 3 >>>>> 12: 3 6 >>>>> 13: 3 9 >>>>> 14: 3 12 >>>>> 15: 3 15 >>>>> >>>>> 1> Y = data.table(a=c(1,2,1), top=c(3,4,2)) >>>>> >>>>> 1> Y >>>>> a top >>>>> 1: 1 3 >>>>> 2: 2 4 >>>>> 3: 1 2 >>>>> 1> X[Y, head(.SD,i.top)] >>>>> a b >>>>> 1: 1 1 >>>>> 2: 1 4 >>>>> 3: 1 7 >>>>> 4: 2 2 >>>>> 5: 2 5 >>>>> 6: 2 8 >>>>> 7: 2 11 >>>>> 8: 1 1 >>>>> >>>>> 9: 1 4 >>>>> 1> >>>>> >>>>> On 24.04.2013 23:43, Eduard Antonyan wrote: >>>>> >>>>>> I assumed they meant create a table :) >>>>>> that looks cool, what's i.top ? I can get a very similar to yours result by writing: >>>>>> X[Y][, head(.SD, top[1]), by = a] >>>>>> and I probably would want the following to produce your result (this might depend a little on what exactly i.top is): >>>>>> X[Y, head(.SD, i.top), by = a] >>>>>> >>>>>> On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle wrote: >>>>>> >>>>>>> That sentence on that linked webpage seems incorect English, since table is a noun not a verb. Should "table" be "join" perhaps? >>>>>>> >>>>>>> Anyway, by-without-by is often used with join inherited scope (JIS). For example, translating their example : >>>>>>> >>>>>>> 1> X = data.table(a=1:3,b=1:15, key="a") >>>>>>> 1> X >>>>>>> a b >>>>>>> 1: 1 1 >>>>>>> 2: 1 4 >>>>>>> 3: 1 7 >>>>>>> 4: 1 10 >>>>>>> 5: 1 13 >>>>>>> 6: 2 2 >>>>>>> 7: 2 5 >>>>>>> 8: 2 8 >>>>>>> 9: 2 11 >>>>>>> 10: 2 14 >>>>>>> 11: 3 3 >>>>>>> 12: 3 6 >>>>>>> >>>>>>> 13: 3 9 >>>>>>> 14: 3 12 >>>>>>> 15: 3 15 >>>>>>> 1> Y = data.table(a=c(1,2), top=c(3,4)) >>>>>>> 1> Y >>>>>>> a top >>>>>>> 1: 1 3 >>>>>>> 2: 2 4 >>>>>>> 1> X[Y, head(.SD,i.top)] >>>>>>> a b >>>>>>> 1: 1 1 >>>>>>> 2: 1 4 >>>>>>> 3: 1 7 >>>>>>> 4: 2 2 >>>>>>> 5: 2 5 >>>>>>> >>>>>>> 6: 2 8 >>>>>>> 7: 2 11 >>>>>>> 1> >>>>>>> >>>>>>> If there was no by-without-by (analogous to CROSS BY), then how would that be done? >>>>>>> >>>>>>> On 24.04.2013 22:22, Eduard Antonyan wrote: >>>>>>> >>>>>>>> By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key). >>>>>>>> Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [9], and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there: >>>>>>>> "We table table1 and table2. table1 has a column called rowcount. >>>>>>>> >>>>>>>> For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id [10]" >>>>>>>> >>>>>>>> On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle wrote: >>>>>>>> >>>>>>>>> But then what would be analogous to CROSS APPLY in SQL? >>>>>>>>> >>>>>>>>> > I'd agree with Eduard, although it's probably too late to change behavior >>>>>>>>> > now. Maybe for data.table.2? Eduard's proposal seems more closely >>>>>>>>> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if >>>>>>>>> > requested). >>>>>>>>> > >>>>>>>>> > S. >>>>>>>>> > >>>>>>>>> >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >> From: eduard.antonyan at gmail.com [1] >>>>>>>>> >> To: datatable-help at lists.r-forge.r-project.org [2] >>>>>>>>> >>>>>>>>>>> Subject: Re: [datatable-help] changing data.table by-without-by >>>>>>>>> >> syntax to require a "by" >>>>>>>>> >> >>>>>>>>> >> I think you're missing the point Michael. Just because it's possible to >>>>>>>>> >> do it >>>>>>>>> >> the way it's done now, doesn't mean that's the best way, as I've tried >>>>>>>>> >> to >>>>>>>>> >> argue in the OP. I don't think you've addressed the issue of unnecessary >>>>>>>>> >> complexity pointed out in OP. >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> -- >>>>>>>>> >> View this message in context: >>>>>>>>> >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html [3] >>>>>>>>> >> Sent from the datatable-help mailing list archive at Nabble.com [4]. >>>>>>>>> >> _______________________________________________ >>>>>>>>> >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org [5] >>>>>>>>> >>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [6] >>>>>>>>> > _______________________________________________ >>>>>>>>> > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org [7] >>>>>>>>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [8] Links: ------ [1] mailto:eduard.antonyan at gmail.com [2] mailto:datatable-help at lists.r-forge.r-project.org [3] http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html [4] http://Nabble.com [5] mailto:datatable-help at lists.r-forge.r-project.org [6] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [7] mailto:datatable-help at lists.r-forge.r-project.org [8] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [9] http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [10] http://table2.id [11] mailto:mdowle at mdowle.plus.com [12] mailto:mdowle at mdowle.plus.com [13] mailto:mdowle at mdowle.plus.com [14] mailto:eduard.antonyan at gmail.com [15] mailto:mdowle at mdowle.plus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.nelson at sydney.edu.au Fri Apr 26 00:46:37 2013 From: michael.nelson at sydney.edu.au (Michael Nelson) Date: Thu, 25 Apr 2013 22:46:37 +0000 Subject: [datatable-help] =?windows-1252?q?there_is_no_package_called_=91x?= =?windows-1252?q?ts=92?= In-Reply-To: References: <874newhpv3.fsf@gnu.org> <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au>, Message-ID: <6FB5193A6CDCDF499486A833B7AFBDCD6751B9E5@EX-MBX-PRO-04.mcs.usyd.edu.au> Indeed. Although I prefer .() to J() to be inline with the future implementation of ..() as well. frame[.(unique(id)), mult = "last"] the relevant section from the NEWS New DT[.(...)] syntax (in the style of package plyr) is identical to DT[list(...)], DT[J(...)] and DT[data.table(...)]. We plan to add ..(), too, so that .() and ..() are analogous to the file system's ./ and ../; i.e., .() evaluates within the frame of DT and ..() in the parent scope. From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Eduard Antonyan [eduard.antonyan at gmail.com] Sent: Thursday, 25 April 2013 12:47 AM To: datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] there is no package called ?xts? @Michael, in the last expression, you probably forgot a J: frame[J(unique(id)), mult = "last"] On Tue, Apr 23, 2013 at 7:41 PM, Michael Nelson michael.nelson at sydney.edu.au wrote: >From the help for data.table::last If x is a data.table, the last row as a one row data.table. Otherwise, whatever xts::last returns. calling lapply(.SD, last) will call last on each column in .SD. Columns within a data.table aren't data.tables thus `xts::last` is called. xts is on the suggests list for data.table, you could use install.packages('data.table, dependencies = 'Suggests') or manually installed xts. OR frame[, last(.SD), by = id] would work without needing xts as would frame[, .SD[.N], by = id] or without having to construct .SD (which is time consuming) frame[frame[, .I[.N],by = id]$V1] or setkey(frame, id) frame[unique(id), mult = 'last'] ________________________________________ From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Sam Steingold [sds at gnu.org] Sent: Wednesday, 24 April 2013 7:57 AM To: datatable-help at lists.r-forge.r-project.org Subject: [datatable-help] there is no package called ?xts? Hi, I got this: > dt <- frame[, lapply(.SD, last) ,by=id] Finding groups (bysameorder=TRUE) ... done in 0.126secs. bysameorder=TRUE and o__ is length 0 Optimized j from 'lapply(.SD, last)' to 'list(last(country), last(language), last(browser), last(platform), last(uatype), last(behavior))' Starting dogroups ... Error in loadNamespace(name) : there is no package called ?xts? Calls: [ -> [.data.table -> last -> :: -> getExportedValue -> asNamespace -> getNamespace -> tryCatch -> tryCatchList -> tryCatchOne -> > the help for last does mention xts, but I don't have it installed. do I need to? -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 http://www.childpsy.net/ http://ffii.org http://think-israel.org http://mideasttruth.com http://memri.org http://camera.org Ernqvat guvf ivbyngrf QZPN. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From mdowle at mdowle.plus.com Fri Apr 26 13:14:02 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 26 Apr 2013 12:14:02 +0100 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <64f192ba80ac813986ed256029f0e7e0@imap.plus.net> References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net> <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net> <9146185881995080674@unknownmsgid> <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net> <5222879356405645530@unknownmsgid> <64f192ba80ac813986ed256029f0e7e0@imap.plus.net> Message-ID: <2a827f7db260f284908fac604301eb8e@imap.plus.net> I didn't get any feedback off list on this one. But I'm coming round to the idea. What about by=.JOIN (is that you were thinking .J stood for?) Other possibilties: .EACHI, .IROW, .EACHIROW, .CROSSAPPLY, .EACHJOIN. Just to brainstorm it. by=.JOIN could be added anyway with no backwards compatibility issues, so that those who wished to be explicit now could be. To change the default for X[Y, j] I'm also coming round to. It might help in a few related areas e.g. X[Y][,j] (which isn't great right now, agreed). We have successfully made non-backwards-compatibile changes in the past by introducing a global option which we slowly migrate to. If datatable.bywithoutby was added it could take values TRUE|"warning"|FALSE from day one, with default TRUE. That allows those who wish for explicit by to migrate straight away by changing the default to FALSE. Existing users could set it to "warning" to see how many implicit bywithoutby they have. Those calls can gradually be changed to by=.JOIN and in that way both implicit and explicit work at the same time, for say a year, with full backwards compatibility by default. This approach allows a slow and flexible migration path on a per feature basis. Then the default could be chaged to "warning" before finally FALSE. Depending on how it goes, the option could be left there to allow TRUE if anyone wanted it, or removed (maybe after two years). Similar to the removal of J() outside DT[...] i.e. users can still now very easily write J=data.table in their .Rprofile if they wish, for backwards compatibility. Or ... instead of : X[Y, j, by=.JOIN] what about : X[by=Y, j] Matthew On 25.04.2013 16:32, Matthew Dowle wrote: > I'd appreciate some input from others whether they agree or not. If you have a view perhaps let me know off list, or on list, whichever you prefer. > > Thanks, > > Matthew > > On 25.04.2013 13:45, Eduard Antonyan wrote: > >> Well, so can .I or .N or .GRP or .BY, yet those are used as special names, which is exactly why I suggested .J. >> The problem with using 'missingness' is that it already means smth very different when i is not a join/cross, it means *don't* do a by, thus introducing the whole case thing one has to through in their head every time as in OP (which of course becomes automatic after a while, but it's a cost nonetheless, which is in particular high for new people). So I see absence of 'by' as an already taken and used signal and thus something else has to be used for the new signal of cross apply (it doesn't have to be the specific option I mentioned above). This is exactly why I find optional turning off of this behavior unsatisfactory, and I don't see that as a solution to this at all. >> I think in the x+y context the appropriate analog is - what if that added x and y normally, but when x and y were data.frames it did element by element multiplication instead? Yes that's possible to do, and possible to document, but it's not a good idea, because it takes place of adding them element by element. The recycling behavior doesn't do that - what that does is it says it doesn't really make sense to add them as is, but we can do that after recycling, so let's recycle. It doesn't take the place of another existing way of adding vectors. >> >> On Apr 25, 2013, at 4:28 AM, Matthew Dowle wrote: >> >>> I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J? If not .J, or any single symbol what else instead? A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC"). But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row. Currently, that signal is missingness (which I like, rely on, and use with join inherited scope). >>> >>> As I wrote in the S.O. thread, I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside. But you've continued to argue for a change to the default, iiuc. >>> >>> Maybe it helps to consider : >>> >>> x+y >>> >>> Fundamentally in R this depends on what x and y are. Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise, recycling y if necessary. In R we like and write code like this all the time. I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y. If you need j for the entire set that Y joins to, then like a FAQ says, make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as someone mentioned on the S.O. thread). So maybe the new option would be datatable.drop (but with default FALSE not TRUE). If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer. >>> >>> I'm happy to add the argument to [.data.table, and make its default changeable via a global option in the usual way. >>> >>> Matthew >>> >>> On 25.04.2013 05:16, Eduard Antonyan wrote: >>> >>>> That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. >>>> To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax. >>>> >>>> On Apr 24, 2013, at 6:01 PM, Eduard Antonyan wrote: >>>> >>>>> that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me >>>>> >>>>> On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle wrote: >>>>> >>>>>> i. prefix is just a robust way to reference join inherited columns: the 'top' column in the i table. Like table aliases in SQL. >>>>>> >>>>>> What about this? : >>>>>> 1> X = data.table(a=1:3,b=1:15, key="a") >>>>>> 1> X >>>>>> a b >>>>>> 1: 1 1 >>>>>> 2: 1 4 >>>>>> 3: 1 7 >>>>>> 4: 1 10 >>>>>> 5: 1 13 >>>>>> 6: 2 2 >>>>>> 7: 2 5 >>>>>> 8: 2 8 >>>>>> 9: 2 11 >>>>>> 10: 2 14 >>>>>> 11: 3 3 >>>>>> 12: 3 6 >>>>>> 13: 3 9 >>>>>> 14: 3 12 >>>>>> 15: 3 15 >>>>>> >>>>>> 1> Y = data.table(a=c(1,2,1), top=c(3,4,2)) >>>>>> >>>>>> 1> Y >>>>>> a top >>>>>> 1: 1 3 >>>>>> 2: 2 4 >>>>>> 3: 1 2 >>>>>> 1> X[Y, head(.SD,i.top)] >>>>>> a b >>>>>> 1: 1 1 >>>>>> 2: 1 4 >>>>>> 3: 1 7 >>>>>> 4: 2 2 >>>>>> 5: 2 5 >>>>>> 6: 2 8 >>>>>> 7: 2 11 >>>>>> 8: 1 1 >>>>>> >>>>>> 9: 1 4 >>>>>> 1> >>>>>> >>>>>> On 24.04.2013 23:43, Eduard Antonyan wrote: >>>>>> >>>>>>> I assumed they meant create a table :) >>>>>>> that looks cool, what's i.top ? I can get a very similar to yours result by writing: >>>>>>> X[Y][, head(.SD, top[1]), by = a] >>>>>>> and I probably would want the following to produce your result (this might depend a little on what exactly i.top is): >>>>>>> X[Y, head(.SD, i.top), by = a] >>>>>>> >>>>>>> On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle wrote: >>>>>>> >>>>>>>> That sentence on that linked webpage seems incorect English, since table is a noun not a verb. Should "table" be "join" perhaps? >>>>>>>> >>>>>>>> Anyway, by-without-by is often used with join inherited scope (JIS). For example, translating their example : >>>>>>>> >>>>>>>> 1> X = data.table(a=1:3,b=1:15, key="a") >>>>>>>> 1> X >>>>>>>> a b >>>>>>>> 1: 1 1 >>>>>>>> 2: 1 4 >>>>>>>> 3: 1 7 >>>>>>>> 4: 1 10 >>>>>>>> 5: 1 13 >>>>>>>> 6: 2 2 >>>>>>>> 7: 2 5 >>>>>>>> 8: 2 8 >>>>>>>> 9: 2 11 >>>>>>>> 10: 2 14 >>>>>>>> 11: 3 3 >>>>>>>> 12: 3 6 >>>>>>>> >>>>>>>> 13: 3 9 >>>>>>>> 14: 3 12 >>>>>>>> 15: 3 15 >>>>>>>> 1> Y = data.table(a=c(1,2), top=c(3,4)) >>>>>>>> 1> Y >>>>>>>> a top >>>>>>>> 1: 1 3 >>>>>>>> 2: 2 4 >>>>>>>> 1> X[Y, head(.SD,i.top)] >>>>>>>> a b >>>>>>>> 1: 1 1 >>>>>>>> 2: 1 4 >>>>>>>> 3: 1 7 >>>>>>>> 4: 2 2 >>>>>>>> 5: 2 5 >>>>>>>> >>>>>>>> 6: 2 8 >>>>>>>> 7: 2 11 >>>>>>>> 1> >>>>>>>> >>>>>>>> If there was no by-without-by (analogous to CROSS BY), then how would that be done? >>>>>>>> >>>>>>>> On 24.04.2013 22:22, Eduard Antonyan wrote: >>>>>>>> >>>>>>>>> By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key). >>>>>>>>> Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [9], and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there: >>>>>>>>> "We table table1 and table2. table1 has a column called rowcount. >>>>>>>>> >>>>>>>>> For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id [10]" >>>>>>>>> >>>>>>>>> On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle wrote: >>>>>>>>> >>>>>>>>>> But then what would be analogous to CROSS APPLY in SQL? >>>>>>>>>> >>>>>>>>>> > I'd agree with Eduard, although it's probably too late to change behavior >>>>>>>>>> > now. Maybe for data.table.2? Eduard's proposal seems more closely >>>>>>>>>> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if >>>>>>>>>> > requested). >>>>>>>>>> > >>>>>>>>>> > S. >>>>>>>>>> > >>>>>>>>>> >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >> From: eduard.antonyan at gmail.com [1] >>>>>>>>>> >> To: datatable-help at lists.r-forge.r-project.org [2] >>>>>>>>>> >>>>>>>>>>>> Subject: Re: [datatable-help] changing data.table by-without-by >>>>>>>>>> >> syntax to require a "by" >>>>>>>>>> >> >>>>>>>>>> >> I think you're missing the point Michael. Just because it's possible to >>>>>>>>>> >> do it >>>>>>>>>> >> the way it's done now, doesn't mean that's the best way, as I've tried >>>>>>>>>> >> to >>>>>>>>>> >> argue in the OP. I don't think you've addressed the issue of unnecessary >>>>>>>>>> >> complexity pointed out in OP. >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> -- >>>>>>>>>> >> View this message in context: >>>>>>>>>> >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html [3] >>>>>>>>>> >> Sent from the datatable-help mailing list archive at Nabble.com [4]. >>>>>>>>>> >> _______________________________________________ >>>>>>>>>> >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org [5] >>>>>>>>>> >>>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [6] >>>>>>>>>> > _______________________________________________ >>>>>>>>>> > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org [7] >>>>>>>>>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [8] Links: ------ [1] mailto:eduard.antonyan at gmail.com [2] mailto:datatable-help at lists.r-forge.r-project.org [3] http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html [4] http://Nabble.com [5] mailto:datatable-help at lists.r-forge.r-project.org [6] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [7] mailto:datatable-help at lists.r-forge.r-project.org [8] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help [9] http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [10] http://table2.id [11] mailto:mdowle at mdowle.plus.com [12] mailto:mdowle at mdowle.plus.com [13] mailto:mdowle at mdowle.plus.com [14] mailto:eduard.antonyan at gmail.com [15] mailto:mdowle at mdowle.plus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From s_milberg at hotmail.com Fri Apr 26 15:34:38 2013 From: s_milberg at hotmail.com (Sadao Milberg) Date: Fri, 26 Apr 2013 09:34:38 -0400 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <2a827f7db260f284908fac604301eb8e@imap.plus.net> References: <1366401278742-4664770.post@n4.nabble.com>, <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>, <1366643879137-4664990.post@n4.nabble.com>, , , , <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>, , <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>, , <9146185881995080674@unknownmsgid>, <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net>, <5222879356405645530@unknownmsgid>, <64f192ba80ac813986ed256029f0e7e0@imap.plus.net>, <2a827f7db260f284908fac604301eb8e@imap.plus.net> Message-ID: Your suggestion for transition seems reasonable, although I still think you should just use a new argument rather than try to change the behavior of by. The most natural thing seems to leave Y as the `i` value, since after all, we are still joining on the key, and then just modify the standard join behavior with the cross.apply=TRUE or some such. This way, you avoid having to have a more complicated description of the `by` argument, where all of a sudden it means 'group by these expressions, unless you use the special expression .XXX, in which case something confusingly similar yet different happens, oh, and by the way, you can only use .XXX if you are also using i=Y' (and what does by=list(a, .JOIN) do?). To some extent your final proposal of by=Y is a little better, but still confusing since now you're using by to join and group, when it's `i` job to do that. Loosely related, what does .JOIN represent? Is it just a flag, or is it a derived variable the way .SD is? If it's just a flag, it seems like a bad idea to use a name to represent it since that is a break from the meaning of all the other .X variables in data.table, which actually contain some kind of derivative data. Finally, when you say "might help in a few related areas e.g. X[Y][,j] (which isn't great right now, agreed)", do you mean joint inherited scope will work even when we're not in by-without-by mode? That would be great. S. Date: Fri, 26 Apr 2013 12:14:02 +0100 From: mdowle at mdowle.plus.com To: eduard.antonyan at gmail.com CC: datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by" I didn't get any feedback off list on this one. But I'm coming round to the idea. What about by=.JOIN (is that you were thinking .J stood for?) Other possibilties: .EACHI, .IROW, .EACHIROW, .CROSSAPPLY, .EACHJOIN. Just to brainstorm it. by=.JOIN could be added anyway with no backwards compatibility issues, so that those who wished to be explicit now could be. To change the default for X[Y, j] I'm also coming round to. It might help in a few related areas e.g. X[Y][,j] (which isn't great right now, agreed). We have successfully made non-backwards-compatibile changes in the past by introducing a global option which we slowly migrate to. If datatable.bywithoutby was added it could take values TRUE|"warning"|FALSE from day one, with default TRUE. That allows those who wish for explicit by to migrate straight away by changing the default to FALSE. Existing users could set it to "warning" to see how many implicit bywithoutby they have. Those calls can gradually be changed to by=.JOIN and in that way both implicit and explicit work at the same time, for say a year, with full backwards compatibility by default. This approach allows a slow and flexible migration path on a per feature basis. Then the default could be chaged to "warning" before finally FALSE. Depending on how it goes, the option could be left there to allow TRUE if anyone wanted it, or removed (maybe after two years). Similar to the removal of J() outside DT[...] i.e. users can still now very easily write J=data.table in their .Rprofile if they wish, for backwards compatibility. Or ... instead of : X[Y, j, by=.JOIN] what about : X[by=Y, j] Matthew On 25.04.2013 16:32, Matthew Dowle wrote: I'd appreciate some input from others whether they agree or not. If you have a view perhaps let me know off list, or on list, whichever you prefer. Thanks, Matthew On 25.04.2013 13:45, Eduard Antonyan wrote: Well, so can .I or .N or .GRP or .BY, yet those are used as special names, which is exactly why I suggested .J. The problem with using 'missingness' is that it already means smth very different when i is not a join/cross, it means *don't* do a by, thus introducing the whole case thing one has to through in their head every time as in OP (which of course becomes automatic after a while, but it's a cost nonetheless, which is in particular high for new people). So I see absence of 'by' as an already taken and used signal and thus something else has to be used for the new signal of cross apply (it doesn't have to be the specific option I mentioned above). This is exactly why I find optional turning off of this behavior unsatisfactory, and I don't see that as a solution to this at all. I think in the x+y context the appropriate analog is - what if that added x and y normally, but when x and y were data.frames it did element by element multiplication instead? Yes that's possible to do, and possible to document, but it's not a good idea, because it takes place of adding them element by element. The recycling behavior doesn't do that - what that does is it says it doesn't really make sense to add them as is, but we can do that after recycling, so let's recycle. It doesn't take the place of another existing way of adding vectors. On Apr 25, 2013, at 4:28 AM, Matthew Dowle wrote: I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J? If not .J, or any single symbol what else instead? A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC"). But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row. Currently, that signal is missingness (which I like, rely on, and use with join inherited scope). As I wrote in the S.O. thread, I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside. But you've continued to argue for a change to the default, iiuc. Maybe it helps to consider : x+y Fundamentally in R this depends on what x and y are. Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise, recycling y if necessary. In R we like and write code like this all the time. I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y. If you need j for the entire set that Y joins to, then like a FAQ says, make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as someone mentioned on the S.O. thread). So maybe the new option would be datatable.drop (but with default FALSE not TRUE). If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer. I'm happy to add the argument to [.data.table, and make its default changeable via a global option in the usual way. Matthew On 25.04.2013 05:16, Eduard Antonyan wrote: That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax. On Apr 24, 2013, at 6:01 PM, Eduard Antonyan wrote: that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle wrote: i. prefix is just a robust way to reference join inherited columns: the 'top' column in the i table. Like table aliases in SQL. What about this? : 1> X = data.table(a=1:3,b=1:15, key="a") 1> X a b 1: 1 1 2: 1 4 3: 1 7 4: 1 10 5: 1 13 6: 2 2 7: 2 5 8: 2 8 9: 2 11 10: 2 14 11: 3 3 12: 3 6 13: 3 9 14: 3 12 15: 3 15 1> Y = data.table(a=c(1,2,1), top=c(3,4,2)) 1> Y a top 1: 1 3 2: 2 4 3: 1 2 1> X[Y, head(.SD,i.top)] a b 1: 1 1 2: 1 4 3: 1 7 4: 2 2 5: 2 5 6: 2 8 7: 2 11 8: 1 1 9: 1 4 1> On 24.04.2013 23:43, Eduard Antonyan wrote: I assumed they meant create a table :) that looks cool, what's i.top ? I can get a very similar to yours result by writing: X[Y][, head(.SD, top[1]), by = a] and I probably would want the following to produce your result (this might depend a little on what exactly i.top is): X[Y, head(.SD, i.top), by = a] On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle wrote: That sentence on that linked webpage seems incorect English, since table is a noun not a verb. Should "table" be "join" perhaps? Anyway, by-without-by is often used with join inherited scope (JIS). For example, translating their example : 1> X = data.table(a=1:3,b=1:15, key="a") 1> X a b 1: 1 1 2: 1 4 3: 1 7 4: 1 10 5: 1 13 6: 2 2 7: 2 5 8: 2 8 9: 2 11 10: 2 14 11: 3 3 12: 3 6 13: 3 9 14: 3 12 15: 3 15 1> Y = data.table(a=c(1,2), top=c(3,4)) 1> Y a top 1: 1 3 2: 2 4 1> X[Y, head(.SD,i.top)] a b 1: 1 1 2: 1 4 3: 1 7 4: 2 2 5: 2 5 6: 2 8 7: 2 11 1> If there was no by-without-by (analogous to CROSS BY), then how would that be done? On 24.04.2013 22:22, Eduard Antonyan wrote: By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key). Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there: "We table table1 and table2. table1 has a column called rowcount. For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id" On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle wrote: But then what would be analogous to CROSS APPLY in SQL? > I'd agree with Eduard, although it's probably too late to change behavior > now. Maybe for data.table.2? Eduard's proposal seems more closely > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if > requested). > > S. > >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >> From: eduard.antonyan at gmail.com >> To: datatable-help at lists.r-forge.r-project.org >> Subject: Re: [datatable-help] changing data.table by-without-by >> syntax to require a "by" >> >> I think you're missing the point Michael. Just because it's possible to >> do it >> the way it's done now, doesn't mean that's the best way, as I've tried >> to >> argue in the OP. I don't think you've addressed the issue of unnecessary >> complexity pointed out in OP. >> >> >> >> -- >> View this message in context: >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html >> Sent from the datatable-help mailing list archive at Nabble.com. >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri Apr 26 17:17:28 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 26 Apr 2013 10:17:28 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net> <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net> <9146185881995080674@unknownmsgid> <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net> <5222879356405645530@unknownmsgid> <64f192ba80ac813986ed256029f0e7e0@imap.plus.net> <2a827f7db260f284908fac604301eb8e@imap.plus.net> Message-ID: I indeed offered .J as a shorthand for .JOIN and to ease the pain of having to type extra stuff for users who are relying on current behavior. Sadao is making good points. The question of what does by=list(a, .JOIN) do can still apply though with cross.apply=TRUE syntax, i.e. what does X[Y,j,by=a,cross.apply=TRUE] do? And I think the answer is the same for either syntax - in addition to the cross-apply-by it would group by column 'a'. Btw I think Matthew's examples above (or smth like them) should go into the FAQ or documentation as they were very illuminating and entirely non-obvious to me. If I were to rate all of the above from imo best to worst, it would be: .JOIN (or .J - yes, I'm biased:) ) .EACHI/cross.apply=TRUE .EACHIROW/.EACHJOIN .CROSSAPPLY X[by=Y,j] After typing the above list, I'm actually starting to like .EACHI (each.i=TRUE? <- I like this even better) more and more as it seems to convey the meaning (as far as I currently understand it - my understanding has shifted a little since the start of this conversation) really well. Anyway, sorry for a verbose email - my current vote is 'each.i = TRUE' - I think this conveys the right meaning, satisfies Sadao's points and also has a meaning that transitions well between having a join-i and not having a join-i (when you're not joining, specifying this option wouldn't do anything extra). On Fri, Apr 26, 2013 at 8:34 AM, Sadao Milberg wrote: > Your suggestion for transition seems reasonable, although I still think > you should just use a new argument rather than try to change the behavior > of by. The most natural thing seems to leave Y as the `i` value, since > after all, we are still joining on the key, and then just modify the > standard join behavior with the cross.apply=TRUE or some such. > > This way, you avoid having to have a more complicated description of the > `by` argument, where all of a sudden it means 'group by these expressions, > unless you use the special expression .XXX, in which case something > confusingly similar yet different happens, oh, and by the way, you can only > use .XXX if you are also using i=Y' (and what does by=list(a, .JOIN) do?). > To some extent your final proposal of by=Y is a little better, but still > confusing since now you're using by to join and group, when it's `i` job to > do that. > > Loosely related, what does .JOIN represent? Is it just a flag, or is it a > derived variable the way .SD is? If it's just a flag, it seems like a bad > idea to use a name to represent it since that is a break from the meaning > of all the other .X variables in data.table, which actually contain some > kind of derivative data. > > Finally, when you say "might help in a few related areas e.g. X[Y][,j] > (which isn't great right now, agreed)", do you mean joint inherited scope > will work even when we're not in by-without-by mode? That would be great. > > S. > > > ------------------------------ > Date: Fri, 26 Apr 2013 12:14:02 +0100 > From: mdowle at mdowle.plus.com > To: eduard.antonyan at gmail.com > CC: datatable-help at lists.r-forge.r-project.org > > Subject: Re: [datatable-help] changing data.table by-without-by syntax to > require a "by" > > > I didn't get any feedback off list on this one. > But I'm coming round to the idea. > What about by=.JOIN (is that you were thinking .J stood for?) Other > possibilties: .EACHI, .IROW, .EACHIROW, .CROSSAPPLY, .EACHJOIN. Just to > brainstorm it. > by=.JOIN could be added anyway with no backwards compatibility issues, so > that those who wished to be explicit now could be. > To change the default for X[Y, j] I'm also coming round to. It might > help in a few related areas e.g. X[Y][,j] (which isn't great right now, > agreed). We have successfully made non-backwards-compatibile changes in > the past by introducing a global option which we slowly migrate to. If > datatable.bywithoutby was added it could take values TRUE|"warning"|FALSE > from day one, with default TRUE. That allows those who wish for explicit > by to migrate straight away by changing the default to FALSE. Existing > users could set it to "warning" to see how many implicit bywithoutby they > have. Those calls can gradually be changed to by=.JOIN and in that way > both implicit and explicit work at the same time, for say a year, with > full backwards compatibility by default. This approach allows a slow and > flexible migration path on a per feature basis. Then the default could be > chaged to "warning" before finally FALSE. Depending on how it goes, > the option could be left there to allow TRUE if anyone wanted it, or > removed (maybe after two years). Similar to the removal of J() outside > DT[...] i.e. users can still now very easily write J=data.table in their > .Rprofile if they wish, for backwards compatibility. > Or ... instead of : > X[Y, j, by=.JOIN] > what about : > X[by=Y, j] > Matthew > > On 25.04.2013 16:32, Matthew Dowle wrote: > > > I'd appreciate some input from others whether they agree or not. If you > have a view perhaps let me know off list, or on list, whichever you prefer. > Thanks, > Matthew > > On 25.04.2013 13:45, Eduard Antonyan wrote: > > Well, so can .I or .N or .GRP or .BY, yet those are used as special names, > which is exactly why I suggested .J. > The problem with using 'missingness' is that it already means smth very > different when i is not a join/cross, it means *don't* do a by, thus > introducing the whole case thing one has to through in their head every > time as in OP (which of course becomes automatic after a while, but it's a > cost nonetheless, which is in particular high for new people). So I see > absence of 'by' as an already taken and used signal and thus something else > has to be used for the new signal of cross apply (it doesn't have to be the > specific option I mentioned above). This is exactly why I find optional > turning off of this behavior unsatisfactory, and I don't see that as a > solution to this at all. > I think in the x+y context the appropriate analog is - what if that added > x and y normally, but when x and y were data.frames it did element by > element multiplication instead? Yes that's possible to do, and possible to > document, but it's not a good idea, because it takes place of adding them > element by element. The recycling behavior doesn't do that - what that does > is it says it doesn't really make sense to add them as is, but we can do > that after recycling, so let's recycle. It doesn't take the place of > another existing way of adding vectors. > > On Apr 25, 2013, at 4:28 AM, Matthew Dowle wrote: > > > > I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J? If not .J, or any single symbol what else instead? A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC"). But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row. Currently, that signal is missingness (which I like, rely on, and use with join inherited scope). > > As I wrote in the S.O. thread, I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside. But you've continued to argue for a change to the default, iiuc. > > Maybe it helps to consider : > > x+y > > Fundamentally in R this depends on what x and y are. Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise, recycling y if necessary. In R we like and write code like this all the time. I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y. If you need j for the entire set that Y joins to, then like a FAQ says, make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as someone mentioned on the S.O. thread). So maybe the new option would be datatable.drop (but with default FALSE not TRUE). If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer. > > > > I'm happy to add the argument to [.data.table, and make its default changeable via a global option in the usual way. > > Matthew > > On 25.04.2013 05:16, Eduard Antonyan wrote: > > That's really interesting, I can't currently think of another way of doing > that as after X[Y] is done the necessary information is lost. > To retain that functionality and achieve better readability, as in OP, I > think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good > replacement for current syntax. > > On Apr 24, 2013, at 6:01 PM, Eduard Antonyan > wrote: > > that's an interesting example - I didn't realize current behavior would > do that, I'm not at a PC anymore but I'll definitely think about it and > report back, as it's not immediately obvious to me > > > On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle wrote: > > > i. prefix is just a robust way to reference join inherited columns: the > 'top' column in the i table. Like table aliases in SQL. > What about this? : > 1> X = data.table(a=1:3,b=1:15, key="a") > 1> X > a b > 1: 1 1 > 2: 1 4 > 3: 1 7 > 4: 1 10 > 5: 1 13 > 6: 2 2 > 7: 2 5 > 8: 2 8 > 9: 2 11 > 10: 2 14 > 11: 3 3 > 12: 3 6 > 13: 3 9 > 14: 3 12 > 15: 3 15 > > 1> Y = data.table(a=c(1,2,1), top=c(3,4,2)) > > > 1> Y > a top > 1: 1 3 > 2: 2 4 > 3: 1 2 > 1> X[Y, head(.SD,i.top)] > a b > 1: 1 1 > 2: 1 4 > 3: 1 7 > 4: 2 2 > 5: 2 5 > 6: 2 8 > 7: 2 11 > 8: 1 1 > > 9: 1 4 > 1> > > > On 24.04.2013 23:43, Eduard Antonyan wrote: > > I assumed they meant create a table :) > that looks cool, what's i.top ? I can get a very similar to yours result > by writing: > X[Y][, head(.SD, top[1]), by = a] > and I probably would want the following to produce your result (this might > depend a little on what exactly i.top is): > X[Y, head(.SD, i.top), by = a] > > > On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle wrote: > > > That sentence on that linked webpage seems incorect English, since table > is a noun not a verb. Should "table" be "join" perhaps? > Anyway, by-without-by is often used with join inherited scope (JIS). For > example, translating their example : > > 1> X = data.table(a=1:3,b=1:15, key="a") > 1> X > a b > 1: 1 1 > 2: 1 4 > 3: 1 7 > 4: 1 10 > 5: 1 13 > 6: 2 2 > 7: 2 5 > 8: 2 8 > 9: 2 11 > 10: 2 14 > 11: 3 3 > 12: 3 6 > > > > > 13: 3 9 > 14: 3 12 > 15: 3 15 > 1> Y = data.table(a=c(1,2), top=c(3,4)) > 1> Y > a top > 1: 1 3 > 2: 2 4 > 1> X[Y, head(.SD,i.top)] > a b > 1: 1 1 > 2: 1 4 > 3: 1 7 > 4: 2 2 > 5: 2 5 > > > > > 6: 2 8 > 7: 2 11 > 1> > > > > If there was no by-without-by (analogous to CROSS BY), then how would that be done? > > > > On 24.04.2013 22:22, Eduard Antonyan wrote: > > By that you mean current behavior? You'd get current behavior by > explicitly specifying the appropriate "by" (i.e. "by" equal to the key). > Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using > http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I > can't figure out how by-without-by (or with by-with-by for that matter:) ) > helps with e.g. the first example there: > "We table table1 and table2. table1 has a column called rowcount. > > For each row from table1 we need to select first rowcount rows from table2, > ordered by table2.id" > > > > > On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle wrote: > > But then what would be analogous to CROSS APPLY in SQL? > > > I'd agree with Eduard, although it's probably too late to change behavior > > now. Maybe for data.table.2? Eduard's proposal seems more closely > > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if > > requested). > > > > S. > > > >> Date: Mon, 22 Apr 2013 08:17:59 -0700 > >> From: eduard.antonyan at gmail.com > >> To: datatable-help at lists.r-forge.r-project.org > >> Subject: Re: [datatable-help] changing data.table by-without-by > >> syntax to require a "by" > >> > >> I think you're missing the point Michael. Just because it's possible to > >> do it > >> the way it's done now, doesn't mean that's the best way, as I've tried > >> to > >> argue in the OP. I don't think you've addressed the issue of unnecessary > >> complexity pointed out in OP. > >> > >> > >> > >> -- > >> View this message in context: > >> > http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html > >> Sent from the datatable-help mailing list archive at Nabble.com. > >> _______________________________________________ > >> datatable-help mailing list > >> datatable-help at lists.r-forge.r-project.org > >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > _______________________________________________ datatable-help mailing > list datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sds at gnu.org Fri Apr 26 17:45:22 2013 From: sds at gnu.org (Sam Steingold) Date: Fri, 26 Apr 2013 11:45:22 -0400 Subject: [datatable-help] variable column names References: <87a9onfza3.fsf@gnu.org> <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net> Message-ID: <87mwslqorx.fsf@gnu.org> I am still missing something: --8<---------------cut here---------------start------------->8--- > dt <- data.table(user=c(rep(4, 5),rep(3, 5)), behavior=c(rep(FALSE,5),rep(TRUE,5)), country=c(rep(1,4),rep(2,6)), language=c(rep(6,6),rep(5,4)), event=1:10, key=c("user","country","language")) > dt user behavior country language event 1: 3 TRUE 2 5 7 2: 3 TRUE 2 5 8 3: 3 TRUE 2 5 9 4: 3 TRUE 2 5 10 5: 3 TRUE 2 6 6 6: 4 FALSE 1 6 1 7: 4 FALSE 1 6 2 8: 4 FALSE 1 6 3 9: 4 FALSE 1 6 4 10: 4 FALSE 2 6 5 > users <- dt[, sum(behavior) > 0, by=user] Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 Detected that j uses these columns: behavior Optimization is on but j left unchanged as 'sum(behavior) > 0' Starting dogroups ... done dogroups in 0 secs > users user V1 1: 3 TRUE 2: 4 FALSE > setnames(users, "V1", "behavior") --8<---------------cut here---------------end--------------->8--- Now I want to do the same thing as in http://stackoverflow.com/questions/16200815/summarize-a-data-table-with-unreliable-data for both fields > fields <- c("country","language") here is what I tried so far: --8<---------------cut here---------------start------------->8--- dt[, .N, .SDcols=fields, by=eval(list("user",fields))] Error in `[.data.table`(dt, , .N, .SDcols = fields, by = eval(list("user", : The items in the 'by' or 'keyby' list are length (1,2). Each must be same length as rows in x or number of rows returned by i (10). Calls: [ -> [.data.table --8<---------------cut here---------------end--------------->8--- the idea is to do something like --8<---------------cut here---------------start------------->8--- > dt.out <- dt[, .N, by=list(user,country)][, list(country[which.max(N)], max(N)/sum(N)), by=user] > setnames(dt.out, c("V1", "V2"), paste0("country",c(".name", ".support"))) > users <- users[dt.out] user behavior country.name country.support 1: 3 TRUE 2 1.0 2: 4 FALSE 1 0.8 --8<---------------cut here---------------end--------------->8--- except that I do not want to have the literal "country" and "language" and that I am sure there is a way to avoid copying users in > users <- users[dt.out] by a ":=" trick. Thanks. > * Matthew Dowle [2013-04-24 21:54:17 +0100]: > > where ... is eval(myid) > iigc >> Or: >> DT[,lapply(.SD,sum),by=...,.SDcols=myvars] -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 http://www.childpsy.net/ http://palestinefacts.org http://ffii.org http://jihadwatch.org http://thereligionofpeace.com Morning is too early for anything but sleep. From mdowle at mdowle.plus.com Fri Apr 26 18:00:27 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 26 Apr 2013 17:00:27 +0100 Subject: [datatable-help] variable column names In-Reply-To: <87mwslqorx.fsf@gnu.org> References: <87a9onfza3.fsf@gnu.org> <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net> <87mwslqorx.fsf@gnu.org> Message-ID: <82f248612d2657b63d5e79c2a1ef98af@imap.plus.net> > dt[, sum(behavior) > 0, by=user] user V1 1: 3 TRUE 2: 4 FALSE > dt[, any(behavior), by=user] # same user V1 1: 3 TRUE 2: 4 FALSE > dt[, list(behavior = any(behavior)), by=user] # how to same without > setnames afterwards user behavior 1: 3 TRUE 2: 4 FALSE > fields <- c("country","language") > dt[, list(behavior = any(behavior)), by=c("user",fields)] # by may > be character vector of column names user country language behavior 1: 3 2 5 TRUE 2: 3 2 6 TRUE 3: 4 1 6 FALSE 4: 4 2 6 FALSE HTH Matthew On 26.04.2013 16:45, Sam Steingold wrote: > I am still missing something: > > --8<---------------cut here---------------start------------->8--- >> dt <- data.table(user=c(rep(4, 5),rep(3, 5)), >> behavior=c(rep(FALSE,5),rep(TRUE,5)), > country=c(rep(1,4),rep(2,6)), > language=c(rep(6,6),rep(5,4)), > event=1:10, key=c("user","country","language")) >> dt > user behavior country language event > 1: 3 TRUE 2 5 7 > 2: 3 TRUE 2 5 8 > 3: 3 TRUE 2 5 9 > 4: 3 TRUE 2 5 10 > 5: 3 TRUE 2 6 6 > 6: 4 FALSE 1 6 1 > 7: 4 FALSE 1 6 2 > 8: 4 FALSE 1 6 3 > 9: 4 FALSE 1 6 4 > 10: 4 FALSE 2 6 5 >> users <- dt[, sum(behavior) > 0, by=user] > Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE > and o__ is length 0 > Detected that j uses these columns: behavior > Optimization is on but j left unchanged as 'sum(behavior) > 0' > Starting dogroups ... done dogroups in 0 secs >> users > user V1 > 1: 3 TRUE > 2: 4 FALSE >> setnames(users, "V1", "behavior") > --8<---------------cut here---------------end--------------->8--- > > Now I want to do the same thing as in > > http://stackoverflow.com/questions/16200815/summarize-a-data-table-with-unreliable-data > for both fields >> fields <- c("country","language") > > here is what I tried so far: > > --8<---------------cut here---------------start------------->8--- > dt[, .N, .SDcols=fields, by=eval(list("user",fields))] > Error in `[.data.table`(dt, , .N, .SDcols = fields, by = > eval(list("user", : > The items in the 'by' or 'keyby' list are length (1,2). Each must > be same length as rows in x or number of rows returned by i (10). > Calls: [ -> [.data.table > --8<---------------cut here---------------end--------------->8--- > > the idea is to do something like > > --8<---------------cut here---------------start------------->8--- >> dt.out <- dt[, .N, by=list(user,country)][, >> list(country[which.max(N)], max(N)/sum(N)), by=user] >> setnames(dt.out, c("V1", "V2"), paste0("country",c(".name", >> ".support"))) >> users <- users[dt.out] > user behavior country.name country.support > 1: 3 TRUE 2 1.0 > 2: 4 FALSE 1 0.8 > --8<---------------cut here---------------end--------------->8--- > > except that I do not want to have the literal "country" and > "language" > and that I am sure there is a way to avoid copying users in >> users <- users[dt.out] > by a ":=" trick. > > Thanks. > >> * Matthew Dowle [2013-04-24 21:54:17 >> +0100]: >> >> where ... is eval(myid) >> iigc >>> Or: >>> DT[,lapply(.SD,sum),by=...,.SDcols=myvars] From sds at gnu.org Fri Apr 26 18:26:06 2013 From: sds at gnu.org (Sam Steingold) Date: Fri, 26 Apr 2013 12:26:06 -0400 Subject: [datatable-help] variable column names References: <87a9onfza3.fsf@gnu.org> <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net> <87mwslqorx.fsf@gnu.org> <82f248612d2657b63d5e79c2a1ef98af@imap.plus.net> Message-ID: <87d2thqmw1.fsf@gnu.org> > * Matthew Dowle [2013-04-26 17:00:27 +0100]: > >> dt[, sum(behavior) > 0, by=user] > user V1 > 1: 3 TRUE > 2: 4 FALSE >> dt[, any(behavior), by=user] # same > user V1 > 1: 3 TRUE > 2: 4 FALSE >> dt[, list(behavior = any(behavior)), by=user] # how to same without >> setnames afterwards > user behavior > 1: 3 TRUE > 2: 4 FALSE >> fields <- c("country","language") >> dt[, list(behavior = any(behavior)), by=c("user",fields)] # by may >> be character vector of column names > user country language behavior > 1: 3 2 5 TRUE > 2: 3 2 6 TRUE > 3: 4 1 6 FALSE > 4: 4 2 6 FALSE oh no, this is _not_ what I want! user should be unique and fields should be summarized as described in the SO question (see the code below) > > > On 26.04.2013 16:45, Sam Steingold wrote: >> I am still missing something: >> >> --8<---------------cut here---------------start------------->8--- >>> dt <- data.table(user=c(rep(4, 5),rep(3, 5)), >>> behavior=c(rep(FALSE,5),rep(TRUE,5)), >> country=c(rep(1,4),rep(2,6)), >> language=c(rep(6,6),rep(5,4)), >> event=1:10, key=c("user","country","language")) >>> dt >> user behavior country language event >> 1: 3 TRUE 2 5 7 >> 2: 3 TRUE 2 5 8 >> 3: 3 TRUE 2 5 9 >> 4: 3 TRUE 2 5 10 >> 5: 3 TRUE 2 6 6 >> 6: 4 FALSE 1 6 1 >> 7: 4 FALSE 1 6 2 >> 8: 4 FALSE 1 6 3 >> 9: 4 FALSE 1 6 4 >> 10: 4 FALSE 2 6 5 >>> users <- dt[, sum(behavior) > 0, by=user] >> Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE >> and o__ is length 0 >> Detected that j uses these columns: behavior >> Optimization is on but j left unchanged as 'sum(behavior) > 0' >> Starting dogroups ... done dogroups in 0 secs >>> users >> user V1 >> 1: 3 TRUE >> 2: 4 FALSE >>> setnames(users, "V1", "behavior") >> --8<---------------cut here---------------end--------------->8--- >> >> Now I want to do the same thing as in >> >> http://stackoverflow.com/questions/16200815/summarize-a-data-table-with-unreliable-data >> for both fields >>> fields <- c("country","language") >> >> here is what I tried so far: >> >> --8<---------------cut here---------------start------------->8--- >> dt[, .N, .SDcols=fields, by=eval(list("user",fields))] >> Error in `[.data.table`(dt, , .N, .SDcols = fields, by = >> eval(list("user", : >> The items in the 'by' or 'keyby' list are length (1,2). Each must >> be same length as rows in x or number of rows returned by i (10). >> Calls: [ -> [.data.table >> --8<---------------cut here---------------end--------------->8--- >> >> the idea is to do something like >> >> --8<---------------cut here---------------start------------->8--- >>> dt.out <- dt[, .N, by=list(user,country)][, >>> list(country[which.max(N)], max(N)/sum(N)), by=user] >>> setnames(dt.out, c("V1", "V2"), paste0("country",c(".name", >>> ".support"))) >>> users <- users[dt.out] >> user behavior country.name country.support >> 1: 3 TRUE 2 1.0 >> 2: 4 FALSE 1 0.8 >> --8<---------------cut here---------------end--------------->8--- >> >> except that I do not want to have the literal "country" and "language" >> and that I am sure there is a way to avoid copying users in >>> users <- users[dt.out] >> by a ":=" trick. >> >> Thanks. >> >>> * Matthew Dowle [2013-04-24 21:54:17 +0100]: >>> >>> where ... is eval(myid) >>> iigc >>>> Or: >>>> DT[,lapply(.SD,sum),by=...,.SDcols=myvars] -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 http://www.childpsy.net/ http://ffii.org http://pmw.org.il http://palestinefacts.org http://dhimmi.com http://thereligionofpeace.com Perl: all stupidities of UNIX in one. From mdowle at mdowle.plus.com Fri Apr 26 18:45:53 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 26 Apr 2013 17:45:53 +0100 Subject: [datatable-help] variable column names In-Reply-To: <87d2thqmw1.fsf@gnu.org> References: <87a9onfza3.fsf@gnu.org> <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net> <87mwslqorx.fsf@gnu.org> <82f248612d2657b63d5e79c2a1ef98af@imap.plus.net> <87d2thqmw1.fsf@gnu.org> Message-ID: S.O. is probably better for this kind of question then. But if you don't get an answer there, then come back to datatable-help. On 26.04.2013 17:26, Sam Steingold wrote: >> * Matthew Dowle [2013-04-26 17:00:27 >> +0100]: >> >>> dt[, sum(behavior) > 0, by=user] >> user V1 >> 1: 3 TRUE >> 2: 4 FALSE >>> dt[, any(behavior), by=user] # same >> user V1 >> 1: 3 TRUE >> 2: 4 FALSE >>> dt[, list(behavior = any(behavior)), by=user] # how to same >>> without >>> setnames afterwards >> user behavior >> 1: 3 TRUE >> 2: 4 FALSE >>> fields <- c("country","language") >>> dt[, list(behavior = any(behavior)), by=c("user",fields)] # by >>> may >>> be character vector of column names >> user country language behavior >> 1: 3 2 5 TRUE >> 2: 3 2 6 TRUE >> 3: 4 1 6 FALSE >> 4: 4 2 6 FALSE > > oh no, this is _not_ what I want! > user should be unique and fields should be summarized as described in > the SO question (see the code below) > > >> >> >> On 26.04.2013 16:45, Sam Steingold wrote: >>> I am still missing something: >>> >>> --8<---------------cut here---------------start------------->8--- >>>> dt <- data.table(user=c(rep(4, 5),rep(3, 5)), >>>> behavior=c(rep(FALSE,5),rep(TRUE,5)), >>> country=c(rep(1,4),rep(2,6)), >>> language=c(rep(6,6),rep(5,4)), >>> event=1:10, key=c("user","country","language")) >>>> dt >>> user behavior country language event >>> 1: 3 TRUE 2 5 7 >>> 2: 3 TRUE 2 5 8 >>> 3: 3 TRUE 2 5 9 >>> 4: 3 TRUE 2 5 10 >>> 5: 3 TRUE 2 6 6 >>> 6: 4 FALSE 1 6 1 >>> 7: 4 FALSE 1 6 2 >>> 8: 4 FALSE 1 6 3 >>> 9: 4 FALSE 1 6 4 >>> 10: 4 FALSE 2 6 5 >>>> users <- dt[, sum(behavior) > 0, by=user] >>> Finding groups (bysameorder=TRUE) ... done in 0secs. >>> bysameorder=TRUE >>> and o__ is length 0 >>> Detected that j uses these columns: behavior >>> Optimization is on but j left unchanged as 'sum(behavior) > 0' >>> Starting dogroups ... done dogroups in 0 secs >>>> users >>> user V1 >>> 1: 3 TRUE >>> 2: 4 FALSE >>>> setnames(users, "V1", "behavior") >>> --8<---------------cut here---------------end--------------->8--- >>> >>> Now I want to do the same thing as in >>> >>> >>> http://stackoverflow.com/questions/16200815/summarize-a-data-table-with-unreliable-data >>> for both fields >>>> fields <- c("country","language") >>> >>> here is what I tried so far: >>> >>> --8<---------------cut here---------------start------------->8--- >>> dt[, .N, .SDcols=fields, by=eval(list("user",fields))] >>> Error in `[.data.table`(dt, , .N, .SDcols = fields, by = >>> eval(list("user", : >>> The items in the 'by' or 'keyby' list are length (1,2). Each must >>> be same length as rows in x or number of rows returned by i (10). >>> Calls: [ -> [.data.table >>> --8<---------------cut here---------------end--------------->8--- >>> >>> the idea is to do something like >>> >>> --8<---------------cut here---------------start------------->8--- >>>> dt.out <- dt[, .N, by=list(user,country)][, >>>> list(country[which.max(N)], max(N)/sum(N)), by=user] >>>> setnames(dt.out, c("V1", "V2"), paste0("country",c(".name", >>>> ".support"))) >>>> users <- users[dt.out] >>> user behavior country.name country.support >>> 1: 3 TRUE 2 1.0 >>> 2: 4 FALSE 1 0.8 >>> --8<---------------cut here---------------end--------------->8--- >>> >>> except that I do not want to have the literal "country" and >>> "language" >>> and that I am sure there is a way to avoid copying users in >>>> users <- users[dt.out] >>> by a ":=" trick. >>> >>> Thanks. >>> >>>> * Matthew Dowle [2013-04-24 21:54:17 >>>> +0100]: >>>> >>>> where ... is eval(myid) >>>> iigc >>>>> Or: >>>>> DT[,lapply(.SD,sum),by=...,.SDcols=myvars] From sds at gnu.org Fri Apr 26 19:05:39 2013 From: sds at gnu.org (Sam Steingold) Date: Fri, 26 Apr 2013 13:05:39 -0400 Subject: [datatable-help] variable column names References: <87a9onfza3.fsf@gnu.org> <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net> <87mwslqorx.fsf@gnu.org> <82f248612d2657b63d5e79c2a1ef98af@imap.plus.net> <87d2thqmw1.fsf@gnu.org> Message-ID: <871u9xkysc.fsf@gnu.org> > * Matthew Dowle [2013-04-26 17:45:53 +0100]: > > S.O. is probably better for this kind of question then. > But if you don't get an answer there, then come back to datatable-help. http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 http://www.childpsy.net/ http://thereligionofpeace.com http://pmw.org.il http://jihadwatch.org http://camera.org http://honestreporting.com Apple: making a living off show-offs. From s_milberg at hotmail.com Fri Apr 26 20:48:40 2013 From: s_milberg at hotmail.com (Sadao Milberg) Date: Fri, 26 Apr 2013 14:48:40 -0400 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com>, <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>, <1366643879137-4664990.post@n4.nabble.com>, , , , <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>, , <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>, , <9146185881995080674@unknownmsgid>, <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net>, <5222879356405645530@unknownmsgid>, <64f192ba80ac813986ed256029f0e7e0@imap.plus.net>, <2a827f7db260f284908fac604301eb8e@imap.plus.net>, , Message-ID: each.i = TRUE sounds fine to me. Date: Fri, 26 Apr 2013 10:17:28 -0500 Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by" From: eduard.antonyan at gmail.com To: s_milberg at hotmail.com CC: mdowle at mdowle.plus.com; datatable-help at lists.r-forge.r-project.org I indeed offered .J as a shorthand for .JOIN and to ease the pain of having to type extra stuff for users who are relying on current behavior. Sadao is making good points. The question of what does by=list(a, .JOIN) do can still apply though with cross.apply=TRUE syntax, i.e. what does X[Y,j,by=a,cross.apply=TRUE] do? And I think the answer is the same for either syntax - in addition to the cross-apply-by it would group by column 'a'. Btw I think Matthew's examples above (or smth like them) should go into the FAQ or documentation as they were very illuminating and entirely non-obvious to me. If I were to rate all of the above from imo best to worst, it would be:.JOIN (or .J - yes, I'm biased:) ).EACHI/cross.apply=TRUE.EACHIROW/.EACHJOIN .CROSSAPPLYX[by=Y,j] After typing the above list, I'm actually starting to like .EACHI (each.i=TRUE? <- I like this even better) more and more as it seems to convey the meaning (as far as I currently understand it - my understanding has shifted a little since the start of this conversation) really well. Anyway, sorry for a verbose email - my current vote is 'each.i = TRUE' - I think this conveys the right meaning, satisfies Sadao's points and also has a meaning that transitions well between having a join-i and not having a join-i (when you're not joining, specifying this option wouldn't do anything extra). On Fri, Apr 26, 2013 at 8:34 AM, Sadao Milberg wrote: Your suggestion for transition seems reasonable, although I still think you should just use a new argument rather than try to change the behavior of by. The most natural thing seems to leave Y as the `i` value, since after all, we are still joining on the key, and then just modify the standard join behavior with the cross.apply=TRUE or some such. This way, you avoid having to have a more complicated description of the `by` argument, where all of a sudden it means 'group by these expressions, unless you use the special expression .XXX, in which case something confusingly similar yet different happens, oh, and by the way, you can only use .XXX if you are also using i=Y' (and what does by=list(a, .JOIN) do?). To some extent your final proposal of by=Y is a little better, but still confusing since now you're using by to join and group, when it's `i` job to do that. Loosely related, what does .JOIN represent? Is it just a flag, or is it a derived variable the way .SD is? If it's just a flag, it seems like a bad idea to use a name to represent it since that is a break from the meaning of all the other .X variables in data.table, which actually contain some kind of derivative data. Finally, when you say "might help in a few related areas e.g. X[Y][,j] (which isn't great right now, agreed)", do you mean joint inherited scope will work even when we're not in by-without-by mode? That would be great. S. Date: Fri, 26 Apr 2013 12:14:02 +0100 From: mdowle at mdowle.plus.com To: eduard.antonyan at gmail.com CC: datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by" I didn't get any feedback off list on this one. But I'm coming round to the idea. What about by=.JOIN (is that you were thinking .J stood for?) Other possibilties: .EACHI, .IROW, .EACHIROW, .CROSSAPPLY, .EACHJOIN. Just to brainstorm it. by=.JOIN could be added anyway with no backwards compatibility issues, so that those who wished to be explicit now could be. To change the default for X[Y, j] I'm also coming round to. It might help in a few related areas e.g. X[Y][,j] (which isn't great right now, agreed). We have successfully made non-backwards-compatibile changes in the past by introducing a global option which we slowly migrate to. If datatable.bywithoutby was added it could take values TRUE|"warning"|FALSE from day one, with default TRUE. That allows those who wish for explicit by to migrate straight away by changing the default to FALSE. Existing users could set it to "warning" to see how many implicit bywithoutby they have. Those calls can gradually be changed to by=.JOIN and in that way both implicit and explicit work at the same time, for say a year, with full backwards compatibility by default. This approach allows a slow and flexible migration path on a per feature basis. Then the default could be chaged to "warning" before finally FALSE. Depending on how it goes, the option could be left there to allow TRUE if anyone wanted it, or removed (maybe after two years). Similar to the removal of J() outside DT[...] i.e. users can still now very easily write J=data.table in their .Rprofile if they wish, for backwards compatibility. Or ... instead of : X[Y, j, by=.JOIN] what about : X[by=Y, j] Matthew On 25.04.2013 16:32, Matthew Dowle wrote: I'd appreciate some input from others whether they agree or not. If you have a view perhaps let me know off list, or on list, whichever you prefer. Thanks, Matthew On 25.04.2013 13:45, Eduard Antonyan wrote: Well, so can .I or .N or .GRP or .BY, yet those are used as special names, which is exactly why I suggested .J. The problem with using 'missingness' is that it already means smth very different when i is not a join/cross, it means *don't* do a by, thus introducing the whole case thing one has to through in their head every time as in OP (which of course becomes automatic after a while, but it's a cost nonetheless, which is in particular high for new people). So I see absence of 'by' as an already taken and used signal and thus something else has to be used for the new signal of cross apply (it doesn't have to be the specific option I mentioned above). This is exactly why I find optional turning off of this behavior unsatisfactory, and I don't see that as a solution to this at all. I think in the x+y context the appropriate analog is - what if that added x and y normally, but when x and y were data.frames it did element by element multiplication instead? Yes that's possible to do, and possible to document, but it's not a good idea, because it takes place of adding them element by element. The recycling behavior doesn't do that - what that does is it says it doesn't really make sense to add them as is, but we can do that after recycling, so let's recycle. It doesn't take the place of another existing way of adding vectors. On Apr 25, 2013, at 4:28 AM, Matthew Dowle wrote: I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J? If not .J, or any single symbol what else instead? A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC"). But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row. Currently, that signal is missingness (which I like, rely on, and use with join inherited scope). As I wrote in the S.O. thread, I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside. But you've continued to argue for a change to the default, iiuc. Maybe it helps to consider : x+y Fundamentally in R this depends on what x and y are. Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise, recycling y if necessary. In R we like and write code like this all the time. I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y. If you need j for the entire set that Y joins to, then like a FAQ says, make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as someone mentioned on the S.O. thread). So maybe the new option would be datatable.drop (but with default FALSE not TRUE). If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer. I'm happy to add the argument to [.data.table, and make its default changeable via a global option in the usual way. Matthew On 25.04.2013 05:16, Eduard Antonyan wrote: That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax. On Apr 24, 2013, at 6:01 PM, Eduard Antonyan wrote: that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle wrote: i. prefix is just a robust way to reference join inherited columns: the 'top' column in the i table. Like table aliases in SQL. What about this? : 1> X = data.table(a=1:3,b=1:15, key="a") 1> X a b 1: 1 1 2: 1 4 3: 1 7 4: 1 10 5: 1 13 6: 2 2 7: 2 5 8: 2 8 9: 2 11 10: 2 14 11: 3 3 12: 3 6 13: 3 9 14: 3 12 15: 3 15 1> Y = data.table(a=c(1,2,1), top=c(3,4,2)) 1> Y a top 1: 1 3 2: 2 4 3: 1 2 1> X[Y, head(.SD,i.top)] a b 1: 1 1 2: 1 4 3: 1 7 4: 2 2 5: 2 5 6: 2 8 7: 2 11 8: 1 1 9: 1 4 1> On 24.04.2013 23:43, Eduard Antonyan wrote: I assumed they meant create a table :) that looks cool, what's i.top ? I can get a very similar to yours result by writing: X[Y][, head(.SD, top[1]), by = a] and I probably would want the following to produce your result (this might depend a little on what exactly i.top is): X[Y, head(.SD, i.top), by = a] On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle wrote: That sentence on that linked webpage seems incorect English, since table is a noun not a verb. Should "table" be "join" perhaps? Anyway, by-without-by is often used with join inherited scope (JIS). For example, translating their example : 1> X = data.table(a=1:3,b=1:15, key="a") 1> X a b 1: 1 1 2: 1 4 3: 1 7 4: 1 10 5: 1 13 6: 2 2 7: 2 5 8: 2 8 9: 2 11 10: 2 14 11: 3 3 12: 3 6 13: 3 9 14: 3 12 15: 3 15 1> Y = data.table(a=c(1,2), top=c(3,4)) 1> Y a top 1: 1 3 2: 2 4 1> X[Y, head(.SD,i.top)] a b 1: 1 1 2: 1 4 3: 1 7 4: 2 2 5: 2 5 6: 2 8 7: 2 11 1> If there was no by-without-by (analogous to CROSS BY), then how would that be done? On 24.04.2013 22:22, Eduard Antonyan wrote: By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key). Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there: "We table table1 and table2. table1 has a column called rowcount. For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id" On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle wrote: But then what would be analogous to CROSS APPLY in SQL? > I'd agree with Eduard, although it's probably too late to change behavior > now. Maybe for data.table.2? Eduard's proposal seems more closely > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if > requested). > > S. > >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >> From: eduard.antonyan at gmail.com >> To: datatable-help at lists.r-forge.r-project.org >> Subject: Re: [datatable-help] changing data.table by-without-by >> syntax to require a "by" >> >> I think you're missing the point Michael. Just because it's possible to >> do it >> the way it's done now, doesn't mean that's the best way, as I've tried >> to >> argue in the OP. I don't think you've addressed the issue of unnecessary >> complexity pointed out in OP. >> >> >> >> -- >> View this message in context: >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html >> Sent from the datatable-help mailing list archive at Nabble.com. >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Fri Apr 26 22:34:39 2013 From: FErickson at psu.edu (Frank Erickson) Date: Fri, 26 Apr 2013 15:34:39 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net> <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net> <9146185881995080674@unknownmsgid> <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net> <5222879356405645530@unknownmsgid> <64f192ba80ac813986ed256029f0e7e0@imap.plus.net> <2a827f7db260f284908fac604301eb8e@imap.plus.net> Message-ID: I disagree with the criticism of data.table's complexity (in the OP). There's nothing wrong with overloading the syntax (that is what CS people call it, right?). As long as Matthew's in control of it, it's likely to have some internal consistency (which, of course, he could explain). However, I like the suggestion to add options (defaulting to something globally adjustable) to disable some of the overloading. Along similar lines (I think), I find unique.data.table very unintuitive. I can see how it could be useful, but strongly prefer base::unique for my current applications. Anyway, I have nothing particular to say about the piece of syntax you all are currently discussing. I just registered with this list to chime in here, instead of further cluttering SO (where eddi answered one of my questions yesterday). These emails sure are wide; must be like 1500px! Interesting to try out this ancient mailing-list form of communication. Please let me know if I should be using "Reply All" or actually quoting that massive thread (as everyone else seems to be doing with each post). Frank -------------- next part -------------- An HTML attachment was scrubbed... URL: From sds at gnu.org Sat Apr 27 00:02:31 2013 From: sds at gnu.org (Sam Steingold) Date: Fri, 26 Apr 2013 18:02:31 -0400 Subject: [datatable-help] variable column names References: <87a9onfza3.fsf@gnu.org> <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net> <87mwslqorx.fsf@gnu.org> <82f248612d2657b63d5e79c2a1ef98af@imap.plus.net> <87d2thqmw1.fsf@gnu.org> <871u9xkysc.fsf@gnu.org> Message-ID: <87wqrpj6h4.fsf@gnu.org> > * Sam Steingold [2013-04-26 13:05:39 -0400]: > >> * Matthew Dowle [2013-04-26 17:45:53 +0100]: >> >> S.O. is probably better for this kind of question then. >> But if you don't get an answer there, then come back to datatable-help. > > http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns downvoted, unlikely to be answered. -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 http://www.childpsy.net/ http://iris.org.il http://think-israel.org http://americancensorship.org http://pmw.org.il http://mideasttruth.com We have preferences. You have biases. They have prejudices. From mdowle at mdowle.plus.com Sat Apr 27 00:47:55 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 26 Apr 2013 23:47:55 +0100 Subject: [datatable-help] variable column names In-Reply-To: <87wqrpj6h4.fsf@gnu.org> References: <87a9onfza3.fsf@gnu.org> <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net> <87mwslqorx.fsf@gnu.org> <82f248612d2657b63d5e79c2a1ef98af@imap.plus.net> <87d2thqmw1.fsf@gnu.org> <871u9xkysc.fsf@gnu.org> <87wqrpj6h4.fsf@gnu.org> Message-ID: <30d6ae8f1a0d6974ebbd54da0d86f3b2@imap.plus.net> On 26.04.2013 23:02, Sam Steingold wrote: >> * Sam Steingold [2013-04-26 13:05:39 -0400]: >> >>> * Matthew Dowle [2013-04-26 17:45:53 >>> +0100]: >>> >>> S.O. is probably better for this kind of question then. >>> But if you don't get an answer there, then come back to >>> datatable-help. >> >> >> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns > > downvoted, unlikely to be answered. I've read it through. Perhaps sleep on it, don't look for 24hrs and look again as if you were trying to answer it yourself. Are there any small changes you can make to make it easier to answer? It wasn't me that downvoted but I suspect it's been done to encourage you to improve the question. Downvotes can (and often are) reversed. I've had many more downvotes than you once, but then I improved it and it went to +10. And, it's Friday and we've all had a long week! Matthew From mdowle at mdowle.plus.com Sat Apr 27 01:35:17 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sat, 27 Apr 2013 00:35:17 +0100 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net> <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net> <9146185881995080674@unknownmsgid> <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net> <5222879356405645530@unknownmsgid> <64f192ba80ac813986ed256029f0e7e0@imap.plus.net> <2a827f7db260f284908fac604301eb8e@imap.plus.net> Message-ID: Thanks for your comments Frank. Ha, yes it's ancient but still has a place. Yes "reply all": Back To: sender (if it's to someone in particular) and cc the list. But on general topics where lots of people are on the thread, just To: datatable-help alone is fine. Personally I prefer "top posting". Like I'm doing now. I only scroll down if I need to. I didn't notice the history was building up. If you comment inline later, then say "scroll down for comments inline" or something at the top. Note that Nabble collapses the history for you so threads are much easier to read there. Or I tend to read via RSS (gmane) in Outlook, so it feels like an email inbox which turns bold on new posts. You only need to subscribe to post (spam control). Most people turn off mail delivery pretty quickly I imagine (or setup an auto rule to move into a folder, but then you might as well subscribe to RSS I guess). S.O. is quite strict: must be clear questions with a clear answer, only one of which can be accepted. No opinion, voting, discussing or notices (enter mailing lists). Chat room is good but for quick chat when people are in the room at the same time. Many companies (sensibly) block chat access, though. Mailing lists allows all timezones a chance at a slower pace. Anonymity is just as acceptable and as easy in both places. Matthew On 26.04.2013 21:34, Frank Erickson wrote: > I disagree with the criticism of data.table's complexity (in the OP). There's nothing wrong with overloading the syntax (that is what CS people call it, right?). As long as Matthew's in control of it, it's likely to have some internal consistency (which, of course, he could explain). However, I like the suggestion to add options (defaulting to something globally adjustable) to disable some of the overloading. Along similar lines (I think), I find unique.data.table very unintuitive. I can see how it could be useful, but strongly prefer base::unique for my current applications. > Anyway, I have nothing particular to say about the piece of syntax you all are currently discussing. I just registered with this list to chime in here, instead of further cluttering SO (where eddi answered one of my questions yesterday). These emails sure are wide; must be like 1500px! Interesting to try out this ancient mailing-list form of communication. Please let me know if I should be using "Reply All" or actually quoting that massive thread (as everyone else seems to be doing with each post). > Frank -------------- next part -------------- An HTML attachment was scrubbed... URL: From victor.kryukov at gmail.com Sat Apr 27 01:42:04 2013 From: victor.kryukov at gmail.com (Victor Kryukov) Date: Fri, 26 Apr 2013 16:42:04 -0700 Subject: [datatable-help] variable column names In-Reply-To: <30d6ae8f1a0d6974ebbd54da0d86f3b2@imap.plus.net> References: <87a9onfza3.fsf@gnu.org> <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net> <87mwslqorx.fsf@gnu.org> <82f248612d2657b63d5e79c2a1ef98af@imap.plus.net> <87d2thqmw1.fsf@gnu.org> <871u9xkysc.fsf@gnu.org> <87wqrpj6h4.fsf@gnu.org> <30d6ae8f1a0d6974ebbd54da0d86f3b2@imap.plus.net> Message-ID: On Fri, Apr 26, 2013 at 3:47 PM, Matthew Dowle wrote: > On 26.04.2013 23:02, Sam Steingold wrote: >>> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns >> >> downvoted, unlikely to be answered. > > I've read it through. > > Perhaps sleep on it, don't look for 24hrs and look again as if you were > trying to answer it yourself. Are there any small changes you can make to > make it easier to answer? It wasn't me that downvoted but I suspect it's > been done to encourage you to improve the question. Downvotes can (and often > are) reversed. I've had many more downvotes than you once, but then I > improved it and it went to +10. > > And, it's Friday and we've all had a long week! Beautiful advice, Matthew! Sam - I've provided my answer (and even used Reduce since you seem to be coming from Lisp land), but I also think some of the down votes/comments have their merit. From aragorn168b at gmail.com Sat Apr 27 17:49:13 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 27 Apr 2013 17:49:13 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" Message-ID: Hello, I thought I'd also chip-in my thoughts to eddi's feature request. Short answer: I don't think this feature is necessary. I basically agree with mnel's reply. Long answer: My argument goes along these lines (in addition to the S3/S4 methods mnel mentions). If you for example type `[.data.frame` in your R-session, you'd see this snippet: if (is.matrix(i)) return(as.matrix(x)[i]) That is, if you do: df <- data.frame(x=1:5, y=1:5, z=1:5) mm <- matrix(1:12, ncol=3) df[mm] # gives [1] 1 2 3 4 5 1 2 3 4 5 1 2 df <- data.frame(x=1:2, y=1:2, z=1:2) df[mm] # gives [1] 1 2 1 2 1 2 NA NA NA NA NA NA Here, the indexing is a matrix. It's obvious. Now, should this behaviour be changed because people would be confused that subsetting a data.frame resulted in a vector? Or because it's not user friendly? Even better, try out `df[mm, ]`. If `i` is a matrix, this is what the code does. I am not convinced this is "bad" design. Functions take arguments of different types ALL the time and they return outputs *depending on the type of input*. This is why I am not sold on the point of "bad design". It's essential to know the type of objects `i` can take and *understand* it. If a function is designed that takes several types of objects for `i` and their behaviour is documented, and the documented behaviour is consistent, then I can't accept there's a problem. I agree there are people who don't read the manual and "try" things out. But they are going to have problems with every other function in R. For example, "unstack" is a function for which same input type gives different output type. That is, it provides a data.frame if the columns are equal after unstaking and list if they are not. That is, compare the outputs of: df <- data.frame(x=rep(1:3, each=3), y=1:9) unstack(df, y ~ x) with df <- data.frame(x=c(rep(1:3, each=3), 3), y=1:10) unstack(df, y ~ x) But if people don't read the documentation, they wouldn't know this difference until they land up on errors. Now, making it user-friendly would mean that it "always" returns a list. Now, is this "bad" design because it gives two object types for same input? Does it require a change? I personally don't think so. To sum up, what eddi points out as "not being user-friendly" (or arguably "bad design") is everywhere inside R if you look closely. My view is that it's very clear that there should be some effort in understanding a function before using it. Not all functions are plain simple. Some functions have exceptions and some packages have a steep learning curve. Best, Arun. On Sat, Apr 27, 2013 at 12:00 PM, < datatable-help-request at lists.r-forge.r-project.org> wrote: > > Send datatable-help mailing list submissions to > datatable-help at lists.r-forge.r-project.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > or, via email, send a message with subject or body 'help' to > datatable-help-request at lists.r-forge.r-project.org > > You can reach the person managing the list at > datatable-help-owner at lists.r-forge.r-project.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of datatable-help digest..." > > > Today's Topics: > > 1. Re: changing data.table by-without-by syntax to require a > "by" (Frank Erickson) > 2. Re: variable column names (Sam Steingold) > 3. Re: variable column names (Matthew Dowle) > 4. Re: changing data.table by-without-by syntax to require a > "by" (Matthew Dowle) > 5. Re: variable column names (Victor Kryukov) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 26 Apr 2013 15:34:39 -0500 > From: Frank Erickson > To: "data.table source forge" > > Subject: Re: [datatable-help] changing data.table by-without-by syntax > to require a "by" > Message-ID: > > Content-Type: text/plain; charset="iso-8859-1" > > I disagree with the criticism of data.table's complexity (in the OP). > There's nothing wrong with overloading the syntax (that is what CS people > call it, right?). As long as Matthew's in control of it, it's likely to > have some internal consistency (which, of course, he could explain). > However, I like the suggestion to add options (defaulting to something > globally adjustable) to disable some of the overloading. Along similar > lines (I think), I find unique.data.table very unintuitive. I can see how > it could be useful, but strongly prefer base::unique for my current > applications. > > Anyway, I have nothing particular to say about the piece of syntax you all > are currently discussing. I just registered with this list to chime in > here, instead of further cluttering SO (where eddi answered one of my > questions yesterday). These emails sure are wide; must be like 1500px! > Interesting to try out this ancient mailing-list form of communication. > Please let me know if I should be using "Reply All" or actually quoting > that massive thread (as everyone else seems to be doing with each post). > > Frank > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130426/eb6556ae/attachment-0001.html > > > ------------------------------ > > Message: 2 > Date: Fri, 26 Apr 2013 18:02:31 -0400 > From: Sam Steingold > To: datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] variable column names > Message-ID: <87wqrpj6h4.fsf at gnu.org> > Content-Type: text/plain > > > * Sam Steingold [2013-04-26 13:05:39 -0400]: > > > >> * Matthew Dowle [2013-04-26 17:45:53 +0100]: > >> > >> S.O. is probably better for this kind of question then. > >> But if you don't get an answer there, then come back to datatable-help. > > > > http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns > > downvoted, unlikely to be answered. > > -- > Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 > http://www.childpsy.net/ http://iris.org.il http://think-israel.org > http://americancensorship.org http://pmw.org.il http://mideasttruth.com > We have preferences. You have biases. They have prejudices. > > > > ------------------------------ > > Message: 3 > Date: Fri, 26 Apr 2013 23:47:55 +0100 > From: Matthew Dowle > To: > Cc: datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] variable column names > Message-ID: <30d6ae8f1a0d6974ebbd54da0d86f3b2 at imap.plus.net> > Content-Type: text/plain; charset=UTF-8; format=flowed > > On 26.04.2013 23:02, Sam Steingold wrote: > >> * Sam Steingold [2013-04-26 13:05:39 -0400]: > >> > >>> * Matthew Dowle [2013-04-26 17:45:53 > >>> +0100]: > >>> > >>> S.O. is probably better for this kind of question then. > >>> But if you don't get an answer there, then come back to > >>> datatable-help. > >> > >> > >> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns > > > > downvoted, unlikely to be answered. > > I've read it through. > > Perhaps sleep on it, don't look for 24hrs and look again as if you were > trying to answer it yourself. Are there any small changes you can make > to make it easier to answer? It wasn't me that downvoted but I suspect > it's been done to encourage you to improve the question. Downvotes can > (and often are) reversed. I've had many more downvotes than you once, > but then I improved it and it went to +10. > > And, it's Friday and we've all had a long week! > > Matthew > > > > > ------------------------------ > > Message: 4 > Date: Sat, 27 Apr 2013 00:35:17 +0100 > From: Matthew Dowle > To: Frank Erickson > Cc: "data.table source forge" > > Subject: Re: [datatable-help] changing data.table by-without-by syntax > to require a "by" > Message-ID: > Content-Type: text/plain; charset="utf-8" > > > > Thanks for your comments Frank. > > Ha, yes it's ancient but still has > a place. Yes "reply all": Back To: sender (if it's to someone in > particular) and cc the list. But on general topics where lots of people > are on the thread, just To: datatable-help alone is fine. Personally I > prefer "top posting". Like I'm doing now. I only scroll down if I need > to. I didn't notice the history was building up. If you comment inline > later, then say "scroll down for comments inline" or something at the > top. Note that Nabble collapses the history for you so threads are much > easier to read there. Or I tend to read via RSS (gmane) in Outlook, so > it feels like an email inbox which turns bold on new posts. You only > need to subscribe to post (spam control). Most people turn off mail > delivery pretty quickly I imagine (or setup an auto rule to move into a > folder, but then you might as well subscribe to RSS I guess). > > S.O. is > quite strict: must be clear questions with a clear answer, only one of > which can be accepted. No opinion, voting, discussing or notices (enter > mailing lists). Chat room is good but for quick chat when people are in > the room at the same time. Many companies (sensibly) block chat access, > though. Mailing lists allows all timezones a chance at a slower pace. > Anonymity is just as acceptable and as easy in both places. > > Matthew > > > On 26.04.2013 21:34, Frank Erickson wrote: > > > I disagree with the > criticism of data.table's complexity (in the OP). There's nothing wrong > with overloading the syntax (that is what CS people call it, right?). As > long as Matthew's in control of it, it's likely to have some internal > consistency (which, of course, he could explain). However, I like the > suggestion to add options (defaulting to something globally adjustable) > to disable some of the overloading. Along similar lines (I think), I > find unique.data.table very unintuitive. I can see how it could be > useful, but strongly prefer base::unique for my current applications. > > > Anyway, I have nothing particular to say about the piece of syntax you > all are currently discussing. I just registered with this list to chime > in here, instead of further cluttering SO (where eddi answered one of my > questions yesterday). These emails sure are wide; must be like 1500px! > Interesting to try out this ancient mailing-list form of communication. > Please let me know if I should be using "Reply All" or actually quoting > that massive thread (as everyone else seems to be doing with each post). > > > Frank > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130427/260f3119/attachment-0001.html > > > ------------------------------ > > Message: 5 > Date: Fri, 26 Apr 2013 16:42:04 -0700 > From: Victor Kryukov > To: Matthew Dowle > Cc: datatable-help at lists.r-forge.r-project.org, sds at gnu.org > Subject: Re: [datatable-help] variable column names > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > On Fri, Apr 26, 2013 at 3:47 PM, Matthew Dowle wrote: > > On 26.04.2013 23:02, Sam Steingold wrote: > >>> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns > >> > >> downvoted, unlikely to be answered. > > > > I've read it through. > > > > Perhaps sleep on it, don't look for 24hrs and look again as if you were > > trying to answer it yourself. Are there any small changes you can make to > > make it easier to answer? It wasn't me that downvoted but I suspect it's > > been done to encourage you to improve the question. Downvotes can (and often > > are) reversed. I've had many more downvotes than you once, but then I > > improved it and it went to +10. > > > > And, it's Friday and we've all had a long week! > > Beautiful advice, Matthew! > > Sam - I've provided my answer (and even used Reduce since you seem to > be coming from Lisp land), but I also think some of the down > votes/comments have their merit. > > > ------------------------------ > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > End of datatable-help Digest, Vol 38, Issue 26 > ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From statquant at outlook.com Sun Apr 28 20:38:35 2013 From: statquant at outlook.com (stat quant) Date: Sun, 28 Apr 2013 20:38:35 +0200 Subject: [datatable-help] Porting data.table to Rcpp Message-ID: Hello list, I am nearly a beginer whn It comes to C++, I like data.table very much and I am interested to Rcpp too. I am wondering how hard would it be to have a data.table API to be able to access data.table greatness from C++ via Rcpp. Cheers Colin -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sun Apr 28 23:52:51 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sun, 28 Apr 2013 22:52:51 +0100 Subject: [datatable-help] Porting data.table to Rcpp In-Reply-To: References: Message-ID: <991982131055d51a41ec5f5d0eedba87@imap.plus.net> Hi, I don't know C++ or Rcpp very well so couldn't estimate how hard. But it rings a bell as being discussed before. I searched datatable-help for "Rcpp" with no luck, but S.O. "[data.table] Rcpp" returns 15 so in there somewhere might be clues. If changes are needed to data.table then that's fine by me. Matthew On 28.04.2013 19:38, stat quant wrote: > Hello list, > I am nearly a beginer whn It comes to C++, I like data.table very much and I am interested to Rcpp too. > I am wondering how hard would it be to have a data.table API to be able to access data.table greatness from C++ via Rcpp. > > Cheers > Colin -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Mon Apr 29 07:29:31 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Mon, 29 Apr 2013 01:29:31 -0400 Subject: [datatable-help] Porting data.table to Rcpp In-Reply-To: References: Message-ID: Hey Colin, This sounds like an interesting idea. What specifically did you have in mind? I would be willing to lend a hand. -Rick Ricardo Saporta Graduate Student, Data Analytics Rutgers University, New Jersey e: saporta at rutgers.edu On Sun, Apr 28, 2013 at 2:38 PM, stat quant wrote: > Hello list, > I am nearly a beginer whn It comes to C++, I like data.table very much and > I am interested to Rcpp too. > I am wondering how hard would it be to have a data.table API to be able to > access data.table greatness from C++ via Rcpp. > > Cheers > Colin > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Mon Apr 29 15:40:27 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Mon, 29 Apr 2013 08:40:27 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: Message-ID: Thanks Arun, the examples you give are probably interesting in their own right, but your post doesn't address advantages/disadvantages of either current or proposed syntaxes and simply points out the (obvious) fact that current (and other, similar in some ways to current) behavior is possible to implement in R. On Sat, Apr 27, 2013 at 10:49 AM, Arunkumar Srinivasan < aragorn168b at gmail.com> wrote: > Hello, > I thought I'd also chip-in my thoughts to eddi's feature request. > Short answer: I don't think this feature is necessary. I basically agree > with mnel's reply. > Long answer: My argument goes along these lines (in addition to the S3/S4 > methods mnel mentions). If you for example type `[.data.frame` in your > R-session, you'd see this snippet: > > if (is.matrix(i)) > return(as.matrix(x)[i]) > > That is, if you do: > > df <- data.frame(x=1:5, y=1:5, z=1:5) > mm <- matrix(1:12, ncol=3) > df[mm] # gives > [1] 1 2 3 4 5 1 2 3 4 5 1 2 > > df <- data.frame(x=1:2, y=1:2, z=1:2) > df[mm] # gives > [1] 1 2 1 2 1 2 NA NA NA NA NA NA > > Here, the indexing is a matrix. It's obvious. Now, should this behaviour > be changed because people would be confused that subsetting a data.frame > resulted in a vector? Or because it's not user friendly? Even better, try > out `df[mm, ]`. If `i` is a matrix, this is what the code does. I am not > convinced this is "bad" design. Functions take arguments of different types > ALL the time and they return outputs *depending on the type of input*. This > is why I am not sold on the point of "bad design". It's essential to know > the type of objects `i` can take and *understand* it. > > If a function is designed that takes several types of objects for `i` and > their behaviour is documented, and the documented behaviour is consistent, > then I can't accept there's a problem. > > I agree there are people who don't read the manual and "try" things out. > But they are going to have problems with every other function in R. > > For example, "unstack" is a function for which same input type gives > different output type. That is, it provides a data.frame if the columns are > equal after unstaking and list if they are not. That is, compare the > outputs of: > > df <- data.frame(x=rep(1:3, each=3), y=1:9) > unstack(df, y ~ x) > > with > > df <- data.frame(x=c(rep(1:3, each=3), 3), y=1:10) > unstack(df, y ~ x) > > But if people don't read the documentation, they wouldn't know this > difference until they land up on errors. Now, making it user-friendly would > mean that it "always" returns a list. > > Now, is this "bad" design because it gives two object types for same > input? Does it require a change? I personally don't think so. > > To sum up, what eddi points out as "not being user-friendly" (or arguably > "bad design") is everywhere inside R if you look closely. My view is that > it's very clear that there should be some effort in understanding a > function before using it. Not all functions are plain simple. Some > functions have exceptions and some packages have a steep learning curve. > > Best, > Arun. > > > On Sat, Apr 27, 2013 at 12:00 PM, < > datatable-help-request at lists.r-forge.r-project.org> wrote: > > > > Send datatable-help mailing list submissions to > > datatable-help at lists.r-forge.r-project.org > > > > > To subscribe or unsubscribe via the World Wide Web, visit > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > or, via email, send a message with subject or body 'help' to > > datatable-help-request at lists.r-forge.r-project.org > > > > > You can reach the person managing the list at > > datatable-help-owner at lists.r-forge.r-project.org > > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of datatable-help digest..." > > > > > > Today's Topics: > > > > 1. Re: changing data.table by-without-by syntax to require a > > "by" (Frank Erickson) > > 2. Re: variable column names (Sam Steingold) > > 3. Re: variable column names (Matthew Dowle) > > 4. Re: changing data.table by-without-by syntax to require a > > "by" (Matthew Dowle) > > 5. Re: variable column names (Victor Kryukov) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Fri, 26 Apr 2013 15:34:39 -0500 > > From: Frank Erickson > > To: "data.table source forge" > > > > > Subject: Re: [datatable-help] changing data.table by-without-by syntax > > to require a "by" > > Message-ID: > > 3GLaWJTakAtzMJVw at mail.gmail.com> > > > Content-Type: text/plain; charset="iso-8859-1" > > > > I disagree with the criticism of data.table's complexity (in the OP). > > There's nothing wrong with overloading the syntax (that is what CS people > > call it, right?). As long as Matthew's in control of it, it's likely to > > have some internal consistency (which, of course, he could explain). > > However, I like the suggestion to add options (defaulting to something > > globally adjustable) to disable some of the overloading. Along similar > > lines (I think), I find unique.data.table very unintuitive. I can see how > > it could be useful, but strongly prefer base::unique for my current > > applications. > > > > Anyway, I have nothing particular to say about the piece of syntax you > all > > are currently discussing. I just registered with this list to chime in > > here, instead of further cluttering SO (where eddi answered one of my > > questions yesterday). These emails sure are wide; must be like 1500px! > > Interesting to try out this ancient mailing-list form of communication. > > Please let me know if I should be using "Reply All" or actually quoting > > that massive thread (as everyone else seems to be doing with each post). > > > > Frank > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > URL: < > http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130426/eb6556ae/attachment-0001.html > > > > > > ------------------------------ > > > > Message: 2 > > Date: Fri, 26 Apr 2013 18:02:31 -0400 > > From: Sam Steingold > > To: datatable-help at lists.r-forge.r-project.org > > > Subject: Re: [datatable-help] variable column names > > Message-ID: <87wqrpj6h4.fsf at gnu.org> > > Content-Type: text/plain > > > > > * Sam Steingold [2013-04-26 13:05:39 -0400]: > > > > > >> * Matthew Dowle [2013-04-26 17:45:53 +0100]: > > > >> > > >> S.O. is probably better for this kind of question then. > > >> But if you don't get an answer there, then come back to > datatable-help. > > > > > > > http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns > > > > downvoted, unlikely to be answered. > > > > -- > > Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X > 11.0.11300000 > > http://www.childpsy.net/ http://iris.org.il http://think-israel.org > > http://americancensorship.org http://pmw.org.il http://mideasttruth.com > > We have preferences. You have biases. They have prejudices. > > > > > > > > ------------------------------ > > > > Message: 3 > > Date: Fri, 26 Apr 2013 23:47:55 +0100 > > From: Matthew Dowle > > To: > > Cc: datatable-help at lists.r-forge.r-project.org > > > Subject: Re: [datatable-help] variable column names > > Message-ID: <30d6ae8f1a0d6974ebbd54da0d86f3b2 at imap.plus.net> > > > Content-Type: text/plain; charset=UTF-8; format=flowed > > > > On 26.04.2013 23:02, Sam Steingold wrote: > > >> * Sam Steingold [2013-04-26 13:05:39 -0400]: > > >> > > >>> * Matthew Dowle [2013-04-26 17:45:53 > > > >>> +0100]: > > >>> > > >>> S.O. is probably better for this kind of question then. > > >>> But if you don't get an answer there, then come back to > > >>> datatable-help. > > >> > > >> > > >> > http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns > > > > > > downvoted, unlikely to be answered. > > > > I've read it through. > > > > Perhaps sleep on it, don't look for 24hrs and look again as if you were > > trying to answer it yourself. Are there any small changes you can make > > to make it easier to answer? It wasn't me that downvoted but I suspect > > it's been done to encourage you to improve the question. Downvotes can > > (and often are) reversed. I've had many more downvotes than you once, > > but then I improved it and it went to +10. > > > > And, it's Friday and we've all had a long week! > > > > Matthew > > > > > > > > > > ------------------------------ > > > > Message: 4 > > Date: Sat, 27 Apr 2013 00:35:17 +0100 > > From: Matthew Dowle > > To: Frank Erickson > > Cc: "data.table source forge" > > > > > Subject: Re: [datatable-help] changing data.table by-without-by syntax > > to require a "by" > > Message-ID: > > > Content-Type: text/plain; charset="utf-8" > > > > > > > > Thanks for your comments Frank. > > > > Ha, yes it's ancient but still has > > a place. Yes "reply all": Back To: sender (if it's to someone in > > particular) and cc the list. But on general topics where lots of people > > are on the thread, just To: datatable-help alone is fine. Personally I > > prefer "top posting". Like I'm doing now. I only scroll down if I need > > to. I didn't notice the history was building up. If you comment inline > > later, then say "scroll down for comments inline" or something at the > > top. Note that Nabble collapses the history for you so threads are much > > easier to read there. Or I tend to read via RSS (gmane) in Outlook, so > > it feels like an email inbox which turns bold on new posts. You only > > need to subscribe to post (spam control). Most people turn off mail > > delivery pretty quickly I imagine (or setup an auto rule to move into a > > folder, but then you might as well subscribe to RSS I guess). > > > > S.O. is > > quite strict: must be clear questions with a clear answer, only one of > > which can be accepted. No opinion, voting, discussing or notices (enter > > mailing lists). Chat room is good but for quick chat when people are in > > the room at the same time. Many companies (sensibly) block chat access, > > though. Mailing lists allows all timezones a chance at a slower pace. > > Anonymity is just as acceptable and as easy in both places. > > > > Matthew > > > > > > On 26.04.2013 21:34, Frank Erickson wrote: > > > > > I disagree with the > > criticism of data.table's complexity (in the OP). There's nothing wrong > > with overloading the syntax (that is what CS people call it, right?). As > > long as Matthew's in control of it, it's likely to have some internal > > consistency (which, of course, he could explain). However, I like the > > suggestion to add options (defaulting to something globally adjustable) > > to disable some of the overloading. Along similar lines (I think), I > > find unique.data.table very unintuitive. I can see how it could be > > useful, but strongly prefer base::unique for my current applications. > > > > > Anyway, I have nothing particular to say about the piece of syntax you > > all are currently discussing. I just registered with this list to chime > > in here, instead of further cluttering SO (where eddi answered one of my > > questions yesterday). These emails sure are wide; must be like 1500px! > > Interesting to try out this ancient mailing-list form of communication. > > Please let me know if I should be using "Reply All" or actually quoting > > that massive thread (as everyone else seems to be doing with each post). > > > > > Frank > > > > > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > URL: < > http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130427/260f3119/attachment-0001.html > > > > > > ------------------------------ > > > > Message: 5 > > Date: Fri, 26 Apr 2013 16:42:04 -0700 > > From: Victor Kryukov > > To: Matthew Dowle > > Cc: datatable-help at lists.r-forge.r-project.org, sds at gnu.org > > > Subject: Re: [datatable-help] variable column names > > Message-ID: > > nJgw at mail.gmail.com> > > Content-Type: text/plain; charset=ISO-8859-1 > > > > > On Fri, Apr 26, 2013 at 3:47 PM, Matthew Dowle > wrote: > > > On 26.04.2013 23:02, Sam Steingold wrote: > > >>> > http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns > > >> > > >> downvoted, unlikely to be answered. > > > > > > I've read it through. > > > > > > Perhaps sleep on it, don't look for 24hrs and look again as if you were > > > trying to answer it yourself. Are there any small changes you can make > to > > > make it easier to answer? It wasn't me that downvoted but I suspect > it's > > > been done to encourage you to improve the question. Downvotes can (and > often > > > are) reversed. I've had many more downvotes than you once, but then I > > > improved it and it went to +10. > > > > > > And, it's Friday and we've all had a long week! > > > > Beautiful advice, Matthew! > > > > Sam - I've provided my answer (and even used Reduce since you seem to > > be coming from Lisp land), but I also think some of the down > > votes/comments have their merit. > > > > > > ------------------------------ > > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > End of datatable-help Digest, Vol 38, Issue 26 > > ********************************************** > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Mon Apr 29 15:43:19 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Mon, 29 Apr 2013 08:43:19 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: Message-ID: It might help to think of this as an improvement proposal rather than a problem fix proposal. On Mon, Apr 29, 2013 at 8:40 AM, Eduard Antonyan wrote: > Thanks Arun, the examples you give are probably interesting in their own > right, but your post doesn't address advantages/disadvantages of either > current or proposed syntaxes and simply points out the (obvious) fact that > current (and other, similar in some ways to current) behavior is possible > to implement in R. > > > On Sat, Apr 27, 2013 at 10:49 AM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > >> Hello, >> I thought I'd also chip-in my thoughts to eddi's feature request. >> Short answer: I don't think this feature is necessary. I basically agree >> with mnel's reply. >> Long answer: My argument goes along these lines (in addition to the S3/S4 >> methods mnel mentions). If you for example type `[.data.frame` in your >> R-session, you'd see this snippet: >> >> if (is.matrix(i)) >> return(as.matrix(x)[i]) >> >> That is, if you do: >> >> df <- data.frame(x=1:5, y=1:5, z=1:5) >> mm <- matrix(1:12, ncol=3) >> df[mm] # gives >> [1] 1 2 3 4 5 1 2 3 4 5 1 2 >> >> df <- data.frame(x=1:2, y=1:2, z=1:2) >> df[mm] # gives >> [1] 1 2 1 2 1 2 NA NA NA NA NA NA >> >> Here, the indexing is a matrix. It's obvious. Now, should this behaviour >> be changed because people would be confused that subsetting a data.frame >> resulted in a vector? Or because it's not user friendly? Even better, try >> out `df[mm, ]`. If `i` is a matrix, this is what the code does. I am not >> convinced this is "bad" design. Functions take arguments of different types >> ALL the time and they return outputs *depending on the type of input*. This >> is why I am not sold on the point of "bad design". It's essential to know >> the type of objects `i` can take and *understand* it. >> >> If a function is designed that takes several types of objects for `i` and >> their behaviour is documented, and the documented behaviour is consistent, >> then I can't accept there's a problem. >> >> I agree there are people who don't read the manual and "try" things out. >> But they are going to have problems with every other function in R. >> >> For example, "unstack" is a function for which same input type gives >> different output type. That is, it provides a data.frame if the columns are >> equal after unstaking and list if they are not. That is, compare the >> outputs of: >> >> df <- data.frame(x=rep(1:3, each=3), y=1:9) >> unstack(df, y ~ x) >> >> with >> >> df <- data.frame(x=c(rep(1:3, each=3), 3), y=1:10) >> unstack(df, y ~ x) >> >> But if people don't read the documentation, they wouldn't know this >> difference until they land up on errors. Now, making it user-friendly would >> mean that it "always" returns a list. >> >> Now, is this "bad" design because it gives two object types for same >> input? Does it require a change? I personally don't think so. >> >> To sum up, what eddi points out as "not being user-friendly" (or arguably >> "bad design") is everywhere inside R if you look closely. My view is that >> it's very clear that there should be some effort in understanding a >> function before using it. Not all functions are plain simple. Some >> functions have exceptions and some packages have a steep learning curve. >> >> Best, >> Arun. >> >> >> On Sat, Apr 27, 2013 at 12:00 PM, < >> datatable-help-request at lists.r-forge.r-project.org> wrote: >> > >> > Send datatable-help mailing list submissions to >> > datatable-help at lists.r-forge.r-project.org >> >> > >> > To subscribe or unsubscribe via the World Wide Web, visit >> > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > >> > or, via email, send a message with subject or body 'help' to >> > datatable-help-request at lists.r-forge.r-project.org >> >> > >> > You can reach the person managing the list at >> > datatable-help-owner at lists.r-forge.r-project.org >> >> > >> > When replying, please edit your Subject line so it is more specific >> > than "Re: Contents of datatable-help digest..." >> > >> > >> > Today's Topics: >> > >> > 1. Re: changing data.table by-without-by syntax to require a >> > "by" (Frank Erickson) >> > 2. Re: variable column names (Sam Steingold) >> > 3. Re: variable column names (Matthew Dowle) >> > 4. Re: changing data.table by-without-by syntax to require a >> > "by" (Matthew Dowle) >> > 5. Re: variable column names (Victor Kryukov) >> > >> > >> > ---------------------------------------------------------------------- >> > >> > Message: 1 >> > Date: Fri, 26 Apr 2013 15:34:39 -0500 >> > From: Frank Erickson >> > To: "data.table source forge" >> > >> >> > Subject: Re: [datatable-help] changing data.table by-without-by syntax >> > to require a "by" >> > Message-ID: >> > > 3GLaWJTakAtzMJVw at mail.gmail.com> >> >> > Content-Type: text/plain; charset="iso-8859-1" >> > >> > I disagree with the criticism of data.table's complexity (in the OP). >> > There's nothing wrong with overloading the syntax (that is what CS >> people >> > call it, right?). As long as Matthew's in control of it, it's likely to >> > have some internal consistency (which, of course, he could explain). >> > However, I like the suggestion to add options (defaulting to something >> > globally adjustable) to disable some of the overloading. Along similar >> > lines (I think), I find unique.data.table very unintuitive. I can see >> how >> > it could be useful, but strongly prefer base::unique for my current >> > applications. >> > >> > Anyway, I have nothing particular to say about the piece of syntax you >> all >> > are currently discussing. I just registered with this list to chime in >> > here, instead of further cluttering SO (where eddi answered one of my >> > questions yesterday). These emails sure are wide; must be like 1500px! >> > Interesting to try out this ancient mailing-list form of communication. >> > Please let me know if I should be using "Reply All" or actually quoting >> > that massive thread (as everyone else seems to be doing with each post). >> > >> > Frank >> > -------------- next part -------------- >> > An HTML attachment was scrubbed... >> > URL: < >> http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130426/eb6556ae/attachment-0001.html >> > >> > >> > ------------------------------ >> > >> > Message: 2 >> > Date: Fri, 26 Apr 2013 18:02:31 -0400 >> > From: Sam Steingold >> > To: datatable-help at lists.r-forge.r-project.org >> >> > Subject: Re: [datatable-help] variable column names >> > Message-ID: <87wqrpj6h4.fsf at gnu.org> >> > Content-Type: text/plain >> > >> > > * Sam Steingold [2013-04-26 13:05:39 -0400]: >> > > >> > >> * Matthew Dowle [2013-04-26 17:45:53 >> +0100]: >> >> > >> >> > >> S.O. is probably better for this kind of question then. >> > >> But if you don't get an answer there, then come back to >> datatable-help. >> > > >> > > >> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns >> > >> > downvoted, unlikely to be answered. >> > >> > -- >> > Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X >> 11.0.11300000 >> > http://www.childpsy.net/ http://iris.org.il http://think-israel.org >> > http://americancensorship.org http://pmw.org.il http://mideasttruth.com >> > We have preferences. You have biases. They have prejudices. >> > >> > >> > >> > ------------------------------ >> > >> > Message: 3 >> > Date: Fri, 26 Apr 2013 23:47:55 +0100 >> > From: Matthew Dowle >> > To: >> > Cc: datatable-help at lists.r-forge.r-project.org >> >> > Subject: Re: [datatable-help] variable column names >> > Message-ID: <30d6ae8f1a0d6974ebbd54da0d86f3b2 at imap.plus.net> >> >> > Content-Type: text/plain; charset=UTF-8; format=flowed >> > >> > On 26.04.2013 23:02, Sam Steingold wrote: >> > >> * Sam Steingold [2013-04-26 13:05:39 -0400]: >> > >> >> > >>> * Matthew Dowle [2013-04-26 17:45:53 >> >> > >>> +0100]: >> > >>> >> > >>> S.O. is probably better for this kind of question then. >> > >>> But if you don't get an answer there, then come back to >> > >>> datatable-help. >> > >> >> > >> >> > >> >> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns >> > > >> > > downvoted, unlikely to be answered. >> > >> > I've read it through. >> > >> > Perhaps sleep on it, don't look for 24hrs and look again as if you were >> > trying to answer it yourself. Are there any small changes you can make >> > to make it easier to answer? It wasn't me that downvoted but I suspect >> > it's been done to encourage you to improve the question. Downvotes can >> > (and often are) reversed. I've had many more downvotes than you once, >> > but then I improved it and it went to +10. >> > >> > And, it's Friday and we've all had a long week! >> > >> > Matthew >> > >> > >> > >> > >> > ------------------------------ >> > >> > Message: 4 >> > Date: Sat, 27 Apr 2013 00:35:17 +0100 >> > From: Matthew Dowle >> > To: Frank Erickson >> > Cc: "data.table source forge" >> > >> >> > Subject: Re: [datatable-help] changing data.table by-without-by syntax >> > to require a "by" >> > Message-ID: >> >> > Content-Type: text/plain; charset="utf-8" >> > >> > >> > >> > Thanks for your comments Frank. >> > >> > Ha, yes it's ancient but still has >> > a place. Yes "reply all": Back To: sender (if it's to someone in >> > particular) and cc the list. But on general topics where lots of people >> > are on the thread, just To: datatable-help alone is fine. Personally I >> > prefer "top posting". Like I'm doing now. I only scroll down if I need >> > to. I didn't notice the history was building up. If you comment inline >> > later, then say "scroll down for comments inline" or something at the >> > top. Note that Nabble collapses the history for you so threads are much >> > easier to read there. Or I tend to read via RSS (gmane) in Outlook, so >> > it feels like an email inbox which turns bold on new posts. You only >> > need to subscribe to post (spam control). Most people turn off mail >> > delivery pretty quickly I imagine (or setup an auto rule to move into a >> > folder, but then you might as well subscribe to RSS I guess). >> > >> > S.O. is >> > quite strict: must be clear questions with a clear answer, only one of >> > which can be accepted. No opinion, voting, discussing or notices (enter >> > mailing lists). Chat room is good but for quick chat when people are in >> > the room at the same time. Many companies (sensibly) block chat access, >> > though. Mailing lists allows all timezones a chance at a slower pace. >> > Anonymity is just as acceptable and as easy in both places. >> > >> > Matthew >> > >> > >> > On 26.04.2013 21:34, Frank Erickson wrote: >> > >> > > I disagree with the >> > criticism of data.table's complexity (in the OP). There's nothing wrong >> > with overloading the syntax (that is what CS people call it, right?). As >> > long as Matthew's in control of it, it's likely to have some internal >> > consistency (which, of course, he could explain). However, I like the >> > suggestion to add options (defaulting to something globally adjustable) >> > to disable some of the overloading. Along similar lines (I think), I >> > find unique.data.table very unintuitive. I can see how it could be >> > useful, but strongly prefer base::unique for my current applications. >> > > >> > Anyway, I have nothing particular to say about the piece of syntax you >> > all are currently discussing. I just registered with this list to chime >> > in here, instead of further cluttering SO (where eddi answered one of my >> > questions yesterday). These emails sure are wide; must be like 1500px! >> > Interesting to try out this ancient mailing-list form of communication. >> > Please let me know if I should be using "Reply All" or actually quoting >> > that massive thread (as everyone else seems to be doing with each post). >> > >> > > Frank >> > >> > >> > -------------- next part -------------- >> > An HTML attachment was scrubbed... >> > URL: < >> http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130427/260f3119/attachment-0001.html >> > >> > >> > ------------------------------ >> > >> > Message: 5 >> > Date: Fri, 26 Apr 2013 16:42:04 -0700 >> > From: Victor Kryukov >> > To: Matthew Dowle >> > Cc: datatable-help at lists.r-forge.r-project.org, sds at gnu.org >> >> > Subject: Re: [datatable-help] variable column names >> > Message-ID: >> > > nJgw at mail.gmail.com> >> > Content-Type: text/plain; charset=ISO-8859-1 >> >> > >> > On Fri, Apr 26, 2013 at 3:47 PM, Matthew Dowle >> wrote: >> > > On 26.04.2013 23:02, Sam Steingold wrote: >> > >>> >> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns >> > >> >> > >> downvoted, unlikely to be answered. >> > > >> > > I've read it through. >> > > >> > > Perhaps sleep on it, don't look for 24hrs and look again as if you >> were >> > > trying to answer it yourself. Are there any small changes you can >> make to >> > > make it easier to answer? It wasn't me that downvoted but I suspect >> it's >> > > been done to encourage you to improve the question. Downvotes can >> (and often >> > > are) reversed. I've had many more downvotes than you once, but then I >> > > improved it and it went to +10. >> > > >> > > And, it's Friday and we've all had a long week! >> > >> > Beautiful advice, Matthew! >> > >> > Sam - I've provided my answer (and even used Reduce since you seem to >> > be coming from Lisp land), but I also think some of the down >> > votes/comments have their merit. >> > >> > >> > ------------------------------ >> > >> > _______________________________________________ >> > datatable-help mailing list >> > datatable-help at lists.r-forge.r-project.org >> >> > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > >> > End of datatable-help Digest, Vol 38, Issue 26 >> > ********************************************** >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From s_milberg at hotmail.com Mon Apr 29 22:21:11 2013 From: s_milberg at hotmail.com (Sadao Milberg) Date: Mon, 29 Apr 2013 16:21:11 -0400 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: , , Message-ID: Also, the issue isn't that data.table has different behavior given different types of inputs. I don't think there is anything wrong with doing that. After all, I think everyone here is okay with a data.table as `i` vs. a vector or a variable name producing different outcomes. The concern here is about which other behavior gets triggered. The default behavior when using a data.table for `i` and nothing for `by` is a somewhat advanced outcome that can't be easily predicted or understood by people who understand the basic operation of data.table (i.e. `i` is for join/indexing, `j` is for evaluating expressions in the context of DT, `by` is for split-apply-combine). As a result usage and documentation become more inaccessible than they could be. S. Date: Mon, 29 Apr 2013 08:43:19 -0500 From: eduard.antonyan at gmail.com To: aragorn168b at gmail.com CC: datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by" It might help to think of this as an improvement proposal rather than a problem fix proposal. On Mon, Apr 29, 2013 at 8:40 AM, Eduard Antonyan wrote: Thanks Arun, the examples you give are probably interesting in their own right, but your post doesn't address advantages/disadvantages of either current or proposed syntaxes and simply points out the (obvious) fact that current (and other, similar in some ways to current) behavior is possible to implement in R. On Sat, Apr 27, 2013 at 10:49 AM, Arunkumar Srinivasan wrote: Hello, I thought I'd also chip-in my thoughts to eddi's feature request. Short answer: I don't think this feature is necessary. I basically agree with mnel's reply. Long answer: My argument goes along these lines (in addition to the S3/S4 methods mnel mentions). If you for example type `[.data.frame` in your R-session, you'd see this snippet: if (is.matrix(i)) return(as.matrix(x)[i]) That is, if you do: df <- data.frame(x=1:5, y=1:5, z=1:5) mm <- matrix(1:12, ncol=3) df[mm] # gives [1] 1 2 3 4 5 1 2 3 4 5 1 2 df <- data.frame(x=1:2, y=1:2, z=1:2) df[mm] # gives [1] 1 2 1 2 1 2 NA NA NA NA NA NA Here, the indexing is a matrix. It's obvious. Now, should this behaviour be changed because people would be confused that subsetting a data.frame resulted in a vector? Or because it's not user friendly? Even better, try out `df[mm, ]`. If `i` is a matrix, this is what the code does. I am not convinced this is "bad" design. Functions take arguments of different types ALL the time and they return outputs *depending on the type of input*. This is why I am not sold on the point of "bad design". It's essential to know the type of objects `i` can take and *understand* it. If a function is designed that takes several types of objects for `i` and their behaviour is documented, and the documented behaviour is consistent, then I can't accept there's a problem. I agree there are people who don't read the manual and "try" things out. But they are going to have problems with every other function in R. For example, "unstack" is a function for which same input type gives different output type. That is, it provides a data.frame if the columns are equal after unstaking and list if they are not. That is, compare the outputs of: df <- data.frame(x=rep(1:3, each=3), y=1:9) unstack(df, y ~ x) with df <- data.frame(x=c(rep(1:3, each=3), 3), y=1:10) unstack(df, y ~ x) But if people don't read the documentation, they wouldn't know this difference until they land up on errors. Now, making it user-friendly would mean that it "always" returns a list. Now, is this "bad" design because it gives two object types for same input? Does it require a change? I personally don't think so. To sum up, what eddi points out as "not being user-friendly" (or arguably "bad design") is everywhere inside R if you look closely. My view is that it's very clear that there should be some effort in understanding a function before using it. Not all functions are plain simple. Some functions have exceptions and some packages have a steep learning curve. Best, Arun. On Sat, Apr 27, 2013 at 12:00 PM, wrote: > > Send datatable-help mailing list submissions to > datatable-help at lists.r-forge.r-project.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > or, via email, send a message with subject or body 'help' to > datatable-help-request at lists.r-forge.r-project.org > > You can reach the person managing the list at > datatable-help-owner at lists.r-forge.r-project.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of datatable-help digest..." > > > Today's Topics: > > 1. Re: changing data.table by-without-by syntax to require a > "by" (Frank Erickson) > 2. Re: variable column names (Sam Steingold) > 3. Re: variable column names (Matthew Dowle) > 4. Re: changing data.table by-without-by syntax to require a > "by" (Matthew Dowle) > 5. Re: variable column names (Victor Kryukov) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 26 Apr 2013 15:34:39 -0500 > From: Frank Erickson > To: "data.table source forge" > > Subject: Re: [datatable-help] changing data.table by-without-by syntax > to require a "by" > Message-ID: > > Content-Type: text/plain; charset="iso-8859-1" > > I disagree with the criticism of data.table's complexity (in the OP). > There's nothing wrong with overloading the syntax (that is what CS people > call it, right?). As long as Matthew's in control of it, it's likely to > have some internal consistency (which, of course, he could explain). > However, I like the suggestion to add options (defaulting to something > globally adjustable) to disable some of the overloading. Along similar > lines (I think), I find unique.data.table very unintuitive. I can see how > it could be useful, but strongly prefer base::unique for my current > applications. > > Anyway, I have nothing particular to say about the piece of syntax you all > are currently discussing. I just registered with this list to chime in > here, instead of further cluttering SO (where eddi answered one of my > questions yesterday). These emails sure are wide; must be like 1500px! > Interesting to try out this ancient mailing-list form of communication. > Please let me know if I should be using "Reply All" or actually quoting > that massive thread (as everyone else seems to be doing with each post). > > Frank > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > ------------------------------ > > Message: 2 > Date: Fri, 26 Apr 2013 18:02:31 -0400 > From: Sam Steingold > To: datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] variable column names > Message-ID: <87wqrpj6h4.fsf at gnu.org> > Content-Type: text/plain > > > * Sam Steingold [2013-04-26 13:05:39 -0400]: > > > >> * Matthew Dowle [2013-04-26 17:45:53 +0100]: > >> > >> S.O. is probably better for this kind of question then. > >> But if you don't get an answer there, then come back to datatable-help. > > > > http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns > > downvoted, unlikely to be answered. > > -- > Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000 > http://www.childpsy.net/ http://iris.org.il http://think-israel.org > http://americancensorship.org http://pmw.org.il http://mideasttruth.com > We have preferences. You have biases. They have prejudices. > > > > ------------------------------ > > Message: 3 > Date: Fri, 26 Apr 2013 23:47:55 +0100 > From: Matthew Dowle > To: > Cc: datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] variable column names > Message-ID: <30d6ae8f1a0d6974ebbd54da0d86f3b2 at imap.plus.net> > Content-Type: text/plain; charset=UTF-8; format=flowed > > On 26.04.2013 23:02, Sam Steingold wrote: > >> * Sam Steingold [2013-04-26 13:05:39 -0400]: > >> > >>> * Matthew Dowle [2013-04-26 17:45:53 > >>> +0100]: > >>> > >>> S.O. is probably better for this kind of question then. > >>> But if you don't get an answer there, then come back to > >>> datatable-help. > >> > >> > >> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns > > > > downvoted, unlikely to be answered. > > I've read it through. > > Perhaps sleep on it, don't look for 24hrs and look again as if you were > trying to answer it yourself. Are there any small changes you can make > to make it easier to answer? It wasn't me that downvoted but I suspect > it's been done to encourage you to improve the question. Downvotes can > (and often are) reversed. I've had many more downvotes than you once, > but then I improved it and it went to +10. > > And, it's Friday and we've all had a long week! > > Matthew > > > > > ------------------------------ > > Message: 4 > Date: Sat, 27 Apr 2013 00:35:17 +0100 > From: Matthew Dowle > To: Frank Erickson > Cc: "data.table source forge" > > Subject: Re: [datatable-help] changing data.table by-without-by syntax > to require a "by" > Message-ID: > Content-Type: text/plain; charset="utf-8" > > > > Thanks for your comments Frank. > > Ha, yes it's ancient but still has > a place. Yes "reply all": Back To: sender (if it's to someone in > particular) and cc the list. But on general topics where lots of people > are on the thread, just To: datatable-help alone is fine. Personally I > prefer "top posting". Like I'm doing now. I only scroll down if I need > to. I didn't notice the history was building up. If you comment inline > later, then say "scroll down for comments inline" or something at the > top. Note that Nabble collapses the history for you so threads are much > easier to read there. Or I tend to read via RSS (gmane) in Outlook, so > it feels like an email inbox which turns bold on new posts. You only > need to subscribe to post (spam control). Most people turn off mail > delivery pretty quickly I imagine (or setup an auto rule to move into a > folder, but then you might as well subscribe to RSS I guess). > > S.O. is > quite strict: must be clear questions with a clear answer, only one of > which can be accepted. No opinion, voting, discussing or notices (enter > mailing lists). Chat room is good but for quick chat when people are in > the room at the same time. Many companies (sensibly) block chat access, > though. Mailing lists allows all timezones a chance at a slower pace. > Anonymity is just as acceptable and as easy in both places. > > Matthew > > > On 26.04.2013 21:34, Frank Erickson wrote: > > > I disagree with the > criticism of data.table's complexity (in the OP). There's nothing wrong > with overloading the syntax (that is what CS people call it, right?). As > long as Matthew's in control of it, it's likely to have some internal > consistency (which, of course, he could explain). However, I like the > suggestion to add options (defaulting to something globally adjustable) > to disable some of the overloading. Along similar lines (I think), I > find unique.data.table very unintuitive. I can see how it could be > useful, but strongly prefer base::unique for my current applications. > > > Anyway, I have nothing particular to say about the piece of syntax you > all are currently discussing. I just registered with this list to chime > in here, instead of further cluttering SO (where eddi answered one of my > questions yesterday). These emails sure are wide; must be like 1500px! > Interesting to try out this ancient mailing-list form of communication. > Please let me know if I should be using "Reply All" or actually quoting > that massive thread (as everyone else seems to be doing with each post). > > > Frank > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > ------------------------------ > > Message: 5 > Date: Fri, 26 Apr 2013 16:42:04 -0700 > From: Victor Kryukov > To: Matthew Dowle > Cc: datatable-help at lists.r-forge.r-project.org, sds at gnu.org > Subject: Re: [datatable-help] variable column names > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > On Fri, Apr 26, 2013 at 3:47 PM, Matthew Dowle wrote: > > On 26.04.2013 23:02, Sam Steingold wrote: > >>> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns > >> > >> downvoted, unlikely to be answered. > > > > I've read it through. > > > > Perhaps sleep on it, don't look for 24hrs and look again as if you were > > trying to answer it yourself. Are there any small changes you can make to > > make it easier to answer? It wasn't me that downvoted but I suspect it's > > been done to encourage you to improve the question. Downvotes can (and often > > are) reversed. I've had many more downvotes than you once, but then I > > improved it and it went to +10. > > > > And, it's Friday and we've all had a long week! > > Beautiful advice, Matthew! > > Sam - I've provided my answer (and even used Reduce since you seem to > be coming from Lisp land), but I also think some of the down > votes/comments have their merit. > > > ------------------------------ > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > End of datatable-help Digest, Vol 38, Issue 26 > ********************************************** _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Mon Apr 29 22:51:26 2013 From: eduard.antonyan at gmail.com (eddi) Date: Mon, 29 Apr 2013 13:51:26 -0700 (PDT) Subject: [datatable-help] minor formatting issue Message-ID: <1367268686060-4665760.post@n4.nabble.com> When joining vs not-joining the order of the key column is different in the output: > dt = data.table(a = c(1:4), b = c(1:4), key = "b") > dt[J(1)] b a 1: 1 1 > dt[!J(1)] a b 1: 2 2 2: 3 3 3: 4 4 I don't usually care about column order, but this could become a surprise issue for people reading/writing to files. -- View this message in context: http://r.789695.n4.nabble.com/minor-formatting-issue-tp4665760.html Sent from the datatable-help mailing list archive at Nabble.com. From michael.nelson at sydney.edu.au Tue Apr 30 00:36:08 2013 From: michael.nelson at sydney.edu.au (Michael Nelson) Date: Mon, 29 Apr 2013 22:36:08 +0000 Subject: [datatable-help] minor formatting issue In-Reply-To: <1367268686060-4665760.post@n4.nabble.com> References: <1367268686060-4665760.post@n4.nabble.com> Message-ID: <6FB5193A6CDCDF499486A833B7AFBDCD705977FA@EX-MBX-PRO-04.mcs.usyd.edu.au> Good spotting. File a bug report. ________________________________________ From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of eddi [eduard.antonyan at gmail.com] Sent: Tuesday, 30 April 2013 6:51 AM To: datatable-help at lists.r-forge.r-project.org Subject: [datatable-help] minor formatting issue When joining vs not-joining the order of the key column is different in the output: > dt = data.table(a = c(1:4), b = c(1:4), key = "b") > dt[J(1)] b a 1: 1 1 > dt[!J(1)] a b 1: 2 2 2: 3 3 3: 4 4 I don't usually care about column order, but this could become a surprise issue for people reading/writing to files. -- View this message in context: http://r.789695.n4.nabble.com/minor-formatting-issue-tp4665760.html Sent from the datatable-help mailing list archive at Nabble.com. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From mdowle at mdowle.plus.com Tue Apr 30 09:53:39 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 30 Apr 2013 08:53:39 +0100 Subject: [datatable-help] Size of posts - moderation queue Message-ID: <353c37d353379414321baf7ceba35cff@imap.plus.net> Hello, If a thread history grows to be over 40KB then mailman holds in a moderation queue. This hasn't happened much before until now. Not my limit but it seems sensible anyway. So if a thread grows, just chop down the history and it should go through. Thanks, Matthew From aragorn168b at gmail.com Tue Apr 30 09:58:37 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 30 Apr 2013 09:58:37 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <8EDE8235CC054C71A23DAFC87B3F02C1@gmail.com> <09BDD69ADDBD4CDD889884347CB7276D@gmail.com> <4F066386B43546C19538B5BC06EB4882@gmail.com> Message-ID: (The earlier message was too long and was rejected.) So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) setkey(DT1, "x") DT2 <- data.table(x=1) DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) setkey(DT1, "x") DT2 <- data.table(x=c(1,2,1), w=c(11:13)) # what's the output supposed to be for? DT1[DT2, y, .JOIN=FALSE] DT1[DT2, .JOIN = FALSE] Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. Best, Arun. -------------- next part -------------- An HTML attachment was scrubbed... URL: From statquant at outlook.com Tue Apr 30 11:19:23 2013 From: statquant at outlook.com (statquant3) Date: Tue, 30 Apr 2013 02:19:23 -0700 (PDT) Subject: [datatable-help] Porting data.table to Rcpp In-Reply-To: References: Message-ID: <1367313563752-4665801.post@n4.nabble.com> Ricky, I tend to use data.table + Rcpp "a lot" now, my usual usecase is 1. create a Rcpp function that takes the data.table say "DT" (as a data.frame) 2. create Rcpp::NumericVectors within the function (say "f") 3: return those vectors as a list and add them by reference to the initial data.table with DT[, names(f(DT)):=f(DT)] This works well and is efficient I think, but if you have to do more complicated stuff, requiring setting keys within C++ that's impossible as it is. What is missing I think is: 1. Possibility to modify by reference the data.table within C++ ( in this post http://stackoverflow.com/questions/15731106/passing-by-reference-a-data-frame-and-updating-it-with-rcpp Romain Francois showed me how to create another data.frame sharing the same data than the initial data.table but this is not quite yet what we want (I think)) 2. Possibility to call data.table functions like merge... within C++ This is totally out of my reach I think, as I am barely a user of data.table and Rcpp, but some brighter devs could find this project interesting -- View this message in context: http://r.789695.n4.nabble.com/Porting-data-table-to-Rcpp-tp4665667p4665801.html Sent from the datatable-help mailing list archive at Nabble.com. From eduard.antonyan at gmail.com Tue Apr 30 14:54:33 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Tue, 30 Apr 2013 07:54:33 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: References: <8EDE8235CC054C71A23DAFC87B3F02C1@gmail.com> <09BDD69ADDBD4CDD889884347CB7276D@gmail.com> <4F066386B43546C19538B5BC06EB4882@gmail.com> Message-ID: <-8694790273355420813@unknownmsgid> Arun, If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan wrote: (The earlier message was too long and was rejected.) So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) setkey(DT1, "x") DT2 <- data.table(x=1) DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) setkey(DT1, "x") DT2 <- data.table(x=c(1,2,1), w=c(11:13)) # what's the output supposed to be for? DT1[DT2, y, .JOIN=FALSE] DT1[DT2, .JOIN = FALSE] Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. Best, Arun. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Tue Apr 30 15:48:07 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 30 Apr 2013 15:48:07 +0200 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <-8694790273355420813@unknownmsgid> References: <8EDE8235CC054C71A23DAFC87B3F02C1@gmail.com> <09BDD69ADDBD4CDD889884347CB7276D@gmail.com> <4F066386B43546C19538B5BC06EB4882@gmail.com> <-8694790273355420813@unknownmsgid> Message-ID: <5AD5B1D231A045329D46159FB5297739@gmail.com> Eduard, thanks for your reply. But somethings are unclear to me still. I'll try to explain them below. First I prefer .JOIN (or cross.apply) just because `each.i` seems general (that it is applicable to *every* i operation, which as of now seems untrue). .JOIN is specific to data.table type for `i`. >From what I understand from your reply, if (.JOIN = FALSE), then, DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] Is this right? It's a bit confusing because I think you're okay with "by-without-by" and I got the impression from Sadao that he finds the syntax of "by-without-by" unaccessible/advanced for basic users. So, just to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the "by-without-by" and then result in a "vector", right? Matthew explains in the current documentation that DT1[DT2][, y] would "join" all columns of DT1 and DT2 and then subset. I assume the implementation underneath is *not* DT1[DT2][, y] rather the result is an efficient equivalence. Then, that of course seems alright to me. If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I can't think of any at the moment. To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results in getting evaluated as a scalar for every group in the current by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, list(x,y)]. If you/anyone believes it's wrong, I'd be all ears to clarify as to what's the purpose of `drop` then (and also how it *doesn't* suit here as compared to .JOIN). Arun On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > Arun, > > If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan wrote: > > > (The earlier message was too long and was rejected.) > > So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > setkey(DT1, "x") > > DT2 <- data.table(x=1) > > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y] > > > > The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > setkey(DT1, "x") > > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > > # what's the output supposed to be for? > > DT1[DT2, y, .JOIN=FALSE] > > DT1[DT2, .JOIN = FALSE] > > > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? > > > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > > > > Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? > > > > I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you. > > > > Best, > > Arun. > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Tue Apr 30 15:52:01 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 30 Apr 2013 15:52:01 +0200 Subject: [datatable-help] sorting on floating point column Message-ID: Hi there, I just saw something strange when I was sorting a column of p-values. I checked the data.table bug tracker for words "sort" and "floating point" and there were no hits for this case. There's a bug for "integer 64" sort on a column though. So, here's a reproducible example. I'd be glad to file a bug, if it is and be corrected if it's something I am doing wrong. set.seed(45) dt <- data.table(x=sample(50), y= sample(c(seq(0, 1, length.out=1000), 7000000:7000100), 50)/1e7) head(dt) x y 1: 32 5.395395e-08 2: 16 6.956957e-08 3: 12 2.142142e-08 4: 18 5.855856e-08 5: 17 6.216216e-08 6: 14 5.025025e-08 setkey(dt, "y") # sort by column y head(dt, 10) x y 1: 47 1.401401e-09 2: 12 2.142142e-08 3: 24 1.391391e-08 4: 43 9.809810e-09 <~~~ obviously false 5: 1 2.932933e-08 6: 48 2.562563e-08 7: 49 1.891892e-08 8: 40 2.182182e-08 9: 9 7.307307e-09 <~~~ obviously false 10: 45 2.482482e-08 Best, Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Tue Apr 30 16:09:03 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Tue, 30 Apr 2013 10:09:03 -0400 Subject: [datatable-help] sorting on floating point column In-Reply-To: References: Message-ID: I'm seeing the same thing as Arun: > dt[, diffs := round(c(NA, diff(y) * 1e8), 3)] > dt x y diffs 1: 19 4.594595e-08 NA 2: 17 7.007007e-08 2.412 3: 45 3.543544e-08 -3.463 4: 38 6.326326e-08 2.783 5: 23 7.847848e-08 1.522 6: 46 5.975976e-08 -1.872 7: 3 3.073073e-08 -2.903 8: 4 9.909910e-08 6.837 9: 16 5.535536e-08 -4.374 10: 25 9.609610e-08 4.074 11: 24 9.309309e-08 -0.300 12: 12 7.000022e-01 70000210.691 13: 31 3.453453e-08 -70000216.547 14: 34 5.565566e-08 2.112 15: 14 1.241241e-08 -4.324 On Tue, Apr 30, 2013 at 9:52 AM, Arunkumar Srinivasan wrote: > Hi there, > > I just saw something strange when I was sorting a column of p-values. I > checked the data.table bug tracker for words "sort" and "floating point" > and there were no hits for this case. There's a bug for "integer 64" sort > on a column though. > > So, here's a reproducible example. I'd be glad to file a bug, if it is and > be corrected if it's something I am doing wrong. > > set.seed(45) > dt <- data.table(x=sample(50), y= sample(c(seq(0, 1, length.out=1000), > 7000000:7000100), 50)/1e7) > head(dt) > x y > 1: 32 5.395395e-08 > 2: 16 6.956957e-08 > 3: 12 2.142142e-08 > 4: 18 5.855856e-08 > 5: 17 6.216216e-08 > 6: 14 5.025025e-08 > setkey(dt, "y") # sort by column y > head(dt, 10) > x y > 1: 47 1.401401e-09 > 2: 12 2.142142e-08 > 3: 24 1.391391e-08 > 4: 43 9.809810e-09 <~~~ obviously false > 5: 1 2.932933e-08 > 6: 48 2.562563e-08 > 7: 49 1.891892e-08 > 8: 40 2.182182e-08 > 9: 9 7.307307e-09 <~~~ obviously false > 10: 45 2.482482e-08 > > Best, > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue Apr 30 16:09:25 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 30 Apr 2013 15:09:25 +0100 Subject: [datatable-help] sorting on floating point column In-Reply-To: References: Message-ID: <17cf7210eff5da9dadf94185f67df182@imap.plus.net> Hi, data.table sorts double within machine tolerance : > sqrt(.Machine$double.eps) [1] 1.490116e-08 > i.e. numbers closer than this are considered equal. Otherwise we wouldn't be able to do things like DT[.(3.14)]. I had a quick look, see arguments of data.table:::ordernumtol which takes "tol" but there is no option provided (yet) to change this. Do we need one? In the examples section of one of the help pages it has an example which generates a series of numers very close together using pi. Note that your numbers are both close together, and, very close to 0. Matthew On 30.04.2013 14:52, Arunkumar Srinivasan wrote: > Hi there, > I just saw something strange when I was sorting a column of p-values. I checked the data.table bug tracker for words "sort" and "floating point" and there were no hits for this case. There's a bug for "integer 64" sort on a column though. > So, here's a reproducible example. I'd be glad to file a bug, if it is and be corrected if it's something I am doing wrong. > > set.seed(45) > dt <- data.table(x=sample(50), y= sample(c(seq(0, 1, length.out=1000), 7000000:7000100), 50)/1e7) > head(dt) > x y > 1: 32 5.395395e-08 > 2: 16 6.956957e-08 > 3: 12 2.142142e-08 > 4: 18 5.855856e-08 > 5: 17 6.216216e-08 > 6: 14 5.025025e-08 > setkey(dt, "y") # sort by column y > head(dt, 10) > x y > 1: 47 1.401401e-09 > 2: 12 2.142142e-08 > 3: 24 1.391391e-08 > 4: 43 9.809810e-09 <~~~ obviously false > 5: 1 2.932933e-08 > 6: 48 2.562563e-08 > 7: 49 1.891892e-08 > 8: 40 2.182182e-08 > 9: 9 7.307307e-09 <~~~ obviously false > 10: 45 2.482482e-08 > > Best, > Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue Apr 30 16:13:09 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 30 Apr 2013 15:13:09 +0100 Subject: [datatable-help] sorting on floating point column In-Reply-To: <17cf7210eff5da9dadf94185f67df182@imap.plus.net> References: <17cf7210eff5da9dadf94185f67df182@imap.plus.net> Message-ID: <69aefde0f4d3a4eb53b708a7f5df6888@imap.plus.net> Or, perhaps the tolerance should be a function of the range of the column. [The range would be quick to calculate with a single C for loop.] On 30.04.2013 15:09, Matthew Dowle wrote: > Hi, > > data.table sorts double within machine tolerance : > >> sqrt(.Machine$double.eps) > [1] 1.490116e-08 >> > > i.e. numbers closer than this are considered equal. > > Otherwise we wouldn't be able to do things like DT[.(3.14)]. > > I had a quick look, see arguments of data.table:::ordernumtol which takes "tol" but there is no option provided (yet) to change this. Do we need one? > > In the examples section of one of the help pages it has an example which generates a series of numers very close together using pi. Note that your numbers are both close together, and, very close to 0. > > Matthew > > On 30.04.2013 14:52, Arunkumar Srinivasan wrote: > >> Hi there, >> I just saw something strange when I was sorting a column of p-values. I checked the data.table bug tracker for words "sort" and "floating point" and there were no hits for this case. There's a bug for "integer 64" sort on a column though. >> So, here's a reproducible example. I'd be glad to file a bug, if it is and be corrected if it's something I am doing wrong. >> >> set.seed(45) >> dt <- data.table(x=sample(50), y= sample(c(seq(0, 1, length.out=1000), 7000000:7000100), 50)/1e7) >> head(dt) >> x y >> 1: 32 5.395395e-08 >> 2: 16 6.956957e-08 >> 3: 12 2.142142e-08 >> 4: 18 5.855856e-08 >> 5: 17 6.216216e-08 >> 6: 14 5.025025e-08 >> setkey(dt, "y") # sort by column y >> head(dt, 10) >> x y >> 1: 47 1.401401e-09 >> 2: 12 2.142142e-08 >> 3: 24 1.391391e-08 >> 4: 43 9.809810e-09 <~~~ obviously false >> 5: 1 2.932933e-08 >> 6: 48 2.562563e-08 >> 7: 49 1.891892e-08 >> 8: 40 2.182182e-08 >> 9: 9 7.307307e-09 <~~~ obviously false >> 10: 45 2.482482e-08 >> >> Best, >> Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Tue Apr 30 16:16:03 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 30 Apr 2013 16:16:03 +0200 Subject: [datatable-help] sorting on floating point column In-Reply-To: <17cf7210eff5da9dadf94185f67df182@imap.plus.net> References: <17cf7210eff5da9dadf94185f67df182@imap.plus.net> Message-ID: <8DC39800AD714C4AA03FDB84ED57BADD@gmail.com> Matthew, I see. I din't think about tolerance. Although dt[with(dt, order(y)), ] seems to do the task right (similar to data.frame). I'm glad that I don't have to convert to data.frame to perform the order. I am not keying by this column. Unless one needs this column for keying, I don't think a tolerance option is essential. Although, having it definitely would be only nicer. Arun On Tuesday, April 30, 2013 at 4:09 PM, Matthew Dowle wrote: > > Hi, > data.table sorts double within machine tolerance : > > sqrt(.Machine$double.eps) > [1] 1.490116e-08 > > > > i.e. numbers closer than this are considered equal. > > Otherwise we wouldn't be able to do things like DT[.(3.14)]. > > I had a quick look, see arguments of data.table:::ordernumtol which takes "tol" but there is no option provided (yet) to change this. Do we need one? > > In the examples section of one of the help pages it has an example which generates a series of numers very close together using pi. Note that your numbers are both close together, and, very close to 0. > > Matthew > > On 30.04.2013 14:52, Arunkumar Srinivasan wrote: > > Hi there, > > I just saw something strange when I was sorting a column of p-values. I checked the data.table bug tracker for words "sort" and "floating point" and there were no hits for this case. There's a bug for "integer 64" sort on a column though. > > So, here's a reproducible example. I'd be glad to file a bug, if it is and be corrected if it's something I am doing wrong. > > set.seed(45) > > dt <- data.table(x=sample(50), y= sample(c(seq(0, 1, length.out=1000), 7000000:7000100), 50)/1e7) > > head(dt) > > x y > > 1: 32 5.395395e-08 > > 2: 16 6.956957e-08 > > 3: 12 2.142142e-08 > > 4: 18 5.855856e-08 > > 5: 17 6.216216e-08 > > 6: 14 5.025025e-08 > > setkey(dt, "y") # sort by column y > > head(dt, 10) > > x y > > 1: 47 1.401401e-09 > > 2: 12 2.142142e-08 > > 3: 24 1.391391e-08 > > 4: 43 9.809810e-09 <~~~ obviously false > > 5: 1 2.932933e-08 > > 6: 48 2.562563e-08 > > 7: 49 1.891892e-08 > > 8: 40 2.182182e-08 > > 9: 9 7.307307e-09 <~~~ obviously false > > 10: 45 2.482482e-08 > > > > Best, > > Arun > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue Apr 30 16:22:54 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 30 Apr 2013 15:22:54 +0100 Subject: [datatable-help] sorting on floating point column In-Reply-To: <8DC39800AD714C4AA03FDB84ED57BADD@gmail.com> References: <17cf7210eff5da9dadf94185f67df182@imap.plus.net> <8DC39800AD714C4AA03FDB84ED57BADD@gmail.com> Message-ID: <2cd9f53b01f908fe6478e3974ddf18e3@imap.plus.net> Maybe it doesn't actually need to sort within machine tolerance. If it was precise, the sort would be faster, that's for sure. But at the time, I remember thinking that it should preserve the order of rows within a group of values within machine tolerance (e.g. 3.99999999, 4.00000001, 3.99999999 should be consider 4.0 and order of those 3 rows maintained). But maybe sorting them to 3.99999999, 3.99999999, 4.00000001 is ok as it's just the join that should be within machine tolerance? Interested in how fast order(y) is, though. Compared to data.table sorting of doubles. Matthew On 30.04.2013 15:16, Arunkumar Srinivasan wrote: > Matthew, > I see. I din't think about tolerance. Although > dt[with(dt, order(y)), ] > seems to do the task right (similar to data.frame). I'm glad that I don't have to convert to data.frame to perform the order. I am not keying by this column. Unless one needs this column for keying, I don't think a tolerance option is essential. Although, having it definitely would be only nicer. > > Arun > > On Tuesday, April 30, 2013 at 4:09 PM, Matthew Dowle wrote: > >> Hi, >> >> data.table sorts double within machine tolerance : >> >>> sqrt(.Machine$double.eps) >> [1] 1.490116e-08 >>> >> >> i.e. numbers closer than this are considered equal. >> >> Otherwise we wouldn't be able to do things like DT[.(3.14)]. >> >> I had a quick look, see arguments of data.table:::ordernumtol which takes "tol" but there is no option provided (yet) to change this. Do we need one? >> >> In the examples section of one of the help pages it has an example which generates a series of numers very close together using pi. Note that your numbers are both close together, and, very close to 0. >> >> Matthew >> >> On 30.04.2013 14:52, Arunkumar Srinivasan wrote: >> >>> Hi there, >>> I just saw something strange when I was sorting a column of p-values. I checked the data.table bug tracker for words "sort" and "floating point" and there were no hits for this case. There's a bug for "integer 64" sort on a column though. >>> So, here's a reproducible example. I'd be glad to file a bug, if it is and be corrected if it's something I am doing wrong. >>> >>> set.seed(45) >>> dt <- data.table(x=sample(50), y= sample(c(seq(0, 1, length.out=1000), 7000000:7000100), 50)/1e7) >>> head(dt) >>> x y >>> 1: 32 5.395395e-08 >>> 2: 16 6.956957e-08 >>> 3: 12 2.142142e-08 >>> 4: 18 5.855856e-08 >>> 5: 17 6.216216e-08 >>> 6: 14 5.025025e-08 >>> setkey(dt, "y") # sort by column y >>> head(dt, 10) >>> x y >>> 1: 47 1.401401e-09 >>> 2: 12 2.142142e-08 >>> 3: 24 1.391391e-08 >>> 4: 43 9.809810e-09 <~~~ obviously false >>> 5: 1 2.932933e-08 >>> 6: 48 2.562563e-08 >>> 7: 49 1.891892e-08 >>> 8: 40 2.182182e-08 >>> 9: 9 7.307307e-09 <~~~ obviously false >>> 10: 45 2.482482e-08 >>> >>> Best, >>> Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Tue Apr 30 16:26:21 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 30 Apr 2013 16:26:21 +0200 Subject: [datatable-help] sorting on floating point column In-Reply-To: <2cd9f53b01f908fe6478e3974ddf18e3@imap.plus.net> References: <17cf7210eff5da9dadf94185f67df182@imap.plus.net> <8DC39800AD714C4AA03FDB84ED57BADD@gmail.com> <2cd9f53b01f908fe6478e3974ddf18e3@imap.plus.net> Message-ID: Matthew, Precisely. That's what I was thinking as well. But was hesitant to tell as I dint know how complex it would be to implement / change it. Since the join requires tolerance, sorting could be still done in the "right" order (by disregarding tolerance during sort). Arun On Tuesday, April 30, 2013 at 4:22 PM, Matthew Dowle wrote: > > Maybe it doesn't actually need to sort within machine tolerance. If it was precise, the sort would be faster, that's for sure. But at the time, I remember thinking that it should preserve the order of rows within a group of values within machine tolerance (e.g. 3.99999999, 4.00000001, 3.99999999 should be consider 4.0 and order of those 3 rows maintained). But maybe sorting them to 3.99999999, 3.99999999, 4.00000001 is ok as it's just the join that should be within machine tolerance? > Interested in how fast order(y) is, though. Compared to data.table sorting of doubles. > Matthew > > On 30.04.2013 15:16, Arunkumar Srinivasan wrote: > > Matthew, > > I see. I din't think about tolerance. Although > > dt[with(dt, order(y)), ] > > seems to do the task right (similar to data.frame). I'm glad that I don't have to convert to data.frame to perform the order. I am not keying by this column. Unless one needs this column for keying, I don't think a tolerance option is essential. Although, having it definitely would be only nicer. > > Arun > > > > > > On Tuesday, April 30, 2013 at 4:09 PM, Matthew Dowle wrote: > > > > > > > > Hi, > > > data.table sorts double within machine tolerance : > > > > sqrt(.Machine$double.eps) > > > [1] 1.490116e-08 > > > > > > > > > > i.e. numbers closer than this are considered equal. > > > > > > Otherwise we wouldn't be able to do things like DT[.(3.14)]. > > > > > > I had a quick look, see arguments of data.table:::ordernumtol which takes "tol" but there is no option provided (yet) to change this. Do we need one? > > > > > > In the examples section of one of the help pages it has an example which generates a series of numers very close together using pi. Note that your numbers are both close together, and, very close to 0. > > > > > > Matthew > > > > > > On 30.04.2013 14:52, Arunkumar Srinivasan wrote: > > > > Hi there, > > > > I just saw something strange when I was sorting a column of p-values. I checked the data.table bug tracker for words "sort" and "floating point" and there were no hits for this case. There's a bug for "integer 64" sort on a column though. > > > > So, here's a reproducible example. I'd be glad to file a bug, if it is and be corrected if it's something I am doing wrong. > > > > set.seed(45) > > > > dt <- data.table(x=sample(50), y= sample(c(seq(0, 1, length.out=1000), 7000000:7000100), 50)/1e7) > > > > head(dt) > > > > x y > > > > 1: 32 5.395395e-08 > > > > 2: 16 6.956957e-08 > > > > 3: 12 2.142142e-08 > > > > 4: 18 5.855856e-08 > > > > 5: 17 6.216216e-08 > > > > 6: 14 5.025025e-08 > > > > setkey(dt, "y") # sort by column y > > > > head(dt, 10) > > > > x y > > > > 1: 47 1.401401e-09 > > > > 2: 12 2.142142e-08 > > > > 3: 24 1.391391e-08 > > > > 4: 43 9.809810e-09 <~~~ obviously false > > > > 5: 1 2.932933e-08 > > > > 6: 48 2.562563e-08 > > > > 7: 49 1.891892e-08 > > > > 8: 40 2.182182e-08 > > > > 9: 9 7.307307e-09 <~~~ obviously false > > > > 10: 45 2.482482e-08 > > > > > > > > Best, > > > > Arun > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Tue Apr 30 17:03:05 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Tue, 30 Apr 2013 10:03:05 -0500 Subject: [datatable-help] changing data.table by-without-by syntax to require a "by" In-Reply-To: <5AD5B1D231A045329D46159FB5297739@gmail.com> References: <1366401278742-4664770.post@n4.nabble.com> <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au> <1366643879137-4664990.post@n4.nabble.com> <-8694790273355420813@unknownmsgid> <5AD5B1D231A045329D46159FB5297739@gmail.com> Message-ID: Arun, Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does currently. No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is literally a 'by' by each of the rows of DT2 that are in the join (thus each.i! - the operation 'y' will be performed for each of the rows of 'i' and then combined and returned). There is no efficiency issue here that I can see, but Matthew can correct me on this. As far as I understand the efficiency comes into play when e.g. the rows of 'i' are unique, and after the join you'd like to do a 'by' by those, then DT1[DT2][, j, by = key(DT1)] would be less efficient since the 'by' could've already been done while joining. DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future DT1[DT2] - in this expression there is no by-without-by happening in either case. The purpose of this is NOT for j just being a column or an expression that gets evaluated into a signal column. It applies to any j. The extra 'by-without-by' column is currently output independently of how many columns you output in your j-expression, the behavior is very similar as to when you specify a by=., except that the 'by' happens by a very special expression, that only exists when joining two data-tables and that generally doesn't exist before or after the join. Hope this answers your questions. On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan wrote: > Eduard, thanks for your reply. But somethings are unclear to me still. > I'll try to explain them below. > > First I prefer .JOIN (or cross.apply) just because `each.i` seems general > (that it is applicable to *every* i operation, which as of now seems > untrue). .JOIN is specific to data.table type for `i`. > > From what I understand from your reply, if (.JOIN = FALSE), then, > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > Is this right? It's a bit confusing because I think you're okay with > "by-without-by" and I got the impression from Sadao that he finds the > syntax of "by-without-by" unaccessible/advanced for basic users. So, just > to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the > "by-without-by" and then result in a "vector", right? > > Matthew explains in the current documentation that DT1[DT2][, y] would > "join" all columns of DT1 and DT2 and then subset. I assume the > implementation underneath is *not* DT1[DT2][, y] rather the result is an > efficient equivalence. Then, that of course seems alright to me. > > If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` > doesn't make sense/has no purpose to me. At least I can't think of any at > the moment. > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as > DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results > in getting evaluated as a scalar for every group in the current > by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. > Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` > instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of > DT1[i, list(x,y)]. > > If you/anyone believes it's wrong, I'd be all ears to clarify as to what's > the purpose of `drop` then (and also how it *doesn't* suit here as compared > to .JOIN). > > Arun > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > Arun, > > If the new boolean is false, the result would be the same as without it > and would be equal to current behavior of d[i][, j]. If it's true, it will > only have an effect if i is a join (I think each.i= fits slightly better > for this description than .join=) - this will replicate current underlying > behavior. If you think the cross-apply is something that could work not > just for i being a data-table but other things as well, then it would make > perfect sense to implement that action too when the bool is true. > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan > wrote: > > (The earlier message was too long and was rejected.) > So, from the discussion so far, I see that Matthew is nice enough to > implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose, > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > setkey(DT1, "x") > DT2 <- data.table(x=1) > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I > expect here the same output as current DT1[DT2, y] > > The above syntax seems "okay". But my first question is what is > `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > setkey(DT1, "x") > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > # what's the output supposed to be for? > DT1[DT2, y, .JOIN=FALSE] > DT1[DT2, .JOIN = FALSE] > > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how > does it work with `subset`? > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > Is this supposed to also do a "cross-apply" on the logical subset? I > guess not. So, .JOIN is an "extra" parameter that comes into play *only* > when `i` is a `data.table`? > > I'd love to have some replies to these questions for me to take a stance > on `.JOIN`. Thank you. > > Best, > Arun. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.harding at paniscus.com Tue Apr 30 19:01:50 2013 From: p.harding at paniscus.com (Paul Harding) Date: Tue, 30 Apr 2013 18:01:50 +0100 Subject: [datatable-help] fread on very large file Message-ID: Problem with fread on a large file The file is 8GB, just short of 200,000 lines, produced as SQLoutput and modified by cygwin/perl to remove the second line. Using data.table 1.8.8 on R3.0.0 I get an fread error fread("data/spd_all_fixed.csv",sep=",") Error in fread("data/spd_all_fixed.csv", sep = ",") : Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 Looking for the offending line,with line numbers in output so I'm guessing this is line 6 of the mid-file chunk examined, $ grep -n '204038,2617097,201108' spd_all_fixed.csv 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 and comparing to surrounding lines and the first ten lines $ head spd_all_fixed.csv s_key,i_key,p_key,q,pq,d,l,epi,class 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 I can't see any difference. I wonder if this is a bug? I have no problems on a small test data set run through an identical process and using the same fread command. Regards Paul -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue Apr 30 19:52:54 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 30 Apr 2013 18:52:54 +0100 Subject: [datatable-help] fread on very large file In-Reply-To: References: Message-ID: <6215268129090c5164b66264010bea9b@imap.plus.net> Hi, Thanks for reporting this. Please set verbose=TRUE and let us know the output. Thanks, Matthew On 30.04.2013 18:01, Paul Harding wrote: > Problem with fread on a large file The file is 8GB, just short of 200,000 lines, produced as SQLoutput and modified by cygwin/perl to remove the second line. > > Using data.table 1.8.8 on R3.0.0 I get an fread error > > fread("data/spd_all_fixed.csv",sep=",") > Error in fread("data/spd_all_fixed.csv", sep = ",") : > Expected sep (',') but '0' ends field 5 on line 6 when detecting types: 204038,2617097,20110803,0,0 > Looking for the offending line,with line numbers in output so I'm guessing this is line 6 of the mid-file chunk examined, > > $ grep -n '204038,2617097,201108' spd_all_fixed.csv > 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 > 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 > 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 > 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 > 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 > and comparing to surrounding lines and the first ten lines > > $ head spd_all_fixed.csv > s_key,i_key,p_key,q,pq,d,l,epi,class > 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 > 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 > 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 > 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 > 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 > 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 > 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 > 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 > 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13 > I can't see any difference. I wonder if this is a bug? I have no problems on a small test data set run through an identical process and using the same fread command. > Regards > Paul -------------- next part -------------- An HTML attachment was scrubbed... URL: From rv15i at yahoo.se Tue Apr 30 20:38:51 2013 From: rv15i at yahoo.se (ravi) Date: Tue, 30 Apr 2013 19:38:51 +0100 (BST) Subject: [datatable-help] fread (sep2) on data with a comma as decimal delimiter Message-ID: <1367347131.53228.YahooMailNeo@web171302.mail.ir2.yahoo.com> Hi, I have a huge excel file that I have converted to a tab delimited file. The numerical data have a comma as a decimal delimiter. I made a compressed version of the file by just taking the first 100 rows. On this, I have confirmed that the following command works fine : df<-read.table(file=file1,header=TRUE,sep="\t",dec=",",encoding="latin1") The following data.table also appears to work OK : dt<-fread(file1,sep="\t") But the numerical data end up as characters. I would like to have help with the most efficient method of converting these into numeric class. I note that sep2 has not been implemented yet. Is there any workaround? Can I specify the encoding also? Would appreciate any help that I can get. Thanks, Ravi From mdowle at mdowle.plus.com Tue Apr 30 20:48:32 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 30 Apr 2013 19:48:32 +0100 Subject: [datatable-help] fread (sep2) on data with a comma as decimal delimiter In-Reply-To: <1367347131.53228.YahooMailNeo@web171302.mail.ir2.yahoo.com> References: <1367347131.53228.YahooMailNeo@web171302.mail.ir2.yahoo.com> Message-ID: <79e76ef89c8c532d8c2c7cafa8b0a86e@imap.plus.net> Hi, Ah yes, fread is locale aware. So if you Set.locale() for the numeric option to say the decimal separator is comma, then fread should heed that. Somewhere, either on S.O. or datatable-help this has come up before, with example and it was successful. Try searching for "[data.table] Sys.setlocale" (I forget that function's spelling exactly). We could add this locale change as an option to data.table but it depends on choosing a particular installed locale that has the comma as separator, and doing this in a cross-platform way is not something I know a huge amount about. There was a concern that locale changes are global, but as far as I know it only affects the current R session and switching back on.exit() should be safe enough (as a way to build it in). fread uses a stdlib call to read floating point (rather than R which does it itself in its own C code). It's that stdlib call that is locale aware and is quite convenient (and fast) from fread's internals point of view. Matthew On 30.04.2013 19:38, ravi wrote: > Hi, > I have a huge excel file that I have converted to a tab delimited > file. The numerical data have a comma as a decimal delimiter. I made > a > compressed version of the file by just taking the first 100 rows. On > this, I have confirmed that the following command works fine : > > df<-read.table(file=file1,header=TRUE,sep="\t",dec=",",encoding="latin1") > The following data.table also appears to work OK : > dt<-fread(file1,sep="\t") > But > the numerical data end up as characters. I would like to have help > with > the most efficient method of converting these into numeric class. I > note that sep2 has not been implemented yet. Is there any workaround? > Can I specify the encoding also? > Would appreciate any help that I can get. > Thanks, > Ravi > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help