From vishalhald at gmail.com  Tue Apr  2 06:06:37 2013
From: vishalhald at gmail.com (vishal)
Date: Mon, 1 Apr 2013 21:06:37 -0700 (PDT)
Subject: [datatable-help] Need help with fread function
Message-ID: <1364875597879-4663036.post@n4.nabble.com>

I am trying to read 2.5GB pipe delimited text file in R using fread but I am
getting below error.
   
"  Opened file ok, obtained its size on disk (-0.0MB), but couldn't memory
map it. This is a 32bit machine. You don't need more RAM per se but this
fread function is tuned for 64bit addressability, at the expense of large
file support on 32bit machines. You probably need more RAM to store the
resulting data.table, anyway. And most speed benefits of data.table are on
64bit with large RAM, too. Please either upgrade to 64bit (e.g. a 64bit
netbook with 4GB RAM can cost just ??300), or make a case for 32bit large
file support to datatable-help."
   
I have used following syntax. 
setwd("E:/Projects/Alo") 
library(data.table) 
f <- fread("6.txt",header="auto")


--
View this message in context: http://r.789695.n4.nabble.com/Need-help-with-fread-function-tp4663036.html
Sent from the datatable-help mailing list archive at Nabble.com.

From mdowle at mdowle.plus.com  Tue Apr  2 09:48:52 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 02 Apr 2013 08:48:52 +0100
Subject: [datatable-help] Need help with fread function
In-Reply-To: <1364875597879-4663036.post@n4.nabble.com>
References: <1364875597879-4663036.post@n4.nabble.com>
Message-ID: <5627b1d0210cebb207c712a33ebd5d0e@imap.plus.net>


Hi,

There seems to be a printing problem in that error message (-0.0MB 
should say 2.5GB), but other than that I can't think of much else to add 
that isn't in the error message already. The error message tells us you 
are using a 32bit computer. Are you Windows or Linux?  2.5GB is over 
2^31 bits, so at the limit of addressability for 32bit.  The file 
doesn't need to fit in RAM, but it does need to be addressable.  For 
example if you had 1GB of RAM, you should be able to read a 1.5GB file 
ok. It's not the amount of RAM you have per se, but whether you are 
32bit or 64bit.

Matthew


On 02.04.2013 05:06, vishal wrote:
> I am trying to read 2.5GB pipe delimited text file in R using fread 
> but I am
> getting below error.
>
> "  Opened file ok, obtained its size on disk (-0.0MB), but couldn't 
> memory
> map it. This is a 32bit machine. You don't need more RAM per se but 
> this
> fread function is tuned for 64bit addressability, at the expense of 
> large
> file support on 32bit machines. You probably need more RAM to store 
> the
> resulting data.table, anyway. And most speed benefits of data.table 
> are on
> 64bit with large RAM, too. Please either upgrade to 64bit (e.g. a 
> 64bit
> netbook with 4GB RAM can cost just ??300), or make a case for 32bit 
> large
> file support to datatable-help."
>
> I have used following syntax.
> setwd("E:/Projects/Alo")
> library(data.table)
> f <- fread("6.txt",header="auto")
>
>
>
> --
> View this message in context:
> 
> http://r.789695.n4.nabble.com/Need-help-with-fread-function-tp4663036.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


From vishalhald at gmail.com  Tue Apr  2 09:53:30 2013
From: vishalhald at gmail.com (vishal)
Date: Tue, 2 Apr 2013 00:53:30 -0700 (PDT)
Subject: [datatable-help] Need help with fread function
In-Reply-To: <5627b1d0210cebb207c712a33ebd5d0e@imap.plus.net>
References: <1364875597879-4663036.post@n4.nabble.com>
 <5627b1d0210cebb207c712a33ebd5d0e@imap.plus.net>
Message-ID: <1364889210584-4663049.post@n4.nabble.com>

Hi Matthew,

I am using 32 bit windows with 4 GB RAM

Vishwesh


--
View this message in context: http://r.789695.n4.nabble.com/Need-help-with-fread-function-tp4663036p4663049.html
Sent from the datatable-help mailing list archive at Nabble.com.

From mdowle at mdowle.plus.com  Tue Apr  2 10:44:52 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 02 Apr 2013 09:44:52 +0100
Subject: [datatable-help] Need help with fread function
In-Reply-To: <1364889210584-4663049.post@n4.nabble.com>
References: <1364875597879-4663036.post@n4.nabble.com>
 <5627b1d0210cebb207c712a33ebd5d0e@imap.plus.net>
 <1364889210584-4663049.post@n4.nabble.com>
Message-ID: <e445c7b2079d561a8705550857b07495@imap.plus.net>


Thanks. Looking at Windows docs there is a GetFileSizeEx and maybe 
that'll make this 2.5GB file work on 32bit. Have filed here :

     
https://r-forge.r-project.org/tracker/?group_id=240&atid=975&func=detail&aid=2655

But the best we can hope for on 32bit is 4GB support, iiuc, using 
memory mapping which fread relies on. Quite possible that the real limit 
is around 3.2GB (whether RAM addressability limits apply to mapping 
files as well or not on Windows I don't know).  I'll let you know when 
it's in v1.8.9 and you can try again.

If this quick fix doesn't work, then I'm not planning on trying harder. 
As the error message says: 64bit is the way forward.

Matthew


On 02.04.2013 08:53, vishal wrote:
> Hi Matthew,
>
> I am using 32 bit windows with 4 GB RAM
>
> Vishwesh
>
>
>
>
>
> --
> View this message in context:
> 
> http://r.789695.n4.nabble.com/Need-help-with-fread-function-tp4663036p4663049.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From npgraham1 at gmail.com  Tue Apr  2 20:30:32 2013
From: npgraham1 at gmail.com (Nathaniel Graham)
Date: Tue, 2 Apr 2013 14:30:32 -0400
Subject: [datatable-help] fread on gzipped files
Message-ID: <CALhihUhHeMCkdudVWrvtPJHwtke2suFCaSwckH9rpay6XVvXfQ@mail.gmail.com>

I have a moderately large csv file that's gzipped, but not in a tar
archive, so it's "filename.csv.gz" that I want to read into a data.table.
I'd like to use fread(), but I can't seem to make it work.  I'm currently
using the following:

data.table(read.csv(gzfile("filename.csv.gz","r")))

Various combinations of gzfile, gzcon, file, readLines, and
textConnection all produce an error (invalid input).  Is there a better
way to read in large, compressed files?
-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130402/7962b7bb/attachment.html>

From mdowle at mdowle.plus.com  Tue Apr  2 21:12:03 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 02 Apr 2013 20:12:03 +0100
Subject: [datatable-help] fread on gzipped files
In-Reply-To: <CALhihUhHeMCkdudVWrvtPJHwtke2suFCaSwckH9rpay6XVvXfQ@mail.gmail.com>
References: <CALhihUhHeMCkdudVWrvtPJHwtke2suFCaSwckH9rpay6XVvXfQ@mail.gmail.com>
Message-ID: <173fc96df68310b80565cdde75586781@imap.plus.net>

 
Hi, 

fread memory maps the entire uncompressed file and this is
baked into the way it works (e.g. skipping to the beginning, middle and
last 5 rows to detect column types before starting to read the rows in)
and where the convenience and speed comes from. 

You could uncompress
the .gz to a ramdisk first, and then fread the uncompressed file from
that ramdisk, is probably the fastest way. Which should still be pretty
quick and I guess unlikely much slower than anything we could build into
fread (provided you use a ramdisk). 

Matthew 

On 02.04.2013 19:30,
Nathaniel Graham wrote: 

> I have a moderately large csv file that's
gzipped, but not in a tar 
> archive, so it's "filename.csv.gz" that I
want to read into a data.table. 
> I'd like to use fread(), but I can't
seem to make it work. I'm currently 
> using the following: 
>
data.table(read.csv(gzfile("filename.csv.gz","r"))) 
> Various
combinations of gzfile, gzcon, file, readLines, and 
> textConnection
all produce an error (invalid input). Is there a better 
> way to read
in large, compressed files? 
> 
> -------
> Nathaniel Graham
>
npgraham1 at gmail.com [1]
> npgraham1 at uky.edu [2]

 
Links:
------
[1]
mailto:npgraham1 at gmail.com
[2] mailto:npgraham1 at uky.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130402/af5a22a4/attachment.html>

From npgraham1 at gmail.com  Tue Apr  2 21:36:07 2013
From: npgraham1 at gmail.com (Nathaniel Graham)
Date: Tue, 2 Apr 2013 15:36:07 -0400
Subject: [datatable-help] fread on gzipped files
In-Reply-To: <173fc96df68310b80565cdde75586781@imap.plus.net>
References: <CALhihUhHeMCkdudVWrvtPJHwtke2suFCaSwckH9rpay6XVvXfQ@mail.gmail.com>
 <173fc96df68310b80565cdde75586781@imap.plus.net>
Message-ID: <CALhihUgbzaZr1_OERay7TcQiX+LNjknpdv4tpNSorqOcw9ppLA@mail.gmail.com>

Thanks, but I suspect that it would take longer to setup and then remove
a ramdisk than it would to use read.csv and data.table.  My files are
moderately large (between 200 MB and 3 GB when compressed), but not
enormous; I gzip not so much to save space on disk but to speed up reads.

-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu


On Tue, Apr 2, 2013 at 3:12 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> **
>
>
>
> Hi,
>
> fread memory maps the entire uncompressed file and this is baked into the
> way it works (e.g. skipping to the beginning, middle and last 5 rows to
> detect column types before starting to read the rows in) and where the
> convenience and speed comes from.
>
> You could uncompress the .gz to a ramdisk first, and then fread the
> uncompressed file from that ramdisk, is probably the fastest way.  Which
> should still be pretty quick and I guess unlikely much slower than anything
> we could build into fread (provided you use a ramdisk).
>
> Matthew
>
>
>
> On 02.04.2013 19:30, Nathaniel Graham wrote:
>
> I have a moderately large csv file that's gzipped, but not in a tar
> archive, so it's "filename.csv.gz" that I want to read into a data.table.
> I'd like to use fread(), but I can't seem to make it work.  I'm currently
> using the following:
> data.table(read.csv(gzfile("filename.csv.gz","r")))
> Various combinations of gzfile, gzcon, file, readLines, and
> textConnection all produce an error (invalid input).  Is there a better
> way to read in large, compressed files?
>  -------
> Nathaniel Graham
> npgraham1 at gmail.com
> npgraham1 at uky.edu
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130402/20e189d6/attachment.html>

From mdowle at mdowle.plus.com  Wed Apr  3 10:58:24 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Wed, 03 Apr 2013 09:58:24 +0100
Subject: [datatable-help] fread on gzipped files
In-Reply-To: <CALhihUgbzaZr1_OERay7TcQiX+LNjknpdv4tpNSorqOcw9ppLA@mail.gmail.com>
References: <CALhihUhHeMCkdudVWrvtPJHwtke2suFCaSwckH9rpay6XVvXfQ@mail.gmail.com>
 <173fc96df68310b80565cdde75586781@imap.plus.net>
 <CALhihUgbzaZr1_OERay7TcQiX+LNjknpdv4tpNSorqOcw9ppLA@mail.gmail.com>
Message-ID: <7059b88b20b16767a63fb4bcde6274f9@imap.plus.net>

 
Interesting. How much do you find read.csv is sped up by reading
gzip'd files? 

On 02.04.2013 20:36, Nathaniel Graham wrote: 

> Thanks,
but I suspect that it would take longer to setup and then remove 
> a
ramdisk than it would to use read.csv and data.table. My files are 
>
moderately large (between 200 MB and 3 GB when compressed), but not 
>
enormous; I gzip not so much to save space on disk but to speed up
reads. 
> 
> -------
> Nathaniel Graham
> npgraham1 at gmail.com [3]
>
npgraham1 at uky.edu [4] 
> 
> On Tue, Apr 2, 2013 at 3:12 PM, Matthew
Dowle <mdowle at mdowle.plus.com [5]> wrote:
> 
>> Hi, 
>> 
>> fread memory
maps the entire uncompressed file and this is baked into the way it
works (e.g. skipping to the beginning, middle and last 5 rows to detect
column types before starting to read the rows in) and where the
convenience and speed comes from. 
>> 
>> You could uncompress the .gz
to a ramdisk first, and then fread the uncompressed file from that
ramdisk, is probably the fastest way. Which should still be pretty quick
and I guess unlikely much slower than anything we could build into fread
(provided you use a ramdisk). 
>> 
>> Matthew 
>> 
>> On 02.04.2013
19:30, Nathaniel Graham wrote: 
>> 
>>> I have a moderately large csv
file that's gzipped, but not in a tar 
>>> archive, so it's
"filename.csv.gz" that I want to read into a data.table. 
>>> I'd like
to use fread(), but I can't seem to make it work. I'm currently 
>>>
using the following: 
>>>
data.table(read.csv(gzfile("filename.csv.gz","r"))) 
>>> Various
combinations of gzfile, gzcon, file, readLines, and 
>>> textConnection
all produce an error (invalid input). Is there a better 
>>> way to read
in large, compressed files? 
>>> 
>>> -------
>>> Nathaniel Graham
>>>
npgraham1 at gmail.com [1]
>>> npgraham1 at uky.edu [2]

 
Links:
------
[1]
mailto:npgraham1 at gmail.com
[2] mailto:npgraham1 at uky.edu
[3]
mailto:npgraham1 at gmail.com
[4] mailto:npgraham1 at uky.edu
[5]
mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130403/9ce46d20/attachment.html>

From npgraham1 at gmail.com  Wed Apr  3 22:20:55 2013
From: npgraham1 at gmail.com (Nathaniel Graham)
Date: Wed, 3 Apr 2013 16:20:55 -0400
Subject: [datatable-help] fread on gzipped files
In-Reply-To: <7059b88b20b16767a63fb4bcde6274f9@imap.plus.net>
References: <CALhihUhHeMCkdudVWrvtPJHwtke2suFCaSwckH9rpay6XVvXfQ@mail.gmail.com>
 <173fc96df68310b80565cdde75586781@imap.plus.net>
 <CALhihUgbzaZr1_OERay7TcQiX+LNjknpdv4tpNSorqOcw9ppLA@mail.gmail.com>
 <7059b88b20b16767a63fb4bcde6274f9@imap.plus.net>
Message-ID: <CALhihUgMW84X43QGzynQubrk9gyqm4GJG3JMMbBKMwtMTx_ELw@mail.gmail.com>

Subjectively, the difference seems substantial, with large loads taking
half or a third as long.  Whether I use gzip or not, CPU usage isn't
especially high, suggesting that I'm either waiting on the hard drive
or that the whole process is memory bound.  I was all set to produce
some timings for comparison, but I'm working from home today and
my home machine struggles to accommodate large files---any difference
in load times gets swamped by swapping and general flailing on the
part of the OS (I've only got 4GB of RAM at home).  Hopefully I'll get
around to doing some timings on my work machine sometime this
week, since I've got no issues with memory there.

-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu


On Wed, Apr 3, 2013 at 4:58 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> **
>
>
>
> Interesting.  How much do you find read.csv is sped up by reading gzip'd
> files?
>
>
>
> On 02.04.2013 20:36, Nathaniel Graham wrote:
>
> Thanks, but I suspect that it would take longer to setup and then remove
> a ramdisk than it would to use read.csv and data.table.  My files are
> moderately large (between 200 MB and 3 GB when compressed), but not
> enormous; I gzip not so much to save space on disk but to speed up reads.
>
> -------
> Nathaniel Graham
> npgraham1 at gmail.com
> npgraham1 at uky.edu
>
>
> On Tue, Apr 2, 2013 at 3:12 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>>
>> Hi,
>>
>> fread memory maps the entire uncompressed file and this is baked into the
>> way it works (e.g. skipping to the beginning, middle and last 5 rows to
>> detect column types before starting to read the rows in) and where the
>> convenience and speed comes from.
>>
>> You could uncompress the .gz to a ramdisk first, and then fread the
>> uncompressed file from that ramdisk, is probably the fastest way.  Which
>> should still be pretty quick and I guess unlikely much slower than anything
>> we could build into fread (provided you use a ramdisk).
>>
>> Matthew
>>
>>
>>
>> On 02.04.2013 19:30, Nathaniel Graham wrote:
>>
>> I have a moderately large csv file that's gzipped, but not in a tar
>> archive, so it's "filename.csv.gz" that I want to read into a data.table.
>> I'd like to use fread(), but I can't seem to make it work.  I'm currently
>> using the following:
>> data.table(read.csv(gzfile("filename.csv.gz","r")))
>> Various combinations of gzfile, gzcon, file, readLines, and
>> textConnection all produce an error (invalid input).  Is there a better
>> way to read in large, compressed files?
>>  -------
>> Nathaniel Graham
>> npgraham1 at gmail.com
>> npgraham1 at uky.edu
>>
>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130403/381bf1cd/attachment.html>

From npgraham1 at gmail.com  Fri Apr  5 20:59:47 2013
From: npgraham1 at gmail.com (Nathaniel Graham)
Date: Fri, 5 Apr 2013 14:59:47 -0400
Subject: [datatable-help] fread on gzipped files
In-Reply-To: <CALhihUgMW84X43QGzynQubrk9gyqm4GJG3JMMbBKMwtMTx_ELw@mail.gmail.com>
References: <CALhihUhHeMCkdudVWrvtPJHwtke2suFCaSwckH9rpay6XVvXfQ@mail.gmail.com>
 <173fc96df68310b80565cdde75586781@imap.plus.net>
 <CALhihUgbzaZr1_OERay7TcQiX+LNjknpdv4tpNSorqOcw9ppLA@mail.gmail.com>
 <7059b88b20b16767a63fb4bcde6274f9@imap.plus.net>
 <CALhihUgMW84X43QGzynQubrk9gyqm4GJG3JMMbBKMwtMTx_ELw@mail.gmail.com>
Message-ID: <CALhihUiJ30T-3V3BwSorm0-z=FGaeYY82XWmdO_n_c9hojA0eA@mail.gmail.com>

As promised, I did some testing.  The results (described in detail below)
are mixed, but suggest that compression is useful for some large data sets,
and that if this is a serious issue for someone, they need to do some
careful testing before committing to anything (I know, that should be
obvious, but...).  Also, my results pretty clearly show that fread()
crushes read.csv, regardless of whether the csv file is compressed.  Nice
job Matthew!

I start with Current Population Survey data from the Bureau of Labor
Statistics.
The file I used get be accessed here:
ftp://ftp.bls.gov/pub/time.series/ln/ln.data.1.AllData

I converted it to a csv file using StatTransfer 8 (I'm lazy), with no
quoting of strings.  I then compressed the csv file using 7-Zip (gzip,
Normal).  The resulting
files, both with 4937221 obs, 5 variables are:
ln_data_1.csv :    133625 KB
ln_data_1.csv.gz : 17528 KB

Given the file size disparity, this should demonstrate any improvements via
compression.  Also, for comparison, I show fread below.  I've made some
formatting changes, but changed nothing else.

for(i in 1:5) {
  t1 <- system.time(cps1 <- read.csv("ln_data_1.csv"))
  print(t1)
}
   user  system elapsed
  12.32    0.53   12.90
  12.51    0.44   13.00
  12.39    0.47   12.89
  12.36    0.55   12.96
  12.43    0.36   12.94

for(i in 1:5) {
  t2 <- system.time(cps1 <- read.csv("ln_data_1.csv.gz"))
  print(t2)
}
   user  system elapsed
  14.04    0.26   14.43
  14.00    0.27   14.34
  14.07    0.31   14.44
  13.93    0.28   14.23
  14.02    0.32   14.35

for(i in 1:5) {
  t3 <- system.time(cps1 <- fread("ln_data_1.csv"))
  print(t3)
}
   user  system elapsed
   2.89    0.04    2.94
   2.92    0.07    2.98
   2.88    0.03    2.95
   2.87    0.06    2.95
   2.91    0.03    2.95

While the gzipped version uses less system time, total & user time has
increased somewhat.  The fread function from data.table is dramatically
faster.  While this isn't strictly a fair comparison because fread produces
a data.table while read.csv produces a data.frame, the bias is against
fread,
not for it.

Next, I produce a random 2,000,000x10 matrix, write it to csv, and then
read it back into memory as a data.frame (or data.table, for fread).  I
again use 7-Zip for compression.The resulting files are:
test2.csv :      375086 KB
test2.csv.gz : 165477 KB

> matr <- replicate(10,rnorm(2000000))
> write.csv(matr,"test2.csv")
> t1 <- system.time(df <- read.csv("test2.csv"))
> t2 <- system.time(df <- read.csv("test2.csv.gz"))
> t3 <- system.time(df <- fread("test2.csv"))

> t1
   user  system elapsed
 165.32    0.36  166.25
> t2
   user  system elapsed
 116.24    0.16  117.08
> t3
   user  system elapsed
  17.64    0.06   17.83

The switch to strictly floating point numbers is significant.  Compression
is significant improvement--about 49 seconds or about 30%--although nowhere
near enough for read.csv to be comparable to fread.

Finally, I produce a 20000x1000 matrix.  The resulting files are:
test1.csv :      354854 KB
test1.csv.gz : 157975 KB

matr <- replicate(1000,rnorm(20000))
> write.csv(matr,"test1.csv")
> t1 <- system.time(df <- read.csv("test1.csv"))
> t2 <- system.time(df <- read.csv("test1.csv.gz"))
> t3 <- system.time(df <- fread("test1.csv"))
> t1
   user  system elapsed
 206.80    1.14  208.60
> t2
   user  system elapsed
 123.42    0.27  123.99
> t3
   user  system elapsed
  17.24    0.09   17.37

Here, compression is an even larger win, improving by about 83 seconds or
roughly 40%.  The fread function is again dramatically faster, and unlike
read.csv, fread's performance is similar regardless of the shape of the
matrix.

We could create more detailed tests, varying the number of columns vs rows
and their type (strings vs integers vs floats, etc) to get better details,
but the
basic result is that compression can be a noticeable improvement in
performance, but a superior read algorithm trumps that.  If it's feasible to
combine fread's behavior with gzip, bzip2, or xz compression, it could be a
big win for some files, but not for all of them.  The advice from
http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html
to
compress csv files appears to hold, although
it may not save much time if you have a lot of non-float values or few
columns.


-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu


On Wed, Apr 3, 2013 at 4:20 PM, Nathaniel Graham <npgraham1 at gmail.com>wrote:

> Subjectively, the difference seems substantial, with large loads taking
> half or a third as long.  Whether I use gzip or not, CPU usage isn't
> especially high, suggesting that I'm either waiting on the hard drive
> or that the whole process is memory bound.  I was all set to produce
> some timings for comparison, but I'm working from home today and
> my home machine struggles to accommodate large files---any difference
> in load times gets swamped by swapping and general flailing on the
> part of the OS (I've only got 4GB of RAM at home).  Hopefully I'll get
> around to doing some timings on my work machine sometime this
> week, since I've got no issues with memory there.
>
> -------
> Nathaniel Graham
> npgraham1 at gmail.com
> npgraham1 at uky.edu
>
>
> On Wed, Apr 3, 2013 at 4:58 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>> **
>>
>>
>>
>> Interesting.  How much do you find read.csv is sped up by reading gzip'd
>> files?
>>
>>
>>
>> On 02.04.2013 20:36, Nathaniel Graham wrote:
>>
>> Thanks, but I suspect that it would take longer to setup and then remove
>> a ramdisk than it would to use read.csv and data.table.  My files are
>> moderately large (between 200 MB and 3 GB when compressed), but not
>> enormous; I gzip not so much to save space on disk but to speed up reads.
>>
>> -------
>> Nathaniel Graham
>> npgraham1 at gmail.com
>> npgraham1 at uky.edu
>>
>>
>> On Tue, Apr 2, 2013 at 3:12 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>
>>>
>>>
>>> Hi,
>>>
>>> fread memory maps the entire uncompressed file and this is baked into
>>> the way it works (e.g. skipping to the beginning, middle and last 5 rows to
>>> detect column types before starting to read the rows in) and where the
>>> convenience and speed comes from.
>>>
>>> You could uncompress the .gz to a ramdisk first, and then fread the
>>> uncompressed file from that ramdisk, is probably the fastest way.  Which
>>> should still be pretty quick and I guess unlikely much slower than anything
>>> we could build into fread (provided you use a ramdisk).
>>>
>>> Matthew
>>>
>>>
>>>
>>> On 02.04.2013 19:30, Nathaniel Graham wrote:
>>>
>>> I have a moderately large csv file that's gzipped, but not in a tar
>>> archive, so it's "filename.csv.gz" that I want to read into a data.table.
>>> I'd like to use fread(), but I can't seem to make it work.  I'm currently
>>> using the following:
>>> data.table(read.csv(gzfile("filename.csv.gz","r")))
>>> Various combinations of gzfile, gzcon, file, readLines, and
>>> textConnection all produce an error (invalid input).  Is there a better
>>> way to read in large, compressed files?
>>>  -------
>>> Nathaniel Graham
>>> npgraham1 at gmail.com
>>> npgraham1 at uky.edu
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130405/ee719e10/attachment.html>

From mdowle at mdowle.plus.com  Fri Apr  5 21:38:40 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Fri, 05 Apr 2013 20:38:40 +0100
Subject: [datatable-help] fread on gzipped files
In-Reply-To: <CALhihUiJ30T-3V3BwSorm0-z=FGaeYY82XWmdO_n_c9hojA0eA@mail.gmail.com>
References: <CALhihUhHeMCkdudVWrvtPJHwtke2suFCaSwckH9rpay6XVvXfQ@mail.gmail.com>
 <173fc96df68310b80565cdde75586781@imap.plus.net>
 <CALhihUgbzaZr1_OERay7TcQiX+LNjknpdv4tpNSorqOcw9ppLA@mail.gmail.com>
 <7059b88b20b16767a63fb4bcde6274f9@imap.plus.net>
 <CALhihUgMW84X43QGzynQubrk9gyqm4GJG3JMMbBKMwtMTx_ELw@mail.gmail.com>
 <CALhihUiJ30T-3V3BwSorm0-z=FGaeYY82XWmdO_n_c9hojA0eA@mail.gmail.com>
Message-ID: <ab520de60024ac5303f957ef08a0d75b@imap.plus.net>

 
Fantastic, great job here, thanks! 

One thing to note is that
read.csv is much faster when using the standard tricks (colClasses,
nrows etc). That's why the speed comparisons in ?fread are careful to
link to online resources that list what the tricks are, and then compare
read.csv both with and without them to fread. Of course the "friendly"
part of fread is that you don't need to learn or know any tricks, so
from that point of view it may well be fair to compare no-frills
read.csv to fread as you've done. Good to state that so that nobody
accuses of unfair comparisons. But even with the tricks applied, fread
is still much faster. With-tricks on a compressed file would be
interesting for completeness. 

Thinking about it I suppose fread could
read .gz directly. Difficult, but possible. For convenience if nothing
else. I'll add it to the list to investigate ... 

Matthew 

On
05.04.2013 19:59, Nathaniel Graham wrote: 

> As promised, I did some
testing. The results (described in detail below) are mixed, but suggest
that compression is useful for some large data sets, and that if this is
a serious issue for someone, they need to do some careful testing before
committing to anything (I know, that should be obvious, but...). Also,
my results pretty clearly show that fread() crushes read.csv, regardless
of whether the csv file is compressed. Nice job Matthew! 
> I start with
Current Population Survey data from the Bureau of Labor Statistics. 
>
The file I used get be accessed here:
ftp://ftp.bls.gov/pub/time.series/ln/ln.data.1.AllData [9] 
> I
converted it to a csv file using StatTransfer 8 (I'm lazy), with no
quoting of strings. I then compressed the csv file using 7-Zip (gzip,
Normal). The resulting 
> files, both with 4937221 obs, 5 variables are:

> 
> ln_data_1.csv : 133625 KB 
> ln_data_1.csv.gz : 17528 KB 
> Given
the file size disparity, this should demonstrate any improvements via
compression. Also, for comparison, I show fread below. I've made some 
>
formatting changes, but changed nothing else. 
> 
> for(i in 1:5) { 
>
t1 
> print(t1) 
> } 
> user system elapsed 
> 12.32 0.53 12.90 
> 12.51
0.44 13.00 
> 12.39 0.47 12.89 
> 12.36 0.55 12.96 
> 12.43 0.36 12.94

> 
> for(i in 1:5) { 
> t2 
> print(t2) 
> } 
> user system elapsed 
>
14.04 0.26 14.43 
> 14.00 0.27 14.34 
> 14.07 0.31 14.44 
> 13.93 0.28
14.23 
> 14.02 0.32 14.35 
> 
> for(i in 1:5) { 
> t3 
> print(t3) 
> }

> user system elapsed 
> 2.89 0.04 2.94 
> 2.92 0.07 2.98 
> 2.88 0.03
2.95 
> 2.87 0.06 2.95 
> 2.91 0.03 2.95 
> While the gzipped version
uses less system time, total & user time has increased somewhat. The
fread function from data.table is dramatically 
> faster. While this
isn't strictly a fair comparison because fread produces 
> a data.table
while read.csv produces a data.frame, the bias is against fread, 
> not
for it. 
> Next, I produce a random 2,000,000x10 matrix, write it to
csv, and then read it back into memory as a data.frame (or data.table,
for fread). I again use 7-Zip for compression.The resulting files are:

> test2.csv : 375086 KB 
> test2.csv.gz : 165477 KB 
> 
>> matr 
>>
write.csv(matr,"test2.csv") 
>> t1 
>> t2 
>> t3 
> 
>> t1 
> user
system elapsed 
> 165.32 0.36 166.25 
>> t2 
> user system elapsed 
>
116.24 0.16 117.08 
>> t3 
> user system elapsed 
> 17.64 0.06 17.83 
>
The switch to strictly floating point numbers is significant.
Compression is significant improvement--about 49 seconds or about
30%--although nowhere near enough for read.csv to be comparable to
fread. 
> Finally, I produce a 20000x1000 matrix. The resulting files
are: 
> test1.csv : 354854 KB 
> test1.csv.gz : 157975 KB 
> 
> matr 
>>
write.csv(matr,"test1.csv") 
>> t1 
>> t2 
>> t3 
>> t1 
> user system
elapsed 
> 206.80 1.14 208.60 
>> t2 
> user system elapsed 
> 123.42
0.27 123.99 
>> t3 
> user system elapsed 
> 17.24 0.09 17.37 
> Here,
compression is an even larger win, improving by about 83 seconds or
roughly 40%. The fread function is again dramatically faster, and unlike
read.csv, fread's performance is similar regardless of the shape of the
matrix. 
> We could create more detailed tests, varying the number of
columns vs rows 
> and their type (strings vs integers vs floats, etc)
to get better details, but the 
> basic result is that compression can
be a noticeable improvement in performance, but a superior read
algorithm trumps that. If it's feasible to 
> combine fread's behavior
with gzip, bzip2, or xz compression, it could be a 
> big win for some
files, but not for all of them. The advice from 
>
http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html
[10] to compress csv files appears to hold, although 
> it may not save
much time if you have a lot of non-float values or few columns. 
> 
>
-------
> Nathaniel Graham
> npgraham1 at gmail.com [11]
>
npgraham1 at uky.edu [12] 
> 
> On Wed, Apr 3, 2013 at 4:20 PM, Nathaniel
Graham <npgraham1 at gmail.com [13]> wrote:
> 
>> Subjectively, the
difference seems substantial, with large loads taking 
>> half or a
third as long. Whether I use gzip or not, CPU usage isn't 
>> especially
high, suggesting that I'm either waiting on the hard drive 
>> or that
the whole process is memory bound. I was all set to produce 
>> some
timings for comparison, but I'm working from home today and 
>> my home
machine struggles to accommodate large files---any difference 
>> in
load times gets swamped by swapping and general flailing on the 
>> part
of the OS (I've only got 4GB of RAM at home). Hopefully I'll get 
>>
around to doing some timings on my work machine sometime this 
>> week,
since I've got no issues with memory there. 
>> 
>> -------
>> Nathaniel
Graham
>> npgraham1 at gmail.com [6]
>> npgraham1 at uky.edu [7] 
>> 
>> On
Wed, Apr 3, 2013 at 4:58 AM, Matthew Dowle <mdowle at mdowle.plus.com [8]>
wrote:
>> 
>>> Interesting. How much do you find read.csv is sped up by
reading gzip'd files? 
>>> 
>>> On 02.04.2013 20:36, Nathaniel Graham
wrote: 
>>> 
>>>> Thanks, but I suspect that it would take longer to
setup and then remove 
>>>> a ramdisk than it would to use read.csv and
data.table. My files are 
>>>> moderately large (between 200 MB and 3 GB
when compressed), but not 
>>>> enormous; I gzip not so much to save
space on disk but to speed up reads. 
>>>> 
>>>> -------
>>>> Nathaniel
Graham
>>>> npgraham1 at gmail.com [3]
>>>> npgraham1 at uky.edu [4] 
>>>>

>>>> On Tue, Apr 2, 2013 at 3:12 PM, Matthew Dowle
<mdowle at mdowle.plus.com [5]> wrote:
>>>> 
>>>>> Hi, 
>>>>> 
>>>>> fread
memory maps the entire uncompressed file and this is baked into the way
it works (e.g. skipping to the beginning, middle and last 5 rows to
detect column types before starting to read the rows in) and where the
convenience and speed comes from. 
>>>>> 
>>>>> You could uncompress the
.gz to a ramdisk first, and then fread the uncompressed file from that
ramdisk, is probably the fastest way. Which should still be pretty quick
and I guess unlikely much slower than anything we could build into fread
(provided you use a ramdisk). 
>>>>> 
>>>>> Matthew 
>>>>> 
>>>>> On
02.04.2013 19:30, Nathaniel Graham wrote: 
>>>>> 
>>>>>> I have a
moderately large csv file that's gzipped, but not in a tar 
>>>>>>
archive, so it's "filename.csv.gz" that I want to read into a
data.table. 
>>>>>> I'd like to use fread(), but I can't seem to make it
work. I'm currently 
>>>>>> using the following: 
>>>>>>
data.table(read.csv(gzfile("filename.csv.gz","r"))) 
>>>>>> Various
combinations of gzfile, gzcon, file, readLines, and 
>>>>>>
textConnection all produce an error (invalid input). Is there a better

>>>>>> way to read in large, compressed files? 
>>>>>> 
>>>>>>
-------
>>>>>> Nathaniel Graham
>>>>>> npgraham1 at gmail.com [1]
>>>>>>
npgraham1 at uky.edu [2]

 
Links:
------
[1]
mailto:npgraham1 at gmail.com
[2] mailto:npgraham1 at uky.edu
[3]
mailto:npgraham1 at gmail.com
[4] mailto:npgraham1 at uky.edu
[5]
mailto:mdowle at mdowle.plus.com
[6] mailto:npgraham1 at gmail.com
[7]
mailto:npgraham1 at uky.edu
[8] mailto:mdowle at mdowle.plus.com
[9]
http://webmail.plus.net/ftp://ftp.bls.gov/pub/time.series/ln/ln.data.1.AllData
[10]
http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html
[11]
mailto:npgraham1 at gmail.com
[12] mailto:npgraham1 at uky.edu
[13]
mailto:npgraham1 at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130405/f11fef80/attachment-0001.html>

From david.bellot at gmail.com  Sun Apr  7 19:01:58 2013
From: david.bellot at gmail.com (David Bellot)
Date: Sun, 7 Apr 2013 18:01:58 +0100
Subject: [datatable-help] fread
Message-ID: <CAOE6ZJFh4G6y0V4fQfMB9t+P6BW1yxs+XxSn8XaeJkeEXYMB6Q@mail.gmail.com>

just to say: fread rocks !
Soooo fast !

That's all for today !
David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130407/d8c0f2f5/attachment.html>

From david.bellot at gmail.com  Tue Apr  9 12:32:38 2013
From: david.bellot at gmail.com (David Bellot)
Date: Tue, 9 Apr 2013 11:32:38 +0100
Subject: [datatable-help] aggregating data
Message-ID: <CAOE6ZJGpBhOWKA6SEYzmq76XvitLWALRbXOzg+DNobkbtvK9SQ@mail.gmail.com>

Hi,

I have a data.table DT with one of the column named x and I other names,
let's say, a1, a2, ... aN. The key of this data.table is made of a1...aN.

Later on, I aggregate my DT with x like this:
agg = DT[ , list(m=mean(y), c=length(y)), by = c("x") ]

The problem is that "x" has 331 unique values as found by
length(unique(DT$x)) but my result "agg" only has 119 rows. I tried by
changing the key to "x" alone but the problem persists. My DT table has a
few millions rows by the way.

I'm sure I'm missing something totally obvious :-( !!!!

Any idea ?

Best,
David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130409/1718fe38/attachment.html>

From mdowle at mdowle.plus.com  Tue Apr  9 12:39:21 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 09 Apr 2013 11:39:21 +0100
Subject: [datatable-help] aggregating data
In-Reply-To: <CAOE6ZJGpBhOWKA6SEYzmq76XvitLWALRbXOzg+DNobkbtvK9SQ@mail.gmail.com>
References: <CAOE6ZJGpBhOWKA6SEYzmq76XvitLWALRbXOzg+DNobkbtvK9SQ@mail.gmail.com>
Message-ID: <dbca7151b7c1b67d23aa94a6fdebb953@imap.plus.net>

 
That's odd. Please provide result of sessionInfo() and str(DT).


Matthew 

On 09.04.2013 11:32, David Bellot wrote: 

> Hi,
> 
> I have
a data.table DT with one of the column named x and I other names, let's
say, a1, a2, ... aN. The key of this data.table is made of a1...aN.
> 
>
Later on, I aggregate my DT with x like this:
> agg = DT[ ,
list(m=mean(y), c=length(y)), by = c("x") ]
> 
> The problem is that "x"
has 331 unique values as found by length(unique(DT$x)) but my result
"agg" only has 119 rows. I tried by changing the key to "x" alone but
the problem persists. My DT table has a few millions rows by the way. 
>

> I'm sure I'm missing something totally obvious :-( !!!!
> 
> Any idea
? 
> Best,
> David

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130409/478f8a83/attachment.html>

From david.bellot at gmail.com  Wed Apr 10 14:50:01 2013
From: david.bellot at gmail.com (David Bellot)
Date: Wed, 10 Apr 2013 13:50:01 +0100
Subject: [datatable-help] aggregating data
In-Reply-To: <dbca7151b7c1b67d23aa94a6fdebb953@imap.plus.net>
References: <CAOE6ZJGpBhOWKA6SEYzmq76XvitLWALRbXOzg+DNobkbtvK9SQ@mail.gmail.com>
 <dbca7151b7c1b67d23aa94a6fdebb953@imap.plus.net>
Message-ID: <CAOE6ZJFkUOk-zY8ey_nNBE5Wv8Orb-0twydmpQ+Rp69daj7AaA@mail.gmail.com>

actually I found the issue. That was not related to data.table but because
I'm comparing float values, it breaks all the time if I do not round() my
values before. Basically I have values like 0,1, 1.5, 0.5 etc...
I know it's bad to do that but I'm not the boss in this project ;-)

Just in case other users are reading my email, I can only advise to read
that again and again:
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

Best,
David


On Tue, Apr 9, 2013 at 11:39 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> **
>
>
>
> That's odd.  Please provide result of sessionInfo() and str(DT).
>
> Matthew
>
>
>
> On 09.04.2013 11:32, David Bellot wrote:
>
>  Hi,
>
> I have a data.table DT with one of the column named x and I other names,
> let's say, a1, a2, ... aN. The key of this data.table is made of a1...aN.
>
> Later on, I aggregate my DT with x like this:
> agg = DT[ , list(m=mean(y), c=length(y)), by = c("x") ]
>
> The problem is that "x" has 331 unique values as found by
> length(unique(DT$x)) but my result "agg" only has 119 rows. I tried by
> changing the key to "x" alone but the problem persists. My DT table has a
> few millions rows by the way.
>
> I'm sure I'm missing something totally obvious :-( !!!!
>
> Any idea ?
> Best,
> David
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130410/07986d01/attachment.html>

From mdowle at mdowle.plus.com  Wed Apr 10 15:25:05 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Wed, 10 Apr 2013 14:25:05 +0100
Subject: [datatable-help] aggregating data
In-Reply-To: <CAOE6ZJFkUOk-zY8ey_nNBE5Wv8Orb-0twydmpQ+Rp69daj7AaA@mail.gmail.com>
References: <CAOE6ZJGpBhOWKA6SEYzmq76XvitLWALRbXOzg+DNobkbtvK9SQ@mail.gmail.com>
 <dbca7151b7c1b67d23aa94a6fdebb953@imap.plus.net>
 <CAOE6ZJFkUOk-zY8ey_nNBE5Wv8Orb-0twydmpQ+Rp69daj7AaA@mail.gmail.com>
Message-ID: <31d67766ab3ed22a107a94238e096de7@imap.plus.net>

 
But data.table is floating point aware. You _can_ join to floating
point values, and you _can_ group by floating point values. data.table
will do that within machine tolerance and take care of it for you. 

So
this may explain why your 'agg' only had 119 rows (because data.table is
doing the rounding for you automatically), but length(unique(DT$x)) had
331 ? 

But, there was a bug or two in this area a few versions ago,
mentioned in NEWS. Which is why I asked for sessionInfo() and str(DT)
suspecting you had a double column with a slightly older version of
data.table. Or, there might be a new problem. If you have to round() in
data.table, that doesn't sound right to me. 

Matthew 

On 10.04.2013
13:50, David Bellot wrote: 

> actually I found the issue. That was not
related to data.table but because I'm comparing float values, it breaks
all the time if I do not round() my values before. Basically I have
values like 0,1, 1.5, 0.5 etc... 
> I know it's bad to do that but I'm
not the boss in this project ;-)
> 
> Just in case other users are
reading my email, I can only advise to read that again and again:
>
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html [1]
> 
>
Best,
> David 
> 
> On Tue, Apr 9, 2013 at 11:39 AM, Matthew Dowle
<mdowle at mdowle.plus.com [2]> wrote:
> 
>> That's odd. Please provide
result of sessionInfo() and str(DT). 
>> 
>> Matthew 
>> 
>> On
09.04.2013 11:32, David Bellot wrote: 
>> 
>>> Hi,
>>> 
>>> I have a
data.table DT with one of the column named x and I other names, let's
say, a1, a2, ... aN. The key of this data.table is made of a1...aN.
>>>

>>> Later on, I aggregate my DT with x like this:
>>> agg = DT[ ,
list(m=mean(y), c=length(y)), by = c("x") ]
>>> 
>>> The problem is that
"x" has 331 unique values as found by length(unique(DT$x)) but my result
"agg" only has 119 rows. I tried by changing the key to "x" alone but
the problem persists. My DT table has a few millions rows by the way.

>>> 
>>> I'm sure I'm missing something totally obvious :-( !!!!
>>>

>>> Any idea ? 
>>> Best,
>>> David

 
Links:
------
[1]
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
[2]
mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130410/f3675b17/attachment.html>

From levkowitz at dc-energy.com  Wed Apr 10 16:46:24 2013
From: levkowitz at dc-energy.com (Shir Levkowitz)
Date: Wed, 10 Apr 2013 10:46:24 -0400
Subject: [datatable-help] Cartesian join invalid key order - bug report
Message-ID: <F616A05D-1619-423E-A1F2-B61F3A6D1D38@dc-energy.com>

I have encountered a bug in the Cartesian join of two data.tables, where the resulting data.table is not sorted by its full key. This is in data.table v1.8.8. Please let me know if this issue has been brought up or if there is any insight regarding it.

Thank you,
Shir Levkowitz


-------------------------------------------------

library(data.table)

###### set up our example data tables
test1 <- data.table(a=sample(1:3, 100, replace=TRUE),
                    b=sample(1:3, 100, replace=TRUE),
                    c=sample(1:10, 100,replace=TRUE))
setkey(test1, a,b,c)

test2 <- data.table(p=sample(1:3, 100, replace=TRUE),
                    q=sample(1:3, 100, replace=TRUE),
                    r=sample(1:100),
                    w=sample(1:100))
setkey(test2, p,q)


###### a cartesian join - this is where the issue arises
test.join <- test1[test2,nomatch=0, allow.cartesian=TRUE]

### have a look at the key
k <- key(test.join)
k

### if we do a group by, we don't get the right aggregation
test.gb <- test.join[,.N,by='a,b,c']
test.gb[a == 1 & b == 1 & c == 1,]
### when really what we want is:
test.agg <- aggregate(r ~a+b+c, test.join, length)
subset(test.agg, a == 1 & b == 1 & c == 1)

### if we set the same key, we get a warning
setkeyv(test.join, k)

>> Warning message: 
In setkeyv(test.join, k) : Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130410/de22c210/attachment.html>

From mdowle at mdowle.plus.com  Wed Apr 10 17:06:55 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Wed, 10 Apr 2013 16:06:55 +0100
Subject: [datatable-help] Cartesian join invalid key order - bug report
In-Reply-To: <F616A05D-1619-423E-A1F2-B61F3A6D1D38@dc-energy.com>
References: <F616A05D-1619-423E-A1F2-B61F3A6D1D38@dc-energy.com>
Message-ID: <a550646710a48ae8f15c182b36e490cd@imap.plus.net>

 
Agreed, new bug. Thanks for reporting. If you could please file on
the R-Forge tracker (then you'll get auto updates) or I can file it,
don't mind. 

I will get to the bug list eventually! 

Thanks, Matthew


On 10.04.2013 15:46, Shir Levkowitz wrote: 

> I have encountered a
bug in the Cartesian join of two data.tables, where the resulting
data.table is not sorted by its full key. This is in data.table v1.8.8.
Please let me know if this issue has been brought up or if there is any
insight regarding it. 
> Thank you, 
> Shir Levkowitz
> 
>
------------------------------------------------- 
> 
>
library(data.table) 
> 
> ###### set up our example data tables 
> test1

> b=sample(1:3, 100, replace=TRUE), 
> c=sample(1:10,
100,replace=TRUE)) 
> setkey(test1, a,b,c) 
> 
> test2 
> q=sample(1:3,
100, replace=TRUE), 
> r=sample(1:100), 
> w=sample(1:100)) 
>
setkey(test2, p,q) 
> 
> ###### a cartesian join - this is where the
issue arises 
> test.join 
> 
> ### have a look at the key 
> k 
> k 
>

> ### if we do a group by, we don't get the right aggregation 
>
test.gb 
> test.gb[a == 1 & b == 1 & c == 1,] 
> ### when really what we
want is: 
> test.agg 
> subset(test.agg, a == 1 & b == 1 & c == 1) 
> 
>
### if we set the same key, we get a warning 
> setkeyv(test.join, k)

>>> Warning message: 
> In setkeyv(test.join, k) : Already keyed by
this key but had invalid row order, key rebuilt. If you didn't go under
the hood please let datatable-help know so the root cause can be
fixed.

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130410/fa8289b4/attachment-0001.html>

From statquant at outlook.com  Thu Apr 18 11:28:22 2013
From: statquant at outlook.com (statquant3)
Date: Thu, 18 Apr 2013 02:28:22 -0700 (PDT)
Subject: [datatable-help] Use of int64 with fread
Message-ID: <1366277302282-4664582.post@n4.nabble.com>

Hello,
a quick question.
Given that there is no support yet of int64 in data.table operations.
Is it really a good thing to cast automatically int64. I find myself
casting back int64 all the time...

Cheers


--
View this message in context: http://r.789695.n4.nabble.com/Use-of-int64-with-fread-tp4664582.html
Sent from the datatable-help mailing list archive at Nabble.com.

From mdowle at mdowle.plus.com  Thu Apr 18 11:48:09 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Thu, 18 Apr 2013 10:48:09 +0100
Subject: [datatable-help] Use of int64 with fread
In-Reply-To: <1366277302282-4664582.post@n4.nabble.com>
References: <1366277302282-4664582.post@n4.nabble.com>
Message-ID: <ab2f98403c65508623b54c05359051fe@imap.plus.net>


What do you cast integer64 back to?  There is some support, you just 
can't have integer64 in keys for example yet. They can be useful and 
work as a value column don't they?  (I don't have much need for 
integer64 myself, so I don't necessarily know.)

I've added use.integer64 = TRUE as a global option and argument to 
fread (not yet committed). That's just a way to turn off the integer64 
feature basically, so they'll be read as numeric as read.csv does.

Btw, after some to and fro I'm thinking colClasses (when type character 
vector) would work the same as read.csv,  but if type list, then you 
could pass sets of columns by number or name; i.e., two valid ways to 
use colClasses in fread would be :

    colClasses = 
c(colC="character",colD="character",colE="character",colQ="numeric")   # 
as read.csv
or
    colClasses = list(character=3:6, numeric="colQ")

To drop columns use "NULL" in colClasses just as read.csv. To select 
columns, 'select' may be either character or numeric vector.

Sound ok?

Matthew


On 18.04.2013 10:28, statquant3 wrote:
> Hello,
> a quick question.
> Given that there is no support yet of int64 in data.table operations.
> Is it really a good thing to cast automatically int64. I find myself
> casting back int64 all the time...
>
> Cheers
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Use-of-int64-with-fread-tp4664582.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From statquant at outlook.com  Thu Apr 18 16:03:18 2013
From: statquant at outlook.com (stat quant)
Date: Thu, 18 Apr 2013 16:03:18 +0200
Subject: [datatable-help] Use of int64 with fread
In-Reply-To: <ab2f98403c65508623b54c05359051fe@imap.plus.net>
References: <1366277302282-4664582.post@n4.nabble.com>
 <ab2f98403c65508623b54c05359051fe@imap.plus.net>
Message-ID: <CAJJHHA_CoStf3XexdmoaPcxDwkRfyAwRKtskOhdxCNAjgnD5eg@mail.gmail.com>

Hi Matthew, I cast int64 to character, I need to use those as keys but
as you pinpoints it is not yet supported, because of it I use strings.
I guess numeric (aka double) could be used but comparing numerics can
be problematic too.
The other usecase is to select stuff, for this case either it is not
well suited as doing DT[myInt64 == 144454938488621444] gives me
myInt64 equals to 144454938488621440, not sure if the problem is how
int64 are displayed or if operator == is different for int64 but
that's too much of a problem to me...

For the fread that's AWESOME news, It will be usefull for datetime
columns, we'll be able to use fasttime on datetime columns! Your way
(the OR case) looks better as if you know the first way you can go to
the second way in a few lines of code. (May be even internally [hint
hint] so we loosers don't have to do anything !!!)

2013/4/18, Matthew Dowle <mdowle at mdowle.plus.com>:
>
> What do you cast integer64 back to?  There is some support, you just
> can't have integer64 in keys for example yet. They can be useful and
> work as a value column don't they?  (I don't have much need for
> integer64 myself, so I don't necessarily know.)
>
> I've added use.integer64 = TRUE as a global option and argument to
> fread (not yet committed). That's just a way to turn off the integer64
> feature basically, so they'll be read as numeric as read.csv does.
>
> Btw, after some to and fro I'm thinking colClasses (when type character
> vector) would work the same as read.csv,  but if type list, then you
> could pass sets of columns by number or name; i.e., two valid ways to
> use colClasses in fread would be :
>
>     colClasses =
> c(colC="character",colD="character",colE="character",colQ="numeric")   #
> as read.csv
> or
>     colClasses = list(character=3:6, numeric="colQ")
>
> To drop columns use "NULL" in colClasses just as read.csv. To select
> columns, 'select' may be either character or numeric vector.
>
> Sound ok?
>
> Matthew
>
>
> On 18.04.2013 10:28, statquant3 wrote:
>> Hello,
>> a quick question.
>> Given that there is no support yet of int64 in data.table operations.
>> Is it really a good thing to cast automatically int64. I find myself
>> casting back int64 all the time...
>>
>> Cheers
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/Use-of-int64-with-fread-tp4664582.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>

From eduard.antonyan at gmail.com  Fri Apr 19 21:54:38 2013
From: eduard.antonyan at gmail.com (eddi)
Date: Fri, 19 Apr 2013 12:54:38 -0700 (PDT)
Subject: [datatable-help] changing data.table by-without-by syntax to
	require a "by"
Message-ID: <1366401278742-4664770.post@n4.nabble.com>

Matthew Dowle suggested I put this up for a discussion here.
This is continuation of the discussion that started on  SO
<http://stackoverflow.com/questions/16093289/data-table-join-and-j-expression-unexpected-behavior/>  
and resulted in  FR2696
<https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2696&group_id=240&atid=978>  
(I recommend reading the latter first, as it's much more clear).
My case for the change boils down to the following: I believe *d[i, j, by =
b]* should be always understood to mean
*"take d, apply i, return j by b"*
instead of the much more complicated current behavior, which is:
*"take d, apply i, if i was not a merge, return j by b, if i was a merge, if
no by, then return j by key, else if b and b == key, complain and return j
by b, else return j by b"*
I believe, while disruptive to some current users, this will make data.table
much more user-friendly for any future users (one piece of evidence I would
suggest for this, besides my plea, is that FAQs 1.13-1.14 (and part of 1.12)
would become completely unnecessary).
This is regarding syntax only, and I do NOT propose any changes to
underlying behavior, in particular the speed-up when you do a "by" by the
key of the join should stay (and should be done iff by=key is present).


--
View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770.html
Sent from the datatable-help mailing list archive at Nabble.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130419/636a459e/attachment.html>

From michael.nelson at sydney.edu.au  Sat Apr 20 01:07:10 2013
From: michael.nelson at sydney.edu.au (Michael Nelson)
Date: Fri, 19 Apr 2013 23:07:10 +0000
Subject: [datatable-help] changing data.table by-without-by syntax
	to	require a "by"
In-Reply-To: <1366401278742-4664770.post@n4.nabble.com>
References: <1366401278742-4664770.post@n4.nabble.com>
Message-ID: <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>

I think this proposed  change is completely unnecessary.

That a function may behave differently is entirely consistent with the various s3 /s4 methods system (although neither are used here).

I think that drop = TRUE when implemented will take care of dropping join columns.


On 20/04/2013, at 5:54 AM, "eddi" <eduard.antonyan at gmail.com<mailto:eduard.antonyan at gmail.com>> wrote:

Matthew Dowle suggested I put this up for a discussion here.

This is continuation of the discussion that started on SO<http://stackoverflow.com/questions/16093289/data-table-join-and-j-expression-unexpected-behavior/> and resulted in FR2696<https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2696&group_id=240&atid=978> (I recommend reading the latter first, as it's much more clear).

My case for the change boils down to the following: I believe d[i, j, by = b] should be always understood to mean

"take d, apply i, return j by b"

instead of the much more complicated current behavior, which is:

"take d, apply i, if i was not a merge, return j by b, if i was a merge, if no by, then return j by key, else if b and b == key, complain and return j by b, else return j by b"

I believe, while disruptive to some current users, this will make data.table much more user-friendly for any future users (one piece of evidence I would suggest for this, besides my plea, is that FAQs 1.13-1.14 (and part of 1.12) would become completely unnecessary).

This is regarding syntax only, and I do NOT propose any changes to underlying behavior, in particular the speed-up when you do a "by" by the key of the join should stay (and should be done iff by=key is present).

________________________________
View this message in context: changing data.table by-without-by syntax to require a "by"<http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770.html>
Sent from the datatable-help mailing list archive<http://r.789695.n4.nabble.com/datatable-help-f2315188.html> at Nabble.com<http://Nabble.com>.
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org<mailto:datatable-help at lists.r-forge.r-project.org>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130419/7dbaa46e/attachment.html>

From erikriverson at gmail.com  Mon Apr 22 14:27:52 2013
From: erikriverson at gmail.com (Erik Iverson)
Date: Mon, 22 Apr 2013 07:27:52 -0500
Subject: [datatable-help] assigning POSIXlt object to a data.table column
Message-ID: <CAKzGw11KYta=ivxBOWU8t4LYB=c-3f+9r8aEbntONNkAOKQeWA@mail.gmail.com>

Hello,

Hope all is well with everyone, just wondering if this is a data.table
bug or a bug in my understanding:

> DT <- data.table(x = 1)
> DT$test <- as.POSIXlt(Sys.Date())
Warning message:
In `[<-.data.table`(x, j = name, value = value) :
  Supplied 9 items to be assigned to 1 items of column 'test' (8 unused)

Thanks!
--Erik

sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] wordcloud_2.4      RColorBrewer_1.0-5 Rcpp_0.10.3        tm_0.5-8.3
[5] ggplot2_0.9.3.1    zoo_1.7-9          data.table_1.8.8   XML_3.96-1.1

loaded via a namespace (and not attached):
 [1] colorspace_1.2-2 compiler_3.0.0   dichromat_2.0-0  digest_0.6.3
 [5] grid_3.0.0       gtable_0.1.2     labeling_0.1     lattice_0.20-15
 [9] MASS_7.3-26      munsell_0.4      plyr_1.8         proto_0.3-10
[13] reshape2_1.2.2   scales_0.2.3     slam_0.1-28      stringr_0.6.2
[17] tools_3.0.0

From mdowle at mdowle.plus.com  Mon Apr 22 14:40:25 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Mon, 22 Apr 2013 13:40:25 +0100
Subject: [datatable-help] assigning POSIXlt object to a data.table column
In-Reply-To: <CAKzGw11KYta=ivxBOWU8t4LYB=c-3f+9r8aEbntONNkAOKQeWA@mail.gmail.com>
References: <CAKzGw11KYta=ivxBOWU8t4LYB=c-3f+9r8aEbntONNkAOKQeWA@mail.gmail.com>
Message-ID: <91e23509e5f826c38e9519d043c7d616@imap.plus.net>


Hi,

It's burried in the Notes section of ?data.table :

  " POSIXlt is not supported as a column type because it uses 40 bytes 
to store a single datetime. Unexpected errors may occur if you manage to 
create a column of type POSIXlt. Please see NEWS for 1.6.3, and 
IDateTime instead. IDateTime has methods to convert to and from POSIXlt. 
"

The no-support for POSIXlt is set in stone,  but the advice there to 
use IDateTime may not be the best. Bascially - anything but POSIXlt!

Btw, please don't assign to DT columns using DT$test<-.  See ?":=".

Matthew

On 22.04.2013 13:27, Erik Iverson wrote:
> Hello,
>
> Hope all is well with everyone, just wondering if this is a 
> data.table
> bug or a bug in my understanding:
>
>> DT <- data.table(x = 1)
>> DT$test <- as.POSIXlt(Sys.Date())
> Warning message:
> In `[<-.data.table`(x, j = name, value = value) :
>   Supplied 9 items to be assigned to 1 items of column 'test' (8 
> unused)
>
> Thanks!
> --Erik
>
> sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=C                 LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] wordcloud_2.4      RColorBrewer_1.0-5 Rcpp_0.10.3        
> tm_0.5-8.3
> [5] ggplot2_0.9.3.1    zoo_1.7-9          data.table_1.8.8   
> XML_3.96-1.1
>
> loaded via a namespace (and not attached):
>  [1] colorspace_1.2-2 compiler_3.0.0   dichromat_2.0-0  digest_0.6.3
>  [5] grid_3.0.0       gtable_0.1.2     labeling_0.1     
> lattice_0.20-15
>  [9] MASS_7.3-26      munsell_0.4      plyr_1.8         proto_0.3-10
> [13] reshape2_1.2.2   scales_0.2.3     slam_0.1-28      stringr_0.6.2
> [17] tools_3.0.0
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From statquant at outlook.com  Mon Apr 22 15:04:04 2013
From: statquant at outlook.com (statquant3)
Date: Mon, 22 Apr 2013 06:04:04 -0700 (PDT)
Subject: [datatable-help] Millis in IDateTime
Message-ID: <1366635844953-4664970.post@n4.nabble.com>

Hello, sorry to come back again on this, but I realized that int is good
enough for millisecond resolution. As ITime handles times < 24h = 86400000L.
Would that be usefull/easy to modify ?
Usually even in finance millis are enough.

Cheers


--
View this message in context: http://r.789695.n4.nabble.com/Millis-in-IDateTime-tp4664970.html
Sent from the datatable-help mailing list archive at Nabble.com.

From eduard.antonyan at gmail.com  Mon Apr 22 17:17:59 2013
From: eduard.antonyan at gmail.com (eddi)
Date: Mon, 22 Apr 2013 08:17:59 -0700 (PDT)
Subject: [datatable-help] changing data.table by-without-by syntax
	to	require a "by"
In-Reply-To: <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
Message-ID: <1366643879137-4664990.post@n4.nabble.com>

I think you're missing the point Michael. Just because it's possible to do it
the way it's done now, doesn't mean that's the best way, as I've tried to
argue in the OP. I don't think you've addressed the issue of unnecessary
complexity pointed out in OP.


--
View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
Sent from the datatable-help mailing list archive at Nabble.com.

From gsee000 at gmail.com  Tue Apr 23 20:46:12 2013
From: gsee000 at gmail.com (G See)
Date: Tue, 23 Apr 2013 13:46:12 -0500
Subject: [datatable-help] Indexing by a logical column
Message-ID: <CA+xi=qZNbhbTYij5AytM+UNEW6AZQ7_EDa0GjjXC6wgwyK5iEA@mail.gmail.com>

Hi,

Is the following expected behavior?

DT = data.table(x=rep(c("a","b","c"),each=3), TF=c(TRUE,FALSE,TRUE))

#All of these return what I expect:

DT[c(TRUE, FALSE, TRUE)]
DT[TF==TRUE]
DT[DT$TF]

#Why doesn't this?
DT[TF]
#Error in eval(expr, envir, enclos) : object 'TF' not found

Thanks,
Garrett

From mdowle at mdowle.plus.com  Tue Apr 23 21:12:23 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 23 Apr 2013 20:12:23 +0100
Subject: [datatable-help] Indexing by a logical column
In-Reply-To: <CA+xi=qZNbhbTYij5AytM+UNEW6AZQ7_EDa0GjjXC6wgwyK5iEA@mail.gmail.com>
References: <CA+xi=qZNbhbTYij5AytM+UNEW6AZQ7_EDa0GjjXC6wgwyK5iEA@mail.gmail.com>
Message-ID: <99876779ed2f08b323f554266d151a8b@imap.plus.net>


Hi,
Yes expected. From ?data.table:
"Advanced: When i is a single variable name, it is not considered an 
expression of column names and is instead evaluated in calling scope."
Subsetting by a logical column is the only example I can think of where 
this is confusing. But we make use of this feature quite a lot e.g.
     TMP=list(...);DT[TMP]
safe in the knowledge that DT[TMP] won't start to fail if DT in future 
has a column called TMP.
When I have a logical column boolCol I wrap with ():   DT[(boolCol)].   
This avoids the memory allocation and scan of ==TRUE,  and avoids the 
variable name repetition of DT[DT$boolCol]
Matthew


On 23.04.2013 19:46, G See wrote:
> Hi,
>
> Is the following expected behavior?
>
> DT = data.table(x=rep(c("a","b","c"),each=3), TF=c(TRUE,FALSE,TRUE))
>
> #All of these return what I expect:
>
> DT[c(TRUE, FALSE, TRUE)]
> DT[TF==TRUE]
> DT[DT$TF]
>
> #Why doesn't this?
> DT[TF]
> #Error in eval(expr, envir, enclos) : object 'TF' not found
>
> Thanks,
> Garrett
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From gsee000 at gmail.com  Tue Apr 23 21:16:38 2013
From: gsee000 at gmail.com (G See)
Date: Tue, 23 Apr 2013 14:16:38 -0500
Subject: [datatable-help] Indexing by a logical column
In-Reply-To: <99876779ed2f08b323f554266d151a8b@imap.plus.net>
References: <CA+xi=qZNbhbTYij5AytM+UNEW6AZQ7_EDa0GjjXC6wgwyK5iEA@mail.gmail.com>
 <99876779ed2f08b323f554266d151a8b@imap.plus.net>
Message-ID: <CA+xi=qYjR2YrnO6tO2AAgC4qd75Z_tOPVWhfRPnedp0HQ5QaRQ@mail.gmail.com>

Thank you.  Very helpful, as always.

Garrett

On Tue, Apr 23, 2013 at 2:12 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
> Hi,
> Yes expected. From ?data.table:
> "Advanced: When i is a single variable name, it is not considered an
> expression of column names and is instead evaluated in calling scope."
> Subsetting by a logical column is the only example I can think of where this
> is confusing. But we make use of this feature quite a lot e.g.
>     TMP=list(...);DT[TMP]
> safe in the knowledge that DT[TMP] won't start to fail if DT in future has a
> column called TMP.
> When I have a logical column boolCol I wrap with ():   DT[(boolCol)].   This
> avoids the memory allocation and scan of ==TRUE,  and avoids the variable
> name repetition of DT[DT$boolCol]
> Matthew
>
>
>
> On 23.04.2013 19:46, G See wrote:
>>
>> Hi,
>>
>> Is the following expected behavior?
>>
>> DT = data.table(x=rep(c("a","b","c"),each=3), TF=c(TRUE,FALSE,TRUE))
>>
>> #All of these return what I expect:
>>
>> DT[c(TRUE, FALSE, TRUE)]
>> DT[TF==TRUE]
>> DT[DT$TF]
>>
>> #Why doesn't this?
>> DT[TF]
>> #Error in eval(expr, envir, enclos) : object 'TF' not found
>>
>> Thanks,
>> Garrett
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From sds at gnu.org  Tue Apr 23 23:41:59 2013
From: sds at gnu.org (Sam Steingold)
Date: Tue, 23 Apr 2013 17:41:59 -0400
Subject: [datatable-help] =?utf-8?q?there_is_no_package_called_=E2=80=98xt?=
	=?utf-8?b?c+KAmQ==?=
Message-ID: <87bo94hql4.fsf@gnu.org>

Hi,
I got this:

> dt <- frame[, lapply(.SD, last) ,by=id]
Finding groups (bysameorder=TRUE) ... done in 0.126secs. bysameorder=TRUE and o__ is length 0
Optimized j from 'lapply(.SD, last)' to 'list(last(country), last(language), last(browser), last(platform), last(uatype), last(behavior))'
Starting dogroups ... Error in loadNamespace(name) : there is no package called ?xts?
Calls: [ -> [.data.table -> last -> :: -> getExportedValue -> asNamespace -> getNamespace -> tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
>

the help for last does mention xts, but I don't have it installed.
do I need to?

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
http://www.childpsy.net/ http://ffii.org http://think-israel.org
http://mideasttruth.com http://memri.org http://camera.org
Ernqvat guvf ivbyngrf QZPN.


From sds at gnu.org  Tue Apr 23 23:55:58 2013
From: sds at gnu.org (Sam Steingold)
Date: Tue, 23 Apr 2013 17:55:58 -0400
Subject: [datatable-help] cedta decided 'igraph' wasn't data.table aware
Message-ID: <8761zchpxt.fsf@gnu.org>

Hi, what does this mean?

--8<---------------cut here---------------start------------->8---
> graph <- graph.data.frame(merged[!v,], vertices=ve, directed=FALSE)
cedta decided 'igraph' wasn't data.table aware
cedta decided 'igraph' wasn't data.table aware
cedta decided 'igraph' wasn't data.table aware
cedta decided 'igraph' wasn't data.table aware
cedta decided 'igraph' wasn't data.table aware
--8<---------------cut here---------------end--------------->8---

`merged' and `ve' are `data.table' objects, and thus `data.frame' objects too.
the igraph function graph.data.frame accepts data.frame.
other than the messages (controlled by datatable.verbose), the code
appears to work.

Bill Dunlap kindly explained that
>> cedta ("Calling Environment is Data.Table Aware")
>> is a private function in package:data.table

could you please offer more detail?
what doe the message mean?

Thanks.

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
http://www.childpsy.net/ http://www.memritv.org http://memri.org http://ffii.org
http://think-israel.org http://palestinefacts.org http://truepeace.org
Perl: all stupidities of UNIX in one.

From sds at gnu.org  Tue Apr 23 23:57:36 2013
From: sds at gnu.org (Sam Steingold)
Date: Tue, 23 Apr 2013 17:57:36 -0400
Subject: [datatable-help] =?utf-8?q?there_is_no_package_called_=E2=80=98xt?=
	=?utf-8?b?c+KAmQ==?=
Message-ID: <874newhpv3.fsf@gnu.org>

Hi,
I got this:

> dt <- frame[, lapply(.SD, last) ,by=id]
Finding groups (bysameorder=TRUE) ... done in 0.126secs. bysameorder=TRUE and o__ is length 0
Optimized j from 'lapply(.SD, last)' to 'list(last(country), last(language), last(browser), last(platform), last(uatype), last(behavior))'
Starting dogroups ... Error in loadNamespace(name) : there is no package called ?xts?
Calls: [ -> [.data.table -> last -> :: -> getExportedValue -> asNamespace -> getNamespace -> tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
>

the help for last does mention xts, but I don't have it installed.
do I need to?

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
http://www.childpsy.net/ http://ffii.org http://think-israel.org
http://mideasttruth.com http://memri.org http://camera.org
Ernqvat guvf ivbyngrf QZPN.

From sds at gnu.org  Wed Apr 24 00:11:01 2013
From: sds at gnu.org (Sam Steingold)
Date: Tue, 23 Apr 2013 18:11:01 -0400
Subject: [datatable-help]
	=?utf-8?q?there_is_no_package_called_=E2=80=98xt?=
	=?utf-8?b?c+KAmQ==?=
In-Reply-To: <874newhpv3.fsf@gnu.org> (Sam Steingold's message of "Tue, 23 Apr
 2013 17:57:36 -0400")
References: <874newhpv3.fsf@gnu.org>
Message-ID: <87zjwogaoa.fsf@gnu.org>

I apologize for double posting - my first message appeared to have been rejected.

> * Sam Steingold <fqf at tah.bet> [2013-04-23 17:57:36 -0400]:
>
> Hi,
> I got this:
>
>> dt <- frame[, lapply(.SD, last) ,by=id]
> Finding groups (bysameorder=TRUE) ... done in 0.126secs. bysameorder=TRUE and o__ is length 0
> Optimized j from 'lapply(.SD, last)' to 'list(last(country), last(language), last(browser), last(platform), last(uatype), last(behavior))'
> Starting dogroups ... Error in loadNamespace(name) : there is no package called ?xts?
> Calls: [ -> [.data.table -> last -> :: -> getExportedValue -> asNamespace -> getNamespace -> tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
>>
>
> the help for last does mention xts, but I don't have it installed.
> do I need to?

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
http://www.childpsy.net/ http://openvotingconsortium.org http://pmw.org.il
http://camera.org http://dhimmi.com http://think-israel.org
Don't use force -- get a bigger hammer.

From michael.nelson at sydney.edu.au  Wed Apr 24 02:41:33 2013
From: michael.nelson at sydney.edu.au (Michael Nelson)
Date: Wed, 24 Apr 2013 00:41:33 +0000
Subject: [datatable-help]
 =?windows-1252?q?there_is_no_package_called_=91x?=
 =?windows-1252?q?ts=92?=
In-Reply-To: <874newhpv3.fsf@gnu.org>
References: <874newhpv3.fsf@gnu.org>
Message-ID: <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au>

>From the help for data.table::last

?If x is a data.table, the last row as a one row data.table. Otherwise, whatever xts::last returns.


calling lapply(.SD, last) will call last on each column in .SD. Columns within a data.table aren't  data.tables thus `xts::last` is called.  xts is on the suggests list for data.table, 

you could use

install.packages('data.table, dependencies = 'Suggests')

or manually installed xts.

OR

frame[, last(.SD), by = id]

would work without needing xts

as would 

frame[, .SD[.N], by = id]

or without having to construct .SD (which is time consuming)

frame[frame[, .I[.N],by = id]$V1]

or 

setkey(frame, id)

frame[unique(id), mult = 'last']

________________________________________
From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Sam Steingold [sds at gnu.org]
Sent: Wednesday, 24 April 2013 7:57 AM
To: datatable-help at lists.r-forge.r-project.org
Subject: [datatable-help] there is no package called ?xts?

Hi,
I got this:

> dt <- frame[, lapply(.SD, last) ,by=id]
Finding groups (bysameorder=TRUE) ... done in 0.126secs. bysameorder=TRUE and o__ is length 0
Optimized j from 'lapply(.SD, last)' to 'list(last(country), last(language), last(browser), last(platform), last(uatype), last(behavior))'
Starting dogroups ... Error in loadNamespace(name) : there is no package called ?xts?
Calls: [ -> [.data.table -> last -> :: -> getExportedValue -> asNamespace -> getNamespace -> tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
>

the help for last does mention xts, but I don't have it installed.
do I need to?

--
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
http://www.childpsy.net/ http://ffii.org http://think-israel.org
http://mideasttruth.com http://memri.org http://camera.org
Ernqvat guvf ivbyngrf QZPN.
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From eduard.antonyan at gmail.com  Wed Apr 24 03:16:42 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Tue, 23 Apr 2013 20:16:42 -0500
Subject: [datatable-help]
	=?windows-1252?q?there_is_no_package_called_=91x?=
	=?windows-1252?q?ts=92?=
In-Reply-To: <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au>
References: <874newhpv3.fsf@gnu.org>
 <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au>
Message-ID: <-7825510419294116677@unknownmsgid>

This is great, a lot of cool stuff in one post!

On Apr 23, 2013, at 7:42 PM, Michael Nelson
<michael.nelson at sydney.edu.au> wrote:

> From the help for data.table::last
>
>  If x is a data.table, the last row as a one row data.table. Otherwise, whatever xts::last returns.
>
>
> calling lapply(.SD, last) will call last on each column in .SD. Columns within a data.table aren't  data.tables thus `xts::last` is called.  xts is on the suggests list for data.table,
>
> you could use
>
> install.packages('data.table, dependencies = 'Suggests')
>
> or manually installed xts.
>
> OR
>
> frame[, last(.SD), by = id]
>
> would work without needing xts
>
> as would
>
> frame[, .SD[.N], by = id]
>
> or without having to construct .SD (which is time consuming)
>
> frame[frame[, .I[.N],by = id]$V1]
>
> or
>
> setkey(frame, id)
>
> frame[unique(id), mult = 'last']
>
> ________________________________________
> From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Sam Steingold [sds at gnu.org]
> Sent: Wednesday, 24 April 2013 7:57 AM
> To: datatable-help at lists.r-forge.r-project.org
> Subject: [datatable-help] there is no package called ?xts?
>
> Hi,
> I got this:
>
>> dt <- frame[, lapply(.SD, last) ,by=id]
> Finding groups (bysameorder=TRUE) ... done in 0.126secs. bysameorder=TRUE and o__ is length 0
> Optimized j from 'lapply(.SD, last)' to 'list(last(country), last(language), last(browser), last(platform), last(uatype), last(behavior))'
> Starting dogroups ... Error in loadNamespace(name) : there is no package called ?xts?
> Calls: [ -> [.data.table -> last -> :: -> getExportedValue -> asNamespace -> getNamespace -> tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
>
> the help for last does mention xts, but I don't have it installed.
> do I need to?
>
> --
> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
> http://www.childpsy.net/ http://ffii.org http://think-israel.org
> http://mideasttruth.com http://memri.org http://camera.org
> Ernqvat guvf ivbyngrf QZPN.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From mdowle at mdowle.plus.com  Wed Apr 24 10:54:21 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Wed, 24 Apr 2013 09:54:21 +0100
Subject: [datatable-help]
 =?utf-8?q?there_is_no_package_called_=E2=80=98xt?=
 =?utf-8?b?c+KAmQ==?=
In-Reply-To: <-7825510419294116677@unknownmsgid>
References: <874newhpv3.fsf@gnu.org>
 <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au>
 <-7825510419294116677@unknownmsgid>
Message-ID: <78c42b8c3e05690f5e48e9e35b54888b@imap.plus.net>


Indeed! Great stuff Michael.

I suppose that error ("Error in loadNamespace(name) : there is no 
package called ?xts?") is yet another valid bug (sigh), if you wouldn't 
mind filing please Sam.

Thanks,
Matthew


On 24.04.2013 02:16, Eduard Antonyan wrote:
> This is great, a lot of cool stuff in one post!
>
> On Apr 23, 2013, at 7:42 PM, Michael Nelson
> <michael.nelson at sydney.edu.au> wrote:
>
>> From the help for data.table::last
>>
>>  If x is a data.table, the last row as a one row data.table. 
>> Otherwise, whatever xts::last returns.
>>
>>
>> calling lapply(.SD, last) will call last on each column in .SD. 
>> Columns within a data.table aren't  data.tables thus `xts::last` is 
>> called.  xts is on the suggests list for data.table,
>>
>> you could use
>>
>> install.packages('data.table, dependencies = 'Suggests')
>>
>> or manually installed xts.
>>
>> OR
>>
>> frame[, last(.SD), by = id]
>>
>> would work without needing xts
>>
>> as would
>>
>> frame[, .SD[.N], by = id]
>>
>> or without having to construct .SD (which is time consuming)
>>
>> frame[frame[, .I[.N],by = id]$V1]
>>
>> or
>>
>> setkey(frame, id)
>>
>> frame[unique(id), mult = 'last']
>>
>> ________________________________________
>> From: datatable-help-bounces at lists.r-forge.r-project.org 
>> [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Sam 
>> Steingold [sds at gnu.org]
>> Sent: Wednesday, 24 April 2013 7:57 AM
>> To: datatable-help at lists.r-forge.r-project.org
>> Subject: [datatable-help] there is no package called ?xts?
>>
>> Hi,
>> I got this:
>>
>>> dt <- frame[, lapply(.SD, last) ,by=id]
>> Finding groups (bysameorder=TRUE) ... done in 0.126secs. 
>> bysameorder=TRUE and o__ is length 0
>> Optimized j from 'lapply(.SD, last)' to 'list(last(country), 
>> last(language), last(browser), last(platform), last(uatype), 
>> last(behavior))'
>> Starting dogroups ... Error in loadNamespace(name) : there is no 
>> package called ?xts?
>> Calls: [ -> [.data.table -> last -> :: -> getExportedValue -> 
>> asNamespace -> getNamespace -> tryCatch -> tryCatchList -> tryCatchOne 
>> -> <Anonymous>
>>
>> the help for last does mention xts, but I don't have it installed.
>> do I need to?
>>
>> --
>> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 
>> 11.0.11300000
>> http://www.childpsy.net/ http://ffii.org http://think-israel.org
>> http://mideasttruth.com http://memri.org http://camera.org
>> Ernqvat guvf ivbyngrf QZPN.
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> 
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> 
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From mdowle at mdowle.plus.com  Wed Apr 24 11:07:12 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Wed, 24 Apr 2013 10:07:12 +0100
Subject: [datatable-help] cedta decided 'igraph' wasn't data.table aware
In-Reply-To: <8761zchpxt.fsf@gnu.org>
References: <8761zchpxt.fsf@gnu.org>
Message-ID: <714d75225a91130b39a1af84f2250232@imap.plus.net>


Hi,

Oh dear. On first glance, you are having a painful time with 
data.table!   But if you have verbose=TRUE then this seems ok.  I think 
of 'verbose' more like 'trace'.  There aren't different levels of 
verbosity, yet, although that has been suggested before and is on the 
list to do.  So internal tracing messages, notes, progress etc is all 
mixed in to one 'verbose=TRUE' level at the moment.    I very rarely use 
verbose=TRUE.  I just switch it on when debugging.    You and others are 
right to switch it on and use it for tuning,  that is the idea,  but 
it's too verbose at the moment as you've found.

Anyway, cedta did the right thing here, since 'igraph' indeed is not 
data.table aware.  Setting verbose=FALSE should make the trace message 
go away.

More info about cedta on the single result returned by "[data.table] 
cedta" :

http://stackoverflow.com/questions/10527072/using-data-table-package-inside-my-own-package/10529888#10529888

(have just reread that and it's still correct).

Matthew


On 23.04.2013 22:55, Sam Steingold wrote:
> Hi, what does this mean?
>
> --8<---------------cut here---------------start------------->8---
>> graph <- graph.data.frame(merged[!v,], vertices=ve, directed=FALSE)
> cedta decided 'igraph' wasn't data.table aware
> cedta decided 'igraph' wasn't data.table aware
> cedta decided 'igraph' wasn't data.table aware
> cedta decided 'igraph' wasn't data.table aware
> cedta decided 'igraph' wasn't data.table aware
> --8<---------------cut here---------------end--------------->8---
>
> `merged' and `ve' are `data.table' objects, and thus `data.frame'
> objects too.
> the igraph function graph.data.frame accepts data.frame.
> other than the messages (controlled by datatable.verbose), the code
> appears to work.
>
> Bill Dunlap kindly explained that
>>> cedta ("Calling Environment is Data.Table Aware")
>>> is a private function in package:data.table
>
> could you please offer more detail?
> what doe the message mean?
>
> Thanks.

From eduard.antonyan at gmail.com  Wed Apr 24 16:47:12 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Wed, 24 Apr 2013 09:47:12 -0500
Subject: [datatable-help]
	=?windows-1252?q?there_is_no_package_called_=91x?=
	=?windows-1252?q?ts=92?=
In-Reply-To: <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au>
References: <874newhpv3.fsf@gnu.org>
 <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au>
Message-ID: <CAHZcBOqn2ZCp27g0ytwW4VpLgzhHKH1qB=jALXy2WZgwqNj6Cw@mail.gmail.com>

@Michael, in the last expression, you probably forgot a J:

frame[J(unique(id)), mult = "last"]


On Tue, Apr 23, 2013 at 7:41 PM, Michael Nelson <
michael.nelson at sydney.edu.au> wrote:

> From the help for data.table::last
>
>  If x is a data.table, the last row as a one row data.table. Otherwise,
> whatever xts::last returns.
>
>
> calling lapply(.SD, last) will call last on each column in .SD. Columns
> within a data.table aren't  data.tables thus `xts::last` is called.  xts is
> on the suggests list for data.table,
>
> you could use
>
> install.packages('data.table, dependencies = 'Suggests')
>
> or manually installed xts.
>
> OR
>
> frame[, last(.SD), by = id]
>
> would work without needing xts
>
> as would
>
> frame[, .SD[.N], by = id]
>
> or without having to construct .SD (which is time consuming)
>
> frame[frame[, .I[.N],by = id]$V1]
>
> or
>
> setkey(frame, id)
>
> frame[unique(id), mult = 'last']
>
> ________________________________________
> From: datatable-help-bounces at lists.r-forge.r-project.org [
> datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Sam
> Steingold [sds at gnu.org]
> Sent: Wednesday, 24 April 2013 7:57 AM
> To: datatable-help at lists.r-forge.r-project.org
> Subject: [datatable-help] there is no package called ?xts?
>
> Hi,
> I got this:
>
> > dt <- frame[, lapply(.SD, last) ,by=id]
> Finding groups (bysameorder=TRUE) ... done in 0.126secs. bysameorder=TRUE
> and o__ is length 0
> Optimized j from 'lapply(.SD, last)' to 'list(last(country),
> last(language), last(browser), last(platform), last(uatype),
> last(behavior))'
> Starting dogroups ... Error in loadNamespace(name) : there is no package
> called ?xts?
> Calls: [ -> [.data.table -> last -> :: -> getExportedValue -> asNamespace
> -> getNamespace -> tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
> >
>
> the help for last does mention xts, but I don't have it installed.
> do I need to?
>
> --
> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X
> 11.0.11300000
> http://www.childpsy.net/ http://ffii.org http://think-israel.org
> http://mideasttruth.com http://memri.org http://camera.org
> Ernqvat guvf ivbyngrf QZPN.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130424/ebbbb33b/attachment.html>

From s_milberg at hotmail.com  Wed Apr 24 20:02:15 2013
From: s_milberg at hotmail.com (Sadao Milberg)
Date: Wed, 24 Apr 2013 14:02:15 -0400
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <1366643879137-4664990.post@n4.nabble.com>
References: <1366401278742-4664770.post@n4.nabble.com>,
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>,
 <1366643879137-4664990.post@n4.nabble.com>
Message-ID: <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>

I'd agree with Eduard, although it's probably too late to change behavior now.  Maybe for data.table.2?  Eduard's proposal seems more closely aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if requested).

S.

> Date: Mon, 22 Apr 2013 08:17:59 -0700
> From: eduard.antonyan at gmail.com
> To: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] changing data.table by-without-by syntax	to	require a "by"
> 
> I think you're missing the point Michael. Just because it's possible to do it
> the way it's done now, doesn't mean that's the best way, as I've tried to
> argue in the OP. I don't think you've addressed the issue of unnecessary
> complexity pointed out in OP.
> 
> 
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130424/08f73a8d/attachment.html>

From sds at gnu.org  Wed Apr 24 21:18:03 2013
From: sds at gnu.org (Sam Steingold)
Date: Wed, 24 Apr 2013 15:18:03 -0400
Subject: [datatable-help]
	=?utf-8?q?there_is_no_package_called_=E2=80=98xt?=
	=?utf-8?b?c+KAmQ==?=
In-Reply-To: <78c42b8c3e05690f5e48e9e35b54888b@imap.plus.net> (Matthew Dowle's
 message of "Wed, 24 Apr 2013 09:54:21 +0100")
References: <874newhpv3.fsf@gnu.org>
 <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au>
 <-7825510419294116677@unknownmsgid>
 <78c42b8c3e05690f5e48e9e35b54888b@imap.plus.net>
Message-ID: <87sj2fg2l0.fsf@gnu.org>

> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-24 09:54:21 +0100]:
>
> Indeed! Great stuff Michael.

yep!

> I suppose that error ("Error in loadNamespace(name) : there is no
> package called ?xts?") is yet another valid bug (sigh), if you wouldn't
> mind filing please Sam.

https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2728&group_id=240&atid=975

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
http://www.childpsy.net/ http://openvotingconsortium.org http://dhimmi.com
http://memri.org http://pmw.org.il http://www.memritv.org http://jihadwatch.org
If abortion is murder, then oral sex is cannibalism.

From sds at gnu.org  Wed Apr 24 21:25:46 2013
From: sds at gnu.org (Sam Steingold)
Date: Wed, 24 Apr 2013 15:25:46 -0400
Subject: [datatable-help] head.data.table does not support negative
	arguments?
Message-ID: <87li87g285.fsf@gnu.org>

is this a bug?
--8<---------------cut here---------------start------------->8---
> head(1:10,-3)
[1] 1 2 3 4 5 6 7
> head(data.frame(a=1:5,b=5:9),-2)
  a b
1 1 5
2 2 6
3 3 7
> head(data.table(a=1:5,b=5:9),-2)
Error in seq_len(min(n, nrow(x))) : 
  argument must be coercible to non-negative integer
Calls: head -> head.data.table
--8<---------------cut here---------------end--------------->8---

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
http://www.childpsy.net/ http://camera.org http://openvotingconsortium.org
http://truepeace.org http://pmw.org.il http://palestinefacts.org
Children fear dentists because of pain, adults - because of bills.

From sds at gnu.org  Wed Apr 24 21:26:57 2013
From: sds at gnu.org (Sam Steingold)
Date: Wed, 24 Apr 2013 15:26:57 -0400
Subject: [datatable-help]
	=?utf-8?q?there_is_no_package_called_=E2=80=98xt?=
	=?utf-8?b?c+KAmQ==?=
In-Reply-To: <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au>
 (Michael Nelson's message of "Wed, 24 Apr 2013 00:41:33 +0000")
References: <874newhpv3.fsf@gnu.org>
 <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au>
Message-ID: <87haivg266.fsf@gnu.org>

> * Michael Nelson <zvpunry.aryfba at flqarl.rqh.nh> [2013-04-24 00:41:33 +0000]:
>
> frame[, .SD[.N], by = id]

I tried
--8<---------------cut here---------------start------------->8---
dt <- frame[, .SD[1] ,by=id]
--8<---------------cut here---------------end--------------->8---
(I don't care whether I take first or last, see another message).
and I got the note
--8<---------------cut here---------------start------------->8---
Finding groups (bysameorder=TRUE) ... done in 0.121secs. bysameorder=TRUE and o__ is length 0
Optimization is on but j left unchanged as '.SD[1]'
Starting dogroups ... The result of j is a named list. It's very
inefficient to create the same names over and over again for each
group. When j=list(...), any names are detected, removed and put back
after grouping has completed, for efficiency. Using j=transform(), for
example, prevents that speedup (consider changing to :=).
--8<---------------cut here---------------end--------------->8---
and indeed it runs unbelievably slow (as if I were using data.table)

thanks a lot for your detailed reply!

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
http://www.childpsy.net/ http://ffii.org http://www.memritv.org
http://palestinefacts.org http://iris.org.il http://think-israel.org
non-smoking section in a restaurant == non-peeing section in a swimming pool

From mdowle at mdowle.plus.com  Wed Apr 24 21:28:42 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Wed, 24 Apr 2013 20:28:42 +0100
Subject: [datatable-help]
 =?utf-8?q?head=2Edata=2Etable_does_not_support_n?=
 =?utf-8?q?egative_arguments=3F?=
In-Reply-To: <87li87g285.fsf@gnu.org>
References: <87li87g285.fsf@gnu.org>
Message-ID: <b0ee40e07a167f781ff2847b1f4e8beb@imap.plus.net>


Yes, known and already filed. Thanks.
Matthew

On 24.04.2013 20:25, Sam Steingold wrote:
> is this a bug?
> --8<---------------cut here---------------start------------->8---
>> head(1:10,-3)
> [1] 1 2 3 4 5 6 7
>> head(data.frame(a=1:5,b=5:9),-2)
>   a b
> 1 1 5
> 2 2 6
> 3 3 7
>> head(data.table(a=1:5,b=5:9),-2)
> Error in seq_len(min(n, nrow(x))) :
>   argument must be coercible to non-negative integer
> Calls: head -> head.data.table
> --8<---------------cut here---------------end--------------->8---


From mdowle at mdowle.plus.com  Wed Apr 24 21:34:12 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Wed, 24 Apr 2013 20:34:12 +0100
Subject: [datatable-help]
 =?utf-8?q?there_is_no_package_called_=E2=80=98xt?=
 =?utf-8?b?c+KAmQ==?=
In-Reply-To: <87haivg266.fsf@gnu.org>
References: <874newhpv3.fsf@gnu.org>
 <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au>
 <87haivg266.fsf@gnu.org>
Message-ID: <b4d571c11bcb856390f8cefacaa5692c@imap.plus.net>


Good. Well, all correct, known and expected then.  There is a feature 
request to optimize .SD[i] in DT[,.SD[i],by=...] to not actually create 
the whole .SD just to get the first or last row (or indeed any subset).  
Since that's the most natural syntax.  I often would like that myself.
In the meatime the other suggestions from Michael should be fast.  As 
he said: the one using .I[.N] should be fast.
Matthew

On 24.04.2013 20:26, Sam Steingold wrote:
>> * Michael Nelson <zvpunry.aryfba at flqarl.rqh.nh> [2013-04-24 00:41:33 
>> +0000]:
>>
>> frame[, .SD[.N], by = id]
>
> I tried
> --8<---------------cut here---------------start------------->8---
> dt <- frame[, .SD[1] ,by=id]
> --8<---------------cut here---------------end--------------->8---
> (I don't care whether I take first or last, see another message).
> and I got the note
> --8<---------------cut here---------------start------------->8---
> Finding groups (bysameorder=TRUE) ... done in 0.121secs.
> bysameorder=TRUE and o__ is length 0
> Optimization is on but j left unchanged as '.SD[1]'
> Starting dogroups ... The result of j is a named list. It's very
> inefficient to create the same names over and over again for each
> group. When j=list(...), any names are detected, removed and put back
> after grouping has completed, for efficiency. Using j=transform(), 
> for
> example, prevents that speedup (consider changing to :=).
> --8<---------------cut here---------------end--------------->8---
> and indeed it runs unbelievably slow (as if I were using data.table)
>
> thanks a lot for your detailed reply!


From sds at gnu.org  Wed Apr 24 22:29:24 2013
From: sds at gnu.org (Sam Steingold)
Date: Wed, 24 Apr 2013 16:29:24 -0400
Subject: [datatable-help] variable column names
Message-ID: <87a9onfza3.fsf@gnu.org>

What do I do if I want to operate on several columns?
E.g.,
--8<---------------cut here---------------start------------->8---
length(names(dt)) = 25
myvars = c("col1","col4","col7")
myid = "user"
setkeyv(dt,myid)
--8<---------------cut here---------------end--------------->8---
and I want to summarize columns in myvars by myid.
the point is that nowhere in the code the literal "col4" or "user" may
appear.

Thanks.

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
http://www.childpsy.net/ http://ffii.org http://palestinefacts.org
http://openvotingconsortium.org http://camera.org http://jihadwatch.org
The difference between genius and stupidity is that genius has its limits.

From eduard.antonyan at gmail.com  Wed Apr 24 22:35:41 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Wed, 24 Apr 2013 15:35:41 -0500
Subject: [datatable-help] variable column names
In-Reply-To: <87a9onfza3.fsf@gnu.org>
References: <87a9onfza3.fsf@gnu.org>
Message-ID: <CAHZcBOrVUDRM8zqQpwgEesFrz1nOkXRYiiFMZKpMqr1Bi6h+kA@mail.gmail.com>

with = FALSE will let you use literal column names


On Wed, Apr 24, 2013 at 3:29 PM, Sam Steingold <sds at gnu.org> wrote:

> What do I do if I want to operate on several columns?
> E.g.,
> --8<---------------cut here---------------start------------->8---
> length(names(dt)) = 25
> myvars = c("col1","col4","col7")
> myid = "user"
> setkeyv(dt,myid)
> --8<---------------cut here---------------end--------------->8---
> and I want to summarize columns in myvars by myid.
> the point is that nowhere in the code the literal "col4" or "user" may
> appear.
>
> Thanks.
>
> --
> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X
> 11.0.11300000
> http://www.childpsy.net/ http://ffii.org http://palestinefacts.org
> http://openvotingconsortium.org http://camera.org http://jihadwatch.org
> The difference between genius and stupidity is that genius has its limits.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130424/51897055/attachment.html>

From mdowle at mdowle.plus.com  Wed Apr 24 22:50:34 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Wed, 24 Apr 2013 21:50:34 +0100
Subject: [datatable-help] variable column names
In-Reply-To: <CAHZcBOrVUDRM8zqQpwgEesFrz1nOkXRYiiFMZKpMqr1Bi6h+kA@mail.gmail.com>
References: <87a9onfza3.fsf@gnu.org>
 <CAHZcBOrVUDRM8zqQpwgEesFrz1nOkXRYiiFMZKpMqr1Bi6h+kA@mail.gmail.com>
Message-ID: <b596977a739f9e41ffef0af946805ada.squirrel@webmail.plus.net>

Or:
DT[,lapply(.SD,sum),by=...,.SDcols=myvars]

> with = FALSE will let you use literal column names
>
>
> On Wed, Apr 24, 2013 at 3:29 PM, Sam Steingold <sds at gnu.org> wrote:
>
>> What do I do if I want to operate on several columns?
>> E.g.,
>> --8<---------------cut here---------------start------------->8---
>> length(names(dt)) = 25
>> myvars = c("col1","col4","col7")
>> myid = "user"
>> setkeyv(dt,myid)
>> --8<---------------cut here---------------end--------------->8---
>> and I want to summarize columns in myvars by myid.
>> the point is that nowhere in the code the literal "col4" or "user" may
>> appear.
>>
>> Thanks.
>>
>> --
>> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X
>> 11.0.11300000
>> http://www.childpsy.net/ http://ffii.org http://palestinefacts.org
>> http://openvotingconsortium.org http://camera.org http://jihadwatch.org
>> The difference between genius and stupidity is that genius has its
>> limits.
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


From mdowle at mdowle.plus.com  Wed Apr 24 22:54:17 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Wed, 24 Apr 2013 21:54:17 +0100
Subject: [datatable-help] variable column names
In-Reply-To: <b596977a739f9e41ffef0af946805ada.squirrel@webmail.plus.net>
References: <87a9onfza3.fsf@gnu.org>
 <CAHZcBOrVUDRM8zqQpwgEesFrz1nOkXRYiiFMZKpMqr1Bi6h+kA@mail.gmail.com>
 <b596977a739f9e41ffef0af946805ada.squirrel@webmail.plus.net>
Message-ID: <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net>

where ... is eval(myid)
iigc
> Or:
> DT[,lapply(.SD,sum),by=...,.SDcols=myvars]
>
>> with = FALSE will let you use literal column names
>>
>>
>> On Wed, Apr 24, 2013 at 3:29 PM, Sam Steingold <sds at gnu.org> wrote:
>>
>>> What do I do if I want to operate on several columns?
>>> E.g.,
>>> --8<---------------cut here---------------start------------->8---
>>> length(names(dt)) = 25
>>> myvars = c("col1","col4","col7")
>>> myid = "user"
>>> setkeyv(dt,myid)
>>> --8<---------------cut here---------------end--------------->8---
>>> and I want to summarize columns in myvars by myid.
>>> the point is that nowhere in the code the literal "col4" or "user" may
>>> appear.
>>>
>>> Thanks.
>>>
>>> --
>>> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X
>>> 11.0.11300000
>>> http://www.childpsy.net/ http://ffii.org http://palestinefacts.org
>>> http://openvotingconsortium.org http://camera.org http://jihadwatch.org
>>> The difference between genius and stupidity is that genius has its
>>> limits.
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


From mdowle at mdowle.plus.com  Wed Apr 24 23:01:49 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Wed, 24 Apr 2013 22:01:49 +0100
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
References: <1366401278742-4664770.post@n4.nabble.com>,
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>,
 <1366643879137-4664990.post@n4.nabble.com>
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
Message-ID: <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>

But then what would be analogous to CROSS APPLY in SQL?

> I'd agree with Eduard, although it's probably too late to change behavior
> now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> requested).
>
> S.
>
>> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: eduard.antonyan at gmail.com
>> To: datatable-help at lists.r-forge.r-project.org
>> Subject: Re: [datatable-help] changing data.table by-without-by
>> syntax	to	require a "by"
>>
>> I think you're missing the point Michael. Just because it's possible to
>> do it
>> the way it's done now, doesn't mean that's the best way, as I've tried
>> to
>> argue in the OP. I don't think you've addressed the issue of unnecessary
>> complexity pointed out in OP.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>  		 	   		  _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


From eduard.antonyan at gmail.com  Wed Apr 24 23:22:42 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Wed, 24 Apr 2013 16:22:42 -0500
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
 <1366643879137-4664990.post@n4.nabble.com>
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>
Message-ID: <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>

By that you mean current behavior? You'd get current behavior by explicitly
specifying the appropriate "by" (i.e. "by" equal to the key).

Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using
http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I
can't figure out how by-without-by (or with by-with-by for that matter:) )
helps with e.g. the first example there:

"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2,
ordered by table2.id"


On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> But then what would be analogous to CROSS APPLY in SQL?
>
> > I'd agree with Eduard, although it's probably too late to change behavior
> > now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> > requested).
> >
> > S.
> >
> >> Date: Mon, 22 Apr 2013 08:17:59 -0700
> >> From: eduard.antonyan at gmail.com
> >> To: datatable-help at lists.r-forge.r-project.org
> >> Subject: Re: [datatable-help] changing data.table by-without-by
> >> syntax       to      require a "by"
> >>
> >> I think you're missing the point Michael. Just because it's possible to
> >> do it
> >> the way it's done now, doesn't mean that's the best way, as I've tried
> >> to
> >> argue in the OP. I don't think you've addressed the issue of unnecessary
> >> complexity pointed out in OP.
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
> >> Sent from the datatable-help mailing list archive at Nabble.com.
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130424/d219afc2/attachment-0001.html>

From mdowle at mdowle.plus.com  Thu Apr 25 00:28:08 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Wed, 24 Apr 2013 23:28:08 +0100
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
 <1366643879137-4664990.post@n4.nabble.com>
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>
 <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>
Message-ID: <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>

 
That sentence on that linked webpage seems incorect English, since
table is a noun not a verb. Should "table" be "join" perhaps? 

Anyway,
by-without-by is often used with join inherited scope (JIS). For
example, translating their example : 

1> X = data.table(a=1:3,b=1:15,
key="a")
1> X
 a b
 1: 1 1
 2: 1 4
 3: 1 7
 4: 1 10
 5: 1 13
 6: 2 2
 7:
2 5
 8: 2 8
 9: 2 11
10: 2 14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3
15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
 a top
1: 1 3
2: 2 4
1>
X[Y, head(.SD,i.top)]
 a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7:
2 11
1> 

If there was no by-without-by (analogous to CROSS BY), then
how would that be done?

On 24.04.2013 22:22, Eduard Antonyan wrote: 

>
By that you mean current behavior? You'd get current behavior by
explicitly specifying the appropriate "by" (i.e. "by" equal to the key).

> Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using
http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [8],
and I can't figure out how by-without-by (or with by-with-by for that
matter:) ) helps with e.g. the first example there: 
> "We table table1
and table2. table1 has a column called rowcount. 
> 
> For each row from
table1 we need to select first rowcount rows from table2, ordered by
table2.id [9]" 
> 
> On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle
<mdowle at mdowle.plus.com [10]> wrote:
> 
>> But then what would be
analogous to CROSS APPLY in SQL?
>> 
>> > I'd agree with Eduard,
although it's probably too late to change behavior
>> > now. Maybe for
data.table.2? Eduard's proposal seems more closely
>> > aligned with SQL
behavior as well (SELECT/JOIN, then GROUP, but only if
>> >
requested).
>> >
>> > S.
>> >
>> >> Date: Mon, 22 Apr 2013 08:17:59
-0700 >> From: eduard.antonyan at gmail.com [1]
>> >> To:
datatable-help at lists.r-forge.r-project.org [2]
>> 
>>>> Subject: Re:
[datatable-help] changing data.table by-without-by
>> >> syntax to
require a "by"
>> >>
>> >> I think you're missing the point Michael.
Just because it's possible to
>> >> do it
>> >> the way it's done now,
doesn't mean that's the best way, as I've tried
>> >> to
>> >> argue in
the OP. I don't think you've addressed the issue of unnecessary
>> >>
complexity pointed out in OP.
>> >>
>> >>
>> >>
>> >> --
>> >> View this
message in context:
>> >>
http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
[3]
>> >> Sent from the datatable-help mailing list archive at
Nabble.com.
>> >> _______________________________________________
>> >>
datatable-help mailing list >>
datatable-help at lists.r-forge.r-project.org [4]
>> 
>>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[5]
>> > _______________________________________________
>> >
datatable-help mailing list > datatable-help at lists.r-forge.r-project.org
[6]
>> >
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[7]

 
Links:
------
[1] mailto:eduard.antonyan at gmail.com
[2]
mailto:datatable-help at lists.r-forge.r-project.org
[3]
http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
[4]
mailto:datatable-help at lists.r-forge.r-project.org
[5]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[6]
mailto:datatable-help at lists.r-forge.r-project.org
[7]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[8]
http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/
[9]
http://table2.id
[10] mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130424/ddbf76db/attachment.html>

From mdowle at mdowle.plus.com  Thu Apr 25 00:41:22 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Wed, 24 Apr 2013 23:41:22 +0100
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
 <1366643879137-4664990.post@n4.nabble.com>
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>
 <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>
 <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>
Message-ID: <290b8e5f0151662cb436cca320fa709f@imap.plus.net>

 
Where I meant CROSS APPLY not CROSS BY (typo) and incorrect with 2
r's. I picked up on that because out of the entire page you seemed to
quote a sentence which made no sense. The rest of the article looks
great. 

On 24.04.2013 23:28, Matthew Dowle wrote: 

> That sentence on
that linked webpage seems incorect English, since table is a noun not a
verb. Should "table" be "join" perhaps? 
> 
> Anyway, by-without-by is
often used with join inherited scope (JIS). For example, translating
their example : 
> 
> 1> X = data.table(a=1:3,b=1:15, key="a")
> 1> X
>
a b
> 1: 1 1
> 2: 1 4
> 3: 1 7
> 4: 1 10
> 5: 1 13
> 6: 2 2
> 7: 2 5
>
8: 2 8
> 9: 2 11
> 10: 2 14
> 11: 3 3
> 12: 3 6
> 13: 3 9
> 14: 3 12
>
15: 3 15
> 1> Y = data.table(a=c(1,2), top=c(3,4))
> 1> Y
> a top
> 1: 1
3
> 2: 2 4
> 1> X[Y, head(.SD,i.top)]
> a b
> 1: 1 1
> 2: 1 4
> 3: 1 7
>
4: 2 2
> 5: 2 5
> 6: 2 8
> 7: 2 11
> 1> 
> 
> If there was no
by-without-by (analogous to CROSS BY), then how would that be done?
> 
>
On 24.04.2013 22:22, Eduard Antonyan wrote: 
> 
>> By that you mean
current behavior? You'd get current behavior by explicitly specifying
the appropriate "by" (i.e. "by" equal to the key). 
>> Btw, I'm trying
to understand SQL CROSS APPLY vs JOIN using
http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [8],
and I can't figure out how by-without-by (or with by-with-by for that
matter:) ) helps with e.g. the first example there: 
>> "We table table1
and table2. table1 has a column called rowcount. 
>> 
>> For each row
from table1 we need to select first rowcount rows from table2, ordered
by table2.id [9]" 
>> 
>> On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle
<mdowle at mdowle.plus.com [10]> wrote:
>> 
>>> But then what would be
analogous to CROSS APPLY in SQL?
>>> 
>>> > I'd agree with Eduard,
although it's probably too late to change behavior
>>> > now. Maybe for
data.table.2? Eduard's proposal seems more closely
>>> > aligned with
SQL behavior as well (SELECT/JOIN, then GROUP, but only if
>>> >
requested).
>>> >
>>> > S.
>>> >
>>> >> Date: Mon, 22 Apr 2013 08:17:59
-0700 >> From: eduard.antonyan at gmail.com [1]
>>> >> To:
datatable-help at lists.r-forge.r-project.org [2]
>>> 
>>>>> Subject: Re:
[datatable-help] changing data.table by-without-by
>>> >> syntax to
require a "by"
>>> >>
>>> >> I think you're missing the point Michael.
Just because it's possible to
>>> >> do it
>>> >> the way it's done now,
doesn't mean that's the best way, as I've tried
>>> >> to
>>> >> argue
in the OP. I don't think you've addressed the issue of unnecessary
>>>
>> complexity pointed out in OP.
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >>
View this message in context:
>>> >>
http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
[3]
>>> >> Sent from the datatable-help mailing list archive at
Nabble.com.
>>> >> _______________________________________________
>>>
>> datatable-help mailing list >>
datatable-help at lists.r-forge.r-project.org [4]
>>> 
>>>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[5]
>>> > _______________________________________________
>>> >
datatable-help mailing list > datatable-help at lists.r-forge.r-project.org
[6]
>>> >
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[7]

 
Links:
------
[1] mailto:eduard.antonyan at gmail.com
[2]
mailto:datatable-help at lists.r-forge.r-project.org
[3]
http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
[4]
mailto:datatable-help at lists.r-forge.r-project.org
[5]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[6]
mailto:datatable-help at lists.r-forge.r-project.org
[7]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[8]
http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/
[9]
http://table2.id
[10] mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130424/2cc7c68e/attachment.html>

From eduard.antonyan at gmail.com  Thu Apr 25 00:43:19 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Wed, 24 Apr 2013 17:43:19 -0500
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
 <1366643879137-4664990.post@n4.nabble.com>
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>
 <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>
 <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>
Message-ID: <CAHZcBOoinL2JrkNayUYnnT-MkQNgWqby8evMARj3q1Q-xGF=mA@mail.gmail.com>

I assumed they meant create a table :)

that looks cool, what's i.top ? I can get a very similar to yours result by
writing:

X[Y][, head(.SD, top[1]), by = a]

and I probably would want the following to produce your result (this might
depend a little on what exactly i.top is):

X[Y, head(.SD, i.top), by = a]


On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> **
>
>
>
> That sentence on that linked webpage seems incorect English, since table
> is a noun not a verb.  Should "table" be "join" perhaps?
>
> Anyway, by-without-by is often used with join inherited scope (JIS).  For
> example, translating their example :
>
> 1> X = data.table(a=1:3,b=1:15, key="a")
> 1> X
>     a  b
>  1: 1  1
>  2: 1  4
>  3: 1  7
>  4: 1 10
>  5: 1 13
>  6: 2  2
>  7: 2  5
>  8: 2  8
>  9: 2 11
> 10: 2 14
> 11: 3  3
> 12: 3  6
> 13: 3  9
> 14: 3 12
> 15: 3 15
> 1> Y = data.table(a=c(1,2), top=c(3,4))
> 1> Y
>    a top
> 1: 1   3
> 2: 2   4
> 1> X[Y, head(.SD,i.top)]
>    a  b
> 1: 1  1
> 2: 1  4
> 3: 1  7
> 4: 2  2
> 5: 2  5
> 6: 2  8
> 7: 2 11
> 1>
>
>
>
> If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
>
>
>
> On 24.04.2013 22:22, Eduard Antonyan wrote:
>
> By that you mean current behavior? You'd get current behavior by
> explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
> Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using
> http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I
> can't figure out how by-without-by (or with by-with-by for that matter:) )
> helps with e.g. the first example there:
> "We table table1 and table2. table1 has a column called rowcount.
>
> For each row from table1 we need to select first rowcount rows from table2,
> ordered by table2.id"
>
>
>
>
> On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>> But then what would be analogous to CROSS APPLY in SQL?
>>
>> > I'd agree with Eduard, although it's probably too late to change
>> behavior
>> > now.  Maybe for data.table.2?  Eduard's proposal seems more closely
>> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
>> > requested).
>> >
>> > S.
>> >
>> >> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> >> From: eduard.antonyan at gmail.com
>> >> To: datatable-help at lists.r-forge.r-project.org
>> >> Subject: Re: [datatable-help] changing data.table by-without-by
>> >> syntax       to      require a "by"
>> >>
>> >> I think you're missing the point Michael. Just because it's possible to
>> >> do it
>> >> the way it's done now, doesn't mean that's the best way, as I've tried
>> >> to
>> >> argue in the OP. I don't think you've addressed the issue of
>> unnecessary
>> >> complexity pointed out in OP.
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>> >> Sent from the datatable-help mailing list archive at Nabble.com.
>> >> _______________________________________________
>> >> datatable-help mailing list
>> >> datatable-help at lists.r-forge.r-project.org
>> >>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >
>> _______________________________________________
>> > datatable-help mailing list
>> > datatable-help at lists.r-forge.r-project.org
>> >
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130424/d430c67d/attachment-0001.html>

From mdowle at mdowle.plus.com  Thu Apr 25 00:50:04 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Wed, 24 Apr 2013 23:50:04 +0100
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <CAHZcBOoinL2JrkNayUYnnT-MkQNgWqby8evMARj3q1Q-xGF=mA@mail.gmail.com>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
 <1366643879137-4664990.post@n4.nabble.com>
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>
 <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>
 <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>
 <CAHZcBOoinL2JrkNayUYnnT-MkQNgWqby8evMARj3q1Q-xGF=mA@mail.gmail.com>
Message-ID: <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>

 
i. prefix is just a robust way to reference join inherited columns:
the 'top' column in the i table. Like table aliases in SQL. 

What about
this? : 

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
 a b
 1: 1 1
 2:
1 4
 3: 1 7
 4: 1 10
 5: 1 13
 6: 2 2
 7: 2 5
 8: 2 8
 9: 2 11
10: 2
14
11: 3 3
12: 3 6
13: 3 9
14: 3 12
15: 3 15
1> Y =
data.table(a=c(1,2,1), top=c(3,4,2))
1> Y
 a top
1: 1 3
2: 2 4
3: 1 2
1>
X[Y, head(.SD,i.top)]
 a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7:
2 11
8: 1 1
9: 1 4
1> 

On 24.04.2013 23:43, Eduard Antonyan wrote: 

>
I assumed they meant create a table :) 
> that looks cool, what's i.top
? I can get a very similar to yours result by writing: 
> X[Y][,
head(.SD, top[1]), by = a] 
> and I probably would want the following to
produce your result (this might depend a little on what exactly i.top
is): 
> X[Y, head(.SD, i.top), by = a] 
> 
> On Wed, Apr 24, 2013 at
5:28 PM, Matthew Dowle <mdowle at mdowle.plus.com [11]> wrote:
> 
>> That
sentence on that linked webpage seems incorect English, since table is a
noun not a verb. Should "table" be "join" perhaps? 
>> 
>> Anyway,
by-without-by is often used with join inherited scope (JIS). For
example, translating their example : 
>> 
>> 1> X =
data.table(a=1:3,b=1:15, key="a")
>> 1> X
>> a b
>> 1: 1 1
>> 2: 1 4
>>
3: 1 7
>> 4: 1 10
>> 5: 1 13
>> 6: 2 2
>> 7: 2 5
>> 8: 2 8
>> 9: 2 11
>>
10: 2 14
>> 11: 3 3
>> 12: 3 6
>> 
>> 13: 3 9
>> 14: 3 12
>> 15: 3 15
>>
1> Y = data.table(a=c(1,2), top=c(3,4))
>> 1> Y
>> a top
>> 1: 1 3
>> 2:
2 4
>> 1> X[Y, head(.SD,i.top)]
>> a b
>> 1: 1 1
>> 2: 1 4
>> 3: 1 7
>>
4: 2 2
>> 5: 2 5
>> 
>> 6: 2 8
>> 7: 2 11
>> 1> 
>> 
>> If there was no
by-without-by (analogous to CROSS BY), then how would that be done?
>>

>> On 24.04.2013 22:22, Eduard Antonyan wrote: 
>> 
>>> By that you
mean current behavior? You'd get current behavior by explicitly
specifying the appropriate "by" (i.e. "by" equal to the key). 
>>> Btw,
I'm trying to understand SQL CROSS APPLY vs JOIN using
http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [8],
and I can't figure out how by-without-by (or with by-with-by for that
matter:) ) helps with e.g. the first example there: 
>>> "We table
table1 and table2. table1 has a column called rowcount. 
>>> 
>>> For
each row from table1 we need to select first rowcount rows from table2,
ordered by table2.id [9]" 
>>> 
>>> On Wed, Apr 24, 2013 at 4:01 PM,
Matthew Dowle <mdowle at mdowle.plus.com [10]> wrote:
>>> 
>>>> But then
what would be analogous to CROSS APPLY in SQL?
>>>> 
>>>> > I'd agree
with Eduard, although it's probably too late to change behavior
>>>> >
now. Maybe for data.table.2? Eduard's proposal seems more closely
>>>> >
aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only
if
>>>> > requested).
>>>> >
>>>> > S.
>>>> >
>>>> >> Date: Mon, 22 Apr
2013 08:17:59 -0700 >> From: eduard.antonyan at gmail.com [1]
>>>> >> To:
datatable-help at lists.r-forge.r-project.org [2]
>>>> 
>>>>>> Subject: Re:
[datatable-help] changing data.table by-without-by
>>>> >> syntax to
require a "by"
>>>> >>
>>>> >> I think you're missing the point Michael.
Just because it's possible to
>>>> >> do it
>>>> >> the way it's done
now, doesn't mean that's the best way, as I've tried
>>>> >> to
>>>> >>
argue in the OP. I don't think you've addressed the issue of
unnecessary
>>>> >> complexity pointed out in OP.
>>>> >>
>>>> >>
>>>>
>>
>>>> >> --
>>>> >> View this message in context:
>>>> >>
http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
[3]
>>>> >> Sent from the datatable-help mailing list archive at
Nabble.com.
>>>> >> _______________________________________________
>>>>
>> datatable-help mailing list >>
datatable-help at lists.r-forge.r-project.org [4]
>>>> 
>>>>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[5]
>>>> > _______________________________________________
>>>> >
datatable-help mailing list > datatable-help at lists.r-forge.r-project.org
[6]
>>>> >
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[7]

 
Links:
------
[1] mailto:eduard.antonyan at gmail.com
[2]
mailto:datatable-help at lists.r-forge.r-project.org
[3]
http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
[4]
mailto:datatable-help at lists.r-forge.r-project.org
[5]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[6]
mailto:datatable-help at lists.r-forge.r-project.org
[7]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[8]
http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/
[9]
http://table2.id
[10] mailto:mdowle at mdowle.plus.com
[11]
mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130424/fbc67fdd/attachment.html>

From eduard.antonyan at gmail.com  Thu Apr 25 01:01:42 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Wed, 24 Apr 2013 18:01:42 -0500
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
 <1366643879137-4664990.post@n4.nabble.com>
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>
 <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>
 <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>
 <CAHZcBOoinL2JrkNayUYnnT-MkQNgWqby8evMARj3q1Q-xGF=mA@mail.gmail.com>
 <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>
Message-ID: <CAHZcBOr3FXMEpYV-g=kz4snYn8FqO43yKZtw9z6jEksgFA_uQA@mail.gmail.com>

that's an interesting example - I didn't realize current behavior would do
that, I'm not at a PC anymore but I'll definitely think about it and report
back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> **
>
>
>
> i. prefix is just a robust way to reference join inherited columns:   the
> 'top' column in the i table.   Like table aliases in SQL.
>
> What about this? :
>
> 1> X = data.table(a=1:3,b=1:15, key="a")
> 1> X
>     a  b
>  1: 1  1
>  2: 1  4
>  3: 1  7
>  4: 1 10
>  5: 1 13
>  6: 2  2
>  7: 2  5
>  8: 2  8
>  9: 2 11
> 10: 2 14
> 11: 3  3
> 12: 3  6
> 13: 3  9
> 14: 3 12
> 15: 3 15
> 1> Y = data.table(a=c(1,2,1), top=c(3,4,2))
>
> 1> Y
>    a top
> 1: 1   3
> 2: 2   4
> 3: 1   2
> 1> X[Y, head(.SD,i.top)]
>    a  b
> 1: 1  1
> 2: 1  4
> 3: 1  7
> 4: 2  2
> 5: 2  5
> 6: 2  8
> 7: 2 11
> 8: 1  1
> 9: 1  4
> 1>
>
>
>
> On 24.04.2013 23:43, Eduard Antonyan wrote:
>
> I assumed they meant create a table :)
> that looks cool, what's i.top ? I can get a very similar to yours result
> by writing:
> X[Y][, head(.SD, top[1]), by = a]
> and I probably would want the following to produce your result (this might
> depend a little on what exactly i.top is):
> X[Y, head(.SD, i.top), by = a]
>
>
> On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>>
>> That sentence on that linked webpage seems incorect English, since table
>> is a noun not a verb.  Should "table" be "join" perhaps?
>>
>> Anyway, by-without-by is often used with join inherited scope (JIS).  For
>> example, translating their example :
>>
>> 1> X = data.table(a=1:3,b=1:15, key="a")
>> 1> X
>>     a  b
>>  1: 1  1
>>  2: 1  4
>>  3: 1  7
>>  4: 1 10
>>  5: 1 13
>>  6: 2  2
>>  7: 2  5
>>  8: 2  8
>>  9: 2 11
>> 10: 2 14
>> 11: 3  3
>> 12: 3  6
>>
>> 13: 3  9
>> 14: 3 12
>> 15: 3 15
>> 1> Y = data.table(a=c(1,2), top=c(3,4))
>> 1> Y
>>    a top
>> 1: 1   3
>> 2: 2   4
>> 1> X[Y, head(.SD,i.top)]
>>    a  b
>> 1: 1  1
>> 2: 1  4
>> 3: 1  7
>> 4: 2  2
>> 5: 2  5
>>
>> 6: 2  8
>> 7: 2 11
>> 1>
>>
>>
>>
>> If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
>>
>>
>>
>> On 24.04.2013 22:22, Eduard Antonyan wrote:
>>
>> By that you mean current behavior? You'd get current behavior by
>> explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
>> Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using
>> http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I
>> can't figure out how by-without-by (or with by-with-by for that matter:) )
>> helps with e.g. the first example there:
>> "We table table1 and table2. table1 has a column called rowcount.
>>
>> For each row from table1 we need to select first rowcount rows from
>> table2, ordered by table2.id"
>>
>>
>>
>>
>> On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>
>>> But then what would be analogous to CROSS APPLY in SQL?
>>>
>>> > I'd agree with Eduard, although it's probably too late to change
>>> behavior
>>> > now.  Maybe for data.table.2?  Eduard's proposal seems more closely
>>> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
>>> > requested).
>>> >
>>> > S.
>>> >
>>> >> Date: Mon, 22 Apr 2013 08:17:59 -0700
>>> >> From: eduard.antonyan at gmail.com
>>> >> To: datatable-help at lists.r-forge.r-project.org
>>> >> Subject: Re: [datatable-help] changing data.table by-without-by
>>> >> syntax       to      require a "by"
>>> >>
>>> >> I think you're missing the point Michael. Just because it's possible
>>> to
>>> >> do it
>>> >> the way it's done now, doesn't mean that's the best way, as I've tried
>>> >> to
>>> >> argue in the OP. I don't think you've addressed the issue of
>>> unnecessary
>>> >> complexity pointed out in OP.
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >>
>>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>>> >> Sent from the datatable-help mailing list archive at Nabble.com.
>>> >> _______________________________________________
>>> >> datatable-help mailing list
>>> >> datatable-help at lists.r-forge.r-project.org
>>> >>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>> >
>>> _______________________________________________
>>> > datatable-help mailing list
>>> > datatable-help at lists.r-forge.r-project.org
>>> >
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>>
>>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130424/93d5e20c/attachment-0001.html>

From viviannevilar at gmail.com  Thu Apr 25 02:38:28 2013
From: viviannevilar at gmail.com (Vivianne Vilar)
Date: Thu, 25 Apr 2013 10:38:28 +1000
Subject: [datatable-help] fread: coercion of class from integer to character
	due to NA string.
Message-ID: <CAAUD0PR4aztpdQVtafBy_Xm3U8RvX2tMtPSA_mY2c7MPb13Ffg@mail.gmail.com>

Hi there,

I think this is probably a known issue, but just in case, here it is.

I am trying to use fread to read a very large csv file, but I am having
problems due to the fact that NAs in a numeric column are represented with
some letters. For example, in my column of SIC codes I have "Z" to
represent NAs. Even though I explicitly set those to be NAs in the command:

data6281 <- fread("data6281.csv",header=TRUE,
na.strings=c("C",".","B","Z",""))

I get the warning message that that column was changed to be character even
though it is supposed to be integer.

With the read.csv I have no problem when I use the command

data6281 <- data.table(read.csv("data6281.csv",header=TRUE,
colClasses=c("integer","integer","integer","integer","integer","factor","character","factor","numeric","numeric","integer"),
na.strings=c("C",".","B","Z","")))

but fread does not allow me to set the column classes since it doesn't
accept the argument colClasses.

A shame really. fread is much faster, and I love that it shows the %
progress.

I don't supposed there is a way around this, but if there is I would be
glad to know.

I would also be happy to provide an example if that's necessary.

Cheers,

Vivianne Siqueira Campos Vilar
----------------------------------------------
?Don't worry about the world coming to an end today. It is already tomorrow
in Australia.?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130425/a52a01ef/attachment.html>

From mdowle at mdowle.plus.com  Thu Apr 25 03:25:50 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Thu, 25 Apr 2013 02:25:50 +0100
Subject: [datatable-help] fread: coercion of class from integer to
 character due to NA string.
In-Reply-To: <CAAUD0PR4aztpdQVtafBy_Xm3U8RvX2tMtPSA_mY2c7MPb13Ffg@mail.gmail.com>
References: <CAAUD0PR4aztpdQVtafBy_Xm3U8RvX2tMtPSA_mY2c7MPb13Ffg@mail.gmail.com>
Message-ID: <b6d70beed54678f4ac5c0bf18486f5b5.squirrel@webmail.plus.net>

Hi,
Thanks for reporting. Yes all known and will be tackled.  colClasses
should be next commit hopefully.
Matthew

> Hi there,
>
> I think this is probably a known issue, but just in case, here it is.
>
> I am trying to use fread to read a very large csv file, but I am having
> problems due to the fact that NAs in a numeric column are represented with
> some letters. For example, in my column of SIC codes I have "Z" to
> represent NAs. Even though I explicitly set those to be NAs in the
> command:
>
> data6281 <- fread("data6281.csv",header=TRUE,
> na.strings=c("C",".","B","Z",""))
>
> I get the warning message that that column was changed to be character
> even
> though it is supposed to be integer.
>
> With the read.csv I have no problem when I use the command
>
> data6281 <- data.table(read.csv("data6281.csv",header=TRUE,
> colClasses=c("integer","integer","integer","integer","integer","factor","character","factor","numeric","numeric","integer"),
> na.strings=c("C",".","B","Z","")))
>
> but fread does not allow me to set the column classes since it doesn't
> accept the argument colClasses.
>
> A shame really. fread is much faster, and I love that it shows the %
> progress.
>
> I don't supposed there is a way around this, but if there is I would be
> glad to know.
>
> I would also be happy to provide an example if that's necessary.
>
> Cheers,
>
> Vivianne Siqueira Campos Vilar
> ----------------------------------------------
> ?Don't worry about the world coming to an end today. It is already
> tomorrow
> in Australia.?
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


From eduard.antonyan at gmail.com  Thu Apr 25 06:16:24 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Wed, 24 Apr 2013 23:16:24 -0500
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <CAHZcBOr3FXMEpYV-g=kz4snYn8FqO43yKZtw9z6jEksgFA_uQA@mail.gmail.com>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
 <1366643879137-4664990.post@n4.nabble.com>
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>
 <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>
 <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>
 <CAHZcBOoinL2JrkNayUYnnT-MkQNgWqby8evMARj3q1Q-xGF=mA@mail.gmail.com>
 <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>
 <CAHZcBOr3FXMEpYV-g=kz4snYn8FqO43yKZtw9z6jEksgFA_uQA@mail.gmail.com>
Message-ID: <9146185881995080674@unknownmsgid>

That's really interesting, I can't currently think of another way of doing
that as after X[Y] is done the necessary information is lost.

To retain that functionality and achieve better readability, as in OP, I
think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good
replacement for current syntax.


On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <eduard.antonyan at gmail.com>
wrote:

that's an interesting example - I didn't realize current behavior would do
that, I'm not at a PC anymore but I'll definitely think about it and report
back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> **
>
>
>
> i. prefix is just a robust way to reference join inherited columns:   the
> 'top' column in the i table.   Like table aliases in SQL.
>
> What about this? :
>
> 1> X = data.table(a=1:3,b=1:15, key="a")
> 1> X
>     a  b
>  1: 1  1
>  2: 1  4
>  3: 1  7
>  4: 1 10
>  5: 1 13
>  6: 2  2
>  7: 2  5
>  8: 2  8
>  9: 2 11
> 10: 2 14
>
> 11: 3  3
> 12: 3  6
> 13: 3  9
> 14: 3 12
> 15: 3 15
> 1> Y = data.table(a=c(1,2,1), top=c(3,4,2))
>
> 1> Y
>    a top
> 1: 1   3
> 2: 2   4
> 3: 1   2
> 1> X[Y, head(.SD,i.top)]
>
>    a  b
> 1: 1  1
> 2: 1  4
> 3: 1  7
> 4: 2  2
> 5: 2  5
> 6: 2  8
> 7: 2 11
> 8: 1  1
> 9: 1  4
> 1>
>
>
>
> On 24.04.2013 23:43, Eduard Antonyan wrote:
>
> I assumed they meant create a table :)
> that looks cool, what's i.top ? I can get a very similar to yours result
> by writing:
> X[Y][, head(.SD, top[1]), by = a]
> and I probably would want the following to produce your result (this might
> depend a little on what exactly i.top is):
> X[Y, head(.SD, i.top), by = a]
>
>
> On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>>
>> That sentence on that linked webpage seems incorect English, since table
>> is a noun not a verb.  Should "table" be "join" perhaps?
>>
>> Anyway, by-without-by is often used with join inherited scope (JIS).  For
>> example, translating their example :
>>
>> 1> X = data.table(a=1:3,b=1:15, key="a")
>> 1> X
>>     a  b
>>  1: 1  1
>>  2: 1  4
>>  3: 1  7
>>  4: 1 10
>>  5: 1 13
>>  6: 2  2
>>  7: 2  5
>>  8: 2  8
>>  9: 2 11
>> 10: 2 14
>> 11: 3  3
>> 12: 3  6
>>
>>
>> 13: 3  9
>> 14: 3 12
>> 15: 3 15
>> 1> Y = data.table(a=c(1,2), top=c(3,4))
>> 1> Y
>>    a top
>> 1: 1   3
>> 2: 2   4
>> 1> X[Y, head(.SD,i.top)]
>>    a  b
>> 1: 1  1
>> 2: 1  4
>> 3: 1  7
>> 4: 2  2
>> 5: 2  5
>>
>>
>> 6: 2  8
>> 7: 2 11
>> 1>
>>
>>
>>
>> If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
>>
>>
>>
>> On 24.04.2013 22:22, Eduard Antonyan wrote:
>>
>> By that you mean current behavior? You'd get current behavior by
>> explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
>> Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using
>> http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I
>> can't figure out how by-without-by (or with by-with-by for that matter:) )
>> helps with e.g. the first example there:
>> "We table table1 and table2. table1 has a column called rowcount.
>>
>> For each row from table1 we need to select first rowcount rows from
>> table2, ordered by table2.id"
>>
>>
>>
>>
>> On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>
>>> But then what would be analogous to CROSS APPLY in SQL?
>>>
>>> > I'd agree with Eduard, although it's probably too late to change
>>> behavior
>>> > now.  Maybe for data.table.2?  Eduard's proposal seems more closely
>>> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
>>> > requested).
>>> >
>>> > S.
>>> >
>>> >> Date: Mon, 22 Apr 2013 08:17:59 -0700
>>> >> From: eduard.antonyan at gmail.com
>>> >> To: datatable-help at lists.r-forge.r-project.org
>>> >> Subject: Re: [datatable-help] changing data.table by-without-by
>>> >> syntax       to      require a "by"
>>> >>
>>> >> I think you're missing the point Michael. Just because it's possible
>>> to
>>> >> do it
>>> >> the way it's done now, doesn't mean that's the best way, as I've tried
>>> >> to
>>> >> argue in the OP. I don't think you've addressed the issue of
>>> unnecessary
>>> >> complexity pointed out in OP.
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >>
>>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>>> >> Sent from the datatable-help mailing list archive at Nabble.com.
>>> >> _______________________________________________
>>> >> datatable-help mailing list
>>> >> datatable-help at lists.r-forge.r-project.org
>>> >>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>> >
>>> _______________________________________________
>>> > datatable-help mailing list
>>> > datatable-help at lists.r-forge.r-project.org
>>> >
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>>
>>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130424/75c910a5/attachment.html>

From mdowle at mdowle.plus.com  Thu Apr 25 11:28:43 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Thu, 25 Apr 2013 10:28:43 +0100
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <9146185881995080674@unknownmsgid>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
 <1366643879137-4664990.post@n4.nabble.com>
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>
 <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>
 <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>
 <CAHZcBOoinL2JrkNayUYnnT-MkQNgWqby8evMARj3q1Q-xGF=mA@mail.gmail.com>
 <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>
 <CAHZcBOr3FXMEpYV-g=kz4snYn8FqO43yKZtw9z6jEksgFA_uQA@mail.gmail.com>
 <9146185881995080674@unknownmsgid>
Message-ID: <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net>

 
I see what you're getting at. But .J may be a column name, which is
the current meaning of by = single symbol. And why .J? If not .J, or any
single symbol what else instead? A character value such as by="irows" is
taken to mean the "irows" column currently (for consistency with
by="colA,colB,colC"). But some signal needs to be passed to by=, then
(you're suggesting), to trigger the cross apply by each i row.
Currently, that signal is missingness (which I like, rely on, and use
with join inherited scope).

As I wrote in the S.O. thread, I'm happy to
make it optional (i.e. an option to turn off by-without-by), since there
is no downside. But you've continued to argue for a change to the
default, iiuc.

Maybe it helps to consider :

 x+y

Fundamentally in R
this depends on what x and y are. Most of us probably assume (as a first
thought) that x and y are vectors and know that this will apply "+"
elementwise, recycling y if necessary. In R we like and write code like
this all the time. I think of X[Y, j] in the same way: j is the
operation (like +) which is applied for each row of Y. If you need j for
the entire set that Y joins to, then like a FAQ says, make j missing too
and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as
X[Y][,j] would be nice and is on the list: drop=TRUE would do that (as
someone mentioned on the S.O. thread). So maybe the new option would be
datatable.drop (but with default FALSE not TRUE). If you wanted to turn
off by-without-by you might set options(datatable.drop=TRUE). Then you
can use data.table how you prefer (explicit by) and I can use it how I
prefer.

I'm happy to add the argument to [.data.table, and make its
default changeable via a global option in the usual way. 

Matthew 

On
25.04.2013 05:16, Eduard Antonyan wrote: 

> That's really interesting,
I can't currently think of another way of doing that as after X[Y] is
done the necessary information is lost. 
> To retain that functionality
and achieve better readability, as in OP, I think smth along the lines
of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current
syntax. 
> 
> On Apr 24, 2013, at 6:01 PM, Eduard Antonyan
<eduard.antonyan at gmail.com [14]> wrote:
> 
>> that's an interesting
example - I didn't realize current behavior would do that, I'm not at a
PC anymore but I'll definitely think about it and report back, as it's
not immediately obvious to me 
>> 
>> On Wed, Apr 24, 2013 at 5:50 PM,
Matthew Dowle <mdowle at mdowle.plus.com [13]> wrote:
>> 
>>> i. prefix is
just a robust way to reference join inherited columns: the 'top' column
in the i table. Like table aliases in SQL. 
>>> 
>>> What about this? :

>>> 1> X = data.table(a=1:3,b=1:15, key="a")
>>> 1> X
>>> a b
>>> 1: 1
1
>>> 2: 1 4
>>> 3: 1 7
>>> 4: 1 10
>>> 5: 1 13
>>> 6: 2 2
>>> 7: 2
5
>>> 8: 2 8
>>> 9: 2 11
>>> 10: 2 14
>>> 11: 3 3
>>> 12: 3 6
>>> 13: 3
9
>>> 14: 3 12
>>> 15: 3 15 
>>> 
>>> 1> Y = data.table(a=c(1,2,1),
top=c(3,4,2))
>>> 
>>> 1> Y
>>> a top
>>> 1: 1 3
>>> 2: 2 4
>>> 3: 1
2
>>> 1> X[Y, head(.SD,i.top)]
>>> a b
>>> 1: 1 1
>>> 2: 1 4
>>> 3: 1
7
>>> 4: 2 2
>>> 5: 2 5
>>> 6: 2 8
>>> 7: 2 11
>>> 8: 1 1 
>>> 
>>> 9: 1
4
>>> 1> 
>>> 
>>> On 24.04.2013 23:43, Eduard Antonyan wrote: 
>>>

>>>> I assumed they meant create a table :) 
>>>> that looks cool,
what's i.top ? I can get a very similar to yours result by writing:

>>>> X[Y][, head(.SD, top[1]), by = a] 
>>>> and I probably would want
the following to produce your result (this might depend a little on what
exactly i.top is): 
>>>> X[Y, head(.SD, i.top), by = a] 
>>>> 
>>>> On
Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <mdowle at mdowle.plus.com
[12]> wrote:
>>>> 
>>>>> That sentence on that linked webpage seems
incorect English, since table is a noun not a verb. Should "table" be
"join" perhaps? 
>>>>> 
>>>>> Anyway, by-without-by is often used with
join inherited scope (JIS). For example, translating their example :

>>>>> 
>>>>> 1> X = data.table(a=1:3,b=1:15, key="a")
>>>>> 1> X
>>>>>
a b
>>>>> 1: 1 1
>>>>> 2: 1 4
>>>>> 3: 1 7
>>>>> 4: 1 10
>>>>> 5: 1
13
>>>>> 6: 2 2
>>>>> 7: 2 5
>>>>> 8: 2 8
>>>>> 9: 2 11
>>>>> 10: 2
14
>>>>> 11: 3 3
>>>>> 12: 3 6
>>>>> 
>>>>> 13: 3 9
>>>>> 14: 3 12
>>>>>
15: 3 15
>>>>> 1> Y = data.table(a=c(1,2), top=c(3,4))
>>>>> 1> Y
>>>>>
a top
>>>>> 1: 1 3
>>>>> 2: 2 4
>>>>> 1> X[Y, head(.SD,i.top)]
>>>>> a
b
>>>>> 1: 1 1
>>>>> 2: 1 4
>>>>> 3: 1 7
>>>>> 4: 2 2
>>>>> 5: 2 5
>>>>>

>>>>> 6: 2 8
>>>>> 7: 2 11
>>>>> 1> 
>>>>> 
>>>>> If there was no
by-without-by (analogous to CROSS BY), then how would that be
done?
>>>>> 
>>>>> On 24.04.2013 22:22, Eduard Antonyan wrote: 
>>>>>

>>>>>> By that you mean current behavior? You'd get current behavior by
explicitly specifying the appropriate "by" (i.e. "by" equal to the key).

>>>>>> Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using
http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [9],
and I can't figure out how by-without-by (or with by-with-by for that
matter:) ) helps with e.g. the first example there: 
>>>>>> "We table
table1 and table2. table1 has a column called rowcount. 
>>>>>> 
>>>>>>
For each row from table1 we need to select first rowcount rows from
table2, ordered by table2.id [10]" 
>>>>>> 
>>>>>> On Wed, Apr 24, 2013
at 4:01 PM, Matthew Dowle <mdowle at mdowle.plus.com [11]> wrote:
>>>>>>

>>>>>>> But then what would be analogous to CROSS APPLY in SQL?
>>>>>>>

>>>>>>> > I'd agree with Eduard, although it's probably too late to
change behavior
>>>>>>> > now. Maybe for data.table.2? Eduard's proposal
seems more closely
>>>>>>> > aligned with SQL behavior as well
(SELECT/JOIN, then GROUP, but only if
>>>>>>> > requested).
>>>>>>>
>
>>>>>>> > S.
>>>>>>> >
>>>>>>> >> Date: Mon, 22 Apr 2013 08:17:59
-0700 >> From: eduard.antonyan at gmail.com [1]
>>>>>>> >> To:
datatable-help at lists.r-forge.r-project.org [2]
>>>>>>> 
>>>>>>>>>
Subject: Re: [datatable-help] changing data.table by-without-by
>>>>>>>
>> syntax to require a "by"
>>>>>>> >>
>>>>>>> >> I think you're missing
the point Michael. Just because it's possible to
>>>>>>> >> do
it
>>>>>>> >> the way it's done now, doesn't mean that's the best way,
as I've tried
>>>>>>> >> to
>>>>>>> >> argue in the OP. I don't think
you've addressed the issue of unnecessary
>>>>>>> >> complexity pointed
out in OP.
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> --
>>>>>>> >>
View this message in context:
>>>>>>> >>
http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
[3]
>>>>>>> >> Sent from the datatable-help mailing list archive at
Nabble.com [4].
>>>>>>> >>
_______________________________________________
>>>>>>> >>
datatable-help mailing list >>
datatable-help at lists.r-forge.r-project.org [5]
>>>>>>> 
>>>>>>>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[6]
>>>>>>> > _______________________________________________
>>>>>>> >
datatable-help mailing list > datatable-help at lists.r-forge.r-project.org
[7]
>>>>>>> >
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[8]

 
Links:
------
[1] mailto:eduard.antonyan at gmail.com
[2]
mailto:datatable-help at lists.r-forge.r-project.org
[3]
http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
[4]
http://Nabble.com
[5]
mailto:datatable-help at lists.r-forge.r-project.org
[6]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[7]
mailto:datatable-help at lists.r-forge.r-project.org
[8]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[9]
http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/
[10]
http://table2.id
[11] mailto:mdowle at mdowle.plus.com
[12]
mailto:mdowle at mdowle.plus.com
[13] mailto:mdowle at mdowle.plus.com
[14]
mailto:eduard.antonyan at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130425/2131e1ec/attachment-0001.html>

From eduard.antonyan at gmail.com  Thu Apr 25 14:45:45 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Thu, 25 Apr 2013 07:45:45 -0500
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
 <1366643879137-4664990.post@n4.nabble.com>
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>
 <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>
 <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>
 <CAHZcBOoinL2JrkNayUYnnT-MkQNgWqby8evMARj3q1Q-xGF=mA@mail.gmail.com>
 <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>
 <CAHZcBOr3FXMEpYV-g=kz4snYn8FqO43yKZtw9z6jEksgFA_uQA@mail.gmail.com>
 <9146185881995080674@unknownmsgid>
 <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net>
Message-ID: <5222879356405645530@unknownmsgid>

Well, so can .I or .N or .GRP or .BY, yet those are used as special names,
which is exactly why I suggested .J.

The problem with using 'missingness' is that it already means smth very
different when i is not a join/cross, it means *don't* do a by, thus
introducing the whole case thing one has to through in their head every
time as in OP (which of course becomes automatic after a while, but it's a
cost nonetheless, which is in particular high for new people). So I see
absence of 'by' as an already taken and used signal and thus something else
has to be used for the new signal of cross apply (it doesn't have to be the
specific option I mentioned above). This is exactly why I find optional
turning off of this behavior unsatisfactory, and I don't see that as a
solution to this at all.

I think in the x+y context the appropriate analog is - what if that added x
and y normally, but when x and y were data.frames it did element by element
multiplication instead? Yes that's possible to do, and possible to
document, but it's not a good idea, because it takes place of adding them
element by element. The recycling behavior doesn't do that - what that does
is it says it doesn't really make sense to add them as is, but we can do
that after recycling, so let's recycle. It doesn't take the place of
another existing way of adding vectors.

On Apr 25, 2013, at 4:28 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:


I see what you're getting at. But .J may be a column name, which is
the current meaning of by = single symbol. And why .J?  If not .J, or
any single symbol what else instead?  A character value such as
by="irows" is taken to mean the "irows" column currently (for
consistency with by="colA,colB,colC").  But some signal needs to be
passed to by=, then (you're suggesting), to trigger the cross apply by
each i row.  Currently, that signal is missingness  (which I like,
rely on, and use with join inherited scope).

As I wrote in the S.O. thread,  I'm happy to make it optional (i.e. an
option to turn off by-without-by), since there is no downside.   But
you've continued to argue for a change to the default, iiuc.

Maybe it helps to consider :

     x+y

Fundamentally in R this depends on what x and y are.  Most of us
probably assume (as a first thought) that x and y are vectors and know
that this will apply "+" elementwise,  recycling y if necessary.  In R
we like and write code like this all the time.   I think of X[Y, j] in
the same way: j is the operation (like +) which is applied for each
row of Y.   If you need j for the entire set that Y joins to,  then
like a FAQ says,  make j missing too and it's X[Y][,j]. But providing
a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on
the list:  drop=TRUE would do that (as someone mentioned on the S.O.
thread).  So maybe the new option would be datatable.drop (but with
default FALSE not TRUE).  If you wanted to turn off by-without-by you
might set options(datatable.drop=TRUE). Then you can use data.table
how you prefer (explicit by) and I can use it how I prefer.


I'm happy to add the argument to [.data.table,  and make its default
changeable via a global option in the usual way.

Matthew


On 25.04.2013 05:16, Eduard Antonyan wrote:

That's really interesting, I can't currently think of another way of doing
that as after X[Y] is done the necessary information is lost.
To retain that functionality and achieve better readability, as in OP, I
think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good
replacement for current syntax.

On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <eduard.antonyan at gmail.com>
wrote:

 that's an interesting example - I didn't realize current behavior would do
that, I'm not at a PC anymore but I'll definitely think about it and report
back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

>
>
> i. prefix is just a robust way to reference join inherited columns:   the
> 'top' column in the i table.   Like table aliases in SQL.
>
> What about this? :
> 1> X = data.table(a=1:3,b=1:15, key="a")
> 1> X
> a b
> 1: 1 1
> 2: 1 4
> 3: 1 7
> 4: 1 10
> 5: 1 13
> 6: 2 2
> 7: 2 5
> 8: 2 8
> 9: 2 11
> 10: 2 14
> 11: 3 3
> 12: 3 6
> 13: 3 9
> 14: 3 12
> 15: 3 15
>
> 1> Y = data.table(a=c(1,2,1), top=c(3,4,2))
>
>
> 1> Y
> a top
> 1: 1 3
> 2: 2 4
> 3: 1 2
> 1> X[Y, head(.SD,i.top)]
> a b
> 1: 1 1
> 2: 1 4
> 3: 1 7
> 4: 2 2
> 5: 2 5
> 6: 2 8
> 7: 2 11
> 8: 1 1
>
> 9: 1  4
> 1>
>
>
>
> On 24.04.2013 23:43, Eduard Antonyan wrote:
>
> I assumed they meant create a table :)
> that looks cool, what's i.top ? I can get a very similar to yours result
> by writing:
> X[Y][, head(.SD, top[1]), by = a]
> and I probably would want the following to produce your result (this might
> depend a little on what exactly i.top is):
> X[Y, head(.SD, i.top), by = a]
>
>
> On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>>
>> That sentence on that linked webpage seems incorect English, since table
>> is a noun not a verb.  Should "table" be "join" perhaps?
>>
>> Anyway, by-without-by is often used with join inherited scope (JIS).  For
>> example, translating their example :
>>
>> 1> X = data.table(a=1:3,b=1:15, key="a")
>> 1> X
>>     a  b
>>  1: 1  1
>>  2: 1  4
>>  3: 1  7
>>  4: 1 10
>>  5: 1 13
>>  6: 2  2
>>  7: 2  5
>>  8: 2  8
>>  9: 2 11
>> 10: 2 14
>> 11: 3  3
>> 12: 3  6
>>
>>
>>
>> 13: 3  9
>> 14: 3 12
>> 15: 3 15
>> 1> Y = data.table(a=c(1,2), top=c(3,4))
>> 1> Y
>>    a top
>> 1: 1   3
>> 2: 2   4
>> 1> X[Y, head(.SD,i.top)]
>>    a  b
>> 1: 1  1
>> 2: 1  4
>> 3: 1  7
>> 4: 2  2
>> 5: 2  5
>>
>>
>>
>> 6: 2  8
>> 7: 2 11
>> 1>
>>
>>
>>
>> If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
>>
>>
>>
>> On 24.04.2013 22:22, Eduard Antonyan wrote:
>>
>> By that you mean current behavior? You'd get current behavior by
>> explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
>> Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using
>> http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I
>> can't figure out how by-without-by (or with by-with-by for that matter:) )
>> helps with e.g. the first example there:
>> "We table table1 and table2. table1 has a column called rowcount.
>>
>> For each row from table1 we need to select first rowcount rows from
>> table2, ordered by table2.id"
>>
>>
>>
>>
>> On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>
>>> But then what would be analogous to CROSS APPLY in SQL?
>>>
>>> > I'd agree with Eduard, although it's probably too late to change
>>> behavior
>>> > now.  Maybe for data.table.2?  Eduard's proposal seems more closely
>>> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
>>> > requested).
>>> >
>>> > S.
>>> >
>>> >> Date: Mon, 22 Apr 2013 08:17:59 -0700
>>> >> From: eduard.antonyan at gmail.com
>>> >> To: datatable-help at lists.r-forge.r-project.org
>>> >> Subject: Re: [datatable-help] changing data.table by-without-by
>>> >> syntax       to      require a "by"
>>> >>
>>> >> I think you're missing the point Michael. Just because it's possible
>>> to
>>> >> do it
>>> >> the way it's done now, doesn't mean that's the best way, as I've tried
>>> >> to
>>> >> argue in the OP. I don't think you've addressed the issue of
>>> unnecessary
>>> >> complexity pointed out in OP.
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >>
>>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>>> >> Sent from the datatable-help mailing list archive at Nabble.com.
>>> >> _______________________________________________
>>> >> datatable-help mailing list
>>> >> datatable-help at lists.r-forge.r-project.org
>>> >>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>> >
>>> _______________________________________________
>>> > datatable-help mailing list
>>> > datatable-help at lists.r-forge.r-project.org
>>> >
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>>
>>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130425/04c0f793/attachment.html>

From s_milberg at hotmail.com  Thu Apr 25 16:54:36 2013
From: s_milberg at hotmail.com (Sadao Milberg)
Date: Thu, 25 Apr 2013 10:54:36 -0400
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <5222879356405645530@unknownmsgid>
References: <1366401278742-4664770.post@n4.nabble.com>,
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>,
 <1366643879137-4664990.post@n4.nabble.com>,
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>,
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>,
 <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>,
 <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>,
 <CAHZcBOoinL2JrkNayUYnnT-MkQNgWqby8evMARj3q1Q-xGF=mA@mail.gmail.com>,
 <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>,
 <CAHZcBOr3FXMEpYV-g=kz4snYn8FqO43yKZtw9z6jEksgFA_uQA@mail.gmail.com>,
 <9146185881995080674@unknownmsgid>,
 <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net>,
 <5222879356405645530@unknownmsgid>
Message-ID: <BAY163-W564CBAE958F21036B544CC99B60@phx.gbl>


Whatever the "right" way to do things is, the key issue is that default behavior should not be changed since existing code will rely on it.  So even though I tend to agree with Eduard, I would strongly advocate against any change in current behavior.  This aside, let me throw my 2 pennies in for the sake of data.table.2:

As for CROSS APPLY, to be honest, my experience with SQL has been primarily with MySQL < 5 so I didn't even know that existed.  As for your specific example a couple of e-mails ago, I believe this works:

X = data.table(a=1:3,b=1:15, key="a")
Y = data.table(a=c(1,2,1), top=c(3,4,2))
X[Y][, head(.SD, top[1]), by=list(a, top)]

Granted, this is somewhat inefficient since we now have the `top` vector replicated for each value of `a` in `X`.  You can probably come up with other examples that are inefficient or just don't work (e.g. `Y = data.table(a=c(1,2,1, 1), top=c(3,4,2,2))`), but the point here isn't whether you should allow CROSS APPLY or not, but what the "correct" syntax for invoking cross apply is.

I would argue that the correct output to:

X[Y, sum(a * top)]

Should be 21, not:

   a V1
1: 1  3
2: 2  8
3: 1  2
While the output above may be convenient to you, it is not intuitive at all.  In fact, it is an advanced caveat to standard behavior ("J is an expression evaluated in the context of X") that isn't straigthforward to circumvent, and would likely bewilder most beginner users of data.table.  I think given the parallels between data.table and SQL, "X[Y, sum(a * top)]" should mean "SELECT sum(X.a * Y.top) FROM X INNER JOIN Y USING(a)", not some more complex expression involving a CROSS APPLY.  Note that if you want a CROSS APPLY in SQL, you have to ask for it (I guess I picked at terrible example here, since the GROUP is implied...).

I think the "correct" way to do the original task would be something along the lines of:

X[Y, head(.SD, i.top), cross.apply=TRUE]

or some such.

That said, data.table is yours.  It is a fantastic tool, and if you want to behave in a manner that simplifies your work rather than matches the intuitions of others, then it is your hard earned right that I fully respect.

Slightly off topic, why aren't the columns from the Y table available in joint inherited scope when not doing a by without by?  I find it odd that:

X[Y, sum(a * top), by=b]

Produces:
Error in `[.data.table`(X, Y, sum(a * top), by = b) : 
  object 'top' not found
Finally, is i.top documented?

S.

From: eduard.antonyan at gmail.com
Date: Thu, 25 Apr 2013 07:45:45 -0500
To: mdowle at mdowle.plus.com
CC: datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by"

Well, so can .I or .N or .GRP or .BY, yet those are used as special names, which is exactly why I suggested .J.
The problem with using 'missingness' is that it already means smth very different when i is not a join/cross, it means *don't* do a by, thus introducing the whole case thing one has to through in their head every time as in OP (which of course becomes automatic after a while, but it's a cost nonetheless, which is in particular high for new people). So I see absence of 'by' as an already taken and used signal and thus something else has to be used for the new signal of cross apply (it doesn't have to be the specific option I mentioned above). This is exactly why I find optional turning off of this behavior unsatisfactory, and I don't see that as a solution to this at all. 

I think in the x+y context the appropriate analog is - what if that added x and y normally, but when x and y were data.frames it did element by element multiplication instead? Yes that's possible to do, and possible to document, but it's not a good idea, because it takes place of adding them element by element. The recycling behavior doesn't do that - what that does is it says it doesn't really make sense to add them as is, but we can do that after recycling, so let's recycle. It doesn't take the place of another existing way of adding vectors. 


On Apr 25, 2013, at 4:28 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:


I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J?  If not .J, or any single symbol what else instead?  A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC").  But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row.  Currently, that signal is missingness  (which I like, rely on, and use with join inherited scope).


As I wrote in the S.O. thread,  I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside.   But you've continued to argue for a change to the default, iiuc.

Maybe it helps to consider :


     x+y

Fundamentally in R this depends on what x and y are.  Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise,  recycling y if necessary.  In R we like and write code like this all the time.   I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y.   If you need j for the entire set that Y joins to,  then like a FAQ says,  make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list:  drop=TRUE would do that (as someone mentioned on the S.O. thread).  So maybe the new option would be datatable.drop (but with default FALSE not TRUE).  If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer.

 
I'm happy to add the argument to [.data.table,  and make its default changeable via a global option in the usual way. 
Matthew

 
On 25.04.2013 05:16, Eduard Antonyan wrote:


That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. 
To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax.

On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <eduard.antonyan at gmail.com> wrote:


that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:


i. prefix is just a robust way to reference join inherited columns:   the 'top' column in the i table.   Like table aliases in SQL.

What about this? :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
 a b
 1: 1 1
 2: 1 4
 3: 1 7
 4: 1 10
 5: 1 13
 6: 2 2
 7: 2 5
 8: 2 8
 9: 2 11
10: 2 14
 11: 3 3
12: 3 6

13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2,1), top=c(3,4,2))

1> Y
 a top
1: 1 3
2: 2 4
3: 1 2
1> X[Y, head(.SD,i.top)]
 a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
8: 1 1
9: 1  4
1> 


On 24.04.2013 23:43, Eduard Antonyan wrote:


I assumed they meant create a table :)
that looks cool, what's i.top ? I can get a very similar to yours result by writing:
X[Y][, head(.SD, top[1]), by = a]
and I probably would want the following to produce your result (this might depend a little on what exactly i.top is):
X[Y, head(.SD, i.top), by = a]


On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:


That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?

Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
    a  b
 1: 1  1
 2: 1  4
 3: 1  7
 4: 1 10
 5: 1 13
 6: 2  2
 7: 2  5
 8: 2  8
 9: 2 11
10: 2 14
11: 3  3
12: 3  6


13: 3  9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
   a top
1: 1   3
2: 2   4
1> X[Y, head(.SD,i.top)]
   a  b
1: 1  1
2: 1  4
3: 1  7
4: 2  2
5: 2  5


6: 2  8
7: 2 11
1> 
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?


On 24.04.2013 22:22, Eduard Antonyan wrote:


By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:

"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"


On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:

But then what would be analogous to CROSS APPLY in SQL?


 > I'd agree with Eduard, although it's probably too late to change behavior
 > now.  Maybe for data.table.2?  Eduard's proposal seems more closely
 > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if

 > requested).
 >
 > S.
 >
 >> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: eduard.antonyan at gmail.com
 >> To: datatable-help at lists.r-forge.r-project.org

>> Subject: Re: [datatable-help] changing data.table by-without-by
 >> syntax       to      require a "by"
 >>
 >> I think you're missing the point Michael. Just because it's possible to

 >> do it
 >> the way it's done now, doesn't mean that's the best way, as I've tried
 >> to
 >> argue in the OP. I don't think you've addressed the issue of unnecessary

 >> complexity pointed out in OP.
 >>
 >>
 >>
 >> --
 >> View this message in context:
 >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html

 >> Sent from the datatable-help mailing list archive at Nabble.com.
 >> _______________________________________________
 >> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org

>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
 >                                         _______________________________________________

 > datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
 > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130425/8b961941/attachment-0001.html>

From timothee.carayol at gmail.com  Thu Apr 25 09:58:32 2013
From: timothee.carayol at gmail.com (=?ISO-8859-1?Q?Timoth=E9e_Carayol?=)
Date: Thu, 25 Apr 2013 08:58:32 +0100
Subject: [datatable-help] fread(character string) limited to strings
 less than 4096 long?
In-Reply-To: <CAGam+C4iLjbUyskZZPHNUqOcQMPvtkwhUkP+Ana_hSWG0jewVQ@mail.gmail.com>
References: <CAGam+C7ASqn39X9zTgoWkyzXqu_KB=1Lgc3gYVSiMyqYQdKwgg@mail.gmail.com>
 <83979bd1fc26d19625910fd1ad31f0e4.squirrel@webmail.plus.net>
 <CAGam+C6fZnjYdugrFD=6gw20pOrTT5PZOnrkTj2qsj_EmD6aKQ@mail.gmail.com>
 <2c2af8789733127541fe78c1ccde5412@imap.plus.net>
 <CAGam+C69Vm=qjJb+8_Qo-FsJXpnvtDVxApdcFBPse1DOpvB3Fw@mail.gmail.com>
 <dff0921757ffe0687e4c79185ab5243c@imap.plus.net>
 <CAGam+C6bc1LA+ioOejVFAvZ5-=43-oZy5P94YgxRR0c7+VfEjA@mail.gmail.com>
 <230b0040889556349b21822824a5fb7e@imap.plus.net>
 <CAGam+C4iLjbUyskZZPHNUqOcQMPvtkwhUkP+Ana_hSWG0jewVQ@mail.gmail.com>
Message-ID: <CAGam+C7D=shnJsAgo2NzK_7Fob8wA3qz6D26jUTHx_KOSTB8qg@mail.gmail.com>

Hi ?

I thought I'd follow up on this.

Matthew, are you still unable to reproduce it? It is still happening to me
after an upgrade to R 3.0.0. And Garrett's case above seems even more
severe, with a truncation at 256 characters it seems, so it's not just me,
and it does seem to depend on some sort of system configuration.


On Thu, Mar 28, 2013 at 3:26 PM, Timoth?e Carayol <
timothee.carayol at gmail.com> wrote:

> Of course, I'll be happy to help!
>
> By the way the verbose output was actually from computer 1 (with 1.8.9) so
> it seems like the -nan% problem is maybe still there?
>
> Cheers
> Timoth?e
>
>
> On Thu, Mar 28, 2013 at 3:19 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>> **
>>
>>
>>
>> Hi,
>>
>> Thanks.  That was from v1.8.8 on computer 2 I hope.  Computer 1 with
>> v1.8.9 should have the -nan% problem fixed.
>>
>> I'm a bit stumped for the moment.  I've filed a bug report.  Probably, if
>> I still can't reproduce my end, I'll add some more detailed tracing to
>> verbose output and ask you to try again next week if that's ok.
>>
>> Thanks for reporting!
>>
>> Matthew
>>
>>
>>
>> On 28.03.2013 14:58, Timoth?e Carayol wrote:
>>
>>   Input contains a \n (or is ""), taking this to be text input (not a
>> filename)
>> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>>
>> Using line 30 to detect sep (the last non blank line in the first 30) ...
>> '\t'
>> Found 2 columns
>>
>> First row with 2 fields occurs on line 1 (either column names or first
>> row of data)
>> All the fields on line 1 are character fields. Treating as the column
>> names.
>> Count of eol after first data row: 1023
>>
>> Subtracted 1 for last eol and any trailing empty lines, leaving 1022 data
>> rows
>> Type codes: 33 (first 5 rows)
>>
>> Type codes: 33 (+middle 5 rows)
>>
>> Type codes: 33 (+last 5 rows)
>>
>>    0.000s (-nan%) Memory map (rerun may be quicker)
>>
>>    0.000s (-nan%) sep and header detection
>>
>>    0.000s (-nan%) Count rows (wc -l)
>>
>>    0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>>
>>    0.000s (-nan%) Allocation of 1022x2 result (xMB) in RAM
>>
>>    0.000s (-nan%) Reading data
>>
>>    0.000s (-nan%) Allocation for type bumps (if any), including gc time
>> if triggered
>>    0.000s (-nan%) Coercing data already read in type bumps (if any)
>>
>>    0.000s (-nan%) Changing na.strings to NA
>>
>>    0.000s        Total
>>
>> 4092 1022
>>
>> Input contains a \n (or is ""), taking this to be text input (not a
>> filename)
>>  Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>>
>> Using line 30 to detect sep (the last non blank line in the first 30) ...
>> '\t'
>> Found 2 columns
>>
>> First row with 2 fields occurs on line 1 (either column names or first
>> row of data)
>> All the fields on line 1 are character fields. Treating as the column
>> names.
>> Count of eol after first data row: 1023
>>
>> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
>> rows
>> Type codes: 33 (first 5 rows)
>>
>> Type codes: 33 (+middle 5 rows)
>>
>> Type codes: 33 (+last 5 rows)
>>
>>    0.000s (-nan%) Memory map (rerun may be quicker)
>>
>>    0.000s (-nan%) sep and header detection
>>
>>    0.000s (-nan%) Count rows (wc -l)
>>
>>    0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>>
>>    0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
>>
>>    0.000s (-nan%) Reading data
>>
>>    0.000s (-nan%) Allocation for type bumps (if any), including gc time
>> if triggered
>>    0.000s (-nan%) Coercing data already read in type bumps (if any)
>>
>>    0.000s (-nan%) Changing na.strings to NA
>>
>>    0.000s        Total
>>
>> 4096 1023
>>
>> Input contains a \n (or is ""), taking this to be text input (not a
>> filename)
>> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>>
>> Using line 30 to detect sep (the last non blank line in the first 30) ...
>> '\t'
>> Found 2 columns
>>
>> First row with 2 fields occurs on line 1 (either column names or first
>> row of data)
>> All the fields on line 1 are character fields. Treating as the column
>> names.
>> Count of eol after first data row: 1023
>>
>> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
>> rows
>> Type codes: 33 (first 5 rows)
>>
>> Type codes: 33 (+middle 5 rows)
>>
>> Type codes: 33 (+last 5 rows)
>>
>>    0.000s (-nan%) Memory map (rerun may be quicker)
>>
>>    0.000s (-nan%) sep and header detection
>>
>>    0.000s (-nan%) Count rows (wc -l)
>>
>>    0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>>
>>    0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
>>
>>    0.000s (-nan%) Reading data
>>
>>    0.000s (-nan%) Allocation for type bumps (if any), including gc time
>> if triggered
>>    0.000s (-nan%) Coercing data already read in type bumps (if any)
>>
>>    0.000s (-nan%) Changing na.strings to NA
>>
>>    0.000s        Total
>>
>> 4100 1023
>>
>> Input contains a \n (or is ""), taking this to be text input (not a
>> filename)
>> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>>
>> Using line 30 to detect sep (the last non blank line in the first 30) ...
>> '\t'
>> Found 2 columns
>>
>> First row with 2 fields occurs on line 1 (either column names or first
>> row of data)
>> All the fields on line 1 are character fields. Treating as the column
>> names.
>> Count of eol after first data row: 1023
>>
>> Subtracted 0 for last eol and any trailing empty lines, leaving 1023 data
>> rows
>> Type codes: 33 (first 5 rows)
>>
>> Type codes: 33 (+middle 5 rows)
>>
>> Type codes: 33 (+last 5 rows)
>>
>>    0.000s (-nan%) Memory map (rerun may be quicker)
>>
>>    0.000s (-nan%) sep and header detection
>>
>>    0.000s (-nan%) Count rows (wc -l)
>>
>>    0.000s (-nan%) Column type detection (first, middle and last 5 rows)
>>
>>    0.000s (-nan%) Allocation of 1023x2 result (xMB) in RAM
>>
>>    0.000s (-nan%) Reading data
>>
>>    0.000s (-nan%) Allocation for type bumps (if any), including gc time
>> if triggered
>>    0.000s (-nan%) Coercing data already read in type bumps (if any)
>>
>>    0.000s (-nan%) Changing na.strings to NA
>>
>>    0.000s        Total
>>
>> 40000 1023
>>
>>
>>
>> On Thu, Mar 28, 2013 at 2:55 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>
>>>
>>>
>>> Hm this is odd.
>>>
>>> Could you run the following and paste back the (verbose) results please.
>>> for (n in c(1023:1025, 10000)) {
>>>
>>>  input = paste( rep('a\tb\n', n), collapse='')
>>>  A = fread(input,verbose=TRUE)
>>>  cat(nchar(input), nrow(A), "\n")
>>> }
>>>
>>>
>>>
>>>
>>>
>>> On 28.03.2013 14:38, Timoth?e Carayol wrote:
>>>
>>>  Curiouser and curiouser..
>>>
>>> I can reproduce on two computers with different versions of R and of
>>> data.table.
>>>
>>>
>>>
>>> Computer 1 (it says unknown-linux but is actually ubuntu):
>>>
>>> R version 2.15.3 (2013-03-01)
>>>
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>>
>>>
>>> locale:
>>>
>>>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>>> LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>>> LC_MONETARY=en_GB.UTF-8
>>>    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=C                 LC_NAME=C
>>>            LC_ADDRESS=C
>>> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8
>>> LC_IDENTIFICATION=C
>>>
>>>
>>>
>>> attached base packages:
>>>
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>>
>>>
>>> other attached packages:
>>>
>>> [1] bit64_0.9-2      bit_1.1-10       data.table_1.8.9 colorout_1.0-0
>>>
>>> Computer 2:
>>>  R version 2.15.2 (2012-10-26)
>>> Platform: x86_64-redhat-linux-gnu (64-bit)
>>>
>>> locale:
>>>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>>>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>>>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>>>  [7] LC_PAPER=C                 LC_NAME=C
>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] data.table_1.8.8
>>>
>>> loaded via a namespace (and not attached):
>>> [1] tools_2.15.2
>>>
>>>
>>> On Thu, Mar 28, 2013 at 2:31 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>>
>>>>
>>>>
>>>> Interesting, what's your sessionInfo() please?
>>>>
>>>> For me it seems to work ok :
>>>>
>>>> [1] 1022
>>>> [1] 1023
>>>> [1] 1024
>>>> [1] 9999
>>>>
>>>> > sessionInfo()
>>>> R version 2.15.2 (2012-10-26)
>>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>
>>>>
>>>>
>>>> On 27.03.2013 22:49, Timoth?e Carayol wrote:
>>>>
>>>>  Agree with Muhammad, longer character strings are definitely
>>>> permitted in R.
>>>> A minimal example that show something strange happening with fread:
>>>>   for (n in c(1023:1025, 10000)) {
>>>>   A
>>>>              paste(
>>>>                  rep('a\tb\n', n),
>>>>                  collapse=''
>>>>                  ),
>>>>            sep='\t'
>>>>            )
>>>>   print(nrow(A))
>>>> }
>>>> On my computer, I obtain:
>>>>  [1] 1022
>>>> [1] 1023
>>>> [1] 1023
>>>> [1] 1023
>>>>  Hope this helps
>>>> Timoth?e
>>>>
>>>>
>>>> On Wed, Mar 27, 2013 at 9:23 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>>>
>>>>> Hi,
>>>>> Nice to hear from you. Nope not known to me. Obviously 4096 is 4k, is
>>>>> that
>>>>> the R limit for a character string length? What happens at 4097?
>>>>> Matthew
>>>>>
>>>>> > Hi,
>>>>> >
>>>>> > I have an example of a string of 4097 characters which can't be
>>>>> parsed by
>>>>> > fread; however, if I remove any character, it can be parsed just
>>>>> fine. Is
>>>>> > that a known limitation?
>>>>> >
>>>>> > (If I write the string to a file and then fread the file name, it
>>>>> works
>>>>> > too.)
>>>>> >
>>>>> > Let me know if you need the string and/or a bug report.
>>>>> >
>>>>> > Thanks
>>>>> > Timoth?e
>>>>>  > _______________________________________________
>>>>> > datatable-help mailing list
>>>>> > datatable-help at lists.r-forge.r-project.org
>>>>> >
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130425/804ce8ae/attachment-0001.html>

From mdowle at mdowle.plus.com  Thu Apr 25 17:32:14 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Thu, 25 Apr 2013 16:32:14 +0100
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <5222879356405645530@unknownmsgid>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
 <1366643879137-4664990.post@n4.nabble.com>
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>
 <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>
 <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>
 <CAHZcBOoinL2JrkNayUYnnT-MkQNgWqby8evMARj3q1Q-xGF=mA@mail.gmail.com>
 <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>
 <CAHZcBOr3FXMEpYV-g=kz4snYn8FqO43yKZtw9z6jEksgFA_uQA@mail.gmail.com>
 <9146185881995080674@unknownmsgid>
 <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net>
 <5222879356405645530@unknownmsgid>
Message-ID: <64f192ba80ac813986ed256029f0e7e0@imap.plus.net>

 
I'd appreciate some input from others whether they agree or not. If
you have a view perhaps let me know off list, or on list, whichever you
prefer. 

Thanks, 

Matthew 

On 25.04.2013 13:45, Eduard Antonyan
wrote: 

> Well, so can .I or .N or .GRP or .BY, yet those are used as
special names, which is exactly why I suggested .J. 
> The problem with
using 'missingness' is that it already means smth very different when i
is not a join/cross, it means *don't* do a by, thus introducing the
whole case thing one has to through in their head every time as in OP
(which of course becomes automatic after a while, but it's a cost
nonetheless, which is in particular high for new people). So I see
absence of 'by' as an already taken and used signal and thus something
else has to be used for the new signal of cross apply (it doesn't have
to be the specific option I mentioned above). This is exactly why I find
optional turning off of this behavior unsatisfactory, and I don't see
that as a solution to this at all. 
> I think in the x+y context the
appropriate analog is - what if that added x and y normally, but when x
and y were data.frames it did element by element multiplication instead?
Yes that's possible to do, and possible to document, but it's not a good
idea, because it takes place of adding them element by element. The
recycling behavior doesn't do that - what that does is it says it
doesn't really make sense to add them as is, but we can do that after
recycling, so let's recycle. It doesn't take the place of another
existing way of adding vectors. 
> 
> On Apr 25, 2013, at 4:28 AM,
Matthew Dowle <mdowle at mdowle.plus.com [15]> wrote:
> 
>> I see what
you're getting at. But .J may be a column name, which is the current
meaning of by = single symbol. And why .J? If not .J, or any single
symbol what else instead? A character value such as by="irows" is taken
to mean the "irows" column currently (for consistency with
by="colA,colB,colC"). But some signal needs to be passed to by=, then
(you're suggesting), to trigger the cross apply by each i row.
Currently, that signal is missingness (which I like, rely on, and use
with join inherited scope).
>> 
>> As I wrote in the S.O. thread, I'm
happy to make it optional (i.e. an option to turn off by-without-by),
since there is no downside. But you've continued to argue for a change
to the default, iiuc.
>> 
>> Maybe it helps to consider :
>> 
>> x+y
>>

>> Fundamentally in R this depends on what x and y are. Most of us
probably assume (as a first thought) that x and y are vectors and know
that this will apply "+" elementwise, recycling y if necessary. In R we
like and write code like this all the time. I think of X[Y, j] in the
same way: j is the operation (like +) which is applied for each row of
Y. If you need j for the entire set that Y joins to, then like a FAQ
says, make j missing too and it's X[Y][,j]. But providing a way to make
X[Y,j] do the same as X[Y][,j] would be nice and is on the list:
drop=TRUE would do that (as someone mentioned on the S.O. thread). So
maybe the new option would be datatable.drop (but with default FALSE not
TRUE). If you wanted to turn off by-without-by you might set
options(datatable.drop=TRUE). Then you can use data.table how you prefer
(explicit by) and I can use it how I prefer.
>> 
>> I'm happy to add the
argument to [.data.table, and make its default changeable via a global
option in the usual way. 
>> 
>> Matthew 
>> 
>> On 25.04.2013 05:16,
Eduard Antonyan wrote: 
>> 
>>> That's really interesting, I can't
currently think of another way of doing that as after X[Y] is done the
necessary information is lost. 
>>> To retain that functionality and
achieve better readability, as in OP, I think smth along the lines of
X[Y, head(.SD, i.top), by=.J] would be a good replacement for current
syntax. 
>>> 
>>> On Apr 24, 2013, at 6:01 PM, Eduard Antonyan
<eduard.antonyan at gmail.com [14]> wrote:
>>> 
>>>> that's an interesting
example - I didn't realize current behavior would do that, I'm not at a
PC anymore but I'll definitely think about it and report back, as it's
not immediately obvious to me 
>>>> 
>>>> On Wed, Apr 24, 2013 at 5:50
PM, Matthew Dowle <mdowle at mdowle.plus.com [13]> wrote:
>>>> 
>>>>> i.
prefix is just a robust way to reference join inherited columns: the
'top' column in the i table. Like table aliases in SQL. 
>>>>> 
>>>>>
What about this? : 
>>>>> 1> X = data.table(a=1:3,b=1:15, key="a")
>>>>>
1> X
>>>>> a b
>>>>> 1: 1 1
>>>>> 2: 1 4
>>>>> 3: 1 7
>>>>> 4: 1
10
>>>>> 5: 1 13
>>>>> 6: 2 2
>>>>> 7: 2 5
>>>>> 8: 2 8
>>>>> 9: 2
11
>>>>> 10: 2 14
>>>>> 11: 3 3
>>>>> 12: 3 6
>>>>> 13: 3 9
>>>>> 14: 3
12
>>>>> 15: 3 15 
>>>>> 
>>>>> 1> Y = data.table(a=c(1,2,1),
top=c(3,4,2))
>>>>> 
>>>>> 1> Y
>>>>> a top
>>>>> 1: 1 3
>>>>> 2: 2
4
>>>>> 3: 1 2
>>>>> 1> X[Y, head(.SD,i.top)]
>>>>> a b
>>>>> 1: 1
1
>>>>> 2: 1 4
>>>>> 3: 1 7
>>>>> 4: 2 2
>>>>> 5: 2 5
>>>>> 6: 2 8
>>>>>
7: 2 11
>>>>> 8: 1 1 
>>>>> 
>>>>> 9: 1 4
>>>>> 1> 
>>>>> 
>>>>> On
24.04.2013 23:43, Eduard Antonyan wrote: 
>>>>> 
>>>>>> I assumed they
meant create a table :) 
>>>>>> that looks cool, what's i.top ? I can
get a very similar to yours result by writing: 
>>>>>> X[Y][, head(.SD,
top[1]), by = a] 
>>>>>> and I probably would want the following to
produce your result (this might depend a little on what exactly i.top
is): 
>>>>>> X[Y, head(.SD, i.top), by = a] 
>>>>>> 
>>>>>> On Wed, Apr
24, 2013 at 5:28 PM, Matthew Dowle <mdowle at mdowle.plus.com [12]>
wrote:
>>>>>> 
>>>>>>> That sentence on that linked webpage seems
incorect English, since table is a noun not a verb. Should "table" be
"join" perhaps? 
>>>>>>> 
>>>>>>> Anyway, by-without-by is often used
with join inherited scope (JIS). For example, translating their example
: 
>>>>>>> 
>>>>>>> 1> X = data.table(a=1:3,b=1:15, key="a")
>>>>>>> 1>
X
>>>>>>> a b
>>>>>>> 1: 1 1
>>>>>>> 2: 1 4
>>>>>>> 3: 1 7
>>>>>>> 4: 1
10
>>>>>>> 5: 1 13
>>>>>>> 6: 2 2
>>>>>>> 7: 2 5
>>>>>>> 8: 2 8
>>>>>>>
9: 2 11
>>>>>>> 10: 2 14
>>>>>>> 11: 3 3
>>>>>>> 12: 3 6
>>>>>>>

>>>>>>> 13: 3 9
>>>>>>> 14: 3 12
>>>>>>> 15: 3 15
>>>>>>> 1> Y =
data.table(a=c(1,2), top=c(3,4))
>>>>>>> 1> Y
>>>>>>> a top
>>>>>>> 1: 1
3
>>>>>>> 2: 2 4
>>>>>>> 1> X[Y, head(.SD,i.top)]
>>>>>>> a b
>>>>>>> 1:
1 1
>>>>>>> 2: 1 4
>>>>>>> 3: 1 7
>>>>>>> 4: 2 2
>>>>>>> 5: 2 5
>>>>>>>

>>>>>>> 6: 2 8
>>>>>>> 7: 2 11
>>>>>>> 1> 
>>>>>>> 
>>>>>>> If there
was no by-without-by (analogous to CROSS BY), then how would that be
done?
>>>>>>> 
>>>>>>> On 24.04.2013 22:22, Eduard Antonyan wrote:

>>>>>>> 
>>>>>>>> By that you mean current behavior? You'd get current
behavior by explicitly specifying the appropriate "by" (i.e. "by" equal
to the key). 
>>>>>>>> Btw, I'm trying to understand SQL CROSS APPLY vs
JOIN using
http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [9],
and I can't figure out how by-without-by (or with by-with-by for that
matter:) ) helps with e.g. the first example there: 
>>>>>>>> "We table
table1 and table2. table1 has a column called rowcount. 
>>>>>>>>

>>>>>>>> For each row from table1 we need to select first rowcount rows
from table2, ordered by table2.id [10]" 
>>>>>>>> 
>>>>>>>> On Wed, Apr
24, 2013 at 4:01 PM, Matthew Dowle <mdowle at mdowle.plus.com [11]>
wrote:
>>>>>>>> 
>>>>>>>>> But then what would be analogous to CROSS
APPLY in SQL?
>>>>>>>>> 
>>>>>>>>> > I'd agree with Eduard, although
it's probably too late to change behavior
>>>>>>>>> > now. Maybe for
data.table.2? Eduard's proposal seems more closely
>>>>>>>>> > aligned
with SQL behavior as well (SELECT/JOIN, then GROUP, but only
if
>>>>>>>>> > requested).
>>>>>>>>> >
>>>>>>>>> > S.
>>>>>>>>>
>
>>>>>>>>> >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >> From:
eduard.antonyan at gmail.com [1]
>>>>>>>>> >> To:
datatable-help at lists.r-forge.r-project.org [2]
>>>>>>>>> 
>>>>>>>>>>>
Subject: Re: [datatable-help] changing data.table
by-without-by
>>>>>>>>> >> syntax to require a "by"
>>>>>>>>>
>>
>>>>>>>>> >> I think you're missing the point Michael. Just because
it's possible to
>>>>>>>>> >> do it
>>>>>>>>> >> the way it's done now,
doesn't mean that's the best way, as I've tried
>>>>>>>>> >>
to
>>>>>>>>> >> argue in the OP. I don't think you've addressed the
issue of unnecessary
>>>>>>>>> >> complexity pointed out in
OP.
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> --
>>>>>>>>> >>
View this message in context:
>>>>>>>>> >>
http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
[3]
>>>>>>>>> >> Sent from the datatable-help mailing list archive at
Nabble.com [4].
>>>>>>>>> >>
_______________________________________________
>>>>>>>>> >>
datatable-help mailing list >>
datatable-help at lists.r-forge.r-project.org [5]
>>>>>>>>> 
>>>>>>>>>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[6]
>>>>>>>>> >
_______________________________________________
>>>>>>>>> >
datatable-help mailing list > datatable-help at lists.r-forge.r-project.org
[7]
>>>>>>>>> >
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[8]

 
Links:
------
[1] mailto:eduard.antonyan at gmail.com
[2]
mailto:datatable-help at lists.r-forge.r-project.org
[3]
http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
[4]
http://Nabble.com
[5]
mailto:datatable-help at lists.r-forge.r-project.org
[6]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[7]
mailto:datatable-help at lists.r-forge.r-project.org
[8]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[9]
http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/
[10]
http://table2.id
[11] mailto:mdowle at mdowle.plus.com
[12]
mailto:mdowle at mdowle.plus.com
[13] mailto:mdowle at mdowle.plus.com
[14]
mailto:eduard.antonyan at gmail.com
[15] mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130425/b9b6daec/attachment.html>

From michael.nelson at sydney.edu.au  Fri Apr 26 00:46:37 2013
From: michael.nelson at sydney.edu.au (Michael Nelson)
Date: Thu, 25 Apr 2013 22:46:37 +0000
Subject: [datatable-help]
 =?windows-1252?q?there_is_no_package_called_=91x?=
 =?windows-1252?q?ts=92?=
In-Reply-To: <CAHZcBOqn2ZCp27g0ytwW4VpLgzhHKH1qB=jALXy2WZgwqNj6Cw@mail.gmail.com>
References: <874newhpv3.fsf@gnu.org>
 <6FB5193A6CDCDF499486A833B7AFBDCD6751B6FB@EX-MBX-PRO-04.mcs.usyd.edu.au>,
 <CAHZcBOqn2ZCp27g0ytwW4VpLgzhHKH1qB=jALXy2WZgwqNj6Cw@mail.gmail.com>
Message-ID: <6FB5193A6CDCDF499486A833B7AFBDCD6751B9E5@EX-MBX-PRO-04.mcs.usyd.edu.au>


Indeed. Although I prefer .()  to J()  to be inline with the future implementation of ..() as well.

frame[.(unique(id)), mult = "last"]

the relevant section from the NEWS


New DT[.(...)] syntax (in the style of package plyr) is identical to         DT[list(...)], DT[J(...)] and DT[data.table(...)]. We plan to add ..(), too, so         that .() and ..() are analogous to the file system's ./ and ../; i.e., .()         evaluates within the frame of DT and ..() in the parent scope.


From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Eduard Antonyan [eduard.antonyan at gmail.com]
 Sent: Thursday, 25 April 2013 12:47 AM
 To: datatable-help at lists.r-forge.r-project.org
 Subject: Re: [datatable-help] there is no package called ?xts?
@Michael, in the last expression, you probably forgot a J: 
frame[J(unique(id)), mult = "last"]  
On Tue, Apr 23, 2013 at 7:41 PM, Michael Nelson  michael.nelson at sydney.edu.au wrote:


>From the help for data.table::last


 If x is a data.table, the last row as a one row data.table. Otherwise, whatever xts::last returns.


calling lapply(.SD, last) will call last on each column in .SD. Columns within a data.table aren't  data.tables thus `xts::last` is called.  xts is on the suggests list for data.table,


you could use


install.packages('data.table, dependencies = 'Suggests')


or manually installed xts.


OR


frame[, last(.SD), by = id]


would work without needing xts


as would


frame[, .SD[.N], by = id]


or without having to construct .SD (which is time consuming)


frame[frame[, .I[.N],by = id]$V1]


or


setkey(frame, id)


frame[unique(id), mult = 'last']


________________________________________

From: 
datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org]
 on behalf of Sam Steingold [sds at gnu.org]

Sent: Wednesday, 24 April 2013 7:57 AM

To: 
datatable-help at lists.r-forge.r-project.org

Subject: [datatable-help] there is no package called ?xts?


Hi,

I got this:


> dt <- frame[, lapply(.SD, last) ,by=id]

Finding groups (bysameorder=TRUE) ... done in 0.126secs. bysameorder=TRUE and o__ is length 0

Optimized j from 'lapply(.SD, last)' to 'list(last(country), last(language), last(browser), last(platform), last(uatype), last(behavior))'

Starting dogroups ... Error in loadNamespace(name) : there is no package called ?xts?

Calls: [ -> [.data.table -> last -> :: -> getExportedValue -> asNamespace -> getNamespace -> tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>

>


the help for last does mention xts, but I don't have it installed.

do I need to?


--

Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000

http://www.childpsy.net/

http://ffii.org 
http://think-israel.org

http://mideasttruth.com

http://memri.org 
http://camera.org

Ernqvat guvf ivbyngrf QZPN.

_______________________________________________

datatable-help mailing list

datatable-help at lists.r-forge.r-project.org

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________

datatable-help mailing list

datatable-help at lists.r-forge.r-project.org

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


From mdowle at mdowle.plus.com  Fri Apr 26 13:14:02 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Fri, 26 Apr 2013 12:14:02 +0100
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <64f192ba80ac813986ed256029f0e7e0@imap.plus.net>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
 <1366643879137-4664990.post@n4.nabble.com>
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>
 <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>
 <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>
 <CAHZcBOoinL2JrkNayUYnnT-MkQNgWqby8evMARj3q1Q-xGF=mA@mail.gmail.com>
 <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>
 <CAHZcBOr3FXMEpYV-g=kz4snYn8FqO43yKZtw9z6jEksgFA_uQA@mail.gmail.com>
 <9146185881995080674@unknownmsgid>
 <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net>
 <5222879356405645530@unknownmsgid>
 <64f192ba80ac813986ed256029f0e7e0@imap.plus.net>
Message-ID: <2a827f7db260f284908fac604301eb8e@imap.plus.net>

 
I didn't get any feedback off list on this one. 

But I'm coming
round to the idea. 

What about by=.JOIN (is that you were thinking .J
stood for?) Other possibilties: .EACHI, .IROW, .EACHIROW, .CROSSAPPLY,
.EACHJOIN. Just to brainstorm it. 

by=.JOIN could be added anyway with
no backwards compatibility issues, so that those who wished to be
explicit now could be. 

To change the default for X[Y, j] I'm also
coming round to. It might help in a few related areas e.g. X[Y][,j]
(which isn't great right now, agreed). We have successfully made
non-backwards-compatibile changes in the past by introducing a global
option which we slowly migrate to. If datatable.bywithoutby was added it
could take values TRUE|"warning"|FALSE from day one, with default TRUE.
That allows those who wish for explicit by to migrate straight away by
changing the default to FALSE. Existing users could set it to "warning"
to see how many implicit bywithoutby they have. Those calls can
gradually be changed to by=.JOIN and in that way both implicit and
explicit work at the same time, for say a year, with full backwards
compatibility by default. This approach allows a slow and flexible
migration path on a per feature basis. Then the default could be chaged
to "warning" before finally FALSE. Depending on how it goes, the option
could be left there to allow TRUE if anyone wanted it, or removed (maybe
after two years). Similar to the removal of J() outside DT[...] i.e.
users can still now very easily write J=data.table in their .Rprofile if
they wish, for backwards compatibility. 

Or ... instead of : 

 X[Y, j,
by=.JOIN] 

what about : 

 X[by=Y, j] 

Matthew 

On 25.04.2013 16:32,
Matthew Dowle wrote: 

> I'd appreciate some input from others whether
they agree or not. If you have a view perhaps let me know off list, or
on list, whichever you prefer. 
> 
> Thanks, 
> 
> Matthew 
> 
> On
25.04.2013 13:45, Eduard Antonyan wrote: 
> 
>> Well, so can .I or .N or
.GRP or .BY, yet those are used as special names, which is exactly why I
suggested .J. 
>> The problem with using 'missingness' is that it
already means smth very different when i is not a join/cross, it means
*don't* do a by, thus introducing the whole case thing one has to
through in their head every time as in OP (which of course becomes
automatic after a while, but it's a cost nonetheless, which is in
particular high for new people). So I see absence of 'by' as an already
taken and used signal and thus something else has to be used for the new
signal of cross apply (it doesn't have to be the specific option I
mentioned above). This is exactly why I find optional turning off of
this behavior unsatisfactory, and I don't see that as a solution to this
at all. 
>> I think in the x+y context the appropriate analog is - what
if that added x and y normally, but when x and y were data.frames it did
element by element multiplication instead? Yes that's possible to do,
and possible to document, but it's not a good idea, because it takes
place of adding them element by element. The recycling behavior doesn't
do that - what that does is it says it doesn't really make sense to add
them as is, but we can do that after recycling, so let's recycle. It
doesn't take the place of another existing way of adding vectors. 
>>

>> On Apr 25, 2013, at 4:28 AM, Matthew Dowle <mdowle at mdowle.plus.com
[15]> wrote:
>> 
>>> I see what you're getting at. But .J may be a
column name, which is the current meaning of by = single symbol. And why
.J? If not .J, or any single symbol what else instead? A character value
such as by="irows" is taken to mean the "irows" column currently (for
consistency with by="colA,colB,colC"). But some signal needs to be
passed to by=, then (you're suggesting), to trigger the cross apply by
each i row. Currently, that signal is missingness (which I like, rely
on, and use with join inherited scope).
>>> 
>>> As I wrote in the S.O.
thread, I'm happy to make it optional (i.e. an option to turn off
by-without-by), since there is no downside. But you've continued to
argue for a change to the default, iiuc.
>>> 
>>> Maybe it helps to
consider :
>>> 
>>> x+y
>>> 
>>> Fundamentally in R this depends on what
x and y are. Most of us probably assume (as a first thought) that x and
y are vectors and know that this will apply "+" elementwise, recycling y
if necessary. In R we like and write code like this all the time. I
think of X[Y, j] in the same way: j is the operation (like +) which is
applied for each row of Y. If you need j for the entire set that Y joins
to, then like a FAQ says, make j missing too and it's X[Y][,j]. But
providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and
is on the list: drop=TRUE would do that (as someone mentioned on the
S.O. thread). So maybe the new option would be datatable.drop (but with
default FALSE not TRUE). If you wanted to turn off by-without-by you
might set options(datatable.drop=TRUE). Then you can use data.table how
you prefer (explicit by) and I can use it how I prefer.
>>> 
>>> I'm
happy to add the argument to [.data.table, and make its default
changeable via a global option in the usual way. 
>>> 
>>> Matthew 
>>>

>>> On 25.04.2013 05:16, Eduard Antonyan wrote: 
>>> 
>>>> That's
really interesting, I can't currently think of another way of doing that
as after X[Y] is done the necessary information is lost. 
>>>> To retain
that functionality and achieve better readability, as in OP, I think
smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good
replacement for current syntax. 
>>>> 
>>>> On Apr 24, 2013, at 6:01 PM,
Eduard Antonyan <eduard.antonyan at gmail.com [14]> wrote:
>>>> 
>>>>>
that's an interesting example - I didn't realize current behavior would
do that, I'm not at a PC anymore but I'll definitely think about it and
report back, as it's not immediately obvious to me 
>>>>> 
>>>>> On Wed,
Apr 24, 2013 at 5:50 PM, Matthew Dowle <mdowle at mdowle.plus.com [13]>
wrote:
>>>>> 
>>>>>> i. prefix is just a robust way to reference join
inherited columns: the 'top' column in the i table. Like table aliases
in SQL. 
>>>>>> 
>>>>>> What about this? : 
>>>>>> 1> X =
data.table(a=1:3,b=1:15, key="a")
>>>>>> 1> X
>>>>>> a b
>>>>>> 1: 1
1
>>>>>> 2: 1 4
>>>>>> 3: 1 7
>>>>>> 4: 1 10
>>>>>> 5: 1 13
>>>>>> 6: 2
2
>>>>>> 7: 2 5
>>>>>> 8: 2 8
>>>>>> 9: 2 11
>>>>>> 10: 2 14
>>>>>> 11:
3 3
>>>>>> 12: 3 6
>>>>>> 13: 3 9
>>>>>> 14: 3 12
>>>>>> 15: 3 15

>>>>>> 
>>>>>> 1> Y = data.table(a=c(1,2,1), top=c(3,4,2))
>>>>>>

>>>>>> 1> Y
>>>>>> a top
>>>>>> 1: 1 3
>>>>>> 2: 2 4
>>>>>> 3: 1
2
>>>>>> 1> X[Y, head(.SD,i.top)]
>>>>>> a b
>>>>>> 1: 1 1
>>>>>> 2: 1
4
>>>>>> 3: 1 7
>>>>>> 4: 2 2
>>>>>> 5: 2 5
>>>>>> 6: 2 8
>>>>>> 7: 2
11
>>>>>> 8: 1 1 
>>>>>> 
>>>>>> 9: 1 4
>>>>>> 1> 
>>>>>> 
>>>>>> On
24.04.2013 23:43, Eduard Antonyan wrote: 
>>>>>> 
>>>>>>> I assumed they
meant create a table :) 
>>>>>>> that looks cool, what's i.top ? I can
get a very similar to yours result by writing: 
>>>>>>> X[Y][, head(.SD,
top[1]), by = a] 
>>>>>>> and I probably would want the following to
produce your result (this might depend a little on what exactly i.top
is): 
>>>>>>> X[Y, head(.SD, i.top), by = a] 
>>>>>>> 
>>>>>>> On Wed,
Apr 24, 2013 at 5:28 PM, Matthew Dowle <mdowle at mdowle.plus.com [12]>
wrote:
>>>>>>> 
>>>>>>>> That sentence on that linked webpage seems
incorect English, since table is a noun not a verb. Should "table" be
"join" perhaps? 
>>>>>>>> 
>>>>>>>> Anyway, by-without-by is often used
with join inherited scope (JIS). For example, translating their example
: 
>>>>>>>> 
>>>>>>>> 1> X = data.table(a=1:3,b=1:15, key="a")
>>>>>>>>
1> X
>>>>>>>> a b
>>>>>>>> 1: 1 1
>>>>>>>> 2: 1 4
>>>>>>>> 3: 1
7
>>>>>>>> 4: 1 10
>>>>>>>> 5: 1 13
>>>>>>>> 6: 2 2
>>>>>>>> 7: 2
5
>>>>>>>> 8: 2 8
>>>>>>>> 9: 2 11
>>>>>>>> 10: 2 14
>>>>>>>> 11: 3
3
>>>>>>>> 12: 3 6
>>>>>>>> 
>>>>>>>> 13: 3 9
>>>>>>>> 14: 3 12
>>>>>>>>
15: 3 15
>>>>>>>> 1> Y = data.table(a=c(1,2), top=c(3,4))
>>>>>>>> 1>
Y
>>>>>>>> a top
>>>>>>>> 1: 1 3
>>>>>>>> 2: 2 4
>>>>>>>> 1> X[Y,
head(.SD,i.top)]
>>>>>>>> a b
>>>>>>>> 1: 1 1
>>>>>>>> 2: 1 4
>>>>>>>>
3: 1 7
>>>>>>>> 4: 2 2
>>>>>>>> 5: 2 5
>>>>>>>> 
>>>>>>>> 6: 2
8
>>>>>>>> 7: 2 11
>>>>>>>> 1> 
>>>>>>>> 
>>>>>>>> If there was no
by-without-by (analogous to CROSS BY), then how would that be
done?
>>>>>>>> 
>>>>>>>> On 24.04.2013 22:22, Eduard Antonyan wrote:

>>>>>>>> 
>>>>>>>>> By that you mean current behavior? You'd get
current behavior by explicitly specifying the appropriate "by" (i.e.
"by" equal to the key). 
>>>>>>>>> Btw, I'm trying to understand SQL
CROSS APPLY vs JOIN using
http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [9],
and I can't figure out how by-without-by (or with by-with-by for that
matter:) ) helps with e.g. the first example there: 
>>>>>>>>> "We table
table1 and table2. table1 has a column called rowcount. 
>>>>>>>>>

>>>>>>>>> For each row from table1 we need to select first rowcount
rows from table2, ordered by table2.id [10]" 
>>>>>>>>> 
>>>>>>>>> On
Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <mdowle at mdowle.plus.com
[11]> wrote:
>>>>>>>>> 
>>>>>>>>>> But then what would be analogous to
CROSS APPLY in SQL?
>>>>>>>>>> 
>>>>>>>>>> > I'd agree with Eduard,
although it's probably too late to change behavior
>>>>>>>>>> > now.
Maybe for data.table.2? Eduard's proposal seems more closely
>>>>>>>>>>
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only
if
>>>>>>>>>> > requested).
>>>>>>>>>> >
>>>>>>>>>> > S.
>>>>>>>>>>
>
>>>>>>>>>> >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >> From:
eduard.antonyan at gmail.com [1]
>>>>>>>>>> >> To:
datatable-help at lists.r-forge.r-project.org [2]
>>>>>>>>>> 
>>>>>>>>>>>>
Subject: Re: [datatable-help] changing data.table
by-without-by
>>>>>>>>>> >> syntax to require a "by"
>>>>>>>>>>
>>
>>>>>>>>>> >> I think you're missing the point Michael. Just because
it's possible to
>>>>>>>>>> >> do it
>>>>>>>>>> >> the way it's done
now, doesn't mean that's the best way, as I've tried
>>>>>>>>>> >>
to
>>>>>>>>>> >> argue in the OP. I don't think you've addressed the
issue of unnecessary
>>>>>>>>>> >> complexity pointed out in
OP.
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >>
--
>>>>>>>>>> >> View this message in context:
>>>>>>>>>> >>
http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
[3]
>>>>>>>>>> >> Sent from the datatable-help mailing list archive at
Nabble.com [4].
>>>>>>>>>> >>
_______________________________________________
>>>>>>>>>> >>
datatable-help mailing list >>
datatable-help at lists.r-forge.r-project.org [5]
>>>>>>>>>> 
>>>>>>>>>>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[6]
>>>>>>>>>> >
_______________________________________________
>>>>>>>>>> >
datatable-help mailing list > datatable-help at lists.r-forge.r-project.org
[7]
>>>>>>>>>> >
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[8]

 
Links:
------
[1] mailto:eduard.antonyan at gmail.com
[2]
mailto:datatable-help at lists.r-forge.r-project.org
[3]
http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
[4]
http://Nabble.com
[5]
mailto:datatable-help at lists.r-forge.r-project.org
[6]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[7]
mailto:datatable-help at lists.r-forge.r-project.org
[8]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[9]
http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/
[10]
http://table2.id
[11] mailto:mdowle at mdowle.plus.com
[12]
mailto:mdowle at mdowle.plus.com
[13] mailto:mdowle at mdowle.plus.com
[14]
mailto:eduard.antonyan at gmail.com
[15] mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130426/f38f0a62/attachment-0001.html>

From s_milberg at hotmail.com  Fri Apr 26 15:34:38 2013
From: s_milberg at hotmail.com (Sadao Milberg)
Date: Fri, 26 Apr 2013 09:34:38 -0400
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <2a827f7db260f284908fac604301eb8e@imap.plus.net>
References: <1366401278742-4664770.post@n4.nabble.com>,
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>,
 <1366643879137-4664990.post@n4.nabble.com>,
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>,
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>,
 <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>,
 <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>,
 <CAHZcBOoinL2JrkNayUYnnT-MkQNgWqby8evMARj3q1Q-xGF=mA@mail.gmail.com>,
 <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>,
 <CAHZcBOr3FXMEpYV-g=kz4snYn8FqO43yKZtw9z6jEksgFA_uQA@mail.gmail.com>,
 <9146185881995080674@unknownmsgid>,
 <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net>,
 <5222879356405645530@unknownmsgid>,
 <64f192ba80ac813986ed256029f0e7e0@imap.plus.net>,
 <2a827f7db260f284908fac604301eb8e@imap.plus.net>
Message-ID: <BAY163-W531F404386A1BC90FA8B1999B70@phx.gbl>


Your suggestion for transition seems reasonable, although I still think you should just use a new argument rather than try to change the behavior of by.  The most natural thing seems to leave Y as the `i` value, since after all, we are still joining on the key, and then just modify the standard join behavior with the cross.apply=TRUE or some such.

This way, you avoid having to have a more complicated description of the `by` argument, where all of a sudden it means 'group by these expressions, unless you use the special expression .XXX, in which case something confusingly similar yet different happens, oh, and by the way, you can only use .XXX if you are also using i=Y' (and what does by=list(a, .JOIN) do?).  To some extent your final proposal of by=Y is a little better, but still confusing since now you're using by to join and group, when it's `i` job to do that.

Loosely related, what does .JOIN represent?  Is it just a flag, or is it a derived variable the way .SD is?  If it's just a flag, it seems like a bad idea to use a name to represent it since that is a break from the meaning of all the other .X variables in data.table, which actually contain some kind of derivative data.

Finally, when you say "might help in a few related areas e.g. X[Y][,j] (which isn't great right now, agreed)", do you mean joint inherited scope will work even when we're not in by-without-by mode?  That would be great.

S.


Date: Fri, 26 Apr 2013 12:14:02 +0100
From: mdowle at mdowle.plus.com
To: eduard.antonyan at gmail.com
CC: datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by"


I didn't get any feedback off list on this one.

But I'm coming round to the idea.

What about by=.JOIN   (is that you were thinking .J stood for?)  Other possibilties: .EACHI, .IROW, .EACHIROW, .CROSSAPPLY, .EACHJOIN.  Just to brainstorm it.

by=.JOIN could be added anyway with no backwards compatibility issues, so that those who wished to be explicit now could be.

To change the default for X[Y, j] I'm also coming round to.   It might help in a few related areas e.g. X[Y][,j] (which isn't great right now, agreed).  We have successfully made non-backwards-compatibile changes in the past by introducing a global option which we slowly migrate to.  If datatable.bywithoutby was added it could take values  TRUE|"warning"|FALSE from day one, with default TRUE.  That allows those who wish for explicit by to migrate straight away by changing the default to FALSE.  Existing users could set it to "warning" to see how many implicit bywithoutby they have.   Those calls can gradually be changed to by=.JOIN and in that way both implicit and explicit work at the same time,   for say a year,   with full backwards compatibility by default. This approach allows a slow and flexible migration path on a per feature basis.   Then the default could be chaged to "warning"  before finally FALSE.     Depending on how it goes,  the option could be left there to allow TRUE if anyone wanted it,  or removed (maybe after two years).   Similar to the removal of J() outside DT[...] i.e. users can still now very easily write J=data.table in their .Rprofile if they wish, for backwards compatibility.

Or ... instead of :

    X[Y, j, by=.JOIN]

what about :

    X[by=Y, j]

Matthew

 
On 25.04.2013 16:32, Matthew Dowle wrote:


I'd appreciate some input from others whether they agree or not.   If you have a view perhaps let me know off list,  or on list, whichever you prefer.

Thanks,

Matthew

 
On 25.04.2013 13:45, Eduard Antonyan wrote:


Well, so can .I or .N or .GRP or .BY, yet those are used as special names, which is exactly why I suggested .J.
The problem with using 'missingness' is that it already means smth very different when i is not a join/cross, it means *don't* do a by, thus introducing the whole case thing one has to through in their head every time as in OP (which of course becomes automatic after a while, but it's a cost nonetheless, which is in particular high for new people). So I see absence of 'by' as an already taken and used signal and thus something else has to be used for the new signal of cross apply (it doesn't have to be the specific option I mentioned above). This is exactly why I find optional turning off of this behavior unsatisfactory, and I don't see that as a solution to this at all. 
I think in the x+y context the appropriate analog is - what if that added x and y normally, but when x and y were data.frames it did element by element multiplication instead? Yes that's possible to do, and possible to document, but it's not a good idea, because it takes place of adding them element by element. The recycling behavior doesn't do that - what that does is it says it doesn't really make sense to add them as is, but we can do that after recycling, so let's recycle. It doesn't take the place of another existing way of adding vectors. 

On Apr 25, 2013, at 4:28 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:


I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J?  If not .J, or any single symbol what else instead?  A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC").  But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row.  Currently, that signal is missingness  (which I like, rely on, and use with join inherited scope).

As I wrote in the S.O. thread,  I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside.   But you've continued to argue for a change to the default, iiuc.

Maybe it helps to consider :

     x+y

Fundamentally in R this depends on what x and y are.  Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise,  recycling y if necessary.  In R we like and write code like this all the time.   I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y.   If you need j for the entire set that Y joins to,  then like a FAQ says,  make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list:  drop=TRUE would do that (as someone mentioned on the S.O. thread).  So maybe the new option would be datatable.drop (but with default FALSE not TRUE).  If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer.
 
I'm happy to add the argument to [.data.table,  and make its default changeable via a global option in the usual way. 
Matthew

 
On 25.04.2013 05:16, Eduard Antonyan wrote:


That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. 
To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax.

On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <eduard.antonyan at gmail.com> wrote:


that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:


i. prefix is just a robust way to reference join inherited columns:   the 'top' column in the i table.   Like table aliases in SQL.

What about this? :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
 a b
 1: 1 1
 2: 1 4
 3: 1 7
 4: 1 10
 5: 1 13
 6: 2 2
 7: 2 5
 8: 2 8
 9: 2 11
10: 2 14
 11: 3 3
12: 3 6
 13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2,1), top=c(3,4,2))

1> Y
 a top
1: 1 3
2: 2 4
3: 1 2
1> X[Y, head(.SD,i.top)]
 a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
8: 1 1
9: 1  4
1> 


On 24.04.2013 23:43, Eduard Antonyan wrote:


I assumed they meant create a table :)
that looks cool, what's i.top ? I can get a very similar to yours result by writing:
X[Y][, head(.SD, top[1]), by = a]
and I probably would want the following to produce your result (this might depend a little on what exactly i.top is):
X[Y, head(.SD, i.top), by = a]


On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:


That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?

Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
    a  b
 1: 1  1
 2: 1  4
 3: 1  7
 4: 1 10
 5: 1 13
 6: 2  2
 7: 2  5
 8: 2  8
 9: 2 11
10: 2 14
11: 3  3
12: 3  6


13: 3  9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
   a top
1: 1   3
2: 2   4
1> X[Y, head(.SD,i.top)]
   a  b
1: 1  1
2: 1  4
3: 1  7
4: 2  2
5: 2  5


6: 2  8
7: 2 11
1> 
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?


On 24.04.2013 22:22, Eduard Antonyan wrote:


By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:
"We table table1 and table2. table1 has a column called rowcount.
For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"


On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:

But then what would be analogous to CROSS APPLY in SQL?


 > I'd agree with Eduard, although it's probably too late to change behavior
 > now.  Maybe for data.table.2?  Eduard's proposal seems more closely
 > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
 > requested).
 >
 > S.
 >
 >> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: eduard.antonyan at gmail.com
 >> To: datatable-help at lists.r-forge.r-project.org

>> Subject: Re: [datatable-help] changing data.table by-without-by
 >> syntax       to      require a "by"
 >>
 >> I think you're missing the point Michael. Just because it's possible to
 >> do it
 >> the way it's done now, doesn't mean that's the best way, as I've tried
 >> to
 >> argue in the OP. I don't think you've addressed the issue of unnecessary
 >> complexity pointed out in OP.
 >>
 >>
 >>
 >> --
 >> View this message in context:
 >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
 >> Sent from the datatable-help mailing list archive at Nabble.com.
 >> _______________________________________________
 >> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org

>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
 >                                         _______________________________________________
 > datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
 > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130426/c2429db7/attachment-0001.html>

From eduard.antonyan at gmail.com  Fri Apr 26 17:17:28 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Fri, 26 Apr 2013 10:17:28 -0500
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <BAY163-W531F404386A1BC90FA8B1999B70@phx.gbl>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
 <1366643879137-4664990.post@n4.nabble.com>
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>
 <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>
 <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>
 <CAHZcBOoinL2JrkNayUYnnT-MkQNgWqby8evMARj3q1Q-xGF=mA@mail.gmail.com>
 <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>
 <CAHZcBOr3FXMEpYV-g=kz4snYn8FqO43yKZtw9z6jEksgFA_uQA@mail.gmail.com>
 <9146185881995080674@unknownmsgid>
 <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net>
 <5222879356405645530@unknownmsgid>
 <64f192ba80ac813986ed256029f0e7e0@imap.plus.net>
 <2a827f7db260f284908fac604301eb8e@imap.plus.net>
 <BAY163-W531F404386A1BC90FA8B1999B70@phx.gbl>
Message-ID: <CAHZcBOrBOdZofFQMh9pJro6q4wepqB+7xHZW3wrg=Pf80ah13g@mail.gmail.com>

I indeed offered .J as a shorthand for .JOIN and to ease the pain of having
to type extra stuff for users who are relying on current behavior.

Sadao is making good points. The question of what does by=list(a, .JOIN) do
can still apply though with cross.apply=TRUE syntax, i.e. what does
X[Y,j,by=a,cross.apply=TRUE] do? And I think the answer is the same for
either syntax - in addition to the cross-apply-by it would group by column
'a'. Btw I think Matthew's examples above (or smth like them) should go
into the FAQ or documentation as they were very illuminating and entirely
non-obvious to me.

If I were to rate all of the above from imo best to worst, it would be:
.JOIN (or .J - yes, I'm biased:) )
.EACHI/cross.apply=TRUE
.EACHIROW/.EACHJOIN
.CROSSAPPLY
X[by=Y,j]

After typing the above list, I'm actually starting to like .EACHI
(each.i=TRUE? <- I like this even better) more and more as it seems to
convey the meaning (as far as I currently understand it - my understanding
has shifted a little since the start of this conversation) really well.

Anyway, sorry for a verbose email - my current vote is 'each.i = TRUE' - I
think this conveys the right meaning, satisfies Sadao's points and also has
a meaning that transitions well between having a join-i and not having a
join-i (when you're not joining, specifying this option wouldn't do
anything extra).


On Fri, Apr 26, 2013 at 8:34 AM, Sadao Milberg <s_milberg at hotmail.com>wrote:

>  Your suggestion for transition seems reasonable, although I still think
> you should just use a new argument rather than try to change the behavior
> of by.  The most natural thing seems to leave Y as the `i` value, since
> after all, we are still joining on the key, and then just modify the
> standard join behavior with the cross.apply=TRUE or some such.
>
> This way, you avoid having to have a more complicated description of the
> `by` argument, where all of a sudden it means 'group by these expressions,
> unless you use the special expression .XXX, in which case something
> confusingly similar yet different happens, oh, and by the way, you can only
> use .XXX if you are also using i=Y' (and what does by=list(a, .JOIN) do?).
> To some extent your final proposal of by=Y is a little better, but still
> confusing since now you're using by to join and group, when it's `i` job to
> do that.
>
> Loosely related, what does .JOIN represent?  Is it just a flag, or is it a
> derived variable the way .SD is?  If it's just a flag, it seems like a bad
> idea to use a name to represent it since that is a break from the meaning
> of all the other .X variables in data.table, which actually contain some
> kind of derivative data.
>
> Finally, when you say "might help in a few related areas e.g. X[Y][,j]
> (which isn't great right now, agreed)", do you mean joint inherited scope
> will work even when we're not in by-without-by mode?  That would be great.
>
> S.
>
>
> ------------------------------
> Date: Fri, 26 Apr 2013 12:14:02 +0100
> From: mdowle at mdowle.plus.com
> To: eduard.antonyan at gmail.com
> CC: datatable-help at lists.r-forge.r-project.org
>
> Subject: Re: [datatable-help] changing data.table by-without-by syntax to
> require a "by"
>
>
> I didn't get any feedback off list on this one.
> But I'm coming round to the idea.
> What about by=.JOIN   (is that you were thinking .J stood for?)  Other
> possibilties: .EACHI, .IROW, .EACHIROW, .CROSSAPPLY, .EACHJOIN.  Just to
> brainstorm it.
> by=.JOIN could be added anyway with no backwards compatibility issues, so
> that those who wished to be explicit now could be.
> To change the default for X[Y, j] I'm also coming round to.   It might
> help in a few related areas e.g. X[Y][,j] (which isn't great right now,
> agreed).  We have successfully made non-backwards-compatibile changes in
> the past by introducing a global option which we slowly migrate to.  If
> datatable.bywithoutby was added it could take values  TRUE|"warning"|FALSE
> from day one, with default TRUE.  That allows those who wish for explicit
> by to migrate straight away by changing the default to FALSE.  Existing
> users could set it to "warning" to see how many implicit bywithoutby they
> have.   Those calls can gradually be changed to by=.JOIN and in that way
> both implicit and explicit work at the same time,   for say a year,   with
> full backwards compatibility by default. This approach allows a slow and
> flexible migration path on a per feature basis.   Then the default could be
> chaged to "warning"  before finally FALSE.     Depending on how it goes,
>  the option could be left there to allow TRUE if anyone wanted it,  or
> removed (maybe after two years).   Similar to the removal of J() outside
> DT[...] i.e. users can still now very easily write J=data.table in their
> .Rprofile if they wish, for backwards compatibility.
> Or ... instead of :
>     X[Y, j, by=.JOIN]
> what about :
>     X[by=Y, j]
> Matthew
>
> On 25.04.2013 16:32, Matthew Dowle wrote:
>
>
> I'd appreciate some input from others whether they agree or not.   If you
> have a view perhaps let me know off list,  or on list, whichever you prefer.
> Thanks,
> Matthew
>
> On 25.04.2013 13:45, Eduard Antonyan wrote:
>
> Well, so can .I or .N or .GRP or .BY, yet those are used as special names,
> which is exactly why I suggested .J.
> The problem with using 'missingness' is that it already means smth very
> different when i is not a join/cross, it means *don't* do a by, thus
> introducing the whole case thing one has to through in their head every
> time as in OP (which of course becomes automatic after a while, but it's a
> cost nonetheless, which is in particular high for new people). So I see
> absence of 'by' as an already taken and used signal and thus something else
> has to be used for the new signal of cross apply (it doesn't have to be the
> specific option I mentioned above). This is exactly why I find optional
> turning off of this behavior unsatisfactory, and I don't see that as a
> solution to this at all.
> I think in the x+y context the appropriate analog is - what if that added
> x and y normally, but when x and y were data.frames it did element by
> element multiplication instead? Yes that's possible to do, and possible to
> document, but it's not a good idea, because it takes place of adding them
> element by element. The recycling behavior doesn't do that - what that does
> is it says it doesn't really make sense to add them as is, but we can do
> that after recycling, so let's recycle. It doesn't take the place of
> another existing way of adding vectors.
>
> On Apr 25, 2013, at 4:28 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
>
>
> I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J?  If not .J, or any single symbol what else instead?  A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC").  But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row.  Currently, that signal is missingness  (which I like, rely on, and use with join inherited scope).
>
> As I wrote in the S.O. thread,  I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside.   But you've continued to argue for a change to the default, iiuc.
>
> Maybe it helps to consider :
>
>      x+y
>
> Fundamentally in R this depends on what x and y are.  Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise,  recycling y if necessary.  In R we like and write code like this all the time.   I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y.   If you need j for the entire set that Y joins to,  then like a FAQ says,  make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list:  drop=TRUE would do that (as someone mentioned on the S.O. thread).  So maybe the new option would be datatable.drop (but with default FALSE not TRUE).  If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer.
>
>
>
> I'm happy to add the argument to [.data.table,  and make its default changeable via a global option in the usual way.
>
> Matthew
>
> On 25.04.2013 05:16, Eduard Antonyan wrote:
>
> That's really interesting, I can't currently think of another way of doing
> that as after X[Y] is done the necessary information is lost.
> To retain that functionality and achieve better readability, as in OP, I
> think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good
> replacement for current syntax.
>
> On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <eduard.antonyan at gmail.com>
> wrote:
>
>  that's an interesting example - I didn't realize current behavior would
> do that, I'm not at a PC anymore but I'll definitely think about it and
> report back, as it's not immediately obvious to me
>
>
> On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>
> i. prefix is just a robust way to reference join inherited columns:   the
> 'top' column in the i table.   Like table aliases in SQL.
> What about this? :
> 1> X = data.table(a=1:3,b=1:15, key="a")
> 1> X
> a b
> 1: 1 1
> 2: 1 4
> 3: 1 7
> 4: 1 10
> 5: 1 13
> 6: 2 2
> 7: 2 5
> 8: 2 8
> 9: 2 11
> 10: 2 14
> 11: 3 3
> 12: 3 6
> 13: 3 9
> 14: 3 12
> 15: 3 15
>
> 1> Y = data.table(a=c(1,2,1), top=c(3,4,2))
>
>
> 1> Y
> a top
> 1: 1 3
> 2: 2 4
> 3: 1 2
> 1> X[Y, head(.SD,i.top)]
> a b
> 1: 1 1
> 2: 1 4
> 3: 1 7
> 4: 2 2
> 5: 2 5
> 6: 2 8
> 7: 2 11
> 8: 1 1
>
> 9: 1  4
> 1>
>
>
> On 24.04.2013 23:43, Eduard Antonyan wrote:
>
> I assumed they meant create a table :)
> that looks cool, what's i.top ? I can get a very similar to yours result
> by writing:
> X[Y][, head(.SD, top[1]), by = a]
> and I probably would want the following to produce your result (this might
> depend a little on what exactly i.top is):
> X[Y, head(.SD, i.top), by = a]
>
>
> On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>
> That sentence on that linked webpage seems incorect English, since table
> is a noun not a verb.  Should "table" be "join" perhaps?
> Anyway, by-without-by is often used with join inherited scope (JIS).  For
> example, translating their example :
>
> 1> X = data.table(a=1:3,b=1:15, key="a")
> 1> X
>     a  b
>  1: 1  1
>  2: 1  4
>  3: 1  7
>  4: 1 10
>  5: 1 13
>  6: 2  2
>  7: 2  5
>  8: 2  8
>  9: 2 11
> 10: 2 14
> 11: 3  3
> 12: 3  6
>
>
>
>
> 13: 3  9
> 14: 3 12
> 15: 3 15
> 1> Y = data.table(a=c(1,2), top=c(3,4))
> 1> Y
>    a top
> 1: 1   3
> 2: 2   4
> 1> X[Y, head(.SD,i.top)]
>    a  b
> 1: 1  1
> 2: 1  4
> 3: 1  7
> 4: 2  2
> 5: 2  5
>
>
>
>
> 6: 2  8
> 7: 2 11
> 1>
>
>
>
> If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
>
>
>
> On 24.04.2013 22:22, Eduard Antonyan wrote:
>
> By that you mean current behavior? You'd get current behavior by
> explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
> Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using
> http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I
> can't figure out how by-without-by (or with by-with-by for that matter:) )
> helps with e.g. the first example there:
> "We table table1 and table2. table1 has a column called rowcount.
>
> For each row from table1 we need to select first rowcount rows from table2,
> ordered by table2.id"
>
>
>
>
> On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
> But then what would be analogous to CROSS APPLY in SQL?
>
> > I'd agree with Eduard, although it's probably too late to change behavior
> > now.  Maybe for data.table.2?  Eduard's proposal seems more closely
> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
> > requested).
> >
> > S.
> >
> >> Date: Mon, 22 Apr 2013 08:17:59 -0700
> >> From: eduard.antonyan at gmail.com
> >> To: datatable-help at lists.r-forge.r-project.org
> >> Subject: Re: [datatable-help] changing data.table by-without-by
> >> syntax       to      require a "by"
> >>
> >> I think you're missing the point Michael. Just because it's possible to
> >> do it
> >> the way it's done now, doesn't mean that's the best way, as I've tried
> >> to
> >> argue in the OP. I don't think you've addressed the issue of unnecessary
> >> complexity pointed out in OP.
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
> >> Sent from the datatable-help mailing list archive at Nabble.com.
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________ datatable-help mailing
> list datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130426/a18fcf79/attachment-0001.html>

From sds at gnu.org  Fri Apr 26 17:45:22 2013
From: sds at gnu.org (Sam Steingold)
Date: Fri, 26 Apr 2013 11:45:22 -0400
Subject: [datatable-help] variable column names
References: <87a9onfza3.fsf@gnu.org>
 <CAHZcBOrVUDRM8zqQpwgEesFrz1nOkXRYiiFMZKpMqr1Bi6h+kA@mail.gmail.com>
 <b596977a739f9e41ffef0af946805ada.squirrel@webmail.plus.net>
 <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net>
Message-ID: <87mwslqorx.fsf@gnu.org>

I am still missing something:

--8<---------------cut here---------------start------------->8---
> dt <- data.table(user=c(rep(4, 5),rep(3, 5)), behavior=c(rep(FALSE,5),rep(TRUE,5)),
                 country=c(rep(1,4),rep(2,6)), language=c(rep(6,6),rep(5,4)),
                 event=1:10, key=c("user","country","language"))
> dt
    user behavior country language event
 1:    3     TRUE       2        5     7
 2:    3     TRUE       2        5     8
 3:    3     TRUE       2        5     9
 4:    3     TRUE       2        5    10
 5:    3     TRUE       2        6     6
 6:    4    FALSE       1        6     1
 7:    4    FALSE       1        6     2
 8:    4    FALSE       1        6     3
 9:    4    FALSE       1        6     4
10:    4    FALSE       2        6     5
>   users <- dt[, sum(behavior) > 0, by=user]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Detected that j uses these columns: behavior 
Optimization is on but j left unchanged as 'sum(behavior) > 0'
Starting dogroups ... done dogroups in 0 secs
> users
   user    V1
1:    3  TRUE
2:    4 FALSE
> setnames(users, "V1", "behavior")
--8<---------------cut here---------------end--------------->8---

Now I want to do the same thing as in
http://stackoverflow.com/questions/16200815/summarize-a-data-table-with-unreliable-data
for both fields
> fields <- c("country","language")

here is what I tried so far:

--8<---------------cut here---------------start------------->8---
dt[, .N, .SDcols=fields, by=eval(list("user",fields))]
Error in `[.data.table`(dt, , .N, .SDcols = fields, by = eval(list("user",  : 
  The items in the 'by' or 'keyby' list are length (1,2). Each must be same length as rows in x or number of rows returned by i (10).
Calls: [ -> [.data.table
--8<---------------cut here---------------end--------------->8---

the idea is to do something like

--8<---------------cut here---------------start------------->8---
> dt.out <- dt[, .N, by=list(user,country)][, list(country[which.max(N)], max(N)/sum(N)), by=user]
> setnames(dt.out, c("V1", "V2"),  paste0("country",c(".name", ".support")))
> users <- users[dt.out]
   user behavior country.name country.support
1:    3     TRUE            2             1.0
2:    4    FALSE            1             0.8
--8<---------------cut here---------------end--------------->8---

except that I do not want to have the literal "country" and "language"
and that I am sure there is a way to avoid copying users in
> users <- users[dt.out]
by a ":=" trick.

Thanks.

> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-24 21:54:17 +0100]:
>
> where ... is eval(myid)
> iigc
>> Or:
>> DT[,lapply(.SD,sum),by=...,.SDcols=myvars]

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
http://www.childpsy.net/ http://palestinefacts.org http://ffii.org
http://jihadwatch.org http://thereligionofpeace.com
Morning is too early for anything but sleep.


From mdowle at mdowle.plus.com  Fri Apr 26 18:00:27 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Fri, 26 Apr 2013 17:00:27 +0100
Subject: [datatable-help] variable column names
In-Reply-To: <87mwslqorx.fsf@gnu.org>
References: <87a9onfza3.fsf@gnu.org>
 <CAHZcBOrVUDRM8zqQpwgEesFrz1nOkXRYiiFMZKpMqr1Bi6h+kA@mail.gmail.com>
 <b596977a739f9e41ffef0af946805ada.squirrel@webmail.plus.net>
 <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net>
 <87mwslqorx.fsf@gnu.org>
Message-ID: <82f248612d2657b63d5e79c2a1ef98af@imap.plus.net>


> dt[, sum(behavior) > 0, by=user]
    user    V1
1:    3  TRUE
2:    4 FALSE
> dt[, any(behavior), by=user]     # same
    user    V1
1:    3  TRUE
2:    4 FALSE
> dt[, list(behavior = any(behavior)), by=user]   # how to same without 
> setnames afterwards
    user behavior
1:    3     TRUE
2:    4    FALSE
> fields <- c("country","language")
> dt[, list(behavior = any(behavior)), by=c("user",fields)]   # by may 
> be character vector of column names
    user country language behavior
1:    3       2        5     TRUE
2:    3       2        6     TRUE
3:    4       1        6    FALSE
4:    4       2        6    FALSE

HTH
Matthew


On 26.04.2013 16:45, Sam Steingold wrote:
> I am still missing something:
>
> --8<---------------cut here---------------start------------->8---
>> dt <- data.table(user=c(rep(4, 5),rep(3, 5)), 
>> behavior=c(rep(FALSE,5),rep(TRUE,5)),
>                  country=c(rep(1,4),rep(2,6)), 
> language=c(rep(6,6),rep(5,4)),
>                  event=1:10, key=c("user","country","language"))
>> dt
>     user behavior country language event
>  1:    3     TRUE       2        5     7
>  2:    3     TRUE       2        5     8
>  3:    3     TRUE       2        5     9
>  4:    3     TRUE       2        5    10
>  5:    3     TRUE       2        6     6
>  6:    4    FALSE       1        6     1
>  7:    4    FALSE       1        6     2
>  8:    4    FALSE       1        6     3
>  9:    4    FALSE       1        6     4
> 10:    4    FALSE       2        6     5
>>   users <- dt[, sum(behavior) > 0, by=user]
> Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE
> and o__ is length 0
> Detected that j uses these columns: behavior
> Optimization is on but j left unchanged as 'sum(behavior) > 0'
> Starting dogroups ... done dogroups in 0 secs
>> users
>    user    V1
> 1:    3  TRUE
> 2:    4 FALSE
>> setnames(users, "V1", "behavior")
> --8<---------------cut here---------------end--------------->8---
>
> Now I want to do the same thing as in
> 
> http://stackoverflow.com/questions/16200815/summarize-a-data-table-with-unreliable-data
> for both fields
>> fields <- c("country","language")
>
> here is what I tried so far:
>
> --8<---------------cut here---------------start------------->8---
> dt[, .N, .SDcols=fields, by=eval(list("user",fields))]
> Error in `[.data.table`(dt, , .N, .SDcols = fields, by =
> eval(list("user",  :
>   The items in the 'by' or 'keyby' list are length (1,2). Each must
> be same length as rows in x or number of rows returned by i (10).
> Calls: [ -> [.data.table
> --8<---------------cut here---------------end--------------->8---
>
> the idea is to do something like
>
> --8<---------------cut here---------------start------------->8---
>> dt.out <- dt[, .N, by=list(user,country)][, 
>> list(country[which.max(N)], max(N)/sum(N)), by=user]
>> setnames(dt.out, c("V1", "V2"),  paste0("country",c(".name", 
>> ".support")))
>> users <- users[dt.out]
>    user behavior country.name country.support
> 1:    3     TRUE            2             1.0
> 2:    4    FALSE            1             0.8
> --8<---------------cut here---------------end--------------->8---
>
> except that I do not want to have the literal "country" and 
> "language"
> and that I am sure there is a way to avoid copying users in
>> users <- users[dt.out]
> by a ":=" trick.
>
> Thanks.
>
>> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-24 21:54:17 
>> +0100]:
>>
>> where ... is eval(myid)
>> iigc
>>> Or:
>>> DT[,lapply(.SD,sum),by=...,.SDcols=myvars]


From sds at gnu.org  Fri Apr 26 18:26:06 2013
From: sds at gnu.org (Sam Steingold)
Date: Fri, 26 Apr 2013 12:26:06 -0400
Subject: [datatable-help] variable column names
References: <87a9onfza3.fsf@gnu.org>
 <CAHZcBOrVUDRM8zqQpwgEesFrz1nOkXRYiiFMZKpMqr1Bi6h+kA@mail.gmail.com>
 <b596977a739f9e41ffef0af946805ada.squirrel@webmail.plus.net>
 <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net>
 <87mwslqorx.fsf@gnu.org> <82f248612d2657b63d5e79c2a1ef98af@imap.plus.net>
Message-ID: <87d2thqmw1.fsf@gnu.org>

> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:00:27 +0100]:
>
>> dt[, sum(behavior) > 0, by=user]
>    user    V1
> 1:    3  TRUE
> 2:    4 FALSE
>> dt[, any(behavior), by=user]     # same
>    user    V1
> 1:    3  TRUE
> 2:    4 FALSE
>> dt[, list(behavior = any(behavior)), by=user]   # how to same without
>> setnames afterwards
>    user behavior
> 1:    3     TRUE
> 2:    4    FALSE
>> fields <- c("country","language")
>> dt[, list(behavior = any(behavior)), by=c("user",fields)]   # by may
>> be character vector of column names
>    user country language behavior
> 1:    3       2        5     TRUE
> 2:    3       2        6     TRUE
> 3:    4       1        6    FALSE
> 4:    4       2        6    FALSE

oh no, this is _not_ what I want!
user should be unique and fields should be summarized as described in
the SO question (see the code below)


>
>
> On 26.04.2013 16:45, Sam Steingold wrote:
>> I am still missing something:
>>
>> --8<---------------cut here---------------start------------->8---
>>> dt <- data.table(user=c(rep(4, 5),rep(3, 5)),
>>> behavior=c(rep(FALSE,5),rep(TRUE,5)),
>>                  country=c(rep(1,4),rep(2,6)),
>> language=c(rep(6,6),rep(5,4)),
>>                  event=1:10, key=c("user","country","language"))
>>> dt
>>     user behavior country language event
>>  1:    3     TRUE       2        5     7
>>  2:    3     TRUE       2        5     8
>>  3:    3     TRUE       2        5     9
>>  4:    3     TRUE       2        5    10
>>  5:    3     TRUE       2        6     6
>>  6:    4    FALSE       1        6     1
>>  7:    4    FALSE       1        6     2
>>  8:    4    FALSE       1        6     3
>>  9:    4    FALSE       1        6     4
>> 10:    4    FALSE       2        6     5
>>>   users <- dt[, sum(behavior) > 0, by=user]
>> Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE
>> and o__ is length 0
>> Detected that j uses these columns: behavior
>> Optimization is on but j left unchanged as 'sum(behavior) > 0'
>> Starting dogroups ... done dogroups in 0 secs
>>> users
>>    user    V1
>> 1:    3  TRUE
>> 2:    4 FALSE
>>> setnames(users, "V1", "behavior")
>> --8<---------------cut here---------------end--------------->8---
>>
>> Now I want to do the same thing as in
>>
>> http://stackoverflow.com/questions/16200815/summarize-a-data-table-with-unreliable-data
>> for both fields
>>> fields <- c("country","language")
>>
>> here is what I tried so far:
>>
>> --8<---------------cut here---------------start------------->8---
>> dt[, .N, .SDcols=fields, by=eval(list("user",fields))]
>> Error in `[.data.table`(dt, , .N, .SDcols = fields, by =
>> eval(list("user",  :
>>   The items in the 'by' or 'keyby' list are length (1,2). Each must
>> be same length as rows in x or number of rows returned by i (10).
>> Calls: [ -> [.data.table
>> --8<---------------cut here---------------end--------------->8---
>>
>> the idea is to do something like
>>
>> --8<---------------cut here---------------start------------->8---
>>> dt.out <- dt[, .N, by=list(user,country)][,
>>> list(country[which.max(N)], max(N)/sum(N)), by=user]
>>> setnames(dt.out, c("V1", "V2"),  paste0("country",c(".name",
>>> ".support")))
>>> users <- users[dt.out]
>>    user behavior country.name country.support
>> 1:    3     TRUE            2             1.0
>> 2:    4    FALSE            1             0.8
>> --8<---------------cut here---------------end--------------->8---
>>
>> except that I do not want to have the literal "country" and "language"
>> and that I am sure there is a way to avoid copying users in
>>> users <- users[dt.out]
>> by a ":=" trick.
>>
>> Thanks.
>>
>>> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-24 21:54:17 +0100]:
>>>
>>> where ... is eval(myid)
>>> iigc
>>>> Or:
>>>> DT[,lapply(.SD,sum),by=...,.SDcols=myvars]

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
http://www.childpsy.net/ http://ffii.org http://pmw.org.il
http://palestinefacts.org http://dhimmi.com http://thereligionofpeace.com
Perl: all stupidities of UNIX in one.


From mdowle at mdowle.plus.com  Fri Apr 26 18:45:53 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Fri, 26 Apr 2013 17:45:53 +0100
Subject: [datatable-help] variable column names
In-Reply-To: <87d2thqmw1.fsf@gnu.org>
References: <87a9onfza3.fsf@gnu.org>
 <CAHZcBOrVUDRM8zqQpwgEesFrz1nOkXRYiiFMZKpMqr1Bi6h+kA@mail.gmail.com>
 <b596977a739f9e41ffef0af946805ada.squirrel@webmail.plus.net>
 <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net>
 <87mwslqorx.fsf@gnu.org> <82f248612d2657b63d5e79c2a1ef98af@imap.plus.net>
 <87d2thqmw1.fsf@gnu.org>
Message-ID: <fe0eeb803fe0a05cc91ae256f56aadd3@imap.plus.net>


S.O. is probably better for this kind of question then.
But if you don't get an answer there, then come back to datatable-help.

On 26.04.2013 17:26, Sam Steingold wrote:
>> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:00:27 
>> +0100]:
>>
>>> dt[, sum(behavior) > 0, by=user]
>>    user    V1
>> 1:    3  TRUE
>> 2:    4 FALSE
>>> dt[, any(behavior), by=user]     # same
>>    user    V1
>> 1:    3  TRUE
>> 2:    4 FALSE
>>> dt[, list(behavior = any(behavior)), by=user]   # how to same 
>>> without
>>> setnames afterwards
>>    user behavior
>> 1:    3     TRUE
>> 2:    4    FALSE
>>> fields <- c("country","language")
>>> dt[, list(behavior = any(behavior)), by=c("user",fields)]   # by 
>>> may
>>> be character vector of column names
>>    user country language behavior
>> 1:    3       2        5     TRUE
>> 2:    3       2        6     TRUE
>> 3:    4       1        6    FALSE
>> 4:    4       2        6    FALSE
>
> oh no, this is _not_ what I want!
> user should be unique and fields should be summarized as described in
> the SO question (see the code below)
>
>
>>
>>
>> On 26.04.2013 16:45, Sam Steingold wrote:
>>> I am still missing something:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>>> dt <- data.table(user=c(rep(4, 5),rep(3, 5)),
>>>> behavior=c(rep(FALSE,5),rep(TRUE,5)),
>>>                  country=c(rep(1,4),rep(2,6)),
>>> language=c(rep(6,6),rep(5,4)),
>>>                  event=1:10, key=c("user","country","language"))
>>>> dt
>>>     user behavior country language event
>>>  1:    3     TRUE       2        5     7
>>>  2:    3     TRUE       2        5     8
>>>  3:    3     TRUE       2        5     9
>>>  4:    3     TRUE       2        5    10
>>>  5:    3     TRUE       2        6     6
>>>  6:    4    FALSE       1        6     1
>>>  7:    4    FALSE       1        6     2
>>>  8:    4    FALSE       1        6     3
>>>  9:    4    FALSE       1        6     4
>>> 10:    4    FALSE       2        6     5
>>>>   users <- dt[, sum(behavior) > 0, by=user]
>>> Finding groups (bysameorder=TRUE) ... done in 0secs. 
>>> bysameorder=TRUE
>>> and o__ is length 0
>>> Detected that j uses these columns: behavior
>>> Optimization is on but j left unchanged as 'sum(behavior) > 0'
>>> Starting dogroups ... done dogroups in 0 secs
>>>> users
>>>    user    V1
>>> 1:    3  TRUE
>>> 2:    4 FALSE
>>>> setnames(users, "V1", "behavior")
>>> --8<---------------cut here---------------end--------------->8---
>>>
>>> Now I want to do the same thing as in
>>>
>>> 
>>> http://stackoverflow.com/questions/16200815/summarize-a-data-table-with-unreliable-data
>>> for both fields
>>>> fields <- c("country","language")
>>>
>>> here is what I tried so far:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>> dt[, .N, .SDcols=fields, by=eval(list("user",fields))]
>>> Error in `[.data.table`(dt, , .N, .SDcols = fields, by =
>>> eval(list("user",  :
>>>   The items in the 'by' or 'keyby' list are length (1,2). Each must
>>> be same length as rows in x or number of rows returned by i (10).
>>> Calls: [ -> [.data.table
>>> --8<---------------cut here---------------end--------------->8---
>>>
>>> the idea is to do something like
>>>
>>> --8<---------------cut here---------------start------------->8---
>>>> dt.out <- dt[, .N, by=list(user,country)][,
>>>> list(country[which.max(N)], max(N)/sum(N)), by=user]
>>>> setnames(dt.out, c("V1", "V2"),  paste0("country",c(".name",
>>>> ".support")))
>>>> users <- users[dt.out]
>>>    user behavior country.name country.support
>>> 1:    3     TRUE            2             1.0
>>> 2:    4    FALSE            1             0.8
>>> --8<---------------cut here---------------end--------------->8---
>>>
>>> except that I do not want to have the literal "country" and 
>>> "language"
>>> and that I am sure there is a way to avoid copying users in
>>>> users <- users[dt.out]
>>> by a ":=" trick.
>>>
>>> Thanks.
>>>
>>>> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-24 21:54:17 
>>>> +0100]:
>>>>
>>>> where ... is eval(myid)
>>>> iigc
>>>>> Or:
>>>>> DT[,lapply(.SD,sum),by=...,.SDcols=myvars]

From sds at gnu.org  Fri Apr 26 19:05:39 2013
From: sds at gnu.org (Sam Steingold)
Date: Fri, 26 Apr 2013 13:05:39 -0400
Subject: [datatable-help] variable column names
References: <87a9onfza3.fsf@gnu.org>
 <CAHZcBOrVUDRM8zqQpwgEesFrz1nOkXRYiiFMZKpMqr1Bi6h+kA@mail.gmail.com>
 <b596977a739f9e41ffef0af946805ada.squirrel@webmail.plus.net>
 <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net>
 <87mwslqorx.fsf@gnu.org> <82f248612d2657b63d5e79c2a1ef98af@imap.plus.net>
 <87d2thqmw1.fsf@gnu.org> <fe0eeb803fe0a05cc91ae256f56aadd3@imap.plus.net>
Message-ID: <871u9xkysc.fsf@gnu.org>

> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:45:53 +0100]:
>
> S.O. is probably better for this kind of question then.
> But if you don't get an answer there, then come back to datatable-help.

http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
http://www.childpsy.net/ http://thereligionofpeace.com http://pmw.org.il
http://jihadwatch.org http://camera.org http://honestreporting.com
Apple: making a living off show-offs.


From s_milberg at hotmail.com  Fri Apr 26 20:48:40 2013
From: s_milberg at hotmail.com (Sadao Milberg)
Date: Fri, 26 Apr 2013 14:48:40 -0400
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <CAHZcBOrBOdZofFQMh9pJro6q4wepqB+7xHZW3wrg=Pf80ah13g@mail.gmail.com>
References: <1366401278742-4664770.post@n4.nabble.com>,
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>,
 <1366643879137-4664990.post@n4.nabble.com>,
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>,
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>,
 <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>,
 <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>,
 <CAHZcBOoinL2JrkNayUYnnT-MkQNgWqby8evMARj3q1Q-xGF=mA@mail.gmail.com>,
 <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>,
 <CAHZcBOr3FXMEpYV-g=kz4snYn8FqO43yKZtw9z6jEksgFA_uQA@mail.gmail.com>,
 <9146185881995080674@unknownmsgid>,
 <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net>,
 <5222879356405645530@unknownmsgid>,
 <64f192ba80ac813986ed256029f0e7e0@imap.plus.net>,
 <2a827f7db260f284908fac604301eb8e@imap.plus.net>,
 <BAY163-W531F404386A1BC90FA8B1999B70@phx.gbl>,
 <CAHZcBOrBOdZofFQMh9pJro6q4wepqB+7xHZW3wrg=Pf80ah13g@mail.gmail.com>
Message-ID: <BAY163-W61D18A71F668693037797999B70@phx.gbl>

each.i = TRUE sounds fine to me.

Date: Fri, 26 Apr 2013 10:17:28 -0500
Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by"
From: eduard.antonyan at gmail.com
To: s_milberg at hotmail.com
CC: mdowle at mdowle.plus.com; datatable-help at lists.r-forge.r-project.org

I indeed offered .J as a shorthand for .JOIN and to ease the pain of having to type extra stuff for users who are relying on current behavior.
Sadao is making good points. The question of what does by=list(a, .JOIN) do can still apply though with cross.apply=TRUE syntax, i.e. what does X[Y,j,by=a,cross.apply=TRUE] do? And I think the answer is the same for either syntax - in addition to the cross-apply-by it would group by column 'a'. Btw I think Matthew's examples above (or smth like them) should go into the FAQ or documentation as they were very illuminating and entirely non-obvious to me.

If I were to rate all of the above from imo best to worst, it would be:.JOIN (or .J - yes, I'm biased:) ).EACHI/cross.apply=TRUE.EACHIROW/.EACHJOIN

.CROSSAPPLYX[by=Y,j]
After typing the above list, I'm actually starting to like .EACHI (each.i=TRUE? <- I like this even better) more and more as it seems to convey the meaning (as far as I currently understand it - my understanding has shifted a little since the start of this conversation) really well.

Anyway, sorry for a verbose email - my current vote is 'each.i = TRUE' - I think this conveys the right meaning, satisfies Sadao's points and also has a meaning that transitions well between having a join-i and not having a join-i (when you're not joining, specifying this option wouldn't do anything extra).


On Fri, Apr 26, 2013 at 8:34 AM, Sadao Milberg <s_milberg at hotmail.com> wrote:


Your suggestion for transition seems reasonable, although I still think you should just use a new argument rather than try to change the behavior of by.  The most natural thing seems to leave Y as the `i` value, since after all, we are still joining on the key, and then just modify the standard join behavior with the cross.apply=TRUE or some such.


This way, you avoid having to have a more complicated description of the `by` argument, where all of a sudden it means 'group by these expressions, unless you use the special expression .XXX, in which case something confusingly similar yet different happens, oh, and by the way, you can only use .XXX if you are also using i=Y' (and what does by=list(a, .JOIN) do?).  To some extent your final proposal of by=Y is a little better, but still confusing since now you're using by to join and group, when it's `i` job to do that.


Loosely related, what does .JOIN represent?  Is it just a flag, or is it a derived variable the way .SD is?  If it's just a flag, it seems like a bad idea to use a name to represent it since that is a break from the meaning of all the other .X variables in data.table, which actually contain some kind of derivative data.


Finally, when you say "might help in a few related areas e.g. X[Y][,j] (which isn't great right now, agreed)", do you mean joint inherited scope will work even when we're not in by-without-by mode?  That would be great.


S.


Date: Fri, 26 Apr 2013 12:14:02 +0100
From: mdowle at mdowle.plus.com
To: eduard.antonyan at gmail.com

CC: datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by"


I didn't get any feedback off list on this one.

But I'm coming round to the idea.

What about by=.JOIN   (is that you were thinking .J stood for?)  Other possibilties: .EACHI, .IROW, .EACHIROW, .CROSSAPPLY, .EACHJOIN.  Just to brainstorm it.

by=.JOIN could be added anyway with no backwards compatibility issues, so that those who wished to be explicit now could be.

To change the default for X[Y, j] I'm also coming round to.   It might help in a few related areas e.g. X[Y][,j] (which isn't great right now, agreed).  We have successfully made non-backwards-compatibile changes in the past by introducing a global option which we slowly migrate to.  If datatable.bywithoutby was added it could take values  TRUE|"warning"|FALSE from day one, with default TRUE.  That allows those who wish for explicit by to migrate straight away by changing the default to FALSE.  Existing users could set it to "warning" to see how many implicit bywithoutby they have.   Those calls can gradually be changed to by=.JOIN and in that way both implicit and explicit work at the same time,   for say a year,   with full backwards compatibility by default. This approach allows a slow and flexible migration path on a per feature basis.   Then the default could be chaged to "warning"  before finally FALSE.     Depending on how it goes,  the option could be left there to allow TRUE if anyone wanted it,  or removed (maybe after two years).   Similar to the removal of J() outside DT[...] i.e. users can still now very easily write J=data.table in their .Rprofile if they wish, for backwards compatibility.


Or ... instead of :

    X[Y, j, by=.JOIN]

what about :

    X[by=Y, j]

Matthew

 
On 25.04.2013 16:32, Matthew Dowle wrote:


I'd appreciate some input from others whether they agree or not.   If you have a view perhaps let me know off list,  or on list, whichever you prefer.

Thanks,

Matthew

 
On 25.04.2013 13:45, Eduard Antonyan wrote:


Well, so can .I or .N or .GRP or .BY, yet those are used as special names, which is exactly why I suggested .J.
The problem with using 'missingness' is that it already means smth very different when i is not a join/cross, it means *don't* do a by, thus introducing the whole case thing one has to through in their head every time as in OP (which of course becomes automatic after a while, but it's a cost nonetheless, which is in particular high for new people). So I see absence of 'by' as an already taken and used signal and thus something else has to be used for the new signal of cross apply (it doesn't have to be the specific option I mentioned above). This is exactly why I find optional turning off of this behavior unsatisfactory, and I don't see that as a solution to this at all. 

I think in the x+y context the appropriate analog is - what if that added x and y normally, but when x and y were data.frames it did element by element multiplication instead? Yes that's possible to do, and possible to document, but it's not a good idea, because it takes place of adding them element by element. The recycling behavior doesn't do that - what that does is it says it doesn't really make sense to add them as is, but we can do that after recycling, so let's recycle. It doesn't take the place of another existing way of adding vectors. 


On Apr 25, 2013, at 4:28 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:


I see what you're getting at. But .J may be a column name, which is the current meaning of by = single symbol. And why .J?  If not .J, or any single symbol what else instead?  A character value such as by="irows" is taken to mean the "irows" column currently (for consistency with by="colA,colB,colC").  But some signal needs to be passed to by=, then (you're suggesting), to trigger the cross apply by each i row.  Currently, that signal is missingness  (which I like, rely on, and use with join inherited scope).


As I wrote in the S.O. thread,  I'm happy to make it optional (i.e. an option to turn off by-without-by), since there is no downside.   But you've continued to argue for a change to the default, iiuc.

Maybe it helps to consider :


     x+y

Fundamentally in R this depends on what x and y are.  Most of us probably assume (as a first thought) that x and y are vectors and know that this will apply "+" elementwise,  recycling y if necessary.  In R we like and write code like this all the time.   I think of X[Y, j] in the same way: j is the operation (like +) which is applied for each row of Y.   If you need j for the entire set that Y joins to,  then like a FAQ says,  make j missing too and it's X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on the list:  drop=TRUE would do that (as someone mentioned on the S.O. thread).  So maybe the new option would be datatable.drop (but with default FALSE not TRUE).  If you wanted to turn off by-without-by you might set options(datatable.drop=TRUE). Then you can use data.table how you prefer (explicit by) and I can use it how I prefer.

 
I'm happy to add the argument to [.data.table,  and make its default changeable via a global option in the usual way. 
Matthew

 
On 25.04.2013 05:16, Eduard Antonyan wrote:


That's really interesting, I can't currently think of another way of doing that as after X[Y] is done the necessary information is lost. 
To retain that functionality and achieve better readability, as in OP, I think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good replacement for current syntax.

On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <eduard.antonyan at gmail.com> wrote:


that's an interesting example - I didn't realize current behavior would do that, I'm not at a PC anymore but I'll definitely think about it and report back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:


i. prefix is just a robust way to reference join inherited columns:   the 'top' column in the i table.   Like table aliases in SQL.

What about this? :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
 a b
 1: 1 1
 2: 1 4
 3: 1 7
 4: 1 10
 5: 1 13
 6: 2 2
 7: 2 5
 8: 2 8
 9: 2 11
10: 2 14
 11: 3 3
12: 3 6
 13: 3 9

14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2,1), top=c(3,4,2))

1> Y
 a top
1: 1 3
2: 2 4
3: 1 2
1> X[Y, head(.SD,i.top)]
 a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
8: 1 1
9: 1  4
1> 


On 24.04.2013 23:43, Eduard Antonyan wrote:


I assumed they meant create a table :)
that looks cool, what's i.top ? I can get a very similar to yours result by writing:
X[Y][, head(.SD, top[1]), by = a]
and I probably would want the following to produce your result (this might depend a little on what exactly i.top is):
X[Y, head(.SD, i.top), by = a]


On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:


That sentence on that linked webpage seems incorect English, since table is a noun not a verb.  Should "table" be "join" perhaps?

Anyway, by-without-by is often used with join inherited scope (JIS).  For example, translating their example :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
    a  b
 1: 1  1
 2: 1  4
 3: 1  7
 4: 1 10
 5: 1 13
 6: 2  2
 7: 2  5
 8: 2  8
 9: 2 11
10: 2 14
11: 3  3
12: 3  6


13: 3  9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
   a top
1: 1   3
2: 2   4
1> X[Y, head(.SD,i.top)]
   a  b
1: 1  1
2: 1  4
3: 1  7
4: 2  2
5: 2  5


6: 2  8
7: 2 11
1> 
 
If there was no by-without-by (analogous to CROSS BY),  then how would that be done?


On 24.04.2013 22:22, Eduard Antonyan wrote:


By that you mean current behavior? You'd get current behavior by explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't figure out how by-without-by (or with by-with-by for that matter:) ) helps with e.g. the first example there:

"We table table1 and table2. table1 has a column called rowcount.

For each row from table1 we need to select first rowcount rows from table2, ordered by table2.id"


On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:

But then what would be analogous to CROSS APPLY in SQL?


 > I'd agree with Eduard, although it's probably too late to change behavior
 > now.  Maybe for data.table.2?  Eduard's proposal seems more closely
 > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if

 > requested).
 >
 > S.
 >
 >> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: eduard.antonyan at gmail.com
 >> To: datatable-help at lists.r-forge.r-project.org


>> Subject: Re: [datatable-help] changing data.table by-without-by
 >> syntax       to      require a "by"
 >>
 >> I think you're missing the point Michael. Just because it's possible to

 >> do it
 >> the way it's done now, doesn't mean that's the best way, as I've tried
 >> to
 >> argue in the OP. I don't think you've addressed the issue of unnecessary

 >> complexity pointed out in OP.
 >>
 >>
 >>
 >> --
 >> View this message in context:
 >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html

 >> Sent from the datatable-help mailing list archive at Nabble.com.
 >> _______________________________________________
 >> datatable-help mailing list

>> datatable-help at lists.r-forge.r-project.org

>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
 >                                         _______________________________________________

 > datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
 > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
 		 	   		  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130426/ef783b2d/attachment-0001.html>

From FErickson at psu.edu  Fri Apr 26 22:34:39 2013
From: FErickson at psu.edu (Frank Erickson)
Date: Fri, 26 Apr 2013 15:34:39 -0500
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <BAY163-W61D18A71F668693037797999B70@phx.gbl>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
 <1366643879137-4664990.post@n4.nabble.com>
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>
 <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>
 <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>
 <CAHZcBOoinL2JrkNayUYnnT-MkQNgWqby8evMARj3q1Q-xGF=mA@mail.gmail.com>
 <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>
 <CAHZcBOr3FXMEpYV-g=kz4snYn8FqO43yKZtw9z6jEksgFA_uQA@mail.gmail.com>
 <9146185881995080674@unknownmsgid>
 <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net>
 <5222879356405645530@unknownmsgid>
 <64f192ba80ac813986ed256029f0e7e0@imap.plus.net>
 <2a827f7db260f284908fac604301eb8e@imap.plus.net>
 <BAY163-W531F404386A1BC90FA8B1999B70@phx.gbl>
 <CAHZcBOrBOdZofFQMh9pJro6q4wepqB+7xHZW3wrg=Pf80ah13g@mail.gmail.com>
 <BAY163-W61D18A71F668693037797999B70@phx.gbl>
Message-ID: <CAJd-hdkv1oiSjfA625oBxmXwr5YuVUzz==3GLaWJTakAtzMJVw@mail.gmail.com>

I disagree with the criticism of data.table's complexity (in the OP).
There's nothing wrong with overloading the syntax (that is what CS people
call it, right?). As long as Matthew's in control of it, it's likely to
have some internal consistency (which, of course, he could explain).
However, I like the suggestion to add options (defaulting to something
globally adjustable) to disable some of the overloading. Along similar
lines (I think), I find unique.data.table very unintuitive. I can see how
it could be useful, but strongly prefer base::unique for my current
applications.

Anyway, I have nothing particular to say about the piece of syntax you all
are currently discussing. I just registered with this list to chime in
here, instead of further cluttering SO (where eddi answered one of my
questions yesterday). These emails sure are wide; must be like 1500px!
Interesting to try out this ancient mailing-list form of communication.
Please let me know if I should be using "Reply All" or actually quoting
that massive thread (as everyone else seems to be doing with each post).

Frank
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130426/eb6556ae/attachment.html>

From sds at gnu.org  Sat Apr 27 00:02:31 2013
From: sds at gnu.org (Sam Steingold)
Date: Fri, 26 Apr 2013 18:02:31 -0400
Subject: [datatable-help] variable column names
References: <87a9onfza3.fsf@gnu.org>
 <CAHZcBOrVUDRM8zqQpwgEesFrz1nOkXRYiiFMZKpMqr1Bi6h+kA@mail.gmail.com>
 <b596977a739f9e41ffef0af946805ada.squirrel@webmail.plus.net>
 <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net>
 <87mwslqorx.fsf@gnu.org> <82f248612d2657b63d5e79c2a1ef98af@imap.plus.net>
 <87d2thqmw1.fsf@gnu.org> <fe0eeb803fe0a05cc91ae256f56aadd3@imap.plus.net>
 <871u9xkysc.fsf@gnu.org>
Message-ID: <87wqrpj6h4.fsf@gnu.org>

> * Sam Steingold <fqf at tah.bet> [2013-04-26 13:05:39 -0400]:
>
>> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:45:53 +0100]:
>>
>> S.O. is probably better for this kind of question then.
>> But if you don't get an answer there, then come back to datatable-help.
>
> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns

downvoted, unlikely to be answered.

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
http://www.childpsy.net/ http://iris.org.il http://think-israel.org
http://americancensorship.org http://pmw.org.il http://mideasttruth.com
We have preferences. You have biases. They have prejudices.


From mdowle at mdowle.plus.com  Sat Apr 27 00:47:55 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Fri, 26 Apr 2013 23:47:55 +0100
Subject: [datatable-help] variable column names
In-Reply-To: <87wqrpj6h4.fsf@gnu.org>
References: <87a9onfza3.fsf@gnu.org>
 <CAHZcBOrVUDRM8zqQpwgEesFrz1nOkXRYiiFMZKpMqr1Bi6h+kA@mail.gmail.com>
 <b596977a739f9e41ffef0af946805ada.squirrel@webmail.plus.net>
 <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net>
 <87mwslqorx.fsf@gnu.org> <82f248612d2657b63d5e79c2a1ef98af@imap.plus.net>
 <87d2thqmw1.fsf@gnu.org> <fe0eeb803fe0a05cc91ae256f56aadd3@imap.plus.net>
 <871u9xkysc.fsf@gnu.org> <87wqrpj6h4.fsf@gnu.org>
Message-ID: <30d6ae8f1a0d6974ebbd54da0d86f3b2@imap.plus.net>

On 26.04.2013 23:02, Sam Steingold wrote:
>> * Sam Steingold <fqf at tah.bet> [2013-04-26 13:05:39 -0400]:
>>
>>> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:45:53 
>>> +0100]:
>>>
>>> S.O. is probably better for this kind of question then.
>>> But if you don't get an answer there, then come back to 
>>> datatable-help.
>>
>> 
>> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns
>
> downvoted, unlikely to be answered.

I've read it through.

Perhaps sleep on it, don't look for 24hrs and look again as if you were 
trying to answer it yourself. Are there any small changes you can make 
to make it easier to answer?  It wasn't me that downvoted but I suspect 
it's been done to encourage you to improve the question. Downvotes can 
(and often are) reversed.  I've had many more downvotes than you once, 
but then I improved it and it went to +10.

And, it's Friday and we've all had a long week!

Matthew


From mdowle at mdowle.plus.com  Sat Apr 27 01:35:17 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Sat, 27 Apr 2013 00:35:17 +0100
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <CAJd-hdkv1oiSjfA625oBxmXwr5YuVUzz==3GLaWJTakAtzMJVw@mail.gmail.com>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
 <1366643879137-4664990.post@n4.nabble.com>
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
 <e396083df35993936b4e5f52b0e18154.squirrel@webmail.plus.net>
 <CAHZcBOojjABUxx0awpkG+jpGGi_bFy67RDraAXXKe=-PeS9E_g@mail.gmail.com>
 <8c8ae0b84f3eed6682e30094fffd8f2d@imap.plus.net>
 <CAHZcBOoinL2JrkNayUYnnT-MkQNgWqby8evMARj3q1Q-xGF=mA@mail.gmail.com>
 <2ddb700796994869ae11b6e1f36f26e7@imap.plus.net>
 <CAHZcBOr3FXMEpYV-g=kz4snYn8FqO43yKZtw9z6jEksgFA_uQA@mail.gmail.com>
 <9146185881995080674@unknownmsgid>
 <4c06adb2958f3060002b05d9b3ef9d9d@imap.plus.net>
 <5222879356405645530@unknownmsgid>
 <64f192ba80ac813986ed256029f0e7e0@imap.plus.net>
 <2a827f7db260f284908fac604301eb8e@imap.plus.net>
 <BAY163-W531F404386A1BC90FA8B1999B70@phx.gbl>
 <CAHZcBOrBOdZofFQMh9pJro6q4wepqB+7xHZW3wrg=Pf80ah13g@mail.gmail.com>
 <BAY163-W61D18A71F668693037797999B70@phx.gbl>
 <CAJd-hdkv1oiSjfA625oBxmXwr5YuVUzz==3GLaWJTakAtzMJVw@mail.gmail.com>
Message-ID: <be967ecd9c927ade15c15eb9985d919e@imap.plus.net>

 
Thanks for your comments Frank. 

Ha, yes it's ancient but still has
a place. Yes "reply all": Back To: sender (if it's to someone in
particular) and cc the list. But on general topics where lots of people
are on the thread, just To: datatable-help alone is fine. Personally I
prefer "top posting". Like I'm doing now. I only scroll down if I need
to. I didn't notice the history was building up. If you comment inline
later, then say "scroll down for comments inline" or something at the
top. Note that Nabble collapses the history for you so threads are much
easier to read there. Or I tend to read via RSS (gmane) in Outlook, so
it feels like an email inbox which turns bold on new posts. You only
need to subscribe to post (spam control). Most people turn off mail
delivery pretty quickly I imagine (or setup an auto rule to move into a
folder, but then you might as well subscribe to RSS I guess). 

S.O. is
quite strict: must be clear questions with a clear answer, only one of
which can be accepted. No opinion, voting, discussing or notices (enter
mailing lists). Chat room is good but for quick chat when people are in
the room at the same time. Many companies (sensibly) block chat access,
though. Mailing lists allows all timezones a chance at a slower pace.
Anonymity is just as acceptable and as easy in both places. 

Matthew


On 26.04.2013 21:34, Frank Erickson wrote: 

> I disagree with the
criticism of data.table's complexity (in the OP). There's nothing wrong
with overloading the syntax (that is what CS people call it, right?). As
long as Matthew's in control of it, it's likely to have some internal
consistency (which, of course, he could explain). However, I like the
suggestion to add options (defaulting to something globally adjustable)
to disable some of the overloading. Along similar lines (I think), I
find unique.data.table very unintuitive. I can see how it could be
useful, but strongly prefer base::unique for my current applications. 
>
Anyway, I have nothing particular to say about the piece of syntax you
all are currently discussing. I just registered with this list to chime
in here, instead of further cluttering SO (where eddi answered one of my
questions yesterday). These emails sure are wide; must be like 1500px!
Interesting to try out this ancient mailing-list form of communication.
Please let me know if I should be using "Reply All" or actually quoting
that massive thread (as everyone else seems to be doing with each post).

> Frank

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130427/260f3119/attachment.html>

From victor.kryukov at gmail.com  Sat Apr 27 01:42:04 2013
From: victor.kryukov at gmail.com (Victor Kryukov)
Date: Fri, 26 Apr 2013 16:42:04 -0700
Subject: [datatable-help] variable column names
In-Reply-To: <30d6ae8f1a0d6974ebbd54da0d86f3b2@imap.plus.net>
References: <87a9onfza3.fsf@gnu.org>
 <CAHZcBOrVUDRM8zqQpwgEesFrz1nOkXRYiiFMZKpMqr1Bi6h+kA@mail.gmail.com>
 <b596977a739f9e41ffef0af946805ada.squirrel@webmail.plus.net>
 <2dd332222892fc9f613aad5fa6d08d7e.squirrel@webmail.plus.net>
 <87mwslqorx.fsf@gnu.org> <82f248612d2657b63d5e79c2a1ef98af@imap.plus.net>
 <87d2thqmw1.fsf@gnu.org> <fe0eeb803fe0a05cc91ae256f56aadd3@imap.plus.net>
 <871u9xkysc.fsf@gnu.org> <87wqrpj6h4.fsf@gnu.org>
 <30d6ae8f1a0d6974ebbd54da0d86f3b2@imap.plus.net>
Message-ID: <CANJmMqTz5+6djLEwpZxsub6LB=3L37=JB3xt5AhG1XgWG=nJgw@mail.gmail.com>

On Fri, Apr 26, 2013 at 3:47 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> On 26.04.2013 23:02, Sam Steingold wrote:
>>> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns
>>
>> downvoted, unlikely to be answered.
>
> I've read it through.
>
> Perhaps sleep on it, don't look for 24hrs and look again as if you were
> trying to answer it yourself. Are there any small changes you can make to
> make it easier to answer?  It wasn't me that downvoted but I suspect it's
> been done to encourage you to improve the question. Downvotes can (and often
> are) reversed.  I've had many more downvotes than you once, but then I
> improved it and it went to +10.
>
> And, it's Friday and we've all had a long week!

Beautiful advice, Matthew!

Sam - I've provided my answer (and even used Reduce since you seem to
be coming from Lisp land), but I also think some of the down
votes/comments have their merit.

From aragorn168b at gmail.com  Sat Apr 27 17:49:13 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 27 Apr 2013 17:49:13 +0200
Subject: [datatable-help] changing data.table by-without-by syntax to
	require a "by"
Message-ID: <CAAf756OqswUKoiykf0pnh6=9HSDDjKK_acUP-0VdV1Vd1EVvGg@mail.gmail.com>

Hello,
I thought I'd also chip-in my thoughts to eddi's feature request.
Short answer: I don't think this feature is necessary. I basically agree
with mnel's reply.
Long answer: My argument goes along these lines (in addition to the S3/S4
methods mnel mentions). If you for example type `[.data.frame` in your
R-session, you'd see this snippet:

        if (is.matrix(i))
            return(as.matrix(x)[i])

That is, if you do:

    df <- data.frame(x=1:5, y=1:5, z=1:5)
    mm <- matrix(1:12, ncol=3)
    df[mm] # gives
    [1] 1 2 3 4 5 1 2 3 4 5 1 2

    df <- data.frame(x=1:2, y=1:2, z=1:2)
    df[mm] # gives
    [1]  1  2  1  2  1  2 NA NA NA NA NA NA

Here, the indexing is a matrix. It's obvious. Now, should this behaviour be
changed because people would be confused that subsetting a data.frame
resulted in a vector? Or because it's not user friendly? Even better, try
out `df[mm, ]`. If `i` is a matrix, this is what the code does. I am not
convinced this is "bad" design. Functions take arguments of different types
ALL the time and they return outputs *depending on the type of input*. This
is why I am not sold on the point of "bad design". It's essential to know
the type of objects `i` can take and *understand* it.

If a function is designed that takes several types of objects for `i` and
their behaviour is documented, and the documented behaviour is consistent,
then I can't accept there's a problem.

I agree there are people who don't read the manual and "try" things out.
But they are going to have problems with every other function in R.

For example, "unstack" is a function for which same input type gives
different output type. That is, it provides a data.frame if the columns are
equal after unstaking and list if they are not. That is, compare the
outputs of:

    df <- data.frame(x=rep(1:3, each=3), y=1:9)
    unstack(df, y ~ x)

with

    df <- data.frame(x=c(rep(1:3, each=3), 3), y=1:10)
    unstack(df, y ~ x)

But if people don't read the documentation, they wouldn't know this
difference until they land up on errors. Now, making it user-friendly would
mean that it "always" returns a list.

Now, is this "bad" design because it gives two object types for same input?
Does it require a change? I personally don't think so.

To sum up, what eddi points out as "not being user-friendly" (or arguably
"bad design") is everywhere inside R if you look closely. My view is that
it's very clear that there should be some effort in understanding a
function before using it. Not all functions are plain simple. Some
functions have exceptions and some packages have a steep learning curve.

Best,
Arun.


On Sat, Apr 27, 2013 at 12:00 PM, <
datatable-help-request at lists.r-forge.r-project.org> wrote:
>
> Send datatable-help mailing list submissions to
>         datatable-help at lists.r-forge.r-project.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> or, via email, send a message with subject or body 'help' to
>         datatable-help-request at lists.r-forge.r-project.org
>
> You can reach the person managing the list at
>         datatable-help-owner at lists.r-forge.r-project.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of datatable-help digest..."
>
>
> Today's Topics:
>
>    1. Re: changing data.table by-without-by syntax to require a
>       "by" (Frank Erickson)
>    2. Re: variable column names (Sam Steingold)
>    3. Re: variable column names (Matthew Dowle)
>    4. Re: changing data.table by-without-by syntax to require a
>       "by" (Matthew Dowle)
>    5. Re: variable column names (Victor Kryukov)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 26 Apr 2013 15:34:39 -0500
> From: Frank Erickson <FErickson at psu.edu>
> To: "data.table source forge"
>         <datatable-help at lists.r-forge.r-project.org>
> Subject: Re: [datatable-help] changing data.table by-without-by syntax
>         to require a "by"
> Message-ID:
>         <CAJd-hdkv1oiSjfA625oBxmXwr5YuVUzz==
3GLaWJTakAtzMJVw at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> I disagree with the criticism of data.table's complexity (in the OP).
> There's nothing wrong with overloading the syntax (that is what CS people
> call it, right?). As long as Matthew's in control of it, it's likely to
> have some internal consistency (which, of course, he could explain).
> However, I like the suggestion to add options (defaulting to something
> globally adjustable) to disable some of the overloading. Along similar
> lines (I think), I find unique.data.table very unintuitive. I can see how
> it could be useful, but strongly prefer base::unique for my current
> applications.
>
> Anyway, I have nothing particular to say about the piece of syntax you all
> are currently discussing. I just registered with this list to chime in
> here, instead of further cluttering SO (where eddi answered one of my
> questions yesterday). These emails sure are wide; must be like 1500px!
> Interesting to try out this ancient mailing-list form of communication.
> Please let me know if I should be using "Reply All" or actually quoting
> that massive thread (as everyone else seems to be doing with each post).
>
> Frank
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130426/eb6556ae/attachment-0001.html
>
>
> ------------------------------
>
> Message: 2
> Date: Fri, 26 Apr 2013 18:02:31 -0400
> From: Sam Steingold <sds at gnu.org>
> To: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] variable column names
> Message-ID: <87wqrpj6h4.fsf at gnu.org>
> Content-Type: text/plain
>
> > * Sam Steingold <fqf at tah.bet> [2013-04-26 13:05:39 -0400]:
> >
> >> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:45:53 +0100]:
> >>
> >> S.O. is probably better for this kind of question then.
> >> But if you don't get an answer there, then come back to datatable-help.
> >
> >
http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns
>
> downvoted, unlikely to be answered.
>
> --
> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X
11.0.11300000
> http://www.childpsy.net/ http://iris.org.il http://think-israel.org
> http://americancensorship.org http://pmw.org.il http://mideasttruth.com
> We have preferences. You have biases. They have prejudices.
>
>
>
> ------------------------------
>
> Message: 3
> Date: Fri, 26 Apr 2013 23:47:55 +0100
> From: Matthew Dowle <mdowle at mdowle.plus.com>
> To: <sds at gnu.org>
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] variable column names
> Message-ID: <30d6ae8f1a0d6974ebbd54da0d86f3b2 at imap.plus.net>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
> On 26.04.2013 23:02, Sam Steingold wrote:
> >> * Sam Steingold <fqf at tah.bet> [2013-04-26 13:05:39 -0400]:
> >>
> >>> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:45:53
> >>> +0100]:
> >>>
> >>> S.O. is probably better for this kind of question then.
> >>> But if you don't get an answer there, then come back to
> >>> datatable-help.
> >>
> >>
> >>
http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns
> >
> > downvoted, unlikely to be answered.
>
> I've read it through.
>
> Perhaps sleep on it, don't look for 24hrs and look again as if you were
> trying to answer it yourself. Are there any small changes you can make
> to make it easier to answer?  It wasn't me that downvoted but I suspect
> it's been done to encourage you to improve the question. Downvotes can
> (and often are) reversed.  I've had many more downvotes than you once,
> but then I improved it and it went to +10.
>
> And, it's Friday and we've all had a long week!
>
> Matthew
>
>
>
>
> ------------------------------
>
> Message: 4
> Date: Sat, 27 Apr 2013 00:35:17 +0100
> From: Matthew Dowle <mdowle at mdowle.plus.com>
> To: Frank Erickson <FErickson at psu.edu>
> Cc: "data.table source forge"
>         <datatable-help at lists.r-forge.r-project.org>
> Subject: Re: [datatable-help] changing data.table by-without-by syntax
>         to require a "by"
> Message-ID: <be967ecd9c927ade15c15eb9985d919e at imap.plus.net>
> Content-Type: text/plain; charset="utf-8"
>
>
>
> Thanks for your comments Frank.
>
> Ha, yes it's ancient but still has
> a place. Yes "reply all": Back To: sender (if it's to someone in
> particular) and cc the list. But on general topics where lots of people
> are on the thread, just To: datatable-help alone is fine. Personally I
> prefer "top posting". Like I'm doing now. I only scroll down if I need
> to. I didn't notice the history was building up. If you comment inline
> later, then say "scroll down for comments inline" or something at the
> top. Note that Nabble collapses the history for you so threads are much
> easier to read there. Or I tend to read via RSS (gmane) in Outlook, so
> it feels like an email inbox which turns bold on new posts. You only
> need to subscribe to post (spam control). Most people turn off mail
> delivery pretty quickly I imagine (or setup an auto rule to move into a
> folder, but then you might as well subscribe to RSS I guess).
>
> S.O. is
> quite strict: must be clear questions with a clear answer, only one of
> which can be accepted. No opinion, voting, discussing or notices (enter
> mailing lists). Chat room is good but for quick chat when people are in
> the room at the same time. Many companies (sensibly) block chat access,
> though. Mailing lists allows all timezones a chance at a slower pace.
> Anonymity is just as acceptable and as easy in both places.
>
> Matthew
>
>
> On 26.04.2013 21:34, Frank Erickson wrote:
>
> > I disagree with the
> criticism of data.table's complexity (in the OP). There's nothing wrong
> with overloading the syntax (that is what CS people call it, right?). As
> long as Matthew's in control of it, it's likely to have some internal
> consistency (which, of course, he could explain). However, I like the
> suggestion to add options (defaulting to something globally adjustable)
> to disable some of the overloading. Along similar lines (I think), I
> find unique.data.table very unintuitive. I can see how it could be
> useful, but strongly prefer base::unique for my current applications.
> >
> Anyway, I have nothing particular to say about the piece of syntax you
> all are currently discussing. I just registered with this list to chime
> in here, instead of further cluttering SO (where eddi answered one of my
> questions yesterday). These emails sure are wide; must be like 1500px!
> Interesting to try out this ancient mailing-list form of communication.
> Please let me know if I should be using "Reply All" or actually quoting
> that massive thread (as everyone else seems to be doing with each post).
>
> > Frank
>
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130427/260f3119/attachment-0001.html
>
>
> ------------------------------
>
> Message: 5
> Date: Fri, 26 Apr 2013 16:42:04 -0700
> From: Victor Kryukov <victor.kryukov at gmail.com>
> To: Matthew Dowle <mdowle at mdowle.plus.com>
> Cc: datatable-help at lists.r-forge.r-project.org, sds at gnu.org
> Subject: Re: [datatable-help] variable column names
> Message-ID:
>         <CANJmMqTz5+6djLEwpZxsub6LB=3L37=JB3xt5AhG1XgWG=
nJgw at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Fri, Apr 26, 2013 at 3:47 PM, Matthew Dowle <mdowle at mdowle.plus.com>
wrote:
> > On 26.04.2013 23:02, Sam Steingold wrote:
> >>>
http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns
> >>
> >> downvoted, unlikely to be answered.
> >
> > I've read it through.
> >
> > Perhaps sleep on it, don't look for 24hrs and look again as if you were
> > trying to answer it yourself. Are there any small changes you can make
to
> > make it easier to answer?  It wasn't me that downvoted but I suspect
it's
> > been done to encourage you to improve the question. Downvotes can (and
often
> > are) reversed.  I've had many more downvotes than you once, but then I
> > improved it and it went to +10.
> >
> > And, it's Friday and we've all had a long week!
>
> Beautiful advice, Matthew!
>
> Sam - I've provided my answer (and even used Reduce since you seem to
> be coming from Lisp land), but I also think some of the down
> votes/comments have their merit.
>
>
> ------------------------------
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> End of datatable-help Digest, Vol 38, Issue 26
> **********************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130427/bc0c3fb0/attachment-0001.html>

From statquant at outlook.com  Sun Apr 28 20:38:35 2013
From: statquant at outlook.com (stat quant)
Date: Sun, 28 Apr 2013 20:38:35 +0200
Subject: [datatable-help] Porting data.table to Rcpp
Message-ID: <CAJJHHA8V-wHB=rvHJbs=z890dF2kyEtwTTTOno1B-sgVG5B4Aw@mail.gmail.com>

Hello list,
I am nearly a beginer whn It comes to C++, I like data.table very much and
I am interested to Rcpp too.
I am wondering how hard would it be to have a data.table API to be able to
access data.table greatness from C++ via Rcpp.

Cheers
Colin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130428/3b263140/attachment.html>

From mdowle at mdowle.plus.com  Sun Apr 28 23:52:51 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Sun, 28 Apr 2013 22:52:51 +0100
Subject: [datatable-help] Porting data.table to Rcpp
In-Reply-To: <CAJJHHA8V-wHB=rvHJbs=z890dF2kyEtwTTTOno1B-sgVG5B4Aw@mail.gmail.com>
References: <CAJJHHA8V-wHB=rvHJbs=z890dF2kyEtwTTTOno1B-sgVG5B4Aw@mail.gmail.com>
Message-ID: <991982131055d51a41ec5f5d0eedba87@imap.plus.net>

 
Hi, 

I don't know C++ or Rcpp very well so couldn't estimate how
hard. But it rings a bell as being discussed before. I searched
datatable-help for "Rcpp" with no luck, but S.O. "[data.table] Rcpp"
returns 15 so in there somewhere might be clues. If changes are needed
to data.table then that's fine by me. 

Matthew 

On 28.04.2013 19:38,
stat quant wrote: 

> Hello list,
> I am nearly a beginer whn It comes
to C++, I like data.table very much and I am interested to Rcpp too.
> I
am wondering how hard would it be to have a data.table API to be able to
access data.table greatness from C++ via Rcpp.
> 
> Cheers
> Colin

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130428/897bae16/attachment.html>

From saporta at scarletmail.rutgers.edu  Mon Apr 29 07:29:31 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Mon, 29 Apr 2013 01:29:31 -0400
Subject: [datatable-help] Porting data.table to Rcpp
In-Reply-To: <CAJJHHA8V-wHB=rvHJbs=z890dF2kyEtwTTTOno1B-sgVG5B4Aw@mail.gmail.com>
References: <CAJJHHA8V-wHB=rvHJbs=z890dF2kyEtwTTTOno1B-sgVG5B4Aw@mail.gmail.com>
Message-ID: <CAE7Aa4QKceNqOfu40-atrcQ+A0Jqeobrj_vMFzu0yDCXrmP4tg@mail.gmail.com>

Hey Colin,
This sounds like an interesting idea. What specifically did you have in
mind?  I would be willing to lend a hand.
-Rick


Ricardo Saporta
Graduate Student, Data Analytics
Rutgers University, New Jersey
e: saporta at rutgers.edu


On Sun, Apr 28, 2013 at 2:38 PM, stat quant <statquant at outlook.com> wrote:

> Hello list,
> I am nearly a beginer whn It comes to C++, I like data.table very much and
> I am interested to Rcpp too.
> I am wondering how hard would it be to have a data.table API to be able to
> access data.table greatness from C++ via Rcpp.
>
> Cheers
> Colin
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130429/e5281fce/attachment.html>

From eduard.antonyan at gmail.com  Mon Apr 29 15:40:27 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Mon, 29 Apr 2013 08:40:27 -0500
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <CAAf756OqswUKoiykf0pnh6=9HSDDjKK_acUP-0VdV1Vd1EVvGg@mail.gmail.com>
References: <CAAf756OqswUKoiykf0pnh6=9HSDDjKK_acUP-0VdV1Vd1EVvGg@mail.gmail.com>
Message-ID: <CAHZcBOp0K+Bjy=O=Gudx7KdvVY9PuooK6hMzF4LmEvw=86c2xg@mail.gmail.com>

Thanks Arun, the examples you give are probably interesting in their own
right, but your post doesn't address advantages/disadvantages of either
current or proposed syntaxes and simply points out the (obvious) fact that
current (and other, similar in some ways to current) behavior is possible
to implement in R.


On Sat, Apr 27, 2013 at 10:49 AM, Arunkumar Srinivasan <
aragorn168b at gmail.com> wrote:

> Hello,
> I thought I'd also chip-in my thoughts to eddi's feature request.
> Short answer: I don't think this feature is necessary. I basically agree
> with mnel's reply.
> Long answer: My argument goes along these lines (in addition to the S3/S4
> methods mnel mentions). If you for example type `[.data.frame` in your
> R-session, you'd see this snippet:
>
>         if (is.matrix(i))
>             return(as.matrix(x)[i])
>
> That is, if you do:
>
>     df <- data.frame(x=1:5, y=1:5, z=1:5)
>     mm <- matrix(1:12, ncol=3)
>     df[mm] # gives
>     [1] 1 2 3 4 5 1 2 3 4 5 1 2
>
>     df <- data.frame(x=1:2, y=1:2, z=1:2)
>     df[mm] # gives
>     [1]  1  2  1  2  1  2 NA NA NA NA NA NA
>
> Here, the indexing is a matrix. It's obvious. Now, should this behaviour
> be changed because people would be confused that subsetting a data.frame
> resulted in a vector? Or because it's not user friendly? Even better, try
> out `df[mm, ]`. If `i` is a matrix, this is what the code does. I am not
> convinced this is "bad" design. Functions take arguments of different types
> ALL the time and they return outputs *depending on the type of input*. This
> is why I am not sold on the point of "bad design". It's essential to know
> the type of objects `i` can take and *understand* it.
>
> If a function is designed that takes several types of objects for `i` and
> their behaviour is documented, and the documented behaviour is consistent,
> then I can't accept there's a problem.
>
> I agree there are people who don't read the manual and "try" things out.
> But they are going to have problems with every other function in R.
>
> For example, "unstack" is a function for which same input type gives
> different output type. That is, it provides a data.frame if the columns are
> equal after unstaking and list if they are not. That is, compare the
> outputs of:
>
>     df <- data.frame(x=rep(1:3, each=3), y=1:9)
>     unstack(df, y ~ x)
>
> with
>
>     df <- data.frame(x=c(rep(1:3, each=3), 3), y=1:10)
>     unstack(df, y ~ x)
>
> But if people don't read the documentation, they wouldn't know this
> difference until they land up on errors. Now, making it user-friendly would
> mean that it "always" returns a list.
>
> Now, is this "bad" design because it gives two object types for same
> input? Does it require a change? I personally don't think so.
>
> To sum up, what eddi points out as "not being user-friendly" (or arguably
> "bad design") is everywhere inside R if you look closely. My view is that
> it's very clear that there should be some effort in understanding a
> function before using it. Not all functions are plain simple. Some
> functions have exceptions and some packages have a steep learning curve.
>
> Best,
> Arun.
>
>
> On Sat, Apr 27, 2013 at 12:00 PM, <
> datatable-help-request at lists.r-forge.r-project.org> wrote:
> >
> > Send datatable-help mailing list submissions to
> >         datatable-help at lists.r-forge.r-project.org
>
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> > or, via email, send a message with subject or body 'help' to
> >         datatable-help-request at lists.r-forge.r-project.org
>
> >
> > You can reach the person managing the list at
> >         datatable-help-owner at lists.r-forge.r-project.org
>
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of datatable-help digest..."
> >
> >
> > Today's Topics:
> >
> >    1. Re: changing data.table by-without-by syntax to require a
> >       "by" (Frank Erickson)
> >    2. Re: variable column names (Sam Steingold)
> >    3. Re: variable column names (Matthew Dowle)
> >    4. Re: changing data.table by-without-by syntax to require a
> >       "by" (Matthew Dowle)
> >    5. Re: variable column names (Victor Kryukov)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Fri, 26 Apr 2013 15:34:39 -0500
> > From: Frank Erickson <FErickson at psu.edu>
> > To: "data.table source forge"
> >         <datatable-help at lists.r-forge.r-project.org>
>
> > Subject: Re: [datatable-help] changing data.table by-without-by syntax
> >         to require a "by"
> > Message-ID:
> >         <CAJd-hdkv1oiSjfA625oBxmXwr5YuVUzz==
> 3GLaWJTakAtzMJVw at mail.gmail.com>
>
> > Content-Type: text/plain; charset="iso-8859-1"
> >
> > I disagree with the criticism of data.table's complexity (in the OP).
> > There's nothing wrong with overloading the syntax (that is what CS people
> > call it, right?). As long as Matthew's in control of it, it's likely to
> > have some internal consistency (which, of course, he could explain).
> > However, I like the suggestion to add options (defaulting to something
> > globally adjustable) to disable some of the overloading. Along similar
> > lines (I think), I find unique.data.table very unintuitive. I can see how
> > it could be useful, but strongly prefer base::unique for my current
> > applications.
> >
> > Anyway, I have nothing particular to say about the piece of syntax you
> all
> > are currently discussing. I just registered with this list to chime in
> > here, instead of further cluttering SO (where eddi answered one of my
> > questions yesterday). These emails sure are wide; must be like 1500px!
> > Interesting to try out this ancient mailing-list form of communication.
> > Please let me know if I should be using "Reply All" or actually quoting
> > that massive thread (as everyone else seems to be doing with each post).
> >
> > Frank
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL: <
> http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130426/eb6556ae/attachment-0001.html
> >
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Fri, 26 Apr 2013 18:02:31 -0400
> > From: Sam Steingold <sds at gnu.org>
> > To: datatable-help at lists.r-forge.r-project.org
>
> > Subject: Re: [datatable-help] variable column names
> > Message-ID: <87wqrpj6h4.fsf at gnu.org>
> > Content-Type: text/plain
> >
> > > * Sam Steingold <fqf at tah.bet> [2013-04-26 13:05:39 -0400]:
> > >
> > >> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:45:53 +0100]:
>
> > >>
> > >> S.O. is probably better for this kind of question then.
> > >> But if you don't get an answer there, then come back to
> datatable-help.
> > >
> > >
> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns
> >
> > downvoted, unlikely to be answered.
> >
> > --
> > Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X
> 11.0.11300000
> > http://www.childpsy.net/ http://iris.org.il http://think-israel.org
> > http://americancensorship.org http://pmw.org.il http://mideasttruth.com
> > We have preferences. You have biases. They have prejudices.
> >
> >
> >
> > ------------------------------
> >
> > Message: 3
> > Date: Fri, 26 Apr 2013 23:47:55 +0100
> > From: Matthew Dowle <mdowle at mdowle.plus.com>
> > To: <sds at gnu.org>
> > Cc: datatable-help at lists.r-forge.r-project.org
>
> > Subject: Re: [datatable-help] variable column names
> > Message-ID: <30d6ae8f1a0d6974ebbd54da0d86f3b2 at imap.plus.net>
>
> > Content-Type: text/plain; charset=UTF-8; format=flowed
> >
> > On 26.04.2013 23:02, Sam Steingold wrote:
> > >> * Sam Steingold <fqf at tah.bet> [2013-04-26 13:05:39 -0400]:
> > >>
> > >>> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:45:53
>
> > >>> +0100]:
> > >>>
> > >>> S.O. is probably better for this kind of question then.
> > >>> But if you don't get an answer there, then come back to
> > >>> datatable-help.
> > >>
> > >>
> > >>
> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns
> > >
> > > downvoted, unlikely to be answered.
> >
> > I've read it through.
> >
> > Perhaps sleep on it, don't look for 24hrs and look again as if you were
> > trying to answer it yourself. Are there any small changes you can make
> > to make it easier to answer?  It wasn't me that downvoted but I suspect
> > it's been done to encourage you to improve the question. Downvotes can
> > (and often are) reversed.  I've had many more downvotes than you once,
> > but then I improved it and it went to +10.
> >
> > And, it's Friday and we've all had a long week!
> >
> > Matthew
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 4
> > Date: Sat, 27 Apr 2013 00:35:17 +0100
> > From: Matthew Dowle <mdowle at mdowle.plus.com>
> > To: Frank Erickson <FErickson at psu.edu>
> > Cc: "data.table source forge"
> >         <datatable-help at lists.r-forge.r-project.org>
>
> > Subject: Re: [datatable-help] changing data.table by-without-by syntax
> >         to require a "by"
> > Message-ID: <be967ecd9c927ade15c15eb9985d919e at imap.plus.net>
>
> > Content-Type: text/plain; charset="utf-8"
> >
> >
> >
> > Thanks for your comments Frank.
> >
> > Ha, yes it's ancient but still has
> > a place. Yes "reply all": Back To: sender (if it's to someone in
> > particular) and cc the list. But on general topics where lots of people
> > are on the thread, just To: datatable-help alone is fine. Personally I
> > prefer "top posting". Like I'm doing now. I only scroll down if I need
> > to. I didn't notice the history was building up. If you comment inline
> > later, then say "scroll down for comments inline" or something at the
> > top. Note that Nabble collapses the history for you so threads are much
> > easier to read there. Or I tend to read via RSS (gmane) in Outlook, so
> > it feels like an email inbox which turns bold on new posts. You only
> > need to subscribe to post (spam control). Most people turn off mail
> > delivery pretty quickly I imagine (or setup an auto rule to move into a
> > folder, but then you might as well subscribe to RSS I guess).
> >
> > S.O. is
> > quite strict: must be clear questions with a clear answer, only one of
> > which can be accepted. No opinion, voting, discussing or notices (enter
> > mailing lists). Chat room is good but for quick chat when people are in
> > the room at the same time. Many companies (sensibly) block chat access,
> > though. Mailing lists allows all timezones a chance at a slower pace.
> > Anonymity is just as acceptable and as easy in both places.
> >
> > Matthew
> >
> >
> > On 26.04.2013 21:34, Frank Erickson wrote:
> >
> > > I disagree with the
> > criticism of data.table's complexity (in the OP). There's nothing wrong
> > with overloading the syntax (that is what CS people call it, right?). As
> > long as Matthew's in control of it, it's likely to have some internal
> > consistency (which, of course, he could explain). However, I like the
> > suggestion to add options (defaulting to something globally adjustable)
> > to disable some of the overloading. Along similar lines (I think), I
> > find unique.data.table very unintuitive. I can see how it could be
> > useful, but strongly prefer base::unique for my current applications.
> > >
> > Anyway, I have nothing particular to say about the piece of syntax you
> > all are currently discussing. I just registered with this list to chime
> > in here, instead of further cluttering SO (where eddi answered one of my
> > questions yesterday). These emails sure are wide; must be like 1500px!
> > Interesting to try out this ancient mailing-list form of communication.
> > Please let me know if I should be using "Reply All" or actually quoting
> > that massive thread (as everyone else seems to be doing with each post).
> >
> > > Frank
> >
> >
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL: <
> http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130427/260f3119/attachment-0001.html
> >
> >
> > ------------------------------
> >
> > Message: 5
> > Date: Fri, 26 Apr 2013 16:42:04 -0700
> > From: Victor Kryukov <victor.kryukov at gmail.com>
> > To: Matthew Dowle <mdowle at mdowle.plus.com>
> > Cc: datatable-help at lists.r-forge.r-project.org, sds at gnu.org
>
> > Subject: Re: [datatable-help] variable column names
> > Message-ID:
> >         <CANJmMqTz5+6djLEwpZxsub6LB=3L37=JB3xt5AhG1XgWG=
> nJgw at mail.gmail.com>
> > Content-Type: text/plain; charset=ISO-8859-1
>
> >
> > On Fri, Apr 26, 2013 at 3:47 PM, Matthew Dowle <mdowle at mdowle.plus.com>
> wrote:
> > > On 26.04.2013 23:02, Sam Steingold wrote:
> > >>>
> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns
> > >>
> > >> downvoted, unlikely to be answered.
> > >
> > > I've read it through.
> > >
> > > Perhaps sleep on it, don't look for 24hrs and look again as if you were
> > > trying to answer it yourself. Are there any small changes you can make
> to
> > > make it easier to answer?  It wasn't me that downvoted but I suspect
> it's
> > > been done to encourage you to improve the question. Downvotes can (and
> often
> > > are) reversed.  I've had many more downvotes than you once, but then I
> > > improved it and it went to +10.
> > >
> > > And, it's Friday and we've all had a long week!
> >
> > Beautiful advice, Matthew!
> >
> > Sam - I've provided my answer (and even used Reduce since you seem to
> > be coming from Lisp land), but I also think some of the down
> > votes/comments have their merit.
> >
> >
> > ------------------------------
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
>
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> > End of datatable-help Digest, Vol 38, Issue 26
> > **********************************************
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130429/a75f00ae/attachment-0001.html>

From eduard.antonyan at gmail.com  Mon Apr 29 15:43:19 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Mon, 29 Apr 2013 08:43:19 -0500
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <CAHZcBOp0K+Bjy=O=Gudx7KdvVY9PuooK6hMzF4LmEvw=86c2xg@mail.gmail.com>
References: <CAAf756OqswUKoiykf0pnh6=9HSDDjKK_acUP-0VdV1Vd1EVvGg@mail.gmail.com>
 <CAHZcBOp0K+Bjy=O=Gudx7KdvVY9PuooK6hMzF4LmEvw=86c2xg@mail.gmail.com>
Message-ID: <CAHZcBOoiP4aZHrPKNxHS6wym5R0O9pmqHgzL6S+1LTH1Vq9aLA@mail.gmail.com>

It might help to think of this as an improvement proposal rather than a
problem fix proposal.


On Mon, Apr 29, 2013 at 8:40 AM, Eduard Antonyan
<eduard.antonyan at gmail.com>wrote:

> Thanks Arun, the examples you give are probably interesting in their own
> right, but your post doesn't address advantages/disadvantages of either
> current or proposed syntaxes and simply points out the (obvious) fact that
> current (and other, similar in some ways to current) behavior is possible
> to implement in R.
>
>
> On Sat, Apr 27, 2013 at 10:49 AM, Arunkumar Srinivasan <
> aragorn168b at gmail.com> wrote:
>
>> Hello,
>> I thought I'd also chip-in my thoughts to eddi's feature request.
>> Short answer: I don't think this feature is necessary. I basically agree
>> with mnel's reply.
>> Long answer: My argument goes along these lines (in addition to the S3/S4
>> methods mnel mentions). If you for example type `[.data.frame` in your
>> R-session, you'd see this snippet:
>>
>>         if (is.matrix(i))
>>             return(as.matrix(x)[i])
>>
>> That is, if you do:
>>
>>     df <- data.frame(x=1:5, y=1:5, z=1:5)
>>     mm <- matrix(1:12, ncol=3)
>>     df[mm] # gives
>>     [1] 1 2 3 4 5 1 2 3 4 5 1 2
>>
>>     df <- data.frame(x=1:2, y=1:2, z=1:2)
>>     df[mm] # gives
>>     [1]  1  2  1  2  1  2 NA NA NA NA NA NA
>>
>> Here, the indexing is a matrix. It's obvious. Now, should this behaviour
>> be changed because people would be confused that subsetting a data.frame
>> resulted in a vector? Or because it's not user friendly? Even better, try
>> out `df[mm, ]`. If `i` is a matrix, this is what the code does. I am not
>> convinced this is "bad" design. Functions take arguments of different types
>> ALL the time and they return outputs *depending on the type of input*. This
>> is why I am not sold on the point of "bad design". It's essential to know
>> the type of objects `i` can take and *understand* it.
>>
>> If a function is designed that takes several types of objects for `i` and
>> their behaviour is documented, and the documented behaviour is consistent,
>> then I can't accept there's a problem.
>>
>> I agree there are people who don't read the manual and "try" things out.
>> But they are going to have problems with every other function in R.
>>
>> For example, "unstack" is a function for which same input type gives
>> different output type. That is, it provides a data.frame if the columns are
>> equal after unstaking and list if they are not. That is, compare the
>> outputs of:
>>
>>     df <- data.frame(x=rep(1:3, each=3), y=1:9)
>>     unstack(df, y ~ x)
>>
>> with
>>
>>     df <- data.frame(x=c(rep(1:3, each=3), 3), y=1:10)
>>     unstack(df, y ~ x)
>>
>> But if people don't read the documentation, they wouldn't know this
>> difference until they land up on errors. Now, making it user-friendly would
>> mean that it "always" returns a list.
>>
>> Now, is this "bad" design because it gives two object types for same
>> input? Does it require a change? I personally don't think so.
>>
>> To sum up, what eddi points out as "not being user-friendly" (or arguably
>> "bad design") is everywhere inside R if you look closely. My view is that
>> it's very clear that there should be some effort in understanding a
>> function before using it. Not all functions are plain simple. Some
>> functions have exceptions and some packages have a steep learning curve.
>>
>> Best,
>> Arun.
>>
>>
>> On Sat, Apr 27, 2013 at 12:00 PM, <
>> datatable-help-request at lists.r-forge.r-project.org> wrote:
>> >
>> > Send datatable-help mailing list submissions to
>> >         datatable-help at lists.r-forge.r-project.org
>>
>> >
>> > To subscribe or unsubscribe via the World Wide Web, visit
>> >
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >
>> > or, via email, send a message with subject or body 'help' to
>> >         datatable-help-request at lists.r-forge.r-project.org
>>
>> >
>> > You can reach the person managing the list at
>> >         datatable-help-owner at lists.r-forge.r-project.org
>>
>> >
>> > When replying, please edit your Subject line so it is more specific
>> > than "Re: Contents of datatable-help digest..."
>> >
>> >
>> > Today's Topics:
>> >
>> >    1. Re: changing data.table by-without-by syntax to require a
>> >       "by" (Frank Erickson)
>> >    2. Re: variable column names (Sam Steingold)
>> >    3. Re: variable column names (Matthew Dowle)
>> >    4. Re: changing data.table by-without-by syntax to require a
>> >       "by" (Matthew Dowle)
>> >    5. Re: variable column names (Victor Kryukov)
>> >
>> >
>> > ----------------------------------------------------------------------
>> >
>> > Message: 1
>> > Date: Fri, 26 Apr 2013 15:34:39 -0500
>>  > From: Frank Erickson <FErickson at psu.edu>
>> > To: "data.table source forge"
>> >         <datatable-help at lists.r-forge.r-project.org>
>>
>> > Subject: Re: [datatable-help] changing data.table by-without-by syntax
>> >         to require a "by"
>> > Message-ID:
>> >         <CAJd-hdkv1oiSjfA625oBxmXwr5YuVUzz==
>> 3GLaWJTakAtzMJVw at mail.gmail.com>
>>
>> > Content-Type: text/plain; charset="iso-8859-1"
>> >
>> > I disagree with the criticism of data.table's complexity (in the OP).
>> > There's nothing wrong with overloading the syntax (that is what CS
>> people
>> > call it, right?). As long as Matthew's in control of it, it's likely to
>> > have some internal consistency (which, of course, he could explain).
>> > However, I like the suggestion to add options (defaulting to something
>> > globally adjustable) to disable some of the overloading. Along similar
>> > lines (I think), I find unique.data.table very unintuitive. I can see
>> how
>> > it could be useful, but strongly prefer base::unique for my current
>> > applications.
>> >
>> > Anyway, I have nothing particular to say about the piece of syntax you
>> all
>> > are currently discussing. I just registered with this list to chime in
>> > here, instead of further cluttering SO (where eddi answered one of my
>> > questions yesterday). These emails sure are wide; must be like 1500px!
>> > Interesting to try out this ancient mailing-list form of communication.
>> > Please let me know if I should be using "Reply All" or actually quoting
>> > that massive thread (as everyone else seems to be doing with each post).
>> >
>> > Frank
>> > -------------- next part --------------
>> > An HTML attachment was scrubbed...
>> > URL: <
>> http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130426/eb6556ae/attachment-0001.html
>> >
>> >
>> > ------------------------------
>> >
>> > Message: 2
>> > Date: Fri, 26 Apr 2013 18:02:31 -0400
>> > From: Sam Steingold <sds at gnu.org>
>> > To: datatable-help at lists.r-forge.r-project.org
>>
>> > Subject: Re: [datatable-help] variable column names
>> > Message-ID: <87wqrpj6h4.fsf at gnu.org>
>> > Content-Type: text/plain
>> >
>> > > * Sam Steingold <fqf at tah.bet> [2013-04-26 13:05:39 -0400]:
>> > >
>> > >> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:45:53
>> +0100]:
>>
>> > >>
>> > >> S.O. is probably better for this kind of question then.
>> > >> But if you don't get an answer there, then come back to
>> datatable-help.
>> > >
>> > >
>> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns
>> >
>> > downvoted, unlikely to be answered.
>> >
>> > --
>> > Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X
>> 11.0.11300000
>> > http://www.childpsy.net/ http://iris.org.il http://think-israel.org
>> > http://americancensorship.org http://pmw.org.il http://mideasttruth.com
>> > We have preferences. You have biases. They have prejudices.
>> >
>> >
>> >
>> > ------------------------------
>> >
>> > Message: 3
>> > Date: Fri, 26 Apr 2013 23:47:55 +0100
>> > From: Matthew Dowle <mdowle at mdowle.plus.com>
>> > To: <sds at gnu.org>
>> > Cc: datatable-help at lists.r-forge.r-project.org
>>
>> > Subject: Re: [datatable-help] variable column names
>> > Message-ID: <30d6ae8f1a0d6974ebbd54da0d86f3b2 at imap.plus.net>
>>
>> > Content-Type: text/plain; charset=UTF-8; format=flowed
>> >
>> > On 26.04.2013 23:02, Sam Steingold wrote:
>> > >> * Sam Steingold <fqf at tah.bet> [2013-04-26 13:05:39 -0400]:
>> > >>
>> > >>> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:45:53
>>
>> > >>> +0100]:
>> > >>>
>> > >>> S.O. is probably better for this kind of question then.
>> > >>> But if you don't get an answer there, then come back to
>> > >>> datatable-help.
>> > >>
>> > >>
>> > >>
>> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns
>> > >
>> > > downvoted, unlikely to be answered.
>> >
>> > I've read it through.
>> >
>> > Perhaps sleep on it, don't look for 24hrs and look again as if you were
>> > trying to answer it yourself. Are there any small changes you can make
>> > to make it easier to answer?  It wasn't me that downvoted but I suspect
>> > it's been done to encourage you to improve the question. Downvotes can
>> > (and often are) reversed.  I've had many more downvotes than you once,
>> > but then I improved it and it went to +10.
>> >
>> > And, it's Friday and we've all had a long week!
>> >
>> > Matthew
>> >
>> >
>> >
>> >
>> > ------------------------------
>> >
>> > Message: 4
>> > Date: Sat, 27 Apr 2013 00:35:17 +0100
>> > From: Matthew Dowle <mdowle at mdowle.plus.com>
>> > To: Frank Erickson <FErickson at psu.edu>
>> > Cc: "data.table source forge"
>> >         <datatable-help at lists.r-forge.r-project.org>
>>
>> > Subject: Re: [datatable-help] changing data.table by-without-by syntax
>> >         to require a "by"
>> > Message-ID: <be967ecd9c927ade15c15eb9985d919e at imap.plus.net>
>>
>> > Content-Type: text/plain; charset="utf-8"
>> >
>> >
>> >
>> > Thanks for your comments Frank.
>> >
>> > Ha, yes it's ancient but still has
>> > a place. Yes "reply all": Back To: sender (if it's to someone in
>> > particular) and cc the list. But on general topics where lots of people
>> > are on the thread, just To: datatable-help alone is fine. Personally I
>> > prefer "top posting". Like I'm doing now. I only scroll down if I need
>> > to. I didn't notice the history was building up. If you comment inline
>> > later, then say "scroll down for comments inline" or something at the
>> > top. Note that Nabble collapses the history for you so threads are much
>> > easier to read there. Or I tend to read via RSS (gmane) in Outlook, so
>> > it feels like an email inbox which turns bold on new posts. You only
>> > need to subscribe to post (spam control). Most people turn off mail
>> > delivery pretty quickly I imagine (or setup an auto rule to move into a
>> > folder, but then you might as well subscribe to RSS I guess).
>> >
>> > S.O. is
>> > quite strict: must be clear questions with a clear answer, only one of
>> > which can be accepted. No opinion, voting, discussing or notices (enter
>> > mailing lists). Chat room is good but for quick chat when people are in
>> > the room at the same time. Many companies (sensibly) block chat access,
>> > though. Mailing lists allows all timezones a chance at a slower pace.
>> > Anonymity is just as acceptable and as easy in both places.
>> >
>> > Matthew
>> >
>> >
>> > On 26.04.2013 21:34, Frank Erickson wrote:
>> >
>> > > I disagree with the
>> > criticism of data.table's complexity (in the OP). There's nothing wrong
>> > with overloading the syntax (that is what CS people call it, right?). As
>> > long as Matthew's in control of it, it's likely to have some internal
>> > consistency (which, of course, he could explain). However, I like the
>> > suggestion to add options (defaulting to something globally adjustable)
>> > to disable some of the overloading. Along similar lines (I think), I
>> > find unique.data.table very unintuitive. I can see how it could be
>> > useful, but strongly prefer base::unique for my current applications.
>> > >
>> > Anyway, I have nothing particular to say about the piece of syntax you
>> > all are currently discussing. I just registered with this list to chime
>> > in here, instead of further cluttering SO (where eddi answered one of my
>> > questions yesterday). These emails sure are wide; must be like 1500px!
>> > Interesting to try out this ancient mailing-list form of communication.
>> > Please let me know if I should be using "Reply All" or actually quoting
>> > that massive thread (as everyone else seems to be doing with each post).
>> >
>> > > Frank
>> >
>> >
>> > -------------- next part --------------
>> > An HTML attachment was scrubbed...
>> > URL: <
>> http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130427/260f3119/attachment-0001.html
>> >
>> >
>> > ------------------------------
>> >
>> > Message: 5
>> > Date: Fri, 26 Apr 2013 16:42:04 -0700
>> > From: Victor Kryukov <victor.kryukov at gmail.com>
>> > To: Matthew Dowle <mdowle at mdowle.plus.com>
>> > Cc: datatable-help at lists.r-forge.r-project.org, sds at gnu.org
>>
>> > Subject: Re: [datatable-help] variable column names
>> > Message-ID:
>> >         <CANJmMqTz5+6djLEwpZxsub6LB=3L37=JB3xt5AhG1XgWG=
>> nJgw at mail.gmail.com>
>> > Content-Type: text/plain; charset=ISO-8859-1
>>
>> >
>> > On Fri, Apr 26, 2013 at 3:47 PM, Matthew Dowle <mdowle at mdowle.plus.com>
>> wrote:
>> > > On 26.04.2013 23:02, Sam Steingold wrote:
>> > >>>
>> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns
>> > >>
>> > >> downvoted, unlikely to be answered.
>> > >
>> > > I've read it through.
>> > >
>> > > Perhaps sleep on it, don't look for 24hrs and look again as if you
>> were
>> > > trying to answer it yourself. Are there any small changes you can
>> make to
>> > > make it easier to answer?  It wasn't me that downvoted but I suspect
>> it's
>> > > been done to encourage you to improve the question. Downvotes can
>> (and often
>> > > are) reversed.  I've had many more downvotes than you once, but then I
>> > > improved it and it went to +10.
>> > >
>> > > And, it's Friday and we've all had a long week!
>> >
>> > Beautiful advice, Matthew!
>> >
>> > Sam - I've provided my answer (and even used Reduce since you seem to
>> > be coming from Lisp land), but I also think some of the down
>> > votes/comments have their merit.
>> >
>> >
>> > ------------------------------
>> >
>> > _______________________________________________
>> > datatable-help mailing list
>> > datatable-help at lists.r-forge.r-project.org
>>
>> >
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >
>> > End of datatable-help Digest, Vol 38, Issue 26
>> > **********************************************
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130429/b1bb0342/attachment-0001.html>

From s_milberg at hotmail.com  Mon Apr 29 22:21:11 2013
From: s_milberg at hotmail.com (Sadao Milberg)
Date: Mon, 29 Apr 2013 16:21:11 -0400
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <CAHZcBOoiP4aZHrPKNxHS6wym5R0O9pmqHgzL6S+1LTH1Vq9aLA@mail.gmail.com>
References: <CAAf756OqswUKoiykf0pnh6=9HSDDjKK_acUP-0VdV1Vd1EVvGg@mail.gmail.com>,
 <CAHZcBOp0K+Bjy=O=Gudx7KdvVY9PuooK6hMzF4LmEvw=86c2xg@mail.gmail.com>,
 <CAHZcBOoiP4aZHrPKNxHS6wym5R0O9pmqHgzL6S+1LTH1Vq9aLA@mail.gmail.com>
Message-ID: <BAY163-W17F525CD44D3B881CB17F999B20@phx.gbl>

Also, the issue isn't that data.table has different behavior given different types of inputs.  I don't think there is anything wrong with doing that.  After all, I think everyone here is okay with a data.table as `i` vs. a vector or a variable name producing different outcomes.

The concern here is about which other behavior gets triggered.  The default behavior when using a data.table for `i` and nothing for `by` is a somewhat advanced outcome that can't be easily predicted or understood by people who understand the basic operation of data.table (i.e. `i` is for join/indexing, `j` is for evaluating expressions in the context of DT, `by` is for split-apply-combine).  As a result usage and documentation become more inaccessible than they could be.

S.

Date: Mon, 29 Apr 2013 08:43:19 -0500
From: eduard.antonyan at gmail.com
To: aragorn168b at gmail.com
CC: datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by"

It might help to think of this as an improvement proposal rather than a problem fix proposal.

On Mon, Apr 29, 2013 at 8:40 AM, Eduard Antonyan <eduard.antonyan at gmail.com> wrote:

Thanks Arun, the examples you give are probably interesting in their own right, but your post doesn't address advantages/disadvantages of either current or proposed syntaxes and simply points out the (obvious) fact that current (and other, similar in some ways to current) behavior is possible to implement in R.


On Sat, Apr 27, 2013 at 10:49 AM, Arunkumar Srinivasan <aragorn168b at gmail.com> wrote:


Hello, 
I thought I'd also chip-in my thoughts to eddi's feature request. 


Short answer: I don't think this feature is necessary. I basically agree with mnel's reply. 
Long answer: My argument goes along these lines (in addition to the S3/S4 methods mnel mentions). If you for example type `[.data.frame` in your R-session, you'd see this snippet:


        if (is.matrix(i)) 
            return(as.matrix(x)[i])

That is, if you do: 

    df <- data.frame(x=1:5, y=1:5, z=1:5)
    mm <- matrix(1:12, ncol=3)
    df[mm] # gives
    [1] 1 2 3 4 5 1 2 3 4 5 1 2


    df <- data.frame(x=1:2, y=1:2, z=1:2)
    df[mm] # gives
    [1]  1  2  1  2  1  2 NA NA NA NA NA NA

Here, the indexing is a matrix. It's obvious. Now, should this behaviour be changed because people would be confused that subsetting a data.frame resulted in a vector? Or because it's not user friendly? Even better, try out `df[mm, ]`. If `i` is a matrix, this is what the code does. I am not convinced this is "bad" design. Functions take arguments of different types ALL the time and they return outputs *depending on the type of input*. This is why I am not sold on the point of "bad design". It's essential to know the type of objects `i` can take and *understand* it. 


If a function is designed that takes several types of objects for `i` and their behaviour is documented, and the documented behaviour is consistent, then I can't accept there's a problem. 

I agree there are people who don't read the manual and "try" things out. But they are going to have problems with every other function in R. 


For example, "unstack" is a function for which same input type gives different output type. That is, it provides a data.frame if the columns are equal after unstaking and list if they are not. That is, compare the outputs of:


    df <- data.frame(x=rep(1:3, each=3), y=1:9)
    unstack(df, y ~ x)

with

    df <- data.frame(x=c(rep(1:3, each=3), 3), y=1:10)
    unstack(df, y ~ x)

But if people don't read the documentation, they wouldn't know this difference until they land up on errors. Now, making it user-friendly would mean that it "always" returns a list. 


Now, is this "bad" design because it gives two object types for same input? Does it require a change? I personally don't think so.

To sum up, what eddi points out as "not being user-friendly" (or arguably "bad design") is everywhere inside R if you look closely. My view is that it's very clear that there should be some effort in understanding a function before using it. Not all functions are plain simple. Some functions have exceptions and some packages have a steep learning curve.


Best,
Arun.


On Sat, Apr 27, 2013 at 12:00 PM, <datatable-help-request at lists.r-forge.r-project.org> wrote:


>
> Send datatable-help mailing list submissions to

>         datatable-help at lists.r-forge.r-project.org
>
> To subscribe or unsubscribe via the World Wide Web, visit


>         https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

>
> or, via email, send a message with subject or body 'help' to
>         datatable-help-request at lists.r-forge.r-project.org


>
> You can reach the person managing the list at
>         datatable-help-owner at lists.r-forge.r-project.org


>
> When replying, please edit your Subject line so it is more specific

> than "Re: Contents of datatable-help digest..."
>
>
> Today's Topics:
>
>    1. Re: changing data.table by-without-by syntax to require a
>       "by" (Frank Erickson)


>    2. Re: variable column names (Sam Steingold)
>    3. Re: variable column names (Matthew Dowle)
>    4. Re: changing data.table by-without-by syntax to require a
>       "by" (Matthew Dowle)


>    5. Re: variable column names (Victor Kryukov)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 26 Apr 2013 15:34:39 -0500


> From: Frank Erickson <FErickson at psu.edu>
> To: "data.table source forge"
>         <datatable-help at lists.r-forge.r-project.org>


> Subject: Re: [datatable-help] changing data.table by-without-by syntax
>         to require a "by"
> Message-ID:
>         <CAJd-hdkv1oiSjfA625oBxmXwr5YuVUzz==3GLaWJTakAtzMJVw at mail.gmail.com>


> Content-Type: text/plain; charset="iso-8859-1"
>
> I disagree with the criticism of data.table's complexity (in the OP).
> There's nothing wrong with overloading the syntax (that is what CS people


> call it, right?). As long as Matthew's in control of it, it's likely to
> have some internal consistency (which, of course, he could explain).
> However, I like the suggestion to add options (defaulting to something


> globally adjustable) to disable some of the overloading. Along similar
> lines (I think), I find unique.data.table very unintuitive. I can see how
> it could be useful, but strongly prefer base::unique for my current


> applications.
>
> Anyway, I have nothing particular to say about the piece of syntax you all
> are currently discussing. I just registered with this list to chime in
> here, instead of further cluttering SO (where eddi answered one of my


> questions yesterday). These emails sure are wide; must be like 1500px!
> Interesting to try out this ancient mailing-list form of communication.
> Please let me know if I should be using "Reply All" or actually quoting


> that massive thread (as everyone else seems to be doing with each post).
>
> Frank
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130426/eb6556ae/attachment-0001.html>


>
> ------------------------------
>
> Message: 2
> Date: Fri, 26 Apr 2013 18:02:31 -0400
> From: Sam Steingold <sds at gnu.org>


> To: datatable-help at lists.r-forge.r-project.org

> Subject: Re: [datatable-help] variable column names
> Message-ID: <87wqrpj6h4.fsf at gnu.org>
> Content-Type: text/plain
>

> > * Sam Steingold <fqf at tah.bet> [2013-04-26 13:05:39 -0400]:


> >
> >> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:45:53 +0100]:
> >>
> >> S.O. is probably better for this kind of question then.
> >> But if you don't get an answer there, then come back to datatable-help.


> >
> > http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns


>
> downvoted, unlikely to be answered.
>
> --
> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
> http://www.childpsy.net/ http://iris.org.il http://think-israel.org


> http://americancensorship.org http://pmw.org.il http://mideasttruth.com


> We have preferences. You have biases. They have prejudices.

>
>
>
> ------------------------------
>
> Message: 3
> Date: Fri, 26 Apr 2013 23:47:55 +0100
> From: Matthew Dowle <mdowle at mdowle.plus.com>


> To: <sds at gnu.org>
> Cc: datatable-help at lists.r-forge.r-project.org


> Subject: Re: [datatable-help] variable column names

> Message-ID: <30d6ae8f1a0d6974ebbd54da0d86f3b2 at imap.plus.net>
> Content-Type: text/plain; charset=UTF-8; format=flowed


>
> On 26.04.2013 23:02, Sam Steingold wrote:

> >> * Sam Steingold <fqf at tah.bet> [2013-04-26 13:05:39 -0400]:
> >>
> >>> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:45:53
> >>> +0100]:


> >>>

> >>> S.O. is probably better for this kind of question then.
> >>> But if you don't get an answer there, then come back to
> >>> datatable-help.
> >>
> >>


> >> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns


> >

> > downvoted, unlikely to be answered.
>
> I've read it through.
>
> Perhaps sleep on it, don't look for 24hrs and look again as if you were
> trying to answer it yourself. Are there any small changes you can make


> to make it easier to answer?  It wasn't me that downvoted but I suspect
> it's been done to encourage you to improve the question. Downvotes can
> (and often are) reversed.  I've had many more downvotes than you once,


> but then I improved it and it went to +10.
>
> And, it's Friday and we've all had a long week!
>
> Matthew
>
>
>
>
> ------------------------------
>


> Message: 4
> Date: Sat, 27 Apr 2013 00:35:17 +0100
> From: Matthew Dowle <mdowle at mdowle.plus.com>
> To: Frank Erickson <FErickson at psu.edu>


> Cc: "data.table source forge"
>         <datatable-help at lists.r-forge.r-project.org>

> Subject: Re: [datatable-help] changing data.table by-without-by syntax


>         to require a "by"
> Message-ID: <be967ecd9c927ade15c15eb9985d919e at imap.plus.net>


> Content-Type: text/plain; charset="utf-8"

>
>
>
> Thanks for your comments Frank.
>
> Ha, yes it's ancient but still has
> a place. Yes "reply all": Back To: sender (if it's to someone in
> particular) and cc the list. But on general topics where lots of people


> are on the thread, just To: datatable-help alone is fine. Personally I
> prefer "top posting". Like I'm doing now. I only scroll down if I need
> to. I didn't notice the history was building up. If you comment inline


> later, then say "scroll down for comments inline" or something at the
> top. Note that Nabble collapses the history for you so threads are much
> easier to read there. Or I tend to read via RSS (gmane) in Outlook, so


> it feels like an email inbox which turns bold on new posts. You only
> need to subscribe to post (spam control). Most people turn off mail
> delivery pretty quickly I imagine (or setup an auto rule to move into a


> folder, but then you might as well subscribe to RSS I guess).
>
> S.O. is
> quite strict: must be clear questions with a clear answer, only one of
> which can be accepted. No opinion, voting, discussing or notices (enter


> mailing lists). Chat room is good but for quick chat when people are in
> the room at the same time. Many companies (sensibly) block chat access,
> though. Mailing lists allows all timezones a chance at a slower pace.


> Anonymity is just as acceptable and as easy in both places.
>
> Matthew
>
>
> On 26.04.2013 21:34, Frank Erickson wrote:


>
> > I disagree with the
> criticism of data.table's complexity (in the OP). There's nothing wrong

> with overloading the syntax (that is what CS people call it, right?). As
> long as Matthew's in control of it, it's likely to have some internal
> consistency (which, of course, he could explain). However, I like the


> suggestion to add options (defaulting to something globally adjustable)
> to disable some of the overloading. Along similar lines (I think), I
> find unique.data.table very unintuitive. I can see how it could be


> useful, but strongly prefer base::unique for my current applications.
> >
> Anyway, I have nothing particular to say about the piece of syntax you
> all are currently discussing. I just registered with this list to chime


> in here, instead of further cluttering SO (where eddi answered one of my
> questions yesterday). These emails sure are wide; must be like 1500px!
> Interesting to try out this ancient mailing-list form of communication.


> Please let me know if I should be using "Reply All" or actually quoting
> that massive thread (as everyone else seems to be doing with each post).
>
> > Frank
>
>
> -------------- next part --------------


> An HTML attachment was scrubbed...
> URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130427/260f3119/attachment-0001.html>


>
> ------------------------------
>
> Message: 5
> Date: Fri, 26 Apr 2013 16:42:04 -0700
> From: Victor Kryukov <victor.kryukov at gmail.com>


> To: Matthew Dowle <mdowle at mdowle.plus.com>
> Cc: datatable-help at lists.r-forge.r-project.org, sds at gnu.org


> Subject: Re: [datatable-help] variable column names
> Message-ID:
>         <CANJmMqTz5+6djLEwpZxsub6LB=3L37=JB3xt5AhG1XgWG=nJgw at mail.gmail.com>


> Content-Type: text/plain; charset=ISO-8859-1

>
> On Fri, Apr 26, 2013 at 3:47 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> > On 26.04.2013 23:02, Sam Steingold wrote:


> >>> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns


> >>
> >> downvoted, unlikely to be answered.
> >
> > I've read it through.
> >
> > Perhaps sleep on it, don't look for 24hrs and look again as if you were


> > trying to answer it yourself. Are there any small changes you can make to
> > make it easier to answer?  It wasn't me that downvoted but I suspect it's
> > been done to encourage you to improve the question. Downvotes can (and often


> > are) reversed.  I've had many more downvotes than you once, but then I
> > improved it and it went to +10.
> >
> > And, it's Friday and we've all had a long week!
>


> Beautiful advice, Matthew!
>
> Sam - I've provided my answer (and even used Reduce since you seem to
> be coming from Lisp land), but I also think some of the down
> votes/comments have their merit.


>
>
> ------------------------------
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org


> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> End of datatable-help Digest, Vol 38, Issue 26


> **********************************************


_______________________________________________

datatable-help mailing list

datatable-help at lists.r-forge.r-project.org

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130429/ccc1e4a6/attachment-0001.html>

From eduard.antonyan at gmail.com  Mon Apr 29 22:51:26 2013
From: eduard.antonyan at gmail.com (eddi)
Date: Mon, 29 Apr 2013 13:51:26 -0700 (PDT)
Subject: [datatable-help] minor formatting issue
Message-ID: <1367268686060-4665760.post@n4.nabble.com>

When joining vs not-joining the order of the key column is different in the
output:

> dt = data.table(a = c(1:4), b = c(1:4), key = "b")
> dt[J(1)]
   b a
1: 1 1
> dt[!J(1)]
   a b
1: 2 2
2: 3 3
3: 4 4

I don't usually care about column order, but this could become a surprise
issue for people reading/writing to files.


--
View this message in context: http://r.789695.n4.nabble.com/minor-formatting-issue-tp4665760.html
Sent from the datatable-help mailing list archive at Nabble.com.

From michael.nelson at sydney.edu.au  Tue Apr 30 00:36:08 2013
From: michael.nelson at sydney.edu.au (Michael Nelson)
Date: Mon, 29 Apr 2013 22:36:08 +0000
Subject: [datatable-help] minor formatting issue
In-Reply-To: <1367268686060-4665760.post@n4.nabble.com>
References: <1367268686060-4665760.post@n4.nabble.com>
Message-ID: <6FB5193A6CDCDF499486A833B7AFBDCD705977FA@EX-MBX-PRO-04.mcs.usyd.edu.au>

Good spotting. File a bug report.


________________________________________
From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of eddi [eduard.antonyan at gmail.com]
Sent: Tuesday, 30 April 2013 6:51 AM
To: datatable-help at lists.r-forge.r-project.org
Subject: [datatable-help] minor formatting issue

When joining vs not-joining the order of the key column is different in the
output:

> dt = data.table(a = c(1:4), b = c(1:4), key = "b")
> dt[J(1)]
   b a
1: 1 1
> dt[!J(1)]
   a b
1: 2 2
2: 3 3
3: 4 4

I don't usually care about column order, but this could become a surprise
issue for people reading/writing to files.


--
View this message in context: http://r.789695.n4.nabble.com/minor-formatting-issue-tp4665760.html
Sent from the datatable-help mailing list archive at Nabble.com.
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From mdowle at mdowle.plus.com  Tue Apr 30 09:53:39 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 30 Apr 2013 08:53:39 +0100
Subject: [datatable-help] Size of posts - moderation queue
Message-ID: <353c37d353379414321baf7ceba35cff@imap.plus.net>


Hello,

If a thread history grows to be over 40KB then mailman holds in a 
moderation queue. This hasn't happened much before until now. Not my 
limit but it seems sensible anyway. So if a thread grows, just chop down 
the history and it should go through.

Thanks, Matthew


From aragorn168b at gmail.com  Tue Apr 30 09:58:37 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Tue, 30 Apr 2013 09:58:37 +0200
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <CAHZcBOpzt-teb_Q8xZBPrBTTjOEgm+ebL8K_xv1Tp3-OUfD_AA@mail.gmail.com>
References: <BAY163-W17F525CD44D3B881CB17F999B20@phx.gbl>
 <8EDE8235CC054C71A23DAFC87B3F02C1@gmail.com>
 <09BDD69ADDBD4CDD889884347CB7276D@gmail.com>
 <CAHZcBOoihFq6zYVgiYfuEG_crgnYjw2DificV85yYGCFoLhnMA@mail.gmail.com>
 <4F066386B43546C19538B5BC06EB4882@gmail.com>
 <CAHZcBOroMW8WCK0SAsJygEc8O9kPRDEBNWKh_8CDHbTCr1aCww@mail.gmail.com>
 <CAHZcBOrLkR9Z_FVZSYMGwsj4Sx46kjyvsNg0-QkSK8vK_F=1tg@mail.gmail.com>
 <CAHZcBOpzt-teb_Q8xZBPrBTTjOEgm+ebL8K_xv1Tp3-OUfD_AA@mail.gmail.com>
Message-ID: <B42AC8544D2443E4B987C66964F76818@gmail.com>

(The earlier message was too long and was rejected.) 
So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose,

    DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10)
    setkey(DT1, "x")
    DT2 <- data.table(x=1)
    DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y]

The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, 

    DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10)
    setkey(DT1, "x")
    DT2 <- data.table(x=c(1,2,1), w=c(11:13))
    # what's the output supposed to be for?
    DT1[DT2, y, .JOIN=FALSE]
    DT1[DT2, .JOIN = FALSE]

Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? 

    DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored?

Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? 

I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you.

Best,
Arun.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130430/bb0ee1c6/attachment.html>

From statquant at outlook.com  Tue Apr 30 11:19:23 2013
From: statquant at outlook.com (statquant3)
Date: Tue, 30 Apr 2013 02:19:23 -0700 (PDT)
Subject: [datatable-help] Porting data.table to Rcpp
In-Reply-To: <CAE7Aa4QKceNqOfu40-atrcQ+A0Jqeobrj_vMFzu0yDCXrmP4tg@mail.gmail.com>
References: <CAJJHHA8V-wHB=rvHJbs=z890dF2kyEtwTTTOno1B-sgVG5B4Aw@mail.gmail.com>
 <CAE7Aa4QKceNqOfu40-atrcQ+A0Jqeobrj_vMFzu0yDCXrmP4tg@mail.gmail.com>
Message-ID: <1367313563752-4665801.post@n4.nabble.com>

Ricky,
I tend to use data.table + Rcpp "a lot" now, my usual usecase is 
1. create a Rcpp function that takes the data.table say "DT" (as a
data.frame)
2. create Rcpp::NumericVectors within the function (say "f")
3: return those vectors as a list and add them by reference to the initial
data.table with DT[, names(f(DT)):=f(DT)]

This works well and is efficient I think, but if you have to do more
complicated stuff, requiring setting keys within C++ that's impossible as it
is. What is missing I think is:

1. Possibility to modify by reference the data.table within C++ ( in this
post
http://stackoverflow.com/questions/15731106/passing-by-reference-a-data-frame-and-updating-it-with-rcpp
Romain Francois showed me how to create another data.frame sharing the same
data than the initial data.table but this is not quite yet what we want (I
think))
2. Possibility to call data.table functions like merge... within C++

This is totally out of my reach I think, as I am barely a user of data.table
and Rcpp, but some brighter devs could find this project interesting 


--
View this message in context: http://r.789695.n4.nabble.com/Porting-data-table-to-Rcpp-tp4665667p4665801.html
Sent from the datatable-help mailing list archive at Nabble.com.

From eduard.antonyan at gmail.com  Tue Apr 30 14:54:33 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Tue, 30 Apr 2013 07:54:33 -0500
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <B42AC8544D2443E4B987C66964F76818@gmail.com>
References: <BAY163-W17F525CD44D3B881CB17F999B20@phx.gbl>
 <8EDE8235CC054C71A23DAFC87B3F02C1@gmail.com>
 <09BDD69ADDBD4CDD889884347CB7276D@gmail.com>
 <CAHZcBOoihFq6zYVgiYfuEG_crgnYjw2DificV85yYGCFoLhnMA@mail.gmail.com>
 <4F066386B43546C19538B5BC06EB4882@gmail.com>
 <CAHZcBOroMW8WCK0SAsJygEc8O9kPRDEBNWKh_8CDHbTCr1aCww@mail.gmail.com>
 <CAHZcBOrLkR9Z_FVZSYMGwsj4Sx46kjyvsNg0-QkSK8vK_F=1tg@mail.gmail.com>
 <CAHZcBOpzt-teb_Q8xZBPrBTTjOEgm+ebL8K_xv1Tp3-OUfD_AA@mail.gmail.com>
 <B42AC8544D2443E4B987C66964F76818@gmail.com>
Message-ID: <-8694790273355420813@unknownmsgid>

Arun,

If the new boolean is false, the result would be the same as without it and
would be equal to current behavior of d[i][, j]. If it's true, it will only
have an effect if i is a join (I think each.i= fits slightly better for
this description than .join=) - this will replicate current underlying
behavior. If you think the cross-apply is something that could work not
just for i being a data-table but other things as well, then it would make
perfect sense to implement that action too when the bool is true.

On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan <aragorn168b at gmail.com>
wrote:

(The earlier message was too long and was rejected.)
So, from the discussion so far, I see that Matthew is nice enough to
implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose,

    DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10)
    setkey(DT1, "x")
    DT2 <- data.table(x=1)
    DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I
expect here the same output as current DT1[DT2, y]

The above syntax seems "okay". But my first question is what is
`.JOIN=FALSE` supposed to do under these two circumstances? Suppose,

    DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10)
    setkey(DT1, "x")
    DT2 <- data.table(x=c(1,2,1), w=c(11:13))
    # what's the output supposed to be for?
    DT1[DT2, y, .JOIN=FALSE]
    DT1[DT2, .JOIN = FALSE]

Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how
does it work with `subset`?

    DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored?
Is this supposed to also do a "cross-apply" on the logical subset? I guess
not. So, .JOIN is an "extra" parameter that comes into play *only* when `i`
is a `data.table`?

I'd love to have some replies to these questions for me to take a stance on
`.JOIN`. Thank you.

Best,
Arun.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130430/40a2f2c4/attachment.html>

From aragorn168b at gmail.com  Tue Apr 30 15:48:07 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Tue, 30 Apr 2013 15:48:07 +0200
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <-8694790273355420813@unknownmsgid>
References: <BAY163-W17F525CD44D3B881CB17F999B20@phx.gbl>
 <8EDE8235CC054C71A23DAFC87B3F02C1@gmail.com>
 <09BDD69ADDBD4CDD889884347CB7276D@gmail.com>
 <CAHZcBOoihFq6zYVgiYfuEG_crgnYjw2DificV85yYGCFoLhnMA@mail.gmail.com>
 <4F066386B43546C19538B5BC06EB4882@gmail.com>
 <CAHZcBOroMW8WCK0SAsJygEc8O9kPRDEBNWKh_8CDHbTCr1aCww@mail.gmail.com>
 <CAHZcBOrLkR9Z_FVZSYMGwsj4Sx46kjyvsNg0-QkSK8vK_F=1tg@mail.gmail.com>
 <CAHZcBOpzt-teb_Q8xZBPrBTTjOEgm+ebL8K_xv1Tp3-OUfD_AA@mail.gmail.com>
 <B42AC8544D2443E4B987C66964F76818@gmail.com>
 <-8694790273355420813@unknownmsgid>
Message-ID: <5AD5B1D231A045329D46159FB5297739@gmail.com>

Eduard, thanks for your reply. But somethings are unclear to me still. I'll try to explain them below. 

First I prefer .JOIN (or cross.apply) just because `each.i` seems general (that it is applicable to *every* i operation, which as of now seems untrue). .JOIN is specific to data.table type for `i`.

>From what I understand from your reply, if (.JOIN = FALSE), then,

    DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y]

Is this right? It's a bit confusing because I think you're okay with "by-without-by" and I got the impression from Sadao that he finds the syntax of "by-without-by" unaccessible/advanced for basic users. So, just to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the "by-without-by" and then result in a "vector", right?  

Matthew explains in the current documentation that DT1[DT2][, y] would "join" all columns of DT1 and DT2 and then subset. I assume the implementation underneath is *not* DT1[DT2][, y] rather the result is an efficient equivalence. Then, that of course seems alright to me.

If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I can't think of any at the moment. 

To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results in getting evaluated as a scalar for every group in the current by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, list(x,y)].

If you/anyone believes it's wrong, I'd be all ears to clarify as to what's the purpose of `drop` then (and also how it *doesn't* suit here as compared to .JOIN).

Arun


On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote:

> Arun,
> 
> If the new boolean is false, the result would be the same as without it and would be equal to current behavior of d[i][, j]. If it's true, it will only have an effect if i is a join (I think each.i= fits slightly better for this description than .join=) - this will replicate current underlying behavior. If you think the cross-apply is something that could work not just for i being a data-table but other things as well, then it would make perfect sense to implement that action too when the bool is true. 
> 
> On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> 
> > (The earlier message was too long and was rejected.) 
> > So, from the discussion so far, I see that Matthew is nice enough to implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose,
> > 
> >     DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) 
> >     setkey(DT1, "x")
> >     DT2 <- data.table(x=1)
> >     DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I expect here the same output as current DT1[DT2, y]
> > 
> > The above syntax seems "okay". But my first question is what is `.JOIN=FALSE` supposed to do under these two circumstances? Suppose, 
> > 
> >     DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) 
> >     setkey(DT1, "x")
> >     DT2 <- data.table(x=c(1,2,1), w=c(11:13))
> >     # what's the output supposed to be for?
> >     DT1[DT2, y, .JOIN=FALSE]
> >     DT1[DT2, .JOIN = FALSE]
> > 
> > Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how does it work with `subset`? 
> > 
> >     DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored?
> > 
> > Is this supposed to also do a "cross-apply" on the logical subset? I guess not. So, .JOIN is an "extra" parameter that comes into play *only* when `i` is a `data.table`? 
> > 
> > I'd love to have some replies to these questions for me to take a stance on `.JOIN`. Thank you.
> > 
> > Best,
> > Arun.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130430/a58d4ad5/attachment.html>

From aragorn168b at gmail.com  Tue Apr 30 15:52:01 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Tue, 30 Apr 2013 15:52:01 +0200
Subject: [datatable-help] sorting on floating point column
Message-ID: <AC72ED7054E7447EA2316AF07E7692B4@gmail.com>

Hi there, 

I just saw something strange when I was sorting a column of p-values. I checked the data.table bug tracker for words "sort" and "floating point" and there were no hits for this case. There's a bug for "integer 64" sort on a column though.

So, here's a reproducible example. I'd be glad to file a bug, if it is and be corrected if it's something I am doing wrong.

set.seed(45)
dt <- data.table(x=sample(50), y= sample(c(seq(0, 1, length.out=1000), 7000000:7000100), 50)/1e7)
head(dt)
    x            y
1: 32 5.395395e-08
2: 16 6.956957e-08
3: 12 2.142142e-08
4: 18 5.855856e-08
5: 17 6.216216e-08
6: 14 5.025025e-08
setkey(dt, "y") # sort by column y
head(dt, 10)
     x            y
 1: 47 1.401401e-09
 2: 12 2.142142e-08
 3: 24 1.391391e-08
 4: 43 9.809810e-09 <~~~ obviously false
 5:  1 2.932933e-08
 6: 48 2.562563e-08
 7: 49 1.891892e-08
 8: 40 2.182182e-08
 9:  9 7.307307e-09 <~~~ obviously false
10: 45 2.482482e-08


Best,
Arun

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130430/46d985d7/attachment.html>

From saporta at scarletmail.rutgers.edu  Tue Apr 30 16:09:03 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Tue, 30 Apr 2013 10:09:03 -0400
Subject: [datatable-help] sorting on floating point column
In-Reply-To: <AC72ED7054E7447EA2316AF07E7692B4@gmail.com>
References: <AC72ED7054E7447EA2316AF07E7692B4@gmail.com>
Message-ID: <CAE7Aa4Qy4_fYPxp1fRM1Fi5QK94kQpcvGkU4YAr7gM-XhMkaow@mail.gmail.com>

I'm seeing the same thing as Arun:


> dt[, diffs := round(c(NA, diff(y) * 1e8), 3)]
> dt
     x            y         diffs
 1: 19 4.594595e-08            NA
 2: 17 7.007007e-08         2.412
 3: 45 3.543544e-08        -3.463
 4: 38 6.326326e-08         2.783
 5: 23 7.847848e-08         1.522
 6: 46 5.975976e-08        -1.872
 7:  3 3.073073e-08        -2.903
 8:  4 9.909910e-08         6.837
 9: 16 5.535536e-08        -4.374
10: 25 9.609610e-08         4.074
11: 24 9.309309e-08        -0.300
12: 12 7.000022e-01  70000210.691
13: 31 3.453453e-08 -70000216.547
14: 34 5.565566e-08         2.112
15: 14 1.241241e-08        -4.324


On Tue, Apr 30, 2013 at 9:52 AM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

>  Hi there,
>
> I just saw something strange when I was sorting a column of p-values. I
> checked the data.table bug tracker for words "sort" and "floating point"
> and there were no hits for this case. There's a bug for "integer 64" sort
> on a column though.
>
> So, here's a reproducible example. I'd be glad to file a bug, if it is and
> be corrected if it's something I am doing wrong.
>
> set.seed(45)
> dt <- data.table(x=sample(50), y= sample(c(seq(0, 1, length.out=1000),
> 7000000:7000100), 50)/1e7)
> head(dt)
>     x            y
> 1: 32 5.395395e-08
> 2: 16 6.956957e-08
> 3: 12 2.142142e-08
> 4: 18 5.855856e-08
> 5: 17 6.216216e-08
> 6: 14 5.025025e-08
> setkey(dt, "y") # sort by column y
> head(dt, 10)
>      x            y
>  1: 47 1.401401e-09
>  2: 12 2.142142e-08
>  3: 24 1.391391e-08
>  4: 43 9.809810e-09 <~~~ obviously false
>  5:  1 2.932933e-08
>  6: 48 2.562563e-08
>  7: 49 1.891892e-08
>  8: 40 2.182182e-08
>  9:  9 7.307307e-09 <~~~ obviously false
> 10: 45 2.482482e-08
>
> Best,
> Arun
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130430/dd714dcf/attachment-0001.html>

From mdowle at mdowle.plus.com  Tue Apr 30 16:09:25 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 30 Apr 2013 15:09:25 +0100
Subject: [datatable-help] sorting on floating point column
In-Reply-To: <AC72ED7054E7447EA2316AF07E7692B4@gmail.com>
References: <AC72ED7054E7447EA2316AF07E7692B4@gmail.com>
Message-ID: <17cf7210eff5da9dadf94185f67df182@imap.plus.net>

 
Hi, 

data.table sorts double within machine tolerance : 

>
sqrt(.Machine$double.eps)
[1] 1.490116e-08
> 

i.e. numbers closer than
this are considered equal.

Otherwise we wouldn't be able to do things
like DT[.(3.14)].

I had a quick look, see arguments of
data.table:::ordernumtol which takes "tol" but there is no option
provided (yet) to change this. Do we need one?

In the examples section
of one of the help pages it has an example which generates a series of
numers very close together using pi. Note that your numbers are both
close together, and, very close to 0.

Matthew

On 30.04.2013 14:52,
Arunkumar Srinivasan wrote: 

> Hi there, 
> I just saw something
strange when I was sorting a column of p-values. I checked the
data.table bug tracker for words "sort" and "floating point" and there
were no hits for this case. There's a bug for "integer 64" sort on a
column though. 
> So, here's a reproducible example. I'd be glad to file
a bug, if it is and be corrected if it's something I am doing wrong. 
>

> set.seed(45) 
> dt <- data.table(x=sample(50), y= sample(c(seq(0, 1,
length.out=1000), 7000000:7000100), 50)/1e7) 
> head(dt) 
> x y 
> 1: 32
5.395395e-08 
> 2: 16 6.956957e-08 
> 3: 12 2.142142e-08 
> 4: 18
5.855856e-08 
> 5: 17 6.216216e-08 
> 6: 14 5.025025e-08 
> setkey(dt,
"y") # sort by column y 
> head(dt, 10) 
> x y 
> 1: 47 1.401401e-09 
>
2: 12 2.142142e-08 
> 3: 24 1.391391e-08 
> 4: 43 9.809810e-09 <~~~
obviously false 
> 5: 1 2.932933e-08 
> 6: 48 2.562563e-08 
> 7: 49
1.891892e-08 
> 8: 40 2.182182e-08 
> 9: 9 7.307307e-09 <~~~ obviously
false 
> 10: 45 2.482482e-08 
> 
> Best, 
> Arun

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130430/6d78e494/attachment.html>

From mdowle at mdowle.plus.com  Tue Apr 30 16:13:09 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 30 Apr 2013 15:13:09 +0100
Subject: [datatable-help] sorting on floating point column
In-Reply-To: <17cf7210eff5da9dadf94185f67df182@imap.plus.net>
References: <AC72ED7054E7447EA2316AF07E7692B4@gmail.com>
 <17cf7210eff5da9dadf94185f67df182@imap.plus.net>
Message-ID: <69aefde0f4d3a4eb53b708a7f5df6888@imap.plus.net>

 
Or, perhaps the tolerance should be a function of the range of the
column. [The range would be quick to calculate with a single C for
loop.] 

On 30.04.2013 15:09, Matthew Dowle wrote: 

> Hi, 
> 
>
data.table sorts double within machine tolerance : 
> 
>>
sqrt(.Machine$double.eps)
> [1] 1.490116e-08
>> 
> 
> i.e. numbers
closer than this are considered equal.
> 
> Otherwise we wouldn't be
able to do things like DT[.(3.14)].
> 
> I had a quick look, see
arguments of data.table:::ordernumtol which takes "tol" but there is no
option provided (yet) to change this. Do we need one?
> 
> In the
examples section of one of the help pages it has an example which
generates a series of numers very close together using pi. Note that
your numbers are both close together, and, very close to 0.
> 
>
Matthew
> 
> On 30.04.2013 14:52, Arunkumar Srinivasan wrote: 
> 
>> Hi
there, 
>> I just saw something strange when I was sorting a column of
p-values. I checked the data.table bug tracker for words "sort" and
"floating point" and there were no hits for this case. There's a bug for
"integer 64" sort on a column though. 
>> So, here's a reproducible
example. I'd be glad to file a bug, if it is and be corrected if it's
something I am doing wrong. 
>> 
>> set.seed(45) 
>> dt <-
data.table(x=sample(50), y= sample(c(seq(0, 1, length.out=1000),
7000000:7000100), 50)/1e7) 
>> head(dt) 
>> x y 
>> 1: 32 5.395395e-08

>> 2: 16 6.956957e-08 
>> 3: 12 2.142142e-08 
>> 4: 18 5.855856e-08 
>>
5: 17 6.216216e-08 
>> 6: 14 5.025025e-08 
>> setkey(dt, "y") # sort by
column y 
>> head(dt, 10) 
>> x y 
>> 1: 47 1.401401e-09 
>> 2: 12
2.142142e-08 
>> 3: 24 1.391391e-08 
>> 4: 43 9.809810e-09 <~~~
obviously false 
>> 5: 1 2.932933e-08 
>> 6: 48 2.562563e-08 
>> 7: 49
1.891892e-08 
>> 8: 40 2.182182e-08 
>> 9: 9 7.307307e-09 <~~~ obviously
false 
>> 10: 45 2.482482e-08 
>> 
>> Best, 
>> Arun

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130430/3bf90fac/attachment.html>

From aragorn168b at gmail.com  Tue Apr 30 16:16:03 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Tue, 30 Apr 2013 16:16:03 +0200
Subject: [datatable-help] sorting on floating point column
In-Reply-To: <17cf7210eff5da9dadf94185f67df182@imap.plus.net>
References: <AC72ED7054E7447EA2316AF07E7692B4@gmail.com>
 <17cf7210eff5da9dadf94185f67df182@imap.plus.net>
Message-ID: <8DC39800AD714C4AA03FDB84ED57BADD@gmail.com>

Matthew, 
I see. I din't think about tolerance. Although

dt[with(dt, order(y)), ] 

seems to do the task right (similar to data.frame). I'm glad that I don't have to convert to data.frame to perform the order. I am not keying by this column. Unless one needs this column for keying, I don't think a tolerance option is essential. Although, having it definitely would be only nicer.

Arun


On Tuesday, April 30, 2013 at 4:09 PM, Matthew Dowle wrote:

>  
> Hi,
> data.table sorts double within machine tolerance :
> > sqrt(.Machine$double.eps)
> [1] 1.490116e-08
> > 
>  
> i.e. numbers closer than this are considered equal.
>  
> Otherwise we wouldn't be able to do things like DT[.(3.14)].
>  
> I had a quick look, see arguments of data.table:::ordernumtol which takes "tol" but there is no option provided (yet) to change this. Do we need one?
>  
> In the examples section of one of the help pages it has an example which generates a series of numers very close together using pi. Note that your numbers are both close together, and, very close to 0.
>  
> Matthew
>  
> On 30.04.2013 14:52, Arunkumar Srinivasan wrote:
> > Hi there,
> > I just saw something strange when I was sorting a column of p-values. I checked the data.table bug tracker for words "sort" and "floating point" and there were no hits for this case. There's a bug for "integer 64" sort on a column though.
> > So, here's a reproducible example. I'd be glad to file a bug, if it is and be corrected if it's something I am doing wrong.
> > set.seed(45)
> > dt <- data.table(x=sample(50), y= sample(c(seq(0, 1, length.out=1000), 7000000:7000100), 50)/1e7)
> > head(dt)
> >     x            y
> > 1: 32 5.395395e-08
> > 2: 16 6.956957e-08
> > 3: 12 2.142142e-08
> > 4: 18 5.855856e-08
> > 5: 17 6.216216e-08
> > 6: 14 5.025025e-08
> > setkey(dt, "y") # sort by column y
> > head(dt, 10)
> >      x            y
> >  1: 47 1.401401e-09
> >  2: 12 2.142142e-08
> >  3: 24 1.391391e-08
> >  4: 43 9.809810e-09 <~~~ obviously false
> >  5:  1 2.932933e-08
> >  6: 48 2.562563e-08
> >  7: 49 1.891892e-08
> >  8: 40 2.182182e-08
> >  9:  9 7.307307e-09 <~~~ obviously false
> > 10: 45 2.482482e-08
> > 
> > Best,
> > Arun
> > 
> > 
> 
>  
>  
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130430/b2916bbc/attachment.html>

From mdowle at mdowle.plus.com  Tue Apr 30 16:22:54 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 30 Apr 2013 15:22:54 +0100
Subject: [datatable-help] sorting on floating point column
In-Reply-To: <8DC39800AD714C4AA03FDB84ED57BADD@gmail.com>
References: <AC72ED7054E7447EA2316AF07E7692B4@gmail.com>
 <17cf7210eff5da9dadf94185f67df182@imap.plus.net>
 <8DC39800AD714C4AA03FDB84ED57BADD@gmail.com>
Message-ID: <2cd9f53b01f908fe6478e3974ddf18e3@imap.plus.net>

 
Maybe it doesn't actually need to sort within machine tolerance. If
it was precise, the sort would be faster, that's for sure. But at the
time, I remember thinking that it should preserve the order of rows
within a group of values within machine tolerance (e.g. 3.99999999,
4.00000001, 3.99999999 should be consider 4.0 and order of those 3 rows
maintained). But maybe sorting them to 3.99999999, 3.99999999,
4.00000001 is ok as it's just the join that should be within machine
tolerance? 

Interested in how fast order(y) is, though. Compared to
data.table sorting of doubles. 

Matthew 

On 30.04.2013 15:16,
Arunkumar Srinivasan wrote: 

> Matthew, 
> I see. I din't think about
tolerance. Although 
> dt[with(dt, order(y)), ] 
> seems to do the task
right (similar to data.frame). I'm glad that I don't have to convert to
data.frame to perform the order. I am not keying by this column. Unless
one needs this column for keying, I don't think a tolerance option is
essential. Although, having it definitely would be only nicer. 
> 
>
Arun 
> 
> On Tuesday, April 30, 2013 at 4:09 PM, Matthew Dowle wrote:

> 
>> Hi, 
>> 
>> data.table sorts double within machine tolerance :

>> 
>>> sqrt(.Machine$double.eps)
>> [1] 1.490116e-08
>>> 
>> 
>> i.e.
numbers closer than this are considered equal.
>> 
>> Otherwise we
wouldn't be able to do things like DT[.(3.14)].
>> 
>> I had a quick
look, see arguments of data.table:::ordernumtol which takes "tol" but
there is no option provided (yet) to change this. Do we need one?
>> 
>>
In the examples section of one of the help pages it has an example which
generates a series of numers very close together using pi. Note that
your numbers are both close together, and, very close to 0.
>> 
>>
Matthew
>> 
>> On 30.04.2013 14:52, Arunkumar Srinivasan wrote: 
>> 
>>>
Hi there, 
>>> I just saw something strange when I was sorting a column
of p-values. I checked the data.table bug tracker for words "sort" and
"floating point" and there were no hits for this case. There's a bug for
"integer 64" sort on a column though. 
>>> So, here's a reproducible
example. I'd be glad to file a bug, if it is and be corrected if it's
something I am doing wrong. 
>>> 
>>> set.seed(45) 
>>> dt <-
data.table(x=sample(50), y= sample(c(seq(0, 1, length.out=1000),
7000000:7000100), 50)/1e7) 
>>> head(dt) 
>>> x y 
>>> 1: 32
5.395395e-08 
>>> 2: 16 6.956957e-08 
>>> 3: 12 2.142142e-08 
>>> 4: 18
5.855856e-08 
>>> 5: 17 6.216216e-08 
>>> 6: 14 5.025025e-08 
>>>
setkey(dt, "y") # sort by column y 
>>> head(dt, 10) 
>>> x y 
>>> 1: 47
1.401401e-09 
>>> 2: 12 2.142142e-08 
>>> 3: 24 1.391391e-08 
>>> 4: 43
9.809810e-09 <~~~ obviously false 
>>> 5: 1 2.932933e-08 
>>> 6: 48
2.562563e-08 
>>> 7: 49 1.891892e-08 
>>> 8: 40 2.182182e-08 
>>> 9: 9
7.307307e-09 <~~~ obviously false 
>>> 10: 45 2.482482e-08 
>>> 
>>>
Best, 
>>> Arun

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130430/d966ac8b/attachment-0001.html>

From aragorn168b at gmail.com  Tue Apr 30 16:26:21 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Tue, 30 Apr 2013 16:26:21 +0200
Subject: [datatable-help] sorting on floating point column
In-Reply-To: <2cd9f53b01f908fe6478e3974ddf18e3@imap.plus.net>
References: <AC72ED7054E7447EA2316AF07E7692B4@gmail.com>
 <17cf7210eff5da9dadf94185f67df182@imap.plus.net>
 <8DC39800AD714C4AA03FDB84ED57BADD@gmail.com>
 <2cd9f53b01f908fe6478e3974ddf18e3@imap.plus.net>
Message-ID: <F5DE358EC79F4DA2980DB6E7D56129D5@gmail.com>

Matthew, 

Precisely. That's what I was thinking as well. But was hesitant to tell as I dint know how complex it would be to implement / change it. Since the join requires tolerance, sorting could be still done in the "right" order (by disregarding tolerance during sort). 

Arun


On Tuesday, April 30, 2013 at 4:22 PM, Matthew Dowle wrote:

>  
> Maybe it doesn't actually need to sort within machine tolerance.   If it was precise, the sort would be faster, that's for sure.  But at the time,  I remember thinking that it should preserve the order of rows within a group of values within machine tolerance (e.g. 3.99999999, 4.00000001, 3.99999999 should be consider 4.0 and order of those 3 rows maintained).   But maybe sorting them to 3.99999999, 3.99999999, 4.00000001 is ok as it's just the join that should be within machine tolerance?
> Interested in how fast order(y) is, though.  Compared to data.table sorting of doubles.
> Matthew
>  
> On 30.04.2013 15:16, Arunkumar Srinivasan wrote:
> > Matthew,
> > I see. I din't think about tolerance. Although
> > dt[with(dt, order(y)), ] 
> > seems to do the task right (similar to data.frame). I'm glad that I don't have to convert to data.frame to perform the order. I am not keying by this column. Unless one needs this column for keying, I don't think a tolerance option is essential. Although, having it definitely would be only nicer.
> > Arun
> > 
> > 
> > On Tuesday, April 30, 2013 at 4:09 PM, Matthew Dowle wrote:
> > 
> > >  
> > > Hi,
> > > data.table sorts double within machine tolerance :
> > > > sqrt(.Machine$double.eps)
> > > [1] 1.490116e-08
> > > > 
> > >  
> > > i.e. numbers closer than this are considered equal.
> > >  
> > > Otherwise we wouldn't be able to do things like DT[.(3.14)].
> > >  
> > > I had a quick look, see arguments of data.table:::ordernumtol which takes "tol" but there is no option provided (yet) to change this. Do we need one?
> > >  
> > > In the examples section of one of the help pages it has an example which generates a series of numers very close together using pi. Note that your numbers are both close together, and, very close to 0.
> > >  
> > > Matthew
> > >  
> > > On 30.04.2013 14:52, Arunkumar Srinivasan wrote:
> > > > Hi there,
> > > > I just saw something strange when I was sorting a column of p-values. I checked the data.table bug tracker for words "sort" and "floating point" and there were no hits for this case. There's a bug for "integer 64" sort on a column though.
> > > > So, here's a reproducible example. I'd be glad to file a bug, if it is and be corrected if it's something I am doing wrong.
> > > > set.seed(45)
> > > > dt <- data.table(x=sample(50), y= sample(c(seq(0, 1, length.out=1000), 7000000:7000100), 50)/1e7)
> > > > head(dt)
> > > >     x            y
> > > > 1: 32 5.395395e-08
> > > > 2: 16 6.956957e-08
> > > > 3: 12 2.142142e-08
> > > > 4: 18 5.855856e-08
> > > > 5: 17 6.216216e-08
> > > > 6: 14 5.025025e-08
> > > > setkey(dt, "y") # sort by column y
> > > > head(dt, 10)
> > > >      x            y
> > > >  1: 47 1.401401e-09
> > > >  2: 12 2.142142e-08
> > > >  3: 24 1.391391e-08
> > > >  4: 43 9.809810e-09 <~~~ obviously false
> > > >  5:  1 2.932933e-08
> > > >  6: 48 2.562563e-08
> > > >  7: 49 1.891892e-08
> > > >  8: 40 2.182182e-08
> > > >  9:  9 7.307307e-09 <~~~ obviously false
> > > > 10: 45 2.482482e-08
> > > > 
> > > > Best,
> > > > Arun
> > > > 
> > > > 
> > > > 
> > > 
> > >  
> > >  
> > > 
> > > 
> > 
> > 
> 
>  
>  
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130430/72bf41d0/attachment.html>

From eduard.antonyan at gmail.com  Tue Apr 30 17:03:05 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Tue, 30 Apr 2013 10:03:05 -0500
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <5AD5B1D231A045329D46159FB5297739@gmail.com>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1C247AEE-36B3-4265-9253-4F33817D0EEA@sydney.edu.au>
 <1366643879137-4664990.post@n4.nabble.com>
 <BAY163-W27615764CDE782D0C9CED099B50@phx.gbl>
 <CAAf756OqswUKoiykf0pnh6=9HSDDjKK_acUP-0VdV1Vd1EVvGg@mail.gmail.com>
 <CAHZcBOp0K+Bjy=O=Gudx7KdvVY9PuooK6hMzF4LmEvw=86c2xg@mail.gmail.com>
 <CAHZcBOoiP4aZHrPKNxHS6wym5R0O9pmqHgzL6S+1LTH1Vq9aLA@mail.gmail.com>
 <BAY163-W17F525CD44D3B881CB17F999B20@phx.gbl>
 <B42AC8544D2443E4B987C66964F76818@gmail.com>
 <-8694790273355420813@unknownmsgid>
 <5AD5B1D231A045329D46159FB5297739@gmail.com>
Message-ID: <CAHZcBOr7TyPA9XM37=uKiZt7nCfbX6O2V0tcC3kbisDx0tCBJw@mail.gmail.com>

Arun,

Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does
currently.
No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is
literally a 'by' by each of the rows of DT2 that are in the join (thus
each.i! - the operation 'y' will be performed for each of the rows of 'i'
and then combined and returned). There is no efficiency issue here that I
can see, but Matthew can correct me on this. As far as I understand the
efficiency comes into play when e.g. the rows of 'i' are unique, and after
the join you'd like to do a 'by' by those, then DT1[DT2][, j, by =
key(DT1)] would be less efficient since the 'by' could've already been done
while joining.

DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future
DT1[DT2] - in this expression there is no by-without-by happening in either
case.

The purpose of this is NOT for j just being a column or an expression that
gets evaluated into a signal column. It applies to any j. The extra
'by-without-by' column is currently output independently of how many
columns you output in your j-expression, the behavior is very similar as to
when you specify a by=., except that the 'by' happens by a very special
expression, that only exists when joining two data-tables and that
generally doesn't exist before or after the join.

Hope this answers your questions.


On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

>  Eduard, thanks for your reply. But somethings are unclear to me still.
> I'll try to explain them below.
>
> First I prefer .JOIN (or cross.apply) just because `each.i` seems general
> (that it is applicable to *every* i operation, which as of now seems
> untrue). .JOIN is specific to data.table type for `i`.
>
> From what I understand from your reply, if (.JOIN = FALSE), then,
>
>     DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y]
>
> Is this right? It's a bit confusing because I think you're okay with
> "by-without-by" and I got the impression from Sadao that he finds the
> syntax of "by-without-by" unaccessible/advanced for basic users. So, just
> to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the
> "by-without-by" and then result in a "vector", right?
>
> Matthew explains in the current documentation that DT1[DT2][, y] would
> "join" all columns of DT1 and DT2 and then subset. I assume the
> implementation underneath is *not* DT1[DT2][, y] rather the result is an
> efficient equivalence. Then, that of course seems alright to me.
>
> If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]`
> doesn't make sense/has no purpose to me. At least I can't think of any at
> the moment.
>
> To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as
> DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results
> in getting evaluated as a scalar for every group in the current
> by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`.
> Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]`
> instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of
> DT1[i, list(x,y)].
>
> If you/anyone believes it's wrong, I'd be all ears to clarify as to what's
> the purpose of `drop` then (and also how it *doesn't* suit here as compared
> to .JOIN).
>
> Arun
>
> On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote:
>
> Arun,
>
> If the new boolean is false, the result would be the same as without it
> and would be equal to current behavior of d[i][, j]. If it's true, it will
> only have an effect if i is a join (I think each.i= fits slightly better
> for this description than .join=) - this will replicate current underlying
> behavior. If you think the cross-apply is something that could work not
> just for i being a data-table but other things as well, then it would make
> perfect sense to implement that action too when the bool is true.
>
> On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan <aragorn168b at gmail.com>
> wrote:
>
> (The earlier message was too long and was rejected.)
> So, from the discussion so far, I see that Matthew is nice enough to
> implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose,
>
>     DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10)
>     setkey(DT1, "x")
>     DT2 <- data.table(x=1)
>     DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I
> expect here the same output as current DT1[DT2, y]
>
> The above syntax seems "okay". But my first question is what is
> `.JOIN=FALSE` supposed to do under these two circumstances? Suppose,
>
>     DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10)
>     setkey(DT1, "x")
>     DT2 <- data.table(x=c(1,2,1), w=c(11:13))
>     # what's the output supposed to be for?
>     DT1[DT2, y, .JOIN=FALSE]
>     DT1[DT2, .JOIN = FALSE]
>
> Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how
> does it work with `subset`?
>
>     DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored?
>  Is this supposed to also do a "cross-apply" on the logical subset? I
> guess not. So, .JOIN is an "extra" parameter that comes into play *only*
> when `i` is a `data.table`?
>
> I'd love to have some replies to these questions for me to take a stance
> on `.JOIN`. Thank you.
>
> Best,
> Arun.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130430/a1e8c69a/attachment.html>

From p.harding at paniscus.com  Tue Apr 30 19:01:50 2013
From: p.harding at paniscus.com (Paul Harding)
Date: Tue, 30 Apr 2013 18:01:50 +0100
Subject: [datatable-help] fread on very large file
Message-ID: <CAMSrYkeQdpVBCywfN4mVfgN-P+FhzY51jatG3ZnOtwXLGnbUBw@mail.gmail.com>

Problem with fread on a large file

The file is 8GB, just short of 200,000 lines, produced as SQLoutput and
modified by cygwin/perl to remove the second line.

Using data.table 1.8.8 on R3.0.0 I get an fread error

fread("data/spd_all_fixed.csv",sep=",")
Error in fread("data/spd_all_fixed.csv", sep = ",") :
  Expected sep (',') but '0' ends field 5 on line 6 when detecting types:
204038,2617097,20110803,0,0

Looking for the offending line,with line numbers in output so I'm guessing
this is line 6 of the mid-file chunk examined,

$ grep -n '204038,2617097,201108' spd_all_fixed.csv
8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0
8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0
9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0
9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0
10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0

and comparing to surrounding lines and the first ten lines

$ head  spd_all_fixed.csv
s_key,i_key,p_key,q,pq,d,l,epi,class
203974,1107181,20110713,0,0,0.13700080000000001,0,0,0
203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0
203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0
203978,1107181,20110713,0,0,0.78346819999999995,0,0,0
203979,1107181,20110713,0,0,0.61627779999999999,0,0,0
203981,1107181,20110713,1,0,0.38610509999999998,0,0,0
203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0
203983,1107181,20110713,2,0,0.71278109999999995,0,0,0
203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13

I can't see any difference. I wonder if this is a bug? I have no problems
on a small test data set run through an identical process and using the
same fread command.

Regards
Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130430/f3b0001f/attachment-0001.html>

From mdowle at mdowle.plus.com  Tue Apr 30 19:52:54 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 30 Apr 2013 18:52:54 +0100
Subject: [datatable-help] fread on very large file
In-Reply-To: <CAMSrYkeQdpVBCywfN4mVfgN-P+FhzY51jatG3ZnOtwXLGnbUBw@mail.gmail.com>
References: <CAMSrYkeQdpVBCywfN4mVfgN-P+FhzY51jatG3ZnOtwXLGnbUBw@mail.gmail.com>
Message-ID: <6215268129090c5164b66264010bea9b@imap.plus.net>

 
Hi, 

Thanks for reporting this. Please set verbose=TRUE and let us
know the output. 

Thanks, Matthew 

On 30.04.2013 18:01, Paul Harding
wrote: 

> Problem with fread on a large file The file is 8GB, just
short of 200,000 lines, produced as SQLoutput and modified by
cygwin/perl to remove the second line.
> 
> Using data.table 1.8.8 on
R3.0.0 I get an fread error 
> 
>
fread("data/spd_all_fixed.csv",sep=",") 
> Error in
fread("data/spd_all_fixed.csv", sep = ",") : 
> Expected sep (',') but
'0' ends field 5 on line 6 when detecting types:
204038,2617097,20110803,0,0 
> Looking for the offending line,with line
numbers in output so I'm guessing this is line 6 of the mid-file chunk
examined, 
> 
> $ grep -n '204038,2617097,201108' spd_all_fixed.csv 
>
8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 
>
8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 
>
9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 
>
9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 
>
10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 
> and
comparing to surrounding lines and the first ten lines 
> 
> $ head
spd_all_fixed.csv 
> s_key,i_key,p_key,q,pq,d,l,epi,class 
>
203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 
>
203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 
>
203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 
>
203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 
>
203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 
>
203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 
>
203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 
>
203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 
>
203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13

> I can't see any difference. I wonder if this is a bug? I have no
problems on a small test data set run through an identical process and
using the same fread command. 
> Regards 
> Paul

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130430/49e3542f/attachment.html>

From rv15i at yahoo.se  Tue Apr 30 20:38:51 2013
From: rv15i at yahoo.se (ravi)
Date: Tue, 30 Apr 2013 19:38:51 +0100 (BST)
Subject: [datatable-help] fread (sep2) on data with a comma as decimal
	delimiter
Message-ID: <1367347131.53228.YahooMailNeo@web171302.mail.ir2.yahoo.com>

Hi,
I have a huge excel file that I have converted to a tab delimited
 file. The numerical data have a comma as a decimal delimiter. I made a 
compressed version of the file by just taking the first 100 rows. On 
this, I have confirmed that the following command works fine :
df<-read.table(file=file1,header=TRUE,sep="\t",dec=",",encoding="latin1")
The following data.table also appears to work OK :
dt<-fread(file1,sep="\t")
But
 the numerical data end up as characters. I would like to have help with
 the most efficient method of converting these into numeric class. I 
note that sep2 has not been implemented yet. Is there any workaround? 
Can I specify the encoding also?
Would appreciate any help that I can get.
Thanks,
Ravi

From mdowle at mdowle.plus.com  Tue Apr 30 20:48:32 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 30 Apr 2013 19:48:32 +0100
Subject: [datatable-help] fread (sep2) on data with a comma as decimal
 delimiter
In-Reply-To: <1367347131.53228.YahooMailNeo@web171302.mail.ir2.yahoo.com>
References: <1367347131.53228.YahooMailNeo@web171302.mail.ir2.yahoo.com>
Message-ID: <79e76ef89c8c532d8c2c7cafa8b0a86e@imap.plus.net>


Hi,

Ah yes,  fread is locale aware.  So if you Set.locale() for the numeric 
option to say the decimal separator is comma,  then fread should heed 
that.  Somewhere, either on S.O. or datatable-help this has come up 
before, with example and it was successful.  Try searching for 
"[data.table] Sys.setlocale"  (I forget that function's spelling 
exactly).

We could add this locale change as an option to data.table but it 
depends on choosing a particular installed locale that has the comma as 
separator, and doing this in a cross-platform way is not something I 
know a huge amount about.  There was a concern that locale changes are 
global, but as far as I know it only affects the current R session and 
switching back on.exit() should be safe enough (as a way to build it 
in).   fread uses a stdlib call to read floating point (rather than R 
which does it itself in its own C code).  It's that stdlib call that is 
locale aware and is quite convenient (and fast) from fread's internals 
point of view.

Matthew


On 30.04.2013 19:38, ravi wrote:
> Hi,
> I have a huge excel file that I have converted to a tab delimited
>  file. The numerical data have a comma as a decimal delimiter. I made 
> a
> compressed version of the file by just taking the first 100 rows. On
> this, I have confirmed that the following command works fine :
> 
> df<-read.table(file=file1,header=TRUE,sep="\t",dec=",",encoding="latin1")
> The following data.table also appears to work OK :
> dt<-fread(file1,sep="\t")
> But
>  the numerical data end up as characters. I would like to have help 
> with
>  the most efficient method of converting these into numeric class. I
> note that sep2 has not been implemented yet. Is there any workaround?
> Can I specify the encoding also?
> Would appreciate any help that I can get.
> Thanks,
> Ravi
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help