From J.Gorecki at wit.edu.pl  Sun Jun  1 13:16:53 2014
From: J.Gorecki at wit.edu.pl (Jan Gorecki)
Date: Sun, 1 Jun 2014 04:16:53 -0700 (PDT)
Subject: [datatable-help] learn how to use melt and dcast
In-Reply-To: <1400590237635-4690882.post@n4.nabble.com>
References: <1400590237635-4690882.post@n4.nabble.com>
Message-ID: <1401621413529-4691552.post@n4.nabble.com>

Hi statquant3,

I found this tutorial very helpful:
http://marcoghislanzoni.com/blog/2013/10/11/pivot-tables-in-r-with-melt-and-cast/

Jan


--
View this message in context: http://r.789695.n4.nabble.com/learn-how-to-use-melt-and-dcast-tp4690882p4691552.html
Sent from the datatable-help mailing list archive at Nabble.com.

From J.Gorecki at wit.edu.pl  Wed Jun  4 11:40:45 2014
From: J.Gorecki at wit.edu.pl (Jan Gorecki)
Date: Wed, 4 Jun 2014 02:40:45 -0700 (PDT)
Subject: [datatable-help] data.table syntax Data Warehouse use case
	simulation
Message-ID: <1401874845225-4691697.post@n4.nabble.com>

Hi All,

I would rather not go deep into description of DW star schema model. It may
not be necessary as you have the initial structure and expected structure.

We have our measures (numeric values) in the "facts" tables. Facts are
connected to dimensions which contains the reference field from facts tables
plus some higher level attributes (may be seen as: dim1="Paris",
dim1h="France").

I'm looking for memory, time and syntax optimal solution to perform
denormalization of my data and join the facts table to all the dimension
tables.

# populate data
library(data.table)
facts <- data.table(dim1=letters[1:6], dim2=letters[7:12],
dim3=letters[13:18], dim4=letters[19:24],
                    quantity = rnorm(6,100,40),
                    value = rnorm(6,1000,200))
dim1 <- data.table(dim1=letters[1:6], dim1h=rep(letters[1:3],2), key="dim1")
dim2 <- data.table(dim2=letters[7:12], dim2h=rep(letters[7:9],2),
key="dim2")
dim3 <- data.table(dim3=letters[13:18], dim3h=rep(letters[13:15],2),
key="dim3")
dim4 <- data.table(dim4=letters[19:24], dim4h=rep(letters[19:21],2),
key="dim4")

# my proposed solution
joinby <- function(master, join, by){
  stopifnot(by %in% names(master) & by %in% names(join))
  join[setkeyv(master,by)]
}

# denormalize
dt <-
joinby(joinby(joinby(joinby(facts,dim1,"dim1"),dim2,"dim2"),dim3,"dim3"),dim4,"dim4")

# aggregate - expected results
dt[,list(quantity=sum(quantity),value=sum(value)),by=c("dim1h","dim2h","dim3h","dim4h")]


My solution assume the column names to be used on joins are identical.
The syntax isn't that great, but I couldn't figure out any better.
I'm not aware of the performance, it may be as issue because of resorting
the master tables on each join.

Anybody would propose better (more optimal) solution?

Regards,
Jan


--
View this message in context: http://r.789695.n4.nabble.com/data-table-syntax-Data-Warehouse-use-case-simulation-tp4691697.html
Sent from the datatable-help mailing list archive at Nabble.com.

From jmtruppia at gmail.com  Fri Jun  6 00:01:36 2014
From: jmtruppia at gmail.com (juancentro)
Date: Thu, 5 Jun 2014 15:01:36 -0700 (PDT)
Subject: [datatable-help] changing data.table by-without-by syntax to
	require a "by"
In-Reply-To: <1399468390041-4690112.post@n4.nabble.com>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1399453335248-4690100.post@n4.nabble.com>
 <etPan.536a06b3.7545e146.bfdf@Arunkumars-MacBook-Pro.local>
 <1399462206528-4690105.post@n4.nabble.com>
 <etPan.536a26b2.5bd062c2.bfdf@Arunkumars-MacBook-Pro.local>
 <1399468390041-4690112.post@n4.nabble.com>
Message-ID: <1402005696445-4691774.post@n4.nabble.com>

Hi, what's the current status on this one? In the last 1.9.3 by=EACHI is
used. This is disruptive for current users (it has broken several pieces of
my code) but, after complaining and barking, I realized that it is really
more intuitive and reasonable to do a by just when a by is explicit.
Are there any plans to release 1.9.3 and which syntax will be kept? I want
to be prepared

thanks!


--
View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4691774.html
Sent from the datatable-help mailing list archive at Nabble.com.

From aragorn168b at gmail.com  Fri Jun  6 00:13:02 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 6 Jun 2014 00:13:02 +0200
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <1402005696445-4691774.post@n4.nabble.com>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1399453335248-4690100.post@n4.nabble.com>
 <etPan.536a06b3.7545e146.bfdf@Arunkumars-MacBook-Pro.local>
 <1399462206528-4690105.post@n4.nabble.com>
 <etPan.536a26b2.5bd062c2.bfdf@Arunkumars-MacBook-Pro.local>
 <1399468390041-4690112.post@n4.nabble.com>
 <1402005696445-4691774.post@n4.nabble.com>
Message-ID: <etPan.5390eb6e.238e1f29.dc4@Arunkumars-MacBook-Pro.local>

Juancentro,

Matt started a post on this topic in March this year here:?http://lists.r-forge.r-project.org/pipermail/datatable-help/2014-March/002430.html?Have you read it or contributed there??If not, when and where did you complain??
Matt also checked all dependent packages on data.table (on CRAN, I believe, bioconductor - not sure) and contacted those authors whose unit tests failed on this issue, IIRC. Are you developing a package that's dependent on data.table? If so, is it already on CRAN or bioconductor? If not, how do you expect us to reach you other than through the mailing list?
And 1.9.3 is a development version, where these things are meant to be ironed out before pushing a *stable* release to CRAN. And IIUC, by the time it'll be pushed to CRAN, there should a provision to use older feature or somehow another fix so that the older feature can be properly deprecated. As I said before, this is *still* in development, and we've not gotten to it yet.

I think rather that you should be following the mailing list closely (and NEWS) and contribute to the conversations when decisions are being made.

Arun

From:?juancentro jmtruppia at gmail.com
Reply:?juancentro jmtruppia at gmail.com
Date:?June 6, 2014 at 12:02:28 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] changing data.table by-without-by syntax to require a "by"  

Hi, what's the current status on this one? In the last 1.9.3 by=EACHI is  
used. This is disruptive for current users (it has broken several pieces of  
my code) but, after complaining and barking, I realized that it is really  
more intuitive and reasonable to do a by just when a by is explicit.  
Are there any plans to release 1.9.3 and which syntax will be kept? I want  
to be prepared  

thanks!  


--  
View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4691774.html  
Sent from the datatable-help mailing list archive at Nabble.com.  
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140606/723e6ca2/attachment.html>

From jmtruppia at gmail.com  Fri Jun  6 00:18:24 2014
From: jmtruppia at gmail.com (Juan Manuel Truppia)
Date: Thu, 5 Jun 2014 19:18:24 -0300
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <etPan.5390eb6e.238e1f29.dc4@Arunkumars-MacBook-Pro.local>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1399453335248-4690100.post@n4.nabble.com>
 <etPan.536a06b3.7545e146.bfdf@Arunkumars-MacBook-Pro.local>
 <1399462206528-4690105.post@n4.nabble.com>
 <etPan.536a26b2.5bd062c2.bfdf@Arunkumars-MacBook-Pro.local>
 <1399468390041-4690112.post@n4.nabble.com>
 <1402005696445-4691774.post@n4.nabble.com>
 <etPan.5390eb6e.238e1f29.dc4@Arunkumars-MacBook-Pro.local>
Message-ID: <CAO2XSvfbxDDj7faobP8giAwqUBzbafD6yD6G1SZ47bw57KeQoQ@mail.gmail.com>

Arun, I only complained to myself! My published packages dont depend on
data.table. My unpublished code does. But I am not complaining about the
change, it is a good one! I was just asking if you had reached a decision.

I meant the complaining part as something funny, not to be taken at face
value.
On Jun 5, 2014 7:13 PM, "Arunkumar Srinivasan" <aragorn168b at gmail.com>
wrote:

> Juancentro,
>
> Matt started a post on this topic in March this year here:
> http://lists.r-forge.r-project.org/pipermail/datatable-help/2014-March/002430.html Have
> you read it or contributed there? If not, when and where did you complain?
> Matt also checked all dependent packages on data.table (on CRAN, I
> believe, bioconductor - not sure) and contacted those authors whose unit
> tests failed on this issue, IIRC. Are you developing a package that's
> dependent on data.table? If so, is it already on CRAN or bioconductor? If
> not, how do you expect us to reach you other than through the mailing list?
> And 1.9.3 is a development version, where these things are meant to be
> ironed out before pushing a *stable* release to CRAN. And IIUC, by the time
> it'll be pushed to CRAN, there should a provision to use older feature or
> somehow another fix so that the older feature can be properly deprecated.
> As I said before, this is *still* in development, and we've not gotten to
> it yet.
>
> I think rather that you should be following the mailing list closely (and
> NEWS) and contribute to the conversations when decisions are being made.
>
> Arun
>
> From: juancentro jmtruppia at gmail.com
> Reply: juancentro jmtruppia at gmail.com
> Date: June 6, 2014 at 12:02:28 AM
> To: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> Subject:  Re: [datatable-help] changing data.table by-without-by syntax
> to require a "by"
>
> Hi, what's the current status on this one? In the last 1.9.3 by=EACHI is
> used. This is disruptive for current users (it has broken several pieces
> of
> my code) but, after complaining and barking, I realized that it is really
> more intuitive and reasonable to do a by just when a by is explicit.
> Are there any plans to release 1.9.3 and which syntax will be kept? I want
> to be prepared
>
> thanks!
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4691774.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140605/a264f3b5/attachment.html>

From mdowle at mdowle.plus.com  Fri Jun  6 02:40:52 2014
From: mdowle at mdowle.plus.com (Matt Dowle)
Date: Fri, 06 Jun 2014 01:40:52 +0100
Subject: [datatable-help] internal FALSE/TRUE value has been modified
In-Reply-To: <5362685E.1080303@mdowle.plus.com>
References: <5361D04C.2090509@gmail.com> <5362685E.1080303@mdowle.plus.com>
Message-ID: <53910E14.8030101@mdowle.plus.com>


Now fixed in v1.9.3 :

o  The warning "internal TRUE value has been modified" with recently 
released R 3.1
      when grouping a table containing a logical column and where all 
groups are just 1 row
      is now fixed and tests added. Thanks to James Sams for the 
reproducible example.
      The warning is issued by R and we have asked if it can be upgraded 
to error.

Matt


On 01/05/14 16:29, Matt Dowle wrote:
>
> Reproduced, thanks for nice example. Not sure yet but what R 3.1 now 
> does is store length 1 logical vectors once only, globally, for 
> efficiency to avoid many new allocations for the common case of single 
> TRUE or FALSE values passed around at C or R level (a nice and welcome 
> change).  Since data.table modifies vectors by reference,  if that 
> vector is length 1 a new data.table bug as from R 3.1 could be 
> modifying R's internal value of TRUE or FALSE whenever length 1 
> logical vectors occur. Clearly a serious bug. The test suite 
> immediately broke the day after the R-devel change was made (good) and 
> was one reason data.table was in error state in CRAN checks for quite 
> a while before R 3.1 shipped.  It was typically tests of 1-row 
> data.table's including a logical column and modifying that logical 
> column that broke. We fixed that and put in checks to detect and warn 
> if R's internal value has been been modified, just in case.  Those 
> changes were in v1.9.2 on CRAN.  I think I wasn't 100% confident in 
> the detection test (false positives) so made it a warning instead of 
> an error.  Now that R 3.1 is out and we haven't had any false 
> positives, it should be an error.
>
> The feature of this upc_table is that all the groups are size 1 :
>
> > upc_table[, .N, by=list(upc, upc_ver_uc)][,max(N)]
> [1] 1
>
> If we change the example so that one group has more than 1 row, it 
> works ok :
>
> > upc_table = data.table(upc=c(1:99998,1,1), upc_ver_uc=rep(c(1,2), 
> times=50000), is_PL=rep(c(T, F, F, T), each=25000), 
> product_module_code=rep(1:4, times=25000), ignore.column=2:100001)
> > upc_table[, .N, by=list(upc, upc_ver_uc)][,max(N)]
> [1] 2
> > upc = upc_table[, list(is_PL, product_module_code), keyby=list(upc, 
> upc_ver_uc)]
>
> So it seems the problem is in the single allocation of working memory 
> for the largest group when that's just 1 and contains a logical 
> column.  Odd, I would have sworn we caught that! Will fix.
>
> R-devel are planning to do more of this small-object-sharing for 
> common single integer values e.g. 0-10,  so we'll need to add more 
> tests accordingly.
>
> Thanks,
> Matt
>
>
>
> On 01/05/14 05:40, James Sams wrote:
>> I don't really know what this error message means. A quick example to 
>> show what I'm seeing:
>>
>> > library(data.table)
>> data.table 1.9.3  For help type: help("data.table")
>> > upc_table = data.table(upc=1:100000, upc_ver_uc=rep(c(1,2), 
>> times=50000), is_PL=rep(c(T, F, F, T), each=25000), 
>> product_module_code=rep(1:4, times=25000), ignore.column=2:100001)
>> > upc = upc_table[, list(is_PL, product_module_code), keyby=list(upc, 
>> upc_ver_uc)]
>> Warning message:
>> In `[.data.table`(upc_table, , list(is_PL, product_module_code), :
>>   internal TRUE value has been modified
>>
>> When I continue using R, I eventually start getting more errors, such 
>> as:
>>
>> Error in gettext(domain, unlist(args)) : invalid 'string' value
>> Error during wrapup: invalid 'string' value
>>
>> and then terminal input/output becomes corrupted. I only start 
>> getting these error messages once I start using data.table; but the 
>> messages don't necessarily occur only with data.table functions.
>>
>> I don't know if the last statement above is executing correctly or 
>> not. I'm rather confused as to what is going on. I was using a 
>> somewhat stale (maybe a couple of weeks old) svn version of 
>> data.table; but I see the same behavior with the latest data.table 
>> (r1263). I'm using CRAN's R 3.1 package for Ubuntu on 13.10 and 14.04.
>>
>>
>>
>> > sessionInfo()
>> R version 3.1.0 (2014-04-10)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C 
>> LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8 
>> LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C LC_ADDRESS=C               
>> LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods base
>>
>> other attached packages:
>> [1] data.table_1.9.3
>>
>> loaded via a namespace (and not attached):
>> [1] plyr_1.8.1    Rcpp_0.11.1   reshape2_1.4  stringr_0.6.2
>>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help 
>
>


From rguy at 123mail.org  Fri Jun  6 08:29:01 2014
From: rguy at 123mail.org (Rguy)
Date: Thu, 5 Jun 2014 23:29:01 -0700 (PDT)
Subject: [datatable-help] A[B]?
In-Reply-To: <537D7511.1000209@gmail.com>
References: <1399183248863-4689942.post@n4.nabble.com>
 <5365F136.8050807@gmail.com> <1399370245881-4690040.post@n4.nabble.com>
 <537D7511.1000209@gmail.com>
Message-ID: <1402036141545-4691793.post@n4.nabble.com>

In the FAQ, the X[Y] syntax is first mentioned in item 1.11, where it is not
explained and no example of its use is provided.

In item 1.12, X[Y] is compared to merge, again without any attempt to
explain what X[Y] is or does, and with no examples of its use. Also, merge
is not discussed correctly:

"...the number of rows returned by merge(X,Y) and merge(Y,X) is the same."

This can be controlled by the merge arguments by.x, by.y.

I suggest that before discussing the in's and out's of the X[Y] syntax the
FAQ explain what it is and provide examples of its use. Think of it as "X[Y]
for Dummies". As things stand the  FAQ are completely useless for getting a
grip on X[Y] and it is very frustrating to encounter explanations of a
concept that has nowhere been introduced or illustrated.


--
View this message in context: http://r.789695.n4.nabble.com/A-B-tp4689942p4691793.html
Sent from the datatable-help mailing list archive at Nabble.com.

From mdowle at mdowle.plus.com  Mon Jun  9 15:51:11 2014
From: mdowle at mdowle.plus.com (Matt Dowle)
Date: Mon, 09 Jun 2014 14:51:11 +0100
Subject: [datatable-help] data.table has moved to GitHub
Message-ID: <5395BBCF.1050800@mdowle.plus.com>


Dear all,

Arun has done an amazing job in transferring everything from R-Forge to 
GitHub. This includes the full commit history and outstanding bug and 
feature requests.

     https://github.com/Rdatatable/datatable/

To install the latest version from now on it's :

     devtools:::install_github("datatable", "Rdatatable")

As you may have noticed R-Forge has been stuck in building state for 
several days now, so you wouldn't have been able to install v1.9.3 from 
there anyway.  Arun has integrated with Travis which gives us the 
package build and check environment.  Windows users will need to install 
Rtools because install_github() compiles from source,  but that is 
straightforward we believe.  We may be able to add building and checking 
of a compiled .zip for Windows in future.  If you're a Windows user 
please let us know how you get on with Rtools and 
devtools::install_github().

GitHub should make it easier for you to contribute :  just edit the file 
within the github website and then press "Propose file change". Project 
members will then review and accept the change.

Public access to the bug and feature request trackers on R-Forge is now 
turned off.  Please use GitHub from now on.  Comments couldn't be 
transferred to GitHub but we can still see them on R-Forge.  If you 
raised an issue on R-Forge you may still get automatic emails from 
R-Forge as we close them down.

Thanks,
Matt


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140609/3a775d06/attachment.html>

From jmtruppia at gmail.com  Mon Jun  9 22:12:34 2014
From: jmtruppia at gmail.com (Juan Manuel Truppia)
Date: Mon, 9 Jun 2014 17:12:34 -0300
Subject: [datatable-help] dcast.data.table loses column classes when column
	is date
Message-ID: <CAO2XSvesTGmM5WQSjPbj8KA_-iCvT4NTs=Poo7ti1MO+LUgXfA@mail.gmail.com>

Here is a reproducible example

dcast.data.table(data = data.table(id = c(1,1,2,2), ty =
c("a","b","a","b"), da = Sys.Date()), formula = id ~ ty)

I don't know how to report a bug, if someone guides me, I'll be much obliged

Thanks!

From gleynes+r at gmail.com  Mon Jun  9 23:18:59 2014
From: gleynes+r at gmail.com (Gene Leynes)
Date: Mon, 9 Jun 2014 16:18:59 -0500
Subject: [datatable-help] data.table has moved to GitHub
In-Reply-To: <5395BBCF.1050800@mdowle.plus.com>
References: <5395BBCF.1050800@mdowle.plus.com>
Message-ID: <CAOBARVgskFrGHvhWROR9ST-8UAb5PXPNw437aj5ADNDC1sJdtw@mail.gmail.com>

I was post about a question. Do questions now go to an address on github
rather than datatable-help at lists.r-forge.r-project.org, or should we use
something else for discussion  / questions?


On Mon, Jun 9, 2014 at 8:51 AM, Matt Dowle <mdowle at mdowle.plus.com> wrote:

>
> Dear all,
>
> Arun has done an amazing job in transferring everything from R-Forge to
> GitHub. This includes the full commit history and outstanding bug and
> feature requests.
>
>     https://github.com/Rdatatable/datatable/
>
> To install the latest version from now on it's :
>
>     devtools:::install_github("datatable", "Rdatatable")
>
> As you may have noticed R-Forge has been stuck in building state for
> several days now, so you wouldn't have been able to install v1.9.3 from
> there anyway.  Arun has integrated with Travis which gives us the package
> build and check environment.  Windows users will need to install Rtools
> because install_github() compiles from source,  but that is straightforward
> we believe.  We may be able to add building and checking of a compiled .zip
> for Windows in future.  If you're a Windows user please let us know how you
> get on with Rtools and devtools::install_github().
>
> GitHub should make it easier for you to contribute :  just edit the file
> within the github website and then press "Propose file change". Project
> members will then review and accept the change.
>
> Public access to the bug and feature request trackers on R-Forge is now
> turned off.  Please use GitHub from now on.  Comments couldn't be
> transferred to GitHub but we can still see them on R-Forge.  If you raised
> an issue on R-Forge you may still get automatic emails from R-Forge as we
> close them down.
>
> Thanks,
> Matt
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140609/800470d3/attachment.html>

From aragorn168b at gmail.com  Mon Jun  9 23:29:21 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Mon, 9 Jun 2014 23:29:21 +0200
Subject: [datatable-help] dcast.data.table loses column classes when
 column is date
In-Reply-To: <CAO2XSvesTGmM5WQSjPbj8KA_-iCvT4NTs=Poo7ti1MO+LUgXfA@mail.gmail.com>
References: <CAO2XSvesTGmM5WQSjPbj8KA_-iCvT4NTs=Poo7ti1MO+LUgXfA@mail.gmail.com>
Message-ID: <etPan.53962731.7644a45c.f72c@Arunkumars-MacBook-Pro.local>

Juan,

On how to report a bug:
1) Go to?https://github.com and create an account, if you don't already have one.
2) Go to our project page, while signed in:?https://github.com/Rdatatable/datatable
3) Click "Issues" on the right side of the page.
4) This issue doesn't already exist. So, hit "New issue" (green button on the right).
5) Provide a title. Fill the body - remember you can format code using?Markdown?(as well as?Github flavoured markdown). For example, to write R-code, you can do:

```S
your R-code
```

The S is the lexer type (for highlighting code using Github flavoured markdown).
6) Add a label (equivalent of tag or tracker type in R-Forge) by clicking on "bug" on the right side.?
7) Preview your post, if you want to. Then click "Submit new issue".

---

On the bug itself: This is because `reshape2:::dcast` doesn't preserve attributes. And we wanted to be consistent with their result at the time of writing.?
However, since that time, `reshape2` has obtained newer implementation of "melt", written by Kevin Ushey, where attributes are preserved as long as all the columns that you're asking for to be "molten" are of the same type. But this doesn't happen for "factors" by default because that might break existing code - and therefore obtained a new argument "factorsAsStrings", IIUC. I personally find these things adding a layer of complexity. But that's the case with "melt".?

It's really hard to tell from reshape2's ?melt or ?cast what's the case regarding attributes. But my guess is that we should, starting with your post, try to define what's what and document it instead of relying entirely on being consistent with reshape2's behaviour, as we do already differ from reshape2 already slightly.

We're very much younger than reshape2's melt/cast. So, I think we might be able to rectify these things on consistency and rules relatively easier.

Arun

From:?Juan Manuel Truppia jmtruppia at gmail.com
Reply:?Juan Manuel Truppia jmtruppia at gmail.com
Date:?June 9, 2014 at 10:13:05 PM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? [datatable-help] dcast.data.table loses column classes when column is date  

Here is a reproducible example  

dcast.data.table(data = data.table(id = c(1,1,2,2), ty =  
c("a","b","a","b"), da = Sys.Date()), formula = id ~ ty)  

I don't know how to report a bug, if someone guides me, I'll be much obliged  

Thanks!  
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140609/9ebc5f83/attachment-0001.html>

From aragorn168b at gmail.com  Mon Jun  9 23:34:39 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Mon, 9 Jun 2014 23:34:39 +0200
Subject: [datatable-help] data.table has moved to GitHub
In-Reply-To: <CAOBARVgskFrGHvhWROR9ST-8UAb5PXPNw437aj5ADNDC1sJdtw@mail.gmail.com>
References: <5395BBCF.1050800@mdowle.plus.com>
 <CAOBARVgskFrGHvhWROR9ST-8UAb5PXPNw437aj5ADNDC1sJdtw@mail.gmail.com>
Message-ID: <etPan.5396286f.684a481a.f72c@Arunkumars-MacBook-Pro.local>

Hello Gene,
Yes, you can continue to post questions on the mailing list, of course. Especially, questions on design changes, proposing design changes - where you'd like to hear from the entire list, or simply question on "how to do this in a data.table way / is there a better way" etc, as this is a place to connect to all data.table users subscribed over the mailing list.?

Although, there is a?"label" on github?named "question", which I'm not quite sure of the use, yet. I suspect, it is mostly for developers to communicate to each other when they're not entirely sure of where it falls or if it's a good feature etc..? We'll know soon enough :).

Arun

From:?Gene Leynes gleynes+r at gmail.com
Reply:?gleynes+r at gmail.com gleynes+r at gmail.com
Date:?June 9, 2014 at 11:19:08 PM
To:?Matt Dowle mdowle at mdowle.plus.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] data.table has moved to GitHub  


I was post about a question. Do questions now go to an address on github rather than datatable-help at lists.r-forge.r-project.org, or should we use something else for discussion ?/ questions?


On Mon, Jun 9, 2014 at 8:51 AM, Matt Dowle <mdowle at mdowle.plus.com> wrote:

Dear all,

Arun has done an amazing job in transferring everything from R-Forge to GitHub. This includes the full commit history and outstanding bug and feature requests.

??? https://github.com/Rdatatable/datatable/

To install the latest version from now on it's :

??? devtools:::install_github("datatable", "Rdatatable")

As you may have noticed R-Forge has been stuck in building state for several days now, so you wouldn't have been able to install v1.9.3 from there anyway.? Arun has integrated with Travis which gives us the package build and check environment.? Windows users will need to install Rtools because install_github() compiles from source,? but that is straightforward we believe.? We may be able to add building and checking of a compiled .zip for Windows in future.? If you're a Windows user please let us know how you get on with Rtools and devtools::install_github().

GitHub should make it easier for you to contribute :? just edit the file within the github website and then press "Propose file change". Project members will then review and accept the change.

Public access to the bug and feature request trackers on R-Forge is now turned off.? Please use GitHub from now on.? Comments couldn't be transferred to GitHub but we can still see them on R-Forge.? If you raised an issue on R-Forge you may still get automatic emails from R-Forge as we close them down.

Thanks,
Matt


_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140609/2cb65a76/attachment.html>

From jmtruppia at gmail.com  Mon Jun  9 23:48:07 2014
From: jmtruppia at gmail.com (Juan Manuel Truppia)
Date: Mon, 9 Jun 2014 18:48:07 -0300
Subject: [datatable-help] dcast.data.table loses column classes when
 column is date
In-Reply-To: <etPan.53962731.7644a45c.f72c@Arunkumars-MacBook-Pro.local>
References: <CAO2XSvesTGmM5WQSjPbj8KA_-iCvT4NTs=Poo7ti1MO+LUgXfA@mail.gmail.com>
 <etPan.53962731.7644a45c.f72c@Arunkumars-MacBook-Pro.local>
Message-ID: <CAO2XSvdOU8p2WB=v3FYv3uJozMsVZS=jjySwua-kfHSTzxdkBw@mail.gmail.com>

Posted here https://github.com/Rdatatable/datatable/issues/688

However, I couldn't tag it as a bug (didn't find the option to add a
label, sorry!)

On Mon, Jun 9, 2014 at 6:29 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> Juan,
>
> On how to report a bug:
> 1) Go to https://github.com and create an account, if you don't already have
> one.
> 2) Go to our project page, while signed in:
> https://github.com/Rdatatable/datatable
> 3) Click "Issues" on the right side of the page.
> 4) This issue doesn't already exist. So, hit "New issue" (green button on
> the right).
> 5) Provide a title. Fill the body - remember you can format code using
> Markdown (as well as Github flavoured markdown). For example, to write
> R-code, you can do:
>
> ```S
> your R-code
> ```
>
> The S is the lexer type (for highlighting code using Github flavoured
> markdown).
> 6) Add a label (equivalent of tag or tracker type in R-Forge) by clicking on
> "bug" on the right side.
> 7) Preview your post, if you want to. Then click "Submit new issue".
>
> ---
>
> On the bug itself: This is because `reshape2:::dcast` doesn't preserve
> attributes. And we wanted to be consistent with their result at the time of
> writing.
> However, since that time, `reshape2` has obtained newer implementation of
> "melt", written by Kevin Ushey, where attributes are preserved as long as
> all the columns that you're asking for to be "molten" are of the same type.
> But this doesn't happen for "factors" by default because that might break
> existing code - and therefore obtained a new argument "factorsAsStrings",
> IIUC. I personally find these things adding a layer of complexity. But
> that's the case with "melt".
>
> It's really hard to tell from reshape2's ?melt or ?cast what's the case
> regarding attributes. But my guess is that we should, starting with your
> post, try to define what's what and document it instead of relying entirely
> on being consistent with reshape2's behaviour, as we do already differ from
> reshape2 already slightly.
>
> We're very much younger than reshape2's melt/cast. So, I think we might be
> able to rectify these things on consistency and rules relatively easier.
>
> Arun
>
> From: Juan Manuel Truppia jmtruppia at gmail.com
> Reply: Juan Manuel Truppia jmtruppia at gmail.com
> Date: June 9, 2014 at 10:13:05 PM
> To: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> Subject:  [datatable-help] dcast.data.table loses column classes when column
> is date
>
> Here is a reproducible example
>
> dcast.data.table(data = data.table(id = c(1,1,2,2), ty =
> c("a","b","a","b"), da = Sys.Date()), formula = id ~ ty)
>
> I don't know how to report a bug, if someone guides me, I'll be much obliged
>
> Thanks!
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From jmtruppia at gmail.com  Mon Jun  9 23:54:39 2014
From: jmtruppia at gmail.com (juancentro)
Date: Mon, 9 Jun 2014 14:54:39 -0700 (PDT)
Subject: [datatable-help] data.table has moved to GitHub
In-Reply-To: <etPan.5396286f.684a481a.f72c@Arunkumars-MacBook-Pro.local>
References: <5395BBCF.1050800@mdowle.plus.com>
 <CAOBARVgskFrGHvhWROR9ST-8UAb5PXPNw437aj5ADNDC1sJdtw@mail.gmail.com>
 <etPan.5396286f.684a481a.f72c@Arunkumars-MacBook-Pro.local>
Message-ID: <1402350879328-4691929.post@n4.nabble.com>

This is great news!!! I love GitHub, and don't love so much R-Forge.

As for the Rtools and Windows enviroment, it works great for me. I've been
installing Hadley packages from github for a while, and didn't have any
issues.


--
View this message in context: http://r.789695.n4.nabble.com/data-table-has-moved-to-GitHub-tp4691915p4691929.html
Sent from the datatable-help mailing list archive at Nabble.com.

From mikkel at scarab-solutions.com  Tue Jun 10 19:17:54 2014
From: mikkel at scarab-solutions.com (Mikkel Grum)
Date: Tue, 10 Jun 2014 12:17:54 -0500
Subject: [datatable-help] data.table error: invalid subscript type,
	except it isn't.
Message-ID: <CAPeJDOSDqfa5JqPjchw=mkU_UUW0Q5dL9ab2t5MBFAxkZ2xJiw@mail.gmail.com>

Hello data.table useRs

I've written a function myTable that I've included in a package I've
made myself (RAPI). The function calls library(data.table) and does a
number of things to the data using the data.table functionality. On
its own (cutting and pasting the code into R) the function works well,
but when I install the package and then try to run the function, I get
the following error

> library(RAPI)
> myTable(11, '2014-06-09')
data.table 1.9.2  For help type: help("data.table")
Error in `[.default`(x, i) : invalid subscript type 'list'

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.9.2 RODBC_1.3-10     RAPI_1.0

loaded via a namespace (and not attached):
[1] plyr_1.8.1    Rcpp_0.11.1   reshape2_1.4  stringr_0.6.2


The line that fails is
stopsperrow <- myData[, length(timestamp), by = list(name, house, row)]

However, if I type the function name, myTable, and cut and paste the
function from the console back into the console, the command produces
the desired output without any hiccups! In other words the function in
the package is OK, but something about the environment isn't right -
if that's the right way to put it.

Any ideas for where I should be looking, or what I should be trying?

Regards
Mikkel

-- 
Mikkel Grum, PhD
Director, Research and Development

ParqueSoft  Calle 25 #127-220  Cali, Colombia
cel +57 313 730 1976
website | map | email

From my.r.help at gmail.com  Wed Jun 11 03:38:28 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Wed, 11 Jun 2014 09:38:28 +0800
Subject: [datatable-help] data.table error: invalid subscript type,
 except it isn't.
In-Reply-To: <CAPeJDOSDqfa5JqPjchw=mkU_UUW0Q5dL9ab2t5MBFAxkZ2xJiw@mail.gmail.com>
References: <CAPeJDOSDqfa5JqPjchw=mkU_UUW0Q5dL9ab2t5MBFAxkZ2xJiw@mail.gmail.com>
Message-ID: <5397B314.6030506@gmail.com>

Have you imported data.table into your RAPI package?

M

On 06/11/2014 01:17 AM, Mikkel Grum wrote:
> Hello data.table useRs
> 
> I've written a function myTable that I've included in a package I've
> made myself (RAPI). The function calls library(data.table) and does a
> number of things to the data using the data.table functionality. On
> its own (cutting and pasting the code into R) the function works well,
> but when I install the package and then try to run the function, I get
> the following error
> 
>> library(RAPI)
>> myTable(11, '2014-06-09')
> data.table 1.9.2  For help type: help("data.table")
> Error in `[.default`(x, i) : invalid subscript type 'list'
> 
>> sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-pc-linux-gnu (64-bit)
> 
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] data.table_1.9.2 RODBC_1.3-10     RAPI_1.0
> 
> loaded via a namespace (and not attached):
> [1] plyr_1.8.1    Rcpp_0.11.1   reshape2_1.4  stringr_0.6.2
> 
> 
> The line that fails is
> stopsperrow <- myData[, length(timestamp), by = list(name, house, row)]
> 
> However, if I type the function name, myTable, and cut and paste the
> function from the console back into the console, the command produces
> the desired output without any hiccups! In other words the function in
> the package is OK, but something about the environment isn't right -
> if that's the right way to put it.
> 
> Any ideas for where I should be looking, or what I should be trying?
> 
> Regards
> Mikkel
> 

From mdowle at mdowle.plus.com  Wed Jun 11 22:22:28 2014
From: mdowle at mdowle.plus.com (Matt Dowle)
Date: Wed, 11 Jun 2014 21:22:28 +0100
Subject: [datatable-help] Slides for useR! data.table tutorial
Message-ID: <5398BA84.3060806@mdowle.plus.com>


Draft slides are now online for the 3 hour data.table tutorial at useR! 
on Monday 30 June.

     user2014.stat.ucla.edu/#tutorials

Is there something fundamental that you wished had been explained in a 
tutorial like this?  If so, please let me know.


I'm doing another of these long tutorials, jointly with Arun, in London 
on Monday 15th September :

     http://www.earl-conference.com/Speakers/Workshop1_DataTable.html


Matt


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140611/539aa239/attachment.html>

From my.r.help at gmail.com  Thu Jun 12 04:44:04 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Thu, 12 Jun 2014 10:44:04 +0800
Subject: [datatable-help] Slides for useR! data.table tutorial
In-Reply-To: <5398BA84.3060806@mdowle.plus.com>
References: <5398BA84.3060806@mdowle.plus.com>
Message-ID: <539913F4.2090102@gmail.com>

Hi Matt,

You mention GForce in your slides. Is this something that happens behind
the scenes, or is it something the user should take care of? (I couldn't
find it in the current docs.)

Thanks,

M


On 06/12/2014 04:22 AM, Matt Dowle wrote:
> 
> Draft slides are now online for the 3 hour data.table tutorial at useR!
> on Monday 30 June.
> 
>     user2014.stat.ucla.edu/#tutorials
> 
> Is there something fundamental that you wished had been explained in a
> tutorial like this?  If so, please let me know.
> 
> 
> I'm doing another of these long tutorials, jointly with Arun, in London
> on Monday 15th September :
> 
>     http://www.earl-conference.com/Speakers/Workshop1_DataTable.html
> 
> 
> Matt
> 
> 
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 

From lianoglou.steve at gene.com  Thu Jun 12 20:17:48 2014
From: lianoglou.steve at gene.com (Steve Lianoglou)
Date: Thu, 12 Jun 2014 11:17:48 -0700
Subject: [datatable-help] data.table error: invalid subscript type,
 except it isn't.
In-Reply-To: <5397B314.6030506@gmail.com>
References: <CAPeJDOSDqfa5JqPjchw=mkU_UUW0Q5dL9ab2t5MBFAxkZ2xJiw@mail.gmail.com>
 <5397B314.6030506@gmail.com>
Message-ID: <CAHA9McOjPOr0KMzDJcpp3_BN-iF3umJmj5r4pWiYD3ZYUJEWwQ@mail.gmail.com>

Hi,

On Tue, Jun 10, 2014 at 6:38 PM, Michael Smith <my.r.help at gmail.com> wrote:
> Have you imported data.table into your RAPI package?

This.

You shouldn't have a line in your package that explicitly loads the
data.table package -- ie. there should be no "library(data.table)"
line in your package.

Instead you should list "data.table" in the "Imports" field in the
DESCRIPTION file of your package, then in the NAMESPACE file you
should have an "import(data.table)" line.

Once those two things are in place, everything should be feng shui.

HTH,
-steve

>
> M
>
> On 06/11/2014 01:17 AM, Mikkel Grum wrote:
>> Hello data.table useRs
>>
>> I've written a function myTable that I've included in a package I've
>> made myself (RAPI). The function calls library(data.table) and does a
>> number of things to the data using the data.table functionality. On
>> its own (cutting and pasting the code into R) the function works well,
>> but when I install the package and then try to run the function, I get
>> the following error
>>
>>> library(RAPI)
>>> myTable(11, '2014-06-09')
>> data.table 1.9.2  For help type: help("data.table")
>> Error in `[.default`(x, i) : invalid subscript type 'list'
>>
>>> sessionInfo()
>> R version 3.1.0 (2014-04-10)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] data.table_1.9.2 RODBC_1.3-10     RAPI_1.0
>>
>> loaded via a namespace (and not attached):
>> [1] plyr_1.8.1    Rcpp_0.11.1   reshape2_1.4  stringr_0.6.2
>>
>>
>> The line that fails is
>> stopsperrow <- myData[, length(timestamp), by = list(name, house, row)]
>>
>> However, if I type the function name, myTable, and cut and paste the
>> function from the console back into the console, the command produces
>> the desired output without any hiccups! In other words the function in
>> the package is OK, but something about the environment isn't right -
>> if that's the right way to put it.
>>
>> Any ideas for where I should be looking, or what I should be trying?
>>
>> Regards
>> Mikkel
>>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


-- 
Steve Lianoglou
Computational Biologist
Genentech

From mikkel at scarab-solutions.com  Thu Jun 12 21:15:32 2014
From: mikkel at scarab-solutions.com (Mikkel Grum)
Date: Thu, 12 Jun 2014 14:15:32 -0500
Subject: [datatable-help] datatable-help Digest, Vol 52, Issue 7
In-Reply-To: <mailman.11.1402480805.24313.datatable-help@lists.r-forge.r-project.org>
References: <mailman.11.1402480805.24313.datatable-help@lists.r-forge.r-project.org>
Message-ID: <CAPeJDOR+XVSA2o65zy=mPnpW-WiywLxZCuMkKozac6g5=g26zA@mail.gmail.com>

Thanks Michael. I had written Depends in all caps in the DESCRIPTION
file and assumed that calling library(data.table) within the function
would override that anyway.

Greatly appreciated

On 11 June 2014 05:00,
<datatable-help-request at lists.r-forge.r-project.org> wrote:
> Send datatable-help mailing list submissions to
>         datatable-help at lists.r-forge.r-project.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> or, via email, send a message with subject or body 'help' to
>         datatable-help-request at lists.r-forge.r-project.org
>
> You can reach the person managing the list at
>         datatable-help-owner at lists.r-forge.r-project.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of datatable-help digest..."
>
>
> Today's Topics:
>
>    1. data.table error: invalid subscript type, except it isn't.
>       (Mikkel Grum)
>    2. Re: data.table error: invalid subscript type, except it
>       isn't. (Michael Smith)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 10 Jun 2014 12:17:54 -0500
> From: Mikkel Grum <mikkel at scarab-solutions.com>
> To: datatable-help at lists.r-forge.r-project.org
> Subject: [datatable-help] data.table error: invalid subscript type,
>         except it isn't.
> Message-ID:
>         <CAPeJDOSDqfa5JqPjchw=mkU_UUW0Q5dL9ab2t5MBFAxkZ2xJiw at mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> Hello data.table useRs
>
> I've written a function myTable that I've included in a package I've
> made myself (RAPI). The function calls library(data.table) and does a
> number of things to the data using the data.table functionality. On
> its own (cutting and pasting the code into R) the function works well,
> but when I install the package and then try to run the function, I get
> the following error
>
>> library(RAPI)
>> myTable(11, '2014-06-09')
> data.table 1.9.2  For help type: help("data.table")
> Error in `[.default`(x, i) : invalid subscript type 'list'
>
>> sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] data.table_1.9.2 RODBC_1.3-10     RAPI_1.0
>
> loaded via a namespace (and not attached):
> [1] plyr_1.8.1    Rcpp_0.11.1   reshape2_1.4  stringr_0.6.2
>
>
> The line that fails is
> stopsperrow <- myData[, length(timestamp), by = list(name, house, row)]
>
> However, if I type the function name, myTable, and cut and paste the
> function from the console back into the console, the command produces
> the desired output without any hiccups! In other words the function in
> the package is OK, but something about the environment isn't right -
> if that's the right way to put it.
>
> Any ideas for where I should be looking, or what I should be trying?
>
> Regards
> Mikkel
>
> --
> Mikkel Grum, PhD
> Director, Research and Development
>
> ParqueSoft  Calle 25 #127-220  Cali, Colombia
> cel +57 313 730 1976
> website | map | email
>
>
> ------------------------------
>
> Message: 2
> Date: Wed, 11 Jun 2014 09:38:28 +0800
> From: Michael Smith <my.r.help at gmail.com>
> To: Mikkel Grum <mikkel at scarab-solutions.com>
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] data.table error: invalid subscript
>         type, except it isn't.
> Message-ID: <5397B314.6030506 at gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Have you imported data.table into your RAPI package?
>
> M
>
> On 06/11/2014 01:17 AM, Mikkel Grum wrote:
>> Hello data.table useRs
>>
>> I've written a function myTable that I've included in a package I've
>> made myself (RAPI). The function calls library(data.table) and does a
>> number of things to the data using the data.table functionality. On
>> its own (cutting and pasting the code into R) the function works well,
>> but when I install the package and then try to run the function, I get
>> the following error
>>
>>> library(RAPI)
>>> myTable(11, '2014-06-09')
>> data.table 1.9.2  For help type: help("data.table")
>> Error in `[.default`(x, i) : invalid subscript type 'list'
>>
>>> sessionInfo()
>> R version 3.1.0 (2014-04-10)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] data.table_1.9.2 RODBC_1.3-10     RAPI_1.0
>>
>> loaded via a namespace (and not attached):
>> [1] plyr_1.8.1    Rcpp_0.11.1   reshape2_1.4  stringr_0.6.2
>>
>>
>> The line that fails is
>> stopsperrow <- myData[, length(timestamp), by = list(name, house, row)]
>>
>> However, if I type the function name, myTable, and cut and paste the
>> function from the console back into the console, the command produces
>> the desired output without any hiccups! In other words the function in
>> the package is OK, but something about the environment isn't right -
>> if that's the right way to put it.
>>
>> Any ideas for where I should be looking, or what I should be trying?
>>
>> Regards
>> Mikkel
>>
>
>
> ------------------------------
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> End of datatable-help Digest, Vol 52, Issue 7
> *********************************************


-- 
Mikkel Grum, PhD
Director, Research and Development

ParqueSoft  Calle 25 #127-220  Cali, Colombia
cel +57 313 730 1976
website | map | email

From J.Gorecki at wit.edu.pl  Fri Jun 13 08:00:35 2014
From: J.Gorecki at wit.edu.pl (Jan Gorecki)
Date: Thu, 12 Jun 2014 23:00:35 -0700 (PDT)
Subject: [datatable-help] data.table syntax Data Warehouse use case
	simulation
In-Reply-To: <1401874845225-4691697.post@n4.nabble.com>
References: <1401874845225-4691697.post@n4.nabble.com>
Message-ID: <1402639235630-4692037.post@n4.nabble.com>

This has been addressed by joinbyv function:
https://github.com/Rdatatable/datatable/pull/694


--
View this message in context: http://r.789695.n4.nabble.com/data-table-syntax-Data-Warehouse-use-case-simulation-tp4691697p4692037.html
Sent from the datatable-help mailing list archive at Nabble.com.

From mdowle at mdowle.plus.com  Fri Jun 13 10:24:29 2014
From: mdowle at mdowle.plus.com (Matt Dowle)
Date: Fri, 13 Jun 2014 09:24:29 +0100
Subject: [datatable-help] Slides for useR! data.table tutorial
In-Reply-To: <539913F4.2090102@gmail.com>
References: <5398BA84.3060806@mdowle.plus.com> <539913F4.2090102@gmail.com>
Message-ID: <539AB53D.8010607@mdowle.plus.com>

Hi Michael,

It happens automatically.  See NEWS for v1.9.2 :

o  New optimization: GForce. Rather than grouping the data, the group locations are passed into
      grouped versions of sum and mean (gsum and gmean) which then compute the result for all groups
      in a single sequential pass through the column for cache efficiency. Further, since the g*
      function is called just once, we don't need to find ways to speed up calling sum or mean
      repetitively for each group. Plan is to add gmin, gmax, gsd, gprod, gwhich.min and gwhich.max.
      Examples where GForce applies now :
        DT[,sum(x,na.rm=),by=...]                       # yes
        DT[,list(sum(x,na.rm=),mean(y,na.rm=)),by=...]  # yes
        DT[,lapply(.SD,sum,na.rm=),by=...]              # yes
        DT[,list(sum(x),min(y)),by=...]                 # no. gmin not yet available, only sum and mean so far.
      GForce is a level 2 optimization. To turn it off: options(datatable.optimize=1)
      Reminder: to see the optimizations and other info, set verbose=TRUE

Matt

On 12/06/14 03:44, Michael Smith wrote:
> Hi Matt,
>
> You mention GForce in your slides. Is this something that happens behind
> the scenes, or is it something the user should take care of? (I couldn't
> find it in the current docs.)
>
> Thanks,
>
> M
>
>
>
> On 06/12/2014 04:22 AM, Matt Dowle wrote:
>> Draft slides are now online for the 3 hour data.table tutorial at useR!
>> on Monday 30 June.
>>
>>      user2014.stat.ucla.edu/#tutorials
>>
>> Is there something fundamental that you wished had been explained in a
>> tutorial like this?  If so, please let me know.
>>
>>
>> I'm doing another of these long tutorials, jointly with Arun, in London
>> on Monday 15th September :
>>
>>      http://www.earl-conference.com/Speakers/Workshop1_DataTable.html
>>
>>
>> Matt
>>
>>
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140613/3030b279/attachment.html>

From mdowle at mdowle.plus.com  Fri Jun 13 14:52:01 2014
From: mdowle at mdowle.plus.com (Matt Dowle)
Date: Fri, 13 Jun 2014 13:52:01 +0100
Subject: [datatable-help] We don't know why Stack Overflow data.table tag
	has just been renamed
Message-ID: <539AF3F1.7090906@mdowle.plus.com>


Have asked on Meta :

http://meta.stackoverflow.com/questions/260463/why-has-the-data-table-tag-just-been-renamed-r-data-table

Matt


From eduard.antonyan at gmail.com  Fri Jun 13 20:46:10 2014
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Fri, 13 Jun 2014 13:46:10 -0500
Subject: [datatable-help] We don't know why Stack Overflow data.table
 tag has just been renamed
In-Reply-To: <539AF3F1.7090906@mdowle.plus.com>
References: <539AF3F1.7090906@mdowle.plus.com>
Message-ID: <CAHZcBOoH1fnziRLKAzh5_=uqGypx3XNQ5Snzu+x6ZQ5WAt=82Q@mail.gmail.com>

holy batman, that was a mess :)


On Fri, Jun 13, 2014 at 7:52 AM, Matt Dowle <mdowle at mdowle.plus.com> wrote:

>
> Have asked on Meta :
>
> http://meta.stackoverflow.com/questions/260463/why-has-the-
> data-table-tag-just-been-renamed-r-data-table
>
> Matt
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/
> listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140613/9369cb6e/attachment.html>

From aragorn168b at gmail.com  Fri Jun 13 21:09:42 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 13 Jun 2014 21:09:42 +0200
Subject: [datatable-help] We don't know why Stack Overflow data.table
 tag has just been renamed
In-Reply-To: <CAHZcBOoH1fnziRLKAzh5_=uqGypx3XNQ5Snzu+x6ZQ5WAt=82Q@mail.gmail.com>
References: <539AF3F1.7090906@mdowle.plus.com>
 <CAHZcBOoH1fnziRLKAzh5_=uqGypx3XNQ5Snzu+x6ZQ5WAt=82Q@mail.gmail.com>
Message-ID: <etPan.539b4c76.721da317.38b@Arunkumars-MacBook-Pro.local>

Seems like everything's back to normal.?

Arun

From:?Eduard Antonyan eduard.antonyan at gmail.com
Reply:?Eduard Antonyan eduard.antonyan at gmail.com
Date:?June 13, 2014 at 8:46:42 PM
To:?Matt Dowle mdowle at mdowle.plus.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] We don't know why Stack Overflow data.table tag has just been renamed  

holy batman, that was a mess :)


On Fri, Jun 13, 2014 at 7:52 AM, Matt Dowle <mdowle at mdowle.plus.com> wrote:

Have asked on Meta :

http://meta.stackoverflow.com/questions/260463/why-has-the-data-table-tag-just-been-renamed-r-data-table

Matt


_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140613/ae126680/attachment.html>

From eduard.antonyan at gmail.com  Fri Jun 13 21:16:59 2014
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Fri, 13 Jun 2014 14:16:59 -0500
Subject: [datatable-help] We don't know why Stack Overflow data.table
 tag has just been renamed
In-Reply-To: <etPan.539b4c76.721da317.38b@Arunkumars-MacBook-Pro.local>
References: <539AF3F1.7090906@mdowle.plus.com>
 <CAHZcBOoH1fnziRLKAzh5_=uqGypx3XNQ5Snzu+x6ZQ5WAt=82Q@mail.gmail.com>
 <etPan.539b4c76.721da317.38b@Arunkumars-MacBook-Pro.local>
Message-ID: <CAHZcBOo+4T92QtLtrZbnDa4-ih=4OeTW0NVvbwzkaCKnRtNHdQ@mail.gmail.com>

Did it have a better tag wiki before? Seems like it should at least have a
link to github in the full description.


On Fri, Jun 13, 2014 at 2:09 PM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

> Seems like everything's back to normal.
>
>  Arun
>
> From: Eduard Antonyan eduard.antonyan at gmail.com
> Reply: Eduard Antonyan eduard.antonyan at gmail.com
> Date: June 13, 2014 at 8:46:42 PM
> To: Matt Dowle mdowle at mdowle.plus.com
> Cc: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> Subject:  Re: [datatable-help] We don't know why Stack Overflow
> data.table tag has just been renamed
>
>  holy batman, that was a mess :)
>
>
> On Fri, Jun 13, 2014 at 7:52 AM, Matt Dowle <mdowle at mdowle.plus.com>
> wrote:
>
>>
>> Have asked on Meta :
>>
>>
>> http://meta.stackoverflow.com/questions/260463/why-has-the-data-table-tag-just-been-renamed-r-data-table
>>
>> Matt
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140613/0b2f9ed5/attachment.html>

From aragorn168b at gmail.com  Fri Jun 13 23:00:12 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 13 Jun 2014 23:00:12 +0200
Subject: [datatable-help] We don't know why Stack Overflow data.table
 tag has just been renamed
In-Reply-To: <CAHZcBOo+4T92QtLtrZbnDa4-ih=4OeTW0NVvbwzkaCKnRtNHdQ@mail.gmail.com>
References: <539AF3F1.7090906@mdowle.plus.com>
 <CAHZcBOoH1fnziRLKAzh5_=uqGypx3XNQ5Snzu+x6ZQ5WAt=82Q@mail.gmail.com>
 <etPan.539b4c76.721da317.38b@Arunkumars-MacBook-Pro.local>
 <CAHZcBOo+4T92QtLtrZbnDa4-ih=4OeTW0NVvbwzkaCKnRtNHdQ@mail.gmail.com>
Message-ID: <etPan.539b665c.2d1d5ae9.38b@Arunkumars-MacBook-Pro.local>

Eddi,

Seems to be back:?http://stackoverflow.com/tags/data.table/info

Arun

From:?Eduard Antonyan eduard.antonyan at gmail.com
Reply:?Eduard Antonyan eduard.antonyan at gmail.com
Date:?June 13, 2014 at 9:17:20 PM
To:?Arunkumar Srinivasan aragorn168b at gmail.com
Cc:?Matt Dowle mdowle at mdowle.plus.com, datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] We don't know why Stack Overflow data.table tag has just been renamed  

Did it have a better tag wiki before? Seems like it should at least have a link to github in the full description.


On Fri, Jun 13, 2014 at 2:09 PM, Arunkumar Srinivasan <aragorn168b at gmail.com> wrote:
Seems like everything's back to normal.?

Arun

From:?Eduard Antonyan eduard.antonyan at gmail.com
Reply:?Eduard Antonyan eduard.antonyan at gmail.com
Date:?June 13, 2014 at 8:46:42 PM
To:?Matt Dowle mdowle at mdowle.plus.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] We don't know why Stack Overflow data.table tag has just been renamed

holy batman, that was a mess :)


On Fri, Jun 13, 2014 at 7:52 AM, Matt Dowle <mdowle at mdowle.plus.com> wrote:

Have asked on Meta :

http://meta.stackoverflow.com/questions/260463/why-has-the-data-table-tag-just-been-renamed-r-data-table

Matt


_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140613/3380c1b0/attachment-0001.html>

From rhylton at verizon.net  Sat Jun 14 01:55:12 2014
From: rhylton at verizon.net (Ron Hylton)
Date: Fri, 13 Jun 2014 19:55:12 -0400
Subject: [datatable-help] data.table is asking for help
Message-ID: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net>

The code below generates the warning:

 
In setkeyv(x, cols, verbose = verbose) :

  Already keyed by this key but had invalid row order, key rebuilt. If you
didn't go under the hood please let datatable-help know so the root cause
can be fixed.

 
This is my first attempt at using datatable so I probably did something
dumb, but maybe that's useful for someone.  The first case is the one that
gives the warnings.

 
I'm also surprised at the timings.  I wrote the original algorithm using
dataframe & ddply and I expected datatable to be substantially faster; the
opposite is true.

 
The algorithm does the following:  Certain columns in the table are keys and
others are values in the sense that each row with the same set of keys
should have the same set of values.  Find all the key sets for which this is
not true and return the keys sets + conflicting value sets.

 
Insight into the performance would be appreciated.

 
Regards,

Ron

 
library(data.table)

library(plyr)

 
conflictsTable1 <- function(f) {

  u <- unique(setkey(f))

  if (nrow(u) == 1) return(NULL)

  u

}

 
conflictsTable2 <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 
conflictsFrame <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 
N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)),
x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

 
setkey(test,id)

 
print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

 
print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

 
print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140613/f9849ef1/attachment.html>

From aragorn168b at gmail.com  Sat Jun 14 02:22:30 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 14 Jun 2014 02:22:30 +0200
Subject: [datatable-help] data.table is asking for help
In-Reply-To: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net>
References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net>
Message-ID: <etPan.539b95c6.79838cb2.38b@Arunkumars-MacBook-Pro.local>

Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you?re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there?s no key set (here) and therefore key is set on all the columns x1, x2 and x3.

Now, the next group (in the by=.) is passed to your function, it?ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn?t be, as this data isn?t sorted. data.table warns in those scenarios.. and that?s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}
Basically, we set the key of f (which is equal to .SD as it?s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn?t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.


Arun

From:?Ron Hylton rhylton at verizon.net
Reply:?Ron Hylton rhylton at verizon.net
Date:?June 14, 2014 at 1:55:53 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? [datatable-help] data.table is asking for help  

The code below generates the warning:

?

In setkeyv(x, cols, verbose = verbose) :

? Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

?

This is my first attempt at using datatable so I probably did something dumb, but maybe that?s useful for someone.? The first case is the one that gives the warnings.

?

I?m also surprised at the timings.? I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

?

The algorithm does the following:? Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.? Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

?

Insight into the performance would be appreciated.

?

Regards,

Ron

?

library(data.table)

library(plyr)

?

conflictsTable1 <- function(f) {

? u <- unique(setkey(f))

? if (nrow(u) == 1) return(NULL)

? u

}

?

conflictsTable2 <- function(f) {

? u <- unique(f)

? if (nrow(u) == 1) return(NULL)

? u

}

?

conflictsFrame <- function(f) {

? u <- unique(f)

? if (nrow(u) == 1) return(NULL)

? u

}

?

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

?

setkey(test,id)

?

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

?

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

?

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140614/b737d427/attachment.html>

From rhylton at verizon.net  Sat Jun 14 02:51:24 2014
From: rhylton at verizon.net (Ron Hylton)
Date: Fri, 13 Jun 2014 20:51:24 -0400
Subject: [datatable-help] data.table is asking for help
In-Reply-To: <etPan.539b95c6.79838cb2.38b@Arunkumars-MacBook-Pro.local>
References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net>
 <etPan.539b95c6.79838cb2.38b@Arunkumars-MacBook-Pro.local>
Message-ID: <006701cf876a$c4f38310$4eda8930$@verizon.net>

I suspected it was something like this.  As one clarification, there is a setkey(test,id) before any setkey(.SD).   If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away.

 
However there?s another aspect.  While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

 
From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] 
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] data.table is asking for help

 
Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you?re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there?s no key set (here) and therefore key is set on all the columns x1, x2 and x3. 

Now, the next group (in the by=.) is passed to your function, it?ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn?t be, as this data isn?t sorted. data.table warns in those scenarios.. and that?s why you get the warning. 

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}

Basically, we set the key of f (which is equal to .SD as it?s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn?t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.

 
Arun


From: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net> 
Reply: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net> 
Date: June 14, 2014 at 1:55:53 AM
To: datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>  datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org> 
Subject:  [datatable-help] data.table is asking for help 


The code below generates the warning:

 
In setkeyv(x, cols, verbose = verbose) :

  Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

 
This is my first attempt at using datatable so I probably did something dumb, but maybe that?s useful for someone.  The first case is the one that gives the warnings.

 
I?m also surprised at the timings.  I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

 
The algorithm does the following:  Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.  Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

 
Insight into the performance would be appreciated.

 
Regards,

Ron

 
library(data.table)

library(plyr)

 
conflictsTable1 <- function(f) {

  u <- unique(setkey(f))

  if (nrow(u) == 1) return(NULL)

  u

}

 
conflictsTable2 <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 
conflictsFrame <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 
N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

 
setkey(test,id)

 
print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

 
print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

 
print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________ 
datatable-help mailing list 
datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140613/1026d1a3/attachment-0001.html>

From aragorn168b at gmail.com  Sat Jun 14 02:57:02 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 14 Jun 2014 02:57:02 +0200
Subject: [datatable-help] data.table is asking for help
In-Reply-To: <006701cf876a$c4f38310$4eda8930$@verizon.net>
References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net>
 <etPan.539b95c6.79838cb2.38b@Arunkumars-MacBook-Pro.local>
 <006701cf876a$c4f38310$4eda8930$@verizon.net>
Message-ID: <etPan.539b9dde.b03e0c6.38b@Arunkumars-MacBook-Pro.local>

However there?s another aspect.? While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

`data.table` is?designed?for working with *really large* data sets in mind (> 100 or 200 GB in?memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand).

This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator.


HTH

Arun


From:?Ron Hylton rhylton at verizon.net
Reply:?Ron Hylton rhylton at verizon.net
Date:?June 14, 2014 at 2:52:04 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] data.table is asking for help  

I suspected it was something like this.? As one clarification, there is a setkey(test,id) before any setkey(.SD).?? If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away.

?

However there?s another aspect.? While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

?

From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] data.table is asking for help

?

Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you?re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there?s no key set (here) and therefore key is set on all the columns x1, x2 and x3.

Now, the next group (in the by=.) is passed to your function, it?ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn?t be, as this data isn?t sorted. data.table warns in those scenarios.. and that?s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}
Basically, we set the key of f (which is equal to .SD as it?s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn?t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.

?

Arun


From:?Ron Hylton rhylton at verizon.net
Reply:?Ron Hylton rhylton at verizon.net
Date:?June 14, 2014 at 1:55:53 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? [datatable-help] data.table is asking for help


The code below generates the warning:

?

In setkeyv(x, cols, verbose = verbose) :

? Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

?

This is my first attempt at using datatable so I probably did something dumb, but maybe that?s useful for someone.? The first case is the one that gives the warnings.

?

I?m also surprised at the timings.? I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

?

The algorithm does the following:? Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.? Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

?

Insight into the performance would be appreciated.

?

Regards,

Ron

?

library(data.table)

library(plyr)

?

conflictsTable1 <- function(f) {

? u <- unique(setkey(f))

? if (nrow(u) == 1) return(NULL)

? u

}

?

conflictsTable2 <- function(f) {

? u <- unique(f)

? if (nrow(u) == 1) return(NULL)

? u

}

?

conflictsFrame <- function(f) {

? u <- unique(f)

? if (nrow(u) == 1) return(NULL)

? u

}

?

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

?

setkey(test,id)

?

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

?

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

?

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140614/32083005/attachment.html>

From rhylton at verizon.net  Sat Jun 14 03:30:04 2014
From: rhylton at verizon.net (Ron Hylton)
Date: Fri, 13 Jun 2014 21:30:04 -0400
Subject: [datatable-help] data.table is asking for help
In-Reply-To: <etPan.539b9dde.b03e0c6.38b@Arunkumars-MacBook-Pro.local>
References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net>
 <etPan.539b95c6.79838cb2.38b@Arunkumars-MacBook-Pro.local>
 <006701cf876a$c4f38310$4eda8930$@verizon.net>
 <etPan.539b9dde.b03e0c6.38b@Arunkumars-MacBook-Pro.local>
Message-ID: <007301cf8770$2bb569b0$83203d10$@verizon.net>

The performance is what puzzles me; the results are correct so the warnings don?t matter, and not all the variations I?ve tried have warnings.  On the real dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + ddply.  I expected it to be substantially faster.

 
From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] 
Sent: Friday, June 13, 2014 8:57 PM
To: Ron Hylton; datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] data.table is asking for help

 
However there?s another aspect.  While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

`data.table` is designed for working with *really large* data sets in mind (> 100 or 200 GB in memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand).

This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator.

 
HTH

Arun


From: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net> 
Reply: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net> 
Date: June 14, 2014 at 2:52:04 AM
To: datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>  datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org> 
Subject:  Re: [datatable-help] data.table is asking for help 


I suspected it was something like this.  As one clarification, there is a setkey(test,id) before any setkey(.SD).   If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away.

 
However there?s another aspect.  While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

 
From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org> 
Subject: Re: [datatable-help] data.table is asking for help

 
Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you?re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there?s no key set (here) and therefore key is set on all the columns x1, x2 and x3.

Now, the next group (in the by=.) is passed to your function, it?ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn?t be, as this data isn?t sorted. data.table warns in those scenarios.. and that?s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}

Basically, we set the key of f (which is equal to .SD as it?s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn?t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.

 
Arun


From: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net> 
Reply: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net> 
Date: June 14, 2014 at 1:55:53 AM
To: datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>  datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org> 
Subject:  [datatable-help] data.table is asking for help

 
The code below generates the warning:

 
In setkeyv(x, cols, verbose = verbose) :

  Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

 
This is my first attempt at using datatable so I probably did something dumb, but maybe that?s useful for someone.  The first case is the one that gives the warnings.

 
I?m also surprised at the timings.  I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

 
The algorithm does the following:  Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.  Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

 
Insight into the performance would be appreciated.

 
Regards,

Ron

 
library(data.table)

library(plyr)

 
conflictsTable1 <- function(f) {

  u <- unique(setkey(f))

  if (nrow(u) == 1) return(NULL)

  u

}

 
conflictsTable2 <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 
conflictsFrame <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 
N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

 
setkey(test,id)

 
print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

 
print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

 
print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org> 
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________ 
datatable-help mailing list 
datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140613/43c630b0/attachment-0001.html>

From aragorn168b at gmail.com  Sat Jun 14 04:34:12 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 14 Jun 2014 04:34:12 +0200
Subject: [datatable-help] data.table is asking for help
In-Reply-To: <007301cf8770$2bb569b0$83203d10$@verizon.net>
References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net>
 <etPan.539b95c6.79838cb2.38b@Arunkumars-MacBook-Pro.local>
 <006701cf876a$c4f38310$4eda8930$@verizon.net>
 <etPan.539b9dde.b03e0c6.38b@Arunkumars-MacBook-Pro.local>
 <007301cf8770$2bb569b0$83203d10$@verizon.net>
Message-ID: <etPan.539bb4a4.54e49eb4.38b@Arunkumars-MacBook-Pro.local>

The j-expression is evaluated from within C for each group (unless they?re optimised with GForce - a new initiative in data.table). And eval(.SD) or eval(anything(.SD)) is costly.

You can get around it by listing the columns by yourself and using .I instead, as follows:

test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]
#  0.140   0.001   0.142  
Takes about 0.14 seconds.

An even faster way is:

system.time({
ans = test[test[, .I[.N > 1], by=id]$V1]        # (1)  
ans = ans[, .N, by=names(ans)]                  # (2)  
ans = ans[ans[, .I[.N > 1L], by=id]$V1]         # (3)
})

#  0.026   0.000   0.027  
The idea for the second case is:

(1) remove all entries where there?s just 1 row corresponding to that id.
(2) Aggregate this result by all the columns now and get the number of rows in the column N (we won?t have to use this column though).
(3) Now, if we aggregate by id and if any id has just 1 row, then it?d mean that that id has had more than 1 rows (step (1) filtering ensures this), but all of them are same and we don?t need them. So we just filter for those where .N > 1L.

HTH


Arun

From:?Ron Hylton rhylton at verizon.net
Reply:?Ron Hylton rhylton at verizon.net
Date:?June 14, 2014 at 3:30:55 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] data.table is asking for help  

The performance is what puzzles me; the results are correct so the warnings don?t matter, and not all the variations I?ve tried have warnings.? On the real dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + ddply.? I expected it to be substantially faster.

?

From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
Sent: Friday, June 13, 2014 8:57 PM
To: Ron Hylton; datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] data.table is asking for help

?

However there?s another aspect.? While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

`data.table` is?designed?for working with *really large* data sets in mind (> 100 or 200 GB in?memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand).

This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator.

?

HTH

Arun


From:?Ron Hylton rhylton at verizon.net
Reply:?Ron Hylton rhylton at verizon.net
Date:?June 14, 2014 at 2:52:04 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] data.table is asking for help


I suspected it was something like this.? As one clarification, there is a setkey(test,id) before any setkey(.SD).?? If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away.

?

However there?s another aspect.? While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

?

From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] data.table is asking for help

?

Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you?re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there?s no key set (here) and therefore key is set on all the columns x1, x2 and x3.

Now, the next group (in the by=.) is passed to your function, it?ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn?t be, as this data isn?t sorted. data.table warns in those scenarios.. and that?s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}
Basically, we set the key of f (which is equal to .SD as it?s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn?t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.

?

Arun


From:?Ron Hylton rhylton at verizon.net
Reply:?Ron Hylton rhylton at verizon.net
Date:?June 14, 2014 at 1:55:53 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? [datatable-help] data.table is asking for help

?

The code below generates the warning:

?

In setkeyv(x, cols, verbose = verbose) :

? Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

?

This is my first attempt at using datatable so I probably did something dumb, but maybe that?s useful for someone.? The first case is the one that gives the warnings.

?

I?m also surprised at the timings.? I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

?

The algorithm does the following:? Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.? Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

?

Insight into the performance would be appreciated.

?

Regards,

Ron

?

library(data.table)

library(plyr)

?

conflictsTable1 <- function(f) {

? u <- unique(setkey(f))

? if (nrow(u) == 1) return(NULL)

? u

}

?

conflictsTable2 <- function(f) {

? u <- unique(f)

? if (nrow(u) == 1) return(NULL)

? u

}

?

conflictsFrame <- function(f) {

? u <- unique(f)

? if (nrow(u) == 1) return(NULL)

? u

}

?

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

?

setkey(test,id)

?

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

?

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

?

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140614/0224a894/attachment-0001.html>

From aragorn168b at gmail.com  Sat Jun 14 04:42:26 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 14 Jun 2014 04:42:26 +0200
Subject: [datatable-help] data.table is asking for help
In-Reply-To: <etPan.539bb4a4.54e49eb4.38b@Arunkumars-MacBook-Pro.local>
References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net>
 <etPan.539b95c6.79838cb2.38b@Arunkumars-MacBook-Pro.local>
 <006701cf876a$c4f38310$4eda8930$@verizon.net>
 <etPan.539b9dde.b03e0c6.38b@Arunkumars-MacBook-Pro.local>
 <007301cf8770$2bb569b0$83203d10$@verizon.net>
 <etPan.539bb4a4.54e49eb4.38b@Arunkumars-MacBook-Pro.local>
Message-ID: <etPan.539bb693.2ca88611.38b@Arunkumars-MacBook-Pro.local>

A slightly simpler version of the 2nd solution is:

system.time({
ans = test[, .N, by=names(test)]
ans = ans[ans[, .I[.N > 1L], by=id]$V1]
})
#  0.019   0.000   0.019  
The answers are identical, you can check this by doing:

ans[, N := NULL]
setkey(ans)
setkey(ut1)
identical(ans, ut1) # [1] TRUE

Arun

From:?Arunkumar Srinivasan aragorn168b at gmail.com
Reply:?Arunkumar Srinivasan aragorn168b at gmail.com
Date:?June 14, 2014 at 4:34:15 AM
To:?Ron Hylton rhylton at verizon.net, datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] data.table is asking for help  

The j-expression is evaluated from within C for each group (unless they?re optimised with GForce - a new initiative in data.table). And eval(.SD) or eval(anything(.SD)) is costly.

You can get around it by listing the columns by yourself and using .I instead, as follows:

test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]
#  0.140   0.001   0.142   

Takes about 0.14 seconds.

An even faster way is:

system.time({
ans = test[test[, .I[.N > 1], by=id]$V1]        # (1)   
ans = ans[, .N, by=names(ans)]                  # (2)   
ans = ans[ans[, .I[.N > 1L], by=id]$V1]         # (3)
})

#  0.026   0.000   0.027   

The idea for the second case is:

(1) remove all entries where there?s just 1 row corresponding to that id.
(2) Aggregate this result by all the columns now and get the number of rows in the column N (we won?t have to use this column though).
(3) Now, if we aggregate by id and if any id has just 1 row, then it?d mean that that id has had more than 1 rows (step (1) filtering ensures this), but all of them are same and we don?t need them. So we just filter for those where .N > 1L.

HTH


Arun

From:?Ron Hylton rhylton at verizon.net
Reply:?Ron Hylton rhylton at verizon.net
Date:?June 14, 2014 at 3:30:55 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] data.table is asking for help

The performance is what puzzles me; the results are correct so the warnings don?t matter, and not all the variations I?ve tried have warnings.? On the real dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + ddply.? I expected it to be substantially faster.

?

From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
Sent: Friday, June 13, 2014 8:57 PM
To: Ron Hylton; datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] data.table is asking for help

?

However there?s another aspect.? While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

`data.table` is?designed?for working with *really large* data sets in mind (> 100 or 200 GB in?memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand).

This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator.

?

HTH

Arun


From:?Ron Hylton rhylton at verizon.net
Reply:?Ron Hylton rhylton at verizon.net
Date:?June 14, 2014 at 2:52:04 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] data.table is asking for help


I suspected it was something like this.? As one clarification, there is a setkey(test,id) before any setkey(.SD).?? If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away.

?

However there?s another aspect.? While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

?

From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] data.table is asking for help

?

Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you?re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there?s no key set (here) and therefore key is set on all the columns x1, x2 and x3.

Now, the next group (in the by=.) is passed to your function, it?ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn?t be, as this data isn?t sorted. data.table warns in those scenarios.. and that?s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}
Basically, we set the key of f (which is equal to .SD as it?s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn?t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.

?

Arun


From:?Ron Hylton rhylton at verizon.net
Reply:?Ron Hylton rhylton at verizon.net
Date:?June 14, 2014 at 1:55:53 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? [datatable-help] data.table is asking for help

?

The code below generates the warning:

?

In setkeyv(x, cols, verbose = verbose) :

? Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

?

This is my first attempt at using datatable so I probably did something dumb, but maybe that?s useful for someone.? The first case is the one that gives the warnings.

?

I?m also surprised at the timings.? I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

?

The algorithm does the following:? Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.? Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

?

Insight into the performance would be appreciated.

?

Regards,

Ron

?

library(data.table)

library(plyr)

?

conflictsTable1 <- function(f) {

? u <- unique(setkey(f))

? if (nrow(u) == 1) return(NULL)

? u

}

?

conflictsTable2 <- function(f) {

? u <- unique(f)

? if (nrow(u) == 1) return(NULL)

? u

}

?

conflictsFrame <- function(f) {

? u <- unique(f)

? if (nrow(u) == 1) return(NULL)

? u

}

?

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

?

setkey(test,id)

?

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

?

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

?

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140614/8f9d3393/attachment-0001.html>

From aragorn168b at gmail.com  Sat Jun 14 04:45:49 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 14 Jun 2014 04:45:49 +0200
Subject: [datatable-help] data.table is asking for help
In-Reply-To: <etPan.539bb693.2ca88611.38b@Arunkumars-MacBook-Pro.local>
References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net>
 <etPan.539b95c6.79838cb2.38b@Arunkumars-MacBook-Pro.local>
 <006701cf876a$c4f38310$4eda8930$@verizon.net>
 <etPan.539b9dde.b03e0c6.38b@Arunkumars-MacBook-Pro.local>
 <007301cf8770$2bb569b0$83203d10$@verizon.net>
 <etPan.539bb4a4.54e49eb4.38b@Arunkumars-MacBook-Pro.local>
 <etPan.539bb693.2ca88611.38b@Arunkumars-MacBook-Pro.local>
Message-ID: <etPan.539bb75d.2901d82.38b@Arunkumars-MacBook-Pro.local>

Sorry. But we can simplify it even further:

The first step is just unique(test). So, we can do:

system.time({
ans = unique(test)
ans = ans[ans[, .I[.N > 1L], by=id]$V1]
})
#  0.016   0.000   0.016  
Identical?

setkey(ans)
setkey(ut1)
identical(ans, ut1) # [1] TRUE

Arun

From:?Arunkumar Srinivasan aragorn168b at gmail.com
Reply:?Arunkumar Srinivasan aragorn168b at gmail.com
Date:?June 14, 2014 at 4:42:31 AM
To:?Ron Hylton rhylton at verizon.net, datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] data.table is asking for help  

A slightly simpler version of the 2nd solution is:

system.time({
ans = test[, .N, by=names(test)]
ans = ans[ans[, .I[.N > 1L], by=id]$V1]
})
#  0.019   0.000   0.019   

The answers are identical, you can check this by doing:

ans[, N := NULL]
setkey(ans)
setkey(ut1)
identical(ans, ut1) # [1] TRUE


Arun

From:?Arunkumar Srinivasan aragorn168b at gmail.com
Reply:?Arunkumar Srinivasan aragorn168b at gmail.com
Date:?June 14, 2014 at 4:34:15 AM
To:?Ron Hylton rhylton at verizon.net, datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] data.table is asking for help

The j-expression is evaluated from within C for each group (unless they?re optimised with GForce - a new initiative in data.table). And eval(.SD) or eval(anything(.SD)) is costly.

You can get around it by listing the columns by yourself and using .I instead, as follows:

test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]
#  0.140   0.001   0.142    


Takes about 0.14 seconds.

An even faster way is:

system.time({
ans = test[test[, .I[.N > 1], by=id]$V1]        # (1)    
ans = ans[, .N, by=names(ans)]                  # (2)    
ans = ans[ans[, .I[.N > 1L], by=id]$V1]         # (3)
})

#  0.026   0.000   0.027    


The idea for the second case is:

(1) remove all entries where there?s just 1 row corresponding to that id.
(2) Aggregate this result by all the columns now and get the number of rows in the column N (we won?t have to use this column though).
(3) Now, if we aggregate by id and if any id has just 1 row, then it?d mean that that id has had more than 1 rows (step (1) filtering ensures this), but all of them are same and we don?t need them. So we just filter for those where .N > 1L.

HTH


Arun

From:?Ron Hylton rhylton at verizon.net
Reply:?Ron Hylton rhylton at verizon.net
Date:?June 14, 2014 at 3:30:55 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] data.table is asking for help

The performance is what puzzles me; the results are correct so the warnings don?t matter, and not all the variations I?ve tried have warnings.? On the real dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + ddply.? I expected it to be substantially faster.

?

From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
Sent: Friday, June 13, 2014 8:57 PM
To: Ron Hylton; datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] data.table is asking for help

?

However there?s another aspect.? While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

`data.table` is?designed?for working with *really large* data sets in mind (> 100 or 200 GB in?memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand).

This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator.

?

HTH

Arun


From:?Ron Hylton rhylton at verizon.net
Reply:?Ron Hylton rhylton at verizon.net
Date:?June 14, 2014 at 2:52:04 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] data.table is asking for help


I suspected it was something like this.? As one clarification, there is a setkey(test,id) before any setkey(.SD).?? If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away.

?

However there?s another aspect.? While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

?

From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] data.table is asking for help

?

Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you?re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there?s no key set (here) and therefore key is set on all the columns x1, x2 and x3.

Now, the next group (in the by=.) is passed to your function, it?ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn?t be, as this data isn?t sorted. data.table warns in those scenarios.. and that?s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}
Basically, we set the key of f (which is equal to .SD as it?s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn?t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.

?

Arun


From:?Ron Hylton rhylton at verizon.net
Reply:?Ron Hylton rhylton at verizon.net
Date:?June 14, 2014 at 1:55:53 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? [datatable-help] data.table is asking for help

?

The code below generates the warning:

?

In setkeyv(x, cols, verbose = verbose) :

? Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

?

This is my first attempt at using datatable so I probably did something dumb, but maybe that?s useful for someone.? The first case is the one that gives the warnings.

?

I?m also surprised at the timings.? I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

?

The algorithm does the following:? Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.? Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

?

Insight into the performance would be appreciated.

?

Regards,

Ron

?

library(data.table)

library(plyr)

?

conflictsTable1 <- function(f) {

? u <- unique(setkey(f))

? if (nrow(u) == 1) return(NULL)

? u

}

?

conflictsTable2 <- function(f) {

? u <- unique(f)

? if (nrow(u) == 1) return(NULL)

? u

}

?

conflictsFrame <- function(f) {

? u <- unique(f)

? if (nrow(u) == 1) return(NULL)

? u

}

?

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

?

setkey(test,id)

?

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

?

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

?

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140614/b9029840/attachment-0001.html>

From rhylton at verizon.net  Sat Jun 14 04:58:16 2014
From: rhylton at verizon.net (Ron Hylton)
Date: Fri, 13 Jun 2014 22:58:16 -0400
Subject: [datatable-help] data.table is asking for help
In-Reply-To: <etPan.539bb75d.2901d82.38b@Arunkumars-MacBook-Pro.local>
References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net>
 <etPan.539b95c6.79838cb2.38b@Arunkumars-MacBook-Pro.local>
 <006701cf876a$c4f38310$4eda8930$@verizon.net>
 <etPan.539b9dde.b03e0c6.38b@Arunkumars-MacBook-Pro.local>
 <007301cf8770$2bb569b0$83203d10$@verizon.net>
 <etPan.539bb4a4.54e49eb4.38b@Arunkumars-MacBook-Pro.local>
 <etPan.539bb693.2ca88611.38b@Arunkumars-MacBook-Pro.local>
 <etPan.539bb75d.2901d82.38b@Arunkumars-MacBook-Pro.local>
Message-ID: <008301cf877c$7dc9def0$795d9cd0$@verizon.net>

Thanks, that very helpful.

 
From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] 
Sent: Friday, June 13, 2014 10:46 PM
To: Ron Hylton; datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] data.table is asking for help

 
Sorry. But we can simplify it even further:

The first step is just unique(test). So, we can do:

system.time({
ans = unique(test)
ans = ans[ans[, .I[.N > 1L], by=id]$V1]
})
#  0.016   0.000   0.016  

Identical?

setkey(ans)
setkey(ut1)
identical(ans, ut1) # [1] TRUE

 
Arun


From: Arunkumar Srinivasan aragorn168b at gmail.com <mailto:aragorn168b at gmail.com> 
Reply: Arunkumar Srinivasan aragorn168b at gmail.com <mailto:aragorn168b at gmail.com> 
Date: June 14, 2014 at 4:42:31 AM
To: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net> , datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>  datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org> 
Subject:  Re: [datatable-help] data.table is asking for help 


A slightly simpler version of the 2nd solution is:

system.time({
ans = test[, .N, by=names(test)]
ans = ans[ans[, .I[.N > 1L], by=id]$V1]
})
#  0.019   0.000   0.019   
 

The answers are identical, you can check this by doing:

ans[, N := NULL]
setkey(ans)
setkey(ut1)
identical(ans, ut1) # [1] TRUE
 

Arun


From: Arunkumar Srinivasan aragorn168b at gmail.com <mailto:aragorn168b at gmail.com> 
Reply: Arunkumar Srinivasan aragorn168b at gmail.com <mailto:aragorn168b at gmail.com> 
Date: June 14, 2014 at 4:34:15 AM
To: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net> , datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>  datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org> 
Subject:  Re: [datatable-help] data.table is asking for help


The j-expression is evaluated from within C for each group (unless they?re optimised with GForce - a new initiative in data.table). And eval(.SD) or eval(anything(.SD)) is costly.

You can get around it by listing the columns by yourself and using .I instead, as follows:

test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]
#  0.140   0.001   0.142    
 
 
Takes about 0.14 seconds.

  _____  

An even faster way is:

system.time({
ans = test[test[, .I[.N > 1], by=id]$V1]        # (1)    
ans = ans[, .N, by=names(ans)]                  # (2)    
ans = ans[ans[, .I[.N > 1L], by=id]$V1]         # (3)
})
 
#  0.026   0.000   0.027    
 
 
The idea for the second case is:

(1) remove all entries where there?s just 1 row corresponding to that id.
(2) Aggregate this result by all the columns now and get the number of rows in the column N (we won?t have to use this column though).
(3) Now, if we aggregate by id and if any id has just 1 row, then it?d mean that that id has had more than 1 rows (step (1) filtering ensures this), but all of them are same and we don?t need them. So we just filter for those where .N > 1L.

HTH

 
Arun


From: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net> 
Reply: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net> 
Date: June 14, 2014 at 3:30:55 AM
To: datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>  datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org> 
Subject:  Re: [datatable-help] data.table is asking for help


The performance is what puzzles me; the results are correct so the warnings don?t matter, and not all the variations I?ve tried have warnings.  On the real dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + ddply.  I expected it to be substantially faster.

 
From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
Sent: Friday, June 13, 2014 8:57 PM
To: Ron Hylton; datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org> 
Subject: Re: [datatable-help] data.table is asking for help

 
However there?s another aspect.  While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

`data.table` is designed for working with *really large* data sets in mind (> 100 or 200 GB in memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand).

This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator.

 
HTH

Arun


From: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net> 
Reply: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net> 
Date: June 14, 2014 at 2:52:04 AM
To: datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>  datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org> 
Subject:  Re: [datatable-help] data.table is asking for help

 
I suspected it was something like this.  As one clarification, there is a setkey(test,id) before any setkey(.SD).   If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away.

 
However there?s another aspect.  While I?m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

 
From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org> 
Subject: Re: [datatable-help] data.table is asking for help

 
Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you?re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there?s no key set (here) and therefore key is set on all the columns x1, x2 and x3.

Now, the next group (in the by=.) is passed to your function, it?ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn?t be, as this data isn?t sorted. data.table warns in those scenarios.. and that?s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}

Basically, we set the key of f (which is equal to .SD as it?s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn?t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.

 
Arun


From: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net> 
Reply: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net> 
Date: June 14, 2014 at 1:55:53 AM
To: datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>  datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org> 
Subject:  [datatable-help] data.table is asking for help

 
The code below generates the warning:

 
In setkeyv(x, cols, verbose = verbose) :

  Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

 
This is my first attempt at using datatable so I probably did something dumb, but maybe that?s useful for someone.  The first case is the one that gives the warnings.

 
I?m also surprised at the timings.  I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

 
The algorithm does the following:  Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.  Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

 
Insight into the performance would be appreciated.

 
Regards,

Ron

 
library(data.table)

library(plyr)

 
conflictsTable1 <- function(f) {

  u <- unique(setkey(f))

  if (nrow(u) == 1) return(NULL)

  u

}

 
conflictsTable2 <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 
conflictsFrame <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 
N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

 
setkey(test,id)

 
print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

 
print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

 
print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org> 
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org> 
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org> 
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140613/712bb5b7/attachment-0001.html>

From roundsjeremiah at gmail.com  Sat Jun 14 07:23:10 2014
From: roundsjeremiah at gmail.com (jeremiah rounds)
Date: Sat, 14 Jun 2014 01:23:10 -0400
Subject: [datatable-help] Are you aware of this?
Message-ID: <CAOjnRsbfkojmYAANp2jOOrGbDe2cggp87+mThMJb31gxZVAaEw@mail.gmail.com>

As a fan of your work I have always been curious if you are aware of this?
 I find it causes new users to make mistakes.


> dt = list()
> dt$x = 1:10
> dt$y = letters[10:1]
> dt = as.data.table(as.data.frame(dt))
> dt
     x y
 1:  1 j
 2:  2 i
 3:  3 h
 4:  4 g
 5:  5 f
 6:  6 e
 7:  7 d
 8:  8 c
 9:  9 b
10: 10 a
> x0 = dt$x
> x1 = dt$x
> x0[1] = 11
> setkeyv(dt,"y")
> x0
 [1] 11  2  3  4  5  6  7  8  9 10
> x1
 [1] 10  9  8  7  6  5  4  3  2  1
> x1 == x0
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE


x0 and x1 have assignments at the same exact time, and since R data.frame's
will not do this, it lures people into thinking they are then identical and
distinct as they are with data.frame's.  My theory is they are not actually
copied: they are promised.  When x0 has its index 1 changed it induces a
copy distinct from dt$x, but x1 has had no operation on it so it refers to
dt$x with its promise. Setting the key on dt reorders it and since x1 still
hasn't been evaluated it now matches the order of dt.

I found new users getting unpredictable results because they would try to
use a data.table as a data.frame and induce this with sorts.  If you
thought you copied something in a particular order in dt by doing the
assigning ahead of the setkeyv you make a mistake.   You don't really
expect x1 assigned maybe a page of code above to have its order changed by
a setkeyv.  You do if you think about C pointers and references, but in R
you really don't think that way.  Many R users don't even know what a
pointer is.


Thanks,
Jeremiah

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] splines   parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] locfit_1.5-9.1       edgeR_3.4.2          limma_3.18.13
[4] data.table_1.9.2     GenomicRanges_1.14.4 XVector_0.2.0
[7] IRanges_1.20.7       BiocGenerics_0.8.0

loaded via a namespace (and not attached):
[1] grid_3.0.1      lattice_0.20-15 plyr_1.8.1      Rcpp_0.11.1
[5] reshape2_1.4    stats4_3.0.1    stringr_0.6.2   tools_3.0.1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140614/615ff843/attachment.html>

From aragorn168b at gmail.com  Sat Jun 14 07:35:16 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 14 Jun 2014 07:35:16 +0200
Subject: [datatable-help] Are you aware of this?
In-Reply-To: <CAOjnRsbfkojmYAANp2jOOrGbDe2cggp87+mThMJb31gxZVAaEw@mail.gmail.com>
References: <CAOjnRsbfkojmYAANp2jOOrGbDe2cggp87+mThMJb31gxZVAaEw@mail.gmail.com>
Message-ID: <etPan.539bdf14.8138641.38b@Arunkumars-MacBook-Pro.local>

Jeremiah,

Thanks. Just a few hours ago, I answered a similar question to a post from Ron (pasted below):

`data.table` is designed for working with *really large* data sets in mind (> 100 or 200 GB in memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand).

This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator.
There?s a pending feature request on adding this point (on explicit copy) to the FAQs, which we?ve not gotten to, yet.

To our knowledge, people do overcome this difference quite quickly.

It?s not necessary to know about pointers to understand that the object gets modified in-place. I?m not a python user at all, but recently came to know that this is also a feature there: https://docs.python.org/2/library/copy.html

But point taken. That explicit copy will be required will be added to the FAQs.


Arun

From:?jeremiah rounds roundsjeremiah at gmail.com
Reply:?jeremiah rounds roundsjeremiah at gmail.com
Date:?June 14, 2014 at 7:23:22 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? [datatable-help] Are you aware of this?  

As a fan of your work I have always been curious if you are aware of this? ?I find it causes new users to make mistakes.


> dt = list()
> dt$x = 1:10
> dt$y = letters[10:1]
> dt = as.data.table(as.data.frame(dt))
> dt
? ? ?x y
?1: ?1 j
?2: ?2 i
?3: ?3 h
?4: ?4 g
?5: ?5 f
?6: ?6 e
?7: ?7 d
?8: ?8 c
?9: ?9 b
10: 10 a
> x0 = dt$x
> x1 = dt$x
> x0[1] = 11
> setkeyv(dt,"y")
> x0
?[1] 11 ?2 ?3 ?4 ?5 ?6 ?7 ?8 ?9 10
> x1
?[1] 10 ?9 ?8 ?7 ?6 ?5 ?4 ?3 ?2 ?1
> x1 == x0
?[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE


x0 and x1 have assignments at the same exact time, and since R data.frame's will not do this, it lures people into thinking they are then identical and distinct as they are with data.frame's. ?My theory is they are not actually copied: they are promised. ?When x0 has its index 1 changed it induces a copy distinct from dt$x, but x1 has had no operation on it so it refers to dt$x with its promise. Setting the key on dt reorders it and since x1 still hasn't been evaluated it now matches the order of dt.

I found new users getting unpredictable results because they would try to use a data.table as a data.frame and induce this with sorts. ?If you thought you copied something in a particular order in dt by doing the assigning ahead of the setkeyv you make a mistake. ? You don't really expect x1 assigned maybe a page of code above to have its order changed by a setkeyv. ?You do if you think about C pointers and references, but in R you really don't think that way. ?Many R users don't even know what a pointer is.


Thanks,
Jeremiah

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C ? ? ? ? ? ? ?
?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 ? ?
?[5] LC_MONETARY=en_US.UTF-8 ? ?LC_MESSAGES=en_US.UTF-8 ??
?[7] LC_PAPER=C ? ? ? ? ? ? ? ? LC_NAME=C ? ? ? ? ? ? ? ??
?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C ? ? ? ? ? ?
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ? ? ??

attached base packages:
[1] splines ? parallel ?stats ? ? graphics ?grDevices utils ? ? datasets?
[8] methods ? base ? ??

other attached packages:
[1] locfit_1.5-9.1 ? ? ? edgeR_3.4.2 ? ? ? ? ?limma_3.18.13 ? ? ??
[4] data.table_1.9.2 ? ? GenomicRanges_1.14.4 XVector_0.2.0 ? ? ??
[7] IRanges_1.20.7 ? ? ? BiocGenerics_0.8.0 ?

loaded via a namespace (and not attached):
[1] grid_3.0.1 ? ? ?lattice_0.20-15 plyr_1.8.1 ? ? ?Rcpp_0.11.1 ? ?
[5] reshape2_1.4 ? ?stats4_3.0.1 ? ?stringr_0.6.2 ? tools_3.0.1 ? ?


_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140614/f0b579ae/attachment.html>

From my.r.help at gmail.com  Sun Jun 15 05:01:35 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Sun, 15 Jun 2014 11:01:35 +0800
Subject: [datatable-help] `with=F` in the `i` Argument
In-Reply-To: <CA+xi=qbyryLtqCgMqqZ3Gri1t4tLH77kaxGAOHv20zrXZ3k3og@mail.gmail.com>
References: <5389541B.8040006@gmail.com>
 <CA+xi=qbyryLtqCgMqqZ3Gri1t4tLH77kaxGAOHv20zrXZ3k3og@mail.gmail.com>
Message-ID: <539D0C8F.1080005@gmail.com>

Devs,

Is this a bug? It works in 1.9.2 but not in the 1.9.3 development version:

DT <- data.table(a = 1:4, b = 8:5)
for (i in c("a", "b"))
  print(DT[order(DT[, i, with = FALSE])])

Error in forder(DT, DT[, i, with = FALSE]) :
  Column '1' is type 'list' which is not supported for ordering currently.


Thanks,

M


On 05/31/2014 12:44 PM, G See wrote:
> Hi Michael,
> 
> I would use get()
> 
> DT <- data.table(a = 1:4, b = 8:5)
> for (i in c("a", "b"))
>   print(DT[order(get(i))])
> 
> For what it's worth, your solution doesn't seem to work in data.table
> 1.9.3 (svn rev. 1278):
> 
>> for (i in c("a", "b"))
> +   print(DT[order(DT[, i, with = FALSE])])
> Error in forder(DT, DT[, i, with = FALSE]) :
>   Column '1' is type 'list' which is not supported for ordering currently.
> 
> 
> HTH,
> Garrett
> 
> On Fri, May 30, 2014 at 11:01 PM, Michael Smith <my.r.help at gmail.com> wrote:
>> All,
>>
>> I'm trying to order the rows according to several columns at a time:
>>
>> DT <- data.table(a = 1:4, b = 8:5)
>> for (i in c("a", "b"))
>>   print(DT[order(i), with = FALSE])
>>
>> It doesn't work, since `with` seems to be about the `j` argument, but
>> not the `i` argument, according to `?data.table`.
>>
>> I found the following workaround, but wonder whether there is a more
>> elegant way to do it:
>>
>> for (i in c("a", "b"))
>>   print(DT[order(DT[, i, with = FALSE])])
>>
>> Thanks,
>> M
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From aragorn168b at gmail.com  Sun Jun 15 10:11:45 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sun, 15 Jun 2014 10:11:45 +0200
Subject: [datatable-help] =?utf-8?Q?=60with=3DF=60_?=in the `i` Argument
In-Reply-To: <539D0C8F.1080005@gmail.com>
References: <5389541B.8040006@gmail.com>
 <CA+xi=qbyryLtqCgMqqZ3Gri1t4tLH77kaxGAOHv20zrXZ3k3og@mail.gmail.com>
 <539D0C8F.1080005@gmail.com>
Message-ID: <etPan.539d5541.725a06fb.38b@Arunkumars-MacBook-Pro.local>

Michael,

Thanks. Replacing order with base:::order seems to give the right result. So, I?d say this is a case that seem to have escaped current tests. So, yes, bug. Could you please file as one here?


Arun

From:?Michael Smith my.r.help at gmail.com
Reply:?Michael Smith my.r.help at gmail.com
Date:?June 15, 2014 at 5:02:46 AM
To:?G See gsee000 at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] `with=F` in the `i` Argument  

Devs,  

Is this a bug? It works in 1.9.2 but not in the 1.9.3 development version:  

DT <- data.table(a = 1:4, b = 8:5)  
for (i in c("a", "b"))  
print(DT[order(DT[, i, with = FALSE])])  

Error in forder(DT, DT[, i, with = FALSE]) :  
Column '1' is type 'list' which is not supported for ordering currently.  


Thanks,  

M  


On 05/31/2014 12:44 PM, G See wrote:  
> Hi Michael,  
>  
> I would use get()  
>  
> DT <- data.table(a = 1:4, b = 8:5)  
> for (i in c("a", "b"))  
> print(DT[order(get(i))])  
>  
> For what it's worth, your solution doesn't seem to work in data.table  
> 1.9.3 (svn rev. 1278):  
>  
>> for (i in c("a", "b"))  
> + print(DT[order(DT[, i, with = FALSE])])  
> Error in forder(DT, DT[, i, with = FALSE]) :  
> Column '1' is type 'list' which is not supported for ordering currently.  
>  
>  
> HTH,  
> Garrett  
>  
> On Fri, May 30, 2014 at 11:01 PM, Michael Smith <my.r.help at gmail.com> wrote:  
>> All,  
>>  
>> I'm trying to order the rows according to several columns at a time:  
>>  
>> DT <- data.table(a = 1:4, b = 8:5)  
>> for (i in c("a", "b"))  
>> print(DT[order(i), with = FALSE])  
>>  
>> It doesn't work, since `with` seems to be about the `j` argument, but  
>> not the `i` argument, according to `?data.table`.  
>>  
>> I found the following workaround, but wonder whether there is a more  
>> elegant way to do it:  
>>  
>> for (i in c("a", "b"))  
>> print(DT[order(DT[, i, with = FALSE])])  
>>  
>> Thanks,  
>> M  
>> _______________________________________________  
>> datatable-help mailing list  
>> datatable-help at lists.r-forge.r-project.org  
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140615/5e1870e3/attachment.html>

From my.r.help at gmail.com  Sun Jun 15 11:15:50 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Sun, 15 Jun 2014 17:15:50 +0800
Subject: [datatable-help] `with=F` in the `i` Argument
In-Reply-To: <etPan.539d5541.725a06fb.38b@Arunkumars-MacBook-Pro.local>
References: <5389541B.8040006@gmail.com>
 <CA+xi=qbyryLtqCgMqqZ3Gri1t4tLH77kaxGAOHv20zrXZ3k3og@mail.gmail.com>
 <539D0C8F.1080005@gmail.com>
 <etPan.539d5541.725a06fb.38b@Arunkumars-MacBook-Pro.local>
Message-ID: <539D6446.3060004@gmail.com>

Hi Arun,

Filed here:
https://github.com/Rdatatable/data.table/issues/696

Thanks,
M

On 06/15/2014 04:11 PM, Arunkumar Srinivasan wrote:
> Michael,
> 
> Thanks. Replacing |order| with |base:::order| seems to give the right
> result. So, I?d say this is a case that seem to have escaped current
> tests. So, yes, bug. Could you please file as one here
> <https://github.com/Rdatatable/data.table/issues>?
> 
> 
> Arun
> 
> From: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>
> Reply: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>
> Date: June 15, 2014 at 5:02:46 AM
> To: G See gsee000 at gmail.com <mailto:gsee000 at gmail.com>
> Cc: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> Subject: Re: [datatable-help] `with=F` in the `i` Argument
> 
>> Devs,
>>
>> Is this a bug? It works in 1.9.2 but not in the 1.9.3 development
>> version:
>>
>> DT <- data.table(a = 1:4, b = 8:5)
>> for (i in c("a", "b"))
>> print(DT[order(DT[, i, with = FALSE])])
>>
>> Error in forder(DT, DT[, i, with = FALSE]) :
>> Column '1' is type 'list' which is not supported for ordering currently.
>>
>>
>> Thanks,
>>
>> M
>>
>>
>> On 05/31/2014 12:44 PM, G See wrote:
>> > Hi Michael,
>> >  
>> > I would use get()
>> >  
>> > DT <- data.table(a = 1:4, b = 8:5)
>> > for (i in c("a", "b"))
>> >   print(DT[order(get(i))])
>> >  
>> > For what it's worth, your solution doesn't seem to work in data.table
>> > 1.9.3 (svn rev. 1278):
>> >  
>> >> for (i in c("a", "b"))
>> > +   print(DT[order(DT[, i, with = FALSE])])
>> > Error in forder(DT, DT[, i, with = FALSE]) :
>> >   Column '1' is type 'list' which is not supported for ordering currently.
>> >  
>> >  
>> > HTH,
>> > Garrett
>> >  
>> > On Fri, May 30, 2014 at 11:01 PM, Michael Smith <my.r.help at gmail.com> wrote:
>> >> All,
>> >>
>> >> I'm trying to order the rows according to several columns at a time:
>> >>
>> >> DT <- data.table(a = 1:4, b = 8:5)
>> >> for (i in c("a", "b"))
>> >>   print(DT[order(i), with = FALSE])
>> >>
>> >> It doesn't work, since `with` seems to be about the `j` argument, but
>> >> not the `i` argument, according to `?data.table`.
>> >>
>> >> I found the following workaround, but wonder whether there is a more
>> >> elegant way to do it:
>> >>
>> >> for (i in c("a", "b"))
>> >>   print(DT[order(DT[, i, with = FALSE])])
>> >>
>> >> Thanks,
>> >> M
>> >> _______________________________________________
>> >> datatable-help mailing list
>> >> datatable-help at lists.r-forge.r-project.org
>> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
> 

From aragorn168b at gmail.com  Sun Jun 15 11:16:42 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sun, 15 Jun 2014 11:16:42 +0200
Subject: [datatable-help] =?utf-8?Q?=60with=3DF=60_?=in the `i` Argument
In-Reply-To: <539D6446.3060004@gmail.com>
References: <5389541B.8040006@gmail.com>
 <CA+xi=qbyryLtqCgMqqZ3Gri1t4tLH77kaxGAOHv20zrXZ3k3og@mail.gmail.com>
 <539D0C8F.1080005@gmail.com>
 <etPan.539d5541.725a06fb.38b@Arunkumars-MacBook-Pro.local>
 <539D6446.3060004@gmail.com>
Message-ID: <etPan.539d647a.749abb43.38b@Arunkumars-MacBook-Pro.local>

Already got the notification. Thanks Michael.

Arun

From:?Michael Smith my.r.help at gmail.com
Reply:?Michael Smith my.r.help at gmail.com
Date:?June 15, 2014 at 11:15:55 AM
To:?Arunkumar Srinivasan aragorn168b at gmail.com
Cc:?G See gsee000 at gmail.com, datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] `with=F` in the `i` Argument  

Hi Arun,  

Filed here:  
https://github.com/Rdatatable/data.table/issues/696  

Thanks,  
M  

On 06/15/2014 04:11 PM, Arunkumar Srinivasan wrote:  
> Michael,  
>  
> Thanks. Replacing |order| with |base:::order| seems to give the right  
> result. So, I?d say this is a case that seem to have escaped current  
> tests. So, yes, bug. Could you please file as one here  
> <https://github.com/Rdatatable/data.table/issues>?  
>  
>  
> Arun  
>  
> From: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>  
> Reply: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>  
> Date: June 15, 2014 at 5:02:46 AM  
> To: G See gsee000 at gmail.com <mailto:gsee000 at gmail.com>  
> Cc: datatable-help at lists.r-forge.r-project.org  
> datatable-help at lists.r-forge.r-project.org  
> <mailto:datatable-help at lists.r-forge.r-project.org>  
> Subject: Re: [datatable-help] `with=F` in the `i` Argument  
>  
>> Devs,  
>>  
>> Is this a bug? It works in 1.9.2 but not in the 1.9.3 development  
>> version:  
>>  
>> DT <- data.table(a = 1:4, b = 8:5)  
>> for (i in c("a", "b"))  
>> print(DT[order(DT[, i, with = FALSE])])  
>>  
>> Error in forder(DT, DT[, i, with = FALSE]) :  
>> Column '1' is type 'list' which is not supported for ordering currently.  
>>  
>>  
>> Thanks,  
>>  
>> M  
>>  
>>  
>> On 05/31/2014 12:44 PM, G See wrote:  
>> > Hi Michael,  
>> >  
>> > I would use get()  
>> >  
>> > DT <- data.table(a = 1:4, b = 8:5)  
>> > for (i in c("a", "b"))  
>> > print(DT[order(get(i))])  
>> >  
>> > For what it's worth, your solution doesn't seem to work in data.table  
>> > 1.9.3 (svn rev. 1278):  
>> >  
>> >> for (i in c("a", "b"))  
>> > + print(DT[order(DT[, i, with = FALSE])])  
>> > Error in forder(DT, DT[, i, with = FALSE]) :  
>> > Column '1' is type 'list' which is not supported for ordering currently.  
>> >  
>> >  
>> > HTH,  
>> > Garrett  
>> >  
>> > On Fri, May 30, 2014 at 11:01 PM, Michael Smith <my.r.help at gmail.com> wrote:  
>> >> All,  
>> >>  
>> >> I'm trying to order the rows according to several columns at a time:  
>> >>  
>> >> DT <- data.table(a = 1:4, b = 8:5)  
>> >> for (i in c("a", "b"))  
>> >> print(DT[order(i), with = FALSE])  
>> >>  
>> >> It doesn't work, since `with` seems to be about the `j` argument, but  
>> >> not the `i` argument, according to `?data.table`.  
>> >>  
>> >> I found the following workaround, but wonder whether there is a more  
>> >> elegant way to do it:  
>> >>  
>> >> for (i in c("a", "b"))  
>> >> print(DT[order(DT[, i, with = FALSE])])  
>> >>  
>> >> Thanks,  
>> >> M  
>> >> _______________________________________________  
>> >> datatable-help mailing list  
>> >> datatable-help at lists.r-forge.r-project.org  
>> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
>> _______________________________________________  
>> datatable-help mailing list  
>> datatable-help at lists.r-forge.r-project.org  
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
>>  
>  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140615/74926e4d/attachment-0001.html>

From gsee000 at gmail.com  Sun Jun 15 17:44:58 2014
From: gsee000 at gmail.com (G See)
Date: Sun, 15 Jun 2014 10:44:58 -0500
Subject: [datatable-help] subsetting by second key
Message-ID: <CA+xi=qYHkHMVUFMsLXvGtHELz5xG46jjcppPyytUt1jnEJ_7-g@mail.gmail.com>

Hi,

I want to subset a data.table using only its second key, which is
demonstrated here
http://stackoverflow.com/questions/15597685/subsetting-data-table-by-2nd-column-only-of-a-2-column-key-using-binary-search/15597713#15597713

However, I need to subset with more than one value in the secondary key

Is this warning expected? What exactly is it telling me?

    library(data.table)
    DT <- data.table(iris, key="Species,Petal.Width")
    DT[J(unique(Species), c(1.5, 2.0)), nomatch=0L]
    #   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
    #1:          6.0         2.2          5.0         1.5 virginica
    #2:          6.3         2.8          5.1         1.5 virginica
    #Warning message:
    #In as.data.table.list(i) :
    #  Item 2 is of size 2 but maximum size is 3 (recycled leaving a
remainder of 1 items)


It looks like I can get what I want with either of these; can you
confirm that both of these will always return the same result?

    DT[Petal.Width %in% c(1.5, 2.0)]  # vector scan
    DT[CJ(unique(Species), c(1.5, 2.0)), nomatch=0L]


Thanks,
Garrett

From aragorn168b at gmail.com  Sun Jun 15 17:56:05 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sun, 15 Jun 2014 17:56:05 +0200
Subject: [datatable-help] subsetting by second key
In-Reply-To: <CA+xi=qYHkHMVUFMsLXvGtHELz5xG46jjcppPyytUt1jnEJ_7-g@mail.gmail.com>
References: <CA+xi=qYHkHMVUFMsLXvGtHELz5xG46jjcppPyytUt1jnEJ_7-g@mail.gmail.com>
Message-ID: <etPan.539dc217.1cf10fd8.38b@Arunkumars-MacBook-Pro.local>

unique(Species) is of length 3, where as the 2nd entry c(1.5, 2) is of length 2.

J in J(.) is replaced with list(.) internally (using lazy evaluation), following which it?s converted to a data.table using as.data.table(list(.)).

And here your list is:

list(c("setosa", "versicolor", "virginica") , c(1.5, 2.0)) which results in the warning because it has to recycle to convert it to a data.table.

In the example you?ve linked, J(.) and CJ(.) will return the same result (because there?s just one value in 2nd column). So, the results don?t change. But the general expression is to use CJ(.) along with nomatch=0L, as you?ve done.

Those two expressions are equivalent, yes.


Arun

From:?G See gsee000 at gmail.com
Reply:?G See gsee000 at gmail.com
Date:?June 15, 2014 at 5:45:11 PM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? [datatable-help] subsetting by second key  

Hi,  

I want to subset a data.table using only its second key, which is  
demonstrated here  
http://stackoverflow.com/questions/15597685/subsetting-data-table-by-2nd-column-only-of-a-2-column-key-using-binary-search/15597713#15597713  

However, I need to subset with more than one value in the secondary key  

Is this warning expected? What exactly is it telling me?  

library(data.table)  
DT <- data.table(iris, key="Species,Petal.Width")  
DT[J(unique(Species), c(1.5, 2.0)), nomatch=0L]  
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
#1: 6.0 2.2 5.0 1.5 virginica  
#2: 6.3 2.8 5.1 1.5 virginica  
#Warning message:  
#In as.data.table.list(i) :  
# Item 2 is of size 2 but maximum size is 3 (recycled leaving a  
remainder of 1 items)  


It looks like I can get what I want with either of these; can you  
confirm that both of these will always return the same result?  

DT[Petal.Width %in% c(1.5, 2.0)] # vector scan  
DT[CJ(unique(Species), c(1.5, 2.0)), nomatch=0L]  


Thanks,  
Garrett  
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140615/7a69cb14/attachment.html>

From gsee000 at gmail.com  Sun Jun 15 18:03:13 2014
From: gsee000 at gmail.com (G See)
Date: Sun, 15 Jun 2014 11:03:13 -0500
Subject: [datatable-help] subsetting by second key
In-Reply-To: <etPan.539dc217.1cf10fd8.38b@Arunkumars-MacBook-Pro.local>
References: <CA+xi=qYHkHMVUFMsLXvGtHELz5xG46jjcppPyytUt1jnEJ_7-g@mail.gmail.com>
 <etPan.539dc217.1cf10fd8.38b@Arunkumars-MacBook-Pro.local>
Message-ID: <CA+xi=qZmnAnLjLRJJ_XftP2KzWp1Fsw1ar6Na7qNWtkbJ-x-1w@mail.gmail.com>

Thank you Arun.  Should that answer be updated to use CJ(.), then?  Is
there an advantage to using J(.) over CJ(.) if you know that you're
only looking for one value in the second column?

On Sun, Jun 15, 2014 at 10:56 AM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> unique(Species) is of length 3, where as the 2nd entry c(1.5, 2) is of
> length 2.
>
> J in J(.) is replaced with list(.) internally (using lazy evaluation),
> following which it?s converted to a data.table using as.data.table(list(.)).
>
> And here your list is:
>
> list(c("setosa", "versicolor", "virginica") , c(1.5, 2.0)) which results in
> the warning because it has to recycle to convert it to a data.table.
>
> In the example you?ve linked, J(.) and CJ(.) will return the same result
> (because there?s just one value in 2nd column). So, the results don?t
> change. But the general expression is to use CJ(.) along with nomatch=0L, as
> you?ve done.
>
> Those two expressions are equivalent, yes.
>
>
> Arun
>
> From: G See gsee000 at gmail.com
> Reply: G See gsee000 at gmail.com
> Date: June 15, 2014 at 5:45:11 PM
> To: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> Subject:  [datatable-help] subsetting by second key
>
> Hi,
>
> I want to subset a data.table using only its second key, which is
> demonstrated here
> http://stackoverflow.com/questions/15597685/subsetting-data-table-by-2nd-column-only-of-a-2-column-key-using-binary-search/15597713#15597713
>
> However, I need to subset with more than one value in the secondary key
>
> Is this warning expected? What exactly is it telling me?
>
> library(data.table)
> DT <- data.table(iris, key="Species,Petal.Width")
> DT[J(unique(Species), c(1.5, 2.0)), nomatch=0L]
> # Sepal.Length Sepal.Width Petal.Length Petal.Width Species
> #1: 6.0 2.2 5.0 1.5 virginica
> #2: 6.3 2.8 5.1 1.5 virginica
> #Warning message:
> #In as.data.table.list(i) :
> # Item 2 is of size 2 but maximum size is 3 (recycled leaving a
> remainder of 1 items)
>
>
> It looks like I can get what I want with either of these; can you
> confirm that both of these will always return the same result?
>
> DT[Petal.Width %in% c(1.5, 2.0)] # vector scan
> DT[CJ(unique(Species), c(1.5, 2.0)), nomatch=0L]
>
>
> Thanks,
> Garrett
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From aragorn168b at gmail.com  Sun Jun 15 18:04:57 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sun, 15 Jun 2014 18:04:57 +0200
Subject: [datatable-help] subsetting by second key
In-Reply-To: <CA+xi=qZmnAnLjLRJJ_XftP2KzWp1Fsw1ar6Na7qNWtkbJ-x-1w@mail.gmail.com>
References: <CA+xi=qYHkHMVUFMsLXvGtHELz5xG46jjcppPyytUt1jnEJ_7-g@mail.gmail.com>
 <etPan.539dc217.1cf10fd8.38b@Arunkumars-MacBook-Pro.local>
 <CA+xi=qZmnAnLjLRJJ_XftP2KzWp1Fsw1ar6Na7qNWtkbJ-x-1w@mail.gmail.com>
Message-ID: <etPan.539dc429.235ba861.38b@Arunkumars-MacBook-Pro.local>

Sure, you can update it. No, there's no advantage. I just dint think of CJ at the time (probably because I tried it with J and it worked, because it's just 1 value for the 2nd key col).

Arun

From:?G See gsee000 at gmail.com
Reply:?G See gsee000 at gmail.com
Date:?June 15, 2014 at 6:03:13 PM
To:?Arunkumar Srinivasan aragorn168b at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] subsetting by second key  

Thank you Arun. Should that answer be updated to use CJ(.), then? Is  
there an advantage to using J(.) over CJ(.) if you know that you're  
only looking for one value in the second column?  

On Sun, Jun 15, 2014 at 10:56 AM, Arunkumar Srinivasan  
<aragorn168b at gmail.com> wrote:  
> unique(Species) is of length 3, where as the 2nd entry c(1.5, 2) is of  
> length 2.  
>  
> J in J(.) is replaced with list(.) internally (using lazy evaluation),  
> following which it?s converted to a data.table using as.data.table(list(.)).  
>  
> And here your list is:  
>  
> list(c("setosa", "versicolor", "virginica") , c(1.5, 2.0)) which results in  
> the warning because it has to recycle to convert it to a data.table.  
>  
> In the example you?ve linked, J(.) and CJ(.) will return the same result  
> (because there?s just one value in 2nd column). So, the results don?t  
> change. But the general expression is to use CJ(.) along with nomatch=0L, as  
> you?ve done.  
>  
> Those two expressions are equivalent, yes.  
>  
>  
> Arun  
>  
> From: G See gsee000 at gmail.com  
> Reply: G See gsee000 at gmail.com  
> Date: June 15, 2014 at 5:45:11 PM  
> To: datatable-help at lists.r-forge.r-project.org  
> datatable-help at lists.r-forge.r-project.org  
> Subject: [datatable-help] subsetting by second key  
>  
> Hi,  
>  
> I want to subset a data.table using only its second key, which is  
> demonstrated here  
> http://stackoverflow.com/questions/15597685/subsetting-data-table-by-2nd-column-only-of-a-2-column-key-using-binary-search/15597713#15597713  
>  
> However, I need to subset with more than one value in the secondary key  
>  
> Is this warning expected? What exactly is it telling me?  
>  
> library(data.table)  
> DT <- data.table(iris, key="Species,Petal.Width")  
> DT[J(unique(Species), c(1.5, 2.0)), nomatch=0L]  
> # Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
> #1: 6.0 2.2 5.0 1.5 virginica  
> #2: 6.3 2.8 5.1 1.5 virginica  
> #Warning message:  
> #In as.data.table.list(i) :  
> # Item 2 is of size 2 but maximum size is 3 (recycled leaving a  
> remainder of 1 items)  
>  
>  
> It looks like I can get what I want with either of these; can you  
> confirm that both of these will always return the same result?  
>  
> DT[Petal.Width %in% c(1.5, 2.0)] # vector scan  
> DT[CJ(unique(Species), c(1.5, 2.0)), nomatch=0L]  
>  
>  
> Thanks,  
> Garrett  
> _______________________________________________  
> datatable-help mailing list  
> datatable-help at lists.r-forge.r-project.org  
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140615/f46349ad/attachment-0001.html>

From aragorn168b at gmail.com  Sun Jun 15 18:06:34 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sun, 15 Jun 2014 18:06:34 +0200
Subject: [datatable-help] subsetting by second key
In-Reply-To: <etPan.539dc429.235ba861.38b@Arunkumars-MacBook-Pro.local>
References: <CA+xi=qYHkHMVUFMsLXvGtHELz5xG46jjcppPyytUt1jnEJ_7-g@mail.gmail.com>
 <etPan.539dc217.1cf10fd8.38b@Arunkumars-MacBook-Pro.local>
 <CA+xi=qZmnAnLjLRJJ_XftP2KzWp1Fsw1ar6Na7qNWtkbJ-x-1w@mail.gmail.com>
 <etPan.539dc429.235ba861.38b@Arunkumars-MacBook-Pro.local>
Message-ID: <etPan.539dc48a.354fe9f9.38b@Arunkumars-MacBook-Pro.local>

Note that `CJ` by default sorts the columns and sets key to all the columns, which means the result would be sorted as well. If that's not desirable, you should be using `CJ` with `sorted=FALSE`.

Arun

From:?Arunkumar Srinivasan aragorn168b at gmail.com
Reply:?Arunkumar Srinivasan aragorn168b at gmail.com
Date:?June 15, 2014 at 6:04:59 PM
To:?G See gsee000 at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] subsetting by second key  

Sure, you can update it. No, there's no advantage. I just dint think of CJ at the time (probably because I tried it with J and it worked, because it's just 1 value for the 2nd key col).

Arun

From:?G See gsee000 at gmail.com
Reply:?G See gsee000 at gmail.com
Date:?June 15, 2014 at 6:03:13 PM
To:?Arunkumar Srinivasan aragorn168b at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] subsetting by second key

Thank you Arun. Should that answer be updated to use CJ(.), then? Is
there an advantage to using J(.) over CJ(.) if you know that you're
only looking for one value in the second column?

On Sun, Jun 15, 2014 at 10:56 AM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> unique(Species) is of length 3, where as the 2nd entry c(1.5, 2) is of
> length 2.
>
> J in J(.) is replaced with list(.) internally (using lazy evaluation),
> following which it?s converted to a data.table using as.data.table(list(.)).
>
> And here your list is:
>
> list(c("setosa", "versicolor", "virginica") , c(1.5, 2.0)) which results in
> the warning because it has to recycle to convert it to a data.table.
>
> In the example you?ve linked, J(.) and CJ(.) will return the same result
> (because there?s just one value in 2nd column). So, the results don?t
> change. But the general expression is to use CJ(.) along with nomatch=0L, as
> you?ve done.
>
> Those two expressions are equivalent, yes.
>
>
> Arun
>
> From: G See gsee000 at gmail.com
> Reply: G See gsee000 at gmail.com
> Date: June 15, 2014 at 5:45:11 PM
> To: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> Subject: [datatable-help] subsetting by second key
>
> Hi,
>
> I want to subset a data.table using only its second key, which is
> demonstrated here
> http://stackoverflow.com/questions/15597685/subsetting-data-table-by-2nd-column-only-of-a-2-column-key-using-binary-search/15597713#15597713
>
> However, I need to subset with more than one value in the secondary key
>
> Is this warning expected? What exactly is it telling me?
>
> library(data.table)
> DT <- data.table(iris, key="Species,Petal.Width")
> DT[J(unique(Species), c(1.5, 2.0)), nomatch=0L]
> # Sepal.Length Sepal.Width Petal.Length Petal.Width Species
> #1: 6.0 2.2 5.0 1.5 virginica
> #2: 6.3 2.8 5.1 1.5 virginica
> #Warning message:
> #In as.data.table.list(i) :
> # Item 2 is of size 2 but maximum size is 3 (recycled leaving a
> remainder of 1 items)
>
>
> It looks like I can get what I want with either of these; can you
> confirm that both of these will always return the same result?
>
> DT[Petal.Width %in% c(1.5, 2.0)] # vector scan
> DT[CJ(unique(Species), c(1.5, 2.0)), nomatch=0L]
>
>
> Thanks,
> Garrett
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140615/a8ab3adf/attachment.html>

From mdowle at mdowle.plus.com  Tue Jun 17 19:03:09 2014
From: mdowle at mdowle.plus.com (Matt Dowle)
Date: Tue, 17 Jun 2014 18:03:09 +0100
Subject: [datatable-help] data.table is asking for help
In-Reply-To: <008301cf877c$7dc9def0$795d9cd0$@verizon.net>
References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net>
 <etPan.539b95c6.79838cb2.38b@Arunkumars-MacBook-Pro.local>
 <006701cf876a$c4f38310$4eda8930$@verizon.net>
 <etPan.539b9dde.b03e0c6.38b@Arunkumars-MacBook-Pro.local>
 <007301cf8770$2bb569b0$83203d10$@verizon.net>
 <etPan.539bb4a4.54e49eb4.38b@Arunkumars-MacBook-Pro.local>
 <etPan.539bb693.2ca88611.38b@Arunkumars-MacBook-Pro.local>
 <etPan.539bb75d.2901d82.38b@Arunkumars-MacBook-Pro.local>
 <008301cf877c$7dc9def0$795d9cd0$@verizon.net>
Message-ID: <53A074CD.3060805@mdowle.plus.com>


Hi Ron,

Thanks for highlighting this.  Two changes now in v1.9.3 on GitHub:

  *

    |setkey|on|.SD|is now an error, rather than warnings for each group
    about rebuilding the key. The new error is similar to when
    attempting to use|:=|in a|.SD|subquery:|".SD is locked. Using set*()
    functions on .SD is reserved for possible future use; a tortuously
    flexible way to modify the original data by group."|Thanks to Ron
    Hylton for highlighting the issue on datatable-helphere
    <http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080.html>.

  *

    Looping calls to|unique(DT)|such as in|DT[,unique(.SD),by=group]|is
    now faster by avoiding internal overhead of calling|[.data.table|.
    Thanks again to Ron Hylton for highlighting in thesame thread
    <http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080.html>.
    His example is reduced from 28 sec to 9 sec, with identical results.


I now get the following (on my slow netbook) with no changes to your code.

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))   #  were 
warnings,    now error
print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))   #  was 
28s, now 9s
print(system.time(uf <- ddply(test, .(id), conflictsFrame))) # 13s

This just fixes the surprises, basically.   Clearly Arun uses data.table 
in a better way which is orders of magnitude faster.

Matt


On 14/06/14 03:58, Ron Hylton wrote:
>
> Thanks, that very helpful.
>
> *From:*Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
> *Sent:* Friday, June 13, 2014 10:46 PM
> *To:* Ron Hylton; datatable-help at lists.r-forge.r-project.org
> *Subject:* Re: [datatable-help] data.table is asking for help
>
> Sorry. But we can simplify it even further:
>
> The first step is just |unique(test)|. So, we can do:
>
> |system.time({|
> |ans = unique(test)|
> |ans = ans[ans[, .I[.N > 1L], by=id]$V1]|
> |})|
> |#  0.016   0.000   0.016|
>
> Identical?
>
> |setkey(ans)|
> |setkey(ut1)|
> |identical(ans, ut1) # [1] TRUE|
>
> Arun
>
>
> From: Arunkumar Srinivasan aragorn168b at gmail.com 
> <mailto:aragorn168b at gmail.com>
> Reply: Arunkumar Srinivasan aragorn168b at gmail.com 
> <mailto:aragorn168b at gmail.com>
> Date: June 14, 2014 at 4:42:31 AM
> To: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>, 
> datatable-help at lists.r-forge.r-project.org 
> <mailto:datatable-help at lists.r-forge.r-project.org> 
> datatable-help at lists.r-forge.r-project.org 
> <mailto:datatable-help at lists.r-forge.r-project.org>
> Subject:  Re: [datatable-help] data.table is asking for help
>
>
>
>     A slightly simpler version of the 2nd solution is:
>
>     |system.time({|
>
>     |ans = test[, .N, by=names(test)]|
>
>     |ans = ans[ans[, .I[.N > 1L], by=id]$V1]|
>
>     |})|
>
>     |#  0.019   0.000   0.019|
>
>       
>
>     The answers are identical, you can check this by doing:
>
>     |ans[, N := NULL]|
>
>     |setkey(ans)|
>
>     |setkey(ut1)|
>
>     |identical(ans, ut1) # [1] TRUE|
>
>       
>
>     Arun
>
>
>     From: Arunkumar Srinivasan aragorn168b at gmail.com
>     <mailto:aragorn168b at gmail.com>
>     Reply: Arunkumar Srinivasan aragorn168b at gmail.com
>     <mailto:aragorn168b at gmail.com>
>     Date: June 14, 2014 at 4:34:15 AM
>     To: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>,
>     datatable-help at lists.r-forge.r-project.org
>     <mailto:datatable-help at lists.r-forge.r-project.org>
>     datatable-help at lists.r-forge.r-project.org
>     <mailto:datatable-help at lists.r-forge.r-project.org>
>     Subject:  Re: [datatable-help] data.table is asking for help
>
>
>
>         The j-expression is evaluated from within C for each group
>         (unless they're optimised with GForce - a new initiative in
>         data.table). And |eval(.SD)| or |eval(anything(.SD))| is costly.
>
>         You can get around it by listing the columns by yourself and
>         using |.I| instead, as follows:
>
>         |test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]|
>
>         |#  0.140   0.001   0.142|
>
>           
>
>           
>
>         Takes about 0.14 seconds.
>
>         ------------------------------------------------------------------------
>
>         An even faster way is:
>
>         |system.time({|
>
>         |ans = test[test[, .I[.N > 1], by=id]$V1]        # (1)|
>
>         |ans = ans[, .N, by=names(ans)]                  # (2)|
>
>         |ans = ans[ans[, .I[.N > 1L], by=id]$V1]         # (3)|
>
>         |})|
>
>         |  |
>
>         |#  0.026   0.000   0.027|
>
>           
>
>           
>
>         The idea for the second case is:
>
>         (1) remove all entries where there's just 1 row corresponding
>         to that |id|.
>         (2) Aggregate this result by all the columns now and get the
>         number of rows in the column |N| (we won't have to use this
>         column though).
>         (3) Now, if we aggregate by |id| and if any id has just 1 row,
>         then it'd mean that that |id| has had more than 1 rows (step
>         (1) filtering ensures this), but all of them are same and we
>         don't need them. So we just filter for those where .N > 1L.
>
>         HTH
>
>         Arun
>
>
>         From: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>
>         Reply: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>
>         Date: June 14, 2014 at 3:30:55 AM
>         To: datatable-help at lists.r-forge.r-project.org
>         <mailto:datatable-help at lists.r-forge.r-project.org>
>         datatable-help at lists.r-forge.r-project.org
>         <mailto:datatable-help at lists.r-forge.r-project.org>
>         Subject:  Re: [datatable-help] data.table is asking for help
>
>
>
>             The performance is what puzzles me; the results are
>             correct so the warnings don't matter, and not all the
>             variations I've tried have warnings.  On the real dataset
>             (~800,000 rows) datatable takes about 1.5 times longer
>             than dataframe + ddply.  I expected it to be substantially
>             faster.
>
>             *From:* Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
>             *Sent:* Friday, June 13, 2014 8:57 PM
>             *To:* Ron Hylton;
>             datatable-help at lists.r-forge.r-project.org
>             <mailto:datatable-help at lists.r-forge.r-project.org>
>             *Subject:* Re: [datatable-help] data.table is asking for help
>
>                 However there's another aspect.  While I'm relatively
>                 new to R my understanding is that a function argument
>                 should be modifiable within the function body without
>                 affecting the caller, which perhaps conflicts with the
>                 behavior of .SD.
>
>             `data.table` is designed for working with *really large*
>             data sets in mind (> 100 or 200 GB in memory even). And
>             therefore, as a design feature, it trades in "referential
>             transparency" for manipulating data objects *as efficient
>             as possible* in terms of both *speed* and *memory usage*
>             (most of the times they go hand-in-hand).
>
>             This is perhaps the biggest design choice one needs to be
>             aware of when working/choosing data.tables. It is possible
>             to modify objects by reference using data.table - All the
>             functions that begin with "set*" modify objects by
>             reference. The only other non "set*" function is `:=`
>             operator.
>
>             HTH
>
>             Arun
>
>
>             From: Ron Hylton rhylton at verizon.net
>             <mailto:rhylton at verizon.net>
>             Reply: Ron Hylton rhylton at verizon.net
>             <mailto:rhylton at verizon.net>
>             Date: June 14, 2014 at 2:52:04 AM
>             To: datatable-help at lists.r-forge.r-project.org
>             <mailto:datatable-help at lists.r-forge.r-project.org>
>             datatable-help at lists.r-forge.r-project.org
>             <mailto:datatable-help at lists.r-forge.r-project.org>
>             Subject:  Re: [datatable-help] data.table is asking for help
>
>                 I suspected it was something like this.  As one
>                 clarification, there is a setkey(test,id) before any
>                 setkey(.SD).   If setkey(test,id) is changed to
>                 setkey(test) so all columns are in the original
>                 datatable key then the warning goes away.
>
>                 However there's another aspect.  While I'm relatively
>                 new to R my understanding is that a function argument
>                 should be modifiable within the function body without
>                 affecting the caller, which perhaps conflicts with the
>                 behavior of .SD.
>
>                 *From:* Arunkumar Srinivasan
>                 [mailto:aragorn168b at gmail.com]
>                 *Sent:* Friday, June 13, 2014 8:23 PM
>                 *To:* Ron Hylton;
>                 datatable-help at lists.r-forge.r-project.org
>                 <mailto:datatable-help at lists.r-forge.r-project.org>
>                 *Subject:* Re: [datatable-help] data.table is asking
>                 for help
>
>                 Nicely reproducible post. Reproducible in v1.9.3
>                 (latest commit) as well.
>
>                 This is a tricky one. It happens because you're
>                 setting key on |.SD| which should normally not be
>                 allowed. What happens is, when you set key the first
>                 time, there's no key set (here) and therefore key is
>                 set on all the columns |x1|, |x2| and |x3|.
>
>                 Now, the next group (in the |by=.|) is passed to your
>                 function, it'll have the |key| already set to
>                 |x1,x2,x3| (because |setkey| modifies the object by
>                 reference), but |.SD| has obtained *new* data
>                 corresponding to /this/ group. And |data.table| sorts
>                 this data, knowing that it already has key set.. but
>                 if the key is set then the order must be 1:n. But it
>                 wouldn't be, as this data isn't sorted. |data.table|
>                 warns in those scenarios.. and that's why you get the
>                 warning.
>
>                 To verify this, you can try:
>
>                 |conflictsTable1 <- function(f, address) {|
>
>                 |   u <- unique(setkey(f))|
>
>                 |   setattr(f, 'sorted', NULL)|
>
>                 |   if (nrow(u) == 1) return(NULL)|
>
>                 |   u|
>
>                 |}|
>
>                 Basically, we set the key of |f| (which is equal to
>                 |.SD| as it's only modified by reference) to |NULL|
>                 everytime after.. so that |.SD| for the new group will
>                 not have the key set.
>
>                 The ideal scenario here, IIUC, is that |setkey(.SD)|
>                 or things pointing to |.SD| should not be possible
>                 (locking binding doesn't seem to affect things done by
>                 reference..). |.SD| however should retain the key of
>                 the data.table, if a key was set, wherever possible.
>
>                 Arun
>
>
>                 From: Ron Hylton rhylton at verizon.net
>                 <mailto:rhylton at verizon.net>
>                 Reply: Ron Hylton rhylton at verizon.net
>                 <mailto:rhylton at verizon.net>
>                 Date: June 14, 2014 at 1:55:53 AM
>                 To: datatable-help at lists.r-forge.r-project.org
>                 <mailto:datatable-help at lists.r-forge.r-project.org>
>                 datatable-help at lists.r-forge.r-project.org
>                 <mailto:datatable-help at lists.r-forge.r-project.org>
>                 Subject:  [datatable-help] data.table is asking for help
>
>                     The code below generates the warning:
>
>                     In setkeyv(x, cols, verbose = verbose) :
>
>                       Already keyed by this key but had invalid row
>                     order, key rebuilt. If you didn't go under the
>                     hood please let datatable-help know so the root
>                     cause can be fixed.
>
>                     This is my first attempt at using datatable so I
>                     probably did something dumb, but maybe that's
>                     useful for someone.  The first case is the one
>                     that gives the warnings.
>
>                     I'm also surprised at the timings.  I wrote the
>                     original algorithm using dataframe & ddply and I
>                     expected datatable to be substantially faster; the
>                     opposite is true.
>
>                     The algorithm does the following:  Certain columns
>                     in the table are keys and others are values in the
>                     sense that each row with the same set of keys
>                     should have the same set of values.  Find all the
>                     key sets for which this is not true and return the
>                     keys sets + conflicting value sets.
>
>                     Insight into the performance would be appreciated.
>
>                     Regards,
>
>                     Ron
>
>                     library(data.table)
>
>                     library(plyr)
>
>                     conflictsTable1 <- function(f) {
>
>                     u <- unique(setkey(f))
>
>                     if (nrow(u) == 1) return(NULL)
>
>                     u
>
>                     }
>
>                     conflictsTable2 <- function(f) {
>
>                     u <- unique(f)
>
>                     if (nrow(u) == 1) return(NULL)
>
>                     u
>
>                     }
>
>                     conflictsFrame <- function(f) {
>
>                     u <- unique(f)
>
>                     if (nrow(u) == 1) return(NULL)
>
>                     u
>
>                     }
>
>                     N <- 10000
>
>                     test <-
>                     data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)),
>                     x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))
>
>                     setkey(test,id)
>
>                     print(system.time(ut1 <- test[,
>                     conflictsTable1(.SD), by=id]))
>
>                     print(system.time(ut2 <- test[,
>                     conflictsTable2(.SD), by=id]))
>
>                     print(system.time(uf <- ddply(test, .(id),
>                     conflictsFrame)))
>
>                     _______________________________________________
>                     datatable-help mailing list
>                     datatable-help at lists.r-forge.r-project.org
>                     <mailto:datatable-help at lists.r-forge.r-project.org>
>                     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>                 _______________________________________________
>                 datatable-help mailing list
>                 datatable-help at lists.r-forge.r-project.org
>                 <mailto:datatable-help at lists.r-forge.r-project.org>
>                 https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>             _______________________________________________
>             datatable-help mailing list
>             datatable-help at lists.r-forge.r-project.org
>             <mailto:datatable-help at lists.r-forge.r-project.org>
>             https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140617/5387e221/attachment-0001.html>

From my.r.help at gmail.com  Wed Jun 18 02:34:14 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Wed, 18 Jun 2014 08:34:14 +0800
Subject: [datatable-help] data.table is asking for help
In-Reply-To: <53A074CD.3060805@mdowle.plus.com>
References: <005b01cf8762$ead3af40$c07b0dc0$@verizon.net>
 <etPan.539b95c6.79838cb2.38b@Arunkumars-MacBook-Pro.local>
 <006701cf876a$c4f38310$4eda8930$@verizon.net>
 <etPan.539b9dde.b03e0c6.38b@Arunkumars-MacBook-Pro.local>
 <007301cf8770$2bb569b0$83203d10$@verizon.net>
 <etPan.539bb4a4.54e49eb4.38b@Arunkumars-MacBook-Pro.local>
 <etPan.539bb693.2ca88611.38b@Arunkumars-MacBook-Pro.local>
 <etPan.539bb75d.2901d82.38b@Arunkumars-MacBook-Pro.local>
 <008301cf877c$7dc9def0$795d9cd0$@verizon.net>
 <53A074CD.3060805@mdowle.plus.com>
Message-ID: <53A0DE86.1080001@gmail.com>

Hi Matt,

There was recently another discussion on using setkey on .SD here:

  http://r.789695.n4.nabble.com/setkey-on-SD-td4690283.html

So the following code won't work any more in the current 1.9.3 dev
version. I think the idea of using setkey in a "chain" of data.tables
was nice, since it allows to set the key temporarily.

The basic idea is taken from the comment here:


http://stackoverflow.com/questions/22863414/using-roll-true-with-allow-cartesian-true#comment34980343_22866917


A <-
  data.table(
    x = c(1, 2, 3, 4, 5),
    y = letters[1:5])
B <-
  data.table(
    x = c(1, 2, 3, 1, 4),
    f = c("Alice", "Alice", "Alice", "Bob", "Bob"),
    z = 101:105)
B[, setkey(.SD, x)][
  , .SD[A, roll = TRUE, rollends = FALSE], by = f][
    , setkey(.SD, x)]


Thanks,

M


On 06/18/2014 01:03 AM, Matt Dowle wrote:
> 
> Hi Ron,
> 
> Thanks for highlighting this.  Two changes now in v1.9.3 on GitHub:
> 
>   *
> 
>     |setkey| on |.SD| is now an error, rather than warnings for each
>     group about rebuilding the key. The new error is similar to when
>     attempting to use |:=| in a |.SD| subquery: |".SD is locked. Using
>     set*() functions on .SD is reserved for possible future use; a
>     tortuously flexible way to modify the original data by
>     group."| Thanks to Ron Hylton for highlighting the issue on
>     datatable-help here
>     <http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080.html>.
> 
>   *
> 
>     Looping calls to |unique(DT)| such as
>     in |DT[,unique(.SD),by=group]| is now faster by avoiding internal
>     overhead of calling |[.data.table|. Thanks again to Ron Hylton for
>     highlighting in the same thread
>     <http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080.html>.
>     His example is reduced from 28 sec to 9 sec, with identical results.
> 
> 
> I now get the following (on my slow netbook) with no changes to your code.
> 
> print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))   #  were
> warnings,    now error
> print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))   #  was
> 28s, now 9s
> print(system.time(uf <- ddply(test, .(id), conflictsFrame)))   # 13s
> 
> This just fixes the surprises, basically.   Clearly Arun uses data.table
> in a better way which is orders of magnitude faster.
> 
> Matt
> 
> 
> On 14/06/14 03:58, Ron Hylton wrote:
>>
>> Thanks, that very helpful.
>>
>>  
>>
>> *From:*Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
>> *Sent:* Friday, June 13, 2014 10:46 PM
>> *To:* Ron Hylton; datatable-help at lists.r-forge.r-project.org
>> *Subject:* Re: [datatable-help] data.table is asking for help
>>
>>  
>>
>> Sorry. But we can simplify it even further:
>>
>> The first step is just |unique(test)|. So, we can do:
>>
>> |system.time({|
>> |ans = unique(test)|
>> |ans = ans[ans[, .I[.N > 1L], by=id]$V1]|
>> |})|
>> |#  0.016   0.000   0.016  |
>>
>> Identical?
>>
>> |setkey(ans)|
>> |setkey(ut1)|
>> |identical(ans, ut1) # [1] TRUE|
>>
>>  
>>
>> Arun
>>
>>
>> From: Arunkumar Srinivasan aragorn168b at gmail.com
>> <mailto:aragorn168b at gmail.com>
>> Reply: Arunkumar Srinivasan aragorn168b at gmail.com
>> <mailto:aragorn168b at gmail.com>
>> Date: June 14, 2014 at 4:42:31 AM
>> To: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>,
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> Subject:  Re: [datatable-help] data.table is asking for help
>>
>>
>>
>>     A slightly simpler version of the 2nd solution is:
>>
>>     |system.time({|
>>
>>     |ans = test[, .N, by=names(test)]|
>>
>>     |ans = ans[ans[, .I[.N > 1L], by=id]$V1]|
>>
>>     |})|
>>
>>     |#  0.019   0.000   0.019   |
>>
>>      
>>
>>     The answers are identical, you can check this by doing:
>>
>>     |ans[, N := NULL]|
>>
>>     |setkey(ans)|
>>
>>     |setkey(ut1)|
>>
>>     |identical(ans, ut1) # [1] TRUE|
>>
>>      
>>
>>      
>>
>>     Arun
>>
>>
>>     From: Arunkumar Srinivasan aragorn168b at gmail.com
>>     <mailto:aragorn168b at gmail.com>
>>     Reply: Arunkumar Srinivasan aragorn168b at gmail.com
>>     <mailto:aragorn168b at gmail.com>
>>     Date: June 14, 2014 at 4:34:15 AM
>>     To: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>,
>>     datatable-help at lists.r-forge.r-project.org
>>     <mailto:datatable-help at lists.r-forge.r-project.org>
>>     datatable-help at lists.r-forge.r-project.org
>>     <mailto:datatable-help at lists.r-forge.r-project.org>
>>     Subject:  Re: [datatable-help] data.table is asking for help
>>
>>
>>
>>         The j-expression is evaluated from within C for each group
>>         (unless they?re optimised with GForce - a new initiative in
>>         data.table). And |eval(.SD)| or |eval(anything(.SD))| is costly.
>>
>>         You can get around it by listing the columns by yourself and
>>         using |.I| instead, as follows:
>>
>>         |test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]|
>>
>>         |#  0.140   0.001   0.142    |
>>
>>          
>>
>>          
>>
>>         Takes about 0.14 seconds.
>>
>>         ------------------------------------------------------------------------
>>
>>         An even faster way is:
>>
>>         |system.time({|
>>
>>         |ans = test[test[, .I[.N > 1], by=id]$V1]        # (1)    |
>>
>>         |ans = ans[, .N, by=names(ans)]                  # (2)    |
>>
>>         |ans = ans[ans[, .I[.N > 1L], by=id]$V1]         # (3)|
>>
>>         |})|
>>
>>         | |
>>
>>         |#  0.026   0.000   0.027    |
>>
>>          
>>
>>          
>>
>>         The idea for the second case is:
>>
>>         (1) remove all entries where there?s just 1 row corresponding
>>         to that |id|.
>>         (2) Aggregate this result by all the columns now and get the
>>         number of rows in the column |N| (we won?t have to use this
>>         column though).
>>         (3) Now, if we aggregate by |id| and if any id has just 1 row,
>>         then it?d mean that that |id| has had more than 1 rows (step
>>         (1) filtering ensures this), but all of them are same and we
>>         don?t need them. So we just filter for those where .N > 1L.
>>
>>         HTH
>>
>>          
>>
>>         Arun
>>
>>
>>         From: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>
>>         Reply: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>
>>         Date: June 14, 2014 at 3:30:55 AM
>>         To: datatable-help at lists.r-forge.r-project.org
>>         <mailto:datatable-help at lists.r-forge.r-project.org>
>>         datatable-help at lists.r-forge.r-project.org
>>         <mailto:datatable-help at lists.r-forge.r-project.org>
>>         Subject:  Re: [datatable-help] data.table is asking for help
>>
>>
>>
>>             The performance is what puzzles me; the results are
>>             correct so the warnings don?t matter, and not all the
>>             variations I?ve tried have warnings.  On the real dataset
>>             (~800,000 rows) datatable takes about 1.5 times longer
>>             than dataframe + ddply.  I expected it to be substantially
>>             faster.
>>
>>              
>>
>>             *From:* Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
>>             *Sent:* Friday, June 13, 2014 8:57 PM
>>             *To:* Ron Hylton;
>>             datatable-help at lists.r-forge.r-project.org
>>             <mailto:datatable-help at lists.r-forge.r-project.org>
>>             *Subject:* Re: [datatable-help] data.table is asking for help
>>
>>              
>>
>>                 However there?s another aspect.  While I?m relatively
>>                 new to R my understanding is that a function argument
>>                 should be modifiable within the function body without
>>                 affecting the caller, which perhaps conflicts with the
>>                 behavior of .SD.
>>
>>             `data.table` is designed for working with *really large*
>>             data sets in mind (> 100 or 200 GB in memory even). And
>>             therefore, as a design feature, it trades in "referential
>>             transparency" for manipulating data objects *as efficient
>>             as possible* in terms of both *speed* and *memory usage*
>>             (most of the times they go hand-in-hand).
>>
>>             This is perhaps the biggest design choice one needs to be
>>             aware of when working/choosing data.tables. It is possible
>>             to modify objects by reference using data.table - All the
>>             functions that begin with "set*" modify objects by
>>             reference. The only other non "set*" function is `:=`
>>             operator.
>>
>>              
>>
>>             HTH
>>
>>             Arun
>>
>>
>>             From: Ron Hylton rhylton at verizon.net
>>             <mailto:rhylton at verizon.net>
>>             Reply: Ron Hylton rhylton at verizon.net
>>             <mailto:rhylton at verizon.net>
>>             Date: June 14, 2014 at 2:52:04 AM
>>             To: datatable-help at lists.r-forge.r-project.org
>>             <mailto:datatable-help at lists.r-forge.r-project.org>
>>             datatable-help at lists.r-forge.r-project.org
>>             <mailto:datatable-help at lists.r-forge.r-project.org>
>>             Subject:  Re: [datatable-help] data.table is asking for help
>>
>>              
>>
>>                 I suspected it was something like this.  As one
>>                 clarification, there is a setkey(test,id) before any
>>                 setkey(.SD).   If setkey(test,id) is changed to
>>                 setkey(test) so all columns are in the original
>>                 datatable key then the warning goes away.
>>
>>                  
>>
>>                 However there?s another aspect.  While I?m relatively
>>                 new to R my understanding is that a function argument
>>                 should be modifiable within the function body without
>>                 affecting the caller, which perhaps conflicts with the
>>                 behavior of .SD.
>>
>>                  
>>
>>                 *From:* Arunkumar Srinivasan
>>                 [mailto:aragorn168b at gmail.com]
>>                 *Sent:* Friday, June 13, 2014 8:23 PM
>>                 *To:* Ron Hylton;
>>                 datatable-help at lists.r-forge.r-project.org
>>                 <mailto:datatable-help at lists.r-forge.r-project.org>
>>                 *Subject:* Re: [datatable-help] data.table is asking
>>                 for help
>>
>>                  
>>
>>                 Nicely reproducible post. Reproducible in v1.9.3
>>                 (latest commit) as well.
>>
>>                 This is a tricky one. It happens because you?re
>>                 setting key on |.SD| which should normally not be
>>                 allowed. What happens is, when you set key the first
>>                 time, there?s no key set (here) and therefore key is
>>                 set on all the columns |x1|, |x2| and |x3|.
>>
>>                 Now, the next group (in the |by=.|) is passed to your
>>                 function, it?ll have the |key| already set to
>>                 |x1,x2,x3| (because |setkey| modifies the object by
>>                 reference), but |.SD| has obtained *new* data
>>                 corresponding to /this/ group. And |data.table| sorts
>>                 this data, knowing that it already has key set.. but
>>                 if the key is set then the order must be 1:n. But it
>>                 wouldn?t be, as this data isn?t sorted. |data.table|
>>                 warns in those scenarios.. and that?s why you get the
>>                 warning.
>>
>>                 To verify this, you can try:
>>
>>                 |conflictsTable1 <- function(f, address) {|
>>
>>                 |  u <- unique(setkey(f))|
>>
>>                 |  setattr(f, 'sorted', NULL)|
>>
>>                 |  if (nrow(u) == 1) return(NULL)|
>>
>>                 |  u|
>>
>>                 |}|
>>
>>                 Basically, we set the key of |f| (which is equal to
>>                 |.SD| as it?s only modified by reference) to |NULL|
>>                 everytime after.. so that |.SD| for the new group will
>>                 not have the key set.
>>
>>                 The ideal scenario here, IIUC, is that |setkey(.SD)|
>>                 or things pointing to |.SD| should not be possible
>>                 (locking binding doesn?t seem to affect things done by
>>                 reference..). |.SD| however should retain the key of
>>                 the data.table, if a key was set, wherever possible.
>>
>>                  
>>
>>                 Arun
>>
>>
>>                 From: Ron Hylton rhylton at verizon.net
>>                 <mailto:rhylton at verizon.net>
>>                 Reply: Ron Hylton rhylton at verizon.net
>>                 <mailto:rhylton at verizon.net>
>>                 Date: June 14, 2014 at 1:55:53 AM
>>                 To: datatable-help at lists.r-forge.r-project.org
>>                 <mailto:datatable-help at lists.r-forge.r-project.org>
>>                 datatable-help at lists.r-forge.r-project.org
>>                 <mailto:datatable-help at lists.r-forge.r-project.org>
>>                 Subject:  [datatable-help] data.table is asking for help
>>
>>                  
>>
>>                     The code below generates the warning:
>>
>>                      
>>
>>                     In setkeyv(x, cols, verbose = verbose) :
>>
>>                       Already keyed by this key but had invalid row
>>                     order, key rebuilt. If you didn't go under the
>>                     hood please let datatable-help know so the root
>>                     cause can be fixed.
>>
>>                      
>>
>>                     This is my first attempt at using datatable so I
>>                     probably did something dumb, but maybe that?s
>>                     useful for someone.  The first case is the one
>>                     that gives the warnings.
>>
>>                      
>>
>>                     I?m also surprised at the timings.  I wrote the
>>                     original algorithm using dataframe & ddply and I
>>                     expected datatable to be substantially faster; the
>>                     opposite is true.
>>
>>                      
>>
>>                     The algorithm does the following:  Certain columns
>>                     in the table are keys and others are values in the
>>                     sense that each row with the same set of keys
>>                     should have the same set of values.  Find all the
>>                     key sets for which this is not true and return the
>>                     keys sets + conflicting value sets.
>>
>>                      
>>
>>                     Insight into the performance would be appreciated.
>>
>>                      
>>
>>                     Regards,
>>
>>                     Ron
>>
>>                      
>>
>>                     library(data.table)
>>
>>                     library(plyr)
>>
>>                      
>>
>>                     conflictsTable1 <- function(f) {
>>
>>                       u <- unique(setkey(f))
>>
>>                       if (nrow(u) == 1) return(NULL)
>>
>>                       u
>>
>>                     }
>>
>>                      
>>
>>                     conflictsTable2 <- function(f) {
>>
>>                       u <- unique(f)
>>
>>                       if (nrow(u) == 1) return(NULL)
>>
>>                       u
>>
>>                     }
>>
>>                      
>>
>>                     conflictsFrame <- function(f) {
>>
>>                       u <- unique(f)
>>
>>                       if (nrow(u) == 1) return(NULL)
>>
>>                       u
>>
>>                     }
>>
>>                      
>>
>>                     N <- 10000
>>
>>                     test <-
>>                     data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)),
>>                     x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))
>>
>>                      
>>
>>                     setkey(test,id)
>>
>>                      
>>
>>                     print(system.time(ut1 <- test[,
>>                     conflictsTable1(.SD), by=id]))
>>
>>                      
>>
>>                     print(system.time(ut2 <- test[,
>>                     conflictsTable2(.SD), by=id]))
>>
>>                      
>>
>>                     print(system.time(uf <- ddply(test, .(id),
>>                     conflictsFrame)))
>>
>>                     _______________________________________________
>>                     datatable-help mailing list
>>                     datatable-help at lists.r-forge.r-project.org
>>                     <mailto:datatable-help at lists.r-forge.r-project.org>
>>                     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>                 _______________________________________________
>>                 datatable-help mailing list
>>                 datatable-help at lists.r-forge.r-project.org
>>                 <mailto:datatable-help at lists.r-forge.r-project.org>
>>                 https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>             _______________________________________________
>>             datatable-help mailing list
>>             datatable-help at lists.r-forge.r-project.org
>>             <mailto:datatable-help at lists.r-forge.r-project.org>
>>             https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 

From my.r.help at gmail.com  Thu Jun 19 05:51:41 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Thu, 19 Jun 2014 11:51:41 +0800
Subject: [datatable-help] Bug when Merging with nomatch=0 and roll=T?
Message-ID: <53A25E4D.8040206@gmail.com>

I got the following result on my keyed data tables `CS` and `SP`, which
seems like a bug (in 1.9.2 and 1.9.3 dev version) to me, since all
columns should have the _same_ length:

> ## Works as expected:
> all((l <- sapply(CS[SP, roll = TRUE], length)) == l[1])
[1] TRUE
> ## Works as expected:
> all((l <- sapply(CS[SP, nomatch = 0], length)) == l[1])
[1] TRUE
> ## Here's the potential _bug_, when combining both:
> all((l <- sapply(CS[SP, nomatch = 0, roll = TRUE], length)) == l[1])
[1] FALSE


Thanks,

M

From my.r.help at gmail.com  Thu Jun 19 05:59:59 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Thu, 19 Jun 2014 11:59:59 +0800
Subject: [datatable-help] Bug when Merging with nomatch=0 and roll=T?
In-Reply-To: <53A25E4D.8040206@gmail.com>
References: <53A25E4D.8040206@gmail.com>
Message-ID: <53A2603F.9030204@gmail.com>

By the way, I know it's not reproducible with the code below. Before
going into further detail, I first wanted to ask whether this looks like
a bug, or whether I've overlooked something obvious and this is expected
behavior.

Thanks,
M

On 06/19/2014 11:51 AM, Michael Smith wrote:
> I got the following result on my keyed data tables `CS` and `SP`, which
> seems like a bug (in 1.9.2 and 1.9.3 dev version) to me, since all
> columns should have the _same_ length:
> 
>> ## Works as expected:
>> all((l <- sapply(CS[SP, roll = TRUE], length)) == l[1])
> [1] TRUE
>> ## Works as expected:
>> all((l <- sapply(CS[SP, nomatch = 0], length)) == l[1])
> [1] TRUE
>> ## Here's the potential _bug_, when combining both:
>> all((l <- sapply(CS[SP, nomatch = 0, roll = TRUE], length)) == l[1])
> [1] FALSE
> 
> 
> Thanks,
> 
> M
> 

From mathematical.coffee at gmail.com  Fri Jun 20 02:44:08 2014
From: mathematical.coffee at gmail.com (mathematical.coffee)
Date: Thu, 19 Jun 2014 17:44:08 -0700 (PDT)
Subject: [datatable-help] What is going on with R 3.1 ?
In-Reply-To: <1397752015938-4689002.post@n4.nabble.com>
References: <1397752015938-4689002.post@n4.nabble.com>
Message-ID: <1403225048664-4692401.post@n4.nabble.com>

Hi all,

Sorry to resurrect an old thread, but I've been experiencing these problems
too and have come up with a reproducible example (for me anyway).

Data.table 1.9.2, R 3.1.0

I was trying to join some tables and got the usual "rerun with
allow.cartesian=TRUE" message like Michele, and then got this error:

Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed

However while I was trying to strip down my data to reproduce the error, I
now consistently get this one instead:

Error in `[.data.table`(x, y, `:=`(female, female)) : 
  object 'bysubl' not found


rather than the TRUE/FALSE one. But they seem to be related.

* x has a column of subjects, some duplicated
* y has a column of subjects, none duplicated, and some not present in x
(all subjects of x are in y though).
* y additionally has a binary column `female` that I wish to join into x

(I know there are other ways to do this, but this is a stripped down example
and seems to point out something going wrong in data.table so it is just an
illustrative example):

```
library(data.table)
x=fread('x.csv')
y=fread('y.csv')
setkey(x, subject)
setkey(y, subject)

x[y]
# Error in vecseq(f__, len__, if (allow.cartesian) NULL else
as.integer(max(nrow(x),  : 
#   Join results in 33 rows; more than 28 = max(nrow(x),nrow(i)). Check for
duplicate key values in i, each of which join to the same group in x over
and over again. If that's ok, try including `j` and dropping `by`
(by-without-by) so that j runs for each group to avoid the large allocation.
If you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
Otherwise, please search for this error message in the FAQ, Wiki, Stack
Overflow and datatable-help for advice.

x[y, female:=female]
Error in `[.data.table`(x, y, `:=`(female, female)) : 
  object 'bysubl' not found
```

I get the above reproducibly with this dataset.

>From now onwards, if I type in 'x' or 'y' into the prompt I get nothing
printed at all. Additionally:

```
tables()
# Error in gettext(domain, unlist(args)) : invalid 'string' value
# Error: argument "finally" is missing, with no default
```

The only solution is to restart the R session.

Note: this *doesn't* occur if the column I try to merge (`female` in this
case) is continuous, for example. I can only get it if it's logical.

I've attached x.csv and y.csv to this email for you to play with.

I think it might be possible to strip down the tables to less rows (x has
28, y has 26) but in my (not exhaustive) attempts to do so, I didn't get
this particular error.

x.csv <http://r.789695.n4.nabble.com/file/n4692401/x.csv>  
y.csv <http://r.789695.n4.nabble.com/file/n4692401/y.csv>  


--
View this message in context: http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-tp4689002p4692401.html
Sent from the datatable-help mailing list archive at Nabble.com.

From aragorn168b at gmail.com  Fri Jun 20 02:51:05 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 20 Jun 2014 02:51:05 +0200
Subject: [datatable-help] What is going on with R 3.1 ?
In-Reply-To: <1403225048664-4692401.post@n4.nabble.com>
References: <1397752015938-4689002.post@n4.nabble.com>
 <1403225048664-4692401.post@n4.nabble.com>
Message-ID: <etPan.53a38579.23d86aac.38b@Arunkumars-MacBook-Pro.local>

Hi,

Could you let us know if you?re able to reproduce it in the devel version 1.9.3 as well?


Arun

From:?mathematical.coffee mathematical.coffee at gmail.com
Reply:?mathematical.coffee mathematical.coffee at gmail.com
Date:?June 20, 2014 at 2:44:50 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] What is going on with R 3.1 ?  

Hi all,  

Sorry to resurrect an old thread, but I've been experiencing these problems  
too and have come up with a reproducible example (for me anyway).  

Data.table 1.9.2, R 3.1.0  

I was trying to join some tables and got the usual "rerun with  
allow.cartesian=TRUE" message like Michele, and then got this error:  

Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed  

However while I was trying to strip down my data to reproduce the error, I  
now consistently get this one instead:  

Error in `[.data.table`(x, y, `:=`(female, female)) :  
object 'bysubl' not found  


rather than the TRUE/FALSE one. But they seem to be related.  

* x has a column of subjects, some duplicated  
* y has a column of subjects, none duplicated, and some not present in x  
(all subjects of x are in y though).  
* y additionally has a binary column `female` that I wish to join into x  

(I know there are other ways to do this, but this is a stripped down example  
and seems to point out something going wrong in data.table so it is just an  
illustrative example):  

```  
library(data.table)  
x=fread('x.csv')  
y=fread('y.csv')  
setkey(x, subject)  
setkey(y, subject)  

x[y]  
# Error in vecseq(f__, len__, if (allow.cartesian) NULL else  
as.integer(max(nrow(x), :  
# Join results in 33 rows; more than 28 = max(nrow(x),nrow(i)). Check for  
duplicate key values in i, each of which join to the same group in x over  
and over again. If that's ok, try including `j` and dropping `by`  
(by-without-by) so that j runs for each group to avoid the large allocation.  
If you are sure you wish to proceed, rerun with allow.cartesian=TRUE.  
Otherwise, please search for this error message in the FAQ, Wiki, Stack  
Overflow and datatable-help for advice.  

x[y, female:=female]  
Error in `[.data.table`(x, y, `:=`(female, female)) :  
object 'bysubl' not found  
```  

I get the above reproducibly with this dataset.  

From now onwards, if I type in 'x' or 'y' into the prompt I get nothing  
printed at all. Additionally:  

```  
tables()  
# Error in gettext(domain, unlist(args)) : invalid 'string' value  
# Error: argument "finally" is missing, with no default  
```  

The only solution is to restart the R session.  

Note: this *doesn't* occur if the column I try to merge (`female` in this  
case) is continuous, for example. I can only get it if it's logical.  

I've attached x.csv and y.csv to this email for you to play with.  

I think it might be possible to strip down the tables to less rows (x has  
28, y has 26) but in my (not exhaustive) attempts to do so, I didn't get  
this particular error.  

x.csv <http://r.789695.n4.nabble.com/file/n4692401/x.csv>  
y.csv <http://r.789695.n4.nabble.com/file/n4692401/y.csv>  


--  
View this message in context: http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-tp4689002p4692401.html  
Sent from the datatable-help mailing list archive at Nabble.com.  
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140620/613e298d/attachment.html>

From mathematical.coffee at gmail.com  Fri Jun 20 03:01:50 2014
From: mathematical.coffee at gmail.com (Amy)
Date: Fri, 20 Jun 2014 11:01:50 +1000
Subject: [datatable-help] What is going on with R 3.1 ?
In-Reply-To: <etPan.53a38579.23d86aac.38b@Arunkumars-MacBook-Pro.local>
References: <1397752015938-4689002.post@n4.nabble.com>
 <1403225048664-4692401.post@n4.nabble.com>
 <etPan.53a38579.23d86aac.38b@Arunkumars-MacBook-Pro.local>
Message-ID: <CAApqYgCk6BBp3Ud026QbSrAY4Nidsg=Ti+YEGEkxxYU0Ux5BYw@mail.gmail.com>

Hi Arun,

In 1.9.3 I get the "Error in vecseq(f__, len__, if (allow.cartesian) NULL
else as.integer(max(nrow(x), : Join results in 33 rows; more than 28 =
max(nrow(x),nrow(i))...." message and it doesn't assign the column (upon
`x[y, female:=female]`, so no, the error doesn't occur.

But as an aside, shouldn't it this command work?
If I have x with subjects a, a, b, c, d; y with genders for subjects a--f,
shouldn't x[y, female:=female] copy the female column from y to x,
duplicating as necessary?
Of course y[x] produces the table I'm after, but in the case that y has
extra columns I /don't/ want in the output and x has extra columns I /do/,
`y[x]` is then not the table I'm after. (But now we are straying into a
different question, my limited understanding of how to use data.table, as
opposed to the bug this thread is about).

PS - typo on the data.table Readmein the "if you get latex errors during
installation" bit:

devtools:::install_github("datat.able", ...)

"datat.able" --> "data.table".

cheers
Amy


On 20 June 2014 10:51, Arunkumar Srinivasan <aragorn168b at gmail.com> wrote:

> Hi,
>
> Could you let us know if you?re able to reproduce it in the devel version
> 1.9.3 <https://github.com/Rdatatable/data.table> as well?
>
>
>
> Arun
>
> From: mathematical.coffee mathematical.coffee at gmail.com
> Reply: mathematical.coffee mathematical.coffee at gmail.com
> Date: June 20, 2014 at 2:44:50 AM
> To: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> Subject:  Re: [datatable-help] What is going on with R 3.1 ?
>
> Hi all,
>
> Sorry to resurrect an old thread, but I've been experiencing these
> problems
> too and have come up with a reproducible example (for me anyway).
>
> Data.table 1.9.2, R 3.1.0
>
> I was trying to join some tables and got the usual "rerun with
> allow.cartesian=TRUE" message like Michele, and then got this error:
>
> Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed
>
> However while I was trying to strip down my data to reproduce the error, I
> now consistently get this one instead:
>
> Error in `[.data.table`(x, y, `:=`(female, female)) :
> object 'bysubl' not found
>
>
> rather than the TRUE/FALSE one. But they seem to be related.
>
> * x has a column of subjects, some duplicated
> * y has a column of subjects, none duplicated, and some not present in x
> (all subjects of x are in y though).
> * y additionally has a binary column `female` that I wish to join into x
>
> (I know there are other ways to do this, but this is a stripped down
> example
> and seems to point out something going wrong in data.table so it is just
> an
> illustrative example):
>
> ```
> library(data.table)
> x=fread('x.csv')
> y=fread('y.csv')
> setkey(x, subject)
> setkey(y, subject)
>
> x[y]
> # Error in vecseq(f__, len__, if (allow.cartesian) NULL else
> as.integer(max(nrow(x), :
> # Join results in 33 rows; more than 28 = max(nrow(x),nrow(i)). Check for
> duplicate key values in i, each of which join to the same group in x over
> and over again. If that's ok, try including `j` and dropping `by`
> (by-without-by) so that j runs for each group to avoid the large
> allocation.
> If you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
> Otherwise, please search for this error message in the FAQ, Wiki, Stack
> Overflow and datatable-help for advice.
>
> x[y, female:=female]
> Error in `[.data.table`(x, y, `:=`(female, female)) :
> object 'bysubl' not found
> ```
>
> I get the above reproducibly with this dataset.
>
> From now onwards, if I type in 'x' or 'y' into the prompt I get nothing
> printed at all. Additionally:
>
> ```
> tables()
> # Error in gettext(domain, unlist(args)) : invalid 'string' value
> # Error: argument "finally" is missing, with no default
> ```
>
> The only solution is to restart the R session.
>
> Note: this *doesn't* occur if the column I try to merge (`female` in this
> case) is continuous, for example. I can only get it if it's logical.
>
> I've attached x.csv and y.csv to this email for you to play with.
>
> I think it might be possible to strip down the tables to less rows (x has
> 28, y has 26) but in my (not exhaustive) attempts to do so, I didn't get
> this particular error.
>
> x.csv <http://r.789695.n4.nabble.com/file/n4692401/x.csv>
> y.csv <http://r.789695.n4.nabble.com/file/n4692401/y.csv>
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-tp4689002p4692401.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140620/c92f540d/attachment-0001.html>

From aragorn168b at gmail.com  Fri Jun 20 03:18:12 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 20 Jun 2014 03:18:12 +0200
Subject: [datatable-help] What is going on with R 3.1 ?
In-Reply-To: <CAApqYgCk6BBp3Ud026QbSrAY4Nidsg=Ti+YEGEkxxYU0Ux5BYw@mail.gmail.com>
References: <1397752015938-4689002.post@n4.nabble.com>
 <1403225048664-4692401.post@n4.nabble.com>
 <etPan.53a38579.23d86aac.38b@Arunkumars-MacBook-Pro.local>
 <CAApqYgCk6BBp3Ud026QbSrAY4Nidsg=Ti+YEGEkxxYU0Ux5BYw@mail.gmail.com>
Message-ID: <etPan.53a38bd4.3c5991aa.38b@Arunkumars-MacBook-Pro.local>

Hi Amy,

Good to know that it?s not reproducible in 1.9.3. Matt already fixed it.

X[Y, LHS := RHS] can not exceed nrow(X) because this assignment is made by reference. If the join from X[Y] results in more than nrow(X), then X will be to be re-allocated entirely.

If you only want those that match with X, then you should do: X[Y, female := i.female, nomatch=0L].

If instead you want all the rows from y, then you could do: x[y, allow.cartesian=TRUE].


Arun

From:?Amy mathematical.coffee at gmail.com
Reply:?Amy mathematical.coffee at gmail.com
Date:?June 20, 2014 at 3:01:50 AM
To:?Arunkumar Srinivasan aragorn168b at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] What is going on with R 3.1 ?  

Hi Arun,

In 1.9.3 I get the "Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in 33 rows; more than 28 = max(nrow(x),nrow(i))...." message and it doesn't assign the column (upon `x[y, female:=female]`, so no, the error doesn't occur.

But as an aside, shouldn't it this command work?
If I have x with subjects a, a, b, c, d; y with genders for subjects a--f, shouldn't x[y, female:=female] copy the female column from y to x, duplicating as necessary?
Of course y[x] produces the table I'm after, but in the case that y has extra columns I /don't/ want in the output and x has extra columns I /do/, `y[x]` is then not the table I'm after. (But now we are straying into a different question, my limited understanding of how to use data.table, as opposed to the bug this thread is about).

PS - typo on the data.table Readmein the "if you get latex errors during installation" bit:

devtools:::install_github("datat.able", ...)

"datat.able" --> "data.table".

cheers
Amy


On 20 June 2014 10:51, Arunkumar Srinivasan <aragorn168b at gmail.com> wrote:
Hi,

Could you let us know if you?re able to reproduce it in the devel version 1.9.3 as well?


Arun

From:?mathematical.coffee mathematical.coffee at gmail.com
Reply:?mathematical.coffee mathematical.coffee at gmail.com
Date:?June 20, 2014 at 2:44:50 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] What is going on with R 3.1 ?

Hi all,

Sorry to resurrect an old thread, but I've been experiencing these problems
too and have come up with a reproducible example (for me anyway).

Data.table 1.9.2, R 3.1.0

I was trying to join some tables and got the usual "rerun with
allow.cartesian=TRUE" message like Michele, and then got this error:

Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed

However while I was trying to strip down my data to reproduce the error, I
now consistently get this one instead:

Error in `[.data.table`(x, y, `:=`(female, female)) :
object 'bysubl' not found


rather than the TRUE/FALSE one. But they seem to be related.

* x has a column of subjects, some duplicated
* y has a column of subjects, none duplicated, and some not present in x
(all subjects of x are in y though).
* y additionally has a binary column `female` that I wish to join into x

(I know there are other ways to do this, but this is a stripped down example
and seems to point out something going wrong in data.table so it is just an
illustrative example):

```
library(data.table)
x=fread('x.csv')
y=fread('y.csv')
setkey(x, subject)
setkey(y, subject)

x[y]
# Error in vecseq(f__, len__, if (allow.cartesian) NULL else
as.integer(max(nrow(x), :
# Join results in 33 rows; more than 28 = max(nrow(x),nrow(i)). Check for
duplicate key values in i, each of which join to the same group in x over
and over again. If that's ok, try including `j` and dropping `by`
(by-without-by) so that j runs for each group to avoid the large allocation.
If you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
Otherwise, please search for this error message in the FAQ, Wiki, Stack
Overflow and datatable-help for advice.

x[y, female:=female]
Error in `[.data.table`(x, y, `:=`(female, female)) :
object 'bysubl' not found
```

I get the above reproducibly with this dataset.

From now onwards, if I type in 'x' or 'y' into the prompt I get nothing
printed at all. Additionally:

```
tables()
# Error in gettext(domain, unlist(args)) : invalid 'string' value
# Error: argument "finally" is missing, with no default
```

The only solution is to restart the R session.

Note: this *doesn't* occur if the column I try to merge (`female` in this
case) is continuous, for example. I can only get it if it's logical.

I've attached x.csv and y.csv to this email for you to play with.

I think it might be possible to strip down the tables to less rows (x has
28, y has 26) but in my (not exhaustive) attempts to do so, I didn't get
this particular error.

x.csv <http://r.789695.n4.nabble.com/file/n4692401/x.csv>
y.csv <http://r.789695.n4.nabble.com/file/n4692401/y.csv>


--
View this message in context: http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-tp4689002p4692401.html
Sent from the datatable-help mailing list archive at Nabble.com.
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140620/91aec5c6/attachment.html>

From mathematical.coffee at gmail.com  Fri Jun 20 03:44:26 2014
From: mathematical.coffee at gmail.com (Amy)
Date: Fri, 20 Jun 2014 11:44:26 +1000
Subject: [datatable-help] What is going on with R 3.1 ?
In-Reply-To: <etPan.53a38bd4.3c5991aa.38b@Arunkumars-MacBook-Pro.local>
References: <1397752015938-4689002.post@n4.nabble.com>
 <1403225048664-4692401.post@n4.nabble.com>
 <etPan.53a38579.23d86aac.38b@Arunkumars-MacBook-Pro.local>
 <CAApqYgCk6BBp3Ud026QbSrAY4Nidsg=Ti+YEGEkxxYU0Ux5BYw@mail.gmail.com>
 <etPan.53a38bd4.3c5991aa.38b@Arunkumars-MacBook-Pro.local>
Message-ID: <CAApqYgABXAcLB06tse45HhGwS45pYASTg4Fw4F4S9tJGZHYJ2A@mail.gmail.com>

Thanks for this, I knew not knowing how to do that join was a problem with
me not understanding data.table, not a problem with data.table.

Very good to know the 'bysubl' "error" is fixed in 1.9.3 (even if it is
brought about by users like me trying to do our joins wrongly :))

thanks,
Amy


On 20 June 2014 11:18, Arunkumar Srinivasan <aragorn168b at gmail.com> wrote:

> Hi Amy,
>
> Good to know that it?s not reproducible in 1.9.3. Matt already fixed it.
>
> X[Y, LHS := RHS] can not exceed nrow(X) because this assignment is made *by
> reference*. If the join from X[Y] results in more than nrow(X), then X
> will be to be re-allocated entirely.
>
> If you only want those that match with X, then you should do: X[Y, female
> := i.female, nomatch=0L].
>
> If instead you want all the rows from y, then you could do: x[y,
> allow.cartesian=TRUE].
>
>
> Arun
>
> From: Amy mathematical.coffee at gmail.com
> Reply: Amy mathematical.coffee at gmail.com
> Date: June 20, 2014 at 3:01:50 AM
> To: Arunkumar Srinivasan aragorn168b at gmail.com
> Cc: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
>
> Subject:  Re: [datatable-help] What is going on with R 3.1 ?
>
>  Hi Arun,
>
> In 1.9.3 I get the "Error in vecseq(f__, len__, if (allow.cartesian) NULL
> else as.integer(max(nrow(x), : Join results in 33 rows; more than 28 =
> max(nrow(x),nrow(i))...." message and it doesn't assign the column (upon
> `x[y, female:=female]`, so no, the error doesn't occur.
>
> But as an aside, shouldn't it this command work?
> If I have x with subjects a, a, b, c, d; y with genders for subjects a--f,
> shouldn't x[y, female:=female] copy the female column from y to x,
> duplicating as necessary?
> Of course y[x] produces the table I'm after, but in the case that y has
> extra columns I /don't/ want in the output and x has extra columns I /do/,
> `y[x]` is then not the table I'm after. (But now we are straying into a
> different question, my limited understanding of how to use data.table, as
> opposed to the bug this thread is about).
>
> PS - typo on the data.table Readmein the "if you get latex errors during
> installation" bit:
>
> devtools:::install_github("datat.able", ...)
>
> "datat.able" --> "data.table".
>
> cheers
> Amy
>
>
> On 20 June 2014 10:51, Arunkumar Srinivasan <aragorn168b at gmail.com> wrote:
>
>>  Hi,
>>
>> Could you let us know if you?re able to reproduce it in the devel
>> version 1.9.3 <https://github.com/Rdatatable/data.table> as well?
>>
>>
>>  Arun
>>
>> From: mathematical.coffee mathematical.coffee at gmail.com
>> Reply: mathematical.coffee mathematical.coffee at gmail.com
>> Date: June 20, 2014 at 2:44:50 AM
>> To: datatable-help at lists.r-forge.r-project.org
>> datatable-help at lists.r-forge.r-project.org
>> Subject:  Re: [datatable-help] What is going on with R 3.1 ?
>>
>>  Hi all,
>>
>> Sorry to resurrect an old thread, but I've been experiencing these
>> problems
>> too and have come up with a reproducible example (for me anyway).
>>
>> Data.table 1.9.2, R 3.1.0
>>
>> I was trying to join some tables and got the usual "rerun with
>> allow.cartesian=TRUE" message like Michele, and then got this error:
>>
>> Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed
>>
>> However while I was trying to strip down my data to reproduce the error, I
>> now consistently get this one instead:
>>
>> Error in `[.data.table`(x, y, `:=`(female, female)) :
>> object 'bysubl' not found
>>
>>
>> rather than the TRUE/FALSE one. But they seem to be related.
>>
>> * x has a column of subjects, some duplicated
>> * y has a column of subjects, none duplicated, and some not present in x
>> (all subjects of x are in y though).
>> * y additionally has a binary column `female` that I wish to join into x
>>
>> (I know there are other ways to do this, but this is a stripped down
>> example
>> and seems to point out something going wrong in data.table so it is just
>> an
>> illustrative example):
>>
>> ```
>> library(data.table)
>> x=fread('x.csv')
>> y=fread('y.csv')
>> setkey(x, subject)
>> setkey(y, subject)
>>
>> x[y]
>> # Error in vecseq(f__, len__, if (allow.cartesian) NULL else
>> as.integer(max(nrow(x), :
>> # Join results in 33 rows; more than 28 = max(nrow(x),nrow(i)). Check for
>> duplicate key values in i, each of which join to the same group in x over
>> and over again. If that's ok, try including `j` and dropping `by`
>> (by-without-by) so that j runs for each group to avoid the large
>> allocation.
>> If you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
>> Otherwise, please search for this error message in the FAQ, Wiki, Stack
>> Overflow and datatable-help for advice.
>>
>> x[y, female:=female]
>> Error in `[.data.table`(x, y, `:=`(female, female)) :
>> object 'bysubl' not found
>> ```
>>
>> I get the above reproducibly with this dataset.
>>
>> From now onwards, if I type in 'x' or 'y' into the prompt I get nothing
>> printed at all. Additionally:
>>
>> ```
>> tables()
>> # Error in gettext(domain, unlist(args)) : invalid 'string' value
>> # Error: argument "finally" is missing, with no default
>> ```
>>
>> The only solution is to restart the R session.
>>
>> Note: this *doesn't* occur if the column I try to merge (`female` in this
>> case) is continuous, for example. I can only get it if it's logical.
>>
>> I've attached x.csv and y.csv to this email for you to play with.
>>
>> I think it might be possible to strip down the tables to less rows (x has
>> 28, y has 26) but in my (not exhaustive) attempts to do so, I didn't get
>> this particular error.
>>
>> x.csv <http://r.789695.n4.nabble.com/file/n4692401/x.csv>
>> y.csv <http://r.789695.n4.nabble.com/file/n4692401/y.csv>
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-tp4689002p4692401.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140620/4c690032/attachment-0001.html>

From my.r.help at gmail.com  Fri Jun 20 05:37:07 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Fri, 20 Jun 2014 11:37:07 +0800
Subject: [datatable-help] Bug when Merging with nomatch=0 and roll=T?
In-Reply-To: <53A2603F.9030204@gmail.com>
References: <53A25E4D.8040206@gmail.com> <53A2603F.9030204@gmail.com>
Message-ID: <53A3AC63.6020301@gmail.com>

So let me rephrase my question (haven't received an answer so far):

For a given data.table, is there any condition under which the lengths
of the vectors in each column may differ? Based on my understanding,
each data.table is also a data.frame, and with a data frame this should
not be possible. For example, it's not possible to have a data.frame
where the first column is a vector of length eight, and the second
column is a vector of length nine. Ergo, it's a bug, right?

If my understanding is correct, please do let me know and I'll be glad
to try to boil this down to something that's reproducible.

Thanks,
M

On 06/19/2014 11:59 AM, Michael Smith wrote:
> By the way, I know it's not reproducible with the code below. Before
> going into further detail, I first wanted to ask whether this looks like
> a bug, or whether I've overlooked something obvious and this is expected
> behavior.
> 
> Thanks,
> M
> 
> On 06/19/2014 11:51 AM, Michael Smith wrote:
>> I got the following result on my keyed data tables `CS` and `SP`, which
>> seems like a bug (in 1.9.2 and 1.9.3 dev version) to me, since all
>> columns should have the _same_ length:
>>
>>> ## Works as expected:
>>> all((l <- sapply(CS[SP, roll = TRUE], length)) == l[1])
>> [1] TRUE
>>> ## Works as expected:
>>> all((l <- sapply(CS[SP, nomatch = 0], length)) == l[1])
>> [1] TRUE
>>> ## Here's the potential _bug_, when combining both:
>>> all((l <- sapply(CS[SP, nomatch = 0, roll = TRUE], length)) == l[1])
>> [1] FALSE
>>
>>
>> Thanks,
>>
>> M
>>

From aragorn168b at gmail.com  Fri Jun 20 11:17:13 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 20 Jun 2014 11:17:13 +0200
Subject: [datatable-help]
 =?utf-8?q?Bug_when_Merging_with_nomatch=3D0_=3F?=
 =?utf-8?b?PWFuZCA9P3V0Zi04P1E/cm9sbD1UPw==?=
In-Reply-To: <53A3AC63.6020301@gmail.com>
References: <53A25E4D.8040206@gmail.com> <53A2603F.9030204@gmail.com>
 <53A3AC63.6020301@gmail.com>
Message-ID: <etPan.53a3fc19.78df6a55.38b@Arunkumars-MacBook-Pro.local>

For a given data.table, is there any condition????Ergo, it's a bug, right??
Yes.

I'll be glad?
to try to boil this down to something that's reproducible.?
That'd be great.


Arun

From:?Michael Smith my.r.help at gmail.com
Reply:?Michael Smith my.r.help at gmail.com
Date:?June 20, 2014 at 5:37:24 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T?  

So let me rephrase my question (haven't received an answer so far):  

For a given data.table, is there any condition under which the lengths  
of the vectors in each column may differ? Based on my understanding,  
each data.table is also a data.frame, and with a data frame this should  
not be possible. For example, it's not possible to have a data.frame  
where the first column is a vector of length eight, and the second  
column is a vector of length nine. Ergo, it's a bug, right?  

If my understanding is correct, please do let me know and I'll be glad  
to try to boil this down to something that's reproducible.  

Thanks,  
M  

On 06/19/2014 11:59 AM, Michael Smith wrote:  
> By the way, I know it's not reproducible with the code below. Before  
> going into further detail, I first wanted to ask whether this looks like  
> a bug, or whether I've overlooked something obvious and this is expected  
> behavior.  
>  
> Thanks,  
> M  
>  
> On 06/19/2014 11:51 AM, Michael Smith wrote:  
>> I got the following result on my keyed data tables `CS` and `SP`, which  
>> seems like a bug (in 1.9.2 and 1.9.3 dev version) to me, since all  
>> columns should have the _same_ length:  
>>  
>>> ## Works as expected:  
>>> all((l <- sapply(CS[SP, roll = TRUE], length)) == l[1])  
>> [1] TRUE  
>>> ## Works as expected:  
>>> all((l <- sapply(CS[SP, nomatch = 0], length)) == l[1])  
>> [1] TRUE  
>>> ## Here's the potential _bug_, when combining both:  
>>> all((l <- sapply(CS[SP, nomatch = 0, roll = TRUE], length)) == l[1])  
>> [1] FALSE  
>>  
>>  
>> Thanks,  
>>  
>> M  
>>  
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140620/59736bb8/attachment.html>

From my.r.help at gmail.com  Fri Jun 20 13:30:05 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Fri, 20 Jun 2014 19:30:05 +0800
Subject: [datatable-help] Bug when Merging with nomatch=0 and roll=T?
In-Reply-To: <etPan.53a3fc19.78df6a55.38b@Arunkumars-MacBook-Pro.local>
References: <53A25E4D.8040206@gmail.com> <53A2603F.9030204@gmail.com>
 <53A3AC63.6020301@gmail.com>
 <etPan.53a3fc19.78df6a55.38b@Arunkumars-MacBook-Pro.local>
Message-ID: <53A41B3D.3000003@gmail.com>

OK, no problem, here's the code. If there are any problems pasting it
into R let me know (I used parts of dput, so maybe the email line
endings are messed up). If you want I can also file a bug report on
github, just let me know.

CS <-
  data.table(
    structure(list(LPERMCO = c(7L, 33L), datadate = structure(c(15912,
15912), class = "Date"), me = c(626550.35284, 7766.385)), .Names =
c("LPERMCO",
"datadate", "me"), class = "data.frame", row.names = c(NA, -2L
)),
    key = "LPERMCO,datadate")
SP <-
  data.table(
    structure(list(PERMCO = c(7L, 7L, 33L, 33L, 33L, 33L), date =
structure(c(15884,
15917, 15884, 15884, 15917, 15917), class = "Date"), RET = c(-0.118303,
0.141225, -0.03137, -0.02533, 0.045967, 0.043694)), .Names = c("PERMCO",
"date", "RET"), class = "data.frame", row.names = c(NA, -6L)),
    key = "PERMCO,date")
sapply(CS[SP, nomatch = 0, roll = T], length)


The relevant output looks like this, both in 1.9.2 and in dev-1.9.3, and
for sapply, the "me" column should be 5 but it's 3:

> CS
   LPERMCO   datadate         me
1:       7 2013-07-26 626550.353
2:      33 2013-07-26   7766.385
> SP
   PERMCO       date       RET
1:      7 2013-06-28 -0.118303
2:      7 2013-07-31  0.141225
3:     33 2013-06-28 -0.031370
4:     33 2013-06-28 -0.025330
5:     33 2013-07-31  0.045967
6:     33 2013-07-31  0.043694
> CS[SP, nomatch = 0, roll = T]
   LPERMCO   datadate         me       RET
1:       7 2013-07-31 626550.353  0.141225
2:      33 2013-06-28   7766.385 -0.031370
3:      33 2013-06-28   7766.385 -0.025330
4:      33 2013-07-31 626550.353  0.045967
5:      33 2013-07-31   7766.385  0.043694
Warning message:
In cbind(LPERMCO = c(" 7", "33", "33", "33", "33"), datadate =
c("2013-07-31",  :
  number of rows of result is not a multiple of vector length (arg 3)
> sapply(CS[SP, nomatch = 0, roll = T], length)
 LPERMCO datadate       me      RET
       5        5        3        5


Thanks,
M


On 06/20/2014 05:17 PM, Arunkumar Srinivasan wrote:
>> For a given data.table, is there any condition ?  Ergo, it's a bug,
>> right? 
> 
> Yes.
> 
>> I'll be glad 
>> to try to boil this down to something that's reproducible. 
> 
> That'd be great.
> 
> 
> Arun
> 
> From: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>
> Reply: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>
> Date: June 20, 2014 at 5:37:24 AM
> To: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> Subject: Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T?
> 
>> So let me rephrase my question (haven't received an answer so far):
>>
>> For a given data.table, is there any condition under which the lengths
>> of the vectors in each column may differ? Based on my understanding,
>> each data.table is also a data.frame, and with a data frame this should
>> not be possible. For example, it's not possible to have a data.frame
>> where the first column is a vector of length eight, and the second
>> column is a vector of length nine. Ergo, it's a bug, right?
>>
>> If my understanding is correct, please do let me know and I'll be glad
>> to try to boil this down to something that's reproducible.
>>
>> Thanks,
>> M
>>
>> On 06/19/2014 11:59 AM, Michael Smith wrote:
>> > By the way, I know it's not reproducible with the code below. Before
>> > going into further detail, I first wanted to ask whether this looks like
>> > a bug, or whether I've overlooked something obvious and this is expected
>> > behavior.
>> >  
>> > Thanks,
>> > M
>> >  
>> > On 06/19/2014 11:51 AM, Michael Smith wrote:
>> >> I got the following result on my keyed data tables `CS` and `SP`, which
>> >> seems like a bug (in 1.9.2 and 1.9.3 dev version) to me, since all
>> >> columns should have the _same_ length:
>> >>
>> >>> ## Works as expected:
>> >>> all((l <- sapply(CS[SP, roll = TRUE], length)) == l[1])
>> >> [1] TRUE
>> >>> ## Works as expected:
>> >>> all((l <- sapply(CS[SP, nomatch = 0], length)) == l[1])
>> >> [1] TRUE
>> >>> ## Here's the potential _bug_, when combining both:
>> >>> all((l <- sapply(CS[SP, nomatch = 0, roll = TRUE], length)) == l[1])
>> >> [1] FALSE
>> >>
>> >>
>> >> Thanks,
>> >>
>> >> M
>> >>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>

From aragorn168b at gmail.com  Fri Jun 20 13:41:59 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 20 Jun 2014 13:41:59 +0200
Subject: [datatable-help]
 =?utf-8?q?Bug_when_Merging_with_nomatch=3D0_=3F?=
 =?utf-8?b?PWFuZCA9P3V0Zi04P1E/cm9sbD1UPw==?=
In-Reply-To: <53A41B3D.3000003@gmail.com>
References: <53A25E4D.8040206@gmail.com> <53A2603F.9030204@gmail.com>
 <53A3AC63.6020301@gmail.com>
 <etPan.53a3fc19.78df6a55.38b@Arunkumars-MacBook-Pro.local>
 <53A41B3D.3000003@gmail.com>
Message-ID: <etPan.53a41e07.2b0d8dbe.38b@Arunkumars-MacBook-Pro.local>

Michael,

Excellent example. Perfectly reproducible on 1.9.2 and 1.9.3. And it works fine on 1.8.10. The answer should've only 3 rows.?
It'd be even more nice of you if you could file it as a bug report.

PS: On another note.. you maybe also interested in `CS[SP, roll=TRUE, rollends=TRUE]`
Arun

From:?Michael Smith my.r.help at gmail.com
Reply:?Michael Smith my.r.help at gmail.com
Date:?June 20, 2014 at 1:30:09 PM
To:?Arunkumar Srinivasan aragorn168b at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T?  

OK, no problem, here's the code. If there are any problems pasting it  
into R let me know (I used parts of dput, so maybe the email line  
endings are messed up). If you want I can also file a bug report on  
github, just let me know.  

CS <-  
data.table(  
structure(list(LPERMCO = c(7L, 33L), datadate = structure(c(15912,  
15912), class = "Date"), me = c(626550.35284, 7766.385)), .Names =  
c("LPERMCO",  
"datadate", "me"), class = "data.frame", row.names = c(NA, -2L  
)),  
key = "LPERMCO,datadate")  
SP <-  
data.table(  
structure(list(PERMCO = c(7L, 7L, 33L, 33L, 33L, 33L), date =  
structure(c(15884,  
15917, 15884, 15884, 15917, 15917), class = "Date"), RET = c(-0.118303,  
0.141225, -0.03137, -0.02533, 0.045967, 0.043694)), .Names = c("PERMCO",  
"date", "RET"), class = "data.frame", row.names = c(NA, -6L)),  
key = "PERMCO,date")  
sapply(CS[SP, nomatch = 0, roll = T], length)  


The relevant output looks like this, both in 1.9.2 and in dev-1.9.3, and  
for sapply, the "me" column should be 5 but it's 3:  

> CS  
LPERMCO datadate me  
1: 7 2013-07-26 626550.353  
2: 33 2013-07-26 7766.385  
> SP  
PERMCO date RET  
1: 7 2013-06-28 -0.118303  
2: 7 2013-07-31 0.141225  
3: 33 2013-06-28 -0.031370  
4: 33 2013-06-28 -0.025330  
5: 33 2013-07-31 0.045967  
6: 33 2013-07-31 0.043694  
> CS[SP, nomatch = 0, roll = T]  
LPERMCO datadate me RET  
1: 7 2013-07-31 626550.353 0.141225  
2: 33 2013-06-28 7766.385 -0.031370  
3: 33 2013-06-28 7766.385 -0.025330  
4: 33 2013-07-31 626550.353 0.045967  
5: 33 2013-07-31 7766.385 0.043694  
Warning message:  
In cbind(LPERMCO = c(" 7", "33", "33", "33", "33"), datadate =  
c("2013-07-31", :  
number of rows of result is not a multiple of vector length (arg 3)  
> sapply(CS[SP, nomatch = 0, roll = T], length)  
LPERMCO datadate me RET  
5 5 3 5  


Thanks,  
M  


On 06/20/2014 05:17 PM, Arunkumar Srinivasan wrote:  
>> For a given data.table, is there any condition ? Ergo, it's a bug,  
>> right?  
>  
> Yes.  
>  
>> I'll be glad  
>> to try to boil this down to something that's reproducible.  
>  
> That'd be great.  
>  
>  
> Arun  
>  
> From: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>  
> Reply: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>  
> Date: June 20, 2014 at 5:37:24 AM  
> To: datatable-help at lists.r-forge.r-project.org  
> datatable-help at lists.r-forge.r-project.org  
> <mailto:datatable-help at lists.r-forge.r-project.org>  
> Subject: Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T?  
>  
>> So let me rephrase my question (haven't received an answer so far):  
>>  
>> For a given data.table, is there any condition under which the lengths  
>> of the vectors in each column may differ? Based on my understanding,  
>> each data.table is also a data.frame, and with a data frame this should  
>> not be possible. For example, it's not possible to have a data.frame  
>> where the first column is a vector of length eight, and the second  
>> column is a vector of length nine. Ergo, it's a bug, right?  
>>  
>> If my understanding is correct, please do let me know and I'll be glad  
>> to try to boil this down to something that's reproducible.  
>>  
>> Thanks,  
>> M  
>>  
>> On 06/19/2014 11:59 AM, Michael Smith wrote:  
>> > By the way, I know it's not reproducible with the code below. Before  
>> > going into further detail, I first wanted to ask whether this looks like  
>> > a bug, or whether I've overlooked something obvious and this is expected  
>> > behavior.  
>> >  
>> > Thanks,  
>> > M  
>> >  
>> > On 06/19/2014 11:51 AM, Michael Smith wrote:  
>> >> I got the following result on my keyed data tables `CS` and `SP`, which  
>> >> seems like a bug (in 1.9.2 and 1.9.3 dev version) to me, since all  
>> >> columns should have the _same_ length:  
>> >>  
>> >>> ## Works as expected:  
>> >>> all((l <- sapply(CS[SP, roll = TRUE], length)) == l[1])  
>> >> [1] TRUE  
>> >>> ## Works as expected:  
>> >>> all((l <- sapply(CS[SP, nomatch = 0], length)) == l[1])  
>> >> [1] TRUE  
>> >>> ## Here's the potential _bug_, when combining both:  
>> >>> all((l <- sapply(CS[SP, nomatch = 0, roll = TRUE], length)) == l[1])  
>> >> [1] FALSE  
>> >>  
>> >>  
>> >> Thanks,  
>> >>  
>> >> M  
>> >>  
>> _______________________________________________  
>> datatable-help mailing list  
>> datatable-help at lists.r-forge.r-project.org  
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
>>  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140620/f9d84003/attachment.html>

From my.r.help at gmail.com  Fri Jun 20 14:23:28 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Fri, 20 Jun 2014 20:23:28 +0800
Subject: [datatable-help] Bug when Merging with nomatch=0 and roll=T?
In-Reply-To: <etPan.53a41e07.2b0d8dbe.38b@Arunkumars-MacBook-Pro.local>
References: <53A25E4D.8040206@gmail.com> <53A2603F.9030204@gmail.com>
 <53A3AC63.6020301@gmail.com>
 <etPan.53a3fc19.78df6a55.38b@Arunkumars-MacBook-Pro.local>
 <53A41B3D.3000003@gmail.com>
 <etPan.53a41e07.2b0d8dbe.38b@Arunkumars-MacBook-Pro.local>
Message-ID: <53A427C0.5070506@gmail.com>

Arun,

Thanks for your reply and the issue is here (if there's anything else I
can do to help solve this problem let me know):

https://github.com/Rdatatable/data.table/issues/700

Also thanks for mentioning rollends.

M


On 06/20/2014 07:41 PM, Arunkumar Srinivasan wrote:
> Michael,
> 
> Excellent example. Perfectly reproducible on 1.9.2 and 1.9.3. And it
> works fine on 1.8.10. The answer should've only 3 rows. 
> It'd be even more nice of you if you could file it as a bug report.
> 
> PS: On another note.. you maybe also interested in `CS[SP, roll=TRUE,
> rollends=TRUE]`
> Arun
> 
> From: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>
> Reply: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>
> Date: June 20, 2014 at 1:30:09 PM
> To: Arunkumar Srinivasan aragorn168b at gmail.com
> <mailto:aragorn168b at gmail.com>
> Cc: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> Subject: Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T?
> 
>> OK, no problem, here's the code. If there are any problems pasting it
>> into R let me know (I used parts of dput, so maybe the email line
>> endings are messed up). If you want I can also file a bug report on
>> github, just let me know.
>>
>> CS <-
>> data.table(
>> structure(list(LPERMCO = c(7L, 33L), datadate = structure(c(15912,
>> 15912), class = "Date"), me = c(626550.35284, 7766.385)), .Names =
>> c("LPERMCO",
>> "datadate", "me"), class = "data.frame", row.names = c(NA, -2L
>> )),
>> key = "LPERMCO,datadate")
>> SP <-
>> data.table(
>> structure(list(PERMCO = c(7L, 7L, 33L, 33L, 33L, 33L), date =
>> structure(c(15884,
>> 15917, 15884, 15884, 15917, 15917), class = "Date"), RET = c(-0.118303,
>> 0.141225, -0.03137, -0.02533, 0.045967, 0.043694)), .Names = c("PERMCO",
>> "date", "RET"), class = "data.frame", row.names = c(NA, -6L)),
>> key = "PERMCO,date")
>> sapply(CS[SP, nomatch = 0, roll = T], length)
>>
>>
>> The relevant output looks like this, both in 1.9.2 and in dev-1.9.3, and
>> for sapply, the "me" column should be 5 but it's 3:
>>
>> > CS
>> LPERMCO datadate me
>> 1: 7 2013-07-26 626550.353
>> 2: 33 2013-07-26 7766.385
>> > SP
>> PERMCO date RET
>> 1: 7 2013-06-28 -0.118303
>> 2: 7 2013-07-31 0.141225
>> 3: 33 2013-06-28 -0.031370
>> 4: 33 2013-06-28 -0.025330
>> 5: 33 2013-07-31 0.045967
>> 6: 33 2013-07-31 0.043694
>> > CS[SP, nomatch = 0, roll = T]
>> LPERMCO datadate me RET
>> 1: 7 2013-07-31 626550.353 0.141225
>> 2: 33 2013-06-28 7766.385 -0.031370
>> 3: 33 2013-06-28 7766.385 -0.025330
>> 4: 33 2013-07-31 626550.353 0.045967
>> 5: 33 2013-07-31 7766.385 0.043694
>> Warning message:
>> In cbind(LPERMCO = c(" 7", "33", "33", "33", "33"), datadate =
>> c("2013-07-31", :
>> number of rows of result is not a multiple of vector length (arg 3)
>> > sapply(CS[SP, nomatch = 0, roll = T], length)
>> LPERMCO datadate me RET
>> 5 5 3 5
>>
>>
>> Thanks,
>> M
>>
>>
>>
>>
>>
>> On 06/20/2014 05:17 PM, Arunkumar Srinivasan wrote:
>> >> For a given data.table, is there any condition ?  Ergo, it's a bug,
>> >> right?  
>> >  
>> > Yes.
>> >  
>> >> I'll be glad  
>> >> to try to boil this down to something that's reproducible.  
>> >  
>> > That'd be great.
>> >  
>> >  
>> > Arun
>> >  
>> > From: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>
>> > Reply: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>
>> > Date: June 20, 2014 at 5:37:24 AM
>> > To: datatable-help at lists.r-forge.r-project.org
>> > datatable-help at lists.r-forge.r-project.org
>> > <mailto:datatable-help at lists.r-forge.r-project.org>
>> > Subject: Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T?
>> >  
>> >> So let me rephrase my question (haven't received an answer so far):
>> >>
>> >> For a given data.table, is there any condition under which the lengths
>> >> of the vectors in each column may differ? Based on my understanding,
>> >> each data.table is also a data.frame, and with a data frame this should
>> >> not be possible. For example, it's not possible to have a data.frame
>> >> where the first column is a vector of length eight, and the second
>> >> column is a vector of length nine. Ergo, it's a bug, right?
>> >>
>> >> If my understanding is correct, please do let me know and I'll be glad
>> >> to try to boil this down to something that's reproducible.
>> >>
>> >> Thanks,
>> >> M
>> >>
>> >> On 06/19/2014 11:59 AM, Michael Smith wrote:
>> >> > By the way, I know it's not reproducible with the code below. Before
>> >> > going into further detail, I first wanted to ask whether this looks like
>> >> > a bug, or whether I've overlooked something obvious and this is expected
>> >> > behavior.
>> >> >   
>> >> > Thanks,
>> >> > M
>> >> >   
>> >> > On 06/19/2014 11:51 AM, Michael Smith wrote:
>> >> >> I got the following result on my keyed data tables `CS` and `SP`, which
>> >> >> seems like a bug (in 1.9.2 and 1.9.3 dev version) to me, since all
>> >> >> columns should have the _same_ length:
>> >> >>
>> >> >>> ## Works as expected:
>> >> >>> all((l <- sapply(CS[SP, roll = TRUE], length)) == l[1])
>> >> >> [1] TRUE
>> >> >>> ## Works as expected:
>> >> >>> all((l <- sapply(CS[SP, nomatch = 0], length)) == l[1])
>> >> >> [1] TRUE
>> >> >>> ## Here's the potential _bug_, when combining both:
>> >> >>> all((l <- sapply(CS[SP, nomatch = 0, roll = TRUE], length)) == l[1])
>> >> >> [1] FALSE
>> >> >>
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> M
>> >> >>
>> >> _______________________________________________
>> >> datatable-help mailing list
>> >> datatable-help at lists.r-forge.r-project.org
>> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >>

From aragorn168b at gmail.com  Fri Jun 20 14:24:22 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 20 Jun 2014 14:24:22 +0200
Subject: [datatable-help]
 =?utf-8?q?Bug_when_Merging_with_nomatch=3D0_=3F?=
 =?utf-8?b?PWFuZCA9P3V0Zi04P1E/cm9sbD1UPw==?=
In-Reply-To: <53A427C0.5070506@gmail.com>
References: <53A25E4D.8040206@gmail.com> <53A2603F.9030204@gmail.com>
 <53A3AC63.6020301@gmail.com>
 <etPan.53a3fc19.78df6a55.38b@Arunkumars-MacBook-Pro.local>
 <53A41B3D.3000003@gmail.com>
 <etPan.53a41e07.2b0d8dbe.38b@Arunkumars-MacBook-Pro.local>
 <53A427C0.5070506@gmail.com>
Message-ID: <etPan.53a427f6.2c27173b.38b@Arunkumars-MacBook-Pro.local>

Awesome. Just got the email notification (from github). Thanks.

Arun

From:?Michael Smith my.r.help at gmail.com
Reply:?Michael Smith my.r.help at gmail.com
Date:?June 20, 2014 at 2:23:32 PM
To:?Arunkumar Srinivasan aragorn168b at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T?  

Arun,  

Thanks for your reply and the issue is here (if there's anything else I  
can do to help solve this problem let me know):  

https://github.com/Rdatatable/data.table/issues/700  

Also thanks for mentioning rollends.  

M  


On 06/20/2014 07:41 PM, Arunkumar Srinivasan wrote:  
> Michael,  
>  
> Excellent example. Perfectly reproducible on 1.9.2 and 1.9.3. And it  
> works fine on 1.8.10. The answer should've only 3 rows.  
> It'd be even more nice of you if you could file it as a bug report.  
>  
> PS: On another note.. you maybe also interested in `CS[SP, roll=TRUE,  
> rollends=TRUE]`  
> Arun  
>  
> From: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>  
> Reply: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>  
> Date: June 20, 2014 at 1:30:09 PM  
> To: Arunkumar Srinivasan aragorn168b at gmail.com  
> <mailto:aragorn168b at gmail.com>  
> Cc: datatable-help at lists.r-forge.r-project.org  
> datatable-help at lists.r-forge.r-project.org  
> <mailto:datatable-help at lists.r-forge.r-project.org>  
> Subject: Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T?  
>  
>> OK, no problem, here's the code. If there are any problems pasting it  
>> into R let me know (I used parts of dput, so maybe the email line  
>> endings are messed up). If you want I can also file a bug report on  
>> github, just let me know.  
>>  
>> CS <-  
>> data.table(  
>> structure(list(LPERMCO = c(7L, 33L), datadate = structure(c(15912,  
>> 15912), class = "Date"), me = c(626550.35284, 7766.385)), .Names =  
>> c("LPERMCO",  
>> "datadate", "me"), class = "data.frame", row.names = c(NA, -2L  
>> )),  
>> key = "LPERMCO,datadate")  
>> SP <-  
>> data.table(  
>> structure(list(PERMCO = c(7L, 7L, 33L, 33L, 33L, 33L), date =  
>> structure(c(15884,  
>> 15917, 15884, 15884, 15917, 15917), class = "Date"), RET = c(-0.118303,  
>> 0.141225, -0.03137, -0.02533, 0.045967, 0.043694)), .Names = c("PERMCO",  
>> "date", "RET"), class = "data.frame", row.names = c(NA, -6L)),  
>> key = "PERMCO,date")  
>> sapply(CS[SP, nomatch = 0, roll = T], length)  
>>  
>>  
>> The relevant output looks like this, both in 1.9.2 and in dev-1.9.3, and  
>> for sapply, the "me" column should be 5 but it's 3:  
>>  
>> > CS  
>> LPERMCO datadate me  
>> 1: 7 2013-07-26 626550.353  
>> 2: 33 2013-07-26 7766.385  
>> > SP  
>> PERMCO date RET  
>> 1: 7 2013-06-28 -0.118303  
>> 2: 7 2013-07-31 0.141225  
>> 3: 33 2013-06-28 -0.031370  
>> 4: 33 2013-06-28 -0.025330  
>> 5: 33 2013-07-31 0.045967  
>> 6: 33 2013-07-31 0.043694  
>> > CS[SP, nomatch = 0, roll = T]  
>> LPERMCO datadate me RET  
>> 1: 7 2013-07-31 626550.353 0.141225  
>> 2: 33 2013-06-28 7766.385 -0.031370  
>> 3: 33 2013-06-28 7766.385 -0.025330  
>> 4: 33 2013-07-31 626550.353 0.045967  
>> 5: 33 2013-07-31 7766.385 0.043694  
>> Warning message:  
>> In cbind(LPERMCO = c(" 7", "33", "33", "33", "33"), datadate =  
>> c("2013-07-31", :  
>> number of rows of result is not a multiple of vector length (arg 3)  
>> > sapply(CS[SP, nomatch = 0, roll = T], length)  
>> LPERMCO datadate me RET  
>> 5 5 3 5  
>>  
>>  
>> Thanks,  
>> M  
>>  
>>  
>>  
>>  
>>  
>> On 06/20/2014 05:17 PM, Arunkumar Srinivasan wrote:  
>> >> For a given data.table, is there any condition ? Ergo, it's a bug,  
>> >> right?  
>> >  
>> > Yes.  
>> >  
>> >> I'll be glad  
>> >> to try to boil this down to something that's reproducible.  
>> >  
>> > That'd be great.  
>> >  
>> >  
>> > Arun  
>> >  
>> > From: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>  
>> > Reply: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>  
>> > Date: June 20, 2014 at 5:37:24 AM  
>> > To: datatable-help at lists.r-forge.r-project.org  
>> > datatable-help at lists.r-forge.r-project.org  
>> > <mailto:datatable-help at lists.r-forge.r-project.org>  
>> > Subject: Re: [datatable-help] Bug when Merging with nomatch=0 and roll=T?  
>> >  
>> >> So let me rephrase my question (haven't received an answer so far):  
>> >>  
>> >> For a given data.table, is there any condition under which the lengths  
>> >> of the vectors in each column may differ? Based on my understanding,  
>> >> each data.table is also a data.frame, and with a data frame this should  
>> >> not be possible. For example, it's not possible to have a data.frame  
>> >> where the first column is a vector of length eight, and the second  
>> >> column is a vector of length nine. Ergo, it's a bug, right?  
>> >>  
>> >> If my understanding is correct, please do let me know and I'll be glad  
>> >> to try to boil this down to something that's reproducible.  
>> >>  
>> >> Thanks,  
>> >> M  
>> >>  
>> >> On 06/19/2014 11:59 AM, Michael Smith wrote:  
>> >> > By the way, I know it's not reproducible with the code below. Before  
>> >> > going into further detail, I first wanted to ask whether this looks like  
>> >> > a bug, or whether I've overlooked something obvious and this is expected  
>> >> > behavior.  
>> >> >  
>> >> > Thanks,  
>> >> > M  
>> >> >  
>> >> > On 06/19/2014 11:51 AM, Michael Smith wrote:  
>> >> >> I got the following result on my keyed data tables `CS` and `SP`, which  
>> >> >> seems like a bug (in 1.9.2 and 1.9.3 dev version) to me, since all  
>> >> >> columns should have the _same_ length:  
>> >> >>  
>> >> >>> ## Works as expected:  
>> >> >>> all((l <- sapply(CS[SP, roll = TRUE], length)) == l[1])  
>> >> >> [1] TRUE  
>> >> >>> ## Works as expected:  
>> >> >>> all((l <- sapply(CS[SP, nomatch = 0], length)) == l[1])  
>> >> >> [1] TRUE  
>> >> >>> ## Here's the potential _bug_, when combining both:  
>> >> >>> all((l <- sapply(CS[SP, nomatch = 0, roll = TRUE], length)) == l[1])  
>> >> >> [1] FALSE  
>> >> >>  
>> >> >>  
>> >> >> Thanks,  
>> >> >>  
>> >> >> M  
>> >> >>  
>> >> _______________________________________________  
>> >> datatable-help mailing list  
>> >> datatable-help at lists.r-forge.r-project.org  
>> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
>> >>  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140620/f9b1555f/attachment.html>

From aragorn168b at gmail.com  Fri Jun 20 23:47:20 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 20 Jun 2014 23:47:20 +0200
Subject: [datatable-help] =?utf-8?Q?=60with=3DF=60_?=in the `i` Argument
In-Reply-To: <539D0C8F.1080005@gmail.com>
References: <5389541B.8040006@gmail.com>
 <CA+xi=qbyryLtqCgMqqZ3Gri1t4tLH77kaxGAOHv20zrXZ3k3og@mail.gmail.com>
 <539D0C8F.1080005@gmail.com>
Message-ID: <etPan.53a4abe8.3db012b3.38b@Arunkumars-MacBook-Pro.local>

This is a really tricky one. I was just trying to fix it when I recollected the issues with base:::order from the time during implementation.

Consider this case:

require(data.table)
DT <- data.table(x=c(1,4,3,2), y=c(8,6,5,7), z=c(10,12,11,9))
Consider the cases A and B below:

# case A
DT[base:::order(DT[, "x", with=FALSE])]
#    x y  z
# 1: 1 8 10
# 2: 2 7  9
# 3: 3 5 11
# 4: 4 6 12
Intended right result. Great!

B:

# case B
DT[base:::order(list(x))]
#    x y  z
# 1: 1 8 10
What just happened?!? So, basically if the list gives TRUE for is.object(.), it understands what the opeation is, correctly. But if it?s just a list, no idea how to deal with it. Also it silently returns undesirable result (imo).

Similar to the above cases, compare these two:

# case C
DT[base:::order(DT[, "x", with=FALSE], DT[, "y", with=FALSE])]
# vs
# case D
DT[base:::order(list(x), list(y))]
Even more crazy case:

# case E
DT[base:::order(DT[, c("x", "y"), with=FALSE])]
# vs
# case F
DT[base:::order(list(x,y))]
While we were testing and implementing forder, obviously it dint occur to check with the argument to order(.) with a data.table. And in spite of the fact that the output for DT[order(list(x))] is a bit strange and even dangerous, to be consistent with base:::order, we had implemented it the same way.

Now, I?m not so sure.. Any ideas justifying these differences?


Arun

From:?Michael Smith my.r.help at gmail.com
Reply:?Michael Smith my.r.help at gmail.com
Date:?June 15, 2014 at 5:02:46 AM
To:?G See gsee000 at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] `with=F` in the `i` Argument  

Devs,  

Is this a bug? It works in 1.9.2 but not in the 1.9.3 development version:  

DT <- data.table(a = 1:4, b = 8:5)  
for (i in c("a", "b"))  
print(DT[order(DT[, i, with = FALSE])])  

Error in forder(DT, DT[, i, with = FALSE]) :  
Column '1' is type 'list' which is not supported for ordering currently.  


Thanks,  

M  


On 05/31/2014 12:44 PM, G See wrote:  
> Hi Michael,  
>  
> I would use get()  
>  
> DT <- data.table(a = 1:4, b = 8:5)  
> for (i in c("a", "b"))  
> print(DT[order(get(i))])  
>  
> For what it's worth, your solution doesn't seem to work in data.table  
> 1.9.3 (svn rev. 1278):  
>  
>> for (i in c("a", "b"))  
> + print(DT[order(DT[, i, with = FALSE])])  
> Error in forder(DT, DT[, i, with = FALSE]) :  
> Column '1' is type 'list' which is not supported for ordering currently.  
>  
>  
> HTH,  
> Garrett  
>  
> On Fri, May 30, 2014 at 11:01 PM, Michael Smith <my.r.help at gmail.com> wrote:  
>> All,  
>>  
>> I'm trying to order the rows according to several columns at a time:  
>>  
>> DT <- data.table(a = 1:4, b = 8:5)  
>> for (i in c("a", "b"))  
>> print(DT[order(i), with = FALSE])  
>>  
>> It doesn't work, since `with` seems to be about the `j` argument, but  
>> not the `i` argument, according to `?data.table`.  
>>  
>> I found the following workaround, but wonder whether there is a more  
>> elegant way to do it:  
>>  
>> for (i in c("a", "b"))  
>> print(DT[order(DT[, i, with = FALSE])])  
>>  
>> Thanks,  
>> M  
>> _______________________________________________  
>> datatable-help mailing list  
>> datatable-help at lists.r-forge.r-project.org  
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140620/ec73380a/attachment-0001.html>

From aragorn168b at gmail.com  Sat Jun 21 02:25:27 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 21 Jun 2014 02:25:27 +0200
Subject: [datatable-help] =?utf-8?Q?=60with=3DF=60_?=in the `i` Argument
In-Reply-To: <etPan.53a4abe8.3db012b3.38b@Arunkumars-MacBook-Pro.local>
References: <5389541B.8040006@gmail.com>
 <CA+xi=qbyryLtqCgMqqZ3Gri1t4tLH77kaxGAOHv20zrXZ3k3og@mail.gmail.com>
 <539D0C8F.1080005@gmail.com>
 <etPan.53a4abe8.3db012b3.38b@Arunkumars-MacBook-Pro.local>
Message-ID: <etPan.53a4d0f7.5b25ace2.38b@Arunkumars-MacBook-Pro.local>

Michael,

Note that in your case, you can also do:

DT <- data.table(a = 1:4, b = 8:5)
for (i in c("a", "b"))
    DT[order(DT[[i]])]
At the moment, I?m more inclined towards giving an error when any of the arguments to order(.) results in a list. The message could be something like:

DT[order(.)] on data.tables is optimised internally to use data.table's fast ordering. Since the behaviour of base:::order seems inconsistent in the way it handles list input - for ex: compare DT[order(list(x))] and DT[order(data.table(x))], we do not support list columns as input here. If you're sure, you can use `DT[base:::order(.)]` explicitly. However, this can be avoided most of the times by using `[[` to access specified columns to result in a vector.
What do you (all) think?


Arun

From:?Arunkumar Srinivasan aragorn168b at gmail.com
Reply:?Arunkumar Srinivasan aragorn168b at gmail.com
Date:?June 20, 2014 at 11:47:22 PM
To:?Michael Smith my.r.help at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] `with=F` in the `i` Argument  

This is a really tricky one. I was just trying to fix it when I recollected the issues with base:::order from the time during implementation.

Consider this case:

require(data.table)
DT <- data.table(x=c(1,4,3,2), y=c(8,6,5,7), z=c(10,12,11,9))

Consider the cases A and B below:

# case A
DT[base:::order(DT[, "x", with=FALSE])]
#    x y  z
# 1: 1 8 10
# 2: 2 7  9
# 3: 3 5 11
# 4: 4 6 12

Intended right result. Great!

B:

# case B
DT[base:::order(list(x))]
#    x y  z
# 1: 1 8 10

What just happened?!? So, basically if the list gives TRUE for is.object(.), it understands what the opeation is, correctly. But if it?s just a list, no idea how to deal with it. Also it silently returns undesirable result (imo).

Similar to the above cases, compare these two:

# case C
DT[base:::order(DT[, "x", with=FALSE], DT[, "y", with=FALSE])]
# vs
# case D
DT[base:::order(list(x), list(y))]

Even more crazy case:

# case E
DT[base:::order(DT[, c("x", "y"), with=FALSE])]
# vs
# case F
DT[base:::order(list(x,y))]

While we were testing and implementing forder, obviously it dint occur to check with the argument to order(.) with a data.table. And in spite of the fact that the output for DT[order(list(x))] is a bit strange and even dangerous, to be consistent with base:::order, we had implemented it the same way.

Now, I?m not so sure.. Any ideas justifying these differences?


Arun

From:?Michael Smith my.r.help at gmail.com
Reply:?Michael Smith my.r.help at gmail.com
Date:?June 15, 2014 at 5:02:46 AM
To:?G See gsee000 at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] `with=F` in the `i` Argument

Devs,

Is this a bug? It works in 1.9.2 but not in the 1.9.3 development version:

DT <- data.table(a = 1:4, b = 8:5)
for (i in c("a", "b"))
print(DT[order(DT[, i, with = FALSE])])

Error in forder(DT, DT[, i, with = FALSE]) :
Column '1' is type 'list' which is not supported for ordering currently.


Thanks,

M


On 05/31/2014 12:44 PM, G See wrote:
> Hi Michael,
>
> I would use get()
>
> DT <- data.table(a = 1:4, b = 8:5)
> for (i in c("a", "b"))
> print(DT[order(get(i))])
>
> For what it's worth, your solution doesn't seem to work in data.table
> 1.9.3 (svn rev. 1278):
>
>> for (i in c("a", "b"))
> + print(DT[order(DT[, i, with = FALSE])])
> Error in forder(DT, DT[, i, with = FALSE]) :
> Column '1' is type 'list' which is not supported for ordering currently.
>
>
> HTH,
> Garrett
>
> On Fri, May 30, 2014 at 11:01 PM, Michael Smith <my.r.help at gmail.com> wrote:
>> All,
>>
>> I'm trying to order the rows according to several columns at a time:
>>
>> DT <- data.table(a = 1:4, b = 8:5)
>> for (i in c("a", "b"))
>> print(DT[order(i), with = FALSE])
>>
>> It doesn't work, since `with` seems to be about the `j` argument, but
>> not the `i` argument, according to `?data.table`.
>>
>> I found the following workaround, but wonder whether there is a more
>> elegant way to do it:
>>
>> for (i in c("a", "b"))
>> print(DT[order(DT[, i, with = FALSE])])
>>
>> Thanks,
>> M
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140621/50598a4c/attachment.html>

From my.r.help at gmail.com  Sat Jun 21 05:19:00 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Sat, 21 Jun 2014 11:19:00 +0800
Subject: [datatable-help] `with=F` in the `i` Argument
In-Reply-To: <etPan.53a4d0f7.5b25ace2.38b@Arunkumars-MacBook-Pro.local>
References: <5389541B.8040006@gmail.com>
 <CA+xi=qbyryLtqCgMqqZ3Gri1t4tLH77kaxGAOHv20zrXZ3k3og@mail.gmail.com>
 <539D0C8F.1080005@gmail.com>
 <etPan.53a4abe8.3db012b3.38b@Arunkumars-MacBook-Pro.local>
 <etPan.53a4d0f7.5b25ace2.38b@Arunkumars-MacBook-Pro.local>
Message-ID: <53A4F9A4.1090808@gmail.com>

Hi Arun,

If `is.object` gives `FALSE` and you just have a list, you could wrap it
in `unlist` as follows. It gives the same result for your cases. (These
are just my two cents, maybe someone else has a different opinion.)

require(data.table)
DT <- data.table(x=c(1,4,3,2), y=c(8,6,5,7), z=c(10,12,11,9))

## Case A.
DT[base::order(DT[, "x", with=FALSE])]  # OK.
## Case B.
DT[base::order(list(x))]                # Not OK.
DT[base::order(unlist(list(x)))]        # Same as case A.

# Case C.
DT[base::order(DT[, "x", with=FALSE], DT[, "y", with=FALSE])] # OK.
# Case D.
DT[base::order(list(x), list(y))]         # Not OK.
DT[base::order(unlist(list(x), list(y)))] # Same as case C.

## Case E.
DT[base::order(DT[, c("x", "y"), with=FALSE])] # Pads NA for `y`.
## Case F.
DT[base::order(list(x, y))]             # Not OK.
DT[base::order(unlist(list(x, y)))]     # Same as case E.


Thanks,
M


On 06/21/2014 08:25 AM, Arunkumar Srinivasan wrote:
> Michael,
> 
> Note that in your case, you can also do:
> 
> |DT <- data.table(a = 1:4, b = 8:5)
> for (i in c("a", "b"))
>     DT[order(DT[[i]])]
> |
> 
> At the moment, I?m more inclined towards giving an error when any of the
> arguments to |order(.)| results in a |list|. The message could be
> something like:
> 
> |DT[order(.)] on data.tables is optimised internally to use data.table's fast ordering. Since the behaviour of base:::order seems inconsistent in the way it handles list input - for ex: compare DT[order(list(x))] and DT[order(data.table(x))], we do not support list columns as input here. If you're sure, you can use `DT[base:::order(.)]` explicitly. However, this can be avoided most of the times by using `[[` to access specified columns to result in a vector.
> |
> 
> What do you (all) think?
> 
> 
> Arun
> 
> From: Arunkumar Srinivasan aragorn168b at gmail.com
> <mailto:aragorn168b at gmail.com>
> Reply: Arunkumar Srinivasan aragorn168b at gmail.com
> <mailto:aragorn168b at gmail.com>
> Date: June 20, 2014 at 11:47:22 PM
> To: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>
> Cc: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> Subject: Re: [datatable-help] `with=F` in the `i` Argument
> 
>> This is a really tricky one. I was just trying to fix it when I
>> recollected the issues with |base:::order| from the time during
>> implementation.
>>
>> Consider this case:
>>
>> |require(data.table)
>> DT <- data.table(x=c(1,4,3,2), y=c(8,6,5,7), z=c(10,12,11,9))
>> |
>>
>> Consider the cases A and B below:
>>
>> |# case A
>> DT[base:::order(DT[, "x", with=FALSE])]
>> #    x y  z
>> # 1: 1 8 10
>> # 2: 2 7  9
>> # 3: 3 5 11
>> # 4: 4 6 12
>> |
>>
>> Intended right result. Great!
>>
>>
>>       B:
>>
>> |# case B
>> DT[base:::order(list(x))]
>> #    x y  z
>> # 1: 1 8 10
>> |
>>
>> What just happened?!? So, basically if the list gives |TRUE| for
>> |is.object(.)|, it understands what the opeation is, correctly. But if
>> it?s /just/ a list, no idea how to deal with it. Also it silently
>> returns undesirable result (imo).
>>
>> Similar to the above cases, compare these two:
>>
>> |# case C
>> DT[base:::order(DT[, "x", with=FALSE], DT[, "y", with=FALSE])]
>> # vs
>> # case D
>> DT[base:::order(list(x), list(y))]
>> |
>>
>>
>>       Even more crazy case:
>>
>> |# case E
>> DT[base:::order(DT[, c("x", "y"), with=FALSE])]
>> # vs
>> # case F
>> DT[base:::order(list(x,y))]
>> |
>>
>> While we were testing and implementing |forder|, obviously it dint
>> occur to check with the argument to |order(.)| with a |data.table|.
>> And in spite of the fact that the output for |DT[order(list(x))]| is a
>> bit strange and even dangerous, to be consistent with |base:::order|,
>> we had implemented it the same way.
>>
>> Now, I?m not so sure.. Any ideas justifying these differences?
>>
>>
>>
>> Arun
>>
>> From: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>
>> Reply: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>
>> Date: June 15, 2014 at 5:02:46 AM
>> To: G See gsee000 at gmail.com <mailto:gsee000 at gmail.com>
>> Cc: datatable-help at lists.r-forge.r-project.org
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> Subject:  Re: [datatable-help] `with=F` in the `i` Argument
>>
>>> Devs,
>>>
>>> Is this a bug? It works in 1.9.2 but not in the 1.9.3 development
>>> version:
>>>
>>> DT <- data.table(a = 1:4, b = 8:5)
>>> for (i in c("a", "b"))
>>> print(DT[order(DT[, i, with = FALSE])])
>>>
>>> Error in forder(DT, DT[, i, with = FALSE]) :
>>> Column '1' is type 'list' which is not supported for ordering currently.
>>>
>>>
>>> Thanks,
>>>
>>> M
>>>
>>>
>>> On 05/31/2014 12:44 PM, G See wrote:
>>> > Hi Michael,
>>> >
>>> > I would use get()
>>> >
>>> > DT <- data.table(a = 1:4, b = 8:5)
>>> > for (i in c("a", "b"))
>>> > print(DT[order(get(i))])
>>> >
>>> > For what it's worth, your solution doesn't seem to work in data.table
>>> > 1.9.3 (svn rev. 1278):
>>> >
>>> >> for (i in c("a", "b"))
>>> > + print(DT[order(DT[, i, with = FALSE])])
>>> > Error in forder(DT, DT[, i, with = FALSE]) :
>>> > Column '1' is type 'list' which is not supported for ordering currently.
>>> >
>>> >
>>> > HTH,
>>> > Garrett
>>> >
>>> > On Fri, May 30, 2014 at 11:01 PM, Michael Smith <my.r.help at gmail.com> wrote:
>>> >> All,
>>> >>
>>> >> I'm trying to order the rows according to several columns at a time:
>>> >>
>>> >> DT <- data.table(a = 1:4, b = 8:5)
>>> >> for (i in c("a", "b"))
>>> >> print(DT[order(i), with = FALSE])
>>> >>
>>> >> It doesn't work, since `with` seems to be about the `j` argument, but
>>> >> not the `i` argument, according to `?data.table`.
>>> >>
>>> >> I found the following workaround, but wonder whether there is a more
>>> >> elegant way to do it:
>>> >>
>>> >> for (i in c("a", "b"))
>>> >> print(DT[order(DT[, i, with = FALSE])])
>>> >>
>>> >> Thanks,
>>> >> M
>>> >> _______________________________________________
>>> >> datatable-help mailing list
>>> >> datatable-help at lists.r-forge.r-project.org
>>> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 

From my.r.help at gmail.com  Sat Jun 21 09:39:56 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Sat, 21 Jun 2014 15:39:56 +0800
Subject: [datatable-help] Self-Join: Potential Bug?
Message-ID: <53A536CC.3050501@gmail.com>

I'm getting a warning when I run the following code in 1.9.2,
dev-1.9.3-master, and dev-1.9.3-issue_700 (b/c I thought it looks
similar to that issue, but it turns out it's different). In contrast, I
do not get this warning in 1.8.10.

Not sure whether this is a bug or whether I'm missing something:

X <- data.table(
  structure(list(ID = c(45063L, 45066L, 45172L), date = structure(c(14548,
14487, 14395), class = "Date"), price = c(17.56, 12.49, 10.04
)), .Names = c("ID", "date", "price"), row.names = c(NA, -3L), class =
"data.frame"),
  key = "ID,date")
X[J(unique(ID), as.Date(c("2009-05-31", "2010-05-31")))]


The data and the warning message look like this:

> X
      ID       date price
1: 45063 2009-10-31 17.56
2: 45066 2009-08-31 12.49
3: 45172 2009-05-31 10.04
> X[J(unique(ID), as.Date(c("2009-05-31", "2010-05-31")))]
      ID       date price
1: 45063 2009-05-31    NA
2: 45066 2010-05-31    NA
3: 45172 2009-05-31 10.04
Warning message:
In as.data.table.list(i) :
  Item 2 is of size 2 but maximum size is 3 (recycled leaving a
remainder of 1 items)


Thanks,
M

From aragorn168b at gmail.com  Sat Jun 21 09:43:29 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 21 Jun 2014 09:43:29 +0200
Subject: [datatable-help] Self-Join: Potential Bug?
In-Reply-To: <53A536CC.3050501@gmail.com>
References: <53A536CC.3050501@gmail.com>
Message-ID: <etPan.53a537a1.34fd6b4f.38b@Arunkumars-MacBook-Pro.local>

Michael,
You should be using `CJ`. This is no different from the post from Garrett See?http://lists.r-forge.r-project.org/pipermail/datatable-help/2014-June/002619.html?just last week, IIUC.

Arun

From:?Michael Smith my.r.help at gmail.com
Reply:?Michael Smith my.r.help at gmail.com
Date:?June 21, 2014 at 9:40:17 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? [datatable-help] Self-Join: Potential Bug?  

I'm getting a warning when I run the following code in 1.9.2,  
dev-1.9.3-master, and dev-1.9.3-issue_700 (b/c I thought it looks  
similar to that issue, but it turns out it's different). In contrast, I  
do not get this warning in 1.8.10.  

Not sure whether this is a bug or whether I'm missing something:  

X <- data.table(  
structure(list(ID = c(45063L, 45066L, 45172L), date = structure(c(14548,  
14487, 14395), class = "Date"), price = c(17.56, 12.49, 10.04  
)), .Names = c("ID", "date", "price"), row.names = c(NA, -3L), class =  
"data.frame"),  
key = "ID,date")  
X[J(unique(ID), as.Date(c("2009-05-31", "2010-05-31")))]  


The data and the warning message look like this:  

> X  
ID date price  
1: 45063 2009-10-31 17.56  
2: 45066 2009-08-31 12.49  
3: 45172 2009-05-31 10.04  
> X[J(unique(ID), as.Date(c("2009-05-31", "2010-05-31")))]  
ID date price  
1: 45063 2009-05-31 NA  
2: 45066 2010-05-31 NA  
3: 45172 2009-05-31 10.04  
Warning message:  
In as.data.table.list(i) :  
Item 2 is of size 2 but maximum size is 3 (recycled leaving a  
remainder of 1 items)  


Thanks,  
M  
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140621/d9cb6ca7/attachment.html>

From my.r.help at gmail.com  Sat Jun 21 09:45:45 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Sat, 21 Jun 2014 15:45:45 +0800
Subject: [datatable-help] Self-Join: Potential Bug?
In-Reply-To: <etPan.53a537a1.34fd6b4f.38b@Arunkumars-MacBook-Pro.local>
References: <53A536CC.3050501@gmail.com>
 <etPan.53a537a1.34fd6b4f.38b@Arunkumars-MacBook-Pro.local>
Message-ID: <53A53829.8000503@gmail.com>

Great, thanks a lot for the clarification; I thought I was going crazy.

Cheers,
M

On 06/21/2014 03:43 PM, Arunkumar Srinivasan wrote:
> Michael,
> You should be using `CJ`. This is no different from the post from
> Garrett
> See http://lists.r-forge.r-project.org/pipermail/datatable-help/2014-June/002619.html just
> last week, IIUC.
> 
> Arun
> 
> From: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>
> Reply: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>
> Date: June 21, 2014 at 9:40:17 AM
> To: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> Subject: [datatable-help] Self-Join: Potential Bug?
> 
>> I'm getting a warning when I run the following code in 1.9.2,
>> dev-1.9.3-master, and dev-1.9.3-issue_700 (b/c I thought it looks
>> similar to that issue, but it turns out it's different). In contrast, I
>> do not get this warning in 1.8.10.
>>
>> Not sure whether this is a bug or whether I'm missing something:
>>
>> X <- data.table(
>> structure(list(ID = c(45063L, 45066L, 45172L), date = structure(c(14548,
>> 14487, 14395), class = "Date"), price = c(17.56, 12.49, 10.04
>> )), .Names = c("ID", "date", "price"), row.names = c(NA, -3L), class =
>> "data.frame"),
>> key = "ID,date")
>> X[J(unique(ID), as.Date(c("2009-05-31", "2010-05-31")))]
>>
>>
>> The data and the warning message look like this:
>>
>> > X
>> ID date price
>> 1: 45063 2009-10-31 17.56
>> 2: 45066 2009-08-31 12.49
>> 3: 45172 2009-05-31 10.04
>> > X[J(unique(ID), as.Date(c("2009-05-31", "2010-05-31")))]
>> ID date price
>> 1: 45063 2009-05-31 NA
>> 2: 45066 2010-05-31 NA
>> 3: 45172 2009-05-31 10.04
>> Warning message:
>> In as.data.table.list(i) :
>> Item 2 is of size 2 but maximum size is 3 (recycled leaving a
>> remainder of 1 items)
>>
>>
>> Thanks,
>> M
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>

From aragorn168b at gmail.com  Tue Jun 24 01:54:49 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Tue, 24 Jun 2014 01:54:49 +0200
Subject: [datatable-help] =?utf-8?Q?=60with=3DF=60_?=in the `i` Argument
In-Reply-To: <53A4F9A4.1090808@gmail.com>
References: <5389541B.8040006@gmail.com>
 <CA+xi=qbyryLtqCgMqqZ3Gri1t4tLH77kaxGAOHv20zrXZ3k3og@mail.gmail.com>
 <539D0C8F.1080005@gmail.com>
 <etPan.53a4abe8.3db012b3.38b@Arunkumars-MacBook-Pro.local>
 <etPan.53a4d0f7.5b25ace2.38b@Arunkumars-MacBook-Pro.local>
 <53A4F9A4.1090808@gmail.com>
Message-ID: <etPan.53a8be49.f819e7f.38b@Arunkumars-MacBook-Pro.local>

I?ve gone ahead and fixed [#696](https://github.com/Rdatatable/data.table/issues/696) to be consistent with base, even though I think this is not necessary in almost all cases. Either one could do:

DT[order(DT[["a"]])]
Or simply use copy along with setorderv:

setorderv(copy(DT), cols="a", order=1L, na.last=FALSE)
I find the latter much more cleaner, and can be used if one wants to reorder by reference as well, by just removing copy.


Arun

From:?Michael Smith my.r.help at gmail.com
Reply:?Michael Smith my.r.help at gmail.com
Date:?June 21, 2014 at 5:19:04 AM
To:?Arunkumar Srinivasan aragorn168b at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] `with=F` in the `i` Argument  

Hi Arun,  

If `is.object` gives `FALSE` and you just have a list, you could wrap it  
in `unlist` as follows. It gives the same result for your cases. (These  
are just my two cents, maybe someone else has a different opinion.)  

require(data.table)  
DT <- data.table(x=c(1,4,3,2), y=c(8,6,5,7), z=c(10,12,11,9))  

## Case A.  
DT[base::order(DT[, "x", with=FALSE])] # OK.  
## Case B.  
DT[base::order(list(x))] # Not OK.  
DT[base::order(unlist(list(x)))] # Same as case A.  

# Case C.  
DT[base::order(DT[, "x", with=FALSE], DT[, "y", with=FALSE])] # OK.  
# Case D.  
DT[base::order(list(x), list(y))] # Not OK.  
DT[base::order(unlist(list(x), list(y)))] # Same as case C.  

## Case E.  
DT[base::order(DT[, c("x", "y"), with=FALSE])] # Pads NA for `y`.  
## Case F.  
DT[base::order(list(x, y))] # Not OK.  
DT[base::order(unlist(list(x, y)))] # Same as case E.  


Thanks,  
M  


On 06/21/2014 08:25 AM, Arunkumar Srinivasan wrote:  
> Michael,  
>  
> Note that in your case, you can also do:  
>  
> |DT <- data.table(a = 1:4, b = 8:5)  
> for (i in c("a", "b"))  
> DT[order(DT[[i]])]  
> |  
>  
> At the moment, I?m more inclined towards giving an error when any of the  
> arguments to |order(.)| results in a |list|. The message could be  
> something like:  
>  
> |DT[order(.)] on data.tables is optimised internally to use data.table's fast ordering. Since the behaviour of base:::order seems inconsistent in the way it handles list input - for ex: compare DT[order(list(x))] and DT[order(data.table(x))], we do not support list columns as input here. If you're sure, you can use `DT[base:::order(.)]` explicitly. However, this can be avoided most of the times by using `[[` to access specified columns to result in a vector.  
> |  
>  
> What do you (all) think?  
>  
>  
> Arun  
>  
> From: Arunkumar Srinivasan aragorn168b at gmail.com  
> <mailto:aragorn168b at gmail.com>  
> Reply: Arunkumar Srinivasan aragorn168b at gmail.com  
> <mailto:aragorn168b at gmail.com>  
> Date: June 20, 2014 at 11:47:22 PM  
> To: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>  
> Cc: datatable-help at lists.r-forge.r-project.org  
> datatable-help at lists.r-forge.r-project.org  
> <mailto:datatable-help at lists.r-forge.r-project.org>  
> Subject: Re: [datatable-help] `with=F` in the `i` Argument  
>  
>> This is a really tricky one. I was just trying to fix it when I  
>> recollected the issues with |base:::order| from the time during  
>> implementation.  
>>  
>> Consider this case:  
>>  
>> |require(data.table)  
>> DT <- data.table(x=c(1,4,3,2), y=c(8,6,5,7), z=c(10,12,11,9))  
>> |  
>>  
>> Consider the cases A and B below:  
>>  
>> |# case A  
>> DT[base:::order(DT[, "x", with=FALSE])]  
>> # x y z  
>> # 1: 1 8 10  
>> # 2: 2 7 9  
>> # 3: 3 5 11  
>> # 4: 4 6 12  
>> |  
>>  
>> Intended right result. Great!  
>>  
>>  
>> B:  
>>  
>> |# case B  
>> DT[base:::order(list(x))]  
>> # x y z  
>> # 1: 1 8 10  
>> |  
>>  
>> What just happened?!? So, basically if the list gives |TRUE| for  
>> |is.object(.)|, it understands what the opeation is, correctly. But if  
>> it?s /just/ a list, no idea how to deal with it. Also it silently  
>> returns undesirable result (imo).  
>>  
>> Similar to the above cases, compare these two:  
>>  
>> |# case C  
>> DT[base:::order(DT[, "x", with=FALSE], DT[, "y", with=FALSE])]  
>> # vs  
>> # case D  
>> DT[base:::order(list(x), list(y))]  
>> |  
>>  
>>  
>> Even more crazy case:  
>>  
>> |# case E  
>> DT[base:::order(DT[, c("x", "y"), with=FALSE])]  
>> # vs  
>> # case F  
>> DT[base:::order(list(x,y))]  
>> |  
>>  
>> While we were testing and implementing |forder|, obviously it dint  
>> occur to check with the argument to |order(.)| with a |data.table|.  
>> And in spite of the fact that the output for |DT[order(list(x))]| is a  
>> bit strange and even dangerous, to be consistent with |base:::order|,  
>> we had implemented it the same way.  
>>  
>> Now, I?m not so sure.. Any ideas justifying these differences?  
>>  
>>  
>>  
>> Arun  
>>  
>> From: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>  
>> Reply: Michael Smith my.r.help at gmail.com <mailto:my.r.help at gmail.com>  
>> Date: June 15, 2014 at 5:02:46 AM  
>> To: G See gsee000 at gmail.com <mailto:gsee000 at gmail.com>  
>> Cc: datatable-help at lists.r-forge.r-project.org  
>> datatable-help at lists.r-forge.r-project.org  
>> <mailto:datatable-help at lists.r-forge.r-project.org>  
>> Subject: Re: [datatable-help] `with=F` in the `i` Argument  
>>  
>>> Devs,  
>>>  
>>> Is this a bug? It works in 1.9.2 but not in the 1.9.3 development  
>>> version:  
>>>  
>>> DT <- data.table(a = 1:4, b = 8:5)  
>>> for (i in c("a", "b"))  
>>> print(DT[order(DT[, i, with = FALSE])])  
>>>  
>>> Error in forder(DT, DT[, i, with = FALSE]) :  
>>> Column '1' is type 'list' which is not supported for ordering currently.  
>>>  
>>>  
>>> Thanks,  
>>>  
>>> M  
>>>  
>>>  
>>> On 05/31/2014 12:44 PM, G See wrote:  
>>> > Hi Michael,  
>>> >  
>>> > I would use get()  
>>> >  
>>> > DT <- data.table(a = 1:4, b = 8:5)  
>>> > for (i in c("a", "b"))  
>>> > print(DT[order(get(i))])  
>>> >  
>>> > For what it's worth, your solution doesn't seem to work in data.table  
>>> > 1.9.3 (svn rev. 1278):  
>>> >  
>>> >> for (i in c("a", "b"))  
>>> > + print(DT[order(DT[, i, with = FALSE])])  
>>> > Error in forder(DT, DT[, i, with = FALSE]) :  
>>> > Column '1' is type 'list' which is not supported for ordering currently.  
>>> >  
>>> >  
>>> > HTH,  
>>> > Garrett  
>>> >  
>>> > On Fri, May 30, 2014 at 11:01 PM, Michael Smith <my.r.help at gmail.com> wrote:  
>>> >> All,  
>>> >>  
>>> >> I'm trying to order the rows according to several columns at a time:  
>>> >>  
>>> >> DT <- data.table(a = 1:4, b = 8:5)  
>>> >> for (i in c("a", "b"))  
>>> >> print(DT[order(i), with = FALSE])  
>>> >>  
>>> >> It doesn't work, since `with` seems to be about the `j` argument, but  
>>> >> not the `i` argument, according to `?data.table`.  
>>> >>  
>>> >> I found the following workaround, but wonder whether there is a more  
>>> >> elegant way to do it:  
>>> >>  
>>> >> for (i in c("a", "b"))  
>>> >> print(DT[order(DT[, i, with = FALSE])])  
>>> >>  
>>> >> Thanks,  
>>> >> M  
>>> >> _______________________________________________  
>>> >> datatable-help mailing list  
>>> >> datatable-help at lists.r-forge.r-project.org  
>>> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
>>> _______________________________________________  
>>> datatable-help mailing list  
>>> datatable-help at lists.r-forge.r-project.org  
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
>  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140624/8be28c52/attachment.html>

From ronin78 at gmail.com  Thu Jun 26 22:56:40 2014
From: ronin78 at gmail.com (Matthew DeAngelis)
Date: Thu, 26 Jun 2014 16:56:40 -0400
Subject: [datatable-help] Efficiently checking value of other row in
	data.table
Message-ID: <CAMjp+0exGfWd9Pb7ZQ5WjaoehRR2qUJq+ikLMYs3NpuuQUd0DA@mail.gmail.com>

Hello data.table gurus,

I have been using data.table to efficiently work with textual data and I
love it for that purpose. I have transformed my data so that it looks
something like this:

worddocumentpositionI11have12transformed13my14data15so21that22it23looks24
something25like26this27
(I actually use a unique number for each word, so that I am able to use
data.table's excellent features to do lightning-fast word counts. This has
revolutionized my workflow over looping through text files with Perl.)

My problem is that I sometimes need to search for phrases or to select
words based on their context (for instance, I may want to exclude a word if
it is preceded by "not" or followed by a word that changes its meaning).
Currently, I am using the solution here
<http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression>
to
create a new column for a word in another position, like this:

worddocumentpositionlead_wordI11havehave12transformedtransformed13mymy14data
data15NAso21thatthat22itit23lookslooks24somethingsomething25likelike26this
this27NA
using a command like: DT[,lead_word:=DT[list(document,position+1),word].

This approach has two problems, however. First, it consumes more resources
as the dataset grows. I am currently working with a file containing over
150 million rows, so adding a column is costly. Second, I may want to check
both one and two words ahead, so that I have to add two columns, and this
can quickly get out of hand.

Is there a better way to use data.table to check the value in a row N
distance from the row of interest within a group and select a row based on
that value? Perhaps the .I variable could be useful here?

I appreciate any suggestions.


Regards,
Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140626/0eaad67d/attachment.html>

From mdowle at mdowle.plus.com  Fri Jun 27 21:17:18 2014
From: mdowle at mdowle.plus.com (Matt Dowle)
Date: Fri, 27 Jun 2014 20:17:18 +0100
Subject: [datatable-help] Efficiently checking value of other row in
	data.table
In-Reply-To: <CAMjp+0exGfWd9Pb7ZQ5WjaoehRR2qUJq+ikLMYs3NpuuQUd0DA@mail.gmail.com>
References: <CAMjp+0exGfWd9Pb7ZQ5WjaoehRR2qUJq+ikLMYs3NpuuQUd0DA@mail.gmail.com>
Message-ID: <53ADC33E.7040509@mdowle.plus.com>


Hi,

Not sure exactly what you need but looks interesting.

Something a bit like this ?

DT[ word == "good", .SD[ lag(word, N) != "not" ],  by=document]

Your idea being you don't want to have to repeat all the pre and post 
words alongside each word but rather express it in the query. Makes 
sense.   Leads to classifying "not good" and "not very good" as both 
negative phrases I guess.

Matt


On 26/06/14 21:56, Matthew DeAngelis wrote:
> Hello data.table gurus,
>
> I have been using data.table to efficiently work with textual data and 
> I love it for that purpose. I have transformed my data so that it 
> looks something like this:
>
> word 	document 	position
> I 	1 	1
> have 	1 	2
> transformed 	1 	3
> my 	1 	4
> data 	1 	5
> so 	2 	1
> that 	2 	2
> it 	2 	3
> looks 	2 	4
> something 	2 	5
> like 	2 	6
> this 	2 	7
>
>
> (I actually use a unique number for each word, so that I am able to 
> use data.table's excellent features to do lightning-fast word counts. 
> This has revolutionized my workflow over looping through text files 
> with Perl.)
>
> My problem is that I sometimes need to search for phrases or to select 
> words based on their context (for instance, I may want to exclude a 
> word if it is preceded by "not" or followed by a word that changes its 
> meaning). Currently, I am using the solution here 
> <http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression> to 
> create a new column for a word in another position, like this:
>
> word 	document 	position 	lead_word
> I 	1 	1 	have
> have 	1 	2 	transformed
> transformed 	1 	3 	my
> my 	1 	4 	data
> data 	1 	5 	NA
> so 	2 	1 	that
> that 	2 	2 	it
> it 	2 	3 	looks
> looks 	2 	4 	something
> something 	2 	5 	like
> like 	2 	6 	this
> this 	2 	7 	NA
>
>
> using a command like: DT[,lead_word:=DT[list(document,position+1),word].
>
> This approach has two problems, however. First, it consumes more 
> resources as the dataset grows. I am currently working with a file 
> containing over 150 million rows, so adding a column is costly. 
> Second, I may want to check both one and two words ahead, so that I 
> have to add two columns, and this can quickly get out of hand.
>
> Is there a better way to use data.table to check the value in a row N 
> distance from the row of interest within a group and select a row 
> based on that value? Perhaps the .I variable could be useful here?
>
> I appreciate any suggestions.
>
>
> Regards,
> Matt
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140627/c619d88d/attachment.html>

From ronin78 at gmail.com  Sat Jun 28 11:55:12 2014
From: ronin78 at gmail.com (Matthew DeAngelis)
Date: Sat, 28 Jun 2014 05:55:12 -0400
Subject: [datatable-help] Efficiently checking value of other row in
	data.table
In-Reply-To: <53ADC33E.7040509@mdowle.plus.com>
References: <CAMjp+0exGfWd9Pb7ZQ5WjaoehRR2qUJq+ikLMYs3NpuuQUd0DA@mail.gmail.com>
 <53ADC33E.7040509@mdowle.plus.com>
Message-ID: <CAMjp+0cc3d9r3mSnM5DRfd9ygTGQgDe2wdbWXOHc2rH-JR617Q@mail.gmail.com>

Hi Matt,

You have the right of it. The problem is somewhat complicated, however,
since I would want to substitute "DT[word=="good"..." with
"DT[J("good")..." after setting the key to word and reordering the rows.
Hence the two-step process I have now where I key by document and position
first, create the lag_word column, key by the word and lag_word columns and
query by row.


Matt


On Fri, Jun 27, 2014 at 3:17 PM, Matt Dowle <mdowle at mdowle.plus.com> wrote:

>
> Hi,
>
> Not sure exactly what you need but looks interesting.
>
> Something a bit like this ?
>
> DT[ word == "good", .SD[ lag(word, N) != "not" ],  by=document]
>
> Your idea being you don't want to have to repeat all the pre and post
> words alongside each word but rather express it in the query. Makes
> sense.   Leads to classifying "not good" and "not very good" as both
> negative phrases I guess.
>
> Matt
>
>
>
> On 26/06/14 21:56, Matthew DeAngelis wrote:
>
> Hello data.table gurus,
>
>  I have been using data.table to efficiently work with textual data and I
> love it for that purpose. I have transformed my data so that it looks
> something like this:
>
>    word document position  I 1 1  have 1 2  transformed 1 3  my 1 4  data
> 1 5  so 2 1  that 2 2  it 2 3  looks 2 4  something 2 5  like 2 6  this 2
> 7
>  (I actually use a unique number for each word, so that I am able to use
> data.table's excellent features to do lightning-fast word counts. This has
> revolutionized my workflow over looping through text files with Perl.)
>
>  My problem is that I sometimes need to search for phrases or to select
> words based on their context (for instance, I may want to exclude a word if
> it is preceded by "not" or followed by a word that changes its meaning).
> Currently, I am using the solution here
> <http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression> to
> create a new column for a word in another position, like this:
>
>    word document position lead_word  I 1 1 have  have 1 2 transformed
> transformed 1 3 my  my 1 4 data  data 1 5 NA  so 2 1 that  that 2 2 it  it
> 2 3 looks  looks 2 4 something  something 2 5 like  like 2 6 this  this 2
> 7 NA
> using a command like: DT[,lead_word:=DT[list(document,position+1),word].
>
>  This approach has two problems, however. First, it consumes more
> resources as the dataset grows. I am currently working with a file
> containing over 150 million rows, so adding a column is costly. Second, I
> may want to check both one and two words ahead, so that I have to add two
> columns, and this can quickly get out of hand.
>
>  Is there a better way to use data.table to check the value in a row N
> distance from the row of interest within a group and select a row based on
> that value? Perhaps the .I variable could be useful here?
>
>  I appreciate any suggestions.
>
>
>  Regards,
> Matt
>
>
> _______________________________________________
> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140628/98ac9c51/attachment-0001.html>

From mdowle at mdowle.plus.com  Sun Jun 29 00:00:58 2014
From: mdowle at mdowle.plus.com (Matt Dowle)
Date: Sat, 28 Jun 2014 23:00:58 +0100
Subject: [datatable-help] Efficiently checking value of other row in
	data.table
In-Reply-To: <CAMjp+0cc3d9r3mSnM5DRfd9ygTGQgDe2wdbWXOHc2rH-JR617Q@mail.gmail.com>
References: <CAMjp+0exGfWd9Pb7ZQ5WjaoehRR2qUJq+ikLMYs3NpuuQUd0DA@mail.gmail.com>	<53ADC33E.7040509@mdowle.plus.com>
 <CAMjp+0cc3d9r3mSnM5DRfd9ygTGQgDe2wdbWXOHc2rH-JR617Q@mail.gmail.com>
Message-ID: <53AF3B1A.30308@mdowle.plus.com>


Hi Matt,

Great.  If you can prepare some dummy data with the appropriate 
properties and a parameter or two to scale up the size (or just provide 
an online large example to download) and a query that gets to the right 
answer but is slow or ugly,   then we've got something to chew on ...

Matt

On 28/06/14 10:55, Matthew DeAngelis wrote:
> Hi Matt,
>
> You have the right of it. The problem is somewhat complicated, 
> however, since I would want to substitute "DT[word=="good"..." with 
> "DT[J("good")..." after setting the key to word and reordering the 
> rows. Hence the two-step process I have now where I key by document 
> and position first, create the lag_word column, key by the word and 
> lag_word columns and query by row.
>
>
> Matt
>
>
> On Fri, Jun 27, 2014 at 3:17 PM, Matt Dowle <mdowle at mdowle.plus.com 
> <mailto:mdowle at mdowle.plus.com>> wrote:
>
>
>     Hi,
>
>     Not sure exactly what you need but looks interesting.
>
>     Something a bit like this ?
>
>     DT[ word == "good", .SD[ lag(word, N) != "not" ], by=document]
>
>     Your idea being you don't want to have to repeat all the pre and
>     post words alongside each word but rather express it in the query.
>     Makes sense.   Leads to classifying "not good" and "not very good"
>     as both negative phrases I guess.
>
>     Matt
>
>
>
>     On 26/06/14 21:56, Matthew DeAngelis wrote:
>>     Hello data.table gurus,
>>
>>     I have been using data.table to efficiently work with textual
>>     data and I love it for that purpose. I have transformed my data
>>     so that it looks something like this:
>>
>>     word 	document 	position
>>     I 	1 	1
>>     have 	1 	2
>>     transformed 	1 	3
>>     my 	1 	4
>>     data 	1 	5
>>     so 	2 	1
>>     that 	2 	2
>>     it 	2 	3
>>     looks 	2 	4
>>     something 	2 	5
>>     like 	2 	6
>>     this 	2 	7
>>
>>
>>     (I actually use a unique number for each word, so that I am able
>>     to use data.table's excellent features to do lightning-fast word
>>     counts. This has revolutionized my workflow over looping through
>>     text files with Perl.)
>>
>>     My problem is that I sometimes need to search for phrases or to
>>     select words based on their context (for instance, I may want to
>>     exclude a word if it is preceded by "not" or followed by a word
>>     that changes its meaning). Currently, I am using the solution
>>     here
>>     <http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression> to
>>     create a new column for a word in another position, like this:
>>
>>     word 	document 	position 	lead_word
>>     I 	1 	1 	have
>>     have 	1 	2 	transformed
>>     transformed 	1 	3 	my
>>     my 	1 	4 	data
>>     data 	1 	5 	NA
>>     so 	2 	1 	that
>>     that 	2 	2 	it
>>     it 	2 	3 	looks
>>     looks 	2 	4 	something
>>     something 	2 	5 	like
>>     like 	2 	6 	this
>>     this 	2 	7 	NA
>>
>>
>>     using a command like:
>>     DT[,lead_word:=DT[list(document,position+1),word].
>>
>>     This approach has two problems, however. First, it consumes more
>>     resources as the dataset grows. I am currently working with a
>>     file containing over 150 million rows, so adding a column is
>>     costly. Second, I may want to check both one and two words ahead,
>>     so that I have to add two columns, and this can quickly get out
>>     of hand.
>>
>>     Is there a better way to use data.table to check the value in a
>>     row N distance from the row of interest within a group and select
>>     a row based on that value? Perhaps the .I variable could be
>>     useful here?
>>
>>     I appreciate any suggestions.
>>
>>
>>     Regards,
>>     Matt
>>
>>
>>     _______________________________________________
>>     datatable-help mailing list
>>     datatable-help at lists.r-forge.r-project.org  <mailto:datatable-help at lists.r-forge.r-project.org>
>>     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140628/fa52d469/attachment.html>

From ggrothendieck at gmail.com  Sun Jun 29 22:58:50 2014
From: ggrothendieck at gmail.com (Gabor Grothendieck)
Date: Sun, 29 Jun 2014 16:58:50 -0400
Subject: [datatable-help] by row
Message-ID: <CAP01uRmhak-JuS4vzi0--yiUb0z=0bUSFfMYDDhOP-AEccmD=g@mail.gmail.com>

There was some discussion of an .EACHI facility for data.table.  Not
sure what happened about that but I have an example that might be
useful:

http://stackoverflow.com/questions/24472254/splitting-a-column-by-factor-within-a-data-frame/24472571#24472571

which shows the code where DT has columns v1, v2 and v3:

DT[, split(v2, v1), by = names(DT)]

It works well if the rows of DT are unique but if they are not then
one must do something ugly like appending a uniquifying column of
1:nrow(DT), say, and then including that in by and then finally
removing it again at the end.

This suggests two features:

1. The ability to tell it to do the by by row
2. The ability to selectively omit by variables from the output

For example, if one could use a pseudo column .I and if -.I meant do
not include it in the output then one could write:

DT[, split(v2, v1), by = c(names(DT), -.I)]

Other syntaxes may be thought of too and the main suggestion here is
the possible need for these features rather than the specific syntax.

(By the way, is there an intention to move to the issue system on
github for things like this?)

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

From aragorn168b at gmail.com  Sun Jun 29 23:39:01 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sun, 29 Jun 2014 23:39:01 +0200
Subject: [datatable-help] by row
In-Reply-To: <CAP01uRmhak-JuS4vzi0--yiUb0z=0bUSFfMYDDhOP-AEccmD=g@mail.gmail.com>
References: <CAP01uRmhak-JuS4vzi0--yiUb0z=0bUSFfMYDDhOP-AEccmD=g@mail.gmail.com>
Message-ID: <etPan.53b08775.3c5991aa.e9e5@Arunkumars-MacBook-Pro.local>

Hi,

You write: There was some discussion of an .EACHI facility for data.table. Not sure what happened about that but I have an example that might be useful: http://stackoverflow.com/questions/24472254/splitting-a-column-by-factor-within-a-data-frame/24472571#24472571

by=.EACHI was implemented to remove the implicit ?by-without-by? feature during joins. And that has been implemented quite sometime back - check the first FR implemented in the README following which Matt also posted on the mailing list asking for feedback.

You write: which shows the code where DT has columns v1, v2 and v3: DT[, split(v2, v1), by = names(DT)] ```

A small comment on this solution per-se. This calls split for each row! I?d approach this a little different:

## 1.9.3
rbindlist(setDT(dd)[, {  
              ans = list(v2);  
              setattr(ans, 'names', v1);  
              list(list(ans))
              }, by = list(v1=as.character(v1))
           ]$V1,  
fill=TRUE)

#     a  b
# 1:  1 NA
# 2:  2 NA
# 3:  6 NA
# 4: NA  3
# 5: NA  4
# 6: NA  5
We can then add this back to dd by reference. Personally I?ve never had to call split on a data.table.

You write: It works well if the rows of DT are unique but if they are not then one must do something ugly like appending a uniquifying column of 1:nrow(DT), say, and then including that in by and then finally removing it again at the end.

This suggests two features:

The ability to tell it to do the by by row
The ability to selectively omit by variables from the output ```
Not sure I follow this entirely, but by= does accept expressions. So, you could do:

dd[, split(v2,v1), by=1:nrow(dd)]
#    nrow  a  b
# 1:    1  1 NA
# 2:    2  2 NA
# 3:    3  6 NA
# 4:    4 NA  3
# 5:    5 NA  4
# 6:    6 NA  5
You write: (By the way, is there an intention to move to the issue system on github for things like this?)

The entire issues from R-Forge have been already moved to github, including feature requests. And since then users have filed new FRs/bugs here. So, yes, you can file FRs directly, although in this case, I think the feature already exists (IIUC)?


Arun

From:?Gabor Grothendieck ggrothendieck at gmail.com
Reply:?Gabor Grothendieck ggrothendieck at gmail.com
Date:?June 29, 2014 at 10:59:22 PM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? [datatable-help] by row  

There was some discussion of an .EACHI facility for data.table. Not  
sure what happened about that but I have an example that might be  
useful:  

http://stackoverflow.com/questions/24472254/splitting-a-column-by-factor-within-a-data-frame/24472571#24472571  

which shows the code where DT has columns v1, v2 and v3:  

DT[, split(v2, v1), by = names(DT)]  

It works well if the rows of DT are unique but if they are not then  
one must do something ugly like appending a uniquifying column of  
1:nrow(DT), say, and then including that in by and then finally  
removing it again at the end.  

This suggests two features:  

1. The ability to tell it to do the by by row  
2. The ability to selectively omit by variables from the output  

For example, if one could use a pseudo column .I and if -.I meant do  
not include it in the output then one could write:  

DT[, split(v2, v1), by = c(names(DT), -.I)]  

Other syntaxes may be thought of too and the main suggestion here is  
the possible need for these features rather than the specific syntax.  

(By the way, is there an intention to move to the issue system on  
github for things like this?)  

--  
Statistics & Software Consulting  
GKX Group, GKX Associates Inc.  
tel: 1-877-GKX-GROUP  
email: ggrothendieck at gmail.com  
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140629/8767ec22/attachment.html>

From ggrothendieck at gmail.com  Mon Jun 30 01:48:33 2014
From: ggrothendieck at gmail.com (Gabor Grothendieck)
Date: Sun, 29 Jun 2014 19:48:33 -0400
Subject: [datatable-help] Finding Rdatatable/datatable
Message-ID: <CAP01uR=Gw2Gx-DVV+eWRVUY=7utZaeYf78-nZV1NnrqUL=nF3g@mail.gmail.com>

Googling for
     github data.table
gets one to:
    https://github.com/arunsrinivasan/datatable
and the DESCRIPTION file and
the CRAN page (http://cran.r-project.org/package=data.table)
both point to R-Forge so the github Rdatatable/database page is not so
easy to find.

(I had previously been using the one google leads you to which is why I
could not find the issues.)

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

From aragorn168b at gmail.com  Mon Jun 30 02:04:11 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Mon, 30 Jun 2014 02:04:11 +0200
Subject: [datatable-help] Finding Rdatatable/datatable
In-Reply-To: <CAP01uR=Gw2Gx-DVV+eWRVUY=7utZaeYf78-nZV1NnrqUL=nF3g@mail.gmail.com>
References: <CAP01uR=Gw2Gx-DVV+eWRVUY=7utZaeYf78-nZV1NnrqUL=nF3g@mail.gmail.com>
Message-ID: <etPan.53b0a97b.2b0d8dbe.e9e5@Arunkumars-MacBook-Pro.local>

I'm sorry about that and am not sure what to do about it. Hopefully Rdatatable/data.table will get more hits and would turn up as the top hit soon. It's been only a few days since the transition.

Arun

From:?Gabor Grothendieck ggrothendieck at gmail.com
Reply:?Gabor Grothendieck ggrothendieck at gmail.com
Date:?June 30, 2014 at 1:49:04 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? [datatable-help] Finding Rdatatable/datatable  

Googling for  
github data.table  
gets one to:  
https://github.com/arunsrinivasan/datatable  
and the DESCRIPTION file and  
the CRAN page (http://cran.r-project.org/package=data.table)  
both point to R-Forge so the github Rdatatable/database page is not so  
easy to find.  

(I had previously been using the one google leads you to which is why I  
could not find the issues.)  

--  
Statistics & Software Consulting  
GKX Group, GKX Associates Inc.  
tel: 1-877-GKX-GROUP  
email: ggrothendieck at gmail.com  
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140630/89b6be63/attachment.html>

From ronin78 at gmail.com  Mon Jun 30 15:24:00 2014
From: ronin78 at gmail.com (Matthew DeAngelis)
Date: Mon, 30 Jun 2014 09:24:00 -0400
Subject: [datatable-help] Efficiently checking value of other row in
	data.table
In-Reply-To: <53AF3B1A.30308@mdowle.plus.com>
References: <CAMjp+0exGfWd9Pb7ZQ5WjaoehRR2qUJq+ikLMYs3NpuuQUd0DA@mail.gmail.com>
 <53ADC33E.7040509@mdowle.plus.com>
 <CAMjp+0cc3d9r3mSnM5DRfd9ygTGQgDe2wdbWXOHc2rH-JR617Q@mail.gmail.com>
 <53AF3B1A.30308@mdowle.plus.com>
Message-ID: <CAMjp+0e7bTGmqqagA-n_hKBM_3cWhZ2QKV=VG7K14EX56ih1jA@mail.gmail.com>

Hi Matt,

Thanks for the suggestion. I am placing an example below that I hope
illustrates the problem more clearly. Please let me know if I can provide
additional detail or clarification.


Regards,
Matt


First we create a dummy dataset with ten documents containing one million
words. There are three unique words in the set.

library(data.table)options(scipen=2)set.seed(1000)DT<-data.table(wordindex=sample(1:3,1000000,replace=T),docindex=sample(1:10,1000000,replace=T))setkey(DT,docindex)DT[,position:=seq.int(1:.N),by=docindex]

##          wordindex docindex position
##       1:         1        1        1
##       2:         1        1        2
##       3:         3        1        3
##       4:         3        1        4
##       5:         1        1        5
##      ---
##  999996:         2       10    99811
##  999997:         2       10    99812
##  999998:         3       10    99813
##  999999:         1       10    99814
## 1000000:         3       10    99815

This is a query to count the occurrences of the first unique word across
all documents. It is also beautiful.

setkey(DT,wordindex)count<-DT[J(1),list(count.1=.N),by=docindex]count

##     docindex count.1
##  1:        1   33533
##  2:        2   33067
##  3:        3   33538
##  4:        4   33053
##  5:        5   33231
##  6:        6   33002
##  7:        7   33369
##  8:        8   33353
##  9:        9   33485
## 10:       10   33225

It gets messier when we have to take the position ahead into account. This
is a query to count the occurrences of the first unique word across all
documents UNLESS it is followed by the second unique word. We create a new
column containing the word one position ahead and then key on both words.

setkey(DT,docindex,position)DT[,lead_wordindex:=DT[list(docindex,position+1)][,wordindex]]

##          wordindex docindex position lead_wordindex
##       1:         1        1        1              1
##       2:         1        1        2              3
##       3:         3        1        3              3
##       4:         3        1        4              1
##       5:         1        1        5              2
##      ---
##  999996:         2       10    99811              2
##  999997:         2       10    99812              3
##  999998:         3       10    99813              1
##  999999:         1       10    99814              3
## 1000000:         3       10    99815             NA

setkey(DT,wordindex,lead_wordindex)countr2<-DT[J(c(1,1),c(1,3)),list(count.1=.N),by=docindex]countr2

##     docindex count.1
##  1:        1   22301
##  2:        2   21835
##  3:        3   22490
##  4:        4   21830
##  5:        5   22218
##  6:        6   21914
##  7:        7   22370
##  8:        8   22265
##  9:        9   22211
## 10:       10   22190

I have a very large dataset for which the above query fails for memory
allocation. As an alternative, we can create this new column for only the
relevant subset of data by filtering the original dataset and then joining
it back on the desired position:

setkey(DT,wordindex)filter<-DT[J(1),list(wordindex,docindex,position)]filter[,lead_position:=position+1]

##         wordindex wordindex docindex position lead_position
##      1:         1         1        2    99717         99718
##      2:         1         1        3    99807         99808
##      3:         1         1        4   100243        100244
##      4:         1         1        1        1             2
##      5:         1         1        1       42            43
##     ---
## 332852:         1         1       10    99785         99786
## 332853:         1         1       10    99787         99788
## 332854:         1         1       10    99798         99799
## 332855:         1         1       10    99804         99805
## 332856:         1         1       10    99814         99815

setkey(DT,docindex,position)filter[,lead_wordindex:=DT[J(filter[,list(docindex,lead_position)])][,wordindex]]

##         wordindex wordindex docindex position lead_position lead_wordindex
##      1:         1         1        2    99717         99718             NA
##      2:         1         1        3    99807         99808             NA
##      3:         1         1        4   100243        100244             NA
##      4:         1         1        1        1             2              1
##      5:         1         1        1       42            43              1
##     ---
## 332852:         1         1       10    99785         99786              3
## 332853:         1         1       10    99787         99788              3
## 332854:         1         1       10    99798         99799              3
## 332855:         1         1       10    99804         99805              3
## 332856:         1         1       10    99814         99815              3

setkey(filter,wordindex,lead_wordindex)countr2.1<-filter[J(c(1,1),c(1,3)),list(count.1=.N),by=docindex]countr2.1

##     docindex count.1
##  1:        1   22301
##  2:        2   21835
##  3:        3   22490
##  4:        4   21830
##  5:        5   22218
##  6:        6   21914
##  7:        7   22370
##  8:        8   22265
##  9:        9   22211
## 10:       10   22190

Pretty ugly, I think. In addition, we may want to look more than one word
ahead. We have to create yet another column. The easy but costly way is:

setkey(DT,docindex,position)DT[,lead_lead_wordindex:=DT[list(docindex,position+2)][,wordindex]]

##          wordindex docindex position lead_wordindex lead_lead_wordindex
##       1:         1        1        1              1                   3
##       2:         1        1        2              3                   3
##       3:         3        1        3              3                   1
##       4:         3        1        4              1                   2
##       5:         1        1        5              2                   3
##      ---
##  999996:         2       10    99811              2                   3
##  999997:         2       10    99812              3                   1
##  999998:         3       10    99813              1                   3
##  999999:         1       10    99814              3                  NA
## 1000000:         3       10    99815             NA                  NA

setkey(DT,wordindex,lead_wordindex,lead_lead_wordindex)countr23<-DT[J(1,2,3),list(count.1=.N),by=docindex]countr23

##     docindex count.1
##  1:        1    3684
##  2:        2    3746
##  3:        3    3717
##  4:        4    3727
##  5:        5    3700
##  6:        6    3779
##  7:        7    3702
##  8:        8    3756
##  9:        9    3702
## 10:       10    3744

However, I currently have to use the ugly filter-and-join way because of
size.

So the question is, is there an easier and more beautiful way?


On Sat, Jun 28, 2014 at 6:00 PM, Matt Dowle <mdowle at mdowle.plus.com> wrote:

>
> Hi Matt,
>
> Great.  If you can prepare some dummy data with the appropriate properties
> and a parameter or two to scale up the size (or just provide an online
> large example to download) and a query that gets to the right answer but is
> slow or ugly,   then we've got something to chew on ...
>
> Matt
>
>
> On 28/06/14 10:55, Matthew DeAngelis wrote:
>
> Hi Matt,
>
>  You have the right of it. The problem is somewhat complicated, however,
> since I would want to substitute "DT[word=="good"..." with
> "DT[J("good")..." after setting the key to word and reordering the rows.
> Hence the two-step process I have now where I key by document and position
> first, create the lag_word column, key by the word and lag_word columns and
> query by row.
>
>
>  Matt
>
>
> On Fri, Jun 27, 2014 at 3:17 PM, Matt Dowle <mdowle at mdowle.plus.com>
> wrote:
>
>>
>> Hi,
>>
>> Not sure exactly what you need but looks interesting.
>>
>> Something a bit like this ?
>>
>> DT[ word == "good", .SD[ lag(word, N) != "not" ],  by=document]
>>
>> Your idea being you don't want to have to repeat all the pre and post
>> words alongside each word but rather express it in the query. Makes
>> sense.   Leads to classifying "not good" and "not very good" as both
>> negative phrases I guess.
>>
>> Matt
>>
>>
>>
>> On 26/06/14 21:56, Matthew DeAngelis wrote:
>>
>>  Hello data.table gurus,
>>
>>  I have been using data.table to efficiently work with textual data and
>> I love it for that purpose. I have transformed my data so that it looks
>> something like this:
>>
>>    word document position  I 1 1  have 1 2  transformed 1 3  my 1 4  data
>> 1 5  so 2 1  that 2 2  it 2 3  looks 2 4  something 2 5  like 2 6  this 2
>> 7
>>  (I actually use a unique number for each word, so that I am able to use
>> data.table's excellent features to do lightning-fast word counts. This has
>> revolutionized my workflow over looping through text files with Perl.)
>>
>>  My problem is that I sometimes need to search for phrases or to select
>> words based on their context (for instance, I may want to exclude a word if
>> it is preceded by "not" or followed by a word that changes its meaning).
>> Currently, I am using the solution here
>> <http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression> to
>> create a new column for a word in another position, like this:
>>
>>    word document position lead_word  I 1 1 have  have 1 2 transformed
>> transformed 1 3 my  my 1 4 data  data 1 5 NA  so 2 1 that  that 2 2 it
>> it 2 3 looks  looks 2 4 something  something 2 5 like  like 2 6 this
>> this 2 7 NA
>> using a command like: DT[,lead_word:=DT[list(document,position+1),word].
>>
>>  This approach has two problems, however. First, it consumes more
>> resources as the dataset grows. I am currently working with a file
>> containing over 150 million rows, so adding a column is costly. Second, I
>> may want to check both one and two words ahead, so that I have to add two
>> columns, and this can quickly get out of hand.
>>
>>  Is there a better way to use data.table to check the value in a row N
>> distance from the row of interest within a group and select a row based on
>> that value? Perhaps the .I variable could be useful here?
>>
>>  I appreciate any suggestions.
>>
>>
>>  Regards,
>> Matt
>>
>>
>>  _______________________________________________
>> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140630/d32dcccf/attachment-0001.html>

From macrakis at alum.mit.edu  Mon Jun 30 17:37:56 2014
From: macrakis at alum.mit.edu (=?UTF-8?B?U3RhdnJvcyBNYWNyYWtpcyAozqPPhM6x4b+mz4HOv8+CIM6czrHOus+BzqzOus63z4Ip?=)
Date: Mon, 30 Jun 2014 11:37:56 -0400
Subject: [datatable-help] Speeding up column references with roll
Message-ID: <CACLVabX13hU4SzPqx4tbMdLU_C00vhqWaOCm9Vn1ODuRPu7Zxg@mail.gmail.com>

In the following example, it is about 15-25% faster to use setnames rather
than j=list(name=var). Is there some better approach to referencing the
other joined column when using roll?

# Use j=list(name=var)
calc1 <- function(d) {
  d[ hit==1
   ][ d,list(hittime=time),roll=-20
   ][ !is.na(hittime)
   ]
}

# Use setnames
calc2 <- function(d) {
  temp <- d[ hit==1
           ][ d,time,roll=-20
           ]
  setnames(temp,3,"hittime")
  temp[!is.na(hittime)]
}

# Generate sample data
set.seed(12312391)
data <- data.table(
          group = sample(1e3,1e7,replace=T),
          time = ceiling(runif(1e7, 0, 1e5)),
          hit = rbinom(1e7, 1, p = 0.1),
  key=c("group","time"))

# Timing

system.time(replicate(10,{gc();calc1(data)})) => 69 sec
system.time(replicate(10,{gc();calc2(data)})) => 52 sec
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140630/b78a53f4/attachment.html>

From macrakis at alum.mit.edu  Mon Jun 30 18:06:41 2014
From: macrakis at alum.mit.edu (=?UTF-8?B?U3RhdnJvcyBNYWNyYWtpcyAozqPPhM6x4b+mz4HOv8+CIM6czrHOus+BzqzOus63z4Ip?=)
Date: Mon, 30 Jun 2014 12:06:41 -0400
Subject: [datatable-help] i = !x different from i = (!x)
Message-ID: <CACLVabXV76W0O72tvSkzs5NOM6+sSX0xUyG34hX-JaiBGBdGXQ@mail.gmail.com>

DT 1.9.2

t1 <- data.table(a=1:2,b=0:1,key="a")

t1[b==0] => row 1, OK
t1[!b] => ERROR "object 'b' not found" ??
t1[(!b)] => row 1, OK

Shouldn't !b be equivalent to (!b)? They are both expressions, not symbols.

         -s
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140630/22d00cc8/attachment.html>

From macrakis at alum.mit.edu  Mon Jun 30 18:30:45 2014
From: macrakis at alum.mit.edu (=?UTF-8?B?U3RhdnJvcyBNYWNyYWtpcyAozqPPhM6x4b+mz4HOv8+CIM6czrHOus+BzqzOus63z4Ip?=)
Date: Mon, 30 Jun 2014 12:30:45 -0400
Subject: [datatable-help] Error corrupts tables
Message-ID: <CACLVabVOAcJmLjU9OyLpjW+54AgyX-4Ge7aBGqbC8u6rmJxvrQ@mail.gmail.com>

> library(data.table)
data.table 1.9.2  For help type: help("data.table")
> test <- data.table(a=1:10,b=1:10%%6==0,key="a")
> test[b==1][test,b,roll=2]
Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed

Not sure what the error is there, but even worse...

> test
Error in if (!is.null(ns)) ns else tryCatch(loadNamespace(name), error =
function(e) { :
  missing value where TRUE/FALSE needed

> tables()
Error in if (!is.null(ns)) ns else tryCatch(loadNamespace(name), error =
function(e) { :
  missing value where TRUE/FALSE needed

It looks like some data structure has been corrupted.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140630/f4f9f912/attachment.html>

From aragorn168b at gmail.com  Mon Jun 30 18:45:57 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Mon, 30 Jun 2014 18:45:57 +0200
Subject: [datatable-help] Error corrupts tables
In-Reply-To: <CACLVabVOAcJmLjU9OyLpjW+54AgyX-4Ge7aBGqbC8u6rmJxvrQ@mail.gmail.com>
References: <CACLVabVOAcJmLjU9OyLpjW+54AgyX-4Ge7aBGqbC8u6rmJxvrQ@mail.gmail.com>
Message-ID: <etPan.53b19445.5675ff36.e9e5@Arunkumars-MacBook-Pro.local>

Fixed in 1.9.3:?https://github.com/Rdatatable/data.table/commit/ddc1d23166932198ee826f8e66176266093b0b41

Arun

From:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu
Reply:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu
Date:?June 30, 2014 at 6:30:56 PM
To:?datatable-help at r-forge.wu-wien.ac.at datatable-help at r-forge.wu-wien.ac.at
Subject:? [datatable-help] Error corrupts tables  

> library(data.table)
data.table 1.9.2 ?For help type: help("data.table")
> test <- data.table(a=1:10,b=1:10%%6==0,key="a")
> test[b==1][test,b,roll=2]
Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed

Not sure what the error is there, but even worse...

> test
Error in if (!is.null(ns)) ns else tryCatch(loadNamespace(name), error = function(e) { :?
? missing value where TRUE/FALSE needed

> tables()
Error in if (!is.null(ns)) ns else tryCatch(loadNamespace(name), error = function(e) { :?
? missing value where TRUE/FALSE needed

It looks like some data structure has been corrupted.
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140630/3a8f7d88/attachment.html>

From aragorn168b at gmail.com  Mon Jun 30 18:50:32 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Mon, 30 Jun 2014 18:50:32 +0200
Subject: [datatable-help] i =?utf-8?Q?=3D_?=!x different from i
 =?utf-8?Q?=3D_?=(!x)
In-Reply-To: <CACLVabXV76W0O72tvSkzs5NOM6+sSX0xUyG34hX-JaiBGBdGXQ@mail.gmail.com>
References: <CACLVabXV76W0O72tvSkzs5NOM6+sSX0xUyG34hX-JaiBGBdGXQ@mail.gmail.com>
Message-ID: <etPan.53b19558.3db012b3.e9e5@Arunkumars-MacBook-Pro.local>

Not at the moment because of the checks. If it's a call, and the first index value is `!`, it's removed and then the rest is checked if it's a "name", which is true for `t1[!b]`. And i by default searches the calling scope - because `i` can be a data.table. That is, `X[!Y]` is intended to be used for `Y` being a data.table. Hence the difference between `X[!Y]` and `X[(!Y)]` at the moment.

But in 1.9.3, this'll get better: Have a look at?https://github.com/Rdatatable/data.table/issues/697?and?https://github.com/Rdatatable/data.table/issues/633

Arun

From:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu
Reply:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu
Date:?June 30, 2014 at 6:06:53 PM
To:?datatable-help at r-forge.wu-wien.ac.at datatable-help at r-forge.wu-wien.ac.at
Subject:? [datatable-help] i = !x different from i = (!x)  

DT 1.9.2

t1 <- data.table(a=1:2,b=0:1,key="a")

t1[b==0] => row 1, OK
t1[!b] => ERROR "object 'b' not found" ??
t1[(!b)] => row 1, OK

Shouldn't !b be equivalent to (!b)? They are both expressions, not symbols.

? ? ? ? ?-s
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140630/c77a71a3/attachment-0001.html>

From aragorn168b at gmail.com  Mon Jun 30 19:00:17 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Mon, 30 Jun 2014 19:00:17 +0200
Subject: [datatable-help] Speeding up column references with roll
In-Reply-To: <CACLVabX13hU4SzPqx4tbMdLU_C00vhqWaOCm9Vn1ODuRPu7Zxg@mail.gmail.com>
References: <CACLVabX13hU4SzPqx4tbMdLU_C00vhqWaOCm9Vn1ODuRPu7Zxg@mail.gmail.com>
Message-ID: <etPan.53b197a1.5b25ace2.e9e5@Arunkumars-MacBook-Pro.local>

Once again, has been fixed in 1.9.3. Now join requires `by=.EACHI` (explicit) to perform a by-without-by.
https://github.com/Rdatatable/data.table/blob/master/README.md
Have a look at the first FR (by = .EACHI runs ...) that's been fixed in 1.9.3 - there's some changes in the way join results in due to these changes (which've been discussed since and for quite sometime) to bring more consistency to the DT[i, j, by] syntax. Also have a look at the second FR and the links it points to for the discussions.

In general, it's better to test with the devel version (and have a look at README) for any bugs you may encounter.

Arun

From:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu
Reply:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu
Date:?June 30, 2014 at 5:38:10 PM
To:?datatable-help at r-forge.wu-wien.ac.at datatable-help at r-forge.wu-wien.ac.at
Subject:? [datatable-help] Speeding up column references with roll  

In the following example, it is about 15-25% faster to use setnames rather than j=list(name=var). Is there some better approach to referencing the other joined column when using roll?

# Use j=list(name=var)
calc1 <- function(d) {
? d[ hit==1
? ?][ d,list(hittime=time),roll=-20
? ?][ !is.na(hittime)
? ?]
}

# Use setnames
calc2 <- function(d) {
? temp <- d[ hit==1
? ? ? ? ? ?][ d,time,roll=-20
? ? ? ? ? ?]
? setnames(temp,3,"hittime")
? temp[!is.na(hittime)]
}

# Generate sample data
set.seed(12312391)
data <- data.table(
? ? ? ? ? group = sample(1e3,1e7,replace=T),
? ? ? ? ? time = ceiling(runif(1e7, 0, 1e5)),
? ? ? ? ? hit = rbinom(1e7, 1, p = 0.1),
??key=c("group","time"))

# Timing

system.time(replicate(10,{gc();calc1(data)})) => 69 sec system.time(replicate(10,{gc();calc2(data)})) => 52 sec
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140630/c690a8d3/attachment.html>

From ggrothendieck at gmail.com  Mon Jun 30 20:21:18 2014
From: ggrothendieck at gmail.com (Gabor Grothendieck)
Date: Mon, 30 Jun 2014 14:21:18 -0400
Subject: [datatable-help] Speeding up column references with roll
In-Reply-To: <etPan.53b197a1.5b25ace2.e9e5@Arunkumars-MacBook-Pro.local>
References: <CACLVabX13hU4SzPqx4tbMdLU_C00vhqWaOCm9Vn1ODuRPu7Zxg@mail.gmail.com>
 <etPan.53b197a1.5b25ace2.e9e5@Arunkumars-MacBook-Pro.local>
Message-ID: <CAP01uR=qrgupcUUBaOGiATjkB=HHuLAfu_UMAw+JAXc8EpY5eg@mail.gmail.com>

On Mon, Jun 30, 2014 at 1:00 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> Once again, has been fixed in 1.9.3. Now join requires `by=.EACHI`
> (explicit) to perform a by-without-by.
> https://github.com/Rdatatable/data.table/blob/master/README.md

The README would be easier to understand if DT was not undefined in
the README. As it stands none of the examples are runnable.

From ggrothendieck at gmail.com  Mon Jun 30 20:41:30 2014
From: ggrothendieck at gmail.com (Gabor Grothendieck)
Date: Mon, 30 Jun 2014 14:41:30 -0400
Subject: [datatable-help] Speeding up column references with roll
In-Reply-To: <CAP01uR=qrgupcUUBaOGiATjkB=HHuLAfu_UMAw+JAXc8EpY5eg@mail.gmail.com>
References: <CACLVabX13hU4SzPqx4tbMdLU_C00vhqWaOCm9Vn1ODuRPu7Zxg@mail.gmail.com>
 <etPan.53b197a1.5b25ace2.e9e5@Arunkumars-MacBook-Pro.local>
 <CAP01uR=qrgupcUUBaOGiATjkB=HHuLAfu_UMAw+JAXc8EpY5eg@mail.gmail.com>
Message-ID: <CAP01uRnRa0d1dzF8E4gX-4EfJQw7uCS0vU=uv5Z1ps7-iChVdA@mail.gmail.com>

One other comment. I wonder if .EACHI could mean by each row if there
were no join specified so this:

library(data.table)
DT <- data.table(
    v1 = factor(c("a", "a", "a", "b", "b", "b")),
    v2 = c(1, 1, 6, 3, 4, 5),
    v3 = c("a", "b", "c", "a", "b", "c"),
    stringsAsFactors=FALSE
)
DT[, c(.SD, split(v2, v1)), by = 1:nrow(DT)][, -1, with = FALSE]

could be written:

DT[, c(.SD, split(v2, v1)), by = .EACHI]

or maybe even:

DT[, split(v2, v1), by = c(names(DT), .EACHI)]


On Mon, Jun 30, 2014 at 2:21 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> On Mon, Jun 30, 2014 at 1:00 PM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>> Once again, has been fixed in 1.9.3. Now join requires `by=.EACHI`
>> (explicit) to perform a by-without-by.
>> https://github.com/Rdatatable/data.table/blob/master/README.md
>
> The README would be easier to understand if DT was not undefined in
> the README. As it stands none of the examples are runnable.


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

From macrakis at alum.mit.edu  Mon Jun 30 22:40:24 2014
From: macrakis at alum.mit.edu (=?UTF-8?B?U3RhdnJvcyBNYWNyYWtpcyAozqPPhM6x4b+mz4HOv8+CIM6czrHOus+BzqzOus63z4Ip?=)
Date: Mon, 30 Jun 2014 16:40:24 -0400
Subject: [datatable-help] Speeding up column references with roll
In-Reply-To: <etPan.53b197a1.5b25ace2.e9e5@Arunkumars-MacBook-Pro.local>
References: <CACLVabX13hU4SzPqx4tbMdLU_C00vhqWaOCm9Vn1ODuRPu7Zxg@mail.gmail.com>
 <etPan.53b197a1.5b25ace2.e9e5@Arunkumars-MacBook-Pro.local>
Message-ID: <CACLVabWdaHSy=Tpi=fgKJCbkCV3tgktNadEP1Db0yW92nL6oSg@mail.gmail.com>

OK, I'm retesting in 1.9.3, adding by=.EACHI. I don't see any significant
difference in the timings -- setnames is still 25% faster than
list(hittime=time). What exactly was fixed?

I also don't see any way to refer to the different time vs. hittime without
renaming the second time column.

You mention some FR's, but they're hard to find without the specific
numbers.

Where can I find the 1.9.3 reference manual? I think it would be easier to
understand for me than the incremental changes in the New Features
listings. On my system (MacOSX), build_vignettes=TRUE gives an error in
texi2dvi -- would that have generated the refman? If so, how do I fix that?

Thanks,

               -s


On Mon, Jun 30, 2014 at 1:00 PM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

> Once again, has been fixed in 1.9.3. Now join requires `by=.EACHI`
> (explicit) to perform a by-without-by.
> https://github.com/Rdatatable/data.table/blob/master/README.md
> Have a look at the first FR (by = .EACHI runs ...) that's been fixed in
> 1.9.3 - there's some changes in the way join results in due to these
> changes (which've been discussed since and for quite sometime) to bring
> more consistency to the DT[i, j, by] syntax. Also have a look at the second
> FR and the links it points to for the discussions.
>
> In general, it's better to test with the devel version (and have a look at
> README) for any bugs you may encounter.
>
> Arun
>
> From: Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu
> Reply: Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu
> Date: June 30, 2014 at 5:38:10 PM
> To: datatable-help at r-forge.wu-wien.ac.at
> datatable-help at r-forge.wu-wien.ac.at
> Subject:  [datatable-help] Speeding up column references with roll
>
>  In the following example, it is about 15-25% faster to use setnames
> rather than j=list(name=var). Is there some better approach to referencing
> the other joined column when using roll?
>
>  # Use j=list(name=var)
> calc1 <- function(d) {
>   d[ hit==1
>    ][ d,list(hittime=time),roll=-20
>    ][ !is.na(hittime)
>    ]
> }
>
> # Use setnames
> calc2 <- function(d) {
>   temp <- d[ hit==1
>            ][ d,time,roll=-20
>            ]
>   setnames(temp,3,"hittime")
>   temp[!is.na(hittime)]
> }
>
>  # Generate sample data
> set.seed(12312391)
> data <- data.table(
>           group = sample(1e3,1e7,replace=T),
>           time = ceiling(runif(1e7, 0, 1e5)),
>           hit = rbinom(1e7, 1, p = 0.1),
>   key=c("group","time"))
>
> # Timing
>
> system.time(replicate(10,{gc();calc1(data)})) => 69 sec
> system.time(replicate(10,{gc();calc2(data)})) => 52 sec
>  _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140630/22611307/attachment-0001.html>

From aragorn168b at gmail.com  Mon Jun 30 23:34:36 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Mon, 30 Jun 2014 23:34:36 +0200
Subject: [datatable-help] Speeding up column references with roll
In-Reply-To: <CACLVabWdaHSy=Tpi=fgKJCbkCV3tgktNadEP1Db0yW92nL6oSg@mail.gmail.com>
References: <CACLVabX13hU4SzPqx4tbMdLU_C00vhqWaOCm9Vn1ODuRPu7Zxg@mail.gmail.com>
 <etPan.53b197a1.5b25ace2.e9e5@Arunkumars-MacBook-Pro.local>
 <CACLVabWdaHSy=Tpi=fgKJCbkCV3tgktNadEP1Db0yW92nL6oSg@mail.gmail.com>
Message-ID: <etPan.53b1d7ec.2c6e4afd.e9e5@Arunkumars-MacBook-Pro.local>

Your example doesn?t work without allow.cartesian=TRUE.

You shouldn?t be using by=.EACHI here. This by was what was implicit in the earlier versions which made it slow. Please re-read the README.

Here?s the function I tested on 1.9.3:

calc1 <- function(d) {
    d[ hit==1][ d,list(hittime=time),roll=-20, allow.cartesian=TRUE][ !is.na(hittime)]
}

calc2 <- function(d) {
  temp <- d[ hit==1][ d,list(time),roll=-20, allow.cartesian=TRUE]
  setnames(temp,1,"hittime")
  temp[!is.na(hittime)]
}

# Generate sample data
set.seed(12312391)
data <- data.table(
          group = sample(1e3,1e7,replace=T),
          time = ceiling(runif(1e7, 0, 1e5)),
          hit = rbinom(1e7, 1, p = 0.1),
  key=c("group","time"))

system.time(ans1 <- calc1(data))
#   user  system elapsed  
#  2.083   0.189   2.344  
system.time(ans2 <- calc2(data))
#   user  system elapsed  
#  2.012   0.241   2.426  
identical(ans1, ans2) # [1] TRUE
You write:
I also don't see any way to refer to the different time vs. hittime without renaming the second time column.
I don?t quite follow what this means, but IIUC I think this is what you?re referring to: https://github.com/Rdatatable/data.table/issues/471

You write:
You mention some FR's, but they're hard to find without the specific numbers.
I was mentioning the first two points under NEW FEATURES within Changes in v1.9.3. The one that starts with by=.EACHI runs j for each group in x that each row of i joins to. and the one that starts with Accordingly, X[Y, j] now does what X[Y][, j] did.

Maybe we should start numbering the fixes for easy reference. Will note it down.

You write: Where can I find the 1.9.3 reference manual?
This version is a development version. Necesary changes will be reflected in their corresponding ?... entry. And when we find some time, the introduction and FAQs will be updated. But that?s not yet.

If you don?t wish to keep up-to-date by looking at the NEWS, you?ll have to wait until the next stable release on CRAN.

You write: On my system (MacOSX), build_vignettes=TRUE gives an error in texi2dvi -- would that have generated the refman? If so, how do I fix that?
I?m guessing it?s a PDF latex error. If so, you?ll have to install what the error message says is missing on your system. Sorry, can?t help you much there.


Arun

From:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu
Reply:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu
Date:?June 30, 2014 at 10:40:24 PM
To:?Arunkumar Srinivasan aragorn168b at gmail.com
Cc:?datatable-help at r-forge.wu-wien.ac.at datatable-help at r-forge.wu-wien.ac.at
Subject:? Re: [datatable-help] Speeding up column references with roll  

OK, I'm retesting in 1.9.3, adding by=.EACHI. I don't see any significant difference in the timings -- setnames is still 25% faster than list(hittime=time). What exactly was fixed?

I also don't see any way to refer to the different time vs. hittime without renaming the second time column.

You mention some FR's, but they're hard to find without the specific numbers.

Where can I find the 1.9.3 reference manual? I think it would be easier to understand for me than the incremental changes in the New Features listings. On my system (MacOSX), build_vignettes=TRUE gives an error in texi2dvi -- would that have generated the refman? If so, how do I fix that?

Thanks,

? ? ? ? ? ? ? ?-s


On Mon, Jun 30, 2014 at 1:00 PM, Arunkumar Srinivasan <aragorn168b at gmail.com> wrote:
Once again, has been fixed in 1.9.3. Now join requires `by=.EACHI` (explicit) to perform a by-without-by.
https://github.com/Rdatatable/data.table/blob/master/README.md
Have a look at the first FR (by = .EACHI runs ...) that's been fixed in 1.9.3 - there's some changes in the way join results in due to these changes (which've been discussed since and for quite sometime) to bring more consistency to the DT[i, j, by] syntax. Also have a look at the second FR and the links it points to for the discussions.

In general, it's better to test with the devel version (and have a look at README) for any bugs you may encounter.

Arun

From:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu
Reply:?Stavros Macrakis (??????? ????????) macrakis at alum.mit.edu
Date:?June 30, 2014 at 5:38:10 PM
To:?datatable-help at r-forge.wu-wien.ac.at datatable-help at r-forge.wu-wien.ac.at
Subject:? [datatable-help] Speeding up column references with roll

In the following example, it is about 15-25% faster to use setnames rather than j=list(name=var). Is there some better approach to referencing the other joined column when using roll?

# Use j=list(name=var)
calc1 <- function(d) {
? d[ hit==1
? ?][ d,list(hittime=time),roll=-20
? ?][ !is.na(hittime)
? ?]
}

# Use setnames
calc2 <- function(d) {
? temp <- d[ hit==1
? ? ? ? ? ?][ d,time,roll=-20
? ? ? ? ? ?]
? setnames(temp,3,"hittime")
? temp[!is.na(hittime)]
}

# Generate sample data
set.seed(12312391)
data <- data.table(
? ? ? ? ? group = sample(1e3,1e7,replace=T),
? ? ? ? ? time = ceiling(runif(1e7, 0, 1e5)),
? ? ? ? ? hit = rbinom(1e7, 1, p = 0.1),
??key=c("group","time"))

# Timing

system.time(replicate(10,{gc();calc1(data)})) => 69 sec system.time(replicate(10,{gc();calc2(data)})) => 52 sec
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140630/ae603c09/attachment.html>