From sams.james at gmail.com  Thu May  1 06:40:44 2014
From: sams.james at gmail.com (James Sams)
Date: Wed, 30 Apr 2014 23:40:44 -0500
Subject: [datatable-help] internal FALSE/TRUE value has been modified
Message-ID: <5361D04C.2090509@gmail.com>

I don't really know what this error message means. A quick example to 
show what I'm seeing:

 > library(data.table)
data.table 1.9.3  For help type: help("data.table")
 > upc_table = data.table(upc=1:100000, upc_ver_uc=rep(c(1,2), 
times=50000), is_PL=rep(c(T, F, F, T), each=25000), 
product_module_code=rep(1:4, times=25000), ignore.column=2:100001)
 > upc = upc_table[, list(is_PL, product_module_code), keyby=list(upc, 
upc_ver_uc)]
Warning message:
In `[.data.table`(upc_table, , list(is_PL, product_module_code),  :
   internal TRUE value has been modified

When I continue using R, I eventually start getting more errors, such as:

Error in gettext(domain, unlist(args)) : invalid 'string' value
Error during wrapup: invalid 'string' value

and then terminal input/output becomes corrupted. I only start getting 
these error messages once I start using data.table; but the messages 
don't necessarily occur only with data.table functions.

I don't know if the last statement above is executing correctly or not. 
I'm rather confused as to what is going on. I was using a somewhat stale 
(maybe a couple of weeks old) svn version of data.table; but I see the 
same behavior with the latest data.table (r1263). I'm using CRAN's R 3.1 
package for Ubuntu on 13.10 and 14.04.


 > sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C LC_TIME=en_US.UTF-8        
LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C LC_ADDRESS=C               
LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods base

other attached packages:
[1] data.table_1.9.3

loaded via a namespace (and not attached):
[1] plyr_1.8.1    Rcpp_0.11.1   reshape2_1.4  stringr_0.6.2

-- 
James Sams
sams.james at gmail.com


From my.r.help at gmail.com  Thu May  1 14:42:56 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Thu, 01 May 2014 20:42:56 +0800
Subject: [datatable-help] Filtering Based on Previous Observation
In-Reply-To: <CAP01uRk8p-1Cv9-8ZdJq-ytqptQS2tDTgOBfha-k5qxC-L17Nw@mail.gmail.com>
References: <535FB180.2060209@gmail.com>
 <CAP01uRk8p-1Cv9-8ZdJq-ytqptQS2tDTgOBfha-k5qxC-L17Nw@mail.gmail.com>
Message-ID: <53624150.6000001@gmail.com>

Awesome, thanks to all of you who have replied. I learned some nice new
data.table/programming tricks!

M


On 04/30/2014 08:00 PM, Gabor Grothendieck wrote:
> On Tue, Apr 29, 2014 at 10:04 AM, Michael Smith <my.r.help at gmail.com> wrote:
>> All,
>>
>> Is there some data.table-idiomatic way to filter based on a previous
>> observation/row? For example, I want to remove a row if
>> DT$a[row]==DT$a[row-1].
>>
>> It could be done by first calculating the lag and then filtering based
>> on that, but I wonder if there's a more direct way.
>>
>> The following example works, but my feeling is there should be a more
>> elegant solution:
>>
>> ( DT <- data.table(a = c(1, 2, 2, 3), b = 8:5) )
>> DT[, L.a := c(NA, head(a, -1))][a != L.a | is.na(L.a)][, L.a := NULL][]
> 
> If the unique elements always appear consecutively then the following
> would work.
> 
> (For example, if `a` were in ascending order (as in the example) or
> descending order then  that would be satisfied.  If DT were keyed
> on 'a' then this would always be the case.)
> 
> DT[ !duplicated(a) ]
> 
> Note that 'a' need not be numeric.
> 

From mdowle at mdowle.plus.com  Thu May  1 17:29:34 2014
From: mdowle at mdowle.plus.com (Matt Dowle)
Date: Thu, 01 May 2014 16:29:34 +0100
Subject: [datatable-help] internal FALSE/TRUE value has been modified
In-Reply-To: <5361D04C.2090509@gmail.com>
References: <5361D04C.2090509@gmail.com>
Message-ID: <5362685E.1080303@mdowle.plus.com>


Reproduced, thanks for nice example. Not sure yet but what R 3.1 now 
does is store length 1 logical vectors once only, globally, for 
efficiency to avoid many new allocations for the common case of single 
TRUE or FALSE values passed around at C or R level (a nice and welcome 
change).  Since data.table modifies vectors by reference,  if that 
vector is length 1 a new data.table bug as from R 3.1 could be modifying 
R's internal value of TRUE or FALSE whenever length 1 logical vectors 
occur. Clearly a serious bug. The test suite immediately broke the day 
after the R-devel change was made (good) and was one reason data.table 
was in error state in CRAN checks for quite a while before R 3.1 
shipped.  It was typically tests of 1-row data.table's including a 
logical column and modifying that logical column that broke. We fixed 
that and put in checks to detect and warn if R's internal value has been 
been modified, just in case.  Those changes were in v1.9.2 on CRAN.  I 
think I wasn't 100% confident in the detection test (false positives) so 
made it a warning instead of an error.  Now that R 3.1 is out and we 
haven't had any false positives, it should be an error.

The feature of this upc_table is that all the groups are size 1 :

 > upc_table[, .N, by=list(upc, upc_ver_uc)][,max(N)]
[1] 1

If we change the example so that one group has more than 1 row, it works 
ok :

 > upc_table = data.table(upc=c(1:99998,1,1), upc_ver_uc=rep(c(1,2), 
times=50000), is_PL=rep(c(T, F, F, T), each=25000), 
product_module_code=rep(1:4, times=25000), ignore.column=2:100001)
 > upc_table[, .N, by=list(upc, upc_ver_uc)][,max(N)]
[1] 2
 > upc = upc_table[, list(is_PL, product_module_code), keyby=list(upc, 
upc_ver_uc)]

So it seems the problem is in the single allocation of working memory 
for the largest group when that's just 1 and contains a logical column.  
Odd, I would have sworn we caught that! Will fix.

R-devel are planning to do more of this small-object-sharing for common 
single integer values e.g. 0-10,  so we'll need to add more tests 
accordingly.

Thanks,
Matt


On 01/05/14 05:40, James Sams wrote:
> I don't really know what this error message means. A quick example to 
> show what I'm seeing:
>
> > library(data.table)
> data.table 1.9.3  For help type: help("data.table")
> > upc_table = data.table(upc=1:100000, upc_ver_uc=rep(c(1,2), 
> times=50000), is_PL=rep(c(T, F, F, T), each=25000), 
> product_module_code=rep(1:4, times=25000), ignore.column=2:100001)
> > upc = upc_table[, list(is_PL, product_module_code), keyby=list(upc, 
> upc_ver_uc)]
> Warning message:
> In `[.data.table`(upc_table, , list(is_PL, product_module_code), :
>   internal TRUE value has been modified
>
> When I continue using R, I eventually start getting more errors, such as:
>
> Error in gettext(domain, unlist(args)) : invalid 'string' value
> Error during wrapup: invalid 'string' value
>
> and then terminal input/output becomes corrupted. I only start getting 
> these error messages once I start using data.table; but the messages 
> don't necessarily occur only with data.table functions.
>
> I don't know if the last statement above is executing correctly or 
> not. I'm rather confused as to what is going on. I was using a 
> somewhat stale (maybe a couple of weeks old) svn version of 
> data.table; but I see the same behavior with the latest data.table 
> (r1263). I'm using CRAN's R 3.1 package for Ubuntu on 13.10 and 14.04.
>
>
>
> > sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C 
> LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8 
> LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C LC_ADDRESS=C               
> LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods base
>
> other attached packages:
> [1] data.table_1.9.3
>
> loaded via a namespace (and not attached):
> [1] plyr_1.8.1    Rcpp_0.11.1   reshape2_1.4  stringr_0.6.2
>


From harishv_99 at yahoo.com  Sat May  3 03:08:42 2014
From: harishv_99 at yahoo.com (Harish)
Date: Fri, 2 May 2014 18:08:42 -0700 (PDT)
Subject: [datatable-help] fread() coercion bug?
Message-ID: <1399079322.73291.YahooMailNeo@web120206.mail.ne1.yahoo.com>

I was trying to use fread() to read data when I got the following error which made no sense:

In fread(paste0(strData, collapse = "\n"), integer64 = "character") : Bumped column 2 to type character on data row 13, field contains '2464.77'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.because "2464.77" is a perfectly legitimate number and there is no reason to coerce the column to character for that.

Here is how to reproduce it:

?? dtT <- data.table( a = 1:72, b=0 )
?? dtT[ 13, b := 2464.77 ]

?? strData <- capture.output( write.table( dtT, row.names=FALSE, quote=FALSE, sep="\t" ) )
?? fread( paste0( strData, collapse="\n" ), integer64="character" )


Note that the following works okay without the integer64="character" argument:
?? dtT <- data.table( a = 1:72, b=0 )
?? dtT[ 13, b := 2464.77 ]

?? strData <- capture.output( write.table( dtT, row.names=FALSE, quote=FALSE, sep="\t" ) )
?? fread( paste0( strData, collapse="\n" ) )

I would appreciate if you could provide some sort of a workaround for this.? The reason I am using the integer64="character" argument is that I have large numbers at times which seems to be having issues once it is read as integer64 -- and that might have nothing to do with data.table but I have not had time to look into it.? My work-around for that issue was to read it as character, but I run into the above issue.


Thanks for your help.


Regards,
Harish
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140502/54e7d44e/attachment.html>

From rguy at 123mail.org  Sun May  4 08:00:48 2014
From: rguy at 123mail.org (Rguy)
Date: Sat, 3 May 2014 23:00:48 -0700 (PDT)
Subject: [datatable-help] A[B]?
Message-ID: <1399183248863-4689942.post@n4.nabble.com>

I am beginning to learn the data.table package. At the outset,
'data.table.pdf' states:

It is inspired by A[B] syntax in R where A is a matrix and B is a 2-column
matrix.

I have used matrices in R but am unfamiliar with the A[B] syntax. When I
check the documentation for 'matrix' I find no discussion of such syntax. So
this "explanation" is in fact a black hole. Please tell your readers what
the package does in such a way that they are not sent on a wild goose chase.
For example:

The data.table package supports an A[B] syntax where A is a data table, B is
a 2 column data table, and the effect of the expression A[B] is...

What does A[B] accomplish?

Thanks.


--
View this message in context: http://r.789695.n4.nabble.com/A-B-tp4689942.html
Sent from the datatable-help mailing list archive at Nabble.com.

From my.r.help at gmail.com  Sun May  4 09:50:14 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Sun, 04 May 2014 15:50:14 +0800
Subject: [datatable-help] A[B]?
In-Reply-To: <1399183248863-4689942.post@n4.nabble.com>
References: <1399183248863-4689942.post@n4.nabble.com>
Message-ID: <5365F136.8050807@gmail.com>

See FAQ 2.14
http://datatable.r-forge.r-project.org/datatable-faq.pdf


On 05/04/2014 02:00 PM, Rguy wrote:
> I am beginning to learn the data.table package. At the outset,
> 'data.table.pdf' states:
> 
> It is inspired by A[B] syntax in R where A is a matrix and B is a 2-column
> matrix.
> 
> I have used matrices in R but am unfamiliar with the A[B] syntax. When I
> check the documentation for 'matrix' I find no discussion of such syntax. So
> this "explanation" is in fact a black hole. Please tell your readers what
> the package does in such a way that they are not sent on a wild goose chase.
> For example:
> 
> The data.table package supports an A[B] syntax where A is a data table, B is
> a 2 column data table, and the effect of the expression A[B] is...
> 
> What does A[B] accomplish?
> 
> Thanks.
> 
> 
> 
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/A-B-tp4689942.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 

From mdowle at mdowle.plus.com  Sun May  4 10:50:32 2014
From: mdowle at mdowle.plus.com (Matt Dowle)
Date: Sun, 04 May 2014 09:50:32 +0100
Subject: [datatable-help] fread() coercion bug?
In-Reply-To: <1399079322.73291.YahooMailNeo@web120206.mail.ne1.yahoo.com>
References: <1399079322.73291.YahooMailNeo@web120206.mail.ne1.yahoo.com>
Message-ID: <5365FF58.2050402@mdowle.plus.com>


Reproduced, thanks. Can't think why that is, but will fix.  Please file 
as a bug so it's not forgotten.

In the meantime, setting the class manually for that column (colClasses 
argument) works in this example :

fread( paste0( strData, collapse="\n" ), integer64="character", 
colClasses=list(numeric="b"))

Is that workable for the full example?  I've used that syntax for 
colClasses so you can pass a vector of column names to be read as 
numeric more easily, if need be.

Matt


On 03/05/14 02:08, Harish wrote:
> I was trying to use fread() to read data when I got the following 
> error which made no sense:
>
> In fread(paste0(strData, collapse = "\n"), integer64 = "character") :
>    Bumped column 2 to type character on data row 13, field contains '2464.77'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.
> because "2464.77" is a perfectly legitimate number and there is no 
> reason to coerce the column to character for that.
>
> Here is how to reproduce it:
>
>    dtT <- data.table( a = 1:72, b=0 )
>    dtT[ 13, b := 2464.77 ]
>
>    strData <- capture.output( write.table( dtT, row.names=FALSE, 
> quote=FALSE, sep="\t" ) )
>    fread( paste0( strData, collapse="\n" ), integer64="character" )
>
> Note that the following works okay without the integer64="character" 
> argument:
>    dtT <- data.table( a = 1:72, b=0 )
>    dtT[ 13, b := 2464.77 ]
>
>    strData <- capture.output( write.table( dtT, row.names=FALSE, 
> quote=FALSE, sep="\t" ) )
>    fread( paste0( strData, collapse="\n" ) )
>
> I would appreciate if you could provide some sort of a workaround for 
> this.  The reason I am using the integer64="character" argument is 
> that I have large numbers at times which seems to be having issues 
> once it is read as integer64 -- and that might have nothing to do with 
> data.table but I have not had time to look into it.  My work-around 
> for that issue was to read it as character, but I run into the above 
> issue.
>
> Thanks for your help.
>
> Regards,
> Harish
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140504/c7c2223f/attachment.html>

From carrieromichele at gmail.com  Mon May  5 04:43:39 2014
From: carrieromichele at gmail.com (Michele)
Date: Sun, 4 May 2014 19:43:39 -0700 (PDT)
Subject: [datatable-help] Roll + nomatch mixes result
Message-ID: <1399257819547-4689968.post@n4.nabble.com>

Hello,I think this was recently introduced because this example comes from a
part of my codes double and triple checked in the past several times (I mean
I should have noticed before, maybe..):
data<-data.table(code = c(rep("A",26L), rep("B",10L)),                 id =
c(rep(1L, 20L), rep(2L, 6L), rep(1L, 10L)),                 date =
structure(c(14602, 14638, 14665, 14698, 14726, 14754, 14788, 14817, 14846,
14882,                                     14939, 15005, 15029, 15064,
15091, 15125, 15153, 15328, 15393,                                    
15393, 15393, 15393, 15431, 15461, 15569, 15569, 14613, 14762,                                    
15110, 15110, 15686, 15686, 14602, 14638, 14665, 14698),                                 
class = "Date"))filter <- data.table(code = c("A", "B"),                    
id = c(2L, 1L),                     limit1 = structure(c(15564, 15681),
class = "Date"),                      limit2 = structure(c(15574, 15691),
class = "Date"),                     index_R = c(26610L,
22662L))setkey(data)setkey(filter, code, id, limit1)> filter[data,
nomatch=0, roll=T]   code id     limit1     limit2 index_R1:    A  2
2012-02-23 2012-08-22   266102:    A  2 2012-02-23 2012-08-22   266103:    A 
2 2012-08-17 2012-12-17   226624:    A  2 2012-08-17 2012-12-17   226625:   
B  1 2011-05-16 2012-08-22   266106:    B  1 2011-05-16 2012-08-22   266107:   
B  1 2012-12-12 2012-12-17   226628:    B  1 2012-12-12 2012-12-17   22662>
> # expected outpit -  workaround using any column from X which is never NA
(before doing X[Y, roll=T])> filter[data, roll=T][!is.na(index_R)]   code id    
limit1     limit2 index_R1:    A  2 2012-08-17 2012-08-22   266102:    A  2
2012-08-17 2012-08-22   266103:    B  1 2012-12-12 2012-12-17   226624:    B 
1 2012-12-12 2012-12-17   22662
btw I'm on 1.9.3, the commit right before the by without by was sadly
removed (sadly cause I would need at least a whole week to change all my
codes...)Can you guys reproduce this? Is it already fixed?Regards,Michele.


--
View this message in context: http://r.789695.n4.nabble.com/Roll-nomatch-mixes-result-tp4689968.html
Sent from the datatable-help mailing list archive at Nabble.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140504/7994bc7a/attachment.html>

From rguy at 123mail.org  Tue May  6 11:57:25 2014
From: rguy at 123mail.org (Rguy)
Date: Tue, 6 May 2014 02:57:25 -0700 (PDT)
Subject: [datatable-help] A[B]?
In-Reply-To: <5365F136.8050807@gmail.com>
References: <1399183248863-4689942.post@n4.nabble.com>
 <5365F136.8050807@gmail.com>
Message-ID: <1399370245881-4690040.post@n4.nabble.com>

That FAQ does not provide any examples of the A[B] syntax used with data
table objects. It does provide an example using A[B] with matrix objects,
but the example does not translate to data table objects, so I'm not sure
why it's there. I suggest that the FAQ be extended to provide one, or better
yet several, examples of the A[B] syntax applied to data.table objects.

As far as I have been able to puzzle out so far, A[B] is just another way to
do a merge.


--
View this message in context: http://r.789695.n4.nabble.com/A-B-tp4689942p4690040.html
Sent from the datatable-help mailing list archive at Nabble.com.

From rguy at 123mail.org  Tue May  6 12:06:57 2014
From: rguy at 123mail.org (Rguy)
Date: Tue, 6 May 2014 03:06:57 -0700 (PDT)
Subject: [datatable-help] Assigning with a compound condition
Message-ID: <1399370817746-4690042.post@n4.nabble.com>

I am experimenting with assigning into a data table (and data frame with same
data) when the assignment involves a compound condition on multiple columns.
Please see the attached file.

Assignment into the data table is about twice as fast as into the data
frame, but I wonder if I am using the optimal syntax for achieving speedy
assignment. Any advice much appreciated.

test_assign.r <http://r.789695.n4.nabble.com/file/n4690042/test_assign.r>  


--
View this message in context: http://r.789695.n4.nabble.com/Assigning-with-a-compound-condition-tp4690042.html
Sent from the datatable-help mailing list archive at Nabble.com.

From kpm.nachtmann at gmail.com  Wed May  7 11:02:15 2014
From: kpm.nachtmann at gmail.com (nachti)
Date: Wed, 7 May 2014 02:02:15 -0700 (PDT)
Subject: [datatable-help] changing data.table by-without-by syntax to
	require a "by"
In-Reply-To: <1366401278742-4664770.post@n4.nabble.com>
References: <1366401278742-4664770.post@n4.nabble.com>
Message-ID: <1399453335248-4690100.post@n4.nabble.com>

The change of the defaults in 1.9.3 breaks existing code, which shoud not be
(see. DT FAQ 1.8). Would be fine if there is a possibility that code works
with different versions of DT and R (e.g. for usage in packages).
See the example here:  https://gist.github.com/nachti/34b2dc46868b9268c5af
<https://gist.github.com/nachti/34b2dc46868b9268c5af>  
I know that 1.9.3 is a development version, but I can't use 1.9.2 due to 
http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-td4687469.html
<http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-td4687469.html>  
and I can't switch back to an older R-Version because of missing permissions
on the server. I have to use a different versions of R and DT parallel.
If I rewrite my code that it works for 1.9.3, it doesn't work with 1.8.10
any more. (see also 
http://stackoverflow.com/questions/23289646/update-subset-of-data-table-based-on-join-using-data-table-1-9-3-does-not-work-a
<http://stackoverflow.com/questions/23289646/update-subset-of-data-table-based-on-join-using-data-table-1-9-3-does-not-work-a>  
by = key(something) is not the same as by = .EACHI, but even if I can get a
solution using the first, 1.8.10 gives a warning, that I shouldn't do that:

In addition: Warning message:
In `[.data.table` ...:
  by is not necessary in this query; it equals all the join columns in the
same order. j is already evaluated by group of x that each row of i matches
to (by-without-by, see ?data.table). Setting by will be slower because a
subset of x is taken and then grouped again. Consider removing by, or
changing it.

nachti


--
View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4690100.html
Sent from the datatable-help mailing list archive at Nabble.com.

From aragorn168b at gmail.com  Wed May  7 12:10:57 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Wed, 7 May 2014 12:10:57 +0200
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <1399453335248-4690100.post@n4.nabble.com>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1399453335248-4690100.post@n4.nabble.com>
Message-ID: <etPan.536a06b3.7545e146.bfdf@Arunkumars-MacBook-Pro.local>

The change of the defaults in 1.9.3 breaks existing code, which shoud not be?
(see. DT FAQ 1.8).
Thanks. Yes, that's what will be the case when it hits CRAN. There will be an option to use the older feature, IIUC. Matt can clarify this point further.

I know that 1.9.3 is a development version, but I can't use 1.9.2 due to?
http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-td4687469.html?
Can you show us an example that 1.9.2 doesn't but 1.9.3 does??

In your case, you should be using stable 1.9.2 version (at least until counter measures are in place for by=.EACHI). And you should ask your administrators to downgrade R, if you don't want that bug to bite you, until this is fixed. But I'm repeating myself.

Arun

From:?nachti kpm.nachtmann at gmail.com
Reply:?nachti kpm.nachtmann at gmail.com
Date:?May 7, 2014 at 11:02:30 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] changing data.table by-without-by syntax to require a "by"  

The change of the defaults in 1.9.3 breaks existing code, which shoud not be  
(see. DT FAQ 1.8). Would be fine if there is a possibility that code works  
with different versions of DT and R (e.g. for usage in packages).  
See the example here: https://gist.github.com/nachti/34b2dc46868b9268c5af  
<https://gist.github.com/nachti/34b2dc46868b9268c5af>  
I know that 1.9.3 is a development version, but I can't use 1.9.2 due to  
http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-td4687469.html  
<http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-td4687469.html>  
and I can't switch back to an older R-Version because of missing permissions  
on the server. I have to use a different versions of R and DT parallel.  
If I rewrite my code that it works for 1.9.3, it doesn't work with 1.8.10  
any more. (see also  
http://stackoverflow.com/questions/23289646/update-subset-of-data-table-based-on-join-using-data-table-1-9-3-does-not-work-a  
<http://stackoverflow.com/questions/23289646/update-subset-of-data-table-based-on-join-using-data-table-1-9-3-does-not-work-a>  
by = key(something) is not the same as by = .EACHI, but even if I can get a  
solution using the first, 1.8.10 gives a warning, that I shouldn't do that:  

In addition: Warning message:  
In `[.data.table` ...:  
by is not necessary in this query; it equals all the join columns in the  
same order. j is already evaluated by group of x that each row of i matches  
to (by-without-by, see ?data.table). Setting by will be slower because a  
subset of x is taken and then grouped again. Consider removing by, or  
changing it.  

nachti  


--  
View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4690100.html  
Sent from the datatable-help mailing list archive at Nabble.com.  
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140507/cb19459e/attachment.html>

From kpm.nachtmann at gmail.com  Wed May  7 13:30:06 2014
From: kpm.nachtmann at gmail.com (nachti)
Date: Wed, 7 May 2014 04:30:06 -0700 (PDT)
Subject: [datatable-help] changing data.table by-without-by syntax to
	require a "by"
In-Reply-To: <etPan.536a06b3.7545e146.bfdf@Arunkumars-MacBook-Pro.local>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1399453335248-4690100.post@n4.nabble.com>
 <etPan.536a06b3.7545e146.bfdf@Arunkumars-MacBook-Pro.local>
Message-ID: <1399462206528-4690105.post@n4.nabble.com>

Arunkumar Srinivasan wrote
> The change of the defaults in 1.9.3 breaks existing code, which shoud not
> be?
> (see. DT FAQ 1.8).
> Thanks. Yes, that's what will be the case when it hits CRAN. There will be
> an option to use the older feature, IIUC. Matt can clarify this point
> further.
> 
> I know that 1.9.3 is a development version, but I can't use 1.9.2 due to?
> http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-td4687469.html?
> Can you show us an example that 1.9.2 doesn't but 1.9.3 does?

Copied from
http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-td4687469.html 

#####
Just another example (maybe to be included to test.data.table), which does
not do, what I expected (v. 1.9.2 - it's also fixed in 1.9.3)

> require(data.table)

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: powerpc64-unknown-linux-gnu (64-bit)
...
other attached packages:
[1] data.table_1.9.2

> example(data.table)
> DT
   x y  v v2  m
1: a 1 42 NA 42
2: a 3 42 NA 42
3: a 6 42 NA 42
4: b 1  4 84  5
5: b 3  5 84  5
6: b 6  6 84  5
7: c 1  7 NA  8
8: c 3  8 NA  8
9: c 6  9 NA  8

> setkey(DT)
> DT[J("a"), list(v, y)]
   x  v y
1: a 42 1
> DT[J("a"), list(v, y, i = "text")]
   x  v y    i
1: a 42 1 text

##### With data.table 1.9.3 it's working fine:
> require(data.table)

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: powerpc64-unknown-linux-gnu (64-bit)
...
other attached packages:
[1] data.table_1.9.3

> example(data.table)

> setkey(DT)
> DT[J("a"), list(v, y)]
    v y
1: 42 1
2: 42 3
3: 42 6
> DT[J("a"), list(v, y, i = "text")]
    v y    i
1: 42 1 text
2: 42 3 text
3: 42 6 text

nachti 
#####


Arunkumar Srinivasan wrote
> In your case, you should be using stable 1.9.2 version (at least until
> counter measures are in place for by=.EACHI). And you should ask your
> administrators to downgrade R, if you don't want that bug to bite you,
> until this is fixed. But I'm repeating myself.
> 
> Arun


--
View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4690105.html
Sent from the datatable-help mailing list archive at Nabble.com.

From aragorn168b at gmail.com  Wed May  7 14:27:30 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Wed, 7 May 2014 14:27:30 +0200
Subject: [datatable-help] changing data.table by-without-by syntax to
 require a "by"
In-Reply-To: <1399462206528-4690105.post@n4.nabble.com>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1399453335248-4690100.post@n4.nabble.com>
 <etPan.536a06b3.7545e146.bfdf@Arunkumars-MacBook-Pro.local>
 <1399462206528-4690105.post@n4.nabble.com>
Message-ID: <etPan.536a26b2.5bd062c2.bfdf@Arunkumars-MacBook-Pro.local>

Once agan, thanks for the example. That wasn't a bug. It's how it was intended to work with prior versions of data.table. But to make things much more consistent (as per user requests and FRs filed), this change is now being implemented.?

Your point that there should be ways to make sure existing code doesn't break down is totally valid and we'll do whatever we can to get there.?You've to realise this is a development version - we're working on it.?And these things will get fixed only in due time. Until then, there's no other way but to get around these issues until we fix it, unfortunately - unless you or someone else would like to help us.

Arun

From:?nachti kpm.nachtmann at gmail.com
Reply:?nachti kpm.nachtmann at gmail.com
Date:?May 7, 2014 at 1:30:16 PM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] changing data.table by-without-by syntax to require a "by"  

Arunkumar Srinivasan wrote
> The change of the defaults in 1.9.3 breaks existing code, which shoud not
> be?
> (see. DT FAQ 1.8).
> Thanks. Yes, that's what will be the case when it hits CRAN. There will be
> an option to use the older feature, IIUC. Matt can clarify this point
> further.
>  
> I know that 1.9.3 is a development version, but I can't use 1.9.2 due to?
> http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-td4687469.html?
> Can you show us an example that 1.9.2 doesn't but 1.9.3 does?

Copied from
http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-td4687469.html  

#####
Just another example (maybe to be included to test.data.table), which does
not do, what I expected (v. 1.9.2 - it's also fixed in 1.9.3)

> require(data.table)

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: powerpc64-unknown-linux-gnu (64-bit)
...
other attached packages:
[1] data.table_1.9.2

> example(data.table)
> DT
x y v v2 m
1: a 1 42 NA 42
2: a 3 42 NA 42
3: a 6 42 NA 42
4: b 1 4 84 5
5: b 3 5 84 5
6: b 6 6 84 5
7: c 1 7 NA 8
8: c 3 8 NA 8
9: c 6 9 NA 8

> setkey(DT)
> DT[J("a"), list(v, y)]
x v y
1: a 42 1
> DT[J("a"), list(v, y, i = "text")]
x v y i
1: a 42 1 text

##### With data.table 1.9.3 it's working fine:
> require(data.table)

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: powerpc64-unknown-linux-gnu (64-bit)
...
other attached packages:
[1] data.table_1.9.3

> example(data.table)

> setkey(DT)
> DT[J("a"), list(v, y)]
v y
1: 42 1
2: 42 3
3: 42 6
> DT[J("a"), list(v, y, i = "text")]
v y i
1: 42 1 text
2: 42 3 text
3: 42 6 text

nachti  
#####


Arunkumar Srinivasan wrote
> In your case, you should be using stable 1.9.2 version (at least until
> counter measures are in place for by=.EACHI). And you should ask your
> administrators to downgrade R, if you don't want that bug to bite you,
> until this is fixed. But I'm repeating myself.
>  
> Arun


--
View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4690105.html
Sent from the datatable-help mailing list archive at Nabble.com.
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140507/97103d00/attachment.html>

From kpm.nachtmann at gmail.com  Wed May  7 15:13:10 2014
From: kpm.nachtmann at gmail.com (nachti)
Date: Wed, 7 May 2014 06:13:10 -0700 (PDT)
Subject: [datatable-help] changing data.table by-without-by syntax to
	require a "by"
In-Reply-To: <etPan.536a26b2.5bd062c2.bfdf@Arunkumars-MacBook-Pro.local>
References: <1366401278742-4664770.post@n4.nabble.com>
 <1399453335248-4690100.post@n4.nabble.com>
 <etPan.536a06b3.7545e146.bfdf@Arunkumars-MacBook-Pro.local>
 <1399462206528-4690105.post@n4.nabble.com>
 <etPan.536a26b2.5bd062c2.bfdf@Arunkumars-MacBook-Pro.local>
Message-ID: <1399468390041-4690112.post@n4.nabble.com>

I have a workaround for it now:

### check data.table version (since 1.9.3 you have to use .EACHI)
odt <- packageVersion("data.table") < "1.9.3"
odt

if (odt) {
  # code for old (stable) datatable versions
} else {
  # code for datatable versions since 1.9.3
}

nachti


--
View this message in context: http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4690112.html
Sent from the datatable-help mailing list archive at Nabble.com.

From benweinstein2010 at gmail.com  Thu May  8 16:39:40 2014
From: benweinstein2010 at gmail.com (Ben Weinstein)
Date: Thu, 8 May 2014 10:39:40 -0400
Subject: [datatable-help] fread crashes reading R when reading csv
Message-ID: <CAC28k7RzNbmouBduNo_zhTjk7voTGuw9N87Hn2onfHVV99PQag@mail.gmail.com>

Data table crashes

I am having a similar issue to this post:
http://r.789695.n4.nabble.com/fread-crash-td4683394.html

please see markdown script: http://rpubs.com/bw4sz0511/16766 or text below:
or text below:

The file is about 550MB, i'm unsure how many rows it actually is (several
million).

When i try to run fread, Rstudio just crashes with no error. I can read in
up to about 15 rows


require(data.table)
## Loading required package: data.table

# env dist table

env <- fread("EnvData.csv", nrows = 15, verbose = TRUE)
## Input contains no \n. Taking this to be a filename to open
## File opened, filesize is  0.543B
## File is opened and mapped ok
## Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
## Using line 30 to detect sep (the last non blank line in the first
'autostart') ... sep=','
## Found 4 columns
## First row with 4 fields occurs on line 2 (either column names or first
row of data)
## Some fields on line 2 are not type character (or are empty). Treating as
a data row and using default column names.
## Count of eol after first data row: 15989212
## Subtracted 0 for last eol and any trailing empty lines, leaving 15989212
data rows
## nrow limited to nrows passed in (15)
## Type codes: 4113 (first 5 rows)
## Type codes: 4113 (after applying colClasses and integer64)
## Type codes: 4113 (after applying drop or select (if supplied)
## Allocating 4 column slots (4 - 0 NULL)
##    0.000s (  0%) Memory map (rerun may be quicker)
##    0.000s (  0%) sep and header detection
##    0.702s (100%) Count rows (wc -l)
##    0.000s (  0%) Column type detection (first, middle and last 5 rows)
##    0.000s (  0%) Allocation of 15x4 result (xMB) in RAM
##    0.000s (  0%) Reading data
##    0.000s (  0%) Allocation for type bumps (if any), including gc time
if triggered
##    0.000s (  0%) Coercing data already read in type bumps (if any)
##    0.000s (  0%) Changing na.strings to NA
##    0.702s        Total

head(env)
##    V1 V2 V3     V4
## 1:  1  2  1  249.3
## 2:  2  3  1  536.9
## 3:  3  4  1 1161.8
## 4:  4  5  1 1234.0
## 5:  5  6  1 1513.4
## 6:  6  7  1 1757.1
However when i run fread with more than 20 rows, it crashes Rstudio.

# not run
env <- fread("EnvData.csv", nrows = 25, verbose = TRUE)
verbose on the error output reads:

Input contains no \n. Taking this to be a filename to open

File opened, filesize is 0.543B

File is opened and mapped ok

Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.

Using line 30 to detect sep (the last non blank line in the first
'autostart') ... sep=','

Found 4 columns

First row with 4 fields occurs on line 2 (either column names or first row
of data)

Some fields on line 2 are not type character (or are empty). Treating as a
data row and using default column names.

Count of eol after first data row: 15989212

Subtracted 0 for last eol and any trailing empty lines, leaving 15989212
data rows

nrow limited to nrows passed in (25)

Type codes: 4113 (first 5 rows)

Type codes: 4113 (+middle 5 rows)

Look at the file, nothing seems wrong


env <- read.csv("EnvData.csv", nrows = 25)

env
##    V1 V2     V3
## 1   2  1  249.3
## 2   3  1  536.9
## 3   4  1 1161.8
## 4   5  1 1234.0
## 5   6  1 1513.4
## 6   7  1 1757.1
## 7   8  1 2176.7
## 8   9  1 2644.0
## 9  10  1 3033.3
## 10 11  1 3721.2
## 11 12  1 4432.8
## 12 13  1 4609.6
## 13 14  1 5378.8
## 14 15  1 5953.6
## 15 16  1 5913.9
## 16 17  1 6281.3
## 17 18  1 6669.7
## 18 19  1 6449.7
## 19 20  1 6218.4
## 20 21  1 6493.4
## 21 22  1 6056.6
## 22 23  1 5275.8
## 23 24  1 4605.2
## 24 25  1 3153.9
## 25 26  1 2532.1


Thanks for your help,

Ben Weinstein
-- 
Ben Weinstein
PhD Candidate
Ecology and Evolution
Stony Brook University

http://benweinstein.weebly.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140508/1e702a7c/attachment.html>

From stanasa at latinumnetwork.com  Thu May  8 20:50:03 2014
From: stanasa at latinumnetwork.com (stanasa)
Date: Thu, 8 May 2014 11:50:03 -0700 (PDT)
Subject: [datatable-help] Fread Skip Question
Message-ID: <1399575003729-4690205.post@n4.nabble.com>

First of all, thank you very much for creating, maintaining and updating this
package! Discovering "fread" and the data.table package have made my life a
lot easier. 

I'm using fread to read large (2-4Gb) .CSV files for subsequent RMySQL
bulkloads, and (since the computer I use is a bit memory limited) decided to
read it in chunks, using skip and nrows. I'm noticing that as I go through
the file (with a for loop) each individual read takes on average a bit
longer (as I'm guessing fread parses through the file line by line to reach
the skip to location). 

Is there any way to make fread "remember" the end of the last read location
for the next iteration? 
It would speed up my reads from minutes to seconds, I would guess. 

Also, should I worry that reusing the same data.table in a for loop causes
memory issues?

Many thanks,


Serban Tanasa, Ph.D.
Senior Analyst
Latinum Network

(o) (240) 482-8259
(f)  (240) 482-8265


--
View this message in context: http://r.789695.n4.nabble.com/Fread-Skip-Question-tp4690205.html
Sent from the datatable-help mailing list archive at Nabble.com.

From gsee000 at gmail.com  Fri May  9 00:57:02 2014
From: gsee000 at gmail.com (G See)
Date: Thu, 8 May 2014 17:57:02 -0500
Subject: [datatable-help] merge zero row data.table
Message-ID: <CA+xi=qaM7sZohoKc18QFHr-0uK5GrNj9S0NL8rRqWq7cSEd5RQ@mail.gmail.com>

Hi,

Is the following error expected?

> library(data.table)
data.table 1.9.3  For help type: help("data.table")
> a <- data.table(BOD, key="Time")
> b <- data.table(BOD, key="Time")[Time < 0] # zero row data.table
> merge(a,b, all=TRUE) # works fine
   Time demand.x demand.y
1:    1      8.3       NA
2:    2     10.3       NA
3:    3     19.0       NA
4:    4     16.0       NA
5:    5     15.6       NA
6:    7     19.8       NA
> merge(b,a, all=TRUE) # error
Error in setcolorder(dt, c(setdiff(names(dt), end), end)) :
  neworder is length 2 but x has 3 columns.

Thanks,
Garrett

p.s. using svn Rev. 1263

From aragorn168b at gmail.com  Fri May  9 01:00:18 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 9 May 2014 01:00:18 +0200
Subject: [datatable-help] merge zero row data.table
In-Reply-To: <CA+xi=qaM7sZohoKc18QFHr-0uK5GrNj9S0NL8rRqWq7cSEd5RQ@mail.gmail.com>
References: <CA+xi=qaM7sZohoKc18QFHr-0uK5GrNj9S0NL8rRqWq7cSEd5RQ@mail.gmail.com>
Message-ID: <etPan.536c0c82.1befd79f.bfdf@Arunkumars-MacBook-Pro.local>

Garrett,

Seems like it works fine in 1.9.2. I'd say it's a bug introduced due to changes in 1.9.3. Could you please file it as one? Thanks.

Arun

From:?G See gsee000 at gmail.com
Reply:?G See gsee000 at gmail.com
Date:?May 9, 2014 at 12:57:15 AM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? [datatable-help] merge zero row data.table  

Hi,  

Is the following error expected?  

> library(data.table)  
data.table 1.9.3 For help type: help("data.table")  
> a <- data.table(BOD, key="Time")  
> b <- data.table(BOD, key="Time")[Time < 0] # zero row data.table  
> merge(a,b, all=TRUE) # works fine  
Time demand.x demand.y  
1: 1 8.3 NA  
2: 2 10.3 NA  
3: 3 19.0 NA  
4: 4 16.0 NA  
5: 5 15.6 NA  
6: 7 19.8 NA  
> merge(b,a, all=TRUE) # error  
Error in setcolorder(dt, c(setdiff(names(dt), end), end)) :  
neworder is length 2 but x has 3 columns.  

Thanks,  
Garrett  

p.s. using svn Rev. 1263  
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140509/54a03573/attachment.html>

From gsee000 at gmail.com  Fri May  9 01:10:07 2014
From: gsee000 at gmail.com (G See)
Date: Thu, 8 May 2014 18:10:07 -0500
Subject: [datatable-help] merge zero row data.table
In-Reply-To: <etPan.536c0c82.1befd79f.bfdf@Arunkumars-MacBook-Pro.local>
References: <CA+xi=qaM7sZohoKc18QFHr-0uK5GrNj9S0NL8rRqWq7cSEd5RQ@mail.gmail.com>
 <etPan.536c0c82.1befd79f.bfdf@Arunkumars-MacBook-Pro.local>
Message-ID: <CA+xi=qY_nJAf53fqpuPi0n+KDTm=HR+nLbkT+ubbzvqe=Hu65g@mail.gmail.com>

Thanks Arun.  Bug filed:
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5672&group_id=240&atid=975

On Thu, May 8, 2014 at 6:00 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> Garrett,
>
> Seems like it works fine in 1.9.2. I'd say it's a bug introduced due to
> changes in 1.9.3. Could you please file it as one? Thanks.
>
> Arun
>
> From: G See gsee000 at gmail.com
> Reply: G See gsee000 at gmail.com
> Date: May 9, 2014 at 12:57:15 AM
> To: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> Subject:  [datatable-help] merge zero row data.table
>
> Hi,
>
> Is the following error expected?
>
>> library(data.table)
> data.table 1.9.3 For help type: help("data.table")
>> a <- data.table(BOD, key="Time")
>> b <- data.table(BOD, key="Time")[Time < 0] # zero row data.table
>> merge(a,b, all=TRUE) # works fine
> Time demand.x demand.y
> 1: 1 8.3 NA
> 2: 2 10.3 NA
> 3: 3 19.0 NA
> 4: 4 16.0 NA
> 5: 5 15.6 NA
> 6: 7 19.8 NA
>> merge(b,a, all=TRUE) # error
> Error in setcolorder(dt, c(setdiff(names(dt), end), end)) :
> neworder is length 2 but x has 3 columns.
>
> Thanks,
> Garrett
>
> p.s. using svn Rev. 1263
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From aragorn168b at gmail.com  Fri May  9 01:10:42 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 9 May 2014 01:10:42 +0200
Subject: [datatable-help] merge zero row data.table
In-Reply-To: <CA+xi=qY_nJAf53fqpuPi0n+KDTm=HR+nLbkT+ubbzvqe=Hu65g@mail.gmail.com>
References: <CA+xi=qaM7sZohoKc18QFHr-0uK5GrNj9S0NL8rRqWq7cSEd5RQ@mail.gmail.com>
 <etPan.536c0c82.1befd79f.bfdf@Arunkumars-MacBook-Pro.local>
 <CA+xi=qY_nJAf53fqpuPi0n+KDTm=HR+nLbkT+ubbzvqe=Hu65g@mail.gmail.com>
Message-ID: <etPan.536c0ef2.6b68079a.bfdf@Arunkumars-MacBook-Pro.local>

Great! Thanks a bunch.

Arun

From:?G See gsee000 at gmail.com
Reply:?G See gsee000 at gmail.com
Date:?May 9, 2014 at 1:10:07 AM
To:?Arunkumar Srinivasan aragorn168b at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] merge zero row data.table  

Thanks Arun. Bug filed:  
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5672&group_id=240&atid=975  

On Thu, May 8, 2014 at 6:00 PM, Arunkumar Srinivasan  
<aragorn168b at gmail.com> wrote:  
> Garrett,  
>  
> Seems like it works fine in 1.9.2. I'd say it's a bug introduced due to  
> changes in 1.9.3. Could you please file it as one? Thanks.  
>  
> Arun  
>  
> From: G See gsee000 at gmail.com  
> Reply: G See gsee000 at gmail.com  
> Date: May 9, 2014 at 12:57:15 AM  
> To: datatable-help at lists.r-forge.r-project.org  
> datatable-help at lists.r-forge.r-project.org  
> Subject: [datatable-help] merge zero row data.table  
>  
> Hi,  
>  
> Is the following error expected?  
>  
>> library(data.table)  
> data.table 1.9.3 For help type: help("data.table")  
>> a <- data.table(BOD, key="Time")  
>> b <- data.table(BOD, key="Time")[Time < 0] # zero row data.table  
>> merge(a,b, all=TRUE) # works fine  
> Time demand.x demand.y  
> 1: 1 8.3 NA  
> 2: 2 10.3 NA  
> 3: 3 19.0 NA  
> 4: 4 16.0 NA  
> 5: 5 15.6 NA  
> 6: 7 19.8 NA  
>> merge(b,a, all=TRUE) # error  
> Error in setcolorder(dt, c(setdiff(names(dt), end), end)) :  
> neworder is length 2 but x has 3 columns.  
>  
> Thanks,  
> Garrett  
>  
> p.s. using svn Rev. 1263  
> _______________________________________________  
> datatable-help mailing list  
> datatable-help at lists.r-forge.r-project.org  
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140509/533d703b/attachment-0001.html>

From fch808 at gmail.com  Fri May  9 23:34:56 2014
From: fch808 at gmail.com (FCH808)
Date: Fri, 9 May 2014 14:34:56 -0700 (PDT)
Subject: [datatable-help] Losing header names when using skip argument in
	fread in R
Message-ID: <1399671296512-4690268.post@n4.nabble.com>

R package: data.table - version. 1.9.2

I have a ";" delimited text file that I need to subset based on the dates
that appear in the first column. I used fread() to read the first column
only, and return the indices with the dates needed so I could use the min()
of the indices to skip to, and the length() for number of rows to read. (In
this case I only need 2 sequential days - 2880 rows/readings)


The problem is that the header = TRUE only seems to capture the row of data
immediately preceding the rows read and uses it as the header info, and
instead of the actual headers in the first line of the text file. 


I wrapped it in a function and timed it, and it seems to be a reasonably
quick way to have a minimal impact on RAM usage for the filtering needed.
This file is only about 2 million rows so it wouldn't be a problem just
reading the whole thing in and subsetting but I would like a solution that
works as my text files get larger.


          findRows<-fread("power.txt", header = TRUE, select = 1)
          all<-(which(findRows$Date %in% c("14/2/2008", "15/2/2008")) )
          skipLines<- min(all)
          keepRows<- length(all)
          feb<- fread("power.txt", skip = skipLines , nrows = keepRows,
header = TRUE)
          rm(findRows)

          head(feb)

           14/2/2008 00:00:00 0.252 0.000 244.230 1.000 0.000 0.000 0.000
        1: 14/2/2008 00:01:00 0.254     0  245.24     1     0     0     0
        2: 14/2/2008 00:01:00 0.254  0 245.24  1  0  0  0
        3: 14/2/2008 00:02:00 0.254  0 245.31  1  0  0  0
        4: 14/2/2008 00:03:00 0.252  0 244.44  1  0  0  0
        5: 14/2/2008 00:04:00 0.252  0 244.27  1  0  0  0
        6: 14/2/2008 00:05:00 0.252  0 244.62  1  0  0  0

        > system.time(loadF())
            user  system elapsed 
            0.55    0.01    0.56 


I was able to circumvent this by setting header = FALSE and just reading the
first line into another tiny dataset and extracting all the column names
(since I only ever read the first column the first time around) and setting
those names to the data.table but this doesn't seem like the best solution
if there is a way to do within the fread() call.


          findRows<-fread("power.txt", header = TRUE, select = 1)
          all<-(which(findRows$Date %in% c("14/2/2008", "15/2/2008")) )
          skipLines<- min(all)
          keepRows<- length(all)
          feb<- fread("power.txt", skip = (skipLines) , nrows = keepRows,
header = FALSE)
          rm(findRows)
          febNames<- names(fread("power.txt", nrow = 1))
          setnames(feb, febNames)  

          head(feb)

                Date     Time Global_active_power Global_reactive_power
Voltage
        1: 14/2/2008 00:00:00               0.252                     0 
244.23
        2: 14/2/2008 00:01:00               0.254                     0 
245.24
        3: 14/2/2008 00:02:00               0.254                     0 
245.31
        4: 14/2/2008 00:03:00               0.252                     0 
244.44
        5: 14/2/2008 00:04:00               0.252                     0 
244.27
        6: 14/2/2008 00:05:00               0.252                     0 
244.62
           Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
        1:                1              0              0              0
        2:                1              0              0              0
        3:                1              0              0              0
        4:                1              0              0              0
        5:                1              0              0              0
        6:                1              0              0              0

        > system.time(loadF())
           user  system elapsed 
           0.61    0.05    0.66 


Is there a way to accomplish this within the fread() call that skips to row
610,957 and initially creates the feb data.table instead of having to create
another data.table of length 1 just to read the headers?


--
View this message in context: http://r.789695.n4.nabble.com/Losing-header-names-when-using-skip-argument-in-fread-in-R-tp4690268.html
Sent from the datatable-help mailing list archive at Nabble.com.

From my.r.help at gmail.com  Sat May 10 08:45:03 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Sat, 10 May 2014 14:45:03 +0800
Subject: [datatable-help] setkey on .SD
Message-ID: <536DCAEF.9050007@gmail.com>

All,

?data.table says that `.SD` is read-only. However, I could use `setkey`
on it. Is this officially supported, or is it dangerous to use on `.SD`,
e.g. since in some corner cases some unexpected behavior could occur.

Thanks,

M

From kevinushey at gmail.com  Mon May 12 00:54:19 2014
From: kevinushey at gmail.com (Kevin Ushey)
Date: Sun, 11 May 2014 15:54:19 -0700
Subject: [datatable-help] Minor request -- make 'copy' an S3 generic?
Message-ID: <CAJXgQP2_bA7riKgqwHu1v9otF=_UqOGdffVbdGCSHds89BZEvQ@mail.gmail.com>

And move the current copy logic to copy.data.table.

This is mainly because I want to implement my own 'copy.environment'
function, which performs a deep copy of an environment -- data.table's
copy does not do this.

Thanks,
Kevin

From mdowle at mdowle.plus.com  Tue May 13 22:16:10 2014
From: mdowle at mdowle.plus.com (Matt Dowle)
Date: Tue, 13 May 2014 21:16:10 +0100
Subject: [datatable-help] R/Finance in Chicago on Friday
Message-ID: <53727D8A.40208@mdowle.plus.com>


Looking forward to it. Spaces available. Tutorial on data.table at 8am.

http://www.rinfinance.com/agenda/

Matt


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140513/dcf573ba/attachment.html>

From aragorn168b at gmail.com  Fri May 16 19:25:36 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 16 May 2014 19:25:36 +0200
Subject: [datatable-help] setkey on .SD
In-Reply-To: <536DCAEF.9050007@gmail.com>
References: <536DCAEF.9050007@gmail.com>
Message-ID: <CAAf756OuFqtsrjtiJmY65tMOKmcQJ-+o0TAbjFic4umxu6uhcg@mail.gmail.com>

After seeing this post:
http://stackoverflow.com/questions/22863414/using-roll-true-with-allow-cartesian-true#comment34980343_22866917

I wrote to Matt about this as well. I've marked this issue to resolve it,
as I've not heard back on this issue from Matt yet. Thanks for reporting.

Arun.


On Sat, May 10, 2014 at 8:45 AM, Michael Smith <my.r.help at gmail.com> wrote:

> All,
>
> ?data.table says that `.SD` is read-only. However, I could use `setkey`
> on it. Is this officially supported, or is it dangerous to use on `.SD`,
> e.g. since in some corner cases some unexpected behavior could occur.
>
> Thanks,
>
> M
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140516/9d8db34a/attachment.html>

From statquant at outlook.com  Tue May 20 14:50:37 2014
From: statquant at outlook.com (statquant3)
Date: Tue, 20 May 2014 05:50:37 -0700 (PDT)
Subject: [datatable-help] learn how to use melt and dcast
Message-ID: <1400590237635-4690882.post@n4.nabble.com>

Guys,
Is there some tutorial about how to use melt and dcast, each time I want to
use it I forget how to...
I think the ?dcast is not enough (likely I am too stupid)
Cheers


--
View this message in context: http://r.789695.n4.nabble.com/learn-how-to-use-melt-and-dcast-tp4690882.html
Sent from the datatable-help mailing list archive at Nabble.com.

From my.r.help at gmail.com  Tue May 20 16:36:15 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Tue, 20 May 2014 22:36:15 +0800
Subject: [datatable-help] learn how to use melt and dcast
In-Reply-To: <1400590237635-4690882.post@n4.nabble.com>
References: <1400590237635-4690882.post@n4.nabble.com>
Message-ID: <537B685F.1040603@gmail.com>

Hadley's JSS article might be a good place to start. It's still for the
reshape package, but the reshape2 package is not much different. And
using it with data.table should be not much different than using it with
a data.frame.


On 05/20/2014 08:50 PM, statquant3 wrote:
> Guys,
> Is there some tutorial about how to use melt and dcast, each time I want to
> use it I forget how to...
> I think the ?dcast is not enough (likely I am too stupid)
> Cheers
> 
> 
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/learn-how-to-use-melt-and-dcast-tp4690882.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 

From aragorn168b at gmail.com  Tue May 20 21:27:52 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Tue, 20 May 2014 21:27:52 +0200
Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill
	arguments
Message-ID: <etPan.537bacb8.6b8b4567.11385@Arunkumars-MacBook-Pro.local>

Hello everyone,

With the latest commit #1266, the extra functionality offered via rbind (use.names and fill) is also now available to rbindlist. In addition, the implementation is completely moved to C, and is therefore tremendously fast, especially for cases where one has to bind using with use.names=TRUE and/or with fill=TRUE. I?ll try to put out a benchmark comparing speed differences with the older implementation ASAP.

Note that this change comes with a very low cost to the default speed to rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding 10,000 data.tables with 20 columns each, resulted in the new version running in 0.107 seconds, where as the older version ran in 0.095 seconds.

In addition the documentation for ?rbindlist also has been improved (#5158 from Alexander). Here?s the change log from NEWS:

  o  'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented entirely in C. Closes #5249
         -> use.names by default is FALSE for backwards compatibility (doesn't bind by names by default)
         -> rbind(...) now just calls rbindlist() internally, except that 'use.names' is TRUE by default,  
            for compatibility with base (and backwards compatibility).
         -> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
         -> At least one item of the input list has to have non-null column names.
         -> Duplicate columns are bound in the order of occurrence, like base.
         -> Attributes that might exist in individual items would be lost in the bound result.
         -> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible.
         -> And incredibly fast ;).
         -> Documentation updated in much detail. Closes DR #5158.
     Eddi's (excellent) work on finding factor levels, type coercion of columns etc. are all retained.
Please try it and write back if things aren?t working as it was before. The tests that had to be fixed are extremely rare cases. I suspect there should be minimal issue, if at all, in this version. However, I do find the changes here bring consistency to the function.

One (very rare) feature that is not available due to this implementation is the ability to recycle.

dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))
lst1 <- list(x=4, y=5, z=as.list(1:3))

rbind(dt1, lst1)
#    x y       z
# 1: 1 4     1,2
# 2: 2 5   1,2,3
# 3: 3 6 1,2,3,4
# 4: 4 5       1
# 5: 4 5       2
# 6: 4 5       3
The 4,5 are recycled very nicely here.. This is not possible at the moment. This is because the earlier rbind implementation used as.data.table to convert to data.table, however it takes a copy (very inefficient on huge / many tables). I?d love to add this feature in C as well, as it would help incredibly for use within [.data.table (now that we can fill columns and bind by names faster). Will add a FR.

In summary, I think there should be minimal issues, if any and should be much faster (for rbind cases). Please write back what you think, if you happen to try out.


Arun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140520/1e1ee6ca/attachment.html>

From ggrothendieck at gmail.com  Tue May 20 22:04:01 2014
From: ggrothendieck at gmail.com (Gabor Grothendieck)
Date: Tue, 20 May 2014 16:04:01 -0400
Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill
	arguments
In-Reply-To: <etPan.537bacb8.6b8b4567.11385@Arunkumars-MacBook-Pro.local>
References: <etPan.537bacb8.6b8b4567.11385@Arunkumars-MacBook-Pro.local>
Message-ID: <CAP01uRnjymCDRx0ANd+Y=NS14H9kBCrbgAesLyt2HEtHRyaEsg@mail.gmail.com>

The requirement to set use.names to TRUE if fill is TRUE seems ugly.
I suggest that fill be the default for use.names.

On Tue, May 20, 2014 at 3:27 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> Hello everyone,
>
> With the latest commit #1266, the extra functionality offered via rbind
> (use.names and fill) is also now available to rbindlist. In addition, the
> implementation is completely moved to C, and is therefore tremendously fast,
> especially for cases where one has to bind using with use.names=TRUE and/or
> with fill=TRUE. I?ll try to put out a benchmark comparing speed differences
> with the older implementation ASAP.
>
> Note that this change comes with a very low cost to the default speed to
> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding
> 10,000 data.tables with 20 columns each, resulted in the new version running
> in 0.107 seconds, where as the older version ran in 0.095 seconds.
>
> In addition the documentation for ?rbindlist also has been improved (#5158
> from Alexander). Here?s the change log from NEWS:
>
>   o  'rbindlist' gains 'use.names' and 'fill' arguments and is now
> implemented entirely in C. Closes #5249
>          -> use.names by default is FALSE for backwards compatibility
> (doesn't bind by names by default)
>          -> rbind(...) now just calls rbindlist() internally, except that
> 'use.names' is TRUE by default,
>             for compatibility with base (and backwards compatibility).
>          -> fill by default is FALSE. If fill is TRUE, use.names has to be
> TRUE.
>          -> At least one item of the input list has to have non-null column
> names.
>          -> Duplicate columns are bound in the order of occurrence, like
> base.
>          -> Attributes that might exist in individual items would be lost in
> the bound result.
>          -> Columns are coerced to the highest SEXPTYPE, if they are
> different, if/when possible.
>          -> And incredibly fast ;).
>          -> Documentation updated in much detail. Closes DR #5158.
>      Eddi's (excellent) work on finding factor levels, type coercion of
> columns etc. are all retained.
>
> Please try it and write back if things aren?t working as it was before. The
> tests that had to be fixed are extremely rare cases. I suspect there should
> be minimal issue, if at all, in this version. However, I do find the changes
> here bring consistency to the function.
>
> One (very rare) feature that is not available due to this implementation is
> the ability to recycle.
>
> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))
> lst1 <- list(x=4, y=5, z=as.list(1:3))
>
> rbind(dt1, lst1)
> #    x y       z
> # 1: 1 4     1,2
> # 2: 2 5   1,2,3
> # 3: 3 6 1,2,3,4
> # 4: 4 5       1
> # 5: 4 5       2
> # 6: 4 5       3
>
> The 4,5 are recycled very nicely here.. This is not possible at the moment.
> This is because the earlier rbind implementation used as.data.table to
> convert to data.table, however it takes a copy (very inefficient on huge /
> many tables). I?d love to add this feature in C as well, as it would help
> incredibly for use within [.data.table (now that we can fill columns and
> bind by names faster). Will add a FR.
>
> In summary, I think there should be minimal issues, if any and should be
> much faster (for rbind cases). Please write back what you think, if you
> happen to try out.
>
>
>
> Arun
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

From aragorn168b at gmail.com  Tue May 20 22:07:00 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Tue, 20 May 2014 22:07:00 +0200
Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill
 arguments
In-Reply-To: <CAP01uRnjymCDRx0ANd+Y=NS14H9kBCrbgAesLyt2HEtHRyaEsg@mail.gmail.com>
References: <etPan.537bacb8.6b8b4567.11385@Arunkumars-MacBook-Pro.local>
 <CAP01uRnjymCDRx0ANd+Y=NS14H9kBCrbgAesLyt2HEtHRyaEsg@mail.gmail.com>
Message-ID: <etPan.537bb5e4.643c9869.11385@Arunkumars-MacBook-Pro.local>

Hi Gabor,

Thanks for the quick response. Just to be clear, you don?t have to set use.names=TRUE when fill=TRUE. If you just set fill=TRUE and use.names happens to be FALSE, then it?ll automatically set it to TRUE (with a message/warning), which you can safely ignore. Do you find this still ugly? You?ll get the warning if you use rbindlist with just fill=TRUE (because use.name=FALSE by default).


Arun

From:?Gabor Grothendieck ggrothendieck at gmail.com
Reply:?Gabor Grothendieck ggrothendieck at gmail.com
Date:?May 20, 2014 at 10:04:21 PM
To:?Arunkumar Srinivasan aragorn168b at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments  

The requirement to set use.names to TRUE if fill is TRUE seems ugly.  
I suggest that fill be the default for use.names.  

On Tue, May 20, 2014 at 3:27 PM, Arunkumar Srinivasan  
<aragorn168b at gmail.com> wrote:  
> Hello everyone,  
>  
> With the latest commit #1266, the extra functionality offered via rbind  
> (use.names and fill) is also now available to rbindlist. In addition, the  
> implementation is completely moved to C, and is therefore tremendously fast,  
> especially for cases where one has to bind using with use.names=TRUE and/or  
> with fill=TRUE. I?ll try to put out a benchmark comparing speed differences  
> with the older implementation ASAP.  
>  
> Note that this change comes with a very low cost to the default speed to  
> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding  
> 10,000 data.tables with 20 columns each, resulted in the new version running  
> in 0.107 seconds, where as the older version ran in 0.095 seconds.  
>  
> In addition the documentation for ?rbindlist also has been improved (#5158  
> from Alexander). Here?s the change log from NEWS:  
>  
> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now  
> implemented entirely in C. Closes #5249  
> -> use.names by default is FALSE for backwards compatibility  
> (doesn't bind by names by default)  
> -> rbind(...) now just calls rbindlist() internally, except that  
> 'use.names' is TRUE by default,  
> for compatibility with base (and backwards compatibility).  
> -> fill by default is FALSE. If fill is TRUE, use.names has to be  
> TRUE.  
> -> At least one item of the input list has to have non-null column  
> names.  
> -> Duplicate columns are bound in the order of occurrence, like  
> base.  
> -> Attributes that might exist in individual items would be lost in  
> the bound result.  
> -> Columns are coerced to the highest SEXPTYPE, if they are  
> different, if/when possible.  
> -> And incredibly fast ;).  
> -> Documentation updated in much detail. Closes DR #5158.  
> Eddi's (excellent) work on finding factor levels, type coercion of  
> columns etc. are all retained.  
>  
> Please try it and write back if things aren?t working as it was before. The  
> tests that had to be fixed are extremely rare cases. I suspect there should  
> be minimal issue, if at all, in this version. However, I do find the changes  
> here bring consistency to the function.  
>  
> One (very rare) feature that is not available due to this implementation is  
> the ability to recycle.  
>  
> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))  
> lst1 <- list(x=4, y=5, z=as.list(1:3))  
>  
> rbind(dt1, lst1)  
> # x y z  
> # 1: 1 4 1,2  
> # 2: 2 5 1,2,3  
> # 3: 3 6 1,2,3,4  
> # 4: 4 5 1  
> # 5: 4 5 2  
> # 6: 4 5 3  
>  
> The 4,5 are recycled very nicely here.. This is not possible at the moment.  
> This is because the earlier rbind implementation used as.data.table to  
> convert to data.table, however it takes a copy (very inefficient on huge /  
> many tables). I?d love to add this feature in C as well, as it would help  
> incredibly for use within [.data.table (now that we can fill columns and  
> bind by names faster). Will add a FR.  
>  
> In summary, I think there should be minimal issues, if any and should be  
> much faster (for rbind cases). Please write back what you think, if you  
> happen to try out.  
>  
>  
>  
> Arun  
>  
>  
> _______________________________________________  
> datatable-help mailing list  
> datatable-help at lists.r-forge.r-project.org  
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  


--  
Statistics & Software Consulting  
GKX Group, GKX Associates Inc.  
tel: 1-877-GKX-GROUP  
email: ggrothendieck at gmail.com  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140520/865cdfd2/attachment-0001.html>

From ggrothendieck at gmail.com  Tue May 20 22:11:16 2014
From: ggrothendieck at gmail.com (Gabor Grothendieck)
Date: Tue, 20 May 2014 16:11:16 -0400
Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill
	arguments
In-Reply-To: <etPan.537bb5e4.643c9869.11385@Arunkumars-MacBook-Pro.local>
References: <etPan.537bacb8.6b8b4567.11385@Arunkumars-MacBook-Pro.local>
 <CAP01uRnjymCDRx0ANd+Y=NS14H9kBCrbgAesLyt2HEtHRyaEsg@mail.gmail.com>
 <etPan.537bb5e4.643c9869.11385@Arunkumars-MacBook-Pro.local>
Message-ID: <CAP01uRn9bH1Fs9kndJCNyYZCcBfC29BG58FjPNR2opaefmCdgQ@mail.gmail.com>

Then why not make the default of use.names be fill. Then you don't get
the warning and you can tell just from the argument list what the
dependencies are.

On Tue, May 20, 2014 at 4:07 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> Hi Gabor,
>
> Thanks for the quick response. Just to be clear, you don?t have to set
> use.names=TRUE when fill=TRUE. If you just set fill=TRUE and use.names
> happens to be FALSE, then it?ll automatically set it to TRUE (with a
> message/warning), which you can safely ignore. Do you find this still ugly?
> You?ll get the warning if you use rbindlist with just fill=TRUE (because
> use.name=FALSE by default).
>
>
> Arun
>
> From: Gabor Grothendieck ggrothendieck at gmail.com
> Reply: Gabor Grothendieck ggrothendieck at gmail.com
> Date: May 20, 2014 at 10:04:21 PM
> To: Arunkumar Srinivasan aragorn168b at gmail.com
> Cc: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> Subject:  Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill
> arguments
>
> The requirement to set use.names to TRUE if fill is TRUE seems ugly.
> I suggest that fill be the default for use.names.
>
> On Tue, May 20, 2014 at 3:27 PM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>> Hello everyone,
>>
>> With the latest commit #1266, the extra functionality offered via rbind
>> (use.names and fill) is also now available to rbindlist. In addition, the
>> implementation is completely moved to C, and is therefore tremendously
>> fast,
>> especially for cases where one has to bind using with use.names=TRUE
>> and/or
>> with fill=TRUE. I?ll try to put out a benchmark comparing speed
>> differences
>> with the older implementation ASAP.
>>
>> Note that this change comes with a very low cost to the default speed to
>> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding
>> 10,000 data.tables with 20 columns each, resulted in the new version
>> running
>> in 0.107 seconds, where as the older version ran in 0.095 seconds.
>>
>> In addition the documentation for ?rbindlist also has been improved (#5158
>> from Alexander). Here?s the change log from NEWS:
>>
>> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now
>> implemented entirely in C. Closes #5249
>> -> use.names by default is FALSE for backwards compatibility
>> (doesn't bind by names by default)
>> -> rbind(...) now just calls rbindlist() internally, except that
>> 'use.names' is TRUE by default,
>> for compatibility with base (and backwards compatibility).
>> -> fill by default is FALSE. If fill is TRUE, use.names has to be
>> TRUE.
>> -> At least one item of the input list has to have non-null column
>> names.
>> -> Duplicate columns are bound in the order of occurrence, like
>> base.
>> -> Attributes that might exist in individual items would be lost in
>> the bound result.
>> -> Columns are coerced to the highest SEXPTYPE, if they are
>> different, if/when possible.
>> -> And incredibly fast ;).
>> -> Documentation updated in much detail. Closes DR #5158.
>> Eddi's (excellent) work on finding factor levels, type coercion of
>> columns etc. are all retained.
>>
>> Please try it and write back if things aren?t working as it was before.
>> The
>> tests that had to be fixed are extremely rare cases. I suspect there
>> should
>> be minimal issue, if at all, in this version. However, I do find the
>> changes
>> here bring consistency to the function.
>>
>> One (very rare) feature that is not available due to this implementation
>> is
>> the ability to recycle.
>>
>> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))
>> lst1 <- list(x=4, y=5, z=as.list(1:3))
>>
>> rbind(dt1, lst1)
>> # x y z
>> # 1: 1 4 1,2
>> # 2: 2 5 1,2,3
>> # 3: 3 6 1,2,3,4
>> # 4: 4 5 1
>> # 5: 4 5 2
>> # 6: 4 5 3
>>
>> The 4,5 are recycled very nicely here.. This is not possible at the
>> moment.
>> This is because the earlier rbind implementation used as.data.table to
>> convert to data.table, however it takes a copy (very inefficient on huge /
>> many tables). I?d love to add this feature in C as well, as it would help
>> incredibly for use within [.data.table (now that we can fill columns and
>> bind by names faster). Will add a FR.
>>
>> In summary, I think there should be minimal issues, if any and should be
>> much faster (for rbind cases). Please write back what you think, if you
>> happen to try out.
>>
>>
>>
>> Arun
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

From aragorn168b at gmail.com  Tue May 20 22:17:45 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Tue, 20 May 2014 22:17:45 +0200
Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill
 arguments
In-Reply-To: <CAP01uRn9bH1Fs9kndJCNyYZCcBfC29BG58FjPNR2opaefmCdgQ@mail.gmail.com>
References: <etPan.537bacb8.6b8b4567.11385@Arunkumars-MacBook-Pro.local>
 <CAP01uRnjymCDRx0ANd+Y=NS14H9kBCrbgAesLyt2HEtHRyaEsg@mail.gmail.com>
 <etPan.537bb5e4.643c9869.11385@Arunkumars-MacBook-Pro.local>
 <CAP01uRn9bH1Fs9kndJCNyYZCcBfC29BG58FjPNR2opaefmCdgQ@mail.gmail.com>
Message-ID: <etPan.537bb869.74b0dc51.11385@Arunkumars-MacBook-Pro.local>

Because with the current implementation, the case use.names=TRUE and fill=FALSE (no missing columns, just order isn?t same) could be faster than if you set fill=TRUE (on large and tables) - as it populates with NAs first.

Sometimes it might be essential to throw an error (to catch bugs?) when you think the columns are all just interchanged, but in reality, there are either new columns or duplicated columns..


Arun

From:?Gabor Grothendieck ggrothendieck at gmail.com
Reply:?Gabor Grothendieck ggrothendieck at gmail.com
Date:?May 20, 2014 at 10:11:36 PM
To:?Arunkumar Srinivasan aragorn168b at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments  

Then why not make the default of use.names be fill. Then you don't get  
the warning and you can tell just from the argument list what the  
dependencies are.  

On Tue, May 20, 2014 at 4:07 PM, Arunkumar Srinivasan  
<aragorn168b at gmail.com> wrote:  
> Hi Gabor,  
>  
> Thanks for the quick response. Just to be clear, you don?t have to set  
> use.names=TRUE when fill=TRUE. If you just set fill=TRUE and use.names  
> happens to be FALSE, then it?ll automatically set it to TRUE (with a  
> message/warning), which you can safely ignore. Do you find this still ugly?  
> You?ll get the warning if you use rbindlist with just fill=TRUE (because  
> use.name=FALSE by default).  
>  
>  
> Arun  
>  
> From: Gabor Grothendieck ggrothendieck at gmail.com  
> Reply: Gabor Grothendieck ggrothendieck at gmail.com  
> Date: May 20, 2014 at 10:04:21 PM  
> To: Arunkumar Srinivasan aragorn168b at gmail.com  
> Cc: datatable-help at lists.r-forge.r-project.org  
> datatable-help at lists.r-forge.r-project.org  
> Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill  
> arguments  
>  
> The requirement to set use.names to TRUE if fill is TRUE seems ugly.  
> I suggest that fill be the default for use.names.  
>  
> On Tue, May 20, 2014 at 3:27 PM, Arunkumar Srinivasan  
> <aragorn168b at gmail.com> wrote:  
>> Hello everyone,  
>>  
>> With the latest commit #1266, the extra functionality offered via rbind  
>> (use.names and fill) is also now available to rbindlist. In addition, the  
>> implementation is completely moved to C, and is therefore tremendously  
>> fast,  
>> especially for cases where one has to bind using with use.names=TRUE  
>> and/or  
>> with fill=TRUE. I?ll try to put out a benchmark comparing speed  
>> differences  
>> with the older implementation ASAP.  
>>  
>> Note that this change comes with a very low cost to the default speed to  
>> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding  
>> 10,000 data.tables with 20 columns each, resulted in the new version  
>> running  
>> in 0.107 seconds, where as the older version ran in 0.095 seconds.  
>>  
>> In addition the documentation for ?rbindlist also has been improved (#5158  
>> from Alexander). Here?s the change log from NEWS:  
>>  
>> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now  
>> implemented entirely in C. Closes #5249  
>> -> use.names by default is FALSE for backwards compatibility  
>> (doesn't bind by names by default)  
>> -> rbind(...) now just calls rbindlist() internally, except that  
>> 'use.names' is TRUE by default,  
>> for compatibility with base (and backwards compatibility).  
>> -> fill by default is FALSE. If fill is TRUE, use.names has to be  
>> TRUE.  
>> -> At least one item of the input list has to have non-null column  
>> names.  
>> -> Duplicate columns are bound in the order of occurrence, like  
>> base.  
>> -> Attributes that might exist in individual items would be lost in  
>> the bound result.  
>> -> Columns are coerced to the highest SEXPTYPE, if they are  
>> different, if/when possible.  
>> -> And incredibly fast ;).  
>> -> Documentation updated in much detail. Closes DR #5158.  
>> Eddi's (excellent) work on finding factor levels, type coercion of  
>> columns etc. are all retained.  
>>  
>> Please try it and write back if things aren?t working as it was before.  
>> The  
>> tests that had to be fixed are extremely rare cases. I suspect there  
>> should  
>> be minimal issue, if at all, in this version. However, I do find the  
>> changes  
>> here bring consistency to the function.  
>>  
>> One (very rare) feature that is not available due to this implementation  
>> is  
>> the ability to recycle.  
>>  
>> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))  
>> lst1 <- list(x=4, y=5, z=as.list(1:3))  
>>  
>> rbind(dt1, lst1)  
>> # x y z  
>> # 1: 1 4 1,2  
>> # 2: 2 5 1,2,3  
>> # 3: 3 6 1,2,3,4  
>> # 4: 4 5 1  
>> # 5: 4 5 2  
>> # 6: 4 5 3  
>>  
>> The 4,5 are recycled very nicely here.. This is not possible at the  
>> moment.  
>> This is because the earlier rbind implementation used as.data.table to  
>> convert to data.table, however it takes a copy (very inefficient on huge /  
>> many tables). I?d love to add this feature in C as well, as it would help  
>> incredibly for use within [.data.table (now that we can fill columns and  
>> bind by names faster). Will add a FR.  
>>  
>> In summary, I think there should be minimal issues, if any and should be  
>> much faster (for rbind cases). Please write back what you think, if you  
>> happen to try out.  
>>  
>>  
>>  
>> Arun  
>>  
>>  
>> _______________________________________________  
>> datatable-help mailing list  
>> datatable-help at lists.r-forge.r-project.org  
>>  
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
>  
>  
>  
> --  
> Statistics & Software Consulting  
> GKX Group, GKX Associates Inc.  
> tel: 1-877-GKX-GROUP  
> email: ggrothendieck at gmail.com  


--  
Statistics & Software Consulting  
GKX Group, GKX Associates Inc.  
tel: 1-877-GKX-GROUP  
email: ggrothendieck at gmail.com  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140520/c08d8a18/attachment.html>

From aragorn168b at gmail.com  Tue May 20 22:28:39 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Tue, 20 May 2014 22:28:39 +0200
Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill
	arguments
In-Reply-To: <etPan.537bacb8.6b8b4567.11385@Arunkumars-MacBook-Pro.local>
References: <etPan.537bacb8.6b8b4567.11385@Arunkumars-MacBook-Pro.local>
Message-ID: <etPan.537bbaf8.2ae8944a.11385@Arunkumars-MacBook-Pro.local>

I?ve filed FR #5690 to remind myself of the recycling feature; that?d be awesome to have.

One feature I forgot to point out in the previous post is that, even when there are duplicate names, rbind/rbindlist binds them consistent with ?base? when use.names=TRUE. And it fills the duplicate columns properly (in the order of occurrence) also when fill=TRUE.

Okay, on to benchmarks. I took a set of 10,000 data.tables, each with columns ranging from V1 to V500 in random order (all integers for simplicity). We?ll need to just use use.names=TRUE (as all columns are available in all data.tables).

I think this data is big enough to illustrate the point. Also, I was curious to see a comparison against dplyr?s rbind_all (commit 1504 devel version). So, I?ve added it as well to the benchmarks.

Here?s the data generation. Note: It takes a while for this step to finish.

require(data.table) ## 1.9.3 commit 1267
require(dplyr)      ## commit 1504 devel
set.seed(1L)
foo <- function(k) {
    ans = setDT(lapply(1:k, function(x) sample(10)))
}
bar <- function(ans, k, n) {
    bla = sample(paste0("V", 1:k), n)
    setnames(ans, bla)
}
n = 10000L
ll = vector("list", n)
for (i in 1:n) {
    bla = bar(foo(500L), 500L, 500L)
    .Call("Csetlistelt", ll, i, bla)
}
And here are the timings:

## data.table v1.9.3 commit 1267's rbindlist
## Timings of three consecutive runs:
system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE))
   user  system elapsed  
 10.909   0.449  11.843  
  
    user  system elapsed  
  5.219   0.386   5.640  
   
    user  system elapsed  
  5.355   0.429   5.898  

## dplyr's rbind_all
## Timings for three consecutive runs
system.time(ans2 <- rbind_all(ll))
   user  system elapsed  
 62.769   0.247  63.941  
  
    user  system elapsed  
 62.010   0.335  65.876  
  
   user  system elapsed  
 55.345   0.359  60.193  

> identical(ans1, setDT(ans2)) # [1] TRUE
  
## data.table v1.9.2's rbind version:
## ran only once as it took a bit more.
system.time(ans1 <- do.call("rbind", ll))
    user  system elapsed  
125.356   2.247 139.000  

> identical(ans1, setDT(ans2)) # [1] TRUE
In summary, the newer implementation is about ~11?23x faster than data.table?s older implementation and is ~5.5?10x faster against dplyr on this (relatively huge) data.

Arun

From:?Arunkumar Srinivasan aragorn168b at gmail.com
Reply:?Arunkumar Srinivasan aragorn168b at gmail.com
Date:?May 20, 2014 at 9:27:56 PM
To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? FR #5249 - rbindlist gains use.names and fill arguments  

Hello everyone,

With the latest commit #1266, the extra functionality offered via rbind (use.names and fill) is also now available to rbindlist. In addition, the implementation is completely moved to C, and is therefore tremendously fast, especially for cases where one has to bind using with use.names=TRUE and/or with fill=TRUE. I?ll try to put out a benchmark comparing speed differences with the older implementation ASAP.

Note that this change comes with a very low cost to the default speed to rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding 10,000 data.tables with 20 columns each, resulted in the new version running in 0.107 seconds, where as the older version ran in 0.095 seconds.

In addition the documentation for ?rbindlist also has been improved (#5158 from Alexander). Here?s the change log from NEWS:

  o  'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented entirely in C. Closes #5249
         -> use.names by default is FALSE for backwards compatibility (doesn't bind by names by default)
         -> rbind(...) now just calls rbindlist() internally, except that 'use.names' is TRUE by default,   
            for compatibility with base (and backwards compatibility).
         -> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
         -> At least one item of the input list has to have non-null column names.
         -> Duplicate columns are bound in the order of occurrence, like base.
         -> Attributes that might exist in individual items would be lost in the bound result.
         -> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible.
         -> And incredibly fast ;).
         -> Documentation updated in much detail. Closes DR #5158.
     Eddi's (excellent) work on finding factor levels, type coercion of columns etc. are all retained.

Please try it and write back if things aren?t working as it was before. The tests that had to be fixed are extremely rare cases. I suspect there should be minimal issue, if at all, in this version. However, I do find the changes here bring consistency to the function.

One (very rare) feature that is not available due to this implementation is the ability to recycle.

dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))
lst1 <- list(x=4, y=5, z=as.list(1:3))

rbind(dt1, lst1)
#    x y       z
# 1: 1 4     1,2
# 2: 2 5   1,2,3
# 3: 3 6 1,2,3,4
# 4: 4 5       1
# 5: 4 5       2
# 6: 4 5       3

The 4,5 are recycled very nicely here.. This is not possible at the moment. This is because the earlier rbind implementation used as.data.table to convert to data.table, however it takes a copy (very inefficient on huge / many tables). I?d love to add this feature in C as well, as it would help incredibly for use within [.data.table (now that we can fill columns and bind by names faster). Will add a FR.

In summary, I think there should be minimal issues, if any and should be much faster (for rbind cases). Please write back what you think, if you happen to try out.


Arun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140520/5085a768/attachment-0001.html>

From ggrothendieck at gmail.com  Tue May 20 22:49:33 2014
From: ggrothendieck at gmail.com (Gabor Grothendieck)
Date: Tue, 20 May 2014 16:49:33 -0400
Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill
	arguments
In-Reply-To: <etPan.537bbaf8.2ae8944a.11385@Arunkumars-MacBook-Pro.local>
References: <etPan.537bacb8.6b8b4567.11385@Arunkumars-MacBook-Pro.local>
 <etPan.537bbaf8.2ae8944a.11385@Arunkumars-MacBook-Pro.local>
Message-ID: <CAP01uR=WxhcAy-TgjhMmyUrnztiYgBgFeFAUe-LTedbEvPgPmQ@mail.gmail.com>

If I understand this right then the table below shows the valid
logical combinations in order of speed (slowest first).  Is that
right?  If so then if fill = FALSE and use.names = fill then we get
the fastest case by default.

Furthermore if you were concerned that we might be T/T when F/T would
be sufficient I don't think that is likely since getting F/T is done
by setting use.names = TRUE.

fill/use.names
T/T (slowest)
F/T
F/F (fasetest)


On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> I?ve filed FR #5690 to remind myself of the recycling feature; that?d be
> awesome to have.
>
> One feature I forgot to point out in the previous post is that, even when
> there are duplicate names, rbind/rbindlist binds them consistent with ?base?
> when use.names=TRUE. And it fills the duplicate columns properly (in the
> order of occurrence) also when fill=TRUE.
>
> Okay, on to benchmarks. I took a set of 10,000 data.tables, each with
> columns ranging from V1 to V500 in random order (all integers for
> simplicity). We?ll need to just use use.names=TRUE (as all columns are
> available in all data.tables).
>
> I think this data is big enough to illustrate the point. Also, I was curious
> to see a comparison against dplyr?s rbind_all (commit 1504 devel version).
> So, I?ve added it as well to the benchmarks.
>
> Here?s the data generation. Note: It takes a while for this step to finish.
>
> require(data.table) ## 1.9.3 commit 1267
> require(dplyr)      ## commit 1504 devel
> set.seed(1L)
> foo <- function(k) {
>     ans = setDT(lapply(1:k, function(x) sample(10)))
> }
> bar <- function(ans, k, n) {
>     bla = sample(paste0("V", 1:k), n)
>     setnames(ans, bla)
> }
> n = 10000L
> ll = vector("list", n)
> for (i in 1:n) {
>     bla = bar(foo(500L), 500L, 500L)
>     .Call("Csetlistelt", ll, i, bla)
> }
>
> And here are the timings:
>
> ## data.table v1.9.3 commit 1267's rbindlist
> ## Timings of three consecutive runs:
> system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE))
>    user  system elapsed
>  10.909   0.449  11.843
>
>     user  system elapsed
>   5.219   0.386   5.640
>
>     user  system elapsed
>   5.355   0.429   5.898
>
> ## dplyr's rbind_all
> ## Timings for three consecutive runs
> system.time(ans2 <- rbind_all(ll))
>    user  system elapsed
>  62.769   0.247  63.941
>
>     user  system elapsed
>  62.010   0.335  65.876
>
>    user  system elapsed
>  55.345   0.359  60.193
>
>> identical(ans1, setDT(ans2)) # [1] TRUE
>
> ## data.table v1.9.2's rbind version:
> ## ran only once as it took a bit more.
> system.time(ans1 <- do.call("rbind", ll))
>     user  system elapsed
> 125.356   2.247 139.000
>
>> identical(ans1, setDT(ans2)) # [1] TRUE
>
> In summary, the newer implementation is about ~11?23x faster than
> data.table?s older implementation and is ~5.5?10x faster against dplyr on
> this (relatively huge) data.
>
> Arun
>
> From: Arunkumar Srinivasan aragorn168b at gmail.com
> Reply: Arunkumar Srinivasan aragorn168b at gmail.com
> Date: May 20, 2014 at 9:27:56 PM
> To: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> Subject:  FR #5249 - rbindlist gains use.names and fill arguments
>
> Hello everyone,
>
> With the latest commit #1266, the extra functionality offered via rbind
> (use.names and fill) is also now available to rbindlist. In addition, the
> implementation is completely moved to C, and is therefore tremendously fast,
> especially for cases where one has to bind using with use.names=TRUE and/or
> with fill=TRUE. I?ll try to put out a benchmark comparing speed differences
> with the older implementation ASAP.
>
> Note that this change comes with a very low cost to the default speed to
> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding
> 10,000 data.tables with 20 columns each, resulted in the new version running
> in 0.107 seconds, where as the older version ran in 0.095 seconds.
>
> In addition the documentation for ?rbindlist also has been improved (#5158
> from Alexander). Here?s the change log from NEWS:
>
>   o  'rbindlist' gains 'use.names' and 'fill' arguments and is now
> implemented entirely in C. Closes #5249
>          -> use.names by default is FALSE for backwards compatibility
> (doesn't bind by names by default)
>          -> rbind(...) now just calls rbindlist() internally, except that
> 'use.names' is TRUE by default,
>             for compatibility with base (and backwards compatibility).
>          -> fill by default is FALSE. If fill is TRUE, use.names has to be
> TRUE.
>          -> At least one item of the input list has to have non-null column
> names.
>          -> Duplicate columns are bound in the order of occurrence, like
> base.
>          -> Attributes that might exist in individual items would be lost in
> the bound result.
>          -> Columns are coerced to the highest SEXPTYPE, if they are
> different, if/when possible.
>          -> And incredibly fast ;).
>          -> Documentation updated in much detail. Closes DR #5158.
>      Eddi's (excellent) work on finding factor levels, type coercion of
> columns etc. are all retained.
>
> Please try it and write back if things aren?t working as it was before. The
> tests that had to be fixed are extremely rare cases. I suspect there should
> be minimal issue, if at all, in this version. However, I do find the changes
> here bring consistency to the function.
>
> One (very rare) feature that is not available due to this implementation is
> the ability to recycle.
>
> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))
> lst1 <- list(x=4, y=5, z=as.list(1:3))
>
> rbind(dt1, lst1)
> #    x y       z
> # 1: 1 4     1,2
> # 2: 2 5   1,2,3
> # 3: 3 6 1,2,3,4
> # 4: 4 5       1
> # 5: 4 5       2
> # 6: 4 5       3
>
> The 4,5 are recycled very nicely here.. This is not possible at the moment.
> This is because the earlier rbind implementation used as.data.table to
> convert to data.table, however it takes a copy (very inefficient on huge /
> many tables). I?d love to add this feature in C as well, as it would help
> incredibly for use within [.data.table (now that we can fill columns and
> bind by names faster). Will add a FR.
>
> In summary, I think there should be minimal issues, if any and should be
> much faster (for rbind cases). Please write back what you think, if you
> happen to try out.
>
>
>
> Arun
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

From aragorn168b at gmail.com  Tue May 20 23:01:52 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Tue, 20 May 2014 23:01:52 +0200
Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill
 arguments
In-Reply-To: <CAP01uR=WxhcAy-TgjhMmyUrnztiYgBgFeFAUe-LTedbEvPgPmQ@mail.gmail.com>
References: <etPan.537bacb8.6b8b4567.11385@Arunkumars-MacBook-Pro.local>
 <etPan.537bbaf8.2ae8944a.11385@Arunkumars-MacBook-Pro.local>
 <CAP01uR=WxhcAy-TgjhMmyUrnztiYgBgFeFAUe-LTedbEvPgPmQ@mail.gmail.com>
Message-ID: <etPan.537bc2c0.238e1f29.11385@Arunkumars-MacBook-Pro.local>

I think I understand now what you?re trying to say. Going back to an earlier post, you wrote:

Then why not make the default of `use.names` be `fill`. Then you don't get the warning and you can tell just from the argument list what the dependencies are.  
You mean to basically do?

rbindlist <- function(l, use.names=fill, fill=FALSE)
.rbind.data.table <- function(..., use.names=fill, fill=TRUE/FALSE)
Is this what you mean? If so, the defaults from the previous versions will be changed. The ones who use rbind directly without setting use.names will have different results.. (assuming I understand you correctly this time).


Arun

From:?Gabor Grothendieck ggrothendieck at gmail.com
Reply:?Gabor Grothendieck ggrothendieck at gmail.com
Date:?May 20, 2014 at 10:49:54 PM
To:?Arunkumar Srinivasan aragorn168b at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments  

If I understand this right then the table below shows the valid  
logical combinations in order of speed (slowest first). Is that  
right? If so then if fill = FALSE and use.names = fill then we get  
the fastest case by default.  

Furthermore if you were concerned that we might be T/T when F/T would  
be sufficient I don't think that is likely since getting F/T is done  
by setting use.names = TRUE.  

fill/use.names  
T/T (slowest)  
F/T  
F/F (fasetest)  


On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan  
<aragorn168b at gmail.com> wrote:  
> I?ve filed FR #5690 to remind myself of the recycling feature; that?d be  
> awesome to have.  
>  
> One feature I forgot to point out in the previous post is that, even when  
> there are duplicate names, rbind/rbindlist binds them consistent with ?base?  
> when use.names=TRUE. And it fills the duplicate columns properly (in the  
> order of occurrence) also when fill=TRUE.  
>  
> Okay, on to benchmarks. I took a set of 10,000 data.tables, each with  
> columns ranging from V1 to V500 in random order (all integers for  
> simplicity). We?ll need to just use use.names=TRUE (as all columns are  
> available in all data.tables).  
>  
> I think this data is big enough to illustrate the point. Also, I was curious  
> to see a comparison against dplyr?s rbind_all (commit 1504 devel version).  
> So, I?ve added it as well to the benchmarks.  
>  
> Here?s the data generation. Note: It takes a while for this step to finish.  
>  
> require(data.table) ## 1.9.3 commit 1267  
> require(dplyr) ## commit 1504 devel  
> set.seed(1L)  
> foo <- function(k) {  
> ans = setDT(lapply(1:k, function(x) sample(10)))  
> }  
> bar <- function(ans, k, n) {  
> bla = sample(paste0("V", 1:k), n)  
> setnames(ans, bla)  
> }  
> n = 10000L  
> ll = vector("list", n)  
> for (i in 1:n) {  
> bla = bar(foo(500L), 500L, 500L)  
> .Call("Csetlistelt", ll, i, bla)  
> }  
>  
> And here are the timings:  
>  
> ## data.table v1.9.3 commit 1267's rbindlist  
> ## Timings of three consecutive runs:  
> system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE))  
> user system elapsed  
> 10.909 0.449 11.843  
>  
> user system elapsed  
> 5.219 0.386 5.640  
>  
> user system elapsed  
> 5.355 0.429 5.898  
>  
> ## dplyr's rbind_all  
> ## Timings for three consecutive runs  
> system.time(ans2 <- rbind_all(ll))  
> user system elapsed  
> 62.769 0.247 63.941  
>  
> user system elapsed  
> 62.010 0.335 65.876  
>  
> user system elapsed  
> 55.345 0.359 60.193  
>  
>> identical(ans1, setDT(ans2)) # [1] TRUE  
>  
> ## data.table v1.9.2's rbind version:  
> ## ran only once as it took a bit more.  
> system.time(ans1 <- do.call("rbind", ll))  
> user system elapsed  
> 125.356 2.247 139.000  
>  
>> identical(ans1, setDT(ans2)) # [1] TRUE  
>  
> In summary, the newer implementation is about ~11?23x faster than  
> data.table?s older implementation and is ~5.5?10x faster against dplyr on  
> this (relatively huge) data.  
>  
> Arun  
>  
> From: Arunkumar Srinivasan aragorn168b at gmail.com  
> Reply: Arunkumar Srinivasan aragorn168b at gmail.com  
> Date: May 20, 2014 at 9:27:56 PM  
> To: datatable-help at lists.r-forge.r-project.org  
> datatable-help at lists.r-forge.r-project.org  
> Subject: FR #5249 - rbindlist gains use.names and fill arguments  
>  
> Hello everyone,  
>  
> With the latest commit #1266, the extra functionality offered via rbind  
> (use.names and fill) is also now available to rbindlist. In addition, the  
> implementation is completely moved to C, and is therefore tremendously fast,  
> especially for cases where one has to bind using with use.names=TRUE and/or  
> with fill=TRUE. I?ll try to put out a benchmark comparing speed differences  
> with the older implementation ASAP.  
>  
> Note that this change comes with a very low cost to the default speed to  
> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding  
> 10,000 data.tables with 20 columns each, resulted in the new version running  
> in 0.107 seconds, where as the older version ran in 0.095 seconds.  
>  
> In addition the documentation for ?rbindlist also has been improved (#5158  
> from Alexander). Here?s the change log from NEWS:  
>  
> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now  
> implemented entirely in C. Closes #5249  
> -> use.names by default is FALSE for backwards compatibility  
> (doesn't bind by names by default)  
> -> rbind(...) now just calls rbindlist() internally, except that  
> 'use.names' is TRUE by default,  
> for compatibility with base (and backwards compatibility).  
> -> fill by default is FALSE. If fill is TRUE, use.names has to be  
> TRUE.  
> -> At least one item of the input list has to have non-null column  
> names.  
> -> Duplicate columns are bound in the order of occurrence, like  
> base.  
> -> Attributes that might exist in individual items would be lost in  
> the bound result.  
> -> Columns are coerced to the highest SEXPTYPE, if they are  
> different, if/when possible.  
> -> And incredibly fast ;).  
> -> Documentation updated in much detail. Closes DR #5158.  
> Eddi's (excellent) work on finding factor levels, type coercion of  
> columns etc. are all retained.  
>  
> Please try it and write back if things aren?t working as it was before. The  
> tests that had to be fixed are extremely rare cases. I suspect there should  
> be minimal issue, if at all, in this version. However, I do find the changes  
> here bring consistency to the function.  
>  
> One (very rare) feature that is not available due to this implementation is  
> the ability to recycle.  
>  
> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))  
> lst1 <- list(x=4, y=5, z=as.list(1:3))  
>  
> rbind(dt1, lst1)  
> # x y z  
> # 1: 1 4 1,2  
> # 2: 2 5 1,2,3  
> # 3: 3 6 1,2,3,4  
> # 4: 4 5 1  
> # 5: 4 5 2  
> # 6: 4 5 3  
>  
> The 4,5 are recycled very nicely here.. This is not possible at the moment.  
> This is because the earlier rbind implementation used as.data.table to  
> convert to data.table, however it takes a copy (very inefficient on huge /  
> many tables). I?d love to add this feature in C as well, as it would help  
> incredibly for use within [.data.table (now that we can fill columns and  
> bind by names faster). Will add a FR.  
>  
> In summary, I think there should be minimal issues, if any and should be  
> much faster (for rbind cases). Please write back what you think, if you  
> happen to try out.  
>  
>  
>  
> Arun  
>  
>  
> _______________________________________________  
> datatable-help mailing list  
> datatable-help at lists.r-forge.r-project.org  
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  


--  
Statistics & Software Consulting  
GKX Group, GKX Associates Inc.  
tel: 1-877-GKX-GROUP  
email: ggrothendieck at gmail.com  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140520/e2c7e2cd/attachment-0001.html>

From ggrothendieck at gmail.com  Tue May 20 23:13:55 2014
From: ggrothendieck at gmail.com (Gabor Grothendieck)
Date: Tue, 20 May 2014 17:13:55 -0400
Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill
	arguments
In-Reply-To: <etPan.537bc2c0.238e1f29.11385@Arunkumars-MacBook-Pro.local>
References: <etPan.537bacb8.6b8b4567.11385@Arunkumars-MacBook-Pro.local>
 <etPan.537bbaf8.2ae8944a.11385@Arunkumars-MacBook-Pro.local>
 <CAP01uR=WxhcAy-TgjhMmyUrnztiYgBgFeFAUe-LTedbEvPgPmQ@mail.gmail.com>
 <etPan.537bc2c0.238e1f29.11385@Arunkumars-MacBook-Pro.local>
Message-ID: <CAP01uRmQab6sRfbR5MJRpU85TZPuX68n6KnLofVkTGkxpGOspw@mail.gmail.com>

Yes.  That is what I intended.

rbindlist on CRAN currently has no fill or use.names arguments.  What
combo of the new fill and use.names does the currrent CRAN rbindlst
correspond to?


On Tue, May 20, 2014 at 5:01 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> I think I understand now what you?re trying to say. Going back to an earlier
> post, you wrote:
>
> Then why not make the default of `use.names` be `fill`. Then you don't get
> the warning and you can tell just from the argument list what the
> dependencies are.
>
> You mean to basically do?
>
> rbindlist <- function(l, use.names=fill, fill=FALSE)
> .rbind.data.table <- function(..., use.names=fill, fill=TRUE/FALSE)
>
> Is this what you mean? If so, the defaults from the previous versions will
> be changed. The ones who use rbind directly without setting use.names will
> have different results.. (assuming I understand you correctly this time).
>
>
> Arun
>
> From: Gabor Grothendieck ggrothendieck at gmail.com
> Reply: Gabor Grothendieck ggrothendieck at gmail.com
> Date: May 20, 2014 at 10:49:54 PM
>
> To: Arunkumar Srinivasan aragorn168b at gmail.com
> Cc: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> Subject:  Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill
> arguments
>
> If I understand this right then the table below shows the valid
> logical combinations in order of speed (slowest first). Is that
> right? If so then if fill = FALSE and use.names = fill then we get
> the fastest case by default.
>
> Furthermore if you were concerned that we might be T/T when F/T would
> be sufficient I don't think that is likely since getting F/T is done
> by setting use.names = TRUE.
>
> fill/use.names
> T/T (slowest)
> F/T
> F/F (fasetest)
>
>
> On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>> I?ve filed FR #5690 to remind myself of the recycling feature; that?d be
>> awesome to have.
>>
>> One feature I forgot to point out in the previous post is that, even when
>> there are duplicate names, rbind/rbindlist binds them consistent with
>> ?base?
>> when use.names=TRUE. And it fills the duplicate columns properly (in the
>> order of occurrence) also when fill=TRUE.
>>
>> Okay, on to benchmarks. I took a set of 10,000 data.tables, each with
>> columns ranging from V1 to V500 in random order (all integers for
>> simplicity). We?ll need to just use use.names=TRUE (as all columns are
>> available in all data.tables).
>>
>> I think this data is big enough to illustrate the point. Also, I was
>> curious
>> to see a comparison against dplyr?s rbind_all (commit 1504 devel version).
>> So, I?ve added it as well to the benchmarks.
>>
>> Here?s the data generation. Note: It takes a while for this step to
>> finish.
>>
>> require(data.table) ## 1.9.3 commit 1267
>> require(dplyr) ## commit 1504 devel
>> set.seed(1L)
>> foo <- function(k) {
>> ans = setDT(lapply(1:k, function(x) sample(10)))
>> }
>> bar <- function(ans, k, n) {
>> bla = sample(paste0("V", 1:k), n)
>> setnames(ans, bla)
>> }
>> n = 10000L
>> ll = vector("list", n)
>> for (i in 1:n) {
>> bla = bar(foo(500L), 500L, 500L)
>> .Call("Csetlistelt", ll, i, bla)
>> }
>>
>> And here are the timings:
>>
>> ## data.table v1.9.3 commit 1267's rbindlist
>> ## Timings of three consecutive runs:
>> system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE))
>> user system elapsed
>> 10.909 0.449 11.843
>>
>> user system elapsed
>> 5.219 0.386 5.640
>>
>> user system elapsed
>> 5.355 0.429 5.898
>>
>> ## dplyr's rbind_all
>> ## Timings for three consecutive runs
>> system.time(ans2 <- rbind_all(ll))
>> user system elapsed
>> 62.769 0.247 63.941
>>
>> user system elapsed
>> 62.010 0.335 65.876
>>
>> user system elapsed
>> 55.345 0.359 60.193
>>
>>> identical(ans1, setDT(ans2)) # [1] TRUE
>>
>> ## data.table v1.9.2's rbind version:
>> ## ran only once as it took a bit more.
>> system.time(ans1 <- do.call("rbind", ll))
>> user system elapsed
>> 125.356 2.247 139.000
>>
>>> identical(ans1, setDT(ans2)) # [1] TRUE
>>
>> In summary, the newer implementation is about ~11?23x faster than
>> data.table?s older implementation and is ~5.5?10x faster against dplyr on
>> this (relatively huge) data.
>>
>> Arun
>>
>> From: Arunkumar Srinivasan aragorn168b at gmail.com
>> Reply: Arunkumar Srinivasan aragorn168b at gmail.com
>> Date: May 20, 2014 at 9:27:56 PM
>> To: datatable-help at lists.r-forge.r-project.org
>> datatable-help at lists.r-forge.r-project.org
>> Subject: FR #5249 - rbindlist gains use.names and fill arguments
>>
>> Hello everyone,
>>
>> With the latest commit #1266, the extra functionality offered via rbind
>> (use.names and fill) is also now available to rbindlist. In addition, the
>> implementation is completely moved to C, and is therefore tremendously
>> fast,
>> especially for cases where one has to bind using with use.names=TRUE
>> and/or
>> with fill=TRUE. I?ll try to put out a benchmark comparing speed
>> differences
>> with the older implementation ASAP.
>>
>> Note that this change comes with a very low cost to the default speed to
>> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding
>> 10,000 data.tables with 20 columns each, resulted in the new version
>> running
>> in 0.107 seconds, where as the older version ran in 0.095 seconds.
>>
>> In addition the documentation for ?rbindlist also has been improved (#5158
>> from Alexander). Here?s the change log from NEWS:
>>
>> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now
>> implemented entirely in C. Closes #5249
>> -> use.names by default is FALSE for backwards compatibility
>> (doesn't bind by names by default)
>> -> rbind(...) now just calls rbindlist() internally, except that
>> 'use.names' is TRUE by default,
>> for compatibility with base (and backwards compatibility).
>> -> fill by default is FALSE. If fill is TRUE, use.names has to be
>> TRUE.
>> -> At least one item of the input list has to have non-null column
>> names.
>> -> Duplicate columns are bound in the order of occurrence, like
>> base.
>> -> Attributes that might exist in individual items would be lost in
>> the bound result.
>> -> Columns are coerced to the highest SEXPTYPE, if they are
>> different, if/when possible.
>> -> And incredibly fast ;).
>> -> Documentation updated in much detail. Closes DR #5158.
>> Eddi's (excellent) work on finding factor levels, type coercion of
>> columns etc. are all retained.
>>
>> Please try it and write back if things aren?t working as it was before.
>> The
>> tests that had to be fixed are extremely rare cases. I suspect there
>> should
>> be minimal issue, if at all, in this version. However, I do find the
>> changes
>> here bring consistency to the function.
>>
>> One (very rare) feature that is not available due to this implementation
>> is
>> the ability to recycle.
>>
>> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))
>> lst1 <- list(x=4, y=5, z=as.list(1:3))
>>
>> rbind(dt1, lst1)
>> # x y z
>> # 1: 1 4 1,2
>> # 2: 2 5 1,2,3
>> # 3: 3 6 1,2,3,4
>> # 4: 4 5 1
>> # 5: 4 5 2
>> # 6: 4 5 3
>>
>> The 4,5 are recycled very nicely here.. This is not possible at the
>> moment.
>> This is because the earlier rbind implementation used as.data.table to
>> convert to data.table, however it takes a copy (very inefficient on huge /
>> many tables). I?d love to add this feature in C as well, as it would help
>> incredibly for use within [.data.table (now that we can fill columns and
>> bind by names faster). Will add a FR.
>>
>> In summary, I think there should be minimal issues, if any and should be
>> much faster (for rbind cases). Please write back what you think, if you
>> happen to try out.
>>
>>
>>
>> Arun
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

From aragorn168b at gmail.com  Tue May 20 23:16:27 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Tue, 20 May 2014 23:16:27 +0200
Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill
 arguments
In-Reply-To: <CAP01uRmQab6sRfbR5MJRpU85TZPuX68n6KnLofVkTGkxpGOspw@mail.gmail.com>
References: <etPan.537bacb8.6b8b4567.11385@Arunkumars-MacBook-Pro.local>
 <etPan.537bbaf8.2ae8944a.11385@Arunkumars-MacBook-Pro.local>
 <CAP01uR=WxhcAy-TgjhMmyUrnztiYgBgFeFAUe-LTedbEvPgPmQ@mail.gmail.com>
 <etPan.537bc2c0.238e1f29.11385@Arunkumars-MacBook-Pro.local>
 <CAP01uRmQab6sRfbR5MJRpU85TZPuX68n6KnLofVkTGkxpGOspw@mail.gmail.com>
Message-ID: <etPan.537bc62b.3d1b58ba.11385@Arunkumars-MacBook-Pro.local>

In the current CRAN:

rbindlist corresponds to use.names=FALSE and fill = FALSE
rbind corresponds to use.names=TRUE and fill = FALSE

Just to be clear, again, are you suggesting that I change *just* rbindlist's defaults to use.names=fill and fill=FALSE or for both?
Arun

From:?Gabor Grothendieck ggrothendieck at gmail.com
Reply:?Gabor Grothendieck ggrothendieck at gmail.com
Date:?May 20, 2014 at 11:14:15 PM
To:?Arunkumar Srinivasan aragorn168b at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments  

Yes. That is what I intended.  

rbindlist on CRAN currently has no fill or use.names arguments. What  
combo of the new fill and use.names does the currrent CRAN rbindlst  
correspond to?  


On Tue, May 20, 2014 at 5:01 PM, Arunkumar Srinivasan  
<aragorn168b at gmail.com> wrote:  
> I think I understand now what you?re trying to say. Going back to an earlier  
> post, you wrote:  
>  
> Then why not make the default of `use.names` be `fill`. Then you don't get  
> the warning and you can tell just from the argument list what the  
> dependencies are.  
>  
> You mean to basically do?  
>  
> rbindlist <- function(l, use.names=fill, fill=FALSE)  
> .rbind.data.table <- function(..., use.names=fill, fill=TRUE/FALSE)  
>  
> Is this what you mean? If so, the defaults from the previous versions will  
> be changed. The ones who use rbind directly without setting use.names will  
> have different results.. (assuming I understand you correctly this time).  
>  
>  
> Arun  
>  
> From: Gabor Grothendieck ggrothendieck at gmail.com  
> Reply: Gabor Grothendieck ggrothendieck at gmail.com  
> Date: May 20, 2014 at 10:49:54 PM  
>  
> To: Arunkumar Srinivasan aragorn168b at gmail.com  
> Cc: datatable-help at lists.r-forge.r-project.org  
> datatable-help at lists.r-forge.r-project.org  
> Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill  
> arguments  
>  
> If I understand this right then the table below shows the valid  
> logical combinations in order of speed (slowest first). Is that  
> right? If so then if fill = FALSE and use.names = fill then we get  
> the fastest case by default.  
>  
> Furthermore if you were concerned that we might be T/T when F/T would  
> be sufficient I don't think that is likely since getting F/T is done  
> by setting use.names = TRUE.  
>  
> fill/use.names  
> T/T (slowest)  
> F/T  
> F/F (fasetest)  
>  
>  
> On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan  
> <aragorn168b at gmail.com> wrote:  
>> I?ve filed FR #5690 to remind myself of the recycling feature; that?d be  
>> awesome to have.  
>>  
>> One feature I forgot to point out in the previous post is that, even when  
>> there are duplicate names, rbind/rbindlist binds them consistent with  
>> ?base?  
>> when use.names=TRUE. And it fills the duplicate columns properly (in the  
>> order of occurrence) also when fill=TRUE.  
>>  
>> Okay, on to benchmarks. I took a set of 10,000 data.tables, each with  
>> columns ranging from V1 to V500 in random order (all integers for  
>> simplicity). We?ll need to just use use.names=TRUE (as all columns are  
>> available in all data.tables).  
>>  
>> I think this data is big enough to illustrate the point. Also, I was  
>> curious  
>> to see a comparison against dplyr?s rbind_all (commit 1504 devel version).  
>> So, I?ve added it as well to the benchmarks.  
>>  
>> Here?s the data generation. Note: It takes a while for this step to  
>> finish.  
>>  
>> require(data.table) ## 1.9.3 commit 1267  
>> require(dplyr) ## commit 1504 devel  
>> set.seed(1L)  
>> foo <- function(k) {  
>> ans = setDT(lapply(1:k, function(x) sample(10)))  
>> }  
>> bar <- function(ans, k, n) {  
>> bla = sample(paste0("V", 1:k), n)  
>> setnames(ans, bla)  
>> }  
>> n = 10000L  
>> ll = vector("list", n)  
>> for (i in 1:n) {  
>> bla = bar(foo(500L), 500L, 500L)  
>> .Call("Csetlistelt", ll, i, bla)  
>> }  
>>  
>> And here are the timings:  
>>  
>> ## data.table v1.9.3 commit 1267's rbindlist  
>> ## Timings of three consecutive runs:  
>> system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE))  
>> user system elapsed  
>> 10.909 0.449 11.843  
>>  
>> user system elapsed  
>> 5.219 0.386 5.640  
>>  
>> user system elapsed  
>> 5.355 0.429 5.898  
>>  
>> ## dplyr's rbind_all  
>> ## Timings for three consecutive runs  
>> system.time(ans2 <- rbind_all(ll))  
>> user system elapsed  
>> 62.769 0.247 63.941  
>>  
>> user system elapsed  
>> 62.010 0.335 65.876  
>>  
>> user system elapsed  
>> 55.345 0.359 60.193  
>>  
>>> identical(ans1, setDT(ans2)) # [1] TRUE  
>>  
>> ## data.table v1.9.2's rbind version:  
>> ## ran only once as it took a bit more.  
>> system.time(ans1 <- do.call("rbind", ll))  
>> user system elapsed  
>> 125.356 2.247 139.000  
>>  
>>> identical(ans1, setDT(ans2)) # [1] TRUE  
>>  
>> In summary, the newer implementation is about ~11?23x faster than  
>> data.table?s older implementation and is ~5.5?10x faster against dplyr on  
>> this (relatively huge) data.  
>>  
>> Arun  
>>  
>> From: Arunkumar Srinivasan aragorn168b at gmail.com  
>> Reply: Arunkumar Srinivasan aragorn168b at gmail.com  
>> Date: May 20, 2014 at 9:27:56 PM  
>> To: datatable-help at lists.r-forge.r-project.org  
>> datatable-help at lists.r-forge.r-project.org  
>> Subject: FR #5249 - rbindlist gains use.names and fill arguments  
>>  
>> Hello everyone,  
>>  
>> With the latest commit #1266, the extra functionality offered via rbind  
>> (use.names and fill) is also now available to rbindlist. In addition, the  
>> implementation is completely moved to C, and is therefore tremendously  
>> fast,  
>> especially for cases where one has to bind using with use.names=TRUE  
>> and/or  
>> with fill=TRUE. I?ll try to put out a benchmark comparing speed  
>> differences  
>> with the older implementation ASAP.  
>>  
>> Note that this change comes with a very low cost to the default speed to  
>> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding  
>> 10,000 data.tables with 20 columns each, resulted in the new version  
>> running  
>> in 0.107 seconds, where as the older version ran in 0.095 seconds.  
>>  
>> In addition the documentation for ?rbindlist also has been improved (#5158  
>> from Alexander). Here?s the change log from NEWS:  
>>  
>> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now  
>> implemented entirely in C. Closes #5249  
>> -> use.names by default is FALSE for backwards compatibility  
>> (doesn't bind by names by default)  
>> -> rbind(...) now just calls rbindlist() internally, except that  
>> 'use.names' is TRUE by default,  
>> for compatibility with base (and backwards compatibility).  
>> -> fill by default is FALSE. If fill is TRUE, use.names has to be  
>> TRUE.  
>> -> At least one item of the input list has to have non-null column  
>> names.  
>> -> Duplicate columns are bound in the order of occurrence, like  
>> base.  
>> -> Attributes that might exist in individual items would be lost in  
>> the bound result.  
>> -> Columns are coerced to the highest SEXPTYPE, if they are  
>> different, if/when possible.  
>> -> And incredibly fast ;).  
>> -> Documentation updated in much detail. Closes DR #5158.  
>> Eddi's (excellent) work on finding factor levels, type coercion of  
>> columns etc. are all retained.  
>>  
>> Please try it and write back if things aren?t working as it was before.  
>> The  
>> tests that had to be fixed are extremely rare cases. I suspect there  
>> should  
>> be minimal issue, if at all, in this version. However, I do find the  
>> changes  
>> here bring consistency to the function.  
>>  
>> One (very rare) feature that is not available due to this implementation  
>> is  
>> the ability to recycle.  
>>  
>> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))  
>> lst1 <- list(x=4, y=5, z=as.list(1:3))  
>>  
>> rbind(dt1, lst1)  
>> # x y z  
>> # 1: 1 4 1,2  
>> # 2: 2 5 1,2,3  
>> # 3: 3 6 1,2,3,4  
>> # 4: 4 5 1  
>> # 5: 4 5 2  
>> # 6: 4 5 3  
>>  
>> The 4,5 are recycled very nicely here.. This is not possible at the  
>> moment.  
>> This is because the earlier rbind implementation used as.data.table to  
>> convert to data.table, however it takes a copy (very inefficient on huge /  
>> many tables). I?d love to add this feature in C as well, as it would help  
>> incredibly for use within [.data.table (now that we can fill columns and  
>> bind by names faster). Will add a FR.  
>>  
>> In summary, I think there should be minimal issues, if any and should be  
>> much faster (for rbind cases). Please write back what you think, if you  
>> happen to try out.  
>>  
>>  
>>  
>> Arun  
>>  
>>  
>> _______________________________________________  
>> datatable-help mailing list  
>> datatable-help at lists.r-forge.r-project.org  
>>  
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
>  
>  
>  
> --  
> Statistics & Software Consulting  
> GKX Group, GKX Associates Inc.  
> tel: 1-877-GKX-GROUP  
> email: ggrothendieck at gmail.com  


--  
Statistics & Software Consulting  
GKX Group, GKX Associates Inc.  
tel: 1-877-GKX-GROUP  
email: ggrothendieck at gmail.com  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140520/094237fb/attachment-0001.html>

From ggrothendieck at gmail.com  Wed May 21 01:02:43 2014
From: ggrothendieck at gmail.com (Gabor Grothendieck)
Date: Tue, 20 May 2014 19:02:43 -0400
Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill
	arguments
In-Reply-To: <etPan.537bc62b.3d1b58ba.11385@Arunkumars-MacBook-Pro.local>
References: <etPan.537bacb8.6b8b4567.11385@Arunkumars-MacBook-Pro.local>
 <etPan.537bbaf8.2ae8944a.11385@Arunkumars-MacBook-Pro.local>
 <CAP01uR=WxhcAy-TgjhMmyUrnztiYgBgFeFAUe-LTedbEvPgPmQ@mail.gmail.com>
 <etPan.537bc2c0.238e1f29.11385@Arunkumars-MacBook-Pro.local>
 <CAP01uRmQab6sRfbR5MJRpU85TZPuX68n6KnLofVkTGkxpGOspw@mail.gmail.com>
 <etPan.537bc62b.3d1b58ba.11385@Arunkumars-MacBook-Pro.local>
Message-ID: <CAP01uRnWmQ-s5sAQYGbgA81KnWYuRZ_dTWoAgVZ34f1txeRt5w@mail.gmail.com>

In that case I suggest just changing rbindlist to have use.names =
fill and leave rbind as is.

On Tue, May 20, 2014 at 5:16 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> In the current CRAN:
>
> rbindlist corresponds to use.names=FALSE and fill = FALSE
> rbind corresponds to use.names=TRUE and fill = FALSE
>
> Just to be clear, again, are you suggesting that I change *just* rbindlist's
> defaults to use.names=fill and fill=FALSE or for both?
> Arun
>
> From: Gabor Grothendieck ggrothendieck at gmail.com
> Reply: Gabor Grothendieck ggrothendieck at gmail.com
> Date: May 20, 2014 at 11:14:15 PM
>
> To: Arunkumar Srinivasan aragorn168b at gmail.com
> Cc: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> Subject:  Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill
> arguments
>
> Yes. That is what I intended.
>
> rbindlist on CRAN currently has no fill or use.names arguments. What
> combo of the new fill and use.names does the currrent CRAN rbindlst
> correspond to?
>
>
>
> On Tue, May 20, 2014 at 5:01 PM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>> I think I understand now what you?re trying to say. Going back to an
>> earlier
>> post, you wrote:
>>
>> Then why not make the default of `use.names` be `fill`. Then you don't get
>> the warning and you can tell just from the argument list what the
>> dependencies are.
>>
>> You mean to basically do?
>>
>> rbindlist <- function(l, use.names=fill, fill=FALSE)
>> .rbind.data.table <- function(..., use.names=fill, fill=TRUE/FALSE)
>>
>> Is this what you mean? If so, the defaults from the previous versions will
>> be changed. The ones who use rbind directly without setting use.names will
>> have different results.. (assuming I understand you correctly this time).
>>
>>
>> Arun
>>
>> From: Gabor Grothendieck ggrothendieck at gmail.com
>> Reply: Gabor Grothendieck ggrothendieck at gmail.com
>> Date: May 20, 2014 at 10:49:54 PM
>>
>> To: Arunkumar Srinivasan aragorn168b at gmail.com
>> Cc: datatable-help at lists.r-forge.r-project.org
>> datatable-help at lists.r-forge.r-project.org
>> Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and
>> fill
>> arguments
>>
>> If I understand this right then the table below shows the valid
>> logical combinations in order of speed (slowest first). Is that
>> right? If so then if fill = FALSE and use.names = fill then we get
>> the fastest case by default.
>>
>> Furthermore if you were concerned that we might be T/T when F/T would
>> be sufficient I don't think that is likely since getting F/T is done
>> by setting use.names = TRUE.
>>
>> fill/use.names
>> T/T (slowest)
>> F/T
>> F/F (fasetest)
>>
>>
>> On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan
>> <aragorn168b at gmail.com> wrote:
>>> I?ve filed FR #5690 to remind myself of the recycling feature; that?d be
>>> awesome to have.
>>>
>>> One feature I forgot to point out in the previous post is that, even when
>>> there are duplicate names, rbind/rbindlist binds them consistent with
>>> ?base?
>>> when use.names=TRUE. And it fills the duplicate columns properly (in the
>>> order of occurrence) also when fill=TRUE.
>>>
>>> Okay, on to benchmarks. I took a set of 10,000 data.tables, each with
>>> columns ranging from V1 to V500 in random order (all integers for
>>> simplicity). We?ll need to just use use.names=TRUE (as all columns are
>>> available in all data.tables).
>>>
>>> I think this data is big enough to illustrate the point. Also, I was
>>> curious
>>> to see a comparison against dplyr?s rbind_all (commit 1504 devel
>>> version).
>>> So, I?ve added it as well to the benchmarks.
>>>
>>> Here?s the data generation. Note: It takes a while for this step to
>>> finish.
>>>
>>> require(data.table) ## 1.9.3 commit 1267
>>> require(dplyr) ## commit 1504 devel
>>> set.seed(1L)
>>> foo <- function(k) {
>>> ans = setDT(lapply(1:k, function(x) sample(10)))
>>> }
>>> bar <- function(ans, k, n) {
>>> bla = sample(paste0("V", 1:k), n)
>>> setnames(ans, bla)
>>> }
>>> n = 10000L
>>> ll = vector("list", n)
>>> for (i in 1:n) {
>>> bla = bar(foo(500L), 500L, 500L)
>>> .Call("Csetlistelt", ll, i, bla)
>>> }
>>>
>>> And here are the timings:
>>>
>>> ## data.table v1.9.3 commit 1267's rbindlist
>>> ## Timings of three consecutive runs:
>>> system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE))
>>> user system elapsed
>>> 10.909 0.449 11.843
>>>
>>> user system elapsed
>>> 5.219 0.386 5.640
>>>
>>> user system elapsed
>>> 5.355 0.429 5.898
>>>
>>> ## dplyr's rbind_all
>>> ## Timings for three consecutive runs
>>> system.time(ans2 <- rbind_all(ll))
>>> user system elapsed
>>> 62.769 0.247 63.941
>>>
>>> user system elapsed
>>> 62.010 0.335 65.876
>>>
>>> user system elapsed
>>> 55.345 0.359 60.193
>>>
>>>> identical(ans1, setDT(ans2)) # [1] TRUE
>>>
>>> ## data.table v1.9.2's rbind version:
>>> ## ran only once as it took a bit more.
>>> system.time(ans1 <- do.call("rbind", ll))
>>> user system elapsed
>>> 125.356 2.247 139.000
>>>
>>>> identical(ans1, setDT(ans2)) # [1] TRUE
>>>
>>> In summary, the newer implementation is about ~11?23x faster than
>>> data.table?s older implementation and is ~5.5?10x faster against dplyr on
>>> this (relatively huge) data.
>>>
>>> Arun
>>>
>>> From: Arunkumar Srinivasan aragorn168b at gmail.com
>>> Reply: Arunkumar Srinivasan aragorn168b at gmail.com
>>> Date: May 20, 2014 at 9:27:56 PM
>>> To: datatable-help at lists.r-forge.r-project.org
>>> datatable-help at lists.r-forge.r-project.org
>>> Subject: FR #5249 - rbindlist gains use.names and fill arguments
>>>
>>> Hello everyone,
>>>
>>> With the latest commit #1266, the extra functionality offered via rbind
>>> (use.names and fill) is also now available to rbindlist. In addition, the
>>> implementation is completely moved to C, and is therefore tremendously
>>> fast,
>>> especially for cases where one has to bind using with use.names=TRUE
>>> and/or
>>> with fill=TRUE. I?ll try to put out a benchmark comparing speed
>>> differences
>>> with the older implementation ASAP.
>>>
>>> Note that this change comes with a very low cost to the default speed to
>>> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding
>>> 10,000 data.tables with 20 columns each, resulted in the new version
>>> running
>>> in 0.107 seconds, where as the older version ran in 0.095 seconds.
>>>
>>> In addition the documentation for ?rbindlist also has been improved
>>> (#5158
>>> from Alexander). Here?s the change log from NEWS:
>>>
>>> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now
>>> implemented entirely in C. Closes #5249
>>> -> use.names by default is FALSE for backwards compatibility
>>> (doesn't bind by names by default)
>>> -> rbind(...) now just calls rbindlist() internally, except that
>>> 'use.names' is TRUE by default,
>>> for compatibility with base (and backwards compatibility).
>>> -> fill by default is FALSE. If fill is TRUE, use.names has to be
>>> TRUE.
>>> -> At least one item of the input list has to have non-null column
>>> names.
>>> -> Duplicate columns are bound in the order of occurrence, like
>>> base.
>>> -> Attributes that might exist in individual items would be lost in
>>> the bound result.
>>> -> Columns are coerced to the highest SEXPTYPE, if they are
>>> different, if/when possible.
>>> -> And incredibly fast ;).
>>> -> Documentation updated in much detail. Closes DR #5158.
>>> Eddi's (excellent) work on finding factor levels, type coercion of
>>> columns etc. are all retained.
>>>
>>> Please try it and write back if things aren?t working as it was before.
>>> The
>>> tests that had to be fixed are extremely rare cases. I suspect there
>>> should
>>> be minimal issue, if at all, in this version. However, I do find the
>>> changes
>>> here bring consistency to the function.
>>>
>>> One (very rare) feature that is not available due to this implementation
>>> is
>>> the ability to recycle.
>>>
>>> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))
>>> lst1 <- list(x=4, y=5, z=as.list(1:3))
>>>
>>> rbind(dt1, lst1)
>>> # x y z
>>> # 1: 1 4 1,2
>>> # 2: 2 5 1,2,3
>>> # 3: 3 6 1,2,3,4
>>> # 4: 4 5 1
>>> # 5: 4 5 2
>>> # 6: 4 5 3
>>>
>>> The 4,5 are recycled very nicely here.. This is not possible at the
>>> moment.
>>> This is because the earlier rbind implementation used as.data.table to
>>> convert to data.table, however it takes a copy (very inefficient on huge
>>> /
>>> many tables). I?d love to add this feature in C as well, as it would help
>>> incredibly for use within [.data.table (now that we can fill columns and
>>> bind by names faster). Will add a FR.
>>>
>>> In summary, I think there should be minimal issues, if any and should be
>>> much faster (for rbind cases). Please write back what you think, if you
>>> happen to try out.
>>>
>>>
>>>
>>> Arun
>>>
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>>
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>> --
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

From npgraham1 at gmail.com  Wed May 21 02:20:34 2014
From: npgraham1 at gmail.com (Nathaniel Graham)
Date: Tue, 20 May 2014 20:20:34 -0400
Subject: [datatable-help] rbindlist and unique
Message-ID: <CALhihUiy0z9ZnMFGmyGdozy6zYr=eBP4-PPV_KSDFaCba+=gkQ@mail.gmail.com>

First, I use rbindlist pretty often, and I've been quite happy with it.
 The new use.names and fill features definitely scratch an itch for me; I
wound up using rbind_all from dplyr (which worked well, I'm not
complaining), but I'm looking forward to having a data.table
implementation.  The speed increase is also welcome.  So thank you for the
new features!  I don't personally have a preference with respect to the
use.names and fill defaults, so whatever you guys decide will be fine with
me.

I do have a question regarding unique, which I use very, very frequently,
and often after rbindlist.  I have a fairly large data set (tens of
millions of raw observations), many of which are duplicates.  The
observations come from a variety of sources, but the formats and variable
names are (nearly) identical.

The problem is that many "duplicates" aren't perfect duplicates, and some
rows have more information than others.  A simple example might look like
this:

> foo
   V1 V2   V3
1:  1  3 TRUE
2:  1  4 TRUE
3:  2  3   NA
4:  2  4 TRUE
5:  1  3 TRUE
6:  1  4   NA
7:  2  3 TRUE
8:  2  4 TRUE
9:  3  1   NA
> unique(foo, by = c("V1", "V2"))
   V1 V2   V3
1:  1  3 TRUE
2:  1  4 TRUE
3:  2  3   NA
4:  2  4 TRUE
5:  3  1   NA


Sometimes V3 is present and sometimes it isn't.  V1 and V2 (in my story)
uniquely identify an observation, but if there's a row where I also have
V3, I'd prefer to have that row rather than a row where it's missing.  You
can see that a naive use of unique here gets me the less-preferable 2,3
row.  If I only had three columns, this would be easy to solve (sort/setkey
first would do it).  However, I have more than a dozen additional columns,
and when I drop duplicates I want to retain the row with the greatest
number of non-missing values.  Additionally, some columns are more
important than others.  If (to refer again to the example above), there are
no rows that have V3 for a given V1 & V2 (like 3,1), I still need to retain
a row, so I can't just condition on !is.na(V3).

Does anybody have any insight or techniques for this sort of thing?  I'm
currently sorting on all columns prior to unique, but I'm quite sure that
this loses some information.


-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu
https://sites.google.com/site/npgraham1/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140520/3798fba2/attachment.html>

From ggrothendieck at gmail.com  Wed May 21 02:34:10 2014
From: ggrothendieck at gmail.com (Gabor Grothendieck)
Date: Tue, 20 May 2014 20:34:10 -0400
Subject: [datatable-help] rbindlist and unique
In-Reply-To: <CALhihUiy0z9ZnMFGmyGdozy6zYr=eBP4-PPV_KSDFaCba+=gkQ@mail.gmail.com>
References: <CALhihUiy0z9ZnMFGmyGdozy6zYr=eBP4-PPV_KSDFaCba+=gkQ@mail.gmail.com>
Message-ID: <CAP01uRmwYWdu9yps4Ga2Fx-ZvEsXZqnX1-LAsF=nn3BEyw9bfg@mail.gmail.com>

On Tue, May 20, 2014 at 8:20 PM, Nathaniel Graham <npgraham1 at gmail.com> wrote:
> First, I use rbindlist pretty often, and I've been quite happy with it.  The
> new use.names and fill features definitely scratch an itch for me; I wound
> up using rbind_all from dplyr (which worked well, I'm not complaining), but
> I'm looking forward to having a data.table implementation.  The speed
> increase is also welcome.  So thank you for the new features!  I don't
> personally have a preference with respect to the use.names and fill
> defaults, so whatever you guys decide will be fine with me.
>
> I do have a question regarding unique, which I use very, very frequently,
> and often after rbindlist.  I have a fairly large data set (tens of millions
> of raw observations), many of which are duplicates.  The observations come
> from a variety of sources, but the formats and variable names are (nearly)
> identical.
>
> The problem is that many "duplicates" aren't perfect duplicates, and some
> rows have more information than others.  A simple example might look like
> this:
>
>> foo
>    V1 V2   V3
> 1:  1  3 TRUE
> 2:  1  4 TRUE
> 3:  2  3   NA
> 4:  2  4 TRUE
> 5:  1  3 TRUE
> 6:  1  4   NA
> 7:  2  3 TRUE
> 8:  2  4 TRUE
> 9:  3  1   NA
>> unique(foo, by = c("V1", "V2"))
>    V1 V2   V3
> 1:  1  3 TRUE
> 2:  1  4 TRUE
> 3:  2  3   NA
> 4:  2  4 TRUE
> 5:  3  1   NA
>
>
> Sometimes V3 is present and sometimes it isn't.  V1 and V2 (in my story)
> uniquely identify an observation, but if there's a row where I also have V3,
> I'd prefer to have that row rather than a row where it's missing.  You can
> see that a naive use of unique here gets me the less-preferable 2,3 row.  If
> I only had three columns, this would be easy to solve (sort/setkey first
> would do it).  However, I have more than a dozen additional columns, and
> when I drop duplicates I want to retain the row with the greatest number of
> non-missing values.  Additionally, some columns are more important than
> others.  If (to refer again to the example above), there are no rows that
> have V3 for a given V1 & V2 (like 3,1), I still need to retain a row, so I
> can't just condition on !is.na(V3).
>
> Does anybody have any insight or techniques for this sort of thing?  I'm
> currently sorting on all columns prior to unique, but I'm quite sure that
> this loses some information.

Append an importance column which ranks the importance of that row
(lower better) and make importance the low order component of the key.

DT[, importance := 0+is.na(V3)]
setkey(DT, V1, V2, importance)
unique(DT, by = c("V1", "V2"))


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

From npgraham1 at gmail.com  Wed May 21 02:45:16 2014
From: npgraham1 at gmail.com (Nathaniel Graham)
Date: Tue, 20 May 2014 20:45:16 -0400
Subject: [datatable-help] rbindlist and unique
In-Reply-To: <CAP01uRmwYWdu9yps4Ga2Fx-ZvEsXZqnX1-LAsF=nn3BEyw9bfg@mail.gmail.com>
References: <CALhihUiy0z9ZnMFGmyGdozy6zYr=eBP4-PPV_KSDFaCba+=gkQ@mail.gmail.com>
 <CAP01uRmwYWdu9yps4Ga2Fx-ZvEsXZqnX1-LAsF=nn3BEyw9bfg@mail.gmail.com>
Message-ID: <CALhihUjRJ-704kQgdjD=9iYujEXiS2ZYr9fxUM=h4EixwWoPYQ@mail.gmail.com>

Thanks!  That's a good idea, and a lot simpler than what I was concocting
in my head.  I'll give that a try.  I think--just for for posterity--you
mean

DT[, importance := 0 - is.na(V3)]

rather than 0 + is.na(V3), so that rows with V3 are lower than rows without.

-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu
https://sites.google.com/site/npgraham1/


On Tue, May 20, 2014 at 8:34 PM, Gabor Grothendieck <ggrothendieck at gmail.com
> wrote:

> On Tue, May 20, 2014 at 8:20 PM, Nathaniel Graham <npgraham1 at gmail.com>
> wrote:
> > First, I use rbindlist pretty often, and I've been quite happy with it.
>  The
> > new use.names and fill features definitely scratch an itch for me; I
> wound
> > up using rbind_all from dplyr (which worked well, I'm not complaining),
> but
> > I'm looking forward to having a data.table implementation.  The speed
> > increase is also welcome.  So thank you for the new features!  I don't
> > personally have a preference with respect to the use.names and fill
> > defaults, so whatever you guys decide will be fine with me.
> >
> > I do have a question regarding unique, which I use very, very frequently,
> > and often after rbindlist.  I have a fairly large data set (tens of
> millions
> > of raw observations), many of which are duplicates.  The observations
> come
> > from a variety of sources, but the formats and variable names are
> (nearly)
> > identical.
> >
> > The problem is that many "duplicates" aren't perfect duplicates, and some
> > rows have more information than others.  A simple example might look like
> > this:
> >
> >> foo
> >    V1 V2   V3
> > 1:  1  3 TRUE
> > 2:  1  4 TRUE
> > 3:  2  3   NA
> > 4:  2  4 TRUE
> > 5:  1  3 TRUE
> > 6:  1  4   NA
> > 7:  2  3 TRUE
> > 8:  2  4 TRUE
> > 9:  3  1   NA
> >> unique(foo, by = c("V1", "V2"))
> >    V1 V2   V3
> > 1:  1  3 TRUE
> > 2:  1  4 TRUE
> > 3:  2  3   NA
> > 4:  2  4 TRUE
> > 5:  3  1   NA
> >
> >
> > Sometimes V3 is present and sometimes it isn't.  V1 and V2 (in my story)
> > uniquely identify an observation, but if there's a row where I also have
> V3,
> > I'd prefer to have that row rather than a row where it's missing.  You
> can
> > see that a naive use of unique here gets me the less-preferable 2,3 row.
>  If
> > I only had three columns, this would be easy to solve (sort/setkey first
> > would do it).  However, I have more than a dozen additional columns, and
> > when I drop duplicates I want to retain the row with the greatest number
> of
> > non-missing values.  Additionally, some columns are more important than
> > others.  If (to refer again to the example above), there are no rows that
> > have V3 for a given V1 & V2 (like 3,1), I still need to retain a row, so
> I
> > can't just condition on !is.na(V3).
> >
> > Does anybody have any insight or techniques for this sort of thing?  I'm
> > currently sorting on all columns prior to unique, but I'm quite sure that
> > this loses some information.
>
> Append an importance column which ranks the importance of that row
> (lower better) and make importance the low order component of the key.
>
> DT[, importance := 0+is.na(V3)]
> setkey(DT, V1, V2, importance)
> unique(DT, by = c("V1", "V2"))
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140520/af5ff1bd/attachment-0001.html>

From ggrothendieck at gmail.com  Wed May 21 02:50:54 2014
From: ggrothendieck at gmail.com (Gabor Grothendieck)
Date: Tue, 20 May 2014 20:50:54 -0400
Subject: [datatable-help] rbindlist and unique
In-Reply-To: <CALhihUjRJ-704kQgdjD=9iYujEXiS2ZYr9fxUM=h4EixwWoPYQ@mail.gmail.com>
References: <CALhihUiy0z9ZnMFGmyGdozy6zYr=eBP4-PPV_KSDFaCba+=gkQ@mail.gmail.com>
 <CAP01uRmwYWdu9yps4Ga2Fx-ZvEsXZqnX1-LAsF=nn3BEyw9bfg@mail.gmail.com>
 <CALhihUjRJ-704kQgdjD=9iYujEXiS2ZYr9fxUM=h4EixwWoPYQ@mail.gmail.com>
Message-ID: <CAP01uRnwk2evvtAst4wFggOBodsKQkuwW4Ck9RhXYRFcfYq92Q@mail.gmail.com>

On Tue, May 20, 2014 at 8:45 PM, Nathaniel Graham <npgraham1 at gmail.com> wrote:
> Thanks!  That's a good idea, and a lot simpler than what I was concocting in
> my head.  I'll give that a try.  I think--just for for posterity--you mean
>
> DT[, importance := 0 - is.na(V3)]
>
> rather than 0 + is.na(V3), so that rows with V3 are lower than rows without.

0 + is.na(V3) was intended.  We want the good rows to have a lower
importance than the bad rows so 0+is.na(V3)  gives a non-NA V3 an
importance of 0 and it gives a V3 which is NA an importance of 1.
When we sort them using setkey the non-NA of 0 comes first so it is
the one picked by unique.

> DT[, importance := 0+is.na(V3)]
> setkey(DT, V1, V2, importance)
> unique(DT, by = c("V1", "V2"))
   V1 V2   V3 importance
1:  1  3 TRUE          0
2:  1  4 TRUE          0
3:  2  3 TRUE          0
4:  2  4 TRUE          0
5:  3  1   NA          1

From npgraham1 at gmail.com  Wed May 21 02:56:56 2014
From: npgraham1 at gmail.com (Nathaniel Graham)
Date: Tue, 20 May 2014 20:56:56 -0400
Subject: [datatable-help] rbindlist and unique
In-Reply-To: <CAP01uRnwk2evvtAst4wFggOBodsKQkuwW4Ck9RhXYRFcfYq92Q@mail.gmail.com>
References: <CALhihUiy0z9ZnMFGmyGdozy6zYr=eBP4-PPV_KSDFaCba+=gkQ@mail.gmail.com>
 <CAP01uRmwYWdu9yps4Ga2Fx-ZvEsXZqnX1-LAsF=nn3BEyw9bfg@mail.gmail.com>
 <CALhihUjRJ-704kQgdjD=9iYujEXiS2ZYr9fxUM=h4EixwWoPYQ@mail.gmail.com>
 <CAP01uRnwk2evvtAst4wFggOBodsKQkuwW4Ck9RhXYRFcfYq92Q@mail.gmail.com>
Message-ID: <CALhihUgaMmCiw8HdLNGKaDziSEUwPM+2G0rB5iSNuERC2xXn-Q@mail.gmail.com>

My mistake, you're correct.  I reversed it in my head.

-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu
https://sites.google.com/site/npgraham1/


On Tue, May 20, 2014 at 8:50 PM, Gabor Grothendieck <ggrothendieck at gmail.com
> wrote:

> On Tue, May 20, 2014 at 8:45 PM, Nathaniel Graham <npgraham1 at gmail.com>
> wrote:
> > Thanks!  That's a good idea, and a lot simpler than what I was
> concocting in
> > my head.  I'll give that a try.  I think--just for for posterity--you
> mean
> >
> > DT[, importance := 0 - is.na(V3)]
> >
> > rather than 0 + is.na(V3), so that rows with V3 are lower than rows
> without.
>
> 0 + is.na(V3) was intended.  We want the good rows to have a lower
> importance than the bad rows so 0+is.na(V3)  gives a non-NA V3 an
> importance of 0 and it gives a V3 which is NA an importance of 1.
> When we sort them using setkey the non-NA of 0 comes first so it is
> the one picked by unique.
>
> > DT[, importance := 0+is.na(V3)]
> > setkey(DT, V1, V2, importance)
> > unique(DT, by = c("V1", "V2"))
>    V1 V2   V3 importance
> 1:  1  3 TRUE          0
> 2:  1  4 TRUE          0
> 3:  2  3 TRUE          0
> 4:  2  4 TRUE          0
> 5:  3  1   NA          1
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140520/7944d53f/attachment.html>

From aragorn168b at gmail.com  Wed May 21 09:23:12 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Wed, 21 May 2014 09:23:12 +0200
Subject: [datatable-help] FR #5249 - rbindlist gains use.names and fill
 arguments
In-Reply-To: <CAP01uRnWmQ-s5sAQYGbgA81KnWYuRZ_dTWoAgVZ34f1txeRt5w@mail.gmail.com>
References: <etPan.537bacb8.6b8b4567.11385@Arunkumars-MacBook-Pro.local>
 <etPan.537bbaf8.2ae8944a.11385@Arunkumars-MacBook-Pro.local>
 <CAP01uR=WxhcAy-TgjhMmyUrnztiYgBgFeFAUe-LTedbEvPgPmQ@mail.gmail.com>
 <etPan.537bc2c0.238e1f29.11385@Arunkumars-MacBook-Pro.local>
 <CAP01uRmQab6sRfbR5MJRpU85TZPuX68n6KnLofVkTGkxpGOspw@mail.gmail.com>
 <etPan.537bc62b.3d1b58ba.11385@Arunkumars-MacBook-Pro.local>
 <CAP01uRnWmQ-s5sAQYGbgA81KnWYuRZ_dTWoAgVZ34f1txeRt5w@mail.gmail.com>
Message-ID: <etPan.537c5460.515f007c.11385@Arunkumars-MacBook-Pro.local>

Great. That makes total sense to me. No defaults are affected as well. Thanks again.

Arun

From:?Gabor Grothendieck ggrothendieck at gmail.com
Reply:?Gabor Grothendieck ggrothendieck at gmail.com
Date:?May 21, 2014 at 1:03:03 AM
To:?Arunkumar Srinivasan aragorn168b at gmail.com
Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:? Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments  

In that case I suggest just changing rbindlist to have use.names =  
fill and leave rbind as is.  

On Tue, May 20, 2014 at 5:16 PM, Arunkumar Srinivasan  
<aragorn168b at gmail.com> wrote:  
> In the current CRAN:  
>  
> rbindlist corresponds to use.names=FALSE and fill = FALSE  
> rbind corresponds to use.names=TRUE and fill = FALSE  
>  
> Just to be clear, again, are you suggesting that I change *just* rbindlist's  
> defaults to use.names=fill and fill=FALSE or for both?  
> Arun  
>  
> From: Gabor Grothendieck ggrothendieck at gmail.com  
> Reply: Gabor Grothendieck ggrothendieck at gmail.com  
> Date: May 20, 2014 at 11:14:15 PM  
>  
> To: Arunkumar Srinivasan aragorn168b at gmail.com  
> Cc: datatable-help at lists.r-forge.r-project.org  
> datatable-help at lists.r-forge.r-project.org  
> Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill  
> arguments  
>  
> Yes. That is what I intended.  
>  
> rbindlist on CRAN currently has no fill or use.names arguments. What  
> combo of the new fill and use.names does the currrent CRAN rbindlst  
> correspond to?  
>  
>  
>  
> On Tue, May 20, 2014 at 5:01 PM, Arunkumar Srinivasan  
> <aragorn168b at gmail.com> wrote:  
>> I think I understand now what you?re trying to say. Going back to an  
>> earlier  
>> post, you wrote:  
>>  
>> Then why not make the default of `use.names` be `fill`. Then you don't get  
>> the warning and you can tell just from the argument list what the  
>> dependencies are.  
>>  
>> You mean to basically do?  
>>  
>> rbindlist <- function(l, use.names=fill, fill=FALSE)  
>> .rbind.data.table <- function(..., use.names=fill, fill=TRUE/FALSE)  
>>  
>> Is this what you mean? If so, the defaults from the previous versions will  
>> be changed. The ones who use rbind directly without setting use.names will  
>> have different results.. (assuming I understand you correctly this time).  
>>  
>>  
>> Arun  
>>  
>> From: Gabor Grothendieck ggrothendieck at gmail.com  
>> Reply: Gabor Grothendieck ggrothendieck at gmail.com  
>> Date: May 20, 2014 at 10:49:54 PM  
>>  
>> To: Arunkumar Srinivasan aragorn168b at gmail.com  
>> Cc: datatable-help at lists.r-forge.r-project.org  
>> datatable-help at lists.r-forge.r-project.org  
>> Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and  
>> fill  
>> arguments  
>>  
>> If I understand this right then the table below shows the valid  
>> logical combinations in order of speed (slowest first). Is that  
>> right? If so then if fill = FALSE and use.names = fill then we get  
>> the fastest case by default.  
>>  
>> Furthermore if you were concerned that we might be T/T when F/T would  
>> be sufficient I don't think that is likely since getting F/T is done  
>> by setting use.names = TRUE.  
>>  
>> fill/use.names  
>> T/T (slowest)  
>> F/T  
>> F/F (fasetest)  
>>  
>>  
>> On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan  
>> <aragorn168b at gmail.com> wrote:  
>>> I?ve filed FR #5690 to remind myself of the recycling feature; that?d be  
>>> awesome to have.  
>>>  
>>> One feature I forgot to point out in the previous post is that, even when  
>>> there are duplicate names, rbind/rbindlist binds them consistent with  
>>> ?base?  
>>> when use.names=TRUE. And it fills the duplicate columns properly (in the  
>>> order of occurrence) also when fill=TRUE.  
>>>  
>>> Okay, on to benchmarks. I took a set of 10,000 data.tables, each with  
>>> columns ranging from V1 to V500 in random order (all integers for  
>>> simplicity). We?ll need to just use use.names=TRUE (as all columns are  
>>> available in all data.tables).  
>>>  
>>> I think this data is big enough to illustrate the point. Also, I was  
>>> curious  
>>> to see a comparison against dplyr?s rbind_all (commit 1504 devel  
>>> version).  
>>> So, I?ve added it as well to the benchmarks.  
>>>  
>>> Here?s the data generation. Note: It takes a while for this step to  
>>> finish.  
>>>  
>>> require(data.table) ## 1.9.3 commit 1267  
>>> require(dplyr) ## commit 1504 devel  
>>> set.seed(1L)  
>>> foo <- function(k) {  
>>> ans = setDT(lapply(1:k, function(x) sample(10)))  
>>> }  
>>> bar <- function(ans, k, n) {  
>>> bla = sample(paste0("V", 1:k), n)  
>>> setnames(ans, bla)  
>>> }  
>>> n = 10000L  
>>> ll = vector("list", n)  
>>> for (i in 1:n) {  
>>> bla = bar(foo(500L), 500L, 500L)  
>>> .Call("Csetlistelt", ll, i, bla)  
>>> }  
>>>  
>>> And here are the timings:  
>>>  
>>> ## data.table v1.9.3 commit 1267's rbindlist  
>>> ## Timings of three consecutive runs:  
>>> system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE))  
>>> user system elapsed  
>>> 10.909 0.449 11.843  
>>>  
>>> user system elapsed  
>>> 5.219 0.386 5.640  
>>>  
>>> user system elapsed  
>>> 5.355 0.429 5.898  
>>>  
>>> ## dplyr's rbind_all  
>>> ## Timings for three consecutive runs  
>>> system.time(ans2 <- rbind_all(ll))  
>>> user system elapsed  
>>> 62.769 0.247 63.941  
>>>  
>>> user system elapsed  
>>> 62.010 0.335 65.876  
>>>  
>>> user system elapsed  
>>> 55.345 0.359 60.193  
>>>  
>>>> identical(ans1, setDT(ans2)) # [1] TRUE  
>>>  
>>> ## data.table v1.9.2's rbind version:  
>>> ## ran only once as it took a bit more.  
>>> system.time(ans1 <- do.call("rbind", ll))  
>>> user system elapsed  
>>> 125.356 2.247 139.000  
>>>  
>>>> identical(ans1, setDT(ans2)) # [1] TRUE  
>>>  
>>> In summary, the newer implementation is about ~11?23x faster than  
>>> data.table?s older implementation and is ~5.5?10x faster against dplyr on  
>>> this (relatively huge) data.  
>>>  
>>> Arun  
>>>  
>>> From: Arunkumar Srinivasan aragorn168b at gmail.com  
>>> Reply: Arunkumar Srinivasan aragorn168b at gmail.com  
>>> Date: May 20, 2014 at 9:27:56 PM  
>>> To: datatable-help at lists.r-forge.r-project.org  
>>> datatable-help at lists.r-forge.r-project.org  
>>> Subject: FR #5249 - rbindlist gains use.names and fill arguments  
>>>  
>>> Hello everyone,  
>>>  
>>> With the latest commit #1266, the extra functionality offered via rbind  
>>> (use.names and fill) is also now available to rbindlist. In addition, the  
>>> implementation is completely moved to C, and is therefore tremendously  
>>> fast,  
>>> especially for cases where one has to bind using with use.names=TRUE  
>>> and/or  
>>> with fill=TRUE. I?ll try to put out a benchmark comparing speed  
>>> differences  
>>> with the older implementation ASAP.  
>>>  
>>> Note that this change comes with a very low cost to the default speed to  
>>> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding  
>>> 10,000 data.tables with 20 columns each, resulted in the new version  
>>> running  
>>> in 0.107 seconds, where as the older version ran in 0.095 seconds.  
>>>  
>>> In addition the documentation for ?rbindlist also has been improved  
>>> (#5158  
>>> from Alexander). Here?s the change log from NEWS:  
>>>  
>>> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now  
>>> implemented entirely in C. Closes #5249  
>>> -> use.names by default is FALSE for backwards compatibility  
>>> (doesn't bind by names by default)  
>>> -> rbind(...) now just calls rbindlist() internally, except that  
>>> 'use.names' is TRUE by default,  
>>> for compatibility with base (and backwards compatibility).  
>>> -> fill by default is FALSE. If fill is TRUE, use.names has to be  
>>> TRUE.  
>>> -> At least one item of the input list has to have non-null column  
>>> names.  
>>> -> Duplicate columns are bound in the order of occurrence, like  
>>> base.  
>>> -> Attributes that might exist in individual items would be lost in  
>>> the bound result.  
>>> -> Columns are coerced to the highest SEXPTYPE, if they are  
>>> different, if/when possible.  
>>> -> And incredibly fast ;).  
>>> -> Documentation updated in much detail. Closes DR #5158.  
>>> Eddi's (excellent) work on finding factor levels, type coercion of  
>>> columns etc. are all retained.  
>>>  
>>> Please try it and write back if things aren?t working as it was before.  
>>> The  
>>> tests that had to be fixed are extremely rare cases. I suspect there  
>>> should  
>>> be minimal issue, if at all, in this version. However, I do find the  
>>> changes  
>>> here bring consistency to the function.  
>>>  
>>> One (very rare) feature that is not available due to this implementation  
>>> is  
>>> the ability to recycle.  
>>>  
>>> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))  
>>> lst1 <- list(x=4, y=5, z=as.list(1:3))  
>>>  
>>> rbind(dt1, lst1)  
>>> # x y z  
>>> # 1: 1 4 1,2  
>>> # 2: 2 5 1,2,3  
>>> # 3: 3 6 1,2,3,4  
>>> # 4: 4 5 1  
>>> # 5: 4 5 2  
>>> # 6: 4 5 3  
>>>  
>>> The 4,5 are recycled very nicely here.. This is not possible at the  
>>> moment.  
>>> This is because the earlier rbind implementation used as.data.table to  
>>> convert to data.table, however it takes a copy (very inefficient on huge  
>>> /  
>>> many tables). I?d love to add this feature in C as well, as it would help  
>>> incredibly for use within [.data.table (now that we can fill columns and  
>>> bind by names faster). Will add a FR.  
>>>  
>>> In summary, I think there should be minimal issues, if any and should be  
>>> much faster (for rbind cases). Please write back what you think, if you  
>>> happen to try out.  
>>>  
>>>  
>>>  
>>> Arun  
>>>  
>>>  
>>> _______________________________________________  
>>> datatable-help mailing list  
>>> datatable-help at lists.r-forge.r-project.org  
>>>  
>>>  
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
>>  
>>  
>>  
>> --  
>> Statistics & Software Consulting  
>> GKX Group, GKX Associates Inc.  
>> tel: 1-877-GKX-GROUP  
>> email: ggrothendieck at gmail.com  
>  
>  
>  
> --  
> Statistics & Software Consulting  
> GKX Group, GKX Associates Inc.  
> tel: 1-877-GKX-GROUP  
> email: ggrothendieck at gmail.com  


--  
Statistics & Software Consulting  
GKX Group, GKX Associates Inc.  
tel: 1-877-GKX-GROUP  
email: ggrothendieck at gmail.com  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140521/841bf92e/attachment-0001.html>

From aragorn168b at gmail.com  Wed May 21 12:56:32 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Wed, 21 May 2014 12:56:32 +0200
Subject: [datatable-help] R Studio Interactions with data.table
In-Reply-To: <CAAf756NnMGvrozrPu=5dKBsT9jntpLMP=B9qjPPV65H+ppXMUw@mail.gmail.com>
References: <87CF8322-0A75-44D7-A246-85ED860ABC2A@dc-energy.com>
 <CAAf756NnMGvrozrPu=5dKBsT9jntpLMP=B9qjPPV65H+ppXMUw@mail.gmail.com>
Message-ID: <CAAf756N4GQa_yd_X6jZOwi5BSOdLw5DQaYifMPtuysbngQgXKA@mail.gmail.com>

Zachary,
Fixed in #1269 (v1.9.3). Please do write back if you still experience the
issue after update.


On Sun, Apr 27, 2014 at 9:48 PM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

> Zack,
> I'm able to reproduce the crash and the occasional warning. Will look into
> it - filed #5647<https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5647&group_id=240&atid=975>.
> Thanks for reporting.
> Arun.
>
>
> On Tue, Apr 22, 2014 at 8:07 PM, Zachary Long <long at dc-energy.com> wrote:
>
>> Hello,
>>
>> I was wondering if an error like this had been addressed before. I am
>> using data table 1.9.2.
>>
>> It appears that the error has to do with the interaction with R-Studio.
>> When I run
>>
>> library(data.table)
>>
>> dt<-data.table(strip="Nov08",date=c("2006-08-01","2006-08-02","2006-08-03","2006-08-04","2006-08-07",
>>
>>                                     "2006-08-08","2006-08-09","2006-08-10","2006-08-11","2006-08-14"))
>> dt[,forward_date:=c(rep(NA,5),date),by='strip']
>>
>>
>> The result I expect is below, along with a warning message.
>>
>> strip       date forward_date
>>  1: Nov08 2006-08-01           NA
>>  2: Nov08 2006-08-02           NA
>>  3: Nov08 2006-08-03           NA
>>  4: Nov08 2006-08-04           NA
>>  5: Nov08 2006-08-07           NA
>>  6: Nov08 2006-08-08   2006-08-01
>>  7: Nov08 2006-08-09   2006-08-02
>>  8: Nov08 2006-08-10   2006-08-03
>>  9: Nov08 2006-08-11   2006-08-04
>> 10: Nov08 2006-08-14   2006-08-07
>>
>>
>> However, I don't get this.
>>
>> 1 of two things can happen.
>>
>> 1. My R-Studio will completely crash without warning. All unsaved
>> information is lost.
>> 2. I can get "Error: Value of SET_STRING_ELT() must be a 'CHARSXP' not a
>> 'character'" "In addition:Lost warning messages"
>>
>> Do you know what is the cause here? It seems related to memory
>> allocation, or something under the hood relating to the interaction of
>> R-Studio and data table.
>>
>> Zach
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140521/37d4d5f8/attachment.html>

From aragorn168b at gmail.com  Wed May 21 13:00:56 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Wed, 21 May 2014 13:00:56 +0200
Subject: [datatable-help] rbindlist and unique
In-Reply-To: <CALhihUiy0z9ZnMFGmyGdozy6zYr=eBP4-PPV_KSDFaCba+=gkQ@mail.gmail.com>
References: <CALhihUiy0z9ZnMFGmyGdozy6zYr=eBP4-PPV_KSDFaCba+=gkQ@mail.gmail.com>
Message-ID: <etPan.537c8768.12200854.11385@Arunkumars-MacBook-Pro.local>

Nathaniel, Thanks.

First, I use rbindlist pretty often, and I've been quite happy with it.  The new  use.names and fill features definitely scratch an itch for me; I wound up using rbind_all from dplyr (which worked well, I'm not complaining), but I'm looking forward to having a data.table implementation.  
A data.table implementation (in rbind) exists since the last release (v1.9.0/2). This one just builds on it.


Arun

From:?Nathaniel Graham npgraham1 at gmail.com
Reply:?Nathaniel Graham npgraham1 at gmail.com
Date:?May 21, 2014 at 2:20:44 AM
To:?data.table source forge datatable-help at lists.r-forge.r-project.org
Subject:? [datatable-help] rbindlist and unique  

First, I use rbindlist pretty often, and I've been quite happy with it. ?The new use.names and fill features definitely scratch an itch for me; I wound up using rbind_all from dplyr (which worked well, I'm not complaining), but I'm looking forward to having a data.table implementation. ?The speed increase is also welcome. ?So thank you for the new features! ?I don't personally have a preference with respect to the use.names and fill defaults, so whatever you guys decide will be fine with me.

I do have a question regarding unique, which I use very, very frequently, and often after rbindlist. ?I have a fairly large data set (tens of millions of raw observations), many of which are duplicates. ?The observations come from a variety of sources, but the formats and variable names are (nearly) identical.

The problem is that many "duplicates" aren't perfect duplicates, and some rows have more information than others. ?A simple example might look like this:

> foo
? ?V1 V2 ? V3
1: ?1 ?3 TRUE
2: ?1 ?4 TRUE
3: ?2 ?3 ? NA
4: ?2 ?4 TRUE
5: ?1 ?3 TRUE
6: ?1 ?4 ? NA
7: ?2 ?3 TRUE
8: ?2 ?4 TRUE
9: ?3 ?1 ? NA
> unique(foo, by = c("V1", "V2"))
? ?V1 V2 ? V3
1: ?1 ?3 TRUE
2: ?1 ?4 TRUE
3: ?2 ?3 ? NA
4: ?2 ?4 TRUE
5: ?3 ?1 ? NA


Sometimes V3 is present and sometimes it isn't. ?V1 and V2 (in my story) uniquely identify an observation, but if there's a row where I also have V3, I'd prefer to have that row rather than a row where it's missing. ?You can see that a naive use of unique here gets me the less-preferable 2,3 row. ?If I only had three columns, this would be easy to solve (sort/setkey first would do it). ?However, I have more than a dozen additional columns, and when I drop duplicates I want to retain the row with the greatest number of non-missing values. ?Additionally, some columns are more important than others. ?If (to refer again to the example above), there are no rows that have V3 for a given V1 & V2 (like 3,1), I still need to retain a row, so I can't just condition on !is.na(V3).

Does anybody have any insight or techniques for this sort of thing? ?I'm currently sorting on all columns prior to unique, but I'm quite sure that this loses some information.


-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu
https://sites.google.com/site/npgraham1/
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140521/b06e5b53/attachment.html>

From my.r.help at gmail.com  Thu May 22 05:54:57 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Thu, 22 May 2014 11:54:57 +0800
Subject: [datatable-help] A[B]?
In-Reply-To: <1399370245881-4690040.post@n4.nabble.com>
References: <1399183248863-4689942.post@n4.nabble.com>
 <5365F136.8050807@gmail.com> <1399370245881-4690040.post@n4.nabble.com>
Message-ID: <537D7511.1000209@gmail.com>

FAQ 1.12?


On 05/06/2014 05:57 PM, Rguy wrote:
> That FAQ does not provide any examples of the A[B] syntax used with data
> table objects. It does provide an example using A[B] with matrix objects,
> but the example does not translate to data table objects, so I'm not sure
> why it's there. I suggest that the FAQ be extended to provide one, or better
> yet several, examples of the A[B] syntax applied to data.table objects.
> 
> As far as I have been able to puzzle out so far, A[B] is just another way to
> do a merge.
> 
> 
> 
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/A-B-tp4689942p4690040.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 

From aragorn168b at gmail.com  Sat May 24 23:15:03 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 24 May 2014 23:15:03 +0200
Subject: [datatable-help] print.data.table's digits argument
In-Reply-To: <CAJd-hdny1FH1fM76A0BJsB_Oxv8B_70T5NWfBaJKKWr_=gRU+Q@mail.gmail.com>
References: <CAJd-hdn9JDP88ZZfj-x_p7s8TauPUxvDFP_VxSdfVGjQbABaxw@mail.gmail.com>
 <081AB5A6E11243C0B1C75A937463DAC8@gmail.com>
 <CAJd-hdny1FH1fM76A0BJsB_Oxv8B_70T5NWfBaJKKWr_=gRU+Q@mail.gmail.com>
Message-ID: <CAAf756PJ9s18psYGTDCw_mv1MQrF0yJpD-NsLP=3kmK3ai9wzA@mail.gmail.com>

Fixed this in commit #1275 (v1.9.3). Thanks Frank and Matthew Beckers for
filing it here<https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5435&group_id=240&atid=975>
.
Arun


On Tue, Jun 18, 2013 at 1:39 AM, Frank Erickson <FErickson at psu.edu> wrote:

> Ah, that did the trick! I'll use this quite a lot, I expect. Thanks, Arun.
> --Frank
>
>
> On Mon, Jun 17, 2013 at 6:19 PM, Arunkumar Srinivasan <
> aragorn168b at gmail.com> wrote:
>
>>  Dear Frank,
>>
>> Thanks for forwarding to the list. I always seem to forget to
>> "reply-all". Apologies. Managed this time! :)
>>
>> Try this on your data:
>>
>> as.data.table(do.call("cbind", lapply(DT, function(x) {
>> if (is.list(x)) {
>> lapply(x, function(y) as.numeric(format(y, digits=2)))
>>  } else
>> as.numeric(format(x, digits=2))
>> })))
>>
>>
>>
>> Arun
>>
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140524/ac70db3c/attachment.html>

From aragorn168b at gmail.com  Sun May 25 04:54:40 2014
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sun, 25 May 2014 04:54:40 +0200
Subject: [datatable-help] Assignment by reference fails silently
In-Reply-To: <CAAf756P+FMBxbF=jPVuoShO2BsB8RUCejj0EhreRVFf=6nxENQ@mail.gmail.com>
References: <CAA3Wa=vdriXBJSvB-fVp+wXAOi=KXRMcMgjtP88KS5GxrJ1hPw@mail.gmail.com>
 <CAAf756P+FMBxbF=jPVuoShO2BsB8RUCejj0EhreRVFf=6nxENQ@mail.gmail.com>
Message-ID: <CAAf756PSBm2wq1AqVy_59pEDnp5W40OiMVSNkFhb00pKAcj1zw@mail.gmail.com>

This was an effect of fixing FR #2551 not properly (from me). Stricter (and
extensive) tests are added now. Fixed in commit #1277 (v1.9.3).

Thanks for reporting.
Arun.


On Sun, Apr 27, 2014 at 9:14 PM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

> Thanks for reporting. I've added this case under comments to another
> recently filed issue bug #5442<https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5442&group_id=240&atid=975>from Michele (as I am quite sure they're related to handling column types
> in `:=` without grouping).
>
> Arun.
>
>
> On Fri, Apr 25, 2014 at 7:25 PM, John Laing <john.laing at gmail.com> wrote:
>
>> If I create a logical column in my data.table and try to
>> assign-by-reference a character value to it, the assignment fails silently.
>> That is, it doesn't work but doesn't throw an error:
>>
>> ## make a simple data.table
>> require(data.table)
>> dt <- data.table(a=1:3, b=4:6, c=NA)
>>
>> ## fails silently
>> dt[, c := "foo"]
>> dt
>>
>> In other cases where an action would lead to the implicit conversion of a
>> column, data.table throws an error suggesting that the user convert the
>> column explicitly if that's what they really mean to do. I think that's the
>> right behavior and should be adopted in this case as well.
>>
>> -John
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140525/7c555c4c/attachment.html>

From talex at privatdemail.net  Fri May 30 11:43:03 2014
From: talex at privatdemail.net (talex)
Date: Fri, 30 May 2014 02:43:03 -0700 (PDT)
Subject: [datatable-help] strange undocumented data.table error
Message-ID: <1401442983415-4691467.post@n4.nabble.com>

I ran into a strange error that a search only turns up in the commit of
data.table:


This came up upon running a previously tested working dcast.data.table
expression, on a data.table I have subsetted (with duplicates) using a J()
command. The offending section is this:


Strangely, the data I fed it works until a specific row of the input table
is included, and then it dies.


What's bothering data.table? I tried changing keys and messing with the
order of the formula terms just to see, because I don't understand the
error, but that didn't work, of course.


--
View this message in context: http://r.789695.n4.nabble.com/strange-undocumented-data-table-error-tp4691467.html
Sent from the datatable-help mailing list archive at Nabble.com.

From my.r.help at gmail.com  Sat May 31 06:01:31 2014
From: my.r.help at gmail.com (Michael Smith)
Date: Sat, 31 May 2014 12:01:31 +0800
Subject: [datatable-help] `with=F` in the `i` Argument
Message-ID: <5389541B.8040006@gmail.com>

All,

I'm trying to order the rows according to several columns at a time:

DT <- data.table(a = 1:4, b = 8:5)
for (i in c("a", "b"))
  print(DT[order(i), with = FALSE])

It doesn't work, since `with` seems to be about the `j` argument, but
not the `i` argument, according to `?data.table`.

I found the following workaround, but wonder whether there is a more
elegant way to do it:

for (i in c("a", "b"))
  print(DT[order(DT[, i, with = FALSE])])

Thanks,
M

From gsee000 at gmail.com  Sat May 31 06:44:59 2014
From: gsee000 at gmail.com (G See)
Date: Fri, 30 May 2014 23:44:59 -0500
Subject: [datatable-help] `with=F` in the `i` Argument
In-Reply-To: <5389541B.8040006@gmail.com>
References: <5389541B.8040006@gmail.com>
Message-ID: <CA+xi=qbyryLtqCgMqqZ3Gri1t4tLH77kaxGAOHv20zrXZ3k3og@mail.gmail.com>

Hi Michael,

I would use get()

DT <- data.table(a = 1:4, b = 8:5)
for (i in c("a", "b"))
  print(DT[order(get(i))])

For what it's worth, your solution doesn't seem to work in data.table
1.9.3 (svn rev. 1278):

> for (i in c("a", "b"))
+   print(DT[order(DT[, i, with = FALSE])])
Error in forder(DT, DT[, i, with = FALSE]) :
  Column '1' is type 'list' which is not supported for ordering currently.


HTH,
Garrett

On Fri, May 30, 2014 at 11:01 PM, Michael Smith <my.r.help at gmail.com> wrote:
> All,
>
> I'm trying to order the rows according to several columns at a time:
>
> DT <- data.table(a = 1:4, b = 8:5)
> for (i in c("a", "b"))
>   print(DT[order(i), with = FALSE])
>
> It doesn't work, since `with` seems to be about the `j` argument, but
> not the `i` argument, according to `?data.table`.
>
> I found the following workaround, but wonder whether there is a more
> elegant way to do it:
>
> for (i in c("a", "b"))
>   print(DT[order(DT[, i, with = FALSE])])
>
> Thanks,
> M
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help