From ht at heatherturner.net  Mon Sep  2 16:51:41 2013
From: ht at heatherturner.net (Heather Turner)
Date: Mon, 2 Sep 2013 15:51:41 +0100 (BST)
Subject: [datatable-help] fread coercion of very small number to character
In-Reply-To: <7282782.82.1378128992927.JavaMail.heather@heather-VPCSB3C5E>
Message-ID: <22722073.108.1378133500191.JavaMail.heather@heather-VPCSB3C5E>

Hello,

When reading a file with very small numbers in scientific notation, fread bumps the column type to "character":

> tmp <- fread(files[1], verbose = TRUE)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep='\t'
Found 5 columns
First row with 5 fields occurs on line 1 (either column names or first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 188308
Subtracted 1 for last eol and any trailing empty lines, leaving 188307 data rows
Type codes: 33302 (first 5 rows)
Type codes: 33302 (+middle 5 rows)
Type codes: 33302 (+last 5 rows)
Bumping column 5 from REAL to STR on data row 361, field contains '1.46761e-313'
   0.000s (  0%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   0.020s ( 13%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   0.020s ( 13%) Allocation of 188307x5 result (xMB) in RAM
   0.110s ( 73%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  0%) Changing na.strings to NA
   0.150s        Total
Warning message:
In fread(files[1], verbose = TRUE) :
  Bumped column 5 to type character on data row 361, field contains '1.46761e-313'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.

Perhaps there is some cutoff at e-300, since the preceding number '3.34402e-299' is read in okay.

I can get round this by specifying the column as character using the colClasses argument, then coercing to numeric after the data has been read in. However it would be better if fread could read the data in as numeric in the first place, as read.table does (though much more slowly in my example).

A simple example where type is detected as numeric then bumped to character (Which rows are used as the middle 5? Does not seem  to be rows 7-11 as I would expect...)

> dat <- data.frame(one = LETTERS[1:17], two = 1:17)
> ## use strings here to replicate what I have in my data file
> dat$two[c(1, 9)] <- c("3.34402e-299", "1.46761e-313") 
> write.table(dat, file = "test.txt", quote = FALSE, row.names = FALSE)
> fread("test.txt", verbose = TRUE)

...
Type codes: 32 (first 5 rows)
Type codes: 32 (+middle 5 rows)
Type codes: 32 (+last 5 rows)
Bumping column 2 from REAL to STR on data row 9, field contains '1.46761e-313'
...

Another example where type is detected as character from the first 5 rows

> dat$two[1:2] <- c("3.34402e-299", "1.46761e-313") 
> write.table(dat, file = "test.txt", quote = FALSE, row.names = FALSE)
> fread("test.txt", verbose = TRUE)

...
Type codes: 33 (first 5 rows)
Type codes: 33 (+middle 5 rows)
Type codes: 33 (+last 5 rows)
...

So aside from the issue of which rows are used for type detection, it does seem that 3.34402e-299 is detected as numeric whilst 1.46761e-313 is detected as character. Compare vs. read.table:

> tmp <- read.table("test.txt", header = TRUE)
> lapply(tmp, class)
$one
[1] "factor"

$two
[1] "numeric"

Best wishes,

Heather

---
Package: data.table
 Version: 1.8.9
 Maintainer: Matthew Dowle <mdowle at mdowle.plus.com>
 Built: R 3.0.1; x86_64-pc-linux-gnu; 2013-06-26 21:24:22 UTC; unix

R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] data.table_1.8.9

loaded via a namespace (and not attached):
[1] compiler_3.0.1 tools_3.0.1


From mdowle at mdowle.plus.com  Tue Sep  3 11:12:36 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 03 Sep 2013 10:12:36 +0100
Subject: [datatable-help] v1.8.10 is now on CRAN
Message-ID: <5225A804.8060308@mdowle.plus.com>

Please see NEWS :

     https://r-forge.r-project.org/scm/viewvc.php/pkg/NEWS?view=markup&root=datatable

As normal it will take a few days to reach all mirrors.

Matthew


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130903/d7fbc88b/attachment.html>

From statquant at outlook.com  Tue Sep  3 16:36:31 2013
From: statquant at outlook.com (statquant3)
Date: Tue, 3 Sep 2013 07:36:31 -0700 (PDT)
Subject: [datatable-help] Bug filled [#4878]
Message-ID: <1378218991304-4675263.post@n4.nabble.com>

I filled a bug [#4878] following  this post
<http://stackoverflow.com/questions/18594017/why-is-data-table-casting-automatically-when-i-assign-all-columns-by-reference/18594544?noredirect=1#18594544>  


--
View this message in context: http://r.789695.n4.nabble.com/Bug-filled-4878-tp4675263.html
Sent from the datatable-help mailing list archive at Nabble.com.

From aragorn168b at gmail.com  Tue Sep  3 16:50:56 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Tue, 3 Sep 2013 16:50:56 +0200
Subject: [datatable-help] Bug filled [#4878]
In-Reply-To: <1378218991304-4675263.post@n4.nabble.com>
References: <1378218991304-4675263.post@n4.nabble.com>
Message-ID: <F73C33BF38684D4188BAA994FB41CED0@gmail.com>

Statquant, 

I don't think this is a bug because the default NA is indeed NA_logical_

IF you do:

x <- rep(NA, 10)
class(x) # [1] logical

You should just do:

x <- rep(NA_integer_, 10)
class(x) # [1] integer

>From ?NA (first paragraph)
NA is a logical constant of length 1 which contains a missing value indicator. NA can be coerced to any other vector type except raw. There are also constants NA_integer_, NA_real_, NA_complex_ and NA_character_ of the other atomic vector types which support missing values: all of these are reserved (http://127.0.0.1:42400/help/library/base/help/reserved) words in the R language.

Arun


On Tuesday, September 3, 2013 at 4:36 PM, statquant3 wrote:

> I filled a bug [#4878] following this post
> <http://stackoverflow.com/questions/18594017/why-is-data-table-casting-automatically-when-i-assign-all-columns-by-reference/18594544?noredirect=1#18594544> 
> 
> 
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/Bug-filled-4878-tp4675263.html
> Sent from the datatable-help mailing list archive at Nabble.com (http://Nabble.com).
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130903/d5a57a89/attachment.html>

From statquant at outlook.com  Tue Sep  3 16:59:52 2013
From: statquant at outlook.com (statquant3)
Date: Tue, 3 Sep 2013 07:59:52 -0700 (PDT)
Subject: [datatable-help] Bug filled [#4878]
In-Reply-To: <F73C33BF38684D4188BAA994FB41CED0@gmail.com>
References: <1378218991304-4675263.post@n4.nabble.com>
 <F73C33BF38684D4188BAA994FB41CED0@gmail.com>
Message-ID: <1378220392511-4675268.post@n4.nabble.com>

Yes x = NA makes x logical but data.table is supposed to keep the type of the
LHS when you do an update That's why you get the usual 
Message d'avis :
In `[.data.table`(DT, , `:=`(a, 1.1)) :
  Coerced 'double' RHS to 'integer' to match the column's type; may have
truncated precision. Either change the target column to 'double' first (by
creating a new 'double' vector length 3 (nrows of entire table) and assign
that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g. 1L,
NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or,
set the column type correctly up front when you create the table and stick
to it, please.

So I think it should still be the case even for 1 row data.table


--
View this message in context: http://r.789695.n4.nabble.com/Bug-filled-4878-tp4675263p4675268.html
Sent from the datatable-help mailing list archive at Nabble.com.

From aragorn168b at gmail.com  Tue Sep  3 17:05:02 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Tue, 3 Sep 2013 17:05:02 +0200
Subject: [datatable-help] Bug filled [#4878]
In-Reply-To: <1378220392511-4675268.post@n4.nabble.com>
References: <1378218991304-4675263.post@n4.nabble.com>
 <F73C33BF38684D4188BAA994FB41CED0@gmail.com>
 <1378220392511-4675268.post@n4.nabble.com>
Message-ID: <5386096E77AD43C08DD52B0ACB6D81A9@gmail.com>

Seems you're right. I missed that warning message... 

Arun


On Tuesday, September 3, 2013 at 4:59 PM, statquant3 wrote:

> Yes x = NA makes x logical but data.table is supposed to keep the type of the
> LHS when you do an update That's why you get the usual 
> Message d'avis :
> In `[.data.table`(DT, , `:=`(a, 1.1)) :
> Coerced 'double' RHS to 'integer' to match the column's type; may have
> truncated precision. Either change the target column to 'double' first (by
> creating a new 'double' vector length 3 (nrows of entire table) and assign
> that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g. 1L,
> NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or,
> set the column type correctly up front when you create the table and stick
> to it, please.
> 
> So I think it should still be the case even for 1 row data.table
> 
> 
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/Bug-filled-4878-tp4675263p4675268.html
> Sent from the datatable-help mailing list archive at Nabble.com (http://Nabble.com).
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130903/86754878/attachment.html>

From statquant at outlook.com  Tue Sep  3 17:04:56 2013
From: statquant at outlook.com (statquant3)
Date: Tue, 3 Sep 2013 08:04:56 -0700 (PDT)
Subject: [datatable-help] Cannot use fread with data.table 1.8.10
Message-ID: <1378220696873-4675269.post@n4.nabble.com>

Just tried the new version, took it from CRAN and had RStudio compiled the
.tar.gz, all went ok.
When I try to load any csv I get the following:
Erreur dans fread("test.csv") : 
  'integer64' must be a single character string: 'integer64', 'double' or
'character'


sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
LC_MONETARY=French_France.1252 LC_NUMERIC=C                   LC_TIME=C                     

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] data.table_1.8.10 vimcom_0.9-8 


--
View this message in context: http://r.789695.n4.nabble.com/Cannot-use-fread-with-data-table-1-8-10-tp4675269.html
Sent from the datatable-help mailing list archive at Nabble.com.

From statquant at outlook.com  Tue Sep  3 17:10:06 2013
From: statquant at outlook.com (statquant3)
Date: Tue, 3 Sep 2013 08:10:06 -0700 (PDT)
Subject: [datatable-help] Bug filled [#4878]
In-Reply-To: <5386096E77AD43C08DD52B0ACB6D81A9@gmail.com>
References: <1378218991304-4675263.post@n4.nabble.com>
 <F73C33BF38684D4188BAA994FB41CED0@gmail.com>
 <1378220392511-4675268.post@n4.nabble.com>
 <5386096E77AD43C08DD52B0ACB6D81A9@gmail.com>
Message-ID: <1378221006699-4675271.post@n4.nabble.com>

You can get the warning doing this (for example)

R) DT = data.table(a=rep(1L,3))
R) DT
   a
1: 1
2: 1
3: 1
R) DT[,a:=1.1]
Message d'avis :
In `[.data.table`(DT, , `:=`(a, 1.1)) :
  Coerced 'double' RHS to 'integer' to match the column's type; may have
truncated precision. Either change the target column to 'double' first (by
creating a new 'double' vector length 3 (nrows of entire table) and assign
that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g. 1L,
NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or,
set the column type correctly up front when you create the table and stick
to it, please.


--
View this message in context: http://r.789695.n4.nabble.com/Bug-filled-4878-tp4675263p4675271.html
Sent from the datatable-help mailing list archive at Nabble.com.

From statquant at outlook.com  Tue Sep  3 17:19:16 2013
From: statquant at outlook.com (statquant3)
Date: Tue, 3 Sep 2013 08:19:16 -0700 (PDT)
Subject: [datatable-help] Cannot use fread with data.table 1.8.10
In-Reply-To: <1378220696873-4675269.post@n4.nabble.com>
References: <1378220696873-4675269.post@n4.nabble.com>
Message-ID: <1378221556721-4675273.post@n4.nabble.com>

Ok just took the .zip from http://datatable.r-forge.r-project.org/ and it is
now working.
I'll wait and try to compile it from source latter (though it compiled fine
so...)


--
View this message in context: http://r.789695.n4.nabble.com/Cannot-use-fread-with-data-table-1-8-10-tp4675269p4675273.html
Sent from the datatable-help mailing list archive at Nabble.com.

From mdowle at mdowle.plus.com  Tue Sep  3 20:18:53 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 03 Sep 2013 19:18:53 +0100
Subject: [datatable-help] Bug filled [#4878]
In-Reply-To: <1378221006699-4675271.post@n4.nabble.com>
References: <1378218991304-4675263.post@n4.nabble.com>
 <F73C33BF38684D4188BAA994FB41CED0@gmail.com>
 <1378220392511-4675268.post@n4.nabble.com>
 <5386096E77AD43C08DD52B0ACB6D81A9@gmail.com>
 <1378221006699-4675271.post@n4.nabble.com>
Message-ID: <5226280D.4060800@mdowle.plus.com>


Just to clear up this thread, it's plonking.  Search for "plonk" in 
?":=".  I've closed the bug report.

Matthew

On 03/09/13 16:10, statquant3 wrote:
> You can get the warning doing this (for example)
>
> R) DT = data.table(a=rep(1L,3))
> R) DT
>     a
> 1: 1
> 2: 1
> 3: 1
> R) DT[,a:=1.1]
> Message d'avis :
> In `[.data.table`(DT, , `:=`(a, 1.1)) :
>    Coerced 'double' RHS to 'integer' to match the column's type; may have
> truncated precision. Either change the target column to 'double' first (by
> creating a new 'double' vector length 3 (nrows of entire table) and assign
> that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g. 1L,
> NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or,
> set the column type correctly up front when you create the table and stick
> to it, please.
>
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Bug-filled-4878-tp4675263p4675271.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


From mdowle at mdowle.plus.com  Tue Sep  3 20:34:37 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 03 Sep 2013 19:34:37 +0100
Subject: [datatable-help] Cannot use fread with data.table 1.8.10
In-Reply-To: <1378221556721-4675273.post@n4.nabble.com>
References: <1378220696873-4675269.post@n4.nabble.com>
 <1378221556721-4675273.post@n4.nabble.com>
Message-ID: <52262BBD.20607@mdowle.plus.com>

That's very odd.   Phew - glad it's working now though!

All I can think is that it was to do with the install process on Windows 
when an R process is open at the same time with data.table loaded in 
it.  We've had similar issues in the past sometimes where a reboot 
followed by reinstall of data.table works.    The reboot ensures that 
every last nuance of .dll usage is cleared. And the reboot also ensures 
that all versions of R are shut down.    Linux seems much better at 
updating shared objects (.so) which are in use by processes,  although 
similar problems have been reported on Linux too when (my best guess is) 
a zombie process holds up something in the install process.      Only 
one or two reports, mind you.

The error about integer64 suggests that maybe the byte code didn't match 
up with the DLL code (since that's a new argument). Something like that, 
anyway, maybe.


On 03/09/13 16:19, statquant3 wrote:
> Ok just took the .zip from http://datatable.r-forge.r-project.org/ and it is
> now working.
> I'll wait and try to compile it from source latter (though it compiled fine
> so...)
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Cannot-use-fread-with-data-table-1-8-10-tp4675269p4675273.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


From mdowle at mdowle.plus.com  Tue Sep  3 20:39:03 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 03 Sep 2013 19:39:03 +0100
Subject: [datatable-help] fread coercion of very small number to
	character
In-Reply-To: <22722073.108.1378133500191.JavaMail.heather@heather-VPCSB3C5E>
References: <22722073.108.1378133500191.JavaMail.heather@heather-VPCSB3C5E>
Message-ID: <52262CC7.5020305@mdowle.plus.com>


Hi,

This is a great bug report.  Please could you file it on the tracker so 
it doesn't get forgotten.  That way you'll also get notified 
automatically when the status changes.  Hoping to clear up everything 
related to fread soon.

Matthew

On 02/09/13 15:51, Heather Turner wrote:
> Hello,
>
> When reading a file with very small numbers in scientific notation, fread bumps the column type to "character":
>
>> tmp <- fread(files[1], verbose = TRUE)
> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
> Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep='\t'
> Found 5 columns
> First row with 5 fields occurs on line 1 (either column names or first row of data)
> All the fields on line 1 are character fields. Treating as the column names.
> Count of eol after first data row: 188308
> Subtracted 1 for last eol and any trailing empty lines, leaving 188307 data rows
> Type codes: 33302 (first 5 rows)
> Type codes: 33302 (+middle 5 rows)
> Type codes: 33302 (+last 5 rows)
> Bumping column 5 from REAL to STR on data row 361, field contains '1.46761e-313'
>     0.000s (  0%) Memory map (rerun may be quicker)
>     0.000s (  0%) sep and header detection
>     0.020s ( 13%) Count rows (wc -l)
>     0.000s (  0%) Column type detection (first, middle and last 5 rows)
>     0.020s ( 13%) Allocation of 188307x5 result (xMB) in RAM
>     0.110s ( 73%) Reading data
>     0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
>     0.000s (  0%) Coercing data already read in type bumps (if any)
>     0.000s (  0%) Changing na.strings to NA
>     0.150s        Total
> Warning message:
> In fread(files[1], verbose = TRUE) :
>    Bumped column 5 to type character on data row 361, field contains '1.46761e-313'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.
>
> Perhaps there is some cutoff at e-300, since the preceding number '3.34402e-299' is read in okay.
>
> I can get round this by specifying the column as character using the colClasses argument, then coercing to numeric after the data has been read in. However it would be better if fread could read the data in as numeric in the first place, as read.table does (though much more slowly in my example).
>
> A simple example where type is detected as numeric then bumped to character (Which rows are used as the middle 5? Does not seem  to be rows 7-11 as I would expect...)
>
>> dat <- data.frame(one = LETTERS[1:17], two = 1:17)
>> ## use strings here to replicate what I have in my data file
>> dat$two[c(1, 9)] <- c("3.34402e-299", "1.46761e-313")
>> write.table(dat, file = "test.txt", quote = FALSE, row.names = FALSE)
>> fread("test.txt", verbose = TRUE)
> ...
> Type codes: 32 (first 5 rows)
> Type codes: 32 (+middle 5 rows)
> Type codes: 32 (+last 5 rows)
> Bumping column 2 from REAL to STR on data row 9, field contains '1.46761e-313'
> ...
>
> Another example where type is detected as character from the first 5 rows
>
>> dat$two[1:2] <- c("3.34402e-299", "1.46761e-313")
>> write.table(dat, file = "test.txt", quote = FALSE, row.names = FALSE)
>> fread("test.txt", verbose = TRUE)
> ...
> Type codes: 33 (first 5 rows)
> Type codes: 33 (+middle 5 rows)
> Type codes: 33 (+last 5 rows)
> ...
>
> So aside from the issue of which rows are used for type detection, it does seem that 3.34402e-299 is detected as numeric whilst 1.46761e-313 is detected as character. Compare vs. read.table:
>
>> tmp <- read.table("test.txt", header = TRUE)
>> lapply(tmp, class)
> $one
> [1] "factor"
>
> $two
> [1] "numeric"
>
> Best wishes,
>
> Heather
>
> ---
> Package: data.table
>   Version: 1.8.9
>   Maintainer: Matthew Dowle <mdowle at mdowle.plus.com>
>   Built: R 3.0.1; x86_64-pc-linux-gnu; 2013-06-26 21:24:22 UTC; unix
>
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>   [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> [7] base
>
> other attached packages:
> [1] data.table_1.8.9
>
> loaded via a namespace (and not attached):
> [1] compiler_3.0.1 tools_3.0.1
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


From tbfowler4 at gmail.com  Thu Sep  5 19:41:26 2013
From: tbfowler4 at gmail.com (Thell Fowler)
Date: Thu, 5 Sep 2013 12:41:26 -0500
Subject: [datatable-help] column of named vectors in data.table and
 possible bug
In-Reply-To: <2826ECD8F1A445629ED4A27B96DCF661@gmail.com>
References: <2826ECD8F1A445629ED4A27B96DCF661@gmail.com>
Message-ID: <CAAJPTXiBrXVD+6Gfq3a-N4EQXKnPNRwfxzxLKw9O2v2axqaaLQ@mail.gmail.com>

Perhaps a 'too late' reply, but have you thought about bringing the names
into the DT, using them, then dropping them?

For example:

> DT[, n:=names(DT$B)]
> DT[,list(B=list(B),Names=list(n)),by=A]
   A     B Names
1: 1 6,7,8 a,b,c
2: 2  9,10   d,e
> DT$n<-NULL


On Sat, Aug 24, 2013 at 2:57 AM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

>  Dear all,
>
> Suppose we've construct a data.table in this manner:
>
> x <- c(1,1,1,2,2)
> y <- 6:10
> setattr(y, 'names', letters[1:5])
> DT<- data.table(A = x, B = y)
>
> DT$B
>  a  b  c  d  e
>  6  7  8  9 10
>
> You see that DT maintains the name of vector B. But if we do:
>
> DT[, names(B), by=A]
>    A V1
> 1: 1  a
> 2: 1  b
> 3: 1  c
> 4: 2  a
> 5: 2  b
> 6: 2  c
>
> There are two things here: First, you see that only the names of the first
> grouping is correct (A = 1). Second, the rest of the result has the same
> names, and the result is also recycled to fit the length. Instead of 5
> rows, we get 6 rows.
>
> A way to get around it would be:
>
> DT[, names(DT$B)[.I], by=A]
>    A V1
> 1: 1  a
> 2: 1  b
> 3: 1  c
> 4: 2  d
> 5: 2  e
>
> However, if one wants to do:
>
> DT[, list(list(B)), by=A]$V1
> [[1]]
> a b c
> 6 7 8
>
> [[2]]
>  a  b
>  9 10
>
> You see that the names are once again wrong (for A = 2). Just the first
> one remains right.
>
> My question is, is it allowed usage of having names for column vectors? If
> so, then this should be a bug. If not, it'd be a great feature to have.
>
> Arun
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


-- 
Sincerely,
Thell
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130905/1f0fa397/attachment.html>

From aragorn168b at gmail.com  Fri Sep  6 11:52:48 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 6 Sep 2013 11:52:48 +0200
Subject: [datatable-help] column of named vectors in data.table and
 possible bug
In-Reply-To: <CAAJPTXiBrXVD+6Gfq3a-N4EQXKnPNRwfxzxLKw9O2v2axqaaLQ@mail.gmail.com>
References: <2826ECD8F1A445629ED4A27B96DCF661@gmail.com>
 <CAAJPTXiBrXVD+6Gfq3a-N4EQXKnPNRwfxzxLKw9O2v2axqaaLQ@mail.gmail.com>
Message-ID: <4FB1A379A48F45AB885DEE9C8DC37883@gmail.com>

Hi Thell, 

It's not late :). Thanks for your reply. Yes of course we could do the way you specified. But the usage for the feature I mentioned is quite different. I was thinking of doing something even more efficient for this question on SO (http://stackoverflow.com/questions/17308551/do-callrbind-list-for-uneven-number-of-column): 

Arun


On Thursday, September 5, 2013 at 7:41 PM, Thell Fowler wrote:

> Perhaps a 'too late' reply, but have you thought about bringing the names into the DT, using them, then dropping them?
> 
> For example:
> 
> > DT[, n:=names(DT$B)]
> > DT[,list(B=list(B),Names=list(n)),by=A]
>    A     B Names
> 1: 1 6,7,8 a,b,c
> 2: 2  9,10   d,e
> > DT$n<-NULL
> 
> 
> 
> On Sat, Aug 24, 2013 at 2:57 AM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > Dear all, 
> > 
> > Suppose we've construct a data.table in this manner:
> > 
> > x <- c(1,1,1,2,2)
> > y <- 6:10
> > setattr(y, 'names', letters[1:5])
> > DT<- data.table(A = x, B = y)
> > 
> > DT$B
> >  a  b  c  d  e 
> >  6  7  8  9 10 
> > 
> > 
> > You see that DT maintains the name of vector B. But if we do:
> > 
> > DT[, names(B), by=A] 
> >    A V1
> > 1: 1  a
> > 2: 1  b
> > 3: 1  c
> > 4: 2  a
> > 5: 2  b
> > 6: 2  c
> > 
> > 
> > There are two things here: First, you see that only the names of the first grouping is correct (A = 1). Second, the rest of the result has the same names, and the result is also recycled to fit the length. Instead of 5 rows, we get 6 rows. 
> > 
> > A way to get around it would be:
> > 
> > DT[, names(DT$B)[.I], by=A]
> >    A V1
> > 1: 1  a
> > 2: 1  b
> > 3: 1  c
> > 4: 2  d
> > 5: 2  e
> > 
> > 
> > However, if one wants to do:
> > 
> > DT[, list(list(B)), by=A]$V1
> > [[1]]
> > a b c 
> > 6 7 8 
> > 
> > [[2]]
> >  a  b 
> >  9 10 
> > 
> > 
> > You see that the names are once again wrong (for A = 2). Just the first one remains right. 
> > 
> > My question is, is it allowed usage of having names for column vectors? If so, then this should be a bug. If not, it'd be a great feature to have. 
> > 
> > Arun
> > 
> > 
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 
> 
> -- 
> Sincerely,
> Thell 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130906/0cda7a9b/attachment.html>

From mdowle at mdowle.plus.com  Sat Sep  7 01:42:00 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Sat, 07 Sep 2013 00:42:00 +0100
Subject: [datatable-help] column of named vectors in data.table and
 possible bug
In-Reply-To: <4FB1A379A48F45AB885DEE9C8DC37883@gmail.com>
References: <2826ECD8F1A445629ED4A27B96DCF661@gmail.com>
 <CAAJPTXiBrXVD+6Gfq3a-N4EQXKnPNRwfxzxLKw9O2v2axqaaLQ@mail.gmail.com>
 <4FB1A379A48F45AB885DEE9C8DC37883@gmail.com>
Message-ID: <522A6848.9000500@mdowle.plus.com>

Just caught up with this thread.
> is it allowed usage of having names for column vectors?
It wasn't intended, no.  It would slow down grouping if it had to 
maintain the names attribute too in the subsets.    data.table is 
intended to be used as a list of plain columns and the internals assume 
that.  names(DT$col) might exist though if data.table() has used a 
reference to an input without taking a copy.  It would then copy on 
first := to that column and drop the names attribute at that point.  
Which is why we might like to leave names there and just not use them.

But I'm thinking data.table() should drop names then to make this 
cleaner.  Despite that meaning a copy of the vector has to be taken if 
it has names.  A copy is taken currently anyway.  But in GNU R 3.1.0,  
with list() no longer copying named inputs,  we can do more on that front.

Matthew


From aragorn168b at gmail.com  Sat Sep  7 15:11:14 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 7 Sep 2013 15:11:14 +0200
Subject: [datatable-help] melt for data.table
Message-ID: <B95C8729190A4FCBB5F0D2F9836CEEEE@gmail.com>

Hi everybody, 

In the recent commit (940-944), a faster version of melt, "fmelt" is implemented. Have a look at this post (http://stackoverflow.com/a/18668808/559784) for a benchmark. It'd be great to get some feedback. 

You can download the recent commit from the first link here (https://r-forge.r-project.org/scm/?group_id=240).

Arun

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130907/49b8cb71/attachment.html>

From aragorn168b at gmail.com  Sat Sep  7 18:30:08 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 7 Sep 2013 18:30:08 +0200
Subject: [datatable-help] melt for data.table
In-Reply-To: <B95C8729190A4FCBB5F0D2F9836CEEEE@gmail.com>
References: <B95C8729190A4FCBB5F0D2F9836CEEEE@gmail.com>
Message-ID: <C491E45259884822B86331D4218340B3@gmail.com>

Hi all,

Regarding the earlier email on "fmelt":

After early feedback, the fmelt _function_ has already changed to be a reshape2::melt _method_ for data.table instead. I've deleted the link on S.O. for now and will post again soon here with updated links... 

Thank you for understanding,
Arun


On Saturday, September 7, 2013 at 3:11 PM, Arunkumar Srinivasan wrote:

> Hi everybody, 
> 
> In the recent commit (940-944), a faster version of melt, "fmelt" is implemented. Have a look at this post (http://stackoverflow.com/a/18668808/559784) for a benchmark. It'd be great to get some feedback. 
> 
> You can download the recent commit from the first link here (https://r-forge.r-project.org/scm/?group_id=240).
> 
> Arun
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130907/184a2162/attachment.html>

From aragorn168b at gmail.com  Sun Sep  8 00:23:35 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sun, 8 Sep 2013 00:23:35 +0200
Subject: [datatable-help] column of named vectors in data.table and
 possible bug
In-Reply-To: <522A6848.9000500@mdowle.plus.com>
References: <2826ECD8F1A445629ED4A27B96DCF661@gmail.com>
 <CAAJPTXiBrXVD+6Gfq3a-N4EQXKnPNRwfxzxLKw9O2v2axqaaLQ@mail.gmail.com>
 <4FB1A379A48F45AB885DEE9C8DC37883@gmail.com>
 <522A6848.9000500@mdowle.plus.com>
Message-ID: <A5C215C2042C4FBBB3A768700CBA74F9@gmail.com>

Great explanation! Thank you. Got the point. 

Arun


On Saturday, September 7, 2013 at 1:42 AM, Matthew Dowle wrote:

> Just caught up with this thread.
> > is it allowed usage of having names for column vectors?
> 
> It wasn't intended, no. It would slow down grouping if it had to 
> maintain the names attribute too in the subsets. data.table is 
> intended to be used as a list of plain columns and the internals assume 
> that. names(DT$col) might exist though if data.table() has used a 
> reference to an input without taking a copy. It would then copy on 
> first := to that column and drop the names attribute at that point. 
> Which is why we might like to leave names there and just not use them.
> 
> But I'm thinking data.table() should drop names then to make this 
> cleaner. Despite that meaning a copy of the vector has to be taken if 
> it has names. A copy is taken currently anyway. But in GNU R 3.1.0, 
> with list() no longer copying named inputs, we can do more on that front.
> 
> Matthew 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130908/0f045dad/attachment.html>

From mattguzzo12 at gmail.com  Sun Sep  8 06:46:14 2013
From: mattguzzo12 at gmail.com (guzzom)
Date: Sat, 7 Sep 2013 21:46:14 -0700 (PDT)
Subject: [datatable-help] Sub setting multiple ids based on a 2nd data frame
Message-ID: <1378615574149-4675620.post@n4.nabble.com>

Hi All,

I have some telemetry data that spans multiple years (2002 - 2013) with
multiple individuals per year. I want to subset the telemetry data to
include only those data points that fall between specific dates which are
provided in a 2nd data frame. The telemetry df is in the form of:

DF "A"

ID	Date	         Depth	Temp
1	2012-05-12	     10	12
1	2012-05-13	     10	12
1	2012-05-14	     10	12
1	2012-05-15	     10	12
2	2012-05-16	     10	12
2	2012-05-17	     10	12
2	2012-05-18	     10	12
2	2012-05-19	     10	12
3	2012-05-20	     10	12
3	2012-05-21	     10	12
3	2012-05-22	     10	12
3	2012-05-23	     10	12
3	2012-05-24	     10	12

And the df with the dates I want to use to subset is formatted as follows:

 DF "B"

Year	   Start	    End
2002	2002-05-10	2002-11-01
2003	2003-05-11	2003-11-02
2004	2004-05-12	2004-11-03
2005	2005-05-13	2005-11-04
2006	2006-05-14	2006-11-05

So, I want to say, for each ID in DF A, subset and keep only those data
points collected on a date that fall between the start and end date for the
corresponding year from DF B.

I am unsure if a loop is my best bet, or using plyr (which I am unfamiliar
with). I am relatively new to R, so this seems a bit above my head. Any help
is much appreciated.

Thanks in advance!


--
View this message in context: http://r.789695.n4.nabble.com/Sub-setting-multiple-ids-based-on-a-2nd-data-frame-tp4675620.html
Sent from the datatable-help mailing list archive at Nabble.com.

From mdowle at mdowle.plus.com  Sun Sep  8 09:57:06 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Sun, 08 Sep 2013 08:57:06 +0100
Subject: [datatable-help] Sub setting multiple ids based on a 2nd data
 frame
In-Reply-To: <1378615574149-4675620.post@n4.nabble.com>
References: <1378615574149-4675620.post@n4.nabble.com>
Message-ID: <522C2DD2.3040906@mdowle.plus.com>

Hi,

Good question.  How about :

http://stackoverflow.com/questions/17867553/data-table-join-using-two-columns-from-one-table-and-one-column-from-other
http://stackoverflow.com/questions/17597508/merging-endpoints-of-a-range-with-a-sequence
http://stackoverflow.com/questions/16666183/find-values-in-a-given-interval-without-a-vector-scan

The syntax for range queries is a bit tricky and we hope to make it 
easier in future :

https://r-forge.r-project.org/tracker/index.php?func=detail&aid=203&group_id=240&atid=978

Matthew


On 08/09/13 05:46, guzzom wrote:
> Hi All,
>
> I have some telemetry data that spans multiple years (2002 - 2013) with
> multiple individuals per year. I want to subset the telemetry data to
> include only those data points that fall between specific dates which are
> provided in a 2nd data frame. The telemetry df is in the form of:
>
> DF "A"
>
> ID	Date	         Depth	Temp
> 1	2012-05-12	     10	12
> 1	2012-05-13	     10	12
> 1	2012-05-14	     10	12
> 1	2012-05-15	     10	12
> 2	2012-05-16	     10	12
> 2	2012-05-17	     10	12
> 2	2012-05-18	     10	12
> 2	2012-05-19	     10	12
> 3	2012-05-20	     10	12
> 3	2012-05-21	     10	12
> 3	2012-05-22	     10	12
> 3	2012-05-23	     10	12
> 3	2012-05-24	     10	12
>
> And the df with the dates I want to use to subset is formatted as follows:
>
>   DF "B"
>
> Year	   Start	    End
> 2002	2002-05-10	2002-11-01
> 2003	2003-05-11	2003-11-02
> 2004	2004-05-12	2004-11-03
> 2005	2005-05-13	2005-11-04
> 2006	2006-05-14	2006-11-05
>
> So, I want to say, for each ID in DF A, subset and keep only those data
> points collected on a date that fall between the start and end date for the
> corresponding year from DF B.
>
> I am unsure if a loop is my best bet, or using plyr (which I am unfamiliar
> with). I am relatively new to R, so this seems a bit above my head. Any help
> is much appreciated.
>
> Thanks in advance!
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Sub-setting-multiple-ids-based-on-a-2nd-data-frame-tp4675620.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


From statquant at outlook.com  Tue Sep 10 14:03:57 2013
From: statquant at outlook.com (statquant3)
Date: Tue, 10 Sep 2013 05:03:57 -0700 (PDT)
Subject: [datatable-help] data.table on the command line
Message-ID: <1378814637664-4675755.post@n4.nabble.com>

I would like to try to use data.table awesomeness on the command line.
The usual use case is that you have a file and you would like to quickly
create a summarized other file.
Sometimes you wouldn't need to start R

Something like (I'm just guessing)

$) DTCMD myfile.csv "[1:5, list(a,b,c=sum(d),e=cumsum(f)),
by=grp][,test:='hello']" > newFile.csv

Would someone have an idea about this ?


--
View this message in context: http://r.789695.n4.nabble.com/data-table-on-the-command-line-tp4675755.html
Sent from the datatable-help mailing list archive at Nabble.com.

From mdowle at mdowle.plus.com  Tue Sep 10 14:59:01 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 10 Sep 2013 13:59:01 +0100
Subject: [datatable-help] data.table on the command line
In-Reply-To: <1378814637664-4675755.post@n4.nabble.com>
References: <1378814637664-4675755.post@n4.nabble.com>
Message-ID: <522F1795.6040603@mdowle.plus.com>

Maybe :

http://dirk.eddelbuettel.com/code/littler.html


On 10/09/13 13:03, statquant3 wrote:
> I would like to try to use data.table awesomeness on the command line.
> The usual use case is that you have a file and you would like to quickly
> create a summarized other file.
> Sometimes you wouldn't need to start R
>
> Something like (I'm just guessing)
>
> $) DTCMD myfile.csv "[1:5, list(a,b,c=sum(d),e=cumsum(f)),
> by=grp][,test:='hello']" > newFile.csv
>
> Would someone have an idea about this ?
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/data-table-on-the-command-line-tp4675755.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130910/f880ed37/attachment.html>

From statquant at outlook.com  Tue Sep 10 15:10:04 2013
From: statquant at outlook.com (statquant3)
Date: Tue, 10 Sep 2013 06:10:04 -0700 (PDT)
Subject: [datatable-help] data.table on the command line
In-Reply-To: <522F1795.6040603@mdowle.plus.com>
References: <1378814637664-4675755.post@n4.nabble.com>
 <522F1795.6040603@mdowle.plus.com>
Message-ID: <1378818604944-4675759.post@n4.nabble.com>

I thought about this... but that would be linux only then no ?


--
View this message in context: http://r.789695.n4.nabble.com/data-table-on-the-command-line-tp4675755p4675759.html
Sent from the datatable-help mailing list archive at Nabble.com.

From ramine.mossadegh at finra.org  Tue Sep 10 16:35:26 2013
From: ramine.mossadegh at finra.org (ramoss)
Date: Tue, 10 Sep 2013 07:35:26 -0700 (PDT)
Subject: [datatable-help] XLSX Help: Exporting to multiple sheets in excel
Message-ID: <1378823726505-4675767.post@n4.nabble.com>

Hello:

I just discovered the XLSX package.
I know how to export 1 dataframe to 1 excel sheet using the XLSX package.
write.xlsx(x= all, file="c:/reports/outlier.xlsx",
sheetName="outlierdays",row.names= FALSE)

How would I export multiple data frames to multiple sheets?

The data frames names are: all, results2 & stats2 
The excel file is called outlier
The sheets within it are: outlierdays, outlier, normaltest.

Thanks for your help


--
View this message in context: http://r.789695.n4.nabble.com/XLSX-Help-Exporting-to-multiple-sheets-in-excel-tp4675767.html
Sent from the datatable-help mailing list archive at Nabble.com.

From ramine.mossadegh at finra.org  Tue Sep 10 16:50:30 2013
From: ramine.mossadegh at finra.org (ramoss)
Date: Tue, 10 Sep 2013 07:50:30 -0700 (PDT)
Subject: [datatable-help] XLSX Help: Exporting to multiple sheets in
	excel
In-Reply-To: <1378823726505-4675767.post@n4.nabble.com>
References: <1378823726505-4675767.post@n4.nabble.com>
Message-ID: <1378824630144-4675770.post@n4.nabble.com>

I found the answer here: 
http://www.r-bloggers.com/importexport-data-to-and-from-xlsx-files/


--
View this message in context: http://r.789695.n4.nabble.com/XLSX-Help-Exporting-to-multiple-sheets-in-excel-tp4675767p4675770.html
Sent from the datatable-help mailing list archive at Nabble.com.

From aragorn168b at gmail.com  Tue Sep 10 16:53:46 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Tue, 10 Sep 2013 16:53:46 +0200
Subject: [datatable-help] XLSX Help: Exporting to multiple sheets in
 excel
In-Reply-To: <1378824630144-4675770.post@n4.nabble.com>
References: <1378823726505-4675767.post@n4.nabble.com>
 <1378824630144-4675770.post@n4.nabble.com>
Message-ID: <50BF512F01F34C40945422780583764F@gmail.com>

Ramoss,
Glad, but I think you're on the wrong mailing list. This is for help with R package data.table. 

Arun


On Tuesday, September 10, 2013 at 4:50 PM, ramoss wrote:

> I found the answer here: 
> http://www.r-bloggers.com/importexport-data-to-and-from-xlsx-files/
> 
> 
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/XLSX-Help-Exporting-to-multiple-sheets-in-excel-tp4675767p4675770.html
> Sent from the datatable-help mailing list archive at Nabble.com (http://Nabble.com).
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130910/a13d4e90/attachment.html>

From mdowle at mdowle.plus.com  Tue Sep 10 16:57:17 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 10 Sep 2013 15:57:17 +0100
Subject: [datatable-help] data.table on the command line
In-Reply-To: <1378818604944-4675759.post@n4.nabble.com>
References: <1378814637664-4675755.post@n4.nabble.com>
 <522F1795.6040603@mdowle.plus.com> <1378818604944-4675759.post@n4.nabble.com>
Message-ID: <522F334D.7030905@mdowle.plus.com>


Hm. How about Rscript -e ?

On 10/09/13 14:10, statquant3 wrote:
> I thought about this... but that would be linux only then no ?
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/data-table-on-the-command-line-tp4675755p4675759.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


From statquant at outlook.com  Tue Sep 10 17:11:53 2013
From: statquant at outlook.com (statquant3)
Date: Tue, 10 Sep 2013 08:11:53 -0700 (PDT)
Subject: [datatable-help] data.table on the command line
In-Reply-To: <522F334D.7030905@mdowle.plus.com>
References: <1378814637664-4675755.post@n4.nabble.com>
 <522F1795.6040603@mdowle.plus.com> <1378818604944-4675759.post@n4.nabble.com>
 <522F334D.7030905@mdowle.plus.com>
Message-ID: <1378825913051-4675774.post@n4.nabble.com>

I am not convinced...
For example here is what I just tried (I am on windows here)

C:\Travail\futCAC\data>C:\Travail\Tools\R-3.0.1\bin\Rscript.exe -e
"library(data.table);
fread('ORDRES_20120831.csv',nrows=100)[,list(CDTSA,ISIN)][,list(count=.N),by=as.Date(CDTSA)]"

At this point this is like writing a R script
1) I need to require the libraries
2) My Rprofile is printed on the screen

Those 2 could be solved using a wrapper...May be it is as good as it gets 


--
View this message in context: http://r.789695.n4.nabble.com/data-table-on-the-command-line-tp4675755p4675774.html
Sent from the datatable-help mailing list archive at Nabble.com.

From mdowle at mdowle.plus.com  Tue Sep 10 17:18:42 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 10 Sep 2013 16:18:42 +0100
Subject: [datatable-help] data.table on the command line
In-Reply-To: <1378825913051-4675774.post@n4.nabble.com>
References: <1378814637664-4675755.post@n4.nabble.com>
 <522F1795.6040603@mdowle.plus.com> <1378818604944-4675759.post@n4.nabble.com>
 <522F334D.7030905@mdowle.plus.com> <1378825913051-4675774.post@n4.nabble.com>
Message-ID: <522F3852.2030807@mdowle.plus.com>

 > May be it is as good as it gets

I'm not sure what you need,  but R can many startup options, and there's 
.Rprofile.
Have you really hunted hard?

On 10/09/13 16:11, statquant3 wrote:
> I am not convinced...
> For example here is what I just tried (I am on windows here)
>
> C:\Travail\futCAC\data>C:\Travail\Tools\R-3.0.1\bin\Rscript.exe -e
> "library(data.table);
> fread('ORDRES_20120831.csv',nrows=100)[,list(CDTSA,ISIN)][,list(count=.N),by=as.Date(CDTSA)]"
>
> At this point this is like writing a R script
> 1) I need to require the libraries
> 2) My Rprofile is printed on the screen
>
> Those 2 could be solved using a wrapper...May be it is as good as it gets
>
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/data-table-on-the-command-line-tp4675755p4675774.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


From statquant at outlook.com  Tue Sep 10 17:31:53 2013
From: statquant at outlook.com (statquant3)
Date: Tue, 10 Sep 2013 08:31:53 -0700 (PDT)
Subject: [datatable-help] data.table on the command line
In-Reply-To: <522F3852.2030807@mdowle.plus.com>
References: <1378814637664-4675755.post@n4.nabble.com>
 <522F1795.6040603@mdowle.plus.com> <1378818604944-4675759.post@n4.nabble.com>
 <522F334D.7030905@mdowle.plus.com> <1378825913051-4675774.post@n4.nabble.com>
 <522F3852.2030807@mdowle.plus.com>
Message-ID: <1378827113153-4675782.post@n4.nabble.com>

Actually I think I wanted something simpler as far as syntax was concerned
but I realize this is a whole new project. I am aware of all the startup
options like -vanilla -noenviron -noprofile etc...

In my previous job we had a nice tool which red csv and allowed csv
manipulation on the command line, data.table provides everything but the
syntax although much simpler than data.frame is a bit more verbose.

I think wrapping it in a script might streamline the syntax, I need to give
it some thoughts I guess.
Sorry for being so fuzzy


--
View this message in context: http://r.789695.n4.nabble.com/data-table-on-the-command-line-tp4675755p4675782.html
Sent from the datatable-help mailing list archive at Nabble.com.

From caneff at gmail.com  Tue Sep 10 19:32:20 2013
From: caneff at gmail.com (Chris Neff)
Date: Tue, 10 Sep 2013 13:32:20 -0400
Subject: [datatable-help] data.table segfaulting,
	need help verifying the reason
Message-ID: <CAAuY0RX44zcWBBmpCszjG_09a_bnDjtZS0LLkdLuZPSGschWWQ@mail.gmail.com>

I'm pretty sure it is some issue of a column that thinks it is bigger than
it actually is.  I have tried, so far in vain, to make a reproducible
example that I can share.  I have one, but can't share it.

What happens is this:

A data.frame is made:

> d = data.frame(...)

Then I call apply over every row, calling a different function that takes
in a DT as well:

l = apply(d, 1, function(x) func(x[1], x[2], DT))

This returns a data.frame.  If I rbindlist this:

a = rbindlist(l)

I can print a just fine, and it will show me all data like normal. but if I
try to just do

a$x

x is one of the columns that was a key in DT, then it segfaults.  If I ask
for a column that was made by "func" and wasn't a column in DT, it works
fine.  If I ask for only the first 10 rows and then ask for x:

a[1:10]$x

it works fine.

So somewhere these key columns think they are different lengths than they
really are, and when I try to access it I go into memory I shouldn't so I
segfault.  How can I verify this? Is there something about the DT I can
check to see what DT thinks these columns are?


Also, if instead of apply when making the list, I do

l = lapply(1:nrow(d), function(i) func(x[i,1],x[i,2],DT))

and rbindlist that, it works fine too.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130910/c6f9554d/attachment.html>

From caneff at gmail.com  Tue Sep 10 19:47:32 2013
From: caneff at gmail.com (Chris Neff)
Date: Tue, 10 Sep 2013 13:47:32 -0400
Subject: [datatable-help] data.table segfaulting,
	need help verifying the reason
In-Reply-To: <CAAuY0RX44zcWBBmpCszjG_09a_bnDjtZS0LLkdLuZPSGschWWQ@mail.gmail.com>
References: <CAAuY0RX44zcWBBmpCszjG_09a_bnDjtZS0LLkdLuZPSGschWWQ@mail.gmail.com>
Message-ID: <CAAuY0RV7LQHY2E6KQLZif-3-euvojL+VWYYAQGrr08Bn+35b4A@mail.gmail.com>

Narrowing it down further,

a$x

segfaults and

a[,x]

segfaults but

a[,"x", with=FALSE]

doesn't.


On Tue, Sep 10, 2013 at 1:32 PM, Chris Neff <caneff at gmail.com> wrote:

> I'm pretty sure it is some issue of a column that thinks it is bigger than
> it actually is.  I have tried, so far in vain, to make a reproducible
> example that I can share.  I have one, but can't share it.
>
> What happens is this:
>
> A data.frame is made:
>
> > d = data.frame(...)
>
> Then I call apply over every row, calling a different function that takes
> in a DT as well:
>
> l = apply(d, 1, function(x) func(x[1], x[2], DT))
>
> This returns a data.frame.  If I rbindlist this:
>
> a = rbindlist(l)
>
> I can print a just fine, and it will show me all data like normal. but if
> I try to just do
>
> a$x
>
> x is one of the columns that was a key in DT, then it segfaults.  If I ask
> for a column that was made by "func" and wasn't a column in DT, it works
> fine.  If I ask for only the first 10 rows and then ask for x:
>
> a[1:10]$x
>
> it works fine.
>
> So somewhere these key columns think they are different lengths than they
> really are, and when I try to access it I go into memory I shouldn't so I
> segfault.  How can I verify this? Is there something about the DT I can
> check to see what DT thinks these columns are?
>
>
> Also, if instead of apply when making the list, I do
>
> l = lapply(1:nrow(d), function(i) func(x[i,1],x[i,2],DT))
>
> and rbindlist that, it works fine too.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130910/240e02bb/attachment.html>

From FErickson at psu.edu  Tue Sep 10 20:02:03 2013
From: FErickson at psu.edu (Frank Erickson)
Date: Tue, 10 Sep 2013 14:02:03 -0400
Subject: [datatable-help] data.table segfaulting,
	need help verifying the reason
In-Reply-To: <CAJd-hdkF0P+Kut4aRyHTT7aCZLgaq9Z2rwg0eQ2_-XM8XcsLgQ@mail.gmail.com>
References: <CAAuY0RX44zcWBBmpCszjG_09a_bnDjtZS0LLkdLuZPSGschWWQ@mail.gmail.com>
 <CAAuY0RV7LQHY2E6KQLZif-3-euvojL+VWYYAQGrr08Bn+35b4A@mail.gmail.com>
 <CAJd-hdkF0P+Kut4aRyHTT7aCZLgaq9Z2rwg0eQ2_-XM8XcsLgQ@mail.gmail.com>
Message-ID: <CAJd-hd=nUHo4kP3UmEn=zzqbRsZywvXtWK+=U5uBX=EBgxi-Zw@mail.gmail.com>

There's also a[["x"]], I suppose... :)

and, looking at methods(`[`) ...

`[.listof`(a,1)
`[.data.frame`(a,1)

if it's in the 1st column.

Because we can't fully see your example, maybe you'll want to look at these
other segfault stories:
http://stackoverflow.com/search?q=segfault+%5Bdata.table%5D I think they're
both fixed with the latest R and data.table, though.

--Frank

p.s. Sorry for the double reply, Chris; forgot to use "reply to all"


On Tue, Sep 10, 2013 at 1:59 PM, Frank Erickson <FErickson at psu.edu> wrote:

> There's also a[["x"]], I suppose... :)
>
> and, looking at methods(`[`) ...
>
> `[.listof`(a,1)
> `[.data.frame`(a,1)
>
> if it's in the 1st column.
>
> Because we can't fully see your example, maybe you'll want to look at
> these other segfault stories:
> http://stackoverflow.com/search?q=segfault+%5Bdata.table%5D I think
> they're both fixed with the latest R and data.table, though.
>
> --Frank
>
>
>
> On Tue, Sep 10, 2013 at 1:47 PM, Chris Neff <caneff at gmail.com> wrote:
>
>> Narrowing it down further,
>>
>> a$x
>>
>> segfaults and
>>
>> a[,x]
>>
>> segfaults but
>>
>> a[,"x", with=FALSE]
>>
>> doesn't.
>>
>>
>> On Tue, Sep 10, 2013 at 1:32 PM, Chris Neff <caneff at gmail.com> wrote:
>>
>>> I'm pretty sure it is some issue of a column that thinks it is bigger
>>> than it actually is.  I have tried, so far in vain, to make a reproducible
>>> example that I can share.  I have one, but can't share it.
>>>
>>> What happens is this:
>>>
>>> A data.frame is made:
>>>
>>> > d = data.frame(...)
>>>
>>> Then I call apply over every row, calling a different function that
>>> takes in a DT as well:
>>>
>>> l = apply(d, 1, function(x) func(x[1], x[2], DT))
>>>
>>> This returns a data.frame.  If I rbindlist this:
>>>
>>> a = rbindlist(l)
>>>
>>> I can print a just fine, and it will show me all data like normal. but
>>> if I try to just do
>>>
>>> a$x
>>>
>>> x is one of the columns that was a key in DT, then it segfaults.  If I
>>> ask for a column that was made by "func" and wasn't a column in DT, it
>>> works fine.  If I ask for only the first 10 rows and then ask for x:
>>>
>>> a[1:10]$x
>>>
>>> it works fine.
>>>
>>> So somewhere these key columns think they are different lengths than
>>> they really are, and when I try to access it I go into memory I shouldn't
>>> so I segfault.  How can I verify this? Is there something about the DT I
>>> can check to see what DT thinks these columns are?
>>>
>>>
>>> Also, if instead of apply when making the list, I do
>>>
>>> l = lapply(1:nrow(d), function(i) func(x[i,1],x[i,2],DT))
>>>
>>> and rbindlist that, it works fine too.
>>>
>>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130910/12fd9d4b/attachment.html>

From mdowle at mdowle.plus.com  Tue Sep 10 20:02:33 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 10 Sep 2013 19:02:33 +0100
Subject: [datatable-help] data.table segfaulting,
 need help verifying the reason
In-Reply-To: <CAAuY0RV7LQHY2E6KQLZif-3-euvojL+VWYYAQGrr08Bn+35b4A@mail.gmail.com>
References: <CAAuY0RX44zcWBBmpCszjG_09a_bnDjtZS0LLkdLuZPSGschWWQ@mail.gmail.com>
 <CAAuY0RV7LQHY2E6KQLZif-3-euvojL+VWYYAQGrr08Bn+35b4A@mail.gmail.com>
Message-ID: <522F5EB9.2080903@mdowle.plus.com>


Nothing springs to mind.  Latest version v1.8.10 from CRAN right? Or 
v1.8.11 on R-Forge?

On this bit :
 > So somewhere these key columns think they are different lengths than 
they really are, and
 > when I try to access it I go into memory I shouldn't so I segfault. 
  How can I verify this? Is
 > there something about the DT I can check to see what DT thinks these 
columns are?

.Internal(inspect(DT)) reveals the internal structure including length 
and truelength on the column pointer vector as well as each column.

But it's a really odd way of using data.table.  Iterating by row is 
going to kill performance;  data.table likes by column.

If it really has to be by row  then   DT[, fun(.SD,...), by=1:nrow(DT)]  
should be better than apply().

Matthew

On 10/09/13 18:47, Chris Neff wrote:
> Narrowing it down further,
>
> a$x
>
> segfaults and
>
> a[,x]
>
> segfaults but
>
> a[,"x", with=FALSE]
>
> doesn't.
>
>
> On Tue, Sep 10, 2013 at 1:32 PM, Chris Neff <caneff at gmail.com 
> <mailto:caneff at gmail.com>> wrote:
>
>     I'm pretty sure it is some issue of a column that thinks it is
>     bigger than it actually is.  I have tried, so far in vain, to make
>     a reproducible example that I can share.  I have one, but can't
>     share it.
>
>     What happens is this:
>
>     A data.frame is made:
>
>     > d = data.frame(...)
>
>     Then I call apply over every row, calling a different function
>     that takes in a DT as well:
>
>     l = apply(d, 1, function(x) func(x[1], x[2], DT))
>
>     This returns a data.frame.  If I rbindlist this:
>
>     a = rbindlist(l)
>
>     I can print a just fine, and it will show me all data like normal.
>     but if I try to just do
>
>     a$x
>
>     x is one of the columns that was a key in DT, then it segfaults.
>      If I ask for a column that was made by "func" and wasn't a column
>     in DT, it works fine.  If I ask for only the first 10 rows and
>     then ask for x:
>
>     a[1:10]$x
>
>     it works fine.
>
>     So somewhere these key columns think they are different lengths
>     than they really are, and when I try to access it I go into memory
>     I shouldn't so I segfault.  How can I verify this? Is there
>     something about the DT I can check to see what DT thinks these
>     columns are?
>
>
>     Also, if instead of apply when making the list, I do
>
>     l = lapply(1:nrow(d), function(i) func(x[i,1],x[i,2],DT))
>
>     and rbindlist that, it works fine too.
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130910/96531771/attachment-0001.html>

From caneff at gmail.com  Tue Sep 10 20:51:35 2013
From: caneff at gmail.com (Chris Neff)
Date: Tue, 10 Sep 2013 14:51:35 -0400
Subject: [datatable-help] data.table segfaulting,
	need help verifying the reason
In-Reply-To: <522F5EB9.2080903@mdowle.plus.com>
References: <CAAuY0RX44zcWBBmpCszjG_09a_bnDjtZS0LLkdLuZPSGschWWQ@mail.gmail.com>
 <CAAuY0RV7LQHY2E6KQLZif-3-euvojL+VWYYAQGrr08Bn+35b4A@mail.gmail.com>
 <522F5EB9.2080903@mdowle.plus.com>
Message-ID: <CAAuY0RUbX5o-cmdsBjGpZWHqpUwMguv9xGyhONiZqpuddkCwAQ@mail.gmail.com>

On Tue, Sep 10, 2013 at 2:02 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

>
> Nothing springs to mind.  Latest version v1.8.10 from CRAN right?  Or
> v1.8.11 on R-Forge?
>

Both. And 1.8.8.


>
> On this bit :
>
> > So somewhere these key columns think they are different lengths than
> they really are, and
> > when I try to access it I go into memory I shouldn't so I segfault.  How
> can I verify this? Is
> > there something about the DT I can check to see what DT thinks these
> columns are?
>
> .Internal(inspect(DT)) reveals the internal structure including length and
> truelength on the column pointer vector as well as each column.
>
> But it's a really odd way of using data.table.  Iterating by row is going
> to kill performance;  data.table likes by column.
>

Trust me I know this, this isn't my code :) I'm just the data.table guy who
helps debug. I am helping him with better ways, but I think we can agree
that it should at least not segfault.


I ran inspect on the two versions of the data.table, the one that crashes
that is made by doing rbindlist(apply(d,1,...)) and the one that doesn't
that gets made by doing rbindlist(lapply(1:nrow(d),...)), and changed the
variable names and censored out values.

First the one that fails (accessing either a$k1 or a$k2 will segfault):

> .Internal(inspect(a))
@2cc5be0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=13, tl=100)
  @3b643d0 16 STRSXP g0c7 [NAM(2),ATT] (len=326, tl=0)
    @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########"
    @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########"
    @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
    @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
    @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
    ...
  ATTRIB:
    @ac6c20 02 LISTSXP g1c0 [MARK]
      TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names"
      @3ba6ad8 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0)
        @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1"
        @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1"
  @3b64e30 16 STRSXP g0c7 [NAM(2),ATT] (len=326, tl=0)
    @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
    @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
    @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
    @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
    @253e3b0 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
    ...
  ATTRIB:
    @ac6cc8 02 LISTSXP g1c0 [MARK]
      TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names"
      @3ba6a68 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0)
        @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2"
        @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2"
  @3b65890 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0)
    @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
    @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
    @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
    @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
    @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
    ...
  @1ff5850 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 3,3,3,3,3,...
  @1fc6600 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 2,1,2,1,3,...
  ...
ATTRIB:
  @21f6d48 02 LISTSXP g0c0 []
    TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names"
    @3efc1f0 16 STRSXP g0c7 [NAM(2)] (len=13, tl=100)
      @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1"
      @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2"
      @108be30 09 CHARSXP g1c2 [MARK,gp=0x21] "v1"
      @108be68 09 CHARSXP g1c2 [MARK,gp=0x21] "v2"
      @108bf10 09 CHARSXP g1c2 [MARK,gp=0x21] "v3"
      ...
    TAG: @96d200 01 SYMSXP g1c0 [MARK,gp=0x4000] "row.names"
    @2556908 13 INTSXP g0c1 [] (len=2, tl=0) -2147483648,-326
    TAG: @9638e8 01 SYMSXP g1c0 [MARK,gp=0x4000] "class"
    @2701b38 16 STRSXP g0c2 [NAM(2)] (len=2, tl=0)
      @bf8460 09 CHARSXP g1c2 [MARK,gp=0x21] "data.table"
      @9f2688 09 CHARSXP g1c2 [MARK,gp=0x21,ATT] "data.frame"
    TAG: @1e75218 01 SYMSXP g1c0 [MARK] ".internal.selfref"
    @21f6e28 22 EXTPTRSXP g0c0 []


Secondly the one that works (all values can be accessed fine:

> .Internal(inspect(a))
@45b4850 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=13, tl=100)
  @33a53a0 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0)
    @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########"
    @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########"
    @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
    @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
    @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
    ...
  @33a5e00 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0)
    @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
    @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
    @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
    @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
    @253e3b0 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
    ...
  @33a6860 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0)
    @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
    @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
    @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
    @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
    @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
    ...
  @1ff10f0 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 3,3,3,3,3,...
  @3a6d0d0 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 2,1,2,1,3,...
  ...
ATTRIB:
  @276c360 02 LISTSXP g0c0 []
    TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names"
    @1fe5670 16 STRSXP g0c7 [NAM(2)] (len=13, tl=100)
      @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1"
      @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2"
      @108be30 09 CHARSXP g1c2 [MARK,gp=0x21] "v1"
      @108be68 09 CHARSXP g1c2 [MARK,gp=0x21] "v2"
      @108bf10 09 CHARSXP g1c2 [MARK,gp=0x21] "v3"
      ...
    TAG: @96d200 01 SYMSXP g1c0 [MARK,gp=0x4000] "row.names"
    @29cbf38 13 INTSXP g0c1 [] (len=2, tl=0) -2147483648,-326
    TAG: @9638e8 01 SYMSXP g1c0 [MARK,gp=0x4000] "class"
    @2d539a0 16 STRSXP g0c2 [NAM(2)] (len=2, tl=0)
      @bf8460 09 CHARSXP g1c2 [MARK,gp=0x21] "data.table"
      @9f2688 09 CHARSXP g1c2 [MARK,gp=0x21,ATT] "data.frame"
    TAG: @1e75218 01 SYMSXP g1c0 [MARK] ".internal.selfref"
    @276c440 22 EXTPTRSXP g0c0 []


It looks to me to be some differences in the ATTRs attached to k1 and k2 in
the first case?  I can't really parse this as well as you can.


> If it really has to be by row  then   DT[, fun(.SD,...), by=1:nrow(DT)]
> should be better than apply().
>
> Matthew
>
>
> On 10/09/13 18:47, Chris Neff wrote:
>
> Narrowing it down further,
>
>  a$x
>
>  segfaults and
>
>  a[,x]
>
>  segfaults but
>
>  a[,"x", with=FALSE]
>
>  doesn't.
>
>
> On Tue, Sep 10, 2013 at 1:32 PM, Chris Neff <caneff at gmail.com> wrote:
>
>> I'm pretty sure it is some issue of a column that thinks it is bigger
>> than it actually is.  I have tried, so far in vain, to make a reproducible
>> example that I can share.  I have one, but can't share it.
>>
>>  What happens is this:
>>
>>  A data.frame is made:
>>
>>  > d = data.frame(...)
>>
>>  Then I call apply over every row, calling a different function that
>> takes in a DT as well:
>>
>>  l = apply(d, 1, function(x) func(x[1], x[2], DT))
>>
>>  This returns a data.frame.  If I rbindlist this:
>>
>>  a = rbindlist(l)
>>
>>  I can print a just fine, and it will show me all data like normal. but
>> if I try to just do
>>
>>  a$x
>>
>>  x is one of the columns that was a key in DT, then it segfaults.  If I
>> ask for a column that was made by "func" and wasn't a column in DT, it
>> works fine.  If I ask for only the first 10 rows and then ask for x:
>>
>>  a[1:10]$x
>>
>>  it works fine.
>>
>>  So somewhere these key columns think they are different lengths than
>> they really are, and when I try to access it I go into memory I shouldn't
>> so I segfault.  How can I verify this? Is there something about the DT I
>> can check to see what DT thinks these columns are?
>>
>>
>>  Also, if instead of apply when making the list, I do
>>
>>  l = lapply(1:nrow(d), function(i) func(x[i,1],x[i,2],DT))
>>
>>  and rbindlist that, it works fine too.
>>
>>
>
>
> _______________________________________________
> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130910/a8a3a504/attachment.html>

From mdowle at mdowle.plus.com  Tue Sep 10 22:06:12 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 10 Sep 2013 21:06:12 +0100
Subject: [datatable-help] data.table segfaulting,
 need help verifying the reason
In-Reply-To: <CAAuY0RUbX5o-cmdsBjGpZWHqpUwMguv9xGyhONiZqpuddkCwAQ@mail.gmail.com>
References: <CAAuY0RX44zcWBBmpCszjG_09a_bnDjtZS0LLkdLuZPSGschWWQ@mail.gmail.com>
 <CAAuY0RV7LQHY2E6KQLZif-3-euvojL+VWYYAQGrr08Bn+35b4A@mail.gmail.com>
 <522F5EB9.2080903@mdowle.plus.com>
 <CAAuY0RUbX5o-cmdsBjGpZWHqpUwMguv9xGyhONiZqpuddkCwAQ@mail.gmail.com>
Message-ID: <522F7BB4.8060300@mdowle.plus.com>


Yes, seems like the columns themselves have names, with inconsistent length.

lapply(a,names)  should reveal the "hidden" names

To remove them :

for (i in 1:ncol(a)) setattr(a[[i]],"names",NULL)

Then lapply(a,names) should be clear.

Then try again the things that segfaulted before.

If this fixes it,  we'll need to establish how the erroneous names got 
in there.


On 10/09/13 19:51, Chris Neff wrote:
>
>
>
> On Tue, Sep 10, 2013 at 2:02 PM, Matthew Dowle <mdowle at mdowle.plus.com 
> <mailto:mdowle at mdowle.plus.com>> wrote:
>
>
>     Nothing springs to mind.  Latest version v1.8.10 from CRAN right? 
>     Or v1.8.11 on R-Forge?
>
>
> Both. And 1.8.8.
>
>
>     On this bit :
>
>     > So somewhere these key columns think they are different lengths
>     than they really are, and
>     > when I try to access it I go into memory I shouldn't so I
>     segfault.  How can I verify this? Is
>     > there something about the DT I can check to see what DT thinks
>     these columns are?
>
>     .Internal(inspect(DT)) reveals the internal structure including
>     length and truelength on the column pointer vector as well as each
>     column.
>
>     But it's a really odd way of using data.table. Iterating by row is
>     going to kill performance; data.table likes by column.
>
>
> Trust me I know this, this isn't my code :) I'm just the data.table 
> guy who helps debug. I am helping him with better ways, but I think we 
> can agree that it should at least not segfault.
>
>
> I ran inspect on the two versions of the data.table, the one that 
> crashes that is made by doing rbindlist(apply(d,1,...)) and the one 
> that doesn't that gets made by doing rbindlist(lapply(1:nrow(d),...)), 
> and changed the variable names and censored out values.
>
> First the one that fails (accessing either a$k1 or a$k2 will segfault):
>
> > .Internal(inspect(a))
> @2cc5be0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=13, tl=100)
>   @3b643d0 16 STRSXP g0c7 [NAM(2),ATT] (len=326, tl=0)
>     @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########"
>     @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########"
>     @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     ...
>   ATTRIB:
>     @ac6c20 02 LISTSXP g1c0 [MARK]
>       TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names"
>       @3ba6ad8 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0)
>         @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1"
>         @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1"
>   @3b64e30 16 STRSXP g0c7 [NAM(2),ATT] (len=326, tl=0)
>     @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e3b0 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     ...
>   ATTRIB:
>     @ac6cc8 02 LISTSXP g1c0 [MARK]
>       TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names"
>       @3ba6a68 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0)
>         @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2"
>         @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2"
>   @3b65890 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0)
>     @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     ...
>   @1ff5850 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 3,3,3,3,3,...
>   @1fc6600 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 2,1,2,1,3,...
>   ...
> ATTRIB:
>   @21f6d48 02 LISTSXP g0c0 []
>     TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names"
>     @3efc1f0 16 STRSXP g0c7 [NAM(2)] (len=13, tl=100)
>       @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1"
>       @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2"
>       @108be30 09 CHARSXP g1c2 [MARK,gp=0x21] "v1"
>       @108be68 09 CHARSXP g1c2 [MARK,gp=0x21] "v2"
>       @108bf10 09 CHARSXP g1c2 [MARK,gp=0x21] "v3"
>       ...
>     TAG: @96d200 01 SYMSXP g1c0 [MARK,gp=0x4000] "row.names"
>     @2556908 13 INTSXP g0c1 [] (len=2, tl=0) -2147483648,-326
>     TAG: @9638e8 01 SYMSXP g1c0 [MARK,gp=0x4000] "class"
>     @2701b38 16 STRSXP g0c2 [NAM(2)] (len=2, tl=0)
>       @bf8460 09 CHARSXP g1c2 [MARK,gp=0x21] "data.table"
>       @9f2688 09 CHARSXP g1c2 [MARK,gp=0x21,ATT] "data.frame"
>     TAG: @1e75218 01 SYMSXP g1c0 [MARK] ".internal.selfref"
>     @21f6e28 22 EXTPTRSXP g0c0 []
>
>
>
>
>
>
> Secondly the one that works (all values can be accessed fine:
>
> > .Internal(inspect(a))
> @45b4850 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=13, tl=100)
>   @33a53a0 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0)
>     @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########"
>     @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########"
>     @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     ...
>   @33a5e00 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0)
>     @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     @253e3b0 09 CHARSXP g1c3 [MARK,gp=0x20] "#########"
>     ...
>   @33a6860 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0)
>     @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########"
>     ...
>   @1ff10f0 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 3,3,3,3,3,...
>   @3a6d0d0 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 2,1,2,1,3,...
>   ...
> ATTRIB:
>   @276c360 02 LISTSXP g0c0 []
>     TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names"
>     @1fe5670 16 STRSXP g0c7 [NAM(2)] (len=13, tl=100)
>       @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1"
>       @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2"
>       @108be30 09 CHARSXP g1c2 [MARK,gp=0x21] "v1"
>       @108be68 09 CHARSXP g1c2 [MARK,gp=0x21] "v2"
>       @108bf10 09 CHARSXP g1c2 [MARK,gp=0x21] "v3"
>       ...
>     TAG: @96d200 01 SYMSXP g1c0 [MARK,gp=0x4000] "row.names"
>     @29cbf38 13 INTSXP g0c1 [] (len=2, tl=0) -2147483648,-326
>     TAG: @9638e8 01 SYMSXP g1c0 [MARK,gp=0x4000] "class"
>     @2d539a0 16 STRSXP g0c2 [NAM(2)] (len=2, tl=0)
>       @bf8460 09 CHARSXP g1c2 [MARK,gp=0x21] "data.table"
>       @9f2688 09 CHARSXP g1c2 [MARK,gp=0x21,ATT] "data.frame"
>     TAG: @1e75218 01 SYMSXP g1c0 [MARK] ".internal.selfref"
>     @276c440 22 EXTPTRSXP g0c0 []
>
>
>
>
> It looks to me to be some differences in the ATTRs attached to k1 and 
> k2 in the first case?  I can't really parse this as well as you can.
>
>     If it really has to be by row  then   DT[, fun(.SD,...),
>     by=1:nrow(DT)]  should be better than apply().
>
>     Matthew
>
>
>     On 10/09/13 18:47, Chris Neff wrote:
>>     Narrowing it down further,
>>
>>     a$x
>>
>>     segfaults and
>>
>>     a[,x]
>>
>>     segfaults but
>>
>>     a[,"x", with=FALSE]
>>
>>     doesn't.
>>
>>
>>     On Tue, Sep 10, 2013 at 1:32 PM, Chris Neff <caneff at gmail.com
>>     <mailto:caneff at gmail.com>> wrote:
>>
>>         I'm pretty sure it is some issue of a column that thinks it
>>         is bigger than it actually is.  I have tried, so far in vain,
>>         to make a reproducible example that I can share.  I have one,
>>         but can't share it.
>>
>>         What happens is this:
>>
>>         A data.frame is made:
>>
>>         > d = data.frame(...)
>>
>>         Then I call apply over every row, calling a different
>>         function that takes in a DT as well:
>>
>>         l = apply(d, 1, function(x) func(x[1], x[2], DT))
>>
>>         This returns a data.frame.  If I rbindlist this:
>>
>>         a = rbindlist(l)
>>
>>         I can print a just fine, and it will show me all data like
>>         normal. but if I try to just do
>>
>>         a$x
>>
>>         x is one of the columns that was a key in DT, then it
>>         segfaults.  If I ask for a column that was made by "func" and
>>         wasn't a column in DT, it works fine.  If I ask for only the
>>         first 10 rows and then ask for x:
>>
>>         a[1:10]$x
>>
>>         it works fine.
>>
>>         So somewhere these key columns think they are different
>>         lengths than they really are, and when I try to access it I
>>         go into memory I shouldn't so I segfault.  How can I verify
>>         this? Is there something about the DT I can check to see what
>>         DT thinks these columns are?
>>
>>
>>         Also, if instead of apply when making the list, I do
>>
>>         l = lapply(1:nrow(d), function(i) func(x[i,1],x[i,2],DT))
>>
>>         and rbindlist that, it works fine too.
>>
>>
>>
>>
>>     _______________________________________________
>>     datatable-help mailing list
>>     datatable-help at lists.r-forge.r-project.org  <mailto:datatable-help at lists.r-forge.r-project.org>
>>     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130910/aeba1dce/attachment-0001.html>

From caneff at gmail.com  Wed Sep 11 11:17:50 2013
From: caneff at gmail.com (Chris Neff)
Date: Wed, 11 Sep 2013 05:17:50 -0400
Subject: [datatable-help] data.table segfaulting,
	need help verifying the reason
In-Reply-To: <522F7BB4.8060300@mdowle.plus.com>
References: <CAAuY0RX44zcWBBmpCszjG_09a_bnDjtZS0LLkdLuZPSGschWWQ@mail.gmail.com>
 <CAAuY0RV7LQHY2E6KQLZif-3-euvojL+VWYYAQGrr08Bn+35b4A@mail.gmail.com>
 <522F5EB9.2080903@mdowle.plus.com>
 <CAAuY0RUbX5o-cmdsBjGpZWHqpUwMguv9xGyhONiZqpuddkCwAQ@mail.gmail.com>
 <522F7BB4.8060300@mdowle.plus.com>
Message-ID: <CAAuY0RXrvkk7ZH6AwYaWzJxRvONnTU2zijTN1Pd8aQfARuWV9g@mail.gmail.com>

Indeed, it shows that k1 and k2 both have names of length 2, and both times
the value of names is just the variable names.

Where the names are getting added is by apply.  What the issue with
data.table is that it does not ignore names from short variables. I now
have a small reproducible example I can share:

d <- data.frame(x=1:5)

f <- function(x) {data.table(x=x, y=1:10)}

l <- apply(d, 1, f)

lapply(l, function(x) lapply(x, names)) # All values of x have a name

a <- rbindlist(l) # a$x will segfault after this


The underlying issue is what data.table and data.frame do with rownames and
recycling. Look at this simple case:

x <- 1:5
names(x) <- letters[1:5]

df <- data.frame(x=x, y=1:10)
#Warning message:
#  In data.frame(x = x, y = 1:10) :
#  row names were found from a short variable and have been discarded

lapply(df, names) # no names

dt <- data.table(x=x, y=1:1) # No warning

lapply(dt, names) # x has names, and they get recycled.


So data.table needs to follow data.frame logic for discarding row names
when they would otherwise be recycled.


Bug submitted here:
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4890&group_id=240&atid=975
I'm surprised this has never arisen before, it seems like something that
has been around forever.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130911/1846abaa/attachment.html>

From aragorn168b at gmail.com  Wed Sep 11 11:24:29 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Wed, 11 Sep 2013 11:24:29 +0200
Subject: [datatable-help] data.table segfaulting,
 need help verifying the reason
In-Reply-To: <CAAuY0RXrvkk7ZH6AwYaWzJxRvONnTU2zijTN1Pd8aQfARuWV9g@mail.gmail.com>
References: <CAAuY0RX44zcWBBmpCszjG_09a_bnDjtZS0LLkdLuZPSGschWWQ@mail.gmail.com>
 <CAAuY0RV7LQHY2E6KQLZif-3-euvojL+VWYYAQGrr08Bn+35b4A@mail.gmail.com>
 <522F5EB9.2080903@mdowle.plus.com>
 <CAAuY0RUbX5o-cmdsBjGpZWHqpUwMguv9xGyhONiZqpuddkCwAQ@mail.gmail.com>
 <522F7BB4.8060300@mdowle.plus.com>
 <CAAuY0RXrvkk7ZH6AwYaWzJxRvONnTU2zijTN1Pd8aQfARuWV9g@mail.gmail.com>
Message-ID: <1CE2AD6E16E241869F7EA71F8E333D99@gmail.com>

Most likely, this (https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4882&group_id=240&atid=5335), when fixed, will take care of it? 

Arun


On Wednesday, September 11, 2013 at 11:17 AM, Chris Neff wrote:

> Indeed, it shows that k1 and k2 both have names of length 2, and both times the value of names is just the variable names.
> 
> Where the names are getting added is by apply.  What the issue with data.table is that it does not ignore names from short variables. I now have a small reproducible example I can share: 
> 
> d <- data.frame(x=1:5)
> 
> f <- function(x) {data.table(x=x, y=1:10)}
> 
> l <- apply(d, 1, f)
> 
> lapply(l, function(x) lapply(x, names)) # All values of x have a name 
> 
> a <- rbindlist(l) # a$x will segfault after this
> 
> 
> The underlying issue is what data.table and data.frame do with rownames and recycling. Look at this simple case: 
> 
> x <- 1:5
> names(x) <- letters[1:5]
> 
> df <- data.frame(x=x, y=1:10) 
> #Warning message:
> #  In data.frame(x = x, y = 1:10) :
> #  row names were found from a short variable and have been discarded
> 
> lapply(df, names) # no names
> 
> dt <- data.table(x=x, y=1:1) # No warning
> 
> lapply(dt, names) # x has names, and they get recycled.
> 
> 
> So data.table needs to follow data.frame logic for discarding row names when they would otherwise be recycled. 
> 
> 
> Bug submitted here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4890&group_id=240&atid=975
 (https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4890&group_id=240&atid=975)> 
> I'm surprised this has never arisen before, it seems like something that has been around forever.
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130911/df6afe30/attachment.html>

From caneff at gmail.com  Wed Sep 11 11:31:02 2013
From: caneff at gmail.com (Chris Neff)
Date: Wed, 11 Sep 2013 05:31:02 -0400
Subject: [datatable-help] data.table segfaulting,
	need help verifying the reason
In-Reply-To: <1CE2AD6E16E241869F7EA71F8E333D99@gmail.com>
References: <CAAuY0RX44zcWBBmpCszjG_09a_bnDjtZS0LLkdLuZPSGschWWQ@mail.gmail.com>
 <CAAuY0RV7LQHY2E6KQLZif-3-euvojL+VWYYAQGrr08Bn+35b4A@mail.gmail.com>
 <522F5EB9.2080903@mdowle.plus.com>
 <CAAuY0RUbX5o-cmdsBjGpZWHqpUwMguv9xGyhONiZqpuddkCwAQ@mail.gmail.com>
 <522F7BB4.8060300@mdowle.plus.com>
 <CAAuY0RXrvkk7ZH6AwYaWzJxRvONnTU2zijTN1Pd8aQfARuWV9g@mail.gmail.com>
 <1CE2AD6E16E241869F7EA71F8E333D99@gmail.com>
Message-ID: <CAAuY0RVn+MiRg9dHACnWARFGVK3RAadqG7+07jZahzjhk8uAjg@mail.gmail.com>

Yes, dropping names altogether in data.table would fix this, and would be
the cleanest thing overall since as is said in that thread data.table
doesn't really work with rownames in mind anyway.

Except it is less of a FR now and more of a bad bug because you can get
segfaults from it.


On Wed, Sep 11, 2013 at 5:24 AM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

>  Most likely, this<https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4882&group_id=240&atid=5335>,
> when fixed, will take care of it?
>
> Arun
>
> On Wednesday, September 11, 2013 at 11:17 AM, Chris Neff wrote:
>
> Indeed, it shows that k1 and k2 both have names of length 2, and both
> times the value of names is just the variable names.
>
> Where the names are getting added is by apply.  What the issue with
> data.table is that it does not ignore names from short variables. I now
> have a small reproducible example I can share:
>
> d <- data.frame(x=1:5)
>
> f <- function(x) {data.table(x=x, y=1:10)}
>
> l <- apply(d, 1, f)
>
> lapply(l, function(x) lapply(x, names)) # All values of x have a name
>
> a <- rbindlist(l) # a$x will segfault after this
>
>
> The underlying issue is what data.table and data.frame do with rownames
> and recycling. Look at this simple case:
>
> x <- 1:5
> names(x) <- letters[1:5]
>
> df <- data.frame(x=x, y=1:10)
> #Warning message:
> #  In data.frame(x = x, y = 1:10) :
> #  row names were found from a short variable and have been discarded
>
> lapply(df, names) # no names
>
> dt <- data.table(x=x, y=1:1) # No warning
>
> lapply(dt, names) # x has names, and they get recycled.
>
>
> So data.table needs to follow data.frame logic for discarding row names
> when they would otherwise be recycled.
>
>
> Bug submitted here:
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4890&group_id=240&atid=975
> I'm surprised this has never arisen before, it seems like something that
> has been around forever.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130911/ffabb93f/attachment.html>

From aragorn168b at gmail.com  Wed Sep 11 11:33:06 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Wed, 11 Sep 2013 11:33:06 +0200
Subject: [datatable-help] data.table segfaulting,
 need help verifying the reason
In-Reply-To: <CAAuY0RVn+MiRg9dHACnWARFGVK3RAadqG7+07jZahzjhk8uAjg@mail.gmail.com>
References: <CAAuY0RX44zcWBBmpCszjG_09a_bnDjtZS0LLkdLuZPSGschWWQ@mail.gmail.com>
 <CAAuY0RV7LQHY2E6KQLZif-3-euvojL+VWYYAQGrr08Bn+35b4A@mail.gmail.com>
 <522F5EB9.2080903@mdowle.plus.com>
 <CAAuY0RUbX5o-cmdsBjGpZWHqpUwMguv9xGyhONiZqpuddkCwAQ@mail.gmail.com>
 <522F7BB4.8060300@mdowle.plus.com>
 <CAAuY0RXrvkk7ZH6AwYaWzJxRvONnTU2zijTN1Pd8aQfARuWV9g@mail.gmail.com>
 <1CE2AD6E16E241869F7EA71F8E333D99@gmail.com>
 <CAAuY0RVn+MiRg9dHACnWARFGVK3RAadqG7+07jZahzjhk8uAjg@mail.gmail.com>
Message-ID: <F10BA77063774095A1C0188BADC61969@gmail.com>

Chris, 
It's not filed as a FR, IIRC. It's filed under "Internals".

Arun


On Wednesday, September 11, 2013 at 11:31 AM, Chris Neff wrote:

> Yes, dropping names altogether in data.table would fix this, and would be the cleanest thing overall since as is said in that thread data.table doesn't really work with rownames in mind anyway.
> 
> Except it is less of a FR now and more of a bad bug because you can get segfaults from it.
> 
> 
> On Wed, Sep 11, 2013 at 5:24 AM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > Most likely, this (https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4882&group_id=240&atid=5335), when fixed, will take care of it? 
> > 
> > Arun
> > 
> > 
> > On Wednesday, September 11, 2013 at 11:17 AM, Chris Neff wrote:
> > 
> > 
> > 
> > > Indeed, it shows that k1 and k2 both have names of length 2, and both times the value of names is just the variable names.
> > > 
> > > Where the names are getting added is by apply.  What the issue with data.table is that it does not ignore names from short variables. I now have a small reproducible example I can share: 
> > > 
> > > d <- data.frame(x=1:5)
> > > 
> > > f <- function(x) {data.table(x=x, y=1:10)}
> > > 
> > > l <- apply(d, 1, f)
> > > 
> > > lapply(l, function(x) lapply(x, names)) # All values of x have a name 
> > > 
> > > a <- rbindlist(l) # a$x will segfault after this
> > > 
> > > 
> > > The underlying issue is what data.table and data.frame do with rownames and recycling. Look at this simple case: 
> > > 
> > > x <- 1:5
> > > names(x) <- letters[1:5]
> > > 
> > > df <- data.frame(x=x, y=1:10) 
> > > #Warning message:
> > > #  In data.frame(x = x, y = 1:10) :
> > > #  row names were found from a short variable and have been discarded
> > > 
> > > lapply(df, names) # no names
> > > 
> > > dt <- data.table(x=x, y=1:1) # No warning
> > > 
> > > lapply(dt, names) # x has names, and they get recycled.
> > > 
> > > 
> > > So data.table needs to follow data.frame logic for discarding row names when they would otherwise be recycled. 
> > > 
> > > 
> > > Bug submitted here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4890&group_id=240&atid=975
 (https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4890&group_id=240&atid=975)> > > 
> > > I'm surprised this has never arisen before, it seems like something that has been around forever.
> > > 
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > 
> > > 
> > > 
> > > 
> > 
> > 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130911/d13e31c1/attachment-0001.html>

From caneff at gmail.com  Wed Sep 11 11:55:48 2013
From: caneff at gmail.com (Chris Neff)
Date: Wed, 11 Sep 2013 05:55:48 -0400
Subject: [datatable-help] data.table segfaulting,
	need help verifying the reason
In-Reply-To: <F10BA77063774095A1C0188BADC61969@gmail.com>
References: <CAAuY0RX44zcWBBmpCszjG_09a_bnDjtZS0LLkdLuZPSGschWWQ@mail.gmail.com>
 <CAAuY0RV7LQHY2E6KQLZif-3-euvojL+VWYYAQGrr08Bn+35b4A@mail.gmail.com>
 <522F5EB9.2080903@mdowle.plus.com>
 <CAAuY0RUbX5o-cmdsBjGpZWHqpUwMguv9xGyhONiZqpuddkCwAQ@mail.gmail.com>
 <522F7BB4.8060300@mdowle.plus.com>
 <CAAuY0RXrvkk7ZH6AwYaWzJxRvONnTU2zijTN1Pd8aQfARuWV9g@mail.gmail.com>
 <1CE2AD6E16E241869F7EA71F8E333D99@gmail.com>
 <CAAuY0RVn+MiRg9dHACnWARFGVK3RAadqG7+07jZahzjhk8uAjg@mail.gmail.com>
 <F10BA77063774095A1C0188BADC61969@gmail.com>
Message-ID: <CAAuY0RU8GsROj1Et1TtcGmQtAXJRr_OQaRHsO3Axnud_5sOVTQ@mail.gmail.com>

Oh okay, sorry. Either way it is more than just a slight improvement :)
 But yes that would fix everything.


On Wed, Sep 11, 2013 at 5:33 AM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

>  Chris,
> It's not filed as a FR, IIRC. It's filed under "Internals".
>
> Arun
>
> On Wednesday, September 11, 2013 at 11:31 AM, Chris Neff wrote:
>
> Yes, dropping names altogether in data.table would fix this, and would be
> the cleanest thing overall since as is said in that thread data.table
> doesn't really work with rownames in mind anyway.
>
> Except it is less of a FR now and more of a bad bug because you can get
> segfaults from it.
>
>
> On Wed, Sep 11, 2013 at 5:24 AM, Arunkumar Srinivasan <
> aragorn168b at gmail.com> wrote:
>
>  Most likely, this<https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4882&group_id=240&atid=5335>,
> when fixed, will take care of it?
>
> Arun
>
> On Wednesday, September 11, 2013 at 11:17 AM, Chris Neff wrote:
>
> Indeed, it shows that k1 and k2 both have names of length 2, and both
> times the value of names is just the variable names.
>
> Where the names are getting added is by apply.  What the issue with
> data.table is that it does not ignore names from short variables. I now
> have a small reproducible example I can share:
>
> d <- data.frame(x=1:5)
>
> f <- function(x) {data.table(x=x, y=1:10)}
>
> l <- apply(d, 1, f)
>
> lapply(l, function(x) lapply(x, names)) # All values of x have a name
>
> a <- rbindlist(l) # a$x will segfault after this
>
>
> The underlying issue is what data.table and data.frame do with rownames
> and recycling. Look at this simple case:
>
> x <- 1:5
> names(x) <- letters[1:5]
>
> df <- data.frame(x=x, y=1:10)
> #Warning message:
> #  In data.frame(x = x, y = 1:10) :
> #  row names were found from a short variable and have been discarded
>
> lapply(df, names) # no names
>
> dt <- data.table(x=x, y=1:1) # No warning
>
> lapply(dt, names) # x has names, and they get recycled.
>
>
> So data.table needs to follow data.frame logic for discarding row names
> when they would otherwise be recycled.
>
>
> Bug submitted here:
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4890&group_id=240&atid=975
> I'm surprised this has never arisen before, it seems like something that
> has been around forever.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130911/bde3b858/attachment.html>

From FErickson at psu.edu  Wed Sep 11 15:52:00 2013
From: FErickson at psu.edu (Frank Erickson)
Date: Wed, 11 Sep 2013 09:52:00 -0400
Subject: [datatable-help] data.table segfaulting,
	need help verifying the reason
In-Reply-To: <CAAuY0RU8GsROj1Et1TtcGmQtAXJRr_OQaRHsO3Axnud_5sOVTQ@mail.gmail.com>
References: <CAAuY0RX44zcWBBmpCszjG_09a_bnDjtZS0LLkdLuZPSGschWWQ@mail.gmail.com>
 <CAAuY0RV7LQHY2E6KQLZif-3-euvojL+VWYYAQGrr08Bn+35b4A@mail.gmail.com>
 <522F5EB9.2080903@mdowle.plus.com>
 <CAAuY0RUbX5o-cmdsBjGpZWHqpUwMguv9xGyhONiZqpuddkCwAQ@mail.gmail.com>
 <522F7BB4.8060300@mdowle.plus.com>
 <CAAuY0RXrvkk7ZH6AwYaWzJxRvONnTU2zijTN1Pd8aQfARuWV9g@mail.gmail.com>
 <1CE2AD6E16E241869F7EA71F8E333D99@gmail.com>
 <CAAuY0RVn+MiRg9dHACnWARFGVK3RAadqG7+07jZahzjhk8uAjg@mail.gmail.com>
 <F10BA77063774095A1C0188BADC61969@gmail.com>
 <CAAuY0RU8GsROj1Et1TtcGmQtAXJRr_OQaRHsO3Axnud_5sOVTQ@mail.gmail.com>
Message-ID: <CAJd-hdmQbMVPLuCZyvcynZr18eiQPKmsE-G9zBQn4EwnayMixg@mail.gmail.com>

@Chris: If your application is like the example given, you might consider
using

CJ(x=1:5,y=1:10)

which is a data.table analogue to

expand.grid(x=1:5,y=1:10)

that automatically sets a key of c("x","y") on the result.

--Frank


On Wed, Sep 11, 2013 at 5:55 AM, Chris Neff <caneff at gmail.com> wrote:

> Oh okay, sorry. Either way it is more than just a slight improvement :)
>  But yes that would fix everything.
>
>
> On Wed, Sep 11, 2013 at 5:33 AM, Arunkumar Srinivasan <
> aragorn168b at gmail.com> wrote:
>
>>  Chris,
>> It's not filed as a FR, IIRC. It's filed under "Internals".
>>
>> Arun
>>
>> On Wednesday, September 11, 2013 at 11:31 AM, Chris Neff wrote:
>>
>> Yes, dropping names altogether in data.table would fix this, and would be
>> the cleanest thing overall since as is said in that thread data.table
>> doesn't really work with rownames in mind anyway.
>>
>> Except it is less of a FR now and more of a bad bug because you can get
>> segfaults from it.
>>
>>
>> On Wed, Sep 11, 2013 at 5:24 AM, Arunkumar Srinivasan <
>> aragorn168b at gmail.com> wrote:
>>
>>  Most likely, this<https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4882&group_id=240&atid=5335>,
>> when fixed, will take care of it?
>>
>> Arun
>>
>> On Wednesday, September 11, 2013 at 11:17 AM, Chris Neff wrote:
>>
>> Indeed, it shows that k1 and k2 both have names of length 2, and both
>> times the value of names is just the variable names.
>>
>> Where the names are getting added is by apply.  What the issue with
>> data.table is that it does not ignore names from short variables. I now
>> have a small reproducible example I can share:
>>
>> d <- data.frame(x=1:5)
>>
>> f <- function(x) {data.table(x=x, y=1:10)}
>>
>> l <- apply(d, 1, f)
>>
>> lapply(l, function(x) lapply(x, names)) # All values of x have a name
>>
>> a <- rbindlist(l) # a$x will segfault after this
>>
>>
>> The underlying issue is what data.table and data.frame do with rownames
>> and recycling. Look at this simple case:
>>
>> x <- 1:5
>> names(x) <- letters[1:5]
>>
>> df <- data.frame(x=x, y=1:10)
>> #Warning message:
>> #  In data.frame(x = x, y = 1:10) :
>> #  row names were found from a short variable and have been discarded
>>
>> lapply(df, names) # no names
>>
>> dt <- data.table(x=x, y=1:1) # No warning
>>
>> lapply(dt, names) # x has names, and they get recycled.
>>
>>
>> So data.table needs to follow data.frame logic for discarding row names
>> when they would otherwise be recycled.
>>
>>
>> Bug submitted here:
>> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4890&group_id=240&atid=975
>> I'm surprised this has never arisen before, it seems like something that
>> has been around forever.
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>>
>>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130911/2a647e59/attachment.html>

From sds at gnu.org  Wed Sep 11 23:35:35 2013
From: sds at gnu.org (Sam Steingold)
Date: Wed, 11 Sep 2013 17:35:35 -0400
Subject: [datatable-help] adding names to j columns is costly
Message-ID: <87d2ofnis8.fsf@gnu.org>

I find myself using setnames(...,"V1","...") very often because setting
them in aggregation is expensive:

--8<---------------cut here---------------start------------->8---
> delays.short <- delays.dt[,sum(count),by="delay"]
Finding groups (bysameorder=TRUE) ... done in 1.262secs. bysameorder=TRUE and o__ is length 0
Detected that j uses these columns: count 
Optimization is on but j left unchanged as 'sum(count)'
Starting dogroups ... done dogroups in 8.612 secs
> delays.short <- delays.dt[,list(count=sum(count)),by="delay"]
Finding groups (bysameorder=TRUE) ... done in 1.051secs. bysameorder=TRUE and o__ is length 0
Detected that j uses these columns: count 
Optimization is on but j left unchanged as 'list(sum(count))'
Starting dogroups ... done dogroups in 11.918 secs
--8<---------------cut here---------------end--------------->8---

38% difference is a lot (3 seconds is not a big deal, but this is just a
toy dataset).

ISTR that I have asked this question before - is this still (data.table
1.8.10) the state of the art, or am I doing something stupid?

Thanks!

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 13.04 (raring) X 11.0.11303000
http://www.childpsy.net/ http://think-israel.org http://truepeace.org
http://thereligionofpeace.com http://americancensorship.org http://iris.org.il
Money does not "play a role", it writes the scenario.

From mdowle at mdowle.plus.com  Thu Sep 12 01:50:02 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Thu, 12 Sep 2013 00:50:02 +0100
Subject: [datatable-help] adding names to j columns is costly
In-Reply-To: <87d2ofnis8.fsf@gnu.org>
References: <87d2ofnis8.fsf@gnu.org>
Message-ID: <523101AA.8040604@mdowle.plus.com>


I don't remember you asking this before!

How many rows does delay.dt have and how many groups?

 > because setting them in aggregation is expensive:

I'm not sure this example is proof of that.  On the contrary, the output 
shows that names are being dropped before grouping commences (they are 
reinstated after grouping), as is correct behaviour.  All I can think is 
that the list() wrapper itself is adding overhead. That might show up as 
this 38% difference if there are a very large number of groups (lots of 
calls to j). In the case of a single aggregate, the list() wrapper could 
be optimized away.  This would be a nice improvement I didn't think of 
before.

Does this theory fit with your experience?   If my guess is correct,  if 
you instead compare two queries where j has list() in both; e.g., 
list(sum(count),max(count))    -vs- list(s=sum(count), m=max(count))  
then I don't think you'll see a speed difference.


On 11/09/13 22:35, Sam Steingold wrote:
> I find myself using setnames(...,"V1","...") very often because setting
> them in aggregation is expensive:
>
> --8<---------------cut here---------------start------------->8---
>> delays.short <- delays.dt[,sum(count),by="delay"]
> Finding groups (bysameorder=TRUE) ... done in 1.262secs. bysameorder=TRUE and o__ is length 0
> Detected that j uses these columns: count
> Optimization is on but j left unchanged as 'sum(count)'
> Starting dogroups ... done dogroups in 8.612 secs
>> delays.short <- delays.dt[,list(count=sum(count)),by="delay"]
> Finding groups (bysameorder=TRUE) ... done in 1.051secs. bysameorder=TRUE and o__ is length 0
> Detected that j uses these columns: count
> Optimization is on but j left unchanged as 'list(sum(count))'
> Starting dogroups ... done dogroups in 11.918 secs
> --8<---------------cut here---------------end--------------->8---
>
> 38% difference is a lot (3 seconds is not a big deal, but this is just a
> toy dataset).
>
> ISTR that I have asked this question before - is this still (data.table
> 1.8.10) the state of the art, or am I doing something stupid?
>
> Thanks!
>


From sds at gnu.org  Thu Sep 12 05:54:21 2013
From: sds at gnu.org (Sam Steingold)
Date: Wed, 11 Sep 2013 23:54:21 -0400
Subject: [datatable-help] adding names to j columns is costly
In-Reply-To: <523101AA.8040604@mdowle.plus.com> (Matthew Dowle's message of
 "Thu, 12 Sep 2013 00:50:02 +0100")
References: <87d2ofnis8.fsf@gnu.org> <523101AA.8040604@mdowle.plus.com>
Message-ID: <87wqmmn18y.fsf@gnu.org>

> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-09-12 00:50:02 +0100]:
>
> How many rows does delay.dt have and how many groups?

--8<---------------cut here---------------start------------->8---
> nrow(delays.dt)
[1] 18772831
> nrow(delays.short)
[1] 14893103
--8<---------------cut here---------------end--------------->8---

>> because setting them in aggregation is expensive:
>
> I'm not sure this example is proof of that.  On the contrary, the output
> shows that names are being dropped before grouping commences (they are
> reinstated after grouping), as is correct behaviour.  All I can think is
> that the list() wrapper itself is adding overhead. That might show up as
> this 38% difference if there are a very large number of groups (lots of
> calls to j). In the case of a single aggregate, the list() wrapper could
> be optimized away.  This would be a nice improvement I didn't think of
> before.

Yes, I would love to be able to drop the extra setnames() call.

> Does this theory fit with your experience?

Looks like it.

> If my guess is correct,  if
> you instead compare two queries where j has list() in both; e.g.,
> list(sum(count),max(count))    -vs- list(s=sum(count), m=max(count))
> then I don't think you'll see a speed difference.

--8<---------------cut here---------------start------------->8---
> delays.short <- delays.dt[,list(sum(count)),by="delay"]
Finding groups (bysameorder=TRUE) ... done in 0.91secs. bysameorder=TRUE and o__ is length 0
Detected that j uses these columns: count 
Optimization is on but j left unchanged as 'list(sum(count))'
Starting dogroups ... done dogroups in 11.497 secs
> delays.short <- delays.dt[,list(s=sum(count)),by="delay"]
Finding groups (bysameorder=TRUE) ... done in 0.91secs. bysameorder=TRUE and o__ is length 0
Detected that j uses these columns: count 
Optimization is on but j left unchanged as 'list(sum(count))'
Starting dogroups ... done dogroups in 11.535 secs
> delays.short <- delays.dt[,list(s=sum(count),m=max(count)),by="delay"]
Finding groups (bysameorder=TRUE) ... done in 0.948secs. bysameorder=TRUE and o__ is length 0
Detected that j uses these columns: count 
Optimization is on but j left unchanged as 'list(sum(count), max(count))'
Starting dogroups ... done dogroups in 18.931 secs
> delays.short <- delays.dt[,list(sum(count),max(count)),by="delay"]
Finding groups (bysameorder=TRUE) ... done in 0.968secs. bysameorder=TRUE and o__ is length 0
Detected that j uses these columns: count 
Optimization is on but j left unchanged as 'list(sum(count), max(count))'
Starting dogroups ... done dogroups in 17.872 secs
> delays.short <- delays.dt[,list(sum(count),max(count)),by="delay"]
Finding groups (bysameorder=TRUE) ... done in 1.004secs. bysameorder=TRUE and o__ is length 0
Detected that j uses these columns: count 
Optimization is on but j left unchanged as 'list(sum(count), max(count))'
Starting dogroups ... done dogroups in 18.971 secs
> delays.short <- delays.dt[,list(s=sum(count),m=max(count)),by="delay"]
Finding groups (bysameorder=TRUE) ... done in 0.946secs. bysameorder=TRUE and o__ is length 0
Detected that j uses these columns: count 
Optimization is on but j left unchanged as 'list(sum(count), max(count))'
Starting dogroups ... done dogroups in 18.799 secs
--8<---------------cut here---------------end--------------->8---


Thanks for your kind help!

>
> On 11/09/13 22:35, Sam Steingold wrote:
>> I find myself using setnames(...,"V1","...") very often because setting
>> them in aggregation is expensive:
>>
>> --8<---------------cut here---------------start------------->8---
>>> delays.short <- delays.dt[,sum(count),by="delay"]
>> Finding groups (bysameorder=TRUE) ... done in 1.262secs. bysameorder=TRUE and o__ is length 0
>> Detected that j uses these columns: count
>> Optimization is on but j left unchanged as 'sum(count)'
>> Starting dogroups ... done dogroups in 8.612 secs
>>> delays.short <- delays.dt[,list(count=sum(count)),by="delay"]
>> Finding groups (bysameorder=TRUE) ... done in 1.051secs. bysameorder=TRUE and o__ is length 0
>> Detected that j uses these columns: count
>> Optimization is on but j left unchanged as 'list(sum(count))'
>> Starting dogroups ... done dogroups in 11.918 secs
>> --8<---------------cut here---------------end--------------->8---
>>
>> 38% difference is a lot (3 seconds is not a big deal, but this is just a
>> toy dataset).
>>
>> ISTR that I have asked this question before - is this still (data.table
>> 1.8.10) the state of the art, or am I doing something stupid?
>>
>> Thanks!
>>

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 13.04 (raring) X 11.0.11303000
http://www.childpsy.net/ http://americancensorship.org http://memri.org
http://mideasttruth.com http://iris.org.il http://truepeace.org
UNIX is a way of thinking.  Windows is a way of not thinking.

From abfriedman at gmail.com  Thu Sep 12 23:32:00 2013
From: abfriedman at gmail.com (Ari Friedman)
Date: Thu, 12 Sep 2013 17:32:00 -0400
Subject: [datatable-help] colClasses and fread
Message-ID: <CAAT1DuNBmEvJJOx71G2XB9Xz5zF+xTXpf256B96s=M3kq7u+gw@mail.gmail.com>

Dear maintainers of that most wonderful package that makes R fast with
big data,

I've recently discovered fread.  It's amazing.  My call to read.fwf on a
4GB file that took all night now takes under a minute after conversion
to csv via csvkit/in2csv.

However, automatic type detection is working very poorly, probably due
to the presence of a large number of columns with high rates of
missingness, plus a large number of character columns with encoded
values (these are medical and diagnostic codes).

Normally I'd specify colClasses, and the warning messages even tell me I
should specify colClasses, but there's no colClasses argument to fread.

Any thoughts on solving this?  Verbose output, warnings, and a
comparison of the guesses vs. what the documentation on the file says it
is are found below.  Unfortunately the data can't be shared, even in
small portions so I can't make this reproducible.

Thanks!
Ari


> dt <- fread('myfile.csv', verbose=TRUE)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 30) ... ','
Found 393 columns
First row with 393 fields occurs on line 1 (either column names or
first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 2994440
Subtracted 1 for last eol and any trailing empty lines, leaving
2994439 data rows
Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000300000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000
(first 5 rows)
Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000
(+middle 5 rows)
Type codes: 000000000000003303330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000
(+last 5 rows)
0%Bumping column 146 from INT to INT64 on data row 9, field contains 'V5867'
Bumping column 146 from INT64 to REAL on data row 9, field contains 'V5867'
Bumping column 146 from REAL to STR on data row 9, field contains 'V5867'
Bumping column 147 from INT to INT64 on data row 9, field contains 'V5869'
Bumping column 147 from INT64 to REAL on data row 9, field contains 'V5869'
Bumping column 147 from REAL to STR on data row 9, field contains 'V5869'
Bumping column 142 from INT to INT64 on data row 10, field contains 'V140'
Bumping column 142 from INT64 to REAL on data row 10, field contains 'V140'
Bumping column 142 from REAL to STR on data row 10, field contains 'V140'
Bumping column 17 from INT to INT64 on data row 12, field contains 'J1885'
Bumping column 17 from INT64 to REAL on data row 12, field contains 'J1885'
Bumping column 17 from REAL to STR on data row 12, field contains 'J1885'
Bumping column 74 from INT to INT64 on data row 12, field contains 'LT'
Bumping column 74 from INT64 to REAL on data row 12, field contains 'LT'
Bumping column 74 from REAL to STR on data row 12, field contains 'LT'
Bumping column 143 from INT to INT64 on data row 13, field contains 'V142'
Bumping column 143 from INT64 to REAL on data row 13, field contains 'V142'
Bumping column 143 from REAL to STR on data row 13, field contains 'V142'
Bumping column 14 from INT to INT64 on data row 22, field contains 'G0431'
Bumping column 14 from INT64 to REAL on data row 22, field contains 'G0431'
Bumping column 14 from REAL to STR on data row 22, field contains 'G0431'
Bumping column 21 from INT to INT64 on data row 23, field contains 'J7060'
Bumping column 21 from INT64 to REAL on data row 23, field contains 'J7060'
Bumping column 21 from REAL to STR on data row 23, field contains 'J7060'
Bumping column 24 from INT to INT64 on data row 27, field contains 'J2405'
Bumping column 24 from INT64 to REAL on data row 27, field contains 'J2405'
Bumping column 24 from REAL to STR on data row 27, field contains 'J2405'
Bumping column 72 from INT to INT64 on data row 35, field contains 'F1'
Bumping column 72 from INT64 to REAL on data row 35, field contains 'F1'
Bumping column 72 from REAL to STR on data row 35, field contains 'F1'
Bumping column 141 from INT to INT64 on data row 35, field contains 'V061'
Bumping column 141 from INT64 to REAL on data row 35, field contains 'V061'
Bumping column 141 from REAL to STR on data row 35, field contains 'V061'
Bumping column 26 from INT to INT64 on data row 37, field contains 'J0690'
Bumping column 26 from INT64 to REAL on data row 37, field contains 'J0690'
Bumping column 26 from REAL to STR on data row 37, field contains 'J0690'
Bumping column 28 from INT to INT64 on data row 37, field contains 'J7030'
Bumping column 28 from INT64 to REAL on data row 37, field contains 'J7030'
Bumping column 28 from REAL to STR on data row 37, field contains 'J7030'
Bumping column 29 from INT to INT64 on data row 37, field contains 'J7040'
Bumping column 29 from INT64 to REAL on data row 37, field contains 'J7040'
Bumping column 29 from REAL to STR on data row 37, field contains 'J7040'
Bumping column 25 from INT to INT64 on data row 43, field contains 'Q9967'
Bumping column 25 from INT64 to REAL on data row 43, field contains 'Q9967'
Bumping column 25 from REAL to STR on data row 43, field contains 'Q9967'
Bumping column 30 from INT to INT64 on data row 43, field contains 'J7030'
Bumping column 30 from INT64 to REAL on data row 43, field contains 'J7030'
Bumping column 30 from REAL to STR on data row 43, field contains 'J7030'
Bumping column 31 from INT to INT64 on data row 43, field contains 'J2405'
Bumping column 31 from INT64 to REAL on data row 43, field contains 'J2405'
Bumping column 31 from REAL to STR on data row 43, field contains 'J2405'
Bumping column 148 from INT to INT64 on data row 44, field contains 'V1551'
Bumping column 148 from INT64 to REAL on data row 44, field contains 'V1551'
Bumping column 148 from REAL to STR on data row 44, field contains 'V1551'
Bumping column 149 from INT to INT64 on data row 44, field contains 'V1588'
Bumping column 149 from INT64 to REAL on data row 44, field contains 'V1588'
Bumping column 149 from REAL to STR on data row 44, field contains 'V1588'
Bumping column 76 from INT to INT64 on data row 45, field contains 'RT'
Bumping column 76 from INT64 to REAL on data row 45, field contains 'RT'
Bumping column 76 from REAL to STR on data row 45, field contains 'RT'
Bumping column 27 from INT to INT64 on data row 53, field contains 'J2405'
Bumping column 27 from INT64 to REAL on data row 53, field contains 'J2405'
Bumping column 27 from REAL to STR on data row 53, field contains 'J2405'
Bumping column 32 from INT to INT64 on data row 56, field contains 'J1885'
Bumping column 32 from INT64 to REAL on data row 56, field contains 'J1885'
Bumping column 32 from REAL to STR on data row 56, field contains 'J1885'
Bumping column 33 from INT to INT64 on data row 56, field contains 'J2270'
Bumping column 33 from INT64 to REAL on data row 56, field contains 'J2270'
Bumping column 33 from REAL to STR on data row 56, field contains 'J2270'
Bumping column 34 from INT to INT64 on data row 56, field contains 'J2405'
Bumping column 34 from INT64 to REAL on data row 56, field contains 'J2405'
Bumping column 34 from REAL to STR on data row 56, field contains 'J2405'
Bumping column 77 from INT to INT64 on data row 65, field contains 'LT'
Bumping column 77 from INT64 to REAL on data row 65, field contains 'LT'
Bumping column 77 from REAL to STR on data row 65, field contains 'LT'
Bumping column 140 from INT to INT64 on data row 74, field contains 'V689'
Bumping column 140 from INT64 to REAL on data row 74, field contains 'V689'
Bumping column 140 from REAL to STR on data row 74, field contains 'V689'
Bumping column 13 from INT to INT64 on data row 103, field contains 'J1100'
Bumping column 13 from INT64 to REAL on data row 103, field contains 'J1100'
Bumping column 13 from REAL to STR on data row 103, field contains 'J1100'
Bumping column 150 from INT to INT64 on data row 104, field contains 'V1508'
Bumping column 150 from INT64 to REAL on data row 104, field contains 'V1508'
Bumping column 150 from REAL to STR on data row 104, field contains 'V1508'
Bumping column 212 from INT to INT64 on data row 107, field contains 'V714'
Bumping column 212 from INT64 to REAL on data row 107, field contains 'V714'
Bumping column 212 from REAL to STR on data row 107, field contains 'V714'
Bumping column 12 from INT to INT64 on data row 113, field contains 'A0427'
Bumping column 12 from INT64 to REAL on data row 113, field contains 'A0427'
Bumping column 12 from REAL to STR on data row 113, field contains 'A0427'
Bumping column 81 from INT to INT64 on data row 113, field contains 'RH'
Bumping column 81 from INT64 to REAL on data row 113, field contains 'RH'
Bumping column 81 from REAL to STR on data row 113, field contains 'RH'
Bumping column 102 from INT to INT64 on data row 113, field contains 'QM'
Bumping column 102 from INT64 to REAL on data row 113, field contains 'QM'
Bumping column 102 from REAL to STR on data row 113, field contains 'QM'
Bumping column 111 from INT to INT64 on data row 113, field contains 'QM'
Bumping column 111 from INT64 to REAL on data row 113, field contains 'QM'
Bumping column 111 from REAL to STR on data row 113, field contains 'QM'
Bumping column 151 from INT to INT64 on data row 294, field contains 'V146'
Bumping column 151 from INT64 to REAL on data row 294, field contains 'V146'
Bumping column 151 from REAL to STR on data row 294, field contains 'V146'
Bumping column 152 from INT to INT64 on data row 294, field contains 'V148'
Bumping column 152 from INT64 to REAL on data row 294, field contains 'V148'
Bumping column 152 from REAL to STR on data row 294, field contains 'V148'
Bumping column 84 from INT to INT64 on data row 346, field contains 'RH'
Bumping column 84 from INT64 to REAL on data row 346, field contains 'RH'
Bumping column 84 from REAL to STR on data row 346, field contains 'RH'
Bumping column 114 from INT to INT64 on data row 346, field contains 'QM'
Bumping column 114 from INT64 to REAL on data row 346, field contains 'QM'
Bumping column 114 from REAL to STR on data row 346, field contains 'QM'
Bumping column 36 from INT to INT64 on data row 348, field contains 'J1644'
Bumping column 36 from INT64 to REAL on data row 348, field contains 'J1644'
Bumping column 36 from REAL to STR on data row 348, field contains 'J1644'
Bumping column 37 from INT to INT64 on data row 348, field contains 'J7030'
Bumping column 37 from INT64 to REAL on data row 348, field contains 'J7030'
Bumping column 37 from REAL to STR on data row 348, field contains 'J7030'
Bumping column 38 from INT to INT64 on data row 348, field contains 'J2405'
Bumping column 38 from INT64 to REAL on data row 348, field contains 'J2405'
Bumping column 38 from REAL to STR on data row 348, field contains 'J2405'
Bumping column 39 from INT to INT64 on data row 349, field contains 'J2405'
Bumping column 39 from INT64 to REAL on data row 349, field contains 'J2405'
Bumping column 39 from REAL to STR on data row 349, field contains 'J2405'
Bumping column 103 from INT to INT64 on data row 702, field contains 'QM'
Bumping column 103 from INT64 to REAL on data row 702, field contains 'QM'
Bumping column 103 from REAL to STR on data row 702, field contains 'QM'
Bumping column 104 from INT to INT64 on data row 702, field contains 'QM'
Bumping column 104 from INT64 to REAL on data row 702, field contains 'QM'
Bumping column 104 from REAL to STR on data row 702, field contains 'QM'
Bumping column 153 from INT to INT64 on data row 815, field contains 'V4561'
Bumping column 153 from INT64 to REAL on data row 815, field contains 'V4561'
Bumping column 153 from REAL to STR on data row 815, field contains 'V4561'
Bumping column 78 from INT to INT64 on data row 891, field contains 'RT'
Bumping column 78 from INT64 to REAL on data row 891, field contains 'RT'
Bumping column 78 from REAL to STR on data row 891, field contains 'RT'
Bumping column 79 from INT to INT64 on data row 891, field contains 'LT'
Bumping column 79 from INT64 to REAL on data row 891, field contains 'LT'
Bumping column 79 from REAL to STR on data row 891, field contains 'LT'
Bumping column 80 from INT to INT64 on data row 891, field contains 'LT'
Bumping column 80 from INT64 to REAL on data row 891, field contains 'LT'
Bumping column 80 from REAL to STR on data row 891, field contains 'LT'
Bumping column 35 from INT to INT64 on data row 892, field contains 'J2270'
Bumping column 35 from INT64 to REAL on data row 892, field contains 'J2270'
Bumping column 35 from REAL to STR on data row 892, field contains 'J2270'
Bumping column 82 from INT to INT64 on data row 931, field contains 'RH'
Bumping column 82 from INT64 to REAL on data row 931, field contains 'RH'
Bumping column 82 from REAL to STR on data row 931, field contains 'RH'
Bumping column 112 from INT to INT64 on data row 931, field contains 'QM'
Bumping column 112 from INT64 to REAL on data row 931, field contains 'QM'
Bumping column 112 from REAL to STR on data row 931, field contains 'QM'
Bumping column 154 from INT to INT64 on data row 1151, field contains 'V4582'
Bumping column 154 from INT64 to REAL on data row 1151, field contains 'V4582'
Bumping column 154 from REAL to STR on data row 1151, field contains 'V4582'
Bumping column 107 from INT to INT64 on data row 1268, field contains 'QM'
Bumping column 107 from INT64 to REAL on data row 1268, field contains 'QM'
Bumping column 107 from REAL to STR on data row 1268, field contains 'QM'
Bumping column 40 from INT to INT64 on data row 1414, field contains 'J2270'
Bumping column 40 from INT64 to REAL on data row 1414, field contains 'J2270'
Bumping column 40 from REAL to STR on data row 1414, field contains 'J2270'
Bumping column 41 from INT to INT64 on data row 1414, field contains 'J7040'
Bumping column 41 from INT64 to REAL on data row 1414, field contains 'J7040'
Bumping column 41 from REAL to STR on data row 1414, field contains 'J7040'
Bumping column 155 from INT to INT64 on data row 1417, field contains 'V8741'
Bumping column 155 from INT64 to REAL on data row 1417, field contains 'V8741'
Bumping column 155 from REAL to STR on data row 1417, field contains 'V8741'
Bumping column 156 from INT to INT64 on data row 1417, field contains 'V1504'
Bumping column 156 from INT64 to REAL on data row 1417, field contains 'V1504'
Bumping column 156 from REAL to STR on data row 1417, field contains 'V1504'
Bumping column 157 from INT to INT64 on data row 1417, field contains 'V2651'
Bumping column 157 from INT64 to REAL on data row 1417, field contains 'V2651'
Bumping column 157 from REAL to STR on data row 1417, field contains 'V2651'
Bumping column 83 from INT to INT64 on data row 1629, field contains 'GP'
Bumping column 83 from INT64 to REAL on data row 1629, field contains 'GP'
Bumping column 83 from REAL to STR on data row 1629, field contains 'GP'
Bumping column 105 from INT to INT64 on data row 1688, field contains 'QM'
Bumping column 105 from INT64 to REAL on data row 1688, field contains 'QM'
Bumping column 105 from REAL to STR on data row 1688, field contains 'QM'
Bumping column 110 from INT to INT64 on data row 1999, field contains 'QM'
Bumping column 110 from INT64 to REAL on data row 1999, field contains 'QM'
Bumping column 110 from REAL to STR on data row 1999, field contains 'QM'
Bumping column 106 from INT to INT64 on data row 2019, field contains 'QM'
Bumping column 106 from INT64 to REAL on data row 2019, field contains 'QM'
Bumping column 106 from REAL to STR on data row 2019, field contains 'QM'
Bumping column 85 from INT to INT64 on data row 2341, field contains 'SH'
Bumping column 85 from INT64 to REAL on data row 2341, field contains 'SH'
Bumping column 85 from REAL to STR on data row 2341, field contains 'SH'
Bumping column 115 from INT to INT64 on data row 2341, field contains 'QN'
Bumping column 115 from INT64 to REAL on data row 2341, field contains 'QN'
Bumping column 115 from REAL to STR on data row 2341, field contains 'QN'
Bumping column 350 from INT to INT64 on data row 2791, field contains 'C'
Bumping column 350 from INT64 to REAL on data row 2791, field contains 'C'
Bumping column 350 from REAL to STR on data row 2791, field contains 'C'
Bumping column 353 from INT to INT64 on data row 2791, field contains 'C'
Bumping column 353 from INT64 to REAL on data row 2791, field contains 'C'
Bumping column 353 from REAL to STR on data row 2791, field contains 'C'
Bumping column 108 from INT to INT64 on data row 2898, field contains 'QM'
Bumping column 108 from INT64 to REAL on data row 2898, field contains 'QM'
Bumping column 108 from REAL to STR on data row 2898, field contains 'QM'
Bumping column 158 from INT to INT64 on data row 3011, field contains 'V441'
Bumping column 158 from INT64 to REAL on data row 3011, field contains 'V441'
Bumping column 158 from REAL to STR on data row 3011, field contains 'V441'
Bumping column 159 from INT to INT64 on data row 3011, field contains 'V1582'
Bumping column 159 from INT64 to REAL on data row 3011, field contains 'V1582'
Bumping column 159 from REAL to STR on data row 3011, field contains 'V1582'
Bumping column 160 from INT to INT64 on data row 3011, field contains 'V5861'
Bumping column 160 from INT64 to REAL on data row 3011, field contains 'V5861'
Bumping column 160 from REAL to STR on data row 3011, field contains 'V5861'
Bumping column 86 from INT to INT64 on data row 3021, field contains 'RH'
Bumping column 86 from INT64 to REAL on data row 3021, field contains 'RH'
Bumping column 86 from REAL to STR on data row 3021, field contains 'RH'
Bumping column 116 from INT to INT64 on data row 3021, field contains 'QM'
Bumping column 116 from INT64 to REAL on data row 3021, field contains 'QM'
Bumping column 116 from REAL to STR on data row 3021, field contains 'QM'
Bumping column 109 from INT to INT64 on data row 3112, field contains 'QM'
Bumping column 109 from INT64 to REAL on data row 3112, field contains 'QM'
Bumping column 109 from REAL to STR on data row 3112, field contains 'QM'
Bumping column 113 from INT to INT64 on data row 5208, field contains 'QM'
Bumping column 113 from INT64 to REAL on data row 5208, field contains 'QM'
Bumping column 113 from REAL to STR on data row 5208, field contains 'QM'
Bumping column 188 from INT to INT64 on data row 8138, field contains 'Y'
Bumping column 188 from INT64 to REAL on data row 8138, field contains 'Y'
Bumping column 188 from REAL to STR on data row 8138, field contains 'Y'
Bumping column 189 from INT to INT64 on data row 8138, field contains 'Y'
Bumping column 189 from INT64 to REAL on data row 8138, field contains 'Y'
Bumping column 189 from REAL to STR on data row 8138, field contains 'Y'
Bumping column 190 from INT to INT64 on data row 8138, field contains 'Y'
Bumping column 190 from INT64 to REAL on data row 8138, field contains 'Y'
Bumping column 190 from REAL to STR on data row 8138, field contains 'Y'
0%Bumping column 161 from INT to INT64 on data row 13758, field contains 'V1582'
Bumping column 161 from INT64 to REAL on data row 13758, field contains 'V1582'
Bumping column 161 from REAL to STR on data row 13758, field contains 'V1582'
Bumping column 231 from INT to INT64 on data row 18303, field contains 'Y'
Bumping column 231 from INT64 to REAL on data row 18303, field contains 'Y'
Bumping column 231 from REAL to STR on data row 18303, field contains 'Y'
Bumping column 87 from INT to INT64 on data row 20592, field contains 'GO'
Bumping column 87 from INT64 to REAL on data row 20592, field contains 'GO'
Bumping column 87 from REAL to STR on data row 20592, field contains 'GO'
Bumping column 192 from INT to INT64 on data row 29413, field contains 'Y'
Bumping column 192 from INT64 to REAL on data row 29413, field contains 'Y'
Bumping column 192 from REAL to STR on data row 29413, field contains 'Y'
Bumping column 193 from INT to INT64 on data row 29413, field contains 'Y'
Bumping column 193 from INT64 to REAL on data row 29413, field contains 'Y'
Bumping column 193 from REAL to STR on data row 29413, field contains 'Y'
Bumping column 194 from INT to INT64 on data row 29413, field contains 'Y'
Bumping column 194 from INT64 to REAL on data row 29413, field contains 'Y'
Bumping column 194 from REAL to STR on data row 29413, field contains 'Y'
Bumping column 96 from INT to INT64 on data row 31954, field contains 'LT'
Bumping column 96 from INT64 to REAL on data row 31954, field contains 'LT'
Bumping column 96 from REAL to STR on data row 31954, field contains 'LT'
Bumping column 191 from INT to INT64 on data row 41091, field contains 'Y'
Bumping column 191 from INT64 to REAL on data row 41091, field contains 'Y'
Bumping column 191 from REAL to STR on data row 41091, field contains 'Y'
Bumping column 162 from INT to INT64 on data row 44469, field contains 'V1582'
Bumping column 162 from INT64 to REAL on data row 44469, field contains 'V1582'
Bumping column 162 from REAL to STR on data row 44469, field contains 'V1582'
Bumping column 163 from INT to INT64 on data row 49003, field contains 'V5865'
Bumping column 163 from INT64 to REAL on data row 49003, field contains 'V5865'
Bumping column 163 from REAL to STR on data row 49003, field contains 'V5865'
Bumping column 90 from INT to INT64 on data row 87095, field contains 'EH'
Bumping column 90 from INT64 to REAL on data row 87095, field contains 'EH'
Bumping column 90 from REAL to STR on data row 87095, field contains 'EH'
Bumping column 120 from INT to INT64 on data row 87095, field contains 'QM'
Bumping column 120 from INT64 to REAL on data row 87095, field contains 'QM'
Bumping column 120 from REAL to STR on data row 87095, field contains 'QM'
Bumping column 213 from INT to INT64 on data row 91672, field contains 'V692'
Bumping column 213 from INT64 to REAL on data row 91672, field contains 'V692'
Bumping column 213 from REAL to STR on data row 91672, field contains 'V692'
Bumping column 338 from INT to INT64 on data row 92112, field contains 'D'
Bumping column 338 from INT64 to REAL on data row 92112, field contains 'D'
Bumping column 338 from REAL to STR on data row 92112, field contains 'D'
Bumping column 339 from INT to INT64 on data row 92112, field contains 'D'
Bumping column 339 from INT64 to REAL on data row 92112, field contains 'D'
Bumping column 339 from REAL to STR on data row 92112, field contains 'D'
Bumping column 214 from INT to INT64 on data row 92181, field contains 'V681'
Bumping column 214 from INT64 to REAL on data row 92181, field contains 'V681'
Bumping column 214 from REAL to STR on data row 92181, field contains 'V681'
Bumping column 91 from INT to INT64 on data row 95380, field contains 'GP'
Bumping column 91 from INT64 to REAL on data row 95380, field contains 'GP'
Bumping column 91 from REAL to STR on data row 95380, field contains 'GP'
Bumping column 216 from INT to INT64 on data row 109576, field contains 'E8499'
Bumping column 216 from INT64 to REAL on data row 109576, field contains 'E8499'
Bumping column 216 from REAL to STR on data row 109576, field contains 'E8499'
4%Bumping column 98 from INT to INT64 on data row 115301, field contains 'GP'
Bumping column 98 from INT64 to REAL on data row 115301, field contains 'GP'
Bumping column 98 from REAL to STR on data row 115301, field contains 'GP'
Bumping column 117 from INT to INT64 on data row 188433, field contains 'QM'
Bumping column 117 from INT64 to REAL on data row 188433, field contains 'QM'
Bumping column 117 from REAL to STR on data row 188433, field contains 'QM'
Bumping column 93 from INT to INT64 on data row 188671, field contains 'LT'
Bumping column 93 from INT64 to REAL on data row 188671, field contains 'LT'
Bumping column 93 from REAL to STR on data row 188671, field contains 'LT'
Bumping column 92 from INT to INT64 on data row 188909, field contains 'RH'
Bumping column 92 from INT64 to REAL on data row 188909, field contains 'RH'
Bumping column 92 from REAL to STR on data row 188909, field contains 'RH'
Bumping column 122 from INT to INT64 on data row 188909, field contains 'QM'
Bumping column 122 from INT64 to REAL on data row 188909, field contains 'QM'
Bumping column 122 from REAL to STR on data row 188909, field contains 'QM'
Bumping column 121 from INT to INT64 on data row 189176, field contains 'QM'
Bumping column 121 from INT64 to REAL on data row 189176, field contains 'QM'
Bumping column 121 from REAL to STR on data row 189176, field contains 'QM'
Bumping column 195 from INT to INT64 on data row 189548, field contains 'Y'
Bumping column 195 from INT64 to REAL on data row 189548, field contains 'Y'
Bumping column 195 from REAL to STR on data row 189548, field contains 'Y'
Bumping column 196 from INT to INT64 on data row 189548, field contains 'Y'
Bumping column 196 from INT64 to REAL on data row 189548, field contains 'Y'
Bumping column 196 from REAL to STR on data row 189548, field contains 'Y'
Bumping column 197 from INT to INT64 on data row 189548, field contains 'Y'
Bumping column 197 from INT64 to REAL on data row 189548, field contains 'Y'
Bumping column 197 from REAL to STR on data row 189548, field contains 'Y'
Bumping column 198 from INT to INT64 on data row 189548, field contains 'Y'
Bumping column 198 from INT64 to REAL on data row 189548, field contains 'Y'
Bumping column 198 from REAL to STR on data row 189548, field contains 'Y'
Bumping column 199 from INT to INT64 on data row 189548, field contains 'Y'
Bumping column 199 from INT64 to REAL on data row 189548, field contains 'Y'
Bumping column 199 from REAL to STR on data row 189548, field contains 'Y'
Bumping column 200 from INT to INT64 on data row 189548, field contains 'Y'
Bumping column 200 from INT64 to REAL on data row 189548, field contains 'Y'
Bumping column 200 from REAL to STR on data row 189548, field contains 'Y'
Bumping column 201 from INT to INT64 on data row 189548, field contains 'Y'
Bumping column 201 from INT64 to REAL on data row 189548, field contains 'Y'
Bumping column 201 from REAL to STR on data row 189548, field contains 'Y'
Bumping column 202 from INT to INT64 on data row 189548, field contains 'Y'
Bumping column 202 from INT64 to REAL on data row 189548, field contains 'Y'
Bumping column 202 from REAL to STR on data row 189548, field contains 'Y'
Bumping column 203 from INT to INT64 on data row 189548, field contains 'Y'
Bumping column 203 from INT64 to REAL on data row 189548, field contains 'Y'
Bumping column 203 from REAL to STR on data row 189548, field contains 'Y'
Bumping column 232 from INT to INT64 on data row 189586, field contains 'U'
Bumping column 232 from INT64 to REAL on data row 189586, field contains 'U'
Bumping column 232 from REAL to STR on data row 189586, field contains 'U'
Bumping column 123 from INT to INT64 on data row 190895, field contains 'QM'
Bumping column 123 from INT64 to REAL on data row 190895, field contains 'QM'
Bumping column 123 from REAL to STR on data row 190895, field contains 'QM'
Bumping column 97 from INT to INT64 on data row 191623, field contains 'NH'
Bumping column 97 from INT64 to REAL on data row 191623, field contains 'NH'
Bumping column 97 from REAL to STR on data row 191623, field contains 'NH'
Bumping column 127 from INT to INT64 on data row 191623, field contains 'QM'
Bumping column 127 from INT64 to REAL on data row 191623, field contains 'QM'
Bumping column 127 from REAL to STR on data row 191623, field contains 'QM'
Bumping column 88 from INT to INT64 on data row 191828, field contains 'RH'
Bumping column 88 from INT64 to REAL on data row 191828, field contains 'RH'
Bumping column 88 from REAL to STR on data row 191828, field contains 'RH'
Bumping column 118 from INT to INT64 on data row 191828, field contains 'QM'
Bumping column 118 from INT64 to REAL on data row 191828, field contains 'QM'
Bumping column 118 from REAL to STR on data row 191828, field contains 'QM'
Bumping column 89 from INT to INT64 on data row 191925, field contains 'RH'
Bumping column 89 from INT64 to REAL on data row 191925, field contains 'RH'
Bumping column 89 from REAL to STR on data row 191925, field contains 'RH'
Bumping column 119 from INT to INT64 on data row 191925, field contains 'QM'
Bumping column 119 from INT64 to REAL on data row 191925, field contains 'QM'
Bumping column 119 from REAL to STR on data row 191925, field contains 'QM'
Bumping column 94 from INT to INT64 on data row 196090, field contains 'RH'
Bumping column 94 from INT64 to REAL on data row 196090, field contains 'RH'
Bumping column 94 from REAL to STR on data row 196090, field contains 'RH'
Bumping column 124 from INT to INT64 on data row 196090, field contains 'QM'
Bumping column 124 from INT64 to REAL on data row 196090, field contains 'QM'
Bumping column 124 from REAL to STR on data row 196090, field contains 'QM'
Bumping column 217 from INT to INT64 on data row 196596, field contains 'E9208'
Bumping column 217 from INT64 to REAL on data row 196596, field contains 'E9208'
Bumping column 217 from REAL to STR on data row 196596, field contains 'E9208'
Bumping column 126 from INT to INT64 on data row 197965, field contains 'QM'
Bumping column 126 from INT64 to REAL on data row 197965, field contains 'QM'
Bumping column 126 from REAL to STR on data row 197965, field contains 'QM'
Bumping column 95 from INT to INT64 on data row 208608, field contains 'LT'
Bumping column 95 from INT64 to REAL on data row 208608, field contains 'LT'
Bumping column 95 from REAL to STR on data row 208608, field contains 'LT'
Bumping column 218 from INT to INT64 on data row 216015, field contains 'E0008'
Bumping column 218 from INT64 to REAL on data row 216015, field contains 'E0008'
Bumping column 218 from REAL to STR on data row 216015, field contains 'E0008'
Bumping column 219 from INT to INT64 on data row 224785, field contains 'E030'
Bumping column 219 from INT64 to REAL on data row 224785, field contains 'E030'
Bumping column 219 from REAL to STR on data row 224785, field contains 'E030'
8%Bumping column 220 from INT to INT64 on data row 233544, field
contains 'E8499'
Bumping column 220 from INT64 to REAL on data row 233544, field contains 'E8499'
Bumping column 220 from REAL to STR on data row 233544, field contains 'E8499'
Bumping column 221 from INT to INT64 on data row 233544, field contains 'E0008'
Bumping column 221 from INT64 to REAL on data row 233544, field contains 'E0008'
Bumping column 221 from REAL to STR on data row 233544, field contains 'E0008'
Bumping column 100 from INT to INT64 on data row 253181, field contains 'GP'
Bumping column 100 from INT64 to REAL on data row 253181, field contains 'GP'
Bumping column 100 from REAL to STR on data row 253181, field contains 'GP'
Bumping column 99 from INT to INT64 on data row 330461, field contains 'GO'
Bumping column 99 from INT64 to REAL on data row 330461, field contains 'GO'
Bumping column 99 from REAL to STR on data row 330461, field contains 'GO'
12%Bumping column 128 from INT to INT64 on data row 419322, field contains 'QN'
Bumping column 128 from INT64 to REAL on data row 419322, field contains 'QN'
Bumping column 128 from REAL to STR on data row 419322, field contains 'QN'
Bumping column 130 from INT to INT64 on data row 420977, field contains 'QN'
Bumping column 130 from INT64 to REAL on data row 420977, field contains 'QN'
Bumping column 130 from REAL to STR on data row 420977, field contains 'QN'
Bumping column 125 from INT to INT64 on data row 426618, field contains 'QN'
Bumping column 125 from INT64 to REAL on data row 426618, field contains 'QN'
Bumping column 125 from REAL to STR on data row 426618, field contains 'QN'
Bumping column 101 from INT to INT64 on data row 446983, field contains 'HN'
Bumping column 101 from INT64 to REAL on data row 446983, field contains 'HN'
Bumping column 101 from REAL to STR on data row 446983, field contains 'HN'
Bumping column 131 from INT to INT64 on data row 446983, field contains 'QN'
Bumping column 131 from INT64 to REAL on data row 446983, field contains 'QN'
Bumping column 131 from REAL to STR on data row 446983, field contains 'QN'
Bumping column 129 from INT to INT64 on data row 448799, field contains 'QN'
Bumping column 129 from INT64 to REAL on data row 448799, field contains 'QN'
Bumping column 129 from REAL to STR on data row 448799, field contains 'QN'
Bumping column 233 from INT to INT64 on data row 455718, field contains 'Y'
Bumping column 233 from INT64 to REAL on data row 455718, field contains 'Y'
Bumping column 233 from REAL to STR on data row 455718, field contains 'Y'
Bumping column 234 from INT to INT64 on data row 458104, field contains 'Y'
Bumping column 234 from INT64 to REAL on data row 458104, field contains 'Y'
Bumping column 234 from REAL to STR on data row 458104, field contains 'Y'
Bumping column 235 from INT to INT64 on data row 458104, field contains 'Y'
Bumping column 235 from INT64 to REAL on data row 458104, field contains 'Y'
Bumping column 235 from REAL to STR on data row 458104, field contains 'Y'
16%Bumping column 204 from INT to INT64 on data row 535636, field contains 'U'
Bumping column 204 from INT64 to REAL on data row 535636, field contains 'U'
Bumping column 204 from REAL to STR on data row 535636, field contains 'U'
Bumping column 205 from INT to INT64 on data row 544450, field contains 'U'
Bumping column 205 from INT64 to REAL on data row 544450, field contains 'U'
Bumping column 205 from REAL to STR on data row 544450, field contains 'U'
Bumping column 206 from INT to INT64 on data row 563578, field contains 'U'
Bumping column 206 from INT64 to REAL on data row 563578, field contains 'U'
Bumping column 206 from REAL to STR on data row 563578, field contains 'U'
Bumping column 207 from INT to INT64 on data row 563578, field contains 'U'
Bumping column 207 from INT64 to REAL on data row 563578, field contains 'U'
Bumping column 207 from REAL to STR on data row 563578, field contains 'U'
Bumping column 208 from INT to INT64 on data row 570116, field contains 'U'
Bumping column 208 from INT64 to REAL on data row 570116, field contains 'U'
Bumping column 208 from REAL to STR on data row 570116, field contains 'U'
Bumping column 209 from INT to INT64 on data row 570116, field contains 'U'
Bumping column 209 from INT64 to REAL on data row 570116, field contains 'U'
Bumping column 209 from REAL to STR on data row 570116, field contains 'U'
24%Bumping column 8 from INT to INT64 on data row 768577, field contains 'F'
Bumping column 8 from INT64 to REAL on data row 768577, field contains 'F'
Bumping column 8 from REAL to STR on data row 768577, field contains 'F'
28%Bumping column 210 from INT to INT64 on data row 948003, field contains 'U'
Bumping column 210 from INT64 to REAL on data row 948003, field contains 'U'
Bumping column 210 from REAL to STR on data row 948003, field contains 'U'
Bumping column 211 from INT to INT64 on data row 948003, field contains 'U'
Bumping column 211 from INT64 to REAL on data row 948003, field contains 'U'
Bumping column 211 from REAL to STR on data row 948003, field contains 'U'
48%Bumping column 222 from INT to INT64 on data row 1567231, field
contains 'E0009'
Bumping column 222 from INT64 to REAL on data row 1567231, field
contains 'E0009'
Bumping column 222 from REAL to STR on data row 1567231, field contains 'E0009'
71%Bumping column 236 from INT to INT64 on data row 2163874, field contains 'U'
Bumping column 236 from INT64 to REAL on data row 2163874, field contains 'U'
Bumping column 236 from REAL to STR on data row 2163874, field contains 'U'
Bumping column 237 from INT to INT64 on data row 2177888, field contains 'U'
Bumping column 237 from INT64 to REAL on data row 2177888, field contains 'U'
Bumping column 237 from REAL to STR on data row 2177888, field contains 'U'
Bumping column 280 from INT to INT64 on data row 2204113, field contains 'invl'
Bumping column 280 from INT64 to REAL on data row 2204113, field contains 'invl'
Bumping column 280 from REAL to STR on data row 2204113, field contains 'invl'
   0.000s (2994439%) Memory map (rerun may be quicker)
   0.000s (2994439%) Sep and header detection
   0.000s (2994439%) Count rows (wc -l)
   0.000s (2994439%) Colmn type detection (first, middle and last 5 rows)
   0.000s (2994439%) Allocation of 5x13 result (xMB) in RAM
  25.710s ( 66%) Reading data
197983.135s (510003%) Allocation for type bumps (if any), including gc
time if triggered
-197977.505s (-509988%) Coercing data already read in type bumps (if any)
-197977.505s (-509988%) Changing na.strings to NA
-197977.505s        Total
There were 50 or more warnings (use warnings() to see the first 50)


Warning messages:
1: In fread(file.path(sedddir, "active", "NJ_SEDD_2011_CORE.csv"),  ... :
  Bumped column 146 to type character on data row 9, field contains
'V5867'. Coercing previously read values in this column from integer
or numeric back to character which may not be lossless; e.g., if '00'
and '000' occurred before they will now be just '0', and there may be
inconsistencies with treatment of ',,' and ',NA,' too (if they
occurred in this column before the bump). If this matters please rerun
and set 'colClasses' to 'character' for this column. Please note that
column type detection uses the first 5 rows, the middle 5 rows and the
last 5 rows, so hopefully this message should be very rare. If
reporting to datatable-help, please rerun and include the output from
verbose=TRUE.
2: In fread(file.path(sedddir, "active", "NJ_SEDD_2011_CORE.csv"),  ... :
  Bumped column 147 to type character on data row 9, field contains
'V5869'. Coercing previously read values in this column from integer
or numeric back to character which may not be lossless; e.g., if '00'
and '000' occurred before they will now be just '0', and there may be
inconsistencies with treatment of ',,' and ',NA,' too (if they
occurred in this column before the bump). If this matters please rerun
and set 'colClasses' to 'character' for this column. Please note that
column type detection uses the first 5 rows, the middle 5 rows and the
last 5 rows, so hopefully this message should be very rare. If
reporting to datatable-help, please rerun and include the output from
verbose=TRUE.
3: In fread(file.path(sedddir, "active", "NJ_SEDD_2011_CORE.csv"),  ... :
  Bumped column 142 to type character on data row 10, field contains
'V140'. Coercing previously read values in this column from integer or
numeric back to character which may not be lossless; e.g., if '00' and
'000' occurred before they will now be just '0', and there may be
inconsistencies with treatment of ',,' and ',NA,' too (if they
occurred in this column before the bump). If this matters please rerun
and set 'colClasses' to 'character' for this column. Please note that
column type detection uses the first 5 rows, the middle 5 rows and the
last 5 rows, so hopefully this message should be very rare. If
reporting to datatable-help, please rerun and include the output from
verbose=TRUE.
[[clipped]]

-----------------------------------------------------
fread's guesses vs. column classes I know to be true:
-----------------------------------------------------

structure(list(DTguess = c("integer", "integer", "integer", "integer",
"integer", "integer", "integer", "character", "integer", "integer",
"integer", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "character", "character", "character",
"character", "character", "character", "character", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "character", "integer64", "integer", "integer", "character",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"numeric", "integer", "integer", "integer", "integer", "character",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"character", "integer", "integer", "character", "character",
"character", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "numeric", "integer", "character", "integer",
"integer", "character", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer"
), actual = c("integer", "integer", "integer", "integer", "integer",
"integer", "character", "character", "integer", "integer", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"integer", "integer", "integer", "integer", "character", "integer",
"character", "integer", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "character", "character", "character",
"character", "character", "character", "character", "character",
"integer", "integer", "integer", "integer", "integer", "character",
"integer", "character", "character", "integer", "integer", "character",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "character",
"integer", "integer", "integer", "character", "integer", "character",
"integer", "character", "integer", "integer", "integer", "integer",
"numeric", "integer", "integer", "integer", "integer", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "character", "character", "character",
"character", "character", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "character", "integer", "integer",
"character", "character", "character", "integer", "character",
"integer", "integer", "integer", "integer", "integer", "numeric",
"integer", "character", "integer", "character", "character",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer", "integer", "integer",
"integer", "integer", "integer", "integer")), .Names = c("DTguess",
"actual"), row.names = c("age", "ageday", "agemonth", "ahour",
"amonth", "asource", "asourceub92", "asource_x", "atype", "aweekend",
"billtype", "cpt1", "cpt2", "cpt3", "cpt4", "cpt5", "cpt6", "cpt7",
"cpt8", "cpt9", "cpt10", "cpt11", "cpt12", "cpt13", "cpt14",
"cpt15", "cpt16", "cpt17", "cpt18", "cpt19", "cpt20", "cpt21",
"cpt22", "cpt23", "cpt24", "cpt25", "cpt26", "cpt27", "cpt28",
"cpt29", "cpt30", "cptccs1", "cptccs2", "cptccs3", "cptccs4",
"cptccs5", "cptccs6", "cptccs7", "cptccs8", "cptccs9", "cptccs10",
"cptccs11", "cptccs12", "cptccs13", "cptccs14", "cptccs15", "cptccs16",
"cptccs17", "cptccs18", "cptccs19", "cptccs20", "cptccs21", "cptccs22",
"cptccs23", "cptccs24", "cptccs25", "cptccs26", "cptccs27", "cptccs28",
"cptccs29", "cptccs30", "cptm1_1", "cptm1_2", "cptm1_3", "cptm1_4",
"cptm1_5", "cptm1_6", "cptm1_7", "cptm1_8", "cptm1_9", "cptm1_10",
"cptm1_11", "cptm1_12", "cptm1_13", "cptm1_14", "cptm1_15", "cptm1_16",
"cptm1_17", "cptm1_18", "cptm1_19", "cptm1_20", "cptm1_21", "cptm1_22",
"cptm1_23", "cptm1_24", "cptm1_25", "cptm1_26", "cptm1_27", "cptm1_28",
"cptm1_29", "cptm1_30", "cptm2_1", "cptm2_2", "cptm2_3", "cptm2_4",
"cptm2_5", "cptm2_6", "cptm2_7", "cptm2_8", "cptm2_9", "cptm2_10",
"cptm2_11", "cptm2_12", "cptm2_13", "cptm2_14", "cptm2_15", "cptm2_16",
"cptm2_17", "cptm2_18", "cptm2_19", "cptm2_20", "cptm2_21", "cptm2_22",
"cptm2_23", "cptm2_24", "cptm2_25", "cptm2_26", "cptm2_27", "cptm2_28",
"cptm2_29", "cptm2_30", "dhour", "died", "dispub04", "dispuniform",
"disp_x", "dqtr", "dshospid", "duration", "dx1", "dx2", "dx3",
"dx4", "dx5", "dx6", "dx7", "dx8", "dx9", "dx10", "dx11", "dx12",
"dx13", "dx14", "dx15", "dx16", "dx17", "dx18", "dx19", "dx20",
"dx21", "dx22", "dx23", "dx24", "dxccs1", "dxccs2", "dxccs3",
"dxccs4", "dxccs5", "dxccs6", "dxccs7", "dxccs8", "dxccs9", "dxccs10",
"dxccs11", "dxccs12", "dxccs13", "dxccs14", "dxccs15", "dxccs16",
"dxccs17", "dxccs18", "dxccs19", "dxccs20", "dxccs21", "dxccs22",
"dxccs23", "dxccs24", "dxpoa1", "dxpoa2", "dxpoa3", "dxpoa4",
"dxpoa5", "dxpoa6", "dxpoa7", "dxpoa8", "dxpoa9", "dxpoa10",
"dxpoa11", "dxpoa12", "dxpoa13", "dxpoa14", "dxpoa15", "dxpoa16",
"dxpoa17", "dxpoa18", "dxpoa19", "dxpoa20", "dxpoa21", "dxpoa22",
"dxpoa23", "dxpoa24", "dx_visit_reason1", "dx_visit_reason2",
"dx_visit_reason3", "ecode1", "ecode2", "ecode3", "ecode4", "ecode5",
"ecode6", "ecode7", "ecode8", "e_ccs1", "e_ccs2", "e_ccs3", "e_ccs4",
"e_ccs5", "e_ccs6", "e_ccs7", "e_ccs8", "e_poa1", "e_poa2", "e_poa3",
"e_poa4", "e_poa5", "e_poa6", "e_poa7", "e_poa8", "female", "hcup_ed",
"hcup_os", "hcup_surgery_broad", "hcup_surgery_narrow", "hispanic_x",
"hospbrth", "hospst", "key", "los", "los_x", "maritalstatusub04",
"mdnum1_r", "mdnum2_r", "medincstq", "momnum_r", "mrn_r", "nchronic",
"ncpt", "ndx", "necode", "neomat", "npr", "opservice", "orproc",
"os_time", "pay1", "pay1_x", "pay2", "pay2_x", "pay3", "pay3_x",
"pl_cbsa", "pl_msa1993", "pl_nchs2006", "pl_ruca10_2005", "pl_ruca2005",
"pl_ruca4_2005", "pl_rucc2003", "pl_uic2003", "pl_ur_cat4", "pr1",
"pr2", "pr3", "pr4", "pr5", "pr6", "pr7", "pr8", "pr9", "pr10",
"pr11", "pr12", "pr13", "pr14", "pr15", "pr16", "pr17", "pr18",
"prccs1", "prccs2", "prccs3", "prccs4", "prccs5", "prccs6", "prccs7",
"prccs8", "prccs9", "prccs10", "prccs11", "prccs12", "prccs13",
"prccs14", "prccs15", "prccs16", "prccs17", "prccs18", "prday1",
"prday2", "prday3", "prday4", "prday5", "prday6", "prday7", "prday8",
"prday9", "prday10", "prday11", "prday12", "prday13", "prday14",
"prday15", "prday16", "prday17", "prday18", "proctype", "pstate",
"pstco", "pstco2", "pointoforiginub04", "pointoforigin_x", "primlang",
"race", "race_x", "readmit", "state_as", "state_ed", "state_os",
"totchg", "totchg_x", "year", "zip3", "zipinc_qrtl", "town",
"zip", "ayear", "dmonth", "bmonth", "byear", "prmonth1", "prmonth2",
"prmonth3", "prmonth4", "prmonth5", "prmonth6", "prmonth7", "prmonth8",
"prmonth9", "prmonth10", "prmonth11", "prmonth12", "prmonth13",
"prmonth14", "prmonth15", "prmonth16", "prmonth17", "prmonth18",
"pryear1", "pryear2", "pryear3", "pryear4", "pryear5", "pryear6",
"pryear7", "pryear8", "pryear9", "pryear10", "pryear11", "pryear12",
"pryear13", "pryear14", "pryear15", "pryear16", "pryear17", "pryear18"
), class = "data.frame")
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130912/27b79e91/attachment-0001.html>

From mdowle at mdowle.plus.com  Fri Sep 13 00:42:22 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Thu, 12 Sep 2013 23:42:22 +0100
Subject: [datatable-help] colClasses and fread
In-Reply-To: <CAAT1DuNBmEvJJOx71G2XB9Xz5zF+xTXpf256B96s=M3kq7u+gw@mail.gmail.com>
References: <CAAT1DuNBmEvJJOx71G2XB9Xz5zF+xTXpf256B96s=M3kq7u+gw@mail.gmail.com>
Message-ID: <5232434E.3050608@mdowle.plus.com>


Is that v1.8.10 as on CRAN?   It doesn't look like it from a few clues 
in the output below.
v1.8.10 has colClasses working, see NEWS.

On 12/09/13 22:32, Ari Friedman wrote:
> Dear maintainers of that most wonderful package that makes R fast with
> big data,
>
> I've recently discovered fread.  It's amazing.  My call to read.fwf on a
> 4GB file that took all night now takes under a minute after conversion
> to csv via csvkit/in2csv.
>
> However, automatic type detection is working very poorly, probably due
> to the presence of a large number of columns with high rates of
> missingness, plus a large number of character columns with encoded
> values (these are medical and diagnostic codes).
>
> Normally I'd specify colClasses, and the warning messages even tell me I
> should specify colClasses, but there's no colClasses argument to fread.
>
> Any thoughts on solving this?  Verbose output, warnings, and a
> comparison of the guesses vs. what the documentation on the file says it
> is are found below.  Unfortunately the data can't be shared, even in
> small portions so I can't make this reproducible.
>
> Thanks!
> Ari
> > dt <- fread('myfile.csv', verbose=TRUE)
> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
> Using line 30 to detect sep (the last non blank line in the first 30) ... ','
> Found 393 columns
> First row with 393 fields occurs on line 1 (either column names or first row of data)
> All the fields on line 1 are character fields. Treating as the column names.
> Count of eol after first data row: 2994440
> Subtracted 1 for last eol and any trailing empty lines, leaving 2994439 data rows
> Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000300000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (first 5 rows)
> Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (+middle 5 rows)
> Type codes: 000000000000003303330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (+last 5 rows)
> 0%Bumping column 146 from INT to INT64 on data row 9, field contains 'V5867'
> Bumping column 146 from INT64 to REAL on data row 9, field contains 'V5867'
> Bumping column 146 from REAL to STR on data row 9, field contains 'V5867'
> Bumping column 147 from INT to INT64 on data row 9, field contains 'V5869'
> Bumping column 147 from INT64 to REAL on data row 9, field contains 'V5869'
> Bumping column 147 from REAL to STR on data row 9, field contains 'V5869'
> Bumping column 142 from INT to INT64 on data row 10, field contains 'V140'
> Bumping column 142 from INT64 to REAL on data row 10, field contains 'V140'
> Bumping column 142 from REAL to STR on data row 10, field contains 'V140'
> Bumping column 17 from INT to INT64 on data row 12, field contains 'J1885'
> Bumping column 17 from INT64 to REAL on data row 12, field contains 'J1885'
> Bumping column 17 from REAL to STR on data row 12, field contains 'J1885'
> Bumping column 74 from INT to INT64 on data row 12, field contains 'LT'
> Bumping column 74 from INT64 to REAL on data row 12, field contains 'LT'
> Bumping column 74 from REAL to STR on data row 12, field contains 'LT'
> Bumping column 143 from INT to INT64 on data row 13, field contains 'V142'
> Bumping column 143 from INT64 to REAL on data row 13, field contains 'V142'
> Bumping column 143 from REAL to STR on data row 13, field contains 'V142'
> Bumping column 14 from INT to INT64 on data row 22, field contains 'G0431'
> Bumping column 14 from INT64 to REAL on data row 22, field contains 'G0431'
> Bumping column 14 from REAL to STR on data row 22, field contains 'G0431'
> Bumping column 21 from INT to INT64 on data row 23, field contains 'J7060'
> Bumping column 21 from INT64 to REAL on data row 23, field contains 'J7060'
> Bumping column 21 from REAL to STR on data row 23, field contains 'J7060'
> Bumping column 24 from INT to INT64 on data row 27, field contains 'J2405'
> Bumping column 24 from INT64 to REAL on data row 27, field contains 'J2405'
> Bumping column 24 from REAL to STR on data row 27, field contains 'J2405'
> Bumping column 72 from INT to INT64 on data row 35, field contains 'F1'
> Bumping column 72 from INT64 to REAL on data row 35, field contains 'F1'
> Bumping column 72 from REAL to STR on data row 35, field contains 'F1'
> Bumping column 141 from INT to INT64 on data row 35, field contains 'V061'
> Bumping column 141 from INT64 to REAL on data row 35, field contains 'V061'
> Bumping column 141 from REAL to STR on data row 35, field contains 'V061'
> Bumping column 26 from INT to INT64 on data row 37, field contains 'J0690'
> Bumping column 26 from INT64 to REAL on data row 37, field contains 'J0690'
> Bumping column 26 from REAL to STR on data row 37, field contains 'J0690'
> Bumping column 28 from INT to INT64 on data row 37, field contains 'J7030'
> Bumping column 28 from INT64 to REAL on data row 37, field contains 'J7030'
> Bumping column 28 from REAL to STR on data row 37, field contains 'J7030'
> Bumping column 29 from INT to INT64 on data row 37, field contains 'J7040'
> Bumping column 29 from INT64 to REAL on data row 37, field contains 'J7040'
> Bumping column 29 from REAL to STR on data row 37, field contains 'J7040'
> Bumping column 25 from INT to INT64 on data row 43, field contains 'Q9967'
> Bumping column 25 from INT64 to REAL on data row 43, field contains 'Q9967'
> Bumping column 25 from REAL to STR on data row 43, field contains 'Q9967'
> Bumping column 30 from INT to INT64 on data row 43, field contains 'J7030'
> Bumping column 30 from INT64 to REAL on data row 43, field contains 'J7030'
> Bumping column 30 from REAL to STR on data row 43, field contains 'J7030'
> Bumping column 31 from INT to INT64 on data row 43, field contains 'J2405'
> Bumping column 31 from INT64 to REAL on data row 43, field contains 'J2405'
> Bumping column 31 from REAL to STR on data row 43, field contains 'J2405'
> Bumping column 148 from INT to INT64 on data row 44, field contains 'V1551'
> Bumping column 148 from INT64 to REAL on data row 44, field contains 'V1551'
> Bumping column 148 from REAL to STR on data row 44, field contains 'V1551'
> Bumping column 149 from INT to INT64 on data row 44, field contains 'V1588'
> Bumping column 149 from INT64 to REAL on data row 44, field contains 'V1588'
> Bumping column 149 from REAL to STR on data row 44, field contains 'V1588'
> Bumping column 76 from INT to INT64 on data row 45, field contains 'RT'
> Bumping column 76 from INT64 to REAL on data row 45, field contains 'RT'
> Bumping column 76 from REAL to STR on data row 45, field contains 'RT'
> Bumping column 27 from INT to INT64 on data row 53, field contains 'J2405'
> Bumping column 27 from INT64 to REAL on data row 53, field contains 'J2405'
> Bumping column 27 from REAL to STR on data row 53, field contains 'J2405'
> Bumping column 32 from INT to INT64 on data row 56, field contains 'J1885'
> Bumping column 32 from INT64 to REAL on data row 56, field contains 'J1885'
> Bumping column 32 from REAL to STR on data row 56, field contains 'J1885'
> Bumping column 33 from INT to INT64 on data row 56, field contains 'J2270'
> Bumping column 33 from INT64 to REAL on data row 56, field contains 'J2270'
> Bumping column 33 from REAL to STR on data row 56, field contains 'J2270'
> Bumping column 34 from INT to INT64 on data row 56, field contains 'J2405'
> Bumping column 34 from INT64 to REAL on data row 56, field contains 'J2405'
> Bumping column 34 from REAL to STR on data row 56, field contains 'J2405'
> Bumping column 77 from INT to INT64 on data row 65, field contains 'LT'
> Bumping column 77 from INT64 to REAL on data row 65, field contains 'LT'
> Bumping column 77 from REAL to STR on data row 65, field contains 'LT'
> Bumping column 140 from INT to INT64 on data row 74, field contains 'V689'
> Bumping column 140 from INT64 to REAL on data row 74, field contains 'V689'
> Bumping column 140 from REAL to STR on data row 74, field contains 'V689'
> Bumping column 13 from INT to INT64 on data row 103, field contains 'J1100'
> Bumping column 13 from INT64 to REAL on data row 103, field contains 'J1100'
> Bumping column 13 from REAL to STR on data row 103, field contains 'J1100'
> Bumping column 150 from INT to INT64 on data row 104, field contains 'V1508'
> Bumping column 150 from INT64 to REAL on data row 104, field contains 'V1508'
> Bumping column 150 from REAL to STR on data row 104, field contains 'V1508'
> Bumping column 212 from INT to INT64 on data row 107, field contains 'V714'
> Bumping column 212 from INT64 to REAL on data row 107, field contains 'V714'
> Bumping column 212 from REAL to STR on data row 107, field contains 'V714'
> Bumping column 12 from INT to INT64 on data row 113, field contains 'A0427'
> Bumping column 12 from INT64 to REAL on data row 113, field contains 'A0427'
> Bumping column 12 from REAL to STR on data row 113, field contains 'A0427'
> Bumping column 81 from INT to INT64 on data row 113, field contains 'RH'
> Bumping column 81 from INT64 to REAL on data row 113, field contains 'RH'
> Bumping column 81 from REAL to STR on data row 113, field contains 'RH'
> Bumping column 102 from INT to INT64 on data row 113, field contains 'QM'
> Bumping column 102 from INT64 to REAL on data row 113, field contains 'QM'
> Bumping column 102 from REAL to STR on data row 113, field contains 'QM'
> Bumping column 111 from INT to INT64 on data row 113, field contains 'QM'
> Bumping column 111 from INT64 to REAL on data row 113, field contains 'QM'
> Bumping column 111 from REAL to STR on data row 113, field contains 'QM'
> Bumping column 151 from INT to INT64 on data row 294, field contains 'V146'
> Bumping column 151 from INT64 to REAL on data row 294, field contains 'V146'
> Bumping column 151 from REAL to STR on data row 294, field contains 'V146'
> Bumping column 152 from INT to INT64 on data row 294, field contains 'V148'
> Bumping column 152 from INT64 to REAL on data row 294, field contains 'V148'
> Bumping column 152 from REAL to STR on data row 294, field contains 'V148'
> Bumping column 84 from INT to INT64 on data row 346, field contains 'RH'
> Bumping column 84 from INT64 to REAL on data row 346, field contains 'RH'
> Bumping column 84 from REAL to STR on data row 346, field contains 'RH'
> Bumping column 114 from INT to INT64 on data row 346, field contains 'QM'
> Bumping column 114 from INT64 to REAL on data row 346, field contains 'QM'
> Bumping column 114 from REAL to STR on data row 346, field contains 'QM'
> Bumping column 36 from INT to INT64 on data row 348, field contains 'J1644'
> Bumping column 36 from INT64 to REAL on data row 348, field contains 'J1644'
> Bumping column 36 from REAL to STR on data row 348, field contains 'J1644'
> Bumping column 37 from INT to INT64 on data row 348, field contains 'J7030'
> Bumping column 37 from INT64 to REAL on data row 348, field contains 'J7030'
> Bumping column 37 from REAL to STR on data row 348, field contains 'J7030'
> Bumping column 38 from INT to INT64 on data row 348, field contains 'J2405'
> Bumping column 38 from INT64 to REAL on data row 348, field contains 'J2405'
> Bumping column 38 from REAL to STR on data row 348, field contains 'J2405'
> Bumping column 39 from INT to INT64 on data row 349, field contains 'J2405'
> Bumping column 39 from INT64 to REAL on data row 349, field contains 'J2405'
> Bumping column 39 from REAL to STR on data row 349, field contains 'J2405'
> Bumping column 103 from INT to INT64 on data row 702, field contains 'QM'
> Bumping column 103 from INT64 to REAL on data row 702, field contains 'QM'
> Bumping column 103 from REAL to STR on data row 702, field contains 'QM'
> Bumping column 104 from INT to INT64 on data row 702, field contains 'QM'
> Bumping column 104 from INT64 to REAL on data row 702, field contains 'QM'
> Bumping column 104 from REAL to STR on data row 702, field contains 'QM'
> Bumping column 153 from INT to INT64 on data row 815, field contains 'V4561'
> Bumping column 153 from INT64 to REAL on data row 815, field contains 'V4561'
> Bumping column 153 from REAL to STR on data row 815, field contains 'V4561'
> Bumping column 78 from INT to INT64 on data row 891, field contains 'RT'
> Bumping column 78 from INT64 to REAL on data row 891, field contains 'RT'
> Bumping column 78 from REAL to STR on data row 891, field contains 'RT'
> Bumping column 79 from INT to INT64 on data row 891, field contains 'LT'
> Bumping column 79 from INT64 to REAL on data row 891, field contains 'LT'
> Bumping column 79 from REAL to STR on data row 891, field contains 'LT'
> Bumping column 80 from INT to INT64 on data row 891, field contains 'LT'
> Bumping column 80 from INT64 to REAL on data row 891, field contains 'LT'
> Bumping column 80 from REAL to STR on data row 891, field contains 'LT'
> Bumping column 35 from INT to INT64 on data row 892, field contains 'J2270'
> Bumping column 35 from INT64 to REAL on data row 892, field contains 'J2270'
> Bumping column 35 from REAL to STR on data row 892, field contains 'J2270'
> Bumping column 82 from INT to INT64 on data row 931, field contains 'RH'
> Bumping column 82 from INT64 to REAL on data row 931, field contains 'RH'
> Bumping column 82 from REAL to STR on data row 931, field contains 'RH'
> Bumping column 112 from INT to INT64 on data row 931, field contains 'QM'
> Bumping column 112 from INT64 to REAL on data row 931, field contains 'QM'
> Bumping column 112 from REAL to STR on data row 931, field contains 'QM'
> Bumping column 154 from INT to INT64 on data row 1151, field contains 'V4582'
> Bumping column 154 from INT64 to REAL on data row 1151, field contains 'V4582'
> Bumping column 154 from REAL to STR on data row 1151, field contains 'V4582'
> Bumping column 107 from INT to INT64 on data row 1268, field contains 'QM'
> Bumping column 107 from INT64 to REAL on data row 1268, field contains 'QM'
> Bumping column 107 from REAL to STR on data row 1268, field contains 'QM'
> Bumping column 40 from INT to INT64 on data row 1414, field contains 'J2270'
> Bumping column 40 from INT64 to REAL on data row 1414, field contains 'J2270'
> Bumping column 40 from REAL to STR on data row 1414, field contains 'J2270'
> Bumping column 41 from INT to INT64 on data row 1414, field contains 'J7040'
> Bumping column 41 from INT64 to REAL on data row 1414, field contains 'J7040'
> Bumping column 41 from REAL to STR on data row 1414, field contains 'J7040'
> Bumping column 155 from INT to INT64 on data row 1417, field contains 'V8741'
> Bumping column 155 from INT64 to REAL on data row 1417, field contains 'V8741'
> Bumping column 155 from REAL to STR on data row 1417, field contains 'V8741'
> Bumping column 156 from INT to INT64 on data row 1417, field contains 'V1504'
> Bumping column 156 from INT64 to REAL on data row 1417, field contains 'V1504'
> Bumping column 156 from REAL to STR on data row 1417, field contains 'V1504'
> Bumping column 157 from INT to INT64 on data row 1417, field contains 'V2651'
> Bumping column 157 from INT64 to REAL on data row 1417, field contains 'V2651'
> Bumping column 157 from REAL to STR on data row 1417, field contains 'V2651'
> Bumping column 83 from INT to INT64 on data row 1629, field contains 'GP'
> Bumping column 83 from INT64 to REAL on data row 1629, field contains 'GP'
> Bumping column 83 from REAL to STR on data row 1629, field contains 'GP'
> Bumping column 105 from INT to INT64 on data row 1688, field contains 'QM'
> Bumping column 105 from INT64 to REAL on data row 1688, field contains 'QM'
> Bumping column 105 from REAL to STR on data row 1688, field contains 'QM'
> Bumping column 110 from INT to INT64 on data row 1999, field contains 'QM'
> Bumping column 110 from INT64 to REAL on data row 1999, field contains 'QM'
> Bumping column 110 from REAL to STR on data row 1999, field contains 'QM'
> Bumping column 106 from INT to INT64 on data row 2019, field contains 'QM'
> Bumping column 106 from INT64 to REAL on data row 2019, field contains 'QM'
> Bumping column 106 from REAL to STR on data row 2019, field contains 'QM'
> Bumping column 85 from INT to INT64 on data row 2341, field contains 'SH'
> Bumping column 85 from INT64 to REAL on data row 2341, field contains 'SH'
> Bumping column 85 from REAL to STR on data row 2341, field contains 'SH'
> Bumping column 115 from INT to INT64 on data row 2341, field contains 'QN'
> Bumping column 115 from INT64 to REAL on data row 2341, field contains 'QN'
> Bumping column 115 from REAL to STR on data row 2341, field contains 'QN'
> Bumping column 350 from INT to INT64 on data row 2791, field contains 'C'
> Bumping column 350 from INT64 to REAL on data row 2791, field contains 'C'
> Bumping column 350 from REAL to STR on data row 2791, field contains 'C'
> Bumping column 353 from INT to INT64 on data row 2791, field contains 'C'
> Bumping column 353 from INT64 to REAL on data row 2791, field contains 'C'
> Bumping column 353 from REAL to STR on data row 2791, field contains 'C'
> Bumping column 108 from INT to INT64 on data row 2898, field contains 'QM'
> Bumping column 108 from INT64 to REAL on data row 2898, field contains 'QM'
> Bumping column 108 from REAL to STR on data row 2898, field contains 'QM'
> Bumping column 158 from INT to INT64 on data row 3011, field contains 'V441'
> Bumping column 158 from INT64 to REAL on data row 3011, field contains 'V441'
> Bumping column 158 from REAL to STR on data row 3011, field contains 'V441'
> Bumping column 159 from INT to INT64 on data row 3011, field contains 'V1582'
> Bumping column 159 from INT64 to REAL on data row 3011, field contains 'V1582'
> Bumping column 159 from REAL to STR on data row 3011, field contains 'V1582'
> Bumping column 160 from INT to INT64 on data row 3011, field contains 'V5861'
> Bumping column 160 from INT64 to REAL on data row 3011, field contains 'V5861'
> Bumping column 160 from REAL to STR on data row 3011, field contains 'V5861'
> Bumping column 86 from INT to INT64 on data row 3021, field contains 'RH'
> Bumping column 86 from INT64 to REAL on data row 3021, field contains 'RH'
> Bumping column 86 from REAL to STR on data row 3021, field contains 'RH'
> Bumping column 116 from INT to INT64 on data row 3021, field contains 'QM'
> Bumping column 116 from INT64 to REAL on data row 3021, field contains 'QM'
> Bumping column 116 from REAL to STR on data row 3021, field contains 'QM'
> Bumping column 109 from INT to INT64 on data row 3112, field contains 'QM'
> Bumping column 109 from INT64 to REAL on data row 3112, field contains 'QM'
> Bumping column 109 from REAL to STR on data row 3112, field contains 'QM'
> Bumping column 113 from INT to INT64 on data row 5208, field contains 'QM'
> Bumping column 113 from INT64 to REAL on data row 5208, field contains 'QM'
> Bumping column 113 from REAL to STR on data row 5208, field contains 'QM'
> Bumping column 188 from INT to INT64 on data row 8138, field contains 'Y'
> Bumping column 188 from INT64 to REAL on data row 8138, field contains 'Y'
> Bumping column 188 from REAL to STR on data row 8138, field contains 'Y'
> Bumping column 189 from INT to INT64 on data row 8138, field contains 'Y'
> Bumping column 189 from INT64 to REAL on data row 8138, field contains 'Y'
> Bumping column 189 from REAL to STR on data row 8138, field contains 'Y'
> Bumping column 190 from INT to INT64 on data row 8138, field contains 'Y'
> Bumping column 190 from INT64 to REAL on data row 8138, field contains 'Y'
> Bumping column 190 from REAL to STR on data row 8138, field contains 'Y'
> 0%Bumping column 161 from INT to INT64 on data row 13758, field contains 'V1582'
> Bumping column 161 from INT64 to REAL on data row 13758, field contains 'V1582'
> Bumping column 161 from REAL to STR on data row 13758, field contains 'V1582'
> Bumping column 231 from INT to INT64 on data row 18303, field contains 'Y'
> Bumping column 231 from INT64 to REAL on data row 18303, field contains 'Y'
> Bumping column 231 from REAL to STR on data row 18303, field contains 'Y'
> Bumping column 87 from INT to INT64 on data row 20592, field contains 'GO'
> Bumping column 87 from INT64 to REAL on data row 20592, field contains 'GO'
> Bumping column 87 from REAL to STR on data row 20592, field contains 'GO'
> Bumping column 192 from INT to INT64 on data row 29413, field contains 'Y'
> Bumping column 192 from INT64 to REAL on data row 29413, field contains 'Y'
> Bumping column 192 from REAL to STR on data row 29413, field contains 'Y'
> Bumping column 193 from INT to INT64 on data row 29413, field contains 'Y'
> Bumping column 193 from INT64 to REAL on data row 29413, field contains 'Y'
> Bumping column 193 from REAL to STR on data row 29413, field contains 'Y'
> Bumping column 194 from INT to INT64 on data row 29413, field contains 'Y'
> Bumping column 194 from INT64 to REAL on data row 29413, field contains 'Y'
> Bumping column 194 from REAL to STR on data row 29413, field contains 'Y'
> Bumping column 96 from INT to INT64 on data row 31954, field contains 'LT'
> Bumping column 96 from INT64 to REAL on data row 31954, field contains 'LT'
> Bumping column 96 from REAL to STR on data row 31954, field contains 'LT'
> Bumping column 191 from INT to INT64 on data row 41091, field contains 'Y'
> Bumping column 191 from INT64 to REAL on data row 41091, field contains 'Y'
> Bumping column 191 from REAL to STR on data row 41091, field contains 'Y'
> Bumping column 162 from INT to INT64 on data row 44469, field contains 'V1582'
> Bumping column 162 from INT64 to REAL on data row 44469, field contains 'V1582'
> Bumping column 162 from REAL to STR on data row 44469, field contains 'V1582'
> Bumping column 163 from INT to INT64 on data row 49003, field contains 'V5865'
> Bumping column 163 from INT64 to REAL on data row 49003, field contains 'V5865'
> Bumping column 163 from REAL to STR on data row 49003, field contains 'V5865'
> Bumping column 90 from INT to INT64 on data row 87095, field contains 'EH'
> Bumping column 90 from INT64 to REAL on data row 87095, field contains 'EH'
> Bumping column 90 from REAL to STR on data row 87095, field contains 'EH'
> Bumping column 120 from INT to INT64 on data row 87095, field contains 'QM'
> Bumping column 120 from INT64 to REAL on data row 87095, field contains 'QM'
> Bumping column 120 from REAL to STR on data row 87095, field contains 'QM'
> Bumping column 213 from INT to INT64 on data row 91672, field contains 'V692'
> Bumping column 213 from INT64 to REAL on data row 91672, field contains 'V692'
> Bumping column 213 from REAL to STR on data row 91672, field contains 'V692'
> Bumping column 338 from INT to INT64 on data row 92112, field contains 'D'
> Bumping column 338 from INT64 to REAL on data row 92112, field contains 'D'
> Bumping column 338 from REAL to STR on data row 92112, field contains 'D'
> Bumping column 339 from INT to INT64 on data row 92112, field contains 'D'
> Bumping column 339 from INT64 to REAL on data row 92112, field contains 'D'
> Bumping column 339 from REAL to STR on data row 92112, field contains 'D'
> Bumping column 214 from INT to INT64 on data row 92181, field contains 'V681'
> Bumping column 214 from INT64 to REAL on data row 92181, field contains 'V681'
> Bumping column 214 from REAL to STR on data row 92181, field contains 'V681'
> Bumping column 91 from INT to INT64 on data row 95380, field contains 'GP'
> Bumping column 91 from INT64 to REAL on data row 95380, field contains 'GP'
> Bumping column 91 from REAL to STR on data row 95380, field contains 'GP'
> Bumping column 216 from INT to INT64 on data row 109576, field contains 'E8499'
> Bumping column 216 from INT64 to REAL on data row 109576, field contains 'E8499'
> Bumping column 216 from REAL to STR on data row 109576, field contains 'E8499'
> 4%Bumping column 98 from INT to INT64 on data row 115301, field contains 'GP'
> Bumping column 98 from INT64 to REAL on data row 115301, field contains 'GP'
> Bumping column 98 from REAL to STR on data row 115301, field contains 'GP'
> Bumping column 117 from INT to INT64 on data row 188433, field contains 'QM'
> Bumping column 117 from INT64 to REAL on data row 188433, field contains 'QM'
> Bumping column 117 from REAL to STR on data row 188433, field contains 'QM'
> Bumping column 93 from INT to INT64 on data row 188671, field contains 'LT'
> Bumping column 93 from INT64 to REAL on data row 188671, field contains 'LT'
> Bumping column 93 from REAL to STR on data row 188671, field contains 'LT'
> Bumping column 92 from INT to INT64 on data row 188909, field contains 'RH'
> Bumping column 92 from INT64 to REAL on data row 188909, field contains 'RH'
> Bumping column 92 from REAL to STR on data row 188909, field contains 'RH'
> Bumping column 122 from INT to INT64 on data row 188909, field contains 'QM'
> Bumping column 122 from INT64 to REAL on data row 188909, field contains 'QM'
> Bumping column 122 from REAL to STR on data row 188909, field contains 'QM'
> Bumping column 121 from INT to INT64 on data row 189176, field contains 'QM'
> Bumping column 121 from INT64 to REAL on data row 189176, field contains 'QM'
> Bumping column 121 from REAL to STR on data row 189176, field contains 'QM'
> Bumping column 195 from INT to INT64 on data row 189548, field contains 'Y'
> Bumping column 195 from INT64 to REAL on data row 189548, field contains 'Y'
> Bumping column 195 from REAL to STR on data row 189548, field contains 'Y'
> Bumping column 196 from INT to INT64 on data row 189548, field contains 'Y'
> Bumping column 196 from INT64 to REAL on data row 189548, field contains 'Y'
> Bumping column 196 from REAL to STR on data row 189548, field contains 'Y'
> Bumping column 197 from INT to INT64 on data row 189548, field contains 'Y'
> Bumping column 197 from INT64 to REAL on data row 189548, field contains 'Y'
> Bumping column 197 from REAL to STR on data row 189548, field contains 'Y'
> Bumping column 198 from INT to INT64 on data row 189548, field contains 'Y'
> Bumping column 198 from INT64 to REAL on data row 189548, field contains 'Y'
> Bumping column 198 from REAL to STR on data row 189548, field contains 'Y'
> Bumping column 199 from INT to INT64 on data row 189548, field contains 'Y'
> Bumping column 199 from INT64 to REAL on data row 189548, field contains 'Y'
> Bumping column 199 from REAL to STR on data row 189548, field contains 'Y'
> Bumping column 200 from INT to INT64 on data row 189548, field contains 'Y'
> Bumping column 200 from INT64 to REAL on data row 189548, field contains 'Y'
> Bumping column 200 from REAL to STR on data row 189548, field contains 'Y'
> Bumping column 201 from INT to INT64 on data row 189548, field contains 'Y'
> Bumping column 201 from INT64 to REAL on data row 189548, field contains 'Y'
> Bumping column 201 from REAL to STR on data row 189548, field contains 'Y'
> Bumping column 202 from INT to INT64 on data row 189548, field contains 'Y'
> Bumping column 202 from INT64 to REAL on data row 189548, field contains 'Y'
> Bumping column 202 from REAL to STR on data row 189548, field contains 'Y'
> Bumping column 203 from INT to INT64 on data row 189548, field contains 'Y'
> Bumping column 203 from INT64 to REAL on data row 189548, field contains 'Y'
> Bumping column 203 from REAL to STR on data row 189548, field contains 'Y'
> Bumping column 232 from INT to INT64 on data row 189586, field contains 'U'
> Bumping column 232 from INT64 to REAL on data row 189586, field contains 'U'
> Bumping column 232 from REAL to STR on data row 189586, field contains 'U'
> Bumping column 123 from INT to INT64 on data row 190895, field contains 'QM'
> Bumping column 123 from INT64 to REAL on data row 190895, field contains 'QM'
> Bumping column 123 from REAL to STR on data row 190895, field contains 'QM'
> Bumping column 97 from INT to INT64 on data row 191623, field contains 'NH'
> Bumping column 97 from INT64 to REAL on data row 191623, field contains 'NH'
> Bumping column 97 from REAL to STR on data row 191623, field contains 'NH'
> Bumping column 127 from INT to INT64 on data row 191623, field contains 'QM'
> Bumping column 127 from INT64 to REAL on data row 191623, field contains 'QM'
> Bumping column 127 from REAL to STR on data row 191623, field contains 'QM'
> Bumping column 88 from INT to INT64 on data row 191828, field contains 'RH'
> Bumping column 88 from INT64 to REAL on data row 191828, field contains 'RH'
> Bumping column 88 from REAL to STR on data row 191828, field contains 'RH'
> Bumping column 118 from INT to INT64 on data row 191828, field contains 'QM'
> Bumping column 118 from INT64 to REAL on data row 191828, field contains 'QM'
> Bumping column 118 from REAL to STR on data row 191828, field contains 'QM'
> Bumping column 89 from INT to INT64 on data row 191925, field contains 'RH'
> Bumping column 89 from INT64 to REAL on data row 191925, field contains 'RH'
> Bumping column 89 from REAL to STR on data row 191925, field contains 'RH'
> Bumping column 119 from INT to INT64 on data row 191925, field contains 'QM'
> Bumping column 119 from INT64 to REAL on data row 191925, field contains 'QM'
> Bumping column 119 from REAL to STR on data row 191925, field contains 'QM'
> Bumping column 94 from INT to INT64 on data row 196090, field contains 'RH'
> Bumping column 94 from INT64 to REAL on data row 196090, field contains 'RH'
> Bumping column 94 from REAL to STR on data row 196090, field contains 'RH'
> Bumping column 124 from INT to INT64 on data row 196090, field contains 'QM'
> Bumping column 124 from INT64 to REAL on data row 196090, field contains 'QM'
> Bumping column 124 from REAL to STR on data row 196090, field contains 'QM'
> Bumping column 217 from INT to INT64 on data row 196596, field contains 'E9208'
> Bumping column 217 from INT64 to REAL on data row 196596, field contains 'E9208'
> Bumping column 217 from REAL to STR on data row 196596, field contains 'E9208'
> Bumping column 126 from INT to INT64 on data row 197965, field contains 'QM'
> Bumping column 126 from INT64 to REAL on data row 197965, field contains 'QM'
> Bumping column 126 from REAL to STR on data row 197965, field contains 'QM'
> Bumping column 95 from INT to INT64 on data row 208608, field contains 'LT'
> Bumping column 95 from INT64 to REAL on data row 208608, field contains 'LT'
> Bumping column 95 from REAL to STR on data row 208608, field contains 'LT'
> Bumping column 218 from INT to INT64 on data row 216015, field contains 'E0008'
> Bumping column 218 from INT64 to REAL on data row 216015, field contains 'E0008'
> Bumping column 218 from REAL to STR on data row 216015, field contains 'E0008'
> Bumping column 219 from INT to INT64 on data row 224785, field contains 'E030'
> Bumping column 219 from INT64 to REAL on data row 224785, field contains 'E030'
> Bumping column 219 from REAL to STR on data row 224785, field contains 'E030'
> 8%Bumping column 220 from INT to INT64 on data row 233544, field contains 'E8499'
> Bumping column 220 from INT64 to REAL on data row 233544, field contains 'E8499'
> Bumping column 220 from REAL to STR on data row 233544, field contains 'E8499'
> Bumping column 221 from INT to INT64 on data row 233544, field contains 'E0008'
> Bumping column 221 from INT64 to REAL on data row 233544, field contains 'E0008'
> Bumping column 221 from REAL to STR on data row 233544, field contains 'E0008'
> Bumping column 100 from INT to INT64 on data row 253181, field contains 'GP'
> Bumping column 100 from INT64 to REAL on data row 253181, field contains 'GP'
> Bumping column 100 from REAL to STR on data row 253181, field contains 'GP'
> Bumping column 99 from INT to INT64 on data row 330461, field contains 'GO'
> Bumping column 99 from INT64 to REAL on data row 330461, field contains 'GO'
> Bumping column 99 from REAL to STR on data row 330461, field contains 'GO'
> 12%Bumping column 128 from INT to INT64 on data row 419322, field contains 'QN'
> Bumping column 128 from INT64 to REAL on data row 419322, field contains 'QN'
> Bumping column 128 from REAL to STR on data row 419322, field contains 'QN'
> Bumping column 130 from INT to INT64 on data row 420977, field contains 'QN'
> Bumping column 130 from INT64 to REAL on data row 420977, field contains 'QN'
> Bumping column 130 from REAL to STR on data row 420977, field contains 'QN'
> Bumping column 125 from INT to INT64 on data row 426618, field contains 'QN'
> Bumping column 125 from INT64 to REAL on data row 426618, field contains 'QN'
> Bumping column 125 from REAL to STR on data row 426618, field contains 'QN'
> Bumping column 101 from INT to INT64 on data row 446983, field contains 'HN'
> Bumping column 101 from INT64 to REAL on data row 446983, field contains 'HN'
> Bumping column 101 from REAL to STR on data row 446983, field contains 'HN'
> Bumping column 131 from INT to INT64 on data row 446983, field contains 'QN'
> Bumping column 131 from INT64 to REAL on data row 446983, field contains 'QN'
> Bumping column 131 from REAL to STR on data row 446983, field contains 'QN'
> Bumping column 129 from INT to INT64 on data row 448799, field contains 'QN'
> Bumping column 129 from INT64 to REAL on data row 448799, field contains 'QN'
> Bumping column 129 from REAL to STR on data row 448799, field contains 'QN'
> Bumping column 233 from INT to INT64 on data row 455718, field contains 'Y'
> Bumping column 233 from INT64 to REAL on data row 455718, field contains 'Y'
> Bumping column 233 from REAL to STR on data row 455718, field contains 'Y'
> Bumping column 234 from INT to INT64 on data row 458104, field contains 'Y'
> Bumping column 234 from INT64 to REAL on data row 458104, field contains 'Y'
> Bumping column 234 from REAL to STR on data row 458104, field contains 'Y'
> Bumping column 235 from INT to INT64 on data row 458104, field contains 'Y'
> Bumping column 235 from INT64 to REAL on data row 458104, field contains 'Y'
> Bumping column 235 from REAL to STR on data row 458104, field contains 'Y'
> 16%Bumping column 204 from INT to INT64 on data row 535636, field contains 'U'
> Bumping column 204 from INT64 to REAL on data row 535636, field contains 'U'
> Bumping column 204 from REAL to STR on data row 535636, field contains 'U'
> Bumping column 205 from INT to INT64 on data row 544450, field contains 'U'
> Bumping column 205 from INT64 to REAL on data row 544450, field contains 'U'
> Bumping column 205 from REAL to STR on data row 544450, field contains 'U'
> Bumping column 206 from INT to INT64 on data row 563578, field contains 'U'
> Bumping column 206 from INT64 to REAL on data row 563578, field contains 'U'
> Bumping column 206 from REAL to STR on data row 563578, field contains 'U'
> Bumping column 207 from INT to INT64 on data row 563578, field contains 'U'
> Bumping column 207 from INT64 to REAL on data row 563578, field contains 'U'
> Bumping column 207 from REAL to STR on data row 563578, field contains 'U'
> Bumping column 208 from INT to INT64 on data row 570116, field contains 'U'
> Bumping column 208 from INT64 to REAL on data row 570116, field contains 'U'
> Bumping column 208 from REAL to STR on data row 570116, field contains 'U'
> Bumping column 209 from INT to INT64 on data row 570116, field contains 'U'
> Bumping column 209 from INT64 to REAL on data row 570116, field contains 'U'
> Bumping column 209 from REAL to STR on data row 570116, field contains 'U'
> 24%Bumping column 8 from INT to INT64 on data row 768577, field contains 'F'
> Bumping column 8 from INT64 to REAL on data row 768577, field contains 'F'
> Bumping column 8 from REAL to STR on data row 768577, field contains 'F'
> 28%Bumping column 210 from INT to INT64 on data row 948003, field contains 'U'
> Bumping column 210 from INT64 to REAL on data row 948003, field contains 'U'
> Bumping column 210 from REAL to STR on data row 948003, field contains 'U'
> Bumping column 211 from INT to INT64 on data row 948003, field contains 'U'
> Bumping column 211 from INT64 to REAL on data row 948003, field contains 'U'
> Bumping column 211 from REAL to STR on data row 948003, field contains 'U'
> 48%Bumping column 222 from INT to INT64 on data row 1567231, field contains 'E0009'
> Bumping column 222 from INT64 to REAL on data row 1567231, field contains 'E0009'
> Bumping column 222 from REAL to STR on data row 1567231, field contains 'E0009'
> 71%Bumping column 236 from INT to INT64 on data row 2163874, field contains 'U'
> Bumping column 236 from INT64 to REAL on data row 2163874, field contains 'U'
> Bumping column 236 from REAL to STR on data row 2163874, field contains 'U'
> Bumping column 237 from INT to INT64 on data row 2177888, field contains 'U'
> Bumping column 237 from INT64 to REAL on data row 2177888, field contains 'U'
> Bumping column 237 from REAL to STR on data row 2177888, field contains 'U'
> Bumping column 280 from INT to INT64 on data row 2204113, field contains 'invl'
> Bumping column 280 from INT64 to REAL on data row 2204113, field contains 'invl'
> Bumping column 280 from REAL to STR on data row 2204113, field contains 'invl'
>     0.000s (2994439%) Memory map (rerun may be quicker)
>     0.000s (2994439%) Sep and header detection
>     0.000s (2994439%) Count rows (wc -l)
>     0.000s (2994439%) Colmn type detection (first, middle and last 5 rows)
>     0.000s (2994439%) Allocation of 5x13 result (xMB) in RAM
>    25.710s ( 66%) Reading data
> 197983.135s (510003%) Allocation for type bumps (if any), including gc time if triggered
> -197977.505s (-509988%) Coercing data already read in type bumps (if any)
> -197977.505s (-509988%) Changing na.strings to NA
> -197977.505s        Total
> There were 50 or more warnings (use warnings() to see the first 50)
>
>
>
> Warning messages:
> 1: In fread(file.path(sedddir, "active", "NJ_SEDD_2011_CORE.csv"),  ... :
>    Bumped column 146 to type character on data row 9, field contains 'V5867'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.
> 2: In fread(file.path(sedddir, "active", "NJ_SEDD_2011_CORE.csv"),  ... :
>    Bumped column 147 to type character on data row 9, field contains 'V5869'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.
> 3: In fread(file.path(sedddir, "active", "NJ_SEDD_2011_CORE.csv"),  ... :
>    Bumped column 142 to type character on data row 10, field contains 'V140'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.
> [[clipped]]
>
> -----------------------------------------------------
> fread's guesses vs. column classes I know to be true:
> -----------------------------------------------------
>
> structure(list(DTguess = c("integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "character", "integer", "integer",
> "integer", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "character", "character", "character",
> "character", "character", "character", "character", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "character", "integer64", "integer", "integer", "character",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "numeric", "integer", "integer", "integer", "integer", "character",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "character", "integer", "integer", "character", "character",
> "character", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "numeric", "integer", "character", "integer",
> "integer", "character", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer"
> ), actual = c("integer", "integer", "integer", "integer", "integer",
> "integer", "character", "character", "integer", "integer", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "integer", "integer", "integer", "integer", "character", "integer",
> "character", "integer", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "integer", "integer", "integer", "integer", "integer", "character",
> "integer", "character", "character", "integer", "integer", "character",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "character",
> "integer", "integer", "integer", "character", "integer", "character",
> "integer", "character", "integer", "integer", "integer", "integer",
> "numeric", "integer", "integer", "integer", "integer", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "character", "character", "character",
> "character", "character", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "character", "integer", "integer",
> "character", "character", "character", "integer", "character",
> "integer", "integer", "integer", "integer", "integer", "numeric",
> "integer", "character", "integer", "character", "character",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer", "integer", "integer",
> "integer", "integer", "integer", "integer")), .Names = c("DTguess",
> "actual"), row.names = c("age", "ageday", "agemonth", "ahour",
> "amonth", "asource", "asourceub92", "asource_x", "atype", "aweekend",
> "billtype", "cpt1", "cpt2", "cpt3", "cpt4", "cpt5", "cpt6", "cpt7",
> "cpt8", "cpt9", "cpt10", "cpt11", "cpt12", "cpt13", "cpt14",
> "cpt15", "cpt16", "cpt17", "cpt18", "cpt19", "cpt20", "cpt21",
> "cpt22", "cpt23", "cpt24", "cpt25", "cpt26", "cpt27", "cpt28",
> "cpt29", "cpt30", "cptccs1", "cptccs2", "cptccs3", "cptccs4",
> "cptccs5", "cptccs6", "cptccs7", "cptccs8", "cptccs9", "cptccs10",
> "cptccs11", "cptccs12", "cptccs13", "cptccs14", "cptccs15", "cptccs16",
> "cptccs17", "cptccs18", "cptccs19", "cptccs20", "cptccs21", "cptccs22",
> "cptccs23", "cptccs24", "cptccs25", "cptccs26", "cptccs27", "cptccs28",
> "cptccs29", "cptccs30", "cptm1_1", "cptm1_2", "cptm1_3", "cptm1_4",
> "cptm1_5", "cptm1_6", "cptm1_7", "cptm1_8", "cptm1_9", "cptm1_10",
> "cptm1_11", "cptm1_12", "cptm1_13", "cptm1_14", "cptm1_15", "cptm1_16",
> "cptm1_17", "cptm1_18", "cptm1_19", "cptm1_20", "cptm1_21", "cptm1_22",
> "cptm1_23", "cptm1_24", "cptm1_25", "cptm1_26", "cptm1_27", "cptm1_28",
> "cptm1_29", "cptm1_30", "cptm2_1", "cptm2_2", "cptm2_3", "cptm2_4",
> "cptm2_5", "cptm2_6", "cptm2_7", "cptm2_8", "cptm2_9", "cptm2_10",
> "cptm2_11", "cptm2_12", "cptm2_13", "cptm2_14", "cptm2_15", "cptm2_16",
> "cptm2_17", "cptm2_18", "cptm2_19", "cptm2_20", "cptm2_21", "cptm2_22",
> "cptm2_23", "cptm2_24", "cptm2_25", "cptm2_26", "cptm2_27", "cptm2_28",
> "cptm2_29", "cptm2_30", "dhour", "died", "dispub04", "dispuniform",
> "disp_x", "dqtr", "dshospid", "duration", "dx1", "dx2", "dx3",
> "dx4", "dx5", "dx6", "dx7", "dx8", "dx9", "dx10", "dx11", "dx12",
> "dx13", "dx14", "dx15", "dx16", "dx17", "dx18", "dx19", "dx20",
> "dx21", "dx22", "dx23", "dx24", "dxccs1", "dxccs2", "dxccs3",
> "dxccs4", "dxccs5", "dxccs6", "dxccs7", "dxccs8", "dxccs9", "dxccs10",
> "dxccs11", "dxccs12", "dxccs13", "dxccs14", "dxccs15", "dxccs16",
> "dxccs17", "dxccs18", "dxccs19", "dxccs20", "dxccs21", "dxccs22",
> "dxccs23", "dxccs24", "dxpoa1", "dxpoa2", "dxpoa3", "dxpoa4",
> "dxpoa5", "dxpoa6", "dxpoa7", "dxpoa8", "dxpoa9", "dxpoa10",
> "dxpoa11", "dxpoa12", "dxpoa13", "dxpoa14", "dxpoa15", "dxpoa16",
> "dxpoa17", "dxpoa18", "dxpoa19", "dxpoa20", "dxpoa21", "dxpoa22",
> "dxpoa23", "dxpoa24", "dx_visit_reason1", "dx_visit_reason2",
> "dx_visit_reason3", "ecode1", "ecode2", "ecode3", "ecode4", "ecode5",
> "ecode6", "ecode7", "ecode8", "e_ccs1", "e_ccs2", "e_ccs3", "e_ccs4",
> "e_ccs5", "e_ccs6", "e_ccs7", "e_ccs8", "e_poa1", "e_poa2", "e_poa3",
> "e_poa4", "e_poa5", "e_poa6", "e_poa7", "e_poa8", "female", "hcup_ed",
> "hcup_os", "hcup_surgery_broad", "hcup_surgery_narrow", "hispanic_x",
> "hospbrth", "hospst", "key", "los", "los_x", "maritalstatusub04",
> "mdnum1_r", "mdnum2_r", "medincstq", "momnum_r", "mrn_r", "nchronic",
> "ncpt", "ndx", "necode", "neomat", "npr", "opservice", "orproc",
> "os_time", "pay1", "pay1_x", "pay2", "pay2_x", "pay3", "pay3_x",
> "pl_cbsa", "pl_msa1993", "pl_nchs2006", "pl_ruca10_2005", "pl_ruca2005",
> "pl_ruca4_2005", "pl_rucc2003", "pl_uic2003", "pl_ur_cat4", "pr1",
> "pr2", "pr3", "pr4", "pr5", "pr6", "pr7", "pr8", "pr9", "pr10",
> "pr11", "pr12", "pr13", "pr14", "pr15", "pr16", "pr17", "pr18",
> "prccs1", "prccs2", "prccs3", "prccs4", "prccs5", "prccs6", "prccs7",
> "prccs8", "prccs9", "prccs10", "prccs11", "prccs12", "prccs13",
> "prccs14", "prccs15", "prccs16", "prccs17", "prccs18", "prday1",
> "prday2", "prday3", "prday4", "prday5", "prday6", "prday7", "prday8",
> "prday9", "prday10", "prday11", "prday12", "prday13", "prday14",
> "prday15", "prday16", "prday17", "prday18", "proctype", "pstate",
> "pstco", "pstco2", "pointoforiginub04", "pointoforigin_x", "primlang",
> "race", "race_x", "readmit", "state_as", "state_ed", "state_os",
> "totchg", "totchg_x", "year", "zip3", "zipinc_qrtl", "town",
> "zip", "ayear", "dmonth", "bmonth", "byear", "prmonth1", "prmonth2",
> "prmonth3", "prmonth4", "prmonth5", "prmonth6", "prmonth7", "prmonth8",
> "prmonth9", "prmonth10", "prmonth11", "prmonth12", "prmonth13",
> "prmonth14", "prmonth15", "prmonth16", "prmonth17", "prmonth18",
> "pryear1", "pryear2", "pryear3", "pryear4", "pryear5", "pryear6",
> "pryear7", "pryear8", "pryear9", "pryear10", "pryear11", "pryear12",
> "pryear13", "pryear14", "pryear15", "pryear16", "pryear17", "pryear18"
> ), class = "data.frame")
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130912/cf2211f1/attachment-0001.html>

From mdowle at mdowle.plus.com  Fri Sep 13 00:52:40 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Thu, 12 Sep 2013 23:52:40 +0100
Subject: [datatable-help] colClasses and fread
In-Reply-To: <5232434E.3050608@mdowle.plus.com>
References: <CAAT1DuNBmEvJJOx71G2XB9Xz5zF+xTXpf256B96s=M3kq7u+gw@mail.gmail.com>
 <5232434E.3050608@mdowle.plus.com>
Message-ID: <523245B8.5090807@mdowle.plus.com>


But I think in the diagnostics you sent,  the final result was still 
correct.   The initial guess may have been poor, but it bumped the 
columns mid read and worked it out.  Why do you need to set colClasses?  
What was wrong in the final result?

(BTW, this thread was failing the mailman size filter (100k message 
size). I let them through and chopped the history on this one for that 
reason. )


On 12/09/13 23:42, Matthew Dowle wrote:
>
> Is that v1.8.10 as on CRAN?   It doesn't look like it from a few clues 
> in the output below.
> v1.8.10 has colClasses working, see NEWS.
>
> On 12/09/13 22:32, Ari Friedman wrote:
>> Dear maintainers of that most wonderful package that makes R fast with
>> big data,
>>
>> I've recently discovered fread.  It's amazing.  My call to read.fwf on a
>> 4GB file that took all night now takes under a minute after conversion
>> to csv via csvkit/in2csv.
>>
>> However, automatic type detection is working very poorly, probably due
>> to the presence of a large number of columns with high rates of
>> missingness, plus a large number of character columns with encoded
>> values (these are medical and diagnostic codes).
>>
>> Normally I'd specify colClasses, and the warning messages even tell me I
>> should specify colClasses, but there's no colClasses argument to fread.
>>
>> Any thoughts on solving this?  Verbose output, warnings, and a
>> comparison of the guesses vs. what the documentation on the file says it
>> is are found below.  Unfortunately the data can't be shared, even in
>> small portions so I can't make this reproducible.
>>
>> Thanks!
>> Ari
>> > dt <- fread('myfile.csv', verbose=TRUE)
>> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>> Using line 30 to detect sep (the last non blank line in the first 30) ... ','
>> Found 393 columns
>> First row with 393 fields occurs on line 1 (either column names or first row of data)
>> All the fields on line 1 are character fields. Treating as the column names.
>> Count of eol after first data row: 2994440
>> Subtracted 1 for last eol and any trailing empty lines, leaving 2994439 data rows
>> Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000300000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (first 5 rows)
>> Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (+middle 5 rows)
>> Type codes: 000000000000003303330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (+last 5 rows)
>> 0%Bumping column 146 from INT to INT64 on data row 9, field contains 'V5867'
>> Bumping column 146 from INT64 to REAL on data row 9, field contains 'V5867'
>> Bumping column 146 from REAL to STR on data row 9, field contains 'V5867'
>> Bumping column 147 from INT to INT64 on data row 9, field contains 'V5869'
>> Bumping column 147 from INT64 to REAL on data row 9, field contains 'V5869'
>> Bumping column 147 from REAL to STR on data row 9, field contains 'V5869'
>> Bumping column 142 from INT to INT64 on data row 10, field contains 'V140'
>> Bumping column 142 from INT64 to REAL on data row 10, field contains 'V140'
>> Bumping column 142 from REAL to STR on data row 10, field contains 'V140'
>> Bumping column 17 from INT to INT64 on data row 12, field contains 'J1885'
>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130912/fe885e56/attachment.html>

From mdowle at mdowle.plus.com  Fri Sep 13 20:19:01 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Fri, 13 Sep 2013 19:19:01 +0100
Subject: [datatable-help] fread'ing logicals
Message-ID: <52335715.8040109@mdowle.plus.com>


All,

I've implemented skipping columns using NULL in colClasses,   and 
logicals are now also read.  read.csv reads 
"T","F","TRUE","FALSE","True" and "False" as type logical,  so I've 
followed suit.   But I'm wondering about the single letters "T" and 
"F".  To illustrate, the following might be confusing :

 > fread("A,B,C\nD,E,F\n")
    A B     C
1: D E FALSE
 > fread("A,B,C\nD,E,F\nG,H,I\n")
    A B C
1: D E F
2: G H I
 >

Should fread treat "T" and "F" as logical?   Should it read a column of 
only 0's and 1's as logical, too? I think I'd prefer that as it's quite 
common.

I'm also thinking of increasing the number of rows used for type 
detection to the top 500, middle 500 and bottom 500,  since that's a 
very small extra cost to save the relatively much larger cost of mid 
read column bumps. As a parameter, with 500 by default.

Matthew


From caneff at gmail.com  Fri Sep 13 20:25:18 2013
From: caneff at gmail.com (Chris Neff)
Date: Fri, 13 Sep 2013 14:25:18 -0400
Subject: [datatable-help] fread'ing logicals
In-Reply-To: <52335715.8040109@mdowle.plus.com>
References: <52335715.8040109@mdowle.plus.com>
Message-ID: <CAAuY0RUnzOH2TAn_vQF5WyYmHUzhKreLEeOvhuUz0PsoC+JeJQ@mail.gmail.com>

I would prefer that you stay consistent with read.csv unless you really
have a good reason.  I don't think this is a good enough reason. They can
specify colClasses or change it after the fact.


On Fri, Sep 13, 2013 at 2:19 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

>
> All,
>
> I've implemented skipping columns using NULL in colClasses,   and logicals
> are now also read.  read.csv reads "T","F","TRUE","FALSE","True" and
> "False" as type logical,  so I've followed suit.   But I'm wondering about
> the single letters "T" and "F".  To illustrate, the following might be
> confusing :
>
> > fread("A,B,C\nD,E,F\n")
>    A B     C
> 1: D E FALSE
> > fread("A,B,C\nD,E,F\nG,H,I\n")
>    A B C
> 1: D E F
> 2: G H I
> >
>
> Should fread treat "T" and "F" as logical?   Should it read a column of
> only 0's and 1's as logical, too? I think I'd prefer that as it's quite
> common.
>
> I'm also thinking of increasing the number of rows used for type detection
> to the top 500, middle 500 and bottom 500,  since that's a very small extra
> cost to save the relatively much larger cost of mid read column bumps. As a
> parameter, with 500 by default.
>
> Matthew
>
>
> ______________________________**_________________
> datatable-help mailing list
> datatable-help at lists.r-forge.**r-project.org<datatable-help at lists.r-forge.r-project.org>
> https://lists.r-forge.r-**project.org/cgi-bin/mailman/**
> listinfo/datatable-help<https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130913/14b173aa/attachment.html>

From chinmay.patil at gmail.com  Sat Sep 14 06:03:43 2013
From: chinmay.patil at gmail.com (Chinmay Patil)
Date: Sat, 14 Sep 2013 12:03:43 +0800
Subject: [datatable-help] fread'ing logicals
In-Reply-To: <CAAuY0RUnzOH2TAn_vQF5WyYmHUzhKreLEeOvhuUz0PsoC+JeJQ@mail.gmail.com>
References: <52335715.8040109@mdowle.plus.com>
 <CAAuY0RUnzOH2TAn_vQF5WyYmHUzhKreLEeOvhuUz0PsoC+JeJQ@mail.gmail.com>
Message-ID: <CA+kDFFUuiLsS4FZjQpWJSpRW6m8GXzi4tgT2izZmg-_Hkn9uKQ@mail.gmail.com>

I agree.. One of the criticism I hear about newer packages in R ecosystem
is inconsistency with existing conventions. I would also vote for
consistency with read.csv / read.table


On Sat, Sep 14, 2013 at 2:25 AM, Chris Neff <caneff at gmail.com> wrote:

> I would prefer that you stay consistent with read.csv unless you really
> have a good reason.  I don't think this is a good enough reason. They can
> specify colClasses or change it after the fact.
>
>
> On Fri, Sep 13, 2013 at 2:19 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>> All,
>>
>> I've implemented skipping columns using NULL in colClasses,   and
>> logicals are now also read.  read.csv reads "T","F","TRUE","FALSE","True"
>> and "False" as type logical,  so I've followed suit.   But I'm wondering
>> about the single letters "T" and "F".  To illustrate, the following might
>> be confusing :
>>
>> > fread("A,B,C\nD,E,F\n")
>>    A B     C
>> 1: D E FALSE
>> > fread("A,B,C\nD,E,F\nG,H,I\n")
>>    A B C
>> 1: D E F
>> 2: G H I
>> >
>>
>> Should fread treat "T" and "F" as logical?   Should it read a column of
>> only 0's and 1's as logical, too? I think I'd prefer that as it's quite
>> common.
>>
>> I'm also thinking of increasing the number of rows used for type
>> detection to the top 500, middle 500 and bottom 500,  since that's a very
>> small extra cost to save the relatively much larger cost of mid read column
>> bumps. As a parameter, with 500 by default.
>>
>> Matthew
>>
>>
>> ______________________________**_________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.**r-project.org<datatable-help at lists.r-forge.r-project.org>
>> https://lists.r-forge.r-**project.org/cgi-bin/mailman/**
>> listinfo/datatable-help<https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help>
>>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130914/72c89643/attachment.html>

From lianoglou.steve at gene.com  Sat Sep 14 06:42:51 2013
From: lianoglou.steve at gene.com (Steve Lianoglou)
Date: Fri, 13 Sep 2013 21:42:51 -0700
Subject: [datatable-help] fread'ing logicals
In-Reply-To: <CA+kDFFUuiLsS4FZjQpWJSpRW6m8GXzi4tgT2izZmg-_Hkn9uKQ@mail.gmail.com>
References: <52335715.8040109@mdowle.plus.com>
 <CAAuY0RUnzOH2TAn_vQF5WyYmHUzhKreLEeOvhuUz0PsoC+JeJQ@mail.gmail.com>
 <CA+kDFFUuiLsS4FZjQpWJSpRW6m8GXzi4tgT2izZmg-_Hkn9uKQ@mail.gmail.com>
Message-ID: <CAHA9McM6GfCbGLL49-oFw1iYfqFFzOJxVOyxhAB8pZt-Wm_BYw@mail.gmail.com>

Hi Chinmay,

On Fri, Sep 13, 2013 at 9:03 PM, Chinmay Patil <chinmay.patil at gmail.com> wrote:
> I agree.. One of the criticism I hear about newer packages in R ecosystem is
> inconsistency with existing conventions.

Out of curiosity, what packages (and criticisms) might those be?

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

From chinmay.patil at gmail.com  Sat Sep 14 06:54:55 2013
From: chinmay.patil at gmail.com (Chinmay Patil)
Date: Sat, 14 Sep 2013 12:54:55 +0800
Subject: [datatable-help] fread'ing logicals
In-Reply-To: <CAHA9McM6GfCbGLL49-oFw1iYfqFFzOJxVOyxhAB8pZt-Wm_BYw@mail.gmail.com>
References: <52335715.8040109@mdowle.plus.com>
 <CAAuY0RUnzOH2TAn_vQF5WyYmHUzhKreLEeOvhuUz0PsoC+JeJQ@mail.gmail.com>
 <CA+kDFFUuiLsS4FZjQpWJSpRW6m8GXzi4tgT2izZmg-_Hkn9uKQ@mail.gmail.com>
 <CAHA9McM6GfCbGLL49-oFw1iYfqFFzOJxVOyxhAB8pZt-Wm_BYw@mail.gmail.com>
Message-ID: <CA+kDFFVUEyEjjdE_2Tsdx6QQgShySCT4cPUfnHbgnnoL+8=jYA@mail.gmail.com>

For eg. I recently heard complains about data.table itself from due to
changes in interface and learning curve that data.table comes with... I
hear similar complaints about some packages like ggplot2, plyr..

Even though all these are great packages.. people don't like radical
changes to interfaces as it makes refactoring older code even more painful.


On Sat, Sep 14, 2013 at 12:42 PM, Steve Lianoglou
<lianoglou.steve at gene.com>wrote:

> Hi Chinmay,
>
> On Fri, Sep 13, 2013 at 9:03 PM, Chinmay Patil <chinmay.patil at gmail.com>
> wrote:
> > I agree.. One of the criticism I hear about newer packages in R
> ecosystem is
> > inconsistency with existing conventions.
>
> Out of curiosity, what packages (and criticisms) might those be?
>
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130914/15110ccb/attachment.html>

From lianoglou.steve at gene.com  Sat Sep 14 07:29:11 2013
From: lianoglou.steve at gene.com (Steve Lianoglou)
Date: Fri, 13 Sep 2013 22:29:11 -0700
Subject: [datatable-help] fread'ing logicals
In-Reply-To: <CA+kDFFVUEyEjjdE_2Tsdx6QQgShySCT4cPUfnHbgnnoL+8=jYA@mail.gmail.com>
References: <52335715.8040109@mdowle.plus.com>
 <CAAuY0RUnzOH2TAn_vQF5WyYmHUzhKreLEeOvhuUz0PsoC+JeJQ@mail.gmail.com>
 <CA+kDFFUuiLsS4FZjQpWJSpRW6m8GXzi4tgT2izZmg-_Hkn9uKQ@mail.gmail.com>
 <CAHA9McM6GfCbGLL49-oFw1iYfqFFzOJxVOyxhAB8pZt-Wm_BYw@mail.gmail.com>
 <CA+kDFFVUEyEjjdE_2Tsdx6QQgShySCT4cPUfnHbgnnoL+8=jYA@mail.gmail.com>
Message-ID: <CAHA9McPie4n3j4n+w53wqiZ1cXrzGyRegxCqE14FyfLCoVL8Fg@mail.gmail.com>

Thanks for the quick response.

As for the "learning curve" stuff -- no real comment there, but:

> For eg. I recently heard complains about data.table itself from due to
> changes in interface

Could you provide some concrete examples about which changes have
stumped users? Perhaps we can learn from these critiques. I had
thought we were pretty good about discussing any (breaking) changes on
list, but I'd be interested to see where this has failed so it might
perhaps be avoided in the future.

> and learning curve that data.table comes with... I hear
> similar complaints about some packages like ggplot2, plyr..
>
> Even though all these are great packages.. people don't like radical changes
> to interfaces as it makes refactoring older code even more painful.

Still curious to hear what radical changes have come down the pipe.

Thanks for taking the time to comment.

Cheers,
-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

From chinmay.patil at gmail.com  Sat Sep 14 07:48:31 2013
From: chinmay.patil at gmail.com (Chinmay Patil)
Date: Sat, 14 Sep 2013 13:48:31 +0800
Subject: [datatable-help] fread'ing logicals
In-Reply-To: <CAHA9McPie4n3j4n+w53wqiZ1cXrzGyRegxCqE14FyfLCoVL8Fg@mail.gmail.com>
References: <52335715.8040109@mdowle.plus.com>
 <CAAuY0RUnzOH2TAn_vQF5WyYmHUzhKreLEeOvhuUz0PsoC+JeJQ@mail.gmail.com>
 <CA+kDFFUuiLsS4FZjQpWJSpRW6m8GXzi4tgT2izZmg-_Hkn9uKQ@mail.gmail.com>
 <CAHA9McM6GfCbGLL49-oFw1iYfqFFzOJxVOyxhAB8pZt-Wm_BYw@mail.gmail.com>
 <CA+kDFFVUEyEjjdE_2Tsdx6QQgShySCT4cPUfnHbgnnoL+8=jYA@mail.gmail.com>
 <CAHA9McPie4n3j4n+w53wqiZ1cXrzGyRegxCqE14FyfLCoVL8Fg@mail.gmail.com>
Message-ID: <A4C263EA-5160-45CB-9007-276CEAA1E712@gmail.com>

I didn't mean changes in data.table's interface but the way data.table works in itself compared to normal data frames. I know there are valid reasons for structuring data.table's interface the way it is but not all users get it immediately. 

As for data.table, I am not complaining, just saying what other users complaints I have heard of.
I personally love data.table and am willing to put the effort to learn best ways to use it while most users aren't. 

Chinmay

On 14 Sep, 2013, at 1:29 PM, Steve Lianoglou <lianoglou.steve at gene.com> wrote:

> Thanks for the quick response.
> 
> As for the "learning curve" stuff -- no real comment there, but:
> 
>> For eg. I recently heard complains about data.table itself from due to
>> changes in interface
> 
> Could you provide some concrete examples about which changes have
> stumped users? Perhaps we can learn from these critiques. I had
> thought we were pretty good about discussing any (breaking) changes on
> list, but I'd be interested to see where this has failed so it might
> perhaps be avoided in the future.
> 
>> and learning curve that data.table comes with... I hear
>> similar complaints about some packages like ggplot2, plyr..
>> 
>> Even though all these are great packages.. people don't like radical changes
>> to interfaces as it makes refactoring older code even more painful.
> 
> Still curious to hear what radical changes have come down the pipe.
> 
> Thanks for taking the time to comment.
> 
> Cheers,
> -steve
> 
> -- 
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech

From mdowle at mdowle.plus.com  Sat Sep 14 11:53:21 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Sat, 14 Sep 2013 10:53:21 +0100
Subject: [datatable-help] fread'ing logicals
In-Reply-To: <A4C263EA-5160-45CB-9007-276CEAA1E712@gmail.com>
References: <52335715.8040109@mdowle.plus.com>
 <CAAuY0RUnzOH2TAn_vQF5WyYmHUzhKreLEeOvhuUz0PsoC+JeJQ@mail.gmail.com>
 <CA+kDFFUuiLsS4FZjQpWJSpRW6m8GXzi4tgT2izZmg-_Hkn9uKQ@mail.gmail.com>
 <CAHA9McM6GfCbGLL49-oFw1iYfqFFzOJxVOyxhAB8pZt-Wm_BYw@mail.gmail.com>
 <CA+kDFFVUEyEjjdE_2Tsdx6QQgShySCT4cPUfnHbgnnoL+8=jYA@mail.gmail.com>
 <CAHA9McPie4n3j4n+w53wqiZ1cXrzGyRegxCqE14FyfLCoVL8Fg@mail.gmail.com>
 <A4C263EA-5160-45CB-9007-276CEAA1E712@gmail.com>
Message-ID: <52343211.7040109@mdowle.plus.com>

On 14/09/13 06:48, Chinmay Patil wrote:
> I didn't mean changes in data.table's interface but the way data.table works in itself compared to normal data frames. I know there are valid reasons for structuring data.table's interface the way it is but not all users get it immediately.

The bottom line in my mind is that even if base syntax was sped up 
(assignment to an unnamed data.frame needn't copy the whole data.frame 
for example), I would still move from 
subset()/transform()/with()/DF[i,j]<-value syntax,  to i,j and by inside 
[...]  with .SD,.I,.N and := in j.  I can do things with that syntax 
that I need to do which aren't always so easy with base syntax (like 
adding columns by reference by group).

And base R syntax is indeed being sped up by pqR, Renjin, Riposte, TERR, 
CXXR, fastr which may feed into GNU R. Once that is mature and the dust 
has settled, I would still move from data.frame to data.table on each of 
them.  Maybe we should market the things that data.table does that base 
R doesn't.   Rather than speed differences.

>
> As for data.table, I am not complaining, just saying what other users complaints I have heard of.
> I personally love data.table and am willing to put the effort to learn best ways to use it while most users aren't.

Great.  data.table is for people like you.

So we'll keep the default fread'ing of "T" and "F" as logicals then for 
consistency with read.csv.

And I still hope to produce a drop-in replacement for read.csv which 
returns a data.frame but uses fread under the hood. That will speed up 
existing code,  but users can use the extra features of fread if they 
want, too.

Matthew

>
> Chinmay
>
> On 14 Sep, 2013, at 1:29 PM, Steve Lianoglou <lianoglou.steve at gene.com> wrote:
>
>> Thanks for the quick response.
>>
>> As for the "learning curve" stuff -- no real comment there, but:
>>
>>> For eg. I recently heard complains about data.table itself from due to
>>> changes in interface
>> Could you provide some concrete examples about which changes have
>> stumped users? Perhaps we can learn from these critiques. I had
>> thought we were pretty good about discussing any (breaking) changes on
>> list, but I'd be interested to see where this has failed so it might
>> perhaps be avoided in the future.
>>
>>> and learning curve that data.table comes with... I hear
>>> similar complaints about some packages like ggplot2, plyr..
>>>
>>> Even though all these are great packages.. people don't like radical changes
>>> to interfaces as it makes refactoring older code even more painful.
>> Still curious to hear what radical changes have come down the pipe.
>>
>> Thanks for taking the time to comment.
>>
>> Cheers,
>> -steve
>>
>> -- 
>> Steve Lianoglou
>> Computational Biologist
>> Bioinformatics and Computational Biology
>> Genentech
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


From aragorn168b at gmail.com  Sat Sep 14 12:29:03 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 14 Sep 2013 12:29:03 +0200
Subject: [datatable-help] fread'ing logicals
In-Reply-To: <52343211.7040109@mdowle.plus.com>
References: <52335715.8040109@mdowle.plus.com>
 <CAAuY0RUnzOH2TAn_vQF5WyYmHUzhKreLEeOvhuUz0PsoC+JeJQ@mail.gmail.com>
 <CA+kDFFUuiLsS4FZjQpWJSpRW6m8GXzi4tgT2izZmg-_Hkn9uKQ@mail.gmail.com>
 <CAHA9McM6GfCbGLL49-oFw1iYfqFFzOJxVOyxhAB8pZt-Wm_BYw@mail.gmail.com>
 <CA+kDFFVUEyEjjdE_2Tsdx6QQgShySCT4cPUfnHbgnnoL+8=jYA@mail.gmail.com>
 <CAHA9McPie4n3j4n+w53wqiZ1cXrzGyRegxCqE14FyfLCoVL8Fg@mail.gmail.com>
 <A4C263EA-5160-45CB-9007-276CEAA1E712@gmail.com>
 <52343211.7040109@mdowle.plus.com>
Message-ID: <21C3EB1B544A44CBA3BF1EB156807AD2@gmail.com>

Matthew, 

+1 for retaining T and F like read.csv.
+1 for the dropins() feature as well.

Arun


On Saturday, September 14, 2013 at 11:53 AM, Matthew Dowle wrote:

> On 14/09/13 06:48, Chinmay Patil wrote:
> > I didn't mean changes in data.table's interface but the way data.table works in itself compared to normal data frames. I know there are valid reasons for structuring data.table's interface the way it is but not all users get it immediately.
> 
> 
> The bottom line in my mind is that even if base syntax was sped up 
> (assignment to an unnamed data.frame needn't copy the whole data.frame 
> for example), I would still move from 
> subset()/transform()/with()/DF[i,j]<-value syntax, to i,j and by inside 
> [...] with .SD,.I,.N and := in j. I can do things with that syntax 
> that I need to do which aren't always so easy with base syntax (like 
> adding columns by reference by group).
> 
> And base R syntax is indeed being sped up by pqR, Renjin, Riposte, TERR, 
> CXXR, fastr which may feed into GNU R. Once that is mature and the dust 
> has settled, I would still move from data.frame to data.table on each of 
> them. Maybe we should market the things that data.table does that base 
> R doesn't. Rather than speed differences.
> 
> > 
> > As for data.table, I am not complaining, just saying what other users complaints I have heard of.
> > I personally love data.table and am willing to put the effort to learn best ways to use it while most users aren't.
> > 
> 
> 
> Great. data.table is for people like you.
> 
> So we'll keep the default fread'ing of "T" and "F" as logicals then for 
> consistency with read.csv.
> 
> And I still hope to produce a drop-in replacement for read.csv which 
> returns a data.frame but uses fread under the hood. That will speed up 
> existing code, but users can use the extra features of fread if they 
> want, too.
> 
> Matthew
> 
> > 
> > Chinmay
> > 
> > On 14 Sep, 2013, at 1:29 PM, Steve Lianoglou <lianoglou.steve at gene.com (mailto:lianoglou.steve at gene.com)> wrote:
> > 
> > > Thanks for the quick response.
> > > 
> > > As for the "learning curve" stuff -- no real comment there, but:
> > > 
> > > > For eg. I recently heard complains about data.table itself from due to
> > > > changes in interface
> > > > 
> > > 
> > > Could you provide some concrete examples about which changes have
> > > stumped users? Perhaps we can learn from these critiques. I had
> > > thought we were pretty good about discussing any (breaking) changes on
> > > list, but I'd be interested to see where this has failed so it might
> > > perhaps be avoided in the future.
> > > 
> > > > and learning curve that data.table comes with... I hear
> > > > similar complaints about some packages like ggplot2, plyr..
> > > > 
> > > > Even though all these are great packages.. people don't like radical changes
> > > > to interfaces as it makes refactoring older code even more painful.
> > > > 
> > > 
> > > Still curious to hear what radical changes have come down the pipe.
> > > 
> > > Thanks for taking the time to comment.
> > > 
> > > Cheers,
> > > -steve
> > > 
> > > -- 
> > > Steve Lianoglou
> > > Computational Biologist
> > > Bioinformatics and Computational Biology
> > > Genentech
> > > 
> > 
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > 
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130914/77b01075/attachment.html>

From harishv_99 at yahoo.com  Sat Sep 14 21:05:29 2013
From: harishv_99 at yahoo.com (Harish)
Date: Sat, 14 Sep 2013 12:05:29 -0700 (PDT)
Subject: [datatable-help] "by" on integer64 not working
Message-ID: <1379185529.38620.YahooMailNeo@web120203.mail.ne1.yahoo.com>

I am trying to use "by" on integer64 data and data.table seems to think that there is only one value.? This is reproduced with the following:

library( data.table )
library( bit64 )

DT <- data.table( a=rep( 1:5, 2), b=15:24 )
DT[ , .N, by=a ]
DT[ , a := as.integer64( a ) ]
DT[ , .N, by=a ]

The output I get is:


> DT <- data.table( a=rep( 1:5, 2), b=15:24 )
> DT[ , .N, by=a ]
?? a N
1: 1 2
2: 2 2
3: 3 2
4: 4 2
5: 5 2
> DT[ , a := as.integer64( a ) ]
> DT[ , .N, by=a ]
?? a? N
1: 1 10

Notice that the "by" after converting column "a" to integer64 is different from before.? However, the values of "a" are correct:

> DT$a
integer64
?[1] 1 2 3 4 5 1 2 3 4 5

I am using the latest version of data.table from r-forge (1.8.11 Rev 965).? I also had the same issue with 1.8.10 from CRAN.


Am I doing something wrong or is this a bug?? Thanks for your help.


Regards,
Harish
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130914/844d86c1/attachment.html>

From harishv_99 at yahoo.com  Sat Sep 14 21:57:55 2013
From: harishv_99 at yahoo.com (Harish)
Date: Sat, 14 Sep 2013 12:57:55 -0700 (PDT)
Subject: [datatable-help] fread() and UTF-8 support
Message-ID: <1379188675.10910.YahooMailNeo@web120205.mail.ne1.yahoo.com>

Does fread() support UTF-8?? I got a text file that is mostly Latin-1 characters but encoded as UTF-8.? When I load the data, the first column name has a few extra characters in the beginning ("???id"), but I do not get this when I convert the same file to ANSI format using Windows Notepad.

I am guessing that UTF-8 encoding puts a few extra characters in the beginning of the text file to indicate that it is an UTF-8 encoding, and fread() is reading that literally as the first column name.

Thanks for the clarification.

Regards,
Harish
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130914/0ba5a4e8/attachment.html>

From mdowle at mdowle.plus.com  Sat Sep 14 22:06:16 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Sat, 14 Sep 2013 21:06:16 +0100
Subject: [datatable-help] "by" on integer64 not working
In-Reply-To: <1379185529.38620.YahooMailNeo@web120203.mail.ne1.yahoo.com>
References: <1379185529.38620.YahooMailNeo@web120203.mail.ne1.yahoo.com>
Message-ID: <5234C1B8.4090306@mdowle.plus.com>


Sorry - haven't got to implementing grouping or keys for integer64 yet. 
All that's been done is integer64 in fread. There's a bug item on the list.

Matthew

On 14/09/13 20:05, Harish wrote:
> I am trying to use "by" on integer64 data and data.table seems to 
> think that there is only one value.  This is reproduced with the 
> following:
>
> library( data.table )
> library( bit64 )
>
> DT <- data.table( a=rep( 1:5, 2), b=15:24 )
> DT[ , .N, by=a ]
> DT[ , a := as.integer64( a ) ]
> DT[ , .N, by=a ]
>
> The output I get is:
>
> > DT <- data.table( a=rep( 1:5, 2), b=15:24 )
> > DT[ , .N, by=a ]
>    a N
> 1: 1 2
> 2: 2 2
> 3: 3 2
> 4: 4 2
> 5: 5 2
> > DT[ , a := as.integer64( a ) ]
> > DT[ , .N, by=a ]
>    a  N
> 1: 1 10
>
> Notice that the "by" after converting column "a" to integer64 is 
> different from before.  However, the values of "a" are correct:
> > DT$a
> integer64
>  [1] 1 2 3 4 5 1 2 3 4 5
>
> I am using the latest version of data.table from r-forge (1.8.11 Rev 
> 965).  I also had the same issue with 1.8.10 from CRAN.
>
> Am I doing something wrong or is this a bug?  Thanks for your help.
>
>
> Regards,
> Harish
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130914/c640e50b/attachment-0001.html>

From mdowle at mdowle.plus.com  Sat Sep 14 22:33:42 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Sat, 14 Sep 2013 21:33:42 +0100
Subject: [datatable-help] fread() and UTF-8 support
In-Reply-To: <1379188675.10910.YahooMailNeo@web120205.mail.ne1.yahoo.com>
References: <1379188675.10910.YahooMailNeo@web120205.mail.ne1.yahoo.com>
Message-ID: <5234C826.6090908@mdowle.plus.com>


Sorry again - nope hadn't given UTF-8 any thought.

Matthew

On 14/09/13 20:57, Harish wrote:
> Does fread() support UTF-8?  I got a text file that is mostly Latin-1 
> characters but encoded as UTF-8.  When I load the data, the first 
> column name has a few extra characters in the beginning ("???id"), but 
> I do not get this when I convert the same file to ANSI format using 
> Windows Notepad.
>
> I am guessing that UTF-8 encoding puts a few extra characters in the 
> beginning of the text file to indicate that it is an UTF-8 encoding, 
> and fread() is reading that literally as the first column name.
>
> Thanks for the clarification.
>
> Regards,
> Harish
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130914/30a0a15b/attachment.html>

From micheledemeo at gmail.com  Sat Sep 14 22:47:26 2013
From: micheledemeo at gmail.com (MICHELE DE MEO)
Date: Sat, 14 Sep 2013 22:47:26 +0200
Subject: [datatable-help] fread() and UTF-8 support
In-Reply-To: <5234C826.6090908@mdowle.plus.com>
References: <1379188675.10910.YahooMailNeo@web120205.mail.ne1.yahoo.com>
 <5234C826.6090908@mdowle.plus.com>
Message-ID: <CAOfgy=nPnaHW9-QFohhDiNM7afdrLRduTN_C-YWjb4D84dPj2w@mail.gmail.com>

I think it could be very useful the possibility to specify the encoding, as
when you use the function 'file' with read.table .

Michele
Il giorno 14/set/2013 22:33, "Matthew Dowle" <mdowle at mdowle.plus.com> ha
scritto:

>
> Sorry again - nope hadn't given UTF-8 any thought.
>
> Matthew
>
> On 14/09/13 20:57, Harish wrote:
>
>  Does fread() support UTF-8?  I got a text file that is mostly Latin-1
> characters but encoded as UTF-8.  When I load the data, the first column
> name has a few extra characters in the beginning ("???id"), but I do not
> get this when I convert the same file to ANSI format using Windows Notepad.
>
>  I am guessing that UTF-8 encoding puts a few extra characters in the
> beginning of the text file to indicate that it is an UTF-8 encoding, and
> fread() is reading that literally as the first column name.
>
>  Thanks for the clarification.
>
>  Regards,
> Harish
>
>
> _______________________________________________
> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130914/e47b4224/attachment.html>

From mdowle at mdowle.plus.com  Sun Sep 15 10:34:29 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Sun, 15 Sep 2013 09:34:29 +0100
Subject: [datatable-help] fread() and UTF-8 support
In-Reply-To: <CAOfgy=nPnaHW9-QFohhDiNM7afdrLRduTN_C-YWjb4D84dPj2w@mail.gmail.com>
References: <1379188675.10910.YahooMailNeo@web120205.mail.ne1.yahoo.com>
 <5234C826.6090908@mdowle.plus.com>
 <CAOfgy=nPnaHW9-QFohhDiNM7afdrLRduTN_C-YWjb4D84dPj2w@mail.gmail.com>
Message-ID: <52357115.7040404@mdowle.plus.com>

Ok, can you file as a feature request please.  Thanks.

Matthew

On 14/09/13 21:47, MICHELE DE MEO wrote:
>
> I think it could be very useful the possibility to specify the 
> encoding, as when you use the function 'file' with read.table .
>
> Michele
>
> Il giorno 14/set/2013 22:33, "Matthew Dowle" <mdowle at mdowle.plus.com 
> <mailto:mdowle at mdowle.plus.com>> ha scritto:
>
>
>     Sorry again - nope hadn't given UTF-8 any thought.
>
>     Matthew
>
>     On 14/09/13 20:57, Harish wrote:
>>     Does fread() support UTF-8?  I got a text file that is mostly
>>     Latin-1 characters but encoded as UTF-8. When I load the data,
>>     the first column name has a few extra characters in the beginning
>>     ("???id"), but I do not get this when I convert the same file to
>>     ANSI format using Windows Notepad.
>>
>>     I am guessing that UTF-8 encoding puts a few extra characters in
>>     the beginning of the text file to indicate that it is an UTF-8
>>     encoding, and fread() is reading that literally as the first
>>     column name.
>>
>>     Thanks for the clarification.
>>
>>     Regards,
>>     Harish
>>
>>
>>     _______________________________________________
>>     datatable-help mailing list
>>     datatable-help at lists.r-forge.r-project.org  <mailto:datatable-help at lists.r-forge.r-project.org>
>>     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>     _______________________________________________
>     datatable-help mailing list
>     datatable-help at lists.r-forge.r-project.org
>     <mailto:datatable-help at lists.r-forge.r-project.org>
>     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130915/e2803980/attachment.html>

From eduard.antonyan at gmail.com  Sun Sep 15 23:42:16 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Sun, 15 Sep 2013 16:42:16 -0500
Subject: [datatable-help] fread'ing logicals
In-Reply-To: <21C3EB1B544A44CBA3BF1EB156807AD2@gmail.com>
References: <52335715.8040109@mdowle.plus.com>
 <CAAuY0RUnzOH2TAn_vQF5WyYmHUzhKreLEeOvhuUz0PsoC+JeJQ@mail.gmail.com>
 <CA+kDFFUuiLsS4FZjQpWJSpRW6m8GXzi4tgT2izZmg-_Hkn9uKQ@mail.gmail.com>
 <CAHA9McM6GfCbGLL49-oFw1iYfqFFzOJxVOyxhAB8pZt-Wm_BYw@mail.gmail.com>
 <CA+kDFFVUEyEjjdE_2Tsdx6QQgShySCT4cPUfnHbgnnoL+8=jYA@mail.gmail.com>
 <CAHA9McPie4n3j4n+w53wqiZ1cXrzGyRegxCqE14FyfLCoVL8Fg@mail.gmail.com>
 <A4C263EA-5160-45CB-9007-276CEAA1E712@gmail.com>
 <52343211.7040109@mdowle.plus.com>
 <21C3EB1B544A44CBA3BF1EB156807AD2@gmail.com>
Message-ID: <CAHZcBOr1VGcdbprj9+xpCjYQ82gafYcVxHxt-K0h_TaB3og50w@mail.gmail.com>

+1 for T and F, but definitely not because it's that way in read.csv (which
imo is not a good reason), but rather because those are commonly used
substitutes for TRUE and FALSE.
On Sep 14, 2013 5:29 AM, "Arunkumar Srinivasan" <aragorn168b at gmail.com>
wrote:

>  Matthew,
>
> +1 for retaining T and F like read.csv.
> +1 for the dropins() feature as well.
>
> Arun
>
> On Saturday, September 14, 2013 at 11:53 AM, Matthew Dowle wrote:
>
> On 14/09/13 06:48, Chinmay Patil wrote:
>
> I didn't mean changes in data.table's interface but the way data.table
> works in itself compared to normal data frames. I know there are valid
> reasons for structuring data.table's interface the way it is but not all
> users get it immediately.
>
>
> The bottom line in my mind is that even if base syntax was sped up
> (assignment to an unnamed data.frame needn't copy the whole data.frame
> for example), I would still move from
> subset()/transform()/with()/DF[i,j]<-value syntax, to i,j and by inside
> [...] with .SD,.I,.N and := in j. I can do things with that syntax
> that I need to do which aren't always so easy with base syntax (like
> adding columns by reference by group).
>
> And base R syntax is indeed being sped up by pqR, Renjin, Riposte, TERR,
> CXXR, fastr which may feed into GNU R. Once that is mature and the dust
> has settled, I would still move from data.frame to data.table on each of
> them. Maybe we should market the things that data.table does that base
> R doesn't. Rather than speed differences.
>
>
> As for data.table, I am not complaining, just saying what other users
> complaints I have heard of.
> I personally love data.table and am willing to put the effort to learn
> best ways to use it while most users aren't.
>
>
> Great. data.table is for people like you.
>
> So we'll keep the default fread'ing of "T" and "F" as logicals then for
> consistency with read.csv.
>
> And I still hope to produce a drop-in replacement for read.csv which
> returns a data.frame but uses fread under the hood. That will speed up
> existing code, but users can use the extra features of fread if they
> want, too.
>
> Matthew
>
>
> Chinmay
>
> On 14 Sep, 2013, at 1:29 PM, Steve Lianoglou <lianoglou.steve at gene.com>
> wrote:
>
> Thanks for the quick response.
>
> As for the "learning curve" stuff -- no real comment there, but:
>
> For eg. I recently heard complains about data.table itself from due to
> changes in interface
>
> Could you provide some concrete examples about which changes have
> stumped users? Perhaps we can learn from these critiques. I had
> thought we were pretty good about discussing any (breaking) changes on
> list, but I'd be interested to see where this has failed so it might
> perhaps be avoided in the future.
>
> and learning curve that data.table comes with... I hear
> similar complaints about some packages like ggplot2, plyr..
>
> Even though all these are great packages.. people don't like radical
> changes
> to interfaces as it makes refactoring older code even more painful.
>
> Still curious to hear what radical changes have come down the pipe.
>
> Thanks for taking the time to comment.
>
> Cheers,
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130915/9dada8c4/attachment.html>

From mdowle at mdowle.plus.com  Mon Sep 16 01:35:27 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Mon, 16 Sep 2013 00:35:27 +0100
Subject: [datatable-help] fread'ing logicals
In-Reply-To: <CAHZcBOr1VGcdbprj9+xpCjYQ82gafYcVxHxt-K0h_TaB3og50w@mail.gmail.com>
References: <52335715.8040109@mdowle.plus.com>
 <CAAuY0RUnzOH2TAn_vQF5WyYmHUzhKreLEeOvhuUz0PsoC+JeJQ@mail.gmail.com>
 <CA+kDFFUuiLsS4FZjQpWJSpRW6m8GXzi4tgT2izZmg-_Hkn9uKQ@mail.gmail.com>
 <CAHA9McM6GfCbGLL49-oFw1iYfqFFzOJxVOyxhAB8pZt-Wm_BYw@mail.gmail.com>
 <CA+kDFFVUEyEjjdE_2Tsdx6QQgShySCT4cPUfnHbgnnoL+8=jYA@mail.gmail.com>
 <CAHA9McPie4n3j4n+w53wqiZ1cXrzGyRegxCqE14FyfLCoVL8Fg@mail.gmail.com>
 <A4C263EA-5160-45CB-9007-276CEAA1E712@gmail.com>
 <52343211.7040109@mdowle.plus.com>
 <21C3EB1B544A44CBA3BF1EB156807AD2@gmail.com>
 <CAHZcBOr1VGcdbprj9+xpCjYQ82gafYcVxHxt-K0h_TaB3og50w@mail.gmail.com>
Message-ID: <5236443F.3000003@mdowle.plus.com>


Good.  Now committed in v1.8.11 (rev 966).  Also drop and select is done.

o  fread's drop, select and NULL in colClasses are implemented. To drop 
or select columns by name
      or by number. See examples in ?fread.

o  fread now detects T,F,True,False,TRUE and FALSE as type logical, 
consistent with read.csv.

I pasted the new examples from ?fread to this answer as well:
http://stackoverflow.com/a/18702011/403310

Hope this covers everything in this area,  but please shout if anyone 
can think of anything further.

Matthew


On 15/09/13 22:42, Eduard Antonyan wrote:
>
> +1 for T and F, but definitely not because it's that way in read.csv 
> (which imo is not a good reason), but rather because those are 
> commonly used substitutes for TRUE and FALSE.
>
> On Sep 14, 2013 5:29 AM, "Arunkumar Srinivasan" <aragorn168b at gmail.com 
> <mailto:aragorn168b at gmail.com>> wrote:
>
>     Matthew,
>
>     +1 for retaining T and F like read.csv.
>     +1 for the dropins() feature as well.
>
>     Arun
>
>     On Saturday, September 14, 2013 at 11:53 AM, Matthew Dowle wrote:
>
>>     On 14/09/13 06:48, Chinmay Patil wrote:
>>>     I didn't mean changes in data.table's interface but the way
>>>     data.table works in itself compared to normal data frames. I
>>>     know there are valid reasons for structuring data.table's
>>>     interface the way it is but not all users get it immediately.
>>
>>     The bottom line in my mind is that even if base syntax was sped up
>>     (assignment to an unnamed data.frame needn't copy the whole
>>     data.frame
>>     for example), I would still move from
>>     subset()/transform()/with()/DF[i,j]<-value syntax, to i,j and by
>>     inside
>>     [...] with .SD,.I,.N and := in j. I can do things with that syntax
>>     that I need to do which aren't always so easy with base syntax (like
>>     adding columns by reference by group).
>>
>>     And base R syntax is indeed being sped up by pqR, Renjin,
>>     Riposte, TERR,
>>     CXXR, fastr which may feed into GNU R. Once that is mature and
>>     the dust
>>     has settled, I would still move from data.frame to data.table on
>>     each of
>>     them. Maybe we should market the things that data.table does that
>>     base
>>     R doesn't. Rather than speed differences.
>>
>>>
>>>     As for data.table, I am not complaining, just saying what other
>>>     users complaints I have heard of.
>>>     I personally love data.table and am willing to put the effort to
>>>     learn best ways to use it while most users aren't.
>>
>>     Great. data.table is for people like you.
>>
>>     So we'll keep the default fread'ing of "T" and "F" as logicals
>>     then for
>>     consistency with read.csv.
>>
>>     And I still hope to produce a drop-in replacement for read.csv which
>>     returns a data.frame but uses fread under the hood. That will
>>     speed up
>>     existing code, but users can use the extra features of fread if they
>>     want, too.
>>
>>     Matthew
>>
>>>
>>>     Chinmay
>>>
>>>     On 14 Sep, 2013, at 1:29 PM, Steve Lianoglou
>>>     <lianoglou.steve at gene.com <mailto:lianoglou.steve at gene.com>> wrote:
>>>
>>>>     Thanks for the quick response.
>>>>
>>>>     As for the "learning curve" stuff -- no real comment there, but:
>>>>
>>>>>     For eg. I recently heard complains about data.table itself
>>>>>     from due to
>>>>>     changes in interface
>>>>     Could you provide some concrete examples about which changes have
>>>>     stumped users? Perhaps we can learn from these critiques. I had
>>>>     thought we were pretty good about discussing any (breaking)
>>>>     changes on
>>>>     list, but I'd be interested to see where this has failed so it
>>>>     might
>>>>     perhaps be avoided in the future.
>>>>
>>>>>     and learning curve that data.table comes with... I hear
>>>>>     similar complaints about some packages like ggplot2, plyr..
>>>>>
>>>>>     Even though all these are great packages.. people don't like
>>>>>     radical changes
>>>>>     to interfaces as it makes refactoring older code even more
>>>>>     painful.
>>>>     Still curious to hear what radical changes have come down the pipe.
>>>>
>>>>     Thanks for taking the time to comment.
>>>>
>>>>     Cheers,
>>>>     -steve
>>>>
>>>>     -- 
>>>>     Steve Lianoglou
>>>>     Computational Biologist
>>>>     Bioinformatics and Computational Biology
>>>>     Genentech
>>>     _______________________________________________
>>>     datatable-help mailing list
>>>     datatable-help at lists.r-forge.r-project.org
>>>     <mailto:datatable-help at lists.r-forge.r-project.org>
>>>     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>     _______________________________________________
>>     datatable-help mailing list
>>     datatable-help at lists.r-forge.r-project.org
>>     <mailto:datatable-help at lists.r-forge.r-project.org>
>>     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>     _______________________________________________
>     datatable-help mailing list
>     datatable-help at lists.r-forge.r-project.org
>     <mailto:datatable-help at lists.r-forge.r-project.org>
>     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130916/725527a9/attachment-0001.html>

From npgraham1 at gmail.com  Tue Sep 17 20:13:49 2013
From: npgraham1 at gmail.com (Nathaniel Graham)
Date: Tue, 17 Sep 2013 14:13:49 -0400
Subject: [datatable-help] Error in coercing matrices within j expressions
Message-ID: <CALhihUjoLkB0fOZTy66ALR9vJ1sT6jwxdLHnZHAZeTEk-XOiOQ@mail.gmail.com>

I'm currently using a (moderately) complex function, call
if f(), as a j expression to analyze my data.  The data itself
is about 1.2M rows, which I analyze by group.
A group may have as few as one row or as many as 10K.
The output from the function is a two-column data.table
where the rows are interesting (for my work) pairs of
observations--I have no idea how many pairs will be
interesting until the function runs, but in abstract it could
be every unique combination (so as many as 50M rows
of output for one call to f()).  It is common, and not an
error, for groups to have no meaningful pairs to return.

I've been using the following line to create the output for
f():

indices <- data.table(i = integer(), j = integer())

I then append to 'indices' any useful pairs using:

indices <- rbind(indices, list(idx[i], idx[j]))

This works, but is very, very slow, in part because I'm
using rbind().  I want to switch to using the built-in matrix,
because rbind() should be much faster for them.  Using
the following line to create the matrix:

indices <- matrix(nrow = 0, ncol = 2, dimnames = list(c(NULL),c("i","j")))

results in the following error:

Logical error. Type of column should have been checked by now

Note that the values returned are always integers.  Results are
coerced via:

data.table(indices)

before returning from f().  If I don't explicitly coerce, I get the
following error:

j doesn't evaluate to the same number of columns for each group

If someone could tell me what I'm doing wrong, or some other
equivalent way to noticeably speed up the whole process, I'd
be very grateful.


-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130917/790f2f87/attachment.html>

From FErickson at psu.edu  Tue Sep 17 22:22:03 2013
From: FErickson at psu.edu (Frank Erickson)
Date: Tue, 17 Sep 2013 16:22:03 -0400
Subject: [datatable-help] Error in coercing matrices within j expressions
In-Reply-To: <CALhihUjoLkB0fOZTy66ALR9vJ1sT6jwxdLHnZHAZeTEk-XOiOQ@mail.gmail.com>
References: <CALhihUjoLkB0fOZTy66ALR9vJ1sT6jwxdLHnZHAZeTEk-XOiOQ@mail.gmail.com>
Message-ID: <CAJd-hdmSCZJKqSTqiGqmeW1HLNM8aWN_uGS28HXWaogVm_ULPw@mail.gmail.com>

Hi,

I guess you could put them into a list and then rbind at the end:

indi <- list()
k=1
indi[[k]] <- list(i=2L,j=6L); k <- k+1
indi[[k]] <- list(4L,5L); k <- k+1
rbindlist(indi)
#    i j
# 1: 2 6
# 2: 4 5

For some reason, I couldn't get rbindlist to work unless the first item in
indi had explicit names ("i" and "j"), but names aren't needed for later
items.

This should be better than dynamically growing with rbind each time, but
there may be a faster way. If your criteria for selecting (i,j) can be
written down, there's likely a much faster way than looping like this.

Best,

--Frank


On Tue, Sep 17, 2013 at 2:13 PM, Nathaniel Graham <npgraham1 at gmail.com>wrote:

> I'm currently using a (moderately) complex function, call
> if f(), as a j expression to analyze my data.  The data itself
> is about 1.2M rows, which I analyze by group.
> A group may have as few as one row or as many as 10K.
> The output from the function is a two-column data.table
> where the rows are interesting (for my work) pairs of
> observations--I have no idea how many pairs will be
> interesting until the function runs, but in abstract it could
> be every unique combination (so as many as 50M rows
> of output for one call to f()).  It is common, and not an
> error, for groups to have no meaningful pairs to return.
>
> I've been using the following line to create the output for
> f():
>
> indices <- data.table(i = integer(), j = integer())
>
> I then append to 'indices' any useful pairs using:
>
> indices <- rbind(indices, list(idx[i], idx[j]))
>
> This works, but is very, very slow, in part because I'm
> using rbind().  I want to switch to using the built-in matrix,
> because rbind() should be much faster for them.  Using
> the following line to create the matrix:
>
> indices <- matrix(nrow = 0, ncol = 2, dimnames = list(c(NULL),c("i","j")))
>
> results in the following error:
>
> Logical error. Type of column should have been checked by now
>
> Note that the values returned are always integers.  Results are
> coerced via:
>
> data.table(indices)
>
> before returning from f().  If I don't explicitly coerce, I get the
> following error:
>
> j doesn't evaluate to the same number of columns for each group
>
> If someone could tell me what I'm doing wrong, or some other
> equivalent way to noticeably speed up the whole process, I'd
> be very grateful.
>
>
> -------
> Nathaniel Graham
> npgraham1 at gmail.com
> npgraham1 at uky.edu
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130917/8b761d0d/attachment.html>

From FErickson at psu.edu  Tue Sep 17 23:22:50 2013
From: FErickson at psu.edu (Frank Erickson)
Date: Tue, 17 Sep 2013 17:22:50 -0400
Subject: [datatable-help] Error in coercing matrices within j expressions
In-Reply-To: <CALhihUi-Awfc-UNOj8k=KyZQz8REihEqendTbfNuJiaQgnQ-OQ@mail.gmail.com>
References: <CALhihUjoLkB0fOZTy66ALR9vJ1sT6jwxdLHnZHAZeTEk-XOiOQ@mail.gmail.com>
 <CAJd-hdmSCZJKqSTqiGqmeW1HLNM8aWN_uGS28HXWaogVm_ULPw@mail.gmail.com>
 <CALhihUi-Awfc-UNOj8k=KyZQz8REihEqendTbfNuJiaQgnQ-OQ@mail.gmail.com>
Message-ID: <CAJd-hdmCkP5aq6cvK=kw1Nz=3Kj_ii-=zCQ0ZNY4_Cf7zS7L2Q@mail.gmail.com>

Well, rbindlist(list()) says "Null data.table" (though it doesn't pass the
is.null() test). Maybe someone else has an idea how to deal with the
no-results case. By the way, it's best to use "reply to all" to make sure
you reply to the mailing list, too; they should be able to see your message
quoted below, though.

--Frank


On Tue, Sep 17, 2013 at 5:03 PM, Nathaniel Graham <npgraham1 at gmail.com>wrote:

> Frank,
>
> Thanks.  This seems to have done the trick, so long as I'm careful to
> check for
> zero-length lists and return data.table(i = integer(), j = integer()) in
> those
> cases.  Essentially, I have to test every combination of i and j to see if
> it's
> "interesting" or not, and some groups have a lot of rows.  At the moment
> I'm
> attacking some other low hanging fruit, like speeding up the comparisons
> I have to do.
>
> As a side note, it would be kind of nice if there was a simple way to clue
> data.table to the fact that there are no rows to return, like returning
> NULL
> or NA or similar.
>
> -------
> Nathaniel Graham
> npgraham1 at gmail.com
> npgraham1 at uky.edu
>
>
> On Tue, Sep 17, 2013 at 4:22 PM, Frank Erickson <FErickson at psu.edu> wrote:
>
>> Hi,
>>
>> I guess you could put them into a list and then rbind at the end:
>>
>> indi <- list()
>> k=1
>> indi[[k]] <- list(i=2L,j=6L); k <- k+1
>> indi[[k]] <- list(4L,5L); k <- k+1
>> rbindlist(indi)
>> #    i j
>> # 1: 2 6
>> # 2: 4 5
>>
>> For some reason, I couldn't get rbindlist to work unless the first item
>> in indi had explicit names ("i" and "j"), but names aren't needed for later
>> items.
>>
>> This should be better than dynamically growing with rbind each time, but
>> there may be a faster way. If your criteria for selecting (i,j) can be
>> written down, there's likely a much faster way than looping like this.
>>
>> Best,
>>
>> --Frank
>>
>>
>>
>> On Tue, Sep 17, 2013 at 2:13 PM, Nathaniel Graham <npgraham1 at gmail.com>wrote:
>>
>>> I'm currently using a (moderately) complex function, call
>>> if f(), as a j expression to analyze my data.  The data itself
>>> is about 1.2M rows, which I analyze by group.
>>> A group may have as few as one row or as many as 10K.
>>> The output from the function is a two-column data.table
>>> where the rows are interesting (for my work) pairs of
>>> observations--I have no idea how many pairs will be
>>> interesting until the function runs, but in abstract it could
>>> be every unique combination (so as many as 50M rows
>>> of output for one call to f()).  It is common, and not an
>>> error, for groups to have no meaningful pairs to return.
>>>
>>> I've been using the following line to create the output for
>>> f():
>>>
>>> indices <- data.table(i = integer(), j = integer())
>>>
>>> I then append to 'indices' any useful pairs using:
>>>
>>> indices <- rbind(indices, list(idx[i], idx[j]))
>>>
>>> This works, but is very, very slow, in part because I'm
>>> using rbind().  I want to switch to using the built-in matrix,
>>> because rbind() should be much faster for them.  Using
>>> the following line to create the matrix:
>>>
>>> indices <- matrix(nrow = 0, ncol = 2, dimnames =
>>> list(c(NULL),c("i","j")))
>>>
>>> results in the following error:
>>>
>>> Logical error. Type of column should have been checked by now
>>>
>>> Note that the values returned are always integers.  Results are
>>> coerced via:
>>>
>>> data.table(indices)
>>>
>>> before returning from f().  If I don't explicitly coerce, I get the
>>> following error:
>>>
>>> j doesn't evaluate to the same number of columns for each group
>>>
>>> If someone could tell me what I'm doing wrong, or some other
>>> equivalent way to noticeably speed up the whole process, I'd
>>> be very grateful.
>>>
>>>
>>> -------
>>> Nathaniel Graham
>>> npgraham1 at gmail.com
>>> npgraham1 at uky.edu
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130917/5f0e7469/attachment-0001.html>

From npgraham1 at gmail.com  Tue Sep 17 23:42:31 2013
From: npgraham1 at gmail.com (Nathaniel Graham)
Date: Tue, 17 Sep 2013 17:42:31 -0400
Subject: [datatable-help] Error in coercing matrices within j expressions
In-Reply-To: <CAJd-hdmCkP5aq6cvK=kw1Nz=3Kj_ii-=zCQ0ZNY4_Cf7zS7L2Q@mail.gmail.com>
References: <CALhihUjoLkB0fOZTy66ALR9vJ1sT6jwxdLHnZHAZeTEk-XOiOQ@mail.gmail.com>
 <CAJd-hdmSCZJKqSTqiGqmeW1HLNM8aWN_uGS28HXWaogVm_ULPw@mail.gmail.com>
 <CALhihUi-Awfc-UNOj8k=KyZQz8REihEqendTbfNuJiaQgnQ-OQ@mail.gmail.com>
 <CAJd-hdmCkP5aq6cvK=kw1Nz=3Kj_ii-=zCQ0ZNY4_Cf7zS7L2Q@mail.gmail.com>
Message-ID: <CALhihUg6GNPqBY69=bcpWwWmQfOc277y4Rv4KZD29OFAK-qhiA@mail.gmail.com>

Oops; I meant to reply to all, and then forgot after I discarded and
rewrote my
message a few times.  I suspect (although I'm not absolutely certain) that
if
NULL or similar did the same thing as returning a 0-row data.table with the
appropriate number of columns, some operations could be sped up a bit.
In those cases, the data.table code wouldn't need to check the number and
type of the columns returned.

I suspect that unless someone knows a secret, ultrafast way to iterate
through
a list of all combinations of a set of items and return the subset of those
that
match some criteria, that I'm as close to optimal as I'm likely to get
right now.

-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu


On Tue, Sep 17, 2013 at 5:22 PM, Frank Erickson <FErickson at psu.edu> wrote:

> Well, rbindlist(list()) says "Null data.table" (though it doesn't pass the
> is.null() test). Maybe someone else has an idea how to deal with the
> no-results case. By the way, it's best to use "reply to all" to make sure
> you reply to the mailing list, too; they should be able to see your message
> quoted below, though.
>
> --Frank
>
>
> On Tue, Sep 17, 2013 at 5:03 PM, Nathaniel Graham <npgraham1 at gmail.com>wrote:
>
>> Frank,
>>
>> Thanks.  This seems to have done the trick, so long as I'm careful to
>> check for
>> zero-length lists and return data.table(i = integer(), j = integer()) in
>> those
>> cases.  Essentially, I have to test every combination of i and j to see
>> if it's
>> "interesting" or not, and some groups have a lot of rows.  At the moment
>> I'm
>> attacking some other low hanging fruit, like speeding up the comparisons
>> I have to do.
>>
>> As a side note, it would be kind of nice if there was a simple way to clue
>> data.table to the fact that there are no rows to return, like returning
>> NULL
>> or NA or similar.
>>
>> -------
>> Nathaniel Graham
>> npgraham1 at gmail.com
>> npgraham1 at uky.edu
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130917/1d45c112/attachment.html>

From FErickson at psu.edu  Tue Sep 17 23:52:54 2013
From: FErickson at psu.edu (Frank Erickson)
Date: Tue, 17 Sep 2013 17:52:54 -0400
Subject: [datatable-help] Error in coercing matrices within j expressions
In-Reply-To: <CALhihUg6GNPqBY69=bcpWwWmQfOc277y4Rv4KZD29OFAK-qhiA@mail.gmail.com>
References: <CALhihUjoLkB0fOZTy66ALR9vJ1sT6jwxdLHnZHAZeTEk-XOiOQ@mail.gmail.com>
 <CAJd-hdmSCZJKqSTqiGqmeW1HLNM8aWN_uGS28HXWaogVm_ULPw@mail.gmail.com>
 <CALhihUi-Awfc-UNOj8k=KyZQz8REihEqendTbfNuJiaQgnQ-OQ@mail.gmail.com>
 <CAJd-hdmCkP5aq6cvK=kw1Nz=3Kj_ii-=zCQ0ZNY4_Cf7zS7L2Q@mail.gmail.com>
 <CALhihUg6GNPqBY69=bcpWwWmQfOc277y4Rv4KZD29OFAK-qhiA@mail.gmail.com>
Message-ID: <CAJd-hdkTOCaWTFvnTcQiTr6yYe1wjSqr8OA5ZNyFDMkvAcvKGg@mail.gmail.com>

Maybe not ultrafast, but with nice syntax:

CJ(i=iset,j=jset)[criterion(i,j)]

I guess it should be parallelizable, but that wouldn't be with data.table,
if I understand this correctly:
http://stackoverflow.com/questions/14759905/data-table-and-parallel-computing


On Tue, Sep 17, 2013 at 5:42 PM, Nathaniel Graham <npgraham1 at gmail.com>wrote:

> Oops; I meant to reply to all, and then forgot after I discarded and
> rewrote my
> message a few times.  I suspect (although I'm not absolutely certain) that
> if
> NULL or similar did the same thing as returning a 0-row data.table with the
> appropriate number of columns, some operations could be sped up a bit.
> In those cases, the data.table code wouldn't need to check the number and
> type of the columns returned.
>
> I suspect that unless someone knows a secret, ultrafast way to iterate
> through
> a list of all combinations of a set of items and return the subset of
> those that
> match some criteria, that I'm as close to optimal as I'm likely to get
> right now.
>
>
> -------
> Nathaniel Graham
> npgraham1 at gmail.com
> npgraham1 at uky.edu
>
>
> On Tue, Sep 17, 2013 at 5:22 PM, Frank Erickson <FErickson at psu.edu> wrote:
>
>> Well, rbindlist(list()) says "Null data.table" (though it doesn't pass
>> the is.null() test). Maybe someone else has an idea how to deal with the
>> no-results case. By the way, it's best to use "reply to all" to make sure
>> you reply to the mailing list, too; they should be able to see your message
>> quoted below, though.
>>
>> --Frank
>>
>>
>> On Tue, Sep 17, 2013 at 5:03 PM, Nathaniel Graham <npgraham1 at gmail.com>wrote:
>>
>>> Frank,
>>>
>>> Thanks.  This seems to have done the trick, so long as I'm careful to
>>> check for
>>> zero-length lists and return data.table(i = integer(), j = integer()) in
>>> those
>>> cases.  Essentially, I have to test every combination of i and j to see
>>> if it's
>>> "interesting" or not, and some groups have a lot of rows.  At the moment
>>> I'm
>>> attacking some other low hanging fruit, like speeding up the comparisons
>>> I have to do.
>>>
>>> As a side note, it would be kind of nice if there was a simple way to
>>> clue
>>> data.table to the fact that there are no rows to return, like returning
>>> NULL
>>> or NA or similar.
>>>
>>> -------
>>> Nathaniel Graham
>>> npgraham1 at gmail.com
>>> npgraham1 at uky.edu
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130917/083dfc70/attachment.html>

From npgraham1 at gmail.com  Wed Sep 18 00:14:36 2013
From: npgraham1 at gmail.com (Nathaniel Graham)
Date: Tue, 17 Sep 2013 18:14:36 -0400
Subject: [datatable-help] Error in coercing matrices within j expressions
In-Reply-To: <CAJd-hdkTOCaWTFvnTcQiTr6yYe1wjSqr8OA5ZNyFDMkvAcvKGg@mail.gmail.com>
References: <CALhihUjoLkB0fOZTy66ALR9vJ1sT6jwxdLHnZHAZeTEk-XOiOQ@mail.gmail.com>
 <CAJd-hdmSCZJKqSTqiGqmeW1HLNM8aWN_uGS28HXWaogVm_ULPw@mail.gmail.com>
 <CALhihUi-Awfc-UNOj8k=KyZQz8REihEqendTbfNuJiaQgnQ-OQ@mail.gmail.com>
 <CAJd-hdmCkP5aq6cvK=kw1Nz=3Kj_ii-=zCQ0ZNY4_Cf7zS7L2Q@mail.gmail.com>
 <CALhihUg6GNPqBY69=bcpWwWmQfOc277y4Rv4KZD29OFAK-qhiA@mail.gmail.com>
 <CAJd-hdkTOCaWTFvnTcQiTr6yYe1wjSqr8OA5ZNyFDMkvAcvKGg@mail.gmail.com>
Message-ID: <CALhihUiTsvv5G_fXkG7=JsRVbzt_D+8D-OcHQR5C4YFkVeXgSw@mail.gmail.com>

It hadn't occurred to me to use CJ(), so I'll tinker with that this evening
and
see if there are any gains to be made there.  In theory it's highly
parallelizable,
and one of the posts Matthew points to in his comments (in the post you
reference) shows a way that it can be done (using the old multicore library,
so I'm not exactly sure how it maps to the parallel library).  In my case
though, the whole process appears to be memory bound rather than CPU
bound.  Since my machine is fairly optimal (i7-4770 with 4x8GB DDR3-1600),
I just don't think it's going to get dramatically faster.  That doesn't mean
I won't try...

-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu


On Tue, Sep 17, 2013 at 5:52 PM, Frank Erickson <FErickson at psu.edu> wrote:

> Maybe not ultrafast, but with nice syntax:
>
> CJ(i=iset,j=jset)[criterion(i,j)]
>
> I guess it should be parallelizable, but that wouldn't be with data.table,
> if I understand this correctly:
> http://stackoverflow.com/questions/14759905/data-table-and-parallel-computing
>
>
> On Tue, Sep 17, 2013 at 5:42 PM, Nathaniel Graham <npgraham1 at gmail.com>wrote:
>
>> Oops; I meant to reply to all, and then forgot after I discarded and
>> rewrote my
>> message a few times.  I suspect (although I'm not absolutely certain)
>> that if
>> NULL or similar did the same thing as returning a 0-row data.table with
>> the
>> appropriate number of columns, some operations could be sped up a bit.
>> In those cases, the data.table code wouldn't need to check the number and
>> type of the columns returned.
>>
>> I suspect that unless someone knows a secret, ultrafast way to iterate
>> through
>> a list of all combinations of a set of items and return the subset of
>> those that
>> match some criteria, that I'm as close to optimal as I'm likely to get
>> right now.
>>
>>
>> -------
>> Nathaniel Graham
>> npgraham1 at gmail.com
>> npgraham1 at uky.edu
>>
>>
>> On Tue, Sep 17, 2013 at 5:22 PM, Frank Erickson <FErickson at psu.edu>wrote:
>>
>>> Well, rbindlist(list()) says "Null data.table" (though it doesn't pass
>>> the is.null() test). Maybe someone else has an idea how to deal with the
>>> no-results case. By the way, it's best to use "reply to all" to make sure
>>> you reply to the mailing list, too; they should be able to see your message
>>> quoted below, though.
>>>
>>> --Frank
>>>
>>>
>>> On Tue, Sep 17, 2013 at 5:03 PM, Nathaniel Graham <npgraham1 at gmail.com>wrote:
>>>
>>>> Frank,
>>>>
>>>> Thanks.  This seems to have done the trick, so long as I'm careful to
>>>> check for
>>>> zero-length lists and return data.table(i = integer(), j = integer())
>>>> in those
>>>> cases.  Essentially, I have to test every combination of i and j to see
>>>> if it's
>>>> "interesting" or not, and some groups have a lot of rows.  At the
>>>> moment I'm
>>>> attacking some other low hanging fruit, like speeding up the comparisons
>>>> I have to do.
>>>>
>>>> As a side note, it would be kind of nice if there was a simple way to
>>>> clue
>>>> data.table to the fact that there are no rows to return, like returning
>>>> NULL
>>>> or NA or similar.
>>>>
>>>> -------
>>>> Nathaniel Graham
>>>> npgraham1 at gmail.com
>>>> npgraham1 at uky.edu
>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130917/ea70383a/attachment-0001.html>

From saporta at scarletmail.rutgers.edu  Fri Sep 20 15:48:11 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Fri, 20 Sep 2013 09:48:11 -0400
Subject: [datatable-help] mapply cannot modify in place when iterating over
	list of DTs
Message-ID: <CAE7Aa4TyhNFbpZ7_J9nseNgwW-N0k9QzgyOBbrprnRxi4umcDw@mail.gmail.com>

I've encountered the following issue iterating over a list of data.tables.
The issue is only with mapply, not with lapply .


Given a list of data.table's, mapply'ing over the list directly
cannot modify in place.

Also if attempting to add a new column, we get an "Invalid
.internal.selfref" warning.
Modifying an existing column does not issue a warning, but still fails to
modify-in-place

WORKAROUND:
----------
The workaround is to iterate over an index to the list, then to
  modify each data.table via list.of.DTs[[i]][ .. ]

**Interestingly, this issue occurs with `mapply`, but not `lapply`.**


EXAMPLE:
--------
  # Given a list of DT's and two lists of vectors,
  #   we want to add the corresponding vectors as columns to the DT.

## ---------------- ##
##   SAMPLE DATA:   ##
## ---------------- ##
  # list of data.tables
  list.DT <- list(
    DT1=data.table(Col1=111:115, Col2=121:125),
    DT2=data.table(Col1=211:215, Col2=221:225)
    )

  # lists of columns to add
  list.Col3 <- list(131:135, 231:235)
  list.Col4 <- list(141:145, 241:245)


## ------------------------------------ ##
##   Iterating over the list elements   ##
##     adding a new column              ##
## ------------------------------------ ##
##   Will issue warning and             ##
##     will fail to modify in place     ##
## ------------------------------------ ##
  mapply (
      function(DT, C3, C4)
         DT[, c("Col3", "Col4") := list(C3, C4)],

      list.DT,  # iterating over the list
      list.Col3, list.Col4,
      SIMPLIFY=FALSE
    )

  ## Note the lack of change
  list.DT


## ------------------------------------ ##
##   Iterating over an index            ##
## ------------------------------------ ##
  mapply (
      function(i, C3, C4)
         list.DT[[i]] [, c("Col3", "Col4") := list(C3, C4)],

      seq(list.DT),   # iterating over an index to the list
      list.Col3, list.Col4,
      SIMPLIFY=FALSE
    )

  ## Note each DT _has_ been modified
  list.DT

## ------------------------------------ ##
##   Iterating over the list elements   ##
##     modifying existing column        ##
## ------------------------------------ ##
##   No warning issued, but             ##
##     Will fail to modify in place     ##
## ------------------------------------ ##
  mapply (
      function(DT, C3, C4)
         DT[, c("Col3", "Col4") := list(Col3*1e3, Col4*1e4)],

      list.DT,  # iterating over the list
      list.Col3, list.Col4,
      SIMPLIFY=FALSE
    )

  ## Note the lack of change (compare with output from `mapply`)
  list.DT

## ------------------------------------ ##
##                                      ##
##   `lapply` works as expected.        ##
##                                      ##
## ------------------------------------ ##

  ## NOW WITH lapply
  lapply(list.DT,
    function(DT)
      DT[, newCol := LETTERS[1:5]]
  )

  ## Note the new column:
  list.DT


# ========================== #

##   NON-WORKAROUNDS   ##
##
## I also tried all of the following alternatives
##   in hopes of being able to iterate over the list
##   directly, using `mapply`.
## None of these worked.

# (1) Creating the DTs First, then creating the list from them
    DT1 <- data.table(Col1=111:115, Col2=121:125)
    DT2 <- data.table(Col1=211:215, Col2=221:225)

    list.DT <- list(DT1=DT1,DT2=DT2 )


# (2) Same as 1, and using `copy()` in the call to `list()`
    list.DT <- list(DT1=copy(DT1),
                    DT2=copy(DT2) )

# (3) lapply'ing `copy` and then iterating over that list
    list.DT <- lapply(list.DT, copy)

# (4) Not naming the list elements
    list.DT <- list(DT1, DT2)
    # and tried
    list.DT <- list(copy(DT1), copy(DT2))

## All of the above still failed to modify in place
##   (and also issued the same warning if trying to add a column)
##    when iterating using mapply

  mapply(function(DT, C3, C4)
    DT[, c("Col3", "Col4") := list(C3, C4)],
    list.DT, list.Col3, list.Col4,
    SIMPLIFY=FALSE)


# ========================== #


Ricardo Saporta
Rutgers University, New Jersey
e: saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130920/755cad81/attachment.html>

From mdowle at mdowle.plus.com  Fri Sep 20 18:49:29 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Fri, 20 Sep 2013 17:49:29 +0100
Subject: [datatable-help] mapply cannot modify in place when iterating
 over list of DTs
In-Reply-To: <CAE7Aa4TyhNFbpZ7_J9nseNgwW-N0k9QzgyOBbrprnRxi4umcDw@mail.gmail.com>
References: <CAE7Aa4TyhNFbpZ7_J9nseNgwW-N0k9QzgyOBbrprnRxi4umcDw@mail.gmail.com>
Message-ID: <523C7C99.40308@mdowle.plus.com>


Hi,

What's the warning?

Matthew


On 20/09/13 14:48, Ricardo Saporta wrote:
> I've encountered the following issue iterating over a list of 
> data.tables.
> The issue is only with mapply, not with lapply .
>
> Given a list of data.table's, mapply'ing over the list directly
> cannot modify in place.
>
> Also if attempting to add a new column, we get an "Invalid 
> .internal.selfref" warning.
> Modifying an existing column does not issue a warning, but still fails 
> to modify-in-place
>
> WORKAROUND:
> ----------
> The workaround is to iterate over an index to the list, then to
>   modify each data.table via list.of.DTs[[i]][ .. ]
>
> **Interestingly, this issue occurs with `mapply`, but not `lapply`.**
>
> EXAMPLE:
> --------
>   # Given a list of DT's and two lists of vectors,
>   #   we want to add the corresponding vectors as columns to the DT.
>
> ## ---------------- ##
> ##   SAMPLE DATA:   ##
> ## ---------------- ##
>   # list of data.tables
>   list.DT <- list(
>     DT1=data.table(Col1=111:115, Col2=121:125),
>     DT2=data.table(Col1=211:215, Col2=221:225)
>     )
>
>   # lists of columns to add
>   list.Col3 <- list(131:135, 231:235)
>   list.Col4 <- list(141:145, 241:245)
>
>
> ## ------------------------------------ ##
> ##   Iterating over the list elements   ##
> ##     adding a new column              ##
> ## ------------------------------------ ##
> ##   Will issue warning and             ##
> ##     will fail to modify in place     ##
> ## ------------------------------------ ##
>   mapply (
>       function(DT, C3, C4)
>          DT[, c("Col3", "Col4") := list(C3, C4)],
>       list.DT,  # iterating over the list
>       list.Col3, list.Col4,
>       SIMPLIFY=FALSE
>     )
>
>   ## Note the lack of change
>   list.DT
>
>
> ## ------------------------------------ ##
> ##   Iterating over an index            ##
> ## ------------------------------------ ##
>   mapply (
>       function(i, C3, C4)
>          list.DT[[i]] [, c("Col3", "Col4") := list(C3, C4)],
>       seq(list.DT),   # iterating over an index to the list
>       list.Col3, list.Col4,
>       SIMPLIFY=FALSE
>     )
>
>   ## Note each DT _has_ been modified
>   list.DT
>
> ## ------------------------------------ ##
> ##   Iterating over the list elements   ##
> ##     modifying existing column        ##
> ## ------------------------------------ ##
> ##   No warning issued, but             ##
> ##     Will fail to modify in place     ##
> ## ------------------------------------ ##
>   mapply (
>       function(DT, C3, C4)
>          DT[, c("Col3", "Col4") := list(Col3*1e3, Col4*1e4)],
>
>       list.DT,  # iterating over the list
>       list.Col3, list.Col4,
>       SIMPLIFY=FALSE
>     )
>
>   ## Note the lack of change (compare with output from `mapply`)
>   list.DT
>
> ## ------------------------------------ ##
> ##                                      ##
> ##   `lapply` works as expected.        ##
> ##                                      ##
> ## ------------------------------------ ##
>   ## NOW WITH lapply
>   lapply(list.DT,
>     function(DT)
>       DT[, newCol := LETTERS[1:5]]
>   )
>
>   ## Note the new column:
>   list.DT
>
>
>
> # ========================== #
>
> ##   NON-WORKAROUNDS   ##
> ##
> ## I also tried all of the following alternatives
> ##   in hopes of being able to iterate over the list
> ##   directly, using `mapply`.
> ## None of these worked.
>
> # (1) Creating the DTs First, then creating the list from them
>     DT1 <- data.table(Col1=111:115, Col2=121:125)
>     DT2 <- data.table(Col1=211:215, Col2=221:225)
>
>     list.DT <- list(DT1=DT1,DT2=DT2 )
>
>
> # (2) Same as 1, and using `copy()` in the call to `list()`
>     list.DT <- list(DT1=copy(DT1),
>                     DT2=copy(DT2) )
>
> # (3) lapply'ing `copy` and then iterating over that list
>     list.DT <- lapply(list.DT, copy)
>
> # (4) Not naming the list elements
>     list.DT <- list(DT1, DT2)
>     # and tried
>     list.DT <- list(copy(DT1), copy(DT2))
>
> ## All of the above still failed to modify in place
> ##   (and also issued the same warning if trying to add a column)
> ##    when iterating using mapply
>
>   mapply(function(DT, C3, C4)
>     DT[, c("Col3", "Col4") := list(C3, C4)],
>     list.DT, list.Col3, list.Col4,
>     SIMPLIFY=FALSE)
>
>
> # ========================== #
>
>
> Ricardo Saporta
> Rutgers University, New Jersey
> e: saporta at rutgers.edu <mailto:saporta at rutgers.edu>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130920/989dc270/attachment-0001.html>

From spbiggs at hotmail.com  Fri Sep 20 19:16:50 2013
From: spbiggs at hotmail.com (Simon Biggs)
Date: Fri, 20 Sep 2013 14:16:50 -0300
Subject: [datatable-help] fread (boolean?) problem in 1.8.11 rev 971
Message-ID: <BAY167-W8440CA7BCBC3FF2B87E143B4220@phx.gbl>

Hi

I'm finding data.table excellent for processing of significant numbers of large files.  Trying out the latest build I'm seeing some problems (perhaps around the new boolean support?)

Loading my file with a basic call to fread() results in the error:

Error in fread("file.txt") : 
  Expected sep ('	') but 'T' ends field 25 on line 2 when detecting types: 8878	1	1	24	4	AFY057250G12	NA	2012-07-01	2013-06-30	1e+05	49999100.7232666	1.58e+08	0	33176.21	978	NA	1	EUR	0	EUR	0	EUR	TERRINC	HXG	T 
Happy to provide the full file if you let me know how.  First couple of rows reproduced below (file is tab delimited):
 
Key Id AKey PKey Peril PNum LName EDt ExDt PAPt PL PPof PD Prem CurrencyKey RK IsV CurrencyCd minDedAmt minDedCur maxDedAmt maxDedCur userIdTxt1 userIdTxt2 userIdTxt3 userIdTxt4 PStat
8878 1 1 24 4 AFY05 NA 2012-07-01 2013-06-30 1e+05 49999100.7232666 1.58e+08 0 33176.21 978 NA 1 EUR 0 EUR 0 EUR TERRINC HXG T TO NA
8878 1 3 93 4 AFJ02 NA 2012-03-31 2014-03-30 150000 3.5e+07 1.4e+08 0 3688.44 840 NA 1 USD 0 USD 0 USD TERRINC JGP T TO NA
8878 1 6 95 4 AFY08 NA 2012-04-08 2013-04-07 1e+05 29999983.4435654 336907000 0 0 826 NA 1 GBP 0 GBP 0 GBP TERRINC SPT T TU NA
8878 1 7 17 4 AFR1 NA 2012-07-12 2013-06-30 7500000 5e+07 5e+08 0 10319.34 840 NA 1 USD 0 USD 0 USD TERREXC JGP T TO NA

Many thanks for your efforts

Simon
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130920/c8970159/attachment.html>

From saporta at scarletmail.rutgers.edu  Fri Sep 20 20:01:16 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Fri, 20 Sep 2013 14:01:16 -0400
Subject: [datatable-help] mapply cannot modify in place when iterating
 over list of DTs
In-Reply-To: <523C7C99.40308@mdowle.plus.com>
References: <CAE7Aa4TyhNFbpZ7_J9nseNgwW-N0k9QzgyOBbrprnRxi4umcDw@mail.gmail.com>
 <523C7C99.40308@mdowle.plus.com>
Message-ID: <CAE7Aa4Q4jvZVmhAUJOLWsdTJ656EQSPM4rUW9QE-brsB46KBSQ@mail.gmail.com>

One warning per DT in the list
  (I added the line breaks)
-Rick
=============================================
Warning messages:

1: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) :

  Invalid .internal.selfref detected and fixed by taking a copy of the
whole table so that := can add this new column by reference. At an earlier
point, this data.table has been copied by R (or been created manually using
structure() or similar). Avoid key<-, names<- and attr<- which in R
currently (and oddly) may copy the whole data.table. Use set* syntax
instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<v3.1.0,
list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named
objects); please upgrade to R>=v3.1.0 if that is biting. If this message
doesn't help, please report to datatable-help so the root cause can be
fixed.

2: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) :

  Invalid .internal.selfref detected and fixed by taking a copy of the
whole table so that := can add this new column by reference. At an earlier
point, this data.table has been copied by R (or been created manually using
structure() or similar). Avoid key<-, names<- and attr<- which in R
currently (and oddly) may copy the whole data.table. Use set* syntax
instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<v3.1.0,
list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named
objects); please upgrade to R>=v3.1.0 if that is biting. If this message
doesn't help, please report to datatable-help so the root cause can be
fixed.
=============================================


On Fri, Sep 20, 2013 at 12:49 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

>
> Hi,
>
> What's the warning?
>
> Matthew
>
>
>
> On 20/09/13 14:48, Ricardo Saporta wrote:
>
>  I've encountered the following issue iterating over a list of
> data.tables.
> The issue is only with mapply, not with lapply .
>
>
> Given a list of data.table's, mapply'ing over the list directly
> cannot modify in place.
>
>  Also if attempting to add a new column, we get an "Invalid
> .internal.selfref" warning.
> Modifying an existing column does not issue a warning, but still fails to
> modify-in-place
>
>  WORKAROUND:
> ----------
> The workaround is to iterate over an index to the list, then to
>   modify each data.table via list.of.DTs[[i]][ .. ]
>
>  **Interestingly, this issue occurs with `mapply`, but not `lapply`.**
>
>
> EXAMPLE:
> --------
>   # Given a list of DT's and two lists of vectors,
>   #   we want to add the corresponding vectors as columns to the DT.
>
>  ## ---------------- ##
> ##   SAMPLE DATA:   ##
> ## ---------------- ##
>   # list of data.tables
>   list.DT <- list(
>     DT1=data.table(Col1=111:115, Col2=121:125),
>     DT2=data.table(Col1=211:215, Col2=221:225)
>     )
>
>    # lists of columns to add
>   list.Col3 <- list(131:135, 231:235)
>   list.Col4 <- list(141:145, 241:245)
>
>
>  ## ------------------------------------ ##
> ##   Iterating over the list elements   ##
> ##     adding a new column              ##
> ## ------------------------------------ ##
> ##   Will issue warning and             ##
> ##     will fail to modify in place     ##
> ## ------------------------------------ ##
>   mapply (
>       function(DT, C3, C4)
>           DT[, c("Col3", "Col4") := list(C3, C4)],
>
>       list.DT,  # iterating over the list
>       list.Col3, list.Col4,
>       SIMPLIFY=FALSE
>     )
>
>    ## Note the lack of change
>   list.DT
>
>
>  ## ------------------------------------ ##
> ##   Iterating over an index            ##
> ## ------------------------------------ ##
>   mapply (
>       function(i, C3, C4)
>          list.DT[[i]] [, c("Col3", "Col4") := list(C3, C4)],
>
>       seq(list.DT),   # iterating over an index to the list
>       list.Col3, list.Col4,
>       SIMPLIFY=FALSE
>     )
>
>    ## Note each DT _has_ been modified
>   list.DT
>
>  ## ------------------------------------ ##
> ##   Iterating over the list elements   ##
> ##     modifying existing column        ##
> ## ------------------------------------ ##
> ##   No warning issued, but             ##
> ##     Will fail to modify in place     ##
> ## ------------------------------------ ##
>   mapply (
>       function(DT, C3, C4)
>          DT[, c("Col3", "Col4") := list(Col3*1e3, Col4*1e4)],
>
>        list.DT,  # iterating over the list
>       list.Col3, list.Col4,
>       SIMPLIFY=FALSE
>     )
>
>    ## Note the lack of change (compare with output from `mapply`)
>   list.DT
>
>  ## ------------------------------------ ##
> ##                                      ##
> ##   `lapply` works as expected.        ##
> ##                                      ##
> ## ------------------------------------ ##
>
>   ## NOW WITH lapply
>   lapply(list.DT,
>     function(DT)
>       DT[, newCol := LETTERS[1:5]]
>   )
>
>    ## Note the new column:
>   list.DT
>
>
>
>  # ========================== #
>
>  ##   NON-WORKAROUNDS   ##
> ##
> ## I also tried all of the following alternatives
> ##   in hopes of being able to iterate over the list
> ##   directly, using `mapply`.
> ## None of these worked.
>
>  # (1) Creating the DTs First, then creating the list from them
>     DT1 <- data.table(Col1=111:115, Col2=121:125)
>     DT2 <- data.table(Col1=211:215, Col2=221:225)
>
>      list.DT <- list(DT1=DT1,DT2=DT2 )
>
>
>  # (2) Same as 1, and using `copy()` in the call to `list()`
>     list.DT <- list(DT1=copy(DT1),
>                     DT2=copy(DT2) )
>
>  # (3) lapply'ing `copy` and then iterating over that list
>     list.DT <- lapply(list.DT, copy)
>
>  # (4) Not naming the list elements
>     list.DT <- list(DT1, DT2)
>     # and tried
>     list.DT <- list(copy(DT1), copy(DT2))
>
>  ## All of the above still failed to modify in place
> ##   (and also issued the same warning if trying to add a column)
> ##    when iterating using mapply
>
>    mapply(function(DT, C3, C4)
>     DT[, c("Col3", "Col4") := list(C3, C4)],
>     list.DT, list.Col3, list.Col4,
>     SIMPLIFY=FALSE)
>
>
>  # ========================== #
>
>
>  Ricardo Saporta
>  Rutgers University, New Jersey
>  e: saporta at rutgers.edu
>
>
>
> _______________________________________________
> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130920/76f92312/attachment.html>

From mdowle at mdowle.plus.com  Fri Sep 20 20:18:44 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Fri, 20 Sep 2013 19:18:44 +0100
Subject: [datatable-help] mapply cannot modify in place when iterating
 over list of DTs
In-Reply-To: <CAE7Aa4Q4jvZVmhAUJOLWsdTJ656EQSPM4rUW9QE-brsB46KBSQ@mail.gmail.com>
References: <CAE7Aa4TyhNFbpZ7_J9nseNgwW-N0k9QzgyOBbrprnRxi4umcDw@mail.gmail.com>	<523C7C99.40308@mdowle.plus.com>
 <CAE7Aa4Q4jvZVmhAUJOLWsdTJ656EQSPM4rUW9QE-brsB46KBSQ@mail.gmail.com>
Message-ID: <523C9184.2010902@mdowle.plus.com>

Does this sentence from the warning help?

" Also, in R<v3.1.0, list(DT1,DT2) copied the entire DT1 and DT2 (R's 
list() used to copy named objects); please upgrade to R>=v3.1.0 if that 
is biting. "

Matthew

On 20/09/13 19:01, Ricardo Saporta wrote:
> One warning per DT in the list
>   (I added the line breaks)
> -Rick
> =============================================
> Warning messages:
>
> 1: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) :
>
>   Invalid .internal.selfref detected and fixed by taking a copy of the 
> whole table so that := can add this new column by reference. At an 
> earlier point, this data.table has been copied by R (or been created 
> manually using structure() or similar). Avoid key<-, names<- and 
> attr<- which in R currently (and oddly) may copy the whole data.table. 
> Use set* syntax instead to avoid copying: ?set, ?setnames and 
> ?setattr. Also, in R<v3.1.0, list(DT1,DT2) copied the entire DT1 and 
> DT2 (R's list() used to copy named objects); please upgrade to 
> R>=v3.1.0 if that is biting. If this message doesn't help, please 
> report to datatable-help so the root cause can be fixed.
>
> 2: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) :
>
>   Invalid .internal.selfref detected and fixed by taking a copy of the 
> whole table so that := can add this new column by reference. At an 
> earlier point, this data.table has been copied by R (or been created 
> manually using structure() or similar). Avoid key<-, names<- and 
> attr<- which in R currently (and oddly) may copy the whole data.table. 
> Use set* syntax instead to avoid copying: ?set, ?setnames and 
> ?setattr. Also, in R<v3.1.0, list(DT1,DT2) copied the entire DT1 and 
> DT2 (R's list() used to copy named objects); please upgrade to 
> R>=v3.1.0 if that is biting. If this message doesn't help, please 
> report to datatable-help so the root cause can be fixed.
> =============================================
>
>
>
>
> On Fri, Sep 20, 2013 at 12:49 PM, Matthew Dowle 
> <mdowle at mdowle.plus.com <mailto:mdowle at mdowle.plus.com>> wrote:
>
>
>     Hi,
>
>     What's the warning?
>
>     Matthew
>
>
>
>     On 20/09/13 14:48, Ricardo Saporta wrote:
>>     I've encountered the following issue iterating over a list of
>>     data.tables.
>>     The issue is only with mapply, not with lapply .
>>
>>     Given a list of data.table's, mapply'ing over the list directly
>>     cannot modify in place.
>>
>>     Also if attempting to add a new column, we get an "Invalid
>>     .internal.selfref" warning.
>>     Modifying an existing column does not issue a warning, but still
>>     fails to modify-in-place
>>
>>     WORKAROUND:
>>     ----------
>>     The workaround is to iterate over an index to the list, then to
>>       modify each data.table via list.of.DTs[[i]][ .. ]
>>
>>     **Interestingly, this issue occurs with `mapply`, but not `lapply`.**
>>
>>     EXAMPLE:
>>     --------
>>       # Given a list of DT's and two lists of vectors,
>>       #   we want to add the corresponding vectors as columns to the DT.
>>
>>     ## ---------------- ##
>>     ##   SAMPLE DATA:   ##
>>     ## ---------------- ##
>>       # list of data.tables
>>       list.DT <- list(
>>         DT1=data.table(Col1=111:115, Col2=121:125),
>>         DT2=data.table(Col1=211:215, Col2=221:225)
>>         )
>>
>>       # lists of columns to add
>>       list.Col3 <- list(131:135, 231:235)
>>       list.Col4 <- list(141:145, 241:245)
>>
>>
>>     ## ------------------------------------ ##
>>     ##   Iterating over the list elements ##
>>     ##     adding a new column  ##
>>     ## ------------------------------------ ##
>>     ##   Will issue warning and ##
>>     ##     will fail to modify in place ##
>>     ## ------------------------------------ ##
>>       mapply (
>>           function(DT, C3, C4)
>>              DT[, c("Col3", "Col4") := list(C3, C4)],
>>           list.DT,  # iterating over the list
>>           list.Col3, list.Col4,
>>           SIMPLIFY=FALSE
>>         )
>>
>>       ## Note the lack of change
>>       list.DT
>>
>>
>>     ## ------------------------------------ ##
>>     ##   Iterating over an index  ##
>>     ## ------------------------------------ ##
>>       mapply (
>>           function(i, C3, C4)
>>              list.DT[[i]] [, c("Col3", "Col4") := list(C3, C4)],
>>           seq(list.DT),   # iterating over an index to the list
>>           list.Col3, list.Col4,
>>           SIMPLIFY=FALSE
>>         )
>>
>>       ## Note each DT _has_ been modified
>>       list.DT
>>
>>     ## ------------------------------------ ##
>>     ##   Iterating over the list elements ##
>>     ##     modifying existing column  ##
>>     ## ------------------------------------ ##
>>     ##   No warning issued, but ##
>>     ##     Will fail to modify in place ##
>>     ## ------------------------------------ ##
>>       mapply (
>>           function(DT, C3, C4)
>>              DT[, c("Col3", "Col4") := list(Col3*1e3, Col4*1e4)],
>>
>>           list.DT,  # iterating over the list
>>           list.Col3, list.Col4,
>>           SIMPLIFY=FALSE
>>         )
>>
>>       ## Note the lack of change (compare with output from `mapply`)
>>       list.DT
>>
>>     ## ------------------------------------ ##
>>     ##  ##
>>     ##   `lapply` works as expected.  ##
>>     ##  ##
>>     ## ------------------------------------ ##
>>       ## NOW WITH lapply
>>       lapply(list.DT,
>>         function(DT)
>>           DT[, newCol := LETTERS[1:5]]
>>       )
>>
>>       ## Note the new column:
>>       list.DT
>>
>>
>>
>>     # ========================== #
>>
>>     ##   NON-WORKAROUNDS   ##
>>     ##
>>     ## I also tried all of the following alternatives
>>     ##   in hopes of being able to iterate over the list
>>     ##   directly, using `mapply`.
>>     ## None of these worked.
>>
>>     # (1) Creating the DTs First, then creating the list from them
>>         DT1 <- data.table(Col1=111:115, Col2=121:125)
>>         DT2 <- data.table(Col1=211:215, Col2=221:225)
>>
>>         list.DT <- list(DT1=DT1,DT2=DT2 )
>>
>>
>>     # (2) Same as 1, and using `copy()` in the call to `list()`
>>         list.DT <- list(DT1=copy(DT1),
>>                         DT2=copy(DT2) )
>>
>>     # (3) lapply'ing `copy` and then iterating over that list
>>         list.DT <- lapply(list.DT, copy)
>>
>>     # (4) Not naming the list elements
>>         list.DT <- list(DT1, DT2)
>>         # and tried
>>         list.DT <- list(copy(DT1), copy(DT2))
>>
>>     ## All of the above still failed to modify in place
>>     ##   (and also issued the same warning if trying to add a column)
>>     ##    when iterating using mapply
>>
>>       mapply(function(DT, C3, C4)
>>         DT[, c("Col3", "Col4") := list(C3, C4)],
>>         list.DT, list.Col3, list.Col4,
>>         SIMPLIFY=FALSE)
>>
>>
>>     # ========================== #
>>
>>
>>     Ricardo Saporta
>>     Rutgers University, New Jersey
>>     e: saporta at rutgers.edu <mailto:saporta at rutgers.edu>
>>
>>
>>
>>     _______________________________________________
>>     datatable-help mailing list
>>     datatable-help at lists.r-forge.r-project.org  <mailto:datatable-help at lists.r-forge.r-project.org>
>>     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130920/7c560937/attachment-0001.html>

From mdowle at mdowle.plus.com  Fri Sep 20 20:40:47 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Fri, 20 Sep 2013 19:40:47 +0100
Subject: [datatable-help] fread (boolean?) problem in 1.8.11 rev 971
In-Reply-To: <BAY167-W8440CA7BCBC3FF2B87E143B4220@phx.gbl>
References: <BAY167-W8440CA7BCBC3FF2B87E143B4220@phx.gbl>
Message-ID: <523C96AF.3080201@mdowle.plus.com>

Hi,
Many thanks. Now fixed - commit 973.
Matthew

On 20/09/13 18:16, Simon Biggs wrote:
> Hi
>
> I'm finding data.table excellent for processing of significant numbers 
> of large files. Trying out the latest build I'm seeing some problems 
> (perhaps around the new boolean support?)
>
> Loading my file with a basic call to fread() results in the error:
>
> Error in fread("file.txt") :
>    Expected sep ('	') but 'T' ends field 25 on line 2 when detecting types: 8878	1	1	24	4	AFY057250G12	NA	2012-07-01	2013-06-30	1e+05	49999100.7232666	1.58e+08	0	33176.21	978	NA	1	EUR	0	EUR	0	EUR	TERRINC	HXG	T
>
> Happy to provide the full file if you let me know how.  First couple 
> of rows reproduced below (file is tab delimited):
>
> Key Id AKey PKey Peril PNum LName EDt ExDt PAPt PL PPof PD Prem CurrencyKey RK IsV CurrencyCd minDedAmt minDedCur maxDedAmt maxDedCur userIdTxt1 userIdTxt2 userIdTxt3 userIdTxt4 PStat
> 8878 1 1 24 4 AFY05 NA 2012-07-01 2013-06-30 1e+05 49999100.7232666 1.58e+08 0 33176.21 978 NA 1 EUR 0 EUR 0 EUR TERRINC HXG T TO NA
> 8878 1 3 93 4 AFJ02 NA 2012-03-31 2014-03-30 150000 3.5e+07 1.4e+08 0 3688.44 840 NA 1 USD 0 USD 0 USD TERRINC JGP T TO NA
> 8878 1 6 95 4 AFY08 NA 2012-04-08 2013-04-07 1e+05 29999983.4435654 336907000 0 0 826 NA 1 GBP 0 GBP 0 GBP TERRINC SPT T TU NA
> 8878 1 7 17 4 AFR1 NA 2012-07-12 2013-06-30 7500000 5e+07 5e+08 0 10319.34 840 NA 1 USD 0 USD 0 USD TERREXC JGP T TO NA
>
> Many thanks for your efforts
>
> Simon
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130920/56589e39/attachment.html>

From saporta at scarletmail.rutgers.edu  Sun Sep 22 03:44:29 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Sat, 21 Sep 2013 21:44:29 -0400
Subject: [datatable-help] by=".Col" produces NA column names
Message-ID: <CAE7Aa4Qc7FCyHqiWB=iowYvTCGc87m6NN7NkUZfgKLeRWz7ZTQ@mail.gmail.com>

 I submitted the below as bug 4927

I believe the fix is a simple regex modification, but I dont want to mess
with the regex too hastilly and possibly break something.  Would someone
care to double check this?

---------------

Issue:
----
Given a data.table with a dot in the column name, using that column name as
an argument to `by=` produces different results when the column name is
quoted than when it is not.

eg:

  DT
     .Col val
  1:    A   1
  2:    B   2

  identical(DT[, sum(val), by=.Col],
            DT[, sum(val), by=".Col"] )
  # [1] FALSE


Specifically, if quotes are used NAs are produced in place of the column
name.

Examples follow at the bottom of this email.  I believe the issue is in the
regex pattern in a call to `grep` in "[.data.table"

The line is copied and pasted here.
(currently line 743 in "data.table.r", which is inside "if
(any(bynames=="")){..}")

## ORIGINAL
tt = grep("^eval|^[^[:alpha:]
]",all.vars(bysubl[[jj+1L]],functions=TRUE),invert=TRUE,value=TRUE)[1L]

## SHOULD (I believe) BE CHANGED TO
tt = grep("^eval|^[^(\\.|[:alpha:])
]",all.vars(bysubl[[jj+1L]],functions=TRUE),invert=TRUE,value=TRUE)[1L]
## ... to allow for the name to start with a period.


## CONTEXT:
            if (any(bynames=="")) {
                if (length(bysubl)<2) stop("When 'by' or 'keyby' is list()
we expect something inside the brackets")
                for (jj in seq_along(bynames)) {
                    if (bynames[jj]=="") {
                        # Best guess. Use "month" in the case of
by=month(date), use "a" in the case of by=a%%2
~~~~ THIS LINE ~~~>     tt = grep("^eval|^[^[:alpha:]
]",all.vars(bysubl[[jj+1L]],functions=TRUE),invert=TRUE,value=TRUE)[1L]
                        if (!length(tt)) tt = all.vars(bysubl[[jj+1L]])[1L]
                        bynames[jj] = tt
                        # if user doesn't like this inferred name, user has
to use by=list() to name the column
                    }
                }
            }

---------------------------------------------------


EXAMPLE:

DT <- data.table(.Col = LETTERS[c(1:3, 1:3)], val=1:6)

identical(DT[, sum(val), by=.Col],
          DT[, sum(val), by=".Col"] )
# [1] FALSE

## This works as expected
DT[, sum(val), by=.Col]

   .Col V1
1:    A  5
2:    B  7
3:    C  9

## Putting the column name within quotes
##   produces NA in the column names
DT[, sum(val), by=c(".Col")]
DT[, sum(val), by=".Col"]  # both lines, same output

   NA V1  <~~~  NOTICE
1:  A  5
2:  B  7
3:  C  9

# notice if we try to use `keyby` we get the following error
DT[, sum(val), keyby=".Col"]
# Error in setkeyv(ans, names(ans)[seq_along(byval)]) :
#   Column 'NA' is type 'NULL' which is not (currently) allowed as a key
column type.

## and this works correctly too
DT[, sum(val), by=list(.Col=.Col)]
   .Col V1
1:    A  5
2:    B  7
3:    C  9

---------------------------------------------------

Only happen with a dot at the start of the name

## Appears to be only an issue when there is a
DT2 <- data.table(Col. = LETTERS[c(1:3, 1:3)], val=1:6)

DT2[, sum(val), by=Col.]
DT2[, sum(val), by=c("Col.")]

   Col. V1    <~~~ As expected
1:    A  5
2:    B  7
3:    C  9


--
Ricardo Saporta
Rutgers University, New Jersey
e: saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130921/ec10abc7/attachment.html>

From saporta at scarletmail.rutgers.edu  Sun Sep 22 04:02:40 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Sat, 21 Sep 2013 22:02:40 -0400
Subject: [datatable-help] mapply cannot modify in place when iterating
 over list of DTs
In-Reply-To: <523C9184.2010902@mdowle.plus.com>
References: <CAE7Aa4TyhNFbpZ7_J9nseNgwW-N0k9QzgyOBbrprnRxi4umcDw@mail.gmail.com>
 <523C7C99.40308@mdowle.plus.com>
 <CAE7Aa4Q4jvZVmhAUJOLWsdTJ656EQSPM4rUW9QE-brsB46KBSQ@mail.gmail.com>
 <523C9184.2010902@mdowle.plus.com>
Message-ID: <CAE7Aa4TStQqej+j=UMQo6NQJs5Z8214bLoCa0oGJnw+RBSOk+Q@mail.gmail.com>

Matthew,

I did notice the warning, but something doesnt add up:

If the issue is simply that it is being copied when created, then wouldnt
we expect the same warning to arise when we try to modify the table in
using `mapply` or `lapply`? (the latter does not produce a warning.

If on the otherhand, the issue pertains specifically to mapply (which I
assume it does), then why is it only a problem when we iterate over the
list directly, whereas iterating indirectly by using an index does not
produce any warnings.

While overall, this is minor if one is aware of the issue, I think it might
allow for unnoticed bugs to creep into someones code.   Specifically if
using mapply to modify a list of DTs and the user not realizing that the
modifications are not being held.

That being said, I'm not sure how this could even be addressed if the root
is in mapply, but is it worth trying to address?

Rick


On Fri, Sep 20, 2013 at 2:18 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

>  Does this sentence from the warning help?
>
>
> " Also, in R<v3.1.0, list(DT1,DT2) copied the entire DT1 and DT2 (R's
> list() used to copy named objects); please upgrade to R>=v3.1.0 if that is
> biting. "
>
> Matthew
>
>
> On 20/09/13 19:01, Ricardo Saporta wrote:
>
> One warning per DT in the list
>   (I added the line breaks)
> -Rick
> =============================================
>  Warning messages:
>
>  1: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) :
>
>    Invalid .internal.selfref detected and fixed by taking a copy of the
> whole table so that := can add this new column by reference. At an earlier
> point, this data.table has been copied by R (or been created manually using
> structure() or similar). Avoid key<-, names<- and attr<- which in R
> currently (and oddly) may copy the whole data.table. Use set* syntax
> instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<v3.1.0,
> list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named
> objects); please upgrade to R>=v3.1.0 if that is biting. If this message
> doesn't help, please report to datatable-help so the root cause can be
> fixed.
>
>  2: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) :
>
>    Invalid .internal.selfref detected and fixed by taking a copy of the
> whole table so that := can add this new column by reference. At an earlier
> point, this data.table has been copied by R (or been created manually using
> structure() or similar). Avoid key<-, names<- and attr<- which in R
> currently (and oddly) may copy the whole data.table. Use set* syntax
> instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<v3.1.0,
> list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named
> objects); please upgrade to R>=v3.1.0 if that is biting. If this message
> doesn't help, please report to datatable-help so the root cause can be
> fixed.
>  =============================================
>
>
>
>
> On Fri, Sep 20, 2013 at 12:49 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>> Hi,
>>
>> What's the warning?
>>
>> Matthew
>>
>>
>>
>> On 20/09/13 14:48, Ricardo Saporta wrote:
>>
>>   I've encountered the following issue iterating over a list of
>> data.tables.
>> The issue is only with mapply, not with lapply .
>>
>>
>> Given a list of data.table's, mapply'ing over the list directly
>> cannot modify in place.
>>
>>  Also if attempting to add a new column, we get an "Invalid
>> .internal.selfref" warning.
>> Modifying an existing column does not issue a warning, but still fails to
>> modify-in-place
>>
>>  WORKAROUND:
>> ----------
>> The workaround is to iterate over an index to the list, then to
>>   modify each data.table via list.of.DTs[[i]][ .. ]
>>
>>  **Interestingly, this issue occurs with `mapply`, but not `lapply`.**
>>
>>
>> EXAMPLE:
>> --------
>>   # Given a list of DT's and two lists of vectors,
>>   #   we want to add the corresponding vectors as columns to the DT.
>>
>>  ## ---------------- ##
>> ##   SAMPLE DATA:   ##
>> ## ---------------- ##
>>   # list of data.tables
>>   list.DT <- list(
>>     DT1=data.table(Col1=111:115, Col2=121:125),
>>     DT2=data.table(Col1=211:215, Col2=221:225)
>>     )
>>
>>    # lists of columns to add
>>   list.Col3 <- list(131:135, 231:235)
>>   list.Col4 <- list(141:145, 241:245)
>>
>>
>>  ## ------------------------------------ ##
>> ##   Iterating over the list elements   ##
>> ##     adding a new column              ##
>> ## ------------------------------------ ##
>> ##   Will issue warning and             ##
>> ##     will fail to modify in place     ##
>> ## ------------------------------------ ##
>>   mapply (
>>       function(DT, C3, C4)
>>           DT[, c("Col3", "Col4") := list(C3, C4)],
>>
>>       list.DT,  # iterating over the list
>>       list.Col3, list.Col4,
>>       SIMPLIFY=FALSE
>>     )
>>
>>    ## Note the lack of change
>>   list.DT
>>
>>
>>  ## ------------------------------------ ##
>> ##   Iterating over an index            ##
>> ## ------------------------------------ ##
>>   mapply (
>>       function(i, C3, C4)
>>          list.DT[[i]] [, c("Col3", "Col4") := list(C3, C4)],
>>
>>       seq(list.DT),   # iterating over an index to the list
>>       list.Col3, list.Col4,
>>       SIMPLIFY=FALSE
>>     )
>>
>>    ## Note each DT _has_ been modified
>>   list.DT
>>
>>  ## ------------------------------------ ##
>> ##   Iterating over the list elements   ##
>> ##     modifying existing column        ##
>> ## ------------------------------------ ##
>> ##   No warning issued, but             ##
>> ##     Will fail to modify in place     ##
>> ## ------------------------------------ ##
>>   mapply (
>>       function(DT, C3, C4)
>>          DT[, c("Col3", "Col4") := list(Col3*1e3, Col4*1e4)],
>>
>>        list.DT,  # iterating over the list
>>       list.Col3, list.Col4,
>>       SIMPLIFY=FALSE
>>     )
>>
>>    ## Note the lack of change (compare with output from `mapply`)
>>   list.DT
>>
>>  ## ------------------------------------ ##
>> ##                                      ##
>> ##   `lapply` works as expected.        ##
>> ##                                      ##
>> ## ------------------------------------ ##
>>
>>   ## NOW WITH lapply
>>   lapply(list.DT,
>>     function(DT)
>>       DT[, newCol := LETTERS[1:5]]
>>   )
>>
>>    ## Note the new column:
>>   list.DT
>>
>>
>>
>>  # ========================== #
>>
>>  ##   NON-WORKAROUNDS   ##
>> ##
>> ## I also tried all of the following alternatives
>> ##   in hopes of being able to iterate over the list
>> ##   directly, using `mapply`.
>> ## None of these worked.
>>
>>  # (1) Creating the DTs First, then creating the list from them
>>     DT1 <- data.table(Col1=111:115, Col2=121:125)
>>     DT2 <- data.table(Col1=211:215, Col2=221:225)
>>
>>      list.DT <- list(DT1=DT1,DT2=DT2 )
>>
>>
>>  # (2) Same as 1, and using `copy()` in the call to `list()`
>>     list.DT <- list(DT1=copy(DT1),
>>                     DT2=copy(DT2) )
>>
>>  # (3) lapply'ing `copy` and then iterating over that list
>>     list.DT <- lapply(list.DT, copy)
>>
>>  # (4) Not naming the list elements
>>     list.DT <- list(DT1, DT2)
>>     # and tried
>>     list.DT <- list(copy(DT1), copy(DT2))
>>
>>  ## All of the above still failed to modify in place
>> ##   (and also issued the same warning if trying to add a column)
>> ##    when iterating using mapply
>>
>>    mapply(function(DT, C3, C4)
>>     DT[, c("Col3", "Col4") := list(C3, C4)],
>>     list.DT, list.Col3, list.Col4,
>>     SIMPLIFY=FALSE)
>>
>>
>>  # ========================== #
>>
>>
>>  Ricardo Saporta
>>  Rutgers University, New Jersey
>>  e: saporta at rutgers.edu
>>
>>
>>
>>  _______________________________________________
>> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130921/425846fe/attachment-0001.html>

From karl at huftis.org  Sun Sep 22 11:38:33 2013
From: karl at huftis.org (Karl Ove Hufthammer)
Date: Sun, 22 Sep 2013 11:38:33 +0200
Subject: [datatable-help] (no subject)
Message-ID: <1379842713.3310.1.camel@adrian.site>


From szehnder at uni-bonn.de  Mon Sep 23 17:08:59 2013
From: szehnder at uni-bonn.de (Simon Zehnder)
Date: Mon, 23 Sep 2013 17:08:59 +0200
Subject: [datatable-help] What the status on fast time and data.table?
Message-ID: <04189485-CEBA-4D04-8EF0-3BD49D0E0E00@uni-bonn.de>

Dear Users and Devels,

I read this thread http://r.789695.n4.nabble.com/About-adding-fastmatch-and-fasttime-to-data-table-td4659622.html and I would like to ask, if there have been any proceedings? What is the status of fast time in data.table?


Best

Simon


From mdowle at mdowle.plus.com  Tue Sep 24 03:25:41 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 24 Sep 2013 02:25:41 +0100
Subject: [datatable-help] What the status on fast time and data.table?
In-Reply-To: <04189485-CEBA-4D04-8EF0-3BD49D0E0E00@uni-bonn.de>
References: <04189485-CEBA-4D04-8EF0-3BD49D0E0E00@uni-bonn.de>
Message-ID: <5240EA15.4040709@mdowle.plus.com>


Hi,
Sorry no progress yet.  But it's on the list.  You currently have to 
read as character and then use Simon's package.  It isn't yet built in.
Matthew


On 23/09/13 16:08, Simon Zehnder wrote:
> Dear Users and Devels,
>
> I read this thread http://r.789695.n4.nabble.com/About-adding-fastmatch-and-fasttime-to-data-table-td4659622.html and I would like to ask, if there have been any proceedings? What is the status of fast time in data.table?
>
>
> Best
>
> Simon
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


From mdowle at mdowle.plus.com  Tue Sep 24 03:42:38 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Tue, 24 Sep 2013 02:42:38 +0100
Subject: [datatable-help] mapply cannot modify in place when iterating
 over list of DTs
In-Reply-To: <CAE7Aa4TStQqej+j=UMQo6NQJs5Z8214bLoCa0oGJnw+RBSOk+Q@mail.gmail.com>
References: <CAE7Aa4TyhNFbpZ7_J9nseNgwW-N0k9QzgyOBbrprnRxi4umcDw@mail.gmail.com>
 <523C7C99.40308@mdowle.plus.com>
 <CAE7Aa4Q4jvZVmhAUJOLWsdTJ656EQSPM4rUW9QE-brsB46KBSQ@mail.gmail.com>
 <523C9184.2010902@mdowle.plus.com>
 <CAE7Aa4TStQqej+j=UMQo6NQJs5Z8214bLoCa0oGJnw+RBSOk+Q@mail.gmail.com>
Message-ID: <5240EE0E.8090001@mdowle.plus.com>


Hi,
Basically adding columns by reference to a data.table when it's a member 
of a list of data.table, is really difficult to handle internally.  I 
had to special case internally to get around list() copying, so that the 
binding can change inside the list on the shallow copy when [[ is used.  
A for loop is the way to add columns by reference inside a list of 
data.table, and that should work ok using [[.  But doing that via lapply 
and mapply is really stretching it.  Even catching user expectations in 
this area is difficult.  Ideally we'd catch mapply, yes,  but really 
data.table likes to be rbindlist()-ed and then ops to work on a single 
large data.table.  We can advice to the warning message not to use 
mapply or lapply to add columns by reference to a list of data.table 
(use a for loop instead) ?
Matthew


On 22/09/13 03:02, Ricardo Saporta wrote:
> Matthew,
>
> I did notice the warning, but something doesnt add up:
>
> If the issue is simply that it is being copied when created, then 
> wouldnt we expect the same warning to arise when we try to modify the 
> table in using `mapply` or `lapply`? (the latter does not produce a 
> warning.
>
> If on the otherhand, the issue pertains specifically to mapply (which 
> I assume it does), then why is it only a problem when we iterate over 
> the list directly, whereas iterating indirectly by using an index does 
> not produce any warnings.
> While overall, this is minor if one is aware of the issue, I think it 
> might allow for unnoticed bugs to creep into someones code. 
> Specifically if using mapply to modify a list of DTs and the user not 
> realizing that the modifications are not being held.
>
> That being said, I'm not sure how this could even be addressed if the 
> root is in mapply, but is it worth trying to address?
>
> Rick
>
>
> On Fri, Sep 20, 2013 at 2:18 PM, Matthew Dowle <mdowle at mdowle.plus.com 
> <mailto:mdowle at mdowle.plus.com>> wrote:
>
>     Does this sentence from the warning help?
>
>
>     " Also, in R<v3.1.0, list(DT1,DT2) copied the entire DT1 and DT2
>     (R's list() used to copy named objects); please upgrade to
>     R>=v3.1.0 if that is biting. "
>
>     Matthew
>
>
>     On 20/09/13 19:01, Ricardo Saporta wrote:
>>     One warning per DT in the list
>>       (I added the line breaks)
>>     -Rick
>>     =============================================
>>     Warning messages:
>>
>>     1: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) :
>>
>>       Invalid .internal.selfref detected and fixed by taking a copy
>>     of the whole table so that := can add this new column by
>>     reference. At an earlier point, this data.table has been copied
>>     by R (or been created manually using structure() or similar).
>>     Avoid key<-, names<- and attr<- which in R currently (and oddly)
>>     may copy the whole data.table. Use set* syntax instead to avoid
>>     copying: ?set, ?setnames and ?setattr. Also, in R<v3.1.0,
>>     list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to
>>     copy named objects); please upgrade to R>=v3.1.0 if that is
>>     biting. If this message doesn't help, please report to
>>     datatable-help so the root cause can be fixed.
>>
>>     2: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) :
>>
>>       Invalid .internal.selfref detected and fixed by taking a copy
>>     of the whole table so that := can add this new column by
>>     reference. At an earlier point, this data.table has been copied
>>     by R (or been created manually using structure() or similar).
>>     Avoid key<-, names<- and attr<- which in R currently (and oddly)
>>     may copy the whole data.table. Use set* syntax instead to avoid
>>     copying: ?set, ?setnames and ?setattr. Also, in R<v3.1.0,
>>     list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to
>>     copy named objects); please upgrade to R>=v3.1.0 if that is
>>     biting. If this message doesn't help, please report to
>>     datatable-help so the root cause can be fixed.
>>     =============================================
>>
>>
>>
>>
>>     On Fri, Sep 20, 2013 at 12:49 PM, Matthew Dowle
>>     <mdowle at mdowle.plus.com <mailto:mdowle at mdowle.plus.com>> wrote:
>>
>>
>>         Hi,
>>
>>         What's the warning?
>>
>>         Matthew
>>
>>
>>
>>         On 20/09/13 14:48, Ricardo Saporta wrote:
>>>         I've encountered the following issue iterating over a list
>>>         of data.tables.
>>>         The issue is only with mapply, not with lapply .
>>>
>>>         Given a list of data.table's, mapply'ing over the list directly
>>>         cannot modify in place.
>>>
>>>         Also if attempting to add a new column, we get an "Invalid
>>>         .internal.selfref" warning.
>>>         Modifying an existing column does not issue a warning, but
>>>         still fails to modify-in-place
>>>
>>>         WORKAROUND:
>>>         ----------
>>>         The workaround is to iterate over an index to the list, then to
>>>           modify each data.table via list.of.DTs[[i]][ .. ]
>>>
>>>         **Interestingly, this issue occurs with `mapply`, but not
>>>         `lapply`.**
>>>
>>>         EXAMPLE:
>>>         --------
>>>           # Given a list of DT's and two lists of vectors,
>>>           #   we want to add the corresponding vectors as columns to
>>>         the DT.
>>>
>>>         ## ---------------- ##
>>>         ##   SAMPLE DATA:   ##
>>>         ## ---------------- ##
>>>           # list of data.tables
>>>           list.DT <- list(
>>>         DT1=data.table(Col1=111:115, Col2=121:125),
>>>         DT2=data.table(Col1=211:215, Col2=221:225)
>>>             )
>>>
>>>           # lists of columns to add
>>>           list.Col3 <- list(131:135, 231:235)
>>>           list.Col4 <- list(141:145, 241:245)
>>>
>>>
>>>         ## ------------------------------------ ##
>>>         ##   Iterating over the list elements   ##
>>>         ##     adding a new column              ##
>>>         ## ------------------------------------ ##
>>>         ##   Will issue warning and             ##
>>>         ##     will fail to modify in place     ##
>>>         ## ------------------------------------ ##
>>>           mapply (
>>>               function(DT, C3, C4)
>>>                  DT[, c("Col3", "Col4") := list(C3, C4)],
>>>               list.DT,  # iterating over the list
>>>               list.Col3, list.Col4,
>>>               SIMPLIFY=FALSE
>>>             )
>>>
>>>           ## Note the lack of change
>>>           list.DT
>>>
>>>
>>>         ## ------------------------------------ ##
>>>         ##   Iterating over an index            ##
>>>         ## ------------------------------------ ##
>>>           mapply (
>>>               function(i, C3, C4)
>>>                  list.DT[[i]] [, c("Col3", "Col4") := list(C3, C4)],
>>>               seq(list.DT),   # iterating over an index to the list
>>>               list.Col3, list.Col4,
>>>               SIMPLIFY=FALSE
>>>             )
>>>
>>>           ## Note each DT _has_ been modified
>>>           list.DT
>>>
>>>         ## ------------------------------------ ##
>>>         ##   Iterating over the list elements   ##
>>>         ##     modifying existing column        ##
>>>         ## ------------------------------------ ##
>>>         ##   No warning issued, but             ##
>>>         ##     Will fail to modify in place     ##
>>>         ## ------------------------------------ ##
>>>           mapply (
>>>               function(DT, C3, C4)
>>>                  DT[, c("Col3", "Col4") := list(Col3*1e3, Col4*1e4)],
>>>
>>>               list.DT,  # iterating over the list
>>>               list.Col3, list.Col4,
>>>               SIMPLIFY=FALSE
>>>             )
>>>
>>>           ## Note the lack of change (compare with output from `mapply`)
>>>           list.DT
>>>
>>>         ## ------------------------------------ ##
>>>         ##                ##
>>>         ##   `lapply` works as expected.        ##
>>>         ##                ##
>>>         ## ------------------------------------ ##
>>>           ## NOW WITH lapply
>>>           lapply(list.DT,
>>>             function(DT)
>>>               DT[, newCol := LETTERS[1:5]]
>>>           )
>>>
>>>           ## Note the new column:
>>>           list.DT
>>>
>>>
>>>
>>>         # ========================== #
>>>
>>>         ##   NON-WORKAROUNDS ##
>>>         ##
>>>         ## I also tried all of the following alternatives
>>>         ##   in hopes of being able to iterate over the list
>>>         ##   directly, using `mapply`.
>>>         ## None of these worked.
>>>
>>>         # (1) Creating the DTs First, then creating the list from them
>>>             DT1 <- data.table(Col1=111:115, Col2=121:125)
>>>             DT2 <- data.table(Col1=211:215, Col2=221:225)
>>>
>>>             list.DT <- list(DT1=DT1,DT2=DT2 )
>>>
>>>
>>>         # (2) Same as 1, and using `copy()` in the call to `list()`
>>>             list.DT <- list(DT1=copy(DT1),
>>>         DT2=copy(DT2) )
>>>
>>>         # (3) lapply'ing `copy` and then iterating over that list
>>>             list.DT <- lapply(list.DT, copy)
>>>
>>>         # (4) Not naming the list elements
>>>             list.DT <- list(DT1, DT2)
>>>             # and tried
>>>             list.DT <- list(copy(DT1), copy(DT2))
>>>
>>>         ## All of the above still failed to modify in place
>>>         ##   (and also issued the same warning if trying to add a
>>>         column)
>>>         ##    when iterating using mapply
>>>
>>>           mapply(function(DT, C3, C4)
>>>             DT[, c("Col3", "Col4") := list(C3, C4)],
>>>             list.DT, list.Col3, list.Col4,
>>>             SIMPLIFY=FALSE)
>>>
>>>
>>>         # ========================== #
>>>
>>>
>>>         Ricardo Saporta
>>>         Rutgers University, New Jersey
>>>         e: saporta at rutgers.edu <mailto:saporta at rutgers.edu>
>>>
>>>
>>>
>>>         _______________________________________________
>>>         datatable-help mailing list
>>>         datatable-help at lists.r-forge.r-project.org  <mailto:datatable-help at lists.r-forge.r-project.org>
>>>         https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130924/963b7191/attachment-0001.html>

From saporta at scarletmail.rutgers.edu  Tue Sep 24 06:15:18 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Tue, 24 Sep 2013 00:15:18 -0400
Subject: [datatable-help] mapply cannot modify in place when iterating
 over list of DTs
In-Reply-To: <5240EE0E.8090001@mdowle.plus.com>
References: <CAE7Aa4TyhNFbpZ7_J9nseNgwW-N0k9QzgyOBbrprnRxi4umcDw@mail.gmail.com>
 <523C7C99.40308@mdowle.plus.com>
 <CAE7Aa4Q4jvZVmhAUJOLWsdTJ656EQSPM4rUW9QE-brsB46KBSQ@mail.gmail.com>
 <523C9184.2010902@mdowle.plus.com>
 <CAE7Aa4TStQqej+j=UMQo6NQJs5Z8214bLoCa0oGJnw+RBSOk+Q@mail.gmail.com>
 <5240EE0E.8090001@mdowle.plus.com>
Message-ID: <CAE7Aa4R5ob8as9F8aMLF3R+bi-kJoGkposz+d6mjw04eLjYaNA@mail.gmail.com>

On Mon, Sep 23, 2013 at 9:42 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

>
> Hi,
> Basically adding columns by reference to a data.table when it's a member
> of a list of data.table, is really difficult to handle internally.  I had
> to special case internally to get around list() copying, so that the
> binding can change inside the list on the shallow copy when [[ is used.  A
> for loop is the way to add columns by reference inside a list of
> data.table, and that should work ok using [[.  But doing that via lapply
> and mapply is really stretching it.
>

That makes sense.  I took a whack at it, but couldn't even come close.


> Even catching user expectations in this area is difficult.  Ideally we'd
> catch mapply, yes,  but really data.table likes to be rbindlist()-ed and
> then ops to work on a single large data.table.
>

Agreed.  In the application where this came up, I am dealing with a list of
tables with different dims (hence not rbinding)


> We can advice to the warning message not to use mapply or lapply to add
> columns by reference to a list of data.table (use a for loop instead) ?
>

Perhaps a warning that modifications to the DT's in the list are likely to
not have stuck and to use rbindlist when possible?


>
> Matthew
>
>
>
> On 22/09/13 03:02, Ricardo Saporta wrote:
>
> Matthew,
>
>  I did notice the warning, but something doesnt add up:
>
>  If the issue is simply that it is being copied when created, then
> wouldnt we expect the same warning to arise when we try to modify the table
> in using `mapply` or `lapply`? (the latter does not produce a warning.
>
>  If on the otherhand, the issue pertains specifically to mapply (which I
> assume it does), then why is it only a problem when we iterate over the
> list directly, whereas iterating indirectly by using an index does not
> produce any warnings.
>
>   While overall, this is minor if one is aware of the issue, I think it
> might allow for unnoticed bugs to creep into someones code.   Specifically
> if using mapply to modify a list of DTs and the user not realizing that the
> modifications are not being held.
>
>  That being said, I'm not sure how this could even be addressed if the
> root is in mapply, but is it worth trying to address?
>
>  Rick
>
>
> On Fri, Sep 20, 2013 at 2:18 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>  Does this sentence from the warning help?
>>
>>
>> " Also, in R<v3.1.0, list(DT1,DT2) copied the entire DT1 and DT2 (R's
>> list() used to copy named objects); please upgrade to R>=v3.1.0 if that is
>> biting. "
>>
>>  Matthew
>>
>>
>> On 20/09/13 19:01, Ricardo Saporta wrote:
>>
>> One warning per DT in the list
>>   (I added the line breaks)
>> -Rick
>> =============================================
>>  Warning messages:
>>
>>  1: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) :
>>
>>    Invalid .internal.selfref detected and fixed by taking a copy of the
>> whole table so that := can add this new column by reference. At an earlier
>> point, this data.table has been copied by R (or been created manually using
>> structure() or similar). Avoid key<-, names<- and attr<- which in R
>> currently (and oddly) may copy the whole data.table. Use set* syntax
>> instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<v3.1.0,
>> list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named
>> objects); please upgrade to R>=v3.1.0 if that is biting. If this message
>> doesn't help, please report to datatable-help so the root cause can be
>> fixed.
>>
>>  2: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) :
>>
>>    Invalid .internal.selfref detected and fixed by taking a copy of the
>> whole table so that := can add this new column by reference. At an earlier
>> point, this data.table has been copied by R (or been created manually using
>> structure() or similar). Avoid key<-, names<- and attr<- which in R
>> currently (and oddly) may copy the whole data.table. Use set* syntax
>> instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<v3.1.0,
>> list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named
>> objects); please upgrade to R>=v3.1.0 if that is biting. If this message
>> doesn't help, please report to datatable-help so the root cause can be
>> fixed.
>>  =============================================
>>
>>
>>
>>
>> On Fri, Sep 20, 2013 at 12:49 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>
>>>
>>> Hi,
>>>
>>> What's the warning?
>>>
>>> Matthew
>>>
>>>
>>>
>>> On 20/09/13 14:48, Ricardo Saporta wrote:
>>>
>>>   I've encountered the following issue iterating over a list of
>>> data.tables.
>>> The issue is only with mapply, not with lapply .
>>>
>>>
>>> Given a list of data.table's, mapply'ing over the list directly
>>> cannot modify in place.
>>>
>>>  Also if attempting to add a new column, we get an "Invalid
>>> .internal.selfref" warning.
>>> Modifying an existing column does not issue a warning, but still fails
>>> to modify-in-place
>>>
>>>  WORKAROUND:
>>> ----------
>>> The workaround is to iterate over an index to the list, then to
>>>   modify each data.table via list.of.DTs[[i]][ .. ]
>>>
>>>  **Interestingly, this issue occurs with `mapply`, but not `lapply`.**
>>>
>>>
>>> EXAMPLE:
>>> --------
>>>   # Given a list of DT's and two lists of vectors,
>>>   #   we want to add the corresponding vectors as columns to the DT.
>>>
>>>  ## ---------------- ##
>>> ##   SAMPLE DATA:   ##
>>> ## ---------------- ##
>>>   # list of data.tables
>>>   list.DT <- list(
>>>     DT1=data.table(Col1=111:115, Col2=121:125),
>>>     DT2=data.table(Col1=211:215, Col2=221:225)
>>>     )
>>>
>>>    # lists of columns to add
>>>   list.Col3 <- list(131:135, 231:235)
>>>   list.Col4 <- list(141:145, 241:245)
>>>
>>>
>>>  ## ------------------------------------ ##
>>> ##   Iterating over the list elements   ##
>>> ##     adding a new column              ##
>>> ## ------------------------------------ ##
>>> ##   Will issue warning and             ##
>>> ##     will fail to modify in place     ##
>>> ## ------------------------------------ ##
>>>   mapply (
>>>       function(DT, C3, C4)
>>>           DT[, c("Col3", "Col4") := list(C3, C4)],
>>>
>>>       list.DT,  # iterating over the list
>>>       list.Col3, list.Col4,
>>>       SIMPLIFY=FALSE
>>>     )
>>>
>>>    ## Note the lack of change
>>>   list.DT
>>>
>>>
>>>  ## ------------------------------------ ##
>>> ##   Iterating over an index            ##
>>> ## ------------------------------------ ##
>>>   mapply (
>>>       function(i, C3, C4)
>>>          list.DT[[i]] [, c("Col3", "Col4") := list(C3, C4)],
>>>
>>>       seq(list.DT),   # iterating over an index to the list
>>>       list.Col3, list.Col4,
>>>       SIMPLIFY=FALSE
>>>     )
>>>
>>>    ## Note each DT _has_ been modified
>>>   list.DT
>>>
>>>  ## ------------------------------------ ##
>>> ##   Iterating over the list elements   ##
>>> ##     modifying existing column        ##
>>> ## ------------------------------------ ##
>>> ##   No warning issued, but             ##
>>> ##     Will fail to modify in place     ##
>>> ## ------------------------------------ ##
>>>   mapply (
>>>       function(DT, C3, C4)
>>>          DT[, c("Col3", "Col4") := list(Col3*1e3, Col4*1e4)],
>>>
>>>        list.DT,  # iterating over the list
>>>       list.Col3, list.Col4,
>>>       SIMPLIFY=FALSE
>>>     )
>>>
>>>    ## Note the lack of change (compare with output from `mapply`)
>>>   list.DT
>>>
>>>  ## ------------------------------------ ##
>>> ##                                      ##
>>> ##   `lapply` works as expected.        ##
>>> ##                                      ##
>>> ## ------------------------------------ ##
>>>
>>>   ## NOW WITH lapply
>>>   lapply(list.DT,
>>>     function(DT)
>>>       DT[, newCol := LETTERS[1:5]]
>>>   )
>>>
>>>    ## Note the new column:
>>>   list.DT
>>>
>>>
>>>
>>>  # ========================== #
>>>
>>>  ##   NON-WORKAROUNDS   ##
>>> ##
>>> ## I also tried all of the following alternatives
>>> ##   in hopes of being able to iterate over the list
>>> ##   directly, using `mapply`.
>>> ## None of these worked.
>>>
>>>  # (1) Creating the DTs First, then creating the list from them
>>>     DT1 <- data.table(Col1=111:115, Col2=121:125)
>>>     DT2 <- data.table(Col1=211:215, Col2=221:225)
>>>
>>>      list.DT <- list(DT1=DT1,DT2=DT2 )
>>>
>>>
>>>  # (2) Same as 1, and using `copy()` in the call to `list()`
>>>     list.DT <- list(DT1=copy(DT1),
>>>                     DT2=copy(DT2) )
>>>
>>>  # (3) lapply'ing `copy` and then iterating over that list
>>>     list.DT <- lapply(list.DT, copy)
>>>
>>>  # (4) Not naming the list elements
>>>     list.DT <- list(DT1, DT2)
>>>     # and tried
>>>     list.DT <- list(copy(DT1), copy(DT2))
>>>
>>>  ## All of the above still failed to modify in place
>>> ##   (and also issued the same warning if trying to add a column)
>>> ##    when iterating using mapply
>>>
>>>    mapply(function(DT, C3, C4)
>>>     DT[, c("Col3", "Col4") := list(C3, C4)],
>>>     list.DT, list.Col3, list.Col4,
>>>     SIMPLIFY=FALSE)
>>>
>>>
>>>  # ========================== #
>>>
>>>
>>>  Ricardo Saporta
>>>  Rutgers University, New Jersey
>>>  e: saporta at rutgers.edu
>>>
>>>
>>>
>>>  _______________________________________________
>>> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>>
>>>
>>
>>
>
>
> _______________________________________________
> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130924/d90fc865/attachment-0001.html>

From shaklev at gmail.com  Fri Sep 27 05:16:11 2013
From: shaklev at gmail.com (=?UTF-8?Q?Stian_H=C3=A5klev?=)
Date: Thu, 26 Sep 2013 23:16:11 -0400
Subject: [datatable-help] Using data.table to run a function on every row
Message-ID: <CAEKz3tjKNLRHjMZQrRvjj1TxcEoTWJiWJFHTsA+XadcYzDNSkw@mail.gmail.com>

I'm trying to run a function on every row fulfilling a certain criterium,
which returns a data frame - the idea is then to take the list of data
frames and rbindlist them together for a totally separate data.table. (I'm
extracting several URL links from each forum post, and tagging them with
the forum post they came from).

I tried doing this with a data.table

a <- db[has_url == T, getUrls(text, id)]

and get the message

Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, 4L,  :
  replacement has 11007 rows, data has 29787

Because some rows have several URLs... However, I don't care that these
rowlengths don't match, I still want these rows :) I thought J would just
let me execute arbitrary R code in the context of the rows as variable
names, etc.

Here's the function it's running, but that shouldn't be relevant

getUrls <- function(text, id) {
  matches <- str_match_all(text, url_pattern)
  a <- data.frame(urls=unlist(matches))
  a$id <- id
  a
}


Thanks, and thanks for an amazing package - data.table has made my life so
much easier. It should be part of base, I think.
Stian Haklev, University of Toronto

-- 
http://reganmian.net/blog -- Random Stuff that Matters
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130926/1413176b/attachment.html>

From saporta at scarletmail.rutgers.edu  Fri Sep 27 08:37:28 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Fri, 27 Sep 2013 02:37:28 -0400
Subject: [datatable-help] Using data.table to run a function on every row
In-Reply-To: <CAEKz3tjKNLRHjMZQrRvjj1TxcEoTWJiWJFHTsA+XadcYzDNSkw@mail.gmail.com>
References: <CAEKz3tjKNLRHjMZQrRvjj1TxcEoTWJiWJFHTsA+XadcYzDNSkw@mail.gmail.com>
Message-ID: <CAE7Aa4SgENWr10xukoB5JJYXS68zKPxb2hae45duJV+j4Q16Rw@mail.gmail.com>

Hi there,

Try inserting a `by=id` in

   a <- db[(has_url), getUrls(text, id), by=id]

Also, no need for "has_url == T"
instead, use
  (has_url)
If the variable is alread logical.  (Otherwise, you are just slowing things
down ;)


Ricardo Saporta
Graduate Student, Data Analytics
Rutgers University, New Jersey
e: saporta at rutgers.edu


On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev <shaklev at gmail.com> wrote:

> I'm trying to run a function on every row fulfilling a certain criterium,
> which returns a data frame - the idea is then to take the list of data
> frames and rbindlist them together for a totally separate data.table. (I'm
> extracting several URL links from each forum post, and tagging them with
> the forum post they came from).
>
> I tried doing this with a data.table
>
> a <- db[has_url == T, getUrls(text, id)]
>
> and get the message
>
> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, 4L,  :
>   replacement has 11007 rows, data has 29787
>
> Because some rows have several URLs... However, I don't care that these
> rowlengths don't match, I still want these rows :) I thought J would just
> let me execute arbitrary R code in the context of the rows as variable
> names, etc.
>
> Here's the function it's running, but that shouldn't be relevant
>
> getUrls <- function(text, id) {
>   matches <- str_match_all(text, url_pattern)
>   a <- data.frame(urls=unlist(matches))
>   a$id <- id
>   a
> }
>
>
> Thanks, and thanks for an amazing package - data.table has made my life so
> much easier. It should be part of base, I think.
> Stian Haklev, University of Toronto
>
> --
> http://reganmian.net/blog -- Random Stuff that Matters
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/a33eb022/attachment.html>

From saporta at scarletmail.rutgers.edu  Fri Sep 27 08:41:19 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Fri, 27 Sep 2013 02:41:19 -0400
Subject: [datatable-help] Using data.table to run a function on every row
In-Reply-To: <CAE7Aa4SgENWr10xukoB5JJYXS68zKPxb2hae45duJV+j4Q16Rw@mail.gmail.com>
References: <CAEKz3tjKNLRHjMZQrRvjj1TxcEoTWJiWJFHTsA+XadcYzDNSkw@mail.gmail.com>
 <CAE7Aa4SgENWr10xukoB5JJYXS68zKPxb2hae45duJV+j4Q16Rw@mail.gmail.com>
Message-ID: <CAE7Aa4RC5y9CCEn+rLRb_KRPdgcu6iYAZv+_Umf5x2ekKQyQ-g@mail.gmail.com>

sorry, I probably should have elaborated  (it's late here, in NJ)

The error you are seeing is most likely coming from your getURL function in
that you are adding several ids to a data.frame of varying rows, and `R`
cannot recycle it correctly.

If you instead breakdown by id, then each time you are only assigning one
id and R will be able to recycle appropriately, without issue.

good luck!
Rick


Ricardo Saporta
Graduate Student, Data Analytics
Rutgers University, New Jersey
e: saporta at rutgers.edu


On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta <
saporta at scarletmail.rutgers.edu> wrote:

> Hi there,
>
> Try inserting a `by=id` in
>
>    a <- db[(has_url), getUrls(text, id), by=id]
>
> Also, no need for "has_url == T"
> instead, use
>   (has_url)
> If the variable is alread logical.  (Otherwise, you are just slowing
> things down ;)
>
>
>
> Ricardo Saporta
> Graduate Student, Data Analytics
> Rutgers University, New Jersey
> e: saporta at rutgers.edu
>
>
>
> On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev <shaklev at gmail.com> wrote:
>
>> I'm trying to run a function on every row fulfilling a certain criterium,
>> which returns a data frame - the idea is then to take the list of data
>> frames and rbindlist them together for a totally separate data.table. (I'm
>> extracting several URL links from each forum post, and tagging them with
>> the forum post they came from).
>>
>> I tried doing this with a data.table
>>
>> a <- db[has_url == T, getUrls(text, id)]
>>
>> and get the message
>>
>> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, 4L,  :
>>   replacement has 11007 rows, data has 29787
>>
>> Because some rows have several URLs... However, I don't care that these
>> rowlengths don't match, I still want these rows :) I thought J would just
>> let me execute arbitrary R code in the context of the rows as variable
>> names, etc.
>>
>> Here's the function it's running, but that shouldn't be relevant
>>
>> getUrls <- function(text, id) {
>>   matches <- str_match_all(text, url_pattern)
>>   a <- data.frame(urls=unlist(matches))
>>   a$id <- id
>>   a
>> }
>>
>>
>> Thanks, and thanks for an amazing package - data.table has made my life
>> so much easier. It should be part of base, I think.
>> Stian Haklev, University of Toronto
>>
>> --
>> http://reganmian.net/blog -- Random Stuff that Matters
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/9ebaa2f9/attachment.html>

From saporta at scarletmail.rutgers.edu  Fri Sep 27 08:44:37 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Fri, 27 Sep 2013 02:44:37 -0400
Subject: [datatable-help] Using data.table to run a function on every row
In-Reply-To: <CAE7Aa4RC5y9CCEn+rLRb_KRPdgcu6iYAZv+_Umf5x2ekKQyQ-g@mail.gmail.com>
References: <CAEKz3tjKNLRHjMZQrRvjj1TxcEoTWJiWJFHTsA+XadcYzDNSkw@mail.gmail.com>
 <CAE7Aa4SgENWr10xukoB5JJYXS68zKPxb2hae45duJV+j4Q16Rw@mail.gmail.com>
 <CAE7Aa4RC5y9CCEn+rLRb_KRPdgcu6iYAZv+_Umf5x2ekKQyQ-g@mail.gmail.com>
Message-ID: <CAE7Aa4R-_DOkzC3JuJ-nbMFSQ5GLTij5rX3jEHHyjp8wL_YCwg@mail.gmail.com>

In fact, you should be able to skip the function altogether and just use:

   db[ (has_url), str_match_all(text, url_pattern), by=id]


(and now, my apologies to all for the email clutter)
good night

On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta <
saporta at scarletmail.rutgers.edu> wrote:

> sorry, I probably should have elaborated  (it's late here, in NJ)
>
> The error you are seeing is most likely coming from your getURL function
> in that you are adding several ids to a data.frame of varying rows, and `R`
> cannot recycle it correctly.
>
> If you instead breakdown by id, then each time you are only assigning one
> id and R will be able to recycle appropriately, without issue.
>
> good luck!
> Rick
>
>
> Ricardo Saporta
> Graduate Student, Data Analytics
> Rutgers University, New Jersey
> e: saporta at rutgers.edu
>
>
>
> On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta <
> saporta at scarletmail.rutgers.edu> wrote:
>
>> Hi there,
>>
>> Try inserting a `by=id` in
>>
>>    a <- db[(has_url), getUrls(text, id), by=id]
>>
>> Also, no need for "has_url == T"
>> instead, use
>>   (has_url)
>> If the variable is alread logical.  (Otherwise, you are just slowing
>> things down ;)
>>
>>
>>
>> Ricardo Saporta
>> Graduate Student, Data Analytics
>> Rutgers University, New Jersey
>> e: saporta at rutgers.edu
>>
>>
>>
>> On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev <shaklev at gmail.com> wrote:
>>
>>> I'm trying to run a function on every row fulfilling a certain
>>> criterium, which returns a data frame - the idea is then to take the list
>>> of data frames and rbindlist them together for a totally separate
>>> data.table. (I'm extracting several URL links from each forum post, and
>>> tagging them with the forum post they came from).
>>>
>>> I tried doing this with a data.table
>>>
>>> a <- db[has_url == T, getUrls(text, id)]
>>>
>>> and get the message
>>>
>>> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, 4L,
>>>  :
>>>   replacement has 11007 rows, data has 29787
>>>
>>> Because some rows have several URLs... However, I don't care that these
>>> rowlengths don't match, I still want these rows :) I thought J would just
>>> let me execute arbitrary R code in the context of the rows as variable
>>> names, etc.
>>>
>>> Here's the function it's running, but that shouldn't be relevant
>>>
>>> getUrls <- function(text, id) {
>>>   matches <- str_match_all(text, url_pattern)
>>>   a <- data.frame(urls=unlist(matches))
>>>   a$id <- id
>>>   a
>>> }
>>>
>>>
>>> Thanks, and thanks for an amazing package - data.table has made my life
>>> so much easier. It should be part of base, I think.
>>> Stian Haklev, University of Toronto
>>>
>>> --
>>> http://reganmian.net/blog -- Random Stuff that Matters
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/002920af/attachment-0001.html>

From mdowle at mdowle.plus.com  Fri Sep 27 14:48:41 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Fri, 27 Sep 2013 13:48:41 +0100
Subject: [datatable-help] Using data.table to run a function on every row
In-Reply-To: <CAE7Aa4R-_DOkzC3JuJ-nbMFSQ5GLTij5rX3jEHHyjp8wL_YCwg@mail.gmail.com>
References: <CAEKz3tjKNLRHjMZQrRvjj1TxcEoTWJiWJFHTsA+XadcYzDNSkw@mail.gmail.com>
 <CAE7Aa4SgENWr10xukoB5JJYXS68zKPxb2hae45duJV+j4Q16Rw@mail.gmail.com>
 <CAE7Aa4RC5y9CCEn+rLRb_KRPdgcu6iYAZv+_Umf5x2ekKQyQ-g@mail.gmail.com>
 <CAE7Aa4R-_DOkzC3JuJ-nbMFSQ5GLTij5rX3jEHHyjp8wL_YCwg@mail.gmail.com>
Message-ID: <52457EA9.8000803@mdowle.plus.com>


That was my thought too.  I don't know what str_match_all is,  but given 
the unlist() in getUrls(),  it seems to return a list. Rather than 
unlist(),  leave it as list,  and data.table should happily make a 
`list` column where each cell is itself a vector. In fact each cell can 
be anything at all,  even embedded data.table, function definitions, or 
any type of object.
You might need a list(list(str_match_all(...))) in j to do that.

Or what Rick has suggested here might work first time.  It's hard to 
visualise it without a small reproducible example, so we're having to 
make educated guesses.

Many thanks for the kind words about data.table.

Matthew


On 27/09/13 07:44, Ricardo Saporta wrote:
> In fact, you should be able to skip the function altogether and just use:
>
>    db[ (has_url), str_match_all(text, url_pattern), by=id]
>
>
> (and now, my apologies to all for the email clutter)
> good night
>
> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta 
> <saporta at scarletmail.rutgers.edu 
> <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>
>     sorry, I probably should have elaborated  (it's late here, in NJ)
>
>     The error you are seeing is most likely coming from your getURL
>     function in that you are adding several ids to a data.frame of
>     varying rows, and `R` cannot recycle it correctly.
>
>     If you instead breakdown by id, then each time you are only
>     assigning one id and R will be able to recycle appropriately,
>     without issue.
>
>     good luck!
>     Rick
>
>
>     Ricardo Saporta
>     Graduate Student, Data Analytics
>     Rutgers University, New Jersey
>     e: saporta at rutgers.edu <mailto:saporta at rutgers.edu>
>
>
>
>     On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta
>     <saporta at scarletmail.rutgers.edu
>     <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>
>         Hi there,
>
>         Try inserting a `by=id` in
>
>         a <- db[(has_url), getUrls(text, id), by=id]
>
>         Also, no need for "has_url == T"
>         instead, use
>         (has_url)
>         If the variable is alread logical.  (Otherwise, you are just
>         slowing things down ;)
>
>
>
>         Ricardo Saporta
>         Graduate Student, Data Analytics
>         Rutgers University, New Jersey
>         e: saporta at rutgers.edu <mailto:saporta at rutgers.edu>
>
>
>
>         On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev
>         <shaklev at gmail.com <mailto:shaklev at gmail.com>> wrote:
>
>             I'm trying to run a function on every row fulfilling a
>             certain criterium, which returns a data frame - the idea
>             is then to take the list of data frames and rbindlist them
>             together for a totally separate data.table. (I'm
>             extracting several URL links from each forum post, and
>             tagging them with the forum post they came from).
>
>             I tried doing this with a data.table
>
>             a <- db[has_url == T, getUrls(text, id)]
>
>             and get the message
>
>             Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L,
>             1L, 2L, 4L,  :
>               replacement has 11007 rows, data has 29787
>
>             Because some rows have several URLs... However, I don't
>             care that these rowlengths don't match, I still want these
>             rows :) I thought J would just let me execute arbitrary R
>             code in the context of the rows as variable names, etc.
>
>             Here's the function it's running, but that shouldn't be
>             relevant
>
>             getUrls <- function(text, id) {
>               matches <- str_match_all(text, url_pattern)
>               a <- data.frame(urls=unlist(matches))
>               a$id <- id
>               a
>             }
>
>
>             Thanks, and thanks for an amazing package - data.table has
>             made my life so much easier. It should be part of base, I
>             think.
>             Stian Haklev, University of Toronto
>
>             -- 
>             http://reganmian.net/blog -- Random Stuff that Matters
>
>             _______________________________________________
>             datatable-help mailing list
>             datatable-help at lists.r-forge.r-project.org
>             <mailto:datatable-help at lists.r-forge.r-project.org>
>             https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/6c9cd699/attachment.html>

From shaklev at gmail.com  Fri Sep 27 17:21:43 2013
From: shaklev at gmail.com (=?UTF-8?Q?Stian_H=C3=A5klev?=)
Date: Fri, 27 Sep 2013 11:21:43 -0400
Subject: [datatable-help] Using data.table to run a function on every row
In-Reply-To: <52457EA9.8000803@mdowle.plus.com>
References: <CAEKz3tjKNLRHjMZQrRvjj1TxcEoTWJiWJFHTsA+XadcYzDNSkw@mail.gmail.com>
 <CAE7Aa4SgENWr10xukoB5JJYXS68zKPxb2hae45duJV+j4Q16Rw@mail.gmail.com>
 <CAE7Aa4RC5y9CCEn+rLRb_KRPdgcu6iYAZv+_Umf5x2ekKQyQ-g@mail.gmail.com>
 <CAE7Aa4R-_DOkzC3JuJ-nbMFSQ5GLTij5rX3jEHHyjp8wL_YCwg@mail.gmail.com>
 <52457EA9.8000803@mdowle.plus.com>
Message-ID: <CAEKz3thLgOETxDfAUiaUBbk_zO4PZDKwJV5iyMCoC5usdhi+_g@mail.gmail.com>

I really appreciate all your help - amazingly supportive community. I could
probably figure out a "brute-force" way of doing things, but since I'm
going to be writing a lot of R in the future too, I always want to find the
"correct" way of doing it, which both looks clear, and is quick. (I come
from a background in Ruby, and am always interested in writing very clear
and DRY (do not repeat yourself) code, but I find I still spend a lot of
time in R struggling with various data formats - lists, nested lists,
vectors, matrices, different forms of apply/ddply/for loops etc).

Anyway, a few different points.

I tried db[has_url,], but got "object has_url not found"

I then tried setkey(db, "has_url"), and using this, but somehow it was a
lot slower than what I used to do (I repeated a few times). Not sure if I'm
doing it wrong. (Not important - even 15 sec is totally fine, I'll only run
this once. But good to understand the underlying principles).

setkey(db, "has_url")
> system.time( db[T, matches := str_match_all(text, url_pattern)] )
   user  system elapsed
 17.514   0.334  17.847
> system.time( db[has_url == T, matches := str_match_all(text,
url_pattern)] )
   user  system elapsed
  5.943   0.040   5.984

The second point was how to get out the matches. The idea was that you have
a text field which might contain several urls, which I want to extract, but
I need each URL tagged with the row it came from (so I can link it back to
properties of the post and author, look at whether certain students are
more likely to post certain kinds of URLs etc).

Instead of a function, you'll see above that I rewrote it to use :=, which
creates a new column that holds a list. That worked wonderfully, but now
how do I get these "out" of this data.table, and into a new one.

Made-up example data:
a <- c(1,2,3)
b <- c(2,3,4)
dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b,
NULL))

Now my goal is to have a new data.table that looks like this
Name Number
Stian 1
Stian 2
Stian 3
Christian 2
Christian 3
Christian 4

Again, I'm sure I could do this with a for() or lapply? But I'd love to see
the most elegant solution.

Note that this:

getUrls <- function(text, id) {
  matches <- str_match_all(text, url_pattern)
  data.frame(urls=unlist(matches), id=id)
}

system.time( a <- db[(has_url), getUrls(text, id), by=id] )

Works perfectly, the result is
idurlsid116
https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 16224
http://www.youtube.com/watch?v=JUiGF4TGI9w24 344
http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/
44461
http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html
61575
http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html
75675https://www.facebook.com/photo.php?fbid=10151324672623754 75

which is exactly what I was looking for. So I've really reached my goal,
but I'm curious about the other method as well.

Thanks!
Stian


On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

>
> That was my thought too.  I don't know what str_match_all is,  but given
> the unlist() in getUrls(),  it seems to return a list.   Rather than
> unlist(),  leave it as list,  and data.table should happily make a `list`
> column where each cell is itself a vector.  In fact each cell can be
> anything at all,  even embedded data.table, function definitions, or any
> type of object.
> You might need a list(list(str_match_all(...))) in j to do that.
>
> Or what Rick has suggested here might work first time.  It's hard to
> visualise it without a small reproducible example, so we're having to make
> educated guesses.
>
> Many thanks for the kind words about data.table.
>
> Matthew
>
>
>
> On 27/09/13 07:44, Ricardo Saporta wrote:
>
> In fact, you should be able to skip the function altogether and just use:
>
>     db[ (has_url), str_match_all(text, url_pattern), by=id]
>
>
>  (and now, my apologies to all for the email clutter)
> good night
>
> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta <
> saporta at scarletmail.rutgers.edu> wrote:
>
>> sorry, I probably should have elaborated  (it's late here, in NJ)
>>
>>  The error you are seeing is most likely coming from your getURL
>> function in that you are adding several ids to a data.frame of varying
>> rows, and `R` cannot recycle it correctly.
>>
>>  If you instead breakdown by id, then each time you are only assigning
>> one id and R will be able to recycle appropriately, without issue.
>>
>>  good luck!
>> Rick
>>
>>
>>  Ricardo Saporta
>>  Graduate Student, Data Analytics
>> Rutgers University, New Jersey
>> e: saporta at rutgers.edu
>>
>>
>>
>>   On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta <
>> saporta at scarletmail.rutgers.edu> wrote:
>>
>>> Hi there,
>>>
>>>  Try inserting a `by=id` in
>>>
>>>     a <- db[(has_url), getUrls(text, id), by=id]
>>>
>>>  Also, no need for "has_url == T"
>>> instead, use
>>>   (has_url)
>>> If the variable is alread logical.  (Otherwise, you are just slowing
>>> things down ;)
>>>
>>>
>>>
>>>  Ricardo Saporta
>>> Graduate Student, Data Analytics
>>> Rutgers University, New Jersey
>>>  e: saporta at rutgers.edu
>>>
>>>
>>>
>>>  On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev <shaklev at gmail.com>wrote:
>>>
>>>>  I'm trying to run a function on every row fulfilling a certain
>>>> criterium, which returns a data frame - the idea is then to take the list
>>>> of data frames and rbindlist them together for a totally separate
>>>> data.table. (I'm extracting several URL links from each forum post, and
>>>> tagging them with the forum post they came from).
>>>>
>>>>  I tried doing this with a data.table
>>>>
>>>>  a <- db[has_url == T, getUrls(text, id)]
>>>>
>>>>  and get the message
>>>>
>>>>  Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L,
>>>> 4L,  :
>>>>   replacement has 11007 rows, data has 29787
>>>>
>>>>  Because some rows have several URLs... However, I don't care that
>>>> these rowlengths don't match, I still want these rows :) I thought J would
>>>> just let me execute arbitrary R code in the context of the rows as variable
>>>> names, etc.
>>>>
>>>>  Here's the function it's running, but that shouldn't be relevant
>>>>
>>>>  getUrls <- function(text, id) {
>>>>   matches <- str_match_all(text, url_pattern)
>>>>   a <- data.frame(urls=unlist(matches))
>>>>   a$id <- id
>>>>   a
>>>> }
>>>>
>>>>
>>>>  Thanks, and thanks for an amazing package - data.table has made my
>>>> life so much easier. It should be part of base, I think.
>>>> Stian Haklev, University of Toronto
>>>>
>>>>  --
>>>> http://reganmian.net/blog -- Random Stuff that Matters
>>>>
>>>>  _______________________________________________
>>>> datatable-help mailing list
>>>> datatable-help at lists.r-forge.r-project.org
>>>>
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>
>>>
>>
>
>
> _______________________________________________
> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>


-- 
http://reganmian.net/blog -- Random Stuff that Matters
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/792e7921/attachment-0001.html>

From shaklev at gmail.com  Fri Sep 27 17:39:35 2013
From: shaklev at gmail.com (=?UTF-8?Q?Stian_H=C3=A5klev?=)
Date: Fri, 27 Sep 2013 11:39:35 -0400
Subject: [datatable-help] Using data.table to run a function on every row
In-Reply-To: <CAEKz3thLgOETxDfAUiaUBbk_zO4PZDKwJV5iyMCoC5usdhi+_g@mail.gmail.com>
References: <CAEKz3tjKNLRHjMZQrRvjj1TxcEoTWJiWJFHTsA+XadcYzDNSkw@mail.gmail.com>
 <CAE7Aa4SgENWr10xukoB5JJYXS68zKPxb2hae45duJV+j4Q16Rw@mail.gmail.com>
 <CAE7Aa4RC5y9CCEn+rLRb_KRPdgcu6iYAZv+_Umf5x2ekKQyQ-g@mail.gmail.com>
 <CAE7Aa4R-_DOkzC3JuJ-nbMFSQ5GLTij5rX3jEHHyjp8wL_YCwg@mail.gmail.com>
 <52457EA9.8000803@mdowle.plus.com>
 <CAEKz3thLgOETxDfAUiaUBbk_zO4PZDKwJV5iyMCoC5usdhi+_g@mail.gmail.com>
Message-ID: <CAEKz3tgs5w-=i9ib4LDxOiUhAf3J2MhzeZ50bU=xT7Mk4mnX-w@mail.gmail.com>

OK, so I just realized a few things.

First of all, I should have had has_db in a parenthesis to use it as an
index (like Ricardo did, I just didn't notice that it was important).
However, this still doesn't make much of a difference, because we're only
talking about 146k entries, and most of the time is spent on the string
extraction:

> system.time( a <- db[(has_url), getUrls(text, id), by=id] )
   user  system elapsed
 10.246   0.027  10.275
> system.time( a <- db[has_url == T, getUrls(text, id), by=id] )
   user  system elapsed
 10.094   0.029  10.123

Either way, good to know!

Secondly, I tried this form:
system.time( b <- db[(has_url),
                     j=list(urls = str_match_all(text, url_pattern)),
                     by=id] )


The problem is that it only accepts one value per row, so the output format
looks exactly like what I want - but
> nrow(db) # all records
[1] 146058
> nrow(a) # using the function getUrls
[1] 30019
> nrow(b) # using str_match_all directly with j=list
[1] 11007
> length(unique(a$id)) # similar number of IDs, but not similar number of
URLs
[1] 11007
> length(unique(b$id))
[1] 11007

thanks again,
Stian


On Fri, Sep 27, 2013 at 11:21 AM, Stian H?klev <shaklev at gmail.com> wrote:

> I really appreciate all your help - amazingly supportive community. I
> could probably figure out a "brute-force" way of doing things, but since
> I'm going to be writing a lot of R in the future too, I always want to find
> the "correct" way of doing it, which both looks clear, and is quick. (I
> come from a background in Ruby, and am always interested in writing very
> clear and DRY (do not repeat yourself) code, but I find I still spend a lot
> of time in R struggling with various data formats - lists, nested lists,
> vectors, matrices, different forms of apply/ddply/for loops etc).
>
> Anyway, a few different points.
>
> I tried db[has_url,], but got "object has_url not found"
>
> I then tried setkey(db, "has_url"), and using this, but somehow it was a
> lot slower than what I used to do (I repeated a few times). Not sure if I'm
> doing it wrong. (Not important - even 15 sec is totally fine, I'll only run
> this once. But good to understand the underlying principles).
>
> setkey(db, "has_url")
> > system.time( db[T, matches := str_match_all(text, url_pattern)] )
>    user  system elapsed
>  17.514   0.334  17.847
> > system.time( db[has_url == T, matches := str_match_all(text,
> url_pattern)] )
>    user  system elapsed
>   5.943   0.040   5.984
>
> The second point was how to get out the matches. The idea was that you
> have a text field which might contain several urls, which I want to
> extract, but I need each URL tagged with the row it came from (so I can
> link it back to properties of the post and author, look at whether certain
> students are more likely to post certain kinds of URLs etc).
>
> Instead of a function, you'll see above that I rewrote it to use :=, which
> creates a new column that holds a list. That worked wonderfully, but now
> how do I get these "out" of this data.table, and into a new one.
>
> Made-up example data:
> a <- c(1,2,3)
> b <- c(2,3,4)
> dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b,
> NULL))
>
> Now my goal is to have a new data.table that looks like this
> Name Number
> Stian 1
> Stian 2
> Stian 3
> Christian 2
> Christian 3
> Christian 4
>
> Again, I'm sure I could do this with a for() or lapply? But I'd love to
> see the most elegant solution.
>
> Note that this:
>
> getUrls <- function(text, id) {
>   matches <- str_match_all(text, url_pattern)
>   data.frame(urls=unlist(matches), id=id)
> }
>
> system.time( a <- db[(has_url), getUrls(text, id), by=id] )
>
> Works perfectly, the result is
> idurlsid116
> https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 162
> 24http://www.youtube.com/watch?v=JUiGF4TGI9w 24 344
> http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/
> 44461
> http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html
> 61575
> http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html
> 75675https://www.facebook.com/photo.php?fbid=10151324672623754 75
>
> which is exactly what I was looking for. So I've really reached my goal,
> but I'm curious about the other method as well.
>
> Thanks!
> Stian
>
>
> On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>> That was my thought too.  I don't know what str_match_all is,  but given
>> the unlist() in getUrls(),  it seems to return a list.   Rather than
>> unlist(),  leave it as list,  and data.table should happily make a `list`
>> column where each cell is itself a vector.  In fact each cell can be
>> anything at all,  even embedded data.table, function definitions, or any
>> type of object.
>> You might need a list(list(str_match_all(...))) in j to do that.
>>
>> Or what Rick has suggested here might work first time.  It's hard to
>> visualise it without a small reproducible example, so we're having to make
>> educated guesses.
>>
>> Many thanks for the kind words about data.table.
>>
>> Matthew
>>
>>
>>
>> On 27/09/13 07:44, Ricardo Saporta wrote:
>>
>> In fact, you should be able to skip the function altogether and just
>> use:
>>
>>     db[ (has_url), str_match_all(text, url_pattern), by=id]
>>
>>
>>  (and now, my apologies to all for the email clutter)
>> good night
>>
>> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta <
>> saporta at scarletmail.rutgers.edu> wrote:
>>
>>> sorry, I probably should have elaborated  (it's late here, in NJ)
>>>
>>>  The error you are seeing is most likely coming from your getURL
>>> function in that you are adding several ids to a data.frame of varying
>>> rows, and `R` cannot recycle it correctly.
>>>
>>>  If you instead breakdown by id, then each time you are only assigning
>>> one id and R will be able to recycle appropriately, without issue.
>>>
>>>  good luck!
>>> Rick
>>>
>>>
>>>  Ricardo Saporta
>>>  Graduate Student, Data Analytics
>>> Rutgers University, New Jersey
>>> e: saporta at rutgers.edu
>>>
>>>
>>>
>>>   On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta <
>>> saporta at scarletmail.rutgers.edu> wrote:
>>>
>>>> Hi there,
>>>>
>>>>  Try inserting a `by=id` in
>>>>
>>>>     a <- db[(has_url), getUrls(text, id), by=id]
>>>>
>>>>  Also, no need for "has_url == T"
>>>> instead, use
>>>>   (has_url)
>>>> If the variable is alread logical.  (Otherwise, you are just slowing
>>>> things down ;)
>>>>
>>>>
>>>>
>>>>  Ricardo Saporta
>>>> Graduate Student, Data Analytics
>>>> Rutgers University, New Jersey
>>>>  e: saporta at rutgers.edu
>>>>
>>>>
>>>>
>>>>  On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev <shaklev at gmail.com>wrote:
>>>>
>>>>>  I'm trying to run a function on every row fulfilling a certain
>>>>> criterium, which returns a data frame - the idea is then to take the list
>>>>> of data frames and rbindlist them together for a totally separate
>>>>> data.table. (I'm extracting several URL links from each forum post, and
>>>>> tagging them with the forum post they came from).
>>>>>
>>>>>  I tried doing this with a data.table
>>>>>
>>>>>  a <- db[has_url == T, getUrls(text, id)]
>>>>>
>>>>>  and get the message
>>>>>
>>>>>  Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L,
>>>>> 4L,  :
>>>>>   replacement has 11007 rows, data has 29787
>>>>>
>>>>>  Because some rows have several URLs... However, I don't care that
>>>>> these rowlengths don't match, I still want these rows :) I thought J would
>>>>> just let me execute arbitrary R code in the context of the rows as variable
>>>>> names, etc.
>>>>>
>>>>>  Here's the function it's running, but that shouldn't be relevant
>>>>>
>>>>>  getUrls <- function(text, id) {
>>>>>   matches <- str_match_all(text, url_pattern)
>>>>>   a <- data.frame(urls=unlist(matches))
>>>>>   a$id <- id
>>>>>   a
>>>>> }
>>>>>
>>>>>
>>>>>  Thanks, and thanks for an amazing package - data.table has made my
>>>>> life so much easier. It should be part of base, I think.
>>>>> Stian Haklev, University of Toronto
>>>>>
>>>>>  --
>>>>> http://reganmian.net/blog -- Random Stuff that Matters
>>>>>
>>>>>  _______________________________________________
>>>>> datatable-help mailing list
>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>
>>>>
>>>>
>>>
>>
>>
>> _______________________________________________
>> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>
>
> --
> http://reganmian.net/blog -- Random Stuff that Matters
>


-- 
http://reganmian.net/blog -- Random Stuff that Matters
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/8d678a1f/attachment-0001.html>

From saporta at scarletmail.rutgers.edu  Fri Sep 27 17:48:04 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Fri, 27 Sep 2013 11:48:04 -0400
Subject: [datatable-help] Using data.table to run a function on every row
In-Reply-To: <CAEKz3thLgOETxDfAUiaUBbk_zO4PZDKwJV5iyMCoC5usdhi+_g@mail.gmail.com>
References: <CAEKz3tjKNLRHjMZQrRvjj1TxcEoTWJiWJFHTsA+XadcYzDNSkw@mail.gmail.com>
 <CAE7Aa4SgENWr10xukoB5JJYXS68zKPxb2hae45duJV+j4Q16Rw@mail.gmail.com>
 <CAE7Aa4RC5y9CCEn+rLRb_KRPdgcu6iYAZv+_Umf5x2ekKQyQ-g@mail.gmail.com>
 <CAE7Aa4R-_DOkzC3JuJ-nbMFSQ5GLTij5rX3jEHHyjp8wL_YCwg@mail.gmail.com>
 <52457EA9.8000803@mdowle.plus.com>
 <CAEKz3thLgOETxDfAUiaUBbk_zO4PZDKwJV5iyMCoC5usdhi+_g@mail.gmail.com>
Message-ID: <CAE7Aa4QkiVXk=Jwse0EyBJBia3uBv-7c2k1tLGexVbENFWiLPg@mail.gmail.com>

Hi Stian,

Try the following two and look at the difference:

 db[T, matches := str_match_all(text, url_pattern)]
 db[.(T), matches := str_match_all(text, url_pattern)]

;)


On Fri, Sep 27, 2013 at 11:21 AM, Stian H?klev <shaklev at gmail.com> wrote:

> I really appreciate all your help - amazingly supportive community. I
> could probably figure out a "brute-force" way of doing things, but since
> I'm going to be writing a lot of R in the future too, I always want to find
> the "correct" way of doing it, which both looks clear, and is quick. (I
> come from a background in Ruby, and am always interested in writing very
> clear and DRY (do not repeat yourself) code, but I find I still spend a lot
> of time in R struggling with various data formats - lists, nested lists,
> vectors, matrices, different forms of apply/ddply/for loops etc).
>
> Anyway, a few different points.
>
> I tried db[has_url,], but got "object has_url not found"
>
> I then tried setkey(db, "has_url"), and using this, but somehow it was a
> lot slower than what I used to do (I repeated a few times). Not sure if I'm
> doing it wrong. (Not important - even 15 sec is totally fine, I'll only run
> this once. But good to understand the underlying principles).
>
> setkey(db, "has_url")
> > system.time( db[T, matches := str_match_all(text, url_pattern)] )
>    user  system elapsed
>  17.514   0.334  17.847
> > system.time( db[has_url == T, matches := str_match_all(text,
> url_pattern)] )
>    user  system elapsed
>   5.943   0.040   5.984
>
> The second point was how to get out the matches. The idea was that you
> have a text field which might contain several urls, which I want to
> extract, but I need each URL tagged with the row it came from (so I can
> link it back to properties of the post and author, look at whether certain
> students are more likely to post certain kinds of URLs etc).
>
> Instead of a function, you'll see above that I rewrote it to use :=, which
> creates a new column that holds a list. That worked wonderfully, but now
> how do I get these "out" of this data.table, and into a new one.
>
> Made-up example data:
> a <- c(1,2,3)
> b <- c(2,3,4)
> dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b,
> NULL))
>
> Now my goal is to have a new data.table that looks like this
> Name Number
> Stian 1
> Stian 2
> Stian 3
> Christian 2
> Christian 3
> Christian 4
>
> Again, I'm sure I could do this with a for() or lapply? But I'd love to
> see the most elegant solution.
>
> Note that this:
>
> getUrls <- function(text, id) {
>   matches <- str_match_all(text, url_pattern)
>   data.frame(urls=unlist(matches), id=id)
> }
>
> system.time( a <- db[(has_url), getUrls(text, id), by=id] )
>
> Works perfectly, the result is
> idurlsid116
> https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 162
> 24http://www.youtube.com/watch?v=JUiGF4TGI9w24 344
> http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/
> 44461
> http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html
> 61575
> http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html
> 75675https://www.facebook.com/photo.php?fbid=10151324672623754 75
>
> which is exactly what I was looking for. So I've really reached my goal,
> but I'm curious about the other method as well.
>
> Thanks!
> Stian
>
>
> On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>> That was my thought too.  I don't know what str_match_all is,  but given
>> the unlist() in getUrls(),  it seems to return a list.   Rather than
>> unlist(),  leave it as list,  and data.table should happily make a `list`
>> column where each cell is itself a vector.  In fact each cell can be
>> anything at all,  even embedded data.table, function definitions, or any
>> type of object.
>> You might need a list(list(str_match_all(...))) in j to do that.
>>
>> Or what Rick has suggested here might work first time.  It's hard to
>> visualise it without a small reproducible example, so we're having to make
>> educated guesses.
>>
>> Many thanks for the kind words about data.table.
>>
>> Matthew
>>
>>
>>
>> On 27/09/13 07:44, Ricardo Saporta wrote:
>>
>> In fact, you should be able to skip the function altogether and just
>> use:
>>
>>     db[ (has_url), str_match_all(text, url_pattern), by=id]
>>
>>
>>  (and now, my apologies to all for the email clutter)
>> good night
>>
>> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta <
>> saporta at scarletmail.rutgers.edu> wrote:
>>
>>> sorry, I probably should have elaborated  (it's late here, in NJ)
>>>
>>>  The error you are seeing is most likely coming from your getURL
>>> function in that you are adding several ids to a data.frame of varying
>>> rows, and `R` cannot recycle it correctly.
>>>
>>>  If you instead breakdown by id, then each time you are only assigning
>>> one id and R will be able to recycle appropriately, without issue.
>>>
>>>  good luck!
>>> Rick
>>>
>>>
>>>  Ricardo Saporta
>>>  Graduate Student, Data Analytics
>>> Rutgers University, New Jersey
>>> e: saporta at rutgers.edu
>>>
>>>
>>>
>>>   On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta <
>>> saporta at scarletmail.rutgers.edu> wrote:
>>>
>>>> Hi there,
>>>>
>>>>  Try inserting a `by=id` in
>>>>
>>>>     a <- db[(has_url), getUrls(text, id), by=id]
>>>>
>>>>  Also, no need for "has_url == T"
>>>> instead, use
>>>>   (has_url)
>>>> If the variable is alread logical.  (Otherwise, you are just slowing
>>>> things down ;)
>>>>
>>>>
>>>>
>>>>  Ricardo Saporta
>>>> Graduate Student, Data Analytics
>>>> Rutgers University, New Jersey
>>>>  e: saporta at rutgers.edu
>>>>
>>>>
>>>>
>>>>  On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev <shaklev at gmail.com>wrote:
>>>>
>>>>>  I'm trying to run a function on every row fulfilling a certain
>>>>> criterium, which returns a data frame - the idea is then to take the list
>>>>> of data frames and rbindlist them together for a totally separate
>>>>> data.table. (I'm extracting several URL links from each forum post, and
>>>>> tagging them with the forum post they came from).
>>>>>
>>>>>  I tried doing this with a data.table
>>>>>
>>>>>  a <- db[has_url == T, getUrls(text, id)]
>>>>>
>>>>>  and get the message
>>>>>
>>>>>  Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L,
>>>>> 4L,  :
>>>>>   replacement has 11007 rows, data has 29787
>>>>>
>>>>>  Because some rows have several URLs... However, I don't care that
>>>>> these rowlengths don't match, I still want these rows :) I thought J would
>>>>> just let me execute arbitrary R code in the context of the rows as variable
>>>>> names, etc.
>>>>>
>>>>>  Here's the function it's running, but that shouldn't be relevant
>>>>>
>>>>>  getUrls <- function(text, id) {
>>>>>   matches <- str_match_all(text, url_pattern)
>>>>>   a <- data.frame(urls=unlist(matches))
>>>>>   a$id <- id
>>>>>   a
>>>>> }
>>>>>
>>>>>
>>>>>  Thanks, and thanks for an amazing package - data.table has made my
>>>>> life so much easier. It should be part of base, I think.
>>>>> Stian Haklev, University of Toronto
>>>>>
>>>>>  --
>>>>> http://reganmian.net/blog -- Random Stuff that Matters
>>>>>
>>>>>  _______________________________________________
>>>>> datatable-help mailing list
>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>
>>>>
>>>>
>>>
>>
>>
>> _______________________________________________
>> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>
>
> --
> http://reganmian.net/blog -- Random Stuff that Matters
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/089fb156/attachment-0001.html>

From shaklev at gmail.com  Fri Sep 27 18:20:20 2013
From: shaklev at gmail.com (=?UTF-8?Q?Stian_H=C3=A5klev?=)
Date: Fri, 27 Sep 2013 12:20:20 -0400
Subject: [datatable-help] Using data.table to run a function on every row
In-Reply-To: <CAE7Aa4QkiVXk=Jwse0EyBJBia3uBv-7c2k1tLGexVbENFWiLPg@mail.gmail.com>
References: <CAEKz3tjKNLRHjMZQrRvjj1TxcEoTWJiWJFHTsA+XadcYzDNSkw@mail.gmail.com>
 <CAE7Aa4SgENWr10xukoB5JJYXS68zKPxb2hae45duJV+j4Q16Rw@mail.gmail.com>
 <CAE7Aa4RC5y9CCEn+rLRb_KRPdgcu6iYAZv+_Umf5x2ekKQyQ-g@mail.gmail.com>
 <CAE7Aa4R-_DOkzC3JuJ-nbMFSQ5GLTij5rX3jEHHyjp8wL_YCwg@mail.gmail.com>
 <52457EA9.8000803@mdowle.plus.com>
 <CAEKz3thLgOETxDfAUiaUBbk_zO4PZDKwJV5iyMCoC5usdhi+_g@mail.gmail.com>
 <CAE7Aa4QkiVXk=Jwse0EyBJBia3uBv-7c2k1tLGexVbENFWiLPg@mail.gmail.com>
Message-ID: <CAEKz3tgH5Jk0YGg=4rYAwOendtzz8TA2hPtEXioqo3ByvX2+fw@mail.gmail.com>

> system.time( db[T, matches := str_match_all(text, url_pattern)] )
   user  system elapsed
 19.610   0.475  20.304
> system.time( db[.(T), matches := str_match_all(text, url_pattern)] )
Error in `[.data.table`(db, .(T), `:=`(matches, str_match_all(text,
url_pattern))) :
  All items in j=list(...) should be atomic vectors or lists. If you are
trying something like j=list(.SD,newcol=mean(colA)) then use := by group
instead (much quicker), or cbind or merge afterwards.
Timing stopped at: 6.339 0.043 6.403


On Fri, Sep 27, 2013 at 11:48 AM, Ricardo Saporta <
saporta at scarletmail.rutgers.edu> wrote:

> Hi Stian,
>
> Try the following two and look at the difference:
>
>   db[T, matches := str_match_all(text, url_pattern)]
>  db[.(T), matches := str_match_all(text, url_pattern)]
>
> ;)
>
>
>
> On Fri, Sep 27, 2013 at 11:21 AM, Stian H?klev <shaklev at gmail.com> wrote:
>
>> I really appreciate all your help - amazingly supportive community. I
>> could probably figure out a "brute-force" way of doing things, but since
>> I'm going to be writing a lot of R in the future too, I always want to find
>> the "correct" way of doing it, which both looks clear, and is quick. (I
>> come from a background in Ruby, and am always interested in writing very
>> clear and DRY (do not repeat yourself) code, but I find I still spend a lot
>> of time in R struggling with various data formats - lists, nested lists,
>> vectors, matrices, different forms of apply/ddply/for loops etc).
>>
>> Anyway, a few different points.
>>
>> I tried db[has_url,], but got "object has_url not found"
>>
>> I then tried setkey(db, "has_url"), and using this, but somehow it was a
>> lot slower than what I used to do (I repeated a few times). Not sure if I'm
>> doing it wrong. (Not important - even 15 sec is totally fine, I'll only run
>> this once. But good to understand the underlying principles).
>>
>> setkey(db, "has_url")
>> > system.time( db[T, matches := str_match_all(text, url_pattern)] )
>>    user  system elapsed
>>  17.514   0.334  17.847
>> > system.time( db[has_url == T, matches := str_match_all(text,
>> url_pattern)] )
>>    user  system elapsed
>>   5.943   0.040   5.984
>>
>> The second point was how to get out the matches. The idea was that you
>> have a text field which might contain several urls, which I want to
>> extract, but I need each URL tagged with the row it came from (so I can
>> link it back to properties of the post and author, look at whether certain
>> students are more likely to post certain kinds of URLs etc).
>>
>> Instead of a function, you'll see above that I rewrote it to use :=,
>> which creates a new column that holds a list. That worked wonderfully, but
>> now how do I get these "out" of this data.table, and into a new one.
>>
>> Made-up example data:
>> a <- c(1,2,3)
>> b <- c(2,3,4)
>> dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b,
>> NULL))
>>
>> Now my goal is to have a new data.table that looks like this
>> Name Number
>> Stian 1
>> Stian 2
>> Stian 3
>> Christian 2
>> Christian 3
>> Christian 4
>>
>> Again, I'm sure I could do this with a for() or lapply? But I'd love to
>> see the most elegant solution.
>>
>> Note that this:
>>
>> getUrls <- function(text, id) {
>>   matches <- str_match_all(text, url_pattern)
>>   data.frame(urls=unlist(matches), id=id)
>> }
>>
>> system.time( a <- db[(has_url), getUrls(text, id), by=id] )
>>
>> Works perfectly, the result is
>> idurlsid116
>> https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 16
>> 224http://www.youtube.com/watch?v=JUiGF4TGI9w 24 344
>> http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/
>> 44461
>> http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html
>> 61575
>> http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html
>> 75675https://www.facebook.com/photo.php?fbid=10151324672623754 75
>>
>> which is exactly what I was looking for. So I've really reached my goal,
>> but I'm curious about the other method as well.
>>
>> Thanks!
>> Stian
>>
>>
>> On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>
>>>
>>> That was my thought too.  I don't know what str_match_all is,  but given
>>> the unlist() in getUrls(),  it seems to return a list.   Rather than
>>> unlist(),  leave it as list,  and data.table should happily make a `list`
>>> column where each cell is itself a vector.  In fact each cell can be
>>> anything at all,  even embedded data.table, function definitions, or any
>>> type of object.
>>> You might need a list(list(str_match_all(...))) in j to do that.
>>>
>>> Or what Rick has suggested here might work first time.  It's hard to
>>> visualise it without a small reproducible example, so we're having to make
>>> educated guesses.
>>>
>>> Many thanks for the kind words about data.table.
>>>
>>> Matthew
>>>
>>>
>>>
>>> On 27/09/13 07:44, Ricardo Saporta wrote:
>>>
>>> In fact, you should be able to skip the function altogether and just
>>> use:
>>>
>>>     db[ (has_url), str_match_all(text, url_pattern), by=id]
>>>
>>>
>>>  (and now, my apologies to all for the email clutter)
>>> good night
>>>
>>> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta <
>>> saporta at scarletmail.rutgers.edu> wrote:
>>>
>>>> sorry, I probably should have elaborated  (it's late here, in NJ)
>>>>
>>>>  The error you are seeing is most likely coming from your getURL
>>>> function in that you are adding several ids to a data.frame of varying
>>>> rows, and `R` cannot recycle it correctly.
>>>>
>>>>  If you instead breakdown by id, then each time you are only assigning
>>>> one id and R will be able to recycle appropriately, without issue.
>>>>
>>>>  good luck!
>>>> Rick
>>>>
>>>>
>>>>  Ricardo Saporta
>>>>  Graduate Student, Data Analytics
>>>> Rutgers University, New Jersey
>>>> e: saporta at rutgers.edu
>>>>
>>>>
>>>>
>>>>   On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta <
>>>> saporta at scarletmail.rutgers.edu> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>>  Try inserting a `by=id` in
>>>>>
>>>>>     a <- db[(has_url), getUrls(text, id), by=id]
>>>>>
>>>>>  Also, no need for "has_url == T"
>>>>> instead, use
>>>>>   (has_url)
>>>>> If the variable is alread logical.  (Otherwise, you are just slowing
>>>>> things down ;)
>>>>>
>>>>>
>>>>>
>>>>>  Ricardo Saporta
>>>>> Graduate Student, Data Analytics
>>>>> Rutgers University, New Jersey
>>>>>  e: saporta at rutgers.edu
>>>>>
>>>>>
>>>>>
>>>>>  On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev <shaklev at gmail.com>wrote:
>>>>>
>>>>>>  I'm trying to run a function on every row fulfilling a certain
>>>>>> criterium, which returns a data frame - the idea is then to take the list
>>>>>> of data frames and rbindlist them together for a totally separate
>>>>>> data.table. (I'm extracting several URL links from each forum post, and
>>>>>> tagging them with the forum post they came from).
>>>>>>
>>>>>>  I tried doing this with a data.table
>>>>>>
>>>>>>  a <- db[has_url == T, getUrls(text, id)]
>>>>>>
>>>>>>  and get the message
>>>>>>
>>>>>>  Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L,
>>>>>> 4L,  :
>>>>>>   replacement has 11007 rows, data has 29787
>>>>>>
>>>>>>  Because some rows have several URLs... However, I don't care that
>>>>>> these rowlengths don't match, I still want these rows :) I thought J would
>>>>>> just let me execute arbitrary R code in the context of the rows as variable
>>>>>> names, etc.
>>>>>>
>>>>>>  Here's the function it's running, but that shouldn't be relevant
>>>>>>
>>>>>>  getUrls <- function(text, id) {
>>>>>>   matches <- str_match_all(text, url_pattern)
>>>>>>   a <- data.frame(urls=unlist(matches))
>>>>>>   a$id <- id
>>>>>>   a
>>>>>> }
>>>>>>
>>>>>>
>>>>>>  Thanks, and thanks for an amazing package - data.table has made my
>>>>>> life so much easier. It should be part of base, I think.
>>>>>> Stian Haklev, University of Toronto
>>>>>>
>>>>>>  --
>>>>>> http://reganmian.net/blog -- Random Stuff that Matters
>>>>>>
>>>>>>  _______________________________________________
>>>>>> datatable-help mailing list
>>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>>
>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>>
>>>
>>
>>
>> --
>> http://reganmian.net/blog -- Random Stuff that Matters
>>
>
>


-- 
http://reganmian.net/blog -- Random Stuff that Matters
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/fa82d632/attachment-0001.html>

From saporta at scarletmail.rutgers.edu  Fri Sep 27 19:25:19 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Fri, 27 Sep 2013 13:25:19 -0400
Subject: [datatable-help] Using data.table to run a function on every row
In-Reply-To: <CAEKz3tgH5Jk0YGg=4rYAwOendtzz8TA2hPtEXioqo3ByvX2+fw@mail.gmail.com>
References: <CAEKz3tjKNLRHjMZQrRvjj1TxcEoTWJiWJFHTsA+XadcYzDNSkw@mail.gmail.com>
 <CAE7Aa4SgENWr10xukoB5JJYXS68zKPxb2hae45duJV+j4Q16Rw@mail.gmail.com>
 <CAE7Aa4RC5y9CCEn+rLRb_KRPdgcu6iYAZv+_Umf5x2ekKQyQ-g@mail.gmail.com>
 <CAE7Aa4R-_DOkzC3JuJ-nbMFSQ5GLTij5rX3jEHHyjp8wL_YCwg@mail.gmail.com>
 <52457EA9.8000803@mdowle.plus.com>
 <CAEKz3thLgOETxDfAUiaUBbk_zO4PZDKwJV5iyMCoC5usdhi+_g@mail.gmail.com>
 <CAE7Aa4QkiVXk=Jwse0EyBJBia3uBv-7c2k1tLGexVbENFWiLPg@mail.gmail.com>
 <CAEKz3tgH5Jk0YGg=4rYAwOendtzz8TA2hPtEXioqo3ByvX2+fw@mail.gmail.com>
Message-ID: <CAE7Aa4Qd3oK6PP5JzxEo8=7RYk-8=2m8S03c7-k24FnefNBN=g@mail.gmail.com>

hm... not sure about `j`  (sorry, I havent taken a close look at your
code), but my comment was to point out that these two statements are
different:

   DT [  TRUE,   ]
   DT [ .(TRUE), ]

The first one is giving you the whole data.table
   DT[TRUE, ]  is the same as DT
(since TRUE is getting recycled)

The second one is giving you all rows within DT where the first column of
the key has a value of TRUE.


Ricardo Saporta
Graduate Student, Data Analytics
Rutgers University, New Jersey
e: saporta at rutgers.edu


On Fri, Sep 27, 2013 at 12:20 PM, Stian H?klev <shaklev at gmail.com> wrote:

> > system.time( db[T, matches := str_match_all(text, url_pattern)] )
>    user  system elapsed
>  19.610   0.475  20.304
> > system.time( db[.(T), matches := str_match_all(text, url_pattern)] )
> Error in `[.data.table`(db, .(T), `:=`(matches, str_match_all(text,
> url_pattern))) :
>   All items in j=list(...) should be atomic vectors or lists. If you are
> trying something like j=list(.SD,newcol=mean(colA)) then use := by group
> instead (much quicker), or cbind or merge afterwards.
> Timing stopped at: 6.339 0.043 6.403
>
>
> On Fri, Sep 27, 2013 at 11:48 AM, Ricardo Saporta <
> saporta at scarletmail.rutgers.edu> wrote:
>
>> Hi Stian,
>>
>> Try the following two and look at the difference:
>>
>>   db[T, matches := str_match_all(text, url_pattern)]
>>  db[.(T), matches := str_match_all(text, url_pattern)]
>>
>> ;)
>>
>>
>>
>> On Fri, Sep 27, 2013 at 11:21 AM, Stian H?klev <shaklev at gmail.com> wrote:
>>
>>> I really appreciate all your help - amazingly supportive community. I
>>> could probably figure out a "brute-force" way of doing things, but since
>>> I'm going to be writing a lot of R in the future too, I always want to find
>>> the "correct" way of doing it, which both looks clear, and is quick. (I
>>> come from a background in Ruby, and am always interested in writing very
>>> clear and DRY (do not repeat yourself) code, but I find I still spend a lot
>>> of time in R struggling with various data formats - lists, nested lists,
>>> vectors, matrices, different forms of apply/ddply/for loops etc).
>>>
>>> Anyway, a few different points.
>>>
>>> I tried db[has_url,], but got "object has_url not found"
>>>
>>> I then tried setkey(db, "has_url"), and using this, but somehow it was a
>>> lot slower than what I used to do (I repeated a few times). Not sure if I'm
>>> doing it wrong. (Not important - even 15 sec is totally fine, I'll only run
>>> this once. But good to understand the underlying principles).
>>>
>>> setkey(db, "has_url")
>>> > system.time( db[T, matches := str_match_all(text, url_pattern)] )
>>>    user  system elapsed
>>>  17.514   0.334  17.847
>>> > system.time( db[has_url == T, matches := str_match_all(text,
>>> url_pattern)] )
>>>    user  system elapsed
>>>   5.943   0.040   5.984
>>>
>>> The second point was how to get out the matches. The idea was that you
>>> have a text field which might contain several urls, which I want to
>>> extract, but I need each URL tagged with the row it came from (so I can
>>> link it back to properties of the post and author, look at whether certain
>>> students are more likely to post certain kinds of URLs etc).
>>>
>>> Instead of a function, you'll see above that I rewrote it to use :=,
>>> which creates a new column that holds a list. That worked wonderfully, but
>>> now how do I get these "out" of this data.table, and into a new one.
>>>
>>> Made-up example data:
>>> a <- c(1,2,3)
>>> b <- c(2,3,4)
>>> dt <- data.table(names=c("Stian", "Christian", "John"),
>>> numbers=list(a,b, NULL))
>>>
>>> Now my goal is to have a new data.table that looks like this
>>> Name Number
>>> Stian 1
>>> Stian 2
>>> Stian 3
>>> Christian 2
>>> Christian 3
>>> Christian 4
>>>
>>> Again, I'm sure I could do this with a for() or lapply? But I'd love to
>>> see the most elegant solution.
>>>
>>> Note that this:
>>>
>>> getUrls <- function(text, id) {
>>>   matches <- str_match_all(text, url_pattern)
>>>   data.frame(urls=unlist(matches), id=id)
>>> }
>>>
>>> system.time( a <- db[(has_url), getUrls(text, id), by=id] )
>>>
>>> Works perfectly, the result is
>>> idurlsid116
>>> https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166
>>> 16224http://www.youtube.com/watch?v=JUiGF4TGI9w 24 344
>>> http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/
>>> 44461
>>> http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html
>>> 61575
>>> http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html
>>> 75675https://www.facebook.com/photo.php?fbid=10151324672623754 75
>>>
>>> which is exactly what I was looking for. So I've really reached my goal,
>>> but I'm curious about the other method as well.
>>>
>>> Thanks!
>>> Stian
>>>
>>>
>>> On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>>
>>>>
>>>> That was my thought too.  I don't know what str_match_all is,  but
>>>> given the unlist() in getUrls(),  it seems to return a list.   Rather than
>>>> unlist(),  leave it as list,  and data.table should happily make a `list`
>>>> column where each cell is itself a vector.  In fact each cell can be
>>>> anything at all,  even embedded data.table, function definitions, or any
>>>> type of object.
>>>> You might need a list(list(str_match_all(...))) in j to do that.
>>>>
>>>> Or what Rick has suggested here might work first time.  It's hard to
>>>> visualise it without a small reproducible example, so we're having to make
>>>> educated guesses.
>>>>
>>>> Many thanks for the kind words about data.table.
>>>>
>>>> Matthew
>>>>
>>>>
>>>>
>>>> On 27/09/13 07:44, Ricardo Saporta wrote:
>>>>
>>>> In fact, you should be able to skip the function altogether and just
>>>> use:
>>>>
>>>>     db[ (has_url), str_match_all(text, url_pattern), by=id]
>>>>
>>>>
>>>>  (and now, my apologies to all for the email clutter)
>>>> good night
>>>>
>>>> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta <
>>>> saporta at scarletmail.rutgers.edu> wrote:
>>>>
>>>>> sorry, I probably should have elaborated  (it's late here, in NJ)
>>>>>
>>>>>  The error you are seeing is most likely coming from your getURL
>>>>> function in that you are adding several ids to a data.frame of varying
>>>>> rows, and `R` cannot recycle it correctly.
>>>>>
>>>>>  If you instead breakdown by id, then each time you are only
>>>>> assigning one id and R will be able to recycle appropriately, without
>>>>> issue.
>>>>>
>>>>>  good luck!
>>>>> Rick
>>>>>
>>>>>
>>>>>  Ricardo Saporta
>>>>>  Graduate Student, Data Analytics
>>>>> Rutgers University, New Jersey
>>>>> e: saporta at rutgers.edu
>>>>>
>>>>>
>>>>>
>>>>>   On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta <
>>>>> saporta at scarletmail.rutgers.edu> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>>  Try inserting a `by=id` in
>>>>>>
>>>>>>     a <- db[(has_url), getUrls(text, id), by=id]
>>>>>>
>>>>>>  Also, no need for "has_url == T"
>>>>>> instead, use
>>>>>>   (has_url)
>>>>>> If the variable is alread logical.  (Otherwise, you are just slowing
>>>>>> things down ;)
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Ricardo Saporta
>>>>>> Graduate Student, Data Analytics
>>>>>> Rutgers University, New Jersey
>>>>>>  e: saporta at rutgers.edu
>>>>>>
>>>>>>
>>>>>>
>>>>>>  On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev <shaklev at gmail.com>wrote:
>>>>>>
>>>>>>>  I'm trying to run a function on every row fulfilling a certain
>>>>>>> criterium, which returns a data frame - the idea is then to take the list
>>>>>>> of data frames and rbindlist them together for a totally separate
>>>>>>> data.table. (I'm extracting several URL links from each forum post, and
>>>>>>> tagging them with the forum post they came from).
>>>>>>>
>>>>>>>  I tried doing this with a data.table
>>>>>>>
>>>>>>>  a <- db[has_url == T, getUrls(text, id)]
>>>>>>>
>>>>>>>  and get the message
>>>>>>>
>>>>>>>  Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L,
>>>>>>> 4L,  :
>>>>>>>   replacement has 11007 rows, data has 29787
>>>>>>>
>>>>>>>  Because some rows have several URLs... However, I don't care that
>>>>>>> these rowlengths don't match, I still want these rows :) I thought J would
>>>>>>> just let me execute arbitrary R code in the context of the rows as variable
>>>>>>> names, etc.
>>>>>>>
>>>>>>>  Here's the function it's running, but that shouldn't be relevant
>>>>>>>
>>>>>>>  getUrls <- function(text, id) {
>>>>>>>   matches <- str_match_all(text, url_pattern)
>>>>>>>   a <- data.frame(urls=unlist(matches))
>>>>>>>   a$id <- id
>>>>>>>   a
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>  Thanks, and thanks for an amazing package - data.table has made my
>>>>>>> life so much easier. It should be part of base, I think.
>>>>>>> Stian Haklev, University of Toronto
>>>>>>>
>>>>>>>  --
>>>>>>> http://reganmian.net/blog -- Random Stuff that Matters
>>>>>>>
>>>>>>>  _______________________________________________
>>>>>>> datatable-help mailing list
>>>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>>>
>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> http://reganmian.net/blog -- Random Stuff that Matters
>>>
>>
>>
>
>
> --
> http://reganmian.net/blog -- Random Stuff that Matters
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/73bbefac/attachment-0001.html>

From mdowle at mdowle.plus.com  Fri Sep 27 20:49:15 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Fri, 27 Sep 2013 19:49:15 +0100
Subject: [datatable-help] Using data.table to run a function on every row
In-Reply-To: <CAE7Aa4Qd3oK6PP5JzxEo8=7RYk-8=2m8S03c7-k24FnefNBN=g@mail.gmail.com>
References: <CAEKz3tjKNLRHjMZQrRvjj1TxcEoTWJiWJFHTsA+XadcYzDNSkw@mail.gmail.com>
 <CAE7Aa4SgENWr10xukoB5JJYXS68zKPxb2hae45duJV+j4Q16Rw@mail.gmail.com>
 <CAE7Aa4RC5y9CCEn+rLRb_KRPdgcu6iYAZv+_Umf5x2ekKQyQ-g@mail.gmail.com>
 <CAE7Aa4R-_DOkzC3JuJ-nbMFSQ5GLTij5rX3jEHHyjp8wL_YCwg@mail.gmail.com>
 <52457EA9.8000803@mdowle.plus.com>
 <CAEKz3thLgOETxDfAUiaUBbk_zO4PZDKwJV5iyMCoC5usdhi+_g@mail.gmail.com>
 <CAE7Aa4QkiVXk=Jwse0EyBJBia3uBv-7c2k1tLGexVbENFWiLPg@mail.gmail.com>
 <CAEKz3tgH5Jk0YGg=4rYAwOendtzz8TA2hPtEXioqo3ByvX2+fw@mail.gmail.com>
 <CAE7Aa4Qd3oK6PP5JzxEo8=7RYk-8=2m8S03c7-k24FnefNBN=g@mail.gmail.com>
Message-ID: <5245D32B.3020304@mdowle.plus.com>


Stian,

datatable-help isn't really for this kind of question.  It's a very good 
question and belongs on S.O. where you can edit it given comments.  
datatable-help is more for discussion about future developments,  
notices,  things that aren't allowed on S.O.,  etc.

This was your example :

 > a <- c(1,2,3)
 > b <- c(2,3,4)
 > dt <- data.table(names=c("Stian", "Christian", "John"), 
numbers=list(a,b, NULL))

The output of that is :

 > dt
        names numbers
1:     Stian   1,2,3
2: Christian   2,3,4
3:      John

Are you possibly mistaken about the output of list columns?  Those 
commas are just how it displays.  They aren't strings in the numbers 
column.  The `numbers` column is a list column where each item is a vector.

To get the output you asked for it's just :

 > dt[,unlist(numbers),by=names]
        names V1
1:     Stian  1
2:     Stian  2
3:     Stian  3
4: Christian  2
5: Christian  3
6: Christian  4
 >

If I've misunderstood,  then please start again with a new question on S.O.

http://stackoverflow.com/questions/tagged/data.table

Thanks,
Matthew


On 27/09/13 18:25, Ricardo Saporta wrote:
> hm... not sure about `j`  (sorry, I havent taken a close look at your 
> code), but my comment was to point out that these two statements are 
> different:
>
>    DT [  TRUE,   ]
>    DT [ .(TRUE), ]
>
> The first one is giving you the whole data.table
>    DT[TRUE, ]  is the same as DT
> (since TRUE is getting recycled)
>
> The second one is giving you all rows within DT where the first column 
> of the key has a value of TRUE.
>
>
>
> Ricardo Saporta
> Graduate Student, Data Analytics
> Rutgers University, New Jersey
> e: saporta at rutgers.edu <mailto:saporta at rutgers.edu>
>
>
>
> On Fri, Sep 27, 2013 at 12:20 PM, Stian H?klev <shaklev at gmail.com 
> <mailto:shaklev at gmail.com>> wrote:
>
>     > system.time( db[T, matches := str_match_all(text, url_pattern)] )
>        user  system elapsed
>      19.610   0.475  20.304
>     > system.time( db[.(T), matches := str_match_all(text, url_pattern)] )
>     Error in `[.data.table`(db, .(T), `:=`(matches,
>     str_match_all(text, url_pattern))) :
>       All items in j=list(...) should be atomic vectors or lists. If
>     you are trying something like j=list(.SD,newcol=mean(colA)) then
>     use := by group instead (much quicker), or cbind or merge afterwards.
>     Timing stopped at: 6.339 0.043 6.403
>
>
>     On Fri, Sep 27, 2013 at 11:48 AM, Ricardo Saporta
>     <saporta at scarletmail.rutgers.edu
>     <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>
>         Hi Stian,
>
>         Try the following two and look at the difference:
>
>          db[T, matches := str_match_all(text, url_pattern)]
>          db[.(T), matches := str_match_all(text, url_pattern)]
>
>         ;)
>
>
>
>         On Fri, Sep 27, 2013 at 11:21 AM, Stian H?klev
>         <shaklev at gmail.com <mailto:shaklev at gmail.com>> wrote:
>
>             I really appreciate all your help - amazingly supportive
>             community. I could probably figure out a "brute-force" way
>             of doing things, but since I'm going to be writing a lot
>             of R in the future too, I always want to find the
>             "correct" way of doing it, which both looks clear, and is
>             quick. (I come from a background in Ruby, and am always
>             interested in writing very clear and DRY (do not repeat
>             yourself) code, but I find I still spend a lot of time in
>             R struggling with various data formats - lists, nested
>             lists, vectors, matrices, different forms of
>             apply/ddply/for loops etc).
>
>             Anyway, a few different points.
>
>             I tried db[has_url,], but got "object has_url not found"
>
>             I then tried setkey(db, "has_url"), and using this, but
>             somehow it was a lot slower than what I used to do (I
>             repeated a few times). Not sure if I'm doing it wrong.
>             (Not important - even 15 sec is totally fine, I'll only
>             run this once. But good to understand the underlying
>             principles).
>
>             setkey(db, "has_url")
>             > system.time( db[T, matches := str_match_all(text,
>             url_pattern)] )
>                user  system elapsed
>              17.514   0.334  17.847
>             > system.time( db[has_url == T, matches :=
>             str_match_all(text, url_pattern)] )
>                user  system elapsed
>               5.943   0.040   5.984
>
>             The second point was how to get out the matches. The idea
>             was that you have a text field which might contain several
>             urls, which I want to extract, but I need each URL tagged
>             with the row it came from (so I can link it back to
>             properties of the post and author, look at whether certain
>             students are more likely to post certain kinds of URLs etc).
>
>             Instead of a function, you'll see above that I rewrote it
>             to use :=, which creates a new column that holds a list.
>             That worked wonderfully, but now how do I get these "out"
>             of this data.table, and into a new one.
>
>             Made-up example data:
>             a <- c(1,2,3)
>             b <- c(2,3,4)
>             dt <- data.table(names=c("Stian", "Christian", "John"),
>             numbers=list(a,b, NULL))
>
>             Now my goal is to have a new data.table that looks like this
>             Name Number
>             Stian 1
>             Stian 2
>             Stian 3
>             Christian 2
>             Christian 3
>             Christian 4
>
>             Again, I'm sure I could do this with a for() or lapply?
>             But I'd love to see the most elegant solution.
>
>             Note that this:
>
>             getUrls <- function(text, id) {
>               matches <- str_match_all(text, url_pattern)
>             data.frame(urls=unlist(matches), id=id)
>             }
>
>             system.time( a <- db[(has_url), getUrls(text, id), by=id] )
>
>             Works perfectly, the result is
>
>             	id 	urls 	id
>             1 	16
>             https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166
>             	16
>             2 	24 	http://www.youtube.com/watch?v=JUiGF4TGI9w 	24
>             3 	44
>             http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/
>             	44
>             4 	61
>             http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html
>             	61
>             5 	75
>             http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html
>             	75
>             6 	75
>             https://www.facebook.com/photo.php?fbid=10151324672623754 	75
>
>
>             which is exactly what I was looking for. So I've really
>             reached my goal, but I'm curious about the other method as
>             well.
>
>             Thanks!
>             Stian
>
>
>             On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle
>             <mdowle at mdowle.plus.com <mailto:mdowle at mdowle.plus.com>>
>             wrote:
>
>
>                 That was my thought too.  I don't know what
>                 str_match_all is,  but given the unlist() in
>                 getUrls(),  it seems to return a list.   Rather than
>                 unlist(),  leave it as list,  and data.table should
>                 happily make a `list` column where each cell is itself
>                 a vector.  In fact each cell can be anything at all, 
>                 even embedded data.table, function definitions, or any
>                 type of object.
>                 You might need a list(list(str_match_all(...))) in j
>                 to do that.
>
>                 Or what Rick has suggested here might work first
>                 time.  It's hard to visualise it without a small
>                 reproducible example, so we're having to make educated
>                 guesses.
>
>                 Many thanks for the kind words about data.table.
>
>                 Matthew
>
>
>
>                 On 27/09/13 07:44, Ricardo Saporta wrote:
>>                 In fact, you should be able to skip the function
>>                 altogether and just use:
>>
>>                    db[ (has_url), str_match_all(text, url_pattern),
>>                 by=id]
>>
>>
>>                 (and now, my apologies to all for the email clutter)
>>                 good night
>>
>>                 On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta
>>                 <saporta at scarletmail.rutgers.edu
>>                 <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>>
>>                     sorry, I probably should have elaborated  (it's
>>                     late here, in NJ)
>>
>>                     The error you are seeing is most likely coming
>>                     from your getURL function in that you are adding
>>                     several ids to a data.frame of varying rows, and
>>                     `R` cannot recycle it correctly.
>>
>>                     If you instead breakdown by id, then each time
>>                     you are only assigning one id and R will be able
>>                     to recycle appropriately, without issue.
>>
>>                     good luck!
>>                     Rick
>>
>>
>>                     Ricardo Saporta
>>                     Graduate Student, Data Analytics
>>                     Rutgers University, New Jersey
>>                     e: saporta at rutgers.edu <mailto:saporta at rutgers.edu>
>>
>>
>>
>>                     On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta
>>                     <saporta at scarletmail.rutgers.edu
>>                     <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>>
>>                         Hi there,
>>
>>                         Try inserting a `by=id` in
>>
>>                         a <- db[(has_url), getUrls(text, id), by=id]
>>
>>                         Also, no need for "has_url == T"
>>                         instead, use
>>                         (has_url)
>>                         If the variable is alread logical.
>>                          (Otherwise, you are just slowing things down ;)
>>
>>
>>
>>                         Ricardo Saporta
>>                         Graduate Student, Data Analytics
>>                         Rutgers University, New Jersey
>>                         e: saporta at rutgers.edu
>>                         <mailto:saporta at rutgers.edu>
>>
>>
>>
>>                         On Thu, Sep 26, 2013 at 11:16 PM, Stian
>>                         H?klev <shaklev at gmail.com
>>                         <mailto:shaklev at gmail.com>> wrote:
>>
>>                             I'm trying to run a function on every row
>>                             fulfilling a certain criterium, which
>>                             returns a data frame - the idea is then
>>                             to take the list of data frames and
>>                             rbindlist them together for a totally
>>                             separate data.table. (I'm extracting
>>                             several URL links from each forum post,
>>                             and tagging them with the forum post they
>>                             came from).
>>
>>                             I tried doing this with a data.table
>>
>>                             a <- db[has_url == T, getUrls(text, id)]
>>
>>                             and get the message
>>
>>                             Error in `$<-.data.frame`(`*tmp*`, "id",
>>                             value = c(1L, 6L, 1L, 2L, 4L,  :
>>                             replacement has 11007 rows, data has 29787
>>
>>                             Because some rows have several URLs...
>>                             However, I don't care that these
>>                             rowlengths don't match, I still want
>>                             these rows :) I thought J would just let
>>                             me execute arbitrary R code in the
>>                             context of the rows as variable names, etc.
>>
>>                             Here's the function it's running, but
>>                             that shouldn't be relevant
>>
>>                             getUrls <- function(text, id) {
>>                               matches <- str_match_all(text, url_pattern)
>>                               a <- data.frame(urls=unlist(matches))
>>                               a$id <- id
>>                               a
>>                             }
>>
>>
>>                             Thanks, and thanks for an amazing package
>>                             - data.table has made my life so much
>>                             easier. It should be part of base, I think.
>>                             Stian Haklev, University of Toronto
>>
>>                             -- 
>>                             http://reganmian.net/blog -- Random Stuff
>>                             that Matters
>>
>>                             _______________________________________________
>>                             datatable-help mailing list
>>                             datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>
>>                             https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>>
>>
>>
>>                 _______________________________________________
>>                 datatable-help mailing list
>>                 datatable-help at lists.r-forge.r-project.org  <mailto:datatable-help at lists.r-forge.r-project.org>
>>                 https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
>             -- 
>             http://reganmian.net/blog -- Random Stuff that Matters
>
>
>
>
>
>     -- 
>     http://reganmian.net/blog -- Random Stuff that Matters
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/307c1649/attachment-0001.html>

From saporta at scarletmail.rutgers.edu  Fri Sep 27 21:01:44 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Fri, 27 Sep 2013 15:01:44 -0400
Subject: [datatable-help] unique.data.frame should create a copy, right?
In-Reply-To: <CAHA9McNiuEaEpqTge+Wvb2PODXxqg_+YqcETgvsZBJz7i=reJg@mail.gmail.com>
References: <CAJd-hd=bu141Pscnh-XuZMSvYsnJP9soxrNciozyOPrZyV6iyQ@mail.gmail.com>
 <0F857B92DB0744C69CFC07AAE0C4DCF4@gmail.com>
 <CAE7Aa4SyCbQBP0tH-HGVNYbLTeGoKp4CkPiOKkwMwX6nT5Ed3A@mail.gmail.com>
 <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com>
 <CAE7Aa4R+uRUxaBrgwJYKWAZpE4TbtKR4JCavarCB8rhVgzLnXQ@mail.gmail.com>
 <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com>
 <CAHA9McMpt8SqPHrsEVuU4tUQhECBfdBxPgBMUJnuVtoMP0Fauw@mail.gmail.com>
 <C2CAE1D34CAD40D8B53240A844CD1334@gmail.com>
 <CAHA9McPn3V_UNPTSCEZ_mz4-pW-pvEp+1PmoFOwbqsqUqdd69Q@mail.gmail.com>
 <CAE7Aa4QVcm+0qfrhK1qUnCsuvWrUvfK6V1ygTXBy7=DGEpqz7A@mail.gmail.com>
 <CAHA9McPGUh2eaRRvjWGxYsg9KT1CMszaQFE5nsOxchJYgZch1w@mail.gmail.com>
 <CAE7Aa4TOcjcwSi-ALFO6aosf=zfLchQZb=joYeZaGOuaw+4eHw@mail.gmail.com>
 <CAHA9McMXpND1nNkTrbx_tdUAXw8qg55pK4syFpKvwdhn_cBZHQ@mail.gmail.com>
 <CAHA9McM0ejf3gGDu1sciwRvFS05TXWON3k2FOg1sR5JewBLLZA@mail.gmail.com>
 <CAE7Aa4SheuYVtYcUALVsH56HZ3a1Kh1AhniZMz3zDQfr=J5Gnw@mail.gmail.com>
 <CAHA9McNiuEaEpqTge+Wvb2PODXxqg_+YqcETgvsZBJz7i=reJg@mail.gmail.com>
Message-ID: <CAE7Aa4QGu4SuKjbFwi-P9zrDJyAa2tLOzxMU-N6B7fTxgePJ5Q@mail.gmail.com>

running some benchmarks at work, I got the following comparing
unique.data.frame to the new
unique(.. , by=..)

> microbenchmark(eval(uDF), eval(uDT))
Unit: milliseconds
      expr      min        lq    median       uq      max neval
 eval(uDF) 28.38505 29.368062 31.705633 33.53874 52.57522   100
 eval(uDT)  6.61314  7.220897  7.597114  9.58860 78.82127   100


well done!


On Tue, Aug 27, 2013 at 1:23 PM, Steve Lianoglou <
mailinglist.honeypot at gmail.com> wrote:

> Last update here :-)
>
> After more hemming and hawing, I've changed the name of the new
> parameter added to duplicated.data.table and unique.data.table from
> `by.columnss` to just `by`, as it (more or less) is the same idea as
> the `by` in dt[x, i,j,by,...]
>
> Sorry for any inconveniences caused if you've been working off of the
> development version.
>
> -steve
>
>
> On Thu, Aug 15, 2013 at 9:35 PM, Ricardo Saporta
> <saporta at scarletmail.rutgers.edu> wrote:
> > Steve, great stuff!!
> > thanks for making that happen
> >
> > Rick
> >
> >
> > On Wed, Aug 14, 2013 at 8:30 PM, Steve Lianoglou
> > <mailinglist.honeypot at gmail.com> wrote:
> >>
> >> Hi all,
> >>
> >> As I needed this sooner than I had expected, I just committed this
> >> change. It's in svn revision 889.
> >>
> >> I chose 'by.columns' as the parameter names -- seemed to make more
> >> sense to me, and using the short hand interactively saves a letter,
> >> eg: unique(dt, by=c('some', 'columns')) ;-)
> >>
> >> Here's the note from the NEWS file:
> >>
> >> o  "Uniqueness" tests can now specify arbirtray combinations of
> >> columns to use to test for duplicates. `by.columns` parameter added to
> >> unique.data.table and duplicated.data.table. This allows the user to
> >> test for uniqueness using any combination of columns in the
> >> data.table, where previously the user only had the option to use the
> >> keyed columns (if keyed) or all columns (if not). The default behavior
> >> sets `by.columns=key(dt)` to maintain backward compatability. See
> >> man/duplicated.Rd and tests 986:991 for more information. Thanks to
> >> Arunkumar Srinivasan, Ricardo Saporta, and Frank Erickson for useful
> >> discussions.
> >>
> >> Should work as advertised assuming my unit tests weren't too simplistic.
> >>
> >> Cheers,
> >>
> >> -steve
> >>
> >>
> >>
> >>
> >> On Tue, Aug 13, 2013 at 1:24 PM, Steve Lianoglou
> >> <mailinglist.honeypot at gmail.com> wrote:
> >> > Thanks for the suggestions, folks.
> >> >
> >> > Matthew: do you have a preference?
> >> >
> >> > -steve
> >> >
> >> > On Mon, Aug 12, 2013 at 11:12 AM, Ricardo Saporta
> >> > <saporta at scarletmail.rutgers.edu> wrote:
> >> >> Steve,
> >> >>
> >> >> I like your suggestion a lot.  I can see putting column specification
> >> >> to
> >> >> good use.
> >> >>
> >> >> As for the argument name, perhaps
> >> >>    'use.columns'
> >> >>
> >> >> And where a value of NULL or FALSE will yield same results as
> >> >> `unique.data.frame`
> >> >>
> >> >>     use.columns=key(x)   # default behavior
> >> >>     use.columns=c("col1name", "col7name")   #etc
> >> >>     use.columns=NULL
> >> >>
> >> >>
> >> >> Thanks as always,
> >> >> Rick
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Aug 12, 2013 at 1:51 PM, Steve Lianoglou
> >> >> <mailinglist.honeypot at gmail.com> wrote:
> >> >>>
> >> >>> Hi folks,
> >> >>>
> >> >>> I actually want to revisit the fix I made here.
> >> >>>
> >> >>> Instead of having `use.key` in the signature to unique.data.table
> (and
> >> >>> duplicated.data.table) to be:
> >> >>>
> >> >>> function(x,
> >> >>>              incomparables=FALSE,
> >> >>>              tolerance=.Machine$double.eps ^ 0.5,
> >> >>>              use.key=TRUE, ...)
> >> >>>
> >> >>> How about we switch out use.key for a parameter that specifies the
> >> >>> column names to use in the uniqueness check, which defaults to
> key(x)
> >> >>> to keep backwards compatibility.
> >> >>>
> >> >>> For argument's sake (like that?), lets call this parameter `columns`
> >> >>> (by.columns? with.columns? whatever) so:
> >> >>>
> >> >>> function(x,
> >> >>>              incomparables=FALSE,
> >> >>>              tolerance=.Machine$double.eps ^ 0.5,
> >> >>>              columns=key(x), ...)
> >> >>>
> >> >>> Then:
> >> >>>
> >> >>> (1) leaving it alone is the backward compatibile behavior;
> >> >>> (2) Perhaps setting it to NULL will use all columns, and make it
> >> >>> equivalent to unique.data.frame (also the same when x has no key);
> and
> >> >>> (3) setting it to any other combo of columns uses those columns as
> the
> >> >>> uniqueness key and filters the rows (only) out of x accordingly.
> >> >>>
> >> >>> What do you folks think? Personally I think this is better on all
> >> >>> accounts then just specifying to use the key or not and the only
> >> >>> question in my mind is the name of the argument -- happy to hear
> other
> >> >>> world views, however, so don't be shy.
> >> >>>
> >> >>> Thanks,
> >> >>> -steve
> >> >>>
> >> >>> --
> >> >>> Steve Lianoglou
> >> >>> Computational Biologist
> >> >>> Bioinformatics and Computational Biology
> >> >>> Genentech
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Steve Lianoglou
> >> > Computational Biologist
> >> > Bioinformatics and Computational Biology
> >> > Genentech
> >>
> >>
> >>
> >> --
> >> Steve Lianoglou
> >> Computational Biologist
> >> Bioinformatics and Computational Biology
> >> Genentech
> >
> >
>
>
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/f02ac148/attachment.html>

From saporta at scarletmail.rutgers.edu  Fri Sep 27 21:09:12 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Fri, 27 Sep 2013 15:09:12 -0400
Subject: [datatable-help] unique.data.frame should create a copy, right?
In-Reply-To: <CAHA9McNiuEaEpqTge+Wvb2PODXxqg_+YqcETgvsZBJz7i=reJg@mail.gmail.com>
References: <CAJd-hd=bu141Pscnh-XuZMSvYsnJP9soxrNciozyOPrZyV6iyQ@mail.gmail.com>
 <0F857B92DB0744C69CFC07AAE0C4DCF4@gmail.com>
 <CAE7Aa4SyCbQBP0tH-HGVNYbLTeGoKp4CkPiOKkwMwX6nT5Ed3A@mail.gmail.com>
 <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com>
 <CAE7Aa4R+uRUxaBrgwJYKWAZpE4TbtKR4JCavarCB8rhVgzLnXQ@mail.gmail.com>
 <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com>
 <CAHA9McMpt8SqPHrsEVuU4tUQhECBfdBxPgBMUJnuVtoMP0Fauw@mail.gmail.com>
 <C2CAE1D34CAD40D8B53240A844CD1334@gmail.com>
 <CAHA9McPn3V_UNPTSCEZ_mz4-pW-pvEp+1PmoFOwbqsqUqdd69Q@mail.gmail.com>
 <CAE7Aa4QVcm+0qfrhK1qUnCsuvWrUvfK6V1ygTXBy7=DGEpqz7A@mail.gmail.com>
 <CAHA9McPGUh2eaRRvjWGxYsg9KT1CMszaQFE5nsOxchJYgZch1w@mail.gmail.com>
 <CAE7Aa4TOcjcwSi-ALFO6aosf=zfLchQZb=joYeZaGOuaw+4eHw@mail.gmail.com>
 <CAHA9McMXpND1nNkTrbx_tdUAXw8qg55pK4syFpKvwdhn_cBZHQ@mail.gmail.com>
 <CAHA9McM0ejf3gGDu1sciwRvFS05TXWON3k2FOg1sR5JewBLLZA@mail.gmail.com>
 <CAE7Aa4SheuYVtYcUALVsH56HZ3a1Kh1AhniZMz3zDQfr=J5Gnw@mail.gmail.com>
 <CAHA9McNiuEaEpqTge+Wvb2PODXxqg_+YqcETgvsZBJz7i=reJg@mail.gmail.com>
Message-ID: <CAE7Aa4Q+K+DDNipRtjTk2vgoHC+ziKLFqigQ7PKrUm8QEBTE-g@mail.gmail.com>

Steve, not to beat a dead horse on the "what to name the new parameter"
discussion,  but I'm wondering what your/others' thoughts are on using
something other than 'by".  Maybe even "uby"

Or perhaps we can have a synonym in the function definition:
   .. function(........ , by=uby, uby)

The reason I bring this up is that as I begin to use this and I am reading
over my own code, I realize that it takes a lot of visual parsing to
distinguish when the "by" in a complex call belongs to "[.data.table" and
when the "by" belongs to "unique.data.table"


Cheers,
Rick


On Tue, Aug 27, 2013 at 1:23 PM, Steve Lianoglou <
mailinglist.honeypot at gmail.com> wrote:

> Last update here :-)
>
> After more hemming and hawing, I've changed the name of the new
> parameter added to duplicated.data.table and unique.data.table from
> `by.columnss` to just `by`, as it (more or less) is the same idea as
> the `by` in dt[x, i,j,by,...]
>
> Sorry for any inconveniences caused if you've been working off of the
> development version.
>
> -steve
>
>
> On Thu, Aug 15, 2013 at 9:35 PM, Ricardo Saporta
> <saporta at scarletmail.rutgers.edu> wrote:
> > Steve, great stuff!!
> > thanks for making that happen
> >
> > Rick
> >
> >
> > On Wed, Aug 14, 2013 at 8:30 PM, Steve Lianoglou
> > <mailinglist.honeypot at gmail.com> wrote:
> >>
> >> Hi all,
> >>
> >> As I needed this sooner than I had expected, I just committed this
> >> change. It's in svn revision 889.
> >>
> >> I chose 'by.columns' as the parameter names -- seemed to make more
> >> sense to me, and using the short hand interactively saves a letter,
> >> eg: unique(dt, by=c('some', 'columns')) ;-)
> >>
> >> Here's the note from the NEWS file:
> >>
> >> o  "Uniqueness" tests can now specify arbirtray combinations of
> >> columns to use to test for duplicates. `by.columns` parameter added to
> >> unique.data.table and duplicated.data.table. This allows the user to
> >> test for uniqueness using any combination of columns in the
> >> data.table, where previously the user only had the option to use the
> >> keyed columns (if keyed) or all columns (if not). The default behavior
> >> sets `by.columns=key(dt)` to maintain backward compatability. See
> >> man/duplicated.Rd and tests 986:991 for more information. Thanks to
> >> Arunkumar Srinivasan, Ricardo Saporta, and Frank Erickson for useful
> >> discussions.
> >>
> >> Should work as advertised assuming my unit tests weren't too simplistic.
> >>
> >> Cheers,
> >>
> >> -steve
> >>
> >>
> >>
> >>
> >> On Tue, Aug 13, 2013 at 1:24 PM, Steve Lianoglou
> >> <mailinglist.honeypot at gmail.com> wrote:
> >> > Thanks for the suggestions, folks.
> >> >
> >> > Matthew: do you have a preference?
> >> >
> >> > -steve
> >> >
> >> > On Mon, Aug 12, 2013 at 11:12 AM, Ricardo Saporta
> >> > <saporta at scarletmail.rutgers.edu> wrote:
> >> >> Steve,
> >> >>
> >> >> I like your suggestion a lot.  I can see putting column specification
> >> >> to
> >> >> good use.
> >> >>
> >> >> As for the argument name, perhaps
> >> >>    'use.columns'
> >> >>
> >> >> And where a value of NULL or FALSE will yield same results as
> >> >> `unique.data.frame`
> >> >>
> >> >>     use.columns=key(x)   # default behavior
> >> >>     use.columns=c("col1name", "col7name")   #etc
> >> >>     use.columns=NULL
> >> >>
> >> >>
> >> >> Thanks as always,
> >> >> Rick
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Aug 12, 2013 at 1:51 PM, Steve Lianoglou
> >> >> <mailinglist.honeypot at gmail.com> wrote:
> >> >>>
> >> >>> Hi folks,
> >> >>>
> >> >>> I actually want to revisit the fix I made here.
> >> >>>
> >> >>> Instead of having `use.key` in the signature to unique.data.table
> (and
> >> >>> duplicated.data.table) to be:
> >> >>>
> >> >>> function(x,
> >> >>>              incomparables=FALSE,
> >> >>>              tolerance=.Machine$double.eps ^ 0.5,
> >> >>>              use.key=TRUE, ...)
> >> >>>
> >> >>> How about we switch out use.key for a parameter that specifies the
> >> >>> column names to use in the uniqueness check, which defaults to
> key(x)
> >> >>> to keep backwards compatibility.
> >> >>>
> >> >>> For argument's sake (like that?), lets call this parameter `columns`
> >> >>> (by.columns? with.columns? whatever) so:
> >> >>>
> >> >>> function(x,
> >> >>>              incomparables=FALSE,
> >> >>>              tolerance=.Machine$double.eps ^ 0.5,
> >> >>>              columns=key(x), ...)
> >> >>>
> >> >>> Then:
> >> >>>
> >> >>> (1) leaving it alone is the backward compatibile behavior;
> >> >>> (2) Perhaps setting it to NULL will use all columns, and make it
> >> >>> equivalent to unique.data.frame (also the same when x has no key);
> and
> >> >>> (3) setting it to any other combo of columns uses those columns as
> the
> >> >>> uniqueness key and filters the rows (only) out of x accordingly.
> >> >>>
> >> >>> What do you folks think? Personally I think this is better on all
> >> >>> accounts then just specifying to use the key or not and the only
> >> >>> question in my mind is the name of the argument -- happy to hear
> other
> >> >>> world views, however, so don't be shy.
> >> >>>
> >> >>> Thanks,
> >> >>> -steve
> >> >>>
> >> >>> --
> >> >>> Steve Lianoglou
> >> >>> Computational Biologist
> >> >>> Bioinformatics and Computational Biology
> >> >>> Genentech
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Steve Lianoglou
> >> > Computational Biologist
> >> > Bioinformatics and Computational Biology
> >> > Genentech
> >>
> >>
> >>
> >> --
> >> Steve Lianoglou
> >> Computational Biologist
> >> Bioinformatics and Computational Biology
> >> Genentech
> >
> >
>
>
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/a4de1871/attachment-0001.html>

From mdowle at mdowle.plus.com  Sat Sep 28 09:29:40 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Sat, 28 Sep 2013 08:29:40 +0100
Subject: [datatable-help] unique.data.frame should create a copy, right?
In-Reply-To: <CAE7Aa4Q+K+DDNipRtjTk2vgoHC+ziKLFqigQ7PKrUm8QEBTE-g@mail.gmail.com>
References: <CAJd-hd=bu141Pscnh-XuZMSvYsnJP9soxrNciozyOPrZyV6iyQ@mail.gmail.com>
 <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com>
 <CAE7Aa4R+uRUxaBrgwJYKWAZpE4TbtKR4JCavarCB8rhVgzLnXQ@mail.gmail.com>
 <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com>
 <CAHA9McMpt8SqPHrsEVuU4tUQhECBfdBxPgBMUJnuVtoMP0Fauw@mail.gmail.com>
 <C2CAE1D34CAD40D8B53240A844CD1334@gmail.com>
 <CAHA9McPn3V_UNPTSCEZ_mz4-pW-pvEp+1PmoFOwbqsqUqdd69Q@mail.gmail.com>
 <CAE7Aa4QVcm+0qfrhK1qUnCsuvWrUvfK6V1ygTXBy7=DGEpqz7A@mail.gmail.com>
 <CAHA9McPGUh2eaRRvjWGxYsg9KT1CMszaQFE5nsOxchJYgZch1w@mail.gmail.com>
 <CAE7Aa4TOcjcwSi-ALFO6aosf=zfLchQZb=joYeZaGOuaw+4eHw@mail.gmail.com>
 <CAHA9McMXpND1nNkTrbx_tdUAXw8qg55pK4syFpKvwdhn_cBZHQ@mail.gmail.com>
 <CAHA9McM0ejf3gGDu1sciwRvFS05TXWON3k2FOg1sR5JewBLLZA@mail.gmail.com>
 <CAE7Aa4SheuYVtYcUALVsH56HZ3a1Kh1AhniZMz3zDQfr=J5Gnw@mail.gmail.com>
 <CAHA9McNiuEaEpqTge+Wvb2PODXxqg_+YqcETgvsZBJz7i=reJg@mail.gmail.com>
 <CAE7Aa4Q+K+DDNipRtjTk2vgoHC+ziKLFqigQ7PKrUm8QEBTE-g@mail.gmail.com>
Message-ID: <52468564.3040801@mdowle.plus.com>

Oh, good point.

How about putting 'by' first in those situations :

 > DT = data.table(A=rep(1:3,2),B=1:2)
 > unique(by="A",DT)
    A B
1: 1 1
2: 2 2
3: 3 1
 > unique(by="B",DT)
    A B
1: 1 1
2: 2 2
 >

On 27/09/13 20:09, Ricardo Saporta wrote:
> Steve, not to beat a dead horse on the "what to name the new 
> parameter" discussion,  but I'm wondering what your/others' thoughts 
> are on using something other than 'by".  Maybe even "uby"
>
> Or perhaps we can have a synonym in the function definition:
>    .. function(........ , by=uby, uby)
>
> The reason I bring this up is that as I begin to use this and I am 
> reading over my own code, I realize that it takes a lot of visual 
> parsing to distinguish when the "by" in a complex call belongs to 
> "[.data.table" and when the "by" belongs to "unique.data.table"
>
> Cheers,
> Rick
>
>
> On Tue, Aug 27, 2013 at 1:23 PM, Steve Lianoglou 
> <mailinglist.honeypot at gmail.com 
> <mailto:mailinglist.honeypot at gmail.com>> wrote:
>
>     Last update here :-)
>
>     After more hemming and hawing, I've changed the name of the new
>     parameter added to duplicated.data.table and unique.data.table from
>     `by.columnss` to just `by`, as it (more or less) is the same idea as
>     the `by` in dt[x, i,j,by,...]
>
>     Sorry for any inconveniences caused if you've been working off of the
>     development version.
>
>     -steve
>
>
>     On Thu, Aug 15, 2013 at 9:35 PM, Ricardo Saporta
>     <saporta at scarletmail.rutgers.edu
>     <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>     > Steve, great stuff!!
>     > thanks for making that happen
>     >
>     > Rick
>     >
>     >
>     > On Wed, Aug 14, 2013 at 8:30 PM, Steve Lianoglou
>     > <mailinglist.honeypot at gmail.com
>     <mailto:mailinglist.honeypot at gmail.com>> wrote:
>     >>
>     >> Hi all,
>     >>
>     >> As I needed this sooner than I had expected, I just committed this
>     >> change. It's in svn revision 889.
>     >>
>     >> I chose 'by.columns' as the parameter names -- seemed to make more
>     >> sense to me, and using the short hand interactively saves a letter,
>     >> eg: unique(dt, by=c('some', 'columns')) ;-)
>     >>
>     >> Here's the note from the NEWS file:
>     >>
>     >> o  "Uniqueness" tests can now specify arbirtray combinations of
>     >> columns to use to test for duplicates. `by.columns` parameter
>     added to
>     >> unique.data.table and duplicated.data.table. This allows the
>     user to
>     >> test for uniqueness using any combination of columns in the
>     >> data.table, where previously the user only had the option to
>     use the
>     >> keyed columns (if keyed) or all columns (if not). The default
>     behavior
>     >> sets `by.columns=key(dt)` to maintain backward compatability. See
>     >> man/duplicated.Rd and tests 986:991 for more information. Thanks to
>     >> Arunkumar Srinivasan, Ricardo Saporta, and Frank Erickson for
>     useful
>     >> discussions.
>     >>
>     >> Should work as advertised assuming my unit tests weren't too
>     simplistic.
>     >>
>     >> Cheers,
>     >>
>     >> -steve
>     >>
>     >>
>     >>
>     >>
>     >> On Tue, Aug 13, 2013 at 1:24 PM, Steve Lianoglou
>     >> <mailinglist.honeypot at gmail.com
>     <mailto:mailinglist.honeypot at gmail.com>> wrote:
>     >> > Thanks for the suggestions, folks.
>     >> >
>     >> > Matthew: do you have a preference?
>     >> >
>     >> > -steve
>     >> >
>     >> > On Mon, Aug 12, 2013 at 11:12 AM, Ricardo Saporta
>     >> > <saporta at scarletmail.rutgers.edu
>     <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>     >> >> Steve,
>     >> >>
>     >> >> I like your suggestion a lot.  I can see putting column
>     specification
>     >> >> to
>     >> >> good use.
>     >> >>
>     >> >> As for the argument name, perhaps
>     >> >>    'use.columns'
>     >> >>
>     >> >> And where a value of NULL or FALSE will yield same results as
>     >> >> `unique.data.frame`
>     >> >>
>     >> >>     use.columns=key(x)   # default behavior
>     >> >>     use.columns=c("col1name", "col7name")   #etc
>     >> >>     use.columns=NULL
>     >> >>
>     >> >>
>     >> >> Thanks as always,
>     >> >> Rick
>     >> >>
>     >> >>
>     >> >>
>     >> >> On Mon, Aug 12, 2013 at 1:51 PM, Steve Lianoglou
>     >> >> <mailinglist.honeypot at gmail.com
>     <mailto:mailinglist.honeypot at gmail.com>> wrote:
>     >> >>>
>     >> >>> Hi folks,
>     >> >>>
>     >> >>> I actually want to revisit the fix I made here.
>     >> >>>
>     >> >>> Instead of having `use.key` in the signature to
>     unique.data.table (and
>     >> >>> duplicated.data.table) to be:
>     >> >>>
>     >> >>> function(x,
>     >> >>>  incomparables=FALSE,
>     >> >>>  tolerance=.Machine$double.eps ^ 0.5,
>     >> >>>              use.key=TRUE, ...)
>     >> >>>
>     >> >>> How about we switch out use.key for a parameter that
>     specifies the
>     >> >>> column names to use in the uniqueness check, which defaults
>     to key(x)
>     >> >>> to keep backwards compatibility.
>     >> >>>
>     >> >>> For argument's sake (like that?), lets call this parameter
>     `columns`
>     >> >>> (by.columns? with.columns? whatever) so:
>     >> >>>
>     >> >>> function(x,
>     >> >>>  incomparables=FALSE,
>     >> >>>  tolerance=.Machine$double.eps ^ 0.5,
>     >> >>>              columns=key(x), ...)
>     >> >>>
>     >> >>> Then:
>     >> >>>
>     >> >>> (1) leaving it alone is the backward compatibile behavior;
>     >> >>> (2) Perhaps setting it to NULL will use all columns, and
>     make it
>     >> >>> equivalent to unique.data.frame (also the same when x has
>     no key); and
>     >> >>> (3) setting it to any other combo of columns uses those
>     columns as the
>     >> >>> uniqueness key and filters the rows (only) out of x
>     accordingly.
>     >> >>>
>     >> >>> What do you folks think? Personally I think this is better
>     on all
>     >> >>> accounts then just specifying to use the key or not and the
>     only
>     >> >>> question in my mind is the name of the argument -- happy to
>     hear other
>     >> >>> world views, however, so don't be shy.
>     >> >>>
>     >> >>> Thanks,
>     >> >>> -steve
>     >> >>>
>     >> >>> --
>     >> >>> Steve Lianoglou
>     >> >>> Computational Biologist
>     >> >>> Bioinformatics and Computational Biology
>     >> >>> Genentech
>     >> >>
>     >> >>
>     >> >
>     >> >
>     >> >
>     >> > --
>     >> > Steve Lianoglou
>     >> > Computational Biologist
>     >> > Bioinformatics and Computational Biology
>     >> > Genentech
>     >>
>     >>
>     >>
>     >> --
>     >> Steve Lianoglou
>     >> Computational Biologist
>     >> Bioinformatics and Computational Biology
>     >> Genentech
>     >
>     >
>
>
>
>     --
>     Steve Lianoglou
>     Computational Biologist
>     Bioinformatics and Computational Biology
>     Genentech
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130928/09e05341/attachment.html>

From FErickson at psu.edu  Sun Sep 29 06:49:24 2013
From: FErickson at psu.edu (Frank Erickson)
Date: Sun, 29 Sep 2013 00:49:24 -0400
Subject: [datatable-help] new key argument to [.data.table in 1.8.11
Message-ID: <CAJd-hdmVPfhu+6rkY6ribTCVeHHE3F0rZybQP9Q3xWTb0Snwfw@mail.gmail.com>

Hi,

I'm just continuing a discussion with @eddi that would not fit in an SO
comment. If you want to catch up, the references are...
http://r-forge.r-project.org/tracker/index.php?func=detail&aid=4675&group_id=240&atid=978
http://stackoverflow.com/a/19074195/1191259
The SO question (scroll up on the second link) was whether there was a way
to use a "temporary" key for X in an X[Y] join.

@eddi:

+1. Yeah, I like this new option and will probably use it.

Will this also overwrite the key when using [.data.table without doing
joins? That might be backward incompatible I guess, since `key` is already
an argument to `[.data.table`. That is, will x[i,,key='B'] do anything? I
don't think that type of command has had much use until now, and adding a j
argument (that doesn't start with `:=`) always makes a copy (right?), so
maybe backward compatibility would not be an issue there.

Regarding whether it's a reasonable compromise, ... well, I'll be using it,
anyway! I don't know what the feasibility constraints are on implementing
what I initially had in mind, so I'll defer to you and the developers on
that. If "secondary keys" are implemented down the road, that would solve
this problem in most cases.

As far as when I will use it, I guess it depends on the relative cost of
making a copy vs resetting the key on x. If I use the old syntax, I make a
copy, but don't have to change x's key back at the end (one copy, one key
setting). With the new syntax, I'd have to change the key on x back
afterward (zero copies, two key settings). If I know the sorting takes a
long time (e.g., because the key is the whole set of columns), I might
still go with copying.

Best,

Frank
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130929/989f96a5/attachment.html>

From eduard.antonyan at gmail.com  Sun Sep 29 15:47:36 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Sun, 29 Sep 2013 08:47:36 -0500
Subject: [datatable-help] new key argument to [.data.table in 1.8.11
In-Reply-To: <CAJd-hdmVPfhu+6rkY6ribTCVeHHE3F0rZybQP9Q3xWTb0Snwfw@mail.gmail.com>
References: <CAJd-hdmVPfhu+6rkY6ribTCVeHHE3F0rZybQP9Q3xWTb0Snwfw@mail.gmail.com>
Message-ID: <CAHZcBOqpsM4n3sdXgS_uMkYQZVCdggmene+tv6n6zvOaXRtqUQ@mail.gmail.com>

There wasn't a 'key' argument before and yes, it will change the key
regardless of whether you're merging or not. Initially I added it just for
the merges, but then realized that there us no conceptual reason to
restrict it just to merges.

Fyi the reason you probably thought there is a key argument before is
because in R shorthand of arguments is valid syntax and you were actually
using 'keyby' (which has not changed).

You raise a good point that I haven't thought of that copying can be faster
than sorting - I will check when that's true. It's easy to implement the
copy version and I did this because I assumed it's the faster option, but
if it's not then might as well copy and do this for merges only.
 On Sep 28, 2013 11:50 PM, "Frank Erickson" <FErickson at psu.edu> wrote:

> Hi,
>
> I'm just continuing a discussion with @eddi that would not fit in an SO
> comment. If you want to catch up, the references are...
>
> http://r-forge.r-project.org/tracker/index.php?func=detail&aid=4675&group_id=240&atid=978
> http://stackoverflow.com/a/19074195/1191259
> The SO question (scroll up on the second link) was whether there was a way
> to use a "temporary" key for X in an X[Y] join.
>
> @eddi:
>
> +1. Yeah, I like this new option and will probably use it.
>
> Will this also overwrite the key when using [.data.table without doing
> joins? That might be backward incompatible I guess, since `key` is already
> an argument to `[.data.table`. That is, will x[i,,key='B'] do anything? I
> don't think that type of command has had much use until now, and adding a j
> argument (that doesn't start with `:=`) always makes a copy (right?), so
> maybe backward compatibility would not be an issue there.
>
> Regarding whether it's a reasonable compromise, ... well, I'll be using
> it, anyway! I don't know what the feasibility constraints are on
> implementing what I initially had in mind, so I'll defer to you and the
> developers on that. If "secondary keys" are implemented down the road, that
> would solve this problem in most cases.
>
> As far as when I will use it, I guess it depends on the relative cost of
> making a copy vs resetting the key on x. If I use the old syntax, I make a
> copy, but don't have to change x's key back at the end (one copy, one key
> setting). With the new syntax, I'd have to change the key on x back
> afterward (zero copies, two key settings). If I know the sorting takes a
> long time (e.g., because the key is the whole set of columns), I might
> still go with copying.
>
> Best,
>
> Frank
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130929/1df4de3e/attachment.html>

From eduard.antonyan at gmail.com  Sun Sep 29 16:02:06 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Sun, 29 Sep 2013 09:02:06 -0500
Subject: [datatable-help] new key argument to [.data.table in 1.8.11
In-Reply-To: <CAHZcBOqpsM4n3sdXgS_uMkYQZVCdggmene+tv6n6zvOaXRtqUQ@mail.gmail.com>
References: <CAJd-hdmVPfhu+6rkY6ribTCVeHHE3F0rZybQP9Q3xWTb0Snwfw@mail.gmail.com>
 <CAHZcBOqpsM4n3sdXgS_uMkYQZVCdggmene+tv6n6zvOaXRtqUQ@mail.gmail.com>
Message-ID: <CAHZcBOq7WM=DVRtVJBZ-tRrFRtjN92d3ujJLxF-i_6Gv0bKbwA@mail.gmail.com>

Ah what am I thinking - you'll have to copy and still set a key, so unless
you have to go back to the old key (rarely?) this is strictly faster.


On Sun, Sep 29, 2013 at 8:47 AM, Eduard Antonyan
<eduard.antonyan at gmail.com>wrote:

> There wasn't a 'key' argument before and yes, it will change the key
> regardless of whether you're merging or not. Initially I added it just for
> the merges, but then realized that there us no conceptual reason to
> restrict it just to merges.
>
> Fyi the reason you probably thought there is a key argument before is
> because in R shorthand of arguments is valid syntax and you were actually
> using 'keyby' (which has not changed).
>
> You raise a good point that I haven't thought of that copying can be
> faster than sorting - I will check when that's true. It's easy to implement
> the copy version and I did this because I assumed it's the faster option,
> but if it's not then might as well copy and do this for merges only.
>  On Sep 28, 2013 11:50 PM, "Frank Erickson" <FErickson at psu.edu> wrote:
>
>> Hi,
>>
>> I'm just continuing a discussion with @eddi that would not fit in an SO
>> comment. If you want to catch up, the references are...
>>
>> http://r-forge.r-project.org/tracker/index.php?func=detail&aid=4675&group_id=240&atid=978
>> http://stackoverflow.com/a/19074195/1191259
>> The SO question (scroll up on the second link) was whether there was a
>> way to use a "temporary" key for X in an X[Y] join.
>>
>> @eddi:
>>
>> +1. Yeah, I like this new option and will probably use it.
>>
>> Will this also overwrite the key when using [.data.table without doing
>> joins? That might be backward incompatible I guess, since `key` is already
>> an argument to `[.data.table`. That is, will x[i,,key='B'] do anything? I
>> don't think that type of command has had much use until now, and adding a j
>> argument (that doesn't start with `:=`) always makes a copy (right?), so
>> maybe backward compatibility would not be an issue there.
>>
>> Regarding whether it's a reasonable compromise, ... well, I'll be using
>> it, anyway! I don't know what the feasibility constraints are on
>> implementing what I initially had in mind, so I'll defer to you and the
>> developers on that. If "secondary keys" are implemented down the road, that
>> would solve this problem in most cases.
>>
>> As far as when I will use it, I guess it depends on the relative cost of
>> making a copy vs resetting the key on x. If I use the old syntax, I make a
>> copy, but don't have to change x's key back at the end (one copy, one key
>> setting). With the new syntax, I'd have to change the key on x back
>> afterward (zero copies, two key settings). If I know the sorting takes a
>> long time (e.g., because the key is the whole set of columns), I might
>> still go with copying.
>>
>> Best,
>>
>> Frank
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130929/5f43cc25/attachment.html>

From FErickson at psu.edu  Sun Sep 29 16:58:48 2013
From: FErickson at psu.edu (Frank Erickson)
Date: Sun, 29 Sep 2013 10:58:48 -0400
Subject: [datatable-help] new key argument to [.data.table in 1.8.11
In-Reply-To: <CAHZcBOq7WM=DVRtVJBZ-tRrFRtjN92d3ujJLxF-i_6Gv0bKbwA@mail.gmail.com>
References: <CAJd-hdmVPfhu+6rkY6ribTCVeHHE3F0rZybQP9Q3xWTb0Snwfw@mail.gmail.com>
 <CAHZcBOqpsM4n3sdXgS_uMkYQZVCdggmene+tv6n6zvOaXRtqUQ@mail.gmail.com>
 <CAHZcBOq7WM=DVRtVJBZ-tRrFRtjN92d3ujJLxF-i_6Gv0bKbwA@mail.gmail.com>
Message-ID: <CAJd-hdmU6w40ER=8ZULMO=1opSJv1Q3-tiwpn+dE-t74jU6owg@mail.gmail.com>

>
> There wasn't a 'key' argument before and yes, it will change the key
> regardless of whether you're merging or not. Initially I added it just for
> the merges, but then realized that there us no conceptual reason to
> restrict it just to merges.


Ah, my mistake. I saw "key" under the list of arguments in the
documentation and assumed it applied to [.data.table; but it's actually for
the data.table function.

Ah what am I thinking - you'll have to copy and still set a key, so unless
> you have to go back to the old key (rarely?) this is strictly faster.
>

Yeah, that was my initial use case, a "temporary key". This new
syntax/functionality should be useful when I don't want to go back to the
old key, though.

--Frank

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130929/dcea2db3/attachment.html>

From harishv_99 at yahoo.com  Sun Sep 29 19:03:24 2013
From: harishv_99 at yahoo.com (Harish)
Date: Sun, 29 Sep 2013 10:03:24 -0700 (PDT)
Subject: [datatable-help] fread() coercing to character when seeing NA
Message-ID: <1380474204.855.YahooMailNeo@web120203.mail.ne1.yahoo.com>

Hi,

I am trying to get fread() to read NA without coercing the entire column to character, but I am unable to do it.? Please tell me whether I am doing something wrong or this is a bug.

# Load two data tables with a column of integers -- one with NA and one without

dt1 <- fread( "a\n2\n4\n8\n5", na.strings=c("?") )
dt2 <- fread( "a\n2\n4\n?\n5", na.strings=c("?") )


# The contents of both are as expected (or so it seems)
dt1
dt2

# The class of the column with NA is character
class( dt1$a )
class( dt2$a )??? # Not expecting this to be character

# Even setting colClasses does not help

dt3 <- fread( "a\n2\n4\n?\n5", na.strings=c("?"), colClasses=c(a="integer") )
class( dt3$a )


Thanks for your help.

Regards,
Harish
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130929/0a31d139/attachment-0001.html>

From julien.barnier at ens-lyon.fr  Mon Sep 30 16:06:31 2013
From: julien.barnier at ens-lyon.fr (Julien Barnier)
Date: Mon, 30 Sep 2013 16:06:31 +0200
Subject: [datatable-help] fread() coercing to character when seeing NA
In-Reply-To: <1380474204.855.YahooMailNeo@web120203.mail.ne1.yahoo.com>
References: <1380474204.855.YahooMailNeo@web120203.mail.ne1.yahoo.com>
Message-ID: <5223628.upPkjNS379@l018198>

Hi,

> dt3 <- fread( "a\n2\n4\n?\n5", na.strings=c("?"), colClasses=c(a="integer"))

I think that running fread with the verbose flag allows to answer your 
question :

R> dt3 <- fread( "a\n2\n4\n?\n5", na.strings=c("?"),colClasses=c(a="integer"), 
verbose=TRUE)
... <snip> ...
Column 1 ('a') has been detected as type 'character'. Ignoring request from 
colClasses to read as 'integer' (a lower type) since NAs would result.
   0.000s (  0%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   0.000s (  0%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   0.000s (  0%) Allocation of 4x1 result (xMB) in RAM
   0.000s (  0%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if 
triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  0%) Changing na.strings to NA
   0.000s        Total

As your ?a? column contains a character string "?", fread dtermines this 
column as character. And colClasses is ignored as that would result in 
possibly unwanted NA value. And all of this, as I understand it, is because 
the replacement of na.strings by NA happens as the last step of fread, after 
the column type has been set.

So it seems that the only workarounds are either to change your data to 
replace your missing value code by a numerical value (like -9999 or anything 
else), or to convert your column back to numeric after using fread.

Regards,

Julien

-- 
Julien Barnier
Centre Max Weber
ENS de Lyon

From mdowle at mdowle.plus.com  Mon Sep 30 20:58:10 2013
From: mdowle at mdowle.plus.com (Matthew Dowle)
Date: Mon, 30 Sep 2013 19:58:10 +0100
Subject: [datatable-help] fread() coercing to character when seeing NA
In-Reply-To: <5223628.upPkjNS379@l018198>
References: <1380474204.855.YahooMailNeo@web120203.mail.ne1.yahoo.com>
 <5223628.upPkjNS379@l018198>
Message-ID: <5249C9C2.2000009@mdowle.plus.com>

Yes, exactly.  On the bug list is #2660 " Improve fread na.strings 
handling" :

https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2660&group_id=240&atid=975

which points to :

http://stackoverflow.com/questions/15784138/bad-interpretation-of-n-a-using-fread

Matthew

On 30/09/13 15:06, Julien Barnier wrote:
> Hi,
>
>> dt3 <- fread( "a\n2\n4\n?\n5", na.strings=c("?"), colClasses=c(a="integer"))
> I think that running fread with the verbose flag allows to answer your
> question :
>
> R> dt3 <- fread( "a\n2\n4\n?\n5", na.strings=c("?"),colClasses=c(a="integer"),
> verbose=TRUE)
> ... <snip> ...
> Column 1 ('a') has been detected as type 'character'. Ignoring request from
> colClasses to read as 'integer' (a lower type) since NAs would result.
>     0.000s (  0%) Memory map (rerun may be quicker)
>     0.000s (  0%) sep and header detection
>     0.000s (  0%) Count rows (wc -l)
>     0.000s (  0%) Column type detection (first, middle and last 5 rows)
>     0.000s (  0%) Allocation of 4x1 result (xMB) in RAM
>     0.000s (  0%) Reading data
>     0.000s (  0%) Allocation for type bumps (if any), including gc time if
> triggered
>     0.000s (  0%) Coercing data already read in type bumps (if any)
>     0.000s (  0%) Changing na.strings to NA
>     0.000s        Total
>
> As your ?a? column contains a character string "?", fread dtermines this
> column as character. And colClasses is ignored as that would result in
> possibly unwanted NA value. And all of this, as I understand it, is because
> the replacement of na.strings by NA happens as the last step of fread, after
> the column type has been set.
>
> So it seems that the only workarounds are either to change your data to
> replace your missing value code by a numerical value (like -9999 or anything
> else), or to convert your column back to numeric after using fread.
>
> Regards,
>
> Julien
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130930/5b15f51e/attachment.html>

From alexandre.sieira at gmail.com  Mon Sep 30 22:01:47 2013
From: alexandre.sieira at gmail.com (Alexandre Sieira)
Date: Mon, 30 Sep 2013 13:01:47 -0700
Subject: [datatable-help] rbind empty data tables
Message-ID: <etPan.5249d8ab.74b0dc51.7e54@MacBook-Pro-de-Alexandre-Sieira.local>

I encountered the following behavior with data.table 1.8.10 on R 3.0.2 on Mac OS X and was wondering if that is expected:

> dt1 = data.table(a=character())
> dt2 = data.table(a=character())
> dt1
Empty data.table (0 rows) of 1 col: a
> colnames(dt1)
[1] "a"
> dt2
Empty data.table (0 rows) of 1 col: a
> colnames(dt2)
[1] "a"
> rbind(dt1, dt2)
Error in setnames(ret, nm.original) : x has no column names

Enter a frame number, or 0 to exit ??

1: rbind(dt1, dt2)
2: rbind(deparse.level, ...)
3: data.table::.rbind.data.table(...)
4: setnames(ret, nm.original)

If I rbind two zero-row data.table objects with matching column names, I would have expected to get a zero-row data.table back (0 + 0 = 0, after all).

--?
Alexandre Sieira
CISA, CISSP, ISO 27001 Lead Auditor

"The truth is rarely pure and never simple."
Oscar Wilde, The Importance of Being Earnest, 1895, Act I
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130930/051f8133/attachment.html>

From alexandre.sieira at gmail.com  Mon Sep 30 22:06:32 2013
From: alexandre.sieira at gmail.com (Alexandre Sieira)
Date: Mon, 30 Sep 2013 13:06:32 -0700
Subject: [datatable-help] =?utf-8?q?rbind_empty_data_tables?=
In-Reply-To: <etPan.5249d8ab.74b0dc51.7e54@MacBook-Pro-de-Alexandre-Sieira.local>
References: <etPan.5249d8ab.74b0dc51.7e54@MacBook-Pro-de-Alexandre-Sieira.local>
Message-ID: <etPan.5249d9c8.238e1f29.7e54@MacBook-Pro-de-Alexandre-Sieira.local>

By the way, this works as I would expect with data.frame on the same environment:

> df1 = data.frame(a=character())
> df2 = data.frame(a=character())
> df1
[1] a
<0 rows> (or row.names with length 0)
> df2
[1] a
<0 rows> (or row.names with length 0)
> rbind(df1, df2)
[1] a
<0 rows> (or row.names with length 0)

--?
Alexandre Sieira
CISA, CISSP, ISO 27001 Lead Auditor

"The truth is rarely pure and never simple."
Oscar Wilde, The Importance of Being Earnest, 1895, Act I

On 30 de setembro de 2013 at 13:01:47, Alexandre Sieira (alexandre.sieira at gmail.com) wrote:

I encountered the following behavior with data.table 1.8.10 on R 3.0.2 on Mac OS X and was wondering if that is expected:

> dt1 = data.table(a=character())
> dt2 = data.table(a=character())
> dt1
Empty data.table (0 rows) of 1 col: a
> colnames(dt1)
[1] "a"
> dt2
Empty data.table (0 rows) of 1 col: a
> colnames(dt2)
[1] "a"
> rbind(dt1, dt2)
Error in setnames(ret, nm.original) : x has no column names

Enter a frame number, or 0 to exit ??

1: rbind(dt1, dt2)
2: rbind(deparse.level, ...)
3: data.table::.rbind.data.table(...)
4: setnames(ret, nm.original)

If I rbind two zero-row data.table objects with matching column names, I would have expected to get a zero-row data.table back (0 + 0 = 0, after all).

--?
Alexandre Sieira
CISA, CISSP, ISO 27001 Lead Auditor

"The truth is rarely pure and never simple."
Oscar Wilde, The Importance of Being Earnest, 1895, Act I
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130930/2b4e71be/attachment-0001.html>