From ht at heatherturner.net Mon Sep 2 16:51:41 2013 From: ht at heatherturner.net (Heather Turner) Date: Mon, 2 Sep 2013 15:51:41 +0100 (BST) Subject: [datatable-help] fread coercion of very small number to character In-Reply-To: <7282782.82.1378128992927.JavaMail.heather@heather-VPCSB3C5E> Message-ID: <22722073.108.1378133500191.JavaMail.heather@heather-VPCSB3C5E> Hello, When reading a file with very small numbers in scientific notation, fread bumps the column type to "character": > tmp <- fread(files[1], verbose = TRUE) Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep='\t' Found 5 columns First row with 5 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 188308 Subtracted 1 for last eol and any trailing empty lines, leaving 188307 data rows Type codes: 33302 (first 5 rows) Type codes: 33302 (+middle 5 rows) Type codes: 33302 (+last 5 rows) Bumping column 5 from REAL to STR on data row 361, field contains '1.46761e-313' 0.000s ( 0%) Memory map (rerun may be quicker) 0.000s ( 0%) sep and header detection 0.020s ( 13%) Count rows (wc -l) 0.000s ( 0%) Column type detection (first, middle and last 5 rows) 0.020s ( 13%) Allocation of 188307x5 result (xMB) in RAM 0.110s ( 73%) Reading data 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered 0.000s ( 0%) Coercing data already read in type bumps (if any) 0.000s ( 0%) Changing na.strings to NA 0.150s Total Warning message: In fread(files[1], verbose = TRUE) : Bumped column 5 to type character on data row 361, field contains '1.46761e-313'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE. Perhaps there is some cutoff at e-300, since the preceding number '3.34402e-299' is read in okay. I can get round this by specifying the column as character using the colClasses argument, then coercing to numeric after the data has been read in. However it would be better if fread could read the data in as numeric in the first place, as read.table does (though much more slowly in my example). A simple example where type is detected as numeric then bumped to character (Which rows are used as the middle 5? Does not seem to be rows 7-11 as I would expect...) > dat <- data.frame(one = LETTERS[1:17], two = 1:17) > ## use strings here to replicate what I have in my data file > dat$two[c(1, 9)] <- c("3.34402e-299", "1.46761e-313") > write.table(dat, file = "test.txt", quote = FALSE, row.names = FALSE) > fread("test.txt", verbose = TRUE) ... Type codes: 32 (first 5 rows) Type codes: 32 (+middle 5 rows) Type codes: 32 (+last 5 rows) Bumping column 2 from REAL to STR on data row 9, field contains '1.46761e-313' ... Another example where type is detected as character from the first 5 rows > dat$two[1:2] <- c("3.34402e-299", "1.46761e-313") > write.table(dat, file = "test.txt", quote = FALSE, row.names = FALSE) > fread("test.txt", verbose = TRUE) ... Type codes: 33 (first 5 rows) Type codes: 33 (+middle 5 rows) Type codes: 33 (+last 5 rows) ... So aside from the issue of which rows are used for type detection, it does seem that 3.34402e-299 is detected as numeric whilst 1.46761e-313 is detected as character. Compare vs. read.table: > tmp <- read.table("test.txt", header = TRUE) > lapply(tmp, class) $one [1] "factor" $two [1] "numeric" Best wishes, Heather --- Package: data.table Version: 1.8.9 Maintainer: Matthew Dowle Built: R 3.0.1; x86_64-pc-linux-gnu; 2013-06-26 21:24:22 UTC; unix R version 3.0.1 (2013-05-16) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base other attached packages: [1] data.table_1.8.9 loaded via a namespace (and not attached): [1] compiler_3.0.1 tools_3.0.1 From mdowle at mdowle.plus.com Tue Sep 3 11:12:36 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 03 Sep 2013 10:12:36 +0100 Subject: [datatable-help] v1.8.10 is now on CRAN Message-ID: <5225A804.8060308@mdowle.plus.com> Please see NEWS : https://r-forge.r-project.org/scm/viewvc.php/pkg/NEWS?view=markup&root=datatable As normal it will take a few days to reach all mirrors. Matthew -------------- next part -------------- An HTML attachment was scrubbed... URL: From statquant at outlook.com Tue Sep 3 16:36:31 2013 From: statquant at outlook.com (statquant3) Date: Tue, 3 Sep 2013 07:36:31 -0700 (PDT) Subject: [datatable-help] Bug filled [#4878] Message-ID: <1378218991304-4675263.post@n4.nabble.com> I filled a bug [#4878] following this post -- View this message in context: http://r.789695.n4.nabble.com/Bug-filled-4878-tp4675263.html Sent from the datatable-help mailing list archive at Nabble.com. From aragorn168b at gmail.com Tue Sep 3 16:50:56 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 3 Sep 2013 16:50:56 +0200 Subject: [datatable-help] Bug filled [#4878] In-Reply-To: <1378218991304-4675263.post@n4.nabble.com> References: <1378218991304-4675263.post@n4.nabble.com> Message-ID: Statquant, I don't think this is a bug because the default NA is indeed NA_logical_ IF you do: x <- rep(NA, 10) class(x) # [1] logical You should just do: x <- rep(NA_integer_, 10) class(x) # [1] integer >From ?NA (first paragraph) NA is a logical constant of length 1 which contains a missing value indicator. NA can be coerced to any other vector type except raw. There are also constants NA_integer_, NA_real_, NA_complex_ and NA_character_ of the other atomic vector types which support missing values: all of these are reserved (http://127.0.0.1:42400/help/library/base/help/reserved) words in the R language. Arun On Tuesday, September 3, 2013 at 4:36 PM, statquant3 wrote: > I filled a bug [#4878] following this post > > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Bug-filled-4878-tp4675263.html > Sent from the datatable-help mailing list archive at Nabble.com (http://Nabble.com). > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From statquant at outlook.com Tue Sep 3 16:59:52 2013 From: statquant at outlook.com (statquant3) Date: Tue, 3 Sep 2013 07:59:52 -0700 (PDT) Subject: [datatable-help] Bug filled [#4878] In-Reply-To: References: <1378218991304-4675263.post@n4.nabble.com> Message-ID: <1378220392511-4675268.post@n4.nabble.com> Yes x = NA makes x logical but data.table is supposed to keep the type of the LHS when you do an update That's why you get the usual Message d'avis : In `[.data.table`(DT, , `:=`(a, 1.1)) : Coerced 'double' RHS to 'integer' to match the column's type; may have truncated precision. Either change the target column to 'double' first (by creating a new 'double' vector length 3 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please. So I think it should still be the case even for 1 row data.table -- View this message in context: http://r.789695.n4.nabble.com/Bug-filled-4878-tp4675263p4675268.html Sent from the datatable-help mailing list archive at Nabble.com. From aragorn168b at gmail.com Tue Sep 3 17:05:02 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 3 Sep 2013 17:05:02 +0200 Subject: [datatable-help] Bug filled [#4878] In-Reply-To: <1378220392511-4675268.post@n4.nabble.com> References: <1378218991304-4675263.post@n4.nabble.com> <1378220392511-4675268.post@n4.nabble.com> Message-ID: <5386096E77AD43C08DD52B0ACB6D81A9@gmail.com> Seems you're right. I missed that warning message... Arun On Tuesday, September 3, 2013 at 4:59 PM, statquant3 wrote: > Yes x = NA makes x logical but data.table is supposed to keep the type of the > LHS when you do an update That's why you get the usual > Message d'avis : > In `[.data.table`(DT, , `:=`(a, 1.1)) : > Coerced 'double' RHS to 'integer' to match the column's type; may have > truncated precision. Either change the target column to 'double' first (by > creating a new 'double' vector length 3 (nrows of entire table) and assign > that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g. 1L, > NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, > set the column type correctly up front when you create the table and stick > to it, please. > > So I think it should still be the case even for 1 row data.table > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Bug-filled-4878-tp4675263p4675268.html > Sent from the datatable-help mailing list archive at Nabble.com (http://Nabble.com). > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From statquant at outlook.com Tue Sep 3 17:04:56 2013 From: statquant at outlook.com (statquant3) Date: Tue, 3 Sep 2013 08:04:56 -0700 (PDT) Subject: [datatable-help] Cannot use fread with data.table 1.8.10 Message-ID: <1378220696873-4675269.post@n4.nabble.com> Just tried the new version, took it from CRAN and had RStudio compiled the .tar.gz, all went ok. When I try to load any csv I get the following: Erreur dans fread("test.csv") : 'integer64' must be a single character string: 'integer64', 'double' or 'character' sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252 LC_NUMERIC=C LC_TIME=C attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] data.table_1.8.10 vimcom_0.9-8 -- View this message in context: http://r.789695.n4.nabble.com/Cannot-use-fread-with-data-table-1-8-10-tp4675269.html Sent from the datatable-help mailing list archive at Nabble.com. From statquant at outlook.com Tue Sep 3 17:10:06 2013 From: statquant at outlook.com (statquant3) Date: Tue, 3 Sep 2013 08:10:06 -0700 (PDT) Subject: [datatable-help] Bug filled [#4878] In-Reply-To: <5386096E77AD43C08DD52B0ACB6D81A9@gmail.com> References: <1378218991304-4675263.post@n4.nabble.com> <1378220392511-4675268.post@n4.nabble.com> <5386096E77AD43C08DD52B0ACB6D81A9@gmail.com> Message-ID: <1378221006699-4675271.post@n4.nabble.com> You can get the warning doing this (for example) R) DT = data.table(a=rep(1L,3)) R) DT a 1: 1 2: 1 3: 1 R) DT[,a:=1.1] Message d'avis : In `[.data.table`(DT, , `:=`(a, 1.1)) : Coerced 'double' RHS to 'integer' to match the column's type; may have truncated precision. Either change the target column to 'double' first (by creating a new 'double' vector length 3 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please. -- View this message in context: http://r.789695.n4.nabble.com/Bug-filled-4878-tp4675263p4675271.html Sent from the datatable-help mailing list archive at Nabble.com. From statquant at outlook.com Tue Sep 3 17:19:16 2013 From: statquant at outlook.com (statquant3) Date: Tue, 3 Sep 2013 08:19:16 -0700 (PDT) Subject: [datatable-help] Cannot use fread with data.table 1.8.10 In-Reply-To: <1378220696873-4675269.post@n4.nabble.com> References: <1378220696873-4675269.post@n4.nabble.com> Message-ID: <1378221556721-4675273.post@n4.nabble.com> Ok just took the .zip from http://datatable.r-forge.r-project.org/ and it is now working. I'll wait and try to compile it from source latter (though it compiled fine so...) -- View this message in context: http://r.789695.n4.nabble.com/Cannot-use-fread-with-data-table-1-8-10-tp4675269p4675273.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Tue Sep 3 20:18:53 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 03 Sep 2013 19:18:53 +0100 Subject: [datatable-help] Bug filled [#4878] In-Reply-To: <1378221006699-4675271.post@n4.nabble.com> References: <1378218991304-4675263.post@n4.nabble.com> <1378220392511-4675268.post@n4.nabble.com> <5386096E77AD43C08DD52B0ACB6D81A9@gmail.com> <1378221006699-4675271.post@n4.nabble.com> Message-ID: <5226280D.4060800@mdowle.plus.com> Just to clear up this thread, it's plonking. Search for "plonk" in ?":=". I've closed the bug report. Matthew On 03/09/13 16:10, statquant3 wrote: > You can get the warning doing this (for example) > > R) DT = data.table(a=rep(1L,3)) > R) DT > a > 1: 1 > 2: 1 > 3: 1 > R) DT[,a:=1.1] > Message d'avis : > In `[.data.table`(DT, , `:=`(a, 1.1)) : > Coerced 'double' RHS to 'integer' to match the column's type; may have > truncated precision. Either change the target column to 'double' first (by > creating a new 'double' vector length 3 (nrows of entire table) and assign > that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g. 1L, > NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, > set the column type correctly up front when you create the table and stick > to it, please. > > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Bug-filled-4878-tp4675263p4675271.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From mdowle at mdowle.plus.com Tue Sep 3 20:34:37 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 03 Sep 2013 19:34:37 +0100 Subject: [datatable-help] Cannot use fread with data.table 1.8.10 In-Reply-To: <1378221556721-4675273.post@n4.nabble.com> References: <1378220696873-4675269.post@n4.nabble.com> <1378221556721-4675273.post@n4.nabble.com> Message-ID: <52262BBD.20607@mdowle.plus.com> That's very odd. Phew - glad it's working now though! All I can think is that it was to do with the install process on Windows when an R process is open at the same time with data.table loaded in it. We've had similar issues in the past sometimes where a reboot followed by reinstall of data.table works. The reboot ensures that every last nuance of .dll usage is cleared. And the reboot also ensures that all versions of R are shut down. Linux seems much better at updating shared objects (.so) which are in use by processes, although similar problems have been reported on Linux too when (my best guess is) a zombie process holds up something in the install process. Only one or two reports, mind you. The error about integer64 suggests that maybe the byte code didn't match up with the DLL code (since that's a new argument). Something like that, anyway, maybe. On 03/09/13 16:19, statquant3 wrote: > Ok just took the .zip from http://datatable.r-forge.r-project.org/ and it is > now working. > I'll wait and try to compile it from source latter (though it compiled fine > so...) > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Cannot-use-fread-with-data-table-1-8-10-tp4675269p4675273.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From mdowle at mdowle.plus.com Tue Sep 3 20:39:03 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 03 Sep 2013 19:39:03 +0100 Subject: [datatable-help] fread coercion of very small number to character In-Reply-To: <22722073.108.1378133500191.JavaMail.heather@heather-VPCSB3C5E> References: <22722073.108.1378133500191.JavaMail.heather@heather-VPCSB3C5E> Message-ID: <52262CC7.5020305@mdowle.plus.com> Hi, This is a great bug report. Please could you file it on the tracker so it doesn't get forgotten. That way you'll also get notified automatically when the status changes. Hoping to clear up everything related to fread soon. Matthew On 02/09/13 15:51, Heather Turner wrote: > Hello, > > When reading a file with very small numbers in scientific notation, fread bumps the column type to "character": > >> tmp <- fread(files[1], verbose = TRUE) > Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. > Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep='\t' > Found 5 columns > First row with 5 fields occurs on line 1 (either column names or first row of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 188308 > Subtracted 1 for last eol and any trailing empty lines, leaving 188307 data rows > Type codes: 33302 (first 5 rows) > Type codes: 33302 (+middle 5 rows) > Type codes: 33302 (+last 5 rows) > Bumping column 5 from REAL to STR on data row 361, field contains '1.46761e-313' > 0.000s ( 0%) Memory map (rerun may be quicker) > 0.000s ( 0%) sep and header detection > 0.020s ( 13%) Count rows (wc -l) > 0.000s ( 0%) Column type detection (first, middle and last 5 rows) > 0.020s ( 13%) Allocation of 188307x5 result (xMB) in RAM > 0.110s ( 73%) Reading data > 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered > 0.000s ( 0%) Coercing data already read in type bumps (if any) > 0.000s ( 0%) Changing na.strings to NA > 0.150s Total > Warning message: > In fread(files[1], verbose = TRUE) : > Bumped column 5 to type character on data row 361, field contains '1.46761e-313'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE. > > Perhaps there is some cutoff at e-300, since the preceding number '3.34402e-299' is read in okay. > > I can get round this by specifying the column as character using the colClasses argument, then coercing to numeric after the data has been read in. However it would be better if fread could read the data in as numeric in the first place, as read.table does (though much more slowly in my example). > > A simple example where type is detected as numeric then bumped to character (Which rows are used as the middle 5? Does not seem to be rows 7-11 as I would expect...) > >> dat <- data.frame(one = LETTERS[1:17], two = 1:17) >> ## use strings here to replicate what I have in my data file >> dat$two[c(1, 9)] <- c("3.34402e-299", "1.46761e-313") >> write.table(dat, file = "test.txt", quote = FALSE, row.names = FALSE) >> fread("test.txt", verbose = TRUE) > ... > Type codes: 32 (first 5 rows) > Type codes: 32 (+middle 5 rows) > Type codes: 32 (+last 5 rows) > Bumping column 2 from REAL to STR on data row 9, field contains '1.46761e-313' > ... > > Another example where type is detected as character from the first 5 rows > >> dat$two[1:2] <- c("3.34402e-299", "1.46761e-313") >> write.table(dat, file = "test.txt", quote = FALSE, row.names = FALSE) >> fread("test.txt", verbose = TRUE) > ... > Type codes: 33 (first 5 rows) > Type codes: 33 (+middle 5 rows) > Type codes: 33 (+last 5 rows) > ... > > So aside from the issue of which rows are used for type detection, it does seem that 3.34402e-299 is detected as numeric whilst 1.46761e-313 is detected as character. Compare vs. read.table: > >> tmp <- read.table("test.txt", header = TRUE) >> lapply(tmp, class) > $one > [1] "factor" > > $two > [1] "numeric" > > Best wishes, > > Heather > > --- > Package: data.table > Version: 1.8.9 > Maintainer: Matthew Dowle > Built: R 3.0.1; x86_64-pc-linux-gnu; 2013-06-26 21:24:22 UTC; unix > > R version 3.0.1 (2013-05-16) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods > [7] base > > other attached packages: > [1] data.table_1.8.9 > > loaded via a namespace (and not attached): > [1] compiler_3.0.1 tools_3.0.1 > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From tbfowler4 at gmail.com Thu Sep 5 19:41:26 2013 From: tbfowler4 at gmail.com (Thell Fowler) Date: Thu, 5 Sep 2013 12:41:26 -0500 Subject: [datatable-help] column of named vectors in data.table and possible bug In-Reply-To: <2826ECD8F1A445629ED4A27B96DCF661@gmail.com> References: <2826ECD8F1A445629ED4A27B96DCF661@gmail.com> Message-ID: Perhaps a 'too late' reply, but have you thought about bringing the names into the DT, using them, then dropping them? For example: > DT[, n:=names(DT$B)] > DT[,list(B=list(B),Names=list(n)),by=A] A B Names 1: 1 6,7,8 a,b,c 2: 2 9,10 d,e > DT$n<-NULL On Sat, Aug 24, 2013 at 2:57 AM, Arunkumar Srinivasan wrote: > Dear all, > > Suppose we've construct a data.table in this manner: > > x <- c(1,1,1,2,2) > y <- 6:10 > setattr(y, 'names', letters[1:5]) > DT<- data.table(A = x, B = y) > > DT$B > a b c d e > 6 7 8 9 10 > > You see that DT maintains the name of vector B. But if we do: > > DT[, names(B), by=A] > A V1 > 1: 1 a > 2: 1 b > 3: 1 c > 4: 2 a > 5: 2 b > 6: 2 c > > There are two things here: First, you see that only the names of the first > grouping is correct (A = 1). Second, the rest of the result has the same > names, and the result is also recycled to fit the length. Instead of 5 > rows, we get 6 rows. > > A way to get around it would be: > > DT[, names(DT$B)[.I], by=A] > A V1 > 1: 1 a > 2: 1 b > 3: 1 c > 4: 2 d > 5: 2 e > > However, if one wants to do: > > DT[, list(list(B)), by=A]$V1 > [[1]] > a b c > 6 7 8 > > [[2]] > a b > 9 10 > > You see that the names are once again wrong (for A = 2). Just the first > one remains right. > > My question is, is it allowed usage of having names for column vectors? If > so, then this should be a bug. If not, it'd be a great feature to have. > > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -- Sincerely, Thell -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri Sep 6 11:52:48 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 6 Sep 2013 11:52:48 +0200 Subject: [datatable-help] column of named vectors in data.table and possible bug In-Reply-To: References: <2826ECD8F1A445629ED4A27B96DCF661@gmail.com> Message-ID: <4FB1A379A48F45AB885DEE9C8DC37883@gmail.com> Hi Thell, It's not late :). Thanks for your reply. Yes of course we could do the way you specified. But the usage for the feature I mentioned is quite different. I was thinking of doing something even more efficient for this question on SO (http://stackoverflow.com/questions/17308551/do-callrbind-list-for-uneven-number-of-column): Arun On Thursday, September 5, 2013 at 7:41 PM, Thell Fowler wrote: > Perhaps a 'too late' reply, but have you thought about bringing the names into the DT, using them, then dropping them? > > For example: > > > DT[, n:=names(DT$B)] > > DT[,list(B=list(B),Names=list(n)),by=A] > A B Names > 1: 1 6,7,8 a,b,c > 2: 2 9,10 d,e > > DT$n<-NULL > > > > On Sat, Aug 24, 2013 at 2:57 AM, Arunkumar Srinivasan wrote: > > Dear all, > > > > Suppose we've construct a data.table in this manner: > > > > x <- c(1,1,1,2,2) > > y <- 6:10 > > setattr(y, 'names', letters[1:5]) > > DT<- data.table(A = x, B = y) > > > > DT$B > > a b c d e > > 6 7 8 9 10 > > > > > > You see that DT maintains the name of vector B. But if we do: > > > > DT[, names(B), by=A] > > A V1 > > 1: 1 a > > 2: 1 b > > 3: 1 c > > 4: 2 a > > 5: 2 b > > 6: 2 c > > > > > > There are two things here: First, you see that only the names of the first grouping is correct (A = 1). Second, the rest of the result has the same names, and the result is also recycled to fit the length. Instead of 5 rows, we get 6 rows. > > > > A way to get around it would be: > > > > DT[, names(DT$B)[.I], by=A] > > A V1 > > 1: 1 a > > 2: 1 b > > 3: 1 c > > 4: 2 d > > 5: 2 e > > > > > > However, if one wants to do: > > > > DT[, list(list(B)), by=A]$V1 > > [[1]] > > a b c > > 6 7 8 > > > > [[2]] > > a b > > 9 10 > > > > > > You see that the names are once again wrong (for A = 2). Just the first one remains right. > > > > My question is, is it allowed usage of having names for column vectors? If so, then this should be a bug. If not, it'd be a great feature to have. > > > > Arun > > > > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > -- > Sincerely, > Thell -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sat Sep 7 01:42:00 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sat, 07 Sep 2013 00:42:00 +0100 Subject: [datatable-help] column of named vectors in data.table and possible bug In-Reply-To: <4FB1A379A48F45AB885DEE9C8DC37883@gmail.com> References: <2826ECD8F1A445629ED4A27B96DCF661@gmail.com> <4FB1A379A48F45AB885DEE9C8DC37883@gmail.com> Message-ID: <522A6848.9000500@mdowle.plus.com> Just caught up with this thread. > is it allowed usage of having names for column vectors? It wasn't intended, no. It would slow down grouping if it had to maintain the names attribute too in the subsets. data.table is intended to be used as a list of plain columns and the internals assume that. names(DT$col) might exist though if data.table() has used a reference to an input without taking a copy. It would then copy on first := to that column and drop the names attribute at that point. Which is why we might like to leave names there and just not use them. But I'm thinking data.table() should drop names then to make this cleaner. Despite that meaning a copy of the vector has to be taken if it has names. A copy is taken currently anyway. But in GNU R 3.1.0, with list() no longer copying named inputs, we can do more on that front. Matthew From aragorn168b at gmail.com Sat Sep 7 15:11:14 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 7 Sep 2013 15:11:14 +0200 Subject: [datatable-help] melt for data.table Message-ID: Hi everybody, In the recent commit (940-944), a faster version of melt, "fmelt" is implemented. Have a look at this post (http://stackoverflow.com/a/18668808/559784) for a benchmark. It'd be great to get some feedback. You can download the recent commit from the first link here (https://r-forge.r-project.org/scm/?group_id=240). Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Sep 7 18:30:08 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 7 Sep 2013 18:30:08 +0200 Subject: [datatable-help] melt for data.table In-Reply-To: References: Message-ID: Hi all, Regarding the earlier email on "fmelt": After early feedback, the fmelt _function_ has already changed to be a reshape2::melt _method_ for data.table instead. I've deleted the link on S.O. for now and will post again soon here with updated links... Thank you for understanding, Arun On Saturday, September 7, 2013 at 3:11 PM, Arunkumar Srinivasan wrote: > Hi everybody, > > In the recent commit (940-944), a faster version of melt, "fmelt" is implemented. Have a look at this post (http://stackoverflow.com/a/18668808/559784) for a benchmark. It'd be great to get some feedback. > > You can download the recent commit from the first link here (https://r-forge.r-project.org/scm/?group_id=240). > > Arun > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sun Sep 8 00:23:35 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 8 Sep 2013 00:23:35 +0200 Subject: [datatable-help] column of named vectors in data.table and possible bug In-Reply-To: <522A6848.9000500@mdowle.plus.com> References: <2826ECD8F1A445629ED4A27B96DCF661@gmail.com> <4FB1A379A48F45AB885DEE9C8DC37883@gmail.com> <522A6848.9000500@mdowle.plus.com> Message-ID: Great explanation! Thank you. Got the point. Arun On Saturday, September 7, 2013 at 1:42 AM, Matthew Dowle wrote: > Just caught up with this thread. > > is it allowed usage of having names for column vectors? > > It wasn't intended, no. It would slow down grouping if it had to > maintain the names attribute too in the subsets. data.table is > intended to be used as a list of plain columns and the internals assume > that. names(DT$col) might exist though if data.table() has used a > reference to an input without taking a copy. It would then copy on > first := to that column and drop the names attribute at that point. > Which is why we might like to leave names there and just not use them. > > But I'm thinking data.table() should drop names then to make this > cleaner. Despite that meaning a copy of the vector has to be taken if > it has names. A copy is taken currently anyway. But in GNU R 3.1.0, > with list() no longer copying named inputs, we can do more on that front. > > Matthew -------------- next part -------------- An HTML attachment was scrubbed... URL: From mattguzzo12 at gmail.com Sun Sep 8 06:46:14 2013 From: mattguzzo12 at gmail.com (guzzom) Date: Sat, 7 Sep 2013 21:46:14 -0700 (PDT) Subject: [datatable-help] Sub setting multiple ids based on a 2nd data frame Message-ID: <1378615574149-4675620.post@n4.nabble.com> Hi All, I have some telemetry data that spans multiple years (2002 - 2013) with multiple individuals per year. I want to subset the telemetry data to include only those data points that fall between specific dates which are provided in a 2nd data frame. The telemetry df is in the form of: DF "A" ID Date Depth Temp 1 2012-05-12 10 12 1 2012-05-13 10 12 1 2012-05-14 10 12 1 2012-05-15 10 12 2 2012-05-16 10 12 2 2012-05-17 10 12 2 2012-05-18 10 12 2 2012-05-19 10 12 3 2012-05-20 10 12 3 2012-05-21 10 12 3 2012-05-22 10 12 3 2012-05-23 10 12 3 2012-05-24 10 12 And the df with the dates I want to use to subset is formatted as follows: DF "B" Year Start End 2002 2002-05-10 2002-11-01 2003 2003-05-11 2003-11-02 2004 2004-05-12 2004-11-03 2005 2005-05-13 2005-11-04 2006 2006-05-14 2006-11-05 So, I want to say, for each ID in DF A, subset and keep only those data points collected on a date that fall between the start and end date for the corresponding year from DF B. I am unsure if a loop is my best bet, or using plyr (which I am unfamiliar with). I am relatively new to R, so this seems a bit above my head. Any help is much appreciated. Thanks in advance! -- View this message in context: http://r.789695.n4.nabble.com/Sub-setting-multiple-ids-based-on-a-2nd-data-frame-tp4675620.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Sun Sep 8 09:57:06 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sun, 08 Sep 2013 08:57:06 +0100 Subject: [datatable-help] Sub setting multiple ids based on a 2nd data frame In-Reply-To: <1378615574149-4675620.post@n4.nabble.com> References: <1378615574149-4675620.post@n4.nabble.com> Message-ID: <522C2DD2.3040906@mdowle.plus.com> Hi, Good question. How about : http://stackoverflow.com/questions/17867553/data-table-join-using-two-columns-from-one-table-and-one-column-from-other http://stackoverflow.com/questions/17597508/merging-endpoints-of-a-range-with-a-sequence http://stackoverflow.com/questions/16666183/find-values-in-a-given-interval-without-a-vector-scan The syntax for range queries is a bit tricky and we hope to make it easier in future : https://r-forge.r-project.org/tracker/index.php?func=detail&aid=203&group_id=240&atid=978 Matthew On 08/09/13 05:46, guzzom wrote: > Hi All, > > I have some telemetry data that spans multiple years (2002 - 2013) with > multiple individuals per year. I want to subset the telemetry data to > include only those data points that fall between specific dates which are > provided in a 2nd data frame. The telemetry df is in the form of: > > DF "A" > > ID Date Depth Temp > 1 2012-05-12 10 12 > 1 2012-05-13 10 12 > 1 2012-05-14 10 12 > 1 2012-05-15 10 12 > 2 2012-05-16 10 12 > 2 2012-05-17 10 12 > 2 2012-05-18 10 12 > 2 2012-05-19 10 12 > 3 2012-05-20 10 12 > 3 2012-05-21 10 12 > 3 2012-05-22 10 12 > 3 2012-05-23 10 12 > 3 2012-05-24 10 12 > > And the df with the dates I want to use to subset is formatted as follows: > > DF "B" > > Year Start End > 2002 2002-05-10 2002-11-01 > 2003 2003-05-11 2003-11-02 > 2004 2004-05-12 2004-11-03 > 2005 2005-05-13 2005-11-04 > 2006 2006-05-14 2006-11-05 > > So, I want to say, for each ID in DF A, subset and keep only those data > points collected on a date that fall between the start and end date for the > corresponding year from DF B. > > I am unsure if a loop is my best bet, or using plyr (which I am unfamiliar > with). I am relatively new to R, so this seems a bit above my head. Any help > is much appreciated. > > Thanks in advance! > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Sub-setting-multiple-ids-based-on-a-2nd-data-frame-tp4675620.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From statquant at outlook.com Tue Sep 10 14:03:57 2013 From: statquant at outlook.com (statquant3) Date: Tue, 10 Sep 2013 05:03:57 -0700 (PDT) Subject: [datatable-help] data.table on the command line Message-ID: <1378814637664-4675755.post@n4.nabble.com> I would like to try to use data.table awesomeness on the command line. The usual use case is that you have a file and you would like to quickly create a summarized other file. Sometimes you wouldn't need to start R Something like (I'm just guessing) $) DTCMD myfile.csv "[1:5, list(a,b,c=sum(d),e=cumsum(f)), by=grp][,test:='hello']" > newFile.csv Would someone have an idea about this ? -- View this message in context: http://r.789695.n4.nabble.com/data-table-on-the-command-line-tp4675755.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Tue Sep 10 14:59:01 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 10 Sep 2013 13:59:01 +0100 Subject: [datatable-help] data.table on the command line In-Reply-To: <1378814637664-4675755.post@n4.nabble.com> References: <1378814637664-4675755.post@n4.nabble.com> Message-ID: <522F1795.6040603@mdowle.plus.com> Maybe : http://dirk.eddelbuettel.com/code/littler.html On 10/09/13 13:03, statquant3 wrote: > I would like to try to use data.table awesomeness on the command line. > The usual use case is that you have a file and you would like to quickly > create a summarized other file. > Sometimes you wouldn't need to start R > > Something like (I'm just guessing) > > $) DTCMD myfile.csv "[1:5, list(a,b,c=sum(d),e=cumsum(f)), > by=grp][,test:='hello']" > newFile.csv > > Would someone have an idea about this ? > > > > -- > View this message in context: http://r.789695.n4.nabble.com/data-table-on-the-command-line-tp4675755.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From statquant at outlook.com Tue Sep 10 15:10:04 2013 From: statquant at outlook.com (statquant3) Date: Tue, 10 Sep 2013 06:10:04 -0700 (PDT) Subject: [datatable-help] data.table on the command line In-Reply-To: <522F1795.6040603@mdowle.plus.com> References: <1378814637664-4675755.post@n4.nabble.com> <522F1795.6040603@mdowle.plus.com> Message-ID: <1378818604944-4675759.post@n4.nabble.com> I thought about this... but that would be linux only then no ? -- View this message in context: http://r.789695.n4.nabble.com/data-table-on-the-command-line-tp4675755p4675759.html Sent from the datatable-help mailing list archive at Nabble.com. From ramine.mossadegh at finra.org Tue Sep 10 16:35:26 2013 From: ramine.mossadegh at finra.org (ramoss) Date: Tue, 10 Sep 2013 07:35:26 -0700 (PDT) Subject: [datatable-help] XLSX Help: Exporting to multiple sheets in excel Message-ID: <1378823726505-4675767.post@n4.nabble.com> Hello: I just discovered the XLSX package. I know how to export 1 dataframe to 1 excel sheet using the XLSX package. write.xlsx(x= all, file="c:/reports/outlier.xlsx", sheetName="outlierdays",row.names= FALSE) How would I export multiple data frames to multiple sheets? The data frames names are: all, results2 & stats2 The excel file is called outlier The sheets within it are: outlierdays, outlier, normaltest. Thanks for your help -- View this message in context: http://r.789695.n4.nabble.com/XLSX-Help-Exporting-to-multiple-sheets-in-excel-tp4675767.html Sent from the datatable-help mailing list archive at Nabble.com. From ramine.mossadegh at finra.org Tue Sep 10 16:50:30 2013 From: ramine.mossadegh at finra.org (ramoss) Date: Tue, 10 Sep 2013 07:50:30 -0700 (PDT) Subject: [datatable-help] XLSX Help: Exporting to multiple sheets in excel In-Reply-To: <1378823726505-4675767.post@n4.nabble.com> References: <1378823726505-4675767.post@n4.nabble.com> Message-ID: <1378824630144-4675770.post@n4.nabble.com> I found the answer here: http://www.r-bloggers.com/importexport-data-to-and-from-xlsx-files/ -- View this message in context: http://r.789695.n4.nabble.com/XLSX-Help-Exporting-to-multiple-sheets-in-excel-tp4675767p4675770.html Sent from the datatable-help mailing list archive at Nabble.com. From aragorn168b at gmail.com Tue Sep 10 16:53:46 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 10 Sep 2013 16:53:46 +0200 Subject: [datatable-help] XLSX Help: Exporting to multiple sheets in excel In-Reply-To: <1378824630144-4675770.post@n4.nabble.com> References: <1378823726505-4675767.post@n4.nabble.com> <1378824630144-4675770.post@n4.nabble.com> Message-ID: <50BF512F01F34C40945422780583764F@gmail.com> Ramoss, Glad, but I think you're on the wrong mailing list. This is for help with R package data.table. Arun On Tuesday, September 10, 2013 at 4:50 PM, ramoss wrote: > I found the answer here: > http://www.r-bloggers.com/importexport-data-to-and-from-xlsx-files/ > > > > -- > View this message in context: http://r.789695.n4.nabble.com/XLSX-Help-Exporting-to-multiple-sheets-in-excel-tp4675767p4675770.html > Sent from the datatable-help mailing list archive at Nabble.com (http://Nabble.com). > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue Sep 10 16:57:17 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 10 Sep 2013 15:57:17 +0100 Subject: [datatable-help] data.table on the command line In-Reply-To: <1378818604944-4675759.post@n4.nabble.com> References: <1378814637664-4675755.post@n4.nabble.com> <522F1795.6040603@mdowle.plus.com> <1378818604944-4675759.post@n4.nabble.com> Message-ID: <522F334D.7030905@mdowle.plus.com> Hm. How about Rscript -e ? On 10/09/13 14:10, statquant3 wrote: > I thought about this... but that would be linux only then no ? > > > > -- > View this message in context: http://r.789695.n4.nabble.com/data-table-on-the-command-line-tp4675755p4675759.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From statquant at outlook.com Tue Sep 10 17:11:53 2013 From: statquant at outlook.com (statquant3) Date: Tue, 10 Sep 2013 08:11:53 -0700 (PDT) Subject: [datatable-help] data.table on the command line In-Reply-To: <522F334D.7030905@mdowle.plus.com> References: <1378814637664-4675755.post@n4.nabble.com> <522F1795.6040603@mdowle.plus.com> <1378818604944-4675759.post@n4.nabble.com> <522F334D.7030905@mdowle.plus.com> Message-ID: <1378825913051-4675774.post@n4.nabble.com> I am not convinced... For example here is what I just tried (I am on windows here) C:\Travail\futCAC\data>C:\Travail\Tools\R-3.0.1\bin\Rscript.exe -e "library(data.table); fread('ORDRES_20120831.csv',nrows=100)[,list(CDTSA,ISIN)][,list(count=.N),by=as.Date(CDTSA)]" At this point this is like writing a R script 1) I need to require the libraries 2) My Rprofile is printed on the screen Those 2 could be solved using a wrapper...May be it is as good as it gets -- View this message in context: http://r.789695.n4.nabble.com/data-table-on-the-command-line-tp4675755p4675774.html Sent from the datatable-help mailing list archive at Nabble.com. From mdowle at mdowle.plus.com Tue Sep 10 17:18:42 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 10 Sep 2013 16:18:42 +0100 Subject: [datatable-help] data.table on the command line In-Reply-To: <1378825913051-4675774.post@n4.nabble.com> References: <1378814637664-4675755.post@n4.nabble.com> <522F1795.6040603@mdowle.plus.com> <1378818604944-4675759.post@n4.nabble.com> <522F334D.7030905@mdowle.plus.com> <1378825913051-4675774.post@n4.nabble.com> Message-ID: <522F3852.2030807@mdowle.plus.com> > May be it is as good as it gets I'm not sure what you need, but R can many startup options, and there's .Rprofile. Have you really hunted hard? On 10/09/13 16:11, statquant3 wrote: > I am not convinced... > For example here is what I just tried (I am on windows here) > > C:\Travail\futCAC\data>C:\Travail\Tools\R-3.0.1\bin\Rscript.exe -e > "library(data.table); > fread('ORDRES_20120831.csv',nrows=100)[,list(CDTSA,ISIN)][,list(count=.N),by=as.Date(CDTSA)]" > > At this point this is like writing a R script > 1) I need to require the libraries > 2) My Rprofile is printed on the screen > > Those 2 could be solved using a wrapper...May be it is as good as it gets > > > > > -- > View this message in context: http://r.789695.n4.nabble.com/data-table-on-the-command-line-tp4675755p4675774.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From statquant at outlook.com Tue Sep 10 17:31:53 2013 From: statquant at outlook.com (statquant3) Date: Tue, 10 Sep 2013 08:31:53 -0700 (PDT) Subject: [datatable-help] data.table on the command line In-Reply-To: <522F3852.2030807@mdowle.plus.com> References: <1378814637664-4675755.post@n4.nabble.com> <522F1795.6040603@mdowle.plus.com> <1378818604944-4675759.post@n4.nabble.com> <522F334D.7030905@mdowle.plus.com> <1378825913051-4675774.post@n4.nabble.com> <522F3852.2030807@mdowle.plus.com> Message-ID: <1378827113153-4675782.post@n4.nabble.com> Actually I think I wanted something simpler as far as syntax was concerned but I realize this is a whole new project. I am aware of all the startup options like -vanilla -noenviron -noprofile etc... In my previous job we had a nice tool which red csv and allowed csv manipulation on the command line, data.table provides everything but the syntax although much simpler than data.frame is a bit more verbose. I think wrapping it in a script might streamline the syntax, I need to give it some thoughts I guess. Sorry for being so fuzzy -- View this message in context: http://r.789695.n4.nabble.com/data-table-on-the-command-line-tp4675755p4675782.html Sent from the datatable-help mailing list archive at Nabble.com. From caneff at gmail.com Tue Sep 10 19:32:20 2013 From: caneff at gmail.com (Chris Neff) Date: Tue, 10 Sep 2013 13:32:20 -0400 Subject: [datatable-help] data.table segfaulting, need help verifying the reason Message-ID: I'm pretty sure it is some issue of a column that thinks it is bigger than it actually is. I have tried, so far in vain, to make a reproducible example that I can share. I have one, but can't share it. What happens is this: A data.frame is made: > d = data.frame(...) Then I call apply over every row, calling a different function that takes in a DT as well: l = apply(d, 1, function(x) func(x[1], x[2], DT)) This returns a data.frame. If I rbindlist this: a = rbindlist(l) I can print a just fine, and it will show me all data like normal. but if I try to just do a$x x is one of the columns that was a key in DT, then it segfaults. If I ask for a column that was made by "func" and wasn't a column in DT, it works fine. If I ask for only the first 10 rows and then ask for x: a[1:10]$x it works fine. So somewhere these key columns think they are different lengths than they really are, and when I try to access it I go into memory I shouldn't so I segfault. How can I verify this? Is there something about the DT I can check to see what DT thinks these columns are? Also, if instead of apply when making the list, I do l = lapply(1:nrow(d), function(i) func(x[i,1],x[i,2],DT)) and rbindlist that, it works fine too. -------------- next part -------------- An HTML attachment was scrubbed... URL: From caneff at gmail.com Tue Sep 10 19:47:32 2013 From: caneff at gmail.com (Chris Neff) Date: Tue, 10 Sep 2013 13:47:32 -0400 Subject: [datatable-help] data.table segfaulting, need help verifying the reason In-Reply-To: References: Message-ID: Narrowing it down further, a$x segfaults and a[,x] segfaults but a[,"x", with=FALSE] doesn't. On Tue, Sep 10, 2013 at 1:32 PM, Chris Neff wrote: > I'm pretty sure it is some issue of a column that thinks it is bigger than > it actually is. I have tried, so far in vain, to make a reproducible > example that I can share. I have one, but can't share it. > > What happens is this: > > A data.frame is made: > > > d = data.frame(...) > > Then I call apply over every row, calling a different function that takes > in a DT as well: > > l = apply(d, 1, function(x) func(x[1], x[2], DT)) > > This returns a data.frame. If I rbindlist this: > > a = rbindlist(l) > > I can print a just fine, and it will show me all data like normal. but if > I try to just do > > a$x > > x is one of the columns that was a key in DT, then it segfaults. If I ask > for a column that was made by "func" and wasn't a column in DT, it works > fine. If I ask for only the first 10 rows and then ask for x: > > a[1:10]$x > > it works fine. > > So somewhere these key columns think they are different lengths than they > really are, and when I try to access it I go into memory I shouldn't so I > segfault. How can I verify this? Is there something about the DT I can > check to see what DT thinks these columns are? > > > Also, if instead of apply when making the list, I do > > l = lapply(1:nrow(d), function(i) func(x[i,1],x[i,2],DT)) > > and rbindlist that, it works fine too. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Tue Sep 10 20:02:03 2013 From: FErickson at psu.edu (Frank Erickson) Date: Tue, 10 Sep 2013 14:02:03 -0400 Subject: [datatable-help] data.table segfaulting, need help verifying the reason In-Reply-To: References: Message-ID: There's also a[["x"]], I suppose... :) and, looking at methods(`[`) ... `[.listof`(a,1) `[.data.frame`(a,1) if it's in the 1st column. Because we can't fully see your example, maybe you'll want to look at these other segfault stories: http://stackoverflow.com/search?q=segfault+%5Bdata.table%5D I think they're both fixed with the latest R and data.table, though. --Frank p.s. Sorry for the double reply, Chris; forgot to use "reply to all" On Tue, Sep 10, 2013 at 1:59 PM, Frank Erickson wrote: > There's also a[["x"]], I suppose... :) > > and, looking at methods(`[`) ... > > `[.listof`(a,1) > `[.data.frame`(a,1) > > if it's in the 1st column. > > Because we can't fully see your example, maybe you'll want to look at > these other segfault stories: > http://stackoverflow.com/search?q=segfault+%5Bdata.table%5D I think > they're both fixed with the latest R and data.table, though. > > --Frank > > > > On Tue, Sep 10, 2013 at 1:47 PM, Chris Neff wrote: > >> Narrowing it down further, >> >> a$x >> >> segfaults and >> >> a[,x] >> >> segfaults but >> >> a[,"x", with=FALSE] >> >> doesn't. >> >> >> On Tue, Sep 10, 2013 at 1:32 PM, Chris Neff wrote: >> >>> I'm pretty sure it is some issue of a column that thinks it is bigger >>> than it actually is. I have tried, so far in vain, to make a reproducible >>> example that I can share. I have one, but can't share it. >>> >>> What happens is this: >>> >>> A data.frame is made: >>> >>> > d = data.frame(...) >>> >>> Then I call apply over every row, calling a different function that >>> takes in a DT as well: >>> >>> l = apply(d, 1, function(x) func(x[1], x[2], DT)) >>> >>> This returns a data.frame. If I rbindlist this: >>> >>> a = rbindlist(l) >>> >>> I can print a just fine, and it will show me all data like normal. but >>> if I try to just do >>> >>> a$x >>> >>> x is one of the columns that was a key in DT, then it segfaults. If I >>> ask for a column that was made by "func" and wasn't a column in DT, it >>> works fine. If I ask for only the first 10 rows and then ask for x: >>> >>> a[1:10]$x >>> >>> it works fine. >>> >>> So somewhere these key columns think they are different lengths than >>> they really are, and when I try to access it I go into memory I shouldn't >>> so I segfault. How can I verify this? Is there something about the DT I >>> can check to see what DT thinks these columns are? >>> >>> >>> Also, if instead of apply when making the list, I do >>> >>> l = lapply(1:nrow(d), function(i) func(x[i,1],x[i,2],DT)) >>> >>> and rbindlist that, it works fine too. >>> >>> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue Sep 10 20:02:33 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 10 Sep 2013 19:02:33 +0100 Subject: [datatable-help] data.table segfaulting, need help verifying the reason In-Reply-To: References: Message-ID: <522F5EB9.2080903@mdowle.plus.com> Nothing springs to mind. Latest version v1.8.10 from CRAN right? Or v1.8.11 on R-Forge? On this bit : > So somewhere these key columns think they are different lengths than they really are, and > when I try to access it I go into memory I shouldn't so I segfault. How can I verify this? Is > there something about the DT I can check to see what DT thinks these columns are? .Internal(inspect(DT)) reveals the internal structure including length and truelength on the column pointer vector as well as each column. But it's a really odd way of using data.table. Iterating by row is going to kill performance; data.table likes by column. If it really has to be by row then DT[, fun(.SD,...), by=1:nrow(DT)] should be better than apply(). Matthew On 10/09/13 18:47, Chris Neff wrote: > Narrowing it down further, > > a$x > > segfaults and > > a[,x] > > segfaults but > > a[,"x", with=FALSE] > > doesn't. > > > On Tue, Sep 10, 2013 at 1:32 PM, Chris Neff > wrote: > > I'm pretty sure it is some issue of a column that thinks it is > bigger than it actually is. I have tried, so far in vain, to make > a reproducible example that I can share. I have one, but can't > share it. > > What happens is this: > > A data.frame is made: > > > d = data.frame(...) > > Then I call apply over every row, calling a different function > that takes in a DT as well: > > l = apply(d, 1, function(x) func(x[1], x[2], DT)) > > This returns a data.frame. If I rbindlist this: > > a = rbindlist(l) > > I can print a just fine, and it will show me all data like normal. > but if I try to just do > > a$x > > x is one of the columns that was a key in DT, then it segfaults. > If I ask for a column that was made by "func" and wasn't a column > in DT, it works fine. If I ask for only the first 10 rows and > then ask for x: > > a[1:10]$x > > it works fine. > > So somewhere these key columns think they are different lengths > than they really are, and when I try to access it I go into memory > I shouldn't so I segfault. How can I verify this? Is there > something about the DT I can check to see what DT thinks these > columns are? > > > Also, if instead of apply when making the list, I do > > l = lapply(1:nrow(d), function(i) func(x[i,1],x[i,2],DT)) > > and rbindlist that, it works fine too. > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From caneff at gmail.com Tue Sep 10 20:51:35 2013 From: caneff at gmail.com (Chris Neff) Date: Tue, 10 Sep 2013 14:51:35 -0400 Subject: [datatable-help] data.table segfaulting, need help verifying the reason In-Reply-To: <522F5EB9.2080903@mdowle.plus.com> References: <522F5EB9.2080903@mdowle.plus.com> Message-ID: On Tue, Sep 10, 2013 at 2:02 PM, Matthew Dowle wrote: > > Nothing springs to mind. Latest version v1.8.10 from CRAN right? Or > v1.8.11 on R-Forge? > Both. And 1.8.8. > > On this bit : > > > So somewhere these key columns think they are different lengths than > they really are, and > > when I try to access it I go into memory I shouldn't so I segfault. How > can I verify this? Is > > there something about the DT I can check to see what DT thinks these > columns are? > > .Internal(inspect(DT)) reveals the internal structure including length and > truelength on the column pointer vector as well as each column. > > But it's a really odd way of using data.table. Iterating by row is going > to kill performance; data.table likes by column. > Trust me I know this, this isn't my code :) I'm just the data.table guy who helps debug. I am helping him with better ways, but I think we can agree that it should at least not segfault. I ran inspect on the two versions of the data.table, the one that crashes that is made by doing rbindlist(apply(d,1,...)) and the one that doesn't that gets made by doing rbindlist(lapply(1:nrow(d),...)), and changed the variable names and censored out values. First the one that fails (accessing either a$k1 or a$k2 will segfault): > .Internal(inspect(a)) @2cc5be0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=13, tl=100) @3b643d0 16 STRSXP g0c7 [NAM(2),ATT] (len=326, tl=0) @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########" @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########" @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" ... ATTRIB: @ac6c20 02 LISTSXP g1c0 [MARK] TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names" @3ba6ad8 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0) @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1" @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1" @3b64e30 16 STRSXP g0c7 [NAM(2),ATT] (len=326, tl=0) @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" @253e3b0 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" ... ATTRIB: @ac6cc8 02 LISTSXP g1c0 [MARK] TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names" @3ba6a68 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0) @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2" @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2" @3b65890 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0) @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" ... @1ff5850 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 3,3,3,3,3,... @1fc6600 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 2,1,2,1,3,... ... ATTRIB: @21f6d48 02 LISTSXP g0c0 [] TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names" @3efc1f0 16 STRSXP g0c7 [NAM(2)] (len=13, tl=100) @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1" @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2" @108be30 09 CHARSXP g1c2 [MARK,gp=0x21] "v1" @108be68 09 CHARSXP g1c2 [MARK,gp=0x21] "v2" @108bf10 09 CHARSXP g1c2 [MARK,gp=0x21] "v3" ... TAG: @96d200 01 SYMSXP g1c0 [MARK,gp=0x4000] "row.names" @2556908 13 INTSXP g0c1 [] (len=2, tl=0) -2147483648,-326 TAG: @9638e8 01 SYMSXP g1c0 [MARK,gp=0x4000] "class" @2701b38 16 STRSXP g0c2 [NAM(2)] (len=2, tl=0) @bf8460 09 CHARSXP g1c2 [MARK,gp=0x21] "data.table" @9f2688 09 CHARSXP g1c2 [MARK,gp=0x21,ATT] "data.frame" TAG: @1e75218 01 SYMSXP g1c0 [MARK] ".internal.selfref" @21f6e28 22 EXTPTRSXP g0c0 [] Secondly the one that works (all values can be accessed fine: > .Internal(inspect(a)) @45b4850 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=13, tl=100) @33a53a0 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0) @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########" @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########" @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" ... @33a5e00 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0) @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" @253e3b0 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" ... @33a6860 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0) @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" ... @1ff10f0 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 3,3,3,3,3,... @3a6d0d0 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 2,1,2,1,3,... ... ATTRIB: @276c360 02 LISTSXP g0c0 [] TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names" @1fe5670 16 STRSXP g0c7 [NAM(2)] (len=13, tl=100) @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1" @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2" @108be30 09 CHARSXP g1c2 [MARK,gp=0x21] "v1" @108be68 09 CHARSXP g1c2 [MARK,gp=0x21] "v2" @108bf10 09 CHARSXP g1c2 [MARK,gp=0x21] "v3" ... TAG: @96d200 01 SYMSXP g1c0 [MARK,gp=0x4000] "row.names" @29cbf38 13 INTSXP g0c1 [] (len=2, tl=0) -2147483648,-326 TAG: @9638e8 01 SYMSXP g1c0 [MARK,gp=0x4000] "class" @2d539a0 16 STRSXP g0c2 [NAM(2)] (len=2, tl=0) @bf8460 09 CHARSXP g1c2 [MARK,gp=0x21] "data.table" @9f2688 09 CHARSXP g1c2 [MARK,gp=0x21,ATT] "data.frame" TAG: @1e75218 01 SYMSXP g1c0 [MARK] ".internal.selfref" @276c440 22 EXTPTRSXP g0c0 [] It looks to me to be some differences in the ATTRs attached to k1 and k2 in the first case? I can't really parse this as well as you can. > If it really has to be by row then DT[, fun(.SD,...), by=1:nrow(DT)] > should be better than apply(). > > Matthew > > > On 10/09/13 18:47, Chris Neff wrote: > > Narrowing it down further, > > a$x > > segfaults and > > a[,x] > > segfaults but > > a[,"x", with=FALSE] > > doesn't. > > > On Tue, Sep 10, 2013 at 1:32 PM, Chris Neff wrote: > >> I'm pretty sure it is some issue of a column that thinks it is bigger >> than it actually is. I have tried, so far in vain, to make a reproducible >> example that I can share. I have one, but can't share it. >> >> What happens is this: >> >> A data.frame is made: >> >> > d = data.frame(...) >> >> Then I call apply over every row, calling a different function that >> takes in a DT as well: >> >> l = apply(d, 1, function(x) func(x[1], x[2], DT)) >> >> This returns a data.frame. If I rbindlist this: >> >> a = rbindlist(l) >> >> I can print a just fine, and it will show me all data like normal. but >> if I try to just do >> >> a$x >> >> x is one of the columns that was a key in DT, then it segfaults. If I >> ask for a column that was made by "func" and wasn't a column in DT, it >> works fine. If I ask for only the first 10 rows and then ask for x: >> >> a[1:10]$x >> >> it works fine. >> >> So somewhere these key columns think they are different lengths than >> they really are, and when I try to access it I go into memory I shouldn't >> so I segfault. How can I verify this? Is there something about the DT I >> can check to see what DT thinks these columns are? >> >> >> Also, if instead of apply when making the list, I do >> >> l = lapply(1:nrow(d), function(i) func(x[i,1],x[i,2],DT)) >> >> and rbindlist that, it works fine too. >> >> > > > _______________________________________________ > datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Tue Sep 10 22:06:12 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 10 Sep 2013 21:06:12 +0100 Subject: [datatable-help] data.table segfaulting, need help verifying the reason In-Reply-To: References: <522F5EB9.2080903@mdowle.plus.com> Message-ID: <522F7BB4.8060300@mdowle.plus.com> Yes, seems like the columns themselves have names, with inconsistent length. lapply(a,names) should reveal the "hidden" names To remove them : for (i in 1:ncol(a)) setattr(a[[i]],"names",NULL) Then lapply(a,names) should be clear. Then try again the things that segfaulted before. If this fixes it, we'll need to establish how the erroneous names got in there. On 10/09/13 19:51, Chris Neff wrote: > > > > On Tue, Sep 10, 2013 at 2:02 PM, Matthew Dowle > wrote: > > > Nothing springs to mind. Latest version v1.8.10 from CRAN right? > Or v1.8.11 on R-Forge? > > > Both. And 1.8.8. > > > On this bit : > > > So somewhere these key columns think they are different lengths > than they really are, and > > when I try to access it I go into memory I shouldn't so I > segfault. How can I verify this? Is > > there something about the DT I can check to see what DT thinks > these columns are? > > .Internal(inspect(DT)) reveals the internal structure including > length and truelength on the column pointer vector as well as each > column. > > But it's a really odd way of using data.table. Iterating by row is > going to kill performance; data.table likes by column. > > > Trust me I know this, this isn't my code :) I'm just the data.table > guy who helps debug. I am helping him with better ways, but I think we > can agree that it should at least not segfault. > > > I ran inspect on the two versions of the data.table, the one that > crashes that is made by doing rbindlist(apply(d,1,...)) and the one > that doesn't that gets made by doing rbindlist(lapply(1:nrow(d),...)), > and changed the variable names and censored out values. > > First the one that fails (accessing either a$k1 or a$k2 will segfault): > > > .Internal(inspect(a)) > @2cc5be0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=13, tl=100) > @3b643d0 16 STRSXP g0c7 [NAM(2),ATT] (len=326, tl=0) > @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########" > @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########" > @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" > @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" > @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" > ... > ATTRIB: > @ac6c20 02 LISTSXP g1c0 [MARK] > TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names" > @3ba6ad8 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0) > @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1" > @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1" > @3b64e30 16 STRSXP g0c7 [NAM(2),ATT] (len=326, tl=0) > @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" > @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" > @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" > @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" > @253e3b0 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" > ... > ATTRIB: > @ac6cc8 02 LISTSXP g1c0 [MARK] > TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names" > @3ba6a68 16 STRSXP g1c2 [MARK,NAM(2)] (len=2, tl=0) > @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2" > @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2" > @3b65890 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0) > @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" > @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" > @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" > @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" > @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" > ... > @1ff5850 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 3,3,3,3,3,... > @1fc6600 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 2,1,2,1,3,... > ... > ATTRIB: > @21f6d48 02 LISTSXP g0c0 [] > TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names" > @3efc1f0 16 STRSXP g0c7 [NAM(2)] (len=13, tl=100) > @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1" > @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2" > @108be30 09 CHARSXP g1c2 [MARK,gp=0x21] "v1" > @108be68 09 CHARSXP g1c2 [MARK,gp=0x21] "v2" > @108bf10 09 CHARSXP g1c2 [MARK,gp=0x21] "v3" > ... > TAG: @96d200 01 SYMSXP g1c0 [MARK,gp=0x4000] "row.names" > @2556908 13 INTSXP g0c1 [] (len=2, tl=0) -2147483648,-326 > TAG: @9638e8 01 SYMSXP g1c0 [MARK,gp=0x4000] "class" > @2701b38 16 STRSXP g0c2 [NAM(2)] (len=2, tl=0) > @bf8460 09 CHARSXP g1c2 [MARK,gp=0x21] "data.table" > @9f2688 09 CHARSXP g1c2 [MARK,gp=0x21,ATT] "data.frame" > TAG: @1e75218 01 SYMSXP g1c0 [MARK] ".internal.selfref" > @21f6e28 22 EXTPTRSXP g0c0 [] > > > > > > > Secondly the one that works (all values can be accessed fine: > > > .Internal(inspect(a)) > @45b4850 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=13, tl=100) > @33a53a0 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0) > @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########" > @253e488 09 CHARSXP g1c3 [MARK,gp=0x20,ATT] "#########" > @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" > @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" > @253e3f8 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" > ... > @33a5e00 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0) > @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" > @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" > @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" > @253e440 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" > @253e3b0 09 CHARSXP g1c3 [MARK,gp=0x20] "#########" > ... > @33a6860 16 STRSXP g0c7 [NAM(2)] (len=326, tl=0) > @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" > @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" > @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" > @24eeb08 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" > @24eeb68 09 CHARSXP g1c1 [MARK,gp=0x20] "#########" > ... > @1ff10f0 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 3,3,3,3,3,... > @3a6d0d0 13 INTSXP g0c7 [NAM(2)] (len=326, tl=0) 2,1,2,1,3,... > ... > ATTRIB: > @276c360 02 LISTSXP g0c0 [] > TAG: @963418 01 SYMSXP g1c0 [MARK,gp=0x4000] "names" > @1fe5670 16 STRSXP g0c7 [NAM(2)] (len=13, tl=100) > @184aed0 09 CHARSXP g1c3 [MARK,gp=0x21,ATT] "k1" > @bf8578 09 CHARSXP g1c2 [MARK,gp=0x21] "k2" > @108be30 09 CHARSXP g1c2 [MARK,gp=0x21] "v1" > @108be68 09 CHARSXP g1c2 [MARK,gp=0x21] "v2" > @108bf10 09 CHARSXP g1c2 [MARK,gp=0x21] "v3" > ... > TAG: @96d200 01 SYMSXP g1c0 [MARK,gp=0x4000] "row.names" > @29cbf38 13 INTSXP g0c1 [] (len=2, tl=0) -2147483648,-326 > TAG: @9638e8 01 SYMSXP g1c0 [MARK,gp=0x4000] "class" > @2d539a0 16 STRSXP g0c2 [NAM(2)] (len=2, tl=0) > @bf8460 09 CHARSXP g1c2 [MARK,gp=0x21] "data.table" > @9f2688 09 CHARSXP g1c2 [MARK,gp=0x21,ATT] "data.frame" > TAG: @1e75218 01 SYMSXP g1c0 [MARK] ".internal.selfref" > @276c440 22 EXTPTRSXP g0c0 [] > > > > > It looks to me to be some differences in the ATTRs attached to k1 and > k2 in the first case? I can't really parse this as well as you can. > > If it really has to be by row then DT[, fun(.SD,...), > by=1:nrow(DT)] should be better than apply(). > > Matthew > > > On 10/09/13 18:47, Chris Neff wrote: >> Narrowing it down further, >> >> a$x >> >> segfaults and >> >> a[,x] >> >> segfaults but >> >> a[,"x", with=FALSE] >> >> doesn't. >> >> >> On Tue, Sep 10, 2013 at 1:32 PM, Chris Neff > > wrote: >> >> I'm pretty sure it is some issue of a column that thinks it >> is bigger than it actually is. I have tried, so far in vain, >> to make a reproducible example that I can share. I have one, >> but can't share it. >> >> What happens is this: >> >> A data.frame is made: >> >> > d = data.frame(...) >> >> Then I call apply over every row, calling a different >> function that takes in a DT as well: >> >> l = apply(d, 1, function(x) func(x[1], x[2], DT)) >> >> This returns a data.frame. If I rbindlist this: >> >> a = rbindlist(l) >> >> I can print a just fine, and it will show me all data like >> normal. but if I try to just do >> >> a$x >> >> x is one of the columns that was a key in DT, then it >> segfaults. If I ask for a column that was made by "func" and >> wasn't a column in DT, it works fine. If I ask for only the >> first 10 rows and then ask for x: >> >> a[1:10]$x >> >> it works fine. >> >> So somewhere these key columns think they are different >> lengths than they really are, and when I try to access it I >> go into memory I shouldn't so I segfault. How can I verify >> this? Is there something about the DT I can check to see what >> DT thinks these columns are? >> >> >> Also, if instead of apply when making the list, I do >> >> l = lapply(1:nrow(d), function(i) func(x[i,1],x[i,2],DT)) >> >> and rbindlist that, it works fine too. >> >> >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From caneff at gmail.com Wed Sep 11 11:17:50 2013 From: caneff at gmail.com (Chris Neff) Date: Wed, 11 Sep 2013 05:17:50 -0400 Subject: [datatable-help] data.table segfaulting, need help verifying the reason In-Reply-To: <522F7BB4.8060300@mdowle.plus.com> References: <522F5EB9.2080903@mdowle.plus.com> <522F7BB4.8060300@mdowle.plus.com> Message-ID: Indeed, it shows that k1 and k2 both have names of length 2, and both times the value of names is just the variable names. Where the names are getting added is by apply. What the issue with data.table is that it does not ignore names from short variables. I now have a small reproducible example I can share: d <- data.frame(x=1:5) f <- function(x) {data.table(x=x, y=1:10)} l <- apply(d, 1, f) lapply(l, function(x) lapply(x, names)) # All values of x have a name a <- rbindlist(l) # a$x will segfault after this The underlying issue is what data.table and data.frame do with rownames and recycling. Look at this simple case: x <- 1:5 names(x) <- letters[1:5] df <- data.frame(x=x, y=1:10) #Warning message: # In data.frame(x = x, y = 1:10) : # row names were found from a short variable and have been discarded lapply(df, names) # no names dt <- data.table(x=x, y=1:1) # No warning lapply(dt, names) # x has names, and they get recycled. So data.table needs to follow data.frame logic for discarding row names when they would otherwise be recycled. Bug submitted here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4890&group_id=240&atid=975 I'm surprised this has never arisen before, it seems like something that has been around forever. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Wed Sep 11 11:24:29 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 11 Sep 2013 11:24:29 +0200 Subject: [datatable-help] data.table segfaulting, need help verifying the reason In-Reply-To: References: <522F5EB9.2080903@mdowle.plus.com> <522F7BB4.8060300@mdowle.plus.com> Message-ID: <1CE2AD6E16E241869F7EA71F8E333D99@gmail.com> Most likely, this (https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4882&group_id=240&atid=5335), when fixed, will take care of it? Arun On Wednesday, September 11, 2013 at 11:17 AM, Chris Neff wrote: > Indeed, it shows that k1 and k2 both have names of length 2, and both times the value of names is just the variable names. > > Where the names are getting added is by apply. What the issue with data.table is that it does not ignore names from short variables. I now have a small reproducible example I can share: > > d <- data.frame(x=1:5) > > f <- function(x) {data.table(x=x, y=1:10)} > > l <- apply(d, 1, f) > > lapply(l, function(x) lapply(x, names)) # All values of x have a name > > a <- rbindlist(l) # a$x will segfault after this > > > The underlying issue is what data.table and data.frame do with rownames and recycling. Look at this simple case: > > x <- 1:5 > names(x) <- letters[1:5] > > df <- data.frame(x=x, y=1:10) > #Warning message: > # In data.frame(x = x, y = 1:10) : > # row names were found from a short variable and have been discarded > > lapply(df, names) # no names > > dt <- data.table(x=x, y=1:1) # No warning > > lapply(dt, names) # x has names, and they get recycled. > > > So data.table needs to follow data.frame logic for discarding row names when they would otherwise be recycled. > > > Bug submitted here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4890&group_id=240&atid=975 (https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4890&group_id=240&atid=975)> > I'm surprised this has never arisen before, it seems like something that has been around forever. > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From caneff at gmail.com Wed Sep 11 11:31:02 2013 From: caneff at gmail.com (Chris Neff) Date: Wed, 11 Sep 2013 05:31:02 -0400 Subject: [datatable-help] data.table segfaulting, need help verifying the reason In-Reply-To: <1CE2AD6E16E241869F7EA71F8E333D99@gmail.com> References: <522F5EB9.2080903@mdowle.plus.com> <522F7BB4.8060300@mdowle.plus.com> <1CE2AD6E16E241869F7EA71F8E333D99@gmail.com> Message-ID: Yes, dropping names altogether in data.table would fix this, and would be the cleanest thing overall since as is said in that thread data.table doesn't really work with rownames in mind anyway. Except it is less of a FR now and more of a bad bug because you can get segfaults from it. On Wed, Sep 11, 2013 at 5:24 AM, Arunkumar Srinivasan wrote: > Most likely, this, > when fixed, will take care of it? > > Arun > > On Wednesday, September 11, 2013 at 11:17 AM, Chris Neff wrote: > > Indeed, it shows that k1 and k2 both have names of length 2, and both > times the value of names is just the variable names. > > Where the names are getting added is by apply. What the issue with > data.table is that it does not ignore names from short variables. I now > have a small reproducible example I can share: > > d <- data.frame(x=1:5) > > f <- function(x) {data.table(x=x, y=1:10)} > > l <- apply(d, 1, f) > > lapply(l, function(x) lapply(x, names)) # All values of x have a name > > a <- rbindlist(l) # a$x will segfault after this > > > The underlying issue is what data.table and data.frame do with rownames > and recycling. Look at this simple case: > > x <- 1:5 > names(x) <- letters[1:5] > > df <- data.frame(x=x, y=1:10) > #Warning message: > # In data.frame(x = x, y = 1:10) : > # row names were found from a short variable and have been discarded > > lapply(df, names) # no names > > dt <- data.table(x=x, y=1:1) # No warning > > lapply(dt, names) # x has names, and they get recycled. > > > So data.table needs to follow data.frame logic for discarding row names > when they would otherwise be recycled. > > > Bug submitted here: > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4890&group_id=240&atid=975 > I'm surprised this has never arisen before, it seems like something that > has been around forever. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Wed Sep 11 11:33:06 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 11 Sep 2013 11:33:06 +0200 Subject: [datatable-help] data.table segfaulting, need help verifying the reason In-Reply-To: References: <522F5EB9.2080903@mdowle.plus.com> <522F7BB4.8060300@mdowle.plus.com> <1CE2AD6E16E241869F7EA71F8E333D99@gmail.com> Message-ID: Chris, It's not filed as a FR, IIRC. It's filed under "Internals". Arun On Wednesday, September 11, 2013 at 11:31 AM, Chris Neff wrote: > Yes, dropping names altogether in data.table would fix this, and would be the cleanest thing overall since as is said in that thread data.table doesn't really work with rownames in mind anyway. > > Except it is less of a FR now and more of a bad bug because you can get segfaults from it. > > > On Wed, Sep 11, 2013 at 5:24 AM, Arunkumar Srinivasan wrote: > > Most likely, this (https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4882&group_id=240&atid=5335), when fixed, will take care of it? > > > > Arun > > > > > > On Wednesday, September 11, 2013 at 11:17 AM, Chris Neff wrote: > > > > > > > > > Indeed, it shows that k1 and k2 both have names of length 2, and both times the value of names is just the variable names. > > > > > > Where the names are getting added is by apply. What the issue with data.table is that it does not ignore names from short variables. I now have a small reproducible example I can share: > > > > > > d <- data.frame(x=1:5) > > > > > > f <- function(x) {data.table(x=x, y=1:10)} > > > > > > l <- apply(d, 1, f) > > > > > > lapply(l, function(x) lapply(x, names)) # All values of x have a name > > > > > > a <- rbindlist(l) # a$x will segfault after this > > > > > > > > > The underlying issue is what data.table and data.frame do with rownames and recycling. Look at this simple case: > > > > > > x <- 1:5 > > > names(x) <- letters[1:5] > > > > > > df <- data.frame(x=x, y=1:10) > > > #Warning message: > > > # In data.frame(x = x, y = 1:10) : > > > # row names were found from a short variable and have been discarded > > > > > > lapply(df, names) # no names > > > > > > dt <- data.table(x=x, y=1:1) # No warning > > > > > > lapply(dt, names) # x has names, and they get recycled. > > > > > > > > > So data.table needs to follow data.frame logic for discarding row names when they would otherwise be recycled. > > > > > > > > > Bug submitted here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4890&group_id=240&atid=975 (https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4890&group_id=240&atid=975)> > > > > > I'm surprised this has never arisen before, it seems like something that has been around forever. > > > > > > > > > > > > > > > _______________________________________________ > > > datatable-help mailing list > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From caneff at gmail.com Wed Sep 11 11:55:48 2013 From: caneff at gmail.com (Chris Neff) Date: Wed, 11 Sep 2013 05:55:48 -0400 Subject: [datatable-help] data.table segfaulting, need help verifying the reason In-Reply-To: References: <522F5EB9.2080903@mdowle.plus.com> <522F7BB4.8060300@mdowle.plus.com> <1CE2AD6E16E241869F7EA71F8E333D99@gmail.com> Message-ID: Oh okay, sorry. Either way it is more than just a slight improvement :) But yes that would fix everything. On Wed, Sep 11, 2013 at 5:33 AM, Arunkumar Srinivasan wrote: > Chris, > It's not filed as a FR, IIRC. It's filed under "Internals". > > Arun > > On Wednesday, September 11, 2013 at 11:31 AM, Chris Neff wrote: > > Yes, dropping names altogether in data.table would fix this, and would be > the cleanest thing overall since as is said in that thread data.table > doesn't really work with rownames in mind anyway. > > Except it is less of a FR now and more of a bad bug because you can get > segfaults from it. > > > On Wed, Sep 11, 2013 at 5:24 AM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > > Most likely, this, > when fixed, will take care of it? > > Arun > > On Wednesday, September 11, 2013 at 11:17 AM, Chris Neff wrote: > > Indeed, it shows that k1 and k2 both have names of length 2, and both > times the value of names is just the variable names. > > Where the names are getting added is by apply. What the issue with > data.table is that it does not ignore names from short variables. I now > have a small reproducible example I can share: > > d <- data.frame(x=1:5) > > f <- function(x) {data.table(x=x, y=1:10)} > > l <- apply(d, 1, f) > > lapply(l, function(x) lapply(x, names)) # All values of x have a name > > a <- rbindlist(l) # a$x will segfault after this > > > The underlying issue is what data.table and data.frame do with rownames > and recycling. Look at this simple case: > > x <- 1:5 > names(x) <- letters[1:5] > > df <- data.frame(x=x, y=1:10) > #Warning message: > # In data.frame(x = x, y = 1:10) : > # row names were found from a short variable and have been discarded > > lapply(df, names) # no names > > dt <- data.table(x=x, y=1:1) # No warning > > lapply(dt, names) # x has names, and they get recycled. > > > So data.table needs to follow data.frame logic for discarding row names > when they would otherwise be recycled. > > > Bug submitted here: > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4890&group_id=240&atid=975 > I'm surprised this has never arisen before, it seems like something that > has been around forever. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Wed Sep 11 15:52:00 2013 From: FErickson at psu.edu (Frank Erickson) Date: Wed, 11 Sep 2013 09:52:00 -0400 Subject: [datatable-help] data.table segfaulting, need help verifying the reason In-Reply-To: References: <522F5EB9.2080903@mdowle.plus.com> <522F7BB4.8060300@mdowle.plus.com> <1CE2AD6E16E241869F7EA71F8E333D99@gmail.com> Message-ID: @Chris: If your application is like the example given, you might consider using CJ(x=1:5,y=1:10) which is a data.table analogue to expand.grid(x=1:5,y=1:10) that automatically sets a key of c("x","y") on the result. --Frank On Wed, Sep 11, 2013 at 5:55 AM, Chris Neff wrote: > Oh okay, sorry. Either way it is more than just a slight improvement :) > But yes that would fix everything. > > > On Wed, Sep 11, 2013 at 5:33 AM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > >> Chris, >> It's not filed as a FR, IIRC. It's filed under "Internals". >> >> Arun >> >> On Wednesday, September 11, 2013 at 11:31 AM, Chris Neff wrote: >> >> Yes, dropping names altogether in data.table would fix this, and would be >> the cleanest thing overall since as is said in that thread data.table >> doesn't really work with rownames in mind anyway. >> >> Except it is less of a FR now and more of a bad bug because you can get >> segfaults from it. >> >> >> On Wed, Sep 11, 2013 at 5:24 AM, Arunkumar Srinivasan < >> aragorn168b at gmail.com> wrote: >> >> Most likely, this, >> when fixed, will take care of it? >> >> Arun >> >> On Wednesday, September 11, 2013 at 11:17 AM, Chris Neff wrote: >> >> Indeed, it shows that k1 and k2 both have names of length 2, and both >> times the value of names is just the variable names. >> >> Where the names are getting added is by apply. What the issue with >> data.table is that it does not ignore names from short variables. I now >> have a small reproducible example I can share: >> >> d <- data.frame(x=1:5) >> >> f <- function(x) {data.table(x=x, y=1:10)} >> >> l <- apply(d, 1, f) >> >> lapply(l, function(x) lapply(x, names)) # All values of x have a name >> >> a <- rbindlist(l) # a$x will segfault after this >> >> >> The underlying issue is what data.table and data.frame do with rownames >> and recycling. Look at this simple case: >> >> x <- 1:5 >> names(x) <- letters[1:5] >> >> df <- data.frame(x=x, y=1:10) >> #Warning message: >> # In data.frame(x = x, y = 1:10) : >> # row names were found from a short variable and have been discarded >> >> lapply(df, names) # no names >> >> dt <- data.table(x=x, y=1:1) # No warning >> >> lapply(dt, names) # x has names, and they get recycled. >> >> >> So data.table needs to follow data.frame logic for discarding row names >> when they would otherwise be recycled. >> >> >> Bug submitted here: >> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4890&group_id=240&atid=975 >> I'm surprised this has never arisen before, it seems like something that >> has been around forever. >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> >> > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sds at gnu.org Wed Sep 11 23:35:35 2013 From: sds at gnu.org (Sam Steingold) Date: Wed, 11 Sep 2013 17:35:35 -0400 Subject: [datatable-help] adding names to j columns is costly Message-ID: <87d2ofnis8.fsf@gnu.org> I find myself using setnames(...,"V1","...") very often because setting them in aggregation is expensive: --8<---------------cut here---------------start------------->8--- > delays.short <- delays.dt[,sum(count),by="delay"] Finding groups (bysameorder=TRUE) ... done in 1.262secs. bysameorder=TRUE and o__ is length 0 Detected that j uses these columns: count Optimization is on but j left unchanged as 'sum(count)' Starting dogroups ... done dogroups in 8.612 secs > delays.short <- delays.dt[,list(count=sum(count)),by="delay"] Finding groups (bysameorder=TRUE) ... done in 1.051secs. bysameorder=TRUE and o__ is length 0 Detected that j uses these columns: count Optimization is on but j left unchanged as 'list(sum(count))' Starting dogroups ... done dogroups in 11.918 secs --8<---------------cut here---------------end--------------->8--- 38% difference is a lot (3 seconds is not a big deal, but this is just a toy dataset). ISTR that I have asked this question before - is this still (data.table 1.8.10) the state of the art, or am I doing something stupid? Thanks! -- Sam Steingold (http://sds.podval.org/) on Ubuntu 13.04 (raring) X 11.0.11303000 http://www.childpsy.net/ http://think-israel.org http://truepeace.org http://thereligionofpeace.com http://americancensorship.org http://iris.org.il Money does not "play a role", it writes the scenario. From mdowle at mdowle.plus.com Thu Sep 12 01:50:02 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 12 Sep 2013 00:50:02 +0100 Subject: [datatable-help] adding names to j columns is costly In-Reply-To: <87d2ofnis8.fsf@gnu.org> References: <87d2ofnis8.fsf@gnu.org> Message-ID: <523101AA.8040604@mdowle.plus.com> I don't remember you asking this before! How many rows does delay.dt have and how many groups? > because setting them in aggregation is expensive: I'm not sure this example is proof of that. On the contrary, the output shows that names are being dropped before grouping commences (they are reinstated after grouping), as is correct behaviour. All I can think is that the list() wrapper itself is adding overhead. That might show up as this 38% difference if there are a very large number of groups (lots of calls to j). In the case of a single aggregate, the list() wrapper could be optimized away. This would be a nice improvement I didn't think of before. Does this theory fit with your experience? If my guess is correct, if you instead compare two queries where j has list() in both; e.g., list(sum(count),max(count)) -vs- list(s=sum(count), m=max(count)) then I don't think you'll see a speed difference. On 11/09/13 22:35, Sam Steingold wrote: > I find myself using setnames(...,"V1","...") very often because setting > them in aggregation is expensive: > > --8<---------------cut here---------------start------------->8--- >> delays.short <- delays.dt[,sum(count),by="delay"] > Finding groups (bysameorder=TRUE) ... done in 1.262secs. bysameorder=TRUE and o__ is length 0 > Detected that j uses these columns: count > Optimization is on but j left unchanged as 'sum(count)' > Starting dogroups ... done dogroups in 8.612 secs >> delays.short <- delays.dt[,list(count=sum(count)),by="delay"] > Finding groups (bysameorder=TRUE) ... done in 1.051secs. bysameorder=TRUE and o__ is length 0 > Detected that j uses these columns: count > Optimization is on but j left unchanged as 'list(sum(count))' > Starting dogroups ... done dogroups in 11.918 secs > --8<---------------cut here---------------end--------------->8--- > > 38% difference is a lot (3 seconds is not a big deal, but this is just a > toy dataset). > > ISTR that I have asked this question before - is this still (data.table > 1.8.10) the state of the art, or am I doing something stupid? > > Thanks! > From sds at gnu.org Thu Sep 12 05:54:21 2013 From: sds at gnu.org (Sam Steingold) Date: Wed, 11 Sep 2013 23:54:21 -0400 Subject: [datatable-help] adding names to j columns is costly In-Reply-To: <523101AA.8040604@mdowle.plus.com> (Matthew Dowle's message of "Thu, 12 Sep 2013 00:50:02 +0100") References: <87d2ofnis8.fsf@gnu.org> <523101AA.8040604@mdowle.plus.com> Message-ID: <87wqmmn18y.fsf@gnu.org> > * Matthew Dowle [2013-09-12 00:50:02 +0100]: > > How many rows does delay.dt have and how many groups? --8<---------------cut here---------------start------------->8--- > nrow(delays.dt) [1] 18772831 > nrow(delays.short) [1] 14893103 --8<---------------cut here---------------end--------------->8--- >> because setting them in aggregation is expensive: > > I'm not sure this example is proof of that. On the contrary, the output > shows that names are being dropped before grouping commences (they are > reinstated after grouping), as is correct behaviour. All I can think is > that the list() wrapper itself is adding overhead. That might show up as > this 38% difference if there are a very large number of groups (lots of > calls to j). In the case of a single aggregate, the list() wrapper could > be optimized away. This would be a nice improvement I didn't think of > before. Yes, I would love to be able to drop the extra setnames() call. > Does this theory fit with your experience? Looks like it. > If my guess is correct, if > you instead compare two queries where j has list() in both; e.g., > list(sum(count),max(count)) -vs- list(s=sum(count), m=max(count)) > then I don't think you'll see a speed difference. --8<---------------cut here---------------start------------->8--- > delays.short <- delays.dt[,list(sum(count)),by="delay"] Finding groups (bysameorder=TRUE) ... done in 0.91secs. bysameorder=TRUE and o__ is length 0 Detected that j uses these columns: count Optimization is on but j left unchanged as 'list(sum(count))' Starting dogroups ... done dogroups in 11.497 secs > delays.short <- delays.dt[,list(s=sum(count)),by="delay"] Finding groups (bysameorder=TRUE) ... done in 0.91secs. bysameorder=TRUE and o__ is length 0 Detected that j uses these columns: count Optimization is on but j left unchanged as 'list(sum(count))' Starting dogroups ... done dogroups in 11.535 secs > delays.short <- delays.dt[,list(s=sum(count),m=max(count)),by="delay"] Finding groups (bysameorder=TRUE) ... done in 0.948secs. bysameorder=TRUE and o__ is length 0 Detected that j uses these columns: count Optimization is on but j left unchanged as 'list(sum(count), max(count))' Starting dogroups ... done dogroups in 18.931 secs > delays.short <- delays.dt[,list(sum(count),max(count)),by="delay"] Finding groups (bysameorder=TRUE) ... done in 0.968secs. bysameorder=TRUE and o__ is length 0 Detected that j uses these columns: count Optimization is on but j left unchanged as 'list(sum(count), max(count))' Starting dogroups ... done dogroups in 17.872 secs > delays.short <- delays.dt[,list(sum(count),max(count)),by="delay"] Finding groups (bysameorder=TRUE) ... done in 1.004secs. bysameorder=TRUE and o__ is length 0 Detected that j uses these columns: count Optimization is on but j left unchanged as 'list(sum(count), max(count))' Starting dogroups ... done dogroups in 18.971 secs > delays.short <- delays.dt[,list(s=sum(count),m=max(count)),by="delay"] Finding groups (bysameorder=TRUE) ... done in 0.946secs. bysameorder=TRUE and o__ is length 0 Detected that j uses these columns: count Optimization is on but j left unchanged as 'list(sum(count), max(count))' Starting dogroups ... done dogroups in 18.799 secs --8<---------------cut here---------------end--------------->8--- Thanks for your kind help! > > On 11/09/13 22:35, Sam Steingold wrote: >> I find myself using setnames(...,"V1","...") very often because setting >> them in aggregation is expensive: >> >> --8<---------------cut here---------------start------------->8--- >>> delays.short <- delays.dt[,sum(count),by="delay"] >> Finding groups (bysameorder=TRUE) ... done in 1.262secs. bysameorder=TRUE and o__ is length 0 >> Detected that j uses these columns: count >> Optimization is on but j left unchanged as 'sum(count)' >> Starting dogroups ... done dogroups in 8.612 secs >>> delays.short <- delays.dt[,list(count=sum(count)),by="delay"] >> Finding groups (bysameorder=TRUE) ... done in 1.051secs. bysameorder=TRUE and o__ is length 0 >> Detected that j uses these columns: count >> Optimization is on but j left unchanged as 'list(sum(count))' >> Starting dogroups ... done dogroups in 11.918 secs >> --8<---------------cut here---------------end--------------->8--- >> >> 38% difference is a lot (3 seconds is not a big deal, but this is just a >> toy dataset). >> >> ISTR that I have asked this question before - is this still (data.table >> 1.8.10) the state of the art, or am I doing something stupid? >> >> Thanks! >> -- Sam Steingold (http://sds.podval.org/) on Ubuntu 13.04 (raring) X 11.0.11303000 http://www.childpsy.net/ http://americancensorship.org http://memri.org http://mideasttruth.com http://iris.org.il http://truepeace.org UNIX is a way of thinking. Windows is a way of not thinking. From abfriedman at gmail.com Thu Sep 12 23:32:00 2013 From: abfriedman at gmail.com (Ari Friedman) Date: Thu, 12 Sep 2013 17:32:00 -0400 Subject: [datatable-help] colClasses and fread Message-ID: Dear maintainers of that most wonderful package that makes R fast with big data, I've recently discovered fread. It's amazing. My call to read.fwf on a 4GB file that took all night now takes under a minute after conversion to csv via csvkit/in2csv. However, automatic type detection is working very poorly, probably due to the presence of a large number of columns with high rates of missingness, plus a large number of character columns with encoded values (these are medical and diagnostic codes). Normally I'd specify colClasses, and the warning messages even tell me I should specify colClasses, but there's no colClasses argument to fread. Any thoughts on solving this? Verbose output, warnings, and a comparison of the guesses vs. what the documentation on the file says it is are found below. Unfortunately the data can't be shared, even in small portions so I can't make this reproducible. Thanks! Ari > dt <- fread('myfile.csv', verbose=TRUE) Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. Using line 30 to detect sep (the last non blank line in the first 30) ... ',' Found 393 columns First row with 393 fields occurs on line 1 (either column names or first row of data) All the fields on line 1 are character fields. Treating as the column names. Count of eol after first data row: 2994440 Subtracted 1 for last eol and any trailing empty lines, leaving 2994439 data rows Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000300000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (first 5 rows) Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (+middle 5 rows) Type codes: 000000000000003303330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (+last 5 rows) 0%Bumping column 146 from INT to INT64 on data row 9, field contains 'V5867' Bumping column 146 from INT64 to REAL on data row 9, field contains 'V5867' Bumping column 146 from REAL to STR on data row 9, field contains 'V5867' Bumping column 147 from INT to INT64 on data row 9, field contains 'V5869' Bumping column 147 from INT64 to REAL on data row 9, field contains 'V5869' Bumping column 147 from REAL to STR on data row 9, field contains 'V5869' Bumping column 142 from INT to INT64 on data row 10, field contains 'V140' Bumping column 142 from INT64 to REAL on data row 10, field contains 'V140' Bumping column 142 from REAL to STR on data row 10, field contains 'V140' Bumping column 17 from INT to INT64 on data row 12, field contains 'J1885' Bumping column 17 from INT64 to REAL on data row 12, field contains 'J1885' Bumping column 17 from REAL to STR on data row 12, field contains 'J1885' Bumping column 74 from INT to INT64 on data row 12, field contains 'LT' Bumping column 74 from INT64 to REAL on data row 12, field contains 'LT' Bumping column 74 from REAL to STR on data row 12, field contains 'LT' Bumping column 143 from INT to INT64 on data row 13, field contains 'V142' Bumping column 143 from INT64 to REAL on data row 13, field contains 'V142' Bumping column 143 from REAL to STR on data row 13, field contains 'V142' Bumping column 14 from INT to INT64 on data row 22, field contains 'G0431' Bumping column 14 from INT64 to REAL on data row 22, field contains 'G0431' Bumping column 14 from REAL to STR on data row 22, field contains 'G0431' Bumping column 21 from INT to INT64 on data row 23, field contains 'J7060' Bumping column 21 from INT64 to REAL on data row 23, field contains 'J7060' Bumping column 21 from REAL to STR on data row 23, field contains 'J7060' Bumping column 24 from INT to INT64 on data row 27, field contains 'J2405' Bumping column 24 from INT64 to REAL on data row 27, field contains 'J2405' Bumping column 24 from REAL to STR on data row 27, field contains 'J2405' Bumping column 72 from INT to INT64 on data row 35, field contains 'F1' Bumping column 72 from INT64 to REAL on data row 35, field contains 'F1' Bumping column 72 from REAL to STR on data row 35, field contains 'F1' Bumping column 141 from INT to INT64 on data row 35, field contains 'V061' Bumping column 141 from INT64 to REAL on data row 35, field contains 'V061' Bumping column 141 from REAL to STR on data row 35, field contains 'V061' Bumping column 26 from INT to INT64 on data row 37, field contains 'J0690' Bumping column 26 from INT64 to REAL on data row 37, field contains 'J0690' Bumping column 26 from REAL to STR on data row 37, field contains 'J0690' Bumping column 28 from INT to INT64 on data row 37, field contains 'J7030' Bumping column 28 from INT64 to REAL on data row 37, field contains 'J7030' Bumping column 28 from REAL to STR on data row 37, field contains 'J7030' Bumping column 29 from INT to INT64 on data row 37, field contains 'J7040' Bumping column 29 from INT64 to REAL on data row 37, field contains 'J7040' Bumping column 29 from REAL to STR on data row 37, field contains 'J7040' Bumping column 25 from INT to INT64 on data row 43, field contains 'Q9967' Bumping column 25 from INT64 to REAL on data row 43, field contains 'Q9967' Bumping column 25 from REAL to STR on data row 43, field contains 'Q9967' Bumping column 30 from INT to INT64 on data row 43, field contains 'J7030' Bumping column 30 from INT64 to REAL on data row 43, field contains 'J7030' Bumping column 30 from REAL to STR on data row 43, field contains 'J7030' Bumping column 31 from INT to INT64 on data row 43, field contains 'J2405' Bumping column 31 from INT64 to REAL on data row 43, field contains 'J2405' Bumping column 31 from REAL to STR on data row 43, field contains 'J2405' Bumping column 148 from INT to INT64 on data row 44, field contains 'V1551' Bumping column 148 from INT64 to REAL on data row 44, field contains 'V1551' Bumping column 148 from REAL to STR on data row 44, field contains 'V1551' Bumping column 149 from INT to INT64 on data row 44, field contains 'V1588' Bumping column 149 from INT64 to REAL on data row 44, field contains 'V1588' Bumping column 149 from REAL to STR on data row 44, field contains 'V1588' Bumping column 76 from INT to INT64 on data row 45, field contains 'RT' Bumping column 76 from INT64 to REAL on data row 45, field contains 'RT' Bumping column 76 from REAL to STR on data row 45, field contains 'RT' Bumping column 27 from INT to INT64 on data row 53, field contains 'J2405' Bumping column 27 from INT64 to REAL on data row 53, field contains 'J2405' Bumping column 27 from REAL to STR on data row 53, field contains 'J2405' Bumping column 32 from INT to INT64 on data row 56, field contains 'J1885' Bumping column 32 from INT64 to REAL on data row 56, field contains 'J1885' Bumping column 32 from REAL to STR on data row 56, field contains 'J1885' Bumping column 33 from INT to INT64 on data row 56, field contains 'J2270' Bumping column 33 from INT64 to REAL on data row 56, field contains 'J2270' Bumping column 33 from REAL to STR on data row 56, field contains 'J2270' Bumping column 34 from INT to INT64 on data row 56, field contains 'J2405' Bumping column 34 from INT64 to REAL on data row 56, field contains 'J2405' Bumping column 34 from REAL to STR on data row 56, field contains 'J2405' Bumping column 77 from INT to INT64 on data row 65, field contains 'LT' Bumping column 77 from INT64 to REAL on data row 65, field contains 'LT' Bumping column 77 from REAL to STR on data row 65, field contains 'LT' Bumping column 140 from INT to INT64 on data row 74, field contains 'V689' Bumping column 140 from INT64 to REAL on data row 74, field contains 'V689' Bumping column 140 from REAL to STR on data row 74, field contains 'V689' Bumping column 13 from INT to INT64 on data row 103, field contains 'J1100' Bumping column 13 from INT64 to REAL on data row 103, field contains 'J1100' Bumping column 13 from REAL to STR on data row 103, field contains 'J1100' Bumping column 150 from INT to INT64 on data row 104, field contains 'V1508' Bumping column 150 from INT64 to REAL on data row 104, field contains 'V1508' Bumping column 150 from REAL to STR on data row 104, field contains 'V1508' Bumping column 212 from INT to INT64 on data row 107, field contains 'V714' Bumping column 212 from INT64 to REAL on data row 107, field contains 'V714' Bumping column 212 from REAL to STR on data row 107, field contains 'V714' Bumping column 12 from INT to INT64 on data row 113, field contains 'A0427' Bumping column 12 from INT64 to REAL on data row 113, field contains 'A0427' Bumping column 12 from REAL to STR on data row 113, field contains 'A0427' Bumping column 81 from INT to INT64 on data row 113, field contains 'RH' Bumping column 81 from INT64 to REAL on data row 113, field contains 'RH' Bumping column 81 from REAL to STR on data row 113, field contains 'RH' Bumping column 102 from INT to INT64 on data row 113, field contains 'QM' Bumping column 102 from INT64 to REAL on data row 113, field contains 'QM' Bumping column 102 from REAL to STR on data row 113, field contains 'QM' Bumping column 111 from INT to INT64 on data row 113, field contains 'QM' Bumping column 111 from INT64 to REAL on data row 113, field contains 'QM' Bumping column 111 from REAL to STR on data row 113, field contains 'QM' Bumping column 151 from INT to INT64 on data row 294, field contains 'V146' Bumping column 151 from INT64 to REAL on data row 294, field contains 'V146' Bumping column 151 from REAL to STR on data row 294, field contains 'V146' Bumping column 152 from INT to INT64 on data row 294, field contains 'V148' Bumping column 152 from INT64 to REAL on data row 294, field contains 'V148' Bumping column 152 from REAL to STR on data row 294, field contains 'V148' Bumping column 84 from INT to INT64 on data row 346, field contains 'RH' Bumping column 84 from INT64 to REAL on data row 346, field contains 'RH' Bumping column 84 from REAL to STR on data row 346, field contains 'RH' Bumping column 114 from INT to INT64 on data row 346, field contains 'QM' Bumping column 114 from INT64 to REAL on data row 346, field contains 'QM' Bumping column 114 from REAL to STR on data row 346, field contains 'QM' Bumping column 36 from INT to INT64 on data row 348, field contains 'J1644' Bumping column 36 from INT64 to REAL on data row 348, field contains 'J1644' Bumping column 36 from REAL to STR on data row 348, field contains 'J1644' Bumping column 37 from INT to INT64 on data row 348, field contains 'J7030' Bumping column 37 from INT64 to REAL on data row 348, field contains 'J7030' Bumping column 37 from REAL to STR on data row 348, field contains 'J7030' Bumping column 38 from INT to INT64 on data row 348, field contains 'J2405' Bumping column 38 from INT64 to REAL on data row 348, field contains 'J2405' Bumping column 38 from REAL to STR on data row 348, field contains 'J2405' Bumping column 39 from INT to INT64 on data row 349, field contains 'J2405' Bumping column 39 from INT64 to REAL on data row 349, field contains 'J2405' Bumping column 39 from REAL to STR on data row 349, field contains 'J2405' Bumping column 103 from INT to INT64 on data row 702, field contains 'QM' Bumping column 103 from INT64 to REAL on data row 702, field contains 'QM' Bumping column 103 from REAL to STR on data row 702, field contains 'QM' Bumping column 104 from INT to INT64 on data row 702, field contains 'QM' Bumping column 104 from INT64 to REAL on data row 702, field contains 'QM' Bumping column 104 from REAL to STR on data row 702, field contains 'QM' Bumping column 153 from INT to INT64 on data row 815, field contains 'V4561' Bumping column 153 from INT64 to REAL on data row 815, field contains 'V4561' Bumping column 153 from REAL to STR on data row 815, field contains 'V4561' Bumping column 78 from INT to INT64 on data row 891, field contains 'RT' Bumping column 78 from INT64 to REAL on data row 891, field contains 'RT' Bumping column 78 from REAL to STR on data row 891, field contains 'RT' Bumping column 79 from INT to INT64 on data row 891, field contains 'LT' Bumping column 79 from INT64 to REAL on data row 891, field contains 'LT' Bumping column 79 from REAL to STR on data row 891, field contains 'LT' Bumping column 80 from INT to INT64 on data row 891, field contains 'LT' Bumping column 80 from INT64 to REAL on data row 891, field contains 'LT' Bumping column 80 from REAL to STR on data row 891, field contains 'LT' Bumping column 35 from INT to INT64 on data row 892, field contains 'J2270' Bumping column 35 from INT64 to REAL on data row 892, field contains 'J2270' Bumping column 35 from REAL to STR on data row 892, field contains 'J2270' Bumping column 82 from INT to INT64 on data row 931, field contains 'RH' Bumping column 82 from INT64 to REAL on data row 931, field contains 'RH' Bumping column 82 from REAL to STR on data row 931, field contains 'RH' Bumping column 112 from INT to INT64 on data row 931, field contains 'QM' Bumping column 112 from INT64 to REAL on data row 931, field contains 'QM' Bumping column 112 from REAL to STR on data row 931, field contains 'QM' Bumping column 154 from INT to INT64 on data row 1151, field contains 'V4582' Bumping column 154 from INT64 to REAL on data row 1151, field contains 'V4582' Bumping column 154 from REAL to STR on data row 1151, field contains 'V4582' Bumping column 107 from INT to INT64 on data row 1268, field contains 'QM' Bumping column 107 from INT64 to REAL on data row 1268, field contains 'QM' Bumping column 107 from REAL to STR on data row 1268, field contains 'QM' Bumping column 40 from INT to INT64 on data row 1414, field contains 'J2270' Bumping column 40 from INT64 to REAL on data row 1414, field contains 'J2270' Bumping column 40 from REAL to STR on data row 1414, field contains 'J2270' Bumping column 41 from INT to INT64 on data row 1414, field contains 'J7040' Bumping column 41 from INT64 to REAL on data row 1414, field contains 'J7040' Bumping column 41 from REAL to STR on data row 1414, field contains 'J7040' Bumping column 155 from INT to INT64 on data row 1417, field contains 'V8741' Bumping column 155 from INT64 to REAL on data row 1417, field contains 'V8741' Bumping column 155 from REAL to STR on data row 1417, field contains 'V8741' Bumping column 156 from INT to INT64 on data row 1417, field contains 'V1504' Bumping column 156 from INT64 to REAL on data row 1417, field contains 'V1504' Bumping column 156 from REAL to STR on data row 1417, field contains 'V1504' Bumping column 157 from INT to INT64 on data row 1417, field contains 'V2651' Bumping column 157 from INT64 to REAL on data row 1417, field contains 'V2651' Bumping column 157 from REAL to STR on data row 1417, field contains 'V2651' Bumping column 83 from INT to INT64 on data row 1629, field contains 'GP' Bumping column 83 from INT64 to REAL on data row 1629, field contains 'GP' Bumping column 83 from REAL to STR on data row 1629, field contains 'GP' Bumping column 105 from INT to INT64 on data row 1688, field contains 'QM' Bumping column 105 from INT64 to REAL on data row 1688, field contains 'QM' Bumping column 105 from REAL to STR on data row 1688, field contains 'QM' Bumping column 110 from INT to INT64 on data row 1999, field contains 'QM' Bumping column 110 from INT64 to REAL on data row 1999, field contains 'QM' Bumping column 110 from REAL to STR on data row 1999, field contains 'QM' Bumping column 106 from INT to INT64 on data row 2019, field contains 'QM' Bumping column 106 from INT64 to REAL on data row 2019, field contains 'QM' Bumping column 106 from REAL to STR on data row 2019, field contains 'QM' Bumping column 85 from INT to INT64 on data row 2341, field contains 'SH' Bumping column 85 from INT64 to REAL on data row 2341, field contains 'SH' Bumping column 85 from REAL to STR on data row 2341, field contains 'SH' Bumping column 115 from INT to INT64 on data row 2341, field contains 'QN' Bumping column 115 from INT64 to REAL on data row 2341, field contains 'QN' Bumping column 115 from REAL to STR on data row 2341, field contains 'QN' Bumping column 350 from INT to INT64 on data row 2791, field contains 'C' Bumping column 350 from INT64 to REAL on data row 2791, field contains 'C' Bumping column 350 from REAL to STR on data row 2791, field contains 'C' Bumping column 353 from INT to INT64 on data row 2791, field contains 'C' Bumping column 353 from INT64 to REAL on data row 2791, field contains 'C' Bumping column 353 from REAL to STR on data row 2791, field contains 'C' Bumping column 108 from INT to INT64 on data row 2898, field contains 'QM' Bumping column 108 from INT64 to REAL on data row 2898, field contains 'QM' Bumping column 108 from REAL to STR on data row 2898, field contains 'QM' Bumping column 158 from INT to INT64 on data row 3011, field contains 'V441' Bumping column 158 from INT64 to REAL on data row 3011, field contains 'V441' Bumping column 158 from REAL to STR on data row 3011, field contains 'V441' Bumping column 159 from INT to INT64 on data row 3011, field contains 'V1582' Bumping column 159 from INT64 to REAL on data row 3011, field contains 'V1582' Bumping column 159 from REAL to STR on data row 3011, field contains 'V1582' Bumping column 160 from INT to INT64 on data row 3011, field contains 'V5861' Bumping column 160 from INT64 to REAL on data row 3011, field contains 'V5861' Bumping column 160 from REAL to STR on data row 3011, field contains 'V5861' Bumping column 86 from INT to INT64 on data row 3021, field contains 'RH' Bumping column 86 from INT64 to REAL on data row 3021, field contains 'RH' Bumping column 86 from REAL to STR on data row 3021, field contains 'RH' Bumping column 116 from INT to INT64 on data row 3021, field contains 'QM' Bumping column 116 from INT64 to REAL on data row 3021, field contains 'QM' Bumping column 116 from REAL to STR on data row 3021, field contains 'QM' Bumping column 109 from INT to INT64 on data row 3112, field contains 'QM' Bumping column 109 from INT64 to REAL on data row 3112, field contains 'QM' Bumping column 109 from REAL to STR on data row 3112, field contains 'QM' Bumping column 113 from INT to INT64 on data row 5208, field contains 'QM' Bumping column 113 from INT64 to REAL on data row 5208, field contains 'QM' Bumping column 113 from REAL to STR on data row 5208, field contains 'QM' Bumping column 188 from INT to INT64 on data row 8138, field contains 'Y' Bumping column 188 from INT64 to REAL on data row 8138, field contains 'Y' Bumping column 188 from REAL to STR on data row 8138, field contains 'Y' Bumping column 189 from INT to INT64 on data row 8138, field contains 'Y' Bumping column 189 from INT64 to REAL on data row 8138, field contains 'Y' Bumping column 189 from REAL to STR on data row 8138, field contains 'Y' Bumping column 190 from INT to INT64 on data row 8138, field contains 'Y' Bumping column 190 from INT64 to REAL on data row 8138, field contains 'Y' Bumping column 190 from REAL to STR on data row 8138, field contains 'Y' 0%Bumping column 161 from INT to INT64 on data row 13758, field contains 'V1582' Bumping column 161 from INT64 to REAL on data row 13758, field contains 'V1582' Bumping column 161 from REAL to STR on data row 13758, field contains 'V1582' Bumping column 231 from INT to INT64 on data row 18303, field contains 'Y' Bumping column 231 from INT64 to REAL on data row 18303, field contains 'Y' Bumping column 231 from REAL to STR on data row 18303, field contains 'Y' Bumping column 87 from INT to INT64 on data row 20592, field contains 'GO' Bumping column 87 from INT64 to REAL on data row 20592, field contains 'GO' Bumping column 87 from REAL to STR on data row 20592, field contains 'GO' Bumping column 192 from INT to INT64 on data row 29413, field contains 'Y' Bumping column 192 from INT64 to REAL on data row 29413, field contains 'Y' Bumping column 192 from REAL to STR on data row 29413, field contains 'Y' Bumping column 193 from INT to INT64 on data row 29413, field contains 'Y' Bumping column 193 from INT64 to REAL on data row 29413, field contains 'Y' Bumping column 193 from REAL to STR on data row 29413, field contains 'Y' Bumping column 194 from INT to INT64 on data row 29413, field contains 'Y' Bumping column 194 from INT64 to REAL on data row 29413, field contains 'Y' Bumping column 194 from REAL to STR on data row 29413, field contains 'Y' Bumping column 96 from INT to INT64 on data row 31954, field contains 'LT' Bumping column 96 from INT64 to REAL on data row 31954, field contains 'LT' Bumping column 96 from REAL to STR on data row 31954, field contains 'LT' Bumping column 191 from INT to INT64 on data row 41091, field contains 'Y' Bumping column 191 from INT64 to REAL on data row 41091, field contains 'Y' Bumping column 191 from REAL to STR on data row 41091, field contains 'Y' Bumping column 162 from INT to INT64 on data row 44469, field contains 'V1582' Bumping column 162 from INT64 to REAL on data row 44469, field contains 'V1582' Bumping column 162 from REAL to STR on data row 44469, field contains 'V1582' Bumping column 163 from INT to INT64 on data row 49003, field contains 'V5865' Bumping column 163 from INT64 to REAL on data row 49003, field contains 'V5865' Bumping column 163 from REAL to STR on data row 49003, field contains 'V5865' Bumping column 90 from INT to INT64 on data row 87095, field contains 'EH' Bumping column 90 from INT64 to REAL on data row 87095, field contains 'EH' Bumping column 90 from REAL to STR on data row 87095, field contains 'EH' Bumping column 120 from INT to INT64 on data row 87095, field contains 'QM' Bumping column 120 from INT64 to REAL on data row 87095, field contains 'QM' Bumping column 120 from REAL to STR on data row 87095, field contains 'QM' Bumping column 213 from INT to INT64 on data row 91672, field contains 'V692' Bumping column 213 from INT64 to REAL on data row 91672, field contains 'V692' Bumping column 213 from REAL to STR on data row 91672, field contains 'V692' Bumping column 338 from INT to INT64 on data row 92112, field contains 'D' Bumping column 338 from INT64 to REAL on data row 92112, field contains 'D' Bumping column 338 from REAL to STR on data row 92112, field contains 'D' Bumping column 339 from INT to INT64 on data row 92112, field contains 'D' Bumping column 339 from INT64 to REAL on data row 92112, field contains 'D' Bumping column 339 from REAL to STR on data row 92112, field contains 'D' Bumping column 214 from INT to INT64 on data row 92181, field contains 'V681' Bumping column 214 from INT64 to REAL on data row 92181, field contains 'V681' Bumping column 214 from REAL to STR on data row 92181, field contains 'V681' Bumping column 91 from INT to INT64 on data row 95380, field contains 'GP' Bumping column 91 from INT64 to REAL on data row 95380, field contains 'GP' Bumping column 91 from REAL to STR on data row 95380, field contains 'GP' Bumping column 216 from INT to INT64 on data row 109576, field contains 'E8499' Bumping column 216 from INT64 to REAL on data row 109576, field contains 'E8499' Bumping column 216 from REAL to STR on data row 109576, field contains 'E8499' 4%Bumping column 98 from INT to INT64 on data row 115301, field contains 'GP' Bumping column 98 from INT64 to REAL on data row 115301, field contains 'GP' Bumping column 98 from REAL to STR on data row 115301, field contains 'GP' Bumping column 117 from INT to INT64 on data row 188433, field contains 'QM' Bumping column 117 from INT64 to REAL on data row 188433, field contains 'QM' Bumping column 117 from REAL to STR on data row 188433, field contains 'QM' Bumping column 93 from INT to INT64 on data row 188671, field contains 'LT' Bumping column 93 from INT64 to REAL on data row 188671, field contains 'LT' Bumping column 93 from REAL to STR on data row 188671, field contains 'LT' Bumping column 92 from INT to INT64 on data row 188909, field contains 'RH' Bumping column 92 from INT64 to REAL on data row 188909, field contains 'RH' Bumping column 92 from REAL to STR on data row 188909, field contains 'RH' Bumping column 122 from INT to INT64 on data row 188909, field contains 'QM' Bumping column 122 from INT64 to REAL on data row 188909, field contains 'QM' Bumping column 122 from REAL to STR on data row 188909, field contains 'QM' Bumping column 121 from INT to INT64 on data row 189176, field contains 'QM' Bumping column 121 from INT64 to REAL on data row 189176, field contains 'QM' Bumping column 121 from REAL to STR on data row 189176, field contains 'QM' Bumping column 195 from INT to INT64 on data row 189548, field contains 'Y' Bumping column 195 from INT64 to REAL on data row 189548, field contains 'Y' Bumping column 195 from REAL to STR on data row 189548, field contains 'Y' Bumping column 196 from INT to INT64 on data row 189548, field contains 'Y' Bumping column 196 from INT64 to REAL on data row 189548, field contains 'Y' Bumping column 196 from REAL to STR on data row 189548, field contains 'Y' Bumping column 197 from INT to INT64 on data row 189548, field contains 'Y' Bumping column 197 from INT64 to REAL on data row 189548, field contains 'Y' Bumping column 197 from REAL to STR on data row 189548, field contains 'Y' Bumping column 198 from INT to INT64 on data row 189548, field contains 'Y' Bumping column 198 from INT64 to REAL on data row 189548, field contains 'Y' Bumping column 198 from REAL to STR on data row 189548, field contains 'Y' Bumping column 199 from INT to INT64 on data row 189548, field contains 'Y' Bumping column 199 from INT64 to REAL on data row 189548, field contains 'Y' Bumping column 199 from REAL to STR on data row 189548, field contains 'Y' Bumping column 200 from INT to INT64 on data row 189548, field contains 'Y' Bumping column 200 from INT64 to REAL on data row 189548, field contains 'Y' Bumping column 200 from REAL to STR on data row 189548, field contains 'Y' Bumping column 201 from INT to INT64 on data row 189548, field contains 'Y' Bumping column 201 from INT64 to REAL on data row 189548, field contains 'Y' Bumping column 201 from REAL to STR on data row 189548, field contains 'Y' Bumping column 202 from INT to INT64 on data row 189548, field contains 'Y' Bumping column 202 from INT64 to REAL on data row 189548, field contains 'Y' Bumping column 202 from REAL to STR on data row 189548, field contains 'Y' Bumping column 203 from INT to INT64 on data row 189548, field contains 'Y' Bumping column 203 from INT64 to REAL on data row 189548, field contains 'Y' Bumping column 203 from REAL to STR on data row 189548, field contains 'Y' Bumping column 232 from INT to INT64 on data row 189586, field contains 'U' Bumping column 232 from INT64 to REAL on data row 189586, field contains 'U' Bumping column 232 from REAL to STR on data row 189586, field contains 'U' Bumping column 123 from INT to INT64 on data row 190895, field contains 'QM' Bumping column 123 from INT64 to REAL on data row 190895, field contains 'QM' Bumping column 123 from REAL to STR on data row 190895, field contains 'QM' Bumping column 97 from INT to INT64 on data row 191623, field contains 'NH' Bumping column 97 from INT64 to REAL on data row 191623, field contains 'NH' Bumping column 97 from REAL to STR on data row 191623, field contains 'NH' Bumping column 127 from INT to INT64 on data row 191623, field contains 'QM' Bumping column 127 from INT64 to REAL on data row 191623, field contains 'QM' Bumping column 127 from REAL to STR on data row 191623, field contains 'QM' Bumping column 88 from INT to INT64 on data row 191828, field contains 'RH' Bumping column 88 from INT64 to REAL on data row 191828, field contains 'RH' Bumping column 88 from REAL to STR on data row 191828, field contains 'RH' Bumping column 118 from INT to INT64 on data row 191828, field contains 'QM' Bumping column 118 from INT64 to REAL on data row 191828, field contains 'QM' Bumping column 118 from REAL to STR on data row 191828, field contains 'QM' Bumping column 89 from INT to INT64 on data row 191925, field contains 'RH' Bumping column 89 from INT64 to REAL on data row 191925, field contains 'RH' Bumping column 89 from REAL to STR on data row 191925, field contains 'RH' Bumping column 119 from INT to INT64 on data row 191925, field contains 'QM' Bumping column 119 from INT64 to REAL on data row 191925, field contains 'QM' Bumping column 119 from REAL to STR on data row 191925, field contains 'QM' Bumping column 94 from INT to INT64 on data row 196090, field contains 'RH' Bumping column 94 from INT64 to REAL on data row 196090, field contains 'RH' Bumping column 94 from REAL to STR on data row 196090, field contains 'RH' Bumping column 124 from INT to INT64 on data row 196090, field contains 'QM' Bumping column 124 from INT64 to REAL on data row 196090, field contains 'QM' Bumping column 124 from REAL to STR on data row 196090, field contains 'QM' Bumping column 217 from INT to INT64 on data row 196596, field contains 'E9208' Bumping column 217 from INT64 to REAL on data row 196596, field contains 'E9208' Bumping column 217 from REAL to STR on data row 196596, field contains 'E9208' Bumping column 126 from INT to INT64 on data row 197965, field contains 'QM' Bumping column 126 from INT64 to REAL on data row 197965, field contains 'QM' Bumping column 126 from REAL to STR on data row 197965, field contains 'QM' Bumping column 95 from INT to INT64 on data row 208608, field contains 'LT' Bumping column 95 from INT64 to REAL on data row 208608, field contains 'LT' Bumping column 95 from REAL to STR on data row 208608, field contains 'LT' Bumping column 218 from INT to INT64 on data row 216015, field contains 'E0008' Bumping column 218 from INT64 to REAL on data row 216015, field contains 'E0008' Bumping column 218 from REAL to STR on data row 216015, field contains 'E0008' Bumping column 219 from INT to INT64 on data row 224785, field contains 'E030' Bumping column 219 from INT64 to REAL on data row 224785, field contains 'E030' Bumping column 219 from REAL to STR on data row 224785, field contains 'E030' 8%Bumping column 220 from INT to INT64 on data row 233544, field contains 'E8499' Bumping column 220 from INT64 to REAL on data row 233544, field contains 'E8499' Bumping column 220 from REAL to STR on data row 233544, field contains 'E8499' Bumping column 221 from INT to INT64 on data row 233544, field contains 'E0008' Bumping column 221 from INT64 to REAL on data row 233544, field contains 'E0008' Bumping column 221 from REAL to STR on data row 233544, field contains 'E0008' Bumping column 100 from INT to INT64 on data row 253181, field contains 'GP' Bumping column 100 from INT64 to REAL on data row 253181, field contains 'GP' Bumping column 100 from REAL to STR on data row 253181, field contains 'GP' Bumping column 99 from INT to INT64 on data row 330461, field contains 'GO' Bumping column 99 from INT64 to REAL on data row 330461, field contains 'GO' Bumping column 99 from REAL to STR on data row 330461, field contains 'GO' 12%Bumping column 128 from INT to INT64 on data row 419322, field contains 'QN' Bumping column 128 from INT64 to REAL on data row 419322, field contains 'QN' Bumping column 128 from REAL to STR on data row 419322, field contains 'QN' Bumping column 130 from INT to INT64 on data row 420977, field contains 'QN' Bumping column 130 from INT64 to REAL on data row 420977, field contains 'QN' Bumping column 130 from REAL to STR on data row 420977, field contains 'QN' Bumping column 125 from INT to INT64 on data row 426618, field contains 'QN' Bumping column 125 from INT64 to REAL on data row 426618, field contains 'QN' Bumping column 125 from REAL to STR on data row 426618, field contains 'QN' Bumping column 101 from INT to INT64 on data row 446983, field contains 'HN' Bumping column 101 from INT64 to REAL on data row 446983, field contains 'HN' Bumping column 101 from REAL to STR on data row 446983, field contains 'HN' Bumping column 131 from INT to INT64 on data row 446983, field contains 'QN' Bumping column 131 from INT64 to REAL on data row 446983, field contains 'QN' Bumping column 131 from REAL to STR on data row 446983, field contains 'QN' Bumping column 129 from INT to INT64 on data row 448799, field contains 'QN' Bumping column 129 from INT64 to REAL on data row 448799, field contains 'QN' Bumping column 129 from REAL to STR on data row 448799, field contains 'QN' Bumping column 233 from INT to INT64 on data row 455718, field contains 'Y' Bumping column 233 from INT64 to REAL on data row 455718, field contains 'Y' Bumping column 233 from REAL to STR on data row 455718, field contains 'Y' Bumping column 234 from INT to INT64 on data row 458104, field contains 'Y' Bumping column 234 from INT64 to REAL on data row 458104, field contains 'Y' Bumping column 234 from REAL to STR on data row 458104, field contains 'Y' Bumping column 235 from INT to INT64 on data row 458104, field contains 'Y' Bumping column 235 from INT64 to REAL on data row 458104, field contains 'Y' Bumping column 235 from REAL to STR on data row 458104, field contains 'Y' 16%Bumping column 204 from INT to INT64 on data row 535636, field contains 'U' Bumping column 204 from INT64 to REAL on data row 535636, field contains 'U' Bumping column 204 from REAL to STR on data row 535636, field contains 'U' Bumping column 205 from INT to INT64 on data row 544450, field contains 'U' Bumping column 205 from INT64 to REAL on data row 544450, field contains 'U' Bumping column 205 from REAL to STR on data row 544450, field contains 'U' Bumping column 206 from INT to INT64 on data row 563578, field contains 'U' Bumping column 206 from INT64 to REAL on data row 563578, field contains 'U' Bumping column 206 from REAL to STR on data row 563578, field contains 'U' Bumping column 207 from INT to INT64 on data row 563578, field contains 'U' Bumping column 207 from INT64 to REAL on data row 563578, field contains 'U' Bumping column 207 from REAL to STR on data row 563578, field contains 'U' Bumping column 208 from INT to INT64 on data row 570116, field contains 'U' Bumping column 208 from INT64 to REAL on data row 570116, field contains 'U' Bumping column 208 from REAL to STR on data row 570116, field contains 'U' Bumping column 209 from INT to INT64 on data row 570116, field contains 'U' Bumping column 209 from INT64 to REAL on data row 570116, field contains 'U' Bumping column 209 from REAL to STR on data row 570116, field contains 'U' 24%Bumping column 8 from INT to INT64 on data row 768577, field contains 'F' Bumping column 8 from INT64 to REAL on data row 768577, field contains 'F' Bumping column 8 from REAL to STR on data row 768577, field contains 'F' 28%Bumping column 210 from INT to INT64 on data row 948003, field contains 'U' Bumping column 210 from INT64 to REAL on data row 948003, field contains 'U' Bumping column 210 from REAL to STR on data row 948003, field contains 'U' Bumping column 211 from INT to INT64 on data row 948003, field contains 'U' Bumping column 211 from INT64 to REAL on data row 948003, field contains 'U' Bumping column 211 from REAL to STR on data row 948003, field contains 'U' 48%Bumping column 222 from INT to INT64 on data row 1567231, field contains 'E0009' Bumping column 222 from INT64 to REAL on data row 1567231, field contains 'E0009' Bumping column 222 from REAL to STR on data row 1567231, field contains 'E0009' 71%Bumping column 236 from INT to INT64 on data row 2163874, field contains 'U' Bumping column 236 from INT64 to REAL on data row 2163874, field contains 'U' Bumping column 236 from REAL to STR on data row 2163874, field contains 'U' Bumping column 237 from INT to INT64 on data row 2177888, field contains 'U' Bumping column 237 from INT64 to REAL on data row 2177888, field contains 'U' Bumping column 237 from REAL to STR on data row 2177888, field contains 'U' Bumping column 280 from INT to INT64 on data row 2204113, field contains 'invl' Bumping column 280 from INT64 to REAL on data row 2204113, field contains 'invl' Bumping column 280 from REAL to STR on data row 2204113, field contains 'invl' 0.000s (2994439%) Memory map (rerun may be quicker) 0.000s (2994439%) Sep and header detection 0.000s (2994439%) Count rows (wc -l) 0.000s (2994439%) Colmn type detection (first, middle and last 5 rows) 0.000s (2994439%) Allocation of 5x13 result (xMB) in RAM 25.710s ( 66%) Reading data 197983.135s (510003%) Allocation for type bumps (if any), including gc time if triggered -197977.505s (-509988%) Coercing data already read in type bumps (if any) -197977.505s (-509988%) Changing na.strings to NA -197977.505s Total There were 50 or more warnings (use warnings() to see the first 50) Warning messages: 1: In fread(file.path(sedddir, "active", "NJ_SEDD_2011_CORE.csv"), ... : Bumped column 146 to type character on data row 9, field contains 'V5867'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE. 2: In fread(file.path(sedddir, "active", "NJ_SEDD_2011_CORE.csv"), ... : Bumped column 147 to type character on data row 9, field contains 'V5869'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE. 3: In fread(file.path(sedddir, "active", "NJ_SEDD_2011_CORE.csv"), ... : Bumped column 142 to type character on data row 10, field contains 'V140'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE. [[clipped]] ----------------------------------------------------- fread's guesses vs. column classes I know to be true: ----------------------------------------------------- structure(list(DTguess = c("integer", "integer", "integer", "integer", "integer", "integer", "integer", "character", "integer", "integer", "integer", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "character", "character", "character", "character", "character", "character", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "character", "integer64", "integer", "integer", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "numeric", "integer", "integer", "integer", "integer", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "character", "integer", "integer", "character", "character", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "numeric", "integer", "character", "integer", "integer", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer" ), actual = c("integer", "integer", "integer", "integer", "integer", "integer", "character", "character", "integer", "integer", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "integer", "integer", "integer", "integer", "character", "integer", "character", "integer", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "character", "character", "character", "character", "character", "character", "character", "character", "integer", "integer", "integer", "integer", "integer", "character", "integer", "character", "character", "integer", "integer", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "character", "integer", "integer", "integer", "character", "integer", "character", "integer", "character", "integer", "integer", "integer", "integer", "numeric", "integer", "integer", "integer", "integer", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "character", "integer", "integer", "character", "character", "character", "integer", "character", "integer", "integer", "integer", "integer", "integer", "numeric", "integer", "character", "integer", "character", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer")), .Names = c("DTguess", "actual"), row.names = c("age", "ageday", "agemonth", "ahour", "amonth", "asource", "asourceub92", "asource_x", "atype", "aweekend", "billtype", "cpt1", "cpt2", "cpt3", "cpt4", "cpt5", "cpt6", "cpt7", "cpt8", "cpt9", "cpt10", "cpt11", "cpt12", "cpt13", "cpt14", "cpt15", "cpt16", "cpt17", "cpt18", "cpt19", "cpt20", "cpt21", "cpt22", "cpt23", "cpt24", "cpt25", "cpt26", "cpt27", "cpt28", "cpt29", "cpt30", "cptccs1", "cptccs2", "cptccs3", "cptccs4", "cptccs5", "cptccs6", "cptccs7", "cptccs8", "cptccs9", "cptccs10", "cptccs11", "cptccs12", "cptccs13", "cptccs14", "cptccs15", "cptccs16", "cptccs17", "cptccs18", "cptccs19", "cptccs20", "cptccs21", "cptccs22", "cptccs23", "cptccs24", "cptccs25", "cptccs26", "cptccs27", "cptccs28", "cptccs29", "cptccs30", "cptm1_1", "cptm1_2", "cptm1_3", "cptm1_4", "cptm1_5", "cptm1_6", "cptm1_7", "cptm1_8", "cptm1_9", "cptm1_10", "cptm1_11", "cptm1_12", "cptm1_13", "cptm1_14", "cptm1_15", "cptm1_16", "cptm1_17", "cptm1_18", "cptm1_19", "cptm1_20", "cptm1_21", "cptm1_22", "cptm1_23", "cptm1_24", "cptm1_25", "cptm1_26", "cptm1_27", "cptm1_28", "cptm1_29", "cptm1_30", "cptm2_1", "cptm2_2", "cptm2_3", "cptm2_4", "cptm2_5", "cptm2_6", "cptm2_7", "cptm2_8", "cptm2_9", "cptm2_10", "cptm2_11", "cptm2_12", "cptm2_13", "cptm2_14", "cptm2_15", "cptm2_16", "cptm2_17", "cptm2_18", "cptm2_19", "cptm2_20", "cptm2_21", "cptm2_22", "cptm2_23", "cptm2_24", "cptm2_25", "cptm2_26", "cptm2_27", "cptm2_28", "cptm2_29", "cptm2_30", "dhour", "died", "dispub04", "dispuniform", "disp_x", "dqtr", "dshospid", "duration", "dx1", "dx2", "dx3", "dx4", "dx5", "dx6", "dx7", "dx8", "dx9", "dx10", "dx11", "dx12", "dx13", "dx14", "dx15", "dx16", "dx17", "dx18", "dx19", "dx20", "dx21", "dx22", "dx23", "dx24", "dxccs1", "dxccs2", "dxccs3", "dxccs4", "dxccs5", "dxccs6", "dxccs7", "dxccs8", "dxccs9", "dxccs10", "dxccs11", "dxccs12", "dxccs13", "dxccs14", "dxccs15", "dxccs16", "dxccs17", "dxccs18", "dxccs19", "dxccs20", "dxccs21", "dxccs22", "dxccs23", "dxccs24", "dxpoa1", "dxpoa2", "dxpoa3", "dxpoa4", "dxpoa5", "dxpoa6", "dxpoa7", "dxpoa8", "dxpoa9", "dxpoa10", "dxpoa11", "dxpoa12", "dxpoa13", "dxpoa14", "dxpoa15", "dxpoa16", "dxpoa17", "dxpoa18", "dxpoa19", "dxpoa20", "dxpoa21", "dxpoa22", "dxpoa23", "dxpoa24", "dx_visit_reason1", "dx_visit_reason2", "dx_visit_reason3", "ecode1", "ecode2", "ecode3", "ecode4", "ecode5", "ecode6", "ecode7", "ecode8", "e_ccs1", "e_ccs2", "e_ccs3", "e_ccs4", "e_ccs5", "e_ccs6", "e_ccs7", "e_ccs8", "e_poa1", "e_poa2", "e_poa3", "e_poa4", "e_poa5", "e_poa6", "e_poa7", "e_poa8", "female", "hcup_ed", "hcup_os", "hcup_surgery_broad", "hcup_surgery_narrow", "hispanic_x", "hospbrth", "hospst", "key", "los", "los_x", "maritalstatusub04", "mdnum1_r", "mdnum2_r", "medincstq", "momnum_r", "mrn_r", "nchronic", "ncpt", "ndx", "necode", "neomat", "npr", "opservice", "orproc", "os_time", "pay1", "pay1_x", "pay2", "pay2_x", "pay3", "pay3_x", "pl_cbsa", "pl_msa1993", "pl_nchs2006", "pl_ruca10_2005", "pl_ruca2005", "pl_ruca4_2005", "pl_rucc2003", "pl_uic2003", "pl_ur_cat4", "pr1", "pr2", "pr3", "pr4", "pr5", "pr6", "pr7", "pr8", "pr9", "pr10", "pr11", "pr12", "pr13", "pr14", "pr15", "pr16", "pr17", "pr18", "prccs1", "prccs2", "prccs3", "prccs4", "prccs5", "prccs6", "prccs7", "prccs8", "prccs9", "prccs10", "prccs11", "prccs12", "prccs13", "prccs14", "prccs15", "prccs16", "prccs17", "prccs18", "prday1", "prday2", "prday3", "prday4", "prday5", "prday6", "prday7", "prday8", "prday9", "prday10", "prday11", "prday12", "prday13", "prday14", "prday15", "prday16", "prday17", "prday18", "proctype", "pstate", "pstco", "pstco2", "pointoforiginub04", "pointoforigin_x", "primlang", "race", "race_x", "readmit", "state_as", "state_ed", "state_os", "totchg", "totchg_x", "year", "zip3", "zipinc_qrtl", "town", "zip", "ayear", "dmonth", "bmonth", "byear", "prmonth1", "prmonth2", "prmonth3", "prmonth4", "prmonth5", "prmonth6", "prmonth7", "prmonth8", "prmonth9", "prmonth10", "prmonth11", "prmonth12", "prmonth13", "prmonth14", "prmonth15", "prmonth16", "prmonth17", "prmonth18", "pryear1", "pryear2", "pryear3", "pryear4", "pryear5", "pryear6", "pryear7", "pryear8", "pryear9", "pryear10", "pryear11", "pryear12", "pryear13", "pryear14", "pryear15", "pryear16", "pryear17", "pryear18" ), class = "data.frame") -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Sep 13 00:42:22 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 12 Sep 2013 23:42:22 +0100 Subject: [datatable-help] colClasses and fread In-Reply-To: References: Message-ID: <5232434E.3050608@mdowle.plus.com> Is that v1.8.10 as on CRAN? It doesn't look like it from a few clues in the output below. v1.8.10 has colClasses working, see NEWS. On 12/09/13 22:32, Ari Friedman wrote: > Dear maintainers of that most wonderful package that makes R fast with > big data, > > I've recently discovered fread. It's amazing. My call to read.fwf on a > 4GB file that took all night now takes under a minute after conversion > to csv via csvkit/in2csv. > > However, automatic type detection is working very poorly, probably due > to the presence of a large number of columns with high rates of > missingness, plus a large number of character columns with encoded > values (these are medical and diagnostic codes). > > Normally I'd specify colClasses, and the warning messages even tell me I > should specify colClasses, but there's no colClasses argument to fread. > > Any thoughts on solving this? Verbose output, warnings, and a > comparison of the guesses vs. what the documentation on the file says it > is are found below. Unfortunately the data can't be shared, even in > small portions so I can't make this reproducible. > > Thanks! > Ari > > dt <- fread('myfile.csv', verbose=TRUE) > Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. > Using line 30 to detect sep (the last non blank line in the first 30) ... ',' > Found 393 columns > First row with 393 fields occurs on line 1 (either column names or first row of data) > All the fields on line 1 are character fields. Treating as the column names. > Count of eol after first data row: 2994440 > Subtracted 1 for last eol and any trailing empty lines, leaving 2994439 data rows > Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000300000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (first 5 rows) > Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (+middle 5 rows) > Type codes: 000000000000003303330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (+last 5 rows) > 0%Bumping column 146 from INT to INT64 on data row 9, field contains 'V5867' > Bumping column 146 from INT64 to REAL on data row 9, field contains 'V5867' > Bumping column 146 from REAL to STR on data row 9, field contains 'V5867' > Bumping column 147 from INT to INT64 on data row 9, field contains 'V5869' > Bumping column 147 from INT64 to REAL on data row 9, field contains 'V5869' > Bumping column 147 from REAL to STR on data row 9, field contains 'V5869' > Bumping column 142 from INT to INT64 on data row 10, field contains 'V140' > Bumping column 142 from INT64 to REAL on data row 10, field contains 'V140' > Bumping column 142 from REAL to STR on data row 10, field contains 'V140' > Bumping column 17 from INT to INT64 on data row 12, field contains 'J1885' > Bumping column 17 from INT64 to REAL on data row 12, field contains 'J1885' > Bumping column 17 from REAL to STR on data row 12, field contains 'J1885' > Bumping column 74 from INT to INT64 on data row 12, field contains 'LT' > Bumping column 74 from INT64 to REAL on data row 12, field contains 'LT' > Bumping column 74 from REAL to STR on data row 12, field contains 'LT' > Bumping column 143 from INT to INT64 on data row 13, field contains 'V142' > Bumping column 143 from INT64 to REAL on data row 13, field contains 'V142' > Bumping column 143 from REAL to STR on data row 13, field contains 'V142' > Bumping column 14 from INT to INT64 on data row 22, field contains 'G0431' > Bumping column 14 from INT64 to REAL on data row 22, field contains 'G0431' > Bumping column 14 from REAL to STR on data row 22, field contains 'G0431' > Bumping column 21 from INT to INT64 on data row 23, field contains 'J7060' > Bumping column 21 from INT64 to REAL on data row 23, field contains 'J7060' > Bumping column 21 from REAL to STR on data row 23, field contains 'J7060' > Bumping column 24 from INT to INT64 on data row 27, field contains 'J2405' > Bumping column 24 from INT64 to REAL on data row 27, field contains 'J2405' > Bumping column 24 from REAL to STR on data row 27, field contains 'J2405' > Bumping column 72 from INT to INT64 on data row 35, field contains 'F1' > Bumping column 72 from INT64 to REAL on data row 35, field contains 'F1' > Bumping column 72 from REAL to STR on data row 35, field contains 'F1' > Bumping column 141 from INT to INT64 on data row 35, field contains 'V061' > Bumping column 141 from INT64 to REAL on data row 35, field contains 'V061' > Bumping column 141 from REAL to STR on data row 35, field contains 'V061' > Bumping column 26 from INT to INT64 on data row 37, field contains 'J0690' > Bumping column 26 from INT64 to REAL on data row 37, field contains 'J0690' > Bumping column 26 from REAL to STR on data row 37, field contains 'J0690' > Bumping column 28 from INT to INT64 on data row 37, field contains 'J7030' > Bumping column 28 from INT64 to REAL on data row 37, field contains 'J7030' > Bumping column 28 from REAL to STR on data row 37, field contains 'J7030' > Bumping column 29 from INT to INT64 on data row 37, field contains 'J7040' > Bumping column 29 from INT64 to REAL on data row 37, field contains 'J7040' > Bumping column 29 from REAL to STR on data row 37, field contains 'J7040' > Bumping column 25 from INT to INT64 on data row 43, field contains 'Q9967' > Bumping column 25 from INT64 to REAL on data row 43, field contains 'Q9967' > Bumping column 25 from REAL to STR on data row 43, field contains 'Q9967' > Bumping column 30 from INT to INT64 on data row 43, field contains 'J7030' > Bumping column 30 from INT64 to REAL on data row 43, field contains 'J7030' > Bumping column 30 from REAL to STR on data row 43, field contains 'J7030' > Bumping column 31 from INT to INT64 on data row 43, field contains 'J2405' > Bumping column 31 from INT64 to REAL on data row 43, field contains 'J2405' > Bumping column 31 from REAL to STR on data row 43, field contains 'J2405' > Bumping column 148 from INT to INT64 on data row 44, field contains 'V1551' > Bumping column 148 from INT64 to REAL on data row 44, field contains 'V1551' > Bumping column 148 from REAL to STR on data row 44, field contains 'V1551' > Bumping column 149 from INT to INT64 on data row 44, field contains 'V1588' > Bumping column 149 from INT64 to REAL on data row 44, field contains 'V1588' > Bumping column 149 from REAL to STR on data row 44, field contains 'V1588' > Bumping column 76 from INT to INT64 on data row 45, field contains 'RT' > Bumping column 76 from INT64 to REAL on data row 45, field contains 'RT' > Bumping column 76 from REAL to STR on data row 45, field contains 'RT' > Bumping column 27 from INT to INT64 on data row 53, field contains 'J2405' > Bumping column 27 from INT64 to REAL on data row 53, field contains 'J2405' > Bumping column 27 from REAL to STR on data row 53, field contains 'J2405' > Bumping column 32 from INT to INT64 on data row 56, field contains 'J1885' > Bumping column 32 from INT64 to REAL on data row 56, field contains 'J1885' > Bumping column 32 from REAL to STR on data row 56, field contains 'J1885' > Bumping column 33 from INT to INT64 on data row 56, field contains 'J2270' > Bumping column 33 from INT64 to REAL on data row 56, field contains 'J2270' > Bumping column 33 from REAL to STR on data row 56, field contains 'J2270' > Bumping column 34 from INT to INT64 on data row 56, field contains 'J2405' > Bumping column 34 from INT64 to REAL on data row 56, field contains 'J2405' > Bumping column 34 from REAL to STR on data row 56, field contains 'J2405' > Bumping column 77 from INT to INT64 on data row 65, field contains 'LT' > Bumping column 77 from INT64 to REAL on data row 65, field contains 'LT' > Bumping column 77 from REAL to STR on data row 65, field contains 'LT' > Bumping column 140 from INT to INT64 on data row 74, field contains 'V689' > Bumping column 140 from INT64 to REAL on data row 74, field contains 'V689' > Bumping column 140 from REAL to STR on data row 74, field contains 'V689' > Bumping column 13 from INT to INT64 on data row 103, field contains 'J1100' > Bumping column 13 from INT64 to REAL on data row 103, field contains 'J1100' > Bumping column 13 from REAL to STR on data row 103, field contains 'J1100' > Bumping column 150 from INT to INT64 on data row 104, field contains 'V1508' > Bumping column 150 from INT64 to REAL on data row 104, field contains 'V1508' > Bumping column 150 from REAL to STR on data row 104, field contains 'V1508' > Bumping column 212 from INT to INT64 on data row 107, field contains 'V714' > Bumping column 212 from INT64 to REAL on data row 107, field contains 'V714' > Bumping column 212 from REAL to STR on data row 107, field contains 'V714' > Bumping column 12 from INT to INT64 on data row 113, field contains 'A0427' > Bumping column 12 from INT64 to REAL on data row 113, field contains 'A0427' > Bumping column 12 from REAL to STR on data row 113, field contains 'A0427' > Bumping column 81 from INT to INT64 on data row 113, field contains 'RH' > Bumping column 81 from INT64 to REAL on data row 113, field contains 'RH' > Bumping column 81 from REAL to STR on data row 113, field contains 'RH' > Bumping column 102 from INT to INT64 on data row 113, field contains 'QM' > Bumping column 102 from INT64 to REAL on data row 113, field contains 'QM' > Bumping column 102 from REAL to STR on data row 113, field contains 'QM' > Bumping column 111 from INT to INT64 on data row 113, field contains 'QM' > Bumping column 111 from INT64 to REAL on data row 113, field contains 'QM' > Bumping column 111 from REAL to STR on data row 113, field contains 'QM' > Bumping column 151 from INT to INT64 on data row 294, field contains 'V146' > Bumping column 151 from INT64 to REAL on data row 294, field contains 'V146' > Bumping column 151 from REAL to STR on data row 294, field contains 'V146' > Bumping column 152 from INT to INT64 on data row 294, field contains 'V148' > Bumping column 152 from INT64 to REAL on data row 294, field contains 'V148' > Bumping column 152 from REAL to STR on data row 294, field contains 'V148' > Bumping column 84 from INT to INT64 on data row 346, field contains 'RH' > Bumping column 84 from INT64 to REAL on data row 346, field contains 'RH' > Bumping column 84 from REAL to STR on data row 346, field contains 'RH' > Bumping column 114 from INT to INT64 on data row 346, field contains 'QM' > Bumping column 114 from INT64 to REAL on data row 346, field contains 'QM' > Bumping column 114 from REAL to STR on data row 346, field contains 'QM' > Bumping column 36 from INT to INT64 on data row 348, field contains 'J1644' > Bumping column 36 from INT64 to REAL on data row 348, field contains 'J1644' > Bumping column 36 from REAL to STR on data row 348, field contains 'J1644' > Bumping column 37 from INT to INT64 on data row 348, field contains 'J7030' > Bumping column 37 from INT64 to REAL on data row 348, field contains 'J7030' > Bumping column 37 from REAL to STR on data row 348, field contains 'J7030' > Bumping column 38 from INT to INT64 on data row 348, field contains 'J2405' > Bumping column 38 from INT64 to REAL on data row 348, field contains 'J2405' > Bumping column 38 from REAL to STR on data row 348, field contains 'J2405' > Bumping column 39 from INT to INT64 on data row 349, field contains 'J2405' > Bumping column 39 from INT64 to REAL on data row 349, field contains 'J2405' > Bumping column 39 from REAL to STR on data row 349, field contains 'J2405' > Bumping column 103 from INT to INT64 on data row 702, field contains 'QM' > Bumping column 103 from INT64 to REAL on data row 702, field contains 'QM' > Bumping column 103 from REAL to STR on data row 702, field contains 'QM' > Bumping column 104 from INT to INT64 on data row 702, field contains 'QM' > Bumping column 104 from INT64 to REAL on data row 702, field contains 'QM' > Bumping column 104 from REAL to STR on data row 702, field contains 'QM' > Bumping column 153 from INT to INT64 on data row 815, field contains 'V4561' > Bumping column 153 from INT64 to REAL on data row 815, field contains 'V4561' > Bumping column 153 from REAL to STR on data row 815, field contains 'V4561' > Bumping column 78 from INT to INT64 on data row 891, field contains 'RT' > Bumping column 78 from INT64 to REAL on data row 891, field contains 'RT' > Bumping column 78 from REAL to STR on data row 891, field contains 'RT' > Bumping column 79 from INT to INT64 on data row 891, field contains 'LT' > Bumping column 79 from INT64 to REAL on data row 891, field contains 'LT' > Bumping column 79 from REAL to STR on data row 891, field contains 'LT' > Bumping column 80 from INT to INT64 on data row 891, field contains 'LT' > Bumping column 80 from INT64 to REAL on data row 891, field contains 'LT' > Bumping column 80 from REAL to STR on data row 891, field contains 'LT' > Bumping column 35 from INT to INT64 on data row 892, field contains 'J2270' > Bumping column 35 from INT64 to REAL on data row 892, field contains 'J2270' > Bumping column 35 from REAL to STR on data row 892, field contains 'J2270' > Bumping column 82 from INT to INT64 on data row 931, field contains 'RH' > Bumping column 82 from INT64 to REAL on data row 931, field contains 'RH' > Bumping column 82 from REAL to STR on data row 931, field contains 'RH' > Bumping column 112 from INT to INT64 on data row 931, field contains 'QM' > Bumping column 112 from INT64 to REAL on data row 931, field contains 'QM' > Bumping column 112 from REAL to STR on data row 931, field contains 'QM' > Bumping column 154 from INT to INT64 on data row 1151, field contains 'V4582' > Bumping column 154 from INT64 to REAL on data row 1151, field contains 'V4582' > Bumping column 154 from REAL to STR on data row 1151, field contains 'V4582' > Bumping column 107 from INT to INT64 on data row 1268, field contains 'QM' > Bumping column 107 from INT64 to REAL on data row 1268, field contains 'QM' > Bumping column 107 from REAL to STR on data row 1268, field contains 'QM' > Bumping column 40 from INT to INT64 on data row 1414, field contains 'J2270' > Bumping column 40 from INT64 to REAL on data row 1414, field contains 'J2270' > Bumping column 40 from REAL to STR on data row 1414, field contains 'J2270' > Bumping column 41 from INT to INT64 on data row 1414, field contains 'J7040' > Bumping column 41 from INT64 to REAL on data row 1414, field contains 'J7040' > Bumping column 41 from REAL to STR on data row 1414, field contains 'J7040' > Bumping column 155 from INT to INT64 on data row 1417, field contains 'V8741' > Bumping column 155 from INT64 to REAL on data row 1417, field contains 'V8741' > Bumping column 155 from REAL to STR on data row 1417, field contains 'V8741' > Bumping column 156 from INT to INT64 on data row 1417, field contains 'V1504' > Bumping column 156 from INT64 to REAL on data row 1417, field contains 'V1504' > Bumping column 156 from REAL to STR on data row 1417, field contains 'V1504' > Bumping column 157 from INT to INT64 on data row 1417, field contains 'V2651' > Bumping column 157 from INT64 to REAL on data row 1417, field contains 'V2651' > Bumping column 157 from REAL to STR on data row 1417, field contains 'V2651' > Bumping column 83 from INT to INT64 on data row 1629, field contains 'GP' > Bumping column 83 from INT64 to REAL on data row 1629, field contains 'GP' > Bumping column 83 from REAL to STR on data row 1629, field contains 'GP' > Bumping column 105 from INT to INT64 on data row 1688, field contains 'QM' > Bumping column 105 from INT64 to REAL on data row 1688, field contains 'QM' > Bumping column 105 from REAL to STR on data row 1688, field contains 'QM' > Bumping column 110 from INT to INT64 on data row 1999, field contains 'QM' > Bumping column 110 from INT64 to REAL on data row 1999, field contains 'QM' > Bumping column 110 from REAL to STR on data row 1999, field contains 'QM' > Bumping column 106 from INT to INT64 on data row 2019, field contains 'QM' > Bumping column 106 from INT64 to REAL on data row 2019, field contains 'QM' > Bumping column 106 from REAL to STR on data row 2019, field contains 'QM' > Bumping column 85 from INT to INT64 on data row 2341, field contains 'SH' > Bumping column 85 from INT64 to REAL on data row 2341, field contains 'SH' > Bumping column 85 from REAL to STR on data row 2341, field contains 'SH' > Bumping column 115 from INT to INT64 on data row 2341, field contains 'QN' > Bumping column 115 from INT64 to REAL on data row 2341, field contains 'QN' > Bumping column 115 from REAL to STR on data row 2341, field contains 'QN' > Bumping column 350 from INT to INT64 on data row 2791, field contains 'C' > Bumping column 350 from INT64 to REAL on data row 2791, field contains 'C' > Bumping column 350 from REAL to STR on data row 2791, field contains 'C' > Bumping column 353 from INT to INT64 on data row 2791, field contains 'C' > Bumping column 353 from INT64 to REAL on data row 2791, field contains 'C' > Bumping column 353 from REAL to STR on data row 2791, field contains 'C' > Bumping column 108 from INT to INT64 on data row 2898, field contains 'QM' > Bumping column 108 from INT64 to REAL on data row 2898, field contains 'QM' > Bumping column 108 from REAL to STR on data row 2898, field contains 'QM' > Bumping column 158 from INT to INT64 on data row 3011, field contains 'V441' > Bumping column 158 from INT64 to REAL on data row 3011, field contains 'V441' > Bumping column 158 from REAL to STR on data row 3011, field contains 'V441' > Bumping column 159 from INT to INT64 on data row 3011, field contains 'V1582' > Bumping column 159 from INT64 to REAL on data row 3011, field contains 'V1582' > Bumping column 159 from REAL to STR on data row 3011, field contains 'V1582' > Bumping column 160 from INT to INT64 on data row 3011, field contains 'V5861' > Bumping column 160 from INT64 to REAL on data row 3011, field contains 'V5861' > Bumping column 160 from REAL to STR on data row 3011, field contains 'V5861' > Bumping column 86 from INT to INT64 on data row 3021, field contains 'RH' > Bumping column 86 from INT64 to REAL on data row 3021, field contains 'RH' > Bumping column 86 from REAL to STR on data row 3021, field contains 'RH' > Bumping column 116 from INT to INT64 on data row 3021, field contains 'QM' > Bumping column 116 from INT64 to REAL on data row 3021, field contains 'QM' > Bumping column 116 from REAL to STR on data row 3021, field contains 'QM' > Bumping column 109 from INT to INT64 on data row 3112, field contains 'QM' > Bumping column 109 from INT64 to REAL on data row 3112, field contains 'QM' > Bumping column 109 from REAL to STR on data row 3112, field contains 'QM' > Bumping column 113 from INT to INT64 on data row 5208, field contains 'QM' > Bumping column 113 from INT64 to REAL on data row 5208, field contains 'QM' > Bumping column 113 from REAL to STR on data row 5208, field contains 'QM' > Bumping column 188 from INT to INT64 on data row 8138, field contains 'Y' > Bumping column 188 from INT64 to REAL on data row 8138, field contains 'Y' > Bumping column 188 from REAL to STR on data row 8138, field contains 'Y' > Bumping column 189 from INT to INT64 on data row 8138, field contains 'Y' > Bumping column 189 from INT64 to REAL on data row 8138, field contains 'Y' > Bumping column 189 from REAL to STR on data row 8138, field contains 'Y' > Bumping column 190 from INT to INT64 on data row 8138, field contains 'Y' > Bumping column 190 from INT64 to REAL on data row 8138, field contains 'Y' > Bumping column 190 from REAL to STR on data row 8138, field contains 'Y' > 0%Bumping column 161 from INT to INT64 on data row 13758, field contains 'V1582' > Bumping column 161 from INT64 to REAL on data row 13758, field contains 'V1582' > Bumping column 161 from REAL to STR on data row 13758, field contains 'V1582' > Bumping column 231 from INT to INT64 on data row 18303, field contains 'Y' > Bumping column 231 from INT64 to REAL on data row 18303, field contains 'Y' > Bumping column 231 from REAL to STR on data row 18303, field contains 'Y' > Bumping column 87 from INT to INT64 on data row 20592, field contains 'GO' > Bumping column 87 from INT64 to REAL on data row 20592, field contains 'GO' > Bumping column 87 from REAL to STR on data row 20592, field contains 'GO' > Bumping column 192 from INT to INT64 on data row 29413, field contains 'Y' > Bumping column 192 from INT64 to REAL on data row 29413, field contains 'Y' > Bumping column 192 from REAL to STR on data row 29413, field contains 'Y' > Bumping column 193 from INT to INT64 on data row 29413, field contains 'Y' > Bumping column 193 from INT64 to REAL on data row 29413, field contains 'Y' > Bumping column 193 from REAL to STR on data row 29413, field contains 'Y' > Bumping column 194 from INT to INT64 on data row 29413, field contains 'Y' > Bumping column 194 from INT64 to REAL on data row 29413, field contains 'Y' > Bumping column 194 from REAL to STR on data row 29413, field contains 'Y' > Bumping column 96 from INT to INT64 on data row 31954, field contains 'LT' > Bumping column 96 from INT64 to REAL on data row 31954, field contains 'LT' > Bumping column 96 from REAL to STR on data row 31954, field contains 'LT' > Bumping column 191 from INT to INT64 on data row 41091, field contains 'Y' > Bumping column 191 from INT64 to REAL on data row 41091, field contains 'Y' > Bumping column 191 from REAL to STR on data row 41091, field contains 'Y' > Bumping column 162 from INT to INT64 on data row 44469, field contains 'V1582' > Bumping column 162 from INT64 to REAL on data row 44469, field contains 'V1582' > Bumping column 162 from REAL to STR on data row 44469, field contains 'V1582' > Bumping column 163 from INT to INT64 on data row 49003, field contains 'V5865' > Bumping column 163 from INT64 to REAL on data row 49003, field contains 'V5865' > Bumping column 163 from REAL to STR on data row 49003, field contains 'V5865' > Bumping column 90 from INT to INT64 on data row 87095, field contains 'EH' > Bumping column 90 from INT64 to REAL on data row 87095, field contains 'EH' > Bumping column 90 from REAL to STR on data row 87095, field contains 'EH' > Bumping column 120 from INT to INT64 on data row 87095, field contains 'QM' > Bumping column 120 from INT64 to REAL on data row 87095, field contains 'QM' > Bumping column 120 from REAL to STR on data row 87095, field contains 'QM' > Bumping column 213 from INT to INT64 on data row 91672, field contains 'V692' > Bumping column 213 from INT64 to REAL on data row 91672, field contains 'V692' > Bumping column 213 from REAL to STR on data row 91672, field contains 'V692' > Bumping column 338 from INT to INT64 on data row 92112, field contains 'D' > Bumping column 338 from INT64 to REAL on data row 92112, field contains 'D' > Bumping column 338 from REAL to STR on data row 92112, field contains 'D' > Bumping column 339 from INT to INT64 on data row 92112, field contains 'D' > Bumping column 339 from INT64 to REAL on data row 92112, field contains 'D' > Bumping column 339 from REAL to STR on data row 92112, field contains 'D' > Bumping column 214 from INT to INT64 on data row 92181, field contains 'V681' > Bumping column 214 from INT64 to REAL on data row 92181, field contains 'V681' > Bumping column 214 from REAL to STR on data row 92181, field contains 'V681' > Bumping column 91 from INT to INT64 on data row 95380, field contains 'GP' > Bumping column 91 from INT64 to REAL on data row 95380, field contains 'GP' > Bumping column 91 from REAL to STR on data row 95380, field contains 'GP' > Bumping column 216 from INT to INT64 on data row 109576, field contains 'E8499' > Bumping column 216 from INT64 to REAL on data row 109576, field contains 'E8499' > Bumping column 216 from REAL to STR on data row 109576, field contains 'E8499' > 4%Bumping column 98 from INT to INT64 on data row 115301, field contains 'GP' > Bumping column 98 from INT64 to REAL on data row 115301, field contains 'GP' > Bumping column 98 from REAL to STR on data row 115301, field contains 'GP' > Bumping column 117 from INT to INT64 on data row 188433, field contains 'QM' > Bumping column 117 from INT64 to REAL on data row 188433, field contains 'QM' > Bumping column 117 from REAL to STR on data row 188433, field contains 'QM' > Bumping column 93 from INT to INT64 on data row 188671, field contains 'LT' > Bumping column 93 from INT64 to REAL on data row 188671, field contains 'LT' > Bumping column 93 from REAL to STR on data row 188671, field contains 'LT' > Bumping column 92 from INT to INT64 on data row 188909, field contains 'RH' > Bumping column 92 from INT64 to REAL on data row 188909, field contains 'RH' > Bumping column 92 from REAL to STR on data row 188909, field contains 'RH' > Bumping column 122 from INT to INT64 on data row 188909, field contains 'QM' > Bumping column 122 from INT64 to REAL on data row 188909, field contains 'QM' > Bumping column 122 from REAL to STR on data row 188909, field contains 'QM' > Bumping column 121 from INT to INT64 on data row 189176, field contains 'QM' > Bumping column 121 from INT64 to REAL on data row 189176, field contains 'QM' > Bumping column 121 from REAL to STR on data row 189176, field contains 'QM' > Bumping column 195 from INT to INT64 on data row 189548, field contains 'Y' > Bumping column 195 from INT64 to REAL on data row 189548, field contains 'Y' > Bumping column 195 from REAL to STR on data row 189548, field contains 'Y' > Bumping column 196 from INT to INT64 on data row 189548, field contains 'Y' > Bumping column 196 from INT64 to REAL on data row 189548, field contains 'Y' > Bumping column 196 from REAL to STR on data row 189548, field contains 'Y' > Bumping column 197 from INT to INT64 on data row 189548, field contains 'Y' > Bumping column 197 from INT64 to REAL on data row 189548, field contains 'Y' > Bumping column 197 from REAL to STR on data row 189548, field contains 'Y' > Bumping column 198 from INT to INT64 on data row 189548, field contains 'Y' > Bumping column 198 from INT64 to REAL on data row 189548, field contains 'Y' > Bumping column 198 from REAL to STR on data row 189548, field contains 'Y' > Bumping column 199 from INT to INT64 on data row 189548, field contains 'Y' > Bumping column 199 from INT64 to REAL on data row 189548, field contains 'Y' > Bumping column 199 from REAL to STR on data row 189548, field contains 'Y' > Bumping column 200 from INT to INT64 on data row 189548, field contains 'Y' > Bumping column 200 from INT64 to REAL on data row 189548, field contains 'Y' > Bumping column 200 from REAL to STR on data row 189548, field contains 'Y' > Bumping column 201 from INT to INT64 on data row 189548, field contains 'Y' > Bumping column 201 from INT64 to REAL on data row 189548, field contains 'Y' > Bumping column 201 from REAL to STR on data row 189548, field contains 'Y' > Bumping column 202 from INT to INT64 on data row 189548, field contains 'Y' > Bumping column 202 from INT64 to REAL on data row 189548, field contains 'Y' > Bumping column 202 from REAL to STR on data row 189548, field contains 'Y' > Bumping column 203 from INT to INT64 on data row 189548, field contains 'Y' > Bumping column 203 from INT64 to REAL on data row 189548, field contains 'Y' > Bumping column 203 from REAL to STR on data row 189548, field contains 'Y' > Bumping column 232 from INT to INT64 on data row 189586, field contains 'U' > Bumping column 232 from INT64 to REAL on data row 189586, field contains 'U' > Bumping column 232 from REAL to STR on data row 189586, field contains 'U' > Bumping column 123 from INT to INT64 on data row 190895, field contains 'QM' > Bumping column 123 from INT64 to REAL on data row 190895, field contains 'QM' > Bumping column 123 from REAL to STR on data row 190895, field contains 'QM' > Bumping column 97 from INT to INT64 on data row 191623, field contains 'NH' > Bumping column 97 from INT64 to REAL on data row 191623, field contains 'NH' > Bumping column 97 from REAL to STR on data row 191623, field contains 'NH' > Bumping column 127 from INT to INT64 on data row 191623, field contains 'QM' > Bumping column 127 from INT64 to REAL on data row 191623, field contains 'QM' > Bumping column 127 from REAL to STR on data row 191623, field contains 'QM' > Bumping column 88 from INT to INT64 on data row 191828, field contains 'RH' > Bumping column 88 from INT64 to REAL on data row 191828, field contains 'RH' > Bumping column 88 from REAL to STR on data row 191828, field contains 'RH' > Bumping column 118 from INT to INT64 on data row 191828, field contains 'QM' > Bumping column 118 from INT64 to REAL on data row 191828, field contains 'QM' > Bumping column 118 from REAL to STR on data row 191828, field contains 'QM' > Bumping column 89 from INT to INT64 on data row 191925, field contains 'RH' > Bumping column 89 from INT64 to REAL on data row 191925, field contains 'RH' > Bumping column 89 from REAL to STR on data row 191925, field contains 'RH' > Bumping column 119 from INT to INT64 on data row 191925, field contains 'QM' > Bumping column 119 from INT64 to REAL on data row 191925, field contains 'QM' > Bumping column 119 from REAL to STR on data row 191925, field contains 'QM' > Bumping column 94 from INT to INT64 on data row 196090, field contains 'RH' > Bumping column 94 from INT64 to REAL on data row 196090, field contains 'RH' > Bumping column 94 from REAL to STR on data row 196090, field contains 'RH' > Bumping column 124 from INT to INT64 on data row 196090, field contains 'QM' > Bumping column 124 from INT64 to REAL on data row 196090, field contains 'QM' > Bumping column 124 from REAL to STR on data row 196090, field contains 'QM' > Bumping column 217 from INT to INT64 on data row 196596, field contains 'E9208' > Bumping column 217 from INT64 to REAL on data row 196596, field contains 'E9208' > Bumping column 217 from REAL to STR on data row 196596, field contains 'E9208' > Bumping column 126 from INT to INT64 on data row 197965, field contains 'QM' > Bumping column 126 from INT64 to REAL on data row 197965, field contains 'QM' > Bumping column 126 from REAL to STR on data row 197965, field contains 'QM' > Bumping column 95 from INT to INT64 on data row 208608, field contains 'LT' > Bumping column 95 from INT64 to REAL on data row 208608, field contains 'LT' > Bumping column 95 from REAL to STR on data row 208608, field contains 'LT' > Bumping column 218 from INT to INT64 on data row 216015, field contains 'E0008' > Bumping column 218 from INT64 to REAL on data row 216015, field contains 'E0008' > Bumping column 218 from REAL to STR on data row 216015, field contains 'E0008' > Bumping column 219 from INT to INT64 on data row 224785, field contains 'E030' > Bumping column 219 from INT64 to REAL on data row 224785, field contains 'E030' > Bumping column 219 from REAL to STR on data row 224785, field contains 'E030' > 8%Bumping column 220 from INT to INT64 on data row 233544, field contains 'E8499' > Bumping column 220 from INT64 to REAL on data row 233544, field contains 'E8499' > Bumping column 220 from REAL to STR on data row 233544, field contains 'E8499' > Bumping column 221 from INT to INT64 on data row 233544, field contains 'E0008' > Bumping column 221 from INT64 to REAL on data row 233544, field contains 'E0008' > Bumping column 221 from REAL to STR on data row 233544, field contains 'E0008' > Bumping column 100 from INT to INT64 on data row 253181, field contains 'GP' > Bumping column 100 from INT64 to REAL on data row 253181, field contains 'GP' > Bumping column 100 from REAL to STR on data row 253181, field contains 'GP' > Bumping column 99 from INT to INT64 on data row 330461, field contains 'GO' > Bumping column 99 from INT64 to REAL on data row 330461, field contains 'GO' > Bumping column 99 from REAL to STR on data row 330461, field contains 'GO' > 12%Bumping column 128 from INT to INT64 on data row 419322, field contains 'QN' > Bumping column 128 from INT64 to REAL on data row 419322, field contains 'QN' > Bumping column 128 from REAL to STR on data row 419322, field contains 'QN' > Bumping column 130 from INT to INT64 on data row 420977, field contains 'QN' > Bumping column 130 from INT64 to REAL on data row 420977, field contains 'QN' > Bumping column 130 from REAL to STR on data row 420977, field contains 'QN' > Bumping column 125 from INT to INT64 on data row 426618, field contains 'QN' > Bumping column 125 from INT64 to REAL on data row 426618, field contains 'QN' > Bumping column 125 from REAL to STR on data row 426618, field contains 'QN' > Bumping column 101 from INT to INT64 on data row 446983, field contains 'HN' > Bumping column 101 from INT64 to REAL on data row 446983, field contains 'HN' > Bumping column 101 from REAL to STR on data row 446983, field contains 'HN' > Bumping column 131 from INT to INT64 on data row 446983, field contains 'QN' > Bumping column 131 from INT64 to REAL on data row 446983, field contains 'QN' > Bumping column 131 from REAL to STR on data row 446983, field contains 'QN' > Bumping column 129 from INT to INT64 on data row 448799, field contains 'QN' > Bumping column 129 from INT64 to REAL on data row 448799, field contains 'QN' > Bumping column 129 from REAL to STR on data row 448799, field contains 'QN' > Bumping column 233 from INT to INT64 on data row 455718, field contains 'Y' > Bumping column 233 from INT64 to REAL on data row 455718, field contains 'Y' > Bumping column 233 from REAL to STR on data row 455718, field contains 'Y' > Bumping column 234 from INT to INT64 on data row 458104, field contains 'Y' > Bumping column 234 from INT64 to REAL on data row 458104, field contains 'Y' > Bumping column 234 from REAL to STR on data row 458104, field contains 'Y' > Bumping column 235 from INT to INT64 on data row 458104, field contains 'Y' > Bumping column 235 from INT64 to REAL on data row 458104, field contains 'Y' > Bumping column 235 from REAL to STR on data row 458104, field contains 'Y' > 16%Bumping column 204 from INT to INT64 on data row 535636, field contains 'U' > Bumping column 204 from INT64 to REAL on data row 535636, field contains 'U' > Bumping column 204 from REAL to STR on data row 535636, field contains 'U' > Bumping column 205 from INT to INT64 on data row 544450, field contains 'U' > Bumping column 205 from INT64 to REAL on data row 544450, field contains 'U' > Bumping column 205 from REAL to STR on data row 544450, field contains 'U' > Bumping column 206 from INT to INT64 on data row 563578, field contains 'U' > Bumping column 206 from INT64 to REAL on data row 563578, field contains 'U' > Bumping column 206 from REAL to STR on data row 563578, field contains 'U' > Bumping column 207 from INT to INT64 on data row 563578, field contains 'U' > Bumping column 207 from INT64 to REAL on data row 563578, field contains 'U' > Bumping column 207 from REAL to STR on data row 563578, field contains 'U' > Bumping column 208 from INT to INT64 on data row 570116, field contains 'U' > Bumping column 208 from INT64 to REAL on data row 570116, field contains 'U' > Bumping column 208 from REAL to STR on data row 570116, field contains 'U' > Bumping column 209 from INT to INT64 on data row 570116, field contains 'U' > Bumping column 209 from INT64 to REAL on data row 570116, field contains 'U' > Bumping column 209 from REAL to STR on data row 570116, field contains 'U' > 24%Bumping column 8 from INT to INT64 on data row 768577, field contains 'F' > Bumping column 8 from INT64 to REAL on data row 768577, field contains 'F' > Bumping column 8 from REAL to STR on data row 768577, field contains 'F' > 28%Bumping column 210 from INT to INT64 on data row 948003, field contains 'U' > Bumping column 210 from INT64 to REAL on data row 948003, field contains 'U' > Bumping column 210 from REAL to STR on data row 948003, field contains 'U' > Bumping column 211 from INT to INT64 on data row 948003, field contains 'U' > Bumping column 211 from INT64 to REAL on data row 948003, field contains 'U' > Bumping column 211 from REAL to STR on data row 948003, field contains 'U' > 48%Bumping column 222 from INT to INT64 on data row 1567231, field contains 'E0009' > Bumping column 222 from INT64 to REAL on data row 1567231, field contains 'E0009' > Bumping column 222 from REAL to STR on data row 1567231, field contains 'E0009' > 71%Bumping column 236 from INT to INT64 on data row 2163874, field contains 'U' > Bumping column 236 from INT64 to REAL on data row 2163874, field contains 'U' > Bumping column 236 from REAL to STR on data row 2163874, field contains 'U' > Bumping column 237 from INT to INT64 on data row 2177888, field contains 'U' > Bumping column 237 from INT64 to REAL on data row 2177888, field contains 'U' > Bumping column 237 from REAL to STR on data row 2177888, field contains 'U' > Bumping column 280 from INT to INT64 on data row 2204113, field contains 'invl' > Bumping column 280 from INT64 to REAL on data row 2204113, field contains 'invl' > Bumping column 280 from REAL to STR on data row 2204113, field contains 'invl' > 0.000s (2994439%) Memory map (rerun may be quicker) > 0.000s (2994439%) Sep and header detection > 0.000s (2994439%) Count rows (wc -l) > 0.000s (2994439%) Colmn type detection (first, middle and last 5 rows) > 0.000s (2994439%) Allocation of 5x13 result (xMB) in RAM > 25.710s ( 66%) Reading data > 197983.135s (510003%) Allocation for type bumps (if any), including gc time if triggered > -197977.505s (-509988%) Coercing data already read in type bumps (if any) > -197977.505s (-509988%) Changing na.strings to NA > -197977.505s Total > There were 50 or more warnings (use warnings() to see the first 50) > > > > Warning messages: > 1: In fread(file.path(sedddir, "active", "NJ_SEDD_2011_CORE.csv"), ... : > Bumped column 146 to type character on data row 9, field contains 'V5867'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE. > 2: In fread(file.path(sedddir, "active", "NJ_SEDD_2011_CORE.csv"), ... : > Bumped column 147 to type character on data row 9, field contains 'V5869'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE. > 3: In fread(file.path(sedddir, "active", "NJ_SEDD_2011_CORE.csv"), ... : > Bumped column 142 to type character on data row 10, field contains 'V140'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE. > [[clipped]] > > ----------------------------------------------------- > fread's guesses vs. column classes I know to be true: > ----------------------------------------------------- > > structure(list(DTguess = c("integer", "integer", "integer", "integer", > "integer", "integer", "integer", "character", "integer", "integer", > "integer", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "character", "character", "character", > "character", "character", "character", "character", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "character", "integer64", "integer", "integer", "character", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "numeric", "integer", "integer", "integer", "integer", "character", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "character", "integer", "integer", "character", "character", > "character", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "numeric", "integer", "character", "integer", > "integer", "character", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer" > ), actual = c("integer", "integer", "integer", "integer", "integer", > "integer", "character", "character", "integer", "integer", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "integer", "integer", "integer", "integer", "character", "integer", > "character", "integer", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "character", "character", "character", > "character", "character", "character", "character", "character", > "integer", "integer", "integer", "integer", "integer", "character", > "integer", "character", "character", "integer", "integer", "character", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "character", > "integer", "integer", "integer", "character", "integer", "character", > "integer", "character", "integer", "integer", "integer", "integer", > "numeric", "integer", "integer", "integer", "integer", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "character", "character", "character", > "character", "character", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "character", "integer", "integer", > "character", "character", "character", "integer", "character", > "integer", "integer", "integer", "integer", "integer", "numeric", > "integer", "character", "integer", "character", "character", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer", "integer", "integer", > "integer", "integer", "integer", "integer")), .Names = c("DTguess", > "actual"), row.names = c("age", "ageday", "agemonth", "ahour", > "amonth", "asource", "asourceub92", "asource_x", "atype", "aweekend", > "billtype", "cpt1", "cpt2", "cpt3", "cpt4", "cpt5", "cpt6", "cpt7", > "cpt8", "cpt9", "cpt10", "cpt11", "cpt12", "cpt13", "cpt14", > "cpt15", "cpt16", "cpt17", "cpt18", "cpt19", "cpt20", "cpt21", > "cpt22", "cpt23", "cpt24", "cpt25", "cpt26", "cpt27", "cpt28", > "cpt29", "cpt30", "cptccs1", "cptccs2", "cptccs3", "cptccs4", > "cptccs5", "cptccs6", "cptccs7", "cptccs8", "cptccs9", "cptccs10", > "cptccs11", "cptccs12", "cptccs13", "cptccs14", "cptccs15", "cptccs16", > "cptccs17", "cptccs18", "cptccs19", "cptccs20", "cptccs21", "cptccs22", > "cptccs23", "cptccs24", "cptccs25", "cptccs26", "cptccs27", "cptccs28", > "cptccs29", "cptccs30", "cptm1_1", "cptm1_2", "cptm1_3", "cptm1_4", > "cptm1_5", "cptm1_6", "cptm1_7", "cptm1_8", "cptm1_9", "cptm1_10", > "cptm1_11", "cptm1_12", "cptm1_13", "cptm1_14", "cptm1_15", "cptm1_16", > "cptm1_17", "cptm1_18", "cptm1_19", "cptm1_20", "cptm1_21", "cptm1_22", > "cptm1_23", "cptm1_24", "cptm1_25", "cptm1_26", "cptm1_27", "cptm1_28", > "cptm1_29", "cptm1_30", "cptm2_1", "cptm2_2", "cptm2_3", "cptm2_4", > "cptm2_5", "cptm2_6", "cptm2_7", "cptm2_8", "cptm2_9", "cptm2_10", > "cptm2_11", "cptm2_12", "cptm2_13", "cptm2_14", "cptm2_15", "cptm2_16", > "cptm2_17", "cptm2_18", "cptm2_19", "cptm2_20", "cptm2_21", "cptm2_22", > "cptm2_23", "cptm2_24", "cptm2_25", "cptm2_26", "cptm2_27", "cptm2_28", > "cptm2_29", "cptm2_30", "dhour", "died", "dispub04", "dispuniform", > "disp_x", "dqtr", "dshospid", "duration", "dx1", "dx2", "dx3", > "dx4", "dx5", "dx6", "dx7", "dx8", "dx9", "dx10", "dx11", "dx12", > "dx13", "dx14", "dx15", "dx16", "dx17", "dx18", "dx19", "dx20", > "dx21", "dx22", "dx23", "dx24", "dxccs1", "dxccs2", "dxccs3", > "dxccs4", "dxccs5", "dxccs6", "dxccs7", "dxccs8", "dxccs9", "dxccs10", > "dxccs11", "dxccs12", "dxccs13", "dxccs14", "dxccs15", "dxccs16", > "dxccs17", "dxccs18", "dxccs19", "dxccs20", "dxccs21", "dxccs22", > "dxccs23", "dxccs24", "dxpoa1", "dxpoa2", "dxpoa3", "dxpoa4", > "dxpoa5", "dxpoa6", "dxpoa7", "dxpoa8", "dxpoa9", "dxpoa10", > "dxpoa11", "dxpoa12", "dxpoa13", "dxpoa14", "dxpoa15", "dxpoa16", > "dxpoa17", "dxpoa18", "dxpoa19", "dxpoa20", "dxpoa21", "dxpoa22", > "dxpoa23", "dxpoa24", "dx_visit_reason1", "dx_visit_reason2", > "dx_visit_reason3", "ecode1", "ecode2", "ecode3", "ecode4", "ecode5", > "ecode6", "ecode7", "ecode8", "e_ccs1", "e_ccs2", "e_ccs3", "e_ccs4", > "e_ccs5", "e_ccs6", "e_ccs7", "e_ccs8", "e_poa1", "e_poa2", "e_poa3", > "e_poa4", "e_poa5", "e_poa6", "e_poa7", "e_poa8", "female", "hcup_ed", > "hcup_os", "hcup_surgery_broad", "hcup_surgery_narrow", "hispanic_x", > "hospbrth", "hospst", "key", "los", "los_x", "maritalstatusub04", > "mdnum1_r", "mdnum2_r", "medincstq", "momnum_r", "mrn_r", "nchronic", > "ncpt", "ndx", "necode", "neomat", "npr", "opservice", "orproc", > "os_time", "pay1", "pay1_x", "pay2", "pay2_x", "pay3", "pay3_x", > "pl_cbsa", "pl_msa1993", "pl_nchs2006", "pl_ruca10_2005", "pl_ruca2005", > "pl_ruca4_2005", "pl_rucc2003", "pl_uic2003", "pl_ur_cat4", "pr1", > "pr2", "pr3", "pr4", "pr5", "pr6", "pr7", "pr8", "pr9", "pr10", > "pr11", "pr12", "pr13", "pr14", "pr15", "pr16", "pr17", "pr18", > "prccs1", "prccs2", "prccs3", "prccs4", "prccs5", "prccs6", "prccs7", > "prccs8", "prccs9", "prccs10", "prccs11", "prccs12", "prccs13", > "prccs14", "prccs15", "prccs16", "prccs17", "prccs18", "prday1", > "prday2", "prday3", "prday4", "prday5", "prday6", "prday7", "prday8", > "prday9", "prday10", "prday11", "prday12", "prday13", "prday14", > "prday15", "prday16", "prday17", "prday18", "proctype", "pstate", > "pstco", "pstco2", "pointoforiginub04", "pointoforigin_x", "primlang", > "race", "race_x", "readmit", "state_as", "state_ed", "state_os", > "totchg", "totchg_x", "year", "zip3", "zipinc_qrtl", "town", > "zip", "ayear", "dmonth", "bmonth", "byear", "prmonth1", "prmonth2", > "prmonth3", "prmonth4", "prmonth5", "prmonth6", "prmonth7", "prmonth8", > "prmonth9", "prmonth10", "prmonth11", "prmonth12", "prmonth13", > "prmonth14", "prmonth15", "prmonth16", "prmonth17", "prmonth18", > "pryear1", "pryear2", "pryear3", "pryear4", "pryear5", "pryear6", > "pryear7", "pryear8", "pryear9", "pryear10", "pryear11", "pryear12", > "pryear13", "pryear14", "pryear15", "pryear16", "pryear17", "pryear18" > ), class = "data.frame") > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Sep 13 00:52:40 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Thu, 12 Sep 2013 23:52:40 +0100 Subject: [datatable-help] colClasses and fread In-Reply-To: <5232434E.3050608@mdowle.plus.com> References: <5232434E.3050608@mdowle.plus.com> Message-ID: <523245B8.5090807@mdowle.plus.com> But I think in the diagnostics you sent, the final result was still correct. The initial guess may have been poor, but it bumped the columns mid read and worked it out. Why do you need to set colClasses? What was wrong in the final result? (BTW, this thread was failing the mailman size filter (100k message size). I let them through and chopped the history on this one for that reason. ) On 12/09/13 23:42, Matthew Dowle wrote: > > Is that v1.8.10 as on CRAN? It doesn't look like it from a few clues > in the output below. > v1.8.10 has colClasses working, see NEWS. > > On 12/09/13 22:32, Ari Friedman wrote: >> Dear maintainers of that most wonderful package that makes R fast with >> big data, >> >> I've recently discovered fread. It's amazing. My call to read.fwf on a >> 4GB file that took all night now takes under a minute after conversion >> to csv via csvkit/in2csv. >> >> However, automatic type detection is working very poorly, probably due >> to the presence of a large number of columns with high rates of >> missingness, plus a large number of character columns with encoded >> values (these are medical and diagnostic codes). >> >> Normally I'd specify colClasses, and the warning messages even tell me I >> should specify colClasses, but there's no colClasses argument to fread. >> >> Any thoughts on solving this? Verbose output, warnings, and a >> comparison of the guesses vs. what the documentation on the file says it >> is are found below. Unfortunately the data can't be shared, even in >> small portions so I can't make this reproducible. >> >> Thanks! >> Ari >> > dt <- fread('myfile.csv', verbose=TRUE) >> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. >> Using line 30 to detect sep (the last non blank line in the first 30) ... ',' >> Found 393 columns >> First row with 393 fields occurs on line 1 (either column names or first row of data) >> All the fields on line 1 are character fields. Treating as the column names. >> Count of eol after first data row: 2994440 >> Subtracted 1 for last eol and any trailing empty lines, leaving 2994439 data rows >> Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000300000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (first 5 rows) >> Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (+middle 5 rows) >> Type codes: 000000000000003303330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (+last 5 rows) >> 0%Bumping column 146 from INT to INT64 on data row 9, field contains 'V5867' >> Bumping column 146 from INT64 to REAL on data row 9, field contains 'V5867' >> Bumping column 146 from REAL to STR on data row 9, field contains 'V5867' >> Bumping column 147 from INT to INT64 on data row 9, field contains 'V5869' >> Bumping column 147 from INT64 to REAL on data row 9, field contains 'V5869' >> Bumping column 147 from REAL to STR on data row 9, field contains 'V5869' >> Bumping column 142 from INT to INT64 on data row 10, field contains 'V140' >> Bumping column 142 from INT64 to REAL on data row 10, field contains 'V140' >> Bumping column 142 from REAL to STR on data row 10, field contains 'V140' >> Bumping column 17 from INT to INT64 on data row 12, field contains 'J1885' >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Sep 13 20:19:01 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 13 Sep 2013 19:19:01 +0100 Subject: [datatable-help] fread'ing logicals Message-ID: <52335715.8040109@mdowle.plus.com> All, I've implemented skipping columns using NULL in colClasses, and logicals are now also read. read.csv reads "T","F","TRUE","FALSE","True" and "False" as type logical, so I've followed suit. But I'm wondering about the single letters "T" and "F". To illustrate, the following might be confusing : > fread("A,B,C\nD,E,F\n") A B C 1: D E FALSE > fread("A,B,C\nD,E,F\nG,H,I\n") A B C 1: D E F 2: G H I > Should fread treat "T" and "F" as logical? Should it read a column of only 0's and 1's as logical, too? I think I'd prefer that as it's quite common. I'm also thinking of increasing the number of rows used for type detection to the top 500, middle 500 and bottom 500, since that's a very small extra cost to save the relatively much larger cost of mid read column bumps. As a parameter, with 500 by default. Matthew From caneff at gmail.com Fri Sep 13 20:25:18 2013 From: caneff at gmail.com (Chris Neff) Date: Fri, 13 Sep 2013 14:25:18 -0400 Subject: [datatable-help] fread'ing logicals In-Reply-To: <52335715.8040109@mdowle.plus.com> References: <52335715.8040109@mdowle.plus.com> Message-ID: I would prefer that you stay consistent with read.csv unless you really have a good reason. I don't think this is a good enough reason. They can specify colClasses or change it after the fact. On Fri, Sep 13, 2013 at 2:19 PM, Matthew Dowle wrote: > > All, > > I've implemented skipping columns using NULL in colClasses, and logicals > are now also read. read.csv reads "T","F","TRUE","FALSE","True" and > "False" as type logical, so I've followed suit. But I'm wondering about > the single letters "T" and "F". To illustrate, the following might be > confusing : > > > fread("A,B,C\nD,E,F\n") > A B C > 1: D E FALSE > > fread("A,B,C\nD,E,F\nG,H,I\n") > A B C > 1: D E F > 2: G H I > > > > Should fread treat "T" and "F" as logical? Should it read a column of > only 0's and 1's as logical, too? I think I'd prefer that as it's quite > common. > > I'm also thinking of increasing the number of rows used for type detection > to the top 500, middle 500 and bottom 500, since that's a very small extra > cost to save the relatively much larger cost of mid read column bumps. As a > parameter, with 500 by default. > > Matthew > > > ______________________________**_________________ > datatable-help mailing list > datatable-help at lists.r-forge.**r-project.org > https://lists.r-forge.r-**project.org/cgi-bin/mailman/** > listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chinmay.patil at gmail.com Sat Sep 14 06:03:43 2013 From: chinmay.patil at gmail.com (Chinmay Patil) Date: Sat, 14 Sep 2013 12:03:43 +0800 Subject: [datatable-help] fread'ing logicals In-Reply-To: References: <52335715.8040109@mdowle.plus.com> Message-ID: I agree.. One of the criticism I hear about newer packages in R ecosystem is inconsistency with existing conventions. I would also vote for consistency with read.csv / read.table On Sat, Sep 14, 2013 at 2:25 AM, Chris Neff wrote: > I would prefer that you stay consistent with read.csv unless you really > have a good reason. I don't think this is a good enough reason. They can > specify colClasses or change it after the fact. > > > On Fri, Sep 13, 2013 at 2:19 PM, Matthew Dowle wrote: > >> >> All, >> >> I've implemented skipping columns using NULL in colClasses, and >> logicals are now also read. read.csv reads "T","F","TRUE","FALSE","True" >> and "False" as type logical, so I've followed suit. But I'm wondering >> about the single letters "T" and "F". To illustrate, the following might >> be confusing : >> >> > fread("A,B,C\nD,E,F\n") >> A B C >> 1: D E FALSE >> > fread("A,B,C\nD,E,F\nG,H,I\n") >> A B C >> 1: D E F >> 2: G H I >> > >> >> Should fread treat "T" and "F" as logical? Should it read a column of >> only 0's and 1's as logical, too? I think I'd prefer that as it's quite >> common. >> >> I'm also thinking of increasing the number of rows used for type >> detection to the top 500, middle 500 and bottom 500, since that's a very >> small extra cost to save the relatively much larger cost of mid read column >> bumps. As a parameter, with 500 by default. >> >> Matthew >> >> >> ______________________________**_________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.**r-project.org >> https://lists.r-forge.r-**project.org/cgi-bin/mailman/** >> listinfo/datatable-help >> > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Sat Sep 14 06:42:51 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Fri, 13 Sep 2013 21:42:51 -0700 Subject: [datatable-help] fread'ing logicals In-Reply-To: References: <52335715.8040109@mdowle.plus.com> Message-ID: Hi Chinmay, On Fri, Sep 13, 2013 at 9:03 PM, Chinmay Patil wrote: > I agree.. One of the criticism I hear about newer packages in R ecosystem is > inconsistency with existing conventions. Out of curiosity, what packages (and criticisms) might those be? -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From chinmay.patil at gmail.com Sat Sep 14 06:54:55 2013 From: chinmay.patil at gmail.com (Chinmay Patil) Date: Sat, 14 Sep 2013 12:54:55 +0800 Subject: [datatable-help] fread'ing logicals In-Reply-To: References: <52335715.8040109@mdowle.plus.com> Message-ID: For eg. I recently heard complains about data.table itself from due to changes in interface and learning curve that data.table comes with... I hear similar complaints about some packages like ggplot2, plyr.. Even though all these are great packages.. people don't like radical changes to interfaces as it makes refactoring older code even more painful. On Sat, Sep 14, 2013 at 12:42 PM, Steve Lianoglou wrote: > Hi Chinmay, > > On Fri, Sep 13, 2013 at 9:03 PM, Chinmay Patil > wrote: > > I agree.. One of the criticism I hear about newer packages in R > ecosystem is > > inconsistency with existing conventions. > > Out of curiosity, what packages (and criticisms) might those be? > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Sat Sep 14 07:29:11 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Fri, 13 Sep 2013 22:29:11 -0700 Subject: [datatable-help] fread'ing logicals In-Reply-To: References: <52335715.8040109@mdowle.plus.com> Message-ID: Thanks for the quick response. As for the "learning curve" stuff -- no real comment there, but: > For eg. I recently heard complains about data.table itself from due to > changes in interface Could you provide some concrete examples about which changes have stumped users? Perhaps we can learn from these critiques. I had thought we were pretty good about discussing any (breaking) changes on list, but I'd be interested to see where this has failed so it might perhaps be avoided in the future. > and learning curve that data.table comes with... I hear > similar complaints about some packages like ggplot2, plyr.. > > Even though all these are great packages.. people don't like radical changes > to interfaces as it makes refactoring older code even more painful. Still curious to hear what radical changes have come down the pipe. Thanks for taking the time to comment. Cheers, -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From chinmay.patil at gmail.com Sat Sep 14 07:48:31 2013 From: chinmay.patil at gmail.com (Chinmay Patil) Date: Sat, 14 Sep 2013 13:48:31 +0800 Subject: [datatable-help] fread'ing logicals In-Reply-To: References: <52335715.8040109@mdowle.plus.com> Message-ID: I didn't mean changes in data.table's interface but the way data.table works in itself compared to normal data frames. I know there are valid reasons for structuring data.table's interface the way it is but not all users get it immediately. As for data.table, I am not complaining, just saying what other users complaints I have heard of. I personally love data.table and am willing to put the effort to learn best ways to use it while most users aren't. Chinmay On 14 Sep, 2013, at 1:29 PM, Steve Lianoglou wrote: > Thanks for the quick response. > > As for the "learning curve" stuff -- no real comment there, but: > >> For eg. I recently heard complains about data.table itself from due to >> changes in interface > > Could you provide some concrete examples about which changes have > stumped users? Perhaps we can learn from these critiques. I had > thought we were pretty good about discussing any (breaking) changes on > list, but I'd be interested to see where this has failed so it might > perhaps be avoided in the future. > >> and learning curve that data.table comes with... I hear >> similar complaints about some packages like ggplot2, plyr.. >> >> Even though all these are great packages.. people don't like radical changes >> to interfaces as it makes refactoring older code even more painful. > > Still curious to hear what radical changes have come down the pipe. > > Thanks for taking the time to comment. > > Cheers, > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech From mdowle at mdowle.plus.com Sat Sep 14 11:53:21 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sat, 14 Sep 2013 10:53:21 +0100 Subject: [datatable-help] fread'ing logicals In-Reply-To: References: <52335715.8040109@mdowle.plus.com> Message-ID: <52343211.7040109@mdowle.plus.com> On 14/09/13 06:48, Chinmay Patil wrote: > I didn't mean changes in data.table's interface but the way data.table works in itself compared to normal data frames. I know there are valid reasons for structuring data.table's interface the way it is but not all users get it immediately. The bottom line in my mind is that even if base syntax was sped up (assignment to an unnamed data.frame needn't copy the whole data.frame for example), I would still move from subset()/transform()/with()/DF[i,j]<-value syntax, to i,j and by inside [...] with .SD,.I,.N and := in j. I can do things with that syntax that I need to do which aren't always so easy with base syntax (like adding columns by reference by group). And base R syntax is indeed being sped up by pqR, Renjin, Riposte, TERR, CXXR, fastr which may feed into GNU R. Once that is mature and the dust has settled, I would still move from data.frame to data.table on each of them. Maybe we should market the things that data.table does that base R doesn't. Rather than speed differences. > > As for data.table, I am not complaining, just saying what other users complaints I have heard of. > I personally love data.table and am willing to put the effort to learn best ways to use it while most users aren't. Great. data.table is for people like you. So we'll keep the default fread'ing of "T" and "F" as logicals then for consistency with read.csv. And I still hope to produce a drop-in replacement for read.csv which returns a data.frame but uses fread under the hood. That will speed up existing code, but users can use the extra features of fread if they want, too. Matthew > > Chinmay > > On 14 Sep, 2013, at 1:29 PM, Steve Lianoglou wrote: > >> Thanks for the quick response. >> >> As for the "learning curve" stuff -- no real comment there, but: >> >>> For eg. I recently heard complains about data.table itself from due to >>> changes in interface >> Could you provide some concrete examples about which changes have >> stumped users? Perhaps we can learn from these critiques. I had >> thought we were pretty good about discussing any (breaking) changes on >> list, but I'd be interested to see where this has failed so it might >> perhaps be avoided in the future. >> >>> and learning curve that data.table comes with... I hear >>> similar complaints about some packages like ggplot2, plyr.. >>> >>> Even though all these are great packages.. people don't like radical changes >>> to interfaces as it makes refactoring older code even more painful. >> Still curious to hear what radical changes have come down the pipe. >> >> Thanks for taking the time to comment. >> >> Cheers, >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Bioinformatics and Computational Biology >> Genentech > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From aragorn168b at gmail.com Sat Sep 14 12:29:03 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 14 Sep 2013 12:29:03 +0200 Subject: [datatable-help] fread'ing logicals In-Reply-To: <52343211.7040109@mdowle.plus.com> References: <52335715.8040109@mdowle.plus.com> <52343211.7040109@mdowle.plus.com> Message-ID: <21C3EB1B544A44CBA3BF1EB156807AD2@gmail.com> Matthew, +1 for retaining T and F like read.csv. +1 for the dropins() feature as well. Arun On Saturday, September 14, 2013 at 11:53 AM, Matthew Dowle wrote: > On 14/09/13 06:48, Chinmay Patil wrote: > > I didn't mean changes in data.table's interface but the way data.table works in itself compared to normal data frames. I know there are valid reasons for structuring data.table's interface the way it is but not all users get it immediately. > > > The bottom line in my mind is that even if base syntax was sped up > (assignment to an unnamed data.frame needn't copy the whole data.frame > for example), I would still move from > subset()/transform()/with()/DF[i,j]<-value syntax, to i,j and by inside > [...] with .SD,.I,.N and := in j. I can do things with that syntax > that I need to do which aren't always so easy with base syntax (like > adding columns by reference by group). > > And base R syntax is indeed being sped up by pqR, Renjin, Riposte, TERR, > CXXR, fastr which may feed into GNU R. Once that is mature and the dust > has settled, I would still move from data.frame to data.table on each of > them. Maybe we should market the things that data.table does that base > R doesn't. Rather than speed differences. > > > > > As for data.table, I am not complaining, just saying what other users complaints I have heard of. > > I personally love data.table and am willing to put the effort to learn best ways to use it while most users aren't. > > > > > Great. data.table is for people like you. > > So we'll keep the default fread'ing of "T" and "F" as logicals then for > consistency with read.csv. > > And I still hope to produce a drop-in replacement for read.csv which > returns a data.frame but uses fread under the hood. That will speed up > existing code, but users can use the extra features of fread if they > want, too. > > Matthew > > > > > Chinmay > > > > On 14 Sep, 2013, at 1:29 PM, Steve Lianoglou wrote: > > > > > Thanks for the quick response. > > > > > > As for the "learning curve" stuff -- no real comment there, but: > > > > > > > For eg. I recently heard complains about data.table itself from due to > > > > changes in interface > > > > > > > > > > Could you provide some concrete examples about which changes have > > > stumped users? Perhaps we can learn from these critiques. I had > > > thought we were pretty good about discussing any (breaking) changes on > > > list, but I'd be interested to see where this has failed so it might > > > perhaps be avoided in the future. > > > > > > > and learning curve that data.table comes with... I hear > > > > similar complaints about some packages like ggplot2, plyr.. > > > > > > > > Even though all these are great packages.. people don't like radical changes > > > > to interfaces as it makes refactoring older code even more painful. > > > > > > > > > > Still curious to hear what radical changes have come down the pipe. > > > > > > Thanks for taking the time to comment. > > > > > > Cheers, > > > -steve > > > > > > -- > > > Steve Lianoglou > > > Computational Biologist > > > Bioinformatics and Computational Biology > > > Genentech > > > > > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From harishv_99 at yahoo.com Sat Sep 14 21:05:29 2013 From: harishv_99 at yahoo.com (Harish) Date: Sat, 14 Sep 2013 12:05:29 -0700 (PDT) Subject: [datatable-help] "by" on integer64 not working Message-ID: <1379185529.38620.YahooMailNeo@web120203.mail.ne1.yahoo.com> I am trying to use "by" on integer64 data and data.table seems to think that there is only one value.? This is reproduced with the following: library( data.table ) library( bit64 ) DT <- data.table( a=rep( 1:5, 2), b=15:24 ) DT[ , .N, by=a ] DT[ , a := as.integer64( a ) ] DT[ , .N, by=a ] The output I get is: > DT <- data.table( a=rep( 1:5, 2), b=15:24 ) > DT[ , .N, by=a ] ?? a N 1: 1 2 2: 2 2 3: 3 2 4: 4 2 5: 5 2 > DT[ , a := as.integer64( a ) ] > DT[ , .N, by=a ] ?? a? N 1: 1 10 Notice that the "by" after converting column "a" to integer64 is different from before.? However, the values of "a" are correct: > DT$a integer64 ?[1] 1 2 3 4 5 1 2 3 4 5 I am using the latest version of data.table from r-forge (1.8.11 Rev 965).? I also had the same issue with 1.8.10 from CRAN. Am I doing something wrong or is this a bug?? Thanks for your help. Regards, Harish -------------- next part -------------- An HTML attachment was scrubbed... URL: From harishv_99 at yahoo.com Sat Sep 14 21:57:55 2013 From: harishv_99 at yahoo.com (Harish) Date: Sat, 14 Sep 2013 12:57:55 -0700 (PDT) Subject: [datatable-help] fread() and UTF-8 support Message-ID: <1379188675.10910.YahooMailNeo@web120205.mail.ne1.yahoo.com> Does fread() support UTF-8?? I got a text file that is mostly Latin-1 characters but encoded as UTF-8.? When I load the data, the first column name has a few extra characters in the beginning ("???id"), but I do not get this when I convert the same file to ANSI format using Windows Notepad. I am guessing that UTF-8 encoding puts a few extra characters in the beginning of the text file to indicate that it is an UTF-8 encoding, and fread() is reading that literally as the first column name. Thanks for the clarification. Regards, Harish -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sat Sep 14 22:06:16 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sat, 14 Sep 2013 21:06:16 +0100 Subject: [datatable-help] "by" on integer64 not working In-Reply-To: <1379185529.38620.YahooMailNeo@web120203.mail.ne1.yahoo.com> References: <1379185529.38620.YahooMailNeo@web120203.mail.ne1.yahoo.com> Message-ID: <5234C1B8.4090306@mdowle.plus.com> Sorry - haven't got to implementing grouping or keys for integer64 yet. All that's been done is integer64 in fread. There's a bug item on the list. Matthew On 14/09/13 20:05, Harish wrote: > I am trying to use "by" on integer64 data and data.table seems to > think that there is only one value. This is reproduced with the > following: > > library( data.table ) > library( bit64 ) > > DT <- data.table( a=rep( 1:5, 2), b=15:24 ) > DT[ , .N, by=a ] > DT[ , a := as.integer64( a ) ] > DT[ , .N, by=a ] > > The output I get is: > > > DT <- data.table( a=rep( 1:5, 2), b=15:24 ) > > DT[ , .N, by=a ] > a N > 1: 1 2 > 2: 2 2 > 3: 3 2 > 4: 4 2 > 5: 5 2 > > DT[ , a := as.integer64( a ) ] > > DT[ , .N, by=a ] > a N > 1: 1 10 > > Notice that the "by" after converting column "a" to integer64 is > different from before. However, the values of "a" are correct: > > DT$a > integer64 > [1] 1 2 3 4 5 1 2 3 4 5 > > I am using the latest version of data.table from r-forge (1.8.11 Rev > 965). I also had the same issue with 1.8.10 from CRAN. > > Am I doing something wrong or is this a bug? Thanks for your help. > > > Regards, > Harish > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sat Sep 14 22:33:42 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sat, 14 Sep 2013 21:33:42 +0100 Subject: [datatable-help] fread() and UTF-8 support In-Reply-To: <1379188675.10910.YahooMailNeo@web120205.mail.ne1.yahoo.com> References: <1379188675.10910.YahooMailNeo@web120205.mail.ne1.yahoo.com> Message-ID: <5234C826.6090908@mdowle.plus.com> Sorry again - nope hadn't given UTF-8 any thought. Matthew On 14/09/13 20:57, Harish wrote: > Does fread() support UTF-8? I got a text file that is mostly Latin-1 > characters but encoded as UTF-8. When I load the data, the first > column name has a few extra characters in the beginning ("???id"), but > I do not get this when I convert the same file to ANSI format using > Windows Notepad. > > I am guessing that UTF-8 encoding puts a few extra characters in the > beginning of the text file to indicate that it is an UTF-8 encoding, > and fread() is reading that literally as the first column name. > > Thanks for the clarification. > > Regards, > Harish > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From micheledemeo at gmail.com Sat Sep 14 22:47:26 2013 From: micheledemeo at gmail.com (MICHELE DE MEO) Date: Sat, 14 Sep 2013 22:47:26 +0200 Subject: [datatable-help] fread() and UTF-8 support In-Reply-To: <5234C826.6090908@mdowle.plus.com> References: <1379188675.10910.YahooMailNeo@web120205.mail.ne1.yahoo.com> <5234C826.6090908@mdowle.plus.com> Message-ID: I think it could be very useful the possibility to specify the encoding, as when you use the function 'file' with read.table . Michele Il giorno 14/set/2013 22:33, "Matthew Dowle" ha scritto: > > Sorry again - nope hadn't given UTF-8 any thought. > > Matthew > > On 14/09/13 20:57, Harish wrote: > > Does fread() support UTF-8? I got a text file that is mostly Latin-1 > characters but encoded as UTF-8. When I load the data, the first column > name has a few extra characters in the beginning ("???id"), but I do not > get this when I convert the same file to ANSI format using Windows Notepad. > > I am guessing that UTF-8 encoding puts a few extra characters in the > beginning of the text file to indicate that it is an UTF-8 encoding, and > fread() is reading that literally as the first column name. > > Thanks for the clarification. > > Regards, > Harish > > > _______________________________________________ > datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sun Sep 15 10:34:29 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sun, 15 Sep 2013 09:34:29 +0100 Subject: [datatable-help] fread() and UTF-8 support In-Reply-To: References: <1379188675.10910.YahooMailNeo@web120205.mail.ne1.yahoo.com> <5234C826.6090908@mdowle.plus.com> Message-ID: <52357115.7040404@mdowle.plus.com> Ok, can you file as a feature request please. Thanks. Matthew On 14/09/13 21:47, MICHELE DE MEO wrote: > > I think it could be very useful the possibility to specify the > encoding, as when you use the function 'file' with read.table . > > Michele > > Il giorno 14/set/2013 22:33, "Matthew Dowle" > ha scritto: > > > Sorry again - nope hadn't given UTF-8 any thought. > > Matthew > > On 14/09/13 20:57, Harish wrote: >> Does fread() support UTF-8? I got a text file that is mostly >> Latin-1 characters but encoded as UTF-8. When I load the data, >> the first column name has a few extra characters in the beginning >> ("???id"), but I do not get this when I convert the same file to >> ANSI format using Windows Notepad. >> >> I am guessing that UTF-8 encoding puts a few extra characters in >> the beginning of the text file to indicate that it is an UTF-8 >> encoding, and fread() is reading that literally as the first >> column name. >> >> Thanks for the clarification. >> >> Regards, >> Harish >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Sun Sep 15 23:42:16 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Sun, 15 Sep 2013 16:42:16 -0500 Subject: [datatable-help] fread'ing logicals In-Reply-To: <21C3EB1B544A44CBA3BF1EB156807AD2@gmail.com> References: <52335715.8040109@mdowle.plus.com> <52343211.7040109@mdowle.plus.com> <21C3EB1B544A44CBA3BF1EB156807AD2@gmail.com> Message-ID: +1 for T and F, but definitely not because it's that way in read.csv (which imo is not a good reason), but rather because those are commonly used substitutes for TRUE and FALSE. On Sep 14, 2013 5:29 AM, "Arunkumar Srinivasan" wrote: > Matthew, > > +1 for retaining T and F like read.csv. > +1 for the dropins() feature as well. > > Arun > > On Saturday, September 14, 2013 at 11:53 AM, Matthew Dowle wrote: > > On 14/09/13 06:48, Chinmay Patil wrote: > > I didn't mean changes in data.table's interface but the way data.table > works in itself compared to normal data frames. I know there are valid > reasons for structuring data.table's interface the way it is but not all > users get it immediately. > > > The bottom line in my mind is that even if base syntax was sped up > (assignment to an unnamed data.frame needn't copy the whole data.frame > for example), I would still move from > subset()/transform()/with()/DF[i,j]<-value syntax, to i,j and by inside > [...] with .SD,.I,.N and := in j. I can do things with that syntax > that I need to do which aren't always so easy with base syntax (like > adding columns by reference by group). > > And base R syntax is indeed being sped up by pqR, Renjin, Riposte, TERR, > CXXR, fastr which may feed into GNU R. Once that is mature and the dust > has settled, I would still move from data.frame to data.table on each of > them. Maybe we should market the things that data.table does that base > R doesn't. Rather than speed differences. > > > As for data.table, I am not complaining, just saying what other users > complaints I have heard of. > I personally love data.table and am willing to put the effort to learn > best ways to use it while most users aren't. > > > Great. data.table is for people like you. > > So we'll keep the default fread'ing of "T" and "F" as logicals then for > consistency with read.csv. > > And I still hope to produce a drop-in replacement for read.csv which > returns a data.frame but uses fread under the hood. That will speed up > existing code, but users can use the extra features of fread if they > want, too. > > Matthew > > > Chinmay > > On 14 Sep, 2013, at 1:29 PM, Steve Lianoglou > wrote: > > Thanks for the quick response. > > As for the "learning curve" stuff -- no real comment there, but: > > For eg. I recently heard complains about data.table itself from due to > changes in interface > > Could you provide some concrete examples about which changes have > stumped users? Perhaps we can learn from these critiques. I had > thought we were pretty good about discussing any (breaking) changes on > list, but I'd be interested to see where this has failed so it might > perhaps be avoided in the future. > > and learning curve that data.table comes with... I hear > similar complaints about some packages like ggplot2, plyr.. > > Even though all these are great packages.. people don't like radical > changes > to interfaces as it makes refactoring older code even more painful. > > Still curious to hear what radical changes have come down the pipe. > > Thanks for taking the time to comment. > > Cheers, > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Mon Sep 16 01:35:27 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 16 Sep 2013 00:35:27 +0100 Subject: [datatable-help] fread'ing logicals In-Reply-To: References: <52335715.8040109@mdowle.plus.com> <52343211.7040109@mdowle.plus.com> <21C3EB1B544A44CBA3BF1EB156807AD2@gmail.com> Message-ID: <5236443F.3000003@mdowle.plus.com> Good. Now committed in v1.8.11 (rev 966). Also drop and select is done. o fread's drop, select and NULL in colClasses are implemented. To drop or select columns by name or by number. See examples in ?fread. o fread now detects T,F,True,False,TRUE and FALSE as type logical, consistent with read.csv. I pasted the new examples from ?fread to this answer as well: http://stackoverflow.com/a/18702011/403310 Hope this covers everything in this area, but please shout if anyone can think of anything further. Matthew On 15/09/13 22:42, Eduard Antonyan wrote: > > +1 for T and F, but definitely not because it's that way in read.csv > (which imo is not a good reason), but rather because those are > commonly used substitutes for TRUE and FALSE. > > On Sep 14, 2013 5:29 AM, "Arunkumar Srinivasan" > wrote: > > Matthew, > > +1 for retaining T and F like read.csv. > +1 for the dropins() feature as well. > > Arun > > On Saturday, September 14, 2013 at 11:53 AM, Matthew Dowle wrote: > >> On 14/09/13 06:48, Chinmay Patil wrote: >>> I didn't mean changes in data.table's interface but the way >>> data.table works in itself compared to normal data frames. I >>> know there are valid reasons for structuring data.table's >>> interface the way it is but not all users get it immediately. >> >> The bottom line in my mind is that even if base syntax was sped up >> (assignment to an unnamed data.frame needn't copy the whole >> data.frame >> for example), I would still move from >> subset()/transform()/with()/DF[i,j]<-value syntax, to i,j and by >> inside >> [...] with .SD,.I,.N and := in j. I can do things with that syntax >> that I need to do which aren't always so easy with base syntax (like >> adding columns by reference by group). >> >> And base R syntax is indeed being sped up by pqR, Renjin, >> Riposte, TERR, >> CXXR, fastr which may feed into GNU R. Once that is mature and >> the dust >> has settled, I would still move from data.frame to data.table on >> each of >> them. Maybe we should market the things that data.table does that >> base >> R doesn't. Rather than speed differences. >> >>> >>> As for data.table, I am not complaining, just saying what other >>> users complaints I have heard of. >>> I personally love data.table and am willing to put the effort to >>> learn best ways to use it while most users aren't. >> >> Great. data.table is for people like you. >> >> So we'll keep the default fread'ing of "T" and "F" as logicals >> then for >> consistency with read.csv. >> >> And I still hope to produce a drop-in replacement for read.csv which >> returns a data.frame but uses fread under the hood. That will >> speed up >> existing code, but users can use the extra features of fread if they >> want, too. >> >> Matthew >> >>> >>> Chinmay >>> >>> On 14 Sep, 2013, at 1:29 PM, Steve Lianoglou >>> > wrote: >>> >>>> Thanks for the quick response. >>>> >>>> As for the "learning curve" stuff -- no real comment there, but: >>>> >>>>> For eg. I recently heard complains about data.table itself >>>>> from due to >>>>> changes in interface >>>> Could you provide some concrete examples about which changes have >>>> stumped users? Perhaps we can learn from these critiques. I had >>>> thought we were pretty good about discussing any (breaking) >>>> changes on >>>> list, but I'd be interested to see where this has failed so it >>>> might >>>> perhaps be avoided in the future. >>>> >>>>> and learning curve that data.table comes with... I hear >>>>> similar complaints about some packages like ggplot2, plyr.. >>>>> >>>>> Even though all these are great packages.. people don't like >>>>> radical changes >>>>> to interfaces as it makes refactoring older code even more >>>>> painful. >>>> Still curious to hear what radical changes have come down the pipe. >>>> >>>> Thanks for taking the time to comment. >>>> >>>> Cheers, >>>> -steve >>>> >>>> -- >>>> Steve Lianoglou >>>> Computational Biologist >>>> Bioinformatics and Computational Biology >>>> Genentech >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From npgraham1 at gmail.com Tue Sep 17 20:13:49 2013 From: npgraham1 at gmail.com (Nathaniel Graham) Date: Tue, 17 Sep 2013 14:13:49 -0400 Subject: [datatable-help] Error in coercing matrices within j expressions Message-ID: I'm currently using a (moderately) complex function, call if f(), as a j expression to analyze my data. The data itself is about 1.2M rows, which I analyze by group. A group may have as few as one row or as many as 10K. The output from the function is a two-column data.table where the rows are interesting (for my work) pairs of observations--I have no idea how many pairs will be interesting until the function runs, but in abstract it could be every unique combination (so as many as 50M rows of output for one call to f()). It is common, and not an error, for groups to have no meaningful pairs to return. I've been using the following line to create the output for f(): indices <- data.table(i = integer(), j = integer()) I then append to 'indices' any useful pairs using: indices <- rbind(indices, list(idx[i], idx[j])) This works, but is very, very slow, in part because I'm using rbind(). I want to switch to using the built-in matrix, because rbind() should be much faster for them. Using the following line to create the matrix: indices <- matrix(nrow = 0, ncol = 2, dimnames = list(c(NULL),c("i","j"))) results in the following error: Logical error. Type of column should have been checked by now Note that the values returned are always integers. Results are coerced via: data.table(indices) before returning from f(). If I don't explicitly coerce, I get the following error: j doesn't evaluate to the same number of columns for each group If someone could tell me what I'm doing wrong, or some other equivalent way to noticeably speed up the whole process, I'd be very grateful. ------- Nathaniel Graham npgraham1 at gmail.com npgraham1 at uky.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Tue Sep 17 22:22:03 2013 From: FErickson at psu.edu (Frank Erickson) Date: Tue, 17 Sep 2013 16:22:03 -0400 Subject: [datatable-help] Error in coercing matrices within j expressions In-Reply-To: References: Message-ID: Hi, I guess you could put them into a list and then rbind at the end: indi <- list() k=1 indi[[k]] <- list(i=2L,j=6L); k <- k+1 indi[[k]] <- list(4L,5L); k <- k+1 rbindlist(indi) # i j # 1: 2 6 # 2: 4 5 For some reason, I couldn't get rbindlist to work unless the first item in indi had explicit names ("i" and "j"), but names aren't needed for later items. This should be better than dynamically growing with rbind each time, but there may be a faster way. If your criteria for selecting (i,j) can be written down, there's likely a much faster way than looping like this. Best, --Frank On Tue, Sep 17, 2013 at 2:13 PM, Nathaniel Graham wrote: > I'm currently using a (moderately) complex function, call > if f(), as a j expression to analyze my data. The data itself > is about 1.2M rows, which I analyze by group. > A group may have as few as one row or as many as 10K. > The output from the function is a two-column data.table > where the rows are interesting (for my work) pairs of > observations--I have no idea how many pairs will be > interesting until the function runs, but in abstract it could > be every unique combination (so as many as 50M rows > of output for one call to f()). It is common, and not an > error, for groups to have no meaningful pairs to return. > > I've been using the following line to create the output for > f(): > > indices <- data.table(i = integer(), j = integer()) > > I then append to 'indices' any useful pairs using: > > indices <- rbind(indices, list(idx[i], idx[j])) > > This works, but is very, very slow, in part because I'm > using rbind(). I want to switch to using the built-in matrix, > because rbind() should be much faster for them. Using > the following line to create the matrix: > > indices <- matrix(nrow = 0, ncol = 2, dimnames = list(c(NULL),c("i","j"))) > > results in the following error: > > Logical error. Type of column should have been checked by now > > Note that the values returned are always integers. Results are > coerced via: > > data.table(indices) > > before returning from f(). If I don't explicitly coerce, I get the > following error: > > j doesn't evaluate to the same number of columns for each group > > If someone could tell me what I'm doing wrong, or some other > equivalent way to noticeably speed up the whole process, I'd > be very grateful. > > > ------- > Nathaniel Graham > npgraham1 at gmail.com > npgraham1 at uky.edu > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Tue Sep 17 23:22:50 2013 From: FErickson at psu.edu (Frank Erickson) Date: Tue, 17 Sep 2013 17:22:50 -0400 Subject: [datatable-help] Error in coercing matrices within j expressions In-Reply-To: References: Message-ID: Well, rbindlist(list()) says "Null data.table" (though it doesn't pass the is.null() test). Maybe someone else has an idea how to deal with the no-results case. By the way, it's best to use "reply to all" to make sure you reply to the mailing list, too; they should be able to see your message quoted below, though. --Frank On Tue, Sep 17, 2013 at 5:03 PM, Nathaniel Graham wrote: > Frank, > > Thanks. This seems to have done the trick, so long as I'm careful to > check for > zero-length lists and return data.table(i = integer(), j = integer()) in > those > cases. Essentially, I have to test every combination of i and j to see if > it's > "interesting" or not, and some groups have a lot of rows. At the moment > I'm > attacking some other low hanging fruit, like speeding up the comparisons > I have to do. > > As a side note, it would be kind of nice if there was a simple way to clue > data.table to the fact that there are no rows to return, like returning > NULL > or NA or similar. > > ------- > Nathaniel Graham > npgraham1 at gmail.com > npgraham1 at uky.edu > > > On Tue, Sep 17, 2013 at 4:22 PM, Frank Erickson wrote: > >> Hi, >> >> I guess you could put them into a list and then rbind at the end: >> >> indi <- list() >> k=1 >> indi[[k]] <- list(i=2L,j=6L); k <- k+1 >> indi[[k]] <- list(4L,5L); k <- k+1 >> rbindlist(indi) >> # i j >> # 1: 2 6 >> # 2: 4 5 >> >> For some reason, I couldn't get rbindlist to work unless the first item >> in indi had explicit names ("i" and "j"), but names aren't needed for later >> items. >> >> This should be better than dynamically growing with rbind each time, but >> there may be a faster way. If your criteria for selecting (i,j) can be >> written down, there's likely a much faster way than looping like this. >> >> Best, >> >> --Frank >> >> >> >> On Tue, Sep 17, 2013 at 2:13 PM, Nathaniel Graham wrote: >> >>> I'm currently using a (moderately) complex function, call >>> if f(), as a j expression to analyze my data. The data itself >>> is about 1.2M rows, which I analyze by group. >>> A group may have as few as one row or as many as 10K. >>> The output from the function is a two-column data.table >>> where the rows are interesting (for my work) pairs of >>> observations--I have no idea how many pairs will be >>> interesting until the function runs, but in abstract it could >>> be every unique combination (so as many as 50M rows >>> of output for one call to f()). It is common, and not an >>> error, for groups to have no meaningful pairs to return. >>> >>> I've been using the following line to create the output for >>> f(): >>> >>> indices <- data.table(i = integer(), j = integer()) >>> >>> I then append to 'indices' any useful pairs using: >>> >>> indices <- rbind(indices, list(idx[i], idx[j])) >>> >>> This works, but is very, very slow, in part because I'm >>> using rbind(). I want to switch to using the built-in matrix, >>> because rbind() should be much faster for them. Using >>> the following line to create the matrix: >>> >>> indices <- matrix(nrow = 0, ncol = 2, dimnames = >>> list(c(NULL),c("i","j"))) >>> >>> results in the following error: >>> >>> Logical error. Type of column should have been checked by now >>> >>> Note that the values returned are always integers. Results are >>> coerced via: >>> >>> data.table(indices) >>> >>> before returning from f(). If I don't explicitly coerce, I get the >>> following error: >>> >>> j doesn't evaluate to the same number of columns for each group >>> >>> If someone could tell me what I'm doing wrong, or some other >>> equivalent way to noticeably speed up the whole process, I'd >>> be very grateful. >>> >>> >>> ------- >>> Nathaniel Graham >>> npgraham1 at gmail.com >>> npgraham1 at uky.edu >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From npgraham1 at gmail.com Tue Sep 17 23:42:31 2013 From: npgraham1 at gmail.com (Nathaniel Graham) Date: Tue, 17 Sep 2013 17:42:31 -0400 Subject: [datatable-help] Error in coercing matrices within j expressions In-Reply-To: References: Message-ID: Oops; I meant to reply to all, and then forgot after I discarded and rewrote my message a few times. I suspect (although I'm not absolutely certain) that if NULL or similar did the same thing as returning a 0-row data.table with the appropriate number of columns, some operations could be sped up a bit. In those cases, the data.table code wouldn't need to check the number and type of the columns returned. I suspect that unless someone knows a secret, ultrafast way to iterate through a list of all combinations of a set of items and return the subset of those that match some criteria, that I'm as close to optimal as I'm likely to get right now. ------- Nathaniel Graham npgraham1 at gmail.com npgraham1 at uky.edu On Tue, Sep 17, 2013 at 5:22 PM, Frank Erickson wrote: > Well, rbindlist(list()) says "Null data.table" (though it doesn't pass the > is.null() test). Maybe someone else has an idea how to deal with the > no-results case. By the way, it's best to use "reply to all" to make sure > you reply to the mailing list, too; they should be able to see your message > quoted below, though. > > --Frank > > > On Tue, Sep 17, 2013 at 5:03 PM, Nathaniel Graham wrote: > >> Frank, >> >> Thanks. This seems to have done the trick, so long as I'm careful to >> check for >> zero-length lists and return data.table(i = integer(), j = integer()) in >> those >> cases. Essentially, I have to test every combination of i and j to see >> if it's >> "interesting" or not, and some groups have a lot of rows. At the moment >> I'm >> attacking some other low hanging fruit, like speeding up the comparisons >> I have to do. >> >> As a side note, it would be kind of nice if there was a simple way to clue >> data.table to the fact that there are no rows to return, like returning >> NULL >> or NA or similar. >> >> ------- >> Nathaniel Graham >> npgraham1 at gmail.com >> npgraham1 at uky.edu >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Tue Sep 17 23:52:54 2013 From: FErickson at psu.edu (Frank Erickson) Date: Tue, 17 Sep 2013 17:52:54 -0400 Subject: [datatable-help] Error in coercing matrices within j expressions In-Reply-To: References: Message-ID: Maybe not ultrafast, but with nice syntax: CJ(i=iset,j=jset)[criterion(i,j)] I guess it should be parallelizable, but that wouldn't be with data.table, if I understand this correctly: http://stackoverflow.com/questions/14759905/data-table-and-parallel-computing On Tue, Sep 17, 2013 at 5:42 PM, Nathaniel Graham wrote: > Oops; I meant to reply to all, and then forgot after I discarded and > rewrote my > message a few times. I suspect (although I'm not absolutely certain) that > if > NULL or similar did the same thing as returning a 0-row data.table with the > appropriate number of columns, some operations could be sped up a bit. > In those cases, the data.table code wouldn't need to check the number and > type of the columns returned. > > I suspect that unless someone knows a secret, ultrafast way to iterate > through > a list of all combinations of a set of items and return the subset of > those that > match some criteria, that I'm as close to optimal as I'm likely to get > right now. > > > ------- > Nathaniel Graham > npgraham1 at gmail.com > npgraham1 at uky.edu > > > On Tue, Sep 17, 2013 at 5:22 PM, Frank Erickson wrote: > >> Well, rbindlist(list()) says "Null data.table" (though it doesn't pass >> the is.null() test). Maybe someone else has an idea how to deal with the >> no-results case. By the way, it's best to use "reply to all" to make sure >> you reply to the mailing list, too; they should be able to see your message >> quoted below, though. >> >> --Frank >> >> >> On Tue, Sep 17, 2013 at 5:03 PM, Nathaniel Graham wrote: >> >>> Frank, >>> >>> Thanks. This seems to have done the trick, so long as I'm careful to >>> check for >>> zero-length lists and return data.table(i = integer(), j = integer()) in >>> those >>> cases. Essentially, I have to test every combination of i and j to see >>> if it's >>> "interesting" or not, and some groups have a lot of rows. At the moment >>> I'm >>> attacking some other low hanging fruit, like speeding up the comparisons >>> I have to do. >>> >>> As a side note, it would be kind of nice if there was a simple way to >>> clue >>> data.table to the fact that there are no rows to return, like returning >>> NULL >>> or NA or similar. >>> >>> ------- >>> Nathaniel Graham >>> npgraham1 at gmail.com >>> npgraham1 at uky.edu >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From npgraham1 at gmail.com Wed Sep 18 00:14:36 2013 From: npgraham1 at gmail.com (Nathaniel Graham) Date: Tue, 17 Sep 2013 18:14:36 -0400 Subject: [datatable-help] Error in coercing matrices within j expressions In-Reply-To: References: Message-ID: It hadn't occurred to me to use CJ(), so I'll tinker with that this evening and see if there are any gains to be made there. In theory it's highly parallelizable, and one of the posts Matthew points to in his comments (in the post you reference) shows a way that it can be done (using the old multicore library, so I'm not exactly sure how it maps to the parallel library). In my case though, the whole process appears to be memory bound rather than CPU bound. Since my machine is fairly optimal (i7-4770 with 4x8GB DDR3-1600), I just don't think it's going to get dramatically faster. That doesn't mean I won't try... ------- Nathaniel Graham npgraham1 at gmail.com npgraham1 at uky.edu On Tue, Sep 17, 2013 at 5:52 PM, Frank Erickson wrote: > Maybe not ultrafast, but with nice syntax: > > CJ(i=iset,j=jset)[criterion(i,j)] > > I guess it should be parallelizable, but that wouldn't be with data.table, > if I understand this correctly: > http://stackoverflow.com/questions/14759905/data-table-and-parallel-computing > > > On Tue, Sep 17, 2013 at 5:42 PM, Nathaniel Graham wrote: > >> Oops; I meant to reply to all, and then forgot after I discarded and >> rewrote my >> message a few times. I suspect (although I'm not absolutely certain) >> that if >> NULL or similar did the same thing as returning a 0-row data.table with >> the >> appropriate number of columns, some operations could be sped up a bit. >> In those cases, the data.table code wouldn't need to check the number and >> type of the columns returned. >> >> I suspect that unless someone knows a secret, ultrafast way to iterate >> through >> a list of all combinations of a set of items and return the subset of >> those that >> match some criteria, that I'm as close to optimal as I'm likely to get >> right now. >> >> >> ------- >> Nathaniel Graham >> npgraham1 at gmail.com >> npgraham1 at uky.edu >> >> >> On Tue, Sep 17, 2013 at 5:22 PM, Frank Erickson wrote: >> >>> Well, rbindlist(list()) says "Null data.table" (though it doesn't pass >>> the is.null() test). Maybe someone else has an idea how to deal with the >>> no-results case. By the way, it's best to use "reply to all" to make sure >>> you reply to the mailing list, too; they should be able to see your message >>> quoted below, though. >>> >>> --Frank >>> >>> >>> On Tue, Sep 17, 2013 at 5:03 PM, Nathaniel Graham wrote: >>> >>>> Frank, >>>> >>>> Thanks. This seems to have done the trick, so long as I'm careful to >>>> check for >>>> zero-length lists and return data.table(i = integer(), j = integer()) >>>> in those >>>> cases. Essentially, I have to test every combination of i and j to see >>>> if it's >>>> "interesting" or not, and some groups have a lot of rows. At the >>>> moment I'm >>>> attacking some other low hanging fruit, like speeding up the comparisons >>>> I have to do. >>>> >>>> As a side note, it would be kind of nice if there was a simple way to >>>> clue >>>> data.table to the fact that there are no rows to return, like returning >>>> NULL >>>> or NA or similar. >>>> >>>> ------- >>>> Nathaniel Graham >>>> npgraham1 at gmail.com >>>> npgraham1 at uky.edu >>>> >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Fri Sep 20 15:48:11 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Fri, 20 Sep 2013 09:48:11 -0400 Subject: [datatable-help] mapply cannot modify in place when iterating over list of DTs Message-ID: I've encountered the following issue iterating over a list of data.tables. The issue is only with mapply, not with lapply . Given a list of data.table's, mapply'ing over the list directly cannot modify in place. Also if attempting to add a new column, we get an "Invalid .internal.selfref" warning. Modifying an existing column does not issue a warning, but still fails to modify-in-place WORKAROUND: ---------- The workaround is to iterate over an index to the list, then to modify each data.table via list.of.DTs[[i]][ .. ] **Interestingly, this issue occurs with `mapply`, but not `lapply`.** EXAMPLE: -------- # Given a list of DT's and two lists of vectors, # we want to add the corresponding vectors as columns to the DT. ## ---------------- ## ## SAMPLE DATA: ## ## ---------------- ## # list of data.tables list.DT <- list( DT1=data.table(Col1=111:115, Col2=121:125), DT2=data.table(Col1=211:215, Col2=221:225) ) # lists of columns to add list.Col3 <- list(131:135, 231:235) list.Col4 <- list(141:145, 241:245) ## ------------------------------------ ## ## Iterating over the list elements ## ## adding a new column ## ## ------------------------------------ ## ## Will issue warning and ## ## will fail to modify in place ## ## ------------------------------------ ## mapply ( function(DT, C3, C4) DT[, c("Col3", "Col4") := list(C3, C4)], list.DT, # iterating over the list list.Col3, list.Col4, SIMPLIFY=FALSE ) ## Note the lack of change list.DT ## ------------------------------------ ## ## Iterating over an index ## ## ------------------------------------ ## mapply ( function(i, C3, C4) list.DT[[i]] [, c("Col3", "Col4") := list(C3, C4)], seq(list.DT), # iterating over an index to the list list.Col3, list.Col4, SIMPLIFY=FALSE ) ## Note each DT _has_ been modified list.DT ## ------------------------------------ ## ## Iterating over the list elements ## ## modifying existing column ## ## ------------------------------------ ## ## No warning issued, but ## ## Will fail to modify in place ## ## ------------------------------------ ## mapply ( function(DT, C3, C4) DT[, c("Col3", "Col4") := list(Col3*1e3, Col4*1e4)], list.DT, # iterating over the list list.Col3, list.Col4, SIMPLIFY=FALSE ) ## Note the lack of change (compare with output from `mapply`) list.DT ## ------------------------------------ ## ## ## ## `lapply` works as expected. ## ## ## ## ------------------------------------ ## ## NOW WITH lapply lapply(list.DT, function(DT) DT[, newCol := LETTERS[1:5]] ) ## Note the new column: list.DT # ========================== # ## NON-WORKAROUNDS ## ## ## I also tried all of the following alternatives ## in hopes of being able to iterate over the list ## directly, using `mapply`. ## None of these worked. # (1) Creating the DTs First, then creating the list from them DT1 <- data.table(Col1=111:115, Col2=121:125) DT2 <- data.table(Col1=211:215, Col2=221:225) list.DT <- list(DT1=DT1,DT2=DT2 ) # (2) Same as 1, and using `copy()` in the call to `list()` list.DT <- list(DT1=copy(DT1), DT2=copy(DT2) ) # (3) lapply'ing `copy` and then iterating over that list list.DT <- lapply(list.DT, copy) # (4) Not naming the list elements list.DT <- list(DT1, DT2) # and tried list.DT <- list(copy(DT1), copy(DT2)) ## All of the above still failed to modify in place ## (and also issued the same warning if trying to add a column) ## when iterating using mapply mapply(function(DT, C3, C4) DT[, c("Col3", "Col4") := list(C3, C4)], list.DT, list.Col3, list.Col4, SIMPLIFY=FALSE) # ========================== # Ricardo Saporta Rutgers University, New Jersey e: saporta at rutgers.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Sep 20 18:49:29 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 20 Sep 2013 17:49:29 +0100 Subject: [datatable-help] mapply cannot modify in place when iterating over list of DTs In-Reply-To: References: Message-ID: <523C7C99.40308@mdowle.plus.com> Hi, What's the warning? Matthew On 20/09/13 14:48, Ricardo Saporta wrote: > I've encountered the following issue iterating over a list of > data.tables. > The issue is only with mapply, not with lapply . > > Given a list of data.table's, mapply'ing over the list directly > cannot modify in place. > > Also if attempting to add a new column, we get an "Invalid > .internal.selfref" warning. > Modifying an existing column does not issue a warning, but still fails > to modify-in-place > > WORKAROUND: > ---------- > The workaround is to iterate over an index to the list, then to > modify each data.table via list.of.DTs[[i]][ .. ] > > **Interestingly, this issue occurs with `mapply`, but not `lapply`.** > > EXAMPLE: > -------- > # Given a list of DT's and two lists of vectors, > # we want to add the corresponding vectors as columns to the DT. > > ## ---------------- ## > ## SAMPLE DATA: ## > ## ---------------- ## > # list of data.tables > list.DT <- list( > DT1=data.table(Col1=111:115, Col2=121:125), > DT2=data.table(Col1=211:215, Col2=221:225) > ) > > # lists of columns to add > list.Col3 <- list(131:135, 231:235) > list.Col4 <- list(141:145, 241:245) > > > ## ------------------------------------ ## > ## Iterating over the list elements ## > ## adding a new column ## > ## ------------------------------------ ## > ## Will issue warning and ## > ## will fail to modify in place ## > ## ------------------------------------ ## > mapply ( > function(DT, C3, C4) > DT[, c("Col3", "Col4") := list(C3, C4)], > list.DT, # iterating over the list > list.Col3, list.Col4, > SIMPLIFY=FALSE > ) > > ## Note the lack of change > list.DT > > > ## ------------------------------------ ## > ## Iterating over an index ## > ## ------------------------------------ ## > mapply ( > function(i, C3, C4) > list.DT[[i]] [, c("Col3", "Col4") := list(C3, C4)], > seq(list.DT), # iterating over an index to the list > list.Col3, list.Col4, > SIMPLIFY=FALSE > ) > > ## Note each DT _has_ been modified > list.DT > > ## ------------------------------------ ## > ## Iterating over the list elements ## > ## modifying existing column ## > ## ------------------------------------ ## > ## No warning issued, but ## > ## Will fail to modify in place ## > ## ------------------------------------ ## > mapply ( > function(DT, C3, C4) > DT[, c("Col3", "Col4") := list(Col3*1e3, Col4*1e4)], > > list.DT, # iterating over the list > list.Col3, list.Col4, > SIMPLIFY=FALSE > ) > > ## Note the lack of change (compare with output from `mapply`) > list.DT > > ## ------------------------------------ ## > ## ## > ## `lapply` works as expected. ## > ## ## > ## ------------------------------------ ## > ## NOW WITH lapply > lapply(list.DT, > function(DT) > DT[, newCol := LETTERS[1:5]] > ) > > ## Note the new column: > list.DT > > > > # ========================== # > > ## NON-WORKAROUNDS ## > ## > ## I also tried all of the following alternatives > ## in hopes of being able to iterate over the list > ## directly, using `mapply`. > ## None of these worked. > > # (1) Creating the DTs First, then creating the list from them > DT1 <- data.table(Col1=111:115, Col2=121:125) > DT2 <- data.table(Col1=211:215, Col2=221:225) > > list.DT <- list(DT1=DT1,DT2=DT2 ) > > > # (2) Same as 1, and using `copy()` in the call to `list()` > list.DT <- list(DT1=copy(DT1), > DT2=copy(DT2) ) > > # (3) lapply'ing `copy` and then iterating over that list > list.DT <- lapply(list.DT, copy) > > # (4) Not naming the list elements > list.DT <- list(DT1, DT2) > # and tried > list.DT <- list(copy(DT1), copy(DT2)) > > ## All of the above still failed to modify in place > ## (and also issued the same warning if trying to add a column) > ## when iterating using mapply > > mapply(function(DT, C3, C4) > DT[, c("Col3", "Col4") := list(C3, C4)], > list.DT, list.Col3, list.Col4, > SIMPLIFY=FALSE) > > > # ========================== # > > > Ricardo Saporta > Rutgers University, New Jersey > e: saporta at rutgers.edu > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From spbiggs at hotmail.com Fri Sep 20 19:16:50 2013 From: spbiggs at hotmail.com (Simon Biggs) Date: Fri, 20 Sep 2013 14:16:50 -0300 Subject: [datatable-help] fread (boolean?) problem in 1.8.11 rev 971 Message-ID: Hi I'm finding data.table excellent for processing of significant numbers of large files. Trying out the latest build I'm seeing some problems (perhaps around the new boolean support?) Loading my file with a basic call to fread() results in the error: Error in fread("file.txt") : Expected sep (' ') but 'T' ends field 25 on line 2 when detecting types: 8878 1 1 24 4 AFY057250G12 NA 2012-07-01 2013-06-30 1e+05 49999100.7232666 1.58e+08 0 33176.21 978 NA 1 EUR 0 EUR 0 EUR TERRINC HXG T Happy to provide the full file if you let me know how. First couple of rows reproduced below (file is tab delimited): Key Id AKey PKey Peril PNum LName EDt ExDt PAPt PL PPof PD Prem CurrencyKey RK IsV CurrencyCd minDedAmt minDedCur maxDedAmt maxDedCur userIdTxt1 userIdTxt2 userIdTxt3 userIdTxt4 PStat 8878 1 1 24 4 AFY05 NA 2012-07-01 2013-06-30 1e+05 49999100.7232666 1.58e+08 0 33176.21 978 NA 1 EUR 0 EUR 0 EUR TERRINC HXG T TO NA 8878 1 3 93 4 AFJ02 NA 2012-03-31 2014-03-30 150000 3.5e+07 1.4e+08 0 3688.44 840 NA 1 USD 0 USD 0 USD TERRINC JGP T TO NA 8878 1 6 95 4 AFY08 NA 2012-04-08 2013-04-07 1e+05 29999983.4435654 336907000 0 0 826 NA 1 GBP 0 GBP 0 GBP TERRINC SPT T TU NA 8878 1 7 17 4 AFR1 NA 2012-07-12 2013-06-30 7500000 5e+07 5e+08 0 10319.34 840 NA 1 USD 0 USD 0 USD TERREXC JGP T TO NA Many thanks for your efforts Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Fri Sep 20 20:01:16 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Fri, 20 Sep 2013 14:01:16 -0400 Subject: [datatable-help] mapply cannot modify in place when iterating over list of DTs In-Reply-To: <523C7C99.40308@mdowle.plus.com> References: <523C7C99.40308@mdowle.plus.com> Message-ID: One warning per DT in the list (I added the line breaks) -Rick ============================================= Warning messages: 1: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) : Invalid .internal.selfref detected and fixed by taking a copy of the whole table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R=v3.1.0 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed. 2: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) : Invalid .internal.selfref detected and fixed by taking a copy of the whole table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R=v3.1.0 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed. ============================================= On Fri, Sep 20, 2013 at 12:49 PM, Matthew Dowle wrote: > > Hi, > > What's the warning? > > Matthew > > > > On 20/09/13 14:48, Ricardo Saporta wrote: > > I've encountered the following issue iterating over a list of > data.tables. > The issue is only with mapply, not with lapply . > > > Given a list of data.table's, mapply'ing over the list directly > cannot modify in place. > > Also if attempting to add a new column, we get an "Invalid > .internal.selfref" warning. > Modifying an existing column does not issue a warning, but still fails to > modify-in-place > > WORKAROUND: > ---------- > The workaround is to iterate over an index to the list, then to > modify each data.table via list.of.DTs[[i]][ .. ] > > **Interestingly, this issue occurs with `mapply`, but not `lapply`.** > > > EXAMPLE: > -------- > # Given a list of DT's and two lists of vectors, > # we want to add the corresponding vectors as columns to the DT. > > ## ---------------- ## > ## SAMPLE DATA: ## > ## ---------------- ## > # list of data.tables > list.DT <- list( > DT1=data.table(Col1=111:115, Col2=121:125), > DT2=data.table(Col1=211:215, Col2=221:225) > ) > > # lists of columns to add > list.Col3 <- list(131:135, 231:235) > list.Col4 <- list(141:145, 241:245) > > > ## ------------------------------------ ## > ## Iterating over the list elements ## > ## adding a new column ## > ## ------------------------------------ ## > ## Will issue warning and ## > ## will fail to modify in place ## > ## ------------------------------------ ## > mapply ( > function(DT, C3, C4) > DT[, c("Col3", "Col4") := list(C3, C4)], > > list.DT, # iterating over the list > list.Col3, list.Col4, > SIMPLIFY=FALSE > ) > > ## Note the lack of change > list.DT > > > ## ------------------------------------ ## > ## Iterating over an index ## > ## ------------------------------------ ## > mapply ( > function(i, C3, C4) > list.DT[[i]] [, c("Col3", "Col4") := list(C3, C4)], > > seq(list.DT), # iterating over an index to the list > list.Col3, list.Col4, > SIMPLIFY=FALSE > ) > > ## Note each DT _has_ been modified > list.DT > > ## ------------------------------------ ## > ## Iterating over the list elements ## > ## modifying existing column ## > ## ------------------------------------ ## > ## No warning issued, but ## > ## Will fail to modify in place ## > ## ------------------------------------ ## > mapply ( > function(DT, C3, C4) > DT[, c("Col3", "Col4") := list(Col3*1e3, Col4*1e4)], > > list.DT, # iterating over the list > list.Col3, list.Col4, > SIMPLIFY=FALSE > ) > > ## Note the lack of change (compare with output from `mapply`) > list.DT > > ## ------------------------------------ ## > ## ## > ## `lapply` works as expected. ## > ## ## > ## ------------------------------------ ## > > ## NOW WITH lapply > lapply(list.DT, > function(DT) > DT[, newCol := LETTERS[1:5]] > ) > > ## Note the new column: > list.DT > > > > # ========================== # > > ## NON-WORKAROUNDS ## > ## > ## I also tried all of the following alternatives > ## in hopes of being able to iterate over the list > ## directly, using `mapply`. > ## None of these worked. > > # (1) Creating the DTs First, then creating the list from them > DT1 <- data.table(Col1=111:115, Col2=121:125) > DT2 <- data.table(Col1=211:215, Col2=221:225) > > list.DT <- list(DT1=DT1,DT2=DT2 ) > > > # (2) Same as 1, and using `copy()` in the call to `list()` > list.DT <- list(DT1=copy(DT1), > DT2=copy(DT2) ) > > # (3) lapply'ing `copy` and then iterating over that list > list.DT <- lapply(list.DT, copy) > > # (4) Not naming the list elements > list.DT <- list(DT1, DT2) > # and tried > list.DT <- list(copy(DT1), copy(DT2)) > > ## All of the above still failed to modify in place > ## (and also issued the same warning if trying to add a column) > ## when iterating using mapply > > mapply(function(DT, C3, C4) > DT[, c("Col3", "Col4") := list(C3, C4)], > list.DT, list.Col3, list.Col4, > SIMPLIFY=FALSE) > > > # ========================== # > > > Ricardo Saporta > Rutgers University, New Jersey > e: saporta at rutgers.edu > > > > _______________________________________________ > datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Sep 20 20:18:44 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 20 Sep 2013 19:18:44 +0100 Subject: [datatable-help] mapply cannot modify in place when iterating over list of DTs In-Reply-To: References: <523C7C99.40308@mdowle.plus.com> Message-ID: <523C9184.2010902@mdowle.plus.com> Does this sentence from the warning help? " Also, in R=v3.1.0 if that is biting. " Matthew On 20/09/13 19:01, Ricardo Saporta wrote: > One warning per DT in the list > (I added the line breaks) > -Rick > ============================================= > Warning messages: > > 1: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) : > > Invalid .internal.selfref detected and fixed by taking a copy of the > whole table so that := can add this new column by reference. At an > earlier point, this data.table has been copied by R (or been created > manually using structure() or similar). Avoid key<-, names<- and > attr<- which in R currently (and oddly) may copy the whole data.table. > Use set* syntax instead to avoid copying: ?set, ?setnames and > ?setattr. Also, in R DT2 (R's list() used to copy named objects); please upgrade to > R>=v3.1.0 if that is biting. If this message doesn't help, please > report to datatable-help so the root cause can be fixed. > > 2: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) : > > Invalid .internal.selfref detected and fixed by taking a copy of the > whole table so that := can add this new column by reference. At an > earlier point, this data.table has been copied by R (or been created > manually using structure() or similar). Avoid key<-, names<- and > attr<- which in R currently (and oddly) may copy the whole data.table. > Use set* syntax instead to avoid copying: ?set, ?setnames and > ?setattr. Also, in R DT2 (R's list() used to copy named objects); please upgrade to > R>=v3.1.0 if that is biting. If this message doesn't help, please > report to datatable-help so the root cause can be fixed. > ============================================= > > > > > On Fri, Sep 20, 2013 at 12:49 PM, Matthew Dowle > > wrote: > > > Hi, > > What's the warning? > > Matthew > > > > On 20/09/13 14:48, Ricardo Saporta wrote: >> I've encountered the following issue iterating over a list of >> data.tables. >> The issue is only with mapply, not with lapply . >> >> Given a list of data.table's, mapply'ing over the list directly >> cannot modify in place. >> >> Also if attempting to add a new column, we get an "Invalid >> .internal.selfref" warning. >> Modifying an existing column does not issue a warning, but still >> fails to modify-in-place >> >> WORKAROUND: >> ---------- >> The workaround is to iterate over an index to the list, then to >> modify each data.table via list.of.DTs[[i]][ .. ] >> >> **Interestingly, this issue occurs with `mapply`, but not `lapply`.** >> >> EXAMPLE: >> -------- >> # Given a list of DT's and two lists of vectors, >> # we want to add the corresponding vectors as columns to the DT. >> >> ## ---------------- ## >> ## SAMPLE DATA: ## >> ## ---------------- ## >> # list of data.tables >> list.DT <- list( >> DT1=data.table(Col1=111:115, Col2=121:125), >> DT2=data.table(Col1=211:215, Col2=221:225) >> ) >> >> # lists of columns to add >> list.Col3 <- list(131:135, 231:235) >> list.Col4 <- list(141:145, 241:245) >> >> >> ## ------------------------------------ ## >> ## Iterating over the list elements ## >> ## adding a new column ## >> ## ------------------------------------ ## >> ## Will issue warning and ## >> ## will fail to modify in place ## >> ## ------------------------------------ ## >> mapply ( >> function(DT, C3, C4) >> DT[, c("Col3", "Col4") := list(C3, C4)], >> list.DT, # iterating over the list >> list.Col3, list.Col4, >> SIMPLIFY=FALSE >> ) >> >> ## Note the lack of change >> list.DT >> >> >> ## ------------------------------------ ## >> ## Iterating over an index ## >> ## ------------------------------------ ## >> mapply ( >> function(i, C3, C4) >> list.DT[[i]] [, c("Col3", "Col4") := list(C3, C4)], >> seq(list.DT), # iterating over an index to the list >> list.Col3, list.Col4, >> SIMPLIFY=FALSE >> ) >> >> ## Note each DT _has_ been modified >> list.DT >> >> ## ------------------------------------ ## >> ## Iterating over the list elements ## >> ## modifying existing column ## >> ## ------------------------------------ ## >> ## No warning issued, but ## >> ## Will fail to modify in place ## >> ## ------------------------------------ ## >> mapply ( >> function(DT, C3, C4) >> DT[, c("Col3", "Col4") := list(Col3*1e3, Col4*1e4)], >> >> list.DT, # iterating over the list >> list.Col3, list.Col4, >> SIMPLIFY=FALSE >> ) >> >> ## Note the lack of change (compare with output from `mapply`) >> list.DT >> >> ## ------------------------------------ ## >> ## ## >> ## `lapply` works as expected. ## >> ## ## >> ## ------------------------------------ ## >> ## NOW WITH lapply >> lapply(list.DT, >> function(DT) >> DT[, newCol := LETTERS[1:5]] >> ) >> >> ## Note the new column: >> list.DT >> >> >> >> # ========================== # >> >> ## NON-WORKAROUNDS ## >> ## >> ## I also tried all of the following alternatives >> ## in hopes of being able to iterate over the list >> ## directly, using `mapply`. >> ## None of these worked. >> >> # (1) Creating the DTs First, then creating the list from them >> DT1 <- data.table(Col1=111:115, Col2=121:125) >> DT2 <- data.table(Col1=211:215, Col2=221:225) >> >> list.DT <- list(DT1=DT1,DT2=DT2 ) >> >> >> # (2) Same as 1, and using `copy()` in the call to `list()` >> list.DT <- list(DT1=copy(DT1), >> DT2=copy(DT2) ) >> >> # (3) lapply'ing `copy` and then iterating over that list >> list.DT <- lapply(list.DT, copy) >> >> # (4) Not naming the list elements >> list.DT <- list(DT1, DT2) >> # and tried >> list.DT <- list(copy(DT1), copy(DT2)) >> >> ## All of the above still failed to modify in place >> ## (and also issued the same warning if trying to add a column) >> ## when iterating using mapply >> >> mapply(function(DT, C3, C4) >> DT[, c("Col3", "Col4") := list(C3, C4)], >> list.DT, list.Col3, list.Col4, >> SIMPLIFY=FALSE) >> >> >> # ========================== # >> >> >> Ricardo Saporta >> Rutgers University, New Jersey >> e: saporta at rutgers.edu >> >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Sep 20 20:40:47 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 20 Sep 2013 19:40:47 +0100 Subject: [datatable-help] fread (boolean?) problem in 1.8.11 rev 971 In-Reply-To: References: Message-ID: <523C96AF.3080201@mdowle.plus.com> Hi, Many thanks. Now fixed - commit 973. Matthew On 20/09/13 18:16, Simon Biggs wrote: > Hi > > I'm finding data.table excellent for processing of significant numbers > of large files. Trying out the latest build I'm seeing some problems > (perhaps around the new boolean support?) > > Loading my file with a basic call to fread() results in the error: > > Error in fread("file.txt") : > Expected sep (' ') but 'T' ends field 25 on line 2 when detecting types: 8878 1 1 24 4 AFY057250G12 NA 2012-07-01 2013-06-30 1e+05 49999100.7232666 1.58e+08 0 33176.21 978 NA 1 EUR 0 EUR 0 EUR TERRINC HXG T > > Happy to provide the full file if you let me know how. First couple > of rows reproduced below (file is tab delimited): > > Key Id AKey PKey Peril PNum LName EDt ExDt PAPt PL PPof PD Prem CurrencyKey RK IsV CurrencyCd minDedAmt minDedCur maxDedAmt maxDedCur userIdTxt1 userIdTxt2 userIdTxt3 userIdTxt4 PStat > 8878 1 1 24 4 AFY05 NA 2012-07-01 2013-06-30 1e+05 49999100.7232666 1.58e+08 0 33176.21 978 NA 1 EUR 0 EUR 0 EUR TERRINC HXG T TO NA > 8878 1 3 93 4 AFJ02 NA 2012-03-31 2014-03-30 150000 3.5e+07 1.4e+08 0 3688.44 840 NA 1 USD 0 USD 0 USD TERRINC JGP T TO NA > 8878 1 6 95 4 AFY08 NA 2012-04-08 2013-04-07 1e+05 29999983.4435654 336907000 0 0 826 NA 1 GBP 0 GBP 0 GBP TERRINC SPT T TU NA > 8878 1 7 17 4 AFR1 NA 2012-07-12 2013-06-30 7500000 5e+07 5e+08 0 10319.34 840 NA 1 USD 0 USD 0 USD TERREXC JGP T TO NA > > Many thanks for your efforts > > Simon > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Sun Sep 22 03:44:29 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Sat, 21 Sep 2013 21:44:29 -0400 Subject: [datatable-help] by=".Col" produces NA column names Message-ID: I submitted the below as bug 4927 I believe the fix is a simple regex modification, but I dont want to mess with the regex too hastilly and possibly break something. Would someone care to double check this? --------------- Issue: ---- Given a data.table with a dot in the column name, using that column name as an argument to `by=` produces different results when the column name is quoted than when it is not. eg: DT .Col val 1: A 1 2: B 2 identical(DT[, sum(val), by=.Col], DT[, sum(val), by=".Col"] ) # [1] FALSE Specifically, if quotes are used NAs are produced in place of the column name. Examples follow at the bottom of this email. I believe the issue is in the regex pattern in a call to `grep` in "[.data.table" The line is copied and pasted here. (currently line 743 in "data.table.r", which is inside "if (any(bynames=="")){..}") ## ORIGINAL tt = grep("^eval|^[^[:alpha:] ]",all.vars(bysubl[[jj+1L]],functions=TRUE),invert=TRUE,value=TRUE)[1L] ## SHOULD (I believe) BE CHANGED TO tt = grep("^eval|^[^(\\.|[:alpha:]) ]",all.vars(bysubl[[jj+1L]],functions=TRUE),invert=TRUE,value=TRUE)[1L] ## ... to allow for the name to start with a period. ## CONTEXT: if (any(bynames=="")) { if (length(bysubl)<2) stop("When 'by' or 'keyby' is list() we expect something inside the brackets") for (jj in seq_along(bynames)) { if (bynames[jj]=="") { # Best guess. Use "month" in the case of by=month(date), use "a" in the case of by=a%%2 ~~~~ THIS LINE ~~~> tt = grep("^eval|^[^[:alpha:] ]",all.vars(bysubl[[jj+1L]],functions=TRUE),invert=TRUE,value=TRUE)[1L] if (!length(tt)) tt = all.vars(bysubl[[jj+1L]])[1L] bynames[jj] = tt # if user doesn't like this inferred name, user has to use by=list() to name the column } } } --------------------------------------------------- EXAMPLE: DT <- data.table(.Col = LETTERS[c(1:3, 1:3)], val=1:6) identical(DT[, sum(val), by=.Col], DT[, sum(val), by=".Col"] ) # [1] FALSE ## This works as expected DT[, sum(val), by=.Col] .Col V1 1: A 5 2: B 7 3: C 9 ## Putting the column name within quotes ## produces NA in the column names DT[, sum(val), by=c(".Col")] DT[, sum(val), by=".Col"] # both lines, same output NA V1 <~~~ NOTICE 1: A 5 2: B 7 3: C 9 # notice if we try to use `keyby` we get the following error DT[, sum(val), keyby=".Col"] # Error in setkeyv(ans, names(ans)[seq_along(byval)]) : # Column 'NA' is type 'NULL' which is not (currently) allowed as a key column type. ## and this works correctly too DT[, sum(val), by=list(.Col=.Col)] .Col V1 1: A 5 2: B 7 3: C 9 --------------------------------------------------- Only happen with a dot at the start of the name ## Appears to be only an issue when there is a DT2 <- data.table(Col. = LETTERS[c(1:3, 1:3)], val=1:6) DT2[, sum(val), by=Col.] DT2[, sum(val), by=c("Col.")] Col. V1 <~~~ As expected 1: A 5 2: B 7 3: C 9 -- Ricardo Saporta Rutgers University, New Jersey e: saporta at rutgers.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Sun Sep 22 04:02:40 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Sat, 21 Sep 2013 22:02:40 -0400 Subject: [datatable-help] mapply cannot modify in place when iterating over list of DTs In-Reply-To: <523C9184.2010902@mdowle.plus.com> References: <523C7C99.40308@mdowle.plus.com> <523C9184.2010902@mdowle.plus.com> Message-ID: Matthew, I did notice the warning, but something doesnt add up: If the issue is simply that it is being copied when created, then wouldnt we expect the same warning to arise when we try to modify the table in using `mapply` or `lapply`? (the latter does not produce a warning. If on the otherhand, the issue pertains specifically to mapply (which I assume it does), then why is it only a problem when we iterate over the list directly, whereas iterating indirectly by using an index does not produce any warnings. While overall, this is minor if one is aware of the issue, I think it might allow for unnoticed bugs to creep into someones code. Specifically if using mapply to modify a list of DTs and the user not realizing that the modifications are not being held. That being said, I'm not sure how this could even be addressed if the root is in mapply, but is it worth trying to address? Rick On Fri, Sep 20, 2013 at 2:18 PM, Matthew Dowle wrote: > Does this sentence from the warning help? > > > " Also, in R list() used to copy named objects); please upgrade to R>=v3.1.0 if that is > biting. " > > Matthew > > > On 20/09/13 19:01, Ricardo Saporta wrote: > > One warning per DT in the list > (I added the line breaks) > -Rick > ============================================= > Warning messages: > > 1: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) : > > Invalid .internal.selfref detected and fixed by taking a copy of the > whole table so that := can add this new column by reference. At an earlier > point, this data.table has been copied by R (or been created manually using > structure() or similar). Avoid key<-, names<- and attr<- which in R > currently (and oddly) may copy the whole data.table. Use set* syntax > instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named > objects); please upgrade to R>=v3.1.0 if that is biting. If this message > doesn't help, please report to datatable-help so the root cause can be > fixed. > > 2: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) : > > Invalid .internal.selfref detected and fixed by taking a copy of the > whole table so that := can add this new column by reference. At an earlier > point, this data.table has been copied by R (or been created manually using > structure() or similar). Avoid key<-, names<- and attr<- which in R > currently (and oddly) may copy the whole data.table. Use set* syntax > instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named > objects); please upgrade to R>=v3.1.0 if that is biting. If this message > doesn't help, please report to datatable-help so the root cause can be > fixed. > ============================================= > > > > > On Fri, Sep 20, 2013 at 12:49 PM, Matthew Dowle wrote: > >> >> Hi, >> >> What's the warning? >> >> Matthew >> >> >> >> On 20/09/13 14:48, Ricardo Saporta wrote: >> >> I've encountered the following issue iterating over a list of >> data.tables. >> The issue is only with mapply, not with lapply . >> >> >> Given a list of data.table's, mapply'ing over the list directly >> cannot modify in place. >> >> Also if attempting to add a new column, we get an "Invalid >> .internal.selfref" warning. >> Modifying an existing column does not issue a warning, but still fails to >> modify-in-place >> >> WORKAROUND: >> ---------- >> The workaround is to iterate over an index to the list, then to >> modify each data.table via list.of.DTs[[i]][ .. ] >> >> **Interestingly, this issue occurs with `mapply`, but not `lapply`.** >> >> >> EXAMPLE: >> -------- >> # Given a list of DT's and two lists of vectors, >> # we want to add the corresponding vectors as columns to the DT. >> >> ## ---------------- ## >> ## SAMPLE DATA: ## >> ## ---------------- ## >> # list of data.tables >> list.DT <- list( >> DT1=data.table(Col1=111:115, Col2=121:125), >> DT2=data.table(Col1=211:215, Col2=221:225) >> ) >> >> # lists of columns to add >> list.Col3 <- list(131:135, 231:235) >> list.Col4 <- list(141:145, 241:245) >> >> >> ## ------------------------------------ ## >> ## Iterating over the list elements ## >> ## adding a new column ## >> ## ------------------------------------ ## >> ## Will issue warning and ## >> ## will fail to modify in place ## >> ## ------------------------------------ ## >> mapply ( >> function(DT, C3, C4) >> DT[, c("Col3", "Col4") := list(C3, C4)], >> >> list.DT, # iterating over the list >> list.Col3, list.Col4, >> SIMPLIFY=FALSE >> ) >> >> ## Note the lack of change >> list.DT >> >> >> ## ------------------------------------ ## >> ## Iterating over an index ## >> ## ------------------------------------ ## >> mapply ( >> function(i, C3, C4) >> list.DT[[i]] [, c("Col3", "Col4") := list(C3, C4)], >> >> seq(list.DT), # iterating over an index to the list >> list.Col3, list.Col4, >> SIMPLIFY=FALSE >> ) >> >> ## Note each DT _has_ been modified >> list.DT >> >> ## ------------------------------------ ## >> ## Iterating over the list elements ## >> ## modifying existing column ## >> ## ------------------------------------ ## >> ## No warning issued, but ## >> ## Will fail to modify in place ## >> ## ------------------------------------ ## >> mapply ( >> function(DT, C3, C4) >> DT[, c("Col3", "Col4") := list(Col3*1e3, Col4*1e4)], >> >> list.DT, # iterating over the list >> list.Col3, list.Col4, >> SIMPLIFY=FALSE >> ) >> >> ## Note the lack of change (compare with output from `mapply`) >> list.DT >> >> ## ------------------------------------ ## >> ## ## >> ## `lapply` works as expected. ## >> ## ## >> ## ------------------------------------ ## >> >> ## NOW WITH lapply >> lapply(list.DT, >> function(DT) >> DT[, newCol := LETTERS[1:5]] >> ) >> >> ## Note the new column: >> list.DT >> >> >> >> # ========================== # >> >> ## NON-WORKAROUNDS ## >> ## >> ## I also tried all of the following alternatives >> ## in hopes of being able to iterate over the list >> ## directly, using `mapply`. >> ## None of these worked. >> >> # (1) Creating the DTs First, then creating the list from them >> DT1 <- data.table(Col1=111:115, Col2=121:125) >> DT2 <- data.table(Col1=211:215, Col2=221:225) >> >> list.DT <- list(DT1=DT1,DT2=DT2 ) >> >> >> # (2) Same as 1, and using `copy()` in the call to `list()` >> list.DT <- list(DT1=copy(DT1), >> DT2=copy(DT2) ) >> >> # (3) lapply'ing `copy` and then iterating over that list >> list.DT <- lapply(list.DT, copy) >> >> # (4) Not naming the list elements >> list.DT <- list(DT1, DT2) >> # and tried >> list.DT <- list(copy(DT1), copy(DT2)) >> >> ## All of the above still failed to modify in place >> ## (and also issued the same warning if trying to add a column) >> ## when iterating using mapply >> >> mapply(function(DT, C3, C4) >> DT[, c("Col3", "Col4") := list(C3, C4)], >> list.DT, list.Col3, list.Col4, >> SIMPLIFY=FALSE) >> >> >> # ========================== # >> >> >> Ricardo Saporta >> Rutgers University, New Jersey >> e: saporta at rutgers.edu >> >> >> >> _______________________________________________ >> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From karl at huftis.org Sun Sep 22 11:38:33 2013 From: karl at huftis.org (Karl Ove Hufthammer) Date: Sun, 22 Sep 2013 11:38:33 +0200 Subject: [datatable-help] (no subject) Message-ID: <1379842713.3310.1.camel@adrian.site> From szehnder at uni-bonn.de Mon Sep 23 17:08:59 2013 From: szehnder at uni-bonn.de (Simon Zehnder) Date: Mon, 23 Sep 2013 17:08:59 +0200 Subject: [datatable-help] What the status on fast time and data.table? Message-ID: <04189485-CEBA-4D04-8EF0-3BD49D0E0E00@uni-bonn.de> Dear Users and Devels, I read this thread http://r.789695.n4.nabble.com/About-adding-fastmatch-and-fasttime-to-data-table-td4659622.html and I would like to ask, if there have been any proceedings? What is the status of fast time in data.table? Best Simon From mdowle at mdowle.plus.com Tue Sep 24 03:25:41 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 24 Sep 2013 02:25:41 +0100 Subject: [datatable-help] What the status on fast time and data.table? In-Reply-To: <04189485-CEBA-4D04-8EF0-3BD49D0E0E00@uni-bonn.de> References: <04189485-CEBA-4D04-8EF0-3BD49D0E0E00@uni-bonn.de> Message-ID: <5240EA15.4040709@mdowle.plus.com> Hi, Sorry no progress yet. But it's on the list. You currently have to read as character and then use Simon's package. It isn't yet built in. Matthew On 23/09/13 16:08, Simon Zehnder wrote: > Dear Users and Devels, > > I read this thread http://r.789695.n4.nabble.com/About-adding-fastmatch-and-fasttime-to-data-table-td4659622.html and I would like to ask, if there have been any proceedings? What is the status of fast time in data.table? > > > Best > > Simon > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From mdowle at mdowle.plus.com Tue Sep 24 03:42:38 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Tue, 24 Sep 2013 02:42:38 +0100 Subject: [datatable-help] mapply cannot modify in place when iterating over list of DTs In-Reply-To: References: <523C7C99.40308@mdowle.plus.com> <523C9184.2010902@mdowle.plus.com> Message-ID: <5240EE0E.8090001@mdowle.plus.com> Hi, Basically adding columns by reference to a data.table when it's a member of a list of data.table, is really difficult to handle internally. I had to special case internally to get around list() copying, so that the binding can change inside the list on the shallow copy when [[ is used. A for loop is the way to add columns by reference inside a list of data.table, and that should work ok using [[. But doing that via lapply and mapply is really stretching it. Even catching user expectations in this area is difficult. Ideally we'd catch mapply, yes, but really data.table likes to be rbindlist()-ed and then ops to work on a single large data.table. We can advice to the warning message not to use mapply or lapply to add columns by reference to a list of data.table (use a for loop instead) ? Matthew On 22/09/13 03:02, Ricardo Saporta wrote: > Matthew, > > I did notice the warning, but something doesnt add up: > > If the issue is simply that it is being copied when created, then > wouldnt we expect the same warning to arise when we try to modify the > table in using `mapply` or `lapply`? (the latter does not produce a > warning. > > If on the otherhand, the issue pertains specifically to mapply (which > I assume it does), then why is it only a problem when we iterate over > the list directly, whereas iterating indirectly by using an index does > not produce any warnings. > While overall, this is minor if one is aware of the issue, I think it > might allow for unnoticed bugs to creep into someones code. > Specifically if using mapply to modify a list of DTs and the user not > realizing that the modifications are not being held. > > That being said, I'm not sure how this could even be addressed if the > root is in mapply, but is it worth trying to address? > > Rick > > > On Fri, Sep 20, 2013 at 2:18 PM, Matthew Dowle > wrote: > > Does this sentence from the warning help? > > > " Also, in R (R's list() used to copy named objects); please upgrade to > R>=v3.1.0 if that is biting. " > > Matthew > > > On 20/09/13 19:01, Ricardo Saporta wrote: >> One warning per DT in the list >> (I added the line breaks) >> -Rick >> ============================================= >> Warning messages: >> >> 1: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) : >> >> Invalid .internal.selfref detected and fixed by taking a copy >> of the whole table so that := can add this new column by >> reference. At an earlier point, this data.table has been copied >> by R (or been created manually using structure() or similar). >> Avoid key<-, names<- and attr<- which in R currently (and oddly) >> may copy the whole data.table. Use set* syntax instead to avoid >> copying: ?set, ?setnames and ?setattr. Also, in R> list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to >> copy named objects); please upgrade to R>=v3.1.0 if that is >> biting. If this message doesn't help, please report to >> datatable-help so the root cause can be fixed. >> >> 2: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) : >> >> Invalid .internal.selfref detected and fixed by taking a copy >> of the whole table so that := can add this new column by >> reference. At an earlier point, this data.table has been copied >> by R (or been created manually using structure() or similar). >> Avoid key<-, names<- and attr<- which in R currently (and oddly) >> may copy the whole data.table. Use set* syntax instead to avoid >> copying: ?set, ?setnames and ?setattr. Also, in R> list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to >> copy named objects); please upgrade to R>=v3.1.0 if that is >> biting. If this message doesn't help, please report to >> datatable-help so the root cause can be fixed. >> ============================================= >> >> >> >> >> On Fri, Sep 20, 2013 at 12:49 PM, Matthew Dowle >> > wrote: >> >> >> Hi, >> >> What's the warning? >> >> Matthew >> >> >> >> On 20/09/13 14:48, Ricardo Saporta wrote: >>> I've encountered the following issue iterating over a list >>> of data.tables. >>> The issue is only with mapply, not with lapply . >>> >>> Given a list of data.table's, mapply'ing over the list directly >>> cannot modify in place. >>> >>> Also if attempting to add a new column, we get an "Invalid >>> .internal.selfref" warning. >>> Modifying an existing column does not issue a warning, but >>> still fails to modify-in-place >>> >>> WORKAROUND: >>> ---------- >>> The workaround is to iterate over an index to the list, then to >>> modify each data.table via list.of.DTs[[i]][ .. ] >>> >>> **Interestingly, this issue occurs with `mapply`, but not >>> `lapply`.** >>> >>> EXAMPLE: >>> -------- >>> # Given a list of DT's and two lists of vectors, >>> # we want to add the corresponding vectors as columns to >>> the DT. >>> >>> ## ---------------- ## >>> ## SAMPLE DATA: ## >>> ## ---------------- ## >>> # list of data.tables >>> list.DT <- list( >>> DT1=data.table(Col1=111:115, Col2=121:125), >>> DT2=data.table(Col1=211:215, Col2=221:225) >>> ) >>> >>> # lists of columns to add >>> list.Col3 <- list(131:135, 231:235) >>> list.Col4 <- list(141:145, 241:245) >>> >>> >>> ## ------------------------------------ ## >>> ## Iterating over the list elements ## >>> ## adding a new column ## >>> ## ------------------------------------ ## >>> ## Will issue warning and ## >>> ## will fail to modify in place ## >>> ## ------------------------------------ ## >>> mapply ( >>> function(DT, C3, C4) >>> DT[, c("Col3", "Col4") := list(C3, C4)], >>> list.DT, # iterating over the list >>> list.Col3, list.Col4, >>> SIMPLIFY=FALSE >>> ) >>> >>> ## Note the lack of change >>> list.DT >>> >>> >>> ## ------------------------------------ ## >>> ## Iterating over an index ## >>> ## ------------------------------------ ## >>> mapply ( >>> function(i, C3, C4) >>> list.DT[[i]] [, c("Col3", "Col4") := list(C3, C4)], >>> seq(list.DT), # iterating over an index to the list >>> list.Col3, list.Col4, >>> SIMPLIFY=FALSE >>> ) >>> >>> ## Note each DT _has_ been modified >>> list.DT >>> >>> ## ------------------------------------ ## >>> ## Iterating over the list elements ## >>> ## modifying existing column ## >>> ## ------------------------------------ ## >>> ## No warning issued, but ## >>> ## Will fail to modify in place ## >>> ## ------------------------------------ ## >>> mapply ( >>> function(DT, C3, C4) >>> DT[, c("Col3", "Col4") := list(Col3*1e3, Col4*1e4)], >>> >>> list.DT, # iterating over the list >>> list.Col3, list.Col4, >>> SIMPLIFY=FALSE >>> ) >>> >>> ## Note the lack of change (compare with output from `mapply`) >>> list.DT >>> >>> ## ------------------------------------ ## >>> ## ## >>> ## `lapply` works as expected. ## >>> ## ## >>> ## ------------------------------------ ## >>> ## NOW WITH lapply >>> lapply(list.DT, >>> function(DT) >>> DT[, newCol := LETTERS[1:5]] >>> ) >>> >>> ## Note the new column: >>> list.DT >>> >>> >>> >>> # ========================== # >>> >>> ## NON-WORKAROUNDS ## >>> ## >>> ## I also tried all of the following alternatives >>> ## in hopes of being able to iterate over the list >>> ## directly, using `mapply`. >>> ## None of these worked. >>> >>> # (1) Creating the DTs First, then creating the list from them >>> DT1 <- data.table(Col1=111:115, Col2=121:125) >>> DT2 <- data.table(Col1=211:215, Col2=221:225) >>> >>> list.DT <- list(DT1=DT1,DT2=DT2 ) >>> >>> >>> # (2) Same as 1, and using `copy()` in the call to `list()` >>> list.DT <- list(DT1=copy(DT1), >>> DT2=copy(DT2) ) >>> >>> # (3) lapply'ing `copy` and then iterating over that list >>> list.DT <- lapply(list.DT, copy) >>> >>> # (4) Not naming the list elements >>> list.DT <- list(DT1, DT2) >>> # and tried >>> list.DT <- list(copy(DT1), copy(DT2)) >>> >>> ## All of the above still failed to modify in place >>> ## (and also issued the same warning if trying to add a >>> column) >>> ## when iterating using mapply >>> >>> mapply(function(DT, C3, C4) >>> DT[, c("Col3", "Col4") := list(C3, C4)], >>> list.DT, list.Col3, list.Col4, >>> SIMPLIFY=FALSE) >>> >>> >>> # ========================== # >>> >>> >>> Ricardo Saporta >>> Rutgers University, New Jersey >>> e: saporta at rutgers.edu >>> >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Tue Sep 24 06:15:18 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Tue, 24 Sep 2013 00:15:18 -0400 Subject: [datatable-help] mapply cannot modify in place when iterating over list of DTs In-Reply-To: <5240EE0E.8090001@mdowle.plus.com> References: <523C7C99.40308@mdowle.plus.com> <523C9184.2010902@mdowle.plus.com> <5240EE0E.8090001@mdowle.plus.com> Message-ID: On Mon, Sep 23, 2013 at 9:42 PM, Matthew Dowle wrote: > > Hi, > Basically adding columns by reference to a data.table when it's a member > of a list of data.table, is really difficult to handle internally. I had > to special case internally to get around list() copying, so that the > binding can change inside the list on the shallow copy when [[ is used. A > for loop is the way to add columns by reference inside a list of > data.table, and that should work ok using [[. But doing that via lapply > and mapply is really stretching it. > That makes sense. I took a whack at it, but couldn't even come close. > Even catching user expectations in this area is difficult. Ideally we'd > catch mapply, yes, but really data.table likes to be rbindlist()-ed and > then ops to work on a single large data.table. > Agreed. In the application where this came up, I am dealing with a list of tables with different dims (hence not rbinding) > We can advice to the warning message not to use mapply or lapply to add > columns by reference to a list of data.table (use a for loop instead) ? > Perhaps a warning that modifications to the DT's in the list are likely to not have stuck and to use rbindlist when possible? > > Matthew > > > > On 22/09/13 03:02, Ricardo Saporta wrote: > > Matthew, > > I did notice the warning, but something doesnt add up: > > If the issue is simply that it is being copied when created, then > wouldnt we expect the same warning to arise when we try to modify the table > in using `mapply` or `lapply`? (the latter does not produce a warning. > > If on the otherhand, the issue pertains specifically to mapply (which I > assume it does), then why is it only a problem when we iterate over the > list directly, whereas iterating indirectly by using an index does not > produce any warnings. > > While overall, this is minor if one is aware of the issue, I think it > might allow for unnoticed bugs to creep into someones code. Specifically > if using mapply to modify a list of DTs and the user not realizing that the > modifications are not being held. > > That being said, I'm not sure how this could even be addressed if the > root is in mapply, but is it worth trying to address? > > Rick > > > On Fri, Sep 20, 2013 at 2:18 PM, Matthew Dowle wrote: > >> Does this sentence from the warning help? >> >> >> " Also, in R> list() used to copy named objects); please upgrade to R>=v3.1.0 if that is >> biting. " >> >> Matthew >> >> >> On 20/09/13 19:01, Ricardo Saporta wrote: >> >> One warning per DT in the list >> (I added the line breaks) >> -Rick >> ============================================= >> Warning messages: >> >> 1: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) : >> >> Invalid .internal.selfref detected and fixed by taking a copy of the >> whole table so that := can add this new column by reference. At an earlier >> point, this data.table has been copied by R (or been created manually using >> structure() or similar). Avoid key<-, names<- and attr<- which in R >> currently (and oddly) may copy the whole data.table. Use set* syntax >> instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R> list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named >> objects); please upgrade to R>=v3.1.0 if that is biting. If this message >> doesn't help, please report to datatable-help so the root cause can be >> fixed. >> >> 2: In `[.data.table`(DT, , `:=`(c("Col3", "Col4"), list(C3, C4))) : >> >> Invalid .internal.selfref detected and fixed by taking a copy of the >> whole table so that := can add this new column by reference. At an earlier >> point, this data.table has been copied by R (or been created manually using >> structure() or similar). Avoid key<-, names<- and attr<- which in R >> currently (and oddly) may copy the whole data.table. Use set* syntax >> instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R> list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named >> objects); please upgrade to R>=v3.1.0 if that is biting. If this message >> doesn't help, please report to datatable-help so the root cause can be >> fixed. >> ============================================= >> >> >> >> >> On Fri, Sep 20, 2013 at 12:49 PM, Matthew Dowle wrote: >> >>> >>> Hi, >>> >>> What's the warning? >>> >>> Matthew >>> >>> >>> >>> On 20/09/13 14:48, Ricardo Saporta wrote: >>> >>> I've encountered the following issue iterating over a list of >>> data.tables. >>> The issue is only with mapply, not with lapply . >>> >>> >>> Given a list of data.table's, mapply'ing over the list directly >>> cannot modify in place. >>> >>> Also if attempting to add a new column, we get an "Invalid >>> .internal.selfref" warning. >>> Modifying an existing column does not issue a warning, but still fails >>> to modify-in-place >>> >>> WORKAROUND: >>> ---------- >>> The workaround is to iterate over an index to the list, then to >>> modify each data.table via list.of.DTs[[i]][ .. ] >>> >>> **Interestingly, this issue occurs with `mapply`, but not `lapply`.** >>> >>> >>> EXAMPLE: >>> -------- >>> # Given a list of DT's and two lists of vectors, >>> # we want to add the corresponding vectors as columns to the DT. >>> >>> ## ---------------- ## >>> ## SAMPLE DATA: ## >>> ## ---------------- ## >>> # list of data.tables >>> list.DT <- list( >>> DT1=data.table(Col1=111:115, Col2=121:125), >>> DT2=data.table(Col1=211:215, Col2=221:225) >>> ) >>> >>> # lists of columns to add >>> list.Col3 <- list(131:135, 231:235) >>> list.Col4 <- list(141:145, 241:245) >>> >>> >>> ## ------------------------------------ ## >>> ## Iterating over the list elements ## >>> ## adding a new column ## >>> ## ------------------------------------ ## >>> ## Will issue warning and ## >>> ## will fail to modify in place ## >>> ## ------------------------------------ ## >>> mapply ( >>> function(DT, C3, C4) >>> DT[, c("Col3", "Col4") := list(C3, C4)], >>> >>> list.DT, # iterating over the list >>> list.Col3, list.Col4, >>> SIMPLIFY=FALSE >>> ) >>> >>> ## Note the lack of change >>> list.DT >>> >>> >>> ## ------------------------------------ ## >>> ## Iterating over an index ## >>> ## ------------------------------------ ## >>> mapply ( >>> function(i, C3, C4) >>> list.DT[[i]] [, c("Col3", "Col4") := list(C3, C4)], >>> >>> seq(list.DT), # iterating over an index to the list >>> list.Col3, list.Col4, >>> SIMPLIFY=FALSE >>> ) >>> >>> ## Note each DT _has_ been modified >>> list.DT >>> >>> ## ------------------------------------ ## >>> ## Iterating over the list elements ## >>> ## modifying existing column ## >>> ## ------------------------------------ ## >>> ## No warning issued, but ## >>> ## Will fail to modify in place ## >>> ## ------------------------------------ ## >>> mapply ( >>> function(DT, C3, C4) >>> DT[, c("Col3", "Col4") := list(Col3*1e3, Col4*1e4)], >>> >>> list.DT, # iterating over the list >>> list.Col3, list.Col4, >>> SIMPLIFY=FALSE >>> ) >>> >>> ## Note the lack of change (compare with output from `mapply`) >>> list.DT >>> >>> ## ------------------------------------ ## >>> ## ## >>> ## `lapply` works as expected. ## >>> ## ## >>> ## ------------------------------------ ## >>> >>> ## NOW WITH lapply >>> lapply(list.DT, >>> function(DT) >>> DT[, newCol := LETTERS[1:5]] >>> ) >>> >>> ## Note the new column: >>> list.DT >>> >>> >>> >>> # ========================== # >>> >>> ## NON-WORKAROUNDS ## >>> ## >>> ## I also tried all of the following alternatives >>> ## in hopes of being able to iterate over the list >>> ## directly, using `mapply`. >>> ## None of these worked. >>> >>> # (1) Creating the DTs First, then creating the list from them >>> DT1 <- data.table(Col1=111:115, Col2=121:125) >>> DT2 <- data.table(Col1=211:215, Col2=221:225) >>> >>> list.DT <- list(DT1=DT1,DT2=DT2 ) >>> >>> >>> # (2) Same as 1, and using `copy()` in the call to `list()` >>> list.DT <- list(DT1=copy(DT1), >>> DT2=copy(DT2) ) >>> >>> # (3) lapply'ing `copy` and then iterating over that list >>> list.DT <- lapply(list.DT, copy) >>> >>> # (4) Not naming the list elements >>> list.DT <- list(DT1, DT2) >>> # and tried >>> list.DT <- list(copy(DT1), copy(DT2)) >>> >>> ## All of the above still failed to modify in place >>> ## (and also issued the same warning if trying to add a column) >>> ## when iterating using mapply >>> >>> mapply(function(DT, C3, C4) >>> DT[, c("Col3", "Col4") := list(C3, C4)], >>> list.DT, list.Col3, list.Col4, >>> SIMPLIFY=FALSE) >>> >>> >>> # ========================== # >>> >>> >>> Ricardo Saporta >>> Rutgers University, New Jersey >>> e: saporta at rutgers.edu >>> >>> >>> >>> _______________________________________________ >>> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >>> >> >> > > > _______________________________________________ > datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaklev at gmail.com Fri Sep 27 05:16:11 2013 From: shaklev at gmail.com (=?UTF-8?Q?Stian_H=C3=A5klev?=) Date: Thu, 26 Sep 2013 23:16:11 -0400 Subject: [datatable-help] Using data.table to run a function on every row Message-ID: I'm trying to run a function on every row fulfilling a certain criterium, which returns a data frame - the idea is then to take the list of data frames and rbindlist them together for a totally separate data.table. (I'm extracting several URL links from each forum post, and tagging them with the forum post they came from). I tried doing this with a data.table a <- db[has_url == T, getUrls(text, id)] and get the message Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, 4L, : replacement has 11007 rows, data has 29787 Because some rows have several URLs... However, I don't care that these rowlengths don't match, I still want these rows :) I thought J would just let me execute arbitrary R code in the context of the rows as variable names, etc. Here's the function it's running, but that shouldn't be relevant getUrls <- function(text, id) { matches <- str_match_all(text, url_pattern) a <- data.frame(urls=unlist(matches)) a$id <- id a } Thanks, and thanks for an amazing package - data.table has made my life so much easier. It should be part of base, I think. Stian Haklev, University of Toronto -- http://reganmian.net/blog -- Random Stuff that Matters -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Fri Sep 27 08:37:28 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Fri, 27 Sep 2013 02:37:28 -0400 Subject: [datatable-help] Using data.table to run a function on every row In-Reply-To: References: Message-ID: Hi there, Try inserting a `by=id` in a <- db[(has_url), getUrls(text, id), by=id] Also, no need for "has_url == T" instead, use (has_url) If the variable is alread logical. (Otherwise, you are just slowing things down ;) Ricardo Saporta Graduate Student, Data Analytics Rutgers University, New Jersey e: saporta at rutgers.edu On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev wrote: > I'm trying to run a function on every row fulfilling a certain criterium, > which returns a data frame - the idea is then to take the list of data > frames and rbindlist them together for a totally separate data.table. (I'm > extracting several URL links from each forum post, and tagging them with > the forum post they came from). > > I tried doing this with a data.table > > a <- db[has_url == T, getUrls(text, id)] > > and get the message > > Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, 4L, : > replacement has 11007 rows, data has 29787 > > Because some rows have several URLs... However, I don't care that these > rowlengths don't match, I still want these rows :) I thought J would just > let me execute arbitrary R code in the context of the rows as variable > names, etc. > > Here's the function it's running, but that shouldn't be relevant > > getUrls <- function(text, id) { > matches <- str_match_all(text, url_pattern) > a <- data.frame(urls=unlist(matches)) > a$id <- id > a > } > > > Thanks, and thanks for an amazing package - data.table has made my life so > much easier. It should be part of base, I think. > Stian Haklev, University of Toronto > > -- > http://reganmian.net/blog -- Random Stuff that Matters > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Fri Sep 27 08:41:19 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Fri, 27 Sep 2013 02:41:19 -0400 Subject: [datatable-help] Using data.table to run a function on every row In-Reply-To: References: Message-ID: sorry, I probably should have elaborated (it's late here, in NJ) The error you are seeing is most likely coming from your getURL function in that you are adding several ids to a data.frame of varying rows, and `R` cannot recycle it correctly. If you instead breakdown by id, then each time you are only assigning one id and R will be able to recycle appropriately, without issue. good luck! Rick Ricardo Saporta Graduate Student, Data Analytics Rutgers University, New Jersey e: saporta at rutgers.edu On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta < saporta at scarletmail.rutgers.edu> wrote: > Hi there, > > Try inserting a `by=id` in > > a <- db[(has_url), getUrls(text, id), by=id] > > Also, no need for "has_url == T" > instead, use > (has_url) > If the variable is alread logical. (Otherwise, you are just slowing > things down ;) > > > > Ricardo Saporta > Graduate Student, Data Analytics > Rutgers University, New Jersey > e: saporta at rutgers.edu > > > > On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev wrote: > >> I'm trying to run a function on every row fulfilling a certain criterium, >> which returns a data frame - the idea is then to take the list of data >> frames and rbindlist them together for a totally separate data.table. (I'm >> extracting several URL links from each forum post, and tagging them with >> the forum post they came from). >> >> I tried doing this with a data.table >> >> a <- db[has_url == T, getUrls(text, id)] >> >> and get the message >> >> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, 4L, : >> replacement has 11007 rows, data has 29787 >> >> Because some rows have several URLs... However, I don't care that these >> rowlengths don't match, I still want these rows :) I thought J would just >> let me execute arbitrary R code in the context of the rows as variable >> names, etc. >> >> Here's the function it's running, but that shouldn't be relevant >> >> getUrls <- function(text, id) { >> matches <- str_match_all(text, url_pattern) >> a <- data.frame(urls=unlist(matches)) >> a$id <- id >> a >> } >> >> >> Thanks, and thanks for an amazing package - data.table has made my life >> so much easier. It should be part of base, I think. >> Stian Haklev, University of Toronto >> >> -- >> http://reganmian.net/blog -- Random Stuff that Matters >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Fri Sep 27 08:44:37 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Fri, 27 Sep 2013 02:44:37 -0400 Subject: [datatable-help] Using data.table to run a function on every row In-Reply-To: References: Message-ID: In fact, you should be able to skip the function altogether and just use: db[ (has_url), str_match_all(text, url_pattern), by=id] (and now, my apologies to all for the email clutter) good night On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta < saporta at scarletmail.rutgers.edu> wrote: > sorry, I probably should have elaborated (it's late here, in NJ) > > The error you are seeing is most likely coming from your getURL function > in that you are adding several ids to a data.frame of varying rows, and `R` > cannot recycle it correctly. > > If you instead breakdown by id, then each time you are only assigning one > id and R will be able to recycle appropriately, without issue. > > good luck! > Rick > > > Ricardo Saporta > Graduate Student, Data Analytics > Rutgers University, New Jersey > e: saporta at rutgers.edu > > > > On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta < > saporta at scarletmail.rutgers.edu> wrote: > >> Hi there, >> >> Try inserting a `by=id` in >> >> a <- db[(has_url), getUrls(text, id), by=id] >> >> Also, no need for "has_url == T" >> instead, use >> (has_url) >> If the variable is alread logical. (Otherwise, you are just slowing >> things down ;) >> >> >> >> Ricardo Saporta >> Graduate Student, Data Analytics >> Rutgers University, New Jersey >> e: saporta at rutgers.edu >> >> >> >> On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev wrote: >> >>> I'm trying to run a function on every row fulfilling a certain >>> criterium, which returns a data frame - the idea is then to take the list >>> of data frames and rbindlist them together for a totally separate >>> data.table. (I'm extracting several URL links from each forum post, and >>> tagging them with the forum post they came from). >>> >>> I tried doing this with a data.table >>> >>> a <- db[has_url == T, getUrls(text, id)] >>> >>> and get the message >>> >>> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, 4L, >>> : >>> replacement has 11007 rows, data has 29787 >>> >>> Because some rows have several URLs... However, I don't care that these >>> rowlengths don't match, I still want these rows :) I thought J would just >>> let me execute arbitrary R code in the context of the rows as variable >>> names, etc. >>> >>> Here's the function it's running, but that shouldn't be relevant >>> >>> getUrls <- function(text, id) { >>> matches <- str_match_all(text, url_pattern) >>> a <- data.frame(urls=unlist(matches)) >>> a$id <- id >>> a >>> } >>> >>> >>> Thanks, and thanks for an amazing package - data.table has made my life >>> so much easier. It should be part of base, I think. >>> Stian Haklev, University of Toronto >>> >>> -- >>> http://reganmian.net/blog -- Random Stuff that Matters >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Sep 27 14:48:41 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 27 Sep 2013 13:48:41 +0100 Subject: [datatable-help] Using data.table to run a function on every row In-Reply-To: References: Message-ID: <52457EA9.8000803@mdowle.plus.com> That was my thought too. I don't know what str_match_all is, but given the unlist() in getUrls(), it seems to return a list. Rather than unlist(), leave it as list, and data.table should happily make a `list` column where each cell is itself a vector. In fact each cell can be anything at all, even embedded data.table, function definitions, or any type of object. You might need a list(list(str_match_all(...))) in j to do that. Or what Rick has suggested here might work first time. It's hard to visualise it without a small reproducible example, so we're having to make educated guesses. Many thanks for the kind words about data.table. Matthew On 27/09/13 07:44, Ricardo Saporta wrote: > In fact, you should be able to skip the function altogether and just use: > > db[ (has_url), str_match_all(text, url_pattern), by=id] > > > (and now, my apologies to all for the email clutter) > good night > > On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta > > wrote: > > sorry, I probably should have elaborated (it's late here, in NJ) > > The error you are seeing is most likely coming from your getURL > function in that you are adding several ids to a data.frame of > varying rows, and `R` cannot recycle it correctly. > > If you instead breakdown by id, then each time you are only > assigning one id and R will be able to recycle appropriately, > without issue. > > good luck! > Rick > > > Ricardo Saporta > Graduate Student, Data Analytics > Rutgers University, New Jersey > e: saporta at rutgers.edu > > > > On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta > > wrote: > > Hi there, > > Try inserting a `by=id` in > > a <- db[(has_url), getUrls(text, id), by=id] > > Also, no need for "has_url == T" > instead, use > (has_url) > If the variable is alread logical. (Otherwise, you are just > slowing things down ;) > > > > Ricardo Saporta > Graduate Student, Data Analytics > Rutgers University, New Jersey > e: saporta at rutgers.edu > > > > On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev > > wrote: > > I'm trying to run a function on every row fulfilling a > certain criterium, which returns a data frame - the idea > is then to take the list of data frames and rbindlist them > together for a totally separate data.table. (I'm > extracting several URL links from each forum post, and > tagging them with the forum post they came from). > > I tried doing this with a data.table > > a <- db[has_url == T, getUrls(text, id)] > > and get the message > > Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, > 1L, 2L, 4L, : > replacement has 11007 rows, data has 29787 > > Because some rows have several URLs... However, I don't > care that these rowlengths don't match, I still want these > rows :) I thought J would just let me execute arbitrary R > code in the context of the rows as variable names, etc. > > Here's the function it's running, but that shouldn't be > relevant > > getUrls <- function(text, id) { > matches <- str_match_all(text, url_pattern) > a <- data.frame(urls=unlist(matches)) > a$id <- id > a > } > > > Thanks, and thanks for an amazing package - data.table has > made my life so much easier. It should be part of base, I > think. > Stian Haklev, University of Toronto > > -- > http://reganmian.net/blog -- Random Stuff that Matters > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaklev at gmail.com Fri Sep 27 17:21:43 2013 From: shaklev at gmail.com (=?UTF-8?Q?Stian_H=C3=A5klev?=) Date: Fri, 27 Sep 2013 11:21:43 -0400 Subject: [datatable-help] Using data.table to run a function on every row In-Reply-To: <52457EA9.8000803@mdowle.plus.com> References: <52457EA9.8000803@mdowle.plus.com> Message-ID: I really appreciate all your help - amazingly supportive community. I could probably figure out a "brute-force" way of doing things, but since I'm going to be writing a lot of R in the future too, I always want to find the "correct" way of doing it, which both looks clear, and is quick. (I come from a background in Ruby, and am always interested in writing very clear and DRY (do not repeat yourself) code, but I find I still spend a lot of time in R struggling with various data formats - lists, nested lists, vectors, matrices, different forms of apply/ddply/for loops etc). Anyway, a few different points. I tried db[has_url,], but got "object has_url not found" I then tried setkey(db, "has_url"), and using this, but somehow it was a lot slower than what I used to do (I repeated a few times). Not sure if I'm doing it wrong. (Not important - even 15 sec is totally fine, I'll only run this once. But good to understand the underlying principles). setkey(db, "has_url") > system.time( db[T, matches := str_match_all(text, url_pattern)] ) user system elapsed 17.514 0.334 17.847 > system.time( db[has_url == T, matches := str_match_all(text, url_pattern)] ) user system elapsed 5.943 0.040 5.984 The second point was how to get out the matches. The idea was that you have a text field which might contain several urls, which I want to extract, but I need each URL tagged with the row it came from (so I can link it back to properties of the post and author, look at whether certain students are more likely to post certain kinds of URLs etc). Instead of a function, you'll see above that I rewrote it to use :=, which creates a new column that holds a list. That worked wonderfully, but now how do I get these "out" of this data.table, and into a new one. Made-up example data: a <- c(1,2,3) b <- c(2,3,4) dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b, NULL)) Now my goal is to have a new data.table that looks like this Name Number Stian 1 Stian 2 Stian 3 Christian 2 Christian 3 Christian 4 Again, I'm sure I could do this with a for() or lapply? But I'd love to see the most elegant solution. Note that this: getUrls <- function(text, id) { matches <- str_match_all(text, url_pattern) data.frame(urls=unlist(matches), id=id) } system.time( a <- db[(has_url), getUrls(text, id), by=id] ) Works perfectly, the result is idurlsid116 https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 16224 http://www.youtube.com/watch?v=JUiGF4TGI9w24 344 http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/ 44461 http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html 61575 http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html 75675https://www.facebook.com/photo.php?fbid=10151324672623754 75 which is exactly what I was looking for. So I've really reached my goal, but I'm curious about the other method as well. Thanks! Stian On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle wrote: > > That was my thought too. I don't know what str_match_all is, but given > the unlist() in getUrls(), it seems to return a list. Rather than > unlist(), leave it as list, and data.table should happily make a `list` > column where each cell is itself a vector. In fact each cell can be > anything at all, even embedded data.table, function definitions, or any > type of object. > You might need a list(list(str_match_all(...))) in j to do that. > > Or what Rick has suggested here might work first time. It's hard to > visualise it without a small reproducible example, so we're having to make > educated guesses. > > Many thanks for the kind words about data.table. > > Matthew > > > > On 27/09/13 07:44, Ricardo Saporta wrote: > > In fact, you should be able to skip the function altogether and just use: > > db[ (has_url), str_match_all(text, url_pattern), by=id] > > > (and now, my apologies to all for the email clutter) > good night > > On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta < > saporta at scarletmail.rutgers.edu> wrote: > >> sorry, I probably should have elaborated (it's late here, in NJ) >> >> The error you are seeing is most likely coming from your getURL >> function in that you are adding several ids to a data.frame of varying >> rows, and `R` cannot recycle it correctly. >> >> If you instead breakdown by id, then each time you are only assigning >> one id and R will be able to recycle appropriately, without issue. >> >> good luck! >> Rick >> >> >> Ricardo Saporta >> Graduate Student, Data Analytics >> Rutgers University, New Jersey >> e: saporta at rutgers.edu >> >> >> >> On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta < >> saporta at scarletmail.rutgers.edu> wrote: >> >>> Hi there, >>> >>> Try inserting a `by=id` in >>> >>> a <- db[(has_url), getUrls(text, id), by=id] >>> >>> Also, no need for "has_url == T" >>> instead, use >>> (has_url) >>> If the variable is alread logical. (Otherwise, you are just slowing >>> things down ;) >>> >>> >>> >>> Ricardo Saporta >>> Graduate Student, Data Analytics >>> Rutgers University, New Jersey >>> e: saporta at rutgers.edu >>> >>> >>> >>> On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev wrote: >>> >>>> I'm trying to run a function on every row fulfilling a certain >>>> criterium, which returns a data frame - the idea is then to take the list >>>> of data frames and rbindlist them together for a totally separate >>>> data.table. (I'm extracting several URL links from each forum post, and >>>> tagging them with the forum post they came from). >>>> >>>> I tried doing this with a data.table >>>> >>>> a <- db[has_url == T, getUrls(text, id)] >>>> >>>> and get the message >>>> >>>> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, >>>> 4L, : >>>> replacement has 11007 rows, data has 29787 >>>> >>>> Because some rows have several URLs... However, I don't care that >>>> these rowlengths don't match, I still want these rows :) I thought J would >>>> just let me execute arbitrary R code in the context of the rows as variable >>>> names, etc. >>>> >>>> Here's the function it's running, but that shouldn't be relevant >>>> >>>> getUrls <- function(text, id) { >>>> matches <- str_match_all(text, url_pattern) >>>> a <- data.frame(urls=unlist(matches)) >>>> a$id <- id >>>> a >>>> } >>>> >>>> >>>> Thanks, and thanks for an amazing package - data.table has made my >>>> life so much easier. It should be part of base, I think. >>>> Stian Haklev, University of Toronto >>>> >>>> -- >>>> http://reganmian.net/blog -- Random Stuff that Matters >>>> >>>> _______________________________________________ >>>> datatable-help mailing list >>>> datatable-help at lists.r-forge.r-project.org >>>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>> >>> >>> >> > > > _______________________________________________ > datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -- http://reganmian.net/blog -- Random Stuff that Matters -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaklev at gmail.com Fri Sep 27 17:39:35 2013 From: shaklev at gmail.com (=?UTF-8?Q?Stian_H=C3=A5klev?=) Date: Fri, 27 Sep 2013 11:39:35 -0400 Subject: [datatable-help] Using data.table to run a function on every row In-Reply-To: References: <52457EA9.8000803@mdowle.plus.com> Message-ID: OK, so I just realized a few things. First of all, I should have had has_db in a parenthesis to use it as an index (like Ricardo did, I just didn't notice that it was important). However, this still doesn't make much of a difference, because we're only talking about 146k entries, and most of the time is spent on the string extraction: > system.time( a <- db[(has_url), getUrls(text, id), by=id] ) user system elapsed 10.246 0.027 10.275 > system.time( a <- db[has_url == T, getUrls(text, id), by=id] ) user system elapsed 10.094 0.029 10.123 Either way, good to know! Secondly, I tried this form: system.time( b <- db[(has_url), j=list(urls = str_match_all(text, url_pattern)), by=id] ) The problem is that it only accepts one value per row, so the output format looks exactly like what I want - but > nrow(db) # all records [1] 146058 > nrow(a) # using the function getUrls [1] 30019 > nrow(b) # using str_match_all directly with j=list [1] 11007 > length(unique(a$id)) # similar number of IDs, but not similar number of URLs [1] 11007 > length(unique(b$id)) [1] 11007 thanks again, Stian On Fri, Sep 27, 2013 at 11:21 AM, Stian H?klev wrote: > I really appreciate all your help - amazingly supportive community. I > could probably figure out a "brute-force" way of doing things, but since > I'm going to be writing a lot of R in the future too, I always want to find > the "correct" way of doing it, which both looks clear, and is quick. (I > come from a background in Ruby, and am always interested in writing very > clear and DRY (do not repeat yourself) code, but I find I still spend a lot > of time in R struggling with various data formats - lists, nested lists, > vectors, matrices, different forms of apply/ddply/for loops etc). > > Anyway, a few different points. > > I tried db[has_url,], but got "object has_url not found" > > I then tried setkey(db, "has_url"), and using this, but somehow it was a > lot slower than what I used to do (I repeated a few times). Not sure if I'm > doing it wrong. (Not important - even 15 sec is totally fine, I'll only run > this once. But good to understand the underlying principles). > > setkey(db, "has_url") > > system.time( db[T, matches := str_match_all(text, url_pattern)] ) > user system elapsed > 17.514 0.334 17.847 > > system.time( db[has_url == T, matches := str_match_all(text, > url_pattern)] ) > user system elapsed > 5.943 0.040 5.984 > > The second point was how to get out the matches. The idea was that you > have a text field which might contain several urls, which I want to > extract, but I need each URL tagged with the row it came from (so I can > link it back to properties of the post and author, look at whether certain > students are more likely to post certain kinds of URLs etc). > > Instead of a function, you'll see above that I rewrote it to use :=, which > creates a new column that holds a list. That worked wonderfully, but now > how do I get these "out" of this data.table, and into a new one. > > Made-up example data: > a <- c(1,2,3) > b <- c(2,3,4) > dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b, > NULL)) > > Now my goal is to have a new data.table that looks like this > Name Number > Stian 1 > Stian 2 > Stian 3 > Christian 2 > Christian 3 > Christian 4 > > Again, I'm sure I could do this with a for() or lapply? But I'd love to > see the most elegant solution. > > Note that this: > > getUrls <- function(text, id) { > matches <- str_match_all(text, url_pattern) > data.frame(urls=unlist(matches), id=id) > } > > system.time( a <- db[(has_url), getUrls(text, id), by=id] ) > > Works perfectly, the result is > idurlsid116 > https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 162 > 24http://www.youtube.com/watch?v=JUiGF4TGI9w 24 344 > http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/ > 44461 > http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html > 61575 > http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html > 75675https://www.facebook.com/photo.php?fbid=10151324672623754 75 > > which is exactly what I was looking for. So I've really reached my goal, > but I'm curious about the other method as well. > > Thanks! > Stian > > > On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle wrote: > >> >> That was my thought too. I don't know what str_match_all is, but given >> the unlist() in getUrls(), it seems to return a list. Rather than >> unlist(), leave it as list, and data.table should happily make a `list` >> column where each cell is itself a vector. In fact each cell can be >> anything at all, even embedded data.table, function definitions, or any >> type of object. >> You might need a list(list(str_match_all(...))) in j to do that. >> >> Or what Rick has suggested here might work first time. It's hard to >> visualise it without a small reproducible example, so we're having to make >> educated guesses. >> >> Many thanks for the kind words about data.table. >> >> Matthew >> >> >> >> On 27/09/13 07:44, Ricardo Saporta wrote: >> >> In fact, you should be able to skip the function altogether and just >> use: >> >> db[ (has_url), str_match_all(text, url_pattern), by=id] >> >> >> (and now, my apologies to all for the email clutter) >> good night >> >> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta < >> saporta at scarletmail.rutgers.edu> wrote: >> >>> sorry, I probably should have elaborated (it's late here, in NJ) >>> >>> The error you are seeing is most likely coming from your getURL >>> function in that you are adding several ids to a data.frame of varying >>> rows, and `R` cannot recycle it correctly. >>> >>> If you instead breakdown by id, then each time you are only assigning >>> one id and R will be able to recycle appropriately, without issue. >>> >>> good luck! >>> Rick >>> >>> >>> Ricardo Saporta >>> Graduate Student, Data Analytics >>> Rutgers University, New Jersey >>> e: saporta at rutgers.edu >>> >>> >>> >>> On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta < >>> saporta at scarletmail.rutgers.edu> wrote: >>> >>>> Hi there, >>>> >>>> Try inserting a `by=id` in >>>> >>>> a <- db[(has_url), getUrls(text, id), by=id] >>>> >>>> Also, no need for "has_url == T" >>>> instead, use >>>> (has_url) >>>> If the variable is alread logical. (Otherwise, you are just slowing >>>> things down ;) >>>> >>>> >>>> >>>> Ricardo Saporta >>>> Graduate Student, Data Analytics >>>> Rutgers University, New Jersey >>>> e: saporta at rutgers.edu >>>> >>>> >>>> >>>> On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev wrote: >>>> >>>>> I'm trying to run a function on every row fulfilling a certain >>>>> criterium, which returns a data frame - the idea is then to take the list >>>>> of data frames and rbindlist them together for a totally separate >>>>> data.table. (I'm extracting several URL links from each forum post, and >>>>> tagging them with the forum post they came from). >>>>> >>>>> I tried doing this with a data.table >>>>> >>>>> a <- db[has_url == T, getUrls(text, id)] >>>>> >>>>> and get the message >>>>> >>>>> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, >>>>> 4L, : >>>>> replacement has 11007 rows, data has 29787 >>>>> >>>>> Because some rows have several URLs... However, I don't care that >>>>> these rowlengths don't match, I still want these rows :) I thought J would >>>>> just let me execute arbitrary R code in the context of the rows as variable >>>>> names, etc. >>>>> >>>>> Here's the function it's running, but that shouldn't be relevant >>>>> >>>>> getUrls <- function(text, id) { >>>>> matches <- str_match_all(text, url_pattern) >>>>> a <- data.frame(urls=unlist(matches)) >>>>> a$id <- id >>>>> a >>>>> } >>>>> >>>>> >>>>> Thanks, and thanks for an amazing package - data.table has made my >>>>> life so much easier. It should be part of base, I think. >>>>> Stian Haklev, University of Toronto >>>>> >>>>> -- >>>>> http://reganmian.net/blog -- Random Stuff that Matters >>>>> >>>>> _______________________________________________ >>>>> datatable-help mailing list >>>>> datatable-help at lists.r-forge.r-project.org >>>>> >>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>>> >>>> >>>> >>> >> >> >> _______________________________________________ >> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> > > > -- > http://reganmian.net/blog -- Random Stuff that Matters > -- http://reganmian.net/blog -- Random Stuff that Matters -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Fri Sep 27 17:48:04 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Fri, 27 Sep 2013 11:48:04 -0400 Subject: [datatable-help] Using data.table to run a function on every row In-Reply-To: References: <52457EA9.8000803@mdowle.plus.com> Message-ID: Hi Stian, Try the following two and look at the difference: db[T, matches := str_match_all(text, url_pattern)] db[.(T), matches := str_match_all(text, url_pattern)] ;) On Fri, Sep 27, 2013 at 11:21 AM, Stian H?klev wrote: > I really appreciate all your help - amazingly supportive community. I > could probably figure out a "brute-force" way of doing things, but since > I'm going to be writing a lot of R in the future too, I always want to find > the "correct" way of doing it, which both looks clear, and is quick. (I > come from a background in Ruby, and am always interested in writing very > clear and DRY (do not repeat yourself) code, but I find I still spend a lot > of time in R struggling with various data formats - lists, nested lists, > vectors, matrices, different forms of apply/ddply/for loops etc). > > Anyway, a few different points. > > I tried db[has_url,], but got "object has_url not found" > > I then tried setkey(db, "has_url"), and using this, but somehow it was a > lot slower than what I used to do (I repeated a few times). Not sure if I'm > doing it wrong. (Not important - even 15 sec is totally fine, I'll only run > this once. But good to understand the underlying principles). > > setkey(db, "has_url") > > system.time( db[T, matches := str_match_all(text, url_pattern)] ) > user system elapsed > 17.514 0.334 17.847 > > system.time( db[has_url == T, matches := str_match_all(text, > url_pattern)] ) > user system elapsed > 5.943 0.040 5.984 > > The second point was how to get out the matches. The idea was that you > have a text field which might contain several urls, which I want to > extract, but I need each URL tagged with the row it came from (so I can > link it back to properties of the post and author, look at whether certain > students are more likely to post certain kinds of URLs etc). > > Instead of a function, you'll see above that I rewrote it to use :=, which > creates a new column that holds a list. That worked wonderfully, but now > how do I get these "out" of this data.table, and into a new one. > > Made-up example data: > a <- c(1,2,3) > b <- c(2,3,4) > dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b, > NULL)) > > Now my goal is to have a new data.table that looks like this > Name Number > Stian 1 > Stian 2 > Stian 3 > Christian 2 > Christian 3 > Christian 4 > > Again, I'm sure I could do this with a for() or lapply? But I'd love to > see the most elegant solution. > > Note that this: > > getUrls <- function(text, id) { > matches <- str_match_all(text, url_pattern) > data.frame(urls=unlist(matches), id=id) > } > > system.time( a <- db[(has_url), getUrls(text, id), by=id] ) > > Works perfectly, the result is > idurlsid116 > https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 162 > 24http://www.youtube.com/watch?v=JUiGF4TGI9w24 344 > http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/ > 44461 > http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html > 61575 > http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html > 75675https://www.facebook.com/photo.php?fbid=10151324672623754 75 > > which is exactly what I was looking for. So I've really reached my goal, > but I'm curious about the other method as well. > > Thanks! > Stian > > > On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle wrote: > >> >> That was my thought too. I don't know what str_match_all is, but given >> the unlist() in getUrls(), it seems to return a list. Rather than >> unlist(), leave it as list, and data.table should happily make a `list` >> column where each cell is itself a vector. In fact each cell can be >> anything at all, even embedded data.table, function definitions, or any >> type of object. >> You might need a list(list(str_match_all(...))) in j to do that. >> >> Or what Rick has suggested here might work first time. It's hard to >> visualise it without a small reproducible example, so we're having to make >> educated guesses. >> >> Many thanks for the kind words about data.table. >> >> Matthew >> >> >> >> On 27/09/13 07:44, Ricardo Saporta wrote: >> >> In fact, you should be able to skip the function altogether and just >> use: >> >> db[ (has_url), str_match_all(text, url_pattern), by=id] >> >> >> (and now, my apologies to all for the email clutter) >> good night >> >> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta < >> saporta at scarletmail.rutgers.edu> wrote: >> >>> sorry, I probably should have elaborated (it's late here, in NJ) >>> >>> The error you are seeing is most likely coming from your getURL >>> function in that you are adding several ids to a data.frame of varying >>> rows, and `R` cannot recycle it correctly. >>> >>> If you instead breakdown by id, then each time you are only assigning >>> one id and R will be able to recycle appropriately, without issue. >>> >>> good luck! >>> Rick >>> >>> >>> Ricardo Saporta >>> Graduate Student, Data Analytics >>> Rutgers University, New Jersey >>> e: saporta at rutgers.edu >>> >>> >>> >>> On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta < >>> saporta at scarletmail.rutgers.edu> wrote: >>> >>>> Hi there, >>>> >>>> Try inserting a `by=id` in >>>> >>>> a <- db[(has_url), getUrls(text, id), by=id] >>>> >>>> Also, no need for "has_url == T" >>>> instead, use >>>> (has_url) >>>> If the variable is alread logical. (Otherwise, you are just slowing >>>> things down ;) >>>> >>>> >>>> >>>> Ricardo Saporta >>>> Graduate Student, Data Analytics >>>> Rutgers University, New Jersey >>>> e: saporta at rutgers.edu >>>> >>>> >>>> >>>> On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev wrote: >>>> >>>>> I'm trying to run a function on every row fulfilling a certain >>>>> criterium, which returns a data frame - the idea is then to take the list >>>>> of data frames and rbindlist them together for a totally separate >>>>> data.table. (I'm extracting several URL links from each forum post, and >>>>> tagging them with the forum post they came from). >>>>> >>>>> I tried doing this with a data.table >>>>> >>>>> a <- db[has_url == T, getUrls(text, id)] >>>>> >>>>> and get the message >>>>> >>>>> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, >>>>> 4L, : >>>>> replacement has 11007 rows, data has 29787 >>>>> >>>>> Because some rows have several URLs... However, I don't care that >>>>> these rowlengths don't match, I still want these rows :) I thought J would >>>>> just let me execute arbitrary R code in the context of the rows as variable >>>>> names, etc. >>>>> >>>>> Here's the function it's running, but that shouldn't be relevant >>>>> >>>>> getUrls <- function(text, id) { >>>>> matches <- str_match_all(text, url_pattern) >>>>> a <- data.frame(urls=unlist(matches)) >>>>> a$id <- id >>>>> a >>>>> } >>>>> >>>>> >>>>> Thanks, and thanks for an amazing package - data.table has made my >>>>> life so much easier. It should be part of base, I think. >>>>> Stian Haklev, University of Toronto >>>>> >>>>> -- >>>>> http://reganmian.net/blog -- Random Stuff that Matters >>>>> >>>>> _______________________________________________ >>>>> datatable-help mailing list >>>>> datatable-help at lists.r-forge.r-project.org >>>>> >>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>>> >>>> >>>> >>> >> >> >> _______________________________________________ >> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> > > > -- > http://reganmian.net/blog -- Random Stuff that Matters > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaklev at gmail.com Fri Sep 27 18:20:20 2013 From: shaklev at gmail.com (=?UTF-8?Q?Stian_H=C3=A5klev?=) Date: Fri, 27 Sep 2013 12:20:20 -0400 Subject: [datatable-help] Using data.table to run a function on every row In-Reply-To: References: <52457EA9.8000803@mdowle.plus.com> Message-ID: > system.time( db[T, matches := str_match_all(text, url_pattern)] ) user system elapsed 19.610 0.475 20.304 > system.time( db[.(T), matches := str_match_all(text, url_pattern)] ) Error in `[.data.table`(db, .(T), `:=`(matches, str_match_all(text, url_pattern))) : All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards. Timing stopped at: 6.339 0.043 6.403 On Fri, Sep 27, 2013 at 11:48 AM, Ricardo Saporta < saporta at scarletmail.rutgers.edu> wrote: > Hi Stian, > > Try the following two and look at the difference: > > db[T, matches := str_match_all(text, url_pattern)] > db[.(T), matches := str_match_all(text, url_pattern)] > > ;) > > > > On Fri, Sep 27, 2013 at 11:21 AM, Stian H?klev wrote: > >> I really appreciate all your help - amazingly supportive community. I >> could probably figure out a "brute-force" way of doing things, but since >> I'm going to be writing a lot of R in the future too, I always want to find >> the "correct" way of doing it, which both looks clear, and is quick. (I >> come from a background in Ruby, and am always interested in writing very >> clear and DRY (do not repeat yourself) code, but I find I still spend a lot >> of time in R struggling with various data formats - lists, nested lists, >> vectors, matrices, different forms of apply/ddply/for loops etc). >> >> Anyway, a few different points. >> >> I tried db[has_url,], but got "object has_url not found" >> >> I then tried setkey(db, "has_url"), and using this, but somehow it was a >> lot slower than what I used to do (I repeated a few times). Not sure if I'm >> doing it wrong. (Not important - even 15 sec is totally fine, I'll only run >> this once. But good to understand the underlying principles). >> >> setkey(db, "has_url") >> > system.time( db[T, matches := str_match_all(text, url_pattern)] ) >> user system elapsed >> 17.514 0.334 17.847 >> > system.time( db[has_url == T, matches := str_match_all(text, >> url_pattern)] ) >> user system elapsed >> 5.943 0.040 5.984 >> >> The second point was how to get out the matches. The idea was that you >> have a text field which might contain several urls, which I want to >> extract, but I need each URL tagged with the row it came from (so I can >> link it back to properties of the post and author, look at whether certain >> students are more likely to post certain kinds of URLs etc). >> >> Instead of a function, you'll see above that I rewrote it to use :=, >> which creates a new column that holds a list. That worked wonderfully, but >> now how do I get these "out" of this data.table, and into a new one. >> >> Made-up example data: >> a <- c(1,2,3) >> b <- c(2,3,4) >> dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b, >> NULL)) >> >> Now my goal is to have a new data.table that looks like this >> Name Number >> Stian 1 >> Stian 2 >> Stian 3 >> Christian 2 >> Christian 3 >> Christian 4 >> >> Again, I'm sure I could do this with a for() or lapply? But I'd love to >> see the most elegant solution. >> >> Note that this: >> >> getUrls <- function(text, id) { >> matches <- str_match_all(text, url_pattern) >> data.frame(urls=unlist(matches), id=id) >> } >> >> system.time( a <- db[(has_url), getUrls(text, id), by=id] ) >> >> Works perfectly, the result is >> idurlsid116 >> https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 16 >> 224http://www.youtube.com/watch?v=JUiGF4TGI9w 24 344 >> http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/ >> 44461 >> http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html >> 61575 >> http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html >> 75675https://www.facebook.com/photo.php?fbid=10151324672623754 75 >> >> which is exactly what I was looking for. So I've really reached my goal, >> but I'm curious about the other method as well. >> >> Thanks! >> Stian >> >> >> On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle wrote: >> >>> >>> That was my thought too. I don't know what str_match_all is, but given >>> the unlist() in getUrls(), it seems to return a list. Rather than >>> unlist(), leave it as list, and data.table should happily make a `list` >>> column where each cell is itself a vector. In fact each cell can be >>> anything at all, even embedded data.table, function definitions, or any >>> type of object. >>> You might need a list(list(str_match_all(...))) in j to do that. >>> >>> Or what Rick has suggested here might work first time. It's hard to >>> visualise it without a small reproducible example, so we're having to make >>> educated guesses. >>> >>> Many thanks for the kind words about data.table. >>> >>> Matthew >>> >>> >>> >>> On 27/09/13 07:44, Ricardo Saporta wrote: >>> >>> In fact, you should be able to skip the function altogether and just >>> use: >>> >>> db[ (has_url), str_match_all(text, url_pattern), by=id] >>> >>> >>> (and now, my apologies to all for the email clutter) >>> good night >>> >>> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta < >>> saporta at scarletmail.rutgers.edu> wrote: >>> >>>> sorry, I probably should have elaborated (it's late here, in NJ) >>>> >>>> The error you are seeing is most likely coming from your getURL >>>> function in that you are adding several ids to a data.frame of varying >>>> rows, and `R` cannot recycle it correctly. >>>> >>>> If you instead breakdown by id, then each time you are only assigning >>>> one id and R will be able to recycle appropriately, without issue. >>>> >>>> good luck! >>>> Rick >>>> >>>> >>>> Ricardo Saporta >>>> Graduate Student, Data Analytics >>>> Rutgers University, New Jersey >>>> e: saporta at rutgers.edu >>>> >>>> >>>> >>>> On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta < >>>> saporta at scarletmail.rutgers.edu> wrote: >>>> >>>>> Hi there, >>>>> >>>>> Try inserting a `by=id` in >>>>> >>>>> a <- db[(has_url), getUrls(text, id), by=id] >>>>> >>>>> Also, no need for "has_url == T" >>>>> instead, use >>>>> (has_url) >>>>> If the variable is alread logical. (Otherwise, you are just slowing >>>>> things down ;) >>>>> >>>>> >>>>> >>>>> Ricardo Saporta >>>>> Graduate Student, Data Analytics >>>>> Rutgers University, New Jersey >>>>> e: saporta at rutgers.edu >>>>> >>>>> >>>>> >>>>> On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev wrote: >>>>> >>>>>> I'm trying to run a function on every row fulfilling a certain >>>>>> criterium, which returns a data frame - the idea is then to take the list >>>>>> of data frames and rbindlist them together for a totally separate >>>>>> data.table. (I'm extracting several URL links from each forum post, and >>>>>> tagging them with the forum post they came from). >>>>>> >>>>>> I tried doing this with a data.table >>>>>> >>>>>> a <- db[has_url == T, getUrls(text, id)] >>>>>> >>>>>> and get the message >>>>>> >>>>>> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, >>>>>> 4L, : >>>>>> replacement has 11007 rows, data has 29787 >>>>>> >>>>>> Because some rows have several URLs... However, I don't care that >>>>>> these rowlengths don't match, I still want these rows :) I thought J would >>>>>> just let me execute arbitrary R code in the context of the rows as variable >>>>>> names, etc. >>>>>> >>>>>> Here's the function it's running, but that shouldn't be relevant >>>>>> >>>>>> getUrls <- function(text, id) { >>>>>> matches <- str_match_all(text, url_pattern) >>>>>> a <- data.frame(urls=unlist(matches)) >>>>>> a$id <- id >>>>>> a >>>>>> } >>>>>> >>>>>> >>>>>> Thanks, and thanks for an amazing package - data.table has made my >>>>>> life so much easier. It should be part of base, I think. >>>>>> Stian Haklev, University of Toronto >>>>>> >>>>>> -- >>>>>> http://reganmian.net/blog -- Random Stuff that Matters >>>>>> >>>>>> _______________________________________________ >>>>>> datatable-help mailing list >>>>>> datatable-help at lists.r-forge.r-project.org >>>>>> >>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>>>> >>>>> >>>>> >>>> >>> >>> >>> _______________________________________________ >>> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>> >>> >>> >> >> >> -- >> http://reganmian.net/blog -- Random Stuff that Matters >> > > -- http://reganmian.net/blog -- Random Stuff that Matters -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Fri Sep 27 19:25:19 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Fri, 27 Sep 2013 13:25:19 -0400 Subject: [datatable-help] Using data.table to run a function on every row In-Reply-To: References: <52457EA9.8000803@mdowle.plus.com> Message-ID: hm... not sure about `j` (sorry, I havent taken a close look at your code), but my comment was to point out that these two statements are different: DT [ TRUE, ] DT [ .(TRUE), ] The first one is giving you the whole data.table DT[TRUE, ] is the same as DT (since TRUE is getting recycled) The second one is giving you all rows within DT where the first column of the key has a value of TRUE. Ricardo Saporta Graduate Student, Data Analytics Rutgers University, New Jersey e: saporta at rutgers.edu On Fri, Sep 27, 2013 at 12:20 PM, Stian H?klev wrote: > > system.time( db[T, matches := str_match_all(text, url_pattern)] ) > user system elapsed > 19.610 0.475 20.304 > > system.time( db[.(T), matches := str_match_all(text, url_pattern)] ) > Error in `[.data.table`(db, .(T), `:=`(matches, str_match_all(text, > url_pattern))) : > All items in j=list(...) should be atomic vectors or lists. If you are > trying something like j=list(.SD,newcol=mean(colA)) then use := by group > instead (much quicker), or cbind or merge afterwards. > Timing stopped at: 6.339 0.043 6.403 > > > On Fri, Sep 27, 2013 at 11:48 AM, Ricardo Saporta < > saporta at scarletmail.rutgers.edu> wrote: > >> Hi Stian, >> >> Try the following two and look at the difference: >> >> db[T, matches := str_match_all(text, url_pattern)] >> db[.(T), matches := str_match_all(text, url_pattern)] >> >> ;) >> >> >> >> On Fri, Sep 27, 2013 at 11:21 AM, Stian H?klev wrote: >> >>> I really appreciate all your help - amazingly supportive community. I >>> could probably figure out a "brute-force" way of doing things, but since >>> I'm going to be writing a lot of R in the future too, I always want to find >>> the "correct" way of doing it, which both looks clear, and is quick. (I >>> come from a background in Ruby, and am always interested in writing very >>> clear and DRY (do not repeat yourself) code, but I find I still spend a lot >>> of time in R struggling with various data formats - lists, nested lists, >>> vectors, matrices, different forms of apply/ddply/for loops etc). >>> >>> Anyway, a few different points. >>> >>> I tried db[has_url,], but got "object has_url not found" >>> >>> I then tried setkey(db, "has_url"), and using this, but somehow it was a >>> lot slower than what I used to do (I repeated a few times). Not sure if I'm >>> doing it wrong. (Not important - even 15 sec is totally fine, I'll only run >>> this once. But good to understand the underlying principles). >>> >>> setkey(db, "has_url") >>> > system.time( db[T, matches := str_match_all(text, url_pattern)] ) >>> user system elapsed >>> 17.514 0.334 17.847 >>> > system.time( db[has_url == T, matches := str_match_all(text, >>> url_pattern)] ) >>> user system elapsed >>> 5.943 0.040 5.984 >>> >>> The second point was how to get out the matches. The idea was that you >>> have a text field which might contain several urls, which I want to >>> extract, but I need each URL tagged with the row it came from (so I can >>> link it back to properties of the post and author, look at whether certain >>> students are more likely to post certain kinds of URLs etc). >>> >>> Instead of a function, you'll see above that I rewrote it to use :=, >>> which creates a new column that holds a list. That worked wonderfully, but >>> now how do I get these "out" of this data.table, and into a new one. >>> >>> Made-up example data: >>> a <- c(1,2,3) >>> b <- c(2,3,4) >>> dt <- data.table(names=c("Stian", "Christian", "John"), >>> numbers=list(a,b, NULL)) >>> >>> Now my goal is to have a new data.table that looks like this >>> Name Number >>> Stian 1 >>> Stian 2 >>> Stian 3 >>> Christian 2 >>> Christian 3 >>> Christian 4 >>> >>> Again, I'm sure I could do this with a for() or lapply? But I'd love to >>> see the most elegant solution. >>> >>> Note that this: >>> >>> getUrls <- function(text, id) { >>> matches <- str_match_all(text, url_pattern) >>> data.frame(urls=unlist(matches), id=id) >>> } >>> >>> system.time( a <- db[(has_url), getUrls(text, id), by=id] ) >>> >>> Works perfectly, the result is >>> idurlsid116 >>> https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 >>> 16224http://www.youtube.com/watch?v=JUiGF4TGI9w 24 344 >>> http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/ >>> 44461 >>> http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html >>> 61575 >>> http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html >>> 75675https://www.facebook.com/photo.php?fbid=10151324672623754 75 >>> >>> which is exactly what I was looking for. So I've really reached my goal, >>> but I'm curious about the other method as well. >>> >>> Thanks! >>> Stian >>> >>> >>> On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle wrote: >>> >>>> >>>> That was my thought too. I don't know what str_match_all is, but >>>> given the unlist() in getUrls(), it seems to return a list. Rather than >>>> unlist(), leave it as list, and data.table should happily make a `list` >>>> column where each cell is itself a vector. In fact each cell can be >>>> anything at all, even embedded data.table, function definitions, or any >>>> type of object. >>>> You might need a list(list(str_match_all(...))) in j to do that. >>>> >>>> Or what Rick has suggested here might work first time. It's hard to >>>> visualise it without a small reproducible example, so we're having to make >>>> educated guesses. >>>> >>>> Many thanks for the kind words about data.table. >>>> >>>> Matthew >>>> >>>> >>>> >>>> On 27/09/13 07:44, Ricardo Saporta wrote: >>>> >>>> In fact, you should be able to skip the function altogether and just >>>> use: >>>> >>>> db[ (has_url), str_match_all(text, url_pattern), by=id] >>>> >>>> >>>> (and now, my apologies to all for the email clutter) >>>> good night >>>> >>>> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta < >>>> saporta at scarletmail.rutgers.edu> wrote: >>>> >>>>> sorry, I probably should have elaborated (it's late here, in NJ) >>>>> >>>>> The error you are seeing is most likely coming from your getURL >>>>> function in that you are adding several ids to a data.frame of varying >>>>> rows, and `R` cannot recycle it correctly. >>>>> >>>>> If you instead breakdown by id, then each time you are only >>>>> assigning one id and R will be able to recycle appropriately, without >>>>> issue. >>>>> >>>>> good luck! >>>>> Rick >>>>> >>>>> >>>>> Ricardo Saporta >>>>> Graduate Student, Data Analytics >>>>> Rutgers University, New Jersey >>>>> e: saporta at rutgers.edu >>>>> >>>>> >>>>> >>>>> On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta < >>>>> saporta at scarletmail.rutgers.edu> wrote: >>>>> >>>>>> Hi there, >>>>>> >>>>>> Try inserting a `by=id` in >>>>>> >>>>>> a <- db[(has_url), getUrls(text, id), by=id] >>>>>> >>>>>> Also, no need for "has_url == T" >>>>>> instead, use >>>>>> (has_url) >>>>>> If the variable is alread logical. (Otherwise, you are just slowing >>>>>> things down ;) >>>>>> >>>>>> >>>>>> >>>>>> Ricardo Saporta >>>>>> Graduate Student, Data Analytics >>>>>> Rutgers University, New Jersey >>>>>> e: saporta at rutgers.edu >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Sep 26, 2013 at 11:16 PM, Stian H?klev wrote: >>>>>> >>>>>>> I'm trying to run a function on every row fulfilling a certain >>>>>>> criterium, which returns a data frame - the idea is then to take the list >>>>>>> of data frames and rbindlist them together for a totally separate >>>>>>> data.table. (I'm extracting several URL links from each forum post, and >>>>>>> tagging them with the forum post they came from). >>>>>>> >>>>>>> I tried doing this with a data.table >>>>>>> >>>>>>> a <- db[has_url == T, getUrls(text, id)] >>>>>>> >>>>>>> and get the message >>>>>>> >>>>>>> Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 6L, 1L, 2L, >>>>>>> 4L, : >>>>>>> replacement has 11007 rows, data has 29787 >>>>>>> >>>>>>> Because some rows have several URLs... However, I don't care that >>>>>>> these rowlengths don't match, I still want these rows :) I thought J would >>>>>>> just let me execute arbitrary R code in the context of the rows as variable >>>>>>> names, etc. >>>>>>> >>>>>>> Here's the function it's running, but that shouldn't be relevant >>>>>>> >>>>>>> getUrls <- function(text, id) { >>>>>>> matches <- str_match_all(text, url_pattern) >>>>>>> a <- data.frame(urls=unlist(matches)) >>>>>>> a$id <- id >>>>>>> a >>>>>>> } >>>>>>> >>>>>>> >>>>>>> Thanks, and thanks for an amazing package - data.table has made my >>>>>>> life so much easier. It should be part of base, I think. >>>>>>> Stian Haklev, University of Toronto >>>>>>> >>>>>>> -- >>>>>>> http://reganmian.net/blog -- Random Stuff that Matters >>>>>>> >>>>>>> _______________________________________________ >>>>>>> datatable-help mailing list >>>>>>> datatable-help at lists.r-forge.r-project.org >>>>>>> >>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> _______________________________________________ >>>> datatable-help mailing listdatatable-help at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>> >>>> >>>> >>> >>> >>> -- >>> http://reganmian.net/blog -- Random Stuff that Matters >>> >> >> > > > -- > http://reganmian.net/blog -- Random Stuff that Matters > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Fri Sep 27 20:49:15 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Fri, 27 Sep 2013 19:49:15 +0100 Subject: [datatable-help] Using data.table to run a function on every row In-Reply-To: References: <52457EA9.8000803@mdowle.plus.com> Message-ID: <5245D32B.3020304@mdowle.plus.com> Stian, datatable-help isn't really for this kind of question. It's a very good question and belongs on S.O. where you can edit it given comments. datatable-help is more for discussion about future developments, notices, things that aren't allowed on S.O., etc. This was your example : > a <- c(1,2,3) > b <- c(2,3,4) > dt <- data.table(names=c("Stian", "Christian", "John"), numbers=list(a,b, NULL)) The output of that is : > dt names numbers 1: Stian 1,2,3 2: Christian 2,3,4 3: John Are you possibly mistaken about the output of list columns? Those commas are just how it displays. They aren't strings in the numbers column. The `numbers` column is a list column where each item is a vector. To get the output you asked for it's just : > dt[,unlist(numbers),by=names] names V1 1: Stian 1 2: Stian 2 3: Stian 3 4: Christian 2 5: Christian 3 6: Christian 4 > If I've misunderstood, then please start again with a new question on S.O. http://stackoverflow.com/questions/tagged/data.table Thanks, Matthew On 27/09/13 18:25, Ricardo Saporta wrote: > hm... not sure about `j` (sorry, I havent taken a close look at your > code), but my comment was to point out that these two statements are > different: > > DT [ TRUE, ] > DT [ .(TRUE), ] > > The first one is giving you the whole data.table > DT[TRUE, ] is the same as DT > (since TRUE is getting recycled) > > The second one is giving you all rows within DT where the first column > of the key has a value of TRUE. > > > > Ricardo Saporta > Graduate Student, Data Analytics > Rutgers University, New Jersey > e: saporta at rutgers.edu > > > > On Fri, Sep 27, 2013 at 12:20 PM, Stian H?klev > wrote: > > > system.time( db[T, matches := str_match_all(text, url_pattern)] ) > user system elapsed > 19.610 0.475 20.304 > > system.time( db[.(T), matches := str_match_all(text, url_pattern)] ) > Error in `[.data.table`(db, .(T), `:=`(matches, > str_match_all(text, url_pattern))) : > All items in j=list(...) should be atomic vectors or lists. If > you are trying something like j=list(.SD,newcol=mean(colA)) then > use := by group instead (much quicker), or cbind or merge afterwards. > Timing stopped at: 6.339 0.043 6.403 > > > On Fri, Sep 27, 2013 at 11:48 AM, Ricardo Saporta > > wrote: > > Hi Stian, > > Try the following two and look at the difference: > > db[T, matches := str_match_all(text, url_pattern)] > db[.(T), matches := str_match_all(text, url_pattern)] > > ;) > > > > On Fri, Sep 27, 2013 at 11:21 AM, Stian H?klev > > wrote: > > I really appreciate all your help - amazingly supportive > community. I could probably figure out a "brute-force" way > of doing things, but since I'm going to be writing a lot > of R in the future too, I always want to find the > "correct" way of doing it, which both looks clear, and is > quick. (I come from a background in Ruby, and am always > interested in writing very clear and DRY (do not repeat > yourself) code, but I find I still spend a lot of time in > R struggling with various data formats - lists, nested > lists, vectors, matrices, different forms of > apply/ddply/for loops etc). > > Anyway, a few different points. > > I tried db[has_url,], but got "object has_url not found" > > I then tried setkey(db, "has_url"), and using this, but > somehow it was a lot slower than what I used to do (I > repeated a few times). Not sure if I'm doing it wrong. > (Not important - even 15 sec is totally fine, I'll only > run this once. But good to understand the underlying > principles). > > setkey(db, "has_url") > > system.time( db[T, matches := str_match_all(text, > url_pattern)] ) > user system elapsed > 17.514 0.334 17.847 > > system.time( db[has_url == T, matches := > str_match_all(text, url_pattern)] ) > user system elapsed > 5.943 0.040 5.984 > > The second point was how to get out the matches. The idea > was that you have a text field which might contain several > urls, which I want to extract, but I need each URL tagged > with the row it came from (so I can link it back to > properties of the post and author, look at whether certain > students are more likely to post certain kinds of URLs etc). > > Instead of a function, you'll see above that I rewrote it > to use :=, which creates a new column that holds a list. > That worked wonderfully, but now how do I get these "out" > of this data.table, and into a new one. > > Made-up example data: > a <- c(1,2,3) > b <- c(2,3,4) > dt <- data.table(names=c("Stian", "Christian", "John"), > numbers=list(a,b, NULL)) > > Now my goal is to have a new data.table that looks like this > Name Number > Stian 1 > Stian 2 > Stian 3 > Christian 2 > Christian 3 > Christian 4 > > Again, I'm sure I could do this with a for() or lapply? > But I'd love to see the most elegant solution. > > Note that this: > > getUrls <- function(text, id) { > matches <- str_match_all(text, url_pattern) > data.frame(urls=unlist(matches), id=id) > } > > system.time( a <- db[(has_url), getUrls(text, id), by=id] ) > > Works perfectly, the result is > > id urls id > 1 16 > https://class.coursera.org/aboriginaled-001/forum/thread?thread_id=166 > 16 > 2 24 http://www.youtube.com/watch?v=JUiGF4TGI9w 24 > 3 44 > http://www.cbc.ca/revisionquest/blog/2010/07/21/july-21-july-24-the-metis-keeping-it-riel/ > 44 > 4 61 > http://www.support-native-american-art.com/Native-American-Medicine-Wheels.html > 61 > 5 75 > http://indigenousfoundations.arts.ubc.ca/home/government-policy/the-residential-school-system.html > 75 > 6 75 > https://www.facebook.com/photo.php?fbid=10151324672623754 75 > > > which is exactly what I was looking for. So I've really > reached my goal, but I'm curious about the other method as > well. > > Thanks! > Stian > > > On Fri, Sep 27, 2013 at 8:48 AM, Matthew Dowle > > > wrote: > > > That was my thought too. I don't know what > str_match_all is, but given the unlist() in > getUrls(), it seems to return a list. Rather than > unlist(), leave it as list, and data.table should > happily make a `list` column where each cell is itself > a vector. In fact each cell can be anything at all, > even embedded data.table, function definitions, or any > type of object. > You might need a list(list(str_match_all(...))) in j > to do that. > > Or what Rick has suggested here might work first > time. It's hard to visualise it without a small > reproducible example, so we're having to make educated > guesses. > > Many thanks for the kind words about data.table. > > Matthew > > > > On 27/09/13 07:44, Ricardo Saporta wrote: >> In fact, you should be able to skip the function >> altogether and just use: >> >> db[ (has_url), str_match_all(text, url_pattern), >> by=id] >> >> >> (and now, my apologies to all for the email clutter) >> good night >> >> On Fri, Sep 27, 2013 at 2:41 AM, Ricardo Saporta >> > > wrote: >> >> sorry, I probably should have elaborated (it's >> late here, in NJ) >> >> The error you are seeing is most likely coming >> from your getURL function in that you are adding >> several ids to a data.frame of varying rows, and >> `R` cannot recycle it correctly. >> >> If you instead breakdown by id, then each time >> you are only assigning one id and R will be able >> to recycle appropriately, without issue. >> >> good luck! >> Rick >> >> >> Ricardo Saporta >> Graduate Student, Data Analytics >> Rutgers University, New Jersey >> e: saporta at rutgers.edu >> >> >> >> On Fri, Sep 27, 2013 at 2:37 AM, Ricardo Saporta >> > > wrote: >> >> Hi there, >> >> Try inserting a `by=id` in >> >> a <- db[(has_url), getUrls(text, id), by=id] >> >> Also, no need for "has_url == T" >> instead, use >> (has_url) >> If the variable is alread logical. >> (Otherwise, you are just slowing things down ;) >> >> >> >> Ricardo Saporta >> Graduate Student, Data Analytics >> Rutgers University, New Jersey >> e: saporta at rutgers.edu >> >> >> >> >> On Thu, Sep 26, 2013 at 11:16 PM, Stian >> H?klev > > wrote: >> >> I'm trying to run a function on every row >> fulfilling a certain criterium, which >> returns a data frame - the idea is then >> to take the list of data frames and >> rbindlist them together for a totally >> separate data.table. (I'm extracting >> several URL links from each forum post, >> and tagging them with the forum post they >> came from). >> >> I tried doing this with a data.table >> >> a <- db[has_url == T, getUrls(text, id)] >> >> and get the message >> >> Error in `$<-.data.frame`(`*tmp*`, "id", >> value = c(1L, 6L, 1L, 2L, 4L, : >> replacement has 11007 rows, data has 29787 >> >> Because some rows have several URLs... >> However, I don't care that these >> rowlengths don't match, I still want >> these rows :) I thought J would just let >> me execute arbitrary R code in the >> context of the rows as variable names, etc. >> >> Here's the function it's running, but >> that shouldn't be relevant >> >> getUrls <- function(text, id) { >> matches <- str_match_all(text, url_pattern) >> a <- data.frame(urls=unlist(matches)) >> a$id <- id >> a >> } >> >> >> Thanks, and thanks for an amazing package >> - data.table has made my life so much >> easier. It should be part of base, I think. >> Stian Haklev, University of Toronto >> >> -- >> http://reganmian.net/blog -- Random Stuff >> that Matters >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > http://reganmian.net/blog -- Random Stuff that Matters > > > > > > -- > http://reganmian.net/blog -- Random Stuff that Matters > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Fri Sep 27 21:01:44 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Fri, 27 Sep 2013 15:01:44 -0400 Subject: [datatable-help] unique.data.frame should create a copy, right? In-Reply-To: References: <0F857B92DB0744C69CFC07AAE0C4DCF4@gmail.com> <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com> <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com> Message-ID: running some benchmarks at work, I got the following comparing unique.data.frame to the new unique(.. , by=..) > microbenchmark(eval(uDF), eval(uDT)) Unit: milliseconds expr min lq median uq max neval eval(uDF) 28.38505 29.368062 31.705633 33.53874 52.57522 100 eval(uDT) 6.61314 7.220897 7.597114 9.58860 78.82127 100 well done! On Tue, Aug 27, 2013 at 1:23 PM, Steve Lianoglou < mailinglist.honeypot at gmail.com> wrote: > Last update here :-) > > After more hemming and hawing, I've changed the name of the new > parameter added to duplicated.data.table and unique.data.table from > `by.columnss` to just `by`, as it (more or less) is the same idea as > the `by` in dt[x, i,j,by,...] > > Sorry for any inconveniences caused if you've been working off of the > development version. > > -steve > > > On Thu, Aug 15, 2013 at 9:35 PM, Ricardo Saporta > wrote: > > Steve, great stuff!! > > thanks for making that happen > > > > Rick > > > > > > On Wed, Aug 14, 2013 at 8:30 PM, Steve Lianoglou > > wrote: > >> > >> Hi all, > >> > >> As I needed this sooner than I had expected, I just committed this > >> change. It's in svn revision 889. > >> > >> I chose 'by.columns' as the parameter names -- seemed to make more > >> sense to me, and using the short hand interactively saves a letter, > >> eg: unique(dt, by=c('some', 'columns')) ;-) > >> > >> Here's the note from the NEWS file: > >> > >> o "Uniqueness" tests can now specify arbirtray combinations of > >> columns to use to test for duplicates. `by.columns` parameter added to > >> unique.data.table and duplicated.data.table. This allows the user to > >> test for uniqueness using any combination of columns in the > >> data.table, where previously the user only had the option to use the > >> keyed columns (if keyed) or all columns (if not). The default behavior > >> sets `by.columns=key(dt)` to maintain backward compatability. See > >> man/duplicated.Rd and tests 986:991 for more information. Thanks to > >> Arunkumar Srinivasan, Ricardo Saporta, and Frank Erickson for useful > >> discussions. > >> > >> Should work as advertised assuming my unit tests weren't too simplistic. > >> > >> Cheers, > >> > >> -steve > >> > >> > >> > >> > >> On Tue, Aug 13, 2013 at 1:24 PM, Steve Lianoglou > >> wrote: > >> > Thanks for the suggestions, folks. > >> > > >> > Matthew: do you have a preference? > >> > > >> > -steve > >> > > >> > On Mon, Aug 12, 2013 at 11:12 AM, Ricardo Saporta > >> > wrote: > >> >> Steve, > >> >> > >> >> I like your suggestion a lot. I can see putting column specification > >> >> to > >> >> good use. > >> >> > >> >> As for the argument name, perhaps > >> >> 'use.columns' > >> >> > >> >> And where a value of NULL or FALSE will yield same results as > >> >> `unique.data.frame` > >> >> > >> >> use.columns=key(x) # default behavior > >> >> use.columns=c("col1name", "col7name") #etc > >> >> use.columns=NULL > >> >> > >> >> > >> >> Thanks as always, > >> >> Rick > >> >> > >> >> > >> >> > >> >> On Mon, Aug 12, 2013 at 1:51 PM, Steve Lianoglou > >> >> wrote: > >> >>> > >> >>> Hi folks, > >> >>> > >> >>> I actually want to revisit the fix I made here. > >> >>> > >> >>> Instead of having `use.key` in the signature to unique.data.table > (and > >> >>> duplicated.data.table) to be: > >> >>> > >> >>> function(x, > >> >>> incomparables=FALSE, > >> >>> tolerance=.Machine$double.eps ^ 0.5, > >> >>> use.key=TRUE, ...) > >> >>> > >> >>> How about we switch out use.key for a parameter that specifies the > >> >>> column names to use in the uniqueness check, which defaults to > key(x) > >> >>> to keep backwards compatibility. > >> >>> > >> >>> For argument's sake (like that?), lets call this parameter `columns` > >> >>> (by.columns? with.columns? whatever) so: > >> >>> > >> >>> function(x, > >> >>> incomparables=FALSE, > >> >>> tolerance=.Machine$double.eps ^ 0.5, > >> >>> columns=key(x), ...) > >> >>> > >> >>> Then: > >> >>> > >> >>> (1) leaving it alone is the backward compatibile behavior; > >> >>> (2) Perhaps setting it to NULL will use all columns, and make it > >> >>> equivalent to unique.data.frame (also the same when x has no key); > and > >> >>> (3) setting it to any other combo of columns uses those columns as > the > >> >>> uniqueness key and filters the rows (only) out of x accordingly. > >> >>> > >> >>> What do you folks think? Personally I think this is better on all > >> >>> accounts then just specifying to use the key or not and the only > >> >>> question in my mind is the name of the argument -- happy to hear > other > >> >>> world views, however, so don't be shy. > >> >>> > >> >>> Thanks, > >> >>> -steve > >> >>> > >> >>> -- > >> >>> Steve Lianoglou > >> >>> Computational Biologist > >> >>> Bioinformatics and Computational Biology > >> >>> Genentech > >> >> > >> >> > >> > > >> > > >> > > >> > -- > >> > Steve Lianoglou > >> > Computational Biologist > >> > Bioinformatics and Computational Biology > >> > Genentech > >> > >> > >> > >> -- > >> Steve Lianoglou > >> Computational Biologist > >> Bioinformatics and Computational Biology > >> Genentech > > > > > > > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > -------------- next part -------------- An HTML attachment was scrubbed... URL: From saporta at scarletmail.rutgers.edu Fri Sep 27 21:09:12 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Fri, 27 Sep 2013 15:09:12 -0400 Subject: [datatable-help] unique.data.frame should create a copy, right? In-Reply-To: References: <0F857B92DB0744C69CFC07AAE0C4DCF4@gmail.com> <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com> <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com> Message-ID: Steve, not to beat a dead horse on the "what to name the new parameter" discussion, but I'm wondering what your/others' thoughts are on using something other than 'by". Maybe even "uby" Or perhaps we can have a synonym in the function definition: .. function(........ , by=uby, uby) The reason I bring this up is that as I begin to use this and I am reading over my own code, I realize that it takes a lot of visual parsing to distinguish when the "by" in a complex call belongs to "[.data.table" and when the "by" belongs to "unique.data.table" Cheers, Rick On Tue, Aug 27, 2013 at 1:23 PM, Steve Lianoglou < mailinglist.honeypot at gmail.com> wrote: > Last update here :-) > > After more hemming and hawing, I've changed the name of the new > parameter added to duplicated.data.table and unique.data.table from > `by.columnss` to just `by`, as it (more or less) is the same idea as > the `by` in dt[x, i,j,by,...] > > Sorry for any inconveniences caused if you've been working off of the > development version. > > -steve > > > On Thu, Aug 15, 2013 at 9:35 PM, Ricardo Saporta > wrote: > > Steve, great stuff!! > > thanks for making that happen > > > > Rick > > > > > > On Wed, Aug 14, 2013 at 8:30 PM, Steve Lianoglou > > wrote: > >> > >> Hi all, > >> > >> As I needed this sooner than I had expected, I just committed this > >> change. It's in svn revision 889. > >> > >> I chose 'by.columns' as the parameter names -- seemed to make more > >> sense to me, and using the short hand interactively saves a letter, > >> eg: unique(dt, by=c('some', 'columns')) ;-) > >> > >> Here's the note from the NEWS file: > >> > >> o "Uniqueness" tests can now specify arbirtray combinations of > >> columns to use to test for duplicates. `by.columns` parameter added to > >> unique.data.table and duplicated.data.table. This allows the user to > >> test for uniqueness using any combination of columns in the > >> data.table, where previously the user only had the option to use the > >> keyed columns (if keyed) or all columns (if not). The default behavior > >> sets `by.columns=key(dt)` to maintain backward compatability. See > >> man/duplicated.Rd and tests 986:991 for more information. Thanks to > >> Arunkumar Srinivasan, Ricardo Saporta, and Frank Erickson for useful > >> discussions. > >> > >> Should work as advertised assuming my unit tests weren't too simplistic. > >> > >> Cheers, > >> > >> -steve > >> > >> > >> > >> > >> On Tue, Aug 13, 2013 at 1:24 PM, Steve Lianoglou > >> wrote: > >> > Thanks for the suggestions, folks. > >> > > >> > Matthew: do you have a preference? > >> > > >> > -steve > >> > > >> > On Mon, Aug 12, 2013 at 11:12 AM, Ricardo Saporta > >> > wrote: > >> >> Steve, > >> >> > >> >> I like your suggestion a lot. I can see putting column specification > >> >> to > >> >> good use. > >> >> > >> >> As for the argument name, perhaps > >> >> 'use.columns' > >> >> > >> >> And where a value of NULL or FALSE will yield same results as > >> >> `unique.data.frame` > >> >> > >> >> use.columns=key(x) # default behavior > >> >> use.columns=c("col1name", "col7name") #etc > >> >> use.columns=NULL > >> >> > >> >> > >> >> Thanks as always, > >> >> Rick > >> >> > >> >> > >> >> > >> >> On Mon, Aug 12, 2013 at 1:51 PM, Steve Lianoglou > >> >> wrote: > >> >>> > >> >>> Hi folks, > >> >>> > >> >>> I actually want to revisit the fix I made here. > >> >>> > >> >>> Instead of having `use.key` in the signature to unique.data.table > (and > >> >>> duplicated.data.table) to be: > >> >>> > >> >>> function(x, > >> >>> incomparables=FALSE, > >> >>> tolerance=.Machine$double.eps ^ 0.5, > >> >>> use.key=TRUE, ...) > >> >>> > >> >>> How about we switch out use.key for a parameter that specifies the > >> >>> column names to use in the uniqueness check, which defaults to > key(x) > >> >>> to keep backwards compatibility. > >> >>> > >> >>> For argument's sake (like that?), lets call this parameter `columns` > >> >>> (by.columns? with.columns? whatever) so: > >> >>> > >> >>> function(x, > >> >>> incomparables=FALSE, > >> >>> tolerance=.Machine$double.eps ^ 0.5, > >> >>> columns=key(x), ...) > >> >>> > >> >>> Then: > >> >>> > >> >>> (1) leaving it alone is the backward compatibile behavior; > >> >>> (2) Perhaps setting it to NULL will use all columns, and make it > >> >>> equivalent to unique.data.frame (also the same when x has no key); > and > >> >>> (3) setting it to any other combo of columns uses those columns as > the > >> >>> uniqueness key and filters the rows (only) out of x accordingly. > >> >>> > >> >>> What do you folks think? Personally I think this is better on all > >> >>> accounts then just specifying to use the key or not and the only > >> >>> question in my mind is the name of the argument -- happy to hear > other > >> >>> world views, however, so don't be shy. > >> >>> > >> >>> Thanks, > >> >>> -steve > >> >>> > >> >>> -- > >> >>> Steve Lianoglou > >> >>> Computational Biologist > >> >>> Bioinformatics and Computational Biology > >> >>> Genentech > >> >> > >> >> > >> > > >> > > >> > > >> > -- > >> > Steve Lianoglou > >> > Computational Biologist > >> > Bioinformatics and Computational Biology > >> > Genentech > >> > >> > >> > >> -- > >> Steve Lianoglou > >> Computational Biologist > >> Bioinformatics and Computational Biology > >> Genentech > > > > > > > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdowle at mdowle.plus.com Sat Sep 28 09:29:40 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Sat, 28 Sep 2013 08:29:40 +0100 Subject: [datatable-help] unique.data.frame should create a copy, right? In-Reply-To: References: <2A8FA620F9814DE48058FCA9F41C0AE6@gmail.com> <6C1BF9F6C1454190AA0457AA23DFB386@gmail.com> Message-ID: <52468564.3040801@mdowle.plus.com> Oh, good point. How about putting 'by' first in those situations : > DT = data.table(A=rep(1:3,2),B=1:2) > unique(by="A",DT) A B 1: 1 1 2: 2 2 3: 3 1 > unique(by="B",DT) A B 1: 1 1 2: 2 2 > On 27/09/13 20:09, Ricardo Saporta wrote: > Steve, not to beat a dead horse on the "what to name the new > parameter" discussion, but I'm wondering what your/others' thoughts > are on using something other than 'by". Maybe even "uby" > > Or perhaps we can have a synonym in the function definition: > .. function(........ , by=uby, uby) > > The reason I bring this up is that as I begin to use this and I am > reading over my own code, I realize that it takes a lot of visual > parsing to distinguish when the "by" in a complex call belongs to > "[.data.table" and when the "by" belongs to "unique.data.table" > > Cheers, > Rick > > > On Tue, Aug 27, 2013 at 1:23 PM, Steve Lianoglou > > wrote: > > Last update here :-) > > After more hemming and hawing, I've changed the name of the new > parameter added to duplicated.data.table and unique.data.table from > `by.columnss` to just `by`, as it (more or less) is the same idea as > the `by` in dt[x, i,j,by,...] > > Sorry for any inconveniences caused if you've been working off of the > development version. > > -steve > > > On Thu, Aug 15, 2013 at 9:35 PM, Ricardo Saporta > > wrote: > > Steve, great stuff!! > > thanks for making that happen > > > > Rick > > > > > > On Wed, Aug 14, 2013 at 8:30 PM, Steve Lianoglou > > > wrote: > >> > >> Hi all, > >> > >> As I needed this sooner than I had expected, I just committed this > >> change. It's in svn revision 889. > >> > >> I chose 'by.columns' as the parameter names -- seemed to make more > >> sense to me, and using the short hand interactively saves a letter, > >> eg: unique(dt, by=c('some', 'columns')) ;-) > >> > >> Here's the note from the NEWS file: > >> > >> o "Uniqueness" tests can now specify arbirtray combinations of > >> columns to use to test for duplicates. `by.columns` parameter > added to > >> unique.data.table and duplicated.data.table. This allows the > user to > >> test for uniqueness using any combination of columns in the > >> data.table, where previously the user only had the option to > use the > >> keyed columns (if keyed) or all columns (if not). The default > behavior > >> sets `by.columns=key(dt)` to maintain backward compatability. See > >> man/duplicated.Rd and tests 986:991 for more information. Thanks to > >> Arunkumar Srinivasan, Ricardo Saporta, and Frank Erickson for > useful > >> discussions. > >> > >> Should work as advertised assuming my unit tests weren't too > simplistic. > >> > >> Cheers, > >> > >> -steve > >> > >> > >> > >> > >> On Tue, Aug 13, 2013 at 1:24 PM, Steve Lianoglou > >> > wrote: > >> > Thanks for the suggestions, folks. > >> > > >> > Matthew: do you have a preference? > >> > > >> > -steve > >> > > >> > On Mon, Aug 12, 2013 at 11:12 AM, Ricardo Saporta > >> > > wrote: > >> >> Steve, > >> >> > >> >> I like your suggestion a lot. I can see putting column > specification > >> >> to > >> >> good use. > >> >> > >> >> As for the argument name, perhaps > >> >> 'use.columns' > >> >> > >> >> And where a value of NULL or FALSE will yield same results as > >> >> `unique.data.frame` > >> >> > >> >> use.columns=key(x) # default behavior > >> >> use.columns=c("col1name", "col7name") #etc > >> >> use.columns=NULL > >> >> > >> >> > >> >> Thanks as always, > >> >> Rick > >> >> > >> >> > >> >> > >> >> On Mon, Aug 12, 2013 at 1:51 PM, Steve Lianoglou > >> >> > wrote: > >> >>> > >> >>> Hi folks, > >> >>> > >> >>> I actually want to revisit the fix I made here. > >> >>> > >> >>> Instead of having `use.key` in the signature to > unique.data.table (and > >> >>> duplicated.data.table) to be: > >> >>> > >> >>> function(x, > >> >>> incomparables=FALSE, > >> >>> tolerance=.Machine$double.eps ^ 0.5, > >> >>> use.key=TRUE, ...) > >> >>> > >> >>> How about we switch out use.key for a parameter that > specifies the > >> >>> column names to use in the uniqueness check, which defaults > to key(x) > >> >>> to keep backwards compatibility. > >> >>> > >> >>> For argument's sake (like that?), lets call this parameter > `columns` > >> >>> (by.columns? with.columns? whatever) so: > >> >>> > >> >>> function(x, > >> >>> incomparables=FALSE, > >> >>> tolerance=.Machine$double.eps ^ 0.5, > >> >>> columns=key(x), ...) > >> >>> > >> >>> Then: > >> >>> > >> >>> (1) leaving it alone is the backward compatibile behavior; > >> >>> (2) Perhaps setting it to NULL will use all columns, and > make it > >> >>> equivalent to unique.data.frame (also the same when x has > no key); and > >> >>> (3) setting it to any other combo of columns uses those > columns as the > >> >>> uniqueness key and filters the rows (only) out of x > accordingly. > >> >>> > >> >>> What do you folks think? Personally I think this is better > on all > >> >>> accounts then just specifying to use the key or not and the > only > >> >>> question in my mind is the name of the argument -- happy to > hear other > >> >>> world views, however, so don't be shy. > >> >>> > >> >>> Thanks, > >> >>> -steve > >> >>> > >> >>> -- > >> >>> Steve Lianoglou > >> >>> Computational Biologist > >> >>> Bioinformatics and Computational Biology > >> >>> Genentech > >> >> > >> >> > >> > > >> > > >> > > >> > -- > >> > Steve Lianoglou > >> > Computational Biologist > >> > Bioinformatics and Computational Biology > >> > Genentech > >> > >> > >> > >> -- > >> Steve Lianoglou > >> Computational Biologist > >> Bioinformatics and Computational Biology > >> Genentech > > > > > > > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Sun Sep 29 06:49:24 2013 From: FErickson at psu.edu (Frank Erickson) Date: Sun, 29 Sep 2013 00:49:24 -0400 Subject: [datatable-help] new key argument to [.data.table in 1.8.11 Message-ID: Hi, I'm just continuing a discussion with @eddi that would not fit in an SO comment. If you want to catch up, the references are... http://r-forge.r-project.org/tracker/index.php?func=detail&aid=4675&group_id=240&atid=978 http://stackoverflow.com/a/19074195/1191259 The SO question (scroll up on the second link) was whether there was a way to use a "temporary" key for X in an X[Y] join. @eddi: +1. Yeah, I like this new option and will probably use it. Will this also overwrite the key when using [.data.table without doing joins? That might be backward incompatible I guess, since `key` is already an argument to `[.data.table`. That is, will x[i,,key='B'] do anything? I don't think that type of command has had much use until now, and adding a j argument (that doesn't start with `:=`) always makes a copy (right?), so maybe backward compatibility would not be an issue there. Regarding whether it's a reasonable compromise, ... well, I'll be using it, anyway! I don't know what the feasibility constraints are on implementing what I initially had in mind, so I'll defer to you and the developers on that. If "secondary keys" are implemented down the road, that would solve this problem in most cases. As far as when I will use it, I guess it depends on the relative cost of making a copy vs resetting the key on x. If I use the old syntax, I make a copy, but don't have to change x's key back at the end (one copy, one key setting). With the new syntax, I'd have to change the key on x back afterward (zero copies, two key settings). If I know the sorting takes a long time (e.g., because the key is the whole set of columns), I might still go with copying. Best, Frank -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Sun Sep 29 15:47:36 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Sun, 29 Sep 2013 08:47:36 -0500 Subject: [datatable-help] new key argument to [.data.table in 1.8.11 In-Reply-To: References: Message-ID: There wasn't a 'key' argument before and yes, it will change the key regardless of whether you're merging or not. Initially I added it just for the merges, but then realized that there us no conceptual reason to restrict it just to merges. Fyi the reason you probably thought there is a key argument before is because in R shorthand of arguments is valid syntax and you were actually using 'keyby' (which has not changed). You raise a good point that I haven't thought of that copying can be faster than sorting - I will check when that's true. It's easy to implement the copy version and I did this because I assumed it's the faster option, but if it's not then might as well copy and do this for merges only. On Sep 28, 2013 11:50 PM, "Frank Erickson" wrote: > Hi, > > I'm just continuing a discussion with @eddi that would not fit in an SO > comment. If you want to catch up, the references are... > > http://r-forge.r-project.org/tracker/index.php?func=detail&aid=4675&group_id=240&atid=978 > http://stackoverflow.com/a/19074195/1191259 > The SO question (scroll up on the second link) was whether there was a way > to use a "temporary" key for X in an X[Y] join. > > @eddi: > > +1. Yeah, I like this new option and will probably use it. > > Will this also overwrite the key when using [.data.table without doing > joins? That might be backward incompatible I guess, since `key` is already > an argument to `[.data.table`. That is, will x[i,,key='B'] do anything? I > don't think that type of command has had much use until now, and adding a j > argument (that doesn't start with `:=`) always makes a copy (right?), so > maybe backward compatibility would not be an issue there. > > Regarding whether it's a reasonable compromise, ... well, I'll be using > it, anyway! I don't know what the feasibility constraints are on > implementing what I initially had in mind, so I'll defer to you and the > developers on that. If "secondary keys" are implemented down the road, that > would solve this problem in most cases. > > As far as when I will use it, I guess it depends on the relative cost of > making a copy vs resetting the key on x. If I use the old syntax, I make a > copy, but don't have to change x's key back at the end (one copy, one key > setting). With the new syntax, I'd have to change the key on x back > afterward (zero copies, two key settings). If I know the sorting takes a > long time (e.g., because the key is the whole set of columns), I might > still go with copying. > > Best, > > Frank > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Sun Sep 29 16:02:06 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Sun, 29 Sep 2013 09:02:06 -0500 Subject: [datatable-help] new key argument to [.data.table in 1.8.11 In-Reply-To: References: Message-ID: Ah what am I thinking - you'll have to copy and still set a key, so unless you have to go back to the old key (rarely?) this is strictly faster. On Sun, Sep 29, 2013 at 8:47 AM, Eduard Antonyan wrote: > There wasn't a 'key' argument before and yes, it will change the key > regardless of whether you're merging or not. Initially I added it just for > the merges, but then realized that there us no conceptual reason to > restrict it just to merges. > > Fyi the reason you probably thought there is a key argument before is > because in R shorthand of arguments is valid syntax and you were actually > using 'keyby' (which has not changed). > > You raise a good point that I haven't thought of that copying can be > faster than sorting - I will check when that's true. It's easy to implement > the copy version and I did this because I assumed it's the faster option, > but if it's not then might as well copy and do this for merges only. > On Sep 28, 2013 11:50 PM, "Frank Erickson" wrote: > >> Hi, >> >> I'm just continuing a discussion with @eddi that would not fit in an SO >> comment. If you want to catch up, the references are... >> >> http://r-forge.r-project.org/tracker/index.php?func=detail&aid=4675&group_id=240&atid=978 >> http://stackoverflow.com/a/19074195/1191259 >> The SO question (scroll up on the second link) was whether there was a >> way to use a "temporary" key for X in an X[Y] join. >> >> @eddi: >> >> +1. Yeah, I like this new option and will probably use it. >> >> Will this also overwrite the key when using [.data.table without doing >> joins? That might be backward incompatible I guess, since `key` is already >> an argument to `[.data.table`. That is, will x[i,,key='B'] do anything? I >> don't think that type of command has had much use until now, and adding a j >> argument (that doesn't start with `:=`) always makes a copy (right?), so >> maybe backward compatibility would not be an issue there. >> >> Regarding whether it's a reasonable compromise, ... well, I'll be using >> it, anyway! I don't know what the feasibility constraints are on >> implementing what I initially had in mind, so I'll defer to you and the >> developers on that. If "secondary keys" are implemented down the road, that >> would solve this problem in most cases. >> >> As far as when I will use it, I guess it depends on the relative cost of >> making a copy vs resetting the key on x. If I use the old syntax, I make a >> copy, but don't have to change x's key back at the end (one copy, one key >> setting). With the new syntax, I'd have to change the key on x back >> afterward (zero copies, two key settings). If I know the sorting takes a >> long time (e.g., because the key is the whole set of columns), I might >> still go with copying. >> >> Best, >> >> Frank >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Sun Sep 29 16:58:48 2013 From: FErickson at psu.edu (Frank Erickson) Date: Sun, 29 Sep 2013 10:58:48 -0400 Subject: [datatable-help] new key argument to [.data.table in 1.8.11 In-Reply-To: References: Message-ID: > > There wasn't a 'key' argument before and yes, it will change the key > regardless of whether you're merging or not. Initially I added it just for > the merges, but then realized that there us no conceptual reason to > restrict it just to merges. Ah, my mistake. I saw "key" under the list of arguments in the documentation and assumed it applied to [.data.table; but it's actually for the data.table function. Ah what am I thinking - you'll have to copy and still set a key, so unless > you have to go back to the old key (rarely?) this is strictly faster. > Yeah, that was my initial use case, a "temporary key". This new syntax/functionality should be useful when I don't want to go back to the old key, though. --Frank > -------------- next part -------------- An HTML attachment was scrubbed... URL: From harishv_99 at yahoo.com Sun Sep 29 19:03:24 2013 From: harishv_99 at yahoo.com (Harish) Date: Sun, 29 Sep 2013 10:03:24 -0700 (PDT) Subject: [datatable-help] fread() coercing to character when seeing NA Message-ID: <1380474204.855.YahooMailNeo@web120203.mail.ne1.yahoo.com> Hi, I am trying to get fread() to read NA without coercing the entire column to character, but I am unable to do it.? Please tell me whether I am doing something wrong or this is a bug. # Load two data tables with a column of integers -- one with NA and one without dt1 <- fread( "a\n2\n4\n8\n5", na.strings=c("?") ) dt2 <- fread( "a\n2\n4\n?\n5", na.strings=c("?") ) # The contents of both are as expected (or so it seems) dt1 dt2 # The class of the column with NA is character class( dt1$a ) class( dt2$a )??? # Not expecting this to be character # Even setting colClasses does not help dt3 <- fread( "a\n2\n4\n?\n5", na.strings=c("?"), colClasses=c(a="integer") ) class( dt3$a ) Thanks for your help. Regards, Harish -------------- next part -------------- An HTML attachment was scrubbed... URL: From julien.barnier at ens-lyon.fr Mon Sep 30 16:06:31 2013 From: julien.barnier at ens-lyon.fr (Julien Barnier) Date: Mon, 30 Sep 2013 16:06:31 +0200 Subject: [datatable-help] fread() coercing to character when seeing NA In-Reply-To: <1380474204.855.YahooMailNeo@web120203.mail.ne1.yahoo.com> References: <1380474204.855.YahooMailNeo@web120203.mail.ne1.yahoo.com> Message-ID: <5223628.upPkjNS379@l018198> Hi, > dt3 <- fread( "a\n2\n4\n?\n5", na.strings=c("?"), colClasses=c(a="integer")) I think that running fread with the verbose flag allows to answer your question : R> dt3 <- fread( "a\n2\n4\n?\n5", na.strings=c("?"),colClasses=c(a="integer"), verbose=TRUE) ... ... Column 1 ('a') has been detected as type 'character'. Ignoring request from colClasses to read as 'integer' (a lower type) since NAs would result. 0.000s ( 0%) Memory map (rerun may be quicker) 0.000s ( 0%) sep and header detection 0.000s ( 0%) Count rows (wc -l) 0.000s ( 0%) Column type detection (first, middle and last 5 rows) 0.000s ( 0%) Allocation of 4x1 result (xMB) in RAM 0.000s ( 0%) Reading data 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered 0.000s ( 0%) Coercing data already read in type bumps (if any) 0.000s ( 0%) Changing na.strings to NA 0.000s Total As your ?a? column contains a character string "?", fread dtermines this column as character. And colClasses is ignored as that would result in possibly unwanted NA value. And all of this, as I understand it, is because the replacement of na.strings by NA happens as the last step of fread, after the column type has been set. So it seems that the only workarounds are either to change your data to replace your missing value code by a numerical value (like -9999 or anything else), or to convert your column back to numeric after using fread. Regards, Julien -- Julien Barnier Centre Max Weber ENS de Lyon From mdowle at mdowle.plus.com Mon Sep 30 20:58:10 2013 From: mdowle at mdowle.plus.com (Matthew Dowle) Date: Mon, 30 Sep 2013 19:58:10 +0100 Subject: [datatable-help] fread() coercing to character when seeing NA In-Reply-To: <5223628.upPkjNS379@l018198> References: <1380474204.855.YahooMailNeo@web120203.mail.ne1.yahoo.com> <5223628.upPkjNS379@l018198> Message-ID: <5249C9C2.2000009@mdowle.plus.com> Yes, exactly. On the bug list is #2660 " Improve fread na.strings handling" : https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2660&group_id=240&atid=975 which points to : http://stackoverflow.com/questions/15784138/bad-interpretation-of-n-a-using-fread Matthew On 30/09/13 15:06, Julien Barnier wrote: > Hi, > >> dt3 <- fread( "a\n2\n4\n?\n5", na.strings=c("?"), colClasses=c(a="integer")) > I think that running fread with the verbose flag allows to answer your > question : > > R> dt3 <- fread( "a\n2\n4\n?\n5", na.strings=c("?"),colClasses=c(a="integer"), > verbose=TRUE) > ... ... > Column 1 ('a') has been detected as type 'character'. Ignoring request from > colClasses to read as 'integer' (a lower type) since NAs would result. > 0.000s ( 0%) Memory map (rerun may be quicker) > 0.000s ( 0%) sep and header detection > 0.000s ( 0%) Count rows (wc -l) > 0.000s ( 0%) Column type detection (first, middle and last 5 rows) > 0.000s ( 0%) Allocation of 4x1 result (xMB) in RAM > 0.000s ( 0%) Reading data > 0.000s ( 0%) Allocation for type bumps (if any), including gc time if > triggered > 0.000s ( 0%) Coercing data already read in type bumps (if any) > 0.000s ( 0%) Changing na.strings to NA > 0.000s Total > > As your ?a? column contains a character string "?", fread dtermines this > column as character. And colClasses is ignored as that would result in > possibly unwanted NA value. And all of this, as I understand it, is because > the replacement of na.strings by NA happens as the last step of fread, after > the column type has been set. > > So it seems that the only workarounds are either to change your data to > replace your missing value code by a numerical value (like -9999 or anything > else), or to convert your column back to numeric after using fread. > > Regards, > > Julien > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.sieira at gmail.com Mon Sep 30 22:01:47 2013 From: alexandre.sieira at gmail.com (Alexandre Sieira) Date: Mon, 30 Sep 2013 13:01:47 -0700 Subject: [datatable-help] rbind empty data tables Message-ID: I encountered the following behavior with data.table 1.8.10 on R 3.0.2 on Mac OS X and was wondering if that is expected: > dt1 = data.table(a=character()) > dt2 = data.table(a=character()) > dt1 Empty data.table (0 rows) of 1 col: a > colnames(dt1) [1] "a" > dt2 Empty data.table (0 rows) of 1 col: a > colnames(dt2) [1] "a" > rbind(dt1, dt2) Error in setnames(ret, nm.original) : x has no column names Enter a frame number, or 0 to exit ?? 1: rbind(dt1, dt2) 2: rbind(deparse.level, ...) 3: data.table::.rbind.data.table(...) 4: setnames(ret, nm.original) If I rbind two zero-row data.table objects with matching column names, I would have expected to get a zero-row data.table back (0 + 0 = 0, after all). --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.sieira at gmail.com Mon Sep 30 22:06:32 2013 From: alexandre.sieira at gmail.com (Alexandre Sieira) Date: Mon, 30 Sep 2013 13:06:32 -0700 Subject: [datatable-help] =?utf-8?q?rbind_empty_data_tables?= In-Reply-To: References: Message-ID: By the way, this works as I would expect with data.frame on the same environment: > df1 = data.frame(a=character()) > df2 = data.frame(a=character()) > df1 [1] a <0 rows> (or row.names with length 0) > df2 [1] a <0 rows> (or row.names with length 0) > rbind(df1, df2) [1] a <0 rows> (or row.names with length 0) --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I On 30 de setembro de 2013 at 13:01:47, Alexandre Sieira (alexandre.sieira at gmail.com) wrote: I encountered the following behavior with data.table 1.8.10 on R 3.0.2 on Mac OS X and was wondering if that is expected: > dt1 = data.table(a=character()) > dt2 = data.table(a=character()) > dt1 Empty data.table (0 rows) of 1 col: a > colnames(dt1) [1] "a" > dt2 Empty data.table (0 rows) of 1 col: a > colnames(dt2) [1] "a" > rbind(dt1, dt2) Error in setnames(ret, nm.original) : x has no column names Enter a frame number, or 0 to exit ?? 1: rbind(dt1, dt2) 2: rbind(deparse.level, ...) 3: data.table::.rbind.data.table(...) 4: setnames(ret, nm.original) If I rbind two zero-row data.table objects with matching column names, I would have expected to get a zero-row data.table back (0 + 0 = 0, after all). --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I -------------- next part -------------- An HTML attachment was scrubbed... URL: