From ggrothendieck at gmail.com Tue Apr 1 13:08:50 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Tue, 1 Apr 2014 07:08:50 -0400 Subject: [datatable-help] .I does not respect by Message-ID: In the following .I seems not to be within group. > dt <- data.table(a = 1:4, b = 1:2) > dt[, .I, by = b] b .I 1: 1 1 2: 1 3 3: 2 2 4: 2 4 > packageVersion("data.table") [1] '1.9.3' This seems contrary to ?data.table. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Tue Apr 1 13:20:15 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 1 Apr 2014 13:20:15 +0200 Subject: [datatable-help] .I does not respect by In-Reply-To: References: Message-ID: Gabor, It's the same as in 1.8.10 and 1.9.2. What is contradicting in ?data.table? It says under .I: ".I is an integer vector length .N holding the row locations in x for this group." The row locations in x for b=1,2,1,2 are 1,2,3,4 which then becomes 1,3 and 2,4 for this group => b=1 and b=2 respectively. Arun From:?Gabor Grothendieck ggrothendieck at gmail.com Reply:?Gabor Grothendieck ggrothendieck at gmail.com Date:?April 1, 2014 at 1:09:23 PM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] .I does not respect by In the following .I seems not to be within group. > dt <- data.table(a = 1:4, b = 1:2) > dt[, .I, by = b] b .I 1: 1 1 2: 1 3 3: 2 2 4: 2 4 > packageVersion("data.table") [1] '1.9.3' This seems contrary to ?data.table. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Tue Apr 1 13:45:19 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Tue, 1 Apr 2014 07:45:19 -0400 Subject: [datatable-help] .I does not respect by In-Reply-To: References: Message-ID: >From the documentation I would have expected that the row locations start over at 1 for each group so that .I = 1:.N. but these are not equivalent. On Tue, Apr 1, 2014 at 7:20 AM, Arunkumar Srinivasan wrote: > Gabor, > > It's the same as in 1.8.10 and 1.9.2. What is contradicting in ?data.table? > It says under .I: ".I is an integer vector length .N holding the row > locations in x for this group." > > The row locations in x for b=1,2,1,2 are 1,2,3,4 which then becomes 1,3 and > 2,4 for this group => b=1 and b=2 respectively. > > > Arun > > From: Gabor Grothendieck ggrothendieck at gmail.com > Reply: Gabor Grothendieck ggrothendieck at gmail.com > Date: April 1, 2014 at 1:09:23 PM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: [datatable-help] .I does not respect by > > In the following .I seems not to be within group. > >> dt <- data.table(a = 1:4, b = 1:2) >> dt[, .I, by = b] > b .I > 1: 1 1 > 2: 1 3 > 3: 2 2 > 4: 2 4 >> packageVersion("data.table") > [1] '1.9.3' > > This seems contrary to ?data.table. > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From aragorn168b at gmail.com Tue Apr 1 13:57:58 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Tue, 1 Apr 2014 13:57:58 +0200 Subject: [datatable-help] .I does not respect by In-Reply-To: References: Message-ID: The key part is "row locations in x". If it were to reset for each group, then that part wouldn't have been necessary. Perhaps it'd be clearer if it were something like this: ".I is an integer vector which, for each group, holds the corresponding row numbers in/from/of x"? Not sure which preposition is most appropriate here.? Arun From:?Gabor Grothendieck ggrothendieck at gmail.com Reply:?Gabor Grothendieck ggrothendieck at gmail.com Date:?April 1, 2014 at 1:45:39 PM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] .I does not respect by From the documentation I would have expected that the row locations start over at 1 for each group so that .I = 1:.N. but these are not equivalent. On Tue, Apr 1, 2014 at 7:20 AM, Arunkumar Srinivasan wrote: > Gabor, > > It's the same as in 1.8.10 and 1.9.2. What is contradicting in ?data.table? > It says under .I: ".I is an integer vector length .N holding the row > locations in x for this group." > > The row locations in x for b=1,2,1,2 are 1,2,3,4 which then becomes 1,3 and > 2,4 for this group => b=1 and b=2 respectively. > > > Arun > > From: Gabor Grothendieck ggrothendieck at gmail.com > Reply: Gabor Grothendieck ggrothendieck at gmail.com > Date: April 1, 2014 at 1:09:23 PM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: [datatable-help] .I does not respect by > > In the following .I seems not to be within group. > >> dt <- data.table(a = 1:4, b = 1:2) >> dt[, .I, by = b] > b .I > 1: 1 1 > 2: 1 3 > 3: 2 2 > 4: 2 4 >> packageVersion("data.table") > [1] '1.9.3' > > This seems contrary to ?data.table. > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Tue Apr 1 14:02:10 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Tue, 1 Apr 2014 08:02:10 -0400 Subject: [datatable-help] .I does not respect by In-Reply-To: References: Message-ID: Perhaps it could read .I = 1:nrow(x). Note that .I does not equal 1:.N if there are groups. On Tue, Apr 1, 2014 at 7:57 AM, Arunkumar Srinivasan wrote: > The key part is "row locations in x". If it were to reset for each group, > then that part wouldn't have been necessary. > Perhaps it'd be clearer if it were something like this: ".I is an integer > vector which, for each group, holds the corresponding row numbers in/from/of > x"? Not sure which preposition is most appropriate here. > > Arun > > From: Gabor Grothendieck ggrothendieck at gmail.com > Reply: Gabor Grothendieck ggrothendieck at gmail.com > Date: April 1, 2014 at 1:45:39 PM > To: Arunkumar Srinivasan aragorn168b at gmail.com > Cc: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] .I does not respect by > > From the documentation I would have expected that the row locations > start over at 1 for each group so that .I = 1:.N. but these are not > equivalent. > > On Tue, Apr 1, 2014 at 7:20 AM, Arunkumar Srinivasan > wrote: >> Gabor, >> >> It's the same as in 1.8.10 and 1.9.2. What is contradicting in >> ?data.table? >> It says under .I: ".I is an integer vector length .N holding the row >> locations in x for this group." >> >> The row locations in x for b=1,2,1,2 are 1,2,3,4 which then becomes 1,3 >> and >> 2,4 for this group => b=1 and b=2 respectively. >> >> >> Arun >> >> From: Gabor Grothendieck ggrothendieck at gmail.com >> Reply: Gabor Grothendieck ggrothendieck at gmail.com >> Date: April 1, 2014 at 1:09:23 PM >> To: datatable-help at lists.r-forge.r-project.org >> datatable-help at lists.r-forge.r-project.org >> Subject: [datatable-help] .I does not respect by >> >> In the following .I seems not to be within group. >> >>> dt <- data.table(a = 1:4, b = 1:2) >>> dt[, .I, by = b] >> b .I >> 1: 1 1 >> 2: 1 3 >> 3: 2 2 >> 4: 2 4 >>> packageVersion("data.table") >> [1] '1.9.3' >> >> This seems contrary to ?data.table. >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From ben.goldstein at gmail.com Tue Apr 1 18:56:58 2014 From: ben.goldstein at gmail.com (bgoldstein) Date: Tue, 1 Apr 2014 09:56:58 -0700 (PDT) Subject: [datatable-help] Adding Column to Data.Table not working Message-ID: <1396371418064-4687970.post@n4.nabble.com> Up until a recent update (now 1.8.8) I would add a column to a DT the DF way: DT <- data.table(a=c(1,2,3), b=c(4,5,6)) DT$c <- c(7:9) Or I could have done the DT way: DT[,c:=c(7:9)] However when I try to do this now I get the error: > DT$c = c(7:9) Error in `[<-.data.table`(x, j = name, value = value) : attempt to set index 2/2 in SET_STRING_EL I can hack around this by doing: c <- 7:9 DT <- cbind(DT,c) However, this does not seem desirable and is causing me to fix a lot of code. Is this a bug or am I doing something wrong now? Thanks, bg -- View this message in context: http://r.789695.n4.nabble.com/Adding-Column-to-Data-Table-not-working-tp4687970.html Sent from the datatable-help mailing list archive at Nabble.com. From lianoglou.steve at gene.com Tue Apr 1 20:12:55 2014 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Tue, 1 Apr 2014 11:12:55 -0700 Subject: [datatable-help] Adding Column to Data.Table not working In-Reply-To: <1396371418064-4687970.post@n4.nabble.com> References: <1396371418064-4687970.post@n4.nabble.com> Message-ID: Hi, On Tue, Apr 1, 2014 at 9:56 AM, bgoldstein wrote: > Up until a recent update (now 1.8.8) I would add a column to a DT the DF way: It's not clear: are you saying that you are using data.table v1.8.8? If so, could you upgrade to the latest (1.9.2) and try again? > DT <- data.table(a=c(1,2,3), b=c(4,5,6)) > DT$c <- c(7:9) > > Or I could have done the DT way: > DT[,c:=c(7:9)] > > However when I try to do this now I get the error: > >> DT$c = c(7:9) > Error in `[<-.data.table`(x, j = name, value = value) : > attempt to set index 2/2 in SET_STRING_EL I can't explain why this is happening, even in your earlier version (it shouldn't be), but let's all get on the same page first then we can continue w/ the debugging. The output of sessionInfo() after you get this error would also be most helpful. Thanks, -steve -- Steve Lianoglou Computational Biologist Genentech From ben.goldstein at gmail.com Tue Apr 1 20:47:10 2014 From: ben.goldstein at gmail.com (bgoldstein) Date: Tue, 1 Apr 2014 11:47:10 -0700 (PDT) Subject: [datatable-help] Adding Column to Data.Table not working In-Reply-To: References: <1396371418064-4687970.post@n4.nabble.com> Message-ID: <1396378030276-4687979.post@n4.nabble.com> So when update everything seems to be working - should have checked that first - so I guess just an odd bug in an older version... Thanks for the commonsense solution... bg -- View this message in context: http://r.789695.n4.nabble.com/Adding-Column-to-Data-Table-not-working-tp4687970p4687979.html Sent from the datatable-help mailing list archive at Nabble.com. From aragorn168b at gmail.com Wed Apr 2 00:04:26 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 2 Apr 2014 00:04:26 +0200 Subject: [datatable-help] .I does not respect by In-Reply-To: References: Message-ID: Okay. Added a "Doc" tracker?#5520. Arun From:?Gabor Grothendieck ggrothendieck at gmail.com Reply:?Gabor Grothendieck ggrothendieck at gmail.com Date:?April 1, 2014 at 2:02:30 PM To:?Arunkumar Srinivasan aragorn168b at gmail.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] .I does not respect by Perhaps it could read .I = 1:nrow(x). Note that .I does not equal 1:.N if there are groups. On Tue, Apr 1, 2014 at 7:57 AM, Arunkumar Srinivasan wrote: > The key part is "row locations in x". If it were to reset for each group, > then that part wouldn't have been necessary. > Perhaps it'd be clearer if it were something like this: ".I is an integer > vector which, for each group, holds the corresponding row numbers in/from/of > x"? Not sure which preposition is most appropriate here. > > Arun > > From: Gabor Grothendieck ggrothendieck at gmail.com > Reply: Gabor Grothendieck ggrothendieck at gmail.com > Date: April 1, 2014 at 1:45:39 PM > To: Arunkumar Srinivasan aragorn168b at gmail.com > Cc: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: Re: [datatable-help] .I does not respect by > > From the documentation I would have expected that the row locations > start over at 1 for each group so that .I = 1:.N. but these are not > equivalent. > > On Tue, Apr 1, 2014 at 7:20 AM, Arunkumar Srinivasan > wrote: >> Gabor, >> >> It's the same as in 1.8.10 and 1.9.2. What is contradicting in >> ?data.table? >> It says under .I: ".I is an integer vector length .N holding the row >> locations in x for this group." >> >> The row locations in x for b=1,2,1,2 are 1,2,3,4 which then becomes 1,3 >> and >> 2,4 for this group => b=1 and b=2 respectively. >> >> >> Arun >> >> From: Gabor Grothendieck ggrothendieck at gmail.com >> Reply: Gabor Grothendieck ggrothendieck at gmail.com >> Date: April 1, 2014 at 1:09:23 PM >> To: datatable-help at lists.r-forge.r-project.org >> datatable-help at lists.r-forge.r-project.org >> Subject: [datatable-help] .I does not respect by >> >> In the following .I seems not to be within group. >> >>> dt <- data.table(a = 1:4, b = 1:2) >>> dt[, .I, by = b] >> b .I >> 1: 1 1 >> 2: 1 3 >> 3: 2 2 >> 4: 2 4 >>> packageVersion("data.table") >> [1] '1.9.3' >> >> This seems contrary to ?data.table. >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sun Apr 6 11:46:07 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 6 Apr 2014 11:46:07 +0200 Subject: [datatable-help] In 1.9.2, By with factor column do not work the same as in 1.8.10 In-Reply-To: References: Message-ID: This is now fixed with commit #1256 from v1.9.3. Thanks Christophe for filing #5437 and Paul for following up. I'll close it now. Please write back if you find something's not right still. Arun From:?Paul Johnson pauljohn32 at gmail.com Reply:?Paul Johnson pauljohn32 at gmail.com Date:?March 31, 2014 at 2:03:48 AM To:?DERVIEUX Christophe christophe.dervieux at rte-france.com Cc:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] In 1.9.2, By with factor column do not work the same as in 1.8.10 Hi I see this problem too. I was not using data.table before 1.9, so I did no realize it ever behaved differently.? In the examples I've tried, any calculation that I expect to create a factor seems to create an integer that uses the R internal integer of the factor.? I noticed this, I thought maybe I needed to do more explicit casting to make it come out as a factor. Here's my variable to lag a factor that beats the point into the ground. lagFactor <- function(x, N){ ??? xold <- x ??? if (is.factor(x)) { ??????? xlev <- levels(x) ??????? xnum <- as.numeric(x) ??? } else { ??????? xlev <- unique(x) ??? } ??? xlag <- c(rep(NA, N), xnum[-(length(xnum):(length(xnum)-N+1))]) ??? xlagf <- factor(xlev[xlag], levels = xlev) ??? xlagf } dat is a data.table with lots of lines, I can give you a copy if you want. Now I'll show you that the result is different in and out of a data.table. > xx <- lagFactor(dat$east2b, 1) > table(xx) xx ?? Yes???? No 130232 151885 > levels(xx) [1] "Yes" "No" > dat[ , xx := lagFactor(east2b, 1), by = c("sippid"), roll? = TRUE] > table(dat$xx) ???? 1????? 2 114963 130095 > levels(dat$xx) NULL > table(xx, dat$xx) ???? xx???????? 1????? 2 ? Yes 114963????? 0 ? No?????? 0 130095 For my case, the only fix is an explicit re-factoring.? ?pj On Fri, Mar 28, 2014 at 5:29 AM, DERVIEUX Christophe wrote: Hi, I have updated data.table package to 1.9.2 recently from 1.8.10 and I found errors on my previous code. See reproductible example below: On 1.8.10 : DT<-data.table(X=factor(2006:2012),Y=rep(1:7,2)) DT[,Z:=paste(X,.N,sep=" - "),by=list(X)][] X Y Z 1: 2006 1 2006 - 2 2: 2007 2 2007 - 2 3: 2008 3 2008 - 2 4: 2009 4 2009 - 2 5: 2010 5 2010 - 2 6: 2011 6 2011 - 2 7: 2012 7 2012 - 2 8: 2006 1 2006 - 2 9: 2007 2 2007 - 2 10: 2008 3 2008 - 2 11: 2009 4 2009 - 2 12: 2010 5 2010 - 2 13: 2011 6 2011 - 2 14: 2012 7 2012 - 2 In column Z, I get the level of the factor column X pasted with count '.N' as expected However, in the 1.9.2, with same code : DT<-data.table(X=factor(2006:2012),Y=rep(1:7,2)) DT[,Z:=paste(X,.N,sep=" - "),by=list(X)][] X Y Z 1: 2006 1 1 - 2 2: 2007 2 2 - 2 3: 2008 3 3 - 2 4: 2009 4 4 - 2 5: 2010 5 5 - 2 6: 2011 6 6 - 2 7: 2012 7 7 - 2 8: 2006 1 1 - 2 9: 2007 2 2 - 2 10: 2008 3 3 - 2 11: 2009 4 4 - 2 12: 2010 5 5 - 2 13: 2011 6 6 - 2 14: 2012 7 7 - 2 as results, I do not get levels of factor column X but the numeric values associated with the level. is this working normally? Why has it changed? Is that a bug? I use this kind of procedure to make labels for ggplot. All my previous code is not working anymore. It's kind of annoying. Thanks Christophe ? _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Paul E. Johnson Professor, Political Science ? ? ?Assoc. Director 1541 Lilac Lane, Room 504 ? ? ?Center for Research Methods University of Kansas ? ? ? ? ? ? ? ? University of Kansas http://pj.freefaculty.org ? ? ? ? ? ? ? http://quant.ku.edu _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From caneff at gmail.com Mon Apr 7 17:31:47 2014 From: caneff at gmail.com (Chris Neff) Date: Mon, 7 Apr 2014 11:31:47 -0400 Subject: [datatable-help] Is there any overhead to converting back and forth from a data.table to a data.frame? Message-ID: I prefer data.tables for all the code processing I do. But others on my team using my functions aren't comfortable with data.tables, so most of the libraries I write end with return(data.frame(DT)) Is there any copying or other overhead happening there? Since it inherits from data.frame, I think the answer is no. Now, if I have a function that does such a return, but I wrap that itself in a data.table call: data.table(func_that_returns_df()) Is there any inefficiency there? Is there a difference between data.table() and as.data.table() here? -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Apr 7 20:25:36 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 7 Apr 2014 20:25:36 +0200 Subject: [datatable-help] Is there any overhead to converting back and forth from a data.table to a data.frame? In-Reply-To: References: Message-ID: as.data.frame is a S3 with .data.table method and is definitely faster than data.frame(). But it still does copy(.). data.frame(.) would also convert strings to factors by default (if stringsAsFactors=TRUE). The most efficient way to convert data.table to data.frame would be to do things by reference (in place). The code is already available in as.data.frame, just remove the copy(.): # convert data.table to data.frame by reference setDF <- function(x) { if (!is.data.table(x)) stop("x must be a data.table") setattr(x, "row.names", .set_row_names(nrow(x))) setattr(x, "class", "data.frame") setattr(x, "sorted", NULL) setattr(x, ".internal.selfref", NULL) } Now you?ve a function that?ll convert a data.table to data.frame by reference. require(data.table) dat <- data.table(x=1:5, y=6:10) setDF(dat) # dat is now a data.frame Probably we should export this function as well, like setDT so that users can switch between the two as they desire without hitting performance? Arun From:?Chris Neff caneff at gmail.com Reply:?Chris Neff caneff at gmail.com Date:?April 7, 2014 at 5:32:47 PM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] Is there any overhead to converting back and forth from a data.table to a data.frame? I prefer data.tables for all the code processing I do. ?But others on my team using my functions aren't comfortable with data.tables, so most of the libraries I write end with ?return(data.frame(DT)) Is there any copying or other overhead happening there? Since it inherits from data.frame, I think the answer is no. Now, if I have a function that does such a return, but I wrap that itself in a data.table call: data.table(func_that_returns_df()) Is there any inefficiency there? ?Is there a difference between data.table() and as.data.table() here? _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From caneff at gmail.com Mon Apr 7 20:29:04 2014 From: caneff at gmail.com (Chris Neff) Date: Mon, 7 Apr 2014 14:29:04 -0400 Subject: [datatable-help] Is there any overhead to converting back and forth from a data.table to a data.frame? In-Reply-To: References: Message-ID: I would appreciate such a function, yes. Thanks for the explanation. On Mon, Apr 7, 2014 at 2:25 PM, Arunkumar Srinivasan wrote: > as.data.frame is a S3 with .data.table method and is definitely faster > than data.frame(). But it still does copy(.). data.frame(.) would also > convert strings to factors by default (if stringsAsFactors=TRUE). > > The most efficient way to convert data.table to data.frame would be to do > things by reference (in place). The code is already available in > as.data.frame, just remove the copy(.): > > # convert data.table to data.frame by reference > setDF <- function(x) { > if (!is.data.table(x)) > stop("x must be a data.table") > setattr(x, "row.names", .set_row_names(nrow(x))) > setattr(x, "class", "data.frame") > setattr(x, "sorted", NULL) > setattr(x, ".internal.selfref", NULL) > } > > Now you've a function that'll convert a data.table to data.frame *by > reference*. > > require(data.table) > dat <- data.table(x=1:5, y=6:10) > setDF(dat) # dat is now a data.frame > > Probably we should export this function as well, like setDT so that users > can switch between the two as they desire without hitting performance? > > > Arun > > From: Chris Neff caneff at gmail.com > Reply: Chris Neff caneff at gmail.com > Date: April 7, 2014 at 5:32:47 PM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: [datatable-help] Is there any overhead to converting back and > forth from a data.table to a data.frame? > > I prefer data.tables for all the code processing I do. But others on my > team using my functions aren't comfortable with data.tables, so most of the > libraries I write end with > > return(data.frame(DT)) > > Is there any copying or other overhead happening there? Since it inherits > from data.frame, I think the answer is no. > > Now, if I have a function that does such a return, but I wrap that itself > in a data.table call: > > data.table(func_that_returns_df()) > > Is there any inefficiency there? Is there a difference between > data.table() and as.data.table() here? > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kevinushey at gmail.com Mon Apr 7 20:40:07 2014 From: kevinushey at gmail.com (Kevin Ushey) Date: Mon, 7 Apr 2014 11:40:07 -0700 Subject: [datatable-help] Is there any overhead to converting back and forth from a data.table to a data.frame? In-Reply-To: References: Message-ID: I agree; this would be very useful. On Mon, Apr 7, 2014 at 11:29 AM, Chris Neff wrote: > I would appreciate such a function, yes. Thanks for the explanation. > > > On Mon, Apr 7, 2014 at 2:25 PM, Arunkumar Srinivasan > wrote: >> >> as.data.frame is a S3 with .data.table method and is definitely faster >> than data.frame(). But it still does copy(.). data.frame(.) would also >> convert strings to factors by default (if stringsAsFactors=TRUE). >> >> The most efficient way to convert data.table to data.frame would be to do >> things by reference (in place). The code is already available in >> as.data.frame, just remove the copy(.): >> >> # convert data.table to data.frame by reference >> setDF <- function(x) { >> if (!is.data.table(x)) >> stop("x must be a data.table") >> setattr(x, "row.names", .set_row_names(nrow(x))) >> setattr(x, "class", "data.frame") >> setattr(x, "sorted", NULL) >> setattr(x, ".internal.selfref", NULL) >> } >> >> Now you've a function that'll convert a data.table to data.frame by >> reference. >> >> require(data.table) >> dat <- data.table(x=1:5, y=6:10) >> setDF(dat) # dat is now a data.frame >> >> Probably we should export this function as well, like setDT so that users >> can switch between the two as they desire without hitting performance? >> >> >> Arun >> >> From: Chris Neff caneff at gmail.com >> Reply: Chris Neff caneff at gmail.com >> Date: April 7, 2014 at 5:32:47 PM >> To: datatable-help at lists.r-forge.r-project.org >> datatable-help at lists.r-forge.r-project.org >> Subject: [datatable-help] Is there any overhead to converting back and >> forth from a data.table to a data.frame? >> >> I prefer data.tables for all the code processing I do. But others on my >> team using my functions aren't comfortable with data.tables, so most of the >> libraries I write end with >> >> return(data.frame(DT)) >> >> Is there any copying or other overhead happening there? Since it inherits >> from data.frame, I think the answer is no. >> >> Now, if I have a function that does such a return, but I wrap that itself >> in a data.table call: >> >> data.table(func_that_returns_df()) >> >> Is there any inefficiency there? Is there a difference between >> data.table() and as.data.table() here? >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From lianoglou.steve at gene.com Mon Apr 7 20:50:00 2014 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Mon, 7 Apr 2014 11:50:00 -0700 Subject: [datatable-help] Is there any overhead to converting back and forth from a data.table to a data.frame? In-Reply-To: References: Message-ID: +1 on exporting setDF On Mon, Apr 7, 2014 at 11:25 AM, Arunkumar Srinivasan wrote: > as.data.frame is a S3 with .data.table method and is definitely faster than > data.frame(). But it still does copy(.). data.frame(.) would also convert > strings to factors by default (if stringsAsFactors=TRUE). > > The most efficient way to convert data.table to data.frame would be to do > things by reference (in place). The code is already available in > as.data.frame, just remove the copy(.): > > # convert data.table to data.frame by reference > setDF <- function(x) { > if (!is.data.table(x)) > stop("x must be a data.table") > setattr(x, "row.names", .set_row_names(nrow(x))) > setattr(x, "class", "data.frame") > setattr(x, "sorted", NULL) > setattr(x, ".internal.selfref", NULL) > } > > Now you've a function that'll convert a data.table to data.frame by > reference. > > require(data.table) > dat <- data.table(x=1:5, y=6:10) > setDF(dat) # dat is now a data.frame > > Probably we should export this function as well, like setDT so that users > can switch between the two as they desire without hitting performance? > > > Arun > > From: Chris Neff caneff at gmail.com > Reply: Chris Neff caneff at gmail.com > Date: April 7, 2014 at 5:32:47 PM > To: datatable-help at lists.r-forge.r-project.org > datatable-help at lists.r-forge.r-project.org > Subject: [datatable-help] Is there any overhead to converting back and > forth from a data.table to a data.frame? > > I prefer data.tables for all the code processing I do. But others on my > team using my functions aren't comfortable with data.tables, so most of the > libraries I write end with > > return(data.frame(DT)) > > Is there any copying or other overhead happening there? Since it inherits > from data.frame, I think the answer is no. > > Now, if I have a function that does such a return, but I wrap that itself in > a data.table call: > > data.table(func_that_returns_df()) > > Is there any inefficiency there? Is there a difference between data.table() > and as.data.table() here? > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Steve Lianoglou Computational Biologist Genentech From torres at uniovi.es Tue Apr 8 20:25:06 2014 From: torres at uniovi.es (Emilio Torres Manzanera) Date: Tue, 08 Apr 2014 20:25:06 +0200 Subject: [datatable-help] fread always read the last line when skipping Message-ID: <87wqezel19.fsf@uniovi.es> Dear Sir, If I use the skip option with a number greater than or equal to the number of rows of the file, it always reads the last record. It would be nice to get a NULL record in such situation. Do you know to get this result? Thank you Best regards Emilio library(data.table) write.table(iris,"iris.csv",row.names=FALSE, col.names=FALSE) a <- fread("iris.csv") dim(a) # 150 records b <- fread("iris.csv",skip=250,nrows=10) ## We skip a lot of records b # it always return the record #150 > sessionInfo() R version 3.0.3 (2014-03-06) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8 [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8 [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.9.2 loaded via a namespace (and not attached): [1] plyr_1.8 reshape2_1.2.2 stringr_0.6.2 -- ================================================= Emilio Torres Manzanera Fac. de Comercio - Universidad de Oviedo c/ Luis Moya 261, E-33203 Gij?n (Spain) Tel. 985 182 197 email: torres at uniovi.es ================================================= From serpalma.v at gmail.com Sun Apr 13 21:35:55 2014 From: serpalma.v at gmail.com (Sergio.pv) Date: Sun, 13 Apr 2014 12:35:55 -0700 (PDT) Subject: [datatable-help] Transform characters to numbers and compare Message-ID: <1397417755341-4688708.post@n4.nabble.com> I have a data.frame of two vectors. df <- data.frame(G1=c("b","a","e","d","c"), G2=c("c","d","e","b","a")) You can see that both vectors have the same characters, but in diferent order. I want to convert them into numbers and then compare them. To compare G2 to G1, G1 must be the reference, so the output will be this: df2 <- data.frame(G1=c("1","2","3","4","5"), G2=c("5","4","3","1","2")) Is there a way to do this?, thanks -- View this message in context: http://r.789695.n4.nabble.com/Transform-characters-to-numbers-and-compare-tp4688708.html Sent from the datatable-help mailing list archive at Nabble.com. From aragorn168b at gmail.com Sun Apr 13 21:43:17 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 13 Apr 2014 21:43:17 +0200 Subject: [datatable-help] Transform characters to numbers and compare In-Reply-To: <1397417755341-4688708.post@n4.nabble.com> References: <1397417755341-4688708.post@n4.nabble.com> Message-ID: Please don't cross-post: http://stackoverflow.com/questions/23047280/transform-characters-to-numbers Arun From:?Sergio.pv serpalma.v at gmail.com Reply:?Sergio.pv serpalma.v at gmail.com Date:?April 13, 2014 at 9:36:35 PM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] Transform characters to numbers and compare I have a data.frame of two vectors. df <- data.frame(G1=c("b","a","e","d","c"), G2=c("c","d","e","b","a")) You can see that both vectors have the same characters, but in diferent order. I want to convert them into numbers and then compare them. To compare G2 to G1, G1 must be the reference, so the output will be this: df2 <- data.frame(G1=c("1","2","3","4","5"), G2=c("5","4","3","1","2")) Is there a way to do this?, thanks -- View this message in context: http://r.789695.n4.nabble.com/Transform-characters-to-numbers-and-compare-tp4688708.html Sent from the datatable-help mailing list archive at Nabble.com. _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Tue Apr 15 05:59:01 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Mon, 14 Apr 2014 23:59:01 -0400 Subject: [datatable-help] fread numerics Message-ID: In light of the change in behavior of read.table / type.convert in R 3.1.0 (it now reads in numerics too long to be represented as factors or strings whereas R 3.0.x read them in as numeric) it would be nice if fread had an argument which specified which way it acts. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From kpm.nachtmann at gmail.com Tue Apr 15 17:10:01 2014 From: kpm.nachtmann at gmail.com (nachti) Date: Tue, 15 Apr 2014 08:10:01 -0700 (PDT) Subject: [datatable-help] Change in list( ) behavior inside join In-Reply-To: <53318CDB.3060008@mdowle.plus.com> References: <70C43702-3298-4DE8-8CF9-0D425BAD1A1F@dc-energy.com> <5331885E.9040108@mdowle.plus.com> <53318CDB.3060008@mdowle.plus.com> Message-ID: <1397574601798-4688823.post@n4.nabble.com> Hi there! Just another example (maybe to be included to test.data.table), which does not do, what I expected (v. 1.9.2 - it's also fixed in 1.9.3) > require(data.table) > sessionInfo() R version 3.1.0 (2014-04-10) Platform: powerpc64-unknown-linux-gnu (64-bit) ... other attached packages: [1] data.table_1.9.2 > example(data.table) > DT x y v v2 m 1: a 1 42 NA 42 2: a 3 42 NA 42 3: a 6 42 NA 42 4: b 1 4 84 5 5: b 3 5 84 5 6: b 6 6 84 5 7: c 1 7 NA 8 8: c 3 8 NA 8 9: c 6 9 NA 8 > setkey(DT) > DT[J("a"), list(v, y)] x v y 1: a 42 1 > DT[J("a"), list(v, y, i = "text")] x v y i 1: a 42 1 text ##### With data.table 1.9.3 it's working fine: > require(data.table) > sessionInfo() R version 3.1.0 (2014-04-10) Platform: powerpc64-unknown-linux-gnu (64-bit) ... other attached packages: [1] data.table_1.9.3 > example(data.table) > setkey(DT) > DT[J("a"), list(v, y)] v y 1: 42 1 2: 42 3 3: 42 6 > DT[J("a"), list(v, y, i = "text")] v y i 1: 42 1 text 2: 42 3 text 3: 42 6 text nachti -- View this message in context: http://r.789695.n4.nabble.com/Change-in-list-behavior-inside-join-tp4687469p4688823.html Sent from the datatable-help mailing list archive at Nabble.com. From cstanley at cstanley.no-ip.biz Wed Apr 16 18:23:37 2014 From: cstanley at cstanley.no-ip.biz (Clayton Stanley) Date: Wed, 16 Apr 2014 11:23:37 -0500 Subject: [datatable-help] data.table and aggregating out-of-order columns in result from by Message-ID: Copied from this SO post: http://stackoverflow.com/questions/23097461 Here's some interesting behavior that I noticed with data.table 1.9.2 > testFun <- function(val) { if (val == 'geteeee') return(data.table(x=4,y=3)) if (val == 'get') return(data.table(y=3,x=4)) }> tbl = data.table(val=c('geteeee', 'get'))> tbl[, testFun(val), by=val] val x y1: geteeee 4 32: get 3 4> When the column order of the data tables returned from each call to testFun are mixed (but have the same name and number of columns), data.table silently binds the tables together without taking into account that they are out of order. This was probably done for speed, but I found the behavior quite unexpected, and would have appreciated at least a warning. Is there a way that I can get data.table to warn or error when this situation happens? This happened in my analysis code and caused values for two DVs to be intermixed. The reason why it happened is that in the 'testFun' there is a branch and the returned data table is created within both sides of the branch. The branch is necessary to handle the case where the data table used to create the final returned data table is empty. So on one side of that branch I basically create an empty data table with the correct columns, and on the other side the data table is created from the first. The point is that the column order for the data tables returned from each side of the branch are different. Now this is certainly a bug on my part in 'testFun'. However I could have caught the issue much earlier if I had received a warning from data.table when the by operation completed and the resulting tables were bound together. Also since there isn't a check for column order, it does make me worry that there are other places in my analysis code where the same thing could be happening. What would be ideal is if there was some way for me to tell if that is the case. Perhaps a warning, temporarily increasing a 'safety' level as an options call, etc. Usually data.table is great at warning me when things are not quite right, so I was surprised when I noticed the current behavior. I understand that this was done for speed. So maybe temporarily increasing a 'safety' level is a way to keep things fast by default and have additional checks (for a speed cost) when the user wants them? This sort of mimics how compiler optimization declarations are done in common lisp. -Clayton -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Wed Apr 16 18:41:48 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 16 Apr 2014 18:41:48 +0200 Subject: [datatable-help] data.table and aggregating out-of-order columns in result from by In-Reply-To: References: Message-ID: Clayton, Thanks for posting it here. Here?s the first follow-up. Here?s an example: require(data.table) ## 1.9.3 comm 1263 dt <- data.table(x=1:1e7, y=1:1e7) ## data.table optimisation removes names system.time(ans1 <- dt[, list(z=y), by=x]) # user system elapsed # 7.193 0.275 7.859 ## data.table can't optimise to remove names foo <- function(x) list(z=x) system.time(ans2 <- dt[, foo(y), by=x]) # user system elapsed # 16.020 0.179 16.411 > identical(ans1, ans2) [1] TRUE This is without checking for names, for each of the 1e7 groups. Arun From:?Clayton Stanley cstanley at cstanley.no-ip.biz Reply:?Clayton Stanley cstanley at cstanley.no-ip.biz Date:?April 16, 2014 at 6:23:50 PM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] data.table and aggregating out-of-order columns in result from by Copied from this SO post:?http://stackoverflow.com/questions/23097461 Here's some interesting behavior that I noticed with data.table 1.9.2 > testFun <- function(val) { if (val == 'geteeee') return(data.table(x=4,y=3)) if (val == 'get') return(data.table(y=3,x=4)) } > tbl = data.table(val=c('geteeee', 'get')) > tbl[, testFun(val), by=val] val x y 1: geteeee 4 3 2: get 3 4 > When the column order of the data tables returned from each call to testFun are mixed (but have the same name and number of columns), data.table silently binds the tables together without taking into account that they are out of order. This was probably done for speed, but I found the behavior quite unexpected, and would have appreciated at least a warning. Is there a way that I can get data.table to warn or error when this situation happens? This happened in my analysis code and caused values for two DVs to be intermixed. The reason why it happened is that in the 'testFun' there is a branch and the returned data table is created within both sides of the branch. The branch is necessary to handle the case where the data table used to create the final returned data table is empty. So on one side of that branch I basically create an empty data table with the correct columns, and on the other side the data table is created from the first. The point is that the column order for the data tables returned from each side of the branch are different. Now this is certainly a bug on my part in 'testFun'. However I could have caught the issue much earlier if I had received a warning from data.table when the by operation completed and the resulting tables were bound together.? Also since there isn't a check for column order, it does make me worry that there are other places in my analysis code where the same thing could be happening. What would be ideal is if there was some way for me to tell if that is the case. Perhaps a warning, temporarily increasing a 'safety' level as an options call, etc. Usually data.table is great at warning me when things are not quite right, so I was surprised when I noticed the current behavior. I understand that this was done for speed. So maybe temporarily increasing a 'safety' level is a way to keep things fast by default and have additional checks (for a speed cost) when the user wants them? This sort of mimics how compiler optimization declarations are done in common lisp. -Clayton _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Wed Apr 16 19:11:09 2014 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Wed, 16 Apr 2014 10:11:09 -0700 Subject: [datatable-help] data.table and aggregating out-of-order columns in result from by In-Reply-To: References: Message-ID: Hi, On Wed, Apr 16, 2014 at 9:41 AM, Arunkumar Srinivasan wrote: > Clayton, > > Thanks for posting it here. Here's the first follow-up. Here's an example: > > require(data.table) ## 1.9.3 comm 1263 > dt <- data.table(x=1:1e7, y=1:1e7) > > ## data.table optimisation removes names > system.time(ans1 <- dt[, list(z=y), by=x]) > > # user system elapsed > # 7.193 0.275 7.859 > > ## data.table can't optimise to remove names > foo <- function(x) list(z=x) > system.time(ans2 <- dt[, foo(y), by=x]) > # user system elapsed > # 16.020 0.179 16.411 > >> identical(ans1, ans2) > [1] TRUE > > This is without checking for names, for each of the 1e7 groups. Do you think the ~2x difference in speed is really a result of an optimization based on the "names" thing, or is it due to the mechanics required to invoke a function within each grouping of the second example? -steve -- Steve Lianoglou Computational Biologist Genentech From aragorn168b at gmail.com Wed Apr 16 20:32:46 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 16 Apr 2014 20:32:46 +0200 Subject: [datatable-help] data.table and aggregating out-of-order columns in result from by In-Reply-To: References: Message-ID: Okay here we go, once again. A much more detailed look: A) Let?s start with datat.able: require(data.table) ## 1.9.3 commit 1263 dt <- data.table(x=1:1e7, y=1:1e7) ## with optimisation - the names are removed and added at the end system.time(dt[, list(z=y), by=x]) # user system elapsed # 7.481 0.253 8.017 ## without optimisation + no external function still. system.time(dt[, {list(z=y)}, by=x]) # user system elapsed # 9.913 0.076 10.408 ## without optimisation + external function with unnamed list foo <- function(x) list(x) system.time(dt[, foo(y), by=x]) # user system elapsed # 13.742 0.139 14.320 ## without optimisation + external function with named list foo <- function(x) list(z=x) system.time(dt[, foo(y), by=x]) # user system elapsed # 15.333 0.181 15.911 Summary: The difference between evaluating a named and unnamed list seems to be around 2.4 seconds without function and about 1.6 seconds with functions.. Using functions to evaluate is what seems to bring the speedup to ~2x when compared to list with no names. B) Let?s verify it by comparing the same as above separately without any other factors, in a separate C file: // test.c #include #define USE_RINTERNALS #include #include // test function - no checks! SEXP test(SEXP expr, SEXP env, SEXP n) { R_len_t i; SEXP ans; for (i=0; i #define USE_RINTERNALS #include #include // test function - no checks! SEXP test(SEXP expr, SEXP env, SEXP n) { R_len_t i; SEXP tmp, nm, ans, j; j = allocVector(INTSXP, 1); ans = eval(expr, env); nm = getAttrib(ans, R_NamesSymbol); for (i=0; i identical(ans1, ans2) [1] TRUE This is without checking for names, for each of the 1e7 groups. Arun From:?Clayton Stanley cstanley at cstanley.no-ip.biz Reply:?Clayton Stanley cstanley at cstanley.no-ip.biz Date:?April 16, 2014 at 6:23:50 PM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] data.table and aggregating out-of-order columns in result from by Copied from this SO post:?http://stackoverflow.com/questions/23097461 Here's some interesting behavior that I noticed with data.table 1.9.2 > testFun <- function(val) { if (val == 'geteeee') return(data.table(x=4,y=3)) if (val == 'get') return(data.table(y=3,x=4)) } > tbl = data.table(val=c('geteeee', 'get')) > tbl[, testFun(val), by=val] val x y 1: geteeee 4 3 2: get 3 4 > When the column order of the data tables returned from each call to testFun are mixed (but have the same name and number of columns), data.table silently binds the tables together without taking into account that they are out of order. This was probably done for speed, but I found the behavior quite unexpected, and would have appreciated at least a warning. Is there a way that I can get data.table to warn or error when this situation happens? This happened in my analysis code and caused values for two DVs to be intermixed. The reason why it happened is that in the 'testFun' there is a branch and the returned data table is created within both sides of the branch. The branch is necessary to handle the case where the data table used to create the final returned data table is empty. So on one side of that branch I basically create an empty data table with the correct columns, and on the other side the data table is created from the first. The point is that the column order for the data tables returned from each side of the branch are different. Now this is certainly a bug on my part in 'testFun'. However I could have caught the issue much earlier if I had received a warning from data.table when the by operation completed and the resulting tables were bound together.? Also since there isn't a check for column order, it does make me worry that there are other places in my analysis code where the same thing could be happening. What would be ideal is if there was some way for me to tell if that is the case. Perhaps a warning, temporarily increasing a 'safety' level as an options call, etc. Usually data.table is great at warning me when things are not quite right, so I was surprised when I noticed the current behavior. I understand that this was done for speed. So maybe temporarily increasing a 'safety' level is a way to keep things fast by default and have additional checks (for a speed cost) when the user wants them? This sort of mimics how compiler optimization declarations are done in common lisp. -Clayton _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From carrieromichele at gmail.com Thu Apr 17 18:26:55 2014 From: carrieromichele at gmail.com (Michele) Date: Thu, 17 Apr 2014 09:26:55 -0700 (PDT) Subject: [datatable-help] What is going on with R 3.1 ? Message-ID: <1397752015938-4689002.post@n4.nabble.com> After installing R 3.1 all my tables crated in previous version of R have lost their pointers > attr(reads, ".internal.selfref")<pointer: (nil)> *all* of them (tens of GB of .RData files). Instead a brand new dt shows: > dt<-data.table(id=1:10)> attr(dt, ".internal.selfref")<pointer: > 0x23caee8> Guys please, what is going on? Has any of you seen this problem? I have in both Win7 and Ubuntu 12.04. In Win I can easily downgrade but in Ubuntu I was unable to manually install from source R3.0.2 (using also ubuntu since only few weeks...). R is very unstable now when using data.table. And I get errors I never seen, like: Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE neededError in if (!missing(finally)) on.exit(finally) : missing value where TRUE/FALSE needed Here and example: *R 3.0.2* > library(data.table)data.table 1.9.3 For help type: help("data.table")> > set.seed(1)> dt <- data.table(id=1:5,var=c(rnorm(4L),NA))> > saveRDS(dt,file="dt.RDS")> sessionInfo()R version 3.0.2 > (2013-09-25)Platform: x86_64-w64-mingw32/x64 (64-bit)locale:[1] > LC_COLLATE=English_United Kingdom.1252 [2] LC_CTYPE=English_United > Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252[4] LC_NUMERIC=C > [5] LC_TIME=English_United Kingdom.1252 attached base packages:[1] > stats graphics grDevices utils datasets methods base other > attached packages:[1] data.table_1.9.3loaded via a namespace (and not > attached):[1] plyr_1.8.1 Rcpp_0.11.0 reshape2_1.2.2 stringr_0.6.2 *R 3.1* > library(data.table)data.table 1.9.3 For help type: help("data.table")> dt > <- readRDS("dt.RDS")> dput(dt)structure(list(id = 1:5, var = > c(-0.626453810742332, 0.183643324222082, -0.835628612410047, > 1.59528080213779, NA)), .Names = c("id", "var"), row.names = c(NA, -5L), > class = c("data.table", "data.frame"), .internal.selfref = <pointer: > (nil)>)> sessionInfo()R version 3.1.0 (2014-04-10)Platform: > x86_64-w64-mingw32/x64 (64-bit)locale:[1] LC_COLLATE=English_United > Kingdom.1252 [2] LC_CTYPE=English_United Kingdom.1252 [3] > LC_MONETARY=English_United Kingdom.1252[4] LC_NUMERIC=C > [5] LC_TIME=English_United Kingdom.1252 attached base packages:[1] > stats graphics grDevices utils datasets methods base other > attached packages:[1] data.table_1.9.3loaded via a namespace (and not > attached):[1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.2.2 stringr_0.6.2 Unfortunately I can't reproduce an example when I get the error: Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed I'll try again. It's very unstable and I get different results time to time. The ONLY thing that is in common when I get an error is the nil pointer in R 3.1 Thanks a million in advance, Michele -- View this message in context: http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-tp4689002.html Sent from the datatable-help mailing list archive at Nabble.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carrieromichele at gmail.com Thu Apr 17 19:15:54 2014 From: carrieromichele at gmail.com (Michele) Date: Thu, 17 Apr 2014 10:15:54 -0700 (PDT) Subject: [datatable-help] What is going on with R 3.1 ? In-Reply-To: <1397752015938-4689002.post@n4.nabble.com> References: <1397752015938-4689002.post@n4.nabble.com> Message-ID: <1397754954723-4689005.post@n4.nabble.com> If I try to /copy()/ the tables to re-create the pointers all the tables get the *same* one, always ending in 788 (opening another R session I get the same result, same pointers accross the table, ending in 788): > dt <- readRDS("dt.RDS")> dt1 <- readRDS("dt1.RDS")> > dput(copy(dt))structure(list(id = 1:5, var = c(-0.626453810742332, > 0.183643324222082, -0.835628612410047, 1.59528080213779, NA)), .Names = > c("id", "var"), row.names = c(NA, -5L), class = c("data.table", > "data.frame"), .internal.selfref = <pointer: 0x00000000003e0788>)> > dput(copy(dt1))structure(list(id = 1:10, var2 = c(9.56605515060971, > 9.83445938796679, 9.14481066107107, 10.5308762543727, NA, NA, > 9.45024506381369, 10.0544601882071, 10.7019565371199, 9.60325830155958), > var3 = c(99.6542887323489, 100.659231456233, 100.282028460177, > 101.423474800432, 98.4134121985332, NA, 98.0406472259105, > 100.253194731594, 100.759881151841, 99.7930775536395)), .Names = c("id", > "var2", "var3"), row.names = c(NA, -10L), class = c("data.table", > "data.frame"), sorted = "id", .internal.selfref = <pointer: > 0x00000000003e0788>) -- View this message in context: http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-tp4689002p4689005.html Sent from the datatable-help mailing list archive at Nabble.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Thu Apr 17 20:37:12 2014 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Thu, 17 Apr 2014 11:37:12 -0700 Subject: [datatable-help] What is going on with R 3.1 ? In-Reply-To: <1397754954723-4689005.post@n4.nabble.com> References: <1397752015938-4689002.post@n4.nabble.com> <1397754954723-4689005.post@n4.nabble.com> Message-ID: Hi, On Thu, Apr 17, 2014 at 10:15 AM, Michele wrote: > If I try to copy() the tables to re-create the pointers all the tables get > the same one, always ending in 788 (opening another R session I get the same > result, same pointers accross the table, ending in 788): > >> dt <- readRDS("dt.RDS") >> dt1 <- readRDS("dt1.RDS") >> dput(copy(dt)) > structure(list(id = 1:5, var = c(-0.626453810742332, 0.183643324222082, > -0.835628612410047, 1.59528080213779, NA)), .Names = c("id", > "var"), row.names = c(NA, -5L), class = c("data.table", "data.frame" > ), .internal.selfref = ) >> dput(copy(dt1)) > structure(list(id = 1:10, var2 = c(9.56605515060971, 9.83445938796679, > 9.14481066107107, 10.5308762543727, NA, NA, 9.45024506381369, > 10.0544601882071, 10.7019565371199, 9.60325830155958), var3 = > c(99.6542887323489, > 100.659231456233, 100.282028460177, 101.423474800432, 98.4134121985332, > NA, 98.0406472259105, 100.253194731594, 100.759881151841, 99.7930775536395 > )), .Names = c("id", "var2", "var3"), row.names = c(NA, -10L), class = > c("data.table", > "data.frame"), sorted = "id", .internal.selfref = 0x00000000003e0788>) Is this actually causing a problem for you somewhere? For instance, if you modify-by-reference "dt", like `dt[, z := 1]`, is `dt1` modified as well (it shouldn't be)? Let's try another experiment. Forget loading from an *.rds and just create a brand new data.table and look at its .internal.selfref: R> a <- data.table(a=1:10) R> b <- data.table(a=1:10) R> attributes(a)$.internal.selfref R> attributes(b)$.internal.selfref What do you see? Also, your initial post referenced two errors that both mentioned something about a missing logical value ("missing value where TRUE/FALSE needed"), but it's not clear what triggered the error. Have you been able to reproduce this? Next time this happens, can you call "traceback()" and provide the stack trace here? Also, your original post also said: > It's very unstable and I get different results time to time What is very unstable? And what results are different? What exact operations are you performing that are producing different results when you run them more than once? If you could provide a more precise explanation of what the problem is you are actually encountering via some piece of code that is producing the unexpected behavior you are observing, that would be most helpful. Thanks, -steve -- Steve Lianoglou Computational Biologist Genentech From carrieromichele at gmail.com Thu Apr 17 21:22:13 2014 From: carrieromichele at gmail.com (Michele) Date: Thu, 17 Apr 2014 12:22:13 -0700 (PDT) Subject: [datatable-help] What is going on with R 3.1 ? In-Reply-To: References: <1397752015938-4689002.post@n4.nabble.com> <1397754954723-4689005.post@n4.nabble.com> Message-ID: <1397762533637-4689020.post@n4.nabble.com> > a <- data.table(a=1:10) > b <- data.table(a=1:10) > attributes(a)$.internal.selfref > attributes(b)$.internal.selfref The real problem is that my usual codes fail, giving error like Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed when I do like x[y, `:=`(a = i.a, b = i.b)] As I said I'll try to replicate the error. However I already gave you something, if you read my post I show how moving data from 3.0.2 to 3.1 creates the problem. Thankfully I just managed to downgrade to 3.0.2 so I don't get fired... -- View this message in context: http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-tp4689002p4689020.html Sent from the datatable-help mailing list archive at Nabble.com. From my.r.help at gmail.com Fri Apr 18 13:53:23 2014 From: my.r.help at gmail.com (Michael Smith) Date: Fri, 18 Apr 2014 19:53:23 +0800 Subject: [datatable-help] Subsetting with logical Message-ID: <53511233.1060505@gmail.com> Hi All, This is about subsetting using logicals. The code below is self-explanatory (I hope). Is this a bug or a feature? Thanks, M > DT <- data.table(a = 1:8, b = c(TRUE, FALSE)) > ## This does *not* work, but it should (in my humble opinion). > DT[b] Error in eval(expr, envir, enclos) : object 'b' not found > ## This does work, but seems a bit awkward, given that b is already > ## logical. > DT[b == TRUE] a b 1: 1 TRUE 2: 3 TRUE 3: 5 TRUE 4: 7 TRUE > ## With data.frame things work as expected. > DF <- as.data.frame(DT) > DF[DF$b, ] a b 1 1 TRUE 3 3 TRUE 5 5 TRUE 7 7 TRUE > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C LC_TIME=en_US.utf8 [4] LC_COLLATE=en_US.utf8 LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 [7] LC_PAPER=en_US.utf8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.9.2 colorout_1.0-1 loaded via a namespace (and not attached): [1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.2.2 stringr_0.6.2 From mark at outins.com Fri Apr 18 18:37:43 2014 From: mark at outins.com (Mark Danese) Date: Fri, 18 Apr 2014 16:37:43 +0000 Subject: [datatable-help] fread for flat files Message-ID: Is it possible to pass a vector of column widths to have fread read in a flat file? I saw that someone suggested using csvkit to add commas and then use data table, but that is beyond my skill set. -------------- next part -------------- An HTML attachment was scrubbed... URL: From brodie.gaslam at yahoo.com Fri Apr 18 22:19:59 2014 From: brodie.gaslam at yahoo.com (brodie gaslam) Date: Fri, 18 Apr 2014 13:19:59 -0700 (PDT) Subject: [datatable-help] dplyr vs. data.table benchmarks Message-ID: <1397852399.42958.YahooMailNeo@web162203.mail.bf1.yahoo.com> After my original question on SO got shut down, I went ahead and ran my own relatively comprehensive benchmarks. Interestingly `dplyr` and `data.table` appear to be comparable until you start having large numbers of groups (100K+), at which point `data.table` seems to be a fair bit faster. Sharing here as it might be of interest to you guys. r - data.table vs dplyr: can one do something well the other can't or does poorly? - Stack Overflow Overview I'm relatively familiar with data.table, not so much with dplyr. I've read through some dplyr vignettes and examples that have popped up on SO, and so far my conclusions are that: View on stackoverflow.com Preview by Yahoo ? data.table vs. dplyr | brodieG Oveview In this post I will compare the use and performance of dplyr and data.table for the purposes of ?split apply combine? style analysis, with... View on www.brodieg.com Preview by Yahoo -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Sat Apr 19 14:07:16 2014 From: my.r.help at gmail.com (Michael Smith) Date: Sat, 19 Apr 2014 20:07:16 +0800 Subject: [datatable-help] fread for flat files In-Reply-To: References: Message-ID: <535266F4.6030201@gmail.com> Probably you could do this from the Linux command line using `sed`, i.e. to replace several spaces with a comma. https://www.google.com/search?q=sed+replace+space+with+comma If you're on Windows, you probably can do the same using Cygwin. M On 04/19/2014 12:37 AM, Mark Danese wrote: > Is it possible to pass a vector of column widths to have fread read in a > flat file? I saw that someone suggested using csvkit to add commas and > then use data table, but that is beyond my skill set. > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From mark at outins.com Sun Apr 20 09:04:41 2014 From: mark at outins.com (Mark Danese) Date: Sun, 20 Apr 2014 07:04:41 +0000 Subject: [datatable-help] fread for flat files In-Reply-To: <535266F4.6030201@gmail.com> References: <535266F4.6030201@gmail.com> Message-ID: Thanks Michael. The flat file format doesn?t have spaces between fields. They are all concatenated. It may be possible to use sed with a vector of widths, but I am not a command-line person (yet). It just may be one of those things that isn?t easy to implement in fread. In healthcare in the US there are still a lot of flat files out there. We usually use SAS but I am trying to get away from that. And R can read flat files(read.fwf), but it is pretty slow. From what I understand, read.fwf actually does insert commas and then reads the file. So, it might be possible to hack read.fwf and fread together somehow. My first experience with fread was to read in a 1.6 GB file in 30 seconds. That was pretty impressive. On 4/19/14, 5:07 AM, "Michael Smith" wrote: >Probably you could do this from the Linux command line using `sed`, i.e. >to replace several spaces with a comma. > >https://www.google.com/search?q=sed+replace+space+with+comma > >If you're on Windows, you probably can do the same using Cygwin. > >M > > >On 04/19/2014 12:37 AM, Mark Danese wrote: >> Is it possible to pass a vector of column widths to have fread read in a >> flat file? I saw that someone suggested using csvkit to add commas and >> then use data table, but that is beyond my skill set. >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >>https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-he >>lp >> From my.r.help at gmail.com Sun Apr 20 16:36:01 2014 From: my.r.help at gmail.com (Michael Smith) Date: Sun, 20 Apr 2014 22:36:01 +0800 Subject: [datatable-help] fread for flat files In-Reply-To: References: <535266F4.6030201@gmail.com> Message-ID: <5353DB51.8050308@gmail.com> Not sure exactly what you mean by "flat file." I previously assumed you mean fixed width formatted data, but now you say they are concatenated and there are no spaces. So what's the column separator? Tab, comma, ...? If everything else fails, try Stat/Transfer. M On 04/20/2014 03:04 PM, Mark Danese wrote: > Thanks Michael. The flat file format doesn?t have spaces between fields. > They are all concatenated. It may be possible to use sed with a vector of > widths, but I am not a command-line person (yet). > > It just may be one of those things that isn?t easy to implement in fread. > In healthcare in the US there are still a lot of flat files out there. We > usually use SAS but I am trying to get away from that. And R can read > flat files(read.fwf), but it is pretty slow. From what I understand, > read.fwf actually does insert commas and then reads the file. So, it > might be possible to hack read.fwf and fread together somehow. > > My first experience with fread was to read in a 1.6 GB file in 30 seconds. > That was pretty impressive. > > > On 4/19/14, 5:07 AM, "Michael Smith" wrote: > >> Probably you could do this from the Linux command line using `sed`, i.e. >> to replace several spaces with a comma. >> >> https://www.google.com/search?q=sed+replace+space+with+comma >> >> If you're on Windows, you probably can do the same using Cygwin. >> >> M >> >> >> On 04/19/2014 12:37 AM, Mark Danese wrote: >>> Is it possible to pass a vector of column widths to have fread read in a >>> flat file? I saw that someone suggested using csvkit to add commas and >>> then use data table, but that is beyond my skill set. >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> datatable-help at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-he >>> lp >>> > From mark at outins.com Sun Apr 20 17:48:25 2014 From: mark at outins.com (Mark Danese) Date: Sun, 20 Apr 2014 15:48:25 +0000 Subject: [datatable-help] fread for flat files In-Reply-To: <5353DB51.8050308@gmail.com> References: <535266F4.6030201@gmail.com> <5353DB51.8050308@gmail.com> Message-ID: There is no column separator for what I am referring to as "flat files" which is the challenge. My apologies if my terminology is off. By ?flat file? I mean that every field has a constant width for every row. So, a file with 3 variables might have an patient id character field that is 8 wide, followed by an age in a numeric field 3 wide, followed by a gender as an integer that is 1 wide. It would look like this. Sometimes there are spaces if fields are empty or if the variables are smaller than the space allotted. pat000010381 pat000020292 pat000030571 which needs to be separated into Patid age gender pat00001 038 1 pat00002 029 2 pat00003 057 1 We have Stat/Transfer and it doesn?t do flat files as far as I can tell. As I said, we can do it quickly and easily in SAS. Most datasets ship with a SAS program for conversion and Adam Damico has written the R package SAScii to parse SAS load files into an R script that goes to read.fwf(). But I am trying to see if it is possible to allow fread to take a vector to split the file. For the above it would be something like passing sep=(8,3,1). All Medicare claims data and most health related national surveys come this way. For an example, the bottom of this file shows how such a file is loaded https://www.hcup-us.ahrq.gov/db/nation/nis/tools/pgms/SASLoad_NIS_2011_Core .SAS This is the first 20% of the list: *** Read data elements from the ASCII file ***; INPUT @1 AGE N3PF. @4 AGEDAY N3PF. @7 AMONTH N2PF. @9 ASOURCE N2PF. @11 ASOURCEUB92 $CHAR1. @12 ASOURCE_X $CHAR3. @15 ATYPE N2PF. @17 AWEEKEND N2PF. @19 DIED N2PF. @21 DISCWT N11P7F. On 4/20/14, 7:36 AM, "Michael Smith" wrote: >Not sure exactly what you mean by "flat file." I previously assumed you >mean fixed width formatted data, but now you say they are concatenated >and there are no spaces. So what's the column separator? Tab, comma, ...? > >If everything else fails, try Stat/Transfer. > >M > >On 04/20/2014 03:04 PM, Mark Danese wrote: >> Thanks Michael. The flat file format doesn?t have spaces between >>fields. >> They are all concatenated. It may be possible to use sed with a vector >>of >> widths, but I am not a command-line person (yet). >> >> It just may be one of those things that isn?t easy to implement in >>fread. >> In healthcare in the US there are still a lot of flat files out there. >>We >> usually use SAS but I am trying to get away from that. And R can read >> flat files(read.fwf), but it is pretty slow. From what I understand, >> read.fwf actually does insert commas and then reads the file. So, it >> might be possible to hack read.fwf and fread together somehow. >> >> My first experience with fread was to read in a 1.6 GB file in 30 >>seconds. >> That was pretty impressive. >> >> >> On 4/19/14, 5:07 AM, "Michael Smith" wrote: >> >>> Probably you could do this from the Linux command line using `sed`, >>>i.e. >>> to replace several spaces with a comma. >>> >>> https://www.google.com/search?q=sed+replace+space+with+comma >>> >>> If you're on Windows, you probably can do the same using Cygwin. >>> >>> M >>> >>> >>> On 04/19/2014 12:37 AM, Mark Danese wrote: >>>> Is it possible to pass a vector of column widths to have fread read >>>>in a >>>> flat file? I saw that someone suggested using csvkit to add commas >>>>and >>>> then use data table, but that is beyond my skill set. >>>> >>>> >>>> _______________________________________________ >>>> datatable-help mailing list >>>> datatable-help at lists.r-forge.r-project.org >>>> >>>> >>>>https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable- >>>>he >>>> lp >>>> >> From my.r.help at gmail.com Mon Apr 21 02:16:06 2014 From: my.r.help at gmail.com (Michael Smith) Date: Mon, 21 Apr 2014 08:16:06 +0800 Subject: [datatable-help] fread for flat files In-Reply-To: References: <535266F4.6030201@gmail.com> <5353DB51.8050308@gmail.com> Message-ID: <53546346.7000007@gmail.com> Take a look here: http://stackoverflow.com/questions/8630053/unix-cut-command-adding-own-delimiter I think the `sed` command at the bottom should work easily, but you would need to adjust the number of dots to your column width. Once you've figured out the correct command, you can then use it like this: DT <- fread(pipe("sed ..... There is no column separator for what I am referring to as "flat files" > which is the challenge. My apologies if my terminology is off. By ?flat > file? I mean that every field has a constant width for every row. So, a > file with 3 variables might have an patient id character field that is 8 > wide, followed by an age in a numeric field 3 wide, followed by a gender > as an integer that is 1 wide. It would look like this. Sometimes there > are spaces if fields are empty or if the variables are smaller than the > space allotted. > pat000010381 > pat000020292 > pat000030571 > > which needs to be separated into > > Patid age gender > pat00001 038 1 > pat00002 029 2 > pat00003 057 1 > > > We have Stat/Transfer and it doesn?t do flat files as far as I can tell. > As I said, we can do it quickly and easily in SAS. Most datasets ship > with a SAS program for conversion and Adam Damico has written the R > package SAScii to parse SAS load files into an R script that goes to > read.fwf(). But I am trying to see if it is possible to allow fread to > take a vector to split the file. For the above it would be something like > passing sep=(8,3,1). > > All Medicare claims data and most health related national surveys come > this way. For an example, the bottom of this file shows how such a file > is loaded > > https://www.hcup-us.ahrq.gov/db/nation/nis/tools/pgms/SASLoad_NIS_2011_Core > .SAS > > This is the first 20% of the list: > *** Read data elements from the ASCII file ***; > INPUT > @1 AGE N3PF. > @4 AGEDAY N3PF. > @7 AMONTH N2PF. > @9 ASOURCE N2PF. > @11 ASOURCEUB92 $CHAR1. > @12 ASOURCE_X $CHAR3. > @15 ATYPE N2PF. > @17 AWEEKEND N2PF. > @19 DIED N2PF. > @21 DISCWT N11P7F. > > > > > On 4/20/14, 7:36 AM, "Michael Smith" wrote: > >> Not sure exactly what you mean by "flat file." I previously assumed you >> mean fixed width formatted data, but now you say they are concatenated >> and there are no spaces. So what's the column separator? Tab, comma, ...? >> >> If everything else fails, try Stat/Transfer. >> >> M >> >> On 04/20/2014 03:04 PM, Mark Danese wrote: >>> Thanks Michael. The flat file format doesn?t have spaces between >>> fields. >>> They are all concatenated. It may be possible to use sed with a vector >>> of >>> widths, but I am not a command-line person (yet). >>> >>> It just may be one of those things that isn?t easy to implement in >>> fread. >>> In healthcare in the US there are still a lot of flat files out there. >>> We >>> usually use SAS but I am trying to get away from that. And R can read >>> flat files(read.fwf), but it is pretty slow. From what I understand, >>> read.fwf actually does insert commas and then reads the file. So, it >>> might be possible to hack read.fwf and fread together somehow. >>> >>> My first experience with fread was to read in a 1.6 GB file in 30 >>> seconds. >>> That was pretty impressive. >>> >>> >>> On 4/19/14, 5:07 AM, "Michael Smith" wrote: >>> >>>> Probably you could do this from the Linux command line using `sed`, >>>> i.e. >>>> to replace several spaces with a comma. >>>> >>>> https://www.google.com/search?q=sed+replace+space+with+comma >>>> >>>> If you're on Windows, you probably can do the same using Cygwin. >>>> >>>> M >>>> >>>> >>>> On 04/19/2014 12:37 AM, Mark Danese wrote: >>>>> Is it possible to pass a vector of column widths to have fread read >>>>> in a >>>>> flat file? I saw that someone suggested using csvkit to add commas >>>>> and >>>>> then use data table, but that is beyond my skill set. >>>>> >>>>> >>>>> _______________________________________________ >>>>> datatable-help mailing list >>>>> datatable-help at lists.r-forge.r-project.org >>>>> >>>>> >>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable- >>>>> he >>>>> lp >>>>> >>> > From my.r.help at gmail.com Mon Apr 21 03:13:46 2014 From: my.r.help at gmail.com (Michael Smith) Date: Mon, 21 Apr 2014 09:13:46 +0800 Subject: [datatable-help] fread for flat files In-Reply-To: <53546346.7000007@gmail.com> References: <535266F4.6030201@gmail.com> <5353DB51.8050308@gmail.com> <53546346.7000007@gmail.com> Message-ID: <535470CA.9080409@gmail.com> If you want a completely R-centric version, do it as follows. First, create the following file, save it as `flat2csv.R` and make it executable (chmod u+x flat2csv.R). #! /usr/bin/env Rscript col.start <- c(1, 9, 12) col.end <- c(8, 11, 13) con <- file("stdin", open = "r") while (length(this.line <- readLines(con, n = 1, warn = FALSE)) > 0) writeLines( paste0( substring( this.line, col.start, col.end), collapse = ",")) close(con) Then open R and run the following code to read your flat file: library("data.table") fread("./flat2csv.R Take a look here: > > http://stackoverflow.com/questions/8630053/unix-cut-command-adding-own-delimiter > > I think the `sed` command at the bottom should work easily, but you > would need to adjust the number of dots to your column width. > > Once you've figured out the correct command, you can then use it like this: > > DT <- fread(pipe("sed ..... > On the other hand, if you can get it into SAS easily, then you can just > convert it from there using Stat/Transfer. But I do understand that > you're looking for a data.table-centric solution, so maybe the above > attempt will help. > > Cheers, > > M > > > > On 04/20/2014 11:48 PM, Mark Danese wrote: >> There is no column separator for what I am referring to as "flat files" >> which is the challenge. My apologies if my terminology is off. By ?flat >> file? I mean that every field has a constant width for every row. So, a >> file with 3 variables might have an patient id character field that is 8 >> wide, followed by an age in a numeric field 3 wide, followed by a gender >> as an integer that is 1 wide. It would look like this. Sometimes there >> are spaces if fields are empty or if the variables are smaller than the >> space allotted. >> pat000010381 >> pat000020292 >> pat000030571 >> >> which needs to be separated into >> >> Patid age gender >> pat00001 038 1 >> pat00002 029 2 >> pat00003 057 1 >> >> >> We have Stat/Transfer and it doesn?t do flat files as far as I can tell. >> As I said, we can do it quickly and easily in SAS. Most datasets ship >> with a SAS program for conversion and Adam Damico has written the R >> package SAScii to parse SAS load files into an R script that goes to >> read.fwf(). But I am trying to see if it is possible to allow fread to >> take a vector to split the file. For the above it would be something like >> passing sep=(8,3,1). >> >> All Medicare claims data and most health related national surveys come >> this way. For an example, the bottom of this file shows how such a file >> is loaded >> >> https://www.hcup-us.ahrq.gov/db/nation/nis/tools/pgms/SASLoad_NIS_2011_Core >> .SAS >> >> This is the first 20% of the list: >> *** Read data elements from the ASCII file ***; >> INPUT >> @1 AGE N3PF. >> @4 AGEDAY N3PF. >> @7 AMONTH N2PF. >> @9 ASOURCE N2PF. >> @11 ASOURCEUB92 $CHAR1. >> @12 ASOURCE_X $CHAR3. >> @15 ATYPE N2PF. >> @17 AWEEKEND N2PF. >> @19 DIED N2PF. >> @21 DISCWT N11P7F. >> >> >> >> >> On 4/20/14, 7:36 AM, "Michael Smith" wrote: >> >>> Not sure exactly what you mean by "flat file." I previously assumed you >>> mean fixed width formatted data, but now you say they are concatenated >>> and there are no spaces. So what's the column separator? Tab, comma, ...? >>> >>> If everything else fails, try Stat/Transfer. >>> >>> M >>> >>> On 04/20/2014 03:04 PM, Mark Danese wrote: >>>> Thanks Michael. The flat file format doesn?t have spaces between >>>> fields. >>>> They are all concatenated. It may be possible to use sed with a vector >>>> of >>>> widths, but I am not a command-line person (yet). >>>> >>>> It just may be one of those things that isn?t easy to implement in >>>> fread. >>>> In healthcare in the US there are still a lot of flat files out there. >>>> We >>>> usually use SAS but I am trying to get away from that. And R can read >>>> flat files(read.fwf), but it is pretty slow. From what I understand, >>>> read.fwf actually does insert commas and then reads the file. So, it >>>> might be possible to hack read.fwf and fread together somehow. >>>> >>>> My first experience with fread was to read in a 1.6 GB file in 30 >>>> seconds. >>>> That was pretty impressive. >>>> >>>> >>>> On 4/19/14, 5:07 AM, "Michael Smith" wrote: >>>> >>>>> Probably you could do this from the Linux command line using `sed`, >>>>> i.e. >>>>> to replace several spaces with a comma. >>>>> >>>>> https://www.google.com/search?q=sed+replace+space+with+comma >>>>> >>>>> If you're on Windows, you probably can do the same using Cygwin. >>>>> >>>>> M >>>>> >>>>> >>>>> On 04/19/2014 12:37 AM, Mark Danese wrote: >>>>>> Is it possible to pass a vector of column widths to have fread read >>>>>> in a >>>>>> flat file? I saw that someone suggested using csvkit to add commas >>>>>> and >>>>>> then use data table, but that is beyond my skill set. >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> datatable-help mailing list >>>>>> datatable-help at lists.r-forge.r-project.org >>>>>> >>>>>> >>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable- >>>>>> he >>>>>> lp >>>>>> >>>> >> From long at dc-energy.com Tue Apr 22 20:07:59 2014 From: long at dc-energy.com (Zachary Long) Date: Tue, 22 Apr 2014 14:07:59 -0400 Subject: [datatable-help] R Studio Interactions with data.table Message-ID: <87CF8322-0A75-44D7-A246-85ED860ABC2A@dc-energy.com> Hello, I was wondering if an error like this had been addressed before. I am using data table 1.9.2. It appears that the error has to do with the interaction with R-Studio. When I run > library(data.table) > dt<-data.table(strip="Nov08",date=c("2006-08-01","2006-08-02","2006-08-03","2006-08-04","2006-08-07", > "2006-08-08","2006-08-09","2006-08-10","2006-08-11","2006-08-14")) > dt[,forward_date:=c(rep(NA,5),date),by='strip'] The result I expect is below, along with a warning message. strip date forward_date 1: Nov08 2006-08-01 NA 2: Nov08 2006-08-02 NA 3: Nov08 2006-08-03 NA 4: Nov08 2006-08-04 NA 5: Nov08 2006-08-07 NA 6: Nov08 2006-08-08 2006-08-01 7: Nov08 2006-08-09 2006-08-02 8: Nov08 2006-08-10 2006-08-03 9: Nov08 2006-08-11 2006-08-04 10: Nov08 2006-08-14 2006-08-07 However, I don't get this. 1 of two things can happen. 1. My R-Studio will completely crash without warning. All unsaved information is lost. 2. I can get "Error: Value of SET_STRING_ELT() must be a 'CHARSXP' not a 'character'" "In addition:Lost warning messages" Do you know what is the cause here? It seems related to memory allocation, or something under the hood relating to the interaction of R-Studio and data table. Zach -------------- next part -------------- An HTML attachment was scrubbed... URL: From kevinushey at gmail.com Tue Apr 22 20:18:27 2014 From: kevinushey at gmail.com (Kevin Ushey) Date: Tue, 22 Apr 2014 11:18:27 -0700 Subject: [datatable-help] R Studio Interactions with data.table In-Reply-To: <87CF8322-0A75-44D7-A246-85ED860ABC2A@dc-energy.com> References: <87CF8322-0A75-44D7-A246-85ED860ABC2A@dc-energy.com> Message-ID: FWIW, I can reproduce this segfault within the console as well, including with the latest SVN version of data.table 1.9.3. Running under R -d lldb, I don't get a segfault off the bat; I get: > dt[,forward_date:=c(rep(NA,5),date),by='strip'] Warning message: In `[.data.table`(dt, , `:=`(forward_date, c(rep(NA, 5), date)), : RHS 1 is length 15 (greater than the size (10) of group 1). The last 5 element(s) will be discarded. I also see the error, if I play around in the console a bit after launching R: > dt[,forward_date:=c(rep(NA,5),date),by='strip'] Process 55776 stopped * thread #1: tid = 0x13b169, 0x000000010002bf60 libR.dylib`Rf_copyMostAttrib(inp=0x0000000101bb0f50, ans=0x0000000101bb0ea8) + 192 at attrib.c:274, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x417e) frame #0: 0x000000010002bf60 libR.dylib`Rf_copyMostAttrib(inp=0x0000000101bb0f50, ans=0x0000000101bb0ea8) + 192 at attrib.c:274 271 PROTECT(ans); 272 PROTECT(inp); 273 for (s = ATTRIB(inp); s != R_NilValue; s = CDR(s)) { -> 274 if ((TAG(s) != R_NamesSymbol) && 275 (TAG(s) != R_DimSymbol) && 276 (TAG(s) != R_DimNamesSymbol)) { 277 installAttrib(ans, TAG(s), CAR(s)); Hopefully this gives a starting point in debugging... Kevin On Tue, Apr 22, 2014 at 11:07 AM, Zachary Long wrote: > Hello, > > I was wondering if an error like this had been addressed before. I am using > data table 1.9.2. > > It appears that the error has to do with the interaction with R-Studio. When > I run > > library(data.table) > dt<-data.table(strip="Nov08",date=c("2006-08-01","2006-08-02","2006-08-03","2006-08-04","2006-08-07", > > "2006-08-08","2006-08-09","2006-08-10","2006-08-11","2006-08-14")) > dt[,forward_date:=c(rep(NA,5),date),by='strip'] > > > The result I expect is below, along with a warning message. > > strip date forward_date > 1: Nov08 2006-08-01 NA > 2: Nov08 2006-08-02 NA > 3: Nov08 2006-08-03 NA > 4: Nov08 2006-08-04 NA > 5: Nov08 2006-08-07 NA > 6: Nov08 2006-08-08 2006-08-01 > 7: Nov08 2006-08-09 2006-08-02 > 8: Nov08 2006-08-10 2006-08-03 > 9: Nov08 2006-08-11 2006-08-04 > 10: Nov08 2006-08-14 2006-08-07 > > > However, I don't get this. > > 1 of two things can happen. > > 1. My R-Studio will completely crash without warning. All unsaved > information is lost. > 2. I can get "Error: Value of SET_STRING_ELT() must be a 'CHARSXP' not a > 'character'" "In addition:Lost warning messages" > > Do you know what is the cause here? It seems related to memory allocation, > or something under the hood relating to the interaction of R-Studio and data > table. > > Zach > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From levkowitz at dc-energy.com Tue Apr 22 20:23:10 2014 From: levkowitz at dc-energy.com (Shir Levkowitz) Date: Tue, 22 Apr 2014 14:23:10 -0400 Subject: [datatable-help] R Studio Interactions with data.table In-Reply-To: References: <87CF8322-0A75-44D7-A246-85ED860ABC2A@dc-energy.com> Message-ID: <2543EEE2-369D-4B20-8657-4CD05C4C3339@dc-energy.com> I have gotten a crash right off the bat in console R (3.0.2, data.table 1.9.2) on Linux, *** caught segfault *** address (nil), cause 'unknown' Traceback: 1: `[.data.table`(dt, , `:=`(forward_date, c(rep(NA, 5), date)), by = "strip") 2: dt[, `:=`(forward_date, c(rep(NA, 5), date)), by = "strip"] Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace Selection: 2 Save workspace image? [y/n/c]: n Warning message: In `[.data.table`(dt, , `:=`(forward_date, c(rep(NA, 5), date)), : RHS 1 is length 15 (greater than the size (10) of group 1). The last 5 element(s) will be discarded. It seems like this bug may be related, though three years old: https://r-forge.r-project.org/tracker/?group_id=240&atid=975&func=detail&aid=1664 Shir On Apr 22, 2014, at 2:18 PM, Kevin Ushey wrote: > FWIW, I can reproduce this segfault within the console as well, > including with the latest SVN version of data.table 1.9.3. > > Running under R -d lldb, I don't get a segfault off the bat; I get: > >> dt[,forward_date:=c(rep(NA,5),date),by='strip'] > > Warning message: > > In `[.data.table`(dt, , `:=`(forward_date, c(rep(NA, 5), date)), : > RHS 1 is length 15 (greater than the size (10) of group 1). The last > 5 element(s) will be discarded. > > I also see the error, if I play around in the console a bit after launching R: > >> dt[,forward_date:=c(rep(NA,5),date),by='strip'] > > Process 55776 stopped > > * thread #1: tid = 0x13b169, 0x000000010002bf60 > libR.dylib`Rf_copyMostAttrib(inp=0x0000000101bb0f50, > ans=0x0000000101bb0ea8) + 192 at attrib.c:274, queue = > 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, > address=0x417e) > > frame #0: 0x000000010002bf60 > libR.dylib`Rf_copyMostAttrib(inp=0x0000000101bb0f50, > ans=0x0000000101bb0ea8) + 192 at attrib.c:274 > 271 PROTECT(ans); > 272 PROTECT(inp); > 273 for (s = ATTRIB(inp); s != R_NilValue; s = CDR(s)) { > -> 274 if ((TAG(s) != R_NamesSymbol) && > 275 (TAG(s) != R_DimSymbol) && > 276 (TAG(s) != R_DimNamesSymbol)) { > 277 installAttrib(ans, TAG(s), CAR(s)); > > Hopefully this gives a starting point in debugging... > > Kevin > > On Tue, Apr 22, 2014 at 11:07 AM, Zachary Long wrote: >> Hello, >> >> I was wondering if an error like this had been addressed before. I am using >> data table 1.9.2. >> >> It appears that the error has to do with the interaction with R-Studio. When >> I run >> >> library(data.table) >> dt<-data.table(strip="Nov08",date=c("2006-08-01","2006-08-02","2006-08-03","2006-08-04","2006-08-07", >> >> "2006-08-08","2006-08-09","2006-08-10","2006-08-11","2006-08-14")) >> dt[,forward_date:=c(rep(NA,5),date),by='strip'] >> >> >> The result I expect is below, along with a warning message. >> >> strip date forward_date >> 1: Nov08 2006-08-01 NA >> 2: Nov08 2006-08-02 NA >> 3: Nov08 2006-08-03 NA >> 4: Nov08 2006-08-04 NA >> 5: Nov08 2006-08-07 NA >> 6: Nov08 2006-08-08 2006-08-01 >> 7: Nov08 2006-08-09 2006-08-02 >> 8: Nov08 2006-08-10 2006-08-03 >> 9: Nov08 2006-08-11 2006-08-04 >> 10: Nov08 2006-08-14 2006-08-07 >> >> >> However, I don't get this. >> >> 1 of two things can happen. >> >> 1. My R-Studio will completely crash without warning. All unsaved >> information is lost. >> 2. I can get "Error: Value of SET_STRING_ELT() must be a 'CHARSXP' not a >> 'character'" "In addition:Lost warning messages" >> >> Do you know what is the cause here? It seems related to memory allocation, >> or something under the hood relating to the interaction of R-Studio and data >> table. >> >> Zach >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From pauljohn32 at gmail.com Thu Apr 24 00:25:49 2014 From: pauljohn32 at gmail.com (Paul Johnson) Date: Wed, 23 Apr 2014 17:25:49 -0500 Subject: [datatable-help] What is going on with R 3.1 ? In-Reply-To: <1397762533637-4689020.post@n4.nabble.com> References: <1397752015938-4689002.post@n4.nabble.com> <1397754954723-4689005.post@n4.nabble.com> <1397762533637-4689020.post@n4.nabble.com> Message-ID: I was going to say you don't mention if you re-compiled data.table to go with the new R and such. update.packages(checkBuilt=TRUE) because you have not mentioned if you did that... Then I was going to say "show us the commands that produced those errors you mentioned and post us an example object to test". I still think you should. But I've just tested your example and I see same, but am not sure something is wrong because the .internal.selfref is not saved by the RDS. In a session, I save a new data.table as usual: > dt<-data.table(id=1:10) > attr(dt, ".internal.selfref") > saveRDS(dt, "/tmp/dt.rds") close down, re-start > library(data.table) data.table 1.9.2 For help type: help("data.table") > attr(dt, ".internal.selfref") NULL Before R-3.1, you mean to say the pointers were the same after saving and re-opening the file? That seems impossible to me that the pointer would be restored, it is session specific. Yes? The data table is still there, it still works. > dt2[ , new2 := rnorm(10)] > dt2 id new2 1: 1 -0.3064382 2: 2 -0.3414318 3: 3 -0.5758131 4: 4 -0.2792946 5: 5 -0.1887096 6: 6 -2.7454482 7: 7 -0.2169927 8: 8 1.0065699 9: 9 -2.0388283 10: 10 -2.5366451 > attr(dt, ".internal.selfref") NULL It just doesn't know who it is :) I expect the best thing is to provide the code & example to reproduce the trouble you see, because, at least within R 3.1, I don't have the troubles you do. But I'm also not trying to use .rds objects that were written in the previous version. pj On Thu, Apr 17, 2014 at 2:22 PM, Michele wrote: >> a <- data.table(a=1:10) >> b <- data.table(a=1:10) >> attributes(a)$.internal.selfref > >> attributes(b)$.internal.selfref > > > The real problem is that my usual codes fail, giving error like > > Error in if (!is.null(lhs)) { : missing value where TRUE/FALSE needed > > when I do like x[y, `:=`(a = i.a, b = i.b)] > > As I said I'll try to replicate the error. However I already gave you > something, if you read my post I show how moving data from 3.0.2 to 3.1 > creates the problem. Thankfully I just managed to downgrade to 3.0.2 so I > don't get fired... > > > > > > > > > > -- > View this message in context: http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-tp4689002p4689020.html > Sent from the datatable-help mailing list archive at Nabble.com. > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu From michael.nelson at sydney.edu.au Thu Apr 24 12:05:22 2014 From: michael.nelson at sydney.edu.au (Michael Nelson) Date: Thu, 24 Apr 2014 10:05:22 +0000 Subject: [datatable-help] What is going on with R 3.1 ? In-Reply-To: References: <1397752015938-4689002.post@n4.nabble.com> <1397754954723-4689005.post@n4.nabble.com>, Message-ID: <6FB5193A6CDCDF499486A833B7AFBDCDB81D73D1@ex-mbx-pro-05> See Simon Urbanek's answer here (and Matt Dowle's Comments) http://stackoverflow.com/questions/15195220/assigning-by-reference-into-loaded-package-datasets/15208059#15208059 ________________________________________ From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of Steve Lianoglou [lianoglou.steve at gene.com] Sent: Friday, 18 April 2014 4:37 AM To: Michele Cc: datatable-help at lists.r-forge.r-project.org Subject: Re: [datatable-help] What is going on with R 3.1 ? Hi, On Thu, Apr 17, 2014 at 10:15 AM, Michele wrote: > If I try to copy() the tables to re-create the pointers all the tables get > the same one, always ending in 788 (opening another R session I get the same > result, same pointers accross the table, ending in 788): > >> dt <- readRDS("dt.RDS") >> dt1 <- readRDS("dt1.RDS") >> dput(copy(dt)) > structure(list(id = 1:5, var = c(-0.626453810742332, 0.183643324222082, > -0.835628612410047, 1.59528080213779, NA)), .Names = c("id", > "var"), row.names = c(NA, -5L), class = c("data.table", "data.frame" > ), .internal.selfref = ) >> dput(copy(dt1)) > structure(list(id = 1:10, var2 = c(9.56605515060971, 9.83445938796679, > 9.14481066107107, 10.5308762543727, NA, NA, 9.45024506381369, > 10.0544601882071, 10.7019565371199, 9.60325830155958), var3 = > c(99.6542887323489, > 100.659231456233, 100.282028460177, 101.423474800432, 98.4134121985332, > NA, 98.0406472259105, 100.253194731594, 100.759881151841, 99.7930775536395 > )), .Names = c("id", "var2", "var3"), row.names = c(NA, -10L), class = > c("data.table", > "data.frame"), sorted = "id", .internal.selfref = 0x00000000003e0788>) Is this actually causing a problem for you somewhere? For instance, if you modify-by-reference "dt", like `dt[, z := 1]`, is `dt1` modified as well (it shouldn't be)? Let's try another experiment. Forget loading from an *.rds and just create a brand new data.table and look at its .internal.selfref: R> a <- data.table(a=1:10) R> b <- data.table(a=1:10) R> attributes(a)$.internal.selfref R> attributes(b)$.internal.selfref What do you see? Also, your initial post referenced two errors that both mentioned something about a missing logical value ("missing value where TRUE/FALSE needed"), but it's not clear what triggered the error. Have you been able to reproduce this? Next time this happens, can you call "traceback()" and provide the stack trace here? Also, your original post also said: > It's very unstable and I get different results time to time What is very unstable? And what results are different? What exact operations are you performing that are producing different results when you run them more than once? If you could provide a more precise explanation of what the problem is you are actually encountering via some piece of code that is producing the unexpected behavior you are observing, that would be most helpful. Thanks, -steve -- Steve Lianoglou Computational Biologist Genentech _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From carrieromichele at gmail.com Thu Apr 24 14:58:42 2014 From: carrieromichele at gmail.com (Michele) Date: Thu, 24 Apr 2014 05:58:42 -0700 (PDT) Subject: [datatable-help] What is going on with R 3.1 ? In-Reply-To: <6FB5193A6CDCDF499486A833B7AFBDCDB81D73D1@ex-mbx-pro-05> References: <1397752015938-4689002.post@n4.nabble.com> <1397754954723-4689005.post@n4.nabble.com> <6FB5193A6CDCDF499486A833B7AFBDCDB81D73D1@ex-mbx-pro-05> Message-ID: Hi, that's very helpful. My problem, however, happens also when loading .RData created in 3.0.1 (or 3.0.2) into R 3.1 So it is not related to serialization issues I guess. I noticed that it happens with tables containing NA values. I suspect R3.1 treats NA in a different way internally, because without missing values all is fine! I'll make a reproducible example this weekend when I'm back from my holiday. In the meantime sorry for leaving this thread like this. Thank you a lot Michele On 24 April 2014 11:09, Michael Nelson [via R] < ml-node+s789695n4689370h58 at n4.nabble.com> wrote: > See Simon Urbanek's answer here (and Matt Dowle's Comments) > > http://stackoverflow.com/questions/15195220/assigning-by-reference-into-loaded-package-datasets/15208059#15208059 > ________________________________________ > From: [hidden email][[hidden > email] ] on behalf > of Steve Lianoglou [[hidden email]] > > Sent: Friday, 18 April 2014 4:37 AM > To: Michele > Cc: [hidden email] > Subject: Re: [datatable-help] What is going on with R 3.1 ? > > Hi, > > On Thu, Apr 17, 2014 at 10:15 AM, Michele <[hidden email]> > wrote: > > > If I try to copy() the tables to re-create the pointers all the tables > get > > the same one, always ending in 788 (opening another R session I get the > same > > result, same pointers accross the table, ending in 788): > > > >> dt <- readRDS("dt.RDS") > >> dt1 <- readRDS("dt1.RDS") > >> dput(copy(dt)) > > structure(list(id = 1:5, var = c(-0.626453810742332, 0.183643324222082, > > -0.835628612410047, 1.59528080213779, NA)), .Names = c("id", > > "var"), row.names = c(NA, -5L), class = c("data.table", "data.frame" > > ), .internal.selfref = ) > >> dput(copy(dt1)) > > structure(list(id = 1:10, var2 = c(9.56605515060971, 9.83445938796679, > > 9.14481066107107, 10.5308762543727, NA, NA, 9.45024506381369, > > 10.0544601882071, 10.7019565371199, 9.60325830155958), var3 = > > c(99.6542887323489, > > 100.659231456233, 100.282028460177, 101.423474800432, 98.4134121985332, > > NA, 98.0406472259105, 100.253194731594, 100.759881151841, > 99.7930775536395 > > )), .Names = c("id", "var2", "var3"), row.names = c(NA, -10L), class = > > c("data.table", > > "data.frame"), sorted = "id", .internal.selfref = > 0x00000000003e0788>) > > Is this actually causing a problem for you somewhere? > > For instance, if you modify-by-reference "dt", like `dt[, z := 1]`, is > `dt1` modified as well (it shouldn't be)? > > Let's try another experiment. Forget loading from an *.rds and just > create a brand new data.table and look at its .internal.selfref: > > R> a <- data.table(a=1:10) > R> b <- data.table(a=1:10) > R> attributes(a)$.internal.selfref > R> attributes(b)$.internal.selfref > > What do you see? > > Also, your initial post referenced two errors that both mentioned > something about a missing logical value ("missing value where > TRUE/FALSE needed"), but it's not clear what triggered the error. Have > you been able to reproduce this? > > Next time this happens, can you call "traceback()" and provide the > stack trace here? > > Also, your original post also said: > > > It's very unstable and I get different results time to time > > What is very unstable? And what results are different? What exact > operations are you performing that are producing different results > when you run them more than once? > > If you could provide a more precise explanation of what the problem is > you are actually encountering via some piece of code that is producing > the unexpected behavior you are observing, that would be most helpful. > > Thanks, > -steve > > -- > Steve Lianoglou > Computational Biologist > Genentech > _______________________________________________ > datatable-help mailing list > [hidden email] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ > datatable-help mailing list > [hidden email] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-tp4689002p4689370.html > To unsubscribe from What is going on with R 3.1 ?, click here > . > NAML > -- *PRIVATE**T:* +44 (0)77 3248 1517 *|* * E:* carrieromichele at gmail.com *OFFICET:* +44 (0)20 8236 8992 *|* * E:* michele.carriero at evolve-analytics.com *T:* www.evolve-analytics.com -- View this message in context: http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-tp4689002p4689382.html Sent from the datatable-help mailing list archive at Nabble.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.laing at gmail.com Fri Apr 25 19:25:13 2014 From: john.laing at gmail.com (John Laing) Date: Fri, 25 Apr 2014 13:25:13 -0400 Subject: [datatable-help] Assignment by reference fails silently Message-ID: If I create a logical column in my data.table and try to assign-by-reference a character value to it, the assignment fails silently. That is, it doesn't work but doesn't throw an error: ## make a simple data.table require(data.table) dt <- data.table(a=1:3, b=4:6, c=NA) ## fails silently dt[, c := "foo"] dt In other cases where an action would lead to the implicit conversion of a column, data.table throws an error suggesting that the user convert the column explicitly if that's what they really mean to do. I think that's the right behavior and should be adopted in this case as well. -John -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Sat Apr 26 03:08:28 2014 From: my.r.help at gmail.com (Michael Smith) Date: Sat, 26 Apr 2014 09:08:28 +0800 Subject: [datatable-help] Subsetting with logical In-Reply-To: <53511233.1060505@gmail.com> References: <53511233.1060505@gmail.com> Message-ID: <535B070C.9080304@gmail.com> Here's another example, maybe more to the point. Shouldn't the second line also work, since `b` is logical already? DT <- data.table(a = 1:8, b = c(TRUE, FALSE)) DT[b] # Doesn't work. DT[identity(b)] # Does work. On 04/18/2014 07:53 PM, Michael Smith wrote: > Hi All, > > This is about subsetting using logicals. The code below is > self-explanatory (I hope). Is this a bug or a feature? > > Thanks, > > M > > >> DT <- data.table(a = 1:8, b = c(TRUE, FALSE)) >> ## This does *not* work, but it should (in my humble opinion). >> DT[b] > Error in eval(expr, envir, enclos) : object 'b' not found >> ## This does work, but seems a bit awkward, given that b is already >> ## logical. >> DT[b == TRUE] > a b > 1: 1 TRUE > 2: 3 TRUE > 3: 5 TRUE > 4: 7 TRUE >> ## With data.frame things work as expected. >> DF <- as.data.frame(DT) >> DF[DF$b, ] > a b > 1 1 TRUE > 3 3 TRUE > 5 5 TRUE > 7 7 TRUE >> sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C > LC_TIME=en_US.utf8 > [4] LC_COLLATE=en_US.utf8 LC_MONETARY=en_US.utf8 > LC_MESSAGES=en_US.utf8 > [7] LC_PAPER=en_US.utf8 LC_NAME=C LC_ADDRESS=C > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.utf8 > LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] data.table_1.9.2 colorout_1.0-1 > > loaded via a namespace (and not attached): > [1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.2.2 stringr_0.6.2 > From smartpink111 at yahoo.com Sat Apr 26 03:15:14 2014 From: smartpink111 at yahoo.com (arun) Date: Fri, 25 Apr 2014 18:15:14 -0700 (PDT) Subject: [datatable-help] Subsetting with logical In-Reply-To: <53511233.1060505@gmail.com> References: <53511233.1060505@gmail.com> Message-ID: <1398474914.87222.YahooMailNeo@web142606.mail.bf1.yahoo.com> Hi M, Check this link: http://stackoverflow.com/questions/16191083/subset-data-table-by-logical-column A.K. ? On Friday, April 18, 2014 7:53 AM, Michael Smith wrote: Hi All, This is about subsetting using logicals. The code below is self-explanatory (I hope). Is this a bug or a feature? Thanks, M > DT <- data.table(a = 1:8, b = c(TRUE, FALSE)) > ## This does *not* work, but it should (in my humble opinion). > DT[b] Error in eval(expr, envir, enclos) : object 'b' not found > ## This does work, but seems a bit awkward, given that b is already > ## logical. > DT[b == TRUE] ? a? ? b 1: 1 TRUE 2: 3 TRUE 3: 5 TRUE 4: 7 TRUE > ## With data.frame things work as expected. > DF <- as.data.frame(DT) > DF[DF$b, ] ? a? ? b 1 1 TRUE 3 3 TRUE 5 5 TRUE 7 7 TRUE > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.utf8? ? ? LC_NUMERIC=C LC_TIME=en_US.utf8 [4] LC_COLLATE=en_US.utf8? ? LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 [7] LC_PAPER=en_US.utf8? ? ? LC_NAME=C? ? ? ? ? ? ? ? LC_ADDRESS=C [10] LC_TELEPHONE=C? ? ? ? ? ? LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats? ? graphics? grDevices utils? ? datasets? methods? base other attached packages: [1] data.table_1.9.2 colorout_1.0-1 loaded via a namespace (and not attached): [1] plyr_1.8.1? ? Rcpp_0.11.1? ? reshape2_1.2.2 stringr_0.6.2 _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From my.r.help at gmail.com Sat Apr 26 04:19:52 2014 From: my.r.help at gmail.com (Michael Smith) Date: Sat, 26 Apr 2014 10:19:52 +0800 Subject: [datatable-help] Subsetting with logical In-Reply-To: <1398474914.87222.YahooMailNeo@web142606.mail.bf1.yahoo.com> References: <53511233.1060505@gmail.com> <1398474914.87222.YahooMailNeo@web142606.mail.bf1.yahoo.com> Message-ID: <535B17C8.2040703@gmail.com> A.K., Thanks a lot for the link. Looking at `?data.table`, maybe the documentation could be changed to read something like this, with the text in bracket added: integer and logical vectors work the same way they do in \code{\link{[.data.frame}} (but see the \dQuote{Advanced} note below about an exception for single variable names). The thing is that I did read the documentation, but I stopped reading at that point because it said to expect the same behavior as with data.frame, which is not what happened in my example code. And based on your link to SO, other people have had the same issue too. M On 04/26/2014 09:15 AM, arun wrote: > Hi M, > > Check this link: > http://stackoverflow.com/questions/16191083/subset-data-table-by-logical-column > > A.K. > > On 04/26/2014 09:08 AM, Michael Smith wrote:> Here's another example, maybe more to the point. Shouldn't the second > line also work, since `b` is logical already? > > DT <- data.table(a = 1:8, b = c(TRUE, FALSE)) > DT[b] # Doesn't work. > DT[identity(b)] # Does work. > > > On Friday, April 18, 2014 7:53 AM, Michael Smith wrote: > Hi All, > > This is about subsetting using logicals. The code below is > self-explanatory (I hope). Is this a bug or a feature? > > Thanks, > > M > > >> DT <- data.table(a = 1:8, b = c(TRUE, FALSE)) >> ## This does *not* work, but it should (in my humble opinion). >> DT[b] > Error in eval(expr, envir, enclos) : object 'b' not found >> ## This does work, but seems a bit awkward, given that b is already >> ## logical. >> DT[b == TRUE] > a b > 1: 1 TRUE > 2: 3 TRUE > 3: 5 TRUE > 4: 7 TRUE >> ## With data.frame things work as expected. >> DF <- as.data.frame(DT) >> DF[DF$b, ] > a b > 1 1 TRUE > 3 3 TRUE > 5 5 TRUE > 7 7 TRUE >> sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C > LC_TIME=en_US.utf8 > [4] LC_COLLATE=en_US.utf8 LC_MONETARY=en_US.utf8 > LC_MESSAGES=en_US.utf8 > [7] LC_PAPER=en_US.utf8 LC_NAME=C LC_ADDRESS=C > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.utf8 > LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] data.table_1.9.2 colorout_1.0-1 > > loaded via a namespace (and not attached): > [1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.2.2 stringr_0.6.2 > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > From danmbox at gmail.com Sun Apr 27 13:47:01 2014 From: danmbox at gmail.com (Dan Muresan) Date: Sun, 27 Apr 2014 14:47:01 +0300 Subject: [datatable-help] data.table: partition like aggregate (FUN = list) Message-ID: How do I achieve the following partitioning effect: dt = data.table (x=10:14, y=20:24) aggregate (dt, by = list (dt$x %% 3), FUN = list) Group.1 x y 1 0 12 22 2 1 10, 13 20, 23 3 2 11, 14 21, 24 If I try the following it doesn't work (and I think I know why): dt [, by = x %% 3, j = list(y)] x y 1: 1 20 2: 1 23 3: 2 21 4: 2 24 5: 0 22 (while with j = max (y) it of course works, generating a "V1" column) Also, how do I name the result of the j-expression (by default the resulting column is "V1")? From jholtman at gmail.com Sun Apr 27 19:00:58 2014 From: jholtman at gmail.com (jim holtman) Date: Sun, 27 Apr 2014 13:00:58 -0400 Subject: [datatable-help] data.table: partition like aggregate (FUN = list) In-Reply-To: References: Message-ID: try this: > dt = data.table (x=10:14, y=20:24) > # create new 'x' column > dt[, newX := x] > dt[ + , list(x = toString(x) + , y = toString(y) + ) + , key = newX %% 3 + ] newX x y 1: 0 12 22 2: 1 10, 13 20, 23 3: 2 11, 14 21, 24 Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Sun, Apr 27, 2014 at 7:47 AM, Dan Muresan wrote: > How do I achieve the following partitioning effect: > > dt = data.table (x=10:14, y=20:24) > aggregate (dt, by = list (dt$x %% 3), FUN = list) > > Group.1 x y > 1 0 12 22 > 2 1 10, 13 20, 23 > 3 2 11, 14 21, 24 > > If I try the following it doesn't work (and I think I know why): > > dt [, by = x %% 3, j = list(y)] > x y > 1: 1 20 > 2: 1 23 > 3: 2 21 > 4: 2 24 > 5: 0 22 > > (while with j = max (y) it of course works, generating a "V1" column) > > Also, how do I name the result of the j-expression (by default the > resulting column is "V1")? > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sun Apr 27 20:51:44 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 27 Apr 2014 20:51:44 +0200 Subject: [datatable-help] data.table: partition like aggregate (FUN = list) In-Reply-To: References: Message-ID: You can do: ## R < 3.1 dt[, list(x=list(x), y=list(y)), by=list(grp = x %% 3)] ## >= 3.1 (for now, until bug #5585is fixed) dt[, list(x=list(I(x)), y=list(I(y))), by=list(grp = x %% 3)] On Sun, Apr 27, 2014 at 1:47 PM, Dan Muresan wrote: > How do I achieve the following partitioning effect: > > dt = data.table (x=10:14, y=20:24) > aggregate (dt, by = list (dt$x %% 3), FUN = list) > > Group.1 x y > 1 0 12 22 > 2 1 10, 13 20, 23 > 3 2 11, 14 21, 24 > > If I try the following it doesn't work (and I think I know why): > > dt [, by = x %% 3, j = list(y)] > x y > 1: 1 20 > 2: 1 23 > 3: 2 21 > 4: 2 24 > 5: 0 22 > > (while with j = max (y) it of course works, generating a "V1" column) > > Also, how do I name the result of the j-expression (by default the > resulting column is "V1")? > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sun Apr 27 21:09:02 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 27 Apr 2014 21:09:02 +0200 Subject: [datatable-help] Subsetting with logical In-Reply-To: <535B17C8.2040703@gmail.com> References: <53511233.1060505@gmail.com> <1398474914.87222.YahooMailNeo@web142606.mail.bf1.yahoo.com> <535B17C8.2040703@gmail.com> Message-ID: Michael, I agree a note in the documentation should make things clearer. Thanks for the post. Matt has already written a bit on this change here on a post from GSee. Have filed a request so that we don't forget. https://r-forge.r-project.org/tracker/?func=detail&atid=5356&aid=5643&group_id=240 On Sat, Apr 26, 2014 at 4:19 AM, Michael Smith wrote: > A.K., > > Thanks a lot for the link. > > Looking at `?data.table`, maybe the documentation could be changed to > read something like this, with the text in bracket added: > > integer and logical vectors work the same way they do in > \code{\link{[.data.frame}} (but see the \dQuote{Advanced} note below > about an exception for single variable names). > > The thing is that I did read the documentation, but I stopped reading at > that point because it said to expect the same behavior as with > data.frame, which is not what happened in my example code. And based on > your link to SO, other people have had the same issue too. > > M > > > On 04/26/2014 09:15 AM, arun wrote: > > Hi M, > > > > Check this link: > > > http://stackoverflow.com/questions/16191083/subset-data-table-by-logical-column > > > > A.K. > > > > > > > On 04/26/2014 09:08 AM, Michael Smith wrote:> Here's another example, > maybe more to the point. Shouldn't the second > > line also work, since `b` is logical already? > > > > DT <- data.table(a = 1:8, b = c(TRUE, FALSE)) > > DT[b] # Doesn't work. > > DT[identity(b)] # Does work. > > > > > > > > On Friday, April 18, 2014 7:53 AM, Michael Smith > wrote: > > Hi All, > > > > This is about subsetting using logicals. The code below is > > self-explanatory (I hope). Is this a bug or a feature? > > > > Thanks, > > > > M > > > > > >> DT <- data.table(a = 1:8, b = c(TRUE, FALSE)) > >> ## This does *not* work, but it should (in my humble opinion). > >> DT[b] > > Error in eval(expr, envir, enclos) : object 'b' not found > >> ## This does work, but seems a bit awkward, given that b is already > >> ## logical. > >> DT[b == TRUE] > > a b > > 1: 1 TRUE > > 2: 3 TRUE > > 3: 5 TRUE > > 4: 7 TRUE > >> ## With data.frame things work as expected. > >> DF <- as.data.frame(DT) > >> DF[DF$b, ] > > a b > > 1 1 TRUE > > 3 3 TRUE > > 5 5 TRUE > > 7 7 TRUE > >> sessionInfo() > > R version 3.0.2 (2013-09-25) > > Platform: x86_64-redhat-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C > > LC_TIME=en_US.utf8 > > [4] LC_COLLATE=en_US.utf8 LC_MONETARY=en_US.utf8 > > LC_MESSAGES=en_US.utf8 > > [7] LC_PAPER=en_US.utf8 LC_NAME=C LC_ADDRESS=C > > > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.utf8 > > LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] data.table_1.9.2 colorout_1.0-1 > > > > loaded via a namespace (and not attached): > > [1] plyr_1.8.1 Rcpp_0.11.1 reshape2_1.2.2 stringr_0.6.2 > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sun Apr 27 21:14:14 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 27 Apr 2014 21:14:14 +0200 Subject: [datatable-help] Assignment by reference fails silently In-Reply-To: References: Message-ID: Thanks for reporting. I've added this case under comments to another recently filed issue bug #5442from Michele (as I am quite sure they're related to handling column types in `:=` without grouping). Arun. On Fri, Apr 25, 2014 at 7:25 PM, John Laing wrote: > If I create a logical column in my data.table and try to > assign-by-reference a character value to it, the assignment fails silently. > That is, it doesn't work but doesn't throw an error: > > ## make a simple data.table > require(data.table) > dt <- data.table(a=1:3, b=4:6, c=NA) > > ## fails silently > dt[, c := "foo"] > dt > > In other cases where an action would lead to the implicit conversion of a > column, data.table throws an error suggesting that the user convert the > column explicitly if that's what they really mean to do. I think that's the > right behavior and should be adopted in this case as well. > > -John > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sun Apr 27 21:48:39 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 27 Apr 2014 21:48:39 +0200 Subject: [datatable-help] R Studio Interactions with data.table In-Reply-To: <87CF8322-0A75-44D7-A246-85ED860ABC2A@dc-energy.com> References: <87CF8322-0A75-44D7-A246-85ED860ABC2A@dc-energy.com> Message-ID: Zack, I'm able to reproduce the crash and the occasional warning. Will look into it - filed #5647. Thanks for reporting. Arun. On Tue, Apr 22, 2014 at 8:07 PM, Zachary Long wrote: > Hello, > > I was wondering if an error like this had been addressed before. I am > using data table 1.9.2. > > It appears that the error has to do with the interaction with R-Studio. > When I run > > library(data.table) > > dt<-data.table(strip="Nov08",date=c("2006-08-01","2006-08-02","2006-08-03","2006-08-04","2006-08-07", > > "2006-08-08","2006-08-09","2006-08-10","2006-08-11","2006-08-14")) > dt[,forward_date:=c(rep(NA,5),date),by='strip'] > > > The result I expect is below, along with a warning message. > > strip date forward_date > 1: Nov08 2006-08-01 NA > 2: Nov08 2006-08-02 NA > 3: Nov08 2006-08-03 NA > 4: Nov08 2006-08-04 NA > 5: Nov08 2006-08-07 NA > 6: Nov08 2006-08-08 2006-08-01 > 7: Nov08 2006-08-09 2006-08-02 > 8: Nov08 2006-08-10 2006-08-03 > 9: Nov08 2006-08-11 2006-08-04 > 10: Nov08 2006-08-14 2006-08-07 > > > However, I don't get this. > > 1 of two things can happen. > > 1. My R-Studio will completely crash without warning. All unsaved > information is lost. > 2. I can get "Error: Value of SET_STRING_ELT() must be a 'CHARSXP' not a > 'character'" "In addition:Lost warning messages" > > Do you know what is the cause here? It seems related to memory allocation, > or something under the hood relating to the interaction of R-Studio and > data table. > > Zach > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sun Apr 27 22:08:16 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 27 Apr 2014 22:08:16 +0200 Subject: [datatable-help] What is going on with R 3.1 ? In-Reply-To: <1397754954723-4689005.post@n4.nabble.com> References: <1397752015938-4689002.post@n4.nabble.com> <1397754954723-4689005.post@n4.nabble.com> Message-ID: Michele, I've tried with 1.8.10 and 1.9.3, on R3.0.3 and R3.1. And in all cases, when I load the saved file and do: attr(dt, ".internal.sefref") I get: Which of course is expected, and upon another assignment by reference shallow copies + over-allocates + sets external pointer (under the hood) and things are fine. I'm not able to reproduce any of the issues you mention. I understand you're having a hard time replicating the error as well. I'm not sure how or what we could do without a way to nicely reproduce this. Arun. On Thu, Apr 17, 2014 at 7:15 PM, Michele wrote: > If I try to *copy()* the tables to re-create the pointers all the tables > get the *same* one, always ending in 788 (opening another R session I get > the same result, same pointers accross the table, ending in 788): > > > dt <- readRDS("dt.RDS") > > dt1 <- readRDS("dt1.RDS") > > dput(copy(dt)) > structure(list(id = 1:5, var = c(-0.626453810742332, 0.183643324222082, > -0.835628612410047, 1.59528080213779, NA)), .Names = c("id", > "var"), row.names = c(NA, -5L), class = c("data.table", "data.frame" > ), .internal.selfref = ) > > dput(copy(dt1)) > structure(list(id = 1:10, var2 = c(9.56605515060971, 9.83445938796679, > 9.14481066107107, 10.5308762543727, NA, NA, 9.45024506381369, > 10.0544601882071, 10.7019565371199, 9.60325830155958), var3 = c(99.6542887323489, > 100.659231456233, 100.282028460177, 101.423474800432, 98.4134121985332, > NA, 98.0406472259105, 100.253194731594, 100.759881151841, 99.7930775536395 > )), .Names = c("id", "var2", "var3"), row.names = c(NA, -10L), class = c("data.table", > "data.frame"), sorted = "id", .internal.selfref = ) > > > ------------------------------ > View this message in context: Re: What is going on with R 3.1 ? > > Sent from the datatable-help mailing list archiveat Nabble.com. > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From danmbox at gmail.com Mon Apr 28 05:56:52 2014 From: danmbox at gmail.com (Dan Muresan) Date: Mon, 28 Apr 2014 06:56:52 +0300 Subject: [datatable-help] data.table: partition like aggregate (FUN = list) In-Reply-To: References: Message-ID: Hi, > dt[, list(x=list(I(x)), y=list(I(y))), by=list(grp = x %% 3)] this seems to work and is pretty natural. I guess the idea was to wrap the result twice in list(). Thanks. From my.r.help at gmail.com Tue Apr 29 16:04:48 2014 From: my.r.help at gmail.com (Michael Smith) Date: Tue, 29 Apr 2014 22:04:48 +0800 Subject: [datatable-help] Filtering Based on Previous Observation Message-ID: <535FB180.2060209@gmail.com> All, Is there some data.table-idiomatic way to filter based on a previous observation/row? For example, I want to remove a row if DT$a[row]==DT$a[row-1]. It could be done by first calculating the lag and then filtering based on that, but I wonder if there's a more direct way. The following example works, but my feeling is there should be a more elegant solution: ( DT <- data.table(a = c(1, 2, 2, 3), b = 8:5) ) DT[, L.a := c(NA, head(a, -1))][a != L.a | is.na(L.a)][, L.a := NULL][] Thanks, M From chinmay.patil at gmail.com Wed Apr 30 09:59:52 2014 From: chinmay.patil at gmail.com (Chinmay Patil) Date: Wed, 30 Apr 2014 15:59:52 +0800 Subject: [datatable-help] Filtering Based on Previous Observation In-Reply-To: <535FB180.2060209@gmail.com> References: <535FB180.2060209@gmail.com> Message-ID: You can try DT ## a b ## 1: 1 8 ## 2: 2 7 ## 3: 2 6 ## 4: 3 5 DT[c(T,diff(a)!=0),] ## a b ## 1: 1 8 ## 2: 2 7 ## 3: 3 5 On Tue, Apr 29, 2014 at 10:04 PM, Michael Smith wrote: > All, > > Is there some data.table-idiomatic way to filter based on a previous > observation/row? For example, I want to remove a row if > DT$a[row]==DT$a[row-1]. > > It could be done by first calculating the lag and then filtering based > on that, but I wonder if there's a more direct way. > > The following example works, but my feeling is there should be a more > elegant solution: > > ( DT <- data.table(a = c(1, 2, 2, 3), b = 8:5) ) > DT[, L.a := c(NA, head(a, -1))][a != L.a | is.na(L.a)][, L.a := NULL][] > > Thanks, > M > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From my.r.help at gmail.com Wed Apr 30 10:55:18 2014 From: my.r.help at gmail.com (Michael Smith) Date: Wed, 30 Apr 2014 16:55:18 +0800 Subject: [datatable-help] Filtering Based on Previous Observation In-Reply-To: References: <535FB180.2060209@gmail.com> Message-ID: <5360BA76.1050109@gmail.com> Chinmay, Kudos, that's a nice one! It also can be generalized to longer lags. Thanks! M On 04/30/2014 03:59 PM, Chinmay Patil wrote: > You can try > > DT > ## a b > ## 1: 1 8 > ## 2: 2 7 > ## 3: 2 6 > ## 4: 3 5 > > DT[c(T,diff(a)!=0),] > ## a b > ## 1: 1 8 > ## 2: 2 7 > ## 3: 3 5 > > > > > On Tue, Apr 29, 2014 at 10:04 PM, Michael Smith > wrote: > > All, > > Is there some data.table-idiomatic way to filter based on a previous > observation/row? For example, I want to remove a row if > DT$a[row]==DT$a[row-1]. > > It could be done by first calculating the lag and then filtering based > on that, but I wonder if there's a more direct way. > > The following example works, but my feeling is there should be a more > elegant solution: > > ( DT <- data.table(a = c(1, 2, 2, 3), b = 8:5) ) > DT[, L.a := c(NA, head(a, -1))][a != L.a | is.na > (L.a)][, L.a := NULL][] > > Thanks, > M > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > From my.r.help at gmail.com Wed Apr 30 10:59:54 2014 From: my.r.help at gmail.com (Michael Smith) Date: Wed, 30 Apr 2014 16:59:54 +0800 Subject: [datatable-help] Filtering Based on Previous Observation In-Reply-To: <5360BA76.1050109@gmail.com> References: <535FB180.2060209@gmail.com> <5360BA76.1050109@gmail.com> Message-ID: <5360BB8A.6070503@gmail.com> ... and here's a new challenge. What if `a` is character? DT <- data.table(a = as.character(c(1, 2, 2, 3)), b = 8:5) DT[, L.a := c(NA, head(a, -1))][a != L.a | is.na(L.a)][, L.a := NULL][] I still would like to compare equality with the previous observation/row, and remove it if it's the same. (`diff` doesn't work with characters.) M On 04/30/2014 04:55 PM, Michael Smith wrote: > Chinmay, > > Kudos, that's a nice one! It also can be generalized to longer lags. > Thanks! > > M > > On 04/30/2014 03:59 PM, Chinmay Patil wrote: >> You can try >> >> DT >> ## a b >> ## 1: 1 8 >> ## 2: 2 7 >> ## 3: 2 6 >> ## 4: 3 5 >> >> DT[c(T,diff(a)!=0),] >> ## a b >> ## 1: 1 8 >> ## 2: 2 7 >> ## 3: 3 5 >> >> >> >> >> On Tue, Apr 29, 2014 at 10:04 PM, Michael Smith > > wrote: >> >> All, >> >> Is there some data.table-idiomatic way to filter based on a previous >> observation/row? For example, I want to remove a row if >> DT$a[row]==DT$a[row-1]. >> >> It could be done by first calculating the lag and then filtering based >> on that, but I wonder if there's a more direct way. >> >> The following example works, but my feeling is there should be a more >> elegant solution: >> >> ( DT <- data.table(a = c(1, 2, 2, 3), b = 8:5) ) >> DT[, L.a := c(NA, head(a, -1))][a != L.a | is.na >> (L.a)][, L.a := NULL][] >> >> Thanks, >> M >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> From aragorn168b at gmail.com Wed Apr 30 11:01:54 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 30 Apr 2014 11:01:54 +0200 Subject: [datatable-help] Filtering Based on Previous Observation In-Reply-To: <5360BB8A.6070503@gmail.com> References: <535FB180.2060209@gmail.com> <5360BA76.1050109@gmail.com> <5360BB8A.6070503@gmail.com> Message-ID: Perhaps do: DT[, GRP := .GRP, by=a] And then continue as Chinmay's earlier solution on the column GRP? Arun From:?Michael Smith my.r.help at gmail.com Reply:?Michael Smith my.r.help at gmail.com Date:?April 30, 2014 at 11:00:21 AM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? Re: [datatable-help] Filtering Based on Previous Observation ... and here's a new challenge. What if `a` is character? DT <- data.table(a = as.character(c(1, 2, 2, 3)), b = 8:5) DT[, L.a := c(NA, head(a, -1))][a != L.a | is.na(L.a)][, L.a := NULL][] I still would like to compare equality with the previous observation/row, and remove it if it's the same. (`diff` doesn't work with characters.) M On 04/30/2014 04:55 PM, Michael Smith wrote: > Chinmay, > > Kudos, that's a nice one! It also can be generalized to longer lags. > Thanks! > > M > > On 04/30/2014 03:59 PM, Chinmay Patil wrote: >> You can try >> >> DT >> ## a b >> ## 1: 1 8 >> ## 2: 2 7 >> ## 3: 2 6 >> ## 4: 3 5 >> >> DT[c(T,diff(a)!=0),] >> ## a b >> ## 1: 1 8 >> ## 2: 2 7 >> ## 3: 3 5 >> >> >> >> >> On Tue, Apr 29, 2014 at 10:04 PM, Michael Smith > > wrote: >> >> All, >> >> Is there some data.table-idiomatic way to filter based on a previous >> observation/row? For example, I want to remove a row if >> DT$a[row]==DT$a[row-1]. >> >> It could be done by first calculating the lag and then filtering based >> on that, but I wonder if there's a more direct way. >> >> The following example works, but my feeling is there should be a more >> elegant solution: >> >> ( DT <- data.table(a = c(1, 2, 2, 3), b = 8:5) ) >> DT[, L.a := c(NA, head(a, -1))][a != L.a | is.na >> (L.a)][, L.a := NULL][] >> >> Thanks, >> M >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From chinmay.patil at gmail.com Wed Apr 30 11:11:03 2014 From: chinmay.patil at gmail.com (Chinmay Patil) Date: Wed, 30 Apr 2014 17:11:03 +0800 Subject: [datatable-help] Filtering Based on Previous Observation In-Reply-To: <5360BB8A.6070503@gmail.com> References: <535FB180.2060209@gmail.com> <5360BA76.1050109@gmail.com> <5360BB8A.6070503@gmail.com> Message-ID: Try DT[c(T, tail(a, -1) != head(a,-1))] ## a b ## 1: 1 8 ## 2: 2 7 ## 3: 3 5 On Wed, Apr 30, 2014 at 4:59 PM, Michael Smith wrote: > ... and here's a new challenge. What if `a` is character? > > DT <- data.table(a = as.character(c(1, 2, 2, 3)), b = 8:5) > DT[, L.a := c(NA, head(a, -1))][a != L.a | is.na(L.a)][, L.a := NULL][] > > I still would like to compare equality with the previous > observation/row, and remove it if it's the same. (`diff` doesn't work > with characters.) > > M > > > > On 04/30/2014 04:55 PM, Michael Smith wrote: > > Chinmay, > > > > Kudos, that's a nice one! It also can be generalized to longer lags. > > Thanks! > > > > M > > > > On 04/30/2014 03:59 PM, Chinmay Patil wrote: > >> You can try > >> > >> DT > >> ## a b > >> ## 1: 1 8 > >> ## 2: 2 7 > >> ## 3: 2 6 > >> ## 4: 3 5 > >> > >> DT[c(T,diff(a)!=0),] > >> ## a b > >> ## 1: 1 8 > >> ## 2: 2 7 > >> ## 3: 3 5 > >> > >> > >> > >> > >> On Tue, Apr 29, 2014 at 10:04 PM, Michael Smith >> > wrote: > >> > >> All, > >> > >> Is there some data.table-idiomatic way to filter based on a previous > >> observation/row? For example, I want to remove a row if > >> DT$a[row]==DT$a[row-1]. > >> > >> It could be done by first calculating the lag and then filtering > based > >> on that, but I wonder if there's a more direct way. > >> > >> The following example works, but my feeling is there should be a > more > >> elegant solution: > >> > >> ( DT <- data.table(a = c(1, 2, 2, 3), b = 8:5) ) > >> DT[, L.a := c(NA, head(a, -1))][a != L.a | is.na > >> (L.a)][, L.a := NULL][] > >> > >> Thanks, > >> M > >> _______________________________________________ > >> datatable-help mailing list > >> datatable-help at lists.r-forge.r-project.org > >> > >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >> > >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Wed Apr 30 14:00:10 2014 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Wed, 30 Apr 2014 08:00:10 -0400 Subject: [datatable-help] Filtering Based on Previous Observation In-Reply-To: <535FB180.2060209@gmail.com> References: <535FB180.2060209@gmail.com> Message-ID: On Tue, Apr 29, 2014 at 10:04 AM, Michael Smith wrote: > All, > > Is there some data.table-idiomatic way to filter based on a previous > observation/row? For example, I want to remove a row if > DT$a[row]==DT$a[row-1]. > > It could be done by first calculating the lag and then filtering based > on that, but I wonder if there's a more direct way. > > The following example works, but my feeling is there should be a more > elegant solution: > > ( DT <- data.table(a = c(1, 2, 2, 3), b = 8:5) ) > DT[, L.a := c(NA, head(a, -1))][a != L.a | is.na(L.a)][, L.a := NULL][] If the unique elements always appear consecutively then the following would work. (For example, if `a` were in ascending order (as in the example) or descending order then that would be satisfied. If DT were keyed on 'a' then this would always be the case.) DT[ !duplicated(a) ] Note that 'a' need not be numeric. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From f.harrell at vanderbilt.edu Wed Apr 30 14:28:27 2014 From: f.harrell at vanderbilt.edu (Frank Harrell) Date: Wed, 30 Apr 2014 07:28:27 -0500 Subject: [datatable-help] fread will not respect stringsAsFactors Message-ID: <5360EC6B.7040607@vanderbilt.edu> In R 3.1.0 and data.table 1.9.2, I can't get stringsAsFactors to work: require(data.table) s=fread('http://biostat.mc.vanderbilt.edu/tmp/support2.csv', stringsAsFactors=TRUE) attributes(s$sex) # NULL; is character Frank From aragorn168b at gmail.com Wed Apr 30 15:04:48 2014 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 30 Apr 2014 15:04:48 +0200 Subject: [datatable-help] fread will not respect stringsAsFactors In-Reply-To: <5360EC6B.7040607@vanderbilt.edu> References: <5360EC6B.7040607@vanderbilt.edu> Message-ID: Hi Frank, IIRC it's not implemented yet (irrespective of R or data.table version). I think there's a bug/FR somewhere. Matt hasn't gotten to it yet. Arun From:?Frank Harrell f.harrell at vanderbilt.edu Reply:?Frank Harrell f.harrell at vanderbilt.edu Date:?April 30, 2014 at 2:28:46 PM To:?datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org Subject:? [datatable-help] fread will not respect stringsAsFactors In R 3.1.0 and data.table 1.9.2, I can't get stringsAsFactors to work: require(data.table) s=fread('http://biostat.mc.vanderbilt.edu/tmp/support2.csv', stringsAsFactors=TRUE) attributes(s$sex) # NULL; is character Frank _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: